├── .gitignore
├── Makefile
├── README.md
├── fetch_data.py
├── ipynbhelper.py
├── notebooks
├── 01.1_setup_and_introduction.ipynb
├── 01.2_sklearn_overview.ipynb
├── 02.1_representation_of_data.ipynb
├── 02.2_feature_extraction.ipynb
├── 03_basic_principles_of_machine_learning.ipynb
├── 04.1_supervised_classification.ipynb
├── 04.2_supervised_regression.ipynb
├── 04.3_measuring_prediction_performance.ipynb
├── 05.1_application_to_face_recognition.ipynb
├── 05.2_application_to_text_mining.ipynb
├── 06.1_validation_and_testing.ipynb
├── 06.2_validation_exercise.ipynb
├── 07.1_in_depth_with_SVMs.ipynb
├── 07.2_in_depth_trees_and_forests.ipynb
├── 07.3_In_depth_with_linear_models.ipynb
├── 08.1_unsupervised_dimreduction.ipynb
├── 08.2_unsupervised_clustering__skipped_at_europython.ipynb
├── 09_pipelining_estimators.ipynb
├── 10_large_scale_text_classification.ipynb
├── figures
│ ├── ML_flow_chart.py
│ ├── __init__.py
│ ├── bias_variance.py
│ ├── grid_search_cv_splits.png
│ ├── grid_search_parameters.png
│ ├── iris_setosa.jpg
│ ├── iris_versicolor.jpg
│ ├── iris_virginica.jpg
│ ├── linear_regression.py
│ ├── sdss_filters.py
│ ├── sgd_separator.py
│ ├── supervised_scikit_learn.png
│ └── svm_gui_frames.py
├── helpers.py
├── images
│ ├── parallel_text_clf.png
│ └── parallel_text_clf_average.png
└── solutions
│ ├── 02A_faces_plot.py
│ ├── 04B_houses_plot.py
│ ├── 04B_houses_regression.py
│ ├── 04C_validation_exercise.py
│ ├── 05B_strip_headers.py
│ ├── 06B_basic_grid_search.py
│ ├── 06B_learning_curves.py
│ ├── 08A_digits_projection.py
│ └── 08B_digits_clustering.py
├── rendered_notebooks
├── 01.1_setup_and_introduction.ipynb
├── 01.2_sklearn_overview.ipynb
├── 02.1_representation_of_data.ipynb
├── 02.2_feature_extraction.ipynb
├── 03_basic_principles_of_machine_learning.ipynb
├── 04.1_supervised_classification.ipynb
├── 04.2_supervised_regression.ipynb
├── 04.3_measuring_prediction_performance.ipynb
├── 05.1_application_to_face_recognition.ipynb
├── 05.2_application_to_text_mining.ipynb
├── 06.1_validation_and_testing.ipynb
├── 06.2_validation_exercise.ipynb
├── 07.1_in_depth_with_SVMs.ipynb
├── 07.2_in_depth_trees_and_forests.ipynb
├── 07.3_In_depth_with_linear_models.ipynb
├── 08.1_unsupervised_dimreduction.ipynb
├── 08.2_unsupervised_clustering.ipynb
├── 08.2_unsupervised_clustering__skipped_at_europython.ipynb
├── 09_pipelining_estimators.ipynb
└── 10_large_scale_text_classification.ipynb
└── svm_gui.py
/.gitignore:
--------------------------------------------------------------------------------
1 | *~
2 | *.pyc
3 | *.npy
4 | *.npz
5 | notebooks/figures/downloads/
6 | *.ipynb_checkpoints
7 | *.mmap
8 | *.pkl
9 | datasets
10 | joblib
11 |
--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
1 | # Makefile used to manage the git repository, not for the tutorial
2 |
3 | all:
4 | python ipynbhelper.py --check
5 | python ipynbhelper.py --render
6 | python ipynbhelper.py --clean
7 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | Scikit-learn Tutorial for scipy beginners
2 | =========================================
3 |
4 | This repository contains files and other info for a training on data
5 | analysis with scikit-learn for people who are not experts in it.
6 |
7 | These were originaly associated with the EuroPython 2014 scikit-learn
8 | tutorial.
9 |
10 | **Instructor**: Gael Varoquaux
11 | [@GaelVaroquaux](https://twitter.com/GaelVaroquaux) |
12 | http://gael-varoquaux.info
13 |
14 | Installation Notes
15 | ------------------
16 |
17 | This tutorial will require recent installations of *numpy*, *scipy*,
18 | *matplotlib*, *scikit-learn*.
19 |
20 | For users who do not yet have these packages installed, a relatively
21 | painless way to install all the requirements is to use a package such as
22 | [Anaconda CE](http://store.continuum.io/ "Anaconda CE"), which can be
23 | downloaded and installed for free.
24 |
25 | Reading the training materials
26 | -------------------------------
27 |
28 | **Not all the material will be covered during the training**:
29 | there is not enough time available. However, you can follow the material
30 | by yourself.
31 |
32 |
33 | ### With the IPython notebook
34 |
35 | The recommended way to access the materials is to execute them in the
36 | IPython notebook. If you have the IPython notebook installed, you should
37 | download the materials (see below), go to the `notebooks` directory, and
38 | type:
39 |
40 | ipython notebook
41 |
42 | in your terminal window. This will open a notebook panel load in your web
43 | browser.
44 |
45 | ### On Internet
46 |
47 | If you don't have the IPython notebook installed, you can browse the
48 | files on Internet:
49 |
50 | * For the instructions without the solutions:
51 |
52 | http://nbviewer.ipython.org/github/GaelVaroquaux/sklearn_europython_2014/tree/master/notebooks/
53 |
54 | * For the instructions and the solutions:
55 |
56 | http://nbviewer.ipython.org/github/GaelVaroquaux/sklearn_europython_2014/tree/master/rendered_notebooks/
57 |
58 | Downloading the Tutorial Materials
59 | ----------------------------------
60 |
61 | I would highly recommend using git, not only for this tutorial, but for the
62 | general betterment of your life. Once git is installed, you can clone the
63 | material in this tutorial by using the git address shown above:
64 |
65 | If you can't or don't want to install git, there is a link above to download
66 | the contents of this repository as a zip file. I may make minor changes to
67 | the repository in the days before the tutorial, however, so cloning the
68 | repository is a much better option.
69 |
70 | Data Downloads
71 | --------------
72 |
73 | The data for this tutorial is not included in the repository. We will be
74 | using several data sets during the tutorial: most are built-in to
75 | scikit-learn, which includes code which automatically downloads and
76 | caches these data. Because the wireless network at conferences can often
77 | be spotty, it would be a good idea to download these data sets before
78 | arriving at the conference. You can do so by using the `fetch_data.py`
79 | included in the tutorial materials.
80 |
81 | Original material from the Scipy 2013 tutorial
82 | ----------------------------------------------
83 |
84 | This material is adapted from the scipy 2013 tutorial:
85 |
86 | http://github.com/jakevdp/sklearn_scipy2013
87 |
88 | Original authors:
89 |
90 | - Gael Varoquaux [@GaelVaroquaux](https://twitter.com/GaelVaroquaux) | http://gael-varoquaux.info
91 | - Olivier Grisel [@ogrisel](https://twitter.com/ogrisel) | http://ogrisel.com
92 | - Jake VanderPlas [@jakevdp](https://twitter.com/jakevdp) | http://jakevdp.github.com
93 |
94 |
95 |
96 |
--------------------------------------------------------------------------------
/fetch_data.py:
--------------------------------------------------------------------------------
1 | import os
2 | import urllib
3 | import tarfile
4 | import zipfile
5 |
6 |
7 | TWENTY_URL = ("http://people.csail.mit.edu/jrennie/"
8 | "20Newsgroups/20news-bydate.tar.gz")
9 | TWENTY_ARCHIVE_NAME = "20news-bydate.tar.gz"
10 | TWENTY_CACHE_NAME = "20news-bydate.pkz"
11 | TWENTY_TRAIN_FOLDER = "20news-bydate-train"
12 | TWENTY_TEST_FOLDER = "20news-bydate-test"
13 |
14 | SENTIMENT140_URL = ("http://cs.stanford.edu/people/alecmgo/"
15 | "trainingandtestdata.zip")
16 | SENTIMENT140_ARCHIVE_NAME = "trainingandtestdata.zip"
17 |
18 |
19 | def get_datasets_folder():
20 | here = os.path.dirname(__file__)
21 | notebooks = os.path.join(here, 'notebooks')
22 | datasets_folder = os.path.abspath(os.path.join(notebooks, 'datasets'))
23 | datasets_archive = os.path.abspath(os.path.join(notebooks, 'datasets.zip'))
24 |
25 | if not os.path.exists(datasets_folder):
26 | if os.path.exists(datasets_archive):
27 | print("Extracting " + datasets_archive)
28 | zf = zipfile.ZipFile(datasets_archive)
29 | zf.extractall('.')
30 | assert os.path.exists(datasets_folder)
31 | else:
32 | print("Creating datasets folder: " + datasets_folder)
33 | os.makedirs(datasets_folder)
34 | else:
35 | print("Using existing dataset folder:" + datasets_folder)
36 | return datasets_folder
37 |
38 |
39 | def check_twenty_newsgroups(datasets_folder):
40 | print("Checking availability of the 20 newsgroups dataset")
41 |
42 | archive_path = os.path.join(datasets_folder, TWENTY_ARCHIVE_NAME)
43 | train_path = os.path.join(datasets_folder, TWENTY_TRAIN_FOLDER)
44 | test_path = os.path.join(datasets_folder, TWENTY_TEST_FOLDER)
45 |
46 | if not os.path.exists(archive_path):
47 | print("Downloading dataset from %s (14 MB)" % TWENTY_URL)
48 | opener = urllib.urlopen(TWENTY_URL)
49 | open(archive_path, 'wb').write(opener.read())
50 | else:
51 | print("Found archive: " + archive_path)
52 |
53 | if not os.path.exists(train_path) or not os.path.exists(test_path):
54 | print("Decompressing %s" % archive_path)
55 | tarfile.open(archive_path, "r:gz").extractall(path=datasets_folder)
56 |
57 | print("Checking that the 20 newsgroups files exist...")
58 | assert os.path.exists(train_path)
59 | assert os.path.exists(test_path)
60 | print("=> Success!")
61 |
62 |
63 | def check_sentiment140(datasets_folder):
64 | print("Checking availability of the sentiment 140 dataset")
65 | archive_path = os.path.join(datasets_folder, SENTIMENT140_ARCHIVE_NAME)
66 | sentiment140_path = os.path.join(datasets_folder, 'sentiment140')
67 | train_path = os.path.join(sentiment140_path,
68 | 'training.1600000.processed.noemoticon.csv')
69 | test_path = os.path.join(sentiment140_path,
70 | 'testdata.manual.2009.06.14.csv')
71 |
72 | if not os.path.exists(archive_path):
73 | print("Downloading dataset from %s (77MB)" % SENTIMENT140_URL)
74 | opener = urllib.urlopen(SENTIMENT140_URL)
75 | open(archive_path, 'wb').write(opener.read())
76 | else:
77 | print("Found archive: " + archive_path)
78 |
79 | if not os.path.exists(sentiment140_path):
80 | print("Extracting %s to %s" % (archive_path, sentiment140_path))
81 | zf = zipfile.ZipFile(archive_path)
82 | zf.extractall(sentiment140_path)
83 | print("Checking that the sentiment 140 CSV files exist...")
84 | assert os.path.exists(train_path)
85 | assert os.path.exists(test_path)
86 | print("=> Success!")
87 |
88 |
89 | if __name__ == "__main__":
90 | datasets_folder = get_datasets_folder()
91 | check_twenty_newsgroups(datasets_folder)
92 | from sklearn.datasets import fetch_olivetti_faces
93 | fetch_olivetti_faces()
94 | print "Loading Labeled Faces Data (~200MB)"
95 | from sklearn.datasets import fetch_lfw_people
96 | fetch_lfw_people(min_faces_per_person=70, resize=0.4,
97 | data_home=datasets_folder)
98 |
99 | print 'Not downloading the sentiment140 data as we will not cover '
100 | print 'notebook 10'
101 | #check_sentiment140(datasets_folder)
102 |
103 | print("=> Success!")
104 |
--------------------------------------------------------------------------------
/ipynbhelper.py:
--------------------------------------------------------------------------------
1 | """Utility script to be used to cleanup the notebooks before git commit
2 |
3 | This a mix from @minrk's various gists.
4 |
5 | """
6 |
7 | import sys
8 | import os
9 | import io
10 | from Queue import Empty
11 |
12 | from IPython.nbformat import current
13 | try:
14 | from IPython.kernel import KernelManager
15 | assert KernelManager # to silence pyflakes
16 | except ImportError:
17 | # 0.13
18 | from IPython.zmq.blockingkernelmanager import BlockingKernelManager
19 | KernelManager = BlockingKernelManager
20 |
21 |
22 | def remove_outputs(nb):
23 | """Remove the outputs from a notebook"""
24 | for ws in nb.worksheets:
25 | for cell in ws.cells:
26 | if cell.cell_type == 'code':
27 | cell.outputs = []
28 | if 'prompt_number' in cell:
29 | del cell['prompt_number']
30 |
31 |
32 | def run_cell(shell, iopub, cell, timeout=300):
33 | # print cell.input
34 | shell.execute(cell.input)
35 | # wait for finish, maximum 5min by default
36 | reply = shell.get_msg(timeout=timeout)['content']
37 | if reply['status'] == 'error':
38 | failed = True
39 | print "\nFAILURE:"
40 | print cell.input
41 | print '-----'
42 | print "raised:"
43 | print '\n'.join(reply['traceback'])
44 | else:
45 | failed = False
46 |
47 | # Collect the outputs of the cell execution
48 | outs = []
49 | while True:
50 | try:
51 | msg = iopub.get_msg(timeout=0.2)
52 | except Empty:
53 | break
54 | msg_type = msg['msg_type']
55 | if msg_type in ('status', 'pyin'):
56 | continue
57 | elif msg_type == 'clear_output':
58 | outs = []
59 | continue
60 |
61 | content = msg['content']
62 | # print msg_type, content
63 | out = current.NotebookNode(output_type=msg_type)
64 |
65 | if msg_type == 'stream':
66 | out.stream = content['name']
67 | out.text = content['data']
68 | elif msg_type in ('display_data', 'pyout'):
69 | for mime, data in content['data'].iteritems():
70 | attr = mime.split('/')[-1].lower()
71 | # this gets most right, but fix svg+html, plain
72 | attr = attr.replace('+xml', '').replace('plain', 'text')
73 | setattr(out, attr, data)
74 | if msg_type == 'pyout':
75 | out.prompt_number = content['execution_count']
76 | elif msg_type == 'pyerr':
77 | out.ename = content['ename']
78 | out.evalue = content['evalue']
79 | out.traceback = content['traceback']
80 | else:
81 | print "unhandled iopub msg:", msg_type
82 |
83 | outs.append(out)
84 | return outs, failed
85 |
86 |
87 | def run_notebook(nb):
88 | km = KernelManager()
89 | km.start_kernel(stderr=open(os.devnull, 'w'))
90 | if hasattr(km, 'client'):
91 | kc = km.client()
92 | kc.start_channels()
93 | iopub = kc.iopub_channel
94 | else:
95 | # IPython 0.13 compat
96 | kc = km
97 | kc.start_channels()
98 | iopub = kc.sub_channel
99 | shell = kc.shell_channel
100 |
101 | # simple ping:
102 | shell.execute("pass")
103 | shell.get_msg()
104 |
105 | cells = 0
106 | failures = 0
107 | for ws in nb.worksheets:
108 | for cell in ws.cells:
109 | if cell.cell_type != 'code':
110 | continue
111 |
112 | outputs, failed = run_cell(shell, iopub, cell)
113 | cell.outputs = outputs
114 | cell['prompt_number'] = cells
115 | failures += failed
116 | cells += 1
117 | sys.stdout.write('.')
118 |
119 | print
120 | print "ran notebook %s" % nb.metadata.name
121 | print " ran %3i cells" % cells
122 | if failures:
123 | print " %3i cells raised exceptions" % failures
124 | kc.stop_channels()
125 | km.shutdown_kernel()
126 | del km
127 |
128 |
129 | def process_notebook_file(fname, action='clean', output_fname=None):
130 | print("Performing '{}' on: {}".format(action, fname))
131 | orig_wd = os.getcwd()
132 | with io.open(fname, 'rb') as f:
133 | nb = current.read(f, 'json')
134 |
135 | if action == 'check':
136 | os.chdir(os.path.dirname(fname))
137 | run_notebook(nb)
138 | remove_outputs(nb)
139 | elif action == 'render':
140 | os.chdir(os.path.dirname(fname))
141 | run_notebook(nb)
142 | else:
143 | # Clean by default
144 | remove_outputs(nb)
145 |
146 | os.chdir(orig_wd)
147 | if output_fname is None:
148 | output_fname = fname
149 | with io.open(output_fname, 'wb') as f:
150 | nb = current.write(nb, f, 'json')
151 |
152 |
153 | if __name__ == '__main__':
154 | # TODO: use argparse instead
155 | args = sys.argv[1:]
156 | targets = [t for t in args if not t.startswith('--')]
157 | action = 'check' if '--check' in args else 'clean'
158 | action = 'render' if '--render' in args else action
159 |
160 | rendered_folder = os.path.join(os.path.dirname(__file__),
161 | 'rendered_notebooks')
162 | if not os.path.exists(rendered_folder):
163 | os.makedirs(rendered_folder)
164 | if not targets:
165 | targets = [os.path.join(os.path.dirname(__file__), 'notebooks')]
166 |
167 | for target in targets:
168 | if os.path.isdir(target):
169 | fnames = [os.path.abspath(os.path.join(target, f))
170 | for f in os.listdir(target)
171 | if f.endswith('.ipynb')]
172 | else:
173 | fnames = [target]
174 | for fname in fnames:
175 | if action == 'render':
176 | output_fname = os.path.join(rendered_folder,
177 | os.path.basename(fname))
178 | else:
179 | output_fname = fname
180 | process_notebook_file(fname, action=action,
181 | output_fname=output_fname)
182 |
--------------------------------------------------------------------------------
/notebooks/01.1_setup_and_introduction.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "name": ""
4 | },
5 | "nbformat": 3,
6 | "nbformat_minor": 0,
7 | "worksheets": [
8 | {
9 | "cells": [
10 | {
11 | "cell_type": "heading",
12 | "level": 1,
13 | "metadata": {},
14 | "source": [
15 | "Introduction to Machine Learning in Python"
16 | ]
17 | },
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {},
21 | "source": [
22 | "In this section we'll go through some preliminary topics, as well as some of the\n",
23 | "requirements for this tutorial.\n",
24 | "\n",
25 | "By the end of this section you should:\n",
26 | "\n",
27 | "- Know what sort of tasks qualify as Machine Learning problems.\n",
28 | "- See some simple examples of machine learning\n",
29 | "- Know the basics of creating and manipulating numpy arrays.\n",
30 | "- Know the basics of scatter plots in matplotlib."
31 | ]
32 | },
33 | {
34 | "cell_type": "heading",
35 | "level": 2,
36 | "metadata": {},
37 | "source": [
38 | "What is Machine Learning?"
39 | ]
40 | },
41 | {
42 | "cell_type": "markdown",
43 | "metadata": {},
44 | "source": [
45 | "In this section we will begin to explore the basic principles of machine learning.\n",
46 | "Machine Learning is about building programs with **tunable parameters** (typically an\n",
47 | "array of floating point values) that are adjusted automatically so as to improve\n",
48 | "their behavior by **adapting to previously seen data.**\n",
49 | "\n",
50 | "Machine Learning can be considered a subfield of **Artificial Intelligence** since those\n",
51 | "algorithms can be seen as building blocks to make computers learn to behave more\n",
52 | "intelligently by somehow **generalizing** rather that just storing and retrieving data items\n",
53 | "like a database system would do.\n",
54 | "\n",
55 | "We'll take a look at two very simple machine learning tasks here.\n",
56 | "The first is a **classification** task: the figure shows a\n",
57 | "collection of two-dimensional data, colored according to two different class\n",
58 | "labels. A classification algorithm may be used to draw a dividing boundary\n",
59 | "between the two clusters of points:"
60 | ]
61 | },
62 | {
63 | "cell_type": "code",
64 | "collapsed": false,
65 | "input": [
66 | "# Start pylab inline mode, so figures will appear in the notebook\n",
67 | "%matplotlib inline"
68 | ],
69 | "language": "python",
70 | "metadata": {},
71 | "outputs": []
72 | },
73 | {
74 | "cell_type": "code",
75 | "collapsed": false,
76 | "input": [
77 | "# Import the example plot from the figures directory\n",
78 | "from figures import plot_sgd_separator\n",
79 | "plot_sgd_separator()"
80 | ],
81 | "language": "python",
82 | "metadata": {},
83 | "outputs": []
84 | },
85 | {
86 | "cell_type": "markdown",
87 | "metadata": {},
88 | "source": [
89 | "This may seem like a trivial task, but it is a simple version of a very important concept.\n",
90 | "By drawing this separating line, we have learned a model which can **generalize** to new\n",
91 | "data: if you were to drop another point onto the plane which is unlabeled, this algorithm\n",
92 | "could now **predict** whether it's a blue or a red point.\n",
93 | "\n",
94 | "If you'd like to see the source code used to generate this, you can either open the\n",
95 | "code in the `figures` directory, or you can load the code using the `%load` magic command:"
96 | ]
97 | },
98 | {
99 | "cell_type": "code",
100 | "collapsed": false,
101 | "input": [
102 | "%load figures/sgd_separator.py"
103 | ],
104 | "language": "python",
105 | "metadata": {},
106 | "outputs": []
107 | },
108 | {
109 | "cell_type": "markdown",
110 | "metadata": {},
111 | "source": [
112 | "The next simple task we'll look at is a **regression** task: a simple best-fit line\n",
113 | "to a set of data:"
114 | ]
115 | },
116 | {
117 | "cell_type": "code",
118 | "collapsed": false,
119 | "input": [
120 | "from figures import plot_linear_regression\n",
121 | "plot_linear_regression()"
122 | ],
123 | "language": "python",
124 | "metadata": {},
125 | "outputs": []
126 | },
127 | {
128 | "cell_type": "markdown",
129 | "metadata": {},
130 | "source": [
131 | "Again, this is an example of fitting a model to data, such that the model can make\n",
132 | "generalizations about new data. The model has been **learned** from the training\n",
133 | "data, and can be used to predict the result of test data:\n",
134 | "here, we might be given an x-value, and the model would\n",
135 | "allow us to predict the y value. Again, this might seem like a trivial problem,\n",
136 | "but it is a basic example of a type of operation that is fundamental to\n",
137 | "machine learning tasks."
138 | ]
139 | },
140 | {
141 | "cell_type": "heading",
142 | "level": 2,
143 | "metadata": {},
144 | "source": [
145 | "Numpy Arrays"
146 | ]
147 | },
148 | {
149 | "cell_type": "markdown",
150 | "metadata": {},
151 | "source": [
152 | "Manipulating `numpy` arrays is an important part of doing machine learning\n",
153 | "(or, really, any type of scientific computation) in python. This will likely\n",
154 | "be review for most: we'll quickly go through some of the most important features."
155 | ]
156 | },
157 | {
158 | "cell_type": "code",
159 | "collapsed": false,
160 | "input": [
161 | "import numpy as np\n",
162 | "\n",
163 | "# Generating a random array\n",
164 | "X = np.random.random((3, 5)) # a 3 x 5 array\n",
165 | "\n",
166 | "print X"
167 | ],
168 | "language": "python",
169 | "metadata": {},
170 | "outputs": []
171 | },
172 | {
173 | "cell_type": "code",
174 | "collapsed": false,
175 | "input": [
176 | "# Accessing elements\n",
177 | "\n",
178 | "# get a single element\n",
179 | "print X[0, 0]\n",
180 | "\n",
181 | "# get a row\n",
182 | "print X[1]\n",
183 | "\n",
184 | "# get a column\n",
185 | "print X[:, 1]"
186 | ],
187 | "language": "python",
188 | "metadata": {},
189 | "outputs": []
190 | },
191 | {
192 | "cell_type": "code",
193 | "collapsed": false,
194 | "input": [
195 | "# Transposing an array\n",
196 | "print X.T"
197 | ],
198 | "language": "python",
199 | "metadata": {},
200 | "outputs": []
201 | },
202 | {
203 | "cell_type": "code",
204 | "collapsed": false,
205 | "input": [
206 | "# Turning a row vector into a column vector\n",
207 | "y = np.linspace(0, 12, 5)\n",
208 | "print y"
209 | ],
210 | "language": "python",
211 | "metadata": {},
212 | "outputs": []
213 | },
214 | {
215 | "cell_type": "code",
216 | "collapsed": false,
217 | "input": [
218 | "# make into a column vector\n",
219 | "print y[:, np.newaxis]"
220 | ],
221 | "language": "python",
222 | "metadata": {},
223 | "outputs": []
224 | },
225 | {
226 | "cell_type": "markdown",
227 | "metadata": {},
228 | "source": [
229 | "There is much, much more to know, but these few operations are fundamental to what we'll\n",
230 | "do during this tutorial."
231 | ]
232 | },
233 | {
234 | "cell_type": "heading",
235 | "level": 2,
236 | "metadata": {},
237 | "source": [
238 | "Scipy Sparse Matrices"
239 | ]
240 | },
241 | {
242 | "cell_type": "markdown",
243 | "metadata": {},
244 | "source": [
245 | "We won't make very much use of these in this tutorial, but sparse matrices are very nice\n",
246 | "in some situations. For example, in some machine learning tasks, especially those associated\n",
247 | "with textual analysis, the data may be mostly zeros. Storing all these zeros is very\n",
248 | "inefficient. We can create and manipulate sparse matrices as follows:"
249 | ]
250 | },
251 | {
252 | "cell_type": "code",
253 | "collapsed": false,
254 | "input": [
255 | "from scipy import sparse\n",
256 | "\n",
257 | "# Create a random array with a lot of zeros\n",
258 | "X = np.random.random((10, 5))\n",
259 | "print X"
260 | ],
261 | "language": "python",
262 | "metadata": {},
263 | "outputs": []
264 | },
265 | {
266 | "cell_type": "code",
267 | "collapsed": false,
268 | "input": [
269 | "# set the majority of elements to zero\n",
270 | "X[X < 0.7] = 0\n",
271 | "print X"
272 | ],
273 | "language": "python",
274 | "metadata": {},
275 | "outputs": []
276 | },
277 | {
278 | "cell_type": "code",
279 | "collapsed": false,
280 | "input": [
281 | "# turn X into a csr (Compressed-Sparse-Row) matrix\n",
282 | "X_csr = sparse.csr_matrix(X)\n",
283 | "print X_csr"
284 | ],
285 | "language": "python",
286 | "metadata": {},
287 | "outputs": []
288 | },
289 | {
290 | "cell_type": "code",
291 | "collapsed": false,
292 | "input": [
293 | "# convert the sparse matrix to a dense array\n",
294 | "print X_csr.toarray()"
295 | ],
296 | "language": "python",
297 | "metadata": {},
298 | "outputs": []
299 | },
300 | {
301 | "cell_type": "markdown",
302 | "metadata": {},
303 | "source": [
304 | "______\n",
305 | "\n",
306 | "**To learn numpy and scipy**: http://scipy-lectures.github.io"
307 | ]
308 | },
309 | {
310 | "cell_type": "heading",
311 | "level": 2,
312 | "metadata": {},
313 | "source": [
314 | "Matplotlib"
315 | ]
316 | },
317 | {
318 | "cell_type": "markdown",
319 | "metadata": {},
320 | "source": [
321 | "Another important part of machine learning is visualization of data. The most common\n",
322 | "tool for this in Python is `matplotlib`. It is an extremely flexible package, but\n",
323 | "we will go over some basics here.\n",
324 | "\n",
325 | "First, something special to IPython notebook. We can turn on the \"IPython inline\" mode,\n",
326 | "which will make plots show up inline in the notebook."
327 | ]
328 | },
329 | {
330 | "cell_type": "code",
331 | "collapsed": false,
332 | "input": [
333 | "%matplotlib inline"
334 | ],
335 | "language": "python",
336 | "metadata": {},
337 | "outputs": []
338 | },
339 | {
340 | "cell_type": "code",
341 | "collapsed": false,
342 | "input": [
343 | "# Here we import the plotting functions\n",
344 | "import matplotlib.pyplot as plt"
345 | ],
346 | "language": "python",
347 | "metadata": {},
348 | "outputs": []
349 | },
350 | {
351 | "cell_type": "code",
352 | "collapsed": false,
353 | "input": [
354 | "# plotting a line\n",
355 | "x = np.linspace(0, 10, 100)\n",
356 | "plt.plot(x, np.sin(x))"
357 | ],
358 | "language": "python",
359 | "metadata": {},
360 | "outputs": []
361 | },
362 | {
363 | "cell_type": "code",
364 | "collapsed": false,
365 | "input": [
366 | "# scatter-plot points\n",
367 | "x = np.random.normal(size=500)\n",
368 | "y = np.random.normal(size=500)\n",
369 | "plt.scatter(x, y)"
370 | ],
371 | "language": "python",
372 | "metadata": {},
373 | "outputs": []
374 | },
375 | {
376 | "cell_type": "code",
377 | "collapsed": false,
378 | "input": [
379 | "# showing images\n",
380 | "x = np.linspace(1, 12, 100)\n",
381 | "y = x[:, np.newaxis]\n",
382 | "\n",
383 | "im = y * np.sin(x) * np.cos(y)\n",
384 | "print im.shape"
385 | ],
386 | "language": "python",
387 | "metadata": {},
388 | "outputs": []
389 | },
390 | {
391 | "cell_type": "code",
392 | "collapsed": false,
393 | "input": [
394 | "# imshow - note that origin is at the top-left by default!\n",
395 | "plt.imshow(im)"
396 | ],
397 | "language": "python",
398 | "metadata": {},
399 | "outputs": []
400 | },
401 | {
402 | "cell_type": "code",
403 | "collapsed": false,
404 | "input": [
405 | "# Contour plot - note that origin here is at the bottom-left by default!\n",
406 | "plt.contour(im)"
407 | ],
408 | "language": "python",
409 | "metadata": {},
410 | "outputs": []
411 | },
412 | {
413 | "cell_type": "markdown",
414 | "metadata": {},
415 | "source": [
416 | "There are many, many more plot types available. One useful way to explore these is by\n",
417 | "looking at the matplotlib gallery: http://matplotlib.org/gallery.html\n",
418 | "\n",
419 | "You can test these examples out easily in the notebook: simply copy the ``Source Code``\n",
420 | "link on each page, and put it in a notebook using the ``%load`` magic.\n",
421 | "For example:"
422 | ]
423 | },
424 | {
425 | "cell_type": "code",
426 | "collapsed": false,
427 | "input": [
428 | "%load http://matplotlib.org/mpl_examples/pylab_examples/ellipse_collection.py"
429 | ],
430 | "language": "python",
431 | "metadata": {},
432 | "outputs": []
433 | },
434 | {
435 | "cell_type": "markdown",
436 | "metadata": {},
437 | "source": [
438 | "\n",
439 | "\n",
440 | "_______\n",
441 | "\n",
442 | "\n"
443 | ]
444 | },
445 | {
446 | "cell_type": "heading",
447 | "level": 4,
448 | "metadata": {},
449 | "source": [
450 | "Additional plots: 3D -- skipped at EuroPython"
451 | ]
452 | },
453 | {
454 | "cell_type": "code",
455 | "collapsed": false,
456 | "input": [
457 | "# 3D plotting\n",
458 | "from mpl_toolkits.mplot3d import Axes3D\n",
459 | "ax = plt.axes(projection='3d')\n",
460 | "xgrid, ygrid = np.meshgrid(x, y.ravel())\n",
461 | "ax.plot_surface(xgrid, ygrid, im, cmap=plt.cm.jet, cstride=2, rstride=2, linewidth=0)"
462 | ],
463 | "language": "python",
464 | "metadata": {},
465 | "outputs": []
466 | }
467 | ],
468 | "metadata": {}
469 | }
470 | ]
471 | }
--------------------------------------------------------------------------------
/notebooks/01.2_sklearn_overview.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "name": ""
4 | },
5 | "nbformat": 3,
6 | "nbformat_minor": 0,
7 | "worksheets": [
8 | {
9 | "cells": [
10 | {
11 | "cell_type": "heading",
12 | "level": 1,
13 | "metadata": {},
14 | "source": [
15 | "An Overview of Scikit-learn"
16 | ]
17 | },
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {},
21 | "source": [
22 | "*Adapted from* [*http://scikit-learn.org/stable/tutorial/basic/tutorial.html*](http://scikit-learn.org/stable/tutorial/basic/tutorial.html)"
23 | ]
24 | },
25 | {
26 | "cell_type": "code",
27 | "collapsed": false,
28 | "input": [
29 | "%matplotlib inline\n",
30 | "import numpy as np\n",
31 | "from matplotlib import pyplot as plt"
32 | ],
33 | "language": "python",
34 | "metadata": {},
35 | "outputs": []
36 | },
37 | {
38 | "cell_type": "heading",
39 | "level": 2,
40 | "metadata": {},
41 | "source": [
42 | "Loading an Example Dataset"
43 | ]
44 | },
45 | {
46 | "cell_type": "code",
47 | "collapsed": false,
48 | "input": [
49 | "from sklearn import datasets\n",
50 | "digits = datasets.load_digits()"
51 | ],
52 | "language": "python",
53 | "metadata": {},
54 | "outputs": []
55 | },
56 | {
57 | "cell_type": "code",
58 | "collapsed": false,
59 | "input": [
60 | "digits.data"
61 | ],
62 | "language": "python",
63 | "metadata": {},
64 | "outputs": []
65 | },
66 | {
67 | "cell_type": "code",
68 | "collapsed": false,
69 | "input": [
70 | "digits.target"
71 | ],
72 | "language": "python",
73 | "metadata": {},
74 | "outputs": []
75 | },
76 | {
77 | "cell_type": "code",
78 | "collapsed": false,
79 | "input": [
80 | "digits.images[0]"
81 | ],
82 | "language": "python",
83 | "metadata": {},
84 | "outputs": []
85 | },
86 | {
87 | "cell_type": "heading",
88 | "level": 2,
89 | "metadata": {},
90 | "source": [
91 | "Learning and Predicting"
92 | ]
93 | },
94 | {
95 | "cell_type": "code",
96 | "collapsed": false,
97 | "input": [
98 | "from sklearn import svm\n",
99 | "clf = svm.SVC(gamma=0.001, C=100.)"
100 | ],
101 | "language": "python",
102 | "metadata": {},
103 | "outputs": []
104 | },
105 | {
106 | "cell_type": "code",
107 | "collapsed": false,
108 | "input": [
109 | "clf.fit(digits.data[:-1], digits.target[:-1])"
110 | ],
111 | "language": "python",
112 | "metadata": {},
113 | "outputs": []
114 | },
115 | {
116 | "cell_type": "code",
117 | "collapsed": false,
118 | "input": [
119 | "clf.predict(digits.data[-1])"
120 | ],
121 | "language": "python",
122 | "metadata": {},
123 | "outputs": []
124 | },
125 | {
126 | "cell_type": "code",
127 | "collapsed": false,
128 | "input": [
129 | "plt.figure(figsize=(2, 2))\n",
130 | "plt.imshow(digits.images[-1], interpolation='nearest', cmap=plt.cm.binary)"
131 | ],
132 | "language": "python",
133 | "metadata": {},
134 | "outputs": []
135 | },
136 | {
137 | "cell_type": "code",
138 | "collapsed": false,
139 | "input": [
140 | "print digits.target[-1]"
141 | ],
142 | "language": "python",
143 | "metadata": {},
144 | "outputs": []
145 | },
146 | {
147 | "cell_type": "heading",
148 | "level": 2,
149 | "metadata": {},
150 | "source": [
151 | "Model Persistence"
152 | ]
153 | },
154 | {
155 | "cell_type": "code",
156 | "collapsed": false,
157 | "input": [
158 | "from sklearn import svm\n",
159 | "from sklearn import datasets\n",
160 | "clf = svm.SVC()\n",
161 | "iris = datasets.load_iris()\n",
162 | "X, y = iris.data, iris.target\n",
163 | "clf.fit(X, y)"
164 | ],
165 | "language": "python",
166 | "metadata": {},
167 | "outputs": []
168 | },
169 | {
170 | "cell_type": "code",
171 | "collapsed": false,
172 | "input": [
173 | "import pickle\n",
174 | "s = pickle.dumps(clf)\n",
175 | "clf2 = pickle.loads(s)\n",
176 | "clf2.predict(X[0])"
177 | ],
178 | "language": "python",
179 | "metadata": {},
180 | "outputs": []
181 | },
182 | {
183 | "cell_type": "code",
184 | "collapsed": false,
185 | "input": [
186 | "y[0]"
187 | ],
188 | "language": "python",
189 | "metadata": {},
190 | "outputs": []
191 | },
192 | {
193 | "cell_type": "code",
194 | "collapsed": false,
195 | "input": [
196 | "from sklearn.externals import joblib\n",
197 | "joblib.dump(clf, 'filename.pkl') "
198 | ],
199 | "language": "python",
200 | "metadata": {},
201 | "outputs": []
202 | }
203 | ],
204 | "metadata": {}
205 | }
206 | ]
207 | }
--------------------------------------------------------------------------------
/notebooks/02.2_feature_extraction.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "name": ""
4 | },
5 | "nbformat": 3,
6 | "nbformat_minor": 0,
7 | "worksheets": [
8 | {
9 | "cells": [
10 | {
11 | "cell_type": "heading",
12 | "level": 1,
13 | "metadata": {},
14 | "source": [
15 | "Feature Extraction"
16 | ]
17 | },
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {},
21 | "source": [
22 | "Here we will talk about an important piece of machine learning: the extraction of\n",
23 | "quantitative features from data. By the end of this section you will\n",
24 | "\n",
25 | "- Know how features are extracted from real-world data.\n",
26 | "- See an example of extracting numerical features from textual data\n",
27 | "\n",
28 | "In addition, we will go over several basic tools within scikit-learn which can be used to accomplish the above tasks."
29 | ]
30 | },
31 | {
32 | "cell_type": "heading",
33 | "level": 2,
34 | "metadata": {},
35 | "source": [
36 | "What Are Features?"
37 | ]
38 | },
39 | {
40 | "cell_type": "heading",
41 | "level": 3,
42 | "metadata": {},
43 | "source": [
44 | "Numerical Features"
45 | ]
46 | },
47 | {
48 | "cell_type": "markdown",
49 | "metadata": {},
50 | "source": [
51 | "Recall that data in scikit-learn is expected to be in two-dimensional arrays, of size\n",
52 | "**n_samples** $\\times$ **n_features**.\n",
53 | "\n",
54 | "Previously, we looked at the iris dataset, which has 150 samples and 4 features"
55 | ]
56 | },
57 | {
58 | "cell_type": "code",
59 | "collapsed": false,
60 | "input": [
61 | "from sklearn.datasets import load_iris\n",
62 | "iris = load_iris()\n",
63 | "print iris.data.shape"
64 | ],
65 | "language": "python",
66 | "metadata": {},
67 | "outputs": []
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "metadata": {},
72 | "source": [
73 | "These features are:\n",
74 | "\n",
75 | "- sepal length in cm\n",
76 | "- sepal width in cm\n",
77 | "- petal length in cm\n",
78 | "- petal width in cm\n",
79 | "\n",
80 | "Numerical features such as these are pretty straightforward: each sample contains a list\n",
81 | "of floating-point numbers corresponding to the features"
82 | ]
83 | },
84 | {
85 | "cell_type": "heading",
86 | "level": 3,
87 | "metadata": {},
88 | "source": [
89 | "Categorical Features"
90 | ]
91 | },
92 | {
93 | "cell_type": "markdown",
94 | "metadata": {},
95 | "source": [
96 | "What if you have categorical features? For example, imagine there is data on the color of each\n",
97 | "iris:\n",
98 | "\n",
99 | " color in [red, blue, purple]\n",
100 | "\n",
101 | "You might be tempted to assign numbers to these features, i.e. *red=1, blue=2, purple=3*\n",
102 | "but in general **this is a bad idea**. Estimators tend to operate under the assumption that\n",
103 | "numerical features lie on some continuous scale, so, for example, 1 and 2 are more alike\n",
104 | "than 1 and 3, and this is often not the case for categorical features.\n",
105 | "\n",
106 | "A better strategy is to give each category its own dimension. \n",
107 | "The enriched iris feature set would hence be in this case:\n",
108 | "\n",
109 | "- sepal length in cm\n",
110 | "- sepal width in cm\n",
111 | "- petal length in cm\n",
112 | "- petal width in cm\n",
113 | "- color=purple (1.0 or 0.0)\n",
114 | "- color=blue (1.0 or 0.0)\n",
115 | "- color=red (1.0 or 0.0)\n",
116 | "\n",
117 | "Note that using many of these categorical features may result in data which is better\n",
118 | "represented as a **sparse matrix**, as we'll see with the text classification example\n",
119 | "below."
120 | ]
121 | },
122 | {
123 | "cell_type": "heading",
124 | "level": 4,
125 | "metadata": {},
126 | "source": [
127 | "Using the DictVectorizer to encode categorical features"
128 | ]
129 | },
130 | {
131 | "cell_type": "markdown",
132 | "metadata": {},
133 | "source": [
134 | "When the source data is encoded has a list of dicts where the values are either strings names for categories or numerical values, you can use the `DictVectorizer` class to compute the boolean expansion of the categorical features while leaving the numerical features unimpacted:"
135 | ]
136 | },
137 | {
138 | "cell_type": "code",
139 | "collapsed": false,
140 | "input": [
141 | "measurements = [\n",
142 | " {'city': 'Dubai', 'temperature': 33.},\n",
143 | " {'city': 'London', 'temperature': 12.},\n",
144 | " {'city': 'San Fransisco', 'temperature': 18.},\n",
145 | "]"
146 | ],
147 | "language": "python",
148 | "metadata": {},
149 | "outputs": []
150 | },
151 | {
152 | "cell_type": "code",
153 | "collapsed": false,
154 | "input": [
155 | "from sklearn.feature_extraction import DictVectorizer\n",
156 | "\n",
157 | "vec = DictVectorizer()\n",
158 | "vec"
159 | ],
160 | "language": "python",
161 | "metadata": {},
162 | "outputs": []
163 | },
164 | {
165 | "cell_type": "code",
166 | "collapsed": false,
167 | "input": [
168 | "vec.fit_transform(measurements).toarray()"
169 | ],
170 | "language": "python",
171 | "metadata": {},
172 | "outputs": []
173 | },
174 | {
175 | "cell_type": "code",
176 | "collapsed": false,
177 | "input": [
178 | "vec.get_feature_names()"
179 | ],
180 | "language": "python",
181 | "metadata": {},
182 | "outputs": []
183 | },
184 | {
185 | "cell_type": "heading",
186 | "level": 3,
187 | "metadata": {},
188 | "source": [
189 | "Derived Features"
190 | ]
191 | },
192 | {
193 | "cell_type": "markdown",
194 | "metadata": {},
195 | "source": [
196 | "Another common feature type are **derived features**, where some pre-processing step is\n",
197 | "applied to the data to generate features that are somehow more informative. Derived\n",
198 | "features may be based in **dimensionality reduction** (such as PCA or manifold learning),\n",
199 | "may be linear or nonlinear combinations of features (such as in Polynomial regression),\n",
200 | "or may be some more sophisticated transform of the features. The latter is often used\n",
201 | "in image processing.\n",
202 | "\n",
203 | "For example, [scikit-image](http://scikit-image.org/) provides a variety of feature\n",
204 | "extractors designed for image data: see the ``skimage.feature`` submodule.\n",
205 | "We will see some *dimensionality*-based feature extraction routines later in the tutorial."
206 | ]
207 | },
208 | {
209 | "cell_type": "heading",
210 | "level": 3,
211 | "metadata": {},
212 | "source": [
213 | "Text Feature Extraction"
214 | ]
215 | },
216 | {
217 | "cell_type": "markdown",
218 | "metadata": {},
219 | "source": [
220 | "Unstructed content such as text documents require there own feature extraction step. In general we treat words in text documents as individual categorical features. An example on text mining will be introduced later."
221 | ]
222 | }
223 | ],
224 | "metadata": {}
225 | }
226 | ]
227 | }
--------------------------------------------------------------------------------
/notebooks/04.1_supervised_classification.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "name": ""
4 | },
5 | "nbformat": 3,
6 | "nbformat_minor": 0,
7 | "worksheets": [
8 | {
9 | "cells": [
10 | {
11 | "cell_type": "heading",
12 | "level": 1,
13 | "metadata": {},
14 | "source": [
15 | "Supervised Learning: Classification of Handwritten Digits"
16 | ]
17 | },
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {},
21 | "source": [
22 | "In this section we'll apply scikit-learn to the classification of handwritten\n",
23 | "digits. This will go a bit beyond the iris classification we saw before: we'll\n",
24 | "discuss some of the metrics which can be used in evaluating the effectiveness\n",
25 | "of a classification model.\n",
26 | "\n",
27 | "We'll work with the handwritten digits dataset which we saw in an earlier\n",
28 | "section of the tutorial."
29 | ]
30 | },
31 | {
32 | "cell_type": "code",
33 | "collapsed": false,
34 | "input": [
35 | "from sklearn.datasets import load_digits\n",
36 | "digits = load_digits()"
37 | ],
38 | "language": "python",
39 | "metadata": {},
40 | "outputs": []
41 | },
42 | {
43 | "cell_type": "markdown",
44 | "metadata": {},
45 | "source": [
46 | "We'll re-use some of our code from before to visualize the data and remind us what\n",
47 | "we're looking at:"
48 | ]
49 | },
50 | {
51 | "cell_type": "code",
52 | "collapsed": false,
53 | "input": [
54 | "%matplotlib inline\n",
55 | "from matplotlib import pyplot as plt"
56 | ],
57 | "language": "python",
58 | "metadata": {},
59 | "outputs": []
60 | },
61 | {
62 | "cell_type": "code",
63 | "collapsed": false,
64 | "input": [
65 | "# copied from notebook 02A_representation_of_data.ipynb\n",
66 | "fig = plt.figure(figsize=(6, 6)) # figure size in inches\n",
67 | "fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)\n",
68 | "\n",
69 | "# plot the digits: each image is 8x8 pixels\n",
70 | "for i in range(64):\n",
71 | " ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])\n",
72 | " ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest')\n",
73 | " \n",
74 | " # label the image with the target value\n",
75 | " ax.text(0, 7, str(digits.target[i]))"
76 | ],
77 | "language": "python",
78 | "metadata": {},
79 | "outputs": []
80 | },
81 | {
82 | "cell_type": "heading",
83 | "level": 2,
84 | "metadata": {},
85 | "source": [
86 | "Visualizing the Data"
87 | ]
88 | },
89 | {
90 | "cell_type": "markdown",
91 | "metadata": {},
92 | "source": [
93 | "A good first-step for many problems is to visualize the data using one of the\n",
94 | "*Dimensionality Reduction* techniques we saw earlier. We'll start with the\n",
95 | "most straightforward one, Principal Component Analysis (PCA).\n",
96 | "\n",
97 | "PCA seeks orthogonal linear combinations of the features which show the greatest\n",
98 | "variance, and as such, can help give you a good idea of the structure of the\n",
99 | "data set. Here we'll use `RandomizedPCA`, because it's faster for large `N`."
100 | ]
101 | },
102 | {
103 | "cell_type": "code",
104 | "collapsed": false,
105 | "input": [
106 | "from sklearn.decomposition import RandomizedPCA\n",
107 | "pca = RandomizedPCA(n_components=2)\n",
108 | "proj = pca.fit_transform(digits.data)"
109 | ],
110 | "language": "python",
111 | "metadata": {},
112 | "outputs": []
113 | },
114 | {
115 | "cell_type": "code",
116 | "collapsed": false,
117 | "input": [
118 | "plt.scatter(proj[:, 0], proj[:, 1], c=digits.target)\n",
119 | "plt.colorbar()"
120 | ],
121 | "language": "python",
122 | "metadata": {},
123 | "outputs": []
124 | },
125 | {
126 | "cell_type": "markdown",
127 | "metadata": {},
128 | "source": [
129 | "Here we see that the digits do cluster fairly well, so we can expect even\n",
130 | "a fairly naive classification scheme to do a decent job separating them."
131 | ]
132 | },
133 | {
134 | "cell_type": "markdown",
135 | "metadata": {},
136 | "source": [
137 | "**Question: Given these projections of the data, which numbers do you think\n",
138 | "a classifier might have trouble distinguishing?**"
139 | ]
140 | },
141 | {
142 | "cell_type": "heading",
143 | "level": 2,
144 | "metadata": {},
145 | "source": [
146 | "Gaussian Naive Bayes Classification"
147 | ]
148 | },
149 | {
150 | "cell_type": "markdown",
151 | "metadata": {},
152 | "source": [
153 | "For most classification problems, it's nice to have a simple, fast, go-to\n",
154 | "method to provide a quick baseline classification. If the simple and fast\n",
155 | "method is sufficient, then we don't have to waste CPU cycles on more complex\n",
156 | "models. If not, we can use the results of the simple method to give us\n",
157 | "clues about our data.\n",
158 | "\n",
159 | "One good method to keep in mind is Gaussian Naive Bayes. It fits a Gaussian distribution to each training label independantly on each feature, and uses this to quickly give a rough classification. It is generally not sufficiently accurate for real-world data, but can perform surprisingly well, for instance on text data."
160 | ]
161 | },
162 | {
163 | "cell_type": "code",
164 | "collapsed": false,
165 | "input": [
166 | "from sklearn.naive_bayes import GaussianNB\n",
167 | "from sklearn.cross_validation import train_test_split"
168 | ],
169 | "language": "python",
170 | "metadata": {},
171 | "outputs": []
172 | },
173 | {
174 | "cell_type": "code",
175 | "collapsed": false,
176 | "input": [
177 | "# split the data into training and validation sets\n",
178 | "X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target)\n",
179 | "\n",
180 | "# train the model\n",
181 | "clf = GaussianNB()\n",
182 | "clf.fit(X_train, y_train)\n",
183 | "\n",
184 | "# use the model to predict the labels of the test data\n",
185 | "predicted = clf.predict(X_test)\n",
186 | "expected = y_test"
187 | ],
188 | "language": "python",
189 | "metadata": {},
190 | "outputs": []
191 | },
192 | {
193 | "cell_type": "markdown",
194 | "metadata": {},
195 | "source": [
196 | "**Question**: why did we split the data into training and validation sets?"
197 | ]
198 | },
199 | {
200 | "cell_type": "markdown",
201 | "metadata": {},
202 | "source": [
203 | "Let's plot the digits again with the predicted labels to get an idea of\n",
204 | "how well the classification is working:"
205 | ]
206 | },
207 | {
208 | "cell_type": "code",
209 | "collapsed": false,
210 | "input": [
211 | "fig = plt.figure(figsize=(6, 6)) # figure size in inches\n",
212 | "fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)\n",
213 | "\n",
214 | "# plot the digits: each image is 8x8 pixels\n",
215 | "for i in range(64):\n",
216 | " ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])\n",
217 | " ax.imshow(X_test.reshape(-1, 8, 8)[i], cmap=plt.cm.binary,\n",
218 | " interpolation='nearest')\n",
219 | " \n",
220 | " # label the image with the target value\n",
221 | " if predicted[i] == expected[i]:\n",
222 | " ax.text(0, 7, str(predicted[i]), color='green')\n",
223 | " else:\n",
224 | " ax.text(0, 7, str(predicted[i]), color='red')"
225 | ],
226 | "language": "python",
227 | "metadata": {},
228 | "outputs": []
229 | },
230 | {
231 | "cell_type": "heading",
232 | "level": 2,
233 | "metadata": {},
234 | "source": [
235 | "Quantitative Measurement of Performance"
236 | ]
237 | },
238 | {
239 | "cell_type": "markdown",
240 | "metadata": {},
241 | "source": [
242 | "We'd like to measure the performance of our estimator without having to resort\n",
243 | "to plotting examples. A simple method might be to simply compare the number of\n",
244 | "matches:"
245 | ]
246 | },
247 | {
248 | "cell_type": "code",
249 | "collapsed": false,
250 | "input": [
251 | "matches = (predicted == expected)\n",
252 | "print matches.sum()\n",
253 | "print len(matches)"
254 | ],
255 | "language": "python",
256 | "metadata": {},
257 | "outputs": []
258 | },
259 | {
260 | "cell_type": "code",
261 | "collapsed": false,
262 | "input": [
263 | "matches.sum() / float(len(matches))"
264 | ],
265 | "language": "python",
266 | "metadata": {},
267 | "outputs": []
268 | },
269 | {
270 | "cell_type": "markdown",
271 | "metadata": {},
272 | "source": [
273 | "We see that nearly 1500 of the 1800 predictions match the input. But there are other\n",
274 | "more sophisticated metrics that can be used to judge the performance of a classifier:\n",
275 | "several are available in the ``sklearn.metrics`` submodule.\n",
276 | "\n",
277 | "One of the most useful metrics is the ``classification_report``, which combines several\n",
278 | "measures and prints a table with the results:"
279 | ]
280 | },
281 | {
282 | "cell_type": "code",
283 | "collapsed": false,
284 | "input": [
285 | "from sklearn import metrics\n",
286 | "print metrics.classification_report(expected, predicted)"
287 | ],
288 | "language": "python",
289 | "metadata": {},
290 | "outputs": []
291 | },
292 | {
293 | "cell_type": "markdown",
294 | "metadata": {},
295 | "source": [
296 | "Another enlightening metric for this sort of multi-label classification\n",
297 | "is a *confusion matrix*: it helps us visualize which labels are\n",
298 | "being interchanged in the classification errors:"
299 | ]
300 | },
301 | {
302 | "cell_type": "code",
303 | "collapsed": false,
304 | "input": [
305 | "print metrics.confusion_matrix(expected, predicted)"
306 | ],
307 | "language": "python",
308 | "metadata": {},
309 | "outputs": []
310 | },
311 | {
312 | "cell_type": "markdown",
313 | "metadata": {},
314 | "source": [
315 | "We see here that in particular, the numbers 1, 2, 3, and 9 are often being labeled 8."
316 | ]
317 | }
318 | ],
319 | "metadata": {}
320 | }
321 | ]
322 | }
--------------------------------------------------------------------------------
/notebooks/04.2_supervised_regression.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "name": ""
4 | },
5 | "nbformat": 3,
6 | "nbformat_minor": 0,
7 | "worksheets": [
8 | {
9 | "cells": [
10 | {
11 | "cell_type": "heading",
12 | "level": 1,
13 | "metadata": {},
14 | "source": [
15 | "Supervised Learning: Regression of Housing Data"
16 | ]
17 | },
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {},
21 | "source": [
22 | "By the end of this section you will\n",
23 | "\n",
24 | "- Know how to instantiate a scikit-learn regression model\n",
25 | "- Know how to train a regressor by calling the `fit(...)` method\n",
26 | "- Know how to predict new labels by calling the `predict(...)` method\n",
27 | "\n",
28 | "Here we'll do a short example of a regression problem: learning a continuous value\n",
29 | "from a set of features.\n",
30 | "\n",
31 | "We'll use the simple Boston house prices set, available in scikit-learn. This\n",
32 | "records measurements of 13 attributes of housing markets around Boston, as well\n",
33 | "as the median price. The question is: can you predict the price of a new\n",
34 | "market given its attributes?\n",
35 | "\n",
36 | "First we'll load the dataset:"
37 | ]
38 | },
39 | {
40 | "cell_type": "code",
41 | "collapsed": false,
42 | "input": [
43 | "from sklearn.datasets import load_boston\n",
44 | "data = load_boston()\n",
45 | "print data.keys()"
46 | ],
47 | "language": "python",
48 | "metadata": {},
49 | "outputs": []
50 | },
51 | {
52 | "cell_type": "markdown",
53 | "metadata": {},
54 | "source": [
55 | "We can see that there are just over 500 data points:"
56 | ]
57 | },
58 | {
59 | "cell_type": "code",
60 | "collapsed": false,
61 | "input": [
62 | "print data.data.shape\n",
63 | "print data.target.shape"
64 | ],
65 | "language": "python",
66 | "metadata": {},
67 | "outputs": []
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "metadata": {},
72 | "source": [
73 | "The ``DESCR`` variable has a long description of the dataset:"
74 | ]
75 | },
76 | {
77 | "cell_type": "code",
78 | "collapsed": false,
79 | "input": [
80 | "print data.DESCR"
81 | ],
82 | "language": "python",
83 | "metadata": {},
84 | "outputs": []
85 | },
86 | {
87 | "cell_type": "markdown",
88 | "metadata": {},
89 | "source": [
90 | "It often helps to quickly visualize pieces of the data using histograms, scatter plots,\n",
91 | "or other plot types. Here we'll load pylab and show a histogram of the target values:\n",
92 | "the median price in each neighborhood."
93 | ]
94 | },
95 | {
96 | "cell_type": "code",
97 | "collapsed": false,
98 | "input": [
99 | "%matplotlib inline\n",
100 | "import pylab as plt\n",
101 | "import numpy as np"
102 | ],
103 | "language": "python",
104 | "metadata": {},
105 | "outputs": []
106 | },
107 | {
108 | "cell_type": "code",
109 | "collapsed": false,
110 | "input": [
111 | "plt.hist(data.target)\n",
112 | "plt.xlabel('price ($1000s)')\n",
113 | "plt.ylabel('count')"
114 | ],
115 | "language": "python",
116 | "metadata": {},
117 | "outputs": []
118 | },
119 | {
120 | "cell_type": "markdown",
121 | "metadata": {},
122 | "source": [
123 | "**Quick Exercise:** Try some scatter plots of the features versus the target.\n",
124 | "\n",
125 | "Are there any features that seem to have a strong correlation with the\n",
126 | "target value? Any that don't?\n",
127 | "\n",
128 | "Remember, you can get at the data columns using:\n",
129 | "\n",
130 | " column_i = data.data[:, i]\n",
131 | " \n",
132 | "Also, you can create a new figure in matplotlib with:\n",
133 | "\n",
134 | " plt.figure()"
135 | ]
136 | },
137 | {
138 | "cell_type": "code",
139 | "collapsed": false,
140 | "input": [],
141 | "language": "python",
142 | "metadata": {},
143 | "outputs": []
144 | },
145 | {
146 | "cell_type": "markdown",
147 | "metadata": {},
148 | "source": [
149 | "**Solution**"
150 | ]
151 | },
152 | {
153 | "cell_type": "code",
154 | "collapsed": false,
155 | "input": [
156 | "%load solutions/04B_houses_plot.py"
157 | ],
158 | "language": "python",
159 | "metadata": {},
160 | "outputs": []
161 | },
162 | {
163 | "cell_type": "markdown",
164 | "metadata": {},
165 | "source": [
166 | "This is a manual version of a technique called **feature selection**.\n",
167 | "\n",
168 | "Sometimes, in Machine Learning it is useful to use \n",
169 | "feature selection to decide which features are most useful for a\n",
170 | "particular problem. Automated methods exist which quantify this sort\n",
171 | "of exercise of choosing the most informative features. We won't cover\n",
172 | "feature selection in this tutorial, but you can read about it elsewhere."
173 | ]
174 | },
175 | {
176 | "cell_type": "heading",
177 | "level": 2,
178 | "metadata": {},
179 | "source": [
180 | "Predicting Home Prices: a Simple Linear Regression"
181 | ]
182 | },
183 | {
184 | "cell_type": "markdown",
185 | "metadata": {},
186 | "source": [
187 | "Now we'll use ``scikit-learn`` to perform a simple linear regression\n",
188 | "on the housing data. There are many possibilities of regressors to\n",
189 | "use. A particularly simple one is ``LinearRegression``: this is\n",
190 | "basically a wrapper around an ordinary least squares calculation.\n",
191 | "\n",
192 | "We'll set it up like this:"
193 | ]
194 | },
195 | {
196 | "cell_type": "code",
197 | "collapsed": false,
198 | "input": [
199 | "from sklearn.cross_validation import train_test_split\n",
200 | "\n",
201 | "X_train, X_test, y_train, y_test = train_test_split(data.data, data.target)"
202 | ],
203 | "language": "python",
204 | "metadata": {},
205 | "outputs": []
206 | },
207 | {
208 | "cell_type": "code",
209 | "collapsed": false,
210 | "input": [
211 | "from sklearn.linear_model import LinearRegression\n",
212 | "\n",
213 | "clf = LinearRegression()\n",
214 | "\n",
215 | "clf.fit(X_train, y_train)"
216 | ],
217 | "language": "python",
218 | "metadata": {},
219 | "outputs": []
220 | },
221 | {
222 | "cell_type": "code",
223 | "collapsed": false,
224 | "input": [
225 | "predicted = clf.predict(X_test)\n",
226 | "expected = y_test"
227 | ],
228 | "language": "python",
229 | "metadata": {},
230 | "outputs": []
231 | },
232 | {
233 | "cell_type": "code",
234 | "collapsed": false,
235 | "input": [
236 | "plt.scatter(expected, predicted)\n",
237 | "plt.plot([0, 50], [0, 50], '--k')\n",
238 | "plt.axis('tight')\n",
239 | "plt.xlabel('True price ($1000s)')\n",
240 | "plt.ylabel('Predicted price ($1000s)')\n",
241 | "print \"RMS:\", np.sqrt(np.mean((predicted - expected) ** 2))"
242 | ],
243 | "language": "python",
244 | "metadata": {},
245 | "outputs": []
246 | },
247 | {
248 | "cell_type": "markdown",
249 | "metadata": {},
250 | "source": [
251 | "The prediction at least correlates with the true price, though there\n",
252 | "are clearly some biases. We could imagine evaluating the performance\n",
253 | "of the regressor by, say, computing the RMS residuals between the\n",
254 | "true and predicted price. There are some subtleties in this, however,\n",
255 | "which we'll cover in a later section."
256 | ]
257 | },
258 | {
259 | "cell_type": "markdown",
260 | "metadata": {},
261 | "source": [
262 | "There are many examples of regression-type problems in machine learning\n",
263 | "\n",
264 | "- **Sales:** given consumer data, predict how much they will spend\n",
265 | "- **Advertising:** given information about a user, predict the click-through rate for a web ad.\n",
266 | "- **Collaborative Filtering:** given a collection of user-ratings for movies, predict preferences for other movies & users\n",
267 | "- **Astronomy:** given observations of galaxies, predict their mass or redshift\n",
268 | "\n",
269 | "And much, much more."
270 | ]
271 | },
272 | {
273 | "cell_type": "heading",
274 | "level": 2,
275 | "metadata": {},
276 | "source": [
277 | "Exercise: Gradient Boosting Tree Regression"
278 | ]
279 | },
280 | {
281 | "cell_type": "markdown",
282 | "metadata": {},
283 | "source": [
284 | "There are many other types of regressors available in scikit-learn:\n",
285 | "we'll try a more powerful one here.\n",
286 | "\n",
287 | "**Use the GradientBoostingRegressor class to fit the housing data**.\n",
288 | "\n",
289 | "You can copy and paste some of the above code, replacing `LinearRegression`\n",
290 | "with `GradientBoostingRegressor`."
291 | ]
292 | },
293 | {
294 | "cell_type": "code",
295 | "collapsed": false,
296 | "input": [
297 | "from sklearn.ensemble import GradientBoostingRegressor\n",
298 | "# Instantiate the model, fit the results, and scatter in vs. out"
299 | ],
300 | "language": "python",
301 | "metadata": {},
302 | "outputs": []
303 | },
304 | {
305 | "cell_type": "heading",
306 | "level": 3,
307 | "metadata": {},
308 | "source": [
309 | "Solution:"
310 | ]
311 | },
312 | {
313 | "cell_type": "code",
314 | "collapsed": false,
315 | "input": [
316 | "%load solutions/04B_houses_regression.py"
317 | ],
318 | "language": "python",
319 | "metadata": {},
320 | "outputs": []
321 | }
322 | ],
323 | "metadata": {}
324 | }
325 | ]
326 | }
--------------------------------------------------------------------------------
/notebooks/04.3_measuring_prediction_performance.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "name": ""
4 | },
5 | "nbformat": 3,
6 | "nbformat_minor": 0,
7 | "worksheets": [
8 | {
9 | "cells": [
10 | {
11 | "cell_type": "heading",
12 | "level": 1,
13 | "metadata": {},
14 | "source": [
15 | "Measuring prediction performance"
16 | ]
17 | },
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {},
21 | "source": [
22 | "Here we will discuss how to use **validation sets** to get a better measure of\n",
23 | "performance for a classifier."
24 | ]
25 | },
26 | {
27 | "cell_type": "heading",
28 | "level": 2,
29 | "metadata": {},
30 | "source": [
31 | "Using the K-neighbors classifier"
32 | ]
33 | },
34 | {
35 | "cell_type": "markdown",
36 | "metadata": {},
37 | "source": [
38 | "Here we'll continue to look at the digits data, but we'll switch to the\n",
39 | "K-Neighbors classifier. The K-neighbors classifier is an instance-based\n",
40 | "classifier. The K-neighbors classifier predicts the label of\n",
41 | "an unknown point based on the labels of the *K* nearest points in the\n",
42 | "parameter space."
43 | ]
44 | },
45 | {
46 | "cell_type": "code",
47 | "collapsed": false,
48 | "input": [
49 | "# Get the data\n",
50 | "from sklearn.datasets import load_digits\n",
51 | "digits = load_digits()\n",
52 | "X = digits.data\n",
53 | "y = digits.target"
54 | ],
55 | "language": "python",
56 | "metadata": {},
57 | "outputs": []
58 | },
59 | {
60 | "cell_type": "code",
61 | "collapsed": false,
62 | "input": [
63 | "# Instantiate and train the classifier\n",
64 | "from sklearn.neighbors import KNeighborsClassifier\n",
65 | "clf = KNeighborsClassifier(n_neighbors=1)\n",
66 | "clf.fit(X, y)"
67 | ],
68 | "language": "python",
69 | "metadata": {},
70 | "outputs": []
71 | },
72 | {
73 | "cell_type": "code",
74 | "collapsed": false,
75 | "input": [
76 | "# Check the results using metrics\n",
77 | "from sklearn import metrics\n",
78 | "y_pred = clf.predict(X)"
79 | ],
80 | "language": "python",
81 | "metadata": {},
82 | "outputs": []
83 | },
84 | {
85 | "cell_type": "code",
86 | "collapsed": false,
87 | "input": [
88 | "print metrics.confusion_matrix(y_pred, y)"
89 | ],
90 | "language": "python",
91 | "metadata": {},
92 | "outputs": []
93 | },
94 | {
95 | "cell_type": "markdown",
96 | "metadata": {},
97 | "source": [
98 | "Apparently, we've found a perfect classifier! But this is misleading\n",
99 | "for the reasons we saw before: the classifier essentially \"memorizes\"\n",
100 | "all the samples it has already seen. To really test how well this\n",
101 | "algorithm does, we need to try some samples it *hasn't* yet seen.\n",
102 | "\n",
103 | "This problem can also occur with regression models. In the following we fit an other instance-based model named \"decision tree\" to the Boston Housing price dataset we introduced previously:"
104 | ]
105 | },
106 | {
107 | "cell_type": "code",
108 | "collapsed": false,
109 | "input": [
110 | "%matplotlib inline\n",
111 | "from matplotlib import pyplot as plt\n",
112 | "import numpy as np"
113 | ],
114 | "language": "python",
115 | "metadata": {},
116 | "outputs": []
117 | },
118 | {
119 | "cell_type": "code",
120 | "collapsed": false,
121 | "input": [
122 | "from sklearn.datasets import load_boston\n",
123 | "from sklearn.tree import DecisionTreeRegressor\n",
124 | "\n",
125 | "data = load_boston()\n",
126 | "clf = DecisionTreeRegressor().fit(data.data, data.target)\n",
127 | "predicted = clf.predict(data.data)\n",
128 | "expected = data.target\n",
129 | "\n",
130 | "plt.scatter(expected, predicted)\n",
131 | "plt.plot([0, 50], [0, 50], '--k')\n",
132 | "plt.axis('tight')\n",
133 | "plt.xlabel('True price ($1000s)')\n",
134 | "plt.ylabel('Predicted price ($1000s)')"
135 | ],
136 | "language": "python",
137 | "metadata": {},
138 | "outputs": []
139 | },
140 | {
141 | "cell_type": "markdown",
142 | "metadata": {},
143 | "source": [
144 | "Here again the predictions are seemingly perfect as the model was able to perfectly memorize the training set."
145 | ]
146 | },
147 | {
148 | "cell_type": "heading",
149 | "level": 2,
150 | "metadata": {},
151 | "source": [
152 | "A Better Approach: Using a validation set"
153 | ]
154 | },
155 | {
156 | "cell_type": "markdown",
157 | "metadata": {},
158 | "source": [
159 | "Learning the parameters of a prediction function and testing it on the\n",
160 | "same data is a methodological mistake: a model that would just repeat\n",
161 | "the labels of the samples that it has just seen would have a perfect\n",
162 | "score but would fail to predict anything useful on yet-unseen data.\n",
163 | "\n",
164 | "To avoid over-fitting, we have to define two different sets:\n",
165 | "\n",
166 | "- a training set X_train, y_train which is used for learning the parameters of a predictive model\n",
167 | "- a testing set X_test, y_test which is used for evaluating the fitted predictive model\n",
168 | "\n",
169 | "In scikit-learn such a random split can be quickly computed with the\n",
170 | "`train_test_split` helper function. It can be used this way:"
171 | ]
172 | },
173 | {
174 | "cell_type": "code",
175 | "collapsed": false,
176 | "input": [
177 | "from sklearn import cross_validation\n",
178 | "X = digits.data\n",
179 | "y = digits.target\n",
180 | "\n",
181 | "X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.25, random_state=0)\n",
182 | "\n",
183 | "print X.shape, X_train.shape, X_test.shape"
184 | ],
185 | "language": "python",
186 | "metadata": {},
187 | "outputs": []
188 | },
189 | {
190 | "cell_type": "markdown",
191 | "metadata": {},
192 | "source": [
193 | "Now we train on the training data, and test on the testing data:"
194 | ]
195 | },
196 | {
197 | "cell_type": "code",
198 | "collapsed": false,
199 | "input": [
200 | "clf = KNeighborsClassifier(n_neighbors=1).fit(X_train, y_train)\n",
201 | "y_pred = clf.predict(X_test)"
202 | ],
203 | "language": "python",
204 | "metadata": {},
205 | "outputs": []
206 | },
207 | {
208 | "cell_type": "code",
209 | "collapsed": false,
210 | "input": [
211 | "print metrics.confusion_matrix(y_test, y_pred)"
212 | ],
213 | "language": "python",
214 | "metadata": {},
215 | "outputs": []
216 | },
217 | {
218 | "cell_type": "code",
219 | "collapsed": false,
220 | "input": [
221 | "print metrics.classification_report(y_test, y_pred)"
222 | ],
223 | "language": "python",
224 | "metadata": {},
225 | "outputs": []
226 | },
227 | {
228 | "cell_type": "markdown",
229 | "metadata": {},
230 | "source": [
231 | "The averaged f1-score is often used as a convenient measure of the\n",
232 | "overall performance of an algorithm. It appears in the bottom row\n",
233 | "of the classification report; it can also be accessed directly:"
234 | ]
235 | },
236 | {
237 | "cell_type": "code",
238 | "collapsed": false,
239 | "input": [
240 | "metrics.f1_score(y_test, y_pred)"
241 | ],
242 | "language": "python",
243 | "metadata": {},
244 | "outputs": []
245 | },
246 | {
247 | "cell_type": "markdown",
248 | "metadata": {},
249 | "source": [
250 | "The over-fitting we saw previously can be quantified by computing the\n",
251 | "f1-score on the training data itself:"
252 | ]
253 | },
254 | {
255 | "cell_type": "code",
256 | "collapsed": false,
257 | "input": [
258 | "metrics.f1_score(y_train, clf.predict(X_train))"
259 | ],
260 | "language": "python",
261 | "metadata": {},
262 | "outputs": []
263 | },
264 | {
265 | "cell_type": "heading",
266 | "level": 3,
267 | "metadata": {},
268 | "source": [
269 | "Validation with a Regression Model"
270 | ]
271 | },
272 | {
273 | "cell_type": "markdown",
274 | "metadata": {},
275 | "source": [
276 | "These validation metrics also work in the case of regression models. Here we'll use\n",
277 | "a Gradient-boosted regression tree, which is a meta-estimator which makes use of the\n",
278 | "``DecisionTreeRegressor`` we showed above. We'll start by doing the train-test split\n",
279 | "as we did with the classification case:"
280 | ]
281 | },
282 | {
283 | "cell_type": "code",
284 | "collapsed": false,
285 | "input": [
286 | "data = load_boston()\n",
287 | "X = data.data\n",
288 | "y = data.target\n",
289 | "X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.25, random_state=0)\n",
290 | "\n",
291 | "print X.shape, X_train.shape, X_test.shape"
292 | ],
293 | "language": "python",
294 | "metadata": {},
295 | "outputs": []
296 | },
297 | {
298 | "cell_type": "markdown",
299 | "metadata": {},
300 | "source": [
301 | "Next we'll compute the training and testing error using the Decision Tree that\n",
302 | "we saw before:"
303 | ]
304 | },
305 | {
306 | "cell_type": "code",
307 | "collapsed": false,
308 | "input": [
309 | "est = DecisionTreeRegressor().fit(X_train, y_train)\n",
310 | "\n",
311 | "print \"validation:\", metrics.explained_variance_score(y_test, est.predict(X_test))\n",
312 | "print \"training:\", metrics.explained_variance_score(y_train, est.predict(X_train))"
313 | ],
314 | "language": "python",
315 | "metadata": {},
316 | "outputs": []
317 | },
318 | {
319 | "cell_type": "markdown",
320 | "metadata": {},
321 | "source": [
322 | "This large spread between validation and training error is characteristic\n",
323 | "of a **high variance** model. Decision trees are not entirely useless,\n",
324 | "however: by combining many individual decision trees within ensemble\n",
325 | "estimators such as Gradient Boosted Trees or Random Forests, we can get\n",
326 | "much better performance:"
327 | ]
328 | },
329 | {
330 | "cell_type": "code",
331 | "collapsed": false,
332 | "input": [
333 | "from sklearn.ensemble import GradientBoostingRegressor\n",
334 | "est = GradientBoostingRegressor().fit(X_train, y_train)\n",
335 | "\n",
336 | "print \"validation:\", metrics.explained_variance_score(y_test, est.predict(X_test))\n",
337 | "print \"training:\", metrics.explained_variance_score(y_train, est.predict(X_train))"
338 | ],
339 | "language": "python",
340 | "metadata": {},
341 | "outputs": []
342 | },
343 | {
344 | "cell_type": "markdown",
345 | "metadata": {},
346 | "source": [
347 | "This model is still over-fitting the data, but not by as much as the single tree."
348 | ]
349 | },
350 | {
351 | "cell_type": "heading",
352 | "level": 2,
353 | "metadata": {},
354 | "source": [
355 | "Exercise: Model Selection via Validation"
356 | ]
357 | },
358 | {
359 | "cell_type": "markdown",
360 | "metadata": {},
361 | "source": [
362 | "In the previous notebook, we saw Gaussian Naive Bayes classification of the digits.\n",
363 | "Here we saw K-neighbors classification of the digits. We've also seen support vector\n",
364 | "machine classification of digits. Now that we have these\n",
365 | "validation tools in place, we can ask quantitatively which of the three estimators\n",
366 | "works best for the digits dataset.\n",
367 | "\n",
368 | "Take a moment and determine the answers to these questions for the digits dataset:\n",
369 | "\n",
370 | "- With the default hyper-parameters for each estimator, which gives the best f1 score\n",
371 | " on the **validation set**? Recall that hyperparameters are the parameters set when\n",
372 | " you instantiate the classifier: for example, the ``n_neighbors`` in\n",
373 | "\n",
374 | " clf = KNeighborsClassifier(n_neighbors=1)\n",
375 | "\n",
376 | " To use the default value, simply leave them unspecified.\n",
377 | "- For each classifier, which value for the hyperparameters gives the best results for\n",
378 | " the digits data? For ``LinearSVC``, use ``loss='l2'`` and ``loss='l1'``. For\n",
379 | " ``KNeighborsClassifier`` use ``n_neighbors`` between 1 and 10. Note that ``GaussianNB``\n",
380 | " does not have any adjustable hyperparameters.\n",
381 | "- Bonus: do the same exercise on the Iris data rather than the Digits data. Does the\n",
382 | " same classifier/hyperparameter combination win out in this case?"
383 | ]
384 | },
385 | {
386 | "cell_type": "code",
387 | "collapsed": false,
388 | "input": [
389 | "from sklearn.svm import LinearSVC\n",
390 | "from sklearn.naive_bayes import GaussianNB\n",
391 | "from sklearn.neighbors import KNeighborsClassifier\n"
392 | ],
393 | "language": "python",
394 | "metadata": {},
395 | "outputs": []
396 | },
397 | {
398 | "cell_type": "code",
399 | "collapsed": false,
400 | "input": [],
401 | "language": "python",
402 | "metadata": {},
403 | "outputs": []
404 | },
405 | {
406 | "cell_type": "heading",
407 | "level": 3,
408 | "metadata": {},
409 | "source": [
410 | "Solution"
411 | ]
412 | },
413 | {
414 | "cell_type": "code",
415 | "collapsed": false,
416 | "input": [
417 | "%load solutions/04C_validation_exercise.py"
418 | ],
419 | "language": "python",
420 | "metadata": {},
421 | "outputs": []
422 | }
423 | ],
424 | "metadata": {}
425 | }
426 | ]
427 | }
--------------------------------------------------------------------------------
/notebooks/06.2_validation_exercise.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "name": "06B_validation_exercise"
4 | },
5 | "nbformat": 3,
6 | "nbformat_minor": 0,
7 | "worksheets": [
8 | {
9 | "cells": [
10 | {
11 | "cell_type": "heading",
12 | "level": 1,
13 | "metadata": {},
14 | "source": [
15 | "Exercise: Cross Validation and Model Selection"
16 | ]
17 | },
18 | {
19 | "cell_type": "code",
20 | "collapsed": false,
21 | "input": [
22 | "%pylab inline"
23 | ],
24 | "language": "python",
25 | "metadata": {},
26 | "outputs": []
27 | },
28 | {
29 | "cell_type": "markdown",
30 | "metadata": {},
31 | "source": [
32 | "This exercise covers cross-validation of regression models on the Diabetes\n",
33 | "dataset. The diabetes data consists of 10 physiological variables (age, sex, weight, blood pressure)\n",
34 | "measure on 442 patients, and an indication of disease progression after one year:"
35 | ]
36 | },
37 | {
38 | "cell_type": "code",
39 | "collapsed": false,
40 | "input": [
41 | "from sklearn.datasets import load_diabetes\n",
42 | "data = load_diabetes()\n",
43 | "X, y = data.data, data.target"
44 | ],
45 | "language": "python",
46 | "metadata": {},
47 | "outputs": []
48 | },
49 | {
50 | "cell_type": "code",
51 | "collapsed": false,
52 | "input": [
53 | "print X.shape"
54 | ],
55 | "language": "python",
56 | "metadata": {},
57 | "outputs": []
58 | },
59 | {
60 | "cell_type": "code",
61 | "collapsed": false,
62 | "input": [
63 | "print y.shape"
64 | ],
65 | "language": "python",
66 | "metadata": {},
67 | "outputs": []
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "metadata": {},
72 | "source": [
73 | "Here we'll be fitting two regularized linear models,\n",
74 | "*Ridge Regression*, which uses $\\ell_2$ regularlization,\n",
75 | "and *Lasso Regression*, which uses $\\ell_1$ regularization."
76 | ]
77 | },
78 | {
79 | "cell_type": "code",
80 | "collapsed": false,
81 | "input": [
82 | "from sklearn.linear_model import Ridge, Lasso"
83 | ],
84 | "language": "python",
85 | "metadata": {},
86 | "outputs": []
87 | },
88 | {
89 | "cell_type": "markdown",
90 | "metadata": {},
91 | "source": [
92 | "We'll first use the default hyper-parameters to see the baseline estimator. We'll\n",
93 | "use the cross-validation score to determine goodness-of-fit."
94 | ]
95 | },
96 | {
97 | "cell_type": "code",
98 | "collapsed": false,
99 | "input": [
100 | "from sklearn.cross_validation import cross_val_score\n",
101 | "\n",
102 | "for Model in [Ridge, Lasso]:\n",
103 | " model = Model()\n",
104 | " print Model.__name__, cross_val_score(model, X, y).mean()"
105 | ],
106 | "language": "python",
107 | "metadata": {},
108 | "outputs": []
109 | },
110 | {
111 | "cell_type": "markdown",
112 | "metadata": {},
113 | "source": [
114 | "We see that for the default hyper-parameter values, Lasso outperforms Ridge.\n",
115 | "But is this the case for the *optimal* hyperparameters of each model?"
116 | ]
117 | },
118 | {
119 | "cell_type": "heading",
120 | "level": 2,
121 | "metadata": {},
122 | "source": [
123 | "Exercise: Basic Hyperparameter Optimization"
124 | ]
125 | },
126 | {
127 | "cell_type": "markdown",
128 | "metadata": {},
129 | "source": [
130 | "Here spend some time writing a function which computes the cross-validation\n",
131 | "score as a function of ``alpha``, the strength of the regularization for\n",
132 | "``Lasso`` and ``Ridge``. We'll choose 20 values of ``alpha`` between\n",
133 | "0.0001 and 1:"
134 | ]
135 | },
136 | {
137 | "cell_type": "code",
138 | "collapsed": false,
139 | "input": [
140 | "alphas = np.logspace(-3, -1, 30)\n",
141 | "\n",
142 | "# plot the mean cross-validation score for a Ridge estimator and a Lasso estimator\n",
143 | "# as a function of alpha. Which is more difficult to tune?"
144 | ],
145 | "language": "python",
146 | "metadata": {},
147 | "outputs": []
148 | },
149 | {
150 | "cell_type": "heading",
151 | "level": 3,
152 | "metadata": {},
153 | "source": [
154 | "Solution"
155 | ]
156 | },
157 | {
158 | "cell_type": "code",
159 | "collapsed": false,
160 | "input": [
161 | "%load solutions/06B_basic_grid_search.py"
162 | ],
163 | "language": "python",
164 | "metadata": {},
165 | "outputs": []
166 | },
167 | {
168 | "cell_type": "heading",
169 | "level": 2,
170 | "metadata": {},
171 | "source": [
172 | "Automatically Performing Grid Search"
173 | ]
174 | },
175 | {
176 | "cell_type": "markdown",
177 | "metadata": {},
178 | "source": [
179 | "Because searching a grid of hyperparameters is such a common task, scikit-learn provides\n",
180 | "several hyper-parameter estimators to automate this. We'll explore this more in depth\n",
181 | "later in the tutorial, but for now it is interesting to see how ``GridSearchCV`` works:"
182 | ]
183 | },
184 | {
185 | "cell_type": "code",
186 | "collapsed": false,
187 | "input": [
188 | "from sklearn.grid_search import GridSearchCV"
189 | ],
190 | "language": "python",
191 | "metadata": {},
192 | "outputs": []
193 | },
194 | {
195 | "cell_type": "markdown",
196 | "metadata": {},
197 | "source": [
198 | "``GridSearchCV`` is constructed with an estimator, as well as a dictionary\n",
199 | "of parameter values to be searched. We can find the optimal parameters this\n",
200 | "way:"
201 | ]
202 | },
203 | {
204 | "cell_type": "code",
205 | "collapsed": false,
206 | "input": [
207 | "for Model in [Ridge, Lasso]:\n",
208 | " gscv = GridSearchCV(Model(), dict(alpha=alphas), cv=3).fit(X, y)\n",
209 | " print Model.__name__, gscv.best_params_"
210 | ],
211 | "language": "python",
212 | "metadata": {},
213 | "outputs": []
214 | },
215 | {
216 | "cell_type": "heading",
217 | "level": 2,
218 | "metadata": {},
219 | "source": [
220 | "Built-in Hyperparameter Search"
221 | ]
222 | },
223 | {
224 | "cell_type": "markdown",
225 | "metadata": {},
226 | "source": [
227 | "For some models within scikit-learn, cross-validation can be performed more efficiently\n",
228 | "on large datasets. In this case, a cross-validated version of the particular model is\n",
229 | "included. The cross-validated versions of ``Ridge`` and ``Lasso`` are ``RidgeCV`` and\n",
230 | "``LassoCV``, respectively. The grid search on these estimators can be performed as\n",
231 | "follows:"
232 | ]
233 | },
234 | {
235 | "cell_type": "code",
236 | "collapsed": false,
237 | "input": [
238 | "from sklearn.linear_model import RidgeCV, LassoCV\n",
239 | "for Model in [RidgeCV, LassoCV]:\n",
240 | " model = Model(alphas=alphas, cv=3).fit(X, y)\n",
241 | " print Model.__name__, model.alpha_"
242 | ],
243 | "language": "python",
244 | "metadata": {},
245 | "outputs": []
246 | },
247 | {
248 | "cell_type": "markdown",
249 | "metadata": {},
250 | "source": [
251 | "We see that the results match those returned by ``GridSearchCV``."
252 | ]
253 | },
254 | {
255 | "cell_type": "heading",
256 | "level": 2,
257 | "metadata": {},
258 | "source": [
259 | "Exercise: Learning Curves"
260 | ]
261 | },
262 | {
263 | "cell_type": "markdown",
264 | "metadata": {},
265 | "source": [
266 | "Here we'll apply our learning curves to the diabetes data. The question to answer is this:\n",
267 | "\n",
268 | "- Given the optimal models above, which is over-fitting and which is under-fitting the data?\n",
269 | "- To obtain better results, would you invest time and effort in gathering\n",
270 | " more **training samples**, or gathering more **attributes** for each sample?\n",
271 | " Recall the previous discussion of reading learning curves.\n",
272 | "\n",
273 | "You can follow the process used in the previous notebook to plot the learning curves.\n",
274 | "A good metric to use is the ``mean_squared_error``, which we'll import below:"
275 | ]
276 | },
277 | {
278 | "cell_type": "code",
279 | "collapsed": false,
280 | "input": [
281 | "from sklearn.metrics import mean_squared_error\n",
282 | "# define a function that computes the learning curve (i.e. mean_squared_error as a function\n",
283 | "# of training set size, for both training and test sets) and plot the result\n"
284 | ],
285 | "language": "python",
286 | "metadata": {},
287 | "outputs": []
288 | },
289 | {
290 | "cell_type": "heading",
291 | "level": 3,
292 | "metadata": {},
293 | "source": [
294 | "Solution"
295 | ]
296 | },
297 | {
298 | "cell_type": "code",
299 | "collapsed": false,
300 | "input": [
301 | "%load solutions/06B_learning_curves.py"
302 | ],
303 | "language": "python",
304 | "metadata": {},
305 | "outputs": []
306 | }
307 | ],
308 | "metadata": {}
309 | }
310 | ]
311 | }
--------------------------------------------------------------------------------
/notebooks/07.1_in_depth_with_SVMs.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "name": "07A_in_depth_with_SVMs"
4 | },
5 | "nbformat": 3,
6 | "nbformat_minor": 0,
7 | "worksheets": [
8 | {
9 | "cells": [
10 | {
11 | "cell_type": "code",
12 | "collapsed": false,
13 | "input": [
14 | "%pylab inline\n",
15 | "import numpy as np\n",
16 | "import pylab as pl"
17 | ],
18 | "language": "python",
19 | "metadata": {},
20 | "outputs": []
21 | },
22 | {
23 | "cell_type": "heading",
24 | "level": 1,
25 | "metadata": {},
26 | "source": [
27 | "In depth with SVMs: Support Vector Machines"
28 | ]
29 | },
30 | {
31 | "cell_type": "markdown",
32 | "metadata": {},
33 | "source": [
34 | "SVM stands for \"support vector machines\". They are efficient and easy to use estimators.\n",
35 | "They come in two kinds: SVCs, Support Vector Classifiers, for classification problems, and SVRs, Support Vector Regressors, for regression problems."
36 | ]
37 | },
38 | {
39 | "cell_type": "code",
40 | "collapsed": false,
41 | "input": [
42 | "from sklearn import svm"
43 | ],
44 | "language": "python",
45 | "metadata": {},
46 | "outputs": []
47 | },
48 | {
49 | "cell_type": "heading",
50 | "level": 2,
51 | "metadata": {},
52 | "source": [
53 | "Linear SVMs: some intuitions"
54 | ]
55 | },
56 | {
57 | "cell_type": "markdown",
58 | "metadata": {},
59 | "source": [
60 | "To develop our intuitions, let us look at a very simple classification problem: classifying irises based on sepal length and width. We only use 2 features to enable easy visualization."
61 | ]
62 | },
63 | {
64 | "cell_type": "code",
65 | "collapsed": false,
66 | "input": [
67 | "svc = svm.SVC(kernel='linear')\n",
68 | "from sklearn import datasets\n",
69 | "iris = datasets.load_iris()\n",
70 | "X = iris.data[:, :2]\n",
71 | "y = iris.target\n",
72 | "svc.fit(X, y)"
73 | ],
74 | "language": "python",
75 | "metadata": {},
76 | "outputs": []
77 | },
78 | {
79 | "cell_type": "markdown",
80 | "metadata": {},
81 | "source": [
82 | "To visualize the prediction, we evaluate it on a grid of points:"
83 | ]
84 | },
85 | {
86 | "cell_type": "code",
87 | "collapsed": false,
88 | "input": [
89 | "from matplotlib.colors import ListedColormap\n",
90 | "# Create color maps for 3-class classification problem, as with iris\n",
91 | "cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])\n",
92 | "cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])\n",
93 | "\n",
94 | "def plot_estimator(estimator, X, y):\n",
95 | " estimator.fit(X, y)\n",
96 | " x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1\n",
97 | " y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1\n",
98 | " xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),\n",
99 | " np.linspace(y_min, y_max, 100))\n",
100 | " Z = estimator.predict(np.c_[xx.ravel(), yy.ravel()])\n",
101 | "\n",
102 | " # Put the result into a color plot\n",
103 | " Z = Z.reshape(xx.shape)\n",
104 | " pl.figure()\n",
105 | " pl.pcolormesh(xx, yy, Z, cmap=cmap_light)\n",
106 | "\n",
107 | " # Plot also the training points\n",
108 | " pl.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)\n",
109 | " pl.axis('tight')\n",
110 | " pl.axis('off')\n",
111 | " pl.tight_layout()"
112 | ],
113 | "language": "python",
114 | "metadata": {},
115 | "outputs": []
116 | },
117 | {
118 | "cell_type": "code",
119 | "collapsed": false,
120 | "input": [
121 | "plot_estimator(svc, X, y)"
122 | ],
123 | "language": "python",
124 | "metadata": {},
125 | "outputs": []
126 | },
127 | {
128 | "cell_type": "markdown",
129 | "metadata": {},
130 | "source": [
131 | "As we can see, `kernel=\"linear\"` gives linear decision frontiers: the frontier between two classes is a line.\n",
132 | "\n",
133 | "How does multi-class work? With the `SVC` object, it is done by combining \"one versus one\" decisions on each pair of classes.\n",
134 | "\n",
135 | "**LinearSVC**: for linear kernels, there is another object, the `LinearSVC` that uses a different algorithm. On some data it may be faster (for instance sparse data, as in text mining). It uses a \"one versus all\" strategy for multi-class problems."
136 | ]
137 | },
138 | {
139 | "cell_type": "code",
140 | "collapsed": false,
141 | "input": [
142 | "plot_estimator(svm.LinearSVC(), X, y)"
143 | ],
144 | "language": "python",
145 | "metadata": {},
146 | "outputs": []
147 | },
148 | {
149 | "cell_type": "markdown",
150 | "metadata": {},
151 | "source": [
152 | "SVRs (Support Vector Regression) work like SVCs, but for regression rather than classification."
153 | ]
154 | },
155 | {
156 | "cell_type": "heading",
157 | "level": 2,
158 | "metadata": {},
159 | "source": [
160 | "Support vectors and regularisation"
161 | ]
162 | },
163 | {
164 | "cell_type": "markdown",
165 | "metadata": {},
166 | "source": [
167 | "**Support vectors**: The way a support vector machine works is by finding a decision boundary separating the 2 classes that is spanned by a small number of training samples, called \"support vectors\". These samples lie closest to the other class, and can thus be considered as most representative samples in terms of the two-class discrimination problem.\n",
168 | "\n",
169 | "To make visualization even simpler, let us consider a 2 class problem, for instance using classes 1 and 2 in the iris dataset. These 2 classes are not well linearly separable, which makes it an interesting problem.\n",
170 | "\n",
171 | "The indices of the support vectors for each class can be found in the `support_vectors_` attribute. We highlight them in the following figure."
172 | ]
173 | },
174 | {
175 | "cell_type": "code",
176 | "collapsed": false,
177 | "input": [
178 | "X, y = X[np.in1d(y, [1, 2])], y[np.in1d(y, [1, 2])]\n",
179 | "plot_estimator(svc, X, y)\n",
180 | "pl.scatter(svc.support_vectors_[:, 0], svc.support_vectors_[:, 1], s=80, facecolors='none', zorder=10)"
181 | ],
182 | "language": "python",
183 | "metadata": {},
184 | "outputs": []
185 | },
186 | {
187 | "cell_type": "markdown",
188 | "metadata": {},
189 | "source": [
190 | "**Regularization**: Considering only the discriminant samples is a form of regularization. Indeed, it forces the model to be simpler in how it combines observed structures.\n",
191 | "\n",
192 | "This regularization can be tuned with the *C* parameter:\n",
193 | "\n",
194 | "- Low C values: many support vectors... Decision frontier = mean(class A) - mean(class B)\n",
195 | "- High C values: small number of support vectors: Decision frontier fully driven by most disriminant samples"
196 | ]
197 | },
198 | {
199 | "cell_type": "code",
200 | "collapsed": false,
201 | "input": [
202 | "svc = svm.SVC(kernel='linear', C=1e3)\n",
203 | "plot_estimator(svc, X, y)\n",
204 | "pl.scatter(svc.support_vectors_[:, 0], svc.support_vectors_[:, 1], s=80, facecolors='none', zorder=10)\n",
205 | "pl.title('High C values: small number of support vectors')\n",
206 | "\n",
207 | "svc = svm.SVC(kernel='linear', C=1e-3)\n",
208 | "plot_estimator(svc, X, y)\n",
209 | "pl.scatter(svc.support_vectors_[:, 0], svc.support_vectors_[:, 1], s=80, facecolors='none', zorder=10)\n",
210 | "pl.title('Low C values: high number of support vectors')"
211 | ],
212 | "language": "python",
213 | "metadata": {},
214 | "outputs": []
215 | },
216 | {
217 | "cell_type": "markdown",
218 | "metadata": {},
219 | "source": [
220 | "One nice features of SVMs is that on many datasets, the default value 'C=1' works well.\n",
221 | "\n",
222 | "**Practical note: Normalizing data** For many estimators, including the SVMs, having datasets with unit standard deviation for each feature is often important to get good prediction."
223 | ]
224 | },
225 | {
226 | "cell_type": "heading",
227 | "level": 2,
228 | "metadata": {},
229 | "source": [
230 | "Kernels"
231 | ]
232 | },
233 | {
234 | "cell_type": "markdown",
235 | "metadata": {},
236 | "source": [
237 | "One appealling aspect of SVMs is that they can easily be used to build non linear decision frontiers using **kernels**. Kernel define the 'building blocks' that are assembled to form a decision rule.\n",
238 | "\n",
239 | "- **linear** will give linear decision frontiers. It is the most computationally efficient approach and the one that requires the least amount of data.\n",
240 | "\n",
241 | "- **poly** will give decision frontiers that are polynomial. The order of this polynomial is given by the 'order' argument.\n",
242 | "\n",
243 | "- **rbf** uses 'radial basis functions' centered at each support vector to assemble a decision frontier. The size of the RBFs, that ultimately controls the smmothness of the decision frontier. RBFs are the most flexible approach, but also the one that will require the largest amount of data."
244 | ]
245 | },
246 | {
247 | "cell_type": "code",
248 | "collapsed": false,
249 | "input": [
250 | "svc = svm.SVC(kernel='linear')\n",
251 | "plot_estimator(svc, X, y)\n",
252 | "pl.scatter(svc.support_vectors_[:, 0], svc.support_vectors_[:, 1], s=80, facecolors='none', zorder=10)\n",
253 | "pl.title('Linear kernel')\n",
254 | "\n",
255 | "svc = svm.SVC(kernel='poly', degree=4)\n",
256 | "plot_estimator(svc, X, y)\n",
257 | "pl.scatter(svc.support_vectors_[:, 0], svc.support_vectors_[:, 1], s=80, facecolors='none', zorder=10)\n",
258 | "pl.title('Polynomial kernel')\n",
259 | "\n",
260 | "svc = svm.SVC(kernel='rbf', gamma=1e2)\n",
261 | "plot_estimator(svc, X, y)\n",
262 | "pl.scatter(svc.support_vectors_[:, 0], svc.support_vectors_[:, 1], s=80, facecolors='none', zorder=10)\n",
263 | "pl.title('RBF kernel')"
264 | ],
265 | "language": "python",
266 | "metadata": {},
267 | "outputs": []
268 | },
269 | {
270 | "cell_type": "markdown",
271 | "metadata": {},
272 | "source": [
273 | "We can see that RBFs and more flexible and fit our train data best. Remember, minimizing train error is not a goal per se, and we have to watch for overfit."
274 | ]
275 | },
276 | {
277 | "cell_type": "heading",
278 | "level": 2,
279 | "metadata": {},
280 | "source": [
281 | "Exercise: tune an SVM on the digits dataset"
282 | ]
283 | },
284 | {
285 | "cell_type": "code",
286 | "collapsed": false,
287 | "input": [
288 | "from sklearn import datasets\n",
289 | "digits = datasets.load_digits()\n",
290 | "X, y = digits.data, digits.target\n",
291 | "#... now all that is left to do is the prediction"
292 | ],
293 | "language": "python",
294 | "metadata": {},
295 | "outputs": []
296 | },
297 | {
298 | "cell_type": "code",
299 | "collapsed": false,
300 | "input": [],
301 | "language": "python",
302 | "metadata": {},
303 | "outputs": []
304 | }
305 | ],
306 | "metadata": {}
307 | }
308 | ]
309 | }
--------------------------------------------------------------------------------
/notebooks/07.2_in_depth_trees_and_forests.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "name": "07B_in_depth_trees_and_forests"
4 | },
5 | "nbformat": 3,
6 | "nbformat_minor": 0,
7 | "worksheets": [
8 | {
9 | "cells": [
10 | {
11 | "cell_type": "heading",
12 | "level": 1,
13 | "metadata": {},
14 | "source": [
15 | "Estimators In Depth: Trees and Forests"
16 | ]
17 | },
18 | {
19 | "cell_type": "code",
20 | "collapsed": false,
21 | "input": [
22 | "%pylab inline"
23 | ],
24 | "language": "python",
25 | "metadata": {},
26 | "outputs": []
27 | },
28 | {
29 | "cell_type": "markdown",
30 | "metadata": {},
31 | "source": [
32 | "Here we'll explore a class of algorithms based on Decision trees.\n",
33 | "Decision trees at their root (Ha!) are extremely intuitive. They\n",
34 | "encode a series of binary choices in a process that parallels how\n",
35 | "a person might classify things themselves, but using an information criterion\n",
36 | "to decide which question is most fruitful at each step. For example, if\n",
37 | "you wanted to create a guide to identifying an animal found in nature, you\n",
38 | "might ask the following series of questions:\n",
39 | "\n",
40 | "- Is the animal bigger or smaller than a meter long?\n",
41 | " + *bigger*: does the animal have horns?\n",
42 | " - *yes*: are the horns longer than ten centimeters?\n",
43 | " - *no*: is the animal wearing a collar\n",
44 | " + *smaller*: does the animal have two or four legs?\n",
45 | " - *two*: does the animal have wings?\n",
46 | " - *four*: does the animal have a bushy tail?\n",
47 | "\n",
48 | "and so on. This binary splitting of questions is the essence of a decision tree."
49 | ]
50 | },
51 | {
52 | "cell_type": "markdown",
53 | "metadata": {},
54 | "source": [
55 | "Decision trees, and the related boosted trees and random forest estimators,\n",
56 | "are powerful non-parametric approaches that are broadly useful. Non-parametric\n",
57 | "approaches are useful when the underlying structure or model of the data is\n",
58 | "unknown.\n",
59 | "\n",
60 | "For example, imagine you wanted to derive a model for some periodic function\n",
61 | "having to do with tides. We'll mimic this data as a sum of two sine\n",
62 | "waves:"
63 | ]
64 | },
65 | {
66 | "cell_type": "code",
67 | "collapsed": false,
68 | "input": [
69 | "rng = np.random.RandomState(0)\n",
70 | "N = 100\n",
71 | "X = np.linspace(0, 6, N)[:, np.newaxis]\n",
72 | "error = 0.4\n",
73 | "y_true = np.sin(X).ravel() + np.sin(6 * X).ravel()\n",
74 | "y_noisy = y_true + rng.normal(0, error, X.shape[0])"
75 | ],
76 | "language": "python",
77 | "metadata": {},
78 | "outputs": []
79 | },
80 | {
81 | "cell_type": "code",
82 | "collapsed": false,
83 | "input": [
84 | "plt.plot(X.ravel(), y_true, color='gray')\n",
85 | "plt.plot(X.ravel(), y_noisy, '.k')"
86 | ],
87 | "language": "python",
88 | "metadata": {},
89 | "outputs": []
90 | },
91 | {
92 | "cell_type": "markdown",
93 | "metadata": {},
94 | "source": [
95 | "This data looks relatively complicated, but if we had the intuition to know that it is\n",
96 | "simply the combination of a small number of sine waves, we could use this sparse\n",
97 | "representation in Fourier space and use a fast linear estimator.\n",
98 | "\n",
99 | "Taking the FFT of the data gives us two peaks in frequency, indicating that the representation\n",
100 | "is sparse in this basis:"
101 | ]
102 | },
103 | {
104 | "cell_type": "code",
105 | "collapsed": false,
106 | "input": [
107 | "from scipy import fftpack\n",
108 | "plt.plot(fftpack.fftfreq(len(y_noisy))[:N/2], abs(fftpack.fft(y_noisy))[:N/2])\n",
109 | "plt.xlim(0, None)\n",
110 | "plt.xlabel('frequency')\n",
111 | "plt.ylabel('Fourier power')"
112 | ],
113 | "language": "python",
114 | "metadata": {},
115 | "outputs": []
116 | },
117 | {
118 | "cell_type": "markdown",
119 | "metadata": {},
120 | "source": [
121 | "This shows how important **intuition**, especially physical intuition, can be when\n",
122 | "performing a learning problem on real-world data.\n",
123 | "If you blindly throw algorithms at a data set, you\n",
124 | "might be missing a simple intuition which might lead to a sparse representation that\n",
125 | "is much more meaningful.\n",
126 | "\n",
127 | "But suppose we don't have any intuition that would lead to representing the data\n",
128 | "in a sparse basis. In this case we can benefit from using a non-parametric estimator\n",
129 | "to fit our task. One well-known and powerful non-parametric estimator is the\n",
130 | "Decision Tree."
131 | ]
132 | },
133 | {
134 | "cell_type": "heading",
135 | "level": 2,
136 | "metadata": {},
137 | "source": [
138 | "Decision Tree Regression"
139 | ]
140 | },
141 | {
142 | "cell_type": "markdown",
143 | "metadata": {},
144 | "source": [
145 | "A decision tree is a simple binary classification tree that, at its root, is\n",
146 | "similar to nearest neighbor classification. It can be used as follows:"
147 | ]
148 | },
149 | {
150 | "cell_type": "code",
151 | "collapsed": false,
152 | "input": [
153 | "i = np.random.permutation(X.shape[0])\n",
154 | "X = X[i]\n",
155 | "y_noisy = y_noisy[i]"
156 | ],
157 | "language": "python",
158 | "metadata": {},
159 | "outputs": []
160 | },
161 | {
162 | "cell_type": "code",
163 | "collapsed": false,
164 | "input": [
165 | "from sklearn.tree import DecisionTreeRegressor\n",
166 | "clf = DecisionTreeRegressor(max_depth=5)\n",
167 | "clf.fit(X, y_noisy)\n",
168 | "\n",
169 | "X_fit = np.linspace(0, 6, 1000).reshape((-1, 1))\n",
170 | "y_fit_1 = clf.predict(X_fit)\n",
171 | "\n",
172 | "plt.plot(X_fit.ravel(), y_fit_1, color='blue')\n",
173 | "plt.plot(X.ravel(), y_noisy, '.k')"
174 | ],
175 | "language": "python",
176 | "metadata": {},
177 | "outputs": []
178 | },
179 | {
180 | "cell_type": "markdown",
181 | "metadata": {},
182 | "source": [
183 | "A single decision tree allows us to estimate the signal in a non-parametric way,\n",
184 | "but clearly has some issues. In some regions, the model shows high bias and\n",
185 | "under-fits the data\n",
186 | "(seen in the long flat lines which don't follow the contours of the data),\n",
187 | "while in other regions the model shows high variance and over-fits the data\n",
188 | "(reflected in the narrow spikes which are influenced by noise in single points).\n",
189 | "\n",
190 | "One way to address this is to combine many trees together, so that the\n",
191 | "effects of their over-fitting go away on average. This approach is known as\n",
192 | "*Random Forests*"
193 | ]
194 | },
195 | {
196 | "cell_type": "heading",
197 | "level": 2,
198 | "metadata": {},
199 | "source": [
200 | "Random Forests"
201 | ]
202 | },
203 | {
204 | "cell_type": "markdown",
205 | "metadata": {},
206 | "source": [
207 | "Here we will use a random forest of 200 trees to reduce the tendency of each\n",
208 | "tree to over-fitting the data."
209 | ]
210 | },
211 | {
212 | "cell_type": "code",
213 | "collapsed": false,
214 | "input": [
215 | "from sklearn.ensemble import RandomForestRegressor\n",
216 | "clf = RandomForestRegressor(n_estimators=200, max_depth=5)\n",
217 | "clf.fit(X, y_noisy)\n",
218 | "\n",
219 | "y_fit_200 = clf.predict(X_fit)\n",
220 | "\n",
221 | "plt.plot(X_fit.ravel(), y_fit_200, color='blue')\n",
222 | "plt.plot(X.ravel(), y_noisy, '.k')"
223 | ],
224 | "language": "python",
225 | "metadata": {},
226 | "outputs": []
227 | },
228 | {
229 | "cell_type": "heading",
230 | "level": 2,
231 | "metadata": {},
232 | "source": [
233 | "Selecting the Optimal Estimator via Cross-Validation"
234 | ]
235 | },
236 | {
237 | "cell_type": "code",
238 | "collapsed": false,
239 | "input": [
240 | "from sklearn import grid_search\n",
241 | "\n",
242 | "rf = RandomForestRegressor()\n",
243 | "parameters = {'n_estimators':[200, 300, 400],\n",
244 | " 'max_depth':[5, 7, 9]}\n",
245 | "\n",
246 | "# Warning: be sure your data is shuffled before using GridSearch!\n",
247 | "clf_grid = grid_search.GridSearchCV(rf, parameters)\n",
248 | "clf_grid.fit(X, y_noisy)"
249 | ],
250 | "language": "python",
251 | "metadata": {},
252 | "outputs": []
253 | },
254 | {
255 | "cell_type": "code",
256 | "collapsed": false,
257 | "input": [
258 | "rf_best = clf_grid.best_estimator_\n",
259 | "X_fit = np.linspace(0, 6, 1000).reshape((-1, 1))\n",
260 | "y_fit_best = rf_best.predict(X_fit)\n",
261 | "\n",
262 | "print rf_best.n_estimators, rf_best.max_depth\n",
263 | "\n",
264 | "plt.plot(X_fit.ravel(), y_fit_best, color='blue')\n",
265 | "plt.plot(X.ravel(), y_noisy, '.k')"
266 | ],
267 | "language": "python",
268 | "metadata": {},
269 | "outputs": []
270 | },
271 | {
272 | "cell_type": "heading",
273 | "level": 2,
274 | "metadata": {},
275 | "source": [
276 | "Another option: Gradient Boosting"
277 | ]
278 | },
279 | {
280 | "cell_type": "markdown",
281 | "metadata": {},
282 | "source": [
283 | "Another Ensemble method that can be useful is *Boosting*: here, rather than\n",
284 | "looking at 200 (say) parallel estimators, We construct a chain of 200 estimators\n",
285 | "which iteratively refine the results of the previous estimator.\n",
286 | "The idea is that by sequentially applying very fast, simple models, we can get a\n",
287 | "total model error which is better than any of the individual pieces."
288 | ]
289 | },
290 | {
291 | "cell_type": "code",
292 | "collapsed": false,
293 | "input": [
294 | "from sklearn.ensemble import GradientBoostingRegressor\n",
295 | "clf = GradientBoostingRegressor(n_estimators=200, max_depth=2)\n",
296 | "clf.fit(X, y_noisy)\n",
297 | "\n",
298 | "y_fit_200 = clf.predict(X_fit)\n",
299 | "\n",
300 | "plt.plot(X_fit.ravel(), y_fit_200, color='blue')\n",
301 | "plt.plot(X.ravel(), y_noisy, '.k')"
302 | ],
303 | "language": "python",
304 | "metadata": {},
305 | "outputs": []
306 | },
307 | {
308 | "cell_type": "heading",
309 | "level": 2,
310 | "metadata": {},
311 | "source": [
312 | "Exercise: Cross-validating Gradient Boosting"
313 | ]
314 | },
315 | {
316 | "cell_type": "markdown",
317 | "metadata": {},
318 | "source": [
319 | "Use a grid search to optimize the number of estimators and max_depth for a Gradient Boosted\n",
320 | "Decision tree."
321 | ]
322 | },
323 | {
324 | "cell_type": "code",
325 | "collapsed": false,
326 | "input": [],
327 | "language": "python",
328 | "metadata": {},
329 | "outputs": []
330 | },
331 | {
332 | "cell_type": "markdown",
333 | "metadata": {},
334 | "source": [
335 | "Plug this optimal ``max_depth`` into a *single* decision tree. Does this single tree over-fit or under-fit the data?"
336 | ]
337 | },
338 | {
339 | "cell_type": "code",
340 | "collapsed": false,
341 | "input": [],
342 | "language": "python",
343 | "metadata": {},
344 | "outputs": []
345 | },
346 | {
347 | "cell_type": "markdown",
348 | "metadata": {},
349 | "source": [
350 | "Repeat this for the Random Forest. Construct a single decision tree using the ``max_depth``\n",
351 | "which is optimal for the Random Forest. Does this single tree over-fit or under-fit the data?"
352 | ]
353 | },
354 | {
355 | "cell_type": "code",
356 | "collapsed": false,
357 | "input": [],
358 | "language": "python",
359 | "metadata": {},
360 | "outputs": []
361 | }
362 | ],
363 | "metadata": {}
364 | }
365 | ]
366 | }
--------------------------------------------------------------------------------
/notebooks/07.3_In_depth_with_linear_models.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "name": ""
4 | },
5 | "nbformat": 3,
6 | "nbformat_minor": 0,
7 | "worksheets": [
8 | {
9 | "cells": [
10 | {
11 | "cell_type": "code",
12 | "collapsed": false,
13 | "input": [
14 | "%pylab inline\n",
15 | "import numpy as np\n",
16 | "import pylab as pl"
17 | ],
18 | "language": "python",
19 | "metadata": {},
20 | "outputs": []
21 | },
22 | {
23 | "cell_type": "heading",
24 | "level": 1,
25 | "metadata": {},
26 | "source": [
27 | "In depth with linear models"
28 | ]
29 | },
30 | {
31 | "cell_type": "markdown",
32 | "metadata": {},
33 | "source": [
34 | "Linear models are useful when little data is available. I addition, they form a good case study of regularization.\n",
35 | "\n",
36 | "The underlying assumption of a linear model is that the output, y, is a linear combination of the features:\n",
37 | "\n",
38 | "y = w_1 x_1 + w_2 x_2 ..."
39 | ]
40 | },
41 | {
42 | "cell_type": "heading",
43 | "level": 2,
44 | "metadata": {},
45 | "source": [
46 | "Linear models for regression"
47 | ]
48 | },
49 | {
50 | "cell_type": "markdown",
51 | "metadata": {},
52 | "source": [
53 | "The most standard linear model is the 'ordinary least regression', often simply called 'linear regression'. When the number of features is large, it becomes ill-posed and overfits.\n",
54 | "\n",
55 | "Let us generate a simple simulation, to see the behavior of these models."
56 | ]
57 | },
58 | {
59 | "cell_type": "code",
60 | "collapsed": false,
61 | "input": [
62 | "rng = np.random.RandomState(0)\n",
63 | "X = rng.normal(size=(1000, 50))\n",
64 | "beta = rng.normal(size=50)\n",
65 | "y = np.dot(X, beta) + 4*rng.normal(size=1000)\n",
66 | "\n",
67 | "from sklearn import linear_model, cross_validation\n",
68 | "\n",
69 | "def plot_learning_curve(estimator):\n",
70 | " scores = list()\n",
71 | " train_sizes = np.linspace(50, 250, 20).astype(np.int)\n",
72 | " for train_size in train_sizes:\n",
73 | " test_error = cross_validation.cross_val_score(estimator, X, y,\n",
74 | " cv=cross_validation.ShuffleSplit(train_size=train_size, test_size=500, n=1000,\n",
75 | " random_state=0)\n",
76 | " )\n",
77 | " scores.append(test_error)\n",
78 | " pl.plot(train_sizes, np.mean(scores, axis=1), label=estimator.__class__.__name__)\n",
79 | " pl.ylim(0, 1)\n",
80 | " pl.ylabel('Explained variance on test set')\n",
81 | " pl.xlabel('Training set size')\n",
82 | " pl.legend(loc='best')\n",
83 | "\n",
84 | "\n",
85 | "plot_learning_curve(linear_model.LinearRegression())"
86 | ],
87 | "language": "python",
88 | "metadata": {},
89 | "outputs": []
90 | },
91 | {
92 | "cell_type": "markdown",
93 | "metadata": {},
94 | "source": [
95 | "As we can see, the ordinary linear regression is not defined if there are less training samples than features. In the presence of noise, it does poorly as long as the number of sample is not several times the number of features.\n",
96 | "\n",
97 | "The LinearRegression is then overfitting: fitting noise. We need to regularize.\n",
98 | "\n",
99 | "**The Ridge estimator** is a simple regularization of the ordinary LinearRegression. In particular, it has the benefit of being not computationaly more expensive than the ordinary least square estimate."
100 | ]
101 | },
102 | {
103 | "cell_type": "code",
104 | "collapsed": false,
105 | "input": [
106 | "plot_learning_curve(linear_model.LinearRegression())\n",
107 | "plot_learning_curve(linear_model.Ridge())"
108 | ],
109 | "language": "python",
110 | "metadata": {},
111 | "outputs": []
112 | },
113 | {
114 | "cell_type": "markdown",
115 | "metadata": {},
116 | "source": [
117 | "We can see that in the low-sample limit, the Ridge estimator performs much better than the unregularized model.\n",
118 | "\n",
119 | "The regularization of the Ridge is a shrinkage: the coefficients learned are bias towards zero. Too much bias is not beneficial, but with very little samples, we will need more bias.\n",
120 | "\n",
121 | "The amount of regularization is set via the `alpha` parameter of the Ridge. Tuning it is critical for performance. We can set it automatically by cross-validation using the RidgeCV estimator:"
122 | ]
123 | },
124 | {
125 | "cell_type": "code",
126 | "collapsed": false,
127 | "input": [
128 | "plot_learning_curve(linear_model.LinearRegression())\n",
129 | "plot_learning_curve(linear_model.Ridge())\n",
130 | "plot_learning_curve(linear_model.RidgeCV())"
131 | ],
132 | "language": "python",
133 | "metadata": {},
134 | "outputs": []
135 | },
136 | {
137 | "cell_type": "markdown",
138 | "metadata": {},
139 | "source": [
140 | "**The Lasso estimator** is useful to impose sparsity on the coefficient. In other words, it is to be prefered if we believe that many of the features are not relevant.\n",
141 | "\n",
142 | "Let us create such a situation with a new simulation where only 10 out of the 50 features are relevant:"
143 | ]
144 | },
145 | {
146 | "cell_type": "code",
147 | "collapsed": false,
148 | "input": [
149 | "beta[10:] = 0\n",
150 | "y = np.dot(X, beta) + 4*rng.normal(size=1000)\n",
151 | "plot_learning_curve(linear_model.RidgeCV())\n",
152 | "plot_learning_curve(linear_model.Lasso())"
153 | ],
154 | "language": "python",
155 | "metadata": {},
156 | "outputs": []
157 | },
158 | {
159 | "cell_type": "markdown",
160 | "metadata": {},
161 | "source": [
162 | "We can see that the Lasso estimator performs, in our case, better than the Ridge when their is a small number of training observations. However, when they are a lot of observations, the Lasso under-performs. Indeed, the variance-reducing effect of the regularization is less critical in these situation, and the bias becomes too detrimental.\n",
163 | "\n",
164 | "As with any estimator, we should tune the regularization parameter to get best prediction. For this purpose, we can use the LassoCV object. Note that it is a significantly more computationally costly operation than the RidgeCV. To speed it up, we reduce the number of values explored for the alpha parameter."
165 | ]
166 | },
167 | {
168 | "cell_type": "code",
169 | "collapsed": false,
170 | "input": [
171 | "plot_learning_curve(linear_model.RidgeCV())\n",
172 | "plot_learning_curve(linear_model.LassoCV(n_alphas=20))"
173 | ],
174 | "language": "python",
175 | "metadata": {},
176 | "outputs": []
177 | },
178 | {
179 | "cell_type": "markdown",
180 | "metadata": {},
181 | "source": [
182 | "**ElasticNet** sits in between Lasso and Ridge. It has a tuning parameter, l1_ratio, that controls this behavior: when set to 0, ElasticNet is a Ridge, when set to 1, it is a Lasso. It is useful when your coefficients are not that sparse. The sparser the coefficients, the higher we should set l1_ratio. Note that l1_ratio can also be set by cross-validation, although we won't do it here to limit computational cost."
183 | ]
184 | },
185 | {
186 | "cell_type": "code",
187 | "collapsed": false,
188 | "input": [
189 | "plot_learning_curve(linear_model.RidgeCV())\n",
190 | "plot_learning_curve(linear_model.ElasticNetCV(l1_ratio=.8, n_alphas=20))"
191 | ],
192 | "language": "python",
193 | "metadata": {},
194 | "outputs": []
195 | },
196 | {
197 | "cell_type": "markdown",
198 | "metadata": {},
199 | "source": [
200 | "**Practical tips:**\n",
201 | "\n",
202 | "- *Different algorithms for the same problem:* The Lasso model can be solved with different algorithms: the **Lasso** estimator implements a coordinate-descent approach; the **LassoLars** estimator implements the LARS alorithm. Depending on the situation, one may be faster than the other.\n",
203 | "\n",
204 | "- *Normalization:* as with all estimators, it can be useful to normalization your data. The linear models have a 'normalize' option that does the normalization, runs the models, and scales the coefficients so that they are suitable for non normalized data. Note that at the time of writing Lasso and LassoLars have different default settings for normalization."
205 | ]
206 | },
207 | {
208 | "cell_type": "markdown",
209 | "metadata": {},
210 | "source": [
211 | "**Exercise:** Find the best linear regression prediction on the `diabetes` dataset, that is available in the scikit-learn datasets."
212 | ]
213 | },
214 | {
215 | "cell_type": "code",
216 | "collapsed": false,
217 | "input": [],
218 | "language": "python",
219 | "metadata": {},
220 | "outputs": []
221 | },
222 | {
223 | "cell_type": "heading",
224 | "level": 2,
225 | "metadata": {},
226 | "source": [
227 | "Linear models for classification"
228 | ]
229 | },
230 | {
231 | "cell_type": "markdown",
232 | "metadata": {},
233 | "source": [
234 | "In general, a 2-class classification task can be seen as a regression problem with output variables (y) drawn from [0, 1]. Let us simulate a simple example and applied linear regression to it."
235 | ]
236 | },
237 | {
238 | "cell_type": "code",
239 | "collapsed": false,
240 | "input": [
241 | "# Generate data\n",
242 | "np.random.seed(0)\n",
243 | "X = np.random.normal(size=100)\n",
244 | "y = (X > 0).astype(np.float)\n",
245 | "X[X > 0] *= 4\n",
246 | "X += .3 * np.random.normal(size=100)\n",
247 | "X = X[:, np.newaxis]\n",
248 | "pl.plot(X, y, 'o')\n",
249 | "\n",
250 | "# Fit a linear regression to it\n",
251 | "ols = linear_model.LinearRegression()\n",
252 | "ols.fit(X, y)\n",
253 | "X_test = np.linspace(-4, 10, 100)[:, np.newaxis]\n",
254 | "pl.plot(X_test, ols.predict(X_test))"
255 | ],
256 | "language": "python",
257 | "metadata": {},
258 | "outputs": []
259 | },
260 | {
261 | "cell_type": "markdown",
262 | "metadata": {},
263 | "source": [
264 | "We can see that a linear fit is clearly not a good fit for variable in [0, 1].\n",
265 | "\n",
266 | "A better approach is to fit a sigmoid, this is called the **logistic regression**. The sigmoid can be seen as coming from a probabilistic model of 'y'.\n",
267 | "\n",
268 | "In scikit-learn, some estimators have a 'predict_proba' method that gives the probablity of each class under the model learned. Let us use the predict_proba method of the LogisticRegression object to plot the fit:"
269 | ]
270 | },
271 | {
272 | "cell_type": "code",
273 | "collapsed": false,
274 | "input": [
275 | "# First plot the data and the ordinary least square regression\n",
276 | "pl.plot(X, y, 'o')\n",
277 | "pl.plot(X_test, ols.predict(X_test))\n",
278 | "\n",
279 | "# Plot a logistic regression\n",
280 | "log_reg = linear_model.LogisticRegression(C=1e5)\n",
281 | "log_reg.fit(X, y)\n",
282 | "pl.plot(X_test, log_reg.predict_proba(X_test)[:, 1])"
283 | ],
284 | "language": "python",
285 | "metadata": {},
286 | "outputs": []
287 | },
288 | {
289 | "cell_type": "markdown",
290 | "metadata": {},
291 | "source": [
292 | "**Regularization**: the logistic regression suffers from over-fit in the presence of many features and must be regularized. The 'C' parameter controls that regularization: large C values give unregularized model, while small C give strongly regularized models.\n",
293 | "\n",
294 | "Similar to the Ridge/Lasso separation, you can set the 'penalty' parameter to 'l1' to enforce sparsity of the coefficients.\n",
295 | "\n",
296 | "**Multi-class**: Multi-class with the LogisticRegression object is tackled using a one-vs-all approach."
297 | ]
298 | },
299 | {
300 | "cell_type": "markdown",
301 | "metadata": {},
302 | "source": [
303 | "**Exercise:** Use LogisticRegression to classify digits."
304 | ]
305 | },
306 | {
307 | "cell_type": "code",
308 | "collapsed": false,
309 | "input": [],
310 | "language": "python",
311 | "metadata": {},
312 | "outputs": []
313 | }
314 | ],
315 | "metadata": {}
316 | }
317 | ]
318 | }
--------------------------------------------------------------------------------
/notebooks/08.2_unsupervised_clustering__skipped_at_europython.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "name": ""
4 | },
5 | "nbformat": 3,
6 | "nbformat_minor": 0,
7 | "worksheets": [
8 | {
9 | "cells": [
10 | {
11 | "cell_type": "heading",
12 | "level": 1,
13 | "metadata": {},
14 | "source": [
15 | "Clustering"
16 | ]
17 | },
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {},
21 | "source": [
22 | "Clustering is the task of gathering samples into groups of similar\n",
23 | "samples according to some predefined similarity or dissimilarity\n",
24 | "measure (such as the Euclidean distance).\n",
25 | "In this section we will explore a basic clustering task on the\n",
26 | "iris data."
27 | ]
28 | },
29 | {
30 | "cell_type": "markdown",
31 | "metadata": {},
32 | "source": [
33 | "By the end of this section you will\n",
34 | "\n",
35 | "- Know how to instantiate and train KMeans, an unsupervised clustering algorithm\n",
36 | "- Know several other interesting clustering algorithms within scikit-learn"
37 | ]
38 | },
39 | {
40 | "cell_type": "markdown",
41 | "metadata": {},
42 | "source": [
43 | "Let's re-use the results of the 2D PCA of the iris dataset in order to\n",
44 | "explore clustering. First we need to repeat some of the code from the\n",
45 | "previous notebook"
46 | ]
47 | },
48 | {
49 | "cell_type": "code",
50 | "collapsed": false,
51 | "input": [
52 | "# make sure ipython inline mode is activated\n",
53 | "%pylab inline"
54 | ],
55 | "language": "python",
56 | "metadata": {},
57 | "outputs": []
58 | },
59 | {
60 | "cell_type": "code",
61 | "collapsed": false,
62 | "input": [
63 | "# all of this is copied from the previous notebook, '06_iris_dimensionality' \n",
64 | "from sklearn.datasets import load_iris\n",
65 | "from sklearn.decomposition import PCA\n",
66 | "import pylab as pl\n",
67 | "from itertools import cycle\n",
68 | "\n",
69 | "iris = load_iris()\n",
70 | "X = iris.data\n",
71 | "y = iris.target\n",
72 | "\n",
73 | "pca = PCA(n_components=2, whiten=True).fit(X)\n",
74 | "X_pca = pca.transform(X)\n",
75 | "\n",
76 | "def plot_2D(data, target, target_names):\n",
77 | " colors = cycle('rgbcmykw')\n",
78 | " target_ids = range(len(target_names))\n",
79 | " pl.figure()\n",
80 | " for i, c, label in zip(target_ids, colors, target_names):\n",
81 | " pl.scatter(data[target == i, 0], data[target == i, 1],\n",
82 | " c=c, label=label)\n",
83 | " pl.legend()"
84 | ],
85 | "language": "python",
86 | "metadata": {},
87 | "outputs": []
88 | },
89 | {
90 | "cell_type": "markdown",
91 | "metadata": {},
92 | "source": [
93 | "To remind ourselves what we're looking at, let's again plot the PCA components\n",
94 | "we defined in the last notebook:"
95 | ]
96 | },
97 | {
98 | "cell_type": "code",
99 | "collapsed": false,
100 | "input": [
101 | "plot_2D(X_pca, iris.target, iris.target_names)"
102 | ],
103 | "language": "python",
104 | "metadata": {},
105 | "outputs": []
106 | },
107 | {
108 | "cell_type": "markdown",
109 | "metadata": {},
110 | "source": [
111 | "Now we will use one of the simplest clustering algorithms, K-means.\n",
112 | "This is an iterative algorithm which searches for three cluster\n",
113 | "centers such that the distance from each point to its cluster is\n",
114 | "minimizied. First, let's step back for a second,\n",
115 | "look at the above plot, and think about what this will do.\n",
116 | "The algorithm will look for three cluster centers, and label the\n",
117 | "points according to which cluster center they're closest to.\n",
118 | "\n",
119 | "**Question:** what would you expect the output to look like?"
120 | ]
121 | },
122 | {
123 | "cell_type": "code",
124 | "collapsed": false,
125 | "input": [
126 | "from sklearn.cluster import KMeans\n",
127 | "from numpy.random import RandomState\n",
128 | "rng = RandomState(42)\n",
129 | "\n",
130 | "kmeans = KMeans(n_clusters=3, random_state=rng)\n",
131 | "kmeans.fit(X_pca)"
132 | ],
133 | "language": "python",
134 | "metadata": {},
135 | "outputs": []
136 | },
137 | {
138 | "cell_type": "code",
139 | "collapsed": false,
140 | "input": [
141 | "import numpy as np\n",
142 | "np.round(kmeans.cluster_centers_, decimals=2)"
143 | ],
144 | "language": "python",
145 | "metadata": {},
146 | "outputs": []
147 | },
148 | {
149 | "cell_type": "markdown",
150 | "metadata": {},
151 | "source": [
152 | "The ``labels_`` attribute of the K means estimator contains the ID of the\n",
153 | "cluster that each point is assigned to."
154 | ]
155 | },
156 | {
157 | "cell_type": "code",
158 | "collapsed": false,
159 | "input": [
160 | "kmeans.labels_"
161 | ],
162 | "language": "python",
163 | "metadata": {},
164 | "outputs": []
165 | },
166 | {
167 | "cell_type": "markdown",
168 | "metadata": {},
169 | "source": [
170 | "The K-means algorithm has been used to infer cluster labels for the\n",
171 | "points. Let's call the ``plot_2D`` function again, but color the points\n",
172 | "based on the cluster labels rather than the iris species."
173 | ]
174 | },
175 | {
176 | "cell_type": "code",
177 | "collapsed": false,
178 | "input": [
179 | "plot_2D(X_pca, kmeans.labels_, [\"c0\", \"c1\", \"c2\"])"
180 | ],
181 | "language": "python",
182 | "metadata": {},
183 | "outputs": []
184 | },
185 | {
186 | "cell_type": "markdown",
187 | "metadata": {},
188 | "source": [
189 | "**Clustering comes with assumptions**: A clustering algorithm find clusters using specific criterion, that correspond to given assumptions. For K-means clustering, the model is that all clusters have equal, spherical variance. In the case of the iris dataset this assumption does not match the geometry of the classes, and thus the clustering cannot recover the classes.\n",
190 | "\n",
191 | "**Gaussian Mixture Models**: we can choose a different set of assumptions using a Gaussian Mixture Model (GMM). The GMM can be used to relax the assumptions of equal variance or of sphericality. However, the less assumptions, the more the problem is ill-posed and hard to learn. The `covariance_type` argument of the GMM controls these assumptions. For the iris dataset, we will use the 'tied' mode, which imposes the same covariance for each classes. This makes the covariance learning problem easier."
192 | ]
193 | },
194 | {
195 | "cell_type": "code",
196 | "collapsed": false,
197 | "input": [
198 | "from sklearn.mixture import GMM\n",
199 | "gmm = GMM(n_components=3, covariance_type='tied')\n",
200 | "gmm.fit(X_pca)\n",
201 | "\n",
202 | "plot_2D(X_pca, gmm.predict(X_pca), [\"c0\", \"c1\", \"c2\"])\n",
203 | "plt.title('GMM labels')\n",
204 | "\n",
205 | "plot_2D(X_pca, iris.target, iris.target_names)\n",
206 | "plt.title('True labels')"
207 | ],
208 | "language": "python",
209 | "metadata": {},
210 | "outputs": []
211 | },
212 | {
213 | "cell_type": "markdown",
214 | "metadata": {},
215 | "source": [
216 | "We see that the label are now much closer to the ground truth.\n",
217 | "\n",
218 | "**In general, there is no guarantee that structure found by a clustering algorithm has anything to do with latent structures of the data**."
219 | ]
220 | },
221 | {
222 | "cell_type": "heading",
223 | "level": 2,
224 | "metadata": {},
225 | "source": [
226 | "Some Notable Clustering Routines"
227 | ]
228 | },
229 | {
230 | "cell_type": "markdown",
231 | "metadata": {},
232 | "source": [
233 | "The following are two well-known clustering algorithms. Like most unsupervised learning\n",
234 | "models in the scikit, they expect the data to be clustered to have the shape `(n_samples, n_features)`:\n",
235 | "\n",
236 | "- `sklearn.cluster.KMeans`:
\n",
237 | " The simplest, yet effective clustering algorithm. Needs to be provided with the\n",
238 | " number of clusters in advance, and assumes that the data is normalized as input\n",
239 | " (but use a PCA model as preprocessor).\n",
240 | "- `sklearn.cluster.MeanShift`:
\n",
241 | " Can find better looking clusters than KMeans but is not scalable to high number of samples.\n",
242 | "- `sklearn.cluster.DBSCAN`:
\n",
243 | " Can detect irregularly shaped clusters based on density, i.e. sparse regions in\n",
244 | " the input space are likely to become inter-cluster boundaries. Can also detect\n",
245 | " outliers (samples that are not part of a cluster).\n",
246 | "\n",
247 | "Other clustering algorithms do not work with a data array of shape (n_samples, n_features)\n",
248 | "but directly with a precomputed affinity matrix of shape (n_samples, n_samples):\n",
249 | "\n",
250 | "- `sklearn.cluster.AffinityPropagation`:
\n",
251 | " Clustering algorithm based on message passing between data points.\n",
252 | "- `sklearn.cluster.SpectralClustering`:
\n",
253 | " KMeans applied to a projection of the normalized graph Laplacian: finds\n",
254 | " normalized graph cuts if the affinity matrix is interpreted as an adjacency matrix of a graph.\n",
255 | "- `sklearn.cluster.Ward`:
\n",
256 | " Ward implements hierarchical clustering based on the Ward algorithm,\n",
257 | " a variance-minimizing approach. At each step, it minimizes the sum of\n",
258 | " squared differences within all clusters (inertia criterion).\n",
259 | "- `sklearn.cluster.DBSCAN`:
\n",
260 | " DBSCAN can work with either an array of samples or an affinity matrix."
261 | ]
262 | },
263 | {
264 | "cell_type": "heading",
265 | "level": 2,
266 | "metadata": {},
267 | "source": [
268 | "Some Applications of Clustering"
269 | ]
270 | },
271 | {
272 | "cell_type": "markdown",
273 | "metadata": {},
274 | "source": [
275 | "Here are some common applications of clustering algorithms:\n",
276 | "\n",
277 | "- Compression, in a data reduction sens\n",
278 | "- Can be used as a preprocessing step for recommender systems\n",
279 | "- Similarly:\n",
280 | " - grouping related web news (e.g. Google News) and web search results\n",
281 | " - grouping related stock quotes for investment portfolio management\n",
282 | " - building customer profiles for market analysis\n",
283 | "- Building a code book of prototype samples for unsupervised feature extraction for supervised learning algorithms\n"
284 | ]
285 | },
286 | {
287 | "cell_type": "heading",
288 | "level": 2,
289 | "metadata": {},
290 | "source": [
291 | "Exercise: digits clustering"
292 | ]
293 | },
294 | {
295 | "cell_type": "markdown",
296 | "metadata": {},
297 | "source": [
298 | "Perform K-means clustering on the digits data, searching for ten clusters.\n",
299 | "Visualize the cluster centers as images (i.e. reshape each to 8x8 and use\n",
300 | "``plt.imshow``) Do the clusters seem to be correlated with particular digits?\n",
301 | "\n",
302 | "Visualize the projected digits as in the last notebook, but this time use the\n",
303 | "cluster labels as the color. What do you notice?"
304 | ]
305 | },
306 | {
307 | "cell_type": "code",
308 | "collapsed": false,
309 | "input": [
310 | "from sklearn.datasets import load_digits\n",
311 | "digits = load_digits()\n",
312 | "# ..."
313 | ],
314 | "language": "python",
315 | "metadata": {},
316 | "outputs": []
317 | },
318 | {
319 | "cell_type": "code",
320 | "collapsed": false,
321 | "input": [
322 | "%load solutions/08B_digits_clustering.py"
323 | ],
324 | "language": "python",
325 | "metadata": {},
326 | "outputs": []
327 | }
328 | ],
329 | "metadata": {}
330 | }
331 | ]
332 | }
--------------------------------------------------------------------------------
/notebooks/09_pipelining_estimators.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "name": ""
4 | },
5 | "nbformat": 3,
6 | "nbformat_minor": 0,
7 | "worksheets": [
8 | {
9 | "cells": [
10 | {
11 | "cell_type": "code",
12 | "collapsed": false,
13 | "input": [
14 | "%matplotlib inline\n",
15 | "import numpy as np\n",
16 | "import pylab as pl"
17 | ],
18 | "language": "python",
19 | "metadata": {},
20 | "outputs": []
21 | },
22 | {
23 | "cell_type": "heading",
24 | "level": 1,
25 | "metadata": {},
26 | "source": [
27 | "Pipelining estimators"
28 | ]
29 | },
30 | {
31 | "cell_type": "markdown",
32 | "metadata": {},
33 | "source": [
34 | "In this section we study how different estimators maybe be chained"
35 | ]
36 | },
37 | {
38 | "cell_type": "heading",
39 | "level": 2,
40 | "metadata": {},
41 | "source": [
42 | "A simple example: feature extraction and selection before an estimator"
43 | ]
44 | },
45 | {
46 | "cell_type": "heading",
47 | "level": 3,
48 | "metadata": {},
49 | "source": [
50 | "Feature extraction: vectorizer"
51 | ]
52 | },
53 | {
54 | "cell_type": "markdown",
55 | "metadata": {},
56 | "source": [
57 | "For some types of data, for instance text data, a feature extraction step must be applied to convert it to numerical features."
58 | ]
59 | },
60 | {
61 | "cell_type": "code",
62 | "collapsed": false,
63 | "input": [
64 | "from sklearn import datasets, feature_selection\n",
65 | "from sklearn.feature_extraction.text import TfidfVectorizer\n",
66 | "news = datasets.fetch_20newsgroups()\n",
67 | "X, y = news.data, news.target\n",
68 | "vectorizer = TfidfVectorizer()\n",
69 | "vectorizer.fit(X)\n",
70 | "vector_X = vectorizer.transform(X)\n",
71 | "print vector_X.shape"
72 | ],
73 | "language": "python",
74 | "metadata": {},
75 | "outputs": []
76 | },
77 | {
78 | "cell_type": "markdown",
79 | "metadata": {},
80 | "source": [
81 | "The feature selection object is a \"transformer\": it has a \"fit\" method and a \"transform\" method.\n",
82 | "\n",
83 | "Importantly, the \"fit\" method of the transformer is applied on the training set, but the transform method can be applied on any data, including the test set.\n",
84 | "\n",
85 | "We can see that the vectorized data has a very large number of features, as it list the words of the document. Many of these are not relevant for the classification problem."
86 | ]
87 | },
88 | {
89 | "cell_type": "heading",
90 | "level": 3,
91 | "metadata": {},
92 | "source": [
93 | "Feature selection"
94 | ]
95 | },
96 | {
97 | "cell_type": "markdown",
98 | "metadata": {},
99 | "source": [
100 | "Supervised feature selection can select features that seem relevent for a learning task based on a simple test. It is often a computationally cheap way of reducing the dimensionality.\n",
101 | "\n",
102 | "Scikit-learn has a variety of feature selection strategy. The univariate feature selection strategies, (FDR, FPR, FWER, k-best, percentile) apply a simple function to compute a test statistic on each feature. The choice of this function (the score_func parameter) is important:\n",
103 | "\n",
104 | "- f_regression for regression problems\n",
105 | "- f_classif for classification problems\n",
106 | "- chi2 for classification problems with sparse non-negative data (typically text data)."
107 | ]
108 | },
109 | {
110 | "cell_type": "code",
111 | "collapsed": false,
112 | "input": [
113 | "from sklearn import feature_selection\n",
114 | "\n",
115 | "selector = feature_selection.SelectPercentile(percentile=5, score_func=feature_selection.chi2)\n",
116 | "X_red = selector.fit_transform(vector_X, y)\n",
117 | "print \"Original data shape %s, reduced data shape %s\" % (vector_X.shape, X_red.shape)"
118 | ],
119 | "language": "python",
120 | "metadata": {},
121 | "outputs": []
122 | },
123 | {
124 | "cell_type": "heading",
125 | "level": 3,
126 | "metadata": {},
127 | "source": [
128 | "Pipelining"
129 | ]
130 | },
131 | {
132 | "cell_type": "markdown",
133 | "metadata": {},
134 | "source": [
135 | "A transformer and a predictor can be combined to form a predictor using the pipeline object.\n",
136 | "\n",
137 | "The constructor of the pipeline object takes a list of (name, estimator) pairs, that are applied on the data in the order of the list. The pipeline object exposes fit, transform, predict and score methods that result from applying the transforms (and fit in the case of the fit method) one after the other to the data, and calling the last object's corresponding function.\n",
138 | "\n",
139 | "Using a pipeline we can combine our feature extraction, selection and final SVC in one step. This is convenient, as it enables to do clean cross-validation."
140 | ]
141 | },
142 | {
143 | "cell_type": "code",
144 | "collapsed": false,
145 | "input": [
146 | "from sklearn.pipeline import Pipeline\n",
147 | "from sklearn.svm import LinearSVC\n",
148 | "from sklearn.cross_validation import cross_val_score\n",
149 | "\n",
150 | "svc = LinearSVC()\n",
151 | "pipeline = Pipeline([('vectorize', vectorizer), ('select', selector), ('svc', svc)])\n",
152 | "cross_val_score(pipeline, X, y, verbose=3)"
153 | ],
154 | "language": "python",
155 | "metadata": {},
156 | "outputs": []
157 | },
158 | {
159 | "cell_type": "heading",
160 | "level": 3,
161 | "metadata": {},
162 | "source": [
163 | "Parameter selection"
164 | ]
165 | },
166 | {
167 | "cell_type": "markdown",
168 | "metadata": {},
169 | "source": [
170 | "The resulting pipelined predictor object has implicitely many parameters. How do we set them in a principled way?\n",
171 | "\n",
172 | "As a reminder, the GridSearchCV object can be used to set the parameters of an estimator. We just need to know the name of the parameters to set.\n",
173 | "\n",
174 | "The pipeline object exposes the parameters of the estimators it wraps with the following convention: first the name of the estimator, as given in the constructor list, then the name of parameter, separated by a double underscore. For instance, to set the SVC's 'C' parameter:"
175 | ]
176 | },
177 | {
178 | "cell_type": "code",
179 | "collapsed": false,
180 | "input": [
181 | "pipeline.set_params(svc__C=10)"
182 | ],
183 | "language": "python",
184 | "metadata": {},
185 | "outputs": []
186 | },
187 | {
188 | "cell_type": "markdown",
189 | "metadata": {},
190 | "source": [
191 | "We can then use the grid search to choose the best C between 3 values.\n",
192 | "\n",
193 | "**Performance tip**: choosing parameters by cross-validation may imply running the transformers many times on the same data with the same parameters. One way to avoid part of this overhead is to use memoization. In particular, we can use the version of joblib that is embedded in scikit-learn:"
194 | ]
195 | },
196 | {
197 | "cell_type": "code",
198 | "collapsed": false,
199 | "input": [
200 | "from sklearn.externals import joblib\n",
201 | "memory = joblib.Memory(cachedir='.')\n",
202 | "memory.clear()\n",
203 | "selector.score_func = memory.cache(selector.score_func)"
204 | ],
205 | "language": "python",
206 | "metadata": {},
207 | "outputs": []
208 | },
209 | {
210 | "cell_type": "markdown",
211 | "metadata": {},
212 | "source": [
213 | "Now we can proceed to run the grid search:"
214 | ]
215 | },
216 | {
217 | "cell_type": "code",
218 | "collapsed": false,
219 | "input": [
220 | "from sklearn.grid_search import GridSearchCV\n",
221 | "grid = GridSearchCV(estimator=pipeline, param_grid=dict(svc__C=[1e-2, 1, 1e2]))\n",
222 | "grid.fit(X, y)\n",
223 | "print grid.best_estimator_.named_steps['svc']"
224 | ],
225 | "language": "python",
226 | "metadata": {},
227 | "outputs": []
228 | },
229 | {
230 | "cell_type": "heading",
231 | "level": 2,
232 | "metadata": {},
233 | "source": [
234 | "Exercise"
235 | ]
236 | },
237 | {
238 | "cell_type": "markdown",
239 | "metadata": {},
240 | "source": [
241 | "On the 'labeled faces in the wild' (datasets.fetch_lfw_people) chain a randomized PCA with an SVC for prediction"
242 | ]
243 | },
244 | {
245 | "cell_type": "code",
246 | "collapsed": false,
247 | "input": [],
248 | "language": "python",
249 | "metadata": {},
250 | "outputs": []
251 | }
252 | ],
253 | "metadata": {}
254 | }
255 | ]
256 | }
--------------------------------------------------------------------------------
/notebooks/figures/ML_flow_chart.py:
--------------------------------------------------------------------------------
1 | """
2 | Tutorial Diagrams
3 | -----------------
4 |
5 | This script plots the flow-charts used in the scikit-learn tutorials.
6 | """
7 |
8 | import numpy as np
9 | import pylab as pl
10 | from matplotlib.patches import Circle, Rectangle, Polygon, Arrow, FancyArrow
11 |
12 | def create_base(box_bg = '#CCCCCC',
13 | arrow1 = '#88CCFF',
14 | arrow2 = '#88FF88',
15 | supervised=True):
16 | fig = pl.figure(figsize=(9, 6), facecolor='w')
17 | ax = pl.axes((0, 0, 1, 1),
18 | xticks=[], yticks=[], frameon=False)
19 | ax.set_xlim(0, 9)
20 | ax.set_ylim(0, 6)
21 |
22 | patches = [Rectangle((0.3, 3.6), 1.5, 1.8, zorder=1, fc=box_bg),
23 | Rectangle((0.5, 3.8), 1.5, 1.8, zorder=2, fc=box_bg),
24 | Rectangle((0.7, 4.0), 1.5, 1.8, zorder=3, fc=box_bg),
25 |
26 | Rectangle((2.9, 3.6), 0.2, 1.8, fc=box_bg),
27 | Rectangle((3.1, 3.8), 0.2, 1.8, fc=box_bg),
28 | Rectangle((3.3, 4.0), 0.2, 1.8, fc=box_bg),
29 |
30 | Rectangle((0.3, 0.2), 1.5, 1.8, fc=box_bg),
31 |
32 | Rectangle((2.9, 0.2), 0.2, 1.8, fc=box_bg),
33 |
34 | Circle((5.5, 3.5), 1.0, fc=box_bg),
35 |
36 | Polygon([[5.5, 1.7],
37 | [6.1, 1.1],
38 | [5.5, 0.5],
39 | [4.9, 1.1]], fc=box_bg),
40 |
41 | FancyArrow(2.3, 4.6, 0.35, 0, fc=arrow1,
42 | width=0.25, head_width=0.5, head_length=0.2),
43 |
44 | FancyArrow(3.75, 4.2, 0.5, -0.2, fc=arrow1,
45 | width=0.25, head_width=0.5, head_length=0.2),
46 |
47 | FancyArrow(5.5, 2.4, 0, -0.4, fc=arrow1,
48 | width=0.25, head_width=0.5, head_length=0.2),
49 |
50 | FancyArrow(2.0, 1.1, 0.5, 0, fc=arrow2,
51 | width=0.25, head_width=0.5, head_length=0.2),
52 |
53 | FancyArrow(3.3, 1.1, 1.3, 0, fc=arrow2,
54 | width=0.25, head_width=0.5, head_length=0.2),
55 |
56 | FancyArrow(6.2, 1.1, 0.8, 0, fc=arrow2,
57 | width=0.25, head_width=0.5, head_length=0.2)]
58 |
59 | if supervised:
60 | patches += [Rectangle((0.3, 2.4), 1.5, 0.5, zorder=1, fc=box_bg),
61 | Rectangle((0.5, 2.6), 1.5, 0.5, zorder=2, fc=box_bg),
62 | Rectangle((0.7, 2.8), 1.5, 0.5, zorder=3, fc=box_bg),
63 | FancyArrow(2.3, 2.9, 2.0, 0, fc=arrow1,
64 | width=0.25, head_width=0.5, head_length=0.2),
65 | Rectangle((7.3, 0.85), 1.5, 0.5, fc=box_bg)]
66 | else:
67 | patches += [Rectangle((7.3, 0.2), 1.5, 1.8, fc=box_bg)]
68 |
69 | for p in patches:
70 | ax.add_patch(p)
71 |
72 | pl.text(1.45, 4.9, "Training\nText,\nDocuments,\nImages,\netc.",
73 | ha='center', va='center', fontsize=14)
74 |
75 | pl.text(3.6, 4.9, "Feature\nVectors",
76 | ha='left', va='center', fontsize=14)
77 |
78 | pl.text(5.5, 3.5, "Machine\nLearning\nAlgorithm",
79 | ha='center', va='center', fontsize=14)
80 |
81 | pl.text(1.05, 1.1, "New Text,\nDocument,\nImage,\netc.",
82 | ha='center', va='center', fontsize=14)
83 |
84 | pl.text(3.3, 1.7, "Feature\nVector",
85 | ha='left', va='center', fontsize=14)
86 |
87 | pl.text(5.5, 1.1, "Predictive\nModel",
88 | ha='center', va='center', fontsize=12)
89 |
90 | if supervised:
91 | pl.text(1.45, 3.05, "Labels",
92 | ha='center', va='center', fontsize=14)
93 |
94 | pl.text(8.05, 1.1, "Expected\nLabel",
95 | ha='center', va='center', fontsize=14)
96 | pl.text(8.8, 5.8, "Supervised Learning Model",
97 | ha='right', va='top', fontsize=18)
98 |
99 | else:
100 | pl.text(8.05, 1.1,
101 | "Likelihood\nor Cluster ID\nor Better\nRepresentation",
102 | ha='center', va='center', fontsize=12)
103 | pl.text(8.8, 5.8, "Unsupervised Learning Model",
104 | ha='right', va='top', fontsize=18)
105 |
106 |
107 |
108 | def plot_supervised_chart(annotate=False):
109 | create_base(supervised=True)
110 | if annotate:
111 | fontdict = dict(color='r', weight='bold', size=14)
112 | pl.text(1.9, 4.55, 'X = vec.fit_transform(input)',
113 | fontdict=fontdict,
114 | rotation=20, ha='left', va='bottom')
115 | pl.text(3.7, 3.2, 'clf.fit(X, y)',
116 | fontdict=fontdict,
117 | rotation=20, ha='left', va='bottom')
118 | pl.text(1.7, 1.5, 'X_new = vec.transform(input)',
119 | fontdict=fontdict,
120 | rotation=20, ha='left', va='bottom')
121 | pl.text(6.1, 1.5, 'y_new = clf.predict(X_new)',
122 | fontdict=fontdict,
123 | rotation=20, ha='left', va='bottom')
124 |
125 | def plot_unsupervised_chart():
126 | create_base(supervised=False)
127 |
128 |
129 | if __name__ == '__main__':
130 | plot_supervised_chart(False)
131 | plot_supervised_chart(True)
132 | plot_unsupervised_chart()
133 | pl.show()
134 |
135 |
136 |
--------------------------------------------------------------------------------
/notebooks/figures/__init__.py:
--------------------------------------------------------------------------------
1 | from sgd_separator import plot_sgd_separator
2 | from linear_regression import plot_linear_regression
3 | from ML_flow_chart import plot_supervised_chart, plot_unsupervised_chart
4 | from bias_variance import plot_bias_variance
5 | from sdss_filters import plot_sdss_filters, plot_redshifts
6 |
--------------------------------------------------------------------------------
/notebooks/figures/bias_variance.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import matplotlib.pyplot as plt
3 |
4 |
5 | def test_func(x, err=0.5):
6 | return np.random.normal(10 - 1. / (x + 0.1), err)
7 |
8 |
9 | def compute_error(x, y, p):
10 | yfit = np.polyval(p, x)
11 | return np.sqrt(np.mean((y - yfit) ** 2))
12 |
13 |
14 | def plot_bias_variance(N=8, random_seed=42, err=0.5):
15 | np.random.seed(random_seed)
16 | x = 10 ** np.linspace(-2, 0, N)
17 | y = test_func(x)
18 |
19 | xfit = np.linspace(-0.2, 1.2, 1000)
20 |
21 | titles = ['d = 1 (under-fit; high bias)',
22 | 'd = 2',
23 | 'd = 6 (over-fit; high variance)']
24 | degrees = [1, 2, 6]
25 |
26 | fig = plt.figure(figsize = (9, 3.5))
27 | fig.subplots_adjust(left = 0.06, right=0.98,
28 | bottom=0.15, top=0.85,
29 | wspace=0.05)
30 | for i, d in enumerate(degrees):
31 | ax = fig.add_subplot(131 + i, xticks=[], yticks=[])
32 | ax.scatter(x, y, marker='x', c='k', s=50)
33 |
34 | p = np.polyfit(x, y, d)
35 | yfit = np.polyval(p, xfit)
36 | ax.plot(xfit, yfit, '-b')
37 |
38 | ax.set_xlim(-0.2, 1.2)
39 | ax.set_ylim(0, 12)
40 | ax.set_xlabel('house size')
41 | if i == 0:
42 | ax.set_ylabel('price')
43 |
44 | ax.set_title(titles[i])
45 |
46 | if __name__ == '__main__':
47 | plot_bias_variance()
48 | plt.show()
49 |
--------------------------------------------------------------------------------
/notebooks/figures/grid_search_cv_splits.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GaelVaroquaux/sklearn_europython_2014/c616221762dc1fdd7d643d1108fa0704c25d7089/notebooks/figures/grid_search_cv_splits.png
--------------------------------------------------------------------------------
/notebooks/figures/grid_search_parameters.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GaelVaroquaux/sklearn_europython_2014/c616221762dc1fdd7d643d1108fa0704c25d7089/notebooks/figures/grid_search_parameters.png
--------------------------------------------------------------------------------
/notebooks/figures/iris_setosa.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GaelVaroquaux/sklearn_europython_2014/c616221762dc1fdd7d643d1108fa0704c25d7089/notebooks/figures/iris_setosa.jpg
--------------------------------------------------------------------------------
/notebooks/figures/iris_versicolor.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GaelVaroquaux/sklearn_europython_2014/c616221762dc1fdd7d643d1108fa0704c25d7089/notebooks/figures/iris_versicolor.jpg
--------------------------------------------------------------------------------
/notebooks/figures/iris_virginica.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GaelVaroquaux/sklearn_europython_2014/c616221762dc1fdd7d643d1108fa0704c25d7089/notebooks/figures/iris_virginica.jpg
--------------------------------------------------------------------------------
/notebooks/figures/linear_regression.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import matplotlib.pyplot as plt
3 | from sklearn.linear_model import LinearRegression
4 |
5 |
6 | def plot_linear_regression():
7 | a = 0.5
8 | b = 1.0
9 |
10 | # x from 0 to 10
11 | x = 30 * np.random.random(20)
12 |
13 | # y = a*x + b with noise
14 | y = a * x + b + np.random.normal(size=x.shape)
15 |
16 | # create a linear regression classifier
17 | clf = LinearRegression()
18 | clf.fit(x[:, None], y)
19 |
20 | # predict y from the data
21 | x_new = np.linspace(0, 30, 100)
22 | y_new = clf.predict(x_new[:, None])
23 |
24 | # plot the results
25 | ax = plt.axes()
26 | ax.scatter(x, y)
27 | ax.plot(x_new, y_new)
28 |
29 | ax.set_xlabel('x')
30 | ax.set_ylabel('y')
31 |
32 | ax.axis('tight')
33 |
34 |
35 | if __name__ == '__main__':
36 | plot_linear_regression()
37 | plt.show()
38 |
--------------------------------------------------------------------------------
/notebooks/figures/sdss_filters.py:
--------------------------------------------------------------------------------
1 | """
2 | SDSS Filters
3 | ------------
4 |
5 | This example downloads and plots the filters from the Sloan Digital Sky
6 | Survey, along with a reference spectrum.
7 | """
8 | import os
9 | import urllib2
10 |
11 | import numpy as np
12 | import matplotlib.pyplot as plt
13 | from matplotlib.patches import Arrow
14 |
15 | DOWNLOAD_DIR = os.path.join(os.path.dirname(os.path.abspath(__file__)),
16 | 'downloads')
17 | REFSPEC_URL = 'ftp://ftp.stsci.edu/cdbs/current_calspec/1732526_nic_002.ascii'
18 | FILTER_URL = 'http://www.sdss.org/dr7/instruments/imager/filters/%s.dat'
19 |
20 | def fetch_filter(filt):
21 | assert filt in 'ugriz'
22 | url = FILTER_URL % filt
23 |
24 | if not os.path.exists(DOWNLOAD_DIR):
25 | os.makedirs(DOWNLOAD_DIR)
26 |
27 | loc = os.path.join(DOWNLOAD_DIR, '%s.dat' % filt)
28 | if not os.path.exists(loc):
29 | print "downloading from %s" % url
30 | F = urllib2.urlopen(url)
31 | open(loc, 'w').write(F.read())
32 |
33 | F = open(loc)
34 |
35 | data = np.loadtxt(F)
36 | return data
37 |
38 |
39 | def fetch_vega_spectrum():
40 | if not os.path.exists(DOWNLOAD_DIR):
41 | os.makedirs(DOWNLOAD_DIR)
42 |
43 | refspec_file = os.path.join(DOWNLOAD_DIR, REFSPEC_URL.split('/')[-1])
44 |
45 | if not os.path.exists(refspec_file):
46 | print "downloading from %s" % REFSPEC_URL
47 | F = urllib2.urlopen(REFSPEC_URL)
48 | open(refspec_file, 'w').write(F.read())
49 |
50 | F = open(refspec_file)
51 |
52 | data = np.loadtxt(F)
53 | return data
54 |
55 |
56 | def plot_sdss_filters():
57 | Xref = fetch_vega_spectrum()
58 | Xref[:, 1] /= 2.1 * Xref[:, 1].max()
59 |
60 | #----------------------------------------------------------------------
61 | # Plot filters in color with a single spectrum
62 | fig, ax = plt.subplots()
63 | ax.plot(Xref[:, 0], Xref[:, 1], '-k', lw=2)
64 |
65 | for f,c in zip('ugriz', 'bgrmk'):
66 | X = fetch_filter(f)
67 | ax.fill(X[:, 0], X[:, 1], ec=c, fc=c, alpha=0.4)
68 |
69 | kwargs = dict(fontsize=20, ha='center', va='center', alpha=0.5)
70 | ax.text(3500, 0.02, 'u', color='b', **kwargs)
71 | ax.text(4600, 0.02, 'g', color='g', **kwargs)
72 | ax.text(6100, 0.02, 'r', color='r', **kwargs)
73 | ax.text(7500, 0.02, 'i', color='m', **kwargs)
74 | ax.text(8800, 0.02, 'z', color='k', **kwargs)
75 |
76 | ax.set_xlim(3000, 11000)
77 |
78 | ax.set_title('SDSS Filters and Reference Spectrum')
79 | ax.set_xlabel('Wavelength (Angstroms)')
80 | ax.set_ylabel('normalized flux / filter transmission')
81 |
82 |
83 | def plot_redshifts():
84 | Xref = fetch_vega_spectrum()
85 | Xref[:, 1] /= 2.1 * Xref[:, 1].max()
86 |
87 | #----------------------------------------------------------------------
88 | # Plot filters in gray with several redshifted spectra
89 | fig, ax = plt.subplots()
90 |
91 | redshifts = [0.0, 0.4, 0.8]
92 | colors = 'bgr'
93 |
94 | for z, c in zip(redshifts, colors):
95 | plt.plot((1. + z) * Xref[:, 0], Xref[:, 1], color=c)
96 |
97 | ax.add_patch(Arrow(4200, 0.47, 1300, 0, lw=0, width=0.05, color='r'))
98 | ax.add_patch(Arrow(5800, 0.47, 1250, 0, lw=0, width=0.05, color='r'))
99 |
100 | ax.text(3800, 0.49, 'z = 0.0', fontsize=14, color=colors[0])
101 | ax.text(5500, 0.49, 'z = 0.4', fontsize=14, color=colors[1])
102 | ax.text(7300, 0.49, 'z = 0.8', fontsize=14, color=colors[2])
103 |
104 | for f in 'ugriz':
105 | X = fetch_filter(f)
106 | ax.fill(X[:, 0], X[:, 1], ec='k', fc='k', alpha=0.2)
107 |
108 | kwargs = dict(fontsize=20, color='gray', ha='center', va='center')
109 | ax.text(3500, 0.02, 'u', **kwargs)
110 | ax.text(4600, 0.02, 'g', **kwargs)
111 | ax.text(6100, 0.02, 'r', **kwargs)
112 | ax.text(7500, 0.02, 'i', **kwargs)
113 | ax.text(8800, 0.02, 'z', **kwargs)
114 |
115 | ax.set_xlim(3000, 11000)
116 | ax.set_ylim(0, 0.55)
117 |
118 | ax.set_title('Redshifting of a Spectrum')
119 | ax.set_xlabel('Observed Wavelength (Angstroms)')
120 | ax.set_ylabel('normalized flux / filter transmission')
121 |
122 |
123 | if __name__ == '__main__':
124 | plot_sdss_filters()
125 | plot_redshifts()
126 | plt.show()
127 |
--------------------------------------------------------------------------------
/notebooks/figures/sgd_separator.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import matplotlib.pyplot as plt
3 | from sklearn.linear_model import SGDClassifier
4 | from sklearn.datasets.samples_generator import make_blobs
5 |
6 | def plot_sgd_separator():
7 | # we create 50 separable points
8 | X, Y = make_blobs(n_samples=50, centers=2,
9 | random_state=0, cluster_std=0.60)
10 |
11 | # fit the model
12 | clf = SGDClassifier(loss="hinge", alpha=0.01,
13 | n_iter=200, fit_intercept=True)
14 | clf.fit(X, Y)
15 |
16 | # plot the line, the points, and the nearest vectors to the plane
17 | xx = np.linspace(-1, 5, 10)
18 | yy = np.linspace(-1, 5, 10)
19 |
20 | X1, X2 = np.meshgrid(xx, yy)
21 | Z = np.empty(X1.shape)
22 | for (i, j), val in np.ndenumerate(X1):
23 | x1 = val
24 | x2 = X2[i, j]
25 | p = clf.decision_function([x1, x2])
26 | Z[i, j] = p[0]
27 | levels = [-1.0, 0.0, 1.0]
28 | linestyles = ['dashed', 'solid', 'dashed']
29 | colors = 'k'
30 |
31 | ax = plt.axes()
32 | ax.contour(X1, X2, Z, levels, colors=colors, linestyles=linestyles)
33 | ax.scatter(X[:, 0], X[:, 1], c=Y, cmap=plt.cm.Paired)
34 |
35 | ax.axis('tight')
36 |
37 |
38 | if __name__ == '__main__':
39 | plot_sgd_separator()
40 | plt.show()
41 |
--------------------------------------------------------------------------------
/notebooks/figures/supervised_scikit_learn.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GaelVaroquaux/sklearn_europython_2014/c616221762dc1fdd7d643d1108fa0704c25d7089/notebooks/figures/supervised_scikit_learn.png
--------------------------------------------------------------------------------
/notebooks/figures/svm_gui_frames.py:
--------------------------------------------------------------------------------
1 | """
2 | Linear Model Example
3 | --------------------
4 |
5 | This is an example plot from the tutorial which accompanies an explanation
6 | of the support vector machine GUI.
7 | """
8 |
9 | import numpy as np
10 | import pylab as pl
11 | import matplotlib
12 |
13 | from sklearn import svm
14 |
15 |
16 | def linear_model(rseed=42, Npts=30):
17 | np.random.seed(rseed)
18 |
19 |
20 | data = np.random.normal(0, 10, (Npts, 2))
21 | data[:Npts / 2] -= 15
22 | data[Npts / 2:] += 15
23 |
24 | labels = np.ones(Npts)
25 | labels[:Npts / 2] = -1
26 |
27 | return data, labels
28 |
29 |
30 | def nonlinear_model(rseed=42, Npts=30):
31 | radius = 40 * np.random.random(Npts)
32 | far_pts = radius > 20
33 | radius[far_pts] *= 1.2
34 | radius[~far_pts] *= 1.1
35 |
36 | theta = np.random.random(Npts) * np.pi * 2
37 |
38 | data = np.empty((Npts, 2))
39 | data[:, 0] = radius * np.cos(theta)
40 | data[:, 1] = radius * np.sin(theta)
41 |
42 | labels = np.ones(Npts)
43 | labels[far_pts] = -1
44 |
45 | return data, labels
46 |
47 |
48 | def plot_linear_model():
49 | X, y = linear_model()
50 | clf = svm.SVC(kernel='linear',
51 | gamma=0.01, coef0=0, degree=3)
52 | clf.fit(X, y)
53 |
54 | fig = pl.figure()
55 | ax = pl.subplot(111, xticks=[], yticks=[])
56 | ax.scatter(X[:, 0], X[:, 1], c=y, cmap=pl.cm.bone)
57 |
58 | ax.scatter(clf.support_vectors_[:, 0],
59 | clf.support_vectors_[:, 1],
60 | s=80, edgecolors="k", facecolors="none")
61 |
62 | delta = 1
63 | y_min, y_max = -50, 50
64 | x_min, x_max = -50, 50
65 | x = np.arange(x_min, x_max + delta, delta)
66 | y = np.arange(y_min, y_max + delta, delta)
67 | X1, X2 = np.meshgrid(x, y)
68 | Z = clf.decision_function(np.c_[X1.ravel(), X2.ravel()])
69 | Z = Z.reshape(X1.shape)
70 |
71 | levels = [-1.0, 0.0, 1.0]
72 | linestyles = ['dashed', 'solid', 'dashed']
73 | colors = 'k'
74 | ax.contour(X1, X2, Z, levels,
75 | colors=colors,
76 | linestyles=linestyles)
77 |
78 |
79 | def plot_rbf_model():
80 | X, y = nonlinear_model()
81 | clf = svm.SVC(kernel='rbf',
82 | gamma=0.001, coef0=0, degree=3)
83 | clf.fit(X, y)
84 |
85 | fig = pl.figure()
86 | ax = pl.subplot(111, xticks=[], yticks=[])
87 | ax.scatter(X[:, 0], X[:, 1], c=y, cmap=pl.cm.bone, zorder=2)
88 |
89 | ax.scatter(clf.support_vectors_[:, 0],
90 | clf.support_vectors_[:, 1],
91 | s=80, edgecolors="k", facecolors="none")
92 |
93 | delta = 1
94 | y_min, y_max = -50, 50
95 | x_min, x_max = -50, 50
96 | x = np.arange(x_min, x_max + delta, delta)
97 | y = np.arange(y_min, y_max + delta, delta)
98 | X1, X2 = np.meshgrid(x, y)
99 | Z = clf.decision_function(np.c_[X1.ravel(), X2.ravel()])
100 | Z = Z.reshape(X1.shape)
101 |
102 | levels = [-1.0, 0.0, 1.0]
103 | linestyles = ['dashed', 'solid', 'dashed']
104 | colors = 'k'
105 |
106 | ax.contourf(X1, X2, Z, 10,
107 | cmap=matplotlib.cm.bone,
108 | origin='lower',
109 | alpha=0.85, zorder=1)
110 | ax.contour(X1, X2, Z, [0.0],
111 | colors='k',
112 | linestyles=['solid'], zorder=1)
113 |
114 |
115 | if __name__ == '__main__':
116 | plot_linear_model()
117 | plot_rbf_model()
118 | pl.show()
119 |
120 |
--------------------------------------------------------------------------------
/notebooks/helpers.py:
--------------------------------------------------------------------------------
1 | """
2 | Small helpers for code that is not shown in the notebooks
3 | """
4 |
5 | from sklearn import neighbors, datasets, linear_model
6 | import pylab as pl
7 | import numpy as np
8 | from matplotlib.colors import ListedColormap
9 |
10 | # Create color maps for 3-class classification problem, as with iris
11 | cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
12 | cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
13 |
14 | def plot_iris_knn():
15 | iris = datasets.load_iris()
16 | X = iris.data[:, :2] # we only take the first two features. We could
17 | # avoid this ugly slicing by using a two-dim dataset
18 | y = iris.target
19 |
20 | knn = neighbors.KNeighborsClassifier(n_neighbors=3)
21 | knn.fit(X, y)
22 |
23 | x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1
24 | y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1
25 | xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
26 | np.linspace(y_min, y_max, 100))
27 | Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
28 |
29 | # Put the result into a color plot
30 | Z = Z.reshape(xx.shape)
31 | pl.figure()
32 | pl.pcolormesh(xx, yy, Z, cmap=cmap_light)
33 |
34 | # Plot also the training points
35 | pl.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)
36 | pl.xlabel('sepal length (cm)')
37 | pl.ylabel('sepal width (cm)')
38 | pl.axis('tight')
39 |
40 |
41 | def plot_polynomial_regression():
42 | rng = np.random.RandomState(0)
43 | x = 2*rng.rand(100) - 1
44 |
45 | f = lambda t: 1.2 * t**2 + .1 * t**3 - .4 * t **5 - .5 * t ** 9
46 | y = f(x) + .4 * rng.normal(size=100)
47 |
48 | x_test = np.linspace(-1, 1, 100)
49 |
50 | pl.figure()
51 | pl.scatter(x, y, s=4)
52 |
53 | X = np.array([x**i for i in range(5)]).T
54 | X_test = np.array([x_test**i for i in range(5)]).T
55 | regr = linear_model.LinearRegression()
56 | regr.fit(X, y)
57 | pl.plot(x_test, regr.predict(X_test), label='4th order')
58 |
59 | X = np.array([x**i for i in range(10)]).T
60 | X_test = np.array([x_test**i for i in range(10)]).T
61 | regr = linear_model.LinearRegression()
62 | regr.fit(X, y)
63 | pl.plot(x_test, regr.predict(X_test), label='9th order')
64 |
65 | pl.legend(loc='best')
66 | pl.axis('tight')
67 | pl.title('Fitting a 4th and a 9th order polynomial')
68 |
69 | pl.figure()
70 | pl.scatter(x, y, s=4)
71 | pl.plot(x_test, f(x_test), label="truth")
72 | pl.axis('tight')
73 | pl.title('Ground truth (9th order polynomial)')
74 |
75 |
76 |
--------------------------------------------------------------------------------
/notebooks/images/parallel_text_clf.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GaelVaroquaux/sklearn_europython_2014/c616221762dc1fdd7d643d1108fa0704c25d7089/notebooks/images/parallel_text_clf.png
--------------------------------------------------------------------------------
/notebooks/images/parallel_text_clf_average.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GaelVaroquaux/sklearn_europython_2014/c616221762dc1fdd7d643d1108fa0704c25d7089/notebooks/images/parallel_text_clf_average.png
--------------------------------------------------------------------------------
/notebooks/solutions/02A_faces_plot.py:
--------------------------------------------------------------------------------
1 | faces = fetch_olivetti_faces()
2 |
3 | # set up the figure
4 | fig = plt.figure(figsize=(6, 6)) # figure size in inches
5 | fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)
6 |
7 | # plot the faces:
8 | for i in range(64):
9 | ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
10 | ax.imshow(faces.images[i], cmap=plt.cm.bone, interpolation='nearest')
11 |
--------------------------------------------------------------------------------
/notebooks/solutions/04B_houses_plot.py:
--------------------------------------------------------------------------------
1 | for index, feature_name in enumerate(data.feature_names):
2 | plt.figure()
3 | plt.scatter(data.data[:, index], data.target)
4 | plt.ylabel('Price')
5 | plt.xlabel(feature_name)
6 |
7 |
--------------------------------------------------------------------------------
/notebooks/solutions/04B_houses_regression.py:
--------------------------------------------------------------------------------
1 | from sklearn.ensemble import GradientBoostingRegressor
2 |
3 | clf = GradientBoostingRegressor()
4 | clf.fit(X_train, y_train)
5 |
6 | predicted = clf.predict(X_test)
7 | expected = y_test
8 |
9 | plt.scatter(expected, predicted)
10 | plt.plot([0, 50], [0, 50], '--k')
11 | plt.axis('tight')
12 | plt.xlabel('True price ($1000s)')
13 | plt.ylabel('Predicted price ($1000s)')
14 | print "RMS:", np.sqrt(np.mean((predicted - expected) ** 2))
15 |
--------------------------------------------------------------------------------
/notebooks/solutions/04C_validation_exercise.py:
--------------------------------------------------------------------------------
1 | # suppress warnings from older versions of KNeighbors
2 | import warnings
3 | warnings.filterwarnings('ignore', message='kneighbors*')
4 |
5 | X = digits.data
6 | y = digits.target
7 | X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.25, random_state=0)
8 |
9 | for Model in [LinearSVC, GaussianNB, KNeighborsClassifier]:
10 | clf = Model().fit(X_train, y_train)
11 | y_pred = clf.predict(X_test)
12 | print Model.__name__, metrics.f1_score(y_test, y_pred)
13 |
14 | print '------------------'
15 |
16 | # test SVC loss
17 | for loss in ['l1', 'l2']:
18 | clf = LinearSVC(loss=loss).fit(X_train, y_train)
19 | y_pred = clf.predict(X_test)
20 | print "LinearSVC(loss='{0}')".format(loss), metrics.f1_score(y_test, y_pred)
21 |
22 | print '-------------------'
23 |
24 | # test K-neighbors
25 | for n_neighbors in range(1, 11):
26 | clf = KNeighborsClassifier(n_neighbors=n_neighbors).fit(X_train, y_train)
27 | y_pred = clf.predict(X_test)
28 | print "KNeighbors(n_neighbors={0})".format(n_neighbors), metrics.f1_score(y_test, y_pred)
29 |
--------------------------------------------------------------------------------
/notebooks/solutions/05B_strip_headers.py:
--------------------------------------------------------------------------------
1 | def strip_headers(post):
2 | """Find the first blank line and drop the headers to keep the body"""
3 | if '\n\n' in post:
4 | headers, body = post.split('\n\n', 1)
5 | return body.lower()
6 | else:
7 | # Unexpected post inner-structure, be conservative
8 | # and keep everything
9 | return post.lower()
10 |
11 | # Let's try it on the first post. Here is the original post content,
12 | # including the headers:
13 |
14 | original_text = all_twenty_train.data[0]
15 | print("Oringinal text:")
16 | print(original_text + "\n")
17 |
18 | text_body = strip_headers(original_text)
19 | print("Stripped text:")
20 | print(text_body + "\n")
21 |
22 | # Let's train a new classifier with the header stripping preprocessor
23 |
24 | strip_vectorizer = TfidfVectorizer(preprocessor=strip_headers, min_df=2)
25 | X_train_small_stripped = strip_vectorizer.fit_transform(
26 | twenty_train_small.data)
27 |
28 | y_train_small_stripped = twenty_train_small.target
29 |
30 | classifier = MultinomialNB(alpha=0.01).fit(
31 | X_train_small_stripped, y_train_small_stripped)
32 |
33 | print("Training score: {0:.1f}%".format(
34 | classifier.score(X_train_small_stripped, y_train_small_stripped) * 100))
35 |
36 | X_test_small_stripped = strip_vectorizer.transform(twenty_test_small.data)
37 | y_test_small_stripped = twenty_test_small.target
38 | print("Testing score: {0:.1f}%".format(
39 | classifier.score(X_test_small_stripped, y_test_small_stripped) * 100))
--------------------------------------------------------------------------------
/notebooks/solutions/06B_basic_grid_search.py:
--------------------------------------------------------------------------------
1 | for Model in [Lasso, Ridge]:
2 | scores = [cross_val_score(Model(alpha), X, y, cv=3).mean()
3 | for alpha in alphas]
4 | plt.plot(alphas, scores, label=Model.__name__)
5 | plt.legend(loc='lower left')
6 |
--------------------------------------------------------------------------------
/notebooks/solutions/06B_learning_curves.py:
--------------------------------------------------------------------------------
1 | from sklearn.metrics import explained_variance_score, mean_squared_error
2 | from sklearn.cross_validation import train_test_split
3 |
4 | def plot_learning_curve(model, err_func=explained_variance_score, N=300, n_runs=10, n_sizes=50, ylim=None):
5 | sizes = np.linspace(5, N, n_sizes).astype(int)
6 | train_err = np.zeros((n_runs, n_sizes))
7 | validation_err = np.zeros((n_runs, n_sizes))
8 | for i in range(n_runs):
9 | for j, size in enumerate(sizes):
10 | xtrain, xtest, ytrain, ytest = train_test_split(
11 | X, y, train_size=size, random_state=i)
12 | # Train on only the first `size` points
13 | model.fit(xtrain, ytrain)
14 | validation_err[i, j] = err_func(ytest, model.predict(xtest))
15 | train_err[i, j] = err_func(ytrain, model.predict(xtrain))
16 |
17 | plt.plot(sizes, validation_err.mean(axis=0), lw=2, label='validation')
18 | plt.plot(sizes, train_err.mean(axis=0), lw=2, label='training')
19 |
20 | plt.xlabel('traning set size')
21 | plt.ylabel(err_func.__name__.replace('_', ' '))
22 |
23 | plt.grid(True)
24 |
25 | plt.legend(loc=0)
26 |
27 | plt.xlim(0, N-1)
28 |
29 | if ylim:
30 | plt.ylim(ylim)
31 |
32 |
33 | plt.figure(figsize=(10, 8))
34 | for i, model in enumerate([Lasso(0.01), Ridge(0.06)]):
35 | plt.subplot(221 + i)
36 | plot_learning_curve(model, ylim=(0, 1))
37 | plt.title(model.__class__.__name__)
38 |
39 | plt.subplot(223 + i)
40 | plot_learning_curve(model, err_func=mean_squared_error, ylim=(0, 8000))
41 |
--------------------------------------------------------------------------------
/notebooks/solutions/08A_digits_projection.py:
--------------------------------------------------------------------------------
1 | from sklearn.decomposition import PCA
2 | from sklearn.manifold import Isomap, LocallyLinearEmbedding
3 |
4 | plt.figure(figsize=(14, 4))
5 | for i, est in enumerate([PCA(n_components=2, whiten=True),
6 | Isomap(n_components=2, n_neighbors=10),
7 | LocallyLinearEmbedding(n_components=2, n_neighbors=10, method='modified')]):
8 | plt.subplot(131 + i)
9 | projection = est.fit_transform(digits.data)
10 | plt.scatter(projection[:, 0], projection[:, 1], c=digits.target)
11 | plt.title(est.__class__.__name__)
12 |
--------------------------------------------------------------------------------
/notebooks/solutions/08B_digits_clustering.py:
--------------------------------------------------------------------------------
1 | from sklearn.cluster import KMeans
2 | kmeans = KMeans(n_clusters=10)
3 | clusters = kmeans.fit_predict(digits.data)
4 |
5 | print kmeans.cluster_centers_.shape
6 |
7 | #------------------------------------------------------------
8 | # visualize the cluster centers
9 | fig = plt.figure(figsize=(8, 3))
10 | for i in range(10):
11 | ax = fig.add_subplot(2, 5, 1 + i)
12 | ax.imshow(kmeans.cluster_centers_[i].reshape((8, 8)),
13 | cmap=plt.cm.binary)
14 | from sklearn.manifold import Isomap
15 | X_iso = Isomap(n_neighbors=10).fit_transform(digits.data)
16 |
17 | #------------------------------------------------------------
18 | # visualize the projected data
19 | fig, ax = plt.subplots(1, 2, figsize=(8, 4))
20 |
21 | ax[0].scatter(X_iso[:, 0], X_iso[:, 1], c=clusters)
22 | ax[1].scatter(X_iso[:, 0], X_iso[:, 1], c=digits.target)
23 |
--------------------------------------------------------------------------------
/rendered_notebooks/01.2_sklearn_overview.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "name": ""
4 | },
5 | "nbformat": 3,
6 | "nbformat_minor": 0,
7 | "worksheets": [
8 | {
9 | "cells": [
10 | {
11 | "cell_type": "heading",
12 | "level": 1,
13 | "metadata": {},
14 | "source": [
15 | "An Overview of Scikit-learn"
16 | ]
17 | },
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {},
21 | "source": [
22 | "*Adapted from* [*http://scikit-learn.org/stable/tutorial/basic/tutorial.html*](http://scikit-learn.org/stable/tutorial/basic/tutorial.html)"
23 | ]
24 | },
25 | {
26 | "cell_type": "code",
27 | "collapsed": false,
28 | "input": [
29 | "%matplotlib inline\n",
30 | "import numpy as np\n",
31 | "from matplotlib import pyplot as plt"
32 | ],
33 | "language": "python",
34 | "metadata": {},
35 | "outputs": [],
36 | "prompt_number": 0
37 | },
38 | {
39 | "cell_type": "heading",
40 | "level": 2,
41 | "metadata": {},
42 | "source": [
43 | "Loading an Example Dataset"
44 | ]
45 | },
46 | {
47 | "cell_type": "code",
48 | "collapsed": false,
49 | "input": [
50 | "from sklearn import datasets\n",
51 | "digits = datasets.load_digits()"
52 | ],
53 | "language": "python",
54 | "metadata": {},
55 | "outputs": [
56 | {
57 | "output_type": "stream",
58 | "stream": "stderr",
59 | "text": [
60 | "/usr/lib/python2.7/dist-packages/scipy/stats/distributions.py:32: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility\n",
61 | " from . import vonmises_cython\n",
62 | "/usr/lib/python2.7/dist-packages/scipy/spatial/__init__.py:88: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility\n",
63 | " from .ckdtree import *\n",
64 | "/usr/lib/python2.7/dist-packages/scipy/spatial/__init__.py:89: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility\n",
65 | " from .qhull import *\n",
66 | "/usr/lib/python2.7/dist-packages/scipy/stats/stats.py:251: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility\n",
67 | " from ._rank import rankdata, tiecorrect\n"
68 | ]
69 | }
70 | ],
71 | "prompt_number": 1
72 | },
73 | {
74 | "cell_type": "code",
75 | "collapsed": false,
76 | "input": [
77 | "digits.data"
78 | ],
79 | "language": "python",
80 | "metadata": {},
81 | "outputs": [
82 | {
83 | "output_type": "pyout",
84 | "prompt_number": 4,
85 | "text": [
86 | "array([[ 0., 0., 5., ..., 0., 0., 0.],\n",
87 | " [ 0., 0., 0., ..., 10., 0., 0.],\n",
88 | " [ 0., 0., 0., ..., 16., 9., 0.],\n",
89 | " ..., \n",
90 | " [ 0., 0., 1., ..., 6., 0., 0.],\n",
91 | " [ 0., 0., 2., ..., 12., 0., 0.],\n",
92 | " [ 0., 0., 10., ..., 12., 1., 0.]])"
93 | ]
94 | }
95 | ],
96 | "prompt_number": 2
97 | },
98 | {
99 | "cell_type": "code",
100 | "collapsed": false,
101 | "input": [
102 | "digits.target"
103 | ],
104 | "language": "python",
105 | "metadata": {},
106 | "outputs": [
107 | {
108 | "output_type": "pyout",
109 | "prompt_number": 5,
110 | "text": [
111 | "array([0, 1, 2, ..., 8, 9, 8])"
112 | ]
113 | }
114 | ],
115 | "prompt_number": 3
116 | },
117 | {
118 | "cell_type": "code",
119 | "collapsed": false,
120 | "input": [
121 | "digits.images[0]"
122 | ],
123 | "language": "python",
124 | "metadata": {},
125 | "outputs": [
126 | {
127 | "output_type": "pyout",
128 | "prompt_number": 6,
129 | "text": [
130 | "array([[ 0., 0., 5., 13., 9., 1., 0., 0.],\n",
131 | " [ 0., 0., 13., 15., 10., 15., 5., 0.],\n",
132 | " [ 0., 3., 15., 2., 0., 11., 8., 0.],\n",
133 | " [ 0., 4., 12., 0., 0., 8., 8., 0.],\n",
134 | " [ 0., 5., 8., 0., 0., 9., 8., 0.],\n",
135 | " [ 0., 4., 11., 0., 1., 12., 7., 0.],\n",
136 | " [ 0., 2., 14., 5., 10., 12., 0., 0.],\n",
137 | " [ 0., 0., 6., 13., 10., 0., 0., 0.]])"
138 | ]
139 | }
140 | ],
141 | "prompt_number": 4
142 | },
143 | {
144 | "cell_type": "heading",
145 | "level": 2,
146 | "metadata": {},
147 | "source": [
148 | "Learning and Predicting"
149 | ]
150 | },
151 | {
152 | "cell_type": "code",
153 | "collapsed": false,
154 | "input": [
155 | "from sklearn import svm\n",
156 | "clf = svm.SVC(gamma=0.001, C=100.)"
157 | ],
158 | "language": "python",
159 | "metadata": {},
160 | "outputs": [],
161 | "prompt_number": 5
162 | },
163 | {
164 | "cell_type": "code",
165 | "collapsed": false,
166 | "input": [
167 | "clf.fit(digits.data[:-1], digits.target[:-1])"
168 | ],
169 | "language": "python",
170 | "metadata": {},
171 | "outputs": [
172 | {
173 | "output_type": "pyout",
174 | "prompt_number": 8,
175 | "text": [
176 | "SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,\n",
177 | " gamma=0.001, kernel='rbf', max_iter=-1, probability=False,\n",
178 | " random_state=None, shrinking=True, tol=0.001, verbose=False)"
179 | ]
180 | }
181 | ],
182 | "prompt_number": 6
183 | },
184 | {
185 | "cell_type": "code",
186 | "collapsed": false,
187 | "input": [
188 | "clf.predict(digits.data[-1])"
189 | ],
190 | "language": "python",
191 | "metadata": {},
192 | "outputs": [
193 | {
194 | "output_type": "pyout",
195 | "prompt_number": 9,
196 | "text": [
197 | "array([8])"
198 | ]
199 | }
200 | ],
201 | "prompt_number": 7
202 | },
203 | {
204 | "cell_type": "code",
205 | "collapsed": false,
206 | "input": [
207 | "plt.figure(figsize=(2, 2))\n",
208 | "plt.imshow(digits.images[-1], interpolation='nearest', cmap=plt.cm.binary)"
209 | ],
210 | "language": "python",
211 | "metadata": {},
212 | "outputs": [
213 | {
214 | "output_type": "pyout",
215 | "prompt_number": 10,
216 | "text": [
217 | ""
218 | ]
219 | },
220 | {
221 | "output_type": "display_data",
222 | "png": "iVBORw0KGgoAAAANSUhEUgAAAIcAAACKCAYAAACAeVnUAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAC55JREFUeJzt3V9sU+UbB/Bv52bImA40c13WxeL4s3UbbWXYhKyBqkgm\nYBzMuIH8G3pjTJw3RrwRvSBDY3SIV0YFNRkmXogSNnSByhgxk7SNFxgwsYtlEAPGAe2cXdvHC7P+\n4Le9Pee87Tk9Zc8n4WL0POc8dA/nz/uc9xwLEREYm0VRvhNg5sXFwYS4OJgQFwcT4uJgQlwcTKg4\n2xVYLJZc5MHySDSakXVxZFr53r17sXfvXs3ryxS3c+dOYVwoFILL5Zr1M7/fL4wbHx/HggULNG/P\n7/djzZo1ws+7u7tn/fuenh689tprwjhRLnp8n5n+c/NhhQkpFsfAwADq6uqwZMkS7N+/34icmElk\nLI5kMomXXnoJAwMDOH/+PPr6+vDLL7+oXnmmXa4ecVarVSpu3rx5UnF2u10qrqWlRSrO6O8zY3GM\njIxg8eLFsNvtKCkpQUdHB44ePap7UlwcszNVcYyNjaGmpib9s81mw9jYmNSGWOHJeLWi9jL11jPh\nNWvWSFcq05/f78945XarjMVRXV2NSCSS/jkSicBms81YTubyiuXH///nffPNN4XLZjysNDc349df\nf8Xo6Cji8Ti+/PJLPPXUUzlLlJlbxj1HcXExDh48iHXr1iGZTGL37t2or683KjeWZ4ojpK2trWht\nbTUiF2YyPELKhHLSW5ExOjoqFXf48GGpuAcffFAqTnYs407Aew4mxMXBhLg4mJBicXR1daGyshJN\nTU1G5MNMRLE4du3ahYGBASNyYSajWBxerxcLFy40IhdmMnzOwYRyMs7BXdnCkbOurFrclS0cOevK\nsrlNsTg6OzuxatUqXLx4ETU1Nfj000+NyIuZgOJhpa+vz4g8mAnxYYUJ5a0rK9vtLC8vl4obHx+X\nipPtHgPy/0bZXHON9xxMiIuDCXFxMCHF4ohEIvD5fGhoaEBjYyMOHDhgRF7MBBRPSEtKSvDee+/B\n5XIhGo1ixYoVWLt2Ld+FPgco7jmsVmv6mRdlZWWor6/H5cuXdU+M5Z+mc47R0VEEg0F4PB698mEm\nonqcIxqNor29Hb29vSgrK7vtM+7KFg4tXVmLmsdbT01NYcOGDWhtbZ3xKCOLxSJ87JMeRI9E0ovo\n0U1qvP/++1JxRg6CZfr9KR5WiAi7d++Gw+HI6otihUexOIaHh/HFF1/g1KlTcLvdcLvdfE/pHKF4\nztHS0oJUKmVELsxkeISUCak6Ic24AoNPSL/++mupuLa2thxnomzHjh1ScYcOHcptIhlkdULK5i4u\nDibExcGEFItjcnISHo8HLpcLDocDe/bsMSIvZgKKl7Lz5s3DqVOnUFpaikQigZaWFpw5c0b6Qaus\ncKg6rJSWlgIA4vE4kskk7rvvPl2TYuagqjhSqRRcLhcqKyvh8/ngcDj0zouZgKqubFFREUKhEK5f\nv45169bNeM8Id2ULh25zZcvLy7F+/XqcO3dOWBzM3HI6V/batWvpFvLff/+N77//Hm63O/ssmekp\n7jmuXLmCHTt2IJVKIZVKYdu2bXjssceMyI3lmWJxNDU1IRAIGJELMxkeIWVCXBxMKG8TqWXJ3pcp\nOwE7G9lMwjYD3nMwIS4OJqSqOJLJJNxuNzZu3Kh3PsxEVBVHb28vHA4Hv7d+jlEsjkuXLuH48eN4\n/vnnDb1XlOWfYnG88soreOedd1BUxKcnc03GS9ljx47hgQcegNvtztjJ465s4cjZXNnXX38dn3/+\nOYqLizE5OYkbN25g8+bN+Oyzz/63AoOnJsgWXigUym0iKkw/ukIrtb+8XJCemrBv3z5EIhGEw2Ec\nOXIEjz766G2Fwe5smk4k+GplblE9fL569WqsXr1az1yYyfAlCBPi4mBCeevKyp6R//DDD1Jxsm97\nyOalwz6fTypOdiL1zp07peJEeM/BhLg4mJCqw4rdbse9996Lu+66CyUlJRgZGdE7L2YCqorDYrHA\n7/fzNMg5RvVhhTuyc4+q4rBYLHj88cfR3NyMjz76SO+cmEmoOqwMDw+jqqoKV69exdq1a1FXVwev\n15v+nLuyhSPnc2WrqqoAABUVFWhra8PIyIiwOJi55XSu7MTEBG7evAkAiMVi+O6779DU1JR9lsz0\nFPccf/zxR/oxjYlEAlu3bsUTTzyhe2Is/xSLY9GiRXm5UYblH4+QMiEuDiZUcF1Zo7eXTVdWllnm\n2PKegwlxcTAhxeIYHx9He3s76uvr4XA48OOPPxqRFzMBxXOOl19+GU8++SS++uorJBIJxGIxI/Ji\nJpCxOK5fv46hoSEcPnz4v4WLi/PyEBSWHxkPK+FwGBUVFdi1axcefvhhvPDCC5iYmDAqN5ZnGfcc\niUQCgUAABw8exMqVK9Hd3Y2enh689dZbty3HXdnCkbOurM1mg81mw8qVKwEA7e3t6OnpmbEcd2UL\nR866slarFTU1Nbh48SIAYHBwEA0NDbnJkpme4tXKBx98gK1btyIej6O2tlZ6/gcrPIrF4XQ68dNP\nPxmRCzMZHiFlQlwcTChvLx2efk2HVrJPMJbtymbTIZXt6Mq+WHnBggWaY/ilw0wKFwcTUiyOCxcu\nwO12p/+Ul5fjwIEDRuTG8kzxUnbZsmUIBoMA/ntLZHV1dfpudHZn03RYGRwcRG1tLWpqavTKh5mI\npuI4cuQItmzZolcuzGRU32Acj8fx7bffYv/+/TM+465s4dDlvbL9/f1YsWIFKioqZnzGXdnCkdO5\nstP6+vrQ2dmZVWKssKgqjlgshsHBQWzatEnTymVHJc+cOSMVJzuaKTtaOzk5aej2ZL8X2d+DquKY\nP38+rl27hnvuuUfTyrk4crs9UxYHm5u4OJhQTrqyrLCJSiDridT8lME7Fx9WmBAXBxPSrTgGBgZQ\nV1eHJUuWzDrkPpuuri5UVlZqfiBdJBKBz+dDQ0MDGhsbVd9SMDk5CY/HA5fLBYfDgT179mjarszL\nmO12O5YvXw63241HHnlEdZzMhPasb7cgHSQSCaqtraVwOEzxeJycTiedP39eMe706dMUCASosbFR\n0/auXLlCwWCQiIhu3rxJS5cuVbU9IqJYLEZERFNTU+TxeGhoaEj1dt99913asmULbdy4UXWM3W6n\nP//8U/Xy07Zv304ff/xxOtfx8XFN8clkkqxWK/3++++qY3TZc4yMjGDx4sWw2+0oKSlBR0cHjh49\nqhjn9XqxcOFCzduzWq3pNzGWlZWhvr4ely9fVhVbWloK4L/GYjKZVP1892xexqx1+ekJ7V1dXQDk\nJrTL3G6hS3GMjY3dloTNZsPY2Jgem5phdHQUwWAQHo9H1fKpVAoulwuVlZXw+XxwOByq4mRfxizz\nqPBcTGiXud1Cl+LI19hHNBpFe3s7ent7UVZWpiqmqKgIoVAIly5dwunTp1UNNd/6Mmate4Hh4WEE\ng0H09/fjww8/xNDQkGLM9IT2F198EYFAAPPnz591zrLI9O0WzzzzjKZcdSmO6upqRCKR9M+RSAQ2\nm02PTaVNTU1h8+bNeO655/D0009rji8vL8f69etx7tw5xWXPnj2Lb775BosWLUJnZydOnjyJ7du3\nq9rObI8KVzLbhPZAIKBqe0Dm2y0y0nRWo9LU1BQ99NBDFA6H6Z9//lF9QkpEFA6HNZ+QplIp2rZt\nG3V3d2uKu3r1Kv31119ERDQxMUFer5cGBwc1rcPv99OGDRtULRuLxejGjRtERBSNRmnVqlV04sQJ\nVbFer5cuXLhARERvvPEGvfrqq6pzfPbZZ+nQoUOql5+mS3EQER0/fpyWLl1KtbW1tG/fPlUxHR0d\nVFVVRXfffTfZbDb65JNPVMUNDQ2RxWIhp9NJLpeLXC4X9ff3K8b9/PPP5Ha7yel0UlNTE7399tuq\ntncrv9+v+mrlt99+I6fTSU6nkxoaGlR/L0REoVCImpubafny5dTW1qb6aiUajdL999+fLkotsu6t\nsDsXj5AyIS4OJsTFwYS4OJgQFwcT4uJgQv8CdYFJ8R3MeG0AAAAASUVORK5CYII=\n",
223 | "text": [
224 | ""
225 | ]
226 | }
227 | ],
228 | "prompt_number": 8
229 | },
230 | {
231 | "cell_type": "code",
232 | "collapsed": false,
233 | "input": [
234 | "print digits.target[-1]"
235 | ],
236 | "language": "python",
237 | "metadata": {},
238 | "outputs": [
239 | {
240 | "output_type": "stream",
241 | "stream": "stdout",
242 | "text": [
243 | "8\n"
244 | ]
245 | }
246 | ],
247 | "prompt_number": 9
248 | },
249 | {
250 | "cell_type": "heading",
251 | "level": 2,
252 | "metadata": {},
253 | "source": [
254 | "Model Persistence"
255 | ]
256 | },
257 | {
258 | "cell_type": "code",
259 | "collapsed": false,
260 | "input": [
261 | "from sklearn import svm\n",
262 | "from sklearn import datasets\n",
263 | "clf = svm.SVC()\n",
264 | "iris = datasets.load_iris()\n",
265 | "X, y = iris.data, iris.target\n",
266 | "clf.fit(X, y)"
267 | ],
268 | "language": "python",
269 | "metadata": {},
270 | "outputs": [
271 | {
272 | "output_type": "pyout",
273 | "prompt_number": 12,
274 | "text": [
275 | "SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,\n",
276 | " kernel='rbf', max_iter=-1, probability=False, random_state=None,\n",
277 | " shrinking=True, tol=0.001, verbose=False)"
278 | ]
279 | }
280 | ],
281 | "prompt_number": 10
282 | },
283 | {
284 | "cell_type": "code",
285 | "collapsed": false,
286 | "input": [
287 | "import pickle\n",
288 | "s = pickle.dumps(clf)\n",
289 | "clf2 = pickle.loads(s)\n",
290 | "clf2.predict(X[0])"
291 | ],
292 | "language": "python",
293 | "metadata": {},
294 | "outputs": [
295 | {
296 | "output_type": "pyout",
297 | "prompt_number": 13,
298 | "text": [
299 | "array([0])"
300 | ]
301 | }
302 | ],
303 | "prompt_number": 11
304 | },
305 | {
306 | "cell_type": "code",
307 | "collapsed": false,
308 | "input": [
309 | "y[0]"
310 | ],
311 | "language": "python",
312 | "metadata": {},
313 | "outputs": [
314 | {
315 | "output_type": "pyout",
316 | "prompt_number": 14,
317 | "text": [
318 | "0"
319 | ]
320 | }
321 | ],
322 | "prompt_number": 12
323 | },
324 | {
325 | "cell_type": "code",
326 | "collapsed": false,
327 | "input": [
328 | "from sklearn.externals import joblib\n",
329 | "joblib.dump(clf, 'filename.pkl') "
330 | ],
331 | "language": "python",
332 | "metadata": {},
333 | "outputs": [
334 | {
335 | "output_type": "pyout",
336 | "prompt_number": 15,
337 | "text": [
338 | "['filename.pkl',\n",
339 | " 'filename.pkl_01.npy',\n",
340 | " 'filename.pkl_02.npy',\n",
341 | " 'filename.pkl_03.npy',\n",
342 | " 'filename.pkl_04.npy',\n",
343 | " 'filename.pkl_05.npy',\n",
344 | " 'filename.pkl_06.npy',\n",
345 | " 'filename.pkl_07.npy',\n",
346 | " 'filename.pkl_08.npy',\n",
347 | " 'filename.pkl_09.npy',\n",
348 | " 'filename.pkl_10.npy']"
349 | ]
350 | }
351 | ],
352 | "prompt_number": 13
353 | }
354 | ],
355 | "metadata": {}
356 | }
357 | ]
358 | }
--------------------------------------------------------------------------------
/rendered_notebooks/02.2_feature_extraction.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "name": ""
4 | },
5 | "nbformat": 3,
6 | "nbformat_minor": 0,
7 | "worksheets": [
8 | {
9 | "cells": [
10 | {
11 | "cell_type": "heading",
12 | "level": 1,
13 | "metadata": {},
14 | "source": [
15 | "Feature Extraction"
16 | ]
17 | },
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {},
21 | "source": [
22 | "Here we will talk about an important piece of machine learning: the extraction of\n",
23 | "quantitative features from data. By the end of this section you will\n",
24 | "\n",
25 | "- Know how features are extracted from real-world data.\n",
26 | "- See an example of extracting numerical features from textual data\n",
27 | "\n",
28 | "In addition, we will go over several basic tools within scikit-learn which can be used to accomplish the above tasks."
29 | ]
30 | },
31 | {
32 | "cell_type": "heading",
33 | "level": 2,
34 | "metadata": {},
35 | "source": [
36 | "What Are Features?"
37 | ]
38 | },
39 | {
40 | "cell_type": "heading",
41 | "level": 3,
42 | "metadata": {},
43 | "source": [
44 | "Numerical Features"
45 | ]
46 | },
47 | {
48 | "cell_type": "markdown",
49 | "metadata": {},
50 | "source": [
51 | "Recall that data in scikit-learn is expected to be in two-dimensional arrays, of size\n",
52 | "**n_samples** $\\times$ **n_features**.\n",
53 | "\n",
54 | "Previously, we looked at the iris dataset, which has 150 samples and 4 features"
55 | ]
56 | },
57 | {
58 | "cell_type": "code",
59 | "collapsed": false,
60 | "input": [
61 | "from sklearn.datasets import load_iris\n",
62 | "iris = load_iris()\n",
63 | "print iris.data.shape"
64 | ],
65 | "language": "python",
66 | "metadata": {},
67 | "outputs": [
68 | {
69 | "output_type": "stream",
70 | "stream": "stdout",
71 | "text": [
72 | "(150, 4)\n"
73 | ]
74 | },
75 | {
76 | "output_type": "stream",
77 | "stream": "stderr",
78 | "text": [
79 | "/usr/lib/python2.7/dist-packages/scipy/stats/distributions.py:32: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility\n",
80 | " from . import vonmises_cython\n",
81 | "/usr/lib/python2.7/dist-packages/scipy/spatial/__init__.py:88: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility\n",
82 | " from .ckdtree import *\n",
83 | "/usr/lib/python2.7/dist-packages/scipy/spatial/__init__.py:89: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility\n",
84 | " from .qhull import *\n",
85 | "/usr/lib/python2.7/dist-packages/scipy/stats/stats.py:251: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility\n",
86 | " from ._rank import rankdata, tiecorrect\n"
87 | ]
88 | }
89 | ],
90 | "prompt_number": 0
91 | },
92 | {
93 | "cell_type": "markdown",
94 | "metadata": {},
95 | "source": [
96 | "These features are:\n",
97 | "\n",
98 | "- sepal length in cm\n",
99 | "- sepal width in cm\n",
100 | "- petal length in cm\n",
101 | "- petal width in cm\n",
102 | "\n",
103 | "Numerical features such as these are pretty straightforward: each sample contains a list\n",
104 | "of floating-point numbers corresponding to the features"
105 | ]
106 | },
107 | {
108 | "cell_type": "heading",
109 | "level": 3,
110 | "metadata": {},
111 | "source": [
112 | "Categorical Features"
113 | ]
114 | },
115 | {
116 | "cell_type": "markdown",
117 | "metadata": {},
118 | "source": [
119 | "What if you have categorical features? For example, imagine there is data on the color of each\n",
120 | "iris:\n",
121 | "\n",
122 | " color in [red, blue, purple]\n",
123 | "\n",
124 | "You might be tempted to assign numbers to these features, i.e. *red=1, blue=2, purple=3*\n",
125 | "but in general **this is a bad idea**. Estimators tend to operate under the assumption that\n",
126 | "numerical features lie on some continuous scale, so, for example, 1 and 2 are more alike\n",
127 | "than 1 and 3, and this is often not the case for categorical features.\n",
128 | "\n",
129 | "A better strategy is to give each category its own dimension. \n",
130 | "The enriched iris feature set would hence be in this case:\n",
131 | "\n",
132 | "- sepal length in cm\n",
133 | "- sepal width in cm\n",
134 | "- petal length in cm\n",
135 | "- petal width in cm\n",
136 | "- color=purple (1.0 or 0.0)\n",
137 | "- color=blue (1.0 or 0.0)\n",
138 | "- color=red (1.0 or 0.0)\n",
139 | "\n",
140 | "Note that using many of these categorical features may result in data which is better\n",
141 | "represented as a **sparse matrix**, as we'll see with the text classification example\n",
142 | "below."
143 | ]
144 | },
145 | {
146 | "cell_type": "heading",
147 | "level": 4,
148 | "metadata": {},
149 | "source": [
150 | "Using the DictVectorizer to encode categorical features"
151 | ]
152 | },
153 | {
154 | "cell_type": "markdown",
155 | "metadata": {},
156 | "source": [
157 | "When the source data is encoded has a list of dicts where the values are either strings names for categories or numerical values, you can use the `DictVectorizer` class to compute the boolean expansion of the categorical features while leaving the numerical features unimpacted:"
158 | ]
159 | },
160 | {
161 | "cell_type": "code",
162 | "collapsed": false,
163 | "input": [
164 | "measurements = [\n",
165 | " {'city': 'Dubai', 'temperature': 33.},\n",
166 | " {'city': 'London', 'temperature': 12.},\n",
167 | " {'city': 'San Fransisco', 'temperature': 18.},\n",
168 | "]"
169 | ],
170 | "language": "python",
171 | "metadata": {},
172 | "outputs": [],
173 | "prompt_number": 1
174 | },
175 | {
176 | "cell_type": "code",
177 | "collapsed": false,
178 | "input": [
179 | "from sklearn.feature_extraction import DictVectorizer\n",
180 | "\n",
181 | "vec = DictVectorizer()\n",
182 | "vec"
183 | ],
184 | "language": "python",
185 | "metadata": {},
186 | "outputs": [
187 | {
188 | "output_type": "pyout",
189 | "prompt_number": 4,
190 | "text": [
191 | "DictVectorizer(dtype=, separator='=', sparse=True)"
192 | ]
193 | }
194 | ],
195 | "prompt_number": 2
196 | },
197 | {
198 | "cell_type": "code",
199 | "collapsed": false,
200 | "input": [
201 | "vec.fit_transform(measurements).toarray()"
202 | ],
203 | "language": "python",
204 | "metadata": {},
205 | "outputs": [
206 | {
207 | "output_type": "stream",
208 | "stream": "stderr",
209 | "text": [
210 | "/home/varoquau/dev/numpy/numpy/core/fromnumeric.py:2499: VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.\n",
211 | " VisibleDeprecationWarning)\n"
212 | ]
213 | },
214 | {
215 | "output_type": "pyout",
216 | "prompt_number": 5,
217 | "text": [
218 | "array([[ 1., 0., 0., 33.],\n",
219 | " [ 0., 1., 0., 12.],\n",
220 | " [ 0., 0., 1., 18.]])"
221 | ]
222 | }
223 | ],
224 | "prompt_number": 3
225 | },
226 | {
227 | "cell_type": "code",
228 | "collapsed": false,
229 | "input": [
230 | "vec.get_feature_names()"
231 | ],
232 | "language": "python",
233 | "metadata": {},
234 | "outputs": [
235 | {
236 | "output_type": "pyout",
237 | "prompt_number": 6,
238 | "text": [
239 | "['city=Dubai', 'city=London', 'city=San Fransisco', 'temperature']"
240 | ]
241 | }
242 | ],
243 | "prompt_number": 4
244 | },
245 | {
246 | "cell_type": "heading",
247 | "level": 3,
248 | "metadata": {},
249 | "source": [
250 | "Derived Features"
251 | ]
252 | },
253 | {
254 | "cell_type": "markdown",
255 | "metadata": {},
256 | "source": [
257 | "Another common feature type are **derived features**, where some pre-processing step is\n",
258 | "applied to the data to generate features that are somehow more informative. Derived\n",
259 | "features may be based in **dimensionality reduction** (such as PCA or manifold learning),\n",
260 | "may be linear or nonlinear combinations of features (such as in Polynomial regression),\n",
261 | "or may be some more sophisticated transform of the features. The latter is often used\n",
262 | "in image processing.\n",
263 | "\n",
264 | "For example, [scikit-image](http://scikit-image.org/) provides a variety of feature\n",
265 | "extractors designed for image data: see the ``skimage.feature`` submodule.\n",
266 | "We will see some *dimensionality*-based feature extraction routines later in the tutorial."
267 | ]
268 | },
269 | {
270 | "cell_type": "heading",
271 | "level": 3,
272 | "metadata": {},
273 | "source": [
274 | "Text Feature Extraction"
275 | ]
276 | },
277 | {
278 | "cell_type": "markdown",
279 | "metadata": {},
280 | "source": [
281 | "Unstructed content such as text documents require there own feature extraction step. In general we treat words in text documents as individual categorical features. An example on text mining will be introduced later."
282 | ]
283 | }
284 | ],
285 | "metadata": {}
286 | }
287 | ]
288 | }
--------------------------------------------------------------------------------
/rendered_notebooks/06.2_validation_exercise.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "name": "06B_validation_exercise"
4 | },
5 | "nbformat": 3,
6 | "nbformat_minor": 0,
7 | "worksheets": [
8 | {
9 | "cells": [
10 | {
11 | "cell_type": "heading",
12 | "level": 1,
13 | "metadata": {},
14 | "source": [
15 | "Exercise: Cross Validation and Model Selection"
16 | ]
17 | },
18 | {
19 | "cell_type": "code",
20 | "collapsed": false,
21 | "input": [
22 | "%pylab inline"
23 | ],
24 | "language": "python",
25 | "metadata": {},
26 | "outputs": [
27 | {
28 | "output_type": "stream",
29 | "stream": "stdout",
30 | "text": [
31 | "Populating the interactive namespace from numpy and matplotlib\n"
32 | ]
33 | }
34 | ],
35 | "prompt_number": 0
36 | },
37 | {
38 | "cell_type": "markdown",
39 | "metadata": {},
40 | "source": [
41 | "This exercise covers cross-validation of regression models on the Diabetes\n",
42 | "dataset. The diabetes data consists of 10 physiological variables (age, sex, weight, blood pressure)\n",
43 | "measure on 442 patients, and an indication of disease progression after one year:"
44 | ]
45 | },
46 | {
47 | "cell_type": "code",
48 | "collapsed": false,
49 | "input": [
50 | "from sklearn.datasets import load_diabetes\n",
51 | "data = load_diabetes()\n",
52 | "X, y = data.data, data.target"
53 | ],
54 | "language": "python",
55 | "metadata": {},
56 | "outputs": [
57 | {
58 | "output_type": "stream",
59 | "stream": "stderr",
60 | "text": [
61 | "/usr/lib/python2.7/dist-packages/scipy/stats/distributions.py:32: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility\n",
62 | " from . import vonmises_cython\n",
63 | "/usr/lib/python2.7/dist-packages/scipy/spatial/__init__.py:88: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility\n",
64 | " from .ckdtree import *\n",
65 | "/usr/lib/python2.7/dist-packages/scipy/spatial/__init__.py:89: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility\n",
66 | " from .qhull import *\n",
67 | "/usr/lib/python2.7/dist-packages/scipy/stats/stats.py:251: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility\n",
68 | " from ._rank import rankdata, tiecorrect\n"
69 | ]
70 | }
71 | ],
72 | "prompt_number": 1
73 | },
74 | {
75 | "cell_type": "code",
76 | "collapsed": false,
77 | "input": [
78 | "print X.shape"
79 | ],
80 | "language": "python",
81 | "metadata": {},
82 | "outputs": [
83 | {
84 | "output_type": "stream",
85 | "stream": "stdout",
86 | "text": [
87 | "(442, 10)\n"
88 | ]
89 | }
90 | ],
91 | "prompt_number": 2
92 | },
93 | {
94 | "cell_type": "code",
95 | "collapsed": false,
96 | "input": [
97 | "print y.shape"
98 | ],
99 | "language": "python",
100 | "metadata": {},
101 | "outputs": [
102 | {
103 | "output_type": "stream",
104 | "stream": "stdout",
105 | "text": [
106 | "(442,)\n"
107 | ]
108 | }
109 | ],
110 | "prompt_number": 3
111 | },
112 | {
113 | "cell_type": "markdown",
114 | "metadata": {},
115 | "source": [
116 | "Here we'll be fitting two regularized linear models,\n",
117 | "*Ridge Regression*, which uses $\\ell_2$ regularlization,\n",
118 | "and *Lasso Regression*, which uses $\\ell_1$ regularization."
119 | ]
120 | },
121 | {
122 | "cell_type": "code",
123 | "collapsed": false,
124 | "input": [
125 | "from sklearn.linear_model import Ridge, Lasso"
126 | ],
127 | "language": "python",
128 | "metadata": {},
129 | "outputs": [],
130 | "prompt_number": 4
131 | },
132 | {
133 | "cell_type": "markdown",
134 | "metadata": {},
135 | "source": [
136 | "We'll first use the default hyper-parameters to see the baseline estimator. We'll\n",
137 | "use the cross-validation score to determine goodness-of-fit."
138 | ]
139 | },
140 | {
141 | "cell_type": "code",
142 | "collapsed": false,
143 | "input": [
144 | "from sklearn.cross_validation import cross_val_score\n",
145 | "\n",
146 | "for Model in [Ridge, Lasso]:\n",
147 | " model = Model()\n",
148 | " print Model.__name__, cross_val_score(model, X, y).mean()"
149 | ],
150 | "language": "python",
151 | "metadata": {},
152 | "outputs": [
153 | {
154 | "output_type": "stream",
155 | "stream": "stdout",
156 | "text": [
157 | "Ridge 0.409427438303\n",
158 | "Lasso 0.353800083299\n"
159 | ]
160 | }
161 | ],
162 | "prompt_number": 5
163 | },
164 | {
165 | "cell_type": "markdown",
166 | "metadata": {},
167 | "source": [
168 | "We see that for the default hyper-parameter values, Lasso outperforms Ridge.\n",
169 | "But is this the case for the *optimal* hyperparameters of each model?"
170 | ]
171 | },
172 | {
173 | "cell_type": "heading",
174 | "level": 2,
175 | "metadata": {},
176 | "source": [
177 | "Exercise: Basic Hyperparameter Optimization"
178 | ]
179 | },
180 | {
181 | "cell_type": "markdown",
182 | "metadata": {},
183 | "source": [
184 | "Here spend some time writing a function which computes the cross-validation\n",
185 | "score as a function of ``alpha``, the strength of the regularization for\n",
186 | "``Lasso`` and ``Ridge``. We'll choose 20 values of ``alpha`` between\n",
187 | "0.0001 and 1:"
188 | ]
189 | },
190 | {
191 | "cell_type": "code",
192 | "collapsed": false,
193 | "input": [
194 | "alphas = np.logspace(-3, -1, 30)\n",
195 | "\n",
196 | "# plot the mean cross-validation score for a Ridge estimator and a Lasso estimator\n",
197 | "# as a function of alpha. Which is more difficult to tune?"
198 | ],
199 | "language": "python",
200 | "metadata": {},
201 | "outputs": [],
202 | "prompt_number": 6
203 | },
204 | {
205 | "cell_type": "heading",
206 | "level": 3,
207 | "metadata": {},
208 | "source": [
209 | "Solution"
210 | ]
211 | },
212 | {
213 | "cell_type": "code",
214 | "collapsed": false,
215 | "input": [
216 | "%load solutions/06B_basic_grid_search.py"
217 | ],
218 | "language": "python",
219 | "metadata": {},
220 | "outputs": [],
221 | "prompt_number": 7
222 | },
223 | {
224 | "cell_type": "heading",
225 | "level": 2,
226 | "metadata": {},
227 | "source": [
228 | "Automatically Performing Grid Search"
229 | ]
230 | },
231 | {
232 | "cell_type": "markdown",
233 | "metadata": {},
234 | "source": [
235 | "Because searching a grid of hyperparameters is such a common task, scikit-learn provides\n",
236 | "several hyper-parameter estimators to automate this. We'll explore this more in depth\n",
237 | "later in the tutorial, but for now it is interesting to see how ``GridSearchCV`` works:"
238 | ]
239 | },
240 | {
241 | "cell_type": "code",
242 | "collapsed": false,
243 | "input": [
244 | "from sklearn.grid_search import GridSearchCV"
245 | ],
246 | "language": "python",
247 | "metadata": {},
248 | "outputs": [],
249 | "prompt_number": 8
250 | },
251 | {
252 | "cell_type": "markdown",
253 | "metadata": {},
254 | "source": [
255 | "``GridSearchCV`` is constructed with an estimator, as well as a dictionary\n",
256 | "of parameter values to be searched. We can find the optimal parameters this\n",
257 | "way:"
258 | ]
259 | },
260 | {
261 | "cell_type": "code",
262 | "collapsed": false,
263 | "input": [
264 | "for Model in [Ridge, Lasso]:\n",
265 | " gscv = GridSearchCV(Model(), dict(alpha=alphas), cv=3).fit(X, y)\n",
266 | " print Model.__name__, gscv.best_params_"
267 | ],
268 | "language": "python",
269 | "metadata": {},
270 | "outputs": [
271 | {
272 | "output_type": "stream",
273 | "stream": "stdout",
274 | "text": [
275 | "Ridge {'alpha': 0.062101694189156162}\n",
276 | "Lasso"
277 | ]
278 | },
279 | {
280 | "output_type": "stream",
281 | "stream": "stdout",
282 | "text": [
283 | " {'alpha': 0.01268961003167922}\n"
284 | ]
285 | }
286 | ],
287 | "prompt_number": 9
288 | },
289 | {
290 | "cell_type": "heading",
291 | "level": 2,
292 | "metadata": {},
293 | "source": [
294 | "Built-in Hyperparameter Search"
295 | ]
296 | },
297 | {
298 | "cell_type": "markdown",
299 | "metadata": {},
300 | "source": [
301 | "For some models within scikit-learn, cross-validation can be performed more efficiently\n",
302 | "on large datasets. In this case, a cross-validated version of the particular model is\n",
303 | "included. The cross-validated versions of ``Ridge`` and ``Lasso`` are ``RidgeCV`` and\n",
304 | "``LassoCV``, respectively. The grid search on these estimators can be performed as\n",
305 | "follows:"
306 | ]
307 | },
308 | {
309 | "cell_type": "code",
310 | "collapsed": false,
311 | "input": [
312 | "from sklearn.linear_model import RidgeCV, LassoCV\n",
313 | "for Model in [RidgeCV, LassoCV]:\n",
314 | " model = Model(alphas=alphas, cv=3).fit(X, y)\n",
315 | " print Model.__name__, model.alpha_"
316 | ],
317 | "language": "python",
318 | "metadata": {},
319 | "outputs": [
320 | {
321 | "output_type": "stream",
322 | "stream": "stdout",
323 | "text": [
324 | "RidgeCV 0.0621016941892\n",
325 | "LassoCV 0.0126896100317\n"
326 | ]
327 | }
328 | ],
329 | "prompt_number": 10
330 | },
331 | {
332 | "cell_type": "markdown",
333 | "metadata": {},
334 | "source": [
335 | "We see that the results match those returned by ``GridSearchCV``."
336 | ]
337 | },
338 | {
339 | "cell_type": "heading",
340 | "level": 2,
341 | "metadata": {},
342 | "source": [
343 | "Exercise: Learning Curves"
344 | ]
345 | },
346 | {
347 | "cell_type": "markdown",
348 | "metadata": {},
349 | "source": [
350 | "Here we'll apply our learning curves to the diabetes data. The question to answer is this:\n",
351 | "\n",
352 | "- Given the optimal models above, which is over-fitting and which is under-fitting the data?\n",
353 | "- To obtain better results, would you invest time and effort in gathering\n",
354 | " more **training samples**, or gathering more **attributes** for each sample?\n",
355 | " Recall the previous discussion of reading learning curves.\n",
356 | "\n",
357 | "You can follow the process used in the previous notebook to plot the learning curves.\n",
358 | "A good metric to use is the ``mean_squared_error``, which we'll import below:"
359 | ]
360 | },
361 | {
362 | "cell_type": "code",
363 | "collapsed": false,
364 | "input": [
365 | "from sklearn.metrics import mean_squared_error\n",
366 | "# define a function that computes the learning curve (i.e. mean_squared_error as a function\n",
367 | "# of training set size, for both training and test sets) and plot the result\n"
368 | ],
369 | "language": "python",
370 | "metadata": {},
371 | "outputs": [],
372 | "prompt_number": 11
373 | },
374 | {
375 | "cell_type": "heading",
376 | "level": 3,
377 | "metadata": {},
378 | "source": [
379 | "Solution"
380 | ]
381 | },
382 | {
383 | "cell_type": "code",
384 | "collapsed": false,
385 | "input": [
386 | "%load solutions/06B_learning_curves.py"
387 | ],
388 | "language": "python",
389 | "metadata": {},
390 | "outputs": [],
391 | "prompt_number": 12
392 | }
393 | ],
394 | "metadata": {}
395 | }
396 | ]
397 | }
--------------------------------------------------------------------------------
/svm_gui.py:
--------------------------------------------------------------------------------
1 | """
2 | ==========
3 | Libsvm GUI
4 | ==========
5 |
6 | A simple graphical frontend for Libsvm mainly intended for didactic
7 | purposes. You can create data points by point and click and visualize
8 | the decision region induced by different kernels and parameter settings.
9 |
10 | To create positive examples click the left mouse button; to create
11 | negative examples click the right button.
12 |
13 | If all examples are from the same class, it uses a one-class SVM.
14 |
15 | """
16 | from __future__ import division, print_function
17 |
18 | print(__doc__)
19 |
20 | # Author: Peter Prettenhoer
21 | #
22 | # License: BSD 3 clause
23 |
24 | import matplotlib
25 | matplotlib.use('TkAgg')
26 |
27 | from matplotlib.backends.backend_tkagg import FigureCanvasTkAgg
28 | from matplotlib.backends.backend_tkagg import NavigationToolbar2TkAgg
29 | from matplotlib.figure import Figure
30 | from matplotlib.contour import ContourSet
31 |
32 | import Tkinter as Tk
33 | import sys
34 | import numpy as np
35 |
36 | from sklearn import svm
37 | from sklearn.datasets import dump_svmlight_file
38 | from sklearn.externals.six.moves import xrange
39 |
40 | y_min, y_max = -50, 50
41 | x_min, x_max = -50, 50
42 |
43 |
44 | class Model(object):
45 | """The Model which hold the data. It implements the
46 | observable in the observer pattern and notifies the
47 | registered observers on change event.
48 | """
49 |
50 | def __init__(self):
51 | self.observers = []
52 | self.surface = None
53 | self.data = []
54 | self.cls = None
55 | self.surface_type = 0
56 |
57 | def changed(self, event):
58 | """Notify the observers. """
59 | for observer in self.observers:
60 | observer.update(event, self)
61 |
62 | def add_observer(self, observer):
63 | """Register an observer. """
64 | self.observers.append(observer)
65 |
66 | def set_surface(self, surface):
67 | self.surface = surface
68 |
69 | def dump_svmlight_file(self, file):
70 | data = np.array(self.data)
71 | X = data[:, 0:2]
72 | y = data[:, 2]
73 | dump_svmlight_file(X, y, file)
74 |
75 |
76 | class Controller(object):
77 | def __init__(self, model):
78 | self.model = model
79 | self.kernel = Tk.IntVar()
80 | self.surface_type = Tk.IntVar()
81 | # Whether or not a model has been fitted
82 | self.fitted = False
83 |
84 | def fit(self):
85 | print("fit the model")
86 | train = np.array(self.model.data)
87 | X = train[:, 0:2]
88 | y = train[:, 2]
89 |
90 | C = float(self.complexity.get())
91 | gamma = float(self.gamma.get())
92 | coef0 = float(self.coef0.get())
93 | degree = int(self.degree.get())
94 | kernel_map = {0: "linear", 1: "rbf", 2: "poly"}
95 | if len(np.unique(y)) == 1:
96 | clf = svm.OneClassSVM(kernel=kernel_map[self.kernel.get()],
97 | gamma=gamma, coef0=coef0, degree=degree)
98 | clf.fit(X)
99 | else:
100 | clf = svm.SVC(kernel=kernel_map[self.kernel.get()], C=C,
101 | gamma=gamma, coef0=coef0, degree=degree)
102 | clf.fit(X, y)
103 | if hasattr(clf, 'score'):
104 | print("Accuracy:", clf.score(X, y) * 100)
105 | X1, X2, Z = self.decision_surface(clf)
106 | self.model.clf = clf
107 | self.model.set_surface((X1, X2, Z))
108 | self.model.surface_type = self.surface_type.get()
109 | self.fitted = True
110 | self.model.changed("surface")
111 |
112 | def decision_surface(self, cls):
113 | delta = 1
114 | x = np.arange(x_min, x_max + delta, delta)
115 | y = np.arange(y_min, y_max + delta, delta)
116 | X1, X2 = np.meshgrid(x, y)
117 | Z = cls.decision_function(np.c_[X1.ravel(), X2.ravel()])
118 | Z = Z.reshape(X1.shape)
119 | return X1, X2, Z
120 |
121 | def clear_data(self):
122 | self.model.data = []
123 | self.fitted = False
124 | self.model.changed("clear")
125 |
126 | def add_example(self, x, y, label):
127 | self.model.data.append((x, y, label))
128 | self.model.changed("example_added")
129 |
130 | # update decision surface if already fitted.
131 | self.refit()
132 |
133 | def refit(self):
134 | """Refit the model if already fitted. """
135 | if self.fitted:
136 | self.fit()
137 |
138 |
139 | class View(object):
140 | """Test docstring. """
141 | def __init__(self, root, controller):
142 | f = Figure()
143 | ax = f.add_subplot(111)
144 | ax.set_xticks([])
145 | ax.set_yticks([])
146 | ax.set_xlim((x_min, x_max))
147 | ax.set_ylim((y_min, y_max))
148 | canvas = FigureCanvasTkAgg(f, master=root)
149 | canvas.show()
150 | canvas.get_tk_widget().pack(side=Tk.TOP, fill=Tk.BOTH, expand=1)
151 | canvas._tkcanvas.pack(side=Tk.TOP, fill=Tk.BOTH, expand=1)
152 | canvas.mpl_connect('button_press_event', self.onclick)
153 | toolbar = NavigationToolbar2TkAgg(canvas, root)
154 | toolbar.update()
155 | self.controllbar = ControllBar(root, controller)
156 | self.f = f
157 | self.ax = ax
158 | self.canvas = canvas
159 | self.controller = controller
160 | self.contours = []
161 | self.c_labels = None
162 | self.plot_kernels()
163 |
164 | def plot_kernels(self):
165 | self.ax.text(-50, -60, "Linear: $u^T v$")
166 | self.ax.text(-20, -60, "RBF: $\exp (-\gamma \| u-v \|^2)$")
167 | self.ax.text(10, -60, "Poly: $(\gamma \, u^T v + r)^d$")
168 |
169 | def onclick(self, event):
170 | if event.xdata and event.ydata:
171 | if event.button == 1:
172 | self.controller.add_example(event.xdata, event.ydata, 1)
173 | elif event.button == 3:
174 | self.controller.add_example(event.xdata, event.ydata, -1)
175 |
176 | def update_example(self, model, idx):
177 | x, y, l = model.data[idx]
178 | if l == 1:
179 | color = 'w'
180 | elif l == -1:
181 | color = 'k'
182 | self.ax.plot([x], [y], "%so" % color, scalex=0.0, scaley=0.0)
183 |
184 | def update(self, event, model):
185 | if event == "examples_loaded":
186 | for i in xrange(len(model.data)):
187 | self.update_example(model, i)
188 |
189 | if event == "example_added":
190 | self.update_example(model, -1)
191 |
192 | if event == "clear":
193 | self.ax.clear()
194 | self.ax.set_xticks([])
195 | self.ax.set_yticks([])
196 | self.contours = []
197 | self.c_labels = None
198 | self.plot_kernels()
199 |
200 | if event == "surface":
201 | self.remove_surface()
202 | self.plot_support_vectors(model.clf.support_vectors_)
203 | self.plot_decision_surface(model.surface, model.surface_type)
204 |
205 | self.canvas.draw()
206 |
207 | def remove_surface(self):
208 | """Remove old decision surface."""
209 | if len(self.contours) > 0:
210 | for contour in self.contours:
211 | if isinstance(contour, ContourSet):
212 | for lineset in contour.collections:
213 | lineset.remove()
214 | else:
215 | contour.remove()
216 | self.contours = []
217 |
218 | def plot_support_vectors(self, support_vectors):
219 | """Plot the support vectors by placing circles over the
220 | corresponding data points and adds the circle collection
221 | to the contours list."""
222 | cs = self.ax.scatter(support_vectors[:, 0], support_vectors[:, 1],
223 | s=80, edgecolors="k", facecolors="none")
224 | self.contours.append(cs)
225 |
226 | def plot_decision_surface(self, surface, type):
227 | X1, X2, Z = surface
228 | if type == 0:
229 | levels = [-1.0, 0.0, 1.0]
230 | linestyles = ['dashed', 'solid', 'dashed']
231 | colors = 'k'
232 | self.contours.append(self.ax.contour(X1, X2, Z, levels,
233 | colors=colors,
234 | linestyles=linestyles))
235 | elif type == 1:
236 | self.contours.append(self.ax.contourf(X1, X2, Z, 10,
237 | cmap=matplotlib.cm.bone,
238 | origin='lower', alpha=0.85))
239 | self.contours.append(self.ax.contour(X1, X2, Z, [0.0], colors='k',
240 | linestyles=['solid']))
241 | else:
242 | raise ValueError("surface type unknown")
243 |
244 |
245 | class ControllBar(object):
246 | def __init__(self, root, controller):
247 | fm = Tk.Frame(root)
248 | kernel_group = Tk.Frame(fm)
249 | Tk.Radiobutton(kernel_group, text="Linear", variable=controller.kernel,
250 | value=0, command=controller.refit).pack(anchor=Tk.W)
251 | Tk.Radiobutton(kernel_group, text="RBF", variable=controller.kernel,
252 | value=1, command=controller.refit).pack(anchor=Tk.W)
253 | Tk.Radiobutton(kernel_group, text="Poly", variable=controller.kernel,
254 | value=2, command=controller.refit).pack(anchor=Tk.W)
255 | kernel_group.pack(side=Tk.LEFT)
256 |
257 | valbox = Tk.Frame(fm)
258 | controller.complexity = Tk.StringVar()
259 | controller.complexity.set("1.0")
260 | c = Tk.Frame(valbox)
261 | Tk.Label(c, text="C:", anchor="e", width=7).pack(side=Tk.LEFT)
262 | Tk.Entry(c, width=6, textvariable=controller.complexity).pack(
263 | side=Tk.LEFT)
264 | c.pack()
265 |
266 | controller.gamma = Tk.StringVar()
267 | controller.gamma.set("0.01")
268 | g = Tk.Frame(valbox)
269 | Tk.Label(g, text="gamma:", anchor="e", width=7).pack(side=Tk.LEFT)
270 | Tk.Entry(g, width=6, textvariable=controller.gamma).pack(side=Tk.LEFT)
271 | g.pack()
272 |
273 | controller.degree = Tk.StringVar()
274 | controller.degree.set("3")
275 | d = Tk.Frame(valbox)
276 | Tk.Label(d, text="degree:", anchor="e", width=7).pack(side=Tk.LEFT)
277 | Tk.Entry(d, width=6, textvariable=controller.degree).pack(side=Tk.LEFT)
278 | d.pack()
279 |
280 | controller.coef0 = Tk.StringVar()
281 | controller.coef0.set("0")
282 | r = Tk.Frame(valbox)
283 | Tk.Label(r, text="coef0:", anchor="e", width=7).pack(side=Tk.LEFT)
284 | Tk.Entry(r, width=6, textvariable=controller.coef0).pack(side=Tk.LEFT)
285 | r.pack()
286 | valbox.pack(side=Tk.LEFT)
287 |
288 | cmap_group = Tk.Frame(fm)
289 | Tk.Radiobutton(cmap_group, text="Hyperplanes",
290 | variable=controller.surface_type, value=0,
291 | command=controller.refit).pack(anchor=Tk.W)
292 | Tk.Radiobutton(cmap_group, text="Surface",
293 | variable=controller.surface_type, value=1,
294 | command=controller.refit).pack(anchor=Tk.W)
295 |
296 | cmap_group.pack(side=Tk.LEFT)
297 |
298 | train_button = Tk.Button(fm, text='Fit', width=5,
299 | command=controller.fit)
300 | train_button.pack()
301 | fm.pack(side=Tk.LEFT)
302 | Tk.Button(fm, text='Clear', width=5,
303 | command=controller.clear_data).pack(side=Tk.LEFT)
304 |
305 |
306 | def get_parser():
307 | from optparse import OptionParser
308 | op = OptionParser()
309 | op.add_option("--output",
310 | action="store", type="str", dest="output",
311 | help="Path where to dump data.")
312 | return op
313 |
314 |
315 | def main(argv):
316 | op = get_parser()
317 | opts, args = op.parse_args(argv[1:])
318 | root = Tk.Tk()
319 | model = Model()
320 | controller = Controller(model)
321 | root.wm_title("Scikit-learn Libsvm GUI")
322 | view = View(root, controller)
323 | model.add_observer(view)
324 | Tk.mainloop()
325 |
326 | if opts.output:
327 | model.dump_svmlight_file(opts.output)
328 |
329 | if __name__ == "__main__":
330 | main(sys.argv)
331 |
--------------------------------------------------------------------------------