├── .gitignore
├── Makefile
├── README.md
├── fetch_data.py
├── ipynbhelper.py
└── notebooks
    ├── 01_introduction.ipynb
    ├── 02_representation_of_data.ipynb
    ├── 03_basic_principles_of_machine_learning.ipynb
    ├── 04_unsupervised_dimreduction.ipynb
    ├── 05_measuring_prediction_performance.ipynb
    ├── 06_sklearn_pandas_and_heterogeneous_data_modeling.ipynb
    ├── 07_sklearn_unstructured_data_face_recognition.ipynb
    ├── 08_sklearn_pandas_practical.ipynb
    ├── figures
        ├── ML_flow_chart.py
        ├── __init__.py
        ├── bias_variance.py
        ├── grid_search_cv_splits.png
        ├── grid_search_parameters.png
        ├── iris_setosa.jpg
        ├── iris_versicolor.jpg
        ├── iris_virginica.jpg
        ├── linear_regression.py
        ├── sgd_separator.py
        ├── supervised_scikit_learn.png
        └── svm_gui_frames.py
    ├── helpers.py
    ├── images
        ├── parallel_text_clf.png
        ├── parallel_text_clf_average.png
        └── predictive_modeling_data_flow.png
    └── solutions
        ├── 02A_faces_plot.py
        ├── 04A_plot_logistic_regression_weights.py
        ├── 04B_more_categorical_variables.py
        ├── 04C_feature_importance.py
        ├── 05B_houses_plot.py
        ├── 05B_houses_regression.py
        ├── 05C_validation_exercise.py
        ├── 07B_basic_grid_search.py
        ├── 07B_learning_curves.py
        └── 08A_digits_projection.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | *~
 2 | *.pyc
 3 | *.npy
 4 | *.npz
 5 | notebooks/figures/downloads/
 6 | *.ipynb_checkpoints
 7 | *.mmap
 8 | *.pkl
 9 | datasets
10 | joblib
11 | 


--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
1 | # Makefile used to manage the git repository, not for the tutorial
2 | 
3 | all:
4 | 	python ipynbhelper.py --check
5 | 	python ipynbhelper.py --render
6 | 	python ipynbhelper.py --clean
7 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | Introduction to predictive analytics with pandas and scikit-learn
 2 | =================================================================
 3 | 
 4 | This repository contains notebooks to get started with predictive
 5 | analytics using scikit-learn and pandas.
 6 | 
 7 | This material is strongly inspired from the
 8 | [EuroPython 2014 scikit-learn tutorial](https://github.com/GaelVaroquaux/sklearn_pandas_tutorial)
 9 | 
10 | * Olivier Grisel [@ogrisel](https://twitter.com/ogrisel) |
11 |   http://ogrisel.com 
12 | 
13 | * Gael Varoquaux [@GaelVaroquaux](https://twitter.com/GaelVaroquaux) |
14 |   http://gael-varoquaux.info
15 | 
16 | which was inspired by http://github.com/jakevdp/sklearn_scipy2013
17 | by Jake VanderPlas [@jakevdp](https://twitter.com/jakevdp) | http://jakevdp.github.com
18 | 
19 | Installation Notes
20 | ------------------
21 | 
22 | This tutorial will require recent installations of *numpy*, *scipy*,
23 | *matplotlib*, *scikit-learn*, *pandas* and *Pillow* (or PIL).
24 | 
25 | For users who do not yet have these packages installed, a relatively
26 | painless way to install all the requirements is to use a package such as
27 | [Anaconda](http://continuum.io/downloads), which can be downloaded and
28 | installed for free.
29 | 
30 | Please download in advance the datasets mentionned in [Data Downloads](#data-downloads)
31 | 
32 | 
33 | ### With the IPython/jupyter notebook
34 | 
35 | The recommended way to access the materials is to execute them in the
36 | IPython/jupyter notebook. If you have the notebook installed, you should
37 | download the materials (see below), go the the `notebooks` directory, and
38 | launch IPython notebook from there by typing:
39 | 
40 |     cd notebooks
41 |     jupyter notebook  # ipython notebook if old version
42 | 
43 | in your terminal window. This will open a notebook panel load in your web
44 | browser.
45 | 
46 | Downloading the Tutorial Materials
47 | ----------------------------------
48 | 
49 | I would highly recommend using git, not only for this tutorial, but for the
50 | general betterment of your life.  Once git is installed, you can clone the
51 | material in this tutorial by using the git address shown above:
52 | 
53 | If you can't or don't want to install git, there is a link above to download
54 | the contents of this repository as a zip file. I may make minor changes to
55 | the repository in the days before the tutorial, however, so cloning the
56 | repository is a much better option.
57 | 
58 | Data Downloads
59 | --------------
60 | 
61 | The data for this tutorial is not included in the repository. We will be
62 | using several data sets during the tutorial: most are built-in to
63 | scikit-learn, which includes code which automatically downloads and
64 | caches these data. Because the wireless network at conferences can often
65 | be spotty, it would be a good idea to download these data sets before
66 | arriving at the conference. You can do so by using the `fetch_data.py`
67 | included in the tutorial materials.
68 | 
69 | You will also need:
70 | 
71 | https://dl.dropboxusercontent.com/u/2140486/data/titanic_train.csv  
72 | https://dl.dropboxusercontent.com/u/2140486/data/adult_train.csv
73 | 


--------------------------------------------------------------------------------
/fetch_data.py:
--------------------------------------------------------------------------------
 1 | from sklearn.datasets import fetch_olivetti_faces
 2 | from sklearn.datasets import fetch_lfw_people
 3 | from sklearn.datasets import get_data_home
 4 | 
 5 | 
 6 | if __name__ == "__main__":
 7 |     fetch_olivetti_faces()
 8 | 
 9 |     print("Loading Labeled Faces Data (~200MB)")
10 |     fetch_lfw_people(min_faces_per_person=70, resize=0.4)
11 |     print("=> Success!")
12 |     print("Data saved in %s" % get_data_home())
13 | 


--------------------------------------------------------------------------------
/ipynbhelper.py:
--------------------------------------------------------------------------------
  1 | """Utility script to be used to cleanup the notebooks before git commit
  2 | 
  3 | This a mix from @minrk's various gists.
  4 | 
  5 | """
  6 | 
  7 | import sys
  8 | import os
  9 | import io
 10 | try:
 11 |     from queue import Empty
 12 | except:
 13 |     from Queue import Empty
 14 | 
 15 | from nbformat import current
 16 | try:
 17 |     from jupyter_client import KernelManager
 18 |     assert KernelManager  # to silence pyflakes
 19 | except ImportError:
 20 |     # 0.13
 21 |     from IPython.zmq.blockingkernelmanager import BlockingKernelManager
 22 |     KernelManager = BlockingKernelManager
 23 | 
 24 | 
 25 | def remove_outputs(nb):
 26 |     """Remove the outputs from a notebook"""
 27 |     for ws in nb.worksheets:
 28 |         for cell in ws.cells:
 29 |             if cell.cell_type == 'code':
 30 |                 cell.outputs = []
 31 |                 if 'prompt_number' in cell:
 32 |                     del cell['prompt_number']
 33 | 
 34 | 
 35 | def remove_signature(nb):
 36 |     """Remove the signature from a notebook"""
 37 |     if 'signature' in nb.metadata:
 38 |         del nb.metadata['signature']
 39 | 
 40 | 
 41 | def run_cell(shell, iopub, cell, timeout=300):
 42 |     if not hasattr(cell, 'input'):
 43 |         return [], False
 44 |     shell.send(cell.input)
 45 |     # wait for finish, maximum 5min by default
 46 |     reply = shell.get_msg(timeout=timeout)['content']
 47 |     if reply['status'] == 'error':
 48 |         failed = True
 49 |         print("\nFAILURE:")
 50 |         print(cell.input)
 51 |         print('-----')
 52 |         print("raised:")
 53 |         print('\n'.join(reply['traceback']))
 54 |     else:
 55 |         failed = False
 56 | 
 57 |     # Collect the outputs of the cell execution
 58 |     outs = []
 59 |     while True:
 60 |         try:
 61 |             msg = iopub.get_msg(timeout=0.2)
 62 |         except Empty:
 63 |             break
 64 |         msg_type = msg['msg_type']
 65 |         if msg_type in ('status', 'pyin'):
 66 |             continue
 67 |         elif msg_type == 'clear_output':
 68 |             outs = []
 69 |             continue
 70 | 
 71 |         content = msg['content']
 72 |         out = current.NotebookNode(output_type=msg_type)
 73 | 
 74 |         if msg_type == 'stream':
 75 |             out.stream = content['name']
 76 |             out.text = content['data']
 77 |         elif msg_type in ('display_data', 'pyout'):
 78 |             for mime, data in content['data'].items():
 79 |                 attr = mime.split('/')[-1].lower()
 80 |                 # this gets most right, but fix svg+html, plain
 81 |                 attr = attr.replace('+xml', '').replace('plain', 'text')
 82 |                 setattr(out, attr, data)
 83 |             if msg_type == 'pyout':
 84 |                 out.prompt_number = content['execution_count']
 85 |         elif msg_type == 'pyerr':
 86 |             out.ename = content['ename']
 87 |             out.evalue = content['evalue']
 88 |             out.traceback = content['traceback']
 89 |         else:
 90 |             print("unhandled iopub msg: %s" % msg_type)
 91 | 
 92 |         outs.append(out)
 93 |     return outs, failed
 94 | 
 95 | 
 96 | def run_notebook(nb):
 97 |     km = KernelManager()
 98 |     km.start_kernel(stderr=open(os.devnull, 'w'))
 99 |     if hasattr(km, 'client'):
100 |         kc = km.client()
101 |         kc.start_channels()
102 |         iopub = kc.iopub_channel
103 |     else:
104 |         # IPython 0.13 compat
105 |         kc = km
106 |         kc.start_channels()
107 |         iopub = kc.sub_channel
108 |     shell = kc.shell_channel
109 | 
110 |     # simple ping:
111 |     shell.send("pass")
112 |     shell.get_msg()
113 | 
114 |     cells = 0
115 |     failures = 0
116 |     for ws in nb.worksheets:
117 |         for cell in ws.cells:
118 |             if cell.cell_type != 'code':
119 |                 continue
120 | 
121 |             outputs, failed = run_cell(shell, iopub, cell)
122 |             cell.outputs = outputs
123 |             cell['prompt_number'] = cells
124 |             failures += failed
125 |             cells += 1
126 |             sys.stdout.write('.')
127 | 
128 |     print()
129 |     print("ran notebook %s" % nb.metadata.name)
130 |     print("    ran %3i cells" % cells)
131 |     if failures:
132 |         print("    %3i cells raised exceptions" % failures)
133 |     kc.stop_channels()
134 |     km.shutdown_kernel()
135 |     del km
136 | 
137 | 
138 | def process_notebook_file(fname, action='clean', output_fname=None):
139 |     print("Performing '{}' on: {}".format(action, fname))
140 |     orig_wd = os.getcwd()
141 |     with io.open(fname, 'r') as f:
142 |         nb = current.read(f, 'json')
143 | 
144 |     if action == 'check':
145 |         os.chdir(os.path.dirname(fname))
146 |         run_notebook(nb)
147 |         remove_outputs(nb)
148 |         remove_signature(nb)
149 |     elif action == 'render':
150 |         os.chdir(os.path.dirname(fname))
151 |         run_notebook(nb)
152 |     else:
153 |         # Clean by default
154 |         remove_outputs(nb)
155 |         remove_signature(nb)
156 | 
157 |     os.chdir(orig_wd)
158 |     if output_fname is None:
159 |         output_fname = fname
160 |     with io.open(output_fname, 'w') as f:
161 |         nb = current.write(nb, f, 'json')
162 | 
163 | 
164 | if __name__ == '__main__':
165 |     # TODO: use argparse instead
166 |     args = sys.argv[1:]
167 |     targets = [t for t in args if not t.startswith('--')]
168 |     action = 'check' if '--check' in args else 'clean'
169 |     action = 'render' if '--render' in args else action
170 | 
171 |     rendered_folder = os.path.join(os.path.dirname(__file__),
172 |                                    'rendered_notebooks')
173 |     if not os.path.exists(rendered_folder):
174 |         os.makedirs(rendered_folder)
175 |     if not targets:
176 |         targets = [os.path.join(os.path.dirname(__file__), 'notebooks')]
177 | 
178 |     for target in targets:
179 |         if os.path.isdir(target):
180 |             fnames = [os.path.abspath(os.path.join(target, f))
181 |                       for f in os.listdir(target)
182 |                       if f.endswith('.ipynb')]
183 |         else:
184 |             fnames = [target]
185 |         for fname in fnames:
186 |             if action == 'render':
187 |                 output_fname = os.path.join(rendered_folder,
188 |                                             os.path.basename(fname))
189 |             else:
190 |                 output_fname = fname
191 |             process_notebook_file(fname, action=action,
192 |                                   output_fname=output_fname)
193 | 


--------------------------------------------------------------------------------
/notebooks/01_introduction.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# What is machine learning?"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "In this section we will begin to explore the basic principles of machine learning.\n",
 15 |     "Machine Learning is about building programs with **tunable parameters** (typically an\n",
 16 |     "array of floating point values) that are adjusted automatically so as to improve\n",
 17 |     "their behavior by **adapting to previously seen data.**\n",
 18 |     "\n",
 19 |     "Machine Learning can be considered a subfield of **Artificial Intelligence** since those\n",
 20 |     "algorithms can be seen as building blocks to make computers learn to behave more\n",
 21 |     "intelligently by somehow **generalizing** rather that just storing and retrieving data items\n",
 22 |     "like a database system would do.\n",
 23 |     "\n",
 24 |     "We'll take a look at two very simple machine learning tasks here.\n",
 25 |     "The first is a **classification** task: the figure shows a\n",
 26 |     "collection of two-dimensional data, colored according to two different class\n",
 27 |     "labels. A classification algorithm may be used to draw a dividing boundary\n",
 28 |     "between the two clusters of points:"
 29 |    ]
 30 |   },
 31 |   {
 32 |    "cell_type": "code",
 33 |    "execution_count": null,
 34 |    "metadata": {
 35 |     "collapsed": false
 36 |    },
 37 |    "outputs": [],
 38 |    "source": [
 39 |     "# Start pylab inline mode, so figures will appear in the notebook\n",
 40 |     "%matplotlib inline"
 41 |    ]
 42 |   },
 43 |   {
 44 |    "cell_type": "code",
 45 |    "execution_count": null,
 46 |    "metadata": {
 47 |     "collapsed": false
 48 |    },
 49 |    "outputs": [],
 50 |    "source": [
 51 |     "# Import the example plot from the figures directory\n",
 52 |     "from figures import plot_sgd_separator\n",
 53 |     "plot_sgd_separator()"
 54 |    ]
 55 |   },
 56 |   {
 57 |    "cell_type": "markdown",
 58 |    "metadata": {},
 59 |    "source": [
 60 |     "This may seem like a trivial task, but it is a simple version of a very important concept.\n",
 61 |     "By drawing this separating line, we have learned a model which can **generalize** to new\n",
 62 |     "data: if you were to drop another point onto the plane which is unlabeled, this algorithm\n",
 63 |     "could now **predict** whether it's a blue or a red point.\n",
 64 |     "\n",
 65 |     "If you'd like to see the source code used to generate this, you can either open the\n",
 66 |     "code in the `figures` directory, or you can load the code using the `%load` magic command:"
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "code",
 71 |    "execution_count": null,
 72 |    "metadata": {
 73 |     "collapsed": false
 74 |    },
 75 |    "outputs": [],
 76 |    "source": [
 77 |     "%load figures/sgd_separator.py"
 78 |    ]
 79 |   },
 80 |   {
 81 |    "cell_type": "markdown",
 82 |    "metadata": {},
 83 |    "source": [
 84 |     "The next simple task we'll look at is a **regression** task: a simple best-fit line\n",
 85 |     "to a set of data:"
 86 |    ]
 87 |   },
 88 |   {
 89 |    "cell_type": "code",
 90 |    "execution_count": null,
 91 |    "metadata": {
 92 |     "collapsed": false
 93 |    },
 94 |    "outputs": [],
 95 |    "source": [
 96 |     "from figures import plot_linear_regression\n",
 97 |     "plot_linear_regression()"
 98 |    ]
 99 |   },
100 |   {
101 |    "cell_type": "markdown",
102 |    "metadata": {},
103 |    "source": [
104 |     "Again, this is an example of fitting a model to data, such that the model can make\n",
105 |     "generalizations about new data.  The model has been **learned** from the training\n",
106 |     "data, and can be used to predict the result of test data:\n",
107 |     "here, we might be given an x-value, and the model would\n",
108 |     "allow us to predict the y value.  Again, this might seem like a trivial problem,\n",
109 |     "but it is a basic example of a type of operation that is fundamental to\n",
110 |     "machine learning tasks."
111 |    ]
112 |   },
113 |   {
114 |    "cell_type": "markdown",
115 |    "metadata": {},
116 |    "source": [
117 |     "# An Overview of Scikit-learn"
118 |    ]
119 |   },
120 |   {
121 |    "cell_type": "markdown",
122 |    "metadata": {},
123 |    "source": [
124 |     "*Adapted from* [*http://scikit-learn.org/stable/tutorial/basic/tutorial.html*](http://scikit-learn.org/stable/tutorial/basic/tutorial.html)"
125 |    ]
126 |   },
127 |   {
128 |    "cell_type": "code",
129 |    "execution_count": null,
130 |    "metadata": {
131 |     "collapsed": false
132 |    },
133 |    "outputs": [],
134 |    "source": [
135 |     "%matplotlib inline\n",
136 |     "import numpy as np\n",
137 |     "from matplotlib import pyplot as plt"
138 |    ]
139 |   },
140 |   {
141 |    "cell_type": "markdown",
142 |    "metadata": {},
143 |    "source": [
144 |     "## Loading an Example Dataset"
145 |    ]
146 |   },
147 |   {
148 |    "cell_type": "code",
149 |    "execution_count": null,
150 |    "metadata": {
151 |     "collapsed": false
152 |    },
153 |    "outputs": [],
154 |    "source": [
155 |     "from sklearn import datasets\n",
156 |     "digits = datasets.load_digits()"
157 |    ]
158 |   },
159 |   {
160 |    "cell_type": "code",
161 |    "execution_count": null,
162 |    "metadata": {
163 |     "collapsed": false
164 |    },
165 |    "outputs": [],
166 |    "source": [
167 |     "digits.data"
168 |    ]
169 |   },
170 |   {
171 |    "cell_type": "code",
172 |    "execution_count": null,
173 |    "metadata": {
174 |     "collapsed": false
175 |    },
176 |    "outputs": [],
177 |    "source": [
178 |     "digits.target"
179 |    ]
180 |   },
181 |   {
182 |    "cell_type": "code",
183 |    "execution_count": null,
184 |    "metadata": {
185 |     "collapsed": false
186 |    },
187 |    "outputs": [],
188 |    "source": [
189 |     "digits.images[0]"
190 |    ]
191 |   },
192 |   {
193 |    "cell_type": "markdown",
194 |    "metadata": {},
195 |    "source": [
196 |     "## Learning and Predicting"
197 |    ]
198 |   },
199 |   {
200 |    "cell_type": "code",
201 |    "execution_count": null,
202 |    "metadata": {
203 |     "collapsed": false
204 |    },
205 |    "outputs": [],
206 |    "source": [
207 |     "from sklearn import svm\n",
208 |     "clf = svm.SVC(gamma=0.001, C=100.)"
209 |    ]
210 |   },
211 |   {
212 |    "cell_type": "code",
213 |    "execution_count": null,
214 |    "metadata": {
215 |     "collapsed": false
216 |    },
217 |    "outputs": [],
218 |    "source": [
219 |     "clf.fit(digits.data[:-1], digits.target[:-1])"
220 |    ]
221 |   },
222 |   {
223 |    "cell_type": "code",
224 |    "execution_count": null,
225 |    "metadata": {
226 |     "collapsed": false
227 |    },
228 |    "outputs": [],
229 |    "source": [
230 |     "clf.predict(digits.data[-1:, :])"
231 |    ]
232 |   },
233 |   {
234 |    "cell_type": "code",
235 |    "execution_count": null,
236 |    "metadata": {
237 |     "collapsed": false
238 |    },
239 |    "outputs": [],
240 |    "source": [
241 |     "plt.figure(figsize=(2, 2))\n",
242 |     "plt.imshow(digits.images[-1], interpolation='nearest', cmap=plt.cm.binary)"
243 |    ]
244 |   },
245 |   {
246 |    "cell_type": "code",
247 |    "execution_count": null,
248 |    "metadata": {
249 |     "collapsed": false
250 |    },
251 |    "outputs": [],
252 |    "source": [
253 |     "print(digits.target[-1])"
254 |    ]
255 |   }
256 |  ],
257 |  "metadata": {
258 |   "kernelspec": {
259 |    "display_name": "Python 2",
260 |    "language": "python",
261 |    "name": "python2"
262 |   },
263 |   "language_info": {
264 |    "codemirror_mode": {
265 |     "name": "ipython",
266 |     "version": 2
267 |    },
268 |    "file_extension": ".py",
269 |    "mimetype": "text/x-python",
270 |    "name": "python",
271 |    "nbconvert_exporter": "python",
272 |    "pygments_lexer": "ipython2",
273 |    "version": "2.7.11"
274 |   }
275 |  },
276 |  "nbformat": 4,
277 |  "nbformat_minor": 0
278 | }
279 | 


--------------------------------------------------------------------------------
/notebooks/02_representation_of_data.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Representation and Visualization of Data"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "Machine learning is about creating models from data: for that reason, we'll start by\n",
 15 |     "discussing how data can be represented in order to be understood by the computer.  Along\n",
 16 |     "with this, we'll build on our matplotlib examples from the previous section and show some\n",
 17 |     "examples of how to visualize data.\n",
 18 |     "\n",
 19 |     "By the end of this section you should:\n",
 20 |     "\n",
 21 |     "- Know the internal data representation of scikit-learn.\n",
 22 |     "- Know how to use scikit-learn's dataset loaders to load example data.\n",
 23 |     "- Know how to use matplotlib to help visualize different types of data."
 24 |    ]
 25 |   },
 26 |   {
 27 |    "cell_type": "markdown",
 28 |    "metadata": {},
 29 |    "source": [
 30 |     "## Data in scikit-learn"
 31 |    ]
 32 |   },
 33 |   {
 34 |    "cell_type": "markdown",
 35 |    "metadata": {},
 36 |    "source": [
 37 |     "Data in scikit-learn, with very few exceptions, is assumed to be stored as a\n",
 38 |     "**two-dimensional array**, of size `[n_samples, n_features]`."
 39 |    ]
 40 |   },
 41 |   {
 42 |    "cell_type": "markdown",
 43 |    "metadata": {},
 44 |    "source": [
 45 |     "Most machine learning algorithms implemented in scikit-learn expect data to be stored in a\n",
 46 |     "**two-dimensional array or matrix**.  The arrays can be\n",
 47 |     "either ``numpy`` arrays, or in some cases ``scipy.sparse`` matrices.\n",
 48 |     "The size of the array is expected to be `[n_samples, n_features]`\n",
 49 |     "\n",
 50 |     "- **n_samples:**   The number of samples: each sample is an item to process (e.g. classify).\n",
 51 |     "  A sample can be a document, a picture, a sound, a video, an astronomical object,\n",
 52 |     "  a row in database or CSV file,\n",
 53 |     "  or whatever you can describe with a fixed set of quantitative traits.\n",
 54 |     "- **n_features:**  The number of features or distinct traits that can be used to describe each\n",
 55 |     "  item in a quantitative manner.  Features are generally real-valued, but may be boolean or\n",
 56 |     "  discrete-valued in some cases.\n",
 57 |     "\n",
 58 |     "The number of features must be fixed in advance. However it can be very high dimensional\n",
 59 |     "(e.g. millions of features) with most of them being zeros for a given sample. This is a case\n",
 60 |     "where `scipy.sparse` matrices can be useful, in that they are\n",
 61 |     "much more memory-efficient than numpy arrays."
 62 |    ]
 63 |   },
 64 |   {
 65 |    "cell_type": "markdown",
 66 |    "metadata": {},
 67 |    "source": [
 68 |     "### A Simple Example: the Iris Dataset"
 69 |    ]
 70 |   },
 71 |   {
 72 |    "cell_type": "markdown",
 73 |    "metadata": {},
 74 |    "source": [
 75 |     "As an example of a simple dataset, we're going to take a look at the iris data stored by scikit-learn.\n",
 76 |     "The data consists of measurements of three different species of irises.  There are three species of iris\n",
 77 |     "in the dataset, which we can picture here:"
 78 |    ]
 79 |   },
 80 |   {
 81 |    "cell_type": "code",
 82 |    "execution_count": null,
 83 |    "metadata": {
 84 |     "collapsed": false
 85 |    },
 86 |    "outputs": [],
 87 |    "source": [
 88 |     "from IPython.core.display import Image, display\n",
 89 |     "display(Image(filename='figures/iris_setosa.jpg'))\n",
 90 |     "print(\"Iris Setosa\\n\")\n",
 91 |     "\n",
 92 |     "display(Image(filename='figures/iris_versicolor.jpg'))\n",
 93 |     "print(\"Iris Versicolor\\n\")\n",
 94 |     "\n",
 95 |     "display(Image(filename='figures/iris_virginica.jpg'))\n",
 96 |     "print(\"Iris Virginica\")"
 97 |    ]
 98 |   },
 99 |   {
100 |    "cell_type": "markdown",
101 |    "metadata": {},
102 |    "source": [
103 |     "### Quick Question:"
104 |    ]
105 |   },
106 |   {
107 |    "cell_type": "markdown",
108 |    "metadata": {},
109 |    "source": [
110 |     "**If we want to design an algorithm to recognize iris species, what might the data be?**\n",
111 |     "\n",
112 |     "Remember: we need a 2D array of size `[n_samples x n_features]`.\n",
113 |     "\n",
114 |     "- What would the `n_samples` refer to?\n",
115 |     "\n",
116 |     "- What might the `n_features` refer to?\n",
117 |     "\n",
118 |     "Remember that there must be a **fixed** number of features for each sample, and feature\n",
119 |     "number ``i`` must be a similar kind of quantity for each sample."
120 |    ]
121 |   },
122 |   {
123 |    "cell_type": "markdown",
124 |    "metadata": {},
125 |    "source": [
126 |     "### Loading the Iris Data with Scikit-learn"
127 |    ]
128 |   },
129 |   {
130 |    "cell_type": "markdown",
131 |    "metadata": {},
132 |    "source": [
133 |     "Scikit-learn has a very straightforward set of data on these iris species.  The data consist of\n",
134 |     "the following:\n",
135 |     "\n",
136 |     "- Features in the Iris dataset:\n",
137 |     "\n",
138 |     "  1. sepal length in cm\n",
139 |     "  2. sepal width in cm\n",
140 |     "  3. petal length in cm\n",
141 |     "  4. petal width in cm\n",
142 |     "\n",
143 |     "- Target classes to predict:\n",
144 |     "\n",
145 |     "  1. Iris Setosa\n",
146 |     "  2. Iris Versicolour\n",
147 |     "  3. Iris Virginica"
148 |    ]
149 |   },
150 |   {
151 |    "cell_type": "markdown",
152 |    "metadata": {},
153 |    "source": [
154 |     "``scikit-learn`` embeds a copy of the iris CSV file along with a helper function to load it into numpy arrays:"
155 |    ]
156 |   },
157 |   {
158 |    "cell_type": "code",
159 |    "execution_count": null,
160 |    "metadata": {
161 |     "collapsed": false
162 |    },
163 |    "outputs": [],
164 |    "source": [
165 |     "from sklearn.datasets import load_iris\n",
166 |     "iris = load_iris()"
167 |    ]
168 |   },
169 |   {
170 |    "cell_type": "markdown",
171 |    "metadata": {},
172 |    "source": [
173 |     "The resulting dataset is a ``Bunch`` object: you can see what's available using\n",
174 |     "the method ``keys()``:"
175 |    ]
176 |   },
177 |   {
178 |    "cell_type": "code",
179 |    "execution_count": null,
180 |    "metadata": {
181 |     "collapsed": false
182 |    },
183 |    "outputs": [],
184 |    "source": [
185 |     "iris.keys()"
186 |    ]
187 |   },
188 |   {
189 |    "cell_type": "markdown",
190 |    "metadata": {},
191 |    "source": [
192 |     "The features of each sample flower are stored in the ``data`` attribute of the dataset:"
193 |    ]
194 |   },
195 |   {
196 |    "cell_type": "code",
197 |    "execution_count": null,
198 |    "metadata": {
199 |     "collapsed": false
200 |    },
201 |    "outputs": [],
202 |    "source": [
203 |     "n_samples, n_features = iris.data.shape\n",
204 |     "print(n_samples)\n",
205 |     "print(n_features)\n",
206 |     "print(iris.data[0])"
207 |    ]
208 |   },
209 |   {
210 |    "cell_type": "markdown",
211 |    "metadata": {},
212 |    "source": [
213 |     "The information about the class of each sample is stored in the ``target`` attribute of the dataset:"
214 |    ]
215 |   },
216 |   {
217 |    "cell_type": "code",
218 |    "execution_count": null,
219 |    "metadata": {
220 |     "collapsed": false
221 |    },
222 |    "outputs": [],
223 |    "source": [
224 |     "print(iris.data.shape)\n",
225 |     "print(iris.target.shape)"
226 |    ]
227 |   },
228 |   {
229 |    "cell_type": "code",
230 |    "execution_count": null,
231 |    "metadata": {
232 |     "collapsed": false
233 |    },
234 |    "outputs": [],
235 |    "source": [
236 |     "print(iris.target)"
237 |    ]
238 |   },
239 |   {
240 |    "cell_type": "markdown",
241 |    "metadata": {},
242 |    "source": [
243 |     "The names of the classes are stored in the last attribute, namely ``target_names``:"
244 |    ]
245 |   },
246 |   {
247 |    "cell_type": "code",
248 |    "execution_count": null,
249 |    "metadata": {
250 |     "collapsed": false
251 |    },
252 |    "outputs": [],
253 |    "source": [
254 |     "print(iris.target_names)"
255 |    ]
256 |   },
257 |   {
258 |    "cell_type": "markdown",
259 |    "metadata": {},
260 |    "source": [
261 |     "This data is four dimensional, but we can visualize two of the dimensions\n",
262 |     "at a time using a simple scatter-plot.  Again, we'll start by enabling\n",
263 |     "matplotlib inline mode:"
264 |    ]
265 |   },
266 |   {
267 |    "cell_type": "code",
268 |    "execution_count": null,
269 |    "metadata": {
270 |     "collapsed": false
271 |    },
272 |    "outputs": [],
273 |    "source": [
274 |     "%matplotlib inline\n",
275 |     "from matplotlib import pyplot as plt"
276 |    ]
277 |   },
278 |   {
279 |    "cell_type": "code",
280 |    "execution_count": null,
281 |    "metadata": {
282 |     "collapsed": false
283 |    },
284 |    "outputs": [],
285 |    "source": [
286 |     "x_index = 0\n",
287 |     "y_index = 1\n",
288 |     "\n",
289 |     "# this formatter will label the colorbar with the correct target names\n",
290 |     "formatter = plt.FuncFormatter(lambda i, *args: iris.target_names[int(i)])\n",
291 |     "\n",
292 |     "plt.scatter(iris.data[:, x_index], iris.data[:, y_index], c=iris.target)\n",
293 |     "plt.colorbar(ticks=[0, 1, 2], format=formatter)\n",
294 |     "plt.xlabel(iris.feature_names[x_index])\n",
295 |     "plt.ylabel(iris.feature_names[y_index])"
296 |    ]
297 |   },
298 |   {
299 |    "cell_type": "markdown",
300 |    "metadata": {},
301 |    "source": [
302 |     "### Quick Exercise:"
303 |    ]
304 |   },
305 |   {
306 |    "cell_type": "markdown",
307 |    "metadata": {},
308 |    "source": [
309 |     "**Change** `x_index` **and** `y_index` **in the above script\n",
310 |     "and find a combination of two parameters\n",
311 |     "which maximally separate the three classes.**\n",
312 |     "\n",
313 |     "This exercise is a preview of **dimensionality reduction**, which we'll see later."
314 |    ]
315 |   },
316 |   {
317 |    "cell_type": "markdown",
318 |    "metadata": {},
319 |    "source": [
320 |     "## Other Available Data"
321 |    ]
322 |   },
323 |   {
324 |    "cell_type": "markdown",
325 |    "metadata": {},
326 |    "source": [
327 |     "Scikit-learn makes available a host of datasets for testing learning algorithms.\n",
328 |     "They come in three flavors:\n",
329 |     "\n",
330 |     "- **Packaged Data:** these small datasets are packaged with the scikit-learn installation,\n",
331 |     "  and can be downloaded using the tools in ``sklearn.datasets.load_*``\n",
332 |     "- **Downloadable Data:** these larger datasets are available for download, and scikit-learn\n",
333 |     "  includes tools which streamline this process.  These tools can be found in\n",
334 |     "  ``sklearn.datasets.fetch_*``\n",
335 |     "- **Generated Data:** there are several datasets which are generated from models based on a\n",
336 |     "  random seed.  These are available in the ``sklearn.datasets.make_*``\n",
337 |     "\n",
338 |     "You can explore the available dataset loaders, fetchers, and generators using IPython's\n",
339 |     "tab-completion functionality.  After importing the ``datasets`` submodule from ``sklearn``,\n",
340 |     "type\n",
341 |     "\n",
342 |     "    datasets.load_<TAB>\n",
343 |     "\n",
344 |     "or\n",
345 |     "\n",
346 |     "    datasets.fetch_<TAB>\n",
347 |     "\n",
348 |     "or\n",
349 |     "\n",
350 |     "    datasets.make_<TAB>\n",
351 |     "\n",
352 |     "to see a list of available functions."
353 |    ]
354 |   },
355 |   {
356 |    "cell_type": "code",
357 |    "execution_count": null,
358 |    "metadata": {
359 |     "collapsed": false
360 |    },
361 |    "outputs": [],
362 |    "source": [
363 |     "from sklearn import datasets"
364 |    ]
365 |   },
366 |   {
367 |    "cell_type": "markdown",
368 |    "metadata": {},
369 |    "source": [
370 |     "The data downloaded using the ``fetch_`` scripts are stored locally,\n",
371 |     "within a subdirectory of your home directory.\n",
372 |     "You can use the following to determine where it is:"
373 |    ]
374 |   },
375 |   {
376 |    "cell_type": "code",
377 |    "execution_count": null,
378 |    "metadata": {
379 |     "collapsed": false
380 |    },
381 |    "outputs": [],
382 |    "source": [
383 |     "from sklearn.datasets import get_data_home\n",
384 |     "get_data_home()"
385 |    ]
386 |   },
387 |   {
388 |    "cell_type": "code",
389 |    "execution_count": null,
390 |    "metadata": {
391 |     "collapsed": false
392 |    },
393 |    "outputs": [],
394 |    "source": [
395 |     "!ls $HOME/scikit_learn_data/"
396 |    ]
397 |   },
398 |   {
399 |    "cell_type": "markdown",
400 |    "metadata": {},
401 |    "source": [
402 |     "Be warned: many of these datasets are quite large, and can take a long time to download!\n",
403 |     "(especially on Conference wifi).\n",
404 |     "\n",
405 |     "If you start a download within the IPython notebook\n",
406 |     "and you want to kill it, you can use ipython's \"kernel interrupt\" feature, available in the menu."
407 |    ]
408 |   },
409 |   {
410 |    "cell_type": "markdown",
411 |    "metadata": {},
412 |    "source": [
413 |     "## Loading Digits Data"
414 |    ]
415 |   },
416 |   {
417 |    "cell_type": "markdown",
418 |    "metadata": {},
419 |    "source": [
420 |     "Now we'll take a look at another dataset, one where we have to put a bit\n",
421 |     "more thought into how to represent the data.  We can explore the data in\n",
422 |     "a similar manner as above:"
423 |    ]
424 |   },
425 |   {
426 |    "cell_type": "code",
427 |    "execution_count": null,
428 |    "metadata": {
429 |     "collapsed": false
430 |    },
431 |    "outputs": [],
432 |    "source": [
433 |     "from sklearn.datasets import load_digits\n",
434 |     "digits = load_digits()"
435 |    ]
436 |   },
437 |   {
438 |    "cell_type": "code",
439 |    "execution_count": null,
440 |    "metadata": {
441 |     "collapsed": false
442 |    },
443 |    "outputs": [],
444 |    "source": [
445 |     "digits.keys()"
446 |    ]
447 |   },
448 |   {
449 |    "cell_type": "code",
450 |    "execution_count": null,
451 |    "metadata": {
452 |     "collapsed": false
453 |    },
454 |    "outputs": [],
455 |    "source": [
456 |     "n_samples, n_features = digits.data.shape\n",
457 |     "print(n_samples, n_features)"
458 |    ]
459 |   },
460 |   {
461 |    "cell_type": "code",
462 |    "execution_count": null,
463 |    "metadata": {
464 |     "collapsed": false
465 |    },
466 |    "outputs": [],
467 |    "source": [
468 |     "print(digits.data[0])\n",
469 |     "print(digits.target)"
470 |    ]
471 |   },
472 |   {
473 |    "cell_type": "markdown",
474 |    "metadata": {},
475 |    "source": [
476 |     "The target here is just the digit represented by the data.  The data is an array of\n",
477 |     "length 64... but what does this data mean?"
478 |    ]
479 |   },
480 |   {
481 |    "cell_type": "markdown",
482 |    "metadata": {},
483 |    "source": [
484 |     "There's a clue in the fact that we have two versions of the data array:\n",
485 |     "``data`` and ``images``.  Let's take a look at them:"
486 |    ]
487 |   },
488 |   {
489 |    "cell_type": "code",
490 |    "execution_count": null,
491 |    "metadata": {
492 |     "collapsed": false
493 |    },
494 |    "outputs": [],
495 |    "source": [
496 |     "print(digits.data.shape)\n",
497 |     "print(digits.images.shape)"
498 |    ]
499 |   },
500 |   {
501 |    "cell_type": "markdown",
502 |    "metadata": {},
503 |    "source": [
504 |     "We can see that they're related by a simple reshaping:"
505 |    ]
506 |   },
507 |   {
508 |    "cell_type": "code",
509 |    "execution_count": null,
510 |    "metadata": {
511 |     "collapsed": false
512 |    },
513 |    "outputs": [],
514 |    "source": [
515 |     "import numpy as np    # numpy!\n",
516 |     "print(np.all(digits.images.reshape((1797, 64)) == digits.data))"
517 |    ]
518 |   },
519 |   {
520 |    "cell_type": "markdown",
521 |    "metadata": {},
522 |    "source": [
523 |     "*Aside... numpy and memory efficiency:*\n",
524 |     "\n",
525 |     "*You might wonder whether duplicating the data is a problem.  In this case, the memory\n",
526 |     "overhead is very small.  Even though the arrays are different shapes, they point to the\n",
527 |     "same memory block, which we can see by doing a bit of digging into the guts of numpy:*"
528 |    ]
529 |   },
530 |   {
531 |    "cell_type": "code",
532 |    "execution_count": null,
533 |    "metadata": {
534 |     "collapsed": false
535 |    },
536 |    "outputs": [],
537 |    "source": [
538 |     "print(digits.data.__array_interface__['data'])\n",
539 |     "print(digits.images.__array_interface__['data'])"
540 |    ]
541 |   },
542 |   {
543 |    "cell_type": "markdown",
544 |    "metadata": {},
545 |    "source": [
546 |     "*The long integer here is a memory address: the fact that the two are the same tells\n",
547 |     "us that the two arrays are looking at the same data.*"
548 |    ]
549 |   },
550 |   {
551 |    "cell_type": "markdown",
552 |    "metadata": {},
553 |    "source": [
554 |     "Let's visualize the data.  It's little bit more involved than the simple scatter-plot\n",
555 |     "we used above, but we can do it rather quickly."
556 |    ]
557 |   },
558 |   {
559 |    "cell_type": "code",
560 |    "execution_count": null,
561 |    "metadata": {
562 |     "collapsed": false
563 |    },
564 |    "outputs": [],
565 |    "source": [
566 |     "# set up the figure\n",
567 |     "fig = plt.figure(figsize=(6, 6))  # figure size in inches\n",
568 |     "fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)\n",
569 |     "\n",
570 |     "# plot the digits: each image is 8x8 pixels\n",
571 |     "for i in range(64):\n",
572 |     "    ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])\n",
573 |     "    ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest')\n",
574 |     "    \n",
575 |     "    # label the image with the target value\n",
576 |     "    ax.text(0, 7, str(digits.target[i]))"
577 |    ]
578 |   },
579 |   {
580 |    "cell_type": "markdown",
581 |    "metadata": {},
582 |    "source": [
583 |     "We see now what the features mean.  Each feature is a real-valued quantity representing the\n",
584 |     "darkness of a pixel in an 8x8 image of a hand-written digit.\n",
585 |     "\n",
586 |     "Even though each sample has data that is inherently two-dimensional, the data matrix flattens\n",
587 |     "this 2D data into a **single vector**, which can be contained in one **row** of the data matrix."
588 |    ]
589 |   },
590 |   {
591 |    "cell_type": "markdown",
592 |    "metadata": {},
593 |    "source": [
594 |     "## Exercise: working with the faces dataset"
595 |    ]
596 |   },
597 |   {
598 |    "cell_type": "markdown",
599 |    "metadata": {},
600 |    "source": [
601 |     "Here we'll take a moment for you to explore the datasets yourself.\n",
602 |     "Later on we'll be using the Olivetti faces dataset.\n",
603 |     "Take a moment to fetch the data (about 1.4MB), and visualize the faces.\n",
604 |     "You can copy the code used to visualize the digits above, and modify it for this data."
605 |    ]
606 |   },
607 |   {
608 |    "cell_type": "code",
609 |    "execution_count": null,
610 |    "metadata": {
611 |     "collapsed": false
612 |    },
613 |    "outputs": [],
614 |    "source": [
615 |     "from sklearn.datasets import fetch_olivetti_faces"
616 |    ]
617 |   },
618 |   {
619 |    "cell_type": "code",
620 |    "execution_count": null,
621 |    "metadata": {
622 |     "collapsed": false
623 |    },
624 |    "outputs": [],
625 |    "source": [
626 |     "# fetch the faces data\n"
627 |    ]
628 |   },
629 |   {
630 |    "cell_type": "code",
631 |    "execution_count": null,
632 |    "metadata": {
633 |     "collapsed": false
634 |    },
635 |    "outputs": [],
636 |    "source": [
637 |     "# Use a script like above to plot the faces image data.\n",
638 |     "# hint: plt.cm.bone is a good colormap for this data\n"
639 |    ]
640 |   },
641 |   {
642 |    "cell_type": "markdown",
643 |    "metadata": {},
644 |    "source": [
645 |     "### Solution:"
646 |    ]
647 |   },
648 |   {
649 |    "cell_type": "code",
650 |    "execution_count": null,
651 |    "metadata": {
652 |     "collapsed": false
653 |    },
654 |    "outputs": [],
655 |    "source": [
656 |     "%load solutions/02A_faces_plot.py"
657 |    ]
658 |   }
659 |  ],
660 |  "metadata": {
661 |   "kernelspec": {
662 |    "display_name": "Python 2",
663 |    "language": "python",
664 |    "name": "python2"
665 |   },
666 |   "language_info": {
667 |    "codemirror_mode": {
668 |     "name": "ipython",
669 |     "version": 2
670 |    },
671 |    "file_extension": ".py",
672 |    "mimetype": "text/x-python",
673 |    "name": "python",
674 |    "nbconvert_exporter": "python",
675 |    "pygments_lexer": "ipython2",
676 |    "version": "2.7.11"
677 |   }
678 |  },
679 |  "nbformat": 4,
680 |  "nbformat_minor": 0
681 | }
682 | 


--------------------------------------------------------------------------------
/notebooks/03_basic_principles_of_machine_learning.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "metadata": {
  7 |     "collapsed": false
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "%matplotlib inline\n",
 12 |     "import numpy as np\n",
 13 |     "from matplotlib import pyplot as plt"
 14 |    ]
 15 |   },
 16 |   {
 17 |    "cell_type": "markdown",
 18 |    "metadata": {},
 19 |    "source": [
 20 |     "# Basic principles of machine learning"
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "markdown",
 25 |    "metadata": {},
 26 |    "source": [
 27 |     "Here is where we start diving into the field of machine learning.\n",
 28 |     "\n",
 29 |     "By the end of this section you will\n",
 30 |     "\n",
 31 |     "- Know the basic categories of supervised learning, including classification and regression problems.\n",
 32 |     "- Know the basic categories of unsupervised learning, including dimensionality reduction and clustering.\n",
 33 |     "- Know the basic syntax of the Scikit-learn **estimator** interface.\n",
 34 |     "- Know how features are extracted from real-world data.\n",
 35 |     "\n",
 36 |     "In addition, we will go over several basic tools within scikit-learn which can be used to accomplish the above tasks."
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "markdown",
 41 |    "metadata": {},
 42 |    "source": [
 43 |     "## Problem setting"
 44 |    ]
 45 |   },
 46 |   {
 47 |    "cell_type": "markdown",
 48 |    "metadata": {},
 49 |    "source": [
 50 |     "### A simple definition of machine learning"
 51 |    ]
 52 |   },
 53 |   {
 54 |    "cell_type": "markdown",
 55 |    "metadata": {},
 56 |    "source": [
 57 |     "Machine Learning (ML) is about building programs with **tunable parameters** (typically an\n",
 58 |     "array of floating point values) that are adjusted automatically so as to improve\n",
 59 |     "their behavior by **adapting to previously seen data.**\n",
 60 |     "\n",
 61 |     "In most ML applications, the data is in a 2D array of shape ``[n_samples x n_features]``,\n",
 62 |     "where the number of features is the same for each object, and each feature column refers\n",
 63 |     "to a related piece of information about each sample.\n",
 64 |     "\n",
 65 |     "Machine learning can be broken into two broad regimes:\n",
 66 |     "*supervised learning* and *unsupervised learning*.\n",
 67 |     "We’ll introduce these concepts here, and discuss them in more detail below."
 68 |    ]
 69 |   },
 70 |   {
 71 |    "cell_type": "markdown",
 72 |    "metadata": {},
 73 |    "source": [
 74 |     "### Introducing the scikit-learn estimator object"
 75 |    ]
 76 |   },
 77 |   {
 78 |    "cell_type": "markdown",
 79 |    "metadata": {},
 80 |    "source": [
 81 |     "Every algorithm is exposed in scikit-learn via an ''Estimator'' object. For instance a linear regression is:"
 82 |    ]
 83 |   },
 84 |   {
 85 |    "cell_type": "code",
 86 |    "execution_count": null,
 87 |    "metadata": {
 88 |     "collapsed": false
 89 |    },
 90 |    "outputs": [],
 91 |    "source": [
 92 |     "from sklearn.linear_model import LinearRegression"
 93 |    ]
 94 |   },
 95 |   {
 96 |    "cell_type": "markdown",
 97 |    "metadata": {},
 98 |    "source": [
 99 |     "**Estimator parameters**: All the parameters of an estimator can be set when it is instantiated:"
100 |    ]
101 |   },
102 |   {
103 |    "cell_type": "code",
104 |    "execution_count": null,
105 |    "metadata": {
106 |     "collapsed": false
107 |    },
108 |    "outputs": [],
109 |    "source": [
110 |     "model = LinearRegression(normalize=True)\n",
111 |     "print(model.normalize)"
112 |    ]
113 |   },
114 |   {
115 |    "cell_type": "code",
116 |    "execution_count": null,
117 |    "metadata": {
118 |     "collapsed": false
119 |    },
120 |    "outputs": [],
121 |    "source": [
122 |     "print(model)"
123 |    ]
124 |   },
125 |   {
126 |    "cell_type": "markdown",
127 |    "metadata": {},
128 |    "source": [
129 |     "**Estimated parameters**: When data is fitted with an estimator, parameters are estimated from the data at hand. All the estimated parameters are attributes of the estimator object ending by an underscore:"
130 |    ]
131 |   },
132 |   {
133 |    "cell_type": "code",
134 |    "execution_count": null,
135 |    "metadata": {
136 |     "collapsed": false
137 |    },
138 |    "outputs": [],
139 |    "source": [
140 |     "x = np.array([0, 1, 2])\n",
141 |     "y = np.array([0, 1, 2])"
142 |    ]
143 |   },
144 |   {
145 |    "cell_type": "code",
146 |    "execution_count": null,
147 |    "metadata": {
148 |     "collapsed": false
149 |    },
150 |    "outputs": [],
151 |    "source": [
152 |     "_ = plt.plot(x, y, marker='o')"
153 |    ]
154 |   },
155 |   {
156 |    "cell_type": "code",
157 |    "execution_count": null,
158 |    "metadata": {
159 |     "collapsed": false
160 |    },
161 |    "outputs": [],
162 |    "source": [
163 |     "X = x[:, np.newaxis] # The input data for sklearn is 2D: (samples == 3 x features == 1)\n",
164 |     "X"
165 |    ]
166 |   },
167 |   {
168 |    "cell_type": "code",
169 |    "execution_count": null,
170 |    "metadata": {
171 |     "collapsed": false
172 |    },
173 |    "outputs": [],
174 |    "source": [
175 |     "model.fit(X, y) \n",
176 |     "model.coef_"
177 |    ]
178 |   },
179 |   {
180 |    "cell_type": "markdown",
181 |    "metadata": {},
182 |    "source": [
183 |     "### Supervised Learning: Classification and regression"
184 |    ]
185 |   },
186 |   {
187 |    "cell_type": "markdown",
188 |    "metadata": {},
189 |    "source": [
190 |     "In **Supervised Learning**, we have a dataset consisting of both features and labels.\n",
191 |     "The task is to construct an estimator which is able to predict the label of an object\n",
192 |     "given the set of features. A relatively simple example is predicting the species of \n",
193 |     "iris given a set of measurements of its flower. This is a relatively simple task. \n",
194 |     "Some more complicated examples are:\n",
195 |     "\n",
196 |     "- given a multicolor image of an object through a telescope, determine\n",
197 |     "  whether that object is a star, a quasar, or a galaxy.\n",
198 |     "- given a photograph of a person, identify the person in the photo.\n",
199 |     "- given a list of movies a person has watched and their personal rating\n",
200 |     "  of the movie, recommend a list of movies they would like\n",
201 |     "  (So-called *recommender systems*: a famous example is the [Netflix Prize](http://en.wikipedia.org/wiki/Netflix_prize)).\n",
202 |     "\n",
203 |     "What these tasks have in common is that there is one or more unknown\n",
204 |     "quantities associated with the object which needs to be determined from other\n",
205 |     "observed quantities.\n",
206 |     "\n",
207 |     "Supervised learning is further broken down into two categories, **classification** and **regression**.\n",
208 |     "In classification, the label is discrete, while in regression, the label is continuous. For example,\n",
209 |     "in astronomy, the task of determining whether an object is a star, a galaxy, or a quasar is a\n",
210 |     "classification problem: the label is from three distinct categories. On the other hand, we might\n",
211 |     "wish to estimate the age of an object based on such observations: this would be a regression problem,\n",
212 |     "because the label (age) is a continuous quantity."
213 |    ]
214 |   },
215 |   {
216 |    "cell_type": "markdown",
217 |    "metadata": {},
218 |    "source": [
219 |     "**Classification**: K nearest neighbors (kNN) is one of the simplest learning strategies: given a new, unknown observation, look up in your reference database which ones have the closest features and assign the predominant class.\n",
220 |     "\n",
221 |     "Let's try it out on our iris classification problem:"
222 |    ]
223 |   },
224 |   {
225 |    "cell_type": "code",
226 |    "execution_count": null,
227 |    "metadata": {
228 |     "collapsed": false
229 |    },
230 |    "outputs": [],
231 |    "source": [
232 |     "from sklearn import neighbors, datasets\n",
233 |     "iris = datasets.load_iris()\n",
234 |     "X, y = iris.data, iris.target\n",
235 |     "knn = neighbors.KNeighborsClassifier(n_neighbors=1)\n",
236 |     "knn.fit(X, y)\n",
237 |     "# What kind of iris has 3cm x 5cm sepal and 4cm x 2cm petal?\n",
238 |     "print(iris.target_names[knn.predict([[3, 5, 4, 2]])])"
239 |    ]
240 |   },
241 |   {
242 |    "cell_type": "code",
243 |    "execution_count": null,
244 |    "metadata": {
245 |     "collapsed": false
246 |    },
247 |    "outputs": [],
248 |    "source": [
249 |     "# A plot of the sepal space and the prediction of the KNN\n",
250 |     "from helpers import plot_iris_knn\n",
251 |     "plot_iris_knn()"
252 |    ]
253 |   },
254 |   {
255 |    "cell_type": "markdown",
256 |    "metadata": {},
257 |    "source": [
258 |     "**Exercise**: Now use as an estimator on the same problem: sklearn.svm.SVC.\n",
259 |     "\n",
260 |     "> Note that you don't have to know what it is to use it.\n",
261 |     "\n",
262 |     "> If you finish early, do the same plot as above."
263 |    ]
264 |   },
265 |   {
266 |    "cell_type": "markdown",
267 |    "metadata": {},
268 |    "source": [
269 |     "**Regression**: The simplest possible regression setting is the linear regression one:"
270 |    ]
271 |   },
272 |   {
273 |    "cell_type": "code",
274 |    "execution_count": null,
275 |    "metadata": {
276 |     "collapsed": false
277 |    },
278 |    "outputs": [],
279 |    "source": [
280 |     "# Create some simple data\n",
281 |     "import numpy as np\n",
282 |     "np.random.seed(0)\n",
283 |     "X = np.random.random(size=(20, 1))\n",
284 |     "y = 3 * X[:, 0] + 2 + np.random.normal(size=20)\n",
285 |     "\n",
286 |     "# Fit a linear regression to it\n",
287 |     "from sklearn.linear_model import LinearRegression\n",
288 |     "model = LinearRegression(fit_intercept=True)\n",
289 |     "model.fit(X, y)\n",
290 |     "print(\"Model coefficient: %.5f, and intercept: %.5f\"\n",
291 |     "      % (model.coef_, model.intercept_))\n",
292 |     "\n",
293 |     "# Plot the data and the model prediction\n",
294 |     "X_test = np.linspace(0, 1, 100)[:, np.newaxis]\n",
295 |     "y_test = model.predict(X_test)\n",
296 |     "plt.plot(X[:, 0], y, 'o')\n",
297 |     "plt.plot(X_test[:, 0], y_test)"
298 |    ]
299 |   },
300 |   {
301 |    "cell_type": "markdown",
302 |    "metadata": {},
303 |    "source": [
304 |     "### A recap on Scikit-learn's estimator interface"
305 |    ]
306 |   },
307 |   {
308 |    "cell_type": "markdown",
309 |    "metadata": {},
310 |    "source": [
311 |     "Scikit-learn strives to have a uniform interface across all methods,\n",
312 |     "and we’ll see examples of these below. Given a scikit-learn *estimator*\n",
313 |     "object named `model`, the following methods are available:\n",
314 |     "\n",
315 |     "- Available in **all Estimators**\n",
316 |     "  + `model.fit()` : fit training data. For supervised learning applications,\n",
317 |     "    this accepts two arguments: the data `X` and the labels `y` (e.g. `model.fit(X, y)`).\n",
318 |     "    For unsupervised learning applications, this accepts only a single argument,\n",
319 |     "    the data `X` (e.g. `model.fit(X)`).\n",
320 |     "- Available in **supervised estimators**\n",
321 |     "  + `model.predict()` : given a trained model, predict the label of a new set of data.\n",
322 |     "    This method accepts one argument, the new data `X_new` (e.g. `model.predict(X_new)`),\n",
323 |     "    and returns the learned label for each object in the array.\n",
324 |     "  + `model.predict_proba()` : For classification problems, some estimators also provide\n",
325 |     "    this method, which returns the probability that a new observation has each categorical label.\n",
326 |     "    In this case, the label with the highest probability is returned by `model.predict()`.\n",
327 |     "  + `model.score()` : for classification or regression problems, most (all?) estimators implement\n",
328 |     "    a score method.  Scores are between 0 and 1, with a larger score indicating a better fit.\n",
329 |     "- Available in **unsupervised estimators**\n",
330 |     "  + `model.transform()` : given an unsupervised model, transform new data into the new basis.\n",
331 |     "    This also accepts one argument `X_new`, and returns the new representation of the data based\n",
332 |     "    on the unsupervised model.\n",
333 |     "  + `model.fit_transform()` : some estimators implement this method,\n",
334 |     "    which more efficiently performs a fit and a transform on the same input data."
335 |    ]
336 |   },
337 |   {
338 |    "cell_type": "markdown",
339 |    "metadata": {},
340 |    "source": [
341 |     "### Regularization: what it is and why it is necessary"
342 |    ]
343 |   },
344 |   {
345 |    "cell_type": "markdown",
346 |    "metadata": {},
347 |    "source": [
348 |     "**Train errors** Suppose you are using a 1-nearest neighbor estimator. How many errors do you expect on your train set?\n",
349 |     "\n",
350 |     "This tells us that:\n",
351 |     "- Train set error is not a good measurement of prediction performance. You need to leave out a test set.\n",
352 |     "- In general, we should accept errors on the train set."
353 |    ]
354 |   },
355 |   {
356 |    "cell_type": "markdown",
357 |    "metadata": {},
358 |    "source": [
359 |     "**An example of regularization** The core idea behind regularization is that we are going to prefer models that are simpler, for a certain definition of ''simpler'', even if they lead to more errors on the train set.\n",
360 |     "\n",
361 |     "As an example, let's generate with a 9th order polynomial."
362 |    ]
363 |   },
364 |   {
365 |    "cell_type": "code",
366 |    "execution_count": null,
367 |    "metadata": {
368 |     "collapsed": false
369 |    },
370 |    "outputs": [],
371 |    "source": [
372 |     "rng = np.random.RandomState(0)\n",
373 |     "x = 2 * rng.rand(100) - 1\n",
374 |     "\n",
375 |     "f = lambda t: 1.2 * t ** 2 + .1 * t ** 3 - .4 * t ** 5 - .5 * t ** 9\n",
376 |     "y = f(x) + .4 * rng.normal(size=100)\n",
377 |     "\n",
378 |     "plt.figure()\n",
379 |     "plt.scatter(x, y, s=4)"
380 |    ]
381 |   },
382 |   {
383 |    "cell_type": "markdown",
384 |    "metadata": {},
385 |    "source": [
386 |     "And now, let's fit a 4th order and a 9th order polynomial to the data. For this we need to engineer features: the n_th powers of x:"
387 |    ]
388 |   },
389 |   {
390 |    "cell_type": "code",
391 |    "execution_count": null,
392 |    "metadata": {
393 |     "collapsed": false
394 |    },
395 |    "outputs": [],
396 |    "source": [
397 |     "x_test = np.linspace(-1, 1, 100)\n",
398 |     "\n",
399 |     "plt.figure()\n",
400 |     "plt.scatter(x, y, s=4)\n",
401 |     "\n",
402 |     "X = np.array([x**i for i in range(5)]).T\n",
403 |     "X_test = np.array([x_test**i for i in range(5)]).T\n",
404 |     "order4 = LinearRegression()\n",
405 |     "order4.fit(X, y)\n",
406 |     "plt.plot(x_test, order4.predict(X_test), label='4th order')\n",
407 |     "\n",
408 |     "X = np.array([x**i for i in range(10)]).T\n",
409 |     "X_test = np.array([x_test**i for i in range(10)]).T\n",
410 |     "order9 = LinearRegression()\n",
411 |     "order9.fit(X, y)\n",
412 |     "plt.plot(x_test, order9.predict(X_test), label='9th order')\n",
413 |     "\n",
414 |     "plt.legend(loc='best')\n",
415 |     "plt.axis('tight')\n",
416 |     "plt.title('Fitting a 4th and a 9th order polynomial')"
417 |    ]
418 |   },
419 |   {
420 |    "cell_type": "markdown",
421 |    "metadata": {},
422 |    "source": [
423 |     "With your naked eyes, which model do you prefer, the 4th order one, or the 9th order one?\n",
424 |     "\n",
425 |     "Let's look at the ground truth:"
426 |    ]
427 |   },
428 |   {
429 |    "cell_type": "code",
430 |    "execution_count": null,
431 |    "metadata": {
432 |     "collapsed": false
433 |    },
434 |    "outputs": [],
435 |    "source": [
436 |     "plt.figure()\n",
437 |     "plt.scatter(x, y, s=4)\n",
438 |     "plt.plot(x_test, f(x_test), label=\"truth\")\n",
439 |     "plt.axis('tight')\n",
440 |     "plt.title('Ground truth (9th order polynomial)')"
441 |    ]
442 |   },
443 |   {
444 |    "cell_type": "markdown",
445 |    "metadata": {},
446 |    "source": [
447 |     "Regularization is ubiquitous in machine learning. Most scikit-learn estimators have a parameter to tune the amount of regularization. For instance, with k-NN, it is 'k', the number of nearest neighbors used to make the decision. k=1 amounts to no regularization: 0 error on the training set, whereas large k will push toward smoother decision boundaries in the feature space."
448 |    ]
449 |   },
450 |   {
451 |    "cell_type": "markdown",
452 |    "metadata": {},
453 |    "source": [
454 |     "### Exercise: Interactive Demo on linearly separable data"
455 |    ]
456 |   },
457 |   {
458 |    "cell_type": "markdown",
459 |    "metadata": {},
460 |    "source": [
461 |     "Run the **svm_gui.py** file from http://scikit-learn.org/stable/_downloads/svm_gui.py"
462 |    ]
463 |   },
464 |   {
465 |    "cell_type": "markdown",
466 |    "metadata": {},
467 |    "source": [
468 |     "\n",
469 |     "\n",
470 |     "______\n",
471 |     "\n",
472 |     "\n"
473 |    ]
474 |   }
475 |  ],
476 |  "metadata": {
477 |   "kernelspec": {
478 |    "display_name": "Python 2",
479 |    "language": "python",
480 |    "name": "python2"
481 |   },
482 |   "language_info": {
483 |    "codemirror_mode": {
484 |     "name": "ipython",
485 |     "version": 2
486 |    },
487 |    "file_extension": ".py",
488 |    "mimetype": "text/x-python",
489 |    "name": "python",
490 |    "nbconvert_exporter": "python",
491 |    "pygments_lexer": "ipython2",
492 |    "version": "2.7.11"
493 |   }
494 |  },
495 |  "nbformat": 4,
496 |  "nbformat_minor": 0
497 | }
498 | 


--------------------------------------------------------------------------------
/notebooks/04_unsupervised_dimreduction.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Unsupervised Learning: Dimensionality Reduction and Visualization"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "Unsupervised learning is interested in situations in which X is available, but not y: data without labels.\n",
 15 |     "\n",
 16 |     "A typical use case is to find hiden structure in the data.\n",
 17 |     "\n",
 18 |     "Previously we worked on visualizing the iris data by plotting\n",
 19 |     "pairs of dimensions by trial and error, until we arrived at\n",
 20 |     "the best pair of dimensions for our dataset.  Here we will\n",
 21 |     "use an unsupervised *dimensionality reduction* algorithm\n",
 22 |     "to accomplish this more automatically."
 23 |    ]
 24 |   },
 25 |   {
 26 |    "cell_type": "markdown",
 27 |    "metadata": {},
 28 |    "source": [
 29 |     "By the end of this section you will\n",
 30 |     "\n",
 31 |     "- Know how to instantiate and train an unsupervised dimensionality reduction algorithm:\n",
 32 |     "  Principal Component Analysis (PCA)\n",
 33 |     "- Know how to use PCA to visualize high-dimensional data"
 34 |    ]
 35 |   },
 36 |   {
 37 |    "cell_type": "markdown",
 38 |    "metadata": {},
 39 |    "source": [
 40 |     "## Dimensionality Reduction: PCA"
 41 |    ]
 42 |   },
 43 |   {
 44 |    "cell_type": "markdown",
 45 |    "metadata": {},
 46 |    "source": [
 47 |     "Dimensionality reduction is the task of deriving a set of new\n",
 48 |     "artificial features that is smaller than the original feature\n",
 49 |     "set while retaining most of the variance of the original data.\n",
 50 |     "Here we'll use a common but powerful dimensionality reduction\n",
 51 |     "technique called Principal Component Analysis (PCA).\n",
 52 |     "We'll perform PCA on the iris dataset that we saw before:"
 53 |    ]
 54 |   },
 55 |   {
 56 |    "cell_type": "code",
 57 |    "execution_count": null,
 58 |    "metadata": {
 59 |     "collapsed": false
 60 |    },
 61 |    "outputs": [],
 62 |    "source": [
 63 |     "from sklearn.datasets import load_iris\n",
 64 |     "iris = load_iris()\n",
 65 |     "X = iris.data\n",
 66 |     "y = iris.target"
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "markdown",
 71 |    "metadata": {},
 72 |    "source": [
 73 |     "PCA is performed using linear combinations of the original features\n",
 74 |     "using a truncated Singular Value Decomposition of the matrix X so\n",
 75 |     "as to project the data onto a base of the top singular vectors.\n",
 76 |     "If the number of retained components is 2 or 3, PCA can be used\n",
 77 |     "to visualize the dataset."
 78 |    ]
 79 |   },
 80 |   {
 81 |    "cell_type": "code",
 82 |    "execution_count": null,
 83 |    "metadata": {
 84 |     "collapsed": false
 85 |    },
 86 |    "outputs": [],
 87 |    "source": [
 88 |     "from sklearn.decomposition import PCA\n",
 89 |     "pca = PCA(n_components=2, whiten=True)\n",
 90 |     "pca.fit(X)"
 91 |    ]
 92 |   },
 93 |   {
 94 |    "cell_type": "markdown",
 95 |    "metadata": {},
 96 |    "source": [
 97 |     "Once fitted, the pca model exposes the singular vectors in the components_ attribute:"
 98 |    ]
 99 |   },
100 |   {
101 |    "cell_type": "code",
102 |    "execution_count": null,
103 |    "metadata": {
104 |     "collapsed": false
105 |    },
106 |    "outputs": [],
107 |    "source": [
108 |     "pca.components_"
109 |    ]
110 |   },
111 |   {
112 |    "cell_type": "markdown",
113 |    "metadata": {},
114 |    "source": [
115 |     "Other attributes are available as well:"
116 |    ]
117 |   },
118 |   {
119 |    "cell_type": "code",
120 |    "execution_count": null,
121 |    "metadata": {
122 |     "collapsed": false
123 |    },
124 |    "outputs": [],
125 |    "source": [
126 |     "pca.explained_variance_ratio_"
127 |    ]
128 |   },
129 |   {
130 |    "cell_type": "code",
131 |    "execution_count": null,
132 |    "metadata": {
133 |     "collapsed": false
134 |    },
135 |    "outputs": [],
136 |    "source": [
137 |     "pca.explained_variance_ratio_.sum()"
138 |    ]
139 |   },
140 |   {
141 |    "cell_type": "markdown",
142 |    "metadata": {},
143 |    "source": [
144 |     "Let us project the iris dataset along those first two dimensions:"
145 |    ]
146 |   },
147 |   {
148 |    "cell_type": "code",
149 |    "execution_count": null,
150 |    "metadata": {
151 |     "collapsed": false
152 |    },
153 |    "outputs": [],
154 |    "source": [
155 |     "X_pca = pca.transform(X)"
156 |    ]
157 |   },
158 |   {
159 |    "cell_type": "markdown",
160 |    "metadata": {},
161 |    "source": [
162 |     "PCA `normalizes` and `whitens` the data, which means that the data\n",
163 |     "is now centered on both components with unit variance:"
164 |    ]
165 |   },
166 |   {
167 |    "cell_type": "code",
168 |    "execution_count": null,
169 |    "metadata": {
170 |     "collapsed": false
171 |    },
172 |    "outputs": [],
173 |    "source": [
174 |     "import numpy as np\n",
175 |     "np.round(X_pca.mean(axis=0), decimals=5)"
176 |    ]
177 |   },
178 |   {
179 |    "cell_type": "code",
180 |    "execution_count": null,
181 |    "metadata": {
182 |     "collapsed": false
183 |    },
184 |    "outputs": [],
185 |    "source": [
186 |     "np.round(X_pca.std(axis=0), decimals=5)"
187 |    ]
188 |   },
189 |   {
190 |    "cell_type": "markdown",
191 |    "metadata": {},
192 |    "source": [
193 |     "Furthermore, the samples components do no longer carry any linear correlation:"
194 |    ]
195 |   },
196 |   {
197 |    "cell_type": "code",
198 |    "execution_count": null,
199 |    "metadata": {
200 |     "collapsed": false
201 |    },
202 |    "outputs": [],
203 |    "source": [
204 |     "np.corrcoef(X_pca.T)"
205 |    ]
206 |   },
207 |   {
208 |    "cell_type": "markdown",
209 |    "metadata": {},
210 |    "source": [
211 |     "We can visualize the projection using pylab, but first\n",
212 |     "let's make sure our ipython notebook is in pylab inline mode"
213 |    ]
214 |   },
215 |   {
216 |    "cell_type": "code",
217 |    "execution_count": null,
218 |    "metadata": {
219 |     "collapsed": false
220 |    },
221 |    "outputs": [],
222 |    "source": [
223 |     "%pylab inline"
224 |    ]
225 |   },
226 |   {
227 |    "cell_type": "markdown",
228 |    "metadata": {},
229 |    "source": [
230 |     "Now we can visualize the results using the following utility function:"
231 |    ]
232 |   },
233 |   {
234 |    "cell_type": "code",
235 |    "execution_count": null,
236 |    "metadata": {
237 |     "collapsed": false
238 |    },
239 |    "outputs": [],
240 |    "source": [
241 |     "import matplotlib.pyplot as plt\n",
242 |     "from itertools import cycle\n",
243 |     "\n",
244 |     "def plot_PCA_2D(data, target, target_names):\n",
245 |     "    colors = cycle('rgbcmykw')\n",
246 |     "    target_ids = range(len(target_names))\n",
247 |     "    plt.figure()\n",
248 |     "    for i, c, label in zip(target_ids, colors, target_names):\n",
249 |     "        plt.scatter(data[target == i, 0], data[target == i, 1],\n",
250 |     "                   c=c, label=label)\n",
251 |     "    plt.legend()"
252 |    ]
253 |   },
254 |   {
255 |    "cell_type": "markdown",
256 |    "metadata": {},
257 |    "source": [
258 |     "Now calling this function for our data, we see the plot:"
259 |    ]
260 |   },
261 |   {
262 |    "cell_type": "code",
263 |    "execution_count": null,
264 |    "metadata": {
265 |     "collapsed": false
266 |    },
267 |    "outputs": [],
268 |    "source": [
269 |     "plot_PCA_2D(X_pca, iris.target, iris.target_names)"
270 |    ]
271 |   },
272 |   {
273 |    "cell_type": "markdown",
274 |    "metadata": {},
275 |    "source": [
276 |     "Note that this projection was determined *without* any information about the\n",
277 |     "labels (represented by the colors): this is the sense in which the learning\n",
278 |     "is **unsupervised**.  Nevertheless, we see that the projection gives us insight\n",
279 |     "into the distribution of the different flowers in parameter space: notably,\n",
280 |     "*iris setosa* is much more distinct than the other two species."
281 |    ]
282 |   },
283 |   {
284 |    "cell_type": "markdown",
285 |    "metadata": {},
286 |    "source": [
287 |     "Note also that the default implementation of PCA computes the\n",
288 |     "singular value decomposition (SVD) of the full\n",
289 |     "data matrix, which is not scalable when both ``n_samples`` and\n",
290 |     "``n_features`` are big (more that a few thousands).\n",
291 |     "If you are interested in a number of components that is much\n",
292 |     "smaller than both ``n_samples`` and ``n_features``, consider using\n",
293 |     "`sklearn.decomposition.RandomizedPCA` instead."
294 |    ]
295 |   },
296 |   {
297 |    "cell_type": "markdown",
298 |    "metadata": {},
299 |    "source": [
300 |     "Other dimensionality reduction techniques which are useful to know about:\n",
301 |     "\n",
302 |     "- [sklearn.decomposition.PCA](http://scikit-learn.org/0.13/modules/generated/sklearn.decomposition.PCA.html): \n",
303 |     "   Principal Component Analysis\n",
304 |     "- [sklearn.decomposition.RandomizedPCA](http://scikit-learn.org/0.13/modules/generated/sklearn.decomposition.RandomizedPCA.html):\n",
305 |     "   fast non-exact PCA implementation based on a randomized algorithm\n",
306 |     "- [sklearn.decomposition.SparsePCA](http://scikit-learn.org/0.13/modules/generated/sklearn.decomposition.SparsePCA.html):\n",
307 |     "   PCA variant including L1 penalty for sparsity\n",
308 |     "- [sklearn.decomposition.FastICA](http://scikit-learn.org/0.13/modules/generated/sklearn.decomposition.FastICA.html):\n",
309 |     "   Independent Component Analysis\n",
310 |     "- [sklearn.decomposition.NMF](http://scikit-learn.org/0.13/modules/generated/sklearn.decomposition.NMF.html):\n",
311 |     "   non-negative matrix factorization\n",
312 |     "- [sklearn.manifold.LocallyLinearEmbedding](http://scikit-learn.org/0.13/modules/generated/sklearn.manifold.LocallyLinearEmbedding.html):\n",
313 |     "   nonlinear manifold learning technique based on local neighborhood geometry\n",
314 |     "- [sklearn.manifold.IsoMap](http://scikit-learn.org/0.13/modules/generated/sklearn.manifold.Isomap.html):\n",
315 |     "   nonlinear manifold learning technique based on a sparse graph algorithm"
316 |    ]
317 |   },
318 |   {
319 |    "cell_type": "markdown",
320 |    "metadata": {},
321 |    "source": [
322 |     "## Manifold Learning"
323 |    ]
324 |   },
325 |   {
326 |    "cell_type": "markdown",
327 |    "metadata": {},
328 |    "source": [
329 |     "One weakness of PCA is that it cannot detect non-linear features.  A set\n",
330 |     "of algorithms known as *Manifold Learning* have been developed to address\n",
331 |     "this deficiency.  A canonical dataset used in Manifold learning is the\n",
332 |     "*S-curve*, which we briefly saw in an earlier section:"
333 |    ]
334 |   },
335 |   {
336 |    "cell_type": "code",
337 |    "execution_count": null,
338 |    "metadata": {
339 |     "collapsed": false
340 |    },
341 |    "outputs": [],
342 |    "source": [
343 |     "from sklearn.datasets import make_s_curve\n",
344 |     "X, y = make_s_curve(n_samples=1000)\n",
345 |     "\n",
346 |     "from mpl_toolkits.mplot3d import Axes3D\n",
347 |     "ax = plt.axes(projection='3d')\n",
348 |     "\n",
349 |     "ax.scatter3D(X[:, 0], X[:, 1], X[:, 2], c=y)\n",
350 |     "ax.view_init(10, -60)"
351 |    ]
352 |   },
353 |   {
354 |    "cell_type": "markdown",
355 |    "metadata": {},
356 |    "source": [
357 |     "This is a 2-dimensional dataset embedded in three dimensions, but it is embedded\n",
358 |     "in such a way that PCA cannot discover the underlying data orientation:"
359 |    ]
360 |   },
361 |   {
362 |    "cell_type": "code",
363 |    "execution_count": null,
364 |    "metadata": {
365 |     "collapsed": false
366 |    },
367 |    "outputs": [],
368 |    "source": [
369 |     "X_pca = PCA(n_components=2).fit_transform(X)\n",
370 |     "plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y)"
371 |    ]
372 |   },
373 |   {
374 |    "cell_type": "markdown",
375 |    "metadata": {},
376 |    "source": [
377 |     "Manifold learning algorithms, however, available in the ``sklearn.manifold``\n",
378 |     "submodule, are able to recover the underlying 2-dimensional manifold:"
379 |    ]
380 |   },
381 |   {
382 |    "cell_type": "code",
383 |    "execution_count": null,
384 |    "metadata": {
385 |     "collapsed": false
386 |    },
387 |    "outputs": [],
388 |    "source": [
389 |     "from sklearn.manifold import LocallyLinearEmbedding, Isomap\n",
390 |     "lle = LocallyLinearEmbedding(n_neighbors=15, n_components=2, method='modified')\n",
391 |     "X_lle = lle.fit_transform(X)\n",
392 |     "plt.scatter(X_lle[:, 0], X_lle[:, 1], c=y)"
393 |    ]
394 |   },
395 |   {
396 |    "cell_type": "code",
397 |    "execution_count": null,
398 |    "metadata": {
399 |     "collapsed": false
400 |    },
401 |    "outputs": [],
402 |    "source": [
403 |     "iso = Isomap(n_neighbors=15, n_components=2)\n",
404 |     "X_iso = iso.fit_transform(X)\n",
405 |     "plt.scatter(X_iso[:, 0], X_iso[:, 1], c=y)"
406 |    ]
407 |   },
408 |   {
409 |    "cell_type": "markdown",
410 |    "metadata": {},
411 |    "source": [
412 |     "## Exercise: Dimension reduction of digits"
413 |    ]
414 |   },
415 |   {
416 |    "cell_type": "markdown",
417 |    "metadata": {},
418 |    "source": [
419 |     "Apply PCA, LocallyLinearEmbedding, and Isomap to project the data to two dimensions.\n",
420 |     "Which visualization technique separates the classes most cleanly?"
421 |    ]
422 |   },
423 |   {
424 |    "cell_type": "code",
425 |    "execution_count": null,
426 |    "metadata": {
427 |     "collapsed": false
428 |    },
429 |    "outputs": [],
430 |    "source": [
431 |     "from sklearn.datasets import load_digits\n",
432 |     "digits = load_digits()\n",
433 |     "# ..."
434 |    ]
435 |   },
436 |   {
437 |    "cell_type": "code",
438 |    "execution_count": null,
439 |    "metadata": {
440 |     "collapsed": false
441 |    },
442 |    "outputs": [],
443 |    "source": []
444 |   },
445 |   {
446 |    "cell_type": "markdown",
447 |    "metadata": {},
448 |    "source": [
449 |     "### Solution:"
450 |    ]
451 |   },
452 |   {
453 |    "cell_type": "code",
454 |    "execution_count": null,
455 |    "metadata": {
456 |     "collapsed": false
457 |    },
458 |    "outputs": [],
459 |    "source": [
460 |     "%load solutions/08A_digits_projection.py"
461 |    ]
462 |   }
463 |  ],
464 |  "metadata": {
465 |   "kernelspec": {
466 |    "display_name": "Python 2",
467 |    "language": "python",
468 |    "name": "python2"
469 |   },
470 |   "language_info": {
471 |    "codemirror_mode": {
472 |     "name": "ipython",
473 |     "version": 2
474 |    },
475 |    "file_extension": ".py",
476 |    "mimetype": "text/x-python",
477 |    "name": "python",
478 |    "nbconvert_exporter": "python",
479 |    "pygments_lexer": "ipython2",
480 |    "version": "2.7.11"
481 |   }
482 |  },
483 |  "nbformat": 4,
484 |  "nbformat_minor": 0
485 | }
486 | 


--------------------------------------------------------------------------------
/notebooks/05_measuring_prediction_performance.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Measuring prediction performance"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "Here we will discuss how to use **validation sets** to get a better measure of\n",
 15 |     "performance for a classifier."
 16 |    ]
 17 |   },
 18 |   {
 19 |    "cell_type": "markdown",
 20 |    "metadata": {},
 21 |    "source": [
 22 |     "## Using the K-neighbors classifier"
 23 |    ]
 24 |   },
 25 |   {
 26 |    "cell_type": "markdown",
 27 |    "metadata": {},
 28 |    "source": [
 29 |     "Here we'll continue to look at the digits data, but we'll switch to the\n",
 30 |     "K-Neighbors classifier.  The K-neighbors classifier is an instance-based\n",
 31 |     "classifier.  The K-neighbors classifier predicts the label of\n",
 32 |     "an unknown point based on the labels of the *K* nearest points in the\n",
 33 |     "parameter space."
 34 |    ]
 35 |   },
 36 |   {
 37 |    "cell_type": "code",
 38 |    "execution_count": null,
 39 |    "metadata": {
 40 |     "collapsed": false
 41 |    },
 42 |    "outputs": [],
 43 |    "source": [
 44 |     "# Get the data\n",
 45 |     "from sklearn.datasets import load_digits\n",
 46 |     "digits = load_digits()\n",
 47 |     "X = digits.data\n",
 48 |     "y = digits.target"
 49 |    ]
 50 |   },
 51 |   {
 52 |    "cell_type": "code",
 53 |    "execution_count": null,
 54 |    "metadata": {
 55 |     "collapsed": false
 56 |    },
 57 |    "outputs": [],
 58 |    "source": [
 59 |     "# Instantiate and train the classifier\n",
 60 |     "from sklearn.neighbors import KNeighborsClassifier\n",
 61 |     "clf = KNeighborsClassifier(n_neighbors=1)\n",
 62 |     "clf.fit(X, y)"
 63 |    ]
 64 |   },
 65 |   {
 66 |    "cell_type": "code",
 67 |    "execution_count": null,
 68 |    "metadata": {
 69 |     "collapsed": false
 70 |    },
 71 |    "outputs": [],
 72 |    "source": [
 73 |     "# Check the results using metrics\n",
 74 |     "from sklearn import metrics\n",
 75 |     "y_pred = clf.predict(X)"
 76 |    ]
 77 |   },
 78 |   {
 79 |    "cell_type": "code",
 80 |    "execution_count": null,
 81 |    "metadata": {
 82 |     "collapsed": false
 83 |    },
 84 |    "outputs": [],
 85 |    "source": [
 86 |     "print(metrics.confusion_matrix(y_pred, y))"
 87 |    ]
 88 |   },
 89 |   {
 90 |    "cell_type": "markdown",
 91 |    "metadata": {},
 92 |    "source": [
 93 |     "Apparently, we've found a perfect classifier!  But this is misleading\n",
 94 |     "for the reasons we saw before: the classifier essentially \"memorizes\"\n",
 95 |     "all the samples it has already seen.  To really test how well this\n",
 96 |     "algorithm does, we need to try some samples it *hasn't* yet seen.\n",
 97 |     "\n",
 98 |     "This problem can also occur with regression models. In the following we fit an other instance-based model named \"decision tree\" to the Boston Housing price dataset we introduced previously:"
 99 |    ]
100 |   },
101 |   {
102 |    "cell_type": "code",
103 |    "execution_count": null,
104 |    "metadata": {
105 |     "collapsed": false
106 |    },
107 |    "outputs": [],
108 |    "source": [
109 |     "%matplotlib inline\n",
110 |     "from matplotlib import pyplot as plt\n",
111 |     "import numpy as np"
112 |    ]
113 |   },
114 |   {
115 |    "cell_type": "code",
116 |    "execution_count": null,
117 |    "metadata": {
118 |     "collapsed": false
119 |    },
120 |    "outputs": [],
121 |    "source": [
122 |     "from sklearn.datasets import load_boston\n",
123 |     "from sklearn.tree import DecisionTreeRegressor\n",
124 |     "\n",
125 |     "data = load_boston()\n",
126 |     "clf = DecisionTreeRegressor().fit(data.data, data.target)\n",
127 |     "predicted = clf.predict(data.data)\n",
128 |     "expected = data.target\n",
129 |     "\n",
130 |     "plt.scatter(expected, predicted)\n",
131 |     "plt.plot([0, 50], [0, 50], '--k')\n",
132 |     "plt.axis('tight')\n",
133 |     "plt.xlabel('True price ($1000s)')\n",
134 |     "plt.ylabel('Predicted price ($1000s)')"
135 |    ]
136 |   },
137 |   {
138 |    "cell_type": "markdown",
139 |    "metadata": {},
140 |    "source": [
141 |     "Here again the predictions are seemingly perfect as the model was able to perfectly memorize the training set."
142 |    ]
143 |   },
144 |   {
145 |    "cell_type": "markdown",
146 |    "metadata": {},
147 |    "source": [
148 |     "## A Better Approach: Using a validation set"
149 |    ]
150 |   },
151 |   {
152 |    "cell_type": "markdown",
153 |    "metadata": {},
154 |    "source": [
155 |     "Learning the parameters of a prediction function and testing it on the\n",
156 |     "same data is a methodological mistake: a model that would just repeat\n",
157 |     "the labels of the samples that it has just seen would have a perfect\n",
158 |     "score but would fail to predict anything useful on yet-unseen data.\n",
159 |     "\n",
160 |     "To avoid over-fitting, we have to define two different sets:\n",
161 |     "\n",
162 |     "- a training set X_train, y_train which is used for learning the parameters of a predictive model\n",
163 |     "- a testing set X_test, y_test which is used for evaluating the fitted predictive model\n",
164 |     "\n",
165 |     "In scikit-learn such a random split can be quickly computed with the\n",
166 |     "`train_test_split` helper function.  It can be used this way:"
167 |    ]
168 |   },
169 |   {
170 |    "cell_type": "code",
171 |    "execution_count": null,
172 |    "metadata": {
173 |     "collapsed": false
174 |    },
175 |    "outputs": [],
176 |    "source": [
177 |     "from sklearn.model_selection import train_test_split\n",
178 |     "X = digits.data\n",
179 |     "y = digits.target\n",
180 |     "\n",
181 |     "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)\n",
182 |     "\n",
183 |     "print(\"%r, %r, %r\" % (X.shape, X_train.shape, X_test.shape))"
184 |    ]
185 |   },
186 |   {
187 |    "cell_type": "markdown",
188 |    "metadata": {},
189 |    "source": [
190 |     "Now we train on the training data, and test on the testing data:"
191 |    ]
192 |   },
193 |   {
194 |    "cell_type": "code",
195 |    "execution_count": null,
196 |    "metadata": {
197 |     "collapsed": false
198 |    },
199 |    "outputs": [],
200 |    "source": [
201 |     "clf = KNeighborsClassifier(n_neighbors=1).fit(X_train, y_train)\n",
202 |     "y_pred = clf.predict(X_test)"
203 |    ]
204 |   },
205 |   {
206 |    "cell_type": "code",
207 |    "execution_count": null,
208 |    "metadata": {
209 |     "collapsed": false
210 |    },
211 |    "outputs": [],
212 |    "source": [
213 |     "print(metrics.confusion_matrix(y_test, y_pred))"
214 |    ]
215 |   },
216 |   {
217 |    "cell_type": "code",
218 |    "execution_count": null,
219 |    "metadata": {
220 |     "collapsed": false
221 |    },
222 |    "outputs": [],
223 |    "source": [
224 |     "print(metrics.classification_report(y_test, y_pred))"
225 |    ]
226 |   },
227 |   {
228 |    "cell_type": "markdown",
229 |    "metadata": {},
230 |    "source": [
231 |     "The averaged f1-score is often used as a convenient measure of the\n",
232 |     "overall performance of an algorithm.  It appears in the bottom row\n",
233 |     "of the classification report; it can also be accessed directly:"
234 |    ]
235 |   },
236 |   {
237 |    "cell_type": "code",
238 |    "execution_count": null,
239 |    "metadata": {
240 |     "collapsed": false
241 |    },
242 |    "outputs": [],
243 |    "source": [
244 |     "metrics.f1_score(y_test, y_pred, average='weighted')"
245 |    ]
246 |   },
247 |   {
248 |    "cell_type": "markdown",
249 |    "metadata": {},
250 |    "source": [
251 |     "The over-fitting we saw previously can be quantified by computing the\n",
252 |     "f1-score on the training data itself:"
253 |    ]
254 |   },
255 |   {
256 |    "cell_type": "code",
257 |    "execution_count": null,
258 |    "metadata": {
259 |     "collapsed": false
260 |    },
261 |    "outputs": [],
262 |    "source": [
263 |     "metrics.f1_score(y_train, clf.predict(X_train), average='weighted')"
264 |    ]
265 |   },
266 |   {
267 |    "cell_type": "markdown",
268 |    "metadata": {},
269 |    "source": [
270 |     "### Validation with a Regression Model"
271 |    ]
272 |   },
273 |   {
274 |    "cell_type": "markdown",
275 |    "metadata": {},
276 |    "source": [
277 |     "These validation metrics also work in the case of regression models.  Here we'll use\n",
278 |     "a Gradient-boosted regression tree, which is a meta-estimator which makes use of the\n",
279 |     "``DecisionTreeRegressor`` we showed above.  We'll start by doing the train-test split\n",
280 |     "as we did with the classification case:"
281 |    ]
282 |   },
283 |   {
284 |    "cell_type": "code",
285 |    "execution_count": null,
286 |    "metadata": {
287 |     "collapsed": false
288 |    },
289 |    "outputs": [],
290 |    "source": [
291 |     "data = load_boston()\n",
292 |     "X = data.data\n",
293 |     "y = data.target\n",
294 |     "X_train, X_test, y_train, y_test = train_test_split(\n",
295 |     "    X, y, test_size=0.25, random_state=0)\n",
296 |     "\n",
297 |     "print(\"%r, %r, %r\" % (X.shape, X_train.shape, X_test.shape))"
298 |    ]
299 |   },
300 |   {
301 |    "cell_type": "markdown",
302 |    "metadata": {},
303 |    "source": [
304 |     "Next we'll compute the training and testing error using the Decision Tree that\n",
305 |     "we saw before:"
306 |    ]
307 |   },
308 |   {
309 |    "cell_type": "code",
310 |    "execution_count": null,
311 |    "metadata": {
312 |     "collapsed": false
313 |    },
314 |    "outputs": [],
315 |    "source": [
316 |     "est = DecisionTreeRegressor().fit(X_train, y_train)\n",
317 |     "\n",
318 |     "validation_score = metrics.explained_variance_score(\n",
319 |     "    y_test, est.predict(X_test))\n",
320 |     "\n",
321 |     "print(\"validation: %r\" % validation_score)\n",
322 |     "\n",
323 |     "training_score = metrics.explained_variance_score(\n",
324 |     "    y_train, est.predict(X_train))\n",
325 |     "\n",
326 |     "print(\"training: %r\" % training_score)"
327 |    ]
328 |   },
329 |   {
330 |    "cell_type": "markdown",
331 |    "metadata": {},
332 |    "source": [
333 |     "This large spread between validation and training error is characteristic\n",
334 |     "of a **high variance** model.  Decision trees are not entirely useless,\n",
335 |     "however: by combining many individual decision trees within ensemble\n",
336 |     "estimators such as Gradient Boosted Trees or Random Forests, we can get\n",
337 |     "much better performance:"
338 |    ]
339 |   },
340 |   {
341 |    "cell_type": "code",
342 |    "execution_count": null,
343 |    "metadata": {
344 |     "collapsed": false
345 |    },
346 |    "outputs": [],
347 |    "source": [
348 |     "from sklearn.ensemble import GradientBoostingRegressor\n",
349 |     "est = GradientBoostingRegressor().fit(X_train, y_train)\n",
350 |     "\n",
351 |     "validation_score = metrics.explained_variance_score(\n",
352 |     "    y_test, est.predict(X_test))\n",
353 |     "\n",
354 |     "print(\"validation: %r\" % validation_score)\n",
355 |     "\n",
356 |     "training_score = metrics.explained_variance_score(\n",
357 |     "    y_train, est.predict(X_train))\n",
358 |     "\n",
359 |     "print(\"training: %r\" % training_score)"
360 |    ]
361 |   },
362 |   {
363 |    "cell_type": "markdown",
364 |    "metadata": {},
365 |    "source": [
366 |     "This model is still over-fitting the data, but not by as much as the single tree."
367 |    ]
368 |   },
369 |   {
370 |    "cell_type": "markdown",
371 |    "metadata": {},
372 |    "source": [
373 |     "## Exercise: Model Selection via Validation"
374 |    ]
375 |   },
376 |   {
377 |    "cell_type": "markdown",
378 |    "metadata": {},
379 |    "source": [
380 |     "Here we saw K-neighbors classification of the digits. We've also seen support vector\n",
381 |     "machine classification of digits. Now that we have these\n",
382 |     "validation tools in place, we can ask quantitatively which of the three estimators\n",
383 |     "works best for the digits dataset.\n",
384 |     "\n",
385 |     "Take a moment and determine the answers to these questions for the digits dataset:\n",
386 |     "\n",
387 |     "- With the default hyper-parameters for each estimator, which gives the best f1 score\n",
388 |     "  on the **validation set**?  Recall that hyperparameters are the parameters set when\n",
389 |     "  you instantiate the classifier: for example, the ``n_neighbors`` in\n",
390 |     "\n",
391 |     "          clf = KNeighborsClassifier(n_neighbors=1)\n",
392 |     "\n",
393 |     "  To use the default value, simply leave them unspecified.\n",
394 |     "- For each classifier, which value for the hyperparameters gives the best results for\n",
395 |     "  the digits data?  For ``LinearSVC``, use ``loss='l2'`` and ``loss='l1'``.  For\n",
396 |     "  ``KNeighborsClassifier`` use ``n_neighbors`` between 1 and 10. Try also\n",
397 |     "  ``GaussianNB``. Note that it does not have any adjustable hyperparameters.\n",
398 |     "- Bonus: do the same exercise on the Iris data rather than the Digits data.  Does the\n",
399 |     "  same classifier/hyperparameter combination win out in this case?"
400 |    ]
401 |   },
402 |   {
403 |    "cell_type": "code",
404 |    "execution_count": null,
405 |    "metadata": {
406 |     "collapsed": false
407 |    },
408 |    "outputs": [],
409 |    "source": [
410 |     "from sklearn.svm import LinearSVC\n",
411 |     "from sklearn.naive_bayes import GaussianNB\n",
412 |     "from sklearn.neighbors import KNeighborsClassifier"
413 |    ]
414 |   },
415 |   {
416 |    "cell_type": "code",
417 |    "execution_count": null,
418 |    "metadata": {
419 |     "collapsed": false
420 |    },
421 |    "outputs": [],
422 |    "source": []
423 |   },
424 |   {
425 |    "cell_type": "markdown",
426 |    "metadata": {},
427 |    "source": [
428 |     "### Solution"
429 |    ]
430 |   },
431 |   {
432 |    "cell_type": "code",
433 |    "execution_count": null,
434 |    "metadata": {
435 |     "collapsed": false
436 |    },
437 |    "outputs": [],
438 |    "source": [
439 |     "%load solutions/05C_validation_exercise.py"
440 |    ]
441 |   }
442 |  ],
443 |  "metadata": {
444 |   "kernelspec": {
445 |    "display_name": "Python 2",
446 |    "language": "python",
447 |    "name": "python2"
448 |   },
449 |   "language_info": {
450 |    "codemirror_mode": {
451 |     "name": "ipython",
452 |     "version": 2
453 |    },
454 |    "file_extension": ".py",
455 |    "mimetype": "text/x-python",
456 |    "name": "python",
457 |    "nbconvert_exporter": "python",
458 |    "pygments_lexer": "ipython2",
459 |    "version": "2.7.11"
460 |   }
461 |  },
462 |  "nbformat": 4,
463 |  "nbformat_minor": 0
464 | }
465 | 


--------------------------------------------------------------------------------
/notebooks/06_sklearn_pandas_and_heterogeneous_data_modeling.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "# Predictive Modeling with scikit-learn and pandas"
   8 |    ]
   9 |   },
  10 |   {
  11 |    "cell_type": "code",
  12 |    "execution_count": null,
  13 |    "metadata": {
  14 |     "collapsed": false
  15 |    },
  16 |    "outputs": [],
  17 |    "source": [
  18 |     "%matplotlib inline\n",
  19 |     "import matplotlib.pyplot as plt\n",
  20 |     "import numpy as np\n",
  21 |     "import pandas as pd\n",
  22 |     "\n",
  23 |     "import warnings\n",
  24 |     "warnings.simplefilter('ignore', DeprecationWarning)"
  25 |    ]
  26 |   },
  27 |   {
  28 |    "cell_type": "markdown",
  29 |    "metadata": {},
  30 |    "source": [
  31 |     "<img src=\"files/images/predictive_modeling_data_flow.png\">"
  32 |    ]
  33 |   },
  34 |   {
  35 |    "cell_type": "markdown",
  36 |    "metadata": {},
  37 |    "source": [
  38 |     "## Loading tabular data from the Titanic kaggle challenge in a pandas Data Frame"
  39 |    ]
  40 |   },
  41 |   {
  42 |    "cell_type": "markdown",
  43 |    "metadata": {},
  44 |    "source": [
  45 |     "Let us have a look at the Titanic dataset from the Kaggle Getting Started challenge at:\n",
  46 |     "\n",
  47 |     "https://www.kaggle.com/c/titanic-gettingStarted\n",
  48 |     "\n",
  49 |     "We can load the CSV file as a pandas data frame in one line:"
  50 |    ]
  51 |   },
  52 |   {
  53 |    "cell_type": "code",
  54 |    "execution_count": null,
  55 |    "metadata": {
  56 |     "collapsed": false
  57 |    },
  58 |    "outputs": [],
  59 |    "source": [
  60 |     "#!curl -s https://dl.dropboxusercontent.com/u/2140486/data/titanic_train.csv | head -5\n",
  61 |     "!head -5 titanic_train.csv"
  62 |    ]
  63 |   },
  64 |   {
  65 |    "cell_type": "code",
  66 |    "execution_count": null,
  67 |    "metadata": {
  68 |     "collapsed": false
  69 |    },
  70 |    "outputs": [],
  71 |    "source": [
  72 |     "#data = pd.read_csv('https://dl.dropboxusercontent.com/u/2140486/data/titanic_train.csv')\n",
  73 |     "data = pd.read_csv('titanic_train.csv')"
  74 |    ]
  75 |   },
  76 |   {
  77 |    "cell_type": "markdown",
  78 |    "metadata": {},
  79 |    "source": [
  80 |     "pandas data frames have a HTML table representation in the IPython notebook. Let's have a look at the first 5 rows:"
  81 |    ]
  82 |   },
  83 |   {
  84 |    "cell_type": "code",
  85 |    "execution_count": null,
  86 |    "metadata": {
  87 |     "collapsed": false
  88 |    },
  89 |    "outputs": [],
  90 |    "source": [
  91 |     "data.head(5)"
  92 |    ]
  93 |   },
  94 |   {
  95 |    "cell_type": "code",
  96 |    "execution_count": null,
  97 |    "metadata": {
  98 |     "collapsed": false
  99 |    },
 100 |    "outputs": [],
 101 |    "source": [
 102 |     "data.count()"
 103 |    ]
 104 |   },
 105 |   {
 106 |    "cell_type": "markdown",
 107 |    "metadata": {},
 108 |    "source": [
 109 |     "The data frame has 891 rows. Some passengers have missing information though: in particular Age and Cabin info can be missing. The meaning of the columns is explained on the challenge website:\n",
 110 |     "\n",
 111 |     "https://www.kaggle.com/c/titanic-gettingStarted/data\n",
 112 |     "\n",
 113 |     "and copied here:\n",
 114 |     "\n",
 115 |     "```\n",
 116 |     "VARIABLE DESCRIPTIONS:\n",
 117 |     "survival        Survival\n",
 118 |     "                (0 = No; 1 = Yes)\n",
 119 |     "pclass          Passenger Class\n",
 120 |     "                (1 = 1st; 2 = 2nd; 3 = 3rd)\n",
 121 |     "name            Name\n",
 122 |     "sex             Sex\n",
 123 |     "age             Age\n",
 124 |     "sibsp           Number of Siblings/Spouses Aboard\n",
 125 |     "parch           Number of Parents/Children Aboard\n",
 126 |     "ticket          Ticket Number\n",
 127 |     "fare            Passenger Fare\n",
 128 |     "cabin           Cabin\n",
 129 |     "embarked        Port of Embarkation\n",
 130 |     "                (C = Cherbourg; Q = Queenstown; S = Southampton)\n",
 131 |     "\n",
 132 |     "SPECIAL NOTES:\n",
 133 |     "Pclass is a proxy for socio-economic status (SES)\n",
 134 |     " 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower\n",
 135 |     "\n",
 136 |     "Age is in Years; Fractional if Age less than One (1)\n",
 137 |     " If the Age is Estimated, it is in the form xx.5\n",
 138 |     "\n",
 139 |     "With respect to the family relation variables (i.e. sibsp and parch)\n",
 140 |     "some relations were ignored.  The following are the definitions used\n",
 141 |     "for sibsp and parch.\n",
 142 |     "\n",
 143 |     "Sibling:  Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic\n",
 144 |     "Spouse:   Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)\n",
 145 |     "Parent:   Mother or Father of Passenger Aboard Titanic\n",
 146 |     "Child:    Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic\n",
 147 |     "\n",
 148 |     "Other family relatives excluded from this study include cousins,\n",
 149 |     "nephews/nieces, aunts/uncles, and in-laws.  Some children travelled\n",
 150 |     "only with a nanny, therefore parch=0 for them.  As well, some\n",
 151 |     "travelled with very close friends or neighbors in a village, however,\n",
 152 |     "the definitions do not support such relations.\n",
 153 |     "```\n",
 154 |     "\n",
 155 |     "A data frame can be converted into a numpy array by calling the `values` attribute:"
 156 |    ]
 157 |   },
 158 |   {
 159 |    "cell_type": "code",
 160 |    "execution_count": null,
 161 |    "metadata": {
 162 |     "collapsed": false
 163 |    },
 164 |    "outputs": [],
 165 |    "source": [
 166 |     "list(data.columns)"
 167 |    ]
 168 |   },
 169 |   {
 170 |    "cell_type": "code",
 171 |    "execution_count": null,
 172 |    "metadata": {
 173 |     "collapsed": false
 174 |    },
 175 |    "outputs": [],
 176 |    "source": [
 177 |     "data.shape"
 178 |    ]
 179 |   },
 180 |   {
 181 |    "cell_type": "code",
 182 |    "execution_count": null,
 183 |    "metadata": {
 184 |     "collapsed": false
 185 |    },
 186 |    "outputs": [],
 187 |    "source": [
 188 |     "data.values"
 189 |    ]
 190 |   },
 191 |   {
 192 |    "cell_type": "markdown",
 193 |    "metadata": {},
 194 |    "source": [
 195 |     "However this cannot be directly fed to a scikit-learn model:\n",
 196 |     "\n",
 197 |     "\n",
 198 |     "- the target variable (survival) is mixed with the input data\n",
 199 |     "\n",
 200 |     "- some attribute such as unique ids have no predictive values for the task\n",
 201 |     "\n",
 202 |     "- the values are heterogeneous (string labels for categories, integers and floating point numbers)\n",
 203 |     "\n",
 204 |     "- some attribute values are missing (nan: \"not a number\")"
 205 |    ]
 206 |   },
 207 |   {
 208 |    "cell_type": "markdown",
 209 |    "metadata": {},
 210 |    "source": [
 211 |     "## Predicting survival"
 212 |    ]
 213 |   },
 214 |   {
 215 |    "cell_type": "markdown",
 216 |    "metadata": {},
 217 |    "source": [
 218 |     "The goal of the challenge is to predict whether a passenger has survived from others known attribute. Let us have a look at the `Survived` columns:"
 219 |    ]
 220 |   },
 221 |   {
 222 |    "cell_type": "code",
 223 |    "execution_count": null,
 224 |    "metadata": {
 225 |     "collapsed": false
 226 |    },
 227 |    "outputs": [],
 228 |    "source": [
 229 |     "survived_column = data['Survived']\n",
 230 |     "survived_column.dtype"
 231 |    ]
 232 |   },
 233 |   {
 234 |    "cell_type": "markdown",
 235 |    "metadata": {},
 236 |    "source": [
 237 |     "`data.Survived` is an instance of the pandas `Series` class with an integer dtype:"
 238 |    ]
 239 |   },
 240 |   {
 241 |    "cell_type": "code",
 242 |    "execution_count": null,
 243 |    "metadata": {
 244 |     "collapsed": false
 245 |    },
 246 |    "outputs": [],
 247 |    "source": [
 248 |     "type(survived_column)"
 249 |    ]
 250 |   },
 251 |   {
 252 |    "cell_type": "markdown",
 253 |    "metadata": {},
 254 |    "source": [
 255 |     "The `data` object is an instance pandas `DataFrame` class:"
 256 |    ]
 257 |   },
 258 |   {
 259 |    "cell_type": "code",
 260 |    "execution_count": null,
 261 |    "metadata": {
 262 |     "collapsed": false
 263 |    },
 264 |    "outputs": [],
 265 |    "source": [
 266 |     "type(data)"
 267 |    ]
 268 |   },
 269 |   {
 270 |    "cell_type": "markdown",
 271 |    "metadata": {},
 272 |    "source": [
 273 |     "`Series` can be seen as homegeneous, 1D columns. `DataFrame` instances are heterogenous collections of columns with the same length.\n",
 274 |     "\n",
 275 |     "The original data frame can be aggregated by counting rows for each possible value of the `Survived` column:"
 276 |    ]
 277 |   },
 278 |   {
 279 |    "cell_type": "code",
 280 |    "execution_count": null,
 281 |    "metadata": {
 282 |     "collapsed": false
 283 |    },
 284 |    "outputs": [],
 285 |    "source": [
 286 |     "data.groupby('Survived').count()"
 287 |    ]
 288 |   },
 289 |   {
 290 |    "cell_type": "code",
 291 |    "execution_count": null,
 292 |    "metadata": {
 293 |     "collapsed": false
 294 |    },
 295 |    "outputs": [],
 296 |    "source": [
 297 |     "np.mean(survived_column == 0)"
 298 |    ]
 299 |   },
 300 |   {
 301 |    "cell_type": "markdown",
 302 |    "metadata": {},
 303 |    "source": [
 304 |     "From this the subset of the full passengers list, about 2/3 perished in the event. So if we are to build a predictive model from this data, a baseline model to compare the performance to would be to always predict death. Such a constant model would reach around 62% predictive accuracy (which is higher than predicting at random):"
 305 |    ]
 306 |   },
 307 |   {
 308 |    "cell_type": "markdown",
 309 |    "metadata": {},
 310 |    "source": [
 311 |     "pandas `Series` instances can be converted to regular 1D numpy arrays by using the `values` attribute:"
 312 |    ]
 313 |   },
 314 |   {
 315 |    "cell_type": "code",
 316 |    "execution_count": null,
 317 |    "metadata": {
 318 |     "collapsed": false
 319 |    },
 320 |    "outputs": [],
 321 |    "source": [
 322 |     "target = survived_column.values"
 323 |    ]
 324 |   },
 325 |   {
 326 |    "cell_type": "code",
 327 |    "execution_count": null,
 328 |    "metadata": {
 329 |     "collapsed": false
 330 |    },
 331 |    "outputs": [],
 332 |    "source": [
 333 |     "type(target)"
 334 |    ]
 335 |   },
 336 |   {
 337 |    "cell_type": "code",
 338 |    "execution_count": null,
 339 |    "metadata": {
 340 |     "collapsed": false
 341 |    },
 342 |    "outputs": [],
 343 |    "source": [
 344 |     "target.dtype"
 345 |    ]
 346 |   },
 347 |   {
 348 |    "cell_type": "code",
 349 |    "execution_count": null,
 350 |    "metadata": {
 351 |     "collapsed": false
 352 |    },
 353 |    "outputs": [],
 354 |    "source": [
 355 |     "target[:5]"
 356 |    ]
 357 |   },
 358 |   {
 359 |    "cell_type": "markdown",
 360 |    "metadata": {},
 361 |    "source": [
 362 |     "## Basic visualization\n",
 363 |     "\n",
 364 |     "Full doc at http://pandas.pydata.org/pandas-docs/stable/visualization.html\n",
 365 |     "\n",
 366 |     "See also seaborn package.\n",
 367 |     "\n",
 368 |     "https://stanford.edu/~mwaskom/software/seaborn/"
 369 |    ]
 370 |   },
 371 |   {
 372 |    "cell_type": "code",
 373 |    "execution_count": null,
 374 |    "metadata": {
 375 |     "collapsed": false
 376 |    },
 377 |    "outputs": [],
 378 |    "source": [
 379 |     "data.plot(kind='scatter', x='Age', y='Fare', c='Survived', s=50);"
 380 |    ]
 381 |   },
 382 |   {
 383 |    "cell_type": "code",
 384 |    "execution_count": null,
 385 |    "metadata": {
 386 |     "collapsed": false
 387 |    },
 388 |    "outputs": [],
 389 |    "source": [
 390 |     "from pandas.tools.plotting import scatter_matrix\n",
 391 |     "scatter_matrix(data.get(['Fare', 'Pclass', 'Age']), alpha=0.2, figsize=(8, 8), diagonal='kde');"
 392 |    ]
 393 |   },
 394 |   {
 395 |    "cell_type": "code",
 396 |    "execution_count": null,
 397 |    "metadata": {
 398 |     "collapsed": false
 399 |    },
 400 |    "outputs": [],
 401 |    "source": [
 402 |     "data.get(['Fare', 'Age', 'Survived']).groupby('Survived').mean().plot(kind='bar');"
 403 |    ]
 404 |   },
 405 |   {
 406 |    "cell_type": "markdown",
 407 |    "metadata": {},
 408 |    "source": [
 409 |     "## Training a predictive model on numerical features"
 410 |    ]
 411 |   },
 412 |   {
 413 |    "cell_type": "markdown",
 414 |    "metadata": {},
 415 |    "source": [
 416 |     "`sklearn` estimators all work with homegeneous numerical feature descriptors passed as a numpy array. Therefore passing the raw data frame will not work out of the box.\n",
 417 |     "\n",
 418 |     "Let us start simple and build a first model that only uses readily available numerical features as input, namely `data.Fare`, `data.Pclass` and `data.Age`."
 419 |    ]
 420 |   },
 421 |   {
 422 |    "cell_type": "code",
 423 |    "execution_count": null,
 424 |    "metadata": {
 425 |     "collapsed": false
 426 |    },
 427 |    "outputs": [],
 428 |    "source": [
 429 |     "numerical_features = data.get(['Fare', 'Pclass', 'Age'])\n",
 430 |     "numerical_features.head(5)"
 431 |    ]
 432 |   },
 433 |   {
 434 |    "cell_type": "markdown",
 435 |    "metadata": {},
 436 |    "source": [
 437 |     "Unfortunately some passengers do not have age information:"
 438 |    ]
 439 |   },
 440 |   {
 441 |    "cell_type": "code",
 442 |    "execution_count": null,
 443 |    "metadata": {
 444 |     "collapsed": false
 445 |    },
 446 |    "outputs": [],
 447 |    "source": [
 448 |     "numerical_features.count()"
 449 |    ]
 450 |   },
 451 |   {
 452 |    "cell_type": "markdown",
 453 |    "metadata": {},
 454 |    "source": [
 455 |     "Let's use pandas `fillna` method to input the median age for those passengers:"
 456 |    ]
 457 |   },
 458 |   {
 459 |    "cell_type": "code",
 460 |    "execution_count": null,
 461 |    "metadata": {
 462 |     "collapsed": false
 463 |    },
 464 |    "outputs": [],
 465 |    "source": [
 466 |     "median_features = numerical_features.dropna().median()\n",
 467 |     "median_features"
 468 |    ]
 469 |   },
 470 |   {
 471 |    "cell_type": "code",
 472 |    "execution_count": null,
 473 |    "metadata": {
 474 |     "collapsed": false
 475 |    },
 476 |    "outputs": [],
 477 |    "source": [
 478 |     "imputed_features = numerical_features.fillna(median_features)\n",
 479 |     "imputed_features.count()"
 480 |    ]
 481 |   },
 482 |   {
 483 |    "cell_type": "code",
 484 |    "execution_count": null,
 485 |    "metadata": {
 486 |     "collapsed": false
 487 |    },
 488 |    "outputs": [],
 489 |    "source": [
 490 |     "imputed_features.head(5)"
 491 |    ]
 492 |   },
 493 |   {
 494 |    "cell_type": "markdown",
 495 |    "metadata": {},
 496 |    "source": [
 497 |     "Now that the data frame is clean, we can convert it into an homogeneous numpy array of floating point values:"
 498 |    ]
 499 |   },
 500 |   {
 501 |    "cell_type": "code",
 502 |    "execution_count": null,
 503 |    "metadata": {
 504 |     "collapsed": false
 505 |    },
 506 |    "outputs": [],
 507 |    "source": [
 508 |     "features_array = imputed_features.values\n",
 509 |     "features_array"
 510 |    ]
 511 |   },
 512 |   {
 513 |    "cell_type": "code",
 514 |    "execution_count": null,
 515 |    "metadata": {
 516 |     "collapsed": false
 517 |    },
 518 |    "outputs": [],
 519 |    "source": [
 520 |     "features_array.dtype"
 521 |    ]
 522 |   },
 523 |   {
 524 |    "cell_type": "markdown",
 525 |    "metadata": {},
 526 |    "source": [
 527 |     "Let's take the 80% of the data for training a first model and keep 20% for computing is generalization score:"
 528 |    ]
 529 |   },
 530 |   {
 531 |    "cell_type": "code",
 532 |    "execution_count": null,
 533 |    "metadata": {
 534 |     "collapsed": false
 535 |    },
 536 |    "outputs": [],
 537 |    "source": [
 538 |     "from sklearn.model_selection import train_test_split\n",
 539 |     "\n",
 540 |     "features_train, features_test, target_train, target_test = train_test_split(\n",
 541 |     "    features_array, target, test_size=0.20, random_state=0)"
 542 |    ]
 543 |   },
 544 |   {
 545 |    "cell_type": "code",
 546 |    "execution_count": null,
 547 |    "metadata": {
 548 |     "collapsed": false
 549 |    },
 550 |    "outputs": [],
 551 |    "source": [
 552 |     "features_train.shape"
 553 |    ]
 554 |   },
 555 |   {
 556 |    "cell_type": "code",
 557 |    "execution_count": null,
 558 |    "metadata": {
 559 |     "collapsed": false
 560 |    },
 561 |    "outputs": [],
 562 |    "source": [
 563 |     "features_test.shape"
 564 |    ]
 565 |   },
 566 |   {
 567 |    "cell_type": "code",
 568 |    "execution_count": null,
 569 |    "metadata": {
 570 |     "collapsed": false
 571 |    },
 572 |    "outputs": [],
 573 |    "source": [
 574 |     "target_train.shape"
 575 |    ]
 576 |   },
 577 |   {
 578 |    "cell_type": "code",
 579 |    "execution_count": null,
 580 |    "metadata": {
 581 |     "collapsed": false
 582 |    },
 583 |    "outputs": [],
 584 |    "source": [
 585 |     "target_test.shape"
 586 |    ]
 587 |   },
 588 |   {
 589 |    "cell_type": "markdown",
 590 |    "metadata": {},
 591 |    "source": [
 592 |     "Let's start with a simple model from sklearn, namely `LogisticRegression`:"
 593 |    ]
 594 |   },
 595 |   {
 596 |    "cell_type": "code",
 597 |    "execution_count": null,
 598 |    "metadata": {
 599 |     "collapsed": false
 600 |    },
 601 |    "outputs": [],
 602 |    "source": [
 603 |     "from sklearn.linear_model import LogisticRegression\n",
 604 |     "\n",
 605 |     "logreg = LogisticRegression(C=1.)\n",
 606 |     "logreg.fit(features_train, target_train)"
 607 |    ]
 608 |   },
 609 |   {
 610 |    "cell_type": "code",
 611 |    "execution_count": null,
 612 |    "metadata": {
 613 |     "collapsed": false
 614 |    },
 615 |    "outputs": [],
 616 |    "source": [
 617 |     "target_predicted = logreg.predict(features_test)"
 618 |    ]
 619 |   },
 620 |   {
 621 |    "cell_type": "code",
 622 |    "execution_count": null,
 623 |    "metadata": {
 624 |     "collapsed": false
 625 |    },
 626 |    "outputs": [],
 627 |    "source": [
 628 |     "from sklearn.metrics import accuracy_score\n",
 629 |     "\n",
 630 |     "accuracy_score(target_test, target_predicted)"
 631 |    ]
 632 |   },
 633 |   {
 634 |    "cell_type": "markdown",
 635 |    "metadata": {},
 636 |    "source": [
 637 |     "This first model has around 73% accuracy: this is better than our baseline that always predicts death."
 638 |    ]
 639 |   },
 640 |   {
 641 |    "cell_type": "code",
 642 |    "execution_count": null,
 643 |    "metadata": {
 644 |     "collapsed": false
 645 |    },
 646 |    "outputs": [],
 647 |    "source": [
 648 |     "logreg.score(features_test, target_test)"
 649 |    ]
 650 |   },
 651 |   {
 652 |    "cell_type": "markdown",
 653 |    "metadata": {},
 654 |    "source": [
 655 |     "## Model evaluation and interpretation"
 656 |    ]
 657 |   },
 658 |   {
 659 |    "cell_type": "markdown",
 660 |    "metadata": {},
 661 |    "source": [
 662 |     "### Interpreting linear model weights"
 663 |    ]
 664 |   },
 665 |   {
 666 |    "cell_type": "markdown",
 667 |    "metadata": {},
 668 |    "source": [
 669 |     "The `coef_` attribute of a fitted linear model such as `LogisticRegression` holds the weights of each features:"
 670 |    ]
 671 |   },
 672 |   {
 673 |    "cell_type": "code",
 674 |    "execution_count": null,
 675 |    "metadata": {
 676 |     "collapsed": false
 677 |    },
 678 |    "outputs": [],
 679 |    "source": [
 680 |     "feature_names = numerical_features.columns\n",
 681 |     "feature_names"
 682 |    ]
 683 |   },
 684 |   {
 685 |    "cell_type": "code",
 686 |    "execution_count": null,
 687 |    "metadata": {
 688 |     "collapsed": false
 689 |    },
 690 |    "outputs": [],
 691 |    "source": [
 692 |     "logreg.coef_"
 693 |    ]
 694 |   },
 695 |   {
 696 |    "cell_type": "code",
 697 |    "execution_count": null,
 698 |    "metadata": {
 699 |     "collapsed": false
 700 |    },
 701 |    "outputs": [],
 702 |    "source": [
 703 |     "x = np.arange(len(feature_names))\n",
 704 |     "plt.bar(x, logreg.coef_.ravel())\n",
 705 |     "_ = plt.xticks(x + 0.5, feature_names, rotation=30)"
 706 |    ]
 707 |   },
 708 |   {
 709 |    "cell_type": "markdown",
 710 |    "metadata": {},
 711 |    "source": [
 712 |     "In this case, survival is slightly positively linked with Fare (the higher the fare, the higher the likelyhood the model will predict survival) while passenger from first class and lower ages are predicted to survive more often than older people from the 3rd class.\n",
 713 |     "\n",
 714 |     "First-class cabins were closer to the lifeboats and children and women reportedly had the priority. Our model seems to capture that historical data. We will see later if the sex of the passenger can be used as an informative predictor to increase the predictive accuracy of the model."
 715 |    ]
 716 |   },
 717 |   {
 718 |    "cell_type": "markdown",
 719 |    "metadata": {},
 720 |    "source": [
 721 |     "### Alternative evaluation metrics"
 722 |    ]
 723 |   },
 724 |   {
 725 |    "cell_type": "markdown",
 726 |    "metadata": {},
 727 |    "source": [
 728 |     "It is possible to see the details of the false positive and false negative errors by computing the confusion matrix:"
 729 |    ]
 730 |   },
 731 |   {
 732 |    "cell_type": "code",
 733 |    "execution_count": null,
 734 |    "metadata": {
 735 |     "collapsed": false
 736 |    },
 737 |    "outputs": [],
 738 |    "source": [
 739 |     "from sklearn.metrics import confusion_matrix\n",
 740 |     "\n",
 741 |     "cm = confusion_matrix(target_test, target_predicted)\n",
 742 |     "print(cm)"
 743 |    ]
 744 |   },
 745 |   {
 746 |    "cell_type": "markdown",
 747 |    "metadata": {},
 748 |    "source": [
 749 |     "The true labeling are seen as the rows and the predicted labels are the columns:"
 750 |    ]
 751 |   },
 752 |   {
 753 |    "cell_type": "code",
 754 |    "execution_count": null,
 755 |    "metadata": {
 756 |     "collapsed": false
 757 |    },
 758 |    "outputs": [],
 759 |    "source": [
 760 |     "def plot_confusion(cm):\n",
 761 |     "    plt.imshow(cm, interpolation='nearest', cmap=plt.cm.binary)\n",
 762 |     "    plt.title('Confusion matrix')\n",
 763 |     "    plt.set_cmap('Blues')\n",
 764 |     "    plt.colorbar()\n",
 765 |     "\n",
 766 |     "    target_names = ['not survived', 'survived']\n",
 767 |     "\n",
 768 |     "    tick_marks = np.arange(len(target_names))\n",
 769 |     "    plt.xticks(tick_marks, target_names, rotation=60)\n",
 770 |     "    plt.yticks(tick_marks, target_names)\n",
 771 |     "    plt.ylabel('True label')\n",
 772 |     "    plt.xlabel('Predicted label')\n",
 773 |     "    # Convenience function to adjust plot parameters for a clear layout.\n",
 774 |     "    plt.tight_layout()\n",
 775 |     "    \n",
 776 |     "plot_confusion(cm)"
 777 |    ]
 778 |   },
 779 |   {
 780 |    "cell_type": "code",
 781 |    "execution_count": null,
 782 |    "metadata": {
 783 |     "collapsed": false
 784 |    },
 785 |    "outputs": [],
 786 |    "source": [
 787 |     "print(cm)"
 788 |    ]
 789 |   },
 790 |   {
 791 |    "cell_type": "markdown",
 792 |    "metadata": {},
 793 |    "source": [
 794 |     "We can normalize the number of prediction by dividing by the total number of true \"survived\" and \"not survived\" to compute false and true positive rates for survival (in the second column of the confusion matrix)."
 795 |    ]
 796 |   },
 797 |   {
 798 |    "cell_type": "code",
 799 |    "execution_count": null,
 800 |    "metadata": {
 801 |     "collapsed": false
 802 |    },
 803 |    "outputs": [],
 804 |    "source": [
 805 |     "print(cm.astype(np.float64) / cm.sum(axis=1))"
 806 |    ]
 807 |   },
 808 |   {
 809 |    "cell_type": "markdown",
 810 |    "metadata": {},
 811 |    "source": [
 812 |     "We can therefore observe that the fact that the target classes are not balanced in the dataset makes the accuracy score not very informative.\n",
 813 |     "\n",
 814 |     "scikit-learn provides alternative classification metrics to evaluate models performance on imbalanced data such as precision, recall and f1 score:"
 815 |    ]
 816 |   },
 817 |   {
 818 |    "cell_type": "code",
 819 |    "execution_count": null,
 820 |    "metadata": {
 821 |     "collapsed": false
 822 |    },
 823 |    "outputs": [],
 824 |    "source": [
 825 |     "from sklearn.metrics import classification_report\n",
 826 |     "\n",
 827 |     "print(classification_report(target_test, target_predicted,\n",
 828 |     "                            target_names=['not survived', 'survived']))"
 829 |    ]
 830 |   },
 831 |   {
 832 |    "cell_type": "markdown",
 833 |    "metadata": {},
 834 |    "source": [
 835 |     "Another way to quantify the quality of a binary classifier on imbalanced data is to compute the precision, recall and f1-score of a model (at the default fixed decision threshold of 0.5)."
 836 |    ]
 837 |   },
 838 |   {
 839 |    "cell_type": "markdown",
 840 |    "metadata": {},
 841 |    "source": [
 842 |     "Logistic Regression is a probabilistic models: instead of just predicting a binary outcome (survived or not) given the input features it can also estimates the posterior probability of the outcome given the input features using the `predict_proba` method:"
 843 |    ]
 844 |   },
 845 |   {
 846 |    "cell_type": "code",
 847 |    "execution_count": null,
 848 |    "metadata": {
 849 |     "collapsed": false
 850 |    },
 851 |    "outputs": [],
 852 |    "source": [
 853 |     "target_predicted_proba = logreg.predict_proba(features_test)\n",
 854 |     "target_predicted_proba[:5]"
 855 |    ]
 856 |   },
 857 |   {
 858 |    "cell_type": "markdown",
 859 |    "metadata": {},
 860 |    "source": [
 861 |     "By default the decision threshold is 0.5: if we vary the decision threshold from 0 to 1 we could generate a family of binary classifier models that address all the possible trade offs between false positive and false negative prediction errors.\n",
 862 |     "\n",
 863 |     "We can summarize the performance of a binary classifier for all the possible thresholds by plotting the ROC curve and quantifying the Area under the ROC curve:"
 864 |    ]
 865 |   },
 866 |   {
 867 |    "cell_type": "code",
 868 |    "execution_count": null,
 869 |    "metadata": {
 870 |     "collapsed": false
 871 |    },
 872 |    "outputs": [],
 873 |    "source": [
 874 |     "from sklearn.metrics import roc_curve\n",
 875 |     "from sklearn.metrics import auc\n",
 876 |     "\n",
 877 |     "def plot_roc_curve(target_test, target_predicted_proba):\n",
 878 |     "    fpr, tpr, thresholds = roc_curve(target_test, target_predicted_proba[:, 1])\n",
 879 |     "    \n",
 880 |     "    roc_auc = auc(fpr, tpr)\n",
 881 |     "    # Plot ROC curve\n",
 882 |     "    plt.plot(fpr, tpr, label='ROC curve (area = %0.3f)' % roc_auc)\n",
 883 |     "    plt.plot([0, 1], [0, 1], 'k--')  # random predictions curve\n",
 884 |     "    plt.xlim([0.0, 1.0])\n",
 885 |     "    plt.ylim([0.0, 1.0])\n",
 886 |     "    plt.xlabel('False Positive Rate or (1 - Specifity)')\n",
 887 |     "    plt.ylabel('True Positive Rate or (Sensitivity)')\n",
 888 |     "    plt.title('Receiver Operating Characteristic')\n",
 889 |     "    plt.legend(loc=\"lower right\")"
 890 |    ]
 891 |   },
 892 |   {
 893 |    "cell_type": "code",
 894 |    "execution_count": null,
 895 |    "metadata": {
 896 |     "collapsed": false
 897 |    },
 898 |    "outputs": [],
 899 |    "source": [
 900 |     "plot_roc_curve(target_test, target_predicted_proba)"
 901 |    ]
 902 |   },
 903 |   {
 904 |    "cell_type": "markdown",
 905 |    "metadata": {},
 906 |    "source": [
 907 |     "Here the area under ROC curve is 0.756 which is very similar to the accuracy (0.732). However the ROC-AUC score of a random model is expected to 0.5 on average while the accuracy score of a random model depends on the class imbalance of the data. ROC-AUC can be seen as a way to callibrate the predictive accuracy of a model against class imbalance."
 908 |    ]
 909 |   },
 910 |   {
 911 |    "cell_type": "markdown",
 912 |    "metadata": {},
 913 |    "source": [
 914 |     "### Cross-validation"
 915 |    ]
 916 |   },
 917 |   {
 918 |    "cell_type": "markdown",
 919 |    "metadata": {},
 920 |    "source": [
 921 |     "We previously decided to randomly split the data to evaluate the model on 20% of held-out data. However the location randomness of the split might have a significant impact in the estimated accuracy:"
 922 |    ]
 923 |   },
 924 |   {
 925 |    "cell_type": "code",
 926 |    "execution_count": null,
 927 |    "metadata": {
 928 |     "collapsed": false
 929 |    },
 930 |    "outputs": [],
 931 |    "source": [
 932 |     "features_train, features_test, target_train, target_test = train_test_split(\n",
 933 |     "    features_array, target, test_size=0.20, random_state=0)\n",
 934 |     "\n",
 935 |     "logreg.fit(features_train, target_train).score(features_test, target_test)"
 936 |    ]
 937 |   },
 938 |   {
 939 |    "cell_type": "code",
 940 |    "execution_count": null,
 941 |    "metadata": {
 942 |     "collapsed": false
 943 |    },
 944 |    "outputs": [],
 945 |    "source": [
 946 |     "features_train, features_test, target_train, target_test = train_test_split(\n",
 947 |     "    features_array, target, test_size=0.20, random_state=1)\n",
 948 |     "\n",
 949 |     "logreg.fit(features_train, target_train).score(features_test, target_test)"
 950 |    ]
 951 |   },
 952 |   {
 953 |    "cell_type": "code",
 954 |    "execution_count": null,
 955 |    "metadata": {
 956 |     "collapsed": false
 957 |    },
 958 |    "outputs": [],
 959 |    "source": [
 960 |     "features_train, features_test, target_train, target_test = train_test_split(\n",
 961 |     "    features_array, target, test_size=0.20, random_state=2)\n",
 962 |     "\n",
 963 |     "logreg.fit(features_train, target_train).score(features_test, target_test)"
 964 |    ]
 965 |   },
 966 |   {
 967 |    "cell_type": "markdown",
 968 |    "metadata": {},
 969 |    "source": [
 970 |     "So instead of using a single train / test split, we can use a group of them and compute the min, max and mean scores as an estimation of the real test score while not underestimating the variability:"
 971 |    ]
 972 |   },
 973 |   {
 974 |    "cell_type": "code",
 975 |    "execution_count": null,
 976 |    "metadata": {
 977 |     "collapsed": false
 978 |    },
 979 |    "outputs": [],
 980 |    "source": [
 981 |     "from sklearn.model_selection import cross_val_score\n",
 982 |     "\n",
 983 |     "scores = cross_val_score(logreg, features_array, target, cv=5)\n",
 984 |     "scores"
 985 |    ]
 986 |   },
 987 |   {
 988 |    "cell_type": "code",
 989 |    "execution_count": null,
 990 |    "metadata": {
 991 |     "collapsed": false
 992 |    },
 993 |    "outputs": [],
 994 |    "source": [
 995 |     "scores.min(), scores.mean(), scores.max()"
 996 |    ]
 997 |   },
 998 |   {
 999 |    "cell_type": "markdown",
1000 |    "metadata": {},
1001 |    "source": [
1002 |     "`cross_val_score` reports accuracy by default be it can also be used to report other performance metrics such as ROC-AUC or f1-score:"
1003 |    ]
1004 |   },
1005 |   {
1006 |    "cell_type": "code",
1007 |    "execution_count": null,
1008 |    "metadata": {
1009 |     "collapsed": false
1010 |    },
1011 |    "outputs": [],
1012 |    "source": [
1013 |     "scores = cross_val_score(logreg, features_array, target, cv=5,\n",
1014 |     "                         scoring='roc_auc')\n",
1015 |     "scores.min(), scores.mean(), scores.max()"
1016 |    ]
1017 |   },
1018 |   {
1019 |    "cell_type": "markdown",
1020 |    "metadata": {},
1021 |    "source": [
1022 |     "**Exercise**:\n",
1023 |     "\n",
1024 |     "- Compute cross-validated scores for other classification metrics ('precision', 'recall', 'f1', 'accuracy'...).\n",
1025 |     "\n",
1026 |     "- Change the number of cross-validation folds between 3 and 10: what is the impact on the mean score? on the processing time?\n",
1027 |     "\n",
1028 |     "Hints:\n",
1029 |     "\n",
1030 |     "The list of classification metrics is available in the online documentation:\n",
1031 |     "\n",
1032 |     "  http://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values\n",
1033 |     "  \n",
1034 |     "You can use the `%%time` cell magic on the first line of an IPython cell to measure the time of the execution of the cell. "
1035 |    ]
1036 |   },
1037 |   {
1038 |    "cell_type": "code",
1039 |    "execution_count": null,
1040 |    "metadata": {
1041 |     "collapsed": false
1042 |    },
1043 |    "outputs": [],
1044 |    "source": []
1045 |   },
1046 |   {
1047 |    "cell_type": "markdown",
1048 |    "metadata": {},
1049 |    "source": [
1050 |     "## More feature engineering and richer models"
1051 |    ]
1052 |   },
1053 |   {
1054 |    "cell_type": "markdown",
1055 |    "metadata": {},
1056 |    "source": [
1057 |     "Let us now try to build richer models by including more features as potential predictors for our model.\n",
1058 |     "\n",
1059 |     "Categorical variables such as `data.Embarked` or `data.Sex` can be converted as boolean indicators features also known as dummy variables or one-hot-encoded features:"
1060 |    ]
1061 |   },
1062 |   {
1063 |    "cell_type": "code",
1064 |    "execution_count": null,
1065 |    "metadata": {
1066 |     "collapsed": false
1067 |    },
1068 |    "outputs": [],
1069 |    "source": [
1070 |     "pd.get_dummies(data.Sex, prefix='Sex').head(5)"
1071 |    ]
1072 |   },
1073 |   {
1074 |    "cell_type": "code",
1075 |    "execution_count": null,
1076 |    "metadata": {
1077 |     "collapsed": false
1078 |    },
1079 |    "outputs": [],
1080 |    "source": [
1081 |     "pd.get_dummies(data.Embarked, prefix='Embarked').head(5)"
1082 |    ]
1083 |   },
1084 |   {
1085 |    "cell_type": "markdown",
1086 |    "metadata": {},
1087 |    "source": [
1088 |     "We can combine those new numerical features with the previous features using `pandas.concat` along `axis=1`:"
1089 |    ]
1090 |   },
1091 |   {
1092 |    "cell_type": "code",
1093 |    "execution_count": null,
1094 |    "metadata": {
1095 |     "collapsed": false
1096 |    },
1097 |    "outputs": [],
1098 |    "source": [
1099 |     "rich_features = pd.concat([data.get(['Fare', 'Pclass', 'Age']),\n",
1100 |     "                           pd.get_dummies(data.Sex, prefix='Sex'),\n",
1101 |     "                           pd.get_dummies(data.Embarked, prefix='Embarked')],\n",
1102 |     "                          axis=1)\n",
1103 |     "rich_features.head(5)"
1104 |    ]
1105 |   },
1106 |   {
1107 |    "cell_type": "markdown",
1108 |    "metadata": {},
1109 |    "source": [
1110 |     "By construction the new `Sex_male` feature is redundant with `Sex_female`. Let us drop it:"
1111 |    ]
1112 |   },
1113 |   {
1114 |    "cell_type": "code",
1115 |    "execution_count": null,
1116 |    "metadata": {
1117 |     "collapsed": false
1118 |    },
1119 |    "outputs": [],
1120 |    "source": [
1121 |     "rich_features_no_male = rich_features.drop('Sex_male', 1)\n",
1122 |     "rich_features_no_male.head(5)"
1123 |    ]
1124 |   },
1125 |   {
1126 |    "cell_type": "markdown",
1127 |    "metadata": {},
1128 |    "source": [
1129 |     "Let us not forget to impute the median age for passengers without age information:"
1130 |    ]
1131 |   },
1132 |   {
1133 |    "cell_type": "code",
1134 |    "execution_count": null,
1135 |    "metadata": {
1136 |     "collapsed": false
1137 |    },
1138 |    "outputs": [],
1139 |    "source": [
1140 |     "rich_features_no_male.count()"
1141 |    ]
1142 |   },
1143 |   {
1144 |    "cell_type": "code",
1145 |    "execution_count": null,
1146 |    "metadata": {
1147 |     "collapsed": false
1148 |    },
1149 |    "outputs": [],
1150 |    "source": [
1151 |     "rich_features_final = rich_features_no_male.fillna(rich_features_no_male.dropna().median())\n",
1152 |     "rich_features_final.count()"
1153 |    ]
1154 |   },
1155 |   {
1156 |    "cell_type": "markdown",
1157 |    "metadata": {},
1158 |    "source": [
1159 |     "We can finally cross-validate a logistic regression model on this new data an observe that the mean score has significantly increased:"
1160 |    ]
1161 |   },
1162 |   {
1163 |    "cell_type": "code",
1164 |    "execution_count": null,
1165 |    "metadata": {
1166 |     "collapsed": false
1167 |    },
1168 |    "outputs": [],
1169 |    "source": [
1170 |     "%%time\n",
1171 |     "\n",
1172 |     "from sklearn.linear_model import LogisticRegression\n",
1173 |     "from sklearn.model_selection import cross_val_score\n",
1174 |     "\n",
1175 |     "logreg = LogisticRegression(C=1.)\n",
1176 |     "scores = cross_val_score(logreg, rich_features_final, target, cv=5, scoring='accuracy')\n",
1177 |     "print(\"Logistic Regression CV scores:\")\n",
1178 |     "print(\"min: {:.3f}, mean: {:.3f}, max: {:.3f}\".format(\n",
1179 |     "    scores.min(), scores.mean(), scores.max()))"
1180 |    ]
1181 |   },
1182 |   {
1183 |    "cell_type": "markdown",
1184 |    "metadata": {},
1185 |    "source": [
1186 |     "**Exercise**:\n",
1187 |     "\n",
1188 |     "- change the value of the parameter `C`. Does it have an impact on the score?\n",
1189 |     "\n",
1190 |     "- fit a new instance of the logistic regression model on the full dataset.\n",
1191 |     "\n",
1192 |     "- plot the weights for the features of this newly fitted logistic regression model."
1193 |    ]
1194 |   },
1195 |   {
1196 |    "cell_type": "code",
1197 |    "execution_count": null,
1198 |    "metadata": {
1199 |     "collapsed": false
1200 |    },
1201 |    "outputs": [],
1202 |    "source": [
1203 |     "%load solutions/04A_plot_logistic_regression_weights.py"
1204 |    ]
1205 |   },
1206 |   {
1207 |    "cell_type": "code",
1208 |    "execution_count": null,
1209 |    "metadata": {
1210 |     "collapsed": false
1211 |    },
1212 |    "outputs": [],
1213 |    "source": []
1214 |   },
1215 |   {
1216 |    "cell_type": "markdown",
1217 |    "metadata": {},
1218 |    "source": [
1219 |     "### Training Non-linear models: ensembles of randomized trees"
1220 |    ]
1221 |   },
1222 |   {
1223 |    "cell_type": "markdown",
1224 |    "metadata": {},
1225 |    "source": [
1226 |     "`sklearn` also implement non linear models that are known to perform very well for data-science projects where datasets have not too many features (e.g. less than 5000).\n",
1227 |     "\n",
1228 |     "In particular let us have a look at Random Forests and Gradient Boosted Trees:"
1229 |    ]
1230 |   },
1231 |   {
1232 |    "cell_type": "code",
1233 |    "execution_count": null,
1234 |    "metadata": {
1235 |     "collapsed": false
1236 |    },
1237 |    "outputs": [],
1238 |    "source": [
1239 |     "%%time\n",
1240 |     "\n",
1241 |     "from sklearn.ensemble import RandomForestClassifier\n",
1242 |     "\n",
1243 |     "rf = RandomForestClassifier(n_estimators=100)\n",
1244 |     "scores = cross_val_score(rf, rich_features_final, target, cv=5, n_jobs=4,\n",
1245 |     "                         scoring='accuracy')\n",
1246 |     "print(\"Random Forest CV scores:\")\n",
1247 |     "print(\"min: {:.3f}, mean: {:.3f}, max: {:.3f}\".format(\n",
1248 |     "    scores.min(), scores.mean(), scores.max()))"
1249 |    ]
1250 |   },
1251 |   {
1252 |    "cell_type": "code",
1253 |    "execution_count": null,
1254 |    "metadata": {
1255 |     "collapsed": false
1256 |    },
1257 |    "outputs": [],
1258 |    "source": [
1259 |     "%%time\n",
1260 |     "\n",
1261 |     "from sklearn.ensemble import GradientBoostingClassifier\n",
1262 |     "\n",
1263 |     "gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1,\n",
1264 |     "                                subsample=.8, max_features=.5)\n",
1265 |     "scores = cross_val_score(gb, rich_features_final, target, cv=5, n_jobs=4,\n",
1266 |     "                         scoring='accuracy')\n",
1267 |     "print(\"Gradient Boosted Trees CV scores:\")\n",
1268 |     "print(\"min: {:.3f}, mean: {:.3f}, max: {:.3f}\".format(\n",
1269 |     "    scores.min(), scores.mean(), scores.max()))"
1270 |    ]
1271 |   },
1272 |   {
1273 |    "cell_type": "markdown",
1274 |    "metadata": {},
1275 |    "source": [
1276 |     "Both models seem to do slightly better than the logistic regression model on this data."
1277 |    ]
1278 |   },
1279 |   {
1280 |    "cell_type": "markdown",
1281 |    "metadata": {},
1282 |    "source": [
1283 |     "**Exercise**:\n",
1284 |     "\n",
1285 |     "- Change the value of the learning_rate and other `GradientBoostingClassifier` parameter, can you get a better mean score?\n",
1286 |     "\n",
1287 |     "- Would treating the `PClass` variable as categorical improve the models performance?\n",
1288 |     "\n",
1289 |     "- Find out which predictor variables (features) are the most informative for those models.\n",
1290 |     "\n",
1291 |     "Hints:\n",
1292 |     "\n",
1293 |     "Fitted ensembles of trees have `feature_importances_` attribute that can be used similarly to the `coef_` attribute of linear models."
1294 |    ]
1295 |   },
1296 |   {
1297 |    "cell_type": "code",
1298 |    "execution_count": null,
1299 |    "metadata": {
1300 |     "collapsed": false
1301 |    },
1302 |    "outputs": [],
1303 |    "source": [
1304 |     "%load solutions/04B_more_categorical_variables.py"
1305 |    ]
1306 |   },
1307 |   {
1308 |    "cell_type": "code",
1309 |    "execution_count": null,
1310 |    "metadata": {
1311 |     "collapsed": false
1312 |    },
1313 |    "outputs": [],
1314 |    "source": []
1315 |   },
1316 |   {
1317 |    "cell_type": "code",
1318 |    "execution_count": null,
1319 |    "metadata": {
1320 |     "collapsed": false
1321 |    },
1322 |    "outputs": [],
1323 |    "source": [
1324 |     "%load solutions/04C_feature_importance.py"
1325 |    ]
1326 |   },
1327 |   {
1328 |    "cell_type": "code",
1329 |    "execution_count": null,
1330 |    "metadata": {
1331 |     "collapsed": false
1332 |    },
1333 |    "outputs": [],
1334 |    "source": []
1335 |   },
1336 |   {
1337 |    "cell_type": "markdown",
1338 |    "metadata": {},
1339 |    "source": [
1340 |     "## Automated parameter tuning"
1341 |    ]
1342 |   },
1343 |   {
1344 |    "cell_type": "markdown",
1345 |    "metadata": {},
1346 |    "source": [
1347 |     "Instead of changing the value of the learning rate manually and re-running the cross-validation, we can find the best values for the parameters automatically (assuming we are ready to wait):"
1348 |    ]
1349 |   },
1350 |   {
1351 |    "cell_type": "code",
1352 |    "execution_count": null,
1353 |    "metadata": {
1354 |     "collapsed": false
1355 |    },
1356 |    "outputs": [],
1357 |    "source": [
1358 |     "%%time\n",
1359 |     "\n",
1360 |     "from sklearn.model_selection import GridSearchCV\n",
1361 |     "\n",
1362 |     "gb = GradientBoostingClassifier(n_estimators=100, subsample=.8)\n",
1363 |     "\n",
1364 |     "params = {\n",
1365 |     "    'learning_rate': [0.05, 0.1, 0.5],\n",
1366 |     "    'max_features': [0.5, 1],\n",
1367 |     "    'max_depth': [3, 4, 5],\n",
1368 |     "}\n",
1369 |     "gs = GridSearchCV(gb, params, cv=5, scoring='roc_auc', n_jobs=4)\n",
1370 |     "gs.fit(rich_features_final, target)"
1371 |    ]
1372 |   },
1373 |   {
1374 |    "cell_type": "markdown",
1375 |    "metadata": {},
1376 |    "source": [
1377 |     "Let us sort the models by mean validation score:"
1378 |    ]
1379 |   },
1380 |   {
1381 |    "cell_type": "code",
1382 |    "execution_count": null,
1383 |    "metadata": {
1384 |     "collapsed": false
1385 |    },
1386 |    "outputs": [],
1387 |    "source": [
1388 |     "sorted(gs.grid_scores_, key=lambda x: x.mean_validation_score, reverse=True)"
1389 |    ]
1390 |   },
1391 |   {
1392 |    "cell_type": "code",
1393 |    "execution_count": null,
1394 |    "metadata": {
1395 |     "collapsed": false
1396 |    },
1397 |    "outputs": [],
1398 |    "source": [
1399 |     "gs.best_score_"
1400 |    ]
1401 |   },
1402 |   {
1403 |    "cell_type": "code",
1404 |    "execution_count": null,
1405 |    "metadata": {
1406 |     "collapsed": false
1407 |    },
1408 |    "outputs": [],
1409 |    "source": [
1410 |     "gs.best_params_"
1411 |    ]
1412 |   },
1413 |   {
1414 |    "cell_type": "markdown",
1415 |    "metadata": {},
1416 |    "source": [
1417 |     "We should note that the mean scores are very close to one another and almost always within one standard deviation of one another. This means that all those parameters are quite reasonable. The only parameter of importance seems to be the `learning_rate`: 0.5 seems to be a bit too high."
1418 |    ]
1419 |   },
1420 |   {
1421 |    "cell_type": "markdown",
1422 |    "metadata": {},
1423 |    "source": [
1424 |     "## Avoiding data snooping with pipelines"
1425 |    ]
1426 |   },
1427 |   {
1428 |    "cell_type": "markdown",
1429 |    "metadata": {},
1430 |    "source": [
1431 |     "When doing imputation in pandas, prior to computing the train test split we use data from the test to improve the accuracy of the median value that we impute on the training set. This is actually cheating. To avoid this we should compute the median of the features on the training fold and use that median value to do the imputation both on the training and validation fold for a given CV split.\n",
1432 |     "\n",
1433 |     "To do this we can prepare the features as previously but without the imputation: we just replace missing values by the -1 marker value:"
1434 |    ]
1435 |   },
1436 |   {
1437 |    "cell_type": "code",
1438 |    "execution_count": null,
1439 |    "metadata": {
1440 |     "collapsed": false
1441 |    },
1442 |    "outputs": [],
1443 |    "source": [
1444 |     "features = pd.concat([data.get(['Fare', 'Age']),\n",
1445 |     "                      pd.get_dummies(data.Sex, prefix='Sex'),\n",
1446 |     "                      pd.get_dummies(data.Pclass, prefix='Pclass'),\n",
1447 |     "                      pd.get_dummies(data.Embarked, prefix='Embarked')],\n",
1448 |     "                     axis=1)\n",
1449 |     "features = features.drop('Sex_male', 1)\n",
1450 |     "\n",
1451 |     "# Because of the following bug we cannot use NaN as the missing\n",
1452 |     "# value marker, use a negative value as marker instead:\n",
1453 |     "# https://github.com/scikit-learn/scikit-learn/issues/3044\n",
1454 |     "features = features.fillna(-1)\n",
1455 |     "features.head(5)"
1456 |    ]
1457 |   },
1458 |   {
1459 |    "cell_type": "markdown",
1460 |    "metadata": {},
1461 |    "source": [
1462 |     "We can now use the `Imputer` transformer of scikit-learn to find the median value on the training set and apply it on missing values of both the training set and the test set."
1463 |    ]
1464 |   },
1465 |   {
1466 |    "cell_type": "code",
1467 |    "execution_count": null,
1468 |    "metadata": {
1469 |     "collapsed": false
1470 |    },
1471 |    "outputs": [],
1472 |    "source": [
1473 |     "from sklearn.model_selection import train_test_split\n",
1474 |     "\n",
1475 |     "X_train, X_test, y_train, y_test = train_test_split(features.values, target, random_state=0)"
1476 |    ]
1477 |   },
1478 |   {
1479 |    "cell_type": "code",
1480 |    "execution_count": null,
1481 |    "metadata": {
1482 |     "collapsed": false
1483 |    },
1484 |    "outputs": [],
1485 |    "source": [
1486 |     "from sklearn.preprocessing import Imputer\n",
1487 |     "\n",
1488 |     "imputer = Imputer(strategy='median', missing_values=-1)\n",
1489 |     "\n",
1490 |     "imputer.fit(X_train)"
1491 |    ]
1492 |   },
1493 |   {
1494 |    "cell_type": "markdown",
1495 |    "metadata": {},
1496 |    "source": [
1497 |     "The median age computed on the training set is stored in the `statistics_` attribute."
1498 |    ]
1499 |   },
1500 |   {
1501 |    "cell_type": "code",
1502 |    "execution_count": null,
1503 |    "metadata": {
1504 |     "collapsed": false
1505 |    },
1506 |    "outputs": [],
1507 |    "source": [
1508 |     "imputer.statistics_"
1509 |    ]
1510 |   },
1511 |   {
1512 |    "cell_type": "markdown",
1513 |    "metadata": {},
1514 |    "source": [
1515 |     "Imputation can now happen by calling  the transform method:"
1516 |    ]
1517 |   },
1518 |   {
1519 |    "cell_type": "code",
1520 |    "execution_count": null,
1521 |    "metadata": {
1522 |     "collapsed": false
1523 |    },
1524 |    "outputs": [],
1525 |    "source": [
1526 |     "X_train_imputed = imputer.transform(X_train)\n",
1527 |     "X_test_imputed = imputer.transform(X_test)"
1528 |    ]
1529 |   },
1530 |   {
1531 |    "cell_type": "code",
1532 |    "execution_count": null,
1533 |    "metadata": {
1534 |     "collapsed": false
1535 |    },
1536 |    "outputs": [],
1537 |    "source": [
1538 |     "np.any(X_train == -1)"
1539 |    ]
1540 |   },
1541 |   {
1542 |    "cell_type": "code",
1543 |    "execution_count": null,
1544 |    "metadata": {
1545 |     "collapsed": false
1546 |    },
1547 |    "outputs": [],
1548 |    "source": [
1549 |     "np.any(X_train_imputed == -1)"
1550 |    ]
1551 |   },
1552 |   {
1553 |    "cell_type": "code",
1554 |    "execution_count": null,
1555 |    "metadata": {
1556 |     "collapsed": false
1557 |    },
1558 |    "outputs": [],
1559 |    "source": [
1560 |     "np.any(X_test == -1)"
1561 |    ]
1562 |   },
1563 |   {
1564 |    "cell_type": "code",
1565 |    "execution_count": null,
1566 |    "metadata": {
1567 |     "collapsed": false
1568 |    },
1569 |    "outputs": [],
1570 |    "source": [
1571 |     "np.any(X_test_imputed == -1)"
1572 |    ]
1573 |   },
1574 |   {
1575 |    "cell_type": "markdown",
1576 |    "metadata": {},
1577 |    "source": [
1578 |     "We can now use a pipeline that wraps an imputer transformer and the classifier itself:"
1579 |    ]
1580 |   },
1581 |   {
1582 |    "cell_type": "code",
1583 |    "execution_count": null,
1584 |    "metadata": {
1585 |     "collapsed": false
1586 |    },
1587 |    "outputs": [],
1588 |    "source": [
1589 |     "from sklearn.pipeline import Pipeline\n",
1590 |     "\n",
1591 |     "imputer = Imputer(strategy='median', missing_values=-1)\n",
1592 |     "\n",
1593 |     "classifier = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1,\n",
1594 |     "                                        subsample=.8, max_features=.5)\n",
1595 |     "\n",
1596 |     "pipeline = Pipeline([\n",
1597 |     "    ('imp', imputer),\n",
1598 |     "    ('clf', classifier),\n",
1599 |     "])\n",
1600 |     "\n",
1601 |     "scores = cross_val_score(pipeline, features.values, target, cv=5, n_jobs=4,\n",
1602 |     "                         scoring='accuracy', )\n",
1603 |     "print(scores.min(), scores.mean(), scores.max())"
1604 |    ]
1605 |   },
1606 |   {
1607 |    "cell_type": "markdown",
1608 |    "metadata": {},
1609 |    "source": [
1610 |     "The mean cross-validation is slightly lower than we used the imputation on the whole data as we did earlier although not by much. This means that in this case the data-snooping was not really helping the model cheat by much.\n",
1611 |     "\n",
1612 |     "Let us re-run the grid search, this time on the pipeline. Note that thanks to the pipeline structure we can optimize the interaction of the imputation method with the parameters of the downstream classifier without cheating:"
1613 |    ]
1614 |   },
1615 |   {
1616 |    "cell_type": "code",
1617 |    "execution_count": null,
1618 |    "metadata": {
1619 |     "collapsed": false
1620 |    },
1621 |    "outputs": [],
1622 |    "source": [
1623 |     "%%time\n",
1624 |     "\n",
1625 |     "params = {\n",
1626 |     "    'imp__strategy': ['mean', 'median'],\n",
1627 |     "    'clf__max_features': [0.5, 1],\n",
1628 |     "    'clf__max_depth': [3, 4, 5],\n",
1629 |     "}\n",
1630 |     "gs = GridSearchCV(pipeline, params, cv=5, scoring='roc_auc', n_jobs=4)\n",
1631 |     "gs.fit(X_train, y_train)"
1632 |    ]
1633 |   },
1634 |   {
1635 |    "cell_type": "code",
1636 |    "execution_count": null,
1637 |    "metadata": {
1638 |     "collapsed": false
1639 |    },
1640 |    "outputs": [],
1641 |    "source": [
1642 |     "sorted(gs.grid_scores_, key=lambda x: x.mean_validation_score, reverse=True)"
1643 |    ]
1644 |   },
1645 |   {
1646 |    "cell_type": "code",
1647 |    "execution_count": null,
1648 |    "metadata": {
1649 |     "collapsed": false
1650 |    },
1651 |    "outputs": [],
1652 |    "source": [
1653 |     "gs.best_score_"
1654 |    ]
1655 |   },
1656 |   {
1657 |    "cell_type": "code",
1658 |    "execution_count": null,
1659 |    "metadata": {
1660 |     "collapsed": false
1661 |    },
1662 |    "outputs": [],
1663 |    "source": [
1664 |     "plot_roc_curve(y_test, gs.predict_proba(X_test))"
1665 |    ]
1666 |   },
1667 |   {
1668 |    "cell_type": "code",
1669 |    "execution_count": null,
1670 |    "metadata": {
1671 |     "collapsed": false
1672 |    },
1673 |    "outputs": [],
1674 |    "source": [
1675 |     "gs.best_params_"
1676 |    ]
1677 |   },
1678 |   {
1679 |    "cell_type": "markdown",
1680 |    "metadata": {},
1681 |    "source": [
1682 |     "From this search we can conclude that the imputation by the 'mean' strategy is generally a slightly better imputation strategy when training a GBRT model on this data."
1683 |    ]
1684 |   },
1685 |   {
1686 |    "cell_type": "markdown",
1687 |    "metadata": {},
1688 |    "source": [
1689 |     "## Further integrating sklearn and pandas"
1690 |    ]
1691 |   },
1692 |   {
1693 |    "cell_type": "markdown",
1694 |    "metadata": {},
1695 |    "source": [
1696 |     "Helper tool for better sklearn / pandas integration: https://github.com/paulgb/sklearn-pandas by making it possible to embed the feature construction from the raw dataframe directly inside a pipeline."
1697 |    ]
1698 |   },
1699 |   {
1700 |    "cell_type": "markdown",
1701 |    "metadata": {},
1702 |    "source": [
1703 |     "### Credits"
1704 |    ]
1705 |   },
1706 |   {
1707 |    "cell_type": "markdown",
1708 |    "metadata": {},
1709 |    "source": [
1710 |     "Thanks to:\n",
1711 |     "\n",
1712 |     "- Kaggle for setting up the Titanic challenge.\n",
1713 |     "\n",
1714 |     "- This blog post by Philippe Adjiman for inspiration:\n",
1715 |     "\n",
1716 |     "http://www.philippeadjiman.com/blog/2013/09/12/a-data-science-exploration-from-the-titanic-in-r/"
1717 |    ]
1718 |   },
1719 |   {
1720 |    "cell_type": "code",
1721 |    "execution_count": null,
1722 |    "metadata": {
1723 |     "collapsed": false
1724 |    },
1725 |    "outputs": [],
1726 |    "source": []
1727 |   }
1728 |  ],
1729 |  "metadata": {
1730 |   "kernelspec": {
1731 |    "display_name": "Python 2",
1732 |    "language": "python",
1733 |    "name": "python2"
1734 |   },
1735 |   "language_info": {
1736 |    "codemirror_mode": {
1737 |     "name": "ipython",
1738 |     "version": 2
1739 |    },
1740 |    "file_extension": ".py",
1741 |    "mimetype": "text/x-python",
1742 |    "name": "python",
1743 |    "nbconvert_exporter": "python",
1744 |    "pygments_lexer": "ipython2",
1745 |    "version": "2.7.11"
1746 |   }
1747 |  },
1748 |  "nbformat": 4,
1749 |  "nbformat_minor": 0
1750 | }
1751 | 


--------------------------------------------------------------------------------
/notebooks/07_sklearn_unstructured_data_face_recognition.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Example from Image Processing"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": null,
 13 |    "metadata": {
 14 |     "collapsed": false
 15 |    },
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "%matplotlib inline\n",
 19 |     "import numpy as np\n",
 20 |     "from matplotlib import pyplot as plt"
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "markdown",
 25 |    "metadata": {},
 26 |    "source": [
 27 |     "Here we'll take a look at a simple facial recognition example.\n",
 28 |     "Ideally, we would use a dataset consisting of a\n",
 29 |     "subset of the [Labeled Faces in the Wild](http://vis-www.cs.umass.edu/lfw/)\n",
 30 |     "data that is available within scikit-learn with the 'datasets.fetch_lfw_people' function. However, this is a relatively large download (~200MB) so we will do the tutorial on a simpler, less rich dataset. Feel free to explore the LFW dataset at home."
 31 |    ]
 32 |   },
 33 |   {
 34 |    "cell_type": "code",
 35 |    "execution_count": null,
 36 |    "metadata": {
 37 |     "collapsed": false
 38 |    },
 39 |    "outputs": [],
 40 |    "source": [
 41 |     "from sklearn import datasets\n",
 42 |     "faces = datasets.fetch_olivetti_faces()\n",
 43 |     "faces.data.shape"
 44 |    ]
 45 |   },
 46 |   {
 47 |    "cell_type": "markdown",
 48 |    "metadata": {},
 49 |    "source": [
 50 |     "Let's visualize these faces to see what we're working with:"
 51 |    ]
 52 |   },
 53 |   {
 54 |    "cell_type": "code",
 55 |    "execution_count": null,
 56 |    "metadata": {
 57 |     "collapsed": false
 58 |    },
 59 |    "outputs": [],
 60 |    "source": [
 61 |     "fig = plt.figure(figsize=(8, 6))\n",
 62 |     "# plot several images\n",
 63 |     "for i in range(15):\n",
 64 |     "    ax = fig.add_subplot(3, 5, i + 1, xticks=[], yticks=[])\n",
 65 |     "    ax.imshow(faces.images[i], cmap=plt.cm.bone)"
 66 |    ]
 67 |   },
 68 |   {
 69 |    "cell_type": "markdown",
 70 |    "metadata": {},
 71 |    "source": [
 72 |     "One thing to note is that these faces have already been localized and scaled\n",
 73 |     "to a common size.  This is an important preprocessing piece for facial\n",
 74 |     "recognition, and is a process that can require a large collection of training\n",
 75 |     "data.  This can be done in scikit-learn, but the challenge is gathering a\n",
 76 |     "sufficient amount of training data for the algorithm to work\n",
 77 |     "\n",
 78 |     "Fortunately, this piece is common enough that it has been done.  One good\n",
 79 |     "resource is [OpenCV](http://opencv.willowgarage.com/wiki/FaceRecognition), the\n",
 80 |     "*Open Computer Vision Library*."
 81 |    ]
 82 |   },
 83 |   {
 84 |    "cell_type": "markdown",
 85 |    "metadata": {},
 86 |    "source": [
 87 |     "We'll perform a Support Vector classification of the images.  We'll\n",
 88 |     "do a typical train-test split on the images to make this happen:"
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "code",
 93 |    "execution_count": null,
 94 |    "metadata": {
 95 |     "collapsed": false
 96 |    },
 97 |    "outputs": [],
 98 |    "source": [
 99 |     "from sklearn.cross_validation import train_test_split\n",
100 |     "X_train, X_test, y_train, y_test = train_test_split(faces.data,\n",
101 |     "        faces.target, random_state=0)\n",
102 |     "\n",
103 |     "print(X_train.shape, X_test.shape)"
104 |    ]
105 |   },
106 |   {
107 |    "cell_type": "markdown",
108 |    "metadata": {},
109 |    "source": [
110 |     "## Preprocessing: Principal Component Analysis"
111 |    ]
112 |   },
113 |   {
114 |    "cell_type": "markdown",
115 |    "metadata": {},
116 |    "source": [
117 |     "1850 dimensions is a lot for SVM.  We can use PCA to reduce these 1850 features to a manageable\n",
118 |     "size, while maintaining most of the information in the dataset.  Here it is useful to use a variant\n",
119 |     "of PCA called ``RandomizedPCA``, which is an approximation of PCA that can be much faster for large\n",
120 |     "datasets.  The interface is the same as the normal PCA we saw earlier:"
121 |    ]
122 |   },
123 |   {
124 |    "cell_type": "code",
125 |    "execution_count": null,
126 |    "metadata": {
127 |     "collapsed": false
128 |    },
129 |    "outputs": [],
130 |    "source": [
131 |     "from sklearn import decomposition\n",
132 |     "pca = decomposition.RandomizedPCA(n_components=150, whiten=True)\n",
133 |     "pca.fit(X_train)"
134 |    ]
135 |   },
136 |   {
137 |    "cell_type": "markdown",
138 |    "metadata": {},
139 |    "source": [
140 |     "One interesting part of PCA is that it computes the \"mean\" face, which can be\n",
141 |     "interesting to examine:"
142 |    ]
143 |   },
144 |   {
145 |    "cell_type": "code",
146 |    "execution_count": null,
147 |    "metadata": {
148 |     "collapsed": false
149 |    },
150 |    "outputs": [],
151 |    "source": [
152 |     "plt.imshow(pca.mean_.reshape(faces.images[0].shape),\n",
153 |     "           cmap=plt.cm.bone)"
154 |    ]
155 |   },
156 |   {
157 |    "cell_type": "markdown",
158 |    "metadata": {},
159 |    "source": [
160 |     "The principal components measure deviations about this mean along orthogonal axes.\n",
161 |     "It is also interesting to visualize these principal components:"
162 |    ]
163 |   },
164 |   {
165 |    "cell_type": "code",
166 |    "execution_count": null,
167 |    "metadata": {
168 |     "collapsed": false
169 |    },
170 |    "outputs": [],
171 |    "source": [
172 |     "print(pca.components_.shape)"
173 |    ]
174 |   },
175 |   {
176 |    "cell_type": "code",
177 |    "execution_count": null,
178 |    "metadata": {
179 |     "collapsed": false
180 |    },
181 |    "outputs": [],
182 |    "source": [
183 |     "fig = plt.figure(figsize=(16, 6))\n",
184 |     "for i in range(30):\n",
185 |     "    ax = fig.add_subplot(3, 10, i + 1, xticks=[], yticks=[])\n",
186 |     "    ax.imshow(pca.components_[i].reshape(faces.images[0].shape), cmap=plt.cm.bone)"
187 |    ]
188 |   },
189 |   {
190 |    "cell_type": "markdown",
191 |    "metadata": {},
192 |    "source": [
193 |     "The components (\"eigenfaces\") are ordered by their importance from top-left to bottom-right.\n",
194 |     "We see that the first few components seem to primarily take care of lighting\n",
195 |     "conditions; the remaining components pull out certain identifying features:\n",
196 |     "the nose, eyes, eyebrows, etc."
197 |    ]
198 |   },
199 |   {
200 |    "cell_type": "markdown",
201 |    "metadata": {},
202 |    "source": [
203 |     "With this projection computed, we can now project our original training\n",
204 |     "and test data onto the PCA basis:"
205 |    ]
206 |   },
207 |   {
208 |    "cell_type": "code",
209 |    "execution_count": null,
210 |    "metadata": {
211 |     "collapsed": false
212 |    },
213 |    "outputs": [],
214 |    "source": [
215 |     "X_train_pca = pca.transform(X_train)\n",
216 |     "X_test_pca = pca.transform(X_test)"
217 |    ]
218 |   },
219 |   {
220 |    "cell_type": "code",
221 |    "execution_count": null,
222 |    "metadata": {
223 |     "collapsed": false
224 |    },
225 |    "outputs": [],
226 |    "source": [
227 |     "print(X_train_pca.shape)\n",
228 |     "print(X_test_pca.shape)"
229 |    ]
230 |   },
231 |   {
232 |    "cell_type": "markdown",
233 |    "metadata": {},
234 |    "source": [
235 |     "These projected components correspond to factors in a linear combination of\n",
236 |     "component images such that the combination approaches the original face."
237 |    ]
238 |   },
239 |   {
240 |    "cell_type": "markdown",
241 |    "metadata": {},
242 |    "source": [
243 |     "## Doing the Learning: Support Vector Machines"
244 |    ]
245 |   },
246 |   {
247 |    "cell_type": "markdown",
248 |    "metadata": {},
249 |    "source": [
250 |     "Now we'll perform support-vector-machine classification on this reduced dataset:"
251 |    ]
252 |   },
253 |   {
254 |    "cell_type": "code",
255 |    "execution_count": null,
256 |    "metadata": {
257 |     "collapsed": false
258 |    },
259 |    "outputs": [],
260 |    "source": [
261 |     "from sklearn import svm\n",
262 |     "clf = svm.SVC(C=5., gamma=0.001)\n",
263 |     "clf.fit(X_train_pca, y_train)"
264 |    ]
265 |   },
266 |   {
267 |    "cell_type": "markdown",
268 |    "metadata": {},
269 |    "source": [
270 |     "Finally, we can evaluate how well this classification did.  First, we might plot a\n",
271 |     "few of the test-cases with the labels learned from the training set:"
272 |    ]
273 |   },
274 |   {
275 |    "cell_type": "code",
276 |    "execution_count": null,
277 |    "metadata": {
278 |     "collapsed": false
279 |    },
280 |    "outputs": [],
281 |    "source": [
282 |     "fig = plt.figure(figsize=(8, 6))\n",
283 |     "for i in range(15):\n",
284 |     "    ax = fig.add_subplot(3, 5, i + 1, xticks=[], yticks=[])\n",
285 |     "    ax.imshow(X_test[i].reshape(faces.images[0].shape),\n",
286 |     "              cmap=plt.cm.bone)\n",
287 |     "    y_pred = clf.predict(X_test_pca[i])[0]\n",
288 |     "    color = ('black' if y_pred == y_test[i]\n",
289 |     "             else 'red')\n",
290 |     "    ax.set_title(y_pred, fontsize='small', color=color)"
291 |    ]
292 |   },
293 |   {
294 |    "cell_type": "markdown",
295 |    "metadata": {},
296 |    "source": [
297 |     "The classifier is correct on an impressive number of images given the simplicity\n",
298 |     "of its learning model!  Using a linear classifier on 150 features derived from\n",
299 |     "the pixel-level data, the algorithm correctly identifies a large number of the\n",
300 |     "people in the images.\n",
301 |     "\n",
302 |     "Again, we can\n",
303 |     "quantify this effectiveness using one of several measures from the ``sklearn.metrics``\n",
304 |     "module.  First we can do the classification report, which shows the precision,\n",
305 |     "recall and other measures of the \"goodness\" of the classification:"
306 |    ]
307 |   },
308 |   {
309 |    "cell_type": "code",
310 |    "execution_count": null,
311 |    "metadata": {
312 |     "collapsed": false
313 |    },
314 |    "outputs": [],
315 |    "source": [
316 |     "from sklearn import metrics\n",
317 |     "y_pred = clf.predict(X_test_pca)\n",
318 |     "print(metrics.classification_report(y_test, y_pred))"
319 |    ]
320 |   },
321 |   {
322 |    "cell_type": "markdown",
323 |    "metadata": {},
324 |    "source": [
325 |     "Another interesting metric is the *confusion matrix*, which indicates how often\n",
326 |     "any two items are mixed-up.  The confusion matrix of a perfect classifier\n",
327 |     "would only have nonzero entries on the diagonal, with zeros on the off-diagonal."
328 |    ]
329 |   },
330 |   {
331 |    "cell_type": "code",
332 |    "execution_count": null,
333 |    "metadata": {
334 |     "collapsed": false
335 |    },
336 |    "outputs": [],
337 |    "source": [
338 |     "print(metrics.confusion_matrix(y_test, y_pred))"
339 |    ]
340 |   },
341 |   {
342 |    "cell_type": "code",
343 |    "execution_count": null,
344 |    "metadata": {
345 |     "collapsed": false
346 |    },
347 |    "outputs": [],
348 |    "source": [
349 |     "print(metrics.f1_score(y_test, y_pred))"
350 |    ]
351 |   },
352 |   {
353 |    "cell_type": "markdown",
354 |    "metadata": {},
355 |    "source": [
356 |     "## Pipelining"
357 |    ]
358 |   },
359 |   {
360 |    "cell_type": "markdown",
361 |    "metadata": {},
362 |    "source": [
363 |     "Above we used PCA as a pre-processing step before applying our support vector machine classifier.\n",
364 |     "Plugging the output of one estimator directly into the input of a second estimator is a commonly\n",
365 |     "used pattern; for this reason scikit-learn provides a ``Pipeline`` object which automates this\n",
366 |     "process.  The above problem can be re-expressed as a pipeline as follows:"
367 |    ]
368 |   },
369 |   {
370 |    "cell_type": "code",
371 |    "execution_count": null,
372 |    "metadata": {
373 |     "collapsed": false
374 |    },
375 |    "outputs": [],
376 |    "source": [
377 |     "from sklearn.pipeline import Pipeline"
378 |    ]
379 |   },
380 |   {
381 |    "cell_type": "code",
382 |    "execution_count": null,
383 |    "metadata": {
384 |     "collapsed": false
385 |    },
386 |    "outputs": [],
387 |    "source": [
388 |     "clf = Pipeline([('pca', decomposition.RandomizedPCA(n_components=150, whiten=True)),\n",
389 |     "                ('svm', svm.LinearSVC(C=1.0))])"
390 |    ]
391 |   },
392 |   {
393 |    "cell_type": "code",
394 |    "execution_count": null,
395 |    "metadata": {
396 |     "collapsed": false
397 |    },
398 |    "outputs": [],
399 |    "source": [
400 |     "clf.fit(X_train, y_train)"
401 |    ]
402 |   },
403 |   {
404 |    "cell_type": "code",
405 |    "execution_count": null,
406 |    "metadata": {
407 |     "collapsed": false
408 |    },
409 |    "outputs": [],
410 |    "source": [
411 |     "y_pred = clf.predict(X_test)"
412 |    ]
413 |   },
414 |   {
415 |    "cell_type": "code",
416 |    "execution_count": null,
417 |    "metadata": {
418 |     "collapsed": false
419 |    },
420 |    "outputs": [],
421 |    "source": [
422 |     "print(metrics.confusion_matrix(y_pred, y_test))"
423 |    ]
424 |   },
425 |   {
426 |    "cell_type": "markdown",
427 |    "metadata": {},
428 |    "source": [
429 |     "The results are not identical because we used the randomized version of the PCA -- because the\n",
430 |     "projection varies slightly each time, the results vary slightly as well."
431 |    ]
432 |   },
433 |   {
434 |    "cell_type": "markdown",
435 |    "metadata": {},
436 |    "source": [
437 |     "## A Quick Note on Facial Recognition"
438 |    ]
439 |   },
440 |   {
441 |    "cell_type": "markdown",
442 |    "metadata": {},
443 |    "source": [
444 |     "Here we have used PCA \"eigenfaces\" as a pre-processing step for facial recognition.\n",
445 |     "The reason we chose this is because PCA is a broadly-applicable technique, which can\n",
446 |     "be useful for a wide array of data types.  Research in the field of facial recognition\n",
447 |     "in particular, however, has shown that other more specific feature extraction methods\n",
448 |     "are can be much more effective."
449 |    ]
450 |   }
451 |  ],
452 |  "metadata": {
453 |   "kernelspec": {
454 |    "display_name": "Python 2",
455 |    "language": "python",
456 |    "name": "python2"
457 |   },
458 |   "language_info": {
459 |    "codemirror_mode": {
460 |     "name": "ipython",
461 |     "version": 2
462 |    },
463 |    "file_extension": ".py",
464 |    "mimetype": "text/x-python",
465 |    "name": "python",
466 |    "nbconvert_exporter": "python",
467 |    "pygments_lexer": "ipython2",
468 |    "version": "2.7.11"
469 |   }
470 |  },
471 |  "nbformat": 4,
472 |  "nbformat_minor": 0
473 | }
474 | 


--------------------------------------------------------------------------------
/notebooks/08_sklearn_pandas_practical.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "code",
 5 |    "execution_count": 1,
 6 |    "metadata": {
 7 |     "collapsed": false
 8 |    },
 9 |    "outputs": [],
10 |    "source": [
11 |     "%matplotlib inline\n",
12 |     "import numpy as np\n",
13 |     "import pandas as pd\n",
14 |     "import matplotlib.pyplot as plt"
15 |    ]
16 |   },
17 |   {
18 |    "cell_type": "markdown",
19 |    "metadata": {},
20 |    "source": [
21 |     "## Hands on scikit-learn / pandas\n",
22 |     "\n",
23 |     "You mission is to build the best predictive model using these data:\n",
24 |     "\n",
25 |     "https://dl.dropboxusercontent.com/u/2140486/data/adult_train.csv\n",
26 |     "\n",
27 |     "The prediction task is to determine whether a person makes over 50K a year.\n",
28 |     "\n",
29 |     "What is expected:\n",
30 |     "\n",
31 |     "  - Read the data using Pandas\n",
32 |     "  - Understand what type of machine learning problem you are facing? regression? classification? binary? multi-class?\n",
33 |     "  - Identify potential issues such as missing values, quantitative / continous or categorical variables, class imbalance\n",
34 |     "  - Visualize the distributions of the features\n",
35 |     "  - Encode properly the categorical variables\n",
36 |     "  - Propose a predictive model that you will evaluate using ROC-AUC metric\n",
37 |     "  - You will evaluate the quality of the model with cross-validation.\n",
38 |     "\n",
39 |     "```\n",
40 |     "Attribute Information:\n",
41 |     "\n",
42 |     "Listing of attributes: \n",
43 |     "\n",
44 |     "age: continuous. \n",
45 |     "workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. \n",
46 |     "fnlwgt: continuous. \n",
47 |     "education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. \n",
48 |     "education-num: continuous. \n",
49 |     "marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. \n",
50 |     "occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. \n",
51 |     "relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. \n",
52 |     "race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. \n",
53 |     "sex: Female, Male. \n",
54 |     "capital-gain: continuous. \n",
55 |     "capital-loss: continuous. \n",
56 |     "hours-per-week: continuous. \n",
57 |     "native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.\n",
58 |     "```\n",
59 |     "\n",
60 |     "The floor is yours !"
61 |    ]
62 |   }
63 |  ],
64 |  "metadata": {
65 |   "kernelspec": {
66 |    "display_name": "Python 2",
67 |    "language": "python",
68 |    "name": "python2"
69 |   },
70 |   "language_info": {
71 |    "codemirror_mode": {
72 |     "name": "ipython",
73 |     "version": 2
74 |    },
75 |    "file_extension": ".py",
76 |    "mimetype": "text/x-python",
77 |    "name": "python",
78 |    "nbconvert_exporter": "python",
79 |    "pygments_lexer": "ipython2",
80 |    "version": "2.7.11"
81 |   }
82 |  },
83 |  "nbformat": 4,
84 |  "nbformat_minor": 0
85 | }
86 | 


--------------------------------------------------------------------------------
/notebooks/figures/ML_flow_chart.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Tutorial Diagrams
  3 | -----------------
  4 | 
  5 | This script plots the flow-charts used in the scikit-learn tutorials.
  6 | """
  7 | 
  8 | import numpy as np
  9 | import pylab as pl
 10 | from matplotlib.patches import Circle, Rectangle, Polygon, Arrow, FancyArrow
 11 | 
 12 | def create_base(box_bg = '#CCCCCC',
 13 |                 arrow1 = '#88CCFF',
 14 |                 arrow2 = '#88FF88',
 15 |                 supervised=True):
 16 |     fig = pl.figure(figsize=(9, 6), facecolor='w')
 17 |     ax = pl.axes((0, 0, 1, 1),
 18 |                  xticks=[], yticks=[], frameon=False)
 19 |     ax.set_xlim(0, 9)
 20 |     ax.set_ylim(0, 6)
 21 | 
 22 |     patches = [Rectangle((0.3, 3.6), 1.5, 1.8, zorder=1, fc=box_bg),
 23 |                Rectangle((0.5, 3.8), 1.5, 1.8, zorder=2, fc=box_bg),
 24 |                Rectangle((0.7, 4.0), 1.5, 1.8, zorder=3, fc=box_bg),
 25 |                
 26 |                Rectangle((2.9, 3.6), 0.2, 1.8, fc=box_bg),
 27 |                Rectangle((3.1, 3.8), 0.2, 1.8, fc=box_bg),
 28 |                Rectangle((3.3, 4.0), 0.2, 1.8, fc=box_bg),
 29 |                
 30 |                Rectangle((0.3, 0.2), 1.5, 1.8, fc=box_bg),
 31 |                
 32 |                Rectangle((2.9, 0.2), 0.2, 1.8, fc=box_bg),
 33 |                
 34 |                Circle((5.5, 3.5), 1.0, fc=box_bg),
 35 |                
 36 |                Polygon([[5.5, 1.7],
 37 |                         [6.1, 1.1],
 38 |                         [5.5, 0.5],
 39 |                         [4.9, 1.1]], fc=box_bg),
 40 |                
 41 |                FancyArrow(2.3, 4.6, 0.35, 0, fc=arrow1,
 42 |                           width=0.25, head_width=0.5, head_length=0.2),
 43 |                
 44 |                FancyArrow(3.75, 4.2, 0.5, -0.2, fc=arrow1,
 45 |                           width=0.25, head_width=0.5, head_length=0.2),
 46 |                
 47 |                FancyArrow(5.5, 2.4, 0, -0.4, fc=arrow1,
 48 |                           width=0.25, head_width=0.5, head_length=0.2),
 49 |                
 50 |                FancyArrow(2.0, 1.1, 0.5, 0, fc=arrow2,
 51 |                           width=0.25, head_width=0.5, head_length=0.2),
 52 |                
 53 |                FancyArrow(3.3, 1.1, 1.3, 0, fc=arrow2,
 54 |                           width=0.25, head_width=0.5, head_length=0.2),
 55 |                
 56 |                FancyArrow(6.2, 1.1, 0.8, 0, fc=arrow2,
 57 |                           width=0.25, head_width=0.5, head_length=0.2)]
 58 | 
 59 |     if supervised:
 60 |         patches += [Rectangle((0.3, 2.4), 1.5, 0.5, zorder=1, fc=box_bg),
 61 |                     Rectangle((0.5, 2.6), 1.5, 0.5, zorder=2, fc=box_bg),
 62 |                     Rectangle((0.7, 2.8), 1.5, 0.5, zorder=3, fc=box_bg),
 63 |                     FancyArrow(2.3, 2.9, 2.0, 0, fc=arrow1,
 64 |                                width=0.25, head_width=0.5, head_length=0.2),
 65 |                     Rectangle((7.3, 0.85), 1.5, 0.5, fc=box_bg)]
 66 |     else:
 67 |         patches += [Rectangle((7.3, 0.2), 1.5, 1.8, fc=box_bg)]
 68 |     
 69 |     for p in patches:
 70 |         ax.add_patch(p)
 71 |         
 72 |     pl.text(1.45, 4.9, "Training\nText,\nDocuments,\nImages,\netc.",
 73 |             ha='center', va='center', fontsize=14)
 74 |     
 75 |     pl.text(3.6, 4.9, "Feature\nVectors", 
 76 |             ha='left', va='center', fontsize=14)
 77 |     
 78 |     pl.text(5.5, 3.5, "Machine\nLearning\nAlgorithm",
 79 |             ha='center', va='center', fontsize=14)
 80 |     
 81 |     pl.text(1.05, 1.1, "New Text,\nDocument,\nImage,\netc.",
 82 |             ha='center', va='center', fontsize=14)
 83 |     
 84 |     pl.text(3.3, 1.7, "Feature\nVector", 
 85 |             ha='left', va='center', fontsize=14)
 86 |     
 87 |     pl.text(5.5, 1.1, "Predictive\nModel", 
 88 |             ha='center', va='center', fontsize=12)
 89 | 
 90 |     if supervised:
 91 |         pl.text(1.45, 3.05, "Labels",
 92 |                 ha='center', va='center', fontsize=14)
 93 |     
 94 |         pl.text(8.05, 1.1, "Expected\nLabel",
 95 |                 ha='center', va='center', fontsize=14)
 96 |         pl.text(8.8, 5.8, "Supervised Learning Model",
 97 |                 ha='right', va='top', fontsize=18)
 98 | 
 99 |     else:
100 |         pl.text(8.05, 1.1,
101 |                 "Likelihood\nor Cluster ID\nor Better\nRepresentation",
102 |                 ha='center', va='center', fontsize=12)
103 |         pl.text(8.8, 5.8, "Unsupervised Learning Model",
104 |                 ha='right', va='top', fontsize=18)
105 |         
106 |         
107 | 
108 | def plot_supervised_chart(annotate=False):
109 |     create_base(supervised=True)
110 |     if annotate:
111 |         fontdict = dict(color='r', weight='bold', size=14)
112 |         pl.text(1.9, 4.55, 'X = vec.fit_transform(input)',
113 |                 fontdict=fontdict,
114 |                 rotation=20, ha='left', va='bottom')
115 |         pl.text(3.7, 3.2, 'clf.fit(X, y)',
116 |                 fontdict=fontdict,
117 |                 rotation=20, ha='left', va='bottom')
118 |         pl.text(1.7, 1.5, 'X_new = vec.transform(input)',
119 |                 fontdict=fontdict,
120 |                 rotation=20, ha='left', va='bottom')
121 |         pl.text(6.1, 1.5, 'y_new = clf.predict(X_new)',
122 |                 fontdict=fontdict,
123 |                 rotation=20, ha='left', va='bottom')
124 | 
125 | def plot_unsupervised_chart():
126 |     create_base(supervised=False)
127 | 
128 | 
129 | if __name__ == '__main__':
130 |     plot_supervised_chart(False)
131 |     plot_supervised_chart(True)
132 |     plot_unsupervised_chart()
133 |     pl.show()
134 | 
135 | 
136 | 


--------------------------------------------------------------------------------
/notebooks/figures/__init__.py:
--------------------------------------------------------------------------------
1 | from .sgd_separator import plot_sgd_separator
2 | from .linear_regression import plot_linear_regression
3 | from .ML_flow_chart import plot_supervised_chart, plot_unsupervised_chart
4 | from .bias_variance import plot_bias_variance
5 | 


--------------------------------------------------------------------------------
/notebooks/figures/bias_variance.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import matplotlib.pyplot as plt
 3 | 
 4 | 
 5 | def test_func(x, err=0.5):
 6 |     return np.random.normal(10 - 1. / (x + 0.1), err)
 7 | 
 8 | 
 9 | def compute_error(x, y, p):
10 |     yfit = np.polyval(p, x)
11 |     return np.sqrt(np.mean((y - yfit) ** 2))
12 | 
13 | 
14 | def plot_bias_variance(N=8, random_seed=42, err=0.5):
15 |     np.random.seed(random_seed)
16 |     x = 10 ** np.linspace(-2, 0, N)
17 |     y = test_func(x)
18 | 
19 |     xfit = np.linspace(-0.2, 1.2, 1000)
20 | 
21 |     titles = ['d = 1 (under-fit; high bias)',
22 |               'd = 2',
23 |               'd = 6 (over-fit; high variance)']
24 |     degrees = [1, 2, 6]
25 | 
26 |     fig = plt.figure(figsize = (9, 3.5))
27 |     fig.subplots_adjust(left = 0.06, right=0.98,
28 |                         bottom=0.15, top=0.85,
29 |                         wspace=0.05)
30 |     for i, d in enumerate(degrees):
31 |         ax = fig.add_subplot(131 + i, xticks=[], yticks=[])
32 |         ax.scatter(x, y, marker='x', c='k', s=50)
33 | 
34 |         p = np.polyfit(x, y, d)
35 |         yfit = np.polyval(p, xfit)
36 |         ax.plot(xfit, yfit, '-b')
37 | 
38 |         ax.set_xlim(-0.2, 1.2)
39 |         ax.set_ylim(0, 12)
40 |         ax.set_xlabel('house size')
41 |         if i == 0:
42 |             ax.set_ylabel('price')
43 | 
44 |         ax.set_title(titles[i])
45 | 
46 | if __name__ == '__main__':
47 |     plot_bias_variance()
48 |     plt.show()
49 | 


--------------------------------------------------------------------------------
/notebooks/figures/grid_search_cv_splits.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/agramfort/sklearn_pandas_intro/c4c148faab7747b3750ce10f332d4dde58019bfa/notebooks/figures/grid_search_cv_splits.png


--------------------------------------------------------------------------------
/notebooks/figures/grid_search_parameters.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/agramfort/sklearn_pandas_intro/c4c148faab7747b3750ce10f332d4dde58019bfa/notebooks/figures/grid_search_parameters.png


--------------------------------------------------------------------------------
/notebooks/figures/iris_setosa.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/agramfort/sklearn_pandas_intro/c4c148faab7747b3750ce10f332d4dde58019bfa/notebooks/figures/iris_setosa.jpg


--------------------------------------------------------------------------------
/notebooks/figures/iris_versicolor.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/agramfort/sklearn_pandas_intro/c4c148faab7747b3750ce10f332d4dde58019bfa/notebooks/figures/iris_versicolor.jpg


--------------------------------------------------------------------------------
/notebooks/figures/iris_virginica.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/agramfort/sklearn_pandas_intro/c4c148faab7747b3750ce10f332d4dde58019bfa/notebooks/figures/iris_virginica.jpg


--------------------------------------------------------------------------------
/notebooks/figures/linear_regression.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import matplotlib.pyplot as plt
 3 | from sklearn.linear_model import LinearRegression
 4 | 
 5 | 
 6 | def plot_linear_regression():
 7 |     a = 0.5
 8 |     b = 1.0
 9 | 
10 |     # x from 0 to 10
11 |     x = 30 * np.random.random(20)
12 | 
13 |     # y = a*x + b with noise
14 |     y = a * x + b + np.random.normal(size=x.shape)
15 | 
16 |     # create a linear regression classifier
17 |     clf = LinearRegression()
18 |     clf.fit(x[:, None], y)
19 | 
20 |     # predict y from the data
21 |     x_new = np.linspace(0, 30, 100)
22 |     y_new = clf.predict(x_new[:, None])
23 | 
24 |     # plot the results
25 |     ax = plt.axes()
26 |     ax.scatter(x, y)
27 |     ax.plot(x_new, y_new)
28 | 
29 |     ax.set_xlabel('x')
30 |     ax.set_ylabel('y')
31 | 
32 |     ax.axis('tight')
33 | 
34 | 
35 | if __name__ == '__main__':
36 |     plot_linear_regression()
37 |     plt.show()
38 | 


--------------------------------------------------------------------------------
/notebooks/figures/sgd_separator.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import matplotlib.pyplot as plt
 3 | from sklearn.linear_model import SGDClassifier
 4 | from sklearn.datasets.samples_generator import make_blobs
 5 | 
 6 | 
 7 | def plot_sgd_separator():
 8 |     # we create 50 separable points
 9 |     X, Y = make_blobs(n_samples=50, centers=2,
10 |                       random_state=0, cluster_std=0.60)
11 | 
12 |     # fit the model
13 |     clf = SGDClassifier(loss="hinge", alpha=0.01,
14 |                         n_iter=200, fit_intercept=True)
15 |     clf.fit(X, Y)
16 | 
17 |     # plot the line, the points, and the nearest vectors to the plane
18 |     xx = np.linspace(-1, 5, 10)
19 |     yy = np.linspace(-1, 5, 10)
20 | 
21 |     X1, X2 = np.meshgrid(xx, yy)
22 |     Z = np.empty(X1.shape)
23 |     for (i, j), val in np.ndenumerate(X1):
24 |         x1 = val
25 |         x2 = X2[i, j]
26 |         p = clf.decision_function([[x1, x2]])
27 |         Z[i, j] = p[0]
28 |     levels = [-1.0, 0.0, 1.0]
29 |     linestyles = ['dashed', 'solid', 'dashed']
30 |     colors = 'k'
31 | 
32 |     ax = plt.axes()
33 |     ax.contour(X1, X2, Z, levels, colors=colors, linestyles=linestyles)
34 |     ax.scatter(X[:, 0], X[:, 1], c=Y, cmap=plt.cm.Paired)
35 | 
36 |     ax.axis('tight')
37 | 
38 | 
39 | if __name__ == '__main__':
40 |     plot_sgd_separator()
41 |     plt.show()
42 | 


--------------------------------------------------------------------------------
/notebooks/figures/supervised_scikit_learn.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/agramfort/sklearn_pandas_intro/c4c148faab7747b3750ce10f332d4dde58019bfa/notebooks/figures/supervised_scikit_learn.png


--------------------------------------------------------------------------------
/notebooks/figures/svm_gui_frames.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Linear Model Example
  3 | --------------------
  4 | 
  5 | This is an example plot from the tutorial which accompanies an explanation
  6 | of the support vector machine GUI.
  7 | """
  8 | 
  9 | import numpy as np
 10 | import pylab as pl
 11 | import matplotlib
 12 | 
 13 | from sklearn import svm
 14 | 
 15 | 
 16 | def linear_model(rseed=42, Npts=30):
 17 |     np.random.seed(rseed)
 18 | 
 19 | 
 20 |     data = np.random.normal(0, 10, (Npts, 2))
 21 |     data[:Npts / 2] -= 15
 22 |     data[Npts / 2:] += 15
 23 | 
 24 |     labels = np.ones(Npts)
 25 |     labels[:Npts / 2] = -1
 26 | 
 27 |     return data, labels
 28 | 
 29 | 
 30 | def nonlinear_model(rseed=42, Npts=30):
 31 |     radius = 40 * np.random.random(Npts)
 32 |     far_pts = radius > 20
 33 |     radius[far_pts] *= 1.2
 34 |     radius[~far_pts] *= 1.1
 35 | 
 36 |     theta = np.random.random(Npts) * np.pi * 2
 37 | 
 38 |     data = np.empty((Npts, 2))
 39 |     data[:, 0] = radius * np.cos(theta)
 40 |     data[:, 1] = radius * np.sin(theta)
 41 | 
 42 |     labels = np.ones(Npts)
 43 |     labels[far_pts] = -1
 44 | 
 45 |     return data, labels
 46 | 
 47 | 
 48 | def plot_linear_model():
 49 |     X, y = linear_model()
 50 |     clf = svm.SVC(kernel='linear',
 51 |                   gamma=0.01, coef0=0, degree=3)
 52 |     clf.fit(X, y)
 53 | 
 54 |     fig = pl.figure()
 55 |     ax = pl.subplot(111, xticks=[], yticks=[])
 56 |     ax.scatter(X[:, 0], X[:, 1], c=y, cmap=pl.cm.bone)
 57 | 
 58 |     ax.scatter(clf.support_vectors_[:, 0],
 59 |                clf.support_vectors_[:, 1],
 60 |                s=80, edgecolors="k", facecolors="none")
 61 | 
 62 |     delta = 1
 63 |     y_min, y_max = -50, 50
 64 |     x_min, x_max = -50, 50
 65 |     x = np.arange(x_min, x_max + delta, delta)
 66 |     y = np.arange(y_min, y_max + delta, delta)
 67 |     X1, X2 = np.meshgrid(x, y)
 68 |     Z = clf.decision_function(np.c_[X1.ravel(), X2.ravel()])
 69 |     Z = Z.reshape(X1.shape)
 70 | 
 71 |     levels = [-1.0, 0.0, 1.0]
 72 |     linestyles = ['dashed', 'solid', 'dashed']
 73 |     colors = 'k'
 74 |     ax.contour(X1, X2, Z, levels,
 75 |                colors=colors,
 76 |                linestyles=linestyles)
 77 | 
 78 | 
 79 | def plot_rbf_model():
 80 |     X, y = nonlinear_model()
 81 |     clf = svm.SVC(kernel='rbf',
 82 |                   gamma=0.001, coef0=0, degree=3)
 83 |     clf.fit(X, y)
 84 | 
 85 |     fig = pl.figure()
 86 |     ax = pl.subplot(111, xticks=[], yticks=[])
 87 |     ax.scatter(X[:, 0], X[:, 1], c=y, cmap=pl.cm.bone, zorder=2)
 88 | 
 89 |     ax.scatter(clf.support_vectors_[:, 0],
 90 |                clf.support_vectors_[:, 1],
 91 |                s=80, edgecolors="k", facecolors="none")
 92 | 
 93 |     delta = 1
 94 |     y_min, y_max = -50, 50
 95 |     x_min, x_max = -50, 50
 96 |     x = np.arange(x_min, x_max + delta, delta)
 97 |     y = np.arange(y_min, y_max + delta, delta)
 98 |     X1, X2 = np.meshgrid(x, y)
 99 |     Z = clf.decision_function(np.c_[X1.ravel(), X2.ravel()])
100 |     Z = Z.reshape(X1.shape)
101 | 
102 |     levels = [-1.0, 0.0, 1.0]
103 |     linestyles = ['dashed', 'solid', 'dashed']
104 |     colors = 'k'
105 | 
106 |     ax.contourf(X1, X2, Z, 10,
107 |                 cmap=matplotlib.cm.bone,
108 |                 origin='lower',
109 |                 alpha=0.85, zorder=1)
110 |     ax.contour(X1, X2, Z, [0.0],
111 |                colors='k',
112 |                linestyles=['solid'], zorder=1)
113 | 
114 | 
115 | if __name__ == '__main__':
116 |     plot_linear_model()
117 |     plot_rbf_model()
118 |     pl.show()
119 |     
120 | 


--------------------------------------------------------------------------------
/notebooks/helpers.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Small helpers for code that is not shown in the notebooks
 3 | """
 4 | 
 5 | from sklearn import neighbors, datasets, linear_model
 6 | import pylab as pl
 7 | import numpy as np
 8 | from matplotlib.colors import ListedColormap
 9 | 
10 | # Create color maps for 3-class classification problem, as with iris
11 | cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
12 | cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
13 | 
14 | def plot_iris_knn():
15 |     iris = datasets.load_iris()
16 |     X = iris.data[:, :2]  # we only take the first two features. We could
17 |                         # avoid this ugly slicing by using a two-dim dataset
18 |     y = iris.target
19 | 
20 |     knn = neighbors.KNeighborsClassifier(n_neighbors=3)
21 |     knn.fit(X, y)
22 | 
23 |     x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1
24 |     y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1
25 |     xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
26 |                          np.linspace(y_min, y_max, 100))
27 |     Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
28 | 
29 |     # Put the result into a color plot
30 |     Z = Z.reshape(xx.shape)
31 |     pl.figure()
32 |     pl.pcolormesh(xx, yy, Z, cmap=cmap_light)
33 | 
34 |     # Plot also the training points
35 |     pl.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)
36 |     pl.xlabel('sepal length (cm)')
37 |     pl.ylabel('sepal width (cm)')
38 |     pl.axis('tight')
39 | 
40 | 
41 | def plot_polynomial_regression():
42 |     rng = np.random.RandomState(0)
43 |     x = 2*rng.rand(100) - 1
44 | 
45 |     f = lambda t: 1.2 * t**2 + .1 * t**3 - .4 * t **5 - .5 * t ** 9
46 |     y = f(x) + .4 * rng.normal(size=100)
47 | 
48 |     x_test = np.linspace(-1, 1, 100)
49 | 
50 |     pl.figure()
51 |     pl.scatter(x, y, s=4)
52 | 
53 |     X = np.array([x**i for i in range(5)]).T
54 |     X_test = np.array([x_test**i for i in range(5)]).T
55 |     regr = linear_model.LinearRegression()
56 |     regr.fit(X, y)
57 |     pl.plot(x_test, regr.predict(X_test), label='4th order')
58 | 
59 |     X = np.array([x**i for i in range(10)]).T
60 |     X_test = np.array([x_test**i for i in range(10)]).T
61 |     regr = linear_model.LinearRegression()
62 |     regr.fit(X, y)
63 |     pl.plot(x_test, regr.predict(X_test), label='9th order')
64 | 
65 |     pl.legend(loc='best')
66 |     pl.axis('tight')
67 |     pl.title('Fitting a 4th and a 9th order polynomial')
68 | 
69 |     pl.figure()
70 |     pl.scatter(x, y, s=4)
71 |     pl.plot(x_test, f(x_test), label="truth")
72 |     pl.axis('tight')
73 |     pl.title('Ground truth (9th order polynomial)')
74 | 
75 | 
76 | 


--------------------------------------------------------------------------------
/notebooks/images/parallel_text_clf.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/agramfort/sklearn_pandas_intro/c4c148faab7747b3750ce10f332d4dde58019bfa/notebooks/images/parallel_text_clf.png


--------------------------------------------------------------------------------
/notebooks/images/parallel_text_clf_average.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/agramfort/sklearn_pandas_intro/c4c148faab7747b3750ce10f332d4dde58019bfa/notebooks/images/parallel_text_clf_average.png


--------------------------------------------------------------------------------
/notebooks/images/predictive_modeling_data_flow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/agramfort/sklearn_pandas_intro/c4c148faab7747b3750ce10f332d4dde58019bfa/notebooks/images/predictive_modeling_data_flow.png


--------------------------------------------------------------------------------
/notebooks/solutions/02A_faces_plot.py:
--------------------------------------------------------------------------------
 1 | faces = fetch_olivetti_faces()
 2 | 
 3 | # set up the figure
 4 | fig = plt.figure(figsize=(6, 6))  # figure size in inches
 5 | fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)
 6 | 
 7 | # plot the faces:
 8 | for i in range(64):
 9 |     ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
10 |     ax.imshow(faces.images[i], cmap=plt.cm.bone, interpolation='nearest')
11 | 


--------------------------------------------------------------------------------
/notebooks/solutions/04A_plot_logistic_regression_weights.py:
--------------------------------------------------------------------------------
 1 | logreg_new = LogisticRegression(C=1).fit(rich_features_final, target)
 2 | 
 3 | feature_names = rich_features_final.columns.values
 4 | x = np.arange(len(feature_names))
 5 | plt.bar(x, logreg_new.coef_.ravel())
 6 | _ = plt.xticks(x + 0.5, feature_names, rotation=30)
 7 | 
 8 | # Rich young women like Kate Winslet tend to survive the Titanic better
 9 | # than poor men like Leonardo.
10 | 


--------------------------------------------------------------------------------
/notebooks/solutions/04B_more_categorical_variables.py:
--------------------------------------------------------------------------------
 1 | features = pd.concat([data.get(['Fare', 'Age']),
 2 |                       pd.get_dummies(data.Sex, prefix='Sex'),
 3 |                       pd.get_dummies(data.Pclass, prefix='Pclass'),
 4 |                       pd.get_dummies(data.Embarked, prefix='Embarked')],
 5 |                      axis=1)
 6 | features = features.drop('Sex_male', 1)
 7 | features = features.fillna(features.dropna().median())
 8 | features.head(5)
 9 | 
10 | 
11 | logreg = LogisticRegression(C=1)
12 | scores = cross_val_score(logreg, features, target, cv=5, scoring='accuracy')
13 | print("Logistic Regression CV scores:")
14 | print("min: {:.3f}, mean: {:.3f}, max: {:.3f}".format(
15 |     scores.min(), scores.mean(), scores.max()))
16 | 
17 | gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1,
18 |                                 subsample=.8, max_features=.5)
19 | scores = cross_val_score(gb, features, target, cv=5, n_jobs=4,
20 |                          scoring='accuracy')
21 | print("Gradient Boosting Trees CV scores:")
22 | print("min: {:.3f}, mean: {:.3f}, max: {:.3f}".format(
23 |     scores.min(), scores.mean(), scores.max()))
24 | 


--------------------------------------------------------------------------------
/notebooks/solutions/04C_feature_importance.py:
--------------------------------------------------------------------------------
1 | gb_new = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1,
2 |                                     subsample=.8, max_features=.5)
3 | gb_new.fit(features, target)
4 | feature_names = features.columns.values
5 | x = np.arange(len(feature_names))
6 | plt.bar(x, gb_new.feature_importances_)
7 | _ = plt.xticks(x + 0.5, feature_names, rotation=30)
8 | 


--------------------------------------------------------------------------------
/notebooks/solutions/05B_houses_plot.py:
--------------------------------------------------------------------------------
1 | for index, feature_name in enumerate(data.feature_names):
2 |     plt.figure()
3 |     plt.scatter(data.data[:, index], data.target)
4 |     plt.ylabel('Price')
5 |     plt.xlabel(feature_name)
6 | 
7 | 


--------------------------------------------------------------------------------
/notebooks/solutions/05B_houses_regression.py:
--------------------------------------------------------------------------------
 1 | from sklearn.ensemble import GradientBoostingRegressor
 2 | 
 3 | clf = GradientBoostingRegressor()
 4 | clf.fit(X_train, y_train)
 5 | 
 6 | predicted = clf.predict(X_test)
 7 | expected = y_test
 8 | 
 9 | plt.scatter(expected, predicted)
10 | plt.plot([0, 50], [0, 50], '--k')
11 | plt.axis('tight')
12 | plt.xlabel('True price ($1000s)')
13 | plt.ylabel('Predicted price ($1000s)')
14 | print("RMS:", np.sqrt(np.mean((predicted - expected) ** 2)))
15 | 


--------------------------------------------------------------------------------
/notebooks/solutions/05C_validation_exercise.py:
--------------------------------------------------------------------------------
 1 | from sklearn.datasets import load_digits
 2 | from sklearn.model_selection import train_test_split
 3 | from sklearn.svm import LinearSVC
 4 | from sklearn.naive_bayes import GaussianNB
 5 | from sklearn.neighbors import KNeighborsClassifier
 6 | from sklearn import metrics
 7 | 
 8 | digits = load_digits()
 9 | 
10 | X = digits.data
11 | y = digits.target
12 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
13 | 
14 | for Model in [LinearSVC, GaussianNB, KNeighborsClassifier]:
15 |     clf = Model().fit(X_train, y_train)
16 |     y_pred = clf.predict(X_test)
17 |     print(Model.__name__, metrics.f1_score(y_test, y_pred, average='weighted'))
18 | 
19 | print('------------------')
20 | 
21 | # test SVC loss
22 | for loss in ['hinge', 'squared_hinge']:
23 |     clf = LinearSVC(loss=loss).fit(X_train, y_train)
24 |     y_pred = clf.predict(X_test)
25 |     print("LinearSVC(loss='{0}')".format(loss), metrics.f1_score(y_test, y_pred, average='weighted'))
26 | 
27 | print('-------------------')
28 | 
29 | # test K-neighbors
30 | for n_neighbors in range(1, 11):
31 |     clf = KNeighborsClassifier(n_neighbors=n_neighbors).fit(X_train, y_train)
32 |     y_pred = clf.predict(X_test)
33 |     print("KNeighbors(n_neighbors={}): {:.3f}".format(
34 |         n_neighbors, metrics.f1_score(y_test, y_pred, average='weighted')))
35 | 


--------------------------------------------------------------------------------
/notebooks/solutions/07B_basic_grid_search.py:
--------------------------------------------------------------------------------
1 | for Model in [Lasso, Ridge]:
2 |     scores = [cross_val_score(Model(alpha), X, y, cv=3).mean()
3 |               for alpha in alphas]
4 |     plt.plot(alphas, scores, label=Model.__name__)
5 | plt.legend(loc='lower left')
6 | 


--------------------------------------------------------------------------------
/notebooks/solutions/07B_learning_curves.py:
--------------------------------------------------------------------------------
 1 | from sklearn.metrics import explained_variance_score, mean_squared_error
 2 | from sklearn.cross_validation import train_test_split
 3 | 
 4 | def plot_learning_curve(model, err_func=explained_variance_score, N=300, n_runs=10, n_sizes=50, ylim=None):
 5 |     sizes = np.linspace(5, N, n_sizes).astype(int)
 6 |     train_err = np.zeros((n_runs, n_sizes))
 7 |     validation_err = np.zeros((n_runs, n_sizes))
 8 |     for i in range(n_runs):
 9 |         for j, size in enumerate(sizes):
10 |             xtrain, xtest, ytrain, ytest = train_test_split(
11 |                 X, y, train_size=size, random_state=i)
12 |             # Train on only the first `size` points
13 |             model.fit(xtrain, ytrain)
14 |             validation_err[i, j] = err_func(ytest, model.predict(xtest))
15 |             train_err[i, j] = err_func(ytrain, model.predict(xtrain))
16 | 
17 |     plt.plot(sizes, validation_err.mean(axis=0), lw=2, label='validation')
18 |     plt.plot(sizes, train_err.mean(axis=0), lw=2, label='training')
19 | 
20 |     plt.xlabel('traning set size')
21 |     plt.ylabel(err_func.__name__.replace('_', ' '))
22 |     
23 |     plt.grid(True)
24 |     
25 |     plt.legend(loc=0)
26 |     
27 |     plt.xlim(0, N-1)
28 |     
29 |     if ylim:
30 |         plt.ylim(ylim)
31 | 
32 | 
33 | plt.figure(figsize=(10, 8))
34 | for i, model in enumerate([Lasso(0.01), Ridge(0.06)]):
35 |     plt.subplot(221 + i)
36 |     plot_learning_curve(model, ylim=(0, 1))
37 |     plt.title(model.__class__.__name__)
38 |     
39 |     plt.subplot(223 + i)
40 |     plot_learning_curve(model, err_func=mean_squared_error, ylim=(0, 8000))
41 | 


--------------------------------------------------------------------------------
/notebooks/solutions/08A_digits_projection.py:
--------------------------------------------------------------------------------
 1 | from sklearn.decomposition import PCA
 2 | from sklearn.manifold import Isomap, LocallyLinearEmbedding
 3 | 
 4 | plt.figure(figsize=(14, 4))
 5 | for i, est in enumerate([PCA(n_components=2, whiten=True),
 6 |                          Isomap(n_components=2, n_neighbors=10),
 7 |                          LocallyLinearEmbedding(n_components=2, n_neighbors=10, method='modified')]):
 8 |     plt.subplot(131 + i)
 9 |     projection = est.fit_transform(digits.data)
10 |     plt.scatter(projection[:, 0], projection[:, 1], c=digits.target)
11 |     plt.title(est.__class__.__name__)
12 | 


--------------------------------------------------------------------------------