├── .gitignore ├── Makefile ├── README.md ├── fetch_data.py ├── ipynbhelper.py └── notebooks ├── 01_introduction.ipynb ├── 02_representation_of_data.ipynb ├── 03_basic_principles_of_machine_learning.ipynb ├── 04_unsupervised_dimreduction.ipynb ├── 05_measuring_prediction_performance.ipynb ├── 06_sklearn_pandas_and_heterogeneous_data_modeling.ipynb ├── 07_sklearn_unstructured_data_face_recognition.ipynb ├── 08_sklearn_pandas_practical.ipynb ├── figures ├── ML_flow_chart.py ├── __init__.py ├── bias_variance.py ├── grid_search_cv_splits.png ├── grid_search_parameters.png ├── iris_setosa.jpg ├── iris_versicolor.jpg ├── iris_virginica.jpg ├── linear_regression.py ├── sgd_separator.py ├── supervised_scikit_learn.png └── svm_gui_frames.py ├── helpers.py ├── images ├── parallel_text_clf.png ├── parallel_text_clf_average.png └── predictive_modeling_data_flow.png └── solutions ├── 02A_faces_plot.py ├── 04A_plot_logistic_regression_weights.py ├── 04B_more_categorical_variables.py ├── 04C_feature_importance.py ├── 05B_houses_plot.py ├── 05B_houses_regression.py ├── 05C_validation_exercise.py ├── 07B_basic_grid_search.py ├── 07B_learning_curves.py └── 08A_digits_projection.py /.gitignore: -------------------------------------------------------------------------------- 1 | *~ 2 | *.pyc 3 | *.npy 4 | *.npz 5 | notebooks/figures/downloads/ 6 | *.ipynb_checkpoints 7 | *.mmap 8 | *.pkl 9 | datasets 10 | joblib 11 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | # Makefile used to manage the git repository, not for the tutorial 2 | 3 | all: 4 | python ipynbhelper.py --check 5 | python ipynbhelper.py --render 6 | python ipynbhelper.py --clean 7 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Introduction to predictive analytics with pandas and scikit-learn 2 | ================================================================= 3 | 4 | This repository contains notebooks to get started with predictive 5 | analytics using scikit-learn and pandas. 6 | 7 | This material is strongly inspired from the 8 | [EuroPython 2014 scikit-learn tutorial](https://github.com/GaelVaroquaux/sklearn_pandas_tutorial) 9 | 10 | * Olivier Grisel [@ogrisel](https://twitter.com/ogrisel) | 11 | http://ogrisel.com 12 | 13 | * Gael Varoquaux [@GaelVaroquaux](https://twitter.com/GaelVaroquaux) | 14 | http://gael-varoquaux.info 15 | 16 | which was inspired by http://github.com/jakevdp/sklearn_scipy2013 17 | by Jake VanderPlas [@jakevdp](https://twitter.com/jakevdp) | http://jakevdp.github.com 18 | 19 | Installation Notes 20 | ------------------ 21 | 22 | This tutorial will require recent installations of *numpy*, *scipy*, 23 | *matplotlib*, *scikit-learn*, *pandas* and *Pillow* (or PIL). 24 | 25 | For users who do not yet have these packages installed, a relatively 26 | painless way to install all the requirements is to use a package such as 27 | [Anaconda](http://continuum.io/downloads), which can be downloaded and 28 | installed for free. 29 | 30 | Please download in advance the datasets mentionned in [Data Downloads](#data-downloads) 31 | 32 | 33 | ### With the IPython/jupyter notebook 34 | 35 | The recommended way to access the materials is to execute them in the 36 | IPython/jupyter notebook. If you have the notebook installed, you should 37 | download the materials (see below), go the the `notebooks` directory, and 38 | launch IPython notebook from there by typing: 39 | 40 | cd notebooks 41 | jupyter notebook # ipython notebook if old version 42 | 43 | in your terminal window. This will open a notebook panel load in your web 44 | browser. 45 | 46 | Downloading the Tutorial Materials 47 | ---------------------------------- 48 | 49 | I would highly recommend using git, not only for this tutorial, but for the 50 | general betterment of your life. Once git is installed, you can clone the 51 | material in this tutorial by using the git address shown above: 52 | 53 | If you can't or don't want to install git, there is a link above to download 54 | the contents of this repository as a zip file. I may make minor changes to 55 | the repository in the days before the tutorial, however, so cloning the 56 | repository is a much better option. 57 | 58 | Data Downloads 59 | -------------- 60 | 61 | The data for this tutorial is not included in the repository. We will be 62 | using several data sets during the tutorial: most are built-in to 63 | scikit-learn, which includes code which automatically downloads and 64 | caches these data. Because the wireless network at conferences can often 65 | be spotty, it would be a good idea to download these data sets before 66 | arriving at the conference. You can do so by using the `fetch_data.py` 67 | included in the tutorial materials. 68 | 69 | You will also need: 70 | 71 | https://dl.dropboxusercontent.com/u/2140486/data/titanic_train.csv 72 | https://dl.dropboxusercontent.com/u/2140486/data/adult_train.csv 73 | -------------------------------------------------------------------------------- /fetch_data.py: -------------------------------------------------------------------------------- 1 | from sklearn.datasets import fetch_olivetti_faces 2 | from sklearn.datasets import fetch_lfw_people 3 | from sklearn.datasets import get_data_home 4 | 5 | 6 | if __name__ == "__main__": 7 | fetch_olivetti_faces() 8 | 9 | print("Loading Labeled Faces Data (~200MB)") 10 | fetch_lfw_people(min_faces_per_person=70, resize=0.4) 11 | print("=> Success!") 12 | print("Data saved in %s" % get_data_home()) 13 | -------------------------------------------------------------------------------- /ipynbhelper.py: -------------------------------------------------------------------------------- 1 | """Utility script to be used to cleanup the notebooks before git commit 2 | 3 | This a mix from @minrk's various gists. 4 | 5 | """ 6 | 7 | import sys 8 | import os 9 | import io 10 | try: 11 | from queue import Empty 12 | except: 13 | from Queue import Empty 14 | 15 | from nbformat import current 16 | try: 17 | from jupyter_client import KernelManager 18 | assert KernelManager # to silence pyflakes 19 | except ImportError: 20 | # 0.13 21 | from IPython.zmq.blockingkernelmanager import BlockingKernelManager 22 | KernelManager = BlockingKernelManager 23 | 24 | 25 | def remove_outputs(nb): 26 | """Remove the outputs from a notebook""" 27 | for ws in nb.worksheets: 28 | for cell in ws.cells: 29 | if cell.cell_type == 'code': 30 | cell.outputs = [] 31 | if 'prompt_number' in cell: 32 | del cell['prompt_number'] 33 | 34 | 35 | def remove_signature(nb): 36 | """Remove the signature from a notebook""" 37 | if 'signature' in nb.metadata: 38 | del nb.metadata['signature'] 39 | 40 | 41 | def run_cell(shell, iopub, cell, timeout=300): 42 | if not hasattr(cell, 'input'): 43 | return [], False 44 | shell.send(cell.input) 45 | # wait for finish, maximum 5min by default 46 | reply = shell.get_msg(timeout=timeout)['content'] 47 | if reply['status'] == 'error': 48 | failed = True 49 | print("\nFAILURE:") 50 | print(cell.input) 51 | print('-----') 52 | print("raised:") 53 | print('\n'.join(reply['traceback'])) 54 | else: 55 | failed = False 56 | 57 | # Collect the outputs of the cell execution 58 | outs = [] 59 | while True: 60 | try: 61 | msg = iopub.get_msg(timeout=0.2) 62 | except Empty: 63 | break 64 | msg_type = msg['msg_type'] 65 | if msg_type in ('status', 'pyin'): 66 | continue 67 | elif msg_type == 'clear_output': 68 | outs = [] 69 | continue 70 | 71 | content = msg['content'] 72 | out = current.NotebookNode(output_type=msg_type) 73 | 74 | if msg_type == 'stream': 75 | out.stream = content['name'] 76 | out.text = content['data'] 77 | elif msg_type in ('display_data', 'pyout'): 78 | for mime, data in content['data'].items(): 79 | attr = mime.split('/')[-1].lower() 80 | # this gets most right, but fix svg+html, plain 81 | attr = attr.replace('+xml', '').replace('plain', 'text') 82 | setattr(out, attr, data) 83 | if msg_type == 'pyout': 84 | out.prompt_number = content['execution_count'] 85 | elif msg_type == 'pyerr': 86 | out.ename = content['ename'] 87 | out.evalue = content['evalue'] 88 | out.traceback = content['traceback'] 89 | else: 90 | print("unhandled iopub msg: %s" % msg_type) 91 | 92 | outs.append(out) 93 | return outs, failed 94 | 95 | 96 | def run_notebook(nb): 97 | km = KernelManager() 98 | km.start_kernel(stderr=open(os.devnull, 'w')) 99 | if hasattr(km, 'client'): 100 | kc = km.client() 101 | kc.start_channels() 102 | iopub = kc.iopub_channel 103 | else: 104 | # IPython 0.13 compat 105 | kc = km 106 | kc.start_channels() 107 | iopub = kc.sub_channel 108 | shell = kc.shell_channel 109 | 110 | # simple ping: 111 | shell.send("pass") 112 | shell.get_msg() 113 | 114 | cells = 0 115 | failures = 0 116 | for ws in nb.worksheets: 117 | for cell in ws.cells: 118 | if cell.cell_type != 'code': 119 | continue 120 | 121 | outputs, failed = run_cell(shell, iopub, cell) 122 | cell.outputs = outputs 123 | cell['prompt_number'] = cells 124 | failures += failed 125 | cells += 1 126 | sys.stdout.write('.') 127 | 128 | print() 129 | print("ran notebook %s" % nb.metadata.name) 130 | print(" ran %3i cells" % cells) 131 | if failures: 132 | print(" %3i cells raised exceptions" % failures) 133 | kc.stop_channels() 134 | km.shutdown_kernel() 135 | del km 136 | 137 | 138 | def process_notebook_file(fname, action='clean', output_fname=None): 139 | print("Performing '{}' on: {}".format(action, fname)) 140 | orig_wd = os.getcwd() 141 | with io.open(fname, 'r') as f: 142 | nb = current.read(f, 'json') 143 | 144 | if action == 'check': 145 | os.chdir(os.path.dirname(fname)) 146 | run_notebook(nb) 147 | remove_outputs(nb) 148 | remove_signature(nb) 149 | elif action == 'render': 150 | os.chdir(os.path.dirname(fname)) 151 | run_notebook(nb) 152 | else: 153 | # Clean by default 154 | remove_outputs(nb) 155 | remove_signature(nb) 156 | 157 | os.chdir(orig_wd) 158 | if output_fname is None: 159 | output_fname = fname 160 | with io.open(output_fname, 'w') as f: 161 | nb = current.write(nb, f, 'json') 162 | 163 | 164 | if __name__ == '__main__': 165 | # TODO: use argparse instead 166 | args = sys.argv[1:] 167 | targets = [t for t in args if not t.startswith('--')] 168 | action = 'check' if '--check' in args else 'clean' 169 | action = 'render' if '--render' in args else action 170 | 171 | rendered_folder = os.path.join(os.path.dirname(__file__), 172 | 'rendered_notebooks') 173 | if not os.path.exists(rendered_folder): 174 | os.makedirs(rendered_folder) 175 | if not targets: 176 | targets = [os.path.join(os.path.dirname(__file__), 'notebooks')] 177 | 178 | for target in targets: 179 | if os.path.isdir(target): 180 | fnames = [os.path.abspath(os.path.join(target, f)) 181 | for f in os.listdir(target) 182 | if f.endswith('.ipynb')] 183 | else: 184 | fnames = [target] 185 | for fname in fnames: 186 | if action == 'render': 187 | output_fname = os.path.join(rendered_folder, 188 | os.path.basename(fname)) 189 | else: 190 | output_fname = fname 191 | process_notebook_file(fname, action=action, 192 | output_fname=output_fname) 193 | -------------------------------------------------------------------------------- /notebooks/01_introduction.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# What is machine learning?" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "In this section we will begin to explore the basic principles of machine learning.\n", 15 | "Machine Learning is about building programs with **tunable parameters** (typically an\n", 16 | "array of floating point values) that are adjusted automatically so as to improve\n", 17 | "their behavior by **adapting to previously seen data.**\n", 18 | "\n", 19 | "Machine Learning can be considered a subfield of **Artificial Intelligence** since those\n", 20 | "algorithms can be seen as building blocks to make computers learn to behave more\n", 21 | "intelligently by somehow **generalizing** rather that just storing and retrieving data items\n", 22 | "like a database system would do.\n", 23 | "\n", 24 | "We'll take a look at two very simple machine learning tasks here.\n", 25 | "The first is a **classification** task: the figure shows a\n", 26 | "collection of two-dimensional data, colored according to two different class\n", 27 | "labels. A classification algorithm may be used to draw a dividing boundary\n", 28 | "between the two clusters of points:" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": null, 34 | "metadata": { 35 | "collapsed": false 36 | }, 37 | "outputs": [], 38 | "source": [ 39 | "# Start pylab inline mode, so figures will appear in the notebook\n", 40 | "%matplotlib inline" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": null, 46 | "metadata": { 47 | "collapsed": false 48 | }, 49 | "outputs": [], 50 | "source": [ 51 | "# Import the example plot from the figures directory\n", 52 | "from figures import plot_sgd_separator\n", 53 | "plot_sgd_separator()" 54 | ] 55 | }, 56 | { 57 | "cell_type": "markdown", 58 | "metadata": {}, 59 | "source": [ 60 | "This may seem like a trivial task, but it is a simple version of a very important concept.\n", 61 | "By drawing this separating line, we have learned a model which can **generalize** to new\n", 62 | "data: if you were to drop another point onto the plane which is unlabeled, this algorithm\n", 63 | "could now **predict** whether it's a blue or a red point.\n", 64 | "\n", 65 | "If you'd like to see the source code used to generate this, you can either open the\n", 66 | "code in the `figures` directory, or you can load the code using the `%load` magic command:" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": null, 72 | "metadata": { 73 | "collapsed": false 74 | }, 75 | "outputs": [], 76 | "source": [ 77 | "%load figures/sgd_separator.py" 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | "The next simple task we'll look at is a **regression** task: a simple best-fit line\n", 85 | "to a set of data:" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": null, 91 | "metadata": { 92 | "collapsed": false 93 | }, 94 | "outputs": [], 95 | "source": [ 96 | "from figures import plot_linear_regression\n", 97 | "plot_linear_regression()" 98 | ] 99 | }, 100 | { 101 | "cell_type": "markdown", 102 | "metadata": {}, 103 | "source": [ 104 | "Again, this is an example of fitting a model to data, such that the model can make\n", 105 | "generalizations about new data. The model has been **learned** from the training\n", 106 | "data, and can be used to predict the result of test data:\n", 107 | "here, we might be given an x-value, and the model would\n", 108 | "allow us to predict the y value. Again, this might seem like a trivial problem,\n", 109 | "but it is a basic example of a type of operation that is fundamental to\n", 110 | "machine learning tasks." 111 | ] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "metadata": {}, 116 | "source": [ 117 | "# An Overview of Scikit-learn" 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": {}, 123 | "source": [ 124 | "*Adapted from* [*http://scikit-learn.org/stable/tutorial/basic/tutorial.html*](http://scikit-learn.org/stable/tutorial/basic/tutorial.html)" 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": null, 130 | "metadata": { 131 | "collapsed": false 132 | }, 133 | "outputs": [], 134 | "source": [ 135 | "%matplotlib inline\n", 136 | "import numpy as np\n", 137 | "from matplotlib import pyplot as plt" 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "metadata": {}, 143 | "source": [ 144 | "## Loading an Example Dataset" 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": null, 150 | "metadata": { 151 | "collapsed": false 152 | }, 153 | "outputs": [], 154 | "source": [ 155 | "from sklearn import datasets\n", 156 | "digits = datasets.load_digits()" 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": null, 162 | "metadata": { 163 | "collapsed": false 164 | }, 165 | "outputs": [], 166 | "source": [ 167 | "digits.data" 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": null, 173 | "metadata": { 174 | "collapsed": false 175 | }, 176 | "outputs": [], 177 | "source": [ 178 | "digits.target" 179 | ] 180 | }, 181 | { 182 | "cell_type": "code", 183 | "execution_count": null, 184 | "metadata": { 185 | "collapsed": false 186 | }, 187 | "outputs": [], 188 | "source": [ 189 | "digits.images[0]" 190 | ] 191 | }, 192 | { 193 | "cell_type": "markdown", 194 | "metadata": {}, 195 | "source": [ 196 | "## Learning and Predicting" 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": null, 202 | "metadata": { 203 | "collapsed": false 204 | }, 205 | "outputs": [], 206 | "source": [ 207 | "from sklearn import svm\n", 208 | "clf = svm.SVC(gamma=0.001, C=100.)" 209 | ] 210 | }, 211 | { 212 | "cell_type": "code", 213 | "execution_count": null, 214 | "metadata": { 215 | "collapsed": false 216 | }, 217 | "outputs": [], 218 | "source": [ 219 | "clf.fit(digits.data[:-1], digits.target[:-1])" 220 | ] 221 | }, 222 | { 223 | "cell_type": "code", 224 | "execution_count": null, 225 | "metadata": { 226 | "collapsed": false 227 | }, 228 | "outputs": [], 229 | "source": [ 230 | "clf.predict(digits.data[-1:, :])" 231 | ] 232 | }, 233 | { 234 | "cell_type": "code", 235 | "execution_count": null, 236 | "metadata": { 237 | "collapsed": false 238 | }, 239 | "outputs": [], 240 | "source": [ 241 | "plt.figure(figsize=(2, 2))\n", 242 | "plt.imshow(digits.images[-1], interpolation='nearest', cmap=plt.cm.binary)" 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "execution_count": null, 248 | "metadata": { 249 | "collapsed": false 250 | }, 251 | "outputs": [], 252 | "source": [ 253 | "print(digits.target[-1])" 254 | ] 255 | } 256 | ], 257 | "metadata": { 258 | "kernelspec": { 259 | "display_name": "Python 2", 260 | "language": "python", 261 | "name": "python2" 262 | }, 263 | "language_info": { 264 | "codemirror_mode": { 265 | "name": "ipython", 266 | "version": 2 267 | }, 268 | "file_extension": ".py", 269 | "mimetype": "text/x-python", 270 | "name": "python", 271 | "nbconvert_exporter": "python", 272 | "pygments_lexer": "ipython2", 273 | "version": "2.7.11" 274 | } 275 | }, 276 | "nbformat": 4, 277 | "nbformat_minor": 0 278 | } 279 | -------------------------------------------------------------------------------- /notebooks/02_representation_of_data.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Representation and Visualization of Data" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Machine learning is about creating models from data: for that reason, we'll start by\n", 15 | "discussing how data can be represented in order to be understood by the computer. Along\n", 16 | "with this, we'll build on our matplotlib examples from the previous section and show some\n", 17 | "examples of how to visualize data.\n", 18 | "\n", 19 | "By the end of this section you should:\n", 20 | "\n", 21 | "- Know the internal data representation of scikit-learn.\n", 22 | "- Know how to use scikit-learn's dataset loaders to load example data.\n", 23 | "- Know how to use matplotlib to help visualize different types of data." 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "## Data in scikit-learn" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "Data in scikit-learn, with very few exceptions, is assumed to be stored as a\n", 38 | "**two-dimensional array**, of size `[n_samples, n_features]`." 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "metadata": {}, 44 | "source": [ 45 | "Most machine learning algorithms implemented in scikit-learn expect data to be stored in a\n", 46 | "**two-dimensional array or matrix**. The arrays can be\n", 47 | "either ``numpy`` arrays, or in some cases ``scipy.sparse`` matrices.\n", 48 | "The size of the array is expected to be `[n_samples, n_features]`\n", 49 | "\n", 50 | "- **n_samples:** The number of samples: each sample is an item to process (e.g. classify).\n", 51 | " A sample can be a document, a picture, a sound, a video, an astronomical object,\n", 52 | " a row in database or CSV file,\n", 53 | " or whatever you can describe with a fixed set of quantitative traits.\n", 54 | "- **n_features:** The number of features or distinct traits that can be used to describe each\n", 55 | " item in a quantitative manner. Features are generally real-valued, but may be boolean or\n", 56 | " discrete-valued in some cases.\n", 57 | "\n", 58 | "The number of features must be fixed in advance. However it can be very high dimensional\n", 59 | "(e.g. millions of features) with most of them being zeros for a given sample. This is a case\n", 60 | "where `scipy.sparse` matrices can be useful, in that they are\n", 61 | "much more memory-efficient than numpy arrays." 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "### A Simple Example: the Iris Dataset" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": {}, 74 | "source": [ 75 | "As an example of a simple dataset, we're going to take a look at the iris data stored by scikit-learn.\n", 76 | "The data consists of measurements of three different species of irises. There are three species of iris\n", 77 | "in the dataset, which we can picture here:" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": null, 83 | "metadata": { 84 | "collapsed": false 85 | }, 86 | "outputs": [], 87 | "source": [ 88 | "from IPython.core.display import Image, display\n", 89 | "display(Image(filename='figures/iris_setosa.jpg'))\n", 90 | "print(\"Iris Setosa\\n\")\n", 91 | "\n", 92 | "display(Image(filename='figures/iris_versicolor.jpg'))\n", 93 | "print(\"Iris Versicolor\\n\")\n", 94 | "\n", 95 | "display(Image(filename='figures/iris_virginica.jpg'))\n", 96 | "print(\"Iris Virginica\")" 97 | ] 98 | }, 99 | { 100 | "cell_type": "markdown", 101 | "metadata": {}, 102 | "source": [ 103 | "### Quick Question:" 104 | ] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "metadata": {}, 109 | "source": [ 110 | "**If we want to design an algorithm to recognize iris species, what might the data be?**\n", 111 | "\n", 112 | "Remember: we need a 2D array of size `[n_samples x n_features]`.\n", 113 | "\n", 114 | "- What would the `n_samples` refer to?\n", 115 | "\n", 116 | "- What might the `n_features` refer to?\n", 117 | "\n", 118 | "Remember that there must be a **fixed** number of features for each sample, and feature\n", 119 | "number ``i`` must be a similar kind of quantity for each sample." 120 | ] 121 | }, 122 | { 123 | "cell_type": "markdown", 124 | "metadata": {}, 125 | "source": [ 126 | "### Loading the Iris Data with Scikit-learn" 127 | ] 128 | }, 129 | { 130 | "cell_type": "markdown", 131 | "metadata": {}, 132 | "source": [ 133 | "Scikit-learn has a very straightforward set of data on these iris species. The data consist of\n", 134 | "the following:\n", 135 | "\n", 136 | "- Features in the Iris dataset:\n", 137 | "\n", 138 | " 1. sepal length in cm\n", 139 | " 2. sepal width in cm\n", 140 | " 3. petal length in cm\n", 141 | " 4. petal width in cm\n", 142 | "\n", 143 | "- Target classes to predict:\n", 144 | "\n", 145 | " 1. Iris Setosa\n", 146 | " 2. Iris Versicolour\n", 147 | " 3. Iris Virginica" 148 | ] 149 | }, 150 | { 151 | "cell_type": "markdown", 152 | "metadata": {}, 153 | "source": [ 154 | "``scikit-learn`` embeds a copy of the iris CSV file along with a helper function to load it into numpy arrays:" 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": null, 160 | "metadata": { 161 | "collapsed": false 162 | }, 163 | "outputs": [], 164 | "source": [ 165 | "from sklearn.datasets import load_iris\n", 166 | "iris = load_iris()" 167 | ] 168 | }, 169 | { 170 | "cell_type": "markdown", 171 | "metadata": {}, 172 | "source": [ 173 | "The resulting dataset is a ``Bunch`` object: you can see what's available using\n", 174 | "the method ``keys()``:" 175 | ] 176 | }, 177 | { 178 | "cell_type": "code", 179 | "execution_count": null, 180 | "metadata": { 181 | "collapsed": false 182 | }, 183 | "outputs": [], 184 | "source": [ 185 | "iris.keys()" 186 | ] 187 | }, 188 | { 189 | "cell_type": "markdown", 190 | "metadata": {}, 191 | "source": [ 192 | "The features of each sample flower are stored in the ``data`` attribute of the dataset:" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": null, 198 | "metadata": { 199 | "collapsed": false 200 | }, 201 | "outputs": [], 202 | "source": [ 203 | "n_samples, n_features = iris.data.shape\n", 204 | "print(n_samples)\n", 205 | "print(n_features)\n", 206 | "print(iris.data[0])" 207 | ] 208 | }, 209 | { 210 | "cell_type": "markdown", 211 | "metadata": {}, 212 | "source": [ 213 | "The information about the class of each sample is stored in the ``target`` attribute of the dataset:" 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": null, 219 | "metadata": { 220 | "collapsed": false 221 | }, 222 | "outputs": [], 223 | "source": [ 224 | "print(iris.data.shape)\n", 225 | "print(iris.target.shape)" 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": null, 231 | "metadata": { 232 | "collapsed": false 233 | }, 234 | "outputs": [], 235 | "source": [ 236 | "print(iris.target)" 237 | ] 238 | }, 239 | { 240 | "cell_type": "markdown", 241 | "metadata": {}, 242 | "source": [ 243 | "The names of the classes are stored in the last attribute, namely ``target_names``:" 244 | ] 245 | }, 246 | { 247 | "cell_type": "code", 248 | "execution_count": null, 249 | "metadata": { 250 | "collapsed": false 251 | }, 252 | "outputs": [], 253 | "source": [ 254 | "print(iris.target_names)" 255 | ] 256 | }, 257 | { 258 | "cell_type": "markdown", 259 | "metadata": {}, 260 | "source": [ 261 | "This data is four dimensional, but we can visualize two of the dimensions\n", 262 | "at a time using a simple scatter-plot. Again, we'll start by enabling\n", 263 | "matplotlib inline mode:" 264 | ] 265 | }, 266 | { 267 | "cell_type": "code", 268 | "execution_count": null, 269 | "metadata": { 270 | "collapsed": false 271 | }, 272 | "outputs": [], 273 | "source": [ 274 | "%matplotlib inline\n", 275 | "from matplotlib import pyplot as plt" 276 | ] 277 | }, 278 | { 279 | "cell_type": "code", 280 | "execution_count": null, 281 | "metadata": { 282 | "collapsed": false 283 | }, 284 | "outputs": [], 285 | "source": [ 286 | "x_index = 0\n", 287 | "y_index = 1\n", 288 | "\n", 289 | "# this formatter will label the colorbar with the correct target names\n", 290 | "formatter = plt.FuncFormatter(lambda i, *args: iris.target_names[int(i)])\n", 291 | "\n", 292 | "plt.scatter(iris.data[:, x_index], iris.data[:, y_index], c=iris.target)\n", 293 | "plt.colorbar(ticks=[0, 1, 2], format=formatter)\n", 294 | "plt.xlabel(iris.feature_names[x_index])\n", 295 | "plt.ylabel(iris.feature_names[y_index])" 296 | ] 297 | }, 298 | { 299 | "cell_type": "markdown", 300 | "metadata": {}, 301 | "source": [ 302 | "### Quick Exercise:" 303 | ] 304 | }, 305 | { 306 | "cell_type": "markdown", 307 | "metadata": {}, 308 | "source": [ 309 | "**Change** `x_index` **and** `y_index` **in the above script\n", 310 | "and find a combination of two parameters\n", 311 | "which maximally separate the three classes.**\n", 312 | "\n", 313 | "This exercise is a preview of **dimensionality reduction**, which we'll see later." 314 | ] 315 | }, 316 | { 317 | "cell_type": "markdown", 318 | "metadata": {}, 319 | "source": [ 320 | "## Other Available Data" 321 | ] 322 | }, 323 | { 324 | "cell_type": "markdown", 325 | "metadata": {}, 326 | "source": [ 327 | "Scikit-learn makes available a host of datasets for testing learning algorithms.\n", 328 | "They come in three flavors:\n", 329 | "\n", 330 | "- **Packaged Data:** these small datasets are packaged with the scikit-learn installation,\n", 331 | " and can be downloaded using the tools in ``sklearn.datasets.load_*``\n", 332 | "- **Downloadable Data:** these larger datasets are available for download, and scikit-learn\n", 333 | " includes tools which streamline this process. These tools can be found in\n", 334 | " ``sklearn.datasets.fetch_*``\n", 335 | "- **Generated Data:** there are several datasets which are generated from models based on a\n", 336 | " random seed. These are available in the ``sklearn.datasets.make_*``\n", 337 | "\n", 338 | "You can explore the available dataset loaders, fetchers, and generators using IPython's\n", 339 | "tab-completion functionality. After importing the ``datasets`` submodule from ``sklearn``,\n", 340 | "type\n", 341 | "\n", 342 | " datasets.load_\n", 343 | "\n", 344 | "or\n", 345 | "\n", 346 | " datasets.fetch_\n", 347 | "\n", 348 | "or\n", 349 | "\n", 350 | " datasets.make_\n", 351 | "\n", 352 | "to see a list of available functions." 353 | ] 354 | }, 355 | { 356 | "cell_type": "code", 357 | "execution_count": null, 358 | "metadata": { 359 | "collapsed": false 360 | }, 361 | "outputs": [], 362 | "source": [ 363 | "from sklearn import datasets" 364 | ] 365 | }, 366 | { 367 | "cell_type": "markdown", 368 | "metadata": {}, 369 | "source": [ 370 | "The data downloaded using the ``fetch_`` scripts are stored locally,\n", 371 | "within a subdirectory of your home directory.\n", 372 | "You can use the following to determine where it is:" 373 | ] 374 | }, 375 | { 376 | "cell_type": "code", 377 | "execution_count": null, 378 | "metadata": { 379 | "collapsed": false 380 | }, 381 | "outputs": [], 382 | "source": [ 383 | "from sklearn.datasets import get_data_home\n", 384 | "get_data_home()" 385 | ] 386 | }, 387 | { 388 | "cell_type": "code", 389 | "execution_count": null, 390 | "metadata": { 391 | "collapsed": false 392 | }, 393 | "outputs": [], 394 | "source": [ 395 | "!ls $HOME/scikit_learn_data/" 396 | ] 397 | }, 398 | { 399 | "cell_type": "markdown", 400 | "metadata": {}, 401 | "source": [ 402 | "Be warned: many of these datasets are quite large, and can take a long time to download!\n", 403 | "(especially on Conference wifi).\n", 404 | "\n", 405 | "If you start a download within the IPython notebook\n", 406 | "and you want to kill it, you can use ipython's \"kernel interrupt\" feature, available in the menu." 407 | ] 408 | }, 409 | { 410 | "cell_type": "markdown", 411 | "metadata": {}, 412 | "source": [ 413 | "## Loading Digits Data" 414 | ] 415 | }, 416 | { 417 | "cell_type": "markdown", 418 | "metadata": {}, 419 | "source": [ 420 | "Now we'll take a look at another dataset, one where we have to put a bit\n", 421 | "more thought into how to represent the data. We can explore the data in\n", 422 | "a similar manner as above:" 423 | ] 424 | }, 425 | { 426 | "cell_type": "code", 427 | "execution_count": null, 428 | "metadata": { 429 | "collapsed": false 430 | }, 431 | "outputs": [], 432 | "source": [ 433 | "from sklearn.datasets import load_digits\n", 434 | "digits = load_digits()" 435 | ] 436 | }, 437 | { 438 | "cell_type": "code", 439 | "execution_count": null, 440 | "metadata": { 441 | "collapsed": false 442 | }, 443 | "outputs": [], 444 | "source": [ 445 | "digits.keys()" 446 | ] 447 | }, 448 | { 449 | "cell_type": "code", 450 | "execution_count": null, 451 | "metadata": { 452 | "collapsed": false 453 | }, 454 | "outputs": [], 455 | "source": [ 456 | "n_samples, n_features = digits.data.shape\n", 457 | "print(n_samples, n_features)" 458 | ] 459 | }, 460 | { 461 | "cell_type": "code", 462 | "execution_count": null, 463 | "metadata": { 464 | "collapsed": false 465 | }, 466 | "outputs": [], 467 | "source": [ 468 | "print(digits.data[0])\n", 469 | "print(digits.target)" 470 | ] 471 | }, 472 | { 473 | "cell_type": "markdown", 474 | "metadata": {}, 475 | "source": [ 476 | "The target here is just the digit represented by the data. The data is an array of\n", 477 | "length 64... but what does this data mean?" 478 | ] 479 | }, 480 | { 481 | "cell_type": "markdown", 482 | "metadata": {}, 483 | "source": [ 484 | "There's a clue in the fact that we have two versions of the data array:\n", 485 | "``data`` and ``images``. Let's take a look at them:" 486 | ] 487 | }, 488 | { 489 | "cell_type": "code", 490 | "execution_count": null, 491 | "metadata": { 492 | "collapsed": false 493 | }, 494 | "outputs": [], 495 | "source": [ 496 | "print(digits.data.shape)\n", 497 | "print(digits.images.shape)" 498 | ] 499 | }, 500 | { 501 | "cell_type": "markdown", 502 | "metadata": {}, 503 | "source": [ 504 | "We can see that they're related by a simple reshaping:" 505 | ] 506 | }, 507 | { 508 | "cell_type": "code", 509 | "execution_count": null, 510 | "metadata": { 511 | "collapsed": false 512 | }, 513 | "outputs": [], 514 | "source": [ 515 | "import numpy as np # numpy!\n", 516 | "print(np.all(digits.images.reshape((1797, 64)) == digits.data))" 517 | ] 518 | }, 519 | { 520 | "cell_type": "markdown", 521 | "metadata": {}, 522 | "source": [ 523 | "*Aside... numpy and memory efficiency:*\n", 524 | "\n", 525 | "*You might wonder whether duplicating the data is a problem. In this case, the memory\n", 526 | "overhead is very small. Even though the arrays are different shapes, they point to the\n", 527 | "same memory block, which we can see by doing a bit of digging into the guts of numpy:*" 528 | ] 529 | }, 530 | { 531 | "cell_type": "code", 532 | "execution_count": null, 533 | "metadata": { 534 | "collapsed": false 535 | }, 536 | "outputs": [], 537 | "source": [ 538 | "print(digits.data.__array_interface__['data'])\n", 539 | "print(digits.images.__array_interface__['data'])" 540 | ] 541 | }, 542 | { 543 | "cell_type": "markdown", 544 | "metadata": {}, 545 | "source": [ 546 | "*The long integer here is a memory address: the fact that the two are the same tells\n", 547 | "us that the two arrays are looking at the same data.*" 548 | ] 549 | }, 550 | { 551 | "cell_type": "markdown", 552 | "metadata": {}, 553 | "source": [ 554 | "Let's visualize the data. It's little bit more involved than the simple scatter-plot\n", 555 | "we used above, but we can do it rather quickly." 556 | ] 557 | }, 558 | { 559 | "cell_type": "code", 560 | "execution_count": null, 561 | "metadata": { 562 | "collapsed": false 563 | }, 564 | "outputs": [], 565 | "source": [ 566 | "# set up the figure\n", 567 | "fig = plt.figure(figsize=(6, 6)) # figure size in inches\n", 568 | "fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)\n", 569 | "\n", 570 | "# plot the digits: each image is 8x8 pixels\n", 571 | "for i in range(64):\n", 572 | " ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])\n", 573 | " ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest')\n", 574 | " \n", 575 | " # label the image with the target value\n", 576 | " ax.text(0, 7, str(digits.target[i]))" 577 | ] 578 | }, 579 | { 580 | "cell_type": "markdown", 581 | "metadata": {}, 582 | "source": [ 583 | "We see now what the features mean. Each feature is a real-valued quantity representing the\n", 584 | "darkness of a pixel in an 8x8 image of a hand-written digit.\n", 585 | "\n", 586 | "Even though each sample has data that is inherently two-dimensional, the data matrix flattens\n", 587 | "this 2D data into a **single vector**, which can be contained in one **row** of the data matrix." 588 | ] 589 | }, 590 | { 591 | "cell_type": "markdown", 592 | "metadata": {}, 593 | "source": [ 594 | "## Exercise: working with the faces dataset" 595 | ] 596 | }, 597 | { 598 | "cell_type": "markdown", 599 | "metadata": {}, 600 | "source": [ 601 | "Here we'll take a moment for you to explore the datasets yourself.\n", 602 | "Later on we'll be using the Olivetti faces dataset.\n", 603 | "Take a moment to fetch the data (about 1.4MB), and visualize the faces.\n", 604 | "You can copy the code used to visualize the digits above, and modify it for this data." 605 | ] 606 | }, 607 | { 608 | "cell_type": "code", 609 | "execution_count": null, 610 | "metadata": { 611 | "collapsed": false 612 | }, 613 | "outputs": [], 614 | "source": [ 615 | "from sklearn.datasets import fetch_olivetti_faces" 616 | ] 617 | }, 618 | { 619 | "cell_type": "code", 620 | "execution_count": null, 621 | "metadata": { 622 | "collapsed": false 623 | }, 624 | "outputs": [], 625 | "source": [ 626 | "# fetch the faces data\n" 627 | ] 628 | }, 629 | { 630 | "cell_type": "code", 631 | "execution_count": null, 632 | "metadata": { 633 | "collapsed": false 634 | }, 635 | "outputs": [], 636 | "source": [ 637 | "# Use a script like above to plot the faces image data.\n", 638 | "# hint: plt.cm.bone is a good colormap for this data\n" 639 | ] 640 | }, 641 | { 642 | "cell_type": "markdown", 643 | "metadata": {}, 644 | "source": [ 645 | "### Solution:" 646 | ] 647 | }, 648 | { 649 | "cell_type": "code", 650 | "execution_count": null, 651 | "metadata": { 652 | "collapsed": false 653 | }, 654 | "outputs": [], 655 | "source": [ 656 | "%load solutions/02A_faces_plot.py" 657 | ] 658 | } 659 | ], 660 | "metadata": { 661 | "kernelspec": { 662 | "display_name": "Python 2", 663 | "language": "python", 664 | "name": "python2" 665 | }, 666 | "language_info": { 667 | "codemirror_mode": { 668 | "name": "ipython", 669 | "version": 2 670 | }, 671 | "file_extension": ".py", 672 | "mimetype": "text/x-python", 673 | "name": "python", 674 | "nbconvert_exporter": "python", 675 | "pygments_lexer": "ipython2", 676 | "version": "2.7.11" 677 | } 678 | }, 679 | "nbformat": 4, 680 | "nbformat_minor": 0 681 | } 682 | -------------------------------------------------------------------------------- /notebooks/03_basic_principles_of_machine_learning.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": { 7 | "collapsed": false 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "%matplotlib inline\n", 12 | "import numpy as np\n", 13 | "from matplotlib import pyplot as plt" 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": {}, 19 | "source": [ 20 | "# Basic principles of machine learning" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "Here is where we start diving into the field of machine learning.\n", 28 | "\n", 29 | "By the end of this section you will\n", 30 | "\n", 31 | "- Know the basic categories of supervised learning, including classification and regression problems.\n", 32 | "- Know the basic categories of unsupervised learning, including dimensionality reduction and clustering.\n", 33 | "- Know the basic syntax of the Scikit-learn **estimator** interface.\n", 34 | "- Know how features are extracted from real-world data.\n", 35 | "\n", 36 | "In addition, we will go over several basic tools within scikit-learn which can be used to accomplish the above tasks." 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "## Problem setting" 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "### A simple definition of machine learning" 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "metadata": {}, 56 | "source": [ 57 | "Machine Learning (ML) is about building programs with **tunable parameters** (typically an\n", 58 | "array of floating point values) that are adjusted automatically so as to improve\n", 59 | "their behavior by **adapting to previously seen data.**\n", 60 | "\n", 61 | "In most ML applications, the data is in a 2D array of shape ``[n_samples x n_features]``,\n", 62 | "where the number of features is the same for each object, and each feature column refers\n", 63 | "to a related piece of information about each sample.\n", 64 | "\n", 65 | "Machine learning can be broken into two broad regimes:\n", 66 | "*supervised learning* and *unsupervised learning*.\n", 67 | "We’ll introduce these concepts here, and discuss them in more detail below." 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": {}, 73 | "source": [ 74 | "### Introducing the scikit-learn estimator object" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "Every algorithm is exposed in scikit-learn via an ''Estimator'' object. For instance a linear regression is:" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": null, 87 | "metadata": { 88 | "collapsed": false 89 | }, 90 | "outputs": [], 91 | "source": [ 92 | "from sklearn.linear_model import LinearRegression" 93 | ] 94 | }, 95 | { 96 | "cell_type": "markdown", 97 | "metadata": {}, 98 | "source": [ 99 | "**Estimator parameters**: All the parameters of an estimator can be set when it is instantiated:" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": null, 105 | "metadata": { 106 | "collapsed": false 107 | }, 108 | "outputs": [], 109 | "source": [ 110 | "model = LinearRegression(normalize=True)\n", 111 | "print(model.normalize)" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": null, 117 | "metadata": { 118 | "collapsed": false 119 | }, 120 | "outputs": [], 121 | "source": [ 122 | "print(model)" 123 | ] 124 | }, 125 | { 126 | "cell_type": "markdown", 127 | "metadata": {}, 128 | "source": [ 129 | "**Estimated parameters**: When data is fitted with an estimator, parameters are estimated from the data at hand. All the estimated parameters are attributes of the estimator object ending by an underscore:" 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": null, 135 | "metadata": { 136 | "collapsed": false 137 | }, 138 | "outputs": [], 139 | "source": [ 140 | "x = np.array([0, 1, 2])\n", 141 | "y = np.array([0, 1, 2])" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": null, 147 | "metadata": { 148 | "collapsed": false 149 | }, 150 | "outputs": [], 151 | "source": [ 152 | "_ = plt.plot(x, y, marker='o')" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": null, 158 | "metadata": { 159 | "collapsed": false 160 | }, 161 | "outputs": [], 162 | "source": [ 163 | "X = x[:, np.newaxis] # The input data for sklearn is 2D: (samples == 3 x features == 1)\n", 164 | "X" 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": null, 170 | "metadata": { 171 | "collapsed": false 172 | }, 173 | "outputs": [], 174 | "source": [ 175 | "model.fit(X, y) \n", 176 | "model.coef_" 177 | ] 178 | }, 179 | { 180 | "cell_type": "markdown", 181 | "metadata": {}, 182 | "source": [ 183 | "### Supervised Learning: Classification and regression" 184 | ] 185 | }, 186 | { 187 | "cell_type": "markdown", 188 | "metadata": {}, 189 | "source": [ 190 | "In **Supervised Learning**, we have a dataset consisting of both features and labels.\n", 191 | "The task is to construct an estimator which is able to predict the label of an object\n", 192 | "given the set of features. A relatively simple example is predicting the species of \n", 193 | "iris given a set of measurements of its flower. This is a relatively simple task. \n", 194 | "Some more complicated examples are:\n", 195 | "\n", 196 | "- given a multicolor image of an object through a telescope, determine\n", 197 | " whether that object is a star, a quasar, or a galaxy.\n", 198 | "- given a photograph of a person, identify the person in the photo.\n", 199 | "- given a list of movies a person has watched and their personal rating\n", 200 | " of the movie, recommend a list of movies they would like\n", 201 | " (So-called *recommender systems*: a famous example is the [Netflix Prize](http://en.wikipedia.org/wiki/Netflix_prize)).\n", 202 | "\n", 203 | "What these tasks have in common is that there is one or more unknown\n", 204 | "quantities associated with the object which needs to be determined from other\n", 205 | "observed quantities.\n", 206 | "\n", 207 | "Supervised learning is further broken down into two categories, **classification** and **regression**.\n", 208 | "In classification, the label is discrete, while in regression, the label is continuous. For example,\n", 209 | "in astronomy, the task of determining whether an object is a star, a galaxy, or a quasar is a\n", 210 | "classification problem: the label is from three distinct categories. On the other hand, we might\n", 211 | "wish to estimate the age of an object based on such observations: this would be a regression problem,\n", 212 | "because the label (age) is a continuous quantity." 213 | ] 214 | }, 215 | { 216 | "cell_type": "markdown", 217 | "metadata": {}, 218 | "source": [ 219 | "**Classification**: K nearest neighbors (kNN) is one of the simplest learning strategies: given a new, unknown observation, look up in your reference database which ones have the closest features and assign the predominant class.\n", 220 | "\n", 221 | "Let's try it out on our iris classification problem:" 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": null, 227 | "metadata": { 228 | "collapsed": false 229 | }, 230 | "outputs": [], 231 | "source": [ 232 | "from sklearn import neighbors, datasets\n", 233 | "iris = datasets.load_iris()\n", 234 | "X, y = iris.data, iris.target\n", 235 | "knn = neighbors.KNeighborsClassifier(n_neighbors=1)\n", 236 | "knn.fit(X, y)\n", 237 | "# What kind of iris has 3cm x 5cm sepal and 4cm x 2cm petal?\n", 238 | "print(iris.target_names[knn.predict([[3, 5, 4, 2]])])" 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": null, 244 | "metadata": { 245 | "collapsed": false 246 | }, 247 | "outputs": [], 248 | "source": [ 249 | "# A plot of the sepal space and the prediction of the KNN\n", 250 | "from helpers import plot_iris_knn\n", 251 | "plot_iris_knn()" 252 | ] 253 | }, 254 | { 255 | "cell_type": "markdown", 256 | "metadata": {}, 257 | "source": [ 258 | "**Exercise**: Now use as an estimator on the same problem: sklearn.svm.SVC.\n", 259 | "\n", 260 | "> Note that you don't have to know what it is to use it.\n", 261 | "\n", 262 | "> If you finish early, do the same plot as above." 263 | ] 264 | }, 265 | { 266 | "cell_type": "markdown", 267 | "metadata": {}, 268 | "source": [ 269 | "**Regression**: The simplest possible regression setting is the linear regression one:" 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": null, 275 | "metadata": { 276 | "collapsed": false 277 | }, 278 | "outputs": [], 279 | "source": [ 280 | "# Create some simple data\n", 281 | "import numpy as np\n", 282 | "np.random.seed(0)\n", 283 | "X = np.random.random(size=(20, 1))\n", 284 | "y = 3 * X[:, 0] + 2 + np.random.normal(size=20)\n", 285 | "\n", 286 | "# Fit a linear regression to it\n", 287 | "from sklearn.linear_model import LinearRegression\n", 288 | "model = LinearRegression(fit_intercept=True)\n", 289 | "model.fit(X, y)\n", 290 | "print(\"Model coefficient: %.5f, and intercept: %.5f\"\n", 291 | " % (model.coef_, model.intercept_))\n", 292 | "\n", 293 | "# Plot the data and the model prediction\n", 294 | "X_test = np.linspace(0, 1, 100)[:, np.newaxis]\n", 295 | "y_test = model.predict(X_test)\n", 296 | "plt.plot(X[:, 0], y, 'o')\n", 297 | "plt.plot(X_test[:, 0], y_test)" 298 | ] 299 | }, 300 | { 301 | "cell_type": "markdown", 302 | "metadata": {}, 303 | "source": [ 304 | "### A recap on Scikit-learn's estimator interface" 305 | ] 306 | }, 307 | { 308 | "cell_type": "markdown", 309 | "metadata": {}, 310 | "source": [ 311 | "Scikit-learn strives to have a uniform interface across all methods,\n", 312 | "and we’ll see examples of these below. Given a scikit-learn *estimator*\n", 313 | "object named `model`, the following methods are available:\n", 314 | "\n", 315 | "- Available in **all Estimators**\n", 316 | " + `model.fit()` : fit training data. For supervised learning applications,\n", 317 | " this accepts two arguments: the data `X` and the labels `y` (e.g. `model.fit(X, y)`).\n", 318 | " For unsupervised learning applications, this accepts only a single argument,\n", 319 | " the data `X` (e.g. `model.fit(X)`).\n", 320 | "- Available in **supervised estimators**\n", 321 | " + `model.predict()` : given a trained model, predict the label of a new set of data.\n", 322 | " This method accepts one argument, the new data `X_new` (e.g. `model.predict(X_new)`),\n", 323 | " and returns the learned label for each object in the array.\n", 324 | " + `model.predict_proba()` : For classification problems, some estimators also provide\n", 325 | " this method, which returns the probability that a new observation has each categorical label.\n", 326 | " In this case, the label with the highest probability is returned by `model.predict()`.\n", 327 | " + `model.score()` : for classification or regression problems, most (all?) estimators implement\n", 328 | " a score method. Scores are between 0 and 1, with a larger score indicating a better fit.\n", 329 | "- Available in **unsupervised estimators**\n", 330 | " + `model.transform()` : given an unsupervised model, transform new data into the new basis.\n", 331 | " This also accepts one argument `X_new`, and returns the new representation of the data based\n", 332 | " on the unsupervised model.\n", 333 | " + `model.fit_transform()` : some estimators implement this method,\n", 334 | " which more efficiently performs a fit and a transform on the same input data." 335 | ] 336 | }, 337 | { 338 | "cell_type": "markdown", 339 | "metadata": {}, 340 | "source": [ 341 | "### Regularization: what it is and why it is necessary" 342 | ] 343 | }, 344 | { 345 | "cell_type": "markdown", 346 | "metadata": {}, 347 | "source": [ 348 | "**Train errors** Suppose you are using a 1-nearest neighbor estimator. How many errors do you expect on your train set?\n", 349 | "\n", 350 | "This tells us that:\n", 351 | "- Train set error is not a good measurement of prediction performance. You need to leave out a test set.\n", 352 | "- In general, we should accept errors on the train set." 353 | ] 354 | }, 355 | { 356 | "cell_type": "markdown", 357 | "metadata": {}, 358 | "source": [ 359 | "**An example of regularization** The core idea behind regularization is that we are going to prefer models that are simpler, for a certain definition of ''simpler'', even if they lead to more errors on the train set.\n", 360 | "\n", 361 | "As an example, let's generate with a 9th order polynomial." 362 | ] 363 | }, 364 | { 365 | "cell_type": "code", 366 | "execution_count": null, 367 | "metadata": { 368 | "collapsed": false 369 | }, 370 | "outputs": [], 371 | "source": [ 372 | "rng = np.random.RandomState(0)\n", 373 | "x = 2 * rng.rand(100) - 1\n", 374 | "\n", 375 | "f = lambda t: 1.2 * t ** 2 + .1 * t ** 3 - .4 * t ** 5 - .5 * t ** 9\n", 376 | "y = f(x) + .4 * rng.normal(size=100)\n", 377 | "\n", 378 | "plt.figure()\n", 379 | "plt.scatter(x, y, s=4)" 380 | ] 381 | }, 382 | { 383 | "cell_type": "markdown", 384 | "metadata": {}, 385 | "source": [ 386 | "And now, let's fit a 4th order and a 9th order polynomial to the data. For this we need to engineer features: the n_th powers of x:" 387 | ] 388 | }, 389 | { 390 | "cell_type": "code", 391 | "execution_count": null, 392 | "metadata": { 393 | "collapsed": false 394 | }, 395 | "outputs": [], 396 | "source": [ 397 | "x_test = np.linspace(-1, 1, 100)\n", 398 | "\n", 399 | "plt.figure()\n", 400 | "plt.scatter(x, y, s=4)\n", 401 | "\n", 402 | "X = np.array([x**i for i in range(5)]).T\n", 403 | "X_test = np.array([x_test**i for i in range(5)]).T\n", 404 | "order4 = LinearRegression()\n", 405 | "order4.fit(X, y)\n", 406 | "plt.plot(x_test, order4.predict(X_test), label='4th order')\n", 407 | "\n", 408 | "X = np.array([x**i for i in range(10)]).T\n", 409 | "X_test = np.array([x_test**i for i in range(10)]).T\n", 410 | "order9 = LinearRegression()\n", 411 | "order9.fit(X, y)\n", 412 | "plt.plot(x_test, order9.predict(X_test), label='9th order')\n", 413 | "\n", 414 | "plt.legend(loc='best')\n", 415 | "plt.axis('tight')\n", 416 | "plt.title('Fitting a 4th and a 9th order polynomial')" 417 | ] 418 | }, 419 | { 420 | "cell_type": "markdown", 421 | "metadata": {}, 422 | "source": [ 423 | "With your naked eyes, which model do you prefer, the 4th order one, or the 9th order one?\n", 424 | "\n", 425 | "Let's look at the ground truth:" 426 | ] 427 | }, 428 | { 429 | "cell_type": "code", 430 | "execution_count": null, 431 | "metadata": { 432 | "collapsed": false 433 | }, 434 | "outputs": [], 435 | "source": [ 436 | "plt.figure()\n", 437 | "plt.scatter(x, y, s=4)\n", 438 | "plt.plot(x_test, f(x_test), label=\"truth\")\n", 439 | "plt.axis('tight')\n", 440 | "plt.title('Ground truth (9th order polynomial)')" 441 | ] 442 | }, 443 | { 444 | "cell_type": "markdown", 445 | "metadata": {}, 446 | "source": [ 447 | "Regularization is ubiquitous in machine learning. Most scikit-learn estimators have a parameter to tune the amount of regularization. For instance, with k-NN, it is 'k', the number of nearest neighbors used to make the decision. k=1 amounts to no regularization: 0 error on the training set, whereas large k will push toward smoother decision boundaries in the feature space." 448 | ] 449 | }, 450 | { 451 | "cell_type": "markdown", 452 | "metadata": {}, 453 | "source": [ 454 | "### Exercise: Interactive Demo on linearly separable data" 455 | ] 456 | }, 457 | { 458 | "cell_type": "markdown", 459 | "metadata": {}, 460 | "source": [ 461 | "Run the **svm_gui.py** file from http://scikit-learn.org/stable/_downloads/svm_gui.py" 462 | ] 463 | }, 464 | { 465 | "cell_type": "markdown", 466 | "metadata": {}, 467 | "source": [ 468 | "\n", 469 | "\n", 470 | "______\n", 471 | "\n", 472 | "\n" 473 | ] 474 | } 475 | ], 476 | "metadata": { 477 | "kernelspec": { 478 | "display_name": "Python 2", 479 | "language": "python", 480 | "name": "python2" 481 | }, 482 | "language_info": { 483 | "codemirror_mode": { 484 | "name": "ipython", 485 | "version": 2 486 | }, 487 | "file_extension": ".py", 488 | "mimetype": "text/x-python", 489 | "name": "python", 490 | "nbconvert_exporter": "python", 491 | "pygments_lexer": "ipython2", 492 | "version": "2.7.11" 493 | } 494 | }, 495 | "nbformat": 4, 496 | "nbformat_minor": 0 497 | } 498 | -------------------------------------------------------------------------------- /notebooks/04_unsupervised_dimreduction.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Unsupervised Learning: Dimensionality Reduction and Visualization" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Unsupervised learning is interested in situations in which X is available, but not y: data without labels.\n", 15 | "\n", 16 | "A typical use case is to find hiden structure in the data.\n", 17 | "\n", 18 | "Previously we worked on visualizing the iris data by plotting\n", 19 | "pairs of dimensions by trial and error, until we arrived at\n", 20 | "the best pair of dimensions for our dataset. Here we will\n", 21 | "use an unsupervised *dimensionality reduction* algorithm\n", 22 | "to accomplish this more automatically." 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "By the end of this section you will\n", 30 | "\n", 31 | "- Know how to instantiate and train an unsupervised dimensionality reduction algorithm:\n", 32 | " Principal Component Analysis (PCA)\n", 33 | "- Know how to use PCA to visualize high-dimensional data" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "## Dimensionality Reduction: PCA" 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "Dimensionality reduction is the task of deriving a set of new\n", 48 | "artificial features that is smaller than the original feature\n", 49 | "set while retaining most of the variance of the original data.\n", 50 | "Here we'll use a common but powerful dimensionality reduction\n", 51 | "technique called Principal Component Analysis (PCA).\n", 52 | "We'll perform PCA on the iris dataset that we saw before:" 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": null, 58 | "metadata": { 59 | "collapsed": false 60 | }, 61 | "outputs": [], 62 | "source": [ 63 | "from sklearn.datasets import load_iris\n", 64 | "iris = load_iris()\n", 65 | "X = iris.data\n", 66 | "y = iris.target" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "PCA is performed using linear combinations of the original features\n", 74 | "using a truncated Singular Value Decomposition of the matrix X so\n", 75 | "as to project the data onto a base of the top singular vectors.\n", 76 | "If the number of retained components is 2 or 3, PCA can be used\n", 77 | "to visualize the dataset." 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": null, 83 | "metadata": { 84 | "collapsed": false 85 | }, 86 | "outputs": [], 87 | "source": [ 88 | "from sklearn.decomposition import PCA\n", 89 | "pca = PCA(n_components=2, whiten=True)\n", 90 | "pca.fit(X)" 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "metadata": {}, 96 | "source": [ 97 | "Once fitted, the pca model exposes the singular vectors in the components_ attribute:" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": null, 103 | "metadata": { 104 | "collapsed": false 105 | }, 106 | "outputs": [], 107 | "source": [ 108 | "pca.components_" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "Other attributes are available as well:" 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": null, 121 | "metadata": { 122 | "collapsed": false 123 | }, 124 | "outputs": [], 125 | "source": [ 126 | "pca.explained_variance_ratio_" 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": null, 132 | "metadata": { 133 | "collapsed": false 134 | }, 135 | "outputs": [], 136 | "source": [ 137 | "pca.explained_variance_ratio_.sum()" 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "metadata": {}, 143 | "source": [ 144 | "Let us project the iris dataset along those first two dimensions:" 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": null, 150 | "metadata": { 151 | "collapsed": false 152 | }, 153 | "outputs": [], 154 | "source": [ 155 | "X_pca = pca.transform(X)" 156 | ] 157 | }, 158 | { 159 | "cell_type": "markdown", 160 | "metadata": {}, 161 | "source": [ 162 | "PCA `normalizes` and `whitens` the data, which means that the data\n", 163 | "is now centered on both components with unit variance:" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": null, 169 | "metadata": { 170 | "collapsed": false 171 | }, 172 | "outputs": [], 173 | "source": [ 174 | "import numpy as np\n", 175 | "np.round(X_pca.mean(axis=0), decimals=5)" 176 | ] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "execution_count": null, 181 | "metadata": { 182 | "collapsed": false 183 | }, 184 | "outputs": [], 185 | "source": [ 186 | "np.round(X_pca.std(axis=0), decimals=5)" 187 | ] 188 | }, 189 | { 190 | "cell_type": "markdown", 191 | "metadata": {}, 192 | "source": [ 193 | "Furthermore, the samples components do no longer carry any linear correlation:" 194 | ] 195 | }, 196 | { 197 | "cell_type": "code", 198 | "execution_count": null, 199 | "metadata": { 200 | "collapsed": false 201 | }, 202 | "outputs": [], 203 | "source": [ 204 | "np.corrcoef(X_pca.T)" 205 | ] 206 | }, 207 | { 208 | "cell_type": "markdown", 209 | "metadata": {}, 210 | "source": [ 211 | "We can visualize the projection using pylab, but first\n", 212 | "let's make sure our ipython notebook is in pylab inline mode" 213 | ] 214 | }, 215 | { 216 | "cell_type": "code", 217 | "execution_count": null, 218 | "metadata": { 219 | "collapsed": false 220 | }, 221 | "outputs": [], 222 | "source": [ 223 | "%pylab inline" 224 | ] 225 | }, 226 | { 227 | "cell_type": "markdown", 228 | "metadata": {}, 229 | "source": [ 230 | "Now we can visualize the results using the following utility function:" 231 | ] 232 | }, 233 | { 234 | "cell_type": "code", 235 | "execution_count": null, 236 | "metadata": { 237 | "collapsed": false 238 | }, 239 | "outputs": [], 240 | "source": [ 241 | "import matplotlib.pyplot as plt\n", 242 | "from itertools import cycle\n", 243 | "\n", 244 | "def plot_PCA_2D(data, target, target_names):\n", 245 | " colors = cycle('rgbcmykw')\n", 246 | " target_ids = range(len(target_names))\n", 247 | " plt.figure()\n", 248 | " for i, c, label in zip(target_ids, colors, target_names):\n", 249 | " plt.scatter(data[target == i, 0], data[target == i, 1],\n", 250 | " c=c, label=label)\n", 251 | " plt.legend()" 252 | ] 253 | }, 254 | { 255 | "cell_type": "markdown", 256 | "metadata": {}, 257 | "source": [ 258 | "Now calling this function for our data, we see the plot:" 259 | ] 260 | }, 261 | { 262 | "cell_type": "code", 263 | "execution_count": null, 264 | "metadata": { 265 | "collapsed": false 266 | }, 267 | "outputs": [], 268 | "source": [ 269 | "plot_PCA_2D(X_pca, iris.target, iris.target_names)" 270 | ] 271 | }, 272 | { 273 | "cell_type": "markdown", 274 | "metadata": {}, 275 | "source": [ 276 | "Note that this projection was determined *without* any information about the\n", 277 | "labels (represented by the colors): this is the sense in which the learning\n", 278 | "is **unsupervised**. Nevertheless, we see that the projection gives us insight\n", 279 | "into the distribution of the different flowers in parameter space: notably,\n", 280 | "*iris setosa* is much more distinct than the other two species." 281 | ] 282 | }, 283 | { 284 | "cell_type": "markdown", 285 | "metadata": {}, 286 | "source": [ 287 | "Note also that the default implementation of PCA computes the\n", 288 | "singular value decomposition (SVD) of the full\n", 289 | "data matrix, which is not scalable when both ``n_samples`` and\n", 290 | "``n_features`` are big (more that a few thousands).\n", 291 | "If you are interested in a number of components that is much\n", 292 | "smaller than both ``n_samples`` and ``n_features``, consider using\n", 293 | "`sklearn.decomposition.RandomizedPCA` instead." 294 | ] 295 | }, 296 | { 297 | "cell_type": "markdown", 298 | "metadata": {}, 299 | "source": [ 300 | "Other dimensionality reduction techniques which are useful to know about:\n", 301 | "\n", 302 | "- [sklearn.decomposition.PCA](http://scikit-learn.org/0.13/modules/generated/sklearn.decomposition.PCA.html): \n", 303 | " Principal Component Analysis\n", 304 | "- [sklearn.decomposition.RandomizedPCA](http://scikit-learn.org/0.13/modules/generated/sklearn.decomposition.RandomizedPCA.html):\n", 305 | " fast non-exact PCA implementation based on a randomized algorithm\n", 306 | "- [sklearn.decomposition.SparsePCA](http://scikit-learn.org/0.13/modules/generated/sklearn.decomposition.SparsePCA.html):\n", 307 | " PCA variant including L1 penalty for sparsity\n", 308 | "- [sklearn.decomposition.FastICA](http://scikit-learn.org/0.13/modules/generated/sklearn.decomposition.FastICA.html):\n", 309 | " Independent Component Analysis\n", 310 | "- [sklearn.decomposition.NMF](http://scikit-learn.org/0.13/modules/generated/sklearn.decomposition.NMF.html):\n", 311 | " non-negative matrix factorization\n", 312 | "- [sklearn.manifold.LocallyLinearEmbedding](http://scikit-learn.org/0.13/modules/generated/sklearn.manifold.LocallyLinearEmbedding.html):\n", 313 | " nonlinear manifold learning technique based on local neighborhood geometry\n", 314 | "- [sklearn.manifold.IsoMap](http://scikit-learn.org/0.13/modules/generated/sklearn.manifold.Isomap.html):\n", 315 | " nonlinear manifold learning technique based on a sparse graph algorithm" 316 | ] 317 | }, 318 | { 319 | "cell_type": "markdown", 320 | "metadata": {}, 321 | "source": [ 322 | "## Manifold Learning" 323 | ] 324 | }, 325 | { 326 | "cell_type": "markdown", 327 | "metadata": {}, 328 | "source": [ 329 | "One weakness of PCA is that it cannot detect non-linear features. A set\n", 330 | "of algorithms known as *Manifold Learning* have been developed to address\n", 331 | "this deficiency. A canonical dataset used in Manifold learning is the\n", 332 | "*S-curve*, which we briefly saw in an earlier section:" 333 | ] 334 | }, 335 | { 336 | "cell_type": "code", 337 | "execution_count": null, 338 | "metadata": { 339 | "collapsed": false 340 | }, 341 | "outputs": [], 342 | "source": [ 343 | "from sklearn.datasets import make_s_curve\n", 344 | "X, y = make_s_curve(n_samples=1000)\n", 345 | "\n", 346 | "from mpl_toolkits.mplot3d import Axes3D\n", 347 | "ax = plt.axes(projection='3d')\n", 348 | "\n", 349 | "ax.scatter3D(X[:, 0], X[:, 1], X[:, 2], c=y)\n", 350 | "ax.view_init(10, -60)" 351 | ] 352 | }, 353 | { 354 | "cell_type": "markdown", 355 | "metadata": {}, 356 | "source": [ 357 | "This is a 2-dimensional dataset embedded in three dimensions, but it is embedded\n", 358 | "in such a way that PCA cannot discover the underlying data orientation:" 359 | ] 360 | }, 361 | { 362 | "cell_type": "code", 363 | "execution_count": null, 364 | "metadata": { 365 | "collapsed": false 366 | }, 367 | "outputs": [], 368 | "source": [ 369 | "X_pca = PCA(n_components=2).fit_transform(X)\n", 370 | "plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y)" 371 | ] 372 | }, 373 | { 374 | "cell_type": "markdown", 375 | "metadata": {}, 376 | "source": [ 377 | "Manifold learning algorithms, however, available in the ``sklearn.manifold``\n", 378 | "submodule, are able to recover the underlying 2-dimensional manifold:" 379 | ] 380 | }, 381 | { 382 | "cell_type": "code", 383 | "execution_count": null, 384 | "metadata": { 385 | "collapsed": false 386 | }, 387 | "outputs": [], 388 | "source": [ 389 | "from sklearn.manifold import LocallyLinearEmbedding, Isomap\n", 390 | "lle = LocallyLinearEmbedding(n_neighbors=15, n_components=2, method='modified')\n", 391 | "X_lle = lle.fit_transform(X)\n", 392 | "plt.scatter(X_lle[:, 0], X_lle[:, 1], c=y)" 393 | ] 394 | }, 395 | { 396 | "cell_type": "code", 397 | "execution_count": null, 398 | "metadata": { 399 | "collapsed": false 400 | }, 401 | "outputs": [], 402 | "source": [ 403 | "iso = Isomap(n_neighbors=15, n_components=2)\n", 404 | "X_iso = iso.fit_transform(X)\n", 405 | "plt.scatter(X_iso[:, 0], X_iso[:, 1], c=y)" 406 | ] 407 | }, 408 | { 409 | "cell_type": "markdown", 410 | "metadata": {}, 411 | "source": [ 412 | "## Exercise: Dimension reduction of digits" 413 | ] 414 | }, 415 | { 416 | "cell_type": "markdown", 417 | "metadata": {}, 418 | "source": [ 419 | "Apply PCA, LocallyLinearEmbedding, and Isomap to project the data to two dimensions.\n", 420 | "Which visualization technique separates the classes most cleanly?" 421 | ] 422 | }, 423 | { 424 | "cell_type": "code", 425 | "execution_count": null, 426 | "metadata": { 427 | "collapsed": false 428 | }, 429 | "outputs": [], 430 | "source": [ 431 | "from sklearn.datasets import load_digits\n", 432 | "digits = load_digits()\n", 433 | "# ..." 434 | ] 435 | }, 436 | { 437 | "cell_type": "code", 438 | "execution_count": null, 439 | "metadata": { 440 | "collapsed": false 441 | }, 442 | "outputs": [], 443 | "source": [] 444 | }, 445 | { 446 | "cell_type": "markdown", 447 | "metadata": {}, 448 | "source": [ 449 | "### Solution:" 450 | ] 451 | }, 452 | { 453 | "cell_type": "code", 454 | "execution_count": null, 455 | "metadata": { 456 | "collapsed": false 457 | }, 458 | "outputs": [], 459 | "source": [ 460 | "%load solutions/08A_digits_projection.py" 461 | ] 462 | } 463 | ], 464 | "metadata": { 465 | "kernelspec": { 466 | "display_name": "Python 2", 467 | "language": "python", 468 | "name": "python2" 469 | }, 470 | "language_info": { 471 | "codemirror_mode": { 472 | "name": "ipython", 473 | "version": 2 474 | }, 475 | "file_extension": ".py", 476 | "mimetype": "text/x-python", 477 | "name": "python", 478 | "nbconvert_exporter": "python", 479 | "pygments_lexer": "ipython2", 480 | "version": "2.7.11" 481 | } 482 | }, 483 | "nbformat": 4, 484 | "nbformat_minor": 0 485 | } 486 | -------------------------------------------------------------------------------- /notebooks/05_measuring_prediction_performance.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Measuring prediction performance" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Here we will discuss how to use **validation sets** to get a better measure of\n", 15 | "performance for a classifier." 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "## Using the K-neighbors classifier" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "Here we'll continue to look at the digits data, but we'll switch to the\n", 30 | "K-Neighbors classifier. The K-neighbors classifier is an instance-based\n", 31 | "classifier. The K-neighbors classifier predicts the label of\n", 32 | "an unknown point based on the labels of the *K* nearest points in the\n", 33 | "parameter space." 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": null, 39 | "metadata": { 40 | "collapsed": false 41 | }, 42 | "outputs": [], 43 | "source": [ 44 | "# Get the data\n", 45 | "from sklearn.datasets import load_digits\n", 46 | "digits = load_digits()\n", 47 | "X = digits.data\n", 48 | "y = digits.target" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": null, 54 | "metadata": { 55 | "collapsed": false 56 | }, 57 | "outputs": [], 58 | "source": [ 59 | "# Instantiate and train the classifier\n", 60 | "from sklearn.neighbors import KNeighborsClassifier\n", 61 | "clf = KNeighborsClassifier(n_neighbors=1)\n", 62 | "clf.fit(X, y)" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": null, 68 | "metadata": { 69 | "collapsed": false 70 | }, 71 | "outputs": [], 72 | "source": [ 73 | "# Check the results using metrics\n", 74 | "from sklearn import metrics\n", 75 | "y_pred = clf.predict(X)" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": null, 81 | "metadata": { 82 | "collapsed": false 83 | }, 84 | "outputs": [], 85 | "source": [ 86 | "print(metrics.confusion_matrix(y_pred, y))" 87 | ] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "metadata": {}, 92 | "source": [ 93 | "Apparently, we've found a perfect classifier! But this is misleading\n", 94 | "for the reasons we saw before: the classifier essentially \"memorizes\"\n", 95 | "all the samples it has already seen. To really test how well this\n", 96 | "algorithm does, we need to try some samples it *hasn't* yet seen.\n", 97 | "\n", 98 | "This problem can also occur with regression models. In the following we fit an other instance-based model named \"decision tree\" to the Boston Housing price dataset we introduced previously:" 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": null, 104 | "metadata": { 105 | "collapsed": false 106 | }, 107 | "outputs": [], 108 | "source": [ 109 | "%matplotlib inline\n", 110 | "from matplotlib import pyplot as plt\n", 111 | "import numpy as np" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": null, 117 | "metadata": { 118 | "collapsed": false 119 | }, 120 | "outputs": [], 121 | "source": [ 122 | "from sklearn.datasets import load_boston\n", 123 | "from sklearn.tree import DecisionTreeRegressor\n", 124 | "\n", 125 | "data = load_boston()\n", 126 | "clf = DecisionTreeRegressor().fit(data.data, data.target)\n", 127 | "predicted = clf.predict(data.data)\n", 128 | "expected = data.target\n", 129 | "\n", 130 | "plt.scatter(expected, predicted)\n", 131 | "plt.plot([0, 50], [0, 50], '--k')\n", 132 | "plt.axis('tight')\n", 133 | "plt.xlabel('True price ($1000s)')\n", 134 | "plt.ylabel('Predicted price ($1000s)')" 135 | ] 136 | }, 137 | { 138 | "cell_type": "markdown", 139 | "metadata": {}, 140 | "source": [ 141 | "Here again the predictions are seemingly perfect as the model was able to perfectly memorize the training set." 142 | ] 143 | }, 144 | { 145 | "cell_type": "markdown", 146 | "metadata": {}, 147 | "source": [ 148 | "## A Better Approach: Using a validation set" 149 | ] 150 | }, 151 | { 152 | "cell_type": "markdown", 153 | "metadata": {}, 154 | "source": [ 155 | "Learning the parameters of a prediction function and testing it on the\n", 156 | "same data is a methodological mistake: a model that would just repeat\n", 157 | "the labels of the samples that it has just seen would have a perfect\n", 158 | "score but would fail to predict anything useful on yet-unseen data.\n", 159 | "\n", 160 | "To avoid over-fitting, we have to define two different sets:\n", 161 | "\n", 162 | "- a training set X_train, y_train which is used for learning the parameters of a predictive model\n", 163 | "- a testing set X_test, y_test which is used for evaluating the fitted predictive model\n", 164 | "\n", 165 | "In scikit-learn such a random split can be quickly computed with the\n", 166 | "`train_test_split` helper function. It can be used this way:" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": null, 172 | "metadata": { 173 | "collapsed": false 174 | }, 175 | "outputs": [], 176 | "source": [ 177 | "from sklearn.model_selection import train_test_split\n", 178 | "X = digits.data\n", 179 | "y = digits.target\n", 180 | "\n", 181 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)\n", 182 | "\n", 183 | "print(\"%r, %r, %r\" % (X.shape, X_train.shape, X_test.shape))" 184 | ] 185 | }, 186 | { 187 | "cell_type": "markdown", 188 | "metadata": {}, 189 | "source": [ 190 | "Now we train on the training data, and test on the testing data:" 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": null, 196 | "metadata": { 197 | "collapsed": false 198 | }, 199 | "outputs": [], 200 | "source": [ 201 | "clf = KNeighborsClassifier(n_neighbors=1).fit(X_train, y_train)\n", 202 | "y_pred = clf.predict(X_test)" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": null, 208 | "metadata": { 209 | "collapsed": false 210 | }, 211 | "outputs": [], 212 | "source": [ 213 | "print(metrics.confusion_matrix(y_test, y_pred))" 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": null, 219 | "metadata": { 220 | "collapsed": false 221 | }, 222 | "outputs": [], 223 | "source": [ 224 | "print(metrics.classification_report(y_test, y_pred))" 225 | ] 226 | }, 227 | { 228 | "cell_type": "markdown", 229 | "metadata": {}, 230 | "source": [ 231 | "The averaged f1-score is often used as a convenient measure of the\n", 232 | "overall performance of an algorithm. It appears in the bottom row\n", 233 | "of the classification report; it can also be accessed directly:" 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": null, 239 | "metadata": { 240 | "collapsed": false 241 | }, 242 | "outputs": [], 243 | "source": [ 244 | "metrics.f1_score(y_test, y_pred, average='weighted')" 245 | ] 246 | }, 247 | { 248 | "cell_type": "markdown", 249 | "metadata": {}, 250 | "source": [ 251 | "The over-fitting we saw previously can be quantified by computing the\n", 252 | "f1-score on the training data itself:" 253 | ] 254 | }, 255 | { 256 | "cell_type": "code", 257 | "execution_count": null, 258 | "metadata": { 259 | "collapsed": false 260 | }, 261 | "outputs": [], 262 | "source": [ 263 | "metrics.f1_score(y_train, clf.predict(X_train), average='weighted')" 264 | ] 265 | }, 266 | { 267 | "cell_type": "markdown", 268 | "metadata": {}, 269 | "source": [ 270 | "### Validation with a Regression Model" 271 | ] 272 | }, 273 | { 274 | "cell_type": "markdown", 275 | "metadata": {}, 276 | "source": [ 277 | "These validation metrics also work in the case of regression models. Here we'll use\n", 278 | "a Gradient-boosted regression tree, which is a meta-estimator which makes use of the\n", 279 | "``DecisionTreeRegressor`` we showed above. We'll start by doing the train-test split\n", 280 | "as we did with the classification case:" 281 | ] 282 | }, 283 | { 284 | "cell_type": "code", 285 | "execution_count": null, 286 | "metadata": { 287 | "collapsed": false 288 | }, 289 | "outputs": [], 290 | "source": [ 291 | "data = load_boston()\n", 292 | "X = data.data\n", 293 | "y = data.target\n", 294 | "X_train, X_test, y_train, y_test = train_test_split(\n", 295 | " X, y, test_size=0.25, random_state=0)\n", 296 | "\n", 297 | "print(\"%r, %r, %r\" % (X.shape, X_train.shape, X_test.shape))" 298 | ] 299 | }, 300 | { 301 | "cell_type": "markdown", 302 | "metadata": {}, 303 | "source": [ 304 | "Next we'll compute the training and testing error using the Decision Tree that\n", 305 | "we saw before:" 306 | ] 307 | }, 308 | { 309 | "cell_type": "code", 310 | "execution_count": null, 311 | "metadata": { 312 | "collapsed": false 313 | }, 314 | "outputs": [], 315 | "source": [ 316 | "est = DecisionTreeRegressor().fit(X_train, y_train)\n", 317 | "\n", 318 | "validation_score = metrics.explained_variance_score(\n", 319 | " y_test, est.predict(X_test))\n", 320 | "\n", 321 | "print(\"validation: %r\" % validation_score)\n", 322 | "\n", 323 | "training_score = metrics.explained_variance_score(\n", 324 | " y_train, est.predict(X_train))\n", 325 | "\n", 326 | "print(\"training: %r\" % training_score)" 327 | ] 328 | }, 329 | { 330 | "cell_type": "markdown", 331 | "metadata": {}, 332 | "source": [ 333 | "This large spread between validation and training error is characteristic\n", 334 | "of a **high variance** model. Decision trees are not entirely useless,\n", 335 | "however: by combining many individual decision trees within ensemble\n", 336 | "estimators such as Gradient Boosted Trees or Random Forests, we can get\n", 337 | "much better performance:" 338 | ] 339 | }, 340 | { 341 | "cell_type": "code", 342 | "execution_count": null, 343 | "metadata": { 344 | "collapsed": false 345 | }, 346 | "outputs": [], 347 | "source": [ 348 | "from sklearn.ensemble import GradientBoostingRegressor\n", 349 | "est = GradientBoostingRegressor().fit(X_train, y_train)\n", 350 | "\n", 351 | "validation_score = metrics.explained_variance_score(\n", 352 | " y_test, est.predict(X_test))\n", 353 | "\n", 354 | "print(\"validation: %r\" % validation_score)\n", 355 | "\n", 356 | "training_score = metrics.explained_variance_score(\n", 357 | " y_train, est.predict(X_train))\n", 358 | "\n", 359 | "print(\"training: %r\" % training_score)" 360 | ] 361 | }, 362 | { 363 | "cell_type": "markdown", 364 | "metadata": {}, 365 | "source": [ 366 | "This model is still over-fitting the data, but not by as much as the single tree." 367 | ] 368 | }, 369 | { 370 | "cell_type": "markdown", 371 | "metadata": {}, 372 | "source": [ 373 | "## Exercise: Model Selection via Validation" 374 | ] 375 | }, 376 | { 377 | "cell_type": "markdown", 378 | "metadata": {}, 379 | "source": [ 380 | "Here we saw K-neighbors classification of the digits. We've also seen support vector\n", 381 | "machine classification of digits. Now that we have these\n", 382 | "validation tools in place, we can ask quantitatively which of the three estimators\n", 383 | "works best for the digits dataset.\n", 384 | "\n", 385 | "Take a moment and determine the answers to these questions for the digits dataset:\n", 386 | "\n", 387 | "- With the default hyper-parameters for each estimator, which gives the best f1 score\n", 388 | " on the **validation set**? Recall that hyperparameters are the parameters set when\n", 389 | " you instantiate the classifier: for example, the ``n_neighbors`` in\n", 390 | "\n", 391 | " clf = KNeighborsClassifier(n_neighbors=1)\n", 392 | "\n", 393 | " To use the default value, simply leave them unspecified.\n", 394 | "- For each classifier, which value for the hyperparameters gives the best results for\n", 395 | " the digits data? For ``LinearSVC``, use ``loss='l2'`` and ``loss='l1'``. For\n", 396 | " ``KNeighborsClassifier`` use ``n_neighbors`` between 1 and 10. Try also\n", 397 | " ``GaussianNB``. Note that it does not have any adjustable hyperparameters.\n", 398 | "- Bonus: do the same exercise on the Iris data rather than the Digits data. Does the\n", 399 | " same classifier/hyperparameter combination win out in this case?" 400 | ] 401 | }, 402 | { 403 | "cell_type": "code", 404 | "execution_count": null, 405 | "metadata": { 406 | "collapsed": false 407 | }, 408 | "outputs": [], 409 | "source": [ 410 | "from sklearn.svm import LinearSVC\n", 411 | "from sklearn.naive_bayes import GaussianNB\n", 412 | "from sklearn.neighbors import KNeighborsClassifier" 413 | ] 414 | }, 415 | { 416 | "cell_type": "code", 417 | "execution_count": null, 418 | "metadata": { 419 | "collapsed": false 420 | }, 421 | "outputs": [], 422 | "source": [] 423 | }, 424 | { 425 | "cell_type": "markdown", 426 | "metadata": {}, 427 | "source": [ 428 | "### Solution" 429 | ] 430 | }, 431 | { 432 | "cell_type": "code", 433 | "execution_count": null, 434 | "metadata": { 435 | "collapsed": false 436 | }, 437 | "outputs": [], 438 | "source": [ 439 | "%load solutions/05C_validation_exercise.py" 440 | ] 441 | } 442 | ], 443 | "metadata": { 444 | "kernelspec": { 445 | "display_name": "Python 2", 446 | "language": "python", 447 | "name": "python2" 448 | }, 449 | "language_info": { 450 | "codemirror_mode": { 451 | "name": "ipython", 452 | "version": 2 453 | }, 454 | "file_extension": ".py", 455 | "mimetype": "text/x-python", 456 | "name": "python", 457 | "nbconvert_exporter": "python", 458 | "pygments_lexer": "ipython2", 459 | "version": "2.7.11" 460 | } 461 | }, 462 | "nbformat": 4, 463 | "nbformat_minor": 0 464 | } 465 | -------------------------------------------------------------------------------- /notebooks/06_sklearn_pandas_and_heterogeneous_data_modeling.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Predictive Modeling with scikit-learn and pandas" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": { 14 | "collapsed": false 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "%matplotlib inline\n", 19 | "import matplotlib.pyplot as plt\n", 20 | "import numpy as np\n", 21 | "import pandas as pd\n", 22 | "\n", 23 | "import warnings\n", 24 | "warnings.simplefilter('ignore', DeprecationWarning)" 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "## Loading tabular data from the Titanic kaggle challenge in a pandas Data Frame" 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "metadata": {}, 44 | "source": [ 45 | "Let us have a look at the Titanic dataset from the Kaggle Getting Started challenge at:\n", 46 | "\n", 47 | "https://www.kaggle.com/c/titanic-gettingStarted\n", 48 | "\n", 49 | "We can load the CSV file as a pandas data frame in one line:" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": null, 55 | "metadata": { 56 | "collapsed": false 57 | }, 58 | "outputs": [], 59 | "source": [ 60 | "#!curl -s https://dl.dropboxusercontent.com/u/2140486/data/titanic_train.csv | head -5\n", 61 | "!head -5 titanic_train.csv" 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": null, 67 | "metadata": { 68 | "collapsed": false 69 | }, 70 | "outputs": [], 71 | "source": [ 72 | "#data = pd.read_csv('https://dl.dropboxusercontent.com/u/2140486/data/titanic_train.csv')\n", 73 | "data = pd.read_csv('titanic_train.csv')" 74 | ] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "metadata": {}, 79 | "source": [ 80 | "pandas data frames have a HTML table representation in the IPython notebook. Let's have a look at the first 5 rows:" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": null, 86 | "metadata": { 87 | "collapsed": false 88 | }, 89 | "outputs": [], 90 | "source": [ 91 | "data.head(5)" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": null, 97 | "metadata": { 98 | "collapsed": false 99 | }, 100 | "outputs": [], 101 | "source": [ 102 | "data.count()" 103 | ] 104 | }, 105 | { 106 | "cell_type": "markdown", 107 | "metadata": {}, 108 | "source": [ 109 | "The data frame has 891 rows. Some passengers have missing information though: in particular Age and Cabin info can be missing. The meaning of the columns is explained on the challenge website:\n", 110 | "\n", 111 | "https://www.kaggle.com/c/titanic-gettingStarted/data\n", 112 | "\n", 113 | "and copied here:\n", 114 | "\n", 115 | "```\n", 116 | "VARIABLE DESCRIPTIONS:\n", 117 | "survival Survival\n", 118 | " (0 = No; 1 = Yes)\n", 119 | "pclass Passenger Class\n", 120 | " (1 = 1st; 2 = 2nd; 3 = 3rd)\n", 121 | "name Name\n", 122 | "sex Sex\n", 123 | "age Age\n", 124 | "sibsp Number of Siblings/Spouses Aboard\n", 125 | "parch Number of Parents/Children Aboard\n", 126 | "ticket Ticket Number\n", 127 | "fare Passenger Fare\n", 128 | "cabin Cabin\n", 129 | "embarked Port of Embarkation\n", 130 | " (C = Cherbourg; Q = Queenstown; S = Southampton)\n", 131 | "\n", 132 | "SPECIAL NOTES:\n", 133 | "Pclass is a proxy for socio-economic status (SES)\n", 134 | " 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower\n", 135 | "\n", 136 | "Age is in Years; Fractional if Age less than One (1)\n", 137 | " If the Age is Estimated, it is in the form xx.5\n", 138 | "\n", 139 | "With respect to the family relation variables (i.e. sibsp and parch)\n", 140 | "some relations were ignored. The following are the definitions used\n", 141 | "for sibsp and parch.\n", 142 | "\n", 143 | "Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic\n", 144 | "Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)\n", 145 | "Parent: Mother or Father of Passenger Aboard Titanic\n", 146 | "Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic\n", 147 | "\n", 148 | "Other family relatives excluded from this study include cousins,\n", 149 | "nephews/nieces, aunts/uncles, and in-laws. Some children travelled\n", 150 | "only with a nanny, therefore parch=0 for them. As well, some\n", 151 | "travelled with very close friends or neighbors in a village, however,\n", 152 | "the definitions do not support such relations.\n", 153 | "```\n", 154 | "\n", 155 | "A data frame can be converted into a numpy array by calling the `values` attribute:" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": null, 161 | "metadata": { 162 | "collapsed": false 163 | }, 164 | "outputs": [], 165 | "source": [ 166 | "list(data.columns)" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": null, 172 | "metadata": { 173 | "collapsed": false 174 | }, 175 | "outputs": [], 176 | "source": [ 177 | "data.shape" 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": null, 183 | "metadata": { 184 | "collapsed": false 185 | }, 186 | "outputs": [], 187 | "source": [ 188 | "data.values" 189 | ] 190 | }, 191 | { 192 | "cell_type": "markdown", 193 | "metadata": {}, 194 | "source": [ 195 | "However this cannot be directly fed to a scikit-learn model:\n", 196 | "\n", 197 | "\n", 198 | "- the target variable (survival) is mixed with the input data\n", 199 | "\n", 200 | "- some attribute such as unique ids have no predictive values for the task\n", 201 | "\n", 202 | "- the values are heterogeneous (string labels for categories, integers and floating point numbers)\n", 203 | "\n", 204 | "- some attribute values are missing (nan: \"not a number\")" 205 | ] 206 | }, 207 | { 208 | "cell_type": "markdown", 209 | "metadata": {}, 210 | "source": [ 211 | "## Predicting survival" 212 | ] 213 | }, 214 | { 215 | "cell_type": "markdown", 216 | "metadata": {}, 217 | "source": [ 218 | "The goal of the challenge is to predict whether a passenger has survived from others known attribute. Let us have a look at the `Survived` columns:" 219 | ] 220 | }, 221 | { 222 | "cell_type": "code", 223 | "execution_count": null, 224 | "metadata": { 225 | "collapsed": false 226 | }, 227 | "outputs": [], 228 | "source": [ 229 | "survived_column = data['Survived']\n", 230 | "survived_column.dtype" 231 | ] 232 | }, 233 | { 234 | "cell_type": "markdown", 235 | "metadata": {}, 236 | "source": [ 237 | "`data.Survived` is an instance of the pandas `Series` class with an integer dtype:" 238 | ] 239 | }, 240 | { 241 | "cell_type": "code", 242 | "execution_count": null, 243 | "metadata": { 244 | "collapsed": false 245 | }, 246 | "outputs": [], 247 | "source": [ 248 | "type(survived_column)" 249 | ] 250 | }, 251 | { 252 | "cell_type": "markdown", 253 | "metadata": {}, 254 | "source": [ 255 | "The `data` object is an instance pandas `DataFrame` class:" 256 | ] 257 | }, 258 | { 259 | "cell_type": "code", 260 | "execution_count": null, 261 | "metadata": { 262 | "collapsed": false 263 | }, 264 | "outputs": [], 265 | "source": [ 266 | "type(data)" 267 | ] 268 | }, 269 | { 270 | "cell_type": "markdown", 271 | "metadata": {}, 272 | "source": [ 273 | "`Series` can be seen as homegeneous, 1D columns. `DataFrame` instances are heterogenous collections of columns with the same length.\n", 274 | "\n", 275 | "The original data frame can be aggregated by counting rows for each possible value of the `Survived` column:" 276 | ] 277 | }, 278 | { 279 | "cell_type": "code", 280 | "execution_count": null, 281 | "metadata": { 282 | "collapsed": false 283 | }, 284 | "outputs": [], 285 | "source": [ 286 | "data.groupby('Survived').count()" 287 | ] 288 | }, 289 | { 290 | "cell_type": "code", 291 | "execution_count": null, 292 | "metadata": { 293 | "collapsed": false 294 | }, 295 | "outputs": [], 296 | "source": [ 297 | "np.mean(survived_column == 0)" 298 | ] 299 | }, 300 | { 301 | "cell_type": "markdown", 302 | "metadata": {}, 303 | "source": [ 304 | "From this the subset of the full passengers list, about 2/3 perished in the event. So if we are to build a predictive model from this data, a baseline model to compare the performance to would be to always predict death. Such a constant model would reach around 62% predictive accuracy (which is higher than predicting at random):" 305 | ] 306 | }, 307 | { 308 | "cell_type": "markdown", 309 | "metadata": {}, 310 | "source": [ 311 | "pandas `Series` instances can be converted to regular 1D numpy arrays by using the `values` attribute:" 312 | ] 313 | }, 314 | { 315 | "cell_type": "code", 316 | "execution_count": null, 317 | "metadata": { 318 | "collapsed": false 319 | }, 320 | "outputs": [], 321 | "source": [ 322 | "target = survived_column.values" 323 | ] 324 | }, 325 | { 326 | "cell_type": "code", 327 | "execution_count": null, 328 | "metadata": { 329 | "collapsed": false 330 | }, 331 | "outputs": [], 332 | "source": [ 333 | "type(target)" 334 | ] 335 | }, 336 | { 337 | "cell_type": "code", 338 | "execution_count": null, 339 | "metadata": { 340 | "collapsed": false 341 | }, 342 | "outputs": [], 343 | "source": [ 344 | "target.dtype" 345 | ] 346 | }, 347 | { 348 | "cell_type": "code", 349 | "execution_count": null, 350 | "metadata": { 351 | "collapsed": false 352 | }, 353 | "outputs": [], 354 | "source": [ 355 | "target[:5]" 356 | ] 357 | }, 358 | { 359 | "cell_type": "markdown", 360 | "metadata": {}, 361 | "source": [ 362 | "## Basic visualization\n", 363 | "\n", 364 | "Full doc at http://pandas.pydata.org/pandas-docs/stable/visualization.html\n", 365 | "\n", 366 | "See also seaborn package.\n", 367 | "\n", 368 | "https://stanford.edu/~mwaskom/software/seaborn/" 369 | ] 370 | }, 371 | { 372 | "cell_type": "code", 373 | "execution_count": null, 374 | "metadata": { 375 | "collapsed": false 376 | }, 377 | "outputs": [], 378 | "source": [ 379 | "data.plot(kind='scatter', x='Age', y='Fare', c='Survived', s=50);" 380 | ] 381 | }, 382 | { 383 | "cell_type": "code", 384 | "execution_count": null, 385 | "metadata": { 386 | "collapsed": false 387 | }, 388 | "outputs": [], 389 | "source": [ 390 | "from pandas.tools.plotting import scatter_matrix\n", 391 | "scatter_matrix(data.get(['Fare', 'Pclass', 'Age']), alpha=0.2, figsize=(8, 8), diagonal='kde');" 392 | ] 393 | }, 394 | { 395 | "cell_type": "code", 396 | "execution_count": null, 397 | "metadata": { 398 | "collapsed": false 399 | }, 400 | "outputs": [], 401 | "source": [ 402 | "data.get(['Fare', 'Age', 'Survived']).groupby('Survived').mean().plot(kind='bar');" 403 | ] 404 | }, 405 | { 406 | "cell_type": "markdown", 407 | "metadata": {}, 408 | "source": [ 409 | "## Training a predictive model on numerical features" 410 | ] 411 | }, 412 | { 413 | "cell_type": "markdown", 414 | "metadata": {}, 415 | "source": [ 416 | "`sklearn` estimators all work with homegeneous numerical feature descriptors passed as a numpy array. Therefore passing the raw data frame will not work out of the box.\n", 417 | "\n", 418 | "Let us start simple and build a first model that only uses readily available numerical features as input, namely `data.Fare`, `data.Pclass` and `data.Age`." 419 | ] 420 | }, 421 | { 422 | "cell_type": "code", 423 | "execution_count": null, 424 | "metadata": { 425 | "collapsed": false 426 | }, 427 | "outputs": [], 428 | "source": [ 429 | "numerical_features = data.get(['Fare', 'Pclass', 'Age'])\n", 430 | "numerical_features.head(5)" 431 | ] 432 | }, 433 | { 434 | "cell_type": "markdown", 435 | "metadata": {}, 436 | "source": [ 437 | "Unfortunately some passengers do not have age information:" 438 | ] 439 | }, 440 | { 441 | "cell_type": "code", 442 | "execution_count": null, 443 | "metadata": { 444 | "collapsed": false 445 | }, 446 | "outputs": [], 447 | "source": [ 448 | "numerical_features.count()" 449 | ] 450 | }, 451 | { 452 | "cell_type": "markdown", 453 | "metadata": {}, 454 | "source": [ 455 | "Let's use pandas `fillna` method to input the median age for those passengers:" 456 | ] 457 | }, 458 | { 459 | "cell_type": "code", 460 | "execution_count": null, 461 | "metadata": { 462 | "collapsed": false 463 | }, 464 | "outputs": [], 465 | "source": [ 466 | "median_features = numerical_features.dropna().median()\n", 467 | "median_features" 468 | ] 469 | }, 470 | { 471 | "cell_type": "code", 472 | "execution_count": null, 473 | "metadata": { 474 | "collapsed": false 475 | }, 476 | "outputs": [], 477 | "source": [ 478 | "imputed_features = numerical_features.fillna(median_features)\n", 479 | "imputed_features.count()" 480 | ] 481 | }, 482 | { 483 | "cell_type": "code", 484 | "execution_count": null, 485 | "metadata": { 486 | "collapsed": false 487 | }, 488 | "outputs": [], 489 | "source": [ 490 | "imputed_features.head(5)" 491 | ] 492 | }, 493 | { 494 | "cell_type": "markdown", 495 | "metadata": {}, 496 | "source": [ 497 | "Now that the data frame is clean, we can convert it into an homogeneous numpy array of floating point values:" 498 | ] 499 | }, 500 | { 501 | "cell_type": "code", 502 | "execution_count": null, 503 | "metadata": { 504 | "collapsed": false 505 | }, 506 | "outputs": [], 507 | "source": [ 508 | "features_array = imputed_features.values\n", 509 | "features_array" 510 | ] 511 | }, 512 | { 513 | "cell_type": "code", 514 | "execution_count": null, 515 | "metadata": { 516 | "collapsed": false 517 | }, 518 | "outputs": [], 519 | "source": [ 520 | "features_array.dtype" 521 | ] 522 | }, 523 | { 524 | "cell_type": "markdown", 525 | "metadata": {}, 526 | "source": [ 527 | "Let's take the 80% of the data for training a first model and keep 20% for computing is generalization score:" 528 | ] 529 | }, 530 | { 531 | "cell_type": "code", 532 | "execution_count": null, 533 | "metadata": { 534 | "collapsed": false 535 | }, 536 | "outputs": [], 537 | "source": [ 538 | "from sklearn.model_selection import train_test_split\n", 539 | "\n", 540 | "features_train, features_test, target_train, target_test = train_test_split(\n", 541 | " features_array, target, test_size=0.20, random_state=0)" 542 | ] 543 | }, 544 | { 545 | "cell_type": "code", 546 | "execution_count": null, 547 | "metadata": { 548 | "collapsed": false 549 | }, 550 | "outputs": [], 551 | "source": [ 552 | "features_train.shape" 553 | ] 554 | }, 555 | { 556 | "cell_type": "code", 557 | "execution_count": null, 558 | "metadata": { 559 | "collapsed": false 560 | }, 561 | "outputs": [], 562 | "source": [ 563 | "features_test.shape" 564 | ] 565 | }, 566 | { 567 | "cell_type": "code", 568 | "execution_count": null, 569 | "metadata": { 570 | "collapsed": false 571 | }, 572 | "outputs": [], 573 | "source": [ 574 | "target_train.shape" 575 | ] 576 | }, 577 | { 578 | "cell_type": "code", 579 | "execution_count": null, 580 | "metadata": { 581 | "collapsed": false 582 | }, 583 | "outputs": [], 584 | "source": [ 585 | "target_test.shape" 586 | ] 587 | }, 588 | { 589 | "cell_type": "markdown", 590 | "metadata": {}, 591 | "source": [ 592 | "Let's start with a simple model from sklearn, namely `LogisticRegression`:" 593 | ] 594 | }, 595 | { 596 | "cell_type": "code", 597 | "execution_count": null, 598 | "metadata": { 599 | "collapsed": false 600 | }, 601 | "outputs": [], 602 | "source": [ 603 | "from sklearn.linear_model import LogisticRegression\n", 604 | "\n", 605 | "logreg = LogisticRegression(C=1.)\n", 606 | "logreg.fit(features_train, target_train)" 607 | ] 608 | }, 609 | { 610 | "cell_type": "code", 611 | "execution_count": null, 612 | "metadata": { 613 | "collapsed": false 614 | }, 615 | "outputs": [], 616 | "source": [ 617 | "target_predicted = logreg.predict(features_test)" 618 | ] 619 | }, 620 | { 621 | "cell_type": "code", 622 | "execution_count": null, 623 | "metadata": { 624 | "collapsed": false 625 | }, 626 | "outputs": [], 627 | "source": [ 628 | "from sklearn.metrics import accuracy_score\n", 629 | "\n", 630 | "accuracy_score(target_test, target_predicted)" 631 | ] 632 | }, 633 | { 634 | "cell_type": "markdown", 635 | "metadata": {}, 636 | "source": [ 637 | "This first model has around 73% accuracy: this is better than our baseline that always predicts death." 638 | ] 639 | }, 640 | { 641 | "cell_type": "code", 642 | "execution_count": null, 643 | "metadata": { 644 | "collapsed": false 645 | }, 646 | "outputs": [], 647 | "source": [ 648 | "logreg.score(features_test, target_test)" 649 | ] 650 | }, 651 | { 652 | "cell_type": "markdown", 653 | "metadata": {}, 654 | "source": [ 655 | "## Model evaluation and interpretation" 656 | ] 657 | }, 658 | { 659 | "cell_type": "markdown", 660 | "metadata": {}, 661 | "source": [ 662 | "### Interpreting linear model weights" 663 | ] 664 | }, 665 | { 666 | "cell_type": "markdown", 667 | "metadata": {}, 668 | "source": [ 669 | "The `coef_` attribute of a fitted linear model such as `LogisticRegression` holds the weights of each features:" 670 | ] 671 | }, 672 | { 673 | "cell_type": "code", 674 | "execution_count": null, 675 | "metadata": { 676 | "collapsed": false 677 | }, 678 | "outputs": [], 679 | "source": [ 680 | "feature_names = numerical_features.columns\n", 681 | "feature_names" 682 | ] 683 | }, 684 | { 685 | "cell_type": "code", 686 | "execution_count": null, 687 | "metadata": { 688 | "collapsed": false 689 | }, 690 | "outputs": [], 691 | "source": [ 692 | "logreg.coef_" 693 | ] 694 | }, 695 | { 696 | "cell_type": "code", 697 | "execution_count": null, 698 | "metadata": { 699 | "collapsed": false 700 | }, 701 | "outputs": [], 702 | "source": [ 703 | "x = np.arange(len(feature_names))\n", 704 | "plt.bar(x, logreg.coef_.ravel())\n", 705 | "_ = plt.xticks(x + 0.5, feature_names, rotation=30)" 706 | ] 707 | }, 708 | { 709 | "cell_type": "markdown", 710 | "metadata": {}, 711 | "source": [ 712 | "In this case, survival is slightly positively linked with Fare (the higher the fare, the higher the likelyhood the model will predict survival) while passenger from first class and lower ages are predicted to survive more often than older people from the 3rd class.\n", 713 | "\n", 714 | "First-class cabins were closer to the lifeboats and children and women reportedly had the priority. Our model seems to capture that historical data. We will see later if the sex of the passenger can be used as an informative predictor to increase the predictive accuracy of the model." 715 | ] 716 | }, 717 | { 718 | "cell_type": "markdown", 719 | "metadata": {}, 720 | "source": [ 721 | "### Alternative evaluation metrics" 722 | ] 723 | }, 724 | { 725 | "cell_type": "markdown", 726 | "metadata": {}, 727 | "source": [ 728 | "It is possible to see the details of the false positive and false negative errors by computing the confusion matrix:" 729 | ] 730 | }, 731 | { 732 | "cell_type": "code", 733 | "execution_count": null, 734 | "metadata": { 735 | "collapsed": false 736 | }, 737 | "outputs": [], 738 | "source": [ 739 | "from sklearn.metrics import confusion_matrix\n", 740 | "\n", 741 | "cm = confusion_matrix(target_test, target_predicted)\n", 742 | "print(cm)" 743 | ] 744 | }, 745 | { 746 | "cell_type": "markdown", 747 | "metadata": {}, 748 | "source": [ 749 | "The true labeling are seen as the rows and the predicted labels are the columns:" 750 | ] 751 | }, 752 | { 753 | "cell_type": "code", 754 | "execution_count": null, 755 | "metadata": { 756 | "collapsed": false 757 | }, 758 | "outputs": [], 759 | "source": [ 760 | "def plot_confusion(cm):\n", 761 | " plt.imshow(cm, interpolation='nearest', cmap=plt.cm.binary)\n", 762 | " plt.title('Confusion matrix')\n", 763 | " plt.set_cmap('Blues')\n", 764 | " plt.colorbar()\n", 765 | "\n", 766 | " target_names = ['not survived', 'survived']\n", 767 | "\n", 768 | " tick_marks = np.arange(len(target_names))\n", 769 | " plt.xticks(tick_marks, target_names, rotation=60)\n", 770 | " plt.yticks(tick_marks, target_names)\n", 771 | " plt.ylabel('True label')\n", 772 | " plt.xlabel('Predicted label')\n", 773 | " # Convenience function to adjust plot parameters for a clear layout.\n", 774 | " plt.tight_layout()\n", 775 | " \n", 776 | "plot_confusion(cm)" 777 | ] 778 | }, 779 | { 780 | "cell_type": "code", 781 | "execution_count": null, 782 | "metadata": { 783 | "collapsed": false 784 | }, 785 | "outputs": [], 786 | "source": [ 787 | "print(cm)" 788 | ] 789 | }, 790 | { 791 | "cell_type": "markdown", 792 | "metadata": {}, 793 | "source": [ 794 | "We can normalize the number of prediction by dividing by the total number of true \"survived\" and \"not survived\" to compute false and true positive rates for survival (in the second column of the confusion matrix)." 795 | ] 796 | }, 797 | { 798 | "cell_type": "code", 799 | "execution_count": null, 800 | "metadata": { 801 | "collapsed": false 802 | }, 803 | "outputs": [], 804 | "source": [ 805 | "print(cm.astype(np.float64) / cm.sum(axis=1))" 806 | ] 807 | }, 808 | { 809 | "cell_type": "markdown", 810 | "metadata": {}, 811 | "source": [ 812 | "We can therefore observe that the fact that the target classes are not balanced in the dataset makes the accuracy score not very informative.\n", 813 | "\n", 814 | "scikit-learn provides alternative classification metrics to evaluate models performance on imbalanced data such as precision, recall and f1 score:" 815 | ] 816 | }, 817 | { 818 | "cell_type": "code", 819 | "execution_count": null, 820 | "metadata": { 821 | "collapsed": false 822 | }, 823 | "outputs": [], 824 | "source": [ 825 | "from sklearn.metrics import classification_report\n", 826 | "\n", 827 | "print(classification_report(target_test, target_predicted,\n", 828 | " target_names=['not survived', 'survived']))" 829 | ] 830 | }, 831 | { 832 | "cell_type": "markdown", 833 | "metadata": {}, 834 | "source": [ 835 | "Another way to quantify the quality of a binary classifier on imbalanced data is to compute the precision, recall and f1-score of a model (at the default fixed decision threshold of 0.5)." 836 | ] 837 | }, 838 | { 839 | "cell_type": "markdown", 840 | "metadata": {}, 841 | "source": [ 842 | "Logistic Regression is a probabilistic models: instead of just predicting a binary outcome (survived or not) given the input features it can also estimates the posterior probability of the outcome given the input features using the `predict_proba` method:" 843 | ] 844 | }, 845 | { 846 | "cell_type": "code", 847 | "execution_count": null, 848 | "metadata": { 849 | "collapsed": false 850 | }, 851 | "outputs": [], 852 | "source": [ 853 | "target_predicted_proba = logreg.predict_proba(features_test)\n", 854 | "target_predicted_proba[:5]" 855 | ] 856 | }, 857 | { 858 | "cell_type": "markdown", 859 | "metadata": {}, 860 | "source": [ 861 | "By default the decision threshold is 0.5: if we vary the decision threshold from 0 to 1 we could generate a family of binary classifier models that address all the possible trade offs between false positive and false negative prediction errors.\n", 862 | "\n", 863 | "We can summarize the performance of a binary classifier for all the possible thresholds by plotting the ROC curve and quantifying the Area under the ROC curve:" 864 | ] 865 | }, 866 | { 867 | "cell_type": "code", 868 | "execution_count": null, 869 | "metadata": { 870 | "collapsed": false 871 | }, 872 | "outputs": [], 873 | "source": [ 874 | "from sklearn.metrics import roc_curve\n", 875 | "from sklearn.metrics import auc\n", 876 | "\n", 877 | "def plot_roc_curve(target_test, target_predicted_proba):\n", 878 | " fpr, tpr, thresholds = roc_curve(target_test, target_predicted_proba[:, 1])\n", 879 | " \n", 880 | " roc_auc = auc(fpr, tpr)\n", 881 | " # Plot ROC curve\n", 882 | " plt.plot(fpr, tpr, label='ROC curve (area = %0.3f)' % roc_auc)\n", 883 | " plt.plot([0, 1], [0, 1], 'k--') # random predictions curve\n", 884 | " plt.xlim([0.0, 1.0])\n", 885 | " plt.ylim([0.0, 1.0])\n", 886 | " plt.xlabel('False Positive Rate or (1 - Specifity)')\n", 887 | " plt.ylabel('True Positive Rate or (Sensitivity)')\n", 888 | " plt.title('Receiver Operating Characteristic')\n", 889 | " plt.legend(loc=\"lower right\")" 890 | ] 891 | }, 892 | { 893 | "cell_type": "code", 894 | "execution_count": null, 895 | "metadata": { 896 | "collapsed": false 897 | }, 898 | "outputs": [], 899 | "source": [ 900 | "plot_roc_curve(target_test, target_predicted_proba)" 901 | ] 902 | }, 903 | { 904 | "cell_type": "markdown", 905 | "metadata": {}, 906 | "source": [ 907 | "Here the area under ROC curve is 0.756 which is very similar to the accuracy (0.732). However the ROC-AUC score of a random model is expected to 0.5 on average while the accuracy score of a random model depends on the class imbalance of the data. ROC-AUC can be seen as a way to callibrate the predictive accuracy of a model against class imbalance." 908 | ] 909 | }, 910 | { 911 | "cell_type": "markdown", 912 | "metadata": {}, 913 | "source": [ 914 | "### Cross-validation" 915 | ] 916 | }, 917 | { 918 | "cell_type": "markdown", 919 | "metadata": {}, 920 | "source": [ 921 | "We previously decided to randomly split the data to evaluate the model on 20% of held-out data. However the location randomness of the split might have a significant impact in the estimated accuracy:" 922 | ] 923 | }, 924 | { 925 | "cell_type": "code", 926 | "execution_count": null, 927 | "metadata": { 928 | "collapsed": false 929 | }, 930 | "outputs": [], 931 | "source": [ 932 | "features_train, features_test, target_train, target_test = train_test_split(\n", 933 | " features_array, target, test_size=0.20, random_state=0)\n", 934 | "\n", 935 | "logreg.fit(features_train, target_train).score(features_test, target_test)" 936 | ] 937 | }, 938 | { 939 | "cell_type": "code", 940 | "execution_count": null, 941 | "metadata": { 942 | "collapsed": false 943 | }, 944 | "outputs": [], 945 | "source": [ 946 | "features_train, features_test, target_train, target_test = train_test_split(\n", 947 | " features_array, target, test_size=0.20, random_state=1)\n", 948 | "\n", 949 | "logreg.fit(features_train, target_train).score(features_test, target_test)" 950 | ] 951 | }, 952 | { 953 | "cell_type": "code", 954 | "execution_count": null, 955 | "metadata": { 956 | "collapsed": false 957 | }, 958 | "outputs": [], 959 | "source": [ 960 | "features_train, features_test, target_train, target_test = train_test_split(\n", 961 | " features_array, target, test_size=0.20, random_state=2)\n", 962 | "\n", 963 | "logreg.fit(features_train, target_train).score(features_test, target_test)" 964 | ] 965 | }, 966 | { 967 | "cell_type": "markdown", 968 | "metadata": {}, 969 | "source": [ 970 | "So instead of using a single train / test split, we can use a group of them and compute the min, max and mean scores as an estimation of the real test score while not underestimating the variability:" 971 | ] 972 | }, 973 | { 974 | "cell_type": "code", 975 | "execution_count": null, 976 | "metadata": { 977 | "collapsed": false 978 | }, 979 | "outputs": [], 980 | "source": [ 981 | "from sklearn.model_selection import cross_val_score\n", 982 | "\n", 983 | "scores = cross_val_score(logreg, features_array, target, cv=5)\n", 984 | "scores" 985 | ] 986 | }, 987 | { 988 | "cell_type": "code", 989 | "execution_count": null, 990 | "metadata": { 991 | "collapsed": false 992 | }, 993 | "outputs": [], 994 | "source": [ 995 | "scores.min(), scores.mean(), scores.max()" 996 | ] 997 | }, 998 | { 999 | "cell_type": "markdown", 1000 | "metadata": {}, 1001 | "source": [ 1002 | "`cross_val_score` reports accuracy by default be it can also be used to report other performance metrics such as ROC-AUC or f1-score:" 1003 | ] 1004 | }, 1005 | { 1006 | "cell_type": "code", 1007 | "execution_count": null, 1008 | "metadata": { 1009 | "collapsed": false 1010 | }, 1011 | "outputs": [], 1012 | "source": [ 1013 | "scores = cross_val_score(logreg, features_array, target, cv=5,\n", 1014 | " scoring='roc_auc')\n", 1015 | "scores.min(), scores.mean(), scores.max()" 1016 | ] 1017 | }, 1018 | { 1019 | "cell_type": "markdown", 1020 | "metadata": {}, 1021 | "source": [ 1022 | "**Exercise**:\n", 1023 | "\n", 1024 | "- Compute cross-validated scores for other classification metrics ('precision', 'recall', 'f1', 'accuracy'...).\n", 1025 | "\n", 1026 | "- Change the number of cross-validation folds between 3 and 10: what is the impact on the mean score? on the processing time?\n", 1027 | "\n", 1028 | "Hints:\n", 1029 | "\n", 1030 | "The list of classification metrics is available in the online documentation:\n", 1031 | "\n", 1032 | " http://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values\n", 1033 | " \n", 1034 | "You can use the `%%time` cell magic on the first line of an IPython cell to measure the time of the execution of the cell. " 1035 | ] 1036 | }, 1037 | { 1038 | "cell_type": "code", 1039 | "execution_count": null, 1040 | "metadata": { 1041 | "collapsed": false 1042 | }, 1043 | "outputs": [], 1044 | "source": [] 1045 | }, 1046 | { 1047 | "cell_type": "markdown", 1048 | "metadata": {}, 1049 | "source": [ 1050 | "## More feature engineering and richer models" 1051 | ] 1052 | }, 1053 | { 1054 | "cell_type": "markdown", 1055 | "metadata": {}, 1056 | "source": [ 1057 | "Let us now try to build richer models by including more features as potential predictors for our model.\n", 1058 | "\n", 1059 | "Categorical variables such as `data.Embarked` or `data.Sex` can be converted as boolean indicators features also known as dummy variables or one-hot-encoded features:" 1060 | ] 1061 | }, 1062 | { 1063 | "cell_type": "code", 1064 | "execution_count": null, 1065 | "metadata": { 1066 | "collapsed": false 1067 | }, 1068 | "outputs": [], 1069 | "source": [ 1070 | "pd.get_dummies(data.Sex, prefix='Sex').head(5)" 1071 | ] 1072 | }, 1073 | { 1074 | "cell_type": "code", 1075 | "execution_count": null, 1076 | "metadata": { 1077 | "collapsed": false 1078 | }, 1079 | "outputs": [], 1080 | "source": [ 1081 | "pd.get_dummies(data.Embarked, prefix='Embarked').head(5)" 1082 | ] 1083 | }, 1084 | { 1085 | "cell_type": "markdown", 1086 | "metadata": {}, 1087 | "source": [ 1088 | "We can combine those new numerical features with the previous features using `pandas.concat` along `axis=1`:" 1089 | ] 1090 | }, 1091 | { 1092 | "cell_type": "code", 1093 | "execution_count": null, 1094 | "metadata": { 1095 | "collapsed": false 1096 | }, 1097 | "outputs": [], 1098 | "source": [ 1099 | "rich_features = pd.concat([data.get(['Fare', 'Pclass', 'Age']),\n", 1100 | " pd.get_dummies(data.Sex, prefix='Sex'),\n", 1101 | " pd.get_dummies(data.Embarked, prefix='Embarked')],\n", 1102 | " axis=1)\n", 1103 | "rich_features.head(5)" 1104 | ] 1105 | }, 1106 | { 1107 | "cell_type": "markdown", 1108 | "metadata": {}, 1109 | "source": [ 1110 | "By construction the new `Sex_male` feature is redundant with `Sex_female`. Let us drop it:" 1111 | ] 1112 | }, 1113 | { 1114 | "cell_type": "code", 1115 | "execution_count": null, 1116 | "metadata": { 1117 | "collapsed": false 1118 | }, 1119 | "outputs": [], 1120 | "source": [ 1121 | "rich_features_no_male = rich_features.drop('Sex_male', 1)\n", 1122 | "rich_features_no_male.head(5)" 1123 | ] 1124 | }, 1125 | { 1126 | "cell_type": "markdown", 1127 | "metadata": {}, 1128 | "source": [ 1129 | "Let us not forget to impute the median age for passengers without age information:" 1130 | ] 1131 | }, 1132 | { 1133 | "cell_type": "code", 1134 | "execution_count": null, 1135 | "metadata": { 1136 | "collapsed": false 1137 | }, 1138 | "outputs": [], 1139 | "source": [ 1140 | "rich_features_no_male.count()" 1141 | ] 1142 | }, 1143 | { 1144 | "cell_type": "code", 1145 | "execution_count": null, 1146 | "metadata": { 1147 | "collapsed": false 1148 | }, 1149 | "outputs": [], 1150 | "source": [ 1151 | "rich_features_final = rich_features_no_male.fillna(rich_features_no_male.dropna().median())\n", 1152 | "rich_features_final.count()" 1153 | ] 1154 | }, 1155 | { 1156 | "cell_type": "markdown", 1157 | "metadata": {}, 1158 | "source": [ 1159 | "We can finally cross-validate a logistic regression model on this new data an observe that the mean score has significantly increased:" 1160 | ] 1161 | }, 1162 | { 1163 | "cell_type": "code", 1164 | "execution_count": null, 1165 | "metadata": { 1166 | "collapsed": false 1167 | }, 1168 | "outputs": [], 1169 | "source": [ 1170 | "%%time\n", 1171 | "\n", 1172 | "from sklearn.linear_model import LogisticRegression\n", 1173 | "from sklearn.model_selection import cross_val_score\n", 1174 | "\n", 1175 | "logreg = LogisticRegression(C=1.)\n", 1176 | "scores = cross_val_score(logreg, rich_features_final, target, cv=5, scoring='accuracy')\n", 1177 | "print(\"Logistic Regression CV scores:\")\n", 1178 | "print(\"min: {:.3f}, mean: {:.3f}, max: {:.3f}\".format(\n", 1179 | " scores.min(), scores.mean(), scores.max()))" 1180 | ] 1181 | }, 1182 | { 1183 | "cell_type": "markdown", 1184 | "metadata": {}, 1185 | "source": [ 1186 | "**Exercise**:\n", 1187 | "\n", 1188 | "- change the value of the parameter `C`. Does it have an impact on the score?\n", 1189 | "\n", 1190 | "- fit a new instance of the logistic regression model on the full dataset.\n", 1191 | "\n", 1192 | "- plot the weights for the features of this newly fitted logistic regression model." 1193 | ] 1194 | }, 1195 | { 1196 | "cell_type": "code", 1197 | "execution_count": null, 1198 | "metadata": { 1199 | "collapsed": false 1200 | }, 1201 | "outputs": [], 1202 | "source": [ 1203 | "%load solutions/04A_plot_logistic_regression_weights.py" 1204 | ] 1205 | }, 1206 | { 1207 | "cell_type": "code", 1208 | "execution_count": null, 1209 | "metadata": { 1210 | "collapsed": false 1211 | }, 1212 | "outputs": [], 1213 | "source": [] 1214 | }, 1215 | { 1216 | "cell_type": "markdown", 1217 | "metadata": {}, 1218 | "source": [ 1219 | "### Training Non-linear models: ensembles of randomized trees" 1220 | ] 1221 | }, 1222 | { 1223 | "cell_type": "markdown", 1224 | "metadata": {}, 1225 | "source": [ 1226 | "`sklearn` also implement non linear models that are known to perform very well for data-science projects where datasets have not too many features (e.g. less than 5000).\n", 1227 | "\n", 1228 | "In particular let us have a look at Random Forests and Gradient Boosted Trees:" 1229 | ] 1230 | }, 1231 | { 1232 | "cell_type": "code", 1233 | "execution_count": null, 1234 | "metadata": { 1235 | "collapsed": false 1236 | }, 1237 | "outputs": [], 1238 | "source": [ 1239 | "%%time\n", 1240 | "\n", 1241 | "from sklearn.ensemble import RandomForestClassifier\n", 1242 | "\n", 1243 | "rf = RandomForestClassifier(n_estimators=100)\n", 1244 | "scores = cross_val_score(rf, rich_features_final, target, cv=5, n_jobs=4,\n", 1245 | " scoring='accuracy')\n", 1246 | "print(\"Random Forest CV scores:\")\n", 1247 | "print(\"min: {:.3f}, mean: {:.3f}, max: {:.3f}\".format(\n", 1248 | " scores.min(), scores.mean(), scores.max()))" 1249 | ] 1250 | }, 1251 | { 1252 | "cell_type": "code", 1253 | "execution_count": null, 1254 | "metadata": { 1255 | "collapsed": false 1256 | }, 1257 | "outputs": [], 1258 | "source": [ 1259 | "%%time\n", 1260 | "\n", 1261 | "from sklearn.ensemble import GradientBoostingClassifier\n", 1262 | "\n", 1263 | "gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1,\n", 1264 | " subsample=.8, max_features=.5)\n", 1265 | "scores = cross_val_score(gb, rich_features_final, target, cv=5, n_jobs=4,\n", 1266 | " scoring='accuracy')\n", 1267 | "print(\"Gradient Boosted Trees CV scores:\")\n", 1268 | "print(\"min: {:.3f}, mean: {:.3f}, max: {:.3f}\".format(\n", 1269 | " scores.min(), scores.mean(), scores.max()))" 1270 | ] 1271 | }, 1272 | { 1273 | "cell_type": "markdown", 1274 | "metadata": {}, 1275 | "source": [ 1276 | "Both models seem to do slightly better than the logistic regression model on this data." 1277 | ] 1278 | }, 1279 | { 1280 | "cell_type": "markdown", 1281 | "metadata": {}, 1282 | "source": [ 1283 | "**Exercise**:\n", 1284 | "\n", 1285 | "- Change the value of the learning_rate and other `GradientBoostingClassifier` parameter, can you get a better mean score?\n", 1286 | "\n", 1287 | "- Would treating the `PClass` variable as categorical improve the models performance?\n", 1288 | "\n", 1289 | "- Find out which predictor variables (features) are the most informative for those models.\n", 1290 | "\n", 1291 | "Hints:\n", 1292 | "\n", 1293 | "Fitted ensembles of trees have `feature_importances_` attribute that can be used similarly to the `coef_` attribute of linear models." 1294 | ] 1295 | }, 1296 | { 1297 | "cell_type": "code", 1298 | "execution_count": null, 1299 | "metadata": { 1300 | "collapsed": false 1301 | }, 1302 | "outputs": [], 1303 | "source": [ 1304 | "%load solutions/04B_more_categorical_variables.py" 1305 | ] 1306 | }, 1307 | { 1308 | "cell_type": "code", 1309 | "execution_count": null, 1310 | "metadata": { 1311 | "collapsed": false 1312 | }, 1313 | "outputs": [], 1314 | "source": [] 1315 | }, 1316 | { 1317 | "cell_type": "code", 1318 | "execution_count": null, 1319 | "metadata": { 1320 | "collapsed": false 1321 | }, 1322 | "outputs": [], 1323 | "source": [ 1324 | "%load solutions/04C_feature_importance.py" 1325 | ] 1326 | }, 1327 | { 1328 | "cell_type": "code", 1329 | "execution_count": null, 1330 | "metadata": { 1331 | "collapsed": false 1332 | }, 1333 | "outputs": [], 1334 | "source": [] 1335 | }, 1336 | { 1337 | "cell_type": "markdown", 1338 | "metadata": {}, 1339 | "source": [ 1340 | "## Automated parameter tuning" 1341 | ] 1342 | }, 1343 | { 1344 | "cell_type": "markdown", 1345 | "metadata": {}, 1346 | "source": [ 1347 | "Instead of changing the value of the learning rate manually and re-running the cross-validation, we can find the best values for the parameters automatically (assuming we are ready to wait):" 1348 | ] 1349 | }, 1350 | { 1351 | "cell_type": "code", 1352 | "execution_count": null, 1353 | "metadata": { 1354 | "collapsed": false 1355 | }, 1356 | "outputs": [], 1357 | "source": [ 1358 | "%%time\n", 1359 | "\n", 1360 | "from sklearn.model_selection import GridSearchCV\n", 1361 | "\n", 1362 | "gb = GradientBoostingClassifier(n_estimators=100, subsample=.8)\n", 1363 | "\n", 1364 | "params = {\n", 1365 | " 'learning_rate': [0.05, 0.1, 0.5],\n", 1366 | " 'max_features': [0.5, 1],\n", 1367 | " 'max_depth': [3, 4, 5],\n", 1368 | "}\n", 1369 | "gs = GridSearchCV(gb, params, cv=5, scoring='roc_auc', n_jobs=4)\n", 1370 | "gs.fit(rich_features_final, target)" 1371 | ] 1372 | }, 1373 | { 1374 | "cell_type": "markdown", 1375 | "metadata": {}, 1376 | "source": [ 1377 | "Let us sort the models by mean validation score:" 1378 | ] 1379 | }, 1380 | { 1381 | "cell_type": "code", 1382 | "execution_count": null, 1383 | "metadata": { 1384 | "collapsed": false 1385 | }, 1386 | "outputs": [], 1387 | "source": [ 1388 | "sorted(gs.grid_scores_, key=lambda x: x.mean_validation_score, reverse=True)" 1389 | ] 1390 | }, 1391 | { 1392 | "cell_type": "code", 1393 | "execution_count": null, 1394 | "metadata": { 1395 | "collapsed": false 1396 | }, 1397 | "outputs": [], 1398 | "source": [ 1399 | "gs.best_score_" 1400 | ] 1401 | }, 1402 | { 1403 | "cell_type": "code", 1404 | "execution_count": null, 1405 | "metadata": { 1406 | "collapsed": false 1407 | }, 1408 | "outputs": [], 1409 | "source": [ 1410 | "gs.best_params_" 1411 | ] 1412 | }, 1413 | { 1414 | "cell_type": "markdown", 1415 | "metadata": {}, 1416 | "source": [ 1417 | "We should note that the mean scores are very close to one another and almost always within one standard deviation of one another. This means that all those parameters are quite reasonable. The only parameter of importance seems to be the `learning_rate`: 0.5 seems to be a bit too high." 1418 | ] 1419 | }, 1420 | { 1421 | "cell_type": "markdown", 1422 | "metadata": {}, 1423 | "source": [ 1424 | "## Avoiding data snooping with pipelines" 1425 | ] 1426 | }, 1427 | { 1428 | "cell_type": "markdown", 1429 | "metadata": {}, 1430 | "source": [ 1431 | "When doing imputation in pandas, prior to computing the train test split we use data from the test to improve the accuracy of the median value that we impute on the training set. This is actually cheating. To avoid this we should compute the median of the features on the training fold and use that median value to do the imputation both on the training and validation fold for a given CV split.\n", 1432 | "\n", 1433 | "To do this we can prepare the features as previously but without the imputation: we just replace missing values by the -1 marker value:" 1434 | ] 1435 | }, 1436 | { 1437 | "cell_type": "code", 1438 | "execution_count": null, 1439 | "metadata": { 1440 | "collapsed": false 1441 | }, 1442 | "outputs": [], 1443 | "source": [ 1444 | "features = pd.concat([data.get(['Fare', 'Age']),\n", 1445 | " pd.get_dummies(data.Sex, prefix='Sex'),\n", 1446 | " pd.get_dummies(data.Pclass, prefix='Pclass'),\n", 1447 | " pd.get_dummies(data.Embarked, prefix='Embarked')],\n", 1448 | " axis=1)\n", 1449 | "features = features.drop('Sex_male', 1)\n", 1450 | "\n", 1451 | "# Because of the following bug we cannot use NaN as the missing\n", 1452 | "# value marker, use a negative value as marker instead:\n", 1453 | "# https://github.com/scikit-learn/scikit-learn/issues/3044\n", 1454 | "features = features.fillna(-1)\n", 1455 | "features.head(5)" 1456 | ] 1457 | }, 1458 | { 1459 | "cell_type": "markdown", 1460 | "metadata": {}, 1461 | "source": [ 1462 | "We can now use the `Imputer` transformer of scikit-learn to find the median value on the training set and apply it on missing values of both the training set and the test set." 1463 | ] 1464 | }, 1465 | { 1466 | "cell_type": "code", 1467 | "execution_count": null, 1468 | "metadata": { 1469 | "collapsed": false 1470 | }, 1471 | "outputs": [], 1472 | "source": [ 1473 | "from sklearn.model_selection import train_test_split\n", 1474 | "\n", 1475 | "X_train, X_test, y_train, y_test = train_test_split(features.values, target, random_state=0)" 1476 | ] 1477 | }, 1478 | { 1479 | "cell_type": "code", 1480 | "execution_count": null, 1481 | "metadata": { 1482 | "collapsed": false 1483 | }, 1484 | "outputs": [], 1485 | "source": [ 1486 | "from sklearn.preprocessing import Imputer\n", 1487 | "\n", 1488 | "imputer = Imputer(strategy='median', missing_values=-1)\n", 1489 | "\n", 1490 | "imputer.fit(X_train)" 1491 | ] 1492 | }, 1493 | { 1494 | "cell_type": "markdown", 1495 | "metadata": {}, 1496 | "source": [ 1497 | "The median age computed on the training set is stored in the `statistics_` attribute." 1498 | ] 1499 | }, 1500 | { 1501 | "cell_type": "code", 1502 | "execution_count": null, 1503 | "metadata": { 1504 | "collapsed": false 1505 | }, 1506 | "outputs": [], 1507 | "source": [ 1508 | "imputer.statistics_" 1509 | ] 1510 | }, 1511 | { 1512 | "cell_type": "markdown", 1513 | "metadata": {}, 1514 | "source": [ 1515 | "Imputation can now happen by calling the transform method:" 1516 | ] 1517 | }, 1518 | { 1519 | "cell_type": "code", 1520 | "execution_count": null, 1521 | "metadata": { 1522 | "collapsed": false 1523 | }, 1524 | "outputs": [], 1525 | "source": [ 1526 | "X_train_imputed = imputer.transform(X_train)\n", 1527 | "X_test_imputed = imputer.transform(X_test)" 1528 | ] 1529 | }, 1530 | { 1531 | "cell_type": "code", 1532 | "execution_count": null, 1533 | "metadata": { 1534 | "collapsed": false 1535 | }, 1536 | "outputs": [], 1537 | "source": [ 1538 | "np.any(X_train == -1)" 1539 | ] 1540 | }, 1541 | { 1542 | "cell_type": "code", 1543 | "execution_count": null, 1544 | "metadata": { 1545 | "collapsed": false 1546 | }, 1547 | "outputs": [], 1548 | "source": [ 1549 | "np.any(X_train_imputed == -1)" 1550 | ] 1551 | }, 1552 | { 1553 | "cell_type": "code", 1554 | "execution_count": null, 1555 | "metadata": { 1556 | "collapsed": false 1557 | }, 1558 | "outputs": [], 1559 | "source": [ 1560 | "np.any(X_test == -1)" 1561 | ] 1562 | }, 1563 | { 1564 | "cell_type": "code", 1565 | "execution_count": null, 1566 | "metadata": { 1567 | "collapsed": false 1568 | }, 1569 | "outputs": [], 1570 | "source": [ 1571 | "np.any(X_test_imputed == -1)" 1572 | ] 1573 | }, 1574 | { 1575 | "cell_type": "markdown", 1576 | "metadata": {}, 1577 | "source": [ 1578 | "We can now use a pipeline that wraps an imputer transformer and the classifier itself:" 1579 | ] 1580 | }, 1581 | { 1582 | "cell_type": "code", 1583 | "execution_count": null, 1584 | "metadata": { 1585 | "collapsed": false 1586 | }, 1587 | "outputs": [], 1588 | "source": [ 1589 | "from sklearn.pipeline import Pipeline\n", 1590 | "\n", 1591 | "imputer = Imputer(strategy='median', missing_values=-1)\n", 1592 | "\n", 1593 | "classifier = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1,\n", 1594 | " subsample=.8, max_features=.5)\n", 1595 | "\n", 1596 | "pipeline = Pipeline([\n", 1597 | " ('imp', imputer),\n", 1598 | " ('clf', classifier),\n", 1599 | "])\n", 1600 | "\n", 1601 | "scores = cross_val_score(pipeline, features.values, target, cv=5, n_jobs=4,\n", 1602 | " scoring='accuracy', )\n", 1603 | "print(scores.min(), scores.mean(), scores.max())" 1604 | ] 1605 | }, 1606 | { 1607 | "cell_type": "markdown", 1608 | "metadata": {}, 1609 | "source": [ 1610 | "The mean cross-validation is slightly lower than we used the imputation on the whole data as we did earlier although not by much. This means that in this case the data-snooping was not really helping the model cheat by much.\n", 1611 | "\n", 1612 | "Let us re-run the grid search, this time on the pipeline. Note that thanks to the pipeline structure we can optimize the interaction of the imputation method with the parameters of the downstream classifier without cheating:" 1613 | ] 1614 | }, 1615 | { 1616 | "cell_type": "code", 1617 | "execution_count": null, 1618 | "metadata": { 1619 | "collapsed": false 1620 | }, 1621 | "outputs": [], 1622 | "source": [ 1623 | "%%time\n", 1624 | "\n", 1625 | "params = {\n", 1626 | " 'imp__strategy': ['mean', 'median'],\n", 1627 | " 'clf__max_features': [0.5, 1],\n", 1628 | " 'clf__max_depth': [3, 4, 5],\n", 1629 | "}\n", 1630 | "gs = GridSearchCV(pipeline, params, cv=5, scoring='roc_auc', n_jobs=4)\n", 1631 | "gs.fit(X_train, y_train)" 1632 | ] 1633 | }, 1634 | { 1635 | "cell_type": "code", 1636 | "execution_count": null, 1637 | "metadata": { 1638 | "collapsed": false 1639 | }, 1640 | "outputs": [], 1641 | "source": [ 1642 | "sorted(gs.grid_scores_, key=lambda x: x.mean_validation_score, reverse=True)" 1643 | ] 1644 | }, 1645 | { 1646 | "cell_type": "code", 1647 | "execution_count": null, 1648 | "metadata": { 1649 | "collapsed": false 1650 | }, 1651 | "outputs": [], 1652 | "source": [ 1653 | "gs.best_score_" 1654 | ] 1655 | }, 1656 | { 1657 | "cell_type": "code", 1658 | "execution_count": null, 1659 | "metadata": { 1660 | "collapsed": false 1661 | }, 1662 | "outputs": [], 1663 | "source": [ 1664 | "plot_roc_curve(y_test, gs.predict_proba(X_test))" 1665 | ] 1666 | }, 1667 | { 1668 | "cell_type": "code", 1669 | "execution_count": null, 1670 | "metadata": { 1671 | "collapsed": false 1672 | }, 1673 | "outputs": [], 1674 | "source": [ 1675 | "gs.best_params_" 1676 | ] 1677 | }, 1678 | { 1679 | "cell_type": "markdown", 1680 | "metadata": {}, 1681 | "source": [ 1682 | "From this search we can conclude that the imputation by the 'mean' strategy is generally a slightly better imputation strategy when training a GBRT model on this data." 1683 | ] 1684 | }, 1685 | { 1686 | "cell_type": "markdown", 1687 | "metadata": {}, 1688 | "source": [ 1689 | "## Further integrating sklearn and pandas" 1690 | ] 1691 | }, 1692 | { 1693 | "cell_type": "markdown", 1694 | "metadata": {}, 1695 | "source": [ 1696 | "Helper tool for better sklearn / pandas integration: https://github.com/paulgb/sklearn-pandas by making it possible to embed the feature construction from the raw dataframe directly inside a pipeline." 1697 | ] 1698 | }, 1699 | { 1700 | "cell_type": "markdown", 1701 | "metadata": {}, 1702 | "source": [ 1703 | "### Credits" 1704 | ] 1705 | }, 1706 | { 1707 | "cell_type": "markdown", 1708 | "metadata": {}, 1709 | "source": [ 1710 | "Thanks to:\n", 1711 | "\n", 1712 | "- Kaggle for setting up the Titanic challenge.\n", 1713 | "\n", 1714 | "- This blog post by Philippe Adjiman for inspiration:\n", 1715 | "\n", 1716 | "http://www.philippeadjiman.com/blog/2013/09/12/a-data-science-exploration-from-the-titanic-in-r/" 1717 | ] 1718 | }, 1719 | { 1720 | "cell_type": "code", 1721 | "execution_count": null, 1722 | "metadata": { 1723 | "collapsed": false 1724 | }, 1725 | "outputs": [], 1726 | "source": [] 1727 | } 1728 | ], 1729 | "metadata": { 1730 | "kernelspec": { 1731 | "display_name": "Python 2", 1732 | "language": "python", 1733 | "name": "python2" 1734 | }, 1735 | "language_info": { 1736 | "codemirror_mode": { 1737 | "name": "ipython", 1738 | "version": 2 1739 | }, 1740 | "file_extension": ".py", 1741 | "mimetype": "text/x-python", 1742 | "name": "python", 1743 | "nbconvert_exporter": "python", 1744 | "pygments_lexer": "ipython2", 1745 | "version": "2.7.11" 1746 | } 1747 | }, 1748 | "nbformat": 4, 1749 | "nbformat_minor": 0 1750 | } 1751 | -------------------------------------------------------------------------------- /notebooks/07_sklearn_unstructured_data_face_recognition.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Example from Image Processing" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": { 14 | "collapsed": false 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "%matplotlib inline\n", 19 | "import numpy as np\n", 20 | "from matplotlib import pyplot as plt" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "Here we'll take a look at a simple facial recognition example.\n", 28 | "Ideally, we would use a dataset consisting of a\n", 29 | "subset of the [Labeled Faces in the Wild](http://vis-www.cs.umass.edu/lfw/)\n", 30 | "data that is available within scikit-learn with the 'datasets.fetch_lfw_people' function. However, this is a relatively large download (~200MB) so we will do the tutorial on a simpler, less rich dataset. Feel free to explore the LFW dataset at home." 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": null, 36 | "metadata": { 37 | "collapsed": false 38 | }, 39 | "outputs": [], 40 | "source": [ 41 | "from sklearn import datasets\n", 42 | "faces = datasets.fetch_olivetti_faces()\n", 43 | "faces.data.shape" 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "Let's visualize these faces to see what we're working with:" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": { 57 | "collapsed": false 58 | }, 59 | "outputs": [], 60 | "source": [ 61 | "fig = plt.figure(figsize=(8, 6))\n", 62 | "# plot several images\n", 63 | "for i in range(15):\n", 64 | " ax = fig.add_subplot(3, 5, i + 1, xticks=[], yticks=[])\n", 65 | " ax.imshow(faces.images[i], cmap=plt.cm.bone)" 66 | ] 67 | }, 68 | { 69 | "cell_type": "markdown", 70 | "metadata": {}, 71 | "source": [ 72 | "One thing to note is that these faces have already been localized and scaled\n", 73 | "to a common size. This is an important preprocessing piece for facial\n", 74 | "recognition, and is a process that can require a large collection of training\n", 75 | "data. This can be done in scikit-learn, but the challenge is gathering a\n", 76 | "sufficient amount of training data for the algorithm to work\n", 77 | "\n", 78 | "Fortunately, this piece is common enough that it has been done. One good\n", 79 | "resource is [OpenCV](http://opencv.willowgarage.com/wiki/FaceRecognition), the\n", 80 | "*Open Computer Vision Library*." 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "metadata": {}, 86 | "source": [ 87 | "We'll perform a Support Vector classification of the images. We'll\n", 88 | "do a typical train-test split on the images to make this happen:" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": null, 94 | "metadata": { 95 | "collapsed": false 96 | }, 97 | "outputs": [], 98 | "source": [ 99 | "from sklearn.cross_validation import train_test_split\n", 100 | "X_train, X_test, y_train, y_test = train_test_split(faces.data,\n", 101 | " faces.target, random_state=0)\n", 102 | "\n", 103 | "print(X_train.shape, X_test.shape)" 104 | ] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "metadata": {}, 109 | "source": [ 110 | "## Preprocessing: Principal Component Analysis" 111 | ] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "metadata": {}, 116 | "source": [ 117 | "1850 dimensions is a lot for SVM. We can use PCA to reduce these 1850 features to a manageable\n", 118 | "size, while maintaining most of the information in the dataset. Here it is useful to use a variant\n", 119 | "of PCA called ``RandomizedPCA``, which is an approximation of PCA that can be much faster for large\n", 120 | "datasets. The interface is the same as the normal PCA we saw earlier:" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": null, 126 | "metadata": { 127 | "collapsed": false 128 | }, 129 | "outputs": [], 130 | "source": [ 131 | "from sklearn import decomposition\n", 132 | "pca = decomposition.RandomizedPCA(n_components=150, whiten=True)\n", 133 | "pca.fit(X_train)" 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": {}, 139 | "source": [ 140 | "One interesting part of PCA is that it computes the \"mean\" face, which can be\n", 141 | "interesting to examine:" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": null, 147 | "metadata": { 148 | "collapsed": false 149 | }, 150 | "outputs": [], 151 | "source": [ 152 | "plt.imshow(pca.mean_.reshape(faces.images[0].shape),\n", 153 | " cmap=plt.cm.bone)" 154 | ] 155 | }, 156 | { 157 | "cell_type": "markdown", 158 | "metadata": {}, 159 | "source": [ 160 | "The principal components measure deviations about this mean along orthogonal axes.\n", 161 | "It is also interesting to visualize these principal components:" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": null, 167 | "metadata": { 168 | "collapsed": false 169 | }, 170 | "outputs": [], 171 | "source": [ 172 | "print(pca.components_.shape)" 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": null, 178 | "metadata": { 179 | "collapsed": false 180 | }, 181 | "outputs": [], 182 | "source": [ 183 | "fig = plt.figure(figsize=(16, 6))\n", 184 | "for i in range(30):\n", 185 | " ax = fig.add_subplot(3, 10, i + 1, xticks=[], yticks=[])\n", 186 | " ax.imshow(pca.components_[i].reshape(faces.images[0].shape), cmap=plt.cm.bone)" 187 | ] 188 | }, 189 | { 190 | "cell_type": "markdown", 191 | "metadata": {}, 192 | "source": [ 193 | "The components (\"eigenfaces\") are ordered by their importance from top-left to bottom-right.\n", 194 | "We see that the first few components seem to primarily take care of lighting\n", 195 | "conditions; the remaining components pull out certain identifying features:\n", 196 | "the nose, eyes, eyebrows, etc." 197 | ] 198 | }, 199 | { 200 | "cell_type": "markdown", 201 | "metadata": {}, 202 | "source": [ 203 | "With this projection computed, we can now project our original training\n", 204 | "and test data onto the PCA basis:" 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": null, 210 | "metadata": { 211 | "collapsed": false 212 | }, 213 | "outputs": [], 214 | "source": [ 215 | "X_train_pca = pca.transform(X_train)\n", 216 | "X_test_pca = pca.transform(X_test)" 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": null, 222 | "metadata": { 223 | "collapsed": false 224 | }, 225 | "outputs": [], 226 | "source": [ 227 | "print(X_train_pca.shape)\n", 228 | "print(X_test_pca.shape)" 229 | ] 230 | }, 231 | { 232 | "cell_type": "markdown", 233 | "metadata": {}, 234 | "source": [ 235 | "These projected components correspond to factors in a linear combination of\n", 236 | "component images such that the combination approaches the original face." 237 | ] 238 | }, 239 | { 240 | "cell_type": "markdown", 241 | "metadata": {}, 242 | "source": [ 243 | "## Doing the Learning: Support Vector Machines" 244 | ] 245 | }, 246 | { 247 | "cell_type": "markdown", 248 | "metadata": {}, 249 | "source": [ 250 | "Now we'll perform support-vector-machine classification on this reduced dataset:" 251 | ] 252 | }, 253 | { 254 | "cell_type": "code", 255 | "execution_count": null, 256 | "metadata": { 257 | "collapsed": false 258 | }, 259 | "outputs": [], 260 | "source": [ 261 | "from sklearn import svm\n", 262 | "clf = svm.SVC(C=5., gamma=0.001)\n", 263 | "clf.fit(X_train_pca, y_train)" 264 | ] 265 | }, 266 | { 267 | "cell_type": "markdown", 268 | "metadata": {}, 269 | "source": [ 270 | "Finally, we can evaluate how well this classification did. First, we might plot a\n", 271 | "few of the test-cases with the labels learned from the training set:" 272 | ] 273 | }, 274 | { 275 | "cell_type": "code", 276 | "execution_count": null, 277 | "metadata": { 278 | "collapsed": false 279 | }, 280 | "outputs": [], 281 | "source": [ 282 | "fig = plt.figure(figsize=(8, 6))\n", 283 | "for i in range(15):\n", 284 | " ax = fig.add_subplot(3, 5, i + 1, xticks=[], yticks=[])\n", 285 | " ax.imshow(X_test[i].reshape(faces.images[0].shape),\n", 286 | " cmap=plt.cm.bone)\n", 287 | " y_pred = clf.predict(X_test_pca[i])[0]\n", 288 | " color = ('black' if y_pred == y_test[i]\n", 289 | " else 'red')\n", 290 | " ax.set_title(y_pred, fontsize='small', color=color)" 291 | ] 292 | }, 293 | { 294 | "cell_type": "markdown", 295 | "metadata": {}, 296 | "source": [ 297 | "The classifier is correct on an impressive number of images given the simplicity\n", 298 | "of its learning model! Using a linear classifier on 150 features derived from\n", 299 | "the pixel-level data, the algorithm correctly identifies a large number of the\n", 300 | "people in the images.\n", 301 | "\n", 302 | "Again, we can\n", 303 | "quantify this effectiveness using one of several measures from the ``sklearn.metrics``\n", 304 | "module. First we can do the classification report, which shows the precision,\n", 305 | "recall and other measures of the \"goodness\" of the classification:" 306 | ] 307 | }, 308 | { 309 | "cell_type": "code", 310 | "execution_count": null, 311 | "metadata": { 312 | "collapsed": false 313 | }, 314 | "outputs": [], 315 | "source": [ 316 | "from sklearn import metrics\n", 317 | "y_pred = clf.predict(X_test_pca)\n", 318 | "print(metrics.classification_report(y_test, y_pred))" 319 | ] 320 | }, 321 | { 322 | "cell_type": "markdown", 323 | "metadata": {}, 324 | "source": [ 325 | "Another interesting metric is the *confusion matrix*, which indicates how often\n", 326 | "any two items are mixed-up. The confusion matrix of a perfect classifier\n", 327 | "would only have nonzero entries on the diagonal, with zeros on the off-diagonal." 328 | ] 329 | }, 330 | { 331 | "cell_type": "code", 332 | "execution_count": null, 333 | "metadata": { 334 | "collapsed": false 335 | }, 336 | "outputs": [], 337 | "source": [ 338 | "print(metrics.confusion_matrix(y_test, y_pred))" 339 | ] 340 | }, 341 | { 342 | "cell_type": "code", 343 | "execution_count": null, 344 | "metadata": { 345 | "collapsed": false 346 | }, 347 | "outputs": [], 348 | "source": [ 349 | "print(metrics.f1_score(y_test, y_pred))" 350 | ] 351 | }, 352 | { 353 | "cell_type": "markdown", 354 | "metadata": {}, 355 | "source": [ 356 | "## Pipelining" 357 | ] 358 | }, 359 | { 360 | "cell_type": "markdown", 361 | "metadata": {}, 362 | "source": [ 363 | "Above we used PCA as a pre-processing step before applying our support vector machine classifier.\n", 364 | "Plugging the output of one estimator directly into the input of a second estimator is a commonly\n", 365 | "used pattern; for this reason scikit-learn provides a ``Pipeline`` object which automates this\n", 366 | "process. The above problem can be re-expressed as a pipeline as follows:" 367 | ] 368 | }, 369 | { 370 | "cell_type": "code", 371 | "execution_count": null, 372 | "metadata": { 373 | "collapsed": false 374 | }, 375 | "outputs": [], 376 | "source": [ 377 | "from sklearn.pipeline import Pipeline" 378 | ] 379 | }, 380 | { 381 | "cell_type": "code", 382 | "execution_count": null, 383 | "metadata": { 384 | "collapsed": false 385 | }, 386 | "outputs": [], 387 | "source": [ 388 | "clf = Pipeline([('pca', decomposition.RandomizedPCA(n_components=150, whiten=True)),\n", 389 | " ('svm', svm.LinearSVC(C=1.0))])" 390 | ] 391 | }, 392 | { 393 | "cell_type": "code", 394 | "execution_count": null, 395 | "metadata": { 396 | "collapsed": false 397 | }, 398 | "outputs": [], 399 | "source": [ 400 | "clf.fit(X_train, y_train)" 401 | ] 402 | }, 403 | { 404 | "cell_type": "code", 405 | "execution_count": null, 406 | "metadata": { 407 | "collapsed": false 408 | }, 409 | "outputs": [], 410 | "source": [ 411 | "y_pred = clf.predict(X_test)" 412 | ] 413 | }, 414 | { 415 | "cell_type": "code", 416 | "execution_count": null, 417 | "metadata": { 418 | "collapsed": false 419 | }, 420 | "outputs": [], 421 | "source": [ 422 | "print(metrics.confusion_matrix(y_pred, y_test))" 423 | ] 424 | }, 425 | { 426 | "cell_type": "markdown", 427 | "metadata": {}, 428 | "source": [ 429 | "The results are not identical because we used the randomized version of the PCA -- because the\n", 430 | "projection varies slightly each time, the results vary slightly as well." 431 | ] 432 | }, 433 | { 434 | "cell_type": "markdown", 435 | "metadata": {}, 436 | "source": [ 437 | "## A Quick Note on Facial Recognition" 438 | ] 439 | }, 440 | { 441 | "cell_type": "markdown", 442 | "metadata": {}, 443 | "source": [ 444 | "Here we have used PCA \"eigenfaces\" as a pre-processing step for facial recognition.\n", 445 | "The reason we chose this is because PCA is a broadly-applicable technique, which can\n", 446 | "be useful for a wide array of data types. Research in the field of facial recognition\n", 447 | "in particular, however, has shown that other more specific feature extraction methods\n", 448 | "are can be much more effective." 449 | ] 450 | } 451 | ], 452 | "metadata": { 453 | "kernelspec": { 454 | "display_name": "Python 2", 455 | "language": "python", 456 | "name": "python2" 457 | }, 458 | "language_info": { 459 | "codemirror_mode": { 460 | "name": "ipython", 461 | "version": 2 462 | }, 463 | "file_extension": ".py", 464 | "mimetype": "text/x-python", 465 | "name": "python", 466 | "nbconvert_exporter": "python", 467 | "pygments_lexer": "ipython2", 468 | "version": "2.7.11" 469 | } 470 | }, 471 | "nbformat": 4, 472 | "nbformat_minor": 0 473 | } 474 | -------------------------------------------------------------------------------- /notebooks/08_sklearn_pandas_practical.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": false 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "%matplotlib inline\n", 12 | "import numpy as np\n", 13 | "import pandas as pd\n", 14 | "import matplotlib.pyplot as plt" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "## Hands on scikit-learn / pandas\n", 22 | "\n", 23 | "You mission is to build the best predictive model using these data:\n", 24 | "\n", 25 | "https://dl.dropboxusercontent.com/u/2140486/data/adult_train.csv\n", 26 | "\n", 27 | "The prediction task is to determine whether a person makes over 50K a year.\n", 28 | "\n", 29 | "What is expected:\n", 30 | "\n", 31 | " - Read the data using Pandas\n", 32 | " - Understand what type of machine learning problem you are facing? regression? classification? binary? multi-class?\n", 33 | " - Identify potential issues such as missing values, quantitative / continous or categorical variables, class imbalance\n", 34 | " - Visualize the distributions of the features\n", 35 | " - Encode properly the categorical variables\n", 36 | " - Propose a predictive model that you will evaluate using ROC-AUC metric\n", 37 | " - You will evaluate the quality of the model with cross-validation.\n", 38 | "\n", 39 | "```\n", 40 | "Attribute Information:\n", 41 | "\n", 42 | "Listing of attributes: \n", 43 | "\n", 44 | "age: continuous. \n", 45 | "workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. \n", 46 | "fnlwgt: continuous. \n", 47 | "education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. \n", 48 | "education-num: continuous. \n", 49 | "marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. \n", 50 | "occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. \n", 51 | "relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. \n", 52 | "race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. \n", 53 | "sex: Female, Male. \n", 54 | "capital-gain: continuous. \n", 55 | "capital-loss: continuous. \n", 56 | "hours-per-week: continuous. \n", 57 | "native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.\n", 58 | "```\n", 59 | "\n", 60 | "The floor is yours !" 61 | ] 62 | } 63 | ], 64 | "metadata": { 65 | "kernelspec": { 66 | "display_name": "Python 2", 67 | "language": "python", 68 | "name": "python2" 69 | }, 70 | "language_info": { 71 | "codemirror_mode": { 72 | "name": "ipython", 73 | "version": 2 74 | }, 75 | "file_extension": ".py", 76 | "mimetype": "text/x-python", 77 | "name": "python", 78 | "nbconvert_exporter": "python", 79 | "pygments_lexer": "ipython2", 80 | "version": "2.7.11" 81 | } 82 | }, 83 | "nbformat": 4, 84 | "nbformat_minor": 0 85 | } 86 | -------------------------------------------------------------------------------- /notebooks/figures/ML_flow_chart.py: -------------------------------------------------------------------------------- 1 | """ 2 | Tutorial Diagrams 3 | ----------------- 4 | 5 | This script plots the flow-charts used in the scikit-learn tutorials. 6 | """ 7 | 8 | import numpy as np 9 | import pylab as pl 10 | from matplotlib.patches import Circle, Rectangle, Polygon, Arrow, FancyArrow 11 | 12 | def create_base(box_bg = '#CCCCCC', 13 | arrow1 = '#88CCFF', 14 | arrow2 = '#88FF88', 15 | supervised=True): 16 | fig = pl.figure(figsize=(9, 6), facecolor='w') 17 | ax = pl.axes((0, 0, 1, 1), 18 | xticks=[], yticks=[], frameon=False) 19 | ax.set_xlim(0, 9) 20 | ax.set_ylim(0, 6) 21 | 22 | patches = [Rectangle((0.3, 3.6), 1.5, 1.8, zorder=1, fc=box_bg), 23 | Rectangle((0.5, 3.8), 1.5, 1.8, zorder=2, fc=box_bg), 24 | Rectangle((0.7, 4.0), 1.5, 1.8, zorder=3, fc=box_bg), 25 | 26 | Rectangle((2.9, 3.6), 0.2, 1.8, fc=box_bg), 27 | Rectangle((3.1, 3.8), 0.2, 1.8, fc=box_bg), 28 | Rectangle((3.3, 4.0), 0.2, 1.8, fc=box_bg), 29 | 30 | Rectangle((0.3, 0.2), 1.5, 1.8, fc=box_bg), 31 | 32 | Rectangle((2.9, 0.2), 0.2, 1.8, fc=box_bg), 33 | 34 | Circle((5.5, 3.5), 1.0, fc=box_bg), 35 | 36 | Polygon([[5.5, 1.7], 37 | [6.1, 1.1], 38 | [5.5, 0.5], 39 | [4.9, 1.1]], fc=box_bg), 40 | 41 | FancyArrow(2.3, 4.6, 0.35, 0, fc=arrow1, 42 | width=0.25, head_width=0.5, head_length=0.2), 43 | 44 | FancyArrow(3.75, 4.2, 0.5, -0.2, fc=arrow1, 45 | width=0.25, head_width=0.5, head_length=0.2), 46 | 47 | FancyArrow(5.5, 2.4, 0, -0.4, fc=arrow1, 48 | width=0.25, head_width=0.5, head_length=0.2), 49 | 50 | FancyArrow(2.0, 1.1, 0.5, 0, fc=arrow2, 51 | width=0.25, head_width=0.5, head_length=0.2), 52 | 53 | FancyArrow(3.3, 1.1, 1.3, 0, fc=arrow2, 54 | width=0.25, head_width=0.5, head_length=0.2), 55 | 56 | FancyArrow(6.2, 1.1, 0.8, 0, fc=arrow2, 57 | width=0.25, head_width=0.5, head_length=0.2)] 58 | 59 | if supervised: 60 | patches += [Rectangle((0.3, 2.4), 1.5, 0.5, zorder=1, fc=box_bg), 61 | Rectangle((0.5, 2.6), 1.5, 0.5, zorder=2, fc=box_bg), 62 | Rectangle((0.7, 2.8), 1.5, 0.5, zorder=3, fc=box_bg), 63 | FancyArrow(2.3, 2.9, 2.0, 0, fc=arrow1, 64 | width=0.25, head_width=0.5, head_length=0.2), 65 | Rectangle((7.3, 0.85), 1.5, 0.5, fc=box_bg)] 66 | else: 67 | patches += [Rectangle((7.3, 0.2), 1.5, 1.8, fc=box_bg)] 68 | 69 | for p in patches: 70 | ax.add_patch(p) 71 | 72 | pl.text(1.45, 4.9, "Training\nText,\nDocuments,\nImages,\netc.", 73 | ha='center', va='center', fontsize=14) 74 | 75 | pl.text(3.6, 4.9, "Feature\nVectors", 76 | ha='left', va='center', fontsize=14) 77 | 78 | pl.text(5.5, 3.5, "Machine\nLearning\nAlgorithm", 79 | ha='center', va='center', fontsize=14) 80 | 81 | pl.text(1.05, 1.1, "New Text,\nDocument,\nImage,\netc.", 82 | ha='center', va='center', fontsize=14) 83 | 84 | pl.text(3.3, 1.7, "Feature\nVector", 85 | ha='left', va='center', fontsize=14) 86 | 87 | pl.text(5.5, 1.1, "Predictive\nModel", 88 | ha='center', va='center', fontsize=12) 89 | 90 | if supervised: 91 | pl.text(1.45, 3.05, "Labels", 92 | ha='center', va='center', fontsize=14) 93 | 94 | pl.text(8.05, 1.1, "Expected\nLabel", 95 | ha='center', va='center', fontsize=14) 96 | pl.text(8.8, 5.8, "Supervised Learning Model", 97 | ha='right', va='top', fontsize=18) 98 | 99 | else: 100 | pl.text(8.05, 1.1, 101 | "Likelihood\nor Cluster ID\nor Better\nRepresentation", 102 | ha='center', va='center', fontsize=12) 103 | pl.text(8.8, 5.8, "Unsupervised Learning Model", 104 | ha='right', va='top', fontsize=18) 105 | 106 | 107 | 108 | def plot_supervised_chart(annotate=False): 109 | create_base(supervised=True) 110 | if annotate: 111 | fontdict = dict(color='r', weight='bold', size=14) 112 | pl.text(1.9, 4.55, 'X = vec.fit_transform(input)', 113 | fontdict=fontdict, 114 | rotation=20, ha='left', va='bottom') 115 | pl.text(3.7, 3.2, 'clf.fit(X, y)', 116 | fontdict=fontdict, 117 | rotation=20, ha='left', va='bottom') 118 | pl.text(1.7, 1.5, 'X_new = vec.transform(input)', 119 | fontdict=fontdict, 120 | rotation=20, ha='left', va='bottom') 121 | pl.text(6.1, 1.5, 'y_new = clf.predict(X_new)', 122 | fontdict=fontdict, 123 | rotation=20, ha='left', va='bottom') 124 | 125 | def plot_unsupervised_chart(): 126 | create_base(supervised=False) 127 | 128 | 129 | if __name__ == '__main__': 130 | plot_supervised_chart(False) 131 | plot_supervised_chart(True) 132 | plot_unsupervised_chart() 133 | pl.show() 134 | 135 | 136 | -------------------------------------------------------------------------------- /notebooks/figures/__init__.py: -------------------------------------------------------------------------------- 1 | from .sgd_separator import plot_sgd_separator 2 | from .linear_regression import plot_linear_regression 3 | from .ML_flow_chart import plot_supervised_chart, plot_unsupervised_chart 4 | from .bias_variance import plot_bias_variance 5 | -------------------------------------------------------------------------------- /notebooks/figures/bias_variance.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | 4 | 5 | def test_func(x, err=0.5): 6 | return np.random.normal(10 - 1. / (x + 0.1), err) 7 | 8 | 9 | def compute_error(x, y, p): 10 | yfit = np.polyval(p, x) 11 | return np.sqrt(np.mean((y - yfit) ** 2)) 12 | 13 | 14 | def plot_bias_variance(N=8, random_seed=42, err=0.5): 15 | np.random.seed(random_seed) 16 | x = 10 ** np.linspace(-2, 0, N) 17 | y = test_func(x) 18 | 19 | xfit = np.linspace(-0.2, 1.2, 1000) 20 | 21 | titles = ['d = 1 (under-fit; high bias)', 22 | 'd = 2', 23 | 'd = 6 (over-fit; high variance)'] 24 | degrees = [1, 2, 6] 25 | 26 | fig = plt.figure(figsize = (9, 3.5)) 27 | fig.subplots_adjust(left = 0.06, right=0.98, 28 | bottom=0.15, top=0.85, 29 | wspace=0.05) 30 | for i, d in enumerate(degrees): 31 | ax = fig.add_subplot(131 + i, xticks=[], yticks=[]) 32 | ax.scatter(x, y, marker='x', c='k', s=50) 33 | 34 | p = np.polyfit(x, y, d) 35 | yfit = np.polyval(p, xfit) 36 | ax.plot(xfit, yfit, '-b') 37 | 38 | ax.set_xlim(-0.2, 1.2) 39 | ax.set_ylim(0, 12) 40 | ax.set_xlabel('house size') 41 | if i == 0: 42 | ax.set_ylabel('price') 43 | 44 | ax.set_title(titles[i]) 45 | 46 | if __name__ == '__main__': 47 | plot_bias_variance() 48 | plt.show() 49 | -------------------------------------------------------------------------------- /notebooks/figures/grid_search_cv_splits.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/agramfort/sklearn_pandas_intro/c4c148faab7747b3750ce10f332d4dde58019bfa/notebooks/figures/grid_search_cv_splits.png -------------------------------------------------------------------------------- /notebooks/figures/grid_search_parameters.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/agramfort/sklearn_pandas_intro/c4c148faab7747b3750ce10f332d4dde58019bfa/notebooks/figures/grid_search_parameters.png -------------------------------------------------------------------------------- /notebooks/figures/iris_setosa.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/agramfort/sklearn_pandas_intro/c4c148faab7747b3750ce10f332d4dde58019bfa/notebooks/figures/iris_setosa.jpg -------------------------------------------------------------------------------- /notebooks/figures/iris_versicolor.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/agramfort/sklearn_pandas_intro/c4c148faab7747b3750ce10f332d4dde58019bfa/notebooks/figures/iris_versicolor.jpg -------------------------------------------------------------------------------- /notebooks/figures/iris_virginica.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/agramfort/sklearn_pandas_intro/c4c148faab7747b3750ce10f332d4dde58019bfa/notebooks/figures/iris_virginica.jpg -------------------------------------------------------------------------------- /notebooks/figures/linear_regression.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | from sklearn.linear_model import LinearRegression 4 | 5 | 6 | def plot_linear_regression(): 7 | a = 0.5 8 | b = 1.0 9 | 10 | # x from 0 to 10 11 | x = 30 * np.random.random(20) 12 | 13 | # y = a*x + b with noise 14 | y = a * x + b + np.random.normal(size=x.shape) 15 | 16 | # create a linear regression classifier 17 | clf = LinearRegression() 18 | clf.fit(x[:, None], y) 19 | 20 | # predict y from the data 21 | x_new = np.linspace(0, 30, 100) 22 | y_new = clf.predict(x_new[:, None]) 23 | 24 | # plot the results 25 | ax = plt.axes() 26 | ax.scatter(x, y) 27 | ax.plot(x_new, y_new) 28 | 29 | ax.set_xlabel('x') 30 | ax.set_ylabel('y') 31 | 32 | ax.axis('tight') 33 | 34 | 35 | if __name__ == '__main__': 36 | plot_linear_regression() 37 | plt.show() 38 | -------------------------------------------------------------------------------- /notebooks/figures/sgd_separator.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | from sklearn.linear_model import SGDClassifier 4 | from sklearn.datasets.samples_generator import make_blobs 5 | 6 | 7 | def plot_sgd_separator(): 8 | # we create 50 separable points 9 | X, Y = make_blobs(n_samples=50, centers=2, 10 | random_state=0, cluster_std=0.60) 11 | 12 | # fit the model 13 | clf = SGDClassifier(loss="hinge", alpha=0.01, 14 | n_iter=200, fit_intercept=True) 15 | clf.fit(X, Y) 16 | 17 | # plot the line, the points, and the nearest vectors to the plane 18 | xx = np.linspace(-1, 5, 10) 19 | yy = np.linspace(-1, 5, 10) 20 | 21 | X1, X2 = np.meshgrid(xx, yy) 22 | Z = np.empty(X1.shape) 23 | for (i, j), val in np.ndenumerate(X1): 24 | x1 = val 25 | x2 = X2[i, j] 26 | p = clf.decision_function([[x1, x2]]) 27 | Z[i, j] = p[0] 28 | levels = [-1.0, 0.0, 1.0] 29 | linestyles = ['dashed', 'solid', 'dashed'] 30 | colors = 'k' 31 | 32 | ax = plt.axes() 33 | ax.contour(X1, X2, Z, levels, colors=colors, linestyles=linestyles) 34 | ax.scatter(X[:, 0], X[:, 1], c=Y, cmap=plt.cm.Paired) 35 | 36 | ax.axis('tight') 37 | 38 | 39 | if __name__ == '__main__': 40 | plot_sgd_separator() 41 | plt.show() 42 | -------------------------------------------------------------------------------- /notebooks/figures/supervised_scikit_learn.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/agramfort/sklearn_pandas_intro/c4c148faab7747b3750ce10f332d4dde58019bfa/notebooks/figures/supervised_scikit_learn.png -------------------------------------------------------------------------------- /notebooks/figures/svm_gui_frames.py: -------------------------------------------------------------------------------- 1 | """ 2 | Linear Model Example 3 | -------------------- 4 | 5 | This is an example plot from the tutorial which accompanies an explanation 6 | of the support vector machine GUI. 7 | """ 8 | 9 | import numpy as np 10 | import pylab as pl 11 | import matplotlib 12 | 13 | from sklearn import svm 14 | 15 | 16 | def linear_model(rseed=42, Npts=30): 17 | np.random.seed(rseed) 18 | 19 | 20 | data = np.random.normal(0, 10, (Npts, 2)) 21 | data[:Npts / 2] -= 15 22 | data[Npts / 2:] += 15 23 | 24 | labels = np.ones(Npts) 25 | labels[:Npts / 2] = -1 26 | 27 | return data, labels 28 | 29 | 30 | def nonlinear_model(rseed=42, Npts=30): 31 | radius = 40 * np.random.random(Npts) 32 | far_pts = radius > 20 33 | radius[far_pts] *= 1.2 34 | radius[~far_pts] *= 1.1 35 | 36 | theta = np.random.random(Npts) * np.pi * 2 37 | 38 | data = np.empty((Npts, 2)) 39 | data[:, 0] = radius * np.cos(theta) 40 | data[:, 1] = radius * np.sin(theta) 41 | 42 | labels = np.ones(Npts) 43 | labels[far_pts] = -1 44 | 45 | return data, labels 46 | 47 | 48 | def plot_linear_model(): 49 | X, y = linear_model() 50 | clf = svm.SVC(kernel='linear', 51 | gamma=0.01, coef0=0, degree=3) 52 | clf.fit(X, y) 53 | 54 | fig = pl.figure() 55 | ax = pl.subplot(111, xticks=[], yticks=[]) 56 | ax.scatter(X[:, 0], X[:, 1], c=y, cmap=pl.cm.bone) 57 | 58 | ax.scatter(clf.support_vectors_[:, 0], 59 | clf.support_vectors_[:, 1], 60 | s=80, edgecolors="k", facecolors="none") 61 | 62 | delta = 1 63 | y_min, y_max = -50, 50 64 | x_min, x_max = -50, 50 65 | x = np.arange(x_min, x_max + delta, delta) 66 | y = np.arange(y_min, y_max + delta, delta) 67 | X1, X2 = np.meshgrid(x, y) 68 | Z = clf.decision_function(np.c_[X1.ravel(), X2.ravel()]) 69 | Z = Z.reshape(X1.shape) 70 | 71 | levels = [-1.0, 0.0, 1.0] 72 | linestyles = ['dashed', 'solid', 'dashed'] 73 | colors = 'k' 74 | ax.contour(X1, X2, Z, levels, 75 | colors=colors, 76 | linestyles=linestyles) 77 | 78 | 79 | def plot_rbf_model(): 80 | X, y = nonlinear_model() 81 | clf = svm.SVC(kernel='rbf', 82 | gamma=0.001, coef0=0, degree=3) 83 | clf.fit(X, y) 84 | 85 | fig = pl.figure() 86 | ax = pl.subplot(111, xticks=[], yticks=[]) 87 | ax.scatter(X[:, 0], X[:, 1], c=y, cmap=pl.cm.bone, zorder=2) 88 | 89 | ax.scatter(clf.support_vectors_[:, 0], 90 | clf.support_vectors_[:, 1], 91 | s=80, edgecolors="k", facecolors="none") 92 | 93 | delta = 1 94 | y_min, y_max = -50, 50 95 | x_min, x_max = -50, 50 96 | x = np.arange(x_min, x_max + delta, delta) 97 | y = np.arange(y_min, y_max + delta, delta) 98 | X1, X2 = np.meshgrid(x, y) 99 | Z = clf.decision_function(np.c_[X1.ravel(), X2.ravel()]) 100 | Z = Z.reshape(X1.shape) 101 | 102 | levels = [-1.0, 0.0, 1.0] 103 | linestyles = ['dashed', 'solid', 'dashed'] 104 | colors = 'k' 105 | 106 | ax.contourf(X1, X2, Z, 10, 107 | cmap=matplotlib.cm.bone, 108 | origin='lower', 109 | alpha=0.85, zorder=1) 110 | ax.contour(X1, X2, Z, [0.0], 111 | colors='k', 112 | linestyles=['solid'], zorder=1) 113 | 114 | 115 | if __name__ == '__main__': 116 | plot_linear_model() 117 | plot_rbf_model() 118 | pl.show() 119 | 120 | -------------------------------------------------------------------------------- /notebooks/helpers.py: -------------------------------------------------------------------------------- 1 | """ 2 | Small helpers for code that is not shown in the notebooks 3 | """ 4 | 5 | from sklearn import neighbors, datasets, linear_model 6 | import pylab as pl 7 | import numpy as np 8 | from matplotlib.colors import ListedColormap 9 | 10 | # Create color maps for 3-class classification problem, as with iris 11 | cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF']) 12 | cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF']) 13 | 14 | def plot_iris_knn(): 15 | iris = datasets.load_iris() 16 | X = iris.data[:, :2] # we only take the first two features. We could 17 | # avoid this ugly slicing by using a two-dim dataset 18 | y = iris.target 19 | 20 | knn = neighbors.KNeighborsClassifier(n_neighbors=3) 21 | knn.fit(X, y) 22 | 23 | x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1 24 | y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1 25 | xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100), 26 | np.linspace(y_min, y_max, 100)) 27 | Z = knn.predict(np.c_[xx.ravel(), yy.ravel()]) 28 | 29 | # Put the result into a color plot 30 | Z = Z.reshape(xx.shape) 31 | pl.figure() 32 | pl.pcolormesh(xx, yy, Z, cmap=cmap_light) 33 | 34 | # Plot also the training points 35 | pl.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold) 36 | pl.xlabel('sepal length (cm)') 37 | pl.ylabel('sepal width (cm)') 38 | pl.axis('tight') 39 | 40 | 41 | def plot_polynomial_regression(): 42 | rng = np.random.RandomState(0) 43 | x = 2*rng.rand(100) - 1 44 | 45 | f = lambda t: 1.2 * t**2 + .1 * t**3 - .4 * t **5 - .5 * t ** 9 46 | y = f(x) + .4 * rng.normal(size=100) 47 | 48 | x_test = np.linspace(-1, 1, 100) 49 | 50 | pl.figure() 51 | pl.scatter(x, y, s=4) 52 | 53 | X = np.array([x**i for i in range(5)]).T 54 | X_test = np.array([x_test**i for i in range(5)]).T 55 | regr = linear_model.LinearRegression() 56 | regr.fit(X, y) 57 | pl.plot(x_test, regr.predict(X_test), label='4th order') 58 | 59 | X = np.array([x**i for i in range(10)]).T 60 | X_test = np.array([x_test**i for i in range(10)]).T 61 | regr = linear_model.LinearRegression() 62 | regr.fit(X, y) 63 | pl.plot(x_test, regr.predict(X_test), label='9th order') 64 | 65 | pl.legend(loc='best') 66 | pl.axis('tight') 67 | pl.title('Fitting a 4th and a 9th order polynomial') 68 | 69 | pl.figure() 70 | pl.scatter(x, y, s=4) 71 | pl.plot(x_test, f(x_test), label="truth") 72 | pl.axis('tight') 73 | pl.title('Ground truth (9th order polynomial)') 74 | 75 | 76 | -------------------------------------------------------------------------------- /notebooks/images/parallel_text_clf.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/agramfort/sklearn_pandas_intro/c4c148faab7747b3750ce10f332d4dde58019bfa/notebooks/images/parallel_text_clf.png -------------------------------------------------------------------------------- /notebooks/images/parallel_text_clf_average.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/agramfort/sklearn_pandas_intro/c4c148faab7747b3750ce10f332d4dde58019bfa/notebooks/images/parallel_text_clf_average.png -------------------------------------------------------------------------------- /notebooks/images/predictive_modeling_data_flow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/agramfort/sklearn_pandas_intro/c4c148faab7747b3750ce10f332d4dde58019bfa/notebooks/images/predictive_modeling_data_flow.png -------------------------------------------------------------------------------- /notebooks/solutions/02A_faces_plot.py: -------------------------------------------------------------------------------- 1 | faces = fetch_olivetti_faces() 2 | 3 | # set up the figure 4 | fig = plt.figure(figsize=(6, 6)) # figure size in inches 5 | fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05) 6 | 7 | # plot the faces: 8 | for i in range(64): 9 | ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[]) 10 | ax.imshow(faces.images[i], cmap=plt.cm.bone, interpolation='nearest') 11 | -------------------------------------------------------------------------------- /notebooks/solutions/04A_plot_logistic_regression_weights.py: -------------------------------------------------------------------------------- 1 | logreg_new = LogisticRegression(C=1).fit(rich_features_final, target) 2 | 3 | feature_names = rich_features_final.columns.values 4 | x = np.arange(len(feature_names)) 5 | plt.bar(x, logreg_new.coef_.ravel()) 6 | _ = plt.xticks(x + 0.5, feature_names, rotation=30) 7 | 8 | # Rich young women like Kate Winslet tend to survive the Titanic better 9 | # than poor men like Leonardo. 10 | -------------------------------------------------------------------------------- /notebooks/solutions/04B_more_categorical_variables.py: -------------------------------------------------------------------------------- 1 | features = pd.concat([data.get(['Fare', 'Age']), 2 | pd.get_dummies(data.Sex, prefix='Sex'), 3 | pd.get_dummies(data.Pclass, prefix='Pclass'), 4 | pd.get_dummies(data.Embarked, prefix='Embarked')], 5 | axis=1) 6 | features = features.drop('Sex_male', 1) 7 | features = features.fillna(features.dropna().median()) 8 | features.head(5) 9 | 10 | 11 | logreg = LogisticRegression(C=1) 12 | scores = cross_val_score(logreg, features, target, cv=5, scoring='accuracy') 13 | print("Logistic Regression CV scores:") 14 | print("min: {:.3f}, mean: {:.3f}, max: {:.3f}".format( 15 | scores.min(), scores.mean(), scores.max())) 16 | 17 | gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, 18 | subsample=.8, max_features=.5) 19 | scores = cross_val_score(gb, features, target, cv=5, n_jobs=4, 20 | scoring='accuracy') 21 | print("Gradient Boosting Trees CV scores:") 22 | print("min: {:.3f}, mean: {:.3f}, max: {:.3f}".format( 23 | scores.min(), scores.mean(), scores.max())) 24 | -------------------------------------------------------------------------------- /notebooks/solutions/04C_feature_importance.py: -------------------------------------------------------------------------------- 1 | gb_new = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, 2 | subsample=.8, max_features=.5) 3 | gb_new.fit(features, target) 4 | feature_names = features.columns.values 5 | x = np.arange(len(feature_names)) 6 | plt.bar(x, gb_new.feature_importances_) 7 | _ = plt.xticks(x + 0.5, feature_names, rotation=30) 8 | -------------------------------------------------------------------------------- /notebooks/solutions/05B_houses_plot.py: -------------------------------------------------------------------------------- 1 | for index, feature_name in enumerate(data.feature_names): 2 | plt.figure() 3 | plt.scatter(data.data[:, index], data.target) 4 | plt.ylabel('Price') 5 | plt.xlabel(feature_name) 6 | 7 | -------------------------------------------------------------------------------- /notebooks/solutions/05B_houses_regression.py: -------------------------------------------------------------------------------- 1 | from sklearn.ensemble import GradientBoostingRegressor 2 | 3 | clf = GradientBoostingRegressor() 4 | clf.fit(X_train, y_train) 5 | 6 | predicted = clf.predict(X_test) 7 | expected = y_test 8 | 9 | plt.scatter(expected, predicted) 10 | plt.plot([0, 50], [0, 50], '--k') 11 | plt.axis('tight') 12 | plt.xlabel('True price ($1000s)') 13 | plt.ylabel('Predicted price ($1000s)') 14 | print("RMS:", np.sqrt(np.mean((predicted - expected) ** 2))) 15 | -------------------------------------------------------------------------------- /notebooks/solutions/05C_validation_exercise.py: -------------------------------------------------------------------------------- 1 | from sklearn.datasets import load_digits 2 | from sklearn.model_selection import train_test_split 3 | from sklearn.svm import LinearSVC 4 | from sklearn.naive_bayes import GaussianNB 5 | from sklearn.neighbors import KNeighborsClassifier 6 | from sklearn import metrics 7 | 8 | digits = load_digits() 9 | 10 | X = digits.data 11 | y = digits.target 12 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0) 13 | 14 | for Model in [LinearSVC, GaussianNB, KNeighborsClassifier]: 15 | clf = Model().fit(X_train, y_train) 16 | y_pred = clf.predict(X_test) 17 | print(Model.__name__, metrics.f1_score(y_test, y_pred, average='weighted')) 18 | 19 | print('------------------') 20 | 21 | # test SVC loss 22 | for loss in ['hinge', 'squared_hinge']: 23 | clf = LinearSVC(loss=loss).fit(X_train, y_train) 24 | y_pred = clf.predict(X_test) 25 | print("LinearSVC(loss='{0}')".format(loss), metrics.f1_score(y_test, y_pred, average='weighted')) 26 | 27 | print('-------------------') 28 | 29 | # test K-neighbors 30 | for n_neighbors in range(1, 11): 31 | clf = KNeighborsClassifier(n_neighbors=n_neighbors).fit(X_train, y_train) 32 | y_pred = clf.predict(X_test) 33 | print("KNeighbors(n_neighbors={}): {:.3f}".format( 34 | n_neighbors, metrics.f1_score(y_test, y_pred, average='weighted'))) 35 | -------------------------------------------------------------------------------- /notebooks/solutions/07B_basic_grid_search.py: -------------------------------------------------------------------------------- 1 | for Model in [Lasso, Ridge]: 2 | scores = [cross_val_score(Model(alpha), X, y, cv=3).mean() 3 | for alpha in alphas] 4 | plt.plot(alphas, scores, label=Model.__name__) 5 | plt.legend(loc='lower left') 6 | -------------------------------------------------------------------------------- /notebooks/solutions/07B_learning_curves.py: -------------------------------------------------------------------------------- 1 | from sklearn.metrics import explained_variance_score, mean_squared_error 2 | from sklearn.cross_validation import train_test_split 3 | 4 | def plot_learning_curve(model, err_func=explained_variance_score, N=300, n_runs=10, n_sizes=50, ylim=None): 5 | sizes = np.linspace(5, N, n_sizes).astype(int) 6 | train_err = np.zeros((n_runs, n_sizes)) 7 | validation_err = np.zeros((n_runs, n_sizes)) 8 | for i in range(n_runs): 9 | for j, size in enumerate(sizes): 10 | xtrain, xtest, ytrain, ytest = train_test_split( 11 | X, y, train_size=size, random_state=i) 12 | # Train on only the first `size` points 13 | model.fit(xtrain, ytrain) 14 | validation_err[i, j] = err_func(ytest, model.predict(xtest)) 15 | train_err[i, j] = err_func(ytrain, model.predict(xtrain)) 16 | 17 | plt.plot(sizes, validation_err.mean(axis=0), lw=2, label='validation') 18 | plt.plot(sizes, train_err.mean(axis=0), lw=2, label='training') 19 | 20 | plt.xlabel('traning set size') 21 | plt.ylabel(err_func.__name__.replace('_', ' ')) 22 | 23 | plt.grid(True) 24 | 25 | plt.legend(loc=0) 26 | 27 | plt.xlim(0, N-1) 28 | 29 | if ylim: 30 | plt.ylim(ylim) 31 | 32 | 33 | plt.figure(figsize=(10, 8)) 34 | for i, model in enumerate([Lasso(0.01), Ridge(0.06)]): 35 | plt.subplot(221 + i) 36 | plot_learning_curve(model, ylim=(0, 1)) 37 | plt.title(model.__class__.__name__) 38 | 39 | plt.subplot(223 + i) 40 | plot_learning_curve(model, err_func=mean_squared_error, ylim=(0, 8000)) 41 | -------------------------------------------------------------------------------- /notebooks/solutions/08A_digits_projection.py: -------------------------------------------------------------------------------- 1 | from sklearn.decomposition import PCA 2 | from sklearn.manifold import Isomap, LocallyLinearEmbedding 3 | 4 | plt.figure(figsize=(14, 4)) 5 | for i, est in enumerate([PCA(n_components=2, whiten=True), 6 | Isomap(n_components=2, n_neighbors=10), 7 | LocallyLinearEmbedding(n_components=2, n_neighbors=10, method='modified')]): 8 | plt.subplot(131 + i) 9 | projection = est.fit_transform(digits.data) 10 | plt.scatter(projection[:, 0], projection[:, 1], c=digits.target) 11 | plt.title(est.__class__.__name__) 12 | --------------------------------------------------------------------------------