├── .gitignore
├── Diabetes_DataSet.ipynb
├── Images
├── Petal-sepal.jpg
├── SVMBoundary.png
├── Setosa.jpg
├── Versicolor.jpg
└── Virginica.jpg
├── InstructorNotebooks
├── Images
└── Iris_DataSet_Instructor.ipynb
├── Iris_DataSet.ipynb
├── KMeans_Tutorial.ipynb
├── KNN_Tutorial.ipynb
├── LICENSE
├── LinearRegression_Tutorial.ipynb
├── README.md
└── SVM_Tutorial.ipynb
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 |
5 | # C extensions
6 | *.so
7 |
8 | # Distribution / packaging
9 | .Python
10 | env/
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | lib/
17 | lib64/
18 | parts/
19 | sdist/
20 | var/
21 | *.egg-info/
22 | .installed.cfg
23 | *.egg
24 |
25 | # PyInstaller
26 | # Usually these files are written by a python script from a template
27 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
28 | *.manifest
29 | *.spec
30 |
31 | # Installer logs
32 | pip-log.txt
33 | pip-delete-this-directory.txt
34 |
35 | # Unit test / coverage reports
36 | htmlcov/
37 | .tox/
38 | .coverage
39 | .cache
40 | nosetests.xml
41 | coverage.xml
42 |
43 | # Translations
44 | *.mo
45 | *.pot
46 |
47 | # Django stuff:
48 | *.log
49 |
50 | # Sphinx documentation
51 | docs/_build/
52 |
53 | # PyBuilder
54 | target/
55 |
56 | # iPython Notebook
57 | .ipynb_checkpoints
58 |
--------------------------------------------------------------------------------
/Diabetes_DataSet.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "What is a dataset?\n",
8 | "===\n",
9 | "A dataset is a collection of information (or data) that can be used by a computer. A dataset typically has some number of examples, where each example has features associated with it. Some datasets also include labels, which is an identifying piece of information that is of interest.\n",
10 | "\n",
11 | "What is an example?\n",
12 | "---\n",
13 | "An example is a single element of a dataset, typically a row (similar to a row in a table). Multiple examples are used to generalize trends about the dataset as a whole. When predicting the list price of a house, each house would be considered a single example.\n",
14 | "\n",
15 | "Examples are often referred to with the letter $x$.\n",
16 | "\n",
17 | "What is a feature?\n",
18 | "---\n",
19 | "A feature is a measurable characteristic that describes an example in a dataset. Features make up the information that a computer can use to learn and make predictions. If your examples are houses, your features might be: the square footage, the number of bedrooms, or the number of bathrooms. Some features are more useful than others. When predicting the list price of a house the number of bedrooms is a useful feature while the color of the walls is not, even though they both describe the house.\n",
20 | "\n",
21 | "Features are sometimes specified as a single element of an example, $x_i$\n",
22 | "\n",
23 | "What is a label?\n",
24 | "---\n",
25 | "A label identifies a piece of information about an example that is of particular interest. In machine learning, the label is the information we want the computer to learn to predict. In our housing example, the label would be the list price of the house.\n",
26 | "\n",
27 | "Labels can be continuous (e.g. price, length, width) or they can be a category label (e.g. color, species of plant/animal). They are typically specified by the letter $y$."
28 | ]
29 | },
30 | {
31 | "cell_type": "markdown",
32 | "metadata": {},
33 | "source": [
34 | "The Diabetes Dataset\n",
35 | "===\n",
36 | "\n",
37 | "Here, we use the Diabetes dataset, available through scikit-learn. This dataset contains information related to specific patients and disease progression of diabetes.\n",
38 | "\n",
39 | "Examples\n",
40 | "---\n",
41 | "The datasets consists of 442 examples, each representing an individual diabetes patient.\n",
42 | "\n",
43 | "Features\n",
44 | "---\n",
45 | "The dataset contains 10 features: Age, sex, body mass index, average blood pressure, and 6 blood serum measurements.\n",
46 | "\n",
47 | "Target\n",
48 | "---\n",
49 | "The target is a quantitative measure of disease progression after one year.\n",
50 | "\n",
51 | "Our goal\n",
52 | "===\n",
53 | "The goal, for this dataset, is to train a computer to predict the progression of diabetes after one year."
54 | ]
55 | },
56 | {
57 | "cell_type": "markdown",
58 | "metadata": {},
59 | "source": [
60 | "Setup\n",
61 | "===\n",
62 | "Tell matplotlib to print figures in the notebook. Then import numpy (for numerical data), pyplot (for plotting figures), and datasets (to download the iris dataset from scikit-learn). Also import colormaps to customize plot coloring and Normalize to normalize data for use with colormaps."
63 | ]
64 | },
65 | {
66 | "cell_type": "code",
67 | "execution_count": null,
68 | "metadata": {
69 | "collapsed": true
70 | },
71 | "outputs": [],
72 | "source": [
73 | "# Print figures in the notebook\n",
74 | "%matplotlib inline \n",
75 | "\n",
76 | "import numpy as np\n",
77 | "import matplotlib.pyplot as plt\n",
78 | "from sklearn import datasets # Import datasets from scikit-learn\n",
79 | "import matplotlib.cm as cm\n",
80 | "from matplotlib.colors import Normalize"
81 | ]
82 | },
83 | {
84 | "cell_type": "markdown",
85 | "metadata": {},
86 | "source": [
87 | "Import the dataset\n",
88 | "===\n",
89 | "Import the dataset and store it to a variable called diabetes. This dataset is similar to a python dictionary, with the keys: ['DESCR', 'target', 'data', 'feature_names']\n",
90 | "\n",
91 | "The data features are stored in diabetes.data, where each row is an example from a single patient, and each column is a single feature. The feature names are stored in diabetes.feature_names. Target values are stored in diabetes.target."
92 | ]
93 | },
94 | {
95 | "cell_type": "code",
96 | "execution_count": null,
97 | "metadata": {},
98 | "outputs": [],
99 | "source": [
100 | "# Import some data to play with\n",
101 | "diabetes = datasets.load_diabetes()\n",
102 | "\n",
103 | "# List the data keys\n",
104 | "print('Keys: ' + str(diabetes.keys()))\n",
105 | "print('Feature names: ' + str(diabetes.feature_names))\n",
106 | "print('')\n",
107 | "\n",
108 | "# Store the labels (y), features (X), and feature names\n",
109 | "y = diabetes.target # Labels are stored in y as numbers\n",
110 | "X = diabetes.data\n",
111 | "featureNames = diabetes.feature_names\n",
112 | "\n",
113 | "# Show the first five examples\n",
114 | "X[:5,:]"
115 | ]
116 | },
117 | {
118 | "cell_type": "markdown",
119 | "metadata": {},
120 | "source": [
121 | "Visualizing the data\n",
122 | "===\n",
123 | "Visualizing the data can help us better understand the data and make use of it. The following block of code will create a plot of serum measurement 1 (x-axis) vs serum measurement 6 (y-axis). The level of diabetes progression has been mapped to fit in the [0,1] range and is shown as a color scale."
124 | ]
125 | },
126 | {
127 | "cell_type": "code",
128 | "execution_count": null,
129 | "metadata": {},
130 | "outputs": [],
131 | "source": [
132 | "norm = Normalize(vmin=y.min(), vmax=y.max()) # need to normalize target to [0,1] range for use with colormap\n",
133 | "plt.scatter(X[:, 4], X[:, 9], c=norm(y), cmap=cm.bone_r)\n",
134 | "plt.colorbar()\n",
135 | "plt.xlabel('Serum Measurement 1 (s1)')\n",
136 | "plt.ylabel('Serum Measurement 6 (s6)')\n",
137 | "\n",
138 | "plt.show()\n"
139 | ]
140 | },
141 | {
142 | "cell_type": "markdown",
143 | "metadata": {},
144 | "source": [
145 | "Make your own plot\n",
146 | "===\n",
147 | "Below, try making your own plots. First, modify the previous code to create a similar plot, comparing different pairs of features. You can start by copying and pasting the previous block of code to the cell below, and modifying it to work."
148 | ]
149 | },
150 | {
151 | "cell_type": "code",
152 | "execution_count": null,
153 | "metadata": {
154 | "collapsed": true
155 | },
156 | "outputs": [],
157 | "source": [
158 | "# Put your code here!"
159 | ]
160 | },
161 | {
162 | "cell_type": "markdown",
163 | "metadata": {},
164 | "source": [
165 | "Training and Testing Sets\n",
166 | "===\n",
167 | "\n",
168 | "In order to evaluate our data properly, we need to divide our dataset into training and testing sets.\n",
169 | "* **Training Set** - Portion of the data used to train a machine learning algorithm. These are the examples that the computer will learn from in order to try to predict data labels.\n",
170 | "* **Testing Set** - Portion of the data (usually 10-30%) not used in training, used to evaluate performance. The computer does not \"see\" this data while learning, but tries to guess the data labels. We can then determine the accuracy of our method by determining how many examples it got correct.\n",
171 | "* **Validation Set** - (Optional) A third section of data used for parameter tuning or classifier selection. When selecting among many classifiers, or when a classifier parameter must be adjusted (tuned), a this data is used like a test set to select the best parameter value(s). The final performance is then evaluated on the remaining, previously unused, testing set.\n",
172 | "\n",
173 | "Creating training and testing sets\n",
174 | "---\n",
175 | "Below, we create a training and testing set from the iris dataset using using the [train_test_split()](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html) function. "
176 | ]
177 | },
178 | {
179 | "cell_type": "code",
180 | "execution_count": null,
181 | "metadata": {},
182 | "outputs": [],
183 | "source": [
184 | "from sklearn.model_selection import train_test_split\n",
185 | "\n",
186 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)\n",
187 | "\n",
188 | "print('Original dataset size: ' + str(X.shape))\n",
189 | "print('Training dataset size: ' + str(X_train.shape))\n",
190 | "print('Test dataset size: ' + str(X_test.shape))"
191 | ]
192 | },
193 | {
194 | "cell_type": "markdown",
195 | "metadata": {},
196 | "source": [
197 | "Create validation set using crossvalidation\n",
198 | "---\n",
199 | "Crossvalidation allows us to use as much of our data as possible for training without training on our test data. We use it to split our training set into training and validation sets.\n",
200 | "* Divide data into multiple equal sections (called folds)\n",
201 | "* Hold one fold out for validation and train on the other folds\n",
202 | "* Repeat using each fold as validation\n",
203 | "\n",
204 | "The [KFold()](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) function returns an iterable with pairs of indices for training and testing data."
205 | ]
206 | },
207 | {
208 | "cell_type": "code",
209 | "execution_count": null,
210 | "metadata": {},
211 | "outputs": [],
212 | "source": [
213 | "from sklearn.model_selection import KFold\n",
214 | "\n",
215 | "# Older versions of scikit learn used n_folds instead of n_splits\n",
216 | "kf = KFold(n_splits=5)\n",
217 | "for trainInd, valInd in kf.split(X_train):\n",
218 | " X_tr = X_train[trainInd,:]\n",
219 | " y_tr = y_train[trainInd]\n",
220 | " X_val = X_train[valInd,:]\n",
221 | " y_val = y_train[valInd]\n",
222 | " print(\"%s %s\" % (X_tr.shape, X_val.shape))"
223 | ]
224 | },
225 | {
226 | "cell_type": "markdown",
227 | "metadata": {},
228 | "source": [
229 | "More information on different methods for creating training and testing sets is available at scikit-learn's [crossvalidation](http://scikit-learn.org/stable/modules/cross_validation.html) page."
230 | ]
231 | },
232 | {
233 | "cell_type": "code",
234 | "execution_count": null,
235 | "metadata": {
236 | "collapsed": true
237 | },
238 | "outputs": [],
239 | "source": []
240 | }
241 | ],
242 | "metadata": {
243 | "kernelspec": {
244 | "display_name": "Python 3",
245 | "language": "python",
246 | "name": "python3"
247 | },
248 | "language_info": {
249 | "codemirror_mode": {
250 | "name": "ipython",
251 | "version": 3
252 | },
253 | "file_extension": ".py",
254 | "mimetype": "text/x-python",
255 | "name": "python",
256 | "nbconvert_exporter": "python",
257 | "pygments_lexer": "ipython3",
258 | "version": "3.6.2"
259 | }
260 | },
261 | "nbformat": 4,
262 | "nbformat_minor": 1
263 | }
264 |
--------------------------------------------------------------------------------
/Images/Petal-sepal.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/abatula/MachineLearningIntro/4929e0616ac9be40388ecca15b80fa731b741c9e/Images/Petal-sepal.jpg
--------------------------------------------------------------------------------
/Images/SVMBoundary.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/abatula/MachineLearningIntro/4929e0616ac9be40388ecca15b80fa731b741c9e/Images/SVMBoundary.png
--------------------------------------------------------------------------------
/Images/Setosa.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/abatula/MachineLearningIntro/4929e0616ac9be40388ecca15b80fa731b741c9e/Images/Setosa.jpg
--------------------------------------------------------------------------------
/Images/Versicolor.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/abatula/MachineLearningIntro/4929e0616ac9be40388ecca15b80fa731b741c9e/Images/Versicolor.jpg
--------------------------------------------------------------------------------
/Images/Virginica.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/abatula/MachineLearningIntro/4929e0616ac9be40388ecca15b80fa731b741c9e/Images/Virginica.jpg
--------------------------------------------------------------------------------
/InstructorNotebooks/Images:
--------------------------------------------------------------------------------
1 | ../Images
--------------------------------------------------------------------------------
/InstructorNotebooks/Iris_DataSet_Instructor.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "What is a dataset?\n",
8 | "===\n",
9 | "A dataset is a collection of information (or data) that can be used by a computer. A dataset typically has some number of examples, where each example has features associated with it. Some datasets also include labels, which is an identifying piece of information that is of interest.\n",
10 | "\n",
11 | "What is an example?\n",
12 | "---\n",
13 | "An example is a single element of a dataset, typically a row (similar to a row in a table). Multiple examples are used to generalize trends about the dataset as a whole. When predicting the list price of a house, each house would be considered a single example.\n",
14 | "\n",
15 | "Examples are often referred to with the letter $x$.\n",
16 | "\n",
17 | "What is a feature?\n",
18 | "---\n",
19 | "A feature is a measurable characteristic that describes an example in a dataset. Features make up the information that a computer can use to learn and make predictions. If your examples are houses, your features might be: the square footage, the number of bedrooms, or the number of bathrooms. Some features are more useful than others. When predicting the list price of a house the number of bedrooms is a useful feature while the number of floorboards is not, even though they both describe the house.\n",
20 | "\n",
21 | "Features are sometimes specified as a single element of an example, $x_i$\n",
22 | "\n",
23 | "What is a label?\n",
24 | "---\n",
25 | "A label identifies a piece of information about an example that is of particular interest. In machine learning, the label is the information we want the computer to learn to predict. In our housing example, the label would be the list price of the house.\n",
26 | "\n",
27 | "Labels can be continuous (e.g. price, length, width) or they can be a category label (e.g. color). They are typically specified by the letter $y$."
28 | ]
29 | },
30 | {
31 | "cell_type": "markdown",
32 | "metadata": {},
33 | "source": [
34 | "The Iris Dataset\n",
35 | "===\n",
36 | "\n",
37 | "Here, we use the Iris dataset, available through scikit-learn. Scikit-learn's explanation of the dataset is [here](http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html).\n",
38 | "\n",
39 | "This dataset contains information on three species of iris flowers ([Setosa](http://en.wikipedia.org/wiki/Iris_setosa), [Versicolour](http://en.wikipedia.org/wiki/Iris_versicolor), and [Virginica](http://en.wikipedia.org/wiki/Iris_virginica). \n",
40 | "\n",
41 | "|
|
|
|\n",
42 | "|:-------------------------------------:|:-----------------------------------------:|:----------------------------------------:|\n",
43 | "| Iris Setosa [source](http://en.wikipedia.org/wiki/Iris_setosa#mediaviewer/File:Kosaciec_szczecinkowaty_Iris_setosa.jpg) | Iris Versicolour [source](http://en.wikipedia.org/wiki/Iris_versicolor#mediaviewer/File:Blue_Flag,_Ottawa.jpg) | Iris Virginica [source](http://en.wikipedia.org/wiki/Iris_virginica#mediaviewer/File:Iris_virginica.jpg) |\n",
44 | "\n",
45 | "Each example has four features (or measurements): [sepal](http://en.wikipedia.org/wiki/Sepal) length, sepal width, [petal](http://en.wikipedia.org/wiki/Petal) length, and petal width. All measurements are in cm.\n",
46 | "\n",
47 | "|
|\n",
48 | "|:------------------------------------------:|\n",
49 | "|Petal and sepal of a primrose plant. From [wikipedia](http://en.wikipedia.org/wiki/Petal#mediaviewer/File:Petal-sepal.jpg)|\n",
50 | "\n",
51 | "\n",
52 | "Examples\n",
53 | "---\n",
54 | "The datasets consists of 150 examples, 50 examples from each species of iris.\n",
55 | "\n",
56 | "Features\n",
57 | "---\n",
58 | "The features are the columns of the dataset. In order from left to right (or 0-3) they are: sepal length, sepal width, petal length, and petal width\n",
59 | "\n",
60 | "Our goal\n",
61 | "===\n",
62 | "The goal, for this dataset, is to train a computer to predict the species of a new iris plant, given only the measured length and width of its sepal and petal."
63 | ]
64 | },
65 | {
66 | "cell_type": "markdown",
67 | "metadata": {},
68 | "source": [
69 | "Setup\n",
70 | "===\n",
71 | "Tell matplotlib to print figures in the notebook. Then import numpy (for numerical data), pyplot (for plotting figures), and ListedColormap (for plotting colors), datasets.\n",
72 | "\n",
73 | "Also create the color maps to use to color the plotted data, and \"labelList\", which is a list of colored rectangles to use in plotted legends"
74 | ]
75 | },
76 | {
77 | "cell_type": "code",
78 | "execution_count": null,
79 | "metadata": {
80 | "collapsed": false
81 | },
82 | "outputs": [],
83 | "source": [
84 | "# Print figures in the notebook\n",
85 | "%matplotlib inline \n",
86 | "\n",
87 | "import numpy as np\n",
88 | "import matplotlib.pyplot as plt\n",
89 | "from matplotlib.colors import ListedColormap\n",
90 | "from sklearn import datasets # Import datasets from scikit-learn\n",
91 | "\n",
92 | "# Import patch for drawing rectangles in the legend\n",
93 | "from matplotlib.patches import Rectangle\n",
94 | "\n",
95 | "# Create color maps\n",
96 | "cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])\n",
97 | "\n",
98 | "# Create a legend for the colors, using rectangles for the corresponding colormap colors\n",
99 | "labelList = []\n",
100 | "for color in cmap_bold.colors:\n",
101 | " labelList.append(Rectangle((0, 0), 1, 1, fc=color))"
102 | ]
103 | },
104 | {
105 | "cell_type": "markdown",
106 | "metadata": {},
107 | "source": [
108 | "Import the dataset\n",
109 | "===\n",
110 | "Import the dataset and store it to a variable called iris. This dataset is similar to a python dictionary, with the keys: ['DESCR', 'target_names', 'target', 'data', 'feature_names']\n",
111 | "\n",
112 | "The data features are stored in iris.data, where each row is an example from a single flow, and each column is a single feature. The feature names are stored in iris.feature_names. Labels are stored as the numbers 0, 1, or 2 in iris.target, and the names of these labels are in iris.target_names."
113 | ]
114 | },
115 | {
116 | "cell_type": "code",
117 | "execution_count": null,
118 | "metadata": {
119 | "collapsed": false
120 | },
121 | "outputs": [],
122 | "source": [
123 | "# Import some data to play with\n",
124 | "iris = datasets.load_iris()\n",
125 | "\n",
126 | "# List the data keys\n",
127 | "print('Keys: ' + str(iris.keys()))\n",
128 | "print('Label names: ' + str(iris.target_names))\n",
129 | "print('Feature names: ' + str(iris.feature_names))\n",
130 | "print('')\n",
131 | "\n",
132 | "# Store the labels (y), label names, features (X), and feature names\n",
133 | "y = iris.target # Labels are stored in y as numbers\n",
134 | "labelNames = iris.target_names # Species names corresponding to labels 0, 1, and 2\n",
135 | "X = iris.data\n",
136 | "featureNames = iris.feature_names\n",
137 | "\n",
138 | "# Show the first five examples\n",
139 | "print(iris.data[1:5,:])"
140 | ]
141 | },
142 | {
143 | "cell_type": "markdown",
144 | "metadata": {},
145 | "source": [
146 | "Visualizing the data\n",
147 | "===\n",
148 | "Visualizing the data can help us better understand the data and make use of it. The following block of code will create a plot of sepal length (x-axis) vs sepal width (y-axis). The colors of the datapoints correspond to the labeled species of iris for that example.\n",
149 | "\n",
150 | "After plotting, look at the data. What do you notice about the way it is arranged?"
151 | ]
152 | },
153 | {
154 | "cell_type": "code",
155 | "execution_count": null,
156 | "metadata": {
157 | "collapsed": false
158 | },
159 | "outputs": [],
160 | "source": [
161 | "# Plot the data\n",
162 | "\n",
163 | "# Sepal length and width\n",
164 | "X_sepal = X[:,:2]\n",
165 | "# Get the minimum and maximum values with an additional 0.5 border\n",
166 | "x_min, x_max = X_sepal[:, 0].min() - .5, X_sepal[:, 0].max() + .5\n",
167 | "y_min, y_max = X_sepal[:, 1].min() - .5, X_sepal[:, 1].max() + .5\n",
168 | "\n",
169 | "plt.figure(figsize=(8, 6))\n",
170 | "\n",
171 | "# Plot the training points\n",
172 | "plt.scatter(X_sepal[:, 0], X_sepal[:, 1], c=y, cmap=cmap_bold)\n",
173 | "plt.xlabel('Sepal length (cm)')\n",
174 | "plt.ylabel('Sepal width (cm)')\n",
175 | "plt.title('Sepal width vs length')\n",
176 | "\n",
177 | "# Set the plot limits\n",
178 | "plt.xlim(x_min, x_max)\n",
179 | "plt.ylim(y_min, y_max)\n",
180 | "\n",
181 | "plt.legend(labelList, labelNames)\n",
182 | "\n",
183 | "plt.show()\n"
184 | ]
185 | },
186 | {
187 | "cell_type": "markdown",
188 | "metadata": {},
189 | "source": [
190 | "Make your own plot\n",
191 | "===\n",
192 | "Below, try making your own plots. First, modify the previous code to create a similar plot, showing the petal width vs the petal length. You can start by copying and pasting the previous block of code to the cell below, and modifying it to work.\n",
193 | "\n",
194 | "How is the data arranged differently? Do you think these additional features would be helpful in determining to which species of iris a new plant should be categorized?\n",
195 | "\n",
196 | "What about plotting other feature combinations, like petal length vs sepal length?\n",
197 | "\n",
198 | "Once you've plotted the data several different ways, think about how you would predict the species of a new iris plant, given only the length and width of its sepals and petals."
199 | ]
200 | },
201 | {
202 | "cell_type": "code",
203 | "execution_count": null,
204 | "metadata": {
205 | "collapsed": false
206 | },
207 | "outputs": [],
208 | "source": [
209 | "# Put your code here!\n",
210 | "\n",
211 | "# Plot the data\n",
212 | "\n",
213 | "# Petal length and width\n",
214 | "X_petal = X[:,2:]\n",
215 | "# Get the minimum and maximum values with an additional 0.5 border\n",
216 | "x_min, x_max = X_petal[:, 0].min() - .5, X_petal[:, 0].max() + .5\n",
217 | "y_min, y_max = X_petal[:, 1].min() - .5, X_petal[:, 1].max() + .5\n",
218 | "\n",
219 | "plt.figure(figsize=(8, 6))\n",
220 | "\n",
221 | "# Plot the training points\n",
222 | "plt.scatter(X_petal[:, 0], X_petal[:, 1], c=y, cmap=cmap_bold)\n",
223 | "plt.xlabel('Petal length (cm)')\n",
224 | "plt.ylabel('Petal width (cm)')\n",
225 | "plt.title('Petal width vs length')\n",
226 | "\n",
227 | "# Set the plot limits\n",
228 | "plt.xlim(x_min, x_max)\n",
229 | "plt.ylim(y_min, y_max)\n",
230 | "\n",
231 | "plt.legend(labelList, labelNames)\n",
232 | "\n",
233 | "plt.show()\n"
234 | ]
235 | },
236 | {
237 | "cell_type": "markdown",
238 | "metadata": {
239 | "collapsed": false
240 | },
241 | "source": [
242 | "Training and Testing Sets\n",
243 | "===\n",
244 | "\n",
245 | "In order to evaluate our data properly, we need to divide our dataset into training and testing sets.\n",
246 | "\n",
247 | "Training Set\n",
248 | "---\n",
249 | "A portion of the data, usually a majority, used to train a machine learning classifier. These are the examples that the computer will learn in order to try to predict data labels.\n",
250 | "\n",
251 | "Testing Set\n",
252 | "---\n",
253 | "A portion of the data, smaller than the training set (usually about 30%), used to test the accuracy of the machine learning classifier. The computer does not \"see\" this data while learning, but tries to guess the data labels. We can then determine the accuracy of our method by determining how many examples it got correct.\n",
254 | "\n",
255 | "Creating training and testing sets\n",
256 | "---\n",
257 | "Below, we create a training and testing set from the iris dataset using using the [train_test_split()](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html) function. "
258 | ]
259 | },
260 | {
261 | "cell_type": "code",
262 | "execution_count": null,
263 | "metadata": {
264 | "collapsed": false
265 | },
266 | "outputs": [],
267 | "source": [
268 | "from sklearn import cross_validation\n",
269 | "\n",
270 | "X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.3)\n",
271 | "print('Original dataset size: ' + str(X.shape))\n",
272 | "print('Training dataset size: ' + str(X_train.shape))\n",
273 | "print('Test dataset size: ' + str(X_test.shape))"
274 | ]
275 | },
276 | {
277 | "cell_type": "markdown",
278 | "metadata": {},
279 | "source": [
280 | "More information on different methods for creating training and testing sets is available at scikit-learn's [crossvalidation](http://scikit-learn.org/stable/modules/cross_validation.html) page."
281 | ]
282 | },
283 | {
284 | "cell_type": "code",
285 | "execution_count": null,
286 | "metadata": {
287 | "collapsed": true
288 | },
289 | "outputs": [],
290 | "source": []
291 | }
292 | ],
293 | "metadata": {
294 | "kernelspec": {
295 | "display_name": "Python 3",
296 | "language": "python",
297 | "name": "python3"
298 | },
299 | "language_info": {
300 | "codemirror_mode": {
301 | "name": "ipython",
302 | "version": 3
303 | },
304 | "file_extension": ".py",
305 | "mimetype": "text/x-python",
306 | "name": "python",
307 | "nbconvert_exporter": "python",
308 | "pygments_lexer": "ipython3",
309 | "version": "3.5.1"
310 | }
311 | },
312 | "nbformat": 4,
313 | "nbformat_minor": 0
314 | }
315 |
--------------------------------------------------------------------------------
/Iris_DataSet.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "What is a dataset?\n",
8 | "===\n",
9 | "A dataset is a collection of information (or data) that can be used by a computer. A dataset typically has some number of examples, where each example has features associated with it. Some datasets also include labels, which is an identifying piece of information that is of interest.\n",
10 | "\n",
11 | "What is an example?\n",
12 | "---\n",
13 | "An example is a single element of a dataset, typically a row (similar to a row in a table). Multiple examples are used to generalize trends about the dataset as a whole. When predicting the list price of a house, each house would be considered a single example.\n",
14 | "\n",
15 | "Examples are often referred to with the letter $x$.\n",
16 | "\n",
17 | "What is a feature?\n",
18 | "---\n",
19 | "A feature is a measurable characteristic that describes an example in a dataset. Features make up the information that a computer can use to learn and make predictions. If your examples are houses, your features might be: the square footage, the number of bedrooms, or the number of bathrooms. Some features are more useful than others. When predicting the list price of a house the number of bedrooms is a useful feature while the color of the walls is not, even though they both describe the house.\n",
20 | "\n",
21 | "Features are sometimes specified as a single element of an example, $x_i$\n",
22 | "\n",
23 | "What is a label?\n",
24 | "---\n",
25 | "A label identifies a piece of information about an example that is of particular interest. In machine learning, the label is the information we want the computer to learn to predict. In our housing example, the label would be the list price of the house.\n",
26 | "\n",
27 | "Labels can be continuous (e.g. price, length, width) or they can be a category label (e.g. color, species of plant/animal). They are typically specified by the letter $y$."
28 | ]
29 | },
30 | {
31 | "cell_type": "markdown",
32 | "metadata": {},
33 | "source": [
34 | "The Iris Dataset\n",
35 | "===\n",
36 | "\n",
37 | "Here, we use the Iris dataset, available through scikit-learn. Scikit-learn's explanation of the dataset is [here](http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html).\n",
38 | "\n",
39 | "This dataset contains information on three species of iris flowers: [Setosa](http://en.wikipedia.org/wiki/Iris_setosa), [Versicolour](http://en.wikipedia.org/wiki/Iris_versicolor), and [Virginica](http://en.wikipedia.org/wiki/Iris_virginica). \n",
40 | "\n",
41 | "|
|
|
|\n",
42 | "|:-------------------------------------:|:-----------------------------------------:|:----------------------------------------:|\n",
43 | "| Iris Setosa [source](http://en.wikipedia.org/wiki/Iris_setosa#mediaviewer/File:Kosaciec_szczecinkowaty_Iris_setosa.jpg) | Iris Versicolour [source](http://en.wikipedia.org/wiki/Iris_versicolor#mediaviewer/File:Blue_Flag,_Ottawa.jpg) | Iris Virginica [source](http://en.wikipedia.org/wiki/Iris_virginica#mediaviewer/File:Iris_virginica.jpg) |\n",
44 | "\n",
45 | "Each example has four features (or measurements): [sepal](http://en.wikipedia.org/wiki/Sepal) length, sepal width, [petal](http://en.wikipedia.org/wiki/Petal) length, and petal width. All measurements are in cm.\n",
46 | "\n",
47 | "|
|\n",
48 | "|:------------------------------------------:|\n",
49 | "|Petal and sepal of a primrose plant. From [wikipedia](http://en.wikipedia.org/wiki/Petal#mediaviewer/File:Petal-sepal.jpg)|\n",
50 | "\n",
51 | "\n",
52 | "Examples\n",
53 | "---\n",
54 | "The datasets consists of 150 examples, 50 examples from each species of iris.\n",
55 | "\n",
56 | "Features\n",
57 | "---\n",
58 | "The features are the columns of the dataset. In order from left to right (or 0-3) they are: sepal length, sepal width, petal length, and petal width.\n",
59 | "\n",
60 | "Target\n",
61 | "---\n",
62 | "The target value is the species of Iris. The three three species are Setosa, Versicolour, and Virginica.\n",
63 | "\n",
64 | "Our goal\n",
65 | "===\n",
66 | "The goal, for this dataset, is to train a computer to predict the species of a new iris plant, given only the measured length and width of its sepal and petal."
67 | ]
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "metadata": {},
72 | "source": [
73 | "Setup\n",
74 | "===\n",
75 | "Tell matplotlib to print figures in the notebook. Then import numpy (for numerical data), pyplot (for plotting figures) ListedColormap (for plotting colors), and datasets (to download the iris dataset from scikit-learn).\n",
76 | "\n",
77 | "Also create the color maps to use to color the plotted data, and \"labelList\", which is a list of colored rectangles to use in plotted legends"
78 | ]
79 | },
80 | {
81 | "cell_type": "code",
82 | "execution_count": null,
83 | "metadata": {
84 | "collapsed": true
85 | },
86 | "outputs": [],
87 | "source": [
88 | "# Print figures in the notebook\n",
89 | "%matplotlib inline \n",
90 | "\n",
91 | "import numpy as np\n",
92 | "import matplotlib.pyplot as plt\n",
93 | "from matplotlib.colors import ListedColormap\n",
94 | "from sklearn import datasets # Import datasets from scikit-learn\n",
95 | "\n",
96 | "# Import patch for drawing rectangles in the legend\n",
97 | "from matplotlib.patches import Rectangle\n",
98 | "\n",
99 | "# Create color maps\n",
100 | "cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])\n",
101 | "\n",
102 | "# Create a legend for the colors, using rectangles for the corresponding colormap colors\n",
103 | "labelList = []\n",
104 | "for color in cmap_bold.colors:\n",
105 | " labelList.append(Rectangle((0, 0), 1, 1, fc=color))"
106 | ]
107 | },
108 | {
109 | "cell_type": "markdown",
110 | "metadata": {},
111 | "source": [
112 | "Import the dataset\n",
113 | "===\n",
114 | "Import the dataset and store it to a variable called iris. This dataset is similar to a python dictionary, with the keys: ['DESCR', 'target_names', 'target', 'data', 'feature_names']\n",
115 | "\n",
116 | "The data features are stored in iris.data, where each row is an example from a single flower, and each column is a single feature. The feature names are stored in iris.feature_names. Labels are stored as the numbers 0, 1, or 2 in iris.target, and the names of these labels are in iris.target_names."
117 | ]
118 | },
119 | {
120 | "cell_type": "code",
121 | "execution_count": null,
122 | "metadata": {},
123 | "outputs": [],
124 | "source": [
125 | "# Import some data to play with\n",
126 | "iris = datasets.load_iris()\n",
127 | "\n",
128 | "# List the data keys\n",
129 | "print('Keys: ' + str(iris.keys()))\n",
130 | "print('Label names: ' + str(iris.target_names))\n",
131 | "print('Feature names: ' + str(iris.feature_names))\n",
132 | "print('')\n",
133 | "\n",
134 | "# Store the labels (y), label names, features (X), and feature names\n",
135 | "y = iris.target # Labels are stored in y as numbers\n",
136 | "labelNames = iris.target_names # Species names corresponding to labels 0, 1, and 2\n",
137 | "X = iris.data\n",
138 | "featureNames = iris.feature_names\n",
139 | "\n",
140 | "# Show the first five examples\n",
141 | "print(X[1:5,:])"
142 | ]
143 | },
144 | {
145 | "cell_type": "markdown",
146 | "metadata": {},
147 | "source": [
148 | "Visualizing the data\n",
149 | "===\n",
150 | "Visualizing the data can help us better understand the data and make use of it. The following block of code will create a plot of sepal length (x-axis) vs sepal width (y-axis). The colors of the datapoints correspond to the labeled species of iris for that example.\n",
151 | "\n",
152 | "After plotting, look at the data. What do you notice about the way it is arranged?"
153 | ]
154 | },
155 | {
156 | "cell_type": "code",
157 | "execution_count": null,
158 | "metadata": {
159 | "collapsed": true
160 | },
161 | "outputs": [],
162 | "source": [
163 | "# Plot the data\n",
164 | "\n",
165 | "# Sepal length and width\n",
166 | "X_sepal = X[:,:2]\n",
167 | "# Get the minimum and maximum values with an additional 0.5 border\n",
168 | "x_min, x_max = X_sepal[:, 0].min() - .5, X_sepal[:, 0].max() + .5\n",
169 | "y_min, y_max = X_sepal[:, 1].min() - .5, X_sepal[:, 1].max() + .5\n",
170 | "\n",
171 | "plt.figure(figsize=(8, 6))\n",
172 | "\n",
173 | "# Plot the training points\n",
174 | "plt.scatter(X_sepal[:, 0], X_sepal[:, 1], c=y, cmap=cmap_bold)\n",
175 | "plt.xlabel('Sepal length (cm)')\n",
176 | "plt.ylabel('Sepal width (cm)')\n",
177 | "plt.title('Sepal width vs length')\n",
178 | "\n",
179 | "# Set the plot limits\n",
180 | "plt.xlim(x_min, x_max)\n",
181 | "plt.ylim(y_min, y_max)\n",
182 | "\n",
183 | "plt.legend(labelList, labelNames)\n",
184 | "\n",
185 | "plt.show()\n"
186 | ]
187 | },
188 | {
189 | "cell_type": "markdown",
190 | "metadata": {},
191 | "source": [
192 | "Make your own plot\n",
193 | "===\n",
194 | "Below, try making your own plots. First, modify the previous code to create a similar plot, showing the petal width vs the petal length. You can start by copying and pasting the previous block of code to the cell below, and modifying it to work.\n",
195 | "\n",
196 | "How is the data arranged differently? Do you think these additional features would be helpful in determining to which species of iris a new plant should be categorized?\n",
197 | "\n",
198 | "What about plotting other feature combinations, like petal length vs sepal length?\n",
199 | "\n",
200 | "Once you've plotted the data several different ways, think about how you would predict the species of a new iris plant, given only the length and width of its sepals and petals."
201 | ]
202 | },
203 | {
204 | "cell_type": "code",
205 | "execution_count": null,
206 | "metadata": {
207 | "collapsed": true
208 | },
209 | "outputs": [],
210 | "source": [
211 | "# Put your code here!"
212 | ]
213 | },
214 | {
215 | "cell_type": "markdown",
216 | "metadata": {},
217 | "source": [
218 | "Training and Testing Sets\n",
219 | "===\n",
220 | "\n",
221 | "In order to evaluate our data properly, we need to divide our dataset into training and testing sets.\n",
222 | "* **Training Set** - Portion of the data used to train a machine learning algorithm. These are the examples that the computer will learn from in order to try to predict data labels.\n",
223 | "* **Testing Set** - Portion of the data (usually 10-30%) not used in training, used to evaluate performance. The computer does not \"see\" this data while learning, but tries to guess the data labels. We can then determine the accuracy of our method by determining how many examples it got correct.\n",
224 | "* **Validation Set** - (Optional) A third section of data used for parameter tuning or classifier selection. When selecting among many classifiers, or when a classifier parameter must be adjusted (tuned), a this data is used like a test set to select the best parameter value(s). The final performance is then evaluated on the remaining, previously unused, testing set.\n",
225 | "\n",
226 | "Creating training and testing sets\n",
227 | "---\n",
228 | "Below, we create a training and testing set from the iris dataset using using the [train_test_split()](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html) function. "
229 | ]
230 | },
231 | {
232 | "cell_type": "code",
233 | "execution_count": null,
234 | "metadata": {
235 | "collapsed": true
236 | },
237 | "outputs": [],
238 | "source": [
239 | "from sklearn.model_selection import train_test_split\n",
240 | "\n",
241 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)\n",
242 | "\n",
243 | "print('Original dataset size: ' + str(X.shape))\n",
244 | "print('Training dataset size: ' + str(X_train.shape))\n",
245 | "print('Test dataset size: ' + str(X_test.shape))"
246 | ]
247 | },
248 | {
249 | "cell_type": "markdown",
250 | "metadata": {},
251 | "source": [
252 | "Create validation set using crossvalidation\n",
253 | "---\n",
254 | "Crossvalidation allows us to use as much of our data as possible for training without training on our test data. We use it to split our training set into training and validation sets.\n",
255 | "* Divide data into multiple equal sections (called folds)\n",
256 | "* Hold one fold out for validation and train on the other folds\n",
257 | "* Repeat using each fold as validation\n",
258 | "\n",
259 | "The [KFold()](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) function returns an iterable with pairs of indices for training and testing data."
260 | ]
261 | },
262 | {
263 | "cell_type": "code",
264 | "execution_count": null,
265 | "metadata": {
266 | "collapsed": true
267 | },
268 | "outputs": [],
269 | "source": [
270 | "from sklearn.model_selection import KFold\n",
271 | "\n",
272 | "# Older versions of scikit learn used n_folds instead of n_splits\n",
273 | "kf = KFold(n_splits=5)\n",
274 | "for trainInd, valInd in kf.split(X_train):\n",
275 | " X_tr = X_train[trainInd,:]\n",
276 | " y_tr = y_train[trainInd]\n",
277 | " X_val = X_train[valInd,:]\n",
278 | " y_val = y_train[valInd]\n",
279 | " print(\"%s %s\" % (X_tr.shape, X_val.shape))"
280 | ]
281 | },
282 | {
283 | "cell_type": "markdown",
284 | "metadata": {},
285 | "source": [
286 | "More information on different methods for creating training and testing sets is available at scikit-learn's [crossvalidation](http://scikit-learn.org/stable/modules/cross_validation.html) page."
287 | ]
288 | },
289 | {
290 | "cell_type": "code",
291 | "execution_count": null,
292 | "metadata": {
293 | "collapsed": true
294 | },
295 | "outputs": [],
296 | "source": []
297 | }
298 | ],
299 | "metadata": {
300 | "kernelspec": {
301 | "display_name": "Python 3",
302 | "language": "python",
303 | "name": "python3"
304 | },
305 | "language_info": {
306 | "codemirror_mode": {
307 | "name": "ipython",
308 | "version": 3
309 | },
310 | "file_extension": ".py",
311 | "mimetype": "text/x-python",
312 | "name": "python",
313 | "nbconvert_exporter": "python",
314 | "pygments_lexer": "ipython3",
315 | "version": "3.6.2"
316 | }
317 | },
318 | "nbformat": 4,
319 | "nbformat_minor": 1
320 | }
321 |
--------------------------------------------------------------------------------
/KMeans_Tutorial.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "K-Means Tutorial\n",
8 | "===\n",
9 | "\n",
10 | "K-means is an example of unsupervised learning through clustering. It tries to separate unlabeled data into clusters with equal variance. In two dimensions, this can be visualized as grouping data using circular areas of equal radius.\n",
11 | "\n",
12 | "There are three steps training a K-means classifier: \n",
13 | "\n",
14 | "1. Pick how many groups you want it to use and (randomly) assign a starting centroid (center point) to each cluster.\n",
15 | "2. Assign each data point to the group with the closest centroid.\n",
16 | "3. Find the mean value of each feature (the middle point of the cluster) for all the points assinged to each cluster. This is the new centroid for that cluster.\n",
17 | "\n",
18 | "Steps 2 and 3 repeat until the cluster centroids do not move significantly.\n",
19 | "\n",
20 | "Scikit-learn provides more information on the K-means classifier function [KMeans](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html). They also have an examples of using K-means to [classify handwritten numbers](http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html#example-cluster-plot-kmeans-digits-py).\n"
21 | ]
22 | },
23 | {
24 | "cell_type": "markdown",
25 | "metadata": {},
26 | "source": [
27 | "Setup\n",
28 | "===\n",
29 | "Tell matplotlib to print figures in the notebook. Then import numpy (for numerical data), pyplot (for plotting figures) ListedColormap (for plotting colors), kmeans (for the scikit-learn kmeans algorithm) and datasets (to download the iris dataset from scikit-learn).\n",
30 | "\n",
31 | "Also create the color maps to use to color the plotted data, and \"labelList\", which is a list of colored rectangles to use in plotted legends."
32 | ]
33 | },
34 | {
35 | "cell_type": "code",
36 | "execution_count": null,
37 | "metadata": {
38 | "collapsed": true
39 | },
40 | "outputs": [],
41 | "source": [
42 | "# Print figures in the notebook\n",
43 | "%matplotlib inline \n",
44 | "\n",
45 | "import numpy as np\n",
46 | "import matplotlib.pyplot as plt\n",
47 | "from matplotlib.colors import ListedColormap\n",
48 | "from sklearn import datasets # Import the dataset from scikit-learn\n",
49 | "from sklearn.cluster import KMeans # Import the KMeans classifier\n",
50 | "\n",
51 | "# Import patch for drawing rectangles in the legend\n",
52 | "from matplotlib.patches import Rectangle\n",
53 | "\n",
54 | "# Create color maps\n",
55 | "cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])\n",
56 | "cmap_bg = ListedColormap(['#333333', '#666666', '#999999'])\n",
57 | "\n",
58 | "# Create a legend for the colors, using rectangles for the corresponding colormap colors\n",
59 | "labelList = []\n",
60 | "for color in cmap_bold.colors:\n",
61 | " labelList.append(Rectangle((0, 0), 1, 1, fc=color))"
62 | ]
63 | },
64 | {
65 | "cell_type": "markdown",
66 | "metadata": {},
67 | "source": [
68 | "Visualizing K-Means\n",
69 | "---\n",
70 | "\n",
71 | "Let's start by visualizing the steps involved in K-Means clustering. Let's create a simple toy dataset and cluster it into four groups. Don't worry if you aren't sure what the code is doing yet, the important part is the visualizations. We'll cover how to use K-Means afterwards.\n",
72 | "\n",
73 | "Below, we create the toy dataset, randomly select 2 starting locations, and plot them together. First we set the seed to 42 so everyone should see the same results.\n",
74 | "\n",
75 | "Then we create half of the x,y co-ordinates using one random distribution, and the other half from a second distribution. These are joined into a single dataset.\n",
76 | "\n",
77 | "Afterwards, we randomly select the x,y co-ordinates for the 2 starting locations and plot these along with the data points."
78 | ]
79 | },
80 | {
81 | "cell_type": "code",
82 | "execution_count": null,
83 | "metadata": {},
84 | "outputs": [],
85 | "source": [
86 | "np.random.seed(42)\n",
87 | "\n",
88 | "data1 = np.random.randn(40,2) + np.array([1,3])\n",
89 | "data2 = np.random.randn(40,2) + np.array([5,2])\n",
90 | "data = np.concatenate([data1, data2], axis=0)\n",
91 | "\n",
92 | "centroids = np.random.randn(2,2) + np.array([-1,3])\n",
93 | "\n",
94 | "# Plot the data\n",
95 | "plt.scatter(data[:, 0], data[:, 1])\n",
96 | "plt.title(\"Initial Configuration\")\n",
97 | "\n",
98 | "# Plot the centroids as a black X\n",
99 | "plt.scatter(centroids[:, 0], centroids[:, 1],\n",
100 | " marker='x', s=169, linewidths=3,\n",
101 | " color='r', zorder=10)\n",
102 | "\n",
103 | "plt.show()"
104 | ]
105 | },
106 | {
107 | "cell_type": "markdown",
108 | "metadata": {},
109 | "source": [
110 | "Currently the centroids aren't in great locations, they aren't even really in the data. But in our next step, we assign each data point to belong to the closest centroid. You can see how they're assigned in the plot below."
111 | ]
112 | },
113 | {
114 | "cell_type": "code",
115 | "execution_count": null,
116 | "metadata": {},
117 | "outputs": [],
118 | "source": [
119 | "h = .02 # step size in the mesh\n",
120 | "\n",
121 | "# Plot the decision boundary. For that, we will assign a color to each\n",
122 | "# point in the mesh [x_min, m_max]x[y_min, y_max].\n",
123 | "x_min, x_max = data[:, 0].min() - 1, data[:, 0].max() + 1\n",
124 | "y_min, y_max = data[:, 1].min() - 1, data[:, 1].max() + 1\n",
125 | "xx, yy = np.meshgrid(np.arange(x_min, x_max, h),\n",
126 | " np.arange(y_min, y_max, h))\n",
127 | "\n",
128 | "Z = np.zeros(xx.shape)\n",
129 | "\n",
130 | "for row in range(Z.shape[0]):\n",
131 | " for col in range(Z.shape[1]):\n",
132 | " d1 = np.linalg.norm(np.array([xx[row,col], yy[row,col]]) - centroids[0,:])\n",
133 | " d2 = np.linalg.norm(np.array([xx[row,col], yy[row,col]]) - centroids[1,:])\n",
134 | " \n",
135 | " if d2 < d1:\n",
136 | " Z[row,col] = 1\n",
137 | "\n",
138 | "\n",
139 | "# Put the result into a color plot\n",
140 | "plt.figure(figsize=(8, 6))\n",
141 | "plt.pcolormesh(xx, yy, Z, cmap=cmap_bg)\n",
142 | "\n",
143 | "# Plot the training points\n",
144 | "plt.scatter(data[:, 0], data[:, 1])\n",
145 | "plt.xlim(xx.min(), xx.max())\n",
146 | "plt.ylim(yy.min(), yy.max())\n",
147 | "plt.title(\"Assigned to Clusters\")\n",
148 | "\n",
149 | "# Plot the centroids as a red X\n",
150 | "plt.scatter(centroids[:, 0], centroids[:, 1],\n",
151 | " marker='x', s=169, linewidths=3,\n",
152 | " color='r', zorder=10)\n",
153 | "\n",
154 | "plt.show()"
155 | ]
156 | },
157 | {
158 | "cell_type": "markdown",
159 | "metadata": {},
160 | "source": [
161 | "Once all the points are assigned to a cluster, the centroids move to the average x and y value of their assigned points. Below, the old locations are shown as a black X, and the new locations are red Xs."
162 | ]
163 | },
164 | {
165 | "cell_type": "code",
166 | "execution_count": null,
167 | "metadata": {},
168 | "outputs": [],
169 | "source": [
170 | "kmeans = KMeans(n_clusters=2, init=centroids, max_iter=1, n_init=1)\n",
171 | "kmeans.fit(data)\n",
172 | "\n",
173 | "# Put the result into a color plot\n",
174 | "plt.figure(figsize=(8, 6))\n",
175 | "plt.pcolormesh(xx, yy, Z, cmap=cmap_bg)\n",
176 | "\n",
177 | "# Plot the training points\n",
178 | "plt.scatter(data[:, 0], data[:, 1])\n",
179 | "plt.xlim(xx.min(), xx.max())\n",
180 | "plt.ylim(yy.min(), yy.max())\n",
181 | "plt.title(\"Centroids Relocated\")\n",
182 | "\n",
183 | "# Plot old centroids as a black X\n",
184 | "plt.scatter(centroids[:, 0], centroids[:, 1],\n",
185 | " marker='x', s=169, linewidths=3,\n",
186 | " color='k', zorder=10)\n",
187 | "\n",
188 | "# Plot the centroids as a red X\n",
189 | "centroids = kmeans.cluster_centers_\n",
190 | "plt.scatter(centroids[:, 0], centroids[:, 1],\n",
191 | " marker='x', s=169, linewidths=3,\n",
192 | " color='r', zorder=10)\n",
193 | "\n",
194 | "plt.show()"
195 | ]
196 | },
197 | {
198 | "cell_type": "markdown",
199 | "metadata": {},
200 | "source": [
201 | "Next, the datapoints are re-assigned to the new closest centroid."
202 | ]
203 | },
204 | {
205 | "cell_type": "code",
206 | "execution_count": null,
207 | "metadata": {},
208 | "outputs": [],
209 | "source": [
210 | "h = .02 # step size in the mesh\n",
211 | "\n",
212 | "# Plot the decision boundary. For that, we will assign a color to each\n",
213 | "# point in the mesh [x_min, m_max]x[y_min, y_max].\n",
214 | "x_min, x_max = data[:, 0].min() - 1, data[:, 0].max() + 1\n",
215 | "y_min, y_max = data[:, 1].min() - 1, data[:, 1].max() + 1\n",
216 | "xx, yy = np.meshgrid(np.arange(x_min, x_max, h),\n",
217 | " np.arange(y_min, y_max, h))\n",
218 | "\n",
219 | "Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()]) # Make a prediction at every point \n",
220 | " # in the mesh in order to find the \n",
221 | " # classification areas for each label\n",
222 | "\n",
223 | "# Put the result into a color plot\n",
224 | "Z = Z.reshape(xx.shape)\n",
225 | "plt.figure(figsize=(8, 6))\n",
226 | "plt.pcolormesh(xx, yy, Z, cmap=cmap_bg)\n",
227 | "\n",
228 | "# Plot the training points\n",
229 | "plt.scatter(data[:, 0], data[:, 1])\n",
230 | "plt.xlim(xx.min(), xx.max())\n",
231 | "plt.ylim(yy.min(), yy.max())\n",
232 | "plt.title(\"Re-assigned to New Clusters\")\n",
233 | "\n",
234 | "# Plot the centroids as a black X\n",
235 | "plt.scatter(centroids[:, 0], centroids[:, 1],\n",
236 | " marker='x', s=169, linewidths=3,\n",
237 | " color='r', zorder=10)\n",
238 | "\n",
239 | "plt.show()"
240 | ]
241 | },
242 | {
243 | "cell_type": "markdown",
244 | "metadata": {},
245 | "source": [
246 | "We repeat these two steps until we've reached a maximum number of calculations or the centroids no longer move significantly. We'll show a few more iterations so you can see what happens."
247 | ]
248 | },
249 | {
250 | "cell_type": "code",
251 | "execution_count": null,
252 | "metadata": {},
253 | "outputs": [],
254 | "source": [
255 | "kmeans = KMeans(n_clusters=2, init=centroids, max_iter=1, n_init=1)\n",
256 | "kmeans.fit(data)\n",
257 | "\n",
258 | "# Put the result into a color plot\n",
259 | "plt.figure(figsize=(8, 6))\n",
260 | "plt.pcolormesh(xx, yy, Z, cmap=cmap_bg)\n",
261 | "\n",
262 | "# Plot the training points\n",
263 | "plt.scatter(data[:, 0], data[:, 1])\n",
264 | "plt.xlim(xx.min(), xx.max())\n",
265 | "plt.ylim(yy.min(), yy.max())\n",
266 | "plt.title(\"Centroids Relocated\")\n",
267 | "\n",
268 | "plt.scatter(centroids[:, 0], centroids[:, 1],\n",
269 | " marker='x', s=169, linewidths=3,\n",
270 | " color='k', zorder=10)\n",
271 | "\n",
272 | "# Plot the centroids as a red X\n",
273 | "centroids = kmeans.cluster_centers_\n",
274 | "plt.scatter(centroids[:, 0], centroids[:, 1],\n",
275 | " marker='x', s=169, linewidths=3,\n",
276 | " color='r', zorder=10)\n",
277 | "\n",
278 | "plt.show()\n",
279 | "\n",
280 | "h = .02 # step size in the mesh\n",
281 | "\n",
282 | "# Plot the decision boundary. For that, we will assign a color to each\n",
283 | "# point in the mesh [x_min, m_max]x[y_min, y_max].\n",
284 | "x_min, x_max = data[:, 0].min() - 1, data[:, 0].max() + 1\n",
285 | "y_min, y_max = data[:, 1].min() - 1, data[:, 1].max() + 1\n",
286 | "xx, yy = np.meshgrid(np.arange(x_min, x_max, h),\n",
287 | " np.arange(y_min, y_max, h))\n",
288 | "\n",
289 | "Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()]) # Make a prediction at every point \n",
290 | " # in the mesh in order to find the \n",
291 | " # classification areas for each label\n",
292 | "\n",
293 | "# Put the result into a color plot\n",
294 | "Z = Z.reshape(xx.shape)\n",
295 | "plt.figure(figsize=(8, 6))\n",
296 | "plt.pcolormesh(xx, yy, Z, cmap=cmap_bg)\n",
297 | "\n",
298 | "# Plot the training points\n",
299 | "plt.scatter(data[:, 0], data[:, 1])\n",
300 | "plt.xlim(xx.min(), xx.max())\n",
301 | "plt.ylim(yy.min(), yy.max())\n",
302 | "plt.title(\"Re-assigned to New Clusters\")\n",
303 | "\n",
304 | "# Plot the centroids as a red X\n",
305 | "plt.scatter(centroids[:, 0], centroids[:, 1],\n",
306 | " marker='x', s=169, linewidths=3,\n",
307 | " color='r', zorder=10)\n",
308 | "\n",
309 | "plt.show()\n",
310 | "\n",
311 | "kmeans = KMeans(n_clusters=2, init=centroids, max_iter=1, n_init=1)\n",
312 | "kmeans.fit(data)\n",
313 | "\n",
314 | "# Put the result into a color plot\n",
315 | "plt.figure(figsize=(8, 6))\n",
316 | "plt.pcolormesh(xx, yy, Z, cmap=cmap_bg)\n",
317 | "\n",
318 | "# Plot the training points\n",
319 | "plt.scatter(data[:, 0], data[:, 1])\n",
320 | "plt.xlim(xx.min(), xx.max())\n",
321 | "plt.ylim(yy.min(), yy.max())\n",
322 | "plt.title(\"Centroids Relocated\")\n",
323 | "\n",
324 | "plt.scatter(centroids[:, 0], centroids[:, 1],\n",
325 | " marker='x', s=169, linewidths=3,\n",
326 | " color='k', zorder=10)\n",
327 | "\n",
328 | "# Plot the centroids as a red X\n",
329 | "centroids = kmeans.cluster_centers_\n",
330 | "plt.scatter(centroids[:, 0], centroids[:, 1],\n",
331 | " marker='x', s=169, linewidths=3,\n",
332 | " color='r', zorder=10)\n",
333 | "\n",
334 | "plt.show()\n",
335 | "\n",
336 | "h = .02 # step size in the mesh\n",
337 | "\n",
338 | "# Plot the decision boundary. For that, we will assign a color to each\n",
339 | "# point in the mesh [x_min, m_max]x[y_min, y_max].\n",
340 | "x_min, x_max = data[:, 0].min() - 1, data[:, 0].max() + 1\n",
341 | "y_min, y_max = data[:, 1].min() - 1, data[:, 1].max() + 1\n",
342 | "xx, yy = np.meshgrid(np.arange(x_min, x_max, h),\n",
343 | " np.arange(y_min, y_max, h))\n",
344 | "\n",
345 | "Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()]) # Make a prediction at every point \n",
346 | " # in the mesh in order to find the \n",
347 | " # classification areas for each label\n",
348 | "\n",
349 | "# Put the result into a color plot\n",
350 | "Z = Z.reshape(xx.shape)\n",
351 | "plt.figure(figsize=(8, 6))\n",
352 | "plt.pcolormesh(xx, yy, Z, cmap=cmap_bg)\n",
353 | "\n",
354 | "# Plot the training points\n",
355 | "plt.scatter(data[:, 0], data[:, 1])\n",
356 | "plt.xlim(xx.min(), xx.max())\n",
357 | "plt.ylim(yy.min(), yy.max())\n",
358 | "plt.title(\"Re-assigned to New Clusters\")\n",
359 | "\n",
360 | "# Plot the centroids as a black X\n",
361 | "plt.scatter(centroids[:, 0], centroids[:, 1],\n",
362 | " marker='x', s=169, linewidths=3,\n",
363 | " color='r', zorder=10)\n",
364 | "\n",
365 | "plt.show()"
366 | ]
367 | },
368 | {
369 | "cell_type": "markdown",
370 | "metadata": {},
371 | "source": [
372 | "Import the dataset\n",
373 | "===\n",
374 | "Import the dataset and store it to a variable called iris. Scikit-learn's explanation of the dataset is [here](http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html). This dataset is similar to a python dictionary, with the keys: ['DESCR', 'target_names', 'target', 'data', 'feature_names']\n",
375 | "\n",
376 | "The data features are stored in iris.data, where each row is an example from a single flower, and each column is a single feature. The feature names are stored in iris.feature_names. Labels are stored as the numbers 0, 1, or 2 in iris.target, and the names of these labels are in iris.target_names.\n",
377 | "\n",
378 | "The dataset consists of measurements made on 50 examples from each of three different species of iris flowers (Setosa, Versicolour, and Virginica). Each example has four features (or measurements): [sepal](http://en.wikipedia.org/wiki/Sepal) length, sepal width, [petal](http://en.wikipedia.org/wiki/Petal) length, and petal width. All measurements are in cm.\n",
379 | "\n",
380 | "Below, we load the labels into y, the corresponding label names into labelNames, the data into X, and the names of the features into featureNames."
381 | ]
382 | },
383 | {
384 | "cell_type": "code",
385 | "execution_count": null,
386 | "metadata": {
387 | "collapsed": true
388 | },
389 | "outputs": [],
390 | "source": [
391 | "# Import some data to play with\n",
392 | "iris = datasets.load_iris()\n",
393 | "\n",
394 | "# Store the labels (y), label names, features (X), and feature names\n",
395 | "y = iris.target # Labels are stored in y as numbers\n",
396 | "labelNames = iris.target_names # Species names corresponding to labels 0, 1, and 2\n",
397 | "X = iris.data\n",
398 | "featureNames = iris.feature_names"
399 | ]
400 | },
401 | {
402 | "cell_type": "markdown",
403 | "metadata": {},
404 | "source": [
405 | "Below, we plot the first two features from the dataset (sepal length and width). Normally we would try to use all useful features, but sticking with two allows us to visualize the data more easily.\n",
406 | "\n",
407 | "Then we plot the data to get a look at what we're dealing with. The colormap is used to determine what colors are used for each class when plotting."
408 | ]
409 | },
410 | {
411 | "cell_type": "code",
412 | "execution_count": null,
413 | "metadata": {},
414 | "outputs": [],
415 | "source": [
416 | "# Plot the data\n",
417 | "\n",
418 | "# Sepal length and width\n",
419 | "X_small = X[:,:2]\n",
420 | "# Get the minimum and maximum values with an additional 0.5 border\n",
421 | "x_min, x_max = X_small[:, 0].min() - .5, X_small[:, 0].max() + .5\n",
422 | "y_min, y_max = X_small[:, 1].min() - .5, X_small[:, 1].max() + .5\n",
423 | "\n",
424 | "plt.figure(figsize=(8, 6))\n",
425 | "\n",
426 | "# Plot the training points\n",
427 | "plt.scatter(X_small[:, 0], X_small[:, 1], c=y, cmap=cmap_bold)\n",
428 | "plt.xlabel('Sepal length (cm)')\n",
429 | "plt.ylabel('Sepal width (cm)')\n",
430 | "plt.title('Sepal width vs length')\n",
431 | "\n",
432 | "# Set the plot limits\n",
433 | "plt.xlim(x_min, x_max)\n",
434 | "plt.ylim(y_min, y_max)\n",
435 | "\n",
436 | "# Plot the legend\n",
437 | "plt.legend(labelList, labelNames)\n",
438 | "\n",
439 | "plt.show()\n"
440 | ]
441 | },
442 | {
443 | "cell_type": "markdown",
444 | "metadata": {},
445 | "source": [
446 | "Unlabeled data\n",
447 | "===\n",
448 | "K-means is an unsupervised learning method, which means it doesn't make use of data labels. This is useful when we're exploring a new dataset. We may not have labels for this dataset, but we want to see how it is grouped together and what examples are most similar to each other. Below we plot the data again, but this time without any labels. This is what k-means \"sees\" when we use it."
449 | ]
450 | },
451 | {
452 | "cell_type": "code",
453 | "execution_count": null,
454 | "metadata": {},
455 | "outputs": [],
456 | "source": [
457 | "# Plot the data\n",
458 | "\n",
459 | "# Sepal length and width\n",
460 | "X_small = X[:,:2]\n",
461 | "# Get the minimum and maximum values with an additional 0.5 border\n",
462 | "x_min, x_max = X_small[:, 0].min() - .5, X_small[:, 0].max() + .5\n",
463 | "y_min, y_max = X_small[:, 1].min() - .5, X_small[:, 1].max() + .5\n",
464 | "\n",
465 | "plt.figure(figsize=(8, 6))\n",
466 | "\n",
467 | "# Plot the training points\n",
468 | "plt.scatter(X_small[:, 0], X_small[:, 1])\n",
469 | "plt.xlabel('Sepal length (cm)')\n",
470 | "plt.ylabel('Sepal width (cm)')\n",
471 | "plt.title('Sepal width vs length')\n",
472 | "\n",
473 | "# Set the plot limits\n",
474 | "plt.xlim(x_min, x_max)\n",
475 | "plt.ylim(y_min, y_max)\n",
476 | "\n",
477 | "plt.show()\n"
478 | ]
479 | },
480 | {
481 | "cell_type": "markdown",
482 | "metadata": {},
483 | "source": [
484 | "K-means: training\n",
485 | "===\n",
486 | "Next, we train a K-means classifier on our data. \n",
487 | "\n",
488 | "The first section chooses the number of clusters to use, and stores it in the variable n_clusters. We choose 3 because we know there are 3 species of iris, but we don't always know this when approaching a machine learning problem. \n",
489 | "\n",
490 | "The last two lines create and train the classifier. \n",
491 | "\n",
492 | "The first line creates a classifier (kmeans) using the [KMeans()](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) function, and tells it to use the number of neighbors stored in n_neighbors. The second line uses the fit() method to train the classifier on the features in X. Notice that because this is an unsupervised method, it does not use the labels stored in y."
493 | ]
494 | },
495 | {
496 | "cell_type": "code",
497 | "execution_count": null,
498 | "metadata": {},
499 | "outputs": [],
500 | "source": [
501 | "# Choose your number of clusters\n",
502 | "n_clusters = 3\n",
503 | "\n",
504 | "# we create an instance of KMeans Classifier and fit the data.\n",
505 | "kmeans = KMeans(n_clusters=n_clusters)\n",
506 | "kmeans.fit(X_small)"
507 | ]
508 | },
509 | {
510 | "cell_type": "markdown",
511 | "metadata": {},
512 | "source": [
513 | "Plot the classification boundaries\n",
514 | "===\n",
515 | "Now that we have our classifier, let's visualize what it's doing. \n",
516 | "\n",
517 | "First we plot the decision boundaries, or the lines dividing areas assigned to the different clusters. The background shows the areas that are considered to belong to a certain cluster, and each cluster can then be assigned to a species of iris. They are plotted in grey, because the classifier does not assign labels to the clusters. The center of each cluster is plotted as a black x. Then we plot our examples onto the space, showing where each point lies in relation to the decision boundaries.\n",
518 | "\n",
519 | "If we took sepal measurements from a new flower, we could plot it in this space and use the background shade to determine which cluster of data points our classifier would assign to it."
520 | ]
521 | },
522 | {
523 | "cell_type": "code",
524 | "execution_count": null,
525 | "metadata": {},
526 | "outputs": [],
527 | "source": [
528 | "h = .02 # step size in the mesh\n",
529 | "\n",
530 | "# Plot the decision boundary. For that, we will assign a color to each\n",
531 | "# point in the mesh [x_min, m_max]x[y_min, y_max].\n",
532 | "x_min, x_max = X_small[:, 0].min() - 1, X_small[:, 0].max() + 1\n",
533 | "y_min, y_max = X_small[:, 1].min() - 1, X_small[:, 1].max() + 1\n",
534 | "xx, yy = np.meshgrid(np.arange(x_min, x_max, h),\n",
535 | " np.arange(y_min, y_max, h))\n",
536 | "\n",
537 | "Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()]) # Make a prediction oat every point \n",
538 | " # in the mesh in order to find the \n",
539 | " # classification areas for each label\n",
540 | "\n",
541 | "# Put the result into a color plot\n",
542 | "Z = Z.reshape(xx.shape)\n",
543 | "plt.figure(figsize=(8, 6))\n",
544 | "plt.pcolormesh(xx, yy, Z, cmap=cmap_bg)\n",
545 | "\n",
546 | "# Plot the training points\n",
547 | "plt.scatter(X_small[:, 0], X_small[:, 1])\n",
548 | "plt.xlim(xx.min(), xx.max())\n",
549 | "plt.ylim(yy.min(), yy.max())\n",
550 | "plt.title(\"KMeans (k = %i)\"\n",
551 | " % (n_clusters))\n",
552 | "plt.xlabel('Sepal length (cm)')\n",
553 | "plt.ylabel('Sepal width (cm)')\n",
554 | "\n",
555 | "# Plot the centroids as a black X\n",
556 | "centroids = kmeans.cluster_centers_\n",
557 | "plt.scatter(centroids[:, 0], centroids[:, 1],\n",
558 | " marker='x', s=169, linewidths=3,\n",
559 | " color='k', zorder=10)\n",
560 | "\n",
561 | "plt.show()"
562 | ]
563 | },
564 | {
565 | "cell_type": "markdown",
566 | "metadata": {},
567 | "source": [
568 | "Cheating with labels\n",
569 | "---\n",
570 | "Because we do have labels for this dataset, let's see how well k-means did at separating the three species."
571 | ]
572 | },
573 | {
574 | "cell_type": "code",
575 | "execution_count": null,
576 | "metadata": {},
577 | "outputs": [],
578 | "source": [
579 | "h = .02 # step size in the mesh\n",
580 | "\n",
581 | "# Plot the decision boundary. For that, we will assign a color to each\n",
582 | "# point in the mesh [x_min, m_max]x[y_min, y_max].\n",
583 | "x_min, x_max = X_small[:, 0].min() - 1, X_small[:, 0].max() + 1\n",
584 | "y_min, y_max = X_small[:, 1].min() - 1, X_small[:, 1].max() + 1\n",
585 | "xx, yy = np.meshgrid(np.arange(x_min, x_max, h),\n",
586 | " np.arange(y_min, y_max, h))\n",
587 | "\n",
588 | "Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()]) # Make a prediction oat every point \n",
589 | " # in the mesh in order to find the \n",
590 | " # classification areas for each label\n",
591 | "\n",
592 | "# Put the result into a color plot\n",
593 | "Z = Z.reshape(xx.shape)\n",
594 | "plt.figure(figsize=(8, 6))\n",
595 | "plt.pcolormesh(xx, yy, Z, cmap=cmap_bg)\n",
596 | "\n",
597 | "# Plot the training points\n",
598 | "plt.scatter(X_small[:, 0], X_small[:, 1], c=y, cmap=cmap_bold)\n",
599 | "plt.xlim(xx.min(), xx.max())\n",
600 | "plt.ylim(yy.min(), yy.max())\n",
601 | "plt.title(\"KMeans (k = %i)\"\n",
602 | " % (n_clusters))\n",
603 | "plt.xlabel('Sepal length (cm)')\n",
604 | "plt.ylabel('Sepal width (cm)')\n",
605 | "\n",
606 | "# Plot the centroids as a black X\n",
607 | "centroids = kmeans.cluster_centers_\n",
608 | "plt.scatter(centroids[:, 0], centroids[:, 1],\n",
609 | " marker='x', s=169, linewidths=3,\n",
610 | " color='k', zorder=10)\n",
611 | "\n",
612 | "# Plot the legend\n",
613 | "plt.legend(labelList, labelNames)\n",
614 | "\n",
615 | "plt.show()"
616 | ]
617 | },
618 | {
619 | "cell_type": "markdown",
620 | "metadata": {},
621 | "source": [
622 | "Analyzing the clusters\n",
623 | "===\n",
624 | "\n",
625 | "As you can see in the previous plots, K-means does a good job of separating the Setosa species (red) into its own cluster. It also does a reasonable job separating Versicolour (green) and Virginica (blue), although there is a considerable amount of overlap that it can't predict properly.\n",
626 | "\n",
627 | "This is an example where it is important to understand your data (and visualize it whenever possible), as well as understand your machine learning model. In this example, you may want to use a different machine learning model that can separate the data more accurately. Alternatively, we could use all four features to see if that improves accuracy (remember, we aren't using petal length or width here for easier data visualization).\n",
628 | "\n",
629 | "Changing the number of clusters\n",
630 | "---\n",
631 | "What would happen if you changed the number of clusters? What would the plot look like with 2 clusters, or 5? Based on the unlabeled data, how would you try to determine the number of classes to use?\n",
632 | "\n",
633 | "In the next block of code, try changing the number of clusteres and seeing what happens. You may need to change the number of colors represented in cmap_bg to match the number of classes you are using."
634 | ]
635 | },
636 | {
637 | "cell_type": "code",
638 | "execution_count": null,
639 | "metadata": {
640 | "collapsed": true
641 | },
642 | "outputs": [],
643 | "source": [
644 | "# Your code goes here!\n",
645 | "cmap_bg = ListedColormap(['#111111','#333333', '#555555', '#777777', '#999999'])"
646 | ]
647 | },
648 | {
649 | "cell_type": "markdown",
650 | "metadata": {},
651 | "source": [
652 | "Making Predictions\n",
653 | "===\n",
654 | "\n",
655 | "Now, let's say we go out and measure the sepals of two iris plants, and want to know what group they belong to. We're going to use our classifier to predict the flowers with the following measurements:\n",
656 | "\n",
657 | "Plant | Sepal length | Sepal width\n",
658 | "------|--------------|------------\n",
659 | "A |4.3 |2.5\n",
660 | "B |6.3 |2.1\n",
661 | "\n",
662 | "We can use our classifier's [predict()](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans.predict) function to predict the label for our input features. We pass in the variable examples to the predict() function, which is a list, and each element is another list containing the features (measurements) for a particular example. The output is a list of labels corresponding to the input examples.\n",
663 | "\n",
664 | "We'll also plot them on the boundary plot, to show why they were predicted that way."
665 | ]
666 | },
667 | {
668 | "cell_type": "code",
669 | "execution_count": null,
670 | "metadata": {},
671 | "outputs": [],
672 | "source": [
673 | "# Add our new data examples\n",
674 | "examples = [[4.3, 2.5], # Plant A\n",
675 | " [6.3, 2.1]] # Plant B\n",
676 | "\n",
677 | "# Choose your number of clusters\n",
678 | "n_clusters = 3\n",
679 | "\n",
680 | "# we create an instance of KMeans Classifier and fit the data.\n",
681 | "kmeans = KMeans(n_clusters=n_clusters)\n",
682 | "kmeans.fit(X_small)\n",
683 | "\n",
684 | "# Predict the labels for our new examples\n",
685 | "labels = kmeans.predict(examples)\n",
686 | "\n",
687 | "# Print the predicted species names\n",
688 | "print('A: Cluster ' + str(labels[0]))\n",
689 | "print('B: Cluster ' + str(labels[1]))\n",
690 | "\n",
691 | "# Now plot the results\n",
692 | "h = .02 # step size in the mesh\n",
693 | "\n",
694 | "# Plot the decision boundary. For that, we will assign a color to each\n",
695 | "# point in the mesh [x_min, m_max]x[y_min, y_max].\n",
696 | "x_min, x_max = X_small[:, 0].min() - 1, X_small[:, 0].max() + 1\n",
697 | "y_min, y_max = X_small[:, 1].min() - 1, X_small[:, 1].max() + 1\n",
698 | "xx, yy = np.meshgrid(np.arange(x_min, x_max, h),\n",
699 | " np.arange(y_min, y_max, h))\n",
700 | "Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])\n",
701 | "\n",
702 | "# Put the result into a color plot\n",
703 | "Z = Z.reshape(xx.shape)\n",
704 | "plt.figure(figsize=(8, 6))\n",
705 | "plt.pcolormesh(xx, yy, Z, cmap=cmap_bg)\n",
706 | "\n",
707 | "# Plot the training points\n",
708 | "plt.scatter(X_small[:, 0], X_small[:, 1])\n",
709 | "plt.xlim(xx.min(), xx.max())\n",
710 | "plt.ylim(yy.min(), yy.max())\n",
711 | "plt.title(\"KMeans (k = %i)\"\n",
712 | " % (n_clusters))\n",
713 | "plt.xlabel('Sepal length (cm)')\n",
714 | "plt.ylabel('Sepal width (cm)')\n",
715 | "\n",
716 | "# Plot the centroids as a black X\n",
717 | "centroids = kmeans.cluster_centers_\n",
718 | "plt.scatter(centroids[:, 0], centroids[:, 1],\n",
719 | " marker='x', s=169, linewidths=3,\n",
720 | " color='k', zorder=10)\n",
721 | "\n",
722 | "# Display the new examples as labeled text on the graph\n",
723 | "plt.text(examples[0][0], examples[0][1],'A', fontsize=14)\n",
724 | "plt.text(examples[1][0], examples[1][1],'B', fontsize=14)\n",
725 | "\n",
726 | "plt.show()"
727 | ]
728 | },
729 | {
730 | "cell_type": "markdown",
731 | "metadata": {},
732 | "source": [
733 | "As you can see, example A is grouped into Cluster 2 and example B is grouped into Cluster 0. Remember, K-means does not use labels. It only clusters the data by feature similarity, and it's up to us to decide what the clusters mean (or if they don't mean anything at all).\n",
734 | "\n",
735 | "Using different features\n",
736 | "===\n",
737 | "Try using different combinations of the four features and see what results you get. Does it make it any easier to determine how many clusters should be used?\n"
738 | ]
739 | },
740 | {
741 | "cell_type": "code",
742 | "execution_count": null,
743 | "metadata": {
744 | "collapsed": true
745 | },
746 | "outputs": [],
747 | "source": [
748 | "# Your code goes here!\n",
749 | "#cmap_bg = ListedColormap(['#111111','#333333', '#555555', '#777777', '#999999'])"
750 | ]
751 | },
752 | {
753 | "cell_type": "markdown",
754 | "metadata": {},
755 | "source": [
756 | "Final Notes\n",
757 | "===\n",
758 | "\n",
759 | "Some final things to keep in mind when using K-means to cluster your data\n",
760 | "\n",
761 | "* K-means is unsupervised, meaning it clusters data by similarity of features and does not require (or even use) labels.\n",
762 | "* How well it works is partially dependent on choosing the right number of clusters for the dataset. You can do this using your knowledge of the data (like we did, knowing we are looking at 3 species of plant). Alternatively, there are ways to try to experimentally find the best number of clusters.\n",
763 | "* The output does not provide a meaningful label, only a cluster assignment for the data. It is up to you to determine the meaning of each cluster."
764 | ]
765 | },
766 | {
767 | "cell_type": "code",
768 | "execution_count": null,
769 | "metadata": {
770 | "collapsed": true
771 | },
772 | "outputs": [],
773 | "source": []
774 | }
775 | ],
776 | "metadata": {
777 | "kernelspec": {
778 | "display_name": "Python 3",
779 | "language": "python",
780 | "name": "python3"
781 | },
782 | "language_info": {
783 | "codemirror_mode": {
784 | "name": "ipython",
785 | "version": 3
786 | },
787 | "file_extension": ".py",
788 | "mimetype": "text/x-python",
789 | "name": "python",
790 | "nbconvert_exporter": "python",
791 | "pygments_lexer": "ipython3",
792 | "version": "3.6.2"
793 | }
794 | },
795 | "nbformat": 4,
796 | "nbformat_minor": 1
797 | }
798 |
--------------------------------------------------------------------------------
/KNN_Tutorial.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "Nearest Neighbor Tutorial\n",
8 | "===\n",
9 | "\n",
10 | "K-nearest neighbors, or K-NN, is a simple form of supervised learning. It assigns an output label to a new input example x based on it's closest neighboring datapoints. The number K is the number of data points to use. For K=1, x is assigned the label of the closest neighbor. If K>1, the majority vote is used to label x.\n",
11 | "\n",
12 | "The code in this tutorial is slightly modified from the scikit-learn [K-NN example](http://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html#example-neighbors-plot-classification-py). There is also information on the K-NN classifier function [KNeighborsClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier).\n"
13 | ]
14 | },
15 | {
16 | "cell_type": "markdown",
17 | "metadata": {},
18 | "source": [
19 | "Setup\n",
20 | "===\n",
21 | "Tell matplotlib to print figures in the notebook. Then import numpy (for numerical data), matplotlib.pyplot (for plotting figures), ListedColormap (for plotting colors), neighbors (for the scikit-learn nearest-neighbor algorithm), and datasets (to download the iris dataset from scikit-learn).\n",
22 | "\n",
23 | "Also create the color maps to use to color the plotted data, and \"labelList\", which is a list of colored rectangles to use in plotted legends"
24 | ]
25 | },
26 | {
27 | "cell_type": "code",
28 | "execution_count": null,
29 | "metadata": {},
30 | "outputs": [],
31 | "source": [
32 | "# Print figures in the notebook\n",
33 | "%matplotlib inline \n",
34 | "\n",
35 | "import numpy as np\n",
36 | "import matplotlib.pyplot as plt\n",
37 | "from matplotlib.colors import ListedColormap\n",
38 | "from sklearn import neighbors, datasets # Import the nerest neighbor function and dataset from scikit-learn\n",
39 | "from sklearn.model_selection import train_test_split, KFold\n",
40 | "\n",
41 | "# Import patch for drawing rectangles in the legend\n",
42 | "from matplotlib.patches import Rectangle\n",
43 | "\n",
44 | "# Create color maps\n",
45 | "cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])\n",
46 | "cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])\n",
47 | "\n",
48 | "# Create a legend for the colors, using rectangles for the corresponding colormap colors\n",
49 | "labelList = []\n",
50 | "for color in cmap_bold.colors:\n",
51 | " labelList.append(Rectangle((0, 0), 1, 1, fc=color))"
52 | ]
53 | },
54 | {
55 | "cell_type": "markdown",
56 | "metadata": {},
57 | "source": [
58 | "Import the dataset\n",
59 | "===\n",
60 | "Import the dataset and store it to a variable called iris. Scikit-learn's explanation of the dataset is [here](http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html). This dataset is similar to a python dictionary, with the keys: ['DESCR', 'target_names', 'target', 'data', 'feature_names']\n",
61 | "\n",
62 | "The data features are stored in iris.data, where each row is an example from a single flower, and each column is a single feature. The feature names are stored in iris.feature_names. Labels are stored as the numbers 0, 1, or 2 in iris.target, and the names of these labels are in iris.target_names.\n",
63 | "\n",
64 | "The dataset consists of measurements made on 50 examples from each of three different species of iris flowers (Setosa, Versicolour, and Virginica). Each example has four features (or measurements): [sepal](http://en.wikipedia.org/wiki/Sepal) length, sepal width, [petal](http://en.wikipedia.org/wiki/Petal) length, and petal width. All measurements are in cm.\n",
65 | "\n",
66 | "Below, we load the labels into y, the corresponding label names into labelNames, the data into X, and the names of the features into featureNames."
67 | ]
68 | },
69 | {
70 | "cell_type": "code",
71 | "execution_count": null,
72 | "metadata": {
73 | "collapsed": true
74 | },
75 | "outputs": [],
76 | "source": [
77 | "# Import some data to play with\n",
78 | "iris = datasets.load_iris()\n",
79 | "\n",
80 | "# Store the labels (y), label names, features (X), and feature names\n",
81 | "y = iris.target # Labels are stored in y as numbers\n",
82 | "labelNames = iris.target_names # Species names corresponding to labels 0, 1, and 2\n",
83 | "X = iris.data\n",
84 | "featureNames = iris.feature_names\n"
85 | ]
86 | },
87 | {
88 | "cell_type": "markdown",
89 | "metadata": {},
90 | "source": [
91 | "Below, we plot the first two features from the dataset (sepal length and width). Normally we would try to use all useful features, but sticking with two allows us to visualize the data more easily.\n",
92 | "\n",
93 | "Then we plot the data to get a look at what we're dealing with. The colormap is used to determine what colors are used for each class when plotting."
94 | ]
95 | },
96 | {
97 | "cell_type": "code",
98 | "execution_count": null,
99 | "metadata": {},
100 | "outputs": [],
101 | "source": [
102 | "# Plot the data\n",
103 | "\n",
104 | "# Sepal length and width\n",
105 | "X_small = X[:,:2]\n",
106 | "# Get the minimum and maximum values with an additional 0.5 border\n",
107 | "x_min, x_max = X_small[:, 0].min() - .5, X_small[:, 0].max() + .5\n",
108 | "y_min, y_max = X_small[:, 1].min() - .5, X_small[:, 1].max() + .5\n",
109 | "\n",
110 | "plt.figure(figsize=(8, 6))\n",
111 | "\n",
112 | "# Plot the training points\n",
113 | "plt.scatter(X_small[:, 0], X_small[:, 1], c=y, cmap=cmap_bold)\n",
114 | "plt.xlabel('Sepal length (cm)')\n",
115 | "plt.ylabel('Sepal width (cm)')\n",
116 | "plt.title('Sepal width vs length')\n",
117 | "\n",
118 | "# Set the plot limits\n",
119 | "plt.xlim(x_min, x_max)\n",
120 | "plt.ylim(y_min, y_max)\n",
121 | "\n",
122 | "# Plot the legend\n",
123 | "plt.legend(labelList, labelNames)\n",
124 | "\n",
125 | "plt.show()\n"
126 | ]
127 | },
128 | {
129 | "cell_type": "markdown",
130 | "metadata": {},
131 | "source": [
132 | "Nearest neighbors: training\n",
133 | "===\n",
134 | "Next, we train a nearest neighbor classifier on our data. \n",
135 | "\n",
136 | "The first section chooses the number of neighbors to use, and stores it in the variable n_neighbors (another, more intuitive, name for the K variable mentioned previously). \n",
137 | "\n",
138 | "The last two lines create and train the classifier. Line 1 creates a classifier (clf) using the [KNeighborsClassifier()](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier) function, and tells it to use the number of neighbors stored in n_neighbors. Line 2 uses the fit() method to train the classifier on the features in X_small, using the labels in y."
139 | ]
140 | },
141 | {
142 | "cell_type": "code",
143 | "execution_count": null,
144 | "metadata": {},
145 | "outputs": [],
146 | "source": [
147 | "# Choose your number of neighbors\n",
148 | "n_neighbors = 15\n",
149 | "\n",
150 | "# we create an instance of Neighbours Classifier and fit the data.\n",
151 | "clf = neighbors.KNeighborsClassifier(n_neighbors)\n",
152 | "clf.fit(X_small, y)"
153 | ]
154 | },
155 | {
156 | "cell_type": "markdown",
157 | "metadata": {},
158 | "source": [
159 | "Plot the classification boundaries\n",
160 | "===\n",
161 | "Now that we have our classifier, let's visualize what it's doing. \n",
162 | "\n",
163 | "First we plot the decision boundaries, or the lines dividing areas assigned to the different labels (species of iris). Then we plot our examples onto the space, showing where each point lies and the corresponding decision boundary.\n",
164 | "\n",
165 | "The colored background shows the areas that are considered to belong to a certain label. If we took sepal measurements from a new flower, we could plot it in this space and use the color to determine which type of iris our classifier believes it to be."
166 | ]
167 | },
168 | {
169 | "cell_type": "code",
170 | "execution_count": null,
171 | "metadata": {},
172 | "outputs": [],
173 | "source": [
174 | "h = .02 # step size in the mesh\n",
175 | "\n",
176 | "# Plot the decision boundary. For that, we will assign a color to each\n",
177 | "# point in the mesh [x_min, m_max]x[y_min, y_max].\n",
178 | "x_min, x_max = X_small[:, 0].min() - 1, X_small[:, 0].max() + 1\n",
179 | "y_min, y_max = X_small[:, 1].min() - 1, X_small[:, 1].max() + 1\n",
180 | "xx, yy = np.meshgrid(np.arange(x_min, x_max, h),\n",
181 | " np.arange(y_min, y_max, h))\n",
182 | "\n",
183 | "Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) # Make a prediction oat every point \n",
184 | " # in the mesh in order to find the \n",
185 | " # classification areas for each label\n",
186 | "\n",
187 | "# Put the result into a color plot\n",
188 | "Z = Z.reshape(xx.shape)\n",
189 | "plt.figure(figsize=(8, 6))\n",
190 | "plt.pcolormesh(xx, yy, Z, cmap=cmap_light)\n",
191 | "\n",
192 | "# Plot the training points\n",
193 | "plt.scatter(X_small[:, 0], X_small[:, 1], c=y, cmap=cmap_bold)\n",
194 | "plt.xlim(xx.min(), xx.max())\n",
195 | "plt.ylim(yy.min(), yy.max())\n",
196 | "plt.title(\"3-Class classification (k = %i)\"\n",
197 | " % (n_neighbors))\n",
198 | "plt.xlabel('Sepal length (cm)')\n",
199 | "plt.ylabel('Sepal width (cm)')\n",
200 | "\n",
201 | "# Plot the legend\n",
202 | "plt.legend(labelList, labelNames)\n",
203 | "\n",
204 | "plt.show()"
205 | ]
206 | },
207 | {
208 | "cell_type": "markdown",
209 | "metadata": {},
210 | "source": [
211 | "Changing the number of neighbors\n",
212 | "===\n",
213 | "\n",
214 | "Change the number of neighbors (n_neighbors) below, and see how the plot boundaries change. Try as many or as few as you'd like, but remember there are only 150 examples in the dataset, so selecting all 150 wouldn't be very useful!\n",
215 | "\n"
216 | ]
217 | },
218 | {
219 | "cell_type": "code",
220 | "execution_count": null,
221 | "metadata": {},
222 | "outputs": [],
223 | "source": [
224 | "# Choose your number of neighbors\n",
225 | "n_neighbors = # Choose a new number of neighbors\n",
226 | "\n",
227 | "# we create an instance of Neighbours Classifier and fit the data.\n",
228 | "clf = neighbors.KNeighborsClassifier(n_neighbors)\n",
229 | "clf.fit(X_small, y)\n",
230 | "\n",
231 | "h = .02 # step size in the mesh\n",
232 | "\n",
233 | "# Plot the decision boundary. For that, we will assign a color to each\n",
234 | "# point in the mesh [x_min, m_max]x[y_min, y_max].\n",
235 | "x_min, x_max = X_small[:, 0].min() - 1, X_small[:, 0].max() + 1\n",
236 | "y_min, y_max = X_small[:, 1].min() - 1, X_small[:, 1].max() + 1\n",
237 | "xx, yy = np.meshgrid(np.arange(x_min, x_max, h),\n",
238 | " np.arange(y_min, y_max, h))\n",
239 | "\n",
240 | "Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) # Make a prediction oat every point \n",
241 | " # in the mesh in order to find the \n",
242 | " # classification areas for each label\n",
243 | "\n",
244 | "# Put the result into a color plot\n",
245 | "Z = Z.reshape(xx.shape)\n",
246 | "plt.figure(figsize=(8, 6))\n",
247 | "plt.pcolormesh(xx, yy, Z, cmap=cmap_light)\n",
248 | "\n",
249 | "# Plot the training points\n",
250 | "plt.scatter(X_small[:, 0], X_small[:, 1], c=y, cmap=cmap_bold)\n",
251 | "plt.xlim(xx.min(), xx.max())\n",
252 | "plt.ylim(yy.min(), yy.max())\n",
253 | "plt.title(\"3-Class classification (k = %i)\"\n",
254 | " % (n_neighbors))\n",
255 | "plt.xlabel('Sepal length (cm)')\n",
256 | "plt.ylabel('Sepal width (cm)')\n",
257 | "\n",
258 | "# Plot the legend\n",
259 | "plt.legend(labelList, labelNames)\n",
260 | "\n",
261 | "plt.show()"
262 | ]
263 | },
264 | {
265 | "cell_type": "markdown",
266 | "metadata": {},
267 | "source": [
268 | "Making predictions\n",
269 | "===\n",
270 | "\n",
271 | "Now, let's say we go out and measure the sepals of two new iris plants, and want to know what species they are. We're going to use our classifier to predict the flowers with the following measurements:\n",
272 | "\n",
273 | "Plant | Sepal length | Sepal width\n",
274 | "------|--------------|------------\n",
275 | "A |4.3 |2.5\n",
276 | "B |6.3 |2.1\n",
277 | "\n",
278 | "We can use our classifier's [predict()](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier.predict) function to predict the label for our input features. We pass in the variable examples to the predict() function, which is a list, and each element is another list containing the features (measurements) for a particular example. The output is a list of labels corresponding to the input examples.\n"
279 | ]
280 | },
281 | {
282 | "cell_type": "code",
283 | "execution_count": null,
284 | "metadata": {},
285 | "outputs": [],
286 | "source": [
287 | "# Add our new data examples\n",
288 | "examples = [[4.3, 2.5], # Plant A\n",
289 | " [6.3, 2.1]] # Plant B\n",
290 | "\n",
291 | "# Reset our number of neighbors\n",
292 | "n_neighbors = 15\n",
293 | "\n",
294 | "# Create an instance of Neighbours Classifier and fit the original data.\n",
295 | "clf = neighbors.KNeighborsClassifier(n_neighbors)\n",
296 | "clf.fit(X_small, y)\n",
297 | "\n",
298 | "# Predict the labels for our new examples\n",
299 | "labels = clf.predict(examples)\n",
300 | "\n",
301 | "# Print the predicted species names\n",
302 | "print('A: ' + labelNames[labels[0]])\n",
303 | "print('B: ' + labelNames[labels[1]])\n"
304 | ]
305 | },
306 | {
307 | "cell_type": "markdown",
308 | "metadata": {},
309 | "source": [
310 | "Plotting our predictions\n",
311 | "---\n",
312 | "Now let's plot our predictions to see why they were classified that way."
313 | ]
314 | },
315 | {
316 | "cell_type": "code",
317 | "execution_count": null,
318 | "metadata": {},
319 | "outputs": [],
320 | "source": [
321 | "# Now plot the results\n",
322 | "h = .02 # step size in the mesh\n",
323 | "\n",
324 | "# Plot the decision boundary. For that, we will assign a color to each\n",
325 | "# point in the mesh [x_min, m_max]x[y_min, y_max].\n",
326 | "x_min, x_max = X_small[:, 0].min() - 1, X_small[:, 0].max() + 1\n",
327 | "y_min, y_max = X_small[:, 1].min() - 1, X_small[:, 1].max() + 1\n",
328 | "xx, yy = np.meshgrid(np.arange(x_min, x_max, h),\n",
329 | " np.arange(y_min, y_max, h))\n",
330 | "\n",
331 | "Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) # Make a prediction oat every point \n",
332 | " # in the mesh in order to find the \n",
333 | " # classification areas for each label\n",
334 | "\n",
335 | "# Put the result into a color plot\n",
336 | "Z = Z.reshape(xx.shape)\n",
337 | "plt.figure(figsize=(8, 6))\n",
338 | "plt.pcolormesh(xx, yy, Z, cmap=cmap_light)\n",
339 | "\n",
340 | "# Plot the training points\n",
341 | "plt.scatter(X_small[:, 0], X_small[:, 1], c=y, cmap=cmap_bold)\n",
342 | "plt.xlim(xx.min(), xx.max())\n",
343 | "plt.ylim(yy.min(), yy.max())\n",
344 | "plt.title(\"3-Class classification (k = %i)\"\n",
345 | " % (n_neighbors))\n",
346 | "plt.xlabel('Sepal length (cm)')\n",
347 | "plt.ylabel('Sepal width (cm)')\n",
348 | "\n",
349 | "# Display the new examples as labeled text on the graph\n",
350 | "plt.text(examples[0][0], examples[0][1],'A', fontsize=14)\n",
351 | "plt.text(examples[1][0], examples[1][1],'B', fontsize=14)\n",
352 | "\n",
353 | "# Plot the legend\n",
354 | "plt.legend(labelList, labelNames)\n",
355 | "\n",
356 | "plt.show()"
357 | ]
358 | },
359 | {
360 | "cell_type": "markdown",
361 | "metadata": {},
362 | "source": [
363 | "What about our other features?\n",
364 | "===\n",
365 | "You may remember that our original dataset contains two additional features, the length and width of the petals.\n",
366 | "\n",
367 | "What does the plot look like when you train on the petal length and width? How does it change when you change the number of neighbors?\n",
368 | "\n",
369 | "How would you plot our two new plants, A and B, on these new plots? Assume we have all four measurements for each plant, as shown below.\n",
370 | "\n",
371 | "Plant | Sepal length | Sepal width| Petal length | Petal width\n",
372 | "------|--------------|------------|--------------|------------\n",
373 | "A |4.3 |2.5 | 1.5 | 0.5\n",
374 | "B |6.3 |2.1 | 4.8 | 1.5\n",
375 | "\n"
376 | ]
377 | },
378 | {
379 | "cell_type": "code",
380 | "execution_count": null,
381 | "metadata": {
382 | "collapsed": true
383 | },
384 | "outputs": [],
385 | "source": [
386 | "# Put your code here! \n",
387 | "\n",
388 | "# Feel free to add as many code cells as you need"
389 | ]
390 | },
391 | {
392 | "cell_type": "markdown",
393 | "metadata": {},
394 | "source": [
395 | "Changing neighbor weights\n",
396 | "===\n",
397 | "Looking at our orignal plot of sepal dimensions, we can see that plant A is classified as Setosa (red) and B is classified as Virginica (blue). While A seems to be clearly correct, B is much closer to two examples from Versicolour (green). Maybe we should give more importance to labels that are closer to our example?\n",
398 | "\n",
399 | "In the previous examples, all the neighbors were considered equally important when deciding what label to give our input. But what if we want to give more importance (or a heigher weight) to neighbors that are closer to our new example? The K-NN algorithm allows you to change from uniform weights, where all included neighbors have the same importance, to distance-based weights, where closer neighbors are given more consideration. \n",
400 | "\n",
401 | "Below, we create a new classifier using distance-based weights and plot the results. The only difference in the code is that we call KNeighborsClassifier() with the argument weights='distance'. \n",
402 | "\n",
403 | "Look at how it's different from the original plot, and see how the classifications of plant B change. Try it with different combinations of neighbors, and compare it to the previous plots."
404 | ]
405 | },
406 | {
407 | "cell_type": "code",
408 | "execution_count": null,
409 | "metadata": {},
410 | "outputs": [],
411 | "source": [
412 | "# Choose your number of neighbors\n",
413 | "n_neighbors_distance = 15\n",
414 | "\n",
415 | "# we create an instance of Neighbours Classifier and fit the data.\n",
416 | "clf_distance = neighbors.KNeighborsClassifier(n_neighbors_distance, \n",
417 | " weights='distance')\n",
418 | "clf_distance.fit(X_small, y)\n",
419 | "\n",
420 | "# Predict the labels of the new examples\n",
421 | "labels = clf_distance.predict(examples)\n",
422 | "\n",
423 | "# Print the predicted species names\n",
424 | "print('A: ' + labelNames[labels[0]])\n",
425 | "print('B: ' + labelNames[labels[1]])\n",
426 | "\n",
427 | "# Plot the results\n",
428 | "h = .02 # step size in the mesh\n",
429 | "\n",
430 | "# Plot the decision boundary. For that, we will assign a color to each\n",
431 | "# point in the mesh [x_min, m_max]x[y_min, y_max].\n",
432 | "x_min, x_max = X_small[:, 0].min() - 1, X_small[:, 0].max() + 1\n",
433 | "y_min, y_max = X_small[:, 1].min() - 1, X_small[:, 1].max() + 1\n",
434 | "xx, yy = np.meshgrid(np.arange(x_min, x_max, h),\n",
435 | " np.arange(y_min, y_max, h))\n",
436 | "\n",
437 | "Z = clf_distance.predict(np.c_[xx.ravel(), yy.ravel()]) # Make a prediction oat every point \n",
438 | " # in the mesh in order to find the \n",
439 | " # classification areas for each label\n",
440 | "\n",
441 | "# Put the result into a color plot\n",
442 | "Z = Z.reshape(xx.shape)\n",
443 | "plt.figure(figsize=(8, 6))\n",
444 | "plt.pcolormesh(xx, yy, Z, cmap=cmap_light)\n",
445 | "\n",
446 | "# Plot the training points\n",
447 | "plt.scatter(X_small[:, 0], X_small[:, 1], c=y, cmap=cmap_bold)\n",
448 | "plt.xlim(xx.min(), xx.max())\n",
449 | "plt.ylim(yy.min(), yy.max())\n",
450 | "plt.title(\"3-Class classification (k = %i)\"\n",
451 | " % (n_neighbors))\n",
452 | "plt.xlabel('Sepal length (cm)')\n",
453 | "plt.ylabel('Sepal width (cm)')\n",
454 | "\n",
455 | "# Display the new examples as labeled text on the graph\n",
456 | "plt.text(examples[0][0], examples[0][1],'A', fontsize=14)\n",
457 | "plt.text(examples[1][0], examples[1][1],'B', fontsize=14)\n",
458 | "\n",
459 | "# Plot the legend\n",
460 | "plt.legend(labelList, labelNames)\n",
461 | "\n",
462 | "plt.show()"
463 | ]
464 | },
465 | {
466 | "cell_type": "markdown",
467 | "metadata": {},
468 | "source": [
469 | "Now, see how this affects your plots using other features."
470 | ]
471 | },
472 | {
473 | "cell_type": "code",
474 | "execution_count": null,
475 | "metadata": {
476 | "collapsed": true
477 | },
478 | "outputs": [],
479 | "source": [
480 | "# Put your code here!"
481 | ]
482 | },
483 | {
484 | "cell_type": "markdown",
485 | "metadata": {},
486 | "source": [
487 | "Using more than two features\n",
488 | "===\n",
489 | "Using two features is great for visualizing, but it's often not good for training a good classifier. Below, we will train a classifier using the 'distance' weighting method and all four features, and use that to predict plants A and B.\n",
490 | "\n",
491 | "How do the predictions compare to our predictions using only two labels?"
492 | ]
493 | },
494 | {
495 | "cell_type": "code",
496 | "execution_count": null,
497 | "metadata": {},
498 | "outputs": [],
499 | "source": [
500 | "# Add our new data examples\n",
501 | "examples = [[4.3, 2.5, 1.5, 0.5], # Plant A\n",
502 | " [6.3, 2.1, 4.8, 1.5]] # Plant B\n",
503 | "\n",
504 | "# Choose your number of neighbors\n",
505 | "n_neighbors_distance = 15\n",
506 | "\n",
507 | "# we create an instance of Neighbours Classifier and fit the data.\n",
508 | "clf_distance = neighbors.KNeighborsClassifier(n_neighbors_distance, weights='distance')\n",
509 | "clf_distance.fit(X, y)\n",
510 | "\n",
511 | "# Predict the labels of the new examples\n",
512 | "labels = clf_distance.predict(examples)\n",
513 | "\n",
514 | "# Print the predicted species names\n",
515 | "print('A: ' + labelNames[labels[0]])\n",
516 | "print('B: ' + labelNames[labels[1]])"
517 | ]
518 | },
519 | {
520 | "cell_type": "markdown",
521 | "metadata": {},
522 | "source": [
523 | "Evaluating The Classifier\n",
524 | "===\n",
525 | "\n",
526 | "In order to evaluate a classifier, we need to split our dataset into training data, which we'll show to the classifier so it can learn, and testing data, which we will hold back from training and use to test its predictions.\n",
527 | "\n",
528 | "Below, we create the training and testing datasets, using all four features. We then train our classifier on the training data, and get the predictions for the test data."
529 | ]
530 | },
531 | {
532 | "cell_type": "code",
533 | "execution_count": null,
534 | "metadata": {
535 | "collapsed": true
536 | },
537 | "outputs": [],
538 | "source": [
539 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)\n",
540 | "\n",
541 | "# Choose your number of neighbors\n",
542 | "n_neighbors_distance = 15\n",
543 | "\n",
544 | "# we create an instance of Neighbours Classifier and fit the data.\n",
545 | "clf_distance = neighbors.KNeighborsClassifier(n_neighbors_distance, weights='distance')\n",
546 | "clf_distance.fit(X_train, y_train)\n",
547 | "\n",
548 | "# Predict the labels of the test data\n",
549 | "predictions = clf_distance.predict(X_test)"
550 | ]
551 | },
552 | {
553 | "cell_type": "markdown",
554 | "metadata": {},
555 | "source": [
556 | "Next, we evaluate how well the classifier did. The easiest way to do this is to get the average number of correct predictions, usually referred to as the accuracy"
557 | ]
558 | },
559 | {
560 | "cell_type": "code",
561 | "execution_count": null,
562 | "metadata": {
563 | "scrolled": true
564 | },
565 | "outputs": [],
566 | "source": [
567 | "accuracy = np.mean(predictions == y_test )*100\n",
568 | "\n",
569 | "print('The accuracy is ' + '%.2f' % accuracy + '%')"
570 | ]
571 | },
572 | {
573 | "cell_type": "markdown",
574 | "metadata": {
575 | "collapsed": true
576 | },
577 | "source": [
578 | "Comparing Models with Crossvalidation\n",
579 | "===\n",
580 | "\n",
581 | "To select the best number of neighbors to use in our model, we need to use crossvalidation. We can then get our final result using our test data.\n",
582 | "\n",
583 | "First we choose the number of neighbors we want to investigate, then divide our training data into folds. We loop through the sets of training and validation folds. Each time, we train each model on the training data and evaluate on the validation data. We store the accuracy of each classifier on each fold so we can look at them later."
584 | ]
585 | },
586 | {
587 | "cell_type": "code",
588 | "execution_count": null,
589 | "metadata": {},
590 | "outputs": [],
591 | "source": [
592 | "# Choose our k values\n",
593 | "kvals = [1,3,5,10,20,40]\n",
594 | "\n",
595 | "# Create a dictionary of arrays to store accuracies\n",
596 | "accuracies = {}\n",
597 | "for k in kvals:\n",
598 | " accuracies[k] = []\n",
599 | "\n",
600 | "# Loop through 5 folds\n",
601 | "kf = KFold(n_splits=5)\n",
602 | "for trainInd, valInd in kf.split(X_train):\n",
603 | " X_tr = X_train[trainInd,:]\n",
604 | " y_tr = y_train[trainInd]\n",
605 | " X_val = X_train[valInd,:]\n",
606 | " y_val = y_train[valInd]\n",
607 | " \n",
608 | " # Loop through each value of k\n",
609 | " for k in kvals:\n",
610 | " \n",
611 | " # Create the classifier\n",
612 | " clf = neighbors.KNeighborsClassifier(k, weights='distance')\n",
613 | " \n",
614 | " # Train the classifier\n",
615 | " clf.fit(X_tr, y_tr) \n",
616 | " \n",
617 | " # Make our predictions\n",
618 | " pred = clf.predict(X_val)\n",
619 | " \n",
620 | " # Calculate the accuracy\n",
621 | " accuracies[k].append(np.mean(pred == y_val))\n",
622 | " "
623 | ]
624 | },
625 | {
626 | "cell_type": "markdown",
627 | "metadata": {},
628 | "source": [
629 | "Select a Model\n",
630 | "---\n",
631 | "\n",
632 | "To select a model, we look at the average accuracy across all folds."
633 | ]
634 | },
635 | {
636 | "cell_type": "code",
637 | "execution_count": null,
638 | "metadata": {},
639 | "outputs": [],
640 | "source": [
641 | "for k in kvals:\n",
642 | " print('k=%i: %.2f' % (k, np.mean(accuracies[k])))"
643 | ]
644 | },
645 | {
646 | "cell_type": "markdown",
647 | "metadata": {},
648 | "source": [
649 | "Final Evaluation\n",
650 | "---\n",
651 | "\n",
652 | "K=3 gives us the highest accuracy, so we select it as our best model. Now we can evaluate it on our training set and get our final accuracy rating."
653 | ]
654 | },
655 | {
656 | "cell_type": "code",
657 | "execution_count": null,
658 | "metadata": {},
659 | "outputs": [],
660 | "source": [
661 | "clf = neighbors.KNeighborsClassifier(3, weights='distance')\n",
662 | "clf.fit(X_train, y_train)\n",
663 | "predictions = clf.predict(X_test)\n",
664 | "\n",
665 | "accuracy = np.mean(predictions == y_test)*100\n",
666 | "\n",
667 | "print('The final accuracy is %.2f' % accuracy + '%')"
668 | ]
669 | },
670 | {
671 | "cell_type": "markdown",
672 | "metadata": {},
673 | "source": [
674 | "Sidenote: Randomness and Results\n",
675 | "===\n",
676 | "\n",
677 | "Every time you run this notebook, you will get slightly different results. Why? Because data is randomly divided among the training/testing/validation data sets. Running the code again will create a different division of the data, and will make the results slightly different. However, the overall outcome should remain consistent and have approximately the same values. If you have drastically different results when running an analysis multiple times, it suggests a problem with your model or that you need more data.\n",
678 | "\n",
679 | "If it's important that you get the exact same results every time you run the code, you can specify a random state in the `random_state` argument of `train_test_split()` and `KFold`."
680 | ]
681 | },
682 | {
683 | "cell_type": "code",
684 | "execution_count": null,
685 | "metadata": {
686 | "collapsed": true
687 | },
688 | "outputs": [],
689 | "source": [
690 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)"
691 | ]
692 | }
693 | ],
694 | "metadata": {
695 | "kernelspec": {
696 | "display_name": "Python 3",
697 | "language": "python",
698 | "name": "python3"
699 | },
700 | "language_info": {
701 | "codemirror_mode": {
702 | "name": "ipython",
703 | "version": 3
704 | },
705 | "file_extension": ".py",
706 | "mimetype": "text/x-python",
707 | "name": "python",
708 | "nbconvert_exporter": "python",
709 | "pygments_lexer": "ipython3",
710 | "version": "3.6.2"
711 | }
712 | },
713 | "nbformat": 4,
714 | "nbformat_minor": 1
715 | }
716 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | GNU GENERAL PUBLIC LICENSE
2 | Version 2, June 1991
3 |
4 | Copyright (C) 1989, 1991 Free Software Foundation, Inc.,
5 | 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
6 | Everyone is permitted to copy and distribute verbatim copies
7 | of this license document, but changing it is not allowed.
8 |
9 | Preamble
10 |
11 | The licenses for most software are designed to take away your
12 | freedom to share and change it. By contrast, the GNU General Public
13 | License is intended to guarantee your freedom to share and change free
14 | software--to make sure the software is free for all its users. This
15 | General Public License applies to most of the Free Software
16 | Foundation's software and to any other program whose authors commit to
17 | using it. (Some other Free Software Foundation software is covered by
18 | the GNU Lesser General Public License instead.) You can apply it to
19 | your programs, too.
20 |
21 | When we speak of free software, we are referring to freedom, not
22 | price. Our General Public Licenses are designed to make sure that you
23 | have the freedom to distribute copies of free software (and charge for
24 | this service if you wish), that you receive source code or can get it
25 | if you want it, that you can change the software or use pieces of it
26 | in new free programs; and that you know you can do these things.
27 |
28 | To protect your rights, we need to make restrictions that forbid
29 | anyone to deny you these rights or to ask you to surrender the rights.
30 | These restrictions translate to certain responsibilities for you if you
31 | distribute copies of the software, or if you modify it.
32 |
33 | For example, if you distribute copies of such a program, whether
34 | gratis or for a fee, you must give the recipients all the rights that
35 | you have. You must make sure that they, too, receive or can get the
36 | source code. And you must show them these terms so they know their
37 | rights.
38 |
39 | We protect your rights with two steps: (1) copyright the software, and
40 | (2) offer you this license which gives you legal permission to copy,
41 | distribute and/or modify the software.
42 |
43 | Also, for each author's protection and ours, we want to make certain
44 | that everyone understands that there is no warranty for this free
45 | software. If the software is modified by someone else and passed on, we
46 | want its recipients to know that what they have is not the original, so
47 | that any problems introduced by others will not reflect on the original
48 | authors' reputations.
49 |
50 | Finally, any free program is threatened constantly by software
51 | patents. We wish to avoid the danger that redistributors of a free
52 | program will individually obtain patent licenses, in effect making the
53 | program proprietary. To prevent this, we have made it clear that any
54 | patent must be licensed for everyone's free use or not licensed at all.
55 |
56 | The precise terms and conditions for copying, distribution and
57 | modification follow.
58 |
59 | GNU GENERAL PUBLIC LICENSE
60 | TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
61 |
62 | 0. This License applies to any program or other work which contains
63 | a notice placed by the copyright holder saying it may be distributed
64 | under the terms of this General Public License. The "Program", below,
65 | refers to any such program or work, and a "work based on the Program"
66 | means either the Program or any derivative work under copyright law:
67 | that is to say, a work containing the Program or a portion of it,
68 | either verbatim or with modifications and/or translated into another
69 | language. (Hereinafter, translation is included without limitation in
70 | the term "modification".) Each licensee is addressed as "you".
71 |
72 | Activities other than copying, distribution and modification are not
73 | covered by this License; they are outside its scope. The act of
74 | running the Program is not restricted, and the output from the Program
75 | is covered only if its contents constitute a work based on the
76 | Program (independent of having been made by running the Program).
77 | Whether that is true depends on what the Program does.
78 |
79 | 1. You may copy and distribute verbatim copies of the Program's
80 | source code as you receive it, in any medium, provided that you
81 | conspicuously and appropriately publish on each copy an appropriate
82 | copyright notice and disclaimer of warranty; keep intact all the
83 | notices that refer to this License and to the absence of any warranty;
84 | and give any other recipients of the Program a copy of this License
85 | along with the Program.
86 |
87 | You may charge a fee for the physical act of transferring a copy, and
88 | you may at your option offer warranty protection in exchange for a fee.
89 |
90 | 2. You may modify your copy or copies of the Program or any portion
91 | of it, thus forming a work based on the Program, and copy and
92 | distribute such modifications or work under the terms of Section 1
93 | above, provided that you also meet all of these conditions:
94 |
95 | a) You must cause the modified files to carry prominent notices
96 | stating that you changed the files and the date of any change.
97 |
98 | b) You must cause any work that you distribute or publish, that in
99 | whole or in part contains or is derived from the Program or any
100 | part thereof, to be licensed as a whole at no charge to all third
101 | parties under the terms of this License.
102 |
103 | c) If the modified program normally reads commands interactively
104 | when run, you must cause it, when started running for such
105 | interactive use in the most ordinary way, to print or display an
106 | announcement including an appropriate copyright notice and a
107 | notice that there is no warranty (or else, saying that you provide
108 | a warranty) and that users may redistribute the program under
109 | these conditions, and telling the user how to view a copy of this
110 | License. (Exception: if the Program itself is interactive but
111 | does not normally print such an announcement, your work based on
112 | the Program is not required to print an announcement.)
113 |
114 | These requirements apply to the modified work as a whole. If
115 | identifiable sections of that work are not derived from the Program,
116 | and can be reasonably considered independent and separate works in
117 | themselves, then this License, and its terms, do not apply to those
118 | sections when you distribute them as separate works. But when you
119 | distribute the same sections as part of a whole which is a work based
120 | on the Program, the distribution of the whole must be on the terms of
121 | this License, whose permissions for other licensees extend to the
122 | entire whole, and thus to each and every part regardless of who wrote it.
123 |
124 | Thus, it is not the intent of this section to claim rights or contest
125 | your rights to work written entirely by you; rather, the intent is to
126 | exercise the right to control the distribution of derivative or
127 | collective works based on the Program.
128 |
129 | In addition, mere aggregation of another work not based on the Program
130 | with the Program (or with a work based on the Program) on a volume of
131 | a storage or distribution medium does not bring the other work under
132 | the scope of this License.
133 |
134 | 3. You may copy and distribute the Program (or a work based on it,
135 | under Section 2) in object code or executable form under the terms of
136 | Sections 1 and 2 above provided that you also do one of the following:
137 |
138 | a) Accompany it with the complete corresponding machine-readable
139 | source code, which must be distributed under the terms of Sections
140 | 1 and 2 above on a medium customarily used for software interchange; or,
141 |
142 | b) Accompany it with a written offer, valid for at least three
143 | years, to give any third party, for a charge no more than your
144 | cost of physically performing source distribution, a complete
145 | machine-readable copy of the corresponding source code, to be
146 | distributed under the terms of Sections 1 and 2 above on a medium
147 | customarily used for software interchange; or,
148 |
149 | c) Accompany it with the information you received as to the offer
150 | to distribute corresponding source code. (This alternative is
151 | allowed only for noncommercial distribution and only if you
152 | received the program in object code or executable form with such
153 | an offer, in accord with Subsection b above.)
154 |
155 | The source code for a work means the preferred form of the work for
156 | making modifications to it. For an executable work, complete source
157 | code means all the source code for all modules it contains, plus any
158 | associated interface definition files, plus the scripts used to
159 | control compilation and installation of the executable. However, as a
160 | special exception, the source code distributed need not include
161 | anything that is normally distributed (in either source or binary
162 | form) with the major components (compiler, kernel, and so on) of the
163 | operating system on which the executable runs, unless that component
164 | itself accompanies the executable.
165 |
166 | If distribution of executable or object code is made by offering
167 | access to copy from a designated place, then offering equivalent
168 | access to copy the source code from the same place counts as
169 | distribution of the source code, even though third parties are not
170 | compelled to copy the source along with the object code.
171 |
172 | 4. You may not copy, modify, sublicense, or distribute the Program
173 | except as expressly provided under this License. Any attempt
174 | otherwise to copy, modify, sublicense or distribute the Program is
175 | void, and will automatically terminate your rights under this License.
176 | However, parties who have received copies, or rights, from you under
177 | this License will not have their licenses terminated so long as such
178 | parties remain in full compliance.
179 |
180 | 5. You are not required to accept this License, since you have not
181 | signed it. However, nothing else grants you permission to modify or
182 | distribute the Program or its derivative works. These actions are
183 | prohibited by law if you do not accept this License. Therefore, by
184 | modifying or distributing the Program (or any work based on the
185 | Program), you indicate your acceptance of this License to do so, and
186 | all its terms and conditions for copying, distributing or modifying
187 | the Program or works based on it.
188 |
189 | 6. Each time you redistribute the Program (or any work based on the
190 | Program), the recipient automatically receives a license from the
191 | original licensor to copy, distribute or modify the Program subject to
192 | these terms and conditions. You may not impose any further
193 | restrictions on the recipients' exercise of the rights granted herein.
194 | You are not responsible for enforcing compliance by third parties to
195 | this License.
196 |
197 | 7. If, as a consequence of a court judgment or allegation of patent
198 | infringement or for any other reason (not limited to patent issues),
199 | conditions are imposed on you (whether by court order, agreement or
200 | otherwise) that contradict the conditions of this License, they do not
201 | excuse you from the conditions of this License. If you cannot
202 | distribute so as to satisfy simultaneously your obligations under this
203 | License and any other pertinent obligations, then as a consequence you
204 | may not distribute the Program at all. For example, if a patent
205 | license would not permit royalty-free redistribution of the Program by
206 | all those who receive copies directly or indirectly through you, then
207 | the only way you could satisfy both it and this License would be to
208 | refrain entirely from distribution of the Program.
209 |
210 | If any portion of this section is held invalid or unenforceable under
211 | any particular circumstance, the balance of the section is intended to
212 | apply and the section as a whole is intended to apply in other
213 | circumstances.
214 |
215 | It is not the purpose of this section to induce you to infringe any
216 | patents or other property right claims or to contest validity of any
217 | such claims; this section has the sole purpose of protecting the
218 | integrity of the free software distribution system, which is
219 | implemented by public license practices. Many people have made
220 | generous contributions to the wide range of software distributed
221 | through that system in reliance on consistent application of that
222 | system; it is up to the author/donor to decide if he or she is willing
223 | to distribute software through any other system and a licensee cannot
224 | impose that choice.
225 |
226 | This section is intended to make thoroughly clear what is believed to
227 | be a consequence of the rest of this License.
228 |
229 | 8. If the distribution and/or use of the Program is restricted in
230 | certain countries either by patents or by copyrighted interfaces, the
231 | original copyright holder who places the Program under this License
232 | may add an explicit geographical distribution limitation excluding
233 | those countries, so that distribution is permitted only in or among
234 | countries not thus excluded. In such case, this License incorporates
235 | the limitation as if written in the body of this License.
236 |
237 | 9. The Free Software Foundation may publish revised and/or new versions
238 | of the General Public License from time to time. Such new versions will
239 | be similar in spirit to the present version, but may differ in detail to
240 | address new problems or concerns.
241 |
242 | Each version is given a distinguishing version number. If the Program
243 | specifies a version number of this License which applies to it and "any
244 | later version", you have the option of following the terms and conditions
245 | either of that version or of any later version published by the Free
246 | Software Foundation. If the Program does not specify a version number of
247 | this License, you may choose any version ever published by the Free Software
248 | Foundation.
249 |
250 | 10. If you wish to incorporate parts of the Program into other free
251 | programs whose distribution conditions are different, write to the author
252 | to ask for permission. For software which is copyrighted by the Free
253 | Software Foundation, write to the Free Software Foundation; we sometimes
254 | make exceptions for this. Our decision will be guided by the two goals
255 | of preserving the free status of all derivatives of our free software and
256 | of promoting the sharing and reuse of software generally.
257 |
258 | NO WARRANTY
259 |
260 | 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
261 | FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN
262 | OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
263 | PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
264 | OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
265 | MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS
266 | TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE
267 | PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
268 | REPAIR OR CORRECTION.
269 |
270 | 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
271 | WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
272 | REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
273 | INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
274 | OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
275 | TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
276 | YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
277 | PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
278 | POSSIBILITY OF SUCH DAMAGES.
279 |
280 | END OF TERMS AND CONDITIONS
281 |
282 | How to Apply These Terms to Your New Programs
283 |
284 | If you develop a new program, and you want it to be of the greatest
285 | possible use to the public, the best way to achieve this is to make it
286 | free software which everyone can redistribute and change under these terms.
287 |
288 | To do so, attach the following notices to the program. It is safest
289 | to attach them to the start of each source file to most effectively
290 | convey the exclusion of warranty; and each file should have at least
291 | the "copyright" line and a pointer to where the full notice is found.
292 |
293 | {description}
294 | Copyright (C) {year} {fullname}
295 |
296 | This program is free software; you can redistribute it and/or modify
297 | it under the terms of the GNU General Public License as published by
298 | the Free Software Foundation; either version 2 of the License, or
299 | (at your option) any later version.
300 |
301 | This program is distributed in the hope that it will be useful,
302 | but WITHOUT ANY WARRANTY; without even the implied warranty of
303 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
304 | GNU General Public License for more details.
305 |
306 | You should have received a copy of the GNU General Public License along
307 | with this program; if not, write to the Free Software Foundation, Inc.,
308 | 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
309 |
310 | Also add information on how to contact you by electronic and paper mail.
311 |
312 | If the program is interactive, make it output a short notice like this
313 | when it starts in an interactive mode:
314 |
315 | Gnomovision version 69, Copyright (C) year name of author
316 | Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
317 | This is free software, and you are welcome to redistribute it
318 | under certain conditions; type `show c' for details.
319 |
320 | The hypothetical commands `show w' and `show c' should show the appropriate
321 | parts of the General Public License. Of course, the commands you use may
322 | be called something other than `show w' and `show c'; they could even be
323 | mouse-clicks or menu items--whatever suits your program.
324 |
325 | You should also get your employer (if you work as a programmer) or your
326 | school, if any, to sign a "copyright disclaimer" for the program, if
327 | necessary. Here is a sample; alter the names:
328 |
329 | Yoyodyne, Inc., hereby disclaims all copyright interest in the program
330 | `Gnomovision' (which makes passes at compilers) written by James Hacker.
331 |
332 | {signature of Ty Coon}, 1 April 1989
333 | Ty Coon, President of Vice
334 |
335 | This General Public License does not permit incorporating your program into
336 | proprietary programs. If your program is a subroutine library, you may
337 | consider it more useful to permit linking proprietary applications with the
338 | library. If this is what you want to do, use the GNU Lesser General
339 | Public License instead of this License.
340 |
341 |
--------------------------------------------------------------------------------
/LinearRegression_Tutorial.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "Linear Regression Tutorial\n",
8 | "===\n",
9 | "\n",
10 | "Some problems don't have discrete (categorical) labels (e.g. color, plant species), but rather a continuous range of numbers (e.g. length, price). For these types of problems, regression is usually a good choice. Rather than predicting a categorical label for each example, it fits a continuous line (or plane, or curve) to the data in order to give a predicition as a number. \n",
11 | "\n",
12 | "If you've ever found a \"line of best fit\" using Excel, you've already used regression!"
13 | ]
14 | },
15 | {
16 | "cell_type": "markdown",
17 | "metadata": {},
18 | "source": [
19 | "Setup\n",
20 | "===\n",
21 | "Tell matplotlib to print figures in the notebook. Then import numpy (for numerical data), matplotlib.pyplot (for plotting figures), linear_model (for the scikit-learn linear regression algorithm), datasets (to download the Boston housing prices dataset from scikit-learn), and cross_validation (to create training and testing sets)."
22 | ]
23 | },
24 | {
25 | "cell_type": "code",
26 | "execution_count": null,
27 | "metadata": {
28 | "collapsed": true
29 | },
30 | "outputs": [],
31 | "source": [
32 | "import numpy as np\n",
33 | "import matplotlib.pyplot as plt\n",
34 | "from sklearn import linear_model, datasets # Import the linear regression function and dataset from scikit-learn\n",
35 | "from sklearn.model_selection import train_test_split, KFold\n",
36 | "from sklearn.metrics import mean_squared_error, r2_score\n",
37 | "\n",
38 | "# Print figures in the notebook\n",
39 | "%matplotlib inline "
40 | ]
41 | },
42 | {
43 | "cell_type": "markdown",
44 | "metadata": {
45 | "collapsed": true
46 | },
47 | "source": [
48 | "Import the dataset\n",
49 | "===\n",
50 | "Import the dataset and store it to a variable called diabetes. This dataset is similar to a python dictionary, with the keys: ['DESCR', 'target', 'data', 'feature_names']\n",
51 | "\n",
52 | "The data features are stored in diabetes.data, where each row is an example from a single patient, and each column is a single feature. The feature names are stored in diabetes.feature_names. Target values are stored in diabetes.target.\n",
53 | "\n",
54 | "Below, we load the labels into y, the data into X, and the names of the features into featureNames. We also print the description of the dataset."
55 | ]
56 | },
57 | {
58 | "cell_type": "code",
59 | "execution_count": null,
60 | "metadata": {},
61 | "outputs": [],
62 | "source": [
63 | "# Import some data to play with\n",
64 | "diabetes = datasets.load_diabetes()\n",
65 | "\n",
66 | "# Store the labels (y), features (X), and feature names\n",
67 | "y = diabetes.target # Labels are stored in y as numbers\n",
68 | "X = diabetes.data\n",
69 | "featureNames = diabetes.feature_names\n",
70 | "\n",
71 | "print(diabetes.DESCR)"
72 | ]
73 | },
74 | {
75 | "cell_type": "markdown",
76 | "metadata": {},
77 | "source": [
78 | "Create Training and Testing Sets\n",
79 | "---\n",
80 | "\n",
81 | "In order to see how well our classifier works, we need to divide our data into training and testing sets"
82 | ]
83 | },
84 | {
85 | "cell_type": "code",
86 | "execution_count": null,
87 | "metadata": {
88 | "collapsed": true
89 | },
90 | "outputs": [],
91 | "source": [
92 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)"
93 | ]
94 | },
95 | {
96 | "cell_type": "markdown",
97 | "metadata": {},
98 | "source": [
99 | "Visualize The Data\n",
100 | "===\n",
101 | "\n",
102 | "There are too many features to visualize the whole training set, but we can plot a single feature (e.g. body mass index) against the quantified disease progression.\n",
103 | "\n",
104 | "Remember that the features have been normalized to center it around 0 and scaled by the standard deviation. So the values shown aren't the actual BMI levels, but a standardized version of them."
105 | ]
106 | },
107 | {
108 | "cell_type": "code",
109 | "execution_count": null,
110 | "metadata": {},
111 | "outputs": [],
112 | "source": [
113 | "bmi_ind = 2\n",
114 | "\n",
115 | "plt.scatter(X_train[:,bmi_ind], y_train)\n",
116 | "plt.ylabel('Disease Progression')\n",
117 | "plt.xlabel('Body Mass Index')\n",
118 | "plt.show()"
119 | ]
120 | },
121 | {
122 | "cell_type": "markdown",
123 | "metadata": {},
124 | "source": [
125 | "Train A Toy Model\n",
126 | "===\n",
127 | "Next, we train a linear regression algorithm on our data. \n",
128 | "\n",
129 | "In scikit learn, we use [linear_model.LinearRegression()](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) to create our model. Models are trained using a `fit()` method. For linear regression, `fit()` expects two arguments: the training examples `X_train` and the corresponding labels `y_train`.\n",
130 | "\n",
131 | "To start, we will train a toy model using only the body mass index feature."
132 | ]
133 | },
134 | {
135 | "cell_type": "code",
136 | "execution_count": null,
137 | "metadata": {},
138 | "outputs": [],
139 | "source": [
140 | "regr = linear_model.LinearRegression()\n",
141 | "x_train = X_train[:,bmi_ind].reshape(-1, 1) # regression expects a (#examples,#features) array shape\n",
142 | "regr.fit(x_train, y_train)"
143 | ]
144 | },
145 | {
146 | "cell_type": "markdown",
147 | "metadata": {},
148 | "source": [
149 | "Making Predictions\n",
150 | "---\n",
151 | "\n",
152 | "To make predictions on new data we use the `predict()` method. It expects a single input: an array-like object containing the features for the examples we want to predict.\n",
153 | "\n",
154 | "Here, we get the predictions for two body mass index values: -.05 and 0.1."
155 | ]
156 | },
157 | {
158 | "cell_type": "code",
159 | "execution_count": null,
160 | "metadata": {},
161 | "outputs": [],
162 | "source": [
163 | "bmi = [[-0.05],[0.1]]\n",
164 | "predictions = regr.predict(bmi)\n",
165 | "print(predictions)"
166 | ]
167 | },
168 | {
169 | "cell_type": "markdown",
170 | "metadata": {
171 | "collapsed": true
172 | },
173 | "source": [
174 | "Visualize the Toy Model\n",
175 | "---\n",
176 | "\n",
177 | "Here we plot the linear regression line of our trained model on top of the data. We do this by predicting the output of the model on a range of values from -0.1 to 0.15. These predictions are plotted as a line on top of the training data. \n",
178 | "\n",
179 | "This can't tell us how well it will perform on new, unseen, data, but it can show us how well the line fits the training data."
180 | ]
181 | },
182 | {
183 | "cell_type": "code",
184 | "execution_count": null,
185 | "metadata": {},
186 | "outputs": [],
187 | "source": [
188 | "values = np.arange(-0.1, 0.15, 0.01).reshape(-1, 1) # Reshape so each feature is a separate row\n",
189 | "\n",
190 | "plt.scatter(x_train, y_train)\n",
191 | "plt.plot(values, regr.predict(values), c='r')\n",
192 | "plt.ylabel('Disease Progression')\n",
193 | "plt.xlabel('Body Mass Index')\n",
194 | "plt.title('Regression Line on Training Data')\n",
195 | "plt.show()"
196 | ]
197 | },
198 | {
199 | "cell_type": "markdown",
200 | "metadata": {},
201 | "source": [
202 | "Test the Toy Model\n",
203 | "===\n",
204 | "\n",
205 | "Next, we will test the ability of our model to predict the disease progression in our test set, using only the body mass index.\n",
206 | "\n",
207 | "First, we get our predictions for the training data, and plot the predicted model on top of the test data. Since we trained our model using only one feature, we need to get only that feature from the test set."
208 | ]
209 | },
210 | {
211 | "cell_type": "code",
212 | "execution_count": null,
213 | "metadata": {},
214 | "outputs": [],
215 | "source": [
216 | "x_test = X_test[:,bmi_ind][np.newaxis].T # regression expects a (#examples,#features) array shape\n",
217 | "predictions = regr.predict(x_test)\n",
218 | "\n",
219 | "plt.scatter(x_test, y_test)\n",
220 | "plt.plot(x_test, predictions, c='r')\n",
221 | "plt.ylabel('Disease Progression')\n",
222 | "plt.xlabel('Body Mass Index')\n",
223 | "plt.title('Regression Line on Test Data')"
224 | ]
225 | },
226 | {
227 | "cell_type": "markdown",
228 | "metadata": {},
229 | "source": [
230 | "Evaluate the Toy Model\n",
231 | "---\n",
232 | "\n",
233 | "Next, we evaluate how well our model worked on the training dataset. Unlike with discrete classifiers (e.g. KNN, SVM), the number of examples it got \"correct\" isn't meaningful here. We certainly care if the predicted value is off by 100, but do we care as much if it is off by 1? What about 0.01?\n",
234 | "\n",
235 | "There are many ways to evaluate a linear classifier, but one popular way is the mean-squared error, or MSE. As the name implies, you find the error for each example (the distance between the point and the predicted line), square each of them, and then add them all together. \n",
236 | "\n",
237 | "Scikit-learn has a function that does this for you easily."
238 | ]
239 | },
240 | {
241 | "cell_type": "code",
242 | "execution_count": null,
243 | "metadata": {},
244 | "outputs": [],
245 | "source": [
246 | "mse = mean_squared_error(y_test, predictions)\n",
247 | "\n",
248 | "print('The MSE is ' + '%.2f' % mse)"
249 | ]
250 | },
251 | {
252 | "cell_type": "markdown",
253 | "metadata": {},
254 | "source": [
255 | "The MSE isn't as intuitive as the accuracy of a discrete classifier, but it is highly useful for comparing the effectiveness of different models. Another option is to look at the $R^2$ score, which you may already be familiar with if you've ever fit a line to data in Excel. A value of 1.0 is a perfect predictor, while 0.0 means there is no correlation between the input and output of the regression model."
256 | ]
257 | },
258 | {
259 | "cell_type": "code",
260 | "execution_count": null,
261 | "metadata": {},
262 | "outputs": [],
263 | "source": [
264 | "r2score = r2_score(y_test, predictions)\n",
265 | "\n",
266 | "print('The R^2 score is ' + '%.2f' % r2score)"
267 | ]
268 | },
269 | {
270 | "cell_type": "markdown",
271 | "metadata": {},
272 | "source": [
273 | "Train A Model on All Features\n",
274 | "===\n",
275 | "\n",
276 | "Next we will train a model on all of the available features and use it to predict the progression of diabetes after a year. We can then see how this compares to using only a single feature."
277 | ]
278 | },
279 | {
280 | "cell_type": "code",
281 | "execution_count": null,
282 | "metadata": {},
283 | "outputs": [],
284 | "source": [
285 | "regr = linear_model.LinearRegression()\n",
286 | "regr.fit(X_train, y_train)\n",
287 | "\n",
288 | "predictions = regr.predict(X_test)\n",
289 | "\n",
290 | "mse = mean_squared_error(y_test, predictions)\n",
291 | "print('The MSE is ' + '%.2f' % mse)\n",
292 | "\n",
293 | "r2score = r2_score(y_test, predictions)\n",
294 | "print('The R^2 score is ' + '%.2f' % r2score)"
295 | ]
296 | },
297 | {
298 | "cell_type": "markdown",
299 | "metadata": {
300 | "collapsed": true
301 | },
302 | "source": [
303 | "Using Crossvalidation\n",
304 | "===\n",
305 | "\n",
306 | "To properly compare these two models, we need to use crossvalidation to select the best model. We can then get our final result using our test data.\n",
307 | "\n",
308 | "First we create our two linear regression models, then we divide our training data into folds. We loop through the sets of training and validation folds. Each time, we train each model on the training data and evaluate on the validation data. We store the MSE and $R^2$ score of each classifier on each fold so we can look at them later."
309 | ]
310 | },
311 | {
312 | "cell_type": "code",
313 | "execution_count": null,
314 | "metadata": {
315 | "collapsed": true
316 | },
317 | "outputs": [],
318 | "source": [
319 | "# Create our regression models\n",
320 | "regr_1 = linear_model.LinearRegression()\n",
321 | "regr_all = linear_model.LinearRegression()\n",
322 | "\n",
323 | "# Create arrays to store the MSE and R^2 score\n",
324 | "mse_1 = []\n",
325 | "r2score_1 = []\n",
326 | "mse_all = []\n",
327 | "r2score_all = []\n",
328 | "\n",
329 | "# Loop through 5 folds\n",
330 | "kf = KFold(n_splits=5)\n",
331 | "for trainInd, valInd in kf.split(X_train):\n",
332 | " X_tr = X_train[trainInd,:]\n",
333 | " y_tr = y_train[trainInd]\n",
334 | " X_val = X_train[valInd,:]\n",
335 | " y_val = y_train[valInd]\n",
336 | " \n",
337 | " # Train our models\n",
338 | " regr_1.fit(X_tr[:,bmi_ind].reshape(-1, 1), y_tr) # Train on only one feature\n",
339 | " regr_all.fit(X_tr, y_tr) # Train on all features\n",
340 | " \n",
341 | " # Make our predictions\n",
342 | " pred_1 = regr_1.predict(X_val[:,bmi_ind].reshape(-1, 1))\n",
343 | " pred_all = regr_all.predict(X_val)\n",
344 | " \n",
345 | " # Calculate the MSE\n",
346 | " mse_1.append(mean_squared_error(y_val, pred_1))\n",
347 | " mse_all.append(mean_squared_error(y_val, pred_all))\n",
348 | " \n",
349 | " # Calculate the R^2 score\n",
350 | " r2score_1.append(r2_score(y_val, pred_1))\n",
351 | " r2score_all.append(r2_score(y_val, pred_all))"
352 | ]
353 | },
354 | {
355 | "cell_type": "markdown",
356 | "metadata": {},
357 | "source": [
358 | "Select a Model\n",
359 | "---\n",
360 | "\n",
361 | "To select a model, we look at the average $R^2$ score and MSE across all folds."
362 | ]
363 | },
364 | {
365 | "cell_type": "code",
366 | "execution_count": null,
367 | "metadata": {},
368 | "outputs": [],
369 | "source": [
370 | "print('One Feature:\\nMSE: ' + '%.2f' % np.mean(mse_1))\n",
371 | "print('R^2: ' + '%.2f' % np.mean(r2score_1))\n",
372 | "\n",
373 | "print('\\nAll Features:\\nMSE: ' + '%.2f' % np.mean(mse_all))\n",
374 | "print('R^2: ' + '%.2f' % np.mean(r2score_all))"
375 | ]
376 | },
377 | {
378 | "cell_type": "markdown",
379 | "metadata": {},
380 | "source": [
381 | "Final Evaluation\n",
382 | "---\n",
383 | "\n",
384 | "The model using all features has a higher $R^2$ score and lower MSE, so we select it as our best model. Now we can evaluate it on our training set and get our final MSE an $R^2$ score values."
385 | ]
386 | },
387 | {
388 | "cell_type": "code",
389 | "execution_count": null,
390 | "metadata": {},
391 | "outputs": [],
392 | "source": [
393 | "regr_all.fit(X_train, y_train)\n",
394 | "predictions = regr_all.predict(X_test)\n",
395 | "\n",
396 | "mse = mean_squared_error(y_test, predictions)\n",
397 | "r2score = r2_score(y_test, predictions)\n",
398 | "\n",
399 | "print('The final MSE is ' + '%.2f' % mse)\n",
400 | "print('The final R^2 score is ' + '%.2f' % r2score)"
401 | ]
402 | },
403 | {
404 | "cell_type": "markdown",
405 | "metadata": {},
406 | "source": [
407 | "Sidenote: Randomness and Results\n",
408 | "===\n",
409 | "\n",
410 | "Every time you run this notebook, you will get slightly different results. Why? Because data is randomly divided among the training/testing/validation data sets. Running the code again will create a different division of the data, and will make the results slightly different. However, the overall outcome should remain consistent and have approximately the same values. If you have drastically different results when running an analysis multiple times, it suggests a problem with your model or that you need more data.\n",
411 | "\n",
412 | "If it's important that you get the exact same results every time you run the code, you can specify a random state in the `random_state` argument of `train_test_split()` and `KFold`."
413 | ]
414 | },
415 | {
416 | "cell_type": "code",
417 | "execution_count": null,
418 | "metadata": {
419 | "collapsed": true
420 | },
421 | "outputs": [],
422 | "source": [
423 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)"
424 | ]
425 | },
426 | {
427 | "cell_type": "code",
428 | "execution_count": null,
429 | "metadata": {
430 | "collapsed": true
431 | },
432 | "outputs": [],
433 | "source": []
434 | }
435 | ],
436 | "metadata": {
437 | "kernelspec": {
438 | "display_name": "Python 3",
439 | "language": "python",
440 | "name": "python3"
441 | },
442 | "language_info": {
443 | "codemirror_mode": {
444 | "name": "ipython",
445 | "version": 3
446 | },
447 | "file_extension": ".py",
448 | "mimetype": "text/x-python",
449 | "name": "python",
450 | "nbconvert_exporter": "python",
451 | "pygments_lexer": "ipython3",
452 | "version": "3.6.2"
453 | }
454 | },
455 | "nbformat": 4,
456 | "nbformat_minor": 1
457 | }
458 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # MachineLearningIntro
2 |
3 | Jupyter notebooks for use in an introduction to machine learning course or talk.
4 |
5 | Originally developed for a Girl Develop It Philadelphia class in 2015 with [Ani Vemprala](https://github.com/aniv) and [Michael Becker](https://github.com/mdbecker).
6 |
7 | ## Recommended Order
8 |
9 | It is (loosely) recommended that you work through the notebooks in the following order:
10 |
11 | 1. Diabetes_DataSet.ipynb
12 | 2. LinearRegression_Tutorial.ipynb
13 | 3. Iris_DataSet.ipynb
14 | 4. KNN_Tutorial.ipynb
15 | 5. SVM_Tutorial.ipynb
16 | 6. KMeans_Tutorial.ipynb
17 |
--------------------------------------------------------------------------------
/SVM_Tutorial.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "This notebook uses material from a tutotial given by [Jake Vanderplas](http://www.vanderplas.com) for PyCon 2014. Source and license info is on [GitHub](https://github.com/jakevdp/sklearn_pycon2014/)."
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "Support Vector Machine tutorial\n",
15 | "===\n",
16 | "\n",
17 | "Support vector machines (or SVMs) are a popular supervised classification method. SVMs attempt to find a classification boundary that maximizes the separability of the classes. This means that it tries to maximize the distance between the boundary lines and the closest data points.\n",
18 | "\n",
19 | "Scikit-learn has a great [SVM tutorial](http://scikit-learn.org/stable/modules/svm.html) if you want more detailed information.\n",
20 | "\n",
21 | "Toy Dataset Illustration\n",
22 | "---\n",
23 | "\n",
24 | "Here, we will use a toy (or overly simple) dataset of two classes which can be perfectly separated with a single, straght line. \n",
25 | "\n",
26 | "
\n",
27 | "\n",
28 | "The solid line is the decision boundary, dividing the red and blue classes. Notice that on either side of the boundary, there is a dotted line that passes through the closest datapoints. The distance between the solid boundary line and this dotted line is what an SVM tries to maximize. \n",
29 | "\n",
30 | "The points that touch the dotted lines are called \"support vectors\". These points are the only ones that matter when determining boundary locations. All other datapoints can be added, moved, or removed from the dataset without changing the classification boundary, as long as they do not cross that dotted line."
31 | ]
32 | },
33 | {
34 | "cell_type": "markdown",
35 | "metadata": {},
36 | "source": [
37 | "Setup\n",
38 | "===\n",
39 | "Tell matplotlib to print figures in the notebook. Then import numpy (for numerical data), pyplot (for plotting figures) ListedColormap (for plotting colors), neighbors (for the scikit-learn nearest-neighbor algorithm) and datasets (to download the iris dataset from scikit-learn).\n",
40 | "\n",
41 | "Also create the color maps to use to color the plotted data, and \"labelList\", which is a list of colored rectangles to use in plotted legends"
42 | ]
43 | },
44 | {
45 | "cell_type": "code",
46 | "execution_count": null,
47 | "metadata": {},
48 | "outputs": [],
49 | "source": [
50 | "# Print figures in the notebook\n",
51 | "%matplotlib inline \n",
52 | "\n",
53 | "import numpy as np\n",
54 | "import matplotlib.pyplot as plt\n",
55 | "from matplotlib.colors import ListedColormap\n",
56 | "from sklearn import datasets # Import the dataset from scikit-learn\n",
57 | "from sklearn.svm import SVC\n",
58 | "from sklearn.model_selection import train_test_split, KFold\n",
59 | "\n",
60 | "# Import patch for drawing rectangles in the legend\n",
61 | "from matplotlib.patches import Rectangle\n",
62 | "\n",
63 | "# Create color maps\n",
64 | "cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])\n",
65 | "cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])\n",
66 | "\n",
67 | "# Create a legend for the colors, using rectangles for the corresponding colormap colors\n",
68 | "labelList = []\n",
69 | "for color in cmap_bold.colors:\n",
70 | " labelList.append(Rectangle((0, 0), 1, 1, fc=color))"
71 | ]
72 | },
73 | {
74 | "cell_type": "markdown",
75 | "metadata": {},
76 | "source": [
77 | "Import the dataset\n",
78 | "===\n",
79 | "Import the dataset and store it to a variable called iris. Scikit-learn's explanation of the dataset is [here](http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html). This dataset is similar to a python dictionary, with the keys: ['DESCR', 'target_names', 'target', 'data', 'feature_names']\n",
80 | "\n",
81 | "The data features are stored in iris.data, where each row is an example from a single flower, and each column is a single feature. The feature names are stored in iris.feature_names. Labels are stored as the numbers 0, 1, or 2 in iris.target, and the names of these labels are in iris.target_names.\n",
82 | "\n",
83 | "The dataset consists of measurements made on 50 examples from each of three different species of iris flowers (Setosa, Versicolour, and Virginica). Each example has four features (or measurements): [sepal](http://en.wikipedia.org/wiki/Sepal) length, sepal width, [petal](http://en.wikipedia.org/wiki/Petal) length, and petal width. All measurements are in cm.\n",
84 | "\n",
85 | "Below, we load the labels into y, the corresponding label names into labelNames, the data into X, and the names of the features into featureNames."
86 | ]
87 | },
88 | {
89 | "cell_type": "code",
90 | "execution_count": null,
91 | "metadata": {},
92 | "outputs": [],
93 | "source": [
94 | "# Import some data to play with\n",
95 | "iris = datasets.load_iris()\n",
96 | "\n",
97 | "# Store the labels (y), label names, features (X), and feature names\n",
98 | "y = iris.target # Labels are stored in y as numbers\n",
99 | "labelNames = iris.target_names # Species names corresponding to labels 0, 1, and 2\n",
100 | "X = iris.data\n",
101 | "featureNames = iris.feature_names\n"
102 | ]
103 | },
104 | {
105 | "cell_type": "markdown",
106 | "metadata": {},
107 | "source": [
108 | "Below, we plot the first two features from the dataset (sepal length and width). Normally we would try to use all useful features, but sticking with two allows us to visualize the data more easily.\n",
109 | "\n",
110 | "Then we plot the data to get a look at what we're dealing with. The colormap is used to determine what colors are used for each class when plotting."
111 | ]
112 | },
113 | {
114 | "cell_type": "code",
115 | "execution_count": null,
116 | "metadata": {},
117 | "outputs": [],
118 | "source": [
119 | "# Plot the data\n",
120 | "\n",
121 | "# Sepal length and width\n",
122 | "X_small = X[:,:2]\n",
123 | "# Get the minimum and maximum values with an additional 0.5 border\n",
124 | "x_min, x_max = X_small[:, 0].min() - .5, X_small[:, 0].max() + .5\n",
125 | "y_min, y_max = X_small[:, 1].min() - .5, X_small[:, 1].max() + .5\n",
126 | "\n",
127 | "plt.figure(figsize=(8, 6))\n",
128 | "\n",
129 | "# Plot the training points\n",
130 | "plt.scatter(X_small[:, 0], X_small[:, 1], c=y, cmap=cmap_bold)\n",
131 | "plt.xlabel('Sepal length (cm)')\n",
132 | "plt.ylabel('Sepal width (cm)')\n",
133 | "plt.title('Sepal width vs length')\n",
134 | "\n",
135 | "# Set the plot limits\n",
136 | "plt.xlim(x_min, x_max)\n",
137 | "plt.ylim(y_min, y_max)\n",
138 | "\n",
139 | "# Plot the legend\n",
140 | "plt.legend(labelList, labelNames)\n",
141 | "\n",
142 | "plt.show()\n"
143 | ]
144 | },
145 | {
146 | "cell_type": "markdown",
147 | "metadata": {},
148 | "source": [
149 | "Support vector machines: training\n",
150 | "===\n",
151 | "Next, we train a SVM classifier on our data. \n",
152 | "\n",
153 | "The first line creates our classifier using the [SVC()](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) function. For now we can ignore the parameter kernel='linear', this just means the decision boundaries should be straight lines. The second line uses the fit() method to train the classifier on the features in X_small, using the labels in y.\n",
154 | "\n",
155 | "It is safe to ignore the parameter 'decision_function_shape'. This is not important for this tutorial, but its inclusion prevents warnings from Scikit-learn about future changes to the default."
156 | ]
157 | },
158 | {
159 | "cell_type": "code",
160 | "execution_count": null,
161 | "metadata": {},
162 | "outputs": [],
163 | "source": [
164 | "# Create an instance of SVM and fit the data.\n",
165 | "clf = SVC(kernel='linear', decision_function_shape='ovo')\n",
166 | "clf.fit(X_small, y)"
167 | ]
168 | },
169 | {
170 | "cell_type": "markdown",
171 | "metadata": {},
172 | "source": [
173 | "Plot the classification boundaries\n",
174 | "===\n",
175 | "Now that we have our classifier, let's visualize what it's doing. \n",
176 | "\n",
177 | "First we plot the decision spaces, or the areas assigned to the different labels (species of iris). Then we plot our examples onto the space, showing where each point lies and the corresponding decision boundary.\n",
178 | "\n",
179 | "The colored background shows the areas that are considered to belong to a certain label. If we took sepal measurements from a new flower, we could plot it in this space and use the color to determine which type of iris our classifier believes it to be."
180 | ]
181 | },
182 | {
183 | "cell_type": "code",
184 | "execution_count": null,
185 | "metadata": {},
186 | "outputs": [],
187 | "source": [
188 | "h = .02 # step size in the mesh\n",
189 | "\n",
190 | "# Plot the decision boundary. For that, we will assign a color to each\n",
191 | "# point in the mesh [x_min, m_max]x[y_min, y_max].\n",
192 | "x_min, x_max = X_small[:, 0].min() - 1, X_small[:, 0].max() + 1\n",
193 | "y_min, y_max = X_small[:, 1].min() - 1, X_small[:, 1].max() + 1\n",
194 | "xx, yy = np.meshgrid(np.arange(x_min, x_max, h),\n",
195 | " np.arange(y_min, y_max, h))\n",
196 | "\n",
197 | "Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) # Make a prediction oat every point \n",
198 | " # in the mesh in order to find the \n",
199 | " # classification areas for each label\n",
200 | "\n",
201 | "# Put the result into a color plot\n",
202 | "Z = Z.reshape(xx.shape)\n",
203 | "plt.figure(figsize=(8, 6))\n",
204 | "plt.pcolormesh(xx, yy, Z, cmap=cmap_light)\n",
205 | "\n",
206 | "# Plot the training points\n",
207 | "plt.scatter(X_small[:, 0], X_small[:, 1], c=y, cmap=cmap_bold)\n",
208 | "plt.xlim(xx.min(), xx.max())\n",
209 | "plt.ylim(yy.min(), yy.max())\n",
210 | "plt.title(\"3-Class classification (SVM)\")\n",
211 | "plt.xlabel('Sepal length (cm)')\n",
212 | "plt.ylabel('Sepal width (cm)')\n",
213 | "\n",
214 | "# Plot the legend\n",
215 | "plt.legend(labelList, labelNames)\n",
216 | "\n",
217 | "plt.show()"
218 | ]
219 | },
220 | {
221 | "cell_type": "markdown",
222 | "metadata": {},
223 | "source": [
224 | "Making predictions\n",
225 | "===\n",
226 | "\n",
227 | "Now, let's say we go out and measure the sepals of two new iris plants, and want to know what species they are. We're going to use our classifier to predict the flowers with the following measurements:\n",
228 | "\n",
229 | "Plant | Sepal length | Sepal width\n",
230 | "------|--------------|------------\n",
231 | "A |4.3 |2.5\n",
232 | "B |6.3 |2.1\n",
233 | "\n",
234 | "We can use our classifier's [predict()](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier.predict) function to predict the label for our input features. We pass in the variable examples to the predict() function, which is a list, and each element is another list containing the features (measurements) for a particular example. The output is a list of labels corresponding to the input examples.\n"
235 | ]
236 | },
237 | {
238 | "cell_type": "code",
239 | "execution_count": null,
240 | "metadata": {},
241 | "outputs": [],
242 | "source": [
243 | "# Add our new data examples\n",
244 | "examples = [[4.3, 2.5], # Plant A\n",
245 | " [6.3, 2.1]] # Plant B\n",
246 | "\n",
247 | "\n",
248 | "# Create an instance of SVM and fit the data\n",
249 | "clf = SVC(kernel='linear', decision_function_shape='ovo')\n",
250 | "clf.fit(X_small, y)\n",
251 | "\n",
252 | "# Predict the labels for our new examples\n",
253 | "labels = clf.predict(examples)\n",
254 | "\n",
255 | "# Print the predicted species names\n",
256 | "print('A: ' + labelNames[labels[0]])\n",
257 | "print('B: ' + labelNames[labels[1]])\n"
258 | ]
259 | },
260 | {
261 | "cell_type": "markdown",
262 | "metadata": {},
263 | "source": [
264 | "Plotting our predictions\n",
265 | "---\n",
266 | "Now let's plot our predictions to see why they were classified that way."
267 | ]
268 | },
269 | {
270 | "cell_type": "code",
271 | "execution_count": null,
272 | "metadata": {},
273 | "outputs": [],
274 | "source": [
275 | "# Now plot the results\n",
276 | "h = .02 # step size in the mesh\n",
277 | "\n",
278 | "# Plot the decision boundary. For that, we will assign a color to each\n",
279 | "# point in the mesh [x_min, m_max]x[y_min, y_max].\n",
280 | "x_min, x_max = X_small[:, 0].min() - 1, X_small[:, 0].max() + 1\n",
281 | "y_min, y_max = X_small[:, 1].min() - 1, X_small[:, 1].max() + 1\n",
282 | "xx, yy = np.meshgrid(np.arange(x_min, x_max, h),\n",
283 | " np.arange(y_min, y_max, h))\n",
284 | "\n",
285 | "Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) # Make a prediction oat every point \n",
286 | " # in the mesh in order to find the \n",
287 | " # classification areas for each label\n",
288 | "\n",
289 | "# Put the result into a color plot\n",
290 | "Z = Z.reshape(xx.shape)\n",
291 | "plt.figure(figsize=(8, 6))\n",
292 | "plt.pcolormesh(xx, yy, Z, cmap=cmap_light)\n",
293 | "\n",
294 | "# Plot the training points\n",
295 | "plt.scatter(X_small[:, 0], X_small[:, 1], c=y, cmap=cmap_bold)\n",
296 | "plt.xlim(xx.min(), xx.max())\n",
297 | "plt.ylim(yy.min(), yy.max())\n",
298 | "plt.title(\"3-Class classification (SVM)\")\n",
299 | "plt.xlabel('Sepal length (cm)')\n",
300 | "plt.ylabel('Sepal width (cm)')\n",
301 | "\n",
302 | "# Display the new examples as labeled text on the graph\n",
303 | "plt.text(examples[0][0], examples[0][1],'A', fontsize=14)\n",
304 | "plt.text(examples[1][0], examples[1][1],'B', fontsize=14)\n",
305 | "\n",
306 | "# Plot the legend\n",
307 | "plt.legend(labelList, labelNames)\n",
308 | "\n",
309 | "plt.show()"
310 | ]
311 | },
312 | {
313 | "cell_type": "markdown",
314 | "metadata": {},
315 | "source": [
316 | "What are the support vectors in this example?\n",
317 | "===\n",
318 | "\n",
319 | "Below, we define a function to plot the solid decision boundary and corresponding dashed lines, as shown in the introductory picture. Because there are three classes to separate, there will now be three sets of lines."
320 | ]
321 | },
322 | {
323 | "cell_type": "code",
324 | "execution_count": null,
325 | "metadata": {},
326 | "outputs": [],
327 | "source": [
328 | "def plot_svc_decision_function(clf):\n",
329 | " \"\"\"Plot the decision function for a 2D SVC\"\"\"\n",
330 | " x = np.linspace(plt.xlim()[0], plt.xlim()[1], 30)\n",
331 | " y = np.linspace(plt.ylim()[0], plt.ylim()[1], 30)\n",
332 | " Y, X = np.meshgrid(y, x)\n",
333 | " P = np.zeros((3,X.shape[0],X.shape[1]))\n",
334 | " for i, xi in enumerate(x):\n",
335 | " for j, yj in enumerate(y):\n",
336 | " P[:, i,j] = clf.decision_function([[xi, yj]])[0]\n",
337 | " for ind in range(3):\n",
338 | " plt.contour(X, Y, P[ind,:,:], colors='k',\n",
339 | " levels=[-1, 0, 1],\n",
340 | " linestyles=['--', '-', '--'])"
341 | ]
342 | },
343 | {
344 | "cell_type": "markdown",
345 | "metadata": {},
346 | "source": [
347 | "And now we plot the lines on top of our previous plot"
348 | ]
349 | },
350 | {
351 | "cell_type": "code",
352 | "execution_count": null,
353 | "metadata": {
354 | "scrolled": true
355 | },
356 | "outputs": [],
357 | "source": [
358 | "# Now plot the results\n",
359 | "h = .02 # step size in the mesh\n",
360 | "\n",
361 | "# Plot the decision boundary. For that, we will assign a color to each\n",
362 | "# point in the mesh [x_min, m_max]x[y_min, y_max].\n",
363 | "x_min, x_max = X_small[:, 0].min() - 1, X_small[:, 0].max() + 1\n",
364 | "y_min, y_max = X_small[:, 1].min() - 1, X_small[:, 1].max() + 1\n",
365 | "xx, yy = np.meshgrid(np.arange(x_min, x_max, h),\n",
366 | " np.arange(y_min, y_max, h))\n",
367 | "\n",
368 | "Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) # Make a prediction at every point \n",
369 | " # in the mesh in order to find the \n",
370 | " # classification areas for each label\n",
371 | "\n",
372 | "# Put the result into a color plot\n",
373 | "Z = Z.reshape(xx.shape)\n",
374 | "plt.figure(figsize=(8, 6))\n",
375 | "plt.pcolormesh(xx, yy, Z, cmap=cmap_light)\n",
376 | "\n",
377 | "# Plot the training points\n",
378 | "plt.scatter(X_small[:, 0], X_small[:, 1], c=y, cmap=cmap_bold)\n",
379 | "plt.xlim(xx.min(), xx.max())\n",
380 | "plt.ylim(yy.min(), yy.max())\n",
381 | "plt.title(\"3-Class classification (SVM)\")\n",
382 | "plt.xlabel('Sepal length (cm)')\n",
383 | "plt.ylabel('Sepal width (cm)')\n",
384 | "\n",
385 | "# Display the new examples as labeled text on the graph\n",
386 | "plt.text(examples[0][0], examples[0][1],'A', fontsize=14)\n",
387 | "plt.text(examples[1][0], examples[1][1],'B', fontsize=14)\n",
388 | "\n",
389 | "# Plot the legend\n",
390 | "plt.legend(labelList, labelNames)\n",
391 | "\n",
392 | "plot_svc_decision_function(clf) # Plot the decison function\n",
393 | "\n",
394 | "plt.show()\n"
395 | ]
396 | },
397 | {
398 | "cell_type": "markdown",
399 | "metadata": {},
400 | "source": [
401 | "This plot is much more visually cluttered than our previous toy example. There are a few points worth noticing if you take a closer look. \n",
402 | "\n",
403 | "First, notice how the three solid lines run right along one of the decision boundaries. These are used to determine the boundaries between the classification areas (where the colors change). \n",
404 | "\n",
405 | "Additionally, while the parallel dotted lines still pas through one or more support vectors, there are now data points located between the decision boundary and the dotted line (and even on the wrong side of the decision boundary!). This happens when our data is not \"perfectly separable\". A perfectly separable dataset is one where the classes can be separated completely with a single, straight (or at least simple) line. While it makes for nice examples, real world machine learning uses are almost never perfectly separable."
406 | ]
407 | },
408 | {
409 | "cell_type": "markdown",
410 | "metadata": {},
411 | "source": [
412 | "Kernels: Changing The Decision Boundary Lines\n",
413 | "===\n",
414 | "\n",
415 | "In our previous example, all of the decision boundaries are straight lines. But what if our data is grouped into more circular clusters, maybe a curved line would separate the data better.\n",
416 | "\n",
417 | "SVMs use something called [kernels](http://scikit-learn.org/stable/modules/svm.html#kernel-functions) to determine the shape of the decision boundary. Remember that when we called the SVC() function we gave it a parameter kernel='linear', which made the boundaries straight. A different kernel, the radial basis function (RBF) groups data into circular clusters instead of dividing by straight lines.\n",
418 | "\n",
419 | "Below we show the same example as before, but with an RBF kernel."
420 | ]
421 | },
422 | {
423 | "cell_type": "code",
424 | "execution_count": null,
425 | "metadata": {},
426 | "outputs": [],
427 | "source": [
428 | "# Create an instance of SVM and fit the data.\n",
429 | "clf = SVC(kernel='rbf', decision_function_shape='ovo') # Use the RBF kernel this time\n",
430 | "clf.fit(X_small, y)\n",
431 | "\n",
432 | "# Now plot the results\n",
433 | "h = .02 # step size in the mesh\n",
434 | "\n",
435 | "# Plot the decision boundary. For that, we will assign a color to each\n",
436 | "# point in the mesh [x_min, m_max]x[y_min, y_max].\n",
437 | "x_min, x_max = X_small[:, 0].min() - 1, X_small[:, 0].max() + 1\n",
438 | "y_min, y_max = X_small[:, 1].min() - 1, X_small[:, 1].max() + 1\n",
439 | "xx, yy = np.meshgrid(np.arange(x_min, x_max, h),\n",
440 | " np.arange(y_min, y_max, h))\n",
441 | "\n",
442 | "Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) # Make a prediction oat every point \n",
443 | " # in the mesh in order to find the \n",
444 | " # classification areas for each label\n",
445 | "\n",
446 | "# Put the result into a color plot\n",
447 | "Z = Z.reshape(xx.shape)\n",
448 | "plt.figure(figsize=(8, 6))\n",
449 | "plt.pcolormesh(xx, yy, Z, cmap=cmap_light)\n",
450 | "\n",
451 | "# Plot the training points\n",
452 | "plt.scatter(X_small[:, 0], X_small[:, 1], c=y, cmap=cmap_bold)\n",
453 | "plt.xlim(xx.min(), xx.max())\n",
454 | "plt.ylim(yy.min(), yy.max())\n",
455 | "plt.title(\"3-Class classification (SVM)\")\n",
456 | "plt.xlabel('Sepal length (cm)')\n",
457 | "plt.ylabel('Sepal width (cm)')\n",
458 | "\n",
459 | "# Display the new examples as labeled text on the graph\n",
460 | "plt.text(examples[0][0], examples[0][1],'A', fontsize=14)\n",
461 | "plt.text(examples[1][0], examples[1][1],'B', fontsize=14)\n",
462 | "\n",
463 | "# Plot the legend\n",
464 | "plt.legend(labelList, labelNames)\n",
465 | "\n",
466 | "plt.show()\n",
467 | "\n"
468 | ]
469 | },
470 | {
471 | "cell_type": "markdown",
472 | "metadata": {},
473 | "source": [
474 | "The boundaries are very similar to before, but now they're curved instead of straight. Now let's add the decision boundaries."
475 | ]
476 | },
477 | {
478 | "cell_type": "code",
479 | "execution_count": null,
480 | "metadata": {
481 | "scrolled": true
482 | },
483 | "outputs": [],
484 | "source": [
485 | "# Create an instance of SVM and fit the data.\n",
486 | "clf = SVC(kernel='rbf', decision_function_shape='ovo') # Use the RBF kernel this time\n",
487 | "clf.fit(X_small, y)\n",
488 | "\n",
489 | "# Now plot the results\n",
490 | "h = .02 # step size in the mesh\n",
491 | "\n",
492 | "# Plot the decision boundary. For that, we will assign a color to each\n",
493 | "# point in the mesh [x_min, m_max]x[y_min, y_max].\n",
494 | "x_min, x_max = X_small[:, 0].min() - 1, X_small[:, 0].max() + 1\n",
495 | "y_min, y_max = X_small[:, 1].min() - 1, X_small[:, 1].max() + 1\n",
496 | "xx, yy = np.meshgrid(np.arange(x_min, x_max, h),\n",
497 | " np.arange(y_min, y_max, h))\n",
498 | "\n",
499 | "Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) # Make a prediction oat every point \n",
500 | " # in the mesh in order to find the \n",
501 | " # classification areas for each label\n",
502 | "\n",
503 | "# Put the result into a color plot\n",
504 | "Z = Z.reshape(xx.shape)\n",
505 | "plt.figure(figsize=(8, 6))\n",
506 | "plt.pcolormesh(xx, yy, Z, cmap=cmap_light)\n",
507 | "\n",
508 | "# Plot the training points\n",
509 | "plt.scatter(X_small[:, 0], X_small[:, 1], c=y, cmap=cmap_bold)\n",
510 | "plt.xlim(xx.min(), xx.max())\n",
511 | "plt.ylim(yy.min(), yy.max())\n",
512 | "plt.title(\"3-Class classification (SVM)\")\n",
513 | "plt.xlabel('Sepal length (cm)')\n",
514 | "plt.ylabel('Sepal width (cm)')\n",
515 | "\n",
516 | "# Display the new examples as labeled text on the graph\n",
517 | "plt.text(examples[0][0], examples[0][1],'A', fontsize=14)\n",
518 | "plt.text(examples[1][0], examples[1][1],'B', fontsize=14)\n",
519 | "\n",
520 | "# Plot the legend\n",
521 | "plt.legend(labelList, labelNames)\n",
522 | "\n",
523 | "plot_svc_decision_function(clf) # Plot the decison function\n",
524 | "\n",
525 | "plt.show()\n",
526 | "\n"
527 | ]
528 | },
529 | {
530 | "cell_type": "markdown",
531 | "metadata": {},
532 | "source": [
533 | "Now the plot looks very different from before! The solid black lines are now all curves, but each decision boundary still falls along one part of those lines. And instead of having dotted lines parallel to the solid, there are smaller ellipsoids on either side of the solid line."
534 | ]
535 | },
536 | {
537 | "cell_type": "markdown",
538 | "metadata": {},
539 | "source": [
540 | "What other kernels exist?\n",
541 | "---\n",
542 | "\n",
543 | "Scikit-learn comes with two other default kernels: polynomial and sigmoid. Advanced users can also creat their own kernels, but we will stick to the defaults for now.\n",
544 | "\n",
545 | "Below, modify our previous examples to try out the other kernels. How do they change the decision boundaries?"
546 | ]
547 | },
548 | {
549 | "cell_type": "code",
550 | "execution_count": null,
551 | "metadata": {},
552 | "outputs": [],
553 | "source": [
554 | "# Your code here!"
555 | ]
556 | },
557 | {
558 | "cell_type": "markdown",
559 | "metadata": {},
560 | "source": [
561 | "What about my other features?\n",
562 | "===\n",
563 | "\n",
564 | "We've been looking at two features: the length and width of the plant's sepal. But what about the other two featurs, petal length and width? What does the graph look like when train on the petal length and width? How does it change when you change the SVM kernel?\n",
565 | "\n",
566 | "How would you plot our two new plants, A and B, on these new plots? Assume we have all four measurements for each plant, as shown below.\n",
567 | "\n",
568 | "Plant | Sepal length | Sepal width| Petal length | Petal width\n",
569 | "------|--------------|------------|--------------|------------\n",
570 | "A |4.3 |2.5 | 1.5 | 0.5\n",
571 | "B |6.3 |2.1 | 4.8 | 1.5"
572 | ]
573 | },
574 | {
575 | "cell_type": "code",
576 | "execution_count": null,
577 | "metadata": {},
578 | "outputs": [],
579 | "source": [
580 | "# Your code here!"
581 | ]
582 | },
583 | {
584 | "cell_type": "markdown",
585 | "metadata": {},
586 | "source": [
587 | "Using more than two features\n",
588 | "---\n",
589 | "\n",
590 | "Sticking to two features is great for visualization, but is less practical for solving real machine learning problems. If you have time, you can experiment with using more features to train your classifier. It gets much harder to visualize the results with 3 features, and nearly impossible with 4 or more. There are techniques that can help you visualize the data in some form, and there are also ways to reduce the number of features you use while still retaining (hopefully) the same amount of information. However, those techniques are beyond the scope of this class."
591 | ]
592 | },
593 | {
594 | "cell_type": "code",
595 | "execution_count": null,
596 | "metadata": {
597 | "collapsed": true
598 | },
599 | "outputs": [],
600 | "source": []
601 | },
602 | {
603 | "cell_type": "markdown",
604 | "metadata": {},
605 | "source": [
606 | "Evaluating The Classifier\n",
607 | "===\n",
608 | "\n",
609 | "In order to evaluate a classifier, we need to split our dataset into training data, which we'll show to the classifier so it can learn, and testing data, which we will hold back from training and use to test its predictions.\n",
610 | "\n",
611 | "Below, we create the training and testing datasets, using all four features. We then train our classifier on the training data, and get the predictions for the test data."
612 | ]
613 | },
614 | {
615 | "cell_type": "code",
616 | "execution_count": null,
617 | "metadata": {
618 | "collapsed": true
619 | },
620 | "outputs": [],
621 | "source": [
622 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)\n",
623 | "\n",
624 | "# Create an instance of SVM and fit the data\n",
625 | "clf = SVC(kernel='linear', decision_function_shape='ovo')\n",
626 | "clf.fit(X_train, y_train)\n",
627 | "\n",
628 | "# Predict the labels of the test data\n",
629 | "predictions = clf.predict(X_test)"
630 | ]
631 | },
632 | {
633 | "cell_type": "markdown",
634 | "metadata": {},
635 | "source": [
636 | "Next, we evaluate how well the classifier did. The easiest way to do this is to get the average number of correct predictions, usually referred to as the accuracy"
637 | ]
638 | },
639 | {
640 | "cell_type": "code",
641 | "execution_count": null,
642 | "metadata": {
643 | "scrolled": true
644 | },
645 | "outputs": [],
646 | "source": [
647 | "accuracy = np.mean(predictions == y_test )*100\n",
648 | "\n",
649 | "print('The accuracy is %.2f' % accuracy + '%')"
650 | ]
651 | },
652 | {
653 | "cell_type": "markdown",
654 | "metadata": {
655 | "collapsed": true
656 | },
657 | "source": [
658 | "Comparing Models with Crossvalidation\n",
659 | "===\n",
660 | "\n",
661 | "To select the kernel to use in our model, we need to use crossvalidation. We can then get our final result using our test data.\n",
662 | "\n",
663 | "First we choose the kernels we want to investigate, then divide our training data into folds. We loop through the sets of training and validation folds. Each time, we train each model on the training data and evaluate on the validation data. We store the accuracy of each classifier on each fold so we can look at them later."
664 | ]
665 | },
666 | {
667 | "cell_type": "code",
668 | "execution_count": null,
669 | "metadata": {},
670 | "outputs": [],
671 | "source": [
672 | "# Choose our kernels\n",
673 | "kernels = ['linear', 'rbf']\n",
674 | "\n",
675 | "# Create a dictionary of arrays to store accuracies\n",
676 | "accuracies = {}\n",
677 | "for kernel in kernels:\n",
678 | " accuracies[kernel] = []\n",
679 | "\n",
680 | "# Loop through 5 folds\n",
681 | "kf = KFold(n_splits=5)\n",
682 | "for trainInd, valInd in kf.split(X_train):\n",
683 | " X_tr = X_train[trainInd,:]\n",
684 | " y_tr = y_train[trainInd]\n",
685 | " X_val = X_train[valInd,:]\n",
686 | " y_val = y_train[valInd]\n",
687 | " \n",
688 | " # Loop through each kernel\n",
689 | " for kernel in kernels:\n",
690 | " \n",
691 | " # Create the classifier\n",
692 | " clf = SVC(kernel=kernel, decision_function_shape='ovo')\n",
693 | " \n",
694 | " # Train the classifier\n",
695 | " clf.fit(X_tr, y_tr) \n",
696 | " \n",
697 | " # Make our predictions\n",
698 | " pred = clf.predict(X_val)\n",
699 | " \n",
700 | " # Calculate the accuracy\n",
701 | " accuracies[kernel].append(np.mean(pred == y_val))\n",
702 | " "
703 | ]
704 | },
705 | {
706 | "cell_type": "markdown",
707 | "metadata": {},
708 | "source": [
709 | "Select a Model\n",
710 | "---\n",
711 | "\n",
712 | "To select a model, we look at the average accuracy across all folds."
713 | ]
714 | },
715 | {
716 | "cell_type": "code",
717 | "execution_count": null,
718 | "metadata": {},
719 | "outputs": [],
720 | "source": [
721 | "for kernel in kernels:\n",
722 | " print('%s: %.2f' % (kernel, np.mean(accuracies[kernel])))"
723 | ]
724 | },
725 | {
726 | "cell_type": "markdown",
727 | "metadata": {},
728 | "source": [
729 | "Final Evaluation\n",
730 | "---\n",
731 | "\n",
732 | "The linear kernel gives us the highest accuracy, so we select it as our best model. Now we can evaluate it on our training set and get our final accuracy rating."
733 | ]
734 | },
735 | {
736 | "cell_type": "code",
737 | "execution_count": null,
738 | "metadata": {},
739 | "outputs": [],
740 | "source": [
741 | "clf = SVC(kernel='linear', decision_function_shape='ovo')\n",
742 | "clf.fit(X_train, y_train)\n",
743 | "predictions = clf.predict(X_test)\n",
744 | "\n",
745 | "accuracy = np.mean(predictions == y_test) * 100\n",
746 | "\n",
747 | "print('The final accuracy is %.2f' % accuracy + '%')"
748 | ]
749 | },
750 | {
751 | "cell_type": "markdown",
752 | "metadata": {},
753 | "source": [
754 | "Sidenote: Randomness and Results\n",
755 | "===\n",
756 | "\n",
757 | "Every time you run this notebook, you will get slightly different results. Why? Because data is randomly divided among the training/testing/validation data sets. Running the code again will create a different division of the data, and will make the results slightly different. However, the overall outcome should remain consistent and have approximately the same values. If you have drastically different results when running an analysis multiple times, it suggests a problem with your model or that you need more data.\n",
758 | "\n",
759 | "If it's important that you get the exact same results every time you run the code, you can specify a random state in the `random_state` argument of `train_test_split()` and `KFold`."
760 | ]
761 | },
762 | {
763 | "cell_type": "code",
764 | "execution_count": null,
765 | "metadata": {
766 | "collapsed": true
767 | },
768 | "outputs": [],
769 | "source": [
770 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)"
771 | ]
772 | },
773 | {
774 | "cell_type": "code",
775 | "execution_count": null,
776 | "metadata": {},
777 | "outputs": [],
778 | "source": []
779 | }
780 | ],
781 | "metadata": {
782 | "kernelspec": {
783 | "display_name": "Python 3",
784 | "language": "python",
785 | "name": "python3"
786 | },
787 | "language_info": {
788 | "codemirror_mode": {
789 | "name": "ipython",
790 | "version": 3
791 | },
792 | "file_extension": ".py",
793 | "mimetype": "text/x-python",
794 | "name": "python",
795 | "nbconvert_exporter": "python",
796 | "pygments_lexer": "ipython3",
797 | "version": "3.6.2"
798 | }
799 | },
800 | "nbformat": 4,
801 | "nbformat_minor": 1
802 | }
803 |
--------------------------------------------------------------------------------