├── .gitignore ├── .ipynb_checkpoints ├── 0. Beginning-checkpoint.ipynb ├── 1. Hello, World!-checkpoint.ipynb ├── 2.0 First Impressions of Machine Learning-checkpoint.ipynb ├── 2.1 Supervised Learning - Classification-checkpoint.ipynb ├── 2.2 Supervised Learning - Regression-checkpoint.ipynb ├── 2.3 Unsupervised Learning - Transformations and Dimensionality Reduction-checkpoint.ipynb ├── 2.4 Unsupervised Learning - Clustering-checkpoint.ipynb └── 3. Validations and Learning Curves-checkpoint.ipynb ├── 0. Beginning.ipynb ├── 1. Hello, World!.ipynb ├── 2.0 First Impressions of Machine Learning.ipynb ├── 2.1 Supervised Learning - Classification.ipynb ├── 2.2 Supervised Learning - Regression.ipynb ├── 2.3 Unsupervised Learning - Transformations and Dimensionality Reduction.ipynb ├── 2.4 Unsupervised Learning - Clustering.ipynb ├── 2.5 Review of Scikit-learn API.ipynb ├── 3. Validations and Learning Curves.ipynb ├── 4.1 Example - Supervised Spam Classification.ipynb ├── 4.2 Example - Face Recognition.ipynb ├── 5. Where do we go from here.ipynb ├── README.md ├── cheatsheet.txt ├── data ├── SMSSpamCollection └── readme ├── fetch_data.py ├── figures ├── BangPypers.pdf ├── Pic_BP_PDF-01.png ├── Pic_BP_PDF-02.png ├── Pic_BP_PDF-03.png ├── Pic_BP_PDF-04.png ├── Pic_BP_PDF-05.png ├── Pic_BP_PDF-06.png ├── Pic_BP_PDF-07.png ├── Pic_BP_PDF-08.png ├── Pic_BP_PDF-09.png ├── Pic_BP_PDF-10.png ├── Pic_BP_PDF-11.png ├── Pic_BP_PDF-12.png ├── Pic_BP_PDF-13.png ├── Pic_BP_PDF-14.png ├── Pic_BP_PDF-15.png ├── Pic_BP_PDF-16.png ├── Pic_BP_PDF-17.png ├── Pic_BP_PDF-18.png ├── Pic_BP_PDF-19.png ├── Pic_BP_PDF-20.png ├── Pic_BP_PDF-21.png ├── Pic_BP_PDF-22.png ├── Pic_BP_PDF-23.png ├── Pic_BP_PDF-24.png ├── __init__.py ├── __init__.pyc ├── bowjpg ├── cluster_comparison.png ├── iris_setosa.jpg ├── iris_versicolor.jpg ├── iris_virginica.jpg ├── magician.jpg ├── ml_map.png ├── netflix-prize.png ├── petal_sepal.jpg ├── plot.py ├── plot.pyc ├── plot_2d_separator.py ├── plot_2d_separator.pyc ├── plot_digits_datasets.py ├── plot_digits_datasets.pyc ├── supervised_workflow.svg ├── train_test_split.svg ├── train_validation_test2.svg └── unsupervised_workflow.svg └── scripts ├── classify_iris.py ├── cluster_digits.py ├── knn_iris.py ├── knn_regression.py └── plot_digits.py /.gitignore: -------------------------------------------------------------------------------- 1 | .ipynb_checkpoints 2 | /data/lfw-home 3 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/0. Beginning-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [], 3 | "metadata": {}, 4 | "nbformat": 4, 5 | "nbformat_minor": 0 6 | } 7 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/2.3 Unsupervised Learning - Transformations and Dimensionality Reduction-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": { 7 | "collapsed": false 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "%matplotlib inline\n", 12 | "import matplotlib.pyplot as plt\n", 13 | "import numpy as np" 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": {}, 19 | "source": [ 20 | "# Unsupervised Learning\n" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "Many instances of unsupervised learning, such as dimensionality reduction, manifold learning and feature extraction, find a new representation of the input data without any additional input.\n", 28 | "\n", 29 | "\n", 30 | "\n", 31 | "The most simple example of this, which can barely be called learning, is rescaling the data to have zero mean and unit variance. This is a helpful preprocessing step for many machine learning models.\n", 32 | "\n", 33 | "Applying such a preprocessing has a very similar interface to the supervised learning algorithms we saw so far.\n", 34 | "Let's load the iris dataset and rescale it:" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": null, 40 | "metadata": { 41 | "collapsed": false 42 | }, 43 | "outputs": [], 44 | "source": [ 45 | "from sklearn.datasets import load_iris\n", 46 | "\n", 47 | "iris = load_iris()\n", 48 | "X, y = iris.data, iris.target\n", 49 | "print(X.shape)" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | "The iris dataset is not \"centered\" that is it has non-zero mean and the standard deviation is different for each component:\n" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": null, 62 | "metadata": { 63 | "collapsed": false 64 | }, 65 | "outputs": [], 66 | "source": [ 67 | "print(\"mean : %s \" % X.mean(axis=0))\n", 68 | "print(\"standard deviation : %s \" % X.std(axis=0))" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": {}, 74 | "source": [ 75 | "To use a preprocessing method, we first import the estimator, here StandardScaler and instantiate it:\n", 76 | " " 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": null, 82 | "metadata": { 83 | "collapsed": true 84 | }, 85 | "outputs": [], 86 | "source": [ 87 | "from sklearn.preprocessing import StandardScaler\n", 88 | "scaler = StandardScaler()" 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": {}, 94 | "source": [ 95 | "As with the classification and regression algorithms, we call ``fit`` to learn the model from the data. As this is an unsupervised model, we only pass ``X``, not ``y``. This simply estimates mean and standard deviation." 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": null, 101 | "metadata": { 102 | "collapsed": false 103 | }, 104 | "outputs": [], 105 | "source": [ 106 | "scaler.fit(X)" 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "metadata": {}, 112 | "source": [ 113 | "Now we can rescale our data by applying the ``transform`` (not ``predict``) method:" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": null, 119 | "metadata": { 120 | "collapsed": true 121 | }, 122 | "outputs": [], 123 | "source": [ 124 | "X_scaled = scaler.transform(X)" 125 | ] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "metadata": {}, 130 | "source": [ 131 | "``X_scaled`` has the same number of samples and features, but the mean was subtracted and all features were scaled to have unit standard deviation:" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": null, 137 | "metadata": { 138 | "collapsed": false 139 | }, 140 | "outputs": [], 141 | "source": [ 142 | "print(X_scaled.shape)" 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": null, 148 | "metadata": { 149 | "collapsed": false 150 | }, 151 | "outputs": [], 152 | "source": [ 153 | "print(\"mean : %s \" % X_scaled.mean(axis=0))\n", 154 | "print(\"standard deviation : %s \" % X_scaled.std(axis=0))" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "Principal Component Analysis\n", 162 | "============================" 163 | ] 164 | }, 165 | { 166 | "cell_type": "markdown", 167 | "metadata": {}, 168 | "source": [ 169 | "An unsupervised transformation that is somewhat more interesting is Principle Component Analysis (PCA).\n", 170 | "It is a technique to reduce the dimensionality of the data, by creating a linear projection.\n", 171 | "That is, we find new features to represent the data that are a linear combination of the old data (i.e. we rotate it).\n", 172 | "\n", 173 | "The way PCA finds these new directions is by looking for the directions of maximum variance.\n", 174 | "Usually only few components that explain most of the variance in the data are kept. To illustrate how a rotation might look like, we first show it on two dimensional data and keep both principal components.\n", 175 | "\n", 176 | "We create a Gaussian blob that is rotated:" 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": null, 182 | "metadata": { 183 | "collapsed": false 184 | }, 185 | "outputs": [], 186 | "source": [ 187 | "rnd = np.random.RandomState(5)\n", 188 | "X_ = rnd.normal(size=(300, 2))\n", 189 | "X_blob = np.dot(X_, rnd.normal(size=(2, 2))) + rnd.normal(size=2)\n", 190 | "y = X_[:, 0] > 0\n", 191 | "plt.scatter(X_blob[:, 0], X_blob[:, 1], c=y, linewidths=0, s=30)\n", 192 | "plt.xlabel(\"feature 1\")\n", 193 | "plt.ylabel(\"feature 2\")" 194 | ] 195 | }, 196 | { 197 | "cell_type": "markdown", 198 | "metadata": {}, 199 | "source": [ 200 | "As always, we instantiate our PCA model. By default all directions are kept." 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "execution_count": null, 206 | "metadata": { 207 | "collapsed": false 208 | }, 209 | "outputs": [], 210 | "source": [ 211 | "from sklearn.decomposition import PCA\n", 212 | "pca = PCA()" 213 | ] 214 | }, 215 | { 216 | "cell_type": "markdown", 217 | "metadata": {}, 218 | "source": [ 219 | "Then we fit the PCA model with our data. As PCA is an unsupervised algorithm, there is no output ``y``." 220 | ] 221 | }, 222 | { 223 | "cell_type": "code", 224 | "execution_count": null, 225 | "metadata": { 226 | "collapsed": false 227 | }, 228 | "outputs": [], 229 | "source": [ 230 | "pca.fit(X_blob)" 231 | ] 232 | }, 233 | { 234 | "cell_type": "markdown", 235 | "metadata": {}, 236 | "source": [ 237 | "Then we can transform the data, projected on the principal components:" 238 | ] 239 | }, 240 | { 241 | "cell_type": "code", 242 | "execution_count": null, 243 | "metadata": { 244 | "collapsed": false 245 | }, 246 | "outputs": [], 247 | "source": [ 248 | "X_pca = pca.transform(X_blob)\n", 249 | "\n", 250 | "plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, linewidths=0, s=30)\n", 251 | "plt.xlabel(\"first principal component\")\n", 252 | "plt.ylabel(\"second principal component\")" 253 | ] 254 | }, 255 | { 256 | "cell_type": "markdown", 257 | "metadata": {}, 258 | "source": [ 259 | "On the left of the plot you can see the four points that were on the top right before. PCA found fit first component to be along the diagonal, and the second to be perpendicular to it. As PCA finds a rotation, the principal components are always at right angles to each other." 260 | ] 261 | }, 262 | { 263 | "cell_type": "markdown", 264 | "metadata": {}, 265 | "source": [ 266 | "Dimensionality Reduction for Visualization with PCA\n", 267 | "-------------------------------------------------------------\n", 268 | "Consider the digits dataset. It cannot be visualized in a single 2D plot, as it has 64 features. We are going to extract 2 dimensions to visualize it in, using the example from the sklearn examples [here](http://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html)" 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": null, 274 | "metadata": { 275 | "collapsed": false 276 | }, 277 | "outputs": [], 278 | "source": [ 279 | "from figures.plot_digits_datasets import digits_plot\n", 280 | "\n", 281 | "digits_plot()" 282 | ] 283 | }, 284 | { 285 | "cell_type": "markdown", 286 | "metadata": {}, 287 | "source": [ 288 | "Note that this projection was determined *without* any information about the\n", 289 | "labels (represented by the colors): this is the sense in which the learning\n", 290 | "is **unsupervised**. Nevertheless, we see that the projection gives us insight\n", 291 | "into the distribution of the different digits in parameter space." 292 | ] 293 | }, 294 | { 295 | "cell_type": "markdown", 296 | "metadata": {}, 297 | "source": [ 298 | "## Manifold Learning\n", 299 | "\n", 300 | "One weakness of PCA is that it cannot detect non-linear features. A set\n", 301 | "of algorithms known as *Manifold Learning* have been developed to address\n", 302 | "this deficiency. A canonical dataset used in Manifold learning is the\n", 303 | "*S-curve*, which we briefly saw in an earlier section:" 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": null, 309 | "metadata": { 310 | "collapsed": false 311 | }, 312 | "outputs": [], 313 | "source": [ 314 | "from sklearn.datasets import make_s_curve\n", 315 | "X, y = make_s_curve(n_samples=1000)\n", 316 | "\n", 317 | "from mpl_toolkits.mplot3d import Axes3D\n", 318 | "ax = plt.axes(projection='3d')\n", 319 | "\n", 320 | "ax.scatter3D(X[:, 0], X[:, 1], X[:, 2], c=y)\n", 321 | "ax.view_init(10, -60)" 322 | ] 323 | }, 324 | { 325 | "cell_type": "markdown", 326 | "metadata": {}, 327 | "source": [ 328 | "This is a 2-dimensional dataset embedded in three dimensions, but it is embedded\n", 329 | "in such a way that PCA cannot discover the underlying data orientation:" 330 | ] 331 | }, 332 | { 333 | "cell_type": "code", 334 | "execution_count": null, 335 | "metadata": { 336 | "collapsed": false 337 | }, 338 | "outputs": [], 339 | "source": [ 340 | "X_pca = PCA(n_components=2).fit_transform(X)\n", 341 | "plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y)" 342 | ] 343 | }, 344 | { 345 | "cell_type": "markdown", 346 | "metadata": {}, 347 | "source": [ 348 | "Manifold learning algorithms, however, available in the ``sklearn.manifold``\n", 349 | "submodule, are able to recover the underlying 2-dimensional manifold:" 350 | ] 351 | }, 352 | { 353 | "cell_type": "code", 354 | "execution_count": null, 355 | "metadata": { 356 | "collapsed": false 357 | }, 358 | "outputs": [], 359 | "source": [ 360 | "from sklearn.manifold import Isomap\n", 361 | "\n", 362 | "iso = Isomap(n_neighbors=15, n_components=2)\n", 363 | "X_iso = iso.fit_transform(X)\n", 364 | "plt.scatter(X_iso[:, 0], X_iso[:, 1], c=y)" 365 | ] 366 | }, 367 | { 368 | "cell_type": "markdown", 369 | "metadata": {}, 370 | "source": [ 371 | "##Exercise\n", 372 | "Compare the results of Isomap and PCA on a 5-class subset of the digits dataset (``load_digits(5)``).\n", 373 | "\n", 374 | "__Bonus__: Also compare to TSNE, another popular manifold learning technique." 375 | ] 376 | }, 377 | { 378 | "cell_type": "code", 379 | "execution_count": null, 380 | "metadata": { 381 | "collapsed": true 382 | }, 383 | "outputs": [], 384 | "source": [ 385 | "from sklearn.datasets import load_digits\n", 386 | "\n", 387 | "digits = load_digits(5)\n", 388 | "\n", 389 | "X = digits.data\n", 390 | "# ..." 391 | ] 392 | } 393 | ], 394 | "metadata": { 395 | "kernelspec": { 396 | "display_name": "Python 2", 397 | "language": "python", 398 | "name": "python2" 399 | }, 400 | "language_info": { 401 | "codemirror_mode": { 402 | "name": "ipython", 403 | "version": 2 404 | }, 405 | "file_extension": ".py", 406 | "mimetype": "text/x-python", 407 | "name": "python", 408 | "nbconvert_exporter": "python", 409 | "pygments_lexer": "ipython2", 410 | "version": "2.7.11" 411 | } 412 | }, 413 | "nbformat": 4, 414 | "nbformat_minor": 0 415 | } 416 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/3. Validations and Learning Curves-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [], 3 | "metadata": {}, 4 | "nbformat": 4, 5 | "nbformat_minor": 0 6 | } 7 | -------------------------------------------------------------------------------- /0. Beginning.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "collapsed": true 7 | }, 8 | "source": [ 9 | "## Machine Learning in Python: An Introduction to Scikit-Learn\n" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "### What this workshop is about?" 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "metadata": {}, 22 | "source": [ 23 | "* Introduction to the basics of Machine Learning, and some tips and tricks\n", 24 | "* Introduction to scikit-learn, utilizing it for your machine learning needs" 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "### Today's Workflow\n", 32 | "\n", 33 | "#### Setup and Introduction\n", 34 | "* Getting your machines to a common working baseline.\n", 35 | "\n", 36 | "#### A Gentle Introduction to Machine Learning and Scikit-Learn\n", 37 | "* What is Machine Learning?\n", 38 | "* Core Terminologies \n", 39 | "* Supervised Learning\n", 40 | "* Unsupervised Learning\n", 41 | "* Evaluation of Models\n", 42 | "* How to choose the right model for your dataset\n", 43 | "\n", 44 | "#### Going deeper with Supervised Learning\n", 45 | "* Classification\n", 46 | "* Regression\n", 47 | "\n", 48 | "#### Going deeper with Unsupervised Learning\n", 49 | "* Clustering\n", 50 | "* Dimensionality Reduction\n", 51 | "\n", 52 | "#### Model Validation\n", 53 | "* Validation and Cross Validation" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": null, 59 | "metadata": { 60 | "collapsed": true 61 | }, 62 | "outputs": [], 63 | "source": [] 64 | } 65 | ], 66 | "metadata": { 67 | "anaconda-cloud": {}, 68 | "kernelspec": { 69 | "display_name": "Python [default]", 70 | "language": "python", 71 | "name": "python2" 72 | }, 73 | "language_info": { 74 | "codemirror_mode": { 75 | "name": "ipython", 76 | "version": 2 77 | }, 78 | "file_extension": ".py", 79 | "mimetype": "text/x-python", 80 | "name": "python", 81 | "nbconvert_exporter": "python", 82 | "pygments_lexer": "ipython2", 83 | "version": "2.7.12" 84 | } 85 | }, 86 | "nbformat": 4, 87 | "nbformat_minor": 0 88 | } 89 | -------------------------------------------------------------------------------- /2.1 Supervised Learning - Classification.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": { 7 | "collapsed": false 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "%matplotlib inline\n", 12 | "import matplotlib.pyplot as plt\n", 13 | "import numpy as np\n", 14 | "import seaborn" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "To visualize the workings of machine learning algorithms, it is often helpful to study two-dimensional or one-dimensional data, that is data with only one or two features. While in practice, datasets usually have many more features, it is hard to plot high-dimensional data on two-dimensional screens.\n", 22 | "\n", 23 | "We will illustrate some very simple examples before we move on to more \"real world\" data sets." 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "Classification\n", 31 | "========\n", 32 | "First, we will look at a two class classification problem in two dimensions. We use the synthetic data generated by the ``make_blobs`` function." 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": null, 38 | "metadata": { 39 | "collapsed": false 40 | }, 41 | "outputs": [], 42 | "source": [ 43 | "from sklearn.datasets import make_blobs\n", 44 | "X, y = make_blobs(centers=2, random_state=0)\n", 45 | "print(X.shape)\n", 46 | "print(y.shape)\n", 47 | "print(X[:5, :])\n", 48 | "print(y[:5])" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": {}, 54 | "source": [ 55 | "As the data is two-dimensional, we can plot each sample as a point in two-dimensional space, with the first feature being the x-axis and the second feature being the y-axis." 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": null, 61 | "metadata": { 62 | "collapsed": false 63 | }, 64 | "outputs": [], 65 | "source": [ 66 | "plt.scatter(X[:, 0], X[:, 1], c=y, s=40)\n", 67 | "plt.xlabel(\"first feature\")\n", 68 | "plt.ylabel(\"second feature\")" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": {}, 74 | "source": [ 75 | "As classification is a supervised task, and we are interested in how well the model generalizes, we split our data into a training set,\n", 76 | "to built the model from, and a test-set, to evaluate how well our model performs on new data. The ``train_test_split`` function form the ``cross_validation`` module does that for us, by randomly splitting of 25% of the data for testing.\n" 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": null, 82 | "metadata": { 83 | "collapsed": true 84 | }, 85 | "outputs": [], 86 | "source": [ 87 | "from sklearn.cross_validation import train_test_split\n", 88 | "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)" 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": {}, 94 | "source": [ 95 | "K Nearest Neighbors\n", 96 | "------------------------------------------------\n", 97 | "A popular and easy to understand classifier is K nearest neighbors (kNN). It has one of the simplest learning strategies: given a new, unknown observation, look up in your reference database which ones have the closest features and assign the predominant class.\n" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": null, 103 | "metadata": { 104 | "collapsed": false 105 | }, 106 | "outputs": [], 107 | "source": [ 108 | "from sklearn.neighbors import KNeighborsClassifier" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "This time we set a parameter of the KNeighborsClassifier to tell it we only want to look at one nearest neighbor:" 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": null, 121 | "metadata": { 122 | "collapsed": false 123 | }, 124 | "outputs": [], 125 | "source": [ 126 | "knn = KNeighborsClassifier(n_neighbors=9)" 127 | ] 128 | }, 129 | { 130 | "cell_type": "markdown", 131 | "metadata": {}, 132 | "source": [ 133 | "We fit the model with out training data" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": null, 139 | "metadata": { 140 | "collapsed": false, 141 | "scrolled": true 142 | }, 143 | "outputs": [], 144 | "source": [ 145 | "knn.fit(X_train, y_train)" 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": null, 151 | "metadata": { 152 | "collapsed": false 153 | }, 154 | "outputs": [], 155 | "source": [ 156 | "from figures import plot_2d_separator" 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": null, 162 | "metadata": { 163 | "collapsed": false 164 | }, 165 | "outputs": [], 166 | "source": [ 167 | "plt.scatter(X[:, 0], X[:, 1], c=y, s=40)\n", 168 | "plt.xlabel(\"first feature\")\n", 169 | "plt.ylabel(\"second feature\")\n", 170 | "plot_2d_separator.plot_2d_separator(knn, X)" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": null, 176 | "metadata": { 177 | "collapsed": false 178 | }, 179 | "outputs": [], 180 | "source": [ 181 | "knn.score(X_test, y_test)" 182 | ] 183 | }, 184 | { 185 | "cell_type": "markdown", 186 | "metadata": {}, 187 | "source": [ 188 | "## Using a different classifier" 189 | ] 190 | }, 191 | { 192 | "cell_type": "markdown", 193 | "metadata": {}, 194 | "source": [ 195 | "Now we'll take a few minutes and try out another learning model. Because of ``scikit-learn``'s uniform interface, the syntax is identical to that of ``LinearSVC`` above.\n", 196 | "\n", 197 | "There are many possibilities of classifiers; you could try any of the methods discussed at . Alternatively, you can explore what's available in ``scikit-learn`` using just the tab-completion feature. For example, import the ``linear_model`` submodule:" 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": null, 203 | "metadata": { 204 | "collapsed": true 205 | }, 206 | "outputs": [], 207 | "source": [ 208 | "from sklearn import linear_model" 209 | ] 210 | }, 211 | { 212 | "cell_type": "markdown", 213 | "metadata": {}, 214 | "source": [ 215 | "And use the tab completion to find what's available. Type ``linear_model.`` and then the tab key to see an interactive list of the functions within this submodule. The ones which begin with capital letters are the models which are available." 216 | ] 217 | }, 218 | { 219 | "cell_type": "markdown", 220 | "metadata": {}, 221 | "source": [ 222 | "Now select a new classifier and try out a classification of the iris data.\n", 223 | "\n", 224 | "Some good choices are\n", 225 | "\n", 226 | "- ``sklearn.svm.LinearSVC`` :\n", 227 | " Support Vector Machines without kernels based on liblinear\n", 228 | "\n", 229 | "- ``sklearn.svm.SVC`` :\n", 230 | " Support Vector Machines with kernels based on libsvm\n", 231 | "\n", 232 | "- ``sklearn.linear_model.LogisticRegression`` :\n", 233 | " Regularized Logistic Regression based on liblinear\n", 234 | "\n", 235 | "- ``sklearn.linear_model.SGDClassifier`` :\n", 236 | " Regularized linear models (SVM or logistic regression) using a Stochastic Gradient Descent algorithm written in Cython\n", 237 | "\n", 238 | "- ``sklearn.neighbors.NeighborsClassifier`` :\n", 239 | " k-Nearest Neighbors classifier based on the ball tree datastructure for low dimensional data and brute force search for high dimensional data\n", 240 | "\n", 241 | "- ``sklearn.naive_bayes.GaussianNB`` :\n", 242 | " Gaussian Naive Bayes model. This is an unsophisticated model which can be trained very quickly. It is often used to obtain baseline results before moving to a more sophisticated classifier.\n", 243 | "\n", 244 | "- ``sklearn.tree.DecisionTreeClassifier`` :\n", 245 | " A classifier based on a series of binary decisions. This is another very fast classifier, which can be very powerful.\n", 246 | "\n", 247 | "Choose one of the above, import it, and use the ``?`` feature to learn about it." 248 | ] 249 | }, 250 | { 251 | "cell_type": "code", 252 | "execution_count": null, 253 | "metadata": { 254 | "collapsed": true 255 | }, 256 | "outputs": [], 257 | "source": [] 258 | }, 259 | { 260 | "cell_type": "markdown", 261 | "metadata": {}, 262 | "source": [ 263 | "Now instantiate this model as we did with ``LinearSVC`` above. Call it ``clf``." 264 | ] 265 | }, 266 | { 267 | "cell_type": "code", 268 | "execution_count": null, 269 | "metadata": { 270 | "collapsed": true 271 | }, 272 | "outputs": [], 273 | "source": [] 274 | }, 275 | { 276 | "cell_type": "markdown", 277 | "metadata": {}, 278 | "source": [ 279 | "Now use our data ``X`` and ``y`` to train the model, using ``clf2.fit(X, y)``" 280 | ] 281 | }, 282 | { 283 | "cell_type": "code", 284 | "execution_count": null, 285 | "metadata": { 286 | "collapsed": true 287 | }, 288 | "outputs": [], 289 | "source": [] 290 | }, 291 | { 292 | "cell_type": "markdown", 293 | "metadata": {}, 294 | "source": [ 295 | "Now call the ``predict`` method, and find the classification of ``X_new``." 296 | ] 297 | }, 298 | { 299 | "cell_type": "code", 300 | "execution_count": null, 301 | "metadata": { 302 | "collapsed": true 303 | }, 304 | "outputs": [], 305 | "source": [] 306 | }, 307 | { 308 | "cell_type": "markdown", 309 | "metadata": {}, 310 | "source": [ 311 | "Now use the code snippet in `Cell 16` to plot the corresponding graph" 312 | ] 313 | }, 314 | { 315 | "cell_type": "code", 316 | "execution_count": null, 317 | "metadata": { 318 | "collapsed": true 319 | }, 320 | "outputs": [], 321 | "source": [] 322 | }, 323 | { 324 | "cell_type": "markdown", 325 | "metadata": {}, 326 | "source": [ 327 | "On the Iris Dataset\n", 328 | "=========\n", 329 | "**Exercise** Apply the KNeighborsClassifier to the ``iris`` dataset. Play with different values of the ``n_neighbors`` and observe how training and test score change.\n", 330 | "\n", 331 | "Note: If you finish early, you can try applying a different estimator: `sklearn.svm.SVC`" 332 | ] 333 | }, 334 | { 335 | "cell_type": "code", 336 | "execution_count": null, 337 | "metadata": { 338 | "collapsed": false 339 | }, 340 | "outputs": [], 341 | "source": [ 342 | "%load scripts/knn_iris.py" 343 | ] 344 | }, 345 | { 346 | "cell_type": "code", 347 | "execution_count": null, 348 | "metadata": { 349 | "collapsed": false 350 | }, 351 | "outputs": [], 352 | "source": [ 353 | "plot_iris_knn()" 354 | ] 355 | }, 356 | { 357 | "cell_type": "markdown", 358 | "metadata": {}, 359 | "source": [ 360 | "## Support Vector Machines" 361 | ] 362 | }, 363 | { 364 | "cell_type": "markdown", 365 | "metadata": {}, 366 | "source": [ 367 | "Another powerful and highly effective method can be used for both Classification and Regression.\n", 368 | "\n", 369 | "SVMs are a **discriminative** classifier: that is, they draw a boundary between clusters of data." 370 | ] 371 | }, 372 | { 373 | "cell_type": "code", 374 | "execution_count": null, 375 | "metadata": { 376 | "collapsed": false 377 | }, 378 | "outputs": [], 379 | "source": [ 380 | "from sklearn.datasets.samples_generator import make_blobs\n", 381 | "X, y = make_blobs(n_samples=50, centers=2,\n", 382 | " random_state=0, cluster_std=0.60)\n", 383 | "plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='spring');" 384 | ] 385 | }, 386 | { 387 | "cell_type": "markdown", 388 | "metadata": {}, 389 | "source": [ 390 | "There can be many seperators for the dataset above. How the find the best one?" 391 | ] 392 | }, 393 | { 394 | "cell_type": "code", 395 | "execution_count": null, 396 | "metadata": { 397 | "collapsed": false 398 | }, 399 | "outputs": [], 400 | "source": [ 401 | "xfit = np.linspace(-1, 3.5)\n", 402 | "plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='spring')\n", 403 | "\n", 404 | "for m, b, d in [(1, 0.65, 0.33), (0.5, 1.6, 0.55), (-0.2, 2.9, 0.2)]:\n", 405 | " yfit = m * xfit + b\n", 406 | " plt.plot(xfit, yfit, '-k')\n", 407 | " plt.fill_between(xfit, yfit - d, yfit + d, edgecolor='none', color='#AAAAAA', alpha=0.4)\n", 408 | "\n", 409 | "plt.xlim(-1, 3.5);" 410 | ] 411 | }, 412 | { 413 | "cell_type": "markdown", 414 | "metadata": {}, 415 | "source": [ 416 | "### SVM's to the rescue :)" 417 | ] 418 | }, 419 | { 420 | "cell_type": "code", 421 | "execution_count": null, 422 | "metadata": { 423 | "collapsed": false 424 | }, 425 | "outputs": [], 426 | "source": [ 427 | "from sklearn.svm import SVC # \"Support Vector Classifier\"\n", 428 | "clf = SVC(kernel='linear')\n", 429 | "clf.fit(X, y)" 430 | ] 431 | }, 432 | { 433 | "cell_type": "code", 434 | "execution_count": null, 435 | "metadata": { 436 | "collapsed": true 437 | }, 438 | "outputs": [], 439 | "source": [ 440 | "def plot_svc_decision_function(clf, ax=None):\n", 441 | " \"\"\"Plot the decision function for a 2D SVC\"\"\"\n", 442 | " if ax is None:\n", 443 | " ax = plt.gca()\n", 444 | " x = np.linspace(plt.xlim()[0], plt.xlim()[1], 30)\n", 445 | " y = np.linspace(plt.ylim()[0], plt.ylim()[1], 30)\n", 446 | " Y, X = np.meshgrid(y, x)\n", 447 | " P = np.zeros_like(X)\n", 448 | " for i, xi in enumerate(x):\n", 449 | " for j, yj in enumerate(y):\n", 450 | " P[i, j] = clf.decision_function([[xi, yj]])\n", 451 | " # plot the margins\n", 452 | " ax.contour(X, Y, P, colors='k',\n", 453 | " levels=[-1, 0, 1], alpha=0.5,\n", 454 | " linestyles=['--', '-', '--'])" 455 | ] 456 | }, 457 | { 458 | "cell_type": "code", 459 | "execution_count": null, 460 | "metadata": { 461 | "collapsed": false 462 | }, 463 | "outputs": [], 464 | "source": [ 465 | "plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='spring')\n", 466 | "plot_svc_decision_function(clf);" 467 | ] 468 | }, 469 | { 470 | "cell_type": "code", 471 | "execution_count": null, 472 | "metadata": { 473 | "collapsed": false 474 | }, 475 | "outputs": [], 476 | "source": [ 477 | "plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='spring')\n", 478 | "plot_svc_decision_function(clf)\n", 479 | "plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],\n", 480 | " s=200, facecolors='none');" 481 | ] 482 | }, 483 | { 484 | "cell_type": "code", 485 | "execution_count": null, 486 | "metadata": { 487 | "collapsed": false 488 | }, 489 | "outputs": [], 490 | "source": [ 491 | "from IPython.html.widgets import interact\n", 492 | "\n", 493 | "def plot_svm(N=10):\n", 494 | " X, y = make_blobs(n_samples=200, centers=2,\n", 495 | " random_state=0, cluster_std=0.60)\n", 496 | " X = X[:N]\n", 497 | " y = y[:N]\n", 498 | " clf = SVC(kernel='linear')\n", 499 | " clf.fit(X, y)\n", 500 | " plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='spring')\n", 501 | " plt.xlim(-1, 4)\n", 502 | " plt.ylim(-1, 6)\n", 503 | " plot_svc_decision_function(clf, plt.gca())\n", 504 | " plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],\n", 505 | " s=200, facecolors='none')\n", 506 | " \n", 507 | "interact(plot_svm, N=[10, 200], kernel='linear');" 508 | ] 509 | }, 510 | { 511 | "cell_type": "markdown", 512 | "metadata": {}, 513 | "source": [ 514 | "#### What if the data is not linear?" 515 | ] 516 | }, 517 | { 518 | "cell_type": "code", 519 | "execution_count": null, 520 | "metadata": { 521 | "collapsed": false 522 | }, 523 | "outputs": [], 524 | "source": [ 525 | "from sklearn.datasets.samples_generator import make_circles\n", 526 | "X, y = make_circles(100, factor=.1, noise=.1)\n", 527 | "\n", 528 | "clf = SVC(kernel='linear').fit(X, y)\n", 529 | "\n", 530 | "plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='spring')\n", 531 | "plot_svc_decision_function(clf);" 532 | ] 533 | }, 534 | { 535 | "cell_type": "code", 536 | "execution_count": null, 537 | "metadata": { 538 | "collapsed": true 539 | }, 540 | "outputs": [], 541 | "source": [] 542 | }, 543 | { 544 | "cell_type": "code", 545 | "execution_count": null, 546 | "metadata": { 547 | "collapsed": true 548 | }, 549 | "outputs": [], 550 | "source": [] 551 | } 552 | ], 553 | "metadata": { 554 | "kernelspec": { 555 | "display_name": "Python [default]", 556 | "language": "python", 557 | "name": "python2" 558 | }, 559 | "language_info": { 560 | "codemirror_mode": { 561 | "name": "ipython", 562 | "version": 2 563 | }, 564 | "file_extension": ".py", 565 | "mimetype": "text/x-python", 566 | "name": "python", 567 | "nbconvert_exporter": "python", 568 | "pygments_lexer": "ipython2", 569 | "version": "2.7.12" 570 | }, 571 | "widgets": { 572 | "state": { 573 | "59dece73c784433691328d95d18a2e4a": { 574 | "views": [ 575 | { 576 | "cell_index": 44 577 | } 578 | ] 579 | } 580 | }, 581 | "version": "1.2.0" 582 | } 583 | }, 584 | "nbformat": 4, 585 | "nbformat_minor": 0 586 | } 587 | -------------------------------------------------------------------------------- /2.5 Review of Scikit-learn API.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "collapsed": true 7 | }, 8 | "source": [ 9 | "\n", 10 | "### A recap on Scikit-learn's estimator interface\n", 11 | "Scikit-learn strives to have a uniform interface across all methods. Given a scikit-learn *estimator*\n", 12 | "object named `model`, the following methods are available (not all for each model):\n", 13 | "\n", 14 | "- Available in **all Estimators**\n", 15 | " + `model.fit()` : fit training data. For supervised learning applications,\n", 16 | " this accepts two arguments: the data `X` and the labels `y` (e.g. `model.fit(X, y)`).\n", 17 | " For unsupervised learning applications, ``fit`` takes only a single argument,\n", 18 | " the data `X` (e.g. `model.fit(X)`).\n", 19 | "- Available in **supervised estimators**\n", 20 | " + `model.predict()` : given a trained model, predict the label of a new set of data.\n", 21 | " This method accepts one argument, the new data `X_new` (e.g. `model.predict(X_new)`),\n", 22 | " and returns the learned label for each object in the array.\n", 23 | " + `model.predict_proba()` : For classification problems, some estimators also provide\n", 24 | " this method, which returns the probability that a new observation has each categorical label.\n", 25 | " In this case, the label with the highest probability is returned by `model.predict()`.\n", 26 | " + `model.decision_function()` : For classification problems, some estimators provide an uncertainty estimate that is not a probability. For binary classification, a decision_function >= 0 means the positive class will be predicted, while < 0 means the negative class.\n", 27 | " + `model.score()` : for classification or regression problems, most (all?) estimators implement\n", 28 | " a score method. Scores are between 0 and 1, with a larger score indicating a better fit.\n", 29 | " + `model.transform()` : For feature selection algorithms, this will reduce the dataset to the selected features. For some classification and regression models such as some linear models and random forests, this method reduces the dataset to the most informative features. These classification and regression models can therefor also be used as feature selection methods.\n", 30 | " \n", 31 | "- Available in **unsupervised estimators**\n", 32 | " + `model.transform()` : given an unsupervised model, transform new data into the new basis.\n", 33 | " This also accepts one argument `X_new`, and returns the new representation of the data based\n", 34 | " on the unsupervised model.\n", 35 | " + `model.fit_transform()` : some estimators implement this method,\n", 36 | " which more efficiently performs a fit and a transform on the same input data.\n", 37 | " + `model.predict()` : for clustering algorithms, the predict method will produce cluster labels for new data points. Not all clustering methods have this functionality.\n", 38 | " + `model.predict_proba()` : Gaussian mixture models (GMMs) provide the probability for each point to be generated by a given mixture component.\n", 39 | " + `model.score()` : Density models like KDE and GMMs provide the likelihood of the data under the model." 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": { 45 | "collapsed": true 46 | }, 47 | "source": [ 48 | "Apart from ``fit``, the two most important functions are arguably ``predict`` to produce a target variable (a ``y``) ``transform``, which produces a new representation of the data (an ``X``).\n", 49 | "The following table shows for which class of models which function applies:\n", 50 | "\n" 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "metadata": {}, 56 | "source": [ 57 | "\n", 58 | "\n", 59 | "\n", 60 | "\n", 61 | "\n", 62 | "\n", 63 | "\n", 64 | "
``model.predict````model.transform``
ClassificationPreprocessing
RegressionDimensionality Reduction
ClusteringFeature Extraction
 Feature selection
\n", 65 | "\n", 66 | "\n" 67 | ] 68 | } 69 | ], 70 | "metadata": { 71 | "kernelspec": { 72 | "display_name": "Python 2", 73 | "language": "python", 74 | "name": "python2" 75 | }, 76 | "language_info": { 77 | "codemirror_mode": { 78 | "name": "ipython", 79 | "version": 2 80 | }, 81 | "file_extension": ".py", 82 | "mimetype": "text/x-python", 83 | "name": "python", 84 | "nbconvert_exporter": "python", 85 | "pygments_lexer": "ipython2", 86 | "version": "2.7.9" 87 | } 88 | }, 89 | "nbformat": 4, 90 | "nbformat_minor": 0 91 | } 92 | -------------------------------------------------------------------------------- /4.1 Example - Supervised Spam Classification.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Spam Classification" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Some background information for text processing" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 1, 20 | "metadata": { 21 | "collapsed": true 22 | }, 23 | "outputs": [], 24 | "source": [ 25 | "%matplotlib inline\n", 26 | "import matplotlib.pyplot as plt\n", 27 | "import numpy as np" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": {}, 33 | "source": [ 34 | "### Bag of words? Bag of whaa...?" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "\n" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": 2, 47 | "metadata": { 48 | "collapsed": true 49 | }, 50 | "outputs": [], 51 | "source": [ 52 | "X = [\"It was a bright cold day in April, and the clocks were striking thirteen\",\n", 53 | " \"The sky above the port was the color of television, tuned to a dead channel\"]" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 4, 59 | "metadata": { 60 | "collapsed": false 61 | }, 62 | "outputs": [ 63 | { 64 | "data": { 65 | "text/plain": [ 66 | "2" 67 | ] 68 | }, 69 | "execution_count": 4, 70 | "metadata": {}, 71 | "output_type": "execute_result" 72 | } 73 | ], 74 | "source": [ 75 | "len(X)" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": 5, 81 | "metadata": { 82 | "collapsed": false 83 | }, 84 | "outputs": [ 85 | { 86 | "data": { 87 | "text/plain": [ 88 | "CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',\n", 89 | " dtype=, encoding=u'utf-8', input=u'content',\n", 90 | " lowercase=True, max_df=1.0, max_features=None, min_df=1,\n", 91 | " ngram_range=(1, 1), preprocessor=None, stop_words=None,\n", 92 | " strip_accents=None, token_pattern=u'(?u)\\\\b\\\\w\\\\w+\\\\b',\n", 93 | " tokenizer=None, vocabulary=None)" 94 | ] 95 | }, 96 | "execution_count": 5, 97 | "metadata": {}, 98 | "output_type": "execute_result" 99 | } 100 | ], 101 | "source": [ 102 | "from sklearn.feature_extraction.text import CountVectorizer\n", 103 | "\n", 104 | "vectorizer = CountVectorizer()\n", 105 | "vectorizer.fit(X)\n" 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": 7, 111 | "metadata": { 112 | "collapsed": false 113 | }, 114 | "outputs": [ 115 | { 116 | "data": { 117 | "text/plain": [ 118 | "{u'above': 0,\n", 119 | " u'and': 1,\n", 120 | " u'april': 2,\n", 121 | " u'bright': 3,\n", 122 | " u'channel': 4,\n", 123 | " u'clocks': 5,\n", 124 | " u'cold': 6,\n", 125 | " u'color': 7,\n", 126 | " u'day': 8,\n", 127 | " u'dead': 9,\n", 128 | " u'in': 10,\n", 129 | " u'it': 11,\n", 130 | " u'of': 12,\n", 131 | " u'port': 13,\n", 132 | " u'sky': 14,\n", 133 | " u'striking': 15,\n", 134 | " u'television': 16,\n", 135 | " u'the': 17,\n", 136 | " u'thirteen': 18,\n", 137 | " u'to': 19,\n", 138 | " u'tuned': 20,\n", 139 | " u'was': 21,\n", 140 | " u'were': 22}" 141 | ] 142 | }, 143 | "execution_count": 7, 144 | "metadata": {}, 145 | "output_type": "execute_result" 146 | } 147 | ], 148 | "source": [ 149 | "vectorizer.vocabulary_" 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": 8, 155 | "metadata": { 156 | "collapsed": true 157 | }, 158 | "outputs": [], 159 | "source": [ 160 | "X_bag_of_words = vectorizer.transform(X)" 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": 9, 166 | "metadata": { 167 | "collapsed": false 168 | }, 169 | "outputs": [ 170 | { 171 | "data": { 172 | "text/plain": [ 173 | "<2x23 sparse matrix of type ''\n", 174 | "\twith 25 stored elements in Compressed Sparse Row format>" 175 | ] 176 | }, 177 | "execution_count": 9, 178 | "metadata": {}, 179 | "output_type": "execute_result" 180 | } 181 | ], 182 | "source": [ 183 | "X_bag_of_words" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": 10, 189 | "metadata": { 190 | "collapsed": false 191 | }, 192 | "outputs": [ 193 | { 194 | "data": { 195 | "text/plain": [ 196 | "(2, 23)" 197 | ] 198 | }, 199 | "execution_count": 10, 200 | "metadata": {}, 201 | "output_type": "execute_result" 202 | } 203 | ], 204 | "source": [ 205 | "X_bag_of_words.shape" 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "execution_count": 11, 211 | "metadata": { 212 | "collapsed": false 213 | }, 214 | "outputs": [ 215 | { 216 | "data": { 217 | "text/plain": [ 218 | "array([[0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1],\n", 219 | " [1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 3, 0, 1, 1, 1, 0]])" 220 | ] 221 | }, 222 | "execution_count": 11, 223 | "metadata": {}, 224 | "output_type": "execute_result" 225 | } 226 | ], 227 | "source": [ 228 | "X_bag_of_words.toarray()" 229 | ] 230 | }, 231 | { 232 | "cell_type": "code", 233 | "execution_count": 12, 234 | "metadata": { 235 | "collapsed": false 236 | }, 237 | "outputs": [ 238 | { 239 | "data": { 240 | "text/plain": [ 241 | "[u'above',\n", 242 | " u'and',\n", 243 | " u'april',\n", 244 | " u'bright',\n", 245 | " u'channel',\n", 246 | " u'clocks',\n", 247 | " u'cold',\n", 248 | " u'color',\n", 249 | " u'day',\n", 250 | " u'dead',\n", 251 | " u'in',\n", 252 | " u'it',\n", 253 | " u'of',\n", 254 | " u'port',\n", 255 | " u'sky',\n", 256 | " u'striking',\n", 257 | " u'television',\n", 258 | " u'the',\n", 259 | " u'thirteen',\n", 260 | " u'to',\n", 261 | " u'tuned',\n", 262 | " u'was',\n", 263 | " u'were']" 264 | ] 265 | }, 266 | "execution_count": 12, 267 | "metadata": {}, 268 | "output_type": "execute_result" 269 | } 270 | ], 271 | "source": [ 272 | "vectorizer.get_feature_names()" 273 | ] 274 | }, 275 | { 276 | "cell_type": "code", 277 | "execution_count": 13, 278 | "metadata": { 279 | "collapsed": false 280 | }, 281 | "outputs": [ 282 | { 283 | "data": { 284 | "text/plain": [ 285 | "[array([u'and', u'april', u'bright', u'clocks', u'cold', u'day', u'in',\n", 286 | " u'it', u'striking', u'the', u'thirteen', u'was', u'were'], \n", 287 | " dtype=', encoding=u'utf-8', input=u'content',\n", 340 | " lowercase=True, max_df=1.0, max_features=None, min_df=1,\n", 341 | " ngram_range=(1, 1), norm=u'l2', preprocessor=None, smooth_idf=True,\n", 342 | " stop_words=None, strip_accents=None, sublinear_tf=False,\n", 343 | " token_pattern=u'(?u)\\\\b\\\\w\\\\w+\\\\b', tokenizer=None, use_idf=True,\n", 344 | " vocabulary=None)" 345 | ] 346 | }, 347 | "execution_count": 16, 348 | "metadata": {}, 349 | "output_type": "execute_result" 350 | } 351 | ], 352 | "source": [ 353 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 354 | "\n", 355 | "tfidf_vectorizer = TfidfVectorizer()\n", 356 | "tfidf_vectorizer.fit(X)" 357 | ] 358 | }, 359 | { 360 | "cell_type": "code", 361 | "execution_count": 15, 362 | "metadata": { 363 | "collapsed": false 364 | }, 365 | "outputs": [ 366 | { 367 | "name": "stdout", 368 | "output_type": "stream", 369 | "text": [ 370 | "[[ 0. 0.29 0.29 0.29 0. 0.29 0.29 0. 0.29 0. 0.29 0.29\n", 371 | " 0. 0. 0. 0.29 0. 0.21 0.29 0. 0. 0.21 0.29]\n", 372 | " [ 0.26 0. 0. 0. 0.26 0. 0. 0.26 0. 0.26 0. 0.\n", 373 | " 0.26 0.26 0.26 0. 0.26 0.55 0. 0.26 0.26 0.18 0. ]]\n" 374 | ] 375 | } 376 | ], 377 | "source": [ 378 | "import numpy as np\n", 379 | "np.set_printoptions(precision=2)\n", 380 | "\n", 381 | "print(tfidf_vectorizer.transform(X).toarray())" 382 | ] 383 | }, 384 | { 385 | "cell_type": "markdown", 386 | "metadata": {}, 387 | "source": [ 388 | "### Bigrams and N-Grams\n", 389 | "Entirely discarding word order is not always a good idea, as composite phrases often have specific meaning, and modifiers like \"not\" can invert the meaning of words.\n", 390 | "A simple way to include some word order are n-grams, which don't only look at a single token, but at all pairs of neighborhing tokens:" 391 | ] 392 | }, 393 | { 394 | "cell_type": "code", 395 | "execution_count": 17, 396 | "metadata": { 397 | "collapsed": false 398 | }, 399 | "outputs": [ 400 | { 401 | "data": { 402 | "text/plain": [ 403 | "CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',\n", 404 | " dtype=, encoding=u'utf-8', input=u'content',\n", 405 | " lowercase=True, max_df=1.0, max_features=None, min_df=1,\n", 406 | " ngram_range=(2, 2), preprocessor=None, stop_words=None,\n", 407 | " strip_accents=None, token_pattern=u'(?u)\\\\b\\\\w\\\\w+\\\\b',\n", 408 | " tokenizer=None, vocabulary=None)" 409 | ] 410 | }, 411 | "execution_count": 17, 412 | "metadata": {}, 413 | "output_type": "execute_result" 414 | } 415 | ], 416 | "source": [ 417 | "# look at sequences of tokens of minimum length 2 and maximum length 2\n", 418 | "bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))\n", 419 | "bigram_vectorizer.fit(X)" 420 | ] 421 | }, 422 | { 423 | "cell_type": "code", 424 | "execution_count": 18, 425 | "metadata": { 426 | "collapsed": false 427 | }, 428 | "outputs": [ 429 | { 430 | "data": { 431 | "text/plain": [ 432 | "[u'above the',\n", 433 | " u'and the',\n", 434 | " u'april and',\n", 435 | " u'bright cold',\n", 436 | " u'clocks were',\n", 437 | " u'cold day',\n", 438 | " u'color of',\n", 439 | " u'day in',\n", 440 | " u'dead channel',\n", 441 | " u'in april',\n", 442 | " u'it was',\n", 443 | " u'of television',\n", 444 | " u'port was',\n", 445 | " u'sky above',\n", 446 | " u'striking thirteen',\n", 447 | " u'television tuned',\n", 448 | " u'the clocks',\n", 449 | " u'the color',\n", 450 | " u'the port',\n", 451 | " u'the sky',\n", 452 | " u'to dead',\n", 453 | " u'tuned to',\n", 454 | " u'was bright',\n", 455 | " u'was the',\n", 456 | " u'were striking']" 457 | ] 458 | }, 459 | "execution_count": 18, 460 | "metadata": {}, 461 | "output_type": "execute_result" 462 | } 463 | ], 464 | "source": [ 465 | "bigram_vectorizer.get_feature_names()" 466 | ] 467 | }, 468 | { 469 | "cell_type": "code", 470 | "execution_count": 19, 471 | "metadata": { 472 | "collapsed": false 473 | }, 474 | "outputs": [ 475 | { 476 | "data": { 477 | "text/plain": [ 478 | "array([[0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0,\n", 479 | " 1, 0, 1],\n", 480 | " [1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1,\n", 481 | " 0, 1, 0]])" 482 | ] 483 | }, 484 | "execution_count": 19, 485 | "metadata": {}, 486 | "output_type": "execute_result" 487 | } 488 | ], 489 | "source": [ 490 | "bigram_vectorizer.transform(X).toarray()" 491 | ] 492 | }, 493 | { 494 | "cell_type": "code", 495 | "execution_count": 20, 496 | "metadata": { 497 | "collapsed": false 498 | }, 499 | "outputs": [ 500 | { 501 | "data": { 502 | "text/plain": [ 503 | "CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',\n", 504 | " dtype=, encoding=u'utf-8', input=u'content',\n", 505 | " lowercase=True, max_df=1.0, max_features=None, min_df=1,\n", 506 | " ngram_range=(1, 2), preprocessor=None, stop_words=None,\n", 507 | " strip_accents=None, token_pattern=u'(?u)\\\\b\\\\w\\\\w+\\\\b',\n", 508 | " tokenizer=None, vocabulary=None)" 509 | ] 510 | }, 511 | "execution_count": 20, 512 | "metadata": {}, 513 | "output_type": "execute_result" 514 | } 515 | ], 516 | "source": [ 517 | "gram_vectorizer = CountVectorizer(ngram_range=(1, 2))\n", 518 | "gram_vectorizer.fit(X)" 519 | ] 520 | }, 521 | { 522 | "cell_type": "code", 523 | "execution_count": 21, 524 | "metadata": { 525 | "collapsed": false 526 | }, 527 | "outputs": [ 528 | { 529 | "data": { 530 | "text/plain": [ 531 | "[u'above',\n", 532 | " u'above the',\n", 533 | " u'and',\n", 534 | " u'and the',\n", 535 | " u'april',\n", 536 | " u'april and',\n", 537 | " u'bright',\n", 538 | " u'bright cold',\n", 539 | " u'channel',\n", 540 | " u'clocks',\n", 541 | " u'clocks were',\n", 542 | " u'cold',\n", 543 | " u'cold day',\n", 544 | " u'color',\n", 545 | " u'color of',\n", 546 | " u'day',\n", 547 | " u'day in',\n", 548 | " u'dead',\n", 549 | " u'dead channel',\n", 550 | " u'in',\n", 551 | " u'in april',\n", 552 | " u'it',\n", 553 | " u'it was',\n", 554 | " u'of',\n", 555 | " u'of television',\n", 556 | " u'port',\n", 557 | " u'port was',\n", 558 | " u'sky',\n", 559 | " u'sky above',\n", 560 | " u'striking',\n", 561 | " u'striking thirteen',\n", 562 | " u'television',\n", 563 | " u'television tuned',\n", 564 | " u'the',\n", 565 | " u'the clocks',\n", 566 | " u'the color',\n", 567 | " u'the port',\n", 568 | " u'the sky',\n", 569 | " u'thirteen',\n", 570 | " u'to',\n", 571 | " u'to dead',\n", 572 | " u'tuned',\n", 573 | " u'tuned to',\n", 574 | " u'was',\n", 575 | " u'was bright',\n", 576 | " u'was the',\n", 577 | " u'were',\n", 578 | " u'were striking']" 579 | ] 580 | }, 581 | "execution_count": 21, 582 | "metadata": {}, 583 | "output_type": "execute_result" 584 | } 585 | ], 586 | "source": [ 587 | "gram_vectorizer.get_feature_names()" 588 | ] 589 | }, 590 | { 591 | "cell_type": "code", 592 | "execution_count": 22, 593 | "metadata": { 594 | "collapsed": false 595 | }, 596 | "outputs": [ 597 | { 598 | "data": { 599 | "text/plain": [ 600 | "array([[0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,\n", 601 | " 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1,\n", 602 | " 1, 0, 1, 1],\n", 603 | " [1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0,\n", 604 | " 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 3, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1,\n", 605 | " 0, 1, 0, 0]])" 606 | ] 607 | }, 608 | "execution_count": 22, 609 | "metadata": {}, 610 | "output_type": "execute_result" 611 | } 612 | ], 613 | "source": [ 614 | "gram_vectorizer.transform(X).toarray()" 615 | ] 616 | }, 617 | { 618 | "cell_type": "markdown", 619 | "metadata": {}, 620 | "source": [ 621 | "### Character N-grams" 622 | ] 623 | }, 624 | { 625 | "cell_type": "markdown", 626 | "metadata": {}, 627 | "source": [ 628 | "Sometimes it is also helpful to not look at words, but instead single character.\n", 629 | "That is particularly useful if you have very noisy data, want to identify the language, or we want to predict something about a single word.\n", 630 | "We can simply look at characters instead of words by setting ``analyzer=\"char\"``.\n", 631 | "Looking at single characters is usually not very informative, but looking at longer n-grams of characters can be:" 632 | ] 633 | }, 634 | { 635 | "cell_type": "code", 636 | "execution_count": 24, 637 | "metadata": { 638 | "collapsed": false 639 | }, 640 | "outputs": [ 641 | { 642 | "data": { 643 | "text/plain": [ 644 | "CountVectorizer(analyzer='char', binary=False, decode_error=u'strict',\n", 645 | " dtype=, encoding=u'utf-8', input=u'content',\n", 646 | " lowercase=True, max_df=1.0, max_features=None, min_df=1,\n", 647 | " ngram_range=(2, 2), preprocessor=None, stop_words=None,\n", 648 | " strip_accents=None, token_pattern=u'(?u)\\\\b\\\\w\\\\w+\\\\b',\n", 649 | " tokenizer=None, vocabulary=None)" 650 | ] 651 | }, 652 | "execution_count": 24, 653 | "metadata": {}, 654 | "output_type": "execute_result" 655 | } 656 | ], 657 | "source": [ 658 | "char_vectorizer = CountVectorizer(ngram_range=(2, 2), analyzer=\"char\")\n", 659 | "char_vectorizer.fit(X)" 660 | ] 661 | }, 662 | { 663 | "cell_type": "code", 664 | "execution_count": 25, 665 | "metadata": { 666 | "collapsed": false 667 | }, 668 | "outputs": [ 669 | { 670 | "name": "stdout", 671 | "output_type": "stream", 672 | "text": [ 673 | "[u' a', u' b', u' c', u' d', u' i', u' o', u' p', u' s', u' t', u' w', u', ', u'a ', u'ab', u'ad', u'an', u'ap', u'as', u'ay', u'bo', u'br', u'ch', u'ck', u'cl', u'co', u'd ', u'da', u'de', u'e ', u'ea', u'ed', u'ee', u'el', u'en', u'er', u'ev', u'f ', u'g ', u'gh', u'ha', u'he', u'hi', u'ht', u'ig', u'ik', u'il', u'in', u'io', u'ir', u'is', u'it', u'ki', u'ks', u'ky', u'l,', u'ld', u'le', u'lo', u'n ', u'n,', u'nd', u'ne', u'ng', u'nn', u'o ', u'oc', u'of', u'ol', u'on', u'or', u'ov', u'po', u'pr', u'r ', u're', u'ri', u'rt', u's ', u'si', u'sk', u'st', u't ', u'te', u'th', u'to', u'tr', u'tu', u'un', u've', u'vi', u'wa', u'we', u'y ']\n" 674 | ] 675 | } 676 | ], 677 | "source": [ 678 | "print(char_vectorizer.get_feature_names())" 679 | ] 680 | }, 681 | { 682 | "cell_type": "markdown", 683 | "metadata": {}, 684 | "source": [ 685 | "## Moving on to the problem at hand" 686 | ] 687 | }, 688 | { 689 | "cell_type": "code", 690 | "execution_count": 26, 691 | "metadata": { 692 | "collapsed": true 693 | }, 694 | "outputs": [], 695 | "source": [ 696 | "import os\n", 697 | "with open(os.path.join(\"data\",\"SMSSpamCollection\")) as f:\n", 698 | " lines = [line.strip().split(\"\\t\") for line in f.readlines()]\n", 699 | "text = [x[1] for x in lines]\n", 700 | "y = [x[0] == \"ham\" for x in lines]" 701 | ] 702 | }, 703 | { 704 | "cell_type": "code", 705 | "execution_count": 28, 706 | "metadata": { 707 | "collapsed": false 708 | }, 709 | "outputs": [], 710 | "source": [ 711 | "from sklearn.cross_validation import train_test_split\n", 712 | "\n", 713 | "text_train, text_test, y_train, y_test = train_test_split(text, y, random_state=42)" 714 | ] 715 | }, 716 | { 717 | "cell_type": "code", 718 | "execution_count": 29, 719 | "metadata": { 720 | "collapsed": true 721 | }, 722 | "outputs": [], 723 | "source": [ 724 | "from sklearn.feature_extraction.text import CountVectorizer\n", 725 | "\n", 726 | "vectorizer = CountVectorizer()\n", 727 | "vectorizer.fit(text_train)\n", 728 | "\n", 729 | "X_train = vectorizer.transform(text_train)\n", 730 | "X_test = vectorizer.transform(text_test)" 731 | ] 732 | }, 733 | { 734 | "cell_type": "code", 735 | "execution_count": 30, 736 | "metadata": { 737 | "collapsed": false 738 | }, 739 | "outputs": [ 740 | { 741 | "name": "stdout", 742 | "output_type": "stream", 743 | "text": [ 744 | "7464\n" 745 | ] 746 | } 747 | ], 748 | "source": [ 749 | "print(len(vectorizer.vocabulary_))" 750 | ] 751 | }, 752 | { 753 | "cell_type": "code", 754 | "execution_count": 31, 755 | "metadata": { 756 | "collapsed": false 757 | }, 758 | "outputs": [ 759 | { 760 | "name": "stdout", 761 | "output_type": "stream", 762 | "text": [ 763 | "[u'00', u'000', u'000pes', u'008704050406', u'0089', u'0121', u'01223585236', u'01223585334', u'02', u'0207', u'02072069400', u'02073162414', u'02085076972', u'03', u'04', u'0430', u'05', u'050703', u'0578', u'06']\n" 764 | ] 765 | } 766 | ], 767 | "source": [ 768 | "print(vectorizer.get_feature_names()[:20])\n" 769 | ] 770 | }, 771 | { 772 | "cell_type": "code", 773 | "execution_count": 32, 774 | "metadata": { 775 | "collapsed": false 776 | }, 777 | "outputs": [ 778 | { 779 | "name": "stdout", 780 | "output_type": "stream", 781 | "text": [ 782 | "[u'getting', u'getzed', u'gf', u'ghodbandar', u'ghost', u'gibbs', u'gibe', u'gift', u'gifted', u'gifts', u'giggle', u'gimme', u'gimmi', u'gin', u'girl', u'girlfrnd', u'girlie', u'girls', u'gist', u'giv']\n" 783 | ] 784 | } 785 | ], 786 | "source": [ 787 | "print(vectorizer.get_feature_names()[3000:3020])" 788 | ] 789 | }, 790 | { 791 | "cell_type": "code", 792 | "execution_count": 33, 793 | "metadata": { 794 | "collapsed": false 795 | }, 796 | "outputs": [ 797 | { 798 | "data": { 799 | "text/plain": [ 800 | "SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,\n", 801 | " eta0=0.0, fit_intercept=True, l1_ratio=0.15,\n", 802 | " learning_rate='optimal', loss='hinge', n_iter=5, n_jobs=1,\n", 803 | " penalty='l2', power_t=0.5, random_state=None, shuffle=True,\n", 804 | " verbose=0, warm_start=False)" 805 | ] 806 | }, 807 | "execution_count": 33, 808 | "metadata": {}, 809 | "output_type": "execute_result" 810 | } 811 | ], 812 | "source": [ 813 | "from sklearn.linear_model import SGDClassifier\n", 814 | "\n", 815 | "clf = SGDClassifier()\n", 816 | "clf" 817 | ] 818 | }, 819 | { 820 | "cell_type": "code", 821 | "execution_count": 34, 822 | "metadata": { 823 | "collapsed": false 824 | }, 825 | "outputs": [ 826 | { 827 | "data": { 828 | "text/plain": [ 829 | "SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,\n", 830 | " eta0=0.0, fit_intercept=True, l1_ratio=0.15,\n", 831 | " learning_rate='optimal', loss='hinge', n_iter=5, n_jobs=1,\n", 832 | " penalty='l2', power_t=0.5, random_state=None, shuffle=True,\n", 833 | " verbose=0, warm_start=False)" 834 | ] 835 | }, 836 | "execution_count": 34, 837 | "metadata": {}, 838 | "output_type": "execute_result" 839 | } 840 | ], 841 | "source": [ 842 | "clf.fit(X_train, y_train)" 843 | ] 844 | }, 845 | { 846 | "cell_type": "code", 847 | "execution_count": 35, 848 | "metadata": { 849 | "collapsed": false 850 | }, 851 | "outputs": [ 852 | { 853 | "data": { 854 | "text/plain": [ 855 | "0.9813486370157819" 856 | ] 857 | }, 858 | "execution_count": 35, 859 | "metadata": {}, 860 | "output_type": "execute_result" 861 | } 862 | ], 863 | "source": [ 864 | "clf.score(X_test, y_test)" 865 | ] 866 | }, 867 | { 868 | "cell_type": "code", 869 | "execution_count": 36, 870 | "metadata": { 871 | "collapsed": false 872 | }, 873 | "outputs": [ 874 | { 875 | "data": { 876 | "text/plain": [ 877 | "0.99880382775119614" 878 | ] 879 | }, 880 | "execution_count": 36, 881 | "metadata": {}, 882 | "output_type": "execute_result" 883 | } 884 | ], 885 | "source": [ 886 | "clf.score(X_train, y_train)" 887 | ] 888 | }, 889 | { 890 | "cell_type": "code", 891 | "execution_count": null, 892 | "metadata": { 893 | "collapsed": true 894 | }, 895 | "outputs": [], 896 | "source": [] 897 | } 898 | ], 899 | "metadata": { 900 | "kernelspec": { 901 | "display_name": "Python 2", 902 | "language": "python", 903 | "name": "python2" 904 | }, 905 | "language_info": { 906 | "codemirror_mode": { 907 | "name": "ipython", 908 | "version": 2 909 | }, 910 | "file_extension": ".py", 911 | "mimetype": "text/x-python", 912 | "name": "python", 913 | "nbconvert_exporter": "python", 914 | "pygments_lexer": "ipython2", 915 | "version": "2.7.11" 916 | } 917 | }, 918 | "nbformat": 4, 919 | "nbformat_minor": 0 920 | } 921 | -------------------------------------------------------------------------------- /5. Where do we go from here.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Story Time : The Netflix Prize\n", 8 | "" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "The Netflix Prize was an open competition for the best collaborative filtering algorithm to predict user ratings for films, based on previous ratings without any other information about the users or films, i.e. without the users or the films being identified except by numbers assigned for the contest." 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | " On 21 September 2009, the grand prize of US$1,000,000 was given to the BellKor's Pragmatic Chaos team which bested Netflix's own algorithm for predicting ratings by 10.06%" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "This competition took 3 years to complete. The winner was not a single algorithm but a complex combination of many algorithms, each one reducing the error ever so slightly." 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "Netflix doesn't use it !!! Because in it's own words it “did not seem to justify the engineering effort needed to bring them into a production environment,”" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "### Ok nice story, but why?" 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "## ML is not magic!" 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "metadata": {}, 56 | "source": [ 57 | "" 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": {}, 63 | "source": [ 64 | "The point is, it's very easy to underestimate the complexity that goes with ML.\n", 65 | "A couple of very important points from the paper, however I urge you to read the paper itself." 66 | ] 67 | }, 68 | { 69 | "cell_type": "markdown", 70 | "metadata": {}, 71 | "source": [ 72 | "There is a pretty pretty awesome paper named **[\"A Few Useful Things to Know about Machine Learning\"](http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)** by Prof. Pedro Domingos, wherein he talks about the pitfalls of ML." 73 | ] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "metadata": {}, 78 | "source": [ 79 | "* **Sometimes data is not enough ** Quoting Domingos: \"... the need for knowledge in learning should not be surprising. Machine learning is not magic; it can’t get something from nothing. What it does is get more from less. Programming, like all engineering, is a lot of work: we have to build everything from scratch. Learning is more like farming, which lets nature do most of the work. Farmers combine seeds with nutrients to grow crops. Learners combine knowledge with data to grow programs.\"" 80 | ] 81 | }, 82 | { 83 | "cell_type": "markdown", 84 | "metadata": {}, 85 | "source": [ 86 | "* **More Data > Clever Algorithm** Quoting Domingos: \"Suppose you’ve constructed the best set of features you can, but the classifiers you’re getting are still not accurate enough. What can you do now? There are two main choices: design a better learning algorithm, or gather more data. [...] As a rule of thumb, a dumb algorithm with lots and lots of data beats a clever one with modest amounts of it. (After all, machine learning is all about letting data do the heavy lifting.)\"" 87 | ] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "metadata": {}, 92 | "source": [ 93 | "* **The CURSE of Dimensionality** This expression was coined by Bellman in 1961 to refer to the fact that many algorithms that work fine in low dimensions become intractable when the input is high-dimensional. Generalizing correctly becomes exponentially harder as the dimensionality (number of features of the examples grows, because a fixed-size training set covers a dwindling fraction of the input space. Even with a moderate dimension of 100 and a huge training set of a trillion examples, the latter covers only a fraction of about 10−18 of the input space. This is what makes machine learning both necessary and hard.\n", 94 | "\n" 95 | ] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "metadata": {}, 100 | "source": [ 101 | "### Which brings us too ...." 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": {}, 107 | "source": [ 108 | "## Diving Deeper into ML" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "If you wish to better understand the inner workings of how Machine Learning functions, how each of the prediction algorithms work, what makes them work, then you can follow in this path.\n", 116 | "\n", 117 | "The thing to note here is that this path is filled with Math and Statistics along with programming. It is very easy to forget that on the ground level ML is all about math and statistics, especially when we have an elegant library like scikit-learn which provides a very splendid abstraction." 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": {}, 123 | "source": [ 124 | "However, this path too has its benefits:\n", 125 | " * You get a better understanding of hyperparameters.\n", 126 | " * Visualization becomes easier.\n", 127 | " * Standard Algorithms sometimes fail in non-trivial problems :(\n", 128 | " " 129 | ] 130 | }, 131 | { 132 | "cell_type": "markdown", 133 | "metadata": {}, 134 | "source": [ 135 | "### How to go about it then?" 136 | ] 137 | }, 138 | { 139 | "cell_type": "markdown", 140 | "metadata": {}, 141 | "source": [ 142 | "* A much recommended approach will be to start with Andrew Ng's Machine Learning course on [Coursera](https://www.coursera.org/learn/machine-learning). It provides a gentle and solid introduction to the insides of Machine Learning and has lots of programming exercises too. \n", 143 | "* Start playing with a number of ML related notebooks shared by [IPython]\n", 144 | "(https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks).\n", 145 | "* Some examples shared by [Scikit-Learn](http://scikit-learn.org/stable/auto_examples/)\n", 146 | "* [Kaggle](https://www.kaggle.com/)\n", 147 | "* This excellent post about Machine Learning on [Github](https://github.com/hangtwenty/dive-into-machine-learning)" 148 | ] 149 | }, 150 | { 151 | "cell_type": "markdown", 152 | "metadata": {}, 153 | "source": [ 154 | "### Finally" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "" 162 | ] 163 | }, 164 | { 165 | "cell_type": "markdown", 166 | "metadata": {}, 167 | "source": [ 168 | "# THANK YOU" 169 | ] 170 | } 171 | ], 172 | "metadata": { 173 | "kernelspec": { 174 | "display_name": "Python 2", 175 | "language": "python", 176 | "name": "python2" 177 | }, 178 | "language_info": { 179 | "codemirror_mode": { 180 | "name": "ipython", 181 | "version": 2 182 | }, 183 | "file_extension": ".py", 184 | "mimetype": "text/x-python", 185 | "name": "python", 186 | "nbconvert_exporter": "python", 187 | "pygments_lexer": "ipython2", 188 | "version": "2.7.11" 189 | } 190 | }, 191 | "nbformat": 4, 192 | "nbformat_minor": 0 193 | } 194 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Machine Learning using Scikit-Learn 2 | 3 | ## Introduction 4 | 5 | The following repository contains notebooks which are based on the material used by me during the BangPypers July meetup. These notebooks are made keeping in mind that the intended audience has very little or no experience with scikit-learn and/or machine learning but have some knowledge of python. 6 | 7 | ## Installation 8 | 9 | * Clone this repo `git clone https://github.com/pfrcks/BangPypers-SKLearn.git` 10 | * If you don't have `python-dev` install it using `sudo apt-get install python-dev` or whatver equivalent command you have for your distribution. 11 | * Installation is a non-trivial process generally. However we have the wonderful **conda** environment manager, a part of Anaconda Scientific Distribution. The best course of action is downloading and installing [miniconda](http://conda.pydata.org/miniconda.html). 12 | * Once you have minconda installed issue the following command on your shell 13 | * `conda install numpy scipy matplotlib scikit-learn ipython-notebook seaborn` 14 | * `conda install -c conda-forge ipywidgets` 15 | * **Note**: The above process requires a good net connection and time. Please do this before coming to the workshop. 16 | * If you want to further simplify the process you can go for the fullfledged package [Anaconda](https://docs.continuum.io/anaconda/install) instead of the above method. (This is the most preferred method) 17 | * After installing issue `conda install -c conda-forge ipywidgets` 18 | * **fetch_data.py** fetches the data required for the Facial Recognition Example. The dataset is ~230MB. If you want to follow along during the workshow you can execute `python fetch_data.py` after cd'ing into the repo directory. In case you don't want to download it, you are welcome to look at the example during the workshop. 19 | * **NOTE** : This repo is a work in process. To keep yourself updated issue a `git pull` before attending the workshop to be on the latest version. 20 | * **NOTE** : If you face any problems during installation, please create an issue on github. 21 | * That's it. 22 | 23 | ## Requirements 24 | 25 | * Python-2.7 26 | * Working knowledge of Python 27 | 28 | ## Notes 29 | 30 | * This workshop has been developed with the intended audience as people with little or no experience of scikit-learn and/or machine learning. 31 | * Please download the repo and fetch the dependencies before coming to the workshop. The installation takes time which can be spent on the workshop instead. 32 | 33 | ## Credits where credit's due 34 | 35 | * These notebooks owe a lot to the notebooks published by [Jake Vanderplas](https://github.com/jakevdp/sklearn_tutorial) and [Andreas Muller](https://www.youtube.com/watch?v=80fZrVMurPM), who have a much more extensive coverage of the topics. If you want to go further in regards to the black box approach with scikit-learn, I would highly recommend going through their notebooks and screencasts. These tutorials helped me a lot in understanding scikit-learn and it's application. 36 | 37 | ## Where to go from here? 38 | 39 | * Kaggle 40 | * Andrew-Ng 41 | * KD-Nuggets post 42 | * Dive into 43 | * Awesome notebooks 44 | * [A visual intro to ML](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/) 45 | -------------------------------------------------------------------------------- /cheatsheet.txt: -------------------------------------------------------------------------------- 1 | Numpy: 2 | - ones : Return a new array of given shape and type, filled with ones. 3 | - arange : Return evenly spaced values within a given interval. arange([start,] stop[, step,], dtype=None) 4 | - asarray : Convert the input to an array. 5 | - random.random : Return random floats in the half-open interval [0.0, 1.0). 6 | - linspace : Returns `num` evenly spaced samples, calculated over the interval [`start`, `stop`]. 7 | - newaxis : Starting with 1D list of numbers with newaxis, you can turn it into a 2D matrix. 8 | - array : Create an array 9 | - random.normal : Draw random samples from a normal (Gaussian) distribution. 10 | - meshgrid : The purpose of meshgrid is to create a rectangular grid out of an array of x values and an array of y values. 11 | - random.RandomState : Container for the Mersenne Twister pseudo-random number generator 12 | - random.RandomState.permutation : Randomly permute a sequence, or return a permuted range 13 | - random.RandomState.uniform : Draw samples from a uniform distribution. 14 | - squeeze : Remove single-dimensional entries from the shape of an array 15 | - random.randn : Return a sample (or samples) from the "standard normal" distribution. 16 | - dot : Dot product of two arrays. 17 | - ravel : Return a contiguous flattened array. 18 | - set_printoptions : Set printing options. 19 | 20 | Matplotlib: 21 | - scatter : Make a scatter plot of x vs y, where x and y are sequence like objects of the same lengths. 22 | - contour : Makes a contour. 23 | - figure : Creates a new figure. 24 | - add_subplot : Add a subplot. 25 | - subplots_adjust : As implied by name. 26 | - pcolormesh : Create a pseudocolor plot of a 2-D array. 27 | - xlim : Get or set the *x* limits of the current axes. 28 | - ylim : Get or set the *y* limits of the current axes. 29 | - axis : Convenience method to get or set axis properties. 30 | - fill_between : Make filled polygons between two curves. 31 | - setp : Set a property on an artist object. 32 | 33 | Scikit-Learn: 34 | - K Neighbors (Classifier/Regressor) : Classifier/Regressor implementing the k-nearest neighbors vote 35 | - linear_model : The :mod:`sklearn.linear_model` module implements generalized linear models. Eg SGD, BR, etc. 36 | - Linear Regression : Ordinary least squares Linear Regression. 37 | - ^ normalize? : If True, the regressors X will be normalized before regression. 38 | - coef_ : Estimated coefficients for the linear regression problem 39 | - intercept_ : Independent term in the linear model. 40 | - residues_ : Get the residues of the fitted model. 41 | - makeblobs : Generate isotropic Gaussian blobs for clustering. 42 | - make_circles : Make a large circle containing a smaller circle in 2d. 43 | - SVM : The :mod:`sklearn.svm` module includes Support Vector Machine algorithms. 44 | - SVC : Support Vector Classification. The implementation is based on libsvm. 45 | - Kernels : The kernel is effectively a similarity measure. 46 | - decision_function : Gives per-class scores for each sample (or a single score per sample in the binary case). 47 | - support_vectors_ : 48 | - score : Returns the mean accuracy on the given test data and labels. 49 | - StandardScaler : Standardize features by removing the mean and scaling to unit variance 50 | - transform : Transform the data based on what is learned from `fit` 51 | - pca.explained_variance_ 52 | - pca.components_ 53 | - fit_transform : Fit the model with X and apply the dimensionality reduction on X. 54 | - inverse_transform : Transform data back to its original space, i.e., return an input X_original whose transform would be X. 55 | - KMeans : K-Means clustering 56 | - fit_predict : Compute cluster centers and predict cluster index for each sample. 57 | - confusion_matrix : Compute confusion matrix to evaluate the accuracy of a classification 58 | - accuracy_score : Accuracy classification score. 59 | - adjusted_rand_score : The Rand Index computes a similarity measure between two clusterings by considering all pairs of samples and 60 | counting pairs that are assigned in the same or different clusters in the predicted and true clusterings. 61 | - MiniBatchKMeans : Mini-Batch K-Means clustering 62 | - cross_val_score : Evaluate a score by cross-validation 63 | - mean_sqaured_error : Mean squared error regression loss 64 | - pipeline.make_pipeline : Construct a Pipeline from the given estimators. 65 | - Polynomial Features : Generate a new feature matrix consisting of all polynomial combinations of the features with degree less than 66 | or equal to the specified degree. 67 | - learning_curve.validation_curve : Determine training and test scores for varying parameter values. 68 | - feature_extraction.CountVectorizer : Convert a collection of text documents to a matrix of token counts 69 | - TfidfVectorizer : Convert a collection of raw documents to a matrix of TF-IDF features. 70 | - SGDClassifier : Linear classifiers (SVM, logistic regression, a.o.) with SGD training. 71 | - RandomizedPCA : Principal component analysis (PCA) using randomized SVD 72 | - RandomForestRegressor : A random forest is a meta estimator that fits a number of classifying decision trees on various sub-samples 73 | of the dataset and use averaging to improve the predictive accuracy and control over-fitting. 74 | -------------------------------------------------------------------------------- /data/readme: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/data/readme -------------------------------------------------------------------------------- /fetch_data.py: -------------------------------------------------------------------------------- 1 | from sklearn import datasets 2 | lfw_people = datasets.fetch_lfw_people(min_faces_per_person=70, resize=0.4, data_home='data') 3 | -------------------------------------------------------------------------------- /figures/BangPypers.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/BangPypers.pdf -------------------------------------------------------------------------------- /figures/Pic_BP_PDF-01.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/Pic_BP_PDF-01.png -------------------------------------------------------------------------------- /figures/Pic_BP_PDF-02.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/Pic_BP_PDF-02.png -------------------------------------------------------------------------------- /figures/Pic_BP_PDF-03.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/Pic_BP_PDF-03.png -------------------------------------------------------------------------------- /figures/Pic_BP_PDF-04.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/Pic_BP_PDF-04.png -------------------------------------------------------------------------------- /figures/Pic_BP_PDF-05.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/Pic_BP_PDF-05.png -------------------------------------------------------------------------------- /figures/Pic_BP_PDF-06.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/Pic_BP_PDF-06.png -------------------------------------------------------------------------------- /figures/Pic_BP_PDF-07.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/Pic_BP_PDF-07.png -------------------------------------------------------------------------------- /figures/Pic_BP_PDF-08.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/Pic_BP_PDF-08.png -------------------------------------------------------------------------------- /figures/Pic_BP_PDF-09.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/Pic_BP_PDF-09.png -------------------------------------------------------------------------------- /figures/Pic_BP_PDF-10.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/Pic_BP_PDF-10.png -------------------------------------------------------------------------------- /figures/Pic_BP_PDF-11.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/Pic_BP_PDF-11.png -------------------------------------------------------------------------------- /figures/Pic_BP_PDF-12.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/Pic_BP_PDF-12.png -------------------------------------------------------------------------------- /figures/Pic_BP_PDF-13.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/Pic_BP_PDF-13.png -------------------------------------------------------------------------------- /figures/Pic_BP_PDF-14.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/Pic_BP_PDF-14.png -------------------------------------------------------------------------------- /figures/Pic_BP_PDF-15.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/Pic_BP_PDF-15.png -------------------------------------------------------------------------------- /figures/Pic_BP_PDF-16.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/Pic_BP_PDF-16.png -------------------------------------------------------------------------------- /figures/Pic_BP_PDF-17.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/Pic_BP_PDF-17.png -------------------------------------------------------------------------------- /figures/Pic_BP_PDF-18.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/Pic_BP_PDF-18.png -------------------------------------------------------------------------------- /figures/Pic_BP_PDF-19.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/Pic_BP_PDF-19.png -------------------------------------------------------------------------------- /figures/Pic_BP_PDF-20.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/Pic_BP_PDF-20.png -------------------------------------------------------------------------------- /figures/Pic_BP_PDF-21.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/Pic_BP_PDF-21.png -------------------------------------------------------------------------------- /figures/Pic_BP_PDF-22.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/Pic_BP_PDF-22.png -------------------------------------------------------------------------------- /figures/Pic_BP_PDF-23.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/Pic_BP_PDF-23.png -------------------------------------------------------------------------------- /figures/Pic_BP_PDF-24.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/Pic_BP_PDF-24.png -------------------------------------------------------------------------------- /figures/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/__init__.py -------------------------------------------------------------------------------- /figures/__init__.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/__init__.pyc -------------------------------------------------------------------------------- /figures/bowjpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/bowjpg -------------------------------------------------------------------------------- /figures/cluster_comparison.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/cluster_comparison.png -------------------------------------------------------------------------------- /figures/iris_setosa.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/iris_setosa.jpg -------------------------------------------------------------------------------- /figures/iris_versicolor.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/iris_versicolor.jpg -------------------------------------------------------------------------------- /figures/iris_virginica.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/iris_virginica.jpg -------------------------------------------------------------------------------- /figures/magician.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/magician.jpg -------------------------------------------------------------------------------- /figures/ml_map.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/ml_map.png -------------------------------------------------------------------------------- /figures/netflix-prize.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/netflix-prize.png -------------------------------------------------------------------------------- /figures/petal_sepal.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/petal_sepal.jpg -------------------------------------------------------------------------------- /figures/plot.py: -------------------------------------------------------------------------------- 1 | """ 2 | Small helpers for code that is not shown in the notebooks 3 | Taken from Jake Vanderplas. 4 | https://github.com/jakevdp/ 5 | """ 6 | 7 | from sklearn import neighbors, datasets, linear_model 8 | import pylab as pl 9 | import numpy as np 10 | from matplotlib.colors import ListedColormap 11 | import matplotlib.pyplot as plt 12 | from sklearn.linear_model import SGDClassifier 13 | from sklearn.datasets.samples_generator import make_blobs 14 | import warnings 15 | 16 | # Create color maps for 3-class classification problem, as with iris 17 | cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF']) 18 | cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF']) 19 | 20 | 21 | def plot_iris_knn(): 22 | iris = datasets.load_iris() 23 | X = iris.data[:, :2] # we only take the first two features. We could 24 | # avoid this ugly slicing by using a two-dim dataset 25 | y = iris.target 26 | 27 | knn = neighbors.KNeighborsClassifier(n_neighbors=3) 28 | knn.fit(X, y) 29 | 30 | x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1 31 | y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1 32 | xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100), 33 | np.linspace(y_min, y_max, 100)) 34 | Z = knn.predict(np.c_[xx.ravel(), yy.ravel()]) 35 | 36 | # Put the result into a color plot 37 | Z = Z.reshape(xx.shape) 38 | pl.figure() 39 | pl.pcolormesh(xx, yy, Z, cmap=cmap_light) 40 | 41 | # Plot also the training points 42 | pl.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold) 43 | pl.xlabel('sepal length (cm)') 44 | pl.ylabel('sepal width (cm)') 45 | pl.axis('tight') 46 | 47 | 48 | def plot_polynomial_regression(): 49 | rng = np.random.RandomState(0) 50 | x = 2*rng.rand(100) - 1 51 | f = lambda t: 1.2 * t**2 + .1 * t**3 - .4 * t **5 - .5 * t ** 9 52 | y = f(x) + .4 * rng.normal(size=100) 53 | 54 | x_test = np.linspace(-1, 1, 100) 55 | 56 | pl.figure() 57 | pl.scatter(x, y, s=4) 58 | 59 | X = np.array([x**i for i in range(5)]).T 60 | X_test = np.array([x_test**i for i in range(5)]).T 61 | regr = linear_model.LinearRegression() 62 | regr.fit(X, y) 63 | pl.plot(x_test, regr.predict(X_test), label='4th order') 64 | 65 | X = np.array([x**i for i in range(10)]).T 66 | X_test = np.array([x_test**i for i in range(10)]).T 67 | regr = linear_model.LinearRegression() 68 | regr.fit(X, y) 69 | pl.plot(x_test, regr.predict(X_test), label='9th order') 70 | 71 | pl.legend(loc='best') 72 | pl.axis('tight') 73 | pl.title('Fitting a 4th and a 9th order polynomial') 74 | 75 | pl.figure() 76 | pl.scatter(x, y, s=4) 77 | pl.plot(x_test, f(x_test), label="truth") 78 | pl.axis('tight') 79 | pl.title('Ground truth (9th order polynomial)') 80 | 81 | 82 | def plot_sgd_separator(): 83 | # we create 50 separable points 84 | X, Y = make_blobs(n_samples=50, centers=2, 85 | random_state=0, cluster_std=0.60) 86 | 87 | # fit the model 88 | clf = SGDClassifier(loss="hinge", alpha=0.01, 89 | n_iter=200, fit_intercept=True) 90 | clf.fit(X, Y) 91 | 92 | # plot the line, the points, and the nearest vectors to the plane 93 | xx = np.linspace(-1, 5, 10) 94 | yy = np.linspace(-1, 5, 10) 95 | 96 | X1, X2 = np.meshgrid(xx, yy) 97 | Z = np.empty(X1.shape) 98 | for (i, j), val in np.ndenumerate(X1): 99 | x1 = val 100 | x2 = X2[i, j] 101 | x3 = np.array([x1, x2]) 102 | p = clf.decision_function(x3.reshape(1, -1)) 103 | Z[i, j] = p[0] 104 | levels = [-1.0, 0.0, 1.0] 105 | linestyles = ['dashed', 'solid', 'dashed'] 106 | colors = 'k' 107 | 108 | ax = plt.axes() 109 | ax.contour(X1, X2, Z, levels, colors=colors, linestyles=linestyles) 110 | ax.scatter(X[:, 0], X[:, 1], c=Y, cmap=plt.cm.Paired) 111 | 112 | ax.axis('tight') 113 | 114 | 115 | def plot_example_decision_tree(): 116 | fig = plt.figure(figsize=(10, 4)) 117 | ax = fig.add_axes([0, 0, 0.8, 1], frameon=False, xticks=[], yticks=[]) 118 | ax.set_title('Example Decision Tree: Animal Classification', size=24) 119 | 120 | def text(ax, x, y, t, size=20, **kwargs): 121 | ax.text(x, y, t, 122 | ha='center', va='center', size=size, 123 | bbox=dict(boxstyle='round', ec='k', fc='w'), **kwargs) 124 | 125 | text(ax, 0.5, 0.9, "How big is\nthe animal?", 20) 126 | text(ax, 0.3, 0.6, "Does the animal\nhave horns?", 18) 127 | text(ax, 0.7, 0.6, "Does the animal\nhave two legs?", 18) 128 | text(ax, 0.12, 0.3, "Are the horns\nlonger than 10cm?", 14) 129 | text(ax, 0.38, 0.3, "Is the animal\nwearing a collar?", 14) 130 | text(ax, 0.62, 0.3, "Does the animal\nhave wings?", 14) 131 | text(ax, 0.88, 0.3, "Does the animal\nhave a tail?", 14) 132 | 133 | text(ax, 0.4, 0.75, "> 1m", 12, alpha=0.4) 134 | text(ax, 0.6, 0.75, "< 1m", 12, alpha=0.4) 135 | 136 | text(ax, 0.21, 0.45, "yes", 12, alpha=0.4) 137 | text(ax, 0.34, 0.45, "no", 12, alpha=0.4) 138 | 139 | text(ax, 0.66, 0.45, "yes", 12, alpha=0.4) 140 | text(ax, 0.79, 0.45, "no", 12, alpha=0.4) 141 | 142 | ax.plot([0.3, 0.5, 0.7], [0.6, 0.9, 0.6], '-k') 143 | ax.plot([0.12, 0.3, 0.38], [0.3, 0.6, 0.3], '-k') 144 | ax.plot([0.62, 0.7, 0.88], [0.3, 0.6, 0.3], '-k') 145 | ax.plot([0.0, 0.12, 0.20], [0.0, 0.3, 0.0], '--k') 146 | ax.plot([0.28, 0.38, 0.48], [0.0, 0.3, 0.0], '--k') 147 | ax.plot([0.52, 0.62, 0.72], [0.0, 0.3, 0.0], '--k') 148 | ax.plot([0.8, 0.88, 1.0], [0.0, 0.3, 0.0], '--k') 149 | ax.axis([0, 1, 0, 1]) 150 | 151 | 152 | def visualize_tree(estimator, X, y, boundaries=True, 153 | xlim=None, ylim=None): 154 | estimator.fit(X, y) 155 | 156 | if xlim is None: 157 | xlim = (X[:, 0].min() - 0.1, X[:, 0].max() + 0.1) 158 | if ylim is None: 159 | ylim = (X[:, 1].min() - 0.1, X[:, 1].max() + 0.1) 160 | 161 | x_min, x_max = xlim 162 | y_min, y_max = ylim 163 | xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100), 164 | np.linspace(y_min, y_max, 100)) 165 | Z = estimator.predict(np.c_[xx.ravel(), yy.ravel()]) 166 | 167 | # Put the result into a color plot 168 | Z = Z.reshape(xx.shape) 169 | plt.figure() 170 | plt.pcolormesh(xx, yy, Z, alpha=0.2, cmap='rainbow') 171 | plt.clim(y.min(), y.max()) 172 | 173 | # Plot also the training points 174 | plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='rainbow') 175 | plt.axis('off') 176 | 177 | plt.xlim(x_min, x_max) 178 | plt.ylim(y_min, y_max) 179 | plt.clim(y.min(), y.max()) 180 | 181 | # Plot the decision boundaries 182 | def plot_boundaries(i, xlim, ylim): 183 | if i < 0: 184 | return 185 | 186 | tree = estimator.tree_ 187 | 188 | if tree.feature[i] == 0: 189 | plt.plot([tree.threshold[i], tree.threshold[i]], ylim, '-k') 190 | plot_boundaries(tree.children_left[i], 191 | [xlim[0], tree.threshold[i]], ylim) 192 | plot_boundaries(tree.children_right[i], 193 | [tree.threshold[i], xlim[1]], ylim) 194 | 195 | elif tree.feature[i] == 1: 196 | plt.plot(xlim, [tree.threshold[i], tree.threshold[i]], '-k') 197 | plot_boundaries(tree.children_left[i], xlim, 198 | [ylim[0], tree.threshold[i]]) 199 | plot_boundaries(tree.children_right[i], xlim, 200 | [tree.threshold[i], ylim[1]]) 201 | 202 | if boundaries: 203 | plot_boundaries(0, plt.xlim(), plt.ylim()) 204 | 205 | 206 | def plot_tree_interactive(X, y): 207 | from sklearn.tree import DecisionTreeClassifier 208 | 209 | def interactive_tree(depth=1): 210 | clf = DecisionTreeClassifier(max_depth=depth, random_state=0) 211 | visualize_tree(clf, X, y) 212 | 213 | from IPython.html.widgets import interact 214 | return interact(interactive_tree, depth=[1, 5]) 215 | 216 | 217 | def plot_kmeans_interactive(min_clusters=1, max_clusters=6): 218 | from IPython.html.widgets import interact 219 | from sklearn.metrics.pairwise import euclidean_distances 220 | from sklearn.datasets.samples_generator import make_blobs 221 | 222 | with warnings.catch_warnings(): 223 | warnings.filterwarnings('ignore') 224 | 225 | X, y = make_blobs(n_samples=300, centers=4, 226 | random_state=0, cluster_std=0.60) 227 | 228 | def _kmeans_step(frame=0, n_clusters=4): 229 | rng = np.random.RandomState(2) 230 | labels = np.zeros(X.shape[0]) 231 | centers = rng.randn(n_clusters, 2) 232 | 233 | nsteps = frame // 3 234 | 235 | for i in range(nsteps + 1): 236 | old_centers = centers 237 | if i < nsteps or frame % 3 > 0: 238 | dist = euclidean_distances(X, centers) 239 | labels = dist.argmin(1) 240 | 241 | if i < nsteps or frame % 3 > 1: 242 | centers = np.array([X[labels == j].mean(0) 243 | for j in range(n_clusters)]) 244 | nans = np.isnan(centers) 245 | centers[nans] = old_centers[nans] 246 | 247 | 248 | # plot the data and cluster centers 249 | plt.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='rainbow', 250 | vmin=0, vmax=n_clusters - 1); 251 | plt.scatter(old_centers[:, 0], old_centers[:, 1], marker='o', 252 | c=np.arange(n_clusters), 253 | s=200, cmap='rainbow') 254 | plt.scatter(old_centers[:, 0], old_centers[:, 1], marker='o', 255 | c='black', s=50) 256 | 257 | # plot new centers if third frame 258 | if frame % 3 == 2: 259 | for i in range(n_clusters): 260 | plt.annotate('', centers[i], old_centers[i], 261 | arrowprops=dict(arrowstyle='->', linewidth=1)) 262 | plt.scatter(centers[:, 0], centers[:, 1], marker='o', 263 | c=np.arange(n_clusters), 264 | s=200, cmap='rainbow') 265 | plt.scatter(centers[:, 0], centers[:, 1], marker='o', 266 | c='black', s=50) 267 | 268 | plt.xlim(-4, 4) 269 | plt.ylim(-2, 10) 270 | 271 | if frame % 3 == 1: 272 | plt.text(3.8, 9.5, "1. Reassign points to nearest centroid", 273 | ha='right', va='top', size=14) 274 | elif frame % 3 == 2: 275 | plt.text(3.8, 9.5, "2. Update centroids to cluster means", 276 | ha='right', va='top', size=14) 277 | 278 | 279 | return interact(_kmeans_step, frame=[0, 50], 280 | n_clusters=[min_clusters, max_clusters]) 281 | 282 | 283 | def plot_image_components(x, coefficients=None, mean=0, components=None, 284 | imshape=(8, 8), n_components=6, fontsize=12): 285 | if coefficients is None: 286 | coefficients = x 287 | 288 | if components is None: 289 | components = np.eye(len(coefficients), len(x)) 290 | 291 | mean = np.zeros_like(x) + mean 292 | 293 | 294 | fig = plt.figure(figsize=(1.2 * (5 + n_components), 1.2 * 2)) 295 | g = plt.GridSpec(2, 5 + n_components, hspace=0.3) 296 | 297 | def show(i, j, x, title=None): 298 | ax = fig.add_subplot(g[i, j], xticks=[], yticks=[]) 299 | ax.imshow(x.reshape(imshape), interpolation='nearest') 300 | if title: 301 | ax.set_title(title, fontsize=fontsize) 302 | 303 | show(slice(2), slice(2), x, "True") 304 | 305 | approx = mean.copy() 306 | show(0, 2, np.zeros_like(x) + mean, r'$\mu$') 307 | show(1, 2, approx, r'$1 \cdot \mu$') 308 | 309 | for i in range(0, n_components): 310 | approx = approx + coefficients[i] * components[i] 311 | show(0, i + 3, components[i], r'$c_{0}$'.format(i + 1)) 312 | show(1, i + 3, approx, 313 | r"${0:.2f} \cdot c_{1}$".format(coefficients[i], i + 1)) 314 | plt.gca().text(0, 1.05, '$+$', ha='right', va='bottom', 315 | transform=plt.gca().transAxes, fontsize=fontsize) 316 | 317 | show(slice(2), slice(-2, None), approx, "Approx") 318 | 319 | 320 | def plot_pca_interactive(data, n_components=6): 321 | from sklearn.decomposition import PCA 322 | from IPython.html.widgets import interact 323 | 324 | pca = PCA(n_components=n_components) 325 | Xproj = pca.fit_transform(data) 326 | 327 | def show_decomp(i=0): 328 | plot_image_components(data[i], Xproj[i], 329 | pca.mean_, pca.components_) 330 | 331 | interact(show_decomp, i=(0, data.shape[0] - 1)); 332 | -------------------------------------------------------------------------------- /figures/plot.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/plot.pyc -------------------------------------------------------------------------------- /figures/plot_2d_separator.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | 4 | 5 | def plot_2d_separator(classifier, X, fill=False, ax=None, eps=None): 6 | if eps is None: 7 | eps = X.std() / 2. 8 | x_min, x_max = X[:, 0].min() - eps, X[:, 0].max() + eps 9 | y_min, y_max = X[:, 1].min() - eps, X[:, 1].max() + eps 10 | xx = np.linspace(x_min, x_max, 100) 11 | yy = np.linspace(y_min, y_max, 100) 12 | 13 | X1, X2 = np.meshgrid(xx, yy) 14 | X_grid = np.c_[X1.ravel(), X2.ravel()] 15 | try: 16 | decision_values = classifier.decision_function(X_grid) 17 | levels = [0] 18 | fill_levels = [decision_values.min(), 0, decision_values.max()] 19 | except AttributeError: 20 | # no decision_function 21 | decision_values = classifier.predict_proba(X_grid)[:, 1] 22 | levels = [.5] 23 | fill_levels = [0, .5, 1] 24 | 25 | if ax is None: 26 | ax = plt.gca() 27 | if fill: 28 | ax.contourf(X1, X2, decision_values.reshape(X1.shape), 29 | levels=fill_levels, colors=['blue', 'red']) 30 | else: 31 | ax.contour(X1, X2, decision_values.reshape(X1.shape), levels=levels, 32 | colors="black") 33 | ax.set_xlim(x_min, x_max) 34 | ax.set_ylim(y_min, y_max) 35 | ax.set_xticks(()) 36 | ax.set_yticks(()) 37 | 38 | 39 | if __name__ == '__main__': 40 | from sklearn.datasets import make_blobs 41 | from sklearn.linear_model import LogisticRegression 42 | X, y = make_blobs(centers=2, random_state=42) 43 | clf = LogisticRegression().fit(X, y) 44 | plot_2d_separator(clf, X, fill=True) 45 | plt.scatter(X[:, 0], X[:, 1], c=y) 46 | plt.show() 47 | -------------------------------------------------------------------------------- /figures/plot_2d_separator.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/plot_2d_separator.pyc -------------------------------------------------------------------------------- /figures/plot_digits_datasets.py: -------------------------------------------------------------------------------- 1 | # Taken from example in scikit-learn examples 2 | # Authors: Fabian Pedregosa 3 | # Olivier Grisel 4 | # Mathieu Blondel 5 | # Gael Varoquaux 6 | # License: BSD 3 clause (C) INRIA 2011 7 | 8 | import numpy as np 9 | import matplotlib.pyplot as plt 10 | from matplotlib import offsetbox 11 | from sklearn import (manifold, datasets, decomposition, ensemble, lda, 12 | random_projection) 13 | 14 | def digits_plot(): 15 | digits = datasets.load_digits(n_class=6) 16 | n_digits = 500 17 | X = digits.data[:n_digits] 18 | y = digits.target[:n_digits] 19 | n_samples, n_features = X.shape 20 | n_neighbors = 30 21 | 22 | def plot_embedding(X, title=None): 23 | x_min, x_max = np.min(X, 0), np.max(X, 0) 24 | X = (X - x_min) / (x_max - x_min) 25 | 26 | plt.figure() 27 | ax = plt.subplot(111) 28 | for i in range(X.shape[0]): 29 | plt.text(X[i, 0], X[i, 1], str(digits.target[i]), 30 | color=plt.cm.Set1(y[i] / 10.), 31 | fontdict={'weight': 'bold', 'size': 9}) 32 | 33 | if hasattr(offsetbox, 'AnnotationBbox'): 34 | # only print thumbnails with matplotlib > 1.0 35 | shown_images = np.array([[1., 1.]]) # just something big 36 | for i in range(X.shape[0]): 37 | dist = np.sum((X[i] - shown_images) ** 2, 1) 38 | if np.min(dist) < 1e5: 39 | # don't show points that are too close 40 | # set a high threshold to basically turn this off 41 | continue 42 | shown_images = np.r_[shown_images, [X[i]]] 43 | imagebox = offsetbox.AnnotationBbox( 44 | offsetbox.OffsetImage(digits.images[i], cmap=plt.cm.gray_r), 45 | X[i]) 46 | ax.add_artist(imagebox) 47 | plt.xticks([]), plt.yticks([]) 48 | if title is not None: 49 | plt.title(title) 50 | 51 | n_img_per_row = 10 52 | img = np.zeros((10 * n_img_per_row, 10 * n_img_per_row)) 53 | for i in range(n_img_per_row): 54 | ix = 10 * i + 1 55 | for j in range(n_img_per_row): 56 | iy = 10 * j + 1 57 | img[ix:ix + 8, iy:iy + 8] = X[i * n_img_per_row + j].reshape((8, 8)) 58 | 59 | plt.imshow(img, cmap=plt.cm.binary) 60 | plt.xticks([]) 61 | plt.yticks([]) 62 | plt.title('A selection from the 64-dimensional digits dataset') 63 | print("Computing PCA projection") 64 | pca = decomposition.PCA(n_components=2).fit(X) 65 | X_pca = pca.transform(X) 66 | plot_embedding(X_pca, "Principal Components projection of the digits") 67 | plt.figure() 68 | plt.matshow(pca.components_[0, :].reshape(8, 8), cmap="gray") 69 | plt.axis('off') 70 | plt.figure() 71 | plt.matshow(pca.components_[1, :].reshape(8, 8), cmap="gray") 72 | plt.axis('off') 73 | plt.show() 74 | -------------------------------------------------------------------------------- /figures/plot_digits_datasets.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pfrcks/Machine-Learning-SKLearn/f7f8a50136cc6f54406859cfe693abf4b4aca19e/figures/plot_digits_datasets.pyc -------------------------------------------------------------------------------- /figures/train_test_split.svg: -------------------------------------------------------------------------------- 1 | 2 | image/svg+xmlAll Data 369 | Training data 395 | Test data 421 | -------------------------------------------------------------------------------- /figures/train_validation_test2.svg: -------------------------------------------------------------------------------- 1 | 2 | image/svg+xmlAll Data 351 | Training 373 | Test 398 | Validation 423 | -------------------------------------------------------------------------------- /figures/unsupervised_workflow.svg: -------------------------------------------------------------------------------- 1 | 2 | image/svg+xmlTraining Data 365 | Test Data 391 | Model 413 | New View 435 | -------------------------------------------------------------------------------- /scripts/classify_iris.py: -------------------------------------------------------------------------------- 1 | from sklearn import neighbors, datasets 2 | 3 | iris = datasets.load_iris() 4 | X, y = iris.data, iris.target 5 | 6 | # create the model 7 | knn = neighbors.KNeighborsClassifier(n_neighbors=5) 8 | 9 | # fit the model 10 | knn.fit(X, y) 11 | 12 | # What kind of iris has 3cm x 5cm sepal and 4cm x 2cm petal? 13 | # call the "predict" method: 14 | result = knn.predict([[3, 5, 4, 2],]) 15 | 16 | print(iris.target_names[result]) 17 | 18 | knn.predict_proba([[3, 5, 4, 2],]) 19 | -------------------------------------------------------------------------------- /scripts/cluster_digits.py: -------------------------------------------------------------------------------- 1 | from sklearn.cluster import KMeans 2 | kmeans = KMeans(n_clusters=10) 3 | clusters = kmeans.fit_predict(digits.data) 4 | 5 | print(kmeans.cluster_centers_.shape) 6 | 7 | #------------------------------------------------------------ 8 | # visualize the cluster centers 9 | fig = plt.figure(figsize=(8, 3)) 10 | for i in range(10): 11 | ax = fig.add_subplot(2, 5, 1 + i) 12 | ax.imshow(kmeans.cluster_centers_[i].reshape((8, 8)), 13 | cmap=plt.cm.binary) 14 | from sklearn.manifold import Isomap 15 | X_iso = Isomap(n_neighbors=10).fit_transform(digits.data) 16 | 17 | #------------------------------------------------------------ 18 | # visualize the projected data 19 | fig, ax = plt.subplots(1, 2, figsize=(8, 4)) 20 | 21 | ax[0].scatter(X_iso[:, 0], X_iso[:, 1], c=clusters) 22 | ax[1].scatter(X_iso[:, 0], X_iso[:, 1], c=digits.target) 23 | -------------------------------------------------------------------------------- /scripts/knn_iris.py: -------------------------------------------------------------------------------- 1 | from sklearn import neighbors, datasets 2 | import pylab as pl 3 | import numpy as np 4 | from matplotlib.colors import ListedColormap 5 | 6 | 7 | # Create color maps for 3-class classification problem, as with iris 8 | cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF']) 9 | cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF']) 10 | 11 | iris = datasets.load_iris() 12 | X, y = iris.data, iris.target 13 | 14 | # create the model 15 | knn = neighbors.KNeighborsClassifier(n_neighbors=5) 16 | 17 | # fit the model 18 | knn.fit(X, y) 19 | 20 | # What kind of iris has 3cm x 5cm sepal and 4cm x 2cm petal? 21 | # call the "predict" method: 22 | result = knn.predict([[3, 5, 4, 2],]) 23 | print(iris.target_names[result]) 24 | 25 | 26 | def plot_iris_knn(): 27 | iris = datasets.load_iris() 28 | X = iris.data[:, :2] 29 | # we only take the first two features. We could 30 | # avoid this ugly slicing by using a two-dim dataset 31 | y = iris.target 32 | 33 | knn = neighbors.KNeighborsClassifier(n_neighbors=3) 34 | knn.fit(X, y) 35 | 36 | x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1 37 | y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1 38 | xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100), 39 | np.linspace(y_min, y_max, 100)) 40 | Z = knn.predict(np.c_[xx.ravel(), yy.ravel()]) 41 | 42 | # Put the result into a color plot 43 | Z = Z.reshape(xx.shape) 44 | pl.figure() 45 | pl.pcolormesh(xx, yy, Z, cmap=cmap_light) 46 | 47 | # Plot also the training points 48 | pl.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold) 49 | pl.xlabel('sepal length (cm)') 50 | pl.ylabel('sepal width (cm)') 51 | pl.axis('tight') 52 | -------------------------------------------------------------------------------- /scripts/knn_regression.py: -------------------------------------------------------------------------------- 1 | from sklearn.neighbors import KNeighborsRegressor 2 | kneighbor_regression = KNeighborsRegressor(n_neighbors=1) 3 | kneighbor_regression.fit(X_train, y_train) 4 | 5 | y_pred_train = kneighbor_regression.predict(X_train) 6 | 7 | plt.plot(X_train, y_train, 'o', label="data") 8 | plt.plot(X_train, y_pred_train, 'o', label="prediction") 9 | plt.legend(loc='best') 10 | 11 | #y_pred_test = kneighbor_regression.predict(X_test) 12 | 13 | #plt.plot(X_test, y_test, 'o', label="data") 14 | #plt.plot(X_test, y_pred_test, 'o', label="prediction") 15 | #plt.legend(loc='best') 16 | -------------------------------------------------------------------------------- /scripts/plot_digits.py: -------------------------------------------------------------------------------- 1 | # set up the figure 2 | fig = plt.figure(figsize=(6, 6)) # figure size in inches 3 | fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05) 4 | 5 | # plot the digits: each image is 8x8 pixels 6 | for i in range(64): 7 | ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[]) 8 | ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest') 9 | 10 | # label the image with the target value 11 | ax.text(0, 7, str(digits.target[i])) 12 | --------------------------------------------------------------------------------