├── .ipynb_checkpoints ├── 01_machine_learning_intro-checkpoint.ipynb ├── 02_machine_learning_setup-checkpoint.ipynb ├── 03_getting_started_with_iris-checkpoint.ipynb ├── 04_model_training-checkpoint.ipynb ├── 05_model_evaluation-checkpoint.ipynb ├── 06_linear_regression-checkpoint.ipynb ├── 07_cross_validation-checkpoint.ipynb ├── 08_grid_search-checkpoint.ipynb ├── 09_classification_metrics-checkpoint.ipynb └── IPython Tutorial-checkpoint.ipynb ├── 01_machine_learning_intro.ipynb ├── 02_machine_learning_setup.ipynb ├── 03_getting_started_with_iris.ipynb ├── 04_model_training.ipynb ├── 05_model_evaluation.ipynb ├── 06_linear_regression.ipynb ├── 07_cross_validation.ipynb ├── 08_grid_search.ipynb ├── 09_classification_metrics.ipynb ├── README.md ├── images ├── 01_clustering.png ├── 01_robot.png ├── 01_spam_filter.png ├── 01_supervised_learning.png ├── 02_ipython_header.png ├── 02_sklearn_algorithms.png ├── 02_sklearn_logo.png ├── 03_iris.png ├── 04_1nn_map.png ├── 04_5nn_map.png ├── 04_knn_dataset.png ├── 05_overfitting.png ├── 05_train_test_split.png ├── 07_cross_validation_diagram.png ├── 09_confusion_matrix_1.png └── 09_confusion_matrix_2.png ├── ipython_tutorial.md ├── ml-with-text ├── .gitignore ├── README.md ├── data │ ├── sms.tsv │ └── yelp.csv ├── exercise.ipynb ├── exercise.py ├── exercise_solution.ipynb ├── exercise_solution.py ├── tutorial.html ├── tutorial.ipynb ├── tutorial.py ├── tutorial_with_output.ipynb └── youtube.jpg ├── requirements.txt └── styles └── custom.css /.ipynb_checkpoints/01_machine_learning_intro-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Topics\n", 8 | "\n", 9 | "1. What is machine learning?\n", 10 | "2. What are the two main categories of machine learning?\n", 11 | "3. How does machine learning work for supervised learning (predictive modelling)?\n", 12 | "4. Big questions in learning machine learning\n", 13 | "5. Resources" 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": {}, 19 | "source": [ 20 | "_This tutorial is derived from Data School's Machine Learning with scikit-learn tutorial. I added my own notes so anyone, including myself, can refer to this tutorial without watching the videos._" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "### 1. What is machine learning?\n", 28 | "\n", 29 | "High level definition: semi-automated extraction of knowledge from data\n", 30 | "\n", 31 | "- **Starts with data**: You need data to exact insights from it\n", 32 | "- **Knowledge from data**: Starts with a question that might be answerable using data\n", 33 | "- **Automated extraction**: A computer provides the insight\n", 34 | "- **Semi-automated**: Requires many smart decisions by a human" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "### 2. What are the two main categories of machine learning?\n", 42 | "\n", 43 | "**Supervised learning**: making predictions using data\n", 44 | " \n", 45 | "- This is also called predictive modelling\n", 46 | "- Example: Is a given email \"spam\" or \"ham/non-spam\"?\n", 47 | "- There is an outcome we are trying to predict" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "![Spam filter](images/01_spam_filter.png)" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": {}, 60 | "source": [ 61 | "**Unsupervised learning**: extracting structure from data\n", 62 | "\n", 63 | "- Example: Segment grocery store shoppers into clusters that exhibit similar behaviors\n", 64 | "- There is no \"right answer\"" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "![Clustering](images/01_clustering.png)" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": {}, 77 | "source": [ 78 | "### 3. How does machine learning work for supervised learning (predictive modelling)?\n", 79 | "\n", 80 | "2 high-level steps of supervised learning:\n", 81 | "\n", 82 | "1. Train a machine learning model using labeled data\n", 83 | " - \"Labeled data\" has been labeled with the outcome\n", 84 | " - \"Machine learning model\" learns the relationship between the attributes of the data and its outcome\n", 85 | "\n", 86 | "2. Make **predictions** on **new data** for which the label is unknown" 87 | ] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "metadata": {}, 92 | "source": [ 93 | "![Supervised learning](images/01_supervised_learning.png)" 94 | ] 95 | }, 96 | { 97 | "cell_type": "markdown", 98 | "metadata": {}, 99 | "source": [ 100 | "The primary goal of supervised learning is to build a model that generalizes \n", 101 | "- It accurately predicts the **future** rather than the past" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": {}, 107 | "source": [ 108 | "### 4. Big questions in learning machine learning\n", 109 | "\n", 110 | "- How do I choose **which attributes** of my data to include in the model?\n", 111 | "- How do I choose **which model** to use?\n", 112 | "- How do I **optimize** this model for best performance?\n", 113 | "- How do I ensure that I'm building a model that will **generalize** to unseen data?\n", 114 | "- Can I **estimate** how well my model is likely to perform on unseen data?" 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": {}, 120 | "source": [ 121 | "### 5. Resources\n", 122 | "\n", 123 | "- Book: [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/) (section 2.1, 14 pages)\n", 124 | "- Video: [Learning Paradigms](http://work.caltech.edu/library/014.html) (13 minutes)" 125 | ] 126 | } 127 | ], 128 | "metadata": { 129 | "kernelspec": { 130 | "display_name": "Python 2", 131 | "language": "python", 132 | "name": "python2" 133 | }, 134 | "language_info": { 135 | "codemirror_mode": { 136 | "name": "ipython", 137 | "version": 2 138 | }, 139 | "file_extension": ".py", 140 | "mimetype": "text/x-python", 141 | "name": "python", 142 | "nbconvert_exporter": "python", 143 | "pygments_lexer": "ipython2", 144 | "version": "2.7.11" 145 | } 146 | }, 147 | "nbformat": 4, 148 | "nbformat_minor": 0 149 | } 150 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/02_machine_learning_setup-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Setting up Python for machine learning: scikit-learn and IPython Notebook" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Agenda\n", 15 | "\n", 16 | "- What are the benefits and drawbacks of scikit-learn?\n", 17 | "- How do I install scikit-learn?\n", 18 | "- How do I use the IPython Notebook?\n", 19 | "- What are some good resources for learning Python?" 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": [ 26 | "_This IPython notebook is derived from Data School's introduction to machine learning. All credits to Data School._" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "![scikit-learn algorithm map](images/02_sklearn_algorithms.png)" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "## Benefits and drawbacks of scikit-learn\n", 41 | "\n", 42 | "### Benefits:\n", 43 | "\n", 44 | "- **Consistent interface** to machine learning models\n", 45 | "- Provides many **tuning parameters** but with **sensible defaults**\n", 46 | "- Exceptional **documentation**\n", 47 | "- Rich set of functionality for **companion tasks**\n", 48 | "- **Active community** for development and support\n", 49 | "\n", 50 | "### Potential drawbacks:\n", 51 | "\n", 52 | "- Harder (than R) to **get started with machine learning**\n", 53 | "- Less emphasis (than R) on **model interpretability**\n", 54 | "\n", 55 | "### Further reading:\n", 56 | "\n", 57 | "- Ben Lorica: [Six reasons why I recommend scikit-learn](http://radar.oreilly.com/2013/12/six-reasons-why-i-recommend-scikit-learn.html)\n", 58 | "- scikit-learn authors: [API design for machine learning software](http://arxiv.org/pdf/1309.0238v1.pdf)\n", 59 | "- Data School: [Should you teach Python or R for data science?](http://www.dataschool.io/python-or-r-for-data-science/)" 60 | ] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "metadata": {}, 65 | "source": [ 66 | "![scikit-learn logo](images/02_sklearn_logo.png)" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "## Installing scikit-learn\n", 74 | "\n", 75 | "**Option 1:** [Install scikit-learn library](http://scikit-learn.org/stable/install.html) and dependencies (NumPy and SciPy)\n", 76 | "\n", 77 | "**Option 2:** [Install Anaconda distribution](https://store.continuum.io/cshop/anaconda/) of Python, which includes:\n", 78 | "\n", 79 | "- Hundreds of useful packages (including scikit-learn)\n", 80 | "- IPython and IPython Notebook\n", 81 | "- conda package manager\n", 82 | "- Spyder IDE" 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": {}, 88 | "source": [ 89 | "![IPython header](images/02_ipython_header.png)" 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "metadata": {}, 95 | "source": [ 96 | "## Using the IPython Notebook\n", 97 | "\n", 98 | "### Components:\n", 99 | "\n", 100 | "- **IPython interpreter:** enhanced version of the standard Python interpreter\n", 101 | "- **Browser-based notebook interface:** weave together code, formatted text, and plots\n", 102 | "\n", 103 | "### Installation:\n", 104 | "\n", 105 | "- **Option 1:** [Install IPython and the notebook](http://ipython.org/install.html)\n", 106 | "- **Option 2:** Included with the Anaconda distribution\n", 107 | "\n", 108 | "### Launching the Notebook:\n", 109 | "\n", 110 | "- Type **ipython notebook** at the command line to open the dashboard\n", 111 | "- Don't close the command line window while the Notebook is running\n", 112 | "\n", 113 | "### Keyboard shortcuts:\n", 114 | "\n", 115 | "**Command mode** (gray border)\n", 116 | "\n", 117 | "- Create new cells above (**a**) or below (**b**) the current cell\n", 118 | "- Navigate using the **up arrow** and **down arrow**\n", 119 | "- Convert the cell type to Markdown (**m**) or code (**y**)\n", 120 | "- See keyboard shortcuts using **h**\n", 121 | "- Switch to Edit mode using **Enter**\n", 122 | "\n", 123 | "**Edit mode** (green border)\n", 124 | "\n", 125 | "- **Ctrl+Enter** to run a cell\n", 126 | "- Switch to Command mode using **Esc**\n", 127 | "\n", 128 | "### IPython and Markdown resources:\n", 129 | "\n", 130 | "- [nbviewer](http://nbviewer.ipython.org/): view notebooks online as static documents\n", 131 | "- [IPython documentation](http://ipython.org/ipython-doc/stable/index.html): focuses on the interpreter\n", 132 | "- [IPython Notebook tutorials](https://github.com/jupyter/notebook/blob/master/docs/source/examples/Notebook/Examples%20and%20Tutorials%20Index.ipynb): in-depth introduction\n", 133 | "- [GitHub's Mastering Markdown](https://guides.github.com/features/mastering-markdown/): short guide with lots of examples" 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": {}, 139 | "source": [ 140 | "## Resources for learning Python\n", 141 | "\n", 142 | "- [Codecademy's Python course](http://www.codecademy.com/en/tracks/python): browser-based, tons of exercises\n", 143 | "- [DataQuest](https://dataquest.io/missions): browser-based, teaches Python in the context of data science\n", 144 | "- [Google's Python class](https://developers.google.com/edu/python/): slightly more advanced, includes videos and downloadable exercises (with solutions)\n", 145 | "- [Python for Informatics](http://www.pythonlearn.com/): beginner-oriented book, includes slides and videos" 146 | ] 147 | } 148 | ], 149 | "metadata": { 150 | "kernelspec": { 151 | "display_name": "Python 2", 152 | "language": "python", 153 | "name": "python2" 154 | }, 155 | "language_info": { 156 | "codemirror_mode": { 157 | "name": "ipython", 158 | "version": 2 159 | }, 160 | "file_extension": ".py", 161 | "mimetype": "text/x-python", 162 | "name": "python", 163 | "nbconvert_exporter": "python", 164 | "pygments_lexer": "ipython2", 165 | "version": "2.7.11" 166 | } 167 | }, 168 | "nbformat": 4, 169 | "nbformat_minor": 0 170 | } 171 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/03_getting_started_with_iris-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Topics\n", 8 | "1. About Iris dataset\n", 9 | "2. Display Iris dataset\n", 10 | "3. Supervised learning on Iris dataset\n", 11 | "4. Loading the Iris dataset into scikit-learn\n", 12 | "5. Machine learning terminology\n", 13 | "6. Exploring the Iris dataset\n", 14 | "7. Requirements for working with datasets in scikit-learn\n", 15 | "8. Additional resources\n" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "_This tutorial is derived from Data School's Machine Learning with scikit-learn tutorial. I added my own notes so anyone, including myself, can refer to this tutorial without watching the videos._" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "### 1. About Iris dataset" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "![Iris](images/03_iris.png)" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "- The iris dataset contains the following data\n", 44 | " - 50 samples of 3 different species of iris (150 samples total)\n", 45 | " - Measurements: sepal length, sepal width, petal length, petal width\n", 46 | "- The format for the data:\n", 47 | "(sepal length, sepal width, petal length, petal width)" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "### 2. Display Iris Dataset" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": 1, 60 | "metadata": { 61 | "collapsed": false 62 | }, 63 | "outputs": [ 64 | { 65 | "data": { 66 | "text/html": [ 67 | "" 68 | ], 69 | "text/plain": [ 70 | "" 71 | ] 72 | }, 73 | "execution_count": 1, 74 | "metadata": {}, 75 | "output_type": "execute_result" 76 | } 77 | ], 78 | "source": [ 79 | "# Display HTML using IPython.display module\n", 80 | "# You can display any other HTML using this module too\n", 81 | "# Just replace the link with your desired HTML page\n", 82 | "from IPython.display import HTML\n", 83 | "HTML('')" 84 | ] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "metadata": {}, 89 | "source": [ 90 | "### 3. Supervised learning on the iris dataset\n", 91 | "\n", 92 | "- Framed as a **supervised learning** problem\n", 93 | " - Predict the species of an iris using the measurements\n", 94 | "- Famous dataset for machine learning because prediction is **easy**\n", 95 | "- Learn more about the iris dataset: [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Iris)" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | "### 4. Loading the iris dataset into scikit-learn" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": 2, 108 | "metadata": { 109 | "collapsed": false 110 | }, 111 | "outputs": [], 112 | "source": [ 113 | "# import load_iris function from datasets module\n", 114 | "# convention is to import modules instead of sklearn as a whole\n", 115 | "from sklearn.datasets import load_iris" 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": 3, 121 | "metadata": { 122 | "collapsed": false 123 | }, 124 | "outputs": [ 125 | { 126 | "data": { 127 | "text/plain": [ 128 | "sklearn.datasets.base.Bunch" 129 | ] 130 | }, 131 | "execution_count": 3, 132 | "metadata": {}, 133 | "output_type": "execute_result" 134 | } 135 | ], 136 | "source": [ 137 | "# save \"bunch\" object containing iris dataset and its attributes\n", 138 | "# the data type is \"bunch\"\n", 139 | "iris = load_iris()\n", 140 | "type(iris)" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": 5, 146 | "metadata": { 147 | "collapsed": false 148 | }, 149 | "outputs": [ 150 | { 151 | "name": "stdout", 152 | "output_type": "stream", 153 | "text": [ 154 | "[[ 5.1 3.5 1.4 0.2]\n", 155 | " [ 4.9 3. 1.4 0.2]\n", 156 | " [ 4.7 3.2 1.3 0.2]\n", 157 | " [ 4.6 3.1 1.5 0.2]\n", 158 | " [ 5. 3.6 1.4 0.2]\n", 159 | " [ 5.4 3.9 1.7 0.4]\n", 160 | " [ 4.6 3.4 1.4 0.3]\n", 161 | " [ 5. 3.4 1.5 0.2]\n", 162 | " [ 4.4 2.9 1.4 0.2]\n", 163 | " [ 4.9 3.1 1.5 0.1]\n", 164 | " [ 5.4 3.7 1.5 0.2]\n", 165 | " [ 4.8 3.4 1.6 0.2]\n", 166 | " [ 4.8 3. 1.4 0.1]\n", 167 | " [ 4.3 3. 1.1 0.1]\n", 168 | " [ 5.8 4. 1.2 0.2]\n", 169 | " [ 5.7 4.4 1.5 0.4]\n", 170 | " [ 5.4 3.9 1.3 0.4]\n", 171 | " [ 5.1 3.5 1.4 0.3]\n", 172 | " [ 5.7 3.8 1.7 0.3]\n", 173 | " [ 5.1 3.8 1.5 0.3]\n", 174 | " [ 5.4 3.4 1.7 0.2]\n", 175 | " [ 5.1 3.7 1.5 0.4]\n", 176 | " [ 4.6 3.6 1. 0.2]\n", 177 | " [ 5.1 3.3 1.7 0.5]\n", 178 | " [ 4.8 3.4 1.9 0.2]\n", 179 | " [ 5. 3. 1.6 0.2]\n", 180 | " [ 5. 3.4 1.6 0.4]\n", 181 | " [ 5.2 3.5 1.5 0.2]\n", 182 | " [ 5.2 3.4 1.4 0.2]\n", 183 | " [ 4.7 3.2 1.6 0.2]\n", 184 | " [ 4.8 3.1 1.6 0.2]\n", 185 | " [ 5.4 3.4 1.5 0.4]\n", 186 | " [ 5.2 4.1 1.5 0.1]\n", 187 | " [ 5.5 4.2 1.4 0.2]\n", 188 | " [ 4.9 3.1 1.5 0.1]\n", 189 | " [ 5. 3.2 1.2 0.2]\n", 190 | " [ 5.5 3.5 1.3 0.2]\n", 191 | " [ 4.9 3.1 1.5 0.1]\n", 192 | " [ 4.4 3. 1.3 0.2]\n", 193 | " [ 5.1 3.4 1.5 0.2]\n", 194 | " [ 5. 3.5 1.3 0.3]\n", 195 | " [ 4.5 2.3 1.3 0.3]\n", 196 | " [ 4.4 3.2 1.3 0.2]\n", 197 | " [ 5. 3.5 1.6 0.6]\n", 198 | " [ 5.1 3.8 1.9 0.4]\n", 199 | " [ 4.8 3. 1.4 0.3]\n", 200 | " [ 5.1 3.8 1.6 0.2]\n", 201 | " [ 4.6 3.2 1.4 0.2]\n", 202 | " [ 5.3 3.7 1.5 0.2]\n", 203 | " [ 5. 3.3 1.4 0.2]\n", 204 | " [ 7. 3.2 4.7 1.4]\n", 205 | " [ 6.4 3.2 4.5 1.5]\n", 206 | " [ 6.9 3.1 4.9 1.5]\n", 207 | " [ 5.5 2.3 4. 1.3]\n", 208 | " [ 6.5 2.8 4.6 1.5]\n", 209 | " [ 5.7 2.8 4.5 1.3]\n", 210 | " [ 6.3 3.3 4.7 1.6]\n", 211 | " [ 4.9 2.4 3.3 1. ]\n", 212 | " [ 6.6 2.9 4.6 1.3]\n", 213 | " [ 5.2 2.7 3.9 1.4]\n", 214 | " [ 5. 2. 3.5 1. ]\n", 215 | " [ 5.9 3. 4.2 1.5]\n", 216 | " [ 6. 2.2 4. 1. ]\n", 217 | " [ 6.1 2.9 4.7 1.4]\n", 218 | " [ 5.6 2.9 3.6 1.3]\n", 219 | " [ 6.7 3.1 4.4 1.4]\n", 220 | " [ 5.6 3. 4.5 1.5]\n", 221 | " [ 5.8 2.7 4.1 1. ]\n", 222 | " [ 6.2 2.2 4.5 1.5]\n", 223 | " [ 5.6 2.5 3.9 1.1]\n", 224 | " [ 5.9 3.2 4.8 1.8]\n", 225 | " [ 6.1 2.8 4. 1.3]\n", 226 | " [ 6.3 2.5 4.9 1.5]\n", 227 | " [ 6.1 2.8 4.7 1.2]\n", 228 | " [ 6.4 2.9 4.3 1.3]\n", 229 | " [ 6.6 3. 4.4 1.4]\n", 230 | " [ 6.8 2.8 4.8 1.4]\n", 231 | " [ 6.7 3. 5. 1.7]\n", 232 | " [ 6. 2.9 4.5 1.5]\n", 233 | " [ 5.7 2.6 3.5 1. ]\n", 234 | " [ 5.5 2.4 3.8 1.1]\n", 235 | " [ 5.5 2.4 3.7 1. ]\n", 236 | " [ 5.8 2.7 3.9 1.2]\n", 237 | " [ 6. 2.7 5.1 1.6]\n", 238 | " [ 5.4 3. 4.5 1.5]\n", 239 | " [ 6. 3.4 4.5 1.6]\n", 240 | " [ 6.7 3.1 4.7 1.5]\n", 241 | " [ 6.3 2.3 4.4 1.3]\n", 242 | " [ 5.6 3. 4.1 1.3]\n", 243 | " [ 5.5 2.5 4. 1.3]\n", 244 | " [ 5.5 2.6 4.4 1.2]\n", 245 | " [ 6.1 3. 4.6 1.4]\n", 246 | " [ 5.8 2.6 4. 1.2]\n", 247 | " [ 5. 2.3 3.3 1. ]\n", 248 | " [ 5.6 2.7 4.2 1.3]\n", 249 | " [ 5.7 3. 4.2 1.2]\n", 250 | " [ 5.7 2.9 4.2 1.3]\n", 251 | " [ 6.2 2.9 4.3 1.3]\n", 252 | " [ 5.1 2.5 3. 1.1]\n", 253 | " [ 5.7 2.8 4.1 1.3]\n", 254 | " [ 6.3 3.3 6. 2.5]\n", 255 | " [ 5.8 2.7 5.1 1.9]\n", 256 | " [ 7.1 3. 5.9 2.1]\n", 257 | " [ 6.3 2.9 5.6 1.8]\n", 258 | " [ 6.5 3. 5.8 2.2]\n", 259 | " [ 7.6 3. 6.6 2.1]\n", 260 | " [ 4.9 2.5 4.5 1.7]\n", 261 | " [ 7.3 2.9 6.3 1.8]\n", 262 | " [ 6.7 2.5 5.8 1.8]\n", 263 | " [ 7.2 3.6 6.1 2.5]\n", 264 | " [ 6.5 3.2 5.1 2. ]\n", 265 | " [ 6.4 2.7 5.3 1.9]\n", 266 | " [ 6.8 3. 5.5 2.1]\n", 267 | " [ 5.7 2.5 5. 2. ]\n", 268 | " [ 5.8 2.8 5.1 2.4]\n", 269 | " [ 6.4 3.2 5.3 2.3]\n", 270 | " [ 6.5 3. 5.5 1.8]\n", 271 | " [ 7.7 3.8 6.7 2.2]\n", 272 | " [ 7.7 2.6 6.9 2.3]\n", 273 | " [ 6. 2.2 5. 1.5]\n", 274 | " [ 6.9 3.2 5.7 2.3]\n", 275 | " [ 5.6 2.8 4.9 2. ]\n", 276 | " [ 7.7 2.8 6.7 2. ]\n", 277 | " [ 6.3 2.7 4.9 1.8]\n", 278 | " [ 6.7 3.3 5.7 2.1]\n", 279 | " [ 7.2 3.2 6. 1.8]\n", 280 | " [ 6.2 2.8 4.8 1.8]\n", 281 | " [ 6.1 3. 4.9 1.8]\n", 282 | " [ 6.4 2.8 5.6 2.1]\n", 283 | " [ 7.2 3. 5.8 1.6]\n", 284 | " [ 7.4 2.8 6.1 1.9]\n", 285 | " [ 7.9 3.8 6.4 2. ]\n", 286 | " [ 6.4 2.8 5.6 2.2]\n", 287 | " [ 6.3 2.8 5.1 1.5]\n", 288 | " [ 6.1 2.6 5.6 1.4]\n", 289 | " [ 7.7 3. 6.1 2.3]\n", 290 | " [ 6.3 3.4 5.6 2.4]\n", 291 | " [ 6.4 3.1 5.5 1.8]\n", 292 | " [ 6. 3. 4.8 1.8]\n", 293 | " [ 6.9 3.1 5.4 2.1]\n", 294 | " [ 6.7 3.1 5.6 2.4]\n", 295 | " [ 6.9 3.1 5.1 2.3]\n", 296 | " [ 5.8 2.7 5.1 1.9]\n", 297 | " [ 6.8 3.2 5.9 2.3]\n", 298 | " [ 6.7 3.3 5.7 2.5]\n", 299 | " [ 6.7 3. 5.2 2.3]\n", 300 | " [ 6.3 2.5 5. 1.9]\n", 301 | " [ 6.5 3. 5.2 2. ]\n", 302 | " [ 6.2 3.4 5.4 2.3]\n", 303 | " [ 5.9 3. 5.1 1.8]]\n" 304 | ] 305 | } 306 | ], 307 | "source": [ 308 | "# print the iris data\n", 309 | "# same data as shown previously\n", 310 | "# each row represents each sample\n", 311 | "# each column represents the features\n", 312 | "print(iris.data)" 313 | ] 314 | }, 315 | { 316 | "cell_type": "markdown", 317 | "metadata": {}, 318 | "source": [ 319 | "### 5. Machine learning terminology\n", 320 | "\n", 321 | "- Each row is an **observation** (also known as: sample, example, instance, record)\n", 322 | "- Each column is a **feature** (also known as: predictor, attribute, independent variable, input, regressor, covariate)" 323 | ] 324 | }, 325 | { 326 | "cell_type": "markdown", 327 | "metadata": {}, 328 | "source": [ 329 | "### 6. Exploring the iris dataset" 330 | ] 331 | }, 332 | { 333 | "cell_type": "code", 334 | "execution_count": null, 335 | "metadata": { 336 | "collapsed": false 337 | }, 338 | "outputs": [], 339 | "source": [ 340 | "# print the names of the four features\n", 341 | "print iris.feature_names" 342 | ] 343 | }, 344 | { 345 | "cell_type": "code", 346 | "execution_count": null, 347 | "metadata": { 348 | "collapsed": false 349 | }, 350 | "outputs": [], 351 | "source": [ 352 | "# print integers representing the species of each observation\n", 353 | "# 0, 1, and 2 represent different species\n", 354 | "print iris.target" 355 | ] 356 | }, 357 | { 358 | "cell_type": "code", 359 | "execution_count": null, 360 | "metadata": { 361 | "collapsed": false 362 | }, 363 | "outputs": [], 364 | "source": [ 365 | "# print the encoding scheme for species: 0 = setosa, 1 = versicolor, 2 = virginica\n", 366 | "print iris.target_names" 367 | ] 368 | }, 369 | { 370 | "cell_type": "markdown", 371 | "metadata": {}, 372 | "source": [ 373 | "- Each value we are predicting is the **response** (also known as: target, outcome, label, dependent variable)\n", 374 | "- **Classification** is supervised learning in which the response is categorical\n", 375 | " - \"0\": setosa\n", 376 | " - \"1\": versicolor\n", 377 | " - \"2\": virginica\n", 378 | "- **Regression** is supervised learning in which the response is ordered and continuous\n", 379 | " - any number (continuous)" 380 | ] 381 | }, 382 | { 383 | "cell_type": "markdown", 384 | "metadata": {}, 385 | "source": [ 386 | "### 7. Requirements for working with data in scikit-learn\n", 387 | "\n", 388 | "1. Features and response are **separate objects**\n", 389 | " - In this case, data and target are separate\n", 390 | "2. Features and response should be **numeric**\n", 391 | " - In this case, features and response are numeric with the matrix dimension of 150 x 4\n", 392 | "3. Features and response should be **NumPy arrays**\n", 393 | " - The iris dataset contains NumPy arrays already\n", 394 | " - For other dataset, by loading them into NumPy\n", 395 | "4. Features and response should have **specific shapes**\n", 396 | " - 150 x 4 for whole dataset\n", 397 | " - 150 x 1 for examples\n", 398 | " - 4 x 1 for features\n", 399 | " - you can convert the matrix accordingly using np.tile(a, [4, 1]), where a is the matrix and [4, 1] is the intended matrix dimensionality" 400 | ] 401 | }, 402 | { 403 | "cell_type": "code", 404 | "execution_count": 6, 405 | "metadata": { 406 | "collapsed": false 407 | }, 408 | "outputs": [ 409 | { 410 | "name": "stdout", 411 | "output_type": "stream", 412 | "text": [ 413 | "\n", 414 | "\n" 415 | ] 416 | } 417 | ], 418 | "source": [ 419 | "# check the types of the features and response\n", 420 | "print(type(iris.data))\n", 421 | "print(type(iris.target))" 422 | ] 423 | }, 424 | { 425 | "cell_type": "code", 426 | "execution_count": 8, 427 | "metadata": { 428 | "collapsed": false 429 | }, 430 | "outputs": [ 431 | { 432 | "name": "stdout", 433 | "output_type": "stream", 434 | "text": [ 435 | "(150, 4)\n" 436 | ] 437 | } 438 | ], 439 | "source": [ 440 | "# check the shape of the features (first dimension = number of observations, second dimensions = number of features)\n", 441 | "print(iris.data.shape)" 442 | ] 443 | }, 444 | { 445 | "cell_type": "code", 446 | "execution_count": 7, 447 | "metadata": { 448 | "collapsed": false 449 | }, 450 | "outputs": [ 451 | { 452 | "name": "stdout", 453 | "output_type": "stream", 454 | "text": [ 455 | "(150,)\n" 456 | ] 457 | } 458 | ], 459 | "source": [ 460 | "# check the shape of the response (single dimension matching the number of observations)\n", 461 | "print(iris.target.shape)" 462 | ] 463 | }, 464 | { 465 | "cell_type": "code", 466 | "execution_count": null, 467 | "metadata": { 468 | "collapsed": false 469 | }, 470 | "outputs": [], 471 | "source": [ 472 | "# store feature matrix in \"X\"\n", 473 | "X = iris.data\n", 474 | "\n", 475 | "# store response vector in \"y\"\n", 476 | "y = iris.target" 477 | ] 478 | }, 479 | { 480 | "cell_type": "markdown", 481 | "metadata": {}, 482 | "source": [ 483 | "### 8. Resources\n", 484 | "\n", 485 | "- scikit-learn documentation: [Dataset loading utilities](http://scikit-learn.org/stable/datasets/)\n", 486 | "- Jake VanderPlas: Fast Numerical Computing with NumPy ([slides](https://speakerdeck.com/jakevdp/losing-your-loops-fast-numerical-computing-with-numpy-pycon-2015), [video](https://www.youtube.com/watch?v=EEUXKG97YRw))\n", 487 | "- Scott Shell: [An Introduction to NumPy](http://www.engr.ucsb.edu/~shell/che210d/numpy.pdf) (PDF)" 488 | ] 489 | } 490 | ], 491 | "metadata": { 492 | "kernelspec": { 493 | "display_name": "Python [py3k]", 494 | "language": "python", 495 | "name": "Python [py3k]" 496 | }, 497 | "language_info": { 498 | "codemirror_mode": { 499 | "name": "ipython", 500 | "version": 3 501 | }, 502 | "file_extension": ".py", 503 | "mimetype": "text/x-python", 504 | "name": "python", 505 | "nbconvert_exporter": "python", 506 | "pygments_lexer": "ipython3", 507 | "version": "3.5.2" 508 | } 509 | }, 510 | "nbformat": 4, 511 | "nbformat_minor": 0 512 | } 513 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/04_model_training-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Topics\n", 8 | "1. Reviewing the iris dataset\n", 9 | "2. K-nearest neighbors (KNN) classification\n", 10 | "3. Example of training data\n", 11 | "4. KNN classification map (K = 1)\n", 12 | "5. KNN classification map (K = 5)\n", 13 | "6. Loading the data\n", 14 | "7. scikit-learn 4-step modeling pattern\n", 15 | "8. Using a different value for K\n", 16 | "9. Using a different classification model (logistic regression)\n", 17 | "10. Different values explanation\n", 18 | "11. Resources" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "_This tutorial is derived from Data School's Machine Learning with scikit-learn tutorial. I added my own notes so anyone, including myself, can refer to this tutorial without watching the videos._" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "### 1. Reviewing the iris dataset" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": 1, 38 | "metadata": { 39 | "collapsed": false 40 | }, 41 | "outputs": [ 42 | { 43 | "data": { 44 | "text/html": [ 45 | "" 46 | ], 47 | "text/plain": [ 48 | "" 49 | ] 50 | }, 51 | "execution_count": 1, 52 | "metadata": {}, 53 | "output_type": "execute_result" 54 | } 55 | ], 56 | "source": [ 57 | "from IPython.display import HTML\n", 58 | "HTML('')" 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "metadata": {}, 64 | "source": [ 65 | "- 150 **observations**\n", 66 | "- 4 **features** (sepal length, sepal width, petal length, petal width)\n", 67 | "- **Response** variable is the iris species\n", 68 | "- **Classification** problem since response is categorical\n", 69 | "- More information in the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Iris)" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "### 2. K-nearest neighbors (KNN) classification" 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "metadata": {}, 82 | "source": [ 83 | "In essence, it's to classify your training data into groups that are similar." 84 | ] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "metadata": {}, 89 | "source": [ 90 | "1. Pick a value for K.\n", 91 | "2. Search for the K observations in the training data that are \"nearest\" to the measurements of the unknown iris.\n", 92 | "3. Use the most popular response value from the K nearest neighbors as the predicted response value for the unknown iris." 93 | ] 94 | }, 95 | { 96 | "cell_type": "markdown", 97 | "metadata": {}, 98 | "source": [ 99 | "### 3. Example training data\n", 100 | "\n", 101 | "![Training data](images/04_knn_dataset.png)" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": {}, 107 | "source": [ 108 | "- Dataset with 2 numerical features (x and y coordinates)\n", 109 | "- Each point: observation\n", 110 | "- Colour of point: response class (red, blue or green)" 111 | ] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "metadata": {}, 116 | "source": [ 117 | "### 4. KNN classification map (K=1)\n", 118 | "\n", 119 | "![1NN classification map](images/04_1nn_map.png)" 120 | ] 121 | }, 122 | { 123 | "cell_type": "markdown", 124 | "metadata": {}, 125 | "source": [ 126 | "- Background colour: predicted response value for a new observation" 127 | ] 128 | }, 129 | { 130 | "cell_type": "markdown", 131 | "metadata": {}, 132 | "source": [ 133 | "### 5. KNN classification map (K=5)\n", 134 | "\n", 135 | "![5NN classification map](images/04_5nn_map.png)" 136 | ] 137 | }, 138 | { 139 | "cell_type": "markdown", 140 | "metadata": {}, 141 | "source": [ 142 | "- Decision boundaries have changed in this scenario\n", 143 | "- White areas: KNN cannot make a clear decision because there's a tie between 2 classes\n", 144 | "- It can make good predictions if the features have very dissimilar values" 145 | ] 146 | }, 147 | { 148 | "cell_type": "markdown", 149 | "metadata": {}, 150 | "source": [ 151 | "*Image Credits: [Data3classes](http://commons.wikimedia.org/wiki/File:Data3classes.png#/media/File:Data3classes.png), [Map1NN](http://commons.wikimedia.org/wiki/File:Map1NN.png#/media/File:Map1NN.png), [Map5NN](http://commons.wikimedia.org/wiki/File:Map5NN.png#/media/File:Map5NN.png) by Agor153. Licensed under CC BY-SA 3.0*" 152 | ] 153 | }, 154 | { 155 | "cell_type": "markdown", 156 | "metadata": {}, 157 | "source": [ 158 | "### 6. Loading the data" 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": 2, 164 | "metadata": { 165 | "collapsed": false 166 | }, 167 | "outputs": [], 168 | "source": [ 169 | "# import load_iris function from datasets module\n", 170 | "from sklearn.datasets import load_iris\n", 171 | "\n", 172 | "# save \"bunch\" object containing iris dataset and its attributes\n", 173 | "iris = load_iris()\n", 174 | "\n", 175 | "# discover what data and target are\n", 176 | "# print iris.data\n", 177 | "# print iris.target\n", 178 | "\n", 179 | "# store feature matrix in \"X\"\n", 180 | "# we use an uppercase for \"X\" because it is a matrix of m x n dimension\n", 181 | "X = iris.data\n", 182 | "\n", 183 | "# store response vector in \"y\"\n", 184 | "# we use a lowercase for \"y\" because it is a vector or m x 1 dimension\n", 185 | "y = iris.target" 186 | ] 187 | }, 188 | { 189 | "cell_type": "code", 190 | "execution_count": 3, 191 | "metadata": { 192 | "collapsed": false 193 | }, 194 | "outputs": [ 195 | { 196 | "name": "stdout", 197 | "output_type": "stream", 198 | "text": [ 199 | "(150, 4)\n", 200 | "(150,)\n" 201 | ] 202 | } 203 | ], 204 | "source": [ 205 | "# print the shapes of X and y\n", 206 | "print(X.shape)\n", 207 | "print(y.shape)" 208 | ] 209 | }, 210 | { 211 | "cell_type": "markdown", 212 | "metadata": {}, 213 | "source": [ 214 | "### 7. scikit-learn 4-step modeling pattern" 215 | ] 216 | }, 217 | { 218 | "cell_type": "markdown", 219 | "metadata": {}, 220 | "source": [ 221 | "_Scikit-learn uses a common modeling pattern to use its algorithms. The steps are similar for using other algorithms._" 222 | ] 223 | }, 224 | { 225 | "cell_type": "markdown", 226 | "metadata": {}, 227 | "source": [ 228 | "**Step 1:** Import the class you plan to use" 229 | ] 230 | }, 231 | { 232 | "cell_type": "code", 233 | "execution_count": 4, 234 | "metadata": { 235 | "collapsed": false 236 | }, 237 | "outputs": [], 238 | "source": [ 239 | "from sklearn.neighbors import KNeighborsClassifier" 240 | ] 241 | }, 242 | { 243 | "cell_type": "markdown", 244 | "metadata": {}, 245 | "source": [ 246 | "**Step 2:** \"Instantiate\" the \"estimator\"\n", 247 | "\n", 248 | "- \"Estimator\" is scikit-learn's term for model\n", 249 | "- \"Instantiate\" means \"make an instance of\"" 250 | ] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "execution_count": 5, 255 | "metadata": { 256 | "collapsed": false 257 | }, 258 | "outputs": [], 259 | "source": [ 260 | "knn = KNeighborsClassifier(n_neighbors=1)" 261 | ] 262 | }, 263 | { 264 | "cell_type": "markdown", 265 | "metadata": {}, 266 | "source": [ 267 | "- Name of the object does not matter\n", 268 | "- Can specify tuning parameters (aka \"hyperparameters\") during this step\n", 269 | " - n_neighbours is a hyperparameter where it represents k\n", 270 | "- All parameters not specified are set to their defaults" 271 | ] 272 | }, 273 | { 274 | "cell_type": "code", 275 | "execution_count": 6, 276 | "metadata": { 277 | "collapsed": false 278 | }, 279 | "outputs": [ 280 | { 281 | "name": "stdout", 282 | "output_type": "stream", 283 | "text": [ 284 | "KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n", 285 | " metric_params=None, n_jobs=1, n_neighbors=1, p=2,\n", 286 | " weights='uniform')\n" 287 | ] 288 | } 289 | ], 290 | "source": [ 291 | "print(knn)" 292 | ] 293 | }, 294 | { 295 | "cell_type": "markdown", 296 | "metadata": {}, 297 | "source": [ 298 | "**Step 3:** Fit the model with data (aka \"model training\")\n", 299 | "\n", 300 | "- Model is learning the relationship between X and y\n", 301 | "- Occurs in-place" 302 | ] 303 | }, 304 | { 305 | "cell_type": "code", 306 | "execution_count": 7, 307 | "metadata": { 308 | "collapsed": false 309 | }, 310 | "outputs": [ 311 | { 312 | "data": { 313 | "text/plain": [ 314 | "KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n", 315 | " metric_params=None, n_jobs=1, n_neighbors=1, p=2,\n", 316 | " weights='uniform')" 317 | ] 318 | }, 319 | "execution_count": 7, 320 | "metadata": {}, 321 | "output_type": "execute_result" 322 | } 323 | ], 324 | "source": [ 325 | "knn.fit(X, y)" 326 | ] 327 | }, 328 | { 329 | "cell_type": "markdown", 330 | "metadata": {}, 331 | "source": [ 332 | "**Step 4:** Predict the response for a new observation\n", 333 | "\n", 334 | "- New observations are called \"out-of-sample\" data\n", 335 | "- Uses the information it learned during the model training process" 336 | ] 337 | }, 338 | { 339 | "cell_type": "code", 340 | "execution_count": 8, 341 | "metadata": { 342 | "collapsed": false 343 | }, 344 | "outputs": [ 345 | { 346 | "data": { 347 | "text/plain": [ 348 | "array([2, 1])" 349 | ] 350 | }, 351 | "execution_count": 8, 352 | "metadata": {}, 353 | "output_type": "execute_result" 354 | } 355 | ], 356 | "source": [ 357 | "X_new = ([3, 5, 4, 2], [5, 4, 3, 2])\n", 358 | "knn.predict(X_new)" 359 | ] 360 | }, 361 | { 362 | "cell_type": "markdown", 363 | "metadata": {}, 364 | "source": [ 365 | "- Returns a NumPy array\n", 366 | "- 2 and 1 are the predicted species\n", 367 | " - \"0\": setosa\n", 368 | " - \"1\": versicolor\n", 369 | " - \"2\": virginica\n", 370 | "- Can predict for multiple observations at once" 371 | ] 372 | }, 373 | { 374 | "cell_type": "markdown", 375 | "metadata": {}, 376 | "source": [ 377 | "### 8. Using a different value for K" 378 | ] 379 | }, 380 | { 381 | "cell_type": "code", 382 | "execution_count": 9, 383 | "metadata": { 384 | "collapsed": false 385 | }, 386 | "outputs": [ 387 | { 388 | "data": { 389 | "text/plain": [ 390 | "array([1, 1])" 391 | ] 392 | }, 393 | "execution_count": 9, 394 | "metadata": {}, 395 | "output_type": "execute_result" 396 | } 397 | ], 398 | "source": [ 399 | "# instantiate the model (using the value K=5)\n", 400 | "knn = KNeighborsClassifier(n_neighbors=5)\n", 401 | "\n", 402 | "# fit the model with data\n", 403 | "knn.fit(X, y)\n", 404 | "\n", 405 | "# predict the response for new observations\n", 406 | "knn.predict(X_new)" 407 | ] 408 | }, 409 | { 410 | "cell_type": "markdown", 411 | "metadata": {}, 412 | "source": [ 413 | "### 9. Using a different classification model" 414 | ] 415 | }, 416 | { 417 | "cell_type": "code", 418 | "execution_count": 10, 419 | "metadata": { 420 | "collapsed": false 421 | }, 422 | "outputs": [ 423 | { 424 | "data": { 425 | "text/plain": [ 426 | "array([2, 0])" 427 | ] 428 | }, 429 | "execution_count": 10, 430 | "metadata": {}, 431 | "output_type": "execute_result" 432 | } 433 | ], 434 | "source": [ 435 | "# import the class\n", 436 | "from sklearn.linear_model import LogisticRegression\n", 437 | "\n", 438 | "# instantiate the model (using the default parameters)\n", 439 | "logreg = LogisticRegression()\n", 440 | "\n", 441 | "# fit the model with data\n", 442 | "logreg.fit(X, y)\n", 443 | "\n", 444 | "# predict the response for new observations\n", 445 | "logreg.predict(X_new)" 446 | ] 447 | }, 448 | { 449 | "cell_type": "markdown", 450 | "metadata": {}, 451 | "source": [ 452 | "### 10. Different values explanation" 453 | ] 454 | }, 455 | { 456 | "cell_type": "markdown", 457 | "metadata": {}, 458 | "source": [ 459 | "Why are the KNN models (different values of K) and Logistic Regression model predict different values? Which are right?\n", 460 | "- We are unable to determine here as we do not have the label (outcome)\n", 461 | "- We, however, can compare the models to determine which is the \"best\n", 462 | "- This will be covered in the following guides" 463 | ] 464 | }, 465 | { 466 | "cell_type": "markdown", 467 | "metadata": {}, 468 | "source": [ 469 | "### 11. Resources\n", 470 | "\n", 471 | "- [Nearest Neighbors](http://scikit-learn.org/stable/modules/neighbors.html) (user guide), [KNeighborsClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) (class documentation)\n", 472 | "- [Logistic Regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression) (user guide), [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) (class documentation)\n", 473 | "- [Videos from An Introduction to Statistical Learning](http://www.dataschool.io/15-hours-of-expert-machine-learning-videos/)\n", 474 | " - Classification Problems and K-Nearest Neighbors (Chapter 2)\n", 475 | " - Introduction to Classification (Chapter 4)\n", 476 | " - Logistic Regression and Maximum Likelihood (Chapter 4)" 477 | ] 478 | } 479 | ], 480 | "metadata": { 481 | "kernelspec": { 482 | "display_name": "Python [py3k]", 483 | "language": "python", 484 | "name": "Python [py3k]" 485 | }, 486 | "language_info": { 487 | "codemirror_mode": { 488 | "name": "ipython", 489 | "version": 3 490 | }, 491 | "file_extension": ".py", 492 | "mimetype": "text/x-python", 493 | "name": "python", 494 | "nbconvert_exporter": "python", 495 | "pygments_lexer": "ipython3", 496 | "version": "3.5.2" 497 | } 498 | }, 499 | "nbformat": 4, 500 | "nbformat_minor": 0 501 | } 502 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/05_model_evaluation-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Topics\n", 8 | "1. Evaluation procedure 1 - Train and test on the entire dataset\n", 9 | " a. Logistic regression\n", 10 | " b. KNN (k = 5)\n", 11 | " c. KNN (k = 1)\n", 12 | " d. Problems with training and testing on the same data\n", 13 | "2. Evaluation procedure 2 - Train/test split\n", 14 | "3. Making predictions on out-of-sample data\n", 15 | "4. Downsides of train/test split\n", 16 | "5. Resources" 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "metadata": {}, 22 | "source": [ 23 | "_This tutorial is derived from Data School's Machine Learning with scikit-learn tutorial. I added my own notes so anyone, including myself, can refer to this tutorial without watching the videos._" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "### 1. Evaluation procedure 1 - Train and test on the entire dataset" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "1. Train the model on the **entire dataset**.\n", 38 | "2. Test the model on the **same dataset**, and evaluate how well we did by comparing the **predicted** response values with the **true** response values." 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": 1, 44 | "metadata": { 45 | "collapsed": true 46 | }, 47 | "outputs": [], 48 | "source": [ 49 | "# read in the iris data\n", 50 | "from sklearn.datasets import load_iris\n", 51 | "iris = load_iris()\n", 52 | "\n", 53 | "# create X (features) and y (response)\n", 54 | "X = iris.data\n", 55 | "y = iris.target" 56 | ] 57 | }, 58 | { 59 | "cell_type": "markdown", 60 | "metadata": {}, 61 | "source": [ 62 | "#### 1a. Logistic regression" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": 2, 68 | "metadata": { 69 | "collapsed": false 70 | }, 71 | "outputs": [ 72 | { 73 | "data": { 74 | "text/plain": [ 75 | "array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 76 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 77 | " 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1,\n", 78 | " 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1,\n", 79 | " 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,\n", 80 | " 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2,\n", 81 | " 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])" 82 | ] 83 | }, 84 | "execution_count": 2, 85 | "metadata": {}, 86 | "output_type": "execute_result" 87 | } 88 | ], 89 | "source": [ 90 | "# import the class\n", 91 | "from sklearn.linear_model import LogisticRegression\n", 92 | "\n", 93 | "# instantiate the model (using the default parameters)\n", 94 | "logreg = LogisticRegression()\n", 95 | "\n", 96 | "# fit the model with data\n", 97 | "logreg.fit(X, y)\n", 98 | "\n", 99 | "# predict the response values for the observations in X\n", 100 | "logreg.predict(X)" 101 | ] 102 | }, 103 | { 104 | "cell_type": "code", 105 | "execution_count": 3, 106 | "metadata": { 107 | "collapsed": false 108 | }, 109 | "outputs": [ 110 | { 111 | "data": { 112 | "text/plain": [ 113 | "150" 114 | ] 115 | }, 116 | "execution_count": 3, 117 | "metadata": {}, 118 | "output_type": "execute_result" 119 | } 120 | ], 121 | "source": [ 122 | "# store the predicted response values\n", 123 | "y_pred = logreg.predict(X)\n", 124 | "\n", 125 | "# check how many predictions were generated\n", 126 | "len(y_pred)" 127 | ] 128 | }, 129 | { 130 | "cell_type": "markdown", 131 | "metadata": {}, 132 | "source": [ 133 | "Classification accuracy:\n", 134 | "\n", 135 | "- **Proportion** of correct predictions\n", 136 | "- Common **evaluation metric** for classification problems" 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": 4, 142 | "metadata": { 143 | "collapsed": false 144 | }, 145 | "outputs": [ 146 | { 147 | "name": "stdout", 148 | "output_type": "stream", 149 | "text": [ 150 | "0.96\n" 151 | ] 152 | } 153 | ], 154 | "source": [ 155 | "# compute classification accuracy for the logistic regression model\n", 156 | "from sklearn import metrics\n", 157 | "\n", 158 | "print(metrics.accuracy_score(y, y_pred))" 159 | ] 160 | }, 161 | { 162 | "cell_type": "markdown", 163 | "metadata": {}, 164 | "source": [ 165 | "- Known as **training accuracy** when you train and test the model on the same data\n", 166 | "- 96% of our predictions are correct" 167 | ] 168 | }, 169 | { 170 | "cell_type": "markdown", 171 | "metadata": {}, 172 | "source": [ 173 | "#### 1b. KNN (K=5)" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": 5, 179 | "metadata": { 180 | "collapsed": false 181 | }, 182 | "outputs": [ 183 | { 184 | "name": "stdout", 185 | "output_type": "stream", 186 | "text": [ 187 | "0.966666666667\n" 188 | ] 189 | } 190 | ], 191 | "source": [ 192 | "from sklearn.neighbors import KNeighborsClassifier\n", 193 | "knn = KNeighborsClassifier(n_neighbors=5)\n", 194 | "knn.fit(X, y)\n", 195 | "y_pred = knn.predict(X)\n", 196 | "print(metrics.accuracy_score(y, y_pred))" 197 | ] 198 | }, 199 | { 200 | "cell_type": "markdown", 201 | "metadata": {}, 202 | "source": [ 203 | "_It seems, there is a higher accuracy here but there is a big issue of testing on your training data_" 204 | ] 205 | }, 206 | { 207 | "cell_type": "markdown", 208 | "metadata": {}, 209 | "source": [ 210 | "#### 1c. KNN (K=1)" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": 6, 216 | "metadata": { 217 | "collapsed": false 218 | }, 219 | "outputs": [ 220 | { 221 | "name": "stdout", 222 | "output_type": "stream", 223 | "text": [ 224 | "1.0\n" 225 | ] 226 | } 227 | ], 228 | "source": [ 229 | "knn = KNeighborsClassifier(n_neighbors=1)\n", 230 | "knn.fit(X, y)\n", 231 | "y_pred = knn.predict(X)\n", 232 | "print(metrics.accuracy_score(y, y_pred))" 233 | ] 234 | }, 235 | { 236 | "cell_type": "markdown", 237 | "metadata": {}, 238 | "source": [ 239 | "- KNN model\n", 240 | " 1. Pick a value for K.\n", 241 | " 2. Search for the K observations in the training data that are \"nearest\" to the measurements of the unknown iris\n", 242 | " 3. Use the most popular response value from the K nearest neighbors as the predicted response value for the unknown iris\n", 243 | "- This would always have 100% accuracy, because we are testing on the exact same data, it would always make correct predictions\n", 244 | "- KNN would search for one nearest observation and find that exact same observation\n", 245 | " - KNN has memorized the training set\n", 246 | " - Because we testing on the exact same data, it would always make the same prediction" 247 | ] 248 | }, 249 | { 250 | "cell_type": "markdown", 251 | "metadata": {}, 252 | "source": [ 253 | "#### 1d. Problems with training and testing on the same data\n", 254 | "\n", 255 | "- Goal is to estimate likely performance of a model on **out-of-sample data**\n", 256 | "- But, maximizing training accuracy rewards **overly complex models** that won't necessarily generalize\n", 257 | "- Unnecessarily complex models **overfit** the training data" 258 | ] 259 | }, 260 | { 261 | "cell_type": "markdown", 262 | "metadata": {}, 263 | "source": [ 264 | "![Overfitting](images/05_overfitting.png)" 265 | ] 266 | }, 267 | { 268 | "cell_type": "markdown", 269 | "metadata": {}, 270 | "source": [ 271 | "*Image Credit: [Overfitting](http://commons.wikimedia.org/wiki/File:Overfitting.svg#/media/File:Overfitting.svg) by Chabacano. Licensed under GFDL via Wikimedia Commons.*" 272 | ] 273 | }, 274 | { 275 | "cell_type": "markdown", 276 | "metadata": {}, 277 | "source": [ 278 | "- Green line (decision boundary): overfit\n", 279 | " - Your accuracy would be high but may not generalize well for future observations\n", 280 | " - Your accuracy is high because it is perfect in classifying your training data but not out-of-sample data\n", 281 | "- Black line (decision boundary): just right\n", 282 | " - Good for generalizing for future observations\n", 283 | "- Hence we need to solve this issue using a **train/test split** that will be explained below" 284 | ] 285 | }, 286 | { 287 | "cell_type": "markdown", 288 | "metadata": {}, 289 | "source": [ 290 | "### 2. Evaluation procedure 2 - Train/test split" 291 | ] 292 | }, 293 | { 294 | "cell_type": "markdown", 295 | "metadata": {}, 296 | "source": [ 297 | "1. Split the dataset into two pieces: a **training set** and a **testing set**.\n", 298 | "2. Train the model on the **training set**.\n", 299 | "3. Test the model on the **testing set**, and evaluate how well we did." 300 | ] 301 | }, 302 | { 303 | "cell_type": "code", 304 | "execution_count": 7, 305 | "metadata": { 306 | "collapsed": false 307 | }, 308 | "outputs": [ 309 | { 310 | "name": "stdout", 311 | "output_type": "stream", 312 | "text": [ 313 | "(150, 4)\n", 314 | "(150,)\n" 315 | ] 316 | } 317 | ], 318 | "source": [ 319 | "# print the shapes of X and y\n", 320 | "# X is our features matrix with 150 x 4 dimension\n", 321 | "print(X.shape)\n", 322 | "# y is our response vector with 150 x 1 dimension\n", 323 | "print(y.shape)" 324 | ] 325 | }, 326 | { 327 | "cell_type": "code", 328 | "execution_count": 8, 329 | "metadata": { 330 | "collapsed": false 331 | }, 332 | "outputs": [], 333 | "source": [ 334 | "# STEP 1: split X and y into training and testing sets\n", 335 | "from sklearn.cross_validation import train_test_split\n", 336 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=4)" 337 | ] 338 | }, 339 | { 340 | "cell_type": "markdown", 341 | "metadata": {}, 342 | "source": [ 343 | "- test_size=0.4\n", 344 | " - 40% of observations to test set\n", 345 | " - 60% of observations to training set\n", 346 | "- data is randomly assigned unless you use random_state hyperparameter\n", 347 | " - If you use random_state=4\n", 348 | " - Your data will be split exactly the same way" 349 | ] 350 | }, 351 | { 352 | "cell_type": "markdown", 353 | "metadata": {}, 354 | "source": [ 355 | "![Train/test split](images/05_train_test_split.png)" 356 | ] 357 | }, 358 | { 359 | "cell_type": "markdown", 360 | "metadata": {}, 361 | "source": [ 362 | "What did this accomplish?\n", 363 | "\n", 364 | "- Model can be trained and tested on **different data**\n", 365 | "- Response values are known for the testing set, and thus **predictions can be evaluated**\n", 366 | "- **Testing accuracy** is a better estimate than training accuracy of out-of-sample performance" 367 | ] 368 | }, 369 | { 370 | "cell_type": "code", 371 | "execution_count": 9, 372 | "metadata": { 373 | "collapsed": false 374 | }, 375 | "outputs": [ 376 | { 377 | "name": "stdout", 378 | "output_type": "stream", 379 | "text": [ 380 | "(90, 4)\n", 381 | "(60, 4)\n" 382 | ] 383 | } 384 | ], 385 | "source": [ 386 | "# print the shapes of the new X objects\n", 387 | "print(X_train.shape)\n", 388 | "print(X_test.shape)" 389 | ] 390 | }, 391 | { 392 | "cell_type": "code", 393 | "execution_count": 10, 394 | "metadata": { 395 | "collapsed": false 396 | }, 397 | "outputs": [ 398 | { 399 | "name": "stdout", 400 | "output_type": "stream", 401 | "text": [ 402 | "(90,)\n", 403 | "(60,)\n" 404 | ] 405 | } 406 | ], 407 | "source": [ 408 | "# print the shapes of the new y objects\n", 409 | "print(y_train.shape)\n", 410 | "print(y_test.shape)" 411 | ] 412 | }, 413 | { 414 | "cell_type": "code", 415 | "execution_count": 11, 416 | "metadata": { 417 | "collapsed": false 418 | }, 419 | "outputs": [ 420 | { 421 | "data": { 422 | "text/plain": [ 423 | "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", 424 | " intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n", 425 | " penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n", 426 | " verbose=0, warm_start=False)" 427 | ] 428 | }, 429 | "execution_count": 11, 430 | "metadata": {}, 431 | "output_type": "execute_result" 432 | } 433 | ], 434 | "source": [ 435 | "# STEP 2: train the model on the training set\n", 436 | "logreg = LogisticRegression()\n", 437 | "logreg.fit(X_train, y_train)" 438 | ] 439 | }, 440 | { 441 | "cell_type": "code", 442 | "execution_count": 12, 443 | "metadata": { 444 | "collapsed": false 445 | }, 446 | "outputs": [ 447 | { 448 | "name": "stdout", 449 | "output_type": "stream", 450 | "text": [ 451 | "0.95\n" 452 | ] 453 | } 454 | ], 455 | "source": [ 456 | "# STEP 3: make predictions on the testing set\n", 457 | "y_pred = logreg.predict(X_test)\n", 458 | "\n", 459 | "# compare actual response values (y_test) with predicted response values (y_pred)\n", 460 | "print(metrics.accuracy_score(y_test, y_pred))" 461 | ] 462 | }, 463 | { 464 | "cell_type": "markdown", 465 | "metadata": {}, 466 | "source": [ 467 | "Repeat for KNN with K=5:" 468 | ] 469 | }, 470 | { 471 | "cell_type": "code", 472 | "execution_count": 13, 473 | "metadata": { 474 | "collapsed": false 475 | }, 476 | "outputs": [ 477 | { 478 | "name": "stdout", 479 | "output_type": "stream", 480 | "text": [ 481 | "0.966666666667\n" 482 | ] 483 | } 484 | ], 485 | "source": [ 486 | "knn = KNeighborsClassifier(n_neighbors=5)\n", 487 | "knn.fit(X_train, y_train)\n", 488 | "y_pred = knn.predict(X_test)\n", 489 | "print(metrics.accuracy_score(y_test, y_pred))" 490 | ] 491 | }, 492 | { 493 | "cell_type": "markdown", 494 | "metadata": {}, 495 | "source": [ 496 | "Repeat for KNN with K=1:" 497 | ] 498 | }, 499 | { 500 | "cell_type": "code", 501 | "execution_count": 14, 502 | "metadata": { 503 | "collapsed": false 504 | }, 505 | "outputs": [ 506 | { 507 | "name": "stdout", 508 | "output_type": "stream", 509 | "text": [ 510 | "0.966666666667\n" 511 | ] 512 | } 513 | ], 514 | "source": [ 515 | "knn = KNeighborsClassifier(n_neighbors=5)\n", 516 | "knn.fit(X_train, y_train)\n", 517 | "y_pred = knn.predict(X_test)\n", 518 | "print(metrics.accuracy_score(y_test, y_pred))" 519 | ] 520 | }, 521 | { 522 | "cell_type": "markdown", 523 | "metadata": {}, 524 | "source": [ 525 | "Can we locate an even better value for K?" 526 | ] 527 | }, 528 | { 529 | "cell_type": "code", 530 | "execution_count": 15, 531 | "metadata": { 532 | "collapsed": false 533 | }, 534 | "outputs": [ 535 | { 536 | "name": "stdout", 537 | "output_type": "stream", 538 | "text": [ 539 | "[0.94999999999999996, 0.94999999999999996, 0.96666666666666667, 0.96666666666666667, 0.96666666666666667, 0.98333333333333328, 0.98333333333333328, 0.98333333333333328, 0.98333333333333328, 0.98333333333333328, 0.98333333333333328, 0.98333333333333328, 0.98333333333333328, 0.98333333333333328, 0.98333333333333328, 0.98333333333333328, 0.98333333333333328, 0.96666666666666667, 0.98333333333333328, 0.96666666666666667, 0.96666666666666667, 0.96666666666666667, 0.96666666666666667, 0.94999999999999996, 0.94999999999999996]\n" 540 | ] 541 | } 542 | ], 543 | "source": [ 544 | "# try K=1 through K=25 and record testing accuracy\n", 545 | "k_range = range(1, 26)\n", 546 | "\n", 547 | "# We can create Python dictionary using [] or dict()\n", 548 | "scores = []\n", 549 | "\n", 550 | "# We use a loop through the range 1 to 26\n", 551 | "# We append the scores in the dictionary\n", 552 | "for k in k_range:\n", 553 | " knn = KNeighborsClassifier(n_neighbors=k)\n", 554 | " knn.fit(X_train, y_train)\n", 555 | " y_pred = knn.predict(X_test)\n", 556 | " scores.append(metrics.accuracy_score(y_test, y_pred))\n", 557 | " \n", 558 | "print(scores)" 559 | ] 560 | }, 561 | { 562 | "cell_type": "code", 563 | "execution_count": 16, 564 | "metadata": { 565 | "collapsed": false 566 | }, 567 | "outputs": [ 568 | { 569 | "data": { 570 | "text/plain": [ 571 | "" 572 | ] 573 | }, 574 | "execution_count": 16, 575 | "metadata": {}, 576 | "output_type": "execute_result" 577 | }, 578 | { 579 | "data": { 580 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZMAAAEPCAYAAACHuClZAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xu4XFWd5vHvmxuQgEmEhBhCbiQCobkqkW5Ej4IS6dY4\nGS/EZgTlCbElDdo9M0TGngRH7YSepkWRaRGaJ/ag2NJR8AYB4Yj2GIMkhFtITsiFxFwMhEsCAknO\nb/7Yu5KiUnVOnVO1q+pUvZ/nOU+q9nVVUdRba+2111JEYGZmVol+9S6AmZn1fQ4TMzOrmMPEzMwq\n5jAxM7OKOUzMzKxiDhMzM6tY5mEiaZqkpyStkXRVkfXDJC2WtFLSUklT8tZ9XtLjkh6VdJukQeny\neZI2S1qe/k3L+nWYmVlpmYaJpH7ADcD5wEnATEknFGx2NbAiIk4FLga+nu47Gvhr4IyIOAUYAFyY\nt991EXFG+nd3lq/DzMy6lnXNZCrQEREbI2IPcDswvWCbKcD9ABGxGhgvaUS6rj8wRNIAYDCwJW8/\nZVpyMzMrW9ZhcgywKe/55nRZvpXADABJU4GxwJiI2AL8I/AM8HvghYi4L2+/OZIekXSzpKFZvQAz\nM+teI1yAXwAMl7QcuBxYAeyTNIykFjMOGA0cLukT6T43AhMj4jRgG3Bd7YttZmY5AzI+/u9Jaho5\nY9Jl+0XELuDTueeS1gHrgGnAuojYmS5fDPwZ8N2I2JF3iG8DPy52ckkeeMzMrBciokeXErKumTwE\nTJI0Lu2JdSFwV/4GkoZKGpg+ngU8GBG7SZq3zpJ0qCQB5wKr0u1G5R1iBvB4qQJEhP8imDdvXt3L\n0Ch/fi/8Xvi96PqvNzKtmUTEPklzgCUkwXVLRKySNDtZHTcBJwKLJHUCTwCXpvsuk3QHSbPXnvTf\nm9JDXyvpNKAT2ADMzvJ1mJlZ17Ju5iKSbrvHFyz7Vt7jpYXr89ZdA1xTZPknq1xMMzOrQCNcgLca\naGtrq3cRGobfiwP8Xhzg96Iy6m37WF8gKZr59ZmZZUES0WAX4M3MrAU4TMzMrGKZX4C3vmnnTvji\nF2Hv3nqXxJrBIYfAtdfCYYdle57XXoNvfxvmzMn2PHYwXzOxon76U7j6arj88nqXxJrBggXw/e/D\nmWdme56HH4azzoI//hEG+Kdyr/XmmonfbiuqowPe9S647LJ6l8SawX33JZ+prMOkoyOpTW/cCMcd\nl+257I18zcSKWrsWJk+udymsWUyenHymspY7Ry3OZW/kMLGiOjpg0qR6l8KaxaRJyWcqax0dMGRI\nbc5lb+QwsaI6OlwzseqZPLl2YXLeeQ6TenCY2EFefx22bIHx4+tdEmsWtQyTCy5wmNSDw8QOsm4d\nHHssDBxY75JYsxg5EvbsSbqcZ+WFF+DVV+Gccxwm9eAwsYO4icuqTcq+dpK7zjdxIjzzTBJeVjsO\nEzuIw8SyUIswmTw5uUFy9GjYsCG7c9nBHCZ2EIeJZSHrMMnvzl6rrsh2gMPEDrJ2rbsFW/VNmpTt\nF3z+j6BaXfC3AxwmdhDXTCwLtbpmArW7r8UOcJjYG7z6KmzbBuPG1bsk1mxyYZLVcHmumdSXw8Te\nYN26JEg8SJ5V21FHJUHy3HPVP/bOnUnvrZEjk+cOk9rLPEwkTZP0lKQ1kq4qsn6YpMWSVkpaKmlK\n3rrPS3pc0qOSbpM0KF0+XNISSasl3SNpaNavo1W4icuykmX34NznVuk4txMmwObNyQ24VhuZhomk\nfsANwPnAScBMSScUbHY1sCIiTgUuBr6e7jsa+GvgjIg4hWSE4wvTfeYC90XE8cD9wBeyfB2txGFi\nWco6THIGDYIxY2D9+uqfy4rLumYyFeiIiI0RsQe4HZhesM0UkkAgIlYD4yWNSNf1B4ZIGgAMBn6f\nLp8OLEofLwI+nN1LaC0OE8tSVmFSbJRrdw+urazD5BhgU97zzemyfCuBGQCSpgJjgTERsQX4R+AZ\nkhB5ISJ+ke4zMiK2A0TENmBkZq+gxbhbsGUpqy/4Yj+CfN2kthrhAvwCYLik5cDlwApgn6RhJDWQ\nccBo4HBJnyhxDE+nWCWumViWsuqy6zCpv6z77PyepKaRM4YDTVUARMQu4NO555LWAeuAacC6iNiZ\nLl8M/BnwXWC7pKMjYrukUcAfShVg/vz5+x+3tbXR1tZW2StqYn/8I+zYAWPHdr+tWW/kdw9WjyaF\nLS2i+Pw7kybBT35SnXM0u/b2dtrb2ys6RqZzwEvqD6wGzgW2AsuAmRGxKm+bocArEbFH0izg7Ii4\nJG3yugU4E3gNuBV4KCK+KWkhsDMiFqY9xIZHxNwi5/cc8D3w+OPw0Y/CqlXdb2vWW29+Mzz11IFu\nvJV69tkkOJ5//o0B1dEB73+/L8L3Rm/mgM+0mSsi9gFzgCXAE8DtEbFK0mxJudnFTwQel7SKpNfX\nlem+y4A7SJq9VgICbkr3WQi8T1IuqBZk+TpahZu4rBaq3fxU2C04Z/z4ZF6e116r3rmstMxvTYuI\nu4HjC5Z9K+/x0sL1eeuuAa4psnwncF51S2oOE6uFXJicfXZ1jlfqcztwYNJku24dnHhidc5lpTXC\nBXhrEA4Tq4Vq10yKdQvOP5e7B9eGw8T2K3YR06zasmrmqsW5rDSHie3X1S88s2qpdm3BYdIYHCYG\nwCuvJAPwHXtsvUtizS53r0k1OlqW6hZceC7LnsPEgOSX4sSJ0M+fCMvY8OHJ1Lrbt1d+rB07oH9/\nOPLI4utdM6kdf3UY4IvvVlvV+pLv7nM7blwSWq++Wvm5rGsOEwMcJlZbtQqTAQOSQHn66crPZV1z\nmBjgMLHaqlaYlNNpxN2Da8NhYoC7BVtt1apmUs1zWdccJga4ZmK1Va3agsOkcThMjN274cUX4ZjC\nmWbMMjJpUhImlXQP7q5bcI7DpDYcJsbatXDcce4WbLUzdCgMHgxbt/b+GNu3J12Mhw/vejvfa1Ib\n/vowN3FZXVRaYyj3czt2bHI/yiuv9P5c1j2HiTlMrC5qFSb9+8OECe4enDWHiTlMrC4qDZOejCXn\n7sHZc5iYuwVbXdSqZlKNc1n3HCbmmonVhcOkuThMWtxLLyVdg0ePrndJrNVMmpRcx+js7Pm+EUmz\nVbk1aodJ9hwmLS73P2Th/NlmWTviCHjTm5J52ntq69aka/HQoeVt7+7B2cs8TCRNk/SUpDWSriqy\nfpikxZJWSloqaUq6/K2SVkhanv77oqQr0nXzJG1O1y2XNC3r19Gs3MRl9dTbGkNPP7fHHgs7d8LL\nL/f8XFaeTMNEUj/gBuB84CRgpqQTCja7GlgREacCFwNfB4iINRFxekScAbwNeBlYnLffdRFxRvp3\nd5avo5k5TKyeahUm/fol8/W4R1d2sq6ZTAU6ImJjROwBbgemF2wzBbgfICJWA+MljSjY5jzg6YjY\nnLfMDTNV4DCxeuptmPRmiml3D85W1mFyDLAp7/nmdFm+lcAMAElTgbHAmIJtPg58r2DZHEmPSLpZ\nUpktp1bI3YKtnmpVM6nkXFaeAfUuALAAuF7ScuAxYAWwL7dS0kDgQ8DcvH1uBL4UESHpy8B1wKXF\nDj5//vz9j9va2mhra6ty8fs210ysnmodJr/9bc/P1Qra29tpb2+v6BiKSobt7O7g0lnA/IiYlj6f\nC0RELOxin/XAyRGxO33+IeCzuWMU2X4c8OOIOKXIusjy9fV1L7yQXJh86SX35rL6ePllOOqo5N9y\nBxrt7Ex6gm3blvxbrgcegHnz4MEHe1fWViKJiOjRt0LWzVwPAZMkjZM0CLgQuCt/A0lD09oHkmYB\nv8wFSWomBU1ckkblPZ0BPJ5F4ZuduwVbvQ0Zkoz6u3lz99vmbNmSdCnuSZCAm7mylmkzV0TskzQH\nWEISXLdExCpJs5PVcRNwIrBIUifwBHnNVZIGk1x8v6zg0NdKOg3oBDYAs7N8Hc3KTVzWCHJf8mPH\nlrd9b6/zjR6dzNuza1fPg8i6l/k1k7Tb7vEFy76V93hp4fq8da8AhT27iIhPVrmYLclhYo0gFybn\nnlve9r393Pbrl8zbs3YtnH56z/e3rvkO+BbmMLFG0NPmp950C84/l7sHZ8Nh0sLcLdgaQU/DpJIf\nQb5ukh2HSQtzzcQagcOkOThMWtTOnbBnD4wcWe+SWKs77jhYvx727et+285OWLeu9zVqh0l2HCYt\nKvfrzt2Crd4GD4YRI2DTpu633bw56Uo8ZEjvzuUwyY7DpEVVchHTrNrKHSK+0ut8b3lLMn/PSy/1\n/hhWnMOkRfl6iTWScmsMlX5uJc9tkhWHSYtymFgjKTdMqlGjdvfgbDhMWpS7BVsjqVXNpCfnsp5x\nmLSgCNdMrLE4TPq+bsNE0l95vpDm8txzSaAcdVS9S2KWOO442LgR9u4tvc2+fUkX4uOOq+xcDpNs\nlFMzGQcsl/RdSedlXSDLnrsFW6M59FA4+ugkUErZtCn5ATR4cGXncphko9swiYi5wGTgNuAzkjok\nfUnS+IzLZhlxt2BrRJMmdX1hvFpNs0cfDa++msznY9VT1jWTiMgN9b6BZNj3twB3Svr7zEpmmfH1\nEmtE3dUYqtVpxN2Ds1HONZPLJS0DrgceBk6JiFnA6SRzs1sf4zCxRtRdmFSzRu3uwdVXznwmo4GZ\nEfF0/sKI6Eyn1LU+xt2CrRFNngy/+EXp9R0d8K53Ve9crplUVznNXD8CtueeSDpC0tsBIsLT5fYx\n7hZsjaqcZq5q1kwcJtVVTpjcBLyS9/xl4FsltrUGt2MH9O8PRx5Z75KYvdHEifDMM8lo1oX27oUN\nGyrvFpzjMKm+csKkX3oBHth/MX5gdkWyLLlWYo3qkEOSedo3bDh43TPPJL2wDj20OudymFRfOWGy\nPr1xsb+kfpIuJ+nVVRZJ0yQ9JWmNpKuKrB8mabGklZKWSpqSLn+rpBWSlqf/vijpinTdcElLJK2W\ndI9vqiyfw8QaWakL49X+3I4YkdR2du6s3jFbXTlhMhs4l+S6yXbg3cCscg4uqR9wA3A+cBIwU9IJ\nBZtdDayIiFOBi4GvA0TEmog4PSLOAN5G0ry2ON1nLnBfRBwP3A98oZzymO8xscZWqstutTuNuHtw\n9ZVz0+L2iPhIRBwVESMi4mMRsb27/VJTgY6I2BgRe4DbgekF20whCQQiYjUwXtKIgm3OA56OiM3p\n8+nAovTxIuDDZZan5blmYo2sVPNTFj+C3NRVXd12DZZ0CHAJSc1if4tlRFxWxvGPAfLnT9tMEjD5\nVgIzgP+QNBUYC4wBduRt83Hge3nPR+YCLSK2SfLks2Vyt2BrZJMnwz33HLy8owPe+97qn8v3mlRP\nOfeZfAdYB/wF8BXgE8ATVSzDAuB6ScuBx4AVwP7ZoCUNBD5E0rRVSpRaMX/+/P2P29raaGtrq6y0\nfZi7BVujK1VbyOJzWyq4WlF7ezvt7e0VHUMRJb+Hkw2kFRFxuqRHI+KU9Mv9VxFxVrcHl84C5kfE\ntPT5XCAiYmEX+6wHTo6I3enzDwGfzR0jXbYKaIuI7ZJGAQ9ExIlFjhXdvb5Wsm0b/MmfwLPP1rsk\nZsW9/joccQTs2gWDBiXL9u6Fww+HF19MenxVy29+A1deCcuWVe+YzUISEdGjoWDLuQCf6/X9gqQT\ngSOAcpuVHgImSRonaRBwIXBX/gaShqYBhaRZwC9zQZKayRubuEiPcUn6+GLgzjLL09JcK7FGN2gQ\njBmTDDWfs2FDMnd7NYMEDtSC/HuzOsoJk1skDQfmAfcAa4D/Xc7BI2IfMAdYQtI0dntErJI0W1Lu\nmsuJwONpbeN84Mrc/pIGk1x8X/zGI7MQeJ+k1SQ9zRaUU55W5zCxvqCwqSurz23uxt3nnqv+sVtR\nl9dMJPUHno2I54EHSC6O90hE3A0cX7DsW3mPlxauz1v3ClDYs4uI2EkSMtYDDhPrCwqHos/qcysd\nCC5PFFe5Lmsmac3i6hqVxTLme0ysLyismaxdm10PRN9rUj3lNHMtkfQ5SW+R9KbcX+Yls6pzt2Dr\nC2rVzJU7l7sHV0c5XYMvSv/927xlQS+avKx+Ilwzsb6h1mHy059mc+xW022YRMSxtSiIZWvr1mTu\n7KEexcwa3PjxsGULvPYa9OsHmzfDhAnZnMt3wVdPOXfAf6LY8oj4bvWLY1nxxXfrKwYOhLFjYd26\nZLqEY445cM9JteV3D1aP7qqwQuU0c52T9/hQ4L0k0/c6TPoQh4n1Jbkv+f79s/3cvvnNMGBAMs/P\nSA/KVJFymrn+Kv95es+Jg6SPcZhYX5K7MN6vX/af21xwOUwqU05vrkK7gInVLohlyxffrS/JddnN\nsltw4bmsMuVcM/khBwZS7EcyerCHL+lj3C3Y+pLJk+HOO5NmrgsuyP5c7h5cuXKumdyQ93gvsDEi\nNmRTHMtCZyc8/bRrJtZ31OqaSe5cd/rnccXKCZMO4A8R8SqApMMkHRsRm7rZzxrEli3JSKxHHFHv\nkpiVZ9w42J5OwTd+fLbncvfg6ijnmslioDPveSfw79kUx7Lgi+/W1wwYkATKsccmXYWz5NGDq6Oc\nmsmAiHg99yQiXktnX7Q+wmFifdHkyUkTbdaGDYNDD01qQqNGZX++ZlVOmDwn6YKI+BmApL8AdmZb\nrOb35JPwta/V5lwPPwwf+UhtzmVWLbUKk9y5PvvZxh49+NJL4R3vqHcpSitnpsW3ktxXko7+zw7g\noohYk3HZKtbIMy0uWAAPPggf/nBtzvfBDyYTDJn1FRs3Jk1PWV8zAVi6FB59NPvz9Nb99yf///7T\nP9XmfL2ZabHbMMk7+DCAiHihF2Wri0YOk9yvjMsu635bM2ttP/oR3Hwz/OQntTlfJtP2SvpfkoZF\nxAsR8YKk4ZKu6X0xDXwToZmVry/cC1NOb66/yK+NpLMufjC7IrUG30RoZuWaOBE2bIC9e+tdktLK\nCZP+kvaP2SnpUCCjMTxbw+7d8MILyWioZmbdOeywZOywZ56pd0lKKydMbgfulXSxpIuBe+jBQI+S\npkl6StIaSVcVWT9M0mJJKyUtlTQlb91QST+QtErSE5LekS6fJ2mzpOXp37Ryy9MI1q6F445LBrEz\nMytHo99cWc6owV+V9ChwXrro2ogoa24ySf1IhmM5F9gCPCTpzoh4Km+zq4EVETFD0vHAN/POdT3w\ns4j4qKQBwOC8/a6LiOvKKUej8X0fZtZTuTA5//x6l6S4sn4bR8RPIuJzEfE5kvtOri/z+FOBjojY\nGBF7SGo50wu2mQLcn55nNTBe0oh0nvlzIuLWdN3eiHgpb78+O5WNw8TMeqrRayZlhYmkkyV9VdLT\nwD8A68s8/jFA/hhem9Nl+VYCM9LzTCWZW34MMAF4VtKtaVPWTZIOy9tvjqRHJN0sqU9NRuswMbOe\navQwKdnMJWkiMDP92w18HxgYEeeU2qeXFgDXS1oOPAasAPYBA4EzgMsj4neSvgbMBeYBNwJfioiQ\n9GXgOuDSYgefP3/+/sdtbW20tbVVufg919EBl1xS71KYWV+SZZi0t7fT3t5e0TFK3rQoqRP4FTAr\nd7e7pHURUfbEWJLOAuZHxLT0+VwgImJhF/usB04GhgC/yZ1P0juBqyLigwXbjwN+HBGnFDlWQ960\nOGpUMsSJe3OZWblefRWGDk16g2Y9+GW1b1r8GMnQKfdJulHSu+n5dYqHgEmSxqXdiy8E7ioo9FBJ\nA9PHs4BfRsTuiNgObEqHc4HkIv6T6Xb5w7HNAB7vYbnq5qWXYNcuGD263iUxs77k0EOTIVU2bqx3\nSYor2cwVEXcAd0g6AvhPJE1MR0v6BvDDiLi/u4NHxD5Jc4AlJMF1S0SskjQ7WR03AScCi9Ka0BO8\nsbnqCuC2NGzWAZ9Kl18r6TSS4fA3ALN78qLrKTcNqfps9wEzq5dcU1cj3vBc9thcAJKOIqmxfDwi\n3p1ZqaqkEZu5vv99+MEP4I476l0SM+trPvtZOOEEuOKKbM+Tydhc+SLi2Yi4sS8ESaNyTy4z661G\n7tHle7BrzGFiZr3lMLH9HCZm1lsOE9svdwHezKynJkyAzZvh9de737bWypnP5HlJOwv+1qcDMI7P\nvojN48UX4Y9/9DzTZtY7gwYl96dt2FDvkhysnDngvwls5cBIwTOB8STDoNwKvCeTkjWhXJc+dws2\ns97KNXW99a3db1tL5TRzfTAivhkRz6d/NwLvj4jbgDdnXL6m4uslZlapRr1uUk6Y/FHSjNyT9PFr\n6dPOTErVpBwmZlapvhwmFwGz0mslzwGzgP8iaTDwuUxL12QcJmZWqUYNk3Imx1oLfKDE6l9WtzjN\nraMDPvOZepfCzPqyRg2TbodTSYdQ+TTJRff94RMRl2VasipotOFUjjwSnnwSjj663iUxs75qzx44\n/PBk0NhDDsnmHL0ZTqWc3lx3AkuBX5PMM2K9sHNn8iEYObLeJTGzvmzgQBg7FtavT8bpahTlhMmQ\niPjbzEvS5NauTaqn7hZsZpWaNClp6mqkMCnnAvzPJb0/85I0OV98N7NqacTrJuWEyWeAuyXtTnt0\nPS9pZ9YFazYOEzOrlr4aJkeRzMc+FBiRPh+RZaGakcPEzKqlT4WJpNxX30kl/qwHHCZmVi2NGCYl\nuwZLuiUiLpX0qyKrIyLelW3RKtcoXYMj4M1vhjVrYITrdGZWob17k+7BL7yQzA1fbVXtGhwRubnY\n3xsRewpONLAX5WtZzz2XBMpRR9W7JGbWDAYMgHHj4Omn4aQGaScq55rJb8tcVpSkaZKekrRG0lVF\n1g+TtFjSSklLJU3JWzc0Hep+laQnJL0jXT5c0hJJqyXdI2loueWpB3cLNrNqmzQp+W5pFF1dMxkp\n6VTgMEknSzol/XsnMLicg0vqB9wAnE9ynWWmpMKe0VcDKyLiVOBi4Ot5664HfhYRJwKnAqvS5XOB\n+yLieOB+4AvllKdefL3EzKqt0a6bdHXT4p+TDKMyhmROk9zv6l3A35V5/KlAR0RsBJB0OzAdeCpv\nmynA3wNExGpJ4yWNIBmZ+JyIuCRdtxd4Kd1nOvDu9PEioJ0kYBqSw8TMqm3yZHj00XqX4oCSNZOI\nuDUizgEujYh3RcQ56d8FEfGDMo9/DLAp7/nmdFm+lcAMAElTgbEkATYBeFbSrZKWS7pJ0mHpPiMj\nYntazm1AQw9S4jAxs2rrSzWTnJGS3hQRL0n6Z+AM4AsR8YsqlWEBcL2k5cBjwAqSMcAGpue6PCJ+\nJ+lrJLWPeRyoJeWU7LI1f/78/Y/b2tpoa2urUrHL5zAxs2qrZpi0t7fT3t5e0THKGTX40Yg4JR1S\n5XLgfwL/EhFv6/bg0lnA/IiYlj6fS9KteGEX+6wHTgaGAL+JiInp8ncCV0XEByWtAtoiYrukUcAD\n6XWVwmPVvWtwBAwbBuvWJaMGm5lVw759MGRIMojs4LKuYpevN12Dy+nNlfs2vgD4TkSsLHM/gIeA\nSZLGSRoEXAjclb9B2mNrYPp4FvDLiNidNmNtkpSb6fhc4Mn08V3AJenji0lGNm5IO3ZA//4OEjOr\nrv79YcKEpHtwIyinmWulpJ8BbwWulnQ4XTQr5YuIfZLmAEtIAuiWiFglaXayOm4CTgQWSeoEngAu\nzTvEFcBtadisAz6VLl8I/JukTwMbgY+VU556yHULNjOrtsmTk++Yk0+ud0nKC5NPAW8D1kbEK+lk\nWZd2s89+EXE3cHzBsm/lPV5auD5v3UrgzCLLdwLnlVuGevL1EjPLSm4o+kbQbXNVROwDJgJ/lS46\nrJz9LOEwMbOsNFKPrm5DQdINwHuAi9JFLwP/nGWhmonDxMyy0khhUk4z159FxBmSVkDSxJReTLcy\nOEzMLCuNFCblNFftSYdFCQBJRwKdmZaqSUQk/6EnTap3ScysGR17bNI1+OWX612SrsfmytVavgn8\nOzBC0jXAr0l6U1k3tm+HQw6B4cPrXRIza0b9+sHEiY0x4GNXzVzLgDMi4juSHibpPSXgoxHxeE1K\n18e5icvMspZr6jr11PqWo6sw2X/3Y0Q8QXIPiPWA7zExs6w1ylD0XYXJCEl/U2plRFyXQXmaimsm\nZpa1yZNh2bJ6l6LrC/D9gcOBI0r8WTccJmaWtUbp0dVVzWRrRHypZiVpQg4TM8tao4RJVzUTTzJb\ngYikHdPdgs0sS8ccAy++CLt21bccXYXJuTUrRRPaujUZFnpoQ89Ob2Z9Xb9+cNxx9b8I39VMiztr\nWZBm4yYuM6uVRmjq8oCNGXGYmFmtOEyamO8xMbNaaYR7TRwmGXHNxMxqxTWTJuYwMbNaaYQwUURZ\nM/D2SZKiHq+vsxOOOAK2bUv+NTPLUgQcfnjSi/RNb6r8eJKIiB7dHuKaSQa2bElCxEFiZrUg1X8K\n38zDRNI0SU9JWiPpqiLrh0laLGmlpKWSpuSt25AuXyFpWd7yeZI2S1qe/k3L+nX0hJu4zKzW6t3U\nVc5Mi72WTqp1A8kNkFuAhyTdGRFP5W12NbAiImZIOp5k/pTz0nWdQFtEPF/k8Nc16mCTDhMzq7V6\nh0nWNZOpQEdEbIyIPcDtwPSCbaYA9wNExGpgvKQR6Tp1UcaGHe7F3YLNrNbq3T046zA5BtiU93xz\nuizfSmAGgKSpwFhgTLougHslPSRpVsF+cyQ9IulmSQ01aIlrJmZWa/WumWTazFWmBcD1kpYDjwEr\ngH3purMjYmtaU7lX0qqI+DVwI/CliAhJXwauAy4tdvD58+fvf9zW1kZbW1tmLyTHYWJmtVZJmLS3\nt9Pe3l7R+TPtGizpLGB+RExLn88FIiJKziEvaT1wckTsLlg+D9hVeJ1E0jjgxxFxSpFj1bxrcGcn\nDBkCO3YkXfXMzGohIukWvGkTDBtW2bEasWvwQ8AkSeMkDQIuBO7K30DSUEkD08ezgF9GxG5JgyUd\nni4fArwfeDx9PirvEDNyyxvB5s0wfLiDxMxqq97dgzNt5oqIfZLmAEtIguuWiFglaXayOm4CTgQW\nSeokmWc+11x1NPBDSZGW87aIWJKuu1bSaSS9vTYAs7N8HT3hJi4zq5dcU9eZZ9b+3JlfM4mIu4Hj\nC5Z9K+81deebAAALf0lEQVTx0sL16fL1wGkljvnJKhezahwmZlYv9bwI7zvgq8xhYmb14jBpIr7H\nxMzqpZ73mjhMqsw1EzOrF9dMmsS+fbB+fTIfs5lZrY0cCXv2wM46TLruMKmiTZvgyCNh8OB6l8TM\nWpFUv9qJw6SK3MRlZvXmMGkCDhMzqzeHSRNwmJhZvTlMmoDDxMzqrV5DqjhMqsj3mJhZveVqJjUe\n49ZhUi1798KGDTBxYr1LYmat7Kijkn+fe66253WYVMkzzyR9vA87rN4lMbNWVq/uwQ6TKvH1EjNr\nFA6TPsxhYmaNwmHShzlMzKxROEz6MIeJmTUKh0kf5m7BZtYockPR17J7sMOkCvbuTXpzTZhQ75KY\nmSUDzvbvDzt21O6cmYeJpGmSnpK0RtJVRdYPk7RY0kpJSyVNyVu3IV2+QtKyvOXDJS2RtFrSPZKG\nZv06urJhA4waBYceWs9SmJkdUOumrkzDRFI/4AbgfOAkYKakEwo2uxpYERGnAhcDX89b1wm0RcTp\nETE1b/lc4L6IOB64H/hCVq+hHL5eYmaNpqnCBJgKdETExojYA9wOTC/YZgpJIBARq4Hxkkak61Si\njNOBRenjRcCHq13wnnCYmFmjabYwOQbYlPd8c7os30pgBoCkqcBYYEy6LoB7JT0kaVbePiMjYjtA\nRGwDRmZQ9rI5TMys0dQ6TAbU7lQlLQCul7QceAxYAexL150dEVvTmsq9klZFxK+LHKMqfRa2boV5\n83q+3733wje+UY0SmJlVx+TJ8OCDcNlltTlf1mHye5KaRs6YdNl+EbEL+HTuuaT1wLp03db03x2S\nfkjSbPZrYLukoyNiu6RRwB9KFWD+/Pn7H7e1tdHW1laysIcdBm9/e5mvLM873gHveU/P9zMzy8rp\np8OCBfD6691vu3p1O2vWtFd0PkWGHZEl9QdWA+cCW4FlwMyIWJW3zVDglYjYkzZlnR0Rl0gaDPSL\niN2ShgBLgGsiYomkhcDOiFiY9hAbHhFzi5w/snx9ZmbNSBIRoZ7sk2nNJCL2SZpDEgT9gFsiYpWk\n2cnquAk4EVgkqRN4Arg03f1o4IeSIi3nbRGxJF23EPg3SZ8GNgIfy/J1mJlZ1zKtmdSbayZmZj3X\nm5qJ74A3M7OKOUzMzKxiDhMzM6uYw8TMzCrmMDEzs4o5TMzMrGIOEzMzq5jDxMzMKuYwMTOzijlM\nzMysYg4TMzOrmMPEzMwq5jAxM7OKOUzMzKxiDhMzM6uYw8TMzCrmMDEzs4o5TMzMrGIOEzMzq1jm\nYSJpmqSnJK2RdFWR9cMkLZa0UtJSSVMK1veTtFzSXXnL5knanC5fLmla1q/DzMxKyzRMJPUDbgDO\nB04CZko6oWCzq4EVEXEqcDHw9YL1VwJPFjn8dRFxRvp3d5WL3nTa29vrXYSG4ffiAL8XB/i9qEzW\nNZOpQEdEbIyIPcDtwPSCbaYA9wNExGpgvKQRAJLGABcANxc5tjIrdRPy/ygH+L04wO/FAX4vKpN1\nmBwDbMp7vjldlm8lMANA0lRgLDAmXfdPwH8Dosix50h6RNLNkoZWtdRmZtYjjXABfgEwXNJy4HJg\nBbBP0p8D2yPiEZJaSH5N5EZgYkScBmwDrqtxmc3MLI8iiv3or9LBpbOA+RExLX0+F4iIWNjFPuuA\nU0iupVwE7AUOA44AFkfEJwu2Hwf8OCJOKXKs7F6cmVkTi4geXUrIOkz6A6uBc4GtwDJgZkSsyttm\nKPBKROyRNAs4OyIuKTjOu4G/jYgPpc9HRcS29PHngTMj4hOZvRAzM+vSgCwPHhH7JM0BlpA0qd0S\nEaskzU5Wx03AicAiSZ3AE8ClZRz6WkmnAZ3ABmB2Ji/AzMzKkmnNxMzMWkMjXICvuu5ulGw1kjak\nN4WukLSs3uWpJUm3SNou6dG8ZcMlLZG0WtI9rdIbsMR70XI3AEsaI+l+SU9IekzSFenylvtcFHkv\n/jpd3uPPRdPVTNIbJdeQXKfZAjwEXBgRT9W1YHWUdmp4W0Q8X++y1JqkdwK7ge/kOmlIWgg8FxHX\npj82hkfE3HqWsxZKvBfzgF0R0TI9IiWNAkZFxCOSDgceJrn/7VO02Oeii/fi4/Twc9GMNZNybpRs\nNaI5/1t3KyJ+DRSG6HRgUfp4EfDhmhaqTkq8F9BiNwBHxLb0lgMiYjewiuTetpb7XJR4L3L3Avbo\nc9GMXzDl3CjZagK4V9JDaY+5VjcyIrZD8j8TMLLO5am3lr0BWNJ44DRgKXB0K38u8t6L36aLevS5\naMYwsYOdHRFnkAxNc3na3GEHNFdbb8+07A3AabPOHcCV6a/yws9By3wuirwXPf5cNGOY/J5kSJac\nMemylhURW9N/dwA/JGkKbGXbJR0N+9uM/1Dn8tRNROyIAxdOvw2cWc/y1IqkASRfnv8aEXemi1vy\nc1HsvejN56IZw+QhYJKkcZIGARcCd3WzT9OSNDj91YGkIcD7gcfrW6qaKxyO5y7gkvTxxcCdhTs0\nsTe8F+mXZs4MWuez8S/AkxFxfd6yVv1cHPRe9OZz0XS9uSDpGgxcz4EbJRfUuUh1I2kCSW0kSG5S\nva2V3g9J3wXagCOB7cA84EfAD4BjgY3AxyLihXqVsVZKvBfvIWkn338DcO66QbOSdDbwIPAYyf8X\nQTJ80zLg32ihz0UX78Un6OHnoinDxMzMaqsZm7nMzKzGHCZmZlYxh4mZmVXMYWJmZhVzmJiZWcUc\nJmZmVjGHifVp6fDZ7ytYdqWkb3az366My3WUpKWSHk778ueve0DSGenjCelUCe8rcox/SIcFLznN\ndTdleLekH+c9/7Kkn0kaKKld0kN5694m6YG8/Tol/Xne+h9LeldvymGtwWFifd13gZkFyy5Ml3cl\n6xuszgMejYi3RcR/FNtA0hjg58DnI+LeIpvMAk6JiLLm5FEyTXahSNd9EfhT4MPpaNoBjJB0fuG2\nqc3A/yjnvGbgMLG+79+BC9LxhZA0DnhLRPyHpCGS7pP0u3RysA8V7lzk1/s3JH0yfXxG7he8pJ/n\nxm0q2H+cpF+kx783nWzoVGAhMD2dWOiQIuUeDdwDfCEiflrkuHcChwMPS/po3nkeyZ0n3e5WSf9H\n0tL0nEUOpb8Bzgc+GBGv5637B+CLRd9VWAm8KOncEuvN3sBhYn1aOuHXMuAD6aILSYbEAHiV5Jf4\n24H3Av9Y6jCFC9Jw+gbwnyPiTOBW4KtF9v0GcGtEnEpSG/pGRKwE/ifw/Yg4IyJeK7LfonTbH5Z4\nXdOBV9L9f5B3ntNy58nb/JiIOCsi/muRQ50NzAY+EBGvFLzm3wCvSXp3sSIAXwH+rlj5zAo5TKwZ\n3E4SIqT/fi99LODvJa0E7gNGSyp3jorjgT8hmQdmBUmTz+gi2/1p3vn+leTLuxz3AhdJOrSLbfIH\np+zqPD/o4hhr0+O8v8SxSwZGOplWFF7zMSvGYWLN4E7gXEmnA4dFxIp0+V8CRwGnR8TpJEOKF355\n7+WN/x/k1gt4PK0ZnB4Rp0bEBzhYb6+9XEsywvUd6VTTxUSJx4Ve7mLdNpJ5bL4mqe2gE0Q8QPKa\nzyqx/1dJmsI8iJ91yWFifV5EvAy0kwyl/b28VUOBP0REp6T3AOPy1uV+mW8EpqQ9nIYBuWsEq0ku\nUJ8FSbOXpClFTv//ONAB4CLgVz0o9+eBF9NyF5NfM6nkPGtJhhH/v5JOKbLJV4D/XmLfe4HhQLH9\nzPZzmFiz+B7JF15+mNwGnJk2c11EMr91TgBExGaSayyPkzSXLU+X7wE+AiyU9AiwgqSpqdAVwKfS\nbf4SuLKMsub/yr8EGFWi+2/+dqXOU1aNISJ+B3wKuCudliDy1v2cpNZW6lhfIRmW3awkD0FvZmYV\nc83EzMwq5jAxM7OKOUzMzKxiDhMzM6uYw8TMzCrmMDEzs4o5TMzMrGIOEzMzq9j/B8v0jBDbKPEn\nAAAAAElFTkSuQmCC\n", 581 | "text/plain": [ 582 | "" 583 | ] 584 | }, 585 | "metadata": {}, 586 | "output_type": "display_data" 587 | } 588 | ], 589 | "source": [ 590 | "# import Matplotlib (scientific plotting library)\n", 591 | "import matplotlib.pyplot as plt\n", 592 | "\n", 593 | "# allow plots to appear within the notebook\n", 594 | "%matplotlib inline\n", 595 | "\n", 596 | "# plot the relationship between K and testing accuracy\n", 597 | "# plt.plot(x_axis, y_axis)\n", 598 | "plt.plot(k_range, scores)\n", 599 | "plt.xlabel('Value of K for KNN')\n", 600 | "plt.ylabel('Testing Accuracy')" 601 | ] 602 | }, 603 | { 604 | "cell_type": "markdown", 605 | "metadata": {}, 606 | "source": [ 607 | "- **Training accuracy** rises as model complexity increases\n", 608 | "- **Testing accuracy** penalizes models that are too complex or not complex enough\n", 609 | "- For KNN models, complexity is determined by the **value of K** (lower value = more complex)" 610 | ] 611 | }, 612 | { 613 | "cell_type": "markdown", 614 | "metadata": {}, 615 | "source": [ 616 | "### 3. Making predictions on out-of-sample data" 617 | ] 618 | }, 619 | { 620 | "cell_type": "code", 621 | "execution_count": 17, 622 | "metadata": { 623 | "collapsed": false 624 | }, 625 | "outputs": [ 626 | { 627 | "name": "stderr", 628 | "output_type": "stream", 629 | "text": [ 630 | "/Users/ritchieng/anaconda3/envs/py3k/lib/python3.5/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.\n", 631 | " DeprecationWarning)\n" 632 | ] 633 | }, 634 | { 635 | "data": { 636 | "text/plain": [ 637 | "array([1])" 638 | ] 639 | }, 640 | "execution_count": 17, 641 | "metadata": {}, 642 | "output_type": "execute_result" 643 | } 644 | ], 645 | "source": [ 646 | "# instantiate the model with the best known parameters\n", 647 | "knn = KNeighborsClassifier(n_neighbors=11)\n", 648 | "\n", 649 | "# train the model with X and y (not X_train and y_train)\n", 650 | "knn.fit(X, y)\n", 651 | "\n", 652 | "# make a prediction for an out-of-sample observation\n", 653 | "knn.predict([3, 5, 4, 2])" 654 | ] 655 | }, 656 | { 657 | "cell_type": "markdown", 658 | "metadata": {}, 659 | "source": [ 660 | "### 4. Downsides of train/test split" 661 | ] 662 | }, 663 | { 664 | "cell_type": "markdown", 665 | "metadata": {}, 666 | "source": [ 667 | "- Provides a **high-variance estimate** of out-of-sample accuracy\n", 668 | "- **K-fold cross-validation** overcomes this limitation\n", 669 | "- But, train/test split is still useful because of its **flexibility and speed**" 670 | ] 671 | }, 672 | { 673 | "cell_type": "markdown", 674 | "metadata": {}, 675 | "source": [ 676 | "### 5. Resources\n", 677 | "\n", 678 | "- Quora: [What is an intuitive explanation of overfitting?](http://www.quora.com/What-is-an-intuitive-explanation-of-overfitting/answer/Jessica-Su)\n", 679 | "- Video: [Estimating prediction error](https://www.youtube.com/watch?v=_2ij6eaaSl0&t=2m34s) (12 minutes, starting at 2:34) by Hastie and Tibshirani\n", 680 | "- [Understanding the Bias-Variance Tradeoff](http://scott.fortmann-roe.com/docs/BiasVariance.html)\n", 681 | " - [Guiding questions](https://github.com/justmarkham/DAT5/blob/master/homework/06_bias_variance.md) when reading this article\n", 682 | "- Video: [Visualizing bias and variance](http://work.caltech.edu/library/081.html) (15 minutes) by Abu-Mostafa" 683 | ] 684 | } 685 | ], 686 | "metadata": { 687 | "anaconda-cloud": {}, 688 | "kernelspec": { 689 | "display_name": "Python [py3k]", 690 | "language": "python", 691 | "name": "Python [py3k]" 692 | }, 693 | "language_info": { 694 | "codemirror_mode": { 695 | "name": "ipython", 696 | "version": 3 697 | }, 698 | "file_extension": ".py", 699 | "mimetype": "text/x-python", 700 | "name": "python", 701 | "nbconvert_exporter": "python", 702 | "pygments_lexer": "ipython3", 703 | "version": "3.5.2" 704 | }, 705 | "name": "" 706 | }, 707 | "nbformat": 4, 708 | "nbformat_minor": 0 709 | } 710 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/IPython Tutorial-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [], 3 | "metadata": {}, 4 | "nbformat": 4, 5 | "nbformat_minor": 0 6 | } 7 | -------------------------------------------------------------------------------- /01_machine_learning_intro.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Topics\n", 8 | "\n", 9 | "1. What is machine learning?\n", 10 | "2. What are the two main categories of machine learning?\n", 11 | "3. How does machine learning work for supervised learning (predictive modelling)?\n", 12 | "4. Big questions in learning machine learning\n", 13 | "5. Resources" 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": {}, 19 | "source": [ 20 | "_This tutorial is derived from Data School's Machine Learning with scikit-learn tutorial. I added my own notes so anyone, including myself, can refer to this tutorial without watching the videos._" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "### 1. What is machine learning?\n", 28 | "\n", 29 | "High level definition: semi-automated extraction of knowledge from data\n", 30 | "\n", 31 | "- **Starts with data**: You need data to exact insights from it\n", 32 | "- **Knowledge from data**: Starts with a question that might be answerable using data\n", 33 | "- **Automated extraction**: A computer provides the insight\n", 34 | "- **Semi-automated**: Requires many smart decisions by a human" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "### 2. What are the two main categories of machine learning?\n", 42 | "\n", 43 | "**Supervised learning**: making predictions using data\n", 44 | " \n", 45 | "- This is also called predictive modelling\n", 46 | "- Example: Is a given email \"spam\" or \"ham/non-spam\"?\n", 47 | "- There is an outcome we are trying to predict" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "![Spam filter](images/01_spam_filter.png)" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": {}, 60 | "source": [ 61 | "**Unsupervised learning**: extracting structure from data\n", 62 | "\n", 63 | "- Example: Segment grocery store shoppers into clusters that exhibit similar behaviors\n", 64 | "- There is no \"right answer\"" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "![Clustering](images/01_clustering.png)" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": {}, 77 | "source": [ 78 | "### 3. How does machine learning work for supervised learning (predictive modelling)?\n", 79 | "\n", 80 | "2 high-level steps of supervised learning:\n", 81 | "\n", 82 | "1. Train a machine learning model using labeled data\n", 83 | " - \"Labeled data\" has been labeled with the outcome\n", 84 | " - \"Machine learning model\" learns the relationship between the attributes of the data and its outcome\n", 85 | "\n", 86 | "2. Make **predictions** on **new data** for which the label is unknown" 87 | ] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "metadata": {}, 92 | "source": [ 93 | "![Supervised learning](images/01_supervised_learning.png)" 94 | ] 95 | }, 96 | { 97 | "cell_type": "markdown", 98 | "metadata": {}, 99 | "source": [ 100 | "The primary goal of supervised learning is to build a model that generalizes \n", 101 | "- It accurately predicts the **future** rather than the past" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": {}, 107 | "source": [ 108 | "### 4. Big questions in learning machine learning\n", 109 | "\n", 110 | "- How do I choose **which attributes** of my data to include in the model?\n", 111 | "- How do I choose **which model** to use?\n", 112 | "- How do I **optimize** this model for best performance?\n", 113 | "- How do I ensure that I'm building a model that will **generalize** to unseen data?\n", 114 | "- Can I **estimate** how well my model is likely to perform on unseen data?" 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": {}, 120 | "source": [ 121 | "### 5. Resources\n", 122 | "\n", 123 | "- Book: [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/) (section 2.1, 14 pages)\n", 124 | "- Video: [Learning Paradigms](http://work.caltech.edu/library/014.html) (13 minutes)" 125 | ] 126 | } 127 | ], 128 | "metadata": { 129 | "kernelspec": { 130 | "display_name": "Python 2", 131 | "language": "python", 132 | "name": "python2" 133 | }, 134 | "language_info": { 135 | "codemirror_mode": { 136 | "name": "ipython", 137 | "version": 2 138 | }, 139 | "file_extension": ".py", 140 | "mimetype": "text/x-python", 141 | "name": "python", 142 | "nbconvert_exporter": "python", 143 | "pygments_lexer": "ipython2", 144 | "version": "2.7.11" 145 | } 146 | }, 147 | "nbformat": 4, 148 | "nbformat_minor": 0 149 | } 150 | -------------------------------------------------------------------------------- /02_machine_learning_setup.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Setting up Python for machine learning: scikit-learn and IPython Notebook" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Agenda\n", 15 | "\n", 16 | "- What are the benefits and drawbacks of scikit-learn?\n", 17 | "- How do I install scikit-learn?\n", 18 | "- How do I use the IPython Notebook?\n", 19 | "- What are some good resources for learning Python?" 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": [ 26 | "_This IPython notebook is derived from Data School's introduction to machine learning. All credits to Data School._" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "![scikit-learn algorithm map](images/02_sklearn_algorithms.png)" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "## Benefits and drawbacks of scikit-learn\n", 41 | "\n", 42 | "### Benefits:\n", 43 | "\n", 44 | "- **Consistent interface** to machine learning models\n", 45 | "- Provides many **tuning parameters** but with **sensible defaults**\n", 46 | "- Exceptional **documentation**\n", 47 | "- Rich set of functionality for **companion tasks**\n", 48 | "- **Active community** for development and support\n", 49 | "\n", 50 | "### Potential drawbacks:\n", 51 | "\n", 52 | "- Harder (than R) to **get started with machine learning**\n", 53 | "- Less emphasis (than R) on **model interpretability**\n", 54 | "\n", 55 | "### Further reading:\n", 56 | "\n", 57 | "- Ben Lorica: [Six reasons why I recommend scikit-learn](http://radar.oreilly.com/2013/12/six-reasons-why-i-recommend-scikit-learn.html)\n", 58 | "- scikit-learn authors: [API design for machine learning software](http://arxiv.org/pdf/1309.0238v1.pdf)\n", 59 | "- Data School: [Should you teach Python or R for data science?](http://www.dataschool.io/python-or-r-for-data-science/)" 60 | ] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "metadata": {}, 65 | "source": [ 66 | "![scikit-learn logo](images/02_sklearn_logo.png)" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "## Installing scikit-learn\n", 74 | "\n", 75 | "**Option 1:** [Install scikit-learn library](http://scikit-learn.org/stable/install.html) and dependencies (NumPy and SciPy)\n", 76 | "\n", 77 | "**Option 2:** [Install Anaconda distribution](https://store.continuum.io/cshop/anaconda/) of Python, which includes:\n", 78 | "\n", 79 | "- Hundreds of useful packages (including scikit-learn)\n", 80 | "- IPython and IPython Notebook\n", 81 | "- conda package manager\n", 82 | "- Spyder IDE" 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": {}, 88 | "source": [ 89 | "![IPython header](images/02_ipython_header.png)" 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "metadata": {}, 95 | "source": [ 96 | "## Using the IPython Notebook\n", 97 | "\n", 98 | "### Components:\n", 99 | "\n", 100 | "- **IPython interpreter:** enhanced version of the standard Python interpreter\n", 101 | "- **Browser-based notebook interface:** weave together code, formatted text, and plots\n", 102 | "\n", 103 | "### Installation:\n", 104 | "\n", 105 | "- **Option 1:** [Install IPython and the notebook](http://ipython.org/install.html)\n", 106 | "- **Option 2:** Included with the Anaconda distribution\n", 107 | "\n", 108 | "### Launching the Notebook:\n", 109 | "\n", 110 | "- Type **ipython notebook** at the command line to open the dashboard\n", 111 | "- Don't close the command line window while the Notebook is running\n", 112 | "\n", 113 | "### Keyboard shortcuts:\n", 114 | "\n", 115 | "**Command mode** (gray border)\n", 116 | "\n", 117 | "- Create new cells above (**a**) or below (**b**) the current cell\n", 118 | "- Navigate using the **up arrow** and **down arrow**\n", 119 | "- Convert the cell type to Markdown (**m**) or code (**y**)\n", 120 | "- See keyboard shortcuts using **h**\n", 121 | "- Switch to Edit mode using **Enter**\n", 122 | "\n", 123 | "**Edit mode** (green border)\n", 124 | "\n", 125 | "- **Ctrl+Enter** to run a cell\n", 126 | "- Switch to Command mode using **Esc**\n", 127 | "\n", 128 | "### IPython and Markdown resources:\n", 129 | "\n", 130 | "- [nbviewer](http://nbviewer.ipython.org/): view notebooks online as static documents\n", 131 | "- [IPython documentation](http://ipython.org/ipython-doc/stable/index.html): focuses on the interpreter\n", 132 | "- [IPython Notebook tutorials](https://github.com/jupyter/notebook/blob/master/docs/source/examples/Notebook/Examples%20and%20Tutorials%20Index.ipynb): in-depth introduction\n", 133 | "- [GitHub's Mastering Markdown](https://guides.github.com/features/mastering-markdown/): short guide with lots of examples" 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": {}, 139 | "source": [ 140 | "## Resources for learning Python\n", 141 | "\n", 142 | "- [Codecademy's Python course](http://www.codecademy.com/en/tracks/python): browser-based, tons of exercises\n", 143 | "- [DataQuest](https://dataquest.io/missions): browser-based, teaches Python in the context of data science\n", 144 | "- [Google's Python class](https://developers.google.com/edu/python/): slightly more advanced, includes videos and downloadable exercises (with solutions)\n", 145 | "- [Python for Informatics](http://www.pythonlearn.com/): beginner-oriented book, includes slides and videos" 146 | ] 147 | } 148 | ], 149 | "metadata": { 150 | "kernelspec": { 151 | "display_name": "Python 2", 152 | "language": "python", 153 | "name": "python2" 154 | }, 155 | "language_info": { 156 | "codemirror_mode": { 157 | "name": "ipython", 158 | "version": 2 159 | }, 160 | "file_extension": ".py", 161 | "mimetype": "text/x-python", 162 | "name": "python", 163 | "nbconvert_exporter": "python", 164 | "pygments_lexer": "ipython2", 165 | "version": "2.7.11" 166 | } 167 | }, 168 | "nbformat": 4, 169 | "nbformat_minor": 0 170 | } 171 | -------------------------------------------------------------------------------- /03_getting_started_with_iris.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Topics\n", 8 | "1. About Iris dataset\n", 9 | "2. Display Iris dataset\n", 10 | "3. Supervised learning on Iris dataset\n", 11 | "4. Loading the Iris dataset into scikit-learn\n", 12 | "5. Machine learning terminology\n", 13 | "6. Exploring the Iris dataset\n", 14 | "7. Requirements for working with datasets in scikit-learn\n", 15 | "8. Additional resources\n" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "_This tutorial is derived from Data School's Machine Learning with scikit-learn tutorial. I added my own notes so anyone, including myself, can refer to this tutorial without watching the videos._" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "### 1. About Iris dataset" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "![Iris](images/03_iris.png)" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "- The iris dataset contains the following data\n", 44 | " - 50 samples of 3 different species of iris (150 samples total)\n", 45 | " - Measurements: sepal length, sepal width, petal length, petal width\n", 46 | "- The format for the data:\n", 47 | "(sepal length, sepal width, petal length, petal width)" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "### 2. Display Iris Dataset" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": 1, 60 | "metadata": { 61 | "collapsed": false 62 | }, 63 | "outputs": [ 64 | { 65 | "data": { 66 | "text/html": [ 67 | "" 68 | ], 69 | "text/plain": [ 70 | "" 71 | ] 72 | }, 73 | "execution_count": 1, 74 | "metadata": {}, 75 | "output_type": "execute_result" 76 | } 77 | ], 78 | "source": [ 79 | "# Display HTML using IPython.display module\n", 80 | "# You can display any other HTML using this module too\n", 81 | "# Just replace the link with your desired HTML page\n", 82 | "from IPython.display import HTML\n", 83 | "HTML('')" 84 | ] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "metadata": {}, 89 | "source": [ 90 | "### 3. Supervised learning on the iris dataset\n", 91 | "\n", 92 | "- Framed as a **supervised learning** problem\n", 93 | " - Predict the species of an iris using the measurements\n", 94 | "- Famous dataset for machine learning because prediction is **easy**\n", 95 | "- Learn more about the iris dataset: [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Iris)" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | "### 4. Loading the iris dataset into scikit-learn" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": 2, 108 | "metadata": { 109 | "collapsed": false 110 | }, 111 | "outputs": [], 112 | "source": [ 113 | "# import load_iris function from datasets module\n", 114 | "# convention is to import modules instead of sklearn as a whole\n", 115 | "from sklearn.datasets import load_iris" 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": 3, 121 | "metadata": { 122 | "collapsed": false 123 | }, 124 | "outputs": [ 125 | { 126 | "data": { 127 | "text/plain": [ 128 | "sklearn.datasets.base.Bunch" 129 | ] 130 | }, 131 | "execution_count": 3, 132 | "metadata": {}, 133 | "output_type": "execute_result" 134 | } 135 | ], 136 | "source": [ 137 | "# save \"bunch\" object containing iris dataset and its attributes\n", 138 | "# the data type is \"bunch\"\n", 139 | "iris = load_iris()\n", 140 | "type(iris)" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": 5, 146 | "metadata": { 147 | "collapsed": false 148 | }, 149 | "outputs": [ 150 | { 151 | "name": "stdout", 152 | "output_type": "stream", 153 | "text": [ 154 | "[[ 5.1 3.5 1.4 0.2]\n", 155 | " [ 4.9 3. 1.4 0.2]\n", 156 | " [ 4.7 3.2 1.3 0.2]\n", 157 | " [ 4.6 3.1 1.5 0.2]\n", 158 | " [ 5. 3.6 1.4 0.2]\n", 159 | " [ 5.4 3.9 1.7 0.4]\n", 160 | " [ 4.6 3.4 1.4 0.3]\n", 161 | " [ 5. 3.4 1.5 0.2]\n", 162 | " [ 4.4 2.9 1.4 0.2]\n", 163 | " [ 4.9 3.1 1.5 0.1]\n", 164 | " [ 5.4 3.7 1.5 0.2]\n", 165 | " [ 4.8 3.4 1.6 0.2]\n", 166 | " [ 4.8 3. 1.4 0.1]\n", 167 | " [ 4.3 3. 1.1 0.1]\n", 168 | " [ 5.8 4. 1.2 0.2]\n", 169 | " [ 5.7 4.4 1.5 0.4]\n", 170 | " [ 5.4 3.9 1.3 0.4]\n", 171 | " [ 5.1 3.5 1.4 0.3]\n", 172 | " [ 5.7 3.8 1.7 0.3]\n", 173 | " [ 5.1 3.8 1.5 0.3]\n", 174 | " [ 5.4 3.4 1.7 0.2]\n", 175 | " [ 5.1 3.7 1.5 0.4]\n", 176 | " [ 4.6 3.6 1. 0.2]\n", 177 | " [ 5.1 3.3 1.7 0.5]\n", 178 | " [ 4.8 3.4 1.9 0.2]\n", 179 | " [ 5. 3. 1.6 0.2]\n", 180 | " [ 5. 3.4 1.6 0.4]\n", 181 | " [ 5.2 3.5 1.5 0.2]\n", 182 | " [ 5.2 3.4 1.4 0.2]\n", 183 | " [ 4.7 3.2 1.6 0.2]\n", 184 | " [ 4.8 3.1 1.6 0.2]\n", 185 | " [ 5.4 3.4 1.5 0.4]\n", 186 | " [ 5.2 4.1 1.5 0.1]\n", 187 | " [ 5.5 4.2 1.4 0.2]\n", 188 | " [ 4.9 3.1 1.5 0.1]\n", 189 | " [ 5. 3.2 1.2 0.2]\n", 190 | " [ 5.5 3.5 1.3 0.2]\n", 191 | " [ 4.9 3.1 1.5 0.1]\n", 192 | " [ 4.4 3. 1.3 0.2]\n", 193 | " [ 5.1 3.4 1.5 0.2]\n", 194 | " [ 5. 3.5 1.3 0.3]\n", 195 | " [ 4.5 2.3 1.3 0.3]\n", 196 | " [ 4.4 3.2 1.3 0.2]\n", 197 | " [ 5. 3.5 1.6 0.6]\n", 198 | " [ 5.1 3.8 1.9 0.4]\n", 199 | " [ 4.8 3. 1.4 0.3]\n", 200 | " [ 5.1 3.8 1.6 0.2]\n", 201 | " [ 4.6 3.2 1.4 0.2]\n", 202 | " [ 5.3 3.7 1.5 0.2]\n", 203 | " [ 5. 3.3 1.4 0.2]\n", 204 | " [ 7. 3.2 4.7 1.4]\n", 205 | " [ 6.4 3.2 4.5 1.5]\n", 206 | " [ 6.9 3.1 4.9 1.5]\n", 207 | " [ 5.5 2.3 4. 1.3]\n", 208 | " [ 6.5 2.8 4.6 1.5]\n", 209 | " [ 5.7 2.8 4.5 1.3]\n", 210 | " [ 6.3 3.3 4.7 1.6]\n", 211 | " [ 4.9 2.4 3.3 1. ]\n", 212 | " [ 6.6 2.9 4.6 1.3]\n", 213 | " [ 5.2 2.7 3.9 1.4]\n", 214 | " [ 5. 2. 3.5 1. ]\n", 215 | " [ 5.9 3. 4.2 1.5]\n", 216 | " [ 6. 2.2 4. 1. ]\n", 217 | " [ 6.1 2.9 4.7 1.4]\n", 218 | " [ 5.6 2.9 3.6 1.3]\n", 219 | " [ 6.7 3.1 4.4 1.4]\n", 220 | " [ 5.6 3. 4.5 1.5]\n", 221 | " [ 5.8 2.7 4.1 1. ]\n", 222 | " [ 6.2 2.2 4.5 1.5]\n", 223 | " [ 5.6 2.5 3.9 1.1]\n", 224 | " [ 5.9 3.2 4.8 1.8]\n", 225 | " [ 6.1 2.8 4. 1.3]\n", 226 | " [ 6.3 2.5 4.9 1.5]\n", 227 | " [ 6.1 2.8 4.7 1.2]\n", 228 | " [ 6.4 2.9 4.3 1.3]\n", 229 | " [ 6.6 3. 4.4 1.4]\n", 230 | " [ 6.8 2.8 4.8 1.4]\n", 231 | " [ 6.7 3. 5. 1.7]\n", 232 | " [ 6. 2.9 4.5 1.5]\n", 233 | " [ 5.7 2.6 3.5 1. ]\n", 234 | " [ 5.5 2.4 3.8 1.1]\n", 235 | " [ 5.5 2.4 3.7 1. ]\n", 236 | " [ 5.8 2.7 3.9 1.2]\n", 237 | " [ 6. 2.7 5.1 1.6]\n", 238 | " [ 5.4 3. 4.5 1.5]\n", 239 | " [ 6. 3.4 4.5 1.6]\n", 240 | " [ 6.7 3.1 4.7 1.5]\n", 241 | " [ 6.3 2.3 4.4 1.3]\n", 242 | " [ 5.6 3. 4.1 1.3]\n", 243 | " [ 5.5 2.5 4. 1.3]\n", 244 | " [ 5.5 2.6 4.4 1.2]\n", 245 | " [ 6.1 3. 4.6 1.4]\n", 246 | " [ 5.8 2.6 4. 1.2]\n", 247 | " [ 5. 2.3 3.3 1. ]\n", 248 | " [ 5.6 2.7 4.2 1.3]\n", 249 | " [ 5.7 3. 4.2 1.2]\n", 250 | " [ 5.7 2.9 4.2 1.3]\n", 251 | " [ 6.2 2.9 4.3 1.3]\n", 252 | " [ 5.1 2.5 3. 1.1]\n", 253 | " [ 5.7 2.8 4.1 1.3]\n", 254 | " [ 6.3 3.3 6. 2.5]\n", 255 | " [ 5.8 2.7 5.1 1.9]\n", 256 | " [ 7.1 3. 5.9 2.1]\n", 257 | " [ 6.3 2.9 5.6 1.8]\n", 258 | " [ 6.5 3. 5.8 2.2]\n", 259 | " [ 7.6 3. 6.6 2.1]\n", 260 | " [ 4.9 2.5 4.5 1.7]\n", 261 | " [ 7.3 2.9 6.3 1.8]\n", 262 | " [ 6.7 2.5 5.8 1.8]\n", 263 | " [ 7.2 3.6 6.1 2.5]\n", 264 | " [ 6.5 3.2 5.1 2. ]\n", 265 | " [ 6.4 2.7 5.3 1.9]\n", 266 | " [ 6.8 3. 5.5 2.1]\n", 267 | " [ 5.7 2.5 5. 2. ]\n", 268 | " [ 5.8 2.8 5.1 2.4]\n", 269 | " [ 6.4 3.2 5.3 2.3]\n", 270 | " [ 6.5 3. 5.5 1.8]\n", 271 | " [ 7.7 3.8 6.7 2.2]\n", 272 | " [ 7.7 2.6 6.9 2.3]\n", 273 | " [ 6. 2.2 5. 1.5]\n", 274 | " [ 6.9 3.2 5.7 2.3]\n", 275 | " [ 5.6 2.8 4.9 2. ]\n", 276 | " [ 7.7 2.8 6.7 2. ]\n", 277 | " [ 6.3 2.7 4.9 1.8]\n", 278 | " [ 6.7 3.3 5.7 2.1]\n", 279 | " [ 7.2 3.2 6. 1.8]\n", 280 | " [ 6.2 2.8 4.8 1.8]\n", 281 | " [ 6.1 3. 4.9 1.8]\n", 282 | " [ 6.4 2.8 5.6 2.1]\n", 283 | " [ 7.2 3. 5.8 1.6]\n", 284 | " [ 7.4 2.8 6.1 1.9]\n", 285 | " [ 7.9 3.8 6.4 2. ]\n", 286 | " [ 6.4 2.8 5.6 2.2]\n", 287 | " [ 6.3 2.8 5.1 1.5]\n", 288 | " [ 6.1 2.6 5.6 1.4]\n", 289 | " [ 7.7 3. 6.1 2.3]\n", 290 | " [ 6.3 3.4 5.6 2.4]\n", 291 | " [ 6.4 3.1 5.5 1.8]\n", 292 | " [ 6. 3. 4.8 1.8]\n", 293 | " [ 6.9 3.1 5.4 2.1]\n", 294 | " [ 6.7 3.1 5.6 2.4]\n", 295 | " [ 6.9 3.1 5.1 2.3]\n", 296 | " [ 5.8 2.7 5.1 1.9]\n", 297 | " [ 6.8 3.2 5.9 2.3]\n", 298 | " [ 6.7 3.3 5.7 2.5]\n", 299 | " [ 6.7 3. 5.2 2.3]\n", 300 | " [ 6.3 2.5 5. 1.9]\n", 301 | " [ 6.5 3. 5.2 2. ]\n", 302 | " [ 6.2 3.4 5.4 2.3]\n", 303 | " [ 5.9 3. 5.1 1.8]]\n" 304 | ] 305 | } 306 | ], 307 | "source": [ 308 | "# print the iris data\n", 309 | "# same data as shown previously\n", 310 | "# each row represents each sample\n", 311 | "# each column represents the features\n", 312 | "print(iris.data)" 313 | ] 314 | }, 315 | { 316 | "cell_type": "markdown", 317 | "metadata": {}, 318 | "source": [ 319 | "### 5. Machine learning terminology\n", 320 | "\n", 321 | "- Each row is an **observation** (also known as: sample, example, instance, record)\n", 322 | "- Each column is a **feature** (also known as: predictor, attribute, independent variable, input, regressor, covariate)" 323 | ] 324 | }, 325 | { 326 | "cell_type": "markdown", 327 | "metadata": {}, 328 | "source": [ 329 | "### 6. Exploring the iris dataset" 330 | ] 331 | }, 332 | { 333 | "cell_type": "code", 334 | "execution_count": null, 335 | "metadata": { 336 | "collapsed": false 337 | }, 338 | "outputs": [], 339 | "source": [ 340 | "# print the names of the four features\n", 341 | "print iris.feature_names" 342 | ] 343 | }, 344 | { 345 | "cell_type": "code", 346 | "execution_count": null, 347 | "metadata": { 348 | "collapsed": false 349 | }, 350 | "outputs": [], 351 | "source": [ 352 | "# print integers representing the species of each observation\n", 353 | "# 0, 1, and 2 represent different species\n", 354 | "print iris.target" 355 | ] 356 | }, 357 | { 358 | "cell_type": "code", 359 | "execution_count": null, 360 | "metadata": { 361 | "collapsed": false 362 | }, 363 | "outputs": [], 364 | "source": [ 365 | "# print the encoding scheme for species: 0 = setosa, 1 = versicolor, 2 = virginica\n", 366 | "print iris.target_names" 367 | ] 368 | }, 369 | { 370 | "cell_type": "markdown", 371 | "metadata": {}, 372 | "source": [ 373 | "- Each value we are predicting is the **response** (also known as: target, outcome, label, dependent variable)\n", 374 | "- **Classification** is supervised learning in which the response is categorical\n", 375 | " - \"0\": setosa\n", 376 | " - \"1\": versicolor\n", 377 | " - \"2\": virginica\n", 378 | "- **Regression** is supervised learning in which the response is ordered and continuous\n", 379 | " - any number (continuous)" 380 | ] 381 | }, 382 | { 383 | "cell_type": "markdown", 384 | "metadata": {}, 385 | "source": [ 386 | "### 7. Requirements for working with data in scikit-learn\n", 387 | "\n", 388 | "1. Features and response are **separate objects**\n", 389 | " - In this case, data and target are separate\n", 390 | "2. Features and response should be **numeric**\n", 391 | " - In this case, features and response are numeric with the matrix dimension of 150 x 4\n", 392 | "3. Features and response should be **NumPy arrays**\n", 393 | " - The iris dataset contains NumPy arrays already\n", 394 | " - For other dataset, by loading them into NumPy\n", 395 | "4. Features and response should have **specific shapes**\n", 396 | " - 150 x 4 for whole dataset\n", 397 | " - 150 x 1 for examples\n", 398 | " - 4 x 1 for features\n", 399 | " - you can convert the matrix accordingly using np.tile(a, [4, 1]), where a is the matrix and [4, 1] is the intended matrix dimensionality" 400 | ] 401 | }, 402 | { 403 | "cell_type": "code", 404 | "execution_count": 6, 405 | "metadata": { 406 | "collapsed": false 407 | }, 408 | "outputs": [ 409 | { 410 | "name": "stdout", 411 | "output_type": "stream", 412 | "text": [ 413 | "\n", 414 | "\n" 415 | ] 416 | } 417 | ], 418 | "source": [ 419 | "# check the types of the features and response\n", 420 | "print(type(iris.data))\n", 421 | "print(type(iris.target))" 422 | ] 423 | }, 424 | { 425 | "cell_type": "code", 426 | "execution_count": 8, 427 | "metadata": { 428 | "collapsed": false 429 | }, 430 | "outputs": [ 431 | { 432 | "name": "stdout", 433 | "output_type": "stream", 434 | "text": [ 435 | "(150, 4)\n" 436 | ] 437 | } 438 | ], 439 | "source": [ 440 | "# check the shape of the features (first dimension = number of observations, second dimensions = number of features)\n", 441 | "print(iris.data.shape)" 442 | ] 443 | }, 444 | { 445 | "cell_type": "code", 446 | "execution_count": 7, 447 | "metadata": { 448 | "collapsed": false 449 | }, 450 | "outputs": [ 451 | { 452 | "name": "stdout", 453 | "output_type": "stream", 454 | "text": [ 455 | "(150,)\n" 456 | ] 457 | } 458 | ], 459 | "source": [ 460 | "# check the shape of the response (single dimension matching the number of observations)\n", 461 | "print(iris.target.shape)" 462 | ] 463 | }, 464 | { 465 | "cell_type": "code", 466 | "execution_count": null, 467 | "metadata": { 468 | "collapsed": false 469 | }, 470 | "outputs": [], 471 | "source": [ 472 | "# store feature matrix in \"X\"\n", 473 | "X = iris.data\n", 474 | "\n", 475 | "# store response vector in \"y\"\n", 476 | "y = iris.target" 477 | ] 478 | }, 479 | { 480 | "cell_type": "markdown", 481 | "metadata": {}, 482 | "source": [ 483 | "### 8. Resources\n", 484 | "\n", 485 | "- scikit-learn documentation: [Dataset loading utilities](http://scikit-learn.org/stable/datasets/)\n", 486 | "- Jake VanderPlas: Fast Numerical Computing with NumPy ([slides](https://speakerdeck.com/jakevdp/losing-your-loops-fast-numerical-computing-with-numpy-pycon-2015), [video](https://www.youtube.com/watch?v=EEUXKG97YRw))\n", 487 | "- Scott Shell: [An Introduction to NumPy](http://www.engr.ucsb.edu/~shell/che210d/numpy.pdf) (PDF)" 488 | ] 489 | } 490 | ], 491 | "metadata": { 492 | "kernelspec": { 493 | "display_name": "Python [py3k]", 494 | "language": "python", 495 | "name": "Python [py3k]" 496 | }, 497 | "language_info": { 498 | "codemirror_mode": { 499 | "name": "ipython", 500 | "version": 3 501 | }, 502 | "file_extension": ".py", 503 | "mimetype": "text/x-python", 504 | "name": "python", 505 | "nbconvert_exporter": "python", 506 | "pygments_lexer": "ipython3", 507 | "version": "3.5.2" 508 | } 509 | }, 510 | "nbformat": 4, 511 | "nbformat_minor": 0 512 | } 513 | -------------------------------------------------------------------------------- /04_model_training.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Topics\n", 8 | "1. Reviewing the iris dataset\n", 9 | "2. K-nearest neighbors (KNN) classification\n", 10 | "3. Example of training data\n", 11 | "4. KNN classification map (K = 1)\n", 12 | "5. KNN classification map (K = 5)\n", 13 | "6. Loading the data\n", 14 | "7. scikit-learn 4-step modeling pattern\n", 15 | "8. Using a different value for K\n", 16 | "9. Using a different classification model (logistic regression)\n", 17 | "10. Different values explanation\n", 18 | "11. Resources" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "_This tutorial is derived from Data School's Machine Learning with scikit-learn tutorial. I added my own notes so anyone, including myself, can refer to this tutorial without watching the videos._" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "### 1. Reviewing the iris dataset" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": 1, 38 | "metadata": { 39 | "collapsed": false 40 | }, 41 | "outputs": [ 42 | { 43 | "data": { 44 | "text/html": [ 45 | "" 46 | ], 47 | "text/plain": [ 48 | "" 49 | ] 50 | }, 51 | "execution_count": 1, 52 | "metadata": {}, 53 | "output_type": "execute_result" 54 | } 55 | ], 56 | "source": [ 57 | "from IPython.display import HTML\n", 58 | "HTML('')" 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "metadata": {}, 64 | "source": [ 65 | "- 150 **observations**\n", 66 | "- 4 **features** (sepal length, sepal width, petal length, petal width)\n", 67 | "- **Response** variable is the iris species\n", 68 | "- **Classification** problem since response is categorical\n", 69 | "- More information in the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Iris)" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "### 2. K-nearest neighbors (KNN) classification" 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "metadata": {}, 82 | "source": [ 83 | "In essence, it's to classify your training data into groups that are similar." 84 | ] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "metadata": {}, 89 | "source": [ 90 | "1. Pick a value for K.\n", 91 | "2. Search for the K observations in the training data that are \"nearest\" to the measurements of the unknown iris.\n", 92 | "3. Use the most popular response value from the K nearest neighbors as the predicted response value for the unknown iris." 93 | ] 94 | }, 95 | { 96 | "cell_type": "markdown", 97 | "metadata": {}, 98 | "source": [ 99 | "### 3. Example training data\n", 100 | "\n", 101 | "![Training data](images/04_knn_dataset.png)" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": {}, 107 | "source": [ 108 | "- Dataset with 2 numerical features (x and y coordinates)\n", 109 | "- Each point: observation\n", 110 | "- Colour of point: response class (red, blue or green)" 111 | ] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "metadata": {}, 116 | "source": [ 117 | "### 4. KNN classification map (K=1)\n", 118 | "\n", 119 | "![1NN classification map](images/04_1nn_map.png)" 120 | ] 121 | }, 122 | { 123 | "cell_type": "markdown", 124 | "metadata": {}, 125 | "source": [ 126 | "- Background colour: predicted response value for a new observation" 127 | ] 128 | }, 129 | { 130 | "cell_type": "markdown", 131 | "metadata": {}, 132 | "source": [ 133 | "### 5. KNN classification map (K=5)\n", 134 | "\n", 135 | "![5NN classification map](images/04_5nn_map.png)" 136 | ] 137 | }, 138 | { 139 | "cell_type": "markdown", 140 | "metadata": {}, 141 | "source": [ 142 | "- Decision boundaries have changed in this scenario\n", 143 | "- White areas: KNN cannot make a clear decision because there's a tie between 2 classes\n", 144 | "- It can make good predictions if the features have very dissimilar values" 145 | ] 146 | }, 147 | { 148 | "cell_type": "markdown", 149 | "metadata": {}, 150 | "source": [ 151 | "*Image Credits: [Data3classes](http://commons.wikimedia.org/wiki/File:Data3classes.png#/media/File:Data3classes.png), [Map1NN](http://commons.wikimedia.org/wiki/File:Map1NN.png#/media/File:Map1NN.png), [Map5NN](http://commons.wikimedia.org/wiki/File:Map5NN.png#/media/File:Map5NN.png) by Agor153. Licensed under CC BY-SA 3.0*" 152 | ] 153 | }, 154 | { 155 | "cell_type": "markdown", 156 | "metadata": {}, 157 | "source": [ 158 | "### 6. Loading the data" 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": 2, 164 | "metadata": { 165 | "collapsed": false 166 | }, 167 | "outputs": [], 168 | "source": [ 169 | "# import load_iris function from datasets module\n", 170 | "from sklearn.datasets import load_iris\n", 171 | "\n", 172 | "# save \"bunch\" object containing iris dataset and its attributes\n", 173 | "iris = load_iris()\n", 174 | "\n", 175 | "# discover what data and target are\n", 176 | "# print iris.data\n", 177 | "# print iris.target\n", 178 | "\n", 179 | "# store feature matrix in \"X\"\n", 180 | "# we use an uppercase for \"X\" because it is a matrix of m x n dimension\n", 181 | "X = iris.data\n", 182 | "\n", 183 | "# store response vector in \"y\"\n", 184 | "# we use a lowercase for \"y\" because it is a vector or m x 1 dimension\n", 185 | "y = iris.target" 186 | ] 187 | }, 188 | { 189 | "cell_type": "code", 190 | "execution_count": 3, 191 | "metadata": { 192 | "collapsed": false 193 | }, 194 | "outputs": [ 195 | { 196 | "name": "stdout", 197 | "output_type": "stream", 198 | "text": [ 199 | "(150, 4)\n", 200 | "(150,)\n" 201 | ] 202 | } 203 | ], 204 | "source": [ 205 | "# print the shapes of X and y\n", 206 | "print(X.shape)\n", 207 | "print(y.shape)" 208 | ] 209 | }, 210 | { 211 | "cell_type": "markdown", 212 | "metadata": {}, 213 | "source": [ 214 | "### 7. scikit-learn 4-step modeling pattern" 215 | ] 216 | }, 217 | { 218 | "cell_type": "markdown", 219 | "metadata": {}, 220 | "source": [ 221 | "_Scikit-learn uses a common modeling pattern to use its algorithms. The steps are similar for using other algorithms._" 222 | ] 223 | }, 224 | { 225 | "cell_type": "markdown", 226 | "metadata": {}, 227 | "source": [ 228 | "**Step 1:** Import the class you plan to use" 229 | ] 230 | }, 231 | { 232 | "cell_type": "code", 233 | "execution_count": 4, 234 | "metadata": { 235 | "collapsed": false 236 | }, 237 | "outputs": [], 238 | "source": [ 239 | "from sklearn.neighbors import KNeighborsClassifier" 240 | ] 241 | }, 242 | { 243 | "cell_type": "markdown", 244 | "metadata": {}, 245 | "source": [ 246 | "**Step 2:** \"Instantiate\" the \"estimator\"\n", 247 | "\n", 248 | "- \"Estimator\" is scikit-learn's term for model\n", 249 | "- \"Instantiate\" means \"make an instance of\"" 250 | ] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "execution_count": 5, 255 | "metadata": { 256 | "collapsed": false 257 | }, 258 | "outputs": [], 259 | "source": [ 260 | "knn = KNeighborsClassifier(n_neighbors=1)" 261 | ] 262 | }, 263 | { 264 | "cell_type": "markdown", 265 | "metadata": {}, 266 | "source": [ 267 | "- Name of the object does not matter\n", 268 | "- Can specify tuning parameters (aka \"hyperparameters\") during this step\n", 269 | " - n_neighbours is a hyperparameter where it represents k\n", 270 | "- All parameters not specified are set to their defaults" 271 | ] 272 | }, 273 | { 274 | "cell_type": "code", 275 | "execution_count": 6, 276 | "metadata": { 277 | "collapsed": false 278 | }, 279 | "outputs": [ 280 | { 281 | "name": "stdout", 282 | "output_type": "stream", 283 | "text": [ 284 | "KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n", 285 | " metric_params=None, n_jobs=1, n_neighbors=1, p=2,\n", 286 | " weights='uniform')\n" 287 | ] 288 | } 289 | ], 290 | "source": [ 291 | "print(knn)" 292 | ] 293 | }, 294 | { 295 | "cell_type": "markdown", 296 | "metadata": {}, 297 | "source": [ 298 | "**Step 3:** Fit the model with data (aka \"model training\")\n", 299 | "\n", 300 | "- Model is learning the relationship between X and y\n", 301 | "- Occurs in-place" 302 | ] 303 | }, 304 | { 305 | "cell_type": "code", 306 | "execution_count": 7, 307 | "metadata": { 308 | "collapsed": false 309 | }, 310 | "outputs": [ 311 | { 312 | "data": { 313 | "text/plain": [ 314 | "KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n", 315 | " metric_params=None, n_jobs=1, n_neighbors=1, p=2,\n", 316 | " weights='uniform')" 317 | ] 318 | }, 319 | "execution_count": 7, 320 | "metadata": {}, 321 | "output_type": "execute_result" 322 | } 323 | ], 324 | "source": [ 325 | "knn.fit(X, y)" 326 | ] 327 | }, 328 | { 329 | "cell_type": "markdown", 330 | "metadata": {}, 331 | "source": [ 332 | "**Step 4:** Predict the response for a new observation\n", 333 | "\n", 334 | "- New observations are called \"out-of-sample\" data\n", 335 | "- Uses the information it learned during the model training process" 336 | ] 337 | }, 338 | { 339 | "cell_type": "code", 340 | "execution_count": 8, 341 | "metadata": { 342 | "collapsed": false 343 | }, 344 | "outputs": [ 345 | { 346 | "data": { 347 | "text/plain": [ 348 | "array([2, 1])" 349 | ] 350 | }, 351 | "execution_count": 8, 352 | "metadata": {}, 353 | "output_type": "execute_result" 354 | } 355 | ], 356 | "source": [ 357 | "X_new = ([3, 5, 4, 2], [5, 4, 3, 2])\n", 358 | "knn.predict(X_new)" 359 | ] 360 | }, 361 | { 362 | "cell_type": "markdown", 363 | "metadata": {}, 364 | "source": [ 365 | "- Returns a NumPy array\n", 366 | "- 2 and 1 are the predicted species\n", 367 | " - \"0\": setosa\n", 368 | " - \"1\": versicolor\n", 369 | " - \"2\": virginica\n", 370 | "- Can predict for multiple observations at once" 371 | ] 372 | }, 373 | { 374 | "cell_type": "markdown", 375 | "metadata": {}, 376 | "source": [ 377 | "### 8. Using a different value for K" 378 | ] 379 | }, 380 | { 381 | "cell_type": "code", 382 | "execution_count": 9, 383 | "metadata": { 384 | "collapsed": false 385 | }, 386 | "outputs": [ 387 | { 388 | "data": { 389 | "text/plain": [ 390 | "array([1, 1])" 391 | ] 392 | }, 393 | "execution_count": 9, 394 | "metadata": {}, 395 | "output_type": "execute_result" 396 | } 397 | ], 398 | "source": [ 399 | "# instantiate the model (using the value K=5)\n", 400 | "knn = KNeighborsClassifier(n_neighbors=5)\n", 401 | "\n", 402 | "# fit the model with data\n", 403 | "knn.fit(X, y)\n", 404 | "\n", 405 | "# predict the response for new observations\n", 406 | "knn.predict(X_new)" 407 | ] 408 | }, 409 | { 410 | "cell_type": "markdown", 411 | "metadata": {}, 412 | "source": [ 413 | "### 9. Using a different classification model" 414 | ] 415 | }, 416 | { 417 | "cell_type": "code", 418 | "execution_count": 10, 419 | "metadata": { 420 | "collapsed": false 421 | }, 422 | "outputs": [ 423 | { 424 | "data": { 425 | "text/plain": [ 426 | "array([2, 0])" 427 | ] 428 | }, 429 | "execution_count": 10, 430 | "metadata": {}, 431 | "output_type": "execute_result" 432 | } 433 | ], 434 | "source": [ 435 | "# import the class\n", 436 | "from sklearn.linear_model import LogisticRegression\n", 437 | "\n", 438 | "# instantiate the model (using the default parameters)\n", 439 | "logreg = LogisticRegression()\n", 440 | "\n", 441 | "# fit the model with data\n", 442 | "logreg.fit(X, y)\n", 443 | "\n", 444 | "# predict the response for new observations\n", 445 | "logreg.predict(X_new)" 446 | ] 447 | }, 448 | { 449 | "cell_type": "markdown", 450 | "metadata": {}, 451 | "source": [ 452 | "### 10. Different values explanation" 453 | ] 454 | }, 455 | { 456 | "cell_type": "markdown", 457 | "metadata": {}, 458 | "source": [ 459 | "Why are the KNN models (different values of K) and Logistic Regression model predict different values? Which are right?\n", 460 | "- We are unable to determine here as we do not have the label (outcome)\n", 461 | "- We, however, can compare the models to determine which is the \"best\n", 462 | "- This will be covered in the following guides" 463 | ] 464 | }, 465 | { 466 | "cell_type": "markdown", 467 | "metadata": {}, 468 | "source": [ 469 | "### 11. Resources\n", 470 | "\n", 471 | "- [Nearest Neighbors](http://scikit-learn.org/stable/modules/neighbors.html) (user guide), [KNeighborsClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) (class documentation)\n", 472 | "- [Logistic Regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression) (user guide), [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) (class documentation)\n", 473 | "- [Videos from An Introduction to Statistical Learning](http://www.dataschool.io/15-hours-of-expert-machine-learning-videos/)\n", 474 | " - Classification Problems and K-Nearest Neighbors (Chapter 2)\n", 475 | " - Introduction to Classification (Chapter 4)\n", 476 | " - Logistic Regression and Maximum Likelihood (Chapter 4)" 477 | ] 478 | } 479 | ], 480 | "metadata": { 481 | "kernelspec": { 482 | "display_name": "Python [py3k]", 483 | "language": "python", 484 | "name": "Python [py3k]" 485 | }, 486 | "language_info": { 487 | "codemirror_mode": { 488 | "name": "ipython", 489 | "version": 3 490 | }, 491 | "file_extension": ".py", 492 | "mimetype": "text/x-python", 493 | "name": "python", 494 | "nbconvert_exporter": "python", 495 | "pygments_lexer": "ipython3", 496 | "version": "3.5.2" 497 | } 498 | }, 499 | "nbformat": 4, 500 | "nbformat_minor": 0 501 | } 502 | -------------------------------------------------------------------------------- /05_model_evaluation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Topics\n", 8 | "1. Evaluation procedure 1 - Train and test on the entire dataset\n", 9 | " a. Logistic regression\n", 10 | " b. KNN (k = 5)\n", 11 | " c. KNN (k = 1)\n", 12 | " d. Problems with training and testing on the same data\n", 13 | "2. Evaluation procedure 2 - Train/test split\n", 14 | "3. Making predictions on out-of-sample data\n", 15 | "4. Downsides of train/test split\n", 16 | "5. Resources" 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "metadata": {}, 22 | "source": [ 23 | "_This tutorial is derived from Data School's Machine Learning with scikit-learn tutorial. I added my own notes so anyone, including myself, can refer to this tutorial without watching the videos._" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "### 1. Evaluation procedure 1 - Train and test on the entire dataset" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "1. Train the model on the **entire dataset**.\n", 38 | "2. Test the model on the **same dataset**, and evaluate how well we did by comparing the **predicted** response values with the **true** response values." 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": 1, 44 | "metadata": { 45 | "collapsed": true 46 | }, 47 | "outputs": [], 48 | "source": [ 49 | "# read in the iris data\n", 50 | "from sklearn.datasets import load_iris\n", 51 | "iris = load_iris()\n", 52 | "\n", 53 | "# create X (features) and y (response)\n", 54 | "X = iris.data\n", 55 | "y = iris.target" 56 | ] 57 | }, 58 | { 59 | "cell_type": "markdown", 60 | "metadata": {}, 61 | "source": [ 62 | "#### 1a. Logistic regression" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": 2, 68 | "metadata": { 69 | "collapsed": false 70 | }, 71 | "outputs": [ 72 | { 73 | "data": { 74 | "text/plain": [ 75 | "array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 76 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 77 | " 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1,\n", 78 | " 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1,\n", 79 | " 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,\n", 80 | " 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2,\n", 81 | " 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])" 82 | ] 83 | }, 84 | "execution_count": 2, 85 | "metadata": {}, 86 | "output_type": "execute_result" 87 | } 88 | ], 89 | "source": [ 90 | "# import the class\n", 91 | "from sklearn.linear_model import LogisticRegression\n", 92 | "\n", 93 | "# instantiate the model (using the default parameters)\n", 94 | "logreg = LogisticRegression()\n", 95 | "\n", 96 | "# fit the model with data\n", 97 | "logreg.fit(X, y)\n", 98 | "\n", 99 | "# predict the response values for the observations in X\n", 100 | "logreg.predict(X)" 101 | ] 102 | }, 103 | { 104 | "cell_type": "code", 105 | "execution_count": 3, 106 | "metadata": { 107 | "collapsed": false 108 | }, 109 | "outputs": [ 110 | { 111 | "data": { 112 | "text/plain": [ 113 | "150" 114 | ] 115 | }, 116 | "execution_count": 3, 117 | "metadata": {}, 118 | "output_type": "execute_result" 119 | } 120 | ], 121 | "source": [ 122 | "# store the predicted response values\n", 123 | "y_pred = logreg.predict(X)\n", 124 | "\n", 125 | "# check how many predictions were generated\n", 126 | "len(y_pred)" 127 | ] 128 | }, 129 | { 130 | "cell_type": "markdown", 131 | "metadata": {}, 132 | "source": [ 133 | "Classification accuracy:\n", 134 | "\n", 135 | "- **Proportion** of correct predictions\n", 136 | "- Common **evaluation metric** for classification problems" 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": 4, 142 | "metadata": { 143 | "collapsed": false 144 | }, 145 | "outputs": [ 146 | { 147 | "name": "stdout", 148 | "output_type": "stream", 149 | "text": [ 150 | "0.96\n" 151 | ] 152 | } 153 | ], 154 | "source": [ 155 | "# compute classification accuracy for the logistic regression model\n", 156 | "from sklearn import metrics\n", 157 | "\n", 158 | "print(metrics.accuracy_score(y, y_pred))" 159 | ] 160 | }, 161 | { 162 | "cell_type": "markdown", 163 | "metadata": {}, 164 | "source": [ 165 | "- Known as **training accuracy** when you train and test the model on the same data\n", 166 | "- 96% of our predictions are correct" 167 | ] 168 | }, 169 | { 170 | "cell_type": "markdown", 171 | "metadata": {}, 172 | "source": [ 173 | "#### 1b. KNN (K=5)" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": 5, 179 | "metadata": { 180 | "collapsed": false 181 | }, 182 | "outputs": [ 183 | { 184 | "name": "stdout", 185 | "output_type": "stream", 186 | "text": [ 187 | "0.966666666667\n" 188 | ] 189 | } 190 | ], 191 | "source": [ 192 | "from sklearn.neighbors import KNeighborsClassifier\n", 193 | "knn = KNeighborsClassifier(n_neighbors=5)\n", 194 | "knn.fit(X, y)\n", 195 | "y_pred = knn.predict(X)\n", 196 | "print(metrics.accuracy_score(y, y_pred))" 197 | ] 198 | }, 199 | { 200 | "cell_type": "markdown", 201 | "metadata": {}, 202 | "source": [ 203 | "_It seems, there is a higher accuracy here but there is a big issue of testing on your training data_" 204 | ] 205 | }, 206 | { 207 | "cell_type": "markdown", 208 | "metadata": {}, 209 | "source": [ 210 | "#### 1c. KNN (K=1)" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": 6, 216 | "metadata": { 217 | "collapsed": false 218 | }, 219 | "outputs": [ 220 | { 221 | "name": "stdout", 222 | "output_type": "stream", 223 | "text": [ 224 | "1.0\n" 225 | ] 226 | } 227 | ], 228 | "source": [ 229 | "knn = KNeighborsClassifier(n_neighbors=1)\n", 230 | "knn.fit(X, y)\n", 231 | "y_pred = knn.predict(X)\n", 232 | "print(metrics.accuracy_score(y, y_pred))" 233 | ] 234 | }, 235 | { 236 | "cell_type": "markdown", 237 | "metadata": {}, 238 | "source": [ 239 | "- KNN model\n", 240 | " 1. Pick a value for K.\n", 241 | " 2. Search for the K observations in the training data that are \"nearest\" to the measurements of the unknown iris\n", 242 | " 3. Use the most popular response value from the K nearest neighbors as the predicted response value for the unknown iris\n", 243 | "- This would always have 100% accuracy, because we are testing on the exact same data, it would always make correct predictions\n", 244 | "- KNN would search for one nearest observation and find that exact same observation\n", 245 | " - KNN has memorized the training set\n", 246 | " - Because we testing on the exact same data, it would always make the same prediction" 247 | ] 248 | }, 249 | { 250 | "cell_type": "markdown", 251 | "metadata": {}, 252 | "source": [ 253 | "#### 1d. Problems with training and testing on the same data\n", 254 | "\n", 255 | "- Goal is to estimate likely performance of a model on **out-of-sample data**\n", 256 | "- But, maximizing training accuracy rewards **overly complex models** that won't necessarily generalize\n", 257 | "- Unnecessarily complex models **overfit** the training data" 258 | ] 259 | }, 260 | { 261 | "cell_type": "markdown", 262 | "metadata": {}, 263 | "source": [ 264 | "![Overfitting](images/05_overfitting.png)" 265 | ] 266 | }, 267 | { 268 | "cell_type": "markdown", 269 | "metadata": {}, 270 | "source": [ 271 | "*Image Credit: [Overfitting](http://commons.wikimedia.org/wiki/File:Overfitting.svg#/media/File:Overfitting.svg) by Chabacano. Licensed under GFDL via Wikimedia Commons.*" 272 | ] 273 | }, 274 | { 275 | "cell_type": "markdown", 276 | "metadata": {}, 277 | "source": [ 278 | "- Green line (decision boundary): overfit\n", 279 | " - Your accuracy would be high but may not generalize well for future observations\n", 280 | " - Your accuracy is high because it is perfect in classifying your training data but not out-of-sample data\n", 281 | "- Black line (decision boundary): just right\n", 282 | " - Good for generalizing for future observations\n", 283 | "- Hence we need to solve this issue using a **train/test split** that will be explained below" 284 | ] 285 | }, 286 | { 287 | "cell_type": "markdown", 288 | "metadata": {}, 289 | "source": [ 290 | "### 2. Evaluation procedure 2 - Train/test split" 291 | ] 292 | }, 293 | { 294 | "cell_type": "markdown", 295 | "metadata": {}, 296 | "source": [ 297 | "1. Split the dataset into two pieces: a **training set** and a **testing set**.\n", 298 | "2. Train the model on the **training set**.\n", 299 | "3. Test the model on the **testing set**, and evaluate how well we did." 300 | ] 301 | }, 302 | { 303 | "cell_type": "code", 304 | "execution_count": 7, 305 | "metadata": { 306 | "collapsed": false 307 | }, 308 | "outputs": [ 309 | { 310 | "name": "stdout", 311 | "output_type": "stream", 312 | "text": [ 313 | "(150, 4)\n", 314 | "(150,)\n" 315 | ] 316 | } 317 | ], 318 | "source": [ 319 | "# print the shapes of X and y\n", 320 | "# X is our features matrix with 150 x 4 dimension\n", 321 | "print(X.shape)\n", 322 | "# y is our response vector with 150 x 1 dimension\n", 323 | "print(y.shape)" 324 | ] 325 | }, 326 | { 327 | "cell_type": "code", 328 | "execution_count": 8, 329 | "metadata": { 330 | "collapsed": false 331 | }, 332 | "outputs": [], 333 | "source": [ 334 | "# STEP 1: split X and y into training and testing sets\n", 335 | "from sklearn.cross_validation import train_test_split\n", 336 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=4)" 337 | ] 338 | }, 339 | { 340 | "cell_type": "markdown", 341 | "metadata": {}, 342 | "source": [ 343 | "- test_size=0.4\n", 344 | " - 40% of observations to test set\n", 345 | " - 60% of observations to training set\n", 346 | "- data is randomly assigned unless you use random_state hyperparameter\n", 347 | " - If you use random_state=4\n", 348 | " - Your data will be split exactly the same way" 349 | ] 350 | }, 351 | { 352 | "cell_type": "markdown", 353 | "metadata": {}, 354 | "source": [ 355 | "![Train/test split](images/05_train_test_split.png)" 356 | ] 357 | }, 358 | { 359 | "cell_type": "markdown", 360 | "metadata": {}, 361 | "source": [ 362 | "What did this accomplish?\n", 363 | "\n", 364 | "- Model can be trained and tested on **different data**\n", 365 | "- Response values are known for the testing set, and thus **predictions can be evaluated**\n", 366 | "- **Testing accuracy** is a better estimate than training accuracy of out-of-sample performance" 367 | ] 368 | }, 369 | { 370 | "cell_type": "code", 371 | "execution_count": 9, 372 | "metadata": { 373 | "collapsed": false 374 | }, 375 | "outputs": [ 376 | { 377 | "name": "stdout", 378 | "output_type": "stream", 379 | "text": [ 380 | "(90, 4)\n", 381 | "(60, 4)\n" 382 | ] 383 | } 384 | ], 385 | "source": [ 386 | "# print the shapes of the new X objects\n", 387 | "print(X_train.shape)\n", 388 | "print(X_test.shape)" 389 | ] 390 | }, 391 | { 392 | "cell_type": "code", 393 | "execution_count": 10, 394 | "metadata": { 395 | "collapsed": false 396 | }, 397 | "outputs": [ 398 | { 399 | "name": "stdout", 400 | "output_type": "stream", 401 | "text": [ 402 | "(90,)\n", 403 | "(60,)\n" 404 | ] 405 | } 406 | ], 407 | "source": [ 408 | "# print the shapes of the new y objects\n", 409 | "print(y_train.shape)\n", 410 | "print(y_test.shape)" 411 | ] 412 | }, 413 | { 414 | "cell_type": "code", 415 | "execution_count": 11, 416 | "metadata": { 417 | "collapsed": false 418 | }, 419 | "outputs": [ 420 | { 421 | "data": { 422 | "text/plain": [ 423 | "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", 424 | " intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n", 425 | " penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n", 426 | " verbose=0, warm_start=False)" 427 | ] 428 | }, 429 | "execution_count": 11, 430 | "metadata": {}, 431 | "output_type": "execute_result" 432 | } 433 | ], 434 | "source": [ 435 | "# STEP 2: train the model on the training set\n", 436 | "logreg = LogisticRegression()\n", 437 | "logreg.fit(X_train, y_train)" 438 | ] 439 | }, 440 | { 441 | "cell_type": "code", 442 | "execution_count": 12, 443 | "metadata": { 444 | "collapsed": false 445 | }, 446 | "outputs": [ 447 | { 448 | "name": "stdout", 449 | "output_type": "stream", 450 | "text": [ 451 | "0.95\n" 452 | ] 453 | } 454 | ], 455 | "source": [ 456 | "# STEP 3: make predictions on the testing set\n", 457 | "y_pred = logreg.predict(X_test)\n", 458 | "\n", 459 | "# compare actual response values (y_test) with predicted response values (y_pred)\n", 460 | "print(metrics.accuracy_score(y_test, y_pred))" 461 | ] 462 | }, 463 | { 464 | "cell_type": "markdown", 465 | "metadata": {}, 466 | "source": [ 467 | "Repeat for KNN with K=5:" 468 | ] 469 | }, 470 | { 471 | "cell_type": "code", 472 | "execution_count": 13, 473 | "metadata": { 474 | "collapsed": false 475 | }, 476 | "outputs": [ 477 | { 478 | "name": "stdout", 479 | "output_type": "stream", 480 | "text": [ 481 | "0.966666666667\n" 482 | ] 483 | } 484 | ], 485 | "source": [ 486 | "knn = KNeighborsClassifier(n_neighbors=5)\n", 487 | "knn.fit(X_train, y_train)\n", 488 | "y_pred = knn.predict(X_test)\n", 489 | "print(metrics.accuracy_score(y_test, y_pred))" 490 | ] 491 | }, 492 | { 493 | "cell_type": "markdown", 494 | "metadata": {}, 495 | "source": [ 496 | "Repeat for KNN with K=1:" 497 | ] 498 | }, 499 | { 500 | "cell_type": "code", 501 | "execution_count": 14, 502 | "metadata": { 503 | "collapsed": false 504 | }, 505 | "outputs": [ 506 | { 507 | "name": "stdout", 508 | "output_type": "stream", 509 | "text": [ 510 | "0.966666666667\n" 511 | ] 512 | } 513 | ], 514 | "source": [ 515 | "knn = KNeighborsClassifier(n_neighbors=5)\n", 516 | "knn.fit(X_train, y_train)\n", 517 | "y_pred = knn.predict(X_test)\n", 518 | "print(metrics.accuracy_score(y_test, y_pred))" 519 | ] 520 | }, 521 | { 522 | "cell_type": "markdown", 523 | "metadata": {}, 524 | "source": [ 525 | "Can we locate an even better value for K?" 526 | ] 527 | }, 528 | { 529 | "cell_type": "code", 530 | "execution_count": 15, 531 | "metadata": { 532 | "collapsed": false 533 | }, 534 | "outputs": [ 535 | { 536 | "name": "stdout", 537 | "output_type": "stream", 538 | "text": [ 539 | "[0.94999999999999996, 0.94999999999999996, 0.96666666666666667, 0.96666666666666667, 0.96666666666666667, 0.98333333333333328, 0.98333333333333328, 0.98333333333333328, 0.98333333333333328, 0.98333333333333328, 0.98333333333333328, 0.98333333333333328, 0.98333333333333328, 0.98333333333333328, 0.98333333333333328, 0.98333333333333328, 0.98333333333333328, 0.96666666666666667, 0.98333333333333328, 0.96666666666666667, 0.96666666666666667, 0.96666666666666667, 0.96666666666666667, 0.94999999999999996, 0.94999999999999996]\n" 540 | ] 541 | } 542 | ], 543 | "source": [ 544 | "# try K=1 through K=25 and record testing accuracy\n", 545 | "k_range = range(1, 26)\n", 546 | "\n", 547 | "# We can create Python dictionary using [] or dict()\n", 548 | "scores = []\n", 549 | "\n", 550 | "# We use a loop through the range 1 to 26\n", 551 | "# We append the scores in the dictionary\n", 552 | "for k in k_range:\n", 553 | " knn = KNeighborsClassifier(n_neighbors=k)\n", 554 | " knn.fit(X_train, y_train)\n", 555 | " y_pred = knn.predict(X_test)\n", 556 | " scores.append(metrics.accuracy_score(y_test, y_pred))\n", 557 | " \n", 558 | "print(scores)" 559 | ] 560 | }, 561 | { 562 | "cell_type": "code", 563 | "execution_count": 16, 564 | "metadata": { 565 | "collapsed": false 566 | }, 567 | "outputs": [ 568 | { 569 | "data": { 570 | "text/plain": [ 571 | "" 572 | ] 573 | }, 574 | "execution_count": 16, 575 | "metadata": {}, 576 | "output_type": "execute_result" 577 | }, 578 | { 579 | "data": { 580 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZMAAAEPCAYAAACHuClZAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xu4XFWd5vHvmxuQgEmEhBhCbiQCobkqkW5Ej4IS6dY4\nGS/EZgTlCbElDdo9M0TGngRH7YSepkWRaRGaJ/ag2NJR8AYB4Yj2GIMkhFtITsiFxFwMhEsCAknO\nb/7Yu5KiUnVOnVO1q+pUvZ/nOU+q9nVVUdRba+2111JEYGZmVol+9S6AmZn1fQ4TMzOrmMPEzMwq\n5jAxM7OKOUzMzKxiDhMzM6tY5mEiaZqkpyStkXRVkfXDJC2WtFLSUklT8tZ9XtLjkh6VdJukQeny\neZI2S1qe/k3L+nWYmVlpmYaJpH7ADcD5wEnATEknFGx2NbAiIk4FLga+nu47Gvhr4IyIOAUYAFyY\nt991EXFG+nd3lq/DzMy6lnXNZCrQEREbI2IPcDswvWCbKcD9ABGxGhgvaUS6rj8wRNIAYDCwJW8/\nZVpyMzMrW9ZhcgywKe/55nRZvpXADABJU4GxwJiI2AL8I/AM8HvghYi4L2+/OZIekXSzpKFZvQAz\nM+teI1yAXwAMl7QcuBxYAeyTNIykFjMOGA0cLukT6T43AhMj4jRgG3Bd7YttZmY5AzI+/u9Jaho5\nY9Jl+0XELuDTueeS1gHrgGnAuojYmS5fDPwZ8N2I2JF3iG8DPy52ckkeeMzMrBciokeXErKumTwE\nTJI0Lu2JdSFwV/4GkoZKGpg+ngU8GBG7SZq3zpJ0qCQB5wKr0u1G5R1iBvB4qQJEhP8imDdvXt3L\n0Ch/fi/8Xvi96PqvNzKtmUTEPklzgCUkwXVLRKySNDtZHTcBJwKLJHUCTwCXpvsuk3QHSbPXnvTf\nm9JDXyvpNKAT2ADMzvJ1mJlZ17Ju5iKSbrvHFyz7Vt7jpYXr89ZdA1xTZPknq1xMMzOrQCNcgLca\naGtrq3cRGobfiwP8Xhzg96Iy6m37WF8gKZr59ZmZZUES0WAX4M3MrAU4TMzMrGKZX4C3vmnnTvji\nF2Hv3nqXxJrBIYfAtdfCYYdle57XXoNvfxvmzMn2PHYwXzOxon76U7j6arj88nqXxJrBggXw/e/D\nmWdme56HH4azzoI//hEG+Kdyr/XmmonfbiuqowPe9S647LJ6l8SawX33JZ+prMOkoyOpTW/cCMcd\nl+257I18zcSKWrsWJk+udymsWUyenHymspY7Ry3OZW/kMLGiOjpg0qR6l8KaxaRJyWcqax0dMGRI\nbc5lb+QwsaI6OlwzseqZPLl2YXLeeQ6TenCY2EFefx22bIHx4+tdEmsWtQyTCy5wmNSDw8QOsm4d\nHHssDBxY75JYsxg5EvbsSbqcZ+WFF+DVV+Gccxwm9eAwsYO4icuqTcq+dpK7zjdxIjzzTBJeVjsO\nEzuIw8SyUIswmTw5uUFy9GjYsCG7c9nBHCZ2EIeJZSHrMMnvzl6rrsh2gMPEDrJ2rbsFW/VNmpTt\nF3z+j6BaXfC3AxwmdhDXTCwLtbpmArW7r8UOcJjYG7z6KmzbBuPG1bsk1mxyYZLVcHmumdSXw8Te\nYN26JEg8SJ5V21FHJUHy3HPVP/bOnUnvrZEjk+cOk9rLPEwkTZP0lKQ1kq4qsn6YpMWSVkpaKmlK\n3rrPS3pc0qOSbpM0KF0+XNISSasl3SNpaNavo1W4icuykmX34NznVuk4txMmwObNyQ24VhuZhomk\nfsANwPnAScBMSScUbHY1sCIiTgUuBr6e7jsa+GvgjIg4hWSE4wvTfeYC90XE8cD9wBeyfB2txGFi\nWco6THIGDYIxY2D9+uqfy4rLumYyFeiIiI0RsQe4HZhesM0UkkAgIlYD4yWNSNf1B4ZIGgAMBn6f\nLp8OLEofLwI+nN1LaC0OE8tSVmFSbJRrdw+urazD5BhgU97zzemyfCuBGQCSpgJjgTERsQX4R+AZ\nkhB5ISJ+ke4zMiK2A0TENmBkZq+gxbhbsGUpqy/4Yj+CfN2kthrhAvwCYLik5cDlwApgn6RhJDWQ\nccBo4HBJnyhxDE+nWCWumViWsuqy6zCpv6z77PyepKaRM4YDTVUARMQu4NO555LWAeuAacC6iNiZ\nLl8M/BnwXWC7pKMjYrukUcAfShVg/vz5+x+3tbXR1tZW2StqYn/8I+zYAWPHdr+tWW/kdw9WjyaF\nLS2i+Pw7kybBT35SnXM0u/b2dtrb2ys6RqZzwEvqD6wGzgW2AsuAmRGxKm+bocArEbFH0izg7Ii4\nJG3yugU4E3gNuBV4KCK+KWkhsDMiFqY9xIZHxNwi5/cc8D3w+OPw0Y/CqlXdb2vWW29+Mzz11IFu\nvJV69tkkOJ5//o0B1dEB73+/L8L3Rm/mgM+0mSsi9gFzgCXAE8DtEbFK0mxJudnFTwQel7SKpNfX\nlem+y4A7SJq9VgICbkr3WQi8T1IuqBZk+TpahZu4rBaq3fxU2C04Z/z4ZF6e116r3rmstMxvTYuI\nu4HjC5Z9K+/x0sL1eeuuAa4psnwncF51S2oOE6uFXJicfXZ1jlfqcztwYNJku24dnHhidc5lpTXC\nBXhrEA4Tq4Vq10yKdQvOP5e7B9eGw8T2K3YR06zasmrmqsW5rDSHie3X1S88s2qpdm3BYdIYHCYG\nwCuvJAPwHXtsvUtizS53r0k1OlqW6hZceC7LnsPEgOSX4sSJ0M+fCMvY8OHJ1Lrbt1d+rB07oH9/\nOPLI4utdM6kdf3UY4IvvVlvV+pLv7nM7blwSWq++Wvm5rGsOEwMcJlZbtQqTAQOSQHn66crPZV1z\nmBjgMLHaqlaYlNNpxN2Da8NhYoC7BVtt1apmUs1zWdccJga4ZmK1Va3agsOkcThMjN274cUX4ZjC\nmWbMMjJpUhImlXQP7q5bcI7DpDYcJsbatXDcce4WbLUzdCgMHgxbt/b+GNu3J12Mhw/vejvfa1Ib\n/vowN3FZXVRaYyj3czt2bHI/yiuv9P5c1j2HiTlMrC5qFSb9+8OECe4enDWHiTlMrC4qDZOejCXn\n7sHZc5iYuwVbXdSqZlKNc1n3HCbmmonVhcOkuThMWtxLLyVdg0ePrndJrNVMmpRcx+js7Pm+EUmz\nVbk1aodJ9hwmLS73P2Th/NlmWTviCHjTm5J52ntq69aka/HQoeVt7+7B2cs8TCRNk/SUpDWSriqy\nfpikxZJWSloqaUq6/K2SVkhanv77oqQr0nXzJG1O1y2XNC3r19Gs3MRl9dTbGkNPP7fHHgs7d8LL\nL/f8XFaeTMNEUj/gBuB84CRgpqQTCja7GlgREacCFwNfB4iINRFxekScAbwNeBlYnLffdRFxRvp3\nd5avo5k5TKyeahUm/fol8/W4R1d2sq6ZTAU6ImJjROwBbgemF2wzBbgfICJWA+MljSjY5jzg6YjY\nnLfMDTNV4DCxeuptmPRmiml3D85W1mFyDLAp7/nmdFm+lcAMAElTgbHAmIJtPg58r2DZHEmPSLpZ\nUpktp1bI3YKtnmpVM6nkXFaeAfUuALAAuF7ScuAxYAWwL7dS0kDgQ8DcvH1uBL4UESHpy8B1wKXF\nDj5//vz9j9va2mhra6ty8fs210ysnmodJr/9bc/P1Qra29tpb2+v6BiKSobt7O7g0lnA/IiYlj6f\nC0RELOxin/XAyRGxO33+IeCzuWMU2X4c8OOIOKXIusjy9fV1L7yQXJh86SX35rL6ePllOOqo5N9y\nBxrt7Ex6gm3blvxbrgcegHnz4MEHe1fWViKJiOjRt0LWzVwPAZMkjZM0CLgQuCt/A0lD09oHkmYB\nv8wFSWomBU1ckkblPZ0BPJ5F4ZuduwVbvQ0Zkoz6u3lz99vmbNmSdCnuSZCAm7mylmkzV0TskzQH\nWEISXLdExCpJs5PVcRNwIrBIUifwBHnNVZIGk1x8v6zg0NdKOg3oBDYAs7N8Hc3KTVzWCHJf8mPH\nlrd9b6/zjR6dzNuza1fPg8i6l/k1k7Tb7vEFy76V93hp4fq8da8AhT27iIhPVrmYLclhYo0gFybn\nnlve9r393Pbrl8zbs3YtnH56z/e3rvkO+BbmMLFG0NPmp950C84/l7sHZ8Nh0sLcLdgaQU/DpJIf\nQb5ukh2HSQtzzcQagcOkOThMWtTOnbBnD4wcWe+SWKs77jhYvx727et+285OWLeu9zVqh0l2HCYt\nKvfrzt2Crd4GD4YRI2DTpu633bw56Uo8ZEjvzuUwyY7DpEVVchHTrNrKHSK+0ut8b3lLMn/PSy/1\n/hhWnMOkRfl6iTWScmsMlX5uJc9tkhWHSYtymFgjKTdMqlGjdvfgbDhMWpS7BVsjqVXNpCfnsp5x\nmLSgCNdMrLE4TPq+bsNE0l95vpDm8txzSaAcdVS9S2KWOO442LgR9u4tvc2+fUkX4uOOq+xcDpNs\nlFMzGQcsl/RdSedlXSDLnrsFW6M59FA4+ugkUErZtCn5ATR4cGXncphko9swiYi5wGTgNuAzkjok\nfUnS+IzLZhlxt2BrRJMmdX1hvFpNs0cfDa++msznY9VT1jWTiMgN9b6BZNj3twB3Svr7zEpmmfH1\nEmtE3dUYqtVpxN2Ds1HONZPLJS0DrgceBk6JiFnA6SRzs1sf4zCxRtRdmFSzRu3uwdVXznwmo4GZ\nEfF0/sKI6Eyn1LU+xt2CrRFNngy/+EXp9R0d8K53Ve9crplUVznNXD8CtueeSDpC0tsBIsLT5fYx\n7hZsjaqcZq5q1kwcJtVVTpjcBLyS9/xl4FsltrUGt2MH9O8PRx5Z75KYvdHEifDMM8lo1oX27oUN\nGyrvFpzjMKm+csKkX3oBHth/MX5gdkWyLLlWYo3qkEOSedo3bDh43TPPJL2wDj20OudymFRfOWGy\nPr1xsb+kfpIuJ+nVVRZJ0yQ9JWmNpKuKrB8mabGklZKWSpqSLn+rpBWSlqf/vijpinTdcElLJK2W\ndI9vqiyfw8QaWakL49X+3I4YkdR2du6s3jFbXTlhMhs4l+S6yXbg3cCscg4uqR9wA3A+cBIwU9IJ\nBZtdDayIiFOBi4GvA0TEmog4PSLOAN5G0ry2ON1nLnBfRBwP3A98oZzymO8xscZWqstutTuNuHtw\n9ZVz0+L2iPhIRBwVESMi4mMRsb27/VJTgY6I2BgRe4DbgekF20whCQQiYjUwXtKIgm3OA56OiM3p\n8+nAovTxIuDDZZan5blmYo2sVPNTFj+C3NRVXd12DZZ0CHAJSc1if4tlRFxWxvGPAfLnT9tMEjD5\nVgIzgP+QNBUYC4wBduRt83Hge3nPR+YCLSK2SfLks2Vyt2BrZJMnwz33HLy8owPe+97qn8v3mlRP\nOfeZfAdYB/wF8BXgE8ATVSzDAuB6ScuBx4AVwP7ZoCUNBD5E0rRVSpRaMX/+/P2P29raaGtrq6y0\nfZi7BVujK1VbyOJzWyq4WlF7ezvt7e0VHUMRJb+Hkw2kFRFxuqRHI+KU9Mv9VxFxVrcHl84C5kfE\ntPT5XCAiYmEX+6wHTo6I3enzDwGfzR0jXbYKaIuI7ZJGAQ9ExIlFjhXdvb5Wsm0b/MmfwLPP1rsk\nZsW9/joccQTs2gWDBiXL9u6Fww+HF19MenxVy29+A1deCcuWVe+YzUISEdGjoWDLuQCf6/X9gqQT\ngSOAcpuVHgImSRonaRBwIXBX/gaShqYBhaRZwC9zQZKayRubuEiPcUn6+GLgzjLL09JcK7FGN2gQ\njBmTDDWfs2FDMnd7NYMEDtSC/HuzOsoJk1skDQfmAfcAa4D/Xc7BI2IfMAdYQtI0dntErJI0W1Lu\nmsuJwONpbeN84Mrc/pIGk1x8X/zGI7MQeJ+k1SQ9zRaUU55W5zCxvqCwqSurz23uxt3nnqv+sVtR\nl9dMJPUHno2I54EHSC6O90hE3A0cX7DsW3mPlxauz1v3ClDYs4uI2EkSMtYDDhPrCwqHos/qcysd\nCC5PFFe5Lmsmac3i6hqVxTLme0ysLyismaxdm10PRN9rUj3lNHMtkfQ5SW+R9KbcX+Yls6pzt2Dr\nC2rVzJU7l7sHV0c5XYMvSv/927xlQS+avKx+Ilwzsb6h1mHy059mc+xW022YRMSxtSiIZWvr1mTu\n7KEexcwa3PjxsGULvPYa9OsHmzfDhAnZnMt3wVdPOXfAf6LY8oj4bvWLY1nxxXfrKwYOhLFjYd26\nZLqEY445cM9JteV3D1aP7qqwQuU0c52T9/hQ4L0k0/c6TPoQh4n1Jbkv+f79s/3cvvnNMGBAMs/P\nSA/KVJFymrn+Kv95es+Jg6SPcZhYX5K7MN6vX/af21xwOUwqU05vrkK7gInVLohlyxffrS/JddnN\nsltw4bmsMuVcM/khBwZS7EcyerCHL+lj3C3Y+pLJk+HOO5NmrgsuyP5c7h5cuXKumdyQ93gvsDEi\nNmRTHMtCZyc8/bRrJtZ31OqaSe5cd/rnccXKCZMO4A8R8SqApMMkHRsRm7rZzxrEli3JSKxHHFHv\nkpiVZ9w42J5OwTd+fLbncvfg6ijnmslioDPveSfw79kUx7Lgi+/W1wwYkATKsccmXYWz5NGDq6Oc\nmsmAiHg99yQiXktnX7Q+wmFifdHkyUkTbdaGDYNDD01qQqNGZX++ZlVOmDwn6YKI+BmApL8AdmZb\nrOb35JPwta/V5lwPPwwf+UhtzmVWLbUKk9y5PvvZxh49+NJL4R3vqHcpSitnpsW3ktxXko7+zw7g\noohYk3HZKtbIMy0uWAAPPggf/nBtzvfBDyYTDJn1FRs3Jk1PWV8zAVi6FB59NPvz9Nb99yf///7T\nP9XmfL2ZabHbMMk7+DCAiHihF2Wri0YOk9yvjMsu635bM2ttP/oR3Hwz/OQntTlfJtP2SvpfkoZF\nxAsR8YKk4ZKu6X0xDXwToZmVry/cC1NOb66/yK+NpLMufjC7IrUG30RoZuWaOBE2bIC9e+tdktLK\nCZP+kvaP2SnpUCCjMTxbw+7d8MILyWioZmbdOeywZOywZ56pd0lKKydMbgfulXSxpIuBe+jBQI+S\npkl6StIaSVcVWT9M0mJJKyUtlTQlb91QST+QtErSE5LekS6fJ2mzpOXp37Ryy9MI1q6F445LBrEz\nMytHo99cWc6owV+V9ChwXrro2ogoa24ySf1IhmM5F9gCPCTpzoh4Km+zq4EVETFD0vHAN/POdT3w\ns4j4qKQBwOC8/a6LiOvKKUej8X0fZtZTuTA5//x6l6S4sn4bR8RPIuJzEfE5kvtOri/z+FOBjojY\nGBF7SGo50wu2mQLcn55nNTBe0oh0nvlzIuLWdN3eiHgpb78+O5WNw8TMeqrRayZlhYmkkyV9VdLT\nwD8A68s8/jFA/hhem9Nl+VYCM9LzTCWZW34MMAF4VtKtaVPWTZIOy9tvjqRHJN0sqU9NRuswMbOe\navQwKdnMJWkiMDP92w18HxgYEeeU2qeXFgDXS1oOPAasAPYBA4EzgMsj4neSvgbMBeYBNwJfioiQ\n9GXgOuDSYgefP3/+/sdtbW20tbVVufg919EBl1xS71KYWV+SZZi0t7fT3t5e0TFK3rQoqRP4FTAr\nd7e7pHURUfbEWJLOAuZHxLT0+VwgImJhF/usB04GhgC/yZ1P0juBqyLigwXbjwN+HBGnFDlWQ960\nOGpUMsSJe3OZWblefRWGDk16g2Y9+GW1b1r8GMnQKfdJulHSu+n5dYqHgEmSxqXdiy8E7ioo9FBJ\nA9PHs4BfRsTuiNgObEqHc4HkIv6T6Xb5w7HNAB7vYbnq5qWXYNcuGD263iUxs77k0EOTIVU2bqx3\nSYor2cwVEXcAd0g6AvhPJE1MR0v6BvDDiLi/u4NHxD5Jc4AlJMF1S0SskjQ7WR03AScCi9Ka0BO8\nsbnqCuC2NGzWAZ9Kl18r6TSS4fA3ALN78qLrKTcNqfps9wEzq5dcU1cj3vBc9thcAJKOIqmxfDwi\n3p1ZqaqkEZu5vv99+MEP4I476l0SM+trPvtZOOEEuOKKbM+Tydhc+SLi2Yi4sS8ESaNyTy4z661G\n7tHle7BrzGFiZr3lMLH9HCZm1lsOE9svdwHezKynJkyAzZvh9de737bWypnP5HlJOwv+1qcDMI7P\nvojN48UX4Y9/9DzTZtY7gwYl96dt2FDvkhysnDngvwls5cBIwTOB8STDoNwKvCeTkjWhXJc+dws2\ns97KNXW99a3db1tL5TRzfTAivhkRz6d/NwLvj4jbgDdnXL6m4uslZlapRr1uUk6Y/FHSjNyT9PFr\n6dPOTErVpBwmZlapvhwmFwGz0mslzwGzgP8iaTDwuUxL12QcJmZWqUYNk3Imx1oLfKDE6l9WtzjN\nraMDPvOZepfCzPqyRg2TbodTSYdQ+TTJRff94RMRl2VasipotOFUjjwSnnwSjj663iUxs75qzx44\n/PBk0NhDDsnmHL0ZTqWc3lx3AkuBX5PMM2K9sHNn8iEYObLeJTGzvmzgQBg7FtavT8bpahTlhMmQ\niPjbzEvS5NauTaqn7hZsZpWaNClp6mqkMCnnAvzPJb0/85I0OV98N7NqacTrJuWEyWeAuyXtTnt0\nPS9pZ9YFazYOEzOrlr4aJkeRzMc+FBiRPh+RZaGakcPEzKqlT4WJpNxX30kl/qwHHCZmVi2NGCYl\nuwZLuiUiLpX0qyKrIyLelW3RKtcoXYMj4M1vhjVrYITrdGZWob17k+7BL7yQzA1fbVXtGhwRubnY\n3xsRewpONLAX5WtZzz2XBMpRR9W7JGbWDAYMgHHj4Omn4aQGaScq55rJb8tcVpSkaZKekrRG0lVF\n1g+TtFjSSklLJU3JWzc0Hep+laQnJL0jXT5c0hJJqyXdI2loueWpB3cLNrNqmzQp+W5pFF1dMxkp\n6VTgMEknSzol/XsnMLicg0vqB9wAnE9ynWWmpMKe0VcDKyLiVOBi4Ot5664HfhYRJwKnAqvS5XOB\n+yLieOB+4AvllKdefL3EzKqt0a6bdHXT4p+TDKMyhmROk9zv6l3A35V5/KlAR0RsBJB0OzAdeCpv\nmynA3wNExGpJ4yWNIBmZ+JyIuCRdtxd4Kd1nOvDu9PEioJ0kYBqSw8TMqm3yZHj00XqX4oCSNZOI\nuDUizgEujYh3RcQ56d8FEfGDMo9/DLAp7/nmdFm+lcAMAElTgbEkATYBeFbSrZKWS7pJ0mHpPiMj\nYntazm1AQw9S4jAxs2rrSzWTnJGS3hQRL0n6Z+AM4AsR8YsqlWEBcL2k5cBjwAqSMcAGpue6PCJ+\nJ+lrJLWPeRyoJeWU7LI1f/78/Y/b2tpoa2urUrHL5zAxs2qrZpi0t7fT3t5e0THKGTX40Yg4JR1S\n5XLgfwL/EhFv6/bg0lnA/IiYlj6fS9KteGEX+6wHTgaGAL+JiInp8ncCV0XEByWtAtoiYrukUcAD\n6XWVwmPVvWtwBAwbBuvWJaMGm5lVw759MGRIMojs4LKuYpevN12Dy+nNlfs2vgD4TkSsLHM/gIeA\nSZLGSRoEXAjclb9B2mNrYPp4FvDLiNidNmNtkpSb6fhc4Mn08V3AJenji0lGNm5IO3ZA//4OEjOr\nrv79YcKEpHtwIyinmWulpJ8BbwWulnQ4XTQr5YuIfZLmAEtIAuiWiFglaXayOm4CTgQWSeoEngAu\nzTvEFcBtadisAz6VLl8I/JukTwMbgY+VU556yHULNjOrtsmTk++Yk0+ud0nKC5NPAW8D1kbEK+lk\nWZd2s89+EXE3cHzBsm/lPV5auD5v3UrgzCLLdwLnlVuGevL1EjPLSm4o+kbQbXNVROwDJgJ/lS46\nrJz9LOEwMbOsNFKPrm5DQdINwHuAi9JFLwP/nGWhmonDxMyy0khhUk4z159FxBmSVkDSxJReTLcy\nOEzMLCuNFCblNFftSYdFCQBJRwKdmZaqSUQk/6EnTap3ScysGR17bNI1+OWX612SrsfmytVavgn8\nOzBC0jXAr0l6U1k3tm+HQw6B4cPrXRIza0b9+sHEiY0x4GNXzVzLgDMi4juSHibpPSXgoxHxeE1K\n18e5icvMspZr6jr11PqWo6sw2X/3Y0Q8QXIPiPWA7zExs6w1ylD0XYXJCEl/U2plRFyXQXmaimsm\nZpa1yZNh2bJ6l6LrC/D9gcOBI0r8WTccJmaWtUbp0dVVzWRrRHypZiVpQg4TM8tao4RJVzUTTzJb\ngYikHdPdgs0sS8ccAy++CLt21bccXYXJuTUrRRPaujUZFnpoQ89Ob2Z9Xb9+cNxx9b8I39VMiztr\nWZBm4yYuM6uVRmjq8oCNGXGYmFmtOEyamO8xMbNaaYR7TRwmGXHNxMxqxTWTJuYwMbNaaYQwUURZ\nM/D2SZKiHq+vsxOOOAK2bUv+NTPLUgQcfnjSi/RNb6r8eJKIiB7dHuKaSQa2bElCxEFiZrUg1X8K\n38zDRNI0SU9JWiPpqiLrh0laLGmlpKWSpuSt25AuXyFpWd7yeZI2S1qe/k3L+nX0hJu4zKzW6t3U\nVc5Mi72WTqp1A8kNkFuAhyTdGRFP5W12NbAiImZIOp5k/pTz0nWdQFtEPF/k8Nc16mCTDhMzq7V6\nh0nWNZOpQEdEbIyIPcDtwPSCbaYA9wNExGpgvKQR6Tp1UcaGHe7F3YLNrNbq3T046zA5BtiU93xz\nuizfSmAGgKSpwFhgTLougHslPSRpVsF+cyQ9IulmSQ01aIlrJmZWa/WumWTazFWmBcD1kpYDjwEr\ngH3purMjYmtaU7lX0qqI+DVwI/CliAhJXwauAy4tdvD58+fvf9zW1kZbW1tmLyTHYWJmtVZJmLS3\nt9Pe3l7R+TPtGizpLGB+RExLn88FIiJKziEvaT1wckTsLlg+D9hVeJ1E0jjgxxFxSpFj1bxrcGcn\nDBkCO3YkXfXMzGohIukWvGkTDBtW2bEasWvwQ8AkSeMkDQIuBO7K30DSUEkD08ezgF9GxG5JgyUd\nni4fArwfeDx9PirvEDNyyxvB5s0wfLiDxMxqq97dgzNt5oqIfZLmAEtIguuWiFglaXayOm4CTgQW\nSeokmWc+11x1NPBDSZGW87aIWJKuu1bSaSS9vTYAs7N8HT3hJi4zq5dcU9eZZ9b+3JlfM4mIu4Hj\nC5Z9K+81deebAAALf0lEQVTx0sL16fL1wGkljvnJKhezahwmZlYv9bwI7zvgq8xhYmb14jBpIr7H\nxMzqpZ73mjhMqsw1EzOrF9dMmsS+fbB+fTIfs5lZrY0cCXv2wM46TLruMKmiTZvgyCNh8OB6l8TM\nWpFUv9qJw6SK3MRlZvXmMGkCDhMzqzeHSRNwmJhZvTlMmoDDxMzqrV5DqjhMqsj3mJhZveVqJjUe\n49ZhUi1798KGDTBxYr1LYmat7Kijkn+fe66253WYVMkzzyR9vA87rN4lMbNWVq/uwQ6TKvH1EjNr\nFA6TPsxhYmaNwmHShzlMzKxROEz6MIeJmTUKh0kf5m7BZtYockPR17J7sMOkCvbuTXpzTZhQ75KY\nmSUDzvbvDzt21O6cmYeJpGmSnpK0RtJVRdYPk7RY0kpJSyVNyVu3IV2+QtKyvOXDJS2RtFrSPZKG\nZv06urJhA4waBYceWs9SmJkdUOumrkzDRFI/4AbgfOAkYKakEwo2uxpYERGnAhcDX89b1wm0RcTp\nETE1b/lc4L6IOB64H/hCVq+hHL5eYmaNpqnCBJgKdETExojYA9wOTC/YZgpJIBARq4Hxkkak61Si\njNOBRenjRcCHq13wnnCYmFmjabYwOQbYlPd8c7os30pgBoCkqcBYYEy6LoB7JT0kaVbePiMjYjtA\nRGwDRmZQ9rI5TMys0dQ6TAbU7lQlLQCul7QceAxYAexL150dEVvTmsq9klZFxK+LHKMqfRa2boV5\n83q+3733wje+UY0SmJlVx+TJ8OCDcNlltTlf1mHye5KaRs6YdNl+EbEL+HTuuaT1wLp03db03x2S\nfkjSbPZrYLukoyNiu6RRwB9KFWD+/Pn7H7e1tdHW1laysIcdBm9/e5mvLM873gHveU/P9zMzy8rp\np8OCBfD6691vu3p1O2vWtFd0PkWGHZEl9QdWA+cCW4FlwMyIWJW3zVDglYjYkzZlnR0Rl0gaDPSL\niN2ShgBLgGsiYomkhcDOiFiY9hAbHhFzi5w/snx9ZmbNSBIRoZ7sk2nNJCL2SZpDEgT9gFsiYpWk\n2cnquAk4EVgkqRN4Arg03f1o4IeSIi3nbRGxJF23EPg3SZ8GNgIfy/J1mJlZ1zKtmdSbayZmZj3X\nm5qJ74A3M7OKOUzMzKxiDhMzM6uYw8TMzCrmMDEzs4o5TMzMrGIOEzMzq5jDxMzMKuYwMTOzijlM\nzMysYg4TMzOrmMPEzMwq5jAxM7OKOUzMzKxiDhMzM6uYw8TMzCrmMDEzs4o5TMzMrGIOEzMzq1jm\nYSJpmqSnJK2RdFWR9cMkLZa0UtJSSVMK1veTtFzSXXnL5knanC5fLmla1q/DzMxKyzRMJPUDbgDO\nB04CZko6oWCzq4EVEXEqcDHw9YL1VwJPFjn8dRFxRvp3d5WL3nTa29vrXYSG4ffiAL8XB/i9qEzW\nNZOpQEdEbIyIPcDtwPSCbaYA9wNExGpgvKQRAJLGABcANxc5tjIrdRPy/ygH+L04wO/FAX4vKpN1\nmBwDbMp7vjldlm8lMANA0lRgLDAmXfdPwH8Dosix50h6RNLNkoZWtdRmZtYjjXABfgEwXNJy4HJg\nBbBP0p8D2yPiEZJaSH5N5EZgYkScBmwDrqtxmc3MLI8iiv3or9LBpbOA+RExLX0+F4iIWNjFPuuA\nU0iupVwE7AUOA44AFkfEJwu2Hwf8OCJOKXKs7F6cmVkTi4geXUrIOkz6A6uBc4GtwDJgZkSsyttm\nKPBKROyRNAs4OyIuKTjOu4G/jYgPpc9HRcS29PHngTMj4hOZvRAzM+vSgCwPHhH7JM0BlpA0qd0S\nEaskzU5Wx03AicAiSZ3AE8ClZRz6WkmnAZ3ABmB2Ji/AzMzKkmnNxMzMWkMjXICvuu5ulGw1kjak\nN4WukLSs3uWpJUm3SNou6dG8ZcMlLZG0WtI9rdIbsMR70XI3AEsaI+l+SU9IekzSFenylvtcFHkv\n/jpd3uPPRdPVTNIbJdeQXKfZAjwEXBgRT9W1YHWUdmp4W0Q8X++y1JqkdwK7ge/kOmlIWgg8FxHX\npj82hkfE3HqWsxZKvBfzgF0R0TI9IiWNAkZFxCOSDgceJrn/7VO02Oeii/fi4/Twc9GMNZNybpRs\nNaI5/1t3KyJ+DRSG6HRgUfp4EfDhmhaqTkq8F9BiNwBHxLb0lgMiYjewiuTetpb7XJR4L3L3Avbo\nc9GMXzDl3CjZagK4V9JDaY+5VjcyIrZD8j8TMLLO5am3lr0BWNJ44DRgKXB0K38u8t6L36aLevS5\naMYwsYOdHRFnkAxNc3na3GEHNFdbb8+07A3AabPOHcCV6a/yws9By3wuirwXPf5cNGOY/J5kSJac\nMemylhURW9N/dwA/JGkKbGXbJR0N+9uM/1Dn8tRNROyIAxdOvw2cWc/y1IqkASRfnv8aEXemi1vy\nc1HsvejN56IZw+QhYJKkcZIGARcCd3WzT9OSNDj91YGkIcD7gcfrW6qaKxyO5y7gkvTxxcCdhTs0\nsTe8F+mXZs4MWuez8S/AkxFxfd6yVv1cHPRe9OZz0XS9uSDpGgxcz4EbJRfUuUh1I2kCSW0kSG5S\nva2V3g9J3wXagCOB7cA84EfAD4BjgY3AxyLihXqVsVZKvBfvIWkn338DcO66QbOSdDbwIPAYyf8X\nQTJ80zLg32ihz0UX78Un6OHnoinDxMzMaqsZm7nMzKzGHCZmZlYxh4mZmVXMYWJmZhVzmJiZWcUc\nJmZmVjGHifVp6fDZ7ytYdqWkb3az366My3WUpKWSHk778ueve0DSGenjCelUCe8rcox/SIcFLznN\ndTdleLekH+c9/7Kkn0kaKKld0kN5694m6YG8/Tol/Xne+h9LeldvymGtwWFifd13gZkFyy5Ml3cl\n6xuszgMejYi3RcR/FNtA0hjg58DnI+LeIpvMAk6JiLLm5FEyTXahSNd9EfhT4MPpaNoBjJB0fuG2\nqc3A/yjnvGbgMLG+79+BC9LxhZA0DnhLRPyHpCGS7pP0u3RysA8V7lzk1/s3JH0yfXxG7he8pJ/n\nxm0q2H+cpF+kx783nWzoVGAhMD2dWOiQIuUeDdwDfCEiflrkuHcChwMPS/po3nkeyZ0n3e5WSf9H\n0tL0nEUOpb8Bzgc+GBGv5637B+CLRd9VWAm8KOncEuvN3sBhYn1aOuHXMuAD6aILSYbEAHiV5Jf4\n24H3Av9Y6jCFC9Jw+gbwnyPiTOBW4KtF9v0GcGtEnEpSG/pGRKwE/ifw/Yg4IyJeK7LfonTbH5Z4\nXdOBV9L9f5B3ntNy58nb/JiIOCsi/muRQ50NzAY+EBGvFLzm3wCvSXp3sSIAXwH+rlj5zAo5TKwZ\n3E4SIqT/fi99LODvJa0E7gNGSyp3jorjgT8hmQdmBUmTz+gi2/1p3vn+leTLuxz3AhdJOrSLbfIH\np+zqPD/o4hhr0+O8v8SxSwZGOplWFF7zMSvGYWLN4E7gXEmnA4dFxIp0+V8CRwGnR8TpJEOKF355\n7+WN/x/k1gt4PK0ZnB4Rp0bEBzhYb6+9XEsywvUd6VTTxUSJx4Ve7mLdNpJ5bL4mqe2gE0Q8QPKa\nzyqx/1dJmsI8iJ91yWFifV5EvAy0kwyl/b28VUOBP0REp6T3AOPy1uV+mW8EpqQ9nIYBuWsEq0ku\nUJ8FSbOXpClFTv//ONAB4CLgVz0o9+eBF9NyF5NfM6nkPGtJhhH/v5JOKbLJV4D/XmLfe4HhQLH9\nzPZzmFiz+B7JF15+mNwGnJk2c11EMr91TgBExGaSayyPkzSXLU+X7wE+AiyU9AiwgqSpqdAVwKfS\nbf4SuLKMsub/yr8EGFWi+2/+dqXOU1aNISJ+B3wKuCudliDy1v2cpNZW6lhfIRmW3awkD0FvZmYV\nc83EzMwq5jAxM7OKOUzMzKxiDhMzM6uYw8TMzCrmMDEzs4o5TMzMrGIOEzMzq9j/B8v0jBDbKPEn\nAAAAAElFTkSuQmCC\n", 581 | "text/plain": [ 582 | "" 583 | ] 584 | }, 585 | "metadata": {}, 586 | "output_type": "display_data" 587 | } 588 | ], 589 | "source": [ 590 | "# import Matplotlib (scientific plotting library)\n", 591 | "import matplotlib.pyplot as plt\n", 592 | "\n", 593 | "# allow plots to appear within the notebook\n", 594 | "%matplotlib inline\n", 595 | "\n", 596 | "# plot the relationship between K and testing accuracy\n", 597 | "# plt.plot(x_axis, y_axis)\n", 598 | "plt.plot(k_range, scores)\n", 599 | "plt.xlabel('Value of K for KNN')\n", 600 | "plt.ylabel('Testing Accuracy')" 601 | ] 602 | }, 603 | { 604 | "cell_type": "markdown", 605 | "metadata": {}, 606 | "source": [ 607 | "- **Training accuracy** rises as model complexity increases\n", 608 | "- **Testing accuracy** penalizes models that are too complex or not complex enough\n", 609 | "- For KNN models, complexity is determined by the **value of K** (lower value = more complex)" 610 | ] 611 | }, 612 | { 613 | "cell_type": "markdown", 614 | "metadata": {}, 615 | "source": [ 616 | "### 3. Making predictions on out-of-sample data" 617 | ] 618 | }, 619 | { 620 | "cell_type": "code", 621 | "execution_count": 17, 622 | "metadata": { 623 | "collapsed": false 624 | }, 625 | "outputs": [ 626 | { 627 | "name": "stderr", 628 | "output_type": "stream", 629 | "text": [ 630 | "/Users/ritchieng/anaconda3/envs/py3k/lib/python3.5/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.\n", 631 | " DeprecationWarning)\n" 632 | ] 633 | }, 634 | { 635 | "data": { 636 | "text/plain": [ 637 | "array([1])" 638 | ] 639 | }, 640 | "execution_count": 17, 641 | "metadata": {}, 642 | "output_type": "execute_result" 643 | } 644 | ], 645 | "source": [ 646 | "# instantiate the model with the best known parameters\n", 647 | "knn = KNeighborsClassifier(n_neighbors=11)\n", 648 | "\n", 649 | "# train the model with X and y (not X_train and y_train)\n", 650 | "knn.fit(X, y)\n", 651 | "\n", 652 | "# make a prediction for an out-of-sample observation\n", 653 | "knn.predict([3, 5, 4, 2])" 654 | ] 655 | }, 656 | { 657 | "cell_type": "markdown", 658 | "metadata": {}, 659 | "source": [ 660 | "### 4. Downsides of train/test split" 661 | ] 662 | }, 663 | { 664 | "cell_type": "markdown", 665 | "metadata": {}, 666 | "source": [ 667 | "- Provides a **high-variance estimate** of out-of-sample accuracy\n", 668 | "- **K-fold cross-validation** overcomes this limitation\n", 669 | "- But, train/test split is still useful because of its **flexibility and speed**" 670 | ] 671 | }, 672 | { 673 | "cell_type": "markdown", 674 | "metadata": {}, 675 | "source": [ 676 | "### 5. Resources\n", 677 | "\n", 678 | "- Quora: [What is an intuitive explanation of overfitting?](http://www.quora.com/What-is-an-intuitive-explanation-of-overfitting/answer/Jessica-Su)\n", 679 | "- Video: [Estimating prediction error](https://www.youtube.com/watch?v=_2ij6eaaSl0&t=2m34s) (12 minutes, starting at 2:34) by Hastie and Tibshirani\n", 680 | "- [Understanding the Bias-Variance Tradeoff](http://scott.fortmann-roe.com/docs/BiasVariance.html)\n", 681 | " - [Guiding questions](https://github.com/justmarkham/DAT5/blob/master/homework/06_bias_variance.md) when reading this article\n", 682 | "- Video: [Visualizing bias and variance](http://work.caltech.edu/library/081.html) (15 minutes) by Abu-Mostafa" 683 | ] 684 | } 685 | ], 686 | "metadata": { 687 | "anaconda-cloud": {}, 688 | "kernelspec": { 689 | "display_name": "Python [py3k]", 690 | "language": "python", 691 | "name": "Python [py3k]" 692 | }, 693 | "language_info": { 694 | "codemirror_mode": { 695 | "name": "ipython", 696 | "version": 3 697 | }, 698 | "file_extension": ".py", 699 | "mimetype": "text/x-python", 700 | "name": "python", 701 | "nbconvert_exporter": "python", 702 | "pygments_lexer": "ipython3", 703 | "version": "3.5.2" 704 | }, 705 | "name": "" 706 | }, 707 | "nbformat": 4, 708 | "nbformat_minor": 0 709 | } 710 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Introduction to machine learning with scikit-learn (Data School) 2 | 3 | This repo contains the IPython/Jupyter notebooks from Dataschool's scikit-learn video series, as seen on Kaggle's blog. 4 | 5 | ## Entire series 6 | 7 | - [Read the blog posts](http://blog.kaggle.com/author/kevin-markham/) (Kaggle's blog) 8 | - [Watch the entire series](https://www.youtube.com/playlist?list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A) (YouTube playlist) 9 | - [View the IPython Notebooks](http://nbviewer.jupyter.org/github/justmarkham/scikit-learn-videos/tree/master/) (nbviewer) 10 | - [Run the IPython Notebooks online](http://mybinder.org/repo/justmarkham/scikit-learn-videos) (binder) 11 | 12 | ## Individual videos 13 | 14 | 1. What is machine learning, and how does it work? ([video](https://www.youtube.com/watch?v=elojMnjn4kk&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=1), [notebook](01_machine_learning_intro.ipynb), [blog post](http://blog.kaggle.com/2015/04/08/new-video-series-introduction-to-machine-learning-with-scikit-learn/)) 15 | - What is machine learning? 16 | - What are the two main categories of machine learning? 17 | - What are some examples of machine learning? 18 | - How does machine learning "work"? 19 | 20 | 2. Setting up Python for machine learning: scikit-learn and IPython Notebook ([video](https://www.youtube.com/watch?v=IsXXlYVBt1M&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=2), [notebook](02_machine_learning_setup.ipynb), [blog post](http://blog.kaggle.com/2015/04/15/scikit-learn-video-2-setting-up-python-for-machine-learning/)) 21 | - What are the benefits and drawbacks of scikit-learn? 22 | - How do I install scikit-learn? 23 | - How do I use the IPython Notebook? 24 | - What are some good resources for learning Python? 25 | 26 | 3. Getting started in scikit-learn with the famous iris dataset ([video](https://www.youtube.com/watch?v=hd1W4CyPX58&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=3), [notebook](03_getting_started_with_iris.ipynb), [blog post](http://blog.kaggle.com/2015/04/22/scikit-learn-video-3-machine-learning-first-steps-with-the-iris-dataset/)) 27 | - What is the famous iris dataset, and how does it relate to machine learning? 28 | - How do we load the iris dataset into scikit-learn? 29 | - How do we describe a dataset using machine learning terminology? 30 | - What are scikit-learn's four key requirements for working with data? 31 | 32 | 4. Training a machine learning model with scikit-learn ([video](https://www.youtube.com/watch?v=RlQuVL6-qe8&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=4), [notebook](04_model_training.ipynb), [blog post](http://blog.kaggle.com/2015/04/30/scikit-learn-video-4-model-training-and-prediction-with-k-nearest-neighbors/)) 33 | - What is the K-nearest neighbors classification model? 34 | - What are the four steps for model training and prediction in scikit-learn? 35 | - How can I apply this pattern to other machine learning models? 36 | 37 | 5. Comparing machine learning models in scikit-learn ([video](https://www.youtube.com/watch?v=0pP4EwWJgIU&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=5), [notebook](05_model_evaluation.ipynb), [blog post](http://blog.kaggle.com/2015/05/14/scikit-learn-video-5-choosing-a-machine-learning-model/)) 38 | - How do I choose which model to use for my supervised learning task? 39 | - How do I choose the best tuning parameters for that model? 40 | - How do I estimate the likely performance of my model on out-of-sample data? 41 | 42 | 6. Data science pipeline: pandas, seaborn, scikit-learn ([video](https://www.youtube.com/watch?v=3ZWuPVWq7p4&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=6), [notebook](06_linear_regression.ipynb), [blog post](http://blog.kaggle.com/2015/05/28/scikit-learn-video-6-linear-regression-plus-pandas-seaborn/)) 43 | - How do I use the pandas library to read data into Python? 44 | - How do I use the seaborn library to visualize data? 45 | - What is linear regression, and how does it work? 46 | - How do I train and interpret a linear regression model in scikit-learn? 47 | - What are some evaluation metrics for regression problems? 48 | - How do I choose which features to include in my model? 49 | 50 | 7. Cross-validation for parameter tuning, model selection, and feature selection ([video](https://www.youtube.com/watch?v=6dbrR-WymjI&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=7), [notebook](07_cross_validation.ipynb), [blog post](http://blog.kaggle.com/2015/06/29/scikit-learn-video-7-optimizing-your-model-with-cross-validation/)) 51 | - What is the drawback of using the train/test split procedure for model evaluation? 52 | - How does K-fold cross-validation overcome this limitation? 53 | - How can cross-validation be used for selecting tuning parameters, choosing between models, and selecting features? 54 | - What are some possible improvements to cross-validation? 55 | 56 | 8. Efficiently searching for optimal tuning parameters ([video](https://www.youtube.com/watch?v=Gol_qOgRqfA&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=8), [notebook](08_grid_search.ipynb), [blog post](http://blog.kaggle.com/2015/07/16/scikit-learn-video-8-efficiently-searching-for-optimal-tuning-parameters/)) 57 | - How can K-fold cross-validation be used to search for an optimal tuning parameter? 58 | - How can this process be made more efficient? 59 | - How do you search for multiple tuning parameters at once? 60 | - What do you do with those tuning parameters before making real predictions? 61 | - How can the computational expense of this process be reduced? 62 | 63 | 9. Evaluating a classification model ([video](https://www.youtube.com/watch?v=85dtiMz9tSo&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=9), [notebook](09_classification_metrics.ipynb), [blog post](http://blog.kaggle.com/2015/10/23/scikit-learn-video-9-better-evaluation-of-classification-models/)) 64 | - What is the purpose of model evaluation, and what are some common evaluation procedures? 65 | - What is the usage of classification accuracy, and what are its limitations? 66 | - How does a confusion matrix describe the performance of a classifier? 67 | - What metrics can be computed from a confusion matrix? 68 | - How can you adjust classifier performance by changing the classification threshold? 69 | - What is the purpose of an ROC curve? 70 | - How does Area Under the Curve (AUC) differ from classification accuracy? 71 | -------------------------------------------------------------------------------- /images/01_clustering.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ritchieng/machine-learning-dataschool/33d7ea8b1171923a91871b1fb4a12b8c47ceefc2/images/01_clustering.png -------------------------------------------------------------------------------- /images/01_robot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ritchieng/machine-learning-dataschool/33d7ea8b1171923a91871b1fb4a12b8c47ceefc2/images/01_robot.png -------------------------------------------------------------------------------- /images/01_spam_filter.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ritchieng/machine-learning-dataschool/33d7ea8b1171923a91871b1fb4a12b8c47ceefc2/images/01_spam_filter.png -------------------------------------------------------------------------------- /images/01_supervised_learning.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ritchieng/machine-learning-dataschool/33d7ea8b1171923a91871b1fb4a12b8c47ceefc2/images/01_supervised_learning.png -------------------------------------------------------------------------------- /images/02_ipython_header.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ritchieng/machine-learning-dataschool/33d7ea8b1171923a91871b1fb4a12b8c47ceefc2/images/02_ipython_header.png -------------------------------------------------------------------------------- /images/02_sklearn_algorithms.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ritchieng/machine-learning-dataschool/33d7ea8b1171923a91871b1fb4a12b8c47ceefc2/images/02_sklearn_algorithms.png -------------------------------------------------------------------------------- /images/02_sklearn_logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ritchieng/machine-learning-dataschool/33d7ea8b1171923a91871b1fb4a12b8c47ceefc2/images/02_sklearn_logo.png -------------------------------------------------------------------------------- /images/03_iris.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ritchieng/machine-learning-dataschool/33d7ea8b1171923a91871b1fb4a12b8c47ceefc2/images/03_iris.png -------------------------------------------------------------------------------- /images/04_1nn_map.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ritchieng/machine-learning-dataschool/33d7ea8b1171923a91871b1fb4a12b8c47ceefc2/images/04_1nn_map.png -------------------------------------------------------------------------------- /images/04_5nn_map.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ritchieng/machine-learning-dataschool/33d7ea8b1171923a91871b1fb4a12b8c47ceefc2/images/04_5nn_map.png -------------------------------------------------------------------------------- /images/04_knn_dataset.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ritchieng/machine-learning-dataschool/33d7ea8b1171923a91871b1fb4a12b8c47ceefc2/images/04_knn_dataset.png -------------------------------------------------------------------------------- /images/05_overfitting.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ritchieng/machine-learning-dataschool/33d7ea8b1171923a91871b1fb4a12b8c47ceefc2/images/05_overfitting.png -------------------------------------------------------------------------------- /images/05_train_test_split.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ritchieng/machine-learning-dataschool/33d7ea8b1171923a91871b1fb4a12b8c47ceefc2/images/05_train_test_split.png -------------------------------------------------------------------------------- /images/07_cross_validation_diagram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ritchieng/machine-learning-dataschool/33d7ea8b1171923a91871b1fb4a12b8c47ceefc2/images/07_cross_validation_diagram.png -------------------------------------------------------------------------------- /images/09_confusion_matrix_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ritchieng/machine-learning-dataschool/33d7ea8b1171923a91871b1fb4a12b8c47ceefc2/images/09_confusion_matrix_1.png -------------------------------------------------------------------------------- /images/09_confusion_matrix_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ritchieng/machine-learning-dataschool/33d7ea8b1171923a91871b1fb4a12b8c47ceefc2/images/09_confusion_matrix_2.png -------------------------------------------------------------------------------- /ipython_tutorial.md: -------------------------------------------------------------------------------- 1 | # IPython Quick Introduction 2 | 3 | - Press shift + enter to run the script 4 | - Press esc to exit editing mode 5 | - Press enter to enter editing mode 6 | - Before pressing enter, you can press the following alphabets to enter different editing modes 7 | - m for markdown 8 | - c for code 9 | - h for heading 10 | - a to add entry 11 | - d to delete entry 12 | - Press h for shortcuts 13 | - Press command + s to save the IPython notebook 14 | 15 | 16 | This should be enough to get you started! If you need anymore short-cuts, you can simply press the "h" key to check if you need any more shortcuts! 17 | -------------------------------------------------------------------------------- /ml-with-text/.gitignore: -------------------------------------------------------------------------------- 1 | .ipynb_checkpoints/ 2 | .DS_Store 3 | *.pyc 4 | extras/ 5 | *.tpl 6 | -------------------------------------------------------------------------------- /ml-with-text/README.md: -------------------------------------------------------------------------------- 1 | ## Tutorial: Machine Learning with Text in scikit-learn 2 | 3 | Presented by [Kevin Markham](http://www.dataschool.io/about/) at PyCon on May 28, 2016. Watch the complete [tutorial video](https://www.youtube.com/watch?v=ZiKMIuYidY0&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=10) on YouTube. 4 | 5 | [![Watch the complete tutorial video on YouTube](youtube.jpg)](https://www.youtube.com/watch?v=ZiKMIuYidY0&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=10 "Machine Learning with Text in scikit-learn - PyCon 2016") 6 | 7 | ### Description 8 | 9 | Although numeric data is easy to work with in Python, most knowledge created by humans is actually raw, unstructured text. By learning how to transform text into data that is usable by machine learning models, you drastically increase the amount of data that your models can learn from. In this tutorial, we'll build and evaluate predictive models from real-world text using scikit-learn. 10 | 11 | ### Objectives 12 | 13 | By the end of this tutorial, attendees will be able to confidently build a predictive model from their own text-based data, including feature extraction, model building and model evaluation. 14 | 15 | ### Required Software 16 | 17 | Attendees will need to bring a laptop with [scikit-learn](http://scikit-learn.org/stable/install.html) and [pandas](http://pandas.pydata.org/pandas-docs/stable/install.html) (and their dependencies) already installed. Installing the [Anaconda distribution of Python](https://www.continuum.io/downloads) is an easy way to accomplish this. Both Python 2 and 3 are welcome. 18 | 19 | I will be leading the tutorial using the IPython/Jupyter notebook, and have added a pre-written notebook to this repository. I have also created a Python script that is identical to the notebook, which you can use in the Python environment of your choice. 20 | 21 | ### Tutorial Files 22 | 23 | * IPython/Jupyter notebooks: [tutorial.ipynb](tutorial.ipynb), [tutorial_with_output.ipynb](tutorial_with_output.ipynb), [exercise.ipynb](exercise.ipynb), [exercise_solution.ipynb](exercise_solution.ipynb) 24 | * Python scripts: [tutorial.py](tutorial.py), [exercise.py](exercise.py), [exercise_solution.py](exercise_solution.py) 25 | * Datasets: [data/sms.tsv](data/sms.tsv), [data/yelp.csv](data/yelp.csv) 26 | 27 | ### Prerequisite Knowledge 28 | 29 | Attendees to this tutorial should be comfortable working in Python, should understand the basic principles of machine learning, and should have at least basic experience with both pandas and scikit-learn. However, no knowledge of advanced mathematics is required. 30 | 31 | - If you need a refresher on scikit-learn or machine learning, I recommend reviewing the notebooks and/or videos from my [scikit-learn video series](https://github.com/justmarkham/scikit-learn-videos), focusing on videos 1-5 as well as video 9. Alternatively, you may prefer reading the [tutorials](http://scikit-learn.org/stable/tutorial/index.html) from the scikit-learn documentation. 32 | - If you need a refresher on pandas, I recommend reviewing the notebook and/or videos from my [pandas video series](https://github.com/justmarkham/pandas-videos). Alternatively, you may prefer reading this 3-part [tutorial](http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/). 33 | 34 | ### Abstract 35 | 36 | It can be difficult to figure out how to work with text in scikit-learn, even if you're already comfortable with the scikit-learn API. Many questions immediately come up: Which vectorizer should I use, and why? What's the difference between a "fit" and a "transform"? What's a document-term matrix, and why is it so sparse? Is it okay for my training data to have more features than observations? What's the appropriate machine learning model to use? And so on... 37 | 38 | In this tutorial, we'll answer all of those questions, and more! We'll start by walking through the vectorization process in order to understand the input and output formats. Then we'll read a simple dataset into pandas, and immediately apply what we've learned about vectorization. We'll move on to the model building process, including a discussion of which model is most appropriate for the task. We'll evaluate our model a few different ways, and then examine the model for greater insight into how the text is influencing its predictions. Finally, we'll practice this entire workflow on a new dataset, and end with a discussion of which parts of the process are worth tuning for improved performance. 39 | 40 | ### Detailed Outline 41 | 42 | 1. Model building in scikit-learn (refresher) 43 | 2. Representing text as numerical data 44 | 3. Reading a text-based dataset into pandas 45 | 4. Vectorizing our dataset 46 | 5. Building and evaluating a model 47 | 6. Comparing models 48 | 7. Examining a model for further insight 49 | 8. Practicing this workflow on another dataset 50 | 9. Tuning the vectorizer (discussion) 51 | 52 | ### About the Instructor 53 | 54 | Kevin Markham is the founder of [Data School](http://www.dataschool.io/) and the former lead instructor for [General Assembly's Data Science course](https://github.com/justmarkham/DAT8) in Washington, DC. He is passionate about teaching data science to people who are new to the field, regardless of their educational and professional backgrounds, and he enjoys teaching both online and in the classroom. Kevin's professional focus is supervised machine learning, which led him to create the popular [scikit-learn video series](https://github.com/justmarkham/scikit-learn-videos) for Kaggle. He has a degree in Computer Engineering from Vanderbilt University. 55 | 56 | * Email: [kevin@dataschool.io](mailto:kevin@dataschool.io) 57 | * Twitter: [@justmarkham](https://twitter.com/justmarkham) 58 | 59 | ### Recommended Resources 60 | 61 | **Text classification:** 62 | * Read Paul Graham's classic post, [A Plan for Spam](http://www.paulgraham.com/spam.html), for an overview of a basic text classification system using a Bayesian approach. (He also wrote a [follow-up post](http://www.paulgraham.com/better.html) about how he improved his spam filter.) 63 | * Coursera's Natural Language Processing (NLP) course has [video lectures](https://class.coursera.org/nlp/lecture) on text classification, tokenization, Naive Bayes, and many other fundamental NLP topics. (Here are the [slides](http://web.stanford.edu/~jurafsky/NLPCourseraSlides.html) used in all of the videos.) 64 | * [Automatically Categorizing Yelp Businesses](http://engineeringblog.yelp.com/2015/09/automatically-categorizing-yelp-businesses.html) discusses how Yelp uses NLP and scikit-learn to solve the problem of uncategorized businesses. 65 | * [How to Read the Mind of a Supreme Court Justice](http://fivethirtyeight.com/features/how-to-read-the-mind-of-a-supreme-court-justice/) discusses CourtCast, a machine learning model that predicts the outcome of Supreme Court cases using text-based features only. (The CourtCast creator wrote a post explaining [how it works](https://sciencecowboy.wordpress.com/2015/03/05/predicting-the-supreme-court-from-oral-arguments/), and the [Python code](https://github.com/nasrallah/CourtCast) is available on GitHub.) 66 | * [Identifying Humorous Cartoon Captions](http://www.cs.huji.ac.il/~dshahaf/pHumor.pdf) is a readable paper about identifying funny captions submitted to the New Yorker Caption Contest. 67 | * In this [PyData video](https://www.youtube.com/watch?v=y3ZTKFZ-1QQ) (50 minutes), Facebook explains how they use scikit-learn for sentiment classification by training a Naive Bayes model on emoji-labeled data. 68 | 69 | **Naive Bayes and logistic regression:** 70 | * Read this brief Quora post on [airport security](http://www.quora.com/In-laymans-terms-how-does-Naive-Bayes-work/answer/Konstantin-Tt) for an intuitive explanation of how Naive Bayes classification works. 71 | * For a longer introduction to Naive Bayes, read Sebastian Raschka's article on [Naive Bayes and Text Classification](http://sebastianraschka.com/Articles/2014_naive_bayes_1.html). As well, Wikipedia has two excellent articles ([Naive Bayes classifier](http://en.wikipedia.org/wiki/Naive_Bayes_classifier) and [Naive Bayes spam filtering](http://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering)), and Cross Validated has a good [Q&A](http://stats.stackexchange.com/questions/21822/understanding-naive-bayes). 72 | * My [guide to an in-depth understanding of logistic regression](http://www.dataschool.io/guide-to-logistic-regression/) includes a lesson notebook and a curated list of resources for going deeper into this topic. 73 | * [Comparison of Machine Learning Models](https://github.com/justmarkham/DAT8/blob/master/other/model_comparison.md) lists the advantages and disadvantages of Naive Bayes, logistic regression, and other classification and regression models. 74 | 75 | **scikit-learn:** 76 | * The scikit-learn user guide includes an excellent section on [text feature extraction](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction) that includes many details not covered in today's tutorial. 77 | * The user guide also describes the [performance trade-offs](http://scikit-learn.org/stable/modules/computational_performance.html#influence-of-the-input-data-representation) involved when choosing between sparse and dense input data representations. 78 | * To learn more about evaluating classification models, watch video #9 from my [scikit-learn video series](https://github.com/justmarkham/scikit-learn-videos) (or just read the associated [notebook](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb)). 79 | 80 | **pandas:** 81 | * Here are my [top 8 resources for learning data analysis with pandas](http://www.dataschool.io/best-python-pandas-resources/). 82 | * As well, I have a new [pandas Q&A video series](http://www.dataschool.io/easier-data-analysis-with-pandas/) targeted at beginners that includes two new videos every week. 83 | -------------------------------------------------------------------------------- /ml-with-text/exercise.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Tutorial Exercise: Yelp reviews" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Introduction\n", 15 | "\n", 16 | "This exercise uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition.\n", 17 | "\n", 18 | "**Description of the data:**\n", 19 | "\n", 20 | "- **`yelp.csv`** contains the dataset. It is stored in the repository (in the **`data`** directory), so there is no need to download anything from the Kaggle website.\n", 21 | "- Each observation (row) in this dataset is a review of a particular business by a particular user.\n", 22 | "- The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.\n", 23 | "- The **text** column is the text of the review.\n", 24 | "\n", 25 | "**Goal:** Predict the star rating of a review using **only** the review text.\n", 26 | "\n", 27 | "**Tip:** After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations." 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": {}, 33 | "source": [ 34 | "## Task 1\n", 35 | "\n", 36 | "Read **`yelp.csv`** into a pandas DataFrame and examine it." 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "## Task 2\n", 44 | "\n", 45 | "Create a new DataFrame that only contains the **5-star** and **1-star** reviews.\n", 46 | "\n", 47 | "- **Hint:** [How do I apply multiple filter criteria to a pandas DataFrame?](http://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb#9.-How-do-I-apply-multiple-filter-criteria-to-a-pandas-DataFrame%3F-%28video%29) explains how to do this." 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "## Task 3\n", 55 | "\n", 56 | "Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response.\n", 57 | "\n", 58 | "- **Hint:** Keep in mind that X should be a pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows." 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "metadata": {}, 64 | "source": [ 65 | "## Task 4\n", 66 | "\n", 67 | "Use CountVectorizer to create **document-term matrices** from X_train and X_test." 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": {}, 73 | "source": [ 74 | "## Task 5\n", 75 | "\n", 76 | "Use multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**.\n", 77 | "\n", 78 | "- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to interpret both classification accuracy and the confusion matrix." 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": {}, 84 | "source": [ 85 | "## Task 6 (Challenge)\n", 86 | "\n", 87 | "Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class.\n", 88 | "\n", 89 | "- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy!" 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "metadata": {}, 95 | "source": [ 96 | "## Task 7 (Challenge)\n", 97 | "\n", 98 | "Browse through the review text of some of the **false positives** and **false negatives**. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews?\n", 99 | "\n", 100 | "- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains the definitions of \"false positives\" and \"false negatives\".\n", 101 | "- **Hint:** Think about what a false positive means in this context, and what a false negative means in this context. What has scikit-learn defined as the \"positive class\"?" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": {}, 107 | "source": [ 108 | "## Task 8 (Challenge)\n", 109 | "\n", 110 | "Calculate which 10 tokens are the most predictive of **5-star reviews**, and which 10 tokens are the most predictive of **1-star reviews**.\n", 111 | "\n", 112 | "- **Hint:** Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the `feature_count_` and `class_count_` attributes of the Naive Bayes model object." 113 | ] 114 | }, 115 | { 116 | "cell_type": "markdown", 117 | "metadata": {}, 118 | "source": [ 119 | "## Task 9 (Challenge)\n", 120 | "\n", 121 | "Up to this point, we have framed this as a **binary classification problem** by only considering the 5-star and 1-star reviews. Now, let's repeat the model building process using all reviews, which makes this a **5-class classification problem**.\n", 122 | "\n", 123 | "Here are the steps:\n", 124 | "\n", 125 | "- Define X and y using the original DataFrame. (y should contain 5 different classes.)\n", 126 | "- Split X and y into training and testing sets.\n", 127 | "- Create document-term matrices using CountVectorizer.\n", 128 | "- Calculate the testing accuracy of a Multinomial Naive Bayes model.\n", 129 | "- Compare the testing accuracy with the null accuracy, and comment on the results.\n", 130 | "- Print the confusion matrix, and comment on the results. (This [Stack Overflow answer](http://stackoverflow.com/a/30748053/1636598) explains how to read a multi-class confusion matrix.)\n", 131 | "- Print the [classification report](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report), and comment on the results. If you are unfamiliar with the terminology it uses, research the terms, and then try to figure out how to calculate these metrics manually from the confusion matrix!" 132 | ] 133 | } 134 | ], 135 | "metadata": { 136 | "kernelspec": { 137 | "display_name": "Python 2", 138 | "language": "python", 139 | "name": "python2" 140 | }, 141 | "language_info": { 142 | "codemirror_mode": { 143 | "name": "ipython", 144 | "version": 2 145 | }, 146 | "file_extension": ".py", 147 | "mimetype": "text/x-python", 148 | "name": "python", 149 | "nbconvert_exporter": "python", 150 | "pygments_lexer": "ipython2", 151 | "version": "2.7.11" 152 | } 153 | }, 154 | "nbformat": 4, 155 | "nbformat_minor": 0 156 | } 157 | -------------------------------------------------------------------------------- /ml-with-text/exercise.py: -------------------------------------------------------------------------------- 1 | # # Tutorial Exercise: Yelp reviews 2 | 3 | # ## Introduction 4 | # 5 | # This exercise uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition. 6 | # 7 | # **Description of the data:** 8 | # 9 | # - **`yelp.csv`** contains the dataset. It is stored in the repository (in the **`data`** directory), so there is no need to download anything from the Kaggle website. 10 | # - Each observation (row) in this dataset is a review of a particular business by a particular user. 11 | # - The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review. 12 | # - The **text** column is the text of the review. 13 | # 14 | # **Goal:** Predict the star rating of a review using **only** the review text. 15 | # 16 | # **Tip:** After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations. 17 | 18 | # ## Task 1 19 | # 20 | # Read **`yelp.csv`** into a pandas DataFrame and examine it. 21 | 22 | # ## Task 2 23 | # 24 | # Create a new DataFrame that only contains the **5-star** and **1-star** reviews. 25 | # 26 | # - **Hint:** [How do I apply multiple filter criteria to a pandas DataFrame?](http://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb#9.-How-do-I-apply-multiple-filter-criteria-to-a-pandas-DataFrame%3F-%28video%29) explains how to do this. 27 | 28 | # ## Task 3 29 | # 30 | # Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response. 31 | # 32 | # - **Hint:** Keep in mind that X should be a pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows. 33 | 34 | # ## Task 4 35 | # 36 | # Use CountVectorizer to create **document-term matrices** from X_train and X_test. 37 | 38 | # ## Task 5 39 | # 40 | # Use multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**. 41 | # 42 | # - **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to interpret both classification accuracy and the confusion matrix. 43 | 44 | # ## Task 6 (Challenge) 45 | # 46 | # Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class. 47 | # 48 | # - **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy! 49 | 50 | # ## Task 7 (Challenge) 51 | # 52 | # Browse through the review text of some of the **false positives** and **false negatives**. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews? 53 | # 54 | # - **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains the definitions of "false positives" and "false negatives". 55 | # - **Hint:** Think about what a false positive means in this context, and what a false negative means in this context. What has scikit-learn defined as the "positive class"? 56 | 57 | # ## Task 8 (Challenge) 58 | # 59 | # Calculate which 10 tokens are the most predictive of **5-star reviews**, and which 10 tokens are the most predictive of **1-star reviews**. 60 | # 61 | # - **Hint:** Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the `feature_count_` and `class_count_` attributes of the Naive Bayes model object. 62 | 63 | # ## Task 9 (Challenge) 64 | # 65 | # Up to this point, we have framed this as a **binary classification problem** by only considering the 5-star and 1-star reviews. Now, let's repeat the model building process using all reviews, which makes this a **5-class classification problem**. 66 | # 67 | # Here are the steps: 68 | # 69 | # - Define X and y using the original DataFrame. (y should contain 5 different classes.) 70 | # - Split X and y into training and testing sets. 71 | # - Create document-term matrices using CountVectorizer. 72 | # - Calculate the testing accuracy of a Multinomial Naive Bayes model. 73 | # - Compare the testing accuracy with the null accuracy, and comment on the results. 74 | # - Print the confusion matrix, and comment on the results. (This [Stack Overflow answer](http://stackoverflow.com/a/30748053/1636598) explains how to read a multi-class confusion matrix.) 75 | # - Print the [classification report](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report), and comment on the results. If you are unfamiliar with the terminology it uses, research the terms, and then try to figure out how to calculate these metrics manually from the confusion matrix! 76 | -------------------------------------------------------------------------------- /ml-with-text/exercise_solution.py: -------------------------------------------------------------------------------- 1 | # # Tutorial Exercise: Yelp reviews (Solution) 2 | 3 | # ## Introduction 4 | # 5 | # This exercise uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition. 6 | # 7 | # **Description of the data:** 8 | # 9 | # - **`yelp.csv`** contains the dataset. It is stored in the repository (in the **`data`** directory), so there is no need to download anything from the Kaggle website. 10 | # - Each observation (row) in this dataset is a review of a particular business by a particular user. 11 | # - The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review. 12 | # - The **text** column is the text of the review. 13 | # 14 | # **Goal:** Predict the star rating of a review using **only** the review text. 15 | # 16 | # **Tip:** After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations. 17 | 18 | # for Python 2: use print only as a function 19 | from __future__ import print_function 20 | 21 | 22 | # ## Task 1 23 | # 24 | # Read **`yelp.csv`** into a pandas DataFrame and examine it. 25 | 26 | # read yelp.csv using a relative path 27 | import pandas as pd 28 | path = 'data/yelp.csv' 29 | yelp = pd.read_csv(path) 30 | 31 | 32 | # examine the shape 33 | yelp.shape 34 | 35 | 36 | # examine the first row 37 | yelp.head(1) 38 | 39 | 40 | # examine the class distribution 41 | yelp.stars.value_counts().sort_index() 42 | 43 | 44 | # ## Task 2 45 | # 46 | # Create a new DataFrame that only contains the **5-star** and **1-star** reviews. 47 | # 48 | # - **Hint:** [How do I apply multiple filter criteria to a pandas DataFrame?](http://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb#9.-How-do-I-apply-multiple-filter-criteria-to-a-pandas-DataFrame%3F-%28video%29) explains how to do this. 49 | 50 | # filter the DataFrame using an OR condition 51 | yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)] 52 | 53 | # equivalently, use the 'loc' method 54 | yelp_best_worst = yelp.loc[(yelp.stars==5) | (yelp.stars==1), :] 55 | 56 | 57 | # examine the shape 58 | yelp_best_worst.shape 59 | 60 | 61 | # ## Task 3 62 | # 63 | # Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response. 64 | # 65 | # - **Hint:** Keep in mind that X should be a pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows. 66 | 67 | # define X and y 68 | X = yelp_best_worst.text 69 | y = yelp_best_worst.stars 70 | 71 | 72 | # split X and y into training and testing sets 73 | from sklearn.cross_validation import train_test_split 74 | X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1) 75 | 76 | 77 | # examine the object shapes 78 | print(X_train.shape) 79 | print(X_test.shape) 80 | print(y_train.shape) 81 | print(y_test.shape) 82 | 83 | 84 | # ## Task 4 85 | # 86 | # Use CountVectorizer to create **document-term matrices** from X_train and X_test. 87 | 88 | # import and instantiate CountVectorizer 89 | from sklearn.feature_extraction.text import CountVectorizer 90 | vect = CountVectorizer() 91 | 92 | 93 | # fit and transform X_train into X_train_dtm 94 | X_train_dtm = vect.fit_transform(X_train) 95 | X_train_dtm.shape 96 | 97 | 98 | # transform X_test into X_test_dtm 99 | X_test_dtm = vect.transform(X_test) 100 | X_test_dtm.shape 101 | 102 | 103 | # ## Task 5 104 | # 105 | # Use multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**. 106 | # 107 | # - **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to interpret both classification accuracy and the confusion matrix. 108 | 109 | # import and instantiate MultinomialNB 110 | from sklearn.naive_bayes import MultinomialNB 111 | nb = MultinomialNB() 112 | 113 | 114 | # train the model using X_train_dtm 115 | nb.fit(X_train_dtm, y_train) 116 | 117 | 118 | # make class predictions for X_test_dtm 119 | y_pred_class = nb.predict(X_test_dtm) 120 | 121 | 122 | # calculate accuracy of class predictions 123 | from sklearn import metrics 124 | metrics.accuracy_score(y_test, y_pred_class) 125 | 126 | 127 | # print the confusion matrix 128 | metrics.confusion_matrix(y_test, y_pred_class) 129 | 130 | 131 | # ## Task 6 (Challenge) 132 | # 133 | # Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class. 134 | # 135 | # - **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy! 136 | 137 | # examine the class distribution of the testing set 138 | y_test.value_counts() 139 | 140 | 141 | # calculate null accuracy 142 | y_test.value_counts().head(1) / y_test.shape 143 | 144 | 145 | # calculate null accuracy manually 146 | 838 / float(838 + 184) 147 | 148 | 149 | # ## Task 7 (Challenge) 150 | # 151 | # Browse through the review text of some of the **false positives** and **false negatives**. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews? 152 | # 153 | # - **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains the definitions of "false positives" and "false negatives". 154 | # - **Hint:** Think about what a false positive means in this context, and what a false negative means in this context. What has scikit-learn defined as the "positive class"? 155 | 156 | # first 10 false positives (1-star reviews incorrectly classified as 5-star reviews) 157 | X_test[y_test < y_pred_class].head(10) 158 | 159 | 160 | # false positive: model is reacting to the words "good", "impressive", "nice" 161 | X_test[1781] 162 | 163 | 164 | # false positive: model does not have enough data to work with 165 | X_test[1919] 166 | 167 | 168 | # first 10 false negatives (5-star reviews incorrectly classified as 1-star reviews) 169 | X_test[y_test > y_pred_class].head(10) 170 | 171 | 172 | # false negative: model is reacting to the words "complain", "crowds", "rushing", "pricey", "scum" 173 | X_test[4963] 174 | 175 | 176 | # ## Task 8 (Challenge) 177 | # 178 | # Calculate which 10 tokens are the most predictive of **5-star reviews**, and which 10 tokens are the most predictive of **1-star reviews**. 179 | # 180 | # - **Hint:** Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the `feature_count_` and `class_count_` attributes of the Naive Bayes model object. 181 | 182 | # store the vocabulary of X_train 183 | X_train_tokens = vect.get_feature_names() 184 | len(X_train_tokens) 185 | 186 | 187 | # first row is one-star reviews, second row is five-star reviews 188 | nb.feature_count_.shape 189 | 190 | 191 | # store the number of times each token appears across each class 192 | one_star_token_count = nb.feature_count_[0, :] 193 | five_star_token_count = nb.feature_count_[1, :] 194 | 195 | 196 | # create a DataFrame of tokens with their separate one-star and five-star counts 197 | tokens = pd.DataFrame({'token':X_train_tokens, 'one_star':one_star_token_count, 'five_star':five_star_token_count}).set_index('token') 198 | 199 | 200 | # add 1 to one-star and five-star counts to avoid dividing by 0 201 | tokens['one_star'] = tokens.one_star + 1 202 | tokens['five_star'] = tokens.five_star + 1 203 | 204 | 205 | # first number is one-star reviews, second number is five-star reviews 206 | nb.class_count_ 207 | 208 | 209 | # convert the one-star and five-star counts into frequencies 210 | tokens['one_star'] = tokens.one_star / nb.class_count_[0] 211 | tokens['five_star'] = tokens.five_star / nb.class_count_[1] 212 | 213 | 214 | # calculate the ratio of five-star to one-star for each token 215 | tokens['five_star_ratio'] = tokens.five_star / tokens.one_star 216 | 217 | 218 | # sort the DataFrame by five_star_ratio (descending order), and examine the first 10 rows 219 | # note: use sort() instead of sort_values() for pandas 0.16.2 and earlier 220 | tokens.sort_values('five_star_ratio', ascending=False).head(10) 221 | 222 | 223 | # sort the DataFrame by five_star_ratio (ascending order), and examine the first 10 rows 224 | tokens.sort_values('five_star_ratio', ascending=True).head(10) 225 | 226 | 227 | # ## Task 9 (Challenge) 228 | # 229 | # Up to this point, we have framed this as a **binary classification problem** by only considering the 5-star and 1-star reviews. Now, let's repeat the model building process using all reviews, which makes this a **5-class classification problem**. 230 | # 231 | # Here are the steps: 232 | # 233 | # - Define X and y using the original DataFrame. (y should contain 5 different classes.) 234 | # - Split X and y into training and testing sets. 235 | # - Create document-term matrices using CountVectorizer. 236 | # - Calculate the testing accuracy of a Multinomial Naive Bayes model. 237 | # - Compare the testing accuracy with the null accuracy, and comment on the results. 238 | # - Print the confusion matrix, and comment on the results. (This [Stack Overflow answer](http://stackoverflow.com/a/30748053/1636598) explains how to read a multi-class confusion matrix.) 239 | # - Print the [classification report](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report), and comment on the results. If you are unfamiliar with the terminology it uses, research the terms, and then try to figure out how to calculate these metrics manually from the confusion matrix! 240 | 241 | # define X and y using the original DataFrame 242 | X = yelp.text 243 | y = yelp.stars 244 | 245 | 246 | # check that y contains 5 different classes 247 | y.value_counts().sort_index() 248 | 249 | 250 | # split X and y into training and testing sets 251 | X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1) 252 | 253 | 254 | # create document-term matrices using CountVectorizer 255 | X_train_dtm = vect.fit_transform(X_train) 256 | X_test_dtm = vect.transform(X_test) 257 | 258 | 259 | # fit a Multinomial Naive Bayes model 260 | nb.fit(X_train_dtm, y_train) 261 | 262 | 263 | # make class predictions 264 | y_pred_class = nb.predict(X_test_dtm) 265 | 266 | 267 | # calculate the accuary 268 | metrics.accuracy_score(y_test, y_pred_class) 269 | 270 | 271 | # calculate the null accuracy 272 | y_test.value_counts().head(1) / y_test.shape 273 | 274 | 275 | # **Accuracy comments:** At first glance, 47% accuracy does not seem very good, given that it is not much higher than the null accuracy. However, I would consider the 47% accuracy to be quite impressive, given that humans would also have a hard time precisely identifying the star rating for many of these reviews. 276 | 277 | # print the confusion matrix 278 | metrics.confusion_matrix(y_test, y_pred_class) 279 | 280 | 281 | # **Confusion matrix comments:** 282 | # 283 | # - Nearly all 4-star and 5-star reviews are classified as 4 or 5 stars, but they are hard for the model to distinguish between. 284 | # - 1-star, 2-star, and 3-star reviews are most commonly classified as 4 stars, probably because it's the predominant class in the training data. 285 | 286 | # print the classification report 287 | print(metrics.classification_report(y_test, y_pred_class)) 288 | 289 | 290 | # **Precision** answers the question: "When a given class is predicted, how often are those predictions correct?" To calculate the precision for class 1, for example, you divide 55 by the sum of the first column of the confusion matrix. 291 | 292 | # manually calculate the precision for class 1 293 | precision = 55 / float(55 + 28 + 5 + 7 + 6) 294 | print(precision) 295 | 296 | 297 | # **Recall** answers the question: "When a given class is the true class, how often is that class predicted?" To calculate the recall for class 1, for example, you divide 55 by the sum of the first row of the confusion matrix. 298 | 299 | # manually calculate the recall for class 1 300 | recall = 55 / float(55 + 14 + 24 + 65 + 27) 301 | print(recall) 302 | 303 | 304 | # **F1 score** is a weighted average of precision and recall. 305 | 306 | # manually calculate the F1 score for class 1 307 | f1 = 2 * (precision * recall) / (precision + recall) 308 | print(f1) 309 | 310 | 311 | # **Support** answers the question: "How many observations exist for which a given class is the true class?" To calculate the support for class 1, for example, you sum the first row of the confusion matrix. 312 | 313 | # manually calculate the support for class 1 314 | support = 55 + 14 + 24 + 65 + 27 315 | print(support) 316 | 317 | 318 | # **Classification report comments:** 319 | # 320 | # - Class 1 has low recall, meaning that the model has a hard time detecting the 1-star reviews, but high precision, meaning that when the model predicts a review is 1-star, it's usually correct. 321 | # - Class 5 has high recall and precision, probably because 5-star reviews have polarized language, and because the model has a lot of observations to learn from. 322 | -------------------------------------------------------------------------------- /ml-with-text/tutorial.py: -------------------------------------------------------------------------------- 1 | # # Tutorial: Machine Learning with Text in scikit-learn 2 | 3 | # ## Agenda 4 | # 5 | # 1. Model building in scikit-learn (refresher) 6 | # 2. Representing text as numerical data 7 | # 3. Reading a text-based dataset into pandas 8 | # 4. Vectorizing our dataset 9 | # 5. Building and evaluating a model 10 | # 6. Comparing models 11 | # 7. Examining a model for further insight 12 | # 8. Practicing this workflow on another dataset 13 | # 9. Tuning the vectorizer (discussion) 14 | 15 | # for Python 2: use print only as a function 16 | from __future__ import print_function 17 | 18 | 19 | # ## Part 1: Model building in scikit-learn (refresher) 20 | 21 | # load the iris dataset as an example 22 | from sklearn.datasets import load_iris 23 | iris = load_iris() 24 | 25 | 26 | # store the feature matrix (X) and response vector (y) 27 | X = iris.data 28 | y = iris.target 29 | 30 | 31 | # **"Features"** are also known as predictors, inputs, or attributes. The **"response"** is also known as the target, label, or output. 32 | 33 | # check the shapes of X and y 34 | print(X.shape) 35 | print(y.shape) 36 | 37 | 38 | # **"Observations"** are also known as samples, instances, or records. 39 | 40 | # examine the first 5 rows of the feature matrix (including the feature names) 41 | import pandas as pd 42 | pd.DataFrame(X, columns=iris.feature_names).head() 43 | 44 | 45 | # examine the response vector 46 | print(y) 47 | 48 | 49 | # In order to **build a model**, the features must be **numeric**, and every observation must have the **same features in the same order**. 50 | 51 | # import the class 52 | from sklearn.neighbors import KNeighborsClassifier 53 | 54 | # instantiate the model (with the default parameters) 55 | knn = KNeighborsClassifier() 56 | 57 | # fit the model with data (occurs in-place) 58 | knn.fit(X, y) 59 | 60 | 61 | # In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning. 62 | 63 | # predict the response for a new observation 64 | knn.predict([[3, 5, 4, 2]]) 65 | 66 | 67 | # ## Part 2: Representing text as numerical data 68 | 69 | # example text for model training (SMS messages) 70 | simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!'] 71 | 72 | 73 | # From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction): 74 | # 75 | # > Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**. 76 | # 77 | # We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to "convert text into a matrix of token counts": 78 | 79 | # import and instantiate CountVectorizer (with the default parameters) 80 | from sklearn.feature_extraction.text import CountVectorizer 81 | vect = CountVectorizer() 82 | 83 | 84 | # learn the 'vocabulary' of the training data (occurs in-place) 85 | vect.fit(simple_train) 86 | 87 | 88 | # examine the fitted vocabulary 89 | vect.get_feature_names() 90 | 91 | 92 | # transform training data into a 'document-term matrix' 93 | simple_train_dtm = vect.transform(simple_train) 94 | simple_train_dtm 95 | 96 | 97 | # convert sparse matrix to a dense matrix 98 | simple_train_dtm.toarray() 99 | 100 | 101 | # examine the vocabulary and document-term matrix together 102 | pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names()) 103 | 104 | 105 | # From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction): 106 | # 107 | # > In this scheme, features and samples are defined as follows: 108 | # 109 | # > - Each individual token occurrence frequency (normalized or not) is treated as a **feature**. 110 | # > - The vector of all the token frequencies for a given document is considered a multivariate **sample**. 111 | # 112 | # > A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus. 113 | # 114 | # > We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document. 115 | 116 | # check the type of the document-term matrix 117 | type(simple_train_dtm) 118 | 119 | 120 | # examine the sparse matrix contents 121 | print(simple_train_dtm) 122 | 123 | 124 | # From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction): 125 | # 126 | # > As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them). 127 | # 128 | # > For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually. 129 | # 130 | # > In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package. 131 | 132 | # example text for model testing 133 | simple_test = ["please don't call me"] 134 | 135 | 136 | # In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning. 137 | 138 | # transform testing data into a document-term matrix (using existing vocabulary) 139 | simple_test_dtm = vect.transform(simple_test) 140 | simple_test_dtm.toarray() 141 | 142 | 143 | # examine the vocabulary and document-term matrix together 144 | pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names()) 145 | 146 | 147 | # **Summary:** 148 | # 149 | # - `vect.fit(train)` **learns the vocabulary** of the training data 150 | # - `vect.transform(train)` uses the **fitted vocabulary** to build a document-term matrix from the training data 151 | # - `vect.transform(test)` uses the **fitted vocabulary** to build a document-term matrix from the testing data (and **ignores tokens** it hasn't seen before) 152 | 153 | # ## Part 3: Reading a text-based dataset into pandas 154 | 155 | # read file into pandas using a relative path 156 | path = 'data/sms.tsv' 157 | sms = pd.read_table(path, header=None, names=['label', 'message']) 158 | 159 | 160 | # alternative: read file into pandas from a URL 161 | # url = 'https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv' 162 | # sms = pd.read_table(url, header=None, names=['label', 'message']) 163 | 164 | 165 | # examine the shape 166 | sms.shape 167 | 168 | 169 | # examine the first 10 rows 170 | sms.head(10) 171 | 172 | 173 | # examine the class distribution 174 | sms.label.value_counts() 175 | 176 | 177 | # convert label to a numerical variable 178 | sms['label_num'] = sms.label.map({'ham':0, 'spam':1}) 179 | 180 | 181 | # check that the conversion worked 182 | sms.head(10) 183 | 184 | 185 | # how to define X and y (from the iris data) for use with a MODEL 186 | X = iris.data 187 | y = iris.target 188 | print(X.shape) 189 | print(y.shape) 190 | 191 | 192 | # how to define X and y (from the SMS data) for use with COUNTVECTORIZER 193 | X = sms.message 194 | y = sms.label_num 195 | print(X.shape) 196 | print(y.shape) 197 | 198 | 199 | # split X and y into training and testing sets 200 | from sklearn.cross_validation import train_test_split 201 | X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1) 202 | print(X_train.shape) 203 | print(X_test.shape) 204 | print(y_train.shape) 205 | print(y_test.shape) 206 | 207 | 208 | # ## Part 4: Vectorizing our dataset 209 | 210 | # instantiate the vectorizer 211 | vect = CountVectorizer() 212 | 213 | 214 | # learn training data vocabulary, then use it to create a document-term matrix 215 | vect.fit(X_train) 216 | X_train_dtm = vect.transform(X_train) 217 | 218 | 219 | # equivalently: combine fit and transform into a single step 220 | X_train_dtm = vect.fit_transform(X_train) 221 | 222 | 223 | # examine the document-term matrix 224 | X_train_dtm 225 | 226 | 227 | # transform testing data (using fitted vocabulary) into a document-term matrix 228 | X_test_dtm = vect.transform(X_test) 229 | X_test_dtm 230 | 231 | 232 | # ## Part 5: Building and evaluating a model 233 | # 234 | # We will use [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html): 235 | # 236 | # > The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work. 237 | 238 | # import and instantiate a Multinomial Naive Bayes model 239 | from sklearn.naive_bayes import MultinomialNB 240 | nb = MultinomialNB() 241 | 242 | 243 | # train the model using X_train_dtm 244 | nb.fit(X_train_dtm, y_train) 245 | 246 | 247 | # make class predictions for X_test_dtm 248 | y_pred_class = nb.predict(X_test_dtm) 249 | 250 | 251 | # calculate accuracy of class predictions 252 | from sklearn import metrics 253 | metrics.accuracy_score(y_test, y_pred_class) 254 | 255 | 256 | # print the confusion matrix 257 | metrics.confusion_matrix(y_test, y_pred_class) 258 | 259 | 260 | # print message text for the false positives (ham incorrectly classified as spam) 261 | 262 | 263 | # print message text for the false negatives (spam incorrectly classified as ham) 264 | 265 | 266 | # example false negative 267 | X_test[3132] 268 | 269 | 270 | # calculate predicted probabilities for X_test_dtm (poorly calibrated) 271 | y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1] 272 | y_pred_prob 273 | 274 | 275 | # calculate AUC 276 | metrics.roc_auc_score(y_test, y_pred_prob) 277 | 278 | 279 | # ## Part 6: Comparing models 280 | # 281 | # We will compare multinomial Naive Bayes with [logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression): 282 | # 283 | # > Logistic regression, despite its name, is a **linear model for classification** rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function. 284 | 285 | # import and instantiate a logistic regression model 286 | from sklearn.linear_model import LogisticRegression 287 | logreg = LogisticRegression() 288 | 289 | 290 | # train the model using X_train_dtm 291 | logreg.fit(X_train_dtm, y_train) 292 | 293 | 294 | # make class predictions for X_test_dtm 295 | y_pred_class = logreg.predict(X_test_dtm) 296 | 297 | 298 | # calculate predicted probabilities for X_test_dtm (well calibrated) 299 | y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1] 300 | y_pred_prob 301 | 302 | 303 | # calculate accuracy 304 | metrics.accuracy_score(y_test, y_pred_class) 305 | 306 | 307 | # calculate AUC 308 | metrics.roc_auc_score(y_test, y_pred_prob) 309 | 310 | 311 | # ## Part 7: Examining a model for further insight 312 | # 313 | # We will examine the our **trained Naive Bayes model** to calculate the approximate **"spamminess" of each token**. 314 | 315 | # store the vocabulary of X_train 316 | X_train_tokens = vect.get_feature_names() 317 | len(X_train_tokens) 318 | 319 | 320 | # examine the first 50 tokens 321 | print(X_train_tokens[0:50]) 322 | 323 | 324 | # examine the last 50 tokens 325 | print(X_train_tokens[-50:]) 326 | 327 | 328 | # Naive Bayes counts the number of times each token appears in each class 329 | nb.feature_count_ 330 | 331 | 332 | # rows represent classes, columns represent tokens 333 | nb.feature_count_.shape 334 | 335 | 336 | # number of times each token appears across all HAM messages 337 | ham_token_count = nb.feature_count_[0, :] 338 | ham_token_count 339 | 340 | 341 | # number of times each token appears across all SPAM messages 342 | spam_token_count = nb.feature_count_[1, :] 343 | spam_token_count 344 | 345 | 346 | # create a DataFrame of tokens with their separate ham and spam counts 347 | tokens = pd.DataFrame({'token':X_train_tokens, 'ham':ham_token_count, 'spam':spam_token_count}).set_index('token') 348 | tokens.head() 349 | 350 | 351 | # examine 5 random DataFrame rows 352 | tokens.sample(5, random_state=6) 353 | 354 | 355 | # Naive Bayes counts the number of observations in each class 356 | nb.class_count_ 357 | 358 | 359 | # Before we can calculate the "spamminess" of each token, we need to avoid **dividing by zero** and account for the **class imbalance**. 360 | 361 | # add 1 to ham and spam counts to avoid dividing by 0 362 | tokens['ham'] = tokens.ham + 1 363 | tokens['spam'] = tokens.spam + 1 364 | tokens.sample(5, random_state=6) 365 | 366 | 367 | # convert the ham and spam counts into frequencies 368 | tokens['ham'] = tokens.ham / nb.class_count_[0] 369 | tokens['spam'] = tokens.spam / nb.class_count_[1] 370 | tokens.sample(5, random_state=6) 371 | 372 | 373 | # calculate the ratio of spam-to-ham for each token 374 | tokens['spam_ratio'] = tokens.spam / tokens.ham 375 | tokens.sample(5, random_state=6) 376 | 377 | 378 | # examine the DataFrame sorted by spam_ratio 379 | # note: use sort() instead of sort_values() for pandas 0.16.2 and earlier 380 | tokens.sort_values('spam_ratio', ascending=False) 381 | 382 | 383 | # look up the spam_ratio for a given token 384 | tokens.loc['dating', 'spam_ratio'] 385 | 386 | 387 | # ## Part 8: Practicing this workflow on another dataset 388 | # 389 | # Please open the **`exercise.ipynb`** notebook (or the **`exercise.py`** script). 390 | 391 | # ## Part 9: Tuning the vectorizer (discussion) 392 | # 393 | # Thus far, we have been using the default parameters of [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html): 394 | 395 | # show default parameters for CountVectorizer 396 | vect 397 | 398 | 399 | # However, the vectorizer is worth tuning, just like a model is worth tuning! Here are a few parameters that you might want to tune: 400 | # 401 | # - **stop_words:** string {'english'}, list, or None (default) 402 | # - If 'english', a built-in stop word list for English is used. 403 | # - If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. 404 | # - If None, no stop words will be used. 405 | 406 | # remove English stop words 407 | vect = CountVectorizer(stop_words='english') 408 | 409 | 410 | # - **ngram_range:** tuple (min_n, max_n), default=(1, 1) 411 | # - The lower and upper boundary of the range of n-values for different n-grams to be extracted. 412 | # - All values of n such that min_n <= n <= max_n will be used. 413 | 414 | # include 1-grams and 2-grams 415 | vect = CountVectorizer(ngram_range=(1, 2)) 416 | 417 | 418 | # - **max_df:** float in range [0.0, 1.0] or int, default=1.0 419 | # - When building the vocabulary, ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). 420 | # - If float, the parameter represents a proportion of documents. 421 | # - If integer, the parameter represents an absolute count. 422 | 423 | # ignore terms that appear in more than 50% of the documents 424 | vect = CountVectorizer(max_df=0.5) 425 | 426 | 427 | # - **min_df:** float in range [0.0, 1.0] or int, default=1 428 | # - When building the vocabulary, ignore terms that have a document frequency strictly lower than the given threshold. (This value is also called "cut-off" in the literature.) 429 | # - If float, the parameter represents a proportion of documents. 430 | # - If integer, the parameter represents an absolute count. 431 | 432 | # only keep terms that appear in at least 2 documents 433 | vect = CountVectorizer(min_df=2) 434 | 435 | 436 | # **Guidelines for tuning CountVectorizer:** 437 | # 438 | # - Use your knowledge of the **problem** and the **text**, and your understanding of the **tuning parameters**, to help you decide what parameters to tune and how to tune them. 439 | # - **Experiment**, and let the data tell you the best approach! 440 | -------------------------------------------------------------------------------- /ml-with-text/youtube.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ritchieng/machine-learning-dataschool/33d7ea8b1171923a91871b1fb4a12b8c47ceefc2/ml-with-text/youtube.jpg -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | seaborn 2 | -------------------------------------------------------------------------------- /styles/custom.css: -------------------------------------------------------------------------------- 1 | 53 | --------------------------------------------------------------------------------