├── .gitignore
├── .nojekyll
├── LICENSE
├── README.md
├── check_env.ipynb
├── images
    ├── check_env-1.png
    ├── check_env-2.png
    └── download-repo.png
├── notebooks
    ├── 01 - Data Loading.ipynb
    ├── 02 - Supervised Learning.ipynb
    ├── 03 - Preprocessing.ipynb
    ├── data
    │   ├── adult.csv
    │   └── ram_price.csv
    └── solutions
    │   ├── load_adult.py
    │   ├── load_iris.py
    │   └── train_iris.py
├── slides
    ├── 01-introduction.html
    ├── 02-supervised-learning.html
    ├── 03-preprocessing.html
    ├── 04-missing_values.html
    ├── images
    │   ├── PDSH.png
    │   ├── alpha_go.png
    │   ├── amazon1.png
    │   ├── amazon2.png
    │   ├── amazon_explanations.png
    │   ├── api-table.png
    │   ├── apm.png
    │   ├── boro_ordinal.png
    │   ├── boro_ordinal_classification.png
    │   ├── column_transformer_schematic.png
    │   ├── cross_validation_new.png
    │   ├── esl.png
    │   ├── exoplanet.png
    │   ├── facebook_gael.png
    │   ├── fancy_impute_comparison.png
    │   ├── fb1.png
    │   ├── fb2.png
    │   ├── fb3.png
    │   ├── grid_search_cross_validation_new.png
    │   ├── house_price_boxplot.png
    │   ├── house_price_scaled_box.png
    │   ├── house_price_scatter.png
    │   ├── imlp.png
    │   ├── imputation-median-schema.png
    │   ├── imputation-schema.png
    │   ├── information_leak_preprocessing.png
    │   ├── knn_boundary_dataset.png
    │   ├── knn_boundary_k1.png
    │   ├── knn_boundary_k3.png
    │   ├── knn_boundary_test_points.png
    │   ├── knn_boundary_varying_k.png
    │   ├── knn_imputation.png
    │   ├── knn_model_complexity.png
    │   ├── knn_scaling.png
    │   ├── knn_scaling2.png
    │   ├── knn_vs_nearest_centroid.png
    │   ├── matrix-representation.png
    │   ├── mean_knn_rf_comparison.png
    │   ├── med_knn_rf_comparison.png
    │   ├── median_imputation.png
    │   ├── missing_values_img_17.png
    │   ├── missing_values_img_19.png
    │   ├── missing_values_img_20.png
    │   ├── missing_values_img_22.png
    │   ├── missing_values_img_23.png
    │   ├── missing_values_img_24.png
    │   ├── missing_values_img_27.png
    │   ├── nao.png
    │   ├── no_information_leak_preprocessing.png
    │   ├── no_separate_scaling.png
    │   ├── overfitting_underfitting_cartoon_full.png
    │   ├── overfitting_underfitting_cartoon_generalization.png
    │   ├── overfitting_underfitting_cartoon_train.png
    │   ├── pipeline.png
    │   ├── propublica_compas.png
    │   ├── reinforcement_cycle.png
    │   ├── row_nan_col_nan.png
    │   ├── scaler_comparison_scatter.png
    │   ├── shuffle_split_cv.png
    │   ├── sklearn-docs.png
    │   ├── sklearn_logo.png
    │   ├── spotify.png
    │   ├── stratified_cv.png
    │   ├── supervised-ml-api.png
    │   ├── supervised-ml-workflow.png
    │   ├── threefold_split.png
    │   ├── time_series_cv.png
    │   ├── train-test-split.png
    │   ├── train_test_set_2d_classification.png
    │   ├── train_test_split_new.png
    │   ├── train_test_validation_split.png
    │   ├── unsupervised-ml-workflow.png
    │   └── unsupervised_ml_api.png
    ├── sklearn_logo.png
    └── style.css
└── todo.rst


/.gitignore:
--------------------------------------------------------------------------------
 1 | # exlude datasets and externals
 2 | notebooks/datasets
 3 | notebooks/joblib/
 4 | 
 5 | # exclude temporary files
 6 | .ipynb_checkpoints
 7 | .DS_Store
 8 | gmon.out
 9 | __pycache__
10 | *.pyc
11 | *.o
12 | *.so
13 | *.gcno
14 | *.swp
15 | *.egg-info
16 | *.egg
17 | *~
18 | build
19 | dist
20 | lib/test
21 | doc/_build
22 | *env
23 | *ENV
24 | .idea
25 | 


--------------------------------------------------------------------------------
/.nojekyll:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/.nojekyll


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2017 Andreas Mueller
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | Introduction to Machine learning with scikit-learn
  2 | ========================================================
  3 | 
  4 | Part 1 of 4
  5 | -----------
  6 | Other parts:
  7 | - [Part 2](https://github.com/amueller/ml-workshop-2-of-4)
  8 | - [Part 3](https://github.com/amueller/ml-workshop-3-of-4)
  9 | - [Part 4](https://github.com/amueller/ml-workshop-4-of-4)
 10 | 
 11 | Content
 12 | -------
 13 | - [What is machine learning and what can it do for you?](https://amueller.github.io/ml-workshop-1-of-4/slides/01-introduction.html)
 14 | - [Data loading and basic API of scikit-learn](https://amueller.github.io/ml-workshop-1-of-4/slides/02-supervised-learning.html)
 15 | - [Fundamentals of Data Preprocessing: scaling and categorical data](https://amueller.github.io/ml-workshop-1-of-4/slides/03-preprocessing.html)
 16 | - [Imputation: dealing with missing values](https://amueller.github.io/ml-workshop-1-of-4/slides/04-missing_values.html)
 17 | 
 18 | Instructor
 19 | -----------
 20 | 
 21 | - [Andreas Mueller](http://amuller.github.io) [@amuellerml](https://twitter.com/amuellerml) - Columbia University; [Book: Introduction to Machine Learning with Python](http://shop.oreilly.com/product/0636920030515.do)
 22 | 
 23 | ---
 24 | 
 25 | This repository will contain the teaching material and other info associated
 26 | with the "Introduction to Machine Learning with scikit-learn" course.
 27 | 
 28 | About the workshop
 29 | ------------------
 30 | Machine learning has become an indispensable tool across many areas of research and commercial applications. From text-to-speech for your phone to detecting the Higgs boson, machine learning excels at extracting knowledge from large amounts of data. This talk will give a general introduction to machine learning, as well as introduce practical tools for you to apply machine learning in your research. We will focus on one particularly important subfield of machine learning, supervised learning. The goal of supervised learning is to "learn" a function that maps inputs x to an output y, by using a collection of training data consisting of input-output pairs. We will walk through formulating a problem as a supervised machine learning problem, creating the necessary training data and applying and evaluating a machine learning algorithm. This workshop should give you all the necessary background to start using machine learning yourself.
 31 | 
 32 | Prerequisites
 33 | -------------
 34 | This workshop assumes familiarity with Jupyter notebooks and basics of pandas, matplotlib and numpy.
 35 | 
 36 | 
 37 | Obtaining the Tutorial Material
 38 | --------------------------------
 39 | 
 40 | 
 41 | If you are familiar with git, it is most convenient if you clone the GitHub repository. This
 42 | is highly encouraged as it allows you to easily synchronize any changes to the material.
 43 | 
 44 | ```
 45 | git clone https://github.com/amueller/ml-workshop-1-of-4.git
 46 | ```
 47 | 
 48 | If you are not familiar with git, you can download the repository as a .zip file by heading over to the GitHub repository (https://github.com/amueller/ml-workshop-1-of-4) in your browser and click the green “Download” button in the upper right.
 49 | 
 50 | ![](images/download-repo.png)
 51 | 
 52 | Please note that I may add and improve the material until shortly before the tutorial session, and we recommend you to update your copy of the materials one day before the tutorials. If you have an GitHub account and forked/cloned the repository via GitHub, you can sync your existing fork with via the following commands:
 53 | 
 54 | ```
 55 | git pull origin master
 56 | ```
 57 | 
 58 | 
 59 | Installation Notes
 60 | ------------------
 61 | 
 62 | This tutorial will require recent installations of
 63 | 
 64 | - [NumPy](http://www.numpy.org)
 65 | - [SciPy](http://www.scipy.org)
 66 | - [matplotlib](http://matplotlib.org)
 67 | - [pillow](https://python-pillow.org)
 68 | - [pandas](http://pandas.pydata.org)
 69 | - [scikit-learn](http://scikit-learn.org/stable/) (>=0.22.1)
 70 | - [IPython](http://ipython.readthedocs.org/en/stable/)
 71 | - [Jupyter Notebook](http://jupyter.org)
 72 | 
 73 | The last one is important, you should be able to type:
 74 | 
 75 |     jupyter notebook
 76 | 
 77 | in your terminal window and see the notebook panel load in your web browser.
 78 | Try opening and running a notebook from the material to see check that it works.
 79 | 
 80 | For users who do not yet have these  packages installed, a relatively
 81 | painless way to install all the requirements is to use a Python distribution
 82 | such as [Anaconda](https://www.continuum.io/downloads), which includes
 83 | the most relevant Python packages for science, math, engineering, and
 84 | data analysis; Anaconda can be downloaded and installed for free
 85 | including commercial use and redistribution.
 86 | The code examples in this tutorial requires Python 3.5 or later.
 87 | 
 88 | After obtaining the material, we **strongly recommend** you to open and execute
 89 | a Jupyter Notebook `jupter notebook check_env.ipynb` that is located at the
 90 | top level of this repository. Inside the repository, you can open the notebook
 91 | by executing
 92 | 
 93 | ```bash
 94 | jupyter notebook check_env.ipynb
 95 | ```
 96 | 
 97 | inside this repository. Inside the Notebook, you can run the code cell by
 98 | clicking on the "Run Cells" button as illustrated in the figure below:
 99 | 
100 | ![](images/check_env-1.png)
101 | 
102 | 
103 | Finally, if your environment satisfies the requirements for the tutorials, the executed code cell will produce an output message as shown below:
104 | 
105 | ![](images/check_env-2.png)
106 | 


--------------------------------------------------------------------------------
/check_env.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "code",
 5 |    "execution_count": null,
 6 |    "metadata": {},
 7 |    "outputs": [],
 8 |    "source": [
 9 |     "from distutils.version import LooseVersion as Version\n",
10 |     "import sys\n",
11 |     "\n",
12 |     "\n",
13 |     "OK = '\\x1b[42m[ OK ]\\x1b[0m'\n",
14 |     "FAIL = \"\\x1b[41m[FAIL]\\x1b[0m\"\n",
15 |     "\n",
16 |     "try:\n",
17 |     "    import importlib\n",
18 |     "except ImportError:\n",
19 |     "    print(FAIL, \"Python version 3.5 is required,\"\n",
20 |     "                \" but %s is installed.\" % sys.version)\n",
21 |     "\n",
22 |     "    \n",
23 |     "def import_version(pkg, min_ver, fail_msg=\"\"):\n",
24 |     "    mod = None\n",
25 |     "    try:\n",
26 |     "        mod = importlib.import_module(pkg)\n",
27 |     "        ver = mod.__version__\n",
28 |     "        if Version(ver) < min_ver:\n",
29 |     "            print(FAIL, \"%s version %s or higher required, but %s installed.\"\n",
30 |     "                  % (lib, min_ver, ver))\n",
31 |     "        else:\n",
32 |     "            print(OK, '%s version %s' % (pkg, ver))\n",
33 |     "    except ImportError:\n",
34 |     "        print(FAIL, '%s not installed. %s' % (pkg, fail_msg))\n",
35 |     "    return mod\n",
36 |     "\n",
37 |     "\n",
38 |     "# first check the python version\n",
39 |     "print('Using python in', sys.prefix)\n",
40 |     "print(sys.version)\n",
41 |     "pyversion = Version(sys.version)\n",
42 |     "if pyversion < \"3.5\":\n",
43 |     "    print(FAIL, \"Python version 3.5 is required,\"\n",
44 |     "                \" but %s is installed.\" % sys.version)\n",
45 |     "print()\n",
46 |     "requirements = {'numpy': \"1.6.1\", 'scipy': \"1.0\", 'matplotlib': \"2.0\",\n",
47 |     "                'IPython': \"3.0\", 'sklearn': \"0.22.1\", 'pandas': \"0.18\"}\n",
48 |     "\n",
49 |     "# now the dependencies\n",
50 |     "for lib, required_version in list(requirements.items()):\n",
51 |     "    import_version(lib, required_version)"
52 |    ]
53 |   }
54 |  ],
55 |  "metadata": {
56 |   "anaconda-cloud": {},
57 |   "kernelspec": {
58 |    "display_name": "Python 3",
59 |    "language": "python",
60 |    "name": "python3"
61 |   },
62 |   "language_info": {
63 |    "codemirror_mode": {
64 |     "name": "ipython",
65 |     "version": 3
66 |    },
67 |    "file_extension": ".py",
68 |    "mimetype": "text/x-python",
69 |    "name": "python",
70 |    "nbconvert_exporter": "python",
71 |    "pygments_lexer": "ipython3",
72 |    "version": "3.7.3"
73 |   }
74 |  },
75 |  "nbformat": 4,
76 |  "nbformat_minor": 4
77 | }
78 | 


--------------------------------------------------------------------------------
/images/check_env-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/images/check_env-1.png


--------------------------------------------------------------------------------
/images/check_env-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/images/check_env-2.png


--------------------------------------------------------------------------------
/images/download-repo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/images/download-repo.png


--------------------------------------------------------------------------------
/notebooks/01 - Data Loading.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Data Loading"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "Get some data to play with"
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "code",
 19 |    "execution_count": null,
 20 |    "metadata": {},
 21 |    "outputs": [],
 22 |    "source": [
 23 |     "from sklearn.datasets import fetch_openml\n",
 24 |     "blood = fetch_openml('blood-transfusion-service-center')\n",
 25 |     "print(blood.DESCR)"
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "code",
 30 |    "execution_count": null,
 31 |    "metadata": {},
 32 |    "outputs": [],
 33 |    "source": [
 34 |     "blood.data.shape"
 35 |    ]
 36 |   },
 37 |   {
 38 |    "cell_type": "code",
 39 |    "execution_count": null,
 40 |    "metadata": {},
 41 |    "outputs": [],
 42 |    "source": [
 43 |     "blood.data"
 44 |    ]
 45 |   },
 46 |   {
 47 |    "cell_type": "code",
 48 |    "execution_count": null,
 49 |    "metadata": {},
 50 |    "outputs": [],
 51 |    "source": [
 52 |     "import pandas as pd\n",
 53 |     "X = pd.DataFrame(blood.data, columns=['recency', 'frequency', 'total_amount', 'since_first'])"
 54 |    ]
 55 |   },
 56 |   {
 57 |    "cell_type": "code",
 58 |    "execution_count": null,
 59 |    "metadata": {},
 60 |    "outputs": [],
 61 |    "source": [
 62 |     "blood.target.shape"
 63 |    ]
 64 |   },
 65 |   {
 66 |    "cell_type": "code",
 67 |    "execution_count": null,
 68 |    "metadata": {},
 69 |    "outputs": [],
 70 |    "source": [
 71 |     "blood.target"
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "code",
 76 |    "execution_count": null,
 77 |    "metadata": {},
 78 |    "outputs": [],
 79 |    "source": [
 80 |     "y = pd.Series(blood.target)\n",
 81 |     "y.value_counts()"
 82 |    ]
 83 |   },
 84 |   {
 85 |    "cell_type": "code",
 86 |    "execution_count": null,
 87 |    "metadata": {},
 88 |    "outputs": [],
 89 |    "source": [
 90 |     "import matplotlib.pyplot as plt\n",
 91 |     "pd.plotting.scatter_matrix(X, c=y=='2', cmap='Paired', figsize=(10, 10));"
 92 |    ]
 93 |   },
 94 |   {
 95 |    "cell_type": "markdown",
 96 |    "metadata": {},
 97 |    "source": [
 98 |     "**Data is always a numpy array (or sparse matrix) of shape (n_samples, n_features)**"
 99 |    ]
100 |   },
101 |   {
102 |    "cell_type": "markdown",
103 |    "metadata": {},
104 |    "source": [
105 |     "Split the data to get going"
106 |    ]
107 |   },
108 |   {
109 |    "cell_type": "code",
110 |    "execution_count": null,
111 |    "metadata": {},
112 |    "outputs": [],
113 |    "source": [
114 |     "from sklearn.model_selection import train_test_split\n",
115 |     "X_train, X_test, y_train, y_test = train_test_split(X, y)"
116 |    ]
117 |   },
118 |   {
119 |    "cell_type": "code",
120 |    "execution_count": null,
121 |    "metadata": {},
122 |    "outputs": [],
123 |    "source": [
124 |     "X.shape"
125 |    ]
126 |   },
127 |   {
128 |    "cell_type": "code",
129 |    "execution_count": null,
130 |    "metadata": {},
131 |    "outputs": [],
132 |    "source": [
133 |     "X_train.shape"
134 |    ]
135 |   },
136 |   {
137 |    "cell_type": "code",
138 |    "execution_count": null,
139 |    "metadata": {},
140 |    "outputs": [],
141 |    "source": [
142 |     "X_test.shape"
143 |    ]
144 |   },
145 |   {
146 |    "cell_type": "markdown",
147 |    "metadata": {},
148 |    "source": [
149 |     "# Exercises\n",
150 |     "\n",
151 |     "## Excercise 1\n",
152 |     "\n",
153 |     "Load the iris dataset from the ``sklearn.datasets`` module using the ``load_iris`` function.\n",
154 |     "The function returns a dictionary-like object that has the same attributes as ``blood``.\n",
155 |     "\n",
156 |     "What is the number of classes, features and data points in this dataset?\n",
157 |     "Use a scatterplot to visualize the dataset.\n",
158 |     "\n",
159 |     "You can look at ``DESCR`` attribute to learn more about the dataset.\n",
160 |     "``print(iris.DESCR)``\n",
161 |     "\n",
162 |     "Split the data into training and test set.\n",
163 |     "\n",
164 |     "## Exercise 2\n",
165 |     "\n",
166 |     "Usually data doesn't come in that nice a format. You can find the csv file that contains the iris dataset at the following path:\n",
167 |     "\n",
168 |     "```python\n",
169 |     "import sklearn.datasets\n",
170 |     "import os\n",
171 |     "iris_path = os.path.join(sklearn.datasets.__path__[0], 'data', 'iris.csv')\n",
172 |     "```\n",
173 |     "Load the data from there using pandas ``pd.read_csv`` method and bring it into the same format as before with the data in a variable X and the labels in a variable y. The first few lines of ``iris.csv`` file looks like:\n",
174 |     "\n",
175 |     "```\n",
176 |     "150,4,setosa,versicolor,virginica\n",
177 |     "5.1,3.5,1.4,0.2,0\n",
178 |     "4.9,3.0,1.4,0.2,0\n",
179 |     "4.7,3.2,1.3,0.2,0\n",
180 |     "4.6,3.1,1.5,0.2,0\n",
181 |     "```"
182 |    ]
183 |   },
184 |   {
185 |    "cell_type": "markdown",
186 |    "metadata": {},
187 |    "source": [
188 |     "# http://github.com/amueller/ml-workshop-1-of-4"
189 |    ]
190 |   },
191 |   {
192 |    "cell_type": "code",
193 |    "execution_count": null,
194 |    "metadata": {},
195 |    "outputs": [],
196 |    "source": [
197 |     "# %load solutions/load_iris.py"
198 |    ]
199 |   }
200 |  ],
201 |  "metadata": {
202 |   "anaconda-cloud": {},
203 |   "kernelspec": {
204 |    "display_name": "Python 3",
205 |    "language": "python",
206 |    "name": "python3"
207 |   },
208 |   "language_info": {
209 |    "codemirror_mode": {
210 |     "name": "ipython",
211 |     "version": 3
212 |    },
213 |    "file_extension": ".py",
214 |    "mimetype": "text/x-python",
215 |    "name": "python",
216 |    "nbconvert_exporter": "python",
217 |    "pygments_lexer": "ipython3",
218 |    "version": "3.7.3"
219 |   }
220 |  },
221 |  "nbformat": 4,
222 |  "nbformat_minor": 4
223 | }
224 | 


--------------------------------------------------------------------------------
/notebooks/02 - Supervised Learning.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Introduction to Scikit-learn"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": null,
 13 |    "metadata": {},
 14 |    "outputs": [],
 15 |    "source": [
 16 |     "import matplotlib.pyplot as plt\n",
 17 |     "import numpy as np\n",
 18 |     "import sklearn\n",
 19 |     "sklearn.set_config(print_changed_only=True)"
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "code",
 24 |    "execution_count": null,
 25 |    "metadata": {},
 26 |    "outputs": [],
 27 |    "source": [
 28 |     "from sklearn.datasets import fetch_openml\n",
 29 |     "from sklearn.model_selection import train_test_split\n",
 30 |     "blood = fetch_openml('blood-transfusion-service-center')\n",
 31 |     "\n",
 32 |     "X_train, X_test, y_train, y_test = train_test_split(\n",
 33 |     "    blood.data, blood.target, random_state=0)"
 34 |    ]
 35 |   },
 36 |   {
 37 |    "cell_type": "code",
 38 |    "execution_count": null,
 39 |    "metadata": {},
 40 |    "outputs": [],
 41 |    "source": [
 42 |     "X_train.shape"
 43 |    ]
 44 |   },
 45 |   {
 46 |    "cell_type": "code",
 47 |    "execution_count": null,
 48 |    "metadata": {},
 49 |    "outputs": [],
 50 |    "source": [
 51 |     "import pandas as pd\n",
 52 |     "pd.Series(y_train).value_counts()"
 53 |    ]
 54 |   },
 55 |   {
 56 |    "cell_type": "code",
 57 |    "execution_count": null,
 58 |    "metadata": {},
 59 |    "outputs": [],
 60 |    "source": [
 61 |     "pd.Series(y_train).value_counts(normalize=True)"
 62 |    ]
 63 |   },
 64 |   {
 65 |    "cell_type": "markdown",
 66 |    "metadata": {},
 67 |    "source": [
 68 |     "Really Simple API\n",
 69 |     "-------------------\n",
 70 |     "0) Import your model class"
 71 |    ]
 72 |   },
 73 |   {
 74 |    "cell_type": "code",
 75 |    "execution_count": null,
 76 |    "metadata": {},
 77 |    "outputs": [],
 78 |    "source": [
 79 |     "from sklearn.svm import LinearSVC"
 80 |    ]
 81 |   },
 82 |   {
 83 |    "cell_type": "markdown",
 84 |    "metadata": {},
 85 |    "source": [
 86 |     "1) Instantiate an object and set the parameters"
 87 |    ]
 88 |   },
 89 |   {
 90 |    "cell_type": "code",
 91 |    "execution_count": null,
 92 |    "metadata": {},
 93 |    "outputs": [],
 94 |    "source": [
 95 |     "svm = LinearSVC()"
 96 |    ]
 97 |   },
 98 |   {
 99 |    "cell_type": "markdown",
100 |    "metadata": {},
101 |    "source": [
102 |     "2) Fit the model"
103 |    ]
104 |   },
105 |   {
106 |    "cell_type": "code",
107 |    "execution_count": null,
108 |    "metadata": {},
109 |    "outputs": [],
110 |    "source": [
111 |     "svm.fit(X_train, y_train)"
112 |    ]
113 |   },
114 |   {
115 |    "cell_type": "markdown",
116 |    "metadata": {},
117 |    "source": [
118 |     "3) Apply / evaluate"
119 |    ]
120 |   },
121 |   {
122 |    "cell_type": "code",
123 |    "execution_count": null,
124 |    "metadata": {},
125 |    "outputs": [],
126 |    "source": [
127 |     "print(svm.predict(X_train))\n",
128 |     "print(y_train)"
129 |    ]
130 |   },
131 |   {
132 |    "cell_type": "code",
133 |    "execution_count": null,
134 |    "metadata": {},
135 |    "outputs": [],
136 |    "source": [
137 |     "svm.score(X_train, y_train)"
138 |    ]
139 |   },
140 |   {
141 |    "cell_type": "code",
142 |    "execution_count": null,
143 |    "metadata": {},
144 |    "outputs": [],
145 |    "source": [
146 |     "svm.score(X_test, y_test)"
147 |    ]
148 |   },
149 |   {
150 |    "cell_type": "markdown",
151 |    "metadata": {},
152 |    "source": [
153 |     "And again\n",
154 |     "---------"
155 |    ]
156 |   },
157 |   {
158 |    "cell_type": "code",
159 |    "execution_count": null,
160 |    "metadata": {},
161 |    "outputs": [],
162 |    "source": [
163 |     "from sklearn.ensemble import RandomForestClassifier"
164 |    ]
165 |   },
166 |   {
167 |    "cell_type": "code",
168 |    "execution_count": null,
169 |    "metadata": {},
170 |    "outputs": [],
171 |    "source": [
172 |     "rf = RandomForestClassifier()"
173 |    ]
174 |   },
175 |   {
176 |    "cell_type": "code",
177 |    "execution_count": null,
178 |    "metadata": {},
179 |    "outputs": [],
180 |    "source": [
181 |     "rf.fit(X_train, y_train)"
182 |    ]
183 |   },
184 |   {
185 |    "cell_type": "code",
186 |    "execution_count": null,
187 |    "metadata": {},
188 |    "outputs": [],
189 |    "source": [
190 |     "rf.score(X_train, y_train)"
191 |    ]
192 |   },
193 |   {
194 |    "cell_type": "code",
195 |    "execution_count": null,
196 |    "metadata": {},
197 |    "outputs": [],
198 |    "source": [
199 |     "rf.score(X_test, y_test)"
200 |    ]
201 |   },
202 |   {
203 |    "cell_type": "markdown",
204 |    "metadata": {},
205 |    "source": [
206 |     "# Materials: https://github.com/amueller/ml-workshop-1-of-4"
207 |    ]
208 |   },
209 |   {
210 |    "cell_type": "markdown",
211 |    "metadata": {},
212 |    "source": [
213 |     "# Exercises\n",
214 |     "\n",
215 |     "## Exercise 1\n",
216 |     "Load the iris dataset from the ``sklearn.datasets`` module using the ``load_iris`` function.\n",
217 |     "\n",
218 |     "Split it into training and test set using ``train_test_split``.\n",
219 |     "\n",
220 |     "## Exercise 2\n",
221 |     "Then train an evaluate ``sklearn.neighbors.KNeighborsClassifier``, the RandomForestClassifier and  ``sklearn.linear_model.LogisticRegression`` on the iris dataset.\n",
222 |     "How do these perform on the training set vs the test set? Which one is the best on the training set, which one is the best on the test set?\n",
223 |     "\n",
224 |     "## Exercise 3 (extra)\n",
225 |     "Can you construct a binary classification dataset (using np.random for example) on which ``sklearn.linear_model.LogisticRegression`` achieves an accuracy of 1? Can you construct a binary classification dataset on which it achieves accuracy 0.5?"
226 |    ]
227 |   },
228 |   {
229 |    "cell_type": "code",
230 |    "execution_count": null,
231 |    "metadata": {},
232 |    "outputs": [],
233 |    "source": [
234 |     "# %load solutions/train_iris.py"
235 |    ]
236 |   }
237 |  ],
238 |  "metadata": {
239 |   "anaconda-cloud": {},
240 |   "kernelspec": {
241 |    "display_name": "Python 3",
242 |    "language": "python",
243 |    "name": "python3"
244 |   },
245 |   "language_info": {
246 |    "codemirror_mode": {
247 |     "name": "ipython",
248 |     "version": 3
249 |    },
250 |    "file_extension": ".py",
251 |    "mimetype": "text/x-python",
252 |    "name": "python",
253 |    "nbconvert_exporter": "python",
254 |    "pygments_lexer": "ipython3",
255 |    "version": "3.7.3"
256 |   }
257 |  },
258 |  "nbformat": 4,
259 |  "nbformat_minor": 4
260 | }
261 | 


--------------------------------------------------------------------------------
/notebooks/03 - Preprocessing.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Preprocessing"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": null,
 13 |    "metadata": {},
 14 |    "outputs": [],
 15 |    "source": [
 16 |     "import numpy as np\n",
 17 |     "import matplotlib.pyplot as plt\n",
 18 |     "import sklearn\n",
 19 |     "sklearn.set_config(print_changed_only=True)"
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "code",
 24 |    "execution_count": null,
 25 |    "metadata": {},
 26 |    "outputs": [],
 27 |    "source": [
 28 |     "from sklearn.datasets import load_boston\n",
 29 |     "boston = load_boston()\n",
 30 |     "from sklearn.model_selection import train_test_split\n",
 31 |     "X, y = boston.data, boston.target\n",
 32 |     "X_train, X_test, y_train, y_test = train_test_split(\n",
 33 |     "    X, y, random_state=0)"
 34 |    ]
 35 |   },
 36 |   {
 37 |    "cell_type": "code",
 38 |    "execution_count": null,
 39 |    "metadata": {},
 40 |    "outputs": [],
 41 |    "source": [
 42 |     "print(boston.DESCR)"
 43 |    ]
 44 |   },
 45 |   {
 46 |    "cell_type": "code",
 47 |    "execution_count": null,
 48 |    "metadata": {},
 49 |    "outputs": [],
 50 |    "source": [
 51 |     "fig, axes = plt.subplots(3, 5, figsize=(20, 10))\n",
 52 |     "for i, ax in enumerate(axes.ravel()):\n",
 53 |     "    if i > 12:\n",
 54 |     "        ax.set_visible(False)\n",
 55 |     "        continue\n",
 56 |     "    ax.plot(X[:, i], y, 'o', alpha=.5)\n",
 57 |     "    ax.set_title(\"{}: {}\".format(i, boston.feature_names[i]))\n",
 58 |     "    ax.set_ylabel(\"MEDV\")"
 59 |    ]
 60 |   },
 61 |   {
 62 |    "cell_type": "code",
 63 |    "execution_count": null,
 64 |    "metadata": {},
 65 |    "outputs": [],
 66 |    "source": [
 67 |     "plt.boxplot(X)\n",
 68 |     "plt.xticks(np.arange(1, X.shape[1] + 1),\n",
 69 |     "           boston.feature_names, rotation=30, ha=\"right\");"
 70 |    ]
 71 |   },
 72 |   {
 73 |    "cell_type": "code",
 74 |    "execution_count": null,
 75 |    "metadata": {},
 76 |    "outputs": [],
 77 |    "source": [
 78 |     "from sklearn.preprocessing import StandardScaler\n",
 79 |     "scaler = StandardScaler()\n",
 80 |     "X_train_scaled = scaler.fit_transform(X_train)"
 81 |    ]
 82 |   },
 83 |   {
 84 |    "cell_type": "code",
 85 |    "execution_count": null,
 86 |    "metadata": {},
 87 |    "outputs": [],
 88 |    "source": [
 89 |     "plt.boxplot(X_train_scaled)\n",
 90 |     "plt.xticks(np.arange(1, X.shape[1] + 1),\n",
 91 |     "           boston.feature_names, rotation=30, ha=\"right\");"
 92 |    ]
 93 |   },
 94 |   {
 95 |    "cell_type": "code",
 96 |    "execution_count": null,
 97 |    "metadata": {},
 98 |    "outputs": [],
 99 |    "source": [
100 |     "from sklearn.neighbors import KNeighborsRegressor\n",
101 |     "knr = KNeighborsRegressor().fit(X_train, y_train)\n",
102 |     "knr.score(X_train, y_train)"
103 |    ]
104 |   },
105 |   {
106 |    "cell_type": "code",
107 |    "execution_count": null,
108 |    "metadata": {},
109 |    "outputs": [],
110 |    "source": [
111 |     "knr.score(X_test, y_test)"
112 |    ]
113 |   },
114 |   {
115 |    "cell_type": "code",
116 |    "execution_count": null,
117 |    "metadata": {},
118 |    "outputs": [],
119 |    "source": [
120 |     "knr_scaled = KNeighborsRegressor().fit(X_train_scaled, y_train)\n",
121 |     "knr_scaled.fit(X_train_scaled, y_train)\n",
122 |     "knr_scaled.score(X_train_scaled, y_train)"
123 |    ]
124 |   },
125 |   {
126 |    "cell_type": "code",
127 |    "execution_count": null,
128 |    "metadata": {},
129 |    "outputs": [],
130 |    "source": [
131 |     "X_test_scaled = scaler.transform(X_test)\n",
132 |     "knr_scaled.score(X_test_scaled, y_test)"
133 |    ]
134 |   },
135 |   {
136 |    "cell_type": "code",
137 |    "execution_count": null,
138 |    "metadata": {},
139 |    "outputs": [],
140 |    "source": [
141 |     "from sklearn.ensemble import RandomForestRegressor\n",
142 |     "rf = RandomForestRegressor(random_state=0)\n",
143 |     "rf.fit(X_train, y_train)\n",
144 |     "rf.score(X_test, y_test)"
145 |    ]
146 |   },
147 |   {
148 |    "cell_type": "code",
149 |    "execution_count": null,
150 |    "metadata": {},
151 |    "outputs": [],
152 |    "source": [
153 |     "rf_scaled = RandomForestRegressor(random_state=0)\n",
154 |     "rf_scaled.fit(X_train_scaled, y_train)\n",
155 |     "rf_scaled.score(X_test_scaled, y_test)"
156 |    ]
157 |   },
158 |   {
159 |    "cell_type": "markdown",
160 |    "metadata": {},
161 |    "source": [
162 |     "# Categorical Variables"
163 |    ]
164 |   },
165 |   {
166 |    "cell_type": "code",
167 |    "execution_count": null,
168 |    "metadata": {},
169 |    "outputs": [],
170 |    "source": [
171 |     "import pandas as pd\n",
172 |     "df = pd.DataFrame({'salary': [103, 89, 142, 54, 63, 219],\n",
173 |     "                   'boro': ['Manhattan', 'Queens', 'Manhattan', 'Brooklyn', 'Brooklyn', 'Bronx']})\n",
174 |     "df"
175 |    ]
176 |   },
177 |   {
178 |    "cell_type": "code",
179 |    "execution_count": null,
180 |    "metadata": {},
181 |    "outputs": [],
182 |    "source": [
183 |     "pd.get_dummies(df)"
184 |    ]
185 |   },
186 |   {
187 |    "cell_type": "code",
188 |    "execution_count": null,
189 |    "metadata": {},
190 |    "outputs": [],
191 |    "source": [
192 |     "from sklearn.compose import make_column_transformer\n",
193 |     "from sklearn.preprocessing import OneHotEncoder\n",
194 |     "categorical = df.dtypes == object\n",
195 |     "categorical"
196 |    ]
197 |   },
198 |   {
199 |    "cell_type": "code",
200 |    "execution_count": null,
201 |    "metadata": {},
202 |    "outputs": [],
203 |    "source": [
204 |     "~categorical"
205 |    ]
206 |   },
207 |   {
208 |    "cell_type": "code",
209 |    "execution_count": null,
210 |    "metadata": {},
211 |    "outputs": [],
212 |    "source": [
213 |     "ct = make_column_transformer((OneHotEncoder(), categorical),\n",
214 |     "                             (StandardScaler(), ~categorical))\n",
215 |     "ct.fit_transform(df)"
216 |    ]
217 |   },
218 |   {
219 |    "cell_type": "code",
220 |    "execution_count": null,
221 |    "metadata": {},
222 |    "outputs": [],
223 |    "source": [
224 |     "ct = make_column_transformer((OneHotEncoder(sparse=False), categorical))\n",
225 |     "ct.fit_transform(df)"
226 |    ]
227 |   },
228 |   {
229 |    "cell_type": "code",
230 |    "execution_count": null,
231 |    "metadata": {},
232 |    "outputs": [],
233 |    "source": [
234 |     "ct = make_column_transformer((OneHotEncoder(), categorical),\n",
235 |     "                             remainder='passthrough')\n",
236 |     "ct.fit_transform(df)"
237 |    ]
238 |   },
239 |   {
240 |    "cell_type": "code",
241 |    "execution_count": null,
242 |    "metadata": {},
243 |    "outputs": [],
244 |    "source": [
245 |     "ct = make_column_transformer((OneHotEncoder(), categorical),\n",
246 |     "                             remainder=StandardScaler())\n",
247 |     "ct.fit_transform(df)"
248 |    ]
249 |   },
250 |   {
251 |    "cell_type": "markdown",
252 |    "metadata": {},
253 |    "source": [
254 |     "# Exercises\n",
255 |     "\n",
256 |     "## Exercise 1\n",
257 |     "Load the \"adult\" datasets using consisting of income data from the census, including information whether someone has a salary of less than \\$50k or more. Look at the data using the ``head`` method. Our final goal in Exercise 4 will be to classify entries into those making less than \\$50k and those that make more.\n",
258 |     "\n",
259 |     "## Exercise 2\n",
260 |     "Experiment with visualizing the data. Can you find out which features influence the income the most?\n",
261 |     "\n",
262 |     "## Exercise 3\n",
263 |     "Separate the target variable from the features.\n",
264 |     "Split the data into training and test set.\n",
265 |     "Apply dummy encoding and scaling.\n",
266 |     "How did this change the number of variables?\n",
267 |     "\n",
268 |     "## Exercise 4\n",
269 |     "Build and evaluate a LogisticRegression model on the data.\n",
270 |     "\n"
271 |    ]
272 |   },
273 |   {
274 |    "cell_type": "code",
275 |    "execution_count": null,
276 |    "metadata": {},
277 |    "outputs": [],
278 |    "source": [
279 |     "data = pd.read_csv(\"data/adult.csv\", index_col=0)"
280 |    ]
281 |   },
282 |   {
283 |    "cell_type": "code",
284 |    "execution_count": null,
285 |    "metadata": {},
286 |    "outputs": [],
287 |    "source": [
288 |     "# %load solutions/load_adult.py"
289 |    ]
290 |   }
291 |  ],
292 |  "metadata": {
293 |   "anaconda-cloud": {},
294 |   "kernelspec": {
295 |    "display_name": "Python 3",
296 |    "language": "python",
297 |    "name": "python3"
298 |   },
299 |   "language_info": {
300 |    "codemirror_mode": {
301 |     "name": "ipython",
302 |     "version": 3
303 |    },
304 |    "file_extension": ".py",
305 |    "mimetype": "text/x-python",
306 |    "name": "python",
307 |    "nbconvert_exporter": "python",
308 |    "pygments_lexer": "ipython3",
309 |    "version": "3.7.3"
310 |   }
311 |  },
312 |  "nbformat": 4,
313 |  "nbformat_minor": 4
314 | }
315 | 


--------------------------------------------------------------------------------
/notebooks/data/ram_price.csv:
--------------------------------------------------------------------------------
  1 | ,date,price
  2 | 0,1957.0,411041792.0
  3 | 1,1959.0,67947725.0
  4 | 2,1960.0,5242880.0
  5 | 3,1965.0,2642412.0
  6 | 4,1970.0,734003.0
  7 | 5,1973.0,399360.0
  8 | 6,1974.0,314573.0
  9 | 7,1975.0,421888.0
 10 | 8,1975.08,180224.0
 11 | 9,1975.25,67584.0
 12 | 10,1975.75,49920.0
 13 | 11,1976.0,40704.0
 14 | 12,1976.17,48960.0
 15 | 13,1976.42,23040.0
 16 | 14,1976.58,32000.0
 17 | 15,1977.08,36800.0
 18 | 16,1978.17,28000.0
 19 | 17,1978.25,29440.0
 20 | 18,1978.33,19200.0
 21 | 19,1978.5,24000.0
 22 | 20,1978.58,16000.0
 23 | 21,1978.75,15200.0
 24 | 22,1979.0,10528.0
 25 | 23,1979.75,6704.0
 26 | 24,1980.0,6480.0
 27 | 25,1981.0,8800.0
 28 | 26,1981.58,4479.0
 29 | 27,1982.0,3520.0
 30 | 28,1982.17,4464.0
 31 | 29,1982.67,1980.0
 32 | 30,1983.0,2396.0
 33 | 31,1983.67,1980.0
 34 | 32,1984.0,1379.0
 35 | 33,1984.58,1331.0
 36 | 34,1985.0,880.0
 37 | 35,1985.33,720.0
 38 | 36,1985.42,550.0
 39 | 37,1985.5,420.0
 40 | 38,1985.58,350.0
 41 | 39,1985.67,300.0
 42 | 40,1985.83,300.0
 43 | 41,1985.92,300.0
 44 | 42,1986.0,300.0
 45 | 43,1986.08,300.0
 46 | 44,1986.17,300.0
 47 | 45,1986.25,300.0
 48 | 46,1986.33,190.0
 49 | 47,1986.42,190.0
 50 | 48,1986.5,190.0
 51 | 49,1986.58,190.0
 52 | 50,1986.67,190.0
 53 | 51,1986.75,190.0
 54 | 52,1986.92,190.0
 55 | 53,1987.0,176.0
 56 | 54,1987.08,176.0
 57 | 55,1987.17,157.0
 58 | 56,1987.25,154.0
 59 | 57,1987.33,154.0
 60 | 58,1987.42,154.0
 61 | 59,1987.5,154.0
 62 | 60,1987.58,154.0
 63 | 61,1987.67,163.0
 64 | 62,1987.75,133.0
 65 | 63,1987.83,163.0
 66 | 64,1987.92,163.0
 67 | 65,1988.0,163.0
 68 | 66,1988.08,182.0
 69 | 67,1988.17,199.0
 70 | 68,1988.33,199.0
 71 | 69,1988.42,199.0
 72 | 70,1988.5,505.0
 73 | 71,1988.58,505.0
 74 | 72,1988.67,505.0
 75 | 73,1988.75,505.0
 76 | 74,1988.83,505.0
 77 | 75,1988.92,505.0
 78 | 76,1989.0,505.0
 79 | 77,1989.08,505.0
 80 | 78,1989.17,505.0
 81 | 79,1989.25,505.0
 82 | 80,1989.42,344.0
 83 | 81,1989.5,197.0
 84 | 82,1989.58,188.0
 85 | 83,1989.67,188.0
 86 | 84,1989.75,128.0
 87 | 85,1989.83,117.0
 88 | 86,1989.92,113.0
 89 | 87,1990.0,106.0
 90 | 88,1990.17,98.3
 91 | 89,1990.33,98.3
 92 | 90,1990.42,89.5
 93 | 91,1990.5,82.8
 94 | 92,1990.58,81.1
 95 | 93,1990.67,71.5
 96 | 94,1990.75,59.0
 97 | 95,1990.83,51.0
 98 | 96,1990.92,45.5
 99 | 97,1991.0,44.5
100 | 98,1991.08,44.5
101 | 99,1991.17,45.0
102 | 100,1991.25,45.0
103 | 101,1991.33,45.0
104 | 102,1991.42,43.8
105 | 103,1991.5,43.8
106 | 104,1991.58,41.3
107 | 105,1991.67,46.3
108 | 106,1991.75,45.0
109 | 107,1991.83,39.8
110 | 108,1991.92,39.8
111 | 109,1992.0,36.3
112 | 110,1992.08,36.3
113 | 111,1992.17,36.3
114 | 112,1992.25,34.8
115 | 113,1992.33,30.0
116 | 114,1992.42,32.5
117 | 115,1992.5,33.5
118 | 116,1992.58,31.0
119 | 117,1992.67,27.5
120 | 118,1992.75,26.3
121 | 119,1992.83,26.3
122 | 120,1992.92,26.3
123 | 121,1993.0,33.1
124 | 122,1993.08,27.5
125 | 123,1993.17,27.5
126 | 124,1993.25,27.5
127 | 125,1993.33,27.5
128 | 126,1993.42,30.0
129 | 127,1993.5,30.0
130 | 128,1993.58,30.0
131 | 129,1993.67,30.0
132 | 130,1993.75,36.0
133 | 131,1993.83,39.8
134 | 132,1993.92,35.8
135 | 133,1994.0,35.8
136 | 134,1994.08,35.8
137 | 135,1994.17,36.0
138 | 136,1994.25,37.3
139 | 137,1994.33,37.3
140 | 138,1994.42,37.3
141 | 139,1994.5,38.5
142 | 140,1994.58,37.0
143 | 141,1994.67,34.0
144 | 142,1994.75,33.5
145 | 143,1994.83,32.3
146 | 144,1994.92,32.3
147 | 145,1995.0,32.3
148 | 146,1995.08,32.0
149 | 147,1995.17,32.0
150 | 148,1995.25,31.2
151 | 149,1995.33,31.2
152 | 150,1995.42,31.1
153 | 151,1995.5,31.2
154 | 152,1995.58,30.6
155 | 153,1995.67,33.1
156 | 154,1995.75,33.1
157 | 155,1995.83,30.9
158 | 156,1995.92,30.9
159 | 157,1996.0,29.9
160 | 158,1996.08,28.8
161 | 159,1996.17,26.1
162 | 160,1996.25,24.7
163 | 161,1996.33,17.2
164 | 162,1996.42,14.9
165 | 163,1996.5,11.3
166 | 164,1996.58,9.06
167 | 165,1996.67,8.44
168 | 166,1996.75,8.0
169 | 167,1996.83,5.25
170 | 168,1996.92,5.25
171 | 169,1997.0,4.63
172 | 170,1997.08,3.63
173 | 171,1997.17,3.0
174 | 172,1997.25,3.0
175 | 173,1997.33,3.0
176 | 174,1997.42,3.69
177 | 175,1997.5,4.0
178 | 176,1997.58,4.13
179 | 177,1997.67,3.63
180 | 178,1997.75,3.41
181 | 179,1997.83,3.25
182 | 180,1997.92,2.16
183 | 181,1998.0,2.16
184 | 182,1998.08,0.91
185 | 183,1998.17,0.97
186 | 184,1998.25,1.22
187 | 185,1998.33,1.19
188 | 186,1998.42,0.97
189 | 187,1998.58,1.03
190 | 188,1998.67,0.97
191 | 189,1998.75,1.16
192 | 190,1998.83,0.84
193 | 191,1998.92,0.84
194 | 192,1999.08,1.44
195 | 193,1999.13,0.84
196 | 194,1999.17,1.25
197 | 195,1999.25,1.25
198 | 196,1999.33,0.86
199 | 197,1999.5,0.78
200 | 198,1999.67,0.87
201 | 199,1999.75,1.04
202 | 200,1999.83,1.34
203 | 201,1999.92,2.35
204 | 202,2000.0,1.56
205 | 203,2000.08,1.48
206 | 204,2000.17,1.08
207 | 205,2000.25,0.84
208 | 206,2000.33,0.7
209 | 207,2000.42,0.9
210 | 208,2000.5,0.77
211 | 209,2000.58,0.84
212 | 210,2000.67,1.07
213 | 211,2000.75,1.12
214 | 212,2000.83,1.12
215 | 213,2000.92,0.9
216 | 214,2001.0,0.75
217 | 215,2001.08,0.464
218 | 216,2001.17,0.464
219 | 217,2001.25,0.383
220 | 218,2001.33,0.387
221 | 219,2001.42,0.305
222 | 220,2001.5,0.352
223 | 221,2001.5,0.27
224 | 222,2001.58,0.191
225 | 223,2001.67,0.191
226 | 224,2001.75,0.169
227 | 225,2001.77,0.148
228 | 226,2002.08,0.134
229 | 227,2002.08,0.207
230 | 228,2002.25,0.193
231 | 229,2002.33,0.193
232 | 230,2002.42,0.33
233 | 231,2002.58,0.193
234 | 232,2002.75,0.193
235 | 233,2003.17,0.176
236 | 234,2003.25,0.076
237 | 235,2003.33,0.126
238 | 236,2003.42,0.115
239 | 237,2003.5,0.133
240 | 238,2003.58,0.129
241 | 239,2003.67,0.143
242 | 240,2003.75,0.148
243 | 241,2003.83,0.16
244 | 242,2003.99,0.166
245 | 243,2004.0,0.174
246 | 244,2004.08,0.148
247 | 245,2004.17,0.146
248 | 246,2004.33,0.156
249 | 247,2004.42,0.203
250 | 248,2004.5,0.176
251 | 249,2005.25,0.185
252 | 250,2005.42,0.149
253 | 251,2005.83,0.116
254 | 252,2005.92,0.185
255 | 253,2006.17,0.112
256 | 254,2006.33,0.073
257 | 255,2006.5,0.082
258 | 256,2006.67,0.073
259 | 257,2006.75,0.088
260 | 258,2006.83,0.098
261 | 259,2006.99,0.092
262 | 260,2007.0,0.082
263 | 261,2007.08,0.078
264 | 262,2007.17,0.066
265 | 263,2007.33,0.0464
266 | 264,2007.5,0.0386
267 | 265,2007.67,0.0351
268 | 266,2007.75,0.0322
269 | 267,2007.83,0.0244
270 | 268,2007.92,0.0244
271 | 269,2008.0,0.0232
272 | 270,2008.08,0.022
273 | 271,2008.33,0.022
274 | 272,2008.5,0.0207
275 | 273,2008.58,0.0176
276 | 274,2008.67,0.0146
277 | 275,2008.83,0.011
278 | 276,2008.92,0.0098
279 | 277,2009.0,0.0098
280 | 278,2009.08,0.0107
281 | 279,2009.25,0.0105
282 | 280,2009.42,0.0115
283 | 281,2009.5,0.011
284 | 282,2009.58,0.0127
285 | 283,2009.75,0.0183
286 | 284,2009.92,0.0205
287 | 285,2010.0,0.019
288 | 286,2010.08,0.0202
289 | 287,2010.17,0.0195
290 | 288,2010.33,0.0242
291 | 289,2010.5,0.021
292 | 290,2010.58,0.022
293 | 291,2010.75,0.0171
294 | 292,2010.83,0.0146
295 | 293,2010.92,0.0122
296 | 294,2011.0,0.01
297 | 295,2011.08,0.0103
298 | 296,2011.33,0.01
299 | 297,2011.42,0.0085
300 | 298,2011.67,0.0054
301 | 299,2011.75,0.0051
302 | 300,2012.0,0.0049
303 | 301,2012.08,0.0049
304 | 302,2012.25,0.005
305 | 303,2012.33,0.0049
306 | 304,2012.58,0.0048
307 | 305,2012.67,0.004
308 | 306,2012.83,0.0037
309 | 307,2013.0,0.0043
310 | 308,2013.08,0.0054
311 | 309,2013.33,0.0067
312 | 310,2013.42,0.0061
313 | 311,2013.58,0.0073
314 | 312,2013.67,0.0065
315 | 313,2013.75,0.0082
316 | 314,2013.83,0.0085
317 | 315,2013.92,0.0079
318 | 316,2014.08,0.0095
319 | 317,2014.17,0.0079
320 | 318,2014.25,0.0073
321 | 319,2014.42,0.0079
322 | 320,2014.58,0.0085
323 | 321,2014.67,0.0085
324 | 322,2014.83,0.0085
325 | 323,2015.0,0.0078
326 | 324,2015.08,0.0073
327 | 325,2015.25,0.0061
328 | 326,2015.33,0.0056
329 | 327,2015.5,0.0049
330 | 328,2015.58,0.0045
331 | 329,2015.67,0.0043
332 | 330,2015.75,0.0042
333 | 331,2015.83,0.0038
334 | 332,2015.92,0.0037
335 | 


--------------------------------------------------------------------------------
/notebooks/solutions/load_adult.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | import matplotlib.pyplot as plt
 3 | from sklearn.model_selection import train_test_split
 4 | import numpy as np
 5 | from sklearn.preprocessing import MinMaxScaler
 6 | # We use display so that we can do multiple nice renderings of dataframes
 7 | # in Jupyter
 8 | from IPython.display import display
 9 | 
10 | # Exercise 1
11 | data = pd.read_csv("data/adult.csv", index_col=0)
12 | display(data.head())
13 | 
14 | income = data.income
15 | data_features = data.drop("income", axis=1)
16 | 
17 | display(data_features.head())
18 | 
19 | # Exercise 2
20 | 
21 | data.age.hist()
22 | 
23 | # plot by gender
24 | data['income_bin'] = data.income == " >50K"
25 | plt.figure()
26 | plt.title("By gender")
27 | grouped = data.groupby("gender")
28 | grouped.income_bin.mean().plot.barh()
29 | 
30 | # plot by education
31 | plt.figure()
32 | plt.title("By education")
33 | data.groupby("education").income_bin.mean().sort_values().plot.barh()
34 | 
35 | plt.figure()
36 | plt.title("By race")
37 | data.groupby("race").income_bin.mean().sort_values().plot.barh()
38 | 
39 | 
40 | # Exercise 3
41 | # using pd.get_dummies
42 | data_one_hot = pd.get_dummies(data_features)
43 | X_train, X_test, y_train, y_test = train_test_split(data_one_hot, income)
44 | 
45 | scaler = MinMaxScaler().fit(X_train)
46 | X_train_scaled = scaler.transform(X_train)
47 | X_test_scaled = scaler.transform(X_test)
48 | 
49 | # using OneHotEncoder
50 | cont_features = data_features.dtypes == "int64"
51 | ct = make_column_transformer((OneHotEncoder(), ~cont_features),
52 |                              (StandardScaler(), cont_features))
53 | X_train, X_test, y_train, y_test = train_test_split(data_features, income)
54 | X_train_scaled = ct.fit_transform(X_train)
55 | X_test_scaled = ct.transform(X_test)
56 | 
57 | 
58 | # Exercise 4
59 | from sklearn.linear_model import LogisticRegression
60 | logreg = LogisticRegression(C=0.1)
61 | logreg.fit(X_train_scaled, y_train)
62 | print("Training score:", logreg.score(X_train_scaled, y_train))
63 | 
64 | print("Test score:", logreg.score(X_test_scaled, y_test))
65 | 
66 | print("Faction <= 50k", (y_train.values == " <=50K").mean())
67 | 


--------------------------------------------------------------------------------
/notebooks/solutions/load_iris.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import matplotlib.pyplot as plt
 3 | 
 4 | from sklearn.datasets import load_iris
 5 | from sklearn.model_selection import train_test_split
 6 | 
 7 | iris = load_iris()
 8 | X, y = iris.data, iris.target
 9 | 
10 | print("Dataset size: %d  number of features: %d  number of classes: %d"
11 |       % (X.shape[0], X.shape[1], len(np.unique(y))))
12 | 
13 | X_train, X_test, y_train, y_test = train_test_split(X, y)
14 | 
15 | plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train)
16 | plt.xlabel(iris.feature_names[0])
17 | plt.ylabel(iris.feature_names[1])
18 | 
19 | plt.figure()
20 | plt.scatter(X_train[:, 2], X_train[:, 3], c=y_train)
21 | plt.xlabel(iris.feature_names[2])
22 | plt.ylabel(iris.feature_names[3])
23 | 
24 | import sklearn.datasets
25 | import os
26 | import pandas as pd
27 | iris_path = os.path.join(sklearn.datasets.__path__[0], 'data', 'iris.csv')
28 | iris_df =  pd.read_csv(iris_path, header=None)
29 | display(iris_df.head())
30 | 
31 | iris_df = pd.read_csv(iris_path, skiprows=1, header=None)
32 | display(iris_df.head())
33 | 
34 | features = iris_df.iloc[:, :4]
35 | target = iris_df.iloc[:, 4]


--------------------------------------------------------------------------------
/notebooks/solutions/train_iris.py:
--------------------------------------------------------------------------------
 1 | from sklearn.datasets import load_iris
 2 | from sklearn.neighbors import KNeighborsClassifier
 3 | from sklearn.model_selection import train_test_split
 4 | # Exercise 1, loading data
 5 | iris = load_iris()
 6 | X, y = iris.data, iris.target
 7 | 
 8 | X_train, X_test, y_train, y_test = train_test_split(X, y)
 9 | 
10 | # Exercise 2
11 | # Training KNN
12 | knn = KNeighborsClassifier(n_neighbors=3)
13 | knn.fit(X_train, y_train)
14 | 
15 | print("test set score of knn: %f" % knn.score(X_test, y_test))
16 | 
17 | # Training RandomForest
18 | rf = RandomForestClassifier()
19 | rf.fit(X_train, y_train)
20 | rf.score(X_train, y_train)
21 | rf.score(X_test, y_test)
22 | 
23 | # Exercise 3
24 | 
25 | # Perfect classification (accuracy=1) on easy dataset
26 | from sklearn.linear_model import LogisticRegression
27 | X = np.random.uniform(size=(1000, 3))
28 | X[::2] += 1000
29 | y = X[:, 0] > 500
30 | X_train, X_test, y_train, y_test = train_test_split(X, y)
31 | logreg = LogisticRegression()
32 | logreg.fit(X_train, y_train)
33 | print("score on trivial data: ", logreg.score(X_test, y_test))
34 | 
35 | # Random classification (accuracy=.5) on random data
36 | y = np.random.normal(size=1000) > .0
37 | X_train, X_test, y_train, y_test = train_test_split(X, y)
38 | logreg = LogisticRegression()
39 | logreg.fit(X_train, y_train)
40 | print("score on random data: ", logreg.score(X_test, y_test))
41 | 


--------------------------------------------------------------------------------
/slides/01-introduction.html:
--------------------------------------------------------------------------------
  1 | <!DOCTYPE html>
  2 | <html>
  3 |   <head>
  4 |     <title>Introduction</title>
  5 |     <meta charset="utf-8">
  6 |     <link rel="stylesheet" href="style.css">
  7 |     <style>
  8 |       @import url(https://fonts.googleapis.com/css?family=Garamond);
  9 |       @import url(https://fonts.googleapis.com/css?family=Muli:400,700,400italic);
 10 |       @import url(https://fonts.googleapis.com/css?family=Ubuntu+Mono:400,700,400italic);
 11 |     </style>
 12 |   </head>
 13 |   <body>
 14 |     <textarea id="source">
 15 | 
 16 | class: center, middle
 17 | 
 18 | ![:scale 40%](images/sklearn_logo.png)
 19 | 
 20 | 
 21 | ### Introduction to Machine learning with scikit-learn
 22 | 
 23 | # Introduction
 24 | 
 25 | Andreas C. Müller
 26 | 
 27 | Columbia University, scikit-learn
 28 | 
 29 | .smaller[https://github.com/amueller/ml-workshop-1-of-4]
 30 | 
 31 | 
 32 | ---
 33 | 
 34 | # Other Resources
 35 | 
 36 | .center[
 37 | ![:scale 25%](images/PDSH.png)&nbsp;&nbsp;&nbsp;
 38 | ![:scale 25%](images/imlp.png)&nbsp;&nbsp;&nbsp;
 39 | ![:scale 25%](images/esl.png)
 40 | ]
 41 | Lecture: http://www.cs.columbia.edu/~amueller/comsw4995s20/schedule/
 42 | 
 43 | https://www.youtube.com/andreasmueller
 44 | 
 45 | Videos and more slides!
 46 | 
 47 | ???
 48 | 
 49 | FIXME JAKE book border
 50 | There are three books that I recommend looking into for this course. Definitely
 51 | check out my book, Introduction to machine learning with Python. You can find
 52 | the PDF on courseworks. My book should be a relatively easy read and it’s quite
 53 | short. The second one is Applied predictive modelling by Max Kuhn, which goes a
 54 | bit deeper. This is about the level I want to go to in this course. You can get
 55 | it for free at springer link, I posted a link in courseworks. These two are
 56 | really the essential ones. Finally, there's Elements of statistical learning,
 57 | also known as ESL or the stanford book, by Hastie Tibshibani and Friedman is a
 58 | classic for a more theoretical view. You can get it for free on the authors
 59 | website.
 60 | 
 61 | If you want to brush up on your Python skills, I also recommend the Python Data
 62 | Science Handbook by my friend Jake Vanderplas.
 63 | 
 64 | ---
 65 | 
 66 | class: center, middle
 67 | 
 68 | # What and Why of Machine Learning
 69 | 
 70 | ???
 71 | 
 72 | I first want to talk about what is machine learning, and why do we want it. As you’re
 73 | in this course, you’re probably already somewhat convinced that it’s useful,
 74 | but I briefly want to give my own perspective. In general, today will not be
 75 | very meaty and be more a loose collections of ideas and directions. The next
 76 | class we will go down to the metal much more.
 77 | 
 78 | ---
 79 | 
 80 | class: center, middle
 81 | 
 82 | # What is machine learning?
 83 | 
 84 | ???
 85 | 
 86 | Machine learning is about extracting knowledge from data. It is closely related
 87 | to statistics and optimization. What distinguishes machine learning is that it
 88 | is very focused on prediction.
 89 | 
 90 | We want to learn from a large dataset how to make decisions for future
 91 | observations. You could say that the input to a machine learning program is
 92 | the dataset, and the output is a program that can make decisions on future
 93 | observations.
 94 | 
 95 | Machine learning is really widely used now, and I want to give you some
 96 | examples that most of you probably already interacted with today.
 97 | 
 98 | ---
 99 | 
100 | .center[
101 | ![:scale 70%](images/fb1.png)
102 | ]
103 | 
104 | ???
105 | 
106 | Here’s the Facebook news-feed. There’s so much machine learning here, it’s crazy.
107 | Can you point out some of them?
108 | So the most space is a sponsored item. Facebook used ML to know who to show
109 | this too. It’s clearly targetted at developers. Facebook also used ML to know
110 | how much to charge for showing it. Then there’s a post below by a friend.
111 | Facebook ranked that top most interesting to me right now, again ML. Then on
112 | the right, you can see birthdays. It shows only one name, though there’s two
113 | birthdays. Again ML to decide how many and whom to show. Below, trending
114 | topics, apps to connect and people I might now. All ML.
115 | 
116 | ---
117 | 
118 | .center[
119 | ![:scale 70%](images/facebook_gael.png)
120 | ]
121 | 
122 | ???
123 | 
124 | But wait, there’s more. Here’s me uploading a picture. It finds the faces
125 | of many of the people here, even in odd poses. Not only my friend Gael here,
126 | one of the creators of scikit-learn, but also Jared Lander there in the background.
127 | 
128 | I'm pretty sure they could automatically tag the people, and actually describe the photos,
129 | but that seems to be disabled on my account. My account might still be European,
130 | where they don't tag faces because of privacy concerns.
131 | 
132 | ---
133 | 
134 | .center[
135 | ![:scale 70%](images/fb3.png)
136 | ]
137 | 
138 | ???
139 | 
140 | And then after you post an album, facebook will select some most interesting
141 | pictures for you, and give them different kinds of space, to create a mosaic.
142 | 
143 | But that’s just facebook. Let’s see what else I’ve got open.
144 | 
145 | ---
146 | 
147 | .center[
148 | ![:scale 70%](images/amazon1.png)
149 | ]
150 | 
151 | ???
152 | 
153 | And then as a last example, amazon. Because google would have been too easy.
154 | Here I’m searching for machine learning. I get a ranked list, using machine
155 | learning. I get sub-categories to search in, via machine learning. Each book
156 | has some features, like top seller etc, added with machine learning.  And
157 | there’s an ad at the top, selected via machine learning.
158 | 
159 | ---
160 | 
161 | .center[
162 | ![:scale 70%](images/amazon2.png)
163 | ]
164 | 
165 | ???
166 | 
167 | And if I click on a book, more machine learning.
168 | There’s an ad below for textbooks, targetted to me. There’s paperback as the
169 | default choice, again machine learning. There’s frequently bought together, and
170 | maybe less obviously, there’s a default seller! The price that is shown on the
171 | right that’s selected for “add to cart” is also selected from a whole pool of
172 | possible amazon sellers via machine learning.
173 | 
174 | Ok, that’s enough websites I think. While I went through all this, you were
175 | probably on your phone, looking at more output produced by machine learning. My
176 | point is, it’s everywhere. And often non-obvious, as in the case of selecting
177 | a seller here.
178 | 
179 | ---
180 | 
181 | # Science!
182 | 
183 | .center[
184 | ![:scale 70%](images/exoplanet.png)
185 | ]
186 | 
187 | ???
188 | 
189 | That was some of the flashy, every day live applications. Something that might
190 | get you VC funding. There’s also a lot of machine learning applications in less
191 | visible, but equally important - or more important - applications in science.
192 | There is more and more personalized cancer treatment – via machine learning.
193 | More medical diagnosis, and more drug discovery are using machine learning.
194 | The higgs boson couldn’t have been found without machine learning, and the same
195 | is true for many earth like planets in other solar systems.
196 | Which is shown using an artists illustration here. In reality you would
197 | have a single pixel, containing the sun and the planet. You can find exoplanets
198 | by checking whether the star gets periodically slightly darker, in which case you found
199 | a planet. Of course with machine learning!
200 | Machine learning is an essential in many data driven sciences now!  So no matter
201 | where you want to go with data, you need machine learning.  But what does that
202 | mean?  Next, I want to give you a little taxonomy of machine learning methods.
203 | 
204 | ---
205 | class: center, middle
206 | 
207 | # Types of Machine Learning
208 | 
209 | ???
210 | 
211 | There are three main branches of machine learning.
212 | Who can name them?
213 | ---
214 | class: spacious
215 | 
216 | # Types of Machine Learning
217 | - Supervised
218 | - Unsupervised
219 | - Reinforcement
220 | 
221 | ???
222 | 
223 | They called supervised learning, unsupervised learning and reinforcement learning.
224 | What are they?
225 | This course will heavily focus on supervised learning, but you should be aware
226 | the other types and their characteristics. We will do some unsupervised learning,
227 | but no reinforcement learning. Supervised learning is the most commonly used type in
228 | industry and research right now, though reinforcement learning becomes
229 | increasingly important.
230 | 
231 | ---
232 | 
233 | class: center
234 | 
235 | # Supervised Learning
236 | 
237 | .larger[
238 | $$ (x_i, y_i) \propto p(x, y) \text{ i.i.d.}$$
239 | $$ x_i \in \mathbb{R}^p$$
240 | $$ y_i \in \mathbb{R}$$
241 | $$f(x_i) \approx y_i$$
242 | ]
243 | 
244 | ???
245 | 
246 | In supervised learning, the dataset we learn form is input-output pairs (x_i,
247 | y_i), where x_i is some n_dimensional input, or feature vector, and y_i is the
248 | desired output we want to learn.  Generally, we assume these samples are drawn
249 | from some unknown joint distribution p(x, y). In statistics, x_i might be
250 | called independent variables and y_i dependent variable.
251 | What does iid mean?
252 | We say they are drawn iid, which stands for independent identically
253 | distributed. In other words, the x_i, y_i pairs are independent and all come
254 | from the same distribution p. You can think of this as there being some
255 | process that goes from x_i to y_i, but that we don’t know. We write this as a
256 | probability distribution and not as a function since even if there is a real
257 | process creating y from x, this process might not be deterministic.  The goal
258 | is to learn a function f so that for new inputs x for which we don’t observe y,
259 | f(x) is close to y.  This approach is very similar to function approximation.
260 | The name supervised comes from the fact that during learning, a supervisor
261 | gives you the correct answers y_i.
262 | 
263 | ---
264 | 
265 | # Generalization
266 | .padding-top[
267 | .left-column[
268 | Not only
269 | 
270 | 
271 | also for new data:
272 | ]
273 | 
274 | .right-column[$f(x_i) \approx y_i$,
275 | 
276 | 
277 | $f(x) \approx y$
278 | ]]
279 | ???
280 | For both regression and classification, it’s important to keep in mind the
281 | concept of generalization.
282 | Let’s say we have a regression task. We have features, that is data vectors x_i
283 | and targets y_i drawn from a joint distribution. We now want to learn a
284 | function f, such that f(x) is approximately y, not on this training data, but
285 | on new data drawn from this distribution. This is what’s called generalization,
286 | and this is a core distinction to function approximation. In principle we don’t
287 | care about how well we do on x_i, we only care how well we do on new samples
288 | from the distribution. We’ll go into much more detail about generalization in
289 | about a week, when we dive into supervised learning.
290 | 
291 | ---
292 | class: spacious
293 | # Examples of Supervised Learning
294 | - spam detection
295 | - medical diagnosis
296 | - ad click prediction
297 | 
298 | ???
299 | 
300 | Here are some examples of supervised learning.
301 | Given an array of test results from a patient, does this patient have diabetes?
302 | The x_i would be the different test results, and y_i would be diabetes or no
303 | diabetes.  Given a piece of a satellite image, what is the terrain in this
304 | image? Here x_i would be the pixels of the image, and y_i would be the terrain
305 | types.
306 | 
307 | This is often used to automate manual labor. For example, you might annotate
308 | part of a dataset manually, then learn a machine learning model from this
309 | annotations, and use the model to annotate the rest of your data.
310 | 
311 | ---
312 | 
313 | # Unsupervised Learning
314 | .padding-top[
315 | $$ x_i \propto p(x) \text{ i.i.d.}$$
316 | Learn about $p$.
317 | 
318 | ]
319 | ???
320 | 
321 | In unsupervised machine learning, we are just given data points x_i, that are
322 | assumed to be drawn from an unknown distribution. Usually we want to learn
323 | something about these, such as whether they lie on a low-dimensional subspace,
324 | or whether the data clusters in several groups, or find ways to represent the
325 | distribution compactly.  The goal in unsupervised learning is often much less
326 | clear than in supervised learning, and there is no-one providing a “correct”
327 | answers and no supervisor.
328 | 
329 | Common examples of unsupervised learning is discovering topics in news articles
330 | or on twitter, or grouping data into clusters for easier analysis.  Another one
331 | is outlier detection, where you ask “does this data look normal” which is
332 | important for fraud detection and security systems.
333 | 
334 | ---
335 | 
336 | class: center, spacious
337 | 
338 | # Reinforcement Learning
339 | 
340 | .left-column[
341 | ![:scale 100%](images/alpha_go.png)
342 | ]
343 | 
344 | .right-column[
345 | ![:scale 100%](images/nao.png)
346 | ]
347 | 
348 | ???
349 | 
350 | The third kind, reinforcement learning, has been in the news quite a bit in the
351 | last year. Has anyone heard of that? Alpha go beat the world champion in go.
352 | Reinforcement learning is about an agent learning to interact with an
353 | environment, with some ultimate goal. The environment could be a go board, and
354 | the goal to win the game. For self-driving cars, the the environment could be
355 | roads, sensed by cameras and laser sensors, and the goal would be to get you
356 | somewhere quickly and safely. Or, the environment could be a social media
357 | platform, and the goal could be to provide you such great content that you
358 | never remove your eyes from your phone again!
359 | 
360 | ---
361 | class: spacious
362 | 
363 | # Explore & Learn
364 | .center[
365 | ![:scale 80%](images/reinforcement_cycle.png)
366 | ]
367 | 
368 | ???
369 | 
370 | Reinforcement learning is quite different in that you usually don’t work with a
371 | dataset – you work with a whole world. This can be a video game, a simulation,
372 | or the real world. The actions of the agent influence which part of the world
373 | they see and which situations they encounter, and it’s usually impossible to
374 | look at all possible situations, even for something as limited as a go board.
375 | This means in reinforcement learning you can not separate data collection and
376 | learning, which you can do for unsupervised and supervised learning.  There
377 | will be no reinforcement learning in this class, as the techniques are quite different,
378 | and real-world use of this technique is still quite limited.
379 | 
380 | ---
381 | 
382 | # Other kinds of learning
383 | 
384 | - Semi-supervised
385 | - Active Learning
386 | - Forecasting
387 | - ...
388 | 
389 | ???
390 | 
391 | There are other kinds of learning that are somewhere between the three kinds I
392 | just explained. Semi-supervised learning for example is a combination of
393 | supervised and unsupervised learning. Active learning is somewhere between
394 | reinforcement learning and supervised learning. There are also many kinds of
395 | supervised learning where the assumption that data points are independent is
396 | dropped, for example for time series analysis and forecasting.  However, if you
397 | get the three main concepts, the rest will be easy to understand.  Some people,
398 | including the local and famous Yann LeCun think that supervised learning is
399 | fundamentally limited. In particular it doesn’t seem to be how humans learn.
400 | So now you can buy these shirts on redbubble
401 | 
402 | ---
403 | 
404 | # Classification and Regression
405 | 
406 | .left-column[
407 | ### Classification
408 | - target y discrete
409 | 
410 | - Is this patient sick?
411 | ]
412 | 
413 | .right-column[
414 | ### Regression
415 | - target y continuous
416 | - How long will it take for the patient to recover?
417 | ]
418 | ???
419 | So getting back to supervised learning, there are two basic kinds, called
420 | classification and regression.
421 | The difference is quite simple: if y is continuous, then it’s regression, and
422 | if y is discrete, it’s classification.
423 | 
424 | While it's simple, let me give an example.
425 | If I want to predict whether a one of you will pass the class, it’s a
426 | classification problem. There are two possible answers, “yes” and “no”.  If I
427 | want to predict how many points you get on an exam, it’s a regression
428 | problem, there is a continuous, gradual output space.
429 | 
430 | There are generalizations of this where we try to predict more than one
431 | variable, but we won’t go into that in this course. The main reason the distinction
432 | between classification and regression is important is because the way we
433 | measure how good a prediction is is very different for the two.
434 | It's not always entirely clear whether it's best to formulate a problem as classification
435 | or regression. If you think of predicting a 5-star rating, there's only 5 different possible
436 | outcomes, so you might think it's classification. But there is also an obvious ordering
437 | between the outcomes, which would make it a regression problem.
438 | Both formulations could work, and there are approaches that combine the two for
439 | this particular problem.
440 | 
441 | 
442 | ---
443 | 
444 | # Relationship to Statistics
445 | 
446 | .left-column[
447 | ### Statistics
448 | - model first
449 | - inference emphasis
450 | ]
451 | .right-column[
452 | ### Machine learning
453 | - data first
454 | - prediction emphasis
455 | ]
456 | 
457 | ???
458 | 
459 | Before I’ll go into some general principles, I want to position machine
460 | learning in relation to statistics. I recently got chewed out by a colleague
461 | for doing that. My goal here is not to say one is better than the other.
462 | Actually, there’s really no clear boundary between statistics and machine learning, and
463 | anyone that tells you otherwise is lying. Two of the books I recommended for the
464 | course are actually statistics text books. But I can tell you how the tools
465 | that I’m talking about in this course will differ from what you’d learn in a
466 | typical stats course.
467 | 
468 | Statistics is usually about inference, often phrased in terms of hypothesis
469 | testing. An example might be a yes-no-question, such as “are women less likely
470 | to enroll in a Data Science Program”, and you have a sample population, for
471 | example this classroom, and you can then try to make an inference about whether
472 | this statement is true. Often this includes making assumptions on how your
473 | sample relates to the general population, say this class vs all of DSI or
474 | Columbia vs all of the US. You might also have a specific model of how the process
475 | behind your question works.
476 | 
477 | ---
478 | 
479 | # Relationship to Statistics
480 | 
481 | .left-column[
482 | ### Statistics
483 | - model first
484 | - inference emphasis
485 | ]
486 | .right-column[
487 | ### Machine learning
488 | - data first
489 | - prediction emphasis
490 | ]
491 | 
492 | ???
493 | 
494 | On the other hand, machine learning is about prediction and generalization. We
495 | want to learn from past data to predict outcomes on future, unseen data.
496 | 
497 | We usually want to make statements about individual data points, and we want to
498 | build a model that will work on new data that fulfills our assumptions,
499 | independent of the population we samples. Often we don’t have or need a model
500 | of the process, but we rely on the assumption that our training data is
501 | generated from the same process as any future data will be.
502 | 
503 | There are statisticians that do predictions and there are machine learning
504 | scientists that do inference, but I find this distinction helpful.
505 | 
506 | Again I’m not saying one or the other is better, I’m just saying that you should
507 | know what kind of problem you are trying to solve, and what the right tool for
508 | the problem is. And then you can call it machine learning or statistics or
509 | probabilistic inference or data science. The tools you learn in this
510 | class will usually not help you to make yes/no inferences, and they
511 | will only give you a limited insight into the data generating process.
512 | 
513 | ---
514 | 
515 | class: middle
516 | 
517 | # Guiding Principles in Machine Learning
518 | 
519 | ???
520 | 
521 | For the rest of todays lecture, I want to talk about some guiding
522 | principles in machine learning applications, and they will hopefully be
523 | something you’ll come back to later. We will revisit some of them at the very
524 | end of the semester in more detail.
525 | 
526 | ---
527 | 
528 | class: middle
529 | 
530 | # Goal considerations
531 | 
532 | ???
533 | 
534 | One of the most important parts of machine learning is defining the goal, and
535 | defining a way to measure that goal. In this way, Kaggle is a really bad way to
536 | prepare you for machine learning in the real world, because they already did
537 | that work for you.  In the real world, people don’t tell you whether you should
538 | use unsupervised learning, supervised learning, classification or regression,
539 | and what’s the right way to cast something as a machine learning task – or
540 | whether to cast it as machine learning at all.
541 | 
542 | ---
543 | 
544 | class: middle
545 | # The Cost of Complex Systems
546 | 
547 | ## Data driven first? yes! (or maybe)
548 | ## Machine Learning first: No!
549 | 
550 | ???
551 | 
552 | My first advice would be, don’t try machine learning. Machine learning systems
553 | are very complex and often fragile. Whether you’re in research or a startup,
554 | don’t immediately start with "oh we can apply deep learning to this". Often
555 | it’s good to collect data, and be able to use data to drive and evaluate
556 | decisions. But including a complex process like machine learning into whatever
557 | you’re trying to do will make it much harder to debug and much harder to
558 | understand.
559 | 
560 | ---
561 | 
562 | class: middle
563 | 
564 | # Thinking in Context!
565 | # What is the baseline?
566 | # What is the benefit?
567 | 
568 | ???
569 | 
570 | So think in context of your problem. What do you want to achieve? What is the
571 | easiest way to achieve this? And what will improving over this baseline buy
572 | you?
573 | 
574 | Let’s go through an example:
575 | Imagine you had an electronics store, but you realized retail is all digital
576 | and you’re closing your store with one big sale to move on. You want to do an
577 | email campaign to tell all your customers about the sale. Is there a machine
578 | learning application here? You could try A/B testing the email message. But
579 | that will only work if you have many many customers. You could use ML to
580 | predict which customers might respond. How do you get the data? Let’s say you
581 | have data. What’s the benefit? None! This is your closing sale. You’ll never
582 | email them again. It doesn’t matter if they’ll include you in your spam filter.
583 | Just email everybody!
584 | 
585 | ---
586 | 
587 | 
588 | class: middle
589 | 
590 | # Communicating Results
591 | 
592 | ???
593 | 
594 | Another very important aspect of machine learning is to be able to communicate
595 | results, and to be able to explain methodologies. If you’re want to use machine learning
596 | in research that's not computer science, you need to be able to
597 | explain why it works. If you automatically make some decisions in your business
598 | applications, you need to convince your boss that nothing crazy will happen.
599 | How do you do that? It’s an important skill to explain what particular results
600 | mean, what their impact will be, and how often a method will work or fail.
601 | Only if you can convince your manager your approach is sensible will your
602 | system go to production.
603 | 
604 | ---
605 | 
606 | class: middle, center
607 | 
608 | # Explainable Results
609 | 
610 | ![:scale 60%](images/amazon_explanations.png)
611 | 
612 | ???
613 | 
614 | Sometimes, you even need to explain to the user why you made a decision. People
615 | found that recommendation engines work much better if the users are shown a
616 | reason for a particular recommendation. Netflix tells you based on what show it
617 | recommends another show, Amazon does the same for products. Interestingly,
618 | this also works even if the explanation is not true. There are systems that
619 | make a recommendation, and then provide a believable “explanation” after the
620 | fact. That gets more engagement than just providing the recommendation, though
621 | some people might question the ethics of that. Earlier I said in ML we are
622 | just interested in good predictions. If you need to be able to explain your
623 | predictions, this is no longer true.
624 | 
625 | ---
626 | class: middle, center
627 | 
628 | # Sidebar: Ethical Considerations
629 | 
630 | ![:scale 30%](images/propublica_compas.png)<br/>
631 | .compact[https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing]
632 | 
633 | ???
634 | 
635 | There is another area where explainability and transparency matter, and that is
636 | when people’s lives are at stake.
637 | One aspect of machine learning that only recently is getting some more attention is
638 | ethics. There was a recent article in propublica about racial bias in
639 | risk-assessment used in the criminal justice system. Spoiler alert: it’s bad. I
640 | recommend reading the article, it’s quite interesting. This is a black-box
641 | machine learning system created by some company. If they had to provide
642 | explanations, or a more transparent system, the situation would likely be
643 | better. But this is not the only place where ethics plays a role in machine
644 | learning.
645 |     </textarea>
646 |     <script src="https://remarkjs.com/downloads/remark-latest.min.js"></script>
647 |     <script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
648 | 
649 |     <script>
650 |     // Config Remark
651 |     remark.macros['scale'] = function (percentage) {
652 |         var url = this;
653 |         return '<img src="' + url + '" style="width: ' + percentage + '" />';
654 |     };
655 |     config_remark = {
656 |         highlightStyle: 'github',
657 |         highlightSpans: true,
658 |         highlightLines: true,
659 |         ratio: "16:9"
660 |     };
661 |       var slideshow = remark.create(config_remark);
662 | 
663 |     // Configure MathJax
664 |     MathJax.Hub.Config({
665 |     tex2jax: {
666 |         inlineMath: [['$','$'], ['\\(','\\)']],
667 |         skipTags: ['script', 'noscript', 'style', 'textarea', 'pre'] /* removed 'code' entry*/
668 |     }
669 |     });
670 |     MathJax.Hub.Queue(function() {
671 |         var all = MathJax.Hub.getAllJax(), i;
672 |         for(i = 0; i < all.length; i += 1) {
673 |             all[i].SourceElement().parentNode.className += ' has-jax';
674 |         }
675 |     });
676 |     </script>
677 |   </body>
678 | </html>
679 | 


--------------------------------------------------------------------------------
/slides/02-supervised-learning.html:
--------------------------------------------------------------------------------
  1 | <!DOCTYPE html>
  2 | <html>
  3 |   <head>
  4 |     <title>Introduction to Supervised Learning</title>
  5 |     <meta charset="utf-8">
  6 |     <link rel="stylesheet" href="style.css">
  7 |     <style>
  8 |       @import url(https://fonts.googleapis.com/css?family=Garamond);
  9 |       @import url(https://fonts.googleapis.com/css?family=Muli:400,700,400italic);
 10 |       @import url(https://fonts.googleapis.com/css?family=Ubuntu+Mono:400,700,400italic);
 11 |     </style>
 12 |   </head>
 13 |   <body>
 14 |     <textarea id="source">
 15 | 
 16 | class: center, middle
 17 | 
 18 | ![:scale 40%](images/sklearn_logo.png)
 19 | 
 20 | 
 21 | ### Introduction to Machine learning with scikit-learn
 22 | 
 23 | # Supervised Learning
 24 | 
 25 | Andreas C. Müller
 26 | 
 27 | Columbia University, scikit-learn
 28 | 
 29 | .smaller[https://github.com/amueller/ml-workshop-1-of-4]
 30 | 
 31 | 
 32 | ---
 33 | 
 34 | class: center
 35 | 
 36 | # scikit-learn documentation
 37 | ![:scale 60%](images/sklearn-docs.png)
 38 | 
 39 | <a href="http://scikit-learn.org/" style="color:black; font-size:50px; text-decoration:None" >scikit-learn.org</a>
 40 | 
 41 | ---
 42 | class: center
 43 | 
 44 | # Representing Data
 45 | 
 46 | ![:scale 100%](images/matrix-representation.png)
 47 | ---
 48 | class: center
 49 | 
 50 | # Training and Test Data
 51 | 
 52 | ![:scale 80%](images/train-test-split.png)
 53 | ---
 54 | class: center, middle
 55 | 
 56 | # Notebook: Data Loading
 57 | 
 58 | ---
 59 | class: center
 60 | 
 61 | # Supervised ML Workflow
 62 | 
 63 | ![:scale 80%](images/supervised-ml-workflow.png)
 64 | ---
 65 | class: center
 66 | 
 67 | # Supervised ML Workflow
 68 | 
 69 | ![:scale 100%](images/supervised-ml-api.png)
 70 | 
 71 | ---
 72 | class: center
 73 | 
 74 | # Nearest Neighbors
 75 | 
 76 | ![:scale 40%](images/knn_boundary_test_points.png)
 77 | $$f(x) = y_i, i = \text{argmin}_j || x_j - x||$$
 78 | 
 79 | ???
 80 | Let’s say we have this two-class classification dataset here, with
 81 | two features, one on the x axis and one on the y axis.
 82 | And we have three new points as marked by the stars here.
 83 | If I make a prediction using a one nearest neighbor classifier, what
 84 | will it predict?
 85 | It will predict the label of the closest data point in the training set.
 86 | That is basically the simplest machine learning algorithm I can come
 87 | up with.
 88 | 
 89 | Here’s the formula:
 90 | the prediction for a new x is the y_i so that x_i is the closest point
 91 | in the training set.
 92 | Ok, so now how do we find out whether this model is any good?
 93 | ---
 94 | class: center
 95 | 
 96 | # Nearest Neighbors
 97 | 
 98 | ![:scale 40%](images/knn_boundary_k1.png)
 99 | 
100 | $$f(x) = y_i, i = \text{argmin}_j || x_j - x||$$
101 | 
102 | ???
103 | Let’s say we have this two-class classification dataset here, with
104 | two features, one on the x axis and one on the y axis.
105 | And we have three new points as marked by the stars here.
106 | If I make a prediction using a one nearest neighbor classifier, what
107 | will it predict?
108 | It will predict the label of the closest data point in the training set.
109 | That is basically the simplest machine learning algorithm you can think of.
110 | 
111 | Here’s the formula:
112 | the prediction for a new x is the y_i so that x_i is the closest point
113 | in the training set.
114 | Ok, so now how do we find out whether this model and these predictions are any good?
115 | ---
116 | class: center
117 | 
118 | ![:scale 60%](images/train_test_set_2d_classification.png)
119 | 
120 | ???
121 | The simplest way is to split the data into a training and a test set.
122 | So we take some part of the data set,  let’s say 75% and the
123 | corresponding output, and train the model, and then apply the model on
124 | the remaining 25% to compute the accuracy. This test-set accuracy
125 | will provide an unbiased estimate of future performance.
126 | So if our i.i.d. assumption is correct, and we get a 90% success
127 | rate on the test data, we expect about a 90% success rate on any future
128 | data, for which we don't have labels.
129 | 
130 | Let's dive into how to build and evaluate this model with scikit-learn.
131 | ---
132 | # KNN with scikit-learn
133 | 
134 | ```python
135 | from sklearn.model_selection import train_test_split
136 | X_train, X_test, y_train, y_test = train_test_split(X, y)
137 | 
138 | from sklearn.neighbors import KNeighborsClassifier
139 | knn = KNeighborsClassifier(n_neighbors=1)
140 | knn.fit(X_train, y_train)
141 | print("accuracy: ", knn.score(X_test, y_test)))
142 | y_pred = knn.predict(X_test)
143 | ```
144 | accuracy: 0.77
145 | 
146 | ???
147 | We import train_test_split form model selection, which does a
148 | random split into 75%/25%.
149 | We provide it with the data X, which are our two features, and the
150 | labels y.
151 | 
152 | As you might already know, all the models in scikit-learn are implemented
153 | in python classes, with a single object used to build and store the model.
154 | 
155 | We start by importing our model, the KneighborsClassifier, and instantiate
156 | it with n_neighbors=1. Instantiating the object is when we set any hyper parameter,
157 | such as here saying that we only want to look at the neirest neighbor.
158 | 
159 | Then we can call the fit method to build the model, here knn.fit(X_train,
160 | y_train)
161 | All models in scikit-learn have a fit-method, and all the supervised ones take
162 | the data X and the outcomes y. The fit method builds the model and stores any
163 | estimated quantities in the model object, here knn.  In the case of nearest
164 | neighbors, the `fit` methods simply remembers the whole training set.
165 | 
166 | Then, we can use knn.score to make predictions on the test data, and
167 | compare them against the true labels y_test.
168 | 
169 | For classification models, the score method will always compute
170 | accuracy.
171 | 
172 | Just for illustration purposes, I also call the predict method here.
173 | The predict method is what's used to make predictions on any dataset.
174 | If we use the score method, it calls predict internally and then
175 | compares the outcome to the ground truth that's provided.
176 | 
177 | Who here has not seen this before?
178 | ---
179 | class: center
180 | 
181 | # Influence of Number of Neighbors
182 | 
183 | ![:scale 50%](images/knn_boundary_k1.png)
184 | 
185 | ???
186 | So this was the predictions as made by one-nearest neighbor.
187 | But we can also consider more neighbors, for example three. Here is the
188 | three nearest neighbors for each of the points and the corresponding
189 | labels.
190 | We can then make a prediction by considering the majority among these
191 | three neighbors.
192 | And as you can see, in this case all the points changed their labels! (I
193 | was actually quite surprised when I saw that, I just picked some points
194 | at random).
195 | Clearly the number of neighbors that we consider matters a lot. But what
196 | is the right number?
197 | The is a problem you’ll encounter a lot in machine learning, the
198 | problem of tuning parameters of the model, also called hyper-parameters,
199 | which can not be learned directly from the data.
200 | ---
201 | class: center
202 | 
203 | # Influence of Number of Neighbors
204 | 
205 | ![:scale 50%](images/knn_boundary_k3.png)
206 | 
207 | ???
208 | So this was the predictions as made by one-nearest neighbor.
209 | But we can also consider more neighbors, for example three. Here is the
210 | three nearest neighbors for each of the points and the corresponding
211 | labels.
212 | We can then make a prediction by considering the majority among these
213 | three neighbors.
214 | And as you can see, in this case all the points changed their labels! (I
215 | was actually quite surprised when I saw that, I just picked some points
216 | at random).
217 | Clearly the number of neighbors that we consider matters a lot. But what
218 | is the right number?
219 | The is a problem you’ll encounter a lot in machine learning, the
220 | problem of tuning parameters of the model, also called hyper-parameters,
221 | which can not be learned directly from the data.
222 | ---
223 | class: center, some-space
224 | 
225 | # Influence of n_neighbors
226 | 
227 | ![:scale 45%](images/knn_boundary_varying_k.png)
228 | 
229 | ???
230 | Here’s an overview of how the classification changes if we consider
231 | different numbers of neighbors.
232 | You can see as red and blue circles the training data. And the background
233 | is colored according to which class a datapoint would be assigned to
234 | for each location.
235 | For one neighbor, you can see that each point in the training set has
236 | a little area around it that would be classified according to it’s
237 | label. This means all the training points would be classified correctly,
238 | but it leads to a very complex shape of the decision boundary.
239 | If we increase the number of neighbors, the boundary between red and
240 | blue simplifies, and with 40 neighbors we mostly end up with a line.
241 | This also means that now many of the training data points would be
242 | labeled incorrectly.
243 | ---
244 | class: center, spacious
245 | 
246 | # Model complexity
247 | 
248 | ![:scale 75%](images/knn_model_complexity.png)
249 | 
250 | ???
251 | We can look at this in more detail by comparing training and test set
252 | scores for the different numbers of neighbors.
253 | Here, I did a random 75%/25% split again. This is a very noisy plot as
254 | the dataset is very small and I only did a random split, but you can
255 | see a trend here.
256 | You can see that for a single neighbor, the training score is 1 so perfect
257 | accuracy, but the test score is only 70%.  If we increase the number of
258 | neighbors we consider, the training score goes down, but the test score
259 | goes up, with an optimum at 19 and 21, but then both go down again.
260 | 
261 | This is a very typical behavior, that I sketched in a schematic for you.
262 | ---
263 | class: center, middle
264 | 
265 | # Notebook: Supervised Learning
266 | 
267 | 
268 |     </textarea>
269 |     <script src="https://remarkjs.com/downloads/remark-latest.min.js"></script>
270 |     <script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
271 | 
272 |     <script>
273 |     // Config Remark
274 |     remark.macros['scale'] = function (percentage) {
275 |         var url = this;
276 |         return '<img src="' + url + '" style="width: ' + percentage + '" />';
277 |     };
278 |     config_remark = {
279 |         highlightStyle: 'github',
280 |         highlightSpans: true,
281 |         highlightLines: true,
282 |         ratio: "16:9"
283 |     };
284 |       var slideshow = remark.create(config_remark);
285 | 
286 |     // Configure MathJax
287 |     MathJax.Hub.Config({
288 |     tex2jax: {
289 |         inlineMath: [['$','$'], ['\\(','\\)']],
290 |         skipTags: ['script', 'noscript', 'style', 'textarea', 'pre'] /* removed 'code' entry*/
291 |     }
292 |     });
293 |     MathJax.Hub.Queue(function() {
294 |         var all = MathJax.Hub.getAllJax(), i;
295 |         for(i = 0; i < all.length; i += 1) {
296 |             all[i].SourceElement().parentNode.className += ' has-jax';
297 |         }
298 |     });
299 |     </script>
300 |   </body>
301 | </html>
302 | 


--------------------------------------------------------------------------------
/slides/03-preprocessing.html:
--------------------------------------------------------------------------------
  1 | <!DOCTYPE html>
  2 | <html>
  3 |   <head>
  4 |     <title>Preprocessing</title>
  5 |     <meta charset="utf-8">
  6 |     <link rel="stylesheet" href="style.css">
  7 |     <style>
  8 |       @import url(https://fonts.googleapis.com/css?family=Garamond);
  9 |       @import url(https://fonts.googleapis.com/css?family=Muli:400,700,400italic);
 10 |       @import url(https://fonts.googleapis.com/css?family=Ubuntu+Mono:400,700,400italic);
 11 |     </style>
 12 |   </head>
 13 |   <body>
 14 |     <textarea id="source">
 15 | 
 16 | class: center, middle
 17 | 
 18 | ![:scale 40%](images/sklearn_logo.png)
 19 | 
 20 | ### Introduction to Machine learning with scikit-learn
 21 | 
 22 | # Preprocessing
 23 | 
 24 | Andreas C. Müller
 25 | 
 26 | Columbia University, scikit-learn
 27 | 
 28 | .smaller[https://github.com/amueller/ml-workshop-1-of-4]
 29 | 
 30 | 
 31 | 
 32 | ???
 33 | Today we’ll talk about preprocessing and featureengineering. What we’re talking about today mostly
 34 | applies to linear models, and not to tree-based
 35 | models, but it also applies to neural nets and kernel
 36 | SVMs.
 37 | 
 38 | ---
 39 | 
 40 | class: middle
 41 | 
 42 | # King County House Prices
 43 | 
 44 | ![:scale 90%](images/house_price_scatter.png)
 45 | 
 46 | 
 47 | 
 48 | ???
 49 | ---
 50 | 
 51 | 
 52 | class: center, middle
 53 | 
 54 | #Scaling
 55 | 
 56 | 
 57 | ???
 58 | N/A
 59 | 
 60 | ---
 61 | 
 62 | class: center, middle
 63 | 
 64 | .center[
 65 | ![:scale 70%](images/house_price_boxplot.png)
 66 | ]
 67 | 
 68 | 
 69 | ???
 70 | Let’s start with the different scales.
 71 | 
 72 | Many model want data that is on the same scale.
 73 | KNearestNeighbors: If the distance in TAX is
 74 | between 300 and 400 then the distance difference in
 75 | CHArS doesn’t matter!
 76 | 
 77 | Linear models: the different scales mean different
 78 | penalty. L2 is the same for all!
 79 | 
 80 | We can also see non-gaussian distributions here btw!
 81 | ---
 82 | # Scaling and Distances
 83 | 
 84 | ![:scale 90%](images/knn_scaling.png)
 85 | 
 86 | ???
 87 | Here is an example of the importance of scaling using a
 88 | distance-based algorithm, K nearest neighbors. My favorite
 89 | toy dataset with two classes in two dimensions. The scatter
 90 | plots look identical, but on the left hand side, the two
 91 | axes have very different scales. The x axis has much larger
 92 | values than the y axis. On the right hand side, I used
 93 | standard scaler and so both features have zero mean and unit
 94 | variance. So what do you think will happen if I use k
 95 | nearest neighbors here? Let's see
 96 | ---
 97 | # Scaling and Distances
 98 | 
 99 | ![:scale 90%](images/knn_scaling2.png)
100 | 
101 | ???
102 | As you can see, the difference is quite dramatic. Because
103 | the X axis has such a larger magnitude on the left-hand
104 | side, only distances along the x axis matter. However, the
105 | important feature for this task is the y axis. So the
106 | important feature gets entirely ignored because of the
107 | different scales. And usually the scales don't have any
108 | meaning - it could be a matter of changing meters to
109 | kilometers.
110 | Linear models: the different scales mean different penalty.
111 | L2 is the same for all!
112 | 
113 | ---
114 | class: center
115 | 
116 | # Ways to Scale Data
117 | <br />
118 | ![:scale 80%](images/scaler_comparison_scatter.png)
119 | ???
120 | Here's an illustration of four of the most common ways. One
121 | of the most common ones that we already saw before is the
122 | Standard Scaler. StandardScaler subtracts the mean and
123 | divides by standard deviation. Making all the features have
124 | a zero mean and a standard deviation of one. One thing about
125 | this is that it won't guarantee any minimum or maximum
126 | values. The range can be arbitrarily large. MinMaxScaler
127 | subtracts minimum and divides by range. Scales between 0 and 1.
128 | So all the features will have an exact minimum at zero
129 | and exact maximum at one. This mostly makes sense if there
130 | are actually minimum and maximum values in your data set.
131 | If it's actually Gaussian distribution, the scaling might
132 | not make a lot of sense, because if you add one more data
133 | point, it's very far away and it will cram all the other
134 | data points more together. This will make sense if you have
135 | a grayscale value between 0 and 255 or something else that
136 | has like clearly defined boundaries. Another alternative is
137 | the RobustScaler. It’s the robust version of the
138 | StandardScaler. It computes the median and the quartiles.
139 | This cannot be skewed by outliers. The StandardScaler uses
140 | mean and standard deviation. So if you have a point that’s
141 | very far away, it can have unlimited influence on the mean.
142 | The RobustScaler uses robust statistics, so it’s not skewed
143 | be outliers. The final one is the Normalizer. This projects
144 | things either on the L1 or L0, meaning it makes sure to
145 | vectors have length one either in L1 norm or L2 norm. If you
146 | do this for L2 norm, it means you don't care about the
147 | length you project on a circle. More commonly used in L1
148 | norm, it projects onto the diamond. What that does is
149 | basically it means you make sure the sum of all the entries
150 | is one. That's often used if you have histograms or if you
151 | have counts of things. If you want to make sure that you
152 | have frequency features instead of count features, you can
153 | use the L1 normalizer.
154 | 
155 | ---
156 | class: spacious
157 | 
158 | # Sparse Data
159 | 
160 | - Data with many zeros – only store non-zero entries.
161 | - Subtracting anything will make the data “dense” (no more zeros) and blow the RAM.
162 | - Only scale, don’t center (use MaxAbsScaler)
163 | 
164 | ???
165 | 
166 | FIXME ugly slide
167 | There’s another one that's important for sparse data. It’s
168 | called the MaxAbsScaler in scikit-learn. So sparse data is
169 | the kind of data that happens actually quite frequently in
170 | practice where most of the features are zero most of the
171 | time. So if you have tens of thousands of features, but
172 | they're always nearly zero. For example, you can think about
173 | these being particular user actions and at any time, any
174 | user makes only very few actions. But there are many, many
175 | possibilities. So if you have data like that, you only want
176 | to store the non-zero values, you don't want to store all
177 | the zeros. If you would store all the zeros, the data
178 | wouldn't even fit into the RAM often. So you have this very
179 | large sparse data set and you want to scale it, but if you
180 | subtract anything from it, if you want to make it zero mean
181 | then all the entries will be non-zero because the chance
182 | that the mean is actually zero is small. So then if you
183 | subtract something from everything, you're going to move all
184 | those zeros away from zero and then you can't store the data
185 | in sparse format and your RAM will blow up. Basically, you
186 | can subtract anything from sparse matrix because that won't
187 | be sparse anymore. But we can still scale it because any
188 | number of times zero is still zero. So if we scale it, then
189 | it will have the same structure again. The MaxAbsScaler sets
190 | the maximum absolute values to one. Basically, it looks at
191 | the maximum absolute values in the whole dataset, or for
192 | each feature and makes sure that this is one, just by
193 | scaling by one divided by the maximum absolute value.
194 | 
195 | ---
196 | # Standard Scaler Example
197 | 
198 | ```python
199 | from sklearn.linear_model import Ridge
200 | # Back to King Country house prices
201 | X_train, X_test, y_train, y_test = train_test_split(
202 |     X, y, random_state=0)
203 | 
204 | scaler = StandardScaler()
205 | scaler.fit(X_train)
206 | X_train_scaled = scaler.transform(X_train)
207 | 
208 | ridge = Ridge().fit(X_train_scaled, y_train)
209 | X_test_scaled = scaler.transform(X_test)
210 | ridge.score(X_test_scaled, y_test)
211 | ```
212 | ```
213 | 0.684
214 | ```
215 | 
216 | ???
217 | Here’s how you do the scaling with StandardScaler in
218 | scikit-learn. Similar interface to models, but“transform”
219 | instead of “predict”. “transform” is always used when you
220 | want a new representation of the data.
221 | 
222 | Fit on training set, transform training set, fit ridge on
223 | scaled data, transform test data, score scaled test data.
224 | 
225 | The fit computes mean and standard deviation on the training
226 | set, transform subtracts the mean and the standard
227 | deviation.
228 | 
229 | We fit on the training set and apply transform on both the
230 | training and the test set. That means the training set mean
231 | gets subtracted from the test set, not the test-set mean.
232 | That’s quite important.
233 | ---
234 | 
235 | class: center, middle
236 | 
237 | .center[
238 | ![:scale 100%](images/no_separate_scaling.png)
239 | ]
240 | 
241 | ???
242 | 
243 | Here’s an illustration why this is important using the
244 | min-max scaler. Left is the original data. Center is what
245 | happens when we fit on the training set and then transform
246 | the training and test set using this transformer. The data
247 | looks exactly the same, but the ticks changed. Now the data
248 | has a minimum of zero and a maximum of one on the training
249 | set. That’s not true for the test set, though. No particular
250 | range is ensured for the test-set. It could even be outside
251 | of 0 and 1. But the transformation is consistent with the
252 | transformation on the training set, so the data looks the
253 | same.
254 | 
255 | On the right you see what happens when you use the test-set
256 | minimum and maximum for scaling the test set. That’s what
257 | would happen if you’d fit again on the test set. Now the
258 | test set also has minimum at 0 and maximum at 1, but the
259 | data is totally distorted from what it was before. So don’t
260 | do that.
261 | ---
262 | class: center, middle
263 | 
264 | # Sckit-Learn API Summary
265 | 
266 | ![:scale 90%](images/api-table.png)
267 | 
268 | 
269 | ???
270 | 
271 | Here’s a summary of the scikit-learn methods. All models
272 | have a fit method which takes the training data X_train. If
273 | the model is supervised, such as our classification and
274 | regression models, they also take a y_train parameter. The
275 | scalers don’t use a y_train because they don’t use the
276 | labels at all – you could say they are unsupervised methods,
277 | but arguably they are not really learning methods at all.
278 | Models (also known as estimators in scikit-learn) to make a
279 | prediction of a target variable, you use the predict method,
280 | as in classification and regression. If you want to create a
281 | new representation of the data, a new kind of X, then you
282 | use the transform method, as we did with scaling. The
283 | transform method is also used for preprocessing, feature
284 | extraction and feature selection, which we’ll see later. All
285 | of these change X into some new form. There’s two important
286 | shortcuts. To fit an estimator and immediately transform the
287 | training data, you can use fit_transform. That’s often more
288 | efficient then using first fit and then transform. The same
289 | goes for fit_predict.
290 | 
291 | 
292 | ---
293 | class: smaller, compact
294 | 
295 | ```python
296 | from sklearn.linear_model import Ridge
297 | ridge = Ridge()
298 | ridge.fit(X_train, y_train)
299 | ridge.score(X_test, y_test)
300 | ```
301 | 
302 | ```
303 | 0.717
304 | 
305 | ```
306 | 
307 | ```python
308 | ridge.fit(X_train_scaled, y_train)
309 | ridge.score(X_test_scaled, y_test)
310 | ```
311 | 
312 | ```
313 | 0.718
314 | 
315 | ```
316 | 
317 | ```python
318 | from sklearn.neighbors import KNeighborsRegressor
319 | knn = KNeighborsRegressor()
320 | knn.fit(X_train, y_train)
321 | knn.score(X_test, y_test)
322 | ```
323 | 
324 | ```
325 | 0.499
326 | 
327 | ```
328 | 
329 | ```python
330 | knn = KNeighborsRegressor()
331 | knn.fit(X_train_scaled, y_train)
332 | knn.score(X_test_scaled, y_test)
333 | ```
334 | 
335 | ```
336 | 0.750
337 | 
338 | ```
339 | 
340 | ???
341 | 
342 | Let’s apply the scaler to the King Country housing data.
343 | 
344 | First I used the StandardScaler to scale the training data.
345 | Then I applied ten-fold cross-validation to evaluate the
346 | Ridge model on the data with and without scaling. I used
347 | RidgeCV which automatically picks alpha for me. With and
348 | without scaling we get an R^2 of about .72, so no
349 | difference. Often there is a difference for Ridge, but not
350 | in this case.
351 | 
352 | If we use KneighborsRegressor instead, we see a big
353 | difference. Without scaling R^2 is about .5, and with
354 | scaling it’s .75. That makes sense since we saw that for
355 | distance calculations basically all features are dominated
356 | by the TAX feature.
357 | 
358 | However, there is a bit of a problem with the analysis we
359 | did here. Can you see it?
360 | 
361 | 
362 | ---
363 | 
364 | class: left, middle
365 | 
366 | # Categorical Variables
367 | 
368 | ---
369 | 
370 | # Categorical Variables
371 | 
372 | .smaller[
373 | ```python
374 | import pandas as pd
375 | df = pd.DataFrame({
376 |   'boro': ['Manhattan', 'Queens', 'Manhattan', 'Brooklyn', 'Brooklyn', 'Bronx'],
377 |   'salary': [103, 89, 142, 54, 63, 219],
378 |   'vegan': ['No', 'No','No','Yes', 'Yes', 'No']})
379 | ```
380 | ]
381 | 
382 | <table border="1" class="dataframe">
383 |   <thead>
384 |     <tr style="text-align: right;">
385 |       <th></th><th>boro</th><th>salary</th><th>vegan</th></tr>
386 |   </thead>
387 |   <tbody>
388 |     <tr><th>0</th><td>Manhattan</td><td>103</td><td>No</td></tr>
389 |     <tr><th>1</th><td>Queens</td><td>89</td><td>No</td></tr>
390 |     <tr><th>2</th><td>Manhattan</td><td>142</td><td>No</td></tr>
391 |     <tr><th>3</th><td>Brooklyn</td><td>54</td><td>Yes</td></tr>
392 |     <tr><th>4</th><td>Brooklyn</td><td>63</td><td>Yes</td></tr>
393 |     <tr><th>5</th><td>Bronx</td><td>219</td><td>No</td></tr>
394 |   </tbody>
395 | </table>
396 | ???
397 | 
398 | Before we can apply a machine learning algorithm, we first
399 | need to think about how we represent our data.
400 | 
401 | Earlier, I said x \in R^n. That’s not how you usually get
402 | data. Often data has units, possibly different units for
403 | different sensors, it has a mixture of continuous values and
404 | discrete values, and different measurements might be on
405 | totally different scales.
406 | 
407 | First, let me explain how to deal with discrete input
408 | variables, also known as categorical features. They come up
409 | in nearly all applications.
410 | 
411 | Scikit-learn requires you to explicitly handle these, and assumes
412 | in general that all your input is continuous numbers.
413 | This is different from how many libraries in R do it,
414 | which deal with categorical variables implicitly.
415 | 
416 | ---
417 | # Ordinal encoding
418 | 
419 | .smaller[
420 | ```python
421 | df['boro_ordinal'] = df.boro.astype("category").cat.codes
422 | df
423 | ```
424 | ]
425 | .left-column[
426 | <table border="1" class="dataframe">
427 |   <thead>
428 |     <tr style="text-align: right;">
429 |       <th></th><th>boro</th><th>salary</th><th>vegan</th></tr>
430 |   </thead>
431 |   <tbody>
432 |     <tr><th>0</th><td>2</td><td>103</td><td>No</td></tr>
433 |     <tr><th>1</th><td>3</td><td>89</td><td>No</td></tr>
434 |     <tr><th>2</th><td>2</td><td>142</td><td>No</td></tr>
435 |     <tr><th>3</th><td>1</td><td>54</td><td>Yes</td></tr>
436 |     <tr><th>4</th><td>1</td><td>63</td><td>Yes</td></tr>
437 |     <tr><th>5</th><td>0</td><td>219</td><td>No</td></tr>
438 |   </tbody>
439 | </table>
440 | ]
441 | 
442 | --
443 | 
444 | .right-column[
445 | ![:scale 100%](images/boro_ordinal.png)
446 | ]
447 | ???
448 | 
449 | If you encode all three values using the same feature, then
450 | you are imposing a linear relation between them, and in
451 | particular you define an order between the categories.
452 | Usually, there is no semantic ordering of the categories,
453 | and so we shouldn’t introduce one in our representation of
454 | the data.
455 | 
456 | ---
457 | # Ordinal encoding
458 | 
459 | .smaller[
460 | ```python
461 | df['boro_ordinal'] = df.boro.astype("category").cat.codes
462 | df
463 | ```
464 | ]
465 | .left-column[
466 | <table border="1" class="dataframe">
467 |   <thead>
468 |     <tr style="text-align: right;">
469 |       <th></th><th>boro_ordinal</th><th>salary</th><th>vegan</th></tr>
470 |   </thead>
471 |   <tbody>
472 |     <tr><th>0</th><td>2</td><td>103</td><td>No</td></tr>
473 |     <tr><th>1</th><td>3</td><td>89</td><td>No</td></tr>
474 |     <tr><th>2</th><td>2</td><td>142</td><td>No</td></tr>
475 |     <tr><th>3</th><td>1</td><td>54</td><td>Yes</td></tr>
476 |     <tr><th>4</th><td>1</td><td>63</td><td>Yes</td></tr>
477 |     <tr><th>5</th><td>0</td><td>219</td><td>No</td></tr>
478 |   </tbody>
479 | </table>
480 | ]
481 | 
482 | .right-column[
483 | ![:scale 100%](images/boro_ordinal_classification.png)
484 | ]
485 | ---
486 | 
487 | # One-Hot (Dummy) Encoding
488 | 
489 | .narrow-left-column[
490 | <table border="1" class="dataframe">
491 |   <thead>
492 |     <tr style="text-align: right;">
493 |       <th></th><th>boro</th><th>salary</th><th>vegan</th></tr>
494 |   </thead>
495 |   <tbody>
496 |     <tr><th>0</th><td>Manhattan</td><td>103</td><td>No</td></tr>
497 |     <tr><th>1</th><td>Queens</td><td>89</td><td>No</td></tr>
498 |     <tr><th>2</th><td>Manhattan</td><td>142</td><td>No</td></tr>
499 |     <tr><th>3</th><td>Brooklyn</td><td>54</td><td>Yes</td></tr>
500 |     <tr><th>4</th><td>Brooklyn</td><td>63</td><td>Yes</td></tr>
501 |     <tr><th>5</th><td>Bronx</td><td>219</td><td>No</td></tr>
502 |   </tbody>
503 | </table>
504 | ]
505 | 
506 | .wide-right-column[
507 | .tiny[
508 | ```python
509 | pd.get_dummies(df)
510 | ```
511 | <table border="1" class="dataframe" style="font-size:15px">
512 |   <thead>
513 |     <tr style="text-align: right;">
514 |       <th></th><th>salary</th><th>boro_Bronx</th><th>boro_Brooklyn</th><th>boro_Manhattan</th><th>boro_Queens</th><th>vegan_No</th><th>vegan_Yes</th></tr>
515 |   </thead>
516 |   <tbody>
517 |     <tr><th>0</th><td>103</td><td>0</td><td>0</td><td>1</td><td>0</td><td>1</td><td>0</td></tr>
518 |     <tr><th>1</th><td>89</td><td>0</td><td>0</td><td>0</td><td>1</td><td>1</td><td>0</td></tr>
519 |     <tr><th>2</th><td>142</td><td>0</td><td>0</td><td>1</td><td>0</td><td>1</td><td>0</td></tr>
520 |     <tr><th>3</th><td>54</td><td>0</td><td>1</td><td>0</td><td>0</td><td>0</td><td>1</td></tr>
521 |     <tr><th>4</th><td>63</td><td>0</td><td>1</td><td>0</td><td>0</td><td>0</td><td>1</td></tr>
522 |     <tr><th>5</th><td>219</td><td>1</td><td>0</td><td>0</td><td>0</td><td>1</td><td>0</td></tr>
523 |   </tbody>
524 | </table>
525 | ]
526 | ]
527 | ???
528 | Instead, we add one new feature for each category,
529 | 
530 | And that feature encodes whether a sample belongs to this
531 | category or not.
532 | 
533 | That’s called a one-hot encoding, because only one of the
534 | three features in this example is active at a time.
535 | 
536 | You could actually get away with n-1 features, but in
537 | machine learning that usually doesn’t matter
538 | 
539 | One way to do is with Pandas. Here I have an example of a
540 | data frame where I have the boroughs of New York as a
541 | categorical variable and variable saying whether they are vegan. One to get the
542 | dummies is to get dummies on this data frame. This will
543 | create new columns, it will actually replace borough column
544 | by four columns that correspond to the four different
545 | values. The get_dummies applies transformation to all
546 | columns that have a dtype that's either object or
547 | categorical.
548 | 
549 | In this case we didn't actually want to transform the target variable vegan.
550 | 
551 | ---
552 | # One-Hot (Dummy) Encoding
553 | 
554 | .narrow-left-column[
555 | <table border="1" class="dataframe">
556 |   <thead>
557 |     <tr style="text-align: right;">
558 |       <th></th><th>boro</th><th>salary</th><th>vegan</th></tr>
559 |   </thead>
560 |   <tbody>
561 |     <tr><th>0</th><td>Manhattan</td><td>103</td><td>No</td></tr>
562 |     <tr><th>1</th><td>Queens</td><td>89</td><td>No</td></tr>
563 |     <tr><th>2</th><td>Manhattan</td><td>142</td><td>No</td></tr>
564 |     <tr><th>3</th><td>Brooklyn</td><td>54</td><td>Yes</td></tr>
565 |     <tr><th>4</th><td>Brooklyn</td><td>63</td><td>Yes</td></tr>
566 |     <tr><th>5</th><td>Bronx</td><td>219</td><td>No</td></tr>
567 |   </tbody>
568 | </table>
569 | ]
570 | 
571 | .wide-right-column[
572 | .tiny[
573 | ```python
574 | pd.get_dummies(df, columns=['boro'])
575 | ```
576 | <table border="1" class="dataframe" style="font-size:15px">
577 |   <thead>
578 |     <tr style="text-align: right;">
579 |       <th></th><th>salary</th><th>vegan</th><th>boro_Bronx</th><th>boro_Brooklyn</th><th>boro_Manhattan</th><th>boro_Queens</th></tr>
580 |   </thead>
581 |   <tbody>
582 |     <tr><th>0</th><td>103</td><td>No</td><td>0</td><td>0</td><td>1</td><td>0</td></tr>
583 |     <tr><th>1</th><td>89</td><td>No</td><td>0</td><td>0</td><td>0</td><td>1</td></tr>
584 |     <tr><th>2</th><td>142</td><td>No</td><td>0</td><td>0</td><td>1</td><td>0</td></tr>
585 |     <tr><th>3</th><td>54</td><td>Yes</td><td>0</td><td>1</td><td>0</td><td>0</td></tr>
586 |     <tr><th>4</th><td>63</td><td>Yes</td><td>0</td><td>1</td><td>0</td><td>0</td></tr>
587 |     <tr><th>5</th><td>219</td><td>No</td><td>1</td><td>0</td><td>0</td><td>0</td></tr>
588 |   </tbody>
589 | </table>
590 | ]
591 | ]
592 | 
593 | ???
594 | We can specify selectively which columns to apply the encoding to.
595 | 
596 | 
597 | ---
598 | # One-Hot (Dummy) Encoding
599 | .narrow-left-column[
600 | <table border="1" class="dataframe">
601 |   <thead>
602 |     <tr style="text-align: right;">
603 |       <th></th><th>boro</th><th>salary</th><th>vegan</th></tr>
604 |   </thead>
605 |   <tbody>
606 |     <tr><th>0</th><td>2</td><td>103</td><td>No</td></tr>
607 |     <tr><th>1</th><td>3</td><td>89</td><td>No</td></tr>
608 |     <tr><th>2</th><td>2</td><td>142</td><td>No</td></tr>
609 |     <tr><th>3</th><td>1</td><td>54</td><td>Yes</td></tr>
610 |     <tr><th>4</th><td>1</td><td>63</td><td>Yes</td></tr>
611 |     <tr><th>5</th><td>0</td><td>219</td><td>No</td></tr>
612 |   </tbody>
613 | </table>
614 | ]
615 | .wide-right-column[
616 | .tiny[
617 | 
618 | ```python
619 | pd.get_dummies(df_ordinal, columns=['boro'])
620 | ```
621 | <table border="1" class="dataframe">
622 |   <thead>
623 |     <tr style="text-align: right;">
624 |       <th></th><th>salary</th><th>vegan</th><th>boro_0</th><th>boro_1</th><th>boro_2</th><th>boro_3</th></tr>
625 |   </thead>
626 |   <tbody>
627 |     <tr><th>0</th><td>103</td><td>No</td><td>0</td><td>0</td><td>1</td><td>0</td></tr>
628 |     <tr><th>1</th><td>89</td><td>No</td><td>0</td><td>0</td><td>0</td><td>1</td></tr>
629 |     <tr><th>2</th><td>142</td><td>No</td><td>0</td><td>0</td><td>1</td><td>0</td></tr>
630 |     <tr><th>3</th><td>54</td><td>Yes</td><td>0</td><td>1</td><td>0</td><td>0</td></tr>
631 |     <tr><th>4</th><td>63</td><td>Yes</td><td>0</td><td>1</td><td>0</td><td>0</td></tr>
632 |     <tr><th>5</th><td>219</td><td>No</td><td>1</td><td>0</td><td>0</td><td>0</td></tr>
633 |   </tbody>
634 | </table>
635 | ]
636 | ]
637 | ???
638 | This also helps if the variable was already encoded using integers.
639 | Sometimes, someone has already encoded the categorical
640 | variables to integers like here. So here this is exactly the
641 | same information only except instead of strings you have
642 | them numbered. If you call the get_dummies on this nothing
643 | happens because none of them are object data types or
644 | categorical data types. If you want to look at the One Hot
645 | Encoding, you can explicitly pass columns equal and this
646 | will transform into boro_1, boro_2, boro_3.
647 | In this case get_dummies usually wouldn't do anything, but we can tell
648 | it which variables are categorical and it will dummy encode those for us.
649 | 
650 | ---
651 | .tiny[
652 | .left-column[
653 | ```python
654 | df = pd.DataFrame({
655 |   'boro': ['Manhattan', 'Queens', 'Manhattan',
656 |            'Brooklyn', 'Brooklyn', 'Bronx'],
657 |   'salary': [103, 89, 142, 54, 63, 219],
658 |   'vegan': ['No', 'No','No','Yes', 'Yes', 'No']})
659 | df_dummies = pd.get_dummies(df, columns=['boro']
660 | ```
661 | <table border="1" class="dataframe" style="font-size:12px">
662 |   <thead>
663 |     <tr style="text-align: right;">
664 |       <th></th><th>salary</th><th>vegan</th><th>boro_Bronx</th><th>boro_Brooklyn</th><th>boro_Manhattan</th><th>boro_Queens</th></tr>
665 |   </thead>
666 |   <tbody>
667 |     <tr><th>0</th><td>103</td><td>No</td><td>0</td><td>0</td><td>1</td><td>0</td></tr>
668 |     <tr><th>1</th><td>89</td><td>No</td><td>0</td><td>0</td><td>0</td><td>1</td></tr>
669 |     <tr><th>2</th><td>142</td><td>No</td><td>0</td><td>0</td><td>1</td><td>0</td></tr>
670 |     <tr><th>3</th><td>54</td><td>Yes</td><td>0</td><td>1</td><td>0</td><td>0</td></tr>
671 |     <tr><th>4</th><td>63</td><td>Yes</td><td>0</td><td>1</td><td>0</td><td>0</td></tr>
672 |     <tr><th>5</th><td>219</td><td>No</td><td>1</td><td>0</td><td>0</td><td>0</td></tr>
673 |   </tbody>
674 | </table>
675 | ]
676 | .right-column[
677 | ```python
678 | df = pd.DataFrame({
679 |   'boro': ['Brooklyn', 'Manhattan', 'Brooklyn',
680 |            'Queens', 'Brooklyn', 'Staten Island'],
681 |   'salary': [61, 146, 142, 212, 98, 47],
682 |   'vegan': ['Yes', 'No','Yes','No', 'Yes', 'No']})
683 | df_dummies = pd.get_dummies(df, columns=['boro'])
684 | ```
685 | <table border="1" class="dataframe" style="font-size:12px">
686 |   <thead>
687 |     <tr style="text-align: right;">
688 |       <th></th><th>salary</th><th>vegan</th><th style="color:red">boro_Brooklyn</th>
689 |       <th style="color:red">boro_Manhattan</th><th style="color:red">boro_Queens</th><th style="color:red">boro_Staten Island</th></tr>
690 |   </thead>
691 |   <tbody>
692 |     <tr><th>0</th><td>61</td><td>Yes</td><td>1</td><td>0</td><td>0</td><td>0</td></tr>
693 |     <tr><th>1</th><td>146</td><td>No</td><td>0</td><td>1</td><td>0</td><td>0</td></tr>
694 |     <tr><th>2</th><td>142</td><td>Yes</td><td>1</td><td>0</td><td>0</td><td>0</td></tr>
695 |     <tr><th>3</th><td>212</td><td>No</td><td>0</td><td>0</td><td>1</td><td>0</td></tr>
696 |     <tr><th>4</th><td>98</td><td>Yes</td><td>1</td><td>0</td><td>0</td><td>0</td></tr>
697 |     <tr><th>5</th><td>47</td><td>No</td><td>0</td><td>0</td><td>0</td><td>1</td></tr>
698 |   </tbody>
699 | </table>
700 | ]]
701 | ???
702 | If someone else gives you a new data set and in this new
703 | data set there is Staten Island, Manhattan, Bronx and
704 | Brooklyn. So new dataset doesn't have anyone from Queens. So
705 | now you transform this with get_dummies, you get something
706 | that has the same shape as the original data but actually,
707 | the last column means something completely different.
708 | Because now the last column is Staten Island, not Queens. If
709 | someone gives you separate training and test data sets, if
710 | you call get_dummies, you don't know that the columns
711 | correspond actually to the same thing. Unless you take care
712 | of the names, unfortunately, scikit-learn completely ignores
713 | column names.
714 | 
715 | ---
716 | class: smaller
717 | 
718 | #Pandas Categorical Columns
719 | 
720 | .smaller[
721 | ```python
722 | 
723 | df = pd.DataFrame({
724 |    'boro': ['Manhattan', 'Queens', 'Manhattan', 'Brooklyn', 'Brooklyn', 'Bronx'],
725 |    'salary': [103, 89, 142, 54, 63, 219],
726 |    'vegan': ['No', 'No','No','Yes', 'Yes', 'No']})
727 | 
728 | df['boro'] = pd.Categorical(df.boro,
729 |                             categories=['Manhattan', 'Queens', 'Brooklyn',
730 |                                         'Bronx', 'Staten Island'])
731 | pd.get_dummies(df, columns=['boro'])
732 | ```
733 | ]
734 | .tiny[
735 | 
736 | <table border="1" class="dataframe">
737 |   <thead>
738 |     <tr style="text-align: right;">
739 |       <th></th><th>salary</th><th>vegan</th><th>boro_Manhattan</th><th>boro_Queens</th><th>boro_Brooklyn</th><th>boro_Bronx</th><th>boro_Staten Island</th></tr>
740 |   </thead>
741 |   <tbody>
742 |     <tr><th>0</th><td>103</td><td>No</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td></tr>
743 |     <tr><th>1</th><td>89</td><td>No</td><td>0</td><td>1</td><td>0</td><td>0</td><td>0</td></tr>
744 |     <tr><th>2</th><td>142</td><td>No</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td></tr>
745 |     <tr><th>3</th><td>54</td><td>Yes</td><td>0</td><td>0</td><td>1</td><td>0</td><td>0</td></tr>
746 |     <tr><th>4</th><td>63</td><td>Yes</td><td>0</td><td>0</td><td>1</td><td>0</td><td>0</td></tr>
747 |     <tr><th>5</th><td>219</td><td>No</td><td>0</td><td>0</td><td>0</td><td>1</td><td>0</td></tr>
748 |   </tbody>
749 | </table>
750 | ]
751 | 
752 | 
753 | ???
754 | The way to fix this is by using Pandas categorical types.
755 | Since we know what the boroughs of Manhattan are, we can
756 | create Pandas categorical dtype, we can create this
757 | categorical dtype with the categories Manhattan, Queens,
758 | Brooklyn, Bronx, and Staten Island. So now I have my column
759 | here and I'm going to convert it to a categorical dtype. So
760 | now it will not actually store the strings. It will just
761 | internally store zero to four, and it will also store what
762 | are the possible values. If a call get_dummies it will use
763 | all the possible values and for each of the possible values
764 | it will create a column. Even though Staten Island has not
765 | appeared in my dataset, it will still make a column for
766 | Staten Island. If I fix this categorical dtype I can apply
767 | it to the training and test data set and that'll make sure
768 | all the columns are always the same no matter what are the
769 | values are actually in the data set.
770 | 
771 | ---
772 | 
773 | 
774 | # OneHotEncoder
775 | 
776 | ```python
777 | import pandas as pd
778 | df = pd.DataFrame({'salary': [103, 89, 142, 54, 63, 219],
779 |                    'boro': ['Manhattan', 'Queens', 'Manhattan',
780 |                             'Brooklyn', 'Brooklyn', 'Bronx']})
781 | 
782 | ce = OneHotEncoder().fit(df)
783 | ce.transform(df).toarray()
784 | ```
785 | ```
786 | array([[ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.],
787 |        [ 0.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.],
788 |        [ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  1.,  0.],
789 |        [ 0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.],
790 |        [ 0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.],
791 |        [ 1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.]])
792 | ```
793 | 
794 | - Always transforms all columns
795 | 
796 | ???
797 | - Fit-transform paradigm ensures train and test-set categories correspond.
798 | 
799 | ---
800 | # OneHotEncoder + ColumnTransformer
801 | 
802 | ```python
803 | categorical = df.dtypes == object
804 | 
805 | preprocess = make_column_transformer(
806 |     (StandardScaler(), ~categorical),
807 |     (OneHotEncoder(), categorical))
808 | 
809 | ```
810 | ???
811 | The way to use this with mixed type data is column transformer, which
812 | allows you to transforms only some of the columns. For
813 | example, you can call categorical encoder only on the
814 | categorical columns and call StandardScaler on the
815 | non-categorical columns, and then use that to preprocess
816 | your data. Right now using Pandas, make sure your column
817 | names match up, make everything to an integer,
818 | or use column transformer and everything is awesome.
819 | 
820 | In contrast to basically all other estimators in sklearn,
821 | this uses the column information in pandas and allows you to slice
822 | out different columns based on column names, integer indices or boolean masks.
823 | In this example I'm constructing a boolean mask
824 | 
825 | ---
826 | class:center, middle
827 | ![:scale 100%](images/column_transformer_schematic.png)
828 | ???
829 | Here's a schematic of the column transformer.
830 | Most commonly you might want to separate continuous and categorical columns,
831 | but you can select any subsets of columns you like. They can also overlap.
832 | Or you can apply multiple transformations to the same set of columns.
833 | Let's say I want a scaled version of the data, but I also want to
834 | extract principal components. I can use the same column as inputs to multiple
835 | transformers, and the results will be concatenated.
836 | 
837 | FIXME add code
838 | ---
839 | class: some-space
840 | 
841 | # Dummy variables and colinearity
842 | 
843 | - One-hot is redundant (last one is 1 – sum of others)
844 | - Can introduce co-linearity
845 | - Can drop one
846 | - Choice which one matters for penalized models
847 | - Keeping all can make the model more interpretable
848 | 
849 | 
850 | ???
851 | N/A
852 | 
853 | ---
854 | 
855 | class: some-space
856 | 
857 | #Models Supporting Discrete Features
858 | 
859 | - In principle:
860 |   - All tree-based models, naive Bayes
861 | - In scikit-learn:
862 |   - Some Naive Bayes classifiers.
863 | - In scikit-learn "soon":
864 |   - Decision trees, random forests, gradient boosting
865 | 
866 | 
867 | ???
868 | In principle all tree-based models support categorical
869 | features, in scikit-learn none of them do, hopefully, soon
870 | they will. So what you can do is either you do the One Hot
871 | Encoder or you just encode this as integers and treat it as
872 | a continuous. If you have very high categorical variables
873 | with many levels, maybe it keeping it as an integer might
874 | make more sense.
875 | ---
876 | class: center, middle
877 | 
878 | # Notebook: Preprocessing
879 | 
880 |     </textarea>
881 |     <script src="https://remarkjs.com/downloads/remark-latest.min.js"></script>
882 |     <script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
883 | 
884 |     <script>
885 |     // Config Remark
886 |     remark.macros['scale'] = function (percentage) {
887 |         var url = this;
888 |         return '<img src="' + url + '" style="width: ' + percentage + '" />';
889 |     };
890 |     config_remark = {
891 |         highlightStyle: 'magula',
892 |         highlightSpans: true,
893 |         highlightLines: true,
894 |         ratio: "16:9"
895 |     };
896 |       var slideshow = remark.create(config_remark);
897 |     // Configure MathJax
898 |     MathJax.Hub.Config({
899 |     tex2jax: {
900 |         inlineMath: [['$','$'], ['\\(','\\)']],
901 |         skipTags: ['script', 'noscript', 'style', 'textarea', 'pre'] /* removed 'code' entry*/
902 |     }
903 |     });
904 |     MathJax.Hub.Queue(function() {
905 |         var all = MathJax.Hub.getAllJax(), i;
906 |         for(i = 0; i < all.length; i += 1) {
907 |             all[i].SourceElement().parentNode.className += ' has-jax';
908 |         }
909 |     });
910 |     </script>
911 |   </body>
912 | </html>
913 | 


--------------------------------------------------------------------------------
/slides/04-missing_values.html:
--------------------------------------------------------------------------------
  1 | <!DOCTYPE html>
  2 | <html>
  3 |   <head>
  4 |     <title>Missing Values</title>
  5 |     <meta charset="utf-8">
  6 |     <link rel="stylesheet" href="style.css">
  7 |     <style>
  8 |       @import url(https://fonts.googleapis.com/css?family=Garamond);
  9 |       @import url(https://fonts.googleapis.com/css?family=Muli:400,700,400italic);
 10 |       @import url(https://fonts.googleapis.com/css?family=Ubuntu+Mono:400,700,400italic);
 11 |     </style>
 12 |   </head>
 13 |   <body>
 14 |     <textarea id="source">
 15 | 
 16 | class: center, middle
 17 | 
 18 | ![:scale 40%](images/sklearn_logo.png)
 19 | 
 20 | ### Introduction to Machine learning with scikit-learn
 21 | 
 22 | # Missing values
 23 | 
 24 | Andreas C. Müller
 25 | 
 26 | Columbia University, scikit-learn
 27 | 
 28 | .smaller[https://github.com/amueller/ml-workshop-1-of-4]
 29 | 
 30 | ---
 31 | class: spacious
 32 | 
 33 | # Dealing with missing values
 34 | 
 35 | - Missing values can be encoded in many ways
 36 | - Numpy has no standard format for it (often np.NaN) - pandas does
 37 | - Sometimes: 999, ???, ?, np.inf, “N/A”, “Unknown“ …
 38 | - Not discussing “missing output” - that’s semi-supervised learning.
 39 | - Often missingness is informative (Use `MissingIndicator`)
 40 | 
 41 | ???
 42 | 
 43 | So first we're going to talk about imputation, which means
 44 | basically dealing with missing values. Before that, we're
 45 | going to talk about different methods to deal with missing
 46 | values. The first thing about missing values is that you
 47 | need to figure out whether your dataset has them and how
 48 | they encode it. If you're on Python, numpy has no standard
 49 | way to represent missing values while Pandas does and it's
 50 | usually nan. But the problem is usually more in the data
 51 | source. So depending on where your data comes from, missing
 52 | values might be encoded as anything. They might be
 53 | differently encoded for different parts of the data set. So
 54 | if you see some question marks somewhere, it doesn't mean
 55 | that all missing values are encoded as question marks. There
 56 | might be different reasons why data is missing. If you look
 57 | into like theoretical analysis missingness, often there you
 58 | can see something that's called missing at random or missing
 59 | completely at random, meaning that data was retracted
 60 | randomly from the dataset. That's not usually what happens.
 61 | Usually, if the data is missing, it's missing because
 62 | something when differently in a process, like someone didn't
 63 | fill out the form correctly, or someone didn't reply in a
 64 | survey. And very often the fact that someone didn't reply or
 65 | that something was not measured is actually informative.
 66 | Whenever you have missing values, it's often a good idea to
 67 | keep the information about whether the value was missing or
 68 | not. If a feature was missing, while we're going to do
 69 | imputation and we're going to try to fill in the missing
 70 | values. It's often really useful information that's
 71 | something was missing and you should record the fact and
 72 | represent it in the dataset somehow. We're only going to
 73 | talk about missing input today. You can also have missing
 74 | values in the outputs that are usually called
 75 | semi-supervised learning where you have the true target or
 76 | the true class only for some data points, but not for all of
 77 | them. So there's the first method which is very obvious.
 78 | Let's say your data looks like this. All my illustrations
 79 | today we'll be adding randomness to the iris data set.
 80 | 
 81 | FIXME: Better digram for feature selection. What are the
 82 | hypothesises for the tests
 83 | ---
 84 | 
 85 | .center[
 86 | ![:scale 80%](images/row_nan_col_nan.png)
 87 | ]
 88 | 
 89 | ???
 90 | So if my dataset looks like something on the left-hand side
 91 | here, then you can see that there are only missing values in
 92 | the first feature and it's mostly missing. One of the ways
 93 | to deal with this is just completely dropped the first
 94 | feature, and that might be a reasonable thing to do. If
 95 | there are so few values here that you don't think there's
 96 | any information in this just drop it I always like to
 97 | compare everything to a baseline approach. So your baseline
 98 | approach should be if there's only missing values in some of
 99 | the columns, just drop these columns, see what happens. You
100 | can always iterate and improve on that but that should be
101 | the baseline. A little bit trickier situation is that there
102 | might be some missing values only for a few rows, the rows
103 | are data points. You can kick out these data points and
104 | train the model on the rest and that might be fine. There's
105 | a bit of a problem with this though, that if you want to
106 | make predictions on new data and the data that arrives has
107 | missing values you'll not be able to make predictions
108 | because you don't have a way to deal with missing values. If
109 | you're guaranteed that any new test point that comes in will
110 | not have missing values then doing this might make sense.
111 | Another issue with dropping the rows with missing values is
112 | that if this was related to the actual outcome then it might
113 | be that you biased how well you think you're doing. Maybe
114 | all the hard data points are the ones that have missing
115 | values. And so by excluding them from your training data,
116 | you're also excluding them from the validation. So it means
117 | you think you're doing very well because you discarded all
118 | the hard data points. Discarding data points is a little bit
119 | trickier and it depends on your situation.
120 | 
121 | 
122 | ---
123 | 
124 | .center[
125 | ![:scale 100%](images/imputation-schema.png)
126 | ]
127 | 
128 | ???
129 | The other solution is to impute them. So the idea is that
130 | you have your training data set, you build some model of the
131 | training data set, and then you fill in the missing values
132 | using the information from the other rows and columns. And
133 | you built a model for this and then you can also apply the
134 | same imputation model on the test data set if you want to
135 | make predictions. Question is what if it has all missing
136 | values? Then you have no choice but drop that. If in the
137 | dataset that happens, you need to figure out what you are
138 | going to do. Like, if you have a production system, and
139 | something comes in with all missing values, you need to
140 | decide what you're going to do. But you probably cannot use
141 | this data point for training model. You could like use the
142 | outputs and train to find the mean outcome of all the values
143 | that are missing.
144 | 
145 | ---
146 | class: spacious
147 | 
148 | # Imputation Methods
149 | 
150 | - Mean / Median
151 | - kNN
152 | - Regression models
153 | - Matrix factorization (not in this lecture)
154 | 
155 | ???
156 | So let's talk about the general methods for data imputation.
157 | So these are the ones that we are going to talk through. The
158 | easiest one is me doing a constant value per column.
159 | Imputing the mean or the medium of the column that we're
160 | trying to compute. kNN means taking the average of KNearest
161 | Neighbors. Regression means I build a regression model from
162 | some of the features trying to rip the missing future. And
163 | finally, elaborate probabilistic models. They try to build a
164 | probabilistic model of the dataset and complete the missing
165 | values based on this probabilistic model
166 | 
167 | ---
168 | 
169 | # Baseline: Dropping Columns
170 | 
171 | .smaller[
172 | ```python
173 | 
174 | from sklearn.linear_model import LogisticRegressionCV
175 | from sklearn.model_selection import train_test_split
176 | X_train, X_test, y_train, y_test = train_test_split(X_, y, stratify=y)
177 | 
178 | nan_columns = np.any(np.isnan(X_train), axis=0)
179 | X_drop_columns = X_train[:, ~nan_columns]
180 | lr = LogisticRegression().fit(X_drop_columns, y_train)
181 | lr.score(X_test[:, ~nan_columns], y_test)
182 | ```
183 | 0.772
184 | ]
185 | 
186 | ???
187 | Here’s my baseline, which is dropping the columns. I used
188 | iris dataset where I had to put some missing values into the
189 | second and third column. Here, x_score has like some missing
190 | values, I split into test and training test data set and
191 | then I drop all columns that have missing values and then I
192 | can train logistic regression, which is sort of the model
193 | I'm going to use to tell me how well does this imputation
194 | work for classification. For my baseline, using a logistic
195 | regression model and 10 fold baseline, I get 77.2% accuracy.
196 | 
197 | ---
198 | 
199 | # Mean and Median
200 | 
201 | .center[
202 | ![:scale 100%](images/imputation-median-schema.png)
203 | ]
204 | 
205 | ???
206 | The simplest imputation is mean or medium. Unfortunately,
207 | only one is implemented in scikit-learn right now. If this
208 | is our dataset, the imputation will look like this. For a
209 | column, each missing value is replaced by the mean of the
210 | column. The imputer is the only transformer in scikit-learn
211 | that can handle missing data. Using the imputer you can
212 | specify the strategy, mean or median or constant and then
213 | you can call the method transform and that imputes missing
214 | values.
215 | 
216 | ---
217 | 
218 | .center[
219 | ![:scale 90%](images/median_imputation.png)
220 | ]
221 | 
222 | ???
223 | Here is a graphical illustration of what does this. So the
224 | iris dataset is four-dimensional, and I'm plotting it in two
225 | dimensions in which I added missing values. So, there are
226 | two other dimensions which had no missing values, which you
227 | can't see. The original data set is basically blue points
228 | here, orange points here, green points here, and it's
229 | relatively easy to separate. But here you can see that the
230 | green points they have a lot of missingness, so they were
231 | replaced by the mean of the whole dataset.
232 | 
233 | So there are two things about this. One, I kind of lost the
234 | class information a lot. Two, the data is now in places
235 | where there was no data before, which is not so great. You
236 | could do smart things like doing the mean per class, but
237 | that's actually not something that I've seen a lot.
238 | 
239 | 
240 | ---
241 | 
242 | 
243 | 
244 | ```python
245 | nan_columns = np.any(np.isnan(X_train), axis = 0)
246 | X_drop_columns = X_train[:,~nan_columns]
247 | logreg = LogisticRegression()
248 | logreg.fit(X_drop_columns, y_train)
249 | logreg.score(X_test[:,~nan_columns], y_test)
250 | ```
251 | 0.794
252 | ```python
253 | imp = SimpleImputer()
254 | X_train_imp = imp.fit_transform(X_train)
255 | lr = LogisticRegression().fit(X_train_imp, y_train)
256 | X_test_imp = imp.transform(X_test)
257 | lr.score(X_test_imp, y_test)
258 | ```
259 | 0.729
260 | 
261 | ???
262 | 
263 | Here's the comparison of dropping the columns versus doing
264 | the mean amputation. We actually see that the mean
265 | imputation is worse. In general, if you have very few
266 | missing values, it might actually not be so bad and putting
267 | in the mean might work.
268 | Here I basically designed the dataset so that mean imputation fails.
269 | This is not very realistic.
270 | 
271 | ---
272 | class: spacious
273 | 
274 | # KNN Imputation
275 | 
276 | - Find k nearest neighbors that have non-missing values.
277 | - Fill in all missing values using the average of the
278 | neighbors.
279 | 
280 | .smaller[
281 | ```python
282 | from sklearn.impute import KNNImputer
283 | knnimp = KNNImputer().fit(X_train)
284 | X_train_knn = knnimp.transform(X_train)
285 | lr = LogisticRegression().fit(X_train_knn, y_train)
286 | X_test_knn = knnimp.transform(X_test)
287 | lr.score(X_test_knn, y_test)
288 | ```
289 | ```
290 | 0.849
291 | ```
292 | ]
293 | ---
294 | .center[
295 | ![:scale 80%](images/knn_imputation.png)
296 | ]
297 | 
298 | ???
299 | 
300 | ---
301 | 
302 | # Model-Driven Imputation
303 | 
304 | - Train regression model for missing values
305 | - Iterate: retrain after filling in
306 | - IterativeImputer in next sklearn release
307 | 
308 | .smaller[
309 | ```python
310 | rf_imp = IterativeImputer(predictor=RandomForestRegressor())
311 | rf_imp.fit(X_train)
312 | X_train_rf = rf_imp.transform(X_train)
313 | lr = LogisticRegression().fit(X_train_rf, y_train)
314 | X_test_rf = rf_imp.transform(X_test)
315 | lr.score(X_test_rf, y_test)
316 | ```
317 | ```
318 | 0.845
319 | ```
320 | ]
321 | 
322 | ???
323 | The next step up in complexity is using an arbitrary
324 | regression model for imputation. I mean, arguably, the kNN
325 | is also a regression model, but there's like some
326 | intricacies that make it a little bit different. So the idea
327 | with using a regression model is you do the first pass and
328 | impute data using the mean. And then you try to predict the
329 | missing features using a regression model trained on the
330 | non-missing features and then you iterate this until stuff
331 | doesn't change anymore. You can use any model you like, and
332 | this is very flexible, and it can be fast if you have a fast
333 | model.
334 | ---
335 | class: spacious
336 | 
337 | # Comparision of Imputation Methods
338 | 
339 | .center[
340 | ![:scale 100%](images/med_knn_rf_comparison.png)
341 | ]
342 | 
343 | ???
344 | FIXME better illustration/ graph
345 | Here's a comparison on the iris dataset again.
346 | This is a very artificial example and is only meant to illustrate the ideas.
347 | In practice, using the mean or median is actually quite decent.
348 | 
349 | 
350 |     </textarea>
351 |     <script src="https://remarkjs.com/downloads/remark-latest.min.js"></script>
352 |     <script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
353 | 
354 |     <script>
355 |     // Config Remark
356 |     remark.macros['scale'] = function (percentage) {
357 |         var url = this;
358 |         return '<img src="' + url + '" style="width: ' + percentage + '" />';
359 |     };
360 |     config_remark = {
361 |         highlightStyle: 'magula',
362 |         highlightSpans: true,
363 |         highlightLines: true,
364 |         ratio: "16:9"
365 |     };
366 |       var slideshow = remark.create(config_remark);
367 |     // Configure MathJax
368 |     MathJax.Hub.Config({
369 |     tex2jax: {
370 |         inlineMath: [['$','$'], ['\\(','\\)']],
371 |         skipTags: ['script', 'noscript', 'style', 'textarea', 'pre'] /* removed 'code' entry*/
372 |     }
373 |     });
374 |     MathJax.Hub.Queue(function() {
375 |         var all = MathJax.Hub.getAllJax(), i;
376 |         for(i = 0; i < all.length; i += 1) {
377 |             all[i].SourceElement().parentNode.className += ' has-jax';
378 |         }
379 |     });
380 |     </script>
381 |   </body>
382 | </html>
383 | 


--------------------------------------------------------------------------------
/slides/images/PDSH.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/PDSH.png


--------------------------------------------------------------------------------
/slides/images/alpha_go.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/alpha_go.png


--------------------------------------------------------------------------------
/slides/images/amazon1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/amazon1.png


--------------------------------------------------------------------------------
/slides/images/amazon2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/amazon2.png


--------------------------------------------------------------------------------
/slides/images/amazon_explanations.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/amazon_explanations.png


--------------------------------------------------------------------------------
/slides/images/api-table.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/api-table.png


--------------------------------------------------------------------------------
/slides/images/apm.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/apm.png


--------------------------------------------------------------------------------
/slides/images/boro_ordinal.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/boro_ordinal.png


--------------------------------------------------------------------------------
/slides/images/boro_ordinal_classification.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/boro_ordinal_classification.png


--------------------------------------------------------------------------------
/slides/images/column_transformer_schematic.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/column_transformer_schematic.png


--------------------------------------------------------------------------------
/slides/images/cross_validation_new.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/cross_validation_new.png


--------------------------------------------------------------------------------
/slides/images/esl.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/esl.png


--------------------------------------------------------------------------------
/slides/images/exoplanet.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/exoplanet.png


--------------------------------------------------------------------------------
/slides/images/facebook_gael.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/facebook_gael.png


--------------------------------------------------------------------------------
/slides/images/fancy_impute_comparison.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/fancy_impute_comparison.png


--------------------------------------------------------------------------------
/slides/images/fb1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/fb1.png


--------------------------------------------------------------------------------
/slides/images/fb2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/fb2.png


--------------------------------------------------------------------------------
/slides/images/fb3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/fb3.png


--------------------------------------------------------------------------------
/slides/images/grid_search_cross_validation_new.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/grid_search_cross_validation_new.png


--------------------------------------------------------------------------------
/slides/images/house_price_boxplot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/house_price_boxplot.png


--------------------------------------------------------------------------------
/slides/images/house_price_scaled_box.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/house_price_scaled_box.png


--------------------------------------------------------------------------------
/slides/images/house_price_scatter.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/house_price_scatter.png


--------------------------------------------------------------------------------
/slides/images/imlp.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/imlp.png


--------------------------------------------------------------------------------
/slides/images/imputation-median-schema.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/imputation-median-schema.png


--------------------------------------------------------------------------------
/slides/images/imputation-schema.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/imputation-schema.png


--------------------------------------------------------------------------------
/slides/images/information_leak_preprocessing.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/information_leak_preprocessing.png


--------------------------------------------------------------------------------
/slides/images/knn_boundary_dataset.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/knn_boundary_dataset.png


--------------------------------------------------------------------------------
/slides/images/knn_boundary_k1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/knn_boundary_k1.png


--------------------------------------------------------------------------------
/slides/images/knn_boundary_k3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/knn_boundary_k3.png


--------------------------------------------------------------------------------
/slides/images/knn_boundary_test_points.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/knn_boundary_test_points.png


--------------------------------------------------------------------------------
/slides/images/knn_boundary_varying_k.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/knn_boundary_varying_k.png


--------------------------------------------------------------------------------
/slides/images/knn_imputation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/knn_imputation.png


--------------------------------------------------------------------------------
/slides/images/knn_model_complexity.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/knn_model_complexity.png


--------------------------------------------------------------------------------
/slides/images/knn_scaling.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/knn_scaling.png


--------------------------------------------------------------------------------
/slides/images/knn_scaling2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/knn_scaling2.png


--------------------------------------------------------------------------------
/slides/images/knn_vs_nearest_centroid.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/knn_vs_nearest_centroid.png


--------------------------------------------------------------------------------
/slides/images/matrix-representation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/matrix-representation.png


--------------------------------------------------------------------------------
/slides/images/mean_knn_rf_comparison.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/mean_knn_rf_comparison.png


--------------------------------------------------------------------------------
/slides/images/med_knn_rf_comparison.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/med_knn_rf_comparison.png


--------------------------------------------------------------------------------
/slides/images/median_imputation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/median_imputation.png


--------------------------------------------------------------------------------
/slides/images/missing_values_img_17.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/missing_values_img_17.png


--------------------------------------------------------------------------------
/slides/images/missing_values_img_19.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/missing_values_img_19.png


--------------------------------------------------------------------------------
/slides/images/missing_values_img_20.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/missing_values_img_20.png


--------------------------------------------------------------------------------
/slides/images/missing_values_img_22.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/missing_values_img_22.png


--------------------------------------------------------------------------------
/slides/images/missing_values_img_23.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/missing_values_img_23.png


--------------------------------------------------------------------------------
/slides/images/missing_values_img_24.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/missing_values_img_24.png


--------------------------------------------------------------------------------
/slides/images/missing_values_img_27.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/missing_values_img_27.png


--------------------------------------------------------------------------------
/slides/images/nao.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/nao.png


--------------------------------------------------------------------------------
/slides/images/no_information_leak_preprocessing.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/no_information_leak_preprocessing.png


--------------------------------------------------------------------------------
/slides/images/no_separate_scaling.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/no_separate_scaling.png


--------------------------------------------------------------------------------
/slides/images/overfitting_underfitting_cartoon_full.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/overfitting_underfitting_cartoon_full.png


--------------------------------------------------------------------------------
/slides/images/overfitting_underfitting_cartoon_generalization.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/overfitting_underfitting_cartoon_generalization.png


--------------------------------------------------------------------------------
/slides/images/overfitting_underfitting_cartoon_train.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/overfitting_underfitting_cartoon_train.png


--------------------------------------------------------------------------------
/slides/images/pipeline.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/pipeline.png


--------------------------------------------------------------------------------
/slides/images/propublica_compas.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/propublica_compas.png


--------------------------------------------------------------------------------
/slides/images/reinforcement_cycle.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/reinforcement_cycle.png


--------------------------------------------------------------------------------
/slides/images/row_nan_col_nan.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/row_nan_col_nan.png


--------------------------------------------------------------------------------
/slides/images/scaler_comparison_scatter.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/scaler_comparison_scatter.png


--------------------------------------------------------------------------------
/slides/images/shuffle_split_cv.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/shuffle_split_cv.png


--------------------------------------------------------------------------------
/slides/images/sklearn-docs.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/sklearn-docs.png


--------------------------------------------------------------------------------
/slides/images/sklearn_logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/sklearn_logo.png


--------------------------------------------------------------------------------
/slides/images/spotify.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/spotify.png


--------------------------------------------------------------------------------
/slides/images/stratified_cv.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/stratified_cv.png


--------------------------------------------------------------------------------
/slides/images/supervised-ml-api.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/supervised-ml-api.png


--------------------------------------------------------------------------------
/slides/images/supervised-ml-workflow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/supervised-ml-workflow.png


--------------------------------------------------------------------------------
/slides/images/threefold_split.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/threefold_split.png


--------------------------------------------------------------------------------
/slides/images/time_series_cv.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/time_series_cv.png


--------------------------------------------------------------------------------
/slides/images/train-test-split.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/train-test-split.png


--------------------------------------------------------------------------------
/slides/images/train_test_set_2d_classification.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/train_test_set_2d_classification.png


--------------------------------------------------------------------------------
/slides/images/train_test_split_new.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/train_test_split_new.png


--------------------------------------------------------------------------------
/slides/images/train_test_validation_split.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/train_test_validation_split.png


--------------------------------------------------------------------------------
/slides/images/unsupervised-ml-workflow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/unsupervised-ml-workflow.png


--------------------------------------------------------------------------------
/slides/images/unsupervised_ml_api.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/images/unsupervised_ml_api.png


--------------------------------------------------------------------------------
/slides/sklearn_logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml-workshop-1-of-4/917850cf6e751c36edb6d9692c48878b42c7f263/slides/sklearn_logo.png


--------------------------------------------------------------------------------
/slides/style.css:
--------------------------------------------------------------------------------
  1 | body {
  2 | font-family: 'Muli';
  3 | font-size: 140%;
  4 | }
  5 | h1, h2 {
  6 | font-family: 'Garamond';
  7 | font-weight: normal;
  8 | margin-top: 10px;
  9 | margin-bottom: 10px;
 10 | }
 11 | .remark-slide-content h1 {
 12 | font-size: 70px;
 13 | text-align: center;
 14 | }
 15 | .remark-slide-content p, .remark-slide-content li {
 16 | font-size:30px;
 17 | line-height: 1.4;
 18 | }
 19 | .remark-code {
 20 | font-size:30px;
 21 | }
 22 | .remark-slide-content p {
 23 |     margin: 5px;
 24 | }
 25 | .remark-slide-container .spacious p,
 26 | .remark-slide-container .spacious li{
 27 |     margin-bottom: 50px;
 28 |     margin-top: 50px;
 29 | }
 30 | .remark-slide-container .spacious h1{
 31 |     margin-bottom: 50px;
 32 | }
 33 | .remark-slide-container .some-space p,
 34 | .remark-slide-container .some-space li,
 35 | .remark-slide-container .some-space h1{
 36 |     margin-bottom: 30px;
 37 | }
 38 | .reset-column {
 39 |     overflow: auto;
 40 |     width: 100%;
 41 | }
 42 | .remark-slide-container .compact p, .remark-slide-container .compact li, .remark-slide-container .compact pre{
 43 | line-height: 1.1;
 44 | margin: 0px 0;
 45 | }
 46 | .remark-slide-container .compact .MathJax_Display{
 47 |     line-height: 1.1;
 48 |     margin: 1px 0;
 49 |     }
 50 | .remark-slide-container .compact h1{
 51 | margin-bottom: 3px;
 52 | }
 53 | .padding-top {
 54 |     padding-top: 100px;
 55 | }
 56 | .remark-slide-content .smaller p, .remark-slide-content .smaller p .MathJax, .remark-slide-content .smaller li,
 57 | .remark-slide-content .smaller .remark-code,  .smaller .remark-code-line,.remark-slide-content .smaller a,
 58 |  .remark-slide-content .smaller .dataframe{
 59 |     font-size: 25px;
 60 | }
 61 | 
 62 | .remark-slide-content .smallest p, .remark-slide-content .smallest .MathJax, .remark-slide-content .smallest li, .remark-slide-content .smallest .remark-code,
 63 | .smallest .remark-code-line, .remark-slide-content .smallest .dataframe, .remark-slide-content  span.smallest{
 64 | font-size: 20px;
 65 | }
 66 | .remark-slide-content .tiny p, .remark-slide-content .tiny li, .remark-slide-content .tiny .remark-code,
 67 | .tiny .remark-code-line, .remark-slide-content .tiny .dataframe{
 68 | font-size: 16px;
 69 | }
 70 | .normal {
 71 |     font-size: 30px;
 72 | }
 73 | .quote_author {
 74 |     display: block;
 75 |     text-align: right;
 76 |     margin-top: 20px;
 77 |     font-size: 30px;
 78 |     font-family: 'Garamond';
 79 | }
 80 | .larger, .larger .remark-code {
 81 |     font-size: 40px;
 82 | }
 83 | .largest, .largest .remark-code {
 84 |     font-size: 50px;
 85 | }
 86 | .left-column, .right-column {
 87 |     width: 48%;
 88 | }
 89 | .right-column{
 90 |     float: right;
 91 | }
 92 | .left-column{
 93 |     float: left;
 94 | }
 95 | .clear-column{
 96 |     clear: both;
 97 | }
 98 | .narrow-right-column {
 99 |     float: right;
100 |     width: 32%
101 | }
102 | .wide-left-column {
103 |     float: left;
104 |     width: 65%
105 | }
106 | .narrow-left-column {
107 |     float: left;
108 |     width: 32%
109 | }
110 | .wide-right-column {
111 |     float: right;
112 |     width: 65%
113 | }
114 | 
115 | .invisible {
116 |     visibility: hidden
117 | }
118 | .tiny-code .remark-code, .remark-inline-code .tiny-code{
119 | font-size: 15px;
120 | }
121 | .remark-code, .remark-inline-code  { font-family: 'Ubuntu Mono';}
122 | .hljs.remark-code {background: #e0e0e0}
123 | 
124 | /* Some additional styling taken form the Jupyter notebook CSS */
125 | table.dataframe {
126 | border: none;
127 | border-collapse: collapse;
128 | border-spacing: 0;
129 | color: black;
130 | table-layout: fixed;
131 | }
132 | table.dataframe thead {
133 | border-bottom: 1px solid black;
134 | vertical-align: bottom;
135 | }
136 | table.dataframe tr,
137 | table.dataframe th,
138 | table.dataframe td {
139 | text-align: right;
140 | vertical-align: middle;
141 | padding: 0.5em 0.5em;
142 | line-height: normal;
143 | white-space: normal;
144 | max-width: none;
145 | border: none;
146 | }
147 | table.dataframe th {
148 | font-weight: bold;
149 | }
150 | table.dataframe tbody tr:nth-child(odd) {
151 | background: #f5f5f5;
152 | }
153 | table.dataframe tbody tr:hover {
154 | background: rgba(66, 165, 245, 0.2);
155 | }


--------------------------------------------------------------------------------
/todo.rst:
--------------------------------------------------------------------------------
 1 | linear models for classification: adult solution gives weird coefficients!
 2 | 
 3 | 3.2: grid-searching pipeline non-deterministic bad result?
 4 | 3.1: bank campaign data too big for this grid-search?
 5 | 
 6 | handle-unknown=ignore for cross-validating one-hot encoder!
 7 | 
 8 | Adult classification solution 1.3 using columntransformer - actually not? leave for day 2?
 9 | 2.2 cross validation instead of crossval score?
10 | Cross validation slides too complicated.
11 | Linear regression: delete most of the notebook.
12 | Remove most of the linear regression slides.
13 | Host the slides so I can link to them.
14 | Handle unknown in onehot in cv. Didn't get to imbalanced data.
15 | Sgdclassifier exercise not interesting without explanation of sgd.
16 | CSV iterator typo on slide
17 | change open office slides to rst?
18 | Bad error message on changing in init (clone breaks weird).
19 | Text data notebook and slides redundant.
20 | Many slides and notebooks redundant?!
21 | Clean up writing own estimator
22 | 
23 | 
24 | make slides work offline?
25 | add power transformer, ranking transformer? better preprocessing?
26 | make sure the requirements include imblearn xgboost
27 | replace boston housing by ames?
28 | 
29 | For 1/4
30 | preprocessing: sync notebook and slides! story!
31 | exercise for imputation
32 | 
33 | 
34 | 
35 | For 2/4
36 | Exercise for review: use column transformer! BROKEN?
37 | gridsearch: use data that makes it instantaneous?
38 | no notebooks for forests?!?!
39 | elasticnet to linear models?
40 | linear models for classification notebook bleak
41 | 
42 | For 3/4
43 | No solutions at all?
44 | evaluation metrics notebook comes from book, use updated book code? it's kinda weird, make sure to sync with slides
45 | 03-model-evaluation.html#28 -> update report with averages!
46 | 
47 | For 4/4
48 | structure?
49 | efficient parameter tuning?
50 | text data: make sure it runs quickly enough!
51 | 
52 | 
53 | NEW
54 | ===
55 | 
56 | 1/4
57 | preprocessing and imputation need some syncing
58 | all need check for images
59 | 
60 | .format is the enemy
61 | 
62 | 
63 | EXERCISE  & NOTEBOOK FOR FORESTS!!!


--------------------------------------------------------------------------------