├── .ipynb_checkpoints ├── Chapter 1 - Vectors, Matrices, and Arrays-checkpoint.ipynb ├── Chapter 11 - Model Evaluation-checkpoint.ipynb ├── Chapter 12 - Model Selection-checkpoint.ipynb ├── Chapter 13 - Linear Regression-checkpoint.ipynb ├── Chapter 14 - Trees and Forests-checkpoint.ipynb ├── Chapter 15 - K-Nearest Neighbors-checkpoint.ipynb ├── Chapter 16 - Logistic Regression-checkpoint.ipynb ├── Chapter 17 - Support Vector Machines-checkpoint.ipynb ├── Chapter 18 - Naive Bayes-checkpoint.ipynb ├── Chapter 19 - Clustering-checkpoint.ipynb ├── Chapter 2 - Loading Data-checkpoint.ipynb ├── Chapter 21 - Saving and Loading Trained Models-checkpoint.ipynb ├── Chapter 3 - Data Wrangling-checkpoint.ipynb ├── Chapter 4 - Handling Numerical Data-checkpoint.ipynb ├── Chapter 5 - Handling Categorical Data-checkpoint.ipynb ├── Chapter 6 - Handling Text-checkpoint.ipynb └── Chapter 7 - Handling Dates and Times-checkpoint.ipynb ├── Chapter 1 - Vectors, Matrices, and Arrays.ipynb ├── Chapter 11 - Model Evaluation.ipynb ├── Chapter 12 - Model Selection.ipynb ├── Chapter 13 - Linear Regression.ipynb ├── Chapter 14 - Trees and Forests.ipynb ├── Chapter 15 - K-Nearest Neighbors.ipynb ├── Chapter 16 - Logistic Regression.ipynb ├── Chapter 17 - Support Vector Machines.ipynb ├── Chapter 18 - Naive Bayes.ipynb ├── Chapter 19 - Clustering.ipynb ├── Chapter 2 - Loading Data.ipynb ├── Chapter 21 - Saving and Loading Trained Models.ipynb ├── Chapter 3 - Data Wrangling.ipynb ├── Chapter 4 - Handling Numerical Data.ipynb ├── Chapter 5 - Handling Categorical Data.ipynb ├── Chapter 6 - Handling Text.ipynb ├── Chapter 7 - Handling Dates and Times.ipynb ├── README.md ├── environment.yml ├── model.pkl ├── requirements.txt └── sample.db /.ipynb_checkpoints/Chapter 13 - Linear Regression-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Chapter 13\n", 8 | "---\n", 9 | "# Linear Regression\n", 10 | "\n", 11 | "### 13.0 Introduction\n", 12 | "Linear regression is one of the simplest supervised learning algorithms in our toolkit. If you have ever taken an introductory statistics course in college, likely the final topic you covered was linear regression. In fact, it is so simple that it is sometimes not considered machine learning at all!\n", 13 | "\n", 14 | "Whatever you believe, the fact is that linear regression--and its extensions--continues to be a common and useful method of making predictions when the target vector is a quantitative value (e.g. home price, age)\n", 15 | "\n", 16 | "### 13.1 Fitting a Line\n", 17 | "#### Problem\n", 18 | "You want to train a model that represents a linear relationship between the feature and target vector.\n", 19 | "\n", 20 | "#### Solution\n", 21 | "Use a linear regression (`LinearRegression` in scikit-learn)" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 1, 27 | "metadata": { 28 | "collapsed": true 29 | }, 30 | "outputs": [], 31 | "source": [ 32 | "from sklearn.linear_model import LinearRegression\n", 33 | "from sklearn.datasets import load_boston\n", 34 | "\n", 35 | "boston = load_boston()\n", 36 | "features = boston.data[:, 0:2]\n", 37 | "target = boston.target\n", 38 | "\n", 39 | "regression = LinearRegression()\n", 40 | "\n", 41 | "model = regression.fit(features, target)" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": {}, 47 | "source": [ 48 | "### 13.4 Reducing Variance with Regularization\n", 49 | "#### Problem\n", 50 | "You want to reduce the variance of your linear regression model\n", 51 | "\n", 52 | "#### Solution\n", 53 | "Use a learning algorithm that includes a *shrinkage penalty* (also called **regularization**) like ridge regression and lasso regression:" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 2, 59 | "metadata": { 60 | "collapsed": false 61 | }, 62 | "outputs": [], 63 | "source": [ 64 | "from sklearn.linear_model import Ridge\n", 65 | "from sklearn.datasets import load_boston\n", 66 | "from sklearn.preprocessing import StandardScaler\n", 67 | "\n", 68 | "boston = load_boston()\n", 69 | "features = boston.data\n", 70 | "target = boston.target\n", 71 | "\n", 72 | "scaler = StandardScaler()\n", 73 | "features_standardized = scaler.fit_transform(features)\n", 74 | "\n", 75 | "regression = Ridge(alpha=0.5)\n", 76 | "model = regression.fit(features_standardized, target)" 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "metadata": {}, 82 | "source": [ 83 | "#### Discussion\n", 84 | "In standard linear regression the model trains to minimize the sum of squared error between the true($y_i$) and prediction ($\\hat y_i$) target values, or residual sum of squares (RSS):\n", 85 | "$$\n", 86 | "RSS = \\sum_{i=1}^n{(y_i - \\hat y_i)^2}\n", 87 | "$$\n", 88 | "\n", 89 | "Regularized regression learners are similar, except they attempt to minimize RSS and some penalty for the total size of the coefficient values, called a shrinkage penalty because it attempts to \"shrink\" the model. There are two common types of regularized learners for linear regression: ridge regression and the lasso. The only formal difference is the type of shrinkage penalty used. In ridge regression, the shrinkage penalty is a tuning hyperparameter multiplied by the squared sum of all coefficients:\n", 90 | "$$\n", 91 | "RSS+\\alpha \\sum_{j=1}^p{\\hat \\beta_j^2}\n", 92 | "$$\n", 93 | "\n", 94 | "where $\\hat \\beta_j$ is the coefficient of the jth of p features and $\\alpha$ is a hyperparameter (discussed next). The lasso is similar, except the shrinkage penalty is a tuning hyperparmeter multiplied by the squared sum of all coefficients:\n", 95 | "$$\n", 96 | "\\frac{1}{2n} RSS + \\alpha \\sum_{j=1}^p{|\\beta_j|}\n", 97 | "$$\n", 98 | "\n", 99 | "where n is the number of observations. So which one should we use? A a very general rule of thumb, ridge regression often produces slightly better predictions than lasso, but lasso (for reasons we will discuss in Recipe 13.5) produces more interpretable models. If we want a balance between, ridge and lasso's penalty functions we can use elastic net, which is simply a regression model with both penalties included. Regardless of which one we use, bot hridge and lasso regresions can penalize large or complex models by including coefficient values in the loss funciton we are trying to minimize\n", 100 | "\n", 101 | "The hyper parameter $\\alpha$ lets us control how much we penalize the coefficients, with higher values of $\\alpha$ creating simpler models. The ideal value of $\\alpha$ should be tuned like any other hyperparameter. In scikit-learn, $\\alpha$ is set using the alpha parameter.\n", 102 | "\n", 103 | "scikit-learn includes a RidgeCV method that allows us to select the ideal value for $\\alpha:" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": 3, 109 | "metadata": { 110 | "collapsed": false 111 | }, 112 | "outputs": [ 113 | { 114 | "data": { 115 | "text/plain": [ 116 | "array([-0.91215884, 1.0658758 , 0.11942614, 0.68558782, -2.03231631,\n", 117 | " 2.67922108, 0.01477326, -3.0777265 , 2.58814315, -2.00973173,\n", 118 | " -2.05390717, 0.85614763, -3.73565106])" 119 | ] 120 | }, 121 | "execution_count": 3, 122 | "metadata": {}, 123 | "output_type": "execute_result" 124 | } 125 | ], 126 | "source": [ 127 | "from sklearn.linear_model import RidgeCV\n", 128 | "\n", 129 | "regr_cv = RidgeCV(alphas=[0.1, 1.0, 10.0])\n", 130 | "\n", 131 | "model_cv = regr_cv.fit(features_standardized, target)\n", 132 | "\n", 133 | "model_cv.coef_" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": 4, 139 | "metadata": { 140 | "collapsed": false 141 | }, 142 | "outputs": [ 143 | { 144 | "data": { 145 | "text/plain": [ 146 | "1.0" 147 | ] 148 | }, 149 | "execution_count": 4, 150 | "metadata": {}, 151 | "output_type": "execute_result" 152 | } 153 | ], 154 | "source": [ 155 | "# view alpha\n", 156 | "model_cv.alpha_" 157 | ] 158 | }, 159 | { 160 | "cell_type": "markdown", 161 | "metadata": {}, 162 | "source": [ 163 | "One final note: because in linear regression the value of the coefficients is partially determined by the scale of the feature, and in regularized models all coefficients are summed together, we must make sure to standardize the feature prior to training\n", 164 | "\n", 165 | "### 13.5 Reducing Features with Lasso Regression\n", 166 | "#### Problem\n", 167 | "You want to simplify your linear regression model by reducing the number of features.\n", 168 | "\n", 169 | "#### Solution\n", 170 | "Use a lasso regression" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": 5, 176 | "metadata": { 177 | "collapsed": true 178 | }, 179 | "outputs": [], 180 | "source": [ 181 | "from sklearn.linear_model import Lasso\n", 182 | "from sklearn.datasets import load_boston\n", 183 | "from sklearn.preprocessing import StandardScaler\n", 184 | "\n", 185 | "boston = load_boston()\n", 186 | "features = boston.data\n", 187 | "target = boston.target\n", 188 | "\n", 189 | "scaler = StandardScaler()\n", 190 | "features_standardized = scaler.fit_transform(features)\n", 191 | "\n", 192 | "regression = Lasso(alpha=0.5)\n", 193 | "model = regression.fit(features_standardized, target)" 194 | ] 195 | }, 196 | { 197 | "cell_type": "markdown", 198 | "metadata": {}, 199 | "source": [ 200 | "#### Discussion\n", 201 | "One interesting characteristic of lasso regression's penalty is that it can shrink the coefficients of a model to zero, effectively reducing the number of features in the model. For example, in our solution we set `alpha` to 0.5 and we can see that many of the coefficients are 0, meaning their corresponding features are not used in the model:" 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": 6, 207 | "metadata": { 208 | "collapsed": false 209 | }, 210 | "outputs": [ 211 | { 212 | "data": { 213 | "text/plain": [ 214 | "array([-0.10697735, 0. , -0. , 0.39739898, -0. ,\n", 215 | " 2.97332316, -0. , -0.16937793, -0. , -0. ,\n", 216 | " -1.59957374, 0.54571511, -3.66888402])" 217 | ] 218 | }, 219 | "execution_count": 6, 220 | "metadata": {}, 221 | "output_type": "execute_result" 222 | } 223 | ], 224 | "source": [ 225 | "model.coef_" 226 | ] 227 | }, 228 | { 229 | "cell_type": "markdown", 230 | "metadata": {}, 231 | "source": [ 232 | "However if we increase $\\alpha$ to a much higher value, we see that lierally none of the features are being used:" 233 | ] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "execution_count": 7, 238 | "metadata": { 239 | "collapsed": false 240 | }, 241 | "outputs": [ 242 | { 243 | "data": { 244 | "text/plain": [ 245 | "array([-0., 0., -0., 0., -0., 0., -0., 0., -0., -0., -0., 0., -0.])" 246 | ] 247 | }, 248 | "execution_count": 7, 249 | "metadata": {}, 250 | "output_type": "execute_result" 251 | } 252 | ], 253 | "source": [ 254 | "regression_a10 = Lasso(alpha=10)\n", 255 | "model_a10 = regression_a10.fit(features_standardized, target)\n", 256 | "model_a10.coef_" 257 | ] 258 | }, 259 | { 260 | "cell_type": "markdown", 261 | "metadata": {}, 262 | "source": [ 263 | "The practical benefit of this effect is that it means that we could include 100 features in our feature matrix and then, through adjusting lasso's $\\alpha$ hyperparameter, produce a model that uses only 10 (for instance) of the most important features. This lets us reduce variance whiel improving interpretability of our model (since fewer features is easier to explain)" 264 | ] 265 | }, 266 | { 267 | "cell_type": "code", 268 | "execution_count": null, 269 | "metadata": { 270 | "collapsed": true 271 | }, 272 | "outputs": [], 273 | "source": [] 274 | } 275 | ], 276 | "metadata": { 277 | "kernelspec": { 278 | "display_name": "Python [conda env:machine_learning_cookbook]", 279 | "language": "python", 280 | "name": "conda-env-machine_learning_cookbook-py" 281 | }, 282 | "language_info": { 283 | "codemirror_mode": { 284 | "name": "ipython", 285 | "version": 3 286 | }, 287 | "file_extension": ".py", 288 | "mimetype": "text/x-python", 289 | "name": "python", 290 | "nbconvert_exporter": "python", 291 | "pygments_lexer": "ipython3", 292 | "version": "3.6.6" 293 | } 294 | }, 295 | "nbformat": 4, 296 | "nbformat_minor": 2 297 | } 298 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/Chapter 15 - K-Nearest Neighbors-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Chapter 15\n", 8 | "---\n", 9 | "# K-Nearest Neighbors\n", 10 | "\n", 11 | "An observation is predicted to be the class of that of the largest proportion of the k-nearest observations.\n", 12 | "\n", 13 | "## 15.1 Finding an Observation's Nearest Neighbors" 14 | ] 15 | }, 16 | { 17 | "cell_type": "code", 18 | "execution_count": 2, 19 | "metadata": { 20 | "collapsed": false 21 | }, 22 | "outputs": [ 23 | { 24 | "data": { 25 | "text/plain": [ 26 | "array([[[1.03800476, 0.56925129, 1.10395287, 1.1850097 ],\n", 27 | " [0.79566902, 0.33784833, 0.76275864, 1.05353673]]])" 28 | ] 29 | }, 30 | "execution_count": 2, 31 | "metadata": {}, 32 | "output_type": "execute_result" 33 | } 34 | ], 35 | "source": [ 36 | "from sklearn import datasets\n", 37 | "from sklearn.neighbors import NearestNeighbors\n", 38 | "from sklearn.preprocessing import StandardScaler\n", 39 | "\n", 40 | "iris = datasets.load_iris()\n", 41 | "features = iris.data\n", 42 | "\n", 43 | "standardizer = StandardScaler()\n", 44 | "\n", 45 | "features_standardized = standardizer.fit_transform(features)\n", 46 | "\n", 47 | "nearest_neighbors = NearestNeighbors(n_neighbors=2).fit(features_standardized)\n", 48 | "#nearest_neighbors_euclidian = NearestNeighbors(n_neighbors=2, metric='euclidian').fit(features_standardized)\n", 49 | "new_observation = [1, 1, 1, 1]\n", 50 | "\n", 51 | "distances, indices = nearest_neighbors.kneighbors([new_observation])\n", 52 | "\n", 53 | "features_standardized[indices]" 54 | ] 55 | }, 56 | { 57 | "cell_type": "markdown", 58 | "metadata": {}, 59 | "source": [ 60 | "### Discussion\n", 61 | "\n", 62 | "How do we measure distance?\n", 63 | "\n", 64 | "* Euclidian\n", 65 | "$$\n", 66 | "d_{euclidean} = \\sqrt{\\sum_{i=1}^{n}{(x_i - y_i)^2}}\n", 67 | "$$\n", 68 | "\n", 69 | "* Manhattan\n", 70 | "$$\n", 71 | "d_{manhattan} = \\sum_{i=1}^{n}{|x_i - y_i|}\n", 72 | "$$\n", 73 | "\n", 74 | "* Minkowski (default)\n", 75 | "$$\n", 76 | "d_{minkowski} = (\\sum_{i=1}^{n}{|x_i - y_i|^p})^{\\frac{1}{p}}\n", 77 | "$$\n", 78 | "## 15.2 Creating a K-Nearest Neighbor Classifier" 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": 3, 84 | "metadata": { 85 | "collapsed": false 86 | }, 87 | "outputs": [ 88 | { 89 | "data": { 90 | "text/plain": [ 91 | "array([1, 2])" 92 | ] 93 | }, 94 | "execution_count": 3, 95 | "metadata": {}, 96 | "output_type": "execute_result" 97 | } 98 | ], 99 | "source": [ 100 | "from sklearn.neighbors import KNeighborsClassifier\n", 101 | "from sklearn.preprocessing import StandardScaler\n", 102 | "from sklearn import datasets\n", 103 | "\n", 104 | "iris = datasets.load_iris()\n", 105 | "X = iris.data\n", 106 | "y = iris.target\n", 107 | "\n", 108 | "standardizer = StandardScaler()\n", 109 | "\n", 110 | "X_std = standardizer.fit_transform(X)\n", 111 | "\n", 112 | "knn = KNeighborsClassifier(n_neighbors=5, n_jobs=-1).fit(X_std, y)\n", 113 | "\n", 114 | "new_observations = [[0.75, 0.75, 0.75, 0.75],\n", 115 | " [1, 1, 1, 1]]\n", 116 | "\n", 117 | "knn.predict(new_observations)" 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": {}, 123 | "source": [ 124 | "### Discussion\n", 125 | "In KNN, given an observation $x_u$, with an unknown target class, the algorithm first identifies the k closest observations (sometimes called $x_u$'s neighborhood) based on some distance metric, then these k observations \"vote\" based on their class and the class that wins the vote is $x_u$'s predicted class. More formally, the probability $x_u$ is some class j is:\n", 126 | "$$\n", 127 | "\\frac{1}{k} \\sum_{i \\in v}{I(y_i = j)}\n", 128 | "$$\n", 129 | "where v is the k observatoin in $x_u$'s neighborhood, $y_i$ is the class of the ith observation, and I is an indicator function (i.e., 1 is true, 0 otherwise). In scikit-learn we can see these probabilities using `predict_proba`\n", 130 | "\n", 131 | "## 15.3 Identifying the Best Neighborhood Size" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": 4, 137 | "metadata": { 138 | "collapsed": false 139 | }, 140 | "outputs": [ 141 | { 142 | "data": { 143 | "text/plain": [ 144 | "6" 145 | ] 146 | }, 147 | "execution_count": 4, 148 | "metadata": {}, 149 | "output_type": "execute_result" 150 | } 151 | ], 152 | "source": [ 153 | "from sklearn.neighbors import KNeighborsClassifier\n", 154 | "from sklearn import datasets\n", 155 | "from sklearn.preprocessing import StandardScaler\n", 156 | "from sklearn.pipeline import Pipeline, FeatureUnion\n", 157 | "from sklearn.model_selection import GridSearchCV\n", 158 | "\n", 159 | "iris = datasets.load_iris()\n", 160 | "features = iris.data\n", 161 | "target = iris.target\n", 162 | "\n", 163 | "standardizer = StandardScaler()\n", 164 | "features_standardized = standardizer.fit_transform(features)\n", 165 | "\n", 166 | "knn = KNeighborsClassifier(n_neighbors=5, n_jobs=-1)\n", 167 | "\n", 168 | "pipe = Pipeline([(\"standardizer\", standardizer), (\"knn\", knn)])\n", 169 | "\n", 170 | "search_space = [{\"knn__n_neighbors\": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}]\n", 171 | "\n", 172 | "classifier = GridSearchCV(\n", 173 | " pipe, search_space, cv=5, verbose=0).fit(features_standardized, target)\n", 174 | "\n", 175 | "classifier.best_estimator_.get_params()[\"knn__n_neighbors\"]" 176 | ] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "metadata": {}, 181 | "source": [ 182 | "## 15.4 Creating a Radius-Based Nearest Neighbor Classifier\n", 183 | "given an observation of unknown class, you need to predict its class based on the class of all observations within a certain distance." 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": 5, 189 | "metadata": { 190 | "collapsed": false 191 | }, 192 | "outputs": [ 193 | { 194 | "data": { 195 | "text/plain": [ 196 | "array([2])" 197 | ] 198 | }, 199 | "execution_count": 5, 200 | "metadata": {}, 201 | "output_type": "execute_result" 202 | } 203 | ], 204 | "source": [ 205 | "from sklearn.neighbors import RadiusNeighborsClassifier\n", 206 | "from sklearn.preprocessing import StandardScaler\n", 207 | "from sklearn import datasets\n", 208 | "\n", 209 | "iris = datasets.load_iris()\n", 210 | "features = iris.data\n", 211 | "target = iris.target\n", 212 | "\n", 213 | "standardizer = StandardScaler()\n", 214 | "features_standardized = standardizer.fit_transform(features)\n", 215 | "\n", 216 | "rnn = RadiusNeighborsClassifier(\n", 217 | " radius=.5, n_jobs=-1).fit(features_standardized, target)\n", 218 | "\n", 219 | "new_observations = [[1, 1, 1, 1]]\n", 220 | "\n", 221 | "rnn.predict(new_observations)" 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": null, 227 | "metadata": { 228 | "collapsed": true 229 | }, 230 | "outputs": [], 231 | "source": [] 232 | } 233 | ], 234 | "metadata": { 235 | "kernelspec": { 236 | "display_name": "Python [conda env:machine_learning_cookbook]", 237 | "language": "python", 238 | "name": "conda-env-machine_learning_cookbook-py" 239 | }, 240 | "language_info": { 241 | "codemirror_mode": { 242 | "name": "ipython", 243 | "version": 3 244 | }, 245 | "file_extension": ".py", 246 | "mimetype": "text/x-python", 247 | "name": "python", 248 | "nbconvert_exporter": "python", 249 | "pygments_lexer": "ipython3", 250 | "version": "3.6.6" 251 | } 252 | }, 253 | "nbformat": 4, 254 | "nbformat_minor": 2 255 | } 256 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/Chapter 16 - Logistic Regression-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Chapter 16\n", 8 | "---\n", 9 | "# Logistic Regression\n", 10 | "\n", 11 | "Despire being called a regression, logistic regression is actually a widely used supervised classification technique. \n", 12 | "Allows us to predict the probability that an observation is of a certain class\n", 13 | "\n", 14 | "## 16.1 Training a Binary Classifier" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 3, 20 | "metadata": { 21 | "collapsed": false 22 | }, 23 | "outputs": [ 24 | { 25 | "name": "stdout", 26 | "output_type": "stream", 27 | "text": [ 28 | "model.predict: [1]\n", 29 | "model.predict_proba: [[0.18823041 0.81176959]]\n" 30 | ] 31 | } 32 | ], 33 | "source": [ 34 | "from sklearn.linear_model import LogisticRegression\n", 35 | "from sklearn import datasets\n", 36 | "from sklearn.preprocessing import StandardScaler\n", 37 | "\n", 38 | "iris = datasets.load_iris()\n", 39 | "features = iris.data[:100,:]\n", 40 | "target = iris.target[:100]\n", 41 | "\n", 42 | "scaler = StandardScaler()\n", 43 | "features_standardized = scaler.fit_transform(features)\n", 44 | "\n", 45 | "logistic_regression = LogisticRegression(random_state=0)\n", 46 | "model = logistic_regression.fit(features_standardized, target)\n", 47 | "\n", 48 | "new_observation = [[.5, .5, .5, .5]]\n", 49 | "\n", 50 | "print(\"model.predict: {}\".format(model.predict(new_observation)))\n", 51 | "print(\"model.predict_proba: {}\".format(model.predict_proba(new_observation)))" 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": {}, 57 | "source": [ 58 | "### Discussion\n", 59 | "Dispire having \"regression\" in its name, a logistic regression is actually a widely used binary lassifier (i.e. the target vector can only take two values). In a logistic regression, a linear model (e.g. $\\beta_0 + \\beta_i x$) is included in a logistic (also called sigmoid) function, $\\frac{1}{1+e^{-z }}$, such that:\n", 60 | "$$\n", 61 | "P(y_i = 1 | X) = \\frac{1}{1+e^{-(\\beta_0 + \\beta_1x)}}\n", 62 | "$$\n", 63 | "where $P(y_i = 1 | X)$ is the probability of the ith obsevation's target, $y_i$ being class 1, X is the training data, $\\beta_0$ and $\\beta_1$ are the parameters to be learned, and e is Euler's number. The effect of the logistic function is to constrain the value of the function's output to between 0 and 1 so that i can be interpreted as a probability. If $P(y_i = 1 | X)$ is greater than 0.5, class 1 is predicted; otherwise class 0 is predicted\n", 64 | "\n", 65 | "## 16.2 Training a Multiclass Classifier" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 4, 71 | "metadata": { 72 | "collapsed": true 73 | }, 74 | "outputs": [], 75 | "source": [ 76 | "from sklearn.linear_model import LogisticRegression\n", 77 | "from sklearn import datasets\n", 78 | "from sklearn.preprocessing import StandardScaler\n", 79 | "\n", 80 | "iris = datasets.load_iris()\n", 81 | "features = iris.data\n", 82 | "target = iris.target\n", 83 | "\n", 84 | "scaler = StandardScaler()\n", 85 | "features_standardized = scaler.fit_transform(features)\n", 86 | "\n", 87 | "logistic_regression = LogisticRegression(random_state=0, multi_class=\"ovr\")\n", 88 | "#logistic_regression_MNL = LogisticRegression(random_state=0, multi_class=\"multinomial\")\n", 89 | "\n", 90 | "model = logistic_regression.fit(features_standardized, target)" 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "metadata": {}, 96 | "source": [ 97 | "### Discussion\n", 98 | "On their own, logistic regressions are only binary classifiers, meaning they cannot handle target vectors with more than two classes. However, two clever extensions to logistic regression do just that. First, in one-vs-rest logistic regression (OVR) a separate model is trained for each class predicted whether an observation is that class or not (thus making it a binary classification problem). It assumes that each observation problem (e.g. class 0 or not) is independent\n", 99 | "\n", 100 | "Alternatively in multinomial logistic regression (MLR) the logistic function we saw in Recipe 15.1 is replaced with a softmax function:\n", 101 | "$$\n", 102 | "P(y_I = k | X) = \\frac{e^{\\beta_k x_i}}{\\sum_{j=1}^{K}{e^{\\beta_j x_i}}}\n", 103 | "$$\n", 104 | "where $P(y_i = k | X)$ is the probability of the ith observation's target value, $y_i$, is class k, and K is the total number of classes. One practical advantage of the MLR is that its predicted probabilities using `predict_proba` method are more reliable\n", 105 | "\n", 106 | "We can switch to an MNL by setting `multi_class='multinomial'`\n", 107 | "\n", 108 | "## 16.3 Reducing Variance Through Regularization" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": 5, 114 | "metadata": { 115 | "collapsed": true 116 | }, 117 | "outputs": [], 118 | "source": [ 119 | "from sklearn.linear_model import LogisticRegressionCV\n", 120 | "from sklearn import datasets\n", 121 | "from sklearn.preprocessing import StandardScaler\n", 122 | "\n", 123 | "iris = datasets.load_iris()\n", 124 | "features = iris.data\n", 125 | "target = iris.target\n", 126 | "\n", 127 | "scaler = StandardScaler()\n", 128 | "features_standardized = scaler.fit_transform(features)\n", 129 | "\n", 130 | "logistic_regression = LogisticRegressionCV(\n", 131 | " penalty='l2', Cs=10, random_state=0, n_jobs=-1)\n", 132 | "\n", 133 | "model = logistic_regression.fit(features_standardized, target)" 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": {}, 139 | "source": [ 140 | "### Discussion\n", 141 | "Regularization is a method of penalizing complex models to reduce their variance. Specifically, a penalty term is added to the loss function we are trying to minimize typically the L1 and L2 penalties\n", 142 | "\n", 143 | "In the L1 penalty:\n", 144 | "$$\n", 145 | "\\alpha \\sum_{j=1}^{p}{|\\hat\\beta_j|}\n", 146 | "$$\n", 147 | "where $\\hat\\beta_j$ is the parameters of the jth of p features being learned and $\\alpha$ is a hyperparameter denoting the regularization strength.\n", 148 | "\n", 149 | "With the L2 penalty:\n", 150 | "$$\n", 151 | "\\alpha \\sum_{j=1}^{p}{\\hat\\beta_j^2}\n", 152 | "$$\n", 153 | "higher values of $\\alpha$ increase the penalty for larger parameter values(i.e. more complex models). scikit-learn follows the common method of using C instead of $\\alpha$ where C is the inverse of the regularization strength: $C = \\frac{1}{\\alpha}$. To reduce variance while using logistic regression, we can treat C as a hyperparameter to be tuned to find thevalue of C that creates the best model. In scikit-learn we can use the `LogisticRegressionCV` class to efficiently tune C.\n", 154 | "\n", 155 | "## 16.4 Training a Classifier on Very Large Data" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": 6, 161 | "metadata": { 162 | "collapsed": true 163 | }, 164 | "outputs": [], 165 | "source": [ 166 | "from sklearn.linear_model import LogisticRegression\n", 167 | "from sklearn import datasets\n", 168 | "from sklearn.preprocessing import StandardScaler\n", 169 | "\n", 170 | "iris = datasets.load_iris()\n", 171 | "features = iris.data\n", 172 | "target = iris.target\n", 173 | "\n", 174 | "scaler = StandardScaler()\n", 175 | "features_standardized = scaler.fit_transform(features)\n", 176 | "\n", 177 | "logistic_regression = LogisticRegression(random_state=0, solver=\"sag\") # stochastic average gradient (SAG) solver\n", 178 | "model = logistic_regression.fit(features_standardized, target)" 179 | ] 180 | }, 181 | { 182 | "cell_type": "markdown", 183 | "metadata": {}, 184 | "source": [ 185 | "### Discussion\n", 186 | "scikit-learn's `LogisticRegression` offers a number of techniques for training a logistic regression, called solvers. Most of the time scikit-learn will select the best solver automatically for us or warn us we cannot do something with that solver.\n", 187 | "\n", 188 | "Stochastic averge gradient descent allows us to train a model much faster than other solvers when our data is very large. However, it is also very sensitive to feature scaling, so standardizing our features is particularly important\n", 189 | "\n", 190 | "### See Also\n", 191 | "* Minimizing Finite Sums with the Stochastic Average Gradient Algorithm, Mark Schmidt (http://www.birs.ca/workshops/2014/14w5003/files/schmidt.pdf)\n", 192 | "\n", 193 | "## 16.5 Handling Imbalanced Classes" 194 | ] 195 | }, 196 | { 197 | "cell_type": "code", 198 | "execution_count": 8, 199 | "metadata": { 200 | "collapsed": false 201 | }, 202 | "outputs": [], 203 | "source": [ 204 | "import numpy as np\n", 205 | "from sklearn.linear_model import LogisticRegression\n", 206 | "from sklearn import datasets\n", 207 | "from sklearn.preprocessing import StandardScaler\n", 208 | "\n", 209 | "iris = datasets.load_iris()\n", 210 | "features = iris.data[40:, :]\n", 211 | "target = iris.target[40:]\n", 212 | "\n", 213 | "target = np.where((target == 0), 0, 1)\n", 214 | "\n", 215 | "scaler = StandardScaler()\n", 216 | "features_standardized = scaler.fit_transform(features)\n", 217 | "\n", 218 | "logistic_regression = LogisticRegression(random_state=0, class_weight=\"balanced\")\n", 219 | "model = logistic_regression.fit(features_standardized, target)" 220 | ] 221 | }, 222 | { 223 | "cell_type": "markdown", 224 | "metadata": {}, 225 | "source": [ 226 | "### Discussion\n", 227 | "`LogisticRegression` comes with a built in method of handling imbalanced classes.\n", 228 | "`class_weight=\"balanced\"` will automatically weigh classes inversely proportional to their frequency:\n", 229 | "$$\n", 230 | "w_j = \\frac{n}{kn_j}\n", 231 | "$$\n", 232 | "where $w_j$ is the weight to class j, n is the number of observations, $n_j$ is the number of observations in class j, and k is the total number of classes" 233 | ] 234 | } 235 | ], 236 | "metadata": { 237 | "kernelspec": { 238 | "display_name": "Python [conda env:machine_learning_cookbook]", 239 | "language": "python", 240 | "name": "conda-env-machine_learning_cookbook-py" 241 | }, 242 | "language_info": { 243 | "codemirror_mode": { 244 | "name": "ipython", 245 | "version": 3 246 | }, 247 | "file_extension": ".py", 248 | "mimetype": "text/x-python", 249 | "name": "python", 250 | "nbconvert_exporter": "python", 251 | "pygments_lexer": "ipython3", 252 | "version": "3.6.6" 253 | } 254 | }, 255 | "nbformat": 4, 256 | "nbformat_minor": 2 257 | } 258 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/Chapter 18 - Naive Bayes-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Chapter 18\n", 8 | "---\n", 9 | "# Naive Bayes\n", 10 | "\n", 11 | "### 18.0 Introduction\n", 12 | "Bayes' theorem is the premier method for understanding the probability of some event $P(A|B)$, given some new information, $P(B|A)$, and a prior belief in the probability of the event, P(A):\n", 13 | "$$\n", 14 | "P(A | B) = \\frac{P(B|A)P(A)}{P(B)}\n", 15 | "$$\n", 16 | "\n", 17 | "The Bayesian method's popularity has skyrocked in the last decade, more and more rivaling the traditional frequentist applications in academia, government, and business. In machine learning, one applicaiton of Bayes' theorem to classifican comes in the form of the naive Bayes classifier. Naive Bayes classifiers combine a number of desirable qualities in practical machine learning into a single classifier:\n", 18 | "\n", 19 | "1. An intuitive approach\n", 20 | "2. The ability to work with small data\n", 21 | "3. Low computation costs for training and prediction\n", 22 | "4. Often solid results in a variety of settigns\n", 23 | "\n", 24 | "Specifically, a naive bayes classifier is based on:\n", 25 | "$$\n", 26 | "P(y | x_1, ..., x_j) = \\frac{P(x_1, ..., x_j | y)P(y)}{P(x_1,...,x_j)}\n", 27 | "$$\n", 28 | "where,\n", 29 | "* $P(y | x_1, ..., x_j)$ is called the *posterior* and is the probability that an observation is class y given observation's values for the j features, $x_1, ..., x_j$\n", 30 | "* $P(x_1, ..., x_j)$ is called likelihood and is the *likelihood* of an observation's values for features, $x_1, ..., x_j$, given their class y.\n", 31 | "* $P(y)$ is called the *prior* and is our belief for the probability of class y before looking at the data\n", 32 | "* P($x_1, ..., x_j$) is called the *marginal probability*\n", 33 | "\n", 34 | "In naive Bayes, we compare an obsrvation's posterior values for each possible class. Specifically, because the marginal probability is constant across these comparisons, we compare the numerators of the posterior for each class. For each observation the class with the greatest posterior numerator becomes the predicted class, $\\hat y$.\n", 35 | "\n", 36 | "There are two important things to note about naive Bayes classifiers.\n", 37 | "\n", 38 | "1. for each feature in the data, we have to assume the statistical distribution of the likelihood, $P(x_1, ..., x_j)$.\n", 39 | "- the common distributions are the normal (Gaussian), multinomial, and Bernoulli distributions.\n", 40 | "- the distribution chose is often determined by the nature of the features (continuous, binary, etc.)\n", 41 | "\n", 42 | "2. naive Bayes gets its name because we assume that each feature, and its resulting likelihood, is independent. This \"naive\" assumption is frequently wrong, yet in practice does little to prevent building high quality classifiers\n", 43 | "\n", 44 | "## 18.1 Training a Classifier for Continuous Features" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": 3, 50 | "metadata": { 51 | "collapsed": false 52 | }, 53 | "outputs": [ 54 | { 55 | "data": { 56 | "text/plain": [ 57 | "array([1])" 58 | ] 59 | }, 60 | "execution_count": 3, 61 | "metadata": {}, 62 | "output_type": "execute_result" 63 | } 64 | ], 65 | "source": [ 66 | "from sklearn import datasets\n", 67 | "from sklearn.naive_bayes import GaussianNB\n", 68 | "\n", 69 | "iris = datasets.load_iris()\n", 70 | "features = iris.data\n", 71 | "target = iris.target\n", 72 | "\n", 73 | "classifier = GaussianNB()\n", 74 | "\n", 75 | "model = classifier.fit(features, target)\n", 76 | "\n", 77 | "new_observation = [[4, 4, 4, 0.4]]\n", 78 | "model.predict(new_observation)" 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": {}, 84 | "source": [ 85 | "### Discussion\n", 86 | "The most common type of naive Bayes classifier is the Gaussian naive Bayesa. In Gaussian naive Bayesam we assuem that the likelihood of the feature values, x, given an observation is of class y, follows a normal distribution:\n", 87 | "$$\n", 88 | "p(x_j | y) = \\frac{1}{\\sqrt{2\\pi \\sigma_y^2}} e^{-\\frac{(x_j - \\mu_y)^2}{2\\sigma_y^2}}\n", 89 | "$$\n", 90 | "where $\\sigma_y^2$ and $\\mu_y$ are the variance and mean values of feature x_j for class y. Because of the assumption of the normal distribution, Gaussian naive Bayes is best used in cases when all our features are continuous.\n", 91 | "\n", 92 | "One of the interesting aspects of naive Bayes classifiers is that they allow us to assign a prior belief over the respect target classes. We can do this using `GaussianNB`'s `priors` parameter, which takes in a list of the probabilities assigned to each class of the target vector" 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": 4, 98 | "metadata": { 99 | "collapsed": true 100 | }, 101 | "outputs": [], 102 | "source": [ 103 | "clf = GaussianNB(priors=[0.25, 0.25, 0.5])\n", 104 | "model = classifier.fit(features, target)" 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "### See Also\n", 112 | "* How the Naive Bayes Classifier Works (http://dataaspirant.com/2017/02/06/naive-bayes-classifier-machine-learning/)\n", 113 | "\n", 114 | "## 18.2 Training a Classifier for Discrete and Count Features\n" 115 | ] 116 | }, 117 | { 118 | "cell_type": "code", 119 | "execution_count": 6, 120 | "metadata": { 121 | "collapsed": false 122 | }, 123 | "outputs": [ 124 | { 125 | "data": { 126 | "text/plain": [ 127 | "array([0])" 128 | ] 129 | }, 130 | "execution_count": 6, 131 | "metadata": {}, 132 | "output_type": "execute_result" 133 | } 134 | ], 135 | "source": [ 136 | "import numpy as np\n", 137 | "from sklearn.naive_bayes import MultinomialNB\n", 138 | "from sklearn.feature_extraction.text import CountVectorizer\n", 139 | "\n", 140 | "text_data = np.array(['I love Brazil. Brazil!', 'Brazil is best', 'Germany beats both'])\n", 141 | "\n", 142 | "count = CountVectorizer()\n", 143 | "bag_of_words = count.fit_transform(text_data)\n", 144 | "\n", 145 | "features = bag_of_words.toarray()\n", 146 | "\n", 147 | "target = np.array([0, 0, 1])\n", 148 | "\n", 149 | "classifier = MultinomialNB(class_prior=[0.25, 0.5])\n", 150 | "model = classifier.fit(features, target)\n", 151 | "\n", 152 | "new_observation = [[0, 0, 0, 1, 0, 1, 0]]\n", 153 | "model.predict(new_observation)" 154 | ] 155 | }, 156 | { 157 | "cell_type": "markdown", 158 | "metadata": {}, 159 | "source": [ 160 | "### Discussion\n", 161 | "\n", 162 | "Multinomial naive Bayes works similarly to Gaussian naive Bayes, but the features are assumed to be multinomial distributed. In practice this means that this classifier is commonly used when we have discrete data. One of the most common uses is text classification using bag of words or tf-idf approaches\n", 163 | "\n", 164 | "## 18.3 Training a Naive Bayes Classifier for Binary Features" 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": 7, 170 | "metadata": { 171 | "collapsed": true 172 | }, 173 | "outputs": [], 174 | "source": [ 175 | "import numpy as np\n", 176 | "from sklearn.naive_bayes import BernoulliNB\n", 177 | "\n", 178 | "features = np.random.randint(2, size=(100, 3))\n", 179 | "target = np.random.randint(2, size=(100, 1)).ravel()\n", 180 | "\n", 181 | "classifier = BernoulliNB(class_prior=[0.25, 0.5])\n", 182 | "model = classifier.fit(features, target)" 183 | ] 184 | }, 185 | { 186 | "cell_type": "markdown", 187 | "metadata": {}, 188 | "source": [ 189 | "The Bernoulli naive Bayes classifier assumes that all our features are binary such that they take only two values (e.g. a nominal categorical feature that has been one-hot encoded). Like its multinomial cousin, Bernoulli naive Bayes is often used in text classification, when our feature matrix is simply the presence or absence of a word in a document" 190 | ] 191 | } 192 | ], 193 | "metadata": { 194 | "kernelspec": { 195 | "display_name": "Python [conda env:machine_learning_cookbook]", 196 | "language": "python", 197 | "name": "conda-env-machine_learning_cookbook-py" 198 | }, 199 | "language_info": { 200 | "codemirror_mode": { 201 | "name": "ipython", 202 | "version": 3 203 | }, 204 | "file_extension": ".py", 205 | "mimetype": "text/x-python", 206 | "name": "python", 207 | "nbconvert_exporter": "python", 208 | "pygments_lexer": "ipython3", 209 | "version": "3.6.6" 210 | } 211 | }, 212 | "nbformat": 4, 213 | "nbformat_minor": 2 214 | } 215 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/Chapter 19 - Clustering-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Chapter 19\n", 8 | "---\n", 9 | "# Clustering\n", 10 | "\n", 11 | "## 19.0 Introduction\n", 12 | "\n", 13 | "Frequently, we run into situations where we only know the features.\n", 14 | "\n", 15 | "The goal of clustering algorithms is to identify latent groupings of obesrvations, which if done well, allow us to predict the class of observations even without a target vector.\n", 16 | "\n", 17 | "## 19.1 Clustering Using K-Means" 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": 2, 23 | "metadata": { 24 | "collapsed": false 25 | }, 26 | "outputs": [ 27 | { 28 | "data": { 29 | "text/plain": [ 30 | "array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n", 31 | " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n", 32 | " 1, 1, 1, 1, 1, 1, 0, 0, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0,\n", 33 | " 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 0, 0, 2,\n", 34 | " 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0,\n", 35 | " 0, 0, 0, 2, 2, 0, 0, 0, 0, 2, 0, 2, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0,\n", 36 | " 0, 2, 2, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 2], dtype=int32)" 37 | ] 38 | }, 39 | "execution_count": 2, 40 | "metadata": {}, 41 | "output_type": "execute_result" 42 | } 43 | ], 44 | "source": [ 45 | "from sklearn import datasets\n", 46 | "from sklearn.preprocessing import StandardScaler\n", 47 | "from sklearn.cluster import KMeans\n", 48 | "\n", 49 | "iris = datasets.load_iris()\n", 50 | "features = iris.data\n", 51 | "\n", 52 | "scaler = StandardScaler()\n", 53 | "features_std = scaler.fit_transform(features)\n", 54 | "\n", 55 | "cluster = KMeans(n_clusters=3, random_state=0, n_jobs=-1)\n", 56 | "model = cluster.fit(features_std)\n", 57 | "\n", 58 | "model.labels_" 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "metadata": {}, 64 | "source": [ 65 | "### Discussion\n", 66 | "k-means clustering is one of the most common clustering techniques. In k-means clustering, the algorithm attempts to group observations into k groups, with each group having roughly equal variance. The number of groups, k, is specified by the user as a hyperparameter. Specifically, in k-means:\n", 67 | "\n", 68 | "1. k cluster \"center\" points are created at random locations.\n", 69 | "\n", 70 | "2. For each observation:\n", 71 | " a. the distance between each observaiton and the k center points is calculated\n", 72 | " b. the observation is assigned to the cluster of the nearest center point\n", 73 | " \n", 74 | "3. The center points are moved to the means (i.e., centers) of their respective clusters\n", 75 | "\n", 76 | "4. Steps 2 and 3 are repeated until no observation changes in cluster membership\n", 77 | "\n", 78 | "k-means clustering assumes:\n", 79 | "* the clusters are convex shaped (e.g. a circle, a sphere).\n", 80 | "* all features are equally scaled\n", 81 | "* the groups are balanced\n", 82 | "\n", 83 | "### See Also\n", 84 | "* Introduction to K-means Clustering (https://www.datascience.com/blog/k-means-clustering)\n", 85 | "\n", 86 | "## 19.2 Speeding Up K-Means Clustering" 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "execution_count": 3, 92 | "metadata": { 93 | "collapsed": false 94 | }, 95 | "outputs": [ 96 | { 97 | "data": { 98 | "text/plain": [ 99 | "array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n", 100 | " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n", 101 | " 1, 1, 1, 1, 1, 1, 2, 2, 2, 0, 0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 2,\n", 102 | " 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0,\n", 103 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2,\n", 104 | " 2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2,\n", 105 | " 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2], dtype=int32)" 106 | ] 107 | }, 108 | "execution_count": 3, 109 | "metadata": {}, 110 | "output_type": "execute_result" 111 | } 112 | ], 113 | "source": [ 114 | "from sklearn import datasets\n", 115 | "from sklearn.preprocessing import StandardScaler\n", 116 | "from sklearn.cluster import MiniBatchKMeans\n", 117 | "\n", 118 | "iris = datasets.load_iris()\n", 119 | "features = iris.data\n", 120 | "\n", 121 | "scaler = StandardScaler()\n", 122 | "features_std = scaler.fit_transform(features)\n", 123 | "\n", 124 | "cluster = MiniBatchKMeans(n_clusters=3, random_state=0, batch_size=100)\n", 125 | "model = cluster.fit(features_std)\n", 126 | "\n", 127 | "model.labels_" 128 | ] 129 | }, 130 | { 131 | "cell_type": "markdown", 132 | "metadata": {}, 133 | "source": [ 134 | "## 19.3 Clustering Using Meanshift" 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": 4, 140 | "metadata": { 141 | "collapsed": false 142 | }, 143 | "outputs": [ 144 | { 145 | "data": { 146 | "text/plain": [ 147 | "array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n", 148 | " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n", 149 | " 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 150 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 151 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 152 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 153 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])" 154 | ] 155 | }, 156 | "execution_count": 4, 157 | "metadata": {}, 158 | "output_type": "execute_result" 159 | } 160 | ], 161 | "source": [ 162 | "from sklearn import datasets\n", 163 | "from sklearn.preprocessing import StandardScaler\n", 164 | "from sklearn.cluster import MeanShift\n", 165 | "\n", 166 | "iris = datasets.load_iris()\n", 167 | "features = iris.data\n", 168 | "\n", 169 | "scaler = StandardScaler()\n", 170 | "features_std = scaler.fit_transform(features)\n", 171 | "\n", 172 | "cluster = MeanShift(n_jobs=-1)\n", 173 | "model = cluster.fit(features_std)\n", 174 | "\n", 175 | "model.labels_" 176 | ] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "metadata": {}, 181 | "source": [ 182 | "## 19.4 Clustering Using DBSCAN" 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": 5, 188 | "metadata": { 189 | "collapsed": false 190 | }, 191 | "outputs": [ 192 | { 193 | "data": { 194 | "text/plain": [ 195 | "array([ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, 0,\n", 196 | " 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1,\n", 197 | " 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 1,\n", 198 | " 1, 1, 1, 1, 1, -1, -1, 1, -1, -1, 1, -1, 1, 1, 1, 1, 1,\n", 199 | " -1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n", 200 | " -1, 1, -1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, -1, 1, -1, 1,\n", 201 | " 1, 1, 1, -1, -1, -1, -1, -1, 1, 1, 1, 1, -1, 1, 1, -1, -1,\n", 202 | " -1, 1, 1, -1, 1, 1, -1, 1, 1, 1, -1, -1, -1, 1, 1, 1, -1,\n", 203 | " -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1])" 204 | ] 205 | }, 206 | "execution_count": 5, 207 | "metadata": {}, 208 | "output_type": "execute_result" 209 | } 210 | ], 211 | "source": [ 212 | "from sklearn import datasets\n", 213 | "from sklearn.preprocessing import StandardScaler\n", 214 | "from sklearn.cluster import DBSCAN\n", 215 | "\n", 216 | "iris = datasets.load_iris()\n", 217 | "features = iris.data\n", 218 | "\n", 219 | "scaler = StandardScaler()\n", 220 | "features_std = scaler.fit_transform(features)\n", 221 | "\n", 222 | "cluster = DBSCAN(n_jobs=-1)\n", 223 | "model = cluster.fit(features_std)\n", 224 | "\n", 225 | "model.labels_" 226 | ] 227 | }, 228 | { 229 | "cell_type": "markdown", 230 | "metadata": {}, 231 | "source": [ 232 | "## 19.5 Clustering using Hierarchical Merging" 233 | ] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "execution_count": 7, 238 | "metadata": { 239 | "collapsed": false 240 | }, 241 | "outputs": [ 242 | { 243 | "data": { 244 | "text/plain": [ 245 | "array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n", 246 | " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1,\n", 247 | " 1, 1, 1, 1, 1, 1, 0, 0, 0, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 0, 2, 0,\n", 248 | " 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 0, 2, 0, 0, 2,\n", 249 | " 2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0,\n", 250 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 251 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])" 252 | ] 253 | }, 254 | "execution_count": 7, 255 | "metadata": {}, 256 | "output_type": "execute_result" 257 | } 258 | ], 259 | "source": [ 260 | "from sklearn import datasets\n", 261 | "from sklearn.preprocessing import StandardScaler\n", 262 | "from sklearn.cluster import AgglomerativeClustering\n", 263 | "\n", 264 | "iris = datasets.load_iris()\n", 265 | "features = iris.data\n", 266 | "\n", 267 | "scaler = StandardScaler()\n", 268 | "features_std = scaler.fit_transform(features)\n", 269 | "\n", 270 | "cluster = AgglomerativeClustering(n_clusters=3)\n", 271 | "model = cluster.fit(features_std)\n", 272 | "\n", 273 | "model.labels_" 274 | ] 275 | }, 276 | { 277 | "cell_type": "code", 278 | "execution_count": null, 279 | "metadata": { 280 | "collapsed": true 281 | }, 282 | "outputs": [], 283 | "source": [] 284 | } 285 | ], 286 | "metadata": { 287 | "kernelspec": { 288 | "display_name": "Python [conda env:machine_learning_cookbook]", 289 | "language": "python", 290 | "name": "conda-env-machine_learning_cookbook-py" 291 | }, 292 | "language_info": { 293 | "codemirror_mode": { 294 | "name": "ipython", 295 | "version": 3 296 | }, 297 | "file_extension": ".py", 298 | "mimetype": "text/x-python", 299 | "name": "python", 300 | "nbconvert_exporter": "python", 301 | "pygments_lexer": "ipython3", 302 | "version": "3.6.6" 303 | } 304 | }, 305 | "nbformat": 4, 306 | "nbformat_minor": 2 307 | } 308 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/Chapter 21 - Saving and Loading Trained Models-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### Chapter 21\n", 8 | "## Saving and Loading Trained Models\n", 9 | "\n", 10 | "### 21.0 Introduction\n", 11 | "In the last 20 chapters around 200 recipies, we have convered how to take raw data nad usem achine learning to create well-performing predictive models. However, for all our work to be worthwhile we eventually need to do something with our model, such as integrating it with an existing software application. To accomplish this goal, we need to be able to bot hsave our models after training and load them when they are needed by an application. This is the focus of the final chapter\n", 12 | "\n", 13 | "### 21.1 Saving and Loading a scikit-learn Model\n", 14 | "#### Problem\n", 15 | "You have trained a scikit-learn model and want to save it and load it elsewhere.\n", 16 | "\n", 17 | "#### Solution\n", 18 | "Save the model as a pickle file:" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 1, 24 | "metadata": {}, 25 | "outputs": [ 26 | { 27 | "name": "stderr", 28 | "output_type": "stream", 29 | "text": [ 30 | "/Users/f00/anaconda/envs/machine_learning_cookbook/lib/python3.6/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.\n", 31 | " from numpy.core.umath_tests import inner1d\n" 32 | ] 33 | }, 34 | { 35 | "data": { 36 | "text/plain": [ 37 | "['model.pkl']" 38 | ] 39 | }, 40 | "execution_count": 1, 41 | "metadata": {}, 42 | "output_type": "execute_result" 43 | } 44 | ], 45 | "source": [ 46 | "# load libraries\n", 47 | "from sklearn.ensemble import RandomForestClassifier\n", 48 | "from sklearn import datasets\n", 49 | "from sklearn.externals import joblib\n", 50 | "\n", 51 | "# load data\n", 52 | "iris = datasets.load_iris()\n", 53 | "features = iris.data\n", 54 | "target = iris.target\n", 55 | "\n", 56 | "# create decision tree classifier object\n", 57 | "classifier = RandomForestClassifier()\n", 58 | "\n", 59 | "# train model\n", 60 | "model = classifier.fit(features, target)\n", 61 | "\n", 62 | "# save model as pickle file\n", 63 | "joblib.dump(model, \"model.pkl\")" 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "metadata": {}, 69 | "source": [ 70 | "Once the model is saved we can use scikit-learn in our destination application (e.g., web application) to load the model:" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": 2, 76 | "metadata": {}, 77 | "outputs": [], 78 | "source": [ 79 | "# load model from file\n", 80 | "classifier = joblib.load(\"model.pkl\")" 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "metadata": {}, 86 | "source": [ 87 | "And use it to make predictions" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": 3, 93 | "metadata": {}, 94 | "outputs": [ 95 | { 96 | "data": { 97 | "text/plain": [ 98 | "array([0])" 99 | ] 100 | }, 101 | "execution_count": 3, 102 | "metadata": {}, 103 | "output_type": "execute_result" 104 | } 105 | ], 106 | "source": [ 107 | "# create new observation\n", 108 | "new_observation = [[ 5.2, 3.2, 1.1, 0.1]]\n", 109 | "\n", 110 | "# predict obserrvation's class\n", 111 | "classifier.predict(new_observation)" 112 | ] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "metadata": {}, 117 | "source": [ 118 | "### Discussion\n", 119 | "The first step in using a model in production is to save that model as a file that can be loaded by another application or workflow. We can accomplish this by saving the model as a pickle file, a Python-specific data format. Specifically, to save the model we use `joblib`, which is a library extending pickle for cases when we have large NumPy arrays--a common occurance for trained models in scikit-learn.\n", 120 | "\n", 121 | "When saving scikit-learn models, be aware that saved models might not be compatible between versions of scikit-learn; therefore, it can be helpful to include the version of scikit-learn used in the model in the filename:" 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": 4, 127 | "metadata": {}, 128 | "outputs": [ 129 | { 130 | "data": { 131 | "text/plain": [ 132 | "['model_(version).pkl']" 133 | ] 134 | }, 135 | "execution_count": 4, 136 | "metadata": {}, 137 | "output_type": "execute_result" 138 | } 139 | ], 140 | "source": [ 141 | "# import library\n", 142 | "import sklearn\n", 143 | "\n", 144 | "# get scikit-learn version\n", 145 | "scikit_version = joblib.__version__\n", 146 | "\n", 147 | "# save model as pickle file\n", 148 | "joblib.dump(model, \"model_(version).pkl\".format(version=scikit_version))" 149 | ] 150 | }, 151 | { 152 | "cell_type": "markdown", 153 | "metadata": {}, 154 | "source": [ 155 | "### 21.2 Saving and Loading a Keras Model\n", 156 | "#### Problem\n", 157 | "You have a trained Keras model and want to save it and load it elsewhere.\n", 158 | "\n", 159 | "#### Solution\n", 160 | "Save the model as HDF5:" 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": 1, 166 | "metadata": {}, 167 | "outputs": [ 168 | { 169 | "name": "stderr", 170 | "output_type": "stream", 171 | "text": [ 172 | "Using Theano backend.\n" 173 | ] 174 | }, 175 | { 176 | "ename": "ModuleNotFoundError", 177 | "evalue": "No module named 'theano'", 178 | "output_type": "error", 179 | "traceback": [ 180 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 181 | "\u001b[0;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)", 182 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# load libraries\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mnumpy\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0;32mfrom\u001b[0m \u001b[0mkeras\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdatasets\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mimdb\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 4\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mkeras\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpreprocessing\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtext\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mTokenizer\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mkeras\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mmodels\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 183 | "\u001b[0;32m~/anaconda/envs/machine_learning_cookbook/lib/python3.6/site-packages/keras/__init__.py\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0m__future__\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mabsolute_import\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mutils\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 4\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mactivations\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mapplications\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 184 | "\u001b[0;32m~/anaconda/envs/machine_learning_cookbook/lib/python3.6/site-packages/keras/utils/__init__.py\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mdata_utils\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mio_utils\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mconv_utils\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 7\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 8\u001b[0m \u001b[0;31m# Globally-importable utils.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 185 | "\u001b[0;32m~/anaconda/envs/machine_learning_cookbook/lib/python3.6/site-packages/keras/utils/conv_utils.py\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 7\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0msix\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmoves\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mrange\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 8\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mnumpy\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 9\u001b[0;31m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m\u001b[0;34m.\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mbackend\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mK\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 10\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 11\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", 186 | "\u001b[0;32m~/anaconda/envs/machine_learning_cookbook/lib/python3.6/site-packages/keras/backend/__init__.py\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 84\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0m_BACKEND\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;34m'theano'\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 85\u001b[0m \u001b[0msys\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstderr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwrite\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'Using Theano backend.\\n'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 86\u001b[0;31m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m\u001b[0mtheano_backend\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 87\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0m_BACKEND\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;34m'tensorflow'\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 88\u001b[0m \u001b[0msys\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstderr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwrite\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'Using TensorFlow backend.\\n'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 187 | "\u001b[0;32m~/anaconda/envs/machine_learning_cookbook/lib/python3.6/site-packages/keras/backend/theano_backend.py\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mcollections\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mdefaultdict\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 6\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mcontextlib\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mcontextmanager\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 7\u001b[0;31m \u001b[0;32mimport\u001b[0m \u001b[0mtheano\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 8\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mtheano\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mtensor\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mT\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 9\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mtheano\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msandbox\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrng_mrg\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mMRG_RandomStreams\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mRandomStreams\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 188 | "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'theano'" 189 | ] 190 | } 191 | ], 192 | "source": [ 193 | "# load libraries\n", 194 | "import numpy as np\n", 195 | "from keras.datasets import imdb\n", 196 | "from keras.preprocessing.text import Tokenizer\n", 197 | "from keras import models\n", 198 | "from keras import layers\n", 199 | "from keras.models import load_model\n", 200 | "\n", 201 | "# set random seed\n", 202 | "np.random.seed(0)\n", 203 | "\n", 204 | "# set the number of features we want\n", 205 | "number_of_features = 1000\n", 206 | "\n", 207 | "# load data and target vector from movie review data\n", 208 | "(train_Data, train_target), (test_data, test_target) = imdb.load_data(num_words=number_of_features)\n", 209 | "\n", 210 | "# convert movie review data to a one-hot encoded feature matrix\n", 211 | "tokenizer = Tokenizer(num_words=number_of_features)\n", 212 | "train_features = tokenizer.sequences_to_matrix(train_data, mode=\"binary\")\n", 213 | "test_features = tokenizer.sequences_to_matrix(test_data, mode=\"binary\")\n", 214 | "\n", 215 | "# start neural network\n", 216 | "network = models.Sequential()\n", 217 | "\n", 218 | "# add fully connected layer with ReLU activation function\n", 219 | "network.add(layers.Dense(units=16, activation=\"relu\", input_shape=(number_of_features,)))\n", 220 | "\n", 221 | "# add fully connected layer with a sigmoid activation function\n", 222 | "network.add(layers.Dense(units=1, activation=\"sigmoid\"))\n", 223 | "\n", 224 | "# compile neural network\n", 225 | "network.compile(loss=\"binary_crossentropy\", optimizer=\"rmsprop\", metrics=[\"accuracy\"])\n", 226 | "\n", 227 | "# train neural network\n", 228 | "history = network.fit(train_features, train_target, epochs=3, verbose=0, batch_size=100, validation_data=(test_features, test_target))\n", 229 | "\n", 230 | "# save neural network\n", 231 | "network.save(\"model.h5\")" 232 | ] 233 | }, 234 | { 235 | "cell_type": "markdown", 236 | "metadata": {}, 237 | "source": [ 238 | "We can then load the model either in another application or for additional training" 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": 7, 244 | "metadata": {}, 245 | "outputs": [ 246 | { 247 | "ename": "NameError", 248 | "evalue": "name 'load_model' is not defined", 249 | "output_type": "error", 250 | "traceback": [ 251 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 252 | "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", 253 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# load neural network\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mnetwork\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mload_model\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"model.h5\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", 254 | "\u001b[0;31mNameError\u001b[0m: name 'load_model' is not defined" 255 | ] 256 | } 257 | ], 258 | "source": [ 259 | "# load neural network\n", 260 | "network = load_model(\"model.h5\")" 261 | ] 262 | }, 263 | { 264 | "cell_type": "markdown", 265 | "metadata": {}, 266 | "source": [ 267 | "#### Discussion\n", 268 | "Unlike scikit-learn, Keras does not recommend you save models using pickle. Instead, models are saved as an HDF5 file. The HDF5 file contains everything you need to not only load the model to make predicitons (i.e., achitecture and trained parameters), but also to restart training (i.e. loss and optimizer settings and the current state)" 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": null, 274 | "metadata": {}, 275 | "outputs": [], 276 | "source": [] 277 | } 278 | ], 279 | "metadata": { 280 | "kernelspec": { 281 | "display_name": "Python [conda env:machine_learning_cookbook]", 282 | "language": "python", 283 | "name": "conda-env-machine_learning_cookbook-py" 284 | }, 285 | "language_info": { 286 | "codemirror_mode": { 287 | "name": "ipython", 288 | "version": 3 289 | }, 290 | "file_extension": ".py", 291 | "mimetype": "text/x-python", 292 | "name": "python", 293 | "nbconvert_exporter": "python", 294 | "pygments_lexer": "ipython3", 295 | "version": "3.6.6" 296 | } 297 | }, 298 | "nbformat": 4, 299 | "nbformat_minor": 2 300 | } 301 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/Chapter 4 - Handling Numerical Data-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Chapter 4\n", 8 | "---\n", 9 | "# Handling Numerical Data\n", 10 | "\n", 11 | "### 4.0 Introduction\n", 12 | "Quantitative data is the measurment of something--weather class size, monthly sales, or student scores. The natural way to represent these quantities is numerically (e.g., 20 students, $529,392 in sales). In this chapter we will cover numerous strategies for transforming raw numerical data into features purpose-built for machine learning algorithms\n", 13 | "\n", 14 | "### 4.1 Rescaling a feature\n", 15 | "Use scikit-learn's `MinMaxScaler` to rescale a feature array" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 2, 21 | "metadata": {}, 22 | "outputs": [ 23 | { 24 | "data": { 25 | "text/plain": [ 26 | "array([[0. ],\n", 27 | " [0.28571429],\n", 28 | " [0.35714286],\n", 29 | " [0.42857143],\n", 30 | " [1. ]])" 31 | ] 32 | }, 33 | "execution_count": 2, 34 | "metadata": {}, 35 | "output_type": "execute_result" 36 | } 37 | ], 38 | "source": [ 39 | "import numpy as np\n", 40 | "from sklearn import preprocessing\n", 41 | "\n", 42 | "# create a feature\n", 43 | "feature = np.array([\n", 44 | " [-500.5],\n", 45 | " [-100.1],\n", 46 | " [0],\n", 47 | " [100.1],\n", 48 | " [900.9]\n", 49 | "])\n", 50 | "\n", 51 | "# create scaler\n", 52 | "minmax_scaler = preprocessing.MinMaxScaler(feature_range=(0,1))\n", 53 | "\n", 54 | "# scale feature\n", 55 | "scaled_feature = minmax_scaler.fit_transform(feature)\n", 56 | "\n", 57 | "scaled_feature" 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": {}, 63 | "source": [ 64 | "#### Discussion\n", 65 | "Rescaling is a common preprocessing task in machine learning. Many of the algorithms described later in this book will assume all features are on the same scale, typically 0 to 1 or -1 to 1. There are a number of rescaling techniques, but one of the simlest is called *min-max scaling*. Min-max scaling uses the minimum and maximum values of a feature to rescale values to within a range. Specfically, min-max calculates:\n", 66 | "$$\n", 67 | "x_i^` = \\frac{x_i - min(x)}{max(x) - min(x)}\n", 68 | "$$\n", 69 | "\n", 70 | "where x is the feature vector, $x_i$ is an individual element of feature x, and $x_i^`$ is the rescaled element\n", 71 | "\n", 72 | "#### See Also\n", 73 | "* Feature scaling, wikipedia (https://en.wikipedia.org/wiki/Feature_scaling)\n", 74 | "\n", 75 | "### 4.2 Standardizing a Feature\n", 76 | "scikit-learn's `StandardScaler` transforms a feature to have a mean of 0 and a standard deviation of 1." 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": 3, 82 | "metadata": {}, 83 | "outputs": [ 84 | { 85 | "data": { 86 | "text/plain": [ 87 | "array([[-0.76058269],\n", 88 | " [-0.54177196],\n", 89 | " [-0.35009716],\n", 90 | " [-0.32271504],\n", 91 | " [ 1.97516685]])" 92 | ] 93 | }, 94 | "execution_count": 3, 95 | "metadata": {}, 96 | "output_type": "execute_result" 97 | } 98 | ], 99 | "source": [ 100 | "import numpy as np\n", 101 | "from sklearn import preprocessing\n", 102 | "\n", 103 | "# create a feature\n", 104 | "feature = np.array([\n", 105 | " [-1000.1],\n", 106 | " [-200.2],\n", 107 | " [500.5],\n", 108 | " [600.6],\n", 109 | " [9000.9]\n", 110 | "])\n", 111 | "\n", 112 | "# create scaler\n", 113 | "scaler = preprocessing.StandardScaler()\n", 114 | "\n", 115 | "# transform the feature\n", 116 | "standardized = scaler.fit_transform(feature)\n", 117 | "\n", 118 | "standardized" 119 | ] 120 | }, 121 | { 122 | "cell_type": "markdown", 123 | "metadata": {}, 124 | "source": [ 125 | "#### Discussion\n", 126 | "A common alternative to min-max scaling is rescaling of features to be approximately standard normally distributed. To achieve this, we use standardization to tranform the data such that it has a mean, $\\bar x$, or 0 and a standard deviation $\\sigma$, of 1. Specifically, each element in the feature is transformed so that:\n", 127 | "$$\n", 128 | "x_i^` = \\frac{x_i - \\bar x}{\\sigma}\n", 129 | "$$" 130 | ] 131 | }, 132 | { 133 | "cell_type": "markdown", 134 | "metadata": {}, 135 | "source": [ 136 | "Where $x_I^`$ is our standardized form of $x_i$. The transformed feature represents the number of standard deviations in the original value is away from the feature's mean value (also called a *z-score* in statistics)\n", 137 | "\n", 138 | "Standardization is a common go-to scaling method for machine learning preprocessing and in my experience is used more than min-max scaling. However it depends on the learning algorithm. For example, principal component analysis often works better using standardization, while min-max scaling is often recommended for neural netwroks. As a general rule, I'd recommend defauling to standardization unless you have a specific reason to use an alternative.\n", 139 | "\n", 140 | "We can see the effect of standardization by looking at the mean and standard deviation of our solutions output:" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": 4, 146 | "metadata": {}, 147 | "outputs": [ 148 | { 149 | "name": "stdout", 150 | "output_type": "stream", 151 | "text": [ 152 | "Mean 0.0\n", 153 | "Standard Deviation: 1.0\n" 154 | ] 155 | } 156 | ], 157 | "source": [ 158 | "print(\"Mean {}\".format(round(standardized.mean())))\n", 159 | "print(\"Standard Deviation: {}\".format(standardized.std()))" 160 | ] 161 | }, 162 | { 163 | "cell_type": "markdown", 164 | "metadata": {}, 165 | "source": [ 166 | "If our data has significant outliers, it can negatively impact our standardizatino by affecting the feature's mean and variance. In this scenario, it is often helpful to instead rescale the feature using the median and quartile range. In scikit-learn, we do this using the *RobustScaler* method:" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": 5, 172 | "metadata": {}, 173 | "outputs": [ 174 | { 175 | "data": { 176 | "text/plain": [ 177 | "array([[-1.87387612],\n", 178 | " [-0.875 ],\n", 179 | " [ 0. ],\n", 180 | " [ 0.125 ],\n", 181 | " [10.61488511]])" 182 | ] 183 | }, 184 | "execution_count": 5, 185 | "metadata": {}, 186 | "output_type": "execute_result" 187 | } 188 | ], 189 | "source": [ 190 | "# create scaler\n", 191 | "robust_scaler = preprocessing.RobustScaler()\n", 192 | "\n", 193 | "# transform feature\n", 194 | "robust_scaler.fit_transform(feature)" 195 | ] 196 | }, 197 | { 198 | "cell_type": "markdown", 199 | "metadata": {}, 200 | "source": [ 201 | "### 4.3 Normalizing Observations\n", 202 | "Use scikit-learn's `Normalizer` to rescale the feature values to have unit norm (a total length of 1)" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": 6, 208 | "metadata": {}, 209 | "outputs": [ 210 | { 211 | "data": { 212 | "text/plain": [ 213 | "array([[0.70710678, 0.70710678],\n", 214 | " [0.30782029, 0.95144452],\n", 215 | " [0.07405353, 0.99725427],\n", 216 | " [0.04733062, 0.99887928],\n", 217 | " [0.95709822, 0.28976368]])" 218 | ] 219 | }, 220 | "execution_count": 6, 221 | "metadata": {}, 222 | "output_type": "execute_result" 223 | } 224 | ], 225 | "source": [ 226 | "import numpy as np\n", 227 | "from sklearn.preprocessing import Normalizer\n", 228 | "\n", 229 | "# create feature matrix\n", 230 | "features = np.array([\n", 231 | " [0.5, 0.5],\n", 232 | " [1.1, 3.4],\n", 233 | " [1.5, 20.2],\n", 234 | " [1.63, 34.4],\n", 235 | " [10.9, 3.3]\n", 236 | "])\n", 237 | "\n", 238 | "# create normalizer\n", 239 | "normalizer = Normalizer(norm=\"l2\")\n", 240 | "\n", 241 | "# transofmr feature matrix\n", 242 | "normalizer.transform(features)" 243 | ] 244 | }, 245 | { 246 | "cell_type": "markdown", 247 | "metadata": {}, 248 | "source": [ 249 | "#### Discussion\n", 250 | "Many rescaling methods operate of features; however, we can also rescale across individual observations. `Normalizer` rescales the values on individual observations to have unit norm (the sum of their lengths is 1). This type of rescaling is often used when we have many equivalent features (e.g. text-classification when every word is n-word group is a feature).\n", 251 | "\n", 252 | "`Normalizer` provides three norm options with Euclidean norm (often called L2) being the default:\n", 253 | "$$\n", 254 | "||x||_2 = \\sqrt{x_1^2 + x_2^2 + ... + x_n^2}\n", 255 | "$$\n", 256 | "\n", 257 | "where x is an individual observation and x_n is that observation's value for the nth feature.\n", 258 | "\n", 259 | "Alternatively, we can specify Manhattan norm (L1):\n", 260 | "$$\n", 261 | "||x||_1 = \\sum_{i=1}^n{x_i}\n", 262 | "$$\n", 263 | "\n", 264 | "Intuitively, L2 norm can be thought of as the distance between two poitns in New York for a bird (i.e. a straight line), while L1 can be thought of as the distance for a human wlaking on the street (walk north one block, east one block, north one block, east one block, etc), which is why it is called \"Manhattan norm\" or \"Taxicab norm\".\n", 265 | "\n", 266 | "Practically, notice that `norm='l1'` rescales an observation's values so they sum to 1, which can sometimes be a desirable quality" 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "execution_count": 8, 272 | "metadata": {}, 273 | "outputs": [ 274 | { 275 | "name": "stdout", 276 | "output_type": "stream", 277 | "text": [ 278 | "Sum of the first observation's values: 1.0\n" 279 | ] 280 | } 281 | ], 282 | "source": [ 283 | "# transform feature matrix\n", 284 | "features_l1_norm = Normalizer(norm=\"l1\").transform(features)\n", 285 | "print(\"Sum of the first observation's values: {}\".format(features_l1_norm[0,0] + features_l1_norm[0,1]))" 286 | ] 287 | }, 288 | { 289 | "cell_type": "markdown", 290 | "metadata": {}, 291 | "source": [ 292 | "### 4.9 Grouping Observations Using Clustering" 293 | ] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "execution_count": 9, 298 | "metadata": {}, 299 | "outputs": [ 300 | { 301 | "data": { 302 | "text/html": [ 303 | "
\n", 304 | "\n", 317 | "\n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | "
feature_1feature_2group
0-9.877554-3.3361450
1-7.287210-8.3539862
2-6.943061-7.0237442
3-7.440167-8.7919592
4-6.641388-8.0758882
\n", 359 | "
" 360 | ], 361 | "text/plain": [ 362 | " feature_1 feature_2 group\n", 363 | "0 -9.877554 -3.336145 0\n", 364 | "1 -7.287210 -8.353986 2\n", 365 | "2 -6.943061 -7.023744 2\n", 366 | "3 -7.440167 -8.791959 2\n", 367 | "4 -6.641388 -8.075888 2" 368 | ] 369 | }, 370 | "execution_count": 9, 371 | "metadata": {}, 372 | "output_type": "execute_result" 373 | } 374 | ], 375 | "source": [ 376 | "import pandas as pd\n", 377 | "from sklearn.datasets import make_blobs\n", 378 | "from sklearn.cluster import KMeans\n", 379 | "\n", 380 | "features, _ = make_blobs(n_samples = 50,\n", 381 | " n_features = 2,\n", 382 | " centers = 3,\n", 383 | " random_state = 1)\n", 384 | "\n", 385 | "df = pd.DataFrame(features, columns=[\"feature_1\", \"feature_2\"])\n", 386 | "\n", 387 | "# make k-means clusterer\n", 388 | "clusterer = KMeans(3, random_state=0)\n", 389 | "\n", 390 | "# fit clusterer\n", 391 | "clusterer.fit(features)\n", 392 | "\n", 393 | "# predict values\n", 394 | "df['group'] = clusterer.predict(features)\n", 395 | "\n", 396 | "df.head()" 397 | ] 398 | }, 399 | { 400 | "cell_type": "markdown", 401 | "metadata": {}, 402 | "source": [ 403 | "# 4.10 Deleteing Observations with Missing Values" 404 | ] 405 | }, 406 | { 407 | "cell_type": "code", 408 | "execution_count": 10, 409 | "metadata": {}, 410 | "outputs": [ 411 | { 412 | "data": { 413 | "text/plain": [ 414 | "array([[ 1.1, 11.1],\n", 415 | " [ 2.2, 22.2],\n", 416 | " [ 3.3, 33.3]])" 417 | ] 418 | }, 419 | "execution_count": 10, 420 | "metadata": {}, 421 | "output_type": "execute_result" 422 | } 423 | ], 424 | "source": [ 425 | "import numpy as np\n", 426 | "\n", 427 | "features = np.array([\n", 428 | " [1.1, 11.1],\n", 429 | " [2.2, 22.2],\n", 430 | " [3.3, 33.3],\n", 431 | " [np.nan, 55]\n", 432 | "])\n", 433 | "\n", 434 | "# keep only observations that are not (denoted by ~) missing\n", 435 | "features[~np.isnan(features).any(axis=1)]" 436 | ] 437 | }, 438 | { 439 | "cell_type": "code", 440 | "execution_count": 11, 441 | "metadata": {}, 442 | "outputs": [ 443 | { 444 | "data": { 445 | "text/html": [ 446 | "
\n", 447 | "\n", 460 | "\n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | "
feature_1feature_2
01.111.1
12.222.2
23.333.3
\n", 486 | "
" 487 | ], 488 | "text/plain": [ 489 | " feature_1 feature_2\n", 490 | "0 1.1 11.1\n", 491 | "1 2.2 22.2\n", 492 | "2 3.3 33.3" 493 | ] 494 | }, 495 | "execution_count": 11, 496 | "metadata": {}, 497 | "output_type": "execute_result" 498 | } 499 | ], 500 | "source": [ 501 | "import pandas as pd\n", 502 | "df = pd.DataFrame(features, columns=[\"feature_1\", \"feature_2\"])\n", 503 | "df.dropna()" 504 | ] 505 | }, 506 | { 507 | "cell_type": "markdown", 508 | "metadata": {}, 509 | "source": [ 510 | "#### Discussion\n", 511 | "Most machine learnign algorithms cannot handling any missing values in the target and feature arrays. The simplest solution is the delete every observation that contains one or more missing values\n", 512 | "\n", 513 | "There are three types of missing data:\n", 514 | "\n", 515 | "*Missing Completely At Random (MCAR)*\n", 516 | "* The probability that a value is missing is independent of everything.\n", 517 | "\n", 518 | "*Missing At Random (MAR)*\n", 519 | "* The probability that a value is missing is not completely random, but depends on information capture in other feature\n", 520 | "\n", 521 | "*Missing Not At Random (MNAR)*\n", 522 | "* The probability that a value is missing is not random and depends on information not captured in our features\n", 523 | "\n", 524 | "#### See Also\n", 525 | "* Identifying the Three Types of Missing Data (https://measuringu.com/missing-data/)\n", 526 | "* Missing-Data Imputation (http://www.stat.columbia.edu/~gelman/arm/missing.pdf)\n", 527 | "\n", 528 | "### 4.11 Imputing Missing Values" 529 | ] 530 | }, 531 | { 532 | "cell_type": "code", 533 | "execution_count": 14, 534 | "metadata": {}, 535 | "outputs": [ 536 | { 537 | "name": "stdout", 538 | "output_type": "stream", 539 | "text": [ 540 | "True Value: 0.8730186113995938\n", 541 | "Imputed Value: -3.058372724614996\n" 542 | ] 543 | } 544 | ], 545 | "source": [ 546 | "import numpy as np\n", 547 | "from sklearn.preprocessing import StandardScaler\n", 548 | "from sklearn.datasets import make_blobs\n", 549 | "from sklearn.preprocessing import Imputer\n", 550 | "\n", 551 | "# make fake data\n", 552 | "features, _ = make_blobs(n_samples = 1000,\n", 553 | " n_features = 2,\n", 554 | " random_state = 1)\n", 555 | "\n", 556 | "# standardize the features\n", 557 | "scaler = StandardScaler()\n", 558 | "standardized_features = scaler.fit_transform(features)\n", 559 | "\n", 560 | "# replace the first feature's first value with a missing value\n", 561 | "true_value = standardized_features[0, 0]\n", 562 | "standardized_features[0,0] = np.nan\n", 563 | "\n", 564 | "# create imputer\n", 565 | "mean_imputer = Imputer(strategy=\"mean\", axis=0)\n", 566 | "\n", 567 | "# impute values\n", 568 | "feautres_mean_imputed = mean_imputer.fit_transform(features)\n", 569 | "\n", 570 | "# compare true and imputed values\n", 571 | "print(\"True Value: {}\".format(true_value))\n", 572 | "print(\"Imputed Value: {}\".format(feautres_mean_imputed[0,0]))" 573 | ] 574 | }, 575 | { 576 | "cell_type": "markdown", 577 | "metadata": {}, 578 | "source": [ 579 | "#### See Also\n", 580 | "* A Study of K-Nearest Neighbor as an Imputation Method (http://conteudo.icmc.usp.br/pessoas/gbatista/files/his2002.pdf)" 581 | ] 582 | }, 583 | { 584 | "cell_type": "code", 585 | "execution_count": null, 586 | "metadata": {}, 587 | "outputs": [], 588 | "source": [] 589 | } 590 | ], 591 | "metadata": { 592 | "kernelspec": { 593 | "display_name": "Python [conda env:machine_learning_cookbook]", 594 | "language": "python", 595 | "name": "conda-env-machine_learning_cookbook-py" 596 | }, 597 | "language_info": { 598 | "codemirror_mode": { 599 | "name": "ipython", 600 | "version": 3 601 | }, 602 | "file_extension": ".py", 603 | "mimetype": "text/x-python", 604 | "name": "python", 605 | "nbconvert_exporter": "python", 606 | "pygments_lexer": "ipython3", 607 | "version": "3.6.6" 608 | } 609 | }, 610 | "nbformat": 4, 611 | "nbformat_minor": 2 612 | } 613 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/Chapter 7 - Handling Dates and Times-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## 7.1 Converting Strings to Dates" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 2, 13 | "metadata": {}, 14 | "outputs": [ 15 | { 16 | "data": { 17 | "text/plain": [ 18 | "[Timestamp('2005-04-03 23:35:00'),\n", 19 | " Timestamp('2010-05-23 00:01:00'),\n", 20 | " Timestamp('2009-09-04 21:09:00')]" 21 | ] 22 | }, 23 | "execution_count": 2, 24 | "metadata": {}, 25 | "output_type": "execute_result" 26 | } 27 | ], 28 | "source": [ 29 | "import numpy as np\n", 30 | "import pandas as pd\n", 31 | "\n", 32 | "date_strings = np.array([\n", 33 | " '03-04-2005 11:35 PM',\n", 34 | " '23-05-2010 12:01 AM',\n", 35 | " '04-09-2009 09:09 PM'\n", 36 | "])\n", 37 | "\n", 38 | "# convert to datetimes\n", 39 | "[pd.to_datetime(date, format='%d-%m-%Y %I:%M %p') for date in date_strings]" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": 3, 45 | "metadata": {}, 46 | "outputs": [ 47 | { 48 | "data": { 49 | "text/plain": [ 50 | "[Timestamp('2005-04-03 23:35:00'),\n", 51 | " Timestamp('2010-05-23 00:01:00'),\n", 52 | " Timestamp('2009-09-04 21:09:00')]" 53 | ] 54 | }, 55 | "execution_count": 3, 56 | "metadata": {}, 57 | "output_type": "execute_result" 58 | } 59 | ], 60 | "source": [ 61 | "[pd.to_datetime(date, format='%d-%m-%Y %I:%M %p', errors='coerce') for date in date_strings]" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "### See Also\n", 69 | "* http://strftime.org/\n", 70 | "\n", 71 | "## 7.2 Handling Time Zones" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": 4, 77 | "metadata": {}, 78 | "outputs": [ 79 | { 80 | "data": { 81 | "text/plain": [ 82 | "Timestamp('2017-05-01 06:00:00+0100', tz='Europe/London')" 83 | ] 84 | }, 85 | "execution_count": 4, 86 | "metadata": {}, 87 | "output_type": "execute_result" 88 | } 89 | ], 90 | "source": [ 91 | "import pandas as pd\n", 92 | "\n", 93 | "pd.Timestamp('2017-05-01 06:00:00', tz='Europe/London')" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": 5, 99 | "metadata": {}, 100 | "outputs": [ 101 | { 102 | "data": { 103 | "text/plain": [ 104 | "Timestamp('2017-05-01 06:00:00+0100', tz='Europe/London')" 105 | ] 106 | }, 107 | "execution_count": 5, 108 | "metadata": {}, 109 | "output_type": "execute_result" 110 | } 111 | ], 112 | "source": [ 113 | "date = pd.Timestamp('2017-05-01 06:00:00')\n", 114 | "\n", 115 | "date_in_london = date.tz_localize('Europe/London')\n", 116 | "\n", 117 | "date_in_london" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": 8, 123 | "metadata": {}, 124 | "outputs": [ 125 | { 126 | "data": { 127 | "text/plain": [ 128 | "Timestamp('2017-05-01 05:00:00+0000', tz='Africa/Abidjan')" 129 | ] 130 | }, 131 | "execution_count": 8, 132 | "metadata": {}, 133 | "output_type": "execute_result" 134 | } 135 | ], 136 | "source": [ 137 | "date_in_london.tz_convert('Africa/Abidjan')" 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": 9, 143 | "metadata": {}, 144 | "outputs": [ 145 | { 146 | "data": { 147 | "text/plain": [ 148 | "0 2002-02-28 00:00:00+00:00\n", 149 | "1 2002-03-31 00:00:00+00:00\n", 150 | "2 2002-04-30 00:00:00+00:00\n", 151 | "dtype: datetime64[ns, Africa/Abidjan]" 152 | ] 153 | }, 154 | "execution_count": 9, 155 | "metadata": {}, 156 | "output_type": "execute_result" 157 | } 158 | ], 159 | "source": [ 160 | "dates = pd.Series(pd.date_range('2/2/2002', periods=3, freq='M'))\n", 161 | "\n", 162 | "dates.dt.tz_localize('Africa/Abidjan')" 163 | ] 164 | }, 165 | { 166 | "cell_type": "markdown", 167 | "metadata": {}, 168 | "source": [ 169 | "## 7.3 Selecting Dates and Times\n", 170 | "## 7.4 Breaking Up Date Data into Multiple Features\n", 171 | "## 7.5 Calculating the Difference Between Dates\n", 172 | "## 7.6 Encoding Days of the Week\n", 173 | "## 7.7 Creating Lagged Feature\n", 174 | "## 7.8 Using Rolling Time Windows" 175 | ] 176 | }, 177 | { 178 | "cell_type": "code", 179 | "execution_count": 10, 180 | "metadata": {}, 181 | "outputs": [ 182 | { 183 | "data": { 184 | "text/html": [ 185 | "
\n", 186 | "\n", 199 | "\n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | "
Stock_Price
2010-01-31NaN
2010-02-281.5
2010-03-312.5
2010-04-303.5
2010-05-314.5
\n", 229 | "
" 230 | ], 231 | "text/plain": [ 232 | " Stock_Price\n", 233 | "2010-01-31 NaN\n", 234 | "2010-02-28 1.5\n", 235 | "2010-03-31 2.5\n", 236 | "2010-04-30 3.5\n", 237 | "2010-05-31 4.5" 238 | ] 239 | }, 240 | "execution_count": 10, 241 | "metadata": {}, 242 | "output_type": "execute_result" 243 | } 244 | ], 245 | "source": [ 246 | "import pandas as pd\n", 247 | "\n", 248 | "time_index = pd.date_range('01/01/2010', periods=5, freq='M')\n", 249 | "df = pd.DataFrame(index=time_index)\n", 250 | "df['Stock_Price'] = [1,2,3,4,5]\n", 251 | "df.rolling(window=2).mean()" 252 | ] 253 | }, 254 | { 255 | "cell_type": "markdown", 256 | "metadata": {}, 257 | "source": [ 258 | "### See Also\n", 259 | "* pandas documentation: Rolling Windows (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html)\n", 260 | "* What are Moving Average or Smoothing Techniques (https://www.itl.nist.gov/div898/handbook/pmc/section4/pmc42.htm)\n", 261 | "\n", 262 | "## 7.9 Handling Missing Data in Time Series" 263 | ] 264 | }, 265 | { 266 | "cell_type": "code", 267 | "execution_count": 11, 268 | "metadata": {}, 269 | "outputs": [ 270 | { 271 | "data": { 272 | "text/html": [ 273 | "
\n", 274 | "\n", 287 | "\n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | "
Sales
2010-01-311.0
2010-02-282.0
2010-03-313.0
2010-04-304.0
2010-05-315.0
\n", 317 | "
" 318 | ], 319 | "text/plain": [ 320 | " Sales\n", 321 | "2010-01-31 1.0\n", 322 | "2010-02-28 2.0\n", 323 | "2010-03-31 3.0\n", 324 | "2010-04-30 4.0\n", 325 | "2010-05-31 5.0" 326 | ] 327 | }, 328 | "execution_count": 11, 329 | "metadata": {}, 330 | "output_type": "execute_result" 331 | } 332 | ], 333 | "source": [ 334 | "import pandas as pd\n", 335 | "import numpy as np\n", 336 | "\n", 337 | "time_index = pd.date_range('01/01/2010', periods=5, freq='M')\n", 338 | "\n", 339 | "df = pd.DataFrame(index=time_index)\n", 340 | "\n", 341 | "df[\"Sales\"] = [1.0, 2.0, np.nan, np.nan, 5.0]\n", 342 | "\n", 343 | "df.interpolate()" 344 | ] 345 | }, 346 | { 347 | "cell_type": "code", 348 | "execution_count": 12, 349 | "metadata": {}, 350 | "outputs": [ 351 | { 352 | "data": { 353 | "text/html": [ 354 | "
\n", 355 | "\n", 368 | "\n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | "
Sales
2010-01-311.0
2010-02-282.0
2010-03-312.0
2010-04-302.0
2010-05-315.0
\n", 398 | "
" 399 | ], 400 | "text/plain": [ 401 | " Sales\n", 402 | "2010-01-31 1.0\n", 403 | "2010-02-28 2.0\n", 404 | "2010-03-31 2.0\n", 405 | "2010-04-30 2.0\n", 406 | "2010-05-31 5.0" 407 | ] 408 | }, 409 | "execution_count": 12, 410 | "metadata": {}, 411 | "output_type": "execute_result" 412 | } 413 | ], 414 | "source": [ 415 | "df.ffill()" 416 | ] 417 | }, 418 | { 419 | "cell_type": "code", 420 | "execution_count": 13, 421 | "metadata": {}, 422 | "outputs": [ 423 | { 424 | "data": { 425 | "text/html": [ 426 | "
\n", 427 | "\n", 440 | "\n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | "
Sales
2010-01-311.0
2010-02-282.0
2010-03-315.0
2010-04-305.0
2010-05-315.0
\n", 470 | "
" 471 | ], 472 | "text/plain": [ 473 | " Sales\n", 474 | "2010-01-31 1.0\n", 475 | "2010-02-28 2.0\n", 476 | "2010-03-31 5.0\n", 477 | "2010-04-30 5.0\n", 478 | "2010-05-31 5.0" 479 | ] 480 | }, 481 | "execution_count": 13, 482 | "metadata": {}, 483 | "output_type": "execute_result" 484 | } 485 | ], 486 | "source": [ 487 | "df.bfill()" 488 | ] 489 | }, 490 | { 491 | "cell_type": "code", 492 | "execution_count": 14, 493 | "metadata": {}, 494 | "outputs": [ 495 | { 496 | "data": { 497 | "text/html": [ 498 | "
\n", 499 | "\n", 512 | "\n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | "
Sales
2010-01-311.000000
2010-02-282.000000
2010-03-313.059808
2010-04-304.038069
2010-05-315.000000
\n", 542 | "
" 543 | ], 544 | "text/plain": [ 545 | " Sales\n", 546 | "2010-01-31 1.000000\n", 547 | "2010-02-28 2.000000\n", 548 | "2010-03-31 3.059808\n", 549 | "2010-04-30 4.038069\n", 550 | "2010-05-31 5.000000" 551 | ] 552 | }, 553 | "execution_count": 14, 554 | "metadata": {}, 555 | "output_type": "execute_result" 556 | } 557 | ], 558 | "source": [ 559 | "df.interpolate(method=\"quadratic\")" 560 | ] 561 | }, 562 | { 563 | "cell_type": "code", 564 | "execution_count": 15, 565 | "metadata": {}, 566 | "outputs": [ 567 | { 568 | "data": { 569 | "text/html": [ 570 | "
\n", 571 | "\n", 584 | "\n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | "
Sales
2010-01-311.0
2010-02-282.0
2010-03-313.0
2010-04-30NaN
2010-05-315.0
\n", 614 | "
" 615 | ], 616 | "text/plain": [ 617 | " Sales\n", 618 | "2010-01-31 1.0\n", 619 | "2010-02-28 2.0\n", 620 | "2010-03-31 3.0\n", 621 | "2010-04-30 NaN\n", 622 | "2010-05-31 5.0" 623 | ] 624 | }, 625 | "execution_count": 15, 626 | "metadata": {}, 627 | "output_type": "execute_result" 628 | } 629 | ], 630 | "source": [ 631 | "df.interpolate(limit=1, limit_direction=\"forward\")" 632 | ] 633 | }, 634 | { 635 | "cell_type": "code", 636 | "execution_count": null, 637 | "metadata": {}, 638 | "outputs": [], 639 | "source": [] 640 | } 641 | ], 642 | "metadata": { 643 | "kernelspec": { 644 | "display_name": "Python [conda env:machine_learning_cookbook]", 645 | "language": "python", 646 | "name": "conda-env-machine_learning_cookbook-py" 647 | }, 648 | "language_info": { 649 | "codemirror_mode": { 650 | "name": "ipython", 651 | "version": 3 652 | }, 653 | "file_extension": ".py", 654 | "mimetype": "text/x-python", 655 | "name": "python", 656 | "nbconvert_exporter": "python", 657 | "pygments_lexer": "ipython3", 658 | "version": "3.6.6" 659 | } 660 | }, 661 | "nbformat": 4, 662 | "nbformat_minor": 2 663 | } 664 | -------------------------------------------------------------------------------- /Chapter 13 - Linear Regression.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Chapter 13\n", 8 | "---\n", 9 | "# Linear Regression\n", 10 | "\n", 11 | "### 13.0 Introduction\n", 12 | "Linear regression is one of the simplest supervised learning algorithms in our toolkit. If you have ever taken an introductory statistics course in college, likely the final topic you covered was linear regression. In fact, it is so simple that it is sometimes not considered machine learning at all!\n", 13 | "\n", 14 | "Whatever you believe, the fact is that linear regression--and its extensions--continues to be a common and useful method of making predictions when the target vector is a quantitative value (e.g. home price, age)\n", 15 | "\n", 16 | "### 13.1 Fitting a Line\n", 17 | "#### Problem\n", 18 | "You want to train a model that represents a linear relationship between the feature and target vector.\n", 19 | "\n", 20 | "#### Solution\n", 21 | "Use a linear regression (`LinearRegression` in scikit-learn)" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 1, 27 | "metadata": { 28 | "collapsed": true 29 | }, 30 | "outputs": [], 31 | "source": [ 32 | "from sklearn.linear_model import LinearRegression\n", 33 | "from sklearn.datasets import load_boston\n", 34 | "\n", 35 | "boston = load_boston()\n", 36 | "features = boston.data[:, 0:2]\n", 37 | "target = boston.target\n", 38 | "\n", 39 | "regression = LinearRegression()\n", 40 | "\n", 41 | "model = regression.fit(features, target)" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": {}, 47 | "source": [ 48 | "### 13.4 Reducing Variance with Regularization\n", 49 | "#### Problem\n", 50 | "You want to reduce the variance of your linear regression model\n", 51 | "\n", 52 | "#### Solution\n", 53 | "Use a learning algorithm that includes a *shrinkage penalty* (also called **regularization**) like ridge regression and lasso regression:" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 2, 59 | "metadata": { 60 | "collapsed": false 61 | }, 62 | "outputs": [], 63 | "source": [ 64 | "from sklearn.linear_model import Ridge\n", 65 | "from sklearn.datasets import load_boston\n", 66 | "from sklearn.preprocessing import StandardScaler\n", 67 | "\n", 68 | "boston = load_boston()\n", 69 | "features = boston.data\n", 70 | "target = boston.target\n", 71 | "\n", 72 | "scaler = StandardScaler()\n", 73 | "features_standardized = scaler.fit_transform(features)\n", 74 | "\n", 75 | "regression = Ridge(alpha=0.5)\n", 76 | "model = regression.fit(features_standardized, target)" 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "metadata": {}, 82 | "source": [ 83 | "#### Discussion\n", 84 | "In standard linear regression the model trains to minimize the sum of squared error between the true($y_i$) and prediction ($\\hat y_i$) target values, or residual sum of squares (RSS):\n", 85 | "$$\n", 86 | "RSS = \\sum_{i=1}^n{(y_i - \\hat y_i)^2}\n", 87 | "$$\n", 88 | "\n", 89 | "Regularized regression learners are similar, except they attempt to minimize RSS and some penalty for the total size of the coefficient values, called a shrinkage penalty because it attempts to \"shrink\" the model. There are two common types of regularized learners for linear regression: ridge regression and the lasso. The only formal difference is the type of shrinkage penalty used. In ridge regression, the shrinkage penalty is a tuning hyperparameter multiplied by the squared sum of all coefficients:\n", 90 | "$$\n", 91 | "RSS+\\alpha \\sum_{j=1}^p{\\hat \\beta_j^2}\n", 92 | "$$\n", 93 | "\n", 94 | "where $\\hat \\beta_j$ is the coefficient of the jth of p features and $\\alpha$ is a hyperparameter (discussed next). The lasso is similar, except the shrinkage penalty is a tuning hyperparmeter multiplied by the squared sum of all coefficients:\n", 95 | "$$\n", 96 | "\\frac{1}{2n} RSS + \\alpha \\sum_{j=1}^p{|\\beta_j|}\n", 97 | "$$\n", 98 | "\n", 99 | "where n is the number of observations. So which one should we use? A a very general rule of thumb, ridge regression often produces slightly better predictions than lasso, but lasso (for reasons we will discuss in Recipe 13.5) produces more interpretable models. If we want a balance between, ridge and lasso's penalty functions we can use elastic net, which is simply a regression model with both penalties included. Regardless of which one we use, bot hridge and lasso regresions can penalize large or complex models by including coefficient values in the loss funciton we are trying to minimize\n", 100 | "\n", 101 | "The hyper parameter $\\alpha$ lets us control how much we penalize the coefficients, with higher values of $\\alpha$ creating simpler models. The ideal value of $\\alpha$ should be tuned like any other hyperparameter. In scikit-learn, $\\alpha$ is set using the alpha parameter.\n", 102 | "\n", 103 | "scikit-learn includes a RidgeCV method that allows us to select the ideal value for $\\alpha:" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": 3, 109 | "metadata": { 110 | "collapsed": false 111 | }, 112 | "outputs": [ 113 | { 114 | "data": { 115 | "text/plain": [ 116 | "array([-0.91215884, 1.0658758 , 0.11942614, 0.68558782, -2.03231631,\n", 117 | " 2.67922108, 0.01477326, -3.0777265 , 2.58814315, -2.00973173,\n", 118 | " -2.05390717, 0.85614763, -3.73565106])" 119 | ] 120 | }, 121 | "execution_count": 3, 122 | "metadata": {}, 123 | "output_type": "execute_result" 124 | } 125 | ], 126 | "source": [ 127 | "from sklearn.linear_model import RidgeCV\n", 128 | "\n", 129 | "regr_cv = RidgeCV(alphas=[0.1, 1.0, 10.0])\n", 130 | "\n", 131 | "model_cv = regr_cv.fit(features_standardized, target)\n", 132 | "\n", 133 | "model_cv.coef_" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": 4, 139 | "metadata": { 140 | "collapsed": false 141 | }, 142 | "outputs": [ 143 | { 144 | "data": { 145 | "text/plain": [ 146 | "1.0" 147 | ] 148 | }, 149 | "execution_count": 4, 150 | "metadata": {}, 151 | "output_type": "execute_result" 152 | } 153 | ], 154 | "source": [ 155 | "# view alpha\n", 156 | "model_cv.alpha_" 157 | ] 158 | }, 159 | { 160 | "cell_type": "markdown", 161 | "metadata": {}, 162 | "source": [ 163 | "One final note: because in linear regression the value of the coefficients is partially determined by the scale of the feature, and in regularized models all coefficients are summed together, we must make sure to standardize the feature prior to training\n", 164 | "\n", 165 | "### 13.5 Reducing Features with Lasso Regression\n", 166 | "#### Problem\n", 167 | "You want to simplify your linear regression model by reducing the number of features.\n", 168 | "\n", 169 | "#### Solution\n", 170 | "Use a lasso regression" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": 5, 176 | "metadata": { 177 | "collapsed": true 178 | }, 179 | "outputs": [], 180 | "source": [ 181 | "from sklearn.linear_model import Lasso\n", 182 | "from sklearn.datasets import load_boston\n", 183 | "from sklearn.preprocessing import StandardScaler\n", 184 | "\n", 185 | "boston = load_boston()\n", 186 | "features = boston.data\n", 187 | "target = boston.target\n", 188 | "\n", 189 | "scaler = StandardScaler()\n", 190 | "features_standardized = scaler.fit_transform(features)\n", 191 | "\n", 192 | "regression = Lasso(alpha=0.5)\n", 193 | "model = regression.fit(features_standardized, target)" 194 | ] 195 | }, 196 | { 197 | "cell_type": "markdown", 198 | "metadata": {}, 199 | "source": [ 200 | "#### Discussion\n", 201 | "One interesting characteristic of lasso regression's penalty is that it can shrink the coefficients of a model to zero, effectively reducing the number of features in the model. For example, in our solution we set `alpha` to 0.5 and we can see that many of the coefficients are 0, meaning their corresponding features are not used in the model:" 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": 6, 207 | "metadata": { 208 | "collapsed": false 209 | }, 210 | "outputs": [ 211 | { 212 | "data": { 213 | "text/plain": [ 214 | "array([-0.10697735, 0. , -0. , 0.39739898, -0. ,\n", 215 | " 2.97332316, -0. , -0.16937793, -0. , -0. ,\n", 216 | " -1.59957374, 0.54571511, -3.66888402])" 217 | ] 218 | }, 219 | "execution_count": 6, 220 | "metadata": {}, 221 | "output_type": "execute_result" 222 | } 223 | ], 224 | "source": [ 225 | "model.coef_" 226 | ] 227 | }, 228 | { 229 | "cell_type": "markdown", 230 | "metadata": {}, 231 | "source": [ 232 | "However if we increase $\\alpha$ to a much higher value, we see that lierally none of the features are being used:" 233 | ] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "execution_count": 7, 238 | "metadata": { 239 | "collapsed": false 240 | }, 241 | "outputs": [ 242 | { 243 | "data": { 244 | "text/plain": [ 245 | "array([-0., 0., -0., 0., -0., 0., -0., 0., -0., -0., -0., 0., -0.])" 246 | ] 247 | }, 248 | "execution_count": 7, 249 | "metadata": {}, 250 | "output_type": "execute_result" 251 | } 252 | ], 253 | "source": [ 254 | "regression_a10 = Lasso(alpha=10)\n", 255 | "model_a10 = regression_a10.fit(features_standardized, target)\n", 256 | "model_a10.coef_" 257 | ] 258 | }, 259 | { 260 | "cell_type": "markdown", 261 | "metadata": {}, 262 | "source": [ 263 | "The practical benefit of this effect is that it means that we could include 100 features in our feature matrix and then, through adjusting lasso's $\\alpha$ hyperparameter, produce a model that uses only 10 (for instance) of the most important features. This lets us reduce variance whiel improving interpretability of our model (since fewer features is easier to explain)" 264 | ] 265 | }, 266 | { 267 | "cell_type": "code", 268 | "execution_count": null, 269 | "metadata": { 270 | "collapsed": true 271 | }, 272 | "outputs": [], 273 | "source": [] 274 | } 275 | ], 276 | "metadata": { 277 | "kernelspec": { 278 | "display_name": "Python [conda env:machine_learning_cookbook]", 279 | "language": "python", 280 | "name": "conda-env-machine_learning_cookbook-py" 281 | }, 282 | "language_info": { 283 | "codemirror_mode": { 284 | "name": "ipython", 285 | "version": 3 286 | }, 287 | "file_extension": ".py", 288 | "mimetype": "text/x-python", 289 | "name": "python", 290 | "nbconvert_exporter": "python", 291 | "pygments_lexer": "ipython3", 292 | "version": "3.6.6" 293 | } 294 | }, 295 | "nbformat": 4, 296 | "nbformat_minor": 2 297 | } 298 | -------------------------------------------------------------------------------- /Chapter 15 - K-Nearest Neighbors.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Chapter 15\n", 8 | "---\n", 9 | "# K-Nearest Neighbors\n", 10 | "\n", 11 | "An observation is predicted to be the class of that of the largest proportion of the k-nearest observations.\n", 12 | "\n", 13 | "## 15.1 Finding an Observation's Nearest Neighbors" 14 | ] 15 | }, 16 | { 17 | "cell_type": "code", 18 | "execution_count": 2, 19 | "metadata": { 20 | "collapsed": false 21 | }, 22 | "outputs": [ 23 | { 24 | "data": { 25 | "text/plain": [ 26 | "array([[[1.03800476, 0.56925129, 1.10395287, 1.1850097 ],\n", 27 | " [0.79566902, 0.33784833, 0.76275864, 1.05353673]]])" 28 | ] 29 | }, 30 | "execution_count": 2, 31 | "metadata": {}, 32 | "output_type": "execute_result" 33 | } 34 | ], 35 | "source": [ 36 | "from sklearn import datasets\n", 37 | "from sklearn.neighbors import NearestNeighbors\n", 38 | "from sklearn.preprocessing import StandardScaler\n", 39 | "\n", 40 | "iris = datasets.load_iris()\n", 41 | "features = iris.data\n", 42 | "\n", 43 | "standardizer = StandardScaler()\n", 44 | "\n", 45 | "features_standardized = standardizer.fit_transform(features)\n", 46 | "\n", 47 | "nearest_neighbors = NearestNeighbors(n_neighbors=2).fit(features_standardized)\n", 48 | "#nearest_neighbors_euclidian = NearestNeighbors(n_neighbors=2, metric='euclidian').fit(features_standardized)\n", 49 | "new_observation = [1, 1, 1, 1]\n", 50 | "\n", 51 | "distances, indices = nearest_neighbors.kneighbors([new_observation])\n", 52 | "\n", 53 | "features_standardized[indices]" 54 | ] 55 | }, 56 | { 57 | "cell_type": "markdown", 58 | "metadata": {}, 59 | "source": [ 60 | "### Discussion\n", 61 | "\n", 62 | "How do we measure distance?\n", 63 | "\n", 64 | "* Euclidian\n", 65 | "$$\n", 66 | "d_{euclidean} = \\sqrt{\\sum_{i=1}^{n}{(x_i - y_i)^2}}\n", 67 | "$$\n", 68 | "\n", 69 | "* Manhattan\n", 70 | "$$\n", 71 | "d_{manhattan} = \\sum_{i=1}^{n}{|x_i - y_i|}\n", 72 | "$$\n", 73 | "\n", 74 | "* Minkowski (default)\n", 75 | "$$\n", 76 | "d_{minkowski} = (\\sum_{i=1}^{n}{|x_i - y_i|^p})^{\\frac{1}{p}}\n", 77 | "$$\n", 78 | "## 15.2 Creating a K-Nearest Neighbor Classifier" 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": 3, 84 | "metadata": { 85 | "collapsed": false 86 | }, 87 | "outputs": [ 88 | { 89 | "data": { 90 | "text/plain": [ 91 | "array([1, 2])" 92 | ] 93 | }, 94 | "execution_count": 3, 95 | "metadata": {}, 96 | "output_type": "execute_result" 97 | } 98 | ], 99 | "source": [ 100 | "from sklearn.neighbors import KNeighborsClassifier\n", 101 | "from sklearn.preprocessing import StandardScaler\n", 102 | "from sklearn import datasets\n", 103 | "\n", 104 | "iris = datasets.load_iris()\n", 105 | "X = iris.data\n", 106 | "y = iris.target\n", 107 | "\n", 108 | "standardizer = StandardScaler()\n", 109 | "\n", 110 | "X_std = standardizer.fit_transform(X)\n", 111 | "\n", 112 | "knn = KNeighborsClassifier(n_neighbors=5, n_jobs=-1).fit(X_std, y)\n", 113 | "\n", 114 | "new_observations = [[0.75, 0.75, 0.75, 0.75],\n", 115 | " [1, 1, 1, 1]]\n", 116 | "\n", 117 | "knn.predict(new_observations)" 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": {}, 123 | "source": [ 124 | "### Discussion\n", 125 | "In KNN, given an observation $x_u$, with an unknown target class, the algorithm first identifies the k closest observations (sometimes called $x_u$'s neighborhood) based on some distance metric, then these k observations \"vote\" based on their class and the class that wins the vote is $x_u$'s predicted class. More formally, the probability $x_u$ is some class j is:\n", 126 | "$$\n", 127 | "\\frac{1}{k} \\sum_{i \\in v}{I(y_i = j)}\n", 128 | "$$\n", 129 | "where v is the k observatoin in $x_u$'s neighborhood, $y_i$ is the class of the ith observation, and I is an indicator function (i.e., 1 is true, 0 otherwise). In scikit-learn we can see these probabilities using `predict_proba`\n", 130 | "\n", 131 | "## 15.3 Identifying the Best Neighborhood Size" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": 4, 137 | "metadata": { 138 | "collapsed": false 139 | }, 140 | "outputs": [ 141 | { 142 | "data": { 143 | "text/plain": [ 144 | "6" 145 | ] 146 | }, 147 | "execution_count": 4, 148 | "metadata": {}, 149 | "output_type": "execute_result" 150 | } 151 | ], 152 | "source": [ 153 | "from sklearn.neighbors import KNeighborsClassifier\n", 154 | "from sklearn import datasets\n", 155 | "from sklearn.preprocessing import StandardScaler\n", 156 | "from sklearn.pipeline import Pipeline, FeatureUnion\n", 157 | "from sklearn.model_selection import GridSearchCV\n", 158 | "\n", 159 | "iris = datasets.load_iris()\n", 160 | "features = iris.data\n", 161 | "target = iris.target\n", 162 | "\n", 163 | "standardizer = StandardScaler()\n", 164 | "features_standardized = standardizer.fit_transform(features)\n", 165 | "\n", 166 | "knn = KNeighborsClassifier(n_neighbors=5, n_jobs=-1)\n", 167 | "\n", 168 | "pipe = Pipeline([(\"standardizer\", standardizer), (\"knn\", knn)])\n", 169 | "\n", 170 | "search_space = [{\"knn__n_neighbors\": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}]\n", 171 | "\n", 172 | "classifier = GridSearchCV(\n", 173 | " pipe, search_space, cv=5, verbose=0).fit(features_standardized, target)\n", 174 | "\n", 175 | "classifier.best_estimator_.get_params()[\"knn__n_neighbors\"]" 176 | ] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "metadata": {}, 181 | "source": [ 182 | "## 15.4 Creating a Radius-Based Nearest Neighbor Classifier\n", 183 | "given an observation of unknown class, you need to predict its class based on the class of all observations within a certain distance." 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": 5, 189 | "metadata": { 190 | "collapsed": false 191 | }, 192 | "outputs": [ 193 | { 194 | "data": { 195 | "text/plain": [ 196 | "array([2])" 197 | ] 198 | }, 199 | "execution_count": 5, 200 | "metadata": {}, 201 | "output_type": "execute_result" 202 | } 203 | ], 204 | "source": [ 205 | "from sklearn.neighbors import RadiusNeighborsClassifier\n", 206 | "from sklearn.preprocessing import StandardScaler\n", 207 | "from sklearn import datasets\n", 208 | "\n", 209 | "iris = datasets.load_iris()\n", 210 | "features = iris.data\n", 211 | "target = iris.target\n", 212 | "\n", 213 | "standardizer = StandardScaler()\n", 214 | "features_standardized = standardizer.fit_transform(features)\n", 215 | "\n", 216 | "rnn = RadiusNeighborsClassifier(\n", 217 | " radius=.5, n_jobs=-1).fit(features_standardized, target)\n", 218 | "\n", 219 | "new_observations = [[1, 1, 1, 1]]\n", 220 | "\n", 221 | "rnn.predict(new_observations)" 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": null, 227 | "metadata": { 228 | "collapsed": true 229 | }, 230 | "outputs": [], 231 | "source": [] 232 | } 233 | ], 234 | "metadata": { 235 | "kernelspec": { 236 | "display_name": "Python [conda env:machine_learning_cookbook]", 237 | "language": "python", 238 | "name": "conda-env-machine_learning_cookbook-py" 239 | }, 240 | "language_info": { 241 | "codemirror_mode": { 242 | "name": "ipython", 243 | "version": 3 244 | }, 245 | "file_extension": ".py", 246 | "mimetype": "text/x-python", 247 | "name": "python", 248 | "nbconvert_exporter": "python", 249 | "pygments_lexer": "ipython3", 250 | "version": "3.6.6" 251 | } 252 | }, 253 | "nbformat": 4, 254 | "nbformat_minor": 2 255 | } 256 | -------------------------------------------------------------------------------- /Chapter 16 - Logistic Regression.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Chapter 16\n", 8 | "---\n", 9 | "# Logistic Regression\n", 10 | "\n", 11 | "Despire being called a regression, logistic regression is actually a widely used supervised classification technique. \n", 12 | "Allows us to predict the probability that an observation is of a certain class\n", 13 | "\n", 14 | "## 16.1 Training a Binary Classifier" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 3, 20 | "metadata": { 21 | "collapsed": false 22 | }, 23 | "outputs": [ 24 | { 25 | "name": "stdout", 26 | "output_type": "stream", 27 | "text": [ 28 | "model.predict: [1]\n", 29 | "model.predict_proba: [[0.18823041 0.81176959]]\n" 30 | ] 31 | } 32 | ], 33 | "source": [ 34 | "from sklearn.linear_model import LogisticRegression\n", 35 | "from sklearn import datasets\n", 36 | "from sklearn.preprocessing import StandardScaler\n", 37 | "\n", 38 | "iris = datasets.load_iris()\n", 39 | "features = iris.data[:100,:]\n", 40 | "target = iris.target[:100]\n", 41 | "\n", 42 | "scaler = StandardScaler()\n", 43 | "features_standardized = scaler.fit_transform(features)\n", 44 | "\n", 45 | "logistic_regression = LogisticRegression(random_state=0)\n", 46 | "model = logistic_regression.fit(features_standardized, target)\n", 47 | "\n", 48 | "new_observation = [[.5, .5, .5, .5]]\n", 49 | "\n", 50 | "print(\"model.predict: {}\".format(model.predict(new_observation)))\n", 51 | "print(\"model.predict_proba: {}\".format(model.predict_proba(new_observation)))" 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": {}, 57 | "source": [ 58 | "### Discussion\n", 59 | "Dispire having \"regression\" in its name, a logistic regression is actually a widely used binary lassifier (i.e. the target vector can only take two values). In a logistic regression, a linear model (e.g. $\\beta_0 + \\beta_i x$) is included in a logistic (also called sigmoid) function, $\\frac{1}{1+e^{-z }}$, such that:\n", 60 | "$$\n", 61 | "P(y_i = 1 | X) = \\frac{1}{1+e^{-(\\beta_0 + \\beta_1x)}}\n", 62 | "$$\n", 63 | "where $P(y_i = 1 | X)$ is the probability of the ith obsevation's target, $y_i$ being class 1, X is the training data, $\\beta_0$ and $\\beta_1$ are the parameters to be learned, and e is Euler's number. The effect of the logistic function is to constrain the value of the function's output to between 0 and 1 so that i can be interpreted as a probability. If $P(y_i = 1 | X)$ is greater than 0.5, class 1 is predicted; otherwise class 0 is predicted\n", 64 | "\n", 65 | "## 16.2 Training a Multiclass Classifier" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 4, 71 | "metadata": { 72 | "collapsed": true 73 | }, 74 | "outputs": [], 75 | "source": [ 76 | "from sklearn.linear_model import LogisticRegression\n", 77 | "from sklearn import datasets\n", 78 | "from sklearn.preprocessing import StandardScaler\n", 79 | "\n", 80 | "iris = datasets.load_iris()\n", 81 | "features = iris.data\n", 82 | "target = iris.target\n", 83 | "\n", 84 | "scaler = StandardScaler()\n", 85 | "features_standardized = scaler.fit_transform(features)\n", 86 | "\n", 87 | "logistic_regression = LogisticRegression(random_state=0, multi_class=\"ovr\")\n", 88 | "#logistic_regression_MNL = LogisticRegression(random_state=0, multi_class=\"multinomial\")\n", 89 | "\n", 90 | "model = logistic_regression.fit(features_standardized, target)" 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "metadata": {}, 96 | "source": [ 97 | "### Discussion\n", 98 | "On their own, logistic regressions are only binary classifiers, meaning they cannot handle target vectors with more than two classes. However, two clever extensions to logistic regression do just that. First, in one-vs-rest logistic regression (OVR) a separate model is trained for each class predicted whether an observation is that class or not (thus making it a binary classification problem). It assumes that each observation problem (e.g. class 0 or not) is independent\n", 99 | "\n", 100 | "Alternatively in multinomial logistic regression (MLR) the logistic function we saw in Recipe 15.1 is replaced with a softmax function:\n", 101 | "$$\n", 102 | "P(y_I = k | X) = \\frac{e^{\\beta_k x_i}}{\\sum_{j=1}^{K}{e^{\\beta_j x_i}}}\n", 103 | "$$\n", 104 | "where $P(y_i = k | X)$ is the probability of the ith observation's target value, $y_i$, is class k, and K is the total number of classes. One practical advantage of the MLR is that its predicted probabilities using `predict_proba` method are more reliable\n", 105 | "\n", 106 | "We can switch to an MNL by setting `multi_class='multinomial'`\n", 107 | "\n", 108 | "## 16.3 Reducing Variance Through Regularization" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": 5, 114 | "metadata": { 115 | "collapsed": true 116 | }, 117 | "outputs": [], 118 | "source": [ 119 | "from sklearn.linear_model import LogisticRegressionCV\n", 120 | "from sklearn import datasets\n", 121 | "from sklearn.preprocessing import StandardScaler\n", 122 | "\n", 123 | "iris = datasets.load_iris()\n", 124 | "features = iris.data\n", 125 | "target = iris.target\n", 126 | "\n", 127 | "scaler = StandardScaler()\n", 128 | "features_standardized = scaler.fit_transform(features)\n", 129 | "\n", 130 | "logistic_regression = LogisticRegressionCV(\n", 131 | " penalty='l2', Cs=10, random_state=0, n_jobs=-1)\n", 132 | "\n", 133 | "model = logistic_regression.fit(features_standardized, target)" 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": {}, 139 | "source": [ 140 | "### Discussion\n", 141 | "Regularization is a method of penalizing complex models to reduce their variance. Specifically, a penalty term is added to the loss function we are trying to minimize typically the L1 and L2 penalties\n", 142 | "\n", 143 | "In the L1 penalty:\n", 144 | "$$\n", 145 | "\\alpha \\sum_{j=1}^{p}{|\\hat\\beta_j|}\n", 146 | "$$\n", 147 | "where $\\hat\\beta_j$ is the parameters of the jth of p features being learned and $\\alpha$ is a hyperparameter denoting the regularization strength.\n", 148 | "\n", 149 | "With the L2 penalty:\n", 150 | "$$\n", 151 | "\\alpha \\sum_{j=1}^{p}{\\hat\\beta_j^2}\n", 152 | "$$\n", 153 | "higher values of $\\alpha$ increase the penalty for larger parameter values(i.e. more complex models). scikit-learn follows the common method of using C instead of $\\alpha$ where C is the inverse of the regularization strength: $C = \\frac{1}{\\alpha}$. To reduce variance while using logistic regression, we can treat C as a hyperparameter to be tuned to find thevalue of C that creates the best model. In scikit-learn we can use the `LogisticRegressionCV` class to efficiently tune C.\n", 154 | "\n", 155 | "## 16.4 Training a Classifier on Very Large Data" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": 6, 161 | "metadata": { 162 | "collapsed": true 163 | }, 164 | "outputs": [], 165 | "source": [ 166 | "from sklearn.linear_model import LogisticRegression\n", 167 | "from sklearn import datasets\n", 168 | "from sklearn.preprocessing import StandardScaler\n", 169 | "\n", 170 | "iris = datasets.load_iris()\n", 171 | "features = iris.data\n", 172 | "target = iris.target\n", 173 | "\n", 174 | "scaler = StandardScaler()\n", 175 | "features_standardized = scaler.fit_transform(features)\n", 176 | "\n", 177 | "logistic_regression = LogisticRegression(random_state=0, solver=\"sag\") # stochastic average gradient (SAG) solver\n", 178 | "model = logistic_regression.fit(features_standardized, target)" 179 | ] 180 | }, 181 | { 182 | "cell_type": "markdown", 183 | "metadata": {}, 184 | "source": [ 185 | "### Discussion\n", 186 | "scikit-learn's `LogisticRegression` offers a number of techniques for training a logistic regression, called solvers. Most of the time scikit-learn will select the best solver automatically for us or warn us we cannot do something with that solver.\n", 187 | "\n", 188 | "Stochastic averge gradient descent allows us to train a model much faster than other solvers when our data is very large. However, it is also very sensitive to feature scaling, so standardizing our features is particularly important\n", 189 | "\n", 190 | "### See Also\n", 191 | "* Minimizing Finite Sums with the Stochastic Average Gradient Algorithm, Mark Schmidt (http://www.birs.ca/workshops/2014/14w5003/files/schmidt.pdf)\n", 192 | "\n", 193 | "## 16.5 Handling Imbalanced Classes" 194 | ] 195 | }, 196 | { 197 | "cell_type": "code", 198 | "execution_count": 8, 199 | "metadata": { 200 | "collapsed": false 201 | }, 202 | "outputs": [], 203 | "source": [ 204 | "import numpy as np\n", 205 | "from sklearn.linear_model import LogisticRegression\n", 206 | "from sklearn import datasets\n", 207 | "from sklearn.preprocessing import StandardScaler\n", 208 | "\n", 209 | "iris = datasets.load_iris()\n", 210 | "features = iris.data[40:, :]\n", 211 | "target = iris.target[40:]\n", 212 | "\n", 213 | "target = np.where((target == 0), 0, 1)\n", 214 | "\n", 215 | "scaler = StandardScaler()\n", 216 | "features_standardized = scaler.fit_transform(features)\n", 217 | "\n", 218 | "logistic_regression = LogisticRegression(random_state=0, class_weight=\"balanced\")\n", 219 | "model = logistic_regression.fit(features_standardized, target)" 220 | ] 221 | }, 222 | { 223 | "cell_type": "markdown", 224 | "metadata": {}, 225 | "source": [ 226 | "### Discussion\n", 227 | "`LogisticRegression` comes with a built in method of handling imbalanced classes.\n", 228 | "`class_weight=\"balanced\"` will automatically weigh classes inversely proportional to their frequency:\n", 229 | "$$\n", 230 | "w_j = \\frac{n}{kn_j}\n", 231 | "$$\n", 232 | "where $w_j$ is the weight to class j, n is the number of observations, $n_j$ is the number of observations in class j, and k is the total number of classes" 233 | ] 234 | } 235 | ], 236 | "metadata": { 237 | "kernelspec": { 238 | "display_name": "Python [conda env:machine_learning_cookbook]", 239 | "language": "python", 240 | "name": "conda-env-machine_learning_cookbook-py" 241 | }, 242 | "language_info": { 243 | "codemirror_mode": { 244 | "name": "ipython", 245 | "version": 3 246 | }, 247 | "file_extension": ".py", 248 | "mimetype": "text/x-python", 249 | "name": "python", 250 | "nbconvert_exporter": "python", 251 | "pygments_lexer": "ipython3", 252 | "version": "3.6.6" 253 | } 254 | }, 255 | "nbformat": 4, 256 | "nbformat_minor": 2 257 | } 258 | -------------------------------------------------------------------------------- /Chapter 18 - Naive Bayes.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Chapter 18\n", 8 | "---\n", 9 | "# Naive Bayes\n", 10 | "\n", 11 | "### 18.0 Introduction\n", 12 | "Bayes' theorem is the premier method for understanding the probability of some event $P(A|B)$, given some new information, $P(B|A)$, and a prior belief in the probability of the event, P(A):\n", 13 | "$$\n", 14 | "P(A | B) = \\frac{P(B|A)P(A)}{P(B)}\n", 15 | "$$\n", 16 | "\n", 17 | "The Bayesian method's popularity has skyrocked in the last decade, more and more rivaling the traditional frequentist applications in academia, government, and business. In machine learning, one applicaiton of Bayes' theorem to classifican comes in the form of the naive Bayes classifier. Naive Bayes classifiers combine a number of desirable qualities in practical machine learning into a single classifier:\n", 18 | "\n", 19 | "1. An intuitive approach\n", 20 | "2. The ability to work with small data\n", 21 | "3. Low computation costs for training and prediction\n", 22 | "4. Often solid results in a variety of settigns\n", 23 | "\n", 24 | "Specifically, a naive bayes classifier is based on:\n", 25 | "$$\n", 26 | "P(y | x_1, ..., x_j) = \\frac{P(x_1, ..., x_j | y)P(y)}{P(x_1,...,x_j)}\n", 27 | "$$\n", 28 | "where,\n", 29 | "* $P(y | x_1, ..., x_j)$ is called the *posterior* and is the probability that an observation is class y given observation's values for the j features, $x_1, ..., x_j$\n", 30 | "* $P(x_1, ..., x_j)$ is called likelihood and is the *likelihood* of an observation's values for features, $x_1, ..., x_j$, given their class y.\n", 31 | "* $P(y)$ is called the *prior* and is our belief for the probability of class y before looking at the data\n", 32 | "* P($x_1, ..., x_j$) is called the *marginal probability*\n", 33 | "\n", 34 | "In naive Bayes, we compare an obsrvation's posterior values for each possible class. Specifically, because the marginal probability is constant across these comparisons, we compare the numerators of the posterior for each class. For each observation the class with the greatest posterior numerator becomes the predicted class, $\\hat y$.\n", 35 | "\n", 36 | "There are two important things to note about naive Bayes classifiers.\n", 37 | "\n", 38 | "1. for each feature in the data, we have to assume the statistical distribution of the likelihood, $P(x_1, ..., x_j)$.\n", 39 | "- the common distributions are the normal (Gaussian), multinomial, and Bernoulli distributions.\n", 40 | "- the distribution chose is often determined by the nature of the features (continuous, binary, etc.)\n", 41 | "\n", 42 | "2. naive Bayes gets its name because we assume that each feature, and its resulting likelihood, is independent. This \"naive\" assumption is frequently wrong, yet in practice does little to prevent building high quality classifiers\n", 43 | "\n", 44 | "## 18.1 Training a Classifier for Continuous Features" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": 3, 50 | "metadata": { 51 | "collapsed": false 52 | }, 53 | "outputs": [ 54 | { 55 | "data": { 56 | "text/plain": [ 57 | "array([1])" 58 | ] 59 | }, 60 | "execution_count": 3, 61 | "metadata": {}, 62 | "output_type": "execute_result" 63 | } 64 | ], 65 | "source": [ 66 | "from sklearn import datasets\n", 67 | "from sklearn.naive_bayes import GaussianNB\n", 68 | "\n", 69 | "iris = datasets.load_iris()\n", 70 | "features = iris.data\n", 71 | "target = iris.target\n", 72 | "\n", 73 | "classifier = GaussianNB()\n", 74 | "\n", 75 | "model = classifier.fit(features, target)\n", 76 | "\n", 77 | "new_observation = [[4, 4, 4, 0.4]]\n", 78 | "model.predict(new_observation)" 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": {}, 84 | "source": [ 85 | "### Discussion\n", 86 | "The most common type of naive Bayes classifier is the Gaussian naive Bayesa. In Gaussian naive Bayesam we assuem that the likelihood of the feature values, x, given an observation is of class y, follows a normal distribution:\n", 87 | "$$\n", 88 | "p(x_j | y) = \\frac{1}{\\sqrt{2\\pi \\sigma_y^2}} e^{-\\frac{(x_j - \\mu_y)^2}{2\\sigma_y^2}}\n", 89 | "$$\n", 90 | "where $\\sigma_y^2$ and $\\mu_y$ are the variance and mean values of feature x_j for class y. Because of the assumption of the normal distribution, Gaussian naive Bayes is best used in cases when all our features are continuous.\n", 91 | "\n", 92 | "One of the interesting aspects of naive Bayes classifiers is that they allow us to assign a prior belief over the respect target classes. We can do this using `GaussianNB`'s `priors` parameter, which takes in a list of the probabilities assigned to each class of the target vector" 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": 4, 98 | "metadata": { 99 | "collapsed": true 100 | }, 101 | "outputs": [], 102 | "source": [ 103 | "clf = GaussianNB(priors=[0.25, 0.25, 0.5])\n", 104 | "model = classifier.fit(features, target)" 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "### See Also\n", 112 | "* How the Naive Bayes Classifier Works (http://dataaspirant.com/2017/02/06/naive-bayes-classifier-machine-learning/)\n", 113 | "\n", 114 | "## 18.2 Training a Classifier for Discrete and Count Features\n" 115 | ] 116 | }, 117 | { 118 | "cell_type": "code", 119 | "execution_count": 6, 120 | "metadata": { 121 | "collapsed": false 122 | }, 123 | "outputs": [ 124 | { 125 | "data": { 126 | "text/plain": [ 127 | "array([0])" 128 | ] 129 | }, 130 | "execution_count": 6, 131 | "metadata": {}, 132 | "output_type": "execute_result" 133 | } 134 | ], 135 | "source": [ 136 | "import numpy as np\n", 137 | "from sklearn.naive_bayes import MultinomialNB\n", 138 | "from sklearn.feature_extraction.text import CountVectorizer\n", 139 | "\n", 140 | "text_data = np.array(['I love Brazil. Brazil!', 'Brazil is best', 'Germany beats both'])\n", 141 | "\n", 142 | "count = CountVectorizer()\n", 143 | "bag_of_words = count.fit_transform(text_data)\n", 144 | "\n", 145 | "features = bag_of_words.toarray()\n", 146 | "\n", 147 | "target = np.array([0, 0, 1])\n", 148 | "\n", 149 | "classifier = MultinomialNB(class_prior=[0.25, 0.5])\n", 150 | "model = classifier.fit(features, target)\n", 151 | "\n", 152 | "new_observation = [[0, 0, 0, 1, 0, 1, 0]]\n", 153 | "model.predict(new_observation)" 154 | ] 155 | }, 156 | { 157 | "cell_type": "markdown", 158 | "metadata": {}, 159 | "source": [ 160 | "### Discussion\n", 161 | "\n", 162 | "Multinomial naive Bayes works similarly to Gaussian naive Bayes, but the features are assumed to be multinomial distributed. In practice this means that this classifier is commonly used when we have discrete data. One of the most common uses is text classification using bag of words or tf-idf approaches\n", 163 | "\n", 164 | "## 18.3 Training a Naive Bayes Classifier for Binary Features" 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": 7, 170 | "metadata": { 171 | "collapsed": true 172 | }, 173 | "outputs": [], 174 | "source": [ 175 | "import numpy as np\n", 176 | "from sklearn.naive_bayes import BernoulliNB\n", 177 | "\n", 178 | "features = np.random.randint(2, size=(100, 3))\n", 179 | "target = np.random.randint(2, size=(100, 1)).ravel()\n", 180 | "\n", 181 | "classifier = BernoulliNB(class_prior=[0.25, 0.5])\n", 182 | "model = classifier.fit(features, target)" 183 | ] 184 | }, 185 | { 186 | "cell_type": "markdown", 187 | "metadata": {}, 188 | "source": [ 189 | "The Bernoulli naive Bayes classifier assumes that all our features are binary such that they take only two values (e.g. a nominal categorical feature that has been one-hot encoded). Like its multinomial cousin, Bernoulli naive Bayes is often used in text classification, when our feature matrix is simply the presence or absence of a word in a document" 190 | ] 191 | } 192 | ], 193 | "metadata": { 194 | "kernelspec": { 195 | "display_name": "Python [conda env:machine_learning_cookbook]", 196 | "language": "python", 197 | "name": "conda-env-machine_learning_cookbook-py" 198 | }, 199 | "language_info": { 200 | "codemirror_mode": { 201 | "name": "ipython", 202 | "version": 3 203 | }, 204 | "file_extension": ".py", 205 | "mimetype": "text/x-python", 206 | "name": "python", 207 | "nbconvert_exporter": "python", 208 | "pygments_lexer": "ipython3", 209 | "version": "3.6.6" 210 | } 211 | }, 212 | "nbformat": 4, 213 | "nbformat_minor": 2 214 | } 215 | -------------------------------------------------------------------------------- /Chapter 19 - Clustering.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Chapter 19\n", 8 | "---\n", 9 | "# Clustering\n", 10 | "\n", 11 | "## 19.0 Introduction\n", 12 | "\n", 13 | "Frequently, we run into situations where we only know the features.\n", 14 | "\n", 15 | "The goal of clustering algorithms is to identify latent groupings of obesrvations, which if done well, allow us to predict the class of observations even without a target vector.\n", 16 | "\n", 17 | "## 19.1 Clustering Using K-Means" 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": 2, 23 | "metadata": { 24 | "collapsed": false 25 | }, 26 | "outputs": [ 27 | { 28 | "data": { 29 | "text/plain": [ 30 | "array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n", 31 | " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n", 32 | " 1, 1, 1, 1, 1, 1, 0, 0, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0,\n", 33 | " 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 0, 0, 2,\n", 34 | " 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0,\n", 35 | " 0, 0, 0, 2, 2, 0, 0, 0, 0, 2, 0, 2, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0,\n", 36 | " 0, 2, 2, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 2], dtype=int32)" 37 | ] 38 | }, 39 | "execution_count": 2, 40 | "metadata": {}, 41 | "output_type": "execute_result" 42 | } 43 | ], 44 | "source": [ 45 | "from sklearn import datasets\n", 46 | "from sklearn.preprocessing import StandardScaler\n", 47 | "from sklearn.cluster import KMeans\n", 48 | "\n", 49 | "iris = datasets.load_iris()\n", 50 | "features = iris.data\n", 51 | "\n", 52 | "scaler = StandardScaler()\n", 53 | "features_std = scaler.fit_transform(features)\n", 54 | "\n", 55 | "cluster = KMeans(n_clusters=3, random_state=0, n_jobs=-1)\n", 56 | "model = cluster.fit(features_std)\n", 57 | "\n", 58 | "model.labels_" 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "metadata": {}, 64 | "source": [ 65 | "### Discussion\n", 66 | "k-means clustering is one of the most common clustering techniques. In k-means clustering, the algorithm attempts to group observations into k groups, with each group having roughly equal variance. The number of groups, k, is specified by the user as a hyperparameter. Specifically, in k-means:\n", 67 | "\n", 68 | "1. k cluster \"center\" points are created at random locations.\n", 69 | "\n", 70 | "2. For each observation:\n", 71 | " a. the distance between each observaiton and the k center points is calculated\n", 72 | " b. the observation is assigned to the cluster of the nearest center point\n", 73 | " \n", 74 | "3. The center points are moved to the means (i.e., centers) of their respective clusters\n", 75 | "\n", 76 | "4. Steps 2 and 3 are repeated until no observation changes in cluster membership\n", 77 | "\n", 78 | "k-means clustering assumes:\n", 79 | "* the clusters are convex shaped (e.g. a circle, a sphere).\n", 80 | "* all features are equally scaled\n", 81 | "* the groups are balanced\n", 82 | "\n", 83 | "### See Also\n", 84 | "* Introduction to K-means Clustering (https://www.datascience.com/blog/k-means-clustering)\n", 85 | "\n", 86 | "## 19.2 Speeding Up K-Means Clustering" 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "execution_count": 3, 92 | "metadata": { 93 | "collapsed": false 94 | }, 95 | "outputs": [ 96 | { 97 | "data": { 98 | "text/plain": [ 99 | "array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n", 100 | " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n", 101 | " 1, 1, 1, 1, 1, 1, 2, 2, 2, 0, 0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 2,\n", 102 | " 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0,\n", 103 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2,\n", 104 | " 2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2,\n", 105 | " 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2], dtype=int32)" 106 | ] 107 | }, 108 | "execution_count": 3, 109 | "metadata": {}, 110 | "output_type": "execute_result" 111 | } 112 | ], 113 | "source": [ 114 | "from sklearn import datasets\n", 115 | "from sklearn.preprocessing import StandardScaler\n", 116 | "from sklearn.cluster import MiniBatchKMeans\n", 117 | "\n", 118 | "iris = datasets.load_iris()\n", 119 | "features = iris.data\n", 120 | "\n", 121 | "scaler = StandardScaler()\n", 122 | "features_std = scaler.fit_transform(features)\n", 123 | "\n", 124 | "cluster = MiniBatchKMeans(n_clusters=3, random_state=0, batch_size=100)\n", 125 | "model = cluster.fit(features_std)\n", 126 | "\n", 127 | "model.labels_" 128 | ] 129 | }, 130 | { 131 | "cell_type": "markdown", 132 | "metadata": {}, 133 | "source": [ 134 | "## 19.3 Clustering Using Meanshift" 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": 4, 140 | "metadata": { 141 | "collapsed": false 142 | }, 143 | "outputs": [ 144 | { 145 | "data": { 146 | "text/plain": [ 147 | "array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n", 148 | " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n", 149 | " 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 150 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 151 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 152 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 153 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])" 154 | ] 155 | }, 156 | "execution_count": 4, 157 | "metadata": {}, 158 | "output_type": "execute_result" 159 | } 160 | ], 161 | "source": [ 162 | "from sklearn import datasets\n", 163 | "from sklearn.preprocessing import StandardScaler\n", 164 | "from sklearn.cluster import MeanShift\n", 165 | "\n", 166 | "iris = datasets.load_iris()\n", 167 | "features = iris.data\n", 168 | "\n", 169 | "scaler = StandardScaler()\n", 170 | "features_std = scaler.fit_transform(features)\n", 171 | "\n", 172 | "cluster = MeanShift(n_jobs=-1)\n", 173 | "model = cluster.fit(features_std)\n", 174 | "\n", 175 | "model.labels_" 176 | ] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "metadata": {}, 181 | "source": [ 182 | "## 19.4 Clustering Using DBSCAN" 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": 5, 188 | "metadata": { 189 | "collapsed": false 190 | }, 191 | "outputs": [ 192 | { 193 | "data": { 194 | "text/plain": [ 195 | "array([ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, 0,\n", 196 | " 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1,\n", 197 | " 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 1,\n", 198 | " 1, 1, 1, 1, 1, -1, -1, 1, -1, -1, 1, -1, 1, 1, 1, 1, 1,\n", 199 | " -1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n", 200 | " -1, 1, -1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, -1, 1, -1, 1,\n", 201 | " 1, 1, 1, -1, -1, -1, -1, -1, 1, 1, 1, 1, -1, 1, 1, -1, -1,\n", 202 | " -1, 1, 1, -1, 1, 1, -1, 1, 1, 1, -1, -1, -1, 1, 1, 1, -1,\n", 203 | " -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1])" 204 | ] 205 | }, 206 | "execution_count": 5, 207 | "metadata": {}, 208 | "output_type": "execute_result" 209 | } 210 | ], 211 | "source": [ 212 | "from sklearn import datasets\n", 213 | "from sklearn.preprocessing import StandardScaler\n", 214 | "from sklearn.cluster import DBSCAN\n", 215 | "\n", 216 | "iris = datasets.load_iris()\n", 217 | "features = iris.data\n", 218 | "\n", 219 | "scaler = StandardScaler()\n", 220 | "features_std = scaler.fit_transform(features)\n", 221 | "\n", 222 | "cluster = DBSCAN(n_jobs=-1)\n", 223 | "model = cluster.fit(features_std)\n", 224 | "\n", 225 | "model.labels_" 226 | ] 227 | }, 228 | { 229 | "cell_type": "markdown", 230 | "metadata": {}, 231 | "source": [ 232 | "## 19.5 Clustering using Hierarchical Merging" 233 | ] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "execution_count": 7, 238 | "metadata": { 239 | "collapsed": false 240 | }, 241 | "outputs": [ 242 | { 243 | "data": { 244 | "text/plain": [ 245 | "array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n", 246 | " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1,\n", 247 | " 1, 1, 1, 1, 1, 1, 0, 0, 0, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 0, 2, 0,\n", 248 | " 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 0, 2, 0, 0, 2,\n", 249 | " 2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0,\n", 250 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 251 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])" 252 | ] 253 | }, 254 | "execution_count": 7, 255 | "metadata": {}, 256 | "output_type": "execute_result" 257 | } 258 | ], 259 | "source": [ 260 | "from sklearn import datasets\n", 261 | "from sklearn.preprocessing import StandardScaler\n", 262 | "from sklearn.cluster import AgglomerativeClustering\n", 263 | "\n", 264 | "iris = datasets.load_iris()\n", 265 | "features = iris.data\n", 266 | "\n", 267 | "scaler = StandardScaler()\n", 268 | "features_std = scaler.fit_transform(features)\n", 269 | "\n", 270 | "cluster = AgglomerativeClustering(n_clusters=3)\n", 271 | "model = cluster.fit(features_std)\n", 272 | "\n", 273 | "model.labels_" 274 | ] 275 | }, 276 | { 277 | "cell_type": "code", 278 | "execution_count": null, 279 | "metadata": { 280 | "collapsed": true 281 | }, 282 | "outputs": [], 283 | "source": [] 284 | } 285 | ], 286 | "metadata": { 287 | "kernelspec": { 288 | "display_name": "Python [conda env:machine_learning_cookbook]", 289 | "language": "python", 290 | "name": "conda-env-machine_learning_cookbook-py" 291 | }, 292 | "language_info": { 293 | "codemirror_mode": { 294 | "name": "ipython", 295 | "version": 3 296 | }, 297 | "file_extension": ".py", 298 | "mimetype": "text/x-python", 299 | "name": "python", 300 | "nbconvert_exporter": "python", 301 | "pygments_lexer": "ipython3", 302 | "version": "3.6.6" 303 | } 304 | }, 305 | "nbformat": 4, 306 | "nbformat_minor": 2 307 | } 308 | -------------------------------------------------------------------------------- /Chapter 21 - Saving and Loading Trained Models.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### Chapter 21\n", 8 | "## Saving and Loading Trained Models\n", 9 | "\n", 10 | "### 21.0 Introduction\n", 11 | "In the last 20 chapters around 200 recipies, we have convered how to take raw data nad usem achine learning to create well-performing predictive models. However, for all our work to be worthwhile we eventually need to do something with our model, such as integrating it with an existing software application. To accomplish this goal, we need to be able to bot hsave our models after training and load them when they are needed by an application. This is the focus of the final chapter\n", 12 | "\n", 13 | "### 21.1 Saving and Loading a scikit-learn Model\n", 14 | "#### Problem\n", 15 | "You have trained a scikit-learn model and want to save it and load it elsewhere.\n", 16 | "\n", 17 | "#### Solution\n", 18 | "Save the model as a pickle file:" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 1, 24 | "metadata": {}, 25 | "outputs": [ 26 | { 27 | "name": "stderr", 28 | "output_type": "stream", 29 | "text": [ 30 | "/Users/f00/anaconda/envs/machine_learning_cookbook/lib/python3.6/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.\n", 31 | " from numpy.core.umath_tests import inner1d\n" 32 | ] 33 | }, 34 | { 35 | "data": { 36 | "text/plain": [ 37 | "['model.pkl']" 38 | ] 39 | }, 40 | "execution_count": 1, 41 | "metadata": {}, 42 | "output_type": "execute_result" 43 | } 44 | ], 45 | "source": [ 46 | "# load libraries\n", 47 | "from sklearn.ensemble import RandomForestClassifier\n", 48 | "from sklearn import datasets\n", 49 | "from sklearn.externals import joblib\n", 50 | "\n", 51 | "# load data\n", 52 | "iris = datasets.load_iris()\n", 53 | "features = iris.data\n", 54 | "target = iris.target\n", 55 | "\n", 56 | "# create decision tree classifier object\n", 57 | "classifier = RandomForestClassifier()\n", 58 | "\n", 59 | "# train model\n", 60 | "model = classifier.fit(features, target)\n", 61 | "\n", 62 | "# save model as pickle file\n", 63 | "joblib.dump(model, \"model.pkl\")" 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "metadata": {}, 69 | "source": [ 70 | "Once the model is saved we can use scikit-learn in our destination application (e.g., web application) to load the model:" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": 2, 76 | "metadata": {}, 77 | "outputs": [], 78 | "source": [ 79 | "# load model from file\n", 80 | "classifier = joblib.load(\"model.pkl\")" 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "metadata": {}, 86 | "source": [ 87 | "And use it to make predictions" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": 3, 93 | "metadata": {}, 94 | "outputs": [ 95 | { 96 | "data": { 97 | "text/plain": [ 98 | "array([0])" 99 | ] 100 | }, 101 | "execution_count": 3, 102 | "metadata": {}, 103 | "output_type": "execute_result" 104 | } 105 | ], 106 | "source": [ 107 | "# create new observation\n", 108 | "new_observation = [[ 5.2, 3.2, 1.1, 0.1]]\n", 109 | "\n", 110 | "# predict obserrvation's class\n", 111 | "classifier.predict(new_observation)" 112 | ] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "metadata": {}, 117 | "source": [ 118 | "### Discussion\n", 119 | "The first step in using a model in production is to save that model as a file that can be loaded by another application or workflow. We can accomplish this by saving the model as a pickle file, a Python-specific data format. Specifically, to save the model we use `joblib`, which is a library extending pickle for cases when we have large NumPy arrays--a common occurance for trained models in scikit-learn.\n", 120 | "\n", 121 | "When saving scikit-learn models, be aware that saved models might not be compatible between versions of scikit-learn; therefore, it can be helpful to include the version of scikit-learn used in the model in the filename:" 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": 4, 127 | "metadata": {}, 128 | "outputs": [ 129 | { 130 | "data": { 131 | "text/plain": [ 132 | "['model_(version).pkl']" 133 | ] 134 | }, 135 | "execution_count": 4, 136 | "metadata": {}, 137 | "output_type": "execute_result" 138 | } 139 | ], 140 | "source": [ 141 | "# import library\n", 142 | "import sklearn\n", 143 | "\n", 144 | "# get scikit-learn version\n", 145 | "scikit_version = joblib.__version__\n", 146 | "\n", 147 | "# save model as pickle file\n", 148 | "joblib.dump(model, \"model_(version).pkl\".format(version=scikit_version))" 149 | ] 150 | }, 151 | { 152 | "cell_type": "markdown", 153 | "metadata": {}, 154 | "source": [ 155 | "### 21.2 Saving and Loading a Keras Model\n", 156 | "#### Problem\n", 157 | "You have a trained Keras model and want to save it and load it elsewhere.\n", 158 | "\n", 159 | "#### Solution\n", 160 | "Save the model as HDF5:" 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": 1, 166 | "metadata": {}, 167 | "outputs": [ 168 | { 169 | "name": "stderr", 170 | "output_type": "stream", 171 | "text": [ 172 | "Using Theano backend.\n" 173 | ] 174 | }, 175 | { 176 | "ename": "ModuleNotFoundError", 177 | "evalue": "No module named 'theano'", 178 | "output_type": "error", 179 | "traceback": [ 180 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 181 | "\u001b[0;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)", 182 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# load libraries\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mnumpy\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0;32mfrom\u001b[0m \u001b[0mkeras\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdatasets\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mimdb\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 4\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mkeras\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpreprocessing\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtext\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mTokenizer\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mkeras\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mmodels\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 183 | "\u001b[0;32m~/anaconda/envs/machine_learning_cookbook/lib/python3.6/site-packages/keras/__init__.py\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0m__future__\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mabsolute_import\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mutils\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 4\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mactivations\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mapplications\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 184 | "\u001b[0;32m~/anaconda/envs/machine_learning_cookbook/lib/python3.6/site-packages/keras/utils/__init__.py\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mdata_utils\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mio_utils\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mconv_utils\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 7\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 8\u001b[0m \u001b[0;31m# Globally-importable utils.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 185 | "\u001b[0;32m~/anaconda/envs/machine_learning_cookbook/lib/python3.6/site-packages/keras/utils/conv_utils.py\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 7\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0msix\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmoves\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mrange\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 8\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mnumpy\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 9\u001b[0;31m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m\u001b[0;34m.\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mbackend\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mK\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 10\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 11\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", 186 | "\u001b[0;32m~/anaconda/envs/machine_learning_cookbook/lib/python3.6/site-packages/keras/backend/__init__.py\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 84\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0m_BACKEND\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;34m'theano'\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 85\u001b[0m \u001b[0msys\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstderr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwrite\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'Using Theano backend.\\n'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 86\u001b[0;31m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m\u001b[0mtheano_backend\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 87\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0m_BACKEND\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;34m'tensorflow'\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 88\u001b[0m \u001b[0msys\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstderr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwrite\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'Using TensorFlow backend.\\n'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 187 | "\u001b[0;32m~/anaconda/envs/machine_learning_cookbook/lib/python3.6/site-packages/keras/backend/theano_backend.py\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mcollections\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mdefaultdict\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 6\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mcontextlib\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mcontextmanager\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 7\u001b[0;31m \u001b[0;32mimport\u001b[0m \u001b[0mtheano\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 8\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mtheano\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mtensor\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mT\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 9\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mtheano\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msandbox\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrng_mrg\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mMRG_RandomStreams\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mRandomStreams\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 188 | "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'theano'" 189 | ] 190 | } 191 | ], 192 | "source": [ 193 | "# load libraries\n", 194 | "import numpy as np\n", 195 | "from keras.datasets import imdb\n", 196 | "from keras.preprocessing.text import Tokenizer\n", 197 | "from keras import models\n", 198 | "from keras import layers\n", 199 | "from keras.models import load_model\n", 200 | "\n", 201 | "# set random seed\n", 202 | "np.random.seed(0)\n", 203 | "\n", 204 | "# set the number of features we want\n", 205 | "number_of_features = 1000\n", 206 | "\n", 207 | "# load data and target vector from movie review data\n", 208 | "(train_Data, train_target), (test_data, test_target) = imdb.load_data(num_words=number_of_features)\n", 209 | "\n", 210 | "# convert movie review data to a one-hot encoded feature matrix\n", 211 | "tokenizer = Tokenizer(num_words=number_of_features)\n", 212 | "train_features = tokenizer.sequences_to_matrix(train_data, mode=\"binary\")\n", 213 | "test_features = tokenizer.sequences_to_matrix(test_data, mode=\"binary\")\n", 214 | "\n", 215 | "# start neural network\n", 216 | "network = models.Sequential()\n", 217 | "\n", 218 | "# add fully connected layer with ReLU activation function\n", 219 | "network.add(layers.Dense(units=16, activation=\"relu\", input_shape=(number_of_features,)))\n", 220 | "\n", 221 | "# add fully connected layer with a sigmoid activation function\n", 222 | "network.add(layers.Dense(units=1, activation=\"sigmoid\"))\n", 223 | "\n", 224 | "# compile neural network\n", 225 | "network.compile(loss=\"binary_crossentropy\", optimizer=\"rmsprop\", metrics=[\"accuracy\"])\n", 226 | "\n", 227 | "# train neural network\n", 228 | "history = network.fit(train_features, train_target, epochs=3, verbose=0, batch_size=100, validation_data=(test_features, test_target))\n", 229 | "\n", 230 | "# save neural network\n", 231 | "network.save(\"model.h5\")" 232 | ] 233 | }, 234 | { 235 | "cell_type": "markdown", 236 | "metadata": {}, 237 | "source": [ 238 | "We can then load the model either in another application or for additional training" 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": 7, 244 | "metadata": {}, 245 | "outputs": [ 246 | { 247 | "ename": "NameError", 248 | "evalue": "name 'load_model' is not defined", 249 | "output_type": "error", 250 | "traceback": [ 251 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 252 | "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", 253 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# load neural network\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mnetwork\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mload_model\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"model.h5\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", 254 | "\u001b[0;31mNameError\u001b[0m: name 'load_model' is not defined" 255 | ] 256 | } 257 | ], 258 | "source": [ 259 | "# load neural network\n", 260 | "network = load_model(\"model.h5\")" 261 | ] 262 | }, 263 | { 264 | "cell_type": "markdown", 265 | "metadata": {}, 266 | "source": [ 267 | "#### Discussion\n", 268 | "Unlike scikit-learn, Keras does not recommend you save models using pickle. Instead, models are saved as an HDF5 file. The HDF5 file contains everything you need to not only load the model to make predicitons (i.e., achitecture and trained parameters), but also to restart training (i.e. loss and optimizer settings and the current state)" 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": null, 274 | "metadata": {}, 275 | "outputs": [], 276 | "source": [] 277 | } 278 | ], 279 | "metadata": { 280 | "kernelspec": { 281 | "display_name": "Python [conda env:machine_learning_cookbook]", 282 | "language": "python", 283 | "name": "conda-env-machine_learning_cookbook-py" 284 | }, 285 | "language_info": { 286 | "codemirror_mode": { 287 | "name": "ipython", 288 | "version": 3 289 | }, 290 | "file_extension": ".py", 291 | "mimetype": "text/x-python", 292 | "name": "python", 293 | "nbconvert_exporter": "python", 294 | "pygments_lexer": "ipython3", 295 | "version": "3.6.6" 296 | } 297 | }, 298 | "nbformat": 4, 299 | "nbformat_minor": 2 300 | } 301 | -------------------------------------------------------------------------------- /Chapter 4 - Handling Numerical Data.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Chapter 4\n", 8 | "---\n", 9 | "# Handling Numerical Data\n", 10 | "\n", 11 | "### 4.0 Introduction\n", 12 | "Quantitative data is the measurment of something--weather class size, monthly sales, or student scores. The natural way to represent these quantities is numerically (e.g., 20 students, $529,392 in sales). In this chapter we will cover numerous strategies for transforming raw numerical data into features purpose-built for machine learning algorithms\n", 13 | "\n", 14 | "### 4.1 Rescaling a feature\n", 15 | "Use scikit-learn's `MinMaxScaler` to rescale a feature array" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 2, 21 | "metadata": {}, 22 | "outputs": [ 23 | { 24 | "data": { 25 | "text/plain": [ 26 | "array([[0. ],\n", 27 | " [0.28571429],\n", 28 | " [0.35714286],\n", 29 | " [0.42857143],\n", 30 | " [1. ]])" 31 | ] 32 | }, 33 | "execution_count": 2, 34 | "metadata": {}, 35 | "output_type": "execute_result" 36 | } 37 | ], 38 | "source": [ 39 | "import numpy as np\n", 40 | "from sklearn import preprocessing\n", 41 | "\n", 42 | "# create a feature\n", 43 | "feature = np.array([\n", 44 | " [-500.5],\n", 45 | " [-100.1],\n", 46 | " [0],\n", 47 | " [100.1],\n", 48 | " [900.9]\n", 49 | "])\n", 50 | "\n", 51 | "# create scaler\n", 52 | "minmax_scaler = preprocessing.MinMaxScaler(feature_range=(0,1))\n", 53 | "\n", 54 | "# scale feature\n", 55 | "scaled_feature = minmax_scaler.fit_transform(feature)\n", 56 | "\n", 57 | "scaled_feature" 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": {}, 63 | "source": [ 64 | "#### Discussion\n", 65 | "Rescaling is a common preprocessing task in machine learning. Many of the algorithms described later in this book will assume all features are on the same scale, typically 0 to 1 or -1 to 1. There are a number of rescaling techniques, but one of the simlest is called *min-max scaling*. Min-max scaling uses the minimum and maximum values of a feature to rescale values to within a range. Specfically, min-max calculates:\n", 66 | "$$\n", 67 | "x_i^` = \\frac{x_i - min(x)}{max(x) - min(x)}\n", 68 | "$$\n", 69 | "\n", 70 | "where x is the feature vector, $x_i$ is an individual element of feature x, and $x_i^`$ is the rescaled element\n", 71 | "\n", 72 | "#### See Also\n", 73 | "* Feature scaling, wikipedia (https://en.wikipedia.org/wiki/Feature_scaling)\n", 74 | "\n", 75 | "### 4.2 Standardizing a Feature\n", 76 | "scikit-learn's `StandardScaler` transforms a feature to have a mean of 0 and a standard deviation of 1." 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": 3, 82 | "metadata": {}, 83 | "outputs": [ 84 | { 85 | "data": { 86 | "text/plain": [ 87 | "array([[-0.76058269],\n", 88 | " [-0.54177196],\n", 89 | " [-0.35009716],\n", 90 | " [-0.32271504],\n", 91 | " [ 1.97516685]])" 92 | ] 93 | }, 94 | "execution_count": 3, 95 | "metadata": {}, 96 | "output_type": "execute_result" 97 | } 98 | ], 99 | "source": [ 100 | "import numpy as np\n", 101 | "from sklearn import preprocessing\n", 102 | "\n", 103 | "# create a feature\n", 104 | "feature = np.array([\n", 105 | " [-1000.1],\n", 106 | " [-200.2],\n", 107 | " [500.5],\n", 108 | " [600.6],\n", 109 | " [9000.9]\n", 110 | "])\n", 111 | "\n", 112 | "# create scaler\n", 113 | "scaler = preprocessing.StandardScaler()\n", 114 | "\n", 115 | "# transform the feature\n", 116 | "standardized = scaler.fit_transform(feature)\n", 117 | "\n", 118 | "standardized" 119 | ] 120 | }, 121 | { 122 | "cell_type": "markdown", 123 | "metadata": {}, 124 | "source": [ 125 | "#### Discussion\n", 126 | "A common alternative to min-max scaling is rescaling of features to be approximately standard normally distributed. To achieve this, we use standardization to tranform the data such that it has a mean, $\\bar x$, or 0 and a standard deviation $\\sigma$, of 1. Specifically, each element in the feature is transformed so that:\n", 127 | "$$\n", 128 | "x_i^` = \\frac{x_i - \\bar x}{\\sigma}\n", 129 | "$$" 130 | ] 131 | }, 132 | { 133 | "cell_type": "markdown", 134 | "metadata": {}, 135 | "source": [ 136 | "Where $x_I^`$ is our standardized form of $x_i$. The transformed feature represents the number of standard deviations in the original value is away from the feature's mean value (also called a *z-score* in statistics)\n", 137 | "\n", 138 | "Standardization is a common go-to scaling method for machine learning preprocessing and in my experience is used more than min-max scaling. However it depends on the learning algorithm. For example, principal component analysis often works better using standardization, while min-max scaling is often recommended for neural netwroks. As a general rule, I'd recommend defauling to standardization unless you have a specific reason to use an alternative.\n", 139 | "\n", 140 | "We can see the effect of standardization by looking at the mean and standard deviation of our solutions output:" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": 4, 146 | "metadata": {}, 147 | "outputs": [ 148 | { 149 | "name": "stdout", 150 | "output_type": "stream", 151 | "text": [ 152 | "Mean 0.0\n", 153 | "Standard Deviation: 1.0\n" 154 | ] 155 | } 156 | ], 157 | "source": [ 158 | "print(\"Mean {}\".format(round(standardized.mean())))\n", 159 | "print(\"Standard Deviation: {}\".format(standardized.std()))" 160 | ] 161 | }, 162 | { 163 | "cell_type": "markdown", 164 | "metadata": {}, 165 | "source": [ 166 | "If our data has significant outliers, it can negatively impact our standardizatino by affecting the feature's mean and variance. In this scenario, it is often helpful to instead rescale the feature using the median and quartile range. In scikit-learn, we do this using the *RobustScaler* method:" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": 5, 172 | "metadata": {}, 173 | "outputs": [ 174 | { 175 | "data": { 176 | "text/plain": [ 177 | "array([[-1.87387612],\n", 178 | " [-0.875 ],\n", 179 | " [ 0. ],\n", 180 | " [ 0.125 ],\n", 181 | " [10.61488511]])" 182 | ] 183 | }, 184 | "execution_count": 5, 185 | "metadata": {}, 186 | "output_type": "execute_result" 187 | } 188 | ], 189 | "source": [ 190 | "# create scaler\n", 191 | "robust_scaler = preprocessing.RobustScaler()\n", 192 | "\n", 193 | "# transform feature\n", 194 | "robust_scaler.fit_transform(feature)" 195 | ] 196 | }, 197 | { 198 | "cell_type": "markdown", 199 | "metadata": {}, 200 | "source": [ 201 | "### 4.3 Normalizing Observations\n", 202 | "Use scikit-learn's `Normalizer` to rescale the feature values to have unit norm (a total length of 1)" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": 6, 208 | "metadata": {}, 209 | "outputs": [ 210 | { 211 | "data": { 212 | "text/plain": [ 213 | "array([[0.70710678, 0.70710678],\n", 214 | " [0.30782029, 0.95144452],\n", 215 | " [0.07405353, 0.99725427],\n", 216 | " [0.04733062, 0.99887928],\n", 217 | " [0.95709822, 0.28976368]])" 218 | ] 219 | }, 220 | "execution_count": 6, 221 | "metadata": {}, 222 | "output_type": "execute_result" 223 | } 224 | ], 225 | "source": [ 226 | "import numpy as np\n", 227 | "from sklearn.preprocessing import Normalizer\n", 228 | "\n", 229 | "# create feature matrix\n", 230 | "features = np.array([\n", 231 | " [0.5, 0.5],\n", 232 | " [1.1, 3.4],\n", 233 | " [1.5, 20.2],\n", 234 | " [1.63, 34.4],\n", 235 | " [10.9, 3.3]\n", 236 | "])\n", 237 | "\n", 238 | "# create normalizer\n", 239 | "normalizer = Normalizer(norm=\"l2\")\n", 240 | "\n", 241 | "# transofmr feature matrix\n", 242 | "normalizer.transform(features)" 243 | ] 244 | }, 245 | { 246 | "cell_type": "markdown", 247 | "metadata": {}, 248 | "source": [ 249 | "#### Discussion\n", 250 | "Many rescaling methods operate of features; however, we can also rescale across individual observations. `Normalizer` rescales the values on individual observations to have unit norm (the sum of their lengths is 1). This type of rescaling is often used when we have many equivalent features (e.g. text-classification when every word is n-word group is a feature).\n", 251 | "\n", 252 | "`Normalizer` provides three norm options with Euclidean norm (often called L2) being the default:\n", 253 | "$$\n", 254 | "||x||_2 = \\sqrt{x_1^2 + x_2^2 + ... + x_n^2}\n", 255 | "$$\n", 256 | "\n", 257 | "where x is an individual observation and x_n is that observation's value for the nth feature.\n", 258 | "\n", 259 | "Alternatively, we can specify Manhattan norm (L1):\n", 260 | "$$\n", 261 | "||x||_1 = \\sum_{i=1}^n{x_i}\n", 262 | "$$\n", 263 | "\n", 264 | "Intuitively, L2 norm can be thought of as the distance between two poitns in New York for a bird (i.e. a straight line), while L1 can be thought of as the distance for a human wlaking on the street (walk north one block, east one block, north one block, east one block, etc), which is why it is called \"Manhattan norm\" or \"Taxicab norm\".\n", 265 | "\n", 266 | "Practically, notice that `norm='l1'` rescales an observation's values so they sum to 1, which can sometimes be a desirable quality" 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "execution_count": 8, 272 | "metadata": {}, 273 | "outputs": [ 274 | { 275 | "name": "stdout", 276 | "output_type": "stream", 277 | "text": [ 278 | "Sum of the first observation's values: 1.0\n" 279 | ] 280 | } 281 | ], 282 | "source": [ 283 | "# transform feature matrix\n", 284 | "features_l1_norm = Normalizer(norm=\"l1\").transform(features)\n", 285 | "print(\"Sum of the first observation's values: {}\".format(features_l1_norm[0,0] + features_l1_norm[0,1]))" 286 | ] 287 | }, 288 | { 289 | "cell_type": "markdown", 290 | "metadata": {}, 291 | "source": [ 292 | "### 4.9 Grouping Observations Using Clustering" 293 | ] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "execution_count": 9, 298 | "metadata": {}, 299 | "outputs": [ 300 | { 301 | "data": { 302 | "text/html": [ 303 | "
\n", 304 | "\n", 317 | "\n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | "
feature_1feature_2group
0-9.877554-3.3361450
1-7.287210-8.3539862
2-6.943061-7.0237442
3-7.440167-8.7919592
4-6.641388-8.0758882
\n", 359 | "
" 360 | ], 361 | "text/plain": [ 362 | " feature_1 feature_2 group\n", 363 | "0 -9.877554 -3.336145 0\n", 364 | "1 -7.287210 -8.353986 2\n", 365 | "2 -6.943061 -7.023744 2\n", 366 | "3 -7.440167 -8.791959 2\n", 367 | "4 -6.641388 -8.075888 2" 368 | ] 369 | }, 370 | "execution_count": 9, 371 | "metadata": {}, 372 | "output_type": "execute_result" 373 | } 374 | ], 375 | "source": [ 376 | "import pandas as pd\n", 377 | "from sklearn.datasets import make_blobs\n", 378 | "from sklearn.cluster import KMeans\n", 379 | "\n", 380 | "features, _ = make_blobs(n_samples = 50,\n", 381 | " n_features = 2,\n", 382 | " centers = 3,\n", 383 | " random_state = 1)\n", 384 | "\n", 385 | "df = pd.DataFrame(features, columns=[\"feature_1\", \"feature_2\"])\n", 386 | "\n", 387 | "# make k-means clusterer\n", 388 | "clusterer = KMeans(3, random_state=0)\n", 389 | "\n", 390 | "# fit clusterer\n", 391 | "clusterer.fit(features)\n", 392 | "\n", 393 | "# predict values\n", 394 | "df['group'] = clusterer.predict(features)\n", 395 | "\n", 396 | "df.head()" 397 | ] 398 | }, 399 | { 400 | "cell_type": "markdown", 401 | "metadata": {}, 402 | "source": [ 403 | "# 4.10 Deleteing Observations with Missing Values" 404 | ] 405 | }, 406 | { 407 | "cell_type": "code", 408 | "execution_count": 10, 409 | "metadata": {}, 410 | "outputs": [ 411 | { 412 | "data": { 413 | "text/plain": [ 414 | "array([[ 1.1, 11.1],\n", 415 | " [ 2.2, 22.2],\n", 416 | " [ 3.3, 33.3]])" 417 | ] 418 | }, 419 | "execution_count": 10, 420 | "metadata": {}, 421 | "output_type": "execute_result" 422 | } 423 | ], 424 | "source": [ 425 | "import numpy as np\n", 426 | "\n", 427 | "features = np.array([\n", 428 | " [1.1, 11.1],\n", 429 | " [2.2, 22.2],\n", 430 | " [3.3, 33.3],\n", 431 | " [np.nan, 55]\n", 432 | "])\n", 433 | "\n", 434 | "# keep only observations that are not (denoted by ~) missing\n", 435 | "features[~np.isnan(features).any(axis=1)]" 436 | ] 437 | }, 438 | { 439 | "cell_type": "code", 440 | "execution_count": 11, 441 | "metadata": {}, 442 | "outputs": [ 443 | { 444 | "data": { 445 | "text/html": [ 446 | "
\n", 447 | "\n", 460 | "\n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | "
feature_1feature_2
01.111.1
12.222.2
23.333.3
\n", 486 | "
" 487 | ], 488 | "text/plain": [ 489 | " feature_1 feature_2\n", 490 | "0 1.1 11.1\n", 491 | "1 2.2 22.2\n", 492 | "2 3.3 33.3" 493 | ] 494 | }, 495 | "execution_count": 11, 496 | "metadata": {}, 497 | "output_type": "execute_result" 498 | } 499 | ], 500 | "source": [ 501 | "import pandas as pd\n", 502 | "df = pd.DataFrame(features, columns=[\"feature_1\", \"feature_2\"])\n", 503 | "df.dropna()" 504 | ] 505 | }, 506 | { 507 | "cell_type": "markdown", 508 | "metadata": {}, 509 | "source": [ 510 | "#### Discussion\n", 511 | "Most machine learnign algorithms cannot handling any missing values in the target and feature arrays. The simplest solution is the delete every observation that contains one or more missing values\n", 512 | "\n", 513 | "There are three types of missing data:\n", 514 | "\n", 515 | "*Missing Completely At Random (MCAR)*\n", 516 | "* The probability that a value is missing is independent of everything.\n", 517 | "\n", 518 | "*Missing At Random (MAR)*\n", 519 | "* The probability that a value is missing is not completely random, but depends on information capture in other feature\n", 520 | "\n", 521 | "*Missing Not At Random (MNAR)*\n", 522 | "* The probability that a value is missing is not random and depends on information not captured in our features\n", 523 | "\n", 524 | "#### See Also\n", 525 | "* Identifying the Three Types of Missing Data (https://measuringu.com/missing-data/)\n", 526 | "* Missing-Data Imputation (http://www.stat.columbia.edu/~gelman/arm/missing.pdf)\n", 527 | "\n", 528 | "### 4.11 Imputing Missing Values" 529 | ] 530 | }, 531 | { 532 | "cell_type": "code", 533 | "execution_count": 14, 534 | "metadata": {}, 535 | "outputs": [ 536 | { 537 | "name": "stdout", 538 | "output_type": "stream", 539 | "text": [ 540 | "True Value: 0.8730186113995938\n", 541 | "Imputed Value: -3.058372724614996\n" 542 | ] 543 | } 544 | ], 545 | "source": [ 546 | "import numpy as np\n", 547 | "from sklearn.preprocessing import StandardScaler\n", 548 | "from sklearn.datasets import make_blobs\n", 549 | "from sklearn.preprocessing import Imputer\n", 550 | "\n", 551 | "# make fake data\n", 552 | "features, _ = make_blobs(n_samples = 1000,\n", 553 | " n_features = 2,\n", 554 | " random_state = 1)\n", 555 | "\n", 556 | "# standardize the features\n", 557 | "scaler = StandardScaler()\n", 558 | "standardized_features = scaler.fit_transform(features)\n", 559 | "\n", 560 | "# replace the first feature's first value with a missing value\n", 561 | "true_value = standardized_features[0, 0]\n", 562 | "standardized_features[0,0] = np.nan\n", 563 | "\n", 564 | "# create imputer\n", 565 | "mean_imputer = Imputer(strategy=\"mean\", axis=0)\n", 566 | "\n", 567 | "# impute values\n", 568 | "feautres_mean_imputed = mean_imputer.fit_transform(features)\n", 569 | "\n", 570 | "# compare true and imputed values\n", 571 | "print(\"True Value: {}\".format(true_value))\n", 572 | "print(\"Imputed Value: {}\".format(feautres_mean_imputed[0,0]))" 573 | ] 574 | }, 575 | { 576 | "cell_type": "markdown", 577 | "metadata": {}, 578 | "source": [ 579 | "#### See Also\n", 580 | "* A Study of K-Nearest Neighbor as an Imputation Method (http://conteudo.icmc.usp.br/pessoas/gbatista/files/his2002.pdf)" 581 | ] 582 | }, 583 | { 584 | "cell_type": "code", 585 | "execution_count": null, 586 | "metadata": {}, 587 | "outputs": [], 588 | "source": [] 589 | } 590 | ], 591 | "metadata": { 592 | "kernelspec": { 593 | "display_name": "Python [conda env:machine_learning_cookbook]", 594 | "language": "python", 595 | "name": "conda-env-machine_learning_cookbook-py" 596 | }, 597 | "language_info": { 598 | "codemirror_mode": { 599 | "name": "ipython", 600 | "version": 3 601 | }, 602 | "file_extension": ".py", 603 | "mimetype": "text/x-python", 604 | "name": "python", 605 | "nbconvert_exporter": "python", 606 | "pygments_lexer": "ipython3", 607 | "version": "3.6.6" 608 | } 609 | }, 610 | "nbformat": 4, 611 | "nbformat_minor": 2 612 | } 613 | -------------------------------------------------------------------------------- /Chapter 7 - Handling Dates and Times.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## 7.1 Converting Strings to Dates" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 2, 13 | "metadata": {}, 14 | "outputs": [ 15 | { 16 | "data": { 17 | "text/plain": [ 18 | "[Timestamp('2005-04-03 23:35:00'),\n", 19 | " Timestamp('2010-05-23 00:01:00'),\n", 20 | " Timestamp('2009-09-04 21:09:00')]" 21 | ] 22 | }, 23 | "execution_count": 2, 24 | "metadata": {}, 25 | "output_type": "execute_result" 26 | } 27 | ], 28 | "source": [ 29 | "import numpy as np\n", 30 | "import pandas as pd\n", 31 | "\n", 32 | "date_strings = np.array([\n", 33 | " '03-04-2005 11:35 PM',\n", 34 | " '23-05-2010 12:01 AM',\n", 35 | " '04-09-2009 09:09 PM'\n", 36 | "])\n", 37 | "\n", 38 | "# convert to datetimes\n", 39 | "[pd.to_datetime(date, format='%d-%m-%Y %I:%M %p') for date in date_strings]" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": 3, 45 | "metadata": {}, 46 | "outputs": [ 47 | { 48 | "data": { 49 | "text/plain": [ 50 | "[Timestamp('2005-04-03 23:35:00'),\n", 51 | " Timestamp('2010-05-23 00:01:00'),\n", 52 | " Timestamp('2009-09-04 21:09:00')]" 53 | ] 54 | }, 55 | "execution_count": 3, 56 | "metadata": {}, 57 | "output_type": "execute_result" 58 | } 59 | ], 60 | "source": [ 61 | "[pd.to_datetime(date, format='%d-%m-%Y %I:%M %p', errors='coerce') for date in date_strings]" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "### See Also\n", 69 | "* http://strftime.org/\n", 70 | "\n", 71 | "## 7.2 Handling Time Zones" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": 4, 77 | "metadata": {}, 78 | "outputs": [ 79 | { 80 | "data": { 81 | "text/plain": [ 82 | "Timestamp('2017-05-01 06:00:00+0100', tz='Europe/London')" 83 | ] 84 | }, 85 | "execution_count": 4, 86 | "metadata": {}, 87 | "output_type": "execute_result" 88 | } 89 | ], 90 | "source": [ 91 | "import pandas as pd\n", 92 | "\n", 93 | "pd.Timestamp('2017-05-01 06:00:00', tz='Europe/London')" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": 5, 99 | "metadata": {}, 100 | "outputs": [ 101 | { 102 | "data": { 103 | "text/plain": [ 104 | "Timestamp('2017-05-01 06:00:00+0100', tz='Europe/London')" 105 | ] 106 | }, 107 | "execution_count": 5, 108 | "metadata": {}, 109 | "output_type": "execute_result" 110 | } 111 | ], 112 | "source": [ 113 | "date = pd.Timestamp('2017-05-01 06:00:00')\n", 114 | "\n", 115 | "date_in_london = date.tz_localize('Europe/London')\n", 116 | "\n", 117 | "date_in_london" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": 8, 123 | "metadata": {}, 124 | "outputs": [ 125 | { 126 | "data": { 127 | "text/plain": [ 128 | "Timestamp('2017-05-01 05:00:00+0000', tz='Africa/Abidjan')" 129 | ] 130 | }, 131 | "execution_count": 8, 132 | "metadata": {}, 133 | "output_type": "execute_result" 134 | } 135 | ], 136 | "source": [ 137 | "date_in_london.tz_convert('Africa/Abidjan')" 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": 9, 143 | "metadata": {}, 144 | "outputs": [ 145 | { 146 | "data": { 147 | "text/plain": [ 148 | "0 2002-02-28 00:00:00+00:00\n", 149 | "1 2002-03-31 00:00:00+00:00\n", 150 | "2 2002-04-30 00:00:00+00:00\n", 151 | "dtype: datetime64[ns, Africa/Abidjan]" 152 | ] 153 | }, 154 | "execution_count": 9, 155 | "metadata": {}, 156 | "output_type": "execute_result" 157 | } 158 | ], 159 | "source": [ 160 | "dates = pd.Series(pd.date_range('2/2/2002', periods=3, freq='M'))\n", 161 | "\n", 162 | "dates.dt.tz_localize('Africa/Abidjan')" 163 | ] 164 | }, 165 | { 166 | "cell_type": "markdown", 167 | "metadata": {}, 168 | "source": [ 169 | "## 7.3 Selecting Dates and Times\n", 170 | "## 7.4 Breaking Up Date Data into Multiple Features\n", 171 | "## 7.5 Calculating the Difference Between Dates\n", 172 | "## 7.6 Encoding Days of the Week\n", 173 | "## 7.7 Creating Lagged Feature\n", 174 | "## 7.8 Using Rolling Time Windows" 175 | ] 176 | }, 177 | { 178 | "cell_type": "code", 179 | "execution_count": 10, 180 | "metadata": {}, 181 | "outputs": [ 182 | { 183 | "data": { 184 | "text/html": [ 185 | "
\n", 186 | "\n", 199 | "\n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | "
Stock_Price
2010-01-31NaN
2010-02-281.5
2010-03-312.5
2010-04-303.5
2010-05-314.5
\n", 229 | "
" 230 | ], 231 | "text/plain": [ 232 | " Stock_Price\n", 233 | "2010-01-31 NaN\n", 234 | "2010-02-28 1.5\n", 235 | "2010-03-31 2.5\n", 236 | "2010-04-30 3.5\n", 237 | "2010-05-31 4.5" 238 | ] 239 | }, 240 | "execution_count": 10, 241 | "metadata": {}, 242 | "output_type": "execute_result" 243 | } 244 | ], 245 | "source": [ 246 | "import pandas as pd\n", 247 | "\n", 248 | "time_index = pd.date_range('01/01/2010', periods=5, freq='M')\n", 249 | "df = pd.DataFrame(index=time_index)\n", 250 | "df['Stock_Price'] = [1,2,3,4,5]\n", 251 | "df.rolling(window=2).mean()" 252 | ] 253 | }, 254 | { 255 | "cell_type": "markdown", 256 | "metadata": {}, 257 | "source": [ 258 | "### See Also\n", 259 | "* pandas documentation: Rolling Windows (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html)\n", 260 | "* What are Moving Average or Smoothing Techniques (https://www.itl.nist.gov/div898/handbook/pmc/section4/pmc42.htm)\n", 261 | "\n", 262 | "## 7.9 Handling Missing Data in Time Series" 263 | ] 264 | }, 265 | { 266 | "cell_type": "code", 267 | "execution_count": 11, 268 | "metadata": {}, 269 | "outputs": [ 270 | { 271 | "data": { 272 | "text/html": [ 273 | "
\n", 274 | "\n", 287 | "\n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | "
Sales
2010-01-311.0
2010-02-282.0
2010-03-313.0
2010-04-304.0
2010-05-315.0
\n", 317 | "
" 318 | ], 319 | "text/plain": [ 320 | " Sales\n", 321 | "2010-01-31 1.0\n", 322 | "2010-02-28 2.0\n", 323 | "2010-03-31 3.0\n", 324 | "2010-04-30 4.0\n", 325 | "2010-05-31 5.0" 326 | ] 327 | }, 328 | "execution_count": 11, 329 | "metadata": {}, 330 | "output_type": "execute_result" 331 | } 332 | ], 333 | "source": [ 334 | "import pandas as pd\n", 335 | "import numpy as np\n", 336 | "\n", 337 | "time_index = pd.date_range('01/01/2010', periods=5, freq='M')\n", 338 | "\n", 339 | "df = pd.DataFrame(index=time_index)\n", 340 | "\n", 341 | "df[\"Sales\"] = [1.0, 2.0, np.nan, np.nan, 5.0]\n", 342 | "\n", 343 | "df.interpolate()" 344 | ] 345 | }, 346 | { 347 | "cell_type": "code", 348 | "execution_count": 12, 349 | "metadata": {}, 350 | "outputs": [ 351 | { 352 | "data": { 353 | "text/html": [ 354 | "
\n", 355 | "\n", 368 | "\n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | "
Sales
2010-01-311.0
2010-02-282.0
2010-03-312.0
2010-04-302.0
2010-05-315.0
\n", 398 | "
" 399 | ], 400 | "text/plain": [ 401 | " Sales\n", 402 | "2010-01-31 1.0\n", 403 | "2010-02-28 2.0\n", 404 | "2010-03-31 2.0\n", 405 | "2010-04-30 2.0\n", 406 | "2010-05-31 5.0" 407 | ] 408 | }, 409 | "execution_count": 12, 410 | "metadata": {}, 411 | "output_type": "execute_result" 412 | } 413 | ], 414 | "source": [ 415 | "df.ffill()" 416 | ] 417 | }, 418 | { 419 | "cell_type": "code", 420 | "execution_count": 13, 421 | "metadata": {}, 422 | "outputs": [ 423 | { 424 | "data": { 425 | "text/html": [ 426 | "
\n", 427 | "\n", 440 | "\n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | "
Sales
2010-01-311.0
2010-02-282.0
2010-03-315.0
2010-04-305.0
2010-05-315.0
\n", 470 | "
" 471 | ], 472 | "text/plain": [ 473 | " Sales\n", 474 | "2010-01-31 1.0\n", 475 | "2010-02-28 2.0\n", 476 | "2010-03-31 5.0\n", 477 | "2010-04-30 5.0\n", 478 | "2010-05-31 5.0" 479 | ] 480 | }, 481 | "execution_count": 13, 482 | "metadata": {}, 483 | "output_type": "execute_result" 484 | } 485 | ], 486 | "source": [ 487 | "df.bfill()" 488 | ] 489 | }, 490 | { 491 | "cell_type": "code", 492 | "execution_count": 14, 493 | "metadata": {}, 494 | "outputs": [ 495 | { 496 | "data": { 497 | "text/html": [ 498 | "
\n", 499 | "\n", 512 | "\n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | "
Sales
2010-01-311.000000
2010-02-282.000000
2010-03-313.059808
2010-04-304.038069
2010-05-315.000000
\n", 542 | "
" 543 | ], 544 | "text/plain": [ 545 | " Sales\n", 546 | "2010-01-31 1.000000\n", 547 | "2010-02-28 2.000000\n", 548 | "2010-03-31 3.059808\n", 549 | "2010-04-30 4.038069\n", 550 | "2010-05-31 5.000000" 551 | ] 552 | }, 553 | "execution_count": 14, 554 | "metadata": {}, 555 | "output_type": "execute_result" 556 | } 557 | ], 558 | "source": [ 559 | "df.interpolate(method=\"quadratic\")" 560 | ] 561 | }, 562 | { 563 | "cell_type": "code", 564 | "execution_count": 15, 565 | "metadata": {}, 566 | "outputs": [ 567 | { 568 | "data": { 569 | "text/html": [ 570 | "
\n", 571 | "\n", 584 | "\n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | "
Sales
2010-01-311.0
2010-02-282.0
2010-03-313.0
2010-04-30NaN
2010-05-315.0
\n", 614 | "
" 615 | ], 616 | "text/plain": [ 617 | " Sales\n", 618 | "2010-01-31 1.0\n", 619 | "2010-02-28 2.0\n", 620 | "2010-03-31 3.0\n", 621 | "2010-04-30 NaN\n", 622 | "2010-05-31 5.0" 623 | ] 624 | }, 625 | "execution_count": 15, 626 | "metadata": {}, 627 | "output_type": "execute_result" 628 | } 629 | ], 630 | "source": [ 631 | "df.interpolate(limit=1, limit_direction=\"forward\")" 632 | ] 633 | }, 634 | { 635 | "cell_type": "code", 636 | "execution_count": null, 637 | "metadata": {}, 638 | "outputs": [], 639 | "source": [] 640 | } 641 | ], 642 | "metadata": { 643 | "kernelspec": { 644 | "display_name": "Python [conda env:machine_learning_cookbook]", 645 | "language": "python", 646 | "name": "conda-env-machine_learning_cookbook-py" 647 | }, 648 | "language_info": { 649 | "codemirror_mode": { 650 | "name": "ipython", 651 | "version": 3 652 | }, 653 | "file_extension": ".py", 654 | "mimetype": "text/x-python", 655 | "name": "python", 656 | "nbconvert_exporter": "python", 657 | "pygments_lexer": "ipython3", 658 | "version": "3.6.6" 659 | } 660 | }, 661 | "nbformat": 4, 662 | "nbformat_minor": 2 663 | } 664 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # machine-learning-with-python-cookbook-notes 2 | My jupyter notebooks/code samples from Chris Albon's Machine Learning with Python Cookbook 3 | 4 | https://nbviewer.jupyter.org/github/DustinAlandzes/machine-learning-with-python-cookbook-notes/tree/master/ 5 | 6 | # usage 7 | ``` 8 | git clone https://github.com/f00-/machine-learning-with-python-cookbook-notes.git 9 | cd machine-learning-with-python-cookbook-notes 10 | conda env create -f environment.yml 11 | source activate machine_learning_cookbook 12 | pip install -r requirements.txt 13 | jupyter notebook 14 | ``` 15 | -------------------------------------------------------------------------------- /environment.yml: -------------------------------------------------------------------------------- 1 | name: machine_learning_cookbook 2 | channels: 3 | - defaults 4 | dependencies: 5 | - _nb_ext_conf=0.4.0=py36_1 6 | - _tflow_1100_select=0.0.2=eigen 7 | - absl-py=0.4.0=py36h28b3542_0 8 | - anaconda-client=1.7.1=py36_0 9 | - appdirs=1.4.3=py36h28b3542_0 10 | - appnope=0.1.0=py36hf537a9a_0 11 | - asn1crypto=0.24.0=py36_0 12 | - astor=0.7.1=py36_0 13 | - attrs=18.1.0=py36_0 14 | - automat=0.7.0=py36_0 15 | - backcall=0.1.0=py36_0 16 | - blas=1.0=mkl 17 | - bleach=2.1.4=py36_0 18 | - ca-certificates=2018.03.07=0 19 | - certifi=2018.8.24=py36_1 20 | - cffi=1.11.5=py36h342bebf_0 21 | - chardet=3.0.4=py36_1 22 | - clyent=1.2.2=py36_1 23 | - constantly=15.1.0=py36h28b3542_0 24 | - cryptography=2.3.1=py36hdbc3d79_0 25 | - cycler=0.10.0=py36hfc81398_0 26 | - decorator=4.3.0=py36_0 27 | - entrypoints=0.2.3=py36_2 28 | - freetype=2.9.1=hb4e5f40_0 29 | - gast=0.2.0=py36_0 30 | - grpcio=1.12.1=py36hd9629dc_0 31 | - h5py=2.8.0=py36h3c9e6ae_0 32 | - hdf5=1.10.2=hfa1e0ec_1 33 | - html5lib=1.0.1=py36_0 34 | - hyperlink=18.0.0=py36_0 35 | - idna=2.7=py36_0 36 | - incremental=17.5.0=py36_0 37 | - intel-openmp=2018.0.3=0 38 | - ipykernel=4.8.2=py36_0 39 | - ipython=6.5.0=py36_0 40 | - ipython_genutils=0.2.0=py36h241746c_0 41 | - ipywidgets=7.4.0=py36_0 42 | - jedi=0.12.1=py36_0 43 | - jinja2=2.10=py36_0 44 | - jsonschema=2.6.0=py36hb385e00_0 45 | - jupyter_client=5.2.3=py36_0 46 | - jupyter_core=4.4.0=py36_0 47 | - keras=2.2.2=0 48 | - keras-applications=1.0.4=py36_1 49 | - keras-base=2.2.2=py36_0 50 | - keras-preprocessing=1.0.2=py36_1 51 | - kiwisolver=1.0.1=py36h0a44026_0 52 | - libcxx=4.0.1=h579ed51_0 53 | - libcxxabi=4.0.1=hebd6815_0 54 | - libedit=3.1.20170329=hb402a30_2 55 | - libffi=3.2.1=h475c297_4 56 | - libgfortran=3.0.1=h93005f0_2 57 | - libpng=1.6.34=he12f830_0 58 | - libprotobuf=3.6.0=hd9629dc_0 59 | - libsodium=1.0.16=h3efe00b_0 60 | - markdown=2.6.11=py36_0 61 | - markupsafe=1.0=py36h1de35cc_1 62 | - matplotlib=2.2.3=py36h54f8f79_0 63 | - mistune=0.8.3=py36h1de35cc_1 64 | - mkl=2018.0.3=1 65 | - mkl_fft=1.0.4=py36h5d10147_1 66 | - mkl_random=1.0.1=py36h5d10147_1 67 | - nb_anacondacloud=1.4.0=py36_0 68 | - nb_conda=2.2.1=py36_0 69 | - nb_conda_kernels=2.1.0=py36_0 70 | - nbconvert=5.3.1=py36_0 71 | - nbformat=4.4.0=py36h827af21_0 72 | - nbpresent=3.0.2=py36_1 73 | - ncurses=6.1=h0a44026_0 74 | - notebook=5.6.0=py36_0 75 | - numpy=1.15.0=py36h648b28d_0 76 | - numpy-base=1.15.0=py36h8a80b8c_0 77 | - openssl=1.0.2p=h1de35cc_0 78 | - pandas=0.23.4=py36h6440ff4_0 79 | - pandoc=2.2.1=h1a437c5_0 80 | - pandocfilters=1.4.2=py36_1 81 | - parso=0.3.1=py36_0 82 | - patsy=0.5.0=py36_0 83 | - pexpect=4.6.0=py36_0 84 | - pickleshare=0.7.4=py36hf512f8e_0 85 | - pip=10.0.1=py36_0 86 | - prometheus_client=0.3.1=py36_0 87 | - prompt_toolkit=1.0.15=py36haeda067_0 88 | - protobuf=3.6.0=py36h0a44026_0 89 | - ptyprocess=0.6.0=py36_0 90 | - pyasn1=0.4.4=py36_0 91 | - pyasn1-modules=0.2.2=py36_0 92 | - pycparser=2.18=py36_1 93 | - pygments=2.2.0=py36h240cd3f_0 94 | - pyopenssl=18.0.0=py36_0 95 | - pyparsing=2.2.0=py36_1 96 | - pysocks=1.6.8=py36_0 97 | - python=3.6.6=hc167b69_0 98 | - python-dateutil=2.7.3=py36_0 99 | - pytz=2018.5=py36_0 100 | - pyyaml=3.13=py36h1de35cc_0 101 | - pyzmq=17.1.2=py36h1de35cc_0 102 | - readline=7.0=hc1231fa_4 103 | - requests=2.19.1=py36_0 104 | - scikit-learn=0.19.1=py36hf9f1f73_0 105 | - scipy=1.1.0=py36hf1f7d93_0 106 | - seaborn=0.9.0=py36_0 107 | - send2trash=1.5.0=py36_0 108 | - service_identity=17.0.0=py36h28b3542_0 109 | - setuptools=40.0.0=py36_0 110 | - simplegeneric=0.8.1=py36_2 111 | - six=1.11.0=py36_1 112 | - sqlalchemy=1.2.11=py36h1de35cc_0 113 | - sqlite=3.24.0=ha441bb4_0 114 | - statsmodels=0.9.0=py36h1d22016_0 115 | - tensorboard=1.10.0=py36hdc36e2c_0 116 | - tensorflow=1.10.0=eigen_py36h0906837_0 117 | - tensorflow-base=1.10.0=eigen_py36h4f0eeca_0 118 | - termcolor=1.1.0=py36_1 119 | - terminado=0.8.1=py36_1 120 | - testpath=0.3.1=py36h625a49b_0 121 | - tk=8.6.7=h35a86e2_3 122 | - tornado=5.1=py36h1de35cc_0 123 | - traitlets=4.3.2=py36h65bd3ce_0 124 | - twisted=17.5.0=py36_0 125 | - urllib3=1.23=py36_0 126 | - wcwidth=0.1.7=py36h8c6ec74_0 127 | - webencodings=0.5.1=py36_1 128 | - werkzeug=0.14.1=py36_0 129 | - wheel=0.31.1=py36_0 130 | - widgetsnbextension=3.4.0=py36_0 131 | - xlrd=1.1.0=py36_1 132 | - xz=5.2.4=h1de35cc_4 133 | - yaml=0.1.7=hc338f04_2 134 | - zeromq=4.2.5=h0a44026_0 135 | - zlib=1.2.11=hf3cbc9b_2 136 | - zope=1.0=py36_0 137 | - zope.interface=4.5.0=py36h1de35cc_0 138 | - pip: 139 | - lifelines==0.14.6 140 | prefix: /Users/f00/anaconda/envs/machine_learning_cookbook 141 | 142 | -------------------------------------------------------------------------------- /model.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DustinAlandzes/machine-learning-with-python-cookbook-notes/e48e1ed5e2dca6f2fdf4114988e7772ff6ff23bb/model.pkl -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | absl-py==0.4.0 2 | anaconda-client==1.7.1 3 | appdirs==1.4.3 4 | appnope==0.1.0 5 | asn1crypto==0.24.0 6 | astor==0.7.1 7 | attrs==18.1.0 8 | Automat==0.7.0 9 | backcall==0.1.0 10 | bleach==3.1.4 11 | certifi==2018.8.24 12 | cffi==1.11.5 13 | chardet==3.0.4 14 | clyent==1.2.2 15 | constantly==15.1.0 16 | cryptography==3.2 17 | cycler==0.10.0 18 | decorator==4.3.0 19 | entrypoints==0.2.3 20 | gast==0.2.0 21 | grpcio==1.12.1 22 | h5py==2.8.0 23 | html5lib==1.0.1 24 | hyperlink==18.0.0 25 | idna==2.7 26 | incremental==17.5.0 27 | ipykernel==4.8.2 28 | ipython==6.5.0 29 | ipython-genutils==0.2.0 30 | ipywidgets==7.4.0 31 | jedi==0.12.1 32 | Jinja2==2.10.1 33 | jsonschema==2.6.0 34 | jupyter-client==5.2.3 35 | jupyter-core==4.4.0 36 | Keras==2.2.2 37 | Keras-Applications==1.0.4 38 | Keras-Preprocessing==1.0.2 39 | kiwisolver==1.0.1 40 | lifelines==0.14.6 41 | Markdown==2.6.11 42 | MarkupSafe==1.0 43 | matplotlib==2.2.3 44 | mistune==0.8.3 45 | mkl-fft==1.0.4 46 | mkl-random==1.0.1 47 | nb-anacondacloud==1.4.0 48 | nb-conda==2.2.1 49 | nb-conda-kernels==2.1.0 50 | nbconvert==5.3.1 51 | nbformat==4.4.0 52 | nbpresent==3.0.2 53 | notebook==6.1.5 54 | numpy==1.15.0 55 | pandas==0.23.4 56 | pandocfilters==1.4.2 57 | parso==0.3.1 58 | patsy==0.5.0 59 | pexpect==4.6.0 60 | pickleshare==0.7.4 61 | prometheus-client==0.3.1 62 | prompt-toolkit==1.0.15 63 | protobuf==3.6.0 64 | ptyprocess==0.6.0 65 | pyasn1==0.4.4 66 | pyasn1-modules==0.2.2 67 | pycparser==2.18 68 | Pygments==2.2.0 69 | pyOpenSSL==18.0.0 70 | pyparsing==2.2.0 71 | PySocks==1.6.8 72 | python-dateutil==2.7.3 73 | pytz==2018.5 74 | PyYAML==5.1 75 | pyzmq==17.1.2 76 | requests==2.20.0 77 | scikit-learn==0.19.1 78 | scipy==1.1.0 79 | seaborn==0.9.0 80 | Send2Trash==1.5.0 81 | service-identity==17.0.0 82 | simplegeneric==0.8.1 83 | six==1.11.0 84 | SQLAlchemy==1.3.0 85 | statsmodels==0.9.0 86 | tensorboard==1.10.0 87 | tensorflow==2.3.1 88 | termcolor==1.1.0 89 | terminado==0.8.1 90 | testpath==0.3.1 91 | tornado==5.1 92 | traitlets==4.3.2 93 | Twisted==20.3.0 94 | urllib3==1.24.2 95 | wcwidth==0.1.7 96 | webencodings==0.5.1 97 | Werkzeug==0.15.3 98 | widgetsnbextension==3.4.0 99 | xlrd==1.1.0 100 | zope.interface==4.5.0 101 | -------------------------------------------------------------------------------- /sample.db: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DustinAlandzes/machine-learning-with-python-cookbook-notes/e48e1ed5e2dca6f2fdf4114988e7772ff6ff23bb/sample.db --------------------------------------------------------------------------------