├── .ipynb_checkpoints
    ├── Chapter 1 - Vectors, Matrices, and Arrays-checkpoint.ipynb
    ├── Chapter 11 - Model Evaluation-checkpoint.ipynb
    ├── Chapter 12 - Model Selection-checkpoint.ipynb
    ├── Chapter 13 - Linear Regression-checkpoint.ipynb
    ├── Chapter 14 - Trees and Forests-checkpoint.ipynb
    ├── Chapter 15 - K-Nearest Neighbors-checkpoint.ipynb
    ├── Chapter 16 - Logistic Regression-checkpoint.ipynb
    ├── Chapter 17 - Support Vector Machines-checkpoint.ipynb
    ├── Chapter 18 - Naive Bayes-checkpoint.ipynb
    ├── Chapter 19 - Clustering-checkpoint.ipynb
    ├── Chapter 2 - Loading Data-checkpoint.ipynb
    ├── Chapter 21 - Saving and Loading Trained Models-checkpoint.ipynb
    ├── Chapter 3 - Data Wrangling-checkpoint.ipynb
    ├── Chapter 4 - Handling Numerical Data-checkpoint.ipynb
    ├── Chapter 5 - Handling Categorical Data-checkpoint.ipynb
    ├── Chapter 6 - Handling Text-checkpoint.ipynb
    └── Chapter 7 - Handling Dates and Times-checkpoint.ipynb
├── Chapter 1 - Vectors, Matrices, and Arrays.ipynb
├── Chapter 11 - Model Evaluation.ipynb
├── Chapter 12 - Model Selection.ipynb
├── Chapter 13 - Linear Regression.ipynb
├── Chapter 14 - Trees and Forests.ipynb
├── Chapter 15 - K-Nearest Neighbors.ipynb
├── Chapter 16 - Logistic Regression.ipynb
├── Chapter 17 - Support Vector Machines.ipynb
├── Chapter 18 - Naive Bayes.ipynb
├── Chapter 19 - Clustering.ipynb
├── Chapter 2 - Loading Data.ipynb
├── Chapter 21 - Saving and Loading Trained Models.ipynb
├── Chapter 3 - Data Wrangling.ipynb
├── Chapter 4 - Handling Numerical Data.ipynb
├── Chapter 5 - Handling Categorical Data.ipynb
├── Chapter 6 - Handling Text.ipynb
├── Chapter 7 - Handling Dates and Times.ipynb
├── README.md
├── environment.yml
├── model.pkl
├── requirements.txt
└── sample.db


/.ipynb_checkpoints/Chapter 13 - Linear Regression-checkpoint.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Chapter 13\n",
  8 |     "---\n",
  9 |     "# Linear Regression\n",
 10 |     "\n",
 11 |     "### 13.0 Introduction\n",
 12 |     "Linear regression is one of the simplest supervised learning algorithms in our toolkit. If you have ever taken an introductory statistics course in college, likely the final topic you covered was linear regression. In fact, it is so simple that it is sometimes not considered machine learning at all!\n",
 13 |     "\n",
 14 |     "Whatever you believe, the fact is that linear regression--and its extensions--continues to be a common and useful method of making predictions when the target vector is a quantitative value (e.g. home price, age)\n",
 15 |     "\n",
 16 |     "### 13.1 Fitting a Line\n",
 17 |     "#### Problem\n",
 18 |     "You want to train a model that represents a linear relationship between the feature and target vector.\n",
 19 |     "\n",
 20 |     "#### Solution\n",
 21 |     "Use a linear regression (`LinearRegression` in scikit-learn)"
 22 |    ]
 23 |   },
 24 |   {
 25 |    "cell_type": "code",
 26 |    "execution_count": 1,
 27 |    "metadata": {
 28 |     "collapsed": true
 29 |    },
 30 |    "outputs": [],
 31 |    "source": [
 32 |     "from sklearn.linear_model import LinearRegression\n",
 33 |     "from sklearn.datasets import load_boston\n",
 34 |     "\n",
 35 |     "boston = load_boston()\n",
 36 |     "features = boston.data[:, 0:2]\n",
 37 |     "target = boston.target\n",
 38 |     "\n",
 39 |     "regression = LinearRegression()\n",
 40 |     "\n",
 41 |     "model = regression.fit(features, target)"
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "markdown",
 46 |    "metadata": {},
 47 |    "source": [
 48 |     "### 13.4 Reducing Variance with Regularization\n",
 49 |     "#### Problem\n",
 50 |     "You want to reduce the variance of your linear regression model\n",
 51 |     "\n",
 52 |     "#### Solution\n",
 53 |     "Use a learning algorithm that includes a *shrinkage penalty* (also called **regularization**) like ridge regression and lasso regression:"
 54 |    ]
 55 |   },
 56 |   {
 57 |    "cell_type": "code",
 58 |    "execution_count": 2,
 59 |    "metadata": {
 60 |     "collapsed": false
 61 |    },
 62 |    "outputs": [],
 63 |    "source": [
 64 |     "from sklearn.linear_model import Ridge\n",
 65 |     "from sklearn.datasets import load_boston\n",
 66 |     "from sklearn.preprocessing import StandardScaler\n",
 67 |     "\n",
 68 |     "boston = load_boston()\n",
 69 |     "features = boston.data\n",
 70 |     "target = boston.target\n",
 71 |     "\n",
 72 |     "scaler = StandardScaler()\n",
 73 |     "features_standardized = scaler.fit_transform(features)\n",
 74 |     "\n",
 75 |     "regression = Ridge(alpha=0.5)\n",
 76 |     "model = regression.fit(features_standardized, target)"
 77 |    ]
 78 |   },
 79 |   {
 80 |    "cell_type": "markdown",
 81 |    "metadata": {},
 82 |    "source": [
 83 |     "#### Discussion\n",
 84 |     "In standard linear regression the model trains to minimize the sum of squared error between the true($y_i$) and prediction ($\\hat y_i$) target values, or residual sum of squares (RSS):\n",
 85 |     "$$\n",
 86 |     "RSS = \\sum_{i=1}^n{(y_i - \\hat y_i)^2}\n",
 87 |     "$$\n",
 88 |     "\n",
 89 |     "Regularized regression learners are similar, except they attempt to minimize RSS and some penalty for the total size of the coefficient values, called a shrinkage penalty because it attempts to \"shrink\" the model. There are two common types of regularized learners for linear regression: ridge regression and the lasso. The only formal difference is the type of shrinkage penalty used. In ridge regression, the shrinkage penalty is a tuning hyperparameter multiplied by the squared sum of all coefficients:\n",
 90 |     "$$\n",
 91 |     "RSS+\\alpha \\sum_{j=1}^p{\\hat \\beta_j^2}\n",
 92 |     "$$\n",
 93 |     "\n",
 94 |     "where $\\hat \\beta_j$ is the coefficient of the jth of p features and $\\alpha$ is a hyperparameter (discussed next). The lasso is similar, except the shrinkage penalty is a tuning hyperparmeter multiplied by the squared sum of all coefficients:\n",
 95 |     "$$\n",
 96 |     "\\frac{1}{2n} RSS + \\alpha \\sum_{j=1}^p{|\\beta_j|}\n",
 97 |     "$$\n",
 98 |     "\n",
 99 |     "where n is the number of observations. So which one should we use? A a very general rule of thumb, ridge regression often produces slightly better predictions than lasso, but lasso (for reasons we will discuss in Recipe 13.5) produces more interpretable models. If we want a balance between, ridge and lasso's penalty functions we can use elastic net, which is simply a regression model with both penalties included. Regardless of which one we use, bot hridge and lasso regresions can penalize large or complex models by including coefficient values in the loss funciton we are trying to minimize\n",
100 |     "\n",
101 |     "The hyper parameter $\\alpha$ lets us control how much we penalize the coefficients, with higher values of $\\alpha$ creating simpler models. The ideal value of $\\alpha$ should be tuned like any other hyperparameter. In scikit-learn, $\\alpha$ is set using the alpha parameter.\n",
102 |     "\n",
103 |     "scikit-learn includes a RidgeCV method that allows us to select the ideal value for $\\alpha:"
104 |    ]
105 |   },
106 |   {
107 |    "cell_type": "code",
108 |    "execution_count": 3,
109 |    "metadata": {
110 |     "collapsed": false
111 |    },
112 |    "outputs": [
113 |     {
114 |      "data": {
115 |       "text/plain": [
116 |        "array([-0.91215884,  1.0658758 ,  0.11942614,  0.68558782, -2.03231631,\n",
117 |        "        2.67922108,  0.01477326, -3.0777265 ,  2.58814315, -2.00973173,\n",
118 |        "       -2.05390717,  0.85614763, -3.73565106])"
119 |       ]
120 |      },
121 |      "execution_count": 3,
122 |      "metadata": {},
123 |      "output_type": "execute_result"
124 |     }
125 |    ],
126 |    "source": [
127 |     "from sklearn.linear_model import RidgeCV\n",
128 |     "\n",
129 |     "regr_cv = RidgeCV(alphas=[0.1, 1.0, 10.0])\n",
130 |     "\n",
131 |     "model_cv = regr_cv.fit(features_standardized, target)\n",
132 |     "\n",
133 |     "model_cv.coef_"
134 |    ]
135 |   },
136 |   {
137 |    "cell_type": "code",
138 |    "execution_count": 4,
139 |    "metadata": {
140 |     "collapsed": false
141 |    },
142 |    "outputs": [
143 |     {
144 |      "data": {
145 |       "text/plain": [
146 |        "1.0"
147 |       ]
148 |      },
149 |      "execution_count": 4,
150 |      "metadata": {},
151 |      "output_type": "execute_result"
152 |     }
153 |    ],
154 |    "source": [
155 |     "# view alpha\n",
156 |     "model_cv.alpha_"
157 |    ]
158 |   },
159 |   {
160 |    "cell_type": "markdown",
161 |    "metadata": {},
162 |    "source": [
163 |     "One final note: because in linear regression the value of the coefficients is partially determined by the scale of the feature, and in regularized models all coefficients are summed together, we must make sure to standardize the feature prior to training\n",
164 |     "\n",
165 |     "### 13.5 Reducing Features with Lasso Regression\n",
166 |     "#### Problem\n",
167 |     "You want to simplify your linear regression model by reducing the number of features.\n",
168 |     "\n",
169 |     "#### Solution\n",
170 |     "Use a lasso regression"
171 |    ]
172 |   },
173 |   {
174 |    "cell_type": "code",
175 |    "execution_count": 5,
176 |    "metadata": {
177 |     "collapsed": true
178 |    },
179 |    "outputs": [],
180 |    "source": [
181 |     "from sklearn.linear_model import Lasso\n",
182 |     "from sklearn.datasets import load_boston\n",
183 |     "from sklearn.preprocessing import StandardScaler\n",
184 |     "\n",
185 |     "boston = load_boston()\n",
186 |     "features = boston.data\n",
187 |     "target = boston.target\n",
188 |     "\n",
189 |     "scaler = StandardScaler()\n",
190 |     "features_standardized = scaler.fit_transform(features)\n",
191 |     "\n",
192 |     "regression = Lasso(alpha=0.5)\n",
193 |     "model = regression.fit(features_standardized, target)"
194 |    ]
195 |   },
196 |   {
197 |    "cell_type": "markdown",
198 |    "metadata": {},
199 |    "source": [
200 |     "#### Discussion\n",
201 |     "One interesting characteristic of lasso regression's penalty is that it can shrink the coefficients of a model to zero, effectively reducing the number of features in the model. For example, in our solution we set `alpha` to 0.5 and we can see that many of the coefficients are 0, meaning their corresponding features are not used in the model:"
202 |    ]
203 |   },
204 |   {
205 |    "cell_type": "code",
206 |    "execution_count": 6,
207 |    "metadata": {
208 |     "collapsed": false
209 |    },
210 |    "outputs": [
211 |     {
212 |      "data": {
213 |       "text/plain": [
214 |        "array([-0.10697735,  0.        , -0.        ,  0.39739898, -0.        ,\n",
215 |        "        2.97332316, -0.        , -0.16937793, -0.        , -0.        ,\n",
216 |        "       -1.59957374,  0.54571511, -3.66888402])"
217 |       ]
218 |      },
219 |      "execution_count": 6,
220 |      "metadata": {},
221 |      "output_type": "execute_result"
222 |     }
223 |    ],
224 |    "source": [
225 |     "model.coef_"
226 |    ]
227 |   },
228 |   {
229 |    "cell_type": "markdown",
230 |    "metadata": {},
231 |    "source": [
232 |     "However if we increase $\\alpha$ to a much higher value, we see that lierally none of the features are being used:"
233 |    ]
234 |   },
235 |   {
236 |    "cell_type": "code",
237 |    "execution_count": 7,
238 |    "metadata": {
239 |     "collapsed": false
240 |    },
241 |    "outputs": [
242 |     {
243 |      "data": {
244 |       "text/plain": [
245 |        "array([-0.,  0., -0.,  0., -0.,  0., -0.,  0., -0., -0., -0.,  0., -0.])"
246 |       ]
247 |      },
248 |      "execution_count": 7,
249 |      "metadata": {},
250 |      "output_type": "execute_result"
251 |     }
252 |    ],
253 |    "source": [
254 |     "regression_a10 = Lasso(alpha=10)\n",
255 |     "model_a10 = regression_a10.fit(features_standardized, target)\n",
256 |     "model_a10.coef_"
257 |    ]
258 |   },
259 |   {
260 |    "cell_type": "markdown",
261 |    "metadata": {},
262 |    "source": [
263 |     "The practical benefit of this effect is that it means that we could include 100 features in our feature matrix and then, through adjusting lasso's $\\alpha$ hyperparameter, produce a model that uses only 10 (for instance) of the most important features. This lets us reduce variance whiel improving interpretability of our model (since fewer features is easier to explain)"
264 |    ]
265 |   },
266 |   {
267 |    "cell_type": "code",
268 |    "execution_count": null,
269 |    "metadata": {
270 |     "collapsed": true
271 |    },
272 |    "outputs": [],
273 |    "source": []
274 |   }
275 |  ],
276 |  "metadata": {
277 |   "kernelspec": {
278 |    "display_name": "Python [conda env:machine_learning_cookbook]",
279 |    "language": "python",
280 |    "name": "conda-env-machine_learning_cookbook-py"
281 |   },
282 |   "language_info": {
283 |    "codemirror_mode": {
284 |     "name": "ipython",
285 |     "version": 3
286 |    },
287 |    "file_extension": ".py",
288 |    "mimetype": "text/x-python",
289 |    "name": "python",
290 |    "nbconvert_exporter": "python",
291 |    "pygments_lexer": "ipython3",
292 |    "version": "3.6.6"
293 |   }
294 |  },
295 |  "nbformat": 4,
296 |  "nbformat_minor": 2
297 | }
298 | 


--------------------------------------------------------------------------------
/.ipynb_checkpoints/Chapter 15 - K-Nearest Neighbors-checkpoint.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Chapter 15\n",
  8 |     "---\n",
  9 |     "# K-Nearest Neighbors\n",
 10 |     "\n",
 11 |     "An observation is predicted to be the class of that of the largest proportion of the k-nearest observations.\n",
 12 |     "\n",
 13 |     "## 15.1 Finding an Observation's Nearest Neighbors"
 14 |    ]
 15 |   },
 16 |   {
 17 |    "cell_type": "code",
 18 |    "execution_count": 2,
 19 |    "metadata": {
 20 |     "collapsed": false
 21 |    },
 22 |    "outputs": [
 23 |     {
 24 |      "data": {
 25 |       "text/plain": [
 26 |        "array([[[1.03800476, 0.56925129, 1.10395287, 1.1850097 ],\n",
 27 |        "        [0.79566902, 0.33784833, 0.76275864, 1.05353673]]])"
 28 |       ]
 29 |      },
 30 |      "execution_count": 2,
 31 |      "metadata": {},
 32 |      "output_type": "execute_result"
 33 |     }
 34 |    ],
 35 |    "source": [
 36 |     "from sklearn import datasets\n",
 37 |     "from sklearn.neighbors import NearestNeighbors\n",
 38 |     "from sklearn.preprocessing import StandardScaler\n",
 39 |     "\n",
 40 |     "iris = datasets.load_iris()\n",
 41 |     "features = iris.data\n",
 42 |     "\n",
 43 |     "standardizer = StandardScaler()\n",
 44 |     "\n",
 45 |     "features_standardized = standardizer.fit_transform(features)\n",
 46 |     "\n",
 47 |     "nearest_neighbors = NearestNeighbors(n_neighbors=2).fit(features_standardized)\n",
 48 |     "#nearest_neighbors_euclidian = NearestNeighbors(n_neighbors=2, metric='euclidian').fit(features_standardized)\n",
 49 |     "new_observation = [1, 1, 1, 1]\n",
 50 |     "\n",
 51 |     "distances, indices = nearest_neighbors.kneighbors([new_observation])\n",
 52 |     "\n",
 53 |     "features_standardized[indices]"
 54 |    ]
 55 |   },
 56 |   {
 57 |    "cell_type": "markdown",
 58 |    "metadata": {},
 59 |    "source": [
 60 |     "### Discussion\n",
 61 |     "\n",
 62 |     "How do we measure distance?\n",
 63 |     "\n",
 64 |     "* Euclidian\n",
 65 |     "$$\n",
 66 |     "d_{euclidean} = \\sqrt{\\sum_{i=1}^{n}{(x_i - y_i)^2}}\n",
 67 |     "$$\n",
 68 |     "\n",
 69 |     "* Manhattan\n",
 70 |     "$$\n",
 71 |     "d_{manhattan} = \\sum_{i=1}^{n}{|x_i - y_i|}\n",
 72 |     "$$\n",
 73 |     "\n",
 74 |     "* Minkowski (default)\n",
 75 |     "$$\n",
 76 |     "d_{minkowski} = (\\sum_{i=1}^{n}{|x_i - y_i|^p})^{\\frac{1}{p}}\n",
 77 |     "$$\n",
 78 |     "## 15.2 Creating a K-Nearest Neighbor Classifier"
 79 |    ]
 80 |   },
 81 |   {
 82 |    "cell_type": "code",
 83 |    "execution_count": 3,
 84 |    "metadata": {
 85 |     "collapsed": false
 86 |    },
 87 |    "outputs": [
 88 |     {
 89 |      "data": {
 90 |       "text/plain": [
 91 |        "array([1, 2])"
 92 |       ]
 93 |      },
 94 |      "execution_count": 3,
 95 |      "metadata": {},
 96 |      "output_type": "execute_result"
 97 |     }
 98 |    ],
 99 |    "source": [
100 |     "from sklearn.neighbors import KNeighborsClassifier\n",
101 |     "from sklearn.preprocessing import StandardScaler\n",
102 |     "from sklearn import datasets\n",
103 |     "\n",
104 |     "iris = datasets.load_iris()\n",
105 |     "X = iris.data\n",
106 |     "y = iris.target\n",
107 |     "\n",
108 |     "standardizer = StandardScaler()\n",
109 |     "\n",
110 |     "X_std = standardizer.fit_transform(X)\n",
111 |     "\n",
112 |     "knn = KNeighborsClassifier(n_neighbors=5, n_jobs=-1).fit(X_std, y)\n",
113 |     "\n",
114 |     "new_observations = [[0.75, 0.75, 0.75, 0.75],\n",
115 |     "                   [1, 1, 1, 1]]\n",
116 |     "\n",
117 |     "knn.predict(new_observations)"
118 |    ]
119 |   },
120 |   {
121 |    "cell_type": "markdown",
122 |    "metadata": {},
123 |    "source": [
124 |     "### Discussion\n",
125 |     "In KNN, given an observation $x_u$, with an unknown target class, the algorithm first identifies the k closest observations (sometimes called $x_u$'s neighborhood) based on some distance metric, then these k observations \"vote\" based on their class and the class that wins the vote is $x_u$'s predicted class. More formally, the probability $x_u$ is some class j is:\n",
126 |     "$$\n",
127 |     "\\frac{1}{k} \\sum_{i \\in v}{I(y_i = j)}\n",
128 |     "$$\n",
129 |     "where v is the k observatoin in $x_u$'s neighborhood, $y_i$ is the class of the ith observation, and I is an indicator function (i.e., 1 is true, 0 otherwise). In scikit-learn we can see these probabilities using `predict_proba`\n",
130 |     "\n",
131 |     "## 15.3 Identifying the Best Neighborhood Size"
132 |    ]
133 |   },
134 |   {
135 |    "cell_type": "code",
136 |    "execution_count": 4,
137 |    "metadata": {
138 |     "collapsed": false
139 |    },
140 |    "outputs": [
141 |     {
142 |      "data": {
143 |       "text/plain": [
144 |        "6"
145 |       ]
146 |      },
147 |      "execution_count": 4,
148 |      "metadata": {},
149 |      "output_type": "execute_result"
150 |     }
151 |    ],
152 |    "source": [
153 |     "from sklearn.neighbors import KNeighborsClassifier\n",
154 |     "from sklearn import datasets\n",
155 |     "from sklearn.preprocessing import StandardScaler\n",
156 |     "from sklearn.pipeline import Pipeline, FeatureUnion\n",
157 |     "from sklearn.model_selection import GridSearchCV\n",
158 |     "\n",
159 |     "iris = datasets.load_iris()\n",
160 |     "features = iris.data\n",
161 |     "target = iris.target\n",
162 |     "\n",
163 |     "standardizer = StandardScaler()\n",
164 |     "features_standardized = standardizer.fit_transform(features)\n",
165 |     "\n",
166 |     "knn = KNeighborsClassifier(n_neighbors=5, n_jobs=-1)\n",
167 |     "\n",
168 |     "pipe = Pipeline([(\"standardizer\", standardizer), (\"knn\", knn)])\n",
169 |     "\n",
170 |     "search_space = [{\"knn__n_neighbors\": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}]\n",
171 |     "\n",
172 |     "classifier = GridSearchCV(\n",
173 |     "    pipe, search_space, cv=5, verbose=0).fit(features_standardized, target)\n",
174 |     "\n",
175 |     "classifier.best_estimator_.get_params()[\"knn__n_neighbors\"]"
176 |    ]
177 |   },
178 |   {
179 |    "cell_type": "markdown",
180 |    "metadata": {},
181 |    "source": [
182 |     "## 15.4 Creating a Radius-Based Nearest Neighbor Classifier\n",
183 |     "given an observation of unknown class, you need to predict its class based on the class of all observations within a certain distance."
184 |    ]
185 |   },
186 |   {
187 |    "cell_type": "code",
188 |    "execution_count": 5,
189 |    "metadata": {
190 |     "collapsed": false
191 |    },
192 |    "outputs": [
193 |     {
194 |      "data": {
195 |       "text/plain": [
196 |        "array([2])"
197 |       ]
198 |      },
199 |      "execution_count": 5,
200 |      "metadata": {},
201 |      "output_type": "execute_result"
202 |     }
203 |    ],
204 |    "source": [
205 |     "from sklearn.neighbors import RadiusNeighborsClassifier\n",
206 |     "from sklearn.preprocessing import StandardScaler\n",
207 |     "from sklearn import datasets\n",
208 |     "\n",
209 |     "iris = datasets.load_iris()\n",
210 |     "features = iris.data\n",
211 |     "target = iris.target\n",
212 |     "\n",
213 |     "standardizer = StandardScaler()\n",
214 |     "features_standardized = standardizer.fit_transform(features)\n",
215 |     "\n",
216 |     "rnn = RadiusNeighborsClassifier(\n",
217 |     "    radius=.5, n_jobs=-1).fit(features_standardized, target)\n",
218 |     "\n",
219 |     "new_observations = [[1, 1, 1, 1]]\n",
220 |     "\n",
221 |     "rnn.predict(new_observations)"
222 |    ]
223 |   },
224 |   {
225 |    "cell_type": "code",
226 |    "execution_count": null,
227 |    "metadata": {
228 |     "collapsed": true
229 |    },
230 |    "outputs": [],
231 |    "source": []
232 |   }
233 |  ],
234 |  "metadata": {
235 |   "kernelspec": {
236 |    "display_name": "Python [conda env:machine_learning_cookbook]",
237 |    "language": "python",
238 |    "name": "conda-env-machine_learning_cookbook-py"
239 |   },
240 |   "language_info": {
241 |    "codemirror_mode": {
242 |     "name": "ipython",
243 |     "version": 3
244 |    },
245 |    "file_extension": ".py",
246 |    "mimetype": "text/x-python",
247 |    "name": "python",
248 |    "nbconvert_exporter": "python",
249 |    "pygments_lexer": "ipython3",
250 |    "version": "3.6.6"
251 |   }
252 |  },
253 |  "nbformat": 4,
254 |  "nbformat_minor": 2
255 | }
256 | 


--------------------------------------------------------------------------------
/.ipynb_checkpoints/Chapter 16 - Logistic Regression-checkpoint.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Chapter 16\n",
  8 |     "---\n",
  9 |     "# Logistic Regression\n",
 10 |     "\n",
 11 |     "Despire being called a regression, logistic regression is actually a widely used supervised classification technique. \n",
 12 |     "Allows us to predict the probability that an observation is of a certain class\n",
 13 |     "\n",
 14 |     "## 16.1 Training a Binary Classifier"
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "code",
 19 |    "execution_count": 3,
 20 |    "metadata": {
 21 |     "collapsed": false
 22 |    },
 23 |    "outputs": [
 24 |     {
 25 |      "name": "stdout",
 26 |      "output_type": "stream",
 27 |      "text": [
 28 |       "model.predict: [1]\n",
 29 |       "model.predict_proba: [[0.18823041 0.81176959]]\n"
 30 |      ]
 31 |     }
 32 |    ],
 33 |    "source": [
 34 |     "from sklearn.linear_model import LogisticRegression\n",
 35 |     "from sklearn import datasets\n",
 36 |     "from sklearn.preprocessing import StandardScaler\n",
 37 |     "\n",
 38 |     "iris = datasets.load_iris()\n",
 39 |     "features = iris.data[:100,:]\n",
 40 |     "target = iris.target[:100]\n",
 41 |     "\n",
 42 |     "scaler = StandardScaler()\n",
 43 |     "features_standardized = scaler.fit_transform(features)\n",
 44 |     "\n",
 45 |     "logistic_regression = LogisticRegression(random_state=0)\n",
 46 |     "model = logistic_regression.fit(features_standardized, target)\n",
 47 |     "\n",
 48 |     "new_observation = [[.5, .5, .5, .5]]\n",
 49 |     "\n",
 50 |     "print(\"model.predict: {}\".format(model.predict(new_observation)))\n",
 51 |     "print(\"model.predict_proba: {}\".format(model.predict_proba(new_observation)))"
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "markdown",
 56 |    "metadata": {},
 57 |    "source": [
 58 |     "### Discussion\n",
 59 |     "Dispire having \"regression\" in its name, a logistic regression is actually a widely used binary lassifier (i.e. the target vector can only take two values). In a logistic regression, a linear model (e.g. $\\beta_0 + \\beta_i x$) is included in a logistic (also called sigmoid) function, $\\frac{1}{1+e^{-z }}$, such that:\n",
 60 |     "$$\n",
 61 |     "P(y_i = 1 | X) = \\frac{1}{1+e^{-(\\beta_0 + \\beta_1x)}}\n",
 62 |     "$$\n",
 63 |     "where $P(y_i = 1 | X)$ is the probability of the ith obsevation's target, $y_i$ being class 1, X is the training data, $\\beta_0$ and $\\beta_1$ are the parameters to be learned, and e is Euler's number. The effect of the logistic function is to constrain the value of the function's output to between 0 and 1 so that i can be interpreted as a probability. If $P(y_i = 1 | X)$ is greater than 0.5, class 1 is predicted; otherwise class 0 is predicted\n",
 64 |     "\n",
 65 |     "## 16.2 Training a Multiclass Classifier"
 66 |    ]
 67 |   },
 68 |   {
 69 |    "cell_type": "code",
 70 |    "execution_count": 4,
 71 |    "metadata": {
 72 |     "collapsed": true
 73 |    },
 74 |    "outputs": [],
 75 |    "source": [
 76 |     "from sklearn.linear_model import LogisticRegression\n",
 77 |     "from sklearn import datasets\n",
 78 |     "from sklearn.preprocessing import StandardScaler\n",
 79 |     "\n",
 80 |     "iris = datasets.load_iris()\n",
 81 |     "features = iris.data\n",
 82 |     "target = iris.target\n",
 83 |     "\n",
 84 |     "scaler = StandardScaler()\n",
 85 |     "features_standardized = scaler.fit_transform(features)\n",
 86 |     "\n",
 87 |     "logistic_regression = LogisticRegression(random_state=0, multi_class=\"ovr\")\n",
 88 |     "#logistic_regression_MNL = LogisticRegression(random_state=0, multi_class=\"multinomial\")\n",
 89 |     "\n",
 90 |     "model = logistic_regression.fit(features_standardized, target)"
 91 |    ]
 92 |   },
 93 |   {
 94 |    "cell_type": "markdown",
 95 |    "metadata": {},
 96 |    "source": [
 97 |     "### Discussion\n",
 98 |     "On their own, logistic regressions are only binary classifiers, meaning they cannot handle target vectors with more than two classes. However, two clever extensions to logistic regression do just that. First, in one-vs-rest logistic regression (OVR) a separate model is trained for each class predicted whether an observation is that class or not (thus making it a binary classification problem). It assumes that each observation problem (e.g. class 0 or not) is independent\n",
 99 |     "\n",
100 |     "Alternatively in multinomial logistic regression (MLR) the logistic function we saw in Recipe 15.1 is replaced with a softmax function:\n",
101 |     "$$\n",
102 |     "P(y_I = k | X) = \\frac{e^{\\beta_k x_i}}{\\sum_{j=1}^{K}{e^{\\beta_j x_i}}}\n",
103 |     "$$\n",
104 |     "where $P(y_i = k | X)$ is the probability of the ith observation's target value, $y_i$, is class k, and K is the total number of classes. One practical advantage of the MLR is that its predicted probabilities using `predict_proba` method are more reliable\n",
105 |     "\n",
106 |     "We can switch to an MNL by setting `multi_class='multinomial'`\n",
107 |     "\n",
108 |     "## 16.3 Reducing Variance Through Regularization"
109 |    ]
110 |   },
111 |   {
112 |    "cell_type": "code",
113 |    "execution_count": 5,
114 |    "metadata": {
115 |     "collapsed": true
116 |    },
117 |    "outputs": [],
118 |    "source": [
119 |     "from sklearn.linear_model import LogisticRegressionCV\n",
120 |     "from sklearn import datasets\n",
121 |     "from sklearn.preprocessing import StandardScaler\n",
122 |     "\n",
123 |     "iris = datasets.load_iris()\n",
124 |     "features = iris.data\n",
125 |     "target = iris.target\n",
126 |     "\n",
127 |     "scaler = StandardScaler()\n",
128 |     "features_standardized = scaler.fit_transform(features)\n",
129 |     "\n",
130 |     "logistic_regression = LogisticRegressionCV(\n",
131 |     "    penalty='l2', Cs=10, random_state=0, n_jobs=-1)\n",
132 |     "\n",
133 |     "model = logistic_regression.fit(features_standardized, target)"
134 |    ]
135 |   },
136 |   {
137 |    "cell_type": "markdown",
138 |    "metadata": {},
139 |    "source": [
140 |     "### Discussion\n",
141 |     "Regularization is a method of penalizing complex models to reduce their variance. Specifically, a penalty term is added to the loss function we are trying to minimize typically the L1 and L2 penalties\n",
142 |     "\n",
143 |     "In the L1 penalty:\n",
144 |     "$$\n",
145 |     "\\alpha \\sum_{j=1}^{p}{|\\hat\\beta_j|}\n",
146 |     "$$\n",
147 |     "where $\\hat\\beta_j$ is the parameters of the jth of p features being learned and $\\alpha$ is a hyperparameter denoting the regularization strength.\n",
148 |     "\n",
149 |     "With the L2 penalty:\n",
150 |     "$$\n",
151 |     "\\alpha \\sum_{j=1}^{p}{\\hat\\beta_j^2}\n",
152 |     "$$\n",
153 |     "higher values of $\\alpha$ increase the penalty for larger parameter values(i.e. more complex models). scikit-learn follows the common method of using C instead of $\\alpha$ where C is the inverse of the regularization strength: $C = \\frac{1}{\\alpha}$. To reduce variance while using logistic regression, we can treat C as a hyperparameter to be tuned to find thevalue of C that creates the best model. In scikit-learn we can use the `LogisticRegressionCV` class to efficiently tune C.\n",
154 |     "\n",
155 |     "## 16.4 Training a Classifier on Very Large Data"
156 |    ]
157 |   },
158 |   {
159 |    "cell_type": "code",
160 |    "execution_count": 6,
161 |    "metadata": {
162 |     "collapsed": true
163 |    },
164 |    "outputs": [],
165 |    "source": [
166 |     "from sklearn.linear_model import LogisticRegression\n",
167 |     "from sklearn import datasets\n",
168 |     "from sklearn.preprocessing import StandardScaler\n",
169 |     "\n",
170 |     "iris = datasets.load_iris()\n",
171 |     "features = iris.data\n",
172 |     "target = iris.target\n",
173 |     "\n",
174 |     "scaler = StandardScaler()\n",
175 |     "features_standardized = scaler.fit_transform(features)\n",
176 |     "\n",
177 |     "logistic_regression = LogisticRegression(random_state=0, solver=\"sag\") # stochastic average gradient (SAG) solver\n",
178 |     "model = logistic_regression.fit(features_standardized, target)"
179 |    ]
180 |   },
181 |   {
182 |    "cell_type": "markdown",
183 |    "metadata": {},
184 |    "source": [
185 |     "### Discussion\n",
186 |     "scikit-learn's `LogisticRegression` offers a number of techniques for training a logistic regression, called solvers. Most of the time scikit-learn will select the best solver automatically for us or warn us we cannot do something with that solver.\n",
187 |     "\n",
188 |     "Stochastic averge gradient descent allows us to train a model much faster than other solvers when our data is very large. However, it is also very sensitive to feature scaling, so standardizing our features is particularly important\n",
189 |     "\n",
190 |     "### See Also\n",
191 |     "* Minimizing Finite Sums with the Stochastic Average Gradient Algorithm, Mark Schmidt (http://www.birs.ca/workshops/2014/14w5003/files/schmidt.pdf)\n",
192 |     "\n",
193 |     "## 16.5 Handling Imbalanced Classes"
194 |    ]
195 |   },
196 |   {
197 |    "cell_type": "code",
198 |    "execution_count": 8,
199 |    "metadata": {
200 |     "collapsed": false
201 |    },
202 |    "outputs": [],
203 |    "source": [
204 |     "import numpy as np\n",
205 |     "from sklearn.linear_model import LogisticRegression\n",
206 |     "from sklearn import datasets\n",
207 |     "from sklearn.preprocessing import StandardScaler\n",
208 |     "\n",
209 |     "iris = datasets.load_iris()\n",
210 |     "features = iris.data[40:, :]\n",
211 |     "target = iris.target[40:]\n",
212 |     "\n",
213 |     "target = np.where((target == 0), 0, 1)\n",
214 |     "\n",
215 |     "scaler = StandardScaler()\n",
216 |     "features_standardized = scaler.fit_transform(features)\n",
217 |     "\n",
218 |     "logistic_regression = LogisticRegression(random_state=0, class_weight=\"balanced\")\n",
219 |     "model = logistic_regression.fit(features_standardized, target)"
220 |    ]
221 |   },
222 |   {
223 |    "cell_type": "markdown",
224 |    "metadata": {},
225 |    "source": [
226 |     "### Discussion\n",
227 |     "`LogisticRegression` comes with a built in method of handling imbalanced classes.\n",
228 |     "`class_weight=\"balanced\"` will automatically weigh classes inversely proportional to their frequency:\n",
229 |     "$$\n",
230 |     "w_j = \\frac{n}{kn_j}\n",
231 |     "$$\n",
232 |     "where $w_j$ is the weight to class j, n is the number of observations, $n_j$ is the number of observations in class j, and k is the total number of classes"
233 |    ]
234 |   }
235 |  ],
236 |  "metadata": {
237 |   "kernelspec": {
238 |    "display_name": "Python [conda env:machine_learning_cookbook]",
239 |    "language": "python",
240 |    "name": "conda-env-machine_learning_cookbook-py"
241 |   },
242 |   "language_info": {
243 |    "codemirror_mode": {
244 |     "name": "ipython",
245 |     "version": 3
246 |    },
247 |    "file_extension": ".py",
248 |    "mimetype": "text/x-python",
249 |    "name": "python",
250 |    "nbconvert_exporter": "python",
251 |    "pygments_lexer": "ipython3",
252 |    "version": "3.6.6"
253 |   }
254 |  },
255 |  "nbformat": 4,
256 |  "nbformat_minor": 2
257 | }
258 | 


--------------------------------------------------------------------------------
/.ipynb_checkpoints/Chapter 18 - Naive Bayes-checkpoint.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Chapter 18\n",
  8 |     "---\n",
  9 |     "# Naive Bayes\n",
 10 |     "\n",
 11 |     "### 18.0 Introduction\n",
 12 |     "Bayes' theorem is the premier method for understanding the probability of some event $P(A|B)$, given some new information, $P(B|A)$, and a prior belief in the probability of the event, P(A):\n",
 13 |     "$$\n",
 14 |     "P(A | B) = \\frac{P(B|A)P(A)}{P(B)}\n",
 15 |     "$$\n",
 16 |     "\n",
 17 |     "The Bayesian method's popularity has skyrocked in the last decade, more and more rivaling the traditional frequentist applications in academia, government, and business. In machine learning, one applicaiton of Bayes' theorem to classifican comes in the form of the naive Bayes classifier. Naive Bayes classifiers combine a number of desirable qualities in practical machine learning into a single classifier:\n",
 18 |     "\n",
 19 |     "1. An intuitive approach\n",
 20 |     "2. The ability to work with small data\n",
 21 |     "3. Low computation costs for training and prediction\n",
 22 |     "4. Often solid results in a variety of settigns\n",
 23 |     "\n",
 24 |     "Specifically, a naive bayes classifier is based on:\n",
 25 |     "$$\n",
 26 |     "P(y | x_1, ..., x_j) = \\frac{P(x_1, ..., x_j | y)P(y)}{P(x_1,...,x_j)}\n",
 27 |     "$$\n",
 28 |     "where,\n",
 29 |     "* $P(y | x_1, ..., x_j)$ is called the *posterior* and is the probability that an observation is class y given observation's values for the j features, $x_1, ..., x_j$\n",
 30 |     "* $P(x_1, ..., x_j)$ is called likelihood and is the *likelihood* of an observation's values for features, $x_1, ..., x_j$, given their class y.\n",
 31 |     "* $P(y)$ is called the *prior* and is our belief for the probability of class y before looking at the data\n",
 32 |     "* P($x_1, ..., x_j$) is called the *marginal probability*\n",
 33 |     "\n",
 34 |     "In naive Bayes, we compare an obsrvation's posterior values for each possible class. Specifically, because the marginal probability is constant across these comparisons, we compare the numerators of the posterior for each class. For each observation the class with the greatest posterior numerator becomes the predicted class, $\\hat y$.\n",
 35 |     "\n",
 36 |     "There are two important things to note about naive Bayes classifiers.\n",
 37 |     "\n",
 38 |     "1. for each feature in the data, we have to assume the statistical distribution of the likelihood, $P(x_1, ..., x_j)$.\n",
 39 |     "- the common distributions are the normal (Gaussian), multinomial, and Bernoulli distributions.\n",
 40 |     "- the distribution chose is often determined by the nature of the features (continuous, binary, etc.)\n",
 41 |     "\n",
 42 |     "2. naive Bayes gets its name because we assume that each feature, and its resulting likelihood, is independent. This \"naive\" assumption is frequently wrong, yet in practice does little to prevent building high quality classifiers\n",
 43 |     "\n",
 44 |     "## 18.1 Training a Classifier for Continuous Features"
 45 |    ]
 46 |   },
 47 |   {
 48 |    "cell_type": "code",
 49 |    "execution_count": 3,
 50 |    "metadata": {
 51 |     "collapsed": false
 52 |    },
 53 |    "outputs": [
 54 |     {
 55 |      "data": {
 56 |       "text/plain": [
 57 |        "array([1])"
 58 |       ]
 59 |      },
 60 |      "execution_count": 3,
 61 |      "metadata": {},
 62 |      "output_type": "execute_result"
 63 |     }
 64 |    ],
 65 |    "source": [
 66 |     "from sklearn import datasets\n",
 67 |     "from sklearn.naive_bayes import GaussianNB\n",
 68 |     "\n",
 69 |     "iris = datasets.load_iris()\n",
 70 |     "features = iris.data\n",
 71 |     "target = iris.target\n",
 72 |     "\n",
 73 |     "classifier = GaussianNB()\n",
 74 |     "\n",
 75 |     "model = classifier.fit(features, target)\n",
 76 |     "\n",
 77 |     "new_observation = [[4, 4, 4, 0.4]]\n",
 78 |     "model.predict(new_observation)"
 79 |    ]
 80 |   },
 81 |   {
 82 |    "cell_type": "markdown",
 83 |    "metadata": {},
 84 |    "source": [
 85 |     "### Discussion\n",
 86 |     "The most common type of naive Bayes classifier is the Gaussian naive Bayesa. In Gaussian naive Bayesam we assuem that the likelihood of the feature values, x, given an observation is of class y, follows a normal distribution:\n",
 87 |     "$$\n",
 88 |     "p(x_j | y) = \\frac{1}{\\sqrt{2\\pi \\sigma_y^2}} e^{-\\frac{(x_j - \\mu_y)^2}{2\\sigma_y^2}}\n",
 89 |     "$$\n",
 90 |     "where $\\sigma_y^2$ and $\\mu_y$ are the variance and mean values of feature x_j for class y. Because of the assumption of the normal distribution, Gaussian naive Bayes is best used in cases when all our features are continuous.\n",
 91 |     "\n",
 92 |     "One of the interesting aspects of naive Bayes classifiers is that they allow us to assign a prior belief over the respect target classes. We can do this using `GaussianNB`'s `priors` parameter, which takes in a list of the probabilities assigned to each class of the target vector"
 93 |    ]
 94 |   },
 95 |   {
 96 |    "cell_type": "code",
 97 |    "execution_count": 4,
 98 |    "metadata": {
 99 |     "collapsed": true
100 |    },
101 |    "outputs": [],
102 |    "source": [
103 |     "clf = GaussianNB(priors=[0.25, 0.25, 0.5])\n",
104 |     "model = classifier.fit(features, target)"
105 |    ]
106 |   },
107 |   {
108 |    "cell_type": "markdown",
109 |    "metadata": {},
110 |    "source": [
111 |     "### See Also\n",
112 |     "* How the Naive Bayes Classifier Works (http://dataaspirant.com/2017/02/06/naive-bayes-classifier-machine-learning/)\n",
113 |     "\n",
114 |     "## 18.2  Training a Classifier for Discrete and Count Features\n"
115 |    ]
116 |   },
117 |   {
118 |    "cell_type": "code",
119 |    "execution_count": 6,
120 |    "metadata": {
121 |     "collapsed": false
122 |    },
123 |    "outputs": [
124 |     {
125 |      "data": {
126 |       "text/plain": [
127 |        "array([0])"
128 |       ]
129 |      },
130 |      "execution_count": 6,
131 |      "metadata": {},
132 |      "output_type": "execute_result"
133 |     }
134 |    ],
135 |    "source": [
136 |     "import numpy as np\n",
137 |     "from sklearn.naive_bayes import MultinomialNB\n",
138 |     "from sklearn.feature_extraction.text import CountVectorizer\n",
139 |     "\n",
140 |     "text_data = np.array(['I love Brazil. Brazil!', 'Brazil is best', 'Germany beats both'])\n",
141 |     "\n",
142 |     "count = CountVectorizer()\n",
143 |     "bag_of_words = count.fit_transform(text_data)\n",
144 |     "\n",
145 |     "features = bag_of_words.toarray()\n",
146 |     "\n",
147 |     "target = np.array([0, 0, 1])\n",
148 |     "\n",
149 |     "classifier = MultinomialNB(class_prior=[0.25, 0.5])\n",
150 |     "model = classifier.fit(features, target)\n",
151 |     "\n",
152 |     "new_observation = [[0, 0, 0, 1, 0, 1, 0]]\n",
153 |     "model.predict(new_observation)"
154 |    ]
155 |   },
156 |   {
157 |    "cell_type": "markdown",
158 |    "metadata": {},
159 |    "source": [
160 |     "### Discussion\n",
161 |     "\n",
162 |     "Multinomial naive Bayes works similarly to Gaussian naive Bayes, but the features are assumed to be multinomial distributed. In practice this means that this classifier is commonly used when we have discrete data. One of the most common uses is text classification using bag of words or tf-idf approaches\n",
163 |     "\n",
164 |     "## 18.3 Training a Naive Bayes Classifier for Binary Features"
165 |    ]
166 |   },
167 |   {
168 |    "cell_type": "code",
169 |    "execution_count": 7,
170 |    "metadata": {
171 |     "collapsed": true
172 |    },
173 |    "outputs": [],
174 |    "source": [
175 |     "import numpy as np\n",
176 |     "from sklearn.naive_bayes import BernoulliNB\n",
177 |     "\n",
178 |     "features = np.random.randint(2, size=(100, 3))\n",
179 |     "target = np.random.randint(2, size=(100, 1)).ravel()\n",
180 |     "\n",
181 |     "classifier = BernoulliNB(class_prior=[0.25, 0.5])\n",
182 |     "model = classifier.fit(features, target)"
183 |    ]
184 |   },
185 |   {
186 |    "cell_type": "markdown",
187 |    "metadata": {},
188 |    "source": [
189 |     "The Bernoulli naive Bayes classifier assumes that all our features are binary such that they take only two values (e.g. a nominal categorical feature that has been one-hot encoded). Like its multinomial cousin, Bernoulli naive Bayes is often used in text classification, when our feature matrix is simply the presence or absence of a word in a document"
190 |    ]
191 |   }
192 |  ],
193 |  "metadata": {
194 |   "kernelspec": {
195 |    "display_name": "Python [conda env:machine_learning_cookbook]",
196 |    "language": "python",
197 |    "name": "conda-env-machine_learning_cookbook-py"
198 |   },
199 |   "language_info": {
200 |    "codemirror_mode": {
201 |     "name": "ipython",
202 |     "version": 3
203 |    },
204 |    "file_extension": ".py",
205 |    "mimetype": "text/x-python",
206 |    "name": "python",
207 |    "nbconvert_exporter": "python",
208 |    "pygments_lexer": "ipython3",
209 |    "version": "3.6.6"
210 |   }
211 |  },
212 |  "nbformat": 4,
213 |  "nbformat_minor": 2
214 | }
215 | 


--------------------------------------------------------------------------------
/.ipynb_checkpoints/Chapter 19 - Clustering-checkpoint.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Chapter 19\n",
  8 |     "---\n",
  9 |     "# Clustering\n",
 10 |     "\n",
 11 |     "## 19.0 Introduction\n",
 12 |     "\n",
 13 |     "Frequently, we run into situations where we only know the features.\n",
 14 |     "\n",
 15 |     "The goal of clustering algorithms is to identify latent groupings of obesrvations, which if done well, allow us to predict the class of observations even without a target vector.\n",
 16 |     "\n",
 17 |     "## 19.1 Clustering Using K-Means"
 18 |    ]
 19 |   },
 20 |   {
 21 |    "cell_type": "code",
 22 |    "execution_count": 2,
 23 |    "metadata": {
 24 |     "collapsed": false
 25 |    },
 26 |    "outputs": [
 27 |     {
 28 |      "data": {
 29 |       "text/plain": [
 30 |        "array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
 31 |        "       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
 32 |        "       1, 1, 1, 1, 1, 1, 0, 0, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0,\n",
 33 |        "       2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 0, 0, 2,\n",
 34 |        "       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0,\n",
 35 |        "       0, 0, 0, 2, 2, 0, 0, 0, 0, 2, 0, 2, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0,\n",
 36 |        "       0, 2, 2, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 2], dtype=int32)"
 37 |       ]
 38 |      },
 39 |      "execution_count": 2,
 40 |      "metadata": {},
 41 |      "output_type": "execute_result"
 42 |     }
 43 |    ],
 44 |    "source": [
 45 |     "from sklearn import datasets\n",
 46 |     "from sklearn.preprocessing import StandardScaler\n",
 47 |     "from sklearn.cluster import KMeans\n",
 48 |     "\n",
 49 |     "iris = datasets.load_iris()\n",
 50 |     "features = iris.data\n",
 51 |     "\n",
 52 |     "scaler = StandardScaler()\n",
 53 |     "features_std = scaler.fit_transform(features)\n",
 54 |     "\n",
 55 |     "cluster = KMeans(n_clusters=3, random_state=0, n_jobs=-1)\n",
 56 |     "model = cluster.fit(features_std)\n",
 57 |     "\n",
 58 |     "model.labels_"
 59 |    ]
 60 |   },
 61 |   {
 62 |    "cell_type": "markdown",
 63 |    "metadata": {},
 64 |    "source": [
 65 |     "### Discussion\n",
 66 |     "k-means clustering is one of the most common clustering techniques. In k-means clustering, the algorithm attempts to group observations into k groups, with each group having roughly equal variance. The number of groups, k, is specified by the user as a hyperparameter. Specifically, in k-means:\n",
 67 |     "\n",
 68 |     "1. k cluster \"center\" points are created at random locations.\n",
 69 |     "\n",
 70 |     "2. For each observation:\n",
 71 |     "    a. the distance between each observaiton and the k center points is calculated\n",
 72 |     "    b. the observation is assigned to the cluster of the nearest center point\n",
 73 |     "    \n",
 74 |     "3. The center points are moved to the means (i.e., centers) of their respective clusters\n",
 75 |     "\n",
 76 |     "4. Steps 2 and 3 are repeated until no observation changes in cluster membership\n",
 77 |     "\n",
 78 |     "k-means clustering assumes:\n",
 79 |     "* the clusters are convex shaped (e.g. a circle, a sphere).\n",
 80 |     "* all features are equally scaled\n",
 81 |     "* the groups are balanced\n",
 82 |     "\n",
 83 |     "### See Also\n",
 84 |     "* Introduction to K-means Clustering (https://www.datascience.com/blog/k-means-clustering)\n",
 85 |     "\n",
 86 |     "## 19.2 Speeding Up K-Means Clustering"
 87 |    ]
 88 |   },
 89 |   {
 90 |    "cell_type": "code",
 91 |    "execution_count": 3,
 92 |    "metadata": {
 93 |     "collapsed": false
 94 |    },
 95 |    "outputs": [
 96 |     {
 97 |      "data": {
 98 |       "text/plain": [
 99 |        "array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
100 |        "       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
101 |        "       1, 1, 1, 1, 1, 1, 2, 2, 2, 0, 0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 2,\n",
102 |        "       0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0,\n",
103 |        "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2,\n",
104 |        "       2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2,\n",
105 |        "       2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2], dtype=int32)"
106 |       ]
107 |      },
108 |      "execution_count": 3,
109 |      "metadata": {},
110 |      "output_type": "execute_result"
111 |     }
112 |    ],
113 |    "source": [
114 |     "from sklearn import datasets\n",
115 |     "from sklearn.preprocessing import StandardScaler\n",
116 |     "from sklearn.cluster import MiniBatchKMeans\n",
117 |     "\n",
118 |     "iris = datasets.load_iris()\n",
119 |     "features = iris.data\n",
120 |     "\n",
121 |     "scaler = StandardScaler()\n",
122 |     "features_std = scaler.fit_transform(features)\n",
123 |     "\n",
124 |     "cluster = MiniBatchKMeans(n_clusters=3, random_state=0, batch_size=100)\n",
125 |     "model = cluster.fit(features_std)\n",
126 |     "\n",
127 |     "model.labels_"
128 |    ]
129 |   },
130 |   {
131 |    "cell_type": "markdown",
132 |    "metadata": {},
133 |    "source": [
134 |     "## 19.3 Clustering Using Meanshift"
135 |    ]
136 |   },
137 |   {
138 |    "cell_type": "code",
139 |    "execution_count": 4,
140 |    "metadata": {
141 |     "collapsed": false
142 |    },
143 |    "outputs": [
144 |     {
145 |      "data": {
146 |       "text/plain": [
147 |        "array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
148 |        "       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
149 |        "       1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
150 |        "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
151 |        "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
152 |        "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
153 |        "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])"
154 |       ]
155 |      },
156 |      "execution_count": 4,
157 |      "metadata": {},
158 |      "output_type": "execute_result"
159 |     }
160 |    ],
161 |    "source": [
162 |     "from sklearn import datasets\n",
163 |     "from sklearn.preprocessing import StandardScaler\n",
164 |     "from sklearn.cluster import MeanShift\n",
165 |     "\n",
166 |     "iris = datasets.load_iris()\n",
167 |     "features = iris.data\n",
168 |     "\n",
169 |     "scaler = StandardScaler()\n",
170 |     "features_std = scaler.fit_transform(features)\n",
171 |     "\n",
172 |     "cluster = MeanShift(n_jobs=-1)\n",
173 |     "model = cluster.fit(features_std)\n",
174 |     "\n",
175 |     "model.labels_"
176 |    ]
177 |   },
178 |   {
179 |    "cell_type": "markdown",
180 |    "metadata": {},
181 |    "source": [
182 |     "## 19.4 Clustering Using DBSCAN"
183 |    ]
184 |   },
185 |   {
186 |    "cell_type": "code",
187 |    "execution_count": 5,
188 |    "metadata": {
189 |     "collapsed": false
190 |    },
191 |    "outputs": [
192 |     {
193 |      "data": {
194 |       "text/plain": [
195 |        "array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, -1, -1,  0,\n",
196 |        "        0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0, -1, -1,\n",
197 |        "        0,  0,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  1,\n",
198 |        "        1,  1,  1,  1,  1, -1, -1,  1, -1, -1,  1, -1,  1,  1,  1,  1,  1,\n",
199 |        "       -1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,\n",
200 |        "       -1,  1, -1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1, -1,  1, -1,  1,\n",
201 |        "        1,  1,  1, -1, -1, -1, -1, -1,  1,  1,  1,  1, -1,  1,  1, -1, -1,\n",
202 |        "       -1,  1,  1, -1,  1,  1, -1,  1,  1,  1, -1, -1, -1,  1,  1,  1, -1,\n",
203 |        "       -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1])"
204 |       ]
205 |      },
206 |      "execution_count": 5,
207 |      "metadata": {},
208 |      "output_type": "execute_result"
209 |     }
210 |    ],
211 |    "source": [
212 |     "from sklearn import datasets\n",
213 |     "from sklearn.preprocessing import StandardScaler\n",
214 |     "from sklearn.cluster import DBSCAN\n",
215 |     "\n",
216 |     "iris = datasets.load_iris()\n",
217 |     "features = iris.data\n",
218 |     "\n",
219 |     "scaler = StandardScaler()\n",
220 |     "features_std = scaler.fit_transform(features)\n",
221 |     "\n",
222 |     "cluster = DBSCAN(n_jobs=-1)\n",
223 |     "model = cluster.fit(features_std)\n",
224 |     "\n",
225 |     "model.labels_"
226 |    ]
227 |   },
228 |   {
229 |    "cell_type": "markdown",
230 |    "metadata": {},
231 |    "source": [
232 |     "## 19.5 Clustering using Hierarchical Merging"
233 |    ]
234 |   },
235 |   {
236 |    "cell_type": "code",
237 |    "execution_count": 7,
238 |    "metadata": {
239 |     "collapsed": false
240 |    },
241 |    "outputs": [
242 |     {
243 |      "data": {
244 |       "text/plain": [
245 |        "array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
246 |        "       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1,\n",
247 |        "       1, 1, 1, 1, 1, 1, 0, 0, 0, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 0, 2, 0,\n",
248 |        "       2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 0, 2, 0, 0, 2,\n",
249 |        "       2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0,\n",
250 |        "       0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
251 |        "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])"
252 |       ]
253 |      },
254 |      "execution_count": 7,
255 |      "metadata": {},
256 |      "output_type": "execute_result"
257 |     }
258 |    ],
259 |    "source": [
260 |     "from sklearn import datasets\n",
261 |     "from sklearn.preprocessing import StandardScaler\n",
262 |     "from sklearn.cluster import AgglomerativeClustering\n",
263 |     "\n",
264 |     "iris = datasets.load_iris()\n",
265 |     "features = iris.data\n",
266 |     "\n",
267 |     "scaler = StandardScaler()\n",
268 |     "features_std = scaler.fit_transform(features)\n",
269 |     "\n",
270 |     "cluster = AgglomerativeClustering(n_clusters=3)\n",
271 |     "model = cluster.fit(features_std)\n",
272 |     "\n",
273 |     "model.labels_"
274 |    ]
275 |   },
276 |   {
277 |    "cell_type": "code",
278 |    "execution_count": null,
279 |    "metadata": {
280 |     "collapsed": true
281 |    },
282 |    "outputs": [],
283 |    "source": []
284 |   }
285 |  ],
286 |  "metadata": {
287 |   "kernelspec": {
288 |    "display_name": "Python [conda env:machine_learning_cookbook]",
289 |    "language": "python",
290 |    "name": "conda-env-machine_learning_cookbook-py"
291 |   },
292 |   "language_info": {
293 |    "codemirror_mode": {
294 |     "name": "ipython",
295 |     "version": 3
296 |    },
297 |    "file_extension": ".py",
298 |    "mimetype": "text/x-python",
299 |    "name": "python",
300 |    "nbconvert_exporter": "python",
301 |    "pygments_lexer": "ipython3",
302 |    "version": "3.6.6"
303 |   }
304 |  },
305 |  "nbformat": 4,
306 |  "nbformat_minor": 2
307 | }
308 | 


--------------------------------------------------------------------------------
/.ipynb_checkpoints/Chapter 21 - Saving and Loading Trained Models-checkpoint.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "### Chapter 21\n",
  8 |     "## Saving and Loading Trained Models\n",
  9 |     "\n",
 10 |     "### 21.0 Introduction\n",
 11 |     "In the last 20 chapters around 200 recipies, we have convered how to take raw data nad usem achine learning to create well-performing predictive models. However, for all our work to be worthwhile we eventually need to do something with our model, such as integrating it with an existing software application. To accomplish this goal, we need to be able to bot hsave our models after training and load them when they are needed by an application. This is the focus of the final chapter\n",
 12 |     "\n",
 13 |     "### 21.1 Saving and Loading a scikit-learn Model\n",
 14 |     "#### Problem\n",
 15 |     "You have trained a scikit-learn model and want to save it and load it elsewhere.\n",
 16 |     "\n",
 17 |     "#### Solution\n",
 18 |     "Save the model as a pickle file:"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "code",
 23 |    "execution_count": 1,
 24 |    "metadata": {},
 25 |    "outputs": [
 26 |     {
 27 |      "name": "stderr",
 28 |      "output_type": "stream",
 29 |      "text": [
 30 |       "/Users/f00/anaconda/envs/machine_learning_cookbook/lib/python3.6/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.\n",
 31 |       "  from numpy.core.umath_tests import inner1d\n"
 32 |      ]
 33 |     },
 34 |     {
 35 |      "data": {
 36 |       "text/plain": [
 37 |        "['model.pkl']"
 38 |       ]
 39 |      },
 40 |      "execution_count": 1,
 41 |      "metadata": {},
 42 |      "output_type": "execute_result"
 43 |     }
 44 |    ],
 45 |    "source": [
 46 |     "# load libraries\n",
 47 |     "from sklearn.ensemble import RandomForestClassifier\n",
 48 |     "from sklearn import datasets\n",
 49 |     "from sklearn.externals import joblib\n",
 50 |     "\n",
 51 |     "# load data\n",
 52 |     "iris = datasets.load_iris()\n",
 53 |     "features = iris.data\n",
 54 |     "target = iris.target\n",
 55 |     "\n",
 56 |     "# create decision tree classifier object\n",
 57 |     "classifier = RandomForestClassifier()\n",
 58 |     "\n",
 59 |     "# train model\n",
 60 |     "model = classifier.fit(features, target)\n",
 61 |     "\n",
 62 |     "# save model as pickle file\n",
 63 |     "joblib.dump(model, \"model.pkl\")"
 64 |    ]
 65 |   },
 66 |   {
 67 |    "cell_type": "markdown",
 68 |    "metadata": {},
 69 |    "source": [
 70 |     "Once the model is saved we can use scikit-learn in our destination application (e.g., web application) to load the model:"
 71 |    ]
 72 |   },
 73 |   {
 74 |    "cell_type": "code",
 75 |    "execution_count": 2,
 76 |    "metadata": {},
 77 |    "outputs": [],
 78 |    "source": [
 79 |     "# load model from file\n",
 80 |     "classifier = joblib.load(\"model.pkl\")"
 81 |    ]
 82 |   },
 83 |   {
 84 |    "cell_type": "markdown",
 85 |    "metadata": {},
 86 |    "source": [
 87 |     "And use it to make predictions"
 88 |    ]
 89 |   },
 90 |   {
 91 |    "cell_type": "code",
 92 |    "execution_count": 3,
 93 |    "metadata": {},
 94 |    "outputs": [
 95 |     {
 96 |      "data": {
 97 |       "text/plain": [
 98 |        "array([0])"
 99 |       ]
100 |      },
101 |      "execution_count": 3,
102 |      "metadata": {},
103 |      "output_type": "execute_result"
104 |     }
105 |    ],
106 |    "source": [
107 |     "# create new observation\n",
108 |     "new_observation = [[ 5.2, 3.2, 1.1, 0.1]]\n",
109 |     "\n",
110 |     "# predict obserrvation's class\n",
111 |     "classifier.predict(new_observation)"
112 |    ]
113 |   },
114 |   {
115 |    "cell_type": "markdown",
116 |    "metadata": {},
117 |    "source": [
118 |     "### Discussion\n",
119 |     "The first step in using a model in production is to save that model as a file that can be loaded by another application or workflow. We can accomplish this by saving the model as a pickle file, a Python-specific data format. Specifically, to save the model we use `joblib`, which is a library extending pickle for cases when we have large NumPy arrays--a common occurance for trained models in scikit-learn.\n",
120 |     "\n",
121 |     "When saving scikit-learn models, be aware that saved models might not be compatible between versions of scikit-learn; therefore, it can be helpful to include the version of scikit-learn used in the model in the filename:"
122 |    ]
123 |   },
124 |   {
125 |    "cell_type": "code",
126 |    "execution_count": 4,
127 |    "metadata": {},
128 |    "outputs": [
129 |     {
130 |      "data": {
131 |       "text/plain": [
132 |        "['model_(version).pkl']"
133 |       ]
134 |      },
135 |      "execution_count": 4,
136 |      "metadata": {},
137 |      "output_type": "execute_result"
138 |     }
139 |    ],
140 |    "source": [
141 |     "# import library\n",
142 |     "import sklearn\n",
143 |     "\n",
144 |     "# get scikit-learn version\n",
145 |     "scikit_version = joblib.__version__\n",
146 |     "\n",
147 |     "# save model as pickle file\n",
148 |     "joblib.dump(model, \"model_(version).pkl\".format(version=scikit_version))"
149 |    ]
150 |   },
151 |   {
152 |    "cell_type": "markdown",
153 |    "metadata": {},
154 |    "source": [
155 |     "### 21.2 Saving and Loading a Keras Model\n",
156 |     "#### Problem\n",
157 |     "You have a trained Keras model and want to save it and load it elsewhere.\n",
158 |     "\n",
159 |     "#### Solution\n",
160 |     "Save the model as HDF5:"
161 |    ]
162 |   },
163 |   {
164 |    "cell_type": "code",
165 |    "execution_count": 1,
166 |    "metadata": {},
167 |    "outputs": [
168 |     {
169 |      "name": "stderr",
170 |      "output_type": "stream",
171 |      "text": [
172 |       "Using Theano backend.\n"
173 |      ]
174 |     },
175 |     {
176 |      "ename": "ModuleNotFoundError",
177 |      "evalue": "No module named 'theano'",
178 |      "output_type": "error",
179 |      "traceback": [
180 |       "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
181 |       "\u001b[0;31mModuleNotFoundError\u001b[0m                       Traceback (most recent call last)",
182 |       "\u001b[0;32m<ipython-input-1-ab685cfa2361>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# load libraries\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      2\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mnumpy\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0;32mfrom\u001b[0m \u001b[0mkeras\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdatasets\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mimdb\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      4\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mkeras\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpreprocessing\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtext\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mTokenizer\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mkeras\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mmodels\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
183 |       "\u001b[0;32m~/anaconda/envs/machine_learning_cookbook/lib/python3.6/site-packages/keras/__init__.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0m__future__\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mabsolute_import\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      2\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mutils\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      4\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mactivations\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mapplications\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
184 |       "\u001b[0;32m~/anaconda/envs/machine_learning_cookbook/lib/python3.6/site-packages/keras/utils/__init__.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mdata_utils\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mio_utils\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mconv_utils\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      7\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;31m# Globally-importable utils.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
185 |       "\u001b[0;32m~/anaconda/envs/machine_learning_cookbook/lib/python3.6/site-packages/keras/utils/conv_utils.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      7\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0msix\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmoves\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mrange\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mnumpy\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 9\u001b[0;31m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m\u001b[0;34m.\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mbackend\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mK\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     10\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     11\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
186 |       "\u001b[0;32m~/anaconda/envs/machine_learning_cookbook/lib/python3.6/site-packages/keras/backend/__init__.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m     84\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0m_BACKEND\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;34m'theano'\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     85\u001b[0m     \u001b[0msys\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstderr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwrite\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'Using Theano backend.\\n'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 86\u001b[0;31m     \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m\u001b[0mtheano_backend\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     87\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0m_BACKEND\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;34m'tensorflow'\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     88\u001b[0m     \u001b[0msys\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstderr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwrite\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'Using TensorFlow backend.\\n'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
187 |       "\u001b[0;32m~/anaconda/envs/machine_learning_cookbook/lib/python3.6/site-packages/keras/backend/theano_backend.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mcollections\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mdefaultdict\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      6\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mcontextlib\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mcontextmanager\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 7\u001b[0;31m \u001b[0;32mimport\u001b[0m \u001b[0mtheano\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      8\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mtheano\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mtensor\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mT\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mtheano\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msandbox\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrng_mrg\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mMRG_RandomStreams\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mRandomStreams\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
188 |       "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'theano'"
189 |      ]
190 |     }
191 |    ],
192 |    "source": [
193 |     "# load libraries\n",
194 |     "import numpy as np\n",
195 |     "from keras.datasets import imdb\n",
196 |     "from keras.preprocessing.text import Tokenizer\n",
197 |     "from keras import models\n",
198 |     "from keras import layers\n",
199 |     "from keras.models import load_model\n",
200 |     "\n",
201 |     "# set random seed\n",
202 |     "np.random.seed(0)\n",
203 |     "\n",
204 |     "# set the number of features we want\n",
205 |     "number_of_features = 1000\n",
206 |     "\n",
207 |     "# load data and target vector from movie review data\n",
208 |     "(train_Data, train_target), (test_data, test_target) = imdb.load_data(num_words=number_of_features)\n",
209 |     "\n",
210 |     "# convert movie review data to a one-hot encoded feature matrix\n",
211 |     "tokenizer = Tokenizer(num_words=number_of_features)\n",
212 |     "train_features = tokenizer.sequences_to_matrix(train_data, mode=\"binary\")\n",
213 |     "test_features = tokenizer.sequences_to_matrix(test_data, mode=\"binary\")\n",
214 |     "\n",
215 |     "# start neural network\n",
216 |     "network = models.Sequential()\n",
217 |     "\n",
218 |     "# add fully connected layer with ReLU activation function\n",
219 |     "network.add(layers.Dense(units=16, activation=\"relu\", input_shape=(number_of_features,)))\n",
220 |     "\n",
221 |     "# add fully connected layer with a sigmoid activation function\n",
222 |     "network.add(layers.Dense(units=1, activation=\"sigmoid\"))\n",
223 |     "\n",
224 |     "# compile neural network\n",
225 |     "network.compile(loss=\"binary_crossentropy\", optimizer=\"rmsprop\", metrics=[\"accuracy\"])\n",
226 |     "\n",
227 |     "# train neural network\n",
228 |     "history = network.fit(train_features, train_target, epochs=3, verbose=0, batch_size=100, validation_data=(test_features, test_target))\n",
229 |     "\n",
230 |     "# save neural network\n",
231 |     "network.save(\"model.h5\")"
232 |    ]
233 |   },
234 |   {
235 |    "cell_type": "markdown",
236 |    "metadata": {},
237 |    "source": [
238 |     "We can then load the model either in another application or for additional training"
239 |    ]
240 |   },
241 |   {
242 |    "cell_type": "code",
243 |    "execution_count": 7,
244 |    "metadata": {},
245 |    "outputs": [
246 |     {
247 |      "ename": "NameError",
248 |      "evalue": "name 'load_model' is not defined",
249 |      "output_type": "error",
250 |      "traceback": [
251 |       "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
252 |       "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
253 |       "\u001b[0;32m<ipython-input-7-114bd8857354>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# load neural network\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mnetwork\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mload_model\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"model.h5\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
254 |       "\u001b[0;31mNameError\u001b[0m: name 'load_model' is not defined"
255 |      ]
256 |     }
257 |    ],
258 |    "source": [
259 |     "# load neural network\n",
260 |     "network = load_model(\"model.h5\")"
261 |    ]
262 |   },
263 |   {
264 |    "cell_type": "markdown",
265 |    "metadata": {},
266 |    "source": [
267 |     "#### Discussion\n",
268 |     "Unlike scikit-learn, Keras does not recommend you save models using pickle. Instead, models are saved as an HDF5 file. The HDF5 file contains everything you need to not only load the model to make predicitons (i.e., achitecture and trained parameters), but also to restart training (i.e. loss and optimizer settings and the current state)"
269 |    ]
270 |   },
271 |   {
272 |    "cell_type": "code",
273 |    "execution_count": null,
274 |    "metadata": {},
275 |    "outputs": [],
276 |    "source": []
277 |   }
278 |  ],
279 |  "metadata": {
280 |   "kernelspec": {
281 |    "display_name": "Python [conda env:machine_learning_cookbook]",
282 |    "language": "python",
283 |    "name": "conda-env-machine_learning_cookbook-py"
284 |   },
285 |   "language_info": {
286 |    "codemirror_mode": {
287 |     "name": "ipython",
288 |     "version": 3
289 |    },
290 |    "file_extension": ".py",
291 |    "mimetype": "text/x-python",
292 |    "name": "python",
293 |    "nbconvert_exporter": "python",
294 |    "pygments_lexer": "ipython3",
295 |    "version": "3.6.6"
296 |   }
297 |  },
298 |  "nbformat": 4,
299 |  "nbformat_minor": 2
300 | }
301 | 


--------------------------------------------------------------------------------
/.ipynb_checkpoints/Chapter 4 - Handling Numerical Data-checkpoint.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Chapter 4\n",
  8 |     "---\n",
  9 |     "# Handling Numerical Data\n",
 10 |     "\n",
 11 |     "### 4.0 Introduction\n",
 12 |     "Quantitative data is the measurment of something--weather class size, monthly sales, or student scores. The natural way to represent these quantities is numerically (e.g., 20 students, $529,392 in sales). In this chapter we will cover numerous strategies for transforming raw numerical data into features purpose-built for machine learning algorithms\n",
 13 |     "\n",
 14 |     "### 4.1 Rescaling a feature\n",
 15 |     "Use scikit-learn's `MinMaxScaler` to rescale a feature array"
 16 |    ]
 17 |   },
 18 |   {
 19 |    "cell_type": "code",
 20 |    "execution_count": 2,
 21 |    "metadata": {},
 22 |    "outputs": [
 23 |     {
 24 |      "data": {
 25 |       "text/plain": [
 26 |        "array([[0.        ],\n",
 27 |        "       [0.28571429],\n",
 28 |        "       [0.35714286],\n",
 29 |        "       [0.42857143],\n",
 30 |        "       [1.        ]])"
 31 |       ]
 32 |      },
 33 |      "execution_count": 2,
 34 |      "metadata": {},
 35 |      "output_type": "execute_result"
 36 |     }
 37 |    ],
 38 |    "source": [
 39 |     "import numpy as np\n",
 40 |     "from sklearn import preprocessing\n",
 41 |     "\n",
 42 |     "# create a feature\n",
 43 |     "feature = np.array([\n",
 44 |     "    [-500.5],\n",
 45 |     "    [-100.1],\n",
 46 |     "    [0],\n",
 47 |     "    [100.1],\n",
 48 |     "    [900.9]\n",
 49 |     "])\n",
 50 |     "\n",
 51 |     "# create scaler\n",
 52 |     "minmax_scaler = preprocessing.MinMaxScaler(feature_range=(0,1))\n",
 53 |     "\n",
 54 |     "# scale feature\n",
 55 |     "scaled_feature = minmax_scaler.fit_transform(feature)\n",
 56 |     "\n",
 57 |     "scaled_feature"
 58 |    ]
 59 |   },
 60 |   {
 61 |    "cell_type": "markdown",
 62 |    "metadata": {},
 63 |    "source": [
 64 |     "#### Discussion\n",
 65 |     "Rescaling is a common preprocessing task in machine learning. Many of the algorithms described later in this book will assume all features are on the same scale, typically 0 to 1 or -1 to 1. There are a number of rescaling techniques, but one of the simlest is called *min-max scaling*. Min-max scaling uses the minimum and maximum values of a feature to rescale values to within a range. Specfically, min-max calculates:\n",
 66 |     "$$\n",
 67 |     "x_i^` = \\frac{x_i - min(x)}{max(x) - min(x)}\n",
 68 |     "$$\n",
 69 |     "\n",
 70 |     "where x is the feature vector, $x_i$ is an individual element of feature x, and $x_i^`$ is the rescaled element\n",
 71 |     "\n",
 72 |     "#### See Also\n",
 73 |     "* Feature scaling, wikipedia (https://en.wikipedia.org/wiki/Feature_scaling)\n",
 74 |     "\n",
 75 |     "### 4.2 Standardizing a Feature\n",
 76 |     "scikit-learn's `StandardScaler` transforms a feature to have a mean of 0 and a standard deviation of 1."
 77 |    ]
 78 |   },
 79 |   {
 80 |    "cell_type": "code",
 81 |    "execution_count": 3,
 82 |    "metadata": {},
 83 |    "outputs": [
 84 |     {
 85 |      "data": {
 86 |       "text/plain": [
 87 |        "array([[-0.76058269],\n",
 88 |        "       [-0.54177196],\n",
 89 |        "       [-0.35009716],\n",
 90 |        "       [-0.32271504],\n",
 91 |        "       [ 1.97516685]])"
 92 |       ]
 93 |      },
 94 |      "execution_count": 3,
 95 |      "metadata": {},
 96 |      "output_type": "execute_result"
 97 |     }
 98 |    ],
 99 |    "source": [
100 |     "import numpy as np\n",
101 |     "from sklearn import preprocessing\n",
102 |     "\n",
103 |     "# create a feature\n",
104 |     "feature = np.array([\n",
105 |     "    [-1000.1],\n",
106 |     "    [-200.2],\n",
107 |     "    [500.5],\n",
108 |     "    [600.6],\n",
109 |     "    [9000.9]\n",
110 |     "])\n",
111 |     "\n",
112 |     "# create scaler\n",
113 |     "scaler = preprocessing.StandardScaler()\n",
114 |     "\n",
115 |     "# transform the feature\n",
116 |     "standardized = scaler.fit_transform(feature)\n",
117 |     "\n",
118 |     "standardized"
119 |    ]
120 |   },
121 |   {
122 |    "cell_type": "markdown",
123 |    "metadata": {},
124 |    "source": [
125 |     "#### Discussion\n",
126 |     "A common alternative to min-max scaling is rescaling of features to be approximately standard normally distributed. To achieve this, we use standardization to tranform the data such that it has a mean, $\\bar x$, or 0 and a standard deviation $\\sigma$, of 1. Specifically, each element in the feature is transformed so that:\n",
127 |     "$$\n",
128 |     "x_i^` = \\frac{x_i - \\bar x}{\\sigma}\n",
129 |     "$$"
130 |    ]
131 |   },
132 |   {
133 |    "cell_type": "markdown",
134 |    "metadata": {},
135 |    "source": [
136 |     "Where $x_I^`$ is our standardized form of $x_i$. The transformed feature represents the number of standard deviations in the original value is away from the feature's mean value (also called a *z-score* in statistics)\n",
137 |     "\n",
138 |     "Standardization is a common go-to scaling method for machine learning preprocessing and in my experience is used more than min-max scaling. However it depends on the learning algorithm. For example, principal component analysis often works better using standardization, while min-max scaling is often recommended for neural netwroks. As a general rule, I'd recommend defauling to standardization unless you have a specific reason to use an alternative.\n",
139 |     "\n",
140 |     "We can see the effect of standardization by looking at the mean and standard deviation of our solutions output:"
141 |    ]
142 |   },
143 |   {
144 |    "cell_type": "code",
145 |    "execution_count": 4,
146 |    "metadata": {},
147 |    "outputs": [
148 |     {
149 |      "name": "stdout",
150 |      "output_type": "stream",
151 |      "text": [
152 |       "Mean 0.0\n",
153 |       "Standard Deviation: 1.0\n"
154 |      ]
155 |     }
156 |    ],
157 |    "source": [
158 |     "print(\"Mean {}\".format(round(standardized.mean())))\n",
159 |     "print(\"Standard Deviation: {}\".format(standardized.std()))"
160 |    ]
161 |   },
162 |   {
163 |    "cell_type": "markdown",
164 |    "metadata": {},
165 |    "source": [
166 |     "If our data has significant outliers, it can negatively impact our standardizatino by affecting the feature's mean and variance. In this scenario, it is often helpful to instead rescale the feature using the median and quartile range. In scikit-learn, we do this using the *RobustScaler* method:"
167 |    ]
168 |   },
169 |   {
170 |    "cell_type": "code",
171 |    "execution_count": 5,
172 |    "metadata": {},
173 |    "outputs": [
174 |     {
175 |      "data": {
176 |       "text/plain": [
177 |        "array([[-1.87387612],\n",
178 |        "       [-0.875     ],\n",
179 |        "       [ 0.        ],\n",
180 |        "       [ 0.125     ],\n",
181 |        "       [10.61488511]])"
182 |       ]
183 |      },
184 |      "execution_count": 5,
185 |      "metadata": {},
186 |      "output_type": "execute_result"
187 |     }
188 |    ],
189 |    "source": [
190 |     "# create scaler\n",
191 |     "robust_scaler = preprocessing.RobustScaler()\n",
192 |     "\n",
193 |     "# transform feature\n",
194 |     "robust_scaler.fit_transform(feature)"
195 |    ]
196 |   },
197 |   {
198 |    "cell_type": "markdown",
199 |    "metadata": {},
200 |    "source": [
201 |     "### 4.3 Normalizing Observations\n",
202 |     "Use scikit-learn's `Normalizer` to rescale the feature values to have unit norm (a total length of 1)"
203 |    ]
204 |   },
205 |   {
206 |    "cell_type": "code",
207 |    "execution_count": 6,
208 |    "metadata": {},
209 |    "outputs": [
210 |     {
211 |      "data": {
212 |       "text/plain": [
213 |        "array([[0.70710678, 0.70710678],\n",
214 |        "       [0.30782029, 0.95144452],\n",
215 |        "       [0.07405353, 0.99725427],\n",
216 |        "       [0.04733062, 0.99887928],\n",
217 |        "       [0.95709822, 0.28976368]])"
218 |       ]
219 |      },
220 |      "execution_count": 6,
221 |      "metadata": {},
222 |      "output_type": "execute_result"
223 |     }
224 |    ],
225 |    "source": [
226 |     "import numpy as np\n",
227 |     "from sklearn.preprocessing import Normalizer\n",
228 |     "\n",
229 |     "# create feature matrix\n",
230 |     "features = np.array([\n",
231 |     "    [0.5, 0.5],\n",
232 |     "    [1.1, 3.4],\n",
233 |     "    [1.5, 20.2],\n",
234 |     "    [1.63, 34.4],\n",
235 |     "    [10.9, 3.3]\n",
236 |     "])\n",
237 |     "\n",
238 |     "# create normalizer\n",
239 |     "normalizer = Normalizer(norm=\"l2\")\n",
240 |     "\n",
241 |     "# transofmr feature matrix\n",
242 |     "normalizer.transform(features)"
243 |    ]
244 |   },
245 |   {
246 |    "cell_type": "markdown",
247 |    "metadata": {},
248 |    "source": [
249 |     "#### Discussion\n",
250 |     "Many rescaling methods operate of features; however, we can also rescale across individual observations. `Normalizer` rescales the values on individual observations to have unit norm (the sum of their lengths is 1). This type of rescaling is often used when we have many equivalent features (e.g. text-classification when every word is n-word group is a feature).\n",
251 |     "\n",
252 |     "`Normalizer` provides three norm options with Euclidean norm (often called L2) being the default:\n",
253 |     "$$\n",
254 |     "||x||_2 = \\sqrt{x_1^2 + x_2^2 + ... + x_n^2}\n",
255 |     "$$\n",
256 |     "\n",
257 |     "where x is an individual observation and x_n is that observation's value for the nth feature.\n",
258 |     "\n",
259 |     "Alternatively, we can specify Manhattan norm (L1):\n",
260 |     "$$\n",
261 |     "||x||_1 = \\sum_{i=1}^n{x_i}\n",
262 |     "$$\n",
263 |     "\n",
264 |     "Intuitively, L2 norm can be thought of as the distance between two poitns in New York for a bird (i.e. a straight line), while L1 can be thought of as the distance for a human wlaking on the street (walk north one block, east one block, north one block, east one block, etc), which is why it is called \"Manhattan norm\" or \"Taxicab norm\".\n",
265 |     "\n",
266 |     "Practically, notice that `norm='l1'` rescales an observation's values so they sum to 1, which can sometimes be a desirable quality"
267 |    ]
268 |   },
269 |   {
270 |    "cell_type": "code",
271 |    "execution_count": 8,
272 |    "metadata": {},
273 |    "outputs": [
274 |     {
275 |      "name": "stdout",
276 |      "output_type": "stream",
277 |      "text": [
278 |       "Sum of the first observation's values: 1.0\n"
279 |      ]
280 |     }
281 |    ],
282 |    "source": [
283 |     "# transform feature matrix\n",
284 |     "features_l1_norm = Normalizer(norm=\"l1\").transform(features)\n",
285 |     "print(\"Sum of the first observation's values: {}\".format(features_l1_norm[0,0] + features_l1_norm[0,1]))"
286 |    ]
287 |   },
288 |   {
289 |    "cell_type": "markdown",
290 |    "metadata": {},
291 |    "source": [
292 |     "### 4.9 Grouping Observations Using Clustering"
293 |    ]
294 |   },
295 |   {
296 |    "cell_type": "code",
297 |    "execution_count": 9,
298 |    "metadata": {},
299 |    "outputs": [
300 |     {
301 |      "data": {
302 |       "text/html": [
303 |        "<div>\n",
304 |        "<style scoped>\n",
305 |        "    .dataframe tbody tr th:only-of-type {\n",
306 |        "        vertical-align: middle;\n",
307 |        "    }\n",
308 |        "\n",
309 |        "    .dataframe tbody tr th {\n",
310 |        "        vertical-align: top;\n",
311 |        "    }\n",
312 |        "\n",
313 |        "    .dataframe thead th {\n",
314 |        "        text-align: right;\n",
315 |        "    }\n",
316 |        "</style>\n",
317 |        "<table border=\"1\" class=\"dataframe\">\n",
318 |        "  <thead>\n",
319 |        "    <tr style=\"text-align: right;\">\n",
320 |        "      <th></th>\n",
321 |        "      <th>feature_1</th>\n",
322 |        "      <th>feature_2</th>\n",
323 |        "      <th>group</th>\n",
324 |        "    </tr>\n",
325 |        "  </thead>\n",
326 |        "  <tbody>\n",
327 |        "    <tr>\n",
328 |        "      <th>0</th>\n",
329 |        "      <td>-9.877554</td>\n",
330 |        "      <td>-3.336145</td>\n",
331 |        "      <td>0</td>\n",
332 |        "    </tr>\n",
333 |        "    <tr>\n",
334 |        "      <th>1</th>\n",
335 |        "      <td>-7.287210</td>\n",
336 |        "      <td>-8.353986</td>\n",
337 |        "      <td>2</td>\n",
338 |        "    </tr>\n",
339 |        "    <tr>\n",
340 |        "      <th>2</th>\n",
341 |        "      <td>-6.943061</td>\n",
342 |        "      <td>-7.023744</td>\n",
343 |        "      <td>2</td>\n",
344 |        "    </tr>\n",
345 |        "    <tr>\n",
346 |        "      <th>3</th>\n",
347 |        "      <td>-7.440167</td>\n",
348 |        "      <td>-8.791959</td>\n",
349 |        "      <td>2</td>\n",
350 |        "    </tr>\n",
351 |        "    <tr>\n",
352 |        "      <th>4</th>\n",
353 |        "      <td>-6.641388</td>\n",
354 |        "      <td>-8.075888</td>\n",
355 |        "      <td>2</td>\n",
356 |        "    </tr>\n",
357 |        "  </tbody>\n",
358 |        "</table>\n",
359 |        "</div>"
360 |       ],
361 |       "text/plain": [
362 |        "   feature_1  feature_2  group\n",
363 |        "0  -9.877554  -3.336145      0\n",
364 |        "1  -7.287210  -8.353986      2\n",
365 |        "2  -6.943061  -7.023744      2\n",
366 |        "3  -7.440167  -8.791959      2\n",
367 |        "4  -6.641388  -8.075888      2"
368 |       ]
369 |      },
370 |      "execution_count": 9,
371 |      "metadata": {},
372 |      "output_type": "execute_result"
373 |     }
374 |    ],
375 |    "source": [
376 |     "import pandas as pd\n",
377 |     "from sklearn.datasets import make_blobs\n",
378 |     "from sklearn.cluster import KMeans\n",
379 |     "\n",
380 |     "features, _ = make_blobs(n_samples = 50,\n",
381 |     "                         n_features = 2,\n",
382 |     "                         centers = 3,\n",
383 |     "                         random_state = 1)\n",
384 |     "\n",
385 |     "df = pd.DataFrame(features, columns=[\"feature_1\", \"feature_2\"])\n",
386 |     "\n",
387 |     "# make k-means clusterer\n",
388 |     "clusterer = KMeans(3, random_state=0)\n",
389 |     "\n",
390 |     "# fit clusterer\n",
391 |     "clusterer.fit(features)\n",
392 |     "\n",
393 |     "# predict values\n",
394 |     "df['group'] = clusterer.predict(features)\n",
395 |     "\n",
396 |     "df.head()"
397 |    ]
398 |   },
399 |   {
400 |    "cell_type": "markdown",
401 |    "metadata": {},
402 |    "source": [
403 |     "# 4.10 Deleteing Observations with Missing Values"
404 |    ]
405 |   },
406 |   {
407 |    "cell_type": "code",
408 |    "execution_count": 10,
409 |    "metadata": {},
410 |    "outputs": [
411 |     {
412 |      "data": {
413 |       "text/plain": [
414 |        "array([[ 1.1, 11.1],\n",
415 |        "       [ 2.2, 22.2],\n",
416 |        "       [ 3.3, 33.3]])"
417 |       ]
418 |      },
419 |      "execution_count": 10,
420 |      "metadata": {},
421 |      "output_type": "execute_result"
422 |     }
423 |    ],
424 |    "source": [
425 |     "import numpy as np\n",
426 |     "\n",
427 |     "features = np.array([\n",
428 |     "    [1.1, 11.1],\n",
429 |     "    [2.2, 22.2],\n",
430 |     "    [3.3, 33.3],\n",
431 |     "    [np.nan, 55]\n",
432 |     "])\n",
433 |     "\n",
434 |     "# keep only observations that are not (denoted by ~) missing\n",
435 |     "features[~np.isnan(features).any(axis=1)]"
436 |    ]
437 |   },
438 |   {
439 |    "cell_type": "code",
440 |    "execution_count": 11,
441 |    "metadata": {},
442 |    "outputs": [
443 |     {
444 |      "data": {
445 |       "text/html": [
446 |        "<div>\n",
447 |        "<style scoped>\n",
448 |        "    .dataframe tbody tr th:only-of-type {\n",
449 |        "        vertical-align: middle;\n",
450 |        "    }\n",
451 |        "\n",
452 |        "    .dataframe tbody tr th {\n",
453 |        "        vertical-align: top;\n",
454 |        "    }\n",
455 |        "\n",
456 |        "    .dataframe thead th {\n",
457 |        "        text-align: right;\n",
458 |        "    }\n",
459 |        "</style>\n",
460 |        "<table border=\"1\" class=\"dataframe\">\n",
461 |        "  <thead>\n",
462 |        "    <tr style=\"text-align: right;\">\n",
463 |        "      <th></th>\n",
464 |        "      <th>feature_1</th>\n",
465 |        "      <th>feature_2</th>\n",
466 |        "    </tr>\n",
467 |        "  </thead>\n",
468 |        "  <tbody>\n",
469 |        "    <tr>\n",
470 |        "      <th>0</th>\n",
471 |        "      <td>1.1</td>\n",
472 |        "      <td>11.1</td>\n",
473 |        "    </tr>\n",
474 |        "    <tr>\n",
475 |        "      <th>1</th>\n",
476 |        "      <td>2.2</td>\n",
477 |        "      <td>22.2</td>\n",
478 |        "    </tr>\n",
479 |        "    <tr>\n",
480 |        "      <th>2</th>\n",
481 |        "      <td>3.3</td>\n",
482 |        "      <td>33.3</td>\n",
483 |        "    </tr>\n",
484 |        "  </tbody>\n",
485 |        "</table>\n",
486 |        "</div>"
487 |       ],
488 |       "text/plain": [
489 |        "   feature_1  feature_2\n",
490 |        "0        1.1       11.1\n",
491 |        "1        2.2       22.2\n",
492 |        "2        3.3       33.3"
493 |       ]
494 |      },
495 |      "execution_count": 11,
496 |      "metadata": {},
497 |      "output_type": "execute_result"
498 |     }
499 |    ],
500 |    "source": [
501 |     "import pandas as pd\n",
502 |     "df = pd.DataFrame(features, columns=[\"feature_1\", \"feature_2\"])\n",
503 |     "df.dropna()"
504 |    ]
505 |   },
506 |   {
507 |    "cell_type": "markdown",
508 |    "metadata": {},
509 |    "source": [
510 |     "#### Discussion\n",
511 |     "Most machine learnign algorithms cannot handling any missing values in the target and feature arrays. The simplest solution is the delete every observation that contains one or more missing values\n",
512 |     "\n",
513 |     "There are three types of missing data:\n",
514 |     "\n",
515 |     "*Missing Completely At Random (MCAR)*\n",
516 |     "* The probability that a value is missing is independent of everything.\n",
517 |     "\n",
518 |     "*Missing At Random (MAR)*\n",
519 |     "* The probability that a value is missing is not completely random, but depends on information capture in other feature\n",
520 |     "\n",
521 |     "*Missing Not At Random (MNAR)*\n",
522 |     "* The probability that a value is missing is not random and depends on information not captured in our features\n",
523 |     "\n",
524 |     "#### See Also\n",
525 |     "* Identifying the Three Types of Missing Data (https://measuringu.com/missing-data/)\n",
526 |     "* Missing-Data Imputation (http://www.stat.columbia.edu/~gelman/arm/missing.pdf)\n",
527 |     "\n",
528 |     "### 4.11 Imputing Missing Values"
529 |    ]
530 |   },
531 |   {
532 |    "cell_type": "code",
533 |    "execution_count": 14,
534 |    "metadata": {},
535 |    "outputs": [
536 |     {
537 |      "name": "stdout",
538 |      "output_type": "stream",
539 |      "text": [
540 |       "True Value: 0.8730186113995938\n",
541 |       "Imputed Value: -3.058372724614996\n"
542 |      ]
543 |     }
544 |    ],
545 |    "source": [
546 |     "import numpy as np\n",
547 |     "from sklearn.preprocessing import StandardScaler\n",
548 |     "from sklearn.datasets import make_blobs\n",
549 |     "from sklearn.preprocessing import Imputer\n",
550 |     "\n",
551 |     "# make fake data\n",
552 |     "features, _ = make_blobs(n_samples = 1000,\n",
553 |     "                        n_features = 2,\n",
554 |     "                        random_state = 1)\n",
555 |     "\n",
556 |     "# standardize the features\n",
557 |     "scaler = StandardScaler()\n",
558 |     "standardized_features = scaler.fit_transform(features)\n",
559 |     "\n",
560 |     "# replace the first feature's first value with a missing value\n",
561 |     "true_value = standardized_features[0, 0]\n",
562 |     "standardized_features[0,0] = np.nan\n",
563 |     "\n",
564 |     "# create imputer\n",
565 |     "mean_imputer = Imputer(strategy=\"mean\", axis=0)\n",
566 |     "\n",
567 |     "# impute values\n",
568 |     "feautres_mean_imputed = mean_imputer.fit_transform(features)\n",
569 |     "\n",
570 |     "# compare true and imputed values\n",
571 |     "print(\"True Value: {}\".format(true_value))\n",
572 |     "print(\"Imputed Value: {}\".format(feautres_mean_imputed[0,0]))"
573 |    ]
574 |   },
575 |   {
576 |    "cell_type": "markdown",
577 |    "metadata": {},
578 |    "source": [
579 |     "#### See Also\n",
580 |     "* A Study of K-Nearest Neighbor as an Imputation Method (http://conteudo.icmc.usp.br/pessoas/gbatista/files/his2002.pdf)"
581 |    ]
582 |   },
583 |   {
584 |    "cell_type": "code",
585 |    "execution_count": null,
586 |    "metadata": {},
587 |    "outputs": [],
588 |    "source": []
589 |   }
590 |  ],
591 |  "metadata": {
592 |   "kernelspec": {
593 |    "display_name": "Python [conda env:machine_learning_cookbook]",
594 |    "language": "python",
595 |    "name": "conda-env-machine_learning_cookbook-py"
596 |   },
597 |   "language_info": {
598 |    "codemirror_mode": {
599 |     "name": "ipython",
600 |     "version": 3
601 |    },
602 |    "file_extension": ".py",
603 |    "mimetype": "text/x-python",
604 |    "name": "python",
605 |    "nbconvert_exporter": "python",
606 |    "pygments_lexer": "ipython3",
607 |    "version": "3.6.6"
608 |   }
609 |  },
610 |  "nbformat": 4,
611 |  "nbformat_minor": 2
612 | }
613 | 


--------------------------------------------------------------------------------
/.ipynb_checkpoints/Chapter 7 - Handling Dates and Times-checkpoint.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## 7.1 Converting Strings to Dates"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 2,
 13 |    "metadata": {},
 14 |    "outputs": [
 15 |     {
 16 |      "data": {
 17 |       "text/plain": [
 18 |        "[Timestamp('2005-04-03 23:35:00'),\n",
 19 |        " Timestamp('2010-05-23 00:01:00'),\n",
 20 |        " Timestamp('2009-09-04 21:09:00')]"
 21 |       ]
 22 |      },
 23 |      "execution_count": 2,
 24 |      "metadata": {},
 25 |      "output_type": "execute_result"
 26 |     }
 27 |    ],
 28 |    "source": [
 29 |     "import numpy as np\n",
 30 |     "import pandas as pd\n",
 31 |     "\n",
 32 |     "date_strings = np.array([\n",
 33 |     "    '03-04-2005 11:35 PM',\n",
 34 |     "    '23-05-2010 12:01 AM',\n",
 35 |     "    '04-09-2009 09:09 PM'\n",
 36 |     "])\n",
 37 |     "\n",
 38 |     "# convert to datetimes\n",
 39 |     "[pd.to_datetime(date, format='%d-%m-%Y %I:%M %p') for date in date_strings]"
 40 |    ]
 41 |   },
 42 |   {
 43 |    "cell_type": "code",
 44 |    "execution_count": 3,
 45 |    "metadata": {},
 46 |    "outputs": [
 47 |     {
 48 |      "data": {
 49 |       "text/plain": [
 50 |        "[Timestamp('2005-04-03 23:35:00'),\n",
 51 |        " Timestamp('2010-05-23 00:01:00'),\n",
 52 |        " Timestamp('2009-09-04 21:09:00')]"
 53 |       ]
 54 |      },
 55 |      "execution_count": 3,
 56 |      "metadata": {},
 57 |      "output_type": "execute_result"
 58 |     }
 59 |    ],
 60 |    "source": [
 61 |     "[pd.to_datetime(date, format='%d-%m-%Y %I:%M %p', errors='coerce') for date in date_strings]"
 62 |    ]
 63 |   },
 64 |   {
 65 |    "cell_type": "markdown",
 66 |    "metadata": {},
 67 |    "source": [
 68 |     "### See Also\n",
 69 |     "* http://strftime.org/\n",
 70 |     "\n",
 71 |     "## 7.2 Handling Time Zones"
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "code",
 76 |    "execution_count": 4,
 77 |    "metadata": {},
 78 |    "outputs": [
 79 |     {
 80 |      "data": {
 81 |       "text/plain": [
 82 |        "Timestamp('2017-05-01 06:00:00+0100', tz='Europe/London')"
 83 |       ]
 84 |      },
 85 |      "execution_count": 4,
 86 |      "metadata": {},
 87 |      "output_type": "execute_result"
 88 |     }
 89 |    ],
 90 |    "source": [
 91 |     "import pandas as pd\n",
 92 |     "\n",
 93 |     "pd.Timestamp('2017-05-01 06:00:00', tz='Europe/London')"
 94 |    ]
 95 |   },
 96 |   {
 97 |    "cell_type": "code",
 98 |    "execution_count": 5,
 99 |    "metadata": {},
100 |    "outputs": [
101 |     {
102 |      "data": {
103 |       "text/plain": [
104 |        "Timestamp('2017-05-01 06:00:00+0100', tz='Europe/London')"
105 |       ]
106 |      },
107 |      "execution_count": 5,
108 |      "metadata": {},
109 |      "output_type": "execute_result"
110 |     }
111 |    ],
112 |    "source": [
113 |     "date = pd.Timestamp('2017-05-01 06:00:00')\n",
114 |     "\n",
115 |     "date_in_london = date.tz_localize('Europe/London')\n",
116 |     "\n",
117 |     "date_in_london"
118 |    ]
119 |   },
120 |   {
121 |    "cell_type": "code",
122 |    "execution_count": 8,
123 |    "metadata": {},
124 |    "outputs": [
125 |     {
126 |      "data": {
127 |       "text/plain": [
128 |        "Timestamp('2017-05-01 05:00:00+0000', tz='Africa/Abidjan')"
129 |       ]
130 |      },
131 |      "execution_count": 8,
132 |      "metadata": {},
133 |      "output_type": "execute_result"
134 |     }
135 |    ],
136 |    "source": [
137 |     "date_in_london.tz_convert('Africa/Abidjan')"
138 |    ]
139 |   },
140 |   {
141 |    "cell_type": "code",
142 |    "execution_count": 9,
143 |    "metadata": {},
144 |    "outputs": [
145 |     {
146 |      "data": {
147 |       "text/plain": [
148 |        "0   2002-02-28 00:00:00+00:00\n",
149 |        "1   2002-03-31 00:00:00+00:00\n",
150 |        "2   2002-04-30 00:00:00+00:00\n",
151 |        "dtype: datetime64[ns, Africa/Abidjan]"
152 |       ]
153 |      },
154 |      "execution_count": 9,
155 |      "metadata": {},
156 |      "output_type": "execute_result"
157 |     }
158 |    ],
159 |    "source": [
160 |     "dates = pd.Series(pd.date_range('2/2/2002', periods=3, freq='M'))\n",
161 |     "\n",
162 |     "dates.dt.tz_localize('Africa/Abidjan')"
163 |    ]
164 |   },
165 |   {
166 |    "cell_type": "markdown",
167 |    "metadata": {},
168 |    "source": [
169 |     "## 7.3 Selecting Dates and Times\n",
170 |     "## 7.4 Breaking Up Date Data into Multiple Features\n",
171 |     "## 7.5 Calculating the Difference Between Dates\n",
172 |     "## 7.6 Encoding Days of the Week\n",
173 |     "## 7.7 Creating Lagged Feature\n",
174 |     "## 7.8 Using Rolling Time Windows"
175 |    ]
176 |   },
177 |   {
178 |    "cell_type": "code",
179 |    "execution_count": 10,
180 |    "metadata": {},
181 |    "outputs": [
182 |     {
183 |      "data": {
184 |       "text/html": [
185 |        "<div>\n",
186 |        "<style scoped>\n",
187 |        "    .dataframe tbody tr th:only-of-type {\n",
188 |        "        vertical-align: middle;\n",
189 |        "    }\n",
190 |        "\n",
191 |        "    .dataframe tbody tr th {\n",
192 |        "        vertical-align: top;\n",
193 |        "    }\n",
194 |        "\n",
195 |        "    .dataframe thead th {\n",
196 |        "        text-align: right;\n",
197 |        "    }\n",
198 |        "</style>\n",
199 |        "<table border=\"1\" class=\"dataframe\">\n",
200 |        "  <thead>\n",
201 |        "    <tr style=\"text-align: right;\">\n",
202 |        "      <th></th>\n",
203 |        "      <th>Stock_Price</th>\n",
204 |        "    </tr>\n",
205 |        "  </thead>\n",
206 |        "  <tbody>\n",
207 |        "    <tr>\n",
208 |        "      <th>2010-01-31</th>\n",
209 |        "      <td>NaN</td>\n",
210 |        "    </tr>\n",
211 |        "    <tr>\n",
212 |        "      <th>2010-02-28</th>\n",
213 |        "      <td>1.5</td>\n",
214 |        "    </tr>\n",
215 |        "    <tr>\n",
216 |        "      <th>2010-03-31</th>\n",
217 |        "      <td>2.5</td>\n",
218 |        "    </tr>\n",
219 |        "    <tr>\n",
220 |        "      <th>2010-04-30</th>\n",
221 |        "      <td>3.5</td>\n",
222 |        "    </tr>\n",
223 |        "    <tr>\n",
224 |        "      <th>2010-05-31</th>\n",
225 |        "      <td>4.5</td>\n",
226 |        "    </tr>\n",
227 |        "  </tbody>\n",
228 |        "</table>\n",
229 |        "</div>"
230 |       ],
231 |       "text/plain": [
232 |        "            Stock_Price\n",
233 |        "2010-01-31          NaN\n",
234 |        "2010-02-28          1.5\n",
235 |        "2010-03-31          2.5\n",
236 |        "2010-04-30          3.5\n",
237 |        "2010-05-31          4.5"
238 |       ]
239 |      },
240 |      "execution_count": 10,
241 |      "metadata": {},
242 |      "output_type": "execute_result"
243 |     }
244 |    ],
245 |    "source": [
246 |     "import pandas as pd\n",
247 |     "\n",
248 |     "time_index = pd.date_range('01/01/2010', periods=5, freq='M')\n",
249 |     "df = pd.DataFrame(index=time_index)\n",
250 |     "df['Stock_Price'] = [1,2,3,4,5]\n",
251 |     "df.rolling(window=2).mean()"
252 |    ]
253 |   },
254 |   {
255 |    "cell_type": "markdown",
256 |    "metadata": {},
257 |    "source": [
258 |     "### See Also\n",
259 |     "* pandas documentation: Rolling Windows (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html)\n",
260 |     "* What are Moving Average or Smoothing Techniques (https://www.itl.nist.gov/div898/handbook/pmc/section4/pmc42.htm)\n",
261 |     "\n",
262 |     "## 7.9 Handling Missing Data in Time Series"
263 |    ]
264 |   },
265 |   {
266 |    "cell_type": "code",
267 |    "execution_count": 11,
268 |    "metadata": {},
269 |    "outputs": [
270 |     {
271 |      "data": {
272 |       "text/html": [
273 |        "<div>\n",
274 |        "<style scoped>\n",
275 |        "    .dataframe tbody tr th:only-of-type {\n",
276 |        "        vertical-align: middle;\n",
277 |        "    }\n",
278 |        "\n",
279 |        "    .dataframe tbody tr th {\n",
280 |        "        vertical-align: top;\n",
281 |        "    }\n",
282 |        "\n",
283 |        "    .dataframe thead th {\n",
284 |        "        text-align: right;\n",
285 |        "    }\n",
286 |        "</style>\n",
287 |        "<table border=\"1\" class=\"dataframe\">\n",
288 |        "  <thead>\n",
289 |        "    <tr style=\"text-align: right;\">\n",
290 |        "      <th></th>\n",
291 |        "      <th>Sales</th>\n",
292 |        "    </tr>\n",
293 |        "  </thead>\n",
294 |        "  <tbody>\n",
295 |        "    <tr>\n",
296 |        "      <th>2010-01-31</th>\n",
297 |        "      <td>1.0</td>\n",
298 |        "    </tr>\n",
299 |        "    <tr>\n",
300 |        "      <th>2010-02-28</th>\n",
301 |        "      <td>2.0</td>\n",
302 |        "    </tr>\n",
303 |        "    <tr>\n",
304 |        "      <th>2010-03-31</th>\n",
305 |        "      <td>3.0</td>\n",
306 |        "    </tr>\n",
307 |        "    <tr>\n",
308 |        "      <th>2010-04-30</th>\n",
309 |        "      <td>4.0</td>\n",
310 |        "    </tr>\n",
311 |        "    <tr>\n",
312 |        "      <th>2010-05-31</th>\n",
313 |        "      <td>5.0</td>\n",
314 |        "    </tr>\n",
315 |        "  </tbody>\n",
316 |        "</table>\n",
317 |        "</div>"
318 |       ],
319 |       "text/plain": [
320 |        "            Sales\n",
321 |        "2010-01-31    1.0\n",
322 |        "2010-02-28    2.0\n",
323 |        "2010-03-31    3.0\n",
324 |        "2010-04-30    4.0\n",
325 |        "2010-05-31    5.0"
326 |       ]
327 |      },
328 |      "execution_count": 11,
329 |      "metadata": {},
330 |      "output_type": "execute_result"
331 |     }
332 |    ],
333 |    "source": [
334 |     "import pandas as pd\n",
335 |     "import numpy as np\n",
336 |     "\n",
337 |     "time_index = pd.date_range('01/01/2010', periods=5, freq='M')\n",
338 |     "\n",
339 |     "df = pd.DataFrame(index=time_index)\n",
340 |     "\n",
341 |     "df[\"Sales\"] = [1.0, 2.0, np.nan, np.nan, 5.0]\n",
342 |     "\n",
343 |     "df.interpolate()"
344 |    ]
345 |   },
346 |   {
347 |    "cell_type": "code",
348 |    "execution_count": 12,
349 |    "metadata": {},
350 |    "outputs": [
351 |     {
352 |      "data": {
353 |       "text/html": [
354 |        "<div>\n",
355 |        "<style scoped>\n",
356 |        "    .dataframe tbody tr th:only-of-type {\n",
357 |        "        vertical-align: middle;\n",
358 |        "    }\n",
359 |        "\n",
360 |        "    .dataframe tbody tr th {\n",
361 |        "        vertical-align: top;\n",
362 |        "    }\n",
363 |        "\n",
364 |        "    .dataframe thead th {\n",
365 |        "        text-align: right;\n",
366 |        "    }\n",
367 |        "</style>\n",
368 |        "<table border=\"1\" class=\"dataframe\">\n",
369 |        "  <thead>\n",
370 |        "    <tr style=\"text-align: right;\">\n",
371 |        "      <th></th>\n",
372 |        "      <th>Sales</th>\n",
373 |        "    </tr>\n",
374 |        "  </thead>\n",
375 |        "  <tbody>\n",
376 |        "    <tr>\n",
377 |        "      <th>2010-01-31</th>\n",
378 |        "      <td>1.0</td>\n",
379 |        "    </tr>\n",
380 |        "    <tr>\n",
381 |        "      <th>2010-02-28</th>\n",
382 |        "      <td>2.0</td>\n",
383 |        "    </tr>\n",
384 |        "    <tr>\n",
385 |        "      <th>2010-03-31</th>\n",
386 |        "      <td>2.0</td>\n",
387 |        "    </tr>\n",
388 |        "    <tr>\n",
389 |        "      <th>2010-04-30</th>\n",
390 |        "      <td>2.0</td>\n",
391 |        "    </tr>\n",
392 |        "    <tr>\n",
393 |        "      <th>2010-05-31</th>\n",
394 |        "      <td>5.0</td>\n",
395 |        "    </tr>\n",
396 |        "  </tbody>\n",
397 |        "</table>\n",
398 |        "</div>"
399 |       ],
400 |       "text/plain": [
401 |        "            Sales\n",
402 |        "2010-01-31    1.0\n",
403 |        "2010-02-28    2.0\n",
404 |        "2010-03-31    2.0\n",
405 |        "2010-04-30    2.0\n",
406 |        "2010-05-31    5.0"
407 |       ]
408 |      },
409 |      "execution_count": 12,
410 |      "metadata": {},
411 |      "output_type": "execute_result"
412 |     }
413 |    ],
414 |    "source": [
415 |     "df.ffill()"
416 |    ]
417 |   },
418 |   {
419 |    "cell_type": "code",
420 |    "execution_count": 13,
421 |    "metadata": {},
422 |    "outputs": [
423 |     {
424 |      "data": {
425 |       "text/html": [
426 |        "<div>\n",
427 |        "<style scoped>\n",
428 |        "    .dataframe tbody tr th:only-of-type {\n",
429 |        "        vertical-align: middle;\n",
430 |        "    }\n",
431 |        "\n",
432 |        "    .dataframe tbody tr th {\n",
433 |        "        vertical-align: top;\n",
434 |        "    }\n",
435 |        "\n",
436 |        "    .dataframe thead th {\n",
437 |        "        text-align: right;\n",
438 |        "    }\n",
439 |        "</style>\n",
440 |        "<table border=\"1\" class=\"dataframe\">\n",
441 |        "  <thead>\n",
442 |        "    <tr style=\"text-align: right;\">\n",
443 |        "      <th></th>\n",
444 |        "      <th>Sales</th>\n",
445 |        "    </tr>\n",
446 |        "  </thead>\n",
447 |        "  <tbody>\n",
448 |        "    <tr>\n",
449 |        "      <th>2010-01-31</th>\n",
450 |        "      <td>1.0</td>\n",
451 |        "    </tr>\n",
452 |        "    <tr>\n",
453 |        "      <th>2010-02-28</th>\n",
454 |        "      <td>2.0</td>\n",
455 |        "    </tr>\n",
456 |        "    <tr>\n",
457 |        "      <th>2010-03-31</th>\n",
458 |        "      <td>5.0</td>\n",
459 |        "    </tr>\n",
460 |        "    <tr>\n",
461 |        "      <th>2010-04-30</th>\n",
462 |        "      <td>5.0</td>\n",
463 |        "    </tr>\n",
464 |        "    <tr>\n",
465 |        "      <th>2010-05-31</th>\n",
466 |        "      <td>5.0</td>\n",
467 |        "    </tr>\n",
468 |        "  </tbody>\n",
469 |        "</table>\n",
470 |        "</div>"
471 |       ],
472 |       "text/plain": [
473 |        "            Sales\n",
474 |        "2010-01-31    1.0\n",
475 |        "2010-02-28    2.0\n",
476 |        "2010-03-31    5.0\n",
477 |        "2010-04-30    5.0\n",
478 |        "2010-05-31    5.0"
479 |       ]
480 |      },
481 |      "execution_count": 13,
482 |      "metadata": {},
483 |      "output_type": "execute_result"
484 |     }
485 |    ],
486 |    "source": [
487 |     "df.bfill()"
488 |    ]
489 |   },
490 |   {
491 |    "cell_type": "code",
492 |    "execution_count": 14,
493 |    "metadata": {},
494 |    "outputs": [
495 |     {
496 |      "data": {
497 |       "text/html": [
498 |        "<div>\n",
499 |        "<style scoped>\n",
500 |        "    .dataframe tbody tr th:only-of-type {\n",
501 |        "        vertical-align: middle;\n",
502 |        "    }\n",
503 |        "\n",
504 |        "    .dataframe tbody tr th {\n",
505 |        "        vertical-align: top;\n",
506 |        "    }\n",
507 |        "\n",
508 |        "    .dataframe thead th {\n",
509 |        "        text-align: right;\n",
510 |        "    }\n",
511 |        "</style>\n",
512 |        "<table border=\"1\" class=\"dataframe\">\n",
513 |        "  <thead>\n",
514 |        "    <tr style=\"text-align: right;\">\n",
515 |        "      <th></th>\n",
516 |        "      <th>Sales</th>\n",
517 |        "    </tr>\n",
518 |        "  </thead>\n",
519 |        "  <tbody>\n",
520 |        "    <tr>\n",
521 |        "      <th>2010-01-31</th>\n",
522 |        "      <td>1.000000</td>\n",
523 |        "    </tr>\n",
524 |        "    <tr>\n",
525 |        "      <th>2010-02-28</th>\n",
526 |        "      <td>2.000000</td>\n",
527 |        "    </tr>\n",
528 |        "    <tr>\n",
529 |        "      <th>2010-03-31</th>\n",
530 |        "      <td>3.059808</td>\n",
531 |        "    </tr>\n",
532 |        "    <tr>\n",
533 |        "      <th>2010-04-30</th>\n",
534 |        "      <td>4.038069</td>\n",
535 |        "    </tr>\n",
536 |        "    <tr>\n",
537 |        "      <th>2010-05-31</th>\n",
538 |        "      <td>5.000000</td>\n",
539 |        "    </tr>\n",
540 |        "  </tbody>\n",
541 |        "</table>\n",
542 |        "</div>"
543 |       ],
544 |       "text/plain": [
545 |        "               Sales\n",
546 |        "2010-01-31  1.000000\n",
547 |        "2010-02-28  2.000000\n",
548 |        "2010-03-31  3.059808\n",
549 |        "2010-04-30  4.038069\n",
550 |        "2010-05-31  5.000000"
551 |       ]
552 |      },
553 |      "execution_count": 14,
554 |      "metadata": {},
555 |      "output_type": "execute_result"
556 |     }
557 |    ],
558 |    "source": [
559 |     "df.interpolate(method=\"quadratic\")"
560 |    ]
561 |   },
562 |   {
563 |    "cell_type": "code",
564 |    "execution_count": 15,
565 |    "metadata": {},
566 |    "outputs": [
567 |     {
568 |      "data": {
569 |       "text/html": [
570 |        "<div>\n",
571 |        "<style scoped>\n",
572 |        "    .dataframe tbody tr th:only-of-type {\n",
573 |        "        vertical-align: middle;\n",
574 |        "    }\n",
575 |        "\n",
576 |        "    .dataframe tbody tr th {\n",
577 |        "        vertical-align: top;\n",
578 |        "    }\n",
579 |        "\n",
580 |        "    .dataframe thead th {\n",
581 |        "        text-align: right;\n",
582 |        "    }\n",
583 |        "</style>\n",
584 |        "<table border=\"1\" class=\"dataframe\">\n",
585 |        "  <thead>\n",
586 |        "    <tr style=\"text-align: right;\">\n",
587 |        "      <th></th>\n",
588 |        "      <th>Sales</th>\n",
589 |        "    </tr>\n",
590 |        "  </thead>\n",
591 |        "  <tbody>\n",
592 |        "    <tr>\n",
593 |        "      <th>2010-01-31</th>\n",
594 |        "      <td>1.0</td>\n",
595 |        "    </tr>\n",
596 |        "    <tr>\n",
597 |        "      <th>2010-02-28</th>\n",
598 |        "      <td>2.0</td>\n",
599 |        "    </tr>\n",
600 |        "    <tr>\n",
601 |        "      <th>2010-03-31</th>\n",
602 |        "      <td>3.0</td>\n",
603 |        "    </tr>\n",
604 |        "    <tr>\n",
605 |        "      <th>2010-04-30</th>\n",
606 |        "      <td>NaN</td>\n",
607 |        "    </tr>\n",
608 |        "    <tr>\n",
609 |        "      <th>2010-05-31</th>\n",
610 |        "      <td>5.0</td>\n",
611 |        "    </tr>\n",
612 |        "  </tbody>\n",
613 |        "</table>\n",
614 |        "</div>"
615 |       ],
616 |       "text/plain": [
617 |        "            Sales\n",
618 |        "2010-01-31    1.0\n",
619 |        "2010-02-28    2.0\n",
620 |        "2010-03-31    3.0\n",
621 |        "2010-04-30    NaN\n",
622 |        "2010-05-31    5.0"
623 |       ]
624 |      },
625 |      "execution_count": 15,
626 |      "metadata": {},
627 |      "output_type": "execute_result"
628 |     }
629 |    ],
630 |    "source": [
631 |     "df.interpolate(limit=1, limit_direction=\"forward\")"
632 |    ]
633 |   },
634 |   {
635 |    "cell_type": "code",
636 |    "execution_count": null,
637 |    "metadata": {},
638 |    "outputs": [],
639 |    "source": []
640 |   }
641 |  ],
642 |  "metadata": {
643 |   "kernelspec": {
644 |    "display_name": "Python [conda env:machine_learning_cookbook]",
645 |    "language": "python",
646 |    "name": "conda-env-machine_learning_cookbook-py"
647 |   },
648 |   "language_info": {
649 |    "codemirror_mode": {
650 |     "name": "ipython",
651 |     "version": 3
652 |    },
653 |    "file_extension": ".py",
654 |    "mimetype": "text/x-python",
655 |    "name": "python",
656 |    "nbconvert_exporter": "python",
657 |    "pygments_lexer": "ipython3",
658 |    "version": "3.6.6"
659 |   }
660 |  },
661 |  "nbformat": 4,
662 |  "nbformat_minor": 2
663 | }
664 | 


--------------------------------------------------------------------------------
/Chapter 13 - Linear Regression.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Chapter 13\n",
  8 |     "---\n",
  9 |     "# Linear Regression\n",
 10 |     "\n",
 11 |     "### 13.0 Introduction\n",
 12 |     "Linear regression is one of the simplest supervised learning algorithms in our toolkit. If you have ever taken an introductory statistics course in college, likely the final topic you covered was linear regression. In fact, it is so simple that it is sometimes not considered machine learning at all!\n",
 13 |     "\n",
 14 |     "Whatever you believe, the fact is that linear regression--and its extensions--continues to be a common and useful method of making predictions when the target vector is a quantitative value (e.g. home price, age)\n",
 15 |     "\n",
 16 |     "### 13.1 Fitting a Line\n",
 17 |     "#### Problem\n",
 18 |     "You want to train a model that represents a linear relationship between the feature and target vector.\n",
 19 |     "\n",
 20 |     "#### Solution\n",
 21 |     "Use a linear regression (`LinearRegression` in scikit-learn)"
 22 |    ]
 23 |   },
 24 |   {
 25 |    "cell_type": "code",
 26 |    "execution_count": 1,
 27 |    "metadata": {
 28 |     "collapsed": true
 29 |    },
 30 |    "outputs": [],
 31 |    "source": [
 32 |     "from sklearn.linear_model import LinearRegression\n",
 33 |     "from sklearn.datasets import load_boston\n",
 34 |     "\n",
 35 |     "boston = load_boston()\n",
 36 |     "features = boston.data[:, 0:2]\n",
 37 |     "target = boston.target\n",
 38 |     "\n",
 39 |     "regression = LinearRegression()\n",
 40 |     "\n",
 41 |     "model = regression.fit(features, target)"
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "markdown",
 46 |    "metadata": {},
 47 |    "source": [
 48 |     "### 13.4 Reducing Variance with Regularization\n",
 49 |     "#### Problem\n",
 50 |     "You want to reduce the variance of your linear regression model\n",
 51 |     "\n",
 52 |     "#### Solution\n",
 53 |     "Use a learning algorithm that includes a *shrinkage penalty* (also called **regularization**) like ridge regression and lasso regression:"
 54 |    ]
 55 |   },
 56 |   {
 57 |    "cell_type": "code",
 58 |    "execution_count": 2,
 59 |    "metadata": {
 60 |     "collapsed": false
 61 |    },
 62 |    "outputs": [],
 63 |    "source": [
 64 |     "from sklearn.linear_model import Ridge\n",
 65 |     "from sklearn.datasets import load_boston\n",
 66 |     "from sklearn.preprocessing import StandardScaler\n",
 67 |     "\n",
 68 |     "boston = load_boston()\n",
 69 |     "features = boston.data\n",
 70 |     "target = boston.target\n",
 71 |     "\n",
 72 |     "scaler = StandardScaler()\n",
 73 |     "features_standardized = scaler.fit_transform(features)\n",
 74 |     "\n",
 75 |     "regression = Ridge(alpha=0.5)\n",
 76 |     "model = regression.fit(features_standardized, target)"
 77 |    ]
 78 |   },
 79 |   {
 80 |    "cell_type": "markdown",
 81 |    "metadata": {},
 82 |    "source": [
 83 |     "#### Discussion\n",
 84 |     "In standard linear regression the model trains to minimize the sum of squared error between the true($y_i$) and prediction ($\\hat y_i$) target values, or residual sum of squares (RSS):\n",
 85 |     "$$\n",
 86 |     "RSS = \\sum_{i=1}^n{(y_i - \\hat y_i)^2}\n",
 87 |     "$$\n",
 88 |     "\n",
 89 |     "Regularized regression learners are similar, except they attempt to minimize RSS and some penalty for the total size of the coefficient values, called a shrinkage penalty because it attempts to \"shrink\" the model. There are two common types of regularized learners for linear regression: ridge regression and the lasso. The only formal difference is the type of shrinkage penalty used. In ridge regression, the shrinkage penalty is a tuning hyperparameter multiplied by the squared sum of all coefficients:\n",
 90 |     "$$\n",
 91 |     "RSS+\\alpha \\sum_{j=1}^p{\\hat \\beta_j^2}\n",
 92 |     "$$\n",
 93 |     "\n",
 94 |     "where $\\hat \\beta_j$ is the coefficient of the jth of p features and $\\alpha$ is a hyperparameter (discussed next). The lasso is similar, except the shrinkage penalty is a tuning hyperparmeter multiplied by the squared sum of all coefficients:\n",
 95 |     "$$\n",
 96 |     "\\frac{1}{2n} RSS + \\alpha \\sum_{j=1}^p{|\\beta_j|}\n",
 97 |     "$$\n",
 98 |     "\n",
 99 |     "where n is the number of observations. So which one should we use? A a very general rule of thumb, ridge regression often produces slightly better predictions than lasso, but lasso (for reasons we will discuss in Recipe 13.5) produces more interpretable models. If we want a balance between, ridge and lasso's penalty functions we can use elastic net, which is simply a regression model with both penalties included. Regardless of which one we use, bot hridge and lasso regresions can penalize large or complex models by including coefficient values in the loss funciton we are trying to minimize\n",
100 |     "\n",
101 |     "The hyper parameter $\\alpha$ lets us control how much we penalize the coefficients, with higher values of $\\alpha$ creating simpler models. The ideal value of $\\alpha$ should be tuned like any other hyperparameter. In scikit-learn, $\\alpha$ is set using the alpha parameter.\n",
102 |     "\n",
103 |     "scikit-learn includes a RidgeCV method that allows us to select the ideal value for $\\alpha:"
104 |    ]
105 |   },
106 |   {
107 |    "cell_type": "code",
108 |    "execution_count": 3,
109 |    "metadata": {
110 |     "collapsed": false
111 |    },
112 |    "outputs": [
113 |     {
114 |      "data": {
115 |       "text/plain": [
116 |        "array([-0.91215884,  1.0658758 ,  0.11942614,  0.68558782, -2.03231631,\n",
117 |        "        2.67922108,  0.01477326, -3.0777265 ,  2.58814315, -2.00973173,\n",
118 |        "       -2.05390717,  0.85614763, -3.73565106])"
119 |       ]
120 |      },
121 |      "execution_count": 3,
122 |      "metadata": {},
123 |      "output_type": "execute_result"
124 |     }
125 |    ],
126 |    "source": [
127 |     "from sklearn.linear_model import RidgeCV\n",
128 |     "\n",
129 |     "regr_cv = RidgeCV(alphas=[0.1, 1.0, 10.0])\n",
130 |     "\n",
131 |     "model_cv = regr_cv.fit(features_standardized, target)\n",
132 |     "\n",
133 |     "model_cv.coef_"
134 |    ]
135 |   },
136 |   {
137 |    "cell_type": "code",
138 |    "execution_count": 4,
139 |    "metadata": {
140 |     "collapsed": false
141 |    },
142 |    "outputs": [
143 |     {
144 |      "data": {
145 |       "text/plain": [
146 |        "1.0"
147 |       ]
148 |      },
149 |      "execution_count": 4,
150 |      "metadata": {},
151 |      "output_type": "execute_result"
152 |     }
153 |    ],
154 |    "source": [
155 |     "# view alpha\n",
156 |     "model_cv.alpha_"
157 |    ]
158 |   },
159 |   {
160 |    "cell_type": "markdown",
161 |    "metadata": {},
162 |    "source": [
163 |     "One final note: because in linear regression the value of the coefficients is partially determined by the scale of the feature, and in regularized models all coefficients are summed together, we must make sure to standardize the feature prior to training\n",
164 |     "\n",
165 |     "### 13.5 Reducing Features with Lasso Regression\n",
166 |     "#### Problem\n",
167 |     "You want to simplify your linear regression model by reducing the number of features.\n",
168 |     "\n",
169 |     "#### Solution\n",
170 |     "Use a lasso regression"
171 |    ]
172 |   },
173 |   {
174 |    "cell_type": "code",
175 |    "execution_count": 5,
176 |    "metadata": {
177 |     "collapsed": true
178 |    },
179 |    "outputs": [],
180 |    "source": [
181 |     "from sklearn.linear_model import Lasso\n",
182 |     "from sklearn.datasets import load_boston\n",
183 |     "from sklearn.preprocessing import StandardScaler\n",
184 |     "\n",
185 |     "boston = load_boston()\n",
186 |     "features = boston.data\n",
187 |     "target = boston.target\n",
188 |     "\n",
189 |     "scaler = StandardScaler()\n",
190 |     "features_standardized = scaler.fit_transform(features)\n",
191 |     "\n",
192 |     "regression = Lasso(alpha=0.5)\n",
193 |     "model = regression.fit(features_standardized, target)"
194 |    ]
195 |   },
196 |   {
197 |    "cell_type": "markdown",
198 |    "metadata": {},
199 |    "source": [
200 |     "#### Discussion\n",
201 |     "One interesting characteristic of lasso regression's penalty is that it can shrink the coefficients of a model to zero, effectively reducing the number of features in the model. For example, in our solution we set `alpha` to 0.5 and we can see that many of the coefficients are 0, meaning their corresponding features are not used in the model:"
202 |    ]
203 |   },
204 |   {
205 |    "cell_type": "code",
206 |    "execution_count": 6,
207 |    "metadata": {
208 |     "collapsed": false
209 |    },
210 |    "outputs": [
211 |     {
212 |      "data": {
213 |       "text/plain": [
214 |        "array([-0.10697735,  0.        , -0.        ,  0.39739898, -0.        ,\n",
215 |        "        2.97332316, -0.        , -0.16937793, -0.        , -0.        ,\n",
216 |        "       -1.59957374,  0.54571511, -3.66888402])"
217 |       ]
218 |      },
219 |      "execution_count": 6,
220 |      "metadata": {},
221 |      "output_type": "execute_result"
222 |     }
223 |    ],
224 |    "source": [
225 |     "model.coef_"
226 |    ]
227 |   },
228 |   {
229 |    "cell_type": "markdown",
230 |    "metadata": {},
231 |    "source": [
232 |     "However if we increase $\\alpha$ to a much higher value, we see that lierally none of the features are being used:"
233 |    ]
234 |   },
235 |   {
236 |    "cell_type": "code",
237 |    "execution_count": 7,
238 |    "metadata": {
239 |     "collapsed": false
240 |    },
241 |    "outputs": [
242 |     {
243 |      "data": {
244 |       "text/plain": [
245 |        "array([-0.,  0., -0.,  0., -0.,  0., -0.,  0., -0., -0., -0.,  0., -0.])"
246 |       ]
247 |      },
248 |      "execution_count": 7,
249 |      "metadata": {},
250 |      "output_type": "execute_result"
251 |     }
252 |    ],
253 |    "source": [
254 |     "regression_a10 = Lasso(alpha=10)\n",
255 |     "model_a10 = regression_a10.fit(features_standardized, target)\n",
256 |     "model_a10.coef_"
257 |    ]
258 |   },
259 |   {
260 |    "cell_type": "markdown",
261 |    "metadata": {},
262 |    "source": [
263 |     "The practical benefit of this effect is that it means that we could include 100 features in our feature matrix and then, through adjusting lasso's $\\alpha$ hyperparameter, produce a model that uses only 10 (for instance) of the most important features. This lets us reduce variance whiel improving interpretability of our model (since fewer features is easier to explain)"
264 |    ]
265 |   },
266 |   {
267 |    "cell_type": "code",
268 |    "execution_count": null,
269 |    "metadata": {
270 |     "collapsed": true
271 |    },
272 |    "outputs": [],
273 |    "source": []
274 |   }
275 |  ],
276 |  "metadata": {
277 |   "kernelspec": {
278 |    "display_name": "Python [conda env:machine_learning_cookbook]",
279 |    "language": "python",
280 |    "name": "conda-env-machine_learning_cookbook-py"
281 |   },
282 |   "language_info": {
283 |    "codemirror_mode": {
284 |     "name": "ipython",
285 |     "version": 3
286 |    },
287 |    "file_extension": ".py",
288 |    "mimetype": "text/x-python",
289 |    "name": "python",
290 |    "nbconvert_exporter": "python",
291 |    "pygments_lexer": "ipython3",
292 |    "version": "3.6.6"
293 |   }
294 |  },
295 |  "nbformat": 4,
296 |  "nbformat_minor": 2
297 | }
298 | 


--------------------------------------------------------------------------------
/Chapter 15 - K-Nearest Neighbors.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Chapter 15\n",
  8 |     "---\n",
  9 |     "# K-Nearest Neighbors\n",
 10 |     "\n",
 11 |     "An observation is predicted to be the class of that of the largest proportion of the k-nearest observations.\n",
 12 |     "\n",
 13 |     "## 15.1 Finding an Observation's Nearest Neighbors"
 14 |    ]
 15 |   },
 16 |   {
 17 |    "cell_type": "code",
 18 |    "execution_count": 2,
 19 |    "metadata": {
 20 |     "collapsed": false
 21 |    },
 22 |    "outputs": [
 23 |     {
 24 |      "data": {
 25 |       "text/plain": [
 26 |        "array([[[1.03800476, 0.56925129, 1.10395287, 1.1850097 ],\n",
 27 |        "        [0.79566902, 0.33784833, 0.76275864, 1.05353673]]])"
 28 |       ]
 29 |      },
 30 |      "execution_count": 2,
 31 |      "metadata": {},
 32 |      "output_type": "execute_result"
 33 |     }
 34 |    ],
 35 |    "source": [
 36 |     "from sklearn import datasets\n",
 37 |     "from sklearn.neighbors import NearestNeighbors\n",
 38 |     "from sklearn.preprocessing import StandardScaler\n",
 39 |     "\n",
 40 |     "iris = datasets.load_iris()\n",
 41 |     "features = iris.data\n",
 42 |     "\n",
 43 |     "standardizer = StandardScaler()\n",
 44 |     "\n",
 45 |     "features_standardized = standardizer.fit_transform(features)\n",
 46 |     "\n",
 47 |     "nearest_neighbors = NearestNeighbors(n_neighbors=2).fit(features_standardized)\n",
 48 |     "#nearest_neighbors_euclidian = NearestNeighbors(n_neighbors=2, metric='euclidian').fit(features_standardized)\n",
 49 |     "new_observation = [1, 1, 1, 1]\n",
 50 |     "\n",
 51 |     "distances, indices = nearest_neighbors.kneighbors([new_observation])\n",
 52 |     "\n",
 53 |     "features_standardized[indices]"
 54 |    ]
 55 |   },
 56 |   {
 57 |    "cell_type": "markdown",
 58 |    "metadata": {},
 59 |    "source": [
 60 |     "### Discussion\n",
 61 |     "\n",
 62 |     "How do we measure distance?\n",
 63 |     "\n",
 64 |     "* Euclidian\n",
 65 |     "$$\n",
 66 |     "d_{euclidean} = \\sqrt{\\sum_{i=1}^{n}{(x_i - y_i)^2}}\n",
 67 |     "$$\n",
 68 |     "\n",
 69 |     "* Manhattan\n",
 70 |     "$$\n",
 71 |     "d_{manhattan} = \\sum_{i=1}^{n}{|x_i - y_i|}\n",
 72 |     "$$\n",
 73 |     "\n",
 74 |     "* Minkowski (default)\n",
 75 |     "$$\n",
 76 |     "d_{minkowski} = (\\sum_{i=1}^{n}{|x_i - y_i|^p})^{\\frac{1}{p}}\n",
 77 |     "$$\n",
 78 |     "## 15.2 Creating a K-Nearest Neighbor Classifier"
 79 |    ]
 80 |   },
 81 |   {
 82 |    "cell_type": "code",
 83 |    "execution_count": 3,
 84 |    "metadata": {
 85 |     "collapsed": false
 86 |    },
 87 |    "outputs": [
 88 |     {
 89 |      "data": {
 90 |       "text/plain": [
 91 |        "array([1, 2])"
 92 |       ]
 93 |      },
 94 |      "execution_count": 3,
 95 |      "metadata": {},
 96 |      "output_type": "execute_result"
 97 |     }
 98 |    ],
 99 |    "source": [
100 |     "from sklearn.neighbors import KNeighborsClassifier\n",
101 |     "from sklearn.preprocessing import StandardScaler\n",
102 |     "from sklearn import datasets\n",
103 |     "\n",
104 |     "iris = datasets.load_iris()\n",
105 |     "X = iris.data\n",
106 |     "y = iris.target\n",
107 |     "\n",
108 |     "standardizer = StandardScaler()\n",
109 |     "\n",
110 |     "X_std = standardizer.fit_transform(X)\n",
111 |     "\n",
112 |     "knn = KNeighborsClassifier(n_neighbors=5, n_jobs=-1).fit(X_std, y)\n",
113 |     "\n",
114 |     "new_observations = [[0.75, 0.75, 0.75, 0.75],\n",
115 |     "                   [1, 1, 1, 1]]\n",
116 |     "\n",
117 |     "knn.predict(new_observations)"
118 |    ]
119 |   },
120 |   {
121 |    "cell_type": "markdown",
122 |    "metadata": {},
123 |    "source": [
124 |     "### Discussion\n",
125 |     "In KNN, given an observation $x_u$, with an unknown target class, the algorithm first identifies the k closest observations (sometimes called $x_u$'s neighborhood) based on some distance metric, then these k observations \"vote\" based on their class and the class that wins the vote is $x_u$'s predicted class. More formally, the probability $x_u$ is some class j is:\n",
126 |     "$$\n",
127 |     "\\frac{1}{k} \\sum_{i \\in v}{I(y_i = j)}\n",
128 |     "$$\n",
129 |     "where v is the k observatoin in $x_u$'s neighborhood, $y_i$ is the class of the ith observation, and I is an indicator function (i.e., 1 is true, 0 otherwise). In scikit-learn we can see these probabilities using `predict_proba`\n",
130 |     "\n",
131 |     "## 15.3 Identifying the Best Neighborhood Size"
132 |    ]
133 |   },
134 |   {
135 |    "cell_type": "code",
136 |    "execution_count": 4,
137 |    "metadata": {
138 |     "collapsed": false
139 |    },
140 |    "outputs": [
141 |     {
142 |      "data": {
143 |       "text/plain": [
144 |        "6"
145 |       ]
146 |      },
147 |      "execution_count": 4,
148 |      "metadata": {},
149 |      "output_type": "execute_result"
150 |     }
151 |    ],
152 |    "source": [
153 |     "from sklearn.neighbors import KNeighborsClassifier\n",
154 |     "from sklearn import datasets\n",
155 |     "from sklearn.preprocessing import StandardScaler\n",
156 |     "from sklearn.pipeline import Pipeline, FeatureUnion\n",
157 |     "from sklearn.model_selection import GridSearchCV\n",
158 |     "\n",
159 |     "iris = datasets.load_iris()\n",
160 |     "features = iris.data\n",
161 |     "target = iris.target\n",
162 |     "\n",
163 |     "standardizer = StandardScaler()\n",
164 |     "features_standardized = standardizer.fit_transform(features)\n",
165 |     "\n",
166 |     "knn = KNeighborsClassifier(n_neighbors=5, n_jobs=-1)\n",
167 |     "\n",
168 |     "pipe = Pipeline([(\"standardizer\", standardizer), (\"knn\", knn)])\n",
169 |     "\n",
170 |     "search_space = [{\"knn__n_neighbors\": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}]\n",
171 |     "\n",
172 |     "classifier = GridSearchCV(\n",
173 |     "    pipe, search_space, cv=5, verbose=0).fit(features_standardized, target)\n",
174 |     "\n",
175 |     "classifier.best_estimator_.get_params()[\"knn__n_neighbors\"]"
176 |    ]
177 |   },
178 |   {
179 |    "cell_type": "markdown",
180 |    "metadata": {},
181 |    "source": [
182 |     "## 15.4 Creating a Radius-Based Nearest Neighbor Classifier\n",
183 |     "given an observation of unknown class, you need to predict its class based on the class of all observations within a certain distance."
184 |    ]
185 |   },
186 |   {
187 |    "cell_type": "code",
188 |    "execution_count": 5,
189 |    "metadata": {
190 |     "collapsed": false
191 |    },
192 |    "outputs": [
193 |     {
194 |      "data": {
195 |       "text/plain": [
196 |        "array([2])"
197 |       ]
198 |      },
199 |      "execution_count": 5,
200 |      "metadata": {},
201 |      "output_type": "execute_result"
202 |     }
203 |    ],
204 |    "source": [
205 |     "from sklearn.neighbors import RadiusNeighborsClassifier\n",
206 |     "from sklearn.preprocessing import StandardScaler\n",
207 |     "from sklearn import datasets\n",
208 |     "\n",
209 |     "iris = datasets.load_iris()\n",
210 |     "features = iris.data\n",
211 |     "target = iris.target\n",
212 |     "\n",
213 |     "standardizer = StandardScaler()\n",
214 |     "features_standardized = standardizer.fit_transform(features)\n",
215 |     "\n",
216 |     "rnn = RadiusNeighborsClassifier(\n",
217 |     "    radius=.5, n_jobs=-1).fit(features_standardized, target)\n",
218 |     "\n",
219 |     "new_observations = [[1, 1, 1, 1]]\n",
220 |     "\n",
221 |     "rnn.predict(new_observations)"
222 |    ]
223 |   },
224 |   {
225 |    "cell_type": "code",
226 |    "execution_count": null,
227 |    "metadata": {
228 |     "collapsed": true
229 |    },
230 |    "outputs": [],
231 |    "source": []
232 |   }
233 |  ],
234 |  "metadata": {
235 |   "kernelspec": {
236 |    "display_name": "Python [conda env:machine_learning_cookbook]",
237 |    "language": "python",
238 |    "name": "conda-env-machine_learning_cookbook-py"
239 |   },
240 |   "language_info": {
241 |    "codemirror_mode": {
242 |     "name": "ipython",
243 |     "version": 3
244 |    },
245 |    "file_extension": ".py",
246 |    "mimetype": "text/x-python",
247 |    "name": "python",
248 |    "nbconvert_exporter": "python",
249 |    "pygments_lexer": "ipython3",
250 |    "version": "3.6.6"
251 |   }
252 |  },
253 |  "nbformat": 4,
254 |  "nbformat_minor": 2
255 | }
256 | 


--------------------------------------------------------------------------------
/Chapter 16 - Logistic Regression.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Chapter 16\n",
  8 |     "---\n",
  9 |     "# Logistic Regression\n",
 10 |     "\n",
 11 |     "Despire being called a regression, logistic regression is actually a widely used supervised classification technique. \n",
 12 |     "Allows us to predict the probability that an observation is of a certain class\n",
 13 |     "\n",
 14 |     "## 16.1 Training a Binary Classifier"
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "code",
 19 |    "execution_count": 3,
 20 |    "metadata": {
 21 |     "collapsed": false
 22 |    },
 23 |    "outputs": [
 24 |     {
 25 |      "name": "stdout",
 26 |      "output_type": "stream",
 27 |      "text": [
 28 |       "model.predict: [1]\n",
 29 |       "model.predict_proba: [[0.18823041 0.81176959]]\n"
 30 |      ]
 31 |     }
 32 |    ],
 33 |    "source": [
 34 |     "from sklearn.linear_model import LogisticRegression\n",
 35 |     "from sklearn import datasets\n",
 36 |     "from sklearn.preprocessing import StandardScaler\n",
 37 |     "\n",
 38 |     "iris = datasets.load_iris()\n",
 39 |     "features = iris.data[:100,:]\n",
 40 |     "target = iris.target[:100]\n",
 41 |     "\n",
 42 |     "scaler = StandardScaler()\n",
 43 |     "features_standardized = scaler.fit_transform(features)\n",
 44 |     "\n",
 45 |     "logistic_regression = LogisticRegression(random_state=0)\n",
 46 |     "model = logistic_regression.fit(features_standardized, target)\n",
 47 |     "\n",
 48 |     "new_observation = [[.5, .5, .5, .5]]\n",
 49 |     "\n",
 50 |     "print(\"model.predict: {}\".format(model.predict(new_observation)))\n",
 51 |     "print(\"model.predict_proba: {}\".format(model.predict_proba(new_observation)))"
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "markdown",
 56 |    "metadata": {},
 57 |    "source": [
 58 |     "### Discussion\n",
 59 |     "Dispire having \"regression\" in its name, a logistic regression is actually a widely used binary lassifier (i.e. the target vector can only take two values). In a logistic regression, a linear model (e.g. $\\beta_0 + \\beta_i x$) is included in a logistic (also called sigmoid) function, $\\frac{1}{1+e^{-z }}$, such that:\n",
 60 |     "$$\n",
 61 |     "P(y_i = 1 | X) = \\frac{1}{1+e^{-(\\beta_0 + \\beta_1x)}}\n",
 62 |     "$$\n",
 63 |     "where $P(y_i = 1 | X)$ is the probability of the ith obsevation's target, $y_i$ being class 1, X is the training data, $\\beta_0$ and $\\beta_1$ are the parameters to be learned, and e is Euler's number. The effect of the logistic function is to constrain the value of the function's output to between 0 and 1 so that i can be interpreted as a probability. If $P(y_i = 1 | X)$ is greater than 0.5, class 1 is predicted; otherwise class 0 is predicted\n",
 64 |     "\n",
 65 |     "## 16.2 Training a Multiclass Classifier"
 66 |    ]
 67 |   },
 68 |   {
 69 |    "cell_type": "code",
 70 |    "execution_count": 4,
 71 |    "metadata": {
 72 |     "collapsed": true
 73 |    },
 74 |    "outputs": [],
 75 |    "source": [
 76 |     "from sklearn.linear_model import LogisticRegression\n",
 77 |     "from sklearn import datasets\n",
 78 |     "from sklearn.preprocessing import StandardScaler\n",
 79 |     "\n",
 80 |     "iris = datasets.load_iris()\n",
 81 |     "features = iris.data\n",
 82 |     "target = iris.target\n",
 83 |     "\n",
 84 |     "scaler = StandardScaler()\n",
 85 |     "features_standardized = scaler.fit_transform(features)\n",
 86 |     "\n",
 87 |     "logistic_regression = LogisticRegression(random_state=0, multi_class=\"ovr\")\n",
 88 |     "#logistic_regression_MNL = LogisticRegression(random_state=0, multi_class=\"multinomial\")\n",
 89 |     "\n",
 90 |     "model = logistic_regression.fit(features_standardized, target)"
 91 |    ]
 92 |   },
 93 |   {
 94 |    "cell_type": "markdown",
 95 |    "metadata": {},
 96 |    "source": [
 97 |     "### Discussion\n",
 98 |     "On their own, logistic regressions are only binary classifiers, meaning they cannot handle target vectors with more than two classes. However, two clever extensions to logistic regression do just that. First, in one-vs-rest logistic regression (OVR) a separate model is trained for each class predicted whether an observation is that class or not (thus making it a binary classification problem). It assumes that each observation problem (e.g. class 0 or not) is independent\n",
 99 |     "\n",
100 |     "Alternatively in multinomial logistic regression (MLR) the logistic function we saw in Recipe 15.1 is replaced with a softmax function:\n",
101 |     "$$\n",
102 |     "P(y_I = k | X) = \\frac{e^{\\beta_k x_i}}{\\sum_{j=1}^{K}{e^{\\beta_j x_i}}}\n",
103 |     "$$\n",
104 |     "where $P(y_i = k | X)$ is the probability of the ith observation's target value, $y_i$, is class k, and K is the total number of classes. One practical advantage of the MLR is that its predicted probabilities using `predict_proba` method are more reliable\n",
105 |     "\n",
106 |     "We can switch to an MNL by setting `multi_class='multinomial'`\n",
107 |     "\n",
108 |     "## 16.3 Reducing Variance Through Regularization"
109 |    ]
110 |   },
111 |   {
112 |    "cell_type": "code",
113 |    "execution_count": 5,
114 |    "metadata": {
115 |     "collapsed": true
116 |    },
117 |    "outputs": [],
118 |    "source": [
119 |     "from sklearn.linear_model import LogisticRegressionCV\n",
120 |     "from sklearn import datasets\n",
121 |     "from sklearn.preprocessing import StandardScaler\n",
122 |     "\n",
123 |     "iris = datasets.load_iris()\n",
124 |     "features = iris.data\n",
125 |     "target = iris.target\n",
126 |     "\n",
127 |     "scaler = StandardScaler()\n",
128 |     "features_standardized = scaler.fit_transform(features)\n",
129 |     "\n",
130 |     "logistic_regression = LogisticRegressionCV(\n",
131 |     "    penalty='l2', Cs=10, random_state=0, n_jobs=-1)\n",
132 |     "\n",
133 |     "model = logistic_regression.fit(features_standardized, target)"
134 |    ]
135 |   },
136 |   {
137 |    "cell_type": "markdown",
138 |    "metadata": {},
139 |    "source": [
140 |     "### Discussion\n",
141 |     "Regularization is a method of penalizing complex models to reduce their variance. Specifically, a penalty term is added to the loss function we are trying to minimize typically the L1 and L2 penalties\n",
142 |     "\n",
143 |     "In the L1 penalty:\n",
144 |     "$$\n",
145 |     "\\alpha \\sum_{j=1}^{p}{|\\hat\\beta_j|}\n",
146 |     "$$\n",
147 |     "where $\\hat\\beta_j$ is the parameters of the jth of p features being learned and $\\alpha$ is a hyperparameter denoting the regularization strength.\n",
148 |     "\n",
149 |     "With the L2 penalty:\n",
150 |     "$$\n",
151 |     "\\alpha \\sum_{j=1}^{p}{\\hat\\beta_j^2}\n",
152 |     "$$\n",
153 |     "higher values of $\\alpha$ increase the penalty for larger parameter values(i.e. more complex models). scikit-learn follows the common method of using C instead of $\\alpha$ where C is the inverse of the regularization strength: $C = \\frac{1}{\\alpha}$. To reduce variance while using logistic regression, we can treat C as a hyperparameter to be tuned to find thevalue of C that creates the best model. In scikit-learn we can use the `LogisticRegressionCV` class to efficiently tune C.\n",
154 |     "\n",
155 |     "## 16.4 Training a Classifier on Very Large Data"
156 |    ]
157 |   },
158 |   {
159 |    "cell_type": "code",
160 |    "execution_count": 6,
161 |    "metadata": {
162 |     "collapsed": true
163 |    },
164 |    "outputs": [],
165 |    "source": [
166 |     "from sklearn.linear_model import LogisticRegression\n",
167 |     "from sklearn import datasets\n",
168 |     "from sklearn.preprocessing import StandardScaler\n",
169 |     "\n",
170 |     "iris = datasets.load_iris()\n",
171 |     "features = iris.data\n",
172 |     "target = iris.target\n",
173 |     "\n",
174 |     "scaler = StandardScaler()\n",
175 |     "features_standardized = scaler.fit_transform(features)\n",
176 |     "\n",
177 |     "logistic_regression = LogisticRegression(random_state=0, solver=\"sag\") # stochastic average gradient (SAG) solver\n",
178 |     "model = logistic_regression.fit(features_standardized, target)"
179 |    ]
180 |   },
181 |   {
182 |    "cell_type": "markdown",
183 |    "metadata": {},
184 |    "source": [
185 |     "### Discussion\n",
186 |     "scikit-learn's `LogisticRegression` offers a number of techniques for training a logistic regression, called solvers. Most of the time scikit-learn will select the best solver automatically for us or warn us we cannot do something with that solver.\n",
187 |     "\n",
188 |     "Stochastic averge gradient descent allows us to train a model much faster than other solvers when our data is very large. However, it is also very sensitive to feature scaling, so standardizing our features is particularly important\n",
189 |     "\n",
190 |     "### See Also\n",
191 |     "* Minimizing Finite Sums with the Stochastic Average Gradient Algorithm, Mark Schmidt (http://www.birs.ca/workshops/2014/14w5003/files/schmidt.pdf)\n",
192 |     "\n",
193 |     "## 16.5 Handling Imbalanced Classes"
194 |    ]
195 |   },
196 |   {
197 |    "cell_type": "code",
198 |    "execution_count": 8,
199 |    "metadata": {
200 |     "collapsed": false
201 |    },
202 |    "outputs": [],
203 |    "source": [
204 |     "import numpy as np\n",
205 |     "from sklearn.linear_model import LogisticRegression\n",
206 |     "from sklearn import datasets\n",
207 |     "from sklearn.preprocessing import StandardScaler\n",
208 |     "\n",
209 |     "iris = datasets.load_iris()\n",
210 |     "features = iris.data[40:, :]\n",
211 |     "target = iris.target[40:]\n",
212 |     "\n",
213 |     "target = np.where((target == 0), 0, 1)\n",
214 |     "\n",
215 |     "scaler = StandardScaler()\n",
216 |     "features_standardized = scaler.fit_transform(features)\n",
217 |     "\n",
218 |     "logistic_regression = LogisticRegression(random_state=0, class_weight=\"balanced\")\n",
219 |     "model = logistic_regression.fit(features_standardized, target)"
220 |    ]
221 |   },
222 |   {
223 |    "cell_type": "markdown",
224 |    "metadata": {},
225 |    "source": [
226 |     "### Discussion\n",
227 |     "`LogisticRegression` comes with a built in method of handling imbalanced classes.\n",
228 |     "`class_weight=\"balanced\"` will automatically weigh classes inversely proportional to their frequency:\n",
229 |     "$$\n",
230 |     "w_j = \\frac{n}{kn_j}\n",
231 |     "$$\n",
232 |     "where $w_j$ is the weight to class j, n is the number of observations, $n_j$ is the number of observations in class j, and k is the total number of classes"
233 |    ]
234 |   }
235 |  ],
236 |  "metadata": {
237 |   "kernelspec": {
238 |    "display_name": "Python [conda env:machine_learning_cookbook]",
239 |    "language": "python",
240 |    "name": "conda-env-machine_learning_cookbook-py"
241 |   },
242 |   "language_info": {
243 |    "codemirror_mode": {
244 |     "name": "ipython",
245 |     "version": 3
246 |    },
247 |    "file_extension": ".py",
248 |    "mimetype": "text/x-python",
249 |    "name": "python",
250 |    "nbconvert_exporter": "python",
251 |    "pygments_lexer": "ipython3",
252 |    "version": "3.6.6"
253 |   }
254 |  },
255 |  "nbformat": 4,
256 |  "nbformat_minor": 2
257 | }
258 | 


--------------------------------------------------------------------------------
/Chapter 18 - Naive Bayes.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Chapter 18\n",
  8 |     "---\n",
  9 |     "# Naive Bayes\n",
 10 |     "\n",
 11 |     "### 18.0 Introduction\n",
 12 |     "Bayes' theorem is the premier method for understanding the probability of some event $P(A|B)$, given some new information, $P(B|A)$, and a prior belief in the probability of the event, P(A):\n",
 13 |     "$$\n",
 14 |     "P(A | B) = \\frac{P(B|A)P(A)}{P(B)}\n",
 15 |     "$$\n",
 16 |     "\n",
 17 |     "The Bayesian method's popularity has skyrocked in the last decade, more and more rivaling the traditional frequentist applications in academia, government, and business. In machine learning, one applicaiton of Bayes' theorem to classifican comes in the form of the naive Bayes classifier. Naive Bayes classifiers combine a number of desirable qualities in practical machine learning into a single classifier:\n",
 18 |     "\n",
 19 |     "1. An intuitive approach\n",
 20 |     "2. The ability to work with small data\n",
 21 |     "3. Low computation costs for training and prediction\n",
 22 |     "4. Often solid results in a variety of settigns\n",
 23 |     "\n",
 24 |     "Specifically, a naive bayes classifier is based on:\n",
 25 |     "$$\n",
 26 |     "P(y | x_1, ..., x_j) = \\frac{P(x_1, ..., x_j | y)P(y)}{P(x_1,...,x_j)}\n",
 27 |     "$$\n",
 28 |     "where,\n",
 29 |     "* $P(y | x_1, ..., x_j)$ is called the *posterior* and is the probability that an observation is class y given observation's values for the j features, $x_1, ..., x_j$\n",
 30 |     "* $P(x_1, ..., x_j)$ is called likelihood and is the *likelihood* of an observation's values for features, $x_1, ..., x_j$, given their class y.\n",
 31 |     "* $P(y)$ is called the *prior* and is our belief for the probability of class y before looking at the data\n",
 32 |     "* P($x_1, ..., x_j$) is called the *marginal probability*\n",
 33 |     "\n",
 34 |     "In naive Bayes, we compare an obsrvation's posterior values for each possible class. Specifically, because the marginal probability is constant across these comparisons, we compare the numerators of the posterior for each class. For each observation the class with the greatest posterior numerator becomes the predicted class, $\\hat y$.\n",
 35 |     "\n",
 36 |     "There are two important things to note about naive Bayes classifiers.\n",
 37 |     "\n",
 38 |     "1. for each feature in the data, we have to assume the statistical distribution of the likelihood, $P(x_1, ..., x_j)$.\n",
 39 |     "- the common distributions are the normal (Gaussian), multinomial, and Bernoulli distributions.\n",
 40 |     "- the distribution chose is often determined by the nature of the features (continuous, binary, etc.)\n",
 41 |     "\n",
 42 |     "2. naive Bayes gets its name because we assume that each feature, and its resulting likelihood, is independent. This \"naive\" assumption is frequently wrong, yet in practice does little to prevent building high quality classifiers\n",
 43 |     "\n",
 44 |     "## 18.1 Training a Classifier for Continuous Features"
 45 |    ]
 46 |   },
 47 |   {
 48 |    "cell_type": "code",
 49 |    "execution_count": 3,
 50 |    "metadata": {
 51 |     "collapsed": false
 52 |    },
 53 |    "outputs": [
 54 |     {
 55 |      "data": {
 56 |       "text/plain": [
 57 |        "array([1])"
 58 |       ]
 59 |      },
 60 |      "execution_count": 3,
 61 |      "metadata": {},
 62 |      "output_type": "execute_result"
 63 |     }
 64 |    ],
 65 |    "source": [
 66 |     "from sklearn import datasets\n",
 67 |     "from sklearn.naive_bayes import GaussianNB\n",
 68 |     "\n",
 69 |     "iris = datasets.load_iris()\n",
 70 |     "features = iris.data\n",
 71 |     "target = iris.target\n",
 72 |     "\n",
 73 |     "classifier = GaussianNB()\n",
 74 |     "\n",
 75 |     "model = classifier.fit(features, target)\n",
 76 |     "\n",
 77 |     "new_observation = [[4, 4, 4, 0.4]]\n",
 78 |     "model.predict(new_observation)"
 79 |    ]
 80 |   },
 81 |   {
 82 |    "cell_type": "markdown",
 83 |    "metadata": {},
 84 |    "source": [
 85 |     "### Discussion\n",
 86 |     "The most common type of naive Bayes classifier is the Gaussian naive Bayesa. In Gaussian naive Bayesam we assuem that the likelihood of the feature values, x, given an observation is of class y, follows a normal distribution:\n",
 87 |     "$$\n",
 88 |     "p(x_j | y) = \\frac{1}{\\sqrt{2\\pi \\sigma_y^2}} e^{-\\frac{(x_j - \\mu_y)^2}{2\\sigma_y^2}}\n",
 89 |     "$$\n",
 90 |     "where $\\sigma_y^2$ and $\\mu_y$ are the variance and mean values of feature x_j for class y. Because of the assumption of the normal distribution, Gaussian naive Bayes is best used in cases when all our features are continuous.\n",
 91 |     "\n",
 92 |     "One of the interesting aspects of naive Bayes classifiers is that they allow us to assign a prior belief over the respect target classes. We can do this using `GaussianNB`'s `priors` parameter, which takes in a list of the probabilities assigned to each class of the target vector"
 93 |    ]
 94 |   },
 95 |   {
 96 |    "cell_type": "code",
 97 |    "execution_count": 4,
 98 |    "metadata": {
 99 |     "collapsed": true
100 |    },
101 |    "outputs": [],
102 |    "source": [
103 |     "clf = GaussianNB(priors=[0.25, 0.25, 0.5])\n",
104 |     "model = classifier.fit(features, target)"
105 |    ]
106 |   },
107 |   {
108 |    "cell_type": "markdown",
109 |    "metadata": {},
110 |    "source": [
111 |     "### See Also\n",
112 |     "* How the Naive Bayes Classifier Works (http://dataaspirant.com/2017/02/06/naive-bayes-classifier-machine-learning/)\n",
113 |     "\n",
114 |     "## 18.2  Training a Classifier for Discrete and Count Features\n"
115 |    ]
116 |   },
117 |   {
118 |    "cell_type": "code",
119 |    "execution_count": 6,
120 |    "metadata": {
121 |     "collapsed": false
122 |    },
123 |    "outputs": [
124 |     {
125 |      "data": {
126 |       "text/plain": [
127 |        "array([0])"
128 |       ]
129 |      },
130 |      "execution_count": 6,
131 |      "metadata": {},
132 |      "output_type": "execute_result"
133 |     }
134 |    ],
135 |    "source": [
136 |     "import numpy as np\n",
137 |     "from sklearn.naive_bayes import MultinomialNB\n",
138 |     "from sklearn.feature_extraction.text import CountVectorizer\n",
139 |     "\n",
140 |     "text_data = np.array(['I love Brazil. Brazil!', 'Brazil is best', 'Germany beats both'])\n",
141 |     "\n",
142 |     "count = CountVectorizer()\n",
143 |     "bag_of_words = count.fit_transform(text_data)\n",
144 |     "\n",
145 |     "features = bag_of_words.toarray()\n",
146 |     "\n",
147 |     "target = np.array([0, 0, 1])\n",
148 |     "\n",
149 |     "classifier = MultinomialNB(class_prior=[0.25, 0.5])\n",
150 |     "model = classifier.fit(features, target)\n",
151 |     "\n",
152 |     "new_observation = [[0, 0, 0, 1, 0, 1, 0]]\n",
153 |     "model.predict(new_observation)"
154 |    ]
155 |   },
156 |   {
157 |    "cell_type": "markdown",
158 |    "metadata": {},
159 |    "source": [
160 |     "### Discussion\n",
161 |     "\n",
162 |     "Multinomial naive Bayes works similarly to Gaussian naive Bayes, but the features are assumed to be multinomial distributed. In practice this means that this classifier is commonly used when we have discrete data. One of the most common uses is text classification using bag of words or tf-idf approaches\n",
163 |     "\n",
164 |     "## 18.3 Training a Naive Bayes Classifier for Binary Features"
165 |    ]
166 |   },
167 |   {
168 |    "cell_type": "code",
169 |    "execution_count": 7,
170 |    "metadata": {
171 |     "collapsed": true
172 |    },
173 |    "outputs": [],
174 |    "source": [
175 |     "import numpy as np\n",
176 |     "from sklearn.naive_bayes import BernoulliNB\n",
177 |     "\n",
178 |     "features = np.random.randint(2, size=(100, 3))\n",
179 |     "target = np.random.randint(2, size=(100, 1)).ravel()\n",
180 |     "\n",
181 |     "classifier = BernoulliNB(class_prior=[0.25, 0.5])\n",
182 |     "model = classifier.fit(features, target)"
183 |    ]
184 |   },
185 |   {
186 |    "cell_type": "markdown",
187 |    "metadata": {},
188 |    "source": [
189 |     "The Bernoulli naive Bayes classifier assumes that all our features are binary such that they take only two values (e.g. a nominal categorical feature that has been one-hot encoded). Like its multinomial cousin, Bernoulli naive Bayes is often used in text classification, when our feature matrix is simply the presence or absence of a word in a document"
190 |    ]
191 |   }
192 |  ],
193 |  "metadata": {
194 |   "kernelspec": {
195 |    "display_name": "Python [conda env:machine_learning_cookbook]",
196 |    "language": "python",
197 |    "name": "conda-env-machine_learning_cookbook-py"
198 |   },
199 |   "language_info": {
200 |    "codemirror_mode": {
201 |     "name": "ipython",
202 |     "version": 3
203 |    },
204 |    "file_extension": ".py",
205 |    "mimetype": "text/x-python",
206 |    "name": "python",
207 |    "nbconvert_exporter": "python",
208 |    "pygments_lexer": "ipython3",
209 |    "version": "3.6.6"
210 |   }
211 |  },
212 |  "nbformat": 4,
213 |  "nbformat_minor": 2
214 | }
215 | 


--------------------------------------------------------------------------------
/Chapter 19 - Clustering.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Chapter 19\n",
  8 |     "---\n",
  9 |     "# Clustering\n",
 10 |     "\n",
 11 |     "## 19.0 Introduction\n",
 12 |     "\n",
 13 |     "Frequently, we run into situations where we only know the features.\n",
 14 |     "\n",
 15 |     "The goal of clustering algorithms is to identify latent groupings of obesrvations, which if done well, allow us to predict the class of observations even without a target vector.\n",
 16 |     "\n",
 17 |     "## 19.1 Clustering Using K-Means"
 18 |    ]
 19 |   },
 20 |   {
 21 |    "cell_type": "code",
 22 |    "execution_count": 2,
 23 |    "metadata": {
 24 |     "collapsed": false
 25 |    },
 26 |    "outputs": [
 27 |     {
 28 |      "data": {
 29 |       "text/plain": [
 30 |        "array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
 31 |        "       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
 32 |        "       1, 1, 1, 1, 1, 1, 0, 0, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0,\n",
 33 |        "       2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 0, 0, 2,\n",
 34 |        "       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0,\n",
 35 |        "       0, 0, 0, 2, 2, 0, 0, 0, 0, 2, 0, 2, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0,\n",
 36 |        "       0, 2, 2, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 2], dtype=int32)"
 37 |       ]
 38 |      },
 39 |      "execution_count": 2,
 40 |      "metadata": {},
 41 |      "output_type": "execute_result"
 42 |     }
 43 |    ],
 44 |    "source": [
 45 |     "from sklearn import datasets\n",
 46 |     "from sklearn.preprocessing import StandardScaler\n",
 47 |     "from sklearn.cluster import KMeans\n",
 48 |     "\n",
 49 |     "iris = datasets.load_iris()\n",
 50 |     "features = iris.data\n",
 51 |     "\n",
 52 |     "scaler = StandardScaler()\n",
 53 |     "features_std = scaler.fit_transform(features)\n",
 54 |     "\n",
 55 |     "cluster = KMeans(n_clusters=3, random_state=0, n_jobs=-1)\n",
 56 |     "model = cluster.fit(features_std)\n",
 57 |     "\n",
 58 |     "model.labels_"
 59 |    ]
 60 |   },
 61 |   {
 62 |    "cell_type": "markdown",
 63 |    "metadata": {},
 64 |    "source": [
 65 |     "### Discussion\n",
 66 |     "k-means clustering is one of the most common clustering techniques. In k-means clustering, the algorithm attempts to group observations into k groups, with each group having roughly equal variance. The number of groups, k, is specified by the user as a hyperparameter. Specifically, in k-means:\n",
 67 |     "\n",
 68 |     "1. k cluster \"center\" points are created at random locations.\n",
 69 |     "\n",
 70 |     "2. For each observation:\n",
 71 |     "    a. the distance between each observaiton and the k center points is calculated\n",
 72 |     "    b. the observation is assigned to the cluster of the nearest center point\n",
 73 |     "    \n",
 74 |     "3. The center points are moved to the means (i.e., centers) of their respective clusters\n",
 75 |     "\n",
 76 |     "4. Steps 2 and 3 are repeated until no observation changes in cluster membership\n",
 77 |     "\n",
 78 |     "k-means clustering assumes:\n",
 79 |     "* the clusters are convex shaped (e.g. a circle, a sphere).\n",
 80 |     "* all features are equally scaled\n",
 81 |     "* the groups are balanced\n",
 82 |     "\n",
 83 |     "### See Also\n",
 84 |     "* Introduction to K-means Clustering (https://www.datascience.com/blog/k-means-clustering)\n",
 85 |     "\n",
 86 |     "## 19.2 Speeding Up K-Means Clustering"
 87 |    ]
 88 |   },
 89 |   {
 90 |    "cell_type": "code",
 91 |    "execution_count": 3,
 92 |    "metadata": {
 93 |     "collapsed": false
 94 |    },
 95 |    "outputs": [
 96 |     {
 97 |      "data": {
 98 |       "text/plain": [
 99 |        "array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
100 |        "       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
101 |        "       1, 1, 1, 1, 1, 1, 2, 2, 2, 0, 0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 2,\n",
102 |        "       0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0,\n",
103 |        "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2,\n",
104 |        "       2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2,\n",
105 |        "       2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2], dtype=int32)"
106 |       ]
107 |      },
108 |      "execution_count": 3,
109 |      "metadata": {},
110 |      "output_type": "execute_result"
111 |     }
112 |    ],
113 |    "source": [
114 |     "from sklearn import datasets\n",
115 |     "from sklearn.preprocessing import StandardScaler\n",
116 |     "from sklearn.cluster import MiniBatchKMeans\n",
117 |     "\n",
118 |     "iris = datasets.load_iris()\n",
119 |     "features = iris.data\n",
120 |     "\n",
121 |     "scaler = StandardScaler()\n",
122 |     "features_std = scaler.fit_transform(features)\n",
123 |     "\n",
124 |     "cluster = MiniBatchKMeans(n_clusters=3, random_state=0, batch_size=100)\n",
125 |     "model = cluster.fit(features_std)\n",
126 |     "\n",
127 |     "model.labels_"
128 |    ]
129 |   },
130 |   {
131 |    "cell_type": "markdown",
132 |    "metadata": {},
133 |    "source": [
134 |     "## 19.3 Clustering Using Meanshift"
135 |    ]
136 |   },
137 |   {
138 |    "cell_type": "code",
139 |    "execution_count": 4,
140 |    "metadata": {
141 |     "collapsed": false
142 |    },
143 |    "outputs": [
144 |     {
145 |      "data": {
146 |       "text/plain": [
147 |        "array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
148 |        "       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
149 |        "       1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
150 |        "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
151 |        "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
152 |        "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
153 |        "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])"
154 |       ]
155 |      },
156 |      "execution_count": 4,
157 |      "metadata": {},
158 |      "output_type": "execute_result"
159 |     }
160 |    ],
161 |    "source": [
162 |     "from sklearn import datasets\n",
163 |     "from sklearn.preprocessing import StandardScaler\n",
164 |     "from sklearn.cluster import MeanShift\n",
165 |     "\n",
166 |     "iris = datasets.load_iris()\n",
167 |     "features = iris.data\n",
168 |     "\n",
169 |     "scaler = StandardScaler()\n",
170 |     "features_std = scaler.fit_transform(features)\n",
171 |     "\n",
172 |     "cluster = MeanShift(n_jobs=-1)\n",
173 |     "model = cluster.fit(features_std)\n",
174 |     "\n",
175 |     "model.labels_"
176 |    ]
177 |   },
178 |   {
179 |    "cell_type": "markdown",
180 |    "metadata": {},
181 |    "source": [
182 |     "## 19.4 Clustering Using DBSCAN"
183 |    ]
184 |   },
185 |   {
186 |    "cell_type": "code",
187 |    "execution_count": 5,
188 |    "metadata": {
189 |     "collapsed": false
190 |    },
191 |    "outputs": [
192 |     {
193 |      "data": {
194 |       "text/plain": [
195 |        "array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, -1, -1,  0,\n",
196 |        "        0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0, -1, -1,\n",
197 |        "        0,  0,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  1,\n",
198 |        "        1,  1,  1,  1,  1, -1, -1,  1, -1, -1,  1, -1,  1,  1,  1,  1,  1,\n",
199 |        "       -1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,\n",
200 |        "       -1,  1, -1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1, -1,  1, -1,  1,\n",
201 |        "        1,  1,  1, -1, -1, -1, -1, -1,  1,  1,  1,  1, -1,  1,  1, -1, -1,\n",
202 |        "       -1,  1,  1, -1,  1,  1, -1,  1,  1,  1, -1, -1, -1,  1,  1,  1, -1,\n",
203 |        "       -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1])"
204 |       ]
205 |      },
206 |      "execution_count": 5,
207 |      "metadata": {},
208 |      "output_type": "execute_result"
209 |     }
210 |    ],
211 |    "source": [
212 |     "from sklearn import datasets\n",
213 |     "from sklearn.preprocessing import StandardScaler\n",
214 |     "from sklearn.cluster import DBSCAN\n",
215 |     "\n",
216 |     "iris = datasets.load_iris()\n",
217 |     "features = iris.data\n",
218 |     "\n",
219 |     "scaler = StandardScaler()\n",
220 |     "features_std = scaler.fit_transform(features)\n",
221 |     "\n",
222 |     "cluster = DBSCAN(n_jobs=-1)\n",
223 |     "model = cluster.fit(features_std)\n",
224 |     "\n",
225 |     "model.labels_"
226 |    ]
227 |   },
228 |   {
229 |    "cell_type": "markdown",
230 |    "metadata": {},
231 |    "source": [
232 |     "## 19.5 Clustering using Hierarchical Merging"
233 |    ]
234 |   },
235 |   {
236 |    "cell_type": "code",
237 |    "execution_count": 7,
238 |    "metadata": {
239 |     "collapsed": false
240 |    },
241 |    "outputs": [
242 |     {
243 |      "data": {
244 |       "text/plain": [
245 |        "array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
246 |        "       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1,\n",
247 |        "       1, 1, 1, 1, 1, 1, 0, 0, 0, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 0, 2, 0,\n",
248 |        "       2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 0, 2, 0, 0, 2,\n",
249 |        "       2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0,\n",
250 |        "       0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
251 |        "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])"
252 |       ]
253 |      },
254 |      "execution_count": 7,
255 |      "metadata": {},
256 |      "output_type": "execute_result"
257 |     }
258 |    ],
259 |    "source": [
260 |     "from sklearn import datasets\n",
261 |     "from sklearn.preprocessing import StandardScaler\n",
262 |     "from sklearn.cluster import AgglomerativeClustering\n",
263 |     "\n",
264 |     "iris = datasets.load_iris()\n",
265 |     "features = iris.data\n",
266 |     "\n",
267 |     "scaler = StandardScaler()\n",
268 |     "features_std = scaler.fit_transform(features)\n",
269 |     "\n",
270 |     "cluster = AgglomerativeClustering(n_clusters=3)\n",
271 |     "model = cluster.fit(features_std)\n",
272 |     "\n",
273 |     "model.labels_"
274 |    ]
275 |   },
276 |   {
277 |    "cell_type": "code",
278 |    "execution_count": null,
279 |    "metadata": {
280 |     "collapsed": true
281 |    },
282 |    "outputs": [],
283 |    "source": []
284 |   }
285 |  ],
286 |  "metadata": {
287 |   "kernelspec": {
288 |    "display_name": "Python [conda env:machine_learning_cookbook]",
289 |    "language": "python",
290 |    "name": "conda-env-machine_learning_cookbook-py"
291 |   },
292 |   "language_info": {
293 |    "codemirror_mode": {
294 |     "name": "ipython",
295 |     "version": 3
296 |    },
297 |    "file_extension": ".py",
298 |    "mimetype": "text/x-python",
299 |    "name": "python",
300 |    "nbconvert_exporter": "python",
301 |    "pygments_lexer": "ipython3",
302 |    "version": "3.6.6"
303 |   }
304 |  },
305 |  "nbformat": 4,
306 |  "nbformat_minor": 2
307 | }
308 | 


--------------------------------------------------------------------------------
/Chapter 21 - Saving and Loading Trained Models.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "### Chapter 21\n",
  8 |     "## Saving and Loading Trained Models\n",
  9 |     "\n",
 10 |     "### 21.0 Introduction\n",
 11 |     "In the last 20 chapters around 200 recipies, we have convered how to take raw data nad usem achine learning to create well-performing predictive models. However, for all our work to be worthwhile we eventually need to do something with our model, such as integrating it with an existing software application. To accomplish this goal, we need to be able to bot hsave our models after training and load them when they are needed by an application. This is the focus of the final chapter\n",
 12 |     "\n",
 13 |     "### 21.1 Saving and Loading a scikit-learn Model\n",
 14 |     "#### Problem\n",
 15 |     "You have trained a scikit-learn model and want to save it and load it elsewhere.\n",
 16 |     "\n",
 17 |     "#### Solution\n",
 18 |     "Save the model as a pickle file:"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "code",
 23 |    "execution_count": 1,
 24 |    "metadata": {},
 25 |    "outputs": [
 26 |     {
 27 |      "name": "stderr",
 28 |      "output_type": "stream",
 29 |      "text": [
 30 |       "/Users/f00/anaconda/envs/machine_learning_cookbook/lib/python3.6/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.\n",
 31 |       "  from numpy.core.umath_tests import inner1d\n"
 32 |      ]
 33 |     },
 34 |     {
 35 |      "data": {
 36 |       "text/plain": [
 37 |        "['model.pkl']"
 38 |       ]
 39 |      },
 40 |      "execution_count": 1,
 41 |      "metadata": {},
 42 |      "output_type": "execute_result"
 43 |     }
 44 |    ],
 45 |    "source": [
 46 |     "# load libraries\n",
 47 |     "from sklearn.ensemble import RandomForestClassifier\n",
 48 |     "from sklearn import datasets\n",
 49 |     "from sklearn.externals import joblib\n",
 50 |     "\n",
 51 |     "# load data\n",
 52 |     "iris = datasets.load_iris()\n",
 53 |     "features = iris.data\n",
 54 |     "target = iris.target\n",
 55 |     "\n",
 56 |     "# create decision tree classifier object\n",
 57 |     "classifier = RandomForestClassifier()\n",
 58 |     "\n",
 59 |     "# train model\n",
 60 |     "model = classifier.fit(features, target)\n",
 61 |     "\n",
 62 |     "# save model as pickle file\n",
 63 |     "joblib.dump(model, \"model.pkl\")"
 64 |    ]
 65 |   },
 66 |   {
 67 |    "cell_type": "markdown",
 68 |    "metadata": {},
 69 |    "source": [
 70 |     "Once the model is saved we can use scikit-learn in our destination application (e.g., web application) to load the model:"
 71 |    ]
 72 |   },
 73 |   {
 74 |    "cell_type": "code",
 75 |    "execution_count": 2,
 76 |    "metadata": {},
 77 |    "outputs": [],
 78 |    "source": [
 79 |     "# load model from file\n",
 80 |     "classifier = joblib.load(\"model.pkl\")"
 81 |    ]
 82 |   },
 83 |   {
 84 |    "cell_type": "markdown",
 85 |    "metadata": {},
 86 |    "source": [
 87 |     "And use it to make predictions"
 88 |    ]
 89 |   },
 90 |   {
 91 |    "cell_type": "code",
 92 |    "execution_count": 3,
 93 |    "metadata": {},
 94 |    "outputs": [
 95 |     {
 96 |      "data": {
 97 |       "text/plain": [
 98 |        "array([0])"
 99 |       ]
100 |      },
101 |      "execution_count": 3,
102 |      "metadata": {},
103 |      "output_type": "execute_result"
104 |     }
105 |    ],
106 |    "source": [
107 |     "# create new observation\n",
108 |     "new_observation = [[ 5.2, 3.2, 1.1, 0.1]]\n",
109 |     "\n",
110 |     "# predict obserrvation's class\n",
111 |     "classifier.predict(new_observation)"
112 |    ]
113 |   },
114 |   {
115 |    "cell_type": "markdown",
116 |    "metadata": {},
117 |    "source": [
118 |     "### Discussion\n",
119 |     "The first step in using a model in production is to save that model as a file that can be loaded by another application or workflow. We can accomplish this by saving the model as a pickle file, a Python-specific data format. Specifically, to save the model we use `joblib`, which is a library extending pickle for cases when we have large NumPy arrays--a common occurance for trained models in scikit-learn.\n",
120 |     "\n",
121 |     "When saving scikit-learn models, be aware that saved models might not be compatible between versions of scikit-learn; therefore, it can be helpful to include the version of scikit-learn used in the model in the filename:"
122 |    ]
123 |   },
124 |   {
125 |    "cell_type": "code",
126 |    "execution_count": 4,
127 |    "metadata": {},
128 |    "outputs": [
129 |     {
130 |      "data": {
131 |       "text/plain": [
132 |        "['model_(version).pkl']"
133 |       ]
134 |      },
135 |      "execution_count": 4,
136 |      "metadata": {},
137 |      "output_type": "execute_result"
138 |     }
139 |    ],
140 |    "source": [
141 |     "# import library\n",
142 |     "import sklearn\n",
143 |     "\n",
144 |     "# get scikit-learn version\n",
145 |     "scikit_version = joblib.__version__\n",
146 |     "\n",
147 |     "# save model as pickle file\n",
148 |     "joblib.dump(model, \"model_(version).pkl\".format(version=scikit_version))"
149 |    ]
150 |   },
151 |   {
152 |    "cell_type": "markdown",
153 |    "metadata": {},
154 |    "source": [
155 |     "### 21.2 Saving and Loading a Keras Model\n",
156 |     "#### Problem\n",
157 |     "You have a trained Keras model and want to save it and load it elsewhere.\n",
158 |     "\n",
159 |     "#### Solution\n",
160 |     "Save the model as HDF5:"
161 |    ]
162 |   },
163 |   {
164 |    "cell_type": "code",
165 |    "execution_count": 1,
166 |    "metadata": {},
167 |    "outputs": [
168 |     {
169 |      "name": "stderr",
170 |      "output_type": "stream",
171 |      "text": [
172 |       "Using Theano backend.\n"
173 |      ]
174 |     },
175 |     {
176 |      "ename": "ModuleNotFoundError",
177 |      "evalue": "No module named 'theano'",
178 |      "output_type": "error",
179 |      "traceback": [
180 |       "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
181 |       "\u001b[0;31mModuleNotFoundError\u001b[0m                       Traceback (most recent call last)",
182 |       "\u001b[0;32m<ipython-input-1-ab685cfa2361>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# load libraries\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      2\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mnumpy\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0;32mfrom\u001b[0m \u001b[0mkeras\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdatasets\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mimdb\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      4\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mkeras\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpreprocessing\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtext\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mTokenizer\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mkeras\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mmodels\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
183 |       "\u001b[0;32m~/anaconda/envs/machine_learning_cookbook/lib/python3.6/site-packages/keras/__init__.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0m__future__\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mabsolute_import\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      2\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mutils\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      4\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mactivations\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mapplications\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
184 |       "\u001b[0;32m~/anaconda/envs/machine_learning_cookbook/lib/python3.6/site-packages/keras/utils/__init__.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mdata_utils\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mio_utils\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mconv_utils\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      7\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;31m# Globally-importable utils.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
185 |       "\u001b[0;32m~/anaconda/envs/machine_learning_cookbook/lib/python3.6/site-packages/keras/utils/conv_utils.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      7\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0msix\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmoves\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mrange\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mnumpy\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 9\u001b[0;31m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m\u001b[0;34m.\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mbackend\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mK\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     10\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     11\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
186 |       "\u001b[0;32m~/anaconda/envs/machine_learning_cookbook/lib/python3.6/site-packages/keras/backend/__init__.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m     84\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0m_BACKEND\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;34m'theano'\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     85\u001b[0m     \u001b[0msys\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstderr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwrite\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'Using Theano backend.\\n'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 86\u001b[0;31m     \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m\u001b[0mtheano_backend\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     87\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0m_BACKEND\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;34m'tensorflow'\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     88\u001b[0m     \u001b[0msys\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstderr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwrite\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'Using TensorFlow backend.\\n'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
187 |       "\u001b[0;32m~/anaconda/envs/machine_learning_cookbook/lib/python3.6/site-packages/keras/backend/theano_backend.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mcollections\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mdefaultdict\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      6\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mcontextlib\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mcontextmanager\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 7\u001b[0;31m \u001b[0;32mimport\u001b[0m \u001b[0mtheano\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      8\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mtheano\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mtensor\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mT\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mtheano\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msandbox\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrng_mrg\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mMRG_RandomStreams\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mRandomStreams\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
188 |       "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'theano'"
189 |      ]
190 |     }
191 |    ],
192 |    "source": [
193 |     "# load libraries\n",
194 |     "import numpy as np\n",
195 |     "from keras.datasets import imdb\n",
196 |     "from keras.preprocessing.text import Tokenizer\n",
197 |     "from keras import models\n",
198 |     "from keras import layers\n",
199 |     "from keras.models import load_model\n",
200 |     "\n",
201 |     "# set random seed\n",
202 |     "np.random.seed(0)\n",
203 |     "\n",
204 |     "# set the number of features we want\n",
205 |     "number_of_features = 1000\n",
206 |     "\n",
207 |     "# load data and target vector from movie review data\n",
208 |     "(train_Data, train_target), (test_data, test_target) = imdb.load_data(num_words=number_of_features)\n",
209 |     "\n",
210 |     "# convert movie review data to a one-hot encoded feature matrix\n",
211 |     "tokenizer = Tokenizer(num_words=number_of_features)\n",
212 |     "train_features = tokenizer.sequences_to_matrix(train_data, mode=\"binary\")\n",
213 |     "test_features = tokenizer.sequences_to_matrix(test_data, mode=\"binary\")\n",
214 |     "\n",
215 |     "# start neural network\n",
216 |     "network = models.Sequential()\n",
217 |     "\n",
218 |     "# add fully connected layer with ReLU activation function\n",
219 |     "network.add(layers.Dense(units=16, activation=\"relu\", input_shape=(number_of_features,)))\n",
220 |     "\n",
221 |     "# add fully connected layer with a sigmoid activation function\n",
222 |     "network.add(layers.Dense(units=1, activation=\"sigmoid\"))\n",
223 |     "\n",
224 |     "# compile neural network\n",
225 |     "network.compile(loss=\"binary_crossentropy\", optimizer=\"rmsprop\", metrics=[\"accuracy\"])\n",
226 |     "\n",
227 |     "# train neural network\n",
228 |     "history = network.fit(train_features, train_target, epochs=3, verbose=0, batch_size=100, validation_data=(test_features, test_target))\n",
229 |     "\n",
230 |     "# save neural network\n",
231 |     "network.save(\"model.h5\")"
232 |    ]
233 |   },
234 |   {
235 |    "cell_type": "markdown",
236 |    "metadata": {},
237 |    "source": [
238 |     "We can then load the model either in another application or for additional training"
239 |    ]
240 |   },
241 |   {
242 |    "cell_type": "code",
243 |    "execution_count": 7,
244 |    "metadata": {},
245 |    "outputs": [
246 |     {
247 |      "ename": "NameError",
248 |      "evalue": "name 'load_model' is not defined",
249 |      "output_type": "error",
250 |      "traceback": [
251 |       "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
252 |       "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
253 |       "\u001b[0;32m<ipython-input-7-114bd8857354>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# load neural network\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mnetwork\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mload_model\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"model.h5\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
254 |       "\u001b[0;31mNameError\u001b[0m: name 'load_model' is not defined"
255 |      ]
256 |     }
257 |    ],
258 |    "source": [
259 |     "# load neural network\n",
260 |     "network = load_model(\"model.h5\")"
261 |    ]
262 |   },
263 |   {
264 |    "cell_type": "markdown",
265 |    "metadata": {},
266 |    "source": [
267 |     "#### Discussion\n",
268 |     "Unlike scikit-learn, Keras does not recommend you save models using pickle. Instead, models are saved as an HDF5 file. The HDF5 file contains everything you need to not only load the model to make predicitons (i.e., achitecture and trained parameters), but also to restart training (i.e. loss and optimizer settings and the current state)"
269 |    ]
270 |   },
271 |   {
272 |    "cell_type": "code",
273 |    "execution_count": null,
274 |    "metadata": {},
275 |    "outputs": [],
276 |    "source": []
277 |   }
278 |  ],
279 |  "metadata": {
280 |   "kernelspec": {
281 |    "display_name": "Python [conda env:machine_learning_cookbook]",
282 |    "language": "python",
283 |    "name": "conda-env-machine_learning_cookbook-py"
284 |   },
285 |   "language_info": {
286 |    "codemirror_mode": {
287 |     "name": "ipython",
288 |     "version": 3
289 |    },
290 |    "file_extension": ".py",
291 |    "mimetype": "text/x-python",
292 |    "name": "python",
293 |    "nbconvert_exporter": "python",
294 |    "pygments_lexer": "ipython3",
295 |    "version": "3.6.6"
296 |   }
297 |  },
298 |  "nbformat": 4,
299 |  "nbformat_minor": 2
300 | }
301 | 


--------------------------------------------------------------------------------
/Chapter 4 - Handling Numerical Data.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Chapter 4\n",
  8 |     "---\n",
  9 |     "# Handling Numerical Data\n",
 10 |     "\n",
 11 |     "### 4.0 Introduction\n",
 12 |     "Quantitative data is the measurment of something--weather class size, monthly sales, or student scores. The natural way to represent these quantities is numerically (e.g., 20 students, $529,392 in sales). In this chapter we will cover numerous strategies for transforming raw numerical data into features purpose-built for machine learning algorithms\n",
 13 |     "\n",
 14 |     "### 4.1 Rescaling a feature\n",
 15 |     "Use scikit-learn's `MinMaxScaler` to rescale a feature array"
 16 |    ]
 17 |   },
 18 |   {
 19 |    "cell_type": "code",
 20 |    "execution_count": 2,
 21 |    "metadata": {},
 22 |    "outputs": [
 23 |     {
 24 |      "data": {
 25 |       "text/plain": [
 26 |        "array([[0.        ],\n",
 27 |        "       [0.28571429],\n",
 28 |        "       [0.35714286],\n",
 29 |        "       [0.42857143],\n",
 30 |        "       [1.        ]])"
 31 |       ]
 32 |      },
 33 |      "execution_count": 2,
 34 |      "metadata": {},
 35 |      "output_type": "execute_result"
 36 |     }
 37 |    ],
 38 |    "source": [
 39 |     "import numpy as np\n",
 40 |     "from sklearn import preprocessing\n",
 41 |     "\n",
 42 |     "# create a feature\n",
 43 |     "feature = np.array([\n",
 44 |     "    [-500.5],\n",
 45 |     "    [-100.1],\n",
 46 |     "    [0],\n",
 47 |     "    [100.1],\n",
 48 |     "    [900.9]\n",
 49 |     "])\n",
 50 |     "\n",
 51 |     "# create scaler\n",
 52 |     "minmax_scaler = preprocessing.MinMaxScaler(feature_range=(0,1))\n",
 53 |     "\n",
 54 |     "# scale feature\n",
 55 |     "scaled_feature = minmax_scaler.fit_transform(feature)\n",
 56 |     "\n",
 57 |     "scaled_feature"
 58 |    ]
 59 |   },
 60 |   {
 61 |    "cell_type": "markdown",
 62 |    "metadata": {},
 63 |    "source": [
 64 |     "#### Discussion\n",
 65 |     "Rescaling is a common preprocessing task in machine learning. Many of the algorithms described later in this book will assume all features are on the same scale, typically 0 to 1 or -1 to 1. There are a number of rescaling techniques, but one of the simlest is called *min-max scaling*. Min-max scaling uses the minimum and maximum values of a feature to rescale values to within a range. Specfically, min-max calculates:\n",
 66 |     "$$\n",
 67 |     "x_i^` = \\frac{x_i - min(x)}{max(x) - min(x)}\n",
 68 |     "$$\n",
 69 |     "\n",
 70 |     "where x is the feature vector, $x_i$ is an individual element of feature x, and $x_i^`$ is the rescaled element\n",
 71 |     "\n",
 72 |     "#### See Also\n",
 73 |     "* Feature scaling, wikipedia (https://en.wikipedia.org/wiki/Feature_scaling)\n",
 74 |     "\n",
 75 |     "### 4.2 Standardizing a Feature\n",
 76 |     "scikit-learn's `StandardScaler` transforms a feature to have a mean of 0 and a standard deviation of 1."
 77 |    ]
 78 |   },
 79 |   {
 80 |    "cell_type": "code",
 81 |    "execution_count": 3,
 82 |    "metadata": {},
 83 |    "outputs": [
 84 |     {
 85 |      "data": {
 86 |       "text/plain": [
 87 |        "array([[-0.76058269],\n",
 88 |        "       [-0.54177196],\n",
 89 |        "       [-0.35009716],\n",
 90 |        "       [-0.32271504],\n",
 91 |        "       [ 1.97516685]])"
 92 |       ]
 93 |      },
 94 |      "execution_count": 3,
 95 |      "metadata": {},
 96 |      "output_type": "execute_result"
 97 |     }
 98 |    ],
 99 |    "source": [
100 |     "import numpy as np\n",
101 |     "from sklearn import preprocessing\n",
102 |     "\n",
103 |     "# create a feature\n",
104 |     "feature = np.array([\n",
105 |     "    [-1000.1],\n",
106 |     "    [-200.2],\n",
107 |     "    [500.5],\n",
108 |     "    [600.6],\n",
109 |     "    [9000.9]\n",
110 |     "])\n",
111 |     "\n",
112 |     "# create scaler\n",
113 |     "scaler = preprocessing.StandardScaler()\n",
114 |     "\n",
115 |     "# transform the feature\n",
116 |     "standardized = scaler.fit_transform(feature)\n",
117 |     "\n",
118 |     "standardized"
119 |    ]
120 |   },
121 |   {
122 |    "cell_type": "markdown",
123 |    "metadata": {},
124 |    "source": [
125 |     "#### Discussion\n",
126 |     "A common alternative to min-max scaling is rescaling of features to be approximately standard normally distributed. To achieve this, we use standardization to tranform the data such that it has a mean, $\\bar x$, or 0 and a standard deviation $\\sigma$, of 1. Specifically, each element in the feature is transformed so that:\n",
127 |     "$$\n",
128 |     "x_i^` = \\frac{x_i - \\bar x}{\\sigma}\n",
129 |     "$$"
130 |    ]
131 |   },
132 |   {
133 |    "cell_type": "markdown",
134 |    "metadata": {},
135 |    "source": [
136 |     "Where $x_I^`$ is our standardized form of $x_i$. The transformed feature represents the number of standard deviations in the original value is away from the feature's mean value (also called a *z-score* in statistics)\n",
137 |     "\n",
138 |     "Standardization is a common go-to scaling method for machine learning preprocessing and in my experience is used more than min-max scaling. However it depends on the learning algorithm. For example, principal component analysis often works better using standardization, while min-max scaling is often recommended for neural netwroks. As a general rule, I'd recommend defauling to standardization unless you have a specific reason to use an alternative.\n",
139 |     "\n",
140 |     "We can see the effect of standardization by looking at the mean and standard deviation of our solutions output:"
141 |    ]
142 |   },
143 |   {
144 |    "cell_type": "code",
145 |    "execution_count": 4,
146 |    "metadata": {},
147 |    "outputs": [
148 |     {
149 |      "name": "stdout",
150 |      "output_type": "stream",
151 |      "text": [
152 |       "Mean 0.0\n",
153 |       "Standard Deviation: 1.0\n"
154 |      ]
155 |     }
156 |    ],
157 |    "source": [
158 |     "print(\"Mean {}\".format(round(standardized.mean())))\n",
159 |     "print(\"Standard Deviation: {}\".format(standardized.std()))"
160 |    ]
161 |   },
162 |   {
163 |    "cell_type": "markdown",
164 |    "metadata": {},
165 |    "source": [
166 |     "If our data has significant outliers, it can negatively impact our standardizatino by affecting the feature's mean and variance. In this scenario, it is often helpful to instead rescale the feature using the median and quartile range. In scikit-learn, we do this using the *RobustScaler* method:"
167 |    ]
168 |   },
169 |   {
170 |    "cell_type": "code",
171 |    "execution_count": 5,
172 |    "metadata": {},
173 |    "outputs": [
174 |     {
175 |      "data": {
176 |       "text/plain": [
177 |        "array([[-1.87387612],\n",
178 |        "       [-0.875     ],\n",
179 |        "       [ 0.        ],\n",
180 |        "       [ 0.125     ],\n",
181 |        "       [10.61488511]])"
182 |       ]
183 |      },
184 |      "execution_count": 5,
185 |      "metadata": {},
186 |      "output_type": "execute_result"
187 |     }
188 |    ],
189 |    "source": [
190 |     "# create scaler\n",
191 |     "robust_scaler = preprocessing.RobustScaler()\n",
192 |     "\n",
193 |     "# transform feature\n",
194 |     "robust_scaler.fit_transform(feature)"
195 |    ]
196 |   },
197 |   {
198 |    "cell_type": "markdown",
199 |    "metadata": {},
200 |    "source": [
201 |     "### 4.3 Normalizing Observations\n",
202 |     "Use scikit-learn's `Normalizer` to rescale the feature values to have unit norm (a total length of 1)"
203 |    ]
204 |   },
205 |   {
206 |    "cell_type": "code",
207 |    "execution_count": 6,
208 |    "metadata": {},
209 |    "outputs": [
210 |     {
211 |      "data": {
212 |       "text/plain": [
213 |        "array([[0.70710678, 0.70710678],\n",
214 |        "       [0.30782029, 0.95144452],\n",
215 |        "       [0.07405353, 0.99725427],\n",
216 |        "       [0.04733062, 0.99887928],\n",
217 |        "       [0.95709822, 0.28976368]])"
218 |       ]
219 |      },
220 |      "execution_count": 6,
221 |      "metadata": {},
222 |      "output_type": "execute_result"
223 |     }
224 |    ],
225 |    "source": [
226 |     "import numpy as np\n",
227 |     "from sklearn.preprocessing import Normalizer\n",
228 |     "\n",
229 |     "# create feature matrix\n",
230 |     "features = np.array([\n",
231 |     "    [0.5, 0.5],\n",
232 |     "    [1.1, 3.4],\n",
233 |     "    [1.5, 20.2],\n",
234 |     "    [1.63, 34.4],\n",
235 |     "    [10.9, 3.3]\n",
236 |     "])\n",
237 |     "\n",
238 |     "# create normalizer\n",
239 |     "normalizer = Normalizer(norm=\"l2\")\n",
240 |     "\n",
241 |     "# transofmr feature matrix\n",
242 |     "normalizer.transform(features)"
243 |    ]
244 |   },
245 |   {
246 |    "cell_type": "markdown",
247 |    "metadata": {},
248 |    "source": [
249 |     "#### Discussion\n",
250 |     "Many rescaling methods operate of features; however, we can also rescale across individual observations. `Normalizer` rescales the values on individual observations to have unit norm (the sum of their lengths is 1). This type of rescaling is often used when we have many equivalent features (e.g. text-classification when every word is n-word group is a feature).\n",
251 |     "\n",
252 |     "`Normalizer` provides three norm options with Euclidean norm (often called L2) being the default:\n",
253 |     "$$\n",
254 |     "||x||_2 = \\sqrt{x_1^2 + x_2^2 + ... + x_n^2}\n",
255 |     "$$\n",
256 |     "\n",
257 |     "where x is an individual observation and x_n is that observation's value for the nth feature.\n",
258 |     "\n",
259 |     "Alternatively, we can specify Manhattan norm (L1):\n",
260 |     "$$\n",
261 |     "||x||_1 = \\sum_{i=1}^n{x_i}\n",
262 |     "$$\n",
263 |     "\n",
264 |     "Intuitively, L2 norm can be thought of as the distance between two poitns in New York for a bird (i.e. a straight line), while L1 can be thought of as the distance for a human wlaking on the street (walk north one block, east one block, north one block, east one block, etc), which is why it is called \"Manhattan norm\" or \"Taxicab norm\".\n",
265 |     "\n",
266 |     "Practically, notice that `norm='l1'` rescales an observation's values so they sum to 1, which can sometimes be a desirable quality"
267 |    ]
268 |   },
269 |   {
270 |    "cell_type": "code",
271 |    "execution_count": 8,
272 |    "metadata": {},
273 |    "outputs": [
274 |     {
275 |      "name": "stdout",
276 |      "output_type": "stream",
277 |      "text": [
278 |       "Sum of the first observation's values: 1.0\n"
279 |      ]
280 |     }
281 |    ],
282 |    "source": [
283 |     "# transform feature matrix\n",
284 |     "features_l1_norm = Normalizer(norm=\"l1\").transform(features)\n",
285 |     "print(\"Sum of the first observation's values: {}\".format(features_l1_norm[0,0] + features_l1_norm[0,1]))"
286 |    ]
287 |   },
288 |   {
289 |    "cell_type": "markdown",
290 |    "metadata": {},
291 |    "source": [
292 |     "### 4.9 Grouping Observations Using Clustering"
293 |    ]
294 |   },
295 |   {
296 |    "cell_type": "code",
297 |    "execution_count": 9,
298 |    "metadata": {},
299 |    "outputs": [
300 |     {
301 |      "data": {
302 |       "text/html": [
303 |        "<div>\n",
304 |        "<style scoped>\n",
305 |        "    .dataframe tbody tr th:only-of-type {\n",
306 |        "        vertical-align: middle;\n",
307 |        "    }\n",
308 |        "\n",
309 |        "    .dataframe tbody tr th {\n",
310 |        "        vertical-align: top;\n",
311 |        "    }\n",
312 |        "\n",
313 |        "    .dataframe thead th {\n",
314 |        "        text-align: right;\n",
315 |        "    }\n",
316 |        "</style>\n",
317 |        "<table border=\"1\" class=\"dataframe\">\n",
318 |        "  <thead>\n",
319 |        "    <tr style=\"text-align: right;\">\n",
320 |        "      <th></th>\n",
321 |        "      <th>feature_1</th>\n",
322 |        "      <th>feature_2</th>\n",
323 |        "      <th>group</th>\n",
324 |        "    </tr>\n",
325 |        "  </thead>\n",
326 |        "  <tbody>\n",
327 |        "    <tr>\n",
328 |        "      <th>0</th>\n",
329 |        "      <td>-9.877554</td>\n",
330 |        "      <td>-3.336145</td>\n",
331 |        "      <td>0</td>\n",
332 |        "    </tr>\n",
333 |        "    <tr>\n",
334 |        "      <th>1</th>\n",
335 |        "      <td>-7.287210</td>\n",
336 |        "      <td>-8.353986</td>\n",
337 |        "      <td>2</td>\n",
338 |        "    </tr>\n",
339 |        "    <tr>\n",
340 |        "      <th>2</th>\n",
341 |        "      <td>-6.943061</td>\n",
342 |        "      <td>-7.023744</td>\n",
343 |        "      <td>2</td>\n",
344 |        "    </tr>\n",
345 |        "    <tr>\n",
346 |        "      <th>3</th>\n",
347 |        "      <td>-7.440167</td>\n",
348 |        "      <td>-8.791959</td>\n",
349 |        "      <td>2</td>\n",
350 |        "    </tr>\n",
351 |        "    <tr>\n",
352 |        "      <th>4</th>\n",
353 |        "      <td>-6.641388</td>\n",
354 |        "      <td>-8.075888</td>\n",
355 |        "      <td>2</td>\n",
356 |        "    </tr>\n",
357 |        "  </tbody>\n",
358 |        "</table>\n",
359 |        "</div>"
360 |       ],
361 |       "text/plain": [
362 |        "   feature_1  feature_2  group\n",
363 |        "0  -9.877554  -3.336145      0\n",
364 |        "1  -7.287210  -8.353986      2\n",
365 |        "2  -6.943061  -7.023744      2\n",
366 |        "3  -7.440167  -8.791959      2\n",
367 |        "4  -6.641388  -8.075888      2"
368 |       ]
369 |      },
370 |      "execution_count": 9,
371 |      "metadata": {},
372 |      "output_type": "execute_result"
373 |     }
374 |    ],
375 |    "source": [
376 |     "import pandas as pd\n",
377 |     "from sklearn.datasets import make_blobs\n",
378 |     "from sklearn.cluster import KMeans\n",
379 |     "\n",
380 |     "features, _ = make_blobs(n_samples = 50,\n",
381 |     "                         n_features = 2,\n",
382 |     "                         centers = 3,\n",
383 |     "                         random_state = 1)\n",
384 |     "\n",
385 |     "df = pd.DataFrame(features, columns=[\"feature_1\", \"feature_2\"])\n",
386 |     "\n",
387 |     "# make k-means clusterer\n",
388 |     "clusterer = KMeans(3, random_state=0)\n",
389 |     "\n",
390 |     "# fit clusterer\n",
391 |     "clusterer.fit(features)\n",
392 |     "\n",
393 |     "# predict values\n",
394 |     "df['group'] = clusterer.predict(features)\n",
395 |     "\n",
396 |     "df.head()"
397 |    ]
398 |   },
399 |   {
400 |    "cell_type": "markdown",
401 |    "metadata": {},
402 |    "source": [
403 |     "# 4.10 Deleteing Observations with Missing Values"
404 |    ]
405 |   },
406 |   {
407 |    "cell_type": "code",
408 |    "execution_count": 10,
409 |    "metadata": {},
410 |    "outputs": [
411 |     {
412 |      "data": {
413 |       "text/plain": [
414 |        "array([[ 1.1, 11.1],\n",
415 |        "       [ 2.2, 22.2],\n",
416 |        "       [ 3.3, 33.3]])"
417 |       ]
418 |      },
419 |      "execution_count": 10,
420 |      "metadata": {},
421 |      "output_type": "execute_result"
422 |     }
423 |    ],
424 |    "source": [
425 |     "import numpy as np\n",
426 |     "\n",
427 |     "features = np.array([\n",
428 |     "    [1.1, 11.1],\n",
429 |     "    [2.2, 22.2],\n",
430 |     "    [3.3, 33.3],\n",
431 |     "    [np.nan, 55]\n",
432 |     "])\n",
433 |     "\n",
434 |     "# keep only observations that are not (denoted by ~) missing\n",
435 |     "features[~np.isnan(features).any(axis=1)]"
436 |    ]
437 |   },
438 |   {
439 |    "cell_type": "code",
440 |    "execution_count": 11,
441 |    "metadata": {},
442 |    "outputs": [
443 |     {
444 |      "data": {
445 |       "text/html": [
446 |        "<div>\n",
447 |        "<style scoped>\n",
448 |        "    .dataframe tbody tr th:only-of-type {\n",
449 |        "        vertical-align: middle;\n",
450 |        "    }\n",
451 |        "\n",
452 |        "    .dataframe tbody tr th {\n",
453 |        "        vertical-align: top;\n",
454 |        "    }\n",
455 |        "\n",
456 |        "    .dataframe thead th {\n",
457 |        "        text-align: right;\n",
458 |        "    }\n",
459 |        "</style>\n",
460 |        "<table border=\"1\" class=\"dataframe\">\n",
461 |        "  <thead>\n",
462 |        "    <tr style=\"text-align: right;\">\n",
463 |        "      <th></th>\n",
464 |        "      <th>feature_1</th>\n",
465 |        "      <th>feature_2</th>\n",
466 |        "    </tr>\n",
467 |        "  </thead>\n",
468 |        "  <tbody>\n",
469 |        "    <tr>\n",
470 |        "      <th>0</th>\n",
471 |        "      <td>1.1</td>\n",
472 |        "      <td>11.1</td>\n",
473 |        "    </tr>\n",
474 |        "    <tr>\n",
475 |        "      <th>1</th>\n",
476 |        "      <td>2.2</td>\n",
477 |        "      <td>22.2</td>\n",
478 |        "    </tr>\n",
479 |        "    <tr>\n",
480 |        "      <th>2</th>\n",
481 |        "      <td>3.3</td>\n",
482 |        "      <td>33.3</td>\n",
483 |        "    </tr>\n",
484 |        "  </tbody>\n",
485 |        "</table>\n",
486 |        "</div>"
487 |       ],
488 |       "text/plain": [
489 |        "   feature_1  feature_2\n",
490 |        "0        1.1       11.1\n",
491 |        "1        2.2       22.2\n",
492 |        "2        3.3       33.3"
493 |       ]
494 |      },
495 |      "execution_count": 11,
496 |      "metadata": {},
497 |      "output_type": "execute_result"
498 |     }
499 |    ],
500 |    "source": [
501 |     "import pandas as pd\n",
502 |     "df = pd.DataFrame(features, columns=[\"feature_1\", \"feature_2\"])\n",
503 |     "df.dropna()"
504 |    ]
505 |   },
506 |   {
507 |    "cell_type": "markdown",
508 |    "metadata": {},
509 |    "source": [
510 |     "#### Discussion\n",
511 |     "Most machine learnign algorithms cannot handling any missing values in the target and feature arrays. The simplest solution is the delete every observation that contains one or more missing values\n",
512 |     "\n",
513 |     "There are three types of missing data:\n",
514 |     "\n",
515 |     "*Missing Completely At Random (MCAR)*\n",
516 |     "* The probability that a value is missing is independent of everything.\n",
517 |     "\n",
518 |     "*Missing At Random (MAR)*\n",
519 |     "* The probability that a value is missing is not completely random, but depends on information capture in other feature\n",
520 |     "\n",
521 |     "*Missing Not At Random (MNAR)*\n",
522 |     "* The probability that a value is missing is not random and depends on information not captured in our features\n",
523 |     "\n",
524 |     "#### See Also\n",
525 |     "* Identifying the Three Types of Missing Data (https://measuringu.com/missing-data/)\n",
526 |     "* Missing-Data Imputation (http://www.stat.columbia.edu/~gelman/arm/missing.pdf)\n",
527 |     "\n",
528 |     "### 4.11 Imputing Missing Values"
529 |    ]
530 |   },
531 |   {
532 |    "cell_type": "code",
533 |    "execution_count": 14,
534 |    "metadata": {},
535 |    "outputs": [
536 |     {
537 |      "name": "stdout",
538 |      "output_type": "stream",
539 |      "text": [
540 |       "True Value: 0.8730186113995938\n",
541 |       "Imputed Value: -3.058372724614996\n"
542 |      ]
543 |     }
544 |    ],
545 |    "source": [
546 |     "import numpy as np\n",
547 |     "from sklearn.preprocessing import StandardScaler\n",
548 |     "from sklearn.datasets import make_blobs\n",
549 |     "from sklearn.preprocessing import Imputer\n",
550 |     "\n",
551 |     "# make fake data\n",
552 |     "features, _ = make_blobs(n_samples = 1000,\n",
553 |     "                        n_features = 2,\n",
554 |     "                        random_state = 1)\n",
555 |     "\n",
556 |     "# standardize the features\n",
557 |     "scaler = StandardScaler()\n",
558 |     "standardized_features = scaler.fit_transform(features)\n",
559 |     "\n",
560 |     "# replace the first feature's first value with a missing value\n",
561 |     "true_value = standardized_features[0, 0]\n",
562 |     "standardized_features[0,0] = np.nan\n",
563 |     "\n",
564 |     "# create imputer\n",
565 |     "mean_imputer = Imputer(strategy=\"mean\", axis=0)\n",
566 |     "\n",
567 |     "# impute values\n",
568 |     "feautres_mean_imputed = mean_imputer.fit_transform(features)\n",
569 |     "\n",
570 |     "# compare true and imputed values\n",
571 |     "print(\"True Value: {}\".format(true_value))\n",
572 |     "print(\"Imputed Value: {}\".format(feautres_mean_imputed[0,0]))"
573 |    ]
574 |   },
575 |   {
576 |    "cell_type": "markdown",
577 |    "metadata": {},
578 |    "source": [
579 |     "#### See Also\n",
580 |     "* A Study of K-Nearest Neighbor as an Imputation Method (http://conteudo.icmc.usp.br/pessoas/gbatista/files/his2002.pdf)"
581 |    ]
582 |   },
583 |   {
584 |    "cell_type": "code",
585 |    "execution_count": null,
586 |    "metadata": {},
587 |    "outputs": [],
588 |    "source": []
589 |   }
590 |  ],
591 |  "metadata": {
592 |   "kernelspec": {
593 |    "display_name": "Python [conda env:machine_learning_cookbook]",
594 |    "language": "python",
595 |    "name": "conda-env-machine_learning_cookbook-py"
596 |   },
597 |   "language_info": {
598 |    "codemirror_mode": {
599 |     "name": "ipython",
600 |     "version": 3
601 |    },
602 |    "file_extension": ".py",
603 |    "mimetype": "text/x-python",
604 |    "name": "python",
605 |    "nbconvert_exporter": "python",
606 |    "pygments_lexer": "ipython3",
607 |    "version": "3.6.6"
608 |   }
609 |  },
610 |  "nbformat": 4,
611 |  "nbformat_minor": 2
612 | }
613 | 


--------------------------------------------------------------------------------
/Chapter 7 - Handling Dates and Times.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## 7.1 Converting Strings to Dates"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 2,
 13 |    "metadata": {},
 14 |    "outputs": [
 15 |     {
 16 |      "data": {
 17 |       "text/plain": [
 18 |        "[Timestamp('2005-04-03 23:35:00'),\n",
 19 |        " Timestamp('2010-05-23 00:01:00'),\n",
 20 |        " Timestamp('2009-09-04 21:09:00')]"
 21 |       ]
 22 |      },
 23 |      "execution_count": 2,
 24 |      "metadata": {},
 25 |      "output_type": "execute_result"
 26 |     }
 27 |    ],
 28 |    "source": [
 29 |     "import numpy as np\n",
 30 |     "import pandas as pd\n",
 31 |     "\n",
 32 |     "date_strings = np.array([\n",
 33 |     "    '03-04-2005 11:35 PM',\n",
 34 |     "    '23-05-2010 12:01 AM',\n",
 35 |     "    '04-09-2009 09:09 PM'\n",
 36 |     "])\n",
 37 |     "\n",
 38 |     "# convert to datetimes\n",
 39 |     "[pd.to_datetime(date, format='%d-%m-%Y %I:%M %p') for date in date_strings]"
 40 |    ]
 41 |   },
 42 |   {
 43 |    "cell_type": "code",
 44 |    "execution_count": 3,
 45 |    "metadata": {},
 46 |    "outputs": [
 47 |     {
 48 |      "data": {
 49 |       "text/plain": [
 50 |        "[Timestamp('2005-04-03 23:35:00'),\n",
 51 |        " Timestamp('2010-05-23 00:01:00'),\n",
 52 |        " Timestamp('2009-09-04 21:09:00')]"
 53 |       ]
 54 |      },
 55 |      "execution_count": 3,
 56 |      "metadata": {},
 57 |      "output_type": "execute_result"
 58 |     }
 59 |    ],
 60 |    "source": [
 61 |     "[pd.to_datetime(date, format='%d-%m-%Y %I:%M %p', errors='coerce') for date in date_strings]"
 62 |    ]
 63 |   },
 64 |   {
 65 |    "cell_type": "markdown",
 66 |    "metadata": {},
 67 |    "source": [
 68 |     "### See Also\n",
 69 |     "* http://strftime.org/\n",
 70 |     "\n",
 71 |     "## 7.2 Handling Time Zones"
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "code",
 76 |    "execution_count": 4,
 77 |    "metadata": {},
 78 |    "outputs": [
 79 |     {
 80 |      "data": {
 81 |       "text/plain": [
 82 |        "Timestamp('2017-05-01 06:00:00+0100', tz='Europe/London')"
 83 |       ]
 84 |      },
 85 |      "execution_count": 4,
 86 |      "metadata": {},
 87 |      "output_type": "execute_result"
 88 |     }
 89 |    ],
 90 |    "source": [
 91 |     "import pandas as pd\n",
 92 |     "\n",
 93 |     "pd.Timestamp('2017-05-01 06:00:00', tz='Europe/London')"
 94 |    ]
 95 |   },
 96 |   {
 97 |    "cell_type": "code",
 98 |    "execution_count": 5,
 99 |    "metadata": {},
100 |    "outputs": [
101 |     {
102 |      "data": {
103 |       "text/plain": [
104 |        "Timestamp('2017-05-01 06:00:00+0100', tz='Europe/London')"
105 |       ]
106 |      },
107 |      "execution_count": 5,
108 |      "metadata": {},
109 |      "output_type": "execute_result"
110 |     }
111 |    ],
112 |    "source": [
113 |     "date = pd.Timestamp('2017-05-01 06:00:00')\n",
114 |     "\n",
115 |     "date_in_london = date.tz_localize('Europe/London')\n",
116 |     "\n",
117 |     "date_in_london"
118 |    ]
119 |   },
120 |   {
121 |    "cell_type": "code",
122 |    "execution_count": 8,
123 |    "metadata": {},
124 |    "outputs": [
125 |     {
126 |      "data": {
127 |       "text/plain": [
128 |        "Timestamp('2017-05-01 05:00:00+0000', tz='Africa/Abidjan')"
129 |       ]
130 |      },
131 |      "execution_count": 8,
132 |      "metadata": {},
133 |      "output_type": "execute_result"
134 |     }
135 |    ],
136 |    "source": [
137 |     "date_in_london.tz_convert('Africa/Abidjan')"
138 |    ]
139 |   },
140 |   {
141 |    "cell_type": "code",
142 |    "execution_count": 9,
143 |    "metadata": {},
144 |    "outputs": [
145 |     {
146 |      "data": {
147 |       "text/plain": [
148 |        "0   2002-02-28 00:00:00+00:00\n",
149 |        "1   2002-03-31 00:00:00+00:00\n",
150 |        "2   2002-04-30 00:00:00+00:00\n",
151 |        "dtype: datetime64[ns, Africa/Abidjan]"
152 |       ]
153 |      },
154 |      "execution_count": 9,
155 |      "metadata": {},
156 |      "output_type": "execute_result"
157 |     }
158 |    ],
159 |    "source": [
160 |     "dates = pd.Series(pd.date_range('2/2/2002', periods=3, freq='M'))\n",
161 |     "\n",
162 |     "dates.dt.tz_localize('Africa/Abidjan')"
163 |    ]
164 |   },
165 |   {
166 |    "cell_type": "markdown",
167 |    "metadata": {},
168 |    "source": [
169 |     "## 7.3 Selecting Dates and Times\n",
170 |     "## 7.4 Breaking Up Date Data into Multiple Features\n",
171 |     "## 7.5 Calculating the Difference Between Dates\n",
172 |     "## 7.6 Encoding Days of the Week\n",
173 |     "## 7.7 Creating Lagged Feature\n",
174 |     "## 7.8 Using Rolling Time Windows"
175 |    ]
176 |   },
177 |   {
178 |    "cell_type": "code",
179 |    "execution_count": 10,
180 |    "metadata": {},
181 |    "outputs": [
182 |     {
183 |      "data": {
184 |       "text/html": [
185 |        "<div>\n",
186 |        "<style scoped>\n",
187 |        "    .dataframe tbody tr th:only-of-type {\n",
188 |        "        vertical-align: middle;\n",
189 |        "    }\n",
190 |        "\n",
191 |        "    .dataframe tbody tr th {\n",
192 |        "        vertical-align: top;\n",
193 |        "    }\n",
194 |        "\n",
195 |        "    .dataframe thead th {\n",
196 |        "        text-align: right;\n",
197 |        "    }\n",
198 |        "</style>\n",
199 |        "<table border=\"1\" class=\"dataframe\">\n",
200 |        "  <thead>\n",
201 |        "    <tr style=\"text-align: right;\">\n",
202 |        "      <th></th>\n",
203 |        "      <th>Stock_Price</th>\n",
204 |        "    </tr>\n",
205 |        "  </thead>\n",
206 |        "  <tbody>\n",
207 |        "    <tr>\n",
208 |        "      <th>2010-01-31</th>\n",
209 |        "      <td>NaN</td>\n",
210 |        "    </tr>\n",
211 |        "    <tr>\n",
212 |        "      <th>2010-02-28</th>\n",
213 |        "      <td>1.5</td>\n",
214 |        "    </tr>\n",
215 |        "    <tr>\n",
216 |        "      <th>2010-03-31</th>\n",
217 |        "      <td>2.5</td>\n",
218 |        "    </tr>\n",
219 |        "    <tr>\n",
220 |        "      <th>2010-04-30</th>\n",
221 |        "      <td>3.5</td>\n",
222 |        "    </tr>\n",
223 |        "    <tr>\n",
224 |        "      <th>2010-05-31</th>\n",
225 |        "      <td>4.5</td>\n",
226 |        "    </tr>\n",
227 |        "  </tbody>\n",
228 |        "</table>\n",
229 |        "</div>"
230 |       ],
231 |       "text/plain": [
232 |        "            Stock_Price\n",
233 |        "2010-01-31          NaN\n",
234 |        "2010-02-28          1.5\n",
235 |        "2010-03-31          2.5\n",
236 |        "2010-04-30          3.5\n",
237 |        "2010-05-31          4.5"
238 |       ]
239 |      },
240 |      "execution_count": 10,
241 |      "metadata": {},
242 |      "output_type": "execute_result"
243 |     }
244 |    ],
245 |    "source": [
246 |     "import pandas as pd\n",
247 |     "\n",
248 |     "time_index = pd.date_range('01/01/2010', periods=5, freq='M')\n",
249 |     "df = pd.DataFrame(index=time_index)\n",
250 |     "df['Stock_Price'] = [1,2,3,4,5]\n",
251 |     "df.rolling(window=2).mean()"
252 |    ]
253 |   },
254 |   {
255 |    "cell_type": "markdown",
256 |    "metadata": {},
257 |    "source": [
258 |     "### See Also\n",
259 |     "* pandas documentation: Rolling Windows (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html)\n",
260 |     "* What are Moving Average or Smoothing Techniques (https://www.itl.nist.gov/div898/handbook/pmc/section4/pmc42.htm)\n",
261 |     "\n",
262 |     "## 7.9 Handling Missing Data in Time Series"
263 |    ]
264 |   },
265 |   {
266 |    "cell_type": "code",
267 |    "execution_count": 11,
268 |    "metadata": {},
269 |    "outputs": [
270 |     {
271 |      "data": {
272 |       "text/html": [
273 |        "<div>\n",
274 |        "<style scoped>\n",
275 |        "    .dataframe tbody tr th:only-of-type {\n",
276 |        "        vertical-align: middle;\n",
277 |        "    }\n",
278 |        "\n",
279 |        "    .dataframe tbody tr th {\n",
280 |        "        vertical-align: top;\n",
281 |        "    }\n",
282 |        "\n",
283 |        "    .dataframe thead th {\n",
284 |        "        text-align: right;\n",
285 |        "    }\n",
286 |        "</style>\n",
287 |        "<table border=\"1\" class=\"dataframe\">\n",
288 |        "  <thead>\n",
289 |        "    <tr style=\"text-align: right;\">\n",
290 |        "      <th></th>\n",
291 |        "      <th>Sales</th>\n",
292 |        "    </tr>\n",
293 |        "  </thead>\n",
294 |        "  <tbody>\n",
295 |        "    <tr>\n",
296 |        "      <th>2010-01-31</th>\n",
297 |        "      <td>1.0</td>\n",
298 |        "    </tr>\n",
299 |        "    <tr>\n",
300 |        "      <th>2010-02-28</th>\n",
301 |        "      <td>2.0</td>\n",
302 |        "    </tr>\n",
303 |        "    <tr>\n",
304 |        "      <th>2010-03-31</th>\n",
305 |        "      <td>3.0</td>\n",
306 |        "    </tr>\n",
307 |        "    <tr>\n",
308 |        "      <th>2010-04-30</th>\n",
309 |        "      <td>4.0</td>\n",
310 |        "    </tr>\n",
311 |        "    <tr>\n",
312 |        "      <th>2010-05-31</th>\n",
313 |        "      <td>5.0</td>\n",
314 |        "    </tr>\n",
315 |        "  </tbody>\n",
316 |        "</table>\n",
317 |        "</div>"
318 |       ],
319 |       "text/plain": [
320 |        "            Sales\n",
321 |        "2010-01-31    1.0\n",
322 |        "2010-02-28    2.0\n",
323 |        "2010-03-31    3.0\n",
324 |        "2010-04-30    4.0\n",
325 |        "2010-05-31    5.0"
326 |       ]
327 |      },
328 |      "execution_count": 11,
329 |      "metadata": {},
330 |      "output_type": "execute_result"
331 |     }
332 |    ],
333 |    "source": [
334 |     "import pandas as pd\n",
335 |     "import numpy as np\n",
336 |     "\n",
337 |     "time_index = pd.date_range('01/01/2010', periods=5, freq='M')\n",
338 |     "\n",
339 |     "df = pd.DataFrame(index=time_index)\n",
340 |     "\n",
341 |     "df[\"Sales\"] = [1.0, 2.0, np.nan, np.nan, 5.0]\n",
342 |     "\n",
343 |     "df.interpolate()"
344 |    ]
345 |   },
346 |   {
347 |    "cell_type": "code",
348 |    "execution_count": 12,
349 |    "metadata": {},
350 |    "outputs": [
351 |     {
352 |      "data": {
353 |       "text/html": [
354 |        "<div>\n",
355 |        "<style scoped>\n",
356 |        "    .dataframe tbody tr th:only-of-type {\n",
357 |        "        vertical-align: middle;\n",
358 |        "    }\n",
359 |        "\n",
360 |        "    .dataframe tbody tr th {\n",
361 |        "        vertical-align: top;\n",
362 |        "    }\n",
363 |        "\n",
364 |        "    .dataframe thead th {\n",
365 |        "        text-align: right;\n",
366 |        "    }\n",
367 |        "</style>\n",
368 |        "<table border=\"1\" class=\"dataframe\">\n",
369 |        "  <thead>\n",
370 |        "    <tr style=\"text-align: right;\">\n",
371 |        "      <th></th>\n",
372 |        "      <th>Sales</th>\n",
373 |        "    </tr>\n",
374 |        "  </thead>\n",
375 |        "  <tbody>\n",
376 |        "    <tr>\n",
377 |        "      <th>2010-01-31</th>\n",
378 |        "      <td>1.0</td>\n",
379 |        "    </tr>\n",
380 |        "    <tr>\n",
381 |        "      <th>2010-02-28</th>\n",
382 |        "      <td>2.0</td>\n",
383 |        "    </tr>\n",
384 |        "    <tr>\n",
385 |        "      <th>2010-03-31</th>\n",
386 |        "      <td>2.0</td>\n",
387 |        "    </tr>\n",
388 |        "    <tr>\n",
389 |        "      <th>2010-04-30</th>\n",
390 |        "      <td>2.0</td>\n",
391 |        "    </tr>\n",
392 |        "    <tr>\n",
393 |        "      <th>2010-05-31</th>\n",
394 |        "      <td>5.0</td>\n",
395 |        "    </tr>\n",
396 |        "  </tbody>\n",
397 |        "</table>\n",
398 |        "</div>"
399 |       ],
400 |       "text/plain": [
401 |        "            Sales\n",
402 |        "2010-01-31    1.0\n",
403 |        "2010-02-28    2.0\n",
404 |        "2010-03-31    2.0\n",
405 |        "2010-04-30    2.0\n",
406 |        "2010-05-31    5.0"
407 |       ]
408 |      },
409 |      "execution_count": 12,
410 |      "metadata": {},
411 |      "output_type": "execute_result"
412 |     }
413 |    ],
414 |    "source": [
415 |     "df.ffill()"
416 |    ]
417 |   },
418 |   {
419 |    "cell_type": "code",
420 |    "execution_count": 13,
421 |    "metadata": {},
422 |    "outputs": [
423 |     {
424 |      "data": {
425 |       "text/html": [
426 |        "<div>\n",
427 |        "<style scoped>\n",
428 |        "    .dataframe tbody tr th:only-of-type {\n",
429 |        "        vertical-align: middle;\n",
430 |        "    }\n",
431 |        "\n",
432 |        "    .dataframe tbody tr th {\n",
433 |        "        vertical-align: top;\n",
434 |        "    }\n",
435 |        "\n",
436 |        "    .dataframe thead th {\n",
437 |        "        text-align: right;\n",
438 |        "    }\n",
439 |        "</style>\n",
440 |        "<table border=\"1\" class=\"dataframe\">\n",
441 |        "  <thead>\n",
442 |        "    <tr style=\"text-align: right;\">\n",
443 |        "      <th></th>\n",
444 |        "      <th>Sales</th>\n",
445 |        "    </tr>\n",
446 |        "  </thead>\n",
447 |        "  <tbody>\n",
448 |        "    <tr>\n",
449 |        "      <th>2010-01-31</th>\n",
450 |        "      <td>1.0</td>\n",
451 |        "    </tr>\n",
452 |        "    <tr>\n",
453 |        "      <th>2010-02-28</th>\n",
454 |        "      <td>2.0</td>\n",
455 |        "    </tr>\n",
456 |        "    <tr>\n",
457 |        "      <th>2010-03-31</th>\n",
458 |        "      <td>5.0</td>\n",
459 |        "    </tr>\n",
460 |        "    <tr>\n",
461 |        "      <th>2010-04-30</th>\n",
462 |        "      <td>5.0</td>\n",
463 |        "    </tr>\n",
464 |        "    <tr>\n",
465 |        "      <th>2010-05-31</th>\n",
466 |        "      <td>5.0</td>\n",
467 |        "    </tr>\n",
468 |        "  </tbody>\n",
469 |        "</table>\n",
470 |        "</div>"
471 |       ],
472 |       "text/plain": [
473 |        "            Sales\n",
474 |        "2010-01-31    1.0\n",
475 |        "2010-02-28    2.0\n",
476 |        "2010-03-31    5.0\n",
477 |        "2010-04-30    5.0\n",
478 |        "2010-05-31    5.0"
479 |       ]
480 |      },
481 |      "execution_count": 13,
482 |      "metadata": {},
483 |      "output_type": "execute_result"
484 |     }
485 |    ],
486 |    "source": [
487 |     "df.bfill()"
488 |    ]
489 |   },
490 |   {
491 |    "cell_type": "code",
492 |    "execution_count": 14,
493 |    "metadata": {},
494 |    "outputs": [
495 |     {
496 |      "data": {
497 |       "text/html": [
498 |        "<div>\n",
499 |        "<style scoped>\n",
500 |        "    .dataframe tbody tr th:only-of-type {\n",
501 |        "        vertical-align: middle;\n",
502 |        "    }\n",
503 |        "\n",
504 |        "    .dataframe tbody tr th {\n",
505 |        "        vertical-align: top;\n",
506 |        "    }\n",
507 |        "\n",
508 |        "    .dataframe thead th {\n",
509 |        "        text-align: right;\n",
510 |        "    }\n",
511 |        "</style>\n",
512 |        "<table border=\"1\" class=\"dataframe\">\n",
513 |        "  <thead>\n",
514 |        "    <tr style=\"text-align: right;\">\n",
515 |        "      <th></th>\n",
516 |        "      <th>Sales</th>\n",
517 |        "    </tr>\n",
518 |        "  </thead>\n",
519 |        "  <tbody>\n",
520 |        "    <tr>\n",
521 |        "      <th>2010-01-31</th>\n",
522 |        "      <td>1.000000</td>\n",
523 |        "    </tr>\n",
524 |        "    <tr>\n",
525 |        "      <th>2010-02-28</th>\n",
526 |        "      <td>2.000000</td>\n",
527 |        "    </tr>\n",
528 |        "    <tr>\n",
529 |        "      <th>2010-03-31</th>\n",
530 |        "      <td>3.059808</td>\n",
531 |        "    </tr>\n",
532 |        "    <tr>\n",
533 |        "      <th>2010-04-30</th>\n",
534 |        "      <td>4.038069</td>\n",
535 |        "    </tr>\n",
536 |        "    <tr>\n",
537 |        "      <th>2010-05-31</th>\n",
538 |        "      <td>5.000000</td>\n",
539 |        "    </tr>\n",
540 |        "  </tbody>\n",
541 |        "</table>\n",
542 |        "</div>"
543 |       ],
544 |       "text/plain": [
545 |        "               Sales\n",
546 |        "2010-01-31  1.000000\n",
547 |        "2010-02-28  2.000000\n",
548 |        "2010-03-31  3.059808\n",
549 |        "2010-04-30  4.038069\n",
550 |        "2010-05-31  5.000000"
551 |       ]
552 |      },
553 |      "execution_count": 14,
554 |      "metadata": {},
555 |      "output_type": "execute_result"
556 |     }
557 |    ],
558 |    "source": [
559 |     "df.interpolate(method=\"quadratic\")"
560 |    ]
561 |   },
562 |   {
563 |    "cell_type": "code",
564 |    "execution_count": 15,
565 |    "metadata": {},
566 |    "outputs": [
567 |     {
568 |      "data": {
569 |       "text/html": [
570 |        "<div>\n",
571 |        "<style scoped>\n",
572 |        "    .dataframe tbody tr th:only-of-type {\n",
573 |        "        vertical-align: middle;\n",
574 |        "    }\n",
575 |        "\n",
576 |        "    .dataframe tbody tr th {\n",
577 |        "        vertical-align: top;\n",
578 |        "    }\n",
579 |        "\n",
580 |        "    .dataframe thead th {\n",
581 |        "        text-align: right;\n",
582 |        "    }\n",
583 |        "</style>\n",
584 |        "<table border=\"1\" class=\"dataframe\">\n",
585 |        "  <thead>\n",
586 |        "    <tr style=\"text-align: right;\">\n",
587 |        "      <th></th>\n",
588 |        "      <th>Sales</th>\n",
589 |        "    </tr>\n",
590 |        "  </thead>\n",
591 |        "  <tbody>\n",
592 |        "    <tr>\n",
593 |        "      <th>2010-01-31</th>\n",
594 |        "      <td>1.0</td>\n",
595 |        "    </tr>\n",
596 |        "    <tr>\n",
597 |        "      <th>2010-02-28</th>\n",
598 |        "      <td>2.0</td>\n",
599 |        "    </tr>\n",
600 |        "    <tr>\n",
601 |        "      <th>2010-03-31</th>\n",
602 |        "      <td>3.0</td>\n",
603 |        "    </tr>\n",
604 |        "    <tr>\n",
605 |        "      <th>2010-04-30</th>\n",
606 |        "      <td>NaN</td>\n",
607 |        "    </tr>\n",
608 |        "    <tr>\n",
609 |        "      <th>2010-05-31</th>\n",
610 |        "      <td>5.0</td>\n",
611 |        "    </tr>\n",
612 |        "  </tbody>\n",
613 |        "</table>\n",
614 |        "</div>"
615 |       ],
616 |       "text/plain": [
617 |        "            Sales\n",
618 |        "2010-01-31    1.0\n",
619 |        "2010-02-28    2.0\n",
620 |        "2010-03-31    3.0\n",
621 |        "2010-04-30    NaN\n",
622 |        "2010-05-31    5.0"
623 |       ]
624 |      },
625 |      "execution_count": 15,
626 |      "metadata": {},
627 |      "output_type": "execute_result"
628 |     }
629 |    ],
630 |    "source": [
631 |     "df.interpolate(limit=1, limit_direction=\"forward\")"
632 |    ]
633 |   },
634 |   {
635 |    "cell_type": "code",
636 |    "execution_count": null,
637 |    "metadata": {},
638 |    "outputs": [],
639 |    "source": []
640 |   }
641 |  ],
642 |  "metadata": {
643 |   "kernelspec": {
644 |    "display_name": "Python [conda env:machine_learning_cookbook]",
645 |    "language": "python",
646 |    "name": "conda-env-machine_learning_cookbook-py"
647 |   },
648 |   "language_info": {
649 |    "codemirror_mode": {
650 |     "name": "ipython",
651 |     "version": 3
652 |    },
653 |    "file_extension": ".py",
654 |    "mimetype": "text/x-python",
655 |    "name": "python",
656 |    "nbconvert_exporter": "python",
657 |    "pygments_lexer": "ipython3",
658 |    "version": "3.6.6"
659 |   }
660 |  },
661 |  "nbformat": 4,
662 |  "nbformat_minor": 2
663 | }
664 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # machine-learning-with-python-cookbook-notes
 2 | My jupyter notebooks/code samples from Chris Albon's Machine Learning with Python Cookbook
 3 | 
 4 | https://nbviewer.jupyter.org/github/DustinAlandzes/machine-learning-with-python-cookbook-notes/tree/master/
 5 | 
 6 | # usage
 7 | ```
 8 | git clone https://github.com/f00-/machine-learning-with-python-cookbook-notes.git
 9 | cd machine-learning-with-python-cookbook-notes
10 | conda env create -f environment.yml
11 | source activate machine_learning_cookbook
12 | pip install -r requirements.txt
13 | jupyter notebook
14 | ```
15 | 


--------------------------------------------------------------------------------
/environment.yml:
--------------------------------------------------------------------------------
  1 | name: machine_learning_cookbook
  2 | channels:
  3 |   - defaults
  4 | dependencies:
  5 |   - _nb_ext_conf=0.4.0=py36_1
  6 |   - _tflow_1100_select=0.0.2=eigen
  7 |   - absl-py=0.4.0=py36h28b3542_0
  8 |   - anaconda-client=1.7.1=py36_0
  9 |   - appdirs=1.4.3=py36h28b3542_0
 10 |   - appnope=0.1.0=py36hf537a9a_0
 11 |   - asn1crypto=0.24.0=py36_0
 12 |   - astor=0.7.1=py36_0
 13 |   - attrs=18.1.0=py36_0
 14 |   - automat=0.7.0=py36_0
 15 |   - backcall=0.1.0=py36_0
 16 |   - blas=1.0=mkl
 17 |   - bleach=2.1.4=py36_0
 18 |   - ca-certificates=2018.03.07=0
 19 |   - certifi=2018.8.24=py36_1
 20 |   - cffi=1.11.5=py36h342bebf_0
 21 |   - chardet=3.0.4=py36_1
 22 |   - clyent=1.2.2=py36_1
 23 |   - constantly=15.1.0=py36h28b3542_0
 24 |   - cryptography=2.3.1=py36hdbc3d79_0
 25 |   - cycler=0.10.0=py36hfc81398_0
 26 |   - decorator=4.3.0=py36_0
 27 |   - entrypoints=0.2.3=py36_2
 28 |   - freetype=2.9.1=hb4e5f40_0
 29 |   - gast=0.2.0=py36_0
 30 |   - grpcio=1.12.1=py36hd9629dc_0
 31 |   - h5py=2.8.0=py36h3c9e6ae_0
 32 |   - hdf5=1.10.2=hfa1e0ec_1
 33 |   - html5lib=1.0.1=py36_0
 34 |   - hyperlink=18.0.0=py36_0
 35 |   - idna=2.7=py36_0
 36 |   - incremental=17.5.0=py36_0
 37 |   - intel-openmp=2018.0.3=0
 38 |   - ipykernel=4.8.2=py36_0
 39 |   - ipython=6.5.0=py36_0
 40 |   - ipython_genutils=0.2.0=py36h241746c_0
 41 |   - ipywidgets=7.4.0=py36_0
 42 |   - jedi=0.12.1=py36_0
 43 |   - jinja2=2.10=py36_0
 44 |   - jsonschema=2.6.0=py36hb385e00_0
 45 |   - jupyter_client=5.2.3=py36_0
 46 |   - jupyter_core=4.4.0=py36_0
 47 |   - keras=2.2.2=0
 48 |   - keras-applications=1.0.4=py36_1
 49 |   - keras-base=2.2.2=py36_0
 50 |   - keras-preprocessing=1.0.2=py36_1
 51 |   - kiwisolver=1.0.1=py36h0a44026_0
 52 |   - libcxx=4.0.1=h579ed51_0
 53 |   - libcxxabi=4.0.1=hebd6815_0
 54 |   - libedit=3.1.20170329=hb402a30_2
 55 |   - libffi=3.2.1=h475c297_4
 56 |   - libgfortran=3.0.1=h93005f0_2
 57 |   - libpng=1.6.34=he12f830_0
 58 |   - libprotobuf=3.6.0=hd9629dc_0
 59 |   - libsodium=1.0.16=h3efe00b_0
 60 |   - markdown=2.6.11=py36_0
 61 |   - markupsafe=1.0=py36h1de35cc_1
 62 |   - matplotlib=2.2.3=py36h54f8f79_0
 63 |   - mistune=0.8.3=py36h1de35cc_1
 64 |   - mkl=2018.0.3=1
 65 |   - mkl_fft=1.0.4=py36h5d10147_1
 66 |   - mkl_random=1.0.1=py36h5d10147_1
 67 |   - nb_anacondacloud=1.4.0=py36_0
 68 |   - nb_conda=2.2.1=py36_0
 69 |   - nb_conda_kernels=2.1.0=py36_0
 70 |   - nbconvert=5.3.1=py36_0
 71 |   - nbformat=4.4.0=py36h827af21_0
 72 |   - nbpresent=3.0.2=py36_1
 73 |   - ncurses=6.1=h0a44026_0
 74 |   - notebook=5.6.0=py36_0
 75 |   - numpy=1.15.0=py36h648b28d_0
 76 |   - numpy-base=1.15.0=py36h8a80b8c_0
 77 |   - openssl=1.0.2p=h1de35cc_0
 78 |   - pandas=0.23.4=py36h6440ff4_0
 79 |   - pandoc=2.2.1=h1a437c5_0
 80 |   - pandocfilters=1.4.2=py36_1
 81 |   - parso=0.3.1=py36_0
 82 |   - patsy=0.5.0=py36_0
 83 |   - pexpect=4.6.0=py36_0
 84 |   - pickleshare=0.7.4=py36hf512f8e_0
 85 |   - pip=10.0.1=py36_0
 86 |   - prometheus_client=0.3.1=py36_0
 87 |   - prompt_toolkit=1.0.15=py36haeda067_0
 88 |   - protobuf=3.6.0=py36h0a44026_0
 89 |   - ptyprocess=0.6.0=py36_0
 90 |   - pyasn1=0.4.4=py36_0
 91 |   - pyasn1-modules=0.2.2=py36_0
 92 |   - pycparser=2.18=py36_1
 93 |   - pygments=2.2.0=py36h240cd3f_0
 94 |   - pyopenssl=18.0.0=py36_0
 95 |   - pyparsing=2.2.0=py36_1
 96 |   - pysocks=1.6.8=py36_0
 97 |   - python=3.6.6=hc167b69_0
 98 |   - python-dateutil=2.7.3=py36_0
 99 |   - pytz=2018.5=py36_0
100 |   - pyyaml=3.13=py36h1de35cc_0
101 |   - pyzmq=17.1.2=py36h1de35cc_0
102 |   - readline=7.0=hc1231fa_4
103 |   - requests=2.19.1=py36_0
104 |   - scikit-learn=0.19.1=py36hf9f1f73_0
105 |   - scipy=1.1.0=py36hf1f7d93_0
106 |   - seaborn=0.9.0=py36_0
107 |   - send2trash=1.5.0=py36_0
108 |   - service_identity=17.0.0=py36h28b3542_0
109 |   - setuptools=40.0.0=py36_0
110 |   - simplegeneric=0.8.1=py36_2
111 |   - six=1.11.0=py36_1
112 |   - sqlalchemy=1.2.11=py36h1de35cc_0
113 |   - sqlite=3.24.0=ha441bb4_0
114 |   - statsmodels=0.9.0=py36h1d22016_0
115 |   - tensorboard=1.10.0=py36hdc36e2c_0
116 |   - tensorflow=1.10.0=eigen_py36h0906837_0
117 |   - tensorflow-base=1.10.0=eigen_py36h4f0eeca_0
118 |   - termcolor=1.1.0=py36_1
119 |   - terminado=0.8.1=py36_1
120 |   - testpath=0.3.1=py36h625a49b_0
121 |   - tk=8.6.7=h35a86e2_3
122 |   - tornado=5.1=py36h1de35cc_0
123 |   - traitlets=4.3.2=py36h65bd3ce_0
124 |   - twisted=17.5.0=py36_0
125 |   - urllib3=1.23=py36_0
126 |   - wcwidth=0.1.7=py36h8c6ec74_0
127 |   - webencodings=0.5.1=py36_1
128 |   - werkzeug=0.14.1=py36_0
129 |   - wheel=0.31.1=py36_0
130 |   - widgetsnbextension=3.4.0=py36_0
131 |   - xlrd=1.1.0=py36_1
132 |   - xz=5.2.4=h1de35cc_4
133 |   - yaml=0.1.7=hc338f04_2
134 |   - zeromq=4.2.5=h0a44026_0
135 |   - zlib=1.2.11=hf3cbc9b_2
136 |   - zope=1.0=py36_0
137 |   - zope.interface=4.5.0=py36h1de35cc_0
138 |   - pip:
139 |     - lifelines==0.14.6
140 | prefix: /Users/f00/anaconda/envs/machine_learning_cookbook
141 | 
142 | 


--------------------------------------------------------------------------------
/model.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DustinAlandzes/machine-learning-with-python-cookbook-notes/e48e1ed5e2dca6f2fdf4114988e7772ff6ff23bb/model.pkl


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
  1 | absl-py==0.4.0
  2 | anaconda-client==1.7.1
  3 | appdirs==1.4.3
  4 | appnope==0.1.0
  5 | asn1crypto==0.24.0
  6 | astor==0.7.1
  7 | attrs==18.1.0
  8 | Automat==0.7.0
  9 | backcall==0.1.0
 10 | bleach==3.1.4
 11 | certifi==2018.8.24
 12 | cffi==1.11.5
 13 | chardet==3.0.4
 14 | clyent==1.2.2
 15 | constantly==15.1.0
 16 | cryptography==3.2
 17 | cycler==0.10.0
 18 | decorator==4.3.0
 19 | entrypoints==0.2.3
 20 | gast==0.2.0
 21 | grpcio==1.12.1
 22 | h5py==2.8.0
 23 | html5lib==1.0.1
 24 | hyperlink==18.0.0
 25 | idna==2.7
 26 | incremental==17.5.0
 27 | ipykernel==4.8.2
 28 | ipython==6.5.0
 29 | ipython-genutils==0.2.0
 30 | ipywidgets==7.4.0
 31 | jedi==0.12.1
 32 | Jinja2==2.10.1
 33 | jsonschema==2.6.0
 34 | jupyter-client==5.2.3
 35 | jupyter-core==4.4.0
 36 | Keras==2.2.2
 37 | Keras-Applications==1.0.4
 38 | Keras-Preprocessing==1.0.2
 39 | kiwisolver==1.0.1
 40 | lifelines==0.14.6
 41 | Markdown==2.6.11
 42 | MarkupSafe==1.0
 43 | matplotlib==2.2.3
 44 | mistune==0.8.3
 45 | mkl-fft==1.0.4
 46 | mkl-random==1.0.1
 47 | nb-anacondacloud==1.4.0
 48 | nb-conda==2.2.1
 49 | nb-conda-kernels==2.1.0
 50 | nbconvert==5.3.1
 51 | nbformat==4.4.0
 52 | nbpresent==3.0.2
 53 | notebook==6.1.5
 54 | numpy==1.15.0
 55 | pandas==0.23.4
 56 | pandocfilters==1.4.2
 57 | parso==0.3.1
 58 | patsy==0.5.0
 59 | pexpect==4.6.0
 60 | pickleshare==0.7.4
 61 | prometheus-client==0.3.1
 62 | prompt-toolkit==1.0.15
 63 | protobuf==3.6.0
 64 | ptyprocess==0.6.0
 65 | pyasn1==0.4.4
 66 | pyasn1-modules==0.2.2
 67 | pycparser==2.18
 68 | Pygments==2.2.0
 69 | pyOpenSSL==18.0.0
 70 | pyparsing==2.2.0
 71 | PySocks==1.6.8
 72 | python-dateutil==2.7.3
 73 | pytz==2018.5
 74 | PyYAML==5.1
 75 | pyzmq==17.1.2
 76 | requests==2.20.0
 77 | scikit-learn==0.19.1
 78 | scipy==1.1.0
 79 | seaborn==0.9.0
 80 | Send2Trash==1.5.0
 81 | service-identity==17.0.0
 82 | simplegeneric==0.8.1
 83 | six==1.11.0
 84 | SQLAlchemy==1.3.0
 85 | statsmodels==0.9.0
 86 | tensorboard==1.10.0
 87 | tensorflow==2.3.1
 88 | termcolor==1.1.0
 89 | terminado==0.8.1
 90 | testpath==0.3.1
 91 | tornado==5.1
 92 | traitlets==4.3.2
 93 | Twisted==20.3.0
 94 | urllib3==1.24.2
 95 | wcwidth==0.1.7
 96 | webencodings==0.5.1
 97 | Werkzeug==0.15.3
 98 | widgetsnbextension==3.4.0
 99 | xlrd==1.1.0
100 | zope.interface==4.5.0
101 | 


--------------------------------------------------------------------------------
/sample.db:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DustinAlandzes/machine-learning-with-python-cookbook-notes/e48e1ed5e2dca6f2fdf4114988e7772ff6ff23bb/sample.db


--------------------------------------------------------------------------------