├── cv_example.png
├── cv_example_2.png
├── README.md
└── Cross Validation done wrong.ipynb


/cv_example.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mottalrd/cross-validation-done-wrong/HEAD/cv_example.png


--------------------------------------------------------------------------------
/cv_example_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mottalrd/cross-validation-done-wrong/HEAD/cv_example_2.png


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Cross validation done wrong
2 | 
3 | This is an IPython notebook showing common errors on cross validation using the scikit learn framework.
4 | This notebook has been published as a blog post in [my blog](http://www.alfredo.motta.name/)
5 | 


--------------------------------------------------------------------------------
/Cross Validation done wrong.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Cross validation done wrong\n",
  8 |     "\n",
  9 |     "Cross validation is an essential tool in statistical learning to estimate the accuracy of your machine learning algorithm. Despite its great power it also exposes some fundamental risk when done wrong which may terribly bias your accuracy estimate.\n",
 10 |     "\n",
 11 |     "In this blog post I'll demonstrate using the Python scikit-learn framework how to avoid the biggest and most common pitfall of cross validation in your experiments."
 12 |    ]
 13 |   },
 14 |   {
 15 |    "cell_type": "markdown",
 16 |    "metadata": {},
 17 |    "source": [
 18 |     "## Theory first\n",
 19 |     "\n",
 20 |     "Cross validation involves randomly dividing the set of observations into `k` groups (or folds) of approximately equal size. The first fold is treated as a validation set, and the machine learning algorithm is trained on the remaining `k-1` folds. The mean squared error is then computed on the held-out fold. This procedure is repeated k times; each time, a different group of observations is treated as a validation set.\n",
 21 |     "\n",
 22 |     "This process results in k estimates of the MSE quantity, namely $MSE_1$, $MSE_2$,...$MSE_k$. The cross validation estimate for the MSE is then computed by simply averaging these values: \n",
 23 |     "$$CV_{(k)} = 1/k \\sum_{i=1}^k MSE_i$$\n",
 24 |     "\n",
 25 |     "This value is an _estimate_, say $\\hat{MSE}$, of the real $MSE$ and our goal it to make this estimate as accurate as possible."
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "markdown",
 30 |    "metadata": {},
 31 |    "source": [
 32 |     "## Hands on\n",
 33 |     "\n",
 34 |     "Let's now have a look at one of the most typical mistakes when using cross validation. When cross validation is done wrong the result is that the estimate $\\hat{MSE}$ does not reflect its real value. In other words, you may think that you just found a perfect machine learning algorithm with incredibly low $MSE$, while in reality you simply wrongly applied CV.\n",
 35 |     "\n",
 36 |     "I'll first show you - hands on - a wrong application of cross validation and then we will fix it together. To make things simple let's first generate some random data and let's pretend that we want to build a machine learning algorithm to predict the outcome."
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "markdown",
 41 |    "metadata": {},
 42 |    "source": [
 43 |     "### Dataset generation"
 44 |    ]
 45 |   },
 46 |   {
 47 |    "cell_type": "code",
 48 |    "execution_count": 12,
 49 |    "metadata": {
 50 |     "collapsed": false
 51 |    },
 52 |    "outputs": [],
 53 |    "source": [
 54 |     "# Import pandas\n",
 55 |     "import pandas as pd\n",
 56 |     "from pandas import *\n",
 57 |     "\n",
 58 |     "# Import scikit-learn\n",
 59 |     "from sklearn.linear_model import LogisticRegression\n",
 60 |     "from sklearn.cross_validation import *\n",
 61 |     "from sklearn.metrics import *\n",
 62 |     "import random"
 63 |    ]
 64 |   },
 65 |   {
 66 |    "cell_type": "markdown",
 67 |    "metadata": {},
 68 |    "source": [
 69 |     "I'll first generate a dataset of $100$ entries. Each entry has $10.000$ features. But, why so many? To demonstrate our issue I need to generate some correlation between our inputs and output which is purely casual. You'll understand _the why_ later in this post."
 70 |    ]
 71 |   },
 72 |   {
 73 |    "cell_type": "code",
 74 |    "execution_count": 13,
 75 |    "metadata": {
 76 |     "collapsed": false
 77 |    },
 78 |    "outputs": [],
 79 |    "source": [
 80 |     "np.random.seed(0)\n",
 81 |     "features = np.random.randint(0,10,size=[100,10000])\n",
 82 |     "target = np.random.randint(0,2,size=100)"
 83 |    ]
 84 |   },
 85 |   {
 86 |    "cell_type": "code",
 87 |    "execution_count": 14,
 88 |    "metadata": {
 89 |     "collapsed": false
 90 |    },
 91 |    "outputs": [
 92 |     {
 93 |      "data": {
 94 |       "text/html": [
 95 |        "<div>\n",
 96 |        "<table border=\"1\" class=\"dataframe\">\n",
 97 |        "  <thead>\n",
 98 |        "    <tr style=\"text-align: right;\">\n",
 99 |        "      <th></th>\n",
100 |        "      <th>0</th>\n",
101 |        "      <th>1</th>\n",
102 |        "      <th>2</th>\n",
103 |        "      <th>3</th>\n",
104 |        "      <th>4</th>\n",
105 |        "      <th>5</th>\n",
106 |        "      <th>6</th>\n",
107 |        "      <th>7</th>\n",
108 |        "      <th>8</th>\n",
109 |        "      <th>9</th>\n",
110 |        "      <th>...</th>\n",
111 |        "      <th>9991</th>\n",
112 |        "      <th>9992</th>\n",
113 |        "      <th>9993</th>\n",
114 |        "      <th>9994</th>\n",
115 |        "      <th>9995</th>\n",
116 |        "      <th>9996</th>\n",
117 |        "      <th>9997</th>\n",
118 |        "      <th>9998</th>\n",
119 |        "      <th>9999</th>\n",
120 |        "      <th>target</th>\n",
121 |        "    </tr>\n",
122 |        "  </thead>\n",
123 |        "  <tbody>\n",
124 |        "    <tr>\n",
125 |        "      <th>0</th>\n",
126 |        "      <td>5</td>\n",
127 |        "      <td>0</td>\n",
128 |        "      <td>3</td>\n",
129 |        "      <td>3</td>\n",
130 |        "      <td>7</td>\n",
131 |        "      <td>9</td>\n",
132 |        "      <td>3</td>\n",
133 |        "      <td>5</td>\n",
134 |        "      <td>2</td>\n",
135 |        "      <td>4</td>\n",
136 |        "      <td>...</td>\n",
137 |        "      <td>7</td>\n",
138 |        "      <td>7</td>\n",
139 |        "      <td>4</td>\n",
140 |        "      <td>1</td>\n",
141 |        "      <td>2</td>\n",
142 |        "      <td>8</td>\n",
143 |        "      <td>0</td>\n",
144 |        "      <td>8</td>\n",
145 |        "      <td>0</td>\n",
146 |        "      <td>1</td>\n",
147 |        "    </tr>\n",
148 |        "    <tr>\n",
149 |        "      <th>1</th>\n",
150 |        "      <td>9</td>\n",
151 |        "      <td>9</td>\n",
152 |        "      <td>7</td>\n",
153 |        "      <td>9</td>\n",
154 |        "      <td>3</td>\n",
155 |        "      <td>7</td>\n",
156 |        "      <td>1</td>\n",
157 |        "      <td>0</td>\n",
158 |        "      <td>2</td>\n",
159 |        "      <td>2</td>\n",
160 |        "      <td>...</td>\n",
161 |        "      <td>8</td>\n",
162 |        "      <td>7</td>\n",
163 |        "      <td>9</td>\n",
164 |        "      <td>3</td>\n",
165 |        "      <td>3</td>\n",
166 |        "      <td>0</td>\n",
167 |        "      <td>1</td>\n",
168 |        "      <td>0</td>\n",
169 |        "      <td>1</td>\n",
170 |        "      <td>1</td>\n",
171 |        "    </tr>\n",
172 |        "    <tr>\n",
173 |        "      <th>2</th>\n",
174 |        "      <td>9</td>\n",
175 |        "      <td>3</td>\n",
176 |        "      <td>9</td>\n",
177 |        "      <td>3</td>\n",
178 |        "      <td>2</td>\n",
179 |        "      <td>6</td>\n",
180 |        "      <td>3</td>\n",
181 |        "      <td>9</td>\n",
182 |        "      <td>0</td>\n",
183 |        "      <td>7</td>\n",
184 |        "      <td>...</td>\n",
185 |        "      <td>0</td>\n",
186 |        "      <td>8</td>\n",
187 |        "      <td>7</td>\n",
188 |        "      <td>2</td>\n",
189 |        "      <td>3</td>\n",
190 |        "      <td>4</td>\n",
191 |        "      <td>4</td>\n",
192 |        "      <td>8</td>\n",
193 |        "      <td>7</td>\n",
194 |        "      <td>0</td>\n",
195 |        "    </tr>\n",
196 |        "    <tr>\n",
197 |        "      <th>3</th>\n",
198 |        "      <td>3</td>\n",
199 |        "      <td>2</td>\n",
200 |        "      <td>7</td>\n",
201 |        "      <td>0</td>\n",
202 |        "      <td>1</td>\n",
203 |        "      <td>4</td>\n",
204 |        "      <td>2</td>\n",
205 |        "      <td>1</td>\n",
206 |        "      <td>2</td>\n",
207 |        "      <td>1</td>\n",
208 |        "      <td>...</td>\n",
209 |        "      <td>1</td>\n",
210 |        "      <td>9</td>\n",
211 |        "      <td>6</td>\n",
212 |        "      <td>7</td>\n",
213 |        "      <td>9</td>\n",
214 |        "      <td>1</td>\n",
215 |        "      <td>0</td>\n",
216 |        "      <td>5</td>\n",
217 |        "      <td>9</td>\n",
218 |        "      <td>1</td>\n",
219 |        "    </tr>\n",
220 |        "    <tr>\n",
221 |        "      <th>4</th>\n",
222 |        "      <td>5</td>\n",
223 |        "      <td>2</td>\n",
224 |        "      <td>6</td>\n",
225 |        "      <td>6</td>\n",
226 |        "      <td>2</td>\n",
227 |        "      <td>6</td>\n",
228 |        "      <td>6</td>\n",
229 |        "      <td>1</td>\n",
230 |        "      <td>7</td>\n",
231 |        "      <td>6</td>\n",
232 |        "      <td>...</td>\n",
233 |        "      <td>5</td>\n",
234 |        "      <td>1</td>\n",
235 |        "      <td>5</td>\n",
236 |        "      <td>8</td>\n",
237 |        "      <td>1</td>\n",
238 |        "      <td>0</td>\n",
239 |        "      <td>6</td>\n",
240 |        "      <td>7</td>\n",
241 |        "      <td>1</td>\n",
242 |        "      <td>1</td>\n",
243 |        "    </tr>\n",
244 |        "  </tbody>\n",
245 |        "</table>\n",
246 |        "<p>5 rows × 10001 columns</p>\n",
247 |        "</div>"
248 |       ],
249 |       "text/plain": [
250 |        "   0  1  2  3  4  5  6  7  8  9   ...    9991  9992  9993  9994  9995  9996  \\\n",
251 |        "0  5  0  3  3  7  9  3  5  2  4   ...       7     7     4     1     2     8   \n",
252 |        "1  9  9  7  9  3  7  1  0  2  2   ...       8     7     9     3     3     0   \n",
253 |        "2  9  3  9  3  2  6  3  9  0  7   ...       0     8     7     2     3     4   \n",
254 |        "3  3  2  7  0  1  4  2  1  2  1   ...       1     9     6     7     9     1   \n",
255 |        "4  5  2  6  6  2  6  6  1  7  6   ...       5     1     5     8     1     0   \n",
256 |        "\n",
257 |        "   9997  9998  9999  target  \n",
258 |        "0     0     8     0       1  \n",
259 |        "1     1     0     1       1  \n",
260 |        "2     4     8     7       0  \n",
261 |        "3     0     5     9       1  \n",
262 |        "4     6     7     1       1  \n",
263 |        "\n",
264 |        "[5 rows x 10001 columns]"
265 |       ]
266 |      },
267 |      "execution_count": 14,
268 |      "metadata": {},
269 |      "output_type": "execute_result"
270 |     }
271 |    ],
272 |    "source": [
273 |     "df = DataFrame(features)\n",
274 |     "df['target'] = target\n",
275 |     "df.head()"
276 |    ]
277 |   },
278 |   {
279 |    "cell_type": "markdown",
280 |    "metadata": {},
281 |    "source": [
282 |     "### Feature selection\n",
283 |     "\n",
284 |     "At this point we would like to know what are the features that are more useful to train our predictor. This is called _feature selection_. The simplest approach to do that is to find which of the $10.000$ features in our input is mostly correlated the target. Using `pandas` this is very easy to do thanks to the [`corr()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.corr.html) function. We run `corr()` on our dataframe, we order the correlation values, and we pick the first two features. "
285 |    ]
286 |   },
287 |   {
288 |    "cell_type": "code",
289 |    "execution_count": 15,
290 |    "metadata": {
291 |     "collapsed": false
292 |    },
293 |    "outputs": [],
294 |    "source": [
295 |     "corr = df.corr()['target'][df.corr()['target'] < 1].abs()\n",
296 |     "corr.sort(ascending=False)"
297 |    ]
298 |   },
299 |   {
300 |    "cell_type": "code",
301 |    "execution_count": 17,
302 |    "metadata": {
303 |     "collapsed": false
304 |    },
305 |    "outputs": [
306 |     {
307 |      "data": {
308 |       "text/plain": [
309 |        "8487    0.428223\n",
310 |        "3555    0.398636\n",
311 |        "627     0.365970\n",
312 |        "3987    0.361673\n",
313 |        "1409    0.357135\n",
314 |        "Name: target, dtype: float64"
315 |       ]
316 |      },
317 |      "execution_count": 17,
318 |      "metadata": {},
319 |      "output_type": "execute_result"
320 |     }
321 |    ],
322 |    "source": [
323 |     "corr.head()"
324 |    ]
325 |   },
326 |   {
327 |    "cell_type": "markdown",
328 |    "metadata": {},
329 |    "source": [
330 |     "### Start the training\n",
331 |     "\n",
332 |     "Great! Out of the $10.000$ features we have been able to select two of them, i.e. feature number $8487$ and $3555$ that have a $0.42$ and $0.39$ correlation with the output. At this point let's just drop all the other columns and use these two features to train a simple `LogisticRegression`. We then use scikit-learn `cross_val_score` to produce $\\hat{MSE}$."
333 |    ]
334 |   },
335 |   {
336 |    "cell_type": "code",
337 |    "execution_count": 18,
338 |    "metadata": {
339 |     "collapsed": true
340 |    },
341 |    "outputs": [],
342 |    "source": [
343 |     "features = corr.index[[0,1]].values"
344 |    ]
345 |   },
346 |   {
347 |    "cell_type": "code",
348 |    "execution_count": 19,
349 |    "metadata": {
350 |     "collapsed": false
351 |    },
352 |    "outputs": [],
353 |    "source": [
354 |     "training_input  = df[features].values\n",
355 |     "training_output = df['target']"
356 |    ]
357 |   },
358 |   {
359 |    "cell_type": "code",
360 |    "execution_count": 20,
361 |    "metadata": {
362 |     "collapsed": true
363 |    },
364 |    "outputs": [],
365 |    "source": [
366 |     "logreg = LogisticRegression()"
367 |    ]
368 |   },
369 |   {
370 |    "cell_type": "code",
371 |    "execution_count": 21,
372 |    "metadata": {
373 |     "collapsed": false
374 |    },
375 |    "outputs": [],
376 |    "source": [
377 |     "# scikit learn return the negative value for MSE \n",
378 |     "# http://stackoverflow.com/questions/21443865/scikit-learn-cross-validation-negative-values-with-mean-squared-error\n",
379 |     "mse_estimate = -1 * cross_val_score(logreg, training_input, training_output, cv=10, scoring='mean_squared_error')"
380 |    ]
381 |   },
382 |   {
383 |    "cell_type": "code",
384 |    "execution_count": 35,
385 |    "metadata": {
386 |     "collapsed": false
387 |    },
388 |    "outputs": [
389 |     {
390 |      "data": {
391 |       "text/plain": [
392 |        "array([ 0.45454545,  0.2       ,  0.2       ,  0.1       ,  0.1       ,\n",
393 |        "        0.        ,  0.3       ,  0.4       ,  0.3       ,  0.44444444])"
394 |       ]
395 |      },
396 |      "execution_count": 35,
397 |      "metadata": {},
398 |      "output_type": "execute_result"
399 |     }
400 |    ],
401 |    "source": [
402 |     "mse_estimate"
403 |    ]
404 |   },
405 |   {
406 |    "cell_type": "code",
407 |    "execution_count": 24,
408 |    "metadata": {
409 |     "collapsed": false
410 |    },
411 |    "outputs": [
412 |     {
413 |      "data": {
414 |       "text/plain": [
415 |        "0    0.249899\n",
416 |        "dtype: float64"
417 |       ]
418 |      },
419 |      "execution_count": 24,
420 |      "metadata": {},
421 |      "output_type": "execute_result"
422 |     }
423 |    ],
424 |    "source": [
425 |     "DataFrame(mse_estimate).mean()"
426 |    ]
427 |   },
428 |   {
429 |    "cell_type": "markdown",
430 |    "metadata": {},
431 |    "source": [
432 |     "### Knowledge leaking\n",
433 |     "\n",
434 |     "According to the previous estimate we built a system that can predict a random noise target from a random noise input with a $MSE$ of just $0.249$. The result is, as you can expect, wrong. But why? \n",
435 |     "\n",
436 |     "The reason is rather counterintuitive and this is why this mistake is so common. When we applied the feature selection we used information from both the training set and the test sets used for the cross validation, i.e. the correlation values. As a consequence our LogisticRegression knew information in the test sets that were supposed to be hidden to it. In fact, when you are computing $MSE_i$ in the i-th iteration of the cross validation you should be using only the information on the training fold, and nothing should come from the test fold. In our case the model did indeed have information from the test fold, i.e. the top correlated features. I think the term **knowledge leaking** express this concept fairly well. \n",
437 |     "\n",
438 |     "The schema that follows shows you how the knowledge leaked into the LogisticRegression because the feature selection has been applied beforehand the cross validation procedure started. The model know something about the data highlighted in yellow that it shoulnd't know, its top correlated features."
439 |    ]
440 |   },
441 |   {
442 |    "cell_type": "markdown",
443 |    "metadata": {},
444 |    "source": [
445 |     "<img src=\"cv_example.png\" width=\"80%\" height=\"80%\">\n",
446 |     "Figure 1. _ The exposed knowledge leaking. The LogisticRegression knows the top correlated features of the entire dataset (hence including test folds) because of the initial correlation operation, whilst it should be exposed only to the training fold information._"
447 |    ]
448 |   },
449 |   {
450 |    "cell_type": "markdown",
451 |    "metadata": {},
452 |    "source": [
453 |     "### Proof that our model is biased\n",
454 |     "To check that we were actually wrong let's do the following: \n",
455 |     "* Take out a portion of the data set (take_out_set).\n",
456 |     "* Train the LogisticRegression on the remaining data using the same feature selection we did before.\n",
457 |     "* After the training is done check the $MSE$ on the take_out_set. \n",
458 |     "\n",
459 |     "Is the $MSE$ on the take_out_set similar to the $\\hat{MSE}$ we estimated with the CV?  \n",
460 |     "The answer is no, and we got a much more reasonable $MSE$ of $0.53$ that is much higher than the $\\hat{MSE}$ of $0.249$. "
461 |    ]
462 |   },
463 |   {
464 |    "cell_type": "code",
465 |    "execution_count": 25,
466 |    "metadata": {
467 |     "collapsed": false
468 |    },
469 |    "outputs": [],
470 |    "source": [
471 |     "take_out_set = df.ix[random.sample(df.index, 30)]"
472 |    ]
473 |   },
474 |   {
475 |    "cell_type": "code",
476 |    "execution_count": 26,
477 |    "metadata": {
478 |     "collapsed": false
479 |    },
480 |    "outputs": [],
481 |    "source": [
482 |     "training_set = df[~(df.isin(take_out_set)).all(axis=1)]"
483 |    ]
484 |   },
485 |   {
486 |    "cell_type": "code",
487 |    "execution_count": 27,
488 |    "metadata": {
489 |     "collapsed": true
490 |    },
491 |    "outputs": [],
492 |    "source": [
493 |     "corr = training_set.corr()['target'][df.corr()['target'] < 1].abs()\n",
494 |     "corr.sort(ascending=False)\n",
495 |     "features = corr.index[[0,1]].values"
496 |    ]
497 |   },
498 |   {
499 |    "cell_type": "code",
500 |    "execution_count": 28,
501 |    "metadata": {
502 |     "collapsed": false
503 |    },
504 |    "outputs": [],
505 |    "source": [
506 |     "training_input  = training_set[features].values\n",
507 |     "training_output = training_set['target']"
508 |    ]
509 |   },
510 |   {
511 |    "cell_type": "code",
512 |    "execution_count": 29,
513 |    "metadata": {
514 |     "collapsed": true
515 |    },
516 |    "outputs": [],
517 |    "source": [
518 |     "logreg = LogisticRegression()"
519 |    ]
520 |   },
521 |   {
522 |    "cell_type": "code",
523 |    "execution_count": 30,
524 |    "metadata": {
525 |     "collapsed": false
526 |    },
527 |    "outputs": [
528 |     {
529 |      "data": {
530 |       "text/plain": [
531 |        "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
532 |        "          intercept_scaling=1, max_iter=100, multi_class='ovr',\n",
533 |        "          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n",
534 |        "          verbose=0)"
535 |       ]
536 |      },
537 |      "execution_count": 30,
538 |      "metadata": {},
539 |      "output_type": "execute_result"
540 |     }
541 |    ],
542 |    "source": [
543 |     "logreg.fit(training_input, training_output)"
544 |    ]
545 |   },
546 |   {
547 |    "cell_type": "code",
548 |    "execution_count": 31,
549 |    "metadata": {
550 |     "collapsed": false
551 |    },
552 |    "outputs": [],
553 |    "source": [
554 |     "y_take_out = logreg.predict(take_out_set[features])"
555 |    ]
556 |   },
557 |   {
558 |    "cell_type": "code",
559 |    "execution_count": 32,
560 |    "metadata": {
561 |     "collapsed": false
562 |    },
563 |    "outputs": [
564 |     {
565 |      "data": {
566 |       "text/plain": [
567 |        "0.53333333333333333"
568 |       ]
569 |      },
570 |      "execution_count": 32,
571 |      "metadata": {},
572 |      "output_type": "execute_result"
573 |     }
574 |    ],
575 |    "source": [
576 |     "mean_squared_error(take_out_set.target, y_take_out)"
577 |    ]
578 |   },
579 |   {
580 |    "cell_type": "markdown",
581 |    "metadata": {
582 |     "collapsed": true
583 |    },
584 |    "source": [
585 |     "## Cross validation done right\n",
586 |     "\n",
587 |     "In the previous section we have seen that if you inject test knowledge in your model your cross validation procedure will be biased. To avoid this let's compute the features correlation during each cross validation batch. The difference is that now the features correlation will use only the information in the training fold instead of the entire dataset. That's a key insight causing the bias we saw previously. The following graph shows you the revisited procedure. This time we got a realistic $\\hat{MSE}$ of $0.44$ that confirms the data is randomly distributed.\n",
588 |     "\n",
589 |     "<img src=\"cv_example_2.png\" width=\"80%\" height=\"80%\">\n",
590 |     "Figure 2. _Revisited cross validation workflow with the correlation step performed for each of the K fold train/test data folds._"
591 |    ]
592 |   },
593 |   {
594 |    "cell_type": "code",
595 |    "execution_count": 33,
596 |    "metadata": {
597 |     "collapsed": false
598 |    },
599 |    "outputs": [
600 |     {
601 |      "name": "stdout",
602 |      "output_type": "stream",
603 |      "text": [
604 |       "Processing fold 0\n",
605 |       "Processing fold 1\n",
606 |       "Processing fold 2\n",
607 |       "Processing fold 3\n",
608 |       "Processing fold 4\n",
609 |       "Processing fold 5\n",
610 |       "Processing fold 6\n",
611 |       "Processing fold 7\n",
612 |       "Processing fold 8\n",
613 |       "Processing fold 9\n",
614 |       "0    0.441212\n",
615 |       "dtype: float64\n"
616 |      ]
617 |     }
618 |    ],
619 |    "source": [
620 |     "kf = StratifiedKFold (df['target'], n_folds=10)\n",
621 |     "mse = []\n",
622 |     "fold_count = 0\n",
623 |     "for train, test in kf:\n",
624 |     "    print(\"Processing fold %s\" % fold_count)\n",
625 |     "    train_fold = df.ix[train]\n",
626 |     "    test_fold = df.ix[test]\n",
627 |     "    \n",
628 |     "    # find best features\n",
629 |     "    corr = train_fold.corr()['target'][train_fold.corr()['target'] < 1].abs()\n",
630 |     "    corr.sort(ascending=False)\n",
631 |     "    features = corr.index[[0,1]].values\n",
632 |     "    \n",
633 |     "    # Get training examples\n",
634 |     "    train_fold_input  = train_fold[features].values\n",
635 |     "    train_fold_output = train_fold['target']\n",
636 |     "    \n",
637 |     "    # Fit logistic regression\n",
638 |     "    logreg = LogisticRegression()\n",
639 |     "    logreg.fit(train_fold_input, train_fold_output)\n",
640 |     "    \n",
641 |     "    # Check MSE on test set\n",
642 |     "    pred = logreg.predict(test_fold[features])\n",
643 |     "    mse.append(mean_squared_error(test_fold.target, pred))\n",
644 |     "    \n",
645 |     "    # Done with the fold\n",
646 |     "    fold_count += 1\n",
647 |     "\n",
648 |     "print(DataFrame(mse).mean())"
649 |    ]
650 |   },
651 |   {
652 |    "cell_type": "code",
653 |    "execution_count": 34,
654 |    "metadata": {
655 |     "collapsed": false
656 |    },
657 |    "outputs": [
658 |     {
659 |      "data": {
660 |       "text/plain": [
661 |        "0    0.441212\n",
662 |        "dtype: float64"
663 |       ]
664 |      },
665 |      "execution_count": 34,
666 |      "metadata": {},
667 |      "output_type": "execute_result"
668 |     }
669 |    ],
670 |    "source": [
671 |     "DataFrame(mse).mean()"
672 |    ]
673 |   },
674 |   {
675 |    "cell_type": "markdown",
676 |    "metadata": {},
677 |    "source": [
678 |     "## Conclusion\n",
679 |     "\n",
680 |     "We have seen how doing features engineering at the wrong step when using cross validation can terribly bias the `MSE` estimate of your machine learning algorithm. We have also seen how to correctly apply cross validation by simply moving one step down the feature engineering such that the knowledge from the test data does not leak in our learning procedure. If you want know more about cross validation and the its tradeoffs both R. Kohavi and Y. Bengio with Y. Grandvalet wrote on this topic.\n"
681 |    ]
682 |   },
683 |   {
684 |    "cell_type": "markdown",
685 |    "metadata": {},
686 |    "source": [
687 |     "### References\n",
688 |     "\n",
689 |     "[1] [Lecture 1 on cross validation - Statistical Learning @ Stanford](https://www.youtube.com/watch?v=nZAM5OXrktY)  \n",
690 |     "[2] [Lecture 2 on cross validation - Statistical Learning @ Stanford](https://www.youtube.com/watch?v=S06JpVoNaA0)  \n",
691 |     "[3] [Ron Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection](http://dl.acm.org/citation.cfm?id=1643047)  \n",
692 |     "[4] [Y. Bengio and Y. Grandvalet. No unbiased estimator of the variance of k-fold cross-validation](http://dl.acm.org/citation.cfm?id=1044695)\n"
693 |    ]
694 |   }
695 |  ],
696 |  "metadata": {
697 |   "kernelspec": {
698 |    "display_name": "Python 2",
699 |    "language": "python",
700 |    "name": "python2"
701 |   },
702 |   "language_info": {
703 |    "codemirror_mode": {
704 |     "name": "ipython",
705 |     "version": 2
706 |    },
707 |    "file_extension": ".py",
708 |    "mimetype": "text/x-python",
709 |    "name": "python",
710 |    "nbconvert_exporter": "python",
711 |    "pygments_lexer": "ipython2",
712 |    "version": "2.7.6"
713 |   }
714 |  },
715 |  "nbformat": 4,
716 |  "nbformat_minor": 0
717 | }
718 | 


--------------------------------------------------------------------------------