├── .gitignore
├── 01 - Introduction to Scikit-learn.ipynb
├── 02 - Unsupervised Transformers.ipynb
├── 03 - Cross-validation.ipynb
├── 04 - Grid Searches for Hyper Parameters.ipynb
├── 05 - Preprocessing and Pipelines.ipynb
├── 06 - Working With Text Data.ipynb
├── 07 - Out Of Core Learning.ipynb
├── 08 - Out Of Core Learning for Text.ipynb
├── LICENSE
├── Readme.md
├── data
    └── aclImdb.tar.bz2
├── machine-learning-with-scikit-learn-nyc-ml-meetup-2016.odp
└── machine-learning-with-scikit-learn-nyc-ml-meetup-2016.pdf


/.gitignore:
--------------------------------------------------------------------------------
1 | data/aclImdb/*
2 | .ipynb_checkpoints/*
3 | data/batch*
4 | data/movies.txt
5 | 


--------------------------------------------------------------------------------
/01 - Introduction to Scikit-learn.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "Get some data to play with"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": null,
 13 |    "metadata": {
 14 |     "collapsed": false
 15 |    },
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "from sklearn.datasets import load_digits\n",
 19 |     "digits = load_digits()\n",
 20 |     "digits.keys()"
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "code",
 25 |    "execution_count": null,
 26 |    "metadata": {
 27 |     "collapsed": false
 28 |    },
 29 |    "outputs": [],
 30 |    "source": [
 31 |     "digits.images.shape"
 32 |    ]
 33 |   },
 34 |   {
 35 |    "cell_type": "code",
 36 |    "execution_count": null,
 37 |    "metadata": {
 38 |     "collapsed": false
 39 |    },
 40 |    "outputs": [],
 41 |    "source": [
 42 |     "print(digits.images[0])"
 43 |    ]
 44 |   },
 45 |   {
 46 |    "cell_type": "code",
 47 |    "execution_count": null,
 48 |    "metadata": {
 49 |     "collapsed": false
 50 |    },
 51 |    "outputs": [],
 52 |    "source": [
 53 |     "import matplotlib.pyplot as plt\n",
 54 |     "%matplotlib notebook\n",
 55 |     "\n",
 56 |     "plt.matshow(digits.images[0], cmap=plt.cm.Greys)"
 57 |    ]
 58 |   },
 59 |   {
 60 |    "cell_type": "code",
 61 |    "execution_count": null,
 62 |    "metadata": {
 63 |     "collapsed": false
 64 |    },
 65 |    "outputs": [],
 66 |    "source": [
 67 |     "digits.data.shape"
 68 |    ]
 69 |   },
 70 |   {
 71 |    "cell_type": "code",
 72 |    "execution_count": null,
 73 |    "metadata": {
 74 |     "collapsed": false
 75 |    },
 76 |    "outputs": [],
 77 |    "source": [
 78 |     "digits.target.shape"
 79 |    ]
 80 |   },
 81 |   {
 82 |    "cell_type": "code",
 83 |    "execution_count": null,
 84 |    "metadata": {
 85 |     "collapsed": false
 86 |    },
 87 |    "outputs": [],
 88 |    "source": [
 89 |     "digits.target"
 90 |    ]
 91 |   },
 92 |   {
 93 |    "cell_type": "markdown",
 94 |    "metadata": {},
 95 |    "source": [
 96 |     "**Data is always a numpy array (or sparse matrix) of shape (n_samples, n_features)**"
 97 |    ]
 98 |   },
 99 |   {
100 |    "cell_type": "markdown",
101 |    "metadata": {},
102 |    "source": [
103 |     "Split the data to get going"
104 |    ]
105 |   },
106 |   {
107 |    "cell_type": "code",
108 |    "execution_count": null,
109 |    "metadata": {
110 |     "collapsed": false
111 |    },
112 |    "outputs": [],
113 |    "source": [
114 |     "from sklearn.cross_validation import train_test_split\n",
115 |     "X_train, X_test, y_train, y_test = train_test_split(digits.data,\n",
116 |     "                                                    digits.target)"
117 |    ]
118 |   },
119 |   {
120 |    "cell_type": "markdown",
121 |    "metadata": {},
122 |    "source": [
123 |     "Really Simple API\n",
124 |     "-------------------\n",
125 |     "0) Import your model class"
126 |    ]
127 |   },
128 |   {
129 |    "cell_type": "code",
130 |    "execution_count": null,
131 |    "metadata": {
132 |     "collapsed": false
133 |    },
134 |    "outputs": [],
135 |    "source": [
136 |     "from sklearn.svm import LinearSVC"
137 |    ]
138 |   },
139 |   {
140 |    "cell_type": "markdown",
141 |    "metadata": {},
142 |    "source": [
143 |     "1) Instantiate an object and set the parameters"
144 |    ]
145 |   },
146 |   {
147 |    "cell_type": "code",
148 |    "execution_count": null,
149 |    "metadata": {
150 |     "collapsed": false
151 |    },
152 |    "outputs": [],
153 |    "source": [
154 |     "svm = LinearSVC(C=0.1)"
155 |    ]
156 |   },
157 |   {
158 |    "cell_type": "markdown",
159 |    "metadata": {},
160 |    "source": [
161 |     "2) Fit the model"
162 |    ]
163 |   },
164 |   {
165 |    "cell_type": "code",
166 |    "execution_count": null,
167 |    "metadata": {
168 |     "collapsed": false
169 |    },
170 |    "outputs": [],
171 |    "source": [
172 |     "svm.fit(X_train, y_train)"
173 |    ]
174 |   },
175 |   {
176 |    "cell_type": "markdown",
177 |    "metadata": {},
178 |    "source": [
179 |     "3) Apply / evaluate"
180 |    ]
181 |   },
182 |   {
183 |    "cell_type": "code",
184 |    "execution_count": null,
185 |    "metadata": {
186 |     "collapsed": false
187 |    },
188 |    "outputs": [],
189 |    "source": [
190 |     "print(svm.predict(X_train))\n",
191 |     "print(y_train)"
192 |    ]
193 |   },
194 |   {
195 |    "cell_type": "code",
196 |    "execution_count": null,
197 |    "metadata": {
198 |     "collapsed": false
199 |    },
200 |    "outputs": [],
201 |    "source": [
202 |     "svm.score(X_train, y_train)"
203 |    ]
204 |   },
205 |   {
206 |    "cell_type": "code",
207 |    "execution_count": null,
208 |    "metadata": {
209 |     "collapsed": false
210 |    },
211 |    "outputs": [],
212 |    "source": [
213 |     "svm.score(X_test, y_test)"
214 |    ]
215 |   },
216 |   {
217 |    "cell_type": "markdown",
218 |    "metadata": {},
219 |    "source": [
220 |     "And again\n",
221 |     "---------"
222 |    ]
223 |   },
224 |   {
225 |    "cell_type": "code",
226 |    "execution_count": null,
227 |    "metadata": {
228 |     "collapsed": false
229 |    },
230 |    "outputs": [],
231 |    "source": [
232 |     "from sklearn.ensemble import RandomForestClassifier"
233 |    ]
234 |   },
235 |   {
236 |    "cell_type": "code",
237 |    "execution_count": null,
238 |    "metadata": {
239 |     "collapsed": false
240 |    },
241 |    "outputs": [],
242 |    "source": [
243 |     "rf = RandomForestClassifier(n_estimators=50)"
244 |    ]
245 |   },
246 |   {
247 |    "cell_type": "code",
248 |    "execution_count": null,
249 |    "metadata": {
250 |     "collapsed": false
251 |    },
252 |    "outputs": [],
253 |    "source": [
254 |     "rf.fit(X_train, y_train)"
255 |    ]
256 |   },
257 |   {
258 |    "cell_type": "code",
259 |    "execution_count": null,
260 |    "metadata": {
261 |     "collapsed": false
262 |    },
263 |    "outputs": [],
264 |    "source": [
265 |     "rf.score(X_test, y_test)"
266 |    ]
267 |   },
268 |   {
269 |    "cell_type": "code",
270 |    "execution_count": null,
271 |    "metadata": {
272 |     "collapsed": true
273 |    },
274 |    "outputs": [],
275 |    "source": []
276 |   }
277 |  ],
278 |  "metadata": {
279 |   "kernelspec": {
280 |    "display_name": "Python 3",
281 |    "language": "python",
282 |    "name": "python3"
283 |   },
284 |   "language_info": {
285 |    "codemirror_mode": {
286 |     "name": "ipython",
287 |     "version": 3
288 |    },
289 |    "file_extension": ".py",
290 |    "mimetype": "text/x-python",
291 |    "name": "python",
292 |    "nbconvert_exporter": "python",
293 |    "pygments_lexer": "ipython3",
294 |    "version": "3.4.4"
295 |   }
296 |  },
297 |  "nbformat": 4,
298 |  "nbformat_minor": 0
299 | }
300 | 


--------------------------------------------------------------------------------
/02 - Unsupervised Transformers.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "metadata": {
  7 |     "collapsed": false
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "from sklearn.datasets import load_digits\n",
 12 |     "from sklearn.cross_validation import train_test_split\n",
 13 |     "import numpy as np\n",
 14 |     "np.set_printoptions(suppress=True)\n",
 15 |     "\n",
 16 |     "digits = load_digits()\n",
 17 |     "X, y = digits.data, digits.target\n",
 18 |     "X_train, X_test, y_train, y_test = train_test_split(X, y)"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "markdown",
 23 |    "metadata": {},
 24 |    "source": [
 25 |     "Removing mean and scaling variance\n",
 26 |     "==================================="
 27 |    ]
 28 |   },
 29 |   {
 30 |    "cell_type": "code",
 31 |    "execution_count": null,
 32 |    "metadata": {
 33 |     "collapsed": false
 34 |    },
 35 |    "outputs": [],
 36 |    "source": [
 37 |     "from sklearn.preprocessing import StandardScaler"
 38 |    ]
 39 |   },
 40 |   {
 41 |    "cell_type": "markdown",
 42 |    "metadata": {},
 43 |    "source": [
 44 |     "1) Instantiate the model"
 45 |    ]
 46 |   },
 47 |   {
 48 |    "cell_type": "code",
 49 |    "execution_count": null,
 50 |    "metadata": {
 51 |     "collapsed": false
 52 |    },
 53 |    "outputs": [],
 54 |    "source": [
 55 |     "scaler = StandardScaler()"
 56 |    ]
 57 |   },
 58 |   {
 59 |    "cell_type": "markdown",
 60 |    "metadata": {},
 61 |    "source": [
 62 |     "2) Fit using only the data."
 63 |    ]
 64 |   },
 65 |   {
 66 |    "cell_type": "code",
 67 |    "execution_count": null,
 68 |    "metadata": {
 69 |     "collapsed": false
 70 |    },
 71 |    "outputs": [],
 72 |    "source": [
 73 |     "scaler.fit(X_train)"
 74 |    ]
 75 |   },
 76 |   {
 77 |    "cell_type": "markdown",
 78 |    "metadata": {},
 79 |    "source": [
 80 |     "3) `transform` the data (not `predict`)."
 81 |    ]
 82 |   },
 83 |   {
 84 |    "cell_type": "code",
 85 |    "execution_count": null,
 86 |    "metadata": {
 87 |     "collapsed": false
 88 |    },
 89 |    "outputs": [],
 90 |    "source": [
 91 |     "X_train_scaled = scaler.transform(X_train)"
 92 |    ]
 93 |   },
 94 |   {
 95 |    "cell_type": "code",
 96 |    "execution_count": null,
 97 |    "metadata": {
 98 |     "collapsed": false
 99 |    },
100 |    "outputs": [],
101 |    "source": [
102 |     "X_train.shape"
103 |    ]
104 |   },
105 |   {
106 |    "cell_type": "code",
107 |    "execution_count": null,
108 |    "metadata": {
109 |     "collapsed": false
110 |    },
111 |    "outputs": [],
112 |    "source": [
113 |     "X_train_scaled.shape"
114 |    ]
115 |   },
116 |   {
117 |    "cell_type": "markdown",
118 |    "metadata": {},
119 |    "source": [
120 |     "The transformed version of the data has the mean removed:"
121 |    ]
122 |   },
123 |   {
124 |    "cell_type": "code",
125 |    "execution_count": null,
126 |    "metadata": {
127 |     "collapsed": false
128 |    },
129 |    "outputs": [],
130 |    "source": [
131 |     "X_train_scaled.mean(axis=0)"
132 |    ]
133 |   },
134 |   {
135 |    "cell_type": "code",
136 |    "execution_count": null,
137 |    "metadata": {
138 |     "collapsed": false
139 |    },
140 |    "outputs": [],
141 |    "source": [
142 |     "X_train_scaled.std(axis=0)"
143 |    ]
144 |   },
145 |   {
146 |    "cell_type": "code",
147 |    "execution_count": null,
148 |    "metadata": {
149 |     "collapsed": false
150 |    },
151 |    "outputs": [],
152 |    "source": [
153 |     "X_test_transformed = scaler.transform(X_test)"
154 |    ]
155 |   },
156 |   {
157 |    "cell_type": "code",
158 |    "execution_count": null,
159 |    "metadata": {
160 |     "collapsed": false
161 |    },
162 |    "outputs": [],
163 |    "source": [
164 |     "X_test_transformed.mean(axis=0)"
165 |    ]
166 |   },
167 |   {
168 |    "cell_type": "markdown",
169 |    "metadata": {},
170 |    "source": [
171 |     "Principal Component Analysis\n",
172 |     "============================="
173 |    ]
174 |   },
175 |   {
176 |    "cell_type": "markdown",
177 |    "metadata": {},
178 |    "source": [
179 |     "0) Import the model"
180 |    ]
181 |   },
182 |   {
183 |    "cell_type": "code",
184 |    "execution_count": null,
185 |    "metadata": {
186 |     "collapsed": false
187 |    },
188 |    "outputs": [],
189 |    "source": [
190 |     "from sklearn.decomposition import PCA"
191 |    ]
192 |   },
193 |   {
194 |    "cell_type": "markdown",
195 |    "metadata": {},
196 |    "source": [
197 |     "1) Instantiate the model"
198 |    ]
199 |   },
200 |   {
201 |    "cell_type": "code",
202 |    "execution_count": null,
203 |    "metadata": {
204 |     "collapsed": false
205 |    },
206 |    "outputs": [],
207 |    "source": [
208 |     "pca = PCA(n_components=2)"
209 |    ]
210 |   },
211 |   {
212 |    "cell_type": "markdown",
213 |    "metadata": {},
214 |    "source": [
215 |     "2) Fit to training data"
216 |    ]
217 |   },
218 |   {
219 |    "cell_type": "code",
220 |    "execution_count": null,
221 |    "metadata": {
222 |     "collapsed": false
223 |    },
224 |    "outputs": [],
225 |    "source": [
226 |     "pca.fit(X)"
227 |    ]
228 |   },
229 |   {
230 |    "cell_type": "markdown",
231 |    "metadata": {},
232 |    "source": [
233 |     "3) Transform to lower-dimensional representation"
234 |    ]
235 |   },
236 |   {
237 |    "cell_type": "code",
238 |    "execution_count": null,
239 |    "metadata": {
240 |     "collapsed": false
241 |    },
242 |    "outputs": [],
243 |    "source": [
244 |     "X_pca = pca.transform(X)\n",
245 |     "print(X.shape)\n",
246 |     "X_pca.shape"
247 |    ]
248 |   },
249 |   {
250 |    "cell_type": "markdown",
251 |    "metadata": {},
252 |    "source": [
253 |     "Visualize\n",
254 |     "----------"
255 |    ]
256 |   },
257 |   {
258 |    "cell_type": "code",
259 |    "execution_count": null,
260 |    "metadata": {
261 |     "collapsed": false
262 |    },
263 |    "outputs": [],
264 |    "source": [
265 |     "import matplotlib.pyplot as plt\n",
266 |     "%matplotlib notebook\n",
267 |     "plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y)"
268 |    ]
269 |   },
270 |   {
271 |    "cell_type": "code",
272 |    "execution_count": null,
273 |    "metadata": {
274 |     "collapsed": false
275 |    },
276 |    "outputs": [],
277 |    "source": []
278 |   }
279 |  ],
280 |  "metadata": {
281 |   "kernelspec": {
282 |    "display_name": "Python 3",
283 |    "language": "python",
284 |    "name": "python3"
285 |   },
286 |   "language_info": {
287 |    "codemirror_mode": {
288 |     "name": "ipython",
289 |     "version": 3
290 |    },
291 |    "file_extension": ".py",
292 |    "mimetype": "text/x-python",
293 |    "name": "python",
294 |    "nbconvert_exporter": "python",
295 |    "pygments_lexer": "ipython3",
296 |    "version": "3.4.4"
297 |   }
298 |  },
299 |  "nbformat": 4,
300 |  "nbformat_minor": 0
301 | }
302 | 


--------------------------------------------------------------------------------
/03 - Cross-validation.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "Cross-Validation\n",
  8 |     "----------------------------------------"
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "code",
 13 |    "execution_count": null,
 14 |    "metadata": {
 15 |     "collapsed": false
 16 |    },
 17 |    "outputs": [],
 18 |    "source": [
 19 |     "from sklearn.datasets import load_digits"
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "code",
 24 |    "execution_count": null,
 25 |    "metadata": {
 26 |     "collapsed": false
 27 |    },
 28 |    "outputs": [],
 29 |    "source": [
 30 |     "digits = load_digits()\n",
 31 |     "X = digits.data\n",
 32 |     "y = digits.target"
 33 |    ]
 34 |   },
 35 |   {
 36 |    "cell_type": "code",
 37 |    "execution_count": null,
 38 |    "metadata": {
 39 |     "collapsed": false
 40 |    },
 41 |    "outputs": [],
 42 |    "source": [
 43 |     "from sklearn.cross_validation import cross_val_score\n",
 44 |     "from sklearn.svm import LinearSVC"
 45 |    ]
 46 |   },
 47 |   {
 48 |    "cell_type": "code",
 49 |    "execution_count": null,
 50 |    "metadata": {
 51 |     "collapsed": false
 52 |    },
 53 |    "outputs": [],
 54 |    "source": [
 55 |     "cross_val_score(LinearSVC(), X, y, cv=5)"
 56 |    ]
 57 |   },
 58 |   {
 59 |    "cell_type": "code",
 60 |    "execution_count": null,
 61 |    "metadata": {
 62 |     "collapsed": false
 63 |    },
 64 |    "outputs": [],
 65 |    "source": [
 66 |     "cross_val_score(LinearSVC(), X, y, cv=5, scoring=\"f1_macro\")"
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "code",
 71 |    "execution_count": null,
 72 |    "metadata": {
 73 |     "collapsed": false
 74 |    },
 75 |    "outputs": [],
 76 |    "source": [
 77 |     "from sklearn.metrics.scorer import SCORERS\n",
 78 |     "print(SCORERS.keys())"
 79 |    ]
 80 |   },
 81 |   {
 82 |    "cell_type": "markdown",
 83 |    "metadata": {},
 84 |    "source": [
 85 |     "There are other ways to do cross-valiation"
 86 |    ]
 87 |   },
 88 |   {
 89 |    "cell_type": "code",
 90 |    "execution_count": null,
 91 |    "metadata": {
 92 |     "collapsed": false
 93 |    },
 94 |    "outputs": [],
 95 |    "source": [
 96 |     "from sklearn.cross_validation import ShuffleSplit\n",
 97 |     "\n",
 98 |     "shuffle_split = ShuffleSplit(len(X), 10, test_size=.4)\n",
 99 |     "cross_val_score(LinearSVC(), X, y, cv=shuffle_split)"
100 |    ]
101 |   }
102 |  ],
103 |  "metadata": {
104 |   "kernelspec": {
105 |    "display_name": "Python 3",
106 |    "language": "python",
107 |    "name": "python3"
108 |   },
109 |   "language_info": {
110 |    "codemirror_mode": {
111 |     "name": "ipython",
112 |     "version": 3
113 |    },
114 |    "file_extension": ".py",
115 |    "mimetype": "text/x-python",
116 |    "name": "python",
117 |    "nbconvert_exporter": "python",
118 |    "pygments_lexer": "ipython3",
119 |    "version": "3.4.4"
120 |   }
121 |  },
122 |  "nbformat": 4,
123 |  "nbformat_minor": 0
124 | }
125 | 


--------------------------------------------------------------------------------
/04 - Grid Searches for Hyper Parameters.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "Grid Searches\n",
  8 |     "================="
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "markdown",
 13 |    "metadata": {},
 14 |    "source": [
 15 |     "Grid-Search with build-in cross validation"
 16 |    ]
 17 |   },
 18 |   {
 19 |    "cell_type": "code",
 20 |    "execution_count": null,
 21 |    "metadata": {
 22 |     "collapsed": false
 23 |    },
 24 |    "outputs": [],
 25 |    "source": [
 26 |     "from sklearn.grid_search import GridSearchCV\n",
 27 |     "from sklearn.svm import SVC"
 28 |    ]
 29 |   },
 30 |   {
 31 |    "cell_type": "code",
 32 |    "execution_count": null,
 33 |    "metadata": {
 34 |     "collapsed": false
 35 |    },
 36 |    "outputs": [],
 37 |    "source": [
 38 |     "from sklearn.datasets import load_digits\n",
 39 |     "from sklearn.cross_validation import train_test_split\n",
 40 |     "digits = load_digits()\n",
 41 |     "X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target)"
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "markdown",
 46 |    "metadata": {},
 47 |    "source": [
 48 |     "Define parameter grid:"
 49 |    ]
 50 |   },
 51 |   {
 52 |    "cell_type": "code",
 53 |    "execution_count": null,
 54 |    "metadata": {
 55 |     "collapsed": false
 56 |    },
 57 |    "outputs": [],
 58 |    "source": [
 59 |     "import numpy as np\n",
 60 |     "\n",
 61 |     "param_grid = {'C': 10. ** np.arange(-3, 3),\n",
 62 |     "              'gamma' : 10. ** np.arange(-5, 0)}\n",
 63 |     "              \n",
 64 |     "\n",
 65 |     "np.set_printoptions(suppress=True)\n",
 66 |     "print(param_grid)"
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "code",
 71 |    "execution_count": null,
 72 |    "metadata": {
 73 |     "collapsed": false
 74 |    },
 75 |    "outputs": [],
 76 |    "source": [
 77 |     "grid_search = GridSearchCV(SVC(), param_grid, verbose=3, cv=5)"
 78 |    ]
 79 |   },
 80 |   {
 81 |    "cell_type": "markdown",
 82 |    "metadata": {},
 83 |    "source": [
 84 |     "A GridSearchCV object behaves just like a normal classifier."
 85 |    ]
 86 |   },
 87 |   {
 88 |    "cell_type": "code",
 89 |    "execution_count": null,
 90 |    "metadata": {
 91 |     "collapsed": false,
 92 |     "scrolled": true
 93 |    },
 94 |    "outputs": [],
 95 |    "source": [
 96 |     "grid_search.fit(X_train, y_train)"
 97 |    ]
 98 |   },
 99 |   {
100 |    "cell_type": "code",
101 |    "execution_count": null,
102 |    "metadata": {
103 |     "collapsed": false,
104 |     "scrolled": true
105 |    },
106 |    "outputs": [],
107 |    "source": [
108 |     "grid_search.predict(X_test)"
109 |    ]
110 |   },
111 |   {
112 |    "cell_type": "code",
113 |    "execution_count": null,
114 |    "metadata": {
115 |     "collapsed": false
116 |    },
117 |    "outputs": [],
118 |    "source": [
119 |     "grid_search.score(X_test, y_test)"
120 |    ]
121 |   },
122 |   {
123 |    "cell_type": "code",
124 |    "execution_count": null,
125 |    "metadata": {
126 |     "collapsed": false
127 |    },
128 |    "outputs": [],
129 |    "source": [
130 |     "grid_search.best_params_"
131 |    ]
132 |   },
133 |   {
134 |    "cell_type": "code",
135 |    "execution_count": null,
136 |    "metadata": {
137 |     "collapsed": false
138 |    },
139 |    "outputs": [],
140 |    "source": [
141 |     "# We extract just the scores\n",
142 |     "%matplotlib notebook\n",
143 |     "import matplotlib.pyplot as plt\n",
144 |     "\n",
145 |     "scores = [x[1] for x in grid_search.grid_scores_]\n",
146 |     "scores = np.array(scores).reshape(6, 5)\n",
147 |     "\n",
148 |     "plt.matshow(scores, cmap='viridis')\n",
149 |     "plt.xlabel('gamma')\n",
150 |     "plt.ylabel('C')\n",
151 |     "plt.colorbar()\n",
152 |     "plt.xticks(np.arange(5), param_grid['gamma'])\n",
153 |     "plt.yticks(np.arange(6), param_grid['C']);"
154 |    ]
155 |   },
156 |   {
157 |    "cell_type": "code",
158 |    "execution_count": null,
159 |    "metadata": {
160 |     "collapsed": true
161 |    },
162 |    "outputs": [],
163 |    "source": []
164 |   }
165 |  ],
166 |  "metadata": {
167 |   "kernelspec": {
168 |    "display_name": "Python 3",
169 |    "language": "python",
170 |    "name": "python3"
171 |   },
172 |   "language_info": {
173 |    "codemirror_mode": {
174 |     "name": "ipython",
175 |     "version": 3
176 |    },
177 |    "file_extension": ".py",
178 |    "mimetype": "text/x-python",
179 |    "name": "python",
180 |    "nbconvert_exporter": "python",
181 |    "pygments_lexer": "ipython3",
182 |    "version": "3.4.4"
183 |   }
184 |  },
185 |  "nbformat": 4,
186 |  "nbformat_minor": 0
187 | }
188 | 


--------------------------------------------------------------------------------
/05 - Preprocessing and Pipelines.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "Preprocessing and Pipelines\n",
  8 |     "============================="
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "code",
 13 |    "execution_count": null,
 14 |    "metadata": {
 15 |     "collapsed": false
 16 |    },
 17 |    "outputs": [],
 18 |    "source": [
 19 |     "from sklearn.datasets import load_digits\n",
 20 |     "from sklearn.cross_validation import train_test_split\n",
 21 |     "digits = load_digits()\n",
 22 |     "X_train, X_test, y_train, y_test = train_test_split(digits.data,\n",
 23 |     "                                                    digits.target)"
 24 |    ]
 25 |   },
 26 |   {
 27 |    "cell_type": "markdown",
 28 |    "metadata": {},
 29 |    "source": [
 30 |     "Cross-validated pipelines including scaling, we need to estimate mean and standard deviation separately for each fold.\n",
 31 |     "To do that, we build a pipeline."
 32 |    ]
 33 |   },
 34 |   {
 35 |    "cell_type": "code",
 36 |    "execution_count": null,
 37 |    "metadata": {
 38 |     "collapsed": false
 39 |    },
 40 |    "outputs": [],
 41 |    "source": [
 42 |     "from sklearn.pipeline import Pipeline, make_pipeline\n",
 43 |     "from sklearn.svm import SVC\n",
 44 |     "from sklearn.preprocessing import StandardScaler"
 45 |    ]
 46 |   },
 47 |   {
 48 |    "cell_type": "code",
 49 |    "execution_count": null,
 50 |    "metadata": {
 51 |     "collapsed": false
 52 |    },
 53 |    "outputs": [],
 54 |    "source": [
 55 |     "standard_scaler = StandardScaler()\n",
 56 |     "standard_scaler.fit(X_train)\n",
 57 |     "X_train_scaled = standard_scaler.transform(X_train)\n",
 58 |     "svm = SVC()\n",
 59 |     "svm.fit(X_train_scaled, y_train)"
 60 |    ]
 61 |   },
 62 |   {
 63 |    "cell_type": "code",
 64 |    "execution_count": null,
 65 |    "metadata": {
 66 |     "collapsed": false
 67 |    },
 68 |    "outputs": [],
 69 |    "source": [
 70 |     "X_test_scaled = standard_scaler.transform(X_test)\n",
 71 |     "svm.score(X_test_scaled, y_test)"
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "code",
 76 |    "execution_count": null,
 77 |    "metadata": {
 78 |     "collapsed": false
 79 |    },
 80 |    "outputs": [],
 81 |    "source": [
 82 |     "pipeline = make_pipeline(StandardScaler(), SVC())"
 83 |    ]
 84 |   },
 85 |   {
 86 |    "cell_type": "code",
 87 |    "execution_count": null,
 88 |    "metadata": {
 89 |     "collapsed": false
 90 |    },
 91 |    "outputs": [],
 92 |    "source": [
 93 |     "pipeline.fit(X_train, y_train)"
 94 |    ]
 95 |   },
 96 |   {
 97 |    "cell_type": "code",
 98 |    "execution_count": null,
 99 |    "metadata": {
100 |     "collapsed": false
101 |    },
102 |    "outputs": [],
103 |    "source": [
104 |     "pipeline.score(X_test, y_test)"
105 |    ]
106 |   },
107 |   {
108 |    "cell_type": "code",
109 |    "execution_count": null,
110 |    "metadata": {
111 |     "collapsed": false,
112 |     "scrolled": true
113 |    },
114 |    "outputs": [],
115 |    "source": [
116 |     "pipeline.predict(X_test)"
117 |    ]
118 |   },
119 |   {
120 |    "cell_type": "markdown",
121 |    "metadata": {},
122 |    "source": [
123 |     "Cross-validation with a pipeline\n",
124 |     "---------------------------------"
125 |    ]
126 |   },
127 |   {
128 |    "cell_type": "code",
129 |    "execution_count": null,
130 |    "metadata": {
131 |     "collapsed": false
132 |    },
133 |    "outputs": [],
134 |    "source": [
135 |     "from sklearn.cross_validation import cross_val_score\n",
136 |     "cross_val_score(pipeline, X_train, y_train)"
137 |    ]
138 |   },
139 |   {
140 |    "cell_type": "markdown",
141 |    "metadata": {},
142 |    "source": [
143 |     "Grid Search with a pipeline\n",
144 |     "==========================="
145 |    ]
146 |   },
147 |   {
148 |    "cell_type": "code",
149 |    "execution_count": null,
150 |    "metadata": {
151 |     "collapsed": false
152 |    },
153 |    "outputs": [],
154 |    "source": [
155 |     "import numpy as np\n",
156 |     "from sklearn.grid_search import GridSearchCV\n",
157 |     "\n",
158 |     "param_grid = {'svc__C': 10. ** np.arange(-3, 3),\n",
159 |     "              'svc__gamma' : 10. ** np.arange(-3, 3)\n",
160 |     "             }\n",
161 |     "\n",
162 |     "grid_pipeline = GridSearchCV(pipeline, param_grid=param_grid) "
163 |    ]
164 |   },
165 |   {
166 |    "cell_type": "code",
167 |    "execution_count": null,
168 |    "metadata": {
169 |     "collapsed": false
170 |    },
171 |    "outputs": [],
172 |    "source": [
173 |     "grid_pipeline.fit(X_train, y_train)"
174 |    ]
175 |   },
176 |   {
177 |    "cell_type": "code",
178 |    "execution_count": null,
179 |    "metadata": {
180 |     "collapsed": false
181 |    },
182 |    "outputs": [],
183 |    "source": [
184 |     "grid_pipeline.score(X_test, y_test)"
185 |    ]
186 |   }
187 |  ],
188 |  "metadata": {
189 |   "kernelspec": {
190 |    "display_name": "Python 3",
191 |    "language": "python",
192 |    "name": "python3"
193 |   },
194 |   "language_info": {
195 |    "codemirror_mode": {
196 |     "name": "ipython",
197 |     "version": 3
198 |    },
199 |    "file_extension": ".py",
200 |    "mimetype": "text/x-python",
201 |    "name": "python",
202 |    "nbconvert_exporter": "python",
203 |    "pygments_lexer": "ipython3",
204 |    "version": "3.4.4"
205 |   }
206 |  },
207 |  "nbformat": 4,
208 |  "nbformat_minor": 0
209 | }
210 | 


--------------------------------------------------------------------------------
/06 - Working With Text Data.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "%matplotlib notebook\n",
 12 |     "import matplotlib.pyplot as plt\n",
 13 |     "import numpy as np"
 14 |    ]
 15 |   },
 16 |   {
 17 |    "cell_type": "markdown",
 18 |    "metadata": {},
 19 |    "source": [
 20 |     "# Text Classification of Movie Reviews"
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "markdown",
 25 |    "metadata": {},
 26 |    "source": [
 27 |     "Unpack data - this only works on linux and (maybe?) OS X. Unpack using 7zip on Windows."
 28 |    ]
 29 |   },
 30 |   {
 31 |    "cell_type": "code",
 32 |    "execution_count": null,
 33 |    "metadata": {
 34 |     "collapsed": false
 35 |    },
 36 |    "outputs": [],
 37 |    "source": [
 38 |     "#! tar -xf data/aclImdb.tar.bz2 --directory data"
 39 |    ]
 40 |   },
 41 |   {
 42 |    "cell_type": "code",
 43 |    "execution_count": null,
 44 |    "metadata": {
 45 |     "collapsed": false
 46 |    },
 47 |    "outputs": [],
 48 |    "source": [
 49 |     "from sklearn.datasets import load_files\n",
 50 |     "\n",
 51 |     "reviews_train = load_files(\"data/aclImdb/train/\")\n",
 52 |     "text_train, y_train = reviews_train.data, reviews_train.target"
 53 |    ]
 54 |   },
 55 |   {
 56 |    "cell_type": "code",
 57 |    "execution_count": null,
 58 |    "metadata": {
 59 |     "collapsed": false
 60 |    },
 61 |    "outputs": [],
 62 |    "source": [
 63 |     "print(\"Number of documents in training data: %d\" % len(text_train))\n",
 64 |     "print(np.bincount(y_train))"
 65 |    ]
 66 |   },
 67 |   {
 68 |    "cell_type": "code",
 69 |    "execution_count": null,
 70 |    "metadata": {
 71 |     "collapsed": false
 72 |    },
 73 |    "outputs": [],
 74 |    "source": [
 75 |     "reviews_test = load_files(\"data/aclImdb/test/\")\n",
 76 |     "text_test, y_test = reviews_test.data, reviews_test.target\n",
 77 |     "print(\"Number of documents in test data: %d\" % len(text_test))\n",
 78 |     "print(np.bincount(y_test))"
 79 |    ]
 80 |   },
 81 |   {
 82 |    "cell_type": "code",
 83 |    "execution_count": null,
 84 |    "metadata": {
 85 |     "collapsed": false
 86 |    },
 87 |    "outputs": [],
 88 |    "source": [
 89 |     "print(text_train[1])"
 90 |    ]
 91 |   },
 92 |   {
 93 |    "cell_type": "code",
 94 |    "execution_count": null,
 95 |    "metadata": {
 96 |     "collapsed": false
 97 |    },
 98 |    "outputs": [],
 99 |    "source": [
100 |     "print(y_train[1])"
101 |    ]
102 |   },
103 |   {
104 |    "cell_type": "code",
105 |    "execution_count": null,
106 |    "metadata": {
107 |     "collapsed": false
108 |    },
109 |    "outputs": [],
110 |    "source": [
111 |     "from sklearn.feature_extraction.text import CountVectorizer\n",
112 |     "cv = CountVectorizer()\n",
113 |     "cv.fit(text_train)\n",
114 |     "\n",
115 |     "len(cv.vocabulary_)"
116 |    ]
117 |   },
118 |   {
119 |    "cell_type": "code",
120 |    "execution_count": null,
121 |    "metadata": {
122 |     "collapsed": false,
123 |     "scrolled": true
124 |    },
125 |    "outputs": [],
126 |    "source": [
127 |     "print(cv.get_feature_names()[:50])\n",
128 |     "print(cv.get_feature_names()[50000:50050])"
129 |    ]
130 |   },
131 |   {
132 |    "cell_type": "code",
133 |    "execution_count": null,
134 |    "metadata": {
135 |     "collapsed": false
136 |    },
137 |    "outputs": [],
138 |    "source": [
139 |     "X_train = cv.transform(text_train)\n",
140 |     "X_train"
141 |    ]
142 |   },
143 |   {
144 |    "cell_type": "code",
145 |    "execution_count": null,
146 |    "metadata": {
147 |     "collapsed": false
148 |    },
149 |    "outputs": [],
150 |    "source": [
151 |     "print(text_train[19726])"
152 |    ]
153 |   },
154 |   {
155 |    "cell_type": "code",
156 |    "execution_count": null,
157 |    "metadata": {
158 |     "collapsed": false
159 |    },
160 |    "outputs": [],
161 |    "source": [
162 |     "X_train[19726].nonzero()[1]"
163 |    ]
164 |   },
165 |   {
166 |    "cell_type": "code",
167 |    "execution_count": null,
168 |    "metadata": {
169 |     "collapsed": false
170 |    },
171 |    "outputs": [],
172 |    "source": [
173 |     "X_test = cv.transform(text_test)"
174 |    ]
175 |   },
176 |   {
177 |    "cell_type": "code",
178 |    "execution_count": null,
179 |    "metadata": {
180 |     "collapsed": false
181 |    },
182 |    "outputs": [],
183 |    "source": [
184 |     "from sklearn.svm import LinearSVC\n",
185 |     "\n",
186 |     "svm = LinearSVC()\n",
187 |     "svm.fit(X_train, y_train)"
188 |    ]
189 |   },
190 |   {
191 |    "cell_type": "code",
192 |    "execution_count": null,
193 |    "metadata": {
194 |     "collapsed": false
195 |    },
196 |    "outputs": [],
197 |    "source": [
198 |     "svm.score(X_train, y_train)"
199 |    ]
200 |   },
201 |   {
202 |    "cell_type": "code",
203 |    "execution_count": null,
204 |    "metadata": {
205 |     "collapsed": false
206 |    },
207 |    "outputs": [],
208 |    "source": [
209 |     "svm.score(X_test, y_test)"
210 |    ]
211 |   },
212 |   {
213 |    "cell_type": "code",
214 |    "execution_count": null,
215 |    "metadata": {
216 |     "collapsed": false
217 |    },
218 |    "outputs": [],
219 |    "source": [
220 |     "def visualize_coefficients(classifier, feature_names, n_top_features=25):\n",
221 |     "    # get coefficients with large absolute values \n",
222 |     "    coef = classifier.coef_.ravel()\n",
223 |     "    positive_coefficients = np.argsort(coef)[-n_top_features:]\n",
224 |     "    negative_coefficients = np.argsort(coef)[:n_top_features]\n",
225 |     "    interesting_coefficients = np.hstack([negative_coefficients, positive_coefficients])\n",
226 |     "    # plot them\n",
227 |     "    plt.figure(figsize=(15, 5))\n",
228 |     "    colors = [\"red\" if c < 0 else \"blue\" for c in coef[interesting_coefficients]]\n",
229 |     "    plt.bar(np.arange(2 * n_top_features), coef[interesting_coefficients], color=colors)\n",
230 |     "    feature_names = np.array(feature_names)\n",
231 |     "    plt.subplots_adjust(bottom=0.3)\n",
232 |     "    plt.xticks(np.arange(1, 1 + 2 * n_top_features), feature_names[interesting_coefficients], rotation=60, ha=\"right\");\n"
233 |    ]
234 |   },
235 |   {
236 |    "cell_type": "code",
237 |    "execution_count": null,
238 |    "metadata": {
239 |     "collapsed": false
240 |    },
241 |    "outputs": [],
242 |    "source": [
243 |     "visualize_coefficients(svm, cv.get_feature_names())"
244 |    ]
245 |   },
246 |   {
247 |    "cell_type": "code",
248 |    "execution_count": null,
249 |    "metadata": {
250 |     "collapsed": false
251 |    },
252 |    "outputs": [],
253 |    "source": [
254 |     "svm = LinearSVC(C=0.001)\n",
255 |     "svm.fit(X_train, y_train)\n",
256 |     "svm.score(X_test, y_test)"
257 |    ]
258 |   },
259 |   {
260 |    "cell_type": "code",
261 |    "execution_count": null,
262 |    "metadata": {
263 |     "collapsed": false
264 |    },
265 |    "outputs": [],
266 |    "source": [
267 |     "visualize_coefficients(svm, cv.get_feature_names())"
268 |    ]
269 |   },
270 |   {
271 |    "cell_type": "code",
272 |    "execution_count": null,
273 |    "metadata": {
274 |     "collapsed": false
275 |    },
276 |    "outputs": [],
277 |    "source": [
278 |     "from sklearn.pipeline import make_pipeline\n",
279 |     "text_pipe = make_pipeline(CountVectorizer(), LinearSVC())\n",
280 |     "text_pipe.fit(text_train, y_train)\n",
281 |     "text_pipe.score(text_test, y_test)"
282 |    ]
283 |   },
284 |   {
285 |    "cell_type": "code",
286 |    "execution_count": null,
287 |    "metadata": {
288 |     "collapsed": false,
289 |     "scrolled": true
290 |    },
291 |    "outputs": [],
292 |    "source": [
293 |     "from sklearn.grid_search import GridSearchCV\n",
294 |     "\n",
295 |     "param_grid = {'linearsvc__C': np.logspace(-5, 0, 6)}\n",
296 |     "grid = GridSearchCV(text_pipe, param_grid, cv=5)\n",
297 |     "grid.fit(text_train, y_train)"
298 |    ]
299 |   },
300 |   {
301 |    "cell_type": "code",
302 |    "execution_count": null,
303 |    "metadata": {
304 |     "collapsed": false
305 |    },
306 |    "outputs": [],
307 |    "source": [
308 |     "grid.best_params_"
309 |    ]
310 |   },
311 |   {
312 |    "cell_type": "code",
313 |    "execution_count": null,
314 |    "metadata": {
315 |     "collapsed": false
316 |    },
317 |    "outputs": [],
318 |    "source": [
319 |     "visualize_coefficients(grid.best_estimator_.named_steps['linearsvc'],\n",
320 |     "                       grid.best_estimator_.named_steps['countvectorizer'].get_feature_names())"
321 |    ]
322 |   },
323 |   {
324 |    "cell_type": "code",
325 |    "execution_count": null,
326 |    "metadata": {
327 |     "collapsed": false
328 |    },
329 |    "outputs": [],
330 |    "source": [
331 |     "grid.score(text_test, y_test)"
332 |    ]
333 |   },
334 |   {
335 |    "cell_type": "markdown",
336 |    "metadata": {},
337 |    "source": [
338 |     "# N-Grams"
339 |    ]
340 |   },
341 |   {
342 |    "cell_type": "code",
343 |    "execution_count": null,
344 |    "metadata": {
345 |     "collapsed": false,
346 |     "scrolled": true
347 |    },
348 |    "outputs": [],
349 |    "source": [
350 |     "text_pipe = make_pipeline(CountVectorizer(), LinearSVC())\n",
351 |     "\n",
352 |     "param_grid = {'linearsvc__C': np.logspace(-3, 2, 6),\n",
353 |     "              \"countvectorizer__ngram_range\": [(1, 1), (1, 2)]}\n",
354 |     "\n",
355 |     "grid = GridSearchCV(text_pipe, param_grid, cv=5)\n",
356 |     "\n",
357 |     "grid.fit(text_train, y_train)"
358 |    ]
359 |   },
360 |   {
361 |    "cell_type": "code",
362 |    "execution_count": null,
363 |    "metadata": {
364 |     "collapsed": false
365 |    },
366 |    "outputs": [],
367 |    "source": [
368 |     "scores = np.array([score.mean_validation_score for score in grid.grid_scores_]).reshape(3, -1)\n",
369 |     "plt.matshow(scores)\n",
370 |     "plt.ylabel(\"n-gram range\")\n",
371 |     "plt.yticks(range(3), param_grid[\"countvectorizer__ngram_range\"])\n",
372 |     "plt.xlabel(\"C\")\n",
373 |     "plt.xticks(range(6), param_grid[\"linearsvc__C\"]);\n",
374 |     "plt.colorbar()"
375 |    ]
376 |   },
377 |   {
378 |    "cell_type": "code",
379 |    "execution_count": null,
380 |    "metadata": {
381 |     "collapsed": false
382 |    },
383 |    "outputs": [],
384 |    "source": [
385 |     "grid.best_params_"
386 |    ]
387 |   },
388 |   {
389 |    "cell_type": "code",
390 |    "execution_count": null,
391 |    "metadata": {
392 |     "collapsed": false
393 |    },
394 |    "outputs": [],
395 |    "source": [
396 |     "visualize_coefficients(grid.best_estimator_.named_steps['linearsvc'],\n",
397 |     "                       grid.best_estimator_.named_steps['countvectorizer'].get_feature_names())"
398 |    ]
399 |   },
400 |   {
401 |    "cell_type": "code",
402 |    "execution_count": null,
403 |    "metadata": {
404 |     "collapsed": false
405 |    },
406 |    "outputs": [],
407 |    "source": [
408 |     "grid.score(text_test, y_test)"
409 |    ]
410 |   },
411 |   {
412 |    "cell_type": "markdown",
413 |    "metadata": {},
414 |    "source": [
415 |     "## Look at SpaCy and NLTK"
416 |    ]
417 |   }
418 |  ],
419 |  "metadata": {
420 |   "kernelspec": {
421 |    "display_name": "Python 3",
422 |    "language": "python",
423 |    "name": "python3"
424 |   },
425 |   "language_info": {
426 |    "codemirror_mode": {
427 |     "name": "ipython",
428 |     "version": 3
429 |    },
430 |    "file_extension": ".py",
431 |    "mimetype": "text/x-python",
432 |    "name": "python",
433 |    "nbconvert_exporter": "python",
434 |    "pygments_lexer": "ipython3",
435 |    "version": "3.4.4"
436 |   }
437 |  },
438 |  "nbformat": 4,
439 |  "nbformat_minor": 0
440 | }
441 | 


--------------------------------------------------------------------------------
/07 - Out Of Core Learning.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "code",
 5 |    "execution_count": null,
 6 |    "metadata": {
 7 |     "collapsed": false
 8 |    },
 9 |    "outputs": [],
10 |    "source": [
11 |     "# write out some toy data\n",
12 |     "from sklearn.datasets import load_digits\n",
13 |     "import pickle\n",
14 |     "\n",
15 |     "digits = load_digits()\n",
16 |     "\n",
17 |     "X, y = digits.data, digits.target\n",
18 |     "\n",
19 |     "for i in range(10):\n",
20 |     "    pickle.dump((X[i::10] / 16., y[i::10]),\n",
21 |     "                open(\"data/batch_%02d.pickle\" % i, \"wb\"), -1)"
22 |    ]
23 |   },
24 |   {
25 |    "cell_type": "code",
26 |    "execution_count": null,
27 |    "metadata": {
28 |     "collapsed": false
29 |    },
30 |    "outputs": [],
31 |    "source": [
32 |     "from sklearn.linear_model import SGDClassifier"
33 |    ]
34 |   },
35 |   {
36 |    "cell_type": "code",
37 |    "execution_count": null,
38 |    "metadata": {
39 |     "collapsed": false
40 |    },
41 |    "outputs": [],
42 |    "source": [
43 |     "sgd = SGDClassifier(random_state=0)\n",
44 |     "for i in range(9):\n",
45 |     "    X_batch, y_batch = pickle.load(open(\"data/batch_%02d.pickle\" % i, \"rb\"))\n",
46 |     "    sgd.partial_fit(X_batch, y_batch, classes=range(10))"
47 |    ]
48 |   },
49 |   {
50 |    "cell_type": "code",
51 |    "execution_count": null,
52 |    "metadata": {
53 |     "collapsed": false
54 |    },
55 |    "outputs": [],
56 |    "source": [
57 |     "X_test, y_test = pickle.load(open(\"data/batch_09.pickle\", \"rb\"))\n",
58 |     "\n",
59 |     "sgd.score(X_test, y_test)"
60 |    ]
61 |   }
62 |  ],
63 |  "metadata": {
64 |   "kernelspec": {
65 |    "display_name": "Python 3",
66 |    "language": "python",
67 |    "name": "python3"
68 |   },
69 |   "language_info": {
70 |    "codemirror_mode": {
71 |     "name": "ipython",
72 |     "version": 3
73 |    },
74 |    "file_extension": ".py",
75 |    "mimetype": "text/x-python",
76 |    "name": "python",
77 |    "nbconvert_exporter": "python",
78 |    "pygments_lexer": "ipython3",
79 |    "version": "3.4.4"
80 |   }
81 |  },
82 |  "nbformat": 4,
83 |  "nbformat_minor": 0
84 | }
85 | 


--------------------------------------------------------------------------------
/08 - Out Of Core Learning for Text.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "metadata": {
  7 |     "collapsed": false
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "import matplotlib.pyplot as plt\n",
 12 |     "import numpy as np\n",
 13 |     "%matplotlib notebook"
 14 |    ]
 15 |   },
 16 |   {
 17 |    "cell_type": "markdown",
 18 |    "metadata": {},
 19 |    "source": [
 20 |     "# Out of core text classification with the Hashing Vectorizer"
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "markdown",
 25 |    "metadata": {},
 26 |    "source": [
 27 |     "Using the Amazon movie reviews collected by J. McAuley and J. Leskovec\n",
 28 |     "\n",
 29 |     "https://snap.stanford.edu/data/web-Movies.html"
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "code",
 34 |    "execution_count": null,
 35 |    "metadata": {
 36 |     "collapsed": false
 37 |    },
 38 |    "outputs": [],
 39 |    "source": [
 40 |     "import os\n",
 41 |     "print(\"file size: %d GB\" % (os.path.getsize(\"data/movies.txt\") / 1024 ** 3))"
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "code",
 46 |    "execution_count": null,
 47 |    "metadata": {
 48 |     "collapsed": false
 49 |    },
 50 |    "outputs": [],
 51 |    "source": [
 52 |     "with open(\"data/movies.txt\") as f:\n",
 53 |     "    print(f.read(4000))"
 54 |    ]
 55 |   },
 56 |   {
 57 |    "cell_type": "code",
 58 |    "execution_count": null,
 59 |    "metadata": {
 60 |     "collapsed": false
 61 |    },
 62 |    "outputs": [],
 63 |    "source": [
 64 |     "def review_iter(f):\n",
 65 |     "    current_post = []\n",
 66 |     "    for line in f:\n",
 67 |     "        if line.startswith(\"product/productId\"):\n",
 68 |     "            if len(current_post):\n",
 69 |     "                score = current_post[3].strip(\"review/score: \").strip()\n",
 70 |     "                review = \"\".join(current_post[6:]).strip(\"review/text: \").strip()\n",
 71 |     "                # there are about 20 posts with linebreaks in them.\n",
 72 |     "                # we just ignore those for simplicity\n",
 73 |     "                try:\n",
 74 |     "                    yield int(float(score)), review\n",
 75 |     "                except:\n",
 76 |     "                    current_post = []\n",
 77 |     "                    continue\n",
 78 |     "            current_post = []\n",
 79 |     "        else:\n",
 80 |     "            current_post.append(line)"
 81 |    ]
 82 |   },
 83 |   {
 84 |    "cell_type": "code",
 85 |    "execution_count": null,
 86 |    "metadata": {
 87 |     "collapsed": false,
 88 |     "scrolled": false
 89 |    },
 90 |    "outputs": [],
 91 |    "source": [
 92 |     "n_reviews = 0\n",
 93 |     "with open(\"data/movies.txt\", 'r', errors='ignore') as f:\n",
 94 |     "    for r in review_iter(f):\n",
 95 |     "        n_reviews += 1\n",
 96 |     "\n",
 97 |     "print(\"Number of reviews: %d\" % n_reviews)"
 98 |    ]
 99 |   },
100 |   {
101 |    "cell_type": "code",
102 |    "execution_count": null,
103 |    "metadata": {
104 |     "collapsed": false
105 |    },
106 |    "outputs": [],
107 |    "source": [
108 |     "from itertools import islice\n",
109 |     "\n",
110 |     "with open(\"data/movies.txt\", 'rb') as f:\n",
111 |     "    reviews = islice(review_iter(f), 10000)\n",
112 |     "    scores, texts = zip(*reviews)\n",
113 |     "print(np.bincount(scores))"
114 |    ]
115 |   },
116 |   {
117 |    "cell_type": "code",
118 |    "execution_count": null,
119 |    "metadata": {
120 |     "collapsed": false
121 |    },
122 |    "outputs": [],
123 |    "source": []
124 |   },
125 |   {
126 |    "cell_type": "code",
127 |    "execution_count": null,
128 |    "metadata": {
129 |     "collapsed": false
130 |    },
131 |    "outputs": [],
132 |    "source": [
133 |     "from itertools import zip_longest # use izip_longest on Python3\n",
134 |     "# from the itertools recipes\n",
135 |     "def grouper(iterable, n, fillvalue=None):\n",
136 |     "    \"Collect data into fixed-length chunks or blocks\"\n",
137 |     "    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx\n",
138 |     "    args = [iter(iterable)] * n\n",
139 |     "    return zip_longest(fillvalue=fillvalue, *args)"
140 |    ]
141 |   },
142 |   {
143 |    "cell_type": "code",
144 |    "execution_count": null,
145 |    "metadata": {
146 |     "collapsed": true
147 |    },
148 |    "outputs": [],
149 |    "source": [
150 |     "def preprocess_batch(reviews):\n",
151 |     "    # score == 3 is \"neutral\", we only want \"positive\" or \"negative\"\n",
152 |     "    reviews_filtered = [r for r in reviews if r is not None and r[0] != 3]\n",
153 |     "    scores, texts = zip(*reviews_filtered)\n",
154 |     "    polarity = np.array(scores) > 3\n",
155 |     "    return polarity, texts"
156 |    ]
157 |   },
158 |   {
159 |    "cell_type": "code",
160 |    "execution_count": null,
161 |    "metadata": {
162 |     "collapsed": false,
163 |     "scrolled": true
164 |    },
165 |    "outputs": [],
166 |    "source": [
167 |     "from sklearn.feature_extraction.text import HashingVectorizer\n",
168 |     "\n",
169 |     "vectorizer = HashingVectorizer(decode_error=\"ignore\")\n",
170 |     "\n",
171 |     "with open(\"data/movies.txt\") as f:\n",
172 |     "    reviews = islice(review_iter(f), 10000)\n",
173 |     "    polarity_test, texts_test = preprocess_batch(reviews)\n",
174 |     "    X_test = vectorizer.transform(texts_test)"
175 |    ]
176 |   },
177 |   {
178 |    "cell_type": "code",
179 |    "execution_count": null,
180 |    "metadata": {
181 |     "collapsed": false
182 |    },
183 |    "outputs": [],
184 |    "source": [
185 |     "from sklearn.linear_model import SGDClassifier\n",
186 |     "\n",
187 |     "sgd = SGDClassifier(random_state=0)\n",
188 |     "\n",
189 |     "accuracies = []\n",
190 |     "with open(\"data/movies.txt\") as f:\n",
191 |     "    training_set = islice(review_iter(f), 10000, None)\n",
192 |     "    batch_iter = grouper(training_set, 10000)\n",
193 |     "    for batch in batch_iter:\n",
194 |     "        polarity, texts = preprocess_batch(batch)\n",
195 |     "        X = vectorizer.transform(texts)\n",
196 |     "        sgd.partial_fit(X, polarity, classes=[0, 1])\n",
197 |     "        accuracies.append(sgd.score(X_test, polarity_test))"
198 |    ]
199 |   },
200 |   {
201 |    "cell_type": "code",
202 |    "execution_count": null,
203 |    "metadata": {
204 |     "collapsed": false
205 |    },
206 |    "outputs": [],
207 |    "source": [
208 |     "plt.plot(accuracies)"
209 |    ]
210 |   }
211 |  ],
212 |  "metadata": {
213 |   "kernelspec": {
214 |    "display_name": "Python 3",
215 |    "language": "python",
216 |    "name": "python3"
217 |   },
218 |   "language_info": {
219 |    "codemirror_mode": {
220 |     "name": "ipython",
221 |     "version": 3
222 |    },
223 |    "file_extension": ".py",
224 |    "mimetype": "text/x-python",
225 |    "name": "python",
226 |    "nbconvert_exporter": "python",
227 |    "pygments_lexer": "ipython3",
228 |    "version": "3.4.4"
229 |   }
230 |  },
231 |  "nbformat": 4,
232 |  "nbformat_minor": 0
233 | }
234 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 | CC0 1.0 Universal
  2 | 
  3 | Statement of Purpose
  4 | 
  5 | The laws of most jurisdictions throughout the world automatically confer
  6 | exclusive Copyright and Related Rights (defined below) upon the creator and
  7 | subsequent owner(s) (each and all, an "owner") of an original work of
  8 | authorship and/or a database (each, a "Work").
  9 | 
 10 | Certain owners wish to permanently relinquish those rights to a Work for the
 11 | purpose of contributing to a commons of creative, cultural and scientific
 12 | works ("Commons") that the public can reliably and without fear of later
 13 | claims of infringement build upon, modify, incorporate in other works, reuse
 14 | and redistribute as freely as possible in any form whatsoever and for any
 15 | purposes, including without limitation commercial purposes. These owners may
 16 | contribute to the Commons to promote the ideal of a free culture and the
 17 | further production of creative, cultural and scientific works, or to gain
 18 | reputation or greater distribution for their Work in part through the use and
 19 | efforts of others.
 20 | 
 21 | For these and/or other purposes and motivations, and without any expectation
 22 | of additional consideration or compensation, the person associating CC0 with a
 23 | Work (the "Affirmer"), to the extent that he or she is an owner of Copyright
 24 | and Related Rights in the Work, voluntarily elects to apply CC0 to the Work
 25 | and publicly distribute the Work under its terms, with knowledge of his or her
 26 | Copyright and Related Rights in the Work and the meaning and intended legal
 27 | effect of CC0 on those rights.
 28 | 
 29 | 1. Copyright and Related Rights. A Work made available under CC0 may be
 30 | protected by copyright and related or neighboring rights ("Copyright and
 31 | Related Rights"). Copyright and Related Rights include, but are not limited
 32 | to, the following:
 33 | 
 34 |   i. the right to reproduce, adapt, distribute, perform, display, communicate,
 35 |   and translate a Work;
 36 | 
 37 |   ii. moral rights retained by the original author(s) and/or performer(s);
 38 | 
 39 |   iii. publicity and privacy rights pertaining to a person's image or likeness
 40 |   depicted in a Work;
 41 | 
 42 |   iv. rights protecting against unfair competition in regards to a Work,
 43 |   subject to the limitations in paragraph 4(a), below;
 44 | 
 45 |   v. rights protecting the extraction, dissemination, use and reuse of data in
 46 |   a Work;
 47 | 
 48 |   vi. database rights (such as those arising under Directive 96/9/EC of the
 49 |   European Parliament and of the Council of 11 March 1996 on the legal
 50 |   protection of databases, and under any national implementation thereof,
 51 |   including any amended or successor version of such directive); and
 52 | 
 53 |   vii. other similar, equivalent or corresponding rights throughout the world
 54 |   based on applicable law or treaty, and any national implementations thereof.
 55 | 
 56 | 2. Waiver. To the greatest extent permitted by, but not in contravention of,
 57 | applicable law, Affirmer hereby overtly, fully, permanently, irrevocably and
 58 | unconditionally waives, abandons, and surrenders all of Affirmer's Copyright
 59 | and Related Rights and associated claims and causes of action, whether now
 60 | known or unknown (including existing as well as future claims and causes of
 61 | action), in the Work (i) in all territories worldwide, (ii) for the maximum
 62 | duration provided by applicable law or treaty (including future time
 63 | extensions), (iii) in any current or future medium and for any number of
 64 | copies, and (iv) for any purpose whatsoever, including without limitation
 65 | commercial, advertising or promotional purposes (the "Waiver"). Affirmer makes
 66 | the Waiver for the benefit of each member of the public at large and to the
 67 | detriment of Affirmer's heirs and successors, fully intending that such Waiver
 68 | shall not be subject to revocation, rescission, cancellation, termination, or
 69 | any other legal or equitable action to disrupt the quiet enjoyment of the Work
 70 | by the public as contemplated by Affirmer's express Statement of Purpose.
 71 | 
 72 | 3. Public License Fallback. Should any part of the Waiver for any reason be
 73 | judged legally invalid or ineffective under applicable law, then the Waiver
 74 | shall be preserved to the maximum extent permitted taking into account
 75 | Affirmer's express Statement of Purpose. In addition, to the extent the Waiver
 76 | is so judged Affirmer hereby grants to each affected person a royalty-free,
 77 | non transferable, non sublicensable, non exclusive, irrevocable and
 78 | unconditional license to exercise Affirmer's Copyright and Related Rights in
 79 | the Work (i) in all territories worldwide, (ii) for the maximum duration
 80 | provided by applicable law or treaty (including future time extensions), (iii)
 81 | in any current or future medium and for any number of copies, and (iv) for any
 82 | purpose whatsoever, including without limitation commercial, advertising or
 83 | promotional purposes (the "License"). The License shall be deemed effective as
 84 | of the date CC0 was applied by Affirmer to the Work. Should any part of the
 85 | License for any reason be judged legally invalid or ineffective under
 86 | applicable law, such partial invalidity or ineffectiveness shall not
 87 | invalidate the remainder of the License, and in such case Affirmer hereby
 88 | affirms that he or she will not (i) exercise any of his or her remaining
 89 | Copyright and Related Rights in the Work or (ii) assert any associated claims
 90 | and causes of action with respect to the Work, in either case contrary to
 91 | Affirmer's express Statement of Purpose.
 92 | 
 93 | 4. Limitations and Disclaimers.
 94 | 
 95 |   a. No trademark or patent rights held by Affirmer are waived, abandoned,
 96 |   surrendered, licensed or otherwise affected by this document.
 97 | 
 98 |   b. Affirmer offers the Work as-is and makes no representations or warranties
 99 |   of any kind concerning the Work, express, implied, statutory or otherwise,
100 |   including without limitation warranties of title, merchantability, fitness
101 |   for a particular purpose, non infringement, or the absence of latent or
102 |   other defects, accuracy, or the present or absence of errors, whether or not
103 |   discoverable, all to the greatest extent permissible under applicable law.
104 | 
105 |   c. Affirmer disclaims responsibility for clearing rights of other persons
106 |   that may apply to the Work or any use thereof, including without limitation
107 |   any person's Copyright and Related Rights in the Work. Further, Affirmer
108 |   disclaims responsibility for obtaining any necessary consents, permissions
109 |   or other rights required for any use of the Work.
110 | 
111 |   d. Affirmer understands and acknowledges that Creative Commons is not a
112 |   party to this document and has no duty or obligation with respect to this
113 |   CC0 or use of the Work.
114 | 
115 | For more information, please see
116 | <http://creativecommons.org/publicdomain/zero/1.0/>
117 | 


--------------------------------------------------------------------------------
/Readme.md:
--------------------------------------------------------------------------------
 1 | Slides and Notebooks for New York Machine Learning Meetup
 2 | =========================================================
 3 | Materials for the Scikit-learn talk on Jan 21 2016.
 4 | 
 5 | Please download the materials and install scikit-learn and the jupyter notebook to follow along.
 6 | 
 7 | Please use Jupyter / IPython in Version 4.0 or higher.
 8 | 
 9 | The tutorial requires scikit-learn 0.15 or higher (current is 0.17).
10 | 


--------------------------------------------------------------------------------
/data/aclImdb.tar.bz2:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml_meetup_nyc_2016/cfbe1a4bf3ddd457add029a5c2a4ca72878e46b0/data/aclImdb.tar.bz2


--------------------------------------------------------------------------------
/machine-learning-with-scikit-learn-nyc-ml-meetup-2016.odp:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml_meetup_nyc_2016/cfbe1a4bf3ddd457add029a5c2a4ca72878e46b0/machine-learning-with-scikit-learn-nyc-ml-meetup-2016.odp


--------------------------------------------------------------------------------
/machine-learning-with-scikit-learn-nyc-ml-meetup-2016.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amueller/ml_meetup_nyc_2016/cfbe1a4bf3ddd457add029a5c2a4ca72878e46b0/machine-learning-with-scikit-learn-nyc-ml-meetup-2016.pdf


--------------------------------------------------------------------------------