├── .gitignore ├── 01 - Introduction to Scikit-learn.ipynb ├── 02 - Unsupervised Transformers.ipynb ├── 03 - Cross-validation.ipynb ├── 04 - Grid Searches for Hyper Parameters.ipynb ├── 05 - Preprocessing and Pipelines.ipynb ├── 06 - Working With Text Data.ipynb ├── 07 - Out Of Core Learning.ipynb ├── 08 - Out Of Core Learning for Text.ipynb ├── LICENSE ├── Readme.md ├── data └── aclImdb.tar.bz2 ├── machine-learning-with-scikit-learn-nyc-ml-meetup-2016.odp └── machine-learning-with-scikit-learn-nyc-ml-meetup-2016.pdf /.gitignore: -------------------------------------------------------------------------------- 1 | data/aclImdb/* 2 | .ipynb_checkpoints/* 3 | data/batch* 4 | data/movies.txt 5 | -------------------------------------------------------------------------------- /01 - Introduction to Scikit-learn.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "Get some data to play with" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": { 14 | "collapsed": false 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "from sklearn.datasets import load_digits\n", 19 | "digits = load_digits()\n", 20 | "digits.keys()" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": null, 26 | "metadata": { 27 | "collapsed": false 28 | }, 29 | "outputs": [], 30 | "source": [ 31 | "digits.images.shape" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": null, 37 | "metadata": { 38 | "collapsed": false 39 | }, 40 | "outputs": [], 41 | "source": [ 42 | "print(digits.images[0])" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": null, 48 | "metadata": { 49 | "collapsed": false 50 | }, 51 | "outputs": [], 52 | "source": [ 53 | "import matplotlib.pyplot as plt\n", 54 | "%matplotlib notebook\n", 55 | "\n", 56 | "plt.matshow(digits.images[0], cmap=plt.cm.Greys)" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": null, 62 | "metadata": { 63 | "collapsed": false 64 | }, 65 | "outputs": [], 66 | "source": [ 67 | "digits.data.shape" 68 | ] 69 | }, 70 | { 71 | "cell_type": "code", 72 | "execution_count": null, 73 | "metadata": { 74 | "collapsed": false 75 | }, 76 | "outputs": [], 77 | "source": [ 78 | "digits.target.shape" 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": null, 84 | "metadata": { 85 | "collapsed": false 86 | }, 87 | "outputs": [], 88 | "source": [ 89 | "digits.target" 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "metadata": {}, 95 | "source": [ 96 | "**Data is always a numpy array (or sparse matrix) of shape (n_samples, n_features)**" 97 | ] 98 | }, 99 | { 100 | "cell_type": "markdown", 101 | "metadata": {}, 102 | "source": [ 103 | "Split the data to get going" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": null, 109 | "metadata": { 110 | "collapsed": false 111 | }, 112 | "outputs": [], 113 | "source": [ 114 | "from sklearn.cross_validation import train_test_split\n", 115 | "X_train, X_test, y_train, y_test = train_test_split(digits.data,\n", 116 | " digits.target)" 117 | ] 118 | }, 119 | { 120 | "cell_type": "markdown", 121 | "metadata": {}, 122 | "source": [ 123 | "Really Simple API\n", 124 | "-------------------\n", 125 | "0) Import your model class" 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": null, 131 | "metadata": { 132 | "collapsed": false 133 | }, 134 | "outputs": [], 135 | "source": [ 136 | "from sklearn.svm import LinearSVC" 137 | ] 138 | }, 139 | { 140 | "cell_type": "markdown", 141 | "metadata": {}, 142 | "source": [ 143 | "1) Instantiate an object and set the parameters" 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": null, 149 | "metadata": { 150 | "collapsed": false 151 | }, 152 | "outputs": [], 153 | "source": [ 154 | "svm = LinearSVC(C=0.1)" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "2) Fit the model" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": null, 167 | "metadata": { 168 | "collapsed": false 169 | }, 170 | "outputs": [], 171 | "source": [ 172 | "svm.fit(X_train, y_train)" 173 | ] 174 | }, 175 | { 176 | "cell_type": "markdown", 177 | "metadata": {}, 178 | "source": [ 179 | "3) Apply / evaluate" 180 | ] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "execution_count": null, 185 | "metadata": { 186 | "collapsed": false 187 | }, 188 | "outputs": [], 189 | "source": [ 190 | "print(svm.predict(X_train))\n", 191 | "print(y_train)" 192 | ] 193 | }, 194 | { 195 | "cell_type": "code", 196 | "execution_count": null, 197 | "metadata": { 198 | "collapsed": false 199 | }, 200 | "outputs": [], 201 | "source": [ 202 | "svm.score(X_train, y_train)" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": null, 208 | "metadata": { 209 | "collapsed": false 210 | }, 211 | "outputs": [], 212 | "source": [ 213 | "svm.score(X_test, y_test)" 214 | ] 215 | }, 216 | { 217 | "cell_type": "markdown", 218 | "metadata": {}, 219 | "source": [ 220 | "And again\n", 221 | "---------" 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": null, 227 | "metadata": { 228 | "collapsed": false 229 | }, 230 | "outputs": [], 231 | "source": [ 232 | "from sklearn.ensemble import RandomForestClassifier" 233 | ] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "execution_count": null, 238 | "metadata": { 239 | "collapsed": false 240 | }, 241 | "outputs": [], 242 | "source": [ 243 | "rf = RandomForestClassifier(n_estimators=50)" 244 | ] 245 | }, 246 | { 247 | "cell_type": "code", 248 | "execution_count": null, 249 | "metadata": { 250 | "collapsed": false 251 | }, 252 | "outputs": [], 253 | "source": [ 254 | "rf.fit(X_train, y_train)" 255 | ] 256 | }, 257 | { 258 | "cell_type": "code", 259 | "execution_count": null, 260 | "metadata": { 261 | "collapsed": false 262 | }, 263 | "outputs": [], 264 | "source": [ 265 | "rf.score(X_test, y_test)" 266 | ] 267 | }, 268 | { 269 | "cell_type": "code", 270 | "execution_count": null, 271 | "metadata": { 272 | "collapsed": true 273 | }, 274 | "outputs": [], 275 | "source": [] 276 | } 277 | ], 278 | "metadata": { 279 | "kernelspec": { 280 | "display_name": "Python 3", 281 | "language": "python", 282 | "name": "python3" 283 | }, 284 | "language_info": { 285 | "codemirror_mode": { 286 | "name": "ipython", 287 | "version": 3 288 | }, 289 | "file_extension": ".py", 290 | "mimetype": "text/x-python", 291 | "name": "python", 292 | "nbconvert_exporter": "python", 293 | "pygments_lexer": "ipython3", 294 | "version": "3.4.4" 295 | } 296 | }, 297 | "nbformat": 4, 298 | "nbformat_minor": 0 299 | } 300 | -------------------------------------------------------------------------------- /02 - Unsupervised Transformers.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": { 7 | "collapsed": false 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "from sklearn.datasets import load_digits\n", 12 | "from sklearn.cross_validation import train_test_split\n", 13 | "import numpy as np\n", 14 | "np.set_printoptions(suppress=True)\n", 15 | "\n", 16 | "digits = load_digits()\n", 17 | "X, y = digits.data, digits.target\n", 18 | "X_train, X_test, y_train, y_test = train_test_split(X, y)" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "Removing mean and scaling variance\n", 26 | "===================================" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": null, 32 | "metadata": { 33 | "collapsed": false 34 | }, 35 | "outputs": [], 36 | "source": [ 37 | "from sklearn.preprocessing import StandardScaler" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "1) Instantiate the model" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": null, 50 | "metadata": { 51 | "collapsed": false 52 | }, 53 | "outputs": [], 54 | "source": [ 55 | "scaler = StandardScaler()" 56 | ] 57 | }, 58 | { 59 | "cell_type": "markdown", 60 | "metadata": {}, 61 | "source": [ 62 | "2) Fit using only the data." 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": null, 68 | "metadata": { 69 | "collapsed": false 70 | }, 71 | "outputs": [], 72 | "source": [ 73 | "scaler.fit(X_train)" 74 | ] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "metadata": {}, 79 | "source": [ 80 | "3) `transform` the data (not `predict`)." 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": null, 86 | "metadata": { 87 | "collapsed": false 88 | }, 89 | "outputs": [], 90 | "source": [ 91 | "X_train_scaled = scaler.transform(X_train)" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": null, 97 | "metadata": { 98 | "collapsed": false 99 | }, 100 | "outputs": [], 101 | "source": [ 102 | "X_train.shape" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": null, 108 | "metadata": { 109 | "collapsed": false 110 | }, 111 | "outputs": [], 112 | "source": [ 113 | "X_train_scaled.shape" 114 | ] 115 | }, 116 | { 117 | "cell_type": "markdown", 118 | "metadata": {}, 119 | "source": [ 120 | "The transformed version of the data has the mean removed:" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": null, 126 | "metadata": { 127 | "collapsed": false 128 | }, 129 | "outputs": [], 130 | "source": [ 131 | "X_train_scaled.mean(axis=0)" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": null, 137 | "metadata": { 138 | "collapsed": false 139 | }, 140 | "outputs": [], 141 | "source": [ 142 | "X_train_scaled.std(axis=0)" 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": null, 148 | "metadata": { 149 | "collapsed": false 150 | }, 151 | "outputs": [], 152 | "source": [ 153 | "X_test_transformed = scaler.transform(X_test)" 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": null, 159 | "metadata": { 160 | "collapsed": false 161 | }, 162 | "outputs": [], 163 | "source": [ 164 | "X_test_transformed.mean(axis=0)" 165 | ] 166 | }, 167 | { 168 | "cell_type": "markdown", 169 | "metadata": {}, 170 | "source": [ 171 | "Principal Component Analysis\n", 172 | "=============================" 173 | ] 174 | }, 175 | { 176 | "cell_type": "markdown", 177 | "metadata": {}, 178 | "source": [ 179 | "0) Import the model" 180 | ] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "execution_count": null, 185 | "metadata": { 186 | "collapsed": false 187 | }, 188 | "outputs": [], 189 | "source": [ 190 | "from sklearn.decomposition import PCA" 191 | ] 192 | }, 193 | { 194 | "cell_type": "markdown", 195 | "metadata": {}, 196 | "source": [ 197 | "1) Instantiate the model" 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": null, 203 | "metadata": { 204 | "collapsed": false 205 | }, 206 | "outputs": [], 207 | "source": [ 208 | "pca = PCA(n_components=2)" 209 | ] 210 | }, 211 | { 212 | "cell_type": "markdown", 213 | "metadata": {}, 214 | "source": [ 215 | "2) Fit to training data" 216 | ] 217 | }, 218 | { 219 | "cell_type": "code", 220 | "execution_count": null, 221 | "metadata": { 222 | "collapsed": false 223 | }, 224 | "outputs": [], 225 | "source": [ 226 | "pca.fit(X)" 227 | ] 228 | }, 229 | { 230 | "cell_type": "markdown", 231 | "metadata": {}, 232 | "source": [ 233 | "3) Transform to lower-dimensional representation" 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": null, 239 | "metadata": { 240 | "collapsed": false 241 | }, 242 | "outputs": [], 243 | "source": [ 244 | "X_pca = pca.transform(X)\n", 245 | "print(X.shape)\n", 246 | "X_pca.shape" 247 | ] 248 | }, 249 | { 250 | "cell_type": "markdown", 251 | "metadata": {}, 252 | "source": [ 253 | "Visualize\n", 254 | "----------" 255 | ] 256 | }, 257 | { 258 | "cell_type": "code", 259 | "execution_count": null, 260 | "metadata": { 261 | "collapsed": false 262 | }, 263 | "outputs": [], 264 | "source": [ 265 | "import matplotlib.pyplot as plt\n", 266 | "%matplotlib notebook\n", 267 | "plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y)" 268 | ] 269 | }, 270 | { 271 | "cell_type": "code", 272 | "execution_count": null, 273 | "metadata": { 274 | "collapsed": false 275 | }, 276 | "outputs": [], 277 | "source": [] 278 | } 279 | ], 280 | "metadata": { 281 | "kernelspec": { 282 | "display_name": "Python 3", 283 | "language": "python", 284 | "name": "python3" 285 | }, 286 | "language_info": { 287 | "codemirror_mode": { 288 | "name": "ipython", 289 | "version": 3 290 | }, 291 | "file_extension": ".py", 292 | "mimetype": "text/x-python", 293 | "name": "python", 294 | "nbconvert_exporter": "python", 295 | "pygments_lexer": "ipython3", 296 | "version": "3.4.4" 297 | } 298 | }, 299 | "nbformat": 4, 300 | "nbformat_minor": 0 301 | } 302 | -------------------------------------------------------------------------------- /03 - Cross-validation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "Cross-Validation\n", 8 | "----------------------------------------" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": null, 14 | "metadata": { 15 | "collapsed": false 16 | }, 17 | "outputs": [], 18 | "source": [ 19 | "from sklearn.datasets import load_digits" 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": null, 25 | "metadata": { 26 | "collapsed": false 27 | }, 28 | "outputs": [], 29 | "source": [ 30 | "digits = load_digits()\n", 31 | "X = digits.data\n", 32 | "y = digits.target" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": null, 38 | "metadata": { 39 | "collapsed": false 40 | }, 41 | "outputs": [], 42 | "source": [ 43 | "from sklearn.cross_validation import cross_val_score\n", 44 | "from sklearn.svm import LinearSVC" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": null, 50 | "metadata": { 51 | "collapsed": false 52 | }, 53 | "outputs": [], 54 | "source": [ 55 | "cross_val_score(LinearSVC(), X, y, cv=5)" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": null, 61 | "metadata": { 62 | "collapsed": false 63 | }, 64 | "outputs": [], 65 | "source": [ 66 | "cross_val_score(LinearSVC(), X, y, cv=5, scoring=\"f1_macro\")" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": null, 72 | "metadata": { 73 | "collapsed": false 74 | }, 75 | "outputs": [], 76 | "source": [ 77 | "from sklearn.metrics.scorer import SCORERS\n", 78 | "print(SCORERS.keys())" 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": {}, 84 | "source": [ 85 | "There are other ways to do cross-valiation" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": null, 91 | "metadata": { 92 | "collapsed": false 93 | }, 94 | "outputs": [], 95 | "source": [ 96 | "from sklearn.cross_validation import ShuffleSplit\n", 97 | "\n", 98 | "shuffle_split = ShuffleSplit(len(X), 10, test_size=.4)\n", 99 | "cross_val_score(LinearSVC(), X, y, cv=shuffle_split)" 100 | ] 101 | } 102 | ], 103 | "metadata": { 104 | "kernelspec": { 105 | "display_name": "Python 3", 106 | "language": "python", 107 | "name": "python3" 108 | }, 109 | "language_info": { 110 | "codemirror_mode": { 111 | "name": "ipython", 112 | "version": 3 113 | }, 114 | "file_extension": ".py", 115 | "mimetype": "text/x-python", 116 | "name": "python", 117 | "nbconvert_exporter": "python", 118 | "pygments_lexer": "ipython3", 119 | "version": "3.4.4" 120 | } 121 | }, 122 | "nbformat": 4, 123 | "nbformat_minor": 0 124 | } 125 | -------------------------------------------------------------------------------- /04 - Grid Searches for Hyper Parameters.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "Grid Searches\n", 8 | "=================" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "Grid-Search with build-in cross validation" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": null, 21 | "metadata": { 22 | "collapsed": false 23 | }, 24 | "outputs": [], 25 | "source": [ 26 | "from sklearn.grid_search import GridSearchCV\n", 27 | "from sklearn.svm import SVC" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": null, 33 | "metadata": { 34 | "collapsed": false 35 | }, 36 | "outputs": [], 37 | "source": [ 38 | "from sklearn.datasets import load_digits\n", 39 | "from sklearn.cross_validation import train_test_split\n", 40 | "digits = load_digits()\n", 41 | "X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target)" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": {}, 47 | "source": [ 48 | "Define parameter grid:" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": null, 54 | "metadata": { 55 | "collapsed": false 56 | }, 57 | "outputs": [], 58 | "source": [ 59 | "import numpy as np\n", 60 | "\n", 61 | "param_grid = {'C': 10. ** np.arange(-3, 3),\n", 62 | " 'gamma' : 10. ** np.arange(-5, 0)}\n", 63 | " \n", 64 | "\n", 65 | "np.set_printoptions(suppress=True)\n", 66 | "print(param_grid)" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": null, 72 | "metadata": { 73 | "collapsed": false 74 | }, 75 | "outputs": [], 76 | "source": [ 77 | "grid_search = GridSearchCV(SVC(), param_grid, verbose=3, cv=5)" 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | "A GridSearchCV object behaves just like a normal classifier." 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": null, 90 | "metadata": { 91 | "collapsed": false, 92 | "scrolled": true 93 | }, 94 | "outputs": [], 95 | "source": [ 96 | "grid_search.fit(X_train, y_train)" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": null, 102 | "metadata": { 103 | "collapsed": false, 104 | "scrolled": true 105 | }, 106 | "outputs": [], 107 | "source": [ 108 | "grid_search.predict(X_test)" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": null, 114 | "metadata": { 115 | "collapsed": false 116 | }, 117 | "outputs": [], 118 | "source": [ 119 | "grid_search.score(X_test, y_test)" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": null, 125 | "metadata": { 126 | "collapsed": false 127 | }, 128 | "outputs": [], 129 | "source": [ 130 | "grid_search.best_params_" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": null, 136 | "metadata": { 137 | "collapsed": false 138 | }, 139 | "outputs": [], 140 | "source": [ 141 | "# We extract just the scores\n", 142 | "%matplotlib notebook\n", 143 | "import matplotlib.pyplot as plt\n", 144 | "\n", 145 | "scores = [x[1] for x in grid_search.grid_scores_]\n", 146 | "scores = np.array(scores).reshape(6, 5)\n", 147 | "\n", 148 | "plt.matshow(scores, cmap='viridis')\n", 149 | "plt.xlabel('gamma')\n", 150 | "plt.ylabel('C')\n", 151 | "plt.colorbar()\n", 152 | "plt.xticks(np.arange(5), param_grid['gamma'])\n", 153 | "plt.yticks(np.arange(6), param_grid['C']);" 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": null, 159 | "metadata": { 160 | "collapsed": true 161 | }, 162 | "outputs": [], 163 | "source": [] 164 | } 165 | ], 166 | "metadata": { 167 | "kernelspec": { 168 | "display_name": "Python 3", 169 | "language": "python", 170 | "name": "python3" 171 | }, 172 | "language_info": { 173 | "codemirror_mode": { 174 | "name": "ipython", 175 | "version": 3 176 | }, 177 | "file_extension": ".py", 178 | "mimetype": "text/x-python", 179 | "name": "python", 180 | "nbconvert_exporter": "python", 181 | "pygments_lexer": "ipython3", 182 | "version": "3.4.4" 183 | } 184 | }, 185 | "nbformat": 4, 186 | "nbformat_minor": 0 187 | } 188 | -------------------------------------------------------------------------------- /05 - Preprocessing and Pipelines.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "Preprocessing and Pipelines\n", 8 | "=============================" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": null, 14 | "metadata": { 15 | "collapsed": false 16 | }, 17 | "outputs": [], 18 | "source": [ 19 | "from sklearn.datasets import load_digits\n", 20 | "from sklearn.cross_validation import train_test_split\n", 21 | "digits = load_digits()\n", 22 | "X_train, X_test, y_train, y_test = train_test_split(digits.data,\n", 23 | " digits.target)" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "Cross-validated pipelines including scaling, we need to estimate mean and standard deviation separately for each fold.\n", 31 | "To do that, we build a pipeline." 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": null, 37 | "metadata": { 38 | "collapsed": false 39 | }, 40 | "outputs": [], 41 | "source": [ 42 | "from sklearn.pipeline import Pipeline, make_pipeline\n", 43 | "from sklearn.svm import SVC\n", 44 | "from sklearn.preprocessing import StandardScaler" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": null, 50 | "metadata": { 51 | "collapsed": false 52 | }, 53 | "outputs": [], 54 | "source": [ 55 | "standard_scaler = StandardScaler()\n", 56 | "standard_scaler.fit(X_train)\n", 57 | "X_train_scaled = standard_scaler.transform(X_train)\n", 58 | "svm = SVC()\n", 59 | "svm.fit(X_train_scaled, y_train)" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": null, 65 | "metadata": { 66 | "collapsed": false 67 | }, 68 | "outputs": [], 69 | "source": [ 70 | "X_test_scaled = standard_scaler.transform(X_test)\n", 71 | "svm.score(X_test_scaled, y_test)" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": null, 77 | "metadata": { 78 | "collapsed": false 79 | }, 80 | "outputs": [], 81 | "source": [ 82 | "pipeline = make_pipeline(StandardScaler(), SVC())" 83 | ] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "execution_count": null, 88 | "metadata": { 89 | "collapsed": false 90 | }, 91 | "outputs": [], 92 | "source": [ 93 | "pipeline.fit(X_train, y_train)" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": null, 99 | "metadata": { 100 | "collapsed": false 101 | }, 102 | "outputs": [], 103 | "source": [ 104 | "pipeline.score(X_test, y_test)" 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": null, 110 | "metadata": { 111 | "collapsed": false, 112 | "scrolled": true 113 | }, 114 | "outputs": [], 115 | "source": [ 116 | "pipeline.predict(X_test)" 117 | ] 118 | }, 119 | { 120 | "cell_type": "markdown", 121 | "metadata": {}, 122 | "source": [ 123 | "Cross-validation with a pipeline\n", 124 | "---------------------------------" 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": null, 130 | "metadata": { 131 | "collapsed": false 132 | }, 133 | "outputs": [], 134 | "source": [ 135 | "from sklearn.cross_validation import cross_val_score\n", 136 | "cross_val_score(pipeline, X_train, y_train)" 137 | ] 138 | }, 139 | { 140 | "cell_type": "markdown", 141 | "metadata": {}, 142 | "source": [ 143 | "Grid Search with a pipeline\n", 144 | "===========================" 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": null, 150 | "metadata": { 151 | "collapsed": false 152 | }, 153 | "outputs": [], 154 | "source": [ 155 | "import numpy as np\n", 156 | "from sklearn.grid_search import GridSearchCV\n", 157 | "\n", 158 | "param_grid = {'svc__C': 10. ** np.arange(-3, 3),\n", 159 | " 'svc__gamma' : 10. ** np.arange(-3, 3)\n", 160 | " }\n", 161 | "\n", 162 | "grid_pipeline = GridSearchCV(pipeline, param_grid=param_grid) " 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": null, 168 | "metadata": { 169 | "collapsed": false 170 | }, 171 | "outputs": [], 172 | "source": [ 173 | "grid_pipeline.fit(X_train, y_train)" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": null, 179 | "metadata": { 180 | "collapsed": false 181 | }, 182 | "outputs": [], 183 | "source": [ 184 | "grid_pipeline.score(X_test, y_test)" 185 | ] 186 | } 187 | ], 188 | "metadata": { 189 | "kernelspec": { 190 | "display_name": "Python 3", 191 | "language": "python", 192 | "name": "python3" 193 | }, 194 | "language_info": { 195 | "codemirror_mode": { 196 | "name": "ipython", 197 | "version": 3 198 | }, 199 | "file_extension": ".py", 200 | "mimetype": "text/x-python", 201 | "name": "python", 202 | "nbconvert_exporter": "python", 203 | "pygments_lexer": "ipython3", 204 | "version": "3.4.4" 205 | } 206 | }, 207 | "nbformat": 4, 208 | "nbformat_minor": 0 209 | } 210 | -------------------------------------------------------------------------------- /06 - Working With Text Data.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "%matplotlib notebook\n", 12 | "import matplotlib.pyplot as plt\n", 13 | "import numpy as np" 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": {}, 19 | "source": [ 20 | "# Text Classification of Movie Reviews" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "Unpack data - this only works on linux and (maybe?) OS X. Unpack using 7zip on Windows." 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": null, 33 | "metadata": { 34 | "collapsed": false 35 | }, 36 | "outputs": [], 37 | "source": [ 38 | "#! tar -xf data/aclImdb.tar.bz2 --directory data" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": null, 44 | "metadata": { 45 | "collapsed": false 46 | }, 47 | "outputs": [], 48 | "source": [ 49 | "from sklearn.datasets import load_files\n", 50 | "\n", 51 | "reviews_train = load_files(\"data/aclImdb/train/\")\n", 52 | "text_train, y_train = reviews_train.data, reviews_train.target" 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": null, 58 | "metadata": { 59 | "collapsed": false 60 | }, 61 | "outputs": [], 62 | "source": [ 63 | "print(\"Number of documents in training data: %d\" % len(text_train))\n", 64 | "print(np.bincount(y_train))" 65 | ] 66 | }, 67 | { 68 | "cell_type": "code", 69 | "execution_count": null, 70 | "metadata": { 71 | "collapsed": false 72 | }, 73 | "outputs": [], 74 | "source": [ 75 | "reviews_test = load_files(\"data/aclImdb/test/\")\n", 76 | "text_test, y_test = reviews_test.data, reviews_test.target\n", 77 | "print(\"Number of documents in test data: %d\" % len(text_test))\n", 78 | "print(np.bincount(y_test))" 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": null, 84 | "metadata": { 85 | "collapsed": false 86 | }, 87 | "outputs": [], 88 | "source": [ 89 | "print(text_train[1])" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": null, 95 | "metadata": { 96 | "collapsed": false 97 | }, 98 | "outputs": [], 99 | "source": [ 100 | "print(y_train[1])" 101 | ] 102 | }, 103 | { 104 | "cell_type": "code", 105 | "execution_count": null, 106 | "metadata": { 107 | "collapsed": false 108 | }, 109 | "outputs": [], 110 | "source": [ 111 | "from sklearn.feature_extraction.text import CountVectorizer\n", 112 | "cv = CountVectorizer()\n", 113 | "cv.fit(text_train)\n", 114 | "\n", 115 | "len(cv.vocabulary_)" 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": null, 121 | "metadata": { 122 | "collapsed": false, 123 | "scrolled": true 124 | }, 125 | "outputs": [], 126 | "source": [ 127 | "print(cv.get_feature_names()[:50])\n", 128 | "print(cv.get_feature_names()[50000:50050])" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": null, 134 | "metadata": { 135 | "collapsed": false 136 | }, 137 | "outputs": [], 138 | "source": [ 139 | "X_train = cv.transform(text_train)\n", 140 | "X_train" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": null, 146 | "metadata": { 147 | "collapsed": false 148 | }, 149 | "outputs": [], 150 | "source": [ 151 | "print(text_train[19726])" 152 | ] 153 | }, 154 | { 155 | "cell_type": "code", 156 | "execution_count": null, 157 | "metadata": { 158 | "collapsed": false 159 | }, 160 | "outputs": [], 161 | "source": [ 162 | "X_train[19726].nonzero()[1]" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": null, 168 | "metadata": { 169 | "collapsed": false 170 | }, 171 | "outputs": [], 172 | "source": [ 173 | "X_test = cv.transform(text_test)" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": null, 179 | "metadata": { 180 | "collapsed": false 181 | }, 182 | "outputs": [], 183 | "source": [ 184 | "from sklearn.svm import LinearSVC\n", 185 | "\n", 186 | "svm = LinearSVC()\n", 187 | "svm.fit(X_train, y_train)" 188 | ] 189 | }, 190 | { 191 | "cell_type": "code", 192 | "execution_count": null, 193 | "metadata": { 194 | "collapsed": false 195 | }, 196 | "outputs": [], 197 | "source": [ 198 | "svm.score(X_train, y_train)" 199 | ] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "execution_count": null, 204 | "metadata": { 205 | "collapsed": false 206 | }, 207 | "outputs": [], 208 | "source": [ 209 | "svm.score(X_test, y_test)" 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": null, 215 | "metadata": { 216 | "collapsed": false 217 | }, 218 | "outputs": [], 219 | "source": [ 220 | "def visualize_coefficients(classifier, feature_names, n_top_features=25):\n", 221 | " # get coefficients with large absolute values \n", 222 | " coef = classifier.coef_.ravel()\n", 223 | " positive_coefficients = np.argsort(coef)[-n_top_features:]\n", 224 | " negative_coefficients = np.argsort(coef)[:n_top_features]\n", 225 | " interesting_coefficients = np.hstack([negative_coefficients, positive_coefficients])\n", 226 | " # plot them\n", 227 | " plt.figure(figsize=(15, 5))\n", 228 | " colors = [\"red\" if c < 0 else \"blue\" for c in coef[interesting_coefficients]]\n", 229 | " plt.bar(np.arange(2 * n_top_features), coef[interesting_coefficients], color=colors)\n", 230 | " feature_names = np.array(feature_names)\n", 231 | " plt.subplots_adjust(bottom=0.3)\n", 232 | " plt.xticks(np.arange(1, 1 + 2 * n_top_features), feature_names[interesting_coefficients], rotation=60, ha=\"right\");\n" 233 | ] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "execution_count": null, 238 | "metadata": { 239 | "collapsed": false 240 | }, 241 | "outputs": [], 242 | "source": [ 243 | "visualize_coefficients(svm, cv.get_feature_names())" 244 | ] 245 | }, 246 | { 247 | "cell_type": "code", 248 | "execution_count": null, 249 | "metadata": { 250 | "collapsed": false 251 | }, 252 | "outputs": [], 253 | "source": [ 254 | "svm = LinearSVC(C=0.001)\n", 255 | "svm.fit(X_train, y_train)\n", 256 | "svm.score(X_test, y_test)" 257 | ] 258 | }, 259 | { 260 | "cell_type": "code", 261 | "execution_count": null, 262 | "metadata": { 263 | "collapsed": false 264 | }, 265 | "outputs": [], 266 | "source": [ 267 | "visualize_coefficients(svm, cv.get_feature_names())" 268 | ] 269 | }, 270 | { 271 | "cell_type": "code", 272 | "execution_count": null, 273 | "metadata": { 274 | "collapsed": false 275 | }, 276 | "outputs": [], 277 | "source": [ 278 | "from sklearn.pipeline import make_pipeline\n", 279 | "text_pipe = make_pipeline(CountVectorizer(), LinearSVC())\n", 280 | "text_pipe.fit(text_train, y_train)\n", 281 | "text_pipe.score(text_test, y_test)" 282 | ] 283 | }, 284 | { 285 | "cell_type": "code", 286 | "execution_count": null, 287 | "metadata": { 288 | "collapsed": false, 289 | "scrolled": true 290 | }, 291 | "outputs": [], 292 | "source": [ 293 | "from sklearn.grid_search import GridSearchCV\n", 294 | "\n", 295 | "param_grid = {'linearsvc__C': np.logspace(-5, 0, 6)}\n", 296 | "grid = GridSearchCV(text_pipe, param_grid, cv=5)\n", 297 | "grid.fit(text_train, y_train)" 298 | ] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "execution_count": null, 303 | "metadata": { 304 | "collapsed": false 305 | }, 306 | "outputs": [], 307 | "source": [ 308 | "grid.best_params_" 309 | ] 310 | }, 311 | { 312 | "cell_type": "code", 313 | "execution_count": null, 314 | "metadata": { 315 | "collapsed": false 316 | }, 317 | "outputs": [], 318 | "source": [ 319 | "visualize_coefficients(grid.best_estimator_.named_steps['linearsvc'],\n", 320 | " grid.best_estimator_.named_steps['countvectorizer'].get_feature_names())" 321 | ] 322 | }, 323 | { 324 | "cell_type": "code", 325 | "execution_count": null, 326 | "metadata": { 327 | "collapsed": false 328 | }, 329 | "outputs": [], 330 | "source": [ 331 | "grid.score(text_test, y_test)" 332 | ] 333 | }, 334 | { 335 | "cell_type": "markdown", 336 | "metadata": {}, 337 | "source": [ 338 | "# N-Grams" 339 | ] 340 | }, 341 | { 342 | "cell_type": "code", 343 | "execution_count": null, 344 | "metadata": { 345 | "collapsed": false, 346 | "scrolled": true 347 | }, 348 | "outputs": [], 349 | "source": [ 350 | "text_pipe = make_pipeline(CountVectorizer(), LinearSVC())\n", 351 | "\n", 352 | "param_grid = {'linearsvc__C': np.logspace(-3, 2, 6),\n", 353 | " \"countvectorizer__ngram_range\": [(1, 1), (1, 2)]}\n", 354 | "\n", 355 | "grid = GridSearchCV(text_pipe, param_grid, cv=5)\n", 356 | "\n", 357 | "grid.fit(text_train, y_train)" 358 | ] 359 | }, 360 | { 361 | "cell_type": "code", 362 | "execution_count": null, 363 | "metadata": { 364 | "collapsed": false 365 | }, 366 | "outputs": [], 367 | "source": [ 368 | "scores = np.array([score.mean_validation_score for score in grid.grid_scores_]).reshape(3, -1)\n", 369 | "plt.matshow(scores)\n", 370 | "plt.ylabel(\"n-gram range\")\n", 371 | "plt.yticks(range(3), param_grid[\"countvectorizer__ngram_range\"])\n", 372 | "plt.xlabel(\"C\")\n", 373 | "plt.xticks(range(6), param_grid[\"linearsvc__C\"]);\n", 374 | "plt.colorbar()" 375 | ] 376 | }, 377 | { 378 | "cell_type": "code", 379 | "execution_count": null, 380 | "metadata": { 381 | "collapsed": false 382 | }, 383 | "outputs": [], 384 | "source": [ 385 | "grid.best_params_" 386 | ] 387 | }, 388 | { 389 | "cell_type": "code", 390 | "execution_count": null, 391 | "metadata": { 392 | "collapsed": false 393 | }, 394 | "outputs": [], 395 | "source": [ 396 | "visualize_coefficients(grid.best_estimator_.named_steps['linearsvc'],\n", 397 | " grid.best_estimator_.named_steps['countvectorizer'].get_feature_names())" 398 | ] 399 | }, 400 | { 401 | "cell_type": "code", 402 | "execution_count": null, 403 | "metadata": { 404 | "collapsed": false 405 | }, 406 | "outputs": [], 407 | "source": [ 408 | "grid.score(text_test, y_test)" 409 | ] 410 | }, 411 | { 412 | "cell_type": "markdown", 413 | "metadata": {}, 414 | "source": [ 415 | "## Look at SpaCy and NLTK" 416 | ] 417 | } 418 | ], 419 | "metadata": { 420 | "kernelspec": { 421 | "display_name": "Python 3", 422 | "language": "python", 423 | "name": "python3" 424 | }, 425 | "language_info": { 426 | "codemirror_mode": { 427 | "name": "ipython", 428 | "version": 3 429 | }, 430 | "file_extension": ".py", 431 | "mimetype": "text/x-python", 432 | "name": "python", 433 | "nbconvert_exporter": "python", 434 | "pygments_lexer": "ipython3", 435 | "version": "3.4.4" 436 | } 437 | }, 438 | "nbformat": 4, 439 | "nbformat_minor": 0 440 | } 441 | -------------------------------------------------------------------------------- /07 - Out Of Core Learning.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": { 7 | "collapsed": false 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "# write out some toy data\n", 12 | "from sklearn.datasets import load_digits\n", 13 | "import pickle\n", 14 | "\n", 15 | "digits = load_digits()\n", 16 | "\n", 17 | "X, y = digits.data, digits.target\n", 18 | "\n", 19 | "for i in range(10):\n", 20 | " pickle.dump((X[i::10] / 16., y[i::10]),\n", 21 | " open(\"data/batch_%02d.pickle\" % i, \"wb\"), -1)" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": null, 27 | "metadata": { 28 | "collapsed": false 29 | }, 30 | "outputs": [], 31 | "source": [ 32 | "from sklearn.linear_model import SGDClassifier" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": null, 38 | "metadata": { 39 | "collapsed": false 40 | }, 41 | "outputs": [], 42 | "source": [ 43 | "sgd = SGDClassifier(random_state=0)\n", 44 | "for i in range(9):\n", 45 | " X_batch, y_batch = pickle.load(open(\"data/batch_%02d.pickle\" % i, \"rb\"))\n", 46 | " sgd.partial_fit(X_batch, y_batch, classes=range(10))" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": null, 52 | "metadata": { 53 | "collapsed": false 54 | }, 55 | "outputs": [], 56 | "source": [ 57 | "X_test, y_test = pickle.load(open(\"data/batch_09.pickle\", \"rb\"))\n", 58 | "\n", 59 | "sgd.score(X_test, y_test)" 60 | ] 61 | } 62 | ], 63 | "metadata": { 64 | "kernelspec": { 65 | "display_name": "Python 3", 66 | "language": "python", 67 | "name": "python3" 68 | }, 69 | "language_info": { 70 | "codemirror_mode": { 71 | "name": "ipython", 72 | "version": 3 73 | }, 74 | "file_extension": ".py", 75 | "mimetype": "text/x-python", 76 | "name": "python", 77 | "nbconvert_exporter": "python", 78 | "pygments_lexer": "ipython3", 79 | "version": "3.4.4" 80 | } 81 | }, 82 | "nbformat": 4, 83 | "nbformat_minor": 0 84 | } 85 | -------------------------------------------------------------------------------- /08 - Out Of Core Learning for Text.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": { 7 | "collapsed": false 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "import matplotlib.pyplot as plt\n", 12 | "import numpy as np\n", 13 | "%matplotlib notebook" 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": {}, 19 | "source": [ 20 | "# Out of core text classification with the Hashing Vectorizer" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "Using the Amazon movie reviews collected by J. McAuley and J. Leskovec\n", 28 | "\n", 29 | "https://snap.stanford.edu/data/web-Movies.html" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": null, 35 | "metadata": { 36 | "collapsed": false 37 | }, 38 | "outputs": [], 39 | "source": [ 40 | "import os\n", 41 | "print(\"file size: %d GB\" % (os.path.getsize(\"data/movies.txt\") / 1024 ** 3))" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": null, 47 | "metadata": { 48 | "collapsed": false 49 | }, 50 | "outputs": [], 51 | "source": [ 52 | "with open(\"data/movies.txt\") as f:\n", 53 | " print(f.read(4000))" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": null, 59 | "metadata": { 60 | "collapsed": false 61 | }, 62 | "outputs": [], 63 | "source": [ 64 | "def review_iter(f):\n", 65 | " current_post = []\n", 66 | " for line in f:\n", 67 | " if line.startswith(\"product/productId\"):\n", 68 | " if len(current_post):\n", 69 | " score = current_post[3].strip(\"review/score: \").strip()\n", 70 | " review = \"\".join(current_post[6:]).strip(\"review/text: \").strip()\n", 71 | " # there are about 20 posts with linebreaks in them.\n", 72 | " # we just ignore those for simplicity\n", 73 | " try:\n", 74 | " yield int(float(score)), review\n", 75 | " except:\n", 76 | " current_post = []\n", 77 | " continue\n", 78 | " current_post = []\n", 79 | " else:\n", 80 | " current_post.append(line)" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": null, 86 | "metadata": { 87 | "collapsed": false, 88 | "scrolled": false 89 | }, 90 | "outputs": [], 91 | "source": [ 92 | "n_reviews = 0\n", 93 | "with open(\"data/movies.txt\", 'r', errors='ignore') as f:\n", 94 | " for r in review_iter(f):\n", 95 | " n_reviews += 1\n", 96 | "\n", 97 | "print(\"Number of reviews: %d\" % n_reviews)" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": null, 103 | "metadata": { 104 | "collapsed": false 105 | }, 106 | "outputs": [], 107 | "source": [ 108 | "from itertools import islice\n", 109 | "\n", 110 | "with open(\"data/movies.txt\", 'rb') as f:\n", 111 | " reviews = islice(review_iter(f), 10000)\n", 112 | " scores, texts = zip(*reviews)\n", 113 | "print(np.bincount(scores))" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": null, 119 | "metadata": { 120 | "collapsed": false 121 | }, 122 | "outputs": [], 123 | "source": [] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": null, 128 | "metadata": { 129 | "collapsed": false 130 | }, 131 | "outputs": [], 132 | "source": [ 133 | "from itertools import zip_longest # use izip_longest on Python3\n", 134 | "# from the itertools recipes\n", 135 | "def grouper(iterable, n, fillvalue=None):\n", 136 | " \"Collect data into fixed-length chunks or blocks\"\n", 137 | " # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx\n", 138 | " args = [iter(iterable)] * n\n", 139 | " return zip_longest(fillvalue=fillvalue, *args)" 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": null, 145 | "metadata": { 146 | "collapsed": true 147 | }, 148 | "outputs": [], 149 | "source": [ 150 | "def preprocess_batch(reviews):\n", 151 | " # score == 3 is \"neutral\", we only want \"positive\" or \"negative\"\n", 152 | " reviews_filtered = [r for r in reviews if r is not None and r[0] != 3]\n", 153 | " scores, texts = zip(*reviews_filtered)\n", 154 | " polarity = np.array(scores) > 3\n", 155 | " return polarity, texts" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": null, 161 | "metadata": { 162 | "collapsed": false, 163 | "scrolled": true 164 | }, 165 | "outputs": [], 166 | "source": [ 167 | "from sklearn.feature_extraction.text import HashingVectorizer\n", 168 | "\n", 169 | "vectorizer = HashingVectorizer(decode_error=\"ignore\")\n", 170 | "\n", 171 | "with open(\"data/movies.txt\") as f:\n", 172 | " reviews = islice(review_iter(f), 10000)\n", 173 | " polarity_test, texts_test = preprocess_batch(reviews)\n", 174 | " X_test = vectorizer.transform(texts_test)" 175 | ] 176 | }, 177 | { 178 | "cell_type": "code", 179 | "execution_count": null, 180 | "metadata": { 181 | "collapsed": false 182 | }, 183 | "outputs": [], 184 | "source": [ 185 | "from sklearn.linear_model import SGDClassifier\n", 186 | "\n", 187 | "sgd = SGDClassifier(random_state=0)\n", 188 | "\n", 189 | "accuracies = []\n", 190 | "with open(\"data/movies.txt\") as f:\n", 191 | " training_set = islice(review_iter(f), 10000, None)\n", 192 | " batch_iter = grouper(training_set, 10000)\n", 193 | " for batch in batch_iter:\n", 194 | " polarity, texts = preprocess_batch(batch)\n", 195 | " X = vectorizer.transform(texts)\n", 196 | " sgd.partial_fit(X, polarity, classes=[0, 1])\n", 197 | " accuracies.append(sgd.score(X_test, polarity_test))" 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": null, 203 | "metadata": { 204 | "collapsed": false 205 | }, 206 | "outputs": [], 207 | "source": [ 208 | "plt.plot(accuracies)" 209 | ] 210 | } 211 | ], 212 | "metadata": { 213 | "kernelspec": { 214 | "display_name": "Python 3", 215 | "language": "python", 216 | "name": "python3" 217 | }, 218 | "language_info": { 219 | "codemirror_mode": { 220 | "name": "ipython", 221 | "version": 3 222 | }, 223 | "file_extension": ".py", 224 | "mimetype": "text/x-python", 225 | "name": "python", 226 | "nbconvert_exporter": "python", 227 | "pygments_lexer": "ipython3", 228 | "version": "3.4.4" 229 | } 230 | }, 231 | "nbformat": 4, 232 | "nbformat_minor": 0 233 | } 234 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | CC0 1.0 Universal 2 | 3 | Statement of Purpose 4 | 5 | The laws of most jurisdictions throughout the world automatically confer 6 | exclusive Copyright and Related Rights (defined below) upon the creator and 7 | subsequent owner(s) (each and all, an "owner") of an original work of 8 | authorship and/or a database (each, a "Work"). 9 | 10 | Certain owners wish to permanently relinquish those rights to a Work for the 11 | purpose of contributing to a commons of creative, cultural and scientific 12 | works ("Commons") that the public can reliably and without fear of later 13 | claims of infringement build upon, modify, incorporate in other works, reuse 14 | and redistribute as freely as possible in any form whatsoever and for any 15 | purposes, including without limitation commercial purposes. These owners may 16 | contribute to the Commons to promote the ideal of a free culture and the 17 | further production of creative, cultural and scientific works, or to gain 18 | reputation or greater distribution for their Work in part through the use and 19 | efforts of others. 20 | 21 | For these and/or other purposes and motivations, and without any expectation 22 | of additional consideration or compensation, the person associating CC0 with a 23 | Work (the "Affirmer"), to the extent that he or she is an owner of Copyright 24 | and Related Rights in the Work, voluntarily elects to apply CC0 to the Work 25 | and publicly distribute the Work under its terms, with knowledge of his or her 26 | Copyright and Related Rights in the Work and the meaning and intended legal 27 | effect of CC0 on those rights. 28 | 29 | 1. Copyright and Related Rights. A Work made available under CC0 may be 30 | protected by copyright and related or neighboring rights ("Copyright and 31 | Related Rights"). Copyright and Related Rights include, but are not limited 32 | to, the following: 33 | 34 | i. the right to reproduce, adapt, distribute, perform, display, communicate, 35 | and translate a Work; 36 | 37 | ii. moral rights retained by the original author(s) and/or performer(s); 38 | 39 | iii. publicity and privacy rights pertaining to a person's image or likeness 40 | depicted in a Work; 41 | 42 | iv. rights protecting against unfair competition in regards to a Work, 43 | subject to the limitations in paragraph 4(a), below; 44 | 45 | v. rights protecting the extraction, dissemination, use and reuse of data in 46 | a Work; 47 | 48 | vi. database rights (such as those arising under Directive 96/9/EC of the 49 | European Parliament and of the Council of 11 March 1996 on the legal 50 | protection of databases, and under any national implementation thereof, 51 | including any amended or successor version of such directive); and 52 | 53 | vii. other similar, equivalent or corresponding rights throughout the world 54 | based on applicable law or treaty, and any national implementations thereof. 55 | 56 | 2. Waiver. To the greatest extent permitted by, but not in contravention of, 57 | applicable law, Affirmer hereby overtly, fully, permanently, irrevocably and 58 | unconditionally waives, abandons, and surrenders all of Affirmer's Copyright 59 | and Related Rights and associated claims and causes of action, whether now 60 | known or unknown (including existing as well as future claims and causes of 61 | action), in the Work (i) in all territories worldwide, (ii) for the maximum 62 | duration provided by applicable law or treaty (including future time 63 | extensions), (iii) in any current or future medium and for any number of 64 | copies, and (iv) for any purpose whatsoever, including without limitation 65 | commercial, advertising or promotional purposes (the "Waiver"). Affirmer makes 66 | the Waiver for the benefit of each member of the public at large and to the 67 | detriment of Affirmer's heirs and successors, fully intending that such Waiver 68 | shall not be subject to revocation, rescission, cancellation, termination, or 69 | any other legal or equitable action to disrupt the quiet enjoyment of the Work 70 | by the public as contemplated by Affirmer's express Statement of Purpose. 71 | 72 | 3. Public License Fallback. Should any part of the Waiver for any reason be 73 | judged legally invalid or ineffective under applicable law, then the Waiver 74 | shall be preserved to the maximum extent permitted taking into account 75 | Affirmer's express Statement of Purpose. In addition, to the extent the Waiver 76 | is so judged Affirmer hereby grants to each affected person a royalty-free, 77 | non transferable, non sublicensable, non exclusive, irrevocable and 78 | unconditional license to exercise Affirmer's Copyright and Related Rights in 79 | the Work (i) in all territories worldwide, (ii) for the maximum duration 80 | provided by applicable law or treaty (including future time extensions), (iii) 81 | in any current or future medium and for any number of copies, and (iv) for any 82 | purpose whatsoever, including without limitation commercial, advertising or 83 | promotional purposes (the "License"). The License shall be deemed effective as 84 | of the date CC0 was applied by Affirmer to the Work. Should any part of the 85 | License for any reason be judged legally invalid or ineffective under 86 | applicable law, such partial invalidity or ineffectiveness shall not 87 | invalidate the remainder of the License, and in such case Affirmer hereby 88 | affirms that he or she will not (i) exercise any of his or her remaining 89 | Copyright and Related Rights in the Work or (ii) assert any associated claims 90 | and causes of action with respect to the Work, in either case contrary to 91 | Affirmer's express Statement of Purpose. 92 | 93 | 4. Limitations and Disclaimers. 94 | 95 | a. No trademark or patent rights held by Affirmer are waived, abandoned, 96 | surrendered, licensed or otherwise affected by this document. 97 | 98 | b. Affirmer offers the Work as-is and makes no representations or warranties 99 | of any kind concerning the Work, express, implied, statutory or otherwise, 100 | including without limitation warranties of title, merchantability, fitness 101 | for a particular purpose, non infringement, or the absence of latent or 102 | other defects, accuracy, or the present or absence of errors, whether or not 103 | discoverable, all to the greatest extent permissible under applicable law. 104 | 105 | c. Affirmer disclaims responsibility for clearing rights of other persons 106 | that may apply to the Work or any use thereof, including without limitation 107 | any person's Copyright and Related Rights in the Work. Further, Affirmer 108 | disclaims responsibility for obtaining any necessary consents, permissions 109 | or other rights required for any use of the Work. 110 | 111 | d. Affirmer understands and acknowledges that Creative Commons is not a 112 | party to this document and has no duty or obligation with respect to this 113 | CC0 or use of the Work. 114 | 115 | For more information, please see 116 | 117 | -------------------------------------------------------------------------------- /Readme.md: -------------------------------------------------------------------------------- 1 | Slides and Notebooks for New York Machine Learning Meetup 2 | ========================================================= 3 | Materials for the Scikit-learn talk on Jan 21 2016. 4 | 5 | Please download the materials and install scikit-learn and the jupyter notebook to follow along. 6 | 7 | Please use Jupyter / IPython in Version 4.0 or higher. 8 | 9 | The tutorial requires scikit-learn 0.15 or higher (current is 0.17). 10 | -------------------------------------------------------------------------------- /data/aclImdb.tar.bz2: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/amueller/ml_meetup_nyc_2016/cfbe1a4bf3ddd457add029a5c2a4ca72878e46b0/data/aclImdb.tar.bz2 -------------------------------------------------------------------------------- /machine-learning-with-scikit-learn-nyc-ml-meetup-2016.odp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/amueller/ml_meetup_nyc_2016/cfbe1a4bf3ddd457add029a5c2a4ca72878e46b0/machine-learning-with-scikit-learn-nyc-ml-meetup-2016.odp -------------------------------------------------------------------------------- /machine-learning-with-scikit-learn-nyc-ml-meetup-2016.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/amueller/ml_meetup_nyc_2016/cfbe1a4bf3ddd457add029a5c2a4ca72878e46b0/machine-learning-with-scikit-learn-nyc-ml-meetup-2016.pdf --------------------------------------------------------------------------------