├── .gitignore ├── README.md ├── sms.tsv ├── tutorial.ipynb ├── tutorial.py ├── tutorial_with_output.ipynb └── youtube.jpg /.gitignore: -------------------------------------------------------------------------------- 1 | .ipynb_checkpoints/ 2 | .DS_Store 3 | *.pyc 4 | extras/ 5 | *.tpl 6 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Tutorial: Machine Learning with Text in scikit-learn 2 | 3 | Presented by [Kevin Markham](http://www.dataschool.io/about/) at PyData DC on October 7, 2016. Watch the complete [tutorial video](https://www.youtube.com/watch?v=vTaxdJ6VYWE) on YouTube. 4 | 5 | [![Watch the complete tutorial video on YouTube](youtube.jpg)](https://www.youtube.com/watch?v=vTaxdJ6VYWE "Machine Learning with Text in scikit-learn - PyData DC 2016") 6 | 7 | ### Description 8 | 9 | Although numeric data is easy to work with in Python, most knowledge created by humans is actually raw, unstructured text. By learning how to transform text into data that is usable by machine learning models, you drastically increase the amount of data that your models can learn from. In this tutorial, we'll build and evaluate predictive models from real-world text using scikit-learn. 10 | 11 | ### Objectives 12 | 13 | By the end of this tutorial, participants will be able to confidently build a predictive model from their own text-based data, including feature extraction, model building and model evaluation. 14 | 15 | ### Required Software 16 | 17 | Participants will need to bring a laptop with [scikit-learn](http://scikit-learn.org/stable/install.html) and [pandas](http://pandas.pydata.org/pandas-docs/stable/install.html) (and their dependencies) already installed. Installing the [Anaconda distribution of Python](https://www.continuum.io/downloads) is an easy way to accomplish this. Both Python 2 and 3 are welcome. 18 | 19 | I will be leading the tutorial using the Jupyter notebook, and have added a pre-written notebook to this repository. I have also created a Python script that is identical to the notebook, which you can use in the Python environment of your choice. 20 | 21 | ### Tutorial Files 22 | 23 | * Jupyter notebooks: [tutorial.ipynb](tutorial.ipynb), [tutorial_with_output.ipynb](tutorial_with_output.ipynb) 24 | * Python script: [tutorial.py](tutorial.py) 25 | * Dataset: [sms.tsv](sms.tsv) 26 | 27 | ### Prerequisite Knowledge 28 | 29 | To get the most out of this tutorial, participants should be comfortable working in Python, should understand the basic principles of machine learning, and should have at least basic experience with both pandas and scikit-learn. However, no knowledge of advanced mathematics is required. 30 | 31 | ### Abstract 32 | 33 | It can be difficult to figure out how to work with text in scikit-learn, even if you're already comfortable with the scikit-learn API. Many questions immediately come up: Which vectorizer should I use, and why? What's the difference between a "fit" and a "transform"? What's a document-term matrix, and why is it so sparse? Is it okay for my training data to have more features than observations? What's the appropriate machine learning model to use? And so on... 34 | 35 | In this tutorial, we'll answer all of those questions, and more! We'll start by walking through the vectorization process in order to understand the input and output formats. Then we'll read a simple dataset into pandas, and immediately apply what we've learned about vectorization. We'll move on to the model building process, including a discussion of which model is most appropriate for the task. We'll end by evaluating our model a few different ways. 36 | 37 | ### Detailed Outline 38 | 39 | 1. Model building in scikit-learn (refresher) 40 | 2. Representing text as numerical data 41 | 3. Reading a text-based dataset into pandas 42 | 4. Vectorizing our dataset 43 | 5. Building and evaluating a model 44 | 6. Comparing models 45 | 46 | ### About the Instructor 47 | 48 | Kevin Markham is the founder of [Data School](http://www.dataschool.io/) and the former lead instructor for General Assembly's part-time Data Science course in Washington, DC. He is passionate about teaching data science to people who are new to the field, regardless of their educational and professional backgrounds, and he enjoys teaching both online and in the classroom. Kevin's professional focus is supervised machine learning, which led him to create the popular [scikit-learn video series](https://github.com/justmarkham/scikit-learn-videos) for Kaggle. He has a degree in Computer Engineering from Vanderbilt University. 49 | 50 | * Email: [kevin@dataschool.io](mailto:kevin@dataschool.io) 51 | * Twitter: [@justmarkham](https://twitter.com/justmarkham) 52 | 53 | ### Recommended Resources 54 | 55 | **scikit-learn:** 56 | * For a thorough introduction to machine learning with scikit-learn, I recommend watching my [scikit-learn video series](https://github.com/justmarkham/scikit-learn-videos) (9 videos, 4 hours). 57 | * If you prefer a written resource for learning scikit-learn, you may like the [tutorials](http://scikit-learn.org/stable/tutorial/index.html) from the scikit-learn documentation. 58 | * The scikit-learn user guide includes an excellent section on [text feature extraction](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction) that includes many details not covered in today's tutorial. 59 | * The user guide also describes the [performance trade-offs](http://scikit-learn.org/stable/modules/computational_performance.html#influence-of-the-input-data-representation) involved when choosing between sparse and dense input data representations. 60 | 61 | **pandas:** 62 | * For a thorough introduction to data analysis, manipulation, and visualization with pandas, I recommend watching my [pandas Q&A video series](https://github.com/justmarkham/pandas-videos) (30 videos, 7 hours). 63 | * If you prefer a written resource for learning pandas, here are my [top 8 recommended resources](http://www.dataschool.io/best-python-pandas-resources/). 64 | 65 | **Text classification:** 66 | * Read Paul Graham's classic post, [A Plan for Spam](http://www.paulgraham.com/spam.html), for an overview of a basic text classification system using a Bayesian approach. (He also wrote a [follow-up post](http://www.paulgraham.com/better.html) about how he improved his spam filter.) 67 | * Coursera's Natural Language Processing (NLP) course has [video lectures](https://www.youtube.com/playlist?list=PL6397E4B26D00A269) on text classification, tokenization, Naive Bayes, and many other fundamental NLP topics. (Here are the [slides](http://web.stanford.edu/~jurafsky/NLPCourseraSlides.html) used in all of the videos.) 68 | * [Automatically Categorizing Yelp Businesses](http://engineeringblog.yelp.com/2015/09/automatically-categorizing-yelp-businesses.html) discusses how Yelp uses NLP and scikit-learn to solve the problem of uncategorized businesses. 69 | * [How to Read the Mind of a Supreme Court Justice](http://fivethirtyeight.com/features/how-to-read-the-mind-of-a-supreme-court-justice/) discusses CourtCast, a machine learning model that predicts the outcome of Supreme Court cases using text-based features only. (The CourtCast creator wrote a post explaining [how it works](https://sciencecowboy.wordpress.com/2015/03/05/predicting-the-supreme-court-from-oral-arguments/), and the [Python code](https://github.com/nasrallah/CourtCast) is available on GitHub.) 70 | * [Identifying Humorous Cartoon Captions](http://www.cs.huji.ac.il/~dshahaf/pHumor.pdf) is a readable paper about identifying funny captions submitted to the New Yorker Caption Contest. 71 | * In this [PyData video](https://www.youtube.com/watch?v=y3ZTKFZ-1QQ) (50 minutes), Facebook explains how they use scikit-learn for sentiment classification by training a Naive Bayes model on emoji-labeled data. 72 | 73 | **Naive Bayes and logistic regression:** 74 | * Read this brief Quora post on [airport security](http://www.quora.com/In-laymans-terms-how-does-Naive-Bayes-work/answer/Konstantin-Tt) for an intuitive explanation of how Naive Bayes classification works. 75 | * For a longer introduction to Naive Bayes, read Sebastian Raschka's article on [Naive Bayes and Text Classification](http://sebastianraschka.com/Articles/2014_naive_bayes_1.html). As well, Wikipedia has two excellent articles ([Naive Bayes classifier](http://en.wikipedia.org/wiki/Naive_Bayes_classifier) and [Naive Bayes spam filtering](http://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering)), and Cross Validated has a good [Q&A](http://stats.stackexchange.com/questions/21822/understanding-naive-bayes). 76 | * My [guide to an in-depth understanding of logistic regression](http://www.dataschool.io/guide-to-logistic-regression/) includes a lesson notebook and a curated list of resources for going deeper into this topic. 77 | * [Comparison of Machine Learning Models](https://github.com/justmarkham/DAT8/blob/master/other/model_comparison.md) lists the advantages and disadvantages of Naive Bayes, logistic regression, and other classification and regression models. 78 | -------------------------------------------------------------------------------- /tutorial.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Tutorial: Machine Learning with Text in scikit-learn" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Agenda\n", 15 | "\n", 16 | "1. Model building in scikit-learn (refresher)\n", 17 | "2. Representing text as numerical data\n", 18 | "3. Reading a text-based dataset into pandas\n", 19 | "4. Vectorizing our dataset\n", 20 | "5. Building and evaluating a model\n", 21 | "6. Comparing models" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": null, 27 | "metadata": { 28 | "collapsed": false 29 | }, 30 | "outputs": [], 31 | "source": [ 32 | "# for Python 2: use print only as a function\n", 33 | "from __future__ import print_function" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "## Part 1: Model building in scikit-learn (refresher)" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": null, 46 | "metadata": { 47 | "collapsed": true 48 | }, 49 | "outputs": [], 50 | "source": [ 51 | "# load the iris dataset as an example\n", 52 | "from sklearn.datasets import load_iris\n", 53 | "iris = load_iris()" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": null, 59 | "metadata": { 60 | "collapsed": true 61 | }, 62 | "outputs": [], 63 | "source": [ 64 | "# store the feature matrix (X) and response vector (y)\n", 65 | "X = iris.data\n", 66 | "y = iris.target" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "**\"Features\"** are also known as predictors, inputs, or attributes. The **\"response\"** is also known as the target, label, or output." 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": null, 79 | "metadata": { 80 | "collapsed": false 81 | }, 82 | "outputs": [], 83 | "source": [ 84 | "# check the shapes of X and y\n", 85 | "print(X.shape)\n", 86 | "print(y.shape)" 87 | ] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "metadata": {}, 92 | "source": [ 93 | "**\"Observations\"** are also known as samples, instances, or records." 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": null, 99 | "metadata": { 100 | "collapsed": false 101 | }, 102 | "outputs": [], 103 | "source": [ 104 | "# examine the first 5 rows of the feature matrix (including the feature names)\n", 105 | "import pandas as pd\n", 106 | "pd.DataFrame(X, columns=iris.feature_names).head()" 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": null, 112 | "metadata": { 113 | "collapsed": false 114 | }, 115 | "outputs": [], 116 | "source": [ 117 | "# examine the response vector\n", 118 | "print(y)" 119 | ] 120 | }, 121 | { 122 | "cell_type": "markdown", 123 | "metadata": {}, 124 | "source": [ 125 | "In order to **build a model**, the features must be **numeric**, and every observation must have the **same features in the same order**." 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": null, 131 | "metadata": { 132 | "collapsed": false 133 | }, 134 | "outputs": [], 135 | "source": [ 136 | "# import the class\n", 137 | "from sklearn.neighbors import KNeighborsClassifier\n", 138 | "\n", 139 | "# instantiate the model (with the default parameters)\n", 140 | "knn = KNeighborsClassifier()\n", 141 | "\n", 142 | "# fit the model with data (occurs in-place)\n", 143 | "knn.fit(X, y)" 144 | ] 145 | }, 146 | { 147 | "cell_type": "markdown", 148 | "metadata": {}, 149 | "source": [ 150 | "In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning." 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": null, 156 | "metadata": { 157 | "collapsed": false 158 | }, 159 | "outputs": [], 160 | "source": [ 161 | "# predict the response for a new observation\n", 162 | "knn.predict([[3, 5, 4, 2]])" 163 | ] 164 | }, 165 | { 166 | "cell_type": "markdown", 167 | "metadata": {}, 168 | "source": [ 169 | "## Part 2: Representing text as numerical data" 170 | ] 171 | }, 172 | { 173 | "cell_type": "code", 174 | "execution_count": null, 175 | "metadata": { 176 | "collapsed": true 177 | }, 178 | "outputs": [], 179 | "source": [ 180 | "# example text for model training (SMS messages)\n", 181 | "simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']" 182 | ] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "execution_count": null, 187 | "metadata": { 188 | "collapsed": true 189 | }, 190 | "outputs": [], 191 | "source": [ 192 | "# example response vector\n", 193 | "is_desperate = [0, 0, 1]" 194 | ] 195 | }, 196 | { 197 | "cell_type": "markdown", 198 | "metadata": {}, 199 | "source": [ 200 | "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n", 201 | "\n", 202 | "> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.\n", 203 | "\n", 204 | "We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to \"convert text into a matrix of token counts\":" 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": null, 210 | "metadata": { 211 | "collapsed": true 212 | }, 213 | "outputs": [], 214 | "source": [ 215 | "# import and instantiate CountVectorizer (with the default parameters)\n", 216 | "from sklearn.feature_extraction.text import CountVectorizer\n", 217 | "vect = CountVectorizer()" 218 | ] 219 | }, 220 | { 221 | "cell_type": "code", 222 | "execution_count": null, 223 | "metadata": { 224 | "collapsed": false 225 | }, 226 | "outputs": [], 227 | "source": [ 228 | "# learn the 'vocabulary' of the training data (occurs in-place)\n", 229 | "vect.fit(simple_train)" 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "execution_count": null, 235 | "metadata": { 236 | "collapsed": false 237 | }, 238 | "outputs": [], 239 | "source": [ 240 | "# examine the fitted vocabulary\n", 241 | "vect.get_feature_names()" 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": null, 247 | "metadata": { 248 | "collapsed": false 249 | }, 250 | "outputs": [], 251 | "source": [ 252 | "# transform training data into a 'document-term matrix'\n", 253 | "simple_train_dtm = vect.transform(simple_train)\n", 254 | "simple_train_dtm" 255 | ] 256 | }, 257 | { 258 | "cell_type": "code", 259 | "execution_count": null, 260 | "metadata": { 261 | "collapsed": false 262 | }, 263 | "outputs": [], 264 | "source": [ 265 | "# convert sparse matrix to a dense matrix\n", 266 | "simple_train_dtm.toarray()" 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "execution_count": null, 272 | "metadata": { 273 | "collapsed": false 274 | }, 275 | "outputs": [], 276 | "source": [ 277 | "# examine the vocabulary and document-term matrix together\n", 278 | "pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())" 279 | ] 280 | }, 281 | { 282 | "cell_type": "markdown", 283 | "metadata": {}, 284 | "source": [ 285 | "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n", 286 | "\n", 287 | "> In this scheme, features and samples are defined as follows:\n", 288 | "\n", 289 | "> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.\n", 290 | "> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.\n", 291 | "\n", 292 | "> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.\n", 293 | "\n", 294 | "> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or \"Bag of n-grams\" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document." 295 | ] 296 | }, 297 | { 298 | "cell_type": "code", 299 | "execution_count": null, 300 | "metadata": { 301 | "collapsed": false 302 | }, 303 | "outputs": [], 304 | "source": [ 305 | "# check the type of the document-term matrix\n", 306 | "type(simple_train_dtm)" 307 | ] 308 | }, 309 | { 310 | "cell_type": "code", 311 | "execution_count": null, 312 | "metadata": { 313 | "collapsed": false, 314 | "scrolled": true 315 | }, 316 | "outputs": [], 317 | "source": [ 318 | "# examine the sparse matrix contents\n", 319 | "print(simple_train_dtm)" 320 | ] 321 | }, 322 | { 323 | "cell_type": "markdown", 324 | "metadata": {}, 325 | "source": [ 326 | "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n", 327 | "\n", 328 | "> As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them).\n", 329 | "\n", 330 | "> For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.\n", 331 | "\n", 332 | "> In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package." 333 | ] 334 | }, 335 | { 336 | "cell_type": "code", 337 | "execution_count": null, 338 | "metadata": { 339 | "collapsed": false 340 | }, 341 | "outputs": [], 342 | "source": [ 343 | "# build a model to predict desperation\n", 344 | "knn = KNeighborsClassifier(n_neighbors=1)\n", 345 | "knn.fit(simple_train_dtm, is_desperate)" 346 | ] 347 | }, 348 | { 349 | "cell_type": "code", 350 | "execution_count": null, 351 | "metadata": { 352 | "collapsed": true 353 | }, 354 | "outputs": [], 355 | "source": [ 356 | "# example text for model testing\n", 357 | "simple_test = [\"please don't call me\"]" 358 | ] 359 | }, 360 | { 361 | "cell_type": "markdown", 362 | "metadata": {}, 363 | "source": [ 364 | "In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning." 365 | ] 366 | }, 367 | { 368 | "cell_type": "code", 369 | "execution_count": null, 370 | "metadata": { 371 | "collapsed": false 372 | }, 373 | "outputs": [], 374 | "source": [ 375 | "# transform testing data into a document-term matrix (using existing vocabulary)\n", 376 | "simple_test_dtm = vect.transform(simple_test)\n", 377 | "simple_test_dtm.toarray()" 378 | ] 379 | }, 380 | { 381 | "cell_type": "code", 382 | "execution_count": null, 383 | "metadata": { 384 | "collapsed": false 385 | }, 386 | "outputs": [], 387 | "source": [ 388 | "# examine the vocabulary and document-term matrix together\n", 389 | "pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())" 390 | ] 391 | }, 392 | { 393 | "cell_type": "code", 394 | "execution_count": null, 395 | "metadata": { 396 | "collapsed": false 397 | }, 398 | "outputs": [], 399 | "source": [ 400 | "# predict whether simple_test is desperate\n", 401 | "knn.predict(simple_test_dtm)" 402 | ] 403 | }, 404 | { 405 | "cell_type": "markdown", 406 | "metadata": {}, 407 | "source": [ 408 | "**Summary:**\n", 409 | "\n", 410 | "- `vect.fit(train)` **learns the vocabulary** of the training data\n", 411 | "- `vect.transform(train)` uses the **fitted vocabulary** to build a document-term matrix from the training data\n", 412 | "- `vect.transform(test)` uses the **fitted vocabulary** to build a document-term matrix from the testing data (and **ignores tokens** it hasn't seen before)" 413 | ] 414 | }, 415 | { 416 | "cell_type": "markdown", 417 | "metadata": {}, 418 | "source": [ 419 | "## Part 3: Reading a text-based dataset into pandas" 420 | ] 421 | }, 422 | { 423 | "cell_type": "code", 424 | "execution_count": null, 425 | "metadata": { 426 | "collapsed": true 427 | }, 428 | "outputs": [], 429 | "source": [ 430 | "# read file into pandas from the working directory\n", 431 | "sms = pd.read_table('sms.tsv', header=None, names=['label', 'message'])" 432 | ] 433 | }, 434 | { 435 | "cell_type": "code", 436 | "execution_count": null, 437 | "metadata": { 438 | "collapsed": false 439 | }, 440 | "outputs": [], 441 | "source": [ 442 | "# alternative: read file into pandas from a URL\n", 443 | "# url = 'https://raw.githubusercontent.com/justmarkham/pydata-dc-2016-tutorial/master/sms.tsv'\n", 444 | "# sms = pd.read_table(url, header=None, names=['label', 'message'])" 445 | ] 446 | }, 447 | { 448 | "cell_type": "code", 449 | "execution_count": null, 450 | "metadata": { 451 | "collapsed": false 452 | }, 453 | "outputs": [], 454 | "source": [ 455 | "# examine the shape\n", 456 | "sms.shape" 457 | ] 458 | }, 459 | { 460 | "cell_type": "code", 461 | "execution_count": null, 462 | "metadata": { 463 | "collapsed": false 464 | }, 465 | "outputs": [], 466 | "source": [ 467 | "# examine the first 10 rows\n", 468 | "sms.head(10)" 469 | ] 470 | }, 471 | { 472 | "cell_type": "code", 473 | "execution_count": null, 474 | "metadata": { 475 | "collapsed": false 476 | }, 477 | "outputs": [], 478 | "source": [ 479 | "# examine the class distribution\n", 480 | "sms.label.value_counts()" 481 | ] 482 | }, 483 | { 484 | "cell_type": "code", 485 | "execution_count": null, 486 | "metadata": { 487 | "collapsed": true 488 | }, 489 | "outputs": [], 490 | "source": [ 491 | "# convert label to a numerical variable\n", 492 | "sms['label_num'] = sms.label.map({'ham':0, 'spam':1})" 493 | ] 494 | }, 495 | { 496 | "cell_type": "code", 497 | "execution_count": null, 498 | "metadata": { 499 | "collapsed": false 500 | }, 501 | "outputs": [], 502 | "source": [ 503 | "# check that the conversion worked\n", 504 | "sms.head(10)" 505 | ] 506 | }, 507 | { 508 | "cell_type": "code", 509 | "execution_count": null, 510 | "metadata": { 511 | "collapsed": false 512 | }, 513 | "outputs": [], 514 | "source": [ 515 | "# how to define X and y (from the iris data) for use with a MODEL\n", 516 | "X = iris.data\n", 517 | "y = iris.target\n", 518 | "print(X.shape)\n", 519 | "print(y.shape)" 520 | ] 521 | }, 522 | { 523 | "cell_type": "code", 524 | "execution_count": null, 525 | "metadata": { 526 | "collapsed": false 527 | }, 528 | "outputs": [], 529 | "source": [ 530 | "# how to define X and y (from the SMS data) for use with COUNTVECTORIZER\n", 531 | "X = sms.message\n", 532 | "y = sms.label_num\n", 533 | "print(X.shape)\n", 534 | "print(y.shape)" 535 | ] 536 | }, 537 | { 538 | "cell_type": "code", 539 | "execution_count": null, 540 | "metadata": { 541 | "collapsed": false 542 | }, 543 | "outputs": [], 544 | "source": [ 545 | "# split X and y into training and testing sets\n", 546 | "from sklearn.cross_validation import train_test_split\n", 547 | "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)\n", 548 | "print(X_train.shape)\n", 549 | "print(X_test.shape)\n", 550 | "print(y_train.shape)\n", 551 | "print(y_test.shape)" 552 | ] 553 | }, 554 | { 555 | "cell_type": "markdown", 556 | "metadata": {}, 557 | "source": [ 558 | "## Part 4: Vectorizing our dataset" 559 | ] 560 | }, 561 | { 562 | "cell_type": "code", 563 | "execution_count": null, 564 | "metadata": { 565 | "collapsed": true 566 | }, 567 | "outputs": [], 568 | "source": [ 569 | "# instantiate the vectorizer\n", 570 | "vect = CountVectorizer()" 571 | ] 572 | }, 573 | { 574 | "cell_type": "code", 575 | "execution_count": null, 576 | "metadata": { 577 | "collapsed": true 578 | }, 579 | "outputs": [], 580 | "source": [ 581 | "# learn training data vocabulary, then use it to create a document-term matrix\n", 582 | "vect.fit(X_train)\n", 583 | "X_train_dtm = vect.transform(X_train)" 584 | ] 585 | }, 586 | { 587 | "cell_type": "code", 588 | "execution_count": null, 589 | "metadata": { 590 | "collapsed": true 591 | }, 592 | "outputs": [], 593 | "source": [ 594 | "# equivalently: combine fit and transform into a single step\n", 595 | "X_train_dtm = vect.fit_transform(X_train)" 596 | ] 597 | }, 598 | { 599 | "cell_type": "code", 600 | "execution_count": null, 601 | "metadata": { 602 | "collapsed": false 603 | }, 604 | "outputs": [], 605 | "source": [ 606 | "# examine the document-term matrix\n", 607 | "X_train_dtm" 608 | ] 609 | }, 610 | { 611 | "cell_type": "code", 612 | "execution_count": null, 613 | "metadata": { 614 | "collapsed": false 615 | }, 616 | "outputs": [], 617 | "source": [ 618 | "# transform testing data (using fitted vocabulary) into a document-term matrix\n", 619 | "X_test_dtm = vect.transform(X_test)\n", 620 | "X_test_dtm" 621 | ] 622 | }, 623 | { 624 | "cell_type": "markdown", 625 | "metadata": {}, 626 | "source": [ 627 | "## Part 5: Building and evaluating a model\n", 628 | "\n", 629 | "We will use [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):\n", 630 | "\n", 631 | "> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work." 632 | ] 633 | }, 634 | { 635 | "cell_type": "code", 636 | "execution_count": null, 637 | "metadata": { 638 | "collapsed": true 639 | }, 640 | "outputs": [], 641 | "source": [ 642 | "# import and instantiate a Multinomial Naive Bayes model\n", 643 | "from sklearn.naive_bayes import MultinomialNB\n", 644 | "nb = MultinomialNB()" 645 | ] 646 | }, 647 | { 648 | "cell_type": "code", 649 | "execution_count": null, 650 | "metadata": { 651 | "collapsed": false 652 | }, 653 | "outputs": [], 654 | "source": [ 655 | "# train the model using X_train_dtm (timing it with an IPython \"magic command\")\n", 656 | "%time nb.fit(X_train_dtm, y_train)" 657 | ] 658 | }, 659 | { 660 | "cell_type": "code", 661 | "execution_count": null, 662 | "metadata": { 663 | "collapsed": true 664 | }, 665 | "outputs": [], 666 | "source": [ 667 | "# make class predictions for X_test_dtm\n", 668 | "y_pred_class = nb.predict(X_test_dtm)" 669 | ] 670 | }, 671 | { 672 | "cell_type": "code", 673 | "execution_count": null, 674 | "metadata": { 675 | "collapsed": false 676 | }, 677 | "outputs": [], 678 | "source": [ 679 | "# calculate accuracy of class predictions\n", 680 | "from sklearn import metrics\n", 681 | "metrics.accuracy_score(y_test, y_pred_class)" 682 | ] 683 | }, 684 | { 685 | "cell_type": "code", 686 | "execution_count": null, 687 | "metadata": { 688 | "collapsed": false 689 | }, 690 | "outputs": [], 691 | "source": [ 692 | "# print the confusion matrix\n", 693 | "metrics.confusion_matrix(y_test, y_pred_class)" 694 | ] 695 | }, 696 | { 697 | "cell_type": "code", 698 | "execution_count": null, 699 | "metadata": { 700 | "collapsed": false 701 | }, 702 | "outputs": [], 703 | "source": [ 704 | "# print message text for the false positives (ham incorrectly classified as spam)\n" 705 | ] 706 | }, 707 | { 708 | "cell_type": "code", 709 | "execution_count": null, 710 | "metadata": { 711 | "collapsed": false, 712 | "scrolled": true 713 | }, 714 | "outputs": [], 715 | "source": [ 716 | "# print message text for the false negatives (spam incorrectly classified as ham)\n" 717 | ] 718 | }, 719 | { 720 | "cell_type": "code", 721 | "execution_count": null, 722 | "metadata": { 723 | "collapsed": false, 724 | "scrolled": true 725 | }, 726 | "outputs": [], 727 | "source": [ 728 | "# example false negative\n", 729 | "X_test[3132]" 730 | ] 731 | }, 732 | { 733 | "cell_type": "code", 734 | "execution_count": null, 735 | "metadata": { 736 | "collapsed": false 737 | }, 738 | "outputs": [], 739 | "source": [ 740 | "# calculate predicted probabilities for X_test_dtm (poorly calibrated)\n", 741 | "y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]\n", 742 | "y_pred_prob" 743 | ] 744 | }, 745 | { 746 | "cell_type": "code", 747 | "execution_count": null, 748 | "metadata": { 749 | "collapsed": false 750 | }, 751 | "outputs": [], 752 | "source": [ 753 | "# calculate AUC\n", 754 | "metrics.roc_auc_score(y_test, y_pred_prob)" 755 | ] 756 | }, 757 | { 758 | "cell_type": "markdown", 759 | "metadata": {}, 760 | "source": [ 761 | "## Part 6: Comparing models\n", 762 | "\n", 763 | "We will compare multinomial Naive Bayes with [logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression):\n", 764 | "\n", 765 | "> Logistic regression, despite its name, is a **linear model for classification** rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function." 766 | ] 767 | }, 768 | { 769 | "cell_type": "code", 770 | "execution_count": null, 771 | "metadata": { 772 | "collapsed": true 773 | }, 774 | "outputs": [], 775 | "source": [ 776 | "# import and instantiate a logistic regression model\n", 777 | "from sklearn.linear_model import LogisticRegression\n", 778 | "logreg = LogisticRegression()" 779 | ] 780 | }, 781 | { 782 | "cell_type": "code", 783 | "execution_count": null, 784 | "metadata": { 785 | "collapsed": false 786 | }, 787 | "outputs": [], 788 | "source": [ 789 | "# train the model using X_train_dtm\n", 790 | "%time logreg.fit(X_train_dtm, y_train)" 791 | ] 792 | }, 793 | { 794 | "cell_type": "code", 795 | "execution_count": null, 796 | "metadata": { 797 | "collapsed": true 798 | }, 799 | "outputs": [], 800 | "source": [ 801 | "# make class predictions for X_test_dtm\n", 802 | "y_pred_class = logreg.predict(X_test_dtm)" 803 | ] 804 | }, 805 | { 806 | "cell_type": "code", 807 | "execution_count": null, 808 | "metadata": { 809 | "collapsed": false 810 | }, 811 | "outputs": [], 812 | "source": [ 813 | "# calculate predicted probabilities for X_test_dtm (well calibrated)\n", 814 | "y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]\n", 815 | "y_pred_prob" 816 | ] 817 | }, 818 | { 819 | "cell_type": "code", 820 | "execution_count": null, 821 | "metadata": { 822 | "collapsed": false 823 | }, 824 | "outputs": [], 825 | "source": [ 826 | "# calculate accuracy\n", 827 | "metrics.accuracy_score(y_test, y_pred_class)" 828 | ] 829 | }, 830 | { 831 | "cell_type": "code", 832 | "execution_count": null, 833 | "metadata": { 834 | "collapsed": false 835 | }, 836 | "outputs": [], 837 | "source": [ 838 | "# calculate AUC\n", 839 | "metrics.roc_auc_score(y_test, y_pred_prob)" 840 | ] 841 | } 842 | ], 843 | "metadata": { 844 | "kernelspec": { 845 | "display_name": "Python 2", 846 | "language": "python", 847 | "name": "python2" 848 | }, 849 | "language_info": { 850 | "codemirror_mode": { 851 | "name": "ipython", 852 | "version": 2 853 | }, 854 | "file_extension": ".py", 855 | "mimetype": "text/x-python", 856 | "name": "python", 857 | "nbconvert_exporter": "python", 858 | "pygments_lexer": "ipython2", 859 | "version": "2.7.11" 860 | } 861 | }, 862 | "nbformat": 4, 863 | "nbformat_minor": 0 864 | } 865 | -------------------------------------------------------------------------------- /tutorial.py: -------------------------------------------------------------------------------- 1 | # # Tutorial: Machine Learning with Text in scikit-learn 2 | 3 | # ## Agenda 4 | # 5 | # 1. Model building in scikit-learn (refresher) 6 | # 2. Representing text as numerical data 7 | # 3. Reading a text-based dataset into pandas 8 | # 4. Vectorizing our dataset 9 | # 5. Building and evaluating a model 10 | # 6. Comparing models 11 | 12 | # for Python 2: use print only as a function 13 | from __future__ import print_function 14 | 15 | 16 | # ## Part 1: Model building in scikit-learn (refresher) 17 | 18 | # load the iris dataset as an example 19 | from sklearn.datasets import load_iris 20 | iris = load_iris() 21 | 22 | 23 | # store the feature matrix (X) and response vector (y) 24 | X = iris.data 25 | y = iris.target 26 | 27 | 28 | # **"Features"** are also known as predictors, inputs, or attributes. The **"response"** is also known as the target, label, or output. 29 | 30 | # check the shapes of X and y 31 | print(X.shape) 32 | print(y.shape) 33 | 34 | 35 | # **"Observations"** are also known as samples, instances, or records. 36 | 37 | # examine the first 5 rows of the feature matrix (including the feature names) 38 | import pandas as pd 39 | pd.DataFrame(X, columns=iris.feature_names).head() 40 | 41 | 42 | # examine the response vector 43 | print(y) 44 | 45 | 46 | # In order to **build a model**, the features must be **numeric**, and every observation must have the **same features in the same order**. 47 | 48 | # import the class 49 | from sklearn.neighbors import KNeighborsClassifier 50 | 51 | # instantiate the model (with the default parameters) 52 | knn = KNeighborsClassifier() 53 | 54 | # fit the model with data (occurs in-place) 55 | knn.fit(X, y) 56 | 57 | 58 | # In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning. 59 | 60 | # predict the response for a new observation 61 | knn.predict([[3, 5, 4, 2]]) 62 | 63 | 64 | # ## Part 2: Representing text as numerical data 65 | 66 | # example text for model training (SMS messages) 67 | simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!'] 68 | 69 | 70 | # example response vector 71 | is_desperate = [0, 0, 1] 72 | 73 | 74 | # From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction): 75 | # 76 | # > Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**. 77 | # 78 | # We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to "convert text into a matrix of token counts": 79 | 80 | # import and instantiate CountVectorizer (with the default parameters) 81 | from sklearn.feature_extraction.text import CountVectorizer 82 | vect = CountVectorizer() 83 | 84 | 85 | # learn the 'vocabulary' of the training data (occurs in-place) 86 | vect.fit(simple_train) 87 | 88 | 89 | # examine the fitted vocabulary 90 | vect.get_feature_names() 91 | 92 | 93 | # transform training data into a 'document-term matrix' 94 | simple_train_dtm = vect.transform(simple_train) 95 | simple_train_dtm 96 | 97 | 98 | # convert sparse matrix to a dense matrix 99 | simple_train_dtm.toarray() 100 | 101 | 102 | # examine the vocabulary and document-term matrix together 103 | pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names()) 104 | 105 | 106 | # From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction): 107 | # 108 | # > In this scheme, features and samples are defined as follows: 109 | # 110 | # > - Each individual token occurrence frequency (normalized or not) is treated as a **feature**. 111 | # > - The vector of all the token frequencies for a given document is considered a multivariate **sample**. 112 | # 113 | # > A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus. 114 | # 115 | # > We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document. 116 | 117 | # check the type of the document-term matrix 118 | type(simple_train_dtm) 119 | 120 | 121 | # examine the sparse matrix contents 122 | print(simple_train_dtm) 123 | 124 | 125 | # From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction): 126 | # 127 | # > As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them). 128 | # 129 | # > For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually. 130 | # 131 | # > In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package. 132 | 133 | # build a model to predict desperation 134 | knn = KNeighborsClassifier(n_neighbors=1) 135 | knn.fit(simple_train_dtm, is_desperate) 136 | 137 | 138 | # example text for model testing 139 | simple_test = ["please don't call me"] 140 | 141 | 142 | # In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning. 143 | 144 | # transform testing data into a document-term matrix (using existing vocabulary) 145 | simple_test_dtm = vect.transform(simple_test) 146 | simple_test_dtm.toarray() 147 | 148 | 149 | # examine the vocabulary and document-term matrix together 150 | pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names()) 151 | 152 | 153 | # predict whether simple_test is desperate 154 | knn.predict(simple_test_dtm) 155 | 156 | 157 | # **Summary:** 158 | # 159 | # - `vect.fit(train)` **learns the vocabulary** of the training data 160 | # - `vect.transform(train)` uses the **fitted vocabulary** to build a document-term matrix from the training data 161 | # - `vect.transform(test)` uses the **fitted vocabulary** to build a document-term matrix from the testing data (and **ignores tokens** it hasn't seen before) 162 | 163 | # ## Part 3: Reading a text-based dataset into pandas 164 | 165 | # read file into pandas from the working directory 166 | sms = pd.read_table('sms.tsv', header=None, names=['label', 'message']) 167 | 168 | 169 | # alternative: read file into pandas from a URL 170 | # url = 'https://raw.githubusercontent.com/justmarkham/pydata-dc-2016-tutorial/master/sms.tsv' 171 | # sms = pd.read_table(url, header=None, names=['label', 'message']) 172 | 173 | 174 | # examine the shape 175 | sms.shape 176 | 177 | 178 | # examine the first 10 rows 179 | sms.head(10) 180 | 181 | 182 | # examine the class distribution 183 | sms.label.value_counts() 184 | 185 | 186 | # convert label to a numerical variable 187 | sms['label_num'] = sms.label.map({'ham':0, 'spam':1}) 188 | 189 | 190 | # check that the conversion worked 191 | sms.head(10) 192 | 193 | 194 | # how to define X and y (from the iris data) for use with a MODEL 195 | X = iris.data 196 | y = iris.target 197 | print(X.shape) 198 | print(y.shape) 199 | 200 | 201 | # how to define X and y (from the SMS data) for use with COUNTVECTORIZER 202 | X = sms.message 203 | y = sms.label_num 204 | print(X.shape) 205 | print(y.shape) 206 | 207 | 208 | # split X and y into training and testing sets 209 | from sklearn.cross_validation import train_test_split 210 | X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1) 211 | print(X_train.shape) 212 | print(X_test.shape) 213 | print(y_train.shape) 214 | print(y_test.shape) 215 | 216 | 217 | # ## Part 4: Vectorizing our dataset 218 | 219 | # instantiate the vectorizer 220 | vect = CountVectorizer() 221 | 222 | 223 | # learn training data vocabulary, then use it to create a document-term matrix 224 | vect.fit(X_train) 225 | X_train_dtm = vect.transform(X_train) 226 | 227 | 228 | # equivalently: combine fit and transform into a single step 229 | X_train_dtm = vect.fit_transform(X_train) 230 | 231 | 232 | # examine the document-term matrix 233 | X_train_dtm 234 | 235 | 236 | # transform testing data (using fitted vocabulary) into a document-term matrix 237 | X_test_dtm = vect.transform(X_test) 238 | X_test_dtm 239 | 240 | 241 | # ## Part 5: Building and evaluating a model 242 | # 243 | # We will use [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html): 244 | # 245 | # > The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work. 246 | 247 | # import and instantiate a Multinomial Naive Bayes model 248 | from sklearn.naive_bayes import MultinomialNB 249 | nb = MultinomialNB() 250 | 251 | 252 | # train the model using X_train_dtm 253 | nb.fit(X_train_dtm, y_train) 254 | 255 | 256 | # make class predictions for X_test_dtm 257 | y_pred_class = nb.predict(X_test_dtm) 258 | 259 | 260 | # calculate accuracy of class predictions 261 | from sklearn import metrics 262 | metrics.accuracy_score(y_test, y_pred_class) 263 | 264 | 265 | # print the confusion matrix 266 | metrics.confusion_matrix(y_test, y_pred_class) 267 | 268 | 269 | # print message text for the false positives (ham incorrectly classified as spam) 270 | 271 | 272 | # print message text for the false negatives (spam incorrectly classified as ham) 273 | 274 | 275 | # example false negative 276 | X_test[3132] 277 | 278 | 279 | # calculate predicted probabilities for X_test_dtm (poorly calibrated) 280 | y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1] 281 | y_pred_prob 282 | 283 | 284 | # calculate AUC 285 | metrics.roc_auc_score(y_test, y_pred_prob) 286 | 287 | 288 | # ## Part 6: Comparing models 289 | # 290 | # We will compare multinomial Naive Bayes with [logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression): 291 | # 292 | # > Logistic regression, despite its name, is a **linear model for classification** rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function. 293 | 294 | # import and instantiate a logistic regression model 295 | from sklearn.linear_model import LogisticRegression 296 | logreg = LogisticRegression() 297 | 298 | 299 | # train the model using X_train_dtm 300 | logreg.fit(X_train_dtm, y_train) 301 | 302 | 303 | # make class predictions for X_test_dtm 304 | y_pred_class = logreg.predict(X_test_dtm) 305 | 306 | 307 | # calculate predicted probabilities for X_test_dtm (well calibrated) 308 | y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1] 309 | y_pred_prob 310 | 311 | 312 | # calculate accuracy 313 | metrics.accuracy_score(y_test, y_pred_class) 314 | 315 | 316 | # calculate AUC 317 | metrics.roc_auc_score(y_test, y_pred_prob) 318 | -------------------------------------------------------------------------------- /tutorial_with_output.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Tutorial: Machine Learning with Text in scikit-learn" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Agenda\n", 15 | "\n", 16 | "1. Model building in scikit-learn (refresher)\n", 17 | "2. Representing text as numerical data\n", 18 | "3. Reading a text-based dataset into pandas\n", 19 | "4. Vectorizing our dataset\n", 20 | "5. Building and evaluating a model\n", 21 | "6. Comparing models" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 1, 27 | "metadata": { 28 | "collapsed": false 29 | }, 30 | "outputs": [], 31 | "source": [ 32 | "# for Python 2: use print only as a function\n", 33 | "from __future__ import print_function" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "## Part 1: Model building in scikit-learn (refresher)" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": 2, 46 | "metadata": { 47 | "collapsed": true 48 | }, 49 | "outputs": [], 50 | "source": [ 51 | "# load the iris dataset as an example\n", 52 | "from sklearn.datasets import load_iris\n", 53 | "iris = load_iris()" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 3, 59 | "metadata": { 60 | "collapsed": true 61 | }, 62 | "outputs": [], 63 | "source": [ 64 | "# store the feature matrix (X) and response vector (y)\n", 65 | "X = iris.data\n", 66 | "y = iris.target" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "**\"Features\"** are also known as predictors, inputs, or attributes. The **\"response\"** is also known as the target, label, or output." 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 4, 79 | "metadata": { 80 | "collapsed": false 81 | }, 82 | "outputs": [ 83 | { 84 | "name": "stdout", 85 | "output_type": "stream", 86 | "text": [ 87 | "(150L, 4L)\n", 88 | "(150L,)\n" 89 | ] 90 | } 91 | ], 92 | "source": [ 93 | "# check the shapes of X and y\n", 94 | "print(X.shape)\n", 95 | "print(y.shape)" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | "**\"Observations\"** are also known as samples, instances, or records." 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": 5, 108 | "metadata": { 109 | "collapsed": false 110 | }, 111 | "outputs": [ 112 | { 113 | "data": { 114 | "text/html": [ 115 | "
\n", 116 | "\n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
05.13.51.40.2
14.93.01.40.2
24.73.21.30.2
34.63.11.50.2
45.03.61.40.2
\n", 164 | "
" 165 | ], 166 | "text/plain": [ 167 | " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n", 168 | "0 5.1 3.5 1.4 0.2\n", 169 | "1 4.9 3.0 1.4 0.2\n", 170 | "2 4.7 3.2 1.3 0.2\n", 171 | "3 4.6 3.1 1.5 0.2\n", 172 | "4 5.0 3.6 1.4 0.2" 173 | ] 174 | }, 175 | "execution_count": 5, 176 | "metadata": {}, 177 | "output_type": "execute_result" 178 | } 179 | ], 180 | "source": [ 181 | "# examine the first 5 rows of the feature matrix (including the feature names)\n", 182 | "import pandas as pd\n", 183 | "pd.DataFrame(X, columns=iris.feature_names).head()" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": 6, 189 | "metadata": { 190 | "collapsed": false 191 | }, 192 | "outputs": [ 193 | { 194 | "name": "stdout", 195 | "output_type": "stream", 196 | "text": [ 197 | "[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", 198 | " 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n", 199 | " 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2\n", 200 | " 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2\n", 201 | " 2 2]\n" 202 | ] 203 | } 204 | ], 205 | "source": [ 206 | "# examine the response vector\n", 207 | "print(y)" 208 | ] 209 | }, 210 | { 211 | "cell_type": "markdown", 212 | "metadata": {}, 213 | "source": [ 214 | "In order to **build a model**, the features must be **numeric**, and every observation must have the **same features in the same order**." 215 | ] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "execution_count": 7, 220 | "metadata": { 221 | "collapsed": false 222 | }, 223 | "outputs": [ 224 | { 225 | "data": { 226 | "text/plain": [ 227 | "KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n", 228 | " metric_params=None, n_jobs=1, n_neighbors=5, p=2,\n", 229 | " weights='uniform')" 230 | ] 231 | }, 232 | "execution_count": 7, 233 | "metadata": {}, 234 | "output_type": "execute_result" 235 | } 236 | ], 237 | "source": [ 238 | "# import the class\n", 239 | "from sklearn.neighbors import KNeighborsClassifier\n", 240 | "\n", 241 | "# instantiate the model (with the default parameters)\n", 242 | "knn = KNeighborsClassifier()\n", 243 | "\n", 244 | "# fit the model with data (occurs in-place)\n", 245 | "knn.fit(X, y)" 246 | ] 247 | }, 248 | { 249 | "cell_type": "markdown", 250 | "metadata": {}, 251 | "source": [ 252 | "In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning." 253 | ] 254 | }, 255 | { 256 | "cell_type": "code", 257 | "execution_count": 8, 258 | "metadata": { 259 | "collapsed": false 260 | }, 261 | "outputs": [ 262 | { 263 | "data": { 264 | "text/plain": [ 265 | "array([1])" 266 | ] 267 | }, 268 | "execution_count": 8, 269 | "metadata": {}, 270 | "output_type": "execute_result" 271 | } 272 | ], 273 | "source": [ 274 | "# predict the response for a new observation\n", 275 | "knn.predict([[3, 5, 4, 2]])" 276 | ] 277 | }, 278 | { 279 | "cell_type": "markdown", 280 | "metadata": {}, 281 | "source": [ 282 | "## Part 2: Representing text as numerical data" 283 | ] 284 | }, 285 | { 286 | "cell_type": "code", 287 | "execution_count": 9, 288 | "metadata": { 289 | "collapsed": true 290 | }, 291 | "outputs": [], 292 | "source": [ 293 | "# example text for model training (SMS messages)\n", 294 | "simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']" 295 | ] 296 | }, 297 | { 298 | "cell_type": "code", 299 | "execution_count": 10, 300 | "metadata": { 301 | "collapsed": true 302 | }, 303 | "outputs": [], 304 | "source": [ 305 | "# example response vector\n", 306 | "is_desperate = [0, 0, 1]" 307 | ] 308 | }, 309 | { 310 | "cell_type": "markdown", 311 | "metadata": {}, 312 | "source": [ 313 | "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n", 314 | "\n", 315 | "> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.\n", 316 | "\n", 317 | "We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to \"convert text into a matrix of token counts\":" 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": 11, 323 | "metadata": { 324 | "collapsed": true 325 | }, 326 | "outputs": [], 327 | "source": [ 328 | "# import and instantiate CountVectorizer (with the default parameters)\n", 329 | "from sklearn.feature_extraction.text import CountVectorizer\n", 330 | "vect = CountVectorizer()" 331 | ] 332 | }, 333 | { 334 | "cell_type": "code", 335 | "execution_count": 12, 336 | "metadata": { 337 | "collapsed": false 338 | }, 339 | "outputs": [ 340 | { 341 | "data": { 342 | "text/plain": [ 343 | "CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',\n", 344 | " dtype=, encoding=u'utf-8', input=u'content',\n", 345 | " lowercase=True, max_df=1.0, max_features=None, min_df=1,\n", 346 | " ngram_range=(1, 1), preprocessor=None, stop_words=None,\n", 347 | " strip_accents=None, token_pattern=u'(?u)\\\\b\\\\w\\\\w+\\\\b',\n", 348 | " tokenizer=None, vocabulary=None)" 349 | ] 350 | }, 351 | "execution_count": 12, 352 | "metadata": {}, 353 | "output_type": "execute_result" 354 | } 355 | ], 356 | "source": [ 357 | "# learn the 'vocabulary' of the training data (occurs in-place)\n", 358 | "vect.fit(simple_train)" 359 | ] 360 | }, 361 | { 362 | "cell_type": "code", 363 | "execution_count": 13, 364 | "metadata": { 365 | "collapsed": false 366 | }, 367 | "outputs": [ 368 | { 369 | "data": { 370 | "text/plain": [ 371 | "[u'cab', u'call', u'me', u'please', u'tonight', u'you']" 372 | ] 373 | }, 374 | "execution_count": 13, 375 | "metadata": {}, 376 | "output_type": "execute_result" 377 | } 378 | ], 379 | "source": [ 380 | "# examine the fitted vocabulary\n", 381 | "vect.get_feature_names()" 382 | ] 383 | }, 384 | { 385 | "cell_type": "code", 386 | "execution_count": 14, 387 | "metadata": { 388 | "collapsed": false 389 | }, 390 | "outputs": [ 391 | { 392 | "data": { 393 | "text/plain": [ 394 | "<3x6 sparse matrix of type ''\n", 395 | "\twith 9 stored elements in Compressed Sparse Row format>" 396 | ] 397 | }, 398 | "execution_count": 14, 399 | "metadata": {}, 400 | "output_type": "execute_result" 401 | } 402 | ], 403 | "source": [ 404 | "# transform training data into a 'document-term matrix'\n", 405 | "simple_train_dtm = vect.transform(simple_train)\n", 406 | "simple_train_dtm" 407 | ] 408 | }, 409 | { 410 | "cell_type": "code", 411 | "execution_count": 15, 412 | "metadata": { 413 | "collapsed": false 414 | }, 415 | "outputs": [ 416 | { 417 | "data": { 418 | "text/plain": [ 419 | "array([[0, 1, 0, 0, 1, 1],\n", 420 | " [1, 1, 1, 0, 0, 0],\n", 421 | " [0, 1, 1, 2, 0, 0]], dtype=int64)" 422 | ] 423 | }, 424 | "execution_count": 15, 425 | "metadata": {}, 426 | "output_type": "execute_result" 427 | } 428 | ], 429 | "source": [ 430 | "# convert sparse matrix to a dense matrix\n", 431 | "simple_train_dtm.toarray()" 432 | ] 433 | }, 434 | { 435 | "cell_type": "code", 436 | "execution_count": 16, 437 | "metadata": { 438 | "collapsed": false 439 | }, 440 | "outputs": [ 441 | { 442 | "data": { 443 | "text/html": [ 444 | "
\n", 445 | "\n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | "
cabcallmepleasetonightyou
0010011
1111000
2011200
\n", 487 | "
" 488 | ], 489 | "text/plain": [ 490 | " cab call me please tonight you\n", 491 | "0 0 1 0 0 1 1\n", 492 | "1 1 1 1 0 0 0\n", 493 | "2 0 1 1 2 0 0" 494 | ] 495 | }, 496 | "execution_count": 16, 497 | "metadata": {}, 498 | "output_type": "execute_result" 499 | } 500 | ], 501 | "source": [ 502 | "# examine the vocabulary and document-term matrix together\n", 503 | "pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())" 504 | ] 505 | }, 506 | { 507 | "cell_type": "markdown", 508 | "metadata": {}, 509 | "source": [ 510 | "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n", 511 | "\n", 512 | "> In this scheme, features and samples are defined as follows:\n", 513 | "\n", 514 | "> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.\n", 515 | "> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.\n", 516 | "\n", 517 | "> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.\n", 518 | "\n", 519 | "> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or \"Bag of n-grams\" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document." 520 | ] 521 | }, 522 | { 523 | "cell_type": "code", 524 | "execution_count": 17, 525 | "metadata": { 526 | "collapsed": false 527 | }, 528 | "outputs": [ 529 | { 530 | "data": { 531 | "text/plain": [ 532 | "scipy.sparse.csr.csr_matrix" 533 | ] 534 | }, 535 | "execution_count": 17, 536 | "metadata": {}, 537 | "output_type": "execute_result" 538 | } 539 | ], 540 | "source": [ 541 | "# check the type of the document-term matrix\n", 542 | "type(simple_train_dtm)" 543 | ] 544 | }, 545 | { 546 | "cell_type": "code", 547 | "execution_count": 18, 548 | "metadata": { 549 | "collapsed": false, 550 | "scrolled": true 551 | }, 552 | "outputs": [ 553 | { 554 | "name": "stdout", 555 | "output_type": "stream", 556 | "text": [ 557 | " (0, 1)\t1\n", 558 | " (0, 4)\t1\n", 559 | " (0, 5)\t1\n", 560 | " (1, 0)\t1\n", 561 | " (1, 1)\t1\n", 562 | " (1, 2)\t1\n", 563 | " (2, 1)\t1\n", 564 | " (2, 2)\t1\n", 565 | " (2, 3)\t2\n" 566 | ] 567 | } 568 | ], 569 | "source": [ 570 | "# examine the sparse matrix contents\n", 571 | "print(simple_train_dtm)" 572 | ] 573 | }, 574 | { 575 | "cell_type": "markdown", 576 | "metadata": {}, 577 | "source": [ 578 | "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n", 579 | "\n", 580 | "> As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them).\n", 581 | "\n", 582 | "> For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.\n", 583 | "\n", 584 | "> In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package." 585 | ] 586 | }, 587 | { 588 | "cell_type": "code", 589 | "execution_count": 19, 590 | "metadata": { 591 | "collapsed": false 592 | }, 593 | "outputs": [ 594 | { 595 | "data": { 596 | "text/plain": [ 597 | "KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n", 598 | " metric_params=None, n_jobs=1, n_neighbors=1, p=2,\n", 599 | " weights='uniform')" 600 | ] 601 | }, 602 | "execution_count": 19, 603 | "metadata": {}, 604 | "output_type": "execute_result" 605 | } 606 | ], 607 | "source": [ 608 | "# build a model to predict desperation\n", 609 | "knn = KNeighborsClassifier(n_neighbors=1)\n", 610 | "knn.fit(simple_train_dtm, is_desperate)" 611 | ] 612 | }, 613 | { 614 | "cell_type": "code", 615 | "execution_count": 20, 616 | "metadata": { 617 | "collapsed": true 618 | }, 619 | "outputs": [], 620 | "source": [ 621 | "# example text for model testing\n", 622 | "simple_test = [\"please don't call me\"]" 623 | ] 624 | }, 625 | { 626 | "cell_type": "markdown", 627 | "metadata": {}, 628 | "source": [ 629 | "In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning." 630 | ] 631 | }, 632 | { 633 | "cell_type": "code", 634 | "execution_count": 21, 635 | "metadata": { 636 | "collapsed": false 637 | }, 638 | "outputs": [ 639 | { 640 | "data": { 641 | "text/plain": [ 642 | "array([[0, 1, 1, 1, 0, 0]], dtype=int64)" 643 | ] 644 | }, 645 | "execution_count": 21, 646 | "metadata": {}, 647 | "output_type": "execute_result" 648 | } 649 | ], 650 | "source": [ 651 | "# transform testing data into a document-term matrix (using existing vocabulary)\n", 652 | "simple_test_dtm = vect.transform(simple_test)\n", 653 | "simple_test_dtm.toarray()" 654 | ] 655 | }, 656 | { 657 | "cell_type": "code", 658 | "execution_count": 22, 659 | "metadata": { 660 | "collapsed": false 661 | }, 662 | "outputs": [ 663 | { 664 | "data": { 665 | "text/html": [ 666 | "
\n", 667 | "\n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | "
cabcallmepleasetonightyou
0011100
\n", 691 | "
" 692 | ], 693 | "text/plain": [ 694 | " cab call me please tonight you\n", 695 | "0 0 1 1 1 0 0" 696 | ] 697 | }, 698 | "execution_count": 22, 699 | "metadata": {}, 700 | "output_type": "execute_result" 701 | } 702 | ], 703 | "source": [ 704 | "# examine the vocabulary and document-term matrix together\n", 705 | "pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())" 706 | ] 707 | }, 708 | { 709 | "cell_type": "code", 710 | "execution_count": 23, 711 | "metadata": { 712 | "collapsed": false 713 | }, 714 | "outputs": [ 715 | { 716 | "data": { 717 | "text/plain": [ 718 | "array([1])" 719 | ] 720 | }, 721 | "execution_count": 23, 722 | "metadata": {}, 723 | "output_type": "execute_result" 724 | } 725 | ], 726 | "source": [ 727 | "# predict whether simple_test is desperate\n", 728 | "knn.predict(simple_test_dtm)" 729 | ] 730 | }, 731 | { 732 | "cell_type": "markdown", 733 | "metadata": {}, 734 | "source": [ 735 | "**Summary:**\n", 736 | "\n", 737 | "- `vect.fit(train)` **learns the vocabulary** of the training data\n", 738 | "- `vect.transform(train)` uses the **fitted vocabulary** to build a document-term matrix from the training data\n", 739 | "- `vect.transform(test)` uses the **fitted vocabulary** to build a document-term matrix from the testing data (and **ignores tokens** it hasn't seen before)" 740 | ] 741 | }, 742 | { 743 | "cell_type": "markdown", 744 | "metadata": {}, 745 | "source": [ 746 | "## Part 3: Reading a text-based dataset into pandas" 747 | ] 748 | }, 749 | { 750 | "cell_type": "code", 751 | "execution_count": 24, 752 | "metadata": { 753 | "collapsed": true 754 | }, 755 | "outputs": [], 756 | "source": [ 757 | "# read file into pandas from the working directory\n", 758 | "sms = pd.read_table('sms.tsv', header=None, names=['label', 'message'])" 759 | ] 760 | }, 761 | { 762 | "cell_type": "code", 763 | "execution_count": 25, 764 | "metadata": { 765 | "collapsed": false 766 | }, 767 | "outputs": [], 768 | "source": [ 769 | "# alternative: read file into pandas from a URL\n", 770 | "# url = 'https://raw.githubusercontent.com/justmarkham/pydata-dc-2016-tutorial/master/sms.tsv'\n", 771 | "# sms = pd.read_table(url, header=None, names=['label', 'message'])" 772 | ] 773 | }, 774 | { 775 | "cell_type": "code", 776 | "execution_count": 26, 777 | "metadata": { 778 | "collapsed": false 779 | }, 780 | "outputs": [ 781 | { 782 | "data": { 783 | "text/plain": [ 784 | "(5572, 2)" 785 | ] 786 | }, 787 | "execution_count": 26, 788 | "metadata": {}, 789 | "output_type": "execute_result" 790 | } 791 | ], 792 | "source": [ 793 | "# examine the shape\n", 794 | "sms.shape" 795 | ] 796 | }, 797 | { 798 | "cell_type": "code", 799 | "execution_count": 27, 800 | "metadata": { 801 | "collapsed": false 802 | }, 803 | "outputs": [ 804 | { 805 | "data": { 806 | "text/html": [ 807 | "
\n", 808 | "\n", 809 | " \n", 810 | " \n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " \n", 827 | " \n", 828 | " \n", 829 | " \n", 830 | " \n", 831 | " \n", 832 | " \n", 833 | " \n", 834 | " \n", 835 | " \n", 836 | " \n", 837 | " \n", 838 | " \n", 839 | " \n", 840 | " \n", 841 | " \n", 842 | " \n", 843 | " \n", 844 | " \n", 845 | " \n", 846 | " \n", 847 | " \n", 848 | " \n", 849 | " \n", 850 | " \n", 851 | " \n", 852 | " \n", 853 | " \n", 854 | " \n", 855 | " \n", 856 | " \n", 857 | " \n", 858 | " \n", 859 | " \n", 860 | " \n", 861 | " \n", 862 | " \n", 863 | " \n", 864 | " \n", 865 | " \n", 866 | " \n", 867 | " \n", 868 | "
labelmessage
0hamGo until jurong point, crazy.. Available only ...
1hamOk lar... Joking wif u oni...
2spamFree entry in 2 a wkly comp to win FA Cup fina...
3hamU dun say so early hor... U c already then say...
4hamNah I don't think he goes to usf, he lives aro...
5spamFreeMsg Hey there darling it's been 3 week's n...
6hamEven my brother is not like to speak with me. ...
7hamAs per your request 'Melle Melle (Oru Minnamin...
8spamWINNER!! As a valued network customer you have...
9spamHad your mobile 11 months or more? U R entitle...
\n", 869 | "
" 870 | ], 871 | "text/plain": [ 872 | " label message\n", 873 | "0 ham Go until jurong point, crazy.. Available only ...\n", 874 | "1 ham Ok lar... Joking wif u oni...\n", 875 | "2 spam Free entry in 2 a wkly comp to win FA Cup fina...\n", 876 | "3 ham U dun say so early hor... U c already then say...\n", 877 | "4 ham Nah I don't think he goes to usf, he lives aro...\n", 878 | "5 spam FreeMsg Hey there darling it's been 3 week's n...\n", 879 | "6 ham Even my brother is not like to speak with me. ...\n", 880 | "7 ham As per your request 'Melle Melle (Oru Minnamin...\n", 881 | "8 spam WINNER!! As a valued network customer you have...\n", 882 | "9 spam Had your mobile 11 months or more? U R entitle..." 883 | ] 884 | }, 885 | "execution_count": 27, 886 | "metadata": {}, 887 | "output_type": "execute_result" 888 | } 889 | ], 890 | "source": [ 891 | "# examine the first 10 rows\n", 892 | "sms.head(10)" 893 | ] 894 | }, 895 | { 896 | "cell_type": "code", 897 | "execution_count": 28, 898 | "metadata": { 899 | "collapsed": false 900 | }, 901 | "outputs": [ 902 | { 903 | "data": { 904 | "text/plain": [ 905 | "ham 4825\n", 906 | "spam 747\n", 907 | "Name: label, dtype: int64" 908 | ] 909 | }, 910 | "execution_count": 28, 911 | "metadata": {}, 912 | "output_type": "execute_result" 913 | } 914 | ], 915 | "source": [ 916 | "# examine the class distribution\n", 917 | "sms.label.value_counts()" 918 | ] 919 | }, 920 | { 921 | "cell_type": "code", 922 | "execution_count": 29, 923 | "metadata": { 924 | "collapsed": true 925 | }, 926 | "outputs": [], 927 | "source": [ 928 | "# convert label to a numerical variable\n", 929 | "sms['label_num'] = sms.label.map({'ham':0, 'spam':1})" 930 | ] 931 | }, 932 | { 933 | "cell_type": "code", 934 | "execution_count": 30, 935 | "metadata": { 936 | "collapsed": false 937 | }, 938 | "outputs": [ 939 | { 940 | "data": { 941 | "text/html": [ 942 | "
\n", 943 | "\n", 944 | " \n", 945 | " \n", 946 | " \n", 947 | " \n", 948 | " \n", 949 | " \n", 950 | " \n", 951 | " \n", 952 | " \n", 953 | " \n", 954 | " \n", 955 | " \n", 956 | " \n", 957 | " \n", 958 | " \n", 959 | " \n", 960 | " \n", 961 | " \n", 962 | " \n", 963 | " \n", 964 | " \n", 965 | " \n", 966 | " \n", 967 | " \n", 968 | " \n", 969 | " \n", 970 | " \n", 971 | " \n", 972 | " \n", 973 | " \n", 974 | " \n", 975 | " \n", 976 | " \n", 977 | " \n", 978 | " \n", 979 | " \n", 980 | " \n", 981 | " \n", 982 | " \n", 983 | " \n", 984 | " \n", 985 | " \n", 986 | " \n", 987 | " \n", 988 | " \n", 989 | " \n", 990 | " \n", 991 | " \n", 992 | " \n", 993 | " \n", 994 | " \n", 995 | " \n", 996 | " \n", 997 | " \n", 998 | " \n", 999 | " \n", 1000 | " \n", 1001 | " \n", 1002 | " \n", 1003 | " \n", 1004 | " \n", 1005 | " \n", 1006 | " \n", 1007 | " \n", 1008 | " \n", 1009 | " \n", 1010 | " \n", 1011 | " \n", 1012 | " \n", 1013 | " \n", 1014 | "
labelmessagelabel_num
0hamGo until jurong point, crazy.. Available only ...0
1hamOk lar... Joking wif u oni...0
2spamFree entry in 2 a wkly comp to win FA Cup fina...1
3hamU dun say so early hor... U c already then say...0
4hamNah I don't think he goes to usf, he lives aro...0
5spamFreeMsg Hey there darling it's been 3 week's n...1
6hamEven my brother is not like to speak with me. ...0
7hamAs per your request 'Melle Melle (Oru Minnamin...0
8spamWINNER!! As a valued network customer you have...1
9spamHad your mobile 11 months or more? U R entitle...1
\n", 1015 | "
" 1016 | ], 1017 | "text/plain": [ 1018 | " label message label_num\n", 1019 | "0 ham Go until jurong point, crazy.. Available only ... 0\n", 1020 | "1 ham Ok lar... Joking wif u oni... 0\n", 1021 | "2 spam Free entry in 2 a wkly comp to win FA Cup fina... 1\n", 1022 | "3 ham U dun say so early hor... U c already then say... 0\n", 1023 | "4 ham Nah I don't think he goes to usf, he lives aro... 0\n", 1024 | "5 spam FreeMsg Hey there darling it's been 3 week's n... 1\n", 1025 | "6 ham Even my brother is not like to speak with me. ... 0\n", 1026 | "7 ham As per your request 'Melle Melle (Oru Minnamin... 0\n", 1027 | "8 spam WINNER!! As a valued network customer you have... 1\n", 1028 | "9 spam Had your mobile 11 months or more? U R entitle... 1" 1029 | ] 1030 | }, 1031 | "execution_count": 30, 1032 | "metadata": {}, 1033 | "output_type": "execute_result" 1034 | } 1035 | ], 1036 | "source": [ 1037 | "# check that the conversion worked\n", 1038 | "sms.head(10)" 1039 | ] 1040 | }, 1041 | { 1042 | "cell_type": "code", 1043 | "execution_count": 31, 1044 | "metadata": { 1045 | "collapsed": false 1046 | }, 1047 | "outputs": [ 1048 | { 1049 | "name": "stdout", 1050 | "output_type": "stream", 1051 | "text": [ 1052 | "(150L, 4L)\n", 1053 | "(150L,)\n" 1054 | ] 1055 | } 1056 | ], 1057 | "source": [ 1058 | "# how to define X and y (from the iris data) for use with a MODEL\n", 1059 | "X = iris.data\n", 1060 | "y = iris.target\n", 1061 | "print(X.shape)\n", 1062 | "print(y.shape)" 1063 | ] 1064 | }, 1065 | { 1066 | "cell_type": "code", 1067 | "execution_count": 32, 1068 | "metadata": { 1069 | "collapsed": false 1070 | }, 1071 | "outputs": [ 1072 | { 1073 | "name": "stdout", 1074 | "output_type": "stream", 1075 | "text": [ 1076 | "(5572L,)\n", 1077 | "(5572L,)\n" 1078 | ] 1079 | } 1080 | ], 1081 | "source": [ 1082 | "# how to define X and y (from the SMS data) for use with COUNTVECTORIZER\n", 1083 | "X = sms.message\n", 1084 | "y = sms.label_num\n", 1085 | "print(X.shape)\n", 1086 | "print(y.shape)" 1087 | ] 1088 | }, 1089 | { 1090 | "cell_type": "code", 1091 | "execution_count": 33, 1092 | "metadata": { 1093 | "collapsed": false 1094 | }, 1095 | "outputs": [ 1096 | { 1097 | "name": "stdout", 1098 | "output_type": "stream", 1099 | "text": [ 1100 | "(4179L,)\n", 1101 | "(1393L,)\n", 1102 | "(4179L,)\n", 1103 | "(1393L,)\n" 1104 | ] 1105 | } 1106 | ], 1107 | "source": [ 1108 | "# split X and y into training and testing sets\n", 1109 | "from sklearn.cross_validation import train_test_split\n", 1110 | "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)\n", 1111 | "print(X_train.shape)\n", 1112 | "print(X_test.shape)\n", 1113 | "print(y_train.shape)\n", 1114 | "print(y_test.shape)" 1115 | ] 1116 | }, 1117 | { 1118 | "cell_type": "markdown", 1119 | "metadata": {}, 1120 | "source": [ 1121 | "## Part 4: Vectorizing our dataset" 1122 | ] 1123 | }, 1124 | { 1125 | "cell_type": "code", 1126 | "execution_count": 34, 1127 | "metadata": { 1128 | "collapsed": true 1129 | }, 1130 | "outputs": [], 1131 | "source": [ 1132 | "# instantiate the vectorizer\n", 1133 | "vect = CountVectorizer()" 1134 | ] 1135 | }, 1136 | { 1137 | "cell_type": "code", 1138 | "execution_count": 35, 1139 | "metadata": { 1140 | "collapsed": true 1141 | }, 1142 | "outputs": [], 1143 | "source": [ 1144 | "# learn training data vocabulary, then use it to create a document-term matrix\n", 1145 | "vect.fit(X_train)\n", 1146 | "X_train_dtm = vect.transform(X_train)" 1147 | ] 1148 | }, 1149 | { 1150 | "cell_type": "code", 1151 | "execution_count": 36, 1152 | "metadata": { 1153 | "collapsed": true 1154 | }, 1155 | "outputs": [], 1156 | "source": [ 1157 | "# equivalently: combine fit and transform into a single step\n", 1158 | "X_train_dtm = vect.fit_transform(X_train)" 1159 | ] 1160 | }, 1161 | { 1162 | "cell_type": "code", 1163 | "execution_count": 37, 1164 | "metadata": { 1165 | "collapsed": false 1166 | }, 1167 | "outputs": [ 1168 | { 1169 | "data": { 1170 | "text/plain": [ 1171 | "<4179x7456 sparse matrix of type ''\n", 1172 | "\twith 55209 stored elements in Compressed Sparse Row format>" 1173 | ] 1174 | }, 1175 | "execution_count": 37, 1176 | "metadata": {}, 1177 | "output_type": "execute_result" 1178 | } 1179 | ], 1180 | "source": [ 1181 | "# examine the document-term matrix\n", 1182 | "X_train_dtm" 1183 | ] 1184 | }, 1185 | { 1186 | "cell_type": "code", 1187 | "execution_count": 38, 1188 | "metadata": { 1189 | "collapsed": false 1190 | }, 1191 | "outputs": [ 1192 | { 1193 | "data": { 1194 | "text/plain": [ 1195 | "<1393x7456 sparse matrix of type ''\n", 1196 | "\twith 17604 stored elements in Compressed Sparse Row format>" 1197 | ] 1198 | }, 1199 | "execution_count": 38, 1200 | "metadata": {}, 1201 | "output_type": "execute_result" 1202 | } 1203 | ], 1204 | "source": [ 1205 | "# transform testing data (using fitted vocabulary) into a document-term matrix\n", 1206 | "X_test_dtm = vect.transform(X_test)\n", 1207 | "X_test_dtm" 1208 | ] 1209 | }, 1210 | { 1211 | "cell_type": "markdown", 1212 | "metadata": {}, 1213 | "source": [ 1214 | "## Part 5: Building and evaluating a model\n", 1215 | "\n", 1216 | "We will use [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):\n", 1217 | "\n", 1218 | "> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work." 1219 | ] 1220 | }, 1221 | { 1222 | "cell_type": "code", 1223 | "execution_count": 39, 1224 | "metadata": { 1225 | "collapsed": true 1226 | }, 1227 | "outputs": [], 1228 | "source": [ 1229 | "# import and instantiate a Multinomial Naive Bayes model\n", 1230 | "from sklearn.naive_bayes import MultinomialNB\n", 1231 | "nb = MultinomialNB()" 1232 | ] 1233 | }, 1234 | { 1235 | "cell_type": "code", 1236 | "execution_count": 40, 1237 | "metadata": { 1238 | "collapsed": false 1239 | }, 1240 | "outputs": [ 1241 | { 1242 | "name": "stdout", 1243 | "output_type": "stream", 1244 | "text": [ 1245 | "Wall time: 6 ms\n" 1246 | ] 1247 | }, 1248 | { 1249 | "data": { 1250 | "text/plain": [ 1251 | "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)" 1252 | ] 1253 | }, 1254 | "execution_count": 40, 1255 | "metadata": {}, 1256 | "output_type": "execute_result" 1257 | } 1258 | ], 1259 | "source": [ 1260 | "# train the model using X_train_dtm (timing it with an IPython \"magic command\")\n", 1261 | "%time nb.fit(X_train_dtm, y_train)" 1262 | ] 1263 | }, 1264 | { 1265 | "cell_type": "code", 1266 | "execution_count": 41, 1267 | "metadata": { 1268 | "collapsed": true 1269 | }, 1270 | "outputs": [], 1271 | "source": [ 1272 | "# make class predictions for X_test_dtm\n", 1273 | "y_pred_class = nb.predict(X_test_dtm)" 1274 | ] 1275 | }, 1276 | { 1277 | "cell_type": "code", 1278 | "execution_count": 42, 1279 | "metadata": { 1280 | "collapsed": false 1281 | }, 1282 | "outputs": [ 1283 | { 1284 | "data": { 1285 | "text/plain": [ 1286 | "0.98851399856424982" 1287 | ] 1288 | }, 1289 | "execution_count": 42, 1290 | "metadata": {}, 1291 | "output_type": "execute_result" 1292 | } 1293 | ], 1294 | "source": [ 1295 | "# calculate accuracy of class predictions\n", 1296 | "from sklearn import metrics\n", 1297 | "metrics.accuracy_score(y_test, y_pred_class)" 1298 | ] 1299 | }, 1300 | { 1301 | "cell_type": "code", 1302 | "execution_count": 43, 1303 | "metadata": { 1304 | "collapsed": false 1305 | }, 1306 | "outputs": [ 1307 | { 1308 | "data": { 1309 | "text/plain": [ 1310 | "array([[1203, 5],\n", 1311 | " [ 11, 174]])" 1312 | ] 1313 | }, 1314 | "execution_count": 43, 1315 | "metadata": {}, 1316 | "output_type": "execute_result" 1317 | } 1318 | ], 1319 | "source": [ 1320 | "# print the confusion matrix\n", 1321 | "metrics.confusion_matrix(y_test, y_pred_class)" 1322 | ] 1323 | }, 1324 | { 1325 | "cell_type": "code", 1326 | "execution_count": 44, 1327 | "metadata": { 1328 | "collapsed": false 1329 | }, 1330 | "outputs": [ 1331 | { 1332 | "data": { 1333 | "text/plain": [ 1334 | "574 Waiting for your call.\n", 1335 | "3375 Also andros ice etc etc\n", 1336 | "45 No calls..messages..missed calls\n", 1337 | "3415 No pic. Please re-send.\n", 1338 | "1988 No calls..messages..missed calls\n", 1339 | "Name: message, dtype: object" 1340 | ] 1341 | }, 1342 | "execution_count": 44, 1343 | "metadata": {}, 1344 | "output_type": "execute_result" 1345 | } 1346 | ], 1347 | "source": [ 1348 | "# print message text for the false positives (ham incorrectly classified as spam)\n", 1349 | "X_test[y_test < y_pred_class]" 1350 | ] 1351 | }, 1352 | { 1353 | "cell_type": "code", 1354 | "execution_count": 45, 1355 | "metadata": { 1356 | "collapsed": false, 1357 | "scrolled": true 1358 | }, 1359 | "outputs": [ 1360 | { 1361 | "data": { 1362 | "text/plain": [ 1363 | "3132 LookAtMe!: Thanks for your purchase of a video...\n", 1364 | "5 FreeMsg Hey there darling it's been 3 week's n...\n", 1365 | "3530 Xmas & New Years Eve tickets are now on sale f...\n", 1366 | "684 Hi I'm sue. I am 20 years old and work as a la...\n", 1367 | "1875 Would you like to see my XXX pics they are so ...\n", 1368 | "1893 CALL 09090900040 & LISTEN TO EXTREME DIRTY LIV...\n", 1369 | "4298 thesmszone.com lets you send free anonymous an...\n", 1370 | "4949 Hi this is Amy, we will be sending you a free ...\n", 1371 | "2821 INTERFLORA - “It's not too late to order Inter...\n", 1372 | "2247 Hi ya babe x u 4goten bout me?' scammers getti...\n", 1373 | "4514 Money i have won wining number 946 wot do i do...\n", 1374 | "Name: message, dtype: object" 1375 | ] 1376 | }, 1377 | "execution_count": 45, 1378 | "metadata": {}, 1379 | "output_type": "execute_result" 1380 | } 1381 | ], 1382 | "source": [ 1383 | "# print message text for the false negatives (spam incorrectly classified as ham)\n", 1384 | "X_test[y_test > y_pred_class]" 1385 | ] 1386 | }, 1387 | { 1388 | "cell_type": "code", 1389 | "execution_count": 46, 1390 | "metadata": { 1391 | "collapsed": false, 1392 | "scrolled": true 1393 | }, 1394 | "outputs": [ 1395 | { 1396 | "data": { 1397 | "text/plain": [ 1398 | "\"LookAtMe!: Thanks for your purchase of a video clip from LookAtMe!, you've been charged 35p. Think you can do better? Why not send a video in a MMSto 32323.\"" 1399 | ] 1400 | }, 1401 | "execution_count": 46, 1402 | "metadata": {}, 1403 | "output_type": "execute_result" 1404 | } 1405 | ], 1406 | "source": [ 1407 | "# example false negative\n", 1408 | "X_test[3132]" 1409 | ] 1410 | }, 1411 | { 1412 | "cell_type": "code", 1413 | "execution_count": 47, 1414 | "metadata": { 1415 | "collapsed": false 1416 | }, 1417 | "outputs": [ 1418 | { 1419 | "data": { 1420 | "text/plain": [ 1421 | "array([ 2.87744864e-03, 1.83488846e-05, 2.07301295e-03, ...,\n", 1422 | " 1.09026171e-06, 1.00000000e+00, 3.98279868e-09])" 1423 | ] 1424 | }, 1425 | "execution_count": 47, 1426 | "metadata": {}, 1427 | "output_type": "execute_result" 1428 | } 1429 | ], 1430 | "source": [ 1431 | "# calculate predicted probabilities for X_test_dtm (poorly calibrated)\n", 1432 | "y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]\n", 1433 | "y_pred_prob" 1434 | ] 1435 | }, 1436 | { 1437 | "cell_type": "code", 1438 | "execution_count": 48, 1439 | "metadata": { 1440 | "collapsed": false 1441 | }, 1442 | "outputs": [ 1443 | { 1444 | "data": { 1445 | "text/plain": [ 1446 | "0.98664310005369604" 1447 | ] 1448 | }, 1449 | "execution_count": 48, 1450 | "metadata": {}, 1451 | "output_type": "execute_result" 1452 | } 1453 | ], 1454 | "source": [ 1455 | "# calculate AUC\n", 1456 | "metrics.roc_auc_score(y_test, y_pred_prob)" 1457 | ] 1458 | }, 1459 | { 1460 | "cell_type": "markdown", 1461 | "metadata": {}, 1462 | "source": [ 1463 | "## Part 6: Comparing models\n", 1464 | "\n", 1465 | "We will compare multinomial Naive Bayes with [logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression):\n", 1466 | "\n", 1467 | "> Logistic regression, despite its name, is a **linear model for classification** rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function." 1468 | ] 1469 | }, 1470 | { 1471 | "cell_type": "code", 1472 | "execution_count": 49, 1473 | "metadata": { 1474 | "collapsed": true 1475 | }, 1476 | "outputs": [], 1477 | "source": [ 1478 | "# import and instantiate a logistic regression model\n", 1479 | "from sklearn.linear_model import LogisticRegression\n", 1480 | "logreg = LogisticRegression()" 1481 | ] 1482 | }, 1483 | { 1484 | "cell_type": "code", 1485 | "execution_count": 50, 1486 | "metadata": { 1487 | "collapsed": false 1488 | }, 1489 | "outputs": [ 1490 | { 1491 | "name": "stdout", 1492 | "output_type": "stream", 1493 | "text": [ 1494 | "Wall time: 131 ms\n" 1495 | ] 1496 | }, 1497 | { 1498 | "data": { 1499 | "text/plain": [ 1500 | "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", 1501 | " intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n", 1502 | " penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n", 1503 | " verbose=0, warm_start=False)" 1504 | ] 1505 | }, 1506 | "execution_count": 50, 1507 | "metadata": {}, 1508 | "output_type": "execute_result" 1509 | } 1510 | ], 1511 | "source": [ 1512 | "# train the model using X_train_dtm\n", 1513 | "%time logreg.fit(X_train_dtm, y_train)" 1514 | ] 1515 | }, 1516 | { 1517 | "cell_type": "code", 1518 | "execution_count": 51, 1519 | "metadata": { 1520 | "collapsed": true 1521 | }, 1522 | "outputs": [], 1523 | "source": [ 1524 | "# make class predictions for X_test_dtm\n", 1525 | "y_pred_class = logreg.predict(X_test_dtm)" 1526 | ] 1527 | }, 1528 | { 1529 | "cell_type": "code", 1530 | "execution_count": 52, 1531 | "metadata": { 1532 | "collapsed": false 1533 | }, 1534 | "outputs": [ 1535 | { 1536 | "data": { 1537 | "text/plain": [ 1538 | "array([ 0.01269556, 0.00347183, 0.00616517, ..., 0.03354907,\n", 1539 | " 0.99725053, 0.00157706])" 1540 | ] 1541 | }, 1542 | "execution_count": 52, 1543 | "metadata": {}, 1544 | "output_type": "execute_result" 1545 | } 1546 | ], 1547 | "source": [ 1548 | "# calculate predicted probabilities for X_test_dtm (well calibrated)\n", 1549 | "y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]\n", 1550 | "y_pred_prob" 1551 | ] 1552 | }, 1553 | { 1554 | "cell_type": "code", 1555 | "execution_count": 53, 1556 | "metadata": { 1557 | "collapsed": false 1558 | }, 1559 | "outputs": [ 1560 | { 1561 | "data": { 1562 | "text/plain": [ 1563 | "0.9877961234745154" 1564 | ] 1565 | }, 1566 | "execution_count": 53, 1567 | "metadata": {}, 1568 | "output_type": "execute_result" 1569 | } 1570 | ], 1571 | "source": [ 1572 | "# calculate accuracy\n", 1573 | "metrics.accuracy_score(y_test, y_pred_class)" 1574 | ] 1575 | }, 1576 | { 1577 | "cell_type": "code", 1578 | "execution_count": 54, 1579 | "metadata": { 1580 | "collapsed": false 1581 | }, 1582 | "outputs": [ 1583 | { 1584 | "data": { 1585 | "text/plain": [ 1586 | "0.99368176123143015" 1587 | ] 1588 | }, 1589 | "execution_count": 54, 1590 | "metadata": {}, 1591 | "output_type": "execute_result" 1592 | } 1593 | ], 1594 | "source": [ 1595 | "# calculate AUC\n", 1596 | "metrics.roc_auc_score(y_test, y_pred_prob)" 1597 | ] 1598 | } 1599 | ], 1600 | "metadata": { 1601 | "kernelspec": { 1602 | "display_name": "Python 2", 1603 | "language": "python", 1604 | "name": "python2" 1605 | }, 1606 | "language_info": { 1607 | "codemirror_mode": { 1608 | "name": "ipython", 1609 | "version": 2 1610 | }, 1611 | "file_extension": ".py", 1612 | "mimetype": "text/x-python", 1613 | "name": "python", 1614 | "nbconvert_exporter": "python", 1615 | "pygments_lexer": "ipython2", 1616 | "version": "2.7.11" 1617 | } 1618 | }, 1619 | "nbformat": 4, 1620 | "nbformat_minor": 0 1621 | } 1622 | -------------------------------------------------------------------------------- /youtube.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/justmarkham/pydata-dc-2016-tutorial/319cf51fee90fcfd3a4ddd56c60c2584c36689bf/youtube.jpg --------------------------------------------------------------------------------