├── .gitignore
├── README.md
├── sms.tsv
├── tutorial.ipynb
├── tutorial.py
├── tutorial_with_output.ipynb
└── youtube.jpg
/.gitignore:
--------------------------------------------------------------------------------
1 | .ipynb_checkpoints/
2 | .DS_Store
3 | *.pyc
4 | extras/
5 | *.tpl
6 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | ## Tutorial: Machine Learning with Text in scikit-learn
2 |
3 | Presented by [Kevin Markham](http://www.dataschool.io/about/) at PyData DC on October 7, 2016. Watch the complete [tutorial video](https://www.youtube.com/watch?v=vTaxdJ6VYWE) on YouTube.
4 |
5 | [](https://www.youtube.com/watch?v=vTaxdJ6VYWE "Machine Learning with Text in scikit-learn - PyData DC 2016")
6 |
7 | ### Description
8 |
9 | Although numeric data is easy to work with in Python, most knowledge created by humans is actually raw, unstructured text. By learning how to transform text into data that is usable by machine learning models, you drastically increase the amount of data that your models can learn from. In this tutorial, we'll build and evaluate predictive models from real-world text using scikit-learn.
10 |
11 | ### Objectives
12 |
13 | By the end of this tutorial, participants will be able to confidently build a predictive model from their own text-based data, including feature extraction, model building and model evaluation.
14 |
15 | ### Required Software
16 |
17 | Participants will need to bring a laptop with [scikit-learn](http://scikit-learn.org/stable/install.html) and [pandas](http://pandas.pydata.org/pandas-docs/stable/install.html) (and their dependencies) already installed. Installing the [Anaconda distribution of Python](https://www.continuum.io/downloads) is an easy way to accomplish this. Both Python 2 and 3 are welcome.
18 |
19 | I will be leading the tutorial using the Jupyter notebook, and have added a pre-written notebook to this repository. I have also created a Python script that is identical to the notebook, which you can use in the Python environment of your choice.
20 |
21 | ### Tutorial Files
22 |
23 | * Jupyter notebooks: [tutorial.ipynb](tutorial.ipynb), [tutorial_with_output.ipynb](tutorial_with_output.ipynb)
24 | * Python script: [tutorial.py](tutorial.py)
25 | * Dataset: [sms.tsv](sms.tsv)
26 |
27 | ### Prerequisite Knowledge
28 |
29 | To get the most out of this tutorial, participants should be comfortable working in Python, should understand the basic principles of machine learning, and should have at least basic experience with both pandas and scikit-learn. However, no knowledge of advanced mathematics is required.
30 |
31 | ### Abstract
32 |
33 | It can be difficult to figure out how to work with text in scikit-learn, even if you're already comfortable with the scikit-learn API. Many questions immediately come up: Which vectorizer should I use, and why? What's the difference between a "fit" and a "transform"? What's a document-term matrix, and why is it so sparse? Is it okay for my training data to have more features than observations? What's the appropriate machine learning model to use? And so on...
34 |
35 | In this tutorial, we'll answer all of those questions, and more! We'll start by walking through the vectorization process in order to understand the input and output formats. Then we'll read a simple dataset into pandas, and immediately apply what we've learned about vectorization. We'll move on to the model building process, including a discussion of which model is most appropriate for the task. We'll end by evaluating our model a few different ways.
36 |
37 | ### Detailed Outline
38 |
39 | 1. Model building in scikit-learn (refresher)
40 | 2. Representing text as numerical data
41 | 3. Reading a text-based dataset into pandas
42 | 4. Vectorizing our dataset
43 | 5. Building and evaluating a model
44 | 6. Comparing models
45 |
46 | ### About the Instructor
47 |
48 | Kevin Markham is the founder of [Data School](http://www.dataschool.io/) and the former lead instructor for General Assembly's part-time Data Science course in Washington, DC. He is passionate about teaching data science to people who are new to the field, regardless of their educational and professional backgrounds, and he enjoys teaching both online and in the classroom. Kevin's professional focus is supervised machine learning, which led him to create the popular [scikit-learn video series](https://github.com/justmarkham/scikit-learn-videos) for Kaggle. He has a degree in Computer Engineering from Vanderbilt University.
49 |
50 | * Email: [kevin@dataschool.io](mailto:kevin@dataschool.io)
51 | * Twitter: [@justmarkham](https://twitter.com/justmarkham)
52 |
53 | ### Recommended Resources
54 |
55 | **scikit-learn:**
56 | * For a thorough introduction to machine learning with scikit-learn, I recommend watching my [scikit-learn video series](https://github.com/justmarkham/scikit-learn-videos) (9 videos, 4 hours).
57 | * If you prefer a written resource for learning scikit-learn, you may like the [tutorials](http://scikit-learn.org/stable/tutorial/index.html) from the scikit-learn documentation.
58 | * The scikit-learn user guide includes an excellent section on [text feature extraction](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction) that includes many details not covered in today's tutorial.
59 | * The user guide also describes the [performance trade-offs](http://scikit-learn.org/stable/modules/computational_performance.html#influence-of-the-input-data-representation) involved when choosing between sparse and dense input data representations.
60 |
61 | **pandas:**
62 | * For a thorough introduction to data analysis, manipulation, and visualization with pandas, I recommend watching my [pandas Q&A video series](https://github.com/justmarkham/pandas-videos) (30 videos, 7 hours).
63 | * If you prefer a written resource for learning pandas, here are my [top 8 recommended resources](http://www.dataschool.io/best-python-pandas-resources/).
64 |
65 | **Text classification:**
66 | * Read Paul Graham's classic post, [A Plan for Spam](http://www.paulgraham.com/spam.html), for an overview of a basic text classification system using a Bayesian approach. (He also wrote a [follow-up post](http://www.paulgraham.com/better.html) about how he improved his spam filter.)
67 | * Coursera's Natural Language Processing (NLP) course has [video lectures](https://www.youtube.com/playlist?list=PL6397E4B26D00A269) on text classification, tokenization, Naive Bayes, and many other fundamental NLP topics. (Here are the [slides](http://web.stanford.edu/~jurafsky/NLPCourseraSlides.html) used in all of the videos.)
68 | * [Automatically Categorizing Yelp Businesses](http://engineeringblog.yelp.com/2015/09/automatically-categorizing-yelp-businesses.html) discusses how Yelp uses NLP and scikit-learn to solve the problem of uncategorized businesses.
69 | * [How to Read the Mind of a Supreme Court Justice](http://fivethirtyeight.com/features/how-to-read-the-mind-of-a-supreme-court-justice/) discusses CourtCast, a machine learning model that predicts the outcome of Supreme Court cases using text-based features only. (The CourtCast creator wrote a post explaining [how it works](https://sciencecowboy.wordpress.com/2015/03/05/predicting-the-supreme-court-from-oral-arguments/), and the [Python code](https://github.com/nasrallah/CourtCast) is available on GitHub.)
70 | * [Identifying Humorous Cartoon Captions](http://www.cs.huji.ac.il/~dshahaf/pHumor.pdf) is a readable paper about identifying funny captions submitted to the New Yorker Caption Contest.
71 | * In this [PyData video](https://www.youtube.com/watch?v=y3ZTKFZ-1QQ) (50 minutes), Facebook explains how they use scikit-learn for sentiment classification by training a Naive Bayes model on emoji-labeled data.
72 |
73 | **Naive Bayes and logistic regression:**
74 | * Read this brief Quora post on [airport security](http://www.quora.com/In-laymans-terms-how-does-Naive-Bayes-work/answer/Konstantin-Tt) for an intuitive explanation of how Naive Bayes classification works.
75 | * For a longer introduction to Naive Bayes, read Sebastian Raschka's article on [Naive Bayes and Text Classification](http://sebastianraschka.com/Articles/2014_naive_bayes_1.html). As well, Wikipedia has two excellent articles ([Naive Bayes classifier](http://en.wikipedia.org/wiki/Naive_Bayes_classifier) and [Naive Bayes spam filtering](http://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering)), and Cross Validated has a good [Q&A](http://stats.stackexchange.com/questions/21822/understanding-naive-bayes).
76 | * My [guide to an in-depth understanding of logistic regression](http://www.dataschool.io/guide-to-logistic-regression/) includes a lesson notebook and a curated list of resources for going deeper into this topic.
77 | * [Comparison of Machine Learning Models](https://github.com/justmarkham/DAT8/blob/master/other/model_comparison.md) lists the advantages and disadvantages of Naive Bayes, logistic regression, and other classification and regression models.
78 |
--------------------------------------------------------------------------------
/tutorial.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Tutorial: Machine Learning with Text in scikit-learn"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "## Agenda\n",
15 | "\n",
16 | "1. Model building in scikit-learn (refresher)\n",
17 | "2. Representing text as numerical data\n",
18 | "3. Reading a text-based dataset into pandas\n",
19 | "4. Vectorizing our dataset\n",
20 | "5. Building and evaluating a model\n",
21 | "6. Comparing models"
22 | ]
23 | },
24 | {
25 | "cell_type": "code",
26 | "execution_count": null,
27 | "metadata": {
28 | "collapsed": false
29 | },
30 | "outputs": [],
31 | "source": [
32 | "# for Python 2: use print only as a function\n",
33 | "from __future__ import print_function"
34 | ]
35 | },
36 | {
37 | "cell_type": "markdown",
38 | "metadata": {},
39 | "source": [
40 | "## Part 1: Model building in scikit-learn (refresher)"
41 | ]
42 | },
43 | {
44 | "cell_type": "code",
45 | "execution_count": null,
46 | "metadata": {
47 | "collapsed": true
48 | },
49 | "outputs": [],
50 | "source": [
51 | "# load the iris dataset as an example\n",
52 | "from sklearn.datasets import load_iris\n",
53 | "iris = load_iris()"
54 | ]
55 | },
56 | {
57 | "cell_type": "code",
58 | "execution_count": null,
59 | "metadata": {
60 | "collapsed": true
61 | },
62 | "outputs": [],
63 | "source": [
64 | "# store the feature matrix (X) and response vector (y)\n",
65 | "X = iris.data\n",
66 | "y = iris.target"
67 | ]
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "metadata": {},
72 | "source": [
73 | "**\"Features\"** are also known as predictors, inputs, or attributes. The **\"response\"** is also known as the target, label, or output."
74 | ]
75 | },
76 | {
77 | "cell_type": "code",
78 | "execution_count": null,
79 | "metadata": {
80 | "collapsed": false
81 | },
82 | "outputs": [],
83 | "source": [
84 | "# check the shapes of X and y\n",
85 | "print(X.shape)\n",
86 | "print(y.shape)"
87 | ]
88 | },
89 | {
90 | "cell_type": "markdown",
91 | "metadata": {},
92 | "source": [
93 | "**\"Observations\"** are also known as samples, instances, or records."
94 | ]
95 | },
96 | {
97 | "cell_type": "code",
98 | "execution_count": null,
99 | "metadata": {
100 | "collapsed": false
101 | },
102 | "outputs": [],
103 | "source": [
104 | "# examine the first 5 rows of the feature matrix (including the feature names)\n",
105 | "import pandas as pd\n",
106 | "pd.DataFrame(X, columns=iris.feature_names).head()"
107 | ]
108 | },
109 | {
110 | "cell_type": "code",
111 | "execution_count": null,
112 | "metadata": {
113 | "collapsed": false
114 | },
115 | "outputs": [],
116 | "source": [
117 | "# examine the response vector\n",
118 | "print(y)"
119 | ]
120 | },
121 | {
122 | "cell_type": "markdown",
123 | "metadata": {},
124 | "source": [
125 | "In order to **build a model**, the features must be **numeric**, and every observation must have the **same features in the same order**."
126 | ]
127 | },
128 | {
129 | "cell_type": "code",
130 | "execution_count": null,
131 | "metadata": {
132 | "collapsed": false
133 | },
134 | "outputs": [],
135 | "source": [
136 | "# import the class\n",
137 | "from sklearn.neighbors import KNeighborsClassifier\n",
138 | "\n",
139 | "# instantiate the model (with the default parameters)\n",
140 | "knn = KNeighborsClassifier()\n",
141 | "\n",
142 | "# fit the model with data (occurs in-place)\n",
143 | "knn.fit(X, y)"
144 | ]
145 | },
146 | {
147 | "cell_type": "markdown",
148 | "metadata": {},
149 | "source": [
150 | "In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning."
151 | ]
152 | },
153 | {
154 | "cell_type": "code",
155 | "execution_count": null,
156 | "metadata": {
157 | "collapsed": false
158 | },
159 | "outputs": [],
160 | "source": [
161 | "# predict the response for a new observation\n",
162 | "knn.predict([[3, 5, 4, 2]])"
163 | ]
164 | },
165 | {
166 | "cell_type": "markdown",
167 | "metadata": {},
168 | "source": [
169 | "## Part 2: Representing text as numerical data"
170 | ]
171 | },
172 | {
173 | "cell_type": "code",
174 | "execution_count": null,
175 | "metadata": {
176 | "collapsed": true
177 | },
178 | "outputs": [],
179 | "source": [
180 | "# example text for model training (SMS messages)\n",
181 | "simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']"
182 | ]
183 | },
184 | {
185 | "cell_type": "code",
186 | "execution_count": null,
187 | "metadata": {
188 | "collapsed": true
189 | },
190 | "outputs": [],
191 | "source": [
192 | "# example response vector\n",
193 | "is_desperate = [0, 0, 1]"
194 | ]
195 | },
196 | {
197 | "cell_type": "markdown",
198 | "metadata": {},
199 | "source": [
200 | "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
201 | "\n",
202 | "> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.\n",
203 | "\n",
204 | "We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to \"convert text into a matrix of token counts\":"
205 | ]
206 | },
207 | {
208 | "cell_type": "code",
209 | "execution_count": null,
210 | "metadata": {
211 | "collapsed": true
212 | },
213 | "outputs": [],
214 | "source": [
215 | "# import and instantiate CountVectorizer (with the default parameters)\n",
216 | "from sklearn.feature_extraction.text import CountVectorizer\n",
217 | "vect = CountVectorizer()"
218 | ]
219 | },
220 | {
221 | "cell_type": "code",
222 | "execution_count": null,
223 | "metadata": {
224 | "collapsed": false
225 | },
226 | "outputs": [],
227 | "source": [
228 | "# learn the 'vocabulary' of the training data (occurs in-place)\n",
229 | "vect.fit(simple_train)"
230 | ]
231 | },
232 | {
233 | "cell_type": "code",
234 | "execution_count": null,
235 | "metadata": {
236 | "collapsed": false
237 | },
238 | "outputs": [],
239 | "source": [
240 | "# examine the fitted vocabulary\n",
241 | "vect.get_feature_names()"
242 | ]
243 | },
244 | {
245 | "cell_type": "code",
246 | "execution_count": null,
247 | "metadata": {
248 | "collapsed": false
249 | },
250 | "outputs": [],
251 | "source": [
252 | "# transform training data into a 'document-term matrix'\n",
253 | "simple_train_dtm = vect.transform(simple_train)\n",
254 | "simple_train_dtm"
255 | ]
256 | },
257 | {
258 | "cell_type": "code",
259 | "execution_count": null,
260 | "metadata": {
261 | "collapsed": false
262 | },
263 | "outputs": [],
264 | "source": [
265 | "# convert sparse matrix to a dense matrix\n",
266 | "simple_train_dtm.toarray()"
267 | ]
268 | },
269 | {
270 | "cell_type": "code",
271 | "execution_count": null,
272 | "metadata": {
273 | "collapsed": false
274 | },
275 | "outputs": [],
276 | "source": [
277 | "# examine the vocabulary and document-term matrix together\n",
278 | "pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())"
279 | ]
280 | },
281 | {
282 | "cell_type": "markdown",
283 | "metadata": {},
284 | "source": [
285 | "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
286 | "\n",
287 | "> In this scheme, features and samples are defined as follows:\n",
288 | "\n",
289 | "> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.\n",
290 | "> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.\n",
291 | "\n",
292 | "> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.\n",
293 | "\n",
294 | "> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or \"Bag of n-grams\" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document."
295 | ]
296 | },
297 | {
298 | "cell_type": "code",
299 | "execution_count": null,
300 | "metadata": {
301 | "collapsed": false
302 | },
303 | "outputs": [],
304 | "source": [
305 | "# check the type of the document-term matrix\n",
306 | "type(simple_train_dtm)"
307 | ]
308 | },
309 | {
310 | "cell_type": "code",
311 | "execution_count": null,
312 | "metadata": {
313 | "collapsed": false,
314 | "scrolled": true
315 | },
316 | "outputs": [],
317 | "source": [
318 | "# examine the sparse matrix contents\n",
319 | "print(simple_train_dtm)"
320 | ]
321 | },
322 | {
323 | "cell_type": "markdown",
324 | "metadata": {},
325 | "source": [
326 | "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
327 | "\n",
328 | "> As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them).\n",
329 | "\n",
330 | "> For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.\n",
331 | "\n",
332 | "> In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package."
333 | ]
334 | },
335 | {
336 | "cell_type": "code",
337 | "execution_count": null,
338 | "metadata": {
339 | "collapsed": false
340 | },
341 | "outputs": [],
342 | "source": [
343 | "# build a model to predict desperation\n",
344 | "knn = KNeighborsClassifier(n_neighbors=1)\n",
345 | "knn.fit(simple_train_dtm, is_desperate)"
346 | ]
347 | },
348 | {
349 | "cell_type": "code",
350 | "execution_count": null,
351 | "metadata": {
352 | "collapsed": true
353 | },
354 | "outputs": [],
355 | "source": [
356 | "# example text for model testing\n",
357 | "simple_test = [\"please don't call me\"]"
358 | ]
359 | },
360 | {
361 | "cell_type": "markdown",
362 | "metadata": {},
363 | "source": [
364 | "In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning."
365 | ]
366 | },
367 | {
368 | "cell_type": "code",
369 | "execution_count": null,
370 | "metadata": {
371 | "collapsed": false
372 | },
373 | "outputs": [],
374 | "source": [
375 | "# transform testing data into a document-term matrix (using existing vocabulary)\n",
376 | "simple_test_dtm = vect.transform(simple_test)\n",
377 | "simple_test_dtm.toarray()"
378 | ]
379 | },
380 | {
381 | "cell_type": "code",
382 | "execution_count": null,
383 | "metadata": {
384 | "collapsed": false
385 | },
386 | "outputs": [],
387 | "source": [
388 | "# examine the vocabulary and document-term matrix together\n",
389 | "pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())"
390 | ]
391 | },
392 | {
393 | "cell_type": "code",
394 | "execution_count": null,
395 | "metadata": {
396 | "collapsed": false
397 | },
398 | "outputs": [],
399 | "source": [
400 | "# predict whether simple_test is desperate\n",
401 | "knn.predict(simple_test_dtm)"
402 | ]
403 | },
404 | {
405 | "cell_type": "markdown",
406 | "metadata": {},
407 | "source": [
408 | "**Summary:**\n",
409 | "\n",
410 | "- `vect.fit(train)` **learns the vocabulary** of the training data\n",
411 | "- `vect.transform(train)` uses the **fitted vocabulary** to build a document-term matrix from the training data\n",
412 | "- `vect.transform(test)` uses the **fitted vocabulary** to build a document-term matrix from the testing data (and **ignores tokens** it hasn't seen before)"
413 | ]
414 | },
415 | {
416 | "cell_type": "markdown",
417 | "metadata": {},
418 | "source": [
419 | "## Part 3: Reading a text-based dataset into pandas"
420 | ]
421 | },
422 | {
423 | "cell_type": "code",
424 | "execution_count": null,
425 | "metadata": {
426 | "collapsed": true
427 | },
428 | "outputs": [],
429 | "source": [
430 | "# read file into pandas from the working directory\n",
431 | "sms = pd.read_table('sms.tsv', header=None, names=['label', 'message'])"
432 | ]
433 | },
434 | {
435 | "cell_type": "code",
436 | "execution_count": null,
437 | "metadata": {
438 | "collapsed": false
439 | },
440 | "outputs": [],
441 | "source": [
442 | "# alternative: read file into pandas from a URL\n",
443 | "# url = 'https://raw.githubusercontent.com/justmarkham/pydata-dc-2016-tutorial/master/sms.tsv'\n",
444 | "# sms = pd.read_table(url, header=None, names=['label', 'message'])"
445 | ]
446 | },
447 | {
448 | "cell_type": "code",
449 | "execution_count": null,
450 | "metadata": {
451 | "collapsed": false
452 | },
453 | "outputs": [],
454 | "source": [
455 | "# examine the shape\n",
456 | "sms.shape"
457 | ]
458 | },
459 | {
460 | "cell_type": "code",
461 | "execution_count": null,
462 | "metadata": {
463 | "collapsed": false
464 | },
465 | "outputs": [],
466 | "source": [
467 | "# examine the first 10 rows\n",
468 | "sms.head(10)"
469 | ]
470 | },
471 | {
472 | "cell_type": "code",
473 | "execution_count": null,
474 | "metadata": {
475 | "collapsed": false
476 | },
477 | "outputs": [],
478 | "source": [
479 | "# examine the class distribution\n",
480 | "sms.label.value_counts()"
481 | ]
482 | },
483 | {
484 | "cell_type": "code",
485 | "execution_count": null,
486 | "metadata": {
487 | "collapsed": true
488 | },
489 | "outputs": [],
490 | "source": [
491 | "# convert label to a numerical variable\n",
492 | "sms['label_num'] = sms.label.map({'ham':0, 'spam':1})"
493 | ]
494 | },
495 | {
496 | "cell_type": "code",
497 | "execution_count": null,
498 | "metadata": {
499 | "collapsed": false
500 | },
501 | "outputs": [],
502 | "source": [
503 | "# check that the conversion worked\n",
504 | "sms.head(10)"
505 | ]
506 | },
507 | {
508 | "cell_type": "code",
509 | "execution_count": null,
510 | "metadata": {
511 | "collapsed": false
512 | },
513 | "outputs": [],
514 | "source": [
515 | "# how to define X and y (from the iris data) for use with a MODEL\n",
516 | "X = iris.data\n",
517 | "y = iris.target\n",
518 | "print(X.shape)\n",
519 | "print(y.shape)"
520 | ]
521 | },
522 | {
523 | "cell_type": "code",
524 | "execution_count": null,
525 | "metadata": {
526 | "collapsed": false
527 | },
528 | "outputs": [],
529 | "source": [
530 | "# how to define X and y (from the SMS data) for use with COUNTVECTORIZER\n",
531 | "X = sms.message\n",
532 | "y = sms.label_num\n",
533 | "print(X.shape)\n",
534 | "print(y.shape)"
535 | ]
536 | },
537 | {
538 | "cell_type": "code",
539 | "execution_count": null,
540 | "metadata": {
541 | "collapsed": false
542 | },
543 | "outputs": [],
544 | "source": [
545 | "# split X and y into training and testing sets\n",
546 | "from sklearn.cross_validation import train_test_split\n",
547 | "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)\n",
548 | "print(X_train.shape)\n",
549 | "print(X_test.shape)\n",
550 | "print(y_train.shape)\n",
551 | "print(y_test.shape)"
552 | ]
553 | },
554 | {
555 | "cell_type": "markdown",
556 | "metadata": {},
557 | "source": [
558 | "## Part 4: Vectorizing our dataset"
559 | ]
560 | },
561 | {
562 | "cell_type": "code",
563 | "execution_count": null,
564 | "metadata": {
565 | "collapsed": true
566 | },
567 | "outputs": [],
568 | "source": [
569 | "# instantiate the vectorizer\n",
570 | "vect = CountVectorizer()"
571 | ]
572 | },
573 | {
574 | "cell_type": "code",
575 | "execution_count": null,
576 | "metadata": {
577 | "collapsed": true
578 | },
579 | "outputs": [],
580 | "source": [
581 | "# learn training data vocabulary, then use it to create a document-term matrix\n",
582 | "vect.fit(X_train)\n",
583 | "X_train_dtm = vect.transform(X_train)"
584 | ]
585 | },
586 | {
587 | "cell_type": "code",
588 | "execution_count": null,
589 | "metadata": {
590 | "collapsed": true
591 | },
592 | "outputs": [],
593 | "source": [
594 | "# equivalently: combine fit and transform into a single step\n",
595 | "X_train_dtm = vect.fit_transform(X_train)"
596 | ]
597 | },
598 | {
599 | "cell_type": "code",
600 | "execution_count": null,
601 | "metadata": {
602 | "collapsed": false
603 | },
604 | "outputs": [],
605 | "source": [
606 | "# examine the document-term matrix\n",
607 | "X_train_dtm"
608 | ]
609 | },
610 | {
611 | "cell_type": "code",
612 | "execution_count": null,
613 | "metadata": {
614 | "collapsed": false
615 | },
616 | "outputs": [],
617 | "source": [
618 | "# transform testing data (using fitted vocabulary) into a document-term matrix\n",
619 | "X_test_dtm = vect.transform(X_test)\n",
620 | "X_test_dtm"
621 | ]
622 | },
623 | {
624 | "cell_type": "markdown",
625 | "metadata": {},
626 | "source": [
627 | "## Part 5: Building and evaluating a model\n",
628 | "\n",
629 | "We will use [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):\n",
630 | "\n",
631 | "> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work."
632 | ]
633 | },
634 | {
635 | "cell_type": "code",
636 | "execution_count": null,
637 | "metadata": {
638 | "collapsed": true
639 | },
640 | "outputs": [],
641 | "source": [
642 | "# import and instantiate a Multinomial Naive Bayes model\n",
643 | "from sklearn.naive_bayes import MultinomialNB\n",
644 | "nb = MultinomialNB()"
645 | ]
646 | },
647 | {
648 | "cell_type": "code",
649 | "execution_count": null,
650 | "metadata": {
651 | "collapsed": false
652 | },
653 | "outputs": [],
654 | "source": [
655 | "# train the model using X_train_dtm (timing it with an IPython \"magic command\")\n",
656 | "%time nb.fit(X_train_dtm, y_train)"
657 | ]
658 | },
659 | {
660 | "cell_type": "code",
661 | "execution_count": null,
662 | "metadata": {
663 | "collapsed": true
664 | },
665 | "outputs": [],
666 | "source": [
667 | "# make class predictions for X_test_dtm\n",
668 | "y_pred_class = nb.predict(X_test_dtm)"
669 | ]
670 | },
671 | {
672 | "cell_type": "code",
673 | "execution_count": null,
674 | "metadata": {
675 | "collapsed": false
676 | },
677 | "outputs": [],
678 | "source": [
679 | "# calculate accuracy of class predictions\n",
680 | "from sklearn import metrics\n",
681 | "metrics.accuracy_score(y_test, y_pred_class)"
682 | ]
683 | },
684 | {
685 | "cell_type": "code",
686 | "execution_count": null,
687 | "metadata": {
688 | "collapsed": false
689 | },
690 | "outputs": [],
691 | "source": [
692 | "# print the confusion matrix\n",
693 | "metrics.confusion_matrix(y_test, y_pred_class)"
694 | ]
695 | },
696 | {
697 | "cell_type": "code",
698 | "execution_count": null,
699 | "metadata": {
700 | "collapsed": false
701 | },
702 | "outputs": [],
703 | "source": [
704 | "# print message text for the false positives (ham incorrectly classified as spam)\n"
705 | ]
706 | },
707 | {
708 | "cell_type": "code",
709 | "execution_count": null,
710 | "metadata": {
711 | "collapsed": false,
712 | "scrolled": true
713 | },
714 | "outputs": [],
715 | "source": [
716 | "# print message text for the false negatives (spam incorrectly classified as ham)\n"
717 | ]
718 | },
719 | {
720 | "cell_type": "code",
721 | "execution_count": null,
722 | "metadata": {
723 | "collapsed": false,
724 | "scrolled": true
725 | },
726 | "outputs": [],
727 | "source": [
728 | "# example false negative\n",
729 | "X_test[3132]"
730 | ]
731 | },
732 | {
733 | "cell_type": "code",
734 | "execution_count": null,
735 | "metadata": {
736 | "collapsed": false
737 | },
738 | "outputs": [],
739 | "source": [
740 | "# calculate predicted probabilities for X_test_dtm (poorly calibrated)\n",
741 | "y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]\n",
742 | "y_pred_prob"
743 | ]
744 | },
745 | {
746 | "cell_type": "code",
747 | "execution_count": null,
748 | "metadata": {
749 | "collapsed": false
750 | },
751 | "outputs": [],
752 | "source": [
753 | "# calculate AUC\n",
754 | "metrics.roc_auc_score(y_test, y_pred_prob)"
755 | ]
756 | },
757 | {
758 | "cell_type": "markdown",
759 | "metadata": {},
760 | "source": [
761 | "## Part 6: Comparing models\n",
762 | "\n",
763 | "We will compare multinomial Naive Bayes with [logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression):\n",
764 | "\n",
765 | "> Logistic regression, despite its name, is a **linear model for classification** rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function."
766 | ]
767 | },
768 | {
769 | "cell_type": "code",
770 | "execution_count": null,
771 | "metadata": {
772 | "collapsed": true
773 | },
774 | "outputs": [],
775 | "source": [
776 | "# import and instantiate a logistic regression model\n",
777 | "from sklearn.linear_model import LogisticRegression\n",
778 | "logreg = LogisticRegression()"
779 | ]
780 | },
781 | {
782 | "cell_type": "code",
783 | "execution_count": null,
784 | "metadata": {
785 | "collapsed": false
786 | },
787 | "outputs": [],
788 | "source": [
789 | "# train the model using X_train_dtm\n",
790 | "%time logreg.fit(X_train_dtm, y_train)"
791 | ]
792 | },
793 | {
794 | "cell_type": "code",
795 | "execution_count": null,
796 | "metadata": {
797 | "collapsed": true
798 | },
799 | "outputs": [],
800 | "source": [
801 | "# make class predictions for X_test_dtm\n",
802 | "y_pred_class = logreg.predict(X_test_dtm)"
803 | ]
804 | },
805 | {
806 | "cell_type": "code",
807 | "execution_count": null,
808 | "metadata": {
809 | "collapsed": false
810 | },
811 | "outputs": [],
812 | "source": [
813 | "# calculate predicted probabilities for X_test_dtm (well calibrated)\n",
814 | "y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]\n",
815 | "y_pred_prob"
816 | ]
817 | },
818 | {
819 | "cell_type": "code",
820 | "execution_count": null,
821 | "metadata": {
822 | "collapsed": false
823 | },
824 | "outputs": [],
825 | "source": [
826 | "# calculate accuracy\n",
827 | "metrics.accuracy_score(y_test, y_pred_class)"
828 | ]
829 | },
830 | {
831 | "cell_type": "code",
832 | "execution_count": null,
833 | "metadata": {
834 | "collapsed": false
835 | },
836 | "outputs": [],
837 | "source": [
838 | "# calculate AUC\n",
839 | "metrics.roc_auc_score(y_test, y_pred_prob)"
840 | ]
841 | }
842 | ],
843 | "metadata": {
844 | "kernelspec": {
845 | "display_name": "Python 2",
846 | "language": "python",
847 | "name": "python2"
848 | },
849 | "language_info": {
850 | "codemirror_mode": {
851 | "name": "ipython",
852 | "version": 2
853 | },
854 | "file_extension": ".py",
855 | "mimetype": "text/x-python",
856 | "name": "python",
857 | "nbconvert_exporter": "python",
858 | "pygments_lexer": "ipython2",
859 | "version": "2.7.11"
860 | }
861 | },
862 | "nbformat": 4,
863 | "nbformat_minor": 0
864 | }
865 |
--------------------------------------------------------------------------------
/tutorial.py:
--------------------------------------------------------------------------------
1 | # # Tutorial: Machine Learning with Text in scikit-learn
2 |
3 | # ## Agenda
4 | #
5 | # 1. Model building in scikit-learn (refresher)
6 | # 2. Representing text as numerical data
7 | # 3. Reading a text-based dataset into pandas
8 | # 4. Vectorizing our dataset
9 | # 5. Building and evaluating a model
10 | # 6. Comparing models
11 |
12 | # for Python 2: use print only as a function
13 | from __future__ import print_function
14 |
15 |
16 | # ## Part 1: Model building in scikit-learn (refresher)
17 |
18 | # load the iris dataset as an example
19 | from sklearn.datasets import load_iris
20 | iris = load_iris()
21 |
22 |
23 | # store the feature matrix (X) and response vector (y)
24 | X = iris.data
25 | y = iris.target
26 |
27 |
28 | # **"Features"** are also known as predictors, inputs, or attributes. The **"response"** is also known as the target, label, or output.
29 |
30 | # check the shapes of X and y
31 | print(X.shape)
32 | print(y.shape)
33 |
34 |
35 | # **"Observations"** are also known as samples, instances, or records.
36 |
37 | # examine the first 5 rows of the feature matrix (including the feature names)
38 | import pandas as pd
39 | pd.DataFrame(X, columns=iris.feature_names).head()
40 |
41 |
42 | # examine the response vector
43 | print(y)
44 |
45 |
46 | # In order to **build a model**, the features must be **numeric**, and every observation must have the **same features in the same order**.
47 |
48 | # import the class
49 | from sklearn.neighbors import KNeighborsClassifier
50 |
51 | # instantiate the model (with the default parameters)
52 | knn = KNeighborsClassifier()
53 |
54 | # fit the model with data (occurs in-place)
55 | knn.fit(X, y)
56 |
57 |
58 | # In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning.
59 |
60 | # predict the response for a new observation
61 | knn.predict([[3, 5, 4, 2]])
62 |
63 |
64 | # ## Part 2: Representing text as numerical data
65 |
66 | # example text for model training (SMS messages)
67 | simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']
68 |
69 |
70 | # example response vector
71 | is_desperate = [0, 0, 1]
72 |
73 |
74 | # From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):
75 | #
76 | # > Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.
77 | #
78 | # We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to "convert text into a matrix of token counts":
79 |
80 | # import and instantiate CountVectorizer (with the default parameters)
81 | from sklearn.feature_extraction.text import CountVectorizer
82 | vect = CountVectorizer()
83 |
84 |
85 | # learn the 'vocabulary' of the training data (occurs in-place)
86 | vect.fit(simple_train)
87 |
88 |
89 | # examine the fitted vocabulary
90 | vect.get_feature_names()
91 |
92 |
93 | # transform training data into a 'document-term matrix'
94 | simple_train_dtm = vect.transform(simple_train)
95 | simple_train_dtm
96 |
97 |
98 | # convert sparse matrix to a dense matrix
99 | simple_train_dtm.toarray()
100 |
101 |
102 | # examine the vocabulary and document-term matrix together
103 | pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())
104 |
105 |
106 | # From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):
107 | #
108 | # > In this scheme, features and samples are defined as follows:
109 | #
110 | # > - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.
111 | # > - The vector of all the token frequencies for a given document is considered a multivariate **sample**.
112 | #
113 | # > A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.
114 | #
115 | # > We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.
116 |
117 | # check the type of the document-term matrix
118 | type(simple_train_dtm)
119 |
120 |
121 | # examine the sparse matrix contents
122 | print(simple_train_dtm)
123 |
124 |
125 | # From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):
126 | #
127 | # > As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them).
128 | #
129 | # > For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.
130 | #
131 | # > In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package.
132 |
133 | # build a model to predict desperation
134 | knn = KNeighborsClassifier(n_neighbors=1)
135 | knn.fit(simple_train_dtm, is_desperate)
136 |
137 |
138 | # example text for model testing
139 | simple_test = ["please don't call me"]
140 |
141 |
142 | # In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning.
143 |
144 | # transform testing data into a document-term matrix (using existing vocabulary)
145 | simple_test_dtm = vect.transform(simple_test)
146 | simple_test_dtm.toarray()
147 |
148 |
149 | # examine the vocabulary and document-term matrix together
150 | pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())
151 |
152 |
153 | # predict whether simple_test is desperate
154 | knn.predict(simple_test_dtm)
155 |
156 |
157 | # **Summary:**
158 | #
159 | # - `vect.fit(train)` **learns the vocabulary** of the training data
160 | # - `vect.transform(train)` uses the **fitted vocabulary** to build a document-term matrix from the training data
161 | # - `vect.transform(test)` uses the **fitted vocabulary** to build a document-term matrix from the testing data (and **ignores tokens** it hasn't seen before)
162 |
163 | # ## Part 3: Reading a text-based dataset into pandas
164 |
165 | # read file into pandas from the working directory
166 | sms = pd.read_table('sms.tsv', header=None, names=['label', 'message'])
167 |
168 |
169 | # alternative: read file into pandas from a URL
170 | # url = 'https://raw.githubusercontent.com/justmarkham/pydata-dc-2016-tutorial/master/sms.tsv'
171 | # sms = pd.read_table(url, header=None, names=['label', 'message'])
172 |
173 |
174 | # examine the shape
175 | sms.shape
176 |
177 |
178 | # examine the first 10 rows
179 | sms.head(10)
180 |
181 |
182 | # examine the class distribution
183 | sms.label.value_counts()
184 |
185 |
186 | # convert label to a numerical variable
187 | sms['label_num'] = sms.label.map({'ham':0, 'spam':1})
188 |
189 |
190 | # check that the conversion worked
191 | sms.head(10)
192 |
193 |
194 | # how to define X and y (from the iris data) for use with a MODEL
195 | X = iris.data
196 | y = iris.target
197 | print(X.shape)
198 | print(y.shape)
199 |
200 |
201 | # how to define X and y (from the SMS data) for use with COUNTVECTORIZER
202 | X = sms.message
203 | y = sms.label_num
204 | print(X.shape)
205 | print(y.shape)
206 |
207 |
208 | # split X and y into training and testing sets
209 | from sklearn.cross_validation import train_test_split
210 | X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
211 | print(X_train.shape)
212 | print(X_test.shape)
213 | print(y_train.shape)
214 | print(y_test.shape)
215 |
216 |
217 | # ## Part 4: Vectorizing our dataset
218 |
219 | # instantiate the vectorizer
220 | vect = CountVectorizer()
221 |
222 |
223 | # learn training data vocabulary, then use it to create a document-term matrix
224 | vect.fit(X_train)
225 | X_train_dtm = vect.transform(X_train)
226 |
227 |
228 | # equivalently: combine fit and transform into a single step
229 | X_train_dtm = vect.fit_transform(X_train)
230 |
231 |
232 | # examine the document-term matrix
233 | X_train_dtm
234 |
235 |
236 | # transform testing data (using fitted vocabulary) into a document-term matrix
237 | X_test_dtm = vect.transform(X_test)
238 | X_test_dtm
239 |
240 |
241 | # ## Part 5: Building and evaluating a model
242 | #
243 | # We will use [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):
244 | #
245 | # > The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.
246 |
247 | # import and instantiate a Multinomial Naive Bayes model
248 | from sklearn.naive_bayes import MultinomialNB
249 | nb = MultinomialNB()
250 |
251 |
252 | # train the model using X_train_dtm
253 | nb.fit(X_train_dtm, y_train)
254 |
255 |
256 | # make class predictions for X_test_dtm
257 | y_pred_class = nb.predict(X_test_dtm)
258 |
259 |
260 | # calculate accuracy of class predictions
261 | from sklearn import metrics
262 | metrics.accuracy_score(y_test, y_pred_class)
263 |
264 |
265 | # print the confusion matrix
266 | metrics.confusion_matrix(y_test, y_pred_class)
267 |
268 |
269 | # print message text for the false positives (ham incorrectly classified as spam)
270 |
271 |
272 | # print message text for the false negatives (spam incorrectly classified as ham)
273 |
274 |
275 | # example false negative
276 | X_test[3132]
277 |
278 |
279 | # calculate predicted probabilities for X_test_dtm (poorly calibrated)
280 | y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
281 | y_pred_prob
282 |
283 |
284 | # calculate AUC
285 | metrics.roc_auc_score(y_test, y_pred_prob)
286 |
287 |
288 | # ## Part 6: Comparing models
289 | #
290 | # We will compare multinomial Naive Bayes with [logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression):
291 | #
292 | # > Logistic regression, despite its name, is a **linear model for classification** rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.
293 |
294 | # import and instantiate a logistic regression model
295 | from sklearn.linear_model import LogisticRegression
296 | logreg = LogisticRegression()
297 |
298 |
299 | # train the model using X_train_dtm
300 | logreg.fit(X_train_dtm, y_train)
301 |
302 |
303 | # make class predictions for X_test_dtm
304 | y_pred_class = logreg.predict(X_test_dtm)
305 |
306 |
307 | # calculate predicted probabilities for X_test_dtm (well calibrated)
308 | y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]
309 | y_pred_prob
310 |
311 |
312 | # calculate accuracy
313 | metrics.accuracy_score(y_test, y_pred_class)
314 |
315 |
316 | # calculate AUC
317 | metrics.roc_auc_score(y_test, y_pred_prob)
318 |
--------------------------------------------------------------------------------
/tutorial_with_output.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Tutorial: Machine Learning with Text in scikit-learn"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "## Agenda\n",
15 | "\n",
16 | "1. Model building in scikit-learn (refresher)\n",
17 | "2. Representing text as numerical data\n",
18 | "3. Reading a text-based dataset into pandas\n",
19 | "4. Vectorizing our dataset\n",
20 | "5. Building and evaluating a model\n",
21 | "6. Comparing models"
22 | ]
23 | },
24 | {
25 | "cell_type": "code",
26 | "execution_count": 1,
27 | "metadata": {
28 | "collapsed": false
29 | },
30 | "outputs": [],
31 | "source": [
32 | "# for Python 2: use print only as a function\n",
33 | "from __future__ import print_function"
34 | ]
35 | },
36 | {
37 | "cell_type": "markdown",
38 | "metadata": {},
39 | "source": [
40 | "## Part 1: Model building in scikit-learn (refresher)"
41 | ]
42 | },
43 | {
44 | "cell_type": "code",
45 | "execution_count": 2,
46 | "metadata": {
47 | "collapsed": true
48 | },
49 | "outputs": [],
50 | "source": [
51 | "# load the iris dataset as an example\n",
52 | "from sklearn.datasets import load_iris\n",
53 | "iris = load_iris()"
54 | ]
55 | },
56 | {
57 | "cell_type": "code",
58 | "execution_count": 3,
59 | "metadata": {
60 | "collapsed": true
61 | },
62 | "outputs": [],
63 | "source": [
64 | "# store the feature matrix (X) and response vector (y)\n",
65 | "X = iris.data\n",
66 | "y = iris.target"
67 | ]
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "metadata": {},
72 | "source": [
73 | "**\"Features\"** are also known as predictors, inputs, or attributes. The **\"response\"** is also known as the target, label, or output."
74 | ]
75 | },
76 | {
77 | "cell_type": "code",
78 | "execution_count": 4,
79 | "metadata": {
80 | "collapsed": false
81 | },
82 | "outputs": [
83 | {
84 | "name": "stdout",
85 | "output_type": "stream",
86 | "text": [
87 | "(150L, 4L)\n",
88 | "(150L,)\n"
89 | ]
90 | }
91 | ],
92 | "source": [
93 | "# check the shapes of X and y\n",
94 | "print(X.shape)\n",
95 | "print(y.shape)"
96 | ]
97 | },
98 | {
99 | "cell_type": "markdown",
100 | "metadata": {},
101 | "source": [
102 | "**\"Observations\"** are also known as samples, instances, or records."
103 | ]
104 | },
105 | {
106 | "cell_type": "code",
107 | "execution_count": 5,
108 | "metadata": {
109 | "collapsed": false
110 | },
111 | "outputs": [
112 | {
113 | "data": {
114 | "text/html": [
115 | "
\n",
116 | "
\n",
117 | " \n",
118 | " \n",
119 | " | \n",
120 | " sepal length (cm) | \n",
121 | " sepal width (cm) | \n",
122 | " petal length (cm) | \n",
123 | " petal width (cm) | \n",
124 | "
\n",
125 | " \n",
126 | " \n",
127 | " \n",
128 | " 0 | \n",
129 | " 5.1 | \n",
130 | " 3.5 | \n",
131 | " 1.4 | \n",
132 | " 0.2 | \n",
133 | "
\n",
134 | " \n",
135 | " 1 | \n",
136 | " 4.9 | \n",
137 | " 3.0 | \n",
138 | " 1.4 | \n",
139 | " 0.2 | \n",
140 | "
\n",
141 | " \n",
142 | " 2 | \n",
143 | " 4.7 | \n",
144 | " 3.2 | \n",
145 | " 1.3 | \n",
146 | " 0.2 | \n",
147 | "
\n",
148 | " \n",
149 | " 3 | \n",
150 | " 4.6 | \n",
151 | " 3.1 | \n",
152 | " 1.5 | \n",
153 | " 0.2 | \n",
154 | "
\n",
155 | " \n",
156 | " 4 | \n",
157 | " 5.0 | \n",
158 | " 3.6 | \n",
159 | " 1.4 | \n",
160 | " 0.2 | \n",
161 | "
\n",
162 | " \n",
163 | "
\n",
164 | "
"
165 | ],
166 | "text/plain": [
167 | " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n",
168 | "0 5.1 3.5 1.4 0.2\n",
169 | "1 4.9 3.0 1.4 0.2\n",
170 | "2 4.7 3.2 1.3 0.2\n",
171 | "3 4.6 3.1 1.5 0.2\n",
172 | "4 5.0 3.6 1.4 0.2"
173 | ]
174 | },
175 | "execution_count": 5,
176 | "metadata": {},
177 | "output_type": "execute_result"
178 | }
179 | ],
180 | "source": [
181 | "# examine the first 5 rows of the feature matrix (including the feature names)\n",
182 | "import pandas as pd\n",
183 | "pd.DataFrame(X, columns=iris.feature_names).head()"
184 | ]
185 | },
186 | {
187 | "cell_type": "code",
188 | "execution_count": 6,
189 | "metadata": {
190 | "collapsed": false
191 | },
192 | "outputs": [
193 | {
194 | "name": "stdout",
195 | "output_type": "stream",
196 | "text": [
197 | "[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
198 | " 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n",
199 | " 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2\n",
200 | " 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2\n",
201 | " 2 2]\n"
202 | ]
203 | }
204 | ],
205 | "source": [
206 | "# examine the response vector\n",
207 | "print(y)"
208 | ]
209 | },
210 | {
211 | "cell_type": "markdown",
212 | "metadata": {},
213 | "source": [
214 | "In order to **build a model**, the features must be **numeric**, and every observation must have the **same features in the same order**."
215 | ]
216 | },
217 | {
218 | "cell_type": "code",
219 | "execution_count": 7,
220 | "metadata": {
221 | "collapsed": false
222 | },
223 | "outputs": [
224 | {
225 | "data": {
226 | "text/plain": [
227 | "KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n",
228 | " metric_params=None, n_jobs=1, n_neighbors=5, p=2,\n",
229 | " weights='uniform')"
230 | ]
231 | },
232 | "execution_count": 7,
233 | "metadata": {},
234 | "output_type": "execute_result"
235 | }
236 | ],
237 | "source": [
238 | "# import the class\n",
239 | "from sklearn.neighbors import KNeighborsClassifier\n",
240 | "\n",
241 | "# instantiate the model (with the default parameters)\n",
242 | "knn = KNeighborsClassifier()\n",
243 | "\n",
244 | "# fit the model with data (occurs in-place)\n",
245 | "knn.fit(X, y)"
246 | ]
247 | },
248 | {
249 | "cell_type": "markdown",
250 | "metadata": {},
251 | "source": [
252 | "In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning."
253 | ]
254 | },
255 | {
256 | "cell_type": "code",
257 | "execution_count": 8,
258 | "metadata": {
259 | "collapsed": false
260 | },
261 | "outputs": [
262 | {
263 | "data": {
264 | "text/plain": [
265 | "array([1])"
266 | ]
267 | },
268 | "execution_count": 8,
269 | "metadata": {},
270 | "output_type": "execute_result"
271 | }
272 | ],
273 | "source": [
274 | "# predict the response for a new observation\n",
275 | "knn.predict([[3, 5, 4, 2]])"
276 | ]
277 | },
278 | {
279 | "cell_type": "markdown",
280 | "metadata": {},
281 | "source": [
282 | "## Part 2: Representing text as numerical data"
283 | ]
284 | },
285 | {
286 | "cell_type": "code",
287 | "execution_count": 9,
288 | "metadata": {
289 | "collapsed": true
290 | },
291 | "outputs": [],
292 | "source": [
293 | "# example text for model training (SMS messages)\n",
294 | "simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']"
295 | ]
296 | },
297 | {
298 | "cell_type": "code",
299 | "execution_count": 10,
300 | "metadata": {
301 | "collapsed": true
302 | },
303 | "outputs": [],
304 | "source": [
305 | "# example response vector\n",
306 | "is_desperate = [0, 0, 1]"
307 | ]
308 | },
309 | {
310 | "cell_type": "markdown",
311 | "metadata": {},
312 | "source": [
313 | "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
314 | "\n",
315 | "> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.\n",
316 | "\n",
317 | "We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to \"convert text into a matrix of token counts\":"
318 | ]
319 | },
320 | {
321 | "cell_type": "code",
322 | "execution_count": 11,
323 | "metadata": {
324 | "collapsed": true
325 | },
326 | "outputs": [],
327 | "source": [
328 | "# import and instantiate CountVectorizer (with the default parameters)\n",
329 | "from sklearn.feature_extraction.text import CountVectorizer\n",
330 | "vect = CountVectorizer()"
331 | ]
332 | },
333 | {
334 | "cell_type": "code",
335 | "execution_count": 12,
336 | "metadata": {
337 | "collapsed": false
338 | },
339 | "outputs": [
340 | {
341 | "data": {
342 | "text/plain": [
343 | "CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',\n",
344 | " dtype=, encoding=u'utf-8', input=u'content',\n",
345 | " lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
346 | " ngram_range=(1, 1), preprocessor=None, stop_words=None,\n",
347 | " strip_accents=None, token_pattern=u'(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
348 | " tokenizer=None, vocabulary=None)"
349 | ]
350 | },
351 | "execution_count": 12,
352 | "metadata": {},
353 | "output_type": "execute_result"
354 | }
355 | ],
356 | "source": [
357 | "# learn the 'vocabulary' of the training data (occurs in-place)\n",
358 | "vect.fit(simple_train)"
359 | ]
360 | },
361 | {
362 | "cell_type": "code",
363 | "execution_count": 13,
364 | "metadata": {
365 | "collapsed": false
366 | },
367 | "outputs": [
368 | {
369 | "data": {
370 | "text/plain": [
371 | "[u'cab', u'call', u'me', u'please', u'tonight', u'you']"
372 | ]
373 | },
374 | "execution_count": 13,
375 | "metadata": {},
376 | "output_type": "execute_result"
377 | }
378 | ],
379 | "source": [
380 | "# examine the fitted vocabulary\n",
381 | "vect.get_feature_names()"
382 | ]
383 | },
384 | {
385 | "cell_type": "code",
386 | "execution_count": 14,
387 | "metadata": {
388 | "collapsed": false
389 | },
390 | "outputs": [
391 | {
392 | "data": {
393 | "text/plain": [
394 | "<3x6 sparse matrix of type ''\n",
395 | "\twith 9 stored elements in Compressed Sparse Row format>"
396 | ]
397 | },
398 | "execution_count": 14,
399 | "metadata": {},
400 | "output_type": "execute_result"
401 | }
402 | ],
403 | "source": [
404 | "# transform training data into a 'document-term matrix'\n",
405 | "simple_train_dtm = vect.transform(simple_train)\n",
406 | "simple_train_dtm"
407 | ]
408 | },
409 | {
410 | "cell_type": "code",
411 | "execution_count": 15,
412 | "metadata": {
413 | "collapsed": false
414 | },
415 | "outputs": [
416 | {
417 | "data": {
418 | "text/plain": [
419 | "array([[0, 1, 0, 0, 1, 1],\n",
420 | " [1, 1, 1, 0, 0, 0],\n",
421 | " [0, 1, 1, 2, 0, 0]], dtype=int64)"
422 | ]
423 | },
424 | "execution_count": 15,
425 | "metadata": {},
426 | "output_type": "execute_result"
427 | }
428 | ],
429 | "source": [
430 | "# convert sparse matrix to a dense matrix\n",
431 | "simple_train_dtm.toarray()"
432 | ]
433 | },
434 | {
435 | "cell_type": "code",
436 | "execution_count": 16,
437 | "metadata": {
438 | "collapsed": false
439 | },
440 | "outputs": [
441 | {
442 | "data": {
443 | "text/html": [
444 | "\n",
445 | "
\n",
446 | " \n",
447 | " \n",
448 | " | \n",
449 | " cab | \n",
450 | " call | \n",
451 | " me | \n",
452 | " please | \n",
453 | " tonight | \n",
454 | " you | \n",
455 | "
\n",
456 | " \n",
457 | " \n",
458 | " \n",
459 | " 0 | \n",
460 | " 0 | \n",
461 | " 1 | \n",
462 | " 0 | \n",
463 | " 0 | \n",
464 | " 1 | \n",
465 | " 1 | \n",
466 | "
\n",
467 | " \n",
468 | " 1 | \n",
469 | " 1 | \n",
470 | " 1 | \n",
471 | " 1 | \n",
472 | " 0 | \n",
473 | " 0 | \n",
474 | " 0 | \n",
475 | "
\n",
476 | " \n",
477 | " 2 | \n",
478 | " 0 | \n",
479 | " 1 | \n",
480 | " 1 | \n",
481 | " 2 | \n",
482 | " 0 | \n",
483 | " 0 | \n",
484 | "
\n",
485 | " \n",
486 | "
\n",
487 | "
"
488 | ],
489 | "text/plain": [
490 | " cab call me please tonight you\n",
491 | "0 0 1 0 0 1 1\n",
492 | "1 1 1 1 0 0 0\n",
493 | "2 0 1 1 2 0 0"
494 | ]
495 | },
496 | "execution_count": 16,
497 | "metadata": {},
498 | "output_type": "execute_result"
499 | }
500 | ],
501 | "source": [
502 | "# examine the vocabulary and document-term matrix together\n",
503 | "pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())"
504 | ]
505 | },
506 | {
507 | "cell_type": "markdown",
508 | "metadata": {},
509 | "source": [
510 | "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
511 | "\n",
512 | "> In this scheme, features and samples are defined as follows:\n",
513 | "\n",
514 | "> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.\n",
515 | "> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.\n",
516 | "\n",
517 | "> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.\n",
518 | "\n",
519 | "> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or \"Bag of n-grams\" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document."
520 | ]
521 | },
522 | {
523 | "cell_type": "code",
524 | "execution_count": 17,
525 | "metadata": {
526 | "collapsed": false
527 | },
528 | "outputs": [
529 | {
530 | "data": {
531 | "text/plain": [
532 | "scipy.sparse.csr.csr_matrix"
533 | ]
534 | },
535 | "execution_count": 17,
536 | "metadata": {},
537 | "output_type": "execute_result"
538 | }
539 | ],
540 | "source": [
541 | "# check the type of the document-term matrix\n",
542 | "type(simple_train_dtm)"
543 | ]
544 | },
545 | {
546 | "cell_type": "code",
547 | "execution_count": 18,
548 | "metadata": {
549 | "collapsed": false,
550 | "scrolled": true
551 | },
552 | "outputs": [
553 | {
554 | "name": "stdout",
555 | "output_type": "stream",
556 | "text": [
557 | " (0, 1)\t1\n",
558 | " (0, 4)\t1\n",
559 | " (0, 5)\t1\n",
560 | " (1, 0)\t1\n",
561 | " (1, 1)\t1\n",
562 | " (1, 2)\t1\n",
563 | " (2, 1)\t1\n",
564 | " (2, 2)\t1\n",
565 | " (2, 3)\t2\n"
566 | ]
567 | }
568 | ],
569 | "source": [
570 | "# examine the sparse matrix contents\n",
571 | "print(simple_train_dtm)"
572 | ]
573 | },
574 | {
575 | "cell_type": "markdown",
576 | "metadata": {},
577 | "source": [
578 | "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
579 | "\n",
580 | "> As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them).\n",
581 | "\n",
582 | "> For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.\n",
583 | "\n",
584 | "> In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package."
585 | ]
586 | },
587 | {
588 | "cell_type": "code",
589 | "execution_count": 19,
590 | "metadata": {
591 | "collapsed": false
592 | },
593 | "outputs": [
594 | {
595 | "data": {
596 | "text/plain": [
597 | "KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n",
598 | " metric_params=None, n_jobs=1, n_neighbors=1, p=2,\n",
599 | " weights='uniform')"
600 | ]
601 | },
602 | "execution_count": 19,
603 | "metadata": {},
604 | "output_type": "execute_result"
605 | }
606 | ],
607 | "source": [
608 | "# build a model to predict desperation\n",
609 | "knn = KNeighborsClassifier(n_neighbors=1)\n",
610 | "knn.fit(simple_train_dtm, is_desperate)"
611 | ]
612 | },
613 | {
614 | "cell_type": "code",
615 | "execution_count": 20,
616 | "metadata": {
617 | "collapsed": true
618 | },
619 | "outputs": [],
620 | "source": [
621 | "# example text for model testing\n",
622 | "simple_test = [\"please don't call me\"]"
623 | ]
624 | },
625 | {
626 | "cell_type": "markdown",
627 | "metadata": {},
628 | "source": [
629 | "In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning."
630 | ]
631 | },
632 | {
633 | "cell_type": "code",
634 | "execution_count": 21,
635 | "metadata": {
636 | "collapsed": false
637 | },
638 | "outputs": [
639 | {
640 | "data": {
641 | "text/plain": [
642 | "array([[0, 1, 1, 1, 0, 0]], dtype=int64)"
643 | ]
644 | },
645 | "execution_count": 21,
646 | "metadata": {},
647 | "output_type": "execute_result"
648 | }
649 | ],
650 | "source": [
651 | "# transform testing data into a document-term matrix (using existing vocabulary)\n",
652 | "simple_test_dtm = vect.transform(simple_test)\n",
653 | "simple_test_dtm.toarray()"
654 | ]
655 | },
656 | {
657 | "cell_type": "code",
658 | "execution_count": 22,
659 | "metadata": {
660 | "collapsed": false
661 | },
662 | "outputs": [
663 | {
664 | "data": {
665 | "text/html": [
666 | "\n",
667 | "
\n",
668 | " \n",
669 | " \n",
670 | " | \n",
671 | " cab | \n",
672 | " call | \n",
673 | " me | \n",
674 | " please | \n",
675 | " tonight | \n",
676 | " you | \n",
677 | "
\n",
678 | " \n",
679 | " \n",
680 | " \n",
681 | " 0 | \n",
682 | " 0 | \n",
683 | " 1 | \n",
684 | " 1 | \n",
685 | " 1 | \n",
686 | " 0 | \n",
687 | " 0 | \n",
688 | "
\n",
689 | " \n",
690 | "
\n",
691 | "
"
692 | ],
693 | "text/plain": [
694 | " cab call me please tonight you\n",
695 | "0 0 1 1 1 0 0"
696 | ]
697 | },
698 | "execution_count": 22,
699 | "metadata": {},
700 | "output_type": "execute_result"
701 | }
702 | ],
703 | "source": [
704 | "# examine the vocabulary and document-term matrix together\n",
705 | "pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())"
706 | ]
707 | },
708 | {
709 | "cell_type": "code",
710 | "execution_count": 23,
711 | "metadata": {
712 | "collapsed": false
713 | },
714 | "outputs": [
715 | {
716 | "data": {
717 | "text/plain": [
718 | "array([1])"
719 | ]
720 | },
721 | "execution_count": 23,
722 | "metadata": {},
723 | "output_type": "execute_result"
724 | }
725 | ],
726 | "source": [
727 | "# predict whether simple_test is desperate\n",
728 | "knn.predict(simple_test_dtm)"
729 | ]
730 | },
731 | {
732 | "cell_type": "markdown",
733 | "metadata": {},
734 | "source": [
735 | "**Summary:**\n",
736 | "\n",
737 | "- `vect.fit(train)` **learns the vocabulary** of the training data\n",
738 | "- `vect.transform(train)` uses the **fitted vocabulary** to build a document-term matrix from the training data\n",
739 | "- `vect.transform(test)` uses the **fitted vocabulary** to build a document-term matrix from the testing data (and **ignores tokens** it hasn't seen before)"
740 | ]
741 | },
742 | {
743 | "cell_type": "markdown",
744 | "metadata": {},
745 | "source": [
746 | "## Part 3: Reading a text-based dataset into pandas"
747 | ]
748 | },
749 | {
750 | "cell_type": "code",
751 | "execution_count": 24,
752 | "metadata": {
753 | "collapsed": true
754 | },
755 | "outputs": [],
756 | "source": [
757 | "# read file into pandas from the working directory\n",
758 | "sms = pd.read_table('sms.tsv', header=None, names=['label', 'message'])"
759 | ]
760 | },
761 | {
762 | "cell_type": "code",
763 | "execution_count": 25,
764 | "metadata": {
765 | "collapsed": false
766 | },
767 | "outputs": [],
768 | "source": [
769 | "# alternative: read file into pandas from a URL\n",
770 | "# url = 'https://raw.githubusercontent.com/justmarkham/pydata-dc-2016-tutorial/master/sms.tsv'\n",
771 | "# sms = pd.read_table(url, header=None, names=['label', 'message'])"
772 | ]
773 | },
774 | {
775 | "cell_type": "code",
776 | "execution_count": 26,
777 | "metadata": {
778 | "collapsed": false
779 | },
780 | "outputs": [
781 | {
782 | "data": {
783 | "text/plain": [
784 | "(5572, 2)"
785 | ]
786 | },
787 | "execution_count": 26,
788 | "metadata": {},
789 | "output_type": "execute_result"
790 | }
791 | ],
792 | "source": [
793 | "# examine the shape\n",
794 | "sms.shape"
795 | ]
796 | },
797 | {
798 | "cell_type": "code",
799 | "execution_count": 27,
800 | "metadata": {
801 | "collapsed": false
802 | },
803 | "outputs": [
804 | {
805 | "data": {
806 | "text/html": [
807 | "\n",
808 | "
\n",
809 | " \n",
810 | " \n",
811 | " | \n",
812 | " label | \n",
813 | " message | \n",
814 | "
\n",
815 | " \n",
816 | " \n",
817 | " \n",
818 | " 0 | \n",
819 | " ham | \n",
820 | " Go until jurong point, crazy.. Available only ... | \n",
821 | "
\n",
822 | " \n",
823 | " 1 | \n",
824 | " ham | \n",
825 | " Ok lar... Joking wif u oni... | \n",
826 | "
\n",
827 | " \n",
828 | " 2 | \n",
829 | " spam | \n",
830 | " Free entry in 2 a wkly comp to win FA Cup fina... | \n",
831 | "
\n",
832 | " \n",
833 | " 3 | \n",
834 | " ham | \n",
835 | " U dun say so early hor... U c already then say... | \n",
836 | "
\n",
837 | " \n",
838 | " 4 | \n",
839 | " ham | \n",
840 | " Nah I don't think he goes to usf, he lives aro... | \n",
841 | "
\n",
842 | " \n",
843 | " 5 | \n",
844 | " spam | \n",
845 | " FreeMsg Hey there darling it's been 3 week's n... | \n",
846 | "
\n",
847 | " \n",
848 | " 6 | \n",
849 | " ham | \n",
850 | " Even my brother is not like to speak with me. ... | \n",
851 | "
\n",
852 | " \n",
853 | " 7 | \n",
854 | " ham | \n",
855 | " As per your request 'Melle Melle (Oru Minnamin... | \n",
856 | "
\n",
857 | " \n",
858 | " 8 | \n",
859 | " spam | \n",
860 | " WINNER!! As a valued network customer you have... | \n",
861 | "
\n",
862 | " \n",
863 | " 9 | \n",
864 | " spam | \n",
865 | " Had your mobile 11 months or more? U R entitle... | \n",
866 | "
\n",
867 | " \n",
868 | "
\n",
869 | "
"
870 | ],
871 | "text/plain": [
872 | " label message\n",
873 | "0 ham Go until jurong point, crazy.. Available only ...\n",
874 | "1 ham Ok lar... Joking wif u oni...\n",
875 | "2 spam Free entry in 2 a wkly comp to win FA Cup fina...\n",
876 | "3 ham U dun say so early hor... U c already then say...\n",
877 | "4 ham Nah I don't think he goes to usf, he lives aro...\n",
878 | "5 spam FreeMsg Hey there darling it's been 3 week's n...\n",
879 | "6 ham Even my brother is not like to speak with me. ...\n",
880 | "7 ham As per your request 'Melle Melle (Oru Minnamin...\n",
881 | "8 spam WINNER!! As a valued network customer you have...\n",
882 | "9 spam Had your mobile 11 months or more? U R entitle..."
883 | ]
884 | },
885 | "execution_count": 27,
886 | "metadata": {},
887 | "output_type": "execute_result"
888 | }
889 | ],
890 | "source": [
891 | "# examine the first 10 rows\n",
892 | "sms.head(10)"
893 | ]
894 | },
895 | {
896 | "cell_type": "code",
897 | "execution_count": 28,
898 | "metadata": {
899 | "collapsed": false
900 | },
901 | "outputs": [
902 | {
903 | "data": {
904 | "text/plain": [
905 | "ham 4825\n",
906 | "spam 747\n",
907 | "Name: label, dtype: int64"
908 | ]
909 | },
910 | "execution_count": 28,
911 | "metadata": {},
912 | "output_type": "execute_result"
913 | }
914 | ],
915 | "source": [
916 | "# examine the class distribution\n",
917 | "sms.label.value_counts()"
918 | ]
919 | },
920 | {
921 | "cell_type": "code",
922 | "execution_count": 29,
923 | "metadata": {
924 | "collapsed": true
925 | },
926 | "outputs": [],
927 | "source": [
928 | "# convert label to a numerical variable\n",
929 | "sms['label_num'] = sms.label.map({'ham':0, 'spam':1})"
930 | ]
931 | },
932 | {
933 | "cell_type": "code",
934 | "execution_count": 30,
935 | "metadata": {
936 | "collapsed": false
937 | },
938 | "outputs": [
939 | {
940 | "data": {
941 | "text/html": [
942 | "\n",
943 | "
\n",
944 | " \n",
945 | " \n",
946 | " | \n",
947 | " label | \n",
948 | " message | \n",
949 | " label_num | \n",
950 | "
\n",
951 | " \n",
952 | " \n",
953 | " \n",
954 | " 0 | \n",
955 | " ham | \n",
956 | " Go until jurong point, crazy.. Available only ... | \n",
957 | " 0 | \n",
958 | "
\n",
959 | " \n",
960 | " 1 | \n",
961 | " ham | \n",
962 | " Ok lar... Joking wif u oni... | \n",
963 | " 0 | \n",
964 | "
\n",
965 | " \n",
966 | " 2 | \n",
967 | " spam | \n",
968 | " Free entry in 2 a wkly comp to win FA Cup fina... | \n",
969 | " 1 | \n",
970 | "
\n",
971 | " \n",
972 | " 3 | \n",
973 | " ham | \n",
974 | " U dun say so early hor... U c already then say... | \n",
975 | " 0 | \n",
976 | "
\n",
977 | " \n",
978 | " 4 | \n",
979 | " ham | \n",
980 | " Nah I don't think he goes to usf, he lives aro... | \n",
981 | " 0 | \n",
982 | "
\n",
983 | " \n",
984 | " 5 | \n",
985 | " spam | \n",
986 | " FreeMsg Hey there darling it's been 3 week's n... | \n",
987 | " 1 | \n",
988 | "
\n",
989 | " \n",
990 | " 6 | \n",
991 | " ham | \n",
992 | " Even my brother is not like to speak with me. ... | \n",
993 | " 0 | \n",
994 | "
\n",
995 | " \n",
996 | " 7 | \n",
997 | " ham | \n",
998 | " As per your request 'Melle Melle (Oru Minnamin... | \n",
999 | " 0 | \n",
1000 | "
\n",
1001 | " \n",
1002 | " 8 | \n",
1003 | " spam | \n",
1004 | " WINNER!! As a valued network customer you have... | \n",
1005 | " 1 | \n",
1006 | "
\n",
1007 | " \n",
1008 | " 9 | \n",
1009 | " spam | \n",
1010 | " Had your mobile 11 months or more? U R entitle... | \n",
1011 | " 1 | \n",
1012 | "
\n",
1013 | " \n",
1014 | "
\n",
1015 | "
"
1016 | ],
1017 | "text/plain": [
1018 | " label message label_num\n",
1019 | "0 ham Go until jurong point, crazy.. Available only ... 0\n",
1020 | "1 ham Ok lar... Joking wif u oni... 0\n",
1021 | "2 spam Free entry in 2 a wkly comp to win FA Cup fina... 1\n",
1022 | "3 ham U dun say so early hor... U c already then say... 0\n",
1023 | "4 ham Nah I don't think he goes to usf, he lives aro... 0\n",
1024 | "5 spam FreeMsg Hey there darling it's been 3 week's n... 1\n",
1025 | "6 ham Even my brother is not like to speak with me. ... 0\n",
1026 | "7 ham As per your request 'Melle Melle (Oru Minnamin... 0\n",
1027 | "8 spam WINNER!! As a valued network customer you have... 1\n",
1028 | "9 spam Had your mobile 11 months or more? U R entitle... 1"
1029 | ]
1030 | },
1031 | "execution_count": 30,
1032 | "metadata": {},
1033 | "output_type": "execute_result"
1034 | }
1035 | ],
1036 | "source": [
1037 | "# check that the conversion worked\n",
1038 | "sms.head(10)"
1039 | ]
1040 | },
1041 | {
1042 | "cell_type": "code",
1043 | "execution_count": 31,
1044 | "metadata": {
1045 | "collapsed": false
1046 | },
1047 | "outputs": [
1048 | {
1049 | "name": "stdout",
1050 | "output_type": "stream",
1051 | "text": [
1052 | "(150L, 4L)\n",
1053 | "(150L,)\n"
1054 | ]
1055 | }
1056 | ],
1057 | "source": [
1058 | "# how to define X and y (from the iris data) for use with a MODEL\n",
1059 | "X = iris.data\n",
1060 | "y = iris.target\n",
1061 | "print(X.shape)\n",
1062 | "print(y.shape)"
1063 | ]
1064 | },
1065 | {
1066 | "cell_type": "code",
1067 | "execution_count": 32,
1068 | "metadata": {
1069 | "collapsed": false
1070 | },
1071 | "outputs": [
1072 | {
1073 | "name": "stdout",
1074 | "output_type": "stream",
1075 | "text": [
1076 | "(5572L,)\n",
1077 | "(5572L,)\n"
1078 | ]
1079 | }
1080 | ],
1081 | "source": [
1082 | "# how to define X and y (from the SMS data) for use with COUNTVECTORIZER\n",
1083 | "X = sms.message\n",
1084 | "y = sms.label_num\n",
1085 | "print(X.shape)\n",
1086 | "print(y.shape)"
1087 | ]
1088 | },
1089 | {
1090 | "cell_type": "code",
1091 | "execution_count": 33,
1092 | "metadata": {
1093 | "collapsed": false
1094 | },
1095 | "outputs": [
1096 | {
1097 | "name": "stdout",
1098 | "output_type": "stream",
1099 | "text": [
1100 | "(4179L,)\n",
1101 | "(1393L,)\n",
1102 | "(4179L,)\n",
1103 | "(1393L,)\n"
1104 | ]
1105 | }
1106 | ],
1107 | "source": [
1108 | "# split X and y into training and testing sets\n",
1109 | "from sklearn.cross_validation import train_test_split\n",
1110 | "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)\n",
1111 | "print(X_train.shape)\n",
1112 | "print(X_test.shape)\n",
1113 | "print(y_train.shape)\n",
1114 | "print(y_test.shape)"
1115 | ]
1116 | },
1117 | {
1118 | "cell_type": "markdown",
1119 | "metadata": {},
1120 | "source": [
1121 | "## Part 4: Vectorizing our dataset"
1122 | ]
1123 | },
1124 | {
1125 | "cell_type": "code",
1126 | "execution_count": 34,
1127 | "metadata": {
1128 | "collapsed": true
1129 | },
1130 | "outputs": [],
1131 | "source": [
1132 | "# instantiate the vectorizer\n",
1133 | "vect = CountVectorizer()"
1134 | ]
1135 | },
1136 | {
1137 | "cell_type": "code",
1138 | "execution_count": 35,
1139 | "metadata": {
1140 | "collapsed": true
1141 | },
1142 | "outputs": [],
1143 | "source": [
1144 | "# learn training data vocabulary, then use it to create a document-term matrix\n",
1145 | "vect.fit(X_train)\n",
1146 | "X_train_dtm = vect.transform(X_train)"
1147 | ]
1148 | },
1149 | {
1150 | "cell_type": "code",
1151 | "execution_count": 36,
1152 | "metadata": {
1153 | "collapsed": true
1154 | },
1155 | "outputs": [],
1156 | "source": [
1157 | "# equivalently: combine fit and transform into a single step\n",
1158 | "X_train_dtm = vect.fit_transform(X_train)"
1159 | ]
1160 | },
1161 | {
1162 | "cell_type": "code",
1163 | "execution_count": 37,
1164 | "metadata": {
1165 | "collapsed": false
1166 | },
1167 | "outputs": [
1168 | {
1169 | "data": {
1170 | "text/plain": [
1171 | "<4179x7456 sparse matrix of type ''\n",
1172 | "\twith 55209 stored elements in Compressed Sparse Row format>"
1173 | ]
1174 | },
1175 | "execution_count": 37,
1176 | "metadata": {},
1177 | "output_type": "execute_result"
1178 | }
1179 | ],
1180 | "source": [
1181 | "# examine the document-term matrix\n",
1182 | "X_train_dtm"
1183 | ]
1184 | },
1185 | {
1186 | "cell_type": "code",
1187 | "execution_count": 38,
1188 | "metadata": {
1189 | "collapsed": false
1190 | },
1191 | "outputs": [
1192 | {
1193 | "data": {
1194 | "text/plain": [
1195 | "<1393x7456 sparse matrix of type ''\n",
1196 | "\twith 17604 stored elements in Compressed Sparse Row format>"
1197 | ]
1198 | },
1199 | "execution_count": 38,
1200 | "metadata": {},
1201 | "output_type": "execute_result"
1202 | }
1203 | ],
1204 | "source": [
1205 | "# transform testing data (using fitted vocabulary) into a document-term matrix\n",
1206 | "X_test_dtm = vect.transform(X_test)\n",
1207 | "X_test_dtm"
1208 | ]
1209 | },
1210 | {
1211 | "cell_type": "markdown",
1212 | "metadata": {},
1213 | "source": [
1214 | "## Part 5: Building and evaluating a model\n",
1215 | "\n",
1216 | "We will use [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):\n",
1217 | "\n",
1218 | "> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work."
1219 | ]
1220 | },
1221 | {
1222 | "cell_type": "code",
1223 | "execution_count": 39,
1224 | "metadata": {
1225 | "collapsed": true
1226 | },
1227 | "outputs": [],
1228 | "source": [
1229 | "# import and instantiate a Multinomial Naive Bayes model\n",
1230 | "from sklearn.naive_bayes import MultinomialNB\n",
1231 | "nb = MultinomialNB()"
1232 | ]
1233 | },
1234 | {
1235 | "cell_type": "code",
1236 | "execution_count": 40,
1237 | "metadata": {
1238 | "collapsed": false
1239 | },
1240 | "outputs": [
1241 | {
1242 | "name": "stdout",
1243 | "output_type": "stream",
1244 | "text": [
1245 | "Wall time: 6 ms\n"
1246 | ]
1247 | },
1248 | {
1249 | "data": {
1250 | "text/plain": [
1251 | "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)"
1252 | ]
1253 | },
1254 | "execution_count": 40,
1255 | "metadata": {},
1256 | "output_type": "execute_result"
1257 | }
1258 | ],
1259 | "source": [
1260 | "# train the model using X_train_dtm (timing it with an IPython \"magic command\")\n",
1261 | "%time nb.fit(X_train_dtm, y_train)"
1262 | ]
1263 | },
1264 | {
1265 | "cell_type": "code",
1266 | "execution_count": 41,
1267 | "metadata": {
1268 | "collapsed": true
1269 | },
1270 | "outputs": [],
1271 | "source": [
1272 | "# make class predictions for X_test_dtm\n",
1273 | "y_pred_class = nb.predict(X_test_dtm)"
1274 | ]
1275 | },
1276 | {
1277 | "cell_type": "code",
1278 | "execution_count": 42,
1279 | "metadata": {
1280 | "collapsed": false
1281 | },
1282 | "outputs": [
1283 | {
1284 | "data": {
1285 | "text/plain": [
1286 | "0.98851399856424982"
1287 | ]
1288 | },
1289 | "execution_count": 42,
1290 | "metadata": {},
1291 | "output_type": "execute_result"
1292 | }
1293 | ],
1294 | "source": [
1295 | "# calculate accuracy of class predictions\n",
1296 | "from sklearn import metrics\n",
1297 | "metrics.accuracy_score(y_test, y_pred_class)"
1298 | ]
1299 | },
1300 | {
1301 | "cell_type": "code",
1302 | "execution_count": 43,
1303 | "metadata": {
1304 | "collapsed": false
1305 | },
1306 | "outputs": [
1307 | {
1308 | "data": {
1309 | "text/plain": [
1310 | "array([[1203, 5],\n",
1311 | " [ 11, 174]])"
1312 | ]
1313 | },
1314 | "execution_count": 43,
1315 | "metadata": {},
1316 | "output_type": "execute_result"
1317 | }
1318 | ],
1319 | "source": [
1320 | "# print the confusion matrix\n",
1321 | "metrics.confusion_matrix(y_test, y_pred_class)"
1322 | ]
1323 | },
1324 | {
1325 | "cell_type": "code",
1326 | "execution_count": 44,
1327 | "metadata": {
1328 | "collapsed": false
1329 | },
1330 | "outputs": [
1331 | {
1332 | "data": {
1333 | "text/plain": [
1334 | "574 Waiting for your call.\n",
1335 | "3375 Also andros ice etc etc\n",
1336 | "45 No calls..messages..missed calls\n",
1337 | "3415 No pic. Please re-send.\n",
1338 | "1988 No calls..messages..missed calls\n",
1339 | "Name: message, dtype: object"
1340 | ]
1341 | },
1342 | "execution_count": 44,
1343 | "metadata": {},
1344 | "output_type": "execute_result"
1345 | }
1346 | ],
1347 | "source": [
1348 | "# print message text for the false positives (ham incorrectly classified as spam)\n",
1349 | "X_test[y_test < y_pred_class]"
1350 | ]
1351 | },
1352 | {
1353 | "cell_type": "code",
1354 | "execution_count": 45,
1355 | "metadata": {
1356 | "collapsed": false,
1357 | "scrolled": true
1358 | },
1359 | "outputs": [
1360 | {
1361 | "data": {
1362 | "text/plain": [
1363 | "3132 LookAtMe!: Thanks for your purchase of a video...\n",
1364 | "5 FreeMsg Hey there darling it's been 3 week's n...\n",
1365 | "3530 Xmas & New Years Eve tickets are now on sale f...\n",
1366 | "684 Hi I'm sue. I am 20 years old and work as a la...\n",
1367 | "1875 Would you like to see my XXX pics they are so ...\n",
1368 | "1893 CALL 09090900040 & LISTEN TO EXTREME DIRTY LIV...\n",
1369 | "4298 thesmszone.com lets you send free anonymous an...\n",
1370 | "4949 Hi this is Amy, we will be sending you a free ...\n",
1371 | "2821 INTERFLORA - It's not too late to order Inter...\n",
1372 | "2247 Hi ya babe x u 4goten bout me?' scammers getti...\n",
1373 | "4514 Money i have won wining number 946 wot do i do...\n",
1374 | "Name: message, dtype: object"
1375 | ]
1376 | },
1377 | "execution_count": 45,
1378 | "metadata": {},
1379 | "output_type": "execute_result"
1380 | }
1381 | ],
1382 | "source": [
1383 | "# print message text for the false negatives (spam incorrectly classified as ham)\n",
1384 | "X_test[y_test > y_pred_class]"
1385 | ]
1386 | },
1387 | {
1388 | "cell_type": "code",
1389 | "execution_count": 46,
1390 | "metadata": {
1391 | "collapsed": false,
1392 | "scrolled": true
1393 | },
1394 | "outputs": [
1395 | {
1396 | "data": {
1397 | "text/plain": [
1398 | "\"LookAtMe!: Thanks for your purchase of a video clip from LookAtMe!, you've been charged 35p. Think you can do better? Why not send a video in a MMSto 32323.\""
1399 | ]
1400 | },
1401 | "execution_count": 46,
1402 | "metadata": {},
1403 | "output_type": "execute_result"
1404 | }
1405 | ],
1406 | "source": [
1407 | "# example false negative\n",
1408 | "X_test[3132]"
1409 | ]
1410 | },
1411 | {
1412 | "cell_type": "code",
1413 | "execution_count": 47,
1414 | "metadata": {
1415 | "collapsed": false
1416 | },
1417 | "outputs": [
1418 | {
1419 | "data": {
1420 | "text/plain": [
1421 | "array([ 2.87744864e-03, 1.83488846e-05, 2.07301295e-03, ...,\n",
1422 | " 1.09026171e-06, 1.00000000e+00, 3.98279868e-09])"
1423 | ]
1424 | },
1425 | "execution_count": 47,
1426 | "metadata": {},
1427 | "output_type": "execute_result"
1428 | }
1429 | ],
1430 | "source": [
1431 | "# calculate predicted probabilities for X_test_dtm (poorly calibrated)\n",
1432 | "y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]\n",
1433 | "y_pred_prob"
1434 | ]
1435 | },
1436 | {
1437 | "cell_type": "code",
1438 | "execution_count": 48,
1439 | "metadata": {
1440 | "collapsed": false
1441 | },
1442 | "outputs": [
1443 | {
1444 | "data": {
1445 | "text/plain": [
1446 | "0.98664310005369604"
1447 | ]
1448 | },
1449 | "execution_count": 48,
1450 | "metadata": {},
1451 | "output_type": "execute_result"
1452 | }
1453 | ],
1454 | "source": [
1455 | "# calculate AUC\n",
1456 | "metrics.roc_auc_score(y_test, y_pred_prob)"
1457 | ]
1458 | },
1459 | {
1460 | "cell_type": "markdown",
1461 | "metadata": {},
1462 | "source": [
1463 | "## Part 6: Comparing models\n",
1464 | "\n",
1465 | "We will compare multinomial Naive Bayes with [logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression):\n",
1466 | "\n",
1467 | "> Logistic regression, despite its name, is a **linear model for classification** rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function."
1468 | ]
1469 | },
1470 | {
1471 | "cell_type": "code",
1472 | "execution_count": 49,
1473 | "metadata": {
1474 | "collapsed": true
1475 | },
1476 | "outputs": [],
1477 | "source": [
1478 | "# import and instantiate a logistic regression model\n",
1479 | "from sklearn.linear_model import LogisticRegression\n",
1480 | "logreg = LogisticRegression()"
1481 | ]
1482 | },
1483 | {
1484 | "cell_type": "code",
1485 | "execution_count": 50,
1486 | "metadata": {
1487 | "collapsed": false
1488 | },
1489 | "outputs": [
1490 | {
1491 | "name": "stdout",
1492 | "output_type": "stream",
1493 | "text": [
1494 | "Wall time: 131 ms\n"
1495 | ]
1496 | },
1497 | {
1498 | "data": {
1499 | "text/plain": [
1500 | "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
1501 | " intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n",
1502 | " penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n",
1503 | " verbose=0, warm_start=False)"
1504 | ]
1505 | },
1506 | "execution_count": 50,
1507 | "metadata": {},
1508 | "output_type": "execute_result"
1509 | }
1510 | ],
1511 | "source": [
1512 | "# train the model using X_train_dtm\n",
1513 | "%time logreg.fit(X_train_dtm, y_train)"
1514 | ]
1515 | },
1516 | {
1517 | "cell_type": "code",
1518 | "execution_count": 51,
1519 | "metadata": {
1520 | "collapsed": true
1521 | },
1522 | "outputs": [],
1523 | "source": [
1524 | "# make class predictions for X_test_dtm\n",
1525 | "y_pred_class = logreg.predict(X_test_dtm)"
1526 | ]
1527 | },
1528 | {
1529 | "cell_type": "code",
1530 | "execution_count": 52,
1531 | "metadata": {
1532 | "collapsed": false
1533 | },
1534 | "outputs": [
1535 | {
1536 | "data": {
1537 | "text/plain": [
1538 | "array([ 0.01269556, 0.00347183, 0.00616517, ..., 0.03354907,\n",
1539 | " 0.99725053, 0.00157706])"
1540 | ]
1541 | },
1542 | "execution_count": 52,
1543 | "metadata": {},
1544 | "output_type": "execute_result"
1545 | }
1546 | ],
1547 | "source": [
1548 | "# calculate predicted probabilities for X_test_dtm (well calibrated)\n",
1549 | "y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]\n",
1550 | "y_pred_prob"
1551 | ]
1552 | },
1553 | {
1554 | "cell_type": "code",
1555 | "execution_count": 53,
1556 | "metadata": {
1557 | "collapsed": false
1558 | },
1559 | "outputs": [
1560 | {
1561 | "data": {
1562 | "text/plain": [
1563 | "0.9877961234745154"
1564 | ]
1565 | },
1566 | "execution_count": 53,
1567 | "metadata": {},
1568 | "output_type": "execute_result"
1569 | }
1570 | ],
1571 | "source": [
1572 | "# calculate accuracy\n",
1573 | "metrics.accuracy_score(y_test, y_pred_class)"
1574 | ]
1575 | },
1576 | {
1577 | "cell_type": "code",
1578 | "execution_count": 54,
1579 | "metadata": {
1580 | "collapsed": false
1581 | },
1582 | "outputs": [
1583 | {
1584 | "data": {
1585 | "text/plain": [
1586 | "0.99368176123143015"
1587 | ]
1588 | },
1589 | "execution_count": 54,
1590 | "metadata": {},
1591 | "output_type": "execute_result"
1592 | }
1593 | ],
1594 | "source": [
1595 | "# calculate AUC\n",
1596 | "metrics.roc_auc_score(y_test, y_pred_prob)"
1597 | ]
1598 | }
1599 | ],
1600 | "metadata": {
1601 | "kernelspec": {
1602 | "display_name": "Python 2",
1603 | "language": "python",
1604 | "name": "python2"
1605 | },
1606 | "language_info": {
1607 | "codemirror_mode": {
1608 | "name": "ipython",
1609 | "version": 2
1610 | },
1611 | "file_extension": ".py",
1612 | "mimetype": "text/x-python",
1613 | "name": "python",
1614 | "nbconvert_exporter": "python",
1615 | "pygments_lexer": "ipython2",
1616 | "version": "2.7.11"
1617 | }
1618 | },
1619 | "nbformat": 4,
1620 | "nbformat_minor": 0
1621 | }
1622 |
--------------------------------------------------------------------------------
/youtube.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/justmarkham/pydata-dc-2016-tutorial/319cf51fee90fcfd3a4ddd56c60c2584c36689bf/youtube.jpg
--------------------------------------------------------------------------------