├── .gitignore
├── exercise.py
├── README.md
├── exercise.ipynb
├── exercise_solution.py
├── tutorial.py
├── tutorial.ipynb
├── exercise_solution.ipynb
└── tutorial_with_output.ipynb


/.gitignore:
--------------------------------------------------------------------------------
1 | .ipynb_checkpoints/
2 | .DS_Store
3 | *.pyc
4 | extras/
5 | *.tpl
6 | 


--------------------------------------------------------------------------------
/exercise.py:
--------------------------------------------------------------------------------
 1 | # # Tutorial Exercise: Yelp reviews
 2 | 
 3 | # ## Introduction
 4 | # 
 5 | # This exercise uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition.
 6 | # 
 7 | # **Description of the data:**
 8 | # 
 9 | # - **`yelp.csv`** contains the dataset. It is stored in the repository (in the **`data`** directory), so there is no need to download anything from the Kaggle website.
10 | # - Each observation (row) in this dataset is a review of a particular business by a particular user.
11 | # - The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
12 | # - The **text** column is the text of the review.
13 | # 
14 | # **Goal:** Predict the star rating of a review using **only** the review text.
15 | # 
16 | # **Tip:** After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations.
17 | 
18 | # ## Task 1
19 | # 
20 | # Read **`yelp.csv`** into a pandas DataFrame and examine it.
21 | 
22 | # ## Task 2
23 | # 
24 | # Create a new DataFrame that only contains the **5-star** and **1-star** reviews.
25 | # 
26 | # - **Hint:** [How do I apply multiple filter criteria to a pandas DataFrame?](http://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb#9.-How-do-I-apply-multiple-filter-criteria-to-a-pandas-DataFrame%3F-%28video%29) explains how to do this.
27 | 
28 | # ## Task 3
29 | # 
30 | # Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response.
31 | # 
32 | # - **Hint:** Keep in mind that X should be a pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows.
33 | 
34 | # ## Task 4
35 | # 
36 | # Use CountVectorizer to create **document-term matrices** from X_train and X_test.
37 | 
38 | # ## Task 5
39 | # 
40 | # Use multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**.
41 | # 
42 | # - **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to interpret both classification accuracy and the confusion matrix.
43 | 
44 | # ## Task 6 (Challenge)
45 | # 
46 | # Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class.
47 | # 
48 | # - **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy!
49 | 
50 | # ## Task 7 (Challenge)
51 | # 
52 | # Browse through the review text of some of the **false positives** and **false negatives**. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews?
53 | # 
54 | # - **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains the definitions of "false positives" and "false negatives".
55 | # - **Hint:** Think about what a false positive means in this context, and what a false negative means in this context. What has scikit-learn defined as the "positive class"?
56 | 
57 | # ## Task 8 (Challenge)
58 | # 
59 | # Calculate which 10 tokens are the most predictive of **5-star reviews**, and which 10 tokens are the most predictive of **1-star reviews**.
60 | # 
61 | # - **Hint:** Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the `feature_count_` and `class_count_` attributes of the Naive Bayes model object.
62 | 
63 | # ## Task 9 (Challenge)
64 | # 
65 | # Up to this point, we have framed this as a **binary classification problem** by only considering the 5-star and 1-star reviews. Now, let's repeat the model building process using all reviews, which makes this a **5-class classification problem**.
66 | # 
67 | # Here are the steps:
68 | # 
69 | # - Define X and y using the original DataFrame. (y should contain 5 different classes.)
70 | # - Split X and y into training and testing sets.
71 | # - Create document-term matrices using CountVectorizer.
72 | # - Calculate the testing accuracy of a Multinomial Naive Bayes model.
73 | # - Compare the testing accuracy with the null accuracy, and comment on the results.
74 | # - Print the confusion matrix, and comment on the results. (This [Stack Overflow answer](http://stackoverflow.com/a/30748053/1636598) explains how to read a multi-class confusion matrix.)
75 | # - Print the [classification report](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report), and comment on the results. If you are unfamiliar with the terminology it uses, research the terms, and then try to figure out how to calculate these metrics manually from the confusion matrix!
76 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ## Tutorial: Machine Learning with Text using Python
 2 | 
 3 | Round of applause to [Kevin Markham](http://www.dataschool.io/) and his video tutorials!
 4 | 
 5 | Instructor: [Vaibhav Srivastav](https://www.linkedin.com/in/vaibhavs10/)
 6 | 
 7 | ### Description
 8 | 
 9 | Although numeric data is easy to work with in Python, most knowledge created by humans is actually raw, unstructured text. By learning how to transform text into data that is usable by machine learning models, you drastically increase the amount of data that your models can learn from. In this tutorial, we'll build and evaluate predictive models from real-world text using scikit-learn.
10 | 
11 | ### Objectives
12 | 
13 | By the end of this tutorial, attendees will be able to confidently build a predictive model from their own text-based data, including feature extraction, model building and model evaluation.
14 | 
15 | ### Required Software
16 | 
17 | Attendees will need to bring a laptop with [scikit-learn](http://scikit-learn.org/stable/install.html) and [pandas](http://pandas.pydata.org/pandas-docs/stable/install.html) (and their dependencies) already installed. Installing the [Anaconda distribution of Python](https://www.continuum.io/downloads) is an easy way to accomplish this. Both Python 2 and 3 are welcome.
18 | 
19 | I will be leading the tutorial using the IPython/Jupyter notebook, and have added a pre-written notebook to this repository. I have also created a Python script that is identical to the notebook, which you can use in the Python environment of your choice.
20 | 
21 | ### Tutorial Files
22 | 
23 | * IPython/Jupyter notebooks: [tutorial.ipynb](tutorial.ipynb), [tutorial_with_output.ipynb](tutorial_with_output.ipynb), [exercise.ipynb](exercise.ipynb), [exercise_solution.ipynb](exercise_solution.ipynb)
24 | * Python scripts: [tutorial.py](tutorial.py), [exercise.py](exercise.py), [exercise_solution.py](exercise_solution.py)
25 | * Datasets: [data/sms.tsv](data/sms.tsv), [data/yelp.csv](data/yelp.csv)
26 | 
27 | ### Prerequisite Knowledge
28 | 
29 | Attendees to this tutorial should be comfortable working in Python, should understand the basic principles of machine learning, and should have at least basic experience with both pandas and scikit-learn. However, no knowledge of advanced mathematics is required.
30 | 
31 | ### Abstract
32 | 
33 | It can be difficult to figure out how to work with text in scikit-learn, even if you're already comfortable with the scikit-learn API. Many questions immediately come up: Which vectorizer should I use, and why? What's the difference between a "fit" and a "transform"? What's a document-term matrix, and why is it so sparse? Is it okay for my training data to have more features than observations? What's the appropriate machine learning model to use? And so on...
34 | 
35 | In this tutorial, we'll answer all of those questions, and more! We'll start by walking through the vectorization process in order to understand the input and output formats. Then we'll read a simple dataset into pandas, and immediately apply what we've learned about vectorization. We'll move on to the model building process, including a discussion of which model is most appropriate for the task. We'll evaluate our model a few different ways, and then examine the model for greater insight into how the text is influencing its predictions. Finally, we'll practice this entire workflow on a new dataset, and end with a discussion of which parts of the process are worth tuning for improved performance.
36 | 
37 | ### Detailed Outline
38 | 
39 | 1. Model building in scikit-learn (refresher)
40 | 2. Representing text as numerical data
41 | 3. Reading a text-based dataset into pandas
42 | 4. Vectorizing our dataset
43 | 5. Building and evaluating a model
44 | 6. Comparing models
45 | 7. Examining a model for further insight
46 | 8. Practicing this workflow on another dataset
47 | 9. Tuning the vectorizer (discussion)
48 | 
49 | ### About the Instructor
50 | 
51 | Vaibhav Srivastav is a Data Scientist currently working with Deloitte Consulting LLP. He has a demonstrated experience of more than 3 plus years in building large scale Machine Learning and Natural Language Processing solutions for Fortune Technology 10 clients.
52 | 
53 | In his free time he teaches Machine Learning/ Data Science to young coders! If Python is what floats your boat then hit him up on any of the channels below:  
54 | 
55 | * Email: [Vaibhavs10@gmail.com](mailto:vaibhavs10@gmail.com)
56 | * Twitter: [@Vaibhavsriv10](https://twitter.com/vaibhavsriv10)
57 | 
58 | ### Recommended Resources
59 | 
60 | **Text classification:**
61 | * Read Paul Graham's classic post, [A Plan for Spam](http://www.paulgraham.com/spam.html), for an overview of a basic text classification system using a Bayesian approach. (He also wrote a [follow-up post](http://www.paulgraham.com/better.html) about how he improved his spam filter.)
62 | * Coursera's Natural Language Processing (NLP) course has [video lectures](https://class.coursera.org/nlp/lecture) on text classification, tokenization, Naive Bayes, and many other fundamental NLP topics. (Here are the [slides](http://web.stanford.edu/~jurafsky/NLPCourseraSlides.html) used in all of the videos.)
63 | * [Automatically Categorizing Yelp Businesses](http://engineeringblog.yelp.com/2015/09/automatically-categorizing-yelp-businesses.html) discusses how Yelp uses NLP and scikit-learn to solve the problem of uncategorized businesses.
64 | * [How to Read the Mind of a Supreme Court Justice](http://fivethirtyeight.com/features/how-to-read-the-mind-of-a-supreme-court-justice/) discusses CourtCast, a machine learning model that predicts the outcome of Supreme Court cases using text-based features only. (The CourtCast creator wrote a post explaining [how it works](https://sciencecowboy.wordpress.com/2015/03/05/predicting-the-supreme-court-from-oral-arguments/), and the [Python code](https://github.com/nasrallah/CourtCast) is available on GitHub.)
65 | * [Identifying Humorous Cartoon Captions](http://www.cs.huji.ac.il/~dshahaf/pHumor.pdf) is a readable paper about identifying funny captions submitted to the New Yorker Caption Contest.
66 | * In this [PyData video](https://www.youtube.com/watch?v=y3ZTKFZ-1QQ) (50 minutes), Facebook explains how they use scikit-learn for sentiment classification by training a Naive Bayes model on emoji-labeled data.
67 | 
68 | **Naive Bayes and logistic regression:**
69 | * Read this brief Quora post on [airport security](http://www.quora.com/In-laymans-terms-how-does-Naive-Bayes-work/answer/Konstantin-Tt) for an intuitive explanation of how Naive Bayes classification works.
70 | * For a longer introduction to Naive Bayes, read Sebastian Raschka's article on [Naive Bayes and Text Classification](http://sebastianraschka.com/Articles/2014_naive_bayes_1.html). As well, Wikipedia has two excellent articles ([Naive Bayes classifier](http://en.wikipedia.org/wiki/Naive_Bayes_classifier) and [Naive Bayes spam filtering](http://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering)), and Cross Validated has a good [Q&A](http://stats.stackexchange.com/questions/21822/understanding-naive-bayes).
71 | * My [guide to an in-depth understanding of logistic regression](http://www.dataschool.io/guide-to-logistic-regression/) includes a lesson notebook and a curated list of resources for going deeper into this topic.
72 | * [Comparison of Machine Learning Models](https://github.com/justmarkham/DAT8/blob/master/other/model_comparison.md) lists the advantages and disadvantages of Naive Bayes, logistic regression, and other classification and regression models.


--------------------------------------------------------------------------------
/exercise.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Tutorial Exercise: Yelp reviews"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "## Introduction\n",
 15 |     "\n",
 16 |     "This exercise uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition.\n",
 17 |     "\n",
 18 |     "**Description of the data:**\n",
 19 |     "\n",
 20 |     "- **`yelp.csv`** contains the dataset. It is stored in the repository (in the **`data`** directory), so there is no need to download anything from the Kaggle website.\n",
 21 |     "- Each observation (row) in this dataset is a review of a particular business by a particular user.\n",
 22 |     "- The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.\n",
 23 |     "- The **text** column is the text of the review.\n",
 24 |     "\n",
 25 |     "**Goal:** Predict the star rating of a review using **only** the review text.\n",
 26 |     "\n",
 27 |     "**Tip:** After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations."
 28 |    ]
 29 |   },
 30 |   {
 31 |    "cell_type": "markdown",
 32 |    "metadata": {},
 33 |    "source": [
 34 |     "## Task 1\n",
 35 |     "\n",
 36 |     "Read **`yelp.csv`** into a pandas DataFrame and examine it."
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "markdown",
 41 |    "metadata": {},
 42 |    "source": [
 43 |     "## Task 2\n",
 44 |     "\n",
 45 |     "Create a new DataFrame that only contains the **5-star** and **1-star** reviews.\n",
 46 |     "\n",
 47 |     "- **Hint:** [How do I apply multiple filter criteria to a pandas DataFrame?](http://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb#9.-How-do-I-apply-multiple-filter-criteria-to-a-pandas-DataFrame%3F-%28video%29) explains how to do this."
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "markdown",
 52 |    "metadata": {},
 53 |    "source": [
 54 |     "## Task 3\n",
 55 |     "\n",
 56 |     "Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response.\n",
 57 |     "\n",
 58 |     "- **Hint:** Keep in mind that X should be a pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows."
 59 |    ]
 60 |   },
 61 |   {
 62 |    "cell_type": "markdown",
 63 |    "metadata": {},
 64 |    "source": [
 65 |     "## Task 4\n",
 66 |     "\n",
 67 |     "Use CountVectorizer to create **document-term matrices** from X_train and X_test."
 68 |    ]
 69 |   },
 70 |   {
 71 |    "cell_type": "markdown",
 72 |    "metadata": {},
 73 |    "source": [
 74 |     "## Task 5\n",
 75 |     "\n",
 76 |     "Use multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**.\n",
 77 |     "\n",
 78 |     "- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to interpret both classification accuracy and the confusion matrix."
 79 |    ]
 80 |   },
 81 |   {
 82 |    "cell_type": "markdown",
 83 |    "metadata": {},
 84 |    "source": [
 85 |     "## Task 6 (Challenge)\n",
 86 |     "\n",
 87 |     "Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class.\n",
 88 |     "\n",
 89 |     "- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy!"
 90 |    ]
 91 |   },
 92 |   {
 93 |    "cell_type": "markdown",
 94 |    "metadata": {},
 95 |    "source": [
 96 |     "## Task 7 (Challenge)\n",
 97 |     "\n",
 98 |     "Browse through the review text of some of the **false positives** and **false negatives**. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews?\n",
 99 |     "\n",
100 |     "- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains the definitions of \"false positives\" and \"false negatives\".\n",
101 |     "- **Hint:** Think about what a false positive means in this context, and what a false negative means in this context. What has scikit-learn defined as the \"positive class\"?"
102 |    ]
103 |   },
104 |   {
105 |    "cell_type": "markdown",
106 |    "metadata": {},
107 |    "source": [
108 |     "## Task 8 (Challenge)\n",
109 |     "\n",
110 |     "Calculate which 10 tokens are the most predictive of **5-star reviews**, and which 10 tokens are the most predictive of **1-star reviews**.\n",
111 |     "\n",
112 |     "- **Hint:** Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the `feature_count_` and `class_count_` attributes of the Naive Bayes model object."
113 |    ]
114 |   },
115 |   {
116 |    "cell_type": "markdown",
117 |    "metadata": {},
118 |    "source": [
119 |     "## Task 9 (Challenge)\n",
120 |     "\n",
121 |     "Up to this point, we have framed this as a **binary classification problem** by only considering the 5-star and 1-star reviews. Now, let's repeat the model building process using all reviews, which makes this a **5-class classification problem**.\n",
122 |     "\n",
123 |     "Here are the steps:\n",
124 |     "\n",
125 |     "- Define X and y using the original DataFrame. (y should contain 5 different classes.)\n",
126 |     "- Split X and y into training and testing sets.\n",
127 |     "- Create document-term matrices using CountVectorizer.\n",
128 |     "- Calculate the testing accuracy of a Multinomial Naive Bayes model.\n",
129 |     "- Compare the testing accuracy with the null accuracy, and comment on the results.\n",
130 |     "- Print the confusion matrix, and comment on the results. (This [Stack Overflow answer](http://stackoverflow.com/a/30748053/1636598) explains how to read a multi-class confusion matrix.)\n",
131 |     "- Print the [classification report](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report), and comment on the results. If you are unfamiliar with the terminology it uses, research the terms, and then try to figure out how to calculate these metrics manually from the confusion matrix!"
132 |    ]
133 |   }
134 |  ],
135 |  "metadata": {
136 |   "kernelspec": {
137 |    "display_name": "Python 2",
138 |    "language": "python",
139 |    "name": "python2"
140 |   },
141 |   "language_info": {
142 |    "codemirror_mode": {
143 |     "name": "ipython",
144 |     "version": 2
145 |    },
146 |    "file_extension": ".py",
147 |    "mimetype": "text/x-python",
148 |    "name": "python",
149 |    "nbconvert_exporter": "python",
150 |    "pygments_lexer": "ipython2",
151 |    "version": "2.7.11"
152 |   }
153 |  },
154 |  "nbformat": 4,
155 |  "nbformat_minor": 0
156 | }
157 | 


--------------------------------------------------------------------------------
/exercise_solution.py:
--------------------------------------------------------------------------------
  1 | # # Tutorial Exercise: Yelp reviews (Solution)
  2 | 
  3 | # ## Introduction
  4 | # 
  5 | # This exercise uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition.
  6 | # 
  7 | # **Description of the data:**
  8 | # 
  9 | # - **`yelp.csv`** contains the dataset. It is stored in the repository (in the **`data`** directory), so there is no need to download anything from the Kaggle website.
 10 | # - Each observation (row) in this dataset is a review of a particular business by a particular user.
 11 | # - The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
 12 | # - The **text** column is the text of the review.
 13 | # 
 14 | # **Goal:** Predict the star rating of a review using **only** the review text.
 15 | # 
 16 | # **Tip:** After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations.
 17 | 
 18 | # for Python 2: use print only as a function
 19 | from __future__ import print_function
 20 | 
 21 | 
 22 | # ## Task 1
 23 | # 
 24 | # Read **`yelp.csv`** into a pandas DataFrame and examine it.
 25 | 
 26 | # read yelp.csv using a relative path
 27 | import pandas as pd
 28 | path = 'data/yelp.csv'
 29 | yelp = pd.read_csv(path)
 30 | 
 31 | 
 32 | # examine the shape
 33 | yelp.shape
 34 | 
 35 | 
 36 | # examine the first row
 37 | yelp.head(1)
 38 | 
 39 | 
 40 | # examine the class distribution
 41 | yelp.stars.value_counts().sort_index()
 42 | 
 43 | 
 44 | # ## Task 2
 45 | # 
 46 | # Create a new DataFrame that only contains the **5-star** and **1-star** reviews.
 47 | # 
 48 | # - **Hint:** [How do I apply multiple filter criteria to a pandas DataFrame?](http://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb#9.-How-do-I-apply-multiple-filter-criteria-to-a-pandas-DataFrame%3F-%28video%29) explains how to do this.
 49 | 
 50 | # filter the DataFrame using an OR condition
 51 | yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]
 52 | 
 53 | # equivalently, use the 'loc' method
 54 | yelp_best_worst = yelp.loc[(yelp.stars==5) | (yelp.stars==1), :]
 55 | 
 56 | 
 57 | # examine the shape
 58 | yelp_best_worst.shape
 59 | 
 60 | 
 61 | # ## Task 3
 62 | # 
 63 | # Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response.
 64 | # 
 65 | # - **Hint:** Keep in mind that X should be a pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows.
 66 | 
 67 | # define X and y
 68 | X = yelp_best_worst.text
 69 | y = yelp_best_worst.stars
 70 | 
 71 | 
 72 | # split X and y into training and testing sets
 73 | from sklearn.cross_validation import train_test_split
 74 | X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
 75 | 
 76 | 
 77 | # examine the object shapes
 78 | print(X_train.shape)
 79 | print(X_test.shape)
 80 | print(y_train.shape)
 81 | print(y_test.shape)
 82 | 
 83 | 
 84 | # ## Task 4
 85 | # 
 86 | # Use CountVectorizer to create **document-term matrices** from X_train and X_test.
 87 | 
 88 | # import and instantiate CountVectorizer
 89 | from sklearn.feature_extraction.text import CountVectorizer
 90 | vect = CountVectorizer()
 91 | 
 92 | 
 93 | # fit and transform X_train into X_train_dtm
 94 | X_train_dtm = vect.fit_transform(X_train)
 95 | X_train_dtm.shape
 96 | 
 97 | 
 98 | # transform X_test into X_test_dtm
 99 | X_test_dtm = vect.transform(X_test)
100 | X_test_dtm.shape
101 | 
102 | 
103 | # ## Task 5
104 | # 
105 | # Use multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**.
106 | # 
107 | # - **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to interpret both classification accuracy and the confusion matrix.
108 | 
109 | # import and instantiate MultinomialNB
110 | from sklearn.naive_bayes import MultinomialNB
111 | nb = MultinomialNB()
112 | 
113 | 
114 | # train the model using X_train_dtm
115 | nb.fit(X_train_dtm, y_train)
116 | 
117 | 
118 | # make class predictions for X_test_dtm
119 | y_pred_class = nb.predict(X_test_dtm)
120 | 
121 | 
122 | # calculate accuracy of class predictions
123 | from sklearn import metrics
124 | metrics.accuracy_score(y_test, y_pred_class)
125 | 
126 | 
127 | # print the confusion matrix
128 | metrics.confusion_matrix(y_test, y_pred_class)
129 | 
130 | 
131 | # ## Task 6 (Challenge)
132 | # 
133 | # Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class.
134 | # 
135 | # - **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy!
136 | 
137 | # examine the class distribution of the testing set
138 | y_test.value_counts()
139 | 
140 | 
141 | # calculate null accuracy
142 | y_test.value_counts().head(1) / y_test.shape
143 | 
144 | 
145 | # calculate null accuracy manually
146 | 838 / float(838 + 184)
147 | 
148 | 
149 | # ## Task 7 (Challenge)
150 | # 
151 | # Browse through the review text of some of the **false positives** and **false negatives**. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews?
152 | # 
153 | # - **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains the definitions of "false positives" and "false negatives".
154 | # - **Hint:** Think about what a false positive means in this context, and what a false negative means in this context. What has scikit-learn defined as the "positive class"?
155 | 
156 | # first 10 false positives (1-star reviews incorrectly classified as 5-star reviews)
157 | X_test[y_test < y_pred_class].head(10)
158 | 
159 | 
160 | # false positive: model is reacting to the words "good", "impressive", "nice"
161 | X_test[1781]
162 | 
163 | 
164 | # false positive: model does not have enough data to work with
165 | X_test[1919]
166 | 
167 | 
168 | # first 10 false negatives (5-star reviews incorrectly classified as 1-star reviews)
169 | X_test[y_test > y_pred_class].head(10)
170 | 
171 | 
172 | # false negative: model is reacting to the words "complain", "crowds", "rushing", "pricey", "scum"
173 | X_test[4963]
174 | 
175 | 
176 | # ## Task 8 (Challenge)
177 | # 
178 | # Calculate which 10 tokens are the most predictive of **5-star reviews**, and which 10 tokens are the most predictive of **1-star reviews**.
179 | # 
180 | # - **Hint:** Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the `feature_count_` and `class_count_` attributes of the Naive Bayes model object.
181 | 
182 | # store the vocabulary of X_train
183 | X_train_tokens = vect.get_feature_names()
184 | len(X_train_tokens)
185 | 
186 | 
187 | # first row is one-star reviews, second row is five-star reviews
188 | nb.feature_count_.shape
189 | 
190 | 
191 | # store the number of times each token appears across each class
192 | one_star_token_count = nb.feature_count_[0, :]
193 | five_star_token_count = nb.feature_count_[1, :]
194 | 
195 | 
196 | # create a DataFrame of tokens with their separate one-star and five-star counts
197 | tokens = pd.DataFrame({'token':X_train_tokens, 'one_star':one_star_token_count, 'five_star':five_star_token_count}).set_index('token')
198 | 
199 | 
200 | # add 1 to one-star and five-star counts to avoid dividing by 0
201 | tokens['one_star'] = tokens.one_star + 1
202 | tokens['five_star'] = tokens.five_star + 1
203 | 
204 | 
205 | # first number is one-star reviews, second number is five-star reviews
206 | nb.class_count_
207 | 
208 | 
209 | # convert the one-star and five-star counts into frequencies
210 | tokens['one_star'] = tokens.one_star / nb.class_count_[0]
211 | tokens['five_star'] = tokens.five_star / nb.class_count_[1]
212 | 
213 | 
214 | # calculate the ratio of five-star to one-star for each token
215 | tokens['five_star_ratio'] = tokens.five_star / tokens.one_star
216 | 
217 | 
218 | # sort the DataFrame by five_star_ratio (descending order), and examine the first 10 rows
219 | # note: use sort() instead of sort_values() for pandas 0.16.2 and earlier
220 | tokens.sort_values('five_star_ratio', ascending=False).head(10)
221 | 
222 | 
223 | # sort the DataFrame by five_star_ratio (ascending order), and examine the first 10 rows
224 | tokens.sort_values('five_star_ratio', ascending=True).head(10)
225 | 
226 | 
227 | # ## Task 9 (Challenge)
228 | # 
229 | # Up to this point, we have framed this as a **binary classification problem** by only considering the 5-star and 1-star reviews. Now, let's repeat the model building process using all reviews, which makes this a **5-class classification problem**.
230 | # 
231 | # Here are the steps:
232 | # 
233 | # - Define X and y using the original DataFrame. (y should contain 5 different classes.)
234 | # - Split X and y into training and testing sets.
235 | # - Create document-term matrices using CountVectorizer.
236 | # - Calculate the testing accuracy of a Multinomial Naive Bayes model.
237 | # - Compare the testing accuracy with the null accuracy, and comment on the results.
238 | # - Print the confusion matrix, and comment on the results. (This [Stack Overflow answer](http://stackoverflow.com/a/30748053/1636598) explains how to read a multi-class confusion matrix.)
239 | # - Print the [classification report](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report), and comment on the results. If you are unfamiliar with the terminology it uses, research the terms, and then try to figure out how to calculate these metrics manually from the confusion matrix!
240 | 
241 | # define X and y using the original DataFrame
242 | X = yelp.text
243 | y = yelp.stars
244 | 
245 | 
246 | # check that y contains 5 different classes
247 | y.value_counts().sort_index()
248 | 
249 | 
250 | # split X and y into training and testing sets
251 | X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
252 | 
253 | 
254 | # create document-term matrices using CountVectorizer
255 | X_train_dtm = vect.fit_transform(X_train)
256 | X_test_dtm = vect.transform(X_test)
257 | 
258 | 
259 | # fit a Multinomial Naive Bayes model
260 | nb.fit(X_train_dtm, y_train)
261 | 
262 | 
263 | # make class predictions
264 | y_pred_class = nb.predict(X_test_dtm)
265 | 
266 | 
267 | # calculate the accuary
268 | metrics.accuracy_score(y_test, y_pred_class)
269 | 
270 | 
271 | # calculate the null accuracy
272 | y_test.value_counts().head(1) / y_test.shape
273 | 
274 | 
275 | # **Accuracy comments:** At first glance, 47% accuracy does not seem very good, given that it is not much higher than the null accuracy. However, I would consider the 47% accuracy to be quite impressive, given that humans would also have a hard time precisely identifying the star rating for many of these reviews.
276 | 
277 | # print the confusion matrix
278 | metrics.confusion_matrix(y_test, y_pred_class)
279 | 
280 | 
281 | # **Confusion matrix comments:**
282 | # 
283 | # - Nearly all 4-star and 5-star reviews are classified as 4 or 5 stars, but they are hard for the model to distinguish between.
284 | # - 1-star, 2-star, and 3-star reviews are most commonly classified as 4 stars, probably because it's the predominant class in the training data.
285 | 
286 | # print the classification report
287 | print(metrics.classification_report(y_test, y_pred_class))
288 | 
289 | 
290 | # **Precision** answers the question: "When a given class is predicted, how often are those predictions correct?" To calculate the precision for class 1, for example, you divide 55 by the sum of the first column of the confusion matrix.
291 | 
292 | # manually calculate the precision for class 1
293 | precision = 55 / float(55 + 28 + 5 + 7 + 6)
294 | print(precision)
295 | 
296 | 
297 | # **Recall** answers the question: "When a given class is the true class, how often is that class predicted?" To calculate the recall for class 1, for example, you divide 55 by the sum of the first row of the confusion matrix.
298 | 
299 | # manually calculate the recall for class 1
300 | recall = 55 / float(55 + 14 + 24 + 65 + 27)
301 | print(recall)
302 | 
303 | 
304 | # **F1 score** is a weighted average of precision and recall.
305 | 
306 | # manually calculate the F1 score for class 1
307 | f1 = 2 * (precision * recall) / (precision + recall)
308 | print(f1)
309 | 
310 | 
311 | # **Support** answers the question: "How many observations exist for which a given class is the true class?" To calculate the support for class 1, for example, you sum the first row of the confusion matrix.
312 | 
313 | # manually calculate the support for class 1
314 | support = 55 + 14 + 24 + 65 + 27
315 | print(support)
316 | 
317 | 
318 | # **Classification report comments:**
319 | # 
320 | # - Class 1 has low recall, meaning that the model has a hard time detecting the 1-star reviews, but high precision, meaning that when the model predicts a review is 1-star, it's usually correct.
321 | # - Class 5 has high recall and precision, probably because 5-star reviews have polarized language, and because the model has a lot of observations to learn from.
322 | 


--------------------------------------------------------------------------------
/tutorial.py:
--------------------------------------------------------------------------------
  1 | # # Tutorial: Machine Learning with Text in scikit-learn
  2 | 
  3 | # ## Agenda
  4 | # 
  5 | # 1. Model building in scikit-learn (refresher)
  6 | # 2. Representing text as numerical data
  7 | # 3. Reading a text-based dataset into pandas
  8 | # 4. Vectorizing our dataset
  9 | # 5. Building and evaluating a model
 10 | # 6. Comparing models
 11 | # 7. Examining a model for further insight
 12 | # 8. Practicing this workflow on another dataset
 13 | # 9. Tuning the vectorizer (discussion)
 14 | 
 15 | # for Python 2: use print only as a function
 16 | from __future__ import print_function
 17 | 
 18 | 
 19 | # ## Part 1: Model building in scikit-learn (refresher)
 20 | 
 21 | # load the iris dataset as an example
 22 | from sklearn.datasets import load_iris
 23 | iris = load_iris()
 24 | 
 25 | 
 26 | # store the feature matrix (X) and response vector (y)
 27 | X = iris.data
 28 | y = iris.target
 29 | 
 30 | 
 31 | # **"Features"** are also known as predictors, inputs, or attributes. The **"response"** is also known as the target, label, or output.
 32 | 
 33 | # check the shapes of X and y
 34 | print(X.shape)
 35 | print(y.shape)
 36 | 
 37 | 
 38 | # **"Observations"** are also known as samples, instances, or records.
 39 | 
 40 | # examine the first 5 rows of the feature matrix (including the feature names)
 41 | import pandas as pd
 42 | pd.DataFrame(X, columns=iris.feature_names).head()
 43 | 
 44 | 
 45 | # examine the response vector
 46 | print(y)
 47 | 
 48 | 
 49 | # In order to **build a model**, the features must be **numeric**, and every observation must have the **same features in the same order**.
 50 | 
 51 | # import the class
 52 | from sklearn.neighbors import KNeighborsClassifier
 53 | 
 54 | # instantiate the model (with the default parameters)
 55 | knn = KNeighborsClassifier()
 56 | 
 57 | # fit the model with data (occurs in-place)
 58 | knn.fit(X, y)
 59 | 
 60 | 
 61 | # In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning.
 62 | 
 63 | # predict the response for a new observation
 64 | knn.predict([[3, 5, 4, 2]])
 65 | 
 66 | 
 67 | # ## Part 2: Representing text as numerical data
 68 | 
 69 | # example text for model training (SMS messages)
 70 | simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']
 71 | 
 72 | 
 73 | # From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):
 74 | # 
 75 | # > Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.
 76 | # 
 77 | # We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to "convert text into a matrix of token counts":
 78 | 
 79 | # import and instantiate CountVectorizer (with the default parameters)
 80 | from sklearn.feature_extraction.text import CountVectorizer
 81 | vect = CountVectorizer()
 82 | 
 83 | 
 84 | # learn the 'vocabulary' of the training data (occurs in-place)
 85 | vect.fit(simple_train)
 86 | 
 87 | 
 88 | # examine the fitted vocabulary
 89 | vect.get_feature_names()
 90 | 
 91 | 
 92 | # transform training data into a 'document-term matrix'
 93 | simple_train_dtm = vect.transform(simple_train)
 94 | simple_train_dtm
 95 | 
 96 | 
 97 | # convert sparse matrix to a dense matrix
 98 | simple_train_dtm.toarray()
 99 | 
100 | 
101 | # examine the vocabulary and document-term matrix together
102 | pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())
103 | 
104 | 
105 | # From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):
106 | # 
107 | # > In this scheme, features and samples are defined as follows:
108 | # 
109 | # > - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.
110 | # > - The vector of all the token frequencies for a given document is considered a multivariate **sample**.
111 | # 
112 | # > A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.
113 | # 
114 | # > We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.
115 | 
116 | # check the type of the document-term matrix
117 | type(simple_train_dtm)
118 | 
119 | 
120 | # examine the sparse matrix contents
121 | print(simple_train_dtm)
122 | 
123 | 
124 | # From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):
125 | # 
126 | # > As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them).
127 | # 
128 | # > For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.
129 | # 
130 | # > In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package.
131 | 
132 | # example text for model testing
133 | simple_test = ["please don't call me"]
134 | 
135 | 
136 | # In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning.
137 | 
138 | # transform testing data into a document-term matrix (using existing vocabulary)
139 | simple_test_dtm = vect.transform(simple_test)
140 | simple_test_dtm.toarray()
141 | 
142 | 
143 | # examine the vocabulary and document-term matrix together
144 | pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())
145 | 
146 | 
147 | # **Summary:**
148 | # 
149 | # - `vect.fit(train)` **learns the vocabulary** of the training data
150 | # - `vect.transform(train)` uses the **fitted vocabulary** to build a document-term matrix from the training data
151 | # - `vect.transform(test)` uses the **fitted vocabulary** to build a document-term matrix from the testing data (and **ignores tokens** it hasn't seen before)
152 | 
153 | # ## Part 3: Reading a text-based dataset into pandas
154 | 
155 | # read file into pandas using a relative path
156 | path = 'data/sms.tsv'
157 | sms = pd.read_table(path, header=None, names=['label', 'message'])
158 | 
159 | 
160 | # alternative: read file into pandas from a URL
161 | # url = 'https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv'
162 | # sms = pd.read_table(url, header=None, names=['label', 'message'])
163 | 
164 | 
165 | # examine the shape
166 | sms.shape
167 | 
168 | 
169 | # examine the first 10 rows
170 | sms.head(10)
171 | 
172 | 
173 | # examine the class distribution
174 | sms.label.value_counts()
175 | 
176 | 
177 | # convert label to a numerical variable
178 | sms['label_num'] = sms.label.map({'ham':0, 'spam':1})
179 | 
180 | 
181 | # check that the conversion worked
182 | sms.head(10)
183 | 
184 | 
185 | # how to define X and y (from the iris data) for use with a MODEL
186 | X = iris.data
187 | y = iris.target
188 | print(X.shape)
189 | print(y.shape)
190 | 
191 | 
192 | # how to define X and y (from the SMS data) for use with COUNTVECTORIZER
193 | X = sms.message
194 | y = sms.label_num
195 | print(X.shape)
196 | print(y.shape)
197 | 
198 | 
199 | # split X and y into training and testing sets
200 | from sklearn.model_selection import train_test_split
201 | X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
202 | print(X_train.shape)
203 | print(X_test.shape)
204 | print(y_train.shape)
205 | print(y_test.shape)
206 | 
207 | 
208 | # ## Part 4: Vectorizing our dataset
209 | 
210 | # instantiate the vectorizer
211 | vect = CountVectorizer()
212 | 
213 | 
214 | # learn training data vocabulary, then use it to create a document-term matrix
215 | vect.fit(X_train)
216 | X_train_dtm = vect.transform(X_train)
217 | 
218 | 
219 | # equivalently: combine fit and transform into a single step
220 | X_train_dtm = vect.fit_transform(X_train)
221 | 
222 | 
223 | # examine the document-term matrix
224 | X_train_dtm
225 | 
226 | 
227 | # transform testing data (using fitted vocabulary) into a document-term matrix
228 | X_test_dtm = vect.transform(X_test)
229 | X_test_dtm
230 | 
231 | 
232 | # ## Part 5: Building and evaluating a model
233 | # 
234 | # We will use [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):
235 | # 
236 | # > The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.
237 | 
238 | # import and instantiate a Multinomial Naive Bayes model
239 | from sklearn.naive_bayes import MultinomialNB
240 | nb = MultinomialNB()
241 | 
242 | 
243 | # train the model using X_train_dtm
244 | nb.fit(X_train_dtm, y_train)
245 | 
246 | 
247 | # make class predictions for X_test_dtm
248 | y_pred_class = nb.predict(X_test_dtm)
249 | 
250 | 
251 | # calculate accuracy of class predictions
252 | from sklearn import metrics
253 | metrics.accuracy_score(y_test, y_pred_class)
254 | 
255 | 
256 | # print the confusion matrix
257 | metrics.confusion_matrix(y_test, y_pred_class)
258 | 
259 | 
260 | # print message text for the false positives (ham incorrectly classified as spam)
261 | 
262 | 
263 | # print message text for the false negatives (spam incorrectly classified as ham)
264 | 
265 | 
266 | # example false negative
267 | X_test[3132]
268 | 
269 | 
270 | # calculate predicted probabilities for X_test_dtm (poorly calibrated)
271 | y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
272 | y_pred_prob
273 | 
274 | 
275 | # calculate AUC
276 | metrics.roc_auc_score(y_test, y_pred_prob)
277 | 
278 | 
279 | # ## Part 6: Comparing models
280 | # 
281 | # We will compare multinomial Naive Bayes with [logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression):
282 | # 
283 | # > Logistic regression, despite its name, is a **linear model for classification** rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.
284 | 
285 | # import and instantiate a logistic regression model
286 | from sklearn.linear_model import LogisticRegression
287 | logreg = LogisticRegression()
288 | 
289 | 
290 | # train the model using X_train_dtm
291 | logreg.fit(X_train_dtm, y_train)
292 | 
293 | 
294 | # make class predictions for X_test_dtm
295 | y_pred_class = logreg.predict(X_test_dtm)
296 | 
297 | 
298 | # calculate predicted probabilities for X_test_dtm (well calibrated)
299 | y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]
300 | y_pred_prob
301 | 
302 | 
303 | # calculate accuracy
304 | metrics.accuracy_score(y_test, y_pred_class)
305 | 
306 | 
307 | # calculate AUC
308 | metrics.roc_auc_score(y_test, y_pred_prob)
309 | 
310 | 
311 | # ## Part 7: Examining a model for further insight
312 | # 
313 | # We will examine the our **trained Naive Bayes model** to calculate the approximate **"spamminess" of each token**.
314 | 
315 | # store the vocabulary of X_train
316 | X_train_tokens = vect.get_feature_names()
317 | len(X_train_tokens)
318 | 
319 | 
320 | # examine the first 50 tokens
321 | print(X_train_tokens[0:50])
322 | 
323 | 
324 | # examine the last 50 tokens
325 | print(X_train_tokens[-50:])
326 | 
327 | 
328 | # Naive Bayes counts the number of times each token appears in each class
329 | nb.feature_count_
330 | 
331 | 
332 | # rows represent classes, columns represent tokens
333 | nb.feature_count_.shape
334 | 
335 | 
336 | # number of times each token appears across all HAM messages
337 | ham_token_count = nb.feature_count_[0, :]
338 | ham_token_count
339 | 
340 | 
341 | # number of times each token appears across all SPAM messages
342 | spam_token_count = nb.feature_count_[1, :]
343 | spam_token_count
344 | 
345 | 
346 | # create a DataFrame of tokens with their separate ham and spam counts
347 | tokens = pd.DataFrame({'token':X_train_tokens, 'ham':ham_token_count, 'spam':spam_token_count}).set_index('token')
348 | tokens.head()
349 | 
350 | 
351 | # examine 5 random DataFrame rows
352 | tokens.sample(5, random_state=6)
353 | 
354 | 
355 | # Naive Bayes counts the number of observations in each class
356 | nb.class_count_
357 | 
358 | 
359 | # Before we can calculate the "spamminess" of each token, we need to avoid **dividing by zero** and account for the **class imbalance**.
360 | 
361 | # add 1 to ham and spam counts to avoid dividing by 0
362 | tokens['ham'] = tokens.ham + 1
363 | tokens['spam'] = tokens.spam + 1
364 | tokens.sample(5, random_state=6)
365 | 
366 | 
367 | # convert the ham and spam counts into frequencies
368 | tokens['ham'] = tokens.ham / nb.class_count_[0]
369 | tokens['spam'] = tokens.spam / nb.class_count_[1]
370 | tokens.sample(5, random_state=6)
371 | 
372 | 
373 | # calculate the ratio of spam-to-ham for each token
374 | tokens['spam_ratio'] = tokens.spam / tokens.ham
375 | tokens.sample(5, random_state=6)
376 | 
377 | 
378 | # examine the DataFrame sorted by spam_ratio
379 | # note: use sort() instead of sort_values() for pandas 0.16.2 and earlier
380 | tokens.sort_values('spam_ratio', ascending=False)
381 | 
382 | 
383 | # look up the spam_ratio for a given token
384 | tokens.loc['dating', 'spam_ratio']
385 | 
386 | 
387 | # ## Part 8: Practicing this workflow on another dataset
388 | # 
389 | # Please open the **`exercise.ipynb`** notebook (or the **`exercise.py`** script).
390 | 
391 | # ## Part 9: Tuning the vectorizer (discussion)
392 | # 
393 | # Thus far, we have been using the default parameters of [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html):
394 | 
395 | # show default parameters for CountVectorizer
396 | vect
397 | 
398 | 
399 | # However, the vectorizer is worth tuning, just like a model is worth tuning! Here are a few parameters that you might want to tune:
400 | # 
401 | # - **stop_words:** string {'english'}, list, or None (default)
402 | #     - If 'english', a built-in stop word list for English is used.
403 | #     - If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
404 | #     - If None, no stop words will be used.
405 | 
406 | # remove English stop words
407 | vect = CountVectorizer(stop_words='english')
408 | 
409 | 
410 | # - **ngram_range:** tuple (min_n, max_n), default=(1, 1)
411 | #     - The lower and upper boundary of the range of n-values for different n-grams to be extracted.
412 | #     - All values of n such that min_n <= n <= max_n will be used.
413 | 
414 | # include 1-grams and 2-grams
415 | vect = CountVectorizer(ngram_range=(1, 2))
416 | 
417 | 
418 | # - **max_df:** float in range [0.0, 1.0] or int, default=1.0
419 | #     - When building the vocabulary, ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words).
420 | #     - If float, the parameter represents a proportion of documents.
421 | #     - If integer, the parameter represents an absolute count.
422 | 
423 | # ignore terms that appear in more than 50% of the documents
424 | vect = CountVectorizer(max_df=0.5)
425 | 
426 | 
427 | # - **min_df:** float in range [0.0, 1.0] or int, default=1
428 | #     - When building the vocabulary, ignore terms that have a document frequency strictly lower than the given threshold. (This value is also called "cut-off" in the literature.)
429 | #     - If float, the parameter represents a proportion of documents.
430 | #     - If integer, the parameter represents an absolute count.
431 | 
432 | # only keep terms that appear in at least 2 documents
433 | vect = CountVectorizer(min_df=2)
434 | 
435 | 
436 | # **Guidelines for tuning CountVectorizer:**
437 | # 
438 | # - Use your knowledge of the **problem** and the **text**, and your understanding of the **tuning parameters**, to help you decide what parameters to tune and how to tune them.
439 | # - **Experiment**, and let the data tell you the best approach!
440 | 


--------------------------------------------------------------------------------
/tutorial.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "# Tutorial: Machine Learning with Text in scikit-learn"
   8 |    ]
   9 |   },
  10 |   {
  11 |    "cell_type": "markdown",
  12 |    "metadata": {},
  13 |    "source": [
  14 |     "## Agenda\n",
  15 |     "\n",
  16 |     "1. Model building in scikit-learn (refresher)\n",
  17 |     "2. Representing text as numerical data\n",
  18 |     "3. Reading a text-based dataset into pandas\n",
  19 |     "4. Vectorizing our dataset\n",
  20 |     "5. Building and evaluating a model\n",
  21 |     "6. Comparing models\n",
  22 |     "7. Examining a model for further insight\n",
  23 |     "8. Practicing this workflow on another dataset\n",
  24 |     "9. Tuning the vectorizer (discussion)"
  25 |    ]
  26 |   },
  27 |   {
  28 |    "cell_type": "code",
  29 |    "execution_count": null,
  30 |    "metadata": {
  31 |     "collapsed": false
  32 |    },
  33 |    "outputs": [],
  34 |    "source": [
  35 |     "# for Python 2: use print only as a function\n",
  36 |     "from __future__ import print_function"
  37 |    ]
  38 |   },
  39 |   {
  40 |    "cell_type": "markdown",
  41 |    "metadata": {},
  42 |    "source": [
  43 |     "## Part 1: Model building in scikit-learn (refresher)"
  44 |    ]
  45 |   },
  46 |   {
  47 |    "cell_type": "code",
  48 |    "execution_count": null,
  49 |    "metadata": {
  50 |     "collapsed": true
  51 |    },
  52 |    "outputs": [],
  53 |    "source": [
  54 |     "# load the iris dataset as an example\n",
  55 |     "from sklearn.datasets import load_iris\n",
  56 |     "iris = load_iris()"
  57 |    ]
  58 |   },
  59 |   {
  60 |    "cell_type": "code",
  61 |    "execution_count": null,
  62 |    "metadata": {
  63 |     "collapsed": true
  64 |    },
  65 |    "outputs": [],
  66 |    "source": [
  67 |     "# store the feature matrix (X) and response vector (y)\n",
  68 |     "X = iris.data\n",
  69 |     "y = iris.target"
  70 |    ]
  71 |   },
  72 |   {
  73 |    "cell_type": "markdown",
  74 |    "metadata": {},
  75 |    "source": [
  76 |     "**\"Features\"** are also known as predictors, inputs, or attributes. The **\"response\"** is also known as the target, label, or output."
  77 |    ]
  78 |   },
  79 |   {
  80 |    "cell_type": "code",
  81 |    "execution_count": null,
  82 |    "metadata": {
  83 |     "collapsed": false
  84 |    },
  85 |    "outputs": [],
  86 |    "source": [
  87 |     "# check the shapes of X and y\n",
  88 |     "print(X.shape)\n",
  89 |     "print(y.shape)"
  90 |    ]
  91 |   },
  92 |   {
  93 |    "cell_type": "markdown",
  94 |    "metadata": {},
  95 |    "source": [
  96 |     "**\"Observations\"** are also known as samples, instances, or records."
  97 |    ]
  98 |   },
  99 |   {
 100 |    "cell_type": "code",
 101 |    "execution_count": null,
 102 |    "metadata": {
 103 |     "collapsed": false
 104 |    },
 105 |    "outputs": [],
 106 |    "source": [
 107 |     "# examine the first 5 rows of the feature matrix (including the feature names)\n",
 108 |     "import pandas as pd\n",
 109 |     "pd.DataFrame(X, columns=iris.feature_names).head()"
 110 |    ]
 111 |   },
 112 |   {
 113 |    "cell_type": "code",
 114 |    "execution_count": null,
 115 |    "metadata": {
 116 |     "collapsed": false
 117 |    },
 118 |    "outputs": [],
 119 |    "source": [
 120 |     "# examine the response vector\n",
 121 |     "print(y)"
 122 |    ]
 123 |   },
 124 |   {
 125 |    "cell_type": "markdown",
 126 |    "metadata": {},
 127 |    "source": [
 128 |     "In order to **build a model**, the features must be **numeric**, and every observation must have the **same features in the same order**."
 129 |    ]
 130 |   },
 131 |   {
 132 |    "cell_type": "code",
 133 |    "execution_count": null,
 134 |    "metadata": {
 135 |     "collapsed": false
 136 |    },
 137 |    "outputs": [],
 138 |    "source": [
 139 |     "# import the class\n",
 140 |     "from sklearn.neighbors import KNeighborsClassifier\n",
 141 |     "\n",
 142 |     "# instantiate the model (with the default parameters)\n",
 143 |     "knn = KNeighborsClassifier()\n",
 144 |     "\n",
 145 |     "# fit the model with data (occurs in-place)\n",
 146 |     "knn.fit(X, y)"
 147 |    ]
 148 |   },
 149 |   {
 150 |    "cell_type": "markdown",
 151 |    "metadata": {},
 152 |    "source": [
 153 |     "In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning."
 154 |    ]
 155 |   },
 156 |   {
 157 |    "cell_type": "code",
 158 |    "execution_count": null,
 159 |    "metadata": {
 160 |     "collapsed": false
 161 |    },
 162 |    "outputs": [],
 163 |    "source": [
 164 |     "# predict the response for a new observation\n",
 165 |     "knn.predict([[3, 5, 4, 2]])"
 166 |    ]
 167 |   },
 168 |   {
 169 |    "cell_type": "markdown",
 170 |    "metadata": {},
 171 |    "source": [
 172 |     "## Part 2: Representing text as numerical data"
 173 |    ]
 174 |   },
 175 |   {
 176 |    "cell_type": "code",
 177 |    "execution_count": null,
 178 |    "metadata": {
 179 |     "collapsed": true
 180 |    },
 181 |    "outputs": [],
 182 |    "source": [
 183 |     "# example text for model training (SMS messages)\n",
 184 |     "simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']"
 185 |    ]
 186 |   },
 187 |   {
 188 |    "cell_type": "markdown",
 189 |    "metadata": {},
 190 |    "source": [
 191 |     "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
 192 |     "\n",
 193 |     "> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.\n",
 194 |     "\n",
 195 |     "We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to \"convert text into a matrix of token counts\":"
 196 |    ]
 197 |   },
 198 |   {
 199 |    "cell_type": "code",
 200 |    "execution_count": null,
 201 |    "metadata": {
 202 |     "collapsed": true
 203 |    },
 204 |    "outputs": [],
 205 |    "source": [
 206 |     "# import and instantiate CountVectorizer (with the default parameters)\n",
 207 |     "from sklearn.feature_extraction.text import CountVectorizer\n",
 208 |     "vect = CountVectorizer()"
 209 |    ]
 210 |   },
 211 |   {
 212 |    "cell_type": "code",
 213 |    "execution_count": null,
 214 |    "metadata": {
 215 |     "collapsed": false
 216 |    },
 217 |    "outputs": [],
 218 |    "source": [
 219 |     "# learn the 'vocabulary' of the training data (occurs in-place)\n",
 220 |     "vect.fit(simple_train)"
 221 |    ]
 222 |   },
 223 |   {
 224 |    "cell_type": "code",
 225 |    "execution_count": null,
 226 |    "metadata": {
 227 |     "collapsed": false
 228 |    },
 229 |    "outputs": [],
 230 |    "source": [
 231 |     "# examine the fitted vocabulary\n",
 232 |     "vect.get_feature_names()"
 233 |    ]
 234 |   },
 235 |   {
 236 |    "cell_type": "code",
 237 |    "execution_count": null,
 238 |    "metadata": {
 239 |     "collapsed": false
 240 |    },
 241 |    "outputs": [],
 242 |    "source": [
 243 |     "# transform training data into a 'document-term matrix'\n",
 244 |     "simple_train_dtm = vect.transform(simple_train)\n",
 245 |     "simple_train_dtm"
 246 |    ]
 247 |   },
 248 |   {
 249 |    "cell_type": "code",
 250 |    "execution_count": null,
 251 |    "metadata": {
 252 |     "collapsed": false
 253 |    },
 254 |    "outputs": [],
 255 |    "source": [
 256 |     "# convert sparse matrix to a dense matrix\n",
 257 |     "simple_train_dtm.toarray()"
 258 |    ]
 259 |   },
 260 |   {
 261 |    "cell_type": "code",
 262 |    "execution_count": null,
 263 |    "metadata": {
 264 |     "collapsed": false
 265 |    },
 266 |    "outputs": [],
 267 |    "source": [
 268 |     "# examine the vocabulary and document-term matrix together\n",
 269 |     "pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())"
 270 |    ]
 271 |   },
 272 |   {
 273 |    "cell_type": "markdown",
 274 |    "metadata": {},
 275 |    "source": [
 276 |     "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
 277 |     "\n",
 278 |     "> In this scheme, features and samples are defined as follows:\n",
 279 |     "\n",
 280 |     "> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.\n",
 281 |     "> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.\n",
 282 |     "\n",
 283 |     "> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.\n",
 284 |     "\n",
 285 |     "> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or \"Bag of n-grams\" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document."
 286 |    ]
 287 |   },
 288 |   {
 289 |    "cell_type": "code",
 290 |    "execution_count": null,
 291 |    "metadata": {
 292 |     "collapsed": false
 293 |    },
 294 |    "outputs": [],
 295 |    "source": [
 296 |     "# check the type of the document-term matrix\n",
 297 |     "type(simple_train_dtm)"
 298 |    ]
 299 |   },
 300 |   {
 301 |    "cell_type": "code",
 302 |    "execution_count": null,
 303 |    "metadata": {
 304 |     "collapsed": false,
 305 |     "scrolled": true
 306 |    },
 307 |    "outputs": [],
 308 |    "source": [
 309 |     "# examine the sparse matrix contents\n",
 310 |     "print(simple_train_dtm)"
 311 |    ]
 312 |   },
 313 |   {
 314 |    "cell_type": "markdown",
 315 |    "metadata": {},
 316 |    "source": [
 317 |     "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
 318 |     "\n",
 319 |     "> As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them).\n",
 320 |     "\n",
 321 |     "> For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.\n",
 322 |     "\n",
 323 |     "> In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package."
 324 |    ]
 325 |   },
 326 |   {
 327 |    "cell_type": "code",
 328 |    "execution_count": null,
 329 |    "metadata": {
 330 |     "collapsed": true
 331 |    },
 332 |    "outputs": [],
 333 |    "source": [
 334 |     "# example text for model testing\n",
 335 |     "simple_test = [\"please don't call me\"]"
 336 |    ]
 337 |   },
 338 |   {
 339 |    "cell_type": "markdown",
 340 |    "metadata": {},
 341 |    "source": [
 342 |     "In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning."
 343 |    ]
 344 |   },
 345 |   {
 346 |    "cell_type": "code",
 347 |    "execution_count": null,
 348 |    "metadata": {
 349 |     "collapsed": false
 350 |    },
 351 |    "outputs": [],
 352 |    "source": [
 353 |     "# transform testing data into a document-term matrix (using existing vocabulary)\n",
 354 |     "simple_test_dtm = vect.transform(simple_test)\n",
 355 |     "simple_test_dtm.toarray()"
 356 |    ]
 357 |   },
 358 |   {
 359 |    "cell_type": "code",
 360 |    "execution_count": null,
 361 |    "metadata": {
 362 |     "collapsed": false
 363 |    },
 364 |    "outputs": [],
 365 |    "source": [
 366 |     "# examine the vocabulary and document-term matrix together\n",
 367 |     "pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())"
 368 |    ]
 369 |   },
 370 |   {
 371 |    "cell_type": "markdown",
 372 |    "metadata": {},
 373 |    "source": [
 374 |     "**Summary:**\n",
 375 |     "\n",
 376 |     "- `vect.fit(train)` **learns the vocabulary** of the training data\n",
 377 |     "- `vect.transform(train)` uses the **fitted vocabulary** to build a document-term matrix from the training data\n",
 378 |     "- `vect.transform(test)` uses the **fitted vocabulary** to build a document-term matrix from the testing data (and **ignores tokens** it hasn't seen before)"
 379 |    ]
 380 |   },
 381 |   {
 382 |    "cell_type": "markdown",
 383 |    "metadata": {},
 384 |    "source": [
 385 |     "## Part 3: Reading a text-based dataset into pandas"
 386 |    ]
 387 |   },
 388 |   {
 389 |    "cell_type": "code",
 390 |    "execution_count": null,
 391 |    "metadata": {
 392 |     "collapsed": true
 393 |    },
 394 |    "outputs": [],
 395 |    "source": [
 396 |     "# read file into pandas using a relative path\n",
 397 |     "path = 'data/sms.tsv'\n",
 398 |     "sms = pd.read_table(path, header=None, names=['label', 'message'])"
 399 |    ]
 400 |   },
 401 |   {
 402 |    "cell_type": "code",
 403 |    "execution_count": null,
 404 |    "metadata": {
 405 |     "collapsed": false
 406 |    },
 407 |    "outputs": [],
 408 |    "source": [
 409 |     "# alternative: read file into pandas from a URL\n",
 410 |     "# url = 'https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv'\n",
 411 |     "# sms = pd.read_table(url, header=None, names=['label', 'message'])"
 412 |    ]
 413 |   },
 414 |   {
 415 |    "cell_type": "code",
 416 |    "execution_count": null,
 417 |    "metadata": {
 418 |     "collapsed": false
 419 |    },
 420 |    "outputs": [],
 421 |    "source": [
 422 |     "# examine the shape\n",
 423 |     "sms.shape"
 424 |    ]
 425 |   },
 426 |   {
 427 |    "cell_type": "code",
 428 |    "execution_count": null,
 429 |    "metadata": {
 430 |     "collapsed": false
 431 |    },
 432 |    "outputs": [],
 433 |    "source": [
 434 |     "# examine the first 10 rows\n",
 435 |     "sms.head(10)"
 436 |    ]
 437 |   },
 438 |   {
 439 |    "cell_type": "code",
 440 |    "execution_count": null,
 441 |    "metadata": {
 442 |     "collapsed": false
 443 |    },
 444 |    "outputs": [],
 445 |    "source": [
 446 |     "# examine the class distribution\n",
 447 |     "sms.label.value_counts()"
 448 |    ]
 449 |   },
 450 |   {
 451 |    "cell_type": "code",
 452 |    "execution_count": null,
 453 |    "metadata": {
 454 |     "collapsed": true
 455 |    },
 456 |    "outputs": [],
 457 |    "source": [
 458 |     "# convert label to a numerical variable\n",
 459 |     "sms['label_num'] = sms.label.map({'ham':0, 'spam':1})"
 460 |    ]
 461 |   },
 462 |   {
 463 |    "cell_type": "code",
 464 |    "execution_count": null,
 465 |    "metadata": {
 466 |     "collapsed": false
 467 |    },
 468 |    "outputs": [],
 469 |    "source": [
 470 |     "# check that the conversion worked\n",
 471 |     "sms.head(10)"
 472 |    ]
 473 |   },
 474 |   {
 475 |    "cell_type": "code",
 476 |    "execution_count": null,
 477 |    "metadata": {
 478 |     "collapsed": false
 479 |    },
 480 |    "outputs": [],
 481 |    "source": [
 482 |     "# how to define X and y (from the iris data) for use with a MODEL\n",
 483 |     "X = iris.data\n",
 484 |     "y = iris.target\n",
 485 |     "print(X.shape)\n",
 486 |     "print(y.shape)"
 487 |    ]
 488 |   },
 489 |   {
 490 |    "cell_type": "code",
 491 |    "execution_count": null,
 492 |    "metadata": {
 493 |     "collapsed": false
 494 |    },
 495 |    "outputs": [],
 496 |    "source": [
 497 |     "# how to define X and y (from the SMS data) for use with COUNTVECTORIZER\n",
 498 |     "X = sms.message\n",
 499 |     "y = sms.label_num\n",
 500 |     "print(X.shape)\n",
 501 |     "print(y.shape)"
 502 |    ]
 503 |   },
 504 |   {
 505 |    "cell_type": "code",
 506 |    "execution_count": null,
 507 |    "metadata": {
 508 |     "collapsed": false
 509 |    },
 510 |    "outputs": [],
 511 |    "source": [
 512 |     "# split X and y into training and testing sets\n",
 513 |     "from sklearn.model_selection import train_test_split\n",
 514 |     "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)\n",
 515 |     "print(X_train.shape)\n",
 516 |     "print(X_test.shape)\n",
 517 |     "print(y_train.shape)\n",
 518 |     "print(y_test.shape)"
 519 |    ]
 520 |   },
 521 |   {
 522 |    "cell_type": "markdown",
 523 |    "metadata": {},
 524 |    "source": [
 525 |     "## Part 4: Vectorizing our dataset"
 526 |    ]
 527 |   },
 528 |   {
 529 |    "cell_type": "code",
 530 |    "execution_count": null,
 531 |    "metadata": {
 532 |     "collapsed": true
 533 |    },
 534 |    "outputs": [],
 535 |    "source": [
 536 |     "# instantiate the vectorizer\n",
 537 |     "vect = CountVectorizer()"
 538 |    ]
 539 |   },
 540 |   {
 541 |    "cell_type": "code",
 542 |    "execution_count": null,
 543 |    "metadata": {
 544 |     "collapsed": true
 545 |    },
 546 |    "outputs": [],
 547 |    "source": [
 548 |     "# learn training data vocabulary, then use it to create a document-term matrix\n",
 549 |     "vect.fit(X_train)\n",
 550 |     "X_train_dtm = vect.transform(X_train)"
 551 |    ]
 552 |   },
 553 |   {
 554 |    "cell_type": "code",
 555 |    "execution_count": null,
 556 |    "metadata": {
 557 |     "collapsed": true
 558 |    },
 559 |    "outputs": [],
 560 |    "source": [
 561 |     "# equivalently: combine fit and transform into a single step\n",
 562 |     "X_train_dtm = vect.fit_transform(X_train)"
 563 |    ]
 564 |   },
 565 |   {
 566 |    "cell_type": "code",
 567 |    "execution_count": null,
 568 |    "metadata": {
 569 |     "collapsed": false
 570 |    },
 571 |    "outputs": [],
 572 |    "source": [
 573 |     "# examine the document-term matrix\n",
 574 |     "X_train_dtm"
 575 |    ]
 576 |   },
 577 |   {
 578 |    "cell_type": "code",
 579 |    "execution_count": null,
 580 |    "metadata": {
 581 |     "collapsed": false
 582 |    },
 583 |    "outputs": [],
 584 |    "source": [
 585 |     "# transform testing data (using fitted vocabulary) into a document-term matrix\n",
 586 |     "X_test_dtm = vect.transform(X_test)\n",
 587 |     "X_test_dtm"
 588 |    ]
 589 |   },
 590 |   {
 591 |    "cell_type": "markdown",
 592 |    "metadata": {},
 593 |    "source": [
 594 |     "## Part 5: Building and evaluating a model\n",
 595 |     "\n",
 596 |     "We will use [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):\n",
 597 |     "\n",
 598 |     "> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work."
 599 |    ]
 600 |   },
 601 |   {
 602 |    "cell_type": "code",
 603 |    "execution_count": null,
 604 |    "metadata": {
 605 |     "collapsed": true
 606 |    },
 607 |    "outputs": [],
 608 |    "source": [
 609 |     "# import and instantiate a Multinomial Naive Bayes model\n",
 610 |     "from sklearn.naive_bayes import MultinomialNB\n",
 611 |     "nb = MultinomialNB()"
 612 |    ]
 613 |   },
 614 |   {
 615 |    "cell_type": "code",
 616 |    "execution_count": null,
 617 |    "metadata": {
 618 |     "collapsed": false
 619 |    },
 620 |    "outputs": [],
 621 |    "source": [
 622 |     "# train the model using X_train_dtm (timing it with an IPython \"magic command\")\n",
 623 |     "%time nb.fit(X_train_dtm, y_train)"
 624 |    ]
 625 |   },
 626 |   {
 627 |    "cell_type": "code",
 628 |    "execution_count": null,
 629 |    "metadata": {
 630 |     "collapsed": true
 631 |    },
 632 |    "outputs": [],
 633 |    "source": [
 634 |     "# make class predictions for X_test_dtm\n",
 635 |     "y_pred_class = nb.predict(X_test_dtm)"
 636 |    ]
 637 |   },
 638 |   {
 639 |    "cell_type": "code",
 640 |    "execution_count": null,
 641 |    "metadata": {
 642 |     "collapsed": false
 643 |    },
 644 |    "outputs": [],
 645 |    "source": [
 646 |     "# calculate accuracy of class predictions\n",
 647 |     "from sklearn import metrics\n",
 648 |     "metrics.accuracy_score(y_test, y_pred_class)"
 649 |    ]
 650 |   },
 651 |   {
 652 |    "cell_type": "code",
 653 |    "execution_count": null,
 654 |    "metadata": {
 655 |     "collapsed": false
 656 |    },
 657 |    "outputs": [],
 658 |    "source": [
 659 |     "# print the confusion matrix\n",
 660 |     "metrics.confusion_matrix(y_test, y_pred_class)"
 661 |    ]
 662 |   },
 663 |   {
 664 |    "cell_type": "code",
 665 |    "execution_count": null,
 666 |    "metadata": {
 667 |     "collapsed": false
 668 |    },
 669 |    "outputs": [],
 670 |    "source": [
 671 |     "# print message text for the false positives (ham incorrectly classified as spam)\n"
 672 |    ]
 673 |   },
 674 |   {
 675 |    "cell_type": "code",
 676 |    "execution_count": null,
 677 |    "metadata": {
 678 |     "collapsed": false,
 679 |     "scrolled": true
 680 |    },
 681 |    "outputs": [],
 682 |    "source": [
 683 |     "# print message text for the false negatives (spam incorrectly classified as ham)\n"
 684 |    ]
 685 |   },
 686 |   {
 687 |    "cell_type": "code",
 688 |    "execution_count": null,
 689 |    "metadata": {
 690 |     "collapsed": false,
 691 |     "scrolled": true
 692 |    },
 693 |    "outputs": [],
 694 |    "source": [
 695 |     "# example false negative\n",
 696 |     "X_test[3132]"
 697 |    ]
 698 |   },
 699 |   {
 700 |    "cell_type": "code",
 701 |    "execution_count": null,
 702 |    "metadata": {
 703 |     "collapsed": false
 704 |    },
 705 |    "outputs": [],
 706 |    "source": [
 707 |     "# calculate predicted probabilities for X_test_dtm (poorly calibrated)\n",
 708 |     "y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]\n",
 709 |     "y_pred_prob"
 710 |    ]
 711 |   },
 712 |   {
 713 |    "cell_type": "code",
 714 |    "execution_count": null,
 715 |    "metadata": {
 716 |     "collapsed": false
 717 |    },
 718 |    "outputs": [],
 719 |    "source": [
 720 |     "# calculate AUC\n",
 721 |     "metrics.roc_auc_score(y_test, y_pred_prob)"
 722 |    ]
 723 |   },
 724 |   {
 725 |    "cell_type": "markdown",
 726 |    "metadata": {},
 727 |    "source": [
 728 |     "## Part 6: Comparing models\n",
 729 |     "\n",
 730 |     "We will compare multinomial Naive Bayes with [logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression):\n",
 731 |     "\n",
 732 |     "> Logistic regression, despite its name, is a **linear model for classification** rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function."
 733 |    ]
 734 |   },
 735 |   {
 736 |    "cell_type": "code",
 737 |    "execution_count": null,
 738 |    "metadata": {
 739 |     "collapsed": true
 740 |    },
 741 |    "outputs": [],
 742 |    "source": [
 743 |     "# import and instantiate a logistic regression model\n",
 744 |     "from sklearn.linear_model import LogisticRegression\n",
 745 |     "logreg = LogisticRegression()"
 746 |    ]
 747 |   },
 748 |   {
 749 |    "cell_type": "code",
 750 |    "execution_count": null,
 751 |    "metadata": {
 752 |     "collapsed": false
 753 |    },
 754 |    "outputs": [],
 755 |    "source": [
 756 |     "# train the model using X_train_dtm\n",
 757 |     "%time logreg.fit(X_train_dtm, y_train)"
 758 |    ]
 759 |   },
 760 |   {
 761 |    "cell_type": "code",
 762 |    "execution_count": null,
 763 |    "metadata": {
 764 |     "collapsed": true
 765 |    },
 766 |    "outputs": [],
 767 |    "source": [
 768 |     "# make class predictions for X_test_dtm\n",
 769 |     "y_pred_class = logreg.predict(X_test_dtm)"
 770 |    ]
 771 |   },
 772 |   {
 773 |    "cell_type": "code",
 774 |    "execution_count": null,
 775 |    "metadata": {
 776 |     "collapsed": false
 777 |    },
 778 |    "outputs": [],
 779 |    "source": [
 780 |     "# calculate predicted probabilities for X_test_dtm (well calibrated)\n",
 781 |     "y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]\n",
 782 |     "y_pred_prob"
 783 |    ]
 784 |   },
 785 |   {
 786 |    "cell_type": "code",
 787 |    "execution_count": null,
 788 |    "metadata": {
 789 |     "collapsed": false
 790 |    },
 791 |    "outputs": [],
 792 |    "source": [
 793 |     "# calculate accuracy\n",
 794 |     "metrics.accuracy_score(y_test, y_pred_class)"
 795 |    ]
 796 |   },
 797 |   {
 798 |    "cell_type": "code",
 799 |    "execution_count": null,
 800 |    "metadata": {
 801 |     "collapsed": false
 802 |    },
 803 |    "outputs": [],
 804 |    "source": [
 805 |     "# calculate AUC\n",
 806 |     "metrics.roc_auc_score(y_test, y_pred_prob)"
 807 |    ]
 808 |   },
 809 |   {
 810 |    "cell_type": "markdown",
 811 |    "metadata": {},
 812 |    "source": [
 813 |     "## Part 7: Examining a model for further insight\n",
 814 |     "\n",
 815 |     "We will examine the our **trained Naive Bayes model** to calculate the approximate **\"spamminess\" of each token**."
 816 |    ]
 817 |   },
 818 |   {
 819 |    "cell_type": "code",
 820 |    "execution_count": null,
 821 |    "metadata": {
 822 |     "collapsed": false
 823 |    },
 824 |    "outputs": [],
 825 |    "source": [
 826 |     "# store the vocabulary of X_train\n",
 827 |     "X_train_tokens = vect.get_feature_names()\n",
 828 |     "len(X_train_tokens)"
 829 |    ]
 830 |   },
 831 |   {
 832 |    "cell_type": "code",
 833 |    "execution_count": null,
 834 |    "metadata": {
 835 |     "collapsed": false,
 836 |     "scrolled": true
 837 |    },
 838 |    "outputs": [],
 839 |    "source": [
 840 |     "# examine the first 50 tokens\n",
 841 |     "print(X_train_tokens[0:50])"
 842 |    ]
 843 |   },
 844 |   {
 845 |    "cell_type": "code",
 846 |    "execution_count": null,
 847 |    "metadata": {
 848 |     "collapsed": false
 849 |    },
 850 |    "outputs": [],
 851 |    "source": [
 852 |     "# examine the last 50 tokens\n",
 853 |     "print(X_train_tokens[-50:])"
 854 |    ]
 855 |   },
 856 |   {
 857 |    "cell_type": "code",
 858 |    "execution_count": null,
 859 |    "metadata": {
 860 |     "collapsed": false
 861 |    },
 862 |    "outputs": [],
 863 |    "source": [
 864 |     "# Naive Bayes counts the number of times each token appears in each class\n",
 865 |     "nb.feature_count_"
 866 |    ]
 867 |   },
 868 |   {
 869 |    "cell_type": "code",
 870 |    "execution_count": null,
 871 |    "metadata": {
 872 |     "collapsed": false
 873 |    },
 874 |    "outputs": [],
 875 |    "source": [
 876 |     "# rows represent classes, columns represent tokens\n",
 877 |     "nb.feature_count_.shape"
 878 |    ]
 879 |   },
 880 |   {
 881 |    "cell_type": "code",
 882 |    "execution_count": null,
 883 |    "metadata": {
 884 |     "collapsed": false
 885 |    },
 886 |    "outputs": [],
 887 |    "source": [
 888 |     "# number of times each token appears across all HAM messages\n",
 889 |     "ham_token_count = nb.feature_count_[0, :]\n",
 890 |     "ham_token_count"
 891 |    ]
 892 |   },
 893 |   {
 894 |    "cell_type": "code",
 895 |    "execution_count": null,
 896 |    "metadata": {
 897 |     "collapsed": false
 898 |    },
 899 |    "outputs": [],
 900 |    "source": [
 901 |     "# number of times each token appears across all SPAM messages\n",
 902 |     "spam_token_count = nb.feature_count_[1, :]\n",
 903 |     "spam_token_count"
 904 |    ]
 905 |   },
 906 |   {
 907 |    "cell_type": "code",
 908 |    "execution_count": null,
 909 |    "metadata": {
 910 |     "collapsed": false
 911 |    },
 912 |    "outputs": [],
 913 |    "source": [
 914 |     "# create a DataFrame of tokens with their separate ham and spam counts\n",
 915 |     "tokens = pd.DataFrame({'token':X_train_tokens, 'ham':ham_token_count, 'spam':spam_token_count}).set_index('token')\n",
 916 |     "tokens.head()"
 917 |    ]
 918 |   },
 919 |   {
 920 |    "cell_type": "code",
 921 |    "execution_count": null,
 922 |    "metadata": {
 923 |     "collapsed": false
 924 |    },
 925 |    "outputs": [],
 926 |    "source": [
 927 |     "# examine 5 random DataFrame rows\n",
 928 |     "tokens.sample(5, random_state=6)"
 929 |    ]
 930 |   },
 931 |   {
 932 |    "cell_type": "code",
 933 |    "execution_count": null,
 934 |    "metadata": {
 935 |     "collapsed": false
 936 |    },
 937 |    "outputs": [],
 938 |    "source": [
 939 |     "# Naive Bayes counts the number of observations in each class\n",
 940 |     "nb.class_count_"
 941 |    ]
 942 |   },
 943 |   {
 944 |    "cell_type": "markdown",
 945 |    "metadata": {},
 946 |    "source": [
 947 |     "Before we can calculate the \"spamminess\" of each token, we need to avoid **dividing by zero** and account for the **class imbalance**."
 948 |    ]
 949 |   },
 950 |   {
 951 |    "cell_type": "code",
 952 |    "execution_count": null,
 953 |    "metadata": {
 954 |     "collapsed": false
 955 |    },
 956 |    "outputs": [],
 957 |    "source": [
 958 |     "# add 1 to ham and spam counts to avoid dividing by 0\n",
 959 |     "tokens['ham'] = tokens.ham + 1\n",
 960 |     "tokens['spam'] = tokens.spam + 1\n",
 961 |     "tokens.sample(5, random_state=6)"
 962 |    ]
 963 |   },
 964 |   {
 965 |    "cell_type": "code",
 966 |    "execution_count": null,
 967 |    "metadata": {
 968 |     "collapsed": false
 969 |    },
 970 |    "outputs": [],
 971 |    "source": [
 972 |     "# convert the ham and spam counts into frequencies\n",
 973 |     "tokens['ham'] = tokens.ham / nb.class_count_[0]\n",
 974 |     "tokens['spam'] = tokens.spam / nb.class_count_[1]\n",
 975 |     "tokens.sample(5, random_state=6)"
 976 |    ]
 977 |   },
 978 |   {
 979 |    "cell_type": "code",
 980 |    "execution_count": null,
 981 |    "metadata": {
 982 |     "collapsed": false
 983 |    },
 984 |    "outputs": [],
 985 |    "source": [
 986 |     "# calculate the ratio of spam-to-ham for each token\n",
 987 |     "tokens['spam_ratio'] = tokens.spam / tokens.ham\n",
 988 |     "tokens.sample(5, random_state=6)"
 989 |    ]
 990 |   },
 991 |   {
 992 |    "cell_type": "code",
 993 |    "execution_count": null,
 994 |    "metadata": {
 995 |     "collapsed": false
 996 |    },
 997 |    "outputs": [],
 998 |    "source": [
 999 |     "# examine the DataFrame sorted by spam_ratio\n",
1000 |     "# note: use sort() instead of sort_values() for pandas 0.16.2 and earlier\n",
1001 |     "tokens.sort_values('spam_ratio', ascending=False)"
1002 |    ]
1003 |   },
1004 |   {
1005 |    "cell_type": "code",
1006 |    "execution_count": null,
1007 |    "metadata": {
1008 |     "collapsed": false
1009 |    },
1010 |    "outputs": [],
1011 |    "source": [
1012 |     "# look up the spam_ratio for a given token\n",
1013 |     "tokens.loc['dating', 'spam_ratio']"
1014 |    ]
1015 |   },
1016 |   {
1017 |    "cell_type": "markdown",
1018 |    "metadata": {},
1019 |    "source": [
1020 |     "## Part 8: Practicing this workflow on another dataset\n",
1021 |     "\n",
1022 |     "Please open the **`exercise.ipynb`** notebook (or the **`exercise.py`** script)."
1023 |    ]
1024 |   },
1025 |   {
1026 |    "cell_type": "markdown",
1027 |    "metadata": {},
1028 |    "source": [
1029 |     "## Part 9: Tuning the vectorizer (discussion)\n",
1030 |     "\n",
1031 |     "Thus far, we have been using the default parameters of [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html):"
1032 |    ]
1033 |   },
1034 |   {
1035 |    "cell_type": "code",
1036 |    "execution_count": null,
1037 |    "metadata": {
1038 |     "collapsed": false
1039 |    },
1040 |    "outputs": [],
1041 |    "source": [
1042 |     "# show default parameters for CountVectorizer\n",
1043 |     "vect"
1044 |    ]
1045 |   },
1046 |   {
1047 |    "cell_type": "markdown",
1048 |    "metadata": {},
1049 |    "source": [
1050 |     "However, the vectorizer is worth tuning, just like a model is worth tuning! Here are a few parameters that you might want to tune:\n",
1051 |     "\n",
1052 |     "- **stop_words:** string {'english'}, list, or None (default)\n",
1053 |     "    - If 'english', a built-in stop word list for English is used.\n",
1054 |     "    - If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.\n",
1055 |     "    - If None, no stop words will be used."
1056 |    ]
1057 |   },
1058 |   {
1059 |    "cell_type": "code",
1060 |    "execution_count": null,
1061 |    "metadata": {
1062 |     "collapsed": true
1063 |    },
1064 |    "outputs": [],
1065 |    "source": [
1066 |     "# remove English stop words\n",
1067 |     "vect = CountVectorizer(stop_words='english')"
1068 |    ]
1069 |   },
1070 |   {
1071 |    "cell_type": "markdown",
1072 |    "metadata": {},
1073 |    "source": [
1074 |     "- **ngram_range:** tuple (min_n, max_n), default=(1, 1)\n",
1075 |     "    - The lower and upper boundary of the range of n-values for different n-grams to be extracted.\n",
1076 |     "    - All values of n such that min_n <= n <= max_n will be used."
1077 |    ]
1078 |   },
1079 |   {
1080 |    "cell_type": "code",
1081 |    "execution_count": null,
1082 |    "metadata": {
1083 |     "collapsed": true
1084 |    },
1085 |    "outputs": [],
1086 |    "source": [
1087 |     "# include 1-grams and 2-grams\n",
1088 |     "vect = CountVectorizer(ngram_range=(1, 2))"
1089 |    ]
1090 |   },
1091 |   {
1092 |    "cell_type": "markdown",
1093 |    "metadata": {},
1094 |    "source": [
1095 |     "- **max_df:** float in range [0.0, 1.0] or int, default=1.0\n",
1096 |     "    - When building the vocabulary, ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words).\n",
1097 |     "    - If float, the parameter represents a proportion of documents.\n",
1098 |     "    - If integer, the parameter represents an absolute count."
1099 |    ]
1100 |   },
1101 |   {
1102 |    "cell_type": "code",
1103 |    "execution_count": null,
1104 |    "metadata": {
1105 |     "collapsed": true
1106 |    },
1107 |    "outputs": [],
1108 |    "source": [
1109 |     "# ignore terms that appear in more than 50% of the documents\n",
1110 |     "vect = CountVectorizer(max_df=0.5)"
1111 |    ]
1112 |   },
1113 |   {
1114 |    "cell_type": "markdown",
1115 |    "metadata": {},
1116 |    "source": [
1117 |     "- **min_df:** float in range [0.0, 1.0] or int, default=1\n",
1118 |     "    - When building the vocabulary, ignore terms that have a document frequency strictly lower than the given threshold. (This value is also called \"cut-off\" in the literature.)\n",
1119 |     "    - If float, the parameter represents a proportion of documents.\n",
1120 |     "    - If integer, the parameter represents an absolute count."
1121 |    ]
1122 |   },
1123 |   {
1124 |    "cell_type": "code",
1125 |    "execution_count": null,
1126 |    "metadata": {
1127 |     "collapsed": true
1128 |    },
1129 |    "outputs": [],
1130 |    "source": [
1131 |     "# only keep terms that appear in at least 2 documents\n",
1132 |     "vect = CountVectorizer(min_df=2)"
1133 |    ]
1134 |   },
1135 |   {
1136 |    "cell_type": "markdown",
1137 |    "metadata": {},
1138 |    "source": [
1139 |     "**Guidelines for tuning CountVectorizer:**\n",
1140 |     "\n",
1141 |     "- Use your knowledge of the **problem** and the **text**, and your understanding of the **tuning parameters**, to help you decide what parameters to tune and how to tune them.\n",
1142 |     "- **Experiment**, and let the data tell you the best approach!"
1143 |    ]
1144 |   }
1145 |  ],
1146 |  "metadata": {
1147 |   "kernelspec": {
1148 |    "display_name": "Python 2",
1149 |    "language": "python",
1150 |    "name": "python2"
1151 |   },
1152 |   "language_info": {
1153 |    "codemirror_mode": {
1154 |     "name": "ipython",
1155 |     "version": 2
1156 |    },
1157 |    "file_extension": ".py",
1158 |    "mimetype": "text/x-python",
1159 |    "name": "python",
1160 |    "nbconvert_exporter": "python",
1161 |    "pygments_lexer": "ipython2",
1162 |    "version": "2.7.11"
1163 |   }
1164 |  },
1165 |  "nbformat": 4,
1166 |  "nbformat_minor": 0
1167 | }
1168 | 


--------------------------------------------------------------------------------
/exercise_solution.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "# Tutorial Exercise: Yelp reviews (Solution)"
   8 |    ]
   9 |   },
  10 |   {
  11 |    "cell_type": "markdown",
  12 |    "metadata": {},
  13 |    "source": [
  14 |     "## Introduction\n",
  15 |     "\n",
  16 |     "This exercise uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition.\n",
  17 |     "\n",
  18 |     "**Description of the data:**\n",
  19 |     "\n",
  20 |     "- **`yelp.csv`** contains the dataset. It is stored in the repository (in the **`data`** directory), so there is no need to download anything from the Kaggle website.\n",
  21 |     "- Each observation (row) in this dataset is a review of a particular business by a particular user.\n",
  22 |     "- The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.\n",
  23 |     "- The **text** column is the text of the review.\n",
  24 |     "\n",
  25 |     "**Goal:** Predict the star rating of a review using **only** the review text.\n",
  26 |     "\n",
  27 |     "**Tip:** After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations."
  28 |    ]
  29 |   },
  30 |   {
  31 |    "cell_type": "code",
  32 |    "execution_count": 1,
  33 |    "metadata": {
  34 |     "collapsed": true
  35 |    },
  36 |    "outputs": [],
  37 |    "source": [
  38 |     "# for Python 2: use print only as a function\n",
  39 |     "from __future__ import print_function"
  40 |    ]
  41 |   },
  42 |   {
  43 |    "cell_type": "markdown",
  44 |    "metadata": {},
  45 |    "source": [
  46 |     "## Task 1\n",
  47 |     "\n",
  48 |     "Read **`yelp.csv`** into a pandas DataFrame and examine it."
  49 |    ]
  50 |   },
  51 |   {
  52 |    "cell_type": "code",
  53 |    "execution_count": 2,
  54 |    "metadata": {
  55 |     "collapsed": true
  56 |    },
  57 |    "outputs": [],
  58 |    "source": [
  59 |     "# read yelp.csv using a relative path\n",
  60 |     "import pandas as pd\n",
  61 |     "path = 'data/yelp.csv'\n",
  62 |     "yelp = pd.read_csv(path)"
  63 |    ]
  64 |   },
  65 |   {
  66 |    "cell_type": "code",
  67 |    "execution_count": 3,
  68 |    "metadata": {
  69 |     "collapsed": false
  70 |    },
  71 |    "outputs": [
  72 |     {
  73 |      "data": {
  74 |       "text/plain": [
  75 |        "(10000, 10)"
  76 |       ]
  77 |      },
  78 |      "execution_count": 3,
  79 |      "metadata": {},
  80 |      "output_type": "execute_result"
  81 |     }
  82 |    ],
  83 |    "source": [
  84 |     "# examine the shape\n",
  85 |     "yelp.shape"
  86 |    ]
  87 |   },
  88 |   {
  89 |    "cell_type": "code",
  90 |    "execution_count": 4,
  91 |    "metadata": {
  92 |     "collapsed": false
  93 |    },
  94 |    "outputs": [
  95 |     {
  96 |      "data": {
  97 |       "text/html": [
  98 |        "<div>\n",
  99 |        "<table border=\"1\" class=\"dataframe\">\n",
 100 |        "  <thead>\n",
 101 |        "    <tr style=\"text-align: right;\">\n",
 102 |        "      <th></th>\n",
 103 |        "      <th>business_id</th>\n",
 104 |        "      <th>date</th>\n",
 105 |        "      <th>review_id</th>\n",
 106 |        "      <th>stars</th>\n",
 107 |        "      <th>text</th>\n",
 108 |        "      <th>type</th>\n",
 109 |        "      <th>user_id</th>\n",
 110 |        "      <th>cool</th>\n",
 111 |        "      <th>useful</th>\n",
 112 |        "      <th>funny</th>\n",
 113 |        "    </tr>\n",
 114 |        "  </thead>\n",
 115 |        "  <tbody>\n",
 116 |        "    <tr>\n",
 117 |        "      <th>0</th>\n",
 118 |        "      <td>9yKzy9PApeiPPOUJEtnvkg</td>\n",
 119 |        "      <td>2011-01-26</td>\n",
 120 |        "      <td>fWKvX83p0-ka4JS3dc6E5A</td>\n",
 121 |        "      <td>5</td>\n",
 122 |        "      <td>My wife took me here on my birthday for breakf...</td>\n",
 123 |        "      <td>review</td>\n",
 124 |        "      <td>rLtl8ZkDX5vH5nAx9C3q5Q</td>\n",
 125 |        "      <td>2</td>\n",
 126 |        "      <td>5</td>\n",
 127 |        "      <td>0</td>\n",
 128 |        "    </tr>\n",
 129 |        "  </tbody>\n",
 130 |        "</table>\n",
 131 |        "</div>"
 132 |       ],
 133 |       "text/plain": [
 134 |        "              business_id        date               review_id  stars  \\\n",
 135 |        "0  9yKzy9PApeiPPOUJEtnvkg  2011-01-26  fWKvX83p0-ka4JS3dc6E5A      5   \n",
 136 |        "\n",
 137 |        "                                                text    type  \\\n",
 138 |        "0  My wife took me here on my birthday for breakf...  review   \n",
 139 |        "\n",
 140 |        "                  user_id  cool  useful  funny  \n",
 141 |        "0  rLtl8ZkDX5vH5nAx9C3q5Q     2       5      0  "
 142 |       ]
 143 |      },
 144 |      "execution_count": 4,
 145 |      "metadata": {},
 146 |      "output_type": "execute_result"
 147 |     }
 148 |    ],
 149 |    "source": [
 150 |     "# examine the first row\n",
 151 |     "yelp.head(1)"
 152 |    ]
 153 |   },
 154 |   {
 155 |    "cell_type": "code",
 156 |    "execution_count": 5,
 157 |    "metadata": {
 158 |     "collapsed": false
 159 |    },
 160 |    "outputs": [
 161 |     {
 162 |      "data": {
 163 |       "text/plain": [
 164 |        "1     749\n",
 165 |        "2     927\n",
 166 |        "3    1461\n",
 167 |        "4    3526\n",
 168 |        "5    3337\n",
 169 |        "Name: stars, dtype: int64"
 170 |       ]
 171 |      },
 172 |      "execution_count": 5,
 173 |      "metadata": {},
 174 |      "output_type": "execute_result"
 175 |     }
 176 |    ],
 177 |    "source": [
 178 |     "# examine the class distribution\n",
 179 |     "yelp.stars.value_counts().sort_index()"
 180 |    ]
 181 |   },
 182 |   {
 183 |    "cell_type": "markdown",
 184 |    "metadata": {},
 185 |    "source": [
 186 |     "## Task 2\n",
 187 |     "\n",
 188 |     "Create a new DataFrame that only contains the **5-star** and **1-star** reviews.\n",
 189 |     "\n",
 190 |     "- **Hint:** [How do I apply multiple filter criteria to a pandas DataFrame?](http://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb#9.-How-do-I-apply-multiple-filter-criteria-to-a-pandas-DataFrame%3F-%28video%29) explains how to do this."
 191 |    ]
 192 |   },
 193 |   {
 194 |    "cell_type": "code",
 195 |    "execution_count": 6,
 196 |    "metadata": {
 197 |     "collapsed": true
 198 |    },
 199 |    "outputs": [],
 200 |    "source": [
 201 |     "# filter the DataFrame using an OR condition\n",
 202 |     "yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]\n",
 203 |     "\n",
 204 |     "# equivalently, use the 'loc' method\n",
 205 |     "yelp_best_worst = yelp.loc[(yelp.stars==5) | (yelp.stars==1), :]"
 206 |    ]
 207 |   },
 208 |   {
 209 |    "cell_type": "code",
 210 |    "execution_count": 7,
 211 |    "metadata": {
 212 |     "collapsed": false
 213 |    },
 214 |    "outputs": [
 215 |     {
 216 |      "data": {
 217 |       "text/plain": [
 218 |        "(4086, 10)"
 219 |       ]
 220 |      },
 221 |      "execution_count": 7,
 222 |      "metadata": {},
 223 |      "output_type": "execute_result"
 224 |     }
 225 |    ],
 226 |    "source": [
 227 |     "# examine the shape\n",
 228 |     "yelp_best_worst.shape"
 229 |    ]
 230 |   },
 231 |   {
 232 |    "cell_type": "markdown",
 233 |    "metadata": {},
 234 |    "source": [
 235 |     "## Task 3\n",
 236 |     "\n",
 237 |     "Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response.\n",
 238 |     "\n",
 239 |     "- **Hint:** Keep in mind that X should be a pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows."
 240 |    ]
 241 |   },
 242 |   {
 243 |    "cell_type": "code",
 244 |    "execution_count": 8,
 245 |    "metadata": {
 246 |     "collapsed": true
 247 |    },
 248 |    "outputs": [],
 249 |    "source": [
 250 |     "# define X and y\n",
 251 |     "X = yelp_best_worst.text\n",
 252 |     "y = yelp_best_worst.stars"
 253 |    ]
 254 |   },
 255 |   {
 256 |    "cell_type": "code",
 257 |    "execution_count": 9,
 258 |    "metadata": {
 259 |     "collapsed": true
 260 |    },
 261 |    "outputs": [],
 262 |    "source": [
 263 |     "# split X and y into training and testing sets\n",
 264 |     "from sklearn.cross_validation import train_test_split\n",
 265 |     "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)"
 266 |    ]
 267 |   },
 268 |   {
 269 |    "cell_type": "code",
 270 |    "execution_count": 10,
 271 |    "metadata": {
 272 |     "collapsed": false
 273 |    },
 274 |    "outputs": [
 275 |     {
 276 |      "name": "stdout",
 277 |      "output_type": "stream",
 278 |      "text": [
 279 |       "(3064L,)\n",
 280 |       "(1022L,)\n",
 281 |       "(3064L,)\n",
 282 |       "(1022L,)\n"
 283 |      ]
 284 |     }
 285 |    ],
 286 |    "source": [
 287 |     "# examine the object shapes\n",
 288 |     "print(X_train.shape)\n",
 289 |     "print(X_test.shape)\n",
 290 |     "print(y_train.shape)\n",
 291 |     "print(y_test.shape)"
 292 |    ]
 293 |   },
 294 |   {
 295 |    "cell_type": "markdown",
 296 |    "metadata": {},
 297 |    "source": [
 298 |     "## Task 4\n",
 299 |     "\n",
 300 |     "Use CountVectorizer to create **document-term matrices** from X_train and X_test."
 301 |    ]
 302 |   },
 303 |   {
 304 |    "cell_type": "code",
 305 |    "execution_count": 11,
 306 |    "metadata": {
 307 |     "collapsed": true
 308 |    },
 309 |    "outputs": [],
 310 |    "source": [
 311 |     "# import and instantiate CountVectorizer\n",
 312 |     "from sklearn.feature_extraction.text import CountVectorizer\n",
 313 |     "vect = CountVectorizer()"
 314 |    ]
 315 |   },
 316 |   {
 317 |    "cell_type": "code",
 318 |    "execution_count": 12,
 319 |    "metadata": {
 320 |     "collapsed": false
 321 |    },
 322 |    "outputs": [
 323 |     {
 324 |      "data": {
 325 |       "text/plain": [
 326 |        "(3064, 16825)"
 327 |       ]
 328 |      },
 329 |      "execution_count": 12,
 330 |      "metadata": {},
 331 |      "output_type": "execute_result"
 332 |     }
 333 |    ],
 334 |    "source": [
 335 |     "# fit and transform X_train into X_train_dtm\n",
 336 |     "X_train_dtm = vect.fit_transform(X_train)\n",
 337 |     "X_train_dtm.shape"
 338 |    ]
 339 |   },
 340 |   {
 341 |    "cell_type": "code",
 342 |    "execution_count": 13,
 343 |    "metadata": {
 344 |     "collapsed": false
 345 |    },
 346 |    "outputs": [
 347 |     {
 348 |      "data": {
 349 |       "text/plain": [
 350 |        "(1022, 16825)"
 351 |       ]
 352 |      },
 353 |      "execution_count": 13,
 354 |      "metadata": {},
 355 |      "output_type": "execute_result"
 356 |     }
 357 |    ],
 358 |    "source": [
 359 |     "# transform X_test into X_test_dtm\n",
 360 |     "X_test_dtm = vect.transform(X_test)\n",
 361 |     "X_test_dtm.shape"
 362 |    ]
 363 |   },
 364 |   {
 365 |    "cell_type": "markdown",
 366 |    "metadata": {},
 367 |    "source": [
 368 |     "## Task 5\n",
 369 |     "\n",
 370 |     "Use multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**.\n",
 371 |     "\n",
 372 |     "- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to interpret both classification accuracy and the confusion matrix."
 373 |    ]
 374 |   },
 375 |   {
 376 |    "cell_type": "code",
 377 |    "execution_count": 14,
 378 |    "metadata": {
 379 |     "collapsed": true
 380 |    },
 381 |    "outputs": [],
 382 |    "source": [
 383 |     "# import and instantiate MultinomialNB\n",
 384 |     "from sklearn.naive_bayes import MultinomialNB\n",
 385 |     "nb = MultinomialNB()"
 386 |    ]
 387 |   },
 388 |   {
 389 |    "cell_type": "code",
 390 |    "execution_count": 15,
 391 |    "metadata": {
 392 |     "collapsed": false
 393 |    },
 394 |    "outputs": [
 395 |     {
 396 |      "data": {
 397 |       "text/plain": [
 398 |        "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)"
 399 |       ]
 400 |      },
 401 |      "execution_count": 15,
 402 |      "metadata": {},
 403 |      "output_type": "execute_result"
 404 |     }
 405 |    ],
 406 |    "source": [
 407 |     "# train the model using X_train_dtm\n",
 408 |     "nb.fit(X_train_dtm, y_train)"
 409 |    ]
 410 |   },
 411 |   {
 412 |    "cell_type": "code",
 413 |    "execution_count": 16,
 414 |    "metadata": {
 415 |     "collapsed": true
 416 |    },
 417 |    "outputs": [],
 418 |    "source": [
 419 |     "# make class predictions for X_test_dtm\n",
 420 |     "y_pred_class = nb.predict(X_test_dtm)"
 421 |    ]
 422 |   },
 423 |   {
 424 |    "cell_type": "code",
 425 |    "execution_count": 17,
 426 |    "metadata": {
 427 |     "collapsed": false
 428 |    },
 429 |    "outputs": [
 430 |     {
 431 |      "data": {
 432 |       "text/plain": [
 433 |        "0.91878669275929548"
 434 |       ]
 435 |      },
 436 |      "execution_count": 17,
 437 |      "metadata": {},
 438 |      "output_type": "execute_result"
 439 |     }
 440 |    ],
 441 |    "source": [
 442 |     "# calculate accuracy of class predictions\n",
 443 |     "from sklearn import metrics\n",
 444 |     "metrics.accuracy_score(y_test, y_pred_class)"
 445 |    ]
 446 |   },
 447 |   {
 448 |    "cell_type": "code",
 449 |    "execution_count": 18,
 450 |    "metadata": {
 451 |     "collapsed": false
 452 |    },
 453 |    "outputs": [
 454 |     {
 455 |      "data": {
 456 |       "text/plain": [
 457 |        "array([[126,  58],\n",
 458 |        "       [ 25, 813]])"
 459 |       ]
 460 |      },
 461 |      "execution_count": 18,
 462 |      "metadata": {},
 463 |      "output_type": "execute_result"
 464 |     }
 465 |    ],
 466 |    "source": [
 467 |     "# print the confusion matrix\n",
 468 |     "metrics.confusion_matrix(y_test, y_pred_class)"
 469 |    ]
 470 |   },
 471 |   {
 472 |    "cell_type": "markdown",
 473 |    "metadata": {},
 474 |    "source": [
 475 |     "## Task 6 (Challenge)\n",
 476 |     "\n",
 477 |     "Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class.\n",
 478 |     "\n",
 479 |     "- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy!"
 480 |    ]
 481 |   },
 482 |   {
 483 |    "cell_type": "code",
 484 |    "execution_count": 19,
 485 |    "metadata": {
 486 |     "collapsed": false
 487 |    },
 488 |    "outputs": [
 489 |     {
 490 |      "data": {
 491 |       "text/plain": [
 492 |        "5    838\n",
 493 |        "1    184\n",
 494 |        "Name: stars, dtype: int64"
 495 |       ]
 496 |      },
 497 |      "execution_count": 19,
 498 |      "metadata": {},
 499 |      "output_type": "execute_result"
 500 |     }
 501 |    ],
 502 |    "source": [
 503 |     "# examine the class distribution of the testing set\n",
 504 |     "y_test.value_counts()"
 505 |    ]
 506 |   },
 507 |   {
 508 |    "cell_type": "code",
 509 |    "execution_count": 20,
 510 |    "metadata": {
 511 |     "collapsed": false
 512 |    },
 513 |    "outputs": [
 514 |     {
 515 |      "data": {
 516 |       "text/plain": [
 517 |        "5    0.819961\n",
 518 |        "Name: stars, dtype: float64"
 519 |       ]
 520 |      },
 521 |      "execution_count": 20,
 522 |      "metadata": {},
 523 |      "output_type": "execute_result"
 524 |     }
 525 |    ],
 526 |    "source": [
 527 |     "# calculate null accuracy\n",
 528 |     "y_test.value_counts().head(1) / y_test.shape"
 529 |    ]
 530 |   },
 531 |   {
 532 |    "cell_type": "code",
 533 |    "execution_count": 21,
 534 |    "metadata": {
 535 |     "collapsed": false
 536 |    },
 537 |    "outputs": [
 538 |     {
 539 |      "data": {
 540 |       "text/plain": [
 541 |        "0.8199608610567515"
 542 |       ]
 543 |      },
 544 |      "execution_count": 21,
 545 |      "metadata": {},
 546 |      "output_type": "execute_result"
 547 |     }
 548 |    ],
 549 |    "source": [
 550 |     "# calculate null accuracy manually\n",
 551 |     "838 / float(838 + 184)"
 552 |    ]
 553 |   },
 554 |   {
 555 |    "cell_type": "markdown",
 556 |    "metadata": {},
 557 |    "source": [
 558 |     "## Task 7 (Challenge)\n",
 559 |     "\n",
 560 |     "Browse through the review text of some of the **false positives** and **false negatives**. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews?\n",
 561 |     "\n",
 562 |     "- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains the definitions of \"false positives\" and \"false negatives\".\n",
 563 |     "- **Hint:** Think about what a false positive means in this context, and what a false negative means in this context. What has scikit-learn defined as the \"positive class\"?"
 564 |    ]
 565 |   },
 566 |   {
 567 |    "cell_type": "code",
 568 |    "execution_count": 22,
 569 |    "metadata": {
 570 |     "collapsed": false
 571 |    },
 572 |    "outputs": [
 573 |     {
 574 |      "data": {
 575 |       "text/plain": [
 576 |        "2175    This has to be the worst restaurant in terms o...\n",
 577 |        "1781    If you like the stuck up Scottsdale vibe this ...\n",
 578 |        "2674    I'm sorry to be what seems to be the lone one ...\n",
 579 |        "9984    Went last night to Whore Foods to get basics t...\n",
 580 |        "3392    I found Lisa G's while driving through phoenix...\n",
 581 |        "8283    Don't know where I should start. Grand opening...\n",
 582 |        "2765    Went last week, and ordered a dozen variety. I...\n",
 583 |        "2839    Never Again,\\r\\nI brought my Mountain Bike in ...\n",
 584 |        "321     My wife and I live around the corner, hadn't e...\n",
 585 |        "1919                                         D-scust-ing.\n",
 586 |        "Name: text, dtype: object"
 587 |       ]
 588 |      },
 589 |      "execution_count": 22,
 590 |      "metadata": {},
 591 |      "output_type": "execute_result"
 592 |     }
 593 |    ],
 594 |    "source": [
 595 |     "# first 10 false positives (1-star reviews incorrectly classified as 5-star reviews)\n",
 596 |     "X_test[y_test < y_pred_class].head(10)"
 597 |    ]
 598 |   },
 599 |   {
 600 |    "cell_type": "code",
 601 |    "execution_count": 23,
 602 |    "metadata": {
 603 |     "collapsed": false
 604 |    },
 605 |    "outputs": [
 606 |     {
 607 |      "data": {
 608 |       "text/plain": [
 609 |        "\"If you like the stuck up Scottsdale vibe this is a good place for you. The food isn't impressive. Nice outdoor seating.\""
 610 |       ]
 611 |      },
 612 |      "execution_count": 23,
 613 |      "metadata": {},
 614 |      "output_type": "execute_result"
 615 |     }
 616 |    ],
 617 |    "source": [
 618 |     "# false positive: model is reacting to the words \"good\", \"impressive\", \"nice\"\n",
 619 |     "X_test[1781]"
 620 |    ]
 621 |   },
 622 |   {
 623 |    "cell_type": "code",
 624 |    "execution_count": 24,
 625 |    "metadata": {
 626 |     "collapsed": false
 627 |    },
 628 |    "outputs": [
 629 |     {
 630 |      "data": {
 631 |       "text/plain": [
 632 |        "'D-scust-ing.'"
 633 |       ]
 634 |      },
 635 |      "execution_count": 24,
 636 |      "metadata": {},
 637 |      "output_type": "execute_result"
 638 |     }
 639 |    ],
 640 |    "source": [
 641 |     "# false positive: model does not have enough data to work with\n",
 642 |     "X_test[1919]"
 643 |    ]
 644 |   },
 645 |   {
 646 |    "cell_type": "code",
 647 |    "execution_count": 25,
 648 |    "metadata": {
 649 |     "collapsed": false
 650 |    },
 651 |    "outputs": [
 652 |     {
 653 |      "data": {
 654 |       "text/plain": [
 655 |        "7148    I now consider myself an Arizonian. If you dri...\n",
 656 |        "4963    This is by far my favourite department store, ...\n",
 657 |        "6318    Since I have ranted recently on poor customer ...\n",
 658 |        "380     This is a must try for any Mani Pedi fan. I us...\n",
 659 |        "5565    I`ve had work done by this shop a few times th...\n",
 660 |        "3448    I was there last week with my sisters and whil...\n",
 661 |        "6050    I went to sears today to check on a layaway th...\n",
 662 |        "2504    I've passed by prestige nails in walmart 100s ...\n",
 663 |        "2475    This place is so great! I am a nanny and had t...\n",
 664 |        "241     I was sad to come back to lai lai's and they n...\n",
 665 |        "Name: text, dtype: object"
 666 |       ]
 667 |      },
 668 |      "execution_count": 25,
 669 |      "metadata": {},
 670 |      "output_type": "execute_result"
 671 |     }
 672 |    ],
 673 |    "source": [
 674 |     "# first 10 false negatives (5-star reviews incorrectly classified as 1-star reviews)\n",
 675 |     "X_test[y_test > y_pred_class].head(10)"
 676 |    ]
 677 |   },
 678 |   {
 679 |    "cell_type": "code",
 680 |    "execution_count": 26,
 681 |    "metadata": {
 682 |     "collapsed": false
 683 |    },
 684 |    "outputs": [
 685 |     {
 686 |      "data": {
 687 |       "text/plain": [
 688 |        "'This is by far my favourite department store, hands down. I have had nothing but perfect experiences in this store, without exception, no matter what department I\\'m in. The shoe SA\\'s will bend over backwards to help you find a specific shoe, and the staff will even go so far as to send out hand-written thank you cards to your home address after you make a purchase - big or small. Tim & Anthony in the shoe salon are fabulous beyond words! \\r\\n\\r\\nI am not completely sure that I understand why people complain about the amount of merchandise on the floor or the lack of crowds in this store. Frankly, I would rather not be bombarded with merchandise and other people. One of the things I love the most about Barney\\'s is not only the prompt attention of SA\\'s, but the fact that they aren\\'t rushing around trying to help 35 people at once. The SA\\'s at Barney\\'s are incredibly friendly and will stop to have an actual conversation, regardless or whether you are purchasing something or not. I have also never experienced a \"high pressure\" sale situation here.\\r\\n\\r\\nAll in all, Barneys is pricey, and there is no getting around it. But, um, so is Neiman\\'s and that place is a crock. Anywhere that ONLY accepts American Express or their charge card and then treats you like scum if you aren\\'t carrying neither is no place that I want to spend my hard earned dollars. Yay Barneys!'"
 689 |       ]
 690 |      },
 691 |      "execution_count": 26,
 692 |      "metadata": {},
 693 |      "output_type": "execute_result"
 694 |     }
 695 |    ],
 696 |    "source": [
 697 |     "# false negative: model is reacting to the words \"complain\", \"crowds\", \"rushing\", \"pricey\", \"scum\"\n",
 698 |     "X_test[4963]"
 699 |    ]
 700 |   },
 701 |   {
 702 |    "cell_type": "markdown",
 703 |    "metadata": {},
 704 |    "source": [
 705 |     "## Task 8 (Challenge)\n",
 706 |     "\n",
 707 |     "Calculate which 10 tokens are the most predictive of **5-star reviews**, and which 10 tokens are the most predictive of **1-star reviews**.\n",
 708 |     "\n",
 709 |     "- **Hint:** Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the `feature_count_` and `class_count_` attributes of the Naive Bayes model object."
 710 |    ]
 711 |   },
 712 |   {
 713 |    "cell_type": "code",
 714 |    "execution_count": 27,
 715 |    "metadata": {
 716 |     "collapsed": false
 717 |    },
 718 |    "outputs": [
 719 |     {
 720 |      "data": {
 721 |       "text/plain": [
 722 |        "16825"
 723 |       ]
 724 |      },
 725 |      "execution_count": 27,
 726 |      "metadata": {},
 727 |      "output_type": "execute_result"
 728 |     }
 729 |    ],
 730 |    "source": [
 731 |     "# store the vocabulary of X_train\n",
 732 |     "X_train_tokens = vect.get_feature_names()\n",
 733 |     "len(X_train_tokens)"
 734 |    ]
 735 |   },
 736 |   {
 737 |    "cell_type": "code",
 738 |    "execution_count": 28,
 739 |    "metadata": {
 740 |     "collapsed": false
 741 |    },
 742 |    "outputs": [
 743 |     {
 744 |      "data": {
 745 |       "text/plain": [
 746 |        "(2L, 16825L)"
 747 |       ]
 748 |      },
 749 |      "execution_count": 28,
 750 |      "metadata": {},
 751 |      "output_type": "execute_result"
 752 |     }
 753 |    ],
 754 |    "source": [
 755 |     "# first row is one-star reviews, second row is five-star reviews\n",
 756 |     "nb.feature_count_.shape"
 757 |    ]
 758 |   },
 759 |   {
 760 |    "cell_type": "code",
 761 |    "execution_count": 29,
 762 |    "metadata": {
 763 |     "collapsed": true
 764 |    },
 765 |    "outputs": [],
 766 |    "source": [
 767 |     "# store the number of times each token appears across each class\n",
 768 |     "one_star_token_count = nb.feature_count_[0, :]\n",
 769 |     "five_star_token_count = nb.feature_count_[1, :]"
 770 |    ]
 771 |   },
 772 |   {
 773 |    "cell_type": "code",
 774 |    "execution_count": 30,
 775 |    "metadata": {
 776 |     "collapsed": true
 777 |    },
 778 |    "outputs": [],
 779 |    "source": [
 780 |     "# create a DataFrame of tokens with their separate one-star and five-star counts\n",
 781 |     "tokens = pd.DataFrame({'token':X_train_tokens, 'one_star':one_star_token_count, 'five_star':five_star_token_count}).set_index('token')"
 782 |    ]
 783 |   },
 784 |   {
 785 |    "cell_type": "code",
 786 |    "execution_count": 31,
 787 |    "metadata": {
 788 |     "collapsed": true
 789 |    },
 790 |    "outputs": [],
 791 |    "source": [
 792 |     "# add 1 to one-star and five-star counts to avoid dividing by 0\n",
 793 |     "tokens['one_star'] = tokens.one_star + 1\n",
 794 |     "tokens['five_star'] = tokens.five_star + 1"
 795 |    ]
 796 |   },
 797 |   {
 798 |    "cell_type": "code",
 799 |    "execution_count": 32,
 800 |    "metadata": {
 801 |     "collapsed": false
 802 |    },
 803 |    "outputs": [
 804 |     {
 805 |      "data": {
 806 |       "text/plain": [
 807 |        "array([  565.,  2499.])"
 808 |       ]
 809 |      },
 810 |      "execution_count": 32,
 811 |      "metadata": {},
 812 |      "output_type": "execute_result"
 813 |     }
 814 |    ],
 815 |    "source": [
 816 |     "# first number is one-star reviews, second number is five-star reviews\n",
 817 |     "nb.class_count_"
 818 |    ]
 819 |   },
 820 |   {
 821 |    "cell_type": "code",
 822 |    "execution_count": 33,
 823 |    "metadata": {
 824 |     "collapsed": true
 825 |    },
 826 |    "outputs": [],
 827 |    "source": [
 828 |     "# convert the one-star and five-star counts into frequencies\n",
 829 |     "tokens['one_star'] = tokens.one_star / nb.class_count_[0]\n",
 830 |     "tokens['five_star'] = tokens.five_star / nb.class_count_[1]"
 831 |    ]
 832 |   },
 833 |   {
 834 |    "cell_type": "code",
 835 |    "execution_count": 34,
 836 |    "metadata": {
 837 |     "collapsed": true
 838 |    },
 839 |    "outputs": [],
 840 |    "source": [
 841 |     "# calculate the ratio of five-star to one-star for each token\n",
 842 |     "tokens['five_star_ratio'] = tokens.five_star / tokens.one_star"
 843 |    ]
 844 |   },
 845 |   {
 846 |    "cell_type": "code",
 847 |    "execution_count": 35,
 848 |    "metadata": {
 849 |     "collapsed": false
 850 |    },
 851 |    "outputs": [
 852 |     {
 853 |      "data": {
 854 |       "text/html": [
 855 |        "<div>\n",
 856 |        "<table border=\"1\" class=\"dataframe\">\n",
 857 |        "  <thead>\n",
 858 |        "    <tr style=\"text-align: right;\">\n",
 859 |        "      <th></th>\n",
 860 |        "      <th>five_star</th>\n",
 861 |        "      <th>one_star</th>\n",
 862 |        "      <th>five_star_ratio</th>\n",
 863 |        "    </tr>\n",
 864 |        "    <tr>\n",
 865 |        "      <th>token</th>\n",
 866 |        "      <th></th>\n",
 867 |        "      <th></th>\n",
 868 |        "      <th></th>\n",
 869 |        "    </tr>\n",
 870 |        "  </thead>\n",
 871 |        "  <tbody>\n",
 872 |        "    <tr>\n",
 873 |        "      <th>fantastic</th>\n",
 874 |        "      <td>0.077231</td>\n",
 875 |        "      <td>0.003540</td>\n",
 876 |        "      <td>21.817727</td>\n",
 877 |        "    </tr>\n",
 878 |        "    <tr>\n",
 879 |        "      <th>perfect</th>\n",
 880 |        "      <td>0.098039</td>\n",
 881 |        "      <td>0.005310</td>\n",
 882 |        "      <td>18.464052</td>\n",
 883 |        "    </tr>\n",
 884 |        "    <tr>\n",
 885 |        "      <th>yum</th>\n",
 886 |        "      <td>0.024810</td>\n",
 887 |        "      <td>0.001770</td>\n",
 888 |        "      <td>14.017607</td>\n",
 889 |        "    </tr>\n",
 890 |        "    <tr>\n",
 891 |        "      <th>favorite</th>\n",
 892 |        "      <td>0.138055</td>\n",
 893 |        "      <td>0.012389</td>\n",
 894 |        "      <td>11.143029</td>\n",
 895 |        "    </tr>\n",
 896 |        "    <tr>\n",
 897 |        "      <th>outstanding</th>\n",
 898 |        "      <td>0.019608</td>\n",
 899 |        "      <td>0.001770</td>\n",
 900 |        "      <td>11.078431</td>\n",
 901 |        "    </tr>\n",
 902 |        "    <tr>\n",
 903 |        "      <th>brunch</th>\n",
 904 |        "      <td>0.016807</td>\n",
 905 |        "      <td>0.001770</td>\n",
 906 |        "      <td>9.495798</td>\n",
 907 |        "    </tr>\n",
 908 |        "    <tr>\n",
 909 |        "      <th>gem</th>\n",
 910 |        "      <td>0.016006</td>\n",
 911 |        "      <td>0.001770</td>\n",
 912 |        "      <td>9.043617</td>\n",
 913 |        "    </tr>\n",
 914 |        "    <tr>\n",
 915 |        "      <th>mozzarella</th>\n",
 916 |        "      <td>0.015606</td>\n",
 917 |        "      <td>0.001770</td>\n",
 918 |        "      <td>8.817527</td>\n",
 919 |        "    </tr>\n",
 920 |        "    <tr>\n",
 921 |        "      <th>pasty</th>\n",
 922 |        "      <td>0.015606</td>\n",
 923 |        "      <td>0.001770</td>\n",
 924 |        "      <td>8.817527</td>\n",
 925 |        "    </tr>\n",
 926 |        "    <tr>\n",
 927 |        "      <th>amazing</th>\n",
 928 |        "      <td>0.185274</td>\n",
 929 |        "      <td>0.021239</td>\n",
 930 |        "      <td>8.723323</td>\n",
 931 |        "    </tr>\n",
 932 |        "  </tbody>\n",
 933 |        "</table>\n",
 934 |        "</div>"
 935 |       ],
 936 |       "text/plain": [
 937 |        "             five_star  one_star  five_star_ratio\n",
 938 |        "token                                            \n",
 939 |        "fantastic     0.077231  0.003540        21.817727\n",
 940 |        "perfect       0.098039  0.005310        18.464052\n",
 941 |        "yum           0.024810  0.001770        14.017607\n",
 942 |        "favorite      0.138055  0.012389        11.143029\n",
 943 |        "outstanding   0.019608  0.001770        11.078431\n",
 944 |        "brunch        0.016807  0.001770         9.495798\n",
 945 |        "gem           0.016006  0.001770         9.043617\n",
 946 |        "mozzarella    0.015606  0.001770         8.817527\n",
 947 |        "pasty         0.015606  0.001770         8.817527\n",
 948 |        "amazing       0.185274  0.021239         8.723323"
 949 |       ]
 950 |      },
 951 |      "execution_count": 35,
 952 |      "metadata": {},
 953 |      "output_type": "execute_result"
 954 |     }
 955 |    ],
 956 |    "source": [
 957 |     "# sort the DataFrame by five_star_ratio (descending order), and examine the first 10 rows\n",
 958 |     "# note: use sort() instead of sort_values() for pandas 0.16.2 and earlier\n",
 959 |     "tokens.sort_values('five_star_ratio', ascending=False).head(10)"
 960 |    ]
 961 |   },
 962 |   {
 963 |    "cell_type": "code",
 964 |    "execution_count": 36,
 965 |    "metadata": {
 966 |     "collapsed": false
 967 |    },
 968 |    "outputs": [
 969 |     {
 970 |      "data": {
 971 |       "text/html": [
 972 |        "<div>\n",
 973 |        "<table border=\"1\" class=\"dataframe\">\n",
 974 |        "  <thead>\n",
 975 |        "    <tr style=\"text-align: right;\">\n",
 976 |        "      <th></th>\n",
 977 |        "      <th>five_star</th>\n",
 978 |        "      <th>one_star</th>\n",
 979 |        "      <th>five_star_ratio</th>\n",
 980 |        "    </tr>\n",
 981 |        "    <tr>\n",
 982 |        "      <th>token</th>\n",
 983 |        "      <th></th>\n",
 984 |        "      <th></th>\n",
 985 |        "      <th></th>\n",
 986 |        "    </tr>\n",
 987 |        "  </thead>\n",
 988 |        "  <tbody>\n",
 989 |        "    <tr>\n",
 990 |        "      <th>staffperson</th>\n",
 991 |        "      <td>0.0004</td>\n",
 992 |        "      <td>0.030088</td>\n",
 993 |        "      <td>0.013299</td>\n",
 994 |        "    </tr>\n",
 995 |        "    <tr>\n",
 996 |        "      <th>refused</th>\n",
 997 |        "      <td>0.0004</td>\n",
 998 |        "      <td>0.024779</td>\n",
 999 |        "      <td>0.016149</td>\n",
1000 |        "    </tr>\n",
1001 |        "    <tr>\n",
1002 |        "      <th>disgusting</th>\n",
1003 |        "      <td>0.0008</td>\n",
1004 |        "      <td>0.042478</td>\n",
1005 |        "      <td>0.018841</td>\n",
1006 |        "    </tr>\n",
1007 |        "    <tr>\n",
1008 |        "      <th>filthy</th>\n",
1009 |        "      <td>0.0004</td>\n",
1010 |        "      <td>0.019469</td>\n",
1011 |        "      <td>0.020554</td>\n",
1012 |        "    </tr>\n",
1013 |        "    <tr>\n",
1014 |        "      <th>unprofessional</th>\n",
1015 |        "      <td>0.0004</td>\n",
1016 |        "      <td>0.015929</td>\n",
1017 |        "      <td>0.025121</td>\n",
1018 |        "    </tr>\n",
1019 |        "    <tr>\n",
1020 |        "      <th>unacceptable</th>\n",
1021 |        "      <td>0.0004</td>\n",
1022 |        "      <td>0.015929</td>\n",
1023 |        "      <td>0.025121</td>\n",
1024 |        "    </tr>\n",
1025 |        "    <tr>\n",
1026 |        "      <th>acknowledge</th>\n",
1027 |        "      <td>0.0004</td>\n",
1028 |        "      <td>0.015929</td>\n",
1029 |        "      <td>0.025121</td>\n",
1030 |        "    </tr>\n",
1031 |        "    <tr>\n",
1032 |        "      <th>ugh</th>\n",
1033 |        "      <td>0.0008</td>\n",
1034 |        "      <td>0.030088</td>\n",
1035 |        "      <td>0.026599</td>\n",
1036 |        "    </tr>\n",
1037 |        "    <tr>\n",
1038 |        "      <th>fuse</th>\n",
1039 |        "      <td>0.0004</td>\n",
1040 |        "      <td>0.014159</td>\n",
1041 |        "      <td>0.028261</td>\n",
1042 |        "    </tr>\n",
1043 |        "    <tr>\n",
1044 |        "      <th>boca</th>\n",
1045 |        "      <td>0.0004</td>\n",
1046 |        "      <td>0.014159</td>\n",
1047 |        "      <td>0.028261</td>\n",
1048 |        "    </tr>\n",
1049 |        "  </tbody>\n",
1050 |        "</table>\n",
1051 |        "</div>"
1052 |       ],
1053 |       "text/plain": [
1054 |        "                five_star  one_star  five_star_ratio\n",
1055 |        "token                                               \n",
1056 |        "staffperson        0.0004  0.030088         0.013299\n",
1057 |        "refused            0.0004  0.024779         0.016149\n",
1058 |        "disgusting         0.0008  0.042478         0.018841\n",
1059 |        "filthy             0.0004  0.019469         0.020554\n",
1060 |        "unprofessional     0.0004  0.015929         0.025121\n",
1061 |        "unacceptable       0.0004  0.015929         0.025121\n",
1062 |        "acknowledge        0.0004  0.015929         0.025121\n",
1063 |        "ugh                0.0008  0.030088         0.026599\n",
1064 |        "fuse               0.0004  0.014159         0.028261\n",
1065 |        "boca               0.0004  0.014159         0.028261"
1066 |       ]
1067 |      },
1068 |      "execution_count": 36,
1069 |      "metadata": {},
1070 |      "output_type": "execute_result"
1071 |     }
1072 |    ],
1073 |    "source": [
1074 |     "# sort the DataFrame by five_star_ratio (ascending order), and examine the first 10 rows\n",
1075 |     "tokens.sort_values('five_star_ratio', ascending=True).head(10)"
1076 |    ]
1077 |   },
1078 |   {
1079 |    "cell_type": "markdown",
1080 |    "metadata": {},
1081 |    "source": [
1082 |     "## Task 9 (Challenge)\n",
1083 |     "\n",
1084 |     "Up to this point, we have framed this as a **binary classification problem** by only considering the 5-star and 1-star reviews. Now, let's repeat the model building process using all reviews, which makes this a **5-class classification problem**.\n",
1085 |     "\n",
1086 |     "Here are the steps:\n",
1087 |     "\n",
1088 |     "- Define X and y using the original DataFrame. (y should contain 5 different classes.)\n",
1089 |     "- Split X and y into training and testing sets.\n",
1090 |     "- Create document-term matrices using CountVectorizer.\n",
1091 |     "- Calculate the testing accuracy of a Multinomial Naive Bayes model.\n",
1092 |     "- Compare the testing accuracy with the null accuracy, and comment on the results.\n",
1093 |     "- Print the confusion matrix, and comment on the results. (This [Stack Overflow answer](http://stackoverflow.com/a/30748053/1636598) explains how to read a multi-class confusion matrix.)\n",
1094 |     "- Print the [classification report](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report), and comment on the results. If you are unfamiliar with the terminology it uses, research the terms, and then try to figure out how to calculate these metrics manually from the confusion matrix!"
1095 |    ]
1096 |   },
1097 |   {
1098 |    "cell_type": "code",
1099 |    "execution_count": 37,
1100 |    "metadata": {
1101 |     "collapsed": true
1102 |    },
1103 |    "outputs": [],
1104 |    "source": [
1105 |     "# define X and y using the original DataFrame\n",
1106 |     "X = yelp.text\n",
1107 |     "y = yelp.stars"
1108 |    ]
1109 |   },
1110 |   {
1111 |    "cell_type": "code",
1112 |    "execution_count": 38,
1113 |    "metadata": {
1114 |     "collapsed": false
1115 |    },
1116 |    "outputs": [
1117 |     {
1118 |      "data": {
1119 |       "text/plain": [
1120 |        "1     749\n",
1121 |        "2     927\n",
1122 |        "3    1461\n",
1123 |        "4    3526\n",
1124 |        "5    3337\n",
1125 |        "Name: stars, dtype: int64"
1126 |       ]
1127 |      },
1128 |      "execution_count": 38,
1129 |      "metadata": {},
1130 |      "output_type": "execute_result"
1131 |     }
1132 |    ],
1133 |    "source": [
1134 |     "# check that y contains 5 different classes\n",
1135 |     "y.value_counts().sort_index()"
1136 |    ]
1137 |   },
1138 |   {
1139 |    "cell_type": "code",
1140 |    "execution_count": 39,
1141 |    "metadata": {
1142 |     "collapsed": true
1143 |    },
1144 |    "outputs": [],
1145 |    "source": [
1146 |     "# split X and y into training and testing sets\n",
1147 |     "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)"
1148 |    ]
1149 |   },
1150 |   {
1151 |    "cell_type": "code",
1152 |    "execution_count": 40,
1153 |    "metadata": {
1154 |     "collapsed": true
1155 |    },
1156 |    "outputs": [],
1157 |    "source": [
1158 |     "# create document-term matrices using CountVectorizer\n",
1159 |     "X_train_dtm = vect.fit_transform(X_train)\n",
1160 |     "X_test_dtm = vect.transform(X_test)"
1161 |    ]
1162 |   },
1163 |   {
1164 |    "cell_type": "code",
1165 |    "execution_count": 41,
1166 |    "metadata": {
1167 |     "collapsed": false
1168 |    },
1169 |    "outputs": [
1170 |     {
1171 |      "data": {
1172 |       "text/plain": [
1173 |        "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)"
1174 |       ]
1175 |      },
1176 |      "execution_count": 41,
1177 |      "metadata": {},
1178 |      "output_type": "execute_result"
1179 |     }
1180 |    ],
1181 |    "source": [
1182 |     "# fit a Multinomial Naive Bayes model\n",
1183 |     "nb.fit(X_train_dtm, y_train)"
1184 |    ]
1185 |   },
1186 |   {
1187 |    "cell_type": "code",
1188 |    "execution_count": 42,
1189 |    "metadata": {
1190 |     "collapsed": true
1191 |    },
1192 |    "outputs": [],
1193 |    "source": [
1194 |     "# make class predictions\n",
1195 |     "y_pred_class = nb.predict(X_test_dtm)"
1196 |    ]
1197 |   },
1198 |   {
1199 |    "cell_type": "code",
1200 |    "execution_count": 43,
1201 |    "metadata": {
1202 |     "collapsed": false
1203 |    },
1204 |    "outputs": [
1205 |     {
1206 |      "data": {
1207 |       "text/plain": [
1208 |        "0.47120000000000001"
1209 |       ]
1210 |      },
1211 |      "execution_count": 43,
1212 |      "metadata": {},
1213 |      "output_type": "execute_result"
1214 |     }
1215 |    ],
1216 |    "source": [
1217 |     "# calculate the accuary\n",
1218 |     "metrics.accuracy_score(y_test, y_pred_class)"
1219 |    ]
1220 |   },
1221 |   {
1222 |    "cell_type": "code",
1223 |    "execution_count": 44,
1224 |    "metadata": {
1225 |     "collapsed": false
1226 |    },
1227 |    "outputs": [
1228 |     {
1229 |      "data": {
1230 |       "text/plain": [
1231 |        "4    0.3536\n",
1232 |        "Name: stars, dtype: float64"
1233 |       ]
1234 |      },
1235 |      "execution_count": 44,
1236 |      "metadata": {},
1237 |      "output_type": "execute_result"
1238 |     }
1239 |    ],
1240 |    "source": [
1241 |     "# calculate the null accuracy\n",
1242 |     "y_test.value_counts().head(1) / y_test.shape"
1243 |    ]
1244 |   },
1245 |   {
1246 |    "cell_type": "markdown",
1247 |    "metadata": {},
1248 |    "source": [
1249 |     "**Accuracy comments:** At first glance, 47% accuracy does not seem very good, given that it is not much higher than the null accuracy. However, I would consider the 47% accuracy to be quite impressive, given that humans would also have a hard time precisely identifying the star rating for many of these reviews."
1250 |    ]
1251 |   },
1252 |   {
1253 |    "cell_type": "code",
1254 |    "execution_count": 45,
1255 |    "metadata": {
1256 |     "collapsed": false
1257 |    },
1258 |    "outputs": [
1259 |     {
1260 |      "data": {
1261 |       "text/plain": [
1262 |        "array([[ 55,  14,  24,  65,  27],\n",
1263 |        "       [ 28,  16,  41, 122,  27],\n",
1264 |        "       [  5,   7,  35, 281,  37],\n",
1265 |        "       [  7,   0,  16, 629, 232],\n",
1266 |        "       [  6,   4,   6, 373, 443]])"
1267 |       ]
1268 |      },
1269 |      "execution_count": 45,
1270 |      "metadata": {},
1271 |      "output_type": "execute_result"
1272 |     }
1273 |    ],
1274 |    "source": [
1275 |     "# print the confusion matrix\n",
1276 |     "metrics.confusion_matrix(y_test, y_pred_class)"
1277 |    ]
1278 |   },
1279 |   {
1280 |    "cell_type": "markdown",
1281 |    "metadata": {},
1282 |    "source": [
1283 |     "**Confusion matrix comments:**\n",
1284 |     "\n",
1285 |     "- Nearly all 4-star and 5-star reviews are classified as 4 or 5 stars, but they are hard for the model to distinguish between.\n",
1286 |     "- 1-star, 2-star, and 3-star reviews are most commonly classified as 4 stars, probably because it's the predominant class in the training data."
1287 |    ]
1288 |   },
1289 |   {
1290 |    "cell_type": "code",
1291 |    "execution_count": 46,
1292 |    "metadata": {
1293 |     "collapsed": false
1294 |    },
1295 |    "outputs": [
1296 |     {
1297 |      "name": "stdout",
1298 |      "output_type": "stream",
1299 |      "text": [
1300 |       "             precision    recall  f1-score   support\n",
1301 |       "\n",
1302 |       "          1       0.54      0.30      0.38       185\n",
1303 |       "          2       0.39      0.07      0.12       234\n",
1304 |       "          3       0.29      0.10      0.14       365\n",
1305 |       "          4       0.43      0.71      0.53       884\n",
1306 |       "          5       0.58      0.53      0.55       832\n",
1307 |       "\n",
1308 |       "avg / total       0.46      0.47      0.43      2500\n",
1309 |       "\n"
1310 |      ]
1311 |     }
1312 |    ],
1313 |    "source": [
1314 |     "# print the classification report\n",
1315 |     "print(metrics.classification_report(y_test, y_pred_class))"
1316 |    ]
1317 |   },
1318 |   {
1319 |    "cell_type": "markdown",
1320 |    "metadata": {},
1321 |    "source": [
1322 |     "**Precision** answers the question: \"When a given class is predicted, how often are those predictions correct?\" To calculate the precision for class 1, for example, you divide 55 by the sum of the first column of the confusion matrix."
1323 |    ]
1324 |   },
1325 |   {
1326 |    "cell_type": "code",
1327 |    "execution_count": 47,
1328 |    "metadata": {
1329 |     "collapsed": false
1330 |    },
1331 |    "outputs": [
1332 |     {
1333 |      "name": "stdout",
1334 |      "output_type": "stream",
1335 |      "text": [
1336 |       "0.544554455446\n"
1337 |      ]
1338 |     }
1339 |    ],
1340 |    "source": [
1341 |     "# manually calculate the precision for class 1\n",
1342 |     "precision = 55 / float(55 + 28 + 5 + 7 + 6)\n",
1343 |     "print(precision)"
1344 |    ]
1345 |   },
1346 |   {
1347 |    "cell_type": "markdown",
1348 |    "metadata": {},
1349 |    "source": [
1350 |     "**Recall** answers the question: \"When a given class is the true class, how often is that class predicted?\" To calculate the recall for class 1, for example, you divide 55 by the sum of the first row of the confusion matrix."
1351 |    ]
1352 |   },
1353 |   {
1354 |    "cell_type": "code",
1355 |    "execution_count": 48,
1356 |    "metadata": {
1357 |     "collapsed": false
1358 |    },
1359 |    "outputs": [
1360 |     {
1361 |      "name": "stdout",
1362 |      "output_type": "stream",
1363 |      "text": [
1364 |       "0.297297297297\n"
1365 |      ]
1366 |     }
1367 |    ],
1368 |    "source": [
1369 |     "# manually calculate the recall for class 1\n",
1370 |     "recall = 55 / float(55 + 14 + 24 + 65 + 27)\n",
1371 |     "print(recall)"
1372 |    ]
1373 |   },
1374 |   {
1375 |    "cell_type": "markdown",
1376 |    "metadata": {},
1377 |    "source": [
1378 |     "**F1 score** is a weighted average of precision and recall."
1379 |    ]
1380 |   },
1381 |   {
1382 |    "cell_type": "code",
1383 |    "execution_count": 49,
1384 |    "metadata": {
1385 |     "collapsed": false
1386 |    },
1387 |    "outputs": [
1388 |     {
1389 |      "name": "stdout",
1390 |      "output_type": "stream",
1391 |      "text": [
1392 |       "0.384615384615\n"
1393 |      ]
1394 |     }
1395 |    ],
1396 |    "source": [
1397 |     "# manually calculate the F1 score for class 1\n",
1398 |     "f1 = 2 * (precision * recall) / (precision + recall)\n",
1399 |     "print(f1)"
1400 |    ]
1401 |   },
1402 |   {
1403 |    "cell_type": "markdown",
1404 |    "metadata": {},
1405 |    "source": [
1406 |     "**Support** answers the question: \"How many observations exist for which a given class is the true class?\" To calculate the support for class 1, for example, you sum the first row of the confusion matrix."
1407 |    ]
1408 |   },
1409 |   {
1410 |    "cell_type": "code",
1411 |    "execution_count": 50,
1412 |    "metadata": {
1413 |     "collapsed": false
1414 |    },
1415 |    "outputs": [
1416 |     {
1417 |      "name": "stdout",
1418 |      "output_type": "stream",
1419 |      "text": [
1420 |       "185\n"
1421 |      ]
1422 |     }
1423 |    ],
1424 |    "source": [
1425 |     "# manually calculate the support for class 1\n",
1426 |     "support = 55 + 14 + 24 + 65 + 27\n",
1427 |     "print(support)"
1428 |    ]
1429 |   },
1430 |   {
1431 |    "cell_type": "markdown",
1432 |    "metadata": {},
1433 |    "source": [
1434 |     "**Classification report comments:**\n",
1435 |     "\n",
1436 |     "- Class 1 has low recall, meaning that the model has a hard time detecting the 1-star reviews, but high precision, meaning that when the model predicts a review is 1-star, it's usually correct.\n",
1437 |     "- Class 5 has high recall and precision, probably because 5-star reviews have polarized language, and because the model has a lot of observations to learn from."
1438 |    ]
1439 |   }
1440 |  ],
1441 |  "metadata": {
1442 |   "kernelspec": {
1443 |    "display_name": "Python 2",
1444 |    "language": "python",
1445 |    "name": "python2"
1446 |   },
1447 |   "language_info": {
1448 |    "codemirror_mode": {
1449 |     "name": "ipython",
1450 |     "version": 2
1451 |    },
1452 |    "file_extension": ".py",
1453 |    "mimetype": "text/x-python",
1454 |    "name": "python",
1455 |    "nbconvert_exporter": "python",
1456 |    "pygments_lexer": "ipython2",
1457 |    "version": "2.7.11"
1458 |   }
1459 |  },
1460 |  "nbformat": 4,
1461 |  "nbformat_minor": 0
1462 | }
1463 | 


--------------------------------------------------------------------------------
/tutorial_with_output.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "# Tutorial: Machine Learning with Text in scikit-learn"
   8 |    ]
   9 |   },
  10 |   {
  11 |    "cell_type": "markdown",
  12 |    "metadata": {},
  13 |    "source": [
  14 |     "## Agenda\n",
  15 |     "\n",
  16 |     "1. Model building in scikit-learn (refresher)\n",
  17 |     "2. Representing text as numerical data\n",
  18 |     "3. Reading a text-based dataset into pandas\n",
  19 |     "4. Vectorizing our dataset\n",
  20 |     "5. Building and evaluating a model\n",
  21 |     "6. Comparing models\n",
  22 |     "7. Examining a model for further insight\n",
  23 |     "8. Practicing this workflow on another dataset\n",
  24 |     "9. Tuning the vectorizer (discussion)"
  25 |    ]
  26 |   },
  27 |   {
  28 |    "cell_type": "code",
  29 |    "execution_count": 1,
  30 |    "metadata": {
  31 |     "collapsed": false
  32 |    },
  33 |    "outputs": [],
  34 |    "source": [
  35 |     "# for Python 2: use print only as a function\n",
  36 |     "from __future__ import print_function"
  37 |    ]
  38 |   },
  39 |   {
  40 |    "cell_type": "markdown",
  41 |    "metadata": {},
  42 |    "source": [
  43 |     "## Part 1: Model building in scikit-learn (refresher)"
  44 |    ]
  45 |   },
  46 |   {
  47 |    "cell_type": "code",
  48 |    "execution_count": 2,
  49 |    "metadata": {
  50 |     "collapsed": true
  51 |    },
  52 |    "outputs": [],
  53 |    "source": [
  54 |     "# load the iris dataset as an example\n",
  55 |     "from sklearn.datasets import load_iris\n",
  56 |     "iris = load_iris()"
  57 |    ]
  58 |   },
  59 |   {
  60 |    "cell_type": "code",
  61 |    "execution_count": 3,
  62 |    "metadata": {
  63 |     "collapsed": true
  64 |    },
  65 |    "outputs": [],
  66 |    "source": [
  67 |     "# store the feature matrix (X) and response vector (y)\n",
  68 |     "X = iris.data\n",
  69 |     "y = iris.target"
  70 |    ]
  71 |   },
  72 |   {
  73 |    "cell_type": "markdown",
  74 |    "metadata": {},
  75 |    "source": [
  76 |     "**\"Features\"** are also known as predictors, inputs, or attributes. The **\"response\"** is also known as the target, label, or output."
  77 |    ]
  78 |   },
  79 |   {
  80 |    "cell_type": "code",
  81 |    "execution_count": 4,
  82 |    "metadata": {
  83 |     "collapsed": false
  84 |    },
  85 |    "outputs": [
  86 |     {
  87 |      "name": "stdout",
  88 |      "output_type": "stream",
  89 |      "text": [
  90 |       "(150L, 4L)\n",
  91 |       "(150L,)\n"
  92 |      ]
  93 |     }
  94 |    ],
  95 |    "source": [
  96 |     "# check the shapes of X and y\n",
  97 |     "print(X.shape)\n",
  98 |     "print(y.shape)"
  99 |    ]
 100 |   },
 101 |   {
 102 |    "cell_type": "markdown",
 103 |    "metadata": {},
 104 |    "source": [
 105 |     "**\"Observations\"** are also known as samples, instances, or records."
 106 |    ]
 107 |   },
 108 |   {
 109 |    "cell_type": "code",
 110 |    "execution_count": 5,
 111 |    "metadata": {
 112 |     "collapsed": false
 113 |    },
 114 |    "outputs": [
 115 |     {
 116 |      "data": {
 117 |       "text/html": [
 118 |        "<div>\n",
 119 |        "<table border=\"1\" class=\"dataframe\">\n",
 120 |        "  <thead>\n",
 121 |        "    <tr style=\"text-align: right;\">\n",
 122 |        "      <th></th>\n",
 123 |        "      <th>sepal length (cm)</th>\n",
 124 |        "      <th>sepal width (cm)</th>\n",
 125 |        "      <th>petal length (cm)</th>\n",
 126 |        "      <th>petal width (cm)</th>\n",
 127 |        "    </tr>\n",
 128 |        "  </thead>\n",
 129 |        "  <tbody>\n",
 130 |        "    <tr>\n",
 131 |        "      <th>0</th>\n",
 132 |        "      <td>5.1</td>\n",
 133 |        "      <td>3.5</td>\n",
 134 |        "      <td>1.4</td>\n",
 135 |        "      <td>0.2</td>\n",
 136 |        "    </tr>\n",
 137 |        "    <tr>\n",
 138 |        "      <th>1</th>\n",
 139 |        "      <td>4.9</td>\n",
 140 |        "      <td>3.0</td>\n",
 141 |        "      <td>1.4</td>\n",
 142 |        "      <td>0.2</td>\n",
 143 |        "    </tr>\n",
 144 |        "    <tr>\n",
 145 |        "      <th>2</th>\n",
 146 |        "      <td>4.7</td>\n",
 147 |        "      <td>3.2</td>\n",
 148 |        "      <td>1.3</td>\n",
 149 |        "      <td>0.2</td>\n",
 150 |        "    </tr>\n",
 151 |        "    <tr>\n",
 152 |        "      <th>3</th>\n",
 153 |        "      <td>4.6</td>\n",
 154 |        "      <td>3.1</td>\n",
 155 |        "      <td>1.5</td>\n",
 156 |        "      <td>0.2</td>\n",
 157 |        "    </tr>\n",
 158 |        "    <tr>\n",
 159 |        "      <th>4</th>\n",
 160 |        "      <td>5.0</td>\n",
 161 |        "      <td>3.6</td>\n",
 162 |        "      <td>1.4</td>\n",
 163 |        "      <td>0.2</td>\n",
 164 |        "    </tr>\n",
 165 |        "  </tbody>\n",
 166 |        "</table>\n",
 167 |        "</div>"
 168 |       ],
 169 |       "text/plain": [
 170 |        "   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)\n",
 171 |        "0                5.1               3.5                1.4               0.2\n",
 172 |        "1                4.9               3.0                1.4               0.2\n",
 173 |        "2                4.7               3.2                1.3               0.2\n",
 174 |        "3                4.6               3.1                1.5               0.2\n",
 175 |        "4                5.0               3.6                1.4               0.2"
 176 |       ]
 177 |      },
 178 |      "execution_count": 5,
 179 |      "metadata": {},
 180 |      "output_type": "execute_result"
 181 |     }
 182 |    ],
 183 |    "source": [
 184 |     "# examine the first 5 rows of the feature matrix (including the feature names)\n",
 185 |     "import pandas as pd\n",
 186 |     "pd.DataFrame(X, columns=iris.feature_names).head()"
 187 |    ]
 188 |   },
 189 |   {
 190 |    "cell_type": "code",
 191 |    "execution_count": 6,
 192 |    "metadata": {
 193 |     "collapsed": false
 194 |    },
 195 |    "outputs": [
 196 |     {
 197 |      "name": "stdout",
 198 |      "output_type": "stream",
 199 |      "text": [
 200 |       "[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
 201 |       " 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n",
 202 |       " 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2\n",
 203 |       " 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2\n",
 204 |       " 2 2]\n"
 205 |      ]
 206 |     }
 207 |    ],
 208 |    "source": [
 209 |     "# examine the response vector\n",
 210 |     "print(y)"
 211 |    ]
 212 |   },
 213 |   {
 214 |    "cell_type": "markdown",
 215 |    "metadata": {},
 216 |    "source": [
 217 |     "In order to **build a model**, the features must be **numeric**, and every observation must have the **same features in the same order**."
 218 |    ]
 219 |   },
 220 |   {
 221 |    "cell_type": "code",
 222 |    "execution_count": 7,
 223 |    "metadata": {
 224 |     "collapsed": false
 225 |    },
 226 |    "outputs": [
 227 |     {
 228 |      "data": {
 229 |       "text/plain": [
 230 |        "KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n",
 231 |        "           metric_params=None, n_jobs=1, n_neighbors=5, p=2,\n",
 232 |        "           weights='uniform')"
 233 |       ]
 234 |      },
 235 |      "execution_count": 7,
 236 |      "metadata": {},
 237 |      "output_type": "execute_result"
 238 |     }
 239 |    ],
 240 |    "source": [
 241 |     "# import the class\n",
 242 |     "from sklearn.neighbors import KNeighborsClassifier\n",
 243 |     "\n",
 244 |     "# instantiate the model (with the default parameters)\n",
 245 |     "knn = KNeighborsClassifier()\n",
 246 |     "\n",
 247 |     "# fit the model with data (occurs in-place)\n",
 248 |     "knn.fit(X, y)"
 249 |    ]
 250 |   },
 251 |   {
 252 |    "cell_type": "markdown",
 253 |    "metadata": {},
 254 |    "source": [
 255 |     "In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning."
 256 |    ]
 257 |   },
 258 |   {
 259 |    "cell_type": "code",
 260 |    "execution_count": 8,
 261 |    "metadata": {
 262 |     "collapsed": false
 263 |    },
 264 |    "outputs": [
 265 |     {
 266 |      "data": {
 267 |       "text/plain": [
 268 |        "array([1])"
 269 |       ]
 270 |      },
 271 |      "execution_count": 8,
 272 |      "metadata": {},
 273 |      "output_type": "execute_result"
 274 |     }
 275 |    ],
 276 |    "source": [
 277 |     "# predict the response for a new observation\n",
 278 |     "knn.predict([[3, 5, 4, 2]])"
 279 |    ]
 280 |   },
 281 |   {
 282 |    "cell_type": "markdown",
 283 |    "metadata": {},
 284 |    "source": [
 285 |     "## Part 2: Representing text as numerical data"
 286 |    ]
 287 |   },
 288 |   {
 289 |    "cell_type": "code",
 290 |    "execution_count": 9,
 291 |    "metadata": {
 292 |     "collapsed": true
 293 |    },
 294 |    "outputs": [],
 295 |    "source": [
 296 |     "# example text for model training (SMS messages)\n",
 297 |     "simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']"
 298 |    ]
 299 |   },
 300 |   {
 301 |    "cell_type": "markdown",
 302 |    "metadata": {},
 303 |    "source": [
 304 |     "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
 305 |     "\n",
 306 |     "> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.\n",
 307 |     "\n",
 308 |     "We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to \"convert text into a matrix of token counts\":"
 309 |    ]
 310 |   },
 311 |   {
 312 |    "cell_type": "code",
 313 |    "execution_count": 10,
 314 |    "metadata": {
 315 |     "collapsed": true
 316 |    },
 317 |    "outputs": [],
 318 |    "source": [
 319 |     "# import and instantiate CountVectorizer (with the default parameters)\n",
 320 |     "from sklearn.feature_extraction.text import CountVectorizer\n",
 321 |     "vect = CountVectorizer()"
 322 |    ]
 323 |   },
 324 |   {
 325 |    "cell_type": "code",
 326 |    "execution_count": 11,
 327 |    "metadata": {
 328 |     "collapsed": false
 329 |    },
 330 |    "outputs": [
 331 |     {
 332 |      "data": {
 333 |       "text/plain": [
 334 |        "CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',\n",
 335 |        "        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',\n",
 336 |        "        lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
 337 |        "        ngram_range=(1, 1), preprocessor=None, stop_words=None,\n",
 338 |        "        strip_accents=None, token_pattern=u'(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
 339 |        "        tokenizer=None, vocabulary=None)"
 340 |       ]
 341 |      },
 342 |      "execution_count": 11,
 343 |      "metadata": {},
 344 |      "output_type": "execute_result"
 345 |     }
 346 |    ],
 347 |    "source": [
 348 |     "# learn the 'vocabulary' of the training data (occurs in-place)\n",
 349 |     "vect.fit(simple_train)"
 350 |    ]
 351 |   },
 352 |   {
 353 |    "cell_type": "code",
 354 |    "execution_count": 12,
 355 |    "metadata": {
 356 |     "collapsed": false
 357 |    },
 358 |    "outputs": [
 359 |     {
 360 |      "data": {
 361 |       "text/plain": [
 362 |        "[u'cab', u'call', u'me', u'please', u'tonight', u'you']"
 363 |       ]
 364 |      },
 365 |      "execution_count": 12,
 366 |      "metadata": {},
 367 |      "output_type": "execute_result"
 368 |     }
 369 |    ],
 370 |    "source": [
 371 |     "# examine the fitted vocabulary\n",
 372 |     "vect.get_feature_names()"
 373 |    ]
 374 |   },
 375 |   {
 376 |    "cell_type": "code",
 377 |    "execution_count": 13,
 378 |    "metadata": {
 379 |     "collapsed": false
 380 |    },
 381 |    "outputs": [
 382 |     {
 383 |      "data": {
 384 |       "text/plain": [
 385 |        "<3x6 sparse matrix of type '<type 'numpy.int64'>'\n",
 386 |        "\twith 9 stored elements in Compressed Sparse Row format>"
 387 |       ]
 388 |      },
 389 |      "execution_count": 13,
 390 |      "metadata": {},
 391 |      "output_type": "execute_result"
 392 |     }
 393 |    ],
 394 |    "source": [
 395 |     "# transform training data into a 'document-term matrix'\n",
 396 |     "simple_train_dtm = vect.transform(simple_train)\n",
 397 |     "simple_train_dtm"
 398 |    ]
 399 |   },
 400 |   {
 401 |    "cell_type": "code",
 402 |    "execution_count": 14,
 403 |    "metadata": {
 404 |     "collapsed": false
 405 |    },
 406 |    "outputs": [
 407 |     {
 408 |      "data": {
 409 |       "text/plain": [
 410 |        "array([[0, 1, 0, 0, 1, 1],\n",
 411 |        "       [1, 1, 1, 0, 0, 0],\n",
 412 |        "       [0, 1, 1, 2, 0, 0]], dtype=int64)"
 413 |       ]
 414 |      },
 415 |      "execution_count": 14,
 416 |      "metadata": {},
 417 |      "output_type": "execute_result"
 418 |     }
 419 |    ],
 420 |    "source": [
 421 |     "# convert sparse matrix to a dense matrix\n",
 422 |     "simple_train_dtm.toarray()"
 423 |    ]
 424 |   },
 425 |   {
 426 |    "cell_type": "code",
 427 |    "execution_count": 15,
 428 |    "metadata": {
 429 |     "collapsed": false
 430 |    },
 431 |    "outputs": [
 432 |     {
 433 |      "data": {
 434 |       "text/html": [
 435 |        "<div>\n",
 436 |        "<table border=\"1\" class=\"dataframe\">\n",
 437 |        "  <thead>\n",
 438 |        "    <tr style=\"text-align: right;\">\n",
 439 |        "      <th></th>\n",
 440 |        "      <th>cab</th>\n",
 441 |        "      <th>call</th>\n",
 442 |        "      <th>me</th>\n",
 443 |        "      <th>please</th>\n",
 444 |        "      <th>tonight</th>\n",
 445 |        "      <th>you</th>\n",
 446 |        "    </tr>\n",
 447 |        "  </thead>\n",
 448 |        "  <tbody>\n",
 449 |        "    <tr>\n",
 450 |        "      <th>0</th>\n",
 451 |        "      <td>0</td>\n",
 452 |        "      <td>1</td>\n",
 453 |        "      <td>0</td>\n",
 454 |        "      <td>0</td>\n",
 455 |        "      <td>1</td>\n",
 456 |        "      <td>1</td>\n",
 457 |        "    </tr>\n",
 458 |        "    <tr>\n",
 459 |        "      <th>1</th>\n",
 460 |        "      <td>1</td>\n",
 461 |        "      <td>1</td>\n",
 462 |        "      <td>1</td>\n",
 463 |        "      <td>0</td>\n",
 464 |        "      <td>0</td>\n",
 465 |        "      <td>0</td>\n",
 466 |        "    </tr>\n",
 467 |        "    <tr>\n",
 468 |        "      <th>2</th>\n",
 469 |        "      <td>0</td>\n",
 470 |        "      <td>1</td>\n",
 471 |        "      <td>1</td>\n",
 472 |        "      <td>2</td>\n",
 473 |        "      <td>0</td>\n",
 474 |        "      <td>0</td>\n",
 475 |        "    </tr>\n",
 476 |        "  </tbody>\n",
 477 |        "</table>\n",
 478 |        "</div>"
 479 |       ],
 480 |       "text/plain": [
 481 |        "   cab  call  me  please  tonight  you\n",
 482 |        "0    0     1   0       0        1    1\n",
 483 |        "1    1     1   1       0        0    0\n",
 484 |        "2    0     1   1       2        0    0"
 485 |       ]
 486 |      },
 487 |      "execution_count": 15,
 488 |      "metadata": {},
 489 |      "output_type": "execute_result"
 490 |     }
 491 |    ],
 492 |    "source": [
 493 |     "# examine the vocabulary and document-term matrix together\n",
 494 |     "pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())"
 495 |    ]
 496 |   },
 497 |   {
 498 |    "cell_type": "markdown",
 499 |    "metadata": {},
 500 |    "source": [
 501 |     "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
 502 |     "\n",
 503 |     "> In this scheme, features and samples are defined as follows:\n",
 504 |     "\n",
 505 |     "> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.\n",
 506 |     "> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.\n",
 507 |     "\n",
 508 |     "> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.\n",
 509 |     "\n",
 510 |     "> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or \"Bag of n-grams\" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document."
 511 |    ]
 512 |   },
 513 |   {
 514 |    "cell_type": "code",
 515 |    "execution_count": 16,
 516 |    "metadata": {
 517 |     "collapsed": false
 518 |    },
 519 |    "outputs": [
 520 |     {
 521 |      "data": {
 522 |       "text/plain": [
 523 |        "scipy.sparse.csr.csr_matrix"
 524 |       ]
 525 |      },
 526 |      "execution_count": 16,
 527 |      "metadata": {},
 528 |      "output_type": "execute_result"
 529 |     }
 530 |    ],
 531 |    "source": [
 532 |     "# check the type of the document-term matrix\n",
 533 |     "type(simple_train_dtm)"
 534 |    ]
 535 |   },
 536 |   {
 537 |    "cell_type": "code",
 538 |    "execution_count": 17,
 539 |    "metadata": {
 540 |     "collapsed": false,
 541 |     "scrolled": true
 542 |    },
 543 |    "outputs": [
 544 |     {
 545 |      "name": "stdout",
 546 |      "output_type": "stream",
 547 |      "text": [
 548 |       "  (0, 1)\t1\n",
 549 |       "  (0, 4)\t1\n",
 550 |       "  (0, 5)\t1\n",
 551 |       "  (1, 0)\t1\n",
 552 |       "  (1, 1)\t1\n",
 553 |       "  (1, 2)\t1\n",
 554 |       "  (2, 1)\t1\n",
 555 |       "  (2, 2)\t1\n",
 556 |       "  (2, 3)\t2\n"
 557 |      ]
 558 |     }
 559 |    ],
 560 |    "source": [
 561 |     "# examine the sparse matrix contents\n",
 562 |     "print(simple_train_dtm)"
 563 |    ]
 564 |   },
 565 |   {
 566 |    "cell_type": "markdown",
 567 |    "metadata": {},
 568 |    "source": [
 569 |     "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
 570 |     "\n",
 571 |     "> As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them).\n",
 572 |     "\n",
 573 |     "> For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.\n",
 574 |     "\n",
 575 |     "> In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package."
 576 |    ]
 577 |   },
 578 |   {
 579 |    "cell_type": "code",
 580 |    "execution_count": 18,
 581 |    "metadata": {
 582 |     "collapsed": true
 583 |    },
 584 |    "outputs": [],
 585 |    "source": [
 586 |     "# example text for model testing\n",
 587 |     "simple_test = [\"please don't call me\"]"
 588 |    ]
 589 |   },
 590 |   {
 591 |    "cell_type": "markdown",
 592 |    "metadata": {},
 593 |    "source": [
 594 |     "In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning."
 595 |    ]
 596 |   },
 597 |   {
 598 |    "cell_type": "code",
 599 |    "execution_count": 19,
 600 |    "metadata": {
 601 |     "collapsed": false
 602 |    },
 603 |    "outputs": [
 604 |     {
 605 |      "data": {
 606 |       "text/plain": [
 607 |        "array([[0, 1, 1, 1, 0, 0]], dtype=int64)"
 608 |       ]
 609 |      },
 610 |      "execution_count": 19,
 611 |      "metadata": {},
 612 |      "output_type": "execute_result"
 613 |     }
 614 |    ],
 615 |    "source": [
 616 |     "# transform testing data into a document-term matrix (using existing vocabulary)\n",
 617 |     "simple_test_dtm = vect.transform(simple_test)\n",
 618 |     "simple_test_dtm.toarray()"
 619 |    ]
 620 |   },
 621 |   {
 622 |    "cell_type": "code",
 623 |    "execution_count": 20,
 624 |    "metadata": {
 625 |     "collapsed": false
 626 |    },
 627 |    "outputs": [
 628 |     {
 629 |      "data": {
 630 |       "text/html": [
 631 |        "<div>\n",
 632 |        "<table border=\"1\" class=\"dataframe\">\n",
 633 |        "  <thead>\n",
 634 |        "    <tr style=\"text-align: right;\">\n",
 635 |        "      <th></th>\n",
 636 |        "      <th>cab</th>\n",
 637 |        "      <th>call</th>\n",
 638 |        "      <th>me</th>\n",
 639 |        "      <th>please</th>\n",
 640 |        "      <th>tonight</th>\n",
 641 |        "      <th>you</th>\n",
 642 |        "    </tr>\n",
 643 |        "  </thead>\n",
 644 |        "  <tbody>\n",
 645 |        "    <tr>\n",
 646 |        "      <th>0</th>\n",
 647 |        "      <td>0</td>\n",
 648 |        "      <td>1</td>\n",
 649 |        "      <td>1</td>\n",
 650 |        "      <td>1</td>\n",
 651 |        "      <td>0</td>\n",
 652 |        "      <td>0</td>\n",
 653 |        "    </tr>\n",
 654 |        "  </tbody>\n",
 655 |        "</table>\n",
 656 |        "</div>"
 657 |       ],
 658 |       "text/plain": [
 659 |        "   cab  call  me  please  tonight  you\n",
 660 |        "0    0     1   1       1        0    0"
 661 |       ]
 662 |      },
 663 |      "execution_count": 20,
 664 |      "metadata": {},
 665 |      "output_type": "execute_result"
 666 |     }
 667 |    ],
 668 |    "source": [
 669 |     "# examine the vocabulary and document-term matrix together\n",
 670 |     "pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())"
 671 |    ]
 672 |   },
 673 |   {
 674 |    "cell_type": "markdown",
 675 |    "metadata": {},
 676 |    "source": [
 677 |     "**Summary:**\n",
 678 |     "\n",
 679 |     "- `vect.fit(train)` **learns the vocabulary** of the training data\n",
 680 |     "- `vect.transform(train)` uses the **fitted vocabulary** to build a document-term matrix from the training data\n",
 681 |     "- `vect.transform(test)` uses the **fitted vocabulary** to build a document-term matrix from the testing data (and **ignores tokens** it hasn't seen before)"
 682 |    ]
 683 |   },
 684 |   {
 685 |    "cell_type": "markdown",
 686 |    "metadata": {},
 687 |    "source": [
 688 |     "## Part 3: Reading a text-based dataset into pandas"
 689 |    ]
 690 |   },
 691 |   {
 692 |    "cell_type": "code",
 693 |    "execution_count": 21,
 694 |    "metadata": {
 695 |     "collapsed": true
 696 |    },
 697 |    "outputs": [],
 698 |    "source": [
 699 |     "# read file into pandas using a relative path\n",
 700 |     "path = 'data/sms.tsv'\n",
 701 |     "sms = pd.read_table(path, header=None, names=['label', 'message'])"
 702 |    ]
 703 |   },
 704 |   {
 705 |    "cell_type": "code",
 706 |    "execution_count": 22,
 707 |    "metadata": {
 708 |     "collapsed": false
 709 |    },
 710 |    "outputs": [],
 711 |    "source": [
 712 |     "# alternative: read file into pandas from a URL\n",
 713 |     "# url = 'https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv'\n",
 714 |     "# sms = pd.read_table(url, header=None, names=['label', 'message'])"
 715 |    ]
 716 |   },
 717 |   {
 718 |    "cell_type": "code",
 719 |    "execution_count": 23,
 720 |    "metadata": {
 721 |     "collapsed": false
 722 |    },
 723 |    "outputs": [
 724 |     {
 725 |      "data": {
 726 |       "text/plain": [
 727 |        "(5572, 2)"
 728 |       ]
 729 |      },
 730 |      "execution_count": 23,
 731 |      "metadata": {},
 732 |      "output_type": "execute_result"
 733 |     }
 734 |    ],
 735 |    "source": [
 736 |     "# examine the shape\n",
 737 |     "sms.shape"
 738 |    ]
 739 |   },
 740 |   {
 741 |    "cell_type": "code",
 742 |    "execution_count": 24,
 743 |    "metadata": {
 744 |     "collapsed": false
 745 |    },
 746 |    "outputs": [
 747 |     {
 748 |      "data": {
 749 |       "text/html": [
 750 |        "<div>\n",
 751 |        "<table border=\"1\" class=\"dataframe\">\n",
 752 |        "  <thead>\n",
 753 |        "    <tr style=\"text-align: right;\">\n",
 754 |        "      <th></th>\n",
 755 |        "      <th>label</th>\n",
 756 |        "      <th>message</th>\n",
 757 |        "    </tr>\n",
 758 |        "  </thead>\n",
 759 |        "  <tbody>\n",
 760 |        "    <tr>\n",
 761 |        "      <th>0</th>\n",
 762 |        "      <td>ham</td>\n",
 763 |        "      <td>Go until jurong point, crazy.. Available only ...</td>\n",
 764 |        "    </tr>\n",
 765 |        "    <tr>\n",
 766 |        "      <th>1</th>\n",
 767 |        "      <td>ham</td>\n",
 768 |        "      <td>Ok lar... Joking wif u oni...</td>\n",
 769 |        "    </tr>\n",
 770 |        "    <tr>\n",
 771 |        "      <th>2</th>\n",
 772 |        "      <td>spam</td>\n",
 773 |        "      <td>Free entry in 2 a wkly comp to win FA Cup fina...</td>\n",
 774 |        "    </tr>\n",
 775 |        "    <tr>\n",
 776 |        "      <th>3</th>\n",
 777 |        "      <td>ham</td>\n",
 778 |        "      <td>U dun say so early hor... U c already then say...</td>\n",
 779 |        "    </tr>\n",
 780 |        "    <tr>\n",
 781 |        "      <th>4</th>\n",
 782 |        "      <td>ham</td>\n",
 783 |        "      <td>Nah I don't think he goes to usf, he lives aro...</td>\n",
 784 |        "    </tr>\n",
 785 |        "    <tr>\n",
 786 |        "      <th>5</th>\n",
 787 |        "      <td>spam</td>\n",
 788 |        "      <td>FreeMsg Hey there darling it's been 3 week's n...</td>\n",
 789 |        "    </tr>\n",
 790 |        "    <tr>\n",
 791 |        "      <th>6</th>\n",
 792 |        "      <td>ham</td>\n",
 793 |        "      <td>Even my brother is not like to speak with me. ...</td>\n",
 794 |        "    </tr>\n",
 795 |        "    <tr>\n",
 796 |        "      <th>7</th>\n",
 797 |        "      <td>ham</td>\n",
 798 |        "      <td>As per your request 'Melle Melle (Oru Minnamin...</td>\n",
 799 |        "    </tr>\n",
 800 |        "    <tr>\n",
 801 |        "      <th>8</th>\n",
 802 |        "      <td>spam</td>\n",
 803 |        "      <td>WINNER!! As a valued network customer you have...</td>\n",
 804 |        "    </tr>\n",
 805 |        "    <tr>\n",
 806 |        "      <th>9</th>\n",
 807 |        "      <td>spam</td>\n",
 808 |        "      <td>Had your mobile 11 months or more? U R entitle...</td>\n",
 809 |        "    </tr>\n",
 810 |        "  </tbody>\n",
 811 |        "</table>\n",
 812 |        "</div>"
 813 |       ],
 814 |       "text/plain": [
 815 |        "  label                                            message\n",
 816 |        "0   ham  Go until jurong point, crazy.. Available only ...\n",
 817 |        "1   ham                      Ok lar... Joking wif u oni...\n",
 818 |        "2  spam  Free entry in 2 a wkly comp to win FA Cup fina...\n",
 819 |        "3   ham  U dun say so early hor... U c already then say...\n",
 820 |        "4   ham  Nah I don't think he goes to usf, he lives aro...\n",
 821 |        "5  spam  FreeMsg Hey there darling it's been 3 week's n...\n",
 822 |        "6   ham  Even my brother is not like to speak with me. ...\n",
 823 |        "7   ham  As per your request 'Melle Melle (Oru Minnamin...\n",
 824 |        "8  spam  WINNER!! As a valued network customer you have...\n",
 825 |        "9  spam  Had your mobile 11 months or more? U R entitle..."
 826 |       ]
 827 |      },
 828 |      "execution_count": 24,
 829 |      "metadata": {},
 830 |      "output_type": "execute_result"
 831 |     }
 832 |    ],
 833 |    "source": [
 834 |     "# examine the first 10 rows\n",
 835 |     "sms.head(10)"
 836 |    ]
 837 |   },
 838 |   {
 839 |    "cell_type": "code",
 840 |    "execution_count": 25,
 841 |    "metadata": {
 842 |     "collapsed": false
 843 |    },
 844 |    "outputs": [
 845 |     {
 846 |      "data": {
 847 |       "text/plain": [
 848 |        "ham     4825\n",
 849 |        "spam     747\n",
 850 |        "Name: label, dtype: int64"
 851 |       ]
 852 |      },
 853 |      "execution_count": 25,
 854 |      "metadata": {},
 855 |      "output_type": "execute_result"
 856 |     }
 857 |    ],
 858 |    "source": [
 859 |     "# examine the class distribution\n",
 860 |     "sms.label.value_counts()"
 861 |    ]
 862 |   },
 863 |   {
 864 |    "cell_type": "code",
 865 |    "execution_count": 26,
 866 |    "metadata": {
 867 |     "collapsed": true
 868 |    },
 869 |    "outputs": [],
 870 |    "source": [
 871 |     "# convert label to a numerical variable\n",
 872 |     "sms['label_num'] = sms.label.map({'ham':0, 'spam':1})"
 873 |    ]
 874 |   },
 875 |   {
 876 |    "cell_type": "code",
 877 |    "execution_count": 27,
 878 |    "metadata": {
 879 |     "collapsed": false
 880 |    },
 881 |    "outputs": [
 882 |     {
 883 |      "data": {
 884 |       "text/html": [
 885 |        "<div>\n",
 886 |        "<table border=\"1\" class=\"dataframe\">\n",
 887 |        "  <thead>\n",
 888 |        "    <tr style=\"text-align: right;\">\n",
 889 |        "      <th></th>\n",
 890 |        "      <th>label</th>\n",
 891 |        "      <th>message</th>\n",
 892 |        "      <th>label_num</th>\n",
 893 |        "    </tr>\n",
 894 |        "  </thead>\n",
 895 |        "  <tbody>\n",
 896 |        "    <tr>\n",
 897 |        "      <th>0</th>\n",
 898 |        "      <td>ham</td>\n",
 899 |        "      <td>Go until jurong point, crazy.. Available only ...</td>\n",
 900 |        "      <td>0</td>\n",
 901 |        "    </tr>\n",
 902 |        "    <tr>\n",
 903 |        "      <th>1</th>\n",
 904 |        "      <td>ham</td>\n",
 905 |        "      <td>Ok lar... Joking wif u oni...</td>\n",
 906 |        "      <td>0</td>\n",
 907 |        "    </tr>\n",
 908 |        "    <tr>\n",
 909 |        "      <th>2</th>\n",
 910 |        "      <td>spam</td>\n",
 911 |        "      <td>Free entry in 2 a wkly comp to win FA Cup fina...</td>\n",
 912 |        "      <td>1</td>\n",
 913 |        "    </tr>\n",
 914 |        "    <tr>\n",
 915 |        "      <th>3</th>\n",
 916 |        "      <td>ham</td>\n",
 917 |        "      <td>U dun say so early hor... U c already then say...</td>\n",
 918 |        "      <td>0</td>\n",
 919 |        "    </tr>\n",
 920 |        "    <tr>\n",
 921 |        "      <th>4</th>\n",
 922 |        "      <td>ham</td>\n",
 923 |        "      <td>Nah I don't think he goes to usf, he lives aro...</td>\n",
 924 |        "      <td>0</td>\n",
 925 |        "    </tr>\n",
 926 |        "    <tr>\n",
 927 |        "      <th>5</th>\n",
 928 |        "      <td>spam</td>\n",
 929 |        "      <td>FreeMsg Hey there darling it's been 3 week's n...</td>\n",
 930 |        "      <td>1</td>\n",
 931 |        "    </tr>\n",
 932 |        "    <tr>\n",
 933 |        "      <th>6</th>\n",
 934 |        "      <td>ham</td>\n",
 935 |        "      <td>Even my brother is not like to speak with me. ...</td>\n",
 936 |        "      <td>0</td>\n",
 937 |        "    </tr>\n",
 938 |        "    <tr>\n",
 939 |        "      <th>7</th>\n",
 940 |        "      <td>ham</td>\n",
 941 |        "      <td>As per your request 'Melle Melle (Oru Minnamin...</td>\n",
 942 |        "      <td>0</td>\n",
 943 |        "    </tr>\n",
 944 |        "    <tr>\n",
 945 |        "      <th>8</th>\n",
 946 |        "      <td>spam</td>\n",
 947 |        "      <td>WINNER!! As a valued network customer you have...</td>\n",
 948 |        "      <td>1</td>\n",
 949 |        "    </tr>\n",
 950 |        "    <tr>\n",
 951 |        "      <th>9</th>\n",
 952 |        "      <td>spam</td>\n",
 953 |        "      <td>Had your mobile 11 months or more? U R entitle...</td>\n",
 954 |        "      <td>1</td>\n",
 955 |        "    </tr>\n",
 956 |        "  </tbody>\n",
 957 |        "</table>\n",
 958 |        "</div>"
 959 |       ],
 960 |       "text/plain": [
 961 |        "  label                                            message  label_num\n",
 962 |        "0   ham  Go until jurong point, crazy.. Available only ...          0\n",
 963 |        "1   ham                      Ok lar... Joking wif u oni...          0\n",
 964 |        "2  spam  Free entry in 2 a wkly comp to win FA Cup fina...          1\n",
 965 |        "3   ham  U dun say so early hor... U c already then say...          0\n",
 966 |        "4   ham  Nah I don't think he goes to usf, he lives aro...          0\n",
 967 |        "5  spam  FreeMsg Hey there darling it's been 3 week's n...          1\n",
 968 |        "6   ham  Even my brother is not like to speak with me. ...          0\n",
 969 |        "7   ham  As per your request 'Melle Melle (Oru Minnamin...          0\n",
 970 |        "8  spam  WINNER!! As a valued network customer you have...          1\n",
 971 |        "9  spam  Had your mobile 11 months or more? U R entitle...          1"
 972 |       ]
 973 |      },
 974 |      "execution_count": 27,
 975 |      "metadata": {},
 976 |      "output_type": "execute_result"
 977 |     }
 978 |    ],
 979 |    "source": [
 980 |     "# check that the conversion worked\n",
 981 |     "sms.head(10)"
 982 |    ]
 983 |   },
 984 |   {
 985 |    "cell_type": "code",
 986 |    "execution_count": 28,
 987 |    "metadata": {
 988 |     "collapsed": false
 989 |    },
 990 |    "outputs": [
 991 |     {
 992 |      "name": "stdout",
 993 |      "output_type": "stream",
 994 |      "text": [
 995 |       "(150L, 4L)\n",
 996 |       "(150L,)\n"
 997 |      ]
 998 |     }
 999 |    ],
1000 |    "source": [
1001 |     "# how to define X and y (from the iris data) for use with a MODEL\n",
1002 |     "X = iris.data\n",
1003 |     "y = iris.target\n",
1004 |     "print(X.shape)\n",
1005 |     "print(y.shape)"
1006 |    ]
1007 |   },
1008 |   {
1009 |    "cell_type": "code",
1010 |    "execution_count": 29,
1011 |    "metadata": {
1012 |     "collapsed": false
1013 |    },
1014 |    "outputs": [
1015 |     {
1016 |      "name": "stdout",
1017 |      "output_type": "stream",
1018 |      "text": [
1019 |       "(5572L,)\n",
1020 |       "(5572L,)\n"
1021 |      ]
1022 |     }
1023 |    ],
1024 |    "source": [
1025 |     "# how to define X and y (from the SMS data) for use with COUNTVECTORIZER\n",
1026 |     "X = sms.message\n",
1027 |     "y = sms.label_num\n",
1028 |     "print(X.shape)\n",
1029 |     "print(y.shape)"
1030 |    ]
1031 |   },
1032 |   {
1033 |    "cell_type": "code",
1034 |    "execution_count": 30,
1035 |    "metadata": {
1036 |     "collapsed": false
1037 |    },
1038 |    "outputs": [
1039 |     {
1040 |      "name": "stdout",
1041 |      "output_type": "stream",
1042 |      "text": [
1043 |       "(4179L,)\n",
1044 |       "(1393L,)\n",
1045 |       "(4179L,)\n",
1046 |       "(1393L,)\n"
1047 |      ]
1048 |     }
1049 |    ],
1050 |    "source": [
1051 |     "# split X and y into training and testing sets\n",
1052 |     "from sklearn.cross_validation import train_test_split\n",
1053 |     "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)\n",
1054 |     "print(X_train.shape)\n",
1055 |     "print(X_test.shape)\n",
1056 |     "print(y_train.shape)\n",
1057 |     "print(y_test.shape)"
1058 |    ]
1059 |   },
1060 |   {
1061 |    "cell_type": "markdown",
1062 |    "metadata": {},
1063 |    "source": [
1064 |     "## Part 4: Vectorizing our dataset"
1065 |    ]
1066 |   },
1067 |   {
1068 |    "cell_type": "code",
1069 |    "execution_count": 31,
1070 |    "metadata": {
1071 |     "collapsed": true
1072 |    },
1073 |    "outputs": [],
1074 |    "source": [
1075 |     "# instantiate the vectorizer\n",
1076 |     "vect = CountVectorizer()"
1077 |    ]
1078 |   },
1079 |   {
1080 |    "cell_type": "code",
1081 |    "execution_count": 32,
1082 |    "metadata": {
1083 |     "collapsed": true
1084 |    },
1085 |    "outputs": [],
1086 |    "source": [
1087 |     "# learn training data vocabulary, then use it to create a document-term matrix\n",
1088 |     "vect.fit(X_train)\n",
1089 |     "X_train_dtm = vect.transform(X_train)"
1090 |    ]
1091 |   },
1092 |   {
1093 |    "cell_type": "code",
1094 |    "execution_count": 33,
1095 |    "metadata": {
1096 |     "collapsed": true
1097 |    },
1098 |    "outputs": [],
1099 |    "source": [
1100 |     "# equivalently: combine fit and transform into a single step\n",
1101 |     "X_train_dtm = vect.fit_transform(X_train)"
1102 |    ]
1103 |   },
1104 |   {
1105 |    "cell_type": "code",
1106 |    "execution_count": 34,
1107 |    "metadata": {
1108 |     "collapsed": false
1109 |    },
1110 |    "outputs": [
1111 |     {
1112 |      "data": {
1113 |       "text/plain": [
1114 |        "<4179x7456 sparse matrix of type '<type 'numpy.int64'>'\n",
1115 |        "\twith 55209 stored elements in Compressed Sparse Row format>"
1116 |       ]
1117 |      },
1118 |      "execution_count": 34,
1119 |      "metadata": {},
1120 |      "output_type": "execute_result"
1121 |     }
1122 |    ],
1123 |    "source": [
1124 |     "# examine the document-term matrix\n",
1125 |     "X_train_dtm"
1126 |    ]
1127 |   },
1128 |   {
1129 |    "cell_type": "code",
1130 |    "execution_count": 35,
1131 |    "metadata": {
1132 |     "collapsed": false
1133 |    },
1134 |    "outputs": [
1135 |     {
1136 |      "data": {
1137 |       "text/plain": [
1138 |        "<1393x7456 sparse matrix of type '<type 'numpy.int64'>'\n",
1139 |        "\twith 17604 stored elements in Compressed Sparse Row format>"
1140 |       ]
1141 |      },
1142 |      "execution_count": 35,
1143 |      "metadata": {},
1144 |      "output_type": "execute_result"
1145 |     }
1146 |    ],
1147 |    "source": [
1148 |     "# transform testing data (using fitted vocabulary) into a document-term matrix\n",
1149 |     "X_test_dtm = vect.transform(X_test)\n",
1150 |     "X_test_dtm"
1151 |    ]
1152 |   },
1153 |   {
1154 |    "cell_type": "markdown",
1155 |    "metadata": {},
1156 |    "source": [
1157 |     "## Part 5: Building and evaluating a model\n",
1158 |     "\n",
1159 |     "We will use [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):\n",
1160 |     "\n",
1161 |     "> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work."
1162 |    ]
1163 |   },
1164 |   {
1165 |    "cell_type": "code",
1166 |    "execution_count": 36,
1167 |    "metadata": {
1168 |     "collapsed": true
1169 |    },
1170 |    "outputs": [],
1171 |    "source": [
1172 |     "# import and instantiate a Multinomial Naive Bayes model\n",
1173 |     "from sklearn.naive_bayes import MultinomialNB\n",
1174 |     "nb = MultinomialNB()"
1175 |    ]
1176 |   },
1177 |   {
1178 |    "cell_type": "code",
1179 |    "execution_count": 37,
1180 |    "metadata": {
1181 |     "collapsed": false
1182 |    },
1183 |    "outputs": [
1184 |     {
1185 |      "name": "stdout",
1186 |      "output_type": "stream",
1187 |      "text": [
1188 |       "Wall time: 3 ms\n"
1189 |      ]
1190 |     },
1191 |     {
1192 |      "data": {
1193 |       "text/plain": [
1194 |        "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)"
1195 |       ]
1196 |      },
1197 |      "execution_count": 37,
1198 |      "metadata": {},
1199 |      "output_type": "execute_result"
1200 |     }
1201 |    ],
1202 |    "source": [
1203 |     "# train the model using X_train_dtm (timing it with an IPython \"magic command\")\n",
1204 |     "%time nb.fit(X_train_dtm, y_train)"
1205 |    ]
1206 |   },
1207 |   {
1208 |    "cell_type": "code",
1209 |    "execution_count": 38,
1210 |    "metadata": {
1211 |     "collapsed": true
1212 |    },
1213 |    "outputs": [],
1214 |    "source": [
1215 |     "# make class predictions for X_test_dtm\n",
1216 |     "y_pred_class = nb.predict(X_test_dtm)"
1217 |    ]
1218 |   },
1219 |   {
1220 |    "cell_type": "code",
1221 |    "execution_count": 39,
1222 |    "metadata": {
1223 |     "collapsed": false
1224 |    },
1225 |    "outputs": [
1226 |     {
1227 |      "data": {
1228 |       "text/plain": [
1229 |        "0.98851399856424982"
1230 |       ]
1231 |      },
1232 |      "execution_count": 39,
1233 |      "metadata": {},
1234 |      "output_type": "execute_result"
1235 |     }
1236 |    ],
1237 |    "source": [
1238 |     "# calculate accuracy of class predictions\n",
1239 |     "from sklearn import metrics\n",
1240 |     "metrics.accuracy_score(y_test, y_pred_class)"
1241 |    ]
1242 |   },
1243 |   {
1244 |    "cell_type": "code",
1245 |    "execution_count": 40,
1246 |    "metadata": {
1247 |     "collapsed": false
1248 |    },
1249 |    "outputs": [
1250 |     {
1251 |      "data": {
1252 |       "text/plain": [
1253 |        "array([[1203,    5],\n",
1254 |        "       [  11,  174]])"
1255 |       ]
1256 |      },
1257 |      "execution_count": 40,
1258 |      "metadata": {},
1259 |      "output_type": "execute_result"
1260 |     }
1261 |    ],
1262 |    "source": [
1263 |     "# print the confusion matrix\n",
1264 |     "metrics.confusion_matrix(y_test, y_pred_class)"
1265 |    ]
1266 |   },
1267 |   {
1268 |    "cell_type": "code",
1269 |    "execution_count": 41,
1270 |    "metadata": {
1271 |     "collapsed": false
1272 |    },
1273 |    "outputs": [
1274 |     {
1275 |      "data": {
1276 |       "text/plain": [
1277 |        "574               Waiting for your call.\n",
1278 |        "3375             Also andros ice etc etc\n",
1279 |        "45      No calls..messages..missed calls\n",
1280 |        "3415             No pic. Please re-send.\n",
1281 |        "1988    No calls..messages..missed calls\n",
1282 |        "Name: message, dtype: object"
1283 |       ]
1284 |      },
1285 |      "execution_count": 41,
1286 |      "metadata": {},
1287 |      "output_type": "execute_result"
1288 |     }
1289 |    ],
1290 |    "source": [
1291 |     "# print message text for the false positives (ham incorrectly classified as spam)\n",
1292 |     "X_test[y_test < y_pred_class]"
1293 |    ]
1294 |   },
1295 |   {
1296 |    "cell_type": "code",
1297 |    "execution_count": 42,
1298 |    "metadata": {
1299 |     "collapsed": false,
1300 |     "scrolled": true
1301 |    },
1302 |    "outputs": [
1303 |     {
1304 |      "data": {
1305 |       "text/plain": [
1306 |        "3132    LookAtMe!: Thanks for your purchase of a video...\n",
1307 |        "5       FreeMsg Hey there darling it's been 3 week's n...\n",
1308 |        "3530    Xmas & New Years Eve tickets are now on sale f...\n",
1309 |        "684     Hi I'm sue. I am 20 years old and work as a la...\n",
1310 |        "1875    Would you like to see my XXX pics they are so ...\n",
1311 |        "1893    CALL 09090900040 & LISTEN TO EXTREME DIRTY LIV...\n",
1312 |        "4298    thesmszone.com lets you send free anonymous an...\n",
1313 |        "4949    Hi this is Amy, we will be sending you a free ...\n",
1314 |        "2821    INTERFLORA - It's not too late to order Inter...\n",
1315 |        "2247    Hi ya babe x u 4goten bout me?' scammers getti...\n",
1316 |        "4514    Money i have won wining number 946 wot do i do...\n",
1317 |        "Name: message, dtype: object"
1318 |       ]
1319 |      },
1320 |      "execution_count": 42,
1321 |      "metadata": {},
1322 |      "output_type": "execute_result"
1323 |     }
1324 |    ],
1325 |    "source": [
1326 |     "# print message text for the false negatives (spam incorrectly classified as ham)\n",
1327 |     "X_test[y_test > y_pred_class]"
1328 |    ]
1329 |   },
1330 |   {
1331 |    "cell_type": "code",
1332 |    "execution_count": 43,
1333 |    "metadata": {
1334 |     "collapsed": false,
1335 |     "scrolled": true
1336 |    },
1337 |    "outputs": [
1338 |     {
1339 |      "data": {
1340 |       "text/plain": [
1341 |        "\"LookAtMe!: Thanks for your purchase of a video clip from LookAtMe!, you've been charged 35p. Think you can do better? Why not send a video in a MMSto 32323.\""
1342 |       ]
1343 |      },
1344 |      "execution_count": 43,
1345 |      "metadata": {},
1346 |      "output_type": "execute_result"
1347 |     }
1348 |    ],
1349 |    "source": [
1350 |     "# example false negative\n",
1351 |     "X_test[3132]"
1352 |    ]
1353 |   },
1354 |   {
1355 |    "cell_type": "code",
1356 |    "execution_count": 44,
1357 |    "metadata": {
1358 |     "collapsed": false
1359 |    },
1360 |    "outputs": [
1361 |     {
1362 |      "data": {
1363 |       "text/plain": [
1364 |        "array([  2.87744864e-03,   1.83488846e-05,   2.07301295e-03, ...,\n",
1365 |        "         1.09026171e-06,   1.00000000e+00,   3.98279868e-09])"
1366 |       ]
1367 |      },
1368 |      "execution_count": 44,
1369 |      "metadata": {},
1370 |      "output_type": "execute_result"
1371 |     }
1372 |    ],
1373 |    "source": [
1374 |     "# calculate predicted probabilities for X_test_dtm (poorly calibrated)\n",
1375 |     "y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]\n",
1376 |     "y_pred_prob"
1377 |    ]
1378 |   },
1379 |   {
1380 |    "cell_type": "code",
1381 |    "execution_count": 45,
1382 |    "metadata": {
1383 |     "collapsed": false
1384 |    },
1385 |    "outputs": [
1386 |     {
1387 |      "data": {
1388 |       "text/plain": [
1389 |        "0.98664310005369604"
1390 |       ]
1391 |      },
1392 |      "execution_count": 45,
1393 |      "metadata": {},
1394 |      "output_type": "execute_result"
1395 |     }
1396 |    ],
1397 |    "source": [
1398 |     "# calculate AUC\n",
1399 |     "metrics.roc_auc_score(y_test, y_pred_prob)"
1400 |    ]
1401 |   },
1402 |   {
1403 |    "cell_type": "markdown",
1404 |    "metadata": {},
1405 |    "source": [
1406 |     "## Part 6: Comparing models\n",
1407 |     "\n",
1408 |     "We will compare multinomial Naive Bayes with [logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression):\n",
1409 |     "\n",
1410 |     "> Logistic regression, despite its name, is a **linear model for classification** rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function."
1411 |    ]
1412 |   },
1413 |   {
1414 |    "cell_type": "code",
1415 |    "execution_count": 46,
1416 |    "metadata": {
1417 |     "collapsed": true
1418 |    },
1419 |    "outputs": [],
1420 |    "source": [
1421 |     "# import and instantiate a logistic regression model\n",
1422 |     "from sklearn.linear_model import LogisticRegression\n",
1423 |     "logreg = LogisticRegression()"
1424 |    ]
1425 |   },
1426 |   {
1427 |    "cell_type": "code",
1428 |    "execution_count": 47,
1429 |    "metadata": {
1430 |     "collapsed": false
1431 |    },
1432 |    "outputs": [
1433 |     {
1434 |      "name": "stdout",
1435 |      "output_type": "stream",
1436 |      "text": [
1437 |       "Wall time: 39 ms\n"
1438 |      ]
1439 |     },
1440 |     {
1441 |      "data": {
1442 |       "text/plain": [
1443 |        "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
1444 |        "          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n",
1445 |        "          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n",
1446 |        "          verbose=0, warm_start=False)"
1447 |       ]
1448 |      },
1449 |      "execution_count": 47,
1450 |      "metadata": {},
1451 |      "output_type": "execute_result"
1452 |     }
1453 |    ],
1454 |    "source": [
1455 |     "# train the model using X_train_dtm\n",
1456 |     "%time logreg.fit(X_train_dtm, y_train)"
1457 |    ]
1458 |   },
1459 |   {
1460 |    "cell_type": "code",
1461 |    "execution_count": 48,
1462 |    "metadata": {
1463 |     "collapsed": true
1464 |    },
1465 |    "outputs": [],
1466 |    "source": [
1467 |     "# make class predictions for X_test_dtm\n",
1468 |     "y_pred_class = logreg.predict(X_test_dtm)"
1469 |    ]
1470 |   },
1471 |   {
1472 |    "cell_type": "code",
1473 |    "execution_count": 49,
1474 |    "metadata": {
1475 |     "collapsed": false
1476 |    },
1477 |    "outputs": [
1478 |     {
1479 |      "data": {
1480 |       "text/plain": [
1481 |        "array([ 0.01269556,  0.00347183,  0.00616517, ...,  0.03354907,\n",
1482 |        "        0.99725053,  0.00157706])"
1483 |       ]
1484 |      },
1485 |      "execution_count": 49,
1486 |      "metadata": {},
1487 |      "output_type": "execute_result"
1488 |     }
1489 |    ],
1490 |    "source": [
1491 |     "# calculate predicted probabilities for X_test_dtm (well calibrated)\n",
1492 |     "y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]\n",
1493 |     "y_pred_prob"
1494 |    ]
1495 |   },
1496 |   {
1497 |    "cell_type": "code",
1498 |    "execution_count": 50,
1499 |    "metadata": {
1500 |     "collapsed": false
1501 |    },
1502 |    "outputs": [
1503 |     {
1504 |      "data": {
1505 |       "text/plain": [
1506 |        "0.9877961234745154"
1507 |       ]
1508 |      },
1509 |      "execution_count": 50,
1510 |      "metadata": {},
1511 |      "output_type": "execute_result"
1512 |     }
1513 |    ],
1514 |    "source": [
1515 |     "# calculate accuracy\n",
1516 |     "metrics.accuracy_score(y_test, y_pred_class)"
1517 |    ]
1518 |   },
1519 |   {
1520 |    "cell_type": "code",
1521 |    "execution_count": 51,
1522 |    "metadata": {
1523 |     "collapsed": false
1524 |    },
1525 |    "outputs": [
1526 |     {
1527 |      "data": {
1528 |       "text/plain": [
1529 |        "0.99368176123143015"
1530 |       ]
1531 |      },
1532 |      "execution_count": 51,
1533 |      "metadata": {},
1534 |      "output_type": "execute_result"
1535 |     }
1536 |    ],
1537 |    "source": [
1538 |     "# calculate AUC\n",
1539 |     "metrics.roc_auc_score(y_test, y_pred_prob)"
1540 |    ]
1541 |   },
1542 |   {
1543 |    "cell_type": "markdown",
1544 |    "metadata": {},
1545 |    "source": [
1546 |     "## Part 7: Examining a model for further insight\n",
1547 |     "\n",
1548 |     "We will examine the our **trained Naive Bayes model** to calculate the approximate **\"spamminess\" of each token**."
1549 |    ]
1550 |   },
1551 |   {
1552 |    "cell_type": "code",
1553 |    "execution_count": 52,
1554 |    "metadata": {
1555 |     "collapsed": false
1556 |    },
1557 |    "outputs": [
1558 |     {
1559 |      "data": {
1560 |       "text/plain": [
1561 |        "7456"
1562 |       ]
1563 |      },
1564 |      "execution_count": 52,
1565 |      "metadata": {},
1566 |      "output_type": "execute_result"
1567 |     }
1568 |    ],
1569 |    "source": [
1570 |     "# store the vocabulary of X_train\n",
1571 |     "X_train_tokens = vect.get_feature_names()\n",
1572 |     "len(X_train_tokens)"
1573 |    ]
1574 |   },
1575 |   {
1576 |    "cell_type": "code",
1577 |    "execution_count": 53,
1578 |    "metadata": {
1579 |     "collapsed": false,
1580 |     "scrolled": true
1581 |    },
1582 |    "outputs": [
1583 |     {
1584 |      "name": "stdout",
1585 |      "output_type": "stream",
1586 |      "text": [
1587 |       "[u'00', u'000', u'008704050406', u'0121', u'01223585236', u'01223585334', u'0125698789', u'02', u'0207', u'02072069400', u'02073162414', u'02085076972', u'021', u'03', u'04', u'0430', u'05', u'050703', u'0578', u'06', u'07', u'07008009200', u'07090201529', u'07090298926', u'07123456789', u'07732584351', u'07734396839', u'07742676969', u'0776xxxxxxx', u'07781482378', u'07786200117', u'078', u'07801543489', u'07808', u'07808247860', u'07808726822', u'07815296484', u'07821230901', u'07880867867', u'0789xxxxxxx', u'07946746291', u'0796xxxxxx', u'07973788240', u'07xxxxxxxxx', u'08', u'0800', u'08000407165', u'08000776320', u'08000839402', u'08000930705']\n"
1588 |      ]
1589 |     }
1590 |    ],
1591 |    "source": [
1592 |     "# examine the first 50 tokens\n",
1593 |     "print(X_train_tokens[0:50])"
1594 |    ]
1595 |   },
1596 |   {
1597 |    "cell_type": "code",
1598 |    "execution_count": 54,
1599 |    "metadata": {
1600 |     "collapsed": false
1601 |    },
1602 |    "outputs": [
1603 |     {
1604 |      "name": "stdout",
1605 |      "output_type": "stream",
1606 |      "text": [
1607 |       "[u'yer', u'yes', u'yest', u'yesterday', u'yet', u'yetunde', u'yijue', u'ym', u'ymca', u'yo', u'yoga', u'yogasana', u'yor', u'yorge', u'you', u'youdoing', u'youi', u'youphone', u'your', u'youre', u'yourjob', u'yours', u'yourself', u'youwanna', u'yowifes', u'yoyyooo', u'yr', u'yrs', u'ything', u'yummmm', u'yummy', u'yun', u'yunny', u'yuo', u'yuou', u'yup', u'zac', u'zaher', u'zealand', u'zebra', u'zed', u'zeros', u'zhong', u'zindgi', u'zoe', u'zoom', u'zouk', u'zyada', u'\\xe8n', u'\\u3028ud']\n"
1608 |      ]
1609 |     }
1610 |    ],
1611 |    "source": [
1612 |     "# examine the last 50 tokens\n",
1613 |     "print(X_train_tokens[-50:])"
1614 |    ]
1615 |   },
1616 |   {
1617 |    "cell_type": "code",
1618 |    "execution_count": 55,
1619 |    "metadata": {
1620 |     "collapsed": false
1621 |    },
1622 |    "outputs": [
1623 |     {
1624 |      "data": {
1625 |       "text/plain": [
1626 |        "array([[  0.,   0.,   0., ...,   1.,   1.,   1.],\n",
1627 |        "       [  5.,  23.,   2., ...,   0.,   0.,   0.]])"
1628 |       ]
1629 |      },
1630 |      "execution_count": 55,
1631 |      "metadata": {},
1632 |      "output_type": "execute_result"
1633 |     }
1634 |    ],
1635 |    "source": [
1636 |     "# Naive Bayes counts the number of times each token appears in each class\n",
1637 |     "nb.feature_count_"
1638 |    ]
1639 |   },
1640 |   {
1641 |    "cell_type": "code",
1642 |    "execution_count": 56,
1643 |    "metadata": {
1644 |     "collapsed": false
1645 |    },
1646 |    "outputs": [
1647 |     {
1648 |      "data": {
1649 |       "text/plain": [
1650 |        "(2L, 7456L)"
1651 |       ]
1652 |      },
1653 |      "execution_count": 56,
1654 |      "metadata": {},
1655 |      "output_type": "execute_result"
1656 |     }
1657 |    ],
1658 |    "source": [
1659 |     "# rows represent classes, columns represent tokens\n",
1660 |     "nb.feature_count_.shape"
1661 |    ]
1662 |   },
1663 |   {
1664 |    "cell_type": "code",
1665 |    "execution_count": 57,
1666 |    "metadata": {
1667 |     "collapsed": false
1668 |    },
1669 |    "outputs": [
1670 |     {
1671 |      "data": {
1672 |       "text/plain": [
1673 |        "array([ 0.,  0.,  0., ...,  1.,  1.,  1.])"
1674 |       ]
1675 |      },
1676 |      "execution_count": 57,
1677 |      "metadata": {},
1678 |      "output_type": "execute_result"
1679 |     }
1680 |    ],
1681 |    "source": [
1682 |     "# number of times each token appears across all HAM messages\n",
1683 |     "ham_token_count = nb.feature_count_[0, :]\n",
1684 |     "ham_token_count"
1685 |    ]
1686 |   },
1687 |   {
1688 |    "cell_type": "code",
1689 |    "execution_count": 58,
1690 |    "metadata": {
1691 |     "collapsed": false
1692 |    },
1693 |    "outputs": [
1694 |     {
1695 |      "data": {
1696 |       "text/plain": [
1697 |        "array([  5.,  23.,   2., ...,   0.,   0.,   0.])"
1698 |       ]
1699 |      },
1700 |      "execution_count": 58,
1701 |      "metadata": {},
1702 |      "output_type": "execute_result"
1703 |     }
1704 |    ],
1705 |    "source": [
1706 |     "# number of times each token appears across all SPAM messages\n",
1707 |     "spam_token_count = nb.feature_count_[1, :]\n",
1708 |     "spam_token_count"
1709 |    ]
1710 |   },
1711 |   {
1712 |    "cell_type": "code",
1713 |    "execution_count": 59,
1714 |    "metadata": {
1715 |     "collapsed": false
1716 |    },
1717 |    "outputs": [
1718 |     {
1719 |      "data": {
1720 |       "text/html": [
1721 |        "<div>\n",
1722 |        "<table border=\"1\" class=\"dataframe\">\n",
1723 |        "  <thead>\n",
1724 |        "    <tr style=\"text-align: right;\">\n",
1725 |        "      <th></th>\n",
1726 |        "      <th>ham</th>\n",
1727 |        "      <th>spam</th>\n",
1728 |        "    </tr>\n",
1729 |        "    <tr>\n",
1730 |        "      <th>token</th>\n",
1731 |        "      <th></th>\n",
1732 |        "      <th></th>\n",
1733 |        "    </tr>\n",
1734 |        "  </thead>\n",
1735 |        "  <tbody>\n",
1736 |        "    <tr>\n",
1737 |        "      <th>00</th>\n",
1738 |        "      <td>0</td>\n",
1739 |        "      <td>5</td>\n",
1740 |        "    </tr>\n",
1741 |        "    <tr>\n",
1742 |        "      <th>000</th>\n",
1743 |        "      <td>0</td>\n",
1744 |        "      <td>23</td>\n",
1745 |        "    </tr>\n",
1746 |        "    <tr>\n",
1747 |        "      <th>008704050406</th>\n",
1748 |        "      <td>0</td>\n",
1749 |        "      <td>2</td>\n",
1750 |        "    </tr>\n",
1751 |        "    <tr>\n",
1752 |        "      <th>0121</th>\n",
1753 |        "      <td>0</td>\n",
1754 |        "      <td>1</td>\n",
1755 |        "    </tr>\n",
1756 |        "    <tr>\n",
1757 |        "      <th>01223585236</th>\n",
1758 |        "      <td>0</td>\n",
1759 |        "      <td>1</td>\n",
1760 |        "    </tr>\n",
1761 |        "  </tbody>\n",
1762 |        "</table>\n",
1763 |        "</div>"
1764 |       ],
1765 |       "text/plain": [
1766 |        "              ham  spam\n",
1767 |        "token                  \n",
1768 |        "00              0     5\n",
1769 |        "000             0    23\n",
1770 |        "008704050406    0     2\n",
1771 |        "0121            0     1\n",
1772 |        "01223585236     0     1"
1773 |       ]
1774 |      },
1775 |      "execution_count": 59,
1776 |      "metadata": {},
1777 |      "output_type": "execute_result"
1778 |     }
1779 |    ],
1780 |    "source": [
1781 |     "# create a DataFrame of tokens with their separate ham and spam counts\n",
1782 |     "tokens = pd.DataFrame({'token':X_train_tokens, 'ham':ham_token_count, 'spam':spam_token_count}).set_index('token')\n",
1783 |     "tokens.head()"
1784 |    ]
1785 |   },
1786 |   {
1787 |    "cell_type": "code",
1788 |    "execution_count": 60,
1789 |    "metadata": {
1790 |     "collapsed": false
1791 |    },
1792 |    "outputs": [
1793 |     {
1794 |      "data": {
1795 |       "text/html": [
1796 |        "<div>\n",
1797 |        "<table border=\"1\" class=\"dataframe\">\n",
1798 |        "  <thead>\n",
1799 |        "    <tr style=\"text-align: right;\">\n",
1800 |        "      <th></th>\n",
1801 |        "      <th>ham</th>\n",
1802 |        "      <th>spam</th>\n",
1803 |        "    </tr>\n",
1804 |        "    <tr>\n",
1805 |        "      <th>token</th>\n",
1806 |        "      <th></th>\n",
1807 |        "      <th></th>\n",
1808 |        "    </tr>\n",
1809 |        "  </thead>\n",
1810 |        "  <tbody>\n",
1811 |        "    <tr>\n",
1812 |        "      <th>very</th>\n",
1813 |        "      <td>64</td>\n",
1814 |        "      <td>2</td>\n",
1815 |        "    </tr>\n",
1816 |        "    <tr>\n",
1817 |        "      <th>nasty</th>\n",
1818 |        "      <td>1</td>\n",
1819 |        "      <td>1</td>\n",
1820 |        "    </tr>\n",
1821 |        "    <tr>\n",
1822 |        "      <th>villa</th>\n",
1823 |        "      <td>0</td>\n",
1824 |        "      <td>1</td>\n",
1825 |        "    </tr>\n",
1826 |        "    <tr>\n",
1827 |        "      <th>beloved</th>\n",
1828 |        "      <td>1</td>\n",
1829 |        "      <td>0</td>\n",
1830 |        "    </tr>\n",
1831 |        "    <tr>\n",
1832 |        "      <th>textoperator</th>\n",
1833 |        "      <td>0</td>\n",
1834 |        "      <td>2</td>\n",
1835 |        "    </tr>\n",
1836 |        "  </tbody>\n",
1837 |        "</table>\n",
1838 |        "</div>"
1839 |       ],
1840 |       "text/plain": [
1841 |        "              ham  spam\n",
1842 |        "token                  \n",
1843 |        "very           64     2\n",
1844 |        "nasty           1     1\n",
1845 |        "villa           0     1\n",
1846 |        "beloved         1     0\n",
1847 |        "textoperator    0     2"
1848 |       ]
1849 |      },
1850 |      "execution_count": 60,
1851 |      "metadata": {},
1852 |      "output_type": "execute_result"
1853 |     }
1854 |    ],
1855 |    "source": [
1856 |     "# examine 5 random DataFrame rows\n",
1857 |     "tokens.sample(5, random_state=6)"
1858 |    ]
1859 |   },
1860 |   {
1861 |    "cell_type": "code",
1862 |    "execution_count": 61,
1863 |    "metadata": {
1864 |     "collapsed": false
1865 |    },
1866 |    "outputs": [
1867 |     {
1868 |      "data": {
1869 |       "text/plain": [
1870 |        "array([ 3617.,   562.])"
1871 |       ]
1872 |      },
1873 |      "execution_count": 61,
1874 |      "metadata": {},
1875 |      "output_type": "execute_result"
1876 |     }
1877 |    ],
1878 |    "source": [
1879 |     "# Naive Bayes counts the number of observations in each class\n",
1880 |     "nb.class_count_"
1881 |    ]
1882 |   },
1883 |   {
1884 |    "cell_type": "markdown",
1885 |    "metadata": {},
1886 |    "source": [
1887 |     "Before we can calculate the \"spamminess\" of each token, we need to avoid **dividing by zero** and account for the **class imbalance**."
1888 |    ]
1889 |   },
1890 |   {
1891 |    "cell_type": "code",
1892 |    "execution_count": 62,
1893 |    "metadata": {
1894 |     "collapsed": false
1895 |    },
1896 |    "outputs": [
1897 |     {
1898 |      "data": {
1899 |       "text/html": [
1900 |        "<div>\n",
1901 |        "<table border=\"1\" class=\"dataframe\">\n",
1902 |        "  <thead>\n",
1903 |        "    <tr style=\"text-align: right;\">\n",
1904 |        "      <th></th>\n",
1905 |        "      <th>ham</th>\n",
1906 |        "      <th>spam</th>\n",
1907 |        "    </tr>\n",
1908 |        "    <tr>\n",
1909 |        "      <th>token</th>\n",
1910 |        "      <th></th>\n",
1911 |        "      <th></th>\n",
1912 |        "    </tr>\n",
1913 |        "  </thead>\n",
1914 |        "  <tbody>\n",
1915 |        "    <tr>\n",
1916 |        "      <th>very</th>\n",
1917 |        "      <td>65</td>\n",
1918 |        "      <td>3</td>\n",
1919 |        "    </tr>\n",
1920 |        "    <tr>\n",
1921 |        "      <th>nasty</th>\n",
1922 |        "      <td>2</td>\n",
1923 |        "      <td>2</td>\n",
1924 |        "    </tr>\n",
1925 |        "    <tr>\n",
1926 |        "      <th>villa</th>\n",
1927 |        "      <td>1</td>\n",
1928 |        "      <td>2</td>\n",
1929 |        "    </tr>\n",
1930 |        "    <tr>\n",
1931 |        "      <th>beloved</th>\n",
1932 |        "      <td>2</td>\n",
1933 |        "      <td>1</td>\n",
1934 |        "    </tr>\n",
1935 |        "    <tr>\n",
1936 |        "      <th>textoperator</th>\n",
1937 |        "      <td>1</td>\n",
1938 |        "      <td>3</td>\n",
1939 |        "    </tr>\n",
1940 |        "  </tbody>\n",
1941 |        "</table>\n",
1942 |        "</div>"
1943 |       ],
1944 |       "text/plain": [
1945 |        "              ham  spam\n",
1946 |        "token                  \n",
1947 |        "very           65     3\n",
1948 |        "nasty           2     2\n",
1949 |        "villa           1     2\n",
1950 |        "beloved         2     1\n",
1951 |        "textoperator    1     3"
1952 |       ]
1953 |      },
1954 |      "execution_count": 62,
1955 |      "metadata": {},
1956 |      "output_type": "execute_result"
1957 |     }
1958 |    ],
1959 |    "source": [
1960 |     "# add 1 to ham and spam counts to avoid dividing by 0\n",
1961 |     "tokens['ham'] = tokens.ham + 1\n",
1962 |     "tokens['spam'] = tokens.spam + 1\n",
1963 |     "tokens.sample(5, random_state=6)"
1964 |    ]
1965 |   },
1966 |   {
1967 |    "cell_type": "code",
1968 |    "execution_count": 63,
1969 |    "metadata": {
1970 |     "collapsed": false
1971 |    },
1972 |    "outputs": [
1973 |     {
1974 |      "data": {
1975 |       "text/html": [
1976 |        "<div>\n",
1977 |        "<table border=\"1\" class=\"dataframe\">\n",
1978 |        "  <thead>\n",
1979 |        "    <tr style=\"text-align: right;\">\n",
1980 |        "      <th></th>\n",
1981 |        "      <th>ham</th>\n",
1982 |        "      <th>spam</th>\n",
1983 |        "    </tr>\n",
1984 |        "    <tr>\n",
1985 |        "      <th>token</th>\n",
1986 |        "      <th></th>\n",
1987 |        "      <th></th>\n",
1988 |        "    </tr>\n",
1989 |        "  </thead>\n",
1990 |        "  <tbody>\n",
1991 |        "    <tr>\n",
1992 |        "      <th>very</th>\n",
1993 |        "      <td>0.017971</td>\n",
1994 |        "      <td>0.005338</td>\n",
1995 |        "    </tr>\n",
1996 |        "    <tr>\n",
1997 |        "      <th>nasty</th>\n",
1998 |        "      <td>0.000553</td>\n",
1999 |        "      <td>0.003559</td>\n",
2000 |        "    </tr>\n",
2001 |        "    <tr>\n",
2002 |        "      <th>villa</th>\n",
2003 |        "      <td>0.000276</td>\n",
2004 |        "      <td>0.003559</td>\n",
2005 |        "    </tr>\n",
2006 |        "    <tr>\n",
2007 |        "      <th>beloved</th>\n",
2008 |        "      <td>0.000553</td>\n",
2009 |        "      <td>0.001779</td>\n",
2010 |        "    </tr>\n",
2011 |        "    <tr>\n",
2012 |        "      <th>textoperator</th>\n",
2013 |        "      <td>0.000276</td>\n",
2014 |        "      <td>0.005338</td>\n",
2015 |        "    </tr>\n",
2016 |        "  </tbody>\n",
2017 |        "</table>\n",
2018 |        "</div>"
2019 |       ],
2020 |       "text/plain": [
2021 |        "                   ham      spam\n",
2022 |        "token                           \n",
2023 |        "very          0.017971  0.005338\n",
2024 |        "nasty         0.000553  0.003559\n",
2025 |        "villa         0.000276  0.003559\n",
2026 |        "beloved       0.000553  0.001779\n",
2027 |        "textoperator  0.000276  0.005338"
2028 |       ]
2029 |      },
2030 |      "execution_count": 63,
2031 |      "metadata": {},
2032 |      "output_type": "execute_result"
2033 |     }
2034 |    ],
2035 |    "source": [
2036 |     "# convert the ham and spam counts into frequencies\n",
2037 |     "tokens['ham'] = tokens.ham / nb.class_count_[0]\n",
2038 |     "tokens['spam'] = tokens.spam / nb.class_count_[1]\n",
2039 |     "tokens.sample(5, random_state=6)"
2040 |    ]
2041 |   },
2042 |   {
2043 |    "cell_type": "code",
2044 |    "execution_count": 64,
2045 |    "metadata": {
2046 |     "collapsed": false
2047 |    },
2048 |    "outputs": [
2049 |     {
2050 |      "data": {
2051 |       "text/html": [
2052 |        "<div>\n",
2053 |        "<table border=\"1\" class=\"dataframe\">\n",
2054 |        "  <thead>\n",
2055 |        "    <tr style=\"text-align: right;\">\n",
2056 |        "      <th></th>\n",
2057 |        "      <th>ham</th>\n",
2058 |        "      <th>spam</th>\n",
2059 |        "      <th>spam_ratio</th>\n",
2060 |        "    </tr>\n",
2061 |        "    <tr>\n",
2062 |        "      <th>token</th>\n",
2063 |        "      <th></th>\n",
2064 |        "      <th></th>\n",
2065 |        "      <th></th>\n",
2066 |        "    </tr>\n",
2067 |        "  </thead>\n",
2068 |        "  <tbody>\n",
2069 |        "    <tr>\n",
2070 |        "      <th>very</th>\n",
2071 |        "      <td>0.017971</td>\n",
2072 |        "      <td>0.005338</td>\n",
2073 |        "      <td>0.297044</td>\n",
2074 |        "    </tr>\n",
2075 |        "    <tr>\n",
2076 |        "      <th>nasty</th>\n",
2077 |        "      <td>0.000553</td>\n",
2078 |        "      <td>0.003559</td>\n",
2079 |        "      <td>6.435943</td>\n",
2080 |        "    </tr>\n",
2081 |        "    <tr>\n",
2082 |        "      <th>villa</th>\n",
2083 |        "      <td>0.000276</td>\n",
2084 |        "      <td>0.003559</td>\n",
2085 |        "      <td>12.871886</td>\n",
2086 |        "    </tr>\n",
2087 |        "    <tr>\n",
2088 |        "      <th>beloved</th>\n",
2089 |        "      <td>0.000553</td>\n",
2090 |        "      <td>0.001779</td>\n",
2091 |        "      <td>3.217972</td>\n",
2092 |        "    </tr>\n",
2093 |        "    <tr>\n",
2094 |        "      <th>textoperator</th>\n",
2095 |        "      <td>0.000276</td>\n",
2096 |        "      <td>0.005338</td>\n",
2097 |        "      <td>19.307829</td>\n",
2098 |        "    </tr>\n",
2099 |        "  </tbody>\n",
2100 |        "</table>\n",
2101 |        "</div>"
2102 |       ],
2103 |       "text/plain": [
2104 |        "                   ham      spam  spam_ratio\n",
2105 |        "token                                       \n",
2106 |        "very          0.017971  0.005338    0.297044\n",
2107 |        "nasty         0.000553  0.003559    6.435943\n",
2108 |        "villa         0.000276  0.003559   12.871886\n",
2109 |        "beloved       0.000553  0.001779    3.217972\n",
2110 |        "textoperator  0.000276  0.005338   19.307829"
2111 |       ]
2112 |      },
2113 |      "execution_count": 64,
2114 |      "metadata": {},
2115 |      "output_type": "execute_result"
2116 |     }
2117 |    ],
2118 |    "source": [
2119 |     "# calculate the ratio of spam-to-ham for each token\n",
2120 |     "tokens['spam_ratio'] = tokens.spam / tokens.ham\n",
2121 |     "tokens.sample(5, random_state=6)"
2122 |    ]
2123 |   },
2124 |   {
2125 |    "cell_type": "code",
2126 |    "execution_count": 65,
2127 |    "metadata": {
2128 |     "collapsed": false
2129 |    },
2130 |    "outputs": [
2131 |     {
2132 |      "data": {
2133 |       "text/html": [
2134 |        "<div>\n",
2135 |        "<table border=\"1\" class=\"dataframe\">\n",
2136 |        "  <thead>\n",
2137 |        "    <tr style=\"text-align: right;\">\n",
2138 |        "      <th></th>\n",
2139 |        "      <th>ham</th>\n",
2140 |        "      <th>spam</th>\n",
2141 |        "      <th>spam_ratio</th>\n",
2142 |        "    </tr>\n",
2143 |        "    <tr>\n",
2144 |        "      <th>token</th>\n",
2145 |        "      <th></th>\n",
2146 |        "      <th></th>\n",
2147 |        "      <th></th>\n",
2148 |        "    </tr>\n",
2149 |        "  </thead>\n",
2150 |        "  <tbody>\n",
2151 |        "    <tr>\n",
2152 |        "      <th>claim</th>\n",
2153 |        "      <td>0.000276</td>\n",
2154 |        "      <td>0.158363</td>\n",
2155 |        "      <td>572.798932</td>\n",
2156 |        "    </tr>\n",
2157 |        "    <tr>\n",
2158 |        "      <th>prize</th>\n",
2159 |        "      <td>0.000276</td>\n",
2160 |        "      <td>0.135231</td>\n",
2161 |        "      <td>489.131673</td>\n",
2162 |        "    </tr>\n",
2163 |        "    <tr>\n",
2164 |        "      <th>150p</th>\n",
2165 |        "      <td>0.000276</td>\n",
2166 |        "      <td>0.087189</td>\n",
2167 |        "      <td>315.361210</td>\n",
2168 |        "    </tr>\n",
2169 |        "    <tr>\n",
2170 |        "      <th>tone</th>\n",
2171 |        "      <td>0.000276</td>\n",
2172 |        "      <td>0.085409</td>\n",
2173 |        "      <td>308.925267</td>\n",
2174 |        "    </tr>\n",
2175 |        "    <tr>\n",
2176 |        "      <th>guaranteed</th>\n",
2177 |        "      <td>0.000276</td>\n",
2178 |        "      <td>0.076512</td>\n",
2179 |        "      <td>276.745552</td>\n",
2180 |        "    </tr>\n",
2181 |        "    <tr>\n",
2182 |        "      <th>18</th>\n",
2183 |        "      <td>0.000276</td>\n",
2184 |        "      <td>0.069395</td>\n",
2185 |        "      <td>251.001779</td>\n",
2186 |        "    </tr>\n",
2187 |        "    <tr>\n",
2188 |        "      <th>cs</th>\n",
2189 |        "      <td>0.000276</td>\n",
2190 |        "      <td>0.065836</td>\n",
2191 |        "      <td>238.129893</td>\n",
2192 |        "    </tr>\n",
2193 |        "    <tr>\n",
2194 |        "      <th>www</th>\n",
2195 |        "      <td>0.000553</td>\n",
2196 |        "      <td>0.129893</td>\n",
2197 |        "      <td>234.911922</td>\n",
2198 |        "    </tr>\n",
2199 |        "    <tr>\n",
2200 |        "      <th>1000</th>\n",
2201 |        "      <td>0.000276</td>\n",
2202 |        "      <td>0.056940</td>\n",
2203 |        "      <td>205.950178</td>\n",
2204 |        "    </tr>\n",
2205 |        "    <tr>\n",
2206 |        "      <th>awarded</th>\n",
2207 |        "      <td>0.000276</td>\n",
2208 |        "      <td>0.053381</td>\n",
2209 |        "      <td>193.078292</td>\n",
2210 |        "    </tr>\n",
2211 |        "    <tr>\n",
2212 |        "      <th>150ppm</th>\n",
2213 |        "      <td>0.000276</td>\n",
2214 |        "      <td>0.051601</td>\n",
2215 |        "      <td>186.642349</td>\n",
2216 |        "    </tr>\n",
2217 |        "    <tr>\n",
2218 |        "      <th>uk</th>\n",
2219 |        "      <td>0.000553</td>\n",
2220 |        "      <td>0.099644</td>\n",
2221 |        "      <td>180.206406</td>\n",
2222 |        "    </tr>\n",
2223 |        "    <tr>\n",
2224 |        "      <th>500</th>\n",
2225 |        "      <td>0.000276</td>\n",
2226 |        "      <td>0.048043</td>\n",
2227 |        "      <td>173.770463</td>\n",
2228 |        "    </tr>\n",
2229 |        "    <tr>\n",
2230 |        "      <th>ringtone</th>\n",
2231 |        "      <td>0.000276</td>\n",
2232 |        "      <td>0.044484</td>\n",
2233 |        "      <td>160.898577</td>\n",
2234 |        "    </tr>\n",
2235 |        "    <tr>\n",
2236 |        "      <th>000</th>\n",
2237 |        "      <td>0.000276</td>\n",
2238 |        "      <td>0.042705</td>\n",
2239 |        "      <td>154.462633</td>\n",
2240 |        "    </tr>\n",
2241 |        "    <tr>\n",
2242 |        "      <th>mob</th>\n",
2243 |        "      <td>0.000276</td>\n",
2244 |        "      <td>0.042705</td>\n",
2245 |        "      <td>154.462633</td>\n",
2246 |        "    </tr>\n",
2247 |        "    <tr>\n",
2248 |        "      <th>co</th>\n",
2249 |        "      <td>0.000553</td>\n",
2250 |        "      <td>0.078292</td>\n",
2251 |        "      <td>141.590747</td>\n",
2252 |        "    </tr>\n",
2253 |        "    <tr>\n",
2254 |        "      <th>collection</th>\n",
2255 |        "      <td>0.000276</td>\n",
2256 |        "      <td>0.039146</td>\n",
2257 |        "      <td>141.590747</td>\n",
2258 |        "    </tr>\n",
2259 |        "    <tr>\n",
2260 |        "      <th>valid</th>\n",
2261 |        "      <td>0.000276</td>\n",
2262 |        "      <td>0.037367</td>\n",
2263 |        "      <td>135.154804</td>\n",
2264 |        "    </tr>\n",
2265 |        "    <tr>\n",
2266 |        "      <th>2000</th>\n",
2267 |        "      <td>0.000276</td>\n",
2268 |        "      <td>0.037367</td>\n",
2269 |        "      <td>135.154804</td>\n",
2270 |        "    </tr>\n",
2271 |        "    <tr>\n",
2272 |        "      <th>800</th>\n",
2273 |        "      <td>0.000276</td>\n",
2274 |        "      <td>0.037367</td>\n",
2275 |        "      <td>135.154804</td>\n",
2276 |        "    </tr>\n",
2277 |        "    <tr>\n",
2278 |        "      <th>10p</th>\n",
2279 |        "      <td>0.000276</td>\n",
2280 |        "      <td>0.037367</td>\n",
2281 |        "      <td>135.154804</td>\n",
2282 |        "    </tr>\n",
2283 |        "    <tr>\n",
2284 |        "      <th>8007</th>\n",
2285 |        "      <td>0.000276</td>\n",
2286 |        "      <td>0.035587</td>\n",
2287 |        "      <td>128.718861</td>\n",
2288 |        "    </tr>\n",
2289 |        "    <tr>\n",
2290 |        "      <th>16</th>\n",
2291 |        "      <td>0.000553</td>\n",
2292 |        "      <td>0.067616</td>\n",
2293 |        "      <td>122.282918</td>\n",
2294 |        "    </tr>\n",
2295 |        "    <tr>\n",
2296 |        "      <th>weekly</th>\n",
2297 |        "      <td>0.000276</td>\n",
2298 |        "      <td>0.033808</td>\n",
2299 |        "      <td>122.282918</td>\n",
2300 |        "    </tr>\n",
2301 |        "    <tr>\n",
2302 |        "      <th>tones</th>\n",
2303 |        "      <td>0.000276</td>\n",
2304 |        "      <td>0.032028</td>\n",
2305 |        "      <td>115.846975</td>\n",
2306 |        "    </tr>\n",
2307 |        "    <tr>\n",
2308 |        "      <th>land</th>\n",
2309 |        "      <td>0.000276</td>\n",
2310 |        "      <td>0.032028</td>\n",
2311 |        "      <td>115.846975</td>\n",
2312 |        "    </tr>\n",
2313 |        "    <tr>\n",
2314 |        "      <th>http</th>\n",
2315 |        "      <td>0.000276</td>\n",
2316 |        "      <td>0.032028</td>\n",
2317 |        "      <td>115.846975</td>\n",
2318 |        "    </tr>\n",
2319 |        "    <tr>\n",
2320 |        "      <th>national</th>\n",
2321 |        "      <td>0.000276</td>\n",
2322 |        "      <td>0.030249</td>\n",
2323 |        "      <td>109.411032</td>\n",
2324 |        "    </tr>\n",
2325 |        "    <tr>\n",
2326 |        "      <th>5000</th>\n",
2327 |        "      <td>0.000276</td>\n",
2328 |        "      <td>0.030249</td>\n",
2329 |        "      <td>109.411032</td>\n",
2330 |        "    </tr>\n",
2331 |        "    <tr>\n",
2332 |        "      <th>...</th>\n",
2333 |        "      <td>...</td>\n",
2334 |        "      <td>...</td>\n",
2335 |        "      <td>...</td>\n",
2336 |        "    </tr>\n",
2337 |        "    <tr>\n",
2338 |        "      <th>went</th>\n",
2339 |        "      <td>0.012718</td>\n",
2340 |        "      <td>0.001779</td>\n",
2341 |        "      <td>0.139912</td>\n",
2342 |        "    </tr>\n",
2343 |        "    <tr>\n",
2344 |        "      <th>ll</th>\n",
2345 |        "      <td>0.052530</td>\n",
2346 |        "      <td>0.007117</td>\n",
2347 |        "      <td>0.135494</td>\n",
2348 |        "    </tr>\n",
2349 |        "    <tr>\n",
2350 |        "      <th>told</th>\n",
2351 |        "      <td>0.013824</td>\n",
2352 |        "      <td>0.001779</td>\n",
2353 |        "      <td>0.128719</td>\n",
2354 |        "    </tr>\n",
2355 |        "    <tr>\n",
2356 |        "      <th>feel</th>\n",
2357 |        "      <td>0.013824</td>\n",
2358 |        "      <td>0.001779</td>\n",
2359 |        "      <td>0.128719</td>\n",
2360 |        "    </tr>\n",
2361 |        "    <tr>\n",
2362 |        "      <th>gud</th>\n",
2363 |        "      <td>0.014100</td>\n",
2364 |        "      <td>0.001779</td>\n",
2365 |        "      <td>0.126195</td>\n",
2366 |        "    </tr>\n",
2367 |        "    <tr>\n",
2368 |        "      <th>cos</th>\n",
2369 |        "      <td>0.014929</td>\n",
2370 |        "      <td>0.001779</td>\n",
2371 |        "      <td>0.119184</td>\n",
2372 |        "    </tr>\n",
2373 |        "    <tr>\n",
2374 |        "      <th>but</th>\n",
2375 |        "      <td>0.090683</td>\n",
2376 |        "      <td>0.010676</td>\n",
2377 |        "      <td>0.117731</td>\n",
2378 |        "    </tr>\n",
2379 |        "    <tr>\n",
2380 |        "      <th>amp</th>\n",
2381 |        "      <td>0.015206</td>\n",
2382 |        "      <td>0.001779</td>\n",
2383 |        "      <td>0.117017</td>\n",
2384 |        "    </tr>\n",
2385 |        "    <tr>\n",
2386 |        "      <th>something</th>\n",
2387 |        "      <td>0.015206</td>\n",
2388 |        "      <td>0.001779</td>\n",
2389 |        "      <td>0.117017</td>\n",
2390 |        "    </tr>\n",
2391 |        "    <tr>\n",
2392 |        "      <th>sure</th>\n",
2393 |        "      <td>0.015206</td>\n",
2394 |        "      <td>0.001779</td>\n",
2395 |        "      <td>0.117017</td>\n",
2396 |        "    </tr>\n",
2397 |        "    <tr>\n",
2398 |        "      <th>ok</th>\n",
2399 |        "      <td>0.061100</td>\n",
2400 |        "      <td>0.007117</td>\n",
2401 |        "      <td>0.116488</td>\n",
2402 |        "    </tr>\n",
2403 |        "    <tr>\n",
2404 |        "      <th>said</th>\n",
2405 |        "      <td>0.016312</td>\n",
2406 |        "      <td>0.001779</td>\n",
2407 |        "      <td>0.109084</td>\n",
2408 |        "    </tr>\n",
2409 |        "    <tr>\n",
2410 |        "      <th>morning</th>\n",
2411 |        "      <td>0.016865</td>\n",
2412 |        "      <td>0.001779</td>\n",
2413 |        "      <td>0.105507</td>\n",
2414 |        "    </tr>\n",
2415 |        "    <tr>\n",
2416 |        "      <th>yeah</th>\n",
2417 |        "      <td>0.017694</td>\n",
2418 |        "      <td>0.001779</td>\n",
2419 |        "      <td>0.100562</td>\n",
2420 |        "    </tr>\n",
2421 |        "    <tr>\n",
2422 |        "      <th>lol</th>\n",
2423 |        "      <td>0.017694</td>\n",
2424 |        "      <td>0.001779</td>\n",
2425 |        "      <td>0.100562</td>\n",
2426 |        "    </tr>\n",
2427 |        "    <tr>\n",
2428 |        "      <th>anything</th>\n",
2429 |        "      <td>0.017971</td>\n",
2430 |        "      <td>0.001779</td>\n",
2431 |        "      <td>0.099015</td>\n",
2432 |        "    </tr>\n",
2433 |        "    <tr>\n",
2434 |        "      <th>my</th>\n",
2435 |        "      <td>0.150401</td>\n",
2436 |        "      <td>0.014235</td>\n",
2437 |        "      <td>0.094646</td>\n",
2438 |        "    </tr>\n",
2439 |        "    <tr>\n",
2440 |        "      <th>doing</th>\n",
2441 |        "      <td>0.019077</td>\n",
2442 |        "      <td>0.001779</td>\n",
2443 |        "      <td>0.093275</td>\n",
2444 |        "    </tr>\n",
2445 |        "    <tr>\n",
2446 |        "      <th>way</th>\n",
2447 |        "      <td>0.019630</td>\n",
2448 |        "      <td>0.001779</td>\n",
2449 |        "      <td>0.090647</td>\n",
2450 |        "    </tr>\n",
2451 |        "    <tr>\n",
2452 |        "      <th>ask</th>\n",
2453 |        "      <td>0.019630</td>\n",
2454 |        "      <td>0.001779</td>\n",
2455 |        "      <td>0.090647</td>\n",
2456 |        "    </tr>\n",
2457 |        "    <tr>\n",
2458 |        "      <th>already</th>\n",
2459 |        "      <td>0.019630</td>\n",
2460 |        "      <td>0.001779</td>\n",
2461 |        "      <td>0.090647</td>\n",
2462 |        "    </tr>\n",
2463 |        "    <tr>\n",
2464 |        "      <th>too</th>\n",
2465 |        "      <td>0.021841</td>\n",
2466 |        "      <td>0.001779</td>\n",
2467 |        "      <td>0.081468</td>\n",
2468 |        "    </tr>\n",
2469 |        "    <tr>\n",
2470 |        "      <th>come</th>\n",
2471 |        "      <td>0.048936</td>\n",
2472 |        "      <td>0.003559</td>\n",
2473 |        "      <td>0.072723</td>\n",
2474 |        "    </tr>\n",
2475 |        "    <tr>\n",
2476 |        "      <th>later</th>\n",
2477 |        "      <td>0.030688</td>\n",
2478 |        "      <td>0.001779</td>\n",
2479 |        "      <td>0.057981</td>\n",
2480 |        "    </tr>\n",
2481 |        "    <tr>\n",
2482 |        "      <th>lor</th>\n",
2483 |        "      <td>0.032900</td>\n",
2484 |        "      <td>0.001779</td>\n",
2485 |        "      <td>0.054084</td>\n",
2486 |        "    </tr>\n",
2487 |        "    <tr>\n",
2488 |        "      <th>da</th>\n",
2489 |        "      <td>0.032900</td>\n",
2490 |        "      <td>0.001779</td>\n",
2491 |        "      <td>0.054084</td>\n",
2492 |        "    </tr>\n",
2493 |        "    <tr>\n",
2494 |        "      <th>she</th>\n",
2495 |        "      <td>0.035665</td>\n",
2496 |        "      <td>0.001779</td>\n",
2497 |        "      <td>0.049891</td>\n",
2498 |        "    </tr>\n",
2499 |        "    <tr>\n",
2500 |        "      <th>he</th>\n",
2501 |        "      <td>0.047000</td>\n",
2502 |        "      <td>0.001779</td>\n",
2503 |        "      <td>0.037858</td>\n",
2504 |        "    </tr>\n",
2505 |        "    <tr>\n",
2506 |        "      <th>lt</th>\n",
2507 |        "      <td>0.064142</td>\n",
2508 |        "      <td>0.001779</td>\n",
2509 |        "      <td>0.027741</td>\n",
2510 |        "    </tr>\n",
2511 |        "    <tr>\n",
2512 |        "      <th>gt</th>\n",
2513 |        "      <td>0.064971</td>\n",
2514 |        "      <td>0.001779</td>\n",
2515 |        "      <td>0.027387</td>\n",
2516 |        "    </tr>\n",
2517 |        "  </tbody>\n",
2518 |        "</table>\n",
2519 |        "<p>7456 rows × 3 columns</p>\n",
2520 |        "</div>"
2521 |       ],
2522 |       "text/plain": [
2523 |        "                 ham      spam  spam_ratio\n",
2524 |        "token                                     \n",
2525 |        "claim       0.000276  0.158363  572.798932\n",
2526 |        "prize       0.000276  0.135231  489.131673\n",
2527 |        "150p        0.000276  0.087189  315.361210\n",
2528 |        "tone        0.000276  0.085409  308.925267\n",
2529 |        "guaranteed  0.000276  0.076512  276.745552\n",
2530 |        "18          0.000276  0.069395  251.001779\n",
2531 |        "cs          0.000276  0.065836  238.129893\n",
2532 |        "www         0.000553  0.129893  234.911922\n",
2533 |        "1000        0.000276  0.056940  205.950178\n",
2534 |        "awarded     0.000276  0.053381  193.078292\n",
2535 |        "150ppm      0.000276  0.051601  186.642349\n",
2536 |        "uk          0.000553  0.099644  180.206406\n",
2537 |        "500         0.000276  0.048043  173.770463\n",
2538 |        "ringtone    0.000276  0.044484  160.898577\n",
2539 |        "000         0.000276  0.042705  154.462633\n",
2540 |        "mob         0.000276  0.042705  154.462633\n",
2541 |        "co          0.000553  0.078292  141.590747\n",
2542 |        "collection  0.000276  0.039146  141.590747\n",
2543 |        "valid       0.000276  0.037367  135.154804\n",
2544 |        "2000        0.000276  0.037367  135.154804\n",
2545 |        "800         0.000276  0.037367  135.154804\n",
2546 |        "10p         0.000276  0.037367  135.154804\n",
2547 |        "8007        0.000276  0.035587  128.718861\n",
2548 |        "16          0.000553  0.067616  122.282918\n",
2549 |        "weekly      0.000276  0.033808  122.282918\n",
2550 |        "tones       0.000276  0.032028  115.846975\n",
2551 |        "land        0.000276  0.032028  115.846975\n",
2552 |        "http        0.000276  0.032028  115.846975\n",
2553 |        "national    0.000276  0.030249  109.411032\n",
2554 |        "5000        0.000276  0.030249  109.411032\n",
2555 |        "...              ...       ...         ...\n",
2556 |        "went        0.012718  0.001779    0.139912\n",
2557 |        "ll          0.052530  0.007117    0.135494\n",
2558 |        "told        0.013824  0.001779    0.128719\n",
2559 |        "feel        0.013824  0.001779    0.128719\n",
2560 |        "gud         0.014100  0.001779    0.126195\n",
2561 |        "cos         0.014929  0.001779    0.119184\n",
2562 |        "but         0.090683  0.010676    0.117731\n",
2563 |        "amp         0.015206  0.001779    0.117017\n",
2564 |        "something   0.015206  0.001779    0.117017\n",
2565 |        "sure        0.015206  0.001779    0.117017\n",
2566 |        "ok          0.061100  0.007117    0.116488\n",
2567 |        "said        0.016312  0.001779    0.109084\n",
2568 |        "morning     0.016865  0.001779    0.105507\n",
2569 |        "yeah        0.017694  0.001779    0.100562\n",
2570 |        "lol         0.017694  0.001779    0.100562\n",
2571 |        "anything    0.017971  0.001779    0.099015\n",
2572 |        "my          0.150401  0.014235    0.094646\n",
2573 |        "doing       0.019077  0.001779    0.093275\n",
2574 |        "way         0.019630  0.001779    0.090647\n",
2575 |        "ask         0.019630  0.001779    0.090647\n",
2576 |        "already     0.019630  0.001779    0.090647\n",
2577 |        "too         0.021841  0.001779    0.081468\n",
2578 |        "come        0.048936  0.003559    0.072723\n",
2579 |        "later       0.030688  0.001779    0.057981\n",
2580 |        "lor         0.032900  0.001779    0.054084\n",
2581 |        "da          0.032900  0.001779    0.054084\n",
2582 |        "she         0.035665  0.001779    0.049891\n",
2583 |        "he          0.047000  0.001779    0.037858\n",
2584 |        "lt          0.064142  0.001779    0.027741\n",
2585 |        "gt          0.064971  0.001779    0.027387\n",
2586 |        "\n",
2587 |        "[7456 rows x 3 columns]"
2588 |       ]
2589 |      },
2590 |      "execution_count": 65,
2591 |      "metadata": {},
2592 |      "output_type": "execute_result"
2593 |     }
2594 |    ],
2595 |    "source": [
2596 |     "# examine the DataFrame sorted by spam_ratio\n",
2597 |     "# note: use sort() instead of sort_values() for pandas 0.16.2 and earlier\n",
2598 |     "tokens.sort_values('spam_ratio', ascending=False)"
2599 |    ]
2600 |   },
2601 |   {
2602 |    "cell_type": "code",
2603 |    "execution_count": 66,
2604 |    "metadata": {
2605 |     "collapsed": false
2606 |    },
2607 |    "outputs": [
2608 |     {
2609 |      "data": {
2610 |       "text/plain": [
2611 |        "83.667259786476862"
2612 |       ]
2613 |      },
2614 |      "execution_count": 66,
2615 |      "metadata": {},
2616 |      "output_type": "execute_result"
2617 |     }
2618 |    ],
2619 |    "source": [
2620 |     "# look up the spam_ratio for a given token\n",
2621 |     "tokens.loc['dating', 'spam_ratio']"
2622 |    ]
2623 |   },
2624 |   {
2625 |    "cell_type": "markdown",
2626 |    "metadata": {},
2627 |    "source": [
2628 |     "## Part 8: Practicing this workflow on another dataset\n",
2629 |     "\n",
2630 |     "Please open the **`exercise.ipynb`** notebook (or the **`exercise.py`** script)."
2631 |    ]
2632 |   },
2633 |   {
2634 |    "cell_type": "markdown",
2635 |    "metadata": {},
2636 |    "source": [
2637 |     "## Part 9: Tuning the vectorizer (discussion)\n",
2638 |     "\n",
2639 |     "Thus far, we have been using the default parameters of [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html):"
2640 |    ]
2641 |   },
2642 |   {
2643 |    "cell_type": "code",
2644 |    "execution_count": 67,
2645 |    "metadata": {
2646 |     "collapsed": false
2647 |    },
2648 |    "outputs": [
2649 |     {
2650 |      "data": {
2651 |       "text/plain": [
2652 |        "CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',\n",
2653 |        "        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',\n",
2654 |        "        lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
2655 |        "        ngram_range=(1, 1), preprocessor=None, stop_words=None,\n",
2656 |        "        strip_accents=None, token_pattern=u'(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
2657 |        "        tokenizer=None, vocabulary=None)"
2658 |       ]
2659 |      },
2660 |      "execution_count": 67,
2661 |      "metadata": {},
2662 |      "output_type": "execute_result"
2663 |     }
2664 |    ],
2665 |    "source": [
2666 |     "# show default parameters for CountVectorizer\n",
2667 |     "vect"
2668 |    ]
2669 |   },
2670 |   {
2671 |    "cell_type": "markdown",
2672 |    "metadata": {},
2673 |    "source": [
2674 |     "However, the vectorizer is worth tuning, just like a model is worth tuning! Here are a few parameters that you might want to tune:\n",
2675 |     "\n",
2676 |     "- **stop_words:** string {'english'}, list, or None (default)\n",
2677 |     "    - If 'english', a built-in stop word list for English is used.\n",
2678 |     "    - If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.\n",
2679 |     "    - If None, no stop words will be used."
2680 |    ]
2681 |   },
2682 |   {
2683 |    "cell_type": "code",
2684 |    "execution_count": 68,
2685 |    "metadata": {
2686 |     "collapsed": true
2687 |    },
2688 |    "outputs": [],
2689 |    "source": [
2690 |     "# remove English stop words\n",
2691 |     "vect = CountVectorizer(stop_words='english')"
2692 |    ]
2693 |   },
2694 |   {
2695 |    "cell_type": "markdown",
2696 |    "metadata": {},
2697 |    "source": [
2698 |     "- **ngram_range:** tuple (min_n, max_n), default=(1, 1)\n",
2699 |     "    - The lower and upper boundary of the range of n-values for different n-grams to be extracted.\n",
2700 |     "    - All values of n such that min_n <= n <= max_n will be used."
2701 |    ]
2702 |   },
2703 |   {
2704 |    "cell_type": "code",
2705 |    "execution_count": 69,
2706 |    "metadata": {
2707 |     "collapsed": true
2708 |    },
2709 |    "outputs": [],
2710 |    "source": [
2711 |     "# include 1-grams and 2-grams\n",
2712 |     "vect = CountVectorizer(ngram_range=(1, 2))"
2713 |    ]
2714 |   },
2715 |   {
2716 |    "cell_type": "markdown",
2717 |    "metadata": {},
2718 |    "source": [
2719 |     "- **max_df:** float in range [0.0, 1.0] or int, default=1.0\n",
2720 |     "    - When building the vocabulary, ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words).\n",
2721 |     "    - If float, the parameter represents a proportion of documents.\n",
2722 |     "    - If integer, the parameter represents an absolute count."
2723 |    ]
2724 |   },
2725 |   {
2726 |    "cell_type": "code",
2727 |    "execution_count": 70,
2728 |    "metadata": {
2729 |     "collapsed": true
2730 |    },
2731 |    "outputs": [],
2732 |    "source": [
2733 |     "# ignore terms that appear in more than 50% of the documents\n",
2734 |     "vect = CountVectorizer(max_df=0.5)"
2735 |    ]
2736 |   },
2737 |   {
2738 |    "cell_type": "markdown",
2739 |    "metadata": {},
2740 |    "source": [
2741 |     "- **min_df:** float in range [0.0, 1.0] or int, default=1\n",
2742 |     "    - When building the vocabulary, ignore terms that have a document frequency strictly lower than the given threshold. (This value is also called \"cut-off\" in the literature.)\n",
2743 |     "    - If float, the parameter represents a proportion of documents.\n",
2744 |     "    - If integer, the parameter represents an absolute count."
2745 |    ]
2746 |   },
2747 |   {
2748 |    "cell_type": "code",
2749 |    "execution_count": 71,
2750 |    "metadata": {
2751 |     "collapsed": true
2752 |    },
2753 |    "outputs": [],
2754 |    "source": [
2755 |     "# only keep terms that appear in at least 2 documents\n",
2756 |     "vect = CountVectorizer(min_df=2)"
2757 |    ]
2758 |   },
2759 |   {
2760 |    "cell_type": "markdown",
2761 |    "metadata": {},
2762 |    "source": [
2763 |     "**Guidelines for tuning CountVectorizer:**\n",
2764 |     "\n",
2765 |     "- Use your knowledge of the **problem** and the **text**, and your understanding of the **tuning parameters**, to help you decide what parameters to tune and how to tune them.\n",
2766 |     "- **Experiment**, and let the data tell you the best approach!"
2767 |    ]
2768 |   }
2769 |  ],
2770 |  "metadata": {
2771 |   "kernelspec": {
2772 |    "display_name": "Python 2",
2773 |    "language": "python",
2774 |    "name": "python2"
2775 |   },
2776 |   "language_info": {
2777 |    "codemirror_mode": {
2778 |     "name": "ipython",
2779 |     "version": 2
2780 |    },
2781 |    "file_extension": ".py",
2782 |    "mimetype": "text/x-python",
2783 |    "name": "python",
2784 |    "nbconvert_exporter": "python",
2785 |    "pygments_lexer": "ipython2",
2786 |    "version": "2.7.11"
2787 |   }
2788 |  },
2789 |  "nbformat": 4,
2790 |  "nbformat_minor": 0
2791 | }
2792 | 


--------------------------------------------------------------------------------