├── .gitignore
├── exercise.py
├── README.md
├── exercise.ipynb
├── exercise_solution.py
├── tutorial.py
├── tutorial.ipynb
├── exercise_solution.ipynb
└── tutorial_with_output.ipynb
/.gitignore:
--------------------------------------------------------------------------------
1 | .ipynb_checkpoints/
2 | .DS_Store
3 | *.pyc
4 | extras/
5 | *.tpl
6 |
--------------------------------------------------------------------------------
/exercise.py:
--------------------------------------------------------------------------------
1 | # # Tutorial Exercise: Yelp reviews
2 |
3 | # ## Introduction
4 | #
5 | # This exercise uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition.
6 | #
7 | # **Description of the data:**
8 | #
9 | # - **`yelp.csv`** contains the dataset. It is stored in the repository (in the **`data`** directory), so there is no need to download anything from the Kaggle website.
10 | # - Each observation (row) in this dataset is a review of a particular business by a particular user.
11 | # - The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
12 | # - The **text** column is the text of the review.
13 | #
14 | # **Goal:** Predict the star rating of a review using **only** the review text.
15 | #
16 | # **Tip:** After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations.
17 |
18 | # ## Task 1
19 | #
20 | # Read **`yelp.csv`** into a pandas DataFrame and examine it.
21 |
22 | # ## Task 2
23 | #
24 | # Create a new DataFrame that only contains the **5-star** and **1-star** reviews.
25 | #
26 | # - **Hint:** [How do I apply multiple filter criteria to a pandas DataFrame?](http://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb#9.-How-do-I-apply-multiple-filter-criteria-to-a-pandas-DataFrame%3F-%28video%29) explains how to do this.
27 |
28 | # ## Task 3
29 | #
30 | # Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response.
31 | #
32 | # - **Hint:** Keep in mind that X should be a pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows.
33 |
34 | # ## Task 4
35 | #
36 | # Use CountVectorizer to create **document-term matrices** from X_train and X_test.
37 |
38 | # ## Task 5
39 | #
40 | # Use multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**.
41 | #
42 | # - **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to interpret both classification accuracy and the confusion matrix.
43 |
44 | # ## Task 6 (Challenge)
45 | #
46 | # Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class.
47 | #
48 | # - **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy!
49 |
50 | # ## Task 7 (Challenge)
51 | #
52 | # Browse through the review text of some of the **false positives** and **false negatives**. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews?
53 | #
54 | # - **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains the definitions of "false positives" and "false negatives".
55 | # - **Hint:** Think about what a false positive means in this context, and what a false negative means in this context. What has scikit-learn defined as the "positive class"?
56 |
57 | # ## Task 8 (Challenge)
58 | #
59 | # Calculate which 10 tokens are the most predictive of **5-star reviews**, and which 10 tokens are the most predictive of **1-star reviews**.
60 | #
61 | # - **Hint:** Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the `feature_count_` and `class_count_` attributes of the Naive Bayes model object.
62 |
63 | # ## Task 9 (Challenge)
64 | #
65 | # Up to this point, we have framed this as a **binary classification problem** by only considering the 5-star and 1-star reviews. Now, let's repeat the model building process using all reviews, which makes this a **5-class classification problem**.
66 | #
67 | # Here are the steps:
68 | #
69 | # - Define X and y using the original DataFrame. (y should contain 5 different classes.)
70 | # - Split X and y into training and testing sets.
71 | # - Create document-term matrices using CountVectorizer.
72 | # - Calculate the testing accuracy of a Multinomial Naive Bayes model.
73 | # - Compare the testing accuracy with the null accuracy, and comment on the results.
74 | # - Print the confusion matrix, and comment on the results. (This [Stack Overflow answer](http://stackoverflow.com/a/30748053/1636598) explains how to read a multi-class confusion matrix.)
75 | # - Print the [classification report](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report), and comment on the results. If you are unfamiliar with the terminology it uses, research the terms, and then try to figure out how to calculate these metrics manually from the confusion matrix!
76 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | ## Tutorial: Machine Learning with Text using Python
2 |
3 | Round of applause to [Kevin Markham](http://www.dataschool.io/) and his video tutorials!
4 |
5 | Instructor: [Vaibhav Srivastav](https://www.linkedin.com/in/vaibhavs10/)
6 |
7 | ### Description
8 |
9 | Although numeric data is easy to work with in Python, most knowledge created by humans is actually raw, unstructured text. By learning how to transform text into data that is usable by machine learning models, you drastically increase the amount of data that your models can learn from. In this tutorial, we'll build and evaluate predictive models from real-world text using scikit-learn.
10 |
11 | ### Objectives
12 |
13 | By the end of this tutorial, attendees will be able to confidently build a predictive model from their own text-based data, including feature extraction, model building and model evaluation.
14 |
15 | ### Required Software
16 |
17 | Attendees will need to bring a laptop with [scikit-learn](http://scikit-learn.org/stable/install.html) and [pandas](http://pandas.pydata.org/pandas-docs/stable/install.html) (and their dependencies) already installed. Installing the [Anaconda distribution of Python](https://www.continuum.io/downloads) is an easy way to accomplish this. Both Python 2 and 3 are welcome.
18 |
19 | I will be leading the tutorial using the IPython/Jupyter notebook, and have added a pre-written notebook to this repository. I have also created a Python script that is identical to the notebook, which you can use in the Python environment of your choice.
20 |
21 | ### Tutorial Files
22 |
23 | * IPython/Jupyter notebooks: [tutorial.ipynb](tutorial.ipynb), [tutorial_with_output.ipynb](tutorial_with_output.ipynb), [exercise.ipynb](exercise.ipynb), [exercise_solution.ipynb](exercise_solution.ipynb)
24 | * Python scripts: [tutorial.py](tutorial.py), [exercise.py](exercise.py), [exercise_solution.py](exercise_solution.py)
25 | * Datasets: [data/sms.tsv](data/sms.tsv), [data/yelp.csv](data/yelp.csv)
26 |
27 | ### Prerequisite Knowledge
28 |
29 | Attendees to this tutorial should be comfortable working in Python, should understand the basic principles of machine learning, and should have at least basic experience with both pandas and scikit-learn. However, no knowledge of advanced mathematics is required.
30 |
31 | ### Abstract
32 |
33 | It can be difficult to figure out how to work with text in scikit-learn, even if you're already comfortable with the scikit-learn API. Many questions immediately come up: Which vectorizer should I use, and why? What's the difference between a "fit" and a "transform"? What's a document-term matrix, and why is it so sparse? Is it okay for my training data to have more features than observations? What's the appropriate machine learning model to use? And so on...
34 |
35 | In this tutorial, we'll answer all of those questions, and more! We'll start by walking through the vectorization process in order to understand the input and output formats. Then we'll read a simple dataset into pandas, and immediately apply what we've learned about vectorization. We'll move on to the model building process, including a discussion of which model is most appropriate for the task. We'll evaluate our model a few different ways, and then examine the model for greater insight into how the text is influencing its predictions. Finally, we'll practice this entire workflow on a new dataset, and end with a discussion of which parts of the process are worth tuning for improved performance.
36 |
37 | ### Detailed Outline
38 |
39 | 1. Model building in scikit-learn (refresher)
40 | 2. Representing text as numerical data
41 | 3. Reading a text-based dataset into pandas
42 | 4. Vectorizing our dataset
43 | 5. Building and evaluating a model
44 | 6. Comparing models
45 | 7. Examining a model for further insight
46 | 8. Practicing this workflow on another dataset
47 | 9. Tuning the vectorizer (discussion)
48 |
49 | ### About the Instructor
50 |
51 | Vaibhav Srivastav is a Data Scientist currently working with Deloitte Consulting LLP. He has a demonstrated experience of more than 3 plus years in building large scale Machine Learning and Natural Language Processing solutions for Fortune Technology 10 clients.
52 |
53 | In his free time he teaches Machine Learning/ Data Science to young coders! If Python is what floats your boat then hit him up on any of the channels below:
54 |
55 | * Email: [Vaibhavs10@gmail.com](mailto:vaibhavs10@gmail.com)
56 | * Twitter: [@Vaibhavsriv10](https://twitter.com/vaibhavsriv10)
57 |
58 | ### Recommended Resources
59 |
60 | **Text classification:**
61 | * Read Paul Graham's classic post, [A Plan for Spam](http://www.paulgraham.com/spam.html), for an overview of a basic text classification system using a Bayesian approach. (He also wrote a [follow-up post](http://www.paulgraham.com/better.html) about how he improved his spam filter.)
62 | * Coursera's Natural Language Processing (NLP) course has [video lectures](https://class.coursera.org/nlp/lecture) on text classification, tokenization, Naive Bayes, and many other fundamental NLP topics. (Here are the [slides](http://web.stanford.edu/~jurafsky/NLPCourseraSlides.html) used in all of the videos.)
63 | * [Automatically Categorizing Yelp Businesses](http://engineeringblog.yelp.com/2015/09/automatically-categorizing-yelp-businesses.html) discusses how Yelp uses NLP and scikit-learn to solve the problem of uncategorized businesses.
64 | * [How to Read the Mind of a Supreme Court Justice](http://fivethirtyeight.com/features/how-to-read-the-mind-of-a-supreme-court-justice/) discusses CourtCast, a machine learning model that predicts the outcome of Supreme Court cases using text-based features only. (The CourtCast creator wrote a post explaining [how it works](https://sciencecowboy.wordpress.com/2015/03/05/predicting-the-supreme-court-from-oral-arguments/), and the [Python code](https://github.com/nasrallah/CourtCast) is available on GitHub.)
65 | * [Identifying Humorous Cartoon Captions](http://www.cs.huji.ac.il/~dshahaf/pHumor.pdf) is a readable paper about identifying funny captions submitted to the New Yorker Caption Contest.
66 | * In this [PyData video](https://www.youtube.com/watch?v=y3ZTKFZ-1QQ) (50 minutes), Facebook explains how they use scikit-learn for sentiment classification by training a Naive Bayes model on emoji-labeled data.
67 |
68 | **Naive Bayes and logistic regression:**
69 | * Read this brief Quora post on [airport security](http://www.quora.com/In-laymans-terms-how-does-Naive-Bayes-work/answer/Konstantin-Tt) for an intuitive explanation of how Naive Bayes classification works.
70 | * For a longer introduction to Naive Bayes, read Sebastian Raschka's article on [Naive Bayes and Text Classification](http://sebastianraschka.com/Articles/2014_naive_bayes_1.html). As well, Wikipedia has two excellent articles ([Naive Bayes classifier](http://en.wikipedia.org/wiki/Naive_Bayes_classifier) and [Naive Bayes spam filtering](http://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering)), and Cross Validated has a good [Q&A](http://stats.stackexchange.com/questions/21822/understanding-naive-bayes).
71 | * My [guide to an in-depth understanding of logistic regression](http://www.dataschool.io/guide-to-logistic-regression/) includes a lesson notebook and a curated list of resources for going deeper into this topic.
72 | * [Comparison of Machine Learning Models](https://github.com/justmarkham/DAT8/blob/master/other/model_comparison.md) lists the advantages and disadvantages of Naive Bayes, logistic regression, and other classification and regression models.
--------------------------------------------------------------------------------
/exercise.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Tutorial Exercise: Yelp reviews"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "## Introduction\n",
15 | "\n",
16 | "This exercise uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition.\n",
17 | "\n",
18 | "**Description of the data:**\n",
19 | "\n",
20 | "- **`yelp.csv`** contains the dataset. It is stored in the repository (in the **`data`** directory), so there is no need to download anything from the Kaggle website.\n",
21 | "- Each observation (row) in this dataset is a review of a particular business by a particular user.\n",
22 | "- The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.\n",
23 | "- The **text** column is the text of the review.\n",
24 | "\n",
25 | "**Goal:** Predict the star rating of a review using **only** the review text.\n",
26 | "\n",
27 | "**Tip:** After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations."
28 | ]
29 | },
30 | {
31 | "cell_type": "markdown",
32 | "metadata": {},
33 | "source": [
34 | "## Task 1\n",
35 | "\n",
36 | "Read **`yelp.csv`** into a pandas DataFrame and examine it."
37 | ]
38 | },
39 | {
40 | "cell_type": "markdown",
41 | "metadata": {},
42 | "source": [
43 | "## Task 2\n",
44 | "\n",
45 | "Create a new DataFrame that only contains the **5-star** and **1-star** reviews.\n",
46 | "\n",
47 | "- **Hint:** [How do I apply multiple filter criteria to a pandas DataFrame?](http://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb#9.-How-do-I-apply-multiple-filter-criteria-to-a-pandas-DataFrame%3F-%28video%29) explains how to do this."
48 | ]
49 | },
50 | {
51 | "cell_type": "markdown",
52 | "metadata": {},
53 | "source": [
54 | "## Task 3\n",
55 | "\n",
56 | "Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response.\n",
57 | "\n",
58 | "- **Hint:** Keep in mind that X should be a pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows."
59 | ]
60 | },
61 | {
62 | "cell_type": "markdown",
63 | "metadata": {},
64 | "source": [
65 | "## Task 4\n",
66 | "\n",
67 | "Use CountVectorizer to create **document-term matrices** from X_train and X_test."
68 | ]
69 | },
70 | {
71 | "cell_type": "markdown",
72 | "metadata": {},
73 | "source": [
74 | "## Task 5\n",
75 | "\n",
76 | "Use multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**.\n",
77 | "\n",
78 | "- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to interpret both classification accuracy and the confusion matrix."
79 | ]
80 | },
81 | {
82 | "cell_type": "markdown",
83 | "metadata": {},
84 | "source": [
85 | "## Task 6 (Challenge)\n",
86 | "\n",
87 | "Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class.\n",
88 | "\n",
89 | "- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy!"
90 | ]
91 | },
92 | {
93 | "cell_type": "markdown",
94 | "metadata": {},
95 | "source": [
96 | "## Task 7 (Challenge)\n",
97 | "\n",
98 | "Browse through the review text of some of the **false positives** and **false negatives**. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews?\n",
99 | "\n",
100 | "- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains the definitions of \"false positives\" and \"false negatives\".\n",
101 | "- **Hint:** Think about what a false positive means in this context, and what a false negative means in this context. What has scikit-learn defined as the \"positive class\"?"
102 | ]
103 | },
104 | {
105 | "cell_type": "markdown",
106 | "metadata": {},
107 | "source": [
108 | "## Task 8 (Challenge)\n",
109 | "\n",
110 | "Calculate which 10 tokens are the most predictive of **5-star reviews**, and which 10 tokens are the most predictive of **1-star reviews**.\n",
111 | "\n",
112 | "- **Hint:** Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the `feature_count_` and `class_count_` attributes of the Naive Bayes model object."
113 | ]
114 | },
115 | {
116 | "cell_type": "markdown",
117 | "metadata": {},
118 | "source": [
119 | "## Task 9 (Challenge)\n",
120 | "\n",
121 | "Up to this point, we have framed this as a **binary classification problem** by only considering the 5-star and 1-star reviews. Now, let's repeat the model building process using all reviews, which makes this a **5-class classification problem**.\n",
122 | "\n",
123 | "Here are the steps:\n",
124 | "\n",
125 | "- Define X and y using the original DataFrame. (y should contain 5 different classes.)\n",
126 | "- Split X and y into training and testing sets.\n",
127 | "- Create document-term matrices using CountVectorizer.\n",
128 | "- Calculate the testing accuracy of a Multinomial Naive Bayes model.\n",
129 | "- Compare the testing accuracy with the null accuracy, and comment on the results.\n",
130 | "- Print the confusion matrix, and comment on the results. (This [Stack Overflow answer](http://stackoverflow.com/a/30748053/1636598) explains how to read a multi-class confusion matrix.)\n",
131 | "- Print the [classification report](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report), and comment on the results. If you are unfamiliar with the terminology it uses, research the terms, and then try to figure out how to calculate these metrics manually from the confusion matrix!"
132 | ]
133 | }
134 | ],
135 | "metadata": {
136 | "kernelspec": {
137 | "display_name": "Python 2",
138 | "language": "python",
139 | "name": "python2"
140 | },
141 | "language_info": {
142 | "codemirror_mode": {
143 | "name": "ipython",
144 | "version": 2
145 | },
146 | "file_extension": ".py",
147 | "mimetype": "text/x-python",
148 | "name": "python",
149 | "nbconvert_exporter": "python",
150 | "pygments_lexer": "ipython2",
151 | "version": "2.7.11"
152 | }
153 | },
154 | "nbformat": 4,
155 | "nbformat_minor": 0
156 | }
157 |
--------------------------------------------------------------------------------
/exercise_solution.py:
--------------------------------------------------------------------------------
1 | # # Tutorial Exercise: Yelp reviews (Solution)
2 |
3 | # ## Introduction
4 | #
5 | # This exercise uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition.
6 | #
7 | # **Description of the data:**
8 | #
9 | # - **`yelp.csv`** contains the dataset. It is stored in the repository (in the **`data`** directory), so there is no need to download anything from the Kaggle website.
10 | # - Each observation (row) in this dataset is a review of a particular business by a particular user.
11 | # - The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
12 | # - The **text** column is the text of the review.
13 | #
14 | # **Goal:** Predict the star rating of a review using **only** the review text.
15 | #
16 | # **Tip:** After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations.
17 |
18 | # for Python 2: use print only as a function
19 | from __future__ import print_function
20 |
21 |
22 | # ## Task 1
23 | #
24 | # Read **`yelp.csv`** into a pandas DataFrame and examine it.
25 |
26 | # read yelp.csv using a relative path
27 | import pandas as pd
28 | path = 'data/yelp.csv'
29 | yelp = pd.read_csv(path)
30 |
31 |
32 | # examine the shape
33 | yelp.shape
34 |
35 |
36 | # examine the first row
37 | yelp.head(1)
38 |
39 |
40 | # examine the class distribution
41 | yelp.stars.value_counts().sort_index()
42 |
43 |
44 | # ## Task 2
45 | #
46 | # Create a new DataFrame that only contains the **5-star** and **1-star** reviews.
47 | #
48 | # - **Hint:** [How do I apply multiple filter criteria to a pandas DataFrame?](http://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb#9.-How-do-I-apply-multiple-filter-criteria-to-a-pandas-DataFrame%3F-%28video%29) explains how to do this.
49 |
50 | # filter the DataFrame using an OR condition
51 | yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]
52 |
53 | # equivalently, use the 'loc' method
54 | yelp_best_worst = yelp.loc[(yelp.stars==5) | (yelp.stars==1), :]
55 |
56 |
57 | # examine the shape
58 | yelp_best_worst.shape
59 |
60 |
61 | # ## Task 3
62 | #
63 | # Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response.
64 | #
65 | # - **Hint:** Keep in mind that X should be a pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows.
66 |
67 | # define X and y
68 | X = yelp_best_worst.text
69 | y = yelp_best_worst.stars
70 |
71 |
72 | # split X and y into training and testing sets
73 | from sklearn.cross_validation import train_test_split
74 | X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
75 |
76 |
77 | # examine the object shapes
78 | print(X_train.shape)
79 | print(X_test.shape)
80 | print(y_train.shape)
81 | print(y_test.shape)
82 |
83 |
84 | # ## Task 4
85 | #
86 | # Use CountVectorizer to create **document-term matrices** from X_train and X_test.
87 |
88 | # import and instantiate CountVectorizer
89 | from sklearn.feature_extraction.text import CountVectorizer
90 | vect = CountVectorizer()
91 |
92 |
93 | # fit and transform X_train into X_train_dtm
94 | X_train_dtm = vect.fit_transform(X_train)
95 | X_train_dtm.shape
96 |
97 |
98 | # transform X_test into X_test_dtm
99 | X_test_dtm = vect.transform(X_test)
100 | X_test_dtm.shape
101 |
102 |
103 | # ## Task 5
104 | #
105 | # Use multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**.
106 | #
107 | # - **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to interpret both classification accuracy and the confusion matrix.
108 |
109 | # import and instantiate MultinomialNB
110 | from sklearn.naive_bayes import MultinomialNB
111 | nb = MultinomialNB()
112 |
113 |
114 | # train the model using X_train_dtm
115 | nb.fit(X_train_dtm, y_train)
116 |
117 |
118 | # make class predictions for X_test_dtm
119 | y_pred_class = nb.predict(X_test_dtm)
120 |
121 |
122 | # calculate accuracy of class predictions
123 | from sklearn import metrics
124 | metrics.accuracy_score(y_test, y_pred_class)
125 |
126 |
127 | # print the confusion matrix
128 | metrics.confusion_matrix(y_test, y_pred_class)
129 |
130 |
131 | # ## Task 6 (Challenge)
132 | #
133 | # Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class.
134 | #
135 | # - **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy!
136 |
137 | # examine the class distribution of the testing set
138 | y_test.value_counts()
139 |
140 |
141 | # calculate null accuracy
142 | y_test.value_counts().head(1) / y_test.shape
143 |
144 |
145 | # calculate null accuracy manually
146 | 838 / float(838 + 184)
147 |
148 |
149 | # ## Task 7 (Challenge)
150 | #
151 | # Browse through the review text of some of the **false positives** and **false negatives**. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews?
152 | #
153 | # - **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains the definitions of "false positives" and "false negatives".
154 | # - **Hint:** Think about what a false positive means in this context, and what a false negative means in this context. What has scikit-learn defined as the "positive class"?
155 |
156 | # first 10 false positives (1-star reviews incorrectly classified as 5-star reviews)
157 | X_test[y_test < y_pred_class].head(10)
158 |
159 |
160 | # false positive: model is reacting to the words "good", "impressive", "nice"
161 | X_test[1781]
162 |
163 |
164 | # false positive: model does not have enough data to work with
165 | X_test[1919]
166 |
167 |
168 | # first 10 false negatives (5-star reviews incorrectly classified as 1-star reviews)
169 | X_test[y_test > y_pred_class].head(10)
170 |
171 |
172 | # false negative: model is reacting to the words "complain", "crowds", "rushing", "pricey", "scum"
173 | X_test[4963]
174 |
175 |
176 | # ## Task 8 (Challenge)
177 | #
178 | # Calculate which 10 tokens are the most predictive of **5-star reviews**, and which 10 tokens are the most predictive of **1-star reviews**.
179 | #
180 | # - **Hint:** Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the `feature_count_` and `class_count_` attributes of the Naive Bayes model object.
181 |
182 | # store the vocabulary of X_train
183 | X_train_tokens = vect.get_feature_names()
184 | len(X_train_tokens)
185 |
186 |
187 | # first row is one-star reviews, second row is five-star reviews
188 | nb.feature_count_.shape
189 |
190 |
191 | # store the number of times each token appears across each class
192 | one_star_token_count = nb.feature_count_[0, :]
193 | five_star_token_count = nb.feature_count_[1, :]
194 |
195 |
196 | # create a DataFrame of tokens with their separate one-star and five-star counts
197 | tokens = pd.DataFrame({'token':X_train_tokens, 'one_star':one_star_token_count, 'five_star':five_star_token_count}).set_index('token')
198 |
199 |
200 | # add 1 to one-star and five-star counts to avoid dividing by 0
201 | tokens['one_star'] = tokens.one_star + 1
202 | tokens['five_star'] = tokens.five_star + 1
203 |
204 |
205 | # first number is one-star reviews, second number is five-star reviews
206 | nb.class_count_
207 |
208 |
209 | # convert the one-star and five-star counts into frequencies
210 | tokens['one_star'] = tokens.one_star / nb.class_count_[0]
211 | tokens['five_star'] = tokens.five_star / nb.class_count_[1]
212 |
213 |
214 | # calculate the ratio of five-star to one-star for each token
215 | tokens['five_star_ratio'] = tokens.five_star / tokens.one_star
216 |
217 |
218 | # sort the DataFrame by five_star_ratio (descending order), and examine the first 10 rows
219 | # note: use sort() instead of sort_values() for pandas 0.16.2 and earlier
220 | tokens.sort_values('five_star_ratio', ascending=False).head(10)
221 |
222 |
223 | # sort the DataFrame by five_star_ratio (ascending order), and examine the first 10 rows
224 | tokens.sort_values('five_star_ratio', ascending=True).head(10)
225 |
226 |
227 | # ## Task 9 (Challenge)
228 | #
229 | # Up to this point, we have framed this as a **binary classification problem** by only considering the 5-star and 1-star reviews. Now, let's repeat the model building process using all reviews, which makes this a **5-class classification problem**.
230 | #
231 | # Here are the steps:
232 | #
233 | # - Define X and y using the original DataFrame. (y should contain 5 different classes.)
234 | # - Split X and y into training and testing sets.
235 | # - Create document-term matrices using CountVectorizer.
236 | # - Calculate the testing accuracy of a Multinomial Naive Bayes model.
237 | # - Compare the testing accuracy with the null accuracy, and comment on the results.
238 | # - Print the confusion matrix, and comment on the results. (This [Stack Overflow answer](http://stackoverflow.com/a/30748053/1636598) explains how to read a multi-class confusion matrix.)
239 | # - Print the [classification report](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report), and comment on the results. If you are unfamiliar with the terminology it uses, research the terms, and then try to figure out how to calculate these metrics manually from the confusion matrix!
240 |
241 | # define X and y using the original DataFrame
242 | X = yelp.text
243 | y = yelp.stars
244 |
245 |
246 | # check that y contains 5 different classes
247 | y.value_counts().sort_index()
248 |
249 |
250 | # split X and y into training and testing sets
251 | X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
252 |
253 |
254 | # create document-term matrices using CountVectorizer
255 | X_train_dtm = vect.fit_transform(X_train)
256 | X_test_dtm = vect.transform(X_test)
257 |
258 |
259 | # fit a Multinomial Naive Bayes model
260 | nb.fit(X_train_dtm, y_train)
261 |
262 |
263 | # make class predictions
264 | y_pred_class = nb.predict(X_test_dtm)
265 |
266 |
267 | # calculate the accuary
268 | metrics.accuracy_score(y_test, y_pred_class)
269 |
270 |
271 | # calculate the null accuracy
272 | y_test.value_counts().head(1) / y_test.shape
273 |
274 |
275 | # **Accuracy comments:** At first glance, 47% accuracy does not seem very good, given that it is not much higher than the null accuracy. However, I would consider the 47% accuracy to be quite impressive, given that humans would also have a hard time precisely identifying the star rating for many of these reviews.
276 |
277 | # print the confusion matrix
278 | metrics.confusion_matrix(y_test, y_pred_class)
279 |
280 |
281 | # **Confusion matrix comments:**
282 | #
283 | # - Nearly all 4-star and 5-star reviews are classified as 4 or 5 stars, but they are hard for the model to distinguish between.
284 | # - 1-star, 2-star, and 3-star reviews are most commonly classified as 4 stars, probably because it's the predominant class in the training data.
285 |
286 | # print the classification report
287 | print(metrics.classification_report(y_test, y_pred_class))
288 |
289 |
290 | # **Precision** answers the question: "When a given class is predicted, how often are those predictions correct?" To calculate the precision for class 1, for example, you divide 55 by the sum of the first column of the confusion matrix.
291 |
292 | # manually calculate the precision for class 1
293 | precision = 55 / float(55 + 28 + 5 + 7 + 6)
294 | print(precision)
295 |
296 |
297 | # **Recall** answers the question: "When a given class is the true class, how often is that class predicted?" To calculate the recall for class 1, for example, you divide 55 by the sum of the first row of the confusion matrix.
298 |
299 | # manually calculate the recall for class 1
300 | recall = 55 / float(55 + 14 + 24 + 65 + 27)
301 | print(recall)
302 |
303 |
304 | # **F1 score** is a weighted average of precision and recall.
305 |
306 | # manually calculate the F1 score for class 1
307 | f1 = 2 * (precision * recall) / (precision + recall)
308 | print(f1)
309 |
310 |
311 | # **Support** answers the question: "How many observations exist for which a given class is the true class?" To calculate the support for class 1, for example, you sum the first row of the confusion matrix.
312 |
313 | # manually calculate the support for class 1
314 | support = 55 + 14 + 24 + 65 + 27
315 | print(support)
316 |
317 |
318 | # **Classification report comments:**
319 | #
320 | # - Class 1 has low recall, meaning that the model has a hard time detecting the 1-star reviews, but high precision, meaning that when the model predicts a review is 1-star, it's usually correct.
321 | # - Class 5 has high recall and precision, probably because 5-star reviews have polarized language, and because the model has a lot of observations to learn from.
322 |
--------------------------------------------------------------------------------
/tutorial.py:
--------------------------------------------------------------------------------
1 | # # Tutorial: Machine Learning with Text in scikit-learn
2 |
3 | # ## Agenda
4 | #
5 | # 1. Model building in scikit-learn (refresher)
6 | # 2. Representing text as numerical data
7 | # 3. Reading a text-based dataset into pandas
8 | # 4. Vectorizing our dataset
9 | # 5. Building and evaluating a model
10 | # 6. Comparing models
11 | # 7. Examining a model for further insight
12 | # 8. Practicing this workflow on another dataset
13 | # 9. Tuning the vectorizer (discussion)
14 |
15 | # for Python 2: use print only as a function
16 | from __future__ import print_function
17 |
18 |
19 | # ## Part 1: Model building in scikit-learn (refresher)
20 |
21 | # load the iris dataset as an example
22 | from sklearn.datasets import load_iris
23 | iris = load_iris()
24 |
25 |
26 | # store the feature matrix (X) and response vector (y)
27 | X = iris.data
28 | y = iris.target
29 |
30 |
31 | # **"Features"** are also known as predictors, inputs, or attributes. The **"response"** is also known as the target, label, or output.
32 |
33 | # check the shapes of X and y
34 | print(X.shape)
35 | print(y.shape)
36 |
37 |
38 | # **"Observations"** are also known as samples, instances, or records.
39 |
40 | # examine the first 5 rows of the feature matrix (including the feature names)
41 | import pandas as pd
42 | pd.DataFrame(X, columns=iris.feature_names).head()
43 |
44 |
45 | # examine the response vector
46 | print(y)
47 |
48 |
49 | # In order to **build a model**, the features must be **numeric**, and every observation must have the **same features in the same order**.
50 |
51 | # import the class
52 | from sklearn.neighbors import KNeighborsClassifier
53 |
54 | # instantiate the model (with the default parameters)
55 | knn = KNeighborsClassifier()
56 |
57 | # fit the model with data (occurs in-place)
58 | knn.fit(X, y)
59 |
60 |
61 | # In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning.
62 |
63 | # predict the response for a new observation
64 | knn.predict([[3, 5, 4, 2]])
65 |
66 |
67 | # ## Part 2: Representing text as numerical data
68 |
69 | # example text for model training (SMS messages)
70 | simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']
71 |
72 |
73 | # From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):
74 | #
75 | # > Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.
76 | #
77 | # We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to "convert text into a matrix of token counts":
78 |
79 | # import and instantiate CountVectorizer (with the default parameters)
80 | from sklearn.feature_extraction.text import CountVectorizer
81 | vect = CountVectorizer()
82 |
83 |
84 | # learn the 'vocabulary' of the training data (occurs in-place)
85 | vect.fit(simple_train)
86 |
87 |
88 | # examine the fitted vocabulary
89 | vect.get_feature_names()
90 |
91 |
92 | # transform training data into a 'document-term matrix'
93 | simple_train_dtm = vect.transform(simple_train)
94 | simple_train_dtm
95 |
96 |
97 | # convert sparse matrix to a dense matrix
98 | simple_train_dtm.toarray()
99 |
100 |
101 | # examine the vocabulary and document-term matrix together
102 | pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())
103 |
104 |
105 | # From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):
106 | #
107 | # > In this scheme, features and samples are defined as follows:
108 | #
109 | # > - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.
110 | # > - The vector of all the token frequencies for a given document is considered a multivariate **sample**.
111 | #
112 | # > A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.
113 | #
114 | # > We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.
115 |
116 | # check the type of the document-term matrix
117 | type(simple_train_dtm)
118 |
119 |
120 | # examine the sparse matrix contents
121 | print(simple_train_dtm)
122 |
123 |
124 | # From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):
125 | #
126 | # > As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them).
127 | #
128 | # > For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.
129 | #
130 | # > In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package.
131 |
132 | # example text for model testing
133 | simple_test = ["please don't call me"]
134 |
135 |
136 | # In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning.
137 |
138 | # transform testing data into a document-term matrix (using existing vocabulary)
139 | simple_test_dtm = vect.transform(simple_test)
140 | simple_test_dtm.toarray()
141 |
142 |
143 | # examine the vocabulary and document-term matrix together
144 | pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())
145 |
146 |
147 | # **Summary:**
148 | #
149 | # - `vect.fit(train)` **learns the vocabulary** of the training data
150 | # - `vect.transform(train)` uses the **fitted vocabulary** to build a document-term matrix from the training data
151 | # - `vect.transform(test)` uses the **fitted vocabulary** to build a document-term matrix from the testing data (and **ignores tokens** it hasn't seen before)
152 |
153 | # ## Part 3: Reading a text-based dataset into pandas
154 |
155 | # read file into pandas using a relative path
156 | path = 'data/sms.tsv'
157 | sms = pd.read_table(path, header=None, names=['label', 'message'])
158 |
159 |
160 | # alternative: read file into pandas from a URL
161 | # url = 'https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv'
162 | # sms = pd.read_table(url, header=None, names=['label', 'message'])
163 |
164 |
165 | # examine the shape
166 | sms.shape
167 |
168 |
169 | # examine the first 10 rows
170 | sms.head(10)
171 |
172 |
173 | # examine the class distribution
174 | sms.label.value_counts()
175 |
176 |
177 | # convert label to a numerical variable
178 | sms['label_num'] = sms.label.map({'ham':0, 'spam':1})
179 |
180 |
181 | # check that the conversion worked
182 | sms.head(10)
183 |
184 |
185 | # how to define X and y (from the iris data) for use with a MODEL
186 | X = iris.data
187 | y = iris.target
188 | print(X.shape)
189 | print(y.shape)
190 |
191 |
192 | # how to define X and y (from the SMS data) for use with COUNTVECTORIZER
193 | X = sms.message
194 | y = sms.label_num
195 | print(X.shape)
196 | print(y.shape)
197 |
198 |
199 | # split X and y into training and testing sets
200 | from sklearn.model_selection import train_test_split
201 | X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
202 | print(X_train.shape)
203 | print(X_test.shape)
204 | print(y_train.shape)
205 | print(y_test.shape)
206 |
207 |
208 | # ## Part 4: Vectorizing our dataset
209 |
210 | # instantiate the vectorizer
211 | vect = CountVectorizer()
212 |
213 |
214 | # learn training data vocabulary, then use it to create a document-term matrix
215 | vect.fit(X_train)
216 | X_train_dtm = vect.transform(X_train)
217 |
218 |
219 | # equivalently: combine fit and transform into a single step
220 | X_train_dtm = vect.fit_transform(X_train)
221 |
222 |
223 | # examine the document-term matrix
224 | X_train_dtm
225 |
226 |
227 | # transform testing data (using fitted vocabulary) into a document-term matrix
228 | X_test_dtm = vect.transform(X_test)
229 | X_test_dtm
230 |
231 |
232 | # ## Part 5: Building and evaluating a model
233 | #
234 | # We will use [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):
235 | #
236 | # > The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.
237 |
238 | # import and instantiate a Multinomial Naive Bayes model
239 | from sklearn.naive_bayes import MultinomialNB
240 | nb = MultinomialNB()
241 |
242 |
243 | # train the model using X_train_dtm
244 | nb.fit(X_train_dtm, y_train)
245 |
246 |
247 | # make class predictions for X_test_dtm
248 | y_pred_class = nb.predict(X_test_dtm)
249 |
250 |
251 | # calculate accuracy of class predictions
252 | from sklearn import metrics
253 | metrics.accuracy_score(y_test, y_pred_class)
254 |
255 |
256 | # print the confusion matrix
257 | metrics.confusion_matrix(y_test, y_pred_class)
258 |
259 |
260 | # print message text for the false positives (ham incorrectly classified as spam)
261 |
262 |
263 | # print message text for the false negatives (spam incorrectly classified as ham)
264 |
265 |
266 | # example false negative
267 | X_test[3132]
268 |
269 |
270 | # calculate predicted probabilities for X_test_dtm (poorly calibrated)
271 | y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
272 | y_pred_prob
273 |
274 |
275 | # calculate AUC
276 | metrics.roc_auc_score(y_test, y_pred_prob)
277 |
278 |
279 | # ## Part 6: Comparing models
280 | #
281 | # We will compare multinomial Naive Bayes with [logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression):
282 | #
283 | # > Logistic regression, despite its name, is a **linear model for classification** rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.
284 |
285 | # import and instantiate a logistic regression model
286 | from sklearn.linear_model import LogisticRegression
287 | logreg = LogisticRegression()
288 |
289 |
290 | # train the model using X_train_dtm
291 | logreg.fit(X_train_dtm, y_train)
292 |
293 |
294 | # make class predictions for X_test_dtm
295 | y_pred_class = logreg.predict(X_test_dtm)
296 |
297 |
298 | # calculate predicted probabilities for X_test_dtm (well calibrated)
299 | y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]
300 | y_pred_prob
301 |
302 |
303 | # calculate accuracy
304 | metrics.accuracy_score(y_test, y_pred_class)
305 |
306 |
307 | # calculate AUC
308 | metrics.roc_auc_score(y_test, y_pred_prob)
309 |
310 |
311 | # ## Part 7: Examining a model for further insight
312 | #
313 | # We will examine the our **trained Naive Bayes model** to calculate the approximate **"spamminess" of each token**.
314 |
315 | # store the vocabulary of X_train
316 | X_train_tokens = vect.get_feature_names()
317 | len(X_train_tokens)
318 |
319 |
320 | # examine the first 50 tokens
321 | print(X_train_tokens[0:50])
322 |
323 |
324 | # examine the last 50 tokens
325 | print(X_train_tokens[-50:])
326 |
327 |
328 | # Naive Bayes counts the number of times each token appears in each class
329 | nb.feature_count_
330 |
331 |
332 | # rows represent classes, columns represent tokens
333 | nb.feature_count_.shape
334 |
335 |
336 | # number of times each token appears across all HAM messages
337 | ham_token_count = nb.feature_count_[0, :]
338 | ham_token_count
339 |
340 |
341 | # number of times each token appears across all SPAM messages
342 | spam_token_count = nb.feature_count_[1, :]
343 | spam_token_count
344 |
345 |
346 | # create a DataFrame of tokens with their separate ham and spam counts
347 | tokens = pd.DataFrame({'token':X_train_tokens, 'ham':ham_token_count, 'spam':spam_token_count}).set_index('token')
348 | tokens.head()
349 |
350 |
351 | # examine 5 random DataFrame rows
352 | tokens.sample(5, random_state=6)
353 |
354 |
355 | # Naive Bayes counts the number of observations in each class
356 | nb.class_count_
357 |
358 |
359 | # Before we can calculate the "spamminess" of each token, we need to avoid **dividing by zero** and account for the **class imbalance**.
360 |
361 | # add 1 to ham and spam counts to avoid dividing by 0
362 | tokens['ham'] = tokens.ham + 1
363 | tokens['spam'] = tokens.spam + 1
364 | tokens.sample(5, random_state=6)
365 |
366 |
367 | # convert the ham and spam counts into frequencies
368 | tokens['ham'] = tokens.ham / nb.class_count_[0]
369 | tokens['spam'] = tokens.spam / nb.class_count_[1]
370 | tokens.sample(5, random_state=6)
371 |
372 |
373 | # calculate the ratio of spam-to-ham for each token
374 | tokens['spam_ratio'] = tokens.spam / tokens.ham
375 | tokens.sample(5, random_state=6)
376 |
377 |
378 | # examine the DataFrame sorted by spam_ratio
379 | # note: use sort() instead of sort_values() for pandas 0.16.2 and earlier
380 | tokens.sort_values('spam_ratio', ascending=False)
381 |
382 |
383 | # look up the spam_ratio for a given token
384 | tokens.loc['dating', 'spam_ratio']
385 |
386 |
387 | # ## Part 8: Practicing this workflow on another dataset
388 | #
389 | # Please open the **`exercise.ipynb`** notebook (or the **`exercise.py`** script).
390 |
391 | # ## Part 9: Tuning the vectorizer (discussion)
392 | #
393 | # Thus far, we have been using the default parameters of [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html):
394 |
395 | # show default parameters for CountVectorizer
396 | vect
397 |
398 |
399 | # However, the vectorizer is worth tuning, just like a model is worth tuning! Here are a few parameters that you might want to tune:
400 | #
401 | # - **stop_words:** string {'english'}, list, or None (default)
402 | # - If 'english', a built-in stop word list for English is used.
403 | # - If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
404 | # - If None, no stop words will be used.
405 |
406 | # remove English stop words
407 | vect = CountVectorizer(stop_words='english')
408 |
409 |
410 | # - **ngram_range:** tuple (min_n, max_n), default=(1, 1)
411 | # - The lower and upper boundary of the range of n-values for different n-grams to be extracted.
412 | # - All values of n such that min_n <= n <= max_n will be used.
413 |
414 | # include 1-grams and 2-grams
415 | vect = CountVectorizer(ngram_range=(1, 2))
416 |
417 |
418 | # - **max_df:** float in range [0.0, 1.0] or int, default=1.0
419 | # - When building the vocabulary, ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words).
420 | # - If float, the parameter represents a proportion of documents.
421 | # - If integer, the parameter represents an absolute count.
422 |
423 | # ignore terms that appear in more than 50% of the documents
424 | vect = CountVectorizer(max_df=0.5)
425 |
426 |
427 | # - **min_df:** float in range [0.0, 1.0] or int, default=1
428 | # - When building the vocabulary, ignore terms that have a document frequency strictly lower than the given threshold. (This value is also called "cut-off" in the literature.)
429 | # - If float, the parameter represents a proportion of documents.
430 | # - If integer, the parameter represents an absolute count.
431 |
432 | # only keep terms that appear in at least 2 documents
433 | vect = CountVectorizer(min_df=2)
434 |
435 |
436 | # **Guidelines for tuning CountVectorizer:**
437 | #
438 | # - Use your knowledge of the **problem** and the **text**, and your understanding of the **tuning parameters**, to help you decide what parameters to tune and how to tune them.
439 | # - **Experiment**, and let the data tell you the best approach!
440 |
--------------------------------------------------------------------------------
/tutorial.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Tutorial: Machine Learning with Text in scikit-learn"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "## Agenda\n",
15 | "\n",
16 | "1. Model building in scikit-learn (refresher)\n",
17 | "2. Representing text as numerical data\n",
18 | "3. Reading a text-based dataset into pandas\n",
19 | "4. Vectorizing our dataset\n",
20 | "5. Building and evaluating a model\n",
21 | "6. Comparing models\n",
22 | "7. Examining a model for further insight\n",
23 | "8. Practicing this workflow on another dataset\n",
24 | "9. Tuning the vectorizer (discussion)"
25 | ]
26 | },
27 | {
28 | "cell_type": "code",
29 | "execution_count": null,
30 | "metadata": {
31 | "collapsed": false
32 | },
33 | "outputs": [],
34 | "source": [
35 | "# for Python 2: use print only as a function\n",
36 | "from __future__ import print_function"
37 | ]
38 | },
39 | {
40 | "cell_type": "markdown",
41 | "metadata": {},
42 | "source": [
43 | "## Part 1: Model building in scikit-learn (refresher)"
44 | ]
45 | },
46 | {
47 | "cell_type": "code",
48 | "execution_count": null,
49 | "metadata": {
50 | "collapsed": true
51 | },
52 | "outputs": [],
53 | "source": [
54 | "# load the iris dataset as an example\n",
55 | "from sklearn.datasets import load_iris\n",
56 | "iris = load_iris()"
57 | ]
58 | },
59 | {
60 | "cell_type": "code",
61 | "execution_count": null,
62 | "metadata": {
63 | "collapsed": true
64 | },
65 | "outputs": [],
66 | "source": [
67 | "# store the feature matrix (X) and response vector (y)\n",
68 | "X = iris.data\n",
69 | "y = iris.target"
70 | ]
71 | },
72 | {
73 | "cell_type": "markdown",
74 | "metadata": {},
75 | "source": [
76 | "**\"Features\"** are also known as predictors, inputs, or attributes. The **\"response\"** is also known as the target, label, or output."
77 | ]
78 | },
79 | {
80 | "cell_type": "code",
81 | "execution_count": null,
82 | "metadata": {
83 | "collapsed": false
84 | },
85 | "outputs": [],
86 | "source": [
87 | "# check the shapes of X and y\n",
88 | "print(X.shape)\n",
89 | "print(y.shape)"
90 | ]
91 | },
92 | {
93 | "cell_type": "markdown",
94 | "metadata": {},
95 | "source": [
96 | "**\"Observations\"** are also known as samples, instances, or records."
97 | ]
98 | },
99 | {
100 | "cell_type": "code",
101 | "execution_count": null,
102 | "metadata": {
103 | "collapsed": false
104 | },
105 | "outputs": [],
106 | "source": [
107 | "# examine the first 5 rows of the feature matrix (including the feature names)\n",
108 | "import pandas as pd\n",
109 | "pd.DataFrame(X, columns=iris.feature_names).head()"
110 | ]
111 | },
112 | {
113 | "cell_type": "code",
114 | "execution_count": null,
115 | "metadata": {
116 | "collapsed": false
117 | },
118 | "outputs": [],
119 | "source": [
120 | "# examine the response vector\n",
121 | "print(y)"
122 | ]
123 | },
124 | {
125 | "cell_type": "markdown",
126 | "metadata": {},
127 | "source": [
128 | "In order to **build a model**, the features must be **numeric**, and every observation must have the **same features in the same order**."
129 | ]
130 | },
131 | {
132 | "cell_type": "code",
133 | "execution_count": null,
134 | "metadata": {
135 | "collapsed": false
136 | },
137 | "outputs": [],
138 | "source": [
139 | "# import the class\n",
140 | "from sklearn.neighbors import KNeighborsClassifier\n",
141 | "\n",
142 | "# instantiate the model (with the default parameters)\n",
143 | "knn = KNeighborsClassifier()\n",
144 | "\n",
145 | "# fit the model with data (occurs in-place)\n",
146 | "knn.fit(X, y)"
147 | ]
148 | },
149 | {
150 | "cell_type": "markdown",
151 | "metadata": {},
152 | "source": [
153 | "In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning."
154 | ]
155 | },
156 | {
157 | "cell_type": "code",
158 | "execution_count": null,
159 | "metadata": {
160 | "collapsed": false
161 | },
162 | "outputs": [],
163 | "source": [
164 | "# predict the response for a new observation\n",
165 | "knn.predict([[3, 5, 4, 2]])"
166 | ]
167 | },
168 | {
169 | "cell_type": "markdown",
170 | "metadata": {},
171 | "source": [
172 | "## Part 2: Representing text as numerical data"
173 | ]
174 | },
175 | {
176 | "cell_type": "code",
177 | "execution_count": null,
178 | "metadata": {
179 | "collapsed": true
180 | },
181 | "outputs": [],
182 | "source": [
183 | "# example text for model training (SMS messages)\n",
184 | "simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']"
185 | ]
186 | },
187 | {
188 | "cell_type": "markdown",
189 | "metadata": {},
190 | "source": [
191 | "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
192 | "\n",
193 | "> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.\n",
194 | "\n",
195 | "We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to \"convert text into a matrix of token counts\":"
196 | ]
197 | },
198 | {
199 | "cell_type": "code",
200 | "execution_count": null,
201 | "metadata": {
202 | "collapsed": true
203 | },
204 | "outputs": [],
205 | "source": [
206 | "# import and instantiate CountVectorizer (with the default parameters)\n",
207 | "from sklearn.feature_extraction.text import CountVectorizer\n",
208 | "vect = CountVectorizer()"
209 | ]
210 | },
211 | {
212 | "cell_type": "code",
213 | "execution_count": null,
214 | "metadata": {
215 | "collapsed": false
216 | },
217 | "outputs": [],
218 | "source": [
219 | "# learn the 'vocabulary' of the training data (occurs in-place)\n",
220 | "vect.fit(simple_train)"
221 | ]
222 | },
223 | {
224 | "cell_type": "code",
225 | "execution_count": null,
226 | "metadata": {
227 | "collapsed": false
228 | },
229 | "outputs": [],
230 | "source": [
231 | "# examine the fitted vocabulary\n",
232 | "vect.get_feature_names()"
233 | ]
234 | },
235 | {
236 | "cell_type": "code",
237 | "execution_count": null,
238 | "metadata": {
239 | "collapsed": false
240 | },
241 | "outputs": [],
242 | "source": [
243 | "# transform training data into a 'document-term matrix'\n",
244 | "simple_train_dtm = vect.transform(simple_train)\n",
245 | "simple_train_dtm"
246 | ]
247 | },
248 | {
249 | "cell_type": "code",
250 | "execution_count": null,
251 | "metadata": {
252 | "collapsed": false
253 | },
254 | "outputs": [],
255 | "source": [
256 | "# convert sparse matrix to a dense matrix\n",
257 | "simple_train_dtm.toarray()"
258 | ]
259 | },
260 | {
261 | "cell_type": "code",
262 | "execution_count": null,
263 | "metadata": {
264 | "collapsed": false
265 | },
266 | "outputs": [],
267 | "source": [
268 | "# examine the vocabulary and document-term matrix together\n",
269 | "pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())"
270 | ]
271 | },
272 | {
273 | "cell_type": "markdown",
274 | "metadata": {},
275 | "source": [
276 | "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
277 | "\n",
278 | "> In this scheme, features and samples are defined as follows:\n",
279 | "\n",
280 | "> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.\n",
281 | "> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.\n",
282 | "\n",
283 | "> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.\n",
284 | "\n",
285 | "> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or \"Bag of n-grams\" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document."
286 | ]
287 | },
288 | {
289 | "cell_type": "code",
290 | "execution_count": null,
291 | "metadata": {
292 | "collapsed": false
293 | },
294 | "outputs": [],
295 | "source": [
296 | "# check the type of the document-term matrix\n",
297 | "type(simple_train_dtm)"
298 | ]
299 | },
300 | {
301 | "cell_type": "code",
302 | "execution_count": null,
303 | "metadata": {
304 | "collapsed": false,
305 | "scrolled": true
306 | },
307 | "outputs": [],
308 | "source": [
309 | "# examine the sparse matrix contents\n",
310 | "print(simple_train_dtm)"
311 | ]
312 | },
313 | {
314 | "cell_type": "markdown",
315 | "metadata": {},
316 | "source": [
317 | "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
318 | "\n",
319 | "> As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them).\n",
320 | "\n",
321 | "> For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.\n",
322 | "\n",
323 | "> In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package."
324 | ]
325 | },
326 | {
327 | "cell_type": "code",
328 | "execution_count": null,
329 | "metadata": {
330 | "collapsed": true
331 | },
332 | "outputs": [],
333 | "source": [
334 | "# example text for model testing\n",
335 | "simple_test = [\"please don't call me\"]"
336 | ]
337 | },
338 | {
339 | "cell_type": "markdown",
340 | "metadata": {},
341 | "source": [
342 | "In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning."
343 | ]
344 | },
345 | {
346 | "cell_type": "code",
347 | "execution_count": null,
348 | "metadata": {
349 | "collapsed": false
350 | },
351 | "outputs": [],
352 | "source": [
353 | "# transform testing data into a document-term matrix (using existing vocabulary)\n",
354 | "simple_test_dtm = vect.transform(simple_test)\n",
355 | "simple_test_dtm.toarray()"
356 | ]
357 | },
358 | {
359 | "cell_type": "code",
360 | "execution_count": null,
361 | "metadata": {
362 | "collapsed": false
363 | },
364 | "outputs": [],
365 | "source": [
366 | "# examine the vocabulary and document-term matrix together\n",
367 | "pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())"
368 | ]
369 | },
370 | {
371 | "cell_type": "markdown",
372 | "metadata": {},
373 | "source": [
374 | "**Summary:**\n",
375 | "\n",
376 | "- `vect.fit(train)` **learns the vocabulary** of the training data\n",
377 | "- `vect.transform(train)` uses the **fitted vocabulary** to build a document-term matrix from the training data\n",
378 | "- `vect.transform(test)` uses the **fitted vocabulary** to build a document-term matrix from the testing data (and **ignores tokens** it hasn't seen before)"
379 | ]
380 | },
381 | {
382 | "cell_type": "markdown",
383 | "metadata": {},
384 | "source": [
385 | "## Part 3: Reading a text-based dataset into pandas"
386 | ]
387 | },
388 | {
389 | "cell_type": "code",
390 | "execution_count": null,
391 | "metadata": {
392 | "collapsed": true
393 | },
394 | "outputs": [],
395 | "source": [
396 | "# read file into pandas using a relative path\n",
397 | "path = 'data/sms.tsv'\n",
398 | "sms = pd.read_table(path, header=None, names=['label', 'message'])"
399 | ]
400 | },
401 | {
402 | "cell_type": "code",
403 | "execution_count": null,
404 | "metadata": {
405 | "collapsed": false
406 | },
407 | "outputs": [],
408 | "source": [
409 | "# alternative: read file into pandas from a URL\n",
410 | "# url = 'https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv'\n",
411 | "# sms = pd.read_table(url, header=None, names=['label', 'message'])"
412 | ]
413 | },
414 | {
415 | "cell_type": "code",
416 | "execution_count": null,
417 | "metadata": {
418 | "collapsed": false
419 | },
420 | "outputs": [],
421 | "source": [
422 | "# examine the shape\n",
423 | "sms.shape"
424 | ]
425 | },
426 | {
427 | "cell_type": "code",
428 | "execution_count": null,
429 | "metadata": {
430 | "collapsed": false
431 | },
432 | "outputs": [],
433 | "source": [
434 | "# examine the first 10 rows\n",
435 | "sms.head(10)"
436 | ]
437 | },
438 | {
439 | "cell_type": "code",
440 | "execution_count": null,
441 | "metadata": {
442 | "collapsed": false
443 | },
444 | "outputs": [],
445 | "source": [
446 | "# examine the class distribution\n",
447 | "sms.label.value_counts()"
448 | ]
449 | },
450 | {
451 | "cell_type": "code",
452 | "execution_count": null,
453 | "metadata": {
454 | "collapsed": true
455 | },
456 | "outputs": [],
457 | "source": [
458 | "# convert label to a numerical variable\n",
459 | "sms['label_num'] = sms.label.map({'ham':0, 'spam':1})"
460 | ]
461 | },
462 | {
463 | "cell_type": "code",
464 | "execution_count": null,
465 | "metadata": {
466 | "collapsed": false
467 | },
468 | "outputs": [],
469 | "source": [
470 | "# check that the conversion worked\n",
471 | "sms.head(10)"
472 | ]
473 | },
474 | {
475 | "cell_type": "code",
476 | "execution_count": null,
477 | "metadata": {
478 | "collapsed": false
479 | },
480 | "outputs": [],
481 | "source": [
482 | "# how to define X and y (from the iris data) for use with a MODEL\n",
483 | "X = iris.data\n",
484 | "y = iris.target\n",
485 | "print(X.shape)\n",
486 | "print(y.shape)"
487 | ]
488 | },
489 | {
490 | "cell_type": "code",
491 | "execution_count": null,
492 | "metadata": {
493 | "collapsed": false
494 | },
495 | "outputs": [],
496 | "source": [
497 | "# how to define X and y (from the SMS data) for use with COUNTVECTORIZER\n",
498 | "X = sms.message\n",
499 | "y = sms.label_num\n",
500 | "print(X.shape)\n",
501 | "print(y.shape)"
502 | ]
503 | },
504 | {
505 | "cell_type": "code",
506 | "execution_count": null,
507 | "metadata": {
508 | "collapsed": false
509 | },
510 | "outputs": [],
511 | "source": [
512 | "# split X and y into training and testing sets\n",
513 | "from sklearn.model_selection import train_test_split\n",
514 | "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)\n",
515 | "print(X_train.shape)\n",
516 | "print(X_test.shape)\n",
517 | "print(y_train.shape)\n",
518 | "print(y_test.shape)"
519 | ]
520 | },
521 | {
522 | "cell_type": "markdown",
523 | "metadata": {},
524 | "source": [
525 | "## Part 4: Vectorizing our dataset"
526 | ]
527 | },
528 | {
529 | "cell_type": "code",
530 | "execution_count": null,
531 | "metadata": {
532 | "collapsed": true
533 | },
534 | "outputs": [],
535 | "source": [
536 | "# instantiate the vectorizer\n",
537 | "vect = CountVectorizer()"
538 | ]
539 | },
540 | {
541 | "cell_type": "code",
542 | "execution_count": null,
543 | "metadata": {
544 | "collapsed": true
545 | },
546 | "outputs": [],
547 | "source": [
548 | "# learn training data vocabulary, then use it to create a document-term matrix\n",
549 | "vect.fit(X_train)\n",
550 | "X_train_dtm = vect.transform(X_train)"
551 | ]
552 | },
553 | {
554 | "cell_type": "code",
555 | "execution_count": null,
556 | "metadata": {
557 | "collapsed": true
558 | },
559 | "outputs": [],
560 | "source": [
561 | "# equivalently: combine fit and transform into a single step\n",
562 | "X_train_dtm = vect.fit_transform(X_train)"
563 | ]
564 | },
565 | {
566 | "cell_type": "code",
567 | "execution_count": null,
568 | "metadata": {
569 | "collapsed": false
570 | },
571 | "outputs": [],
572 | "source": [
573 | "# examine the document-term matrix\n",
574 | "X_train_dtm"
575 | ]
576 | },
577 | {
578 | "cell_type": "code",
579 | "execution_count": null,
580 | "metadata": {
581 | "collapsed": false
582 | },
583 | "outputs": [],
584 | "source": [
585 | "# transform testing data (using fitted vocabulary) into a document-term matrix\n",
586 | "X_test_dtm = vect.transform(X_test)\n",
587 | "X_test_dtm"
588 | ]
589 | },
590 | {
591 | "cell_type": "markdown",
592 | "metadata": {},
593 | "source": [
594 | "## Part 5: Building and evaluating a model\n",
595 | "\n",
596 | "We will use [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):\n",
597 | "\n",
598 | "> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work."
599 | ]
600 | },
601 | {
602 | "cell_type": "code",
603 | "execution_count": null,
604 | "metadata": {
605 | "collapsed": true
606 | },
607 | "outputs": [],
608 | "source": [
609 | "# import and instantiate a Multinomial Naive Bayes model\n",
610 | "from sklearn.naive_bayes import MultinomialNB\n",
611 | "nb = MultinomialNB()"
612 | ]
613 | },
614 | {
615 | "cell_type": "code",
616 | "execution_count": null,
617 | "metadata": {
618 | "collapsed": false
619 | },
620 | "outputs": [],
621 | "source": [
622 | "# train the model using X_train_dtm (timing it with an IPython \"magic command\")\n",
623 | "%time nb.fit(X_train_dtm, y_train)"
624 | ]
625 | },
626 | {
627 | "cell_type": "code",
628 | "execution_count": null,
629 | "metadata": {
630 | "collapsed": true
631 | },
632 | "outputs": [],
633 | "source": [
634 | "# make class predictions for X_test_dtm\n",
635 | "y_pred_class = nb.predict(X_test_dtm)"
636 | ]
637 | },
638 | {
639 | "cell_type": "code",
640 | "execution_count": null,
641 | "metadata": {
642 | "collapsed": false
643 | },
644 | "outputs": [],
645 | "source": [
646 | "# calculate accuracy of class predictions\n",
647 | "from sklearn import metrics\n",
648 | "metrics.accuracy_score(y_test, y_pred_class)"
649 | ]
650 | },
651 | {
652 | "cell_type": "code",
653 | "execution_count": null,
654 | "metadata": {
655 | "collapsed": false
656 | },
657 | "outputs": [],
658 | "source": [
659 | "# print the confusion matrix\n",
660 | "metrics.confusion_matrix(y_test, y_pred_class)"
661 | ]
662 | },
663 | {
664 | "cell_type": "code",
665 | "execution_count": null,
666 | "metadata": {
667 | "collapsed": false
668 | },
669 | "outputs": [],
670 | "source": [
671 | "# print message text for the false positives (ham incorrectly classified as spam)\n"
672 | ]
673 | },
674 | {
675 | "cell_type": "code",
676 | "execution_count": null,
677 | "metadata": {
678 | "collapsed": false,
679 | "scrolled": true
680 | },
681 | "outputs": [],
682 | "source": [
683 | "# print message text for the false negatives (spam incorrectly classified as ham)\n"
684 | ]
685 | },
686 | {
687 | "cell_type": "code",
688 | "execution_count": null,
689 | "metadata": {
690 | "collapsed": false,
691 | "scrolled": true
692 | },
693 | "outputs": [],
694 | "source": [
695 | "# example false negative\n",
696 | "X_test[3132]"
697 | ]
698 | },
699 | {
700 | "cell_type": "code",
701 | "execution_count": null,
702 | "metadata": {
703 | "collapsed": false
704 | },
705 | "outputs": [],
706 | "source": [
707 | "# calculate predicted probabilities for X_test_dtm (poorly calibrated)\n",
708 | "y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]\n",
709 | "y_pred_prob"
710 | ]
711 | },
712 | {
713 | "cell_type": "code",
714 | "execution_count": null,
715 | "metadata": {
716 | "collapsed": false
717 | },
718 | "outputs": [],
719 | "source": [
720 | "# calculate AUC\n",
721 | "metrics.roc_auc_score(y_test, y_pred_prob)"
722 | ]
723 | },
724 | {
725 | "cell_type": "markdown",
726 | "metadata": {},
727 | "source": [
728 | "## Part 6: Comparing models\n",
729 | "\n",
730 | "We will compare multinomial Naive Bayes with [logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression):\n",
731 | "\n",
732 | "> Logistic regression, despite its name, is a **linear model for classification** rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function."
733 | ]
734 | },
735 | {
736 | "cell_type": "code",
737 | "execution_count": null,
738 | "metadata": {
739 | "collapsed": true
740 | },
741 | "outputs": [],
742 | "source": [
743 | "# import and instantiate a logistic regression model\n",
744 | "from sklearn.linear_model import LogisticRegression\n",
745 | "logreg = LogisticRegression()"
746 | ]
747 | },
748 | {
749 | "cell_type": "code",
750 | "execution_count": null,
751 | "metadata": {
752 | "collapsed": false
753 | },
754 | "outputs": [],
755 | "source": [
756 | "# train the model using X_train_dtm\n",
757 | "%time logreg.fit(X_train_dtm, y_train)"
758 | ]
759 | },
760 | {
761 | "cell_type": "code",
762 | "execution_count": null,
763 | "metadata": {
764 | "collapsed": true
765 | },
766 | "outputs": [],
767 | "source": [
768 | "# make class predictions for X_test_dtm\n",
769 | "y_pred_class = logreg.predict(X_test_dtm)"
770 | ]
771 | },
772 | {
773 | "cell_type": "code",
774 | "execution_count": null,
775 | "metadata": {
776 | "collapsed": false
777 | },
778 | "outputs": [],
779 | "source": [
780 | "# calculate predicted probabilities for X_test_dtm (well calibrated)\n",
781 | "y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]\n",
782 | "y_pred_prob"
783 | ]
784 | },
785 | {
786 | "cell_type": "code",
787 | "execution_count": null,
788 | "metadata": {
789 | "collapsed": false
790 | },
791 | "outputs": [],
792 | "source": [
793 | "# calculate accuracy\n",
794 | "metrics.accuracy_score(y_test, y_pred_class)"
795 | ]
796 | },
797 | {
798 | "cell_type": "code",
799 | "execution_count": null,
800 | "metadata": {
801 | "collapsed": false
802 | },
803 | "outputs": [],
804 | "source": [
805 | "# calculate AUC\n",
806 | "metrics.roc_auc_score(y_test, y_pred_prob)"
807 | ]
808 | },
809 | {
810 | "cell_type": "markdown",
811 | "metadata": {},
812 | "source": [
813 | "## Part 7: Examining a model for further insight\n",
814 | "\n",
815 | "We will examine the our **trained Naive Bayes model** to calculate the approximate **\"spamminess\" of each token**."
816 | ]
817 | },
818 | {
819 | "cell_type": "code",
820 | "execution_count": null,
821 | "metadata": {
822 | "collapsed": false
823 | },
824 | "outputs": [],
825 | "source": [
826 | "# store the vocabulary of X_train\n",
827 | "X_train_tokens = vect.get_feature_names()\n",
828 | "len(X_train_tokens)"
829 | ]
830 | },
831 | {
832 | "cell_type": "code",
833 | "execution_count": null,
834 | "metadata": {
835 | "collapsed": false,
836 | "scrolled": true
837 | },
838 | "outputs": [],
839 | "source": [
840 | "# examine the first 50 tokens\n",
841 | "print(X_train_tokens[0:50])"
842 | ]
843 | },
844 | {
845 | "cell_type": "code",
846 | "execution_count": null,
847 | "metadata": {
848 | "collapsed": false
849 | },
850 | "outputs": [],
851 | "source": [
852 | "# examine the last 50 tokens\n",
853 | "print(X_train_tokens[-50:])"
854 | ]
855 | },
856 | {
857 | "cell_type": "code",
858 | "execution_count": null,
859 | "metadata": {
860 | "collapsed": false
861 | },
862 | "outputs": [],
863 | "source": [
864 | "# Naive Bayes counts the number of times each token appears in each class\n",
865 | "nb.feature_count_"
866 | ]
867 | },
868 | {
869 | "cell_type": "code",
870 | "execution_count": null,
871 | "metadata": {
872 | "collapsed": false
873 | },
874 | "outputs": [],
875 | "source": [
876 | "# rows represent classes, columns represent tokens\n",
877 | "nb.feature_count_.shape"
878 | ]
879 | },
880 | {
881 | "cell_type": "code",
882 | "execution_count": null,
883 | "metadata": {
884 | "collapsed": false
885 | },
886 | "outputs": [],
887 | "source": [
888 | "# number of times each token appears across all HAM messages\n",
889 | "ham_token_count = nb.feature_count_[0, :]\n",
890 | "ham_token_count"
891 | ]
892 | },
893 | {
894 | "cell_type": "code",
895 | "execution_count": null,
896 | "metadata": {
897 | "collapsed": false
898 | },
899 | "outputs": [],
900 | "source": [
901 | "# number of times each token appears across all SPAM messages\n",
902 | "spam_token_count = nb.feature_count_[1, :]\n",
903 | "spam_token_count"
904 | ]
905 | },
906 | {
907 | "cell_type": "code",
908 | "execution_count": null,
909 | "metadata": {
910 | "collapsed": false
911 | },
912 | "outputs": [],
913 | "source": [
914 | "# create a DataFrame of tokens with their separate ham and spam counts\n",
915 | "tokens = pd.DataFrame({'token':X_train_tokens, 'ham':ham_token_count, 'spam':spam_token_count}).set_index('token')\n",
916 | "tokens.head()"
917 | ]
918 | },
919 | {
920 | "cell_type": "code",
921 | "execution_count": null,
922 | "metadata": {
923 | "collapsed": false
924 | },
925 | "outputs": [],
926 | "source": [
927 | "# examine 5 random DataFrame rows\n",
928 | "tokens.sample(5, random_state=6)"
929 | ]
930 | },
931 | {
932 | "cell_type": "code",
933 | "execution_count": null,
934 | "metadata": {
935 | "collapsed": false
936 | },
937 | "outputs": [],
938 | "source": [
939 | "# Naive Bayes counts the number of observations in each class\n",
940 | "nb.class_count_"
941 | ]
942 | },
943 | {
944 | "cell_type": "markdown",
945 | "metadata": {},
946 | "source": [
947 | "Before we can calculate the \"spamminess\" of each token, we need to avoid **dividing by zero** and account for the **class imbalance**."
948 | ]
949 | },
950 | {
951 | "cell_type": "code",
952 | "execution_count": null,
953 | "metadata": {
954 | "collapsed": false
955 | },
956 | "outputs": [],
957 | "source": [
958 | "# add 1 to ham and spam counts to avoid dividing by 0\n",
959 | "tokens['ham'] = tokens.ham + 1\n",
960 | "tokens['spam'] = tokens.spam + 1\n",
961 | "tokens.sample(5, random_state=6)"
962 | ]
963 | },
964 | {
965 | "cell_type": "code",
966 | "execution_count": null,
967 | "metadata": {
968 | "collapsed": false
969 | },
970 | "outputs": [],
971 | "source": [
972 | "# convert the ham and spam counts into frequencies\n",
973 | "tokens['ham'] = tokens.ham / nb.class_count_[0]\n",
974 | "tokens['spam'] = tokens.spam / nb.class_count_[1]\n",
975 | "tokens.sample(5, random_state=6)"
976 | ]
977 | },
978 | {
979 | "cell_type": "code",
980 | "execution_count": null,
981 | "metadata": {
982 | "collapsed": false
983 | },
984 | "outputs": [],
985 | "source": [
986 | "# calculate the ratio of spam-to-ham for each token\n",
987 | "tokens['spam_ratio'] = tokens.spam / tokens.ham\n",
988 | "tokens.sample(5, random_state=6)"
989 | ]
990 | },
991 | {
992 | "cell_type": "code",
993 | "execution_count": null,
994 | "metadata": {
995 | "collapsed": false
996 | },
997 | "outputs": [],
998 | "source": [
999 | "# examine the DataFrame sorted by spam_ratio\n",
1000 | "# note: use sort() instead of sort_values() for pandas 0.16.2 and earlier\n",
1001 | "tokens.sort_values('spam_ratio', ascending=False)"
1002 | ]
1003 | },
1004 | {
1005 | "cell_type": "code",
1006 | "execution_count": null,
1007 | "metadata": {
1008 | "collapsed": false
1009 | },
1010 | "outputs": [],
1011 | "source": [
1012 | "# look up the spam_ratio for a given token\n",
1013 | "tokens.loc['dating', 'spam_ratio']"
1014 | ]
1015 | },
1016 | {
1017 | "cell_type": "markdown",
1018 | "metadata": {},
1019 | "source": [
1020 | "## Part 8: Practicing this workflow on another dataset\n",
1021 | "\n",
1022 | "Please open the **`exercise.ipynb`** notebook (or the **`exercise.py`** script)."
1023 | ]
1024 | },
1025 | {
1026 | "cell_type": "markdown",
1027 | "metadata": {},
1028 | "source": [
1029 | "## Part 9: Tuning the vectorizer (discussion)\n",
1030 | "\n",
1031 | "Thus far, we have been using the default parameters of [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html):"
1032 | ]
1033 | },
1034 | {
1035 | "cell_type": "code",
1036 | "execution_count": null,
1037 | "metadata": {
1038 | "collapsed": false
1039 | },
1040 | "outputs": [],
1041 | "source": [
1042 | "# show default parameters for CountVectorizer\n",
1043 | "vect"
1044 | ]
1045 | },
1046 | {
1047 | "cell_type": "markdown",
1048 | "metadata": {},
1049 | "source": [
1050 | "However, the vectorizer is worth tuning, just like a model is worth tuning! Here are a few parameters that you might want to tune:\n",
1051 | "\n",
1052 | "- **stop_words:** string {'english'}, list, or None (default)\n",
1053 | " - If 'english', a built-in stop word list for English is used.\n",
1054 | " - If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.\n",
1055 | " - If None, no stop words will be used."
1056 | ]
1057 | },
1058 | {
1059 | "cell_type": "code",
1060 | "execution_count": null,
1061 | "metadata": {
1062 | "collapsed": true
1063 | },
1064 | "outputs": [],
1065 | "source": [
1066 | "# remove English stop words\n",
1067 | "vect = CountVectorizer(stop_words='english')"
1068 | ]
1069 | },
1070 | {
1071 | "cell_type": "markdown",
1072 | "metadata": {},
1073 | "source": [
1074 | "- **ngram_range:** tuple (min_n, max_n), default=(1, 1)\n",
1075 | " - The lower and upper boundary of the range of n-values for different n-grams to be extracted.\n",
1076 | " - All values of n such that min_n <= n <= max_n will be used."
1077 | ]
1078 | },
1079 | {
1080 | "cell_type": "code",
1081 | "execution_count": null,
1082 | "metadata": {
1083 | "collapsed": true
1084 | },
1085 | "outputs": [],
1086 | "source": [
1087 | "# include 1-grams and 2-grams\n",
1088 | "vect = CountVectorizer(ngram_range=(1, 2))"
1089 | ]
1090 | },
1091 | {
1092 | "cell_type": "markdown",
1093 | "metadata": {},
1094 | "source": [
1095 | "- **max_df:** float in range [0.0, 1.0] or int, default=1.0\n",
1096 | " - When building the vocabulary, ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words).\n",
1097 | " - If float, the parameter represents a proportion of documents.\n",
1098 | " - If integer, the parameter represents an absolute count."
1099 | ]
1100 | },
1101 | {
1102 | "cell_type": "code",
1103 | "execution_count": null,
1104 | "metadata": {
1105 | "collapsed": true
1106 | },
1107 | "outputs": [],
1108 | "source": [
1109 | "# ignore terms that appear in more than 50% of the documents\n",
1110 | "vect = CountVectorizer(max_df=0.5)"
1111 | ]
1112 | },
1113 | {
1114 | "cell_type": "markdown",
1115 | "metadata": {},
1116 | "source": [
1117 | "- **min_df:** float in range [0.0, 1.0] or int, default=1\n",
1118 | " - When building the vocabulary, ignore terms that have a document frequency strictly lower than the given threshold. (This value is also called \"cut-off\" in the literature.)\n",
1119 | " - If float, the parameter represents a proportion of documents.\n",
1120 | " - If integer, the parameter represents an absolute count."
1121 | ]
1122 | },
1123 | {
1124 | "cell_type": "code",
1125 | "execution_count": null,
1126 | "metadata": {
1127 | "collapsed": true
1128 | },
1129 | "outputs": [],
1130 | "source": [
1131 | "# only keep terms that appear in at least 2 documents\n",
1132 | "vect = CountVectorizer(min_df=2)"
1133 | ]
1134 | },
1135 | {
1136 | "cell_type": "markdown",
1137 | "metadata": {},
1138 | "source": [
1139 | "**Guidelines for tuning CountVectorizer:**\n",
1140 | "\n",
1141 | "- Use your knowledge of the **problem** and the **text**, and your understanding of the **tuning parameters**, to help you decide what parameters to tune and how to tune them.\n",
1142 | "- **Experiment**, and let the data tell you the best approach!"
1143 | ]
1144 | }
1145 | ],
1146 | "metadata": {
1147 | "kernelspec": {
1148 | "display_name": "Python 2",
1149 | "language": "python",
1150 | "name": "python2"
1151 | },
1152 | "language_info": {
1153 | "codemirror_mode": {
1154 | "name": "ipython",
1155 | "version": 2
1156 | },
1157 | "file_extension": ".py",
1158 | "mimetype": "text/x-python",
1159 | "name": "python",
1160 | "nbconvert_exporter": "python",
1161 | "pygments_lexer": "ipython2",
1162 | "version": "2.7.11"
1163 | }
1164 | },
1165 | "nbformat": 4,
1166 | "nbformat_minor": 0
1167 | }
1168 |
--------------------------------------------------------------------------------
/exercise_solution.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Tutorial Exercise: Yelp reviews (Solution)"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "## Introduction\n",
15 | "\n",
16 | "This exercise uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition.\n",
17 | "\n",
18 | "**Description of the data:**\n",
19 | "\n",
20 | "- **`yelp.csv`** contains the dataset. It is stored in the repository (in the **`data`** directory), so there is no need to download anything from the Kaggle website.\n",
21 | "- Each observation (row) in this dataset is a review of a particular business by a particular user.\n",
22 | "- The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.\n",
23 | "- The **text** column is the text of the review.\n",
24 | "\n",
25 | "**Goal:** Predict the star rating of a review using **only** the review text.\n",
26 | "\n",
27 | "**Tip:** After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations."
28 | ]
29 | },
30 | {
31 | "cell_type": "code",
32 | "execution_count": 1,
33 | "metadata": {
34 | "collapsed": true
35 | },
36 | "outputs": [],
37 | "source": [
38 | "# for Python 2: use print only as a function\n",
39 | "from __future__ import print_function"
40 | ]
41 | },
42 | {
43 | "cell_type": "markdown",
44 | "metadata": {},
45 | "source": [
46 | "## Task 1\n",
47 | "\n",
48 | "Read **`yelp.csv`** into a pandas DataFrame and examine it."
49 | ]
50 | },
51 | {
52 | "cell_type": "code",
53 | "execution_count": 2,
54 | "metadata": {
55 | "collapsed": true
56 | },
57 | "outputs": [],
58 | "source": [
59 | "# read yelp.csv using a relative path\n",
60 | "import pandas as pd\n",
61 | "path = 'data/yelp.csv'\n",
62 | "yelp = pd.read_csv(path)"
63 | ]
64 | },
65 | {
66 | "cell_type": "code",
67 | "execution_count": 3,
68 | "metadata": {
69 | "collapsed": false
70 | },
71 | "outputs": [
72 | {
73 | "data": {
74 | "text/plain": [
75 | "(10000, 10)"
76 | ]
77 | },
78 | "execution_count": 3,
79 | "metadata": {},
80 | "output_type": "execute_result"
81 | }
82 | ],
83 | "source": [
84 | "# examine the shape\n",
85 | "yelp.shape"
86 | ]
87 | },
88 | {
89 | "cell_type": "code",
90 | "execution_count": 4,
91 | "metadata": {
92 | "collapsed": false
93 | },
94 | "outputs": [
95 | {
96 | "data": {
97 | "text/html": [
98 | "
\n",
99 | "
\n",
100 | " \n",
101 | " \n",
102 | " | \n",
103 | " business_id | \n",
104 | " date | \n",
105 | " review_id | \n",
106 | " stars | \n",
107 | " text | \n",
108 | " type | \n",
109 | " user_id | \n",
110 | " cool | \n",
111 | " useful | \n",
112 | " funny | \n",
113 | "
\n",
114 | " \n",
115 | " \n",
116 | " \n",
117 | " | 0 | \n",
118 | " 9yKzy9PApeiPPOUJEtnvkg | \n",
119 | " 2011-01-26 | \n",
120 | " fWKvX83p0-ka4JS3dc6E5A | \n",
121 | " 5 | \n",
122 | " My wife took me here on my birthday for breakf... | \n",
123 | " review | \n",
124 | " rLtl8ZkDX5vH5nAx9C3q5Q | \n",
125 | " 2 | \n",
126 | " 5 | \n",
127 | " 0 | \n",
128 | "
\n",
129 | " \n",
130 | "
\n",
131 | "
"
132 | ],
133 | "text/plain": [
134 | " business_id date review_id stars \\\n",
135 | "0 9yKzy9PApeiPPOUJEtnvkg 2011-01-26 fWKvX83p0-ka4JS3dc6E5A 5 \n",
136 | "\n",
137 | " text type \\\n",
138 | "0 My wife took me here on my birthday for breakf... review \n",
139 | "\n",
140 | " user_id cool useful funny \n",
141 | "0 rLtl8ZkDX5vH5nAx9C3q5Q 2 5 0 "
142 | ]
143 | },
144 | "execution_count": 4,
145 | "metadata": {},
146 | "output_type": "execute_result"
147 | }
148 | ],
149 | "source": [
150 | "# examine the first row\n",
151 | "yelp.head(1)"
152 | ]
153 | },
154 | {
155 | "cell_type": "code",
156 | "execution_count": 5,
157 | "metadata": {
158 | "collapsed": false
159 | },
160 | "outputs": [
161 | {
162 | "data": {
163 | "text/plain": [
164 | "1 749\n",
165 | "2 927\n",
166 | "3 1461\n",
167 | "4 3526\n",
168 | "5 3337\n",
169 | "Name: stars, dtype: int64"
170 | ]
171 | },
172 | "execution_count": 5,
173 | "metadata": {},
174 | "output_type": "execute_result"
175 | }
176 | ],
177 | "source": [
178 | "# examine the class distribution\n",
179 | "yelp.stars.value_counts().sort_index()"
180 | ]
181 | },
182 | {
183 | "cell_type": "markdown",
184 | "metadata": {},
185 | "source": [
186 | "## Task 2\n",
187 | "\n",
188 | "Create a new DataFrame that only contains the **5-star** and **1-star** reviews.\n",
189 | "\n",
190 | "- **Hint:** [How do I apply multiple filter criteria to a pandas DataFrame?](http://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb#9.-How-do-I-apply-multiple-filter-criteria-to-a-pandas-DataFrame%3F-%28video%29) explains how to do this."
191 | ]
192 | },
193 | {
194 | "cell_type": "code",
195 | "execution_count": 6,
196 | "metadata": {
197 | "collapsed": true
198 | },
199 | "outputs": [],
200 | "source": [
201 | "# filter the DataFrame using an OR condition\n",
202 | "yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]\n",
203 | "\n",
204 | "# equivalently, use the 'loc' method\n",
205 | "yelp_best_worst = yelp.loc[(yelp.stars==5) | (yelp.stars==1), :]"
206 | ]
207 | },
208 | {
209 | "cell_type": "code",
210 | "execution_count": 7,
211 | "metadata": {
212 | "collapsed": false
213 | },
214 | "outputs": [
215 | {
216 | "data": {
217 | "text/plain": [
218 | "(4086, 10)"
219 | ]
220 | },
221 | "execution_count": 7,
222 | "metadata": {},
223 | "output_type": "execute_result"
224 | }
225 | ],
226 | "source": [
227 | "# examine the shape\n",
228 | "yelp_best_worst.shape"
229 | ]
230 | },
231 | {
232 | "cell_type": "markdown",
233 | "metadata": {},
234 | "source": [
235 | "## Task 3\n",
236 | "\n",
237 | "Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response.\n",
238 | "\n",
239 | "- **Hint:** Keep in mind that X should be a pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows."
240 | ]
241 | },
242 | {
243 | "cell_type": "code",
244 | "execution_count": 8,
245 | "metadata": {
246 | "collapsed": true
247 | },
248 | "outputs": [],
249 | "source": [
250 | "# define X and y\n",
251 | "X = yelp_best_worst.text\n",
252 | "y = yelp_best_worst.stars"
253 | ]
254 | },
255 | {
256 | "cell_type": "code",
257 | "execution_count": 9,
258 | "metadata": {
259 | "collapsed": true
260 | },
261 | "outputs": [],
262 | "source": [
263 | "# split X and y into training and testing sets\n",
264 | "from sklearn.cross_validation import train_test_split\n",
265 | "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)"
266 | ]
267 | },
268 | {
269 | "cell_type": "code",
270 | "execution_count": 10,
271 | "metadata": {
272 | "collapsed": false
273 | },
274 | "outputs": [
275 | {
276 | "name": "stdout",
277 | "output_type": "stream",
278 | "text": [
279 | "(3064L,)\n",
280 | "(1022L,)\n",
281 | "(3064L,)\n",
282 | "(1022L,)\n"
283 | ]
284 | }
285 | ],
286 | "source": [
287 | "# examine the object shapes\n",
288 | "print(X_train.shape)\n",
289 | "print(X_test.shape)\n",
290 | "print(y_train.shape)\n",
291 | "print(y_test.shape)"
292 | ]
293 | },
294 | {
295 | "cell_type": "markdown",
296 | "metadata": {},
297 | "source": [
298 | "## Task 4\n",
299 | "\n",
300 | "Use CountVectorizer to create **document-term matrices** from X_train and X_test."
301 | ]
302 | },
303 | {
304 | "cell_type": "code",
305 | "execution_count": 11,
306 | "metadata": {
307 | "collapsed": true
308 | },
309 | "outputs": [],
310 | "source": [
311 | "# import and instantiate CountVectorizer\n",
312 | "from sklearn.feature_extraction.text import CountVectorizer\n",
313 | "vect = CountVectorizer()"
314 | ]
315 | },
316 | {
317 | "cell_type": "code",
318 | "execution_count": 12,
319 | "metadata": {
320 | "collapsed": false
321 | },
322 | "outputs": [
323 | {
324 | "data": {
325 | "text/plain": [
326 | "(3064, 16825)"
327 | ]
328 | },
329 | "execution_count": 12,
330 | "metadata": {},
331 | "output_type": "execute_result"
332 | }
333 | ],
334 | "source": [
335 | "# fit and transform X_train into X_train_dtm\n",
336 | "X_train_dtm = vect.fit_transform(X_train)\n",
337 | "X_train_dtm.shape"
338 | ]
339 | },
340 | {
341 | "cell_type": "code",
342 | "execution_count": 13,
343 | "metadata": {
344 | "collapsed": false
345 | },
346 | "outputs": [
347 | {
348 | "data": {
349 | "text/plain": [
350 | "(1022, 16825)"
351 | ]
352 | },
353 | "execution_count": 13,
354 | "metadata": {},
355 | "output_type": "execute_result"
356 | }
357 | ],
358 | "source": [
359 | "# transform X_test into X_test_dtm\n",
360 | "X_test_dtm = vect.transform(X_test)\n",
361 | "X_test_dtm.shape"
362 | ]
363 | },
364 | {
365 | "cell_type": "markdown",
366 | "metadata": {},
367 | "source": [
368 | "## Task 5\n",
369 | "\n",
370 | "Use multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**.\n",
371 | "\n",
372 | "- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to interpret both classification accuracy and the confusion matrix."
373 | ]
374 | },
375 | {
376 | "cell_type": "code",
377 | "execution_count": 14,
378 | "metadata": {
379 | "collapsed": true
380 | },
381 | "outputs": [],
382 | "source": [
383 | "# import and instantiate MultinomialNB\n",
384 | "from sklearn.naive_bayes import MultinomialNB\n",
385 | "nb = MultinomialNB()"
386 | ]
387 | },
388 | {
389 | "cell_type": "code",
390 | "execution_count": 15,
391 | "metadata": {
392 | "collapsed": false
393 | },
394 | "outputs": [
395 | {
396 | "data": {
397 | "text/plain": [
398 | "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)"
399 | ]
400 | },
401 | "execution_count": 15,
402 | "metadata": {},
403 | "output_type": "execute_result"
404 | }
405 | ],
406 | "source": [
407 | "# train the model using X_train_dtm\n",
408 | "nb.fit(X_train_dtm, y_train)"
409 | ]
410 | },
411 | {
412 | "cell_type": "code",
413 | "execution_count": 16,
414 | "metadata": {
415 | "collapsed": true
416 | },
417 | "outputs": [],
418 | "source": [
419 | "# make class predictions for X_test_dtm\n",
420 | "y_pred_class = nb.predict(X_test_dtm)"
421 | ]
422 | },
423 | {
424 | "cell_type": "code",
425 | "execution_count": 17,
426 | "metadata": {
427 | "collapsed": false
428 | },
429 | "outputs": [
430 | {
431 | "data": {
432 | "text/plain": [
433 | "0.91878669275929548"
434 | ]
435 | },
436 | "execution_count": 17,
437 | "metadata": {},
438 | "output_type": "execute_result"
439 | }
440 | ],
441 | "source": [
442 | "# calculate accuracy of class predictions\n",
443 | "from sklearn import metrics\n",
444 | "metrics.accuracy_score(y_test, y_pred_class)"
445 | ]
446 | },
447 | {
448 | "cell_type": "code",
449 | "execution_count": 18,
450 | "metadata": {
451 | "collapsed": false
452 | },
453 | "outputs": [
454 | {
455 | "data": {
456 | "text/plain": [
457 | "array([[126, 58],\n",
458 | " [ 25, 813]])"
459 | ]
460 | },
461 | "execution_count": 18,
462 | "metadata": {},
463 | "output_type": "execute_result"
464 | }
465 | ],
466 | "source": [
467 | "# print the confusion matrix\n",
468 | "metrics.confusion_matrix(y_test, y_pred_class)"
469 | ]
470 | },
471 | {
472 | "cell_type": "markdown",
473 | "metadata": {},
474 | "source": [
475 | "## Task 6 (Challenge)\n",
476 | "\n",
477 | "Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class.\n",
478 | "\n",
479 | "- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy!"
480 | ]
481 | },
482 | {
483 | "cell_type": "code",
484 | "execution_count": 19,
485 | "metadata": {
486 | "collapsed": false
487 | },
488 | "outputs": [
489 | {
490 | "data": {
491 | "text/plain": [
492 | "5 838\n",
493 | "1 184\n",
494 | "Name: stars, dtype: int64"
495 | ]
496 | },
497 | "execution_count": 19,
498 | "metadata": {},
499 | "output_type": "execute_result"
500 | }
501 | ],
502 | "source": [
503 | "# examine the class distribution of the testing set\n",
504 | "y_test.value_counts()"
505 | ]
506 | },
507 | {
508 | "cell_type": "code",
509 | "execution_count": 20,
510 | "metadata": {
511 | "collapsed": false
512 | },
513 | "outputs": [
514 | {
515 | "data": {
516 | "text/plain": [
517 | "5 0.819961\n",
518 | "Name: stars, dtype: float64"
519 | ]
520 | },
521 | "execution_count": 20,
522 | "metadata": {},
523 | "output_type": "execute_result"
524 | }
525 | ],
526 | "source": [
527 | "# calculate null accuracy\n",
528 | "y_test.value_counts().head(1) / y_test.shape"
529 | ]
530 | },
531 | {
532 | "cell_type": "code",
533 | "execution_count": 21,
534 | "metadata": {
535 | "collapsed": false
536 | },
537 | "outputs": [
538 | {
539 | "data": {
540 | "text/plain": [
541 | "0.8199608610567515"
542 | ]
543 | },
544 | "execution_count": 21,
545 | "metadata": {},
546 | "output_type": "execute_result"
547 | }
548 | ],
549 | "source": [
550 | "# calculate null accuracy manually\n",
551 | "838 / float(838 + 184)"
552 | ]
553 | },
554 | {
555 | "cell_type": "markdown",
556 | "metadata": {},
557 | "source": [
558 | "## Task 7 (Challenge)\n",
559 | "\n",
560 | "Browse through the review text of some of the **false positives** and **false negatives**. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews?\n",
561 | "\n",
562 | "- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains the definitions of \"false positives\" and \"false negatives\".\n",
563 | "- **Hint:** Think about what a false positive means in this context, and what a false negative means in this context. What has scikit-learn defined as the \"positive class\"?"
564 | ]
565 | },
566 | {
567 | "cell_type": "code",
568 | "execution_count": 22,
569 | "metadata": {
570 | "collapsed": false
571 | },
572 | "outputs": [
573 | {
574 | "data": {
575 | "text/plain": [
576 | "2175 This has to be the worst restaurant in terms o...\n",
577 | "1781 If you like the stuck up Scottsdale vibe this ...\n",
578 | "2674 I'm sorry to be what seems to be the lone one ...\n",
579 | "9984 Went last night to Whore Foods to get basics t...\n",
580 | "3392 I found Lisa G's while driving through phoenix...\n",
581 | "8283 Don't know where I should start. Grand opening...\n",
582 | "2765 Went last week, and ordered a dozen variety. I...\n",
583 | "2839 Never Again,\\r\\nI brought my Mountain Bike in ...\n",
584 | "321 My wife and I live around the corner, hadn't e...\n",
585 | "1919 D-scust-ing.\n",
586 | "Name: text, dtype: object"
587 | ]
588 | },
589 | "execution_count": 22,
590 | "metadata": {},
591 | "output_type": "execute_result"
592 | }
593 | ],
594 | "source": [
595 | "# first 10 false positives (1-star reviews incorrectly classified as 5-star reviews)\n",
596 | "X_test[y_test < y_pred_class].head(10)"
597 | ]
598 | },
599 | {
600 | "cell_type": "code",
601 | "execution_count": 23,
602 | "metadata": {
603 | "collapsed": false
604 | },
605 | "outputs": [
606 | {
607 | "data": {
608 | "text/plain": [
609 | "\"If you like the stuck up Scottsdale vibe this is a good place for you. The food isn't impressive. Nice outdoor seating.\""
610 | ]
611 | },
612 | "execution_count": 23,
613 | "metadata": {},
614 | "output_type": "execute_result"
615 | }
616 | ],
617 | "source": [
618 | "# false positive: model is reacting to the words \"good\", \"impressive\", \"nice\"\n",
619 | "X_test[1781]"
620 | ]
621 | },
622 | {
623 | "cell_type": "code",
624 | "execution_count": 24,
625 | "metadata": {
626 | "collapsed": false
627 | },
628 | "outputs": [
629 | {
630 | "data": {
631 | "text/plain": [
632 | "'D-scust-ing.'"
633 | ]
634 | },
635 | "execution_count": 24,
636 | "metadata": {},
637 | "output_type": "execute_result"
638 | }
639 | ],
640 | "source": [
641 | "# false positive: model does not have enough data to work with\n",
642 | "X_test[1919]"
643 | ]
644 | },
645 | {
646 | "cell_type": "code",
647 | "execution_count": 25,
648 | "metadata": {
649 | "collapsed": false
650 | },
651 | "outputs": [
652 | {
653 | "data": {
654 | "text/plain": [
655 | "7148 I now consider myself an Arizonian. If you dri...\n",
656 | "4963 This is by far my favourite department store, ...\n",
657 | "6318 Since I have ranted recently on poor customer ...\n",
658 | "380 This is a must try for any Mani Pedi fan. I us...\n",
659 | "5565 I`ve had work done by this shop a few times th...\n",
660 | "3448 I was there last week with my sisters and whil...\n",
661 | "6050 I went to sears today to check on a layaway th...\n",
662 | "2504 I've passed by prestige nails in walmart 100s ...\n",
663 | "2475 This place is so great! I am a nanny and had t...\n",
664 | "241 I was sad to come back to lai lai's and they n...\n",
665 | "Name: text, dtype: object"
666 | ]
667 | },
668 | "execution_count": 25,
669 | "metadata": {},
670 | "output_type": "execute_result"
671 | }
672 | ],
673 | "source": [
674 | "# first 10 false negatives (5-star reviews incorrectly classified as 1-star reviews)\n",
675 | "X_test[y_test > y_pred_class].head(10)"
676 | ]
677 | },
678 | {
679 | "cell_type": "code",
680 | "execution_count": 26,
681 | "metadata": {
682 | "collapsed": false
683 | },
684 | "outputs": [
685 | {
686 | "data": {
687 | "text/plain": [
688 | "'This is by far my favourite department store, hands down. I have had nothing but perfect experiences in this store, without exception, no matter what department I\\'m in. The shoe SA\\'s will bend over backwards to help you find a specific shoe, and the staff will even go so far as to send out hand-written thank you cards to your home address after you make a purchase - big or small. Tim & Anthony in the shoe salon are fabulous beyond words! \\r\\n\\r\\nI am not completely sure that I understand why people complain about the amount of merchandise on the floor or the lack of crowds in this store. Frankly, I would rather not be bombarded with merchandise and other people. One of the things I love the most about Barney\\'s is not only the prompt attention of SA\\'s, but the fact that they aren\\'t rushing around trying to help 35 people at once. The SA\\'s at Barney\\'s are incredibly friendly and will stop to have an actual conversation, regardless or whether you are purchasing something or not. I have also never experienced a \"high pressure\" sale situation here.\\r\\n\\r\\nAll in all, Barneys is pricey, and there is no getting around it. But, um, so is Neiman\\'s and that place is a crock. Anywhere that ONLY accepts American Express or their charge card and then treats you like scum if you aren\\'t carrying neither is no place that I want to spend my hard earned dollars. Yay Barneys!'"
689 | ]
690 | },
691 | "execution_count": 26,
692 | "metadata": {},
693 | "output_type": "execute_result"
694 | }
695 | ],
696 | "source": [
697 | "# false negative: model is reacting to the words \"complain\", \"crowds\", \"rushing\", \"pricey\", \"scum\"\n",
698 | "X_test[4963]"
699 | ]
700 | },
701 | {
702 | "cell_type": "markdown",
703 | "metadata": {},
704 | "source": [
705 | "## Task 8 (Challenge)\n",
706 | "\n",
707 | "Calculate which 10 tokens are the most predictive of **5-star reviews**, and which 10 tokens are the most predictive of **1-star reviews**.\n",
708 | "\n",
709 | "- **Hint:** Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the `feature_count_` and `class_count_` attributes of the Naive Bayes model object."
710 | ]
711 | },
712 | {
713 | "cell_type": "code",
714 | "execution_count": 27,
715 | "metadata": {
716 | "collapsed": false
717 | },
718 | "outputs": [
719 | {
720 | "data": {
721 | "text/plain": [
722 | "16825"
723 | ]
724 | },
725 | "execution_count": 27,
726 | "metadata": {},
727 | "output_type": "execute_result"
728 | }
729 | ],
730 | "source": [
731 | "# store the vocabulary of X_train\n",
732 | "X_train_tokens = vect.get_feature_names()\n",
733 | "len(X_train_tokens)"
734 | ]
735 | },
736 | {
737 | "cell_type": "code",
738 | "execution_count": 28,
739 | "metadata": {
740 | "collapsed": false
741 | },
742 | "outputs": [
743 | {
744 | "data": {
745 | "text/plain": [
746 | "(2L, 16825L)"
747 | ]
748 | },
749 | "execution_count": 28,
750 | "metadata": {},
751 | "output_type": "execute_result"
752 | }
753 | ],
754 | "source": [
755 | "# first row is one-star reviews, second row is five-star reviews\n",
756 | "nb.feature_count_.shape"
757 | ]
758 | },
759 | {
760 | "cell_type": "code",
761 | "execution_count": 29,
762 | "metadata": {
763 | "collapsed": true
764 | },
765 | "outputs": [],
766 | "source": [
767 | "# store the number of times each token appears across each class\n",
768 | "one_star_token_count = nb.feature_count_[0, :]\n",
769 | "five_star_token_count = nb.feature_count_[1, :]"
770 | ]
771 | },
772 | {
773 | "cell_type": "code",
774 | "execution_count": 30,
775 | "metadata": {
776 | "collapsed": true
777 | },
778 | "outputs": [],
779 | "source": [
780 | "# create a DataFrame of tokens with their separate one-star and five-star counts\n",
781 | "tokens = pd.DataFrame({'token':X_train_tokens, 'one_star':one_star_token_count, 'five_star':five_star_token_count}).set_index('token')"
782 | ]
783 | },
784 | {
785 | "cell_type": "code",
786 | "execution_count": 31,
787 | "metadata": {
788 | "collapsed": true
789 | },
790 | "outputs": [],
791 | "source": [
792 | "# add 1 to one-star and five-star counts to avoid dividing by 0\n",
793 | "tokens['one_star'] = tokens.one_star + 1\n",
794 | "tokens['five_star'] = tokens.five_star + 1"
795 | ]
796 | },
797 | {
798 | "cell_type": "code",
799 | "execution_count": 32,
800 | "metadata": {
801 | "collapsed": false
802 | },
803 | "outputs": [
804 | {
805 | "data": {
806 | "text/plain": [
807 | "array([ 565., 2499.])"
808 | ]
809 | },
810 | "execution_count": 32,
811 | "metadata": {},
812 | "output_type": "execute_result"
813 | }
814 | ],
815 | "source": [
816 | "# first number is one-star reviews, second number is five-star reviews\n",
817 | "nb.class_count_"
818 | ]
819 | },
820 | {
821 | "cell_type": "code",
822 | "execution_count": 33,
823 | "metadata": {
824 | "collapsed": true
825 | },
826 | "outputs": [],
827 | "source": [
828 | "# convert the one-star and five-star counts into frequencies\n",
829 | "tokens['one_star'] = tokens.one_star / nb.class_count_[0]\n",
830 | "tokens['five_star'] = tokens.five_star / nb.class_count_[1]"
831 | ]
832 | },
833 | {
834 | "cell_type": "code",
835 | "execution_count": 34,
836 | "metadata": {
837 | "collapsed": true
838 | },
839 | "outputs": [],
840 | "source": [
841 | "# calculate the ratio of five-star to one-star for each token\n",
842 | "tokens['five_star_ratio'] = tokens.five_star / tokens.one_star"
843 | ]
844 | },
845 | {
846 | "cell_type": "code",
847 | "execution_count": 35,
848 | "metadata": {
849 | "collapsed": false
850 | },
851 | "outputs": [
852 | {
853 | "data": {
854 | "text/html": [
855 | "\n",
856 | "
\n",
857 | " \n",
858 | " \n",
859 | " | \n",
860 | " five_star | \n",
861 | " one_star | \n",
862 | " five_star_ratio | \n",
863 | "
\n",
864 | " \n",
865 | " | token | \n",
866 | " | \n",
867 | " | \n",
868 | " | \n",
869 | "
\n",
870 | " \n",
871 | " \n",
872 | " \n",
873 | " | fantastic | \n",
874 | " 0.077231 | \n",
875 | " 0.003540 | \n",
876 | " 21.817727 | \n",
877 | "
\n",
878 | " \n",
879 | " | perfect | \n",
880 | " 0.098039 | \n",
881 | " 0.005310 | \n",
882 | " 18.464052 | \n",
883 | "
\n",
884 | " \n",
885 | " | yum | \n",
886 | " 0.024810 | \n",
887 | " 0.001770 | \n",
888 | " 14.017607 | \n",
889 | "
\n",
890 | " \n",
891 | " | favorite | \n",
892 | " 0.138055 | \n",
893 | " 0.012389 | \n",
894 | " 11.143029 | \n",
895 | "
\n",
896 | " \n",
897 | " | outstanding | \n",
898 | " 0.019608 | \n",
899 | " 0.001770 | \n",
900 | " 11.078431 | \n",
901 | "
\n",
902 | " \n",
903 | " | brunch | \n",
904 | " 0.016807 | \n",
905 | " 0.001770 | \n",
906 | " 9.495798 | \n",
907 | "
\n",
908 | " \n",
909 | " | gem | \n",
910 | " 0.016006 | \n",
911 | " 0.001770 | \n",
912 | " 9.043617 | \n",
913 | "
\n",
914 | " \n",
915 | " | mozzarella | \n",
916 | " 0.015606 | \n",
917 | " 0.001770 | \n",
918 | " 8.817527 | \n",
919 | "
\n",
920 | " \n",
921 | " | pasty | \n",
922 | " 0.015606 | \n",
923 | " 0.001770 | \n",
924 | " 8.817527 | \n",
925 | "
\n",
926 | " \n",
927 | " | amazing | \n",
928 | " 0.185274 | \n",
929 | " 0.021239 | \n",
930 | " 8.723323 | \n",
931 | "
\n",
932 | " \n",
933 | "
\n",
934 | "
"
935 | ],
936 | "text/plain": [
937 | " five_star one_star five_star_ratio\n",
938 | "token \n",
939 | "fantastic 0.077231 0.003540 21.817727\n",
940 | "perfect 0.098039 0.005310 18.464052\n",
941 | "yum 0.024810 0.001770 14.017607\n",
942 | "favorite 0.138055 0.012389 11.143029\n",
943 | "outstanding 0.019608 0.001770 11.078431\n",
944 | "brunch 0.016807 0.001770 9.495798\n",
945 | "gem 0.016006 0.001770 9.043617\n",
946 | "mozzarella 0.015606 0.001770 8.817527\n",
947 | "pasty 0.015606 0.001770 8.817527\n",
948 | "amazing 0.185274 0.021239 8.723323"
949 | ]
950 | },
951 | "execution_count": 35,
952 | "metadata": {},
953 | "output_type": "execute_result"
954 | }
955 | ],
956 | "source": [
957 | "# sort the DataFrame by five_star_ratio (descending order), and examine the first 10 rows\n",
958 | "# note: use sort() instead of sort_values() for pandas 0.16.2 and earlier\n",
959 | "tokens.sort_values('five_star_ratio', ascending=False).head(10)"
960 | ]
961 | },
962 | {
963 | "cell_type": "code",
964 | "execution_count": 36,
965 | "metadata": {
966 | "collapsed": false
967 | },
968 | "outputs": [
969 | {
970 | "data": {
971 | "text/html": [
972 | "\n",
973 | "
\n",
974 | " \n",
975 | " \n",
976 | " | \n",
977 | " five_star | \n",
978 | " one_star | \n",
979 | " five_star_ratio | \n",
980 | "
\n",
981 | " \n",
982 | " | token | \n",
983 | " | \n",
984 | " | \n",
985 | " | \n",
986 | "
\n",
987 | " \n",
988 | " \n",
989 | " \n",
990 | " | staffperson | \n",
991 | " 0.0004 | \n",
992 | " 0.030088 | \n",
993 | " 0.013299 | \n",
994 | "
\n",
995 | " \n",
996 | " | refused | \n",
997 | " 0.0004 | \n",
998 | " 0.024779 | \n",
999 | " 0.016149 | \n",
1000 | "
\n",
1001 | " \n",
1002 | " | disgusting | \n",
1003 | " 0.0008 | \n",
1004 | " 0.042478 | \n",
1005 | " 0.018841 | \n",
1006 | "
\n",
1007 | " \n",
1008 | " | filthy | \n",
1009 | " 0.0004 | \n",
1010 | " 0.019469 | \n",
1011 | " 0.020554 | \n",
1012 | "
\n",
1013 | " \n",
1014 | " | unprofessional | \n",
1015 | " 0.0004 | \n",
1016 | " 0.015929 | \n",
1017 | " 0.025121 | \n",
1018 | "
\n",
1019 | " \n",
1020 | " | unacceptable | \n",
1021 | " 0.0004 | \n",
1022 | " 0.015929 | \n",
1023 | " 0.025121 | \n",
1024 | "
\n",
1025 | " \n",
1026 | " | acknowledge | \n",
1027 | " 0.0004 | \n",
1028 | " 0.015929 | \n",
1029 | " 0.025121 | \n",
1030 | "
\n",
1031 | " \n",
1032 | " | ugh | \n",
1033 | " 0.0008 | \n",
1034 | " 0.030088 | \n",
1035 | " 0.026599 | \n",
1036 | "
\n",
1037 | " \n",
1038 | " | fuse | \n",
1039 | " 0.0004 | \n",
1040 | " 0.014159 | \n",
1041 | " 0.028261 | \n",
1042 | "
\n",
1043 | " \n",
1044 | " | boca | \n",
1045 | " 0.0004 | \n",
1046 | " 0.014159 | \n",
1047 | " 0.028261 | \n",
1048 | "
\n",
1049 | " \n",
1050 | "
\n",
1051 | "
"
1052 | ],
1053 | "text/plain": [
1054 | " five_star one_star five_star_ratio\n",
1055 | "token \n",
1056 | "staffperson 0.0004 0.030088 0.013299\n",
1057 | "refused 0.0004 0.024779 0.016149\n",
1058 | "disgusting 0.0008 0.042478 0.018841\n",
1059 | "filthy 0.0004 0.019469 0.020554\n",
1060 | "unprofessional 0.0004 0.015929 0.025121\n",
1061 | "unacceptable 0.0004 0.015929 0.025121\n",
1062 | "acknowledge 0.0004 0.015929 0.025121\n",
1063 | "ugh 0.0008 0.030088 0.026599\n",
1064 | "fuse 0.0004 0.014159 0.028261\n",
1065 | "boca 0.0004 0.014159 0.028261"
1066 | ]
1067 | },
1068 | "execution_count": 36,
1069 | "metadata": {},
1070 | "output_type": "execute_result"
1071 | }
1072 | ],
1073 | "source": [
1074 | "# sort the DataFrame by five_star_ratio (ascending order), and examine the first 10 rows\n",
1075 | "tokens.sort_values('five_star_ratio', ascending=True).head(10)"
1076 | ]
1077 | },
1078 | {
1079 | "cell_type": "markdown",
1080 | "metadata": {},
1081 | "source": [
1082 | "## Task 9 (Challenge)\n",
1083 | "\n",
1084 | "Up to this point, we have framed this as a **binary classification problem** by only considering the 5-star and 1-star reviews. Now, let's repeat the model building process using all reviews, which makes this a **5-class classification problem**.\n",
1085 | "\n",
1086 | "Here are the steps:\n",
1087 | "\n",
1088 | "- Define X and y using the original DataFrame. (y should contain 5 different classes.)\n",
1089 | "- Split X and y into training and testing sets.\n",
1090 | "- Create document-term matrices using CountVectorizer.\n",
1091 | "- Calculate the testing accuracy of a Multinomial Naive Bayes model.\n",
1092 | "- Compare the testing accuracy with the null accuracy, and comment on the results.\n",
1093 | "- Print the confusion matrix, and comment on the results. (This [Stack Overflow answer](http://stackoverflow.com/a/30748053/1636598) explains how to read a multi-class confusion matrix.)\n",
1094 | "- Print the [classification report](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report), and comment on the results. If you are unfamiliar with the terminology it uses, research the terms, and then try to figure out how to calculate these metrics manually from the confusion matrix!"
1095 | ]
1096 | },
1097 | {
1098 | "cell_type": "code",
1099 | "execution_count": 37,
1100 | "metadata": {
1101 | "collapsed": true
1102 | },
1103 | "outputs": [],
1104 | "source": [
1105 | "# define X and y using the original DataFrame\n",
1106 | "X = yelp.text\n",
1107 | "y = yelp.stars"
1108 | ]
1109 | },
1110 | {
1111 | "cell_type": "code",
1112 | "execution_count": 38,
1113 | "metadata": {
1114 | "collapsed": false
1115 | },
1116 | "outputs": [
1117 | {
1118 | "data": {
1119 | "text/plain": [
1120 | "1 749\n",
1121 | "2 927\n",
1122 | "3 1461\n",
1123 | "4 3526\n",
1124 | "5 3337\n",
1125 | "Name: stars, dtype: int64"
1126 | ]
1127 | },
1128 | "execution_count": 38,
1129 | "metadata": {},
1130 | "output_type": "execute_result"
1131 | }
1132 | ],
1133 | "source": [
1134 | "# check that y contains 5 different classes\n",
1135 | "y.value_counts().sort_index()"
1136 | ]
1137 | },
1138 | {
1139 | "cell_type": "code",
1140 | "execution_count": 39,
1141 | "metadata": {
1142 | "collapsed": true
1143 | },
1144 | "outputs": [],
1145 | "source": [
1146 | "# split X and y into training and testing sets\n",
1147 | "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)"
1148 | ]
1149 | },
1150 | {
1151 | "cell_type": "code",
1152 | "execution_count": 40,
1153 | "metadata": {
1154 | "collapsed": true
1155 | },
1156 | "outputs": [],
1157 | "source": [
1158 | "# create document-term matrices using CountVectorizer\n",
1159 | "X_train_dtm = vect.fit_transform(X_train)\n",
1160 | "X_test_dtm = vect.transform(X_test)"
1161 | ]
1162 | },
1163 | {
1164 | "cell_type": "code",
1165 | "execution_count": 41,
1166 | "metadata": {
1167 | "collapsed": false
1168 | },
1169 | "outputs": [
1170 | {
1171 | "data": {
1172 | "text/plain": [
1173 | "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)"
1174 | ]
1175 | },
1176 | "execution_count": 41,
1177 | "metadata": {},
1178 | "output_type": "execute_result"
1179 | }
1180 | ],
1181 | "source": [
1182 | "# fit a Multinomial Naive Bayes model\n",
1183 | "nb.fit(X_train_dtm, y_train)"
1184 | ]
1185 | },
1186 | {
1187 | "cell_type": "code",
1188 | "execution_count": 42,
1189 | "metadata": {
1190 | "collapsed": true
1191 | },
1192 | "outputs": [],
1193 | "source": [
1194 | "# make class predictions\n",
1195 | "y_pred_class = nb.predict(X_test_dtm)"
1196 | ]
1197 | },
1198 | {
1199 | "cell_type": "code",
1200 | "execution_count": 43,
1201 | "metadata": {
1202 | "collapsed": false
1203 | },
1204 | "outputs": [
1205 | {
1206 | "data": {
1207 | "text/plain": [
1208 | "0.47120000000000001"
1209 | ]
1210 | },
1211 | "execution_count": 43,
1212 | "metadata": {},
1213 | "output_type": "execute_result"
1214 | }
1215 | ],
1216 | "source": [
1217 | "# calculate the accuary\n",
1218 | "metrics.accuracy_score(y_test, y_pred_class)"
1219 | ]
1220 | },
1221 | {
1222 | "cell_type": "code",
1223 | "execution_count": 44,
1224 | "metadata": {
1225 | "collapsed": false
1226 | },
1227 | "outputs": [
1228 | {
1229 | "data": {
1230 | "text/plain": [
1231 | "4 0.3536\n",
1232 | "Name: stars, dtype: float64"
1233 | ]
1234 | },
1235 | "execution_count": 44,
1236 | "metadata": {},
1237 | "output_type": "execute_result"
1238 | }
1239 | ],
1240 | "source": [
1241 | "# calculate the null accuracy\n",
1242 | "y_test.value_counts().head(1) / y_test.shape"
1243 | ]
1244 | },
1245 | {
1246 | "cell_type": "markdown",
1247 | "metadata": {},
1248 | "source": [
1249 | "**Accuracy comments:** At first glance, 47% accuracy does not seem very good, given that it is not much higher than the null accuracy. However, I would consider the 47% accuracy to be quite impressive, given that humans would also have a hard time precisely identifying the star rating for many of these reviews."
1250 | ]
1251 | },
1252 | {
1253 | "cell_type": "code",
1254 | "execution_count": 45,
1255 | "metadata": {
1256 | "collapsed": false
1257 | },
1258 | "outputs": [
1259 | {
1260 | "data": {
1261 | "text/plain": [
1262 | "array([[ 55, 14, 24, 65, 27],\n",
1263 | " [ 28, 16, 41, 122, 27],\n",
1264 | " [ 5, 7, 35, 281, 37],\n",
1265 | " [ 7, 0, 16, 629, 232],\n",
1266 | " [ 6, 4, 6, 373, 443]])"
1267 | ]
1268 | },
1269 | "execution_count": 45,
1270 | "metadata": {},
1271 | "output_type": "execute_result"
1272 | }
1273 | ],
1274 | "source": [
1275 | "# print the confusion matrix\n",
1276 | "metrics.confusion_matrix(y_test, y_pred_class)"
1277 | ]
1278 | },
1279 | {
1280 | "cell_type": "markdown",
1281 | "metadata": {},
1282 | "source": [
1283 | "**Confusion matrix comments:**\n",
1284 | "\n",
1285 | "- Nearly all 4-star and 5-star reviews are classified as 4 or 5 stars, but they are hard for the model to distinguish between.\n",
1286 | "- 1-star, 2-star, and 3-star reviews are most commonly classified as 4 stars, probably because it's the predominant class in the training data."
1287 | ]
1288 | },
1289 | {
1290 | "cell_type": "code",
1291 | "execution_count": 46,
1292 | "metadata": {
1293 | "collapsed": false
1294 | },
1295 | "outputs": [
1296 | {
1297 | "name": "stdout",
1298 | "output_type": "stream",
1299 | "text": [
1300 | " precision recall f1-score support\n",
1301 | "\n",
1302 | " 1 0.54 0.30 0.38 185\n",
1303 | " 2 0.39 0.07 0.12 234\n",
1304 | " 3 0.29 0.10 0.14 365\n",
1305 | " 4 0.43 0.71 0.53 884\n",
1306 | " 5 0.58 0.53 0.55 832\n",
1307 | "\n",
1308 | "avg / total 0.46 0.47 0.43 2500\n",
1309 | "\n"
1310 | ]
1311 | }
1312 | ],
1313 | "source": [
1314 | "# print the classification report\n",
1315 | "print(metrics.classification_report(y_test, y_pred_class))"
1316 | ]
1317 | },
1318 | {
1319 | "cell_type": "markdown",
1320 | "metadata": {},
1321 | "source": [
1322 | "**Precision** answers the question: \"When a given class is predicted, how often are those predictions correct?\" To calculate the precision for class 1, for example, you divide 55 by the sum of the first column of the confusion matrix."
1323 | ]
1324 | },
1325 | {
1326 | "cell_type": "code",
1327 | "execution_count": 47,
1328 | "metadata": {
1329 | "collapsed": false
1330 | },
1331 | "outputs": [
1332 | {
1333 | "name": "stdout",
1334 | "output_type": "stream",
1335 | "text": [
1336 | "0.544554455446\n"
1337 | ]
1338 | }
1339 | ],
1340 | "source": [
1341 | "# manually calculate the precision for class 1\n",
1342 | "precision = 55 / float(55 + 28 + 5 + 7 + 6)\n",
1343 | "print(precision)"
1344 | ]
1345 | },
1346 | {
1347 | "cell_type": "markdown",
1348 | "metadata": {},
1349 | "source": [
1350 | "**Recall** answers the question: \"When a given class is the true class, how often is that class predicted?\" To calculate the recall for class 1, for example, you divide 55 by the sum of the first row of the confusion matrix."
1351 | ]
1352 | },
1353 | {
1354 | "cell_type": "code",
1355 | "execution_count": 48,
1356 | "metadata": {
1357 | "collapsed": false
1358 | },
1359 | "outputs": [
1360 | {
1361 | "name": "stdout",
1362 | "output_type": "stream",
1363 | "text": [
1364 | "0.297297297297\n"
1365 | ]
1366 | }
1367 | ],
1368 | "source": [
1369 | "# manually calculate the recall for class 1\n",
1370 | "recall = 55 / float(55 + 14 + 24 + 65 + 27)\n",
1371 | "print(recall)"
1372 | ]
1373 | },
1374 | {
1375 | "cell_type": "markdown",
1376 | "metadata": {},
1377 | "source": [
1378 | "**F1 score** is a weighted average of precision and recall."
1379 | ]
1380 | },
1381 | {
1382 | "cell_type": "code",
1383 | "execution_count": 49,
1384 | "metadata": {
1385 | "collapsed": false
1386 | },
1387 | "outputs": [
1388 | {
1389 | "name": "stdout",
1390 | "output_type": "stream",
1391 | "text": [
1392 | "0.384615384615\n"
1393 | ]
1394 | }
1395 | ],
1396 | "source": [
1397 | "# manually calculate the F1 score for class 1\n",
1398 | "f1 = 2 * (precision * recall) / (precision + recall)\n",
1399 | "print(f1)"
1400 | ]
1401 | },
1402 | {
1403 | "cell_type": "markdown",
1404 | "metadata": {},
1405 | "source": [
1406 | "**Support** answers the question: \"How many observations exist for which a given class is the true class?\" To calculate the support for class 1, for example, you sum the first row of the confusion matrix."
1407 | ]
1408 | },
1409 | {
1410 | "cell_type": "code",
1411 | "execution_count": 50,
1412 | "metadata": {
1413 | "collapsed": false
1414 | },
1415 | "outputs": [
1416 | {
1417 | "name": "stdout",
1418 | "output_type": "stream",
1419 | "text": [
1420 | "185\n"
1421 | ]
1422 | }
1423 | ],
1424 | "source": [
1425 | "# manually calculate the support for class 1\n",
1426 | "support = 55 + 14 + 24 + 65 + 27\n",
1427 | "print(support)"
1428 | ]
1429 | },
1430 | {
1431 | "cell_type": "markdown",
1432 | "metadata": {},
1433 | "source": [
1434 | "**Classification report comments:**\n",
1435 | "\n",
1436 | "- Class 1 has low recall, meaning that the model has a hard time detecting the 1-star reviews, but high precision, meaning that when the model predicts a review is 1-star, it's usually correct.\n",
1437 | "- Class 5 has high recall and precision, probably because 5-star reviews have polarized language, and because the model has a lot of observations to learn from."
1438 | ]
1439 | }
1440 | ],
1441 | "metadata": {
1442 | "kernelspec": {
1443 | "display_name": "Python 2",
1444 | "language": "python",
1445 | "name": "python2"
1446 | },
1447 | "language_info": {
1448 | "codemirror_mode": {
1449 | "name": "ipython",
1450 | "version": 2
1451 | },
1452 | "file_extension": ".py",
1453 | "mimetype": "text/x-python",
1454 | "name": "python",
1455 | "nbconvert_exporter": "python",
1456 | "pygments_lexer": "ipython2",
1457 | "version": "2.7.11"
1458 | }
1459 | },
1460 | "nbformat": 4,
1461 | "nbformat_minor": 0
1462 | }
1463 |
--------------------------------------------------------------------------------
/tutorial_with_output.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Tutorial: Machine Learning with Text in scikit-learn"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "## Agenda\n",
15 | "\n",
16 | "1. Model building in scikit-learn (refresher)\n",
17 | "2. Representing text as numerical data\n",
18 | "3. Reading a text-based dataset into pandas\n",
19 | "4. Vectorizing our dataset\n",
20 | "5. Building and evaluating a model\n",
21 | "6. Comparing models\n",
22 | "7. Examining a model for further insight\n",
23 | "8. Practicing this workflow on another dataset\n",
24 | "9. Tuning the vectorizer (discussion)"
25 | ]
26 | },
27 | {
28 | "cell_type": "code",
29 | "execution_count": 1,
30 | "metadata": {
31 | "collapsed": false
32 | },
33 | "outputs": [],
34 | "source": [
35 | "# for Python 2: use print only as a function\n",
36 | "from __future__ import print_function"
37 | ]
38 | },
39 | {
40 | "cell_type": "markdown",
41 | "metadata": {},
42 | "source": [
43 | "## Part 1: Model building in scikit-learn (refresher)"
44 | ]
45 | },
46 | {
47 | "cell_type": "code",
48 | "execution_count": 2,
49 | "metadata": {
50 | "collapsed": true
51 | },
52 | "outputs": [],
53 | "source": [
54 | "# load the iris dataset as an example\n",
55 | "from sklearn.datasets import load_iris\n",
56 | "iris = load_iris()"
57 | ]
58 | },
59 | {
60 | "cell_type": "code",
61 | "execution_count": 3,
62 | "metadata": {
63 | "collapsed": true
64 | },
65 | "outputs": [],
66 | "source": [
67 | "# store the feature matrix (X) and response vector (y)\n",
68 | "X = iris.data\n",
69 | "y = iris.target"
70 | ]
71 | },
72 | {
73 | "cell_type": "markdown",
74 | "metadata": {},
75 | "source": [
76 | "**\"Features\"** are also known as predictors, inputs, or attributes. The **\"response\"** is also known as the target, label, or output."
77 | ]
78 | },
79 | {
80 | "cell_type": "code",
81 | "execution_count": 4,
82 | "metadata": {
83 | "collapsed": false
84 | },
85 | "outputs": [
86 | {
87 | "name": "stdout",
88 | "output_type": "stream",
89 | "text": [
90 | "(150L, 4L)\n",
91 | "(150L,)\n"
92 | ]
93 | }
94 | ],
95 | "source": [
96 | "# check the shapes of X and y\n",
97 | "print(X.shape)\n",
98 | "print(y.shape)"
99 | ]
100 | },
101 | {
102 | "cell_type": "markdown",
103 | "metadata": {},
104 | "source": [
105 | "**\"Observations\"** are also known as samples, instances, or records."
106 | ]
107 | },
108 | {
109 | "cell_type": "code",
110 | "execution_count": 5,
111 | "metadata": {
112 | "collapsed": false
113 | },
114 | "outputs": [
115 | {
116 | "data": {
117 | "text/html": [
118 | "\n",
119 | "
\n",
120 | " \n",
121 | " \n",
122 | " | \n",
123 | " sepal length (cm) | \n",
124 | " sepal width (cm) | \n",
125 | " petal length (cm) | \n",
126 | " petal width (cm) | \n",
127 | "
\n",
128 | " \n",
129 | " \n",
130 | " \n",
131 | " | 0 | \n",
132 | " 5.1 | \n",
133 | " 3.5 | \n",
134 | " 1.4 | \n",
135 | " 0.2 | \n",
136 | "
\n",
137 | " \n",
138 | " | 1 | \n",
139 | " 4.9 | \n",
140 | " 3.0 | \n",
141 | " 1.4 | \n",
142 | " 0.2 | \n",
143 | "
\n",
144 | " \n",
145 | " | 2 | \n",
146 | " 4.7 | \n",
147 | " 3.2 | \n",
148 | " 1.3 | \n",
149 | " 0.2 | \n",
150 | "
\n",
151 | " \n",
152 | " | 3 | \n",
153 | " 4.6 | \n",
154 | " 3.1 | \n",
155 | " 1.5 | \n",
156 | " 0.2 | \n",
157 | "
\n",
158 | " \n",
159 | " | 4 | \n",
160 | " 5.0 | \n",
161 | " 3.6 | \n",
162 | " 1.4 | \n",
163 | " 0.2 | \n",
164 | "
\n",
165 | " \n",
166 | "
\n",
167 | "
"
168 | ],
169 | "text/plain": [
170 | " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n",
171 | "0 5.1 3.5 1.4 0.2\n",
172 | "1 4.9 3.0 1.4 0.2\n",
173 | "2 4.7 3.2 1.3 0.2\n",
174 | "3 4.6 3.1 1.5 0.2\n",
175 | "4 5.0 3.6 1.4 0.2"
176 | ]
177 | },
178 | "execution_count": 5,
179 | "metadata": {},
180 | "output_type": "execute_result"
181 | }
182 | ],
183 | "source": [
184 | "# examine the first 5 rows of the feature matrix (including the feature names)\n",
185 | "import pandas as pd\n",
186 | "pd.DataFrame(X, columns=iris.feature_names).head()"
187 | ]
188 | },
189 | {
190 | "cell_type": "code",
191 | "execution_count": 6,
192 | "metadata": {
193 | "collapsed": false
194 | },
195 | "outputs": [
196 | {
197 | "name": "stdout",
198 | "output_type": "stream",
199 | "text": [
200 | "[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
201 | " 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n",
202 | " 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2\n",
203 | " 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2\n",
204 | " 2 2]\n"
205 | ]
206 | }
207 | ],
208 | "source": [
209 | "# examine the response vector\n",
210 | "print(y)"
211 | ]
212 | },
213 | {
214 | "cell_type": "markdown",
215 | "metadata": {},
216 | "source": [
217 | "In order to **build a model**, the features must be **numeric**, and every observation must have the **same features in the same order**."
218 | ]
219 | },
220 | {
221 | "cell_type": "code",
222 | "execution_count": 7,
223 | "metadata": {
224 | "collapsed": false
225 | },
226 | "outputs": [
227 | {
228 | "data": {
229 | "text/plain": [
230 | "KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n",
231 | " metric_params=None, n_jobs=1, n_neighbors=5, p=2,\n",
232 | " weights='uniform')"
233 | ]
234 | },
235 | "execution_count": 7,
236 | "metadata": {},
237 | "output_type": "execute_result"
238 | }
239 | ],
240 | "source": [
241 | "# import the class\n",
242 | "from sklearn.neighbors import KNeighborsClassifier\n",
243 | "\n",
244 | "# instantiate the model (with the default parameters)\n",
245 | "knn = KNeighborsClassifier()\n",
246 | "\n",
247 | "# fit the model with data (occurs in-place)\n",
248 | "knn.fit(X, y)"
249 | ]
250 | },
251 | {
252 | "cell_type": "markdown",
253 | "metadata": {},
254 | "source": [
255 | "In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning."
256 | ]
257 | },
258 | {
259 | "cell_type": "code",
260 | "execution_count": 8,
261 | "metadata": {
262 | "collapsed": false
263 | },
264 | "outputs": [
265 | {
266 | "data": {
267 | "text/plain": [
268 | "array([1])"
269 | ]
270 | },
271 | "execution_count": 8,
272 | "metadata": {},
273 | "output_type": "execute_result"
274 | }
275 | ],
276 | "source": [
277 | "# predict the response for a new observation\n",
278 | "knn.predict([[3, 5, 4, 2]])"
279 | ]
280 | },
281 | {
282 | "cell_type": "markdown",
283 | "metadata": {},
284 | "source": [
285 | "## Part 2: Representing text as numerical data"
286 | ]
287 | },
288 | {
289 | "cell_type": "code",
290 | "execution_count": 9,
291 | "metadata": {
292 | "collapsed": true
293 | },
294 | "outputs": [],
295 | "source": [
296 | "# example text for model training (SMS messages)\n",
297 | "simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']"
298 | ]
299 | },
300 | {
301 | "cell_type": "markdown",
302 | "metadata": {},
303 | "source": [
304 | "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
305 | "\n",
306 | "> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.\n",
307 | "\n",
308 | "We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to \"convert text into a matrix of token counts\":"
309 | ]
310 | },
311 | {
312 | "cell_type": "code",
313 | "execution_count": 10,
314 | "metadata": {
315 | "collapsed": true
316 | },
317 | "outputs": [],
318 | "source": [
319 | "# import and instantiate CountVectorizer (with the default parameters)\n",
320 | "from sklearn.feature_extraction.text import CountVectorizer\n",
321 | "vect = CountVectorizer()"
322 | ]
323 | },
324 | {
325 | "cell_type": "code",
326 | "execution_count": 11,
327 | "metadata": {
328 | "collapsed": false
329 | },
330 | "outputs": [
331 | {
332 | "data": {
333 | "text/plain": [
334 | "CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',\n",
335 | " dtype=, encoding=u'utf-8', input=u'content',\n",
336 | " lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
337 | " ngram_range=(1, 1), preprocessor=None, stop_words=None,\n",
338 | " strip_accents=None, token_pattern=u'(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
339 | " tokenizer=None, vocabulary=None)"
340 | ]
341 | },
342 | "execution_count": 11,
343 | "metadata": {},
344 | "output_type": "execute_result"
345 | }
346 | ],
347 | "source": [
348 | "# learn the 'vocabulary' of the training data (occurs in-place)\n",
349 | "vect.fit(simple_train)"
350 | ]
351 | },
352 | {
353 | "cell_type": "code",
354 | "execution_count": 12,
355 | "metadata": {
356 | "collapsed": false
357 | },
358 | "outputs": [
359 | {
360 | "data": {
361 | "text/plain": [
362 | "[u'cab', u'call', u'me', u'please', u'tonight', u'you']"
363 | ]
364 | },
365 | "execution_count": 12,
366 | "metadata": {},
367 | "output_type": "execute_result"
368 | }
369 | ],
370 | "source": [
371 | "# examine the fitted vocabulary\n",
372 | "vect.get_feature_names()"
373 | ]
374 | },
375 | {
376 | "cell_type": "code",
377 | "execution_count": 13,
378 | "metadata": {
379 | "collapsed": false
380 | },
381 | "outputs": [
382 | {
383 | "data": {
384 | "text/plain": [
385 | "<3x6 sparse matrix of type ''\n",
386 | "\twith 9 stored elements in Compressed Sparse Row format>"
387 | ]
388 | },
389 | "execution_count": 13,
390 | "metadata": {},
391 | "output_type": "execute_result"
392 | }
393 | ],
394 | "source": [
395 | "# transform training data into a 'document-term matrix'\n",
396 | "simple_train_dtm = vect.transform(simple_train)\n",
397 | "simple_train_dtm"
398 | ]
399 | },
400 | {
401 | "cell_type": "code",
402 | "execution_count": 14,
403 | "metadata": {
404 | "collapsed": false
405 | },
406 | "outputs": [
407 | {
408 | "data": {
409 | "text/plain": [
410 | "array([[0, 1, 0, 0, 1, 1],\n",
411 | " [1, 1, 1, 0, 0, 0],\n",
412 | " [0, 1, 1, 2, 0, 0]], dtype=int64)"
413 | ]
414 | },
415 | "execution_count": 14,
416 | "metadata": {},
417 | "output_type": "execute_result"
418 | }
419 | ],
420 | "source": [
421 | "# convert sparse matrix to a dense matrix\n",
422 | "simple_train_dtm.toarray()"
423 | ]
424 | },
425 | {
426 | "cell_type": "code",
427 | "execution_count": 15,
428 | "metadata": {
429 | "collapsed": false
430 | },
431 | "outputs": [
432 | {
433 | "data": {
434 | "text/html": [
435 | "\n",
436 | "
\n",
437 | " \n",
438 | " \n",
439 | " | \n",
440 | " cab | \n",
441 | " call | \n",
442 | " me | \n",
443 | " please | \n",
444 | " tonight | \n",
445 | " you | \n",
446 | "
\n",
447 | " \n",
448 | " \n",
449 | " \n",
450 | " | 0 | \n",
451 | " 0 | \n",
452 | " 1 | \n",
453 | " 0 | \n",
454 | " 0 | \n",
455 | " 1 | \n",
456 | " 1 | \n",
457 | "
\n",
458 | " \n",
459 | " | 1 | \n",
460 | " 1 | \n",
461 | " 1 | \n",
462 | " 1 | \n",
463 | " 0 | \n",
464 | " 0 | \n",
465 | " 0 | \n",
466 | "
\n",
467 | " \n",
468 | " | 2 | \n",
469 | " 0 | \n",
470 | " 1 | \n",
471 | " 1 | \n",
472 | " 2 | \n",
473 | " 0 | \n",
474 | " 0 | \n",
475 | "
\n",
476 | " \n",
477 | "
\n",
478 | "
"
479 | ],
480 | "text/plain": [
481 | " cab call me please tonight you\n",
482 | "0 0 1 0 0 1 1\n",
483 | "1 1 1 1 0 0 0\n",
484 | "2 0 1 1 2 0 0"
485 | ]
486 | },
487 | "execution_count": 15,
488 | "metadata": {},
489 | "output_type": "execute_result"
490 | }
491 | ],
492 | "source": [
493 | "# examine the vocabulary and document-term matrix together\n",
494 | "pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())"
495 | ]
496 | },
497 | {
498 | "cell_type": "markdown",
499 | "metadata": {},
500 | "source": [
501 | "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
502 | "\n",
503 | "> In this scheme, features and samples are defined as follows:\n",
504 | "\n",
505 | "> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.\n",
506 | "> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.\n",
507 | "\n",
508 | "> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.\n",
509 | "\n",
510 | "> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or \"Bag of n-grams\" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document."
511 | ]
512 | },
513 | {
514 | "cell_type": "code",
515 | "execution_count": 16,
516 | "metadata": {
517 | "collapsed": false
518 | },
519 | "outputs": [
520 | {
521 | "data": {
522 | "text/plain": [
523 | "scipy.sparse.csr.csr_matrix"
524 | ]
525 | },
526 | "execution_count": 16,
527 | "metadata": {},
528 | "output_type": "execute_result"
529 | }
530 | ],
531 | "source": [
532 | "# check the type of the document-term matrix\n",
533 | "type(simple_train_dtm)"
534 | ]
535 | },
536 | {
537 | "cell_type": "code",
538 | "execution_count": 17,
539 | "metadata": {
540 | "collapsed": false,
541 | "scrolled": true
542 | },
543 | "outputs": [
544 | {
545 | "name": "stdout",
546 | "output_type": "stream",
547 | "text": [
548 | " (0, 1)\t1\n",
549 | " (0, 4)\t1\n",
550 | " (0, 5)\t1\n",
551 | " (1, 0)\t1\n",
552 | " (1, 1)\t1\n",
553 | " (1, 2)\t1\n",
554 | " (2, 1)\t1\n",
555 | " (2, 2)\t1\n",
556 | " (2, 3)\t2\n"
557 | ]
558 | }
559 | ],
560 | "source": [
561 | "# examine the sparse matrix contents\n",
562 | "print(simple_train_dtm)"
563 | ]
564 | },
565 | {
566 | "cell_type": "markdown",
567 | "metadata": {},
568 | "source": [
569 | "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
570 | "\n",
571 | "> As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them).\n",
572 | "\n",
573 | "> For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.\n",
574 | "\n",
575 | "> In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package."
576 | ]
577 | },
578 | {
579 | "cell_type": "code",
580 | "execution_count": 18,
581 | "metadata": {
582 | "collapsed": true
583 | },
584 | "outputs": [],
585 | "source": [
586 | "# example text for model testing\n",
587 | "simple_test = [\"please don't call me\"]"
588 | ]
589 | },
590 | {
591 | "cell_type": "markdown",
592 | "metadata": {},
593 | "source": [
594 | "In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning."
595 | ]
596 | },
597 | {
598 | "cell_type": "code",
599 | "execution_count": 19,
600 | "metadata": {
601 | "collapsed": false
602 | },
603 | "outputs": [
604 | {
605 | "data": {
606 | "text/plain": [
607 | "array([[0, 1, 1, 1, 0, 0]], dtype=int64)"
608 | ]
609 | },
610 | "execution_count": 19,
611 | "metadata": {},
612 | "output_type": "execute_result"
613 | }
614 | ],
615 | "source": [
616 | "# transform testing data into a document-term matrix (using existing vocabulary)\n",
617 | "simple_test_dtm = vect.transform(simple_test)\n",
618 | "simple_test_dtm.toarray()"
619 | ]
620 | },
621 | {
622 | "cell_type": "code",
623 | "execution_count": 20,
624 | "metadata": {
625 | "collapsed": false
626 | },
627 | "outputs": [
628 | {
629 | "data": {
630 | "text/html": [
631 | "\n",
632 | "
\n",
633 | " \n",
634 | " \n",
635 | " | \n",
636 | " cab | \n",
637 | " call | \n",
638 | " me | \n",
639 | " please | \n",
640 | " tonight | \n",
641 | " you | \n",
642 | "
\n",
643 | " \n",
644 | " \n",
645 | " \n",
646 | " | 0 | \n",
647 | " 0 | \n",
648 | " 1 | \n",
649 | " 1 | \n",
650 | " 1 | \n",
651 | " 0 | \n",
652 | " 0 | \n",
653 | "
\n",
654 | " \n",
655 | "
\n",
656 | "
"
657 | ],
658 | "text/plain": [
659 | " cab call me please tonight you\n",
660 | "0 0 1 1 1 0 0"
661 | ]
662 | },
663 | "execution_count": 20,
664 | "metadata": {},
665 | "output_type": "execute_result"
666 | }
667 | ],
668 | "source": [
669 | "# examine the vocabulary and document-term matrix together\n",
670 | "pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())"
671 | ]
672 | },
673 | {
674 | "cell_type": "markdown",
675 | "metadata": {},
676 | "source": [
677 | "**Summary:**\n",
678 | "\n",
679 | "- `vect.fit(train)` **learns the vocabulary** of the training data\n",
680 | "- `vect.transform(train)` uses the **fitted vocabulary** to build a document-term matrix from the training data\n",
681 | "- `vect.transform(test)` uses the **fitted vocabulary** to build a document-term matrix from the testing data (and **ignores tokens** it hasn't seen before)"
682 | ]
683 | },
684 | {
685 | "cell_type": "markdown",
686 | "metadata": {},
687 | "source": [
688 | "## Part 3: Reading a text-based dataset into pandas"
689 | ]
690 | },
691 | {
692 | "cell_type": "code",
693 | "execution_count": 21,
694 | "metadata": {
695 | "collapsed": true
696 | },
697 | "outputs": [],
698 | "source": [
699 | "# read file into pandas using a relative path\n",
700 | "path = 'data/sms.tsv'\n",
701 | "sms = pd.read_table(path, header=None, names=['label', 'message'])"
702 | ]
703 | },
704 | {
705 | "cell_type": "code",
706 | "execution_count": 22,
707 | "metadata": {
708 | "collapsed": false
709 | },
710 | "outputs": [],
711 | "source": [
712 | "# alternative: read file into pandas from a URL\n",
713 | "# url = 'https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv'\n",
714 | "# sms = pd.read_table(url, header=None, names=['label', 'message'])"
715 | ]
716 | },
717 | {
718 | "cell_type": "code",
719 | "execution_count": 23,
720 | "metadata": {
721 | "collapsed": false
722 | },
723 | "outputs": [
724 | {
725 | "data": {
726 | "text/plain": [
727 | "(5572, 2)"
728 | ]
729 | },
730 | "execution_count": 23,
731 | "metadata": {},
732 | "output_type": "execute_result"
733 | }
734 | ],
735 | "source": [
736 | "# examine the shape\n",
737 | "sms.shape"
738 | ]
739 | },
740 | {
741 | "cell_type": "code",
742 | "execution_count": 24,
743 | "metadata": {
744 | "collapsed": false
745 | },
746 | "outputs": [
747 | {
748 | "data": {
749 | "text/html": [
750 | "\n",
751 | "
\n",
752 | " \n",
753 | " \n",
754 | " | \n",
755 | " label | \n",
756 | " message | \n",
757 | "
\n",
758 | " \n",
759 | " \n",
760 | " \n",
761 | " | 0 | \n",
762 | " ham | \n",
763 | " Go until jurong point, crazy.. Available only ... | \n",
764 | "
\n",
765 | " \n",
766 | " | 1 | \n",
767 | " ham | \n",
768 | " Ok lar... Joking wif u oni... | \n",
769 | "
\n",
770 | " \n",
771 | " | 2 | \n",
772 | " spam | \n",
773 | " Free entry in 2 a wkly comp to win FA Cup fina... | \n",
774 | "
\n",
775 | " \n",
776 | " | 3 | \n",
777 | " ham | \n",
778 | " U dun say so early hor... U c already then say... | \n",
779 | "
\n",
780 | " \n",
781 | " | 4 | \n",
782 | " ham | \n",
783 | " Nah I don't think he goes to usf, he lives aro... | \n",
784 | "
\n",
785 | " \n",
786 | " | 5 | \n",
787 | " spam | \n",
788 | " FreeMsg Hey there darling it's been 3 week's n... | \n",
789 | "
\n",
790 | " \n",
791 | " | 6 | \n",
792 | " ham | \n",
793 | " Even my brother is not like to speak with me. ... | \n",
794 | "
\n",
795 | " \n",
796 | " | 7 | \n",
797 | " ham | \n",
798 | " As per your request 'Melle Melle (Oru Minnamin... | \n",
799 | "
\n",
800 | " \n",
801 | " | 8 | \n",
802 | " spam | \n",
803 | " WINNER!! As a valued network customer you have... | \n",
804 | "
\n",
805 | " \n",
806 | " | 9 | \n",
807 | " spam | \n",
808 | " Had your mobile 11 months or more? U R entitle... | \n",
809 | "
\n",
810 | " \n",
811 | "
\n",
812 | "
"
813 | ],
814 | "text/plain": [
815 | " label message\n",
816 | "0 ham Go until jurong point, crazy.. Available only ...\n",
817 | "1 ham Ok lar... Joking wif u oni...\n",
818 | "2 spam Free entry in 2 a wkly comp to win FA Cup fina...\n",
819 | "3 ham U dun say so early hor... U c already then say...\n",
820 | "4 ham Nah I don't think he goes to usf, he lives aro...\n",
821 | "5 spam FreeMsg Hey there darling it's been 3 week's n...\n",
822 | "6 ham Even my brother is not like to speak with me. ...\n",
823 | "7 ham As per your request 'Melle Melle (Oru Minnamin...\n",
824 | "8 spam WINNER!! As a valued network customer you have...\n",
825 | "9 spam Had your mobile 11 months or more? U R entitle..."
826 | ]
827 | },
828 | "execution_count": 24,
829 | "metadata": {},
830 | "output_type": "execute_result"
831 | }
832 | ],
833 | "source": [
834 | "# examine the first 10 rows\n",
835 | "sms.head(10)"
836 | ]
837 | },
838 | {
839 | "cell_type": "code",
840 | "execution_count": 25,
841 | "metadata": {
842 | "collapsed": false
843 | },
844 | "outputs": [
845 | {
846 | "data": {
847 | "text/plain": [
848 | "ham 4825\n",
849 | "spam 747\n",
850 | "Name: label, dtype: int64"
851 | ]
852 | },
853 | "execution_count": 25,
854 | "metadata": {},
855 | "output_type": "execute_result"
856 | }
857 | ],
858 | "source": [
859 | "# examine the class distribution\n",
860 | "sms.label.value_counts()"
861 | ]
862 | },
863 | {
864 | "cell_type": "code",
865 | "execution_count": 26,
866 | "metadata": {
867 | "collapsed": true
868 | },
869 | "outputs": [],
870 | "source": [
871 | "# convert label to a numerical variable\n",
872 | "sms['label_num'] = sms.label.map({'ham':0, 'spam':1})"
873 | ]
874 | },
875 | {
876 | "cell_type": "code",
877 | "execution_count": 27,
878 | "metadata": {
879 | "collapsed": false
880 | },
881 | "outputs": [
882 | {
883 | "data": {
884 | "text/html": [
885 | "\n",
886 | "
\n",
887 | " \n",
888 | " \n",
889 | " | \n",
890 | " label | \n",
891 | " message | \n",
892 | " label_num | \n",
893 | "
\n",
894 | " \n",
895 | " \n",
896 | " \n",
897 | " | 0 | \n",
898 | " ham | \n",
899 | " Go until jurong point, crazy.. Available only ... | \n",
900 | " 0 | \n",
901 | "
\n",
902 | " \n",
903 | " | 1 | \n",
904 | " ham | \n",
905 | " Ok lar... Joking wif u oni... | \n",
906 | " 0 | \n",
907 | "
\n",
908 | " \n",
909 | " | 2 | \n",
910 | " spam | \n",
911 | " Free entry in 2 a wkly comp to win FA Cup fina... | \n",
912 | " 1 | \n",
913 | "
\n",
914 | " \n",
915 | " | 3 | \n",
916 | " ham | \n",
917 | " U dun say so early hor... U c already then say... | \n",
918 | " 0 | \n",
919 | "
\n",
920 | " \n",
921 | " | 4 | \n",
922 | " ham | \n",
923 | " Nah I don't think he goes to usf, he lives aro... | \n",
924 | " 0 | \n",
925 | "
\n",
926 | " \n",
927 | " | 5 | \n",
928 | " spam | \n",
929 | " FreeMsg Hey there darling it's been 3 week's n... | \n",
930 | " 1 | \n",
931 | "
\n",
932 | " \n",
933 | " | 6 | \n",
934 | " ham | \n",
935 | " Even my brother is not like to speak with me. ... | \n",
936 | " 0 | \n",
937 | "
\n",
938 | " \n",
939 | " | 7 | \n",
940 | " ham | \n",
941 | " As per your request 'Melle Melle (Oru Minnamin... | \n",
942 | " 0 | \n",
943 | "
\n",
944 | " \n",
945 | " | 8 | \n",
946 | " spam | \n",
947 | " WINNER!! As a valued network customer you have... | \n",
948 | " 1 | \n",
949 | "
\n",
950 | " \n",
951 | " | 9 | \n",
952 | " spam | \n",
953 | " Had your mobile 11 months or more? U R entitle... | \n",
954 | " 1 | \n",
955 | "
\n",
956 | " \n",
957 | "
\n",
958 | "
"
959 | ],
960 | "text/plain": [
961 | " label message label_num\n",
962 | "0 ham Go until jurong point, crazy.. Available only ... 0\n",
963 | "1 ham Ok lar... Joking wif u oni... 0\n",
964 | "2 spam Free entry in 2 a wkly comp to win FA Cup fina... 1\n",
965 | "3 ham U dun say so early hor... U c already then say... 0\n",
966 | "4 ham Nah I don't think he goes to usf, he lives aro... 0\n",
967 | "5 spam FreeMsg Hey there darling it's been 3 week's n... 1\n",
968 | "6 ham Even my brother is not like to speak with me. ... 0\n",
969 | "7 ham As per your request 'Melle Melle (Oru Minnamin... 0\n",
970 | "8 spam WINNER!! As a valued network customer you have... 1\n",
971 | "9 spam Had your mobile 11 months or more? U R entitle... 1"
972 | ]
973 | },
974 | "execution_count": 27,
975 | "metadata": {},
976 | "output_type": "execute_result"
977 | }
978 | ],
979 | "source": [
980 | "# check that the conversion worked\n",
981 | "sms.head(10)"
982 | ]
983 | },
984 | {
985 | "cell_type": "code",
986 | "execution_count": 28,
987 | "metadata": {
988 | "collapsed": false
989 | },
990 | "outputs": [
991 | {
992 | "name": "stdout",
993 | "output_type": "stream",
994 | "text": [
995 | "(150L, 4L)\n",
996 | "(150L,)\n"
997 | ]
998 | }
999 | ],
1000 | "source": [
1001 | "# how to define X and y (from the iris data) for use with a MODEL\n",
1002 | "X = iris.data\n",
1003 | "y = iris.target\n",
1004 | "print(X.shape)\n",
1005 | "print(y.shape)"
1006 | ]
1007 | },
1008 | {
1009 | "cell_type": "code",
1010 | "execution_count": 29,
1011 | "metadata": {
1012 | "collapsed": false
1013 | },
1014 | "outputs": [
1015 | {
1016 | "name": "stdout",
1017 | "output_type": "stream",
1018 | "text": [
1019 | "(5572L,)\n",
1020 | "(5572L,)\n"
1021 | ]
1022 | }
1023 | ],
1024 | "source": [
1025 | "# how to define X and y (from the SMS data) for use with COUNTVECTORIZER\n",
1026 | "X = sms.message\n",
1027 | "y = sms.label_num\n",
1028 | "print(X.shape)\n",
1029 | "print(y.shape)"
1030 | ]
1031 | },
1032 | {
1033 | "cell_type": "code",
1034 | "execution_count": 30,
1035 | "metadata": {
1036 | "collapsed": false
1037 | },
1038 | "outputs": [
1039 | {
1040 | "name": "stdout",
1041 | "output_type": "stream",
1042 | "text": [
1043 | "(4179L,)\n",
1044 | "(1393L,)\n",
1045 | "(4179L,)\n",
1046 | "(1393L,)\n"
1047 | ]
1048 | }
1049 | ],
1050 | "source": [
1051 | "# split X and y into training and testing sets\n",
1052 | "from sklearn.cross_validation import train_test_split\n",
1053 | "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)\n",
1054 | "print(X_train.shape)\n",
1055 | "print(X_test.shape)\n",
1056 | "print(y_train.shape)\n",
1057 | "print(y_test.shape)"
1058 | ]
1059 | },
1060 | {
1061 | "cell_type": "markdown",
1062 | "metadata": {},
1063 | "source": [
1064 | "## Part 4: Vectorizing our dataset"
1065 | ]
1066 | },
1067 | {
1068 | "cell_type": "code",
1069 | "execution_count": 31,
1070 | "metadata": {
1071 | "collapsed": true
1072 | },
1073 | "outputs": [],
1074 | "source": [
1075 | "# instantiate the vectorizer\n",
1076 | "vect = CountVectorizer()"
1077 | ]
1078 | },
1079 | {
1080 | "cell_type": "code",
1081 | "execution_count": 32,
1082 | "metadata": {
1083 | "collapsed": true
1084 | },
1085 | "outputs": [],
1086 | "source": [
1087 | "# learn training data vocabulary, then use it to create a document-term matrix\n",
1088 | "vect.fit(X_train)\n",
1089 | "X_train_dtm = vect.transform(X_train)"
1090 | ]
1091 | },
1092 | {
1093 | "cell_type": "code",
1094 | "execution_count": 33,
1095 | "metadata": {
1096 | "collapsed": true
1097 | },
1098 | "outputs": [],
1099 | "source": [
1100 | "# equivalently: combine fit and transform into a single step\n",
1101 | "X_train_dtm = vect.fit_transform(X_train)"
1102 | ]
1103 | },
1104 | {
1105 | "cell_type": "code",
1106 | "execution_count": 34,
1107 | "metadata": {
1108 | "collapsed": false
1109 | },
1110 | "outputs": [
1111 | {
1112 | "data": {
1113 | "text/plain": [
1114 | "<4179x7456 sparse matrix of type ''\n",
1115 | "\twith 55209 stored elements in Compressed Sparse Row format>"
1116 | ]
1117 | },
1118 | "execution_count": 34,
1119 | "metadata": {},
1120 | "output_type": "execute_result"
1121 | }
1122 | ],
1123 | "source": [
1124 | "# examine the document-term matrix\n",
1125 | "X_train_dtm"
1126 | ]
1127 | },
1128 | {
1129 | "cell_type": "code",
1130 | "execution_count": 35,
1131 | "metadata": {
1132 | "collapsed": false
1133 | },
1134 | "outputs": [
1135 | {
1136 | "data": {
1137 | "text/plain": [
1138 | "<1393x7456 sparse matrix of type ''\n",
1139 | "\twith 17604 stored elements in Compressed Sparse Row format>"
1140 | ]
1141 | },
1142 | "execution_count": 35,
1143 | "metadata": {},
1144 | "output_type": "execute_result"
1145 | }
1146 | ],
1147 | "source": [
1148 | "# transform testing data (using fitted vocabulary) into a document-term matrix\n",
1149 | "X_test_dtm = vect.transform(X_test)\n",
1150 | "X_test_dtm"
1151 | ]
1152 | },
1153 | {
1154 | "cell_type": "markdown",
1155 | "metadata": {},
1156 | "source": [
1157 | "## Part 5: Building and evaluating a model\n",
1158 | "\n",
1159 | "We will use [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):\n",
1160 | "\n",
1161 | "> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work."
1162 | ]
1163 | },
1164 | {
1165 | "cell_type": "code",
1166 | "execution_count": 36,
1167 | "metadata": {
1168 | "collapsed": true
1169 | },
1170 | "outputs": [],
1171 | "source": [
1172 | "# import and instantiate a Multinomial Naive Bayes model\n",
1173 | "from sklearn.naive_bayes import MultinomialNB\n",
1174 | "nb = MultinomialNB()"
1175 | ]
1176 | },
1177 | {
1178 | "cell_type": "code",
1179 | "execution_count": 37,
1180 | "metadata": {
1181 | "collapsed": false
1182 | },
1183 | "outputs": [
1184 | {
1185 | "name": "stdout",
1186 | "output_type": "stream",
1187 | "text": [
1188 | "Wall time: 3 ms\n"
1189 | ]
1190 | },
1191 | {
1192 | "data": {
1193 | "text/plain": [
1194 | "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)"
1195 | ]
1196 | },
1197 | "execution_count": 37,
1198 | "metadata": {},
1199 | "output_type": "execute_result"
1200 | }
1201 | ],
1202 | "source": [
1203 | "# train the model using X_train_dtm (timing it with an IPython \"magic command\")\n",
1204 | "%time nb.fit(X_train_dtm, y_train)"
1205 | ]
1206 | },
1207 | {
1208 | "cell_type": "code",
1209 | "execution_count": 38,
1210 | "metadata": {
1211 | "collapsed": true
1212 | },
1213 | "outputs": [],
1214 | "source": [
1215 | "# make class predictions for X_test_dtm\n",
1216 | "y_pred_class = nb.predict(X_test_dtm)"
1217 | ]
1218 | },
1219 | {
1220 | "cell_type": "code",
1221 | "execution_count": 39,
1222 | "metadata": {
1223 | "collapsed": false
1224 | },
1225 | "outputs": [
1226 | {
1227 | "data": {
1228 | "text/plain": [
1229 | "0.98851399856424982"
1230 | ]
1231 | },
1232 | "execution_count": 39,
1233 | "metadata": {},
1234 | "output_type": "execute_result"
1235 | }
1236 | ],
1237 | "source": [
1238 | "# calculate accuracy of class predictions\n",
1239 | "from sklearn import metrics\n",
1240 | "metrics.accuracy_score(y_test, y_pred_class)"
1241 | ]
1242 | },
1243 | {
1244 | "cell_type": "code",
1245 | "execution_count": 40,
1246 | "metadata": {
1247 | "collapsed": false
1248 | },
1249 | "outputs": [
1250 | {
1251 | "data": {
1252 | "text/plain": [
1253 | "array([[1203, 5],\n",
1254 | " [ 11, 174]])"
1255 | ]
1256 | },
1257 | "execution_count": 40,
1258 | "metadata": {},
1259 | "output_type": "execute_result"
1260 | }
1261 | ],
1262 | "source": [
1263 | "# print the confusion matrix\n",
1264 | "metrics.confusion_matrix(y_test, y_pred_class)"
1265 | ]
1266 | },
1267 | {
1268 | "cell_type": "code",
1269 | "execution_count": 41,
1270 | "metadata": {
1271 | "collapsed": false
1272 | },
1273 | "outputs": [
1274 | {
1275 | "data": {
1276 | "text/plain": [
1277 | "574 Waiting for your call.\n",
1278 | "3375 Also andros ice etc etc\n",
1279 | "45 No calls..messages..missed calls\n",
1280 | "3415 No pic. Please re-send.\n",
1281 | "1988 No calls..messages..missed calls\n",
1282 | "Name: message, dtype: object"
1283 | ]
1284 | },
1285 | "execution_count": 41,
1286 | "metadata": {},
1287 | "output_type": "execute_result"
1288 | }
1289 | ],
1290 | "source": [
1291 | "# print message text for the false positives (ham incorrectly classified as spam)\n",
1292 | "X_test[y_test < y_pred_class]"
1293 | ]
1294 | },
1295 | {
1296 | "cell_type": "code",
1297 | "execution_count": 42,
1298 | "metadata": {
1299 | "collapsed": false,
1300 | "scrolled": true
1301 | },
1302 | "outputs": [
1303 | {
1304 | "data": {
1305 | "text/plain": [
1306 | "3132 LookAtMe!: Thanks for your purchase of a video...\n",
1307 | "5 FreeMsg Hey there darling it's been 3 week's n...\n",
1308 | "3530 Xmas & New Years Eve tickets are now on sale f...\n",
1309 | "684 Hi I'm sue. I am 20 years old and work as a la...\n",
1310 | "1875 Would you like to see my XXX pics they are so ...\n",
1311 | "1893 CALL 09090900040 & LISTEN TO EXTREME DIRTY LIV...\n",
1312 | "4298 thesmszone.com lets you send free anonymous an...\n",
1313 | "4949 Hi this is Amy, we will be sending you a free ...\n",
1314 | "2821 INTERFLORA - It's not too late to order Inter...\n",
1315 | "2247 Hi ya babe x u 4goten bout me?' scammers getti...\n",
1316 | "4514 Money i have won wining number 946 wot do i do...\n",
1317 | "Name: message, dtype: object"
1318 | ]
1319 | },
1320 | "execution_count": 42,
1321 | "metadata": {},
1322 | "output_type": "execute_result"
1323 | }
1324 | ],
1325 | "source": [
1326 | "# print message text for the false negatives (spam incorrectly classified as ham)\n",
1327 | "X_test[y_test > y_pred_class]"
1328 | ]
1329 | },
1330 | {
1331 | "cell_type": "code",
1332 | "execution_count": 43,
1333 | "metadata": {
1334 | "collapsed": false,
1335 | "scrolled": true
1336 | },
1337 | "outputs": [
1338 | {
1339 | "data": {
1340 | "text/plain": [
1341 | "\"LookAtMe!: Thanks for your purchase of a video clip from LookAtMe!, you've been charged 35p. Think you can do better? Why not send a video in a MMSto 32323.\""
1342 | ]
1343 | },
1344 | "execution_count": 43,
1345 | "metadata": {},
1346 | "output_type": "execute_result"
1347 | }
1348 | ],
1349 | "source": [
1350 | "# example false negative\n",
1351 | "X_test[3132]"
1352 | ]
1353 | },
1354 | {
1355 | "cell_type": "code",
1356 | "execution_count": 44,
1357 | "metadata": {
1358 | "collapsed": false
1359 | },
1360 | "outputs": [
1361 | {
1362 | "data": {
1363 | "text/plain": [
1364 | "array([ 2.87744864e-03, 1.83488846e-05, 2.07301295e-03, ...,\n",
1365 | " 1.09026171e-06, 1.00000000e+00, 3.98279868e-09])"
1366 | ]
1367 | },
1368 | "execution_count": 44,
1369 | "metadata": {},
1370 | "output_type": "execute_result"
1371 | }
1372 | ],
1373 | "source": [
1374 | "# calculate predicted probabilities for X_test_dtm (poorly calibrated)\n",
1375 | "y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]\n",
1376 | "y_pred_prob"
1377 | ]
1378 | },
1379 | {
1380 | "cell_type": "code",
1381 | "execution_count": 45,
1382 | "metadata": {
1383 | "collapsed": false
1384 | },
1385 | "outputs": [
1386 | {
1387 | "data": {
1388 | "text/plain": [
1389 | "0.98664310005369604"
1390 | ]
1391 | },
1392 | "execution_count": 45,
1393 | "metadata": {},
1394 | "output_type": "execute_result"
1395 | }
1396 | ],
1397 | "source": [
1398 | "# calculate AUC\n",
1399 | "metrics.roc_auc_score(y_test, y_pred_prob)"
1400 | ]
1401 | },
1402 | {
1403 | "cell_type": "markdown",
1404 | "metadata": {},
1405 | "source": [
1406 | "## Part 6: Comparing models\n",
1407 | "\n",
1408 | "We will compare multinomial Naive Bayes with [logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression):\n",
1409 | "\n",
1410 | "> Logistic regression, despite its name, is a **linear model for classification** rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function."
1411 | ]
1412 | },
1413 | {
1414 | "cell_type": "code",
1415 | "execution_count": 46,
1416 | "metadata": {
1417 | "collapsed": true
1418 | },
1419 | "outputs": [],
1420 | "source": [
1421 | "# import and instantiate a logistic regression model\n",
1422 | "from sklearn.linear_model import LogisticRegression\n",
1423 | "logreg = LogisticRegression()"
1424 | ]
1425 | },
1426 | {
1427 | "cell_type": "code",
1428 | "execution_count": 47,
1429 | "metadata": {
1430 | "collapsed": false
1431 | },
1432 | "outputs": [
1433 | {
1434 | "name": "stdout",
1435 | "output_type": "stream",
1436 | "text": [
1437 | "Wall time: 39 ms\n"
1438 | ]
1439 | },
1440 | {
1441 | "data": {
1442 | "text/plain": [
1443 | "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
1444 | " intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n",
1445 | " penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n",
1446 | " verbose=0, warm_start=False)"
1447 | ]
1448 | },
1449 | "execution_count": 47,
1450 | "metadata": {},
1451 | "output_type": "execute_result"
1452 | }
1453 | ],
1454 | "source": [
1455 | "# train the model using X_train_dtm\n",
1456 | "%time logreg.fit(X_train_dtm, y_train)"
1457 | ]
1458 | },
1459 | {
1460 | "cell_type": "code",
1461 | "execution_count": 48,
1462 | "metadata": {
1463 | "collapsed": true
1464 | },
1465 | "outputs": [],
1466 | "source": [
1467 | "# make class predictions for X_test_dtm\n",
1468 | "y_pred_class = logreg.predict(X_test_dtm)"
1469 | ]
1470 | },
1471 | {
1472 | "cell_type": "code",
1473 | "execution_count": 49,
1474 | "metadata": {
1475 | "collapsed": false
1476 | },
1477 | "outputs": [
1478 | {
1479 | "data": {
1480 | "text/plain": [
1481 | "array([ 0.01269556, 0.00347183, 0.00616517, ..., 0.03354907,\n",
1482 | " 0.99725053, 0.00157706])"
1483 | ]
1484 | },
1485 | "execution_count": 49,
1486 | "metadata": {},
1487 | "output_type": "execute_result"
1488 | }
1489 | ],
1490 | "source": [
1491 | "# calculate predicted probabilities for X_test_dtm (well calibrated)\n",
1492 | "y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]\n",
1493 | "y_pred_prob"
1494 | ]
1495 | },
1496 | {
1497 | "cell_type": "code",
1498 | "execution_count": 50,
1499 | "metadata": {
1500 | "collapsed": false
1501 | },
1502 | "outputs": [
1503 | {
1504 | "data": {
1505 | "text/plain": [
1506 | "0.9877961234745154"
1507 | ]
1508 | },
1509 | "execution_count": 50,
1510 | "metadata": {},
1511 | "output_type": "execute_result"
1512 | }
1513 | ],
1514 | "source": [
1515 | "# calculate accuracy\n",
1516 | "metrics.accuracy_score(y_test, y_pred_class)"
1517 | ]
1518 | },
1519 | {
1520 | "cell_type": "code",
1521 | "execution_count": 51,
1522 | "metadata": {
1523 | "collapsed": false
1524 | },
1525 | "outputs": [
1526 | {
1527 | "data": {
1528 | "text/plain": [
1529 | "0.99368176123143015"
1530 | ]
1531 | },
1532 | "execution_count": 51,
1533 | "metadata": {},
1534 | "output_type": "execute_result"
1535 | }
1536 | ],
1537 | "source": [
1538 | "# calculate AUC\n",
1539 | "metrics.roc_auc_score(y_test, y_pred_prob)"
1540 | ]
1541 | },
1542 | {
1543 | "cell_type": "markdown",
1544 | "metadata": {},
1545 | "source": [
1546 | "## Part 7: Examining a model for further insight\n",
1547 | "\n",
1548 | "We will examine the our **trained Naive Bayes model** to calculate the approximate **\"spamminess\" of each token**."
1549 | ]
1550 | },
1551 | {
1552 | "cell_type": "code",
1553 | "execution_count": 52,
1554 | "metadata": {
1555 | "collapsed": false
1556 | },
1557 | "outputs": [
1558 | {
1559 | "data": {
1560 | "text/plain": [
1561 | "7456"
1562 | ]
1563 | },
1564 | "execution_count": 52,
1565 | "metadata": {},
1566 | "output_type": "execute_result"
1567 | }
1568 | ],
1569 | "source": [
1570 | "# store the vocabulary of X_train\n",
1571 | "X_train_tokens = vect.get_feature_names()\n",
1572 | "len(X_train_tokens)"
1573 | ]
1574 | },
1575 | {
1576 | "cell_type": "code",
1577 | "execution_count": 53,
1578 | "metadata": {
1579 | "collapsed": false,
1580 | "scrolled": true
1581 | },
1582 | "outputs": [
1583 | {
1584 | "name": "stdout",
1585 | "output_type": "stream",
1586 | "text": [
1587 | "[u'00', u'000', u'008704050406', u'0121', u'01223585236', u'01223585334', u'0125698789', u'02', u'0207', u'02072069400', u'02073162414', u'02085076972', u'021', u'03', u'04', u'0430', u'05', u'050703', u'0578', u'06', u'07', u'07008009200', u'07090201529', u'07090298926', u'07123456789', u'07732584351', u'07734396839', u'07742676969', u'0776xxxxxxx', u'07781482378', u'07786200117', u'078', u'07801543489', u'07808', u'07808247860', u'07808726822', u'07815296484', u'07821230901', u'07880867867', u'0789xxxxxxx', u'07946746291', u'0796xxxxxx', u'07973788240', u'07xxxxxxxxx', u'08', u'0800', u'08000407165', u'08000776320', u'08000839402', u'08000930705']\n"
1588 | ]
1589 | }
1590 | ],
1591 | "source": [
1592 | "# examine the first 50 tokens\n",
1593 | "print(X_train_tokens[0:50])"
1594 | ]
1595 | },
1596 | {
1597 | "cell_type": "code",
1598 | "execution_count": 54,
1599 | "metadata": {
1600 | "collapsed": false
1601 | },
1602 | "outputs": [
1603 | {
1604 | "name": "stdout",
1605 | "output_type": "stream",
1606 | "text": [
1607 | "[u'yer', u'yes', u'yest', u'yesterday', u'yet', u'yetunde', u'yijue', u'ym', u'ymca', u'yo', u'yoga', u'yogasana', u'yor', u'yorge', u'you', u'youdoing', u'youi', u'youphone', u'your', u'youre', u'yourjob', u'yours', u'yourself', u'youwanna', u'yowifes', u'yoyyooo', u'yr', u'yrs', u'ything', u'yummmm', u'yummy', u'yun', u'yunny', u'yuo', u'yuou', u'yup', u'zac', u'zaher', u'zealand', u'zebra', u'zed', u'zeros', u'zhong', u'zindgi', u'zoe', u'zoom', u'zouk', u'zyada', u'\\xe8n', u'\\u3028ud']\n"
1608 | ]
1609 | }
1610 | ],
1611 | "source": [
1612 | "# examine the last 50 tokens\n",
1613 | "print(X_train_tokens[-50:])"
1614 | ]
1615 | },
1616 | {
1617 | "cell_type": "code",
1618 | "execution_count": 55,
1619 | "metadata": {
1620 | "collapsed": false
1621 | },
1622 | "outputs": [
1623 | {
1624 | "data": {
1625 | "text/plain": [
1626 | "array([[ 0., 0., 0., ..., 1., 1., 1.],\n",
1627 | " [ 5., 23., 2., ..., 0., 0., 0.]])"
1628 | ]
1629 | },
1630 | "execution_count": 55,
1631 | "metadata": {},
1632 | "output_type": "execute_result"
1633 | }
1634 | ],
1635 | "source": [
1636 | "# Naive Bayes counts the number of times each token appears in each class\n",
1637 | "nb.feature_count_"
1638 | ]
1639 | },
1640 | {
1641 | "cell_type": "code",
1642 | "execution_count": 56,
1643 | "metadata": {
1644 | "collapsed": false
1645 | },
1646 | "outputs": [
1647 | {
1648 | "data": {
1649 | "text/plain": [
1650 | "(2L, 7456L)"
1651 | ]
1652 | },
1653 | "execution_count": 56,
1654 | "metadata": {},
1655 | "output_type": "execute_result"
1656 | }
1657 | ],
1658 | "source": [
1659 | "# rows represent classes, columns represent tokens\n",
1660 | "nb.feature_count_.shape"
1661 | ]
1662 | },
1663 | {
1664 | "cell_type": "code",
1665 | "execution_count": 57,
1666 | "metadata": {
1667 | "collapsed": false
1668 | },
1669 | "outputs": [
1670 | {
1671 | "data": {
1672 | "text/plain": [
1673 | "array([ 0., 0., 0., ..., 1., 1., 1.])"
1674 | ]
1675 | },
1676 | "execution_count": 57,
1677 | "metadata": {},
1678 | "output_type": "execute_result"
1679 | }
1680 | ],
1681 | "source": [
1682 | "# number of times each token appears across all HAM messages\n",
1683 | "ham_token_count = nb.feature_count_[0, :]\n",
1684 | "ham_token_count"
1685 | ]
1686 | },
1687 | {
1688 | "cell_type": "code",
1689 | "execution_count": 58,
1690 | "metadata": {
1691 | "collapsed": false
1692 | },
1693 | "outputs": [
1694 | {
1695 | "data": {
1696 | "text/plain": [
1697 | "array([ 5., 23., 2., ..., 0., 0., 0.])"
1698 | ]
1699 | },
1700 | "execution_count": 58,
1701 | "metadata": {},
1702 | "output_type": "execute_result"
1703 | }
1704 | ],
1705 | "source": [
1706 | "# number of times each token appears across all SPAM messages\n",
1707 | "spam_token_count = nb.feature_count_[1, :]\n",
1708 | "spam_token_count"
1709 | ]
1710 | },
1711 | {
1712 | "cell_type": "code",
1713 | "execution_count": 59,
1714 | "metadata": {
1715 | "collapsed": false
1716 | },
1717 | "outputs": [
1718 | {
1719 | "data": {
1720 | "text/html": [
1721 | "\n",
1722 | "
\n",
1723 | " \n",
1724 | " \n",
1725 | " | \n",
1726 | " ham | \n",
1727 | " spam | \n",
1728 | "
\n",
1729 | " \n",
1730 | " | token | \n",
1731 | " | \n",
1732 | " | \n",
1733 | "
\n",
1734 | " \n",
1735 | " \n",
1736 | " \n",
1737 | " | 00 | \n",
1738 | " 0 | \n",
1739 | " 5 | \n",
1740 | "
\n",
1741 | " \n",
1742 | " | 000 | \n",
1743 | " 0 | \n",
1744 | " 23 | \n",
1745 | "
\n",
1746 | " \n",
1747 | " | 008704050406 | \n",
1748 | " 0 | \n",
1749 | " 2 | \n",
1750 | "
\n",
1751 | " \n",
1752 | " | 0121 | \n",
1753 | " 0 | \n",
1754 | " 1 | \n",
1755 | "
\n",
1756 | " \n",
1757 | " | 01223585236 | \n",
1758 | " 0 | \n",
1759 | " 1 | \n",
1760 | "
\n",
1761 | " \n",
1762 | "
\n",
1763 | "
"
1764 | ],
1765 | "text/plain": [
1766 | " ham spam\n",
1767 | "token \n",
1768 | "00 0 5\n",
1769 | "000 0 23\n",
1770 | "008704050406 0 2\n",
1771 | "0121 0 1\n",
1772 | "01223585236 0 1"
1773 | ]
1774 | },
1775 | "execution_count": 59,
1776 | "metadata": {},
1777 | "output_type": "execute_result"
1778 | }
1779 | ],
1780 | "source": [
1781 | "# create a DataFrame of tokens with their separate ham and spam counts\n",
1782 | "tokens = pd.DataFrame({'token':X_train_tokens, 'ham':ham_token_count, 'spam':spam_token_count}).set_index('token')\n",
1783 | "tokens.head()"
1784 | ]
1785 | },
1786 | {
1787 | "cell_type": "code",
1788 | "execution_count": 60,
1789 | "metadata": {
1790 | "collapsed": false
1791 | },
1792 | "outputs": [
1793 | {
1794 | "data": {
1795 | "text/html": [
1796 | "\n",
1797 | "
\n",
1798 | " \n",
1799 | " \n",
1800 | " | \n",
1801 | " ham | \n",
1802 | " spam | \n",
1803 | "
\n",
1804 | " \n",
1805 | " | token | \n",
1806 | " | \n",
1807 | " | \n",
1808 | "
\n",
1809 | " \n",
1810 | " \n",
1811 | " \n",
1812 | " | very | \n",
1813 | " 64 | \n",
1814 | " 2 | \n",
1815 | "
\n",
1816 | " \n",
1817 | " | nasty | \n",
1818 | " 1 | \n",
1819 | " 1 | \n",
1820 | "
\n",
1821 | " \n",
1822 | " | villa | \n",
1823 | " 0 | \n",
1824 | " 1 | \n",
1825 | "
\n",
1826 | " \n",
1827 | " | beloved | \n",
1828 | " 1 | \n",
1829 | " 0 | \n",
1830 | "
\n",
1831 | " \n",
1832 | " | textoperator | \n",
1833 | " 0 | \n",
1834 | " 2 | \n",
1835 | "
\n",
1836 | " \n",
1837 | "
\n",
1838 | "
"
1839 | ],
1840 | "text/plain": [
1841 | " ham spam\n",
1842 | "token \n",
1843 | "very 64 2\n",
1844 | "nasty 1 1\n",
1845 | "villa 0 1\n",
1846 | "beloved 1 0\n",
1847 | "textoperator 0 2"
1848 | ]
1849 | },
1850 | "execution_count": 60,
1851 | "metadata": {},
1852 | "output_type": "execute_result"
1853 | }
1854 | ],
1855 | "source": [
1856 | "# examine 5 random DataFrame rows\n",
1857 | "tokens.sample(5, random_state=6)"
1858 | ]
1859 | },
1860 | {
1861 | "cell_type": "code",
1862 | "execution_count": 61,
1863 | "metadata": {
1864 | "collapsed": false
1865 | },
1866 | "outputs": [
1867 | {
1868 | "data": {
1869 | "text/plain": [
1870 | "array([ 3617., 562.])"
1871 | ]
1872 | },
1873 | "execution_count": 61,
1874 | "metadata": {},
1875 | "output_type": "execute_result"
1876 | }
1877 | ],
1878 | "source": [
1879 | "# Naive Bayes counts the number of observations in each class\n",
1880 | "nb.class_count_"
1881 | ]
1882 | },
1883 | {
1884 | "cell_type": "markdown",
1885 | "metadata": {},
1886 | "source": [
1887 | "Before we can calculate the \"spamminess\" of each token, we need to avoid **dividing by zero** and account for the **class imbalance**."
1888 | ]
1889 | },
1890 | {
1891 | "cell_type": "code",
1892 | "execution_count": 62,
1893 | "metadata": {
1894 | "collapsed": false
1895 | },
1896 | "outputs": [
1897 | {
1898 | "data": {
1899 | "text/html": [
1900 | "\n",
1901 | "
\n",
1902 | " \n",
1903 | " \n",
1904 | " | \n",
1905 | " ham | \n",
1906 | " spam | \n",
1907 | "
\n",
1908 | " \n",
1909 | " | token | \n",
1910 | " | \n",
1911 | " | \n",
1912 | "
\n",
1913 | " \n",
1914 | " \n",
1915 | " \n",
1916 | " | very | \n",
1917 | " 65 | \n",
1918 | " 3 | \n",
1919 | "
\n",
1920 | " \n",
1921 | " | nasty | \n",
1922 | " 2 | \n",
1923 | " 2 | \n",
1924 | "
\n",
1925 | " \n",
1926 | " | villa | \n",
1927 | " 1 | \n",
1928 | " 2 | \n",
1929 | "
\n",
1930 | " \n",
1931 | " | beloved | \n",
1932 | " 2 | \n",
1933 | " 1 | \n",
1934 | "
\n",
1935 | " \n",
1936 | " | textoperator | \n",
1937 | " 1 | \n",
1938 | " 3 | \n",
1939 | "
\n",
1940 | " \n",
1941 | "
\n",
1942 | "
"
1943 | ],
1944 | "text/plain": [
1945 | " ham spam\n",
1946 | "token \n",
1947 | "very 65 3\n",
1948 | "nasty 2 2\n",
1949 | "villa 1 2\n",
1950 | "beloved 2 1\n",
1951 | "textoperator 1 3"
1952 | ]
1953 | },
1954 | "execution_count": 62,
1955 | "metadata": {},
1956 | "output_type": "execute_result"
1957 | }
1958 | ],
1959 | "source": [
1960 | "# add 1 to ham and spam counts to avoid dividing by 0\n",
1961 | "tokens['ham'] = tokens.ham + 1\n",
1962 | "tokens['spam'] = tokens.spam + 1\n",
1963 | "tokens.sample(5, random_state=6)"
1964 | ]
1965 | },
1966 | {
1967 | "cell_type": "code",
1968 | "execution_count": 63,
1969 | "metadata": {
1970 | "collapsed": false
1971 | },
1972 | "outputs": [
1973 | {
1974 | "data": {
1975 | "text/html": [
1976 | "\n",
1977 | "
\n",
1978 | " \n",
1979 | " \n",
1980 | " | \n",
1981 | " ham | \n",
1982 | " spam | \n",
1983 | "
\n",
1984 | " \n",
1985 | " | token | \n",
1986 | " | \n",
1987 | " | \n",
1988 | "
\n",
1989 | " \n",
1990 | " \n",
1991 | " \n",
1992 | " | very | \n",
1993 | " 0.017971 | \n",
1994 | " 0.005338 | \n",
1995 | "
\n",
1996 | " \n",
1997 | " | nasty | \n",
1998 | " 0.000553 | \n",
1999 | " 0.003559 | \n",
2000 | "
\n",
2001 | " \n",
2002 | " | villa | \n",
2003 | " 0.000276 | \n",
2004 | " 0.003559 | \n",
2005 | "
\n",
2006 | " \n",
2007 | " | beloved | \n",
2008 | " 0.000553 | \n",
2009 | " 0.001779 | \n",
2010 | "
\n",
2011 | " \n",
2012 | " | textoperator | \n",
2013 | " 0.000276 | \n",
2014 | " 0.005338 | \n",
2015 | "
\n",
2016 | " \n",
2017 | "
\n",
2018 | "
"
2019 | ],
2020 | "text/plain": [
2021 | " ham spam\n",
2022 | "token \n",
2023 | "very 0.017971 0.005338\n",
2024 | "nasty 0.000553 0.003559\n",
2025 | "villa 0.000276 0.003559\n",
2026 | "beloved 0.000553 0.001779\n",
2027 | "textoperator 0.000276 0.005338"
2028 | ]
2029 | },
2030 | "execution_count": 63,
2031 | "metadata": {},
2032 | "output_type": "execute_result"
2033 | }
2034 | ],
2035 | "source": [
2036 | "# convert the ham and spam counts into frequencies\n",
2037 | "tokens['ham'] = tokens.ham / nb.class_count_[0]\n",
2038 | "tokens['spam'] = tokens.spam / nb.class_count_[1]\n",
2039 | "tokens.sample(5, random_state=6)"
2040 | ]
2041 | },
2042 | {
2043 | "cell_type": "code",
2044 | "execution_count": 64,
2045 | "metadata": {
2046 | "collapsed": false
2047 | },
2048 | "outputs": [
2049 | {
2050 | "data": {
2051 | "text/html": [
2052 | "\n",
2053 | "
\n",
2054 | " \n",
2055 | " \n",
2056 | " | \n",
2057 | " ham | \n",
2058 | " spam | \n",
2059 | " spam_ratio | \n",
2060 | "
\n",
2061 | " \n",
2062 | " | token | \n",
2063 | " | \n",
2064 | " | \n",
2065 | " | \n",
2066 | "
\n",
2067 | " \n",
2068 | " \n",
2069 | " \n",
2070 | " | very | \n",
2071 | " 0.017971 | \n",
2072 | " 0.005338 | \n",
2073 | " 0.297044 | \n",
2074 | "
\n",
2075 | " \n",
2076 | " | nasty | \n",
2077 | " 0.000553 | \n",
2078 | " 0.003559 | \n",
2079 | " 6.435943 | \n",
2080 | "
\n",
2081 | " \n",
2082 | " | villa | \n",
2083 | " 0.000276 | \n",
2084 | " 0.003559 | \n",
2085 | " 12.871886 | \n",
2086 | "
\n",
2087 | " \n",
2088 | " | beloved | \n",
2089 | " 0.000553 | \n",
2090 | " 0.001779 | \n",
2091 | " 3.217972 | \n",
2092 | "
\n",
2093 | " \n",
2094 | " | textoperator | \n",
2095 | " 0.000276 | \n",
2096 | " 0.005338 | \n",
2097 | " 19.307829 | \n",
2098 | "
\n",
2099 | " \n",
2100 | "
\n",
2101 | "
"
2102 | ],
2103 | "text/plain": [
2104 | " ham spam spam_ratio\n",
2105 | "token \n",
2106 | "very 0.017971 0.005338 0.297044\n",
2107 | "nasty 0.000553 0.003559 6.435943\n",
2108 | "villa 0.000276 0.003559 12.871886\n",
2109 | "beloved 0.000553 0.001779 3.217972\n",
2110 | "textoperator 0.000276 0.005338 19.307829"
2111 | ]
2112 | },
2113 | "execution_count": 64,
2114 | "metadata": {},
2115 | "output_type": "execute_result"
2116 | }
2117 | ],
2118 | "source": [
2119 | "# calculate the ratio of spam-to-ham for each token\n",
2120 | "tokens['spam_ratio'] = tokens.spam / tokens.ham\n",
2121 | "tokens.sample(5, random_state=6)"
2122 | ]
2123 | },
2124 | {
2125 | "cell_type": "code",
2126 | "execution_count": 65,
2127 | "metadata": {
2128 | "collapsed": false
2129 | },
2130 | "outputs": [
2131 | {
2132 | "data": {
2133 | "text/html": [
2134 | "\n",
2135 | "
\n",
2136 | " \n",
2137 | " \n",
2138 | " | \n",
2139 | " ham | \n",
2140 | " spam | \n",
2141 | " spam_ratio | \n",
2142 | "
\n",
2143 | " \n",
2144 | " | token | \n",
2145 | " | \n",
2146 | " | \n",
2147 | " | \n",
2148 | "
\n",
2149 | " \n",
2150 | " \n",
2151 | " \n",
2152 | " | claim | \n",
2153 | " 0.000276 | \n",
2154 | " 0.158363 | \n",
2155 | " 572.798932 | \n",
2156 | "
\n",
2157 | " \n",
2158 | " | prize | \n",
2159 | " 0.000276 | \n",
2160 | " 0.135231 | \n",
2161 | " 489.131673 | \n",
2162 | "
\n",
2163 | " \n",
2164 | " | 150p | \n",
2165 | " 0.000276 | \n",
2166 | " 0.087189 | \n",
2167 | " 315.361210 | \n",
2168 | "
\n",
2169 | " \n",
2170 | " | tone | \n",
2171 | " 0.000276 | \n",
2172 | " 0.085409 | \n",
2173 | " 308.925267 | \n",
2174 | "
\n",
2175 | " \n",
2176 | " | guaranteed | \n",
2177 | " 0.000276 | \n",
2178 | " 0.076512 | \n",
2179 | " 276.745552 | \n",
2180 | "
\n",
2181 | " \n",
2182 | " | 18 | \n",
2183 | " 0.000276 | \n",
2184 | " 0.069395 | \n",
2185 | " 251.001779 | \n",
2186 | "
\n",
2187 | " \n",
2188 | " | cs | \n",
2189 | " 0.000276 | \n",
2190 | " 0.065836 | \n",
2191 | " 238.129893 | \n",
2192 | "
\n",
2193 | " \n",
2194 | " | www | \n",
2195 | " 0.000553 | \n",
2196 | " 0.129893 | \n",
2197 | " 234.911922 | \n",
2198 | "
\n",
2199 | " \n",
2200 | " | 1000 | \n",
2201 | " 0.000276 | \n",
2202 | " 0.056940 | \n",
2203 | " 205.950178 | \n",
2204 | "
\n",
2205 | " \n",
2206 | " | awarded | \n",
2207 | " 0.000276 | \n",
2208 | " 0.053381 | \n",
2209 | " 193.078292 | \n",
2210 | "
\n",
2211 | " \n",
2212 | " | 150ppm | \n",
2213 | " 0.000276 | \n",
2214 | " 0.051601 | \n",
2215 | " 186.642349 | \n",
2216 | "
\n",
2217 | " \n",
2218 | " | uk | \n",
2219 | " 0.000553 | \n",
2220 | " 0.099644 | \n",
2221 | " 180.206406 | \n",
2222 | "
\n",
2223 | " \n",
2224 | " | 500 | \n",
2225 | " 0.000276 | \n",
2226 | " 0.048043 | \n",
2227 | " 173.770463 | \n",
2228 | "
\n",
2229 | " \n",
2230 | " | ringtone | \n",
2231 | " 0.000276 | \n",
2232 | " 0.044484 | \n",
2233 | " 160.898577 | \n",
2234 | "
\n",
2235 | " \n",
2236 | " | 000 | \n",
2237 | " 0.000276 | \n",
2238 | " 0.042705 | \n",
2239 | " 154.462633 | \n",
2240 | "
\n",
2241 | " \n",
2242 | " | mob | \n",
2243 | " 0.000276 | \n",
2244 | " 0.042705 | \n",
2245 | " 154.462633 | \n",
2246 | "
\n",
2247 | " \n",
2248 | " | co | \n",
2249 | " 0.000553 | \n",
2250 | " 0.078292 | \n",
2251 | " 141.590747 | \n",
2252 | "
\n",
2253 | " \n",
2254 | " | collection | \n",
2255 | " 0.000276 | \n",
2256 | " 0.039146 | \n",
2257 | " 141.590747 | \n",
2258 | "
\n",
2259 | " \n",
2260 | " | valid | \n",
2261 | " 0.000276 | \n",
2262 | " 0.037367 | \n",
2263 | " 135.154804 | \n",
2264 | "
\n",
2265 | " \n",
2266 | " | 2000 | \n",
2267 | " 0.000276 | \n",
2268 | " 0.037367 | \n",
2269 | " 135.154804 | \n",
2270 | "
\n",
2271 | " \n",
2272 | " | 800 | \n",
2273 | " 0.000276 | \n",
2274 | " 0.037367 | \n",
2275 | " 135.154804 | \n",
2276 | "
\n",
2277 | " \n",
2278 | " | 10p | \n",
2279 | " 0.000276 | \n",
2280 | " 0.037367 | \n",
2281 | " 135.154804 | \n",
2282 | "
\n",
2283 | " \n",
2284 | " | 8007 | \n",
2285 | " 0.000276 | \n",
2286 | " 0.035587 | \n",
2287 | " 128.718861 | \n",
2288 | "
\n",
2289 | " \n",
2290 | " | 16 | \n",
2291 | " 0.000553 | \n",
2292 | " 0.067616 | \n",
2293 | " 122.282918 | \n",
2294 | "
\n",
2295 | " \n",
2296 | " | weekly | \n",
2297 | " 0.000276 | \n",
2298 | " 0.033808 | \n",
2299 | " 122.282918 | \n",
2300 | "
\n",
2301 | " \n",
2302 | " | tones | \n",
2303 | " 0.000276 | \n",
2304 | " 0.032028 | \n",
2305 | " 115.846975 | \n",
2306 | "
\n",
2307 | " \n",
2308 | " | land | \n",
2309 | " 0.000276 | \n",
2310 | " 0.032028 | \n",
2311 | " 115.846975 | \n",
2312 | "
\n",
2313 | " \n",
2314 | " | http | \n",
2315 | " 0.000276 | \n",
2316 | " 0.032028 | \n",
2317 | " 115.846975 | \n",
2318 | "
\n",
2319 | " \n",
2320 | " | national | \n",
2321 | " 0.000276 | \n",
2322 | " 0.030249 | \n",
2323 | " 109.411032 | \n",
2324 | "
\n",
2325 | " \n",
2326 | " | 5000 | \n",
2327 | " 0.000276 | \n",
2328 | " 0.030249 | \n",
2329 | " 109.411032 | \n",
2330 | "
\n",
2331 | " \n",
2332 | " | ... | \n",
2333 | " ... | \n",
2334 | " ... | \n",
2335 | " ... | \n",
2336 | "
\n",
2337 | " \n",
2338 | " | went | \n",
2339 | " 0.012718 | \n",
2340 | " 0.001779 | \n",
2341 | " 0.139912 | \n",
2342 | "
\n",
2343 | " \n",
2344 | " | ll | \n",
2345 | " 0.052530 | \n",
2346 | " 0.007117 | \n",
2347 | " 0.135494 | \n",
2348 | "
\n",
2349 | " \n",
2350 | " | told | \n",
2351 | " 0.013824 | \n",
2352 | " 0.001779 | \n",
2353 | " 0.128719 | \n",
2354 | "
\n",
2355 | " \n",
2356 | " | feel | \n",
2357 | " 0.013824 | \n",
2358 | " 0.001779 | \n",
2359 | " 0.128719 | \n",
2360 | "
\n",
2361 | " \n",
2362 | " | gud | \n",
2363 | " 0.014100 | \n",
2364 | " 0.001779 | \n",
2365 | " 0.126195 | \n",
2366 | "
\n",
2367 | " \n",
2368 | " | cos | \n",
2369 | " 0.014929 | \n",
2370 | " 0.001779 | \n",
2371 | " 0.119184 | \n",
2372 | "
\n",
2373 | " \n",
2374 | " | but | \n",
2375 | " 0.090683 | \n",
2376 | " 0.010676 | \n",
2377 | " 0.117731 | \n",
2378 | "
\n",
2379 | " \n",
2380 | " | amp | \n",
2381 | " 0.015206 | \n",
2382 | " 0.001779 | \n",
2383 | " 0.117017 | \n",
2384 | "
\n",
2385 | " \n",
2386 | " | something | \n",
2387 | " 0.015206 | \n",
2388 | " 0.001779 | \n",
2389 | " 0.117017 | \n",
2390 | "
\n",
2391 | " \n",
2392 | " | sure | \n",
2393 | " 0.015206 | \n",
2394 | " 0.001779 | \n",
2395 | " 0.117017 | \n",
2396 | "
\n",
2397 | " \n",
2398 | " | ok | \n",
2399 | " 0.061100 | \n",
2400 | " 0.007117 | \n",
2401 | " 0.116488 | \n",
2402 | "
\n",
2403 | " \n",
2404 | " | said | \n",
2405 | " 0.016312 | \n",
2406 | " 0.001779 | \n",
2407 | " 0.109084 | \n",
2408 | "
\n",
2409 | " \n",
2410 | " | morning | \n",
2411 | " 0.016865 | \n",
2412 | " 0.001779 | \n",
2413 | " 0.105507 | \n",
2414 | "
\n",
2415 | " \n",
2416 | " | yeah | \n",
2417 | " 0.017694 | \n",
2418 | " 0.001779 | \n",
2419 | " 0.100562 | \n",
2420 | "
\n",
2421 | " \n",
2422 | " | lol | \n",
2423 | " 0.017694 | \n",
2424 | " 0.001779 | \n",
2425 | " 0.100562 | \n",
2426 | "
\n",
2427 | " \n",
2428 | " | anything | \n",
2429 | " 0.017971 | \n",
2430 | " 0.001779 | \n",
2431 | " 0.099015 | \n",
2432 | "
\n",
2433 | " \n",
2434 | " | my | \n",
2435 | " 0.150401 | \n",
2436 | " 0.014235 | \n",
2437 | " 0.094646 | \n",
2438 | "
\n",
2439 | " \n",
2440 | " | doing | \n",
2441 | " 0.019077 | \n",
2442 | " 0.001779 | \n",
2443 | " 0.093275 | \n",
2444 | "
\n",
2445 | " \n",
2446 | " | way | \n",
2447 | " 0.019630 | \n",
2448 | " 0.001779 | \n",
2449 | " 0.090647 | \n",
2450 | "
\n",
2451 | " \n",
2452 | " | ask | \n",
2453 | " 0.019630 | \n",
2454 | " 0.001779 | \n",
2455 | " 0.090647 | \n",
2456 | "
\n",
2457 | " \n",
2458 | " | already | \n",
2459 | " 0.019630 | \n",
2460 | " 0.001779 | \n",
2461 | " 0.090647 | \n",
2462 | "
\n",
2463 | " \n",
2464 | " | too | \n",
2465 | " 0.021841 | \n",
2466 | " 0.001779 | \n",
2467 | " 0.081468 | \n",
2468 | "
\n",
2469 | " \n",
2470 | " | come | \n",
2471 | " 0.048936 | \n",
2472 | " 0.003559 | \n",
2473 | " 0.072723 | \n",
2474 | "
\n",
2475 | " \n",
2476 | " | later | \n",
2477 | " 0.030688 | \n",
2478 | " 0.001779 | \n",
2479 | " 0.057981 | \n",
2480 | "
\n",
2481 | " \n",
2482 | " | lor | \n",
2483 | " 0.032900 | \n",
2484 | " 0.001779 | \n",
2485 | " 0.054084 | \n",
2486 | "
\n",
2487 | " \n",
2488 | " | da | \n",
2489 | " 0.032900 | \n",
2490 | " 0.001779 | \n",
2491 | " 0.054084 | \n",
2492 | "
\n",
2493 | " \n",
2494 | " | she | \n",
2495 | " 0.035665 | \n",
2496 | " 0.001779 | \n",
2497 | " 0.049891 | \n",
2498 | "
\n",
2499 | " \n",
2500 | " | he | \n",
2501 | " 0.047000 | \n",
2502 | " 0.001779 | \n",
2503 | " 0.037858 | \n",
2504 | "
\n",
2505 | " \n",
2506 | " | lt | \n",
2507 | " 0.064142 | \n",
2508 | " 0.001779 | \n",
2509 | " 0.027741 | \n",
2510 | "
\n",
2511 | " \n",
2512 | " | gt | \n",
2513 | " 0.064971 | \n",
2514 | " 0.001779 | \n",
2515 | " 0.027387 | \n",
2516 | "
\n",
2517 | " \n",
2518 | "
\n",
2519 | "
7456 rows × 3 columns
\n",
2520 | "
"
2521 | ],
2522 | "text/plain": [
2523 | " ham spam spam_ratio\n",
2524 | "token \n",
2525 | "claim 0.000276 0.158363 572.798932\n",
2526 | "prize 0.000276 0.135231 489.131673\n",
2527 | "150p 0.000276 0.087189 315.361210\n",
2528 | "tone 0.000276 0.085409 308.925267\n",
2529 | "guaranteed 0.000276 0.076512 276.745552\n",
2530 | "18 0.000276 0.069395 251.001779\n",
2531 | "cs 0.000276 0.065836 238.129893\n",
2532 | "www 0.000553 0.129893 234.911922\n",
2533 | "1000 0.000276 0.056940 205.950178\n",
2534 | "awarded 0.000276 0.053381 193.078292\n",
2535 | "150ppm 0.000276 0.051601 186.642349\n",
2536 | "uk 0.000553 0.099644 180.206406\n",
2537 | "500 0.000276 0.048043 173.770463\n",
2538 | "ringtone 0.000276 0.044484 160.898577\n",
2539 | "000 0.000276 0.042705 154.462633\n",
2540 | "mob 0.000276 0.042705 154.462633\n",
2541 | "co 0.000553 0.078292 141.590747\n",
2542 | "collection 0.000276 0.039146 141.590747\n",
2543 | "valid 0.000276 0.037367 135.154804\n",
2544 | "2000 0.000276 0.037367 135.154804\n",
2545 | "800 0.000276 0.037367 135.154804\n",
2546 | "10p 0.000276 0.037367 135.154804\n",
2547 | "8007 0.000276 0.035587 128.718861\n",
2548 | "16 0.000553 0.067616 122.282918\n",
2549 | "weekly 0.000276 0.033808 122.282918\n",
2550 | "tones 0.000276 0.032028 115.846975\n",
2551 | "land 0.000276 0.032028 115.846975\n",
2552 | "http 0.000276 0.032028 115.846975\n",
2553 | "national 0.000276 0.030249 109.411032\n",
2554 | "5000 0.000276 0.030249 109.411032\n",
2555 | "... ... ... ...\n",
2556 | "went 0.012718 0.001779 0.139912\n",
2557 | "ll 0.052530 0.007117 0.135494\n",
2558 | "told 0.013824 0.001779 0.128719\n",
2559 | "feel 0.013824 0.001779 0.128719\n",
2560 | "gud 0.014100 0.001779 0.126195\n",
2561 | "cos 0.014929 0.001779 0.119184\n",
2562 | "but 0.090683 0.010676 0.117731\n",
2563 | "amp 0.015206 0.001779 0.117017\n",
2564 | "something 0.015206 0.001779 0.117017\n",
2565 | "sure 0.015206 0.001779 0.117017\n",
2566 | "ok 0.061100 0.007117 0.116488\n",
2567 | "said 0.016312 0.001779 0.109084\n",
2568 | "morning 0.016865 0.001779 0.105507\n",
2569 | "yeah 0.017694 0.001779 0.100562\n",
2570 | "lol 0.017694 0.001779 0.100562\n",
2571 | "anything 0.017971 0.001779 0.099015\n",
2572 | "my 0.150401 0.014235 0.094646\n",
2573 | "doing 0.019077 0.001779 0.093275\n",
2574 | "way 0.019630 0.001779 0.090647\n",
2575 | "ask 0.019630 0.001779 0.090647\n",
2576 | "already 0.019630 0.001779 0.090647\n",
2577 | "too 0.021841 0.001779 0.081468\n",
2578 | "come 0.048936 0.003559 0.072723\n",
2579 | "later 0.030688 0.001779 0.057981\n",
2580 | "lor 0.032900 0.001779 0.054084\n",
2581 | "da 0.032900 0.001779 0.054084\n",
2582 | "she 0.035665 0.001779 0.049891\n",
2583 | "he 0.047000 0.001779 0.037858\n",
2584 | "lt 0.064142 0.001779 0.027741\n",
2585 | "gt 0.064971 0.001779 0.027387\n",
2586 | "\n",
2587 | "[7456 rows x 3 columns]"
2588 | ]
2589 | },
2590 | "execution_count": 65,
2591 | "metadata": {},
2592 | "output_type": "execute_result"
2593 | }
2594 | ],
2595 | "source": [
2596 | "# examine the DataFrame sorted by spam_ratio\n",
2597 | "# note: use sort() instead of sort_values() for pandas 0.16.2 and earlier\n",
2598 | "tokens.sort_values('spam_ratio', ascending=False)"
2599 | ]
2600 | },
2601 | {
2602 | "cell_type": "code",
2603 | "execution_count": 66,
2604 | "metadata": {
2605 | "collapsed": false
2606 | },
2607 | "outputs": [
2608 | {
2609 | "data": {
2610 | "text/plain": [
2611 | "83.667259786476862"
2612 | ]
2613 | },
2614 | "execution_count": 66,
2615 | "metadata": {},
2616 | "output_type": "execute_result"
2617 | }
2618 | ],
2619 | "source": [
2620 | "# look up the spam_ratio for a given token\n",
2621 | "tokens.loc['dating', 'spam_ratio']"
2622 | ]
2623 | },
2624 | {
2625 | "cell_type": "markdown",
2626 | "metadata": {},
2627 | "source": [
2628 | "## Part 8: Practicing this workflow on another dataset\n",
2629 | "\n",
2630 | "Please open the **`exercise.ipynb`** notebook (or the **`exercise.py`** script)."
2631 | ]
2632 | },
2633 | {
2634 | "cell_type": "markdown",
2635 | "metadata": {},
2636 | "source": [
2637 | "## Part 9: Tuning the vectorizer (discussion)\n",
2638 | "\n",
2639 | "Thus far, we have been using the default parameters of [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html):"
2640 | ]
2641 | },
2642 | {
2643 | "cell_type": "code",
2644 | "execution_count": 67,
2645 | "metadata": {
2646 | "collapsed": false
2647 | },
2648 | "outputs": [
2649 | {
2650 | "data": {
2651 | "text/plain": [
2652 | "CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',\n",
2653 | " dtype=, encoding=u'utf-8', input=u'content',\n",
2654 | " lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
2655 | " ngram_range=(1, 1), preprocessor=None, stop_words=None,\n",
2656 | " strip_accents=None, token_pattern=u'(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
2657 | " tokenizer=None, vocabulary=None)"
2658 | ]
2659 | },
2660 | "execution_count": 67,
2661 | "metadata": {},
2662 | "output_type": "execute_result"
2663 | }
2664 | ],
2665 | "source": [
2666 | "# show default parameters for CountVectorizer\n",
2667 | "vect"
2668 | ]
2669 | },
2670 | {
2671 | "cell_type": "markdown",
2672 | "metadata": {},
2673 | "source": [
2674 | "However, the vectorizer is worth tuning, just like a model is worth tuning! Here are a few parameters that you might want to tune:\n",
2675 | "\n",
2676 | "- **stop_words:** string {'english'}, list, or None (default)\n",
2677 | " - If 'english', a built-in stop word list for English is used.\n",
2678 | " - If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.\n",
2679 | " - If None, no stop words will be used."
2680 | ]
2681 | },
2682 | {
2683 | "cell_type": "code",
2684 | "execution_count": 68,
2685 | "metadata": {
2686 | "collapsed": true
2687 | },
2688 | "outputs": [],
2689 | "source": [
2690 | "# remove English stop words\n",
2691 | "vect = CountVectorizer(stop_words='english')"
2692 | ]
2693 | },
2694 | {
2695 | "cell_type": "markdown",
2696 | "metadata": {},
2697 | "source": [
2698 | "- **ngram_range:** tuple (min_n, max_n), default=(1, 1)\n",
2699 | " - The lower and upper boundary of the range of n-values for different n-grams to be extracted.\n",
2700 | " - All values of n such that min_n <= n <= max_n will be used."
2701 | ]
2702 | },
2703 | {
2704 | "cell_type": "code",
2705 | "execution_count": 69,
2706 | "metadata": {
2707 | "collapsed": true
2708 | },
2709 | "outputs": [],
2710 | "source": [
2711 | "# include 1-grams and 2-grams\n",
2712 | "vect = CountVectorizer(ngram_range=(1, 2))"
2713 | ]
2714 | },
2715 | {
2716 | "cell_type": "markdown",
2717 | "metadata": {},
2718 | "source": [
2719 | "- **max_df:** float in range [0.0, 1.0] or int, default=1.0\n",
2720 | " - When building the vocabulary, ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words).\n",
2721 | " - If float, the parameter represents a proportion of documents.\n",
2722 | " - If integer, the parameter represents an absolute count."
2723 | ]
2724 | },
2725 | {
2726 | "cell_type": "code",
2727 | "execution_count": 70,
2728 | "metadata": {
2729 | "collapsed": true
2730 | },
2731 | "outputs": [],
2732 | "source": [
2733 | "# ignore terms that appear in more than 50% of the documents\n",
2734 | "vect = CountVectorizer(max_df=0.5)"
2735 | ]
2736 | },
2737 | {
2738 | "cell_type": "markdown",
2739 | "metadata": {},
2740 | "source": [
2741 | "- **min_df:** float in range [0.0, 1.0] or int, default=1\n",
2742 | " - When building the vocabulary, ignore terms that have a document frequency strictly lower than the given threshold. (This value is also called \"cut-off\" in the literature.)\n",
2743 | " - If float, the parameter represents a proportion of documents.\n",
2744 | " - If integer, the parameter represents an absolute count."
2745 | ]
2746 | },
2747 | {
2748 | "cell_type": "code",
2749 | "execution_count": 71,
2750 | "metadata": {
2751 | "collapsed": true
2752 | },
2753 | "outputs": [],
2754 | "source": [
2755 | "# only keep terms that appear in at least 2 documents\n",
2756 | "vect = CountVectorizer(min_df=2)"
2757 | ]
2758 | },
2759 | {
2760 | "cell_type": "markdown",
2761 | "metadata": {},
2762 | "source": [
2763 | "**Guidelines for tuning CountVectorizer:**\n",
2764 | "\n",
2765 | "- Use your knowledge of the **problem** and the **text**, and your understanding of the **tuning parameters**, to help you decide what parameters to tune and how to tune them.\n",
2766 | "- **Experiment**, and let the data tell you the best approach!"
2767 | ]
2768 | }
2769 | ],
2770 | "metadata": {
2771 | "kernelspec": {
2772 | "display_name": "Python 2",
2773 | "language": "python",
2774 | "name": "python2"
2775 | },
2776 | "language_info": {
2777 | "codemirror_mode": {
2778 | "name": "ipython",
2779 | "version": 2
2780 | },
2781 | "file_extension": ".py",
2782 | "mimetype": "text/x-python",
2783 | "name": "python",
2784 | "nbconvert_exporter": "python",
2785 | "pygments_lexer": "ipython2",
2786 | "version": "2.7.11"
2787 | }
2788 | },
2789 | "nbformat": 4,
2790 | "nbformat_minor": 0
2791 | }
2792 |
--------------------------------------------------------------------------------