├── .gitignore ├── exercise.py ├── README.md ├── exercise.ipynb ├── exercise_solution.py ├── tutorial.py ├── tutorial.ipynb ├── exercise_solution.ipynb └── tutorial_with_output.ipynb /.gitignore: -------------------------------------------------------------------------------- 1 | .ipynb_checkpoints/ 2 | .DS_Store 3 | *.pyc 4 | extras/ 5 | *.tpl 6 | -------------------------------------------------------------------------------- /exercise.py: -------------------------------------------------------------------------------- 1 | # # Tutorial Exercise: Yelp reviews 2 | 3 | # ## Introduction 4 | # 5 | # This exercise uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition. 6 | # 7 | # **Description of the data:** 8 | # 9 | # - **`yelp.csv`** contains the dataset. It is stored in the repository (in the **`data`** directory), so there is no need to download anything from the Kaggle website. 10 | # - Each observation (row) in this dataset is a review of a particular business by a particular user. 11 | # - The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review. 12 | # - The **text** column is the text of the review. 13 | # 14 | # **Goal:** Predict the star rating of a review using **only** the review text. 15 | # 16 | # **Tip:** After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations. 17 | 18 | # ## Task 1 19 | # 20 | # Read **`yelp.csv`** into a pandas DataFrame and examine it. 21 | 22 | # ## Task 2 23 | # 24 | # Create a new DataFrame that only contains the **5-star** and **1-star** reviews. 25 | # 26 | # - **Hint:** [How do I apply multiple filter criteria to a pandas DataFrame?](http://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb#9.-How-do-I-apply-multiple-filter-criteria-to-a-pandas-DataFrame%3F-%28video%29) explains how to do this. 27 | 28 | # ## Task 3 29 | # 30 | # Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response. 31 | # 32 | # - **Hint:** Keep in mind that X should be a pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows. 33 | 34 | # ## Task 4 35 | # 36 | # Use CountVectorizer to create **document-term matrices** from X_train and X_test. 37 | 38 | # ## Task 5 39 | # 40 | # Use multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**. 41 | # 42 | # - **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to interpret both classification accuracy and the confusion matrix. 43 | 44 | # ## Task 6 (Challenge) 45 | # 46 | # Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class. 47 | # 48 | # - **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy! 49 | 50 | # ## Task 7 (Challenge) 51 | # 52 | # Browse through the review text of some of the **false positives** and **false negatives**. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews? 53 | # 54 | # - **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains the definitions of "false positives" and "false negatives". 55 | # - **Hint:** Think about what a false positive means in this context, and what a false negative means in this context. What has scikit-learn defined as the "positive class"? 56 | 57 | # ## Task 8 (Challenge) 58 | # 59 | # Calculate which 10 tokens are the most predictive of **5-star reviews**, and which 10 tokens are the most predictive of **1-star reviews**. 60 | # 61 | # - **Hint:** Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the `feature_count_` and `class_count_` attributes of the Naive Bayes model object. 62 | 63 | # ## Task 9 (Challenge) 64 | # 65 | # Up to this point, we have framed this as a **binary classification problem** by only considering the 5-star and 1-star reviews. Now, let's repeat the model building process using all reviews, which makes this a **5-class classification problem**. 66 | # 67 | # Here are the steps: 68 | # 69 | # - Define X and y using the original DataFrame. (y should contain 5 different classes.) 70 | # - Split X and y into training and testing sets. 71 | # - Create document-term matrices using CountVectorizer. 72 | # - Calculate the testing accuracy of a Multinomial Naive Bayes model. 73 | # - Compare the testing accuracy with the null accuracy, and comment on the results. 74 | # - Print the confusion matrix, and comment on the results. (This [Stack Overflow answer](http://stackoverflow.com/a/30748053/1636598) explains how to read a multi-class confusion matrix.) 75 | # - Print the [classification report](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report), and comment on the results. If you are unfamiliar with the terminology it uses, research the terms, and then try to figure out how to calculate these metrics manually from the confusion matrix! 76 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Tutorial: Machine Learning with Text using Python 2 | 3 | Round of applause to [Kevin Markham](http://www.dataschool.io/) and his video tutorials! 4 | 5 | Instructor: [Vaibhav Srivastav](https://www.linkedin.com/in/vaibhavs10/) 6 | 7 | ### Description 8 | 9 | Although numeric data is easy to work with in Python, most knowledge created by humans is actually raw, unstructured text. By learning how to transform text into data that is usable by machine learning models, you drastically increase the amount of data that your models can learn from. In this tutorial, we'll build and evaluate predictive models from real-world text using scikit-learn. 10 | 11 | ### Objectives 12 | 13 | By the end of this tutorial, attendees will be able to confidently build a predictive model from their own text-based data, including feature extraction, model building and model evaluation. 14 | 15 | ### Required Software 16 | 17 | Attendees will need to bring a laptop with [scikit-learn](http://scikit-learn.org/stable/install.html) and [pandas](http://pandas.pydata.org/pandas-docs/stable/install.html) (and their dependencies) already installed. Installing the [Anaconda distribution of Python](https://www.continuum.io/downloads) is an easy way to accomplish this. Both Python 2 and 3 are welcome. 18 | 19 | I will be leading the tutorial using the IPython/Jupyter notebook, and have added a pre-written notebook to this repository. I have also created a Python script that is identical to the notebook, which you can use in the Python environment of your choice. 20 | 21 | ### Tutorial Files 22 | 23 | * IPython/Jupyter notebooks: [tutorial.ipynb](tutorial.ipynb), [tutorial_with_output.ipynb](tutorial_with_output.ipynb), [exercise.ipynb](exercise.ipynb), [exercise_solution.ipynb](exercise_solution.ipynb) 24 | * Python scripts: [tutorial.py](tutorial.py), [exercise.py](exercise.py), [exercise_solution.py](exercise_solution.py) 25 | * Datasets: [data/sms.tsv](data/sms.tsv), [data/yelp.csv](data/yelp.csv) 26 | 27 | ### Prerequisite Knowledge 28 | 29 | Attendees to this tutorial should be comfortable working in Python, should understand the basic principles of machine learning, and should have at least basic experience with both pandas and scikit-learn. However, no knowledge of advanced mathematics is required. 30 | 31 | ### Abstract 32 | 33 | It can be difficult to figure out how to work with text in scikit-learn, even if you're already comfortable with the scikit-learn API. Many questions immediately come up: Which vectorizer should I use, and why? What's the difference between a "fit" and a "transform"? What's a document-term matrix, and why is it so sparse? Is it okay for my training data to have more features than observations? What's the appropriate machine learning model to use? And so on... 34 | 35 | In this tutorial, we'll answer all of those questions, and more! We'll start by walking through the vectorization process in order to understand the input and output formats. Then we'll read a simple dataset into pandas, and immediately apply what we've learned about vectorization. We'll move on to the model building process, including a discussion of which model is most appropriate for the task. We'll evaluate our model a few different ways, and then examine the model for greater insight into how the text is influencing its predictions. Finally, we'll practice this entire workflow on a new dataset, and end with a discussion of which parts of the process are worth tuning for improved performance. 36 | 37 | ### Detailed Outline 38 | 39 | 1. Model building in scikit-learn (refresher) 40 | 2. Representing text as numerical data 41 | 3. Reading a text-based dataset into pandas 42 | 4. Vectorizing our dataset 43 | 5. Building and evaluating a model 44 | 6. Comparing models 45 | 7. Examining a model for further insight 46 | 8. Practicing this workflow on another dataset 47 | 9. Tuning the vectorizer (discussion) 48 | 49 | ### About the Instructor 50 | 51 | Vaibhav Srivastav is a Data Scientist currently working with Deloitte Consulting LLP. He has a demonstrated experience of more than 3 plus years in building large scale Machine Learning and Natural Language Processing solutions for Fortune Technology 10 clients. 52 | 53 | In his free time he teaches Machine Learning/ Data Science to young coders! If Python is what floats your boat then hit him up on any of the channels below: 54 | 55 | * Email: [Vaibhavs10@gmail.com](mailto:vaibhavs10@gmail.com) 56 | * Twitter: [@Vaibhavsriv10](https://twitter.com/vaibhavsriv10) 57 | 58 | ### Recommended Resources 59 | 60 | **Text classification:** 61 | * Read Paul Graham's classic post, [A Plan for Spam](http://www.paulgraham.com/spam.html), for an overview of a basic text classification system using a Bayesian approach. (He also wrote a [follow-up post](http://www.paulgraham.com/better.html) about how he improved his spam filter.) 62 | * Coursera's Natural Language Processing (NLP) course has [video lectures](https://class.coursera.org/nlp/lecture) on text classification, tokenization, Naive Bayes, and many other fundamental NLP topics. (Here are the [slides](http://web.stanford.edu/~jurafsky/NLPCourseraSlides.html) used in all of the videos.) 63 | * [Automatically Categorizing Yelp Businesses](http://engineeringblog.yelp.com/2015/09/automatically-categorizing-yelp-businesses.html) discusses how Yelp uses NLP and scikit-learn to solve the problem of uncategorized businesses. 64 | * [How to Read the Mind of a Supreme Court Justice](http://fivethirtyeight.com/features/how-to-read-the-mind-of-a-supreme-court-justice/) discusses CourtCast, a machine learning model that predicts the outcome of Supreme Court cases using text-based features only. (The CourtCast creator wrote a post explaining [how it works](https://sciencecowboy.wordpress.com/2015/03/05/predicting-the-supreme-court-from-oral-arguments/), and the [Python code](https://github.com/nasrallah/CourtCast) is available on GitHub.) 65 | * [Identifying Humorous Cartoon Captions](http://www.cs.huji.ac.il/~dshahaf/pHumor.pdf) is a readable paper about identifying funny captions submitted to the New Yorker Caption Contest. 66 | * In this [PyData video](https://www.youtube.com/watch?v=y3ZTKFZ-1QQ) (50 minutes), Facebook explains how they use scikit-learn for sentiment classification by training a Naive Bayes model on emoji-labeled data. 67 | 68 | **Naive Bayes and logistic regression:** 69 | * Read this brief Quora post on [airport security](http://www.quora.com/In-laymans-terms-how-does-Naive-Bayes-work/answer/Konstantin-Tt) for an intuitive explanation of how Naive Bayes classification works. 70 | * For a longer introduction to Naive Bayes, read Sebastian Raschka's article on [Naive Bayes and Text Classification](http://sebastianraschka.com/Articles/2014_naive_bayes_1.html). As well, Wikipedia has two excellent articles ([Naive Bayes classifier](http://en.wikipedia.org/wiki/Naive_Bayes_classifier) and [Naive Bayes spam filtering](http://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering)), and Cross Validated has a good [Q&A](http://stats.stackexchange.com/questions/21822/understanding-naive-bayes). 71 | * My [guide to an in-depth understanding of logistic regression](http://www.dataschool.io/guide-to-logistic-regression/) includes a lesson notebook and a curated list of resources for going deeper into this topic. 72 | * [Comparison of Machine Learning Models](https://github.com/justmarkham/DAT8/blob/master/other/model_comparison.md) lists the advantages and disadvantages of Naive Bayes, logistic regression, and other classification and regression models. -------------------------------------------------------------------------------- /exercise.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Tutorial Exercise: Yelp reviews" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Introduction\n", 15 | "\n", 16 | "This exercise uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition.\n", 17 | "\n", 18 | "**Description of the data:**\n", 19 | "\n", 20 | "- **`yelp.csv`** contains the dataset. It is stored in the repository (in the **`data`** directory), so there is no need to download anything from the Kaggle website.\n", 21 | "- Each observation (row) in this dataset is a review of a particular business by a particular user.\n", 22 | "- The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.\n", 23 | "- The **text** column is the text of the review.\n", 24 | "\n", 25 | "**Goal:** Predict the star rating of a review using **only** the review text.\n", 26 | "\n", 27 | "**Tip:** After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations." 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": {}, 33 | "source": [ 34 | "## Task 1\n", 35 | "\n", 36 | "Read **`yelp.csv`** into a pandas DataFrame and examine it." 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "## Task 2\n", 44 | "\n", 45 | "Create a new DataFrame that only contains the **5-star** and **1-star** reviews.\n", 46 | "\n", 47 | "- **Hint:** [How do I apply multiple filter criteria to a pandas DataFrame?](http://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb#9.-How-do-I-apply-multiple-filter-criteria-to-a-pandas-DataFrame%3F-%28video%29) explains how to do this." 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "## Task 3\n", 55 | "\n", 56 | "Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response.\n", 57 | "\n", 58 | "- **Hint:** Keep in mind that X should be a pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows." 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "metadata": {}, 64 | "source": [ 65 | "## Task 4\n", 66 | "\n", 67 | "Use CountVectorizer to create **document-term matrices** from X_train and X_test." 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": {}, 73 | "source": [ 74 | "## Task 5\n", 75 | "\n", 76 | "Use multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**.\n", 77 | "\n", 78 | "- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to interpret both classification accuracy and the confusion matrix." 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": {}, 84 | "source": [ 85 | "## Task 6 (Challenge)\n", 86 | "\n", 87 | "Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class.\n", 88 | "\n", 89 | "- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy!" 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "metadata": {}, 95 | "source": [ 96 | "## Task 7 (Challenge)\n", 97 | "\n", 98 | "Browse through the review text of some of the **false positives** and **false negatives**. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews?\n", 99 | "\n", 100 | "- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains the definitions of \"false positives\" and \"false negatives\".\n", 101 | "- **Hint:** Think about what a false positive means in this context, and what a false negative means in this context. What has scikit-learn defined as the \"positive class\"?" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": {}, 107 | "source": [ 108 | "## Task 8 (Challenge)\n", 109 | "\n", 110 | "Calculate which 10 tokens are the most predictive of **5-star reviews**, and which 10 tokens are the most predictive of **1-star reviews**.\n", 111 | "\n", 112 | "- **Hint:** Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the `feature_count_` and `class_count_` attributes of the Naive Bayes model object." 113 | ] 114 | }, 115 | { 116 | "cell_type": "markdown", 117 | "metadata": {}, 118 | "source": [ 119 | "## Task 9 (Challenge)\n", 120 | "\n", 121 | "Up to this point, we have framed this as a **binary classification problem** by only considering the 5-star and 1-star reviews. Now, let's repeat the model building process using all reviews, which makes this a **5-class classification problem**.\n", 122 | "\n", 123 | "Here are the steps:\n", 124 | "\n", 125 | "- Define X and y using the original DataFrame. (y should contain 5 different classes.)\n", 126 | "- Split X and y into training and testing sets.\n", 127 | "- Create document-term matrices using CountVectorizer.\n", 128 | "- Calculate the testing accuracy of a Multinomial Naive Bayes model.\n", 129 | "- Compare the testing accuracy with the null accuracy, and comment on the results.\n", 130 | "- Print the confusion matrix, and comment on the results. (This [Stack Overflow answer](http://stackoverflow.com/a/30748053/1636598) explains how to read a multi-class confusion matrix.)\n", 131 | "- Print the [classification report](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report), and comment on the results. If you are unfamiliar with the terminology it uses, research the terms, and then try to figure out how to calculate these metrics manually from the confusion matrix!" 132 | ] 133 | } 134 | ], 135 | "metadata": { 136 | "kernelspec": { 137 | "display_name": "Python 2", 138 | "language": "python", 139 | "name": "python2" 140 | }, 141 | "language_info": { 142 | "codemirror_mode": { 143 | "name": "ipython", 144 | "version": 2 145 | }, 146 | "file_extension": ".py", 147 | "mimetype": "text/x-python", 148 | "name": "python", 149 | "nbconvert_exporter": "python", 150 | "pygments_lexer": "ipython2", 151 | "version": "2.7.11" 152 | } 153 | }, 154 | "nbformat": 4, 155 | "nbformat_minor": 0 156 | } 157 | -------------------------------------------------------------------------------- /exercise_solution.py: -------------------------------------------------------------------------------- 1 | # # Tutorial Exercise: Yelp reviews (Solution) 2 | 3 | # ## Introduction 4 | # 5 | # This exercise uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition. 6 | # 7 | # **Description of the data:** 8 | # 9 | # - **`yelp.csv`** contains the dataset. It is stored in the repository (in the **`data`** directory), so there is no need to download anything from the Kaggle website. 10 | # - Each observation (row) in this dataset is a review of a particular business by a particular user. 11 | # - The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review. 12 | # - The **text** column is the text of the review. 13 | # 14 | # **Goal:** Predict the star rating of a review using **only** the review text. 15 | # 16 | # **Tip:** After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations. 17 | 18 | # for Python 2: use print only as a function 19 | from __future__ import print_function 20 | 21 | 22 | # ## Task 1 23 | # 24 | # Read **`yelp.csv`** into a pandas DataFrame and examine it. 25 | 26 | # read yelp.csv using a relative path 27 | import pandas as pd 28 | path = 'data/yelp.csv' 29 | yelp = pd.read_csv(path) 30 | 31 | 32 | # examine the shape 33 | yelp.shape 34 | 35 | 36 | # examine the first row 37 | yelp.head(1) 38 | 39 | 40 | # examine the class distribution 41 | yelp.stars.value_counts().sort_index() 42 | 43 | 44 | # ## Task 2 45 | # 46 | # Create a new DataFrame that only contains the **5-star** and **1-star** reviews. 47 | # 48 | # - **Hint:** [How do I apply multiple filter criteria to a pandas DataFrame?](http://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb#9.-How-do-I-apply-multiple-filter-criteria-to-a-pandas-DataFrame%3F-%28video%29) explains how to do this. 49 | 50 | # filter the DataFrame using an OR condition 51 | yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)] 52 | 53 | # equivalently, use the 'loc' method 54 | yelp_best_worst = yelp.loc[(yelp.stars==5) | (yelp.stars==1), :] 55 | 56 | 57 | # examine the shape 58 | yelp_best_worst.shape 59 | 60 | 61 | # ## Task 3 62 | # 63 | # Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response. 64 | # 65 | # - **Hint:** Keep in mind that X should be a pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows. 66 | 67 | # define X and y 68 | X = yelp_best_worst.text 69 | y = yelp_best_worst.stars 70 | 71 | 72 | # split X and y into training and testing sets 73 | from sklearn.cross_validation import train_test_split 74 | X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1) 75 | 76 | 77 | # examine the object shapes 78 | print(X_train.shape) 79 | print(X_test.shape) 80 | print(y_train.shape) 81 | print(y_test.shape) 82 | 83 | 84 | # ## Task 4 85 | # 86 | # Use CountVectorizer to create **document-term matrices** from X_train and X_test. 87 | 88 | # import and instantiate CountVectorizer 89 | from sklearn.feature_extraction.text import CountVectorizer 90 | vect = CountVectorizer() 91 | 92 | 93 | # fit and transform X_train into X_train_dtm 94 | X_train_dtm = vect.fit_transform(X_train) 95 | X_train_dtm.shape 96 | 97 | 98 | # transform X_test into X_test_dtm 99 | X_test_dtm = vect.transform(X_test) 100 | X_test_dtm.shape 101 | 102 | 103 | # ## Task 5 104 | # 105 | # Use multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**. 106 | # 107 | # - **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to interpret both classification accuracy and the confusion matrix. 108 | 109 | # import and instantiate MultinomialNB 110 | from sklearn.naive_bayes import MultinomialNB 111 | nb = MultinomialNB() 112 | 113 | 114 | # train the model using X_train_dtm 115 | nb.fit(X_train_dtm, y_train) 116 | 117 | 118 | # make class predictions for X_test_dtm 119 | y_pred_class = nb.predict(X_test_dtm) 120 | 121 | 122 | # calculate accuracy of class predictions 123 | from sklearn import metrics 124 | metrics.accuracy_score(y_test, y_pred_class) 125 | 126 | 127 | # print the confusion matrix 128 | metrics.confusion_matrix(y_test, y_pred_class) 129 | 130 | 131 | # ## Task 6 (Challenge) 132 | # 133 | # Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class. 134 | # 135 | # - **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy! 136 | 137 | # examine the class distribution of the testing set 138 | y_test.value_counts() 139 | 140 | 141 | # calculate null accuracy 142 | y_test.value_counts().head(1) / y_test.shape 143 | 144 | 145 | # calculate null accuracy manually 146 | 838 / float(838 + 184) 147 | 148 | 149 | # ## Task 7 (Challenge) 150 | # 151 | # Browse through the review text of some of the **false positives** and **false negatives**. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews? 152 | # 153 | # - **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains the definitions of "false positives" and "false negatives". 154 | # - **Hint:** Think about what a false positive means in this context, and what a false negative means in this context. What has scikit-learn defined as the "positive class"? 155 | 156 | # first 10 false positives (1-star reviews incorrectly classified as 5-star reviews) 157 | X_test[y_test < y_pred_class].head(10) 158 | 159 | 160 | # false positive: model is reacting to the words "good", "impressive", "nice" 161 | X_test[1781] 162 | 163 | 164 | # false positive: model does not have enough data to work with 165 | X_test[1919] 166 | 167 | 168 | # first 10 false negatives (5-star reviews incorrectly classified as 1-star reviews) 169 | X_test[y_test > y_pred_class].head(10) 170 | 171 | 172 | # false negative: model is reacting to the words "complain", "crowds", "rushing", "pricey", "scum" 173 | X_test[4963] 174 | 175 | 176 | # ## Task 8 (Challenge) 177 | # 178 | # Calculate which 10 tokens are the most predictive of **5-star reviews**, and which 10 tokens are the most predictive of **1-star reviews**. 179 | # 180 | # - **Hint:** Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the `feature_count_` and `class_count_` attributes of the Naive Bayes model object. 181 | 182 | # store the vocabulary of X_train 183 | X_train_tokens = vect.get_feature_names() 184 | len(X_train_tokens) 185 | 186 | 187 | # first row is one-star reviews, second row is five-star reviews 188 | nb.feature_count_.shape 189 | 190 | 191 | # store the number of times each token appears across each class 192 | one_star_token_count = nb.feature_count_[0, :] 193 | five_star_token_count = nb.feature_count_[1, :] 194 | 195 | 196 | # create a DataFrame of tokens with their separate one-star and five-star counts 197 | tokens = pd.DataFrame({'token':X_train_tokens, 'one_star':one_star_token_count, 'five_star':five_star_token_count}).set_index('token') 198 | 199 | 200 | # add 1 to one-star and five-star counts to avoid dividing by 0 201 | tokens['one_star'] = tokens.one_star + 1 202 | tokens['five_star'] = tokens.five_star + 1 203 | 204 | 205 | # first number is one-star reviews, second number is five-star reviews 206 | nb.class_count_ 207 | 208 | 209 | # convert the one-star and five-star counts into frequencies 210 | tokens['one_star'] = tokens.one_star / nb.class_count_[0] 211 | tokens['five_star'] = tokens.five_star / nb.class_count_[1] 212 | 213 | 214 | # calculate the ratio of five-star to one-star for each token 215 | tokens['five_star_ratio'] = tokens.five_star / tokens.one_star 216 | 217 | 218 | # sort the DataFrame by five_star_ratio (descending order), and examine the first 10 rows 219 | # note: use sort() instead of sort_values() for pandas 0.16.2 and earlier 220 | tokens.sort_values('five_star_ratio', ascending=False).head(10) 221 | 222 | 223 | # sort the DataFrame by five_star_ratio (ascending order), and examine the first 10 rows 224 | tokens.sort_values('five_star_ratio', ascending=True).head(10) 225 | 226 | 227 | # ## Task 9 (Challenge) 228 | # 229 | # Up to this point, we have framed this as a **binary classification problem** by only considering the 5-star and 1-star reviews. Now, let's repeat the model building process using all reviews, which makes this a **5-class classification problem**. 230 | # 231 | # Here are the steps: 232 | # 233 | # - Define X and y using the original DataFrame. (y should contain 5 different classes.) 234 | # - Split X and y into training and testing sets. 235 | # - Create document-term matrices using CountVectorizer. 236 | # - Calculate the testing accuracy of a Multinomial Naive Bayes model. 237 | # - Compare the testing accuracy with the null accuracy, and comment on the results. 238 | # - Print the confusion matrix, and comment on the results. (This [Stack Overflow answer](http://stackoverflow.com/a/30748053/1636598) explains how to read a multi-class confusion matrix.) 239 | # - Print the [classification report](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report), and comment on the results. If you are unfamiliar with the terminology it uses, research the terms, and then try to figure out how to calculate these metrics manually from the confusion matrix! 240 | 241 | # define X and y using the original DataFrame 242 | X = yelp.text 243 | y = yelp.stars 244 | 245 | 246 | # check that y contains 5 different classes 247 | y.value_counts().sort_index() 248 | 249 | 250 | # split X and y into training and testing sets 251 | X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1) 252 | 253 | 254 | # create document-term matrices using CountVectorizer 255 | X_train_dtm = vect.fit_transform(X_train) 256 | X_test_dtm = vect.transform(X_test) 257 | 258 | 259 | # fit a Multinomial Naive Bayes model 260 | nb.fit(X_train_dtm, y_train) 261 | 262 | 263 | # make class predictions 264 | y_pred_class = nb.predict(X_test_dtm) 265 | 266 | 267 | # calculate the accuary 268 | metrics.accuracy_score(y_test, y_pred_class) 269 | 270 | 271 | # calculate the null accuracy 272 | y_test.value_counts().head(1) / y_test.shape 273 | 274 | 275 | # **Accuracy comments:** At first glance, 47% accuracy does not seem very good, given that it is not much higher than the null accuracy. However, I would consider the 47% accuracy to be quite impressive, given that humans would also have a hard time precisely identifying the star rating for many of these reviews. 276 | 277 | # print the confusion matrix 278 | metrics.confusion_matrix(y_test, y_pred_class) 279 | 280 | 281 | # **Confusion matrix comments:** 282 | # 283 | # - Nearly all 4-star and 5-star reviews are classified as 4 or 5 stars, but they are hard for the model to distinguish between. 284 | # - 1-star, 2-star, and 3-star reviews are most commonly classified as 4 stars, probably because it's the predominant class in the training data. 285 | 286 | # print the classification report 287 | print(metrics.classification_report(y_test, y_pred_class)) 288 | 289 | 290 | # **Precision** answers the question: "When a given class is predicted, how often are those predictions correct?" To calculate the precision for class 1, for example, you divide 55 by the sum of the first column of the confusion matrix. 291 | 292 | # manually calculate the precision for class 1 293 | precision = 55 / float(55 + 28 + 5 + 7 + 6) 294 | print(precision) 295 | 296 | 297 | # **Recall** answers the question: "When a given class is the true class, how often is that class predicted?" To calculate the recall for class 1, for example, you divide 55 by the sum of the first row of the confusion matrix. 298 | 299 | # manually calculate the recall for class 1 300 | recall = 55 / float(55 + 14 + 24 + 65 + 27) 301 | print(recall) 302 | 303 | 304 | # **F1 score** is a weighted average of precision and recall. 305 | 306 | # manually calculate the F1 score for class 1 307 | f1 = 2 * (precision * recall) / (precision + recall) 308 | print(f1) 309 | 310 | 311 | # **Support** answers the question: "How many observations exist for which a given class is the true class?" To calculate the support for class 1, for example, you sum the first row of the confusion matrix. 312 | 313 | # manually calculate the support for class 1 314 | support = 55 + 14 + 24 + 65 + 27 315 | print(support) 316 | 317 | 318 | # **Classification report comments:** 319 | # 320 | # - Class 1 has low recall, meaning that the model has a hard time detecting the 1-star reviews, but high precision, meaning that when the model predicts a review is 1-star, it's usually correct. 321 | # - Class 5 has high recall and precision, probably because 5-star reviews have polarized language, and because the model has a lot of observations to learn from. 322 | -------------------------------------------------------------------------------- /tutorial.py: -------------------------------------------------------------------------------- 1 | # # Tutorial: Machine Learning with Text in scikit-learn 2 | 3 | # ## Agenda 4 | # 5 | # 1. Model building in scikit-learn (refresher) 6 | # 2. Representing text as numerical data 7 | # 3. Reading a text-based dataset into pandas 8 | # 4. Vectorizing our dataset 9 | # 5. Building and evaluating a model 10 | # 6. Comparing models 11 | # 7. Examining a model for further insight 12 | # 8. Practicing this workflow on another dataset 13 | # 9. Tuning the vectorizer (discussion) 14 | 15 | # for Python 2: use print only as a function 16 | from __future__ import print_function 17 | 18 | 19 | # ## Part 1: Model building in scikit-learn (refresher) 20 | 21 | # load the iris dataset as an example 22 | from sklearn.datasets import load_iris 23 | iris = load_iris() 24 | 25 | 26 | # store the feature matrix (X) and response vector (y) 27 | X = iris.data 28 | y = iris.target 29 | 30 | 31 | # **"Features"** are also known as predictors, inputs, or attributes. The **"response"** is also known as the target, label, or output. 32 | 33 | # check the shapes of X and y 34 | print(X.shape) 35 | print(y.shape) 36 | 37 | 38 | # **"Observations"** are also known as samples, instances, or records. 39 | 40 | # examine the first 5 rows of the feature matrix (including the feature names) 41 | import pandas as pd 42 | pd.DataFrame(X, columns=iris.feature_names).head() 43 | 44 | 45 | # examine the response vector 46 | print(y) 47 | 48 | 49 | # In order to **build a model**, the features must be **numeric**, and every observation must have the **same features in the same order**. 50 | 51 | # import the class 52 | from sklearn.neighbors import KNeighborsClassifier 53 | 54 | # instantiate the model (with the default parameters) 55 | knn = KNeighborsClassifier() 56 | 57 | # fit the model with data (occurs in-place) 58 | knn.fit(X, y) 59 | 60 | 61 | # In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning. 62 | 63 | # predict the response for a new observation 64 | knn.predict([[3, 5, 4, 2]]) 65 | 66 | 67 | # ## Part 2: Representing text as numerical data 68 | 69 | # example text for model training (SMS messages) 70 | simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!'] 71 | 72 | 73 | # From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction): 74 | # 75 | # > Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**. 76 | # 77 | # We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to "convert text into a matrix of token counts": 78 | 79 | # import and instantiate CountVectorizer (with the default parameters) 80 | from sklearn.feature_extraction.text import CountVectorizer 81 | vect = CountVectorizer() 82 | 83 | 84 | # learn the 'vocabulary' of the training data (occurs in-place) 85 | vect.fit(simple_train) 86 | 87 | 88 | # examine the fitted vocabulary 89 | vect.get_feature_names() 90 | 91 | 92 | # transform training data into a 'document-term matrix' 93 | simple_train_dtm = vect.transform(simple_train) 94 | simple_train_dtm 95 | 96 | 97 | # convert sparse matrix to a dense matrix 98 | simple_train_dtm.toarray() 99 | 100 | 101 | # examine the vocabulary and document-term matrix together 102 | pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names()) 103 | 104 | 105 | # From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction): 106 | # 107 | # > In this scheme, features and samples are defined as follows: 108 | # 109 | # > - Each individual token occurrence frequency (normalized or not) is treated as a **feature**. 110 | # > - The vector of all the token frequencies for a given document is considered a multivariate **sample**. 111 | # 112 | # > A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus. 113 | # 114 | # > We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document. 115 | 116 | # check the type of the document-term matrix 117 | type(simple_train_dtm) 118 | 119 | 120 | # examine the sparse matrix contents 121 | print(simple_train_dtm) 122 | 123 | 124 | # From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction): 125 | # 126 | # > As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them). 127 | # 128 | # > For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually. 129 | # 130 | # > In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package. 131 | 132 | # example text for model testing 133 | simple_test = ["please don't call me"] 134 | 135 | 136 | # In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning. 137 | 138 | # transform testing data into a document-term matrix (using existing vocabulary) 139 | simple_test_dtm = vect.transform(simple_test) 140 | simple_test_dtm.toarray() 141 | 142 | 143 | # examine the vocabulary and document-term matrix together 144 | pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names()) 145 | 146 | 147 | # **Summary:** 148 | # 149 | # - `vect.fit(train)` **learns the vocabulary** of the training data 150 | # - `vect.transform(train)` uses the **fitted vocabulary** to build a document-term matrix from the training data 151 | # - `vect.transform(test)` uses the **fitted vocabulary** to build a document-term matrix from the testing data (and **ignores tokens** it hasn't seen before) 152 | 153 | # ## Part 3: Reading a text-based dataset into pandas 154 | 155 | # read file into pandas using a relative path 156 | path = 'data/sms.tsv' 157 | sms = pd.read_table(path, header=None, names=['label', 'message']) 158 | 159 | 160 | # alternative: read file into pandas from a URL 161 | # url = 'https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv' 162 | # sms = pd.read_table(url, header=None, names=['label', 'message']) 163 | 164 | 165 | # examine the shape 166 | sms.shape 167 | 168 | 169 | # examine the first 10 rows 170 | sms.head(10) 171 | 172 | 173 | # examine the class distribution 174 | sms.label.value_counts() 175 | 176 | 177 | # convert label to a numerical variable 178 | sms['label_num'] = sms.label.map({'ham':0, 'spam':1}) 179 | 180 | 181 | # check that the conversion worked 182 | sms.head(10) 183 | 184 | 185 | # how to define X and y (from the iris data) for use with a MODEL 186 | X = iris.data 187 | y = iris.target 188 | print(X.shape) 189 | print(y.shape) 190 | 191 | 192 | # how to define X and y (from the SMS data) for use with COUNTVECTORIZER 193 | X = sms.message 194 | y = sms.label_num 195 | print(X.shape) 196 | print(y.shape) 197 | 198 | 199 | # split X and y into training and testing sets 200 | from sklearn.model_selection import train_test_split 201 | X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1) 202 | print(X_train.shape) 203 | print(X_test.shape) 204 | print(y_train.shape) 205 | print(y_test.shape) 206 | 207 | 208 | # ## Part 4: Vectorizing our dataset 209 | 210 | # instantiate the vectorizer 211 | vect = CountVectorizer() 212 | 213 | 214 | # learn training data vocabulary, then use it to create a document-term matrix 215 | vect.fit(X_train) 216 | X_train_dtm = vect.transform(X_train) 217 | 218 | 219 | # equivalently: combine fit and transform into a single step 220 | X_train_dtm = vect.fit_transform(X_train) 221 | 222 | 223 | # examine the document-term matrix 224 | X_train_dtm 225 | 226 | 227 | # transform testing data (using fitted vocabulary) into a document-term matrix 228 | X_test_dtm = vect.transform(X_test) 229 | X_test_dtm 230 | 231 | 232 | # ## Part 5: Building and evaluating a model 233 | # 234 | # We will use [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html): 235 | # 236 | # > The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work. 237 | 238 | # import and instantiate a Multinomial Naive Bayes model 239 | from sklearn.naive_bayes import MultinomialNB 240 | nb = MultinomialNB() 241 | 242 | 243 | # train the model using X_train_dtm 244 | nb.fit(X_train_dtm, y_train) 245 | 246 | 247 | # make class predictions for X_test_dtm 248 | y_pred_class = nb.predict(X_test_dtm) 249 | 250 | 251 | # calculate accuracy of class predictions 252 | from sklearn import metrics 253 | metrics.accuracy_score(y_test, y_pred_class) 254 | 255 | 256 | # print the confusion matrix 257 | metrics.confusion_matrix(y_test, y_pred_class) 258 | 259 | 260 | # print message text for the false positives (ham incorrectly classified as spam) 261 | 262 | 263 | # print message text for the false negatives (spam incorrectly classified as ham) 264 | 265 | 266 | # example false negative 267 | X_test[3132] 268 | 269 | 270 | # calculate predicted probabilities for X_test_dtm (poorly calibrated) 271 | y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1] 272 | y_pred_prob 273 | 274 | 275 | # calculate AUC 276 | metrics.roc_auc_score(y_test, y_pred_prob) 277 | 278 | 279 | # ## Part 6: Comparing models 280 | # 281 | # We will compare multinomial Naive Bayes with [logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression): 282 | # 283 | # > Logistic regression, despite its name, is a **linear model for classification** rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function. 284 | 285 | # import and instantiate a logistic regression model 286 | from sklearn.linear_model import LogisticRegression 287 | logreg = LogisticRegression() 288 | 289 | 290 | # train the model using X_train_dtm 291 | logreg.fit(X_train_dtm, y_train) 292 | 293 | 294 | # make class predictions for X_test_dtm 295 | y_pred_class = logreg.predict(X_test_dtm) 296 | 297 | 298 | # calculate predicted probabilities for X_test_dtm (well calibrated) 299 | y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1] 300 | y_pred_prob 301 | 302 | 303 | # calculate accuracy 304 | metrics.accuracy_score(y_test, y_pred_class) 305 | 306 | 307 | # calculate AUC 308 | metrics.roc_auc_score(y_test, y_pred_prob) 309 | 310 | 311 | # ## Part 7: Examining a model for further insight 312 | # 313 | # We will examine the our **trained Naive Bayes model** to calculate the approximate **"spamminess" of each token**. 314 | 315 | # store the vocabulary of X_train 316 | X_train_tokens = vect.get_feature_names() 317 | len(X_train_tokens) 318 | 319 | 320 | # examine the first 50 tokens 321 | print(X_train_tokens[0:50]) 322 | 323 | 324 | # examine the last 50 tokens 325 | print(X_train_tokens[-50:]) 326 | 327 | 328 | # Naive Bayes counts the number of times each token appears in each class 329 | nb.feature_count_ 330 | 331 | 332 | # rows represent classes, columns represent tokens 333 | nb.feature_count_.shape 334 | 335 | 336 | # number of times each token appears across all HAM messages 337 | ham_token_count = nb.feature_count_[0, :] 338 | ham_token_count 339 | 340 | 341 | # number of times each token appears across all SPAM messages 342 | spam_token_count = nb.feature_count_[1, :] 343 | spam_token_count 344 | 345 | 346 | # create a DataFrame of tokens with their separate ham and spam counts 347 | tokens = pd.DataFrame({'token':X_train_tokens, 'ham':ham_token_count, 'spam':spam_token_count}).set_index('token') 348 | tokens.head() 349 | 350 | 351 | # examine 5 random DataFrame rows 352 | tokens.sample(5, random_state=6) 353 | 354 | 355 | # Naive Bayes counts the number of observations in each class 356 | nb.class_count_ 357 | 358 | 359 | # Before we can calculate the "spamminess" of each token, we need to avoid **dividing by zero** and account for the **class imbalance**. 360 | 361 | # add 1 to ham and spam counts to avoid dividing by 0 362 | tokens['ham'] = tokens.ham + 1 363 | tokens['spam'] = tokens.spam + 1 364 | tokens.sample(5, random_state=6) 365 | 366 | 367 | # convert the ham and spam counts into frequencies 368 | tokens['ham'] = tokens.ham / nb.class_count_[0] 369 | tokens['spam'] = tokens.spam / nb.class_count_[1] 370 | tokens.sample(5, random_state=6) 371 | 372 | 373 | # calculate the ratio of spam-to-ham for each token 374 | tokens['spam_ratio'] = tokens.spam / tokens.ham 375 | tokens.sample(5, random_state=6) 376 | 377 | 378 | # examine the DataFrame sorted by spam_ratio 379 | # note: use sort() instead of sort_values() for pandas 0.16.2 and earlier 380 | tokens.sort_values('spam_ratio', ascending=False) 381 | 382 | 383 | # look up the spam_ratio for a given token 384 | tokens.loc['dating', 'spam_ratio'] 385 | 386 | 387 | # ## Part 8: Practicing this workflow on another dataset 388 | # 389 | # Please open the **`exercise.ipynb`** notebook (or the **`exercise.py`** script). 390 | 391 | # ## Part 9: Tuning the vectorizer (discussion) 392 | # 393 | # Thus far, we have been using the default parameters of [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html): 394 | 395 | # show default parameters for CountVectorizer 396 | vect 397 | 398 | 399 | # However, the vectorizer is worth tuning, just like a model is worth tuning! Here are a few parameters that you might want to tune: 400 | # 401 | # - **stop_words:** string {'english'}, list, or None (default) 402 | # - If 'english', a built-in stop word list for English is used. 403 | # - If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. 404 | # - If None, no stop words will be used. 405 | 406 | # remove English stop words 407 | vect = CountVectorizer(stop_words='english') 408 | 409 | 410 | # - **ngram_range:** tuple (min_n, max_n), default=(1, 1) 411 | # - The lower and upper boundary of the range of n-values for different n-grams to be extracted. 412 | # - All values of n such that min_n <= n <= max_n will be used. 413 | 414 | # include 1-grams and 2-grams 415 | vect = CountVectorizer(ngram_range=(1, 2)) 416 | 417 | 418 | # - **max_df:** float in range [0.0, 1.0] or int, default=1.0 419 | # - When building the vocabulary, ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). 420 | # - If float, the parameter represents a proportion of documents. 421 | # - If integer, the parameter represents an absolute count. 422 | 423 | # ignore terms that appear in more than 50% of the documents 424 | vect = CountVectorizer(max_df=0.5) 425 | 426 | 427 | # - **min_df:** float in range [0.0, 1.0] or int, default=1 428 | # - When building the vocabulary, ignore terms that have a document frequency strictly lower than the given threshold. (This value is also called "cut-off" in the literature.) 429 | # - If float, the parameter represents a proportion of documents. 430 | # - If integer, the parameter represents an absolute count. 431 | 432 | # only keep terms that appear in at least 2 documents 433 | vect = CountVectorizer(min_df=2) 434 | 435 | 436 | # **Guidelines for tuning CountVectorizer:** 437 | # 438 | # - Use your knowledge of the **problem** and the **text**, and your understanding of the **tuning parameters**, to help you decide what parameters to tune and how to tune them. 439 | # - **Experiment**, and let the data tell you the best approach! 440 | -------------------------------------------------------------------------------- /tutorial.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Tutorial: Machine Learning with Text in scikit-learn" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Agenda\n", 15 | "\n", 16 | "1. Model building in scikit-learn (refresher)\n", 17 | "2. Representing text as numerical data\n", 18 | "3. Reading a text-based dataset into pandas\n", 19 | "4. Vectorizing our dataset\n", 20 | "5. Building and evaluating a model\n", 21 | "6. Comparing models\n", 22 | "7. Examining a model for further insight\n", 23 | "8. Practicing this workflow on another dataset\n", 24 | "9. Tuning the vectorizer (discussion)" 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": null, 30 | "metadata": { 31 | "collapsed": false 32 | }, 33 | "outputs": [], 34 | "source": [ 35 | "# for Python 2: use print only as a function\n", 36 | "from __future__ import print_function" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "## Part 1: Model building in scikit-learn (refresher)" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": null, 49 | "metadata": { 50 | "collapsed": true 51 | }, 52 | "outputs": [], 53 | "source": [ 54 | "# load the iris dataset as an example\n", 55 | "from sklearn.datasets import load_iris\n", 56 | "iris = load_iris()" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": null, 62 | "metadata": { 63 | "collapsed": true 64 | }, 65 | "outputs": [], 66 | "source": [ 67 | "# store the feature matrix (X) and response vector (y)\n", 68 | "X = iris.data\n", 69 | "y = iris.target" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "**\"Features\"** are also known as predictors, inputs, or attributes. The **\"response\"** is also known as the target, label, or output." 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": null, 82 | "metadata": { 83 | "collapsed": false 84 | }, 85 | "outputs": [], 86 | "source": [ 87 | "# check the shapes of X and y\n", 88 | "print(X.shape)\n", 89 | "print(y.shape)" 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "metadata": {}, 95 | "source": [ 96 | "**\"Observations\"** are also known as samples, instances, or records." 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": null, 102 | "metadata": { 103 | "collapsed": false 104 | }, 105 | "outputs": [], 106 | "source": [ 107 | "# examine the first 5 rows of the feature matrix (including the feature names)\n", 108 | "import pandas as pd\n", 109 | "pd.DataFrame(X, columns=iris.feature_names).head()" 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": null, 115 | "metadata": { 116 | "collapsed": false 117 | }, 118 | "outputs": [], 119 | "source": [ 120 | "# examine the response vector\n", 121 | "print(y)" 122 | ] 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "metadata": {}, 127 | "source": [ 128 | "In order to **build a model**, the features must be **numeric**, and every observation must have the **same features in the same order**." 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": null, 134 | "metadata": { 135 | "collapsed": false 136 | }, 137 | "outputs": [], 138 | "source": [ 139 | "# import the class\n", 140 | "from sklearn.neighbors import KNeighborsClassifier\n", 141 | "\n", 142 | "# instantiate the model (with the default parameters)\n", 143 | "knn = KNeighborsClassifier()\n", 144 | "\n", 145 | "# fit the model with data (occurs in-place)\n", 146 | "knn.fit(X, y)" 147 | ] 148 | }, 149 | { 150 | "cell_type": "markdown", 151 | "metadata": {}, 152 | "source": [ 153 | "In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning." 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": null, 159 | "metadata": { 160 | "collapsed": false 161 | }, 162 | "outputs": [], 163 | "source": [ 164 | "# predict the response for a new observation\n", 165 | "knn.predict([[3, 5, 4, 2]])" 166 | ] 167 | }, 168 | { 169 | "cell_type": "markdown", 170 | "metadata": {}, 171 | "source": [ 172 | "## Part 2: Representing text as numerical data" 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": null, 178 | "metadata": { 179 | "collapsed": true 180 | }, 181 | "outputs": [], 182 | "source": [ 183 | "# example text for model training (SMS messages)\n", 184 | "simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']" 185 | ] 186 | }, 187 | { 188 | "cell_type": "markdown", 189 | "metadata": {}, 190 | "source": [ 191 | "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n", 192 | "\n", 193 | "> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.\n", 194 | "\n", 195 | "We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to \"convert text into a matrix of token counts\":" 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": null, 201 | "metadata": { 202 | "collapsed": true 203 | }, 204 | "outputs": [], 205 | "source": [ 206 | "# import and instantiate CountVectorizer (with the default parameters)\n", 207 | "from sklearn.feature_extraction.text import CountVectorizer\n", 208 | "vect = CountVectorizer()" 209 | ] 210 | }, 211 | { 212 | "cell_type": "code", 213 | "execution_count": null, 214 | "metadata": { 215 | "collapsed": false 216 | }, 217 | "outputs": [], 218 | "source": [ 219 | "# learn the 'vocabulary' of the training data (occurs in-place)\n", 220 | "vect.fit(simple_train)" 221 | ] 222 | }, 223 | { 224 | "cell_type": "code", 225 | "execution_count": null, 226 | "metadata": { 227 | "collapsed": false 228 | }, 229 | "outputs": [], 230 | "source": [ 231 | "# examine the fitted vocabulary\n", 232 | "vect.get_feature_names()" 233 | ] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "execution_count": null, 238 | "metadata": { 239 | "collapsed": false 240 | }, 241 | "outputs": [], 242 | "source": [ 243 | "# transform training data into a 'document-term matrix'\n", 244 | "simple_train_dtm = vect.transform(simple_train)\n", 245 | "simple_train_dtm" 246 | ] 247 | }, 248 | { 249 | "cell_type": "code", 250 | "execution_count": null, 251 | "metadata": { 252 | "collapsed": false 253 | }, 254 | "outputs": [], 255 | "source": [ 256 | "# convert sparse matrix to a dense matrix\n", 257 | "simple_train_dtm.toarray()" 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "execution_count": null, 263 | "metadata": { 264 | "collapsed": false 265 | }, 266 | "outputs": [], 267 | "source": [ 268 | "# examine the vocabulary and document-term matrix together\n", 269 | "pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())" 270 | ] 271 | }, 272 | { 273 | "cell_type": "markdown", 274 | "metadata": {}, 275 | "source": [ 276 | "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n", 277 | "\n", 278 | "> In this scheme, features and samples are defined as follows:\n", 279 | "\n", 280 | "> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.\n", 281 | "> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.\n", 282 | "\n", 283 | "> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.\n", 284 | "\n", 285 | "> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or \"Bag of n-grams\" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document." 286 | ] 287 | }, 288 | { 289 | "cell_type": "code", 290 | "execution_count": null, 291 | "metadata": { 292 | "collapsed": false 293 | }, 294 | "outputs": [], 295 | "source": [ 296 | "# check the type of the document-term matrix\n", 297 | "type(simple_train_dtm)" 298 | ] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "execution_count": null, 303 | "metadata": { 304 | "collapsed": false, 305 | "scrolled": true 306 | }, 307 | "outputs": [], 308 | "source": [ 309 | "# examine the sparse matrix contents\n", 310 | "print(simple_train_dtm)" 311 | ] 312 | }, 313 | { 314 | "cell_type": "markdown", 315 | "metadata": {}, 316 | "source": [ 317 | "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n", 318 | "\n", 319 | "> As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them).\n", 320 | "\n", 321 | "> For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.\n", 322 | "\n", 323 | "> In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package." 324 | ] 325 | }, 326 | { 327 | "cell_type": "code", 328 | "execution_count": null, 329 | "metadata": { 330 | "collapsed": true 331 | }, 332 | "outputs": [], 333 | "source": [ 334 | "# example text for model testing\n", 335 | "simple_test = [\"please don't call me\"]" 336 | ] 337 | }, 338 | { 339 | "cell_type": "markdown", 340 | "metadata": {}, 341 | "source": [ 342 | "In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning." 343 | ] 344 | }, 345 | { 346 | "cell_type": "code", 347 | "execution_count": null, 348 | "metadata": { 349 | "collapsed": false 350 | }, 351 | "outputs": [], 352 | "source": [ 353 | "# transform testing data into a document-term matrix (using existing vocabulary)\n", 354 | "simple_test_dtm = vect.transform(simple_test)\n", 355 | "simple_test_dtm.toarray()" 356 | ] 357 | }, 358 | { 359 | "cell_type": "code", 360 | "execution_count": null, 361 | "metadata": { 362 | "collapsed": false 363 | }, 364 | "outputs": [], 365 | "source": [ 366 | "# examine the vocabulary and document-term matrix together\n", 367 | "pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())" 368 | ] 369 | }, 370 | { 371 | "cell_type": "markdown", 372 | "metadata": {}, 373 | "source": [ 374 | "**Summary:**\n", 375 | "\n", 376 | "- `vect.fit(train)` **learns the vocabulary** of the training data\n", 377 | "- `vect.transform(train)` uses the **fitted vocabulary** to build a document-term matrix from the training data\n", 378 | "- `vect.transform(test)` uses the **fitted vocabulary** to build a document-term matrix from the testing data (and **ignores tokens** it hasn't seen before)" 379 | ] 380 | }, 381 | { 382 | "cell_type": "markdown", 383 | "metadata": {}, 384 | "source": [ 385 | "## Part 3: Reading a text-based dataset into pandas" 386 | ] 387 | }, 388 | { 389 | "cell_type": "code", 390 | "execution_count": null, 391 | "metadata": { 392 | "collapsed": true 393 | }, 394 | "outputs": [], 395 | "source": [ 396 | "# read file into pandas using a relative path\n", 397 | "path = 'data/sms.tsv'\n", 398 | "sms = pd.read_table(path, header=None, names=['label', 'message'])" 399 | ] 400 | }, 401 | { 402 | "cell_type": "code", 403 | "execution_count": null, 404 | "metadata": { 405 | "collapsed": false 406 | }, 407 | "outputs": [], 408 | "source": [ 409 | "# alternative: read file into pandas from a URL\n", 410 | "# url = 'https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv'\n", 411 | "# sms = pd.read_table(url, header=None, names=['label', 'message'])" 412 | ] 413 | }, 414 | { 415 | "cell_type": "code", 416 | "execution_count": null, 417 | "metadata": { 418 | "collapsed": false 419 | }, 420 | "outputs": [], 421 | "source": [ 422 | "# examine the shape\n", 423 | "sms.shape" 424 | ] 425 | }, 426 | { 427 | "cell_type": "code", 428 | "execution_count": null, 429 | "metadata": { 430 | "collapsed": false 431 | }, 432 | "outputs": [], 433 | "source": [ 434 | "# examine the first 10 rows\n", 435 | "sms.head(10)" 436 | ] 437 | }, 438 | { 439 | "cell_type": "code", 440 | "execution_count": null, 441 | "metadata": { 442 | "collapsed": false 443 | }, 444 | "outputs": [], 445 | "source": [ 446 | "# examine the class distribution\n", 447 | "sms.label.value_counts()" 448 | ] 449 | }, 450 | { 451 | "cell_type": "code", 452 | "execution_count": null, 453 | "metadata": { 454 | "collapsed": true 455 | }, 456 | "outputs": [], 457 | "source": [ 458 | "# convert label to a numerical variable\n", 459 | "sms['label_num'] = sms.label.map({'ham':0, 'spam':1})" 460 | ] 461 | }, 462 | { 463 | "cell_type": "code", 464 | "execution_count": null, 465 | "metadata": { 466 | "collapsed": false 467 | }, 468 | "outputs": [], 469 | "source": [ 470 | "# check that the conversion worked\n", 471 | "sms.head(10)" 472 | ] 473 | }, 474 | { 475 | "cell_type": "code", 476 | "execution_count": null, 477 | "metadata": { 478 | "collapsed": false 479 | }, 480 | "outputs": [], 481 | "source": [ 482 | "# how to define X and y (from the iris data) for use with a MODEL\n", 483 | "X = iris.data\n", 484 | "y = iris.target\n", 485 | "print(X.shape)\n", 486 | "print(y.shape)" 487 | ] 488 | }, 489 | { 490 | "cell_type": "code", 491 | "execution_count": null, 492 | "metadata": { 493 | "collapsed": false 494 | }, 495 | "outputs": [], 496 | "source": [ 497 | "# how to define X and y (from the SMS data) for use with COUNTVECTORIZER\n", 498 | "X = sms.message\n", 499 | "y = sms.label_num\n", 500 | "print(X.shape)\n", 501 | "print(y.shape)" 502 | ] 503 | }, 504 | { 505 | "cell_type": "code", 506 | "execution_count": null, 507 | "metadata": { 508 | "collapsed": false 509 | }, 510 | "outputs": [], 511 | "source": [ 512 | "# split X and y into training and testing sets\n", 513 | "from sklearn.model_selection import train_test_split\n", 514 | "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)\n", 515 | "print(X_train.shape)\n", 516 | "print(X_test.shape)\n", 517 | "print(y_train.shape)\n", 518 | "print(y_test.shape)" 519 | ] 520 | }, 521 | { 522 | "cell_type": "markdown", 523 | "metadata": {}, 524 | "source": [ 525 | "## Part 4: Vectorizing our dataset" 526 | ] 527 | }, 528 | { 529 | "cell_type": "code", 530 | "execution_count": null, 531 | "metadata": { 532 | "collapsed": true 533 | }, 534 | "outputs": [], 535 | "source": [ 536 | "# instantiate the vectorizer\n", 537 | "vect = CountVectorizer()" 538 | ] 539 | }, 540 | { 541 | "cell_type": "code", 542 | "execution_count": null, 543 | "metadata": { 544 | "collapsed": true 545 | }, 546 | "outputs": [], 547 | "source": [ 548 | "# learn training data vocabulary, then use it to create a document-term matrix\n", 549 | "vect.fit(X_train)\n", 550 | "X_train_dtm = vect.transform(X_train)" 551 | ] 552 | }, 553 | { 554 | "cell_type": "code", 555 | "execution_count": null, 556 | "metadata": { 557 | "collapsed": true 558 | }, 559 | "outputs": [], 560 | "source": [ 561 | "# equivalently: combine fit and transform into a single step\n", 562 | "X_train_dtm = vect.fit_transform(X_train)" 563 | ] 564 | }, 565 | { 566 | "cell_type": "code", 567 | "execution_count": null, 568 | "metadata": { 569 | "collapsed": false 570 | }, 571 | "outputs": [], 572 | "source": [ 573 | "# examine the document-term matrix\n", 574 | "X_train_dtm" 575 | ] 576 | }, 577 | { 578 | "cell_type": "code", 579 | "execution_count": null, 580 | "metadata": { 581 | "collapsed": false 582 | }, 583 | "outputs": [], 584 | "source": [ 585 | "# transform testing data (using fitted vocabulary) into a document-term matrix\n", 586 | "X_test_dtm = vect.transform(X_test)\n", 587 | "X_test_dtm" 588 | ] 589 | }, 590 | { 591 | "cell_type": "markdown", 592 | "metadata": {}, 593 | "source": [ 594 | "## Part 5: Building and evaluating a model\n", 595 | "\n", 596 | "We will use [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):\n", 597 | "\n", 598 | "> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work." 599 | ] 600 | }, 601 | { 602 | "cell_type": "code", 603 | "execution_count": null, 604 | "metadata": { 605 | "collapsed": true 606 | }, 607 | "outputs": [], 608 | "source": [ 609 | "# import and instantiate a Multinomial Naive Bayes model\n", 610 | "from sklearn.naive_bayes import MultinomialNB\n", 611 | "nb = MultinomialNB()" 612 | ] 613 | }, 614 | { 615 | "cell_type": "code", 616 | "execution_count": null, 617 | "metadata": { 618 | "collapsed": false 619 | }, 620 | "outputs": [], 621 | "source": [ 622 | "# train the model using X_train_dtm (timing it with an IPython \"magic command\")\n", 623 | "%time nb.fit(X_train_dtm, y_train)" 624 | ] 625 | }, 626 | { 627 | "cell_type": "code", 628 | "execution_count": null, 629 | "metadata": { 630 | "collapsed": true 631 | }, 632 | "outputs": [], 633 | "source": [ 634 | "# make class predictions for X_test_dtm\n", 635 | "y_pred_class = nb.predict(X_test_dtm)" 636 | ] 637 | }, 638 | { 639 | "cell_type": "code", 640 | "execution_count": null, 641 | "metadata": { 642 | "collapsed": false 643 | }, 644 | "outputs": [], 645 | "source": [ 646 | "# calculate accuracy of class predictions\n", 647 | "from sklearn import metrics\n", 648 | "metrics.accuracy_score(y_test, y_pred_class)" 649 | ] 650 | }, 651 | { 652 | "cell_type": "code", 653 | "execution_count": null, 654 | "metadata": { 655 | "collapsed": false 656 | }, 657 | "outputs": [], 658 | "source": [ 659 | "# print the confusion matrix\n", 660 | "metrics.confusion_matrix(y_test, y_pred_class)" 661 | ] 662 | }, 663 | { 664 | "cell_type": "code", 665 | "execution_count": null, 666 | "metadata": { 667 | "collapsed": false 668 | }, 669 | "outputs": [], 670 | "source": [ 671 | "# print message text for the false positives (ham incorrectly classified as spam)\n" 672 | ] 673 | }, 674 | { 675 | "cell_type": "code", 676 | "execution_count": null, 677 | "metadata": { 678 | "collapsed": false, 679 | "scrolled": true 680 | }, 681 | "outputs": [], 682 | "source": [ 683 | "# print message text for the false negatives (spam incorrectly classified as ham)\n" 684 | ] 685 | }, 686 | { 687 | "cell_type": "code", 688 | "execution_count": null, 689 | "metadata": { 690 | "collapsed": false, 691 | "scrolled": true 692 | }, 693 | "outputs": [], 694 | "source": [ 695 | "# example false negative\n", 696 | "X_test[3132]" 697 | ] 698 | }, 699 | { 700 | "cell_type": "code", 701 | "execution_count": null, 702 | "metadata": { 703 | "collapsed": false 704 | }, 705 | "outputs": [], 706 | "source": [ 707 | "# calculate predicted probabilities for X_test_dtm (poorly calibrated)\n", 708 | "y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]\n", 709 | "y_pred_prob" 710 | ] 711 | }, 712 | { 713 | "cell_type": "code", 714 | "execution_count": null, 715 | "metadata": { 716 | "collapsed": false 717 | }, 718 | "outputs": [], 719 | "source": [ 720 | "# calculate AUC\n", 721 | "metrics.roc_auc_score(y_test, y_pred_prob)" 722 | ] 723 | }, 724 | { 725 | "cell_type": "markdown", 726 | "metadata": {}, 727 | "source": [ 728 | "## Part 6: Comparing models\n", 729 | "\n", 730 | "We will compare multinomial Naive Bayes with [logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression):\n", 731 | "\n", 732 | "> Logistic regression, despite its name, is a **linear model for classification** rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function." 733 | ] 734 | }, 735 | { 736 | "cell_type": "code", 737 | "execution_count": null, 738 | "metadata": { 739 | "collapsed": true 740 | }, 741 | "outputs": [], 742 | "source": [ 743 | "# import and instantiate a logistic regression model\n", 744 | "from sklearn.linear_model import LogisticRegression\n", 745 | "logreg = LogisticRegression()" 746 | ] 747 | }, 748 | { 749 | "cell_type": "code", 750 | "execution_count": null, 751 | "metadata": { 752 | "collapsed": false 753 | }, 754 | "outputs": [], 755 | "source": [ 756 | "# train the model using X_train_dtm\n", 757 | "%time logreg.fit(X_train_dtm, y_train)" 758 | ] 759 | }, 760 | { 761 | "cell_type": "code", 762 | "execution_count": null, 763 | "metadata": { 764 | "collapsed": true 765 | }, 766 | "outputs": [], 767 | "source": [ 768 | "# make class predictions for X_test_dtm\n", 769 | "y_pred_class = logreg.predict(X_test_dtm)" 770 | ] 771 | }, 772 | { 773 | "cell_type": "code", 774 | "execution_count": null, 775 | "metadata": { 776 | "collapsed": false 777 | }, 778 | "outputs": [], 779 | "source": [ 780 | "# calculate predicted probabilities for X_test_dtm (well calibrated)\n", 781 | "y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]\n", 782 | "y_pred_prob" 783 | ] 784 | }, 785 | { 786 | "cell_type": "code", 787 | "execution_count": null, 788 | "metadata": { 789 | "collapsed": false 790 | }, 791 | "outputs": [], 792 | "source": [ 793 | "# calculate accuracy\n", 794 | "metrics.accuracy_score(y_test, y_pred_class)" 795 | ] 796 | }, 797 | { 798 | "cell_type": "code", 799 | "execution_count": null, 800 | "metadata": { 801 | "collapsed": false 802 | }, 803 | "outputs": [], 804 | "source": [ 805 | "# calculate AUC\n", 806 | "metrics.roc_auc_score(y_test, y_pred_prob)" 807 | ] 808 | }, 809 | { 810 | "cell_type": "markdown", 811 | "metadata": {}, 812 | "source": [ 813 | "## Part 7: Examining a model for further insight\n", 814 | "\n", 815 | "We will examine the our **trained Naive Bayes model** to calculate the approximate **\"spamminess\" of each token**." 816 | ] 817 | }, 818 | { 819 | "cell_type": "code", 820 | "execution_count": null, 821 | "metadata": { 822 | "collapsed": false 823 | }, 824 | "outputs": [], 825 | "source": [ 826 | "# store the vocabulary of X_train\n", 827 | "X_train_tokens = vect.get_feature_names()\n", 828 | "len(X_train_tokens)" 829 | ] 830 | }, 831 | { 832 | "cell_type": "code", 833 | "execution_count": null, 834 | "metadata": { 835 | "collapsed": false, 836 | "scrolled": true 837 | }, 838 | "outputs": [], 839 | "source": [ 840 | "# examine the first 50 tokens\n", 841 | "print(X_train_tokens[0:50])" 842 | ] 843 | }, 844 | { 845 | "cell_type": "code", 846 | "execution_count": null, 847 | "metadata": { 848 | "collapsed": false 849 | }, 850 | "outputs": [], 851 | "source": [ 852 | "# examine the last 50 tokens\n", 853 | "print(X_train_tokens[-50:])" 854 | ] 855 | }, 856 | { 857 | "cell_type": "code", 858 | "execution_count": null, 859 | "metadata": { 860 | "collapsed": false 861 | }, 862 | "outputs": [], 863 | "source": [ 864 | "# Naive Bayes counts the number of times each token appears in each class\n", 865 | "nb.feature_count_" 866 | ] 867 | }, 868 | { 869 | "cell_type": "code", 870 | "execution_count": null, 871 | "metadata": { 872 | "collapsed": false 873 | }, 874 | "outputs": [], 875 | "source": [ 876 | "# rows represent classes, columns represent tokens\n", 877 | "nb.feature_count_.shape" 878 | ] 879 | }, 880 | { 881 | "cell_type": "code", 882 | "execution_count": null, 883 | "metadata": { 884 | "collapsed": false 885 | }, 886 | "outputs": [], 887 | "source": [ 888 | "# number of times each token appears across all HAM messages\n", 889 | "ham_token_count = nb.feature_count_[0, :]\n", 890 | "ham_token_count" 891 | ] 892 | }, 893 | { 894 | "cell_type": "code", 895 | "execution_count": null, 896 | "metadata": { 897 | "collapsed": false 898 | }, 899 | "outputs": [], 900 | "source": [ 901 | "# number of times each token appears across all SPAM messages\n", 902 | "spam_token_count = nb.feature_count_[1, :]\n", 903 | "spam_token_count" 904 | ] 905 | }, 906 | { 907 | "cell_type": "code", 908 | "execution_count": null, 909 | "metadata": { 910 | "collapsed": false 911 | }, 912 | "outputs": [], 913 | "source": [ 914 | "# create a DataFrame of tokens with their separate ham and spam counts\n", 915 | "tokens = pd.DataFrame({'token':X_train_tokens, 'ham':ham_token_count, 'spam':spam_token_count}).set_index('token')\n", 916 | "tokens.head()" 917 | ] 918 | }, 919 | { 920 | "cell_type": "code", 921 | "execution_count": null, 922 | "metadata": { 923 | "collapsed": false 924 | }, 925 | "outputs": [], 926 | "source": [ 927 | "# examine 5 random DataFrame rows\n", 928 | "tokens.sample(5, random_state=6)" 929 | ] 930 | }, 931 | { 932 | "cell_type": "code", 933 | "execution_count": null, 934 | "metadata": { 935 | "collapsed": false 936 | }, 937 | "outputs": [], 938 | "source": [ 939 | "# Naive Bayes counts the number of observations in each class\n", 940 | "nb.class_count_" 941 | ] 942 | }, 943 | { 944 | "cell_type": "markdown", 945 | "metadata": {}, 946 | "source": [ 947 | "Before we can calculate the \"spamminess\" of each token, we need to avoid **dividing by zero** and account for the **class imbalance**." 948 | ] 949 | }, 950 | { 951 | "cell_type": "code", 952 | "execution_count": null, 953 | "metadata": { 954 | "collapsed": false 955 | }, 956 | "outputs": [], 957 | "source": [ 958 | "# add 1 to ham and spam counts to avoid dividing by 0\n", 959 | "tokens['ham'] = tokens.ham + 1\n", 960 | "tokens['spam'] = tokens.spam + 1\n", 961 | "tokens.sample(5, random_state=6)" 962 | ] 963 | }, 964 | { 965 | "cell_type": "code", 966 | "execution_count": null, 967 | "metadata": { 968 | "collapsed": false 969 | }, 970 | "outputs": [], 971 | "source": [ 972 | "# convert the ham and spam counts into frequencies\n", 973 | "tokens['ham'] = tokens.ham / nb.class_count_[0]\n", 974 | "tokens['spam'] = tokens.spam / nb.class_count_[1]\n", 975 | "tokens.sample(5, random_state=6)" 976 | ] 977 | }, 978 | { 979 | "cell_type": "code", 980 | "execution_count": null, 981 | "metadata": { 982 | "collapsed": false 983 | }, 984 | "outputs": [], 985 | "source": [ 986 | "# calculate the ratio of spam-to-ham for each token\n", 987 | "tokens['spam_ratio'] = tokens.spam / tokens.ham\n", 988 | "tokens.sample(5, random_state=6)" 989 | ] 990 | }, 991 | { 992 | "cell_type": "code", 993 | "execution_count": null, 994 | "metadata": { 995 | "collapsed": false 996 | }, 997 | "outputs": [], 998 | "source": [ 999 | "# examine the DataFrame sorted by spam_ratio\n", 1000 | "# note: use sort() instead of sort_values() for pandas 0.16.2 and earlier\n", 1001 | "tokens.sort_values('spam_ratio', ascending=False)" 1002 | ] 1003 | }, 1004 | { 1005 | "cell_type": "code", 1006 | "execution_count": null, 1007 | "metadata": { 1008 | "collapsed": false 1009 | }, 1010 | "outputs": [], 1011 | "source": [ 1012 | "# look up the spam_ratio for a given token\n", 1013 | "tokens.loc['dating', 'spam_ratio']" 1014 | ] 1015 | }, 1016 | { 1017 | "cell_type": "markdown", 1018 | "metadata": {}, 1019 | "source": [ 1020 | "## Part 8: Practicing this workflow on another dataset\n", 1021 | "\n", 1022 | "Please open the **`exercise.ipynb`** notebook (or the **`exercise.py`** script)." 1023 | ] 1024 | }, 1025 | { 1026 | "cell_type": "markdown", 1027 | "metadata": {}, 1028 | "source": [ 1029 | "## Part 9: Tuning the vectorizer (discussion)\n", 1030 | "\n", 1031 | "Thus far, we have been using the default parameters of [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html):" 1032 | ] 1033 | }, 1034 | { 1035 | "cell_type": "code", 1036 | "execution_count": null, 1037 | "metadata": { 1038 | "collapsed": false 1039 | }, 1040 | "outputs": [], 1041 | "source": [ 1042 | "# show default parameters for CountVectorizer\n", 1043 | "vect" 1044 | ] 1045 | }, 1046 | { 1047 | "cell_type": "markdown", 1048 | "metadata": {}, 1049 | "source": [ 1050 | "However, the vectorizer is worth tuning, just like a model is worth tuning! Here are a few parameters that you might want to tune:\n", 1051 | "\n", 1052 | "- **stop_words:** string {'english'}, list, or None (default)\n", 1053 | " - If 'english', a built-in stop word list for English is used.\n", 1054 | " - If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.\n", 1055 | " - If None, no stop words will be used." 1056 | ] 1057 | }, 1058 | { 1059 | "cell_type": "code", 1060 | "execution_count": null, 1061 | "metadata": { 1062 | "collapsed": true 1063 | }, 1064 | "outputs": [], 1065 | "source": [ 1066 | "# remove English stop words\n", 1067 | "vect = CountVectorizer(stop_words='english')" 1068 | ] 1069 | }, 1070 | { 1071 | "cell_type": "markdown", 1072 | "metadata": {}, 1073 | "source": [ 1074 | "- **ngram_range:** tuple (min_n, max_n), default=(1, 1)\n", 1075 | " - The lower and upper boundary of the range of n-values for different n-grams to be extracted.\n", 1076 | " - All values of n such that min_n <= n <= max_n will be used." 1077 | ] 1078 | }, 1079 | { 1080 | "cell_type": "code", 1081 | "execution_count": null, 1082 | "metadata": { 1083 | "collapsed": true 1084 | }, 1085 | "outputs": [], 1086 | "source": [ 1087 | "# include 1-grams and 2-grams\n", 1088 | "vect = CountVectorizer(ngram_range=(1, 2))" 1089 | ] 1090 | }, 1091 | { 1092 | "cell_type": "markdown", 1093 | "metadata": {}, 1094 | "source": [ 1095 | "- **max_df:** float in range [0.0, 1.0] or int, default=1.0\n", 1096 | " - When building the vocabulary, ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words).\n", 1097 | " - If float, the parameter represents a proportion of documents.\n", 1098 | " - If integer, the parameter represents an absolute count." 1099 | ] 1100 | }, 1101 | { 1102 | "cell_type": "code", 1103 | "execution_count": null, 1104 | "metadata": { 1105 | "collapsed": true 1106 | }, 1107 | "outputs": [], 1108 | "source": [ 1109 | "# ignore terms that appear in more than 50% of the documents\n", 1110 | "vect = CountVectorizer(max_df=0.5)" 1111 | ] 1112 | }, 1113 | { 1114 | "cell_type": "markdown", 1115 | "metadata": {}, 1116 | "source": [ 1117 | "- **min_df:** float in range [0.0, 1.0] or int, default=1\n", 1118 | " - When building the vocabulary, ignore terms that have a document frequency strictly lower than the given threshold. (This value is also called \"cut-off\" in the literature.)\n", 1119 | " - If float, the parameter represents a proportion of documents.\n", 1120 | " - If integer, the parameter represents an absolute count." 1121 | ] 1122 | }, 1123 | { 1124 | "cell_type": "code", 1125 | "execution_count": null, 1126 | "metadata": { 1127 | "collapsed": true 1128 | }, 1129 | "outputs": [], 1130 | "source": [ 1131 | "# only keep terms that appear in at least 2 documents\n", 1132 | "vect = CountVectorizer(min_df=2)" 1133 | ] 1134 | }, 1135 | { 1136 | "cell_type": "markdown", 1137 | "metadata": {}, 1138 | "source": [ 1139 | "**Guidelines for tuning CountVectorizer:**\n", 1140 | "\n", 1141 | "- Use your knowledge of the **problem** and the **text**, and your understanding of the **tuning parameters**, to help you decide what parameters to tune and how to tune them.\n", 1142 | "- **Experiment**, and let the data tell you the best approach!" 1143 | ] 1144 | } 1145 | ], 1146 | "metadata": { 1147 | "kernelspec": { 1148 | "display_name": "Python 2", 1149 | "language": "python", 1150 | "name": "python2" 1151 | }, 1152 | "language_info": { 1153 | "codemirror_mode": { 1154 | "name": "ipython", 1155 | "version": 2 1156 | }, 1157 | "file_extension": ".py", 1158 | "mimetype": "text/x-python", 1159 | "name": "python", 1160 | "nbconvert_exporter": "python", 1161 | "pygments_lexer": "ipython2", 1162 | "version": "2.7.11" 1163 | } 1164 | }, 1165 | "nbformat": 4, 1166 | "nbformat_minor": 0 1167 | } 1168 | -------------------------------------------------------------------------------- /exercise_solution.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Tutorial Exercise: Yelp reviews (Solution)" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Introduction\n", 15 | "\n", 16 | "This exercise uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition.\n", 17 | "\n", 18 | "**Description of the data:**\n", 19 | "\n", 20 | "- **`yelp.csv`** contains the dataset. It is stored in the repository (in the **`data`** directory), so there is no need to download anything from the Kaggle website.\n", 21 | "- Each observation (row) in this dataset is a review of a particular business by a particular user.\n", 22 | "- The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.\n", 23 | "- The **text** column is the text of the review.\n", 24 | "\n", 25 | "**Goal:** Predict the star rating of a review using **only** the review text.\n", 26 | "\n", 27 | "**Tip:** After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations." 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": 1, 33 | "metadata": { 34 | "collapsed": true 35 | }, 36 | "outputs": [], 37 | "source": [ 38 | "# for Python 2: use print only as a function\n", 39 | "from __future__ import print_function" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "## Task 1\n", 47 | "\n", 48 | "Read **`yelp.csv`** into a pandas DataFrame and examine it." 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": 2, 54 | "metadata": { 55 | "collapsed": true 56 | }, 57 | "outputs": [], 58 | "source": [ 59 | "# read yelp.csv using a relative path\n", 60 | "import pandas as pd\n", 61 | "path = 'data/yelp.csv'\n", 62 | "yelp = pd.read_csv(path)" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": 3, 68 | "metadata": { 69 | "collapsed": false 70 | }, 71 | "outputs": [ 72 | { 73 | "data": { 74 | "text/plain": [ 75 | "(10000, 10)" 76 | ] 77 | }, 78 | "execution_count": 3, 79 | "metadata": {}, 80 | "output_type": "execute_result" 81 | } 82 | ], 83 | "source": [ 84 | "# examine the shape\n", 85 | "yelp.shape" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": 4, 91 | "metadata": { 92 | "collapsed": false 93 | }, 94 | "outputs": [ 95 | { 96 | "data": { 97 | "text/html": [ 98 | "
\n", 99 | "\n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | "
business_iddatereview_idstarstexttypeuser_idcoolusefulfunny
09yKzy9PApeiPPOUJEtnvkg2011-01-26fWKvX83p0-ka4JS3dc6E5A5My wife took me here on my birthday for breakf...reviewrLtl8ZkDX5vH5nAx9C3q5Q250
\n", 131 | "
" 132 | ], 133 | "text/plain": [ 134 | " business_id date review_id stars \\\n", 135 | "0 9yKzy9PApeiPPOUJEtnvkg 2011-01-26 fWKvX83p0-ka4JS3dc6E5A 5 \n", 136 | "\n", 137 | " text type \\\n", 138 | "0 My wife took me here on my birthday for breakf... review \n", 139 | "\n", 140 | " user_id cool useful funny \n", 141 | "0 rLtl8ZkDX5vH5nAx9C3q5Q 2 5 0 " 142 | ] 143 | }, 144 | "execution_count": 4, 145 | "metadata": {}, 146 | "output_type": "execute_result" 147 | } 148 | ], 149 | "source": [ 150 | "# examine the first row\n", 151 | "yelp.head(1)" 152 | ] 153 | }, 154 | { 155 | "cell_type": "code", 156 | "execution_count": 5, 157 | "metadata": { 158 | "collapsed": false 159 | }, 160 | "outputs": [ 161 | { 162 | "data": { 163 | "text/plain": [ 164 | "1 749\n", 165 | "2 927\n", 166 | "3 1461\n", 167 | "4 3526\n", 168 | "5 3337\n", 169 | "Name: stars, dtype: int64" 170 | ] 171 | }, 172 | "execution_count": 5, 173 | "metadata": {}, 174 | "output_type": "execute_result" 175 | } 176 | ], 177 | "source": [ 178 | "# examine the class distribution\n", 179 | "yelp.stars.value_counts().sort_index()" 180 | ] 181 | }, 182 | { 183 | "cell_type": "markdown", 184 | "metadata": {}, 185 | "source": [ 186 | "## Task 2\n", 187 | "\n", 188 | "Create a new DataFrame that only contains the **5-star** and **1-star** reviews.\n", 189 | "\n", 190 | "- **Hint:** [How do I apply multiple filter criteria to a pandas DataFrame?](http://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb#9.-How-do-I-apply-multiple-filter-criteria-to-a-pandas-DataFrame%3F-%28video%29) explains how to do this." 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": 6, 196 | "metadata": { 197 | "collapsed": true 198 | }, 199 | "outputs": [], 200 | "source": [ 201 | "# filter the DataFrame using an OR condition\n", 202 | "yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]\n", 203 | "\n", 204 | "# equivalently, use the 'loc' method\n", 205 | "yelp_best_worst = yelp.loc[(yelp.stars==5) | (yelp.stars==1), :]" 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "execution_count": 7, 211 | "metadata": { 212 | "collapsed": false 213 | }, 214 | "outputs": [ 215 | { 216 | "data": { 217 | "text/plain": [ 218 | "(4086, 10)" 219 | ] 220 | }, 221 | "execution_count": 7, 222 | "metadata": {}, 223 | "output_type": "execute_result" 224 | } 225 | ], 226 | "source": [ 227 | "# examine the shape\n", 228 | "yelp_best_worst.shape" 229 | ] 230 | }, 231 | { 232 | "cell_type": "markdown", 233 | "metadata": {}, 234 | "source": [ 235 | "## Task 3\n", 236 | "\n", 237 | "Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response.\n", 238 | "\n", 239 | "- **Hint:** Keep in mind that X should be a pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows." 240 | ] 241 | }, 242 | { 243 | "cell_type": "code", 244 | "execution_count": 8, 245 | "metadata": { 246 | "collapsed": true 247 | }, 248 | "outputs": [], 249 | "source": [ 250 | "# define X and y\n", 251 | "X = yelp_best_worst.text\n", 252 | "y = yelp_best_worst.stars" 253 | ] 254 | }, 255 | { 256 | "cell_type": "code", 257 | "execution_count": 9, 258 | "metadata": { 259 | "collapsed": true 260 | }, 261 | "outputs": [], 262 | "source": [ 263 | "# split X and y into training and testing sets\n", 264 | "from sklearn.cross_validation import train_test_split\n", 265 | "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)" 266 | ] 267 | }, 268 | { 269 | "cell_type": "code", 270 | "execution_count": 10, 271 | "metadata": { 272 | "collapsed": false 273 | }, 274 | "outputs": [ 275 | { 276 | "name": "stdout", 277 | "output_type": "stream", 278 | "text": [ 279 | "(3064L,)\n", 280 | "(1022L,)\n", 281 | "(3064L,)\n", 282 | "(1022L,)\n" 283 | ] 284 | } 285 | ], 286 | "source": [ 287 | "# examine the object shapes\n", 288 | "print(X_train.shape)\n", 289 | "print(X_test.shape)\n", 290 | "print(y_train.shape)\n", 291 | "print(y_test.shape)" 292 | ] 293 | }, 294 | { 295 | "cell_type": "markdown", 296 | "metadata": {}, 297 | "source": [ 298 | "## Task 4\n", 299 | "\n", 300 | "Use CountVectorizer to create **document-term matrices** from X_train and X_test." 301 | ] 302 | }, 303 | { 304 | "cell_type": "code", 305 | "execution_count": 11, 306 | "metadata": { 307 | "collapsed": true 308 | }, 309 | "outputs": [], 310 | "source": [ 311 | "# import and instantiate CountVectorizer\n", 312 | "from sklearn.feature_extraction.text import CountVectorizer\n", 313 | "vect = CountVectorizer()" 314 | ] 315 | }, 316 | { 317 | "cell_type": "code", 318 | "execution_count": 12, 319 | "metadata": { 320 | "collapsed": false 321 | }, 322 | "outputs": [ 323 | { 324 | "data": { 325 | "text/plain": [ 326 | "(3064, 16825)" 327 | ] 328 | }, 329 | "execution_count": 12, 330 | "metadata": {}, 331 | "output_type": "execute_result" 332 | } 333 | ], 334 | "source": [ 335 | "# fit and transform X_train into X_train_dtm\n", 336 | "X_train_dtm = vect.fit_transform(X_train)\n", 337 | "X_train_dtm.shape" 338 | ] 339 | }, 340 | { 341 | "cell_type": "code", 342 | "execution_count": 13, 343 | "metadata": { 344 | "collapsed": false 345 | }, 346 | "outputs": [ 347 | { 348 | "data": { 349 | "text/plain": [ 350 | "(1022, 16825)" 351 | ] 352 | }, 353 | "execution_count": 13, 354 | "metadata": {}, 355 | "output_type": "execute_result" 356 | } 357 | ], 358 | "source": [ 359 | "# transform X_test into X_test_dtm\n", 360 | "X_test_dtm = vect.transform(X_test)\n", 361 | "X_test_dtm.shape" 362 | ] 363 | }, 364 | { 365 | "cell_type": "markdown", 366 | "metadata": {}, 367 | "source": [ 368 | "## Task 5\n", 369 | "\n", 370 | "Use multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**.\n", 371 | "\n", 372 | "- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to interpret both classification accuracy and the confusion matrix." 373 | ] 374 | }, 375 | { 376 | "cell_type": "code", 377 | "execution_count": 14, 378 | "metadata": { 379 | "collapsed": true 380 | }, 381 | "outputs": [], 382 | "source": [ 383 | "# import and instantiate MultinomialNB\n", 384 | "from sklearn.naive_bayes import MultinomialNB\n", 385 | "nb = MultinomialNB()" 386 | ] 387 | }, 388 | { 389 | "cell_type": "code", 390 | "execution_count": 15, 391 | "metadata": { 392 | "collapsed": false 393 | }, 394 | "outputs": [ 395 | { 396 | "data": { 397 | "text/plain": [ 398 | "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)" 399 | ] 400 | }, 401 | "execution_count": 15, 402 | "metadata": {}, 403 | "output_type": "execute_result" 404 | } 405 | ], 406 | "source": [ 407 | "# train the model using X_train_dtm\n", 408 | "nb.fit(X_train_dtm, y_train)" 409 | ] 410 | }, 411 | { 412 | "cell_type": "code", 413 | "execution_count": 16, 414 | "metadata": { 415 | "collapsed": true 416 | }, 417 | "outputs": [], 418 | "source": [ 419 | "# make class predictions for X_test_dtm\n", 420 | "y_pred_class = nb.predict(X_test_dtm)" 421 | ] 422 | }, 423 | { 424 | "cell_type": "code", 425 | "execution_count": 17, 426 | "metadata": { 427 | "collapsed": false 428 | }, 429 | "outputs": [ 430 | { 431 | "data": { 432 | "text/plain": [ 433 | "0.91878669275929548" 434 | ] 435 | }, 436 | "execution_count": 17, 437 | "metadata": {}, 438 | "output_type": "execute_result" 439 | } 440 | ], 441 | "source": [ 442 | "# calculate accuracy of class predictions\n", 443 | "from sklearn import metrics\n", 444 | "metrics.accuracy_score(y_test, y_pred_class)" 445 | ] 446 | }, 447 | { 448 | "cell_type": "code", 449 | "execution_count": 18, 450 | "metadata": { 451 | "collapsed": false 452 | }, 453 | "outputs": [ 454 | { 455 | "data": { 456 | "text/plain": [ 457 | "array([[126, 58],\n", 458 | " [ 25, 813]])" 459 | ] 460 | }, 461 | "execution_count": 18, 462 | "metadata": {}, 463 | "output_type": "execute_result" 464 | } 465 | ], 466 | "source": [ 467 | "# print the confusion matrix\n", 468 | "metrics.confusion_matrix(y_test, y_pred_class)" 469 | ] 470 | }, 471 | { 472 | "cell_type": "markdown", 473 | "metadata": {}, 474 | "source": [ 475 | "## Task 6 (Challenge)\n", 476 | "\n", 477 | "Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class.\n", 478 | "\n", 479 | "- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy!" 480 | ] 481 | }, 482 | { 483 | "cell_type": "code", 484 | "execution_count": 19, 485 | "metadata": { 486 | "collapsed": false 487 | }, 488 | "outputs": [ 489 | { 490 | "data": { 491 | "text/plain": [ 492 | "5 838\n", 493 | "1 184\n", 494 | "Name: stars, dtype: int64" 495 | ] 496 | }, 497 | "execution_count": 19, 498 | "metadata": {}, 499 | "output_type": "execute_result" 500 | } 501 | ], 502 | "source": [ 503 | "# examine the class distribution of the testing set\n", 504 | "y_test.value_counts()" 505 | ] 506 | }, 507 | { 508 | "cell_type": "code", 509 | "execution_count": 20, 510 | "metadata": { 511 | "collapsed": false 512 | }, 513 | "outputs": [ 514 | { 515 | "data": { 516 | "text/plain": [ 517 | "5 0.819961\n", 518 | "Name: stars, dtype: float64" 519 | ] 520 | }, 521 | "execution_count": 20, 522 | "metadata": {}, 523 | "output_type": "execute_result" 524 | } 525 | ], 526 | "source": [ 527 | "# calculate null accuracy\n", 528 | "y_test.value_counts().head(1) / y_test.shape" 529 | ] 530 | }, 531 | { 532 | "cell_type": "code", 533 | "execution_count": 21, 534 | "metadata": { 535 | "collapsed": false 536 | }, 537 | "outputs": [ 538 | { 539 | "data": { 540 | "text/plain": [ 541 | "0.8199608610567515" 542 | ] 543 | }, 544 | "execution_count": 21, 545 | "metadata": {}, 546 | "output_type": "execute_result" 547 | } 548 | ], 549 | "source": [ 550 | "# calculate null accuracy manually\n", 551 | "838 / float(838 + 184)" 552 | ] 553 | }, 554 | { 555 | "cell_type": "markdown", 556 | "metadata": {}, 557 | "source": [ 558 | "## Task 7 (Challenge)\n", 559 | "\n", 560 | "Browse through the review text of some of the **false positives** and **false negatives**. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews?\n", 561 | "\n", 562 | "- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains the definitions of \"false positives\" and \"false negatives\".\n", 563 | "- **Hint:** Think about what a false positive means in this context, and what a false negative means in this context. What has scikit-learn defined as the \"positive class\"?" 564 | ] 565 | }, 566 | { 567 | "cell_type": "code", 568 | "execution_count": 22, 569 | "metadata": { 570 | "collapsed": false 571 | }, 572 | "outputs": [ 573 | { 574 | "data": { 575 | "text/plain": [ 576 | "2175 This has to be the worst restaurant in terms o...\n", 577 | "1781 If you like the stuck up Scottsdale vibe this ...\n", 578 | "2674 I'm sorry to be what seems to be the lone one ...\n", 579 | "9984 Went last night to Whore Foods to get basics t...\n", 580 | "3392 I found Lisa G's while driving through phoenix...\n", 581 | "8283 Don't know where I should start. Grand opening...\n", 582 | "2765 Went last week, and ordered a dozen variety. I...\n", 583 | "2839 Never Again,\\r\\nI brought my Mountain Bike in ...\n", 584 | "321 My wife and I live around the corner, hadn't e...\n", 585 | "1919 D-scust-ing.\n", 586 | "Name: text, dtype: object" 587 | ] 588 | }, 589 | "execution_count": 22, 590 | "metadata": {}, 591 | "output_type": "execute_result" 592 | } 593 | ], 594 | "source": [ 595 | "# first 10 false positives (1-star reviews incorrectly classified as 5-star reviews)\n", 596 | "X_test[y_test < y_pred_class].head(10)" 597 | ] 598 | }, 599 | { 600 | "cell_type": "code", 601 | "execution_count": 23, 602 | "metadata": { 603 | "collapsed": false 604 | }, 605 | "outputs": [ 606 | { 607 | "data": { 608 | "text/plain": [ 609 | "\"If you like the stuck up Scottsdale vibe this is a good place for you. The food isn't impressive. Nice outdoor seating.\"" 610 | ] 611 | }, 612 | "execution_count": 23, 613 | "metadata": {}, 614 | "output_type": "execute_result" 615 | } 616 | ], 617 | "source": [ 618 | "# false positive: model is reacting to the words \"good\", \"impressive\", \"nice\"\n", 619 | "X_test[1781]" 620 | ] 621 | }, 622 | { 623 | "cell_type": "code", 624 | "execution_count": 24, 625 | "metadata": { 626 | "collapsed": false 627 | }, 628 | "outputs": [ 629 | { 630 | "data": { 631 | "text/plain": [ 632 | "'D-scust-ing.'" 633 | ] 634 | }, 635 | "execution_count": 24, 636 | "metadata": {}, 637 | "output_type": "execute_result" 638 | } 639 | ], 640 | "source": [ 641 | "# false positive: model does not have enough data to work with\n", 642 | "X_test[1919]" 643 | ] 644 | }, 645 | { 646 | "cell_type": "code", 647 | "execution_count": 25, 648 | "metadata": { 649 | "collapsed": false 650 | }, 651 | "outputs": [ 652 | { 653 | "data": { 654 | "text/plain": [ 655 | "7148 I now consider myself an Arizonian. If you dri...\n", 656 | "4963 This is by far my favourite department store, ...\n", 657 | "6318 Since I have ranted recently on poor customer ...\n", 658 | "380 This is a must try for any Mani Pedi fan. I us...\n", 659 | "5565 I`ve had work done by this shop a few times th...\n", 660 | "3448 I was there last week with my sisters and whil...\n", 661 | "6050 I went to sears today to check on a layaway th...\n", 662 | "2504 I've passed by prestige nails in walmart 100s ...\n", 663 | "2475 This place is so great! I am a nanny and had t...\n", 664 | "241 I was sad to come back to lai lai's and they n...\n", 665 | "Name: text, dtype: object" 666 | ] 667 | }, 668 | "execution_count": 25, 669 | "metadata": {}, 670 | "output_type": "execute_result" 671 | } 672 | ], 673 | "source": [ 674 | "# first 10 false negatives (5-star reviews incorrectly classified as 1-star reviews)\n", 675 | "X_test[y_test > y_pred_class].head(10)" 676 | ] 677 | }, 678 | { 679 | "cell_type": "code", 680 | "execution_count": 26, 681 | "metadata": { 682 | "collapsed": false 683 | }, 684 | "outputs": [ 685 | { 686 | "data": { 687 | "text/plain": [ 688 | "'This is by far my favourite department store, hands down. I have had nothing but perfect experiences in this store, without exception, no matter what department I\\'m in. The shoe SA\\'s will bend over backwards to help you find a specific shoe, and the staff will even go so far as to send out hand-written thank you cards to your home address after you make a purchase - big or small. Tim & Anthony in the shoe salon are fabulous beyond words! \\r\\n\\r\\nI am not completely sure that I understand why people complain about the amount of merchandise on the floor or the lack of crowds in this store. Frankly, I would rather not be bombarded with merchandise and other people. One of the things I love the most about Barney\\'s is not only the prompt attention of SA\\'s, but the fact that they aren\\'t rushing around trying to help 35 people at once. The SA\\'s at Barney\\'s are incredibly friendly and will stop to have an actual conversation, regardless or whether you are purchasing something or not. I have also never experienced a \"high pressure\" sale situation here.\\r\\n\\r\\nAll in all, Barneys is pricey, and there is no getting around it. But, um, so is Neiman\\'s and that place is a crock. Anywhere that ONLY accepts American Express or their charge card and then treats you like scum if you aren\\'t carrying neither is no place that I want to spend my hard earned dollars. Yay Barneys!'" 689 | ] 690 | }, 691 | "execution_count": 26, 692 | "metadata": {}, 693 | "output_type": "execute_result" 694 | } 695 | ], 696 | "source": [ 697 | "# false negative: model is reacting to the words \"complain\", \"crowds\", \"rushing\", \"pricey\", \"scum\"\n", 698 | "X_test[4963]" 699 | ] 700 | }, 701 | { 702 | "cell_type": "markdown", 703 | "metadata": {}, 704 | "source": [ 705 | "## Task 8 (Challenge)\n", 706 | "\n", 707 | "Calculate which 10 tokens are the most predictive of **5-star reviews**, and which 10 tokens are the most predictive of **1-star reviews**.\n", 708 | "\n", 709 | "- **Hint:** Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the `feature_count_` and `class_count_` attributes of the Naive Bayes model object." 710 | ] 711 | }, 712 | { 713 | "cell_type": "code", 714 | "execution_count": 27, 715 | "metadata": { 716 | "collapsed": false 717 | }, 718 | "outputs": [ 719 | { 720 | "data": { 721 | "text/plain": [ 722 | "16825" 723 | ] 724 | }, 725 | "execution_count": 27, 726 | "metadata": {}, 727 | "output_type": "execute_result" 728 | } 729 | ], 730 | "source": [ 731 | "# store the vocabulary of X_train\n", 732 | "X_train_tokens = vect.get_feature_names()\n", 733 | "len(X_train_tokens)" 734 | ] 735 | }, 736 | { 737 | "cell_type": "code", 738 | "execution_count": 28, 739 | "metadata": { 740 | "collapsed": false 741 | }, 742 | "outputs": [ 743 | { 744 | "data": { 745 | "text/plain": [ 746 | "(2L, 16825L)" 747 | ] 748 | }, 749 | "execution_count": 28, 750 | "metadata": {}, 751 | "output_type": "execute_result" 752 | } 753 | ], 754 | "source": [ 755 | "# first row is one-star reviews, second row is five-star reviews\n", 756 | "nb.feature_count_.shape" 757 | ] 758 | }, 759 | { 760 | "cell_type": "code", 761 | "execution_count": 29, 762 | "metadata": { 763 | "collapsed": true 764 | }, 765 | "outputs": [], 766 | "source": [ 767 | "# store the number of times each token appears across each class\n", 768 | "one_star_token_count = nb.feature_count_[0, :]\n", 769 | "five_star_token_count = nb.feature_count_[1, :]" 770 | ] 771 | }, 772 | { 773 | "cell_type": "code", 774 | "execution_count": 30, 775 | "metadata": { 776 | "collapsed": true 777 | }, 778 | "outputs": [], 779 | "source": [ 780 | "# create a DataFrame of tokens with their separate one-star and five-star counts\n", 781 | "tokens = pd.DataFrame({'token':X_train_tokens, 'one_star':one_star_token_count, 'five_star':five_star_token_count}).set_index('token')" 782 | ] 783 | }, 784 | { 785 | "cell_type": "code", 786 | "execution_count": 31, 787 | "metadata": { 788 | "collapsed": true 789 | }, 790 | "outputs": [], 791 | "source": [ 792 | "# add 1 to one-star and five-star counts to avoid dividing by 0\n", 793 | "tokens['one_star'] = tokens.one_star + 1\n", 794 | "tokens['five_star'] = tokens.five_star + 1" 795 | ] 796 | }, 797 | { 798 | "cell_type": "code", 799 | "execution_count": 32, 800 | "metadata": { 801 | "collapsed": false 802 | }, 803 | "outputs": [ 804 | { 805 | "data": { 806 | "text/plain": [ 807 | "array([ 565., 2499.])" 808 | ] 809 | }, 810 | "execution_count": 32, 811 | "metadata": {}, 812 | "output_type": "execute_result" 813 | } 814 | ], 815 | "source": [ 816 | "# first number is one-star reviews, second number is five-star reviews\n", 817 | "nb.class_count_" 818 | ] 819 | }, 820 | { 821 | "cell_type": "code", 822 | "execution_count": 33, 823 | "metadata": { 824 | "collapsed": true 825 | }, 826 | "outputs": [], 827 | "source": [ 828 | "# convert the one-star and five-star counts into frequencies\n", 829 | "tokens['one_star'] = tokens.one_star / nb.class_count_[0]\n", 830 | "tokens['five_star'] = tokens.five_star / nb.class_count_[1]" 831 | ] 832 | }, 833 | { 834 | "cell_type": "code", 835 | "execution_count": 34, 836 | "metadata": { 837 | "collapsed": true 838 | }, 839 | "outputs": [], 840 | "source": [ 841 | "# calculate the ratio of five-star to one-star for each token\n", 842 | "tokens['five_star_ratio'] = tokens.five_star / tokens.one_star" 843 | ] 844 | }, 845 | { 846 | "cell_type": "code", 847 | "execution_count": 35, 848 | "metadata": { 849 | "collapsed": false 850 | }, 851 | "outputs": [ 852 | { 853 | "data": { 854 | "text/html": [ 855 | "
\n", 856 | "\n", 857 | " \n", 858 | " \n", 859 | " \n", 860 | " \n", 861 | " \n", 862 | " \n", 863 | " \n", 864 | " \n", 865 | " \n", 866 | " \n", 867 | " \n", 868 | " \n", 869 | " \n", 870 | " \n", 871 | " \n", 872 | " \n", 873 | " \n", 874 | " \n", 875 | " \n", 876 | " \n", 877 | " \n", 878 | " \n", 879 | " \n", 880 | " \n", 881 | " \n", 882 | " \n", 883 | " \n", 884 | " \n", 885 | " \n", 886 | " \n", 887 | " \n", 888 | " \n", 889 | " \n", 890 | " \n", 891 | " \n", 892 | " \n", 893 | " \n", 894 | " \n", 895 | " \n", 896 | " \n", 897 | " \n", 898 | " \n", 899 | " \n", 900 | " \n", 901 | " \n", 902 | " \n", 903 | " \n", 904 | " \n", 905 | " \n", 906 | " \n", 907 | " \n", 908 | " \n", 909 | " \n", 910 | " \n", 911 | " \n", 912 | " \n", 913 | " \n", 914 | " \n", 915 | " \n", 916 | " \n", 917 | " \n", 918 | " \n", 919 | " \n", 920 | " \n", 921 | " \n", 922 | " \n", 923 | " \n", 924 | " \n", 925 | " \n", 926 | " \n", 927 | " \n", 928 | " \n", 929 | " \n", 930 | " \n", 931 | " \n", 932 | " \n", 933 | "
five_starone_starfive_star_ratio
token
fantastic0.0772310.00354021.817727
perfect0.0980390.00531018.464052
yum0.0248100.00177014.017607
favorite0.1380550.01238911.143029
outstanding0.0196080.00177011.078431
brunch0.0168070.0017709.495798
gem0.0160060.0017709.043617
mozzarella0.0156060.0017708.817527
pasty0.0156060.0017708.817527
amazing0.1852740.0212398.723323
\n", 934 | "
" 935 | ], 936 | "text/plain": [ 937 | " five_star one_star five_star_ratio\n", 938 | "token \n", 939 | "fantastic 0.077231 0.003540 21.817727\n", 940 | "perfect 0.098039 0.005310 18.464052\n", 941 | "yum 0.024810 0.001770 14.017607\n", 942 | "favorite 0.138055 0.012389 11.143029\n", 943 | "outstanding 0.019608 0.001770 11.078431\n", 944 | "brunch 0.016807 0.001770 9.495798\n", 945 | "gem 0.016006 0.001770 9.043617\n", 946 | "mozzarella 0.015606 0.001770 8.817527\n", 947 | "pasty 0.015606 0.001770 8.817527\n", 948 | "amazing 0.185274 0.021239 8.723323" 949 | ] 950 | }, 951 | "execution_count": 35, 952 | "metadata": {}, 953 | "output_type": "execute_result" 954 | } 955 | ], 956 | "source": [ 957 | "# sort the DataFrame by five_star_ratio (descending order), and examine the first 10 rows\n", 958 | "# note: use sort() instead of sort_values() for pandas 0.16.2 and earlier\n", 959 | "tokens.sort_values('five_star_ratio', ascending=False).head(10)" 960 | ] 961 | }, 962 | { 963 | "cell_type": "code", 964 | "execution_count": 36, 965 | "metadata": { 966 | "collapsed": false 967 | }, 968 | "outputs": [ 969 | { 970 | "data": { 971 | "text/html": [ 972 | "
\n", 973 | "\n", 974 | " \n", 975 | " \n", 976 | " \n", 977 | " \n", 978 | " \n", 979 | " \n", 980 | " \n", 981 | " \n", 982 | " \n", 983 | " \n", 984 | " \n", 985 | " \n", 986 | " \n", 987 | " \n", 988 | " \n", 989 | " \n", 990 | " \n", 991 | " \n", 992 | " \n", 993 | " \n", 994 | " \n", 995 | " \n", 996 | " \n", 997 | " \n", 998 | " \n", 999 | " \n", 1000 | " \n", 1001 | " \n", 1002 | " \n", 1003 | " \n", 1004 | " \n", 1005 | " \n", 1006 | " \n", 1007 | " \n", 1008 | " \n", 1009 | " \n", 1010 | " \n", 1011 | " \n", 1012 | " \n", 1013 | " \n", 1014 | " \n", 1015 | " \n", 1016 | " \n", 1017 | " \n", 1018 | " \n", 1019 | " \n", 1020 | " \n", 1021 | " \n", 1022 | " \n", 1023 | " \n", 1024 | " \n", 1025 | " \n", 1026 | " \n", 1027 | " \n", 1028 | " \n", 1029 | " \n", 1030 | " \n", 1031 | " \n", 1032 | " \n", 1033 | " \n", 1034 | " \n", 1035 | " \n", 1036 | " \n", 1037 | " \n", 1038 | " \n", 1039 | " \n", 1040 | " \n", 1041 | " \n", 1042 | " \n", 1043 | " \n", 1044 | " \n", 1045 | " \n", 1046 | " \n", 1047 | " \n", 1048 | " \n", 1049 | " \n", 1050 | "
five_starone_starfive_star_ratio
token
staffperson0.00040.0300880.013299
refused0.00040.0247790.016149
disgusting0.00080.0424780.018841
filthy0.00040.0194690.020554
unprofessional0.00040.0159290.025121
unacceptable0.00040.0159290.025121
acknowledge0.00040.0159290.025121
ugh0.00080.0300880.026599
fuse0.00040.0141590.028261
boca0.00040.0141590.028261
\n", 1051 | "
" 1052 | ], 1053 | "text/plain": [ 1054 | " five_star one_star five_star_ratio\n", 1055 | "token \n", 1056 | "staffperson 0.0004 0.030088 0.013299\n", 1057 | "refused 0.0004 0.024779 0.016149\n", 1058 | "disgusting 0.0008 0.042478 0.018841\n", 1059 | "filthy 0.0004 0.019469 0.020554\n", 1060 | "unprofessional 0.0004 0.015929 0.025121\n", 1061 | "unacceptable 0.0004 0.015929 0.025121\n", 1062 | "acknowledge 0.0004 0.015929 0.025121\n", 1063 | "ugh 0.0008 0.030088 0.026599\n", 1064 | "fuse 0.0004 0.014159 0.028261\n", 1065 | "boca 0.0004 0.014159 0.028261" 1066 | ] 1067 | }, 1068 | "execution_count": 36, 1069 | "metadata": {}, 1070 | "output_type": "execute_result" 1071 | } 1072 | ], 1073 | "source": [ 1074 | "# sort the DataFrame by five_star_ratio (ascending order), and examine the first 10 rows\n", 1075 | "tokens.sort_values('five_star_ratio', ascending=True).head(10)" 1076 | ] 1077 | }, 1078 | { 1079 | "cell_type": "markdown", 1080 | "metadata": {}, 1081 | "source": [ 1082 | "## Task 9 (Challenge)\n", 1083 | "\n", 1084 | "Up to this point, we have framed this as a **binary classification problem** by only considering the 5-star and 1-star reviews. Now, let's repeat the model building process using all reviews, which makes this a **5-class classification problem**.\n", 1085 | "\n", 1086 | "Here are the steps:\n", 1087 | "\n", 1088 | "- Define X and y using the original DataFrame. (y should contain 5 different classes.)\n", 1089 | "- Split X and y into training and testing sets.\n", 1090 | "- Create document-term matrices using CountVectorizer.\n", 1091 | "- Calculate the testing accuracy of a Multinomial Naive Bayes model.\n", 1092 | "- Compare the testing accuracy with the null accuracy, and comment on the results.\n", 1093 | "- Print the confusion matrix, and comment on the results. (This [Stack Overflow answer](http://stackoverflow.com/a/30748053/1636598) explains how to read a multi-class confusion matrix.)\n", 1094 | "- Print the [classification report](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report), and comment on the results. If you are unfamiliar with the terminology it uses, research the terms, and then try to figure out how to calculate these metrics manually from the confusion matrix!" 1095 | ] 1096 | }, 1097 | { 1098 | "cell_type": "code", 1099 | "execution_count": 37, 1100 | "metadata": { 1101 | "collapsed": true 1102 | }, 1103 | "outputs": [], 1104 | "source": [ 1105 | "# define X and y using the original DataFrame\n", 1106 | "X = yelp.text\n", 1107 | "y = yelp.stars" 1108 | ] 1109 | }, 1110 | { 1111 | "cell_type": "code", 1112 | "execution_count": 38, 1113 | "metadata": { 1114 | "collapsed": false 1115 | }, 1116 | "outputs": [ 1117 | { 1118 | "data": { 1119 | "text/plain": [ 1120 | "1 749\n", 1121 | "2 927\n", 1122 | "3 1461\n", 1123 | "4 3526\n", 1124 | "5 3337\n", 1125 | "Name: stars, dtype: int64" 1126 | ] 1127 | }, 1128 | "execution_count": 38, 1129 | "metadata": {}, 1130 | "output_type": "execute_result" 1131 | } 1132 | ], 1133 | "source": [ 1134 | "# check that y contains 5 different classes\n", 1135 | "y.value_counts().sort_index()" 1136 | ] 1137 | }, 1138 | { 1139 | "cell_type": "code", 1140 | "execution_count": 39, 1141 | "metadata": { 1142 | "collapsed": true 1143 | }, 1144 | "outputs": [], 1145 | "source": [ 1146 | "# split X and y into training and testing sets\n", 1147 | "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)" 1148 | ] 1149 | }, 1150 | { 1151 | "cell_type": "code", 1152 | "execution_count": 40, 1153 | "metadata": { 1154 | "collapsed": true 1155 | }, 1156 | "outputs": [], 1157 | "source": [ 1158 | "# create document-term matrices using CountVectorizer\n", 1159 | "X_train_dtm = vect.fit_transform(X_train)\n", 1160 | "X_test_dtm = vect.transform(X_test)" 1161 | ] 1162 | }, 1163 | { 1164 | "cell_type": "code", 1165 | "execution_count": 41, 1166 | "metadata": { 1167 | "collapsed": false 1168 | }, 1169 | "outputs": [ 1170 | { 1171 | "data": { 1172 | "text/plain": [ 1173 | "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)" 1174 | ] 1175 | }, 1176 | "execution_count": 41, 1177 | "metadata": {}, 1178 | "output_type": "execute_result" 1179 | } 1180 | ], 1181 | "source": [ 1182 | "# fit a Multinomial Naive Bayes model\n", 1183 | "nb.fit(X_train_dtm, y_train)" 1184 | ] 1185 | }, 1186 | { 1187 | "cell_type": "code", 1188 | "execution_count": 42, 1189 | "metadata": { 1190 | "collapsed": true 1191 | }, 1192 | "outputs": [], 1193 | "source": [ 1194 | "# make class predictions\n", 1195 | "y_pred_class = nb.predict(X_test_dtm)" 1196 | ] 1197 | }, 1198 | { 1199 | "cell_type": "code", 1200 | "execution_count": 43, 1201 | "metadata": { 1202 | "collapsed": false 1203 | }, 1204 | "outputs": [ 1205 | { 1206 | "data": { 1207 | "text/plain": [ 1208 | "0.47120000000000001" 1209 | ] 1210 | }, 1211 | "execution_count": 43, 1212 | "metadata": {}, 1213 | "output_type": "execute_result" 1214 | } 1215 | ], 1216 | "source": [ 1217 | "# calculate the accuary\n", 1218 | "metrics.accuracy_score(y_test, y_pred_class)" 1219 | ] 1220 | }, 1221 | { 1222 | "cell_type": "code", 1223 | "execution_count": 44, 1224 | "metadata": { 1225 | "collapsed": false 1226 | }, 1227 | "outputs": [ 1228 | { 1229 | "data": { 1230 | "text/plain": [ 1231 | "4 0.3536\n", 1232 | "Name: stars, dtype: float64" 1233 | ] 1234 | }, 1235 | "execution_count": 44, 1236 | "metadata": {}, 1237 | "output_type": "execute_result" 1238 | } 1239 | ], 1240 | "source": [ 1241 | "# calculate the null accuracy\n", 1242 | "y_test.value_counts().head(1) / y_test.shape" 1243 | ] 1244 | }, 1245 | { 1246 | "cell_type": "markdown", 1247 | "metadata": {}, 1248 | "source": [ 1249 | "**Accuracy comments:** At first glance, 47% accuracy does not seem very good, given that it is not much higher than the null accuracy. However, I would consider the 47% accuracy to be quite impressive, given that humans would also have a hard time precisely identifying the star rating for many of these reviews." 1250 | ] 1251 | }, 1252 | { 1253 | "cell_type": "code", 1254 | "execution_count": 45, 1255 | "metadata": { 1256 | "collapsed": false 1257 | }, 1258 | "outputs": [ 1259 | { 1260 | "data": { 1261 | "text/plain": [ 1262 | "array([[ 55, 14, 24, 65, 27],\n", 1263 | " [ 28, 16, 41, 122, 27],\n", 1264 | " [ 5, 7, 35, 281, 37],\n", 1265 | " [ 7, 0, 16, 629, 232],\n", 1266 | " [ 6, 4, 6, 373, 443]])" 1267 | ] 1268 | }, 1269 | "execution_count": 45, 1270 | "metadata": {}, 1271 | "output_type": "execute_result" 1272 | } 1273 | ], 1274 | "source": [ 1275 | "# print the confusion matrix\n", 1276 | "metrics.confusion_matrix(y_test, y_pred_class)" 1277 | ] 1278 | }, 1279 | { 1280 | "cell_type": "markdown", 1281 | "metadata": {}, 1282 | "source": [ 1283 | "**Confusion matrix comments:**\n", 1284 | "\n", 1285 | "- Nearly all 4-star and 5-star reviews are classified as 4 or 5 stars, but they are hard for the model to distinguish between.\n", 1286 | "- 1-star, 2-star, and 3-star reviews are most commonly classified as 4 stars, probably because it's the predominant class in the training data." 1287 | ] 1288 | }, 1289 | { 1290 | "cell_type": "code", 1291 | "execution_count": 46, 1292 | "metadata": { 1293 | "collapsed": false 1294 | }, 1295 | "outputs": [ 1296 | { 1297 | "name": "stdout", 1298 | "output_type": "stream", 1299 | "text": [ 1300 | " precision recall f1-score support\n", 1301 | "\n", 1302 | " 1 0.54 0.30 0.38 185\n", 1303 | " 2 0.39 0.07 0.12 234\n", 1304 | " 3 0.29 0.10 0.14 365\n", 1305 | " 4 0.43 0.71 0.53 884\n", 1306 | " 5 0.58 0.53 0.55 832\n", 1307 | "\n", 1308 | "avg / total 0.46 0.47 0.43 2500\n", 1309 | "\n" 1310 | ] 1311 | } 1312 | ], 1313 | "source": [ 1314 | "# print the classification report\n", 1315 | "print(metrics.classification_report(y_test, y_pred_class))" 1316 | ] 1317 | }, 1318 | { 1319 | "cell_type": "markdown", 1320 | "metadata": {}, 1321 | "source": [ 1322 | "**Precision** answers the question: \"When a given class is predicted, how often are those predictions correct?\" To calculate the precision for class 1, for example, you divide 55 by the sum of the first column of the confusion matrix." 1323 | ] 1324 | }, 1325 | { 1326 | "cell_type": "code", 1327 | "execution_count": 47, 1328 | "metadata": { 1329 | "collapsed": false 1330 | }, 1331 | "outputs": [ 1332 | { 1333 | "name": "stdout", 1334 | "output_type": "stream", 1335 | "text": [ 1336 | "0.544554455446\n" 1337 | ] 1338 | } 1339 | ], 1340 | "source": [ 1341 | "# manually calculate the precision for class 1\n", 1342 | "precision = 55 / float(55 + 28 + 5 + 7 + 6)\n", 1343 | "print(precision)" 1344 | ] 1345 | }, 1346 | { 1347 | "cell_type": "markdown", 1348 | "metadata": {}, 1349 | "source": [ 1350 | "**Recall** answers the question: \"When a given class is the true class, how often is that class predicted?\" To calculate the recall for class 1, for example, you divide 55 by the sum of the first row of the confusion matrix." 1351 | ] 1352 | }, 1353 | { 1354 | "cell_type": "code", 1355 | "execution_count": 48, 1356 | "metadata": { 1357 | "collapsed": false 1358 | }, 1359 | "outputs": [ 1360 | { 1361 | "name": "stdout", 1362 | "output_type": "stream", 1363 | "text": [ 1364 | "0.297297297297\n" 1365 | ] 1366 | } 1367 | ], 1368 | "source": [ 1369 | "# manually calculate the recall for class 1\n", 1370 | "recall = 55 / float(55 + 14 + 24 + 65 + 27)\n", 1371 | "print(recall)" 1372 | ] 1373 | }, 1374 | { 1375 | "cell_type": "markdown", 1376 | "metadata": {}, 1377 | "source": [ 1378 | "**F1 score** is a weighted average of precision and recall." 1379 | ] 1380 | }, 1381 | { 1382 | "cell_type": "code", 1383 | "execution_count": 49, 1384 | "metadata": { 1385 | "collapsed": false 1386 | }, 1387 | "outputs": [ 1388 | { 1389 | "name": "stdout", 1390 | "output_type": "stream", 1391 | "text": [ 1392 | "0.384615384615\n" 1393 | ] 1394 | } 1395 | ], 1396 | "source": [ 1397 | "# manually calculate the F1 score for class 1\n", 1398 | "f1 = 2 * (precision * recall) / (precision + recall)\n", 1399 | "print(f1)" 1400 | ] 1401 | }, 1402 | { 1403 | "cell_type": "markdown", 1404 | "metadata": {}, 1405 | "source": [ 1406 | "**Support** answers the question: \"How many observations exist for which a given class is the true class?\" To calculate the support for class 1, for example, you sum the first row of the confusion matrix." 1407 | ] 1408 | }, 1409 | { 1410 | "cell_type": "code", 1411 | "execution_count": 50, 1412 | "metadata": { 1413 | "collapsed": false 1414 | }, 1415 | "outputs": [ 1416 | { 1417 | "name": "stdout", 1418 | "output_type": "stream", 1419 | "text": [ 1420 | "185\n" 1421 | ] 1422 | } 1423 | ], 1424 | "source": [ 1425 | "# manually calculate the support for class 1\n", 1426 | "support = 55 + 14 + 24 + 65 + 27\n", 1427 | "print(support)" 1428 | ] 1429 | }, 1430 | { 1431 | "cell_type": "markdown", 1432 | "metadata": {}, 1433 | "source": [ 1434 | "**Classification report comments:**\n", 1435 | "\n", 1436 | "- Class 1 has low recall, meaning that the model has a hard time detecting the 1-star reviews, but high precision, meaning that when the model predicts a review is 1-star, it's usually correct.\n", 1437 | "- Class 5 has high recall and precision, probably because 5-star reviews have polarized language, and because the model has a lot of observations to learn from." 1438 | ] 1439 | } 1440 | ], 1441 | "metadata": { 1442 | "kernelspec": { 1443 | "display_name": "Python 2", 1444 | "language": "python", 1445 | "name": "python2" 1446 | }, 1447 | "language_info": { 1448 | "codemirror_mode": { 1449 | "name": "ipython", 1450 | "version": 2 1451 | }, 1452 | "file_extension": ".py", 1453 | "mimetype": "text/x-python", 1454 | "name": "python", 1455 | "nbconvert_exporter": "python", 1456 | "pygments_lexer": "ipython2", 1457 | "version": "2.7.11" 1458 | } 1459 | }, 1460 | "nbformat": 4, 1461 | "nbformat_minor": 0 1462 | } 1463 | -------------------------------------------------------------------------------- /tutorial_with_output.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Tutorial: Machine Learning with Text in scikit-learn" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Agenda\n", 15 | "\n", 16 | "1. Model building in scikit-learn (refresher)\n", 17 | "2. Representing text as numerical data\n", 18 | "3. Reading a text-based dataset into pandas\n", 19 | "4. Vectorizing our dataset\n", 20 | "5. Building and evaluating a model\n", 21 | "6. Comparing models\n", 22 | "7. Examining a model for further insight\n", 23 | "8. Practicing this workflow on another dataset\n", 24 | "9. Tuning the vectorizer (discussion)" 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": 1, 30 | "metadata": { 31 | "collapsed": false 32 | }, 33 | "outputs": [], 34 | "source": [ 35 | "# for Python 2: use print only as a function\n", 36 | "from __future__ import print_function" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "## Part 1: Model building in scikit-learn (refresher)" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": 2, 49 | "metadata": { 50 | "collapsed": true 51 | }, 52 | "outputs": [], 53 | "source": [ 54 | "# load the iris dataset as an example\n", 55 | "from sklearn.datasets import load_iris\n", 56 | "iris = load_iris()" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": 3, 62 | "metadata": { 63 | "collapsed": true 64 | }, 65 | "outputs": [], 66 | "source": [ 67 | "# store the feature matrix (X) and response vector (y)\n", 68 | "X = iris.data\n", 69 | "y = iris.target" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "**\"Features\"** are also known as predictors, inputs, or attributes. The **\"response\"** is also known as the target, label, or output." 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": 4, 82 | "metadata": { 83 | "collapsed": false 84 | }, 85 | "outputs": [ 86 | { 87 | "name": "stdout", 88 | "output_type": "stream", 89 | "text": [ 90 | "(150L, 4L)\n", 91 | "(150L,)\n" 92 | ] 93 | } 94 | ], 95 | "source": [ 96 | "# check the shapes of X and y\n", 97 | "print(X.shape)\n", 98 | "print(y.shape)" 99 | ] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "metadata": {}, 104 | "source": [ 105 | "**\"Observations\"** are also known as samples, instances, or records." 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": 5, 111 | "metadata": { 112 | "collapsed": false 113 | }, 114 | "outputs": [ 115 | { 116 | "data": { 117 | "text/html": [ 118 | "
\n", 119 | "\n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
05.13.51.40.2
14.93.01.40.2
24.73.21.30.2
34.63.11.50.2
45.03.61.40.2
\n", 167 | "
" 168 | ], 169 | "text/plain": [ 170 | " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n", 171 | "0 5.1 3.5 1.4 0.2\n", 172 | "1 4.9 3.0 1.4 0.2\n", 173 | "2 4.7 3.2 1.3 0.2\n", 174 | "3 4.6 3.1 1.5 0.2\n", 175 | "4 5.0 3.6 1.4 0.2" 176 | ] 177 | }, 178 | "execution_count": 5, 179 | "metadata": {}, 180 | "output_type": "execute_result" 181 | } 182 | ], 183 | "source": [ 184 | "# examine the first 5 rows of the feature matrix (including the feature names)\n", 185 | "import pandas as pd\n", 186 | "pd.DataFrame(X, columns=iris.feature_names).head()" 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": 6, 192 | "metadata": { 193 | "collapsed": false 194 | }, 195 | "outputs": [ 196 | { 197 | "name": "stdout", 198 | "output_type": "stream", 199 | "text": [ 200 | "[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", 201 | " 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n", 202 | " 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2\n", 203 | " 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2\n", 204 | " 2 2]\n" 205 | ] 206 | } 207 | ], 208 | "source": [ 209 | "# examine the response vector\n", 210 | "print(y)" 211 | ] 212 | }, 213 | { 214 | "cell_type": "markdown", 215 | "metadata": {}, 216 | "source": [ 217 | "In order to **build a model**, the features must be **numeric**, and every observation must have the **same features in the same order**." 218 | ] 219 | }, 220 | { 221 | "cell_type": "code", 222 | "execution_count": 7, 223 | "metadata": { 224 | "collapsed": false 225 | }, 226 | "outputs": [ 227 | { 228 | "data": { 229 | "text/plain": [ 230 | "KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n", 231 | " metric_params=None, n_jobs=1, n_neighbors=5, p=2,\n", 232 | " weights='uniform')" 233 | ] 234 | }, 235 | "execution_count": 7, 236 | "metadata": {}, 237 | "output_type": "execute_result" 238 | } 239 | ], 240 | "source": [ 241 | "# import the class\n", 242 | "from sklearn.neighbors import KNeighborsClassifier\n", 243 | "\n", 244 | "# instantiate the model (with the default parameters)\n", 245 | "knn = KNeighborsClassifier()\n", 246 | "\n", 247 | "# fit the model with data (occurs in-place)\n", 248 | "knn.fit(X, y)" 249 | ] 250 | }, 251 | { 252 | "cell_type": "markdown", 253 | "metadata": {}, 254 | "source": [ 255 | "In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning." 256 | ] 257 | }, 258 | { 259 | "cell_type": "code", 260 | "execution_count": 8, 261 | "metadata": { 262 | "collapsed": false 263 | }, 264 | "outputs": [ 265 | { 266 | "data": { 267 | "text/plain": [ 268 | "array([1])" 269 | ] 270 | }, 271 | "execution_count": 8, 272 | "metadata": {}, 273 | "output_type": "execute_result" 274 | } 275 | ], 276 | "source": [ 277 | "# predict the response for a new observation\n", 278 | "knn.predict([[3, 5, 4, 2]])" 279 | ] 280 | }, 281 | { 282 | "cell_type": "markdown", 283 | "metadata": {}, 284 | "source": [ 285 | "## Part 2: Representing text as numerical data" 286 | ] 287 | }, 288 | { 289 | "cell_type": "code", 290 | "execution_count": 9, 291 | "metadata": { 292 | "collapsed": true 293 | }, 294 | "outputs": [], 295 | "source": [ 296 | "# example text for model training (SMS messages)\n", 297 | "simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']" 298 | ] 299 | }, 300 | { 301 | "cell_type": "markdown", 302 | "metadata": {}, 303 | "source": [ 304 | "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n", 305 | "\n", 306 | "> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.\n", 307 | "\n", 308 | "We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to \"convert text into a matrix of token counts\":" 309 | ] 310 | }, 311 | { 312 | "cell_type": "code", 313 | "execution_count": 10, 314 | "metadata": { 315 | "collapsed": true 316 | }, 317 | "outputs": [], 318 | "source": [ 319 | "# import and instantiate CountVectorizer (with the default parameters)\n", 320 | "from sklearn.feature_extraction.text import CountVectorizer\n", 321 | "vect = CountVectorizer()" 322 | ] 323 | }, 324 | { 325 | "cell_type": "code", 326 | "execution_count": 11, 327 | "metadata": { 328 | "collapsed": false 329 | }, 330 | "outputs": [ 331 | { 332 | "data": { 333 | "text/plain": [ 334 | "CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',\n", 335 | " dtype=, encoding=u'utf-8', input=u'content',\n", 336 | " lowercase=True, max_df=1.0, max_features=None, min_df=1,\n", 337 | " ngram_range=(1, 1), preprocessor=None, stop_words=None,\n", 338 | " strip_accents=None, token_pattern=u'(?u)\\\\b\\\\w\\\\w+\\\\b',\n", 339 | " tokenizer=None, vocabulary=None)" 340 | ] 341 | }, 342 | "execution_count": 11, 343 | "metadata": {}, 344 | "output_type": "execute_result" 345 | } 346 | ], 347 | "source": [ 348 | "# learn the 'vocabulary' of the training data (occurs in-place)\n", 349 | "vect.fit(simple_train)" 350 | ] 351 | }, 352 | { 353 | "cell_type": "code", 354 | "execution_count": 12, 355 | "metadata": { 356 | "collapsed": false 357 | }, 358 | "outputs": [ 359 | { 360 | "data": { 361 | "text/plain": [ 362 | "[u'cab', u'call', u'me', u'please', u'tonight', u'you']" 363 | ] 364 | }, 365 | "execution_count": 12, 366 | "metadata": {}, 367 | "output_type": "execute_result" 368 | } 369 | ], 370 | "source": [ 371 | "# examine the fitted vocabulary\n", 372 | "vect.get_feature_names()" 373 | ] 374 | }, 375 | { 376 | "cell_type": "code", 377 | "execution_count": 13, 378 | "metadata": { 379 | "collapsed": false 380 | }, 381 | "outputs": [ 382 | { 383 | "data": { 384 | "text/plain": [ 385 | "<3x6 sparse matrix of type ''\n", 386 | "\twith 9 stored elements in Compressed Sparse Row format>" 387 | ] 388 | }, 389 | "execution_count": 13, 390 | "metadata": {}, 391 | "output_type": "execute_result" 392 | } 393 | ], 394 | "source": [ 395 | "# transform training data into a 'document-term matrix'\n", 396 | "simple_train_dtm = vect.transform(simple_train)\n", 397 | "simple_train_dtm" 398 | ] 399 | }, 400 | { 401 | "cell_type": "code", 402 | "execution_count": 14, 403 | "metadata": { 404 | "collapsed": false 405 | }, 406 | "outputs": [ 407 | { 408 | "data": { 409 | "text/plain": [ 410 | "array([[0, 1, 0, 0, 1, 1],\n", 411 | " [1, 1, 1, 0, 0, 0],\n", 412 | " [0, 1, 1, 2, 0, 0]], dtype=int64)" 413 | ] 414 | }, 415 | "execution_count": 14, 416 | "metadata": {}, 417 | "output_type": "execute_result" 418 | } 419 | ], 420 | "source": [ 421 | "# convert sparse matrix to a dense matrix\n", 422 | "simple_train_dtm.toarray()" 423 | ] 424 | }, 425 | { 426 | "cell_type": "code", 427 | "execution_count": 15, 428 | "metadata": { 429 | "collapsed": false 430 | }, 431 | "outputs": [ 432 | { 433 | "data": { 434 | "text/html": [ 435 | "
\n", 436 | "\n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | "
cabcallmepleasetonightyou
0010011
1111000
2011200
\n", 478 | "
" 479 | ], 480 | "text/plain": [ 481 | " cab call me please tonight you\n", 482 | "0 0 1 0 0 1 1\n", 483 | "1 1 1 1 0 0 0\n", 484 | "2 0 1 1 2 0 0" 485 | ] 486 | }, 487 | "execution_count": 15, 488 | "metadata": {}, 489 | "output_type": "execute_result" 490 | } 491 | ], 492 | "source": [ 493 | "# examine the vocabulary and document-term matrix together\n", 494 | "pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())" 495 | ] 496 | }, 497 | { 498 | "cell_type": "markdown", 499 | "metadata": {}, 500 | "source": [ 501 | "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n", 502 | "\n", 503 | "> In this scheme, features and samples are defined as follows:\n", 504 | "\n", 505 | "> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.\n", 506 | "> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.\n", 507 | "\n", 508 | "> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.\n", 509 | "\n", 510 | "> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or \"Bag of n-grams\" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document." 511 | ] 512 | }, 513 | { 514 | "cell_type": "code", 515 | "execution_count": 16, 516 | "metadata": { 517 | "collapsed": false 518 | }, 519 | "outputs": [ 520 | { 521 | "data": { 522 | "text/plain": [ 523 | "scipy.sparse.csr.csr_matrix" 524 | ] 525 | }, 526 | "execution_count": 16, 527 | "metadata": {}, 528 | "output_type": "execute_result" 529 | } 530 | ], 531 | "source": [ 532 | "# check the type of the document-term matrix\n", 533 | "type(simple_train_dtm)" 534 | ] 535 | }, 536 | { 537 | "cell_type": "code", 538 | "execution_count": 17, 539 | "metadata": { 540 | "collapsed": false, 541 | "scrolled": true 542 | }, 543 | "outputs": [ 544 | { 545 | "name": "stdout", 546 | "output_type": "stream", 547 | "text": [ 548 | " (0, 1)\t1\n", 549 | " (0, 4)\t1\n", 550 | " (0, 5)\t1\n", 551 | " (1, 0)\t1\n", 552 | " (1, 1)\t1\n", 553 | " (1, 2)\t1\n", 554 | " (2, 1)\t1\n", 555 | " (2, 2)\t1\n", 556 | " (2, 3)\t2\n" 557 | ] 558 | } 559 | ], 560 | "source": [ 561 | "# examine the sparse matrix contents\n", 562 | "print(simple_train_dtm)" 563 | ] 564 | }, 565 | { 566 | "cell_type": "markdown", 567 | "metadata": {}, 568 | "source": [ 569 | "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n", 570 | "\n", 571 | "> As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them).\n", 572 | "\n", 573 | "> For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.\n", 574 | "\n", 575 | "> In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package." 576 | ] 577 | }, 578 | { 579 | "cell_type": "code", 580 | "execution_count": 18, 581 | "metadata": { 582 | "collapsed": true 583 | }, 584 | "outputs": [], 585 | "source": [ 586 | "# example text for model testing\n", 587 | "simple_test = [\"please don't call me\"]" 588 | ] 589 | }, 590 | { 591 | "cell_type": "markdown", 592 | "metadata": {}, 593 | "source": [ 594 | "In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning." 595 | ] 596 | }, 597 | { 598 | "cell_type": "code", 599 | "execution_count": 19, 600 | "metadata": { 601 | "collapsed": false 602 | }, 603 | "outputs": [ 604 | { 605 | "data": { 606 | "text/plain": [ 607 | "array([[0, 1, 1, 1, 0, 0]], dtype=int64)" 608 | ] 609 | }, 610 | "execution_count": 19, 611 | "metadata": {}, 612 | "output_type": "execute_result" 613 | } 614 | ], 615 | "source": [ 616 | "# transform testing data into a document-term matrix (using existing vocabulary)\n", 617 | "simple_test_dtm = vect.transform(simple_test)\n", 618 | "simple_test_dtm.toarray()" 619 | ] 620 | }, 621 | { 622 | "cell_type": "code", 623 | "execution_count": 20, 624 | "metadata": { 625 | "collapsed": false 626 | }, 627 | "outputs": [ 628 | { 629 | "data": { 630 | "text/html": [ 631 | "
\n", 632 | "\n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | "
cabcallmepleasetonightyou
0011100
\n", 656 | "
" 657 | ], 658 | "text/plain": [ 659 | " cab call me please tonight you\n", 660 | "0 0 1 1 1 0 0" 661 | ] 662 | }, 663 | "execution_count": 20, 664 | "metadata": {}, 665 | "output_type": "execute_result" 666 | } 667 | ], 668 | "source": [ 669 | "# examine the vocabulary and document-term matrix together\n", 670 | "pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())" 671 | ] 672 | }, 673 | { 674 | "cell_type": "markdown", 675 | "metadata": {}, 676 | "source": [ 677 | "**Summary:**\n", 678 | "\n", 679 | "- `vect.fit(train)` **learns the vocabulary** of the training data\n", 680 | "- `vect.transform(train)` uses the **fitted vocabulary** to build a document-term matrix from the training data\n", 681 | "- `vect.transform(test)` uses the **fitted vocabulary** to build a document-term matrix from the testing data (and **ignores tokens** it hasn't seen before)" 682 | ] 683 | }, 684 | { 685 | "cell_type": "markdown", 686 | "metadata": {}, 687 | "source": [ 688 | "## Part 3: Reading a text-based dataset into pandas" 689 | ] 690 | }, 691 | { 692 | "cell_type": "code", 693 | "execution_count": 21, 694 | "metadata": { 695 | "collapsed": true 696 | }, 697 | "outputs": [], 698 | "source": [ 699 | "# read file into pandas using a relative path\n", 700 | "path = 'data/sms.tsv'\n", 701 | "sms = pd.read_table(path, header=None, names=['label', 'message'])" 702 | ] 703 | }, 704 | { 705 | "cell_type": "code", 706 | "execution_count": 22, 707 | "metadata": { 708 | "collapsed": false 709 | }, 710 | "outputs": [], 711 | "source": [ 712 | "# alternative: read file into pandas from a URL\n", 713 | "# url = 'https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv'\n", 714 | "# sms = pd.read_table(url, header=None, names=['label', 'message'])" 715 | ] 716 | }, 717 | { 718 | "cell_type": "code", 719 | "execution_count": 23, 720 | "metadata": { 721 | "collapsed": false 722 | }, 723 | "outputs": [ 724 | { 725 | "data": { 726 | "text/plain": [ 727 | "(5572, 2)" 728 | ] 729 | }, 730 | "execution_count": 23, 731 | "metadata": {}, 732 | "output_type": "execute_result" 733 | } 734 | ], 735 | "source": [ 736 | "# examine the shape\n", 737 | "sms.shape" 738 | ] 739 | }, 740 | { 741 | "cell_type": "code", 742 | "execution_count": 24, 743 | "metadata": { 744 | "collapsed": false 745 | }, 746 | "outputs": [ 747 | { 748 | "data": { 749 | "text/html": [ 750 | "
\n", 751 | "\n", 752 | " \n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | " \n", 763 | " \n", 764 | " \n", 765 | " \n", 766 | " \n", 767 | " \n", 768 | " \n", 769 | " \n", 770 | " \n", 771 | " \n", 772 | " \n", 773 | " \n", 774 | " \n", 775 | " \n", 776 | " \n", 777 | " \n", 778 | " \n", 779 | " \n", 780 | " \n", 781 | " \n", 782 | " \n", 783 | " \n", 784 | " \n", 785 | " \n", 786 | " \n", 787 | " \n", 788 | " \n", 789 | " \n", 790 | " \n", 791 | " \n", 792 | " \n", 793 | " \n", 794 | " \n", 795 | " \n", 796 | " \n", 797 | " \n", 798 | " \n", 799 | " \n", 800 | " \n", 801 | " \n", 802 | " \n", 803 | " \n", 804 | " \n", 805 | " \n", 806 | " \n", 807 | " \n", 808 | " \n", 809 | " \n", 810 | " \n", 811 | "
labelmessage
0hamGo until jurong point, crazy.. Available only ...
1hamOk lar... Joking wif u oni...
2spamFree entry in 2 a wkly comp to win FA Cup fina...
3hamU dun say so early hor... U c already then say...
4hamNah I don't think he goes to usf, he lives aro...
5spamFreeMsg Hey there darling it's been 3 week's n...
6hamEven my brother is not like to speak with me. ...
7hamAs per your request 'Melle Melle (Oru Minnamin...
8spamWINNER!! As a valued network customer you have...
9spamHad your mobile 11 months or more? U R entitle...
\n", 812 | "
" 813 | ], 814 | "text/plain": [ 815 | " label message\n", 816 | "0 ham Go until jurong point, crazy.. Available only ...\n", 817 | "1 ham Ok lar... Joking wif u oni...\n", 818 | "2 spam Free entry in 2 a wkly comp to win FA Cup fina...\n", 819 | "3 ham U dun say so early hor... U c already then say...\n", 820 | "4 ham Nah I don't think he goes to usf, he lives aro...\n", 821 | "5 spam FreeMsg Hey there darling it's been 3 week's n...\n", 822 | "6 ham Even my brother is not like to speak with me. ...\n", 823 | "7 ham As per your request 'Melle Melle (Oru Minnamin...\n", 824 | "8 spam WINNER!! As a valued network customer you have...\n", 825 | "9 spam Had your mobile 11 months or more? U R entitle..." 826 | ] 827 | }, 828 | "execution_count": 24, 829 | "metadata": {}, 830 | "output_type": "execute_result" 831 | } 832 | ], 833 | "source": [ 834 | "# examine the first 10 rows\n", 835 | "sms.head(10)" 836 | ] 837 | }, 838 | { 839 | "cell_type": "code", 840 | "execution_count": 25, 841 | "metadata": { 842 | "collapsed": false 843 | }, 844 | "outputs": [ 845 | { 846 | "data": { 847 | "text/plain": [ 848 | "ham 4825\n", 849 | "spam 747\n", 850 | "Name: label, dtype: int64" 851 | ] 852 | }, 853 | "execution_count": 25, 854 | "metadata": {}, 855 | "output_type": "execute_result" 856 | } 857 | ], 858 | "source": [ 859 | "# examine the class distribution\n", 860 | "sms.label.value_counts()" 861 | ] 862 | }, 863 | { 864 | "cell_type": "code", 865 | "execution_count": 26, 866 | "metadata": { 867 | "collapsed": true 868 | }, 869 | "outputs": [], 870 | "source": [ 871 | "# convert label to a numerical variable\n", 872 | "sms['label_num'] = sms.label.map({'ham':0, 'spam':1})" 873 | ] 874 | }, 875 | { 876 | "cell_type": "code", 877 | "execution_count": 27, 878 | "metadata": { 879 | "collapsed": false 880 | }, 881 | "outputs": [ 882 | { 883 | "data": { 884 | "text/html": [ 885 | "
\n", 886 | "\n", 887 | " \n", 888 | " \n", 889 | " \n", 890 | " \n", 891 | " \n", 892 | " \n", 893 | " \n", 894 | " \n", 895 | " \n", 896 | " \n", 897 | " \n", 898 | " \n", 899 | " \n", 900 | " \n", 901 | " \n", 902 | " \n", 903 | " \n", 904 | " \n", 905 | " \n", 906 | " \n", 907 | " \n", 908 | " \n", 909 | " \n", 910 | " \n", 911 | " \n", 912 | " \n", 913 | " \n", 914 | " \n", 915 | " \n", 916 | " \n", 917 | " \n", 918 | " \n", 919 | " \n", 920 | " \n", 921 | " \n", 922 | " \n", 923 | " \n", 924 | " \n", 925 | " \n", 926 | " \n", 927 | " \n", 928 | " \n", 929 | " \n", 930 | " \n", 931 | " \n", 932 | " \n", 933 | " \n", 934 | " \n", 935 | " \n", 936 | " \n", 937 | " \n", 938 | " \n", 939 | " \n", 940 | " \n", 941 | " \n", 942 | " \n", 943 | " \n", 944 | " \n", 945 | " \n", 946 | " \n", 947 | " \n", 948 | " \n", 949 | " \n", 950 | " \n", 951 | " \n", 952 | " \n", 953 | " \n", 954 | " \n", 955 | " \n", 956 | " \n", 957 | "
labelmessagelabel_num
0hamGo until jurong point, crazy.. Available only ...0
1hamOk lar... Joking wif u oni...0
2spamFree entry in 2 a wkly comp to win FA Cup fina...1
3hamU dun say so early hor... U c already then say...0
4hamNah I don't think he goes to usf, he lives aro...0
5spamFreeMsg Hey there darling it's been 3 week's n...1
6hamEven my brother is not like to speak with me. ...0
7hamAs per your request 'Melle Melle (Oru Minnamin...0
8spamWINNER!! As a valued network customer you have...1
9spamHad your mobile 11 months or more? U R entitle...1
\n", 958 | "
" 959 | ], 960 | "text/plain": [ 961 | " label message label_num\n", 962 | "0 ham Go until jurong point, crazy.. Available only ... 0\n", 963 | "1 ham Ok lar... Joking wif u oni... 0\n", 964 | "2 spam Free entry in 2 a wkly comp to win FA Cup fina... 1\n", 965 | "3 ham U dun say so early hor... U c already then say... 0\n", 966 | "4 ham Nah I don't think he goes to usf, he lives aro... 0\n", 967 | "5 spam FreeMsg Hey there darling it's been 3 week's n... 1\n", 968 | "6 ham Even my brother is not like to speak with me. ... 0\n", 969 | "7 ham As per your request 'Melle Melle (Oru Minnamin... 0\n", 970 | "8 spam WINNER!! As a valued network customer you have... 1\n", 971 | "9 spam Had your mobile 11 months or more? U R entitle... 1" 972 | ] 973 | }, 974 | "execution_count": 27, 975 | "metadata": {}, 976 | "output_type": "execute_result" 977 | } 978 | ], 979 | "source": [ 980 | "# check that the conversion worked\n", 981 | "sms.head(10)" 982 | ] 983 | }, 984 | { 985 | "cell_type": "code", 986 | "execution_count": 28, 987 | "metadata": { 988 | "collapsed": false 989 | }, 990 | "outputs": [ 991 | { 992 | "name": "stdout", 993 | "output_type": "stream", 994 | "text": [ 995 | "(150L, 4L)\n", 996 | "(150L,)\n" 997 | ] 998 | } 999 | ], 1000 | "source": [ 1001 | "# how to define X and y (from the iris data) for use with a MODEL\n", 1002 | "X = iris.data\n", 1003 | "y = iris.target\n", 1004 | "print(X.shape)\n", 1005 | "print(y.shape)" 1006 | ] 1007 | }, 1008 | { 1009 | "cell_type": "code", 1010 | "execution_count": 29, 1011 | "metadata": { 1012 | "collapsed": false 1013 | }, 1014 | "outputs": [ 1015 | { 1016 | "name": "stdout", 1017 | "output_type": "stream", 1018 | "text": [ 1019 | "(5572L,)\n", 1020 | "(5572L,)\n" 1021 | ] 1022 | } 1023 | ], 1024 | "source": [ 1025 | "# how to define X and y (from the SMS data) for use with COUNTVECTORIZER\n", 1026 | "X = sms.message\n", 1027 | "y = sms.label_num\n", 1028 | "print(X.shape)\n", 1029 | "print(y.shape)" 1030 | ] 1031 | }, 1032 | { 1033 | "cell_type": "code", 1034 | "execution_count": 30, 1035 | "metadata": { 1036 | "collapsed": false 1037 | }, 1038 | "outputs": [ 1039 | { 1040 | "name": "stdout", 1041 | "output_type": "stream", 1042 | "text": [ 1043 | "(4179L,)\n", 1044 | "(1393L,)\n", 1045 | "(4179L,)\n", 1046 | "(1393L,)\n" 1047 | ] 1048 | } 1049 | ], 1050 | "source": [ 1051 | "# split X and y into training and testing sets\n", 1052 | "from sklearn.cross_validation import train_test_split\n", 1053 | "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)\n", 1054 | "print(X_train.shape)\n", 1055 | "print(X_test.shape)\n", 1056 | "print(y_train.shape)\n", 1057 | "print(y_test.shape)" 1058 | ] 1059 | }, 1060 | { 1061 | "cell_type": "markdown", 1062 | "metadata": {}, 1063 | "source": [ 1064 | "## Part 4: Vectorizing our dataset" 1065 | ] 1066 | }, 1067 | { 1068 | "cell_type": "code", 1069 | "execution_count": 31, 1070 | "metadata": { 1071 | "collapsed": true 1072 | }, 1073 | "outputs": [], 1074 | "source": [ 1075 | "# instantiate the vectorizer\n", 1076 | "vect = CountVectorizer()" 1077 | ] 1078 | }, 1079 | { 1080 | "cell_type": "code", 1081 | "execution_count": 32, 1082 | "metadata": { 1083 | "collapsed": true 1084 | }, 1085 | "outputs": [], 1086 | "source": [ 1087 | "# learn training data vocabulary, then use it to create a document-term matrix\n", 1088 | "vect.fit(X_train)\n", 1089 | "X_train_dtm = vect.transform(X_train)" 1090 | ] 1091 | }, 1092 | { 1093 | "cell_type": "code", 1094 | "execution_count": 33, 1095 | "metadata": { 1096 | "collapsed": true 1097 | }, 1098 | "outputs": [], 1099 | "source": [ 1100 | "# equivalently: combine fit and transform into a single step\n", 1101 | "X_train_dtm = vect.fit_transform(X_train)" 1102 | ] 1103 | }, 1104 | { 1105 | "cell_type": "code", 1106 | "execution_count": 34, 1107 | "metadata": { 1108 | "collapsed": false 1109 | }, 1110 | "outputs": [ 1111 | { 1112 | "data": { 1113 | "text/plain": [ 1114 | "<4179x7456 sparse matrix of type ''\n", 1115 | "\twith 55209 stored elements in Compressed Sparse Row format>" 1116 | ] 1117 | }, 1118 | "execution_count": 34, 1119 | "metadata": {}, 1120 | "output_type": "execute_result" 1121 | } 1122 | ], 1123 | "source": [ 1124 | "# examine the document-term matrix\n", 1125 | "X_train_dtm" 1126 | ] 1127 | }, 1128 | { 1129 | "cell_type": "code", 1130 | "execution_count": 35, 1131 | "metadata": { 1132 | "collapsed": false 1133 | }, 1134 | "outputs": [ 1135 | { 1136 | "data": { 1137 | "text/plain": [ 1138 | "<1393x7456 sparse matrix of type ''\n", 1139 | "\twith 17604 stored elements in Compressed Sparse Row format>" 1140 | ] 1141 | }, 1142 | "execution_count": 35, 1143 | "metadata": {}, 1144 | "output_type": "execute_result" 1145 | } 1146 | ], 1147 | "source": [ 1148 | "# transform testing data (using fitted vocabulary) into a document-term matrix\n", 1149 | "X_test_dtm = vect.transform(X_test)\n", 1150 | "X_test_dtm" 1151 | ] 1152 | }, 1153 | { 1154 | "cell_type": "markdown", 1155 | "metadata": {}, 1156 | "source": [ 1157 | "## Part 5: Building and evaluating a model\n", 1158 | "\n", 1159 | "We will use [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):\n", 1160 | "\n", 1161 | "> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work." 1162 | ] 1163 | }, 1164 | { 1165 | "cell_type": "code", 1166 | "execution_count": 36, 1167 | "metadata": { 1168 | "collapsed": true 1169 | }, 1170 | "outputs": [], 1171 | "source": [ 1172 | "# import and instantiate a Multinomial Naive Bayes model\n", 1173 | "from sklearn.naive_bayes import MultinomialNB\n", 1174 | "nb = MultinomialNB()" 1175 | ] 1176 | }, 1177 | { 1178 | "cell_type": "code", 1179 | "execution_count": 37, 1180 | "metadata": { 1181 | "collapsed": false 1182 | }, 1183 | "outputs": [ 1184 | { 1185 | "name": "stdout", 1186 | "output_type": "stream", 1187 | "text": [ 1188 | "Wall time: 3 ms\n" 1189 | ] 1190 | }, 1191 | { 1192 | "data": { 1193 | "text/plain": [ 1194 | "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)" 1195 | ] 1196 | }, 1197 | "execution_count": 37, 1198 | "metadata": {}, 1199 | "output_type": "execute_result" 1200 | } 1201 | ], 1202 | "source": [ 1203 | "# train the model using X_train_dtm (timing it with an IPython \"magic command\")\n", 1204 | "%time nb.fit(X_train_dtm, y_train)" 1205 | ] 1206 | }, 1207 | { 1208 | "cell_type": "code", 1209 | "execution_count": 38, 1210 | "metadata": { 1211 | "collapsed": true 1212 | }, 1213 | "outputs": [], 1214 | "source": [ 1215 | "# make class predictions for X_test_dtm\n", 1216 | "y_pred_class = nb.predict(X_test_dtm)" 1217 | ] 1218 | }, 1219 | { 1220 | "cell_type": "code", 1221 | "execution_count": 39, 1222 | "metadata": { 1223 | "collapsed": false 1224 | }, 1225 | "outputs": [ 1226 | { 1227 | "data": { 1228 | "text/plain": [ 1229 | "0.98851399856424982" 1230 | ] 1231 | }, 1232 | "execution_count": 39, 1233 | "metadata": {}, 1234 | "output_type": "execute_result" 1235 | } 1236 | ], 1237 | "source": [ 1238 | "# calculate accuracy of class predictions\n", 1239 | "from sklearn import metrics\n", 1240 | "metrics.accuracy_score(y_test, y_pred_class)" 1241 | ] 1242 | }, 1243 | { 1244 | "cell_type": "code", 1245 | "execution_count": 40, 1246 | "metadata": { 1247 | "collapsed": false 1248 | }, 1249 | "outputs": [ 1250 | { 1251 | "data": { 1252 | "text/plain": [ 1253 | "array([[1203, 5],\n", 1254 | " [ 11, 174]])" 1255 | ] 1256 | }, 1257 | "execution_count": 40, 1258 | "metadata": {}, 1259 | "output_type": "execute_result" 1260 | } 1261 | ], 1262 | "source": [ 1263 | "# print the confusion matrix\n", 1264 | "metrics.confusion_matrix(y_test, y_pred_class)" 1265 | ] 1266 | }, 1267 | { 1268 | "cell_type": "code", 1269 | "execution_count": 41, 1270 | "metadata": { 1271 | "collapsed": false 1272 | }, 1273 | "outputs": [ 1274 | { 1275 | "data": { 1276 | "text/plain": [ 1277 | "574 Waiting for your call.\n", 1278 | "3375 Also andros ice etc etc\n", 1279 | "45 No calls..messages..missed calls\n", 1280 | "3415 No pic. Please re-send.\n", 1281 | "1988 No calls..messages..missed calls\n", 1282 | "Name: message, dtype: object" 1283 | ] 1284 | }, 1285 | "execution_count": 41, 1286 | "metadata": {}, 1287 | "output_type": "execute_result" 1288 | } 1289 | ], 1290 | "source": [ 1291 | "# print message text for the false positives (ham incorrectly classified as spam)\n", 1292 | "X_test[y_test < y_pred_class]" 1293 | ] 1294 | }, 1295 | { 1296 | "cell_type": "code", 1297 | "execution_count": 42, 1298 | "metadata": { 1299 | "collapsed": false, 1300 | "scrolled": true 1301 | }, 1302 | "outputs": [ 1303 | { 1304 | "data": { 1305 | "text/plain": [ 1306 | "3132 LookAtMe!: Thanks for your purchase of a video...\n", 1307 | "5 FreeMsg Hey there darling it's been 3 week's n...\n", 1308 | "3530 Xmas & New Years Eve tickets are now on sale f...\n", 1309 | "684 Hi I'm sue. I am 20 years old and work as a la...\n", 1310 | "1875 Would you like to see my XXX pics they are so ...\n", 1311 | "1893 CALL 09090900040 & LISTEN TO EXTREME DIRTY LIV...\n", 1312 | "4298 thesmszone.com lets you send free anonymous an...\n", 1313 | "4949 Hi this is Amy, we will be sending you a free ...\n", 1314 | "2821 INTERFLORA - “It's not too late to order Inter...\n", 1315 | "2247 Hi ya babe x u 4goten bout me?' scammers getti...\n", 1316 | "4514 Money i have won wining number 946 wot do i do...\n", 1317 | "Name: message, dtype: object" 1318 | ] 1319 | }, 1320 | "execution_count": 42, 1321 | "metadata": {}, 1322 | "output_type": "execute_result" 1323 | } 1324 | ], 1325 | "source": [ 1326 | "# print message text for the false negatives (spam incorrectly classified as ham)\n", 1327 | "X_test[y_test > y_pred_class]" 1328 | ] 1329 | }, 1330 | { 1331 | "cell_type": "code", 1332 | "execution_count": 43, 1333 | "metadata": { 1334 | "collapsed": false, 1335 | "scrolled": true 1336 | }, 1337 | "outputs": [ 1338 | { 1339 | "data": { 1340 | "text/plain": [ 1341 | "\"LookAtMe!: Thanks for your purchase of a video clip from LookAtMe!, you've been charged 35p. Think you can do better? Why not send a video in a MMSto 32323.\"" 1342 | ] 1343 | }, 1344 | "execution_count": 43, 1345 | "metadata": {}, 1346 | "output_type": "execute_result" 1347 | } 1348 | ], 1349 | "source": [ 1350 | "# example false negative\n", 1351 | "X_test[3132]" 1352 | ] 1353 | }, 1354 | { 1355 | "cell_type": "code", 1356 | "execution_count": 44, 1357 | "metadata": { 1358 | "collapsed": false 1359 | }, 1360 | "outputs": [ 1361 | { 1362 | "data": { 1363 | "text/plain": [ 1364 | "array([ 2.87744864e-03, 1.83488846e-05, 2.07301295e-03, ...,\n", 1365 | " 1.09026171e-06, 1.00000000e+00, 3.98279868e-09])" 1366 | ] 1367 | }, 1368 | "execution_count": 44, 1369 | "metadata": {}, 1370 | "output_type": "execute_result" 1371 | } 1372 | ], 1373 | "source": [ 1374 | "# calculate predicted probabilities for X_test_dtm (poorly calibrated)\n", 1375 | "y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]\n", 1376 | "y_pred_prob" 1377 | ] 1378 | }, 1379 | { 1380 | "cell_type": "code", 1381 | "execution_count": 45, 1382 | "metadata": { 1383 | "collapsed": false 1384 | }, 1385 | "outputs": [ 1386 | { 1387 | "data": { 1388 | "text/plain": [ 1389 | "0.98664310005369604" 1390 | ] 1391 | }, 1392 | "execution_count": 45, 1393 | "metadata": {}, 1394 | "output_type": "execute_result" 1395 | } 1396 | ], 1397 | "source": [ 1398 | "# calculate AUC\n", 1399 | "metrics.roc_auc_score(y_test, y_pred_prob)" 1400 | ] 1401 | }, 1402 | { 1403 | "cell_type": "markdown", 1404 | "metadata": {}, 1405 | "source": [ 1406 | "## Part 6: Comparing models\n", 1407 | "\n", 1408 | "We will compare multinomial Naive Bayes with [logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression):\n", 1409 | "\n", 1410 | "> Logistic regression, despite its name, is a **linear model for classification** rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function." 1411 | ] 1412 | }, 1413 | { 1414 | "cell_type": "code", 1415 | "execution_count": 46, 1416 | "metadata": { 1417 | "collapsed": true 1418 | }, 1419 | "outputs": [], 1420 | "source": [ 1421 | "# import and instantiate a logistic regression model\n", 1422 | "from sklearn.linear_model import LogisticRegression\n", 1423 | "logreg = LogisticRegression()" 1424 | ] 1425 | }, 1426 | { 1427 | "cell_type": "code", 1428 | "execution_count": 47, 1429 | "metadata": { 1430 | "collapsed": false 1431 | }, 1432 | "outputs": [ 1433 | { 1434 | "name": "stdout", 1435 | "output_type": "stream", 1436 | "text": [ 1437 | "Wall time: 39 ms\n" 1438 | ] 1439 | }, 1440 | { 1441 | "data": { 1442 | "text/plain": [ 1443 | "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", 1444 | " intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n", 1445 | " penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n", 1446 | " verbose=0, warm_start=False)" 1447 | ] 1448 | }, 1449 | "execution_count": 47, 1450 | "metadata": {}, 1451 | "output_type": "execute_result" 1452 | } 1453 | ], 1454 | "source": [ 1455 | "# train the model using X_train_dtm\n", 1456 | "%time logreg.fit(X_train_dtm, y_train)" 1457 | ] 1458 | }, 1459 | { 1460 | "cell_type": "code", 1461 | "execution_count": 48, 1462 | "metadata": { 1463 | "collapsed": true 1464 | }, 1465 | "outputs": [], 1466 | "source": [ 1467 | "# make class predictions for X_test_dtm\n", 1468 | "y_pred_class = logreg.predict(X_test_dtm)" 1469 | ] 1470 | }, 1471 | { 1472 | "cell_type": "code", 1473 | "execution_count": 49, 1474 | "metadata": { 1475 | "collapsed": false 1476 | }, 1477 | "outputs": [ 1478 | { 1479 | "data": { 1480 | "text/plain": [ 1481 | "array([ 0.01269556, 0.00347183, 0.00616517, ..., 0.03354907,\n", 1482 | " 0.99725053, 0.00157706])" 1483 | ] 1484 | }, 1485 | "execution_count": 49, 1486 | "metadata": {}, 1487 | "output_type": "execute_result" 1488 | } 1489 | ], 1490 | "source": [ 1491 | "# calculate predicted probabilities for X_test_dtm (well calibrated)\n", 1492 | "y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]\n", 1493 | "y_pred_prob" 1494 | ] 1495 | }, 1496 | { 1497 | "cell_type": "code", 1498 | "execution_count": 50, 1499 | "metadata": { 1500 | "collapsed": false 1501 | }, 1502 | "outputs": [ 1503 | { 1504 | "data": { 1505 | "text/plain": [ 1506 | "0.9877961234745154" 1507 | ] 1508 | }, 1509 | "execution_count": 50, 1510 | "metadata": {}, 1511 | "output_type": "execute_result" 1512 | } 1513 | ], 1514 | "source": [ 1515 | "# calculate accuracy\n", 1516 | "metrics.accuracy_score(y_test, y_pred_class)" 1517 | ] 1518 | }, 1519 | { 1520 | "cell_type": "code", 1521 | "execution_count": 51, 1522 | "metadata": { 1523 | "collapsed": false 1524 | }, 1525 | "outputs": [ 1526 | { 1527 | "data": { 1528 | "text/plain": [ 1529 | "0.99368176123143015" 1530 | ] 1531 | }, 1532 | "execution_count": 51, 1533 | "metadata": {}, 1534 | "output_type": "execute_result" 1535 | } 1536 | ], 1537 | "source": [ 1538 | "# calculate AUC\n", 1539 | "metrics.roc_auc_score(y_test, y_pred_prob)" 1540 | ] 1541 | }, 1542 | { 1543 | "cell_type": "markdown", 1544 | "metadata": {}, 1545 | "source": [ 1546 | "## Part 7: Examining a model for further insight\n", 1547 | "\n", 1548 | "We will examine the our **trained Naive Bayes model** to calculate the approximate **\"spamminess\" of each token**." 1549 | ] 1550 | }, 1551 | { 1552 | "cell_type": "code", 1553 | "execution_count": 52, 1554 | "metadata": { 1555 | "collapsed": false 1556 | }, 1557 | "outputs": [ 1558 | { 1559 | "data": { 1560 | "text/plain": [ 1561 | "7456" 1562 | ] 1563 | }, 1564 | "execution_count": 52, 1565 | "metadata": {}, 1566 | "output_type": "execute_result" 1567 | } 1568 | ], 1569 | "source": [ 1570 | "# store the vocabulary of X_train\n", 1571 | "X_train_tokens = vect.get_feature_names()\n", 1572 | "len(X_train_tokens)" 1573 | ] 1574 | }, 1575 | { 1576 | "cell_type": "code", 1577 | "execution_count": 53, 1578 | "metadata": { 1579 | "collapsed": false, 1580 | "scrolled": true 1581 | }, 1582 | "outputs": [ 1583 | { 1584 | "name": "stdout", 1585 | "output_type": "stream", 1586 | "text": [ 1587 | "[u'00', u'000', u'008704050406', u'0121', u'01223585236', u'01223585334', u'0125698789', u'02', u'0207', u'02072069400', u'02073162414', u'02085076972', u'021', u'03', u'04', u'0430', u'05', u'050703', u'0578', u'06', u'07', u'07008009200', u'07090201529', u'07090298926', u'07123456789', u'07732584351', u'07734396839', u'07742676969', u'0776xxxxxxx', u'07781482378', u'07786200117', u'078', u'07801543489', u'07808', u'07808247860', u'07808726822', u'07815296484', u'07821230901', u'07880867867', u'0789xxxxxxx', u'07946746291', u'0796xxxxxx', u'07973788240', u'07xxxxxxxxx', u'08', u'0800', u'08000407165', u'08000776320', u'08000839402', u'08000930705']\n" 1588 | ] 1589 | } 1590 | ], 1591 | "source": [ 1592 | "# examine the first 50 tokens\n", 1593 | "print(X_train_tokens[0:50])" 1594 | ] 1595 | }, 1596 | { 1597 | "cell_type": "code", 1598 | "execution_count": 54, 1599 | "metadata": { 1600 | "collapsed": false 1601 | }, 1602 | "outputs": [ 1603 | { 1604 | "name": "stdout", 1605 | "output_type": "stream", 1606 | "text": [ 1607 | "[u'yer', u'yes', u'yest', u'yesterday', u'yet', u'yetunde', u'yijue', u'ym', u'ymca', u'yo', u'yoga', u'yogasana', u'yor', u'yorge', u'you', u'youdoing', u'youi', u'youphone', u'your', u'youre', u'yourjob', u'yours', u'yourself', u'youwanna', u'yowifes', u'yoyyooo', u'yr', u'yrs', u'ything', u'yummmm', u'yummy', u'yun', u'yunny', u'yuo', u'yuou', u'yup', u'zac', u'zaher', u'zealand', u'zebra', u'zed', u'zeros', u'zhong', u'zindgi', u'zoe', u'zoom', u'zouk', u'zyada', u'\\xe8n', u'\\u3028ud']\n" 1608 | ] 1609 | } 1610 | ], 1611 | "source": [ 1612 | "# examine the last 50 tokens\n", 1613 | "print(X_train_tokens[-50:])" 1614 | ] 1615 | }, 1616 | { 1617 | "cell_type": "code", 1618 | "execution_count": 55, 1619 | "metadata": { 1620 | "collapsed": false 1621 | }, 1622 | "outputs": [ 1623 | { 1624 | "data": { 1625 | "text/plain": [ 1626 | "array([[ 0., 0., 0., ..., 1., 1., 1.],\n", 1627 | " [ 5., 23., 2., ..., 0., 0., 0.]])" 1628 | ] 1629 | }, 1630 | "execution_count": 55, 1631 | "metadata": {}, 1632 | "output_type": "execute_result" 1633 | } 1634 | ], 1635 | "source": [ 1636 | "# Naive Bayes counts the number of times each token appears in each class\n", 1637 | "nb.feature_count_" 1638 | ] 1639 | }, 1640 | { 1641 | "cell_type": "code", 1642 | "execution_count": 56, 1643 | "metadata": { 1644 | "collapsed": false 1645 | }, 1646 | "outputs": [ 1647 | { 1648 | "data": { 1649 | "text/plain": [ 1650 | "(2L, 7456L)" 1651 | ] 1652 | }, 1653 | "execution_count": 56, 1654 | "metadata": {}, 1655 | "output_type": "execute_result" 1656 | } 1657 | ], 1658 | "source": [ 1659 | "# rows represent classes, columns represent tokens\n", 1660 | "nb.feature_count_.shape" 1661 | ] 1662 | }, 1663 | { 1664 | "cell_type": "code", 1665 | "execution_count": 57, 1666 | "metadata": { 1667 | "collapsed": false 1668 | }, 1669 | "outputs": [ 1670 | { 1671 | "data": { 1672 | "text/plain": [ 1673 | "array([ 0., 0., 0., ..., 1., 1., 1.])" 1674 | ] 1675 | }, 1676 | "execution_count": 57, 1677 | "metadata": {}, 1678 | "output_type": "execute_result" 1679 | } 1680 | ], 1681 | "source": [ 1682 | "# number of times each token appears across all HAM messages\n", 1683 | "ham_token_count = nb.feature_count_[0, :]\n", 1684 | "ham_token_count" 1685 | ] 1686 | }, 1687 | { 1688 | "cell_type": "code", 1689 | "execution_count": 58, 1690 | "metadata": { 1691 | "collapsed": false 1692 | }, 1693 | "outputs": [ 1694 | { 1695 | "data": { 1696 | "text/plain": [ 1697 | "array([ 5., 23., 2., ..., 0., 0., 0.])" 1698 | ] 1699 | }, 1700 | "execution_count": 58, 1701 | "metadata": {}, 1702 | "output_type": "execute_result" 1703 | } 1704 | ], 1705 | "source": [ 1706 | "# number of times each token appears across all SPAM messages\n", 1707 | "spam_token_count = nb.feature_count_[1, :]\n", 1708 | "spam_token_count" 1709 | ] 1710 | }, 1711 | { 1712 | "cell_type": "code", 1713 | "execution_count": 59, 1714 | "metadata": { 1715 | "collapsed": false 1716 | }, 1717 | "outputs": [ 1718 | { 1719 | "data": { 1720 | "text/html": [ 1721 | "
\n", 1722 | "\n", 1723 | " \n", 1724 | " \n", 1725 | " \n", 1726 | " \n", 1727 | " \n", 1728 | " \n", 1729 | " \n", 1730 | " \n", 1731 | " \n", 1732 | " \n", 1733 | " \n", 1734 | " \n", 1735 | " \n", 1736 | " \n", 1737 | " \n", 1738 | " \n", 1739 | " \n", 1740 | " \n", 1741 | " \n", 1742 | " \n", 1743 | " \n", 1744 | " \n", 1745 | " \n", 1746 | " \n", 1747 | " \n", 1748 | " \n", 1749 | " \n", 1750 | " \n", 1751 | " \n", 1752 | " \n", 1753 | " \n", 1754 | " \n", 1755 | " \n", 1756 | " \n", 1757 | " \n", 1758 | " \n", 1759 | " \n", 1760 | " \n", 1761 | " \n", 1762 | "
hamspam
token
0005
000023
00870405040602
012101
0122358523601
\n", 1763 | "
" 1764 | ], 1765 | "text/plain": [ 1766 | " ham spam\n", 1767 | "token \n", 1768 | "00 0 5\n", 1769 | "000 0 23\n", 1770 | "008704050406 0 2\n", 1771 | "0121 0 1\n", 1772 | "01223585236 0 1" 1773 | ] 1774 | }, 1775 | "execution_count": 59, 1776 | "metadata": {}, 1777 | "output_type": "execute_result" 1778 | } 1779 | ], 1780 | "source": [ 1781 | "# create a DataFrame of tokens with their separate ham and spam counts\n", 1782 | "tokens = pd.DataFrame({'token':X_train_tokens, 'ham':ham_token_count, 'spam':spam_token_count}).set_index('token')\n", 1783 | "tokens.head()" 1784 | ] 1785 | }, 1786 | { 1787 | "cell_type": "code", 1788 | "execution_count": 60, 1789 | "metadata": { 1790 | "collapsed": false 1791 | }, 1792 | "outputs": [ 1793 | { 1794 | "data": { 1795 | "text/html": [ 1796 | "
\n", 1797 | "\n", 1798 | " \n", 1799 | " \n", 1800 | " \n", 1801 | " \n", 1802 | " \n", 1803 | " \n", 1804 | " \n", 1805 | " \n", 1806 | " \n", 1807 | " \n", 1808 | " \n", 1809 | " \n", 1810 | " \n", 1811 | " \n", 1812 | " \n", 1813 | " \n", 1814 | " \n", 1815 | " \n", 1816 | " \n", 1817 | " \n", 1818 | " \n", 1819 | " \n", 1820 | " \n", 1821 | " \n", 1822 | " \n", 1823 | " \n", 1824 | " \n", 1825 | " \n", 1826 | " \n", 1827 | " \n", 1828 | " \n", 1829 | " \n", 1830 | " \n", 1831 | " \n", 1832 | " \n", 1833 | " \n", 1834 | " \n", 1835 | " \n", 1836 | " \n", 1837 | "
hamspam
token
very642
nasty11
villa01
beloved10
textoperator02
\n", 1838 | "
" 1839 | ], 1840 | "text/plain": [ 1841 | " ham spam\n", 1842 | "token \n", 1843 | "very 64 2\n", 1844 | "nasty 1 1\n", 1845 | "villa 0 1\n", 1846 | "beloved 1 0\n", 1847 | "textoperator 0 2" 1848 | ] 1849 | }, 1850 | "execution_count": 60, 1851 | "metadata": {}, 1852 | "output_type": "execute_result" 1853 | } 1854 | ], 1855 | "source": [ 1856 | "# examine 5 random DataFrame rows\n", 1857 | "tokens.sample(5, random_state=6)" 1858 | ] 1859 | }, 1860 | { 1861 | "cell_type": "code", 1862 | "execution_count": 61, 1863 | "metadata": { 1864 | "collapsed": false 1865 | }, 1866 | "outputs": [ 1867 | { 1868 | "data": { 1869 | "text/plain": [ 1870 | "array([ 3617., 562.])" 1871 | ] 1872 | }, 1873 | "execution_count": 61, 1874 | "metadata": {}, 1875 | "output_type": "execute_result" 1876 | } 1877 | ], 1878 | "source": [ 1879 | "# Naive Bayes counts the number of observations in each class\n", 1880 | "nb.class_count_" 1881 | ] 1882 | }, 1883 | { 1884 | "cell_type": "markdown", 1885 | "metadata": {}, 1886 | "source": [ 1887 | "Before we can calculate the \"spamminess\" of each token, we need to avoid **dividing by zero** and account for the **class imbalance**." 1888 | ] 1889 | }, 1890 | { 1891 | "cell_type": "code", 1892 | "execution_count": 62, 1893 | "metadata": { 1894 | "collapsed": false 1895 | }, 1896 | "outputs": [ 1897 | { 1898 | "data": { 1899 | "text/html": [ 1900 | "
\n", 1901 | "\n", 1902 | " \n", 1903 | " \n", 1904 | " \n", 1905 | " \n", 1906 | " \n", 1907 | " \n", 1908 | " \n", 1909 | " \n", 1910 | " \n", 1911 | " \n", 1912 | " \n", 1913 | " \n", 1914 | " \n", 1915 | " \n", 1916 | " \n", 1917 | " \n", 1918 | " \n", 1919 | " \n", 1920 | " \n", 1921 | " \n", 1922 | " \n", 1923 | " \n", 1924 | " \n", 1925 | " \n", 1926 | " \n", 1927 | " \n", 1928 | " \n", 1929 | " \n", 1930 | " \n", 1931 | " \n", 1932 | " \n", 1933 | " \n", 1934 | " \n", 1935 | " \n", 1936 | " \n", 1937 | " \n", 1938 | " \n", 1939 | " \n", 1940 | " \n", 1941 | "
hamspam
token
very653
nasty22
villa12
beloved21
textoperator13
\n", 1942 | "
" 1943 | ], 1944 | "text/plain": [ 1945 | " ham spam\n", 1946 | "token \n", 1947 | "very 65 3\n", 1948 | "nasty 2 2\n", 1949 | "villa 1 2\n", 1950 | "beloved 2 1\n", 1951 | "textoperator 1 3" 1952 | ] 1953 | }, 1954 | "execution_count": 62, 1955 | "metadata": {}, 1956 | "output_type": "execute_result" 1957 | } 1958 | ], 1959 | "source": [ 1960 | "# add 1 to ham and spam counts to avoid dividing by 0\n", 1961 | "tokens['ham'] = tokens.ham + 1\n", 1962 | "tokens['spam'] = tokens.spam + 1\n", 1963 | "tokens.sample(5, random_state=6)" 1964 | ] 1965 | }, 1966 | { 1967 | "cell_type": "code", 1968 | "execution_count": 63, 1969 | "metadata": { 1970 | "collapsed": false 1971 | }, 1972 | "outputs": [ 1973 | { 1974 | "data": { 1975 | "text/html": [ 1976 | "
\n", 1977 | "\n", 1978 | " \n", 1979 | " \n", 1980 | " \n", 1981 | " \n", 1982 | " \n", 1983 | " \n", 1984 | " \n", 1985 | " \n", 1986 | " \n", 1987 | " \n", 1988 | " \n", 1989 | " \n", 1990 | " \n", 1991 | " \n", 1992 | " \n", 1993 | " \n", 1994 | " \n", 1995 | " \n", 1996 | " \n", 1997 | " \n", 1998 | " \n", 1999 | " \n", 2000 | " \n", 2001 | " \n", 2002 | " \n", 2003 | " \n", 2004 | " \n", 2005 | " \n", 2006 | " \n", 2007 | " \n", 2008 | " \n", 2009 | " \n", 2010 | " \n", 2011 | " \n", 2012 | " \n", 2013 | " \n", 2014 | " \n", 2015 | " \n", 2016 | " \n", 2017 | "
hamspam
token
very0.0179710.005338
nasty0.0005530.003559
villa0.0002760.003559
beloved0.0005530.001779
textoperator0.0002760.005338
\n", 2018 | "
" 2019 | ], 2020 | "text/plain": [ 2021 | " ham spam\n", 2022 | "token \n", 2023 | "very 0.017971 0.005338\n", 2024 | "nasty 0.000553 0.003559\n", 2025 | "villa 0.000276 0.003559\n", 2026 | "beloved 0.000553 0.001779\n", 2027 | "textoperator 0.000276 0.005338" 2028 | ] 2029 | }, 2030 | "execution_count": 63, 2031 | "metadata": {}, 2032 | "output_type": "execute_result" 2033 | } 2034 | ], 2035 | "source": [ 2036 | "# convert the ham and spam counts into frequencies\n", 2037 | "tokens['ham'] = tokens.ham / nb.class_count_[0]\n", 2038 | "tokens['spam'] = tokens.spam / nb.class_count_[1]\n", 2039 | "tokens.sample(5, random_state=6)" 2040 | ] 2041 | }, 2042 | { 2043 | "cell_type": "code", 2044 | "execution_count": 64, 2045 | "metadata": { 2046 | "collapsed": false 2047 | }, 2048 | "outputs": [ 2049 | { 2050 | "data": { 2051 | "text/html": [ 2052 | "
\n", 2053 | "\n", 2054 | " \n", 2055 | " \n", 2056 | " \n", 2057 | " \n", 2058 | " \n", 2059 | " \n", 2060 | " \n", 2061 | " \n", 2062 | " \n", 2063 | " \n", 2064 | " \n", 2065 | " \n", 2066 | " \n", 2067 | " \n", 2068 | " \n", 2069 | " \n", 2070 | " \n", 2071 | " \n", 2072 | " \n", 2073 | " \n", 2074 | " \n", 2075 | " \n", 2076 | " \n", 2077 | " \n", 2078 | " \n", 2079 | " \n", 2080 | " \n", 2081 | " \n", 2082 | " \n", 2083 | " \n", 2084 | " \n", 2085 | " \n", 2086 | " \n", 2087 | " \n", 2088 | " \n", 2089 | " \n", 2090 | " \n", 2091 | " \n", 2092 | " \n", 2093 | " \n", 2094 | " \n", 2095 | " \n", 2096 | " \n", 2097 | " \n", 2098 | " \n", 2099 | " \n", 2100 | "
hamspamspam_ratio
token
very0.0179710.0053380.297044
nasty0.0005530.0035596.435943
villa0.0002760.00355912.871886
beloved0.0005530.0017793.217972
textoperator0.0002760.00533819.307829
\n", 2101 | "
" 2102 | ], 2103 | "text/plain": [ 2104 | " ham spam spam_ratio\n", 2105 | "token \n", 2106 | "very 0.017971 0.005338 0.297044\n", 2107 | "nasty 0.000553 0.003559 6.435943\n", 2108 | "villa 0.000276 0.003559 12.871886\n", 2109 | "beloved 0.000553 0.001779 3.217972\n", 2110 | "textoperator 0.000276 0.005338 19.307829" 2111 | ] 2112 | }, 2113 | "execution_count": 64, 2114 | "metadata": {}, 2115 | "output_type": "execute_result" 2116 | } 2117 | ], 2118 | "source": [ 2119 | "# calculate the ratio of spam-to-ham for each token\n", 2120 | "tokens['spam_ratio'] = tokens.spam / tokens.ham\n", 2121 | "tokens.sample(5, random_state=6)" 2122 | ] 2123 | }, 2124 | { 2125 | "cell_type": "code", 2126 | "execution_count": 65, 2127 | "metadata": { 2128 | "collapsed": false 2129 | }, 2130 | "outputs": [ 2131 | { 2132 | "data": { 2133 | "text/html": [ 2134 | "
\n", 2135 | "\n", 2136 | " \n", 2137 | " \n", 2138 | " \n", 2139 | " \n", 2140 | " \n", 2141 | " \n", 2142 | " \n", 2143 | " \n", 2144 | " \n", 2145 | " \n", 2146 | " \n", 2147 | " \n", 2148 | " \n", 2149 | " \n", 2150 | " \n", 2151 | " \n", 2152 | " \n", 2153 | " \n", 2154 | " \n", 2155 | " \n", 2156 | " \n", 2157 | " \n", 2158 | " \n", 2159 | " \n", 2160 | " \n", 2161 | " \n", 2162 | " \n", 2163 | " \n", 2164 | " \n", 2165 | " \n", 2166 | " \n", 2167 | " \n", 2168 | " \n", 2169 | " \n", 2170 | " \n", 2171 | " \n", 2172 | " \n", 2173 | " \n", 2174 | " \n", 2175 | " \n", 2176 | " \n", 2177 | " \n", 2178 | " \n", 2179 | " \n", 2180 | " \n", 2181 | " \n", 2182 | " \n", 2183 | " \n", 2184 | " \n", 2185 | " \n", 2186 | " \n", 2187 | " \n", 2188 | " \n", 2189 | " \n", 2190 | " \n", 2191 | " \n", 2192 | " \n", 2193 | " \n", 2194 | " \n", 2195 | " \n", 2196 | " \n", 2197 | " \n", 2198 | " \n", 2199 | " \n", 2200 | " \n", 2201 | " \n", 2202 | " \n", 2203 | " \n", 2204 | " \n", 2205 | " \n", 2206 | " \n", 2207 | " \n", 2208 | " \n", 2209 | " \n", 2210 | " \n", 2211 | " \n", 2212 | " \n", 2213 | " \n", 2214 | " \n", 2215 | " \n", 2216 | " \n", 2217 | " \n", 2218 | " \n", 2219 | " \n", 2220 | " \n", 2221 | " \n", 2222 | " \n", 2223 | " \n", 2224 | " \n", 2225 | " \n", 2226 | " \n", 2227 | " \n", 2228 | " \n", 2229 | " \n", 2230 | " \n", 2231 | " \n", 2232 | " \n", 2233 | " \n", 2234 | " \n", 2235 | " \n", 2236 | " \n", 2237 | " \n", 2238 | " \n", 2239 | " \n", 2240 | " \n", 2241 | " \n", 2242 | " \n", 2243 | " \n", 2244 | " \n", 2245 | " \n", 2246 | " \n", 2247 | " \n", 2248 | " \n", 2249 | " \n", 2250 | " \n", 2251 | " \n", 2252 | " \n", 2253 | " \n", 2254 | " \n", 2255 | " \n", 2256 | " \n", 2257 | " \n", 2258 | " \n", 2259 | " \n", 2260 | " \n", 2261 | " \n", 2262 | " \n", 2263 | " \n", 2264 | " \n", 2265 | " \n", 2266 | " \n", 2267 | " \n", 2268 | " \n", 2269 | " \n", 2270 | " \n", 2271 | " \n", 2272 | " \n", 2273 | " \n", 2274 | " \n", 2275 | " \n", 2276 | " \n", 2277 | " \n", 2278 | " \n", 2279 | " \n", 2280 | " \n", 2281 | " \n", 2282 | " \n", 2283 | " \n", 2284 | " \n", 2285 | " \n", 2286 | " \n", 2287 | " \n", 2288 | " \n", 2289 | " \n", 2290 | " \n", 2291 | " \n", 2292 | " \n", 2293 | " \n", 2294 | " \n", 2295 | " \n", 2296 | " \n", 2297 | " \n", 2298 | " \n", 2299 | " \n", 2300 | " \n", 2301 | " \n", 2302 | " \n", 2303 | " \n", 2304 | " \n", 2305 | " \n", 2306 | " \n", 2307 | " \n", 2308 | " \n", 2309 | " \n", 2310 | " \n", 2311 | " \n", 2312 | " \n", 2313 | " \n", 2314 | " \n", 2315 | " \n", 2316 | " \n", 2317 | " \n", 2318 | " \n", 2319 | " \n", 2320 | " \n", 2321 | " \n", 2322 | " \n", 2323 | " \n", 2324 | " \n", 2325 | " \n", 2326 | " \n", 2327 | " \n", 2328 | " \n", 2329 | " \n", 2330 | " \n", 2331 | " \n", 2332 | " \n", 2333 | " \n", 2334 | " \n", 2335 | " \n", 2336 | " \n", 2337 | " \n", 2338 | " \n", 2339 | " \n", 2340 | " \n", 2341 | " \n", 2342 | " \n", 2343 | " \n", 2344 | " \n", 2345 | " \n", 2346 | " \n", 2347 | " \n", 2348 | " \n", 2349 | " \n", 2350 | " \n", 2351 | " \n", 2352 | " \n", 2353 | " \n", 2354 | " \n", 2355 | " \n", 2356 | " \n", 2357 | " \n", 2358 | " \n", 2359 | " \n", 2360 | " \n", 2361 | " \n", 2362 | " \n", 2363 | " \n", 2364 | " \n", 2365 | " \n", 2366 | " \n", 2367 | " \n", 2368 | " \n", 2369 | " \n", 2370 | " \n", 2371 | " \n", 2372 | " \n", 2373 | " \n", 2374 | " \n", 2375 | " \n", 2376 | " \n", 2377 | " \n", 2378 | " \n", 2379 | " \n", 2380 | " \n", 2381 | " \n", 2382 | " \n", 2383 | " \n", 2384 | " \n", 2385 | " \n", 2386 | " \n", 2387 | " \n", 2388 | " \n", 2389 | " \n", 2390 | " \n", 2391 | " \n", 2392 | " \n", 2393 | " \n", 2394 | " \n", 2395 | " \n", 2396 | " \n", 2397 | " \n", 2398 | " \n", 2399 | " \n", 2400 | " \n", 2401 | " \n", 2402 | " \n", 2403 | " \n", 2404 | " \n", 2405 | " \n", 2406 | " \n", 2407 | " \n", 2408 | " \n", 2409 | " \n", 2410 | " \n", 2411 | " \n", 2412 | " \n", 2413 | " \n", 2414 | " \n", 2415 | " \n", 2416 | " \n", 2417 | " \n", 2418 | " \n", 2419 | " \n", 2420 | " \n", 2421 | " \n", 2422 | " \n", 2423 | " \n", 2424 | " \n", 2425 | " \n", 2426 | " \n", 2427 | " \n", 2428 | " \n", 2429 | " \n", 2430 | " \n", 2431 | " \n", 2432 | " \n", 2433 | " \n", 2434 | " \n", 2435 | " \n", 2436 | " \n", 2437 | " \n", 2438 | " \n", 2439 | " \n", 2440 | " \n", 2441 | " \n", 2442 | " \n", 2443 | " \n", 2444 | " \n", 2445 | " \n", 2446 | " \n", 2447 | " \n", 2448 | " \n", 2449 | " \n", 2450 | " \n", 2451 | " \n", 2452 | " \n", 2453 | " \n", 2454 | " \n", 2455 | " \n", 2456 | " \n", 2457 | " \n", 2458 | " \n", 2459 | " \n", 2460 | " \n", 2461 | " \n", 2462 | " \n", 2463 | " \n", 2464 | " \n", 2465 | " \n", 2466 | " \n", 2467 | " \n", 2468 | " \n", 2469 | " \n", 2470 | " \n", 2471 | " \n", 2472 | " \n", 2473 | " \n", 2474 | " \n", 2475 | " \n", 2476 | " \n", 2477 | " \n", 2478 | " \n", 2479 | " \n", 2480 | " \n", 2481 | " \n", 2482 | " \n", 2483 | " \n", 2484 | " \n", 2485 | " \n", 2486 | " \n", 2487 | " \n", 2488 | " \n", 2489 | " \n", 2490 | " \n", 2491 | " \n", 2492 | " \n", 2493 | " \n", 2494 | " \n", 2495 | " \n", 2496 | " \n", 2497 | " \n", 2498 | " \n", 2499 | " \n", 2500 | " \n", 2501 | " \n", 2502 | " \n", 2503 | " \n", 2504 | " \n", 2505 | " \n", 2506 | " \n", 2507 | " \n", 2508 | " \n", 2509 | " \n", 2510 | " \n", 2511 | " \n", 2512 | " \n", 2513 | " \n", 2514 | " \n", 2515 | " \n", 2516 | " \n", 2517 | " \n", 2518 | "
hamspamspam_ratio
token
claim0.0002760.158363572.798932
prize0.0002760.135231489.131673
150p0.0002760.087189315.361210
tone0.0002760.085409308.925267
guaranteed0.0002760.076512276.745552
180.0002760.069395251.001779
cs0.0002760.065836238.129893
www0.0005530.129893234.911922
10000.0002760.056940205.950178
awarded0.0002760.053381193.078292
150ppm0.0002760.051601186.642349
uk0.0005530.099644180.206406
5000.0002760.048043173.770463
ringtone0.0002760.044484160.898577
0000.0002760.042705154.462633
mob0.0002760.042705154.462633
co0.0005530.078292141.590747
collection0.0002760.039146141.590747
valid0.0002760.037367135.154804
20000.0002760.037367135.154804
8000.0002760.037367135.154804
10p0.0002760.037367135.154804
80070.0002760.035587128.718861
160.0005530.067616122.282918
weekly0.0002760.033808122.282918
tones0.0002760.032028115.846975
land0.0002760.032028115.846975
http0.0002760.032028115.846975
national0.0002760.030249109.411032
50000.0002760.030249109.411032
............
went0.0127180.0017790.139912
ll0.0525300.0071170.135494
told0.0138240.0017790.128719
feel0.0138240.0017790.128719
gud0.0141000.0017790.126195
cos0.0149290.0017790.119184
but0.0906830.0106760.117731
amp0.0152060.0017790.117017
something0.0152060.0017790.117017
sure0.0152060.0017790.117017
ok0.0611000.0071170.116488
said0.0163120.0017790.109084
morning0.0168650.0017790.105507
yeah0.0176940.0017790.100562
lol0.0176940.0017790.100562
anything0.0179710.0017790.099015
my0.1504010.0142350.094646
doing0.0190770.0017790.093275
way0.0196300.0017790.090647
ask0.0196300.0017790.090647
already0.0196300.0017790.090647
too0.0218410.0017790.081468
come0.0489360.0035590.072723
later0.0306880.0017790.057981
lor0.0329000.0017790.054084
da0.0329000.0017790.054084
she0.0356650.0017790.049891
he0.0470000.0017790.037858
lt0.0641420.0017790.027741
gt0.0649710.0017790.027387
\n", 2519 | "

7456 rows × 3 columns

\n", 2520 | "
" 2521 | ], 2522 | "text/plain": [ 2523 | " ham spam spam_ratio\n", 2524 | "token \n", 2525 | "claim 0.000276 0.158363 572.798932\n", 2526 | "prize 0.000276 0.135231 489.131673\n", 2527 | "150p 0.000276 0.087189 315.361210\n", 2528 | "tone 0.000276 0.085409 308.925267\n", 2529 | "guaranteed 0.000276 0.076512 276.745552\n", 2530 | "18 0.000276 0.069395 251.001779\n", 2531 | "cs 0.000276 0.065836 238.129893\n", 2532 | "www 0.000553 0.129893 234.911922\n", 2533 | "1000 0.000276 0.056940 205.950178\n", 2534 | "awarded 0.000276 0.053381 193.078292\n", 2535 | "150ppm 0.000276 0.051601 186.642349\n", 2536 | "uk 0.000553 0.099644 180.206406\n", 2537 | "500 0.000276 0.048043 173.770463\n", 2538 | "ringtone 0.000276 0.044484 160.898577\n", 2539 | "000 0.000276 0.042705 154.462633\n", 2540 | "mob 0.000276 0.042705 154.462633\n", 2541 | "co 0.000553 0.078292 141.590747\n", 2542 | "collection 0.000276 0.039146 141.590747\n", 2543 | "valid 0.000276 0.037367 135.154804\n", 2544 | "2000 0.000276 0.037367 135.154804\n", 2545 | "800 0.000276 0.037367 135.154804\n", 2546 | "10p 0.000276 0.037367 135.154804\n", 2547 | "8007 0.000276 0.035587 128.718861\n", 2548 | "16 0.000553 0.067616 122.282918\n", 2549 | "weekly 0.000276 0.033808 122.282918\n", 2550 | "tones 0.000276 0.032028 115.846975\n", 2551 | "land 0.000276 0.032028 115.846975\n", 2552 | "http 0.000276 0.032028 115.846975\n", 2553 | "national 0.000276 0.030249 109.411032\n", 2554 | "5000 0.000276 0.030249 109.411032\n", 2555 | "... ... ... ...\n", 2556 | "went 0.012718 0.001779 0.139912\n", 2557 | "ll 0.052530 0.007117 0.135494\n", 2558 | "told 0.013824 0.001779 0.128719\n", 2559 | "feel 0.013824 0.001779 0.128719\n", 2560 | "gud 0.014100 0.001779 0.126195\n", 2561 | "cos 0.014929 0.001779 0.119184\n", 2562 | "but 0.090683 0.010676 0.117731\n", 2563 | "amp 0.015206 0.001779 0.117017\n", 2564 | "something 0.015206 0.001779 0.117017\n", 2565 | "sure 0.015206 0.001779 0.117017\n", 2566 | "ok 0.061100 0.007117 0.116488\n", 2567 | "said 0.016312 0.001779 0.109084\n", 2568 | "morning 0.016865 0.001779 0.105507\n", 2569 | "yeah 0.017694 0.001779 0.100562\n", 2570 | "lol 0.017694 0.001779 0.100562\n", 2571 | "anything 0.017971 0.001779 0.099015\n", 2572 | "my 0.150401 0.014235 0.094646\n", 2573 | "doing 0.019077 0.001779 0.093275\n", 2574 | "way 0.019630 0.001779 0.090647\n", 2575 | "ask 0.019630 0.001779 0.090647\n", 2576 | "already 0.019630 0.001779 0.090647\n", 2577 | "too 0.021841 0.001779 0.081468\n", 2578 | "come 0.048936 0.003559 0.072723\n", 2579 | "later 0.030688 0.001779 0.057981\n", 2580 | "lor 0.032900 0.001779 0.054084\n", 2581 | "da 0.032900 0.001779 0.054084\n", 2582 | "she 0.035665 0.001779 0.049891\n", 2583 | "he 0.047000 0.001779 0.037858\n", 2584 | "lt 0.064142 0.001779 0.027741\n", 2585 | "gt 0.064971 0.001779 0.027387\n", 2586 | "\n", 2587 | "[7456 rows x 3 columns]" 2588 | ] 2589 | }, 2590 | "execution_count": 65, 2591 | "metadata": {}, 2592 | "output_type": "execute_result" 2593 | } 2594 | ], 2595 | "source": [ 2596 | "# examine the DataFrame sorted by spam_ratio\n", 2597 | "# note: use sort() instead of sort_values() for pandas 0.16.2 and earlier\n", 2598 | "tokens.sort_values('spam_ratio', ascending=False)" 2599 | ] 2600 | }, 2601 | { 2602 | "cell_type": "code", 2603 | "execution_count": 66, 2604 | "metadata": { 2605 | "collapsed": false 2606 | }, 2607 | "outputs": [ 2608 | { 2609 | "data": { 2610 | "text/plain": [ 2611 | "83.667259786476862" 2612 | ] 2613 | }, 2614 | "execution_count": 66, 2615 | "metadata": {}, 2616 | "output_type": "execute_result" 2617 | } 2618 | ], 2619 | "source": [ 2620 | "# look up the spam_ratio for a given token\n", 2621 | "tokens.loc['dating', 'spam_ratio']" 2622 | ] 2623 | }, 2624 | { 2625 | "cell_type": "markdown", 2626 | "metadata": {}, 2627 | "source": [ 2628 | "## Part 8: Practicing this workflow on another dataset\n", 2629 | "\n", 2630 | "Please open the **`exercise.ipynb`** notebook (or the **`exercise.py`** script)." 2631 | ] 2632 | }, 2633 | { 2634 | "cell_type": "markdown", 2635 | "metadata": {}, 2636 | "source": [ 2637 | "## Part 9: Tuning the vectorizer (discussion)\n", 2638 | "\n", 2639 | "Thus far, we have been using the default parameters of [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html):" 2640 | ] 2641 | }, 2642 | { 2643 | "cell_type": "code", 2644 | "execution_count": 67, 2645 | "metadata": { 2646 | "collapsed": false 2647 | }, 2648 | "outputs": [ 2649 | { 2650 | "data": { 2651 | "text/plain": [ 2652 | "CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',\n", 2653 | " dtype=, encoding=u'utf-8', input=u'content',\n", 2654 | " lowercase=True, max_df=1.0, max_features=None, min_df=1,\n", 2655 | " ngram_range=(1, 1), preprocessor=None, stop_words=None,\n", 2656 | " strip_accents=None, token_pattern=u'(?u)\\\\b\\\\w\\\\w+\\\\b',\n", 2657 | " tokenizer=None, vocabulary=None)" 2658 | ] 2659 | }, 2660 | "execution_count": 67, 2661 | "metadata": {}, 2662 | "output_type": "execute_result" 2663 | } 2664 | ], 2665 | "source": [ 2666 | "# show default parameters for CountVectorizer\n", 2667 | "vect" 2668 | ] 2669 | }, 2670 | { 2671 | "cell_type": "markdown", 2672 | "metadata": {}, 2673 | "source": [ 2674 | "However, the vectorizer is worth tuning, just like a model is worth tuning! Here are a few parameters that you might want to tune:\n", 2675 | "\n", 2676 | "- **stop_words:** string {'english'}, list, or None (default)\n", 2677 | " - If 'english', a built-in stop word list for English is used.\n", 2678 | " - If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.\n", 2679 | " - If None, no stop words will be used." 2680 | ] 2681 | }, 2682 | { 2683 | "cell_type": "code", 2684 | "execution_count": 68, 2685 | "metadata": { 2686 | "collapsed": true 2687 | }, 2688 | "outputs": [], 2689 | "source": [ 2690 | "# remove English stop words\n", 2691 | "vect = CountVectorizer(stop_words='english')" 2692 | ] 2693 | }, 2694 | { 2695 | "cell_type": "markdown", 2696 | "metadata": {}, 2697 | "source": [ 2698 | "- **ngram_range:** tuple (min_n, max_n), default=(1, 1)\n", 2699 | " - The lower and upper boundary of the range of n-values for different n-grams to be extracted.\n", 2700 | " - All values of n such that min_n <= n <= max_n will be used." 2701 | ] 2702 | }, 2703 | { 2704 | "cell_type": "code", 2705 | "execution_count": 69, 2706 | "metadata": { 2707 | "collapsed": true 2708 | }, 2709 | "outputs": [], 2710 | "source": [ 2711 | "# include 1-grams and 2-grams\n", 2712 | "vect = CountVectorizer(ngram_range=(1, 2))" 2713 | ] 2714 | }, 2715 | { 2716 | "cell_type": "markdown", 2717 | "metadata": {}, 2718 | "source": [ 2719 | "- **max_df:** float in range [0.0, 1.0] or int, default=1.0\n", 2720 | " - When building the vocabulary, ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words).\n", 2721 | " - If float, the parameter represents a proportion of documents.\n", 2722 | " - If integer, the parameter represents an absolute count." 2723 | ] 2724 | }, 2725 | { 2726 | "cell_type": "code", 2727 | "execution_count": 70, 2728 | "metadata": { 2729 | "collapsed": true 2730 | }, 2731 | "outputs": [], 2732 | "source": [ 2733 | "# ignore terms that appear in more than 50% of the documents\n", 2734 | "vect = CountVectorizer(max_df=0.5)" 2735 | ] 2736 | }, 2737 | { 2738 | "cell_type": "markdown", 2739 | "metadata": {}, 2740 | "source": [ 2741 | "- **min_df:** float in range [0.0, 1.0] or int, default=1\n", 2742 | " - When building the vocabulary, ignore terms that have a document frequency strictly lower than the given threshold. (This value is also called \"cut-off\" in the literature.)\n", 2743 | " - If float, the parameter represents a proportion of documents.\n", 2744 | " - If integer, the parameter represents an absolute count." 2745 | ] 2746 | }, 2747 | { 2748 | "cell_type": "code", 2749 | "execution_count": 71, 2750 | "metadata": { 2751 | "collapsed": true 2752 | }, 2753 | "outputs": [], 2754 | "source": [ 2755 | "# only keep terms that appear in at least 2 documents\n", 2756 | "vect = CountVectorizer(min_df=2)" 2757 | ] 2758 | }, 2759 | { 2760 | "cell_type": "markdown", 2761 | "metadata": {}, 2762 | "source": [ 2763 | "**Guidelines for tuning CountVectorizer:**\n", 2764 | "\n", 2765 | "- Use your knowledge of the **problem** and the **text**, and your understanding of the **tuning parameters**, to help you decide what parameters to tune and how to tune them.\n", 2766 | "- **Experiment**, and let the data tell you the best approach!" 2767 | ] 2768 | } 2769 | ], 2770 | "metadata": { 2771 | "kernelspec": { 2772 | "display_name": "Python 2", 2773 | "language": "python", 2774 | "name": "python2" 2775 | }, 2776 | "language_info": { 2777 | "codemirror_mode": { 2778 | "name": "ipython", 2779 | "version": 2 2780 | }, 2781 | "file_extension": ".py", 2782 | "mimetype": "text/x-python", 2783 | "name": "python", 2784 | "nbconvert_exporter": "python", 2785 | "pygments_lexer": "ipython2", 2786 | "version": "2.7.11" 2787 | } 2788 | }, 2789 | "nbformat": 4, 2790 | "nbformat_minor": 0 2791 | } 2792 | --------------------------------------------------------------------------------