├── A Smattering of NLP in Python.ipynb
├── LICENSE
├── README.md
└── images
├── Scikit-learn_logo.png
├── anaconda_logo_web.png
├── cat.gif
├── dcnlp.jpeg
├── i-was-told-there-would-be-no-math.jpg
├── no_time.jpg
├── python-powered-w-200x80.png
└── stanford-nlp.jpg
/A Smattering of NLP in Python.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "name": "A Smattering of NLP in Python",
4 | "signature": "sha256:5b38818827e50ee282fa44155be8b71ad71466229789ff67e945f3a6d2570004"
5 | },
6 | "nbformat": 3,
7 | "nbformat_minor": 0,
8 | "worksheets": [
9 | {
10 | "cells": [
11 | {
12 | "cell_type": "markdown",
13 | "metadata": {},
14 | "source": "# A Smattering of NLP in Python\n*by Charlie Greenbacker [@greenbacker](https://twitter.com/greenbacker)*\n\n[](https://www.python.org/)\n\n### Part of a [joint meetup on Natural Language Processing](http://www.meetup.com/stats-prog-dc/events/177772322/) - 9 July 2014\n- #### [Statistical Programming DC](http://www.meetup.com/stats-prog-dc/)\n- #### [Data Wranglers DC](http://www.meetup.com/Data-Wranglers-DC/)\n- #### [DC Natural Language Processing](http://dcnlp.org/)\n\n***\n\n## Introduction\nBack in the dark ages of data science, each group or individual working in Natural Language Processing (NLP) generally maintained an assortment of homebrew utility programs designed to handle many of the common tasks involved with NLP. Despite everyone's best intentions, most of this code was lousy, brittle, and poorly documented -- not a good foundation upon which to build your masterpiece. Fortunately, over the past decade, mainstream open source software libraries like the [Natural Language Toolkit for Python (NLTK)](http://www.nltk.org/) have emerged to offer a collection of high-quality reusable NLP functionality. These libraries allow researchers and developers to spend more time focusing on the application logic of the task at hand, and less on debugging an abandoned method for sentence segmentation or reimplementing noun phrase chunking.\n\nThis presentation will cover a handful of the NLP building blocks provided by NLTK (and a few additional libraries), including extracting text from HTML, stemming & lemmatization, frequency analysis, and named entity recognition. Several of these components will then be assembled to build a very basic document summarization program.\n\n[](http://oreilly.com/catalog/9780596516499/)\n\n### Initial Setup\nObviously, you'll need Python installed on your system to run the code examples used in this presentation. We enthusiatically recommend using [Anaconda](https://store.continuum.io/cshop/anaconda/), a Python distribution provided by [Continuum Analytics](http://www.continuum.io/). Anaconda is free to use, it includes nearly [200 of the most commonly used Python packages for data analysis](http://docs.continuum.io/anaconda/pkg-docs.html) (including NLTK), and it works on Mac, Linux, and yes, even Windows.\n\n[](https://store.continuum.io/cshop/anaconda/)\n\nWe'll make use of the following Python packages in the example code:\n\n- [nltk](http://www.nltk.org/install.html) (comes with Anaconda)\n- [readability-lxml](https://github.com/buriy/python-readability)\n- [BeautifulSoup4](http://www.crummy.com/software/BeautifulSoup/) (comes with Anaconda)\n- [scikit-learn](http://scikit-learn.org/stable/install.html) (comes with Anaconda)\n\nPlease note that the **readability** package is not distributed with Anaconda, so you'll need to download & install it separately using something like easy_install readability-lxml
or pip install readability-lxml
.\n\nIf you don't use Anaconda, you'll also need to download & install the other packages separately using similar methods. Refer to the homepage of each package for instructions.\n\nYou'll want to run nltk.download()
one time to get all of the NLTK packages, corpora, etc. (see below). Select the \"all\" option. Depending on your network speed, this could take a while, but you'll only need to do it once.\n\n#### Java libraries (optional)\nOne of the examples will use NLTK's interface to the [Stanford Named Entity Recognizer](http://www-nlp.stanford.edu/software/CRF-NER.shtml#Download), which is distributed as a Java library. In particular, you'll want the following files handy in order to run this particular example:\n\n- stanford-ner.jar\n- english.all.3class.distsim.crf.ser.gz\n\n[](http://www-nlp.stanford.edu/software/CRF-NER.shtml#Download)\n\n***\n\n## Getting Started\nThe first thing we'll need to do is import nltk
:"
15 | },
16 | {
17 | "cell_type": "code",
18 | "collapsed": false,
19 | "input": "import nltk",
20 | "language": "python",
21 | "metadata": {},
22 | "outputs": []
23 | },
24 | {
25 | "cell_type": "markdown",
26 | "metadata": {},
27 | "source": "#### Downloading NLTK resources\nThe first time you run anything using NLTK, you'll want to go ahead and download the additional resources that aren't distributed directly with the NLTK package. Upon running the nltk.download()
command below, the the NLTK Downloader window will pop-up. In the Collections tab, select \"all\" and click on Download. As mentioned earlier, this may take several minutes depending on your network connection speed, but you'll only ever need to run it a single time."
28 | },
29 | {
30 | "cell_type": "code",
31 | "collapsed": false,
32 | "input": "nltk.download()",
33 | "language": "python",
34 | "metadata": {},
35 | "outputs": []
36 | },
37 | {
38 | "cell_type": "markdown",
39 | "metadata": {},
40 | "source": "## Extracting text from HTML\nNow the fun begins. We'll start with a pretty basic and commonly-faced task: extracting text content from an HTML page. Python's urllib package gives us the tools we need to fetch a web page from a given URL, but we see that the output is full of HTML markup that we don't want to deal with.\n\n(N.B.: Throughout the examples in this presentation, we'll use Python *slicing* (e.g., [:500]
below) to only display a small portion of a string or list. Otherwise, if we displayed the entire item, sometimes it would take up the entire screen.)"
41 | },
42 | {
43 | "cell_type": "code",
44 | "collapsed": false,
45 | "input": "from urllib import urlopen\n\nurl = \"http://venturebeat.com/2014/07/04/facebooks-little-social-experiment-got-you-bummed-out-get-over-it/\"\nhtml = urlopen(url).read()\nhtml[:500]",
46 | "language": "python",
47 | "metadata": {},
48 | "outputs": []
49 | },
50 | {
51 | "cell_type": "markdown",
52 | "metadata": {},
53 | "source": "#### Stripping-out HTML formatting\nFortunately, NTLK provides a method called clean_html()
to get the raw text out of an HTML-formatted string. It's still not perfect, though, since the output will contain page navigation and all kinds of other junk that we don't want, especially if our goal is to focus on the body content from a news article, for example."
54 | },
55 | {
56 | "cell_type": "code",
57 | "collapsed": false,
58 | "input": "text = nltk.clean_html(html)\ntext[:500]",
59 | "language": "python",
60 | "metadata": {},
61 | "outputs": []
62 | },
63 | {
64 | "cell_type": "markdown",
65 | "metadata": {},
66 | "source": "#### Identifying the Main Content\nIf we just want the body content from the article, we'll need to use two additional packages. The first is a Python port of a Ruby port of a Javascript tool called Readability, which pulls the main body content out of an HTML document and subsequently \"cleans it up.\" The second package, BeautifulSoup, is a Python library for pulling data out of HTML and XML files. It parses HTML content into easily-navigable nested data structure. Using Readability and BeautifulSoup together, we can quickly get exactly the text we're looking for out of the HTML, (*mostly*) free of page navigation, comments, ads, etc. Now we're ready to start analyzing this text content."
67 | },
68 | {
69 | "cell_type": "code",
70 | "collapsed": false,
71 | "input": "from readability.readability import Document\nfrom bs4 import BeautifulSoup\n\nreadable_article = Document(html).summary()\nreadable_title = Document(html).title()\nsoup = BeautifulSoup(readable_article)\nprint '*** TITLE *** \\n\\\"' + readable_title + '\\\"\\n'\nprint '*** CONTENT *** \\n\\\"' + soup.text[:500] + '[...]\\\"'",
72 | "language": "python",
73 | "metadata": {},
74 | "outputs": []
75 | },
76 | {
77 | "cell_type": "markdown",
78 | "metadata": {},
79 | "source": "## Frequency Analysis\nHere's a little secret: much of NLP (and data science, for that matter) boils down to counting things. If you've got a bunch of data that needs *analyzin'* but you don't know where to start, counting things is usually a good place to begin. Sure, you'll need to figure out exactly what you want to count, how to count it, and what to do with the counts, but if you're lost and don't know what to do, **just start counting**.\n\nPerhaps we'd like to begin (as is often the case in NLP) by examining the words that appear in our document. To do that, we'll first need to tokenize the text string into discrete words. Since we're working with English, this isn't so bad, but if we were working with a non-whitespace-delimited language like Chinese, Japanese, or Korean, it would be much more difficult.\n\nIn the code snippet below, we're using two of NLTK's tokenize methods to first chop up the article text into sentences, and then each sentence into individual words. (Technically, we didn't need to use sent_tokenize()
, but if we only used word_tokenize()
alone, we'd see a bunch of extraneous sentence-final punctuation in our output.) By printing each token alphabetically, along with a count of the number of times it appeared in the text, we can see the results of the tokenization. Notice that the output contains some punctuation & numbers, hasn't been loweredcased, and counts *BuzzFeed* and *BuzzFeed's* separately. We'll tackle some of those issues next."
80 | },
81 | {
82 | "cell_type": "code",
83 | "collapsed": false,
84 | "input": "tokens = [word for sent in nltk.sent_tokenize(soup.text) for word in nltk.word_tokenize(sent)]\n\nfor token in sorted(set(tokens))[:30]:\n print token + ' [' + str(tokens.count(token)) + ']'",
85 | "language": "python",
86 | "metadata": {},
87 | "outputs": []
88 | },
89 | {
90 | "cell_type": "markdown",
91 | "metadata": {},
92 | "source": "#### Word Stemming\n[Stemming](http://en.wikipedia.org/wiki/Stemming) is the process of reducing a word to its base/stem/root form. Most stemmers are pretty basic and just chop off standard affixes indicating things like tense (e.g., \"-ed\") and possessive forms (e.g., \"-'s\"). Here, we'll use the Snowball stemmer for English, which comes with NLTK.\n\nOnce our tokens are stemmed, we can rest easy knowing that *BuzzFeed* and *BuzzFeed's* are now being counted together as... *buzzfe*? Don't worry: although this may look weird, it's pretty standard behavior for stemmers and won't affect our analysis (much). We also (probably) won't show the stemmed words to users -- we'll normally just use them for internal analysis or indexing purposes."
93 | },
94 | {
95 | "cell_type": "code",
96 | "collapsed": false,
97 | "input": "from nltk.stem.snowball import SnowballStemmer\n\nstemmer = SnowballStemmer(\"english\")\nstemmed_tokens = [stemmer.stem(t) for t in tokens]\n\nfor token in sorted(set(stemmed_tokens))[50:75]:\n print token + ' [' + str(stemmed_tokens.count(token)) + ']'",
98 | "language": "python",
99 | "metadata": {},
100 | "outputs": []
101 | },
102 | {
103 | "cell_type": "markdown",
104 | "metadata": {},
105 | "source": "#### Lemmatization\n\nAlthough the stemmer very helpfully chopped off pesky affixes (and made everything lowercase to boot), there are some word forms that give stemmers indigestion, especially *irregular* words. While the process of stemming typically involves rule-based methods of stripping affixes (making them small & fast), **lemmatization** involves dictionary-based methods to derive the canonical forms (i.e., *lemmas*) of words. For example, *run*, *runs*, *ran*, and *running* all correspond to the lemma *run*. However, lemmatizers are generally big, slow, and brittle due to the nature of the dictionary-based methods, so you'll only want to use them when necessary.\n\nThe example below compares the output of the Snowball stemmer with the WordNet lemmatizer (also distributed with NLTK). Notice that the lemmatizer correctly converts *women* into *woman*, while the stemmer turns *lying* into *lie*. Additionally, both replace *eyes* with *eye*, but neither of them properly transforms *told* into *tell*."
106 | },
107 | {
108 | "cell_type": "code",
109 | "collapsed": false,
110 | "input": "lemmatizer = nltk.WordNetLemmatizer()\ntemp_sent = \"Several women told me I have lying eyes.\"\n\nprint [stemmer.stem(t) for t in nltk.word_tokenize(temp_sent)]\nprint [lemmatizer.lemmatize(t) for t in nltk.word_tokenize(temp_sent)]",
111 | "language": "python",
112 | "metadata": {},
113 | "outputs": []
114 | },
115 | {
116 | "cell_type": "markdown",
117 | "metadata": {},
118 | "source": "#### NLTK Frequency Distributions\nThus far, we've been working with lists of tokens that we're manually sorting, uniquifying, and counting -- all of which can get to be a bit cumbersome. Fortunately, NLTK provides a data structure called FreqDist
that makes it more convenient to work with these kinds of frequency distributions. The code snippet below builds a FreqDist
from our list of stemmed tokens, and then displays the top 25 tokens appearing most frequently in the text of our article. Wasn't that easy?"
119 | },
120 | {
121 | "cell_type": "code",
122 | "collapsed": false,
123 | "input": "fdist = nltk.FreqDist(stemmed_tokens)\n\nfor item in fdist.items()[:25]:\n print item",
124 | "language": "python",
125 | "metadata": {},
126 | "outputs": []
127 | },
128 | {
129 | "cell_type": "markdown",
130 | "metadata": {},
131 | "source": "#### Filtering out Stop Words\nNotice in the output above that most of the top 25 tokens are worthless. With the exception of things like *facebook*, *content*, *user*, and perhaps *emot* (emotion?), the rest are basically devoid of meaningful information. They don't really tells us anything about the article since these tokens will appear is just about any English document. What we need to do is filter out these [*stop words*](http://en.wikipedia.org/wiki/Stop_words) in order to focus on just the important material.\n\nWhile there is no single, definitive list of stop words, NLTK provides a decent start. Let's load it up and take a look at what we get:"
132 | },
133 | {
134 | "cell_type": "code",
135 | "collapsed": false,
136 | "input": "sorted(nltk.corpus.stopwords.words('english'))[:25]",
137 | "language": "python",
138 | "metadata": {},
139 | "outputs": []
140 | },
141 | {
142 | "cell_type": "markdown",
143 | "metadata": {},
144 | "source": "Now we can use this list to filter-out stop words from our list of stemmed tokens before we create the frequency distribution. You'll notice in the output below that we still have some things like punctuation that we'd probably like to remove, but we're much closer to having a list of the most \"important\" words in our article."
145 | },
146 | {
147 | "cell_type": "code",
148 | "collapsed": false,
149 | "input": "stemmed_tokens_no_stop = [stemmer.stem(t) for t in stemmed_tokens if t not in nltk.corpus.stopwords.words('english')]\n\nfdist2 = nltk.FreqDist(stemmed_tokens_no_stop)\n\nfor item in fdist2.items()[:25]:\n print item",
150 | "language": "python",
151 | "metadata": {},
152 | "outputs": []
153 | },
154 | {
155 | "cell_type": "markdown",
156 | "metadata": {},
157 | "source": "## Named Entity Recognition\nAnother task we might want to do to help identify what's \"important\" in a text document is [named entity recogniton (NER)](http://en.wikipedia.org/wiki/Named-entity_recognition). Also called *entity extraction*, this process involves automatically extracting the names of persons, places, organizations, and potentially other entity types out of unstructured text. Building an NER classifier requires *lots* of annotated training data and some [fancy machine learning algorithms](http://en.wikipedia.org/wiki/Conditional_random_field), but fortunately, NLTK comes with a pre-built/pre-trained NER classifier ready to extract entities right out of the box. This classifier has been trained to recognize PERSON, ORGANIZATION, and GPE (geo-political entity) entity types.\n\n(At this point, I should include a disclaimer stating [No True Computational Linguist](http://en.wikipedia.org/wiki/No_true_Scotsman) would ever use a pre-built NER classifier in the \"real world\" without first re-training it on annotated data representing their particular task. So please don't send me any hate mail -- I've done my part to stop the madness.)\n\n\n\nIn the example below (inspired by [this gist from Gavin Hackeling](https://gist.github.com/gavinmh/4735528/) and [this post from John Price](http://freshlyminted.co.uk/blog/2011/02/28/getting-band-and-artist-names-nltk/)), we're defining a method to perform the following steps:\n\n- take a string as input\n- tokenize it into sentences\n- tokenize the sentences into words\n- add part-of-speech tags to the words using nltk.pos_tag()
\n- run this through the NLTK-provided NER classifier using nltk.ne_chunk()
\n- parse these intermediate results and return any extracted entities\n\nWe then apply this method to a sample sentence and parse the clunky output format provided by nltk.ne_chunk()
(it comes as a [nltk.tree.Tree](http://www.nltk.org/_modules/nltk/tree.html)) to display the entities we've extracted. Don't let these nice results fool you -- NER output isn't always this satisfying. Try some other sample text and see what you get."
158 | },
159 | {
160 | "cell_type": "code",
161 | "collapsed": false,
162 | "input": "def extract_entities(text):\n\tentities = []\n\tfor sentence in nltk.sent_tokenize(text):\n\t chunks = nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sentence)))\n\t entities.extend([chunk for chunk in chunks if hasattr(chunk, 'node')])\n\treturn entities\n\nfor entity in extract_entities('My name is Charlie and I work for Altamira in Tysons Corner.'):\n print '[' + entity.node + '] ' + ' '.join(c[0] for c in entity.leaves())",
163 | "language": "python",
164 | "metadata": {},
165 | "outputs": []
166 | },
167 | {
168 | "cell_type": "markdown",
169 | "metadata": {},
170 | "source": "If you're like me, you've grown accustomed over the years to working with the [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml) library for Java, and you're suspicious of NLTK's built-in NER classifier (especially because it has *chunk* in the name). Thankfully, recent versions of NLTK contain an special NERTagger
interface that enables us to make calls to Stanford NER from our Python programs, even though Stanford NER is a *Java library* (the horror!). [Not surprisingly](http://www.yurtopic.com/tech/programming/images/java-and-python.jpg), the Python NERTagger
API is slightly less verbose than the native Java API for Stanford NER.\n\nTo run this example, you'll need to follow the instructions for installing the optional Java libraries, as outlined in the **Initial Setup** section above. You'll also want to pay close attention to the comment that says # change the paths below to point to wherever you unzipped the Stanford NER download file
."
171 | },
172 | {
173 | "cell_type": "code",
174 | "collapsed": false,
175 | "input": "from nltk.tag.stanford import NERTagger\n\n# change the paths below to point to wherever you unzipped the Stanford NER download file\nst = NERTagger('/Users/cgreenba/stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz',\n '/Users/cgreenba/stanford-ner/stanford-ner.jar', 'utf-8')\n\nfor i in st.tag('Up next is Tommy, who works at STPI in Washington.'.split()):\n print '[' + i[1] + '] ' + i[0]",
176 | "language": "python",
177 | "metadata": {},
178 | "outputs": []
179 | },
180 | {
181 | "cell_type": "markdown",
182 | "metadata": {},
183 | "source": "## Automatic Summarization\nNow let's try to take some of what we've learned and build something potentially useful in real life: a program that will [automatically summarize](http://en.wikipedia.org/wiki/Automatic_summarization) documents. For this, we'll switch gears slightly, putting aside the web article we've been working on until now and instead using a corpus of documents distributed with NLTK.\n\nThe Reuters Corpus contains nearly 11,000 news articles about a variety of topics and subjects. If you've run the nltk.download()
command as previously recommended, you can then easily import and explore the Reuters Corpus like so:"
184 | },
185 | {
186 | "cell_type": "code",
187 | "collapsed": false,
188 | "input": "from nltk.corpus import reuters\n\nprint '** BEGIN ARTICLE: ** \\\"' + reuters.raw(reuters.fileids()[0])[:500] + ' [...]\\\"'",
189 | "language": "python",
190 | "metadata": {},
191 | "outputs": []
192 | },
193 | {
194 | "cell_type": "markdown",
195 | "metadata": {},
196 | "source": "Our [painfully simplistic](http://anthology.aclweb.org/P/P11/P11-3014.pdf) automatic summarization tool will implement the following steps:\n\n- assign a score to each word in a document corresponding to its level of \"importance\"\n- rank each sentence in the document by summing the individual word scores and dividing by the number of tokens in the sentence\n- extract the top N highest scoring sentences and return them as our \"summary\"\n\nSounds easy enough, right? But before we can say \"*voila!*,\" we'll need to figure out how to calculate an \"importance\" score for words. As we saw above with stop words, etc. simply counting the number of times a word appears in a document will not necessarily tell you which words are most important.\n\n#### Term Frequency - Inverse Document Frequency (TF-IDF)\n\nConsider a document that contains the word *baseball* 8 times. You might think, \"wow, *baseball* isn't a stop word, and it appeared rather frequently here, so it's probably important.\" And you might be right. But what if that document is actually an article posted on a baseball blog? Won't the word *baseball* appear frequently in nearly every post on that blog? In this particular case, if you were generating a summary of this document, would the word *baseball* be a good indicator of importance, or would you maybe look for other words that help distinguish or differentiate this blog post from the rest?\n\nContext is essential. What really matters here isn't the raw frequency of the number of times each word appeared in a document, but rather the **relative frequency** comparing the number of times a word appeared in this document against the number of times it appeared across the rest of the collection of documents. \"Important\" words will be the ones that are generally rare across the collection, but which appear with an unusually high frequency in a given document.\n\nWe'll calculate this relative frequency using a statistical metric called [term frequency - inverse document frequency (TF-IDF)](http://en.wikipedia.org/wiki/Tf%E2%80%93idf). We could implement TF-IDF ourselves using NLTK, but rather than bore you with the math, we'll take a shortcut and use the TF-IDF implementation provided by the [scikit-learn](http://scikit-learn.org/) machine learning library for Python.\n\n\n\n#### Building a Term-Document Matrix\n\nWe'll use scikit-learn's TfidfVectorizer
class to construct a [term-document matrix](http://en.wikipedia.org/wiki/Document-term_matrix) containing the TF-IDF score for each word in each document in the Reuters Corpus. In essence, the rows of this sparse matrix correspond to documents in the corpus, the columns represent each word in the vocabulary of the corpus, and each cell contains the TF-IDF value for a given word in a given document.\n\n[](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)\n\nInspired by a [computer science lab exercise from Duke University](http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html), the code sample below iterates through the Reuters Corpus to build a dictionary of stemmed tokens for each article, then uses the TfidfVectorizer
and scikit-learn's own built-in stop words list to generate the term-document matrix containing TF-IDF scores."
197 | },
198 | {
199 | "cell_type": "code",
200 | "collapsed": false,
201 | "input": "import datetime, re, sys\nfrom sklearn.feature_extraction.text import TfidfVectorizer\n\ndef tokenize_and_stem(text):\n tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]\n filtered_tokens = []\n # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)\n for token in tokens:\n if re.search('[a-zA-Z]', token):\n filtered_tokens.append(token)\n stems = [stemmer.stem(t) for t in filtered_tokens]\n return stems\n\ntoken_dict = {}\nfor article in reuters.fileids():\n token_dict[article] = reuters.raw(article)\n \ntfidf = TfidfVectorizer(tokenizer=tokenize_and_stem, stop_words='english', decode_error='ignore')\nprint 'building term-document matrix... [process started: ' + str(datetime.datetime.now()) + ']'\nsys.stdout.flush()\n\ntdm = tfidf.fit_transform(token_dict.values()) # this can take some time (about 60 seconds on my machine)\nprint 'done! [process finished: ' + str(datetime.datetime.now()) + ']'",
202 | "language": "python",
203 | "metadata": {},
204 | "outputs": []
205 | },
206 | {
207 | "cell_type": "markdown",
208 | "metadata": {},
209 | "source": "#### TF-IDF Scores\n\nNow that we've built the term-document matrix, we can explore its contents:"
210 | },
211 | {
212 | "cell_type": "code",
213 | "collapsed": false,
214 | "input": "from random import randint\n\nfeature_names = tfidf.get_feature_names()\nprint 'TDM contains ' + str(len(feature_names)) + ' terms and ' + str(tdm.shape[0]) + ' documents'\n\nprint 'first term: ' + feature_names[0]\nprint 'last term: ' + feature_names[len(feature_names) - 1]\n\nfor i in range(0, 4):\n print 'random term: ' + feature_names[randint(1,len(feature_names) - 2)]",
215 | "language": "python",
216 | "metadata": {},
217 | "outputs": []
218 | },
219 | {
220 | "cell_type": "markdown",
221 | "metadata": {},
222 | "source": "#### Generating the Summary\n\nThat's all we'll need to produce a summary for any document in the corpus. In the example code below, we start by randomly selecting an article from the Reuters Corpus. We iterate through the article, calculating a score for each sentence by summing the TF-IDF values for each word appearing in the sentence. We normalize the sentence scores by dividing by the number of tokens in the sentence (to avoid bias in favor of longer sentences). Then we sort the sentences by their scores, and return the highest-scoring sentences as our summary. The number of sentences returned corresponds to roughly 20% of the overall length of the article.\n\nSince some of the articles in the Reuters Corpus are rather small (i.e., a single sentence in length) or contain just raw financial data, some of the summaries won't make sense. If you run this code a few times, however, you'll eventually see a randomly-selected article that provides a decent demonstration of this simplistic method of identifying the \"most important\" sentence from a document."
223 | },
224 | {
225 | "cell_type": "code",
226 | "collapsed": false,
227 | "input": "import math\nfrom __future__ import division\n\narticle_id = randint(0, tdm.shape[0] - 1)\narticle_text = reuters.raw(reuters.fileids()[article_id])\n\nsent_scores = []\nfor sentence in nltk.sent_tokenize(article_text):\n score = 0\n sent_tokens = tokenize_and_stem(sentence)\n for token in (t for t in sent_tokens if t in feature_names):\n score += tdm[article_id, feature_names.index(token)]\n sent_scores.append((score / len(sent_tokens), sentence))\n\nsummary_length = int(math.ceil(len(sent_scores) / 5))\nsent_scores.sort(key=lambda sent: sent[0], reverse=True)\n\nprint '*** SUMMARY ***'\nfor summary_sentence in sent_scores[:summary_length]:\n print summary_sentence[1]\n\nprint '\\n*** ORIGINAL ***'\nprint article_text",
228 | "language": "python",
229 | "metadata": {},
230 | "outputs": []
231 | },
232 | {
233 | "cell_type": "markdown",
234 | "metadata": {},
235 | "source": "#### Improving the Summary\nThat was fairly easy, but how could we improve the quality of the generated summary? Perhaps we could boost the importance of words found in the title or any entities we're able to extract from the text. After initially selecting the highest-scoring sentence, we might discount the TF-IDF scores for duplicate words in the remaining sentences in an attempt to reduce repetitiveness. We could also look at cleaning up the sentences used to form the summary by fixing any pronouns missing an antecedent, or even pulling out partial phrases instead of complete sentences. The possibilities are virtually endless.\n\n## Next Steps\nWant to learn more? Start by working your way through all the examples in the NLTK book (aka \"the Whale book\"):\n\n[](http://oreilly.com/catalog/9780596516499/)\n\n- [Natural Language Processing with Python (book)](http://oreilly.com/catalog/9780596516499/)\n- (free online version: [nltk.org/book](http://www.nltk.org/book/))\n\n### Additional NLP Resources for Python\n- [NLTK HOWTOs](http://www.nltk.org/howto/)\n- [Python Text Processing with NLTK 2.0 Cookbook (book)](http://www.packtpub.com/python-text-processing-nltk-20-cookbook/book)\n- [Python wrapper for the Stanford CoreNLP Java library](https://pypi.python.org/pypi/corenlp)\n- [guess_language (Python library for language identification)](https://bitbucket.org/spirit/guess_language)\n- [MITIE (new C/C++-based NER library from MIT with a Python API)](https://github.com/mit-nlp/MITIE)\n- [gensim (topic modeling library for Python)](http://radimrehurek.com/gensim/)\n\n### Attend future DC NLP meetups\n\n[](http://dcnlp.org/)\n\n- [dcnlp.org](http://dcnlp.org/) | [@DCNLP](https://twitter.com/DCNLP/)"
236 | }
237 | ],
238 | "metadata": {}
239 | }
240 | ]
241 | }
242 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Apache License
2 | Version 2.0, January 2004
3 | http://www.apache.org/licenses/
4 |
5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6 |
7 | 1. Definitions.
8 |
9 | "License" shall mean the terms and conditions for use, reproduction,
10 | and distribution as defined by Sections 1 through 9 of this document.
11 |
12 | "Licensor" shall mean the copyright owner or entity authorized by
13 | the copyright owner that is granting the License.
14 |
15 | "Legal Entity" shall mean the union of the acting entity and all
16 | other entities that control, are controlled by, or are under common
17 | control with that entity. For the purposes of this definition,
18 | "control" means (i) the power, direct or indirect, to cause the
19 | direction or management of such entity, whether by contract or
20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the
21 | outstanding shares, or (iii) beneficial ownership of such entity.
22 |
23 | "You" (or "Your") shall mean an individual or Legal Entity
24 | exercising permissions granted by this License.
25 |
26 | "Source" form shall mean the preferred form for making modifications,
27 | including but not limited to software source code, documentation
28 | source, and configuration files.
29 |
30 | "Object" form shall mean any form resulting from mechanical
31 | transformation or translation of a Source form, including but
32 | not limited to compiled object code, generated documentation,
33 | and conversions to other media types.
34 |
35 | "Work" shall mean the work of authorship, whether in Source or
36 | Object form, made available under the License, as indicated by a
37 | copyright notice that is included in or attached to the work
38 | (an example is provided in the Appendix below).
39 |
40 | "Derivative Works" shall mean any work, whether in Source or Object
41 | form, that is based on (or derived from) the Work and for which the
42 | editorial revisions, annotations, elaborations, or other modifications
43 | represent, as a whole, an original work of authorship. For the purposes
44 | of this License, Derivative Works shall not include works that remain
45 | separable from, or merely link (or bind by name) to the interfaces of,
46 | the Work and Derivative Works thereof.
47 |
48 | "Contribution" shall mean any work of authorship, including
49 | the original version of the Work and any modifications or additions
50 | to that Work or Derivative Works thereof, that is intentionally
51 | submitted to Licensor for inclusion in the Work by the copyright owner
52 | or by an individual or Legal Entity authorized to submit on behalf of
53 | the copyright owner. For the purposes of this definition, "submitted"
54 | means any form of electronic, verbal, or written communication sent
55 | to the Licensor or its representatives, including but not limited to
56 | communication on electronic mailing lists, source code control systems,
57 | and issue tracking systems that are managed by, or on behalf of, the
58 | Licensor for the purpose of discussing and improving the Work, but
59 | excluding communication that is conspicuously marked or otherwise
60 | designated in writing by the copyright owner as "Not a Contribution."
61 |
62 | "Contributor" shall mean Licensor and any individual or Legal Entity
63 | on behalf of whom a Contribution has been received by Licensor and
64 | subsequently incorporated within the Work.
65 |
66 | 2. Grant of Copyright License. Subject to the terms and conditions of
67 | this License, each Contributor hereby grants to You a perpetual,
68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69 | copyright license to reproduce, prepare Derivative Works of,
70 | publicly display, publicly perform, sublicense, and distribute the
71 | Work and such Derivative Works in Source or Object form.
72 |
73 | 3. Grant of Patent License. Subject to the terms and conditions of
74 | this License, each Contributor hereby grants to You a perpetual,
75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76 | (except as stated in this section) patent license to make, have made,
77 | use, offer to sell, sell, import, and otherwise transfer the Work,
78 | where such license applies only to those patent claims licensable
79 | by such Contributor that are necessarily infringed by their
80 | Contribution(s) alone or by combination of their Contribution(s)
81 | with the Work to which such Contribution(s) was submitted. If You
82 | institute patent litigation against any entity (including a
83 | cross-claim or counterclaim in a lawsuit) alleging that the Work
84 | or a Contribution incorporated within the Work constitutes direct
85 | or contributory patent infringement, then any patent licenses
86 | granted to You under this License for that Work shall terminate
87 | as of the date such litigation is filed.
88 |
89 | 4. Redistribution. You may reproduce and distribute copies of the
90 | Work or Derivative Works thereof in any medium, with or without
91 | modifications, and in Source or Object form, provided that You
92 | meet the following conditions:
93 |
94 | (a) You must give any other recipients of the Work or
95 | Derivative Works a copy of this License; and
96 |
97 | (b) You must cause any modified files to carry prominent notices
98 | stating that You changed the files; and
99 |
100 | (c) You must retain, in the Source form of any Derivative Works
101 | that You distribute, all copyright, patent, trademark, and
102 | attribution notices from the Source form of the Work,
103 | excluding those notices that do not pertain to any part of
104 | the Derivative Works; and
105 |
106 | (d) If the Work includes a "NOTICE" text file as part of its
107 | distribution, then any Derivative Works that You distribute must
108 | include a readable copy of the attribution notices contained
109 | within such NOTICE file, excluding those notices that do not
110 | pertain to any part of the Derivative Works, in at least one
111 | of the following places: within a NOTICE text file distributed
112 | as part of the Derivative Works; within the Source form or
113 | documentation, if provided along with the Derivative Works; or,
114 | within a display generated by the Derivative Works, if and
115 | wherever such third-party notices normally appear. The contents
116 | of the NOTICE file are for informational purposes only and
117 | do not modify the License. You may add Your own attribution
118 | notices within Derivative Works that You distribute, alongside
119 | or as an addendum to the NOTICE text from the Work, provided
120 | that such additional attribution notices cannot be construed
121 | as modifying the License.
122 |
123 | You may add Your own copyright statement to Your modifications and
124 | may provide additional or different license terms and conditions
125 | for use, reproduction, or distribution of Your modifications, or
126 | for any such Derivative Works as a whole, provided Your use,
127 | reproduction, and distribution of the Work otherwise complies with
128 | the conditions stated in this License.
129 |
130 | 5. Submission of Contributions. Unless You explicitly state otherwise,
131 | any Contribution intentionally submitted for inclusion in the Work
132 | by You to the Licensor shall be under the terms and conditions of
133 | this License, without any additional terms or conditions.
134 | Notwithstanding the above, nothing herein shall supersede or modify
135 | the terms of any separate license agreement you may have executed
136 | with Licensor regarding such Contributions.
137 |
138 | 6. Trademarks. This License does not grant permission to use the trade
139 | names, trademarks, service marks, or product names of the Licensor,
140 | except as required for reasonable and customary use in describing the
141 | origin of the Work and reproducing the content of the NOTICE file.
142 |
143 | 7. Disclaimer of Warranty. Unless required by applicable law or
144 | agreed to in writing, Licensor provides the Work (and each
145 | Contributor provides its Contributions) on an "AS IS" BASIS,
146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 | implied, including, without limitation, any warranties or conditions
148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 | PARTICULAR PURPOSE. You are solely responsible for determining the
150 | appropriateness of using or redistributing the Work and assume any
151 | risks associated with Your exercise of permissions under this License.
152 |
153 | 8. Limitation of Liability. In no event and under no legal theory,
154 | whether in tort (including negligence), contract, or otherwise,
155 | unless required by applicable law (such as deliberate and grossly
156 | negligent acts) or agreed to in writing, shall any Contributor be
157 | liable to You for damages, including any direct, indirect, special,
158 | incidental, or consequential damages of any character arising as a
159 | result of this License or out of the use or inability to use the
160 | Work (including but not limited to damages for loss of goodwill,
161 | work stoppage, computer failure or malfunction, or any and all
162 | other commercial damages or losses), even if such Contributor
163 | has been advised of the possibility of such damages.
164 |
165 | 9. Accepting Warranty or Additional Liability. While redistributing
166 | the Work or Derivative Works thereof, You may choose to offer,
167 | and charge a fee for, acceptance of support, warranty, indemnity,
168 | or other liability obligations and/or rights consistent with this
169 | License. However, in accepting such obligations, You may act only
170 | on Your own behalf and on Your sole responsibility, not on behalf
171 | of any other Contributor, and only if You agree to indemnify,
172 | defend, and hold each Contributor harmless for any liability
173 | incurred by, or claims asserted against, such Contributor by reason
174 | of your accepting any such warranty or additional liability.
175 |
176 | END OF TERMS AND CONDITIONS
177 |
178 | APPENDIX: How to apply the Apache License to your work.
179 |
180 | To apply the Apache License to your work, attach the following
181 | boilerplate notice, with the fields enclosed by brackets "{}"
182 | replaced with your own identifying information. (Don't include
183 | the brackets!) The text should be enclosed in the appropriate
184 | comment syntax for the file format. We also recommend that a
185 | file or class name and description of purpose be included on the
186 | same "printed page" as the copyright notice for easier
187 | identification within third-party archives.
188 |
189 | Copyright {yyyy} {name of copyright owner}
190 |
191 | Licensed under the Apache License, Version 2.0 (the "License");
192 | you may not use this file except in compliance with the License.
193 | You may obtain a copy of the License at
194 |
195 | http://www.apache.org/licenses/LICENSE-2.0
196 |
197 | Unless required by applicable law or agreed to in writing, software
198 | distributed under the License is distributed on an "AS IS" BASIS,
199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 | See the License for the specific language governing permissions and
201 | limitations under the License.
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | ## Hey YOU!
2 | **Yes, you. Don't try to use the code examples in this README. Instead, download the .ipynb file provided in this repository, fire up [iPython Notebook](http://ipython.org/notebook.html), and run the code there instead. Trust us, you'll like it much better.**
3 |
4 | **You can also view a non-runnable version of the notebook (with proper syntax highlighting and embedded images) here: [http://nbviewer.ipython.org/github/charlieg/A-Smattering-of-NLP-in-Python/blob/master/A%20Smattering%20of%20NLP%20in%20Python.ipynb](http://nbviewer.ipython.org/github/charlieg/A-Smattering-of-NLP-in-Python/blob/master/A%20Smattering%20of%20NLP%20in%20Python.ipynb)**
5 |
6 | ***
7 |
8 | # A Smattering of NLP in Python
9 | *by Charlie Greenbacker [@greenbacker](https://twitter.com/greenbacker)*
10 |
11 | ### Part of a [joint meetup on NLP](http://www.meetup.com/stats-prog-dc/events/177772322/) - 9 July 2014
12 | - [Statistical Programming DC](http://www.meetup.com/stats-prog-dc/)
13 | - [Data Wranglers DC](http://www.meetup.com/Data-Wranglers-DC/)
14 | - [DC Natural Language Processing](http://dcnlp.org/)
15 |
16 | ***
17 |
18 | ## Introduction
19 | Back in the dark ages of data science, each group or individual working in
20 | Natural Language Processing (NLP) generally maintained an assortment of homebrew
21 | utility programs designed to handle many of the common tasks involved with NLP.
22 | Despite everyone's best intentions, most of this code was lousy, brittle, and
23 | poorly documented -- not a good foundation upon which to build your masterpiece.
24 | Fortunately, over the past decade, mainstream open source software libraries
25 | like the [Natural Language Toolkit for Python (NLTK)](http://www.nltk.org/) have
26 | emerged to offer a collection of high-quality reusable NLP functionality. These
27 | libraries allow researchers and developers to spend more time focusing on the
28 | application logic of the task at hand, and less on debugging an abandoned method
29 | for sentence segmentation or reimplementing noun phrase chunking.
30 |
31 | This presentation will cover a handful of the NLP building blocks provided by
32 | NLTK (and a few additional libraries), including extracting text from HTML,
33 | stemming & lemmatization, frequency analysis, and named entity recognition.
34 | Several of these components will then be assembled to build a very basic
35 | document summarization program.
36 |
37 | ### Initial Setup
38 | Obviously, you'll need Python installed on your system to run the code examples
39 | used in this presentation. We enthusiatically recommend using
40 | [Anaconda](https://store.continuum.io/cshop/anaconda/), a Python distribution
41 | provided by [Continuum Analytics](http://www.continuum.io/). Anaconda is free to
42 | use, it includes nearly [200 of the most commonly used Python packages for data
43 | analysis](http://docs.continuum.io/anaconda/pkg-docs.html) (including NLTK), and
44 | it works on Mac, Linux, and yes, even Windows.
45 |
46 | We'll make use of the following Python packages in the example code:
47 |
48 | - [nltk](http://www.nltk.org/install.html) (comes with Anaconda)
49 | - [readability-lxml](https://github.com/buriy/python-readability)
50 | - [BeautifulSoup4](http://www.crummy.com/software/BeautifulSoup/) (comes with
51 | Anaconda)
52 | - [scikit-learn](http://scikit-learn.org/stable/install.html) (comes with
53 | Anaconda)
54 |
55 | Please note that the **readability** package is not distributed with Anaconda,
56 | so you'll need to download & install it separately using something like
57 | easy_install readability-lxml
or pip install readability-lxml
.
58 |
59 | If you don't use Anaconda, you'll also need to download & install the other
60 | packages separately using similar methods. Refer to the homepage of each package
61 | for instructions.
62 |
63 | You'll want to run nltk.download()
one time to get all of the NLTK
64 | packages, corpora, etc. (see below). Select the "all" option. Depending on your
65 | network speed, this could take a while, but you'll only need to do it once.
66 |
67 | #### Java libraries (optional)
68 | One of the examples will use NLTK's interface to the [Stanford Named Entity
69 | Recognizer](http://www-nlp.stanford.edu/software/CRF-NER.shtml#Download), which
70 | is distributed as a Java library. In particular, you'll want the following files
71 | handy in order to run this particular example:
72 |
73 | - stanford-ner.jar
74 | - english.all.3class.distsim.crf.ser.gz
75 |
76 | ***
77 |
78 | ## Getting Started
79 | The first thing we'll need to do is import nltk
:
80 |
81 |
82 | import nltk
83 |
84 | #### Downloading NLTK resources
85 | The first time you run anything using NLTK, you'll want to go ahead and download
86 | the additional resources that aren't distributed directly with the NLTK package.
87 | Upon running the nltk.download()
command below, the the NLTK
88 | Downloader window will pop-up. In the Collections tab, select "all" and click on
89 | Download. As mentioned earlier, this may take several minutes depending on your
90 | network connection speed, but you'll only ever need to run it a single time.
91 |
92 |
93 | nltk.download()
94 |
95 | ## Extracting text from HTML
96 | Now the fun begins. We'll start with a pretty basic and commonly-faced task:
97 | extracting text content from an HTML page. Python's urllib package gives us the
98 | tools we need to fetch a web page from a given URL, but we see that the output
99 | is full of HTML markup that we don't want to deal with.
100 |
101 | (N.B.: Throughout the examples in this presentation, we'll use Python *slicing*
102 | (e.g., [:500]
below) to only display a small portion of a string or
103 | list. Otherwise, if we displayed the entire item, sometimes it would take up the
104 | entire screen.)
105 |
106 |
107 | from urllib import urlopen
108 |
109 | url = "http://venturebeat.com/2014/07/04/facebooks-little-social-experiment-got-you-bummed-out-get-over-it/"
110 | html = urlopen(url).read()
111 | html[:500]
112 |
113 | #### Stripping-out HTML formatting
114 | Fortunately, NTLK provides a method called clean_html()
to get the
115 | raw text out of an HTML-formatted string. It's still not perfect, though, since
116 | the output will contain page navigation and all kinds of other junk that we
117 | don't want, especially if our goal is to focus on the body content from a news
118 | article, for example.
119 |
120 |
121 | text = nltk.clean_html(html)
122 | text[:500]
123 |
124 | #### Identifying the Main Content
125 | If we just want the body content from the article, we'll need to use two
126 | additional packages. The first is a Python port of a Ruby port of a Javascript
127 | tool called Readability, which pulls the main body content out of an HTML
128 | document and subsequently "cleans it up." The second package, BeautifulSoup, is
129 | a Python library for pulling data out of HTML and XML files. It parses HTML
130 | content into easily-navigable nested data structure. Using Readability and
131 | BeautifulSoup together, we can quickly get exactly the text we're looking for
132 | out of the HTML, (*mostly*) free of page navigation, comments, ads, etc. Now
133 | we're ready to start analyzing this text content.
134 |
135 |
136 | from readability.readability import Document
137 | from bs4 import BeautifulSoup
138 |
139 | readable_article = Document(html).summary()
140 | readable_title = Document(html).title()
141 | soup = BeautifulSoup(readable_article)
142 | print '*** TITLE *** \n\"' + readable_title + '\"\n'
143 | print '*** CONTENT *** \n\"' + soup.text[:500] + '[...]\"'
144 |
145 | ## Frequency Analysis
146 | Here's a little secret: much of NLP (and data science, for that matter) boils
147 | down to counting things. If you've got a bunch of data that needs *analyzin'*
148 | but you don't know where to start, counting things is usually a good place to
149 | begin. Sure, you'll need to figure out exactly what you want to count, how to
150 | count it, and what to do with the counts, but if you're lost and don't know what
151 | to do, **just start counting**.
152 |
153 | Perhaps we'd like to begin (as is often the case in NLP) by examining the words
154 | that appear in our document. To do that, we'll first need to tokenize the text
155 | string into discrete words. Since we're working with English, this isn't so bad,
156 | but if we were working with a non-whitespace-delimited language like Chinese,
157 | Japanese, or Korean, it would be much more difficult.
158 |
159 | In the code snippet below, we're using two of NLTK's tokenize methods to first
160 | chop up the article text into sentences, and then each sentence into individual
161 | words. (Technically, we didn't need to use sent_tokenize()
, but if
162 | we only used word_tokenize()
alone, we'd see a bunch of extraneous
163 | sentence-final punctuation in our output.) By printing each token
164 | alphabetically, along with a count of the number of times it appeared in the
165 | text, we can see the results of the tokenization. Notice that the output
166 | contains some punctuation & numbers, hasn't been loweredcased, and counts
167 | *BuzzFeed* and *BuzzFeed's* separately. We'll tackle some of those issues next.
168 |
169 |
170 | tokens = [word for sent in nltk.sent_tokenize(soup.text) for word in nltk.word_tokenize(sent)]
171 |
172 | for token in sorted(set(tokens))[:30]:
173 | print token + ' [' + str(tokens.count(token)) + ']'
174 |
175 | #### Word Stemming
176 | [Stemming](http://en.wikipedia.org/wiki/Stemming) is the process of reducing a
177 | word to its base/stem/root form. Most stemmers are pretty basic and just chop
178 | off standard affixes indicating things like tense (e.g., "-ed") and possessive
179 | forms (e.g., "-'s"). Here, we'll use the Snowball stemmer for English, which
180 | comes with NLTK.
181 |
182 | Once our tokens are stemmed, we can rest easy knowing that *BuzzFeed* and
183 | *BuzzFeed's* are now being counted together as... *buzzfe*? Don't worry:
184 | although this may look weird, it's pretty standard behavior for stemmers and
185 | won't affect our analysis (much). We also (probably) won't show the stemmed
186 | words to users -- we'll normally just use them for internal analysis or indexing
187 | purposes.
188 |
189 |
190 | from nltk.stem.snowball import SnowballStemmer
191 |
192 | stemmer = SnowballStemmer("english")
193 | stemmed_tokens = [stemmer.stem(t) for t in tokens]
194 |
195 | for token in sorted(set(stemmed_tokens))[50:75]:
196 | print token + ' [' + str(stemmed_tokens.count(token)) + ']'
197 |
198 | #### Lemmatization
199 |
200 | Although the stemmer very helpfully chopped off pesky affixes (and made
201 | everything lowercase to boot), there are some word forms that give stemmers
202 | indigestion, especially *irregular* words. While the process of stemming
203 | typically involves rule-based methods of stripping affixes (making them small &
204 | fast), **lemmatization** involves dictionary-based methods to derive the
205 | canonical forms (i.e., *lemmas*) of words. For example, *run*, *runs*, *ran*,
206 | and *running* all correspond to the lemma *run*. However, lemmatizers are
207 | generally big, slow, and brittle due to the nature of the dictionary-based
208 | methods, so you'll only want to use them when necessary.
209 |
210 | The example below compares the output of the Snowball stemmer with the WordNet
211 | lemmatizer (also distributed with NLTK). Notice that the lemmatizer correctly
212 | converts *women* into *woman*, while the stemmer turns *lying* into *lie*.
213 | Additionally, both replace *eyes* with *eye*, but neither of them properly
214 | transforms *told* into *tell*.
215 |
216 |
217 | lemmatizer = nltk.WordNetLemmatizer()
218 | temp_sent = "Several women told me I have lying eyes."
219 |
220 | print [stemmer.stem(t) for t in nltk.word_tokenize(temp_sent)]
221 | print [lemmatizer.lemmatize(t) for t in nltk.word_tokenize(temp_sent)]
222 |
223 | #### NLTK Frequency Distributions
224 | Thus far, we've been working with lists of tokens that we're manually sorting,
225 | uniquifying, and counting -- all of which can get to be a bit cumbersome.
226 | Fortunately, NLTK provides a data structure called FreqDist
that
227 | makes it more convenient to work with these kinds of frequency distributions.
228 | The code snippet below builds a FreqDist
from our list of stemmed
229 | tokens, and then displays the top 25 tokens appearing most frequently in the
230 | text of our article. Wasn't that easy?
231 |
232 |
233 | fdist = nltk.FreqDist(stemmed_tokens)
234 |
235 | for item in fdist.items()[:25]:
236 | print item
237 |
238 | #### Filtering out Stop Words
239 | Notice in the output above that most of the top 25 tokens are worthless. With
240 | the exception of things like *facebook*, *content*, *user*, and perhaps *emot*
241 | (emotion?), the rest are basically devoid of meaningful information. They don't
242 | really tells us anything about the article since these tokens will appear is
243 | just about any English document. What we need to do is filter out these [*stop
244 | words*](http://en.wikipedia.org/wiki/Stop_words) in order to focus on just the
245 | important material.
246 |
247 | While there is no single, definitive list of stop words, NLTK provides a decent
248 | start. Let's load it up and take a look at what we get:
249 |
250 |
251 | sorted(nltk.corpus.stopwords.words('english'))[:25]
252 |
253 | Now we can use this list to filter-out stop words from our list of stemmed
254 | tokens before we create the frequency distribution. You'll notice in the output
255 | below that we still have some things like punctuation that we'd probably like to
256 | remove, but we're much closer to having a list of the most "important" words in
257 | our article.
258 |
259 |
260 | stemmed_tokens_no_stop = [stemmer.stem(t) for t in stemmed_tokens if t not in nltk.corpus.stopwords.words('english')]
261 |
262 | fdist2 = nltk.FreqDist(stemmed_tokens_no_stop)
263 |
264 | for item in fdist2.items()[:25]:
265 | print item
266 |
267 | ## Named Entity Recognition
268 | Another task we might want to do to help identify what's "important" in a text
269 | document is [named entity recogniton (NER)](http://en.wikipedia.org/wiki/Named-
270 | entity_recognition). Also called *entity extraction*, this process involves
271 | automatically extracting the names of persons, places, organizations, and
272 | potentially other entity types out of unstructured text. Building an NER
273 | classifier requires *lots* of annotated training data and some [fancy machine
274 | learning algorithms](http://en.wikipedia.org/wiki/Conditional_random_field), but
275 | fortunately, NLTK comes with a pre-built/pre-trained NER classifier ready to
276 | extract entities right out of the box. This classifier has been trained to
277 | recognize PERSON, ORGANIZATION, and GPE (geo-political entity) entity types.
278 |
279 | (At this point, I should include a disclaimer stating [No True Computational
280 | Linguist](http://en.wikipedia.org/wiki/No_true_Scotsman) would ever use a pre-
281 | built NER classifier in the "real world" without first re-training it on
282 | annotated data representing their particular task. So please don't send me any
283 | hate mail -- I've done my part to stop the madness.)
284 |
285 | In the example below (inspired by [this gist from Gavin
286 | Hackeling](https://gist.github.com/gavinmh/4735528/) and [this post from John
287 | Price](http://freshlyminted.co.uk/blog/2011/02/28/getting-band-and-artist-names-
288 | nltk/)), we're defining a method to perform the following steps:
289 |
290 | - take a string as input
291 | - tokenize it into sentences
292 | - tokenize the sentences into words
293 | - add part-of-speech tags to the words using nltk.pos_tag()
294 | - run this through the NLTK-provided NER classifier using
295 | nltk.ne_chunk()
296 | - parse these intermediate results and return any extracted entities
297 |
298 | We then apply this method to a sample sentence and parse the clunky output
299 | format provided by nltk.ne_chunk()
(it comes as a
300 | [nltk.tree.Tree](http://www.nltk.org/_modules/nltk/tree.html)) to display the
301 | entities we've extracted. Don't let these nice results fool you -- NER output
302 | isn't always this satisfying. Try some other sample text and see what you get.
303 |
304 |
305 | def extract_entities(text):
306 | entities = []
307 | for sentence in nltk.sent_tokenize(text):
308 | chunks = nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sentence)))
309 | entities.extend([chunk for chunk in chunks if hasattr(chunk, 'node')])
310 | return entities
311 |
312 | for entity in extract_entities('My name is Charlie and I work for Altamira in Tysons Corner.'):
313 | print '[' + entity.node + '] ' + ' '.join(c[0] for c in entity.leaves())
314 |
315 | If you're like me, you've grown accustomed over the years to working with the
316 | [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml) library for Java,
317 | and you're suspicious of NLTK's built-in NER classifier (especially because it
318 | has *chunk* in the name). Thankfully, recent versions of NLTK contain an special
319 | NERTagger
interface that enables us to make calls to Stanford NER
320 | from our Python programs, even though Stanford NER is a *Java library* (the
321 | horror!). [Not surprisingly](http://www.yurtopic.com/tech/programming/images
322 | /java-and-python.jpg), the Python NERTagger
API is slightly less
323 | verbose than the native Java API for Stanford NER.
324 |
325 | To run this example, you'll need to follow the instructions for installing the
326 | optional Java libraries, as outlined in the **Initial Setup** section above.
327 | You'll also want to pay close attention to the comment that says # change
328 | the paths below to point to wherever you unzipped the Stanford NER download
329 | file
.
330 |
331 |
332 | from nltk.tag.stanford import NERTagger
333 |
334 | # change the paths below to point to wherever you unzipped the Stanford NER download file
335 | st = NERTagger('/Users/cgreenba/stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz',
336 | '/Users/cgreenba/stanford-ner/stanford-ner.jar', 'utf-8')
337 |
338 | for i in st.tag('Up next is Tommy, who works at STPI in Washington.'.split()):
339 | print '[' + i[1] + '] ' + i[0]
340 |
341 | ## Automatic Summarization
342 | Now let's try to take some of what we've learned and build something potentially
343 | useful in real life: a program that will [automatically
344 | summarize](http://en.wikipedia.org/wiki/Automatic_summarization) documents. For
345 | this, we'll switch gears slightly, putting aside the web article we've been
346 | working on until now and instead using a corpus of documents distributed with
347 | NLTK.
348 |
349 | The Reuters Corpus contains nearly 11,000 news articles about a variety of
350 | topics and subjects. If you've run the nltk.download()
command as
351 | previously recommended, you can then easily import and explore the Reuters
352 | Corpus like so:
353 |
354 |
355 | from nltk.corpus import reuters
356 |
357 | print '** BEGIN ARTICLE: ** \"' + reuters.raw(reuters.fileids()[0])[:500] + ' [...]\"'
358 |
359 | Our [painfully simplistic](http://anthology.aclweb.org/P/P11/P11-3014.pdf)
360 | automatic summarization tool will implement the following steps:
361 |
362 | - assign a score to each word in a document corresponding to its level of
363 | "importance"
364 | - rank each sentence in the document by summing the individual word scores and
365 | dividing by the number of tokens in the sentence
366 | - extract the top N highest scoring sentences and return them as our "summary"
367 |
368 | Sounds easy enough, right? But before we can say "*voila!*," we'll need to
369 | figure out how to calculate an "importance" score for words. As we saw above
370 | with stop words, etc. simply counting the number of times a word appears in a
371 | document will not necessarily tell you which words are most important.
372 |
373 | #### Term Frequency - Inverse Document Frequency (TF-IDF)
374 |
375 | Consider a document that contains the word *baseball* 8 times. You might think,
376 | "wow, *baseball* isn't a stop word, and it appeared rather frequently here, so
377 | it's probably important." And you might be right. But what if that document is
378 | actually an article posted on a baseball blog? Won't the word *baseball* appear
379 | frequently in nearly every post on that blog? In this particular case, if you
380 | were generating a summary of this document, would the word *baseball* be a good
381 | indicator of importance, or would you maybe look for other words that help
382 | distinguish or differentiate this blog post from the rest?
383 |
384 | Context is essential. What really matters here isn't the raw frequency of the
385 | number of times each word appeared in a document, but rather the **relative
386 | frequency** comparing the number of times a word appeared in this document
387 | against the number of times it appeared across the rest of the collection of
388 | documents. "Important" words will be the ones that are generally rare across the
389 | collection, but which appear with an unusually high frequency in a given
390 | document.
391 |
392 | We'll calculate this relative frequency using a statistical metric called [term
393 | frequency - inverse document frequency (TF-
394 | IDF)](http://en.wikipedia.org/wiki/Tf%E2%80%93idf). We could implement TF-IDF
395 | ourselves using NLTK, but rather than bore you with the math, we'll take a
396 | shortcut and use the TF-IDF implementation provided by the [scikit-learn](http
397 | ://scikit-learn.org/) machine learning library for Python.
398 |
399 | #### Building a Term-Document Matrix
400 |
401 | We'll use scikit-learn's TfidfVectorizer
class to construct a
402 | [term-document matrix](http://en.wikipedia.org/wiki/Document-term_matrix)
403 | containing the TF-IDF score for each word in each document in the Reuters
404 | Corpus. In essence, the rows of this sparse matrix correspond to documents in
405 | the corpus, the columns represent each word in the vocabulary of the corpus, and
406 | each cell contains the TF-IDF value for a given word in a given document.
407 |
408 | Inspired by a [computer science lab exercise from Duke University](http://www.cs
409 | .duke.edu/courses/spring14/compsci290/assignments/lab02.html), the code sample
410 | below iterates through the Reuters Corpus to build a dictionary of stemmed
411 | tokens for each article, then uses the TfidfVectorizer
and scikit-
412 | learn's own built-in stop words list to generate the term-document matrix
413 | containing TF-IDF scores.
414 |
415 |
416 | import datetime, re, sys
417 | from sklearn.feature_extraction.text import TfidfVectorizer
418 |
419 | def tokenize_and_stem(text):
420 | tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
421 | filtered_tokens = []
422 | # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
423 | for token in tokens:
424 | if re.search('[a-zA-Z]', token):
425 | filtered_tokens.append(token)
426 | stems = [stemmer.stem(t) for t in filtered_tokens]
427 | return stems
428 |
429 | token_dict = {}
430 | for article in reuters.fileids():
431 | token_dict[article] = reuters.raw(article)
432 |
433 | tfidf = TfidfVectorizer(tokenizer=tokenize_and_stem, stop_words='english', decode_error='ignore')
434 | print 'building term-document matrix... [process started: ' + str(datetime.datetime.now()) + ']'
435 | sys.stdout.flush()
436 |
437 | tdm = tfidf.fit_transform(token_dict.values()) # this can take some time (about 60 seconds on my machine)
438 | print 'done! [process finished: ' + str(datetime.datetime.now()) + ']'
439 |
440 | #### TF-IDF Scores
441 |
442 | Now that we've built the term-document matrix, we can explore its contents:
443 |
444 |
445 | from random import randint
446 |
447 | feature_names = tfidf.get_feature_names()
448 | print 'TDM contains ' + str(len(feature_names)) + ' terms and ' + str(tdm.shape[0]) + ' documents'
449 |
450 | print 'first term: ' + feature_names[0]
451 | print 'last term: ' + feature_names[len(feature_names) - 1]
452 |
453 | for i in range(0, 4):
454 | print 'random term: ' + feature_names[randint(1,len(feature_names) - 2)]
455 |
456 | #### Generating the Summary
457 |
458 | That's all we'll need to produce a summary for any document in the corpus. In
459 | the example code below, we start by randomly selecting an article from the
460 | Reuters Corpus. We iterate through the article, calculating a score for each
461 | sentence by summing the TF-IDF values for each word appearing in the sentence.
462 | We normalize the sentence scores by dividing by the number of tokens in the
463 | sentence (to avoid bias in favor of longer sentences). Then we sort the
464 | sentences by their scores, and return the highest-scoring sentences as our
465 | summary. The number of sentences returned corresponds to roughly 20% of the
466 | overall length of the article.
467 |
468 | Since some of the articles in the Reuters Corpus are rather small (i.e., a
469 | single sentence in length) or contain just raw financial data, some of the
470 | summaries won't make sense. If you run this code a few times, however, you'll
471 | eventually see a randomly-selected article that provides a decent demonstration
472 | of this simplistic method of identifying the "most important" sentence from a
473 | document.
474 |
475 |
476 | import math
477 | from __future__ import division
478 |
479 | article_id = randint(0, tdm.shape[0] - 1)
480 | article_text = reuters.raw(reuters.fileids()[article_id])
481 |
482 | sent_scores = []
483 | for sentence in nltk.sent_tokenize(article_text):
484 | score = 0
485 | sent_tokens = tokenize_and_stem(sentence)
486 | for token in (t for t in sent_tokens if t in feature_names):
487 | score += tdm[article_id, feature_names.index(token)]
488 | sent_scores.append((score / len(sent_tokens), sentence))
489 |
490 | summary_length = int(math.ceil(len(sent_scores) / 5))
491 | sent_scores.sort(key=lambda sent: sent[0])
492 |
493 | print '*** SUMMARY ***'
494 | for summary_sentence in sent_scores[:summary_length]:
495 | print summary_sentence[1]
496 |
497 | print '\n*** ORIGINAL ***'
498 | print article_text
499 |
500 | #### Improving the Summary
501 | That was fairly easy, but how could we improve the quality of the generated
502 | summary? Perhaps we could boost the importance of words found in the title or
503 | any entities we're able to extract from the text. After initially selecting the
504 | highest-scoring sentence, we might discount the TF-IDF scores for duplicate
505 | words in the remaining sentences in an attempt to reduce repetitiveness. We
506 | could also look at cleaning up the sentences used to form the summary by fixing
507 | any pronouns missing an antecedent, or even pulling out partial phrases instead
508 | of complete sentences. The possibilities are virtually endless.
509 |
510 | ## Next Steps
511 | Want to learn more? Start by working your way through all the examples in the
512 | NLTK book (aka "the Whale book"):
513 |
514 | - [Natural Language Processing with Python
515 | (book)](http://oreilly.com/catalog/9780596516499/)
516 | - (free online version: [nltk.org/book](http://www.nltk.org/book/))
517 |
518 | ### Additional NLP Resources for Python
519 | - [NLTK HOWTOs](http://www.nltk.org/howto/)
520 | - [Python Text Processing with NLTK 2.0 Cookbook (book)](http://www.packtpub.com
521 | /python-text-processing-nltk-20-cookbook/book)
522 | - [Python wrapper for the Stanford CoreNLP Java
523 | library](https://pypi.python.org/pypi/corenlp)
524 | - [guess_language (Python library for language
525 | identification)](https://bitbucket.org/spirit/guess_language)
526 | - [MITIE (new C/C++-based NER library from MIT with a Python
527 | API)](https://github.com/mit-nlp/MITIE)
528 | - [gensim (topic modeling library for Python)](http://radimrehurek.com/gensim/)
529 |
530 | ### Attend future DC NLP meetups
531 |
532 | - [dcnlp.org](http://dcnlp.org/) | [@DCNLP](https://twitter.com/DCNLP/)
533 |
--------------------------------------------------------------------------------
/images/Scikit-learn_logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/charlieg/A-Smattering-of-NLP-in-Python/5145d0959fff7aaec5f3f3ec02458de0f3e815a0/images/Scikit-learn_logo.png
--------------------------------------------------------------------------------
/images/anaconda_logo_web.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/charlieg/A-Smattering-of-NLP-in-Python/5145d0959fff7aaec5f3f3ec02458de0f3e815a0/images/anaconda_logo_web.png
--------------------------------------------------------------------------------
/images/cat.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/charlieg/A-Smattering-of-NLP-in-Python/5145d0959fff7aaec5f3f3ec02458de0f3e815a0/images/cat.gif
--------------------------------------------------------------------------------
/images/dcnlp.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/charlieg/A-Smattering-of-NLP-in-Python/5145d0959fff7aaec5f3f3ec02458de0f3e815a0/images/dcnlp.jpeg
--------------------------------------------------------------------------------
/images/i-was-told-there-would-be-no-math.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/charlieg/A-Smattering-of-NLP-in-Python/5145d0959fff7aaec5f3f3ec02458de0f3e815a0/images/i-was-told-there-would-be-no-math.jpg
--------------------------------------------------------------------------------
/images/no_time.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/charlieg/A-Smattering-of-NLP-in-Python/5145d0959fff7aaec5f3f3ec02458de0f3e815a0/images/no_time.jpg
--------------------------------------------------------------------------------
/images/python-powered-w-200x80.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/charlieg/A-Smattering-of-NLP-in-Python/5145d0959fff7aaec5f3f3ec02458de0f3e815a0/images/python-powered-w-200x80.png
--------------------------------------------------------------------------------
/images/stanford-nlp.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/charlieg/A-Smattering-of-NLP-in-Python/5145d0959fff7aaec5f3f3ec02458de0f3e815a0/images/stanford-nlp.jpg
--------------------------------------------------------------------------------