├── Assignment+1.ipynb
├── Assignment+2.ipynb
├── Assignment+3.ipynb
├── Assignment+4.ipynb
├── Case+Study+-+Sentiment+Analysis.ipynb
├── Module+2+(Python+3).ipynb
├── README.md
├── Regex+with+Pandas+and+Named+Groups.ipynb
├── Slides
├── Module 1
│ ├── 1.1_Introduction-to-Text-Mining.pdf
│ ├── 1.2_Handling-Text-in-Python.pdf
│ ├── 1.3_Regular-Expressions.pdf
│ └── 1.4-Internationalization-and-Issues-with-Non-ASCII-Characters.pdf
├── Module 2
│ ├── 2.1-Natural-Language-Processing.pdf
│ ├── 2.2-NLP-Tasks-with-NLTK.pdf
│ └── 2.3-NLP-Tasks-with-NLTK.pdf
├── Module 3
│ ├── 3.1_Text-Classification.pdf
│ ├── 3.2_Identifying-Features-from-Text.pdf
│ ├── 3.3_Naive-Bayes-Classifier.pdf
│ ├── 3.4_Naive-Bayes-Variations.pdf
│ ├── 3.5_Support-Vector-Machines.pdf
│ └── 3.6_Learning-Text-Classifiers-in-Python.pdf
└── Module 4
│ ├── 4.1_Semantic-Text-Similarity.pdf
│ ├── 4.2_Topic-Modeling.pdf
│ ├── 4.3_Generative-Models-and-LDA.pdf
│ └── 4.4_Information-Extraction.pdf
├── Working+With+Text.ipynb
├── dates.txt
├── moby.txt
├── newsgroups
├── paraphrases.csv
└── spam.csv
/Assignment+1.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "---\n",
8 | "\n",
9 | "_You are currently looking at **version 1.1** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._\n",
10 | "\n",
11 | "---"
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {},
17 | "source": [
18 | "# Assignment 1\n",
19 | "\n",
20 | "In this assignment, you'll be working with messy medical data and using regex to extract relevant infromation from the data. \n",
21 | "\n",
22 | "Each line of the `dates.txt` file corresponds to a medical note. Each note has a date that needs to be extracted, but each date is encoded in one of many formats.\n",
23 | "\n",
24 | "The goal of this assignment is to correctly identify all of the different date variants encoded in this dataset and to properly normalize and sort the dates. \n",
25 | "\n",
26 | "Here is a list of some of the variants you might encounter in this dataset:\n",
27 | "* 04/20/2009; 04/20/09; 4/20/09; 4/3/09\n",
28 | "* Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;\n",
29 | "* 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009\n",
30 | "* Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009\n",
31 | "* Feb 2009; Sep 2009; Oct 2010\n",
32 | "* 6/2008; 12/2009\n",
33 | "* 2009; 2010\n",
34 | "\n",
35 | "Once you have extracted these date patterns from the text, the next step is to sort them in ascending chronological order accoring to the following rules:\n",
36 | "* Assume all dates in xx/xx/xx format are mm/dd/yy\n",
37 | "* Assume all dates where year is encoded in only two digits are years from the 1900's (e.g. 1/5/89 is January 5th, 1989)\n",
38 | "* If the day is missing (e.g. 9/2009), assume it is the first day of the month (e.g. September 1, 2009).\n",
39 | "* If the month is missing (e.g. 2010), assume it is the first of January of that year (e.g. January 1, 2010).\n",
40 | "* Watch out for potential typos as this is a raw, real-life derived dataset.\n",
41 | "\n",
42 | "With these rules in mind, find the correct date in each note and return a pandas Series in chronological order of the original Series' indices.\n",
43 | "\n",
44 | "For example if the original series was this:\n",
45 | "\n",
46 | " 0 1999\n",
47 | " 1 2010\n",
48 | " 2 1978\n",
49 | " 3 2015\n",
50 | " 4 1985\n",
51 | "\n",
52 | "Your function should return this:\n",
53 | "\n",
54 | " 0 2\n",
55 | " 1 4\n",
56 | " 2 0\n",
57 | " 3 1\n",
58 | " 4 3\n",
59 | "\n",
60 | "Your score will be calculated using [Kendall's tau](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient), a correlation measure for ordinal data.\n",
61 | "\n",
62 | "*This function should return a Series of length 500 and dtype int.*"
63 | ]
64 | },
65 | {
66 | "cell_type": "code",
67 | "execution_count": 1,
68 | "metadata": {},
69 | "outputs": [
70 | {
71 | "data": {
72 | "text/plain": [
73 | "0 03/25/93 Total time of visit (in minutes):\\n\n",
74 | "1 6/18/85 Primary Care Doctor:\\n\n",
75 | "2 sshe plans to move as of 7/8/71 In-Home Servic...\n",
76 | "3 7 on 9/27/75 Audit C Score Current:\\n\n",
77 | "4 2/6/96 sleep studyPain Treatment Pain Level (N...\n",
78 | "5 .Per 7/06/79 Movement D/O note:\\n\n",
79 | "6 4, 5/18/78 Patient's thoughts about current su...\n",
80 | "7 10/24/89 CPT Code: 90801 - Psychiatric Diagnos...\n",
81 | "8 3/7/86 SOS-10 Total Score:\\n\n",
82 | "9 (4/10/71)Score-1Audit C Score Current:\\n\n",
83 | "dtype: object"
84 | ]
85 | },
86 | "execution_count": 1,
87 | "metadata": {},
88 | "output_type": "execute_result"
89 | }
90 | ],
91 | "source": [
92 | "import pandas as pd\n",
93 | "\n",
94 | "doc = []\n",
95 | "with open('dates.txt') as file:\n",
96 | " for line in file:\n",
97 | " doc.append(line)\n",
98 | "\n",
99 | "df = pd.Series(doc)\n",
100 | "df.head(10)"
101 | ]
102 | },
103 | {
104 | "cell_type": "code",
105 | "execution_count": 2,
106 | "metadata": {
107 | "collapsed": true
108 | },
109 | "outputs": [],
110 | "source": [
111 | "def date_sorter():\n",
112 | " \n",
113 | " regex1 = '(\\d{1,2}[/-]\\d{1,2}[/-]\\d{2,4})'\n",
114 | " regex2 = '((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[\\S]*[+\\s]\\d{1,2}[,]{0,1}[+\\s]\\d{4})'\n",
115 | " regex3 = '(\\d{1,2}[+\\s](?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[\\S]*[+\\s]\\d{4})'\n",
116 | " regex4 = '((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[\\S]*[+\\s]\\d{4})'\n",
117 | " regex5 = '(\\d{1,2}[/-][1|2]\\d{3})'\n",
118 | " regex6 = '([1|2]\\d{3})'\n",
119 | " full_regex = '(%s|%s|%s|%s|%s|%s)' %(regex1, regex2, regex3, regex4, regex5, regex6)\n",
120 | " parsed_date = df.str.extract(full_regex)\n",
121 | " parsed_date = parsed_date.iloc[:,0].str.replace('Janaury', 'January').str.replace('Decemeber', 'December')\n",
122 | " parsed_date = pd.Series(pd.to_datetime(parsed_date))\n",
123 | " parsed_date = parsed_date.sort_values(ascending=True).index\n",
124 | " return pd.Series(parsed_date.values)"
125 | ]
126 | },
127 | {
128 | "cell_type": "code",
129 | "execution_count": 3,
130 | "metadata": {},
131 | "outputs": [
132 | {
133 | "name": "stderr",
134 | "output_type": "stream",
135 | "text": [
136 | "/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:10: FutureWarning: currently extract(expand=None) means expand=False (return Index/Series/DataFrame) but in a future version of pandas this will be changed to expand=True (return DataFrame)\n",
137 | " # Remove the CWD from sys.path while we load stuff.\n"
138 | ]
139 | },
140 | {
141 | "data": {
142 | "text/plain": [
143 | "0 9\n",
144 | "1 84\n",
145 | "2 2\n",
146 | "3 53\n",
147 | "4 28\n",
148 | "5 474\n",
149 | "6 153\n",
150 | "7 13\n",
151 | "8 129\n",
152 | "9 98\n",
153 | "10 111\n",
154 | "11 225\n",
155 | "12 31\n",
156 | "13 171\n",
157 | "14 191\n",
158 | "15 486\n",
159 | "16 335\n",
160 | "17 415\n",
161 | "18 36\n",
162 | "19 405\n",
163 | "20 323\n",
164 | "21 422\n",
165 | "22 375\n",
166 | "23 380\n",
167 | "24 345\n",
168 | "25 57\n",
169 | "26 481\n",
170 | "27 436\n",
171 | "28 104\n",
172 | "29 299\n",
173 | " ... \n",
174 | "470 220\n",
175 | "471 208\n",
176 | "472 243\n",
177 | "473 139\n",
178 | "474 320\n",
179 | "475 383\n",
180 | "476 244\n",
181 | "477 286\n",
182 | "478 480\n",
183 | "479 431\n",
184 | "480 279\n",
185 | "481 198\n",
186 | "482 381\n",
187 | "483 463\n",
188 | "484 366\n",
189 | "485 439\n",
190 | "486 255\n",
191 | "487 401\n",
192 | "488 475\n",
193 | "489 257\n",
194 | "490 152\n",
195 | "491 235\n",
196 | "492 464\n",
197 | "493 253\n",
198 | "494 427\n",
199 | "495 231\n",
200 | "496 141\n",
201 | "497 186\n",
202 | "498 161\n",
203 | "499 413\n",
204 | "Length: 500, dtype: int64"
205 | ]
206 | },
207 | "execution_count": 3,
208 | "metadata": {},
209 | "output_type": "execute_result"
210 | }
211 | ],
212 | "source": [
213 | "date_sorter()"
214 | ]
215 | }
216 | ],
217 | "metadata": {
218 | "coursera": {
219 | "course_slug": "python-text-mining",
220 | "graded_item_id": "LvcWI",
221 | "launcher_item_id": "krne9",
222 | "part_id": "Mkp1I"
223 | },
224 | "kernelspec": {
225 | "display_name": "Python 3",
226 | "language": "python",
227 | "name": "python3"
228 | },
229 | "language_info": {
230 | "codemirror_mode": {
231 | "name": "ipython",
232 | "version": 3
233 | },
234 | "file_extension": ".py",
235 | "mimetype": "text/x-python",
236 | "name": "python",
237 | "nbconvert_exporter": "python",
238 | "pygments_lexer": "ipython3",
239 | "version": "3.6.2"
240 | }
241 | },
242 | "nbformat": 4,
243 | "nbformat_minor": 2
244 | }
245 |
--------------------------------------------------------------------------------
/Assignment+2.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "---\n",
8 | "\n",
9 | "_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._\n",
10 | "\n",
11 | "---"
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {},
17 | "source": [
18 | "# Assignment 2 - Introduction to NLTK\n",
19 | "\n",
20 | "In part 1 of this assignment you will use nltk to explore the Herman Melville novel Moby Dick. Then in part 2 you will create a spelling recommender function that uses nltk to find words similar to the misspelling. "
21 | ]
22 | },
23 | {
24 | "cell_type": "markdown",
25 | "metadata": {},
26 | "source": [
27 | "## Part 1 - Analyzing Moby Dick"
28 | ]
29 | },
30 | {
31 | "cell_type": "code",
32 | "execution_count": 3,
33 | "metadata": {},
34 | "outputs": [
35 | {
36 | "name": "stdout",
37 | "output_type": "stream",
38 | "text": [
39 | "[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...\n",
40 | "[nltk_data] Unzipping tokenizers/punkt.zip.\n",
41 | "[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...\n",
42 | "[nltk_data] Unzipping corpora/wordnet.zip.\n",
43 | "[nltk_data] Downloading package averaged_perceptron_tagger to\n",
44 | "[nltk_data] /home/jovyan/nltk_data...\n",
45 | "[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.\n",
46 | "[nltk_data] Downloading package words to /home/jovyan/nltk_data...\n",
47 | "[nltk_data] Unzipping corpora/words.zip.\n"
48 | ]
49 | }
50 | ],
51 | "source": [
52 | "import nltk\n",
53 | "nltk.download('punkt')\n",
54 | "nltk.download('wordnet')\n",
55 | "nltk.download('averaged_perceptron_tagger')\n",
56 | "nltk.download('words')\n",
57 | "import pandas as pd\n",
58 | "import numpy as np\n",
59 | "\n",
60 | "# If you would like to work with the raw text you can use 'moby_raw'\n",
61 | "with open('moby.txt', 'r') as f:\n",
62 | " moby_raw = f.read()\n",
63 | " \n",
64 | "# If you would like to work with the novel in nltk.Text format you can use 'text1'\n",
65 | "moby_tokens = nltk.word_tokenize(moby_raw)\n",
66 | "text1 = nltk.Text(moby_tokens)"
67 | ]
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "metadata": {},
72 | "source": [
73 | "### Example 1\n",
74 | "\n",
75 | "How many tokens (words and punctuation symbols) are in text1?\n",
76 | "\n",
77 | "*This function should return an integer.*"
78 | ]
79 | },
80 | {
81 | "cell_type": "code",
82 | "execution_count": 2,
83 | "metadata": {},
84 | "outputs": [
85 | {
86 | "data": {
87 | "text/plain": [
88 | "254989"
89 | ]
90 | },
91 | "execution_count": 2,
92 | "metadata": {},
93 | "output_type": "execute_result"
94 | }
95 | ],
96 | "source": [
97 | "def example_one():\n",
98 | " \n",
99 | " return len(nltk.word_tokenize(moby_raw)) # or alternatively len(text1)\n",
100 | "\n",
101 | "example_one()"
102 | ]
103 | },
104 | {
105 | "cell_type": "markdown",
106 | "metadata": {},
107 | "source": [
108 | "### Example 2\n",
109 | "\n",
110 | "How many unique tokens (unique words and punctuation) does text1 have?\n",
111 | "\n",
112 | "*This function should return an integer.*"
113 | ]
114 | },
115 | {
116 | "cell_type": "code",
117 | "execution_count": 3,
118 | "metadata": {},
119 | "outputs": [
120 | {
121 | "data": {
122 | "text/plain": [
123 | "20755"
124 | ]
125 | },
126 | "execution_count": 3,
127 | "metadata": {},
128 | "output_type": "execute_result"
129 | }
130 | ],
131 | "source": [
132 | "def example_two():\n",
133 | " \n",
134 | " return len(set(nltk.word_tokenize(moby_raw))) # or alternatively len(set(text1))\n",
135 | "\n",
136 | "example_two()"
137 | ]
138 | },
139 | {
140 | "cell_type": "markdown",
141 | "metadata": {},
142 | "source": [
143 | "### Example 3\n",
144 | "\n",
145 | "After lemmatizing the verbs, how many unique tokens does text1 have?\n",
146 | "\n",
147 | "*This function should return an integer.*"
148 | ]
149 | },
150 | {
151 | "cell_type": "code",
152 | "execution_count": 4,
153 | "metadata": {},
154 | "outputs": [
155 | {
156 | "data": {
157 | "text/plain": [
158 | "16900"
159 | ]
160 | },
161 | "execution_count": 4,
162 | "metadata": {},
163 | "output_type": "execute_result"
164 | }
165 | ],
166 | "source": [
167 | "from nltk.stem import WordNetLemmatizer\n",
168 | "\n",
169 | "def example_three():\n",
170 | "\n",
171 | " lemmatizer = WordNetLemmatizer()\n",
172 | " lemmatized = [lemmatizer.lemmatize(w,'v') for w in text1]\n",
173 | "\n",
174 | " return len(set(lemmatized))\n",
175 | "\n",
176 | "example_three()"
177 | ]
178 | },
179 | {
180 | "cell_type": "markdown",
181 | "metadata": {},
182 | "source": [
183 | "### Question 1\n",
184 | "\n",
185 | "What is the lexical diversity of the given text input? (i.e. ratio of unique tokens to the total number of tokens)\n",
186 | "\n",
187 | "*This function should return a float.*"
188 | ]
189 | },
190 | {
191 | "cell_type": "code",
192 | "execution_count": 5,
193 | "metadata": {},
194 | "outputs": [
195 | {
196 | "data": {
197 | "text/plain": [
198 | "0.08139566804842562"
199 | ]
200 | },
201 | "execution_count": 5,
202 | "metadata": {},
203 | "output_type": "execute_result"
204 | }
205 | ],
206 | "source": [
207 | "def answer_one():\n",
208 | " \n",
209 | " return len(set(text1))/len(text1)\n",
210 | "\n",
211 | "answer_one()"
212 | ]
213 | },
214 | {
215 | "cell_type": "markdown",
216 | "metadata": {},
217 | "source": [
218 | "### Question 2\n",
219 | "\n",
220 | "What percentage of tokens is 'whale'or 'Whale'?\n",
221 | "\n",
222 | "*This function should return a float.*"
223 | ]
224 | },
225 | {
226 | "cell_type": "code",
227 | "execution_count": 6,
228 | "metadata": {},
229 | "outputs": [
230 | {
231 | "data": {
232 | "text/plain": [
233 | "0.4125668166077752"
234 | ]
235 | },
236 | "execution_count": 6,
237 | "metadata": {},
238 | "output_type": "execute_result"
239 | }
240 | ],
241 | "source": [
242 | "def answer_two():\n",
243 | " \n",
244 | " dist = nltk.FreqDist(text1)\n",
245 | " return (dist['whale'] + dist['Whale'])/len(text1)*100\n",
246 | "\n",
247 | "answer_two()"
248 | ]
249 | },
250 | {
251 | "cell_type": "markdown",
252 | "metadata": {},
253 | "source": [
254 | "### Question 3\n",
255 | "\n",
256 | "What are the 20 most frequently occurring (unique) tokens in the text? What is their frequency?\n",
257 | "\n",
258 | "*This function should return a list of 20 tuples where each tuple is of the form `(token, frequency)`. The list should be sorted in descending order of frequency.*"
259 | ]
260 | },
261 | {
262 | "cell_type": "code",
263 | "execution_count": 7,
264 | "metadata": {},
265 | "outputs": [
266 | {
267 | "data": {
268 | "text/plain": [
269 | "[(',', 19204),\n",
270 | " ('the', 13715),\n",
271 | " ('.', 7308),\n",
272 | " ('of', 6513),\n",
273 | " ('and', 6010),\n",
274 | " ('a', 4545),\n",
275 | " ('to', 4515),\n",
276 | " (';', 4173),\n",
277 | " ('in', 3908),\n",
278 | " ('that', 2978),\n",
279 | " ('his', 2459),\n",
280 | " ('it', 2196),\n",
281 | " ('I', 2097),\n",
282 | " ('!', 1767),\n",
283 | " ('is', 1722),\n",
284 | " ('--', 1713),\n",
285 | " ('with', 1659),\n",
286 | " ('he', 1658),\n",
287 | " ('was', 1639),\n",
288 | " ('as', 1620)]"
289 | ]
290 | },
291 | "execution_count": 7,
292 | "metadata": {},
293 | "output_type": "execute_result"
294 | }
295 | ],
296 | "source": [
297 | "def answer_three():\n",
298 | " \n",
299 | " return nltk.FreqDist(text1).most_common(20)\n",
300 | "\n",
301 | "answer_three()"
302 | ]
303 | },
304 | {
305 | "cell_type": "markdown",
306 | "metadata": {},
307 | "source": [
308 | "### Question 4\n",
309 | "\n",
310 | "What tokens have a length of greater than 5 and frequency of more than 150?\n",
311 | "\n",
312 | "*This function should return an alphabetically sorted list of the tokens that match the above constraints. To sort your list, use `sorted()`*"
313 | ]
314 | },
315 | {
316 | "cell_type": "code",
317 | "execution_count": 26,
318 | "metadata": {},
319 | "outputs": [
320 | {
321 | "data": {
322 | "text/plain": [
323 | "['Captain',\n",
324 | " 'Pequod',\n",
325 | " 'Queequeg',\n",
326 | " 'Starbuck',\n",
327 | " 'almost',\n",
328 | " 'before',\n",
329 | " 'himself',\n",
330 | " 'little',\n",
331 | " 'seemed',\n",
332 | " 'should',\n",
333 | " 'though',\n",
334 | " 'through',\n",
335 | " 'whales',\n",
336 | " 'without']"
337 | ]
338 | },
339 | "execution_count": 26,
340 | "metadata": {},
341 | "output_type": "execute_result"
342 | }
343 | ],
344 | "source": [
345 | "def answer_four():\n",
346 | " \n",
347 | " dist = nltk.FreqDist(text1)\n",
348 | " vocab1 = dist.keys()\n",
349 | " freqwords = [w for w in vocab1 if len(w)>5 and dist[w] > 150]\n",
350 | " return sorted(freqwords)\n",
351 | "\n",
352 | "answer_four()"
353 | ]
354 | },
355 | {
356 | "cell_type": "markdown",
357 | "metadata": {},
358 | "source": [
359 | "### Question 5\n",
360 | "\n",
361 | "Find the longest word in text1 and that word's length.\n",
362 | "\n",
363 | "*This function should return a tuple `(longest_word, length)`.*"
364 | ]
365 | },
366 | {
367 | "cell_type": "code",
368 | "execution_count": 35,
369 | "metadata": {},
370 | "outputs": [
371 | {
372 | "data": {
373 | "text/plain": [
374 | "(\"twelve-o'clock-at-night\", 23)"
375 | ]
376 | },
377 | "execution_count": 35,
378 | "metadata": {},
379 | "output_type": "execute_result"
380 | }
381 | ],
382 | "source": [
383 | "def answer_five():\n",
384 | " \n",
385 | " maxlen = max(len(w) for w in text1)\n",
386 | " word = [w for w in text1 if len(w) == maxlen]\n",
387 | " return (word[0],maxlen)\n",
388 | "\n",
389 | "answer_five()"
390 | ]
391 | },
392 | {
393 | "cell_type": "markdown",
394 | "metadata": {},
395 | "source": [
396 | "### Question 6\n",
397 | "\n",
398 | "What unique words have a frequency of more than 2000? What is their frequency?\n",
399 | "\n",
400 | "\"Hint: you may want to use `isalpha()` to check if the token is a word and not punctuation.\"\n",
401 | "\n",
402 | "*This function should return a list of tuples of the form `(frequency, word)` sorted in descending order of frequency.*"
403 | ]
404 | },
405 | {
406 | "cell_type": "code",
407 | "execution_count": 11,
408 | "metadata": {},
409 | "outputs": [
410 | {
411 | "data": {
412 | "text/plain": [
413 | "[(13715, 'the'),\n",
414 | " (6513, 'of'),\n",
415 | " (6010, 'and'),\n",
416 | " (4545, 'a'),\n",
417 | " (4515, 'to'),\n",
418 | " (3908, 'in'),\n",
419 | " (2978, 'that'),\n",
420 | " (2459, 'his'),\n",
421 | " (2196, 'it'),\n",
422 | " (2097, 'I')]"
423 | ]
424 | },
425 | "execution_count": 11,
426 | "metadata": {},
427 | "output_type": "execute_result"
428 | }
429 | ],
430 | "source": [
431 | "def answer_six():\n",
432 | " dist = nltk.FreqDist(text1)\n",
433 | " freqwords = [(dist[w],w) for w in set(text1) if w.isalpha() and dist[w]>2000]\n",
434 | " return sorted(freqwords, key=lambda x:x[0],reverse=True)\n",
435 | "\n",
436 | "answer_six()"
437 | ]
438 | },
439 | {
440 | "cell_type": "markdown",
441 | "metadata": {},
442 | "source": [
443 | "### Question 7\n",
444 | "\n",
445 | "What is the average number of tokens per sentence?\n",
446 | "\n",
447 | "*This function should return a float.*"
448 | ]
449 | },
450 | {
451 | "cell_type": "code",
452 | "execution_count": 12,
453 | "metadata": {},
454 | "outputs": [
455 | {
456 | "data": {
457 | "text/plain": [
458 | "25.881952902963864"
459 | ]
460 | },
461 | "execution_count": 12,
462 | "metadata": {},
463 | "output_type": "execute_result"
464 | }
465 | ],
466 | "source": [
467 | "def answer_seven():\n",
468 | " text2 = nltk.sent_tokenize(moby_raw)\n",
469 | " return len(text1)/len(text2)\n",
470 | "\n",
471 | "answer_seven()"
472 | ]
473 | },
474 | {
475 | "cell_type": "markdown",
476 | "metadata": {},
477 | "source": [
478 | "### Question 8\n",
479 | "\n",
480 | "What are the 5 most frequent parts of speech in this text? What is their frequency?\n",
481 | "\n",
482 | "*This function should return a list of tuples of the form `(part_of_speech, frequency)` sorted in descending order of frequency.*"
483 | ]
484 | },
485 | {
486 | "cell_type": "code",
487 | "execution_count": 24,
488 | "metadata": {},
489 | "outputs": [
490 | {
491 | "data": {
492 | "text/plain": [
493 | "[('NN', 32730), ('IN', 28657), ('DT', 25867), (',', 19204), ('JJ', 17620)]"
494 | ]
495 | },
496 | "execution_count": 24,
497 | "metadata": {},
498 | "output_type": "execute_result"
499 | }
500 | ],
501 | "source": [
502 | "def answer_eight():\n",
503 | " \n",
504 | " import collections\n",
505 | " pos_list = nltk.pos_tag(text1)\n",
506 | " pos_counts = collections.Counter((subl[1] for subl in pos_list))\n",
507 | " return pos_counts.most_common(5)\n",
508 | "\n",
509 | "answer_eight()"
510 | ]
511 | },
512 | {
513 | "cell_type": "markdown",
514 | "metadata": {},
515 | "source": [
516 | "## Part 2 - Spelling Recommender\n",
517 | "\n",
518 | "For this part of the assignment you will create three different spelling recommenders, that each take a list of misspelled words and recommends a correctly spelled word for every word in the list.\n",
519 | "\n",
520 | "For every misspelled word, the recommender should find find the word in `correct_spellings` that has the shortest distance*, and starts with the same letter as the misspelled word, and return that word as a recommendation.\n",
521 | "\n",
522 | "*Each of the three different recommenders will use a different distance measure (outlined below).\n",
523 | "\n",
524 | "Each of the recommenders should provide recommendations for the three default words provided: `['cormulent', 'incendenece', 'validrate']`."
525 | ]
526 | },
527 | {
528 | "cell_type": "code",
529 | "execution_count": 4,
530 | "metadata": {},
531 | "outputs": [],
532 | "source": [
533 | "from nltk.corpus import words\n",
534 | "\n",
535 | "correct_spellings = words.words()"
536 | ]
537 | },
538 | {
539 | "cell_type": "markdown",
540 | "metadata": {},
541 | "source": [
542 | "### Question 9\n",
543 | "\n",
544 | "For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:\n",
545 | "\n",
546 | "**[Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index) on the trigrams of the two words.**\n",
547 | "\n",
548 | "*This function should return a list of length three:\n",
549 | "`['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']`.*"
550 | ]
551 | },
552 | {
553 | "cell_type": "code",
554 | "execution_count": 6,
555 | "metadata": {},
556 | "outputs": [
557 | {
558 | "name": "stderr",
559 | "output_type": "stream",
560 | "text": [
561 | "/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:9: DeprecationWarning: generator 'ngrams' raised StopIteration\n",
562 | " if __name__ == '__main__':\n"
563 | ]
564 | },
565 | {
566 | "data": {
567 | "text/plain": [
568 | "['corpulent', 'indecence', 'validate']"
569 | ]
570 | },
571 | "execution_count": 6,
572 | "metadata": {},
573 | "output_type": "execute_result"
574 | }
575 | ],
576 | "source": [
577 | "def answer_nine(entries=['cormulent', 'incendenece', 'validrate']):\n",
578 | " \n",
579 | " recommend = []\n",
580 | " for entry in entries:\n",
581 | " \n",
582 | " # Match first letter. input_spell contains all words in correct_spellings with the same first letter.\n",
583 | " input_spell = [x for x in correct_spellings if x[0] == entry[0]]\n",
584 | " \n",
585 | " # Find the jaccard distance between the entry word and every word in correct_spellings with the same first letter.\n",
586 | " jaccard_dist = [nltk.jaccard_distance(set(nltk.ngrams(entry,n=3)), set(nltk.ngrams(x,n=3))) for x in input_spell]\n",
587 | " \n",
588 | " # Recommend the word in input_spell with the minimum Jaccard distance.\n",
589 | " recommend.append(input_spell[np.argmin(jaccard_dist)])\n",
590 | " \n",
591 | " return recommend\n",
592 | " \n",
593 | "answer_nine()"
594 | ]
595 | },
596 | {
597 | "cell_type": "markdown",
598 | "metadata": {},
599 | "source": [
600 | "### Question 10\n",
601 | "\n",
602 | "For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:\n",
603 | "\n",
604 | "**[Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index) on the 4-grams of the two words.**\n",
605 | "\n",
606 | "*This function should return a list of length three:\n",
607 | "`['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']`.*"
608 | ]
609 | },
610 | {
611 | "cell_type": "code",
612 | "execution_count": 7,
613 | "metadata": {},
614 | "outputs": [
615 | {
616 | "name": "stderr",
617 | "output_type": "stream",
618 | "text": [
619 | "/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:10: DeprecationWarning: generator 'ngrams' raised StopIteration\n",
620 | " # Remove the CWD from sys.path while we load stuff.\n"
621 | ]
622 | },
623 | {
624 | "data": {
625 | "text/plain": [
626 | "['cormus', 'incendiary', 'valid']"
627 | ]
628 | },
629 | "execution_count": 7,
630 | "metadata": {},
631 | "output_type": "execute_result"
632 | }
633 | ],
634 | "source": [
635 | "def answer_ten(entries=['cormulent', 'incendenece', 'validrate']):\n",
636 | " \n",
637 | " recommend = []\n",
638 | " for entry in entries:\n",
639 | " \n",
640 | " # Match first letter. input_spell contains all words in correct_spellings with the same first letter.\n",
641 | " input_spell = [x for x in correct_spellings if x[0] == entry[0]]\n",
642 | " \n",
643 | " # Find the jaccard distance between the entry word and every word in correct_spellings with the same first letter.\n",
644 | " jaccard_dist = [nltk.jaccard_distance(set(nltk.ngrams(entry,n=4)), set(nltk.ngrams(x,n=4))) for x in input_spell]\n",
645 | " \n",
646 | " # Recommend the word in input_spell with the minimum Jaccard distance.\n",
647 | " recommend.append(input_spell[np.argmin(jaccard_dist)])\n",
648 | " \n",
649 | " return recommend\n",
650 | " \n",
651 | "answer_ten()"
652 | ]
653 | },
654 | {
655 | "cell_type": "markdown",
656 | "metadata": {},
657 | "source": [
658 | "### Question 11\n",
659 | "\n",
660 | "For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:\n",
661 | "\n",
662 | "**[Edit distance on the two words with transpositions.](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance)**\n",
663 | "\n",
664 | "*This function should return a list of length three:\n",
665 | "`['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']`.*"
666 | ]
667 | },
668 | {
669 | "cell_type": "code",
670 | "execution_count": 9,
671 | "metadata": {},
672 | "outputs": [
673 | {
674 | "data": {
675 | "text/plain": [
676 | "['corpulent', 'intendence', 'validate']"
677 | ]
678 | },
679 | "execution_count": 9,
680 | "metadata": {},
681 | "output_type": "execute_result"
682 | }
683 | ],
684 | "source": [
685 | "def answer_eleven(entries=['cormulent', 'incendenece', 'validrate']):\n",
686 | " \n",
687 | " recommend = []\n",
688 | " for entry in entries:\n",
689 | " \n",
690 | " # Match first letter. input_spell contains all words in correct_spellings with the same first letter.\n",
691 | " input_spell = [x for x in correct_spellings if x[0] == entry[0]]\n",
692 | " \n",
693 | " # Find the jaccard distance between the entry word and every word in correct_spellings with the same first letter.\n",
694 | " DL_dist = [nltk.edit_distance(x, entry, transpositions=True) for x in input_spell]\n",
695 | " \n",
696 | " # Recommend the word in input_spell with the minimum Jaccard distance.\n",
697 | " recommend.append(input_spell[np.argmin(DL_dist)])\n",
698 | " \n",
699 | " return recommend\n",
700 | " \n",
701 | "answer_eleven()"
702 | ]
703 | },
704 | {
705 | "cell_type": "code",
706 | "execution_count": null,
707 | "metadata": {
708 | "collapsed": true
709 | },
710 | "outputs": [],
711 | "source": []
712 | }
713 | ],
714 | "metadata": {
715 | "coursera": {
716 | "course_slug": "python-text-mining",
717 | "graded_item_id": "r35En",
718 | "launcher_item_id": "tCVfW",
719 | "part_id": "NTVgL"
720 | },
721 | "kernelspec": {
722 | "display_name": "Python 3",
723 | "language": "python",
724 | "name": "python3"
725 | },
726 | "language_info": {
727 | "codemirror_mode": {
728 | "name": "ipython",
729 | "version": 3
730 | },
731 | "file_extension": ".py",
732 | "mimetype": "text/x-python",
733 | "name": "python",
734 | "nbconvert_exporter": "python",
735 | "pygments_lexer": "ipython3",
736 | "version": "3.6.2"
737 | }
738 | },
739 | "nbformat": 4,
740 | "nbformat_minor": 2
741 | }
742 |
--------------------------------------------------------------------------------
/Assignment+3.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "---\n",
8 | "\n",
9 | "_You are currently looking at **version 1.1** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._\n",
10 | "\n",
11 | "---"
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {},
17 | "source": [
18 | "# Assignment 3\n",
19 | "\n",
20 | "In this assignment you will explore text message data and create models to predict if a message is spam or not. "
21 | ]
22 | },
23 | {
24 | "cell_type": "code",
25 | "execution_count": 1,
26 | "metadata": {},
27 | "outputs": [
28 | {
29 | "data": {
30 | "text/html": [
31 | "
\n",
32 | "\n",
45 | "
\n",
46 | " \n",
47 | " \n",
48 | " | \n",
49 | " text | \n",
50 | " target | \n",
51 | "
\n",
52 | " \n",
53 | " \n",
54 | " \n",
55 | " 0 | \n",
56 | " Go until jurong point, crazy.. Available only ... | \n",
57 | " 0 | \n",
58 | "
\n",
59 | " \n",
60 | " 1 | \n",
61 | " Ok lar... Joking wif u oni... | \n",
62 | " 0 | \n",
63 | "
\n",
64 | " \n",
65 | " 2 | \n",
66 | " Free entry in 2 a wkly comp to win FA Cup fina... | \n",
67 | " 1 | \n",
68 | "
\n",
69 | " \n",
70 | " 3 | \n",
71 | " U dun say so early hor... U c already then say... | \n",
72 | " 0 | \n",
73 | "
\n",
74 | " \n",
75 | " 4 | \n",
76 | " Nah I don't think he goes to usf, he lives aro... | \n",
77 | " 0 | \n",
78 | "
\n",
79 | " \n",
80 | " 5 | \n",
81 | " FreeMsg Hey there darling it's been 3 week's n... | \n",
82 | " 1 | \n",
83 | "
\n",
84 | " \n",
85 | " 6 | \n",
86 | " Even my brother is not like to speak with me. ... | \n",
87 | " 0 | \n",
88 | "
\n",
89 | " \n",
90 | " 7 | \n",
91 | " As per your request 'Melle Melle (Oru Minnamin... | \n",
92 | " 0 | \n",
93 | "
\n",
94 | " \n",
95 | " 8 | \n",
96 | " WINNER!! As a valued network customer you have... | \n",
97 | " 1 | \n",
98 | "
\n",
99 | " \n",
100 | " 9 | \n",
101 | " Had your mobile 11 months or more? U R entitle... | \n",
102 | " 1 | \n",
103 | "
\n",
104 | " \n",
105 | "
\n",
106 | "
"
107 | ],
108 | "text/plain": [
109 | " text target\n",
110 | "0 Go until jurong point, crazy.. Available only ... 0\n",
111 | "1 Ok lar... Joking wif u oni... 0\n",
112 | "2 Free entry in 2 a wkly comp to win FA Cup fina... 1\n",
113 | "3 U dun say so early hor... U c already then say... 0\n",
114 | "4 Nah I don't think he goes to usf, he lives aro... 0\n",
115 | "5 FreeMsg Hey there darling it's been 3 week's n... 1\n",
116 | "6 Even my brother is not like to speak with me. ... 0\n",
117 | "7 As per your request 'Melle Melle (Oru Minnamin... 0\n",
118 | "8 WINNER!! As a valued network customer you have... 1\n",
119 | "9 Had your mobile 11 months or more? U R entitle... 1"
120 | ]
121 | },
122 | "execution_count": 1,
123 | "metadata": {},
124 | "output_type": "execute_result"
125 | }
126 | ],
127 | "source": [
128 | "import pandas as pd\n",
129 | "import numpy as np\n",
130 | "\n",
131 | "spam_data = pd.read_csv('spam.csv')\n",
132 | "\n",
133 | "spam_data['target'] = np.where(spam_data['target']=='spam',1,0)\n",
134 | "spam_data.head(10)"
135 | ]
136 | },
137 | {
138 | "cell_type": "code",
139 | "execution_count": 2,
140 | "metadata": {
141 | "collapsed": true
142 | },
143 | "outputs": [],
144 | "source": [
145 | "from sklearn.model_selection import train_test_split\n",
146 | "\n",
147 | "\n",
148 | "X_train, X_test, y_train, y_test = train_test_split(spam_data['text'], \n",
149 | " spam_data['target'], \n",
150 | " random_state=0)"
151 | ]
152 | },
153 | {
154 | "cell_type": "markdown",
155 | "metadata": {},
156 | "source": [
157 | "### Question 1\n",
158 | "What percentage of the documents in `spam_data` are spam?\n",
159 | "\n",
160 | "*This function should return a float, the percent value (i.e. $ratio * 100$).*"
161 | ]
162 | },
163 | {
164 | "cell_type": "code",
165 | "execution_count": 3,
166 | "metadata": {
167 | "collapsed": true
168 | },
169 | "outputs": [],
170 | "source": [
171 | "def answer_one():\n",
172 | " \n",
173 | " return len(spam_data[spam_data['target']==1])/len(spam_data['target'])*100"
174 | ]
175 | },
176 | {
177 | "cell_type": "code",
178 | "execution_count": 4,
179 | "metadata": {},
180 | "outputs": [
181 | {
182 | "data": {
183 | "text/plain": [
184 | "13.406317300789663"
185 | ]
186 | },
187 | "execution_count": 4,
188 | "metadata": {},
189 | "output_type": "execute_result"
190 | }
191 | ],
192 | "source": [
193 | "answer_one()"
194 | ]
195 | },
196 | {
197 | "cell_type": "markdown",
198 | "metadata": {},
199 | "source": [
200 | "### Question 2\n",
201 | "\n",
202 | "Fit the training data `X_train` using a Count Vectorizer with default parameters.\n",
203 | "\n",
204 | "What is the longest token in the vocabulary?\n",
205 | "\n",
206 | "*This function should return a string.*"
207 | ]
208 | },
209 | {
210 | "cell_type": "code",
211 | "execution_count": 5,
212 | "metadata": {
213 | "collapsed": true
214 | },
215 | "outputs": [],
216 | "source": [
217 | "from sklearn.feature_extraction.text import CountVectorizer\n",
218 | "\n",
219 | "def answer_two():\n",
220 | " \n",
221 | " # List all tokens and their counts as a dictionary:\n",
222 | " vocabulary = CountVectorizer().fit(X_train).vocabulary_\n",
223 | " \n",
224 | " # You want only the keys, i.e, the words:\n",
225 | " vocabulary = [x for x in vocabulary.keys()]\n",
226 | " \n",
227 | " # Store the lengths in a separate list:\n",
228 | " len_vocabulary = [len(x) for x in vocabulary]\n",
229 | " \n",
230 | " # Use the index of the longest token:\n",
231 | " return vocabulary[np.argmax(len_vocabulary)]\n"
232 | ]
233 | },
234 | {
235 | "cell_type": "code",
236 | "execution_count": 6,
237 | "metadata": {},
238 | "outputs": [
239 | {
240 | "data": {
241 | "text/plain": [
242 | "'com1win150ppmx3age16subscription'"
243 | ]
244 | },
245 | "execution_count": 6,
246 | "metadata": {},
247 | "output_type": "execute_result"
248 | }
249 | ],
250 | "source": [
251 | "answer_two()"
252 | ]
253 | },
254 | {
255 | "cell_type": "markdown",
256 | "metadata": {},
257 | "source": [
258 | "### Question 3\n",
259 | "\n",
260 | "Fit and transform the training data `X_train` using a Count Vectorizer with default parameters.\n",
261 | "\n",
262 | "Next, fit a fit a multinomial Naive Bayes classifier model with smoothing `alpha=0.1`. Find the area under the curve (AUC) score using the transformed test data.\n",
263 | "\n",
264 | "*This function should return the AUC score as a float.*"
265 | ]
266 | },
267 | {
268 | "cell_type": "code",
269 | "execution_count": 7,
270 | "metadata": {
271 | "collapsed": true
272 | },
273 | "outputs": [],
274 | "source": [
275 | "from sklearn.naive_bayes import MultinomialNB\n",
276 | "from sklearn.metrics import roc_auc_score\n",
277 | "\n",
278 | "def answer_three():\n",
279 | " \n",
280 | " cv = CountVectorizer().fit(X_train)\n",
281 | " \n",
282 | " # Transform both X_train and X_test with the same CV object:\n",
283 | " X_train_cv = cv.transform(X_train)\n",
284 | " X_test_cv = cv.transform(X_test)\n",
285 | " \n",
286 | " # Classifier for prediction:\n",
287 | " clf = MultinomialNB(alpha=0.1)\n",
288 | " clf.fit(X_train_cv, y_train)\n",
289 | " preds_test = clf.predict(X_test_cv)\n",
290 | " \n",
291 | " return roc_auc_score(y_test, preds_test)\n"
292 | ]
293 | },
294 | {
295 | "cell_type": "code",
296 | "execution_count": 8,
297 | "metadata": {},
298 | "outputs": [
299 | {
300 | "data": {
301 | "text/plain": [
302 | "0.97208121827411165"
303 | ]
304 | },
305 | "execution_count": 8,
306 | "metadata": {},
307 | "output_type": "execute_result"
308 | }
309 | ],
310 | "source": [
311 | "answer_three()"
312 | ]
313 | },
314 | {
315 | "cell_type": "markdown",
316 | "metadata": {},
317 | "source": [
318 | "### Question 4\n",
319 | "\n",
320 | "Fit and transform the training data `X_train` using a Tfidf Vectorizer with default parameters.\n",
321 | "\n",
322 | "What 20 features have the smallest tf-idf and what 20 have the largest tf-idf?\n",
323 | "\n",
324 | "Put these features in a two series where each series is sorted by tf-idf value and then alphabetically by feature name. The index of the series should be the feature name, and the data should be the tf-idf.\n",
325 | "\n",
326 | "The series of 20 features with smallest tf-idfs should be sorted smallest tfidf first, the list of 20 features with largest tf-idfs should be sorted largest first. \n",
327 | "\n",
328 | "*This function should return a tuple of two series\n",
329 | "`(smallest tf-idfs series, largest tf-idfs series)`.*"
330 | ]
331 | },
332 | {
333 | "cell_type": "code",
334 | "execution_count": 9,
335 | "metadata": {
336 | "collapsed": true
337 | },
338 | "outputs": [],
339 | "source": [
340 | "from sklearn.feature_extraction.text import TfidfVectorizer\n",
341 | "\n",
342 | "def answer_four():\n",
343 | " \n",
344 | " tfidf = TfidfVectorizer().fit(X_train)\n",
345 | " feature_names = np.array(tfidf.get_feature_names())\n",
346 | " \n",
347 | " X_train_tf = tfidf.transform(X_train)\n",
348 | " \n",
349 | " max_tf_idfs = X_train_tf.max(0).toarray()[0] # Get largest tfidf values across all documents.\n",
350 | " sorted_tf_idxs = max_tf_idfs.argsort() # Sorted indices\n",
351 | " sorted_tf_idfs = max_tf_idfs[sorted_tf_idxs] # Sorted TFIDF values\n",
352 | " \n",
353 | " # feature_names doesn't need to be sorted! You just access it with a list of sorted indices!\n",
354 | " smallest_tf_idfs = pd.Series(sorted_tf_idfs[:20], index=feature_names[sorted_tf_idxs[:20]]) \n",
355 | " largest_tf_idfs = pd.Series(sorted_tf_idfs[-20:][::-1], index=feature_names[sorted_tf_idxs[-20:][::-1]])\n",
356 | " \n",
357 | " return (smallest_tf_idfs, largest_tf_idfs)"
358 | ]
359 | },
360 | {
361 | "cell_type": "code",
362 | "execution_count": 10,
363 | "metadata": {},
364 | "outputs": [
365 | {
366 | "data": {
367 | "text/plain": [
368 | "(sympathetic 0.074475\n",
369 | " healer 0.074475\n",
370 | " aaniye 0.074475\n",
371 | " dependable 0.074475\n",
372 | " companion 0.074475\n",
373 | " listener 0.074475\n",
374 | " athletic 0.074475\n",
375 | " exterminator 0.074475\n",
376 | " psychiatrist 0.074475\n",
377 | " pest 0.074475\n",
378 | " determined 0.074475\n",
379 | " chef 0.074475\n",
380 | " courageous 0.074475\n",
381 | " stylist 0.074475\n",
382 | " psychologist 0.074475\n",
383 | " organizer 0.074475\n",
384 | " pudunga 0.074475\n",
385 | " venaam 0.074475\n",
386 | " diwali 0.091250\n",
387 | " mornings 0.091250\n",
388 | " dtype: float64, 146tf150p 1.000000\n",
389 | " havent 1.000000\n",
390 | " home 1.000000\n",
391 | " okie 1.000000\n",
392 | " thanx 1.000000\n",
393 | " er 1.000000\n",
394 | " anything 1.000000\n",
395 | " lei 1.000000\n",
396 | " nite 1.000000\n",
397 | " yup 1.000000\n",
398 | " thank 1.000000\n",
399 | " ok 1.000000\n",
400 | " where 1.000000\n",
401 | " beerage 1.000000\n",
402 | " anytime 1.000000\n",
403 | " too 1.000000\n",
404 | " done 1.000000\n",
405 | " 645 1.000000\n",
406 | " tick 0.980166\n",
407 | " blank 0.932702\n",
408 | " dtype: float64)"
409 | ]
410 | },
411 | "execution_count": 10,
412 | "metadata": {},
413 | "output_type": "execute_result"
414 | }
415 | ],
416 | "source": [
417 | "answer_four()"
418 | ]
419 | },
420 | {
421 | "cell_type": "markdown",
422 | "metadata": {},
423 | "source": [
424 | "### Question 5\n",
425 | "\n",
426 | "Fit and transform the training data `X_train` using a Tfidf Vectorizer ignoring terms that have a document frequency strictly lower than **3**.\n",
427 | "\n",
428 | "Then fit a multinomial Naive Bayes classifier model with smoothing `alpha=0.1` and compute the area under the curve (AUC) score using the transformed test data.\n",
429 | "\n",
430 | "*This function should return the AUC score as a float.*"
431 | ]
432 | },
433 | {
434 | "cell_type": "code",
435 | "execution_count": 11,
436 | "metadata": {
437 | "collapsed": true
438 | },
439 | "outputs": [],
440 | "source": [
441 | "def answer_five():\n",
442 | " \n",
443 | " tf = TfidfVectorizer(min_df=3).fit(X_train)\n",
444 | " X_train_tf = tf.transform(X_train)\n",
445 | " X_test_tf = tf.transform(X_test)\n",
446 | " clf = MultinomialNB(alpha=0.1)\n",
447 | " clf.fit(X_train_tf, y_train)\n",
448 | " pred = clf.predict(X_test_tf)\n",
449 | " return roc_auc_score(y_test, pred)"
450 | ]
451 | },
452 | {
453 | "cell_type": "code",
454 | "execution_count": 12,
455 | "metadata": {},
456 | "outputs": [
457 | {
458 | "data": {
459 | "text/plain": [
460 | "0.94162436548223349"
461 | ]
462 | },
463 | "execution_count": 12,
464 | "metadata": {},
465 | "output_type": "execute_result"
466 | }
467 | ],
468 | "source": [
469 | "answer_five()"
470 | ]
471 | },
472 | {
473 | "cell_type": "markdown",
474 | "metadata": {},
475 | "source": [
476 | "### Question 6\n",
477 | "\n",
478 | "What is the average length of documents (number of characters) for not spam and spam documents?\n",
479 | "\n",
480 | "*This function should return a tuple (average length not spam, average length spam).*"
481 | ]
482 | },
483 | {
484 | "cell_type": "code",
485 | "execution_count": 13,
486 | "metadata": {
487 | "collapsed": true
488 | },
489 | "outputs": [],
490 | "source": [
491 | "def answer_six():\n",
492 | " \n",
493 | " len_spam = [len(x) for x in spam_data.loc[spam_data['target']==1, 'text']]\n",
494 | " len_not_spam = [len(x) for x in spam_data.loc[spam_data['target']==0, 'text']]\n",
495 | " return (np.mean(len_not_spam), np.mean(len_spam))"
496 | ]
497 | },
498 | {
499 | "cell_type": "code",
500 | "execution_count": 14,
501 | "metadata": {},
502 | "outputs": [
503 | {
504 | "data": {
505 | "text/plain": [
506 | "(71.023626943005183, 138.8661311914324)"
507 | ]
508 | },
509 | "execution_count": 14,
510 | "metadata": {},
511 | "output_type": "execute_result"
512 | }
513 | ],
514 | "source": [
515 | "answer_six()"
516 | ]
517 | },
518 | {
519 | "cell_type": "markdown",
520 | "metadata": {},
521 | "source": [
522 | "
\n",
523 | "
\n",
524 | "The following function has been provided to help you combine new features into the training data:"
525 | ]
526 | },
527 | {
528 | "cell_type": "code",
529 | "execution_count": 15,
530 | "metadata": {
531 | "collapsed": true
532 | },
533 | "outputs": [],
534 | "source": [
535 | "def add_feature(X, feature_to_add):\n",
536 | " \"\"\"\n",
537 | " Returns sparse feature matrix with added feature.\n",
538 | " feature_to_add can also be a list of features.\n",
539 | " \"\"\"\n",
540 | " from scipy.sparse import csr_matrix, hstack\n",
541 | " return hstack([X, csr_matrix(feature_to_add).T], 'csr')"
542 | ]
543 | },
544 | {
545 | "cell_type": "markdown",
546 | "metadata": {},
547 | "source": [
548 | "### Question 7\n",
549 | "\n",
550 | "Fit and transform the training data X_train using a Tfidf Vectorizer ignoring terms that have a document frequency strictly lower than **5**.\n",
551 | "\n",
552 | "Using this document-term matrix and an additional feature, **the length of document (number of characters)**, fit a Support Vector Classification model with regularization `C=10000`. Then compute the area under the curve (AUC) score using the transformed test data.\n",
553 | "\n",
554 | "*This function should return the AUC score as a float.*"
555 | ]
556 | },
557 | {
558 | "cell_type": "code",
559 | "execution_count": 16,
560 | "metadata": {
561 | "collapsed": true
562 | },
563 | "outputs": [],
564 | "source": [
565 | "from sklearn.svm import SVC\n",
566 | "\n",
567 | "def answer_seven():\n",
568 | " \n",
569 | " len_train = [len(x) for x in X_train]\n",
570 | " len_test = [len(x) for x in X_test]\n",
571 | " \n",
572 | " tf = TfidfVectorizer(min_df=5).fit(X_train)\n",
573 | " X_train_tf = tf.transform(X_train)\n",
574 | " X_test_tf = tf.transform(X_test)\n",
575 | " \n",
576 | " X_train_tf = add_feature(X_train_tf, len_train)\n",
577 | " X_test_tf = add_feature(X_test_tf, len_test)\n",
578 | " \n",
579 | " clf = SVC(C=10000)\n",
580 | " clf.fit(X_train_tf, y_train)\n",
581 | " pred = clf.predict(X_test_tf)\n",
582 | " \n",
583 | " return roc_auc_score(y_test, pred)"
584 | ]
585 | },
586 | {
587 | "cell_type": "code",
588 | "execution_count": 17,
589 | "metadata": {},
590 | "outputs": [
591 | {
592 | "data": {
593 | "text/plain": [
594 | "0.95813668234215565"
595 | ]
596 | },
597 | "execution_count": 17,
598 | "metadata": {},
599 | "output_type": "execute_result"
600 | }
601 | ],
602 | "source": [
603 | "answer_seven()"
604 | ]
605 | },
606 | {
607 | "cell_type": "markdown",
608 | "metadata": {},
609 | "source": [
610 | "### Question 8\n",
611 | "\n",
612 | "What is the average number of digits per document for not spam and spam documents?\n",
613 | "\n",
614 | "*This function should return a tuple (average # digits not spam, average # digits spam).*"
615 | ]
616 | },
617 | {
618 | "cell_type": "code",
619 | "execution_count": 18,
620 | "metadata": {
621 | "collapsed": true
622 | },
623 | "outputs": [],
624 | "source": [
625 | "def answer_eight():\n",
626 | " \n",
627 | " dig_spam = [sum(char.isnumeric() for char in x) for x in spam_data.loc[spam_data['target']==1,'text']]\n",
628 | " dig_not_spam = [sum(char.isnumeric() for char in x) for x in spam_data.loc[spam_data['target']==0,'text']]\n",
629 | " return (np.mean(dig_not_spam), np.mean(dig_spam))"
630 | ]
631 | },
632 | {
633 | "cell_type": "code",
634 | "execution_count": 19,
635 | "metadata": {},
636 | "outputs": [
637 | {
638 | "data": {
639 | "text/plain": [
640 | "(0.29927461139896372, 15.76037483266399)"
641 | ]
642 | },
643 | "execution_count": 19,
644 | "metadata": {},
645 | "output_type": "execute_result"
646 | }
647 | ],
648 | "source": [
649 | "answer_eight()"
650 | ]
651 | },
652 | {
653 | "cell_type": "markdown",
654 | "metadata": {},
655 | "source": [
656 | "### Question 9\n",
657 | "\n",
658 | "Fit and transform the training data `X_train` using a Tfidf Vectorizer ignoring terms that have a document frequency strictly lower than **5** and using **word n-grams from n=1 to n=3** (unigrams, bigrams, and trigrams).\n",
659 | "\n",
660 | "Using this document-term matrix and the following additional features:\n",
661 | "* the length of document (number of characters)\n",
662 | "* **number of digits per document**\n",
663 | "\n",
664 | "fit a Logistic Regression model with regularization `C=100`. Then compute the area under the curve (AUC) score using the transformed test data.\n",
665 | "\n",
666 | "*This function should return the AUC score as a float.*"
667 | ]
668 | },
669 | {
670 | "cell_type": "code",
671 | "execution_count": 20,
672 | "metadata": {
673 | "collapsed": true
674 | },
675 | "outputs": [],
676 | "source": [
677 | "from sklearn.linear_model import LogisticRegression\n",
678 | "\n",
679 | "def answer_nine():\n",
680 | " \n",
681 | " dig_train = [sum(char.isnumeric() for char in x) for x in X_train]\n",
682 | " dig_test = [sum(char.isnumeric() for char in x) for x in X_test]\n",
683 | " \n",
684 | " tf = TfidfVectorizer(min_df = 5, ngram_range = (1,3)).fit(X_train)\n",
685 | " X_train_tf = tf.transform(X_train)\n",
686 | " X_test_tf = tf.transform(X_test)\n",
687 | " \n",
688 | " X_train_tf = add_feature(X_train_tf, dig_train)\n",
689 | " X_test_tf = add_feature(X_test_tf, dig_test)\n",
690 | " \n",
691 | " clf = LogisticRegression(C=100).fit(X_train_tf, y_train)\n",
692 | " pred = clf.predict(X_test_tf)\n",
693 | " \n",
694 | " return roc_auc_score(y_test, pred)"
695 | ]
696 | },
697 | {
698 | "cell_type": "code",
699 | "execution_count": 21,
700 | "metadata": {},
701 | "outputs": [
702 | {
703 | "data": {
704 | "text/plain": [
705 | "0.96787090640544626"
706 | ]
707 | },
708 | "execution_count": 21,
709 | "metadata": {},
710 | "output_type": "execute_result"
711 | }
712 | ],
713 | "source": [
714 | "answer_nine()"
715 | ]
716 | },
717 | {
718 | "cell_type": "markdown",
719 | "metadata": {},
720 | "source": [
721 | "### Question 10\n",
722 | "\n",
723 | "What is the average number of non-word characters (anything other than a letter, digit or underscore) per document for not spam and spam documents?\n",
724 | "\n",
725 | "*Hint: Use `\\w` and `\\W` character classes*\n",
726 | "\n",
727 | "*This function should return a tuple (average # non-word characters not spam, average # non-word characters spam).*"
728 | ]
729 | },
730 | {
731 | "cell_type": "code",
732 | "execution_count": 22,
733 | "metadata": {
734 | "collapsed": true
735 | },
736 | "outputs": [],
737 | "source": [
738 | "def answer_ten():\n",
739 | " \n",
740 | " return (np.mean(spam_data.loc[spam_data['target']==0,'text'].str.count('\\W')), \n",
741 | " np.mean(spam_data.loc[spam_data['target']==1,'text'].str.count('\\W')))"
742 | ]
743 | },
744 | {
745 | "cell_type": "code",
746 | "execution_count": 23,
747 | "metadata": {},
748 | "outputs": [
749 | {
750 | "data": {
751 | "text/plain": [
752 | "(17.291813471502589, 29.041499330655956)"
753 | ]
754 | },
755 | "execution_count": 23,
756 | "metadata": {},
757 | "output_type": "execute_result"
758 | }
759 | ],
760 | "source": [
761 | "answer_ten()"
762 | ]
763 | },
764 | {
765 | "cell_type": "markdown",
766 | "metadata": {},
767 | "source": [
768 | "### Question 11\n",
769 | "\n",
770 | "Fit and transform the training data X_train using a Count Vectorizer ignoring terms that have a document frequency strictly lower than **5** and using **character n-grams from n=2 to n=5.**\n",
771 | "\n",
772 | "To tell Count Vectorizer to use character n-grams pass in `analyzer='char_wb'` which creates character n-grams only from text inside word boundaries. This should make the model more robust to spelling mistakes.\n",
773 | "\n",
774 | "Using this document-term matrix and the following additional features:\n",
775 | "* the length of document (number of characters)\n",
776 | "* number of digits per document\n",
777 | "* **number of non-word characters (anything other than a letter, digit or underscore.)**\n",
778 | "\n",
779 | "fit a Logistic Regression model with regularization C=100. Then compute the area under the curve (AUC) score using the transformed test data.\n",
780 | "\n",
781 | "Also **find the 10 smallest and 10 largest coefficients from the model** and return them along with the AUC score in a tuple.\n",
782 | "\n",
783 | "The list of 10 smallest coefficients should be sorted smallest first, the list of 10 largest coefficients should be sorted largest first.\n",
784 | "\n",
785 | "The three features that were added to the document term matrix should have the following names should they appear in the list of coefficients:\n",
786 | "['length_of_doc', 'digit_count', 'non_word_char_count']\n",
787 | "\n",
788 | "*This function should return a tuple `(AUC score as a float, smallest coefs list, largest coefs list)`.*"
789 | ]
790 | },
791 | {
792 | "cell_type": "code",
793 | "execution_count": 26,
794 | "metadata": {
795 | "collapsed": true
796 | },
797 | "outputs": [],
798 | "source": [
799 | "def answer_eleven():\n",
800 | " \n",
801 | " len_train = [len(x) for x in X_train]\n",
802 | " len_test = [len(x) for x in X_test]\n",
803 | " dig_train = [sum(char.isnumeric() for char in x) for x in X_train]\n",
804 | " dig_test = [sum(char.isnumeric() for char in x) for x in X_test]\n",
805 | " \n",
806 | " # Not alpha numeric:\n",
807 | " nan_train = X_train.str.count('\\W')\n",
808 | " nan_test = X_test.str.count('\\W')\n",
809 | " \n",
810 | " cv = CountVectorizer(min_df = 5, ngram_range=(2,5), analyzer='char_wb').fit(X_train)\n",
811 | " X_train_cv = cv.transform(X_train)\n",
812 | " X_test_cv = cv.transform(X_test)\n",
813 | " \n",
814 | " X_train_cv = add_feature(X_train_cv, [len_train, dig_train, nan_train])\n",
815 | " X_test_cv = add_feature(X_test_cv, [len_test, dig_test, nan_test])\n",
816 | " \n",
817 | " clf = LogisticRegression(C=100).fit(X_train_cv, y_train)\n",
818 | " pred = clf.predict(X_test_cv)\n",
819 | " \n",
820 | " score = roc_auc_score(y_test, pred)\n",
821 | " \n",
822 | " feature_names = np.array(cv.get_feature_names() + ['length_of_doc', 'digit_count', 'non_word_char_count'])\n",
823 | " sorted_coef_index = clf.coef_[0].argsort()\n",
824 | " small_coeffs = list(feature_names[sorted_coef_index[:10]])\n",
825 | " large_coeffs = list(feature_names[sorted_coef_index[:-11:-1]])\n",
826 | " \n",
827 | " return (score, small_coeffs, large_coeffs)"
828 | ]
829 | },
830 | {
831 | "cell_type": "code",
832 | "execution_count": 27,
833 | "metadata": {},
834 | "outputs": [
835 | {
836 | "data": {
837 | "text/plain": [
838 | "(0.97885931107074342,\n",
839 | " ['. ', '..', '? ', ' i', ' y', ' go', ':)', ' h', 'go', ' m'],\n",
840 | " ['digit_count', 'ne', 'ia', 'co', 'xt', ' ch', 'mob', ' x', 'ww', 'ar'])"
841 | ]
842 | },
843 | "execution_count": 27,
844 | "metadata": {},
845 | "output_type": "execute_result"
846 | }
847 | ],
848 | "source": [
849 | "answer_eleven()"
850 | ]
851 | },
852 | {
853 | "cell_type": "code",
854 | "execution_count": null,
855 | "metadata": {
856 | "collapsed": true
857 | },
858 | "outputs": [],
859 | "source": []
860 | }
861 | ],
862 | "metadata": {
863 | "coursera": {
864 | "course_slug": "python-text-mining",
865 | "graded_item_id": "Pn19K",
866 | "launcher_item_id": "y1juS",
867 | "part_id": "ctlgo"
868 | },
869 | "kernelspec": {
870 | "display_name": "Python 3",
871 | "language": "python",
872 | "name": "python3"
873 | },
874 | "language_info": {
875 | "codemirror_mode": {
876 | "name": "ipython",
877 | "version": 3
878 | },
879 | "file_extension": ".py",
880 | "mimetype": "text/x-python",
881 | "name": "python",
882 | "nbconvert_exporter": "python",
883 | "pygments_lexer": "ipython3",
884 | "version": "3.6.2"
885 | }
886 | },
887 | "nbformat": 4,
888 | "nbformat_minor": 2
889 | }
890 |
--------------------------------------------------------------------------------
/Assignment+4.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "---\n",
8 | "\n",
9 | "_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._\n",
10 | "\n",
11 | "---"
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {},
17 | "source": [
18 | "# Assignment 4 - Document Similarity & Topic Modelling"
19 | ]
20 | },
21 | {
22 | "cell_type": "markdown",
23 | "metadata": {},
24 | "source": [
25 | "## Part 1 - Document Similarity\n",
26 | "\n",
27 | "For the first part of this assignment, you will complete the functions `doc_to_synsets` and `similarity_score` which will be used by `document_path_similarity` to find the path similarity between two documents.\n",
28 | "\n",
29 | "The following functions are provided:\n",
30 | "* **`convert_tag:`** converts the tag given by `nltk.pos_tag` to a tag used by `wordnet.synsets`. You will need to use this function in `doc_to_synsets`.\n",
31 | "* **`document_path_similarity:`** computes the symmetrical path similarity between two documents by finding the synsets in each document using `doc_to_synsets`, then computing similarities using `similarity_score`.\n",
32 | "\n",
33 | "You will need to finish writing the following functions:\n",
34 | "* **`doc_to_synsets:`** returns a list of synsets in document. This function should first tokenize and part of speech tag the document using `nltk.word_tokenize` and `nltk.pos_tag`. Then it should find each tokens corresponding synset using `wn.synsets(token, wordnet_tag)`. The first synset match should be used. If there is no match, that token is skipped.\n",
35 | "* **`similarity_score:`** returns the normalized similarity score of a list of synsets (s1) onto a second list of synsets (s2). For each synset in s1, find the synset in s2 with the largest similarity value. Sum all of the largest similarity values together and normalize this value by dividing it by the number of largest similarity values found. Be careful with data types, which should be floats. Missing values should be ignored.\n",
36 | "\n",
37 | "Once `doc_to_synsets` and `similarity_score` have been completed, submit to the autograder which will run `test_document_path_similarity` to test that these functions are running correctly. \n",
38 | "\n",
39 | "*Do not modify the functions `convert_tag`, `document_path_similarity`, and `test_document_path_similarity`.*"
40 | ]
41 | },
42 | {
43 | "cell_type": "code",
44 | "execution_count": null,
45 | "metadata": {
46 | "collapsed": true
47 | },
48 | "outputs": [],
49 | "source": [
50 | "# WordNet is the lexical database i.e. dictionary for the English language, specifically designed for natural language processing.\n",
51 | "\n",
52 | "# Synset is a special kind of a simple interface that is present in NLTK to look up words in WordNet.\n",
53 | "# Synset instances are the groupings of synonymous words that express the same concept.\n",
54 | "# Some of the words have only one Synset and some have several."
55 | ]
56 | },
57 | {
58 | "cell_type": "code",
59 | "execution_count": null,
60 | "metadata": {},
61 | "outputs": [
62 | {
63 | "name": "stdout",
64 | "output_type": "stream",
65 | "text": [
66 | "[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...\n",
67 | "[nltk_data] Package punkt is already up-to-date!\n",
68 | "[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...\n",
69 | "[nltk_data] Package wordnet is already up-to-date!\n",
70 | "[nltk_data] Downloading package averaged_perceptron_tagger to\n",
71 | "[nltk_data] /home/jovyan/nltk_data...\n",
72 | "[nltk_data] Package averaged_perceptron_tagger is already up-to-\n",
73 | "[nltk_data] date!\n"
74 | ]
75 | }
76 | ],
77 | "source": [
78 | "import numpy as np\n",
79 | "import nltk\n",
80 | "nltk.download('punkt')\n",
81 | "nltk.download('wordnet')\n",
82 | "nltk.download('averaged_perceptron_tagger')\n",
83 | "from nltk.corpus import wordnet as wn\n",
84 | "import pandas as pd\n",
85 | "\n",
86 | "# Need to feed pos tags to this function!\n",
87 | "def convert_tag(tag):\n",
88 | " \"\"\"Convert the tag given by nltk.pos_tag to the tag used by wordnet.synsets\"\"\"\n",
89 | " \n",
90 | " tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}\n",
91 | " try:\n",
92 | " return tag_dict[tag[0]]\n",
93 | " except KeyError:\n",
94 | " return None\n",
95 | "\n",
96 | "\n",
97 | "def doc_to_synsets(doc):\n",
98 | " \"\"\"\n",
99 | " Returns a list of synsets in document.\n",
100 | "\n",
101 | " Tokenizes and tags the words in the document doc.\n",
102 | " Then finds the first synset for each word/tag combination.\n",
103 | " If a synset is not found for that combination it is skipped.\n",
104 | "\n",
105 | " Args:\n",
106 | " doc: string to be converted\n",
107 | "\n",
108 | " Returns:\n",
109 | " list of synsets\n",
110 | "\n",
111 | " Example:\n",
112 | " doc_to_synsets('Fish are nvqjp friends.')\n",
113 | " Out: [Synset('fish.n.01'), Synset('be.v.01'), Synset('friend.n.01')]\n",
114 | " \"\"\"\n",
115 | " \n",
116 | " tokens = nltk.word_tokenize(doc)\n",
117 | " pos_tags = nltk.pos_tag(tokens)\n",
118 | " wn_tags = [convert_tag(x[1]) for x in pos_tags]\n",
119 | " # If there is nothing in the synset for the token, it must be skipped! Therefore check that len of the synset is > 0!\n",
120 | " # Will return a list of lists of synsets - one list for each token!\n",
121 | " # Remember to use only the first match for each token! Hence wn.synsets(x,y)[0]!\n",
122 | " synset_list = [wn.synsets(x,y)[0] for x,y in zip(tokens, wn_tags) if len(wn.synsets(x,y))>0]\n",
123 | " return synset_list\n",
124 | "\n",
125 | "\n",
126 | "def similarity_score(s1, s2):\n",
127 | " \"\"\"\n",
128 | " Calculate the normalized similarity score of s1 onto s2\n",
129 | "\n",
130 | " For each synset in s1, finds the synset in s2 with the largest similarity value.\n",
131 | " Sum of all of the largest similarity values and normalize this value by dividing it by the\n",
132 | " number of largest similarity values found.\n",
133 | "\n",
134 | " Args:\n",
135 | " s1, s2: list of synsets from doc_to_synsets\n",
136 | "\n",
137 | " Returns:\n",
138 | " normalized similarity score of s1 onto s2\n",
139 | "\n",
140 | " Example:\n",
141 | " synsets1 = doc_to_synsets('I like cats')\n",
142 | " synsets2 = doc_to_synsets('I like dogs')\n",
143 | " similarity_score(synsets1, synsets2)\n",
144 | " Out: 0.73333333333333339\n",
145 | " \"\"\"\n",
146 | " \n",
147 | " max_sim = []\n",
148 | " for syn in s1:\n",
149 | " sim = [syn.path_similarity(x) for x in s2 if syn.path_similarity(x) is not None]\n",
150 | " if sim:\n",
151 | " max_sim.append(max(sim))\n",
152 | " return np.mean(max_sim)\n",
153 | "\n",
154 | "\n",
155 | "def document_path_similarity(doc1, doc2):\n",
156 | " \"\"\"Finds the symmetrical similarity between doc1 and doc2\"\"\"\n",
157 | "\n",
158 | " synsets1 = doc_to_synsets(doc1)\n",
159 | " synsets2 = doc_to_synsets(doc2)\n",
160 | "\n",
161 | " return (similarity_score(synsets1, synsets2) + similarity_score(synsets2, synsets1)) / 2"
162 | ]
163 | },
164 | {
165 | "cell_type": "markdown",
166 | "metadata": {},
167 | "source": [
168 | "### test_document_path_similarity\n",
169 | "\n",
170 | "Use this function to check if doc_to_synsets and similarity_score are correct.\n",
171 | "\n",
172 | "*This function should return the similarity score as a float.*"
173 | ]
174 | },
175 | {
176 | "cell_type": "code",
177 | "execution_count": null,
178 | "metadata": {
179 | "collapsed": true
180 | },
181 | "outputs": [],
182 | "source": [
183 | "def test_document_path_similarity():\n",
184 | " doc1 = 'This is a function to test document_path_similarity.'\n",
185 | " doc2 = 'Use this function to see if your code in doc_to_synsets \\\n",
186 | " and similarity_score is correct!'\n",
187 | " return document_path_similarity(doc1, doc2)"
188 | ]
189 | },
190 | {
191 | "cell_type": "code",
192 | "execution_count": null,
193 | "metadata": {},
194 | "outputs": [
195 | {
196 | "data": {
197 | "text/plain": [
198 | "0.55426587301587305"
199 | ]
200 | },
201 | "execution_count": 4,
202 | "metadata": {},
203 | "output_type": "execute_result"
204 | }
205 | ],
206 | "source": [
207 | "test_document_path_similarity()"
208 | ]
209 | },
210 | {
211 | "cell_type": "markdown",
212 | "metadata": {},
213 | "source": [
214 | "
\n",
215 | "___\n",
216 | "`paraphrases` is a DataFrame which contains the following columns: `Quality`, `D1`, and `D2`.\n",
217 | "\n",
218 | "`Quality` is an indicator variable which indicates if the two documents `D1` and `D2` are paraphrases of one another (1 for paraphrase, 0 for not paraphrase)."
219 | ]
220 | },
221 | {
222 | "cell_type": "code",
223 | "execution_count": null,
224 | "metadata": {},
225 | "outputs": [
226 | {
227 | "data": {
228 | "text/html": [
229 | "\n",
230 | "\n",
243 | "
\n",
244 | " \n",
245 | " \n",
246 | " | \n",
247 | " Quality | \n",
248 | " D1 | \n",
249 | " D2 | \n",
250 | "
\n",
251 | " \n",
252 | " \n",
253 | " \n",
254 | " 0 | \n",
255 | " 1 | \n",
256 | " Ms Stewart, the chief executive, was not expec... | \n",
257 | " Ms Stewart, 61, its chief executive officer an... | \n",
258 | "
\n",
259 | " \n",
260 | " 1 | \n",
261 | " 1 | \n",
262 | " After more than two years' detention under the... | \n",
263 | " After more than two years in detention by the ... | \n",
264 | "
\n",
265 | " \n",
266 | " 2 | \n",
267 | " 1 | \n",
268 | " \"It still remains to be seen whether the reven... | \n",
269 | " \"It remains to be seen whether the revenue rec... | \n",
270 | "
\n",
271 | " \n",
272 | " 3 | \n",
273 | " 0 | \n",
274 | " And it's going to be a wild ride,\" said Allan ... | \n",
275 | " Now the rest is just mechanical,\" said Allan H... | \n",
276 | "
\n",
277 | " \n",
278 | " 4 | \n",
279 | " 1 | \n",
280 | " The cards are issued by Mexico's consulates to... | \n",
281 | " The card is issued by Mexico's consulates to i... | \n",
282 | "
\n",
283 | " \n",
284 | "
\n",
285 | "
"
286 | ],
287 | "text/plain": [
288 | " Quality D1 \\\n",
289 | "0 1 Ms Stewart, the chief executive, was not expec... \n",
290 | "1 1 After more than two years' detention under the... \n",
291 | "2 1 \"It still remains to be seen whether the reven... \n",
292 | "3 0 And it's going to be a wild ride,\" said Allan ... \n",
293 | "4 1 The cards are issued by Mexico's consulates to... \n",
294 | "\n",
295 | " D2 \n",
296 | "0 Ms Stewart, 61, its chief executive officer an... \n",
297 | "1 After more than two years in detention by the ... \n",
298 | "2 \"It remains to be seen whether the revenue rec... \n",
299 | "3 Now the rest is just mechanical,\" said Allan H... \n",
300 | "4 The card is issued by Mexico's consulates to i... "
301 | ]
302 | },
303 | "execution_count": 5,
304 | "metadata": {},
305 | "output_type": "execute_result"
306 | }
307 | ],
308 | "source": [
309 | "# Use this dataframe for questions most_similar_docs and label_accuracy\n",
310 | "paraphrases = pd.read_csv('paraphrases.csv')\n",
311 | "paraphrases.head()"
312 | ]
313 | },
314 | {
315 | "cell_type": "markdown",
316 | "metadata": {},
317 | "source": [
318 | "___\n",
319 | "\n",
320 | "### most_similar_docs\n",
321 | "\n",
322 | "Using `document_path_similarity`, find the pair of documents in paraphrases which has the maximum similarity score.\n",
323 | "\n",
324 | "*This function should return a tuple `(D1, D2, similarity_score)`*"
325 | ]
326 | },
327 | {
328 | "cell_type": "code",
329 | "execution_count": null,
330 | "metadata": {
331 | "collapsed": true
332 | },
333 | "outputs": [],
334 | "source": [
335 | "def most_similar_docs():\n",
336 | " \n",
337 | " sim_scores = [document_path_similarity(x,y) for x,y in zip(paraphrases['D1'], paraphrases['D2'])]\n",
338 | " \n",
339 | " return (paraphrases.loc[np.argmax(sim_scores),'D1'], paraphrases.loc[np.argmax(sim_scores),'D2'], max(sim_scores))"
340 | ]
341 | },
342 | {
343 | "cell_type": "code",
344 | "execution_count": null,
345 | "metadata": {},
346 | "outputs": [
347 | {
348 | "data": {
349 | "text/plain": [
350 | "('\"Indeed, Iran should be put on notice that efforts to try to remake Iraq in their image will be aggressively put down,\" he said.',\n",
351 | " '\"Iran should be on notice that attempts to remake Iraq in Iran\\'s image will be aggressively put down,\" he said.\\n',\n",
352 | " 0.97530864197530864)"
353 | ]
354 | },
355 | "execution_count": 7,
356 | "metadata": {},
357 | "output_type": "execute_result"
358 | }
359 | ],
360 | "source": [
361 | "most_similar_docs()"
362 | ]
363 | },
364 | {
365 | "cell_type": "markdown",
366 | "metadata": {},
367 | "source": [
368 | "### label_accuracy\n",
369 | "\n",
370 | "Provide labels for the twenty pairs of documents by computing the similarity for each pair using `document_path_similarity`. Let the classifier rule be that if the score is greater than 0.75, label is paraphrase (1), else label is not paraphrase (0). Report accuracy of the classifier using scikit-learn's accuracy_score.\n",
371 | "\n",
372 | "*This function should return a float.*"
373 | ]
374 | },
375 | {
376 | "cell_type": "code",
377 | "execution_count": null,
378 | "metadata": {
379 | "collapsed": true
380 | },
381 | "outputs": [],
382 | "source": [
383 | "# Remember you always need something to compare against for accuracy!"
384 | ]
385 | },
386 | {
387 | "cell_type": "code",
388 | "execution_count": null,
389 | "metadata": {
390 | "collapsed": true
391 | },
392 | "outputs": [],
393 | "source": [
394 | "def label_accuracy():\n",
395 | " from sklearn.metrics import accuracy_score\n",
396 | "\n",
397 | " paraphrases['sim_scores'] = [document_path_similarity(x,y) for x,y in zip(paraphrases['D1'], paraphrases['D2'])]\n",
398 | " paraphrases['sim_scores'] = np.where(paraphrases['sim_scores']>0.75, 1, 0)\n",
399 | " return accuracy_score(paraphrases['Quality'], paraphrases['sim_scores'])"
400 | ]
401 | },
402 | {
403 | "cell_type": "code",
404 | "execution_count": null,
405 | "metadata": {},
406 | "outputs": [
407 | {
408 | "data": {
409 | "text/plain": [
410 | "0.80000000000000004"
411 | ]
412 | },
413 | "execution_count": 10,
414 | "metadata": {},
415 | "output_type": "execute_result"
416 | }
417 | ],
418 | "source": [
419 | "label_accuracy()"
420 | ]
421 | },
422 | {
423 | "cell_type": "markdown",
424 | "metadata": {},
425 | "source": [
426 | "## Part 2 - Topic Modelling\n",
427 | "\n",
428 | "For the second part of this assignment, you will use Gensim's LDA (Latent Dirichlet Allocation) model to model topics in `newsgroup_data`. You will first need to finish the code in the cell below by using gensim.models.ldamodel.LdaModel constructor to estimate LDA model parameters on the corpus, and save to the variable `ldamodel`. Extract 10 topics using `corpus` and `id_map`, and with `passes=25` and `random_state=34`."
429 | ]
430 | },
431 | {
432 | "cell_type": "code",
433 | "execution_count": null,
434 | "metadata": {
435 | "collapsed": true
436 | },
437 | "outputs": [],
438 | "source": [
439 | "import pickle\n",
440 | "import gensim\n",
441 | "from sklearn.feature_extraction.text import CountVectorizer\n",
442 | "\n",
443 | "# Load the list of documents\n",
444 | "with open('newsgroups', 'rb') as f:\n",
445 | " newsgroup_data = pickle.load(f)\n",
446 | "\n",
447 | "# Use CountVectorizor to find three letter tokens, remove stop_words, \n",
448 | "# remove tokens that don't appear in at least 20 documents,\n",
449 | "# remove tokens that appear in more than 20% of the documents\n",
450 | "vect = CountVectorizer(min_df=20, max_df=0.2, stop_words='english', \n",
451 | " token_pattern='(?u)\\\\b\\\\w\\\\w\\\\w+\\\\b')\n",
452 | "# Fit and transform\n",
453 | "X = vect.fit_transform(newsgroup_data)\n",
454 | "\n",
455 | "# Convert sparse matrix to gensim corpus.\n",
456 | "corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)\n",
457 | "\n",
458 | "# Mapping from word IDs to words (To be used in LdaModel's id2word parameter)\n",
459 | "id_map = dict((v, k) for k, v in vect.vocabulary_.items())\n"
460 | ]
461 | },
462 | {
463 | "cell_type": "code",
464 | "execution_count": null,
465 | "metadata": {
466 | "collapsed": true
467 | },
468 | "outputs": [],
469 | "source": [
470 | "# Use the gensim.models.ldamodel.LdaModel constructor to estimate \n",
471 | "# LDA model parameters on the corpus, and save to the variable `ldamodel`\n",
472 | "\n",
473 | "# Your code here:\n",
474 | "ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=10, id2word=id_map, passes=25, random_state=34)"
475 | ]
476 | },
477 | {
478 | "cell_type": "markdown",
479 | "metadata": {},
480 | "source": [
481 | "### lda_topics\n",
482 | "\n",
483 | "Using `ldamodel`, find a list of the 10 topics and the most significant 10 words in each topic. This should be structured as a list of 10 tuples where each tuple takes on the form:\n",
484 | "\n",
485 | "`(9, '0.068*\"space\" + 0.036*\"nasa\" + 0.021*\"science\" + 0.020*\"edu\" + 0.019*\"data\" + 0.017*\"shuttle\" + 0.015*\"launch\" + 0.015*\"available\" + 0.014*\"center\" + 0.014*\"sci\"')`\n",
486 | "\n",
487 | "for example.\n",
488 | "\n",
489 | "*This function should return a list of tuples.*"
490 | ]
491 | },
492 | {
493 | "cell_type": "code",
494 | "execution_count": null,
495 | "metadata": {
496 | "collapsed": true
497 | },
498 | "outputs": [],
499 | "source": [
500 | "def lda_topics():\n",
501 | " \n",
502 | " return list(ldamodel.show_topics(num_topics=10, num_words=10))"
503 | ]
504 | },
505 | {
506 | "cell_type": "code",
507 | "execution_count": null,
508 | "metadata": {},
509 | "outputs": [
510 | {
511 | "data": {
512 | "text/plain": [
513 | "[(0,\n",
514 | " '0.056*\"edu\" + 0.043*\"com\" + 0.033*\"thanks\" + 0.022*\"mail\" + 0.021*\"know\" + 0.020*\"does\" + 0.014*\"info\" + 0.012*\"monitor\" + 0.010*\"looking\" + 0.010*\"don\"'),\n",
515 | " (1,\n",
516 | " '0.024*\"ground\" + 0.018*\"current\" + 0.018*\"just\" + 0.013*\"want\" + 0.013*\"use\" + 0.011*\"using\" + 0.011*\"used\" + 0.010*\"power\" + 0.010*\"speed\" + 0.010*\"output\"'),\n",
517 | " (2,\n",
518 | " '0.061*\"drive\" + 0.042*\"disk\" + 0.033*\"scsi\" + 0.030*\"drives\" + 0.028*\"hard\" + 0.028*\"controller\" + 0.027*\"card\" + 0.020*\"rom\" + 0.018*\"floppy\" + 0.017*\"bus\"'),\n",
519 | " (3,\n",
520 | " '0.023*\"time\" + 0.015*\"atheism\" + 0.014*\"list\" + 0.013*\"left\" + 0.012*\"alt\" + 0.012*\"faq\" + 0.012*\"probably\" + 0.011*\"know\" + 0.011*\"send\" + 0.010*\"months\"'),\n",
521 | " (4,\n",
522 | " '0.025*\"car\" + 0.016*\"just\" + 0.014*\"don\" + 0.014*\"bike\" + 0.012*\"good\" + 0.011*\"new\" + 0.011*\"think\" + 0.010*\"year\" + 0.010*\"cars\" + 0.010*\"time\"'),\n",
523 | " (5,\n",
524 | " '0.030*\"game\" + 0.027*\"team\" + 0.023*\"year\" + 0.017*\"games\" + 0.016*\"play\" + 0.012*\"season\" + 0.012*\"players\" + 0.012*\"win\" + 0.011*\"hockey\" + 0.011*\"good\"'),\n",
525 | " (6,\n",
526 | " '0.017*\"information\" + 0.014*\"help\" + 0.014*\"medical\" + 0.012*\"new\" + 0.012*\"use\" + 0.012*\"000\" + 0.012*\"research\" + 0.011*\"university\" + 0.010*\"number\" + 0.010*\"program\"'),\n",
527 | " (7,\n",
528 | " '0.022*\"don\" + 0.021*\"people\" + 0.018*\"think\" + 0.017*\"just\" + 0.012*\"say\" + 0.011*\"know\" + 0.011*\"does\" + 0.011*\"good\" + 0.010*\"god\" + 0.009*\"way\"'),\n",
529 | " (8,\n",
530 | " '0.034*\"use\" + 0.023*\"apple\" + 0.020*\"power\" + 0.016*\"time\" + 0.015*\"data\" + 0.015*\"software\" + 0.012*\"pin\" + 0.012*\"memory\" + 0.012*\"simms\" + 0.012*\"port\"'),\n",
531 | " (9,\n",
532 | " '0.068*\"space\" + 0.036*\"nasa\" + 0.021*\"science\" + 0.020*\"edu\" + 0.019*\"data\" + 0.017*\"shuttle\" + 0.015*\"launch\" + 0.015*\"available\" + 0.014*\"center\" + 0.014*\"sci\"')]"
533 | ]
534 | },
535 | "execution_count": 14,
536 | "metadata": {},
537 | "output_type": "execute_result"
538 | }
539 | ],
540 | "source": [
541 | "lda_topics()"
542 | ]
543 | },
544 | {
545 | "cell_type": "markdown",
546 | "metadata": {},
547 | "source": [
548 | "### topic_distribution\n",
549 | "\n",
550 | "For the new document `new_doc`, find the topic distribution. Remember to use vect.transform on the the new doc, and Sparse2Corpus to convert the sparse matrix to gensim corpus.\n",
551 | "\n",
552 | "*This function should return a list of tuples, where each tuple is `(#topic, probability)`*"
553 | ]
554 | },
555 | {
556 | "cell_type": "code",
557 | "execution_count": null,
558 | "metadata": {
559 | "collapsed": true
560 | },
561 | "outputs": [],
562 | "source": [
563 | "new_doc = [\"\\n\\nIt's my understanding that the freezing will start to occur because \\\n",
564 | "of the\\ngrowing distance of Pluto and Charon from the Sun, due to it's\\nelliptical orbit. \\\n",
565 | "It is not due to shadowing effects. \\n\\n\\nPluto can shadow Charon, and vice-versa.\\n\\nGeorge \\\n",
566 | "Krumins\\n-- \"]"
567 | ]
568 | },
569 | {
570 | "cell_type": "code",
571 | "execution_count": 27,
572 | "metadata": {
573 | "collapsed": true
574 | },
575 | "outputs": [],
576 | "source": [
577 | "def topic_distribution():\n",
578 | " \n",
579 | " sparse_doc = vect.transform(new_doc)\n",
580 | " gen_corpus = gensim.matutils.Sparse2Corpus(sparse_doc, documents_columns=False)\n",
581 | " return list(ldamodel[gen_corpus])[0] # It's a list of lists! You just want the first one.\n",
582 | " #return list(ldamodel.show_topics(num_topics=10, num_words=10)) # For topic_names"
583 | ]
584 | },
585 | {
586 | "cell_type": "code",
587 | "execution_count": 28,
588 | "metadata": {},
589 | "outputs": [
590 | {
591 | "data": {
592 | "text/plain": [
593 | "[(0, 0.020001831983605493),\n",
594 | " (1, 0.020002047841298054),\n",
595 | " (2, 0.020000000831972634),\n",
596 | " (3, 0.49633312090606696),\n",
597 | " (4, 0.020002763617431935),\n",
598 | " (5, 0.020002855476879279),\n",
599 | " (6, 0.020001696069172015),\n",
600 | " (7, 0.020001367291069459),\n",
601 | " (8, 0.020001848438415574),\n",
602 | " (9, 0.34365246754408862)]"
603 | ]
604 | },
605 | "execution_count": 28,
606 | "metadata": {},
607 | "output_type": "execute_result"
608 | }
609 | ],
610 | "source": [
611 | "topic_distribution()"
612 | ]
613 | },
614 | {
615 | "cell_type": "markdown",
616 | "metadata": {},
617 | "source": [
618 | "### topic_names\n",
619 | "\n",
620 | "From the list of the following given topics, assign topic names to the topics you found. If none of these names best matches the topics you found, create a new 1-3 word \"title\" for the topic.\n",
621 | "\n",
622 | "Topics: Health, Science, Automobiles, Politics, Government, Travel, Computers & IT, Sports, Business, Society & Lifestyle, Religion, Education.\n",
623 | "\n",
624 | "*This function should return a list of 10 strings.*"
625 | ]
626 | },
627 | {
628 | "cell_type": "code",
629 | "execution_count": null,
630 | "metadata": {
631 | "collapsed": true
632 | },
633 | "outputs": [],
634 | "source": [
635 | "# Manually assign labels based on most important features and most important words in them:\n",
636 | "# Assigning labels manually for the first step in supervised learning.\n",
637 | "def topic_names():\n",
638 | " \n",
639 | " return ['Education','Science','Computers & IT','Religion','Automobiles','Sports','Science','Religion','Computers & IT','Science']"
640 | ]
641 | },
642 | {
643 | "cell_type": "code",
644 | "execution_count": 1,
645 | "metadata": {},
646 | "outputs": [
647 | {
648 | "name": "stdout",
649 | "output_type": "stream",
650 | "text": [
651 | "Amazon_Unlocked_Mobile.csv\n",
652 | "Assignment 1.ipynb\n",
653 | "Assignment 2.ipynb\n",
654 | "Assignment 3.ipynb\n",
655 | "Assignment 4.ipynb\n",
656 | "Case Study - Sentiment Analysis.ipynb\n",
657 | "Module 2 (Python 3).ipynb\n",
658 | "Regex with Pandas and Named Groups.ipynb\n",
659 | "Working With Text.ipynb\n",
660 | "dates.txt\n",
661 | "moby.txt\n",
662 | "mygrammar.cfg\n",
663 | "newsgroups\n",
664 | "paraphrases.csv\n",
665 | "readonly/\n",
666 | "readonly/paraphrases.csv\n",
667 | "readonly/Assignment 3.ipynb\n",
668 | "readonly/Working With Text.ipynb\n",
669 | "readonly/Assignment 1.ipynb\n",
670 | "readonly/.ipynb_checkpoints/\n",
671 | "readonly/.ipynb_checkpoints/Assignment 2-checkpoint.ipynb\n",
672 | "readonly/.ipynb_checkpoints/Assignment 4-checkpoint.ipynb\n",
673 | "readonly/.ipynb_checkpoints/Case Study - Sentiment Analysis-checkpoint.ipynb\n",
674 | "readonly/.ipynb_checkpoints/Assignment 3-checkpoint.ipynb\n",
675 | "readonly/.ipynb_checkpoints/Regex with Pandas and Named Groups-checkpoint.ipynb\n",
676 | "readonly/.ipynb_checkpoints/Assignment 1-checkpoint.ipynb\n",
677 | "readonly/Regex with Pandas and Named Groups.ipynb\n",
678 | "readonly/Module 2 (Python 3).ipynb\n",
679 | "readonly/moby.txt\n",
680 | "readonly/newsgroups\n",
681 | "readonly/Assignment 4.ipynb\n",
682 | "readonly/Case Study - Sentiment Analysis.ipynb\n",
683 | "readonly/Amazon_Unlocked_Mobile.csv\n",
684 | "readonly/mygrammar.cfg\n",
685 | "readonly/spam.csv\n",
686 | "readonly/Assignment 2.ipynb\n",
687 | "readonly/dates.txt\n",
688 | "spam.csv\n"
689 | ]
690 | }
691 | ],
692 | "source": [
693 | "!tar chvfz notebook.tar.gz *"
694 | ]
695 | },
696 | {
697 | "cell_type": "code",
698 | "execution_count": null,
699 | "metadata": {
700 | "collapsed": true
701 | },
702 | "outputs": [],
703 | "source": []
704 | }
705 | ],
706 | "metadata": {
707 | "coursera": {
708 | "course_slug": "python-text-mining",
709 | "graded_item_id": "2qbcK",
710 | "launcher_item_id": "pi9Sh",
711 | "part_id": "kQiwX"
712 | },
713 | "kernelspec": {
714 | "display_name": "Python 3",
715 | "language": "python",
716 | "name": "python3"
717 | },
718 | "language_info": {
719 | "codemirror_mode": {
720 | "name": "ipython",
721 | "version": 3
722 | },
723 | "file_extension": ".py",
724 | "mimetype": "text/x-python",
725 | "name": "python",
726 | "nbconvert_exporter": "python",
727 | "pygments_lexer": "ipython3",
728 | "version": "3.6.2"
729 | }
730 | },
731 | "nbformat": 4,
732 | "nbformat_minor": 2
733 | }
734 |
--------------------------------------------------------------------------------
/Case+Study+-+Sentiment+Analysis.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "---\n",
8 | "\n",
9 | "_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._\n",
10 | "\n",
11 | "---"
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {},
17 | "source": [
18 | "*Note: Some of the cells in this notebook are computationally expensive. To reduce runtime, this notebook is using a subset of the data.*"
19 | ]
20 | },
21 | {
22 | "cell_type": "markdown",
23 | "metadata": {},
24 | "source": [
25 | "# Case Study: Sentiment Analysis"
26 | ]
27 | },
28 | {
29 | "cell_type": "markdown",
30 | "metadata": {},
31 | "source": [
32 | "### Data Prep"
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "execution_count": 1,
38 | "metadata": {},
39 | "outputs": [
40 | {
41 | "data": {
42 | "text/html": [
43 | "\n",
44 | "\n",
57 | "
\n",
58 | " \n",
59 | " \n",
60 | " | \n",
61 | " Product Name | \n",
62 | " Brand Name | \n",
63 | " Price | \n",
64 | " Rating | \n",
65 | " Reviews | \n",
66 | " Review Votes | \n",
67 | "
\n",
68 | " \n",
69 | " \n",
70 | " \n",
71 | " 394349 | \n",
72 | " Sony XPERIA Z2 D6503 FACTORY UNLOCKED Internat... | \n",
73 | " NaN | \n",
74 | " 244.95 | \n",
75 | " 5 | \n",
76 | " Very good one! Better than Samsung S and iphon... | \n",
77 | " 0.0 | \n",
78 | "
\n",
79 | " \n",
80 | " 34377 | \n",
81 | " Apple iPhone 5c 8GB (Pink) - Verizon Wireless | \n",
82 | " Apple | \n",
83 | " 194.99 | \n",
84 | " 1 | \n",
85 | " The phone needed a SIM card, would have been n... | \n",
86 | " 1.0 | \n",
87 | "
\n",
88 | " \n",
89 | " 248521 | \n",
90 | " Motorola Droid RAZR MAXX XT912 M Verizon Smart... | \n",
91 | " Motorola | \n",
92 | " 174.99 | \n",
93 | " 5 | \n",
94 | " I was 3 months away from my upgrade and my Str... | \n",
95 | " 3.0 | \n",
96 | "
\n",
97 | " \n",
98 | " 167661 | \n",
99 | " CNPGD [U.S. Office Extended Warranty] Smartwat... | \n",
100 | " CNPGD | \n",
101 | " 49.99 | \n",
102 | " 1 | \n",
103 | " an experience i want to forget | \n",
104 | " 0.0 | \n",
105 | "
\n",
106 | " \n",
107 | " 73287 | \n",
108 | " Apple iPhone 7 Unlocked Phone 256 GB - US Vers... | \n",
109 | " Apple | \n",
110 | " 922.00 | \n",
111 | " 5 | \n",
112 | " GREAT PHONE WORK ACCORDING MY EXPECTATIONS. | \n",
113 | " 1.0 | \n",
114 | "
\n",
115 | " \n",
116 | "
\n",
117 | "
"
118 | ],
119 | "text/plain": [
120 | " Product Name Brand Name Price \\\n",
121 | "394349 Sony XPERIA Z2 D6503 FACTORY UNLOCKED Internat... NaN 244.95 \n",
122 | "34377 Apple iPhone 5c 8GB (Pink) - Verizon Wireless Apple 194.99 \n",
123 | "248521 Motorola Droid RAZR MAXX XT912 M Verizon Smart... Motorola 174.99 \n",
124 | "167661 CNPGD [U.S. Office Extended Warranty] Smartwat... CNPGD 49.99 \n",
125 | "73287 Apple iPhone 7 Unlocked Phone 256 GB - US Vers... Apple 922.00 \n",
126 | "\n",
127 | " Rating Reviews \\\n",
128 | "394349 5 Very good one! Better than Samsung S and iphon... \n",
129 | "34377 1 The phone needed a SIM card, would have been n... \n",
130 | "248521 5 I was 3 months away from my upgrade and my Str... \n",
131 | "167661 1 an experience i want to forget \n",
132 | "73287 5 GREAT PHONE WORK ACCORDING MY EXPECTATIONS. \n",
133 | "\n",
134 | " Review Votes \n",
135 | "394349 0.0 \n",
136 | "34377 1.0 \n",
137 | "248521 3.0 \n",
138 | "167661 0.0 \n",
139 | "73287 1.0 "
140 | ]
141 | },
142 | "execution_count": 1,
143 | "metadata": {},
144 | "output_type": "execute_result"
145 | }
146 | ],
147 | "source": [
148 | "import pandas as pd\n",
149 | "import numpy as np\n",
150 | "\n",
151 | "# Read in the data\n",
152 | "df = pd.read_csv('Amazon_Unlocked_Mobile.csv')\n",
153 | "\n",
154 | "# Sample the data to speed up computation\n",
155 | "# Comment out this line to match with lecture\n",
156 | "df = df.sample(frac=0.1, random_state=10)\n",
157 | "\n",
158 | "df.head()"
159 | ]
160 | },
161 | {
162 | "cell_type": "code",
163 | "execution_count": 2,
164 | "metadata": {},
165 | "outputs": [
166 | {
167 | "data": {
168 | "text/html": [
169 | "\n",
170 | "\n",
183 | "
\n",
184 | " \n",
185 | " \n",
186 | " | \n",
187 | " Product Name | \n",
188 | " Brand Name | \n",
189 | " Price | \n",
190 | " Rating | \n",
191 | " Reviews | \n",
192 | " Review Votes | \n",
193 | " Positively Rated | \n",
194 | "
\n",
195 | " \n",
196 | " \n",
197 | " \n",
198 | " 34377 | \n",
199 | " Apple iPhone 5c 8GB (Pink) - Verizon Wireless | \n",
200 | " Apple | \n",
201 | " 194.99 | \n",
202 | " 1 | \n",
203 | " The phone needed a SIM card, would have been n... | \n",
204 | " 1.0 | \n",
205 | " 0 | \n",
206 | "
\n",
207 | " \n",
208 | " 248521 | \n",
209 | " Motorola Droid RAZR MAXX XT912 M Verizon Smart... | \n",
210 | " Motorola | \n",
211 | " 174.99 | \n",
212 | " 5 | \n",
213 | " I was 3 months away from my upgrade and my Str... | \n",
214 | " 3.0 | \n",
215 | " 1 | \n",
216 | "
\n",
217 | " \n",
218 | " 167661 | \n",
219 | " CNPGD [U.S. Office Extended Warranty] Smartwat... | \n",
220 | " CNPGD | \n",
221 | " 49.99 | \n",
222 | " 1 | \n",
223 | " an experience i want to forget | \n",
224 | " 0.0 | \n",
225 | " 0 | \n",
226 | "
\n",
227 | " \n",
228 | " 73287 | \n",
229 | " Apple iPhone 7 Unlocked Phone 256 GB - US Vers... | \n",
230 | " Apple | \n",
231 | " 922.00 | \n",
232 | " 5 | \n",
233 | " GREAT PHONE WORK ACCORDING MY EXPECTATIONS. | \n",
234 | " 1.0 | \n",
235 | " 1 | \n",
236 | "
\n",
237 | " \n",
238 | " 277158 | \n",
239 | " Nokia N8 Unlocked GSM Touch Screen Phone Featu... | \n",
240 | " Nokia | \n",
241 | " 95.00 | \n",
242 | " 5 | \n",
243 | " I fell in love with this phone because it did ... | \n",
244 | " 0.0 | \n",
245 | " 1 | \n",
246 | "
\n",
247 | " \n",
248 | " 100311 | \n",
249 | " Blackberry Torch 2 9810 Unlocked Phone with 1.... | \n",
250 | " BlackBerry | \n",
251 | " 77.49 | \n",
252 | " 5 | \n",
253 | " I am pleased with this Blackberry phone! The p... | \n",
254 | " 0.0 | \n",
255 | " 1 | \n",
256 | "
\n",
257 | " \n",
258 | " 251669 | \n",
259 | " Motorola Moto E (1st Generation) - Black - 4 G... | \n",
260 | " Motorola | \n",
261 | " 89.99 | \n",
262 | " 5 | \n",
263 | " Great product, best value for money smartphone... | \n",
264 | " 0.0 | \n",
265 | " 1 | \n",
266 | "
\n",
267 | " \n",
268 | " 279878 | \n",
269 | " OtterBox 77-29864 Defender Series Hybrid Case ... | \n",
270 | " OtterBox | \n",
271 | " 9.99 | \n",
272 | " 5 | \n",
273 | " I've bought 3 no problems. Fast delivery. | \n",
274 | " 0.0 | \n",
275 | " 1 | \n",
276 | "
\n",
277 | " \n",
278 | " 406017 | \n",
279 | " Verizon HTC Rezound 4G Android Smarphone - 8MP... | \n",
280 | " HTC | \n",
281 | " 74.99 | \n",
282 | " 4 | \n",
283 | " Great phone for the price... | \n",
284 | " 0.0 | \n",
285 | " 1 | \n",
286 | "
\n",
287 | " \n",
288 | " 302567 | \n",
289 | " RCA M1 Unlocked Cell Phone, Dual Sim, 5Mp Came... | \n",
290 | " RCA | \n",
291 | " 159.99 | \n",
292 | " 5 | \n",
293 | " My mom is not good with new technoloy but this... | \n",
294 | " 4.0 | \n",
295 | " 1 | \n",
296 | "
\n",
297 | " \n",
298 | "
\n",
299 | "
"
300 | ],
301 | "text/plain": [
302 | " Product Name Brand Name Price \\\n",
303 | "34377 Apple iPhone 5c 8GB (Pink) - Verizon Wireless Apple 194.99 \n",
304 | "248521 Motorola Droid RAZR MAXX XT912 M Verizon Smart... Motorola 174.99 \n",
305 | "167661 CNPGD [U.S. Office Extended Warranty] Smartwat... CNPGD 49.99 \n",
306 | "73287 Apple iPhone 7 Unlocked Phone 256 GB - US Vers... Apple 922.00 \n",
307 | "277158 Nokia N8 Unlocked GSM Touch Screen Phone Featu... Nokia 95.00 \n",
308 | "100311 Blackberry Torch 2 9810 Unlocked Phone with 1.... BlackBerry 77.49 \n",
309 | "251669 Motorola Moto E (1st Generation) - Black - 4 G... Motorola 89.99 \n",
310 | "279878 OtterBox 77-29864 Defender Series Hybrid Case ... OtterBox 9.99 \n",
311 | "406017 Verizon HTC Rezound 4G Android Smarphone - 8MP... HTC 74.99 \n",
312 | "302567 RCA M1 Unlocked Cell Phone, Dual Sim, 5Mp Came... RCA 159.99 \n",
313 | "\n",
314 | " Rating Reviews \\\n",
315 | "34377 1 The phone needed a SIM card, would have been n... \n",
316 | "248521 5 I was 3 months away from my upgrade and my Str... \n",
317 | "167661 1 an experience i want to forget \n",
318 | "73287 5 GREAT PHONE WORK ACCORDING MY EXPECTATIONS. \n",
319 | "277158 5 I fell in love with this phone because it did ... \n",
320 | "100311 5 I am pleased with this Blackberry phone! The p... \n",
321 | "251669 5 Great product, best value for money smartphone... \n",
322 | "279878 5 I've bought 3 no problems. Fast delivery. \n",
323 | "406017 4 Great phone for the price... \n",
324 | "302567 5 My mom is not good with new technoloy but this... \n",
325 | "\n",
326 | " Review Votes Positively Rated \n",
327 | "34377 1.0 0 \n",
328 | "248521 3.0 1 \n",
329 | "167661 0.0 0 \n",
330 | "73287 1.0 1 \n",
331 | "277158 0.0 1 \n",
332 | "100311 0.0 1 \n",
333 | "251669 0.0 1 \n",
334 | "279878 0.0 1 \n",
335 | "406017 0.0 1 \n",
336 | "302567 4.0 1 "
337 | ]
338 | },
339 | "execution_count": 2,
340 | "metadata": {},
341 | "output_type": "execute_result"
342 | }
343 | ],
344 | "source": [
345 | "# Drop missing values\n",
346 | "df.dropna(inplace=True)\n",
347 | "\n",
348 | "# Remove any 'neutral' ratings equal to 3\n",
349 | "df = df[df['Rating'] != 3]\n",
350 | "\n",
351 | "# Encode 4s and 5s as 1 (rated positively)\n",
352 | "# Encode 1s and 2s as 0 (rated poorly)\n",
353 | "df['Positively Rated'] = np.where(df['Rating'] > 3, 1, 0)\n",
354 | "df.head(10)"
355 | ]
356 | },
357 | {
358 | "cell_type": "code",
359 | "execution_count": 3,
360 | "metadata": {},
361 | "outputs": [
362 | {
363 | "data": {
364 | "text/plain": [
365 | "0.74717766860786672"
366 | ]
367 | },
368 | "execution_count": 3,
369 | "metadata": {},
370 | "output_type": "execute_result"
371 | }
372 | ],
373 | "source": [
374 | "# Most ratings are positive\n",
375 | "df['Positively Rated'].mean()"
376 | ]
377 | },
378 | {
379 | "cell_type": "code",
380 | "execution_count": 4,
381 | "metadata": {
382 | "collapsed": true
383 | },
384 | "outputs": [],
385 | "source": [
386 | "from sklearn.model_selection import train_test_split\n",
387 | "\n",
388 | "# Split data into training and test sets\n",
389 | "X_train, X_test, y_train, y_test = train_test_split(df['Reviews'], \n",
390 | " df['Positively Rated'], \n",
391 | " random_state=0)"
392 | ]
393 | },
394 | {
395 | "cell_type": "code",
396 | "execution_count": 5,
397 | "metadata": {},
398 | "outputs": [
399 | {
400 | "name": "stdout",
401 | "output_type": "stream",
402 | "text": [
403 | "X_train first entry:\n",
404 | "\n",
405 | " Everything about it is awesome!\n",
406 | "\n",
407 | "\n",
408 | "X_train shape: (23052,)\n"
409 | ]
410 | }
411 | ],
412 | "source": [
413 | "print('X_train first entry:\\n\\n', X_train.iloc[0])\n",
414 | "print('\\n\\nX_train shape: ', X_train.shape)"
415 | ]
416 | },
417 | {
418 | "cell_type": "markdown",
419 | "metadata": {},
420 | "source": [
421 | "# CountVectorizer"
422 | ]
423 | },
424 | {
425 | "cell_type": "code",
426 | "execution_count": 6,
427 | "metadata": {
428 | "collapsed": true
429 | },
430 | "outputs": [],
431 | "source": [
432 | "from sklearn.feature_extraction.text import CountVectorizer\n",
433 | "\n",
434 | "# Fit the CountVectorizer to the training data\n",
435 | "vect = CountVectorizer().fit(X_train)"
436 | ]
437 | },
438 | {
439 | "cell_type": "code",
440 | "execution_count": 7,
441 | "metadata": {
442 | "scrolled": false
443 | },
444 | "outputs": [
445 | {
446 | "data": {
447 | "text/plain": [
448 | "['00',\n",
449 | " 'arroja',\n",
450 | " 'comapañias',\n",
451 | " 'dvds',\n",
452 | " 'golden',\n",
453 | " 'lands',\n",
454 | " 'oil',\n",
455 | " 'razonable',\n",
456 | " 'smallsliver',\n",
457 | " 'tweak']"
458 | ]
459 | },
460 | "execution_count": 7,
461 | "metadata": {},
462 | "output_type": "execute_result"
463 | }
464 | ],
465 | "source": [
466 | "vect.get_feature_names()[::2000] # Every 2000th feature."
467 | ]
468 | },
469 | {
470 | "cell_type": "code",
471 | "execution_count": 8,
472 | "metadata": {},
473 | "outputs": [
474 | {
475 | "data": {
476 | "text/plain": [
477 | "19601"
478 | ]
479 | },
480 | "execution_count": 8,
481 | "metadata": {},
482 | "output_type": "execute_result"
483 | }
484 | ],
485 | "source": [
486 | "len(vect.get_feature_names())"
487 | ]
488 | },
489 | {
490 | "cell_type": "code",
491 | "execution_count": 9,
492 | "metadata": {},
493 | "outputs": [
494 | {
495 | "data": {
496 | "text/plain": [
497 | "<23052x19601 sparse matrix of type ''\n",
498 | "\twith 613289 stored elements in Compressed Sparse Row format>"
499 | ]
500 | },
501 | "execution_count": 9,
502 | "metadata": {},
503 | "output_type": "execute_result"
504 | }
505 | ],
506 | "source": [
507 | "# transform the documents in the training data to a document-term matrix\n",
508 | "X_train_vectorized = vect.transform(X_train)\n",
509 | "\n",
510 | "X_train_vectorized"
511 | ]
512 | },
513 | {
514 | "cell_type": "code",
515 | "execution_count": 10,
516 | "metadata": {},
517 | "outputs": [
518 | {
519 | "data": {
520 | "text/plain": [
521 | "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
522 | " intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n",
523 | " penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n",
524 | " verbose=0, warm_start=False)"
525 | ]
526 | },
527 | "execution_count": 10,
528 | "metadata": {},
529 | "output_type": "execute_result"
530 | }
531 | ],
532 | "source": [
533 | "from sklearn.linear_model import LogisticRegression\n",
534 | "\n",
535 | "# Train the model\n",
536 | "model = LogisticRegression()\n",
537 | "model.fit(X_train_vectorized, y_train)"
538 | ]
539 | },
540 | {
541 | "cell_type": "code",
542 | "execution_count": 11,
543 | "metadata": {},
544 | "outputs": [
545 | {
546 | "name": "stdout",
547 | "output_type": "stream",
548 | "text": [
549 | "AUC: 0.897433277667\n"
550 | ]
551 | }
552 | ],
553 | "source": [
554 | "from sklearn.metrics import roc_auc_score\n",
555 | "\n",
556 | "# Predict the transformed test documents\n",
557 | "predictions = model.predict(vect.transform(X_test))\n",
558 | "\n",
559 | "print('AUC: ', roc_auc_score(y_test, predictions))"
560 | ]
561 | },
562 | {
563 | "cell_type": "code",
564 | "execution_count": 12,
565 | "metadata": {
566 | "scrolled": true
567 | },
568 | "outputs": [
569 | {
570 | "name": "stdout",
571 | "output_type": "stream",
572 | "text": [
573 | "Smallest Coefs:\n",
574 | "['worst' 'terrible' 'slow' 'junk' 'poor' 'sucks' 'horrible' 'useless'\n",
575 | " 'waste' 'disappointed']\n",
576 | "\n",
577 | "Largest Coefs: \n",
578 | "['excelent' 'excelente' 'excellent' 'perfectly' 'love' 'perfect' 'exactly'\n",
579 | " 'great' 'best' 'awesome']\n"
580 | ]
581 | }
582 | ],
583 | "source": [
584 | "# get the feature names as numpy array\n",
585 | "feature_names = np.array(vect.get_feature_names())\n",
586 | "\n",
587 | "# Sort the coefficients from the model\n",
588 | "sorted_coef_index = model.coef_[0].argsort()\n",
589 | "\n",
590 | "# Find the 10 smallest and 10 largest coefficients\n",
591 | "# The 10 largest coefficients are being indexed using [:-11:-1] \n",
592 | "# so the list returned is in order of largest to smallest\n",
593 | "\n",
594 | "# Remember -ve indices mean the array is read backwards!\n",
595 | "\n",
596 | "print('Smallest Coefs:\\n{}\\n'.format(feature_names[sorted_coef_index[:10]]))\n",
597 | "print('Largest Coefs: \\n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))"
598 | ]
599 | },
600 | {
601 | "cell_type": "markdown",
602 | "metadata": {},
603 | "source": [
604 | "# Tfidf"
605 | ]
606 | },
607 | {
608 | "cell_type": "code",
609 | "execution_count": 13,
610 | "metadata": {},
611 | "outputs": [
612 | {
613 | "data": {
614 | "text/plain": [
615 | "5442"
616 | ]
617 | },
618 | "execution_count": 13,
619 | "metadata": {},
620 | "output_type": "execute_result"
621 | }
622 | ],
623 | "source": [
624 | "from sklearn.feature_extraction.text import TfidfVectorizer\n",
625 | "\n",
626 | "# term frequency-inverse document frequency\n",
627 | "\n",
628 | "# Features with high tfidf are usually used in specific types of documents but rarely used across all documents.\n",
629 | "# Features with low tfidf are generally used across all documents in the corpus.\n",
630 | "\n",
631 | "# Fit the TfidfVectorizer to the training data specifiying a minimum document frequency of 5\n",
632 | "# Each token needs to appear in at least 5 documents to become a part of the vocabulary.\n",
633 | "vect = TfidfVectorizer(min_df=5).fit(X_train)\n",
634 | "len(vect.get_feature_names())"
635 | ]
636 | },
637 | {
638 | "cell_type": "code",
639 | "execution_count": 14,
640 | "metadata": {},
641 | "outputs": [
642 | {
643 | "name": "stdout",
644 | "output_type": "stream",
645 | "text": [
646 | "AUC: 0.889951006492\n"
647 | ]
648 | }
649 | ],
650 | "source": [
651 | "X_train_vectorized = vect.transform(X_train)\n",
652 | "\n",
653 | "model = LogisticRegression()\n",
654 | "model.fit(X_train_vectorized, y_train)\n",
655 | "\n",
656 | "predictions = model.predict(vect.transform(X_test))\n",
657 | "\n",
658 | "print('AUC: ', roc_auc_score(y_test, predictions))"
659 | ]
660 | },
661 | {
662 | "cell_type": "code",
663 | "execution_count": 15,
664 | "metadata": {},
665 | "outputs": [
666 | {
667 | "name": "stdout",
668 | "output_type": "stream",
669 | "text": [
670 | "Smallest tfidf:\n",
671 | "['61' 'printer' 'approach' 'adjustment' 'consequences' 'length' 'emailing'\n",
672 | " 'degrees' 'handsfree' 'chipset']\n",
673 | "\n",
674 | "Largest tfidf: \n",
675 | "['unlocked' 'handy' 'useless' 'cheat' 'up' 'original' 'exelent' 'exelente'\n",
676 | " 'exellent' 'satisfied']\n"
677 | ]
678 | }
679 | ],
680 | "source": [
681 | "feature_names = np.array(vect.get_feature_names())\n",
682 | "\n",
683 | "sorted_tfidf_index = X_train_vectorized.max(0).toarray()[0].argsort()\n",
684 | "\n",
685 | "print('Smallest tfidf:\\n{}\\n'.format(feature_names[sorted_tfidf_index[:10]]))\n",
686 | "print('Largest tfidf: \\n{}'.format(feature_names[sorted_tfidf_index[:-11:-1]]))"
687 | ]
688 | },
689 | {
690 | "cell_type": "code",
691 | "execution_count": 16,
692 | "metadata": {},
693 | "outputs": [
694 | {
695 | "name": "stdout",
696 | "output_type": "stream",
697 | "text": [
698 | "Smallest Coefs:\n",
699 | "['not' 'slow' 'disappointed' 'worst' 'terrible' 'never' 'return' 'doesn'\n",
700 | " 'horrible' 'waste']\n",
701 | "\n",
702 | "Largest Coefs: \n",
703 | "['great' 'love' 'excellent' 'good' 'best' 'perfect' 'price' 'awesome' 'far'\n",
704 | " 'perfectly']\n"
705 | ]
706 | }
707 | ],
708 | "source": [
709 | "sorted_coef_index = model.coef_[0].argsort()\n",
710 | "\n",
711 | "print('Smallest Coefs:\\n{}\\n'.format(feature_names[sorted_coef_index[:10]]))\n",
712 | "print('Largest Coefs: \\n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))"
713 | ]
714 | },
715 | {
716 | "cell_type": "code",
717 | "execution_count": 17,
718 | "metadata": {},
719 | "outputs": [
720 | {
721 | "name": "stdout",
722 | "output_type": "stream",
723 | "text": [
724 | "[0 0]\n"
725 | ]
726 | }
727 | ],
728 | "source": [
729 | "# These reviews are treated the same by our current model\n",
730 | "print(model.predict(vect.transform(['not an issue, phone is working',\n",
731 | " 'an issue, phone is not working'])))"
732 | ]
733 | },
734 | {
735 | "cell_type": "markdown",
736 | "metadata": {},
737 | "source": [
738 | "# n-grams"
739 | ]
740 | },
741 | {
742 | "cell_type": "code",
743 | "execution_count": 18,
744 | "metadata": {},
745 | "outputs": [
746 | {
747 | "data": {
748 | "text/plain": [
749 | "29072"
750 | ]
751 | },
752 | "execution_count": 18,
753 | "metadata": {},
754 | "output_type": "execute_result"
755 | }
756 | ],
757 | "source": [
758 | "# Fit the CountVectorizer to the training data specifiying a minimum \n",
759 | "# document frequency of 5 and extracting 1-grams and 2-grams\n",
760 | "vect = CountVectorizer(min_df=5, ngram_range=(1,2)).fit(X_train)\n",
761 | "\n",
762 | "X_train_vectorized = vect.transform(X_train)\n",
763 | "\n",
764 | "len(vect.get_feature_names())"
765 | ]
766 | },
767 | {
768 | "cell_type": "code",
769 | "execution_count": 19,
770 | "metadata": {},
771 | "outputs": [
772 | {
773 | "name": "stdout",
774 | "output_type": "stream",
775 | "text": [
776 | "AUC: 0.91106617946\n"
777 | ]
778 | }
779 | ],
780 | "source": [
781 | "model = LogisticRegression()\n",
782 | "model.fit(X_train_vectorized, y_train)\n",
783 | "\n",
784 | "predictions = model.predict(vect.transform(X_test))\n",
785 | "\n",
786 | "print('AUC: ', roc_auc_score(y_test, predictions))"
787 | ]
788 | },
789 | {
790 | "cell_type": "code",
791 | "execution_count": 20,
792 | "metadata": {},
793 | "outputs": [
794 | {
795 | "name": "stdout",
796 | "output_type": "stream",
797 | "text": [
798 | "Smallest Coefs:\n",
799 | "['no good' 'junk' 'poor' 'slow' 'worst' 'broken' 'not good' 'terrible'\n",
800 | " 'defective' 'horrible']\n",
801 | "\n",
802 | "Largest Coefs: \n",
803 | "['excellent' 'excelente' 'excelent' 'perfect' 'great' 'love' 'awesome'\n",
804 | " 'no problems' 'good' 'best']\n"
805 | ]
806 | }
807 | ],
808 | "source": [
809 | "feature_names = np.array(vect.get_feature_names())\n",
810 | "\n",
811 | "sorted_coef_index = model.coef_[0].argsort()\n",
812 | "\n",
813 | "print('Smallest Coefs:\\n{}\\n'.format(feature_names[sorted_coef_index[:10]]))\n",
814 | "print('Largest Coefs: \\n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))"
815 | ]
816 | },
817 | {
818 | "cell_type": "code",
819 | "execution_count": 21,
820 | "metadata": {},
821 | "outputs": [
822 | {
823 | "name": "stdout",
824 | "output_type": "stream",
825 | "text": [
826 | "[1 0]\n"
827 | ]
828 | }
829 | ],
830 | "source": [
831 | "# These reviews are now correctly identified\n",
832 | "print(model.predict(vect.transform(['not an issue, phone is working',\n",
833 | " 'an issue, phone is not working'])))"
834 | ]
835 | }
836 | ],
837 | "metadata": {
838 | "kernelspec": {
839 | "display_name": "Python 3",
840 | "language": "python",
841 | "name": "python3"
842 | },
843 | "language_info": {
844 | "codemirror_mode": {
845 | "name": "ipython",
846 | "version": 3
847 | },
848 | "file_extension": ".py",
849 | "mimetype": "text/x-python",
850 | "name": "python",
851 | "nbconvert_exporter": "python",
852 | "pygments_lexer": "ipython3",
853 | "version": "3.6.2"
854 | }
855 | },
856 | "nbformat": 4,
857 | "nbformat_minor": 2
858 | }
859 |
--------------------------------------------------------------------------------
/Module+2+(Python+3).ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Module 2 (Python 3)"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "## Basic NLP Tasks with NLTK"
15 | ]
16 | },
17 | {
18 | "cell_type": "code",
19 | "execution_count": 1,
20 | "metadata": {},
21 | "outputs": [
22 | {
23 | "name": "stdout",
24 | "output_type": "stream",
25 | "text": [
26 | "*** Introductory Examples for the NLTK Book ***\n",
27 | "Loading text1, ..., text9 and sent1, ..., sent9\n",
28 | "Type the name of the text or sentence to view it.\n",
29 | "Type: 'texts()' or 'sents()' to list the materials.\n",
30 | "text1: Moby Dick by Herman Melville 1851\n",
31 | "text2: Sense and Sensibility by Jane Austen 1811\n",
32 | "text3: The Book of Genesis\n",
33 | "text4: Inaugural Address Corpus\n",
34 | "text5: Chat Corpus\n",
35 | "text6: Monty Python and the Holy Grail\n",
36 | "text7: Wall Street Journal\n",
37 | "text8: Personals Corpus\n",
38 | "text9: The Man Who Was Thursday by G . K . Chesterton 1908\n"
39 | ]
40 | }
41 | ],
42 | "source": [
43 | "import nltk\n",
44 | "from nltk.book import *"
45 | ]
46 | },
47 | {
48 | "cell_type": "markdown",
49 | "metadata": {},
50 | "source": [
51 | "### Counting vocabulary of words"
52 | ]
53 | },
54 | {
55 | "cell_type": "code",
56 | "execution_count": 3,
57 | "metadata": {},
58 | "outputs": [
59 | {
60 | "data": {
61 | "text/plain": [
62 | ""
63 | ]
64 | },
65 | "execution_count": 3,
66 | "metadata": {},
67 | "output_type": "execute_result"
68 | }
69 | ],
70 | "source": [
71 | "text7"
72 | ]
73 | },
74 | {
75 | "cell_type": "code",
76 | "execution_count": 4,
77 | "metadata": {},
78 | "outputs": [
79 | {
80 | "data": {
81 | "text/plain": [
82 | "['Pierre',\n",
83 | " 'Vinken',\n",
84 | " ',',\n",
85 | " '61',\n",
86 | " 'years',\n",
87 | " 'old',\n",
88 | " ',',\n",
89 | " 'will',\n",
90 | " 'join',\n",
91 | " 'the',\n",
92 | " 'board',\n",
93 | " 'as',\n",
94 | " 'a',\n",
95 | " 'nonexecutive',\n",
96 | " 'director',\n",
97 | " 'Nov.',\n",
98 | " '29',\n",
99 | " '.']"
100 | ]
101 | },
102 | "execution_count": 4,
103 | "metadata": {},
104 | "output_type": "execute_result"
105 | }
106 | ],
107 | "source": [
108 | "sent7"
109 | ]
110 | },
111 | {
112 | "cell_type": "code",
113 | "execution_count": 5,
114 | "metadata": {},
115 | "outputs": [
116 | {
117 | "data": {
118 | "text/plain": [
119 | "18"
120 | ]
121 | },
122 | "execution_count": 5,
123 | "metadata": {},
124 | "output_type": "execute_result"
125 | }
126 | ],
127 | "source": [
128 | "len(sent7)"
129 | ]
130 | },
131 | {
132 | "cell_type": "code",
133 | "execution_count": 6,
134 | "metadata": {},
135 | "outputs": [
136 | {
137 | "data": {
138 | "text/plain": [
139 | "100676"
140 | ]
141 | },
142 | "execution_count": 6,
143 | "metadata": {},
144 | "output_type": "execute_result"
145 | }
146 | ],
147 | "source": [
148 | "len(text7)"
149 | ]
150 | },
151 | {
152 | "cell_type": "code",
153 | "execution_count": 7,
154 | "metadata": {},
155 | "outputs": [
156 | {
157 | "data": {
158 | "text/plain": [
159 | "12408"
160 | ]
161 | },
162 | "execution_count": 7,
163 | "metadata": {},
164 | "output_type": "execute_result"
165 | }
166 | ],
167 | "source": [
168 | "len(set(text7))"
169 | ]
170 | },
171 | {
172 | "cell_type": "code",
173 | "execution_count": 8,
174 | "metadata": {},
175 | "outputs": [
176 | {
177 | "data": {
178 | "text/plain": [
179 | "['bottom',\n",
180 | " 'Richmond',\n",
181 | " 'tension',\n",
182 | " 'limits',\n",
183 | " 'Wedtech',\n",
184 | " 'most',\n",
185 | " 'boost',\n",
186 | " '143.80',\n",
187 | " 'Dale',\n",
188 | " 'refunded']"
189 | ]
190 | },
191 | "execution_count": 8,
192 | "metadata": {},
193 | "output_type": "execute_result"
194 | }
195 | ],
196 | "source": [
197 | "list(set(text7))[:10]"
198 | ]
199 | },
200 | {
201 | "cell_type": "markdown",
202 | "metadata": {},
203 | "source": [
204 | "### Frequency of words"
205 | ]
206 | },
207 | {
208 | "cell_type": "code",
209 | "execution_count": 9,
210 | "metadata": {},
211 | "outputs": [
212 | {
213 | "data": {
214 | "text/plain": [
215 | "12408"
216 | ]
217 | },
218 | "execution_count": 9,
219 | "metadata": {},
220 | "output_type": "execute_result"
221 | }
222 | ],
223 | "source": [
224 | "dist = FreqDist(text7)\n",
225 | "len(dist)"
226 | ]
227 | },
228 | {
229 | "cell_type": "code",
230 | "execution_count": 15,
231 | "metadata": {},
232 | "outputs": [
233 | {
234 | "data": {
235 | "text/plain": [
236 | "['Pierre', 'Vinken', ',', '61', 'years', 'old', 'will', 'join', 'the', 'board']"
237 | ]
238 | },
239 | "execution_count": 15,
240 | "metadata": {},
241 | "output_type": "execute_result"
242 | }
243 | ],
244 | "source": [
245 | "vocab1 = dist.keys()\n",
246 | "#vocab1[:10] \n",
247 | "# In Python 3 dict.keys() returns an iterable view instead of a list\n",
248 | "list(vocab1)[:10]"
249 | ]
250 | },
251 | {
252 | "cell_type": "code",
253 | "execution_count": 16,
254 | "metadata": {},
255 | "outputs": [
256 | {
257 | "data": {
258 | "text/plain": [
259 | "20"
260 | ]
261 | },
262 | "execution_count": 16,
263 | "metadata": {},
264 | "output_type": "execute_result"
265 | }
266 | ],
267 | "source": [
268 | "dist['four']"
269 | ]
270 | },
271 | {
272 | "cell_type": "code",
273 | "execution_count": 17,
274 | "metadata": {},
275 | "outputs": [
276 | {
277 | "data": {
278 | "text/plain": [
279 | "['billion',\n",
280 | " 'company',\n",
281 | " 'president',\n",
282 | " 'because',\n",
283 | " 'market',\n",
284 | " 'million',\n",
285 | " 'shares',\n",
286 | " 'trading',\n",
287 | " 'program']"
288 | ]
289 | },
290 | "execution_count": 17,
291 | "metadata": {},
292 | "output_type": "execute_result"
293 | }
294 | ],
295 | "source": [
296 | "freqwords = [w for w in vocab1 if len(w) > 5 and dist[w] > 100]\n",
297 | "freqwords"
298 | ]
299 | },
300 | {
301 | "cell_type": "markdown",
302 | "metadata": {},
303 | "source": [
304 | "### Normalization and stemming"
305 | ]
306 | },
307 | {
308 | "cell_type": "code",
309 | "execution_count": 22,
310 | "metadata": {},
311 | "outputs": [
312 | {
313 | "data": {
314 | "text/plain": [
315 | "['list', 'listed', 'lists', 'listing', 'listings']"
316 | ]
317 | },
318 | "execution_count": 22,
319 | "metadata": {},
320 | "output_type": "execute_result"
321 | }
322 | ],
323 | "source": [
324 | "input1 = \"List listed lists listing listings\"\n",
325 | "words1 = input1.lower().split(' ')\n",
326 | "words1"
327 | ]
328 | },
329 | {
330 | "cell_type": "code",
331 | "execution_count": 23,
332 | "metadata": {},
333 | "outputs": [
334 | {
335 | "data": {
336 | "text/plain": [
337 | "['list', 'list', 'list', 'list', 'list']"
338 | ]
339 | },
340 | "execution_count": 23,
341 | "metadata": {},
342 | "output_type": "execute_result"
343 | }
344 | ],
345 | "source": [
346 | "porter = nltk.PorterStemmer()\n",
347 | "[porter.stem(t) for t in words1]"
348 | ]
349 | },
350 | {
351 | "cell_type": "markdown",
352 | "metadata": {},
353 | "source": [
354 | "### Lemmatization"
355 | ]
356 | },
357 | {
358 | "cell_type": "code",
359 | "execution_count": 26,
360 | "metadata": {},
361 | "outputs": [
362 | {
363 | "data": {
364 | "text/plain": [
365 | "['Universal',\n",
366 | " 'Declaration',\n",
367 | " 'of',\n",
368 | " 'Human',\n",
369 | " 'Rights',\n",
370 | " 'Preamble',\n",
371 | " 'Whereas',\n",
372 | " 'recognition',\n",
373 | " 'of',\n",
374 | " 'the',\n",
375 | " 'inherent',\n",
376 | " 'dignity',\n",
377 | " 'and',\n",
378 | " 'of',\n",
379 | " 'the',\n",
380 | " 'equal',\n",
381 | " 'and',\n",
382 | " 'inalienable',\n",
383 | " 'rights',\n",
384 | " 'of']"
385 | ]
386 | },
387 | "execution_count": 26,
388 | "metadata": {},
389 | "output_type": "execute_result"
390 | }
391 | ],
392 | "source": [
393 | "udhr = nltk.corpus.udhr.words('English-Latin1')\n",
394 | "udhr[:20]"
395 | ]
396 | },
397 | {
398 | "cell_type": "code",
399 | "execution_count": 24,
400 | "metadata": {},
401 | "outputs": [
402 | {
403 | "data": {
404 | "text/plain": [
405 | "['univers',\n",
406 | " 'declar',\n",
407 | " 'of',\n",
408 | " 'human',\n",
409 | " 'right',\n",
410 | " 'preambl',\n",
411 | " 'wherea',\n",
412 | " 'recognit',\n",
413 | " 'of',\n",
414 | " 'the',\n",
415 | " 'inher',\n",
416 | " 'digniti',\n",
417 | " 'and',\n",
418 | " 'of',\n",
419 | " 'the',\n",
420 | " 'equal',\n",
421 | " 'and',\n",
422 | " 'inalien',\n",
423 | " 'right',\n",
424 | " 'of']"
425 | ]
426 | },
427 | "execution_count": 24,
428 | "metadata": {},
429 | "output_type": "execute_result"
430 | }
431 | ],
432 | "source": [
433 | "[porter.stem(t) for t in udhr[:20]] # Still Lemmatization"
434 | ]
435 | },
436 | {
437 | "cell_type": "code",
438 | "execution_count": 25,
439 | "metadata": {},
440 | "outputs": [
441 | {
442 | "data": {
443 | "text/plain": [
444 | "['Universal',\n",
445 | " 'Declaration',\n",
446 | " 'of',\n",
447 | " 'Human',\n",
448 | " 'Rights',\n",
449 | " 'Preamble',\n",
450 | " 'Whereas',\n",
451 | " 'recognition',\n",
452 | " 'of',\n",
453 | " 'the',\n",
454 | " 'inherent',\n",
455 | " 'dignity',\n",
456 | " 'and',\n",
457 | " 'of',\n",
458 | " 'the',\n",
459 | " 'equal',\n",
460 | " 'and',\n",
461 | " 'inalienable',\n",
462 | " 'right',\n",
463 | " 'of']"
464 | ]
465 | },
466 | "execution_count": 25,
467 | "metadata": {},
468 | "output_type": "execute_result"
469 | }
470 | ],
471 | "source": [
472 | "WNlemma = nltk.WordNetLemmatizer()\n",
473 | "[WNlemma.lemmatize(t) for t in udhr[:20]]"
474 | ]
475 | },
476 | {
477 | "cell_type": "markdown",
478 | "metadata": {},
479 | "source": [
480 | "### Tokenization"
481 | ]
482 | },
483 | {
484 | "cell_type": "code",
485 | "execution_count": 28,
486 | "metadata": {},
487 | "outputs": [
488 | {
489 | "data": {
490 | "text/plain": [
491 | "['Children', \"shouldn't\", 'drink', 'a', 'sugary', 'drink', 'before', 'bed.']"
492 | ]
493 | },
494 | "execution_count": 28,
495 | "metadata": {},
496 | "output_type": "execute_result"
497 | }
498 | ],
499 | "source": [
500 | "text11 = \"Children shouldn't drink a sugary drink before bed.\"\n",
501 | "text11.split(' ')"
502 | ]
503 | },
504 | {
505 | "cell_type": "code",
506 | "execution_count": 29,
507 | "metadata": {},
508 | "outputs": [
509 | {
510 | "data": {
511 | "text/plain": [
512 | "['Children',\n",
513 | " 'should',\n",
514 | " \"n't\",\n",
515 | " 'drink',\n",
516 | " 'a',\n",
517 | " 'sugary',\n",
518 | " 'drink',\n",
519 | " 'before',\n",
520 | " 'bed',\n",
521 | " '.']"
522 | ]
523 | },
524 | "execution_count": 29,
525 | "metadata": {},
526 | "output_type": "execute_result"
527 | }
528 | ],
529 | "source": [
530 | "nltk.word_tokenize(text11)"
531 | ]
532 | },
533 | {
534 | "cell_type": "code",
535 | "execution_count": 30,
536 | "metadata": {},
537 | "outputs": [
538 | {
539 | "data": {
540 | "text/plain": [
541 | "4"
542 | ]
543 | },
544 | "execution_count": 30,
545 | "metadata": {},
546 | "output_type": "execute_result"
547 | }
548 | ],
549 | "source": [
550 | "text12 = \"This is the first sentence. A gallon of milk in the U.S. costs $2.99. Is this the third sentence? Yes, it is!\"\n",
551 | "sentences = nltk.sent_tokenize(text12)\n",
552 | "len(sentences)"
553 | ]
554 | },
555 | {
556 | "cell_type": "code",
557 | "execution_count": 31,
558 | "metadata": {},
559 | "outputs": [
560 | {
561 | "data": {
562 | "text/plain": [
563 | "['This is the first sentence.',\n",
564 | " 'A gallon of milk in the U.S. costs $2.99.',\n",
565 | " 'Is this the third sentence?',\n",
566 | " 'Yes, it is!']"
567 | ]
568 | },
569 | "execution_count": 31,
570 | "metadata": {},
571 | "output_type": "execute_result"
572 | }
573 | ],
574 | "source": [
575 | "sentences"
576 | ]
577 | },
578 | {
579 | "cell_type": "markdown",
580 | "metadata": {},
581 | "source": [
582 | "## Advanced NLP Tasks with NLTK"
583 | ]
584 | },
585 | {
586 | "cell_type": "markdown",
587 | "metadata": {},
588 | "source": [
589 | "### POS tagging"
590 | ]
591 | },
592 | {
593 | "cell_type": "code",
594 | "execution_count": 33,
595 | "metadata": {},
596 | "outputs": [
597 | {
598 | "name": "stdout",
599 | "output_type": "stream",
600 | "text": [
601 | "MD: modal auxiliary\n",
602 | " can cannot could couldn't dare may might must need ought shall should\n",
603 | " shouldn't will would\n"
604 | ]
605 | }
606 | ],
607 | "source": [
608 | "nltk.help.upenn_tagset('MD')"
609 | ]
610 | },
611 | {
612 | "cell_type": "code",
613 | "execution_count": 34,
614 | "metadata": {},
615 | "outputs": [
616 | {
617 | "data": {
618 | "text/plain": [
619 | "[('Children', 'NNP'),\n",
620 | " ('should', 'MD'),\n",
621 | " (\"n't\", 'RB'),\n",
622 | " ('drink', 'VB'),\n",
623 | " ('a', 'DT'),\n",
624 | " ('sugary', 'JJ'),\n",
625 | " ('drink', 'NN'),\n",
626 | " ('before', 'IN'),\n",
627 | " ('bed', 'NN'),\n",
628 | " ('.', '.')]"
629 | ]
630 | },
631 | "execution_count": 34,
632 | "metadata": {},
633 | "output_type": "execute_result"
634 | }
635 | ],
636 | "source": [
637 | "text13 = nltk.word_tokenize(text11)\n",
638 | "nltk.pos_tag(text13)"
639 | ]
640 | },
641 | {
642 | "cell_type": "code",
643 | "execution_count": 35,
644 | "metadata": {},
645 | "outputs": [
646 | {
647 | "data": {
648 | "text/plain": [
649 | "[('Visiting', 'VBG'),\n",
650 | " ('aunts', 'NNS'),\n",
651 | " ('can', 'MD'),\n",
652 | " ('be', 'VB'),\n",
653 | " ('a', 'DT'),\n",
654 | " ('nuisance', 'NN')]"
655 | ]
656 | },
657 | "execution_count": 35,
658 | "metadata": {},
659 | "output_type": "execute_result"
660 | }
661 | ],
662 | "source": [
663 | "text14 = nltk.word_tokenize(\"Visiting aunts can be a nuisance\")\n",
664 | "nltk.pos_tag(text14)"
665 | ]
666 | },
667 | {
668 | "cell_type": "code",
669 | "execution_count": 37,
670 | "metadata": {},
671 | "outputs": [
672 | {
673 | "name": "stdout",
674 | "output_type": "stream",
675 | "text": [
676 | "(S (NP Alice) (VP (V loves) (NP Bob)))\n"
677 | ]
678 | }
679 | ],
680 | "source": [
681 | "# Parsing sentence structure\n",
682 | "text15 = nltk.word_tokenize(\"Alice loves Bob\")\n",
683 | "grammar = nltk.CFG.fromstring(\"\"\"\n",
684 | "S -> NP VP\n",
685 | "VP -> V NP\n",
686 | "NP -> 'Alice' | 'Bob'\n",
687 | "V -> 'loves'\n",
688 | "\"\"\")\n",
689 | "\n",
690 | "parser = nltk.ChartParser(grammar)\n",
691 | "trees = parser.parse_all(text15)\n",
692 | "for tree in trees:\n",
693 | " print(tree)"
694 | ]
695 | },
696 | {
697 | "cell_type": "code",
698 | "execution_count": 40,
699 | "metadata": {},
700 | "outputs": [
701 | {
702 | "data": {
703 | "text/plain": [
704 | ""
705 | ]
706 | },
707 | "execution_count": 40,
708 | "metadata": {},
709 | "output_type": "execute_result"
710 | }
711 | ],
712 | "source": [
713 | "text16 = nltk.word_tokenize(\"I saw the man with a telescope\")\n",
714 | "grammar1 = nltk.data.load('mygrammar.cfg')\n",
715 | "grammar1"
716 | ]
717 | },
718 | {
719 | "cell_type": "code",
720 | "execution_count": 41,
721 | "metadata": {},
722 | "outputs": [
723 | {
724 | "name": "stdout",
725 | "output_type": "stream",
726 | "text": [
727 | "(S\n",
728 | " (NP I)\n",
729 | " (VP\n",
730 | " (VP (V saw) (NP (Det the) (N man)))\n",
731 | " (PP (P with) (NP (Det a) (N telescope)))))\n",
732 | "(S\n",
733 | " (NP I)\n",
734 | " (VP\n",
735 | " (V saw)\n",
736 | " (NP (Det the) (N man) (PP (P with) (NP (Det a) (N telescope))))))\n"
737 | ]
738 | }
739 | ],
740 | "source": [
741 | "parser = nltk.ChartParser(grammar1)\n",
742 | "trees = parser.parse_all(text16)\n",
743 | "for tree in trees:\n",
744 | " print(tree)"
745 | ]
746 | },
747 | {
748 | "cell_type": "code",
749 | "execution_count": 42,
750 | "metadata": {},
751 | "outputs": [
752 | {
753 | "name": "stdout",
754 | "output_type": "stream",
755 | "text": [
756 | "(S\n",
757 | " (NP-SBJ\n",
758 | " (NP (NNP Pierre) (NNP Vinken))\n",
759 | " (, ,)\n",
760 | " (ADJP (NP (CD 61) (NNS years)) (JJ old))\n",
761 | " (, ,))\n",
762 | " (VP\n",
763 | " (MD will)\n",
764 | " (VP\n",
765 | " (VB join)\n",
766 | " (NP (DT the) (NN board))\n",
767 | " (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))\n",
768 | " (NP-TMP (NNP Nov.) (CD 29))))\n",
769 | " (. .))\n"
770 | ]
771 | }
772 | ],
773 | "source": [
774 | "from nltk.corpus import treebank\n",
775 | "text17 = treebank.parsed_sents('wsj_0001.mrg')[0]\n",
776 | "print(text17)"
777 | ]
778 | },
779 | {
780 | "cell_type": "markdown",
781 | "metadata": {},
782 | "source": [
783 | "### POS tagging and parsing ambiguity"
784 | ]
785 | },
786 | {
787 | "cell_type": "code",
788 | "execution_count": 43,
789 | "metadata": {},
790 | "outputs": [
791 | {
792 | "data": {
793 | "text/plain": [
794 | "[('The', 'DT'), ('old', 'JJ'), ('man', 'NN'), ('the', 'DT'), ('boat', 'NN')]"
795 | ]
796 | },
797 | "execution_count": 43,
798 | "metadata": {},
799 | "output_type": "execute_result"
800 | }
801 | ],
802 | "source": [
803 | "text18 = nltk.word_tokenize(\"The old man the boat\")\n",
804 | "nltk.pos_tag(text18)"
805 | ]
806 | },
807 | {
808 | "cell_type": "code",
809 | "execution_count": 44,
810 | "metadata": {},
811 | "outputs": [
812 | {
813 | "data": {
814 | "text/plain": [
815 | "[('Colorless', 'NNP'),\n",
816 | " ('green', 'JJ'),\n",
817 | " ('ideas', 'NNS'),\n",
818 | " ('sleep', 'VBP'),\n",
819 | " ('furiously', 'RB')]"
820 | ]
821 | },
822 | "execution_count": 44,
823 | "metadata": {},
824 | "output_type": "execute_result"
825 | }
826 | ],
827 | "source": [
828 | "text19 = nltk.word_tokenize(\"Colorless green ideas sleep furiously\")\n",
829 | "nltk.pos_tag(text19)"
830 | ]
831 | }
832 | ],
833 | "metadata": {
834 | "kernelspec": {
835 | "display_name": "Python 3",
836 | "language": "python",
837 | "name": "python3"
838 | },
839 | "language_info": {
840 | "codemirror_mode": {
841 | "name": "ipython",
842 | "version": 3
843 | },
844 | "file_extension": ".py",
845 | "mimetype": "text/x-python",
846 | "name": "python",
847 | "nbconvert_exporter": "python",
848 | "pygments_lexer": "ipython3",
849 | "version": "3.6.2"
850 | }
851 | },
852 | "nbformat": 4,
853 | "nbformat_minor": 2
854 | }
855 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Applied-Text-Mining-in-Python
2 | University of Michigan on Coursera
3 |
4 | This course will introduce the learner to text mining and text manipulation basics. The course begins with an understanding of how text is handled by python, the structure of text both to the machine and to humans, and an overview of the nltk framework for manipulating text. The second week focuses on common manipulation needs, including regular expressions (searching for text), cleaning text, and preparing text for use by machine learning processes. The third week will apply basic natural language processing methods to text, and demonstrate how text classification is accomplished. The final week will explore more advanced methods for detecting the topics in documents and grouping them by similarity (topic modelling).
5 |
--------------------------------------------------------------------------------
/Regex+with+Pandas+and+Named+Groups.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "---\n",
8 | "\n",
9 | "_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._\n",
10 | "\n",
11 | "---"
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {},
17 | "source": [
18 | "# Working with Text Data in pandas"
19 | ]
20 | },
21 | {
22 | "cell_type": "code",
23 | "execution_count": 1,
24 | "metadata": {
25 | "scrolled": true
26 | },
27 | "outputs": [
28 | {
29 | "data": {
30 | "text/html": [
31 | "\n",
32 | "\n",
45 | "
\n",
46 | " \n",
47 | " \n",
48 | " | \n",
49 | " text | \n",
50 | "
\n",
51 | " \n",
52 | " \n",
53 | " \n",
54 | " 0 | \n",
55 | " Monday: The doctor's appointment is at 2:45pm. | \n",
56 | "
\n",
57 | " \n",
58 | " 1 | \n",
59 | " Tuesday: The dentist's appointment is at 11:30... | \n",
60 | "
\n",
61 | " \n",
62 | " 2 | \n",
63 | " Wednesday: At 7:00pm, there is a basketball game! | \n",
64 | "
\n",
65 | " \n",
66 | " 3 | \n",
67 | " Thursday: Be back home by 11:15 pm at the latest. | \n",
68 | "
\n",
69 | " \n",
70 | " 4 | \n",
71 | " Friday: Take the train at 08:10 am, arrive at ... | \n",
72 | "
\n",
73 | " \n",
74 | "
\n",
75 | "
"
76 | ],
77 | "text/plain": [
78 | " text\n",
79 | "0 Monday: The doctor's appointment is at 2:45pm.\n",
80 | "1 Tuesday: The dentist's appointment is at 11:30...\n",
81 | "2 Wednesday: At 7:00pm, there is a basketball game!\n",
82 | "3 Thursday: Be back home by 11:15 pm at the latest.\n",
83 | "4 Friday: Take the train at 08:10 am, arrive at ..."
84 | ]
85 | },
86 | "execution_count": 1,
87 | "metadata": {},
88 | "output_type": "execute_result"
89 | }
90 | ],
91 | "source": [
92 | "import pandas as pd\n",
93 | "\n",
94 | "time_sentences = [\"Monday: The doctor's appointment is at 2:45pm.\", \n",
95 | " \"Tuesday: The dentist's appointment is at 11:30 am.\",\n",
96 | " \"Wednesday: At 7:00pm, there is a basketball game!\",\n",
97 | " \"Thursday: Be back home by 11:15 pm at the latest.\",\n",
98 | " \"Friday: Take the train at 08:10 am, arrive at 09:00am.\"]\n",
99 | "\n",
100 | "df = pd.DataFrame(time_sentences, columns=['text'])\n",
101 | "df"
102 | ]
103 | },
104 | {
105 | "cell_type": "code",
106 | "execution_count": 2,
107 | "metadata": {},
108 | "outputs": [
109 | {
110 | "data": {
111 | "text/plain": [
112 | "0 46\n",
113 | "1 50\n",
114 | "2 49\n",
115 | "3 49\n",
116 | "4 54\n",
117 | "Name: text, dtype: int64"
118 | ]
119 | },
120 | "execution_count": 2,
121 | "metadata": {},
122 | "output_type": "execute_result"
123 | }
124 | ],
125 | "source": [
126 | "# find the number of characters for each string in df['text']\n",
127 | "df['text'].str.len()"
128 | ]
129 | },
130 | {
131 | "cell_type": "code",
132 | "execution_count": 3,
133 | "metadata": {},
134 | "outputs": [
135 | {
136 | "data": {
137 | "text/plain": [
138 | "0 7\n",
139 | "1 8\n",
140 | "2 8\n",
141 | "3 10\n",
142 | "4 10\n",
143 | "Name: text, dtype: int64"
144 | ]
145 | },
146 | "execution_count": 3,
147 | "metadata": {},
148 | "output_type": "execute_result"
149 | }
150 | ],
151 | "source": [
152 | "# find the number of tokens for each string in df['text']\n",
153 | "df['text'].str.split().str.len()"
154 | ]
155 | },
156 | {
157 | "cell_type": "code",
158 | "execution_count": 4,
159 | "metadata": {},
160 | "outputs": [
161 | {
162 | "data": {
163 | "text/plain": [
164 | "0 True\n",
165 | "1 True\n",
166 | "2 False\n",
167 | "3 False\n",
168 | "4 False\n",
169 | "Name: text, dtype: bool"
170 | ]
171 | },
172 | "execution_count": 4,
173 | "metadata": {},
174 | "output_type": "execute_result"
175 | }
176 | ],
177 | "source": [
178 | "# find which entries contain the word 'appointment'\n",
179 | "df['text'].str.contains('appointment')"
180 | ]
181 | },
182 | {
183 | "cell_type": "code",
184 | "execution_count": 5,
185 | "metadata": {},
186 | "outputs": [
187 | {
188 | "data": {
189 | "text/plain": [
190 | "0 3\n",
191 | "1 4\n",
192 | "2 3\n",
193 | "3 4\n",
194 | "4 8\n",
195 | "Name: text, dtype: int64"
196 | ]
197 | },
198 | "execution_count": 5,
199 | "metadata": {},
200 | "output_type": "execute_result"
201 | }
202 | ],
203 | "source": [
204 | "# find how many times a digit occurs in each string\n",
205 | "df['text'].str.count(r'\\d')"
206 | ]
207 | },
208 | {
209 | "cell_type": "code",
210 | "execution_count": 6,
211 | "metadata": {},
212 | "outputs": [
213 | {
214 | "data": {
215 | "text/plain": [
216 | "0 [2, 4, 5]\n",
217 | "1 [1, 1, 3, 0]\n",
218 | "2 [7, 0, 0]\n",
219 | "3 [1, 1, 1, 5]\n",
220 | "4 [0, 8, 1, 0, 0, 9, 0, 0]\n",
221 | "Name: text, dtype: object"
222 | ]
223 | },
224 | "execution_count": 6,
225 | "metadata": {},
226 | "output_type": "execute_result"
227 | }
228 | ],
229 | "source": [
230 | "# find all occurances of the digits\n",
231 | "df['text'].str.findall(r'\\d')"
232 | ]
233 | },
234 | {
235 | "cell_type": "code",
236 | "execution_count": 7,
237 | "metadata": {},
238 | "outputs": [
239 | {
240 | "data": {
241 | "text/plain": [
242 | "0 [(2, 45)]\n",
243 | "1 [(11, 30)]\n",
244 | "2 [(7, 00)]\n",
245 | "3 [(11, 15)]\n",
246 | "4 [(08, 10), (09, 00)]\n",
247 | "Name: text, dtype: object"
248 | ]
249 | },
250 | "execution_count": 7,
251 | "metadata": {},
252 | "output_type": "execute_result"
253 | }
254 | ],
255 | "source": [
256 | "# group and find the hours and minutes\n",
257 | "df['text'].str.findall(r'(\\d?\\d):(\\d\\d)')"
258 | ]
259 | },
260 | {
261 | "cell_type": "code",
262 | "execution_count": 8,
263 | "metadata": {},
264 | "outputs": [
265 | {
266 | "data": {
267 | "text/plain": [
268 | "0 ???: The doctor's appointment is at 2:45pm.\n",
269 | "1 ???: The dentist's appointment is at 11:30 am.\n",
270 | "2 ???: At 7:00pm, there is a basketball game!\n",
271 | "3 ???: Be back home by 11:15 pm at the latest.\n",
272 | "4 ???: Take the train at 08:10 am, arrive at 09:...\n",
273 | "Name: text, dtype: object"
274 | ]
275 | },
276 | "execution_count": 8,
277 | "metadata": {},
278 | "output_type": "execute_result"
279 | }
280 | ],
281 | "source": [
282 | "# replace weekdays with '???'\n",
283 | "df['text'].str.replace(r'\\w+day\\b', '???')"
284 | ]
285 | },
286 | {
287 | "cell_type": "code",
288 | "execution_count": 9,
289 | "metadata": {},
290 | "outputs": [
291 | {
292 | "data": {
293 | "text/plain": [
294 | "0 Mon: The doctor's appointment is at 2:45pm.\n",
295 | "1 Tue: The dentist's appointment is at 11:30 am.\n",
296 | "2 Wed: At 7:00pm, there is a basketball game!\n",
297 | "3 Thu: Be back home by 11:15 pm at the latest.\n",
298 | "4 Fri: Take the train at 08:10 am, arrive at 09:...\n",
299 | "Name: text, dtype: object"
300 | ]
301 | },
302 | "execution_count": 9,
303 | "metadata": {},
304 | "output_type": "execute_result"
305 | }
306 | ],
307 | "source": [
308 | "# replace weekdays with 3 letter abbrevations\n",
309 | "df['text'].str.replace(r'(\\w+day\\b)', lambda x: x.groups()[0][:3])"
310 | ]
311 | },
312 | {
313 | "cell_type": "code",
314 | "execution_count": 10,
315 | "metadata": {},
316 | "outputs": [
317 | {
318 | "name": "stderr",
319 | "output_type": "stream",
320 | "text": [
321 | "/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:2: FutureWarning: currently extract(expand=None) means expand=False (return Index/Series/DataFrame) but in a future version of pandas this will be changed to expand=True (return DataFrame)\n",
322 | " \n"
323 | ]
324 | },
325 | {
326 | "data": {
327 | "text/html": [
328 | "\n",
329 | "\n",
342 | "
\n",
343 | " \n",
344 | " \n",
345 | " | \n",
346 | " 0 | \n",
347 | " 1 | \n",
348 | "
\n",
349 | " \n",
350 | " \n",
351 | " \n",
352 | " 0 | \n",
353 | " 2 | \n",
354 | " 45 | \n",
355 | "
\n",
356 | " \n",
357 | " 1 | \n",
358 | " 11 | \n",
359 | " 30 | \n",
360 | "
\n",
361 | " \n",
362 | " 2 | \n",
363 | " 7 | \n",
364 | " 00 | \n",
365 | "
\n",
366 | " \n",
367 | " 3 | \n",
368 | " 11 | \n",
369 | " 15 | \n",
370 | "
\n",
371 | " \n",
372 | " 4 | \n",
373 | " 08 | \n",
374 | " 10 | \n",
375 | "
\n",
376 | " \n",
377 | "
\n",
378 | "
"
379 | ],
380 | "text/plain": [
381 | " 0 1\n",
382 | "0 2 45\n",
383 | "1 11 30\n",
384 | "2 7 00\n",
385 | "3 11 15\n",
386 | "4 08 10"
387 | ]
388 | },
389 | "execution_count": 10,
390 | "metadata": {},
391 | "output_type": "execute_result"
392 | }
393 | ],
394 | "source": [
395 | "# create new columns from first match of extracted groups\n",
396 | "df['text'].str.extract(r'(\\d?\\d):(\\d\\d)')"
397 | ]
398 | },
399 | {
400 | "cell_type": "code",
401 | "execution_count": 11,
402 | "metadata": {},
403 | "outputs": [
404 | {
405 | "data": {
406 | "text/html": [
407 | "\n",
408 | "\n",
421 | "
\n",
422 | " \n",
423 | " \n",
424 | " | \n",
425 | " | \n",
426 | " 0 | \n",
427 | " 1 | \n",
428 | " 2 | \n",
429 | " 3 | \n",
430 | "
\n",
431 | " \n",
432 | " | \n",
433 | " match | \n",
434 | " | \n",
435 | " | \n",
436 | " | \n",
437 | " | \n",
438 | "
\n",
439 | " \n",
440 | " \n",
441 | " \n",
442 | " 0 | \n",
443 | " 0 | \n",
444 | " 2:45pm | \n",
445 | " 2 | \n",
446 | " 45 | \n",
447 | " pm | \n",
448 | "
\n",
449 | " \n",
450 | " 1 | \n",
451 | " 0 | \n",
452 | " 11:30 am | \n",
453 | " 11 | \n",
454 | " 30 | \n",
455 | " am | \n",
456 | "
\n",
457 | " \n",
458 | " 2 | \n",
459 | " 0 | \n",
460 | " 7:00pm | \n",
461 | " 7 | \n",
462 | " 00 | \n",
463 | " pm | \n",
464 | "
\n",
465 | " \n",
466 | " 3 | \n",
467 | " 0 | \n",
468 | " 11:15 pm | \n",
469 | " 11 | \n",
470 | " 15 | \n",
471 | " pm | \n",
472 | "
\n",
473 | " \n",
474 | " 4 | \n",
475 | " 0 | \n",
476 | " 08:10 am | \n",
477 | " 08 | \n",
478 | " 10 | \n",
479 | " am | \n",
480 | "
\n",
481 | " \n",
482 | " 1 | \n",
483 | " 09:00am | \n",
484 | " 09 | \n",
485 | " 00 | \n",
486 | " am | \n",
487 | "
\n",
488 | " \n",
489 | "
\n",
490 | "
"
491 | ],
492 | "text/plain": [
493 | " 0 1 2 3\n",
494 | " match \n",
495 | "0 0 2:45pm 2 45 pm\n",
496 | "1 0 11:30 am 11 30 am\n",
497 | "2 0 7:00pm 7 00 pm\n",
498 | "3 0 11:15 pm 11 15 pm\n",
499 | "4 0 08:10 am 08 10 am\n",
500 | " 1 09:00am 09 00 am"
501 | ]
502 | },
503 | "execution_count": 11,
504 | "metadata": {},
505 | "output_type": "execute_result"
506 | }
507 | ],
508 | "source": [
509 | "# extract the entire time, the hours, the minutes, and the period\n",
510 | "df['text'].str.extractall(r'((\\d?\\d):(\\d\\d) ?([ap]m))')"
511 | ]
512 | },
513 | {
514 | "cell_type": "code",
515 | "execution_count": 12,
516 | "metadata": {},
517 | "outputs": [
518 | {
519 | "data": {
520 | "text/html": [
521 | "\n",
522 | "\n",
535 | "
\n",
536 | " \n",
537 | " \n",
538 | " | \n",
539 | " | \n",
540 | " time | \n",
541 | " hour | \n",
542 | " minute | \n",
543 | " period | \n",
544 | "
\n",
545 | " \n",
546 | " | \n",
547 | " match | \n",
548 | " | \n",
549 | " | \n",
550 | " | \n",
551 | " | \n",
552 | "
\n",
553 | " \n",
554 | " \n",
555 | " \n",
556 | " 0 | \n",
557 | " 0 | \n",
558 | " 2:45pm | \n",
559 | " 2 | \n",
560 | " 45 | \n",
561 | " pm | \n",
562 | "
\n",
563 | " \n",
564 | " 1 | \n",
565 | " 0 | \n",
566 | " 11:30 am | \n",
567 | " 11 | \n",
568 | " 30 | \n",
569 | " am | \n",
570 | "
\n",
571 | " \n",
572 | " 2 | \n",
573 | " 0 | \n",
574 | " 7:00pm | \n",
575 | " 7 | \n",
576 | " 00 | \n",
577 | " pm | \n",
578 | "
\n",
579 | " \n",
580 | " 3 | \n",
581 | " 0 | \n",
582 | " 11:15 pm | \n",
583 | " 11 | \n",
584 | " 15 | \n",
585 | " pm | \n",
586 | "
\n",
587 | " \n",
588 | " 4 | \n",
589 | " 0 | \n",
590 | " 08:10 am | \n",
591 | " 08 | \n",
592 | " 10 | \n",
593 | " am | \n",
594 | "
\n",
595 | " \n",
596 | " 1 | \n",
597 | " 09:00am | \n",
598 | " 09 | \n",
599 | " 00 | \n",
600 | " am | \n",
601 | "
\n",
602 | " \n",
603 | "
\n",
604 | "
"
605 | ],
606 | "text/plain": [
607 | " time hour minute period\n",
608 | " match \n",
609 | "0 0 2:45pm 2 45 pm\n",
610 | "1 0 11:30 am 11 30 am\n",
611 | "2 0 7:00pm 7 00 pm\n",
612 | "3 0 11:15 pm 11 15 pm\n",
613 | "4 0 08:10 am 08 10 am\n",
614 | " 1 09:00am 09 00 am"
615 | ]
616 | },
617 | "execution_count": 12,
618 | "metadata": {},
619 | "output_type": "execute_result"
620 | }
621 | ],
622 | "source": [
623 | "# extract the entire time, the hours, the minutes, and the period with group names\n",
624 | "df['text'].str.extractall(r'(?P