├── Assignment+1.ipynb ├── Assignment+2.ipynb ├── Assignment+3.ipynb ├── Assignment+4.ipynb ├── Case+Study+-+Sentiment+Analysis.ipynb ├── README.md ├── Regex+with+Pandas+and+Named+Groups.ipynb ├── dates.txt ├── moby.txt ├── newsgroups ├── paraphrases.csv └── spam.csv /Assignment+1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "---\n", 8 | "\n", 9 | "_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._\n", 10 | "\n", 11 | "---" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "# Assignment 1\n", 19 | "\n", 20 | "In this assignment, you'll be working with messy medical data and using regex to extract relevant infromation from the data. \n", 21 | "\n", 22 | "Each line of the `dates.txt` file corresponds to a medical note. Each note has a date that needs to be extracted, but each date is encoded in one of many formats.\n", 23 | "\n", 24 | "The goal of this assignment is to correctly identify all of the different date variants encoded in this dataset and to properly normalize and sort the dates. \n", 25 | "\n", 26 | "Here is a list of some of the variants you might encounter in this dataset:\n", 27 | "* 04/20/2009; 04/20/09; 4/20/09; 4/3/09\n", 28 | "* Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;\n", 29 | "* 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009\n", 30 | "* Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009\n", 31 | "* Feb 2009; Sep 2009; Oct 2010\n", 32 | "* 6/2008; 12/2009\n", 33 | "* 2009; 2010\n", 34 | "\n", 35 | "Once you have extracted these date patterns from the text, the next step is to sort them in ascending chronological order accoring to the following rules:\n", 36 | "* Assume all dates in xx/xx/xx format are mm/dd/yy\n", 37 | "* Assume all dates where year is encoded in only two digits are years from the 1900's (e.g. 1/5/89 is January 5th, 1989)\n", 38 | "* If the day is missing (e.g. 9/2009), assume it is the first day of the month (e.g. September 1, 2009).\n", 39 | "* If the month is missing (e.g. 2010), assume it is the first of January of that year (e.g. January 1, 2010).\n", 40 | "\n", 41 | "With these rules in mind, find the correct date in each note and return a pandas Series in chronological order of the original Series' indices.\n", 42 | "\n", 43 | "For example if the original series was this:\n", 44 | "\n", 45 | " 0 1999\n", 46 | " 1 2010\n", 47 | " 2 1978\n", 48 | " 3 2015\n", 49 | " 4 1985\n", 50 | "\n", 51 | "Your function should return this:\n", 52 | "\n", 53 | " 0 2\n", 54 | " 1 4\n", 55 | " 2 0\n", 56 | " 3 1\n", 57 | " 4 3\n", 58 | "\n", 59 | "Your score will be calculated using [Kendall's tau](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient), a correlation measure for ordinal data.\n", 60 | "\n", 61 | "*This function should return a Series of length 500 and dtype int.*" 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": 1, 67 | "metadata": {}, 68 | "outputs": [ 69 | { 70 | "data": { 71 | "text/plain": [ 72 | "0 03/25/93 Total time of visit (in minutes):\\n\n", 73 | "1 6/18/85 Primary Care Doctor:\\n\n", 74 | "2 sshe plans to move as of 7/8/71 In-Home Servic...\n", 75 | "3 7 on 9/27/75 Audit C Score Current:\\n\n", 76 | "4 2/6/96 sleep studyPain Treatment Pain Level (N...\n", 77 | "5 .Per 7/06/79 Movement D/O note:\\n\n", 78 | "6 4, 5/18/78 Patient's thoughts about current su...\n", 79 | "7 10/24/89 CPT Code: 90801 - Psychiatric Diagnos...\n", 80 | "8 3/7/86 SOS-10 Total Score:\\n\n", 81 | "9 (4/10/71)Score-1Audit C Score Current:\\n\n", 82 | "dtype: object" 83 | ] 84 | }, 85 | "execution_count": 1, 86 | "metadata": {}, 87 | "output_type": "execute_result" 88 | } 89 | ], 90 | "source": [ 91 | "import pandas as pd\n", 92 | "\n", 93 | "doc = []\n", 94 | "with open('dates.txt') as file:\n", 95 | " for line in file:\n", 96 | " doc.append(line)\n", 97 | "\n", 98 | "df = pd.Series(doc)\n", 99 | "df.head(10)" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": 2, 105 | "metadata": {}, 106 | "outputs": [], 107 | "source": [ 108 | "def date_sorter():\n", 109 | " \n", 110 | " # Your code here\n", 111 | " # Full date\n", 112 | " global df\n", 113 | " dates_extracted = df.str.extractall(r'(?P(?P\\d?\\d)[/|-](?P\\d?\\d)[/|-](?P\\d{4}))')\n", 114 | " index_left = ~df.index.isin([x[0] for x in dates_extracted.index])\n", 115 | " dates_extracted = dates_extracted.append(df[index_left].str.extractall(r'(?P(?P\\d?\\d)[/|-](?P([0-2]?[0-9])|([3][01]))[/|-](?P\\d{2}))'))\n", 116 | " index_left = ~df.index.isin([x[0] for x in dates_extracted.index])\n", 117 | " del dates_extracted[3]\n", 118 | " del dates_extracted[4]\n", 119 | " dates_extracted = dates_extracted.append(df[index_left].str.extractall(r'(?P(?P\\d?\\d) ?(?P[a-zA-Z]{3,})\\.?,? (?P\\d{4}))'))\n", 120 | " index_left = ~df.index.isin([x[0] for x in dates_extracted.index])\n", 121 | " dates_extracted = dates_extracted.append(df[index_left].str.extractall(r'(?P(?P[a-zA-Z]{3,})\\.?-? ?(?P\\d\\d?)(th|nd|st)?,?-? ?(?P\\d{4}))'))\n", 122 | " del dates_extracted[3]\n", 123 | " index_left = ~df.index.isin([x[0] for x in dates_extracted.index])\n", 124 | "\n", 125 | " # Without day\n", 126 | " dates_without_day = df[index_left].str.extractall('(?P(?P[A-Z][a-z]{2,}),?\\.? (?P\\d{4}))')\n", 127 | " dates_without_day = dates_without_day.append(df[index_left].str.extractall(r'(?P(?P\\d\\d?)/(?P\\d{4}))'))\n", 128 | " dates_without_day['day'] = 1\n", 129 | " dates_extracted = dates_extracted.append(dates_without_day)\n", 130 | " index_left = ~df.index.isin([x[0] for x in dates_extracted.index])\n", 131 | "\n", 132 | " # Only year\n", 133 | " dates_only_year = df[index_left].str.extractall(r'(?P(?P\\d{4}))')\n", 134 | " dates_only_year['day'] = 1\n", 135 | " dates_only_year['month'] = 1\n", 136 | " dates_extracted = dates_extracted.append(dates_only_year)\n", 137 | " index_left = ~df.index.isin([x[0] for x in dates_extracted.index])\n", 138 | "\n", 139 | " # Year\n", 140 | " dates_extracted['year'] = dates_extracted['year'].apply(lambda x: '19' + x if len(x) == 2 else x)\n", 141 | " dates_extracted['year'] = dates_extracted['year'].apply(lambda x: str(x))\n", 142 | "\n", 143 | " # Month\n", 144 | " dates_extracted['month'] = dates_extracted['month'].apply(lambda x: x[1:] if type(x) is str and x.startswith('0') else x)\n", 145 | " month_dict = dict({'September': 9, 'Mar': 3, 'November': 11, 'Jul': 7, 'January': 1, 'December': 12,\n", 146 | " 'Feb': 2, 'May': 5, 'Aug': 8, 'Jun': 6, 'Sep': 9, 'Oct': 10, 'June': 6, 'March': 3,\n", 147 | " 'February': 2, 'Dec': 12, 'Apr': 4, 'Jan': 1, 'Janaury': 1,'August': 8, 'October': 10,\n", 148 | " 'July': 7, 'Since': 1, 'Nov': 11, 'April': 4, 'Decemeber': 12, 'Age': 8})\n", 149 | " dates_extracted.replace({\"month\": month_dict}, inplace=True)\n", 150 | " dates_extracted['month'] = dates_extracted['month'].apply(lambda x: str(x))\n", 151 | "\n", 152 | " # Day\n", 153 | " dates_extracted['day'] = dates_extracted['day'].apply(lambda x: str(x))\n", 154 | "\n", 155 | " # Cleaned date\n", 156 | " dates_extracted['date'] = dates_extracted['month'] + '/' + dates_extracted['day'] + '/' + dates_extracted['year']\n", 157 | " dates_extracted['date'] = pd.to_datetime(dates_extracted['date'])\n", 158 | "\n", 159 | " dates_extracted.sort_values(by='date', inplace=True)\n", 160 | " df1 = pd.Series(list(dates_extracted.index.labels[0]))\n", 161 | " \n", 162 | " return df1# Your answer here" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": 3, 168 | "metadata": {}, 169 | "outputs": [ 170 | { 171 | "name": "stdout", 172 | "output_type": "stream", 173 | "text": [ 174 | "0 9\n", 175 | "1 84\n", 176 | "2 2\n", 177 | "3 53\n", 178 | "4 28\n", 179 | "5 474\n", 180 | "6 153\n", 181 | "7 13\n", 182 | "8 129\n", 183 | "9 98\n", 184 | "10 111\n", 185 | "11 225\n", 186 | "12 31\n", 187 | "13 171\n", 188 | "14 191\n", 189 | "15 486\n", 190 | "16 335\n", 191 | "17 415\n", 192 | "18 36\n", 193 | "19 405\n", 194 | "20 323\n", 195 | "21 422\n", 196 | "22 375\n", 197 | "23 380\n", 198 | "24 345\n", 199 | "25 57\n", 200 | "26 481\n", 201 | "27 436\n", 202 | "28 104\n", 203 | "29 299\n", 204 | " ... \n", 205 | "470 220\n", 206 | "471 243\n", 207 | "472 208\n", 208 | "473 139\n", 209 | "474 320\n", 210 | "475 383\n", 211 | "476 286\n", 212 | "477 244\n", 213 | "478 480\n", 214 | "479 431\n", 215 | "480 279\n", 216 | "481 198\n", 217 | "482 381\n", 218 | "483 463\n", 219 | "484 366\n", 220 | "485 439\n", 221 | "486 255\n", 222 | "487 401\n", 223 | "488 475\n", 224 | "489 257\n", 225 | "490 152\n", 226 | "491 235\n", 227 | "492 464\n", 228 | "493 253\n", 229 | "494 231\n", 230 | "495 427\n", 231 | "496 141\n", 232 | "497 186\n", 233 | "498 161\n", 234 | "499 413\n", 235 | "Length: 500, dtype: int64\n" 236 | ] 237 | } 238 | ], 239 | "source": [ 240 | "#print(date_sorter())" 241 | ] 242 | }, 243 | { 244 | "cell_type": "code", 245 | "execution_count": null, 246 | "metadata": { 247 | "collapsed": true 248 | }, 249 | "outputs": [], 250 | "source": [] 251 | } 252 | ], 253 | "metadata": { 254 | "coursera": { 255 | "course_slug": "python-text-mining", 256 | "graded_item_id": "LvcWI", 257 | "launcher_item_id": "krne9", 258 | "part_id": "Mkp1I" 259 | }, 260 | "kernelspec": { 261 | "display_name": "Python 3", 262 | "language": "python", 263 | "name": "python3" 264 | }, 265 | "language_info": { 266 | "codemirror_mode": { 267 | "name": "ipython", 268 | "version": 3 269 | }, 270 | "file_extension": ".py", 271 | "mimetype": "text/x-python", 272 | "name": "python", 273 | "nbconvert_exporter": "python", 274 | "pygments_lexer": "ipython3", 275 | "version": "3.6.2" 276 | } 277 | }, 278 | "nbformat": 4, 279 | "nbformat_minor": 2 280 | } 281 | -------------------------------------------------------------------------------- /Assignment+2.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "---\n", 8 | "\n", 9 | "_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._\n", 10 | "\n", 11 | "---" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "# Assignment 2 - Introduction to NLTK\n", 19 | "\n", 20 | "In part 1 of this assignment you will use nltk to explore the Herman Melville novel Moby Dick. Then in part 2 you will create a spelling recommender function that uses nltk to find words similar to the misspelling. " 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "## Part 1 - Analyzing Moby Dick" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": 1, 33 | "metadata": { 34 | "collapsed": true 35 | }, 36 | "outputs": [], 37 | "source": [ 38 | "import nltk\n", 39 | "import pandas as pd\n", 40 | "import numpy as np\n", 41 | "\n", 42 | "# If you would like to work with the raw text you can use 'moby_raw'\n", 43 | "with open('moby.txt', 'r') as f:\n", 44 | " moby_raw = f.read()\n", 45 | " \n", 46 | "# If you would like to work with the novel in nltk.Text format you can use 'text1'\n", 47 | "moby_tokens = nltk.word_tokenize(moby_raw)\n", 48 | "text1 = nltk.Text(moby_tokens)" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": {}, 54 | "source": [ 55 | "### Example 1\n", 56 | "\n", 57 | "How many tokens (words and punctuation symbols) are in text1?\n", 58 | "\n", 59 | "*This function should return an integer.*" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": 2, 65 | "metadata": {}, 66 | "outputs": [ 67 | { 68 | "data": { 69 | "text/plain": [ 70 | "254989" 71 | ] 72 | }, 73 | "execution_count": 2, 74 | "metadata": {}, 75 | "output_type": "execute_result" 76 | } 77 | ], 78 | "source": [ 79 | "def example_one():\n", 80 | " \n", 81 | " return len(nltk.word_tokenize(moby_raw)) # or alternatively len(text1)\n", 82 | "\n", 83 | "example_one()" 84 | ] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "metadata": {}, 89 | "source": [ 90 | "### Example 2\n", 91 | "\n", 92 | "How many unique tokens (unique words and punctuation) does text1 have?\n", 93 | "\n", 94 | "*This function should return an integer.*" 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": 3, 100 | "metadata": {}, 101 | "outputs": [ 102 | { 103 | "data": { 104 | "text/plain": [ 105 | "20755" 106 | ] 107 | }, 108 | "execution_count": 3, 109 | "metadata": {}, 110 | "output_type": "execute_result" 111 | } 112 | ], 113 | "source": [ 114 | "def example_two():\n", 115 | " \n", 116 | " return len(set(nltk.word_tokenize(moby_raw))) # or alternatively len(set(text1))\n", 117 | "\n", 118 | "example_two()" 119 | ] 120 | }, 121 | { 122 | "cell_type": "markdown", 123 | "metadata": {}, 124 | "source": [ 125 | "### Example 3\n", 126 | "\n", 127 | "After lemmatizing the verbs, how many unique tokens does text1 have?\n", 128 | "\n", 129 | "*This function should return an integer.*" 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": 4, 135 | "metadata": {}, 136 | "outputs": [ 137 | { 138 | "data": { 139 | "text/plain": [ 140 | "16900" 141 | ] 142 | }, 143 | "execution_count": 4, 144 | "metadata": {}, 145 | "output_type": "execute_result" 146 | } 147 | ], 148 | "source": [ 149 | "from nltk.stem import WordNetLemmatizer\n", 150 | "\n", 151 | "def example_three():\n", 152 | "\n", 153 | " lemmatizer = WordNetLemmatizer()\n", 154 | " lemmatized = [lemmatizer.lemmatize(w,'v') for w in text1]\n", 155 | "\n", 156 | " return len(set(lemmatized))\n", 157 | "\n", 158 | "example_three()" 159 | ] 160 | }, 161 | { 162 | "cell_type": "markdown", 163 | "metadata": {}, 164 | "source": [ 165 | "### Question 1\n", 166 | "\n", 167 | "What is the lexical diversity of the given text input? (i.e. ratio of unique tokens to the total number of tokens)\n", 168 | "\n", 169 | "*This function should return a float.*" 170 | ] 171 | }, 172 | { 173 | "cell_type": "code", 174 | "execution_count": 5, 175 | "metadata": {}, 176 | "outputs": [ 177 | { 178 | "data": { 179 | "text/plain": [ 180 | "0.08139566804842562" 181 | ] 182 | }, 183 | "execution_count": 5, 184 | "metadata": {}, 185 | "output_type": "execute_result" 186 | } 187 | ], 188 | "source": [ 189 | "def answer_one():\n", 190 | " \n", 191 | " \n", 192 | " return example_two()/example_one()\n", 193 | "\n", 194 | "answer_one()" 195 | ] 196 | }, 197 | { 198 | "cell_type": "markdown", 199 | "metadata": {}, 200 | "source": [ 201 | "### Question 2\n", 202 | "\n", 203 | "What percentage of tokens is 'whale'or 'Whale'?\n", 204 | "\n", 205 | "*This function should return a float.*" 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "execution_count": 6, 211 | "metadata": {}, 212 | "outputs": [ 213 | { 214 | "data": { 215 | "text/plain": [ 216 | "0.4125668166077752" 217 | ] 218 | }, 219 | "execution_count": 6, 220 | "metadata": {}, 221 | "output_type": "execute_result" 222 | } 223 | ], 224 | "source": [ 225 | "def answer_two():\n", 226 | " \n", 227 | " \n", 228 | " return (text1.vocab()['whale'] + text1.vocab()['Whale']) / len(nltk.word_tokenize(moby_raw)) * 100 # Your answer here\n", 229 | "\n", 230 | "answer_two()" 231 | ] 232 | }, 233 | { 234 | "cell_type": "markdown", 235 | "metadata": {}, 236 | "source": [ 237 | "### Question 3\n", 238 | "\n", 239 | "What are the 20 most frequently occurring (unique) tokens in the text? What is their frequency?\n", 240 | "\n", 241 | "*This function should return a list of 20 tuples where each tuple is of the form `(token, frequency)`. The list should be sorted in descending order of frequency.*" 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": 7, 247 | "metadata": {}, 248 | "outputs": [ 249 | { 250 | "data": { 251 | "text/plain": [ 252 | "[(',', 19204),\n", 253 | " ('the', 13715),\n", 254 | " ('.', 7308),\n", 255 | " ('of', 6513),\n", 256 | " ('and', 6010),\n", 257 | " ('a', 4545),\n", 258 | " ('to', 4515),\n", 259 | " (';', 4173),\n", 260 | " ('in', 3908),\n", 261 | " ('that', 2978),\n", 262 | " ('his', 2459),\n", 263 | " ('it', 2196),\n", 264 | " ('I', 2097),\n", 265 | " ('!', 1767),\n", 266 | " ('is', 1722),\n", 267 | " ('--', 1713),\n", 268 | " ('with', 1659),\n", 269 | " ('he', 1658),\n", 270 | " ('was', 1639),\n", 271 | " ('as', 1620)]" 272 | ] 273 | }, 274 | "execution_count": 7, 275 | "metadata": {}, 276 | "output_type": "execute_result" 277 | } 278 | ], 279 | "source": [ 280 | "def answer_three():\n", 281 | " import operator\n", 282 | "\n", 283 | " return sorted(text1.vocab().items(), key=operator.itemgetter(1), reverse=True)[:20] # Your answer here\n", 284 | "\n", 285 | "answer_three()" 286 | ] 287 | }, 288 | { 289 | "cell_type": "markdown", 290 | "metadata": {}, 291 | "source": [ 292 | "### Question 4\n", 293 | "\n", 294 | "What tokens have a length of greater than 5 and frequency of more than 150?\n", 295 | "\n", 296 | "*This function should return a sorted list of the tokens that match the above constraints. To sort your list, use `sorted()`*" 297 | ] 298 | }, 299 | { 300 | "cell_type": "code", 301 | "execution_count": 8, 302 | "metadata": {}, 303 | "outputs": [ 304 | { 305 | "data": { 306 | "text/plain": [ 307 | "['Captain',\n", 308 | " 'Pequod',\n", 309 | " 'Queequeg',\n", 310 | " 'Starbuck',\n", 311 | " 'almost',\n", 312 | " 'before',\n", 313 | " 'himself',\n", 314 | " 'little',\n", 315 | " 'seemed',\n", 316 | " 'should',\n", 317 | " 'though',\n", 318 | " 'through',\n", 319 | " 'whales',\n", 320 | " 'without']" 321 | ] 322 | }, 323 | "execution_count": 8, 324 | "metadata": {}, 325 | "output_type": "execute_result" 326 | } 327 | ], 328 | "source": [ 329 | "def answer_four():\n", 330 | " \n", 331 | " \n", 332 | " return sorted([token for token, freq in text1.vocab().items() if len(token) > 5 and freq > 150]) # Your answer here\n", 333 | "\n", 334 | "answer_four()" 335 | ] 336 | }, 337 | { 338 | "cell_type": "markdown", 339 | "metadata": {}, 340 | "source": [ 341 | "### Question 5\n", 342 | "\n", 343 | "Find the longest word in text1 and that word's length.\n", 344 | "\n", 345 | "*This function should return a tuple `(longest_word, length)`.*" 346 | ] 347 | }, 348 | { 349 | "cell_type": "code", 350 | "execution_count": 9, 351 | "metadata": {}, 352 | "outputs": [ 353 | { 354 | "data": { 355 | "text/plain": [ 356 | "(\"twelve-o'clock-at-night\", 23)" 357 | ] 358 | }, 359 | "execution_count": 9, 360 | "metadata": {}, 361 | "output_type": "execute_result" 362 | } 363 | ], 364 | "source": [ 365 | "def answer_five():\n", 366 | " import operator\n", 367 | " \n", 368 | " return sorted([(token, len(token))for token, freq in text1.vocab().items()], key=operator.itemgetter(1), reverse=True)[0] # Your answer here\n", 369 | "\n", 370 | "answer_five()" 371 | ] 372 | }, 373 | { 374 | "cell_type": "markdown", 375 | "metadata": {}, 376 | "source": [ 377 | "### Question 6\n", 378 | "\n", 379 | "What unique words have a frequency of more than 2000? What is their frequency?\n", 380 | "\n", 381 | "\"Hint: you may want to use `isalpha()` to check if the token is a word and not punctuation.\"\n", 382 | "\n", 383 | "*This function should return a list of tuples of the form `(frequency, word)` sorted in descending order of frequency.*" 384 | ] 385 | }, 386 | { 387 | "cell_type": "code", 388 | "execution_count": 10, 389 | "metadata": {}, 390 | "outputs": [ 391 | { 392 | "data": { 393 | "text/plain": [ 394 | "[(13715, 'the'),\n", 395 | " (6513, 'of'),\n", 396 | " (6010, 'and'),\n", 397 | " (4545, 'a'),\n", 398 | " (4515, 'to'),\n", 399 | " (3908, 'in'),\n", 400 | " (2978, 'that'),\n", 401 | " (2459, 'his'),\n", 402 | " (2196, 'it'),\n", 403 | " (2097, 'I')]" 404 | ] 405 | }, 406 | "execution_count": 10, 407 | "metadata": {}, 408 | "output_type": "execute_result" 409 | } 410 | ], 411 | "source": [ 412 | "def answer_six():\n", 413 | " import operator\n", 414 | " \n", 415 | " return sorted([(freq, token) for token, freq in text1.vocab().items() if freq > 2000 and token.isalpha()], key=operator.itemgetter(0), reverse=True) # Your answer here\n", 416 | "\n", 417 | "answer_six()" 418 | ] 419 | }, 420 | { 421 | "cell_type": "markdown", 422 | "metadata": {}, 423 | "source": [ 424 | "### Question 7\n", 425 | "\n", 426 | "What is the average number of tokens per sentence?\n", 427 | "\n", 428 | "*This function should return a float.*" 429 | ] 430 | }, 431 | { 432 | "cell_type": "code", 433 | "execution_count": 11, 434 | "metadata": {}, 435 | "outputs": [ 436 | { 437 | "data": { 438 | "text/plain": [ 439 | "25.881952902963864" 440 | ] 441 | }, 442 | "execution_count": 11, 443 | "metadata": {}, 444 | "output_type": "execute_result" 445 | } 446 | ], 447 | "source": [ 448 | "def answer_seven():\n", 449 | " \n", 450 | " \n", 451 | " return np.mean([len(nltk.word_tokenize(sent)) for sent in nltk.sent_tokenize(moby_raw)]) # Your answer here\n", 452 | "\n", 453 | "answer_seven()" 454 | ] 455 | }, 456 | { 457 | "cell_type": "markdown", 458 | "metadata": {}, 459 | "source": [ 460 | "### Question 8\n", 461 | "\n", 462 | "What are the 5 most frequent parts of speech in this text? What is their frequency?\n", 463 | "\n", 464 | "*This function should return a list of tuples of the form `(part_of_speech, frequency)` sorted in descending order of frequency.*" 465 | ] 466 | }, 467 | { 468 | "cell_type": "code", 469 | "execution_count": 12, 470 | "metadata": {}, 471 | "outputs": [ 472 | { 473 | "data": { 474 | "text/plain": [ 475 | "[('NN', 32730), ('IN', 28657), ('DT', 25867), (',', 19204), ('JJ', 17620)]" 476 | ] 477 | }, 478 | "execution_count": 12, 479 | "metadata": {}, 480 | "output_type": "execute_result" 481 | } 482 | ], 483 | "source": [ 484 | "def answer_eight():\n", 485 | " from collections import Counter\n", 486 | " import operator\n", 487 | " \n", 488 | " return sorted(Counter([tag for token, tag in nltk.pos_tag(text1)]).items(), key=operator.itemgetter(1), reverse=True)[:5] # Your answer here\n", 489 | "\n", 490 | "answer_eight()" 491 | ] 492 | }, 493 | { 494 | "cell_type": "markdown", 495 | "metadata": {}, 496 | "source": [ 497 | "## Part 2 - Spelling Recommender\n", 498 | "\n", 499 | "For this part of the assignment you will create three different spelling recommenders, that each take a list of misspelled words and recommends a correctly spelled word for every word in the list.\n", 500 | "\n", 501 | "For every misspelled word, the recommender should find find the word in `correct_spellings` that has the shortest distance*, and starts with the same letter as the misspelled word, and return that word as a recommendation.\n", 502 | "\n", 503 | "*Each of the three different recommenders will use a different distance measure (outlined below).\n", 504 | "\n", 505 | "Each of the recommenders should provide recommendations for the three default words provided: `['cormulent', 'incendenece', 'validrate']`." 506 | ] 507 | }, 508 | { 509 | "cell_type": "code", 510 | "execution_count": 13, 511 | "metadata": { 512 | "collapsed": true 513 | }, 514 | "outputs": [], 515 | "source": [ 516 | "from nltk.corpus import words\n", 517 | "\n", 518 | "correct_spellings = words.words()" 519 | ] 520 | }, 521 | { 522 | "cell_type": "markdown", 523 | "metadata": {}, 524 | "source": [ 525 | "### Question 9\n", 526 | "\n", 527 | "For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:\n", 528 | "\n", 529 | "**[Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index) on the trigrams of the two words.**\n", 530 | "\n", 531 | "*This function should return a list of length three:\n", 532 | "`['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']`.*" 533 | ] 534 | }, 535 | { 536 | "cell_type": "code", 537 | "execution_count": 14, 538 | "metadata": {}, 539 | "outputs": [ 540 | { 541 | "data": { 542 | "text/plain": [ 543 | "['corpulent', 'indecence', 'validate']" 544 | ] 545 | }, 546 | "execution_count": 14, 547 | "metadata": {}, 548 | "output_type": "execute_result" 549 | } 550 | ], 551 | "source": [ 552 | "def answer_nine(entries=['cormulent', 'incendenece', 'validrate']):\n", 553 | " result = []\n", 554 | " import operator\n", 555 | " for entry in entries:\n", 556 | " spell_list = [spell for spell in correct_spellings if spell.startswith(entry[0]) and len(spell) > 2]\n", 557 | " distance_list = [(spell, nltk.jaccard_distance(set(nltk.ngrams(entry, n=3)), set(nltk.ngrams(spell, n=3)))) for spell in spell_list]\n", 558 | "\n", 559 | " result.append(sorted(distance_list, key=operator.itemgetter(1))[0][0])\n", 560 | " \n", 561 | " return result # Your answer here\n", 562 | " \n", 563 | "answer_nine()" 564 | ] 565 | }, 566 | { 567 | "cell_type": "markdown", 568 | "metadata": {}, 569 | "source": [ 570 | "### Question 10\n", 571 | "\n", 572 | "For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:\n", 573 | "\n", 574 | "**[Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index) on the 4-grams of the two words.**\n", 575 | "\n", 576 | "*This function should return a list of length three:\n", 577 | "`['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']`.*" 578 | ] 579 | }, 580 | { 581 | "cell_type": "code", 582 | "execution_count": 15, 583 | "metadata": {}, 584 | "outputs": [ 585 | { 586 | "data": { 587 | "text/plain": [ 588 | "['cormus', 'incendiary', 'valid']" 589 | ] 590 | }, 591 | "execution_count": 15, 592 | "metadata": {}, 593 | "output_type": "execute_result" 594 | } 595 | ], 596 | "source": [ 597 | "def answer_ten(entries=['cormulent', 'incendenece', 'validrate']):\n", 598 | " result = []\n", 599 | " import operator\n", 600 | " for entry in entries:\n", 601 | " spell_list = [spell for spell in correct_spellings if spell.startswith(entry[0]) and len(spell) > 2]\n", 602 | " distance_list = [(spell, nltk.jaccard_distance(set(nltk.ngrams(entry, n=4)), set(nltk.ngrams(spell, n=4)))) for spell in spell_list]\n", 603 | "\n", 604 | " result.append(sorted(distance_list, key=operator.itemgetter(1))[0][0])\n", 605 | " \n", 606 | " return result # Your answer here\n", 607 | " \n", 608 | "answer_ten()" 609 | ] 610 | }, 611 | { 612 | "cell_type": "markdown", 613 | "metadata": {}, 614 | "source": [ 615 | "### Question 11\n", 616 | "\n", 617 | "For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:\n", 618 | "\n", 619 | "**[Edit distance on the two words with transpositions.](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance)**\n", 620 | "\n", 621 | "*This function should return a list of length three:\n", 622 | "`['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']`.*" 623 | ] 624 | }, 625 | { 626 | "cell_type": "code", 627 | "execution_count": 16, 628 | "metadata": {}, 629 | "outputs": [ 630 | { 631 | "data": { 632 | "text/plain": [ 633 | "['corpulent', 'intendence', 'validate']" 634 | ] 635 | }, 636 | "execution_count": 16, 637 | "metadata": {}, 638 | "output_type": "execute_result" 639 | } 640 | ], 641 | "source": [ 642 | "def answer_eleven(entries=['cormulent', 'incendenece', 'validrate']):\n", 643 | " result = []\n", 644 | " import operator\n", 645 | " for entry in entries:\n", 646 | " spell_list = [spell for spell in correct_spellings if spell.startswith(entry[0]) and len(spell) > 2]\n", 647 | " distance_list = [(spell, nltk.edit_distance(entry, spell, transpositions=True)) for spell in spell_list]\n", 648 | "\n", 649 | " result.append(sorted(distance_list, key=operator.itemgetter(1))[0][0])\n", 650 | " \n", 651 | " return result# Your answer here \n", 652 | " \n", 653 | "answer_eleven()" 654 | ] 655 | }, 656 | { 657 | "cell_type": "code", 658 | "execution_count": null, 659 | "metadata": { 660 | "collapsed": true 661 | }, 662 | "outputs": [], 663 | "source": [] 664 | } 665 | ], 666 | "metadata": { 667 | "coursera": { 668 | "course_slug": "python-text-mining", 669 | "graded_item_id": "r35En", 670 | "launcher_item_id": "tCVfW", 671 | "part_id": "NTVgL" 672 | }, 673 | "kernelspec": { 674 | "display_name": "Python 3", 675 | "language": "python", 676 | "name": "python3" 677 | }, 678 | "language_info": { 679 | "codemirror_mode": { 680 | "name": "ipython", 681 | "version": 3 682 | }, 683 | "file_extension": ".py", 684 | "mimetype": "text/x-python", 685 | "name": "python", 686 | "nbconvert_exporter": "python", 687 | "pygments_lexer": "ipython3", 688 | "version": "3.6.2" 689 | } 690 | }, 691 | "nbformat": 4, 692 | "nbformat_minor": 2 693 | } 694 | -------------------------------------------------------------------------------- /Assignment+3.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "---\n", 8 | "\n", 9 | "_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._\n", 10 | "\n", 11 | "---" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "# Assignment 3\n", 19 | "\n", 20 | "In this assignment you will explore text message data and create models to predict if a message is spam or not. " 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": 1, 26 | "metadata": {}, 27 | "outputs": [ 28 | { 29 | "data": { 30 | "text/html": [ 31 | "
\n", 32 | "\n", 45 | "\n", 46 | " \n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | "
texttarget
0Go until jurong point, crazy.. Available only ...0
1Ok lar... Joking wif u oni...0
2Free entry in 2 a wkly comp to win FA Cup fina...1
3U dun say so early hor... U c already then say...0
4Nah I don't think he goes to usf, he lives aro...0
5FreeMsg Hey there darling it's been 3 week's n...1
6Even my brother is not like to speak with me. ...0
7As per your request 'Melle Melle (Oru Minnamin...0
8WINNER!! As a valued network customer you have...1
9Had your mobile 11 months or more? U R entitle...1
\n", 106 | "
" 107 | ], 108 | "text/plain": [ 109 | " text target\n", 110 | "0 Go until jurong point, crazy.. Available only ... 0\n", 111 | "1 Ok lar... Joking wif u oni... 0\n", 112 | "2 Free entry in 2 a wkly comp to win FA Cup fina... 1\n", 113 | "3 U dun say so early hor... U c already then say... 0\n", 114 | "4 Nah I don't think he goes to usf, he lives aro... 0\n", 115 | "5 FreeMsg Hey there darling it's been 3 week's n... 1\n", 116 | "6 Even my brother is not like to speak with me. ... 0\n", 117 | "7 As per your request 'Melle Melle (Oru Minnamin... 0\n", 118 | "8 WINNER!! As a valued network customer you have... 1\n", 119 | "9 Had your mobile 11 months or more? U R entitle... 1" 120 | ] 121 | }, 122 | "execution_count": 1, 123 | "metadata": {}, 124 | "output_type": "execute_result" 125 | } 126 | ], 127 | "source": [ 128 | "import pandas as pd\n", 129 | "import numpy as np\n", 130 | "\n", 131 | "spam_data = pd.read_csv('spam.csv')\n", 132 | "\n", 133 | "spam_data['target'] = np.where(spam_data['target']=='spam',1,0)\n", 134 | "spam_data.head(10)" 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": 2, 140 | "metadata": { 141 | "collapsed": true 142 | }, 143 | "outputs": [], 144 | "source": [ 145 | "from sklearn.model_selection import train_test_split\n", 146 | "\n", 147 | "\n", 148 | "X_train, X_test, y_train, y_test = train_test_split(spam_data['text'], \n", 149 | " spam_data['target'], \n", 150 | " random_state=0)" 151 | ] 152 | }, 153 | { 154 | "cell_type": "markdown", 155 | "metadata": {}, 156 | "source": [ 157 | "### Question 1\n", 158 | "What percentage of the documents in `spam_data` are spam?" 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": 3, 164 | "metadata": { 165 | "collapsed": true 166 | }, 167 | "outputs": [], 168 | "source": [ 169 | "def answer_one():\n", 170 | " \n", 171 | " return len(spam_data[spam_data['target'] == 1]) / len(spam_data) * 100" 172 | ] 173 | }, 174 | { 175 | "cell_type": "code", 176 | "execution_count": 4, 177 | "metadata": {}, 178 | "outputs": [ 179 | { 180 | "data": { 181 | "text/plain": [ 182 | "13.406317300789663" 183 | ] 184 | }, 185 | "execution_count": 4, 186 | "metadata": {}, 187 | "output_type": "execute_result" 188 | } 189 | ], 190 | "source": [ 191 | "answer_one()" 192 | ] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "metadata": {}, 197 | "source": [ 198 | "### Question 2\n", 199 | "\n", 200 | "Fit the training data `X_train` using a Count Vectorizer with default parameters.\n", 201 | "\n", 202 | "What is the longest token in the vocabulary?\n", 203 | "\n", 204 | "*This function should return a string.*" 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": 5, 210 | "metadata": { 211 | "collapsed": true 212 | }, 213 | "outputs": [], 214 | "source": [ 215 | "from sklearn.feature_extraction.text import CountVectorizer\n", 216 | "\n", 217 | "def answer_two():\n", 218 | " import operator\n", 219 | "\n", 220 | " vectorizer = CountVectorizer()\n", 221 | " vectorizer.fit(X_train)\n", 222 | " \n", 223 | " return sorted([(token, len(token)) for token in vectorizer.vocabulary_.keys()], key=operator.itemgetter(1), reverse=True)[0][0]" 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": 6, 229 | "metadata": {}, 230 | "outputs": [ 231 | { 232 | "data": { 233 | "text/plain": [ 234 | "'com1win150ppmx3age16subscription'" 235 | ] 236 | }, 237 | "execution_count": 6, 238 | "metadata": {}, 239 | "output_type": "execute_result" 240 | } 241 | ], 242 | "source": [ 243 | "answer_two()" 244 | ] 245 | }, 246 | { 247 | "cell_type": "markdown", 248 | "metadata": {}, 249 | "source": [ 250 | "### Question 3\n", 251 | "\n", 252 | "Fit and transform the training data `X_train` using a Count Vectorizer with default parameters.\n", 253 | "\n", 254 | "Next, fit a fit a multinomial Naive Bayes classifier model with smoothing `alpha=0.1`. Find the area under the curve (AUC) score using the transformed test data.\n", 255 | "\n", 256 | "*This function should return the AUC score as a float.*" 257 | ] 258 | }, 259 | { 260 | "cell_type": "code", 261 | "execution_count": 7, 262 | "metadata": { 263 | "collapsed": true 264 | }, 265 | "outputs": [], 266 | "source": [ 267 | "from sklearn.naive_bayes import MultinomialNB\n", 268 | "from sklearn.metrics import roc_auc_score\n", 269 | "\n", 270 | "def answer_three():\n", 271 | " vectorizer = CountVectorizer()\n", 272 | " X_train_transformed = vectorizer.fit_transform(X_train)\n", 273 | " X_test_transformed = vectorizer.transform(X_test)\n", 274 | "\n", 275 | " clf = MultinomialNB(alpha=0.1)\n", 276 | " clf.fit(X_train_transformed, y_train)\n", 277 | "\n", 278 | " y_predicted = clf.predict(X_test_transformed)\n", 279 | " \n", 280 | " return roc_auc_score(y_test, y_predicted)" 281 | ] 282 | }, 283 | { 284 | "cell_type": "code", 285 | "execution_count": 8, 286 | "metadata": {}, 287 | "outputs": [ 288 | { 289 | "data": { 290 | "text/plain": [ 291 | "0.97208121827411165" 292 | ] 293 | }, 294 | "execution_count": 8, 295 | "metadata": {}, 296 | "output_type": "execute_result" 297 | } 298 | ], 299 | "source": [ 300 | "answer_three()" 301 | ] 302 | }, 303 | { 304 | "cell_type": "markdown", 305 | "metadata": {}, 306 | "source": [ 307 | "### Question 4\n", 308 | "\n", 309 | "Fit and transform the training data `X_train` using a Tfidf Vectorizer with default parameters.\n", 310 | "\n", 311 | "What 20 features have the smallest tf-idf and what 20 have the largest tf-idf?\n", 312 | "\n", 313 | "Put these features in a two series where each series is sorted by tf-idf value and then alphabetically by feature name. The index of the series should be the feature name, and the data should be the tf-idf.\n", 314 | "\n", 315 | "The series of 20 features with smallest tf-idfs should be sorted smallest tfidf first, the list of 20 features with largest tf-idfs should be sorted largest first. \n", 316 | "\n", 317 | "*This function should return a tuple of two series\n", 318 | "`(smallest tf-idfs series, largest tf-idfs series)`.*" 319 | ] 320 | }, 321 | { 322 | "cell_type": "code", 323 | "execution_count": 9, 324 | "metadata": { 325 | "collapsed": true 326 | }, 327 | "outputs": [], 328 | "source": [ 329 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 330 | "\n", 331 | "def answer_four():\n", 332 | " import operator\n", 333 | "\n", 334 | " vectorizer = TfidfVectorizer()\n", 335 | " X_train_transformed = vectorizer.fit_transform(X_train)\n", 336 | "\n", 337 | " feature_names = vectorizer.get_feature_names()\n", 338 | " idfs = vectorizer.idf_\n", 339 | " names_idfs = list(zip(feature_names, idfs))\n", 340 | "\n", 341 | " smallest = sorted(names_idfs, key=operator.itemgetter(1))[:20]\n", 342 | " smallest = pd.Series([features[1] for features in smallest], index=[features[0] for features in smallest])\n", 343 | "\n", 344 | " largest = sorted(names_idfs, key=operator.itemgetter(1), reverse=True)[:20]\n", 345 | " # largest = sorted(names_idfs, key=operator.itemgetter(1,0), reverse=True)[:20]\n", 346 | " largest = sorted(largest, key=operator.itemgetter(0))\n", 347 | " largest = pd.Series([features[1] for features in largest], index=[features[0] for features in largest])\n", 348 | " \n", 349 | " return (smallest, largest)" 350 | ] 351 | }, 352 | { 353 | "cell_type": "code", 354 | "execution_count": 10, 355 | "metadata": {}, 356 | "outputs": [ 357 | { 358 | "data": { 359 | "text/plain": [ 360 | "(to 2.198406\n", 361 | " you 2.265645\n", 362 | " the 2.707383\n", 363 | " in 2.890761\n", 364 | " and 2.976764\n", 365 | " is 3.003012\n", 366 | " me 3.111530\n", 367 | " for 3.206840\n", 368 | " it 3.222174\n", 369 | " my 3.231044\n", 370 | " call 3.297812\n", 371 | " your 3.300196\n", 372 | " of 3.319473\n", 373 | " have 3.354130\n", 374 | " that 3.408477\n", 375 | " on 3.463136\n", 376 | " now 3.465949\n", 377 | " can 3.545053\n", 378 | " are 3.560414\n", 379 | " so 3.566625\n", 380 | " dtype: float64, 000pes 8.644919\n", 381 | " 0089 8.644919\n", 382 | " 0121 8.644919\n", 383 | " 01223585236 8.644919\n", 384 | " 0125698789 8.644919\n", 385 | " 02072069400 8.644919\n", 386 | " 02073162414 8.644919\n", 387 | " 02085076972 8.644919\n", 388 | " 021 8.644919\n", 389 | " 0430 8.644919\n", 390 | " 07008009200 8.644919\n", 391 | " 07099833605 8.644919\n", 392 | " 07123456789 8.644919\n", 393 | " 0721072 8.644919\n", 394 | " 07753741225 8.644919\n", 395 | " 077xxx 8.644919\n", 396 | " 078 8.644919\n", 397 | " 07808247860 8.644919\n", 398 | " 07808726822 8.644919\n", 399 | " 078498 8.644919\n", 400 | " dtype: float64)" 401 | ] 402 | }, 403 | "execution_count": 10, 404 | "metadata": {}, 405 | "output_type": "execute_result" 406 | } 407 | ], 408 | "source": [ 409 | "answer_four()" 410 | ] 411 | }, 412 | { 413 | "cell_type": "markdown", 414 | "metadata": {}, 415 | "source": [ 416 | "### Question 5\n", 417 | "\n", 418 | "Fit and transform the training data `X_train` using a Tfidf Vectorizer ignoring terms that have a document frequency strictly lower than **3**.\n", 419 | "\n", 420 | "Then fit a multinomial Naive Bayes classifier model with smoothing `alpha=0.1` and compute the area under the curve (AUC) score using the transformed test data.\n", 421 | "\n", 422 | "*This function should return the AUC score as a float.*" 423 | ] 424 | }, 425 | { 426 | "cell_type": "code", 427 | "execution_count": 11, 428 | "metadata": { 429 | "collapsed": true 430 | }, 431 | "outputs": [], 432 | "source": [ 433 | "def answer_five():\n", 434 | " vectorizer = TfidfVectorizer(min_df=3)\n", 435 | " X_train_transformed = vectorizer.fit_transform(X_train)\n", 436 | " X_test_transformed = vectorizer.transform(X_test)\n", 437 | "\n", 438 | " clf = MultinomialNB(alpha=0.1)\n", 439 | " clf.fit(X_train_transformed, y_train)\n", 440 | "\n", 441 | " # y_predicted_prob = clf.predict_proba(X_test_transformed)[:, 1]\n", 442 | " y_predicted = clf.predict(X_test_transformed)\n", 443 | "\n", 444 | " # return roc_auc_score(y_test, y_predicted_prob) #Your answer here\n", 445 | " return roc_auc_score(y_test, y_predicted)" 446 | ] 447 | }, 448 | { 449 | "cell_type": "code", 450 | "execution_count": 12, 451 | "metadata": {}, 452 | "outputs": [ 453 | { 454 | "data": { 455 | "text/plain": [ 456 | "0.94162436548223349" 457 | ] 458 | }, 459 | "execution_count": 12, 460 | "metadata": {}, 461 | "output_type": "execute_result" 462 | } 463 | ], 464 | "source": [ 465 | "answer_five()" 466 | ] 467 | }, 468 | { 469 | "cell_type": "markdown", 470 | "metadata": {}, 471 | "source": [ 472 | "### Question 6\n", 473 | "\n", 474 | "What is the average length of documents (number of characters) for not spam and spam documents?\n", 475 | "\n", 476 | "*This function should return a tuple (average length not spam, average length spam).*" 477 | ] 478 | }, 479 | { 480 | "cell_type": "code", 481 | "execution_count": 13, 482 | "metadata": { 483 | "collapsed": true 484 | }, 485 | "outputs": [], 486 | "source": [ 487 | "def answer_six():\n", 488 | " spam_data['length'] = spam_data['text'].apply(lambda x:len(x))\n", 489 | " \n", 490 | " return (np.mean(spam_data['length'][spam_data['target'] == 0]), np.mean(spam_data['length'][spam_data['target'] == 1]))#Your answer here" 491 | ] 492 | }, 493 | { 494 | "cell_type": "code", 495 | "execution_count": 14, 496 | "metadata": {}, 497 | "outputs": [ 498 | { 499 | "data": { 500 | "text/plain": [ 501 | "(71.02362694300518, 138.8661311914324)" 502 | ] 503 | }, 504 | "execution_count": 14, 505 | "metadata": {}, 506 | "output_type": "execute_result" 507 | } 508 | ], 509 | "source": [ 510 | "answer_six()" 511 | ] 512 | }, 513 | { 514 | "cell_type": "markdown", 515 | "metadata": {}, 516 | "source": [ 517 | "
\n", 518 | "
\n", 519 | "The following function has been provided to help you combine new features into the training data:" 520 | ] 521 | }, 522 | { 523 | "cell_type": "code", 524 | "execution_count": 15, 525 | "metadata": { 526 | "collapsed": true 527 | }, 528 | "outputs": [], 529 | "source": [ 530 | "def add_feature(X, feature_to_add):\n", 531 | " \"\"\"\n", 532 | " Returns sparse feature matrix with added feature.\n", 533 | " feature_to_add can also be a list of features.\n", 534 | " \"\"\"\n", 535 | " from scipy.sparse import csr_matrix, hstack\n", 536 | " return hstack([X, csr_matrix(feature_to_add).T], 'csr')\n" 537 | ] 538 | }, 539 | { 540 | "cell_type": "markdown", 541 | "metadata": {}, 542 | "source": [ 543 | "### Question 7\n", 544 | "\n", 545 | "Fit and transform the training data X_train using a Tfidf Vectorizer ignoring terms that have a document frequency strictly lower than **5**.\n", 546 | "\n", 547 | "Using this document-term matrix and an additional feature, **the length of document (number of characters)**, fit a Support Vector Classification model with regularization `C=10000`. Then compute the area under the curve (AUC) score using the transformed test data.\n", 548 | "\n", 549 | "*This function should return the AUC score as a float.*" 550 | ] 551 | }, 552 | { 553 | "cell_type": "code", 554 | "execution_count": 16, 555 | "metadata": { 556 | "collapsed": true 557 | }, 558 | "outputs": [], 559 | "source": [ 560 | "from sklearn.svm import SVC\n", 561 | "\n", 562 | "def answer_seven():\n", 563 | " vectorizer = TfidfVectorizer(min_df=5)\n", 564 | "\n", 565 | " X_train_transformed = vectorizer.fit_transform(X_train)\n", 566 | " X_train_transformed_with_length = add_feature(X_train_transformed, X_train.str.len())\n", 567 | "\n", 568 | " X_test_transformed = vectorizer.transform(X_test)\n", 569 | " X_test_transformed_with_length = add_feature(X_test_transformed, X_test.str.len())\n", 570 | "\n", 571 | " clf = SVC(C=10000)\n", 572 | "\n", 573 | " clf.fit(X_train_transformed_with_length, y_train)\n", 574 | "\n", 575 | " y_predicted = clf.predict(X_test_transformed_with_length)\n", 576 | " \n", 577 | " return roc_auc_score(y_test, y_predicted)" 578 | ] 579 | }, 580 | { 581 | "cell_type": "code", 582 | "execution_count": 17, 583 | "metadata": {}, 584 | "outputs": [ 585 | { 586 | "data": { 587 | "text/plain": [ 588 | "0.95813668234215565" 589 | ] 590 | }, 591 | "execution_count": 17, 592 | "metadata": {}, 593 | "output_type": "execute_result" 594 | } 595 | ], 596 | "source": [ 597 | "answer_seven()" 598 | ] 599 | }, 600 | { 601 | "cell_type": "markdown", 602 | "metadata": {}, 603 | "source": [ 604 | "### Question 8\n", 605 | "\n", 606 | "What is the average number of digits per document for not spam and spam documents?\n", 607 | "\n", 608 | "*This function should return a tuple (average # digits not spam, average # digits spam).*" 609 | ] 610 | }, 611 | { 612 | "cell_type": "code", 613 | "execution_count": 18, 614 | "metadata": { 615 | "collapsed": true 616 | }, 617 | "outputs": [], 618 | "source": [ 619 | "def answer_eight():\n", 620 | " spam_data['length'] = spam_data['text'].apply(lambda x: len(''.join([a for a in x if a.isdigit()])))\n", 621 | " \n", 622 | " return (np.mean(spam_data['length'][spam_data['target'] == 0]), np.mean(spam_data['length'][spam_data['target'] == 1]))" 623 | ] 624 | }, 625 | { 626 | "cell_type": "code", 627 | "execution_count": 19, 628 | "metadata": {}, 629 | "outputs": [ 630 | { 631 | "data": { 632 | "text/plain": [ 633 | "(0.2992746113989637, 15.759036144578314)" 634 | ] 635 | }, 636 | "execution_count": 19, 637 | "metadata": {}, 638 | "output_type": "execute_result" 639 | } 640 | ], 641 | "source": [ 642 | "answer_eight()" 643 | ] 644 | }, 645 | { 646 | "cell_type": "markdown", 647 | "metadata": {}, 648 | "source": [ 649 | "### Question 9\n", 650 | "\n", 651 | "Fit and transform the training data `X_train` using a Tfidf Vectorizer ignoring terms that have a document frequency strictly lower than **5** and using **word n-grams from n=1 to n=3** (unigrams, bigrams, and trigrams).\n", 652 | "\n", 653 | "Using this document-term matrix and the following additional features:\n", 654 | "* the length of document (number of characters)\n", 655 | "* **number of digits per document**\n", 656 | "\n", 657 | "fit a Logistic Regression model with regularization `C=100`. Then compute the area under the curve (AUC) score using the transformed test data.\n", 658 | "\n", 659 | "*This function should return the AUC score as a float.*" 660 | ] 661 | }, 662 | { 663 | "cell_type": "code", 664 | "execution_count": 20, 665 | "metadata": { 666 | "collapsed": true 667 | }, 668 | "outputs": [], 669 | "source": [ 670 | "from sklearn.linear_model import LogisticRegression\n", 671 | "\n", 672 | "def answer_nine():\n", 673 | " vectorizer = TfidfVectorizer(min_df=5, ngram_range=[1,3])\n", 674 | "\n", 675 | " X_train_transformed = vectorizer.fit_transform(X_train)\n", 676 | " X_train_transformed_with_length = add_feature(X_train_transformed, [X_train.str.len(),\n", 677 | " X_train.apply(lambda x: len(''.join([a for a in x if a.isdigit()])))])\n", 678 | "\n", 679 | " X_test_transformed = vectorizer.transform(X_test)\n", 680 | " X_test_transformed_with_length = add_feature(X_test_transformed, [X_test.str.len(),\n", 681 | " X_test.apply(lambda x: len(''.join([a for a in x if a.isdigit()])))])\n", 682 | "\n", 683 | " clf = LogisticRegression(C=100)\n", 684 | "\n", 685 | " clf.fit(X_train_transformed_with_length, y_train)\n", 686 | "\n", 687 | " y_predicted = clf.predict(X_test_transformed_with_length)\n", 688 | "\n", 689 | " return roc_auc_score(y_test, y_predicted)" 690 | ] 691 | }, 692 | { 693 | "cell_type": "code", 694 | "execution_count": 21, 695 | "metadata": {}, 696 | "outputs": [ 697 | { 698 | "data": { 699 | "text/plain": [ 700 | "0.96787090640544626" 701 | ] 702 | }, 703 | "execution_count": 21, 704 | "metadata": {}, 705 | "output_type": "execute_result" 706 | } 707 | ], 708 | "source": [ 709 | "answer_nine()" 710 | ] 711 | }, 712 | { 713 | "cell_type": "markdown", 714 | "metadata": {}, 715 | "source": [ 716 | "### Question 10\n", 717 | "\n", 718 | "What is the average number of non-word characters (anything other than a letter, digit or underscore) per document for not spam and spam documents?\n", 719 | "\n", 720 | "*This function should return a tuple (average # non-word characters not spam, average # non-word characters spam).*" 721 | ] 722 | }, 723 | { 724 | "cell_type": "code", 725 | "execution_count": 22, 726 | "metadata": { 727 | "collapsed": true 728 | }, 729 | "outputs": [], 730 | "source": [ 731 | "def answer_ten():\n", 732 | " spam_data['length'] = spam_data['text'].str.findall(r'(\\W)').str.len()\n", 733 | " \n", 734 | " return (np.mean(spam_data['length'][spam_data['target'] == 0]), np.mean(spam_data['length'][spam_data['target'] == 1]))" 735 | ] 736 | }, 737 | { 738 | "cell_type": "code", 739 | "execution_count": 23, 740 | "metadata": {}, 741 | "outputs": [ 742 | { 743 | "data": { 744 | "text/plain": [ 745 | "(17.29181347150259, 29.041499330655956)" 746 | ] 747 | }, 748 | "execution_count": 23, 749 | "metadata": {}, 750 | "output_type": "execute_result" 751 | } 752 | ], 753 | "source": [ 754 | "answer_ten()" 755 | ] 756 | }, 757 | { 758 | "cell_type": "markdown", 759 | "metadata": {}, 760 | "source": [ 761 | "### Question 11\n", 762 | "\n", 763 | "Fit and transform the training data X_train using a Count Vectorizer ignoring terms that have a document frequency strictly lower than **5** and using **character n-grams from n=2 to n=5.**\n", 764 | "\n", 765 | "To tell Count Vectorizer to use character n-grams pass in `analyzer='char_wb'` which creates character n-grams only from text inside word boundaries. This should make the model more robust to spelling mistakes.\n", 766 | "\n", 767 | "Using this document-term matrix and the following additional features:\n", 768 | "* the length of document (number of characters)\n", 769 | "* number of digits per document\n", 770 | "* **number of non-word characters (anything other than a letter, digit or underscore.)**\n", 771 | "\n", 772 | "fit a Logistic Regression model with regularization C=100. Then compute the area under the curve (AUC) score using the transformed test data.\n", 773 | "\n", 774 | "Also **find the 10 smallest and 10 largest coefficients from the model** and return them along with the AUC score in a tuple.\n", 775 | "\n", 776 | "The list of 10 smallest coefficients should be sorted smallest first, the list of 10 largest coefficients should be sorted largest first.\n", 777 | "\n", 778 | "The three features that were added to the document term matrix should have the following names should they appear in the list of coefficients:\n", 779 | "['length_of_doc', 'digit_count', 'non_word_char_count']\n", 780 | "\n", 781 | "*This function should return a tuple `(AUC score as a float, smallest coefs list, largest coefs list)`.*" 782 | ] 783 | }, 784 | { 785 | "cell_type": "code", 786 | "execution_count": 24, 787 | "metadata": { 788 | "collapsed": true 789 | }, 790 | "outputs": [], 791 | "source": [ 792 | "def answer_eleven():\n", 793 | " vectorizer = CountVectorizer(min_df=5, analyzer='char_wb', ngram_range=[2,5])\n", 794 | "\n", 795 | " X_train_transformed = vectorizer.fit_transform(X_train)\n", 796 | " X_train_transformed_with_length = add_feature(X_train_transformed, [X_train.str.len(),\n", 797 | " X_train.apply(lambda x: len(''.join([a for a in x if a.isdigit()]))),\n", 798 | " X_train.str.findall(r'(\\W)').str.len()])\n", 799 | "\n", 800 | " X_test_transformed = vectorizer.transform(X_test)\n", 801 | " X_test_transformed_with_length = add_feature(X_test_transformed, [X_test.str.len(),\n", 802 | " X_test.apply(lambda x: len(''.join([a for a in x if a.isdigit()]))),\n", 803 | " X_test.str.findall(r'(\\W)').str.len()])\n", 804 | "\n", 805 | " clf = LogisticRegression(C=100)\n", 806 | "\n", 807 | " clf.fit(X_train_transformed_with_length, y_train)\n", 808 | "\n", 809 | " y_predicted = clf.predict(X_test_transformed_with_length)\n", 810 | "\n", 811 | " auc = roc_auc_score(y_test, y_predicted)\n", 812 | "\n", 813 | " feature_names = np.array(vectorizer.get_feature_names() + ['length_of_doc', 'digit_count', 'non_word_char_count'])\n", 814 | " sorted_coef_index = clf.coef_[0].argsort()\n", 815 | " smallest = feature_names[sorted_coef_index[:10]]\n", 816 | " largest = feature_names[sorted_coef_index[:-11:-1]]\n", 817 | "\n", 818 | " return (auc, list(smallest), list(largest))" 819 | ] 820 | }, 821 | { 822 | "cell_type": "code", 823 | "execution_count": 25, 824 | "metadata": {}, 825 | "outputs": [ 826 | { 827 | "data": { 828 | "text/plain": [ 829 | "(0.97885931107074342,\n", 830 | " ['. ', '..', '? ', ' i', ' y', ' go', ':)', ' h', 'go', ' m'],\n", 831 | " ['digit_count', 'ne', 'ia', 'co', 'xt', ' ch', 'mob', ' x', 'ww', 'ar'])" 832 | ] 833 | }, 834 | "execution_count": 25, 835 | "metadata": {}, 836 | "output_type": "execute_result" 837 | } 838 | ], 839 | "source": [ 840 | "answer_eleven()" 841 | ] 842 | }, 843 | { 844 | "cell_type": "code", 845 | "execution_count": null, 846 | "metadata": { 847 | "collapsed": true 848 | }, 849 | "outputs": [], 850 | "source": [] 851 | } 852 | ], 853 | "metadata": { 854 | "coursera": { 855 | "course_slug": "python-text-mining", 856 | "graded_item_id": "Pn19K", 857 | "launcher_item_id": "y1juS", 858 | "part_id": "ctlgo" 859 | }, 860 | "kernelspec": { 861 | "display_name": "Python 3", 862 | "language": "python", 863 | "name": "python3" 864 | }, 865 | "language_info": { 866 | "codemirror_mode": { 867 | "name": "ipython", 868 | "version": 3 869 | }, 870 | "file_extension": ".py", 871 | "mimetype": "text/x-python", 872 | "name": "python", 873 | "nbconvert_exporter": "python", 874 | "pygments_lexer": "ipython3", 875 | "version": "3.6.2" 876 | } 877 | }, 878 | "nbformat": 4, 879 | "nbformat_minor": 2 880 | } 881 | -------------------------------------------------------------------------------- /Assignment+4.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "---\n", 8 | "\n", 9 | "_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._\n", 10 | "\n", 11 | "---" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "# Assignment 4 - Document Similarity & Topic Modelling" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "## Part 1 - Document Similarity\n", 26 | "\n", 27 | "For the first part of this assignment, you will complete the functions `doc_to_synsets` and `similarity_score` which will be used by `document_path_similarity` to find the path similarity between two documents.\n", 28 | "\n", 29 | "The following functions are provided:\n", 30 | "* **`convert_tag:`** converts the tag given by `nltk.pos_tag` to a tag used by `wordnet.synsets`. You will need to use this function in `doc_to_synsets`.\n", 31 | "* **`document_path_similarity:`** computes the symmetrical path similarity between two documents by finding the synsets in each document using `doc_to_synsets`, then computing similarities using `similarity_score`.\n", 32 | "\n", 33 | "You will need to finish writing the following functions:\n", 34 | "* **`doc_to_synsets:`** returns a list of synsets in document. This function should first tokenize and part of speech tag the document using `nltk.word_tokenize` and `nltk.pos_tag`. Then it should find each tokens corresponding synset using `wn.synsets(token, wordnet_tag)`. The first synset match should be used. If there is no match, that token is skipped.\n", 35 | "* **`similarity_score:`** returns the normalized similarity score of a list of synsets (s1) onto a second list of synsets (s2). For each synset in s1, find the synset in s2 with the largest similarity value. Sum all of the largest similarity values together and normalize this value by dividing it by the number of largest similarity values found. Be careful with data types, which should be floats. Missing values should be ignored.\n", 36 | "\n", 37 | "Once `doc_to_synsets` and `similarity_score` have been completed, submit to the autograder which will run `test_document_path_similarity` to test that these functions are running correctly. \n", 38 | "\n", 39 | "*Do not modify the functions `convert_tag`, `document_path_similarity`, and `test_document_path_similarity`.*" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": null, 45 | "metadata": { 46 | "collapsed": true 47 | }, 48 | "outputs": [], 49 | "source": [ 50 | "import numpy as np\n", 51 | "import nltk\n", 52 | "from nltk.corpus import wordnet as wn\n", 53 | "import pandas as pd\n", 54 | "\n", 55 | "\n", 56 | "def convert_tag(tag):\n", 57 | " \"\"\"Convert the tag given by nltk.pos_tag to the tag used by wordnet.synsets\"\"\"\n", 58 | " \n", 59 | " tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}\n", 60 | " try:\n", 61 | " return tag_dict[tag[0]]\n", 62 | " except KeyError:\n", 63 | " return None\n", 64 | "\n", 65 | "\n", 66 | "def doc_to_synsets(doc):\n", 67 | " \"\"\"\n", 68 | " Returns a list of synsets in document.\n", 69 | "\n", 70 | " Tokenizes and tags the words in the document doc.\n", 71 | " Then finds the first synset for each word/tag combination.\n", 72 | " If a synset is not found for that combination it is skipped.\n", 73 | "\n", 74 | " Args:\n", 75 | " doc: string to be converted\n", 76 | "\n", 77 | " Returns:\n", 78 | " list of synsets\n", 79 | "\n", 80 | " Example:\n", 81 | " doc_to_synsets('Fish are nvqjp friends.')\n", 82 | " Out: [Synset('fish.n.01'), Synset('be.v.01'), Synset('friend.n.01')]\n", 83 | " \"\"\"\n", 84 | " \n", 85 | "\n", 86 | " # Your Code Here\n", 87 | " \n", 88 | " return # Your Answer Here\n", 89 | "\n", 90 | "\n", 91 | "def similarity_score(s1, s2):\n", 92 | " \"\"\"\n", 93 | " Calculate the normalized similarity score of s1 onto s2\n", 94 | "\n", 95 | " For each synset in s1, finds the synset in s2 with the largest similarity value.\n", 96 | " Sum of all of the largest similarity values and normalize this value by dividing it by the\n", 97 | " number of largest similarity values found.\n", 98 | "\n", 99 | " Args:\n", 100 | " s1, s2: list of synsets from doc_to_synsets\n", 101 | "\n", 102 | " Returns:\n", 103 | " normalized similarity score of s1 onto s2\n", 104 | "\n", 105 | " Example:\n", 106 | " synsets1 = doc_to_synsets('I like cats')\n", 107 | " synsets2 = doc_to_synsets('I like dogs')\n", 108 | " similarity_score(synsets1, synsets2)\n", 109 | " Out: 0.73333333333333339\n", 110 | " \"\"\"\n", 111 | " \n", 112 | " \n", 113 | " # Your Code Here\n", 114 | " \n", 115 | " return # Your Answer Here\n", 116 | "\n", 117 | "\n", 118 | "def document_path_similarity(doc1, doc2):\n", 119 | " \"\"\"Finds the symmetrical similarity between doc1 and doc2\"\"\"\n", 120 | "\n", 121 | " synsets1 = doc_to_synsets(doc1)\n", 122 | " synsets2 = doc_to_synsets(doc2)\n", 123 | "\n", 124 | " return (similarity_score(synsets1, synsets2) + similarity_score(synsets2, synsets1)) / 2" 125 | ] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "metadata": {}, 130 | "source": [ 131 | "### test_document_path_similarity\n", 132 | "\n", 133 | "Use this function to check if doc_to_synsets and similarity_score are correct.\n", 134 | "\n", 135 | "*This function should return the similarity score as a float.*" 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": null, 141 | "metadata": { 142 | "collapsed": true 143 | }, 144 | "outputs": [], 145 | "source": [ 146 | "def test_document_path_similarity():\n", 147 | " doc1 = 'This is a function to test document_path_similarity.'\n", 148 | " doc2 = 'Use this function to see if your code in doc_to_synsets \\\n", 149 | " and similarity_score is correct!'\n", 150 | " return document_path_similarity(doc1, doc2)" 151 | ] 152 | }, 153 | { 154 | "cell_type": "markdown", 155 | "metadata": {}, 156 | "source": [ 157 | "
\n", 158 | "___\n", 159 | "`paraphrases` is a DataFrame which contains the following columns: `Quality`, `D1`, and `D2`.\n", 160 | "\n", 161 | "`Quality` is an indicator variable which indicates if the two documents `D1` and `D2` are paraphrases of one another (1 for paraphrase, 0 for not paraphrase)." 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": null, 167 | "metadata": { 168 | "collapsed": true 169 | }, 170 | "outputs": [], 171 | "source": [ 172 | "# Use this dataframe for questions most_similar_docs and label_accuracy\n", 173 | "paraphrases = pd.read_csv('paraphrases.csv')\n", 174 | "paraphrases.head()" 175 | ] 176 | }, 177 | { 178 | "cell_type": "markdown", 179 | "metadata": {}, 180 | "source": [ 181 | "___\n", 182 | "\n", 183 | "### most_similar_docs\n", 184 | "\n", 185 | "Using `document_path_similarity`, find the pair of documents in paraphrases which has the maximum similarity score.\n", 186 | "\n", 187 | "*This function should return a tuple `(D1, D2, similarity_score)`*" 188 | ] 189 | }, 190 | { 191 | "cell_type": "code", 192 | "execution_count": null, 193 | "metadata": { 194 | "collapsed": true 195 | }, 196 | "outputs": [], 197 | "source": [ 198 | "def most_similar_docs():\n", 199 | " \n", 200 | " # Your Code Here\n", 201 | " paraphrases['similarity'] = paraphrases.apply(lambda x:document_path_similarity(x['D1'], x['D2']), axis=1)\n", 202 | " pair = paraphrases.iloc[paraphrases['similarity'].argmax()]\n", 203 | " return pair['D1'], pair['D2'], pair['similarity']" 204 | ] 205 | }, 206 | { 207 | "cell_type": "markdown", 208 | "metadata": {}, 209 | "source": [ 210 | "### label_accuracy\n", 211 | "\n", 212 | "Provide labels for the twenty pairs of documents by computing the similarity for each pair using `document_path_similarity`. Let the classifier rule be that if the score is greater than 0.75, label is paraphrase (1), else label is not paraphrase (0). Report accuracy of the classifier using scikit-learn's accuracy_score.\n", 213 | "\n", 214 | "*This function should return a float.*" 215 | ] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "execution_count": null, 220 | "metadata": { 221 | "collapsed": true 222 | }, 223 | "outputs": [], 224 | "source": [ 225 | "def label_accuracy():\n", 226 | " from sklearn.metrics import accuracy_score\n", 227 | "\n", 228 | " # Your Code Here\n", 229 | " paraphrases['predicted'] = np.where(paraphrases['similarity'] > 0.75, 1, 0)\n", 230 | " \n", 231 | " return accuracy_score(paraphrases['Quality'], paraphrases['predicted'])" 232 | ] 233 | }, 234 | { 235 | "cell_type": "markdown", 236 | "metadata": {}, 237 | "source": [ 238 | "## Part 2 - Topic Modelling\n", 239 | "\n", 240 | "For the second part of this assignment, you will use Gensim's LDA (Latent Dirichlet Allocation) model to model topics in `newsgroup_data`. You will first need to finish the code in the cell below by using gensim.models.ldamodel.LdaModel constructor to estimate LDA model parameters on the corpus, and save to the variable `ldamodel`. Extract 10 topics using `corpus` and `id_map`, and with `passes=25` and `random_state=34`." 241 | ] 242 | }, 243 | { 244 | "cell_type": "code", 245 | "execution_count": null, 246 | "metadata": { 247 | "collapsed": true 248 | }, 249 | "outputs": [], 250 | "source": [ 251 | "import pickle\n", 252 | "import gensim\n", 253 | "from sklearn.feature_extraction.text import CountVectorizer\n", 254 | "\n", 255 | "# Load the list of documents\n", 256 | "with open('newsgroups', 'rb') as f:\n", 257 | " newsgroup_data = pickle.load(f)\n", 258 | "\n", 259 | "# Use CountVectorizor to find three letter tokens, remove stop_words, \n", 260 | "# remove tokens that don't appear in at least 20 documents,\n", 261 | "# remove tokens that appear in more than 20% of the documents\n", 262 | "vect = CountVectorizer(min_df=20, max_df=0.2, stop_words='english', \n", 263 | " token_pattern='(?u)\\\\b\\\\w\\\\w\\\\w+\\\\b')\n", 264 | "# Fit and transform\n", 265 | "X = vect.fit_transform(newsgroup_data)\n", 266 | "\n", 267 | "# Convert sparse matrix to gensim corpus.\n", 268 | "corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)\n", 269 | "\n", 270 | "# Mapping from word IDs to words (To be used in LdaModel's id2word parameter)\n", 271 | "id_map = dict((v, k) for k, v in vect.vocabulary_.items())\n" 272 | ] 273 | }, 274 | { 275 | "cell_type": "code", 276 | "execution_count": null, 277 | "metadata": { 278 | "collapsed": true 279 | }, 280 | "outputs": [], 281 | "source": [ 282 | "# Use the gensim.models.ldamodel.LdaModel constructor to estimate \n", 283 | "# LDA model parameters on the corpus, and save to the variable `ldamodel`\n", 284 | "\n", 285 | "# Your code here:\n", 286 | "ldamodel = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id_map, passes=25, random_state=34, num_topics=10)" 287 | ] 288 | }, 289 | { 290 | "cell_type": "markdown", 291 | "metadata": {}, 292 | "source": [ 293 | "### lda_topics\n", 294 | "\n", 295 | "Using `ldamodel`, find a list of the 10 topics and the most significant 10 words in each topic. This should be structured as a list of 10 tuples where each tuple takes on the form:\n", 296 | "\n", 297 | "`(9, '0.068*\"space\" + 0.036*\"nasa\" + 0.021*\"science\" + 0.020*\"edu\" + 0.019*\"data\" + 0.017*\"shuttle\" + 0.015*\"launch\" + 0.015*\"available\" + 0.014*\"center\" + 0.014*\"sci\"')`\n", 298 | "\n", 299 | "for example.\n", 300 | "\n", 301 | "*This function should return a list of tuples.*" 302 | ] 303 | }, 304 | { 305 | "cell_type": "code", 306 | "execution_count": null, 307 | "metadata": { 308 | "collapsed": true 309 | }, 310 | "outputs": [], 311 | "source": [ 312 | "def lda_topics():\n", 313 | " \n", 314 | " # Your Code Here\n", 315 | " \n", 316 | " return ldamodel.print_topics(num_words=10)" 317 | ] 318 | }, 319 | { 320 | "cell_type": "markdown", 321 | "metadata": {}, 322 | "source": [ 323 | "### topic_distribution\n", 324 | "\n", 325 | "For the new document `new_doc`, find the topic distribution. Remember to use vect.transform on the the new doc, and Sparse2Corpus to convert the sparse matrix to gensim corpus.\n", 326 | "\n", 327 | "*This function should return a list of tuples, where each tuple is `(#topic, probability)`*" 328 | ] 329 | }, 330 | { 331 | "cell_type": "code", 332 | "execution_count": null, 333 | "metadata": { 334 | "collapsed": true 335 | }, 336 | "outputs": [], 337 | "source": [ 338 | "new_doc = [\"\\n\\nIt's my understanding that the freezing will start to occur because \\\n", 339 | "of the\\ngrowing distance of Pluto and Charon from the Sun, due to it's\\nelliptical orbit. \\\n", 340 | "It is not due to shadowing effects. \\n\\n\\nPluto can shadow Charon, and vice-versa.\\n\\nGeorge \\\n", 341 | "Krumins\\n-- \"]" 342 | ] 343 | }, 344 | { 345 | "cell_type": "code", 346 | "execution_count": null, 347 | "metadata": { 348 | "collapsed": true 349 | }, 350 | "outputs": [], 351 | "source": [ 352 | "def topic_distribution():\n", 353 | " \n", 354 | " # Your Code Here\n", 355 | " # transform\n", 356 | " X = vect.transform(new_doc)\n", 357 | "\n", 358 | " # Convert sparse matrix to gensim corpus.\n", 359 | " corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)\n", 360 | " \n", 361 | " return list(ldamodel[corpus])[0] # Your Answer Here" 362 | ] 363 | }, 364 | { 365 | "cell_type": "markdown", 366 | "metadata": {}, 367 | "source": [ 368 | "### topic_names\n", 369 | "\n", 370 | "From the list of the following given topics, assign topic names to the topics you found. If none of these names best matches the topics you found, create a new 1-3 word \"title\" for the topic.\n", 371 | "\n", 372 | "Topics: Health, Science, Automobiles, Politics, Government, Travel, Computers & IT, Sports, Business, Society & Lifestyle, Religion, Education.\n", 373 | "\n", 374 | "*This function should return a list of 10 strings.*" 375 | ] 376 | }, 377 | { 378 | "cell_type": "code", 379 | "execution_count": null, 380 | "metadata": { 381 | "collapsed": true 382 | }, 383 | "outputs": [], 384 | "source": [ 385 | "def topic_names():\n", 386 | " \n", 387 | " # Your Code Here\n", 388 | " \n", 389 | " return ['Automobiles', 'Health', 'Science',\n", 390 | " 'Politics',\n", 391 | " 'Sports',\n", 392 | " 'Business', 'Society & Lifestyle',\n", 393 | " 'Religion', 'Education', 'Computers & IT']# Your Answer Here" 394 | ] 395 | } 396 | ], 397 | "metadata": { 398 | "coursera": { 399 | "course_slug": "python-text-mining", 400 | "graded_item_id": "2qbcK", 401 | "launcher_item_id": "pi9Sh", 402 | "part_id": "kQiwX" 403 | }, 404 | "kernelspec": { 405 | "display_name": "Python 3", 406 | "language": "python", 407 | "name": "python3" 408 | }, 409 | "language_info": { 410 | "codemirror_mode": { 411 | "name": "ipython", 412 | "version": 3 413 | }, 414 | "file_extension": ".py", 415 | "mimetype": "text/x-python", 416 | "name": "python", 417 | "nbconvert_exporter": "python", 418 | "pygments_lexer": "ipython3", 419 | "version": "3.6.2" 420 | } 421 | }, 422 | "nbformat": 4, 423 | "nbformat_minor": 2 424 | } 425 | -------------------------------------------------------------------------------- /Case+Study+-+Sentiment+Analysis.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "---\n", 8 | "\n", 9 | "_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._\n", 10 | "\n", 11 | "---" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "*Note: Some of the cells in this notebook are computationally expensive. To reduce runtime, this notebook is using a subset of the data.*" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "# Case Study: Sentiment Analysis" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "### Data Prep" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": null, 38 | "metadata": {}, 39 | "outputs": [], 40 | "source": [ 41 | "import pandas as pd\n", 42 | "import numpy as np\n", 43 | "\n", 44 | "# Read in the data\n", 45 | "df = pd.read_csv('Amazon_Unlocked_Mobile.csv')\n", 46 | "\n", 47 | "# Sample the data to speed up computation\n", 48 | "# Comment out this line to match with lecture\n", 49 | "df = df.sample(frac=0.1, random_state=10)\n", 50 | "\n", 51 | "df.head()" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": null, 57 | "metadata": {}, 58 | "outputs": [], 59 | "source": [ 60 | "# Drop missing values\n", 61 | "df.dropna(inplace=True)\n", 62 | "\n", 63 | "# Remove any 'neutral' ratings equal to 3\n", 64 | "df = df[df['Rating'] != 3]\n", 65 | "\n", 66 | "# Encode 4s and 5s as 1 (rated positively)\n", 67 | "# Encode 1s and 2s as 0 (rated poorly)\n", 68 | "df['Positively Rated'] = np.where(df['Rating'] > 3, 1, 0)\n", 69 | "df.head(10)" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": null, 75 | "metadata": {}, 76 | "outputs": [], 77 | "source": [ 78 | "# Most ratings are positive\n", 79 | "df['Positively Rated'].mean()" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": null, 85 | "metadata": { 86 | "collapsed": true 87 | }, 88 | "outputs": [], 89 | "source": [ 90 | "from sklearn.model_selection import train_test_split\n", 91 | "\n", 92 | "# Split data into training and test sets\n", 93 | "X_train, X_test, y_train, y_test = train_test_split(df['Reviews'], \n", 94 | " df['Positively Rated'], \n", 95 | " random_state=0)" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": null, 101 | "metadata": {}, 102 | "outputs": [], 103 | "source": [ 104 | "print('X_train first entry:\\n\\n', X_train.iloc[0])\n", 105 | "print('\\n\\nX_train shape: ', X_train.shape)" 106 | ] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "metadata": {}, 111 | "source": [ 112 | "# CountVectorizer" 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "execution_count": null, 118 | "metadata": { 119 | "collapsed": true 120 | }, 121 | "outputs": [], 122 | "source": [ 123 | "from sklearn.feature_extraction.text import CountVectorizer\n", 124 | "\n", 125 | "# Fit the CountVectorizer to the training data\n", 126 | "vect = CountVectorizer().fit(X_train)" 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": null, 132 | "metadata": { 133 | "scrolled": false 134 | }, 135 | "outputs": [], 136 | "source": [ 137 | "vect.get_feature_names()[::2000]" 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": null, 143 | "metadata": {}, 144 | "outputs": [], 145 | "source": [ 146 | "len(vect.get_feature_names())" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": null, 152 | "metadata": {}, 153 | "outputs": [], 154 | "source": [ 155 | "# transform the documents in the training data to a document-term matrix\n", 156 | "X_train_vectorized = vect.transform(X_train)\n", 157 | "\n", 158 | "X_train_vectorized" 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": null, 164 | "metadata": {}, 165 | "outputs": [], 166 | "source": [ 167 | "from sklearn.linear_model import LogisticRegression\n", 168 | "\n", 169 | "# Train the model\n", 170 | "model = LogisticRegression()\n", 171 | "model.fit(X_train_vectorized, y_train)" 172 | ] 173 | }, 174 | { 175 | "cell_type": "code", 176 | "execution_count": null, 177 | "metadata": {}, 178 | "outputs": [], 179 | "source": [ 180 | "from sklearn.metrics import roc_auc_score\n", 181 | "\n", 182 | "# Predict the transformed test documents\n", 183 | "predictions = model.predict(vect.transform(X_test))\n", 184 | "\n", 185 | "print('AUC: ', roc_auc_score(y_test, predictions))" 186 | ] 187 | }, 188 | { 189 | "cell_type": "code", 190 | "execution_count": null, 191 | "metadata": { 192 | "scrolled": true 193 | }, 194 | "outputs": [], 195 | "source": [ 196 | "# get the feature names as numpy array\n", 197 | "feature_names = np.array(vect.get_feature_names())\n", 198 | "\n", 199 | "# Sort the coefficients from the model\n", 200 | "sorted_coef_index = model.coef_[0].argsort()\n", 201 | "\n", 202 | "# Find the 10 smallest and 10 largest coefficients\n", 203 | "# The 10 largest coefficients are being indexed using [:-11:-1] \n", 204 | "# so the list returned is in order of largest to smallest\n", 205 | "print('Smallest Coefs:\\n{}\\n'.format(feature_names[sorted_coef_index[:10]]))\n", 206 | "print('Largest Coefs: \\n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))" 207 | ] 208 | }, 209 | { 210 | "cell_type": "markdown", 211 | "metadata": {}, 212 | "source": [ 213 | "# Tfidf" 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": null, 219 | "metadata": {}, 220 | "outputs": [], 221 | "source": [ 222 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 223 | "\n", 224 | "# Fit the TfidfVectorizer to the training data specifiying a minimum document frequency of 5\n", 225 | "vect = TfidfVectorizer(min_df=5).fit(X_train)\n", 226 | "len(vect.get_feature_names())" 227 | ] 228 | }, 229 | { 230 | "cell_type": "code", 231 | "execution_count": null, 232 | "metadata": {}, 233 | "outputs": [], 234 | "source": [ 235 | "X_train_vectorized = vect.transform(X_train)\n", 236 | "\n", 237 | "model = LogisticRegression()\n", 238 | "model.fit(X_train_vectorized, y_train)\n", 239 | "\n", 240 | "predictions = model.predict(vect.transform(X_test))\n", 241 | "\n", 242 | "print('AUC: ', roc_auc_score(y_test, predictions))" 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "execution_count": null, 248 | "metadata": {}, 249 | "outputs": [], 250 | "source": [ 251 | "feature_names = np.array(vect.get_feature_names())\n", 252 | "\n", 253 | "sorted_tfidf_index = X_train_vectorized.max(0).toarray()[0].argsort()\n", 254 | "\n", 255 | "print('Smallest tfidf:\\n{}\\n'.format(feature_names[sorted_tfidf_index[:10]]))\n", 256 | "print('Largest tfidf: \\n{}'.format(feature_names[sorted_tfidf_index[:-11:-1]]))" 257 | ] 258 | }, 259 | { 260 | "cell_type": "code", 261 | "execution_count": null, 262 | "metadata": {}, 263 | "outputs": [], 264 | "source": [ 265 | "sorted_coef_index = model.coef_[0].argsort()\n", 266 | "\n", 267 | "print('Smallest Coefs:\\n{}\\n'.format(feature_names[sorted_coef_index[:10]]))\n", 268 | "print('Largest Coefs: \\n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))" 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": null, 274 | "metadata": {}, 275 | "outputs": [], 276 | "source": [ 277 | "# These reviews are treated the same by our current model\n", 278 | "print(model.predict(vect.transform(['not an issue, phone is working',\n", 279 | " 'an issue, phone is not working'])))" 280 | ] 281 | }, 282 | { 283 | "cell_type": "markdown", 284 | "metadata": {}, 285 | "source": [ 286 | "# n-grams" 287 | ] 288 | }, 289 | { 290 | "cell_type": "code", 291 | "execution_count": null, 292 | "metadata": {}, 293 | "outputs": [], 294 | "source": [ 295 | "# Fit the CountVectorizer to the training data specifiying a minimum \n", 296 | "# document frequency of 5 and extracting 1-grams and 2-grams\n", 297 | "vect = CountVectorizer(min_df=5, ngram_range=(1,2)).fit(X_train)\n", 298 | "\n", 299 | "X_train_vectorized = vect.transform(X_train)\n", 300 | "\n", 301 | "len(vect.get_feature_names())" 302 | ] 303 | }, 304 | { 305 | "cell_type": "code", 306 | "execution_count": null, 307 | "metadata": {}, 308 | "outputs": [], 309 | "source": [ 310 | "model = LogisticRegression()\n", 311 | "model.fit(X_train_vectorized, y_train)\n", 312 | "\n", 313 | "predictions = model.predict(vect.transform(X_test))\n", 314 | "\n", 315 | "print('AUC: ', roc_auc_score(y_test, predictions))" 316 | ] 317 | }, 318 | { 319 | "cell_type": "code", 320 | "execution_count": null, 321 | "metadata": {}, 322 | "outputs": [], 323 | "source": [ 324 | "feature_names = np.array(vect.get_feature_names())\n", 325 | "\n", 326 | "sorted_coef_index = model.coef_[0].argsort()\n", 327 | "\n", 328 | "print('Smallest Coefs:\\n{}\\n'.format(feature_names[sorted_coef_index[:10]]))\n", 329 | "print('Largest Coefs: \\n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))" 330 | ] 331 | }, 332 | { 333 | "cell_type": "code", 334 | "execution_count": null, 335 | "metadata": {}, 336 | "outputs": [], 337 | "source": [ 338 | "# These reviews are now correctly identified\n", 339 | "print(model.predict(vect.transform(['not an issue, phone is working',\n", 340 | " 'an issue, phone is not working'])))" 341 | ] 342 | } 343 | ], 344 | "metadata": { 345 | "kernelspec": { 346 | "display_name": "Python 3", 347 | "language": "python", 348 | "name": "python3" 349 | }, 350 | "language_info": { 351 | "codemirror_mode": { 352 | "name": "ipython", 353 | "version": 3 354 | }, 355 | "file_extension": ".py", 356 | "mimetype": "text/x-python", 357 | "name": "python", 358 | "nbconvert_exporter": "python", 359 | "pygments_lexer": "ipython3", 360 | "version": "3.6.0" 361 | } 362 | }, 363 | "nbformat": 4, 364 | "nbformat_minor": 2 365 | } 366 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Applied-text-mining-in-Python 2 | -------------------------------------------------------------------------------- /Regex+with+Pandas+and+Named+Groups.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "---\n", 8 | "\n", 9 | "_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._\n", 10 | "\n", 11 | "---" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "# Working with Text Data in pandas" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 1, 24 | "metadata": { 25 | "collapsed": false, 26 | "scrolled": true 27 | }, 28 | "outputs": [ 29 | { 30 | "data": { 31 | "text/html": [ 32 | "
\n", 33 | "\n", 34 | " \n", 35 | " \n", 36 | " \n", 37 | " \n", 38 | " \n", 39 | " \n", 40 | " \n", 41 | " \n", 42 | " \n", 43 | " \n", 44 | " \n", 45 | " \n", 46 | " \n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | "
text
0Monday: The doctor's appointment is at 2:45pm.
1Tuesday: The dentist's appointment is at 11:30...
2Wednesday: At 7:00pm, there is a basketball game!
3Thursday: Be back home by 11:15 pm at the latest.
4Friday: Take the train at 08:10 am, arrive at ...
\n", 63 | "
" 64 | ], 65 | "text/plain": [ 66 | " text\n", 67 | "0 Monday: The doctor's appointment is at 2:45pm.\n", 68 | "1 Tuesday: The dentist's appointment is at 11:30...\n", 69 | "2 Wednesday: At 7:00pm, there is a basketball game!\n", 70 | "3 Thursday: Be back home by 11:15 pm at the latest.\n", 71 | "4 Friday: Take the train at 08:10 am, arrive at ..." 72 | ] 73 | }, 74 | "execution_count": 1, 75 | "metadata": {}, 76 | "output_type": "execute_result" 77 | } 78 | ], 79 | "source": [ 80 | "import pandas as pd\n", 81 | "\n", 82 | "time_sentences = [\"Monday: The doctor's appointment is at 2:45pm.\", \n", 83 | " \"Tuesday: The dentist's appointment is at 11:30 am.\",\n", 84 | " \"Wednesday: At 7:00pm, there is a basketball game!\",\n", 85 | " \"Thursday: Be back home by 11:15 pm at the latest.\",\n", 86 | " \"Friday: Take the train at 08:10 am, arrive at 09:00am.\"]\n", 87 | "\n", 88 | "df = pd.DataFrame(time_sentences, columns=['text'])\n", 89 | "df" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": 2, 95 | "metadata": { 96 | "collapsed": false 97 | }, 98 | "outputs": [ 99 | { 100 | "data": { 101 | "text/plain": [ 102 | "0 46\n", 103 | "1 50\n", 104 | "2 49\n", 105 | "3 49\n", 106 | "4 54\n", 107 | "Name: text, dtype: int64" 108 | ] 109 | }, 110 | "execution_count": 2, 111 | "metadata": {}, 112 | "output_type": "execute_result" 113 | } 114 | ], 115 | "source": [ 116 | "# find the number of characters for each string in df['text']\n", 117 | "df['text'].str.len()" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": 3, 123 | "metadata": { 124 | "collapsed": false 125 | }, 126 | "outputs": [ 127 | { 128 | "data": { 129 | "text/plain": [ 130 | "0 7\n", 131 | "1 8\n", 132 | "2 8\n", 133 | "3 10\n", 134 | "4 10\n", 135 | "Name: text, dtype: int64" 136 | ] 137 | }, 138 | "execution_count": 3, 139 | "metadata": {}, 140 | "output_type": "execute_result" 141 | } 142 | ], 143 | "source": [ 144 | "# find the number of tokens for each string in df['text']\n", 145 | "df['text'].str.split().str.len()" 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": 4, 151 | "metadata": { 152 | "collapsed": false 153 | }, 154 | "outputs": [ 155 | { 156 | "data": { 157 | "text/plain": [ 158 | "0 True\n", 159 | "1 True\n", 160 | "2 False\n", 161 | "3 False\n", 162 | "4 False\n", 163 | "Name: text, dtype: bool" 164 | ] 165 | }, 166 | "execution_count": 4, 167 | "metadata": {}, 168 | "output_type": "execute_result" 169 | } 170 | ], 171 | "source": [ 172 | "# find which entries contain the word 'appointment'\n", 173 | "df['text'].str.contains('appointment')" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": 5, 179 | "metadata": { 180 | "collapsed": false 181 | }, 182 | "outputs": [ 183 | { 184 | "data": { 185 | "text/plain": [ 186 | "0 3\n", 187 | "1 4\n", 188 | "2 3\n", 189 | "3 4\n", 190 | "4 8\n", 191 | "Name: text, dtype: int64" 192 | ] 193 | }, 194 | "execution_count": 5, 195 | "metadata": {}, 196 | "output_type": "execute_result" 197 | } 198 | ], 199 | "source": [ 200 | "# find how many times a digit occurs in each string\n", 201 | "df['text'].str.count(r'\\d')" 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": 6, 207 | "metadata": { 208 | "collapsed": false 209 | }, 210 | "outputs": [ 211 | { 212 | "data": { 213 | "text/plain": [ 214 | "0 [2, 4, 5]\n", 215 | "1 [1, 1, 3, 0]\n", 216 | "2 [7, 0, 0]\n", 217 | "3 [1, 1, 1, 5]\n", 218 | "4 [0, 8, 1, 0, 0, 9, 0, 0]\n", 219 | "Name: text, dtype: object" 220 | ] 221 | }, 222 | "execution_count": 6, 223 | "metadata": {}, 224 | "output_type": "execute_result" 225 | } 226 | ], 227 | "source": [ 228 | "# find all occurances of the digits\n", 229 | "df['text'].str.findall(r'\\d')" 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "execution_count": 7, 235 | "metadata": { 236 | "collapsed": false 237 | }, 238 | "outputs": [ 239 | { 240 | "data": { 241 | "text/plain": [ 242 | "0 [(2, 45)]\n", 243 | "1 [(11, 30)]\n", 244 | "2 [(7, 00)]\n", 245 | "3 [(11, 15)]\n", 246 | "4 [(08, 10), (09, 00)]\n", 247 | "Name: text, dtype: object" 248 | ] 249 | }, 250 | "execution_count": 7, 251 | "metadata": {}, 252 | "output_type": "execute_result" 253 | } 254 | ], 255 | "source": [ 256 | "# group and find the hours and minutes\n", 257 | "df['text'].str.findall(r'(\\d?\\d):(\\d\\d)')" 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "execution_count": 8, 263 | "metadata": { 264 | "collapsed": false 265 | }, 266 | "outputs": [ 267 | { 268 | "data": { 269 | "text/plain": [ 270 | "0 ???: The doctor's appointment is at 2:45pm.\n", 271 | "1 ???: The dentist's appointment is at 11:30 am.\n", 272 | "2 ???: At 7:00pm, there is a basketball game!\n", 273 | "3 ???: Be back home by 11:15 pm at the latest.\n", 274 | "4 ???: Take the train at 08:10 am, arrive at 09:...\n", 275 | "Name: text, dtype: object" 276 | ] 277 | }, 278 | "execution_count": 8, 279 | "metadata": {}, 280 | "output_type": "execute_result" 281 | } 282 | ], 283 | "source": [ 284 | "# replace weekdays with '???'\n", 285 | "df['text'].str.replace(r'\\w+day\\b', '???')" 286 | ] 287 | }, 288 | { 289 | "cell_type": "code", 290 | "execution_count": 9, 291 | "metadata": { 292 | "collapsed": false 293 | }, 294 | "outputs": [ 295 | { 296 | "ename": "TypeError", 297 | "evalue": "repl must be a string", 298 | "output_type": "error", 299 | "traceback": [ 300 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 301 | "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", 302 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# replace weekdays with 3 letter abbrevations\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mdf\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'text'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreplace\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34mr'(\\w+day\\b)'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mlambda\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgroups\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;36m3\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", 303 | "\u001b[0;32m/home/sid/anaconda2/lib/python2.7/site-packages/pandas/core/strings.pyc\u001b[0m in \u001b[0;36mreplace\u001b[0;34m(self, pat, repl, n, case, flags)\u001b[0m\n\u001b[1;32m 1504\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mreplace\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpat\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrepl\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mn\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcase\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mTrue\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mflags\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1505\u001b[0m result = str_replace(self._data, pat, repl, n=n, case=case,\n\u001b[0;32m-> 1506\u001b[0;31m flags=flags)\n\u001b[0m\u001b[1;32m 1507\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_wrap_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mresult\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1508\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", 304 | "\u001b[0;32m/home/sid/anaconda2/lib/python2.7/site-packages/pandas/core/strings.pyc\u001b[0m in \u001b[0;36mstr_replace\u001b[0;34m(arr, pat, repl, n, case, flags)\u001b[0m\n\u001b[1;32m 320\u001b[0m \u001b[0;31m# Check whether repl is valid (GH 13438)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 321\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mis_string_like\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrepl\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 322\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mTypeError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"repl must be a string\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 323\u001b[0m \u001b[0muse_re\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mcase\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpat\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m1\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0mflags\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 324\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", 305 | "\u001b[0;31mTypeError\u001b[0m: repl must be a string" 306 | ] 307 | } 308 | ], 309 | "source": [ 310 | "# replace weekdays with 3 letter abbrevations\n", 311 | "df['text'].str.replace(r'(\\w+day\\b)', lambda x: x.groups()[0][:3])" 312 | ] 313 | }, 314 | { 315 | "cell_type": "code", 316 | "execution_count": 10, 317 | "metadata": { 318 | "collapsed": false 319 | }, 320 | "outputs": [ 321 | { 322 | "name": "stderr", 323 | "output_type": "stream", 324 | "text": [ 325 | "/home/sid/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:2: FutureWarning: currently extract(expand=None) means expand=False (return Index/Series/DataFrame) but in a future version of pandas this will be changed to expand=True (return DataFrame)\n", 326 | " from ipykernel import kernelapp as app\n" 327 | ] 328 | }, 329 | { 330 | "data": { 331 | "text/html": [ 332 | "
\n", 333 | "\n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | "
01
0245
11130
2700
31115
40810
\n", 369 | "
" 370 | ], 371 | "text/plain": [ 372 | " 0 1\n", 373 | "0 2 45\n", 374 | "1 11 30\n", 375 | "2 7 00\n", 376 | "3 11 15\n", 377 | "4 08 10" 378 | ] 379 | }, 380 | "execution_count": 10, 381 | "metadata": {}, 382 | "output_type": "execute_result" 383 | } 384 | ], 385 | "source": [ 386 | "# create new columns from first match of extracted groups\n", 387 | "df['text'].str.extract(r'(\\d?\\d):(\\d\\d)')" 388 | ] 389 | }, 390 | { 391 | "cell_type": "code", 392 | "execution_count": 11, 393 | "metadata": { 394 | "collapsed": false 395 | }, 396 | "outputs": [ 397 | { 398 | "data": { 399 | "text/html": [ 400 | "
\n", 401 | "\n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | "
0123
match
002:45pm245pm
1011:30 am1130am
207:00pm700pm
3011:15 pm1115pm
4008:10 am0810am
109:00am0900am
\n", 470 | "
" 471 | ], 472 | "text/plain": [ 473 | " 0 1 2 3\n", 474 | " match \n", 475 | "0 0 2:45pm 2 45 pm\n", 476 | "1 0 11:30 am 11 30 am\n", 477 | "2 0 7:00pm 7 00 pm\n", 478 | "3 0 11:15 pm 11 15 pm\n", 479 | "4 0 08:10 am 08 10 am\n", 480 | " 1 09:00am 09 00 am" 481 | ] 482 | }, 483 | "execution_count": 11, 484 | "metadata": {}, 485 | "output_type": "execute_result" 486 | } 487 | ], 488 | "source": [ 489 | "# extract the entire time, the hours, the minutes, and the period\n", 490 | "df['text'].str.extractall(r'((\\d?\\d):(\\d\\d) ?([ap]m))')" 491 | ] 492 | }, 493 | { 494 | "cell_type": "code", 495 | "execution_count": 12, 496 | "metadata": { 497 | "collapsed": false 498 | }, 499 | "outputs": [ 500 | { 501 | "data": { 502 | "text/html": [ 503 | "
\n", 504 | "\n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | "
timehourminuteperiod
match
002:45pm245pm
1011:30 am1130am
207:00pm700pm
3011:15 pm1115pm
4008:10 am0810am
109:00am0900am
\n", 573 | "
" 574 | ], 575 | "text/plain": [ 576 | " time hour minute period\n", 577 | " match \n", 578 | "0 0 2:45pm 2 45 pm\n", 579 | "1 0 11:30 am 11 30 am\n", 580 | "2 0 7:00pm 7 00 pm\n", 581 | "3 0 11:15 pm 11 15 pm\n", 582 | "4 0 08:10 am 08 10 am\n", 583 | " 1 09:00am 09 00 am" 584 | ] 585 | }, 586 | "execution_count": 12, 587 | "metadata": {}, 588 | "output_type": "execute_result" 589 | } 590 | ], 591 | "source": [ 592 | "# extract the entire time, the hours, the minutes, and the period with group names\n", 593 | "df['text'].str.extractall(r'(?P