├── certificate.pdf
├── README.md
├── .gitattributes
├── .gitignore
├── Week 1
└── Assignment+1.ipynb
├── Week 4
└── Assignment+4.ipynb
├── Week 2
└── Assignment+2.ipynb
└── Week 3
└── Assignment+3.ipynb
/certificate.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bondeanikets/Applied-Text-Mining-in-Python/master/certificate.pdf
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Applied-Text-Mining-in-Python
2 | Applied Data Science with Python Specialization: Course 4 (University of Michigan)
3 |
--------------------------------------------------------------------------------
/.gitattributes:
--------------------------------------------------------------------------------
1 | # Auto detect text files and perform LF normalization
2 | * text=auto
3 |
4 | # Custom for Visual Studio
5 | *.cs diff=csharp
6 |
7 | # Standard to msysgit
8 | *.doc diff=astextplain
9 | *.DOC diff=astextplain
10 | *.docx diff=astextplain
11 | *.DOCX diff=astextplain
12 | *.dot diff=astextplain
13 | *.DOT diff=astextplain
14 | *.pdf diff=astextplain
15 | *.PDF diff=astextplain
16 | *.rtf diff=astextplain
17 | *.RTF diff=astextplain
18 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | # Windows image file caches
2 | Thumbs.db
3 | ehthumbs.db
4 |
5 | # Folder config file
6 | Desktop.ini
7 |
8 | # Recycle Bin used on file shares
9 | $RECYCLE.BIN/
10 |
11 | # Windows Installer files
12 | *.cab
13 | *.msi
14 | *.msm
15 | *.msp
16 |
17 | # Windows shortcuts
18 | *.lnk
19 |
20 | # =========================
21 | # Operating System Files
22 | # =========================
23 |
24 | # OSX
25 | # =========================
26 |
27 | .DS_Store
28 | .AppleDouble
29 | .LSOverride
30 |
31 | # Thumbnails
32 | ._*
33 |
34 | # Files that might appear in the root of a volume
35 | .DocumentRevisions-V100
36 | .fseventsd
37 | .Spotlight-V100
38 | .TemporaryItems
39 | .Trashes
40 | .VolumeIcon.icns
41 |
42 | # Directories potentially created on remote AFP share
43 | .AppleDB
44 | .AppleDesktop
45 | Network Trash Folder
46 | Temporary Items
47 | .apdisk
48 |
--------------------------------------------------------------------------------
/Week 1/Assignment+1.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "---\n",
8 | "\n",
9 | "_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._\n",
10 | "\n",
11 | "---"
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {},
17 | "source": [
18 | "# Assignment 1\n",
19 | "\n",
20 | "In this assignment, you'll be working with messy medical data and using regex to extract relevant infromation from the data. \n",
21 | "\n",
22 | "Each line of the `dates.txt` file corresponds to a medical note. Each note has a date that needs to be extracted, but each date is encoded in one of many formats.\n",
23 | "\n",
24 | "The goal of this assignment is to correctly identify all of the different date variants encoded in this dataset and to properly normalize and sort the dates. \n",
25 | "\n",
26 | "Here is a list of some of the variants you might encounter in this dataset:\n",
27 | "* 04/20/2009; 04/20/09; 4/20/09; 4/3/09\n",
28 | "* Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;\n",
29 | "* 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009\n",
30 | "* Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009\n",
31 | "* Feb 2009; Sep 2009; Oct 2010\n",
32 | "* 6/2008; 12/2009\n",
33 | "* 2009; 2010\n",
34 | "\n",
35 | "Once you have extracted these date patterns from the text, the next step is to sort them in ascending chronological order accoring to the following rules:\n",
36 | "* Assume all dates in xx/xx/xx format are mm/dd/yy\n",
37 | "* Assume all dates where year is encoded in only two digits are years from the 1900's (e.g. 1/5/89 is January 5th, 1989)\n",
38 | "* If the day is missing (e.g. 9/2009), assume it is the first day of the month (e.g. September 1, 2009).\n",
39 | "* If the month is missing (e.g. 2010), assume it is the first of January of that year (e.g. January 1, 2010).\n",
40 | "\n",
41 | "With these rules in mind, find the correct date in each note and return a pandas Series in chronological order of the original Series' indices.\n",
42 | "\n",
43 | "For example if the original series was this:\n",
44 | "\n",
45 | " 0 1999\n",
46 | " 1 2010\n",
47 | " 2 1978\n",
48 | " 3 2015\n",
49 | " 4 1985\n",
50 | "\n",
51 | "Your function should return this:\n",
52 | "\n",
53 | " 0 2\n",
54 | " 1 4\n",
55 | " 2 0\n",
56 | " 3 1\n",
57 | " 4 3\n",
58 | "\n",
59 | "Your score will be calculated using [Kendall's tau](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient), a correlation measure for ordinal data.\n",
60 | "\n",
61 | "*This function should return a Series of length 500 and dtype int.*"
62 | ]
63 | },
64 | {
65 | "cell_type": "code",
66 | "execution_count": 2,
67 | "metadata": {},
68 | "outputs": [
69 | {
70 | "data": {
71 | "text/plain": [
72 | "0 03/25/93 Total time of visit (in minutes):\\n\n",
73 | "1 6/18/85 Primary Care Doctor:\\n\n",
74 | "2 sshe plans to move as of 7/8/71 In-Home Servic...\n",
75 | "3 7 on 9/27/75 Audit C Score Current:\\n\n",
76 | "4 2/6/96 sleep studyPain Treatment Pain Level (N...\n",
77 | "5 .Per 7/06/79 Movement D/O note:\\n\n",
78 | "6 4, 5/18/78 Patient's thoughts about current su...\n",
79 | "7 10/24/89 CPT Code: 90801 - Psychiatric Diagnos...\n",
80 | "8 3/7/86 SOS-10 Total Score:\\n\n",
81 | "9 (4/10/71)Score-1Audit C Score Current:\\n\n",
82 | "dtype: object"
83 | ]
84 | },
85 | "execution_count": 2,
86 | "metadata": {},
87 | "output_type": "execute_result"
88 | }
89 | ],
90 | "source": [
91 | "import pandas as pd\n",
92 | "import numpy as np\n",
93 | "from datetime import datetime\n",
94 | "doc = []\n",
95 | "with open('dates.txt') as file:\n",
96 | " for line in file:\n",
97 | " doc.append(line)\n",
98 | "\n",
99 | "df = pd.Series(doc)\n",
100 | "df.head(10)"
101 | ]
102 | },
103 | {
104 | "cell_type": "code",
105 | "execution_count": 3,
106 | "metadata": {},
107 | "outputs": [],
108 | "source": [
109 | "def date_sorter():\n",
110 | " \n",
111 | " a1_1 =df.str.extractall(r'(\\d{1,2})[/-](\\d{1,2})[/-](\\d{2})\\b')\n",
112 | " a1_2 =df.str.extractall(r'(\\d{1,2})[/-](\\d{1,2})[/-](\\d{4})\\b')\n",
113 | " a1 = pd.concat([a1_1,a1_2])\n",
114 | " a1.reset_index(inplace=True)\n",
115 | " a1_index = a1['level_0']\n",
116 | " \n",
117 | " a2 = df.str.extractall(r'((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*[-.]* )((?:\\d{1,2}[?:, -]*)\\d{4})')\n",
118 | " a2.reset_index(inplace=True)\n",
119 | " a2_index = a2['level_0']\n",
120 | " \n",
121 | " a3 = df.str.extractall(r'((?:\\d{1,2} ))?((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*[?:, -]* )(\\d{4})')\n",
122 | " a3.reset_index(inplace=True)\n",
123 | " a3_index = a3['level_0']\n",
124 | " \n",
125 | " a6 = df.str.extractall(r'(\\d{1,2})[/](\\d{4})')\n",
126 | " a6.reset_index(inplace=True)\n",
127 | " a6_index = a6['level_0']\n",
128 | " save=[]\n",
129 | " for i in a6_index:\n",
130 | " if not(i in a1_index.values):\n",
131 | " save.append(i)\n",
132 | " save = np.asarray(save)\n",
133 | " a6 = a6[a6['level_0'].isin(save)]\n",
134 | "\n",
135 | " \n",
136 | " a7_1= df.str.extractall(r'[a-z]?[^0-9](\\d{4})[^0-9]')\n",
137 | " a7_2 = df.str.extractall(r'^(\\d{4})[^0-9]')\n",
138 | " a7 = pd.concat([a7_1,a7_2])\n",
139 | " a7.reset_index(inplace=True)\n",
140 | "\n",
141 | " a7_index = a7['level_0']\n",
142 | " save=[]\n",
143 | " for i in a7_index:\n",
144 | " if not((i in a2_index.values) | (i in a3_index.values) | (i in a6_index.values)):\n",
145 | " save.append(i)\n",
146 | " save = np.asarray(save)\n",
147 | " a7 = a7[a7['level_0'].isin(save)]\n",
148 | " \n",
149 | " s = a1.level_0.values.tolist()+a2.level_0.values.tolist()+a3.level_0.values.tolist()+a6.level_0.values.tolist()+a7.level_0.values.tolist()\n",
150 | " s = np.asarray(s)\n",
151 | " \n",
152 | " a1.columns=['level_0','match','month','day','year']\n",
153 | " a1['year']=a1['year'].apply(str)\n",
154 | " a1['year']=a1['year'].apply(lambda x: '19'+x if len(x)<=2 else x)\n",
155 | " \n",
156 | " a2[1] = a2[1].apply(lambda x: x.replace(',',''))\n",
157 | " a2['day'] = a2[1].apply(lambda x:x.split(' ')[0])\n",
158 | " a2['year'] = a2[1].apply(lambda x:x.split(' ')[1])\n",
159 | " a2.columns=['level_0','match','month','day-year','day','year']\n",
160 | " a2.drop('day-year',axis=1,inplace=True) \n",
161 | " \n",
162 | " a3.columns=['level_0','match','day','month','year']\n",
163 | " a3['day'] = a3['day'].replace(np.nan,-99)\n",
164 | " a3['day'] = a3['day'].apply(lambda x: 1 if int(x)==-99 else x)\n",
165 | "\n",
166 | " a3['month'] = a3.month.apply(lambda x: x[:3])\n",
167 | " a3['month'] = pd.to_datetime(a3.month, format='%b').dt.month\n",
168 | " \n",
169 | " a6.columns=['level_0','match','month','year']\n",
170 | " a6['day']=1\n",
171 | " \n",
172 | " a7.columns=['level_0','match','year']\n",
173 | " a7['day']=1\n",
174 | " a7['month']=1\n",
175 | " \n",
176 | " final = pd.concat([a1,a2,a3,a6,a7])\n",
177 | " final['date'] =pd.to_datetime(final['month'].apply(str)+'/'+final['day'].apply(str)+'/'+final['year'].apply(str))\n",
178 | " final = final.sort_values(by='level_0').set_index('level_0')\n",
179 | "\n",
180 | " myList = final['date']\n",
181 | " answer = pd.Series([i[0] for i in sorted(enumerate(myList), key=lambda x:x[1])],np.arange(500))\n",
182 | " return answer"
183 | ]
184 | },
185 | {
186 | "cell_type": "code",
187 | "execution_count": 4,
188 | "metadata": {
189 | "collapsed": true
190 | },
191 | "outputs": [],
192 | "source": [
193 | "def diff(first, second):\n",
194 | " second = set(second)\n",
195 | " return [item for item in first if item not in second]"
196 | ]
197 | },
198 | {
199 | "cell_type": "code",
200 | "execution_count": null,
201 | "metadata": {
202 | "collapsed": true
203 | },
204 | "outputs": [],
205 | "source": []
206 | }
207 | ],
208 | "metadata": {
209 | "coursera": {
210 | "course_slug": "python-text-mining",
211 | "graded_item_id": "LvcWI",
212 | "launcher_item_id": "krne9",
213 | "part_id": "Mkp1I"
214 | },
215 | "kernelspec": {
216 | "display_name": "Python 3",
217 | "language": "python",
218 | "name": "python3"
219 | },
220 | "language_info": {
221 | "codemirror_mode": {
222 | "name": "ipython",
223 | "version": 3
224 | },
225 | "file_extension": ".py",
226 | "mimetype": "text/x-python",
227 | "name": "python",
228 | "nbconvert_exporter": "python",
229 | "pygments_lexer": "ipython3",
230 | "version": "3.6.0"
231 | }
232 | },
233 | "nbformat": 4,
234 | "nbformat_minor": 2
235 | }
236 |
--------------------------------------------------------------------------------
/Week 4/Assignment+4.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "---\n",
8 | "\n",
9 | "_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._\n",
10 | "\n",
11 | "---"
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {},
17 | "source": [
18 | "# Assignment 4 - Document Similarity & Topic Modelling"
19 | ]
20 | },
21 | {
22 | "cell_type": "markdown",
23 | "metadata": {},
24 | "source": [
25 | "## Part 1 - Document Similarity\n",
26 | "\n",
27 | "For the first part of this assignment, you will complete the functions `doc_to_synsets` and `similarity_score` which will be used by `document_path_similarity` to find the path similarity between two documents.\n",
28 | "\n",
29 | "The following functions are provided:\n",
30 | "* **`convert_tag:`** converts the tag given by `nltk.pos_tag` to a tag used by `wordnet.synsets`. You will need to use this function in `doc_to_synsets`.\n",
31 | "* **`document_path_similarity:`** computes the symmetrical path similarity between two documents by finding the synsets in each document using `doc_to_synsets`, then computing similarities using `similarity_score`.\n",
32 | "\n",
33 | "You will need to finish writing the following functions:\n",
34 | "* **`doc_to_synsets:`** returns a list of synsets in document. This function should first tokenize and part of speech tag the document using `nltk.word_tokenize` and `nltk.pos_tag`. Then it should find each tokens corresponding synset using `wn.synsets(token, wordnet_tag)`. The first synset match should be used. If there is no match, that token is skipped.\n",
35 | "* **`similarity_score:`** returns the normalized similarity score of a list of synsets (s1) onto a second list of synsets (s2). For each synset in s1, find the synset in s2 with the largest similarity value. Sum all of the largest similarity values together and normalize this value by dividing it by the number of largest similarity values found. Be careful with data types, which should be floats. Missing values should be ignored.\n",
36 | "\n",
37 | "Once `doc_to_synsets` and `similarity_score` have been completed, submit to the autograder which will run `test_document_path_similarity` to test that these functions are running correctly. \n",
38 | "\n",
39 | "*Do not modify the functions `convert_tag`, `document_path_similarity`, and `test_document_path_similarity`.*"
40 | ]
41 | },
42 | {
43 | "cell_type": "code",
44 | "execution_count": 3,
45 | "metadata": {
46 | "collapsed": true
47 | },
48 | "outputs": [],
49 | "source": [
50 | "import numpy as np\n",
51 | "import nltk\n",
52 | "from nltk.corpus import wordnet as wn\n",
53 | "import pandas as pd\n",
54 | "\n",
55 | "from nltk.stem import WordNetLemmatizer\n",
56 | "from nltk import pos_tag, word_tokenize\n",
57 | "\n",
58 | "from nltk.corpus import wordnet as wn\n",
59 | "\n",
60 | "def conversion(tag):\n",
61 | " if tag.startswith('J'):\n",
62 | " return wn.ADJ\n",
63 | " elif tag.startswith('N'):\n",
64 | " return wn.NOUN\n",
65 | " elif tag.startswith('R'):\n",
66 | " return wn.ADV\n",
67 | " elif tag.startswith('V'):\n",
68 | " return wn.VERB\n",
69 | " return None\n",
70 | "\n",
71 | "def convert_tag(tag):\n",
72 | " \"\"\"Convert the tag given by nltk.pos_tag to the tag used by wordnet.synsets\"\"\"\n",
73 | " \n",
74 | " tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}\n",
75 | " try:\n",
76 | " return tag_dict[tag[0]]\n",
77 | " except KeyError:\n",
78 | " return None\n",
79 | "\n",
80 | "\n",
81 | "def doc_to_synsets(doc):\n",
82 | " \"\"\"\n",
83 | " Returns a list of synsets in document.\n",
84 | "\n",
85 | " Tokenizes and tags the words in the document doc.\n",
86 | " Then finds the first synset for each word/tag combination.\n",
87 | " If a synset is not found for that combination it is skipped.\n",
88 | "\n",
89 | " Args:\n",
90 | " doc: string to be converted\n",
91 | "\n",
92 | " Returns:\n",
93 | " list of synsets\n",
94 | "\n",
95 | " Example:\n",
96 | " doc_to_synsets('Fish are nvqjp friends.')\n",
97 | " Out: [Synset('fish.n.01'), Synset('be.v.01'), Synset('friend.n.01')]\n",
98 | " \"\"\"\n",
99 | " \n",
100 | "\n",
101 | " # Your Code Here\n",
102 | "\n",
103 | " part_of_speech = pos_tag(word_tokenize(doc))\n",
104 | "\n",
105 | " lemmatzr = WordNetLemmatizer()\n",
106 | " results = []\n",
107 | " for token in part_of_speech:\n",
108 | " wn_tag = conversion(token[1])\n",
109 | " if not wn_tag:\n",
110 | " continue\n",
111 | "\n",
112 | " lemma = lemmatzr.lemmatize(token[0], pos=wn_tag)\n",
113 | " synsets = wn.synsets(lemma, pos=wn_tag)\n",
114 | " if len(synsets) > 0 :\n",
115 | " results.append(synsets[0])\n",
116 | "\n",
117 | " return results # Your Answer Here\n",
118 | "\n",
119 | "\n",
120 | "def similarity_score(s1, s2):\n",
121 | " \"\"\"\n",
122 | " Calculate the normalized similarity score of s1 onto s2\n",
123 | "\n",
124 | " For each synset in s1, finds the synset in s2 with the largest similarity value.\n",
125 | " Sum of all of the largest similarity values and normalize this value by dividing it by the\n",
126 | " number of largest similarity values found.\n",
127 | "\n",
128 | " Args:\n",
129 | " s1, s2: list of synsets from doc_to_synsets\n",
130 | "\n",
131 | " Returns:\n",
132 | " normalized similarity score of s1 onto s2\n",
133 | "\n",
134 | " Example:\n",
135 | " synsets1 = doc_to_synsets('I like cats')\n",
136 | " synsets2 = doc_to_synsets('I like dogs')\n",
137 | " similarity_score(synsets1, synsets2)\n",
138 | " Out: 0.73333333333333339\n",
139 | " \"\"\"\n",
140 | " \n",
141 | " \n",
142 | " # Your Code Here\n",
143 | " s =[]\n",
144 | " for i1 in s1:\n",
145 | " r = []\n",
146 | " for i2 in s2:\n",
147 | " r.append(i1.path_similarity(i2))\n",
148 | " result = [x for x in r if x is not None]\n",
149 | " if len(result) > 0 :\n",
150 | " s.append(max(result))\n",
151 | "\n",
152 | " return sum(s)/len(s)# Your Answer Here\n",
153 | "\n",
154 | "\n",
155 | "def document_path_similarity(doc1, doc2):\n",
156 | " \"\"\"Finds the symmetrical similarity between doc1 and doc2\"\"\"\n",
157 | "\n",
158 | " synsets1 = doc_to_synsets(doc1)\n",
159 | " synsets2 = doc_to_synsets(doc2)\n",
160 | "\n",
161 | " return (similarity_score(synsets1, synsets2) + similarity_score(synsets2, synsets1)) / 2"
162 | ]
163 | },
164 | {
165 | "cell_type": "markdown",
166 | "metadata": {},
167 | "source": [
168 | "### test_document_path_similarity\n",
169 | "\n",
170 | "Use this function to check if doc_to_synsets and similarity_score are correct.\n",
171 | "\n",
172 | "*This function should return the similarity score as a float.*"
173 | ]
174 | },
175 | {
176 | "cell_type": "code",
177 | "execution_count": 4,
178 | "metadata": {},
179 | "outputs": [
180 | {
181 | "data": {
182 | "text/plain": [
183 | "0.6392857142857143"
184 | ]
185 | },
186 | "execution_count": 4,
187 | "metadata": {},
188 | "output_type": "execute_result"
189 | }
190 | ],
191 | "source": [
192 | "def test_document_path_similarity():\n",
193 | " doc1 = 'This is a function to test document_path_similarity.'\n",
194 | " doc2 = 'Use this function to see if your code in doc_to_synsets \\\n",
195 | " and similarity_score is correct!'\n",
196 | " return document_path_similarity(doc1, doc2)"
197 | ]
198 | },
199 | {
200 | "cell_type": "markdown",
201 | "metadata": {},
202 | "source": [
203 | "
\n",
204 | "___\n",
205 | "`paraphrases` is a DataFrame which contains the following columns: `Quality`, `D1`, and `D2`.\n",
206 | "\n",
207 | "`Quality` is an indicator variable which indicates if the two documents `D1` and `D2` are paraphrases of one another (1 for paraphrase, 0 for not paraphrase)."
208 | ]
209 | },
210 | {
211 | "cell_type": "code",
212 | "execution_count": 10,
213 | "metadata": {},
214 | "outputs": [
215 | {
216 | "data": {
217 | "text/html": [
218 | "
\n",
219 | "\n",
232 | "
\n",
233 | " \n",
234 | " \n",
235 | " | \n",
236 | " Quality | \n",
237 | " D1 | \n",
238 | " D2 | \n",
239 | "
\n",
240 | " \n",
241 | " \n",
242 | " \n",
243 | " | 0 | \n",
244 | " 1 | \n",
245 | " Ms Stewart, the chief executive, was not expec... | \n",
246 | " Ms Stewart, 61, its chief executive officer an... | \n",
247 | "
\n",
248 | " \n",
249 | " | 1 | \n",
250 | " 1 | \n",
251 | " After more than two years' detention under the... | \n",
252 | " After more than two years in detention by the ... | \n",
253 | "
\n",
254 | " \n",
255 | " | 2 | \n",
256 | " 1 | \n",
257 | " \"It still remains to be seen whether the reven... | \n",
258 | " \"It remains to be seen whether the revenue rec... | \n",
259 | "
\n",
260 | " \n",
261 | " | 3 | \n",
262 | " 0 | \n",
263 | " And it's going to be a wild ride,\" said Allan ... | \n",
264 | " Now the rest is just mechanical,\" said Allan H... | \n",
265 | "
\n",
266 | " \n",
267 | " | 4 | \n",
268 | " 1 | \n",
269 | " The cards are issued by Mexico's consulates to... | \n",
270 | " The card is issued by Mexico's consulates to i... | \n",
271 | "
\n",
272 | " \n",
273 | "
\n",
274 | "
"
275 | ],
276 | "text/plain": [
277 | " Quality D1 \\\n",
278 | "0 1 Ms Stewart, the chief executive, was not expec... \n",
279 | "1 1 After more than two years' detention under the... \n",
280 | "2 1 \"It still remains to be seen whether the reven... \n",
281 | "3 0 And it's going to be a wild ride,\" said Allan ... \n",
282 | "4 1 The cards are issued by Mexico's consulates to... \n",
283 | "\n",
284 | " D2 \n",
285 | "0 Ms Stewart, 61, its chief executive officer an... \n",
286 | "1 After more than two years in detention by the ... \n",
287 | "2 \"It remains to be seen whether the revenue rec... \n",
288 | "3 Now the rest is just mechanical,\" said Allan H... \n",
289 | "4 The card is issued by Mexico's consulates to i... "
290 | ]
291 | },
292 | "execution_count": 10,
293 | "metadata": {},
294 | "output_type": "execute_result"
295 | }
296 | ],
297 | "source": [
298 | "# Use this dataframe for questions most_similar_docs and label_accuracy\n",
299 | "paraphrases = pd.read_csv('paraphrases.csv')\n",
300 | "paraphrases.head()"
301 | ]
302 | },
303 | {
304 | "cell_type": "markdown",
305 | "metadata": {},
306 | "source": [
307 | "___\n",
308 | "\n",
309 | "### most_similar_docs\n",
310 | "\n",
311 | "Using `document_path_similarity`, find the pair of documents in paraphrases which has the maximum similarity score.\n",
312 | "\n",
313 | "*This function should return a tuple `(D1, D2, similarity_score)`*"
314 | ]
315 | },
316 | {
317 | "cell_type": "code",
318 | "execution_count": 11,
319 | "metadata": {
320 | "collapsed": true
321 | },
322 | "outputs": [],
323 | "source": [
324 | "def most_similar_docs():\n",
325 | " \n",
326 | " # Your Code Here\n",
327 | " s = 0.0\n",
328 | " for i in range(len(paraphrases)):\n",
329 | " similarity = document_path_similarity(paraphrases['D1'][i], paraphrases['D2'][i])\n",
330 | " if s < similarity:\n",
331 | " s = similarity\n",
332 | " result = ((paraphrases['D1'][i], paraphrases['D2'][i], similarity))\n",
333 | " return result# Your Answer Here"
334 | ]
335 | },
336 | {
337 | "cell_type": "markdown",
338 | "metadata": {},
339 | "source": [
340 | "### label_accuracy\n",
341 | "\n",
342 | "Provide labels for the twenty pairs of documents by computing the similarity for each pair using `document_path_similarity`. Let the classifier rule be that if the score is greater than 0.75, label is paraphrase (1), else label is not paraphrase (0). Report accuracy of the classifier using scikit-learn's accuracy_score.\n",
343 | "\n",
344 | "*This function should return a float.*"
345 | ]
346 | },
347 | {
348 | "cell_type": "code",
349 | "execution_count": 12,
350 | "metadata": {
351 | "collapsed": true
352 | },
353 | "outputs": [],
354 | "source": [
355 | "def label_accuracy():\n",
356 | " from sklearn.metrics import accuracy_score\n",
357 | "\n",
358 | " # Your Code Here\n",
359 | " label=[]\n",
360 | " for i in range(len(paraphrases)):\n",
361 | " similarity = document_path_similarity(paraphrases['D1'][i], paraphrases['D2'][i])\n",
362 | " if similarity > 0.75:\n",
363 | " label.append(1)\n",
364 | " else:\n",
365 | " label.append(0)\n",
366 | " \n",
367 | " return accuracy_score(paraphrases['Quality'],label)# Your Answer Here"
368 | ]
369 | },
370 | {
371 | "cell_type": "markdown",
372 | "metadata": {},
373 | "source": [
374 | "## Part 2 - Topic Modelling\n",
375 | "\n",
376 | "For the second part of this assignment, you will use Gensim's LDA (Latent Dirichlet Allocation) model to model topics in `newsgroup_data`. You will first need to finish the code in the cell below by using gensim.models.ldamodel.LdaModel constructor to estimate LDA model parameters on the corpus, and save to the variable `ldamodel`. Extract 10 topics using `corpus` and `id_map`, and with `passes=25` and `random_state=34`."
377 | ]
378 | },
379 | {
380 | "cell_type": "code",
381 | "execution_count": 13,
382 | "metadata": {
383 | "collapsed": true
384 | },
385 | "outputs": [],
386 | "source": [
387 | "import pickle\n",
388 | "import gensim\n",
389 | "from sklearn.feature_extraction.text import CountVectorizer\n",
390 | "\n",
391 | "# Load the list of documents\n",
392 | "with open('newsgroups', 'rb') as f:\n",
393 | " newsgroup_data = pickle.load(f)\n",
394 | "\n",
395 | "# Use CountVectorizor to find three letter tokens, remove stop_words, \n",
396 | "# remove tokens that don't appear in at least 20 documents,\n",
397 | "# remove tokens that appear in more than 20% of the documents\n",
398 | "vect = CountVectorizer(min_df=20, max_df=0.2, stop_words='english', \n",
399 | " token_pattern='(?u)\\\\b\\\\w\\\\w\\\\w+\\\\b')\n",
400 | "# Fit and transform\n",
401 | "X = vect.fit_transform(newsgroup_data)\n",
402 | "\n",
403 | "# Convert sparse matrix to gensim corpus.\n",
404 | "corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)\n",
405 | "\n",
406 | "# Mapping from word IDs to words (To be used in LdaModel's id2word parameter)\n",
407 | "id_map = dict((v, k) for k, v in vect.vocabulary_.items())"
408 | ]
409 | },
410 | {
411 | "cell_type": "code",
412 | "execution_count": 14,
413 | "metadata": {
414 | "collapsed": true
415 | },
416 | "outputs": [],
417 | "source": [
418 | "# Use the gensim.models.ldamodel.LdaModel constructor to estimate \n",
419 | "# LDA model parameters on the corpus, and save to the variable `ldamodel`\n",
420 | "\n",
421 | "# Your code here:\n",
422 | "ldamodel = gensim.models.ldamodel.LdaModel(corpus,num_topics=10,id2word=id_map,random_state=34,passes=25)"
423 | ]
424 | },
425 | {
426 | "cell_type": "markdown",
427 | "metadata": {},
428 | "source": [
429 | "### lda_topics\n",
430 | "\n",
431 | "Using `ldamodel`, find a list of the 10 topics and the most significant 10 words in each topic. This should be structured as a list of 10 tuples where each tuple takes on the form:\n",
432 | "\n",
433 | "`(9, '0.068*\"space\" + 0.036*\"nasa\" + 0.021*\"science\" + 0.020*\"edu\" + 0.019*\"data\" + 0.017*\"shuttle\" + 0.015*\"launch\" + 0.015*\"available\" + 0.014*\"center\" + 0.014*\"sci\"')`\n",
434 | "\n",
435 | "for example.\n",
436 | "\n",
437 | "*This function should return a list of tuples.*"
438 | ]
439 | },
440 | {
441 | "cell_type": "code",
442 | "execution_count": null,
443 | "metadata": {
444 | "collapsed": true
445 | },
446 | "outputs": [],
447 | "source": [
448 | "def lda_topics():\n",
449 | " \n",
450 | " # Your Code Here\n",
451 | " \n",
452 | " return ldamodel.print_topics(num_topics=10, num_words=10)"
453 | ]
454 | },
455 | {
456 | "cell_type": "markdown",
457 | "metadata": {},
458 | "source": [
459 | "### topic_distribution\n",
460 | "\n",
461 | "For the new document `new_doc`, find the topic distribution. Remember to use vect.transform on the the new doc, and Sparse2Corpus to convert the sparse matrix to gensim corpus.\n",
462 | "\n",
463 | "*This function should return a list of tuples, where each tuple is `(#topic, probability)`*"
464 | ]
465 | },
466 | {
467 | "cell_type": "code",
468 | "execution_count": null,
469 | "metadata": {
470 | "collapsed": true
471 | },
472 | "outputs": [],
473 | "source": [
474 | "new_doc = [\"\\n\\nIt's my understanding that the freezing will start to occur because \\\n",
475 | "of the\\ngrowing distance of Pluto and Charon from the Sun, due to it's\\nelliptical orbit. \\\n",
476 | "It is not due to shadowing effects. \\n\\n\\nPluto can shadow Charon, and vice-versa.\\n\\nGeorge \\\n",
477 | "Krumins\\n-- \"]"
478 | ]
479 | },
480 | {
481 | "cell_type": "code",
482 | "execution_count": null,
483 | "metadata": {
484 | "collapsed": true
485 | },
486 | "outputs": [],
487 | "source": [
488 | "def topic_distribution():\n",
489 | " \n",
490 | " # Your Code Here\n",
491 | " from gensim import corpora, models\n",
492 | " vect = CountVectorizer(stop_words='english')\n",
493 | " new_X = vect.fit_transform(new_doc)\n",
494 | " new_corpus = gensim.matutils.Sparse2Corpus(new_X, documents_columns=False)\n",
495 | " new_ldamodel = gensim.models.ldamodel.LdaModel(new_corpus,num_topics=10,id2word=id_map,random_state=34,passes=25)\n",
496 | "\n",
497 | " dictionary = corpora.Dictionary(vect.vocabulary_.items())\n",
498 | " bow = dictionary.doc2bow(new_doc[0].split())\n",
499 | " \n",
500 | " return new_ldamodel.get_document_topics(bow)"
501 | ]
502 | },
503 | {
504 | "cell_type": "markdown",
505 | "metadata": {},
506 | "source": [
507 | "### topic_names\n",
508 | "\n",
509 | "From the list of the following given topics, assign topic names to the topics you found. If none of these names best matches the topics you found, create a new 1-3 word \"title\" for the topic.\n",
510 | "\n",
511 | "Topics: Health, Science, Automobiles, Politics, Government, Travel, Computers & IT, Sports, Business, Society & Lifestyle, Religion, Education.\n",
512 | "\n",
513 | "*This function should return a list of 10 strings.*"
514 | ]
515 | },
516 | {
517 | "cell_type": "code",
518 | "execution_count": null,
519 | "metadata": {
520 | "collapsed": true
521 | },
522 | "outputs": [],
523 | "source": [
524 | "topics= ['Health', 'Science', 'Automobiles', 'Politics', 'Government', \n",
525 | " 'Travel', 'Computers & IT', 'Sports', \n",
526 | " 'Business', 'Society & Lifestyle', 'Religion', 'Education']"
527 | ]
528 | },
529 | {
530 | "cell_type": "code",
531 | "execution_count": null,
532 | "metadata": {
533 | "collapsed": true
534 | },
535 | "outputs": [],
536 | "source": [
537 | "def topic_names(topics):\n",
538 | " \n",
539 | " # Your Code Here\n",
540 | " \n",
541 | " return topics"
542 | ]
543 | }
544 | ],
545 | "metadata": {
546 | "coursera": {
547 | "course_slug": "python-text-mining",
548 | "graded_item_id": "2qbcK",
549 | "launcher_item_id": "pi9Sh",
550 | "part_id": "kQiwX"
551 | },
552 | "kernelspec": {
553 | "display_name": "Python 3",
554 | "language": "python",
555 | "name": "python3"
556 | },
557 | "language_info": {
558 | "codemirror_mode": {
559 | "name": "ipython",
560 | "version": 3
561 | },
562 | "file_extension": ".py",
563 | "mimetype": "text/x-python",
564 | "name": "python",
565 | "nbconvert_exporter": "python",
566 | "pygments_lexer": "ipython3",
567 | "version": "3.6.0"
568 | }
569 | },
570 | "nbformat": 4,
571 | "nbformat_minor": 2
572 | }
573 |
--------------------------------------------------------------------------------
/Week 2/Assignment+2.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "---\n",
8 | "\n",
9 | "_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._\n",
10 | "\n",
11 | "---"
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {},
17 | "source": [
18 | "# Assignment 2 - Introduction to NLTK\n",
19 | "\n",
20 | "In part 1 of this assignment you will use nltk to explore the Herman Melville novel Moby Dick. Then in part 2 you will create a spelling recommender function that uses nltk to find words similar to the misspelling. "
21 | ]
22 | },
23 | {
24 | "cell_type": "markdown",
25 | "metadata": {},
26 | "source": [
27 | "## Part 1 - Analyzing Moby Dick"
28 | ]
29 | },
30 | {
31 | "cell_type": "code",
32 | "execution_count": 2,
33 | "metadata": {
34 | "collapsed": true
35 | },
36 | "outputs": [],
37 | "source": [
38 | "import nltk\n",
39 | "import pandas as pd\n",
40 | "import numpy as np\n",
41 | "from nltk.stem import WordNetLemmatizer\n",
42 | "import re\n",
43 | "import operator\n",
44 | "from nltk.tokenize import sent_tokenize\n",
45 | "from nltk.tokenize import word_tokenize\n",
46 | "from nltk.corpus import words\n",
47 | "\n",
48 | "# If you would like to work with the raw text you can use 'moby_raw'\n",
49 | "with open('moby.txt', 'r') as f:\n",
50 | " moby_raw = f.read()\n",
51 | " \n",
52 | "# If you would like to work with the novel in nltk.Text format you can use 'text1'\n",
53 | "moby_tokens = nltk.word_tokenize(moby_raw)\n",
54 | "text1 = nltk.Text(moby_tokens)"
55 | ]
56 | },
57 | {
58 | "cell_type": "markdown",
59 | "metadata": {},
60 | "source": [
61 | "### Example 1\n",
62 | "\n",
63 | "How many tokens (words and punctuation symbols) are in text1?\n",
64 | "\n",
65 | "*This function should return an integer.*"
66 | ]
67 | },
68 | {
69 | "cell_type": "code",
70 | "execution_count": 3,
71 | "metadata": {},
72 | "outputs": [
73 | {
74 | "data": {
75 | "text/plain": [
76 | "254989"
77 | ]
78 | },
79 | "execution_count": 3,
80 | "metadata": {},
81 | "output_type": "execute_result"
82 | }
83 | ],
84 | "source": [
85 | "def example_one():\n",
86 | " \n",
87 | " return len(nltk.word_tokenize(moby_raw)) # or alternatively len(text1)\n",
88 | "\n",
89 | "example_one()"
90 | ]
91 | },
92 | {
93 | "cell_type": "markdown",
94 | "metadata": {},
95 | "source": [
96 | "### Example 2\n",
97 | "\n",
98 | "How many unique tokens (unique words and punctuation) does text1 have?\n",
99 | "\n",
100 | "*This function should return an integer.*"
101 | ]
102 | },
103 | {
104 | "cell_type": "code",
105 | "execution_count": 4,
106 | "metadata": {},
107 | "outputs": [
108 | {
109 | "data": {
110 | "text/plain": [
111 | "20755"
112 | ]
113 | },
114 | "execution_count": 4,
115 | "metadata": {},
116 | "output_type": "execute_result"
117 | }
118 | ],
119 | "source": [
120 | "def example_two():\n",
121 | " \n",
122 | " return len(set(nltk.word_tokenize(moby_raw))) # or alternatively len(set(text1))\n",
123 | "\n",
124 | "example_two()"
125 | ]
126 | },
127 | {
128 | "cell_type": "markdown",
129 | "metadata": {},
130 | "source": [
131 | "### Example 3\n",
132 | "\n",
133 | "After lemmatizing the verbs, how many unique tokens does text1 have?\n",
134 | "\n",
135 | "*This function should return an integer.*"
136 | ]
137 | },
138 | {
139 | "cell_type": "code",
140 | "execution_count": 5,
141 | "metadata": {},
142 | "outputs": [
143 | {
144 | "data": {
145 | "text/plain": [
146 | "16900"
147 | ]
148 | },
149 | "execution_count": 5,
150 | "metadata": {},
151 | "output_type": "execute_result"
152 | }
153 | ],
154 | "source": [
155 | "from nltk.stem import WordNetLemmatizer\n",
156 | "\n",
157 | "def example_three():\n",
158 | "\n",
159 | " lemmatizer = WordNetLemmatizer()\n",
160 | " lemmatized = [lemmatizer.lemmatize(w,'v') for w in text1]\n",
161 | "\n",
162 | " return len(set(lemmatized))\n",
163 | "\n",
164 | "example_three()"
165 | ]
166 | },
167 | {
168 | "cell_type": "markdown",
169 | "metadata": {},
170 | "source": [
171 | "### Question 1\n",
172 | "\n",
173 | "What is the lexical diversity of the given text input? (i.e. ratio of unique tokens to the total number of tokens)\n",
174 | "\n",
175 | "*This function should return a float.*"
176 | ]
177 | },
178 | {
179 | "cell_type": "code",
180 | "execution_count": 6,
181 | "metadata": {},
182 | "outputs": [
183 | {
184 | "data": {
185 | "text/plain": [
186 | "0.081"
187 | ]
188 | },
189 | "execution_count": 6,
190 | "metadata": {},
191 | "output_type": "execute_result"
192 | }
193 | ],
194 | "source": [
195 | "def answer_one():\n",
196 | " \n",
197 | " return round(len(set(text1))/len(text1),3)\n",
198 | "\n",
199 | "answer_one()"
200 | ]
201 | },
202 | {
203 | "cell_type": "markdown",
204 | "metadata": {},
205 | "source": [
206 | "### Question 2\n",
207 | "\n",
208 | "What percentage of tokens is 'whale'or 'Whale'?\n",
209 | "\n",
210 | "*This function should return a float.*"
211 | ]
212 | },
213 | {
214 | "cell_type": "code",
215 | "execution_count": 7,
216 | "metadata": {},
217 | "outputs": [
218 | {
219 | "data": {
220 | "text/plain": [
221 | "0.4125668166077752"
222 | ]
223 | },
224 | "execution_count": 7,
225 | "metadata": {},
226 | "output_type": "execute_result"
227 | }
228 | ],
229 | "source": [
230 | "def answer_two():\n",
231 | " \n",
232 | " test = [w for w in text1 if re.search(r'^[Ww]hale$',w)] \n",
233 | " return ( len(test)/len(text1) ) * 100\n",
234 | "\n",
235 | "answer_two()"
236 | ]
237 | },
238 | {
239 | "cell_type": "markdown",
240 | "metadata": {},
241 | "source": [
242 | "### Question 3\n",
243 | "\n",
244 | "What are the 20 most frequently occurring (unique) tokens in the text? What is their frequency?\n",
245 | "\n",
246 | "*This function should return a list of 20 tuples where each tuple is of the form `(token, frequency)`. The list should be sorted in descending order of frequency.*"
247 | ]
248 | },
249 | {
250 | "cell_type": "code",
251 | "execution_count": 8,
252 | "metadata": {},
253 | "outputs": [
254 | {
255 | "data": {
256 | "text/plain": [
257 | "[(',', 19204),\n",
258 | " ('the', 13715),\n",
259 | " ('.', 7308),\n",
260 | " ('of', 6513),\n",
261 | " ('and', 6010),\n",
262 | " ('a', 4545),\n",
263 | " ('to', 4515),\n",
264 | " (';', 4173),\n",
265 | " ('in', 3908),\n",
266 | " ('that', 2978),\n",
267 | " ('his', 2459),\n",
268 | " ('it', 2196),\n",
269 | " ('I', 2097),\n",
270 | " ('!', 1767),\n",
271 | " ('is', 1722),\n",
272 | " ('--', 1713),\n",
273 | " ('with', 1659),\n",
274 | " ('he', 1658),\n",
275 | " ('was', 1639),\n",
276 | " ('as', 1620)]"
277 | ]
278 | },
279 | "execution_count": 8,
280 | "metadata": {},
281 | "output_type": "execute_result"
282 | }
283 | ],
284 | "source": [
285 | "def answer_three():\n",
286 | " dist = nltk.FreqDist(text1)\n",
287 | " sorted_x = sorted(dist.items(),key=operator.itemgetter(1))\n",
288 | " sorted_x.reverse()\n",
289 | " \n",
290 | " return sorted_x[:20]\n",
291 | "\n",
292 | "answer_three()"
293 | ]
294 | },
295 | {
296 | "cell_type": "markdown",
297 | "metadata": {},
298 | "source": [
299 | "### Question 4\n",
300 | "\n",
301 | "What tokens have a length of greater than 5 and frequency of more than 150?\n",
302 | "\n",
303 | "*This function should return a sorted list of the tokens that match the above constraints. To sort your list, use `sorted()`*"
304 | ]
305 | },
306 | {
307 | "cell_type": "code",
308 | "execution_count": 9,
309 | "metadata": {},
310 | "outputs": [
311 | {
312 | "data": {
313 | "text/plain": [
314 | "['Captain',\n",
315 | " 'Pequod',\n",
316 | " 'Queequeg',\n",
317 | " 'Starbuck',\n",
318 | " 'almost',\n",
319 | " 'before',\n",
320 | " 'himself',\n",
321 | " 'little',\n",
322 | " 'seemed',\n",
323 | " 'should',\n",
324 | " 'though',\n",
325 | " 'through',\n",
326 | " 'whales',\n",
327 | " 'without']"
328 | ]
329 | },
330 | "execution_count": 9,
331 | "metadata": {},
332 | "output_type": "execute_result"
333 | }
334 | ],
335 | "source": [
336 | "def answer_four():\n",
337 | " dist = nltk.FreqDist(text1)\n",
338 | " vocab1 = dist.keys()\n",
339 | " freqwords = [w for w in vocab1 if len(w) > 5 and dist[w] > 150]\n",
340 | " return sorted(freqwords)\n",
341 | "\n",
342 | "answer_four()"
343 | ]
344 | },
345 | {
346 | "cell_type": "markdown",
347 | "metadata": {},
348 | "source": [
349 | "### Question 5\n",
350 | "\n",
351 | "Find the longest word in text1 and that word's length.\n",
352 | "\n",
353 | "*This function should return a tuple `(longest_word, length)`.*"
354 | ]
355 | },
356 | {
357 | "cell_type": "code",
358 | "execution_count": 10,
359 | "metadata": {},
360 | "outputs": [
361 | {
362 | "data": {
363 | "text/plain": [
364 | "(\"twelve-o'clock-at-night\", 23)"
365 | ]
366 | },
367 | "execution_count": 10,
368 | "metadata": {},
369 | "output_type": "execute_result"
370 | }
371 | ],
372 | "source": [
373 | "def answer_five():\n",
374 | "\n",
375 | " length = [len(w) for w in text1]\n",
376 | " return tuple((text1[length.index(max(length))],max(length)))\n",
377 | "\n",
378 | "answer_five()"
379 | ]
380 | },
381 | {
382 | "cell_type": "markdown",
383 | "metadata": {},
384 | "source": [
385 | "### Question 6\n",
386 | "\n",
387 | "What unique words have a frequency of more than 2000? What is their frequency?\n",
388 | "\n",
389 | "\"Hint: you may want to use `isalpha()` to check if the token is a word and not punctuation.\"\n",
390 | "\n",
391 | "*This function should return a list of tuples of the form `(frequency, word)` sorted in descending order of frequency.*"
392 | ]
393 | },
394 | {
395 | "cell_type": "code",
396 | "execution_count": 18,
397 | "metadata": {},
398 | "outputs": [
399 | {
400 | "name": "stderr",
401 | "output_type": "stream",
402 | "text": [
403 | "/opt/conda/lib/python3.6/site-packages/ipykernel/__main__.py:7: FutureWarning: in the future, boolean array-likes will be handled as a boolean array index\n"
404 | ]
405 | },
406 | {
407 | "data": {
408 | "text/plain": [
409 | "[]"
410 | ]
411 | },
412 | "execution_count": 18,
413 | "metadata": {},
414 | "output_type": "execute_result"
415 | }
416 | ],
417 | "source": [
418 | "def answer_six():\n",
419 | " fun = lambda x: str.isalpha(x)\n",
420 | " \n",
421 | " dist = nltk.FreqDist(text1)\n",
422 | "\n",
423 | " final_list = list(map(fun,list(dist.keys())))\n",
424 | " final_keys = np.asarray(list(dist.keys()))[final_list]\n",
425 | " dist_again = { dist[your_key]: your_key for your_key in final_keys if dist[your_key] > 2000}\n",
426 | " sorted_x = sorted(dist_again.items(),key=operator.itemgetter(0))\n",
427 | " sorted_x.reverse()\n",
428 | " \n",
429 | " return sorted_x\n",
430 | "\n",
431 | "answer_six()"
432 | ]
433 | },
434 | {
435 | "cell_type": "markdown",
436 | "metadata": {},
437 | "source": [
438 | "### Question 7\n",
439 | "\n",
440 | "What is the average number of tokens per sentence?\n",
441 | "\n",
442 | "*This function should return a float.*"
443 | ]
444 | },
445 | {
446 | "cell_type": "code",
447 | "execution_count": 12,
448 | "metadata": {},
449 | "outputs": [
450 | {
451 | "data": {
452 | "text/plain": [
453 | "25.881952902963864"
454 | ]
455 | },
456 | "execution_count": 12,
457 | "metadata": {},
458 | "output_type": "execute_result"
459 | }
460 | ],
461 | "source": [
462 | "def answer_seven():\n",
463 | "\n",
464 | " moby_sentence = sent_tokenize(moby_raw)\n",
465 | " again = [word_tokenize(i) for i in moby_sentence]\n",
466 | " numbers = []\n",
467 | " for i in again:\n",
468 | " numbers.append(len(i))\n",
469 | " return np.mean(numbers)\n",
470 | "\n",
471 | "answer_seven()"
472 | ]
473 | },
474 | {
475 | "cell_type": "markdown",
476 | "metadata": {},
477 | "source": [
478 | "### Question 8\n",
479 | "\n",
480 | "What are the 5 most frequent parts of speech in this text? What is their frequency?\n",
481 | "\n",
482 | "*This function should return a list of tuples of the form `(part_of_speech, frequency)` sorted in descending order of frequency.*"
483 | ]
484 | },
485 | {
486 | "cell_type": "code",
487 | "execution_count": 13,
488 | "metadata": {},
489 | "outputs": [
490 | {
491 | "data": {
492 | "text/plain": [
493 | "[('NN', 32730), ('IN', 28657), ('DT', 25867), ('JJ', 17620), ('RB', 13756)]"
494 | ]
495 | },
496 | "execution_count": 13,
497 | "metadata": {},
498 | "output_type": "execute_result"
499 | }
500 | ],
501 | "source": [
502 | "def answer_eight():\n",
503 | " parts_of_speech = nltk.pos_tag(text1)\n",
504 | " count = nltk.FreqDist(tag for (word, tag) in parts_of_speech)\n",
505 | " answer = count.most_common()[:6]\n",
506 | " output = [i for i in answer if i[0]!=',']\n",
507 | " return output\n",
508 | "\n",
509 | "answer_eight()"
510 | ]
511 | },
512 | {
513 | "cell_type": "markdown",
514 | "metadata": {},
515 | "source": [
516 | "## Part 2 - Spelling Recommender\n",
517 | "\n",
518 | "For this part of the assignment you will create three different spelling recommenders, that each take a list of misspelled words and recommends a correctly spelled word for every word in the list.\n",
519 | "\n",
520 | "For every misspelled word, the recommender should find find the word in `correct_spellings` that has the shortest distance*, and starts with the same letter as the misspelled word, and return that word as a recommendation.\n",
521 | "\n",
522 | "*Each of the three different recommenders will use a different distance measure (outlined below).\n",
523 | "\n",
524 | "Each of the recommenders should provide recommendations for the three default words provided: `['cormulent', 'incendenece', 'validrate']`."
525 | ]
526 | },
527 | {
528 | "cell_type": "code",
529 | "execution_count": 14,
530 | "metadata": {
531 | "collapsed": true
532 | },
533 | "outputs": [],
534 | "source": [
535 | "from nltk.corpus import words\n",
536 | "\n",
537 | "correct_spellings = words.words()"
538 | ]
539 | },
540 | {
541 | "cell_type": "markdown",
542 | "metadata": {},
543 | "source": [
544 | "### Question 9\n",
545 | "\n",
546 | "For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:\n",
547 | "\n",
548 | "**[Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index) on the trigrams of the two words.**\n",
549 | "\n",
550 | "*This function should return a list of length three:\n",
551 | "`['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']`.*"
552 | ]
553 | },
554 | {
555 | "cell_type": "code",
556 | "execution_count": 15,
557 | "metadata": {},
558 | "outputs": [
559 | {
560 | "name": "stderr",
561 | "output_type": "stream",
562 | "text": [
563 | "/opt/conda/lib/python3.6/site-packages/ipykernel/__main__.py:15: DeprecationWarning: generator 'ngrams' raised StopIteration\n"
564 | ]
565 | },
566 | {
567 | "data": {
568 | "text/plain": [
569 | "['corpulent', 'indecence', 'validate']"
570 | ]
571 | },
572 | "execution_count": 15,
573 | "metadata": {},
574 | "output_type": "execute_result"
575 | }
576 | ],
577 | "source": [
578 | "def answer_nine(entries=['cormulent', 'incendenece', 'validrate']):\n",
579 | " correct_spellings = words.words()\n",
580 | " f = lambda x: x[0].lower()\n",
581 | " final=[]\n",
582 | " for j in entries:\n",
583 | " word = j\n",
584 | " first_letter=np.array(list(map(f,correct_spellings)))\n",
585 | " indx = np.where(first_letter==word[0])\n",
586 | " correct_spellings = np.asarray(correct_spellings)\n",
587 | " dictionary = correct_spellings[indx]\n",
588 | "\n",
589 | " set1 = set(nltk.ngrams(word,n=3))\n",
590 | " s=1\n",
591 | " for spelling in dictionary:\n",
592 | " set_ex = set(nltk.ngrams(spelling,n=3))\n",
593 | " if nltk.distance.jaccard_distance(set1,set_ex) < s:\n",
594 | " s=nltk.distance.jaccard_distance(set1,set_ex)\n",
595 | " answer = spelling\n",
596 | " final.append(answer)\n",
597 | " return final\n",
598 | " \n",
599 | "answer_nine()"
600 | ]
601 | },
602 | {
603 | "cell_type": "markdown",
604 | "metadata": {},
605 | "source": [
606 | "### Question 10\n",
607 | "\n",
608 | "For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:\n",
609 | "\n",
610 | "**[Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index) on the 4-grams of the two words.**\n",
611 | "\n",
612 | "*This function should return a list of length three:\n",
613 | "`['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']`.*"
614 | ]
615 | },
616 | {
617 | "cell_type": "code",
618 | "execution_count": 16,
619 | "metadata": {},
620 | "outputs": [
621 | {
622 | "name": "stderr",
623 | "output_type": "stream",
624 | "text": [
625 | "/opt/conda/lib/python3.6/site-packages/ipykernel/__main__.py:15: DeprecationWarning: generator 'ngrams' raised StopIteration\n"
626 | ]
627 | },
628 | {
629 | "data": {
630 | "text/plain": [
631 | "['cormus', 'incendiary', 'valid']"
632 | ]
633 | },
634 | "execution_count": 16,
635 | "metadata": {},
636 | "output_type": "execute_result"
637 | }
638 | ],
639 | "source": [
640 | "def answer_ten(entries=['cormulent', 'incendenece', 'validrate']):\n",
641 | " correct_spellings = words.words()\n",
642 | " f = lambda x: x[0].lower()\n",
643 | " final=[]\n",
644 | " for j in entries:\n",
645 | " word = j\n",
646 | " first_letter=np.array(list(map(f,correct_spellings)))\n",
647 | " indx = np.where(first_letter==word[0])\n",
648 | " correct_spellings = np.asarray(correct_spellings)\n",
649 | " dictionary = correct_spellings[indx]\n",
650 | "\n",
651 | " set1 = set(nltk.ngrams(word,n=4))\n",
652 | " s=1\n",
653 | " for spelling in dictionary:\n",
654 | " set_ex = set(nltk.ngrams(spelling,n=4))\n",
655 | " if nltk.distance.jaccard_distance(set1,set_ex) < s:\n",
656 | " s=nltk.distance.jaccard_distance(set1,set_ex)\n",
657 | " answer = spelling\n",
658 | " final.append(answer)\n",
659 | " return final\n",
660 | " \n",
661 | "answer_ten()"
662 | ]
663 | },
664 | {
665 | "cell_type": "markdown",
666 | "metadata": {},
667 | "source": [
668 | "### Question 11\n",
669 | "\n",
670 | "For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:\n",
671 | "\n",
672 | "**[Edit distance on the two words with transpositions.](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance)**\n",
673 | "\n",
674 | "*This function should return a list of length three:\n",
675 | "`['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']`.*"
676 | ]
677 | },
678 | {
679 | "cell_type": "code",
680 | "execution_count": 17,
681 | "metadata": {},
682 | "outputs": [
683 | {
684 | "data": {
685 | "text/plain": [
686 | "['corpulent', 'intendence', 'validate']"
687 | ]
688 | },
689 | "execution_count": 17,
690 | "metadata": {},
691 | "output_type": "execute_result"
692 | }
693 | ],
694 | "source": [
695 | "def answer_eleven(entries=['cormulent', 'incendenece', 'validrate']):\n",
696 | " correct_spellings = words.words()\n",
697 | " f = lambda x: x[0].lower()\n",
698 | " final=[]\n",
699 | " for j in entries:\n",
700 | " word = j\n",
701 | " first_letter=np.array(list(map(f,correct_spellings)))\n",
702 | " indx = np.where(first_letter==word[0])\n",
703 | " correct_spellings = np.asarray(correct_spellings)\n",
704 | " dictionary = correct_spellings[indx]\n",
705 | " s=10\n",
706 | " for spelling in dictionary:\n",
707 | " if nltk.distance.edit_distance(word,spelling) < s:\n",
708 | " s=nltk.distance.edit_distance(word,spelling)\n",
709 | " answer = spelling\n",
710 | " final.append(answer)\n",
711 | " return final\n",
712 | " \n",
713 | "# return # Your answer here \n",
714 | " \n",
715 | "answer_eleven()"
716 | ]
717 | },
718 | {
719 | "cell_type": "code",
720 | "execution_count": null,
721 | "metadata": {
722 | "collapsed": true
723 | },
724 | "outputs": [],
725 | "source": []
726 | }
727 | ],
728 | "metadata": {
729 | "coursera": {
730 | "course_slug": "python-text-mining",
731 | "graded_item_id": "r35En",
732 | "launcher_item_id": "tCVfW",
733 | "part_id": "NTVgL"
734 | },
735 | "kernelspec": {
736 | "display_name": "Python 3",
737 | "language": "python",
738 | "name": "python3"
739 | },
740 | "language_info": {
741 | "codemirror_mode": {
742 | "name": "ipython",
743 | "version": 3
744 | },
745 | "file_extension": ".py",
746 | "mimetype": "text/x-python",
747 | "name": "python",
748 | "nbconvert_exporter": "python",
749 | "pygments_lexer": "ipython3",
750 | "version": "3.6.0"
751 | }
752 | },
753 | "nbformat": 4,
754 | "nbformat_minor": 2
755 | }
756 |
--------------------------------------------------------------------------------
/Week 3/Assignment+3.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "---\n",
8 | "\n",
9 | "_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._\n",
10 | "\n",
11 | "---"
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {},
17 | "source": [
18 | "# Assignment 3\n",
19 | "\n",
20 | "In this assignment you will explore text message data and create models to predict if a message is spam or not. "
21 | ]
22 | },
23 | {
24 | "cell_type": "code",
25 | "execution_count": 1,
26 | "metadata": {},
27 | "outputs": [
28 | {
29 | "data": {
30 | "text/html": [
31 | "\n",
32 | "\n",
45 | "
\n",
46 | " \n",
47 | " \n",
48 | " | \n",
49 | " text | \n",
50 | " target | \n",
51 | "
\n",
52 | " \n",
53 | " \n",
54 | " \n",
55 | " | 0 | \n",
56 | " Go until jurong point, crazy.. Available only ... | \n",
57 | " 0 | \n",
58 | "
\n",
59 | " \n",
60 | " | 1 | \n",
61 | " Ok lar... Joking wif u oni... | \n",
62 | " 0 | \n",
63 | "
\n",
64 | " \n",
65 | " | 2 | \n",
66 | " Free entry in 2 a wkly comp to win FA Cup fina... | \n",
67 | " 1 | \n",
68 | "
\n",
69 | " \n",
70 | " | 3 | \n",
71 | " U dun say so early hor... U c already then say... | \n",
72 | " 0 | \n",
73 | "
\n",
74 | " \n",
75 | " | 4 | \n",
76 | " Nah I don't think he goes to usf, he lives aro... | \n",
77 | " 0 | \n",
78 | "
\n",
79 | " \n",
80 | " | 5 | \n",
81 | " FreeMsg Hey there darling it's been 3 week's n... | \n",
82 | " 1 | \n",
83 | "
\n",
84 | " \n",
85 | " | 6 | \n",
86 | " Even my brother is not like to speak with me. ... | \n",
87 | " 0 | \n",
88 | "
\n",
89 | " \n",
90 | " | 7 | \n",
91 | " As per your request 'Melle Melle (Oru Minnamin... | \n",
92 | " 0 | \n",
93 | "
\n",
94 | " \n",
95 | " | 8 | \n",
96 | " WINNER!! As a valued network customer you have... | \n",
97 | " 1 | \n",
98 | "
\n",
99 | " \n",
100 | " | 9 | \n",
101 | " Had your mobile 11 months or more? U R entitle... | \n",
102 | " 1 | \n",
103 | "
\n",
104 | " \n",
105 | "
\n",
106 | "
"
107 | ],
108 | "text/plain": [
109 | " text target\n",
110 | "0 Go until jurong point, crazy.. Available only ... 0\n",
111 | "1 Ok lar... Joking wif u oni... 0\n",
112 | "2 Free entry in 2 a wkly comp to win FA Cup fina... 1\n",
113 | "3 U dun say so early hor... U c already then say... 0\n",
114 | "4 Nah I don't think he goes to usf, he lives aro... 0\n",
115 | "5 FreeMsg Hey there darling it's been 3 week's n... 1\n",
116 | "6 Even my brother is not like to speak with me. ... 0\n",
117 | "7 As per your request 'Melle Melle (Oru Minnamin... 0\n",
118 | "8 WINNER!! As a valued network customer you have... 1\n",
119 | "9 Had your mobile 11 months or more? U R entitle... 1"
120 | ]
121 | },
122 | "execution_count": 1,
123 | "metadata": {},
124 | "output_type": "execute_result"
125 | }
126 | ],
127 | "source": [
128 | "import pandas as pd\n",
129 | "import numpy as np\n",
130 | "\n",
131 | "spam_data = pd.read_csv('spam.csv')\n",
132 | "\n",
133 | "spam_data['target'] = np.where(spam_data['target']=='spam',1,0)\n",
134 | "spam_data.head(10)"
135 | ]
136 | },
137 | {
138 | "cell_type": "code",
139 | "execution_count": 2,
140 | "metadata": {
141 | "collapsed": true
142 | },
143 | "outputs": [],
144 | "source": [
145 | "from sklearn.model_selection import train_test_split\n",
146 | "\n",
147 | "\n",
148 | "X_train, X_test, y_train, y_test = train_test_split(spam_data['text'], \n",
149 | " spam_data['target'], \n",
150 | " random_state=0)"
151 | ]
152 | },
153 | {
154 | "cell_type": "markdown",
155 | "metadata": {},
156 | "source": [
157 | "### Question 1\n",
158 | "What percentage of the documents in `spam_data` are spam?"
159 | ]
160 | },
161 | {
162 | "cell_type": "code",
163 | "execution_count": 3,
164 | "metadata": {
165 | "collapsed": true
166 | },
167 | "outputs": [],
168 | "source": [
169 | "def answer_one():\n",
170 | " return (len(spam_data[spam_data.target==1]) / len(spam_data)) * 100 #Your answer here"
171 | ]
172 | },
173 | {
174 | "cell_type": "code",
175 | "execution_count": 4,
176 | "metadata": {},
177 | "outputs": [
178 | {
179 | "data": {
180 | "text/plain": [
181 | "13.406317300789663"
182 | ]
183 | },
184 | "execution_count": 4,
185 | "metadata": {},
186 | "output_type": "execute_result"
187 | }
188 | ],
189 | "source": [
190 | "answer_one()"
191 | ]
192 | },
193 | {
194 | "cell_type": "markdown",
195 | "metadata": {},
196 | "source": [
197 | "### Question 2\n",
198 | "\n",
199 | "Fit the training data `X_train` using a Count Vectorizer with default parameters.\n",
200 | "\n",
201 | "What is the longest token in the vocabulary?\n",
202 | "\n",
203 | "*This function should return a string.*"
204 | ]
205 | },
206 | {
207 | "cell_type": "code",
208 | "execution_count": 5,
209 | "metadata": {
210 | "collapsed": true
211 | },
212 | "outputs": [],
213 | "source": [
214 | "from sklearn.feature_extraction.text import CountVectorizer\n",
215 | "#import sklearn.feature_extraction.text.CountVectorizer\n",
216 | "def answer_two():\n",
217 | " vect = CountVectorizer().fit(X_train)\n",
218 | " feature_names = np.array(vect.get_feature_names())\n",
219 | " length = list(map(len,feature_names))\n",
220 | " return feature_names[np.argmax(length)] #Your answer her"
221 | ]
222 | },
223 | {
224 | "cell_type": "code",
225 | "execution_count": 6,
226 | "metadata": {},
227 | "outputs": [
228 | {
229 | "data": {
230 | "text/plain": [
231 | "'com1win150ppmx3age16subscription'"
232 | ]
233 | },
234 | "execution_count": 6,
235 | "metadata": {},
236 | "output_type": "execute_result"
237 | }
238 | ],
239 | "source": [
240 | "answer_two()"
241 | ]
242 | },
243 | {
244 | "cell_type": "markdown",
245 | "metadata": {},
246 | "source": [
247 | "### Question 3\n",
248 | "\n",
249 | "Fit and transform the training data `X_train` using a Count Vectorizer with default parameters.\n",
250 | "\n",
251 | "Next, fit a fit a multinomial Naive Bayes classifier model with smoothing `alpha=0.1`. Find the area under the curve (AUC) score using the transformed test data.\n",
252 | "\n",
253 | "*This function should return the AUC score as a float.*"
254 | ]
255 | },
256 | {
257 | "cell_type": "code",
258 | "execution_count": 7,
259 | "metadata": {
260 | "collapsed": true
261 | },
262 | "outputs": [],
263 | "source": [
264 | "from sklearn.naive_bayes import MultinomialNB\n",
265 | "from sklearn.metrics import roc_auc_score\n",
266 | "\n",
267 | "def answer_three():\n",
268 | " vect = CountVectorizer().fit(X_train)\n",
269 | " X_train_vectorized = vect.transform(X_train)\n",
270 | " \n",
271 | " model = MultinomialNB(alpha=0.1)\n",
272 | " model.fit(X_train_vectorized, y_train)\n",
273 | "\n",
274 | " predictions = model.predict(vect.transform(X_test))\n",
275 | " \n",
276 | " return roc_auc_score(y_test, predictions) #Your answer here"
277 | ]
278 | },
279 | {
280 | "cell_type": "code",
281 | "execution_count": 8,
282 | "metadata": {},
283 | "outputs": [
284 | {
285 | "data": {
286 | "text/plain": [
287 | "0.97208121827411165"
288 | ]
289 | },
290 | "execution_count": 8,
291 | "metadata": {},
292 | "output_type": "execute_result"
293 | }
294 | ],
295 | "source": [
296 | "answer_three()"
297 | ]
298 | },
299 | {
300 | "cell_type": "markdown",
301 | "metadata": {},
302 | "source": [
303 | "### Question 4\n",
304 | "\n",
305 | "Fit and transform the training data `X_train` using a Tfidf Vectorizer with default parameters.\n",
306 | "\n",
307 | "What 20 features have the smallest tf-idf and what 20 have the largest tf-idf?\n",
308 | "\n",
309 | "Put these features in a two series where each series is sorted by tf-idf value and then alphabetically by feature name. The index of the series should be the feature name, and the data should be the tf-idf.\n",
310 | "\n",
311 | "The series of 20 features with smallest tf-idfs should be sorted smallest tfidf first, the list of 20 features with largest tf-idfs should be sorted largest first. \n",
312 | "\n",
313 | "*This function should return a tuple of two series\n",
314 | "`(smallest tf-idfs series, largest tf-idfs series)`.*"
315 | ]
316 | },
317 | {
318 | "cell_type": "code",
319 | "execution_count": 9,
320 | "metadata": {
321 | "collapsed": true
322 | },
323 | "outputs": [],
324 | "source": [
325 | "from sklearn.feature_extraction.text import TfidfVectorizer\n",
326 | "\n",
327 | "def answer_four():\n",
328 | " vect = TfidfVectorizer().fit(X_train)\n",
329 | " \n",
330 | " X_train_vectorized = vect.transform(X_train)\n",
331 | " model = MultinomialNB(alpha=0.1)\n",
332 | " model.fit(X_train_vectorized, y_train)\n",
333 | " predictions = model.predict(vect.transform(X_test))\n",
334 | " \n",
335 | " feature_names = np.array(vect.get_feature_names())\n",
336 | "\n",
337 | " sorted_tfidf_index = X_train_vectorized.max(0).toarray()[0].argsort()\n",
338 | " \n",
339 | " small_index = feature_names[sorted_tfidf_index[:20]]\n",
340 | " small_value = np.sort(X_train_vectorized.max(0).toarray()[0])[:20]\n",
341 | " small_final_index = np.concatenate((np.sort(small_index[small_value==min(small_value)]) ,small_index[small_value!=min(small_value)]))\n",
342 | "\n",
343 | " large_index = feature_names[sorted_tfidf_index[:-21:-1]]\n",
344 | " large_value = np.sort(X_train_vectorized.max(0).toarray()[0])[:-21:-1]\n",
345 | " large_final_index = np.concatenate((np.sort(large_index[large_value==max(large_value)]) ,large_index[large_value!=max(large_value)]))\n",
346 | "\n",
347 | " small = pd.Series(small_value,index=small_final_index)\n",
348 | " large = pd.Series(large_value,index=large_final_index)\n",
349 | " return ((small,large))#Your answer here"
350 | ]
351 | },
352 | {
353 | "cell_type": "code",
354 | "execution_count": 10,
355 | "metadata": {},
356 | "outputs": [
357 | {
358 | "data": {
359 | "text/plain": [
360 | "(aaniye 0.074475\n",
361 | " athletic 0.074475\n",
362 | " chef 0.074475\n",
363 | " companion 0.074475\n",
364 | " courageous 0.074475\n",
365 | " dependable 0.074475\n",
366 | " determined 0.074475\n",
367 | " exterminator 0.074475\n",
368 | " healer 0.074475\n",
369 | " listener 0.074475\n",
370 | " organizer 0.074475\n",
371 | " pest 0.074475\n",
372 | " psychiatrist 0.074475\n",
373 | " psychologist 0.074475\n",
374 | " pudunga 0.074475\n",
375 | " stylist 0.074475\n",
376 | " sympathetic 0.074475\n",
377 | " venaam 0.074475\n",
378 | " diwali 0.091250\n",
379 | " mornings 0.091250\n",
380 | " dtype: float64, 146tf150p 1.000000\n",
381 | " 645 1.000000\n",
382 | " anything 1.000000\n",
383 | " anytime 1.000000\n",
384 | " beerage 1.000000\n",
385 | " done 1.000000\n",
386 | " er 1.000000\n",
387 | " havent 1.000000\n",
388 | " home 1.000000\n",
389 | " lei 1.000000\n",
390 | " nite 1.000000\n",
391 | " ok 1.000000\n",
392 | " okie 1.000000\n",
393 | " thank 1.000000\n",
394 | " thanx 1.000000\n",
395 | " too 1.000000\n",
396 | " where 1.000000\n",
397 | " yup 1.000000\n",
398 | " tick 0.980166\n",
399 | " blank 0.932702\n",
400 | " dtype: float64)"
401 | ]
402 | },
403 | "execution_count": 10,
404 | "metadata": {},
405 | "output_type": "execute_result"
406 | }
407 | ],
408 | "source": [
409 | "answer_four()"
410 | ]
411 | },
412 | {
413 | "cell_type": "markdown",
414 | "metadata": {},
415 | "source": [
416 | "### Question 5\n",
417 | "\n",
418 | "Fit and transform the training data `X_train` using a Tfidf Vectorizer ignoring terms that have a document frequency strictly lower than **3**.\n",
419 | "\n",
420 | "Then fit a multinomial Naive Bayes classifier model with smoothing `alpha=0.1` and compute the area under the curve (AUC) score using the transformed test data.\n",
421 | "\n",
422 | "*This function should return the AUC score as a float.*"
423 | ]
424 | },
425 | {
426 | "cell_type": "code",
427 | "execution_count": 11,
428 | "metadata": {
429 | "collapsed": true
430 | },
431 | "outputs": [],
432 | "source": [
433 | "def answer_five():\n",
434 | " vect = TfidfVectorizer(min_df=3).fit(X_train)\n",
435 | " X_train_vectorized = vect.transform(X_train)\n",
436 | " model = MultinomialNB(alpha=0.1)\n",
437 | " model.fit(X_train_vectorized, y_train)\n",
438 | " predictions = model.predict(vect.transform(X_test)) \n",
439 | " \n",
440 | " return roc_auc_score(y_test, predictions)#Your answer here"
441 | ]
442 | },
443 | {
444 | "cell_type": "code",
445 | "execution_count": 12,
446 | "metadata": {},
447 | "outputs": [
448 | {
449 | "data": {
450 | "text/plain": [
451 | "0.94162436548223349"
452 | ]
453 | },
454 | "execution_count": 12,
455 | "metadata": {},
456 | "output_type": "execute_result"
457 | }
458 | ],
459 | "source": [
460 | "answer_five()"
461 | ]
462 | },
463 | {
464 | "cell_type": "markdown",
465 | "metadata": {},
466 | "source": [
467 | "### Question 6\n",
468 | "\n",
469 | "What is the average length of documents (number of characters) for not spam and spam documents?\n",
470 | "\n",
471 | "*This function should return a tuple (average length not spam, average length spam).*"
472 | ]
473 | },
474 | {
475 | "cell_type": "code",
476 | "execution_count": 13,
477 | "metadata": {
478 | "collapsed": true
479 | },
480 | "outputs": [],
481 | "source": [
482 | "def answer_six():\n",
483 | " length_spam = list(map(len,spam_data['text'][spam_data.target==1]))\n",
484 | " length_not_spam = list(map(len,spam_data['text'][spam_data.target==0]))\n",
485 | " \n",
486 | " return ((np.mean(length_not_spam),np.mean(length_spam)))#Your answer here"
487 | ]
488 | },
489 | {
490 | "cell_type": "code",
491 | "execution_count": 14,
492 | "metadata": {},
493 | "outputs": [
494 | {
495 | "data": {
496 | "text/plain": [
497 | "(71.023626943005183, 138.8661311914324)"
498 | ]
499 | },
500 | "execution_count": 14,
501 | "metadata": {},
502 | "output_type": "execute_result"
503 | }
504 | ],
505 | "source": [
506 | "answer_six()"
507 | ]
508 | },
509 | {
510 | "cell_type": "markdown",
511 | "metadata": {},
512 | "source": [
513 | "
\n",
514 | "
\n",
515 | "The following function has been provided to help you combine new features into the training data:"
516 | ]
517 | },
518 | {
519 | "cell_type": "code",
520 | "execution_count": 15,
521 | "metadata": {
522 | "collapsed": true
523 | },
524 | "outputs": [],
525 | "source": [
526 | "def add_feature(X, feature_to_add):\n",
527 | " \"\"\"\n",
528 | " Returns sparse feature matrix with added feature.\n",
529 | " feature_to_add can also be a list of features.\n",
530 | " \"\"\"\n",
531 | " from scipy.sparse import csr_matrix, hstack\n",
532 | " return hstack([X, csr_matrix(feature_to_add).T], 'csr')"
533 | ]
534 | },
535 | {
536 | "cell_type": "markdown",
537 | "metadata": {},
538 | "source": [
539 | "### Question 7\n",
540 | "\n",
541 | "Fit and transform the training data X_train using a Tfidf Vectorizer ignoring terms that have a document frequency strictly lower than **5**.\n",
542 | "\n",
543 | "Using this document-term matrix and an additional feature, **the length of document (number of characters)**, fit a Support Vector Classification model with regularization `C=10000`. Then compute the area under the curve (AUC) score using the transformed test data.\n",
544 | "\n",
545 | "*This function should return the AUC score as a float.*"
546 | ]
547 | },
548 | {
549 | "cell_type": "code",
550 | "execution_count": 17,
551 | "metadata": {
552 | "collapsed": true
553 | },
554 | "outputs": [],
555 | "source": [
556 | "from sklearn.svm import SVC\n",
557 | "\n",
558 | "def answer_seven():\n",
559 | " length_X_train = list(map(len,X_train))\n",
560 | " length_X_test = list(map(len,X_test))\n",
561 | " \n",
562 | " vect = TfidfVectorizer(min_df=5).fit(X_train)\n",
563 | " X_train_vectorized = vect.transform(X_train)\n",
564 | " model = SVC(C=10000)\n",
565 | " model.fit(X_train_vectorized, y_train)\n",
566 | " predictions = model.predict(vect.transform(X_test)) \n",
567 | " \n",
568 | " \n",
569 | " return roc_auc_score(y_test, predictions) #Your answer here"
570 | ]
571 | },
572 | {
573 | "cell_type": "code",
574 | "execution_count": 18,
575 | "metadata": {},
576 | "outputs": [
577 | {
578 | "data": {
579 | "text/plain": [
580 | "0.94971605860482489"
581 | ]
582 | },
583 | "execution_count": 18,
584 | "metadata": {},
585 | "output_type": "execute_result"
586 | }
587 | ],
588 | "source": [
589 | "answer_seven()"
590 | ]
591 | },
592 | {
593 | "cell_type": "markdown",
594 | "metadata": {},
595 | "source": [
596 | "### Question 8\n",
597 | "\n",
598 | "What is the average number of digits per document for not spam and spam documents?\n",
599 | "\n",
600 | "*This function should return a tuple (average # digits not spam, average # digits spam).*"
601 | ]
602 | },
603 | {
604 | "cell_type": "code",
605 | "execution_count": 19,
606 | "metadata": {
607 | "collapsed": true
608 | },
609 | "outputs": [],
610 | "source": [
611 | "def answer_eight():\n",
612 | " import re\n",
613 | " spam = [re.findall(\"[0-9]\",i) for i in spam_data['text'][spam_data.target==1]]\n",
614 | " non_spam = [re.findall(\"[0-9]\",i) for i in spam_data['text'][spam_data.target==0]]\n",
615 | "\n",
616 | " return ((np.mean(list(map(len,non_spam))),np.mean(list(map(len,spam)))))#Your answer here"
617 | ]
618 | },
619 | {
620 | "cell_type": "code",
621 | "execution_count": 20,
622 | "metadata": {},
623 | "outputs": [
624 | {
625 | "data": {
626 | "text/plain": [
627 | "(0.29927461139896372, 15.759036144578314)"
628 | ]
629 | },
630 | "execution_count": 20,
631 | "metadata": {},
632 | "output_type": "execute_result"
633 | }
634 | ],
635 | "source": [
636 | "answer_eight()"
637 | ]
638 | },
639 | {
640 | "cell_type": "markdown",
641 | "metadata": {},
642 | "source": [
643 | "### Question 9\n",
644 | "\n",
645 | "Fit and transform the training data `X_train` using a Tfidf Vectorizer ignoring terms that have a document frequency strictly lower than **5** and using **word n-grams from n=1 to n=3** (unigrams, bigrams, and trigrams).\n",
646 | "\n",
647 | "Using this document-term matrix and the following additional features:\n",
648 | "* the length of document (number of characters)\n",
649 | "* **number of digits per document**\n",
650 | "\n",
651 | "fit a Logistic Regression model with regularization `C=100`. Then compute the area under the curve (AUC) score using the transformed test data.\n",
652 | "\n",
653 | "*This function should return the AUC score as a float.*"
654 | ]
655 | },
656 | {
657 | "cell_type": "code",
658 | "execution_count": 21,
659 | "metadata": {
660 | "collapsed": true
661 | },
662 | "outputs": [],
663 | "source": [
664 | "from sklearn.linear_model import LogisticRegression\n",
665 | "\n",
666 | "def answer_nine():\n",
667 | " vect = CountVectorizer(min_df=5, ngram_range=(1,3)).fit(X_train)\n",
668 | "\n",
669 | " X_train_vectorized = vect.transform(X_train)\n",
670 | " model = LogisticRegression(C=100)\n",
671 | " model.fit(X_train_vectorized, y_train)\n",
672 | "\n",
673 | " predictions = model.predict(vect.transform(X_test))\n",
674 | " \n",
675 | " return roc_auc_score(y_test, predictions)#Your answer here"
676 | ]
677 | },
678 | {
679 | "cell_type": "code",
680 | "execution_count": 22,
681 | "metadata": {},
682 | "outputs": [
683 | {
684 | "data": {
685 | "text/plain": [
686 | "0.95180635960816939"
687 | ]
688 | },
689 | "execution_count": 22,
690 | "metadata": {},
691 | "output_type": "execute_result"
692 | }
693 | ],
694 | "source": [
695 | "answer_nine()"
696 | ]
697 | },
698 | {
699 | "cell_type": "markdown",
700 | "metadata": {},
701 | "source": [
702 | "### Question 10\n",
703 | "\n",
704 | "What is the average number of non-word characters (anything other than a letter, digit or underscore) per document for not spam and spam documents?\n",
705 | "\n",
706 | "*Hint: Use `\\w` and `\\W` character classes*\n",
707 | "\n",
708 | "*This function should return a tuple (average # non-word characters not spam, average # non-word characters spam).*"
709 | ]
710 | },
711 | {
712 | "cell_type": "code",
713 | "execution_count": 23,
714 | "metadata": {
715 | "collapsed": true
716 | },
717 | "outputs": [],
718 | "source": [
719 | "def answer_ten():\n",
720 | " import re\n",
721 | " spam = [re.findall(\"\\W\",i) for i in spam_data['text'][spam_data.target==1]]\n",
722 | " non_spam = [re.findall(\"\\W\",i) for i in spam_data['text'][spam_data.target==0]]\n",
723 | " \n",
724 | " return ((np.mean(list(map(len,non_spam))),np.mean(list(map(len,spam)))))#Your answer here"
725 | ]
726 | },
727 | {
728 | "cell_type": "code",
729 | "execution_count": 24,
730 | "metadata": {},
731 | "outputs": [
732 | {
733 | "data": {
734 | "text/plain": [
735 | "(17.291813471502589, 29.041499330655956)"
736 | ]
737 | },
738 | "execution_count": 24,
739 | "metadata": {},
740 | "output_type": "execute_result"
741 | }
742 | ],
743 | "source": [
744 | "answer_ten()"
745 | ]
746 | },
747 | {
748 | "cell_type": "markdown",
749 | "metadata": {},
750 | "source": [
751 | "### Question 11\n",
752 | "\n",
753 | "Fit and transform the training data X_train using a Count Vectorizer ignoring terms that have a document frequency strictly lower than **5** and using **character n-grams from n=2 to n=5.**\n",
754 | "\n",
755 | "To tell Count Vectorizer to use character n-grams pass in `analyzer='char_wb'` which creates character n-grams only from text inside word boundaries. This should make the model more robust to spelling mistakes.\n",
756 | "\n",
757 | "Using this document-term matrix and the following additional features:\n",
758 | "* the length of document (number of characters)\n",
759 | "* number of digits per document\n",
760 | "* **number of non-word characters (anything other than a letter, digit or underscore.)**\n",
761 | "\n",
762 | "fit a Logistic Regression model with regularization C=100. Then compute the area under the curve (AUC) score using the transformed test data.\n",
763 | "\n",
764 | "Also **find the 10 smallest and 10 largest coefficients from the model** and return them along with the AUC score in a tuple.\n",
765 | "\n",
766 | "The list of 10 smallest coefficients should be sorted smallest first, the list of 10 largest coefficients should be sorted largest first.\n",
767 | "\n",
768 | "The three features that were added to the document term matrix should have the following names should they appear in the list of coefficients:\n",
769 | "['length_of_doc', 'digit_count', 'non_word_char_count']\n",
770 | "\n",
771 | "*This function should return a tuple `(AUC score as a float, smallest coefs list, largest coefs list)`.*"
772 | ]
773 | },
774 | {
775 | "cell_type": "code",
776 | "execution_count": 25,
777 | "metadata": {
778 | "collapsed": true
779 | },
780 | "outputs": [],
781 | "source": [
782 | "def answer_eleven():\n",
783 | " vect = CountVectorizer(min_df=5, ngram_range=(2,5)).fit(X_train)\n",
784 | "\n",
785 | " X_train_vectorized = vect.transform(X_train)\n",
786 | " model = LogisticRegression(C=100)\n",
787 | " model.fit(X_train_vectorized, y_train)\n",
788 | "\n",
789 | " predictions = model.predict(vect.transform(X_test))\n",
790 | " feature_names = np.array(vect.get_feature_names())\n",
791 | "\n",
792 | " sorted_coef_index = model.coef_[0].argsort()\n",
793 | " small_coefficient = list(feature_names[sorted_coef_index[:10]])\n",
794 | " large_coefficient = list(feature_names[sorted_coef_index[:-11:-1]])\n",
795 | " return roc_auc_score(y_test, predictions),small_coefficient,large_coefficient #Your answer here"
796 | ]
797 | },
798 | {
799 | "cell_type": "code",
800 | "execution_count": 26,
801 | "metadata": {},
802 | "outputs": [
803 | {
804 | "data": {
805 | "text/plain": [
806 | "(0.89217654448839612,\n",
807 | " ['can send',\n",
808 | " 'going to',\n",
809 | " 'be in',\n",
810 | " 'the last',\n",
811 | " 'no more',\n",
812 | " 'lt gt',\n",
813 | " 'if you can',\n",
814 | " 'week and',\n",
815 | " 'went out',\n",
816 | " 'nice day'],\n",
817 | " ['co uk',\n",
818 | " 'cost 150ppm',\n",
819 | " 'chat to',\n",
820 | " 'sms ac',\n",
821 | " 'reply with',\n",
822 | " 'txt stop',\n",
823 | " 'to this',\n",
824 | " 'stop to',\n",
825 | " 'visit www',\n",
826 | " 'ur mobile'])"
827 | ]
828 | },
829 | "execution_count": 26,
830 | "metadata": {},
831 | "output_type": "execute_result"
832 | }
833 | ],
834 | "source": [
835 | "answer_eleven()"
836 | ]
837 | },
838 | {
839 | "cell_type": "code",
840 | "execution_count": null,
841 | "metadata": {
842 | "collapsed": true
843 | },
844 | "outputs": [],
845 | "source": []
846 | }
847 | ],
848 | "metadata": {
849 | "coursera": {
850 | "course_slug": "python-text-mining",
851 | "graded_item_id": "Pn19K",
852 | "launcher_item_id": "y1juS",
853 | "part_id": "ctlgo"
854 | },
855 | "kernelspec": {
856 | "display_name": "Python 3",
857 | "language": "python",
858 | "name": "python3"
859 | },
860 | "language_info": {
861 | "codemirror_mode": {
862 | "name": "ipython",
863 | "version": 3
864 | },
865 | "file_extension": ".py",
866 | "mimetype": "text/x-python",
867 | "name": "python",
868 | "nbconvert_exporter": "python",
869 | "pygments_lexer": "ipython3",
870 | "version": "3.6.0"
871 | }
872 | },
873 | "nbformat": 4,
874 | "nbformat_minor": 2
875 | }
876 |
--------------------------------------------------------------------------------