├── .gitignore
├── 1. Quick Python Refresher
    ├── 1. Lists.ipynb
    ├── 2. Dictionaries.ipynb
    ├── 3. Loops and Conditionals.ipynb
    └── 4. Functions.ipynb
├── 2. NLTK and the Basics
    ├── 1. Counting Text.ipynb
    ├── 2. Example - Words Per Sentence Trends.ipynb
    ├── 3. Frequency Distribution.ipynb
    ├── 4. Conditional Frequency Distribution.ipynb
    ├── 5. Example - Informative Words.ipynb
    ├── 6. Bigrams.ipynb
    └── 7. Regular Expressions.ipynb
├── 3. Tokenization, Tagging, Chunking
    ├── 1. Tokenization.ipynb
    ├── 2. Normalization.ipynb
    ├── 3. Part of Speech Tagging.ipynb
    ├── 4. Example - Multiple Parts of Speech.ipynb
    ├── 5. Example - Choices.ipynb
    ├── 6. Chunking.ipynb
    ├── 7. Example - Named Entity Recognition.ipynb
    └── example.txt
├── 4. Custom Sources
    ├── 1. Text File.ipynb
    ├── 2. HTML.ipynb
    ├── 3. URL.ipynb
    ├── 4. CSV (Spreadsheet).ipynb
    ├── 5. Exporting.ipynb
    ├── 6. NLTK Resources.ipynb
    ├── 7. Example - Remove Stopwords.ipynb
    ├── dec_independence.txt
    ├── export.txt
    └── reviews.csv
├── 5. Projects
    ├── 1. Sentiment Analysis.ipynb
    ├── 2. Gender Prediction.ipynb
    ├── 3. Term Frequency, Inverse Document Frequency.ipynb
    ├── reviews.csv
    ├── tfidf_1.txt
    ├── tfidf_10.txt
    ├── tfidf_2.txt
    ├── tfidf_3.txt
    ├── tfidf_4.txt
    ├── tfidf_5.txt
    ├── tfidf_6.txt
    ├── tfidf_7.txt
    ├── tfidf_8.txt
    ├── tfidf_9.txt
    ├── words_negative.csv
    └── words_positive.csv
└── README.md


/.gitignore:
--------------------------------------------------------------------------------
1 | .ipynb_checkpoints
2 | */.ipynb_checkpoints
3 | 


--------------------------------------------------------------------------------
/1. Quick Python Refresher/1. Lists.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Python Refresher - Lists"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "A Python list stores comma separated values.  In our cases these values will be strings, and numbers.\n"
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "code",
 19 |    "execution_count": 1,
 20 |    "metadata": {
 21 |     "collapsed": true
 22 |    },
 23 |    "outputs": [],
 24 |    "source": [
 25 |     "mylist = [\"a\",\"b\",\"c\"]"
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "code",
 30 |    "execution_count": 2,
 31 |    "metadata": {
 32 |     "collapsed": false
 33 |    },
 34 |    "outputs": [
 35 |     {
 36 |      "data": {
 37 |       "text/plain": [
 38 |        "['a', 'b', 'c']"
 39 |       ]
 40 |      },
 41 |      "execution_count": 2,
 42 |      "metadata": {},
 43 |      "output_type": "execute_result"
 44 |     }
 45 |    ],
 46 |    "source": [
 47 |     "mylist"
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "code",
 52 |    "execution_count": 3,
 53 |    "metadata": {
 54 |     "collapsed": false
 55 |    },
 56 |    "outputs": [
 57 |     {
 58 |      "data": {
 59 |       "text/plain": [
 60 |        "[1, 2, 3, 4, 5]"
 61 |       ]
 62 |      },
 63 |      "execution_count": 3,
 64 |      "metadata": {},
 65 |      "output_type": "execute_result"
 66 |     }
 67 |    ],
 68 |    "source": [
 69 |     "mylist2 = [1,2,3,4,5]\n",
 70 |     "mylist2"
 71 |    ]
 72 |   },
 73 |   {
 74 |    "cell_type": "markdown",
 75 |    "metadata": {},
 76 |    "source": [
 77 |     "Each item in the list has a position or index.  By using a list index you can get back individual list item.\n",
 78 |     "\n",
 79 |     "Remember that in programming, counting starts at 0, so to get the first item, we would call index 0."
 80 |    ]
 81 |   },
 82 |   {
 83 |    "cell_type": "code",
 84 |    "execution_count": 4,
 85 |    "metadata": {
 86 |     "collapsed": false
 87 |    },
 88 |    "outputs": [
 89 |     {
 90 |      "data": {
 91 |       "text/plain": [
 92 |        "'a'"
 93 |       ]
 94 |      },
 95 |      "execution_count": 4,
 96 |      "metadata": {},
 97 |      "output_type": "execute_result"
 98 |     }
 99 |    ],
100 |    "source": [
101 |     "mylist[0]"
102 |    ]
103 |   },
104 |   {
105 |    "cell_type": "code",
106 |    "execution_count": 5,
107 |    "metadata": {
108 |     "collapsed": false
109 |    },
110 |    "outputs": [
111 |     {
112 |      "data": {
113 |       "text/plain": [
114 |        "1"
115 |       ]
116 |      },
117 |      "execution_count": 5,
118 |      "metadata": {},
119 |      "output_type": "execute_result"
120 |     }
121 |    ],
122 |    "source": [
123 |     "mylist2[0]"
124 |    ]
125 |   },
126 |   {
127 |    "cell_type": "markdown",
128 |    "metadata": {},
129 |    "source": [
130 |     "We can also use a range of indexes to call  back a range from out list."
131 |    ]
132 |   },
133 |   {
134 |    "cell_type": "code",
135 |    "execution_count": 6,
136 |    "metadata": {
137 |     "collapsed": false
138 |    },
139 |    "outputs": [
140 |     {
141 |      "data": {
142 |       "text/plain": [
143 |        "[1, 2]"
144 |       ]
145 |      },
146 |      "execution_count": 6,
147 |      "metadata": {},
148 |      "output_type": "execute_result"
149 |     }
150 |    ],
151 |    "source": [
152 |     "mylist2[0:2]"
153 |    ]
154 |   },
155 |   {
156 |    "cell_type": "markdown",
157 |    "metadata": {},
158 |    "source": [
159 |     "The first number tells us where to start while the second tells us where to end and is exclusive.  If we don't enter the fist number, we will get back the first x items, where x is the second index number we provide."
160 |    ]
161 |   },
162 |   {
163 |    "cell_type": "code",
164 |    "execution_count": 7,
165 |    "metadata": {
166 |     "collapsed": false
167 |    },
168 |    "outputs": [
169 |     {
170 |      "data": {
171 |       "text/plain": [
172 |        "['a', 'b']"
173 |       ]
174 |      },
175 |      "execution_count": 7,
176 |      "metadata": {},
177 |      "output_type": "execute_result"
178 |     }
179 |    ],
180 |    "source": [
181 |     "mylist[:2]"
182 |    ]
183 |   },
184 |   {
185 |    "cell_type": "code",
186 |    "execution_count": 8,
187 |    "metadata": {
188 |     "collapsed": false
189 |    },
190 |    "outputs": [
191 |     {
192 |      "data": {
193 |       "text/plain": [
194 |        "[1, 2, 3]"
195 |       ]
196 |      },
197 |      "execution_count": 8,
198 |      "metadata": {},
199 |      "output_type": "execute_result"
200 |     }
201 |    ],
202 |    "source": [
203 |     "mylist2[:3]"
204 |    ]
205 |   },
206 |   {
207 |    "cell_type": "markdown",
208 |    "metadata": {},
209 |    "source": [
210 |     "We can also call the ends of a list by doing the following."
211 |    ]
212 |   },
213 |   {
214 |    "cell_type": "code",
215 |    "execution_count": 9,
216 |    "metadata": {
217 |     "collapsed": false
218 |    },
219 |    "outputs": [
220 |     {
221 |      "data": {
222 |       "text/plain": [
223 |        "['b', 'c']"
224 |       ]
225 |      },
226 |      "execution_count": 9,
227 |      "metadata": {},
228 |      "output_type": "execute_result"
229 |     }
230 |    ],
231 |    "source": [
232 |     "mylist[-2:]"
233 |    ]
234 |   }
235 |  ],
236 |  "metadata": {
237 |   "kernelspec": {
238 |    "display_name": "Python 2",
239 |    "language": "python",
240 |    "name": "python2"
241 |   },
242 |   "language_info": {
243 |    "codemirror_mode": {
244 |     "name": "ipython",
245 |     "version": 2
246 |    },
247 |    "file_extension": ".py",
248 |    "mimetype": "text/x-python",
249 |    "name": "python",
250 |    "nbconvert_exporter": "python",
251 |    "pygments_lexer": "ipython2",
252 |    "version": "2.7.10"
253 |   }
254 |  },
255 |  "nbformat": 4,
256 |  "nbformat_minor": 0
257 | }
258 | 


--------------------------------------------------------------------------------
/1. Quick Python Refresher/2. Dictionaries.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Python Refresher - Dictionaries"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "A Python dictionary stores key-value pairs."
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "code",
 19 |    "execution_count": 1,
 20 |    "metadata": {
 21 |     "collapsed": false
 22 |    },
 23 |    "outputs": [],
 24 |    "source": [
 25 |     "d = {\n",
 26 |     "    'Python': 'programming', \n",
 27 |     "    'English': \"natural\", \n",
 28 |     "    'French': 'natrual', \n",
 29 |     "    'Ruby' : 'programming', \n",
 30 |     "    'Javascript' : 'programming'\n",
 31 |     "}"
 32 |    ]
 33 |   },
 34 |   {
 35 |    "cell_type": "code",
 36 |    "execution_count": 2,
 37 |    "metadata": {
 38 |     "collapsed": false
 39 |    },
 40 |    "outputs": [
 41 |     {
 42 |      "data": {
 43 |       "text/plain": [
 44 |        "{'English': 'natural',\n",
 45 |        " 'French': 'natrual',\n",
 46 |        " 'Javascript': 'programming',\n",
 47 |        " 'Python': 'programming',\n",
 48 |        " 'Ruby': 'programming'}"
 49 |       ]
 50 |      },
 51 |      "execution_count": 2,
 52 |      "metadata": {},
 53 |      "output_type": "execute_result"
 54 |     }
 55 |    ],
 56 |    "source": [
 57 |     "d"
 58 |    ]
 59 |   },
 60 |   {
 61 |    "cell_type": "code",
 62 |    "execution_count": 3,
 63 |    "metadata": {
 64 |     "collapsed": false
 65 |    },
 66 |    "outputs": [
 67 |     {
 68 |      "data": {
 69 |       "text/plain": [
 70 |        "dict"
 71 |       ]
 72 |      },
 73 |      "execution_count": 3,
 74 |      "metadata": {},
 75 |      "output_type": "execute_result"
 76 |     }
 77 |    ],
 78 |    "source": [
 79 |     "type(d)"
 80 |    ]
 81 |   },
 82 |   {
 83 |    "cell_type": "code",
 84 |    "execution_count": 4,
 85 |    "metadata": {
 86 |     "collapsed": false
 87 |    },
 88 |    "outputs": [
 89 |     {
 90 |      "data": {
 91 |       "text/plain": [
 92 |        "'programming'"
 93 |       ]
 94 |      },
 95 |      "execution_count": 4,
 96 |      "metadata": {},
 97 |      "output_type": "execute_result"
 98 |     }
 99 |    ],
100 |    "source": [
101 |     "d[\"Python\"]"
102 |    ]
103 |   },
104 |   {
105 |    "cell_type": "code",
106 |    "execution_count": 5,
107 |    "metadata": {
108 |     "collapsed": false
109 |    },
110 |    "outputs": [
111 |     {
112 |      "data": {
113 |       "text/plain": [
114 |        "'natural'"
115 |       ]
116 |      },
117 |      "execution_count": 5,
118 |      "metadata": {},
119 |      "output_type": "execute_result"
120 |     }
121 |    ],
122 |    "source": [
123 |     "d[\"English\"]"
124 |    ]
125 |   },
126 |   {
127 |    "cell_type": "markdown",
128 |    "metadata": {},
129 |    "source": [
130 |     "We can add new entries to a dictionary."
131 |    ]
132 |   },
133 |   {
134 |    "cell_type": "code",
135 |    "execution_count": 6,
136 |    "metadata": {
137 |     "collapsed": true
138 |    },
139 |    "outputs": [],
140 |    "source": [
141 |     "d['Scala'] = 'programming'"
142 |    ]
143 |   },
144 |   {
145 |    "cell_type": "code",
146 |    "execution_count": 7,
147 |    "metadata": {
148 |     "collapsed": false
149 |    },
150 |    "outputs": [
151 |     {
152 |      "data": {
153 |       "text/plain": [
154 |        "{'English': 'natural',\n",
155 |        " 'French': 'natrual',\n",
156 |        " 'Javascript': 'programming',\n",
157 |        " 'Python': 'programming',\n",
158 |        " 'Ruby': 'programming',\n",
159 |        " 'Scala': 'programming'}"
160 |       ]
161 |      },
162 |      "execution_count": 7,
163 |      "metadata": {},
164 |      "output_type": "execute_result"
165 |     }
166 |    ],
167 |    "source": [
168 |     "d"
169 |    ]
170 |   },
171 |   {
172 |    "cell_type": "markdown",
173 |    "metadata": {},
174 |    "source": [
175 |     "Values can also be numbers."
176 |    ]
177 |   },
178 |   {
179 |    "cell_type": "code",
180 |    "execution_count": 8,
181 |    "metadata": {
182 |     "collapsed": true
183 |    },
184 |    "outputs": [],
185 |    "source": [
186 |     "d[\"languages known\"] = 3"
187 |    ]
188 |   },
189 |   {
190 |    "cell_type": "code",
191 |    "execution_count": 9,
192 |    "metadata": {
193 |     "collapsed": false
194 |    },
195 |    "outputs": [
196 |     {
197 |      "data": {
198 |       "text/plain": [
199 |        "{'English': 'natural',\n",
200 |        " 'French': 'natrual',\n",
201 |        " 'Javascript': 'programming',\n",
202 |        " 'Python': 'programming',\n",
203 |        " 'Ruby': 'programming',\n",
204 |        " 'Scala': 'programming',\n",
205 |        " 'languages known': 3}"
206 |       ]
207 |      },
208 |      "execution_count": 9,
209 |      "metadata": {},
210 |      "output_type": "execute_result"
211 |     }
212 |    ],
213 |    "source": [
214 |     "d"
215 |    ]
216 |   },
217 |   {
218 |    "cell_type": "code",
219 |    "execution_count": 10,
220 |    "metadata": {
221 |     "collapsed": false
222 |    },
223 |    "outputs": [
224 |     {
225 |      "data": {
226 |       "text/plain": [
227 |        "dict_keys(['Python', 'English', 'French', 'Ruby', 'Javascript', 'Scala', 'languages known'])"
228 |       ]
229 |      },
230 |      "execution_count": 10,
231 |      "metadata": {},
232 |      "output_type": "execute_result"
233 |     }
234 |    ],
235 |    "source": [
236 |     "d.keys()"
237 |    ]
238 |   },
239 |   {
240 |    "cell_type": "code",
241 |    "execution_count": 11,
242 |    "metadata": {
243 |     "collapsed": false
244 |    },
245 |    "outputs": [
246 |     {
247 |      "data": {
248 |       "text/plain": [
249 |        "dict_values(['programming', 'natural', 'natrual', 'programming', 'programming', 'programming', 3])"
250 |       ]
251 |      },
252 |      "execution_count": 11,
253 |      "metadata": {},
254 |      "output_type": "execute_result"
255 |     }
256 |    ],
257 |    "source": [
258 |     "d.values()"
259 |    ]
260 |   },
261 |   {
262 |    "cell_type": "code",
263 |    "execution_count": 12,
264 |    "metadata": {
265 |     "collapsed": false
266 |    },
267 |    "outputs": [
268 |     {
269 |      "data": {
270 |       "text/plain": [
271 |        "7"
272 |       ]
273 |      },
274 |      "execution_count": 12,
275 |      "metadata": {},
276 |      "output_type": "execute_result"
277 |     }
278 |    ],
279 |    "source": [
280 |     "len(d)"
281 |    ]
282 |   }
283 |  ],
284 |  "metadata": {
285 |   "kernelspec": {
286 |    "display_name": "Python 3",
287 |    "language": "python",
288 |    "name": "python3"
289 |   },
290 |   "language_info": {
291 |    "codemirror_mode": {
292 |     "name": "ipython",
293 |     "version": 3
294 |    },
295 |    "file_extension": ".py",
296 |    "mimetype": "text/x-python",
297 |    "name": "python",
298 |    "nbconvert_exporter": "python",
299 |    "pygments_lexer": "ipython3",
300 |    "version": "3.6.0"
301 |   }
302 |  },
303 |  "nbformat": 4,
304 |  "nbformat_minor": 2
305 | }
306 | 


--------------------------------------------------------------------------------
/1. Quick Python Refresher/3. Loops and Conditionals.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Python Refresher - Loops and Conditionals"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 2,
 13 |    "metadata": {
 14 |     "collapsed": true
 15 |    },
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "mylist = [\"Python\",\"Ruby\",\"Javascript\",\"Scala\"]"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "code",
 23 |    "execution_count": 3,
 24 |    "metadata": {
 25 |     "collapsed": false
 26 |    },
 27 |    "outputs": [
 28 |     {
 29 |      "data": {
 30 |       "text/plain": [
 31 |        "['Python', 'Ruby', 'Javascript', 'Scala']"
 32 |       ]
 33 |      },
 34 |      "execution_count": 3,
 35 |      "metadata": {},
 36 |      "output_type": "execute_result"
 37 |     }
 38 |    ],
 39 |    "source": [
 40 |     "mylist"
 41 |    ]
 42 |   },
 43 |   {
 44 |    "cell_type": "markdown",
 45 |    "metadata": {},
 46 |    "source": [
 47 |     "With a for loop, we can iterate through this list."
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "code",
 52 |    "execution_count": 4,
 53 |    "metadata": {
 54 |     "collapsed": false
 55 |    },
 56 |    "outputs": [
 57 |     {
 58 |      "name": "stdout",
 59 |      "output_type": "stream",
 60 |      "text": [
 61 |       "Python\n",
 62 |       "Ruby\n",
 63 |       "Javascript\n",
 64 |       "Scala\n"
 65 |      ]
 66 |     }
 67 |    ],
 68 |    "source": [
 69 |     "for item in mylist:\n",
 70 |     "    print(item)"
 71 |    ]
 72 |   },
 73 |   {
 74 |    "cell_type": "markdown",
 75 |    "metadata": {},
 76 |    "source": [
 77 |     "We can also write a for loop this way, know as list comprehension when used with a list..."
 78 |    ]
 79 |   },
 80 |   {
 81 |    "cell_type": "code",
 82 |    "execution_count": 5,
 83 |    "metadata": {
 84 |     "collapsed": false
 85 |    },
 86 |    "outputs": [
 87 |     {
 88 |      "data": {
 89 |       "text/plain": [
 90 |        "['Python', 'Ruby', 'Javascript', 'Scala']"
 91 |       ]
 92 |      },
 93 |      "execution_count": 5,
 94 |      "metadata": {},
 95 |      "output_type": "execute_result"
 96 |     }
 97 |    ],
 98 |    "source": [
 99 |     "[item for item in mylist]"
100 |    ]
101 |   },
102 |   {
103 |    "cell_type": "markdown",
104 |    "metadata": {},
105 |    "source": [
106 |     "We can use for loops to carry out actions."
107 |    ]
108 |   },
109 |   {
110 |    "cell_type": "code",
111 |    "execution_count": 6,
112 |    "metadata": {
113 |     "collapsed": false
114 |    },
115 |    "outputs": [
116 |     {
117 |      "name": "stdout",
118 |      "output_type": "stream",
119 |      "text": [
120 |       "Python is a fun programming language.\n",
121 |       "Ruby is a fun programming language.\n",
122 |       "Javascript is a fun programming language.\n",
123 |       "Scala is a fun programming language.\n"
124 |      ]
125 |     }
126 |    ],
127 |    "source": [
128 |     "for item in mylist:\n",
129 |     "    print(item + \" is a fun programming language.\")"
130 |    ]
131 |   },
132 |   {
133 |    "cell_type": "code",
134 |    "execution_count": 7,
135 |    "metadata": {
136 |     "collapsed": false
137 |    },
138 |    "outputs": [],
139 |    "source": [
140 |     "newlist = [item + \" is a fun programming language.\" for item in mylist]"
141 |    ]
142 |   },
143 |   {
144 |    "cell_type": "code",
145 |    "execution_count": 8,
146 |    "metadata": {
147 |     "collapsed": false
148 |    },
149 |    "outputs": [
150 |     {
151 |      "data": {
152 |       "text/plain": [
153 |        "['Python is a fun programming language.',\n",
154 |        " 'Ruby is a fun programming language.',\n",
155 |        " 'Javascript is a fun programming language.',\n",
156 |        " 'Scala is a fun programming language.']"
157 |       ]
158 |      },
159 |      "execution_count": 8,
160 |      "metadata": {},
161 |      "output_type": "execute_result"
162 |     }
163 |    ],
164 |    "source": [
165 |     "newlist"
166 |    ]
167 |   },
168 |   {
169 |    "cell_type": "markdown",
170 |    "metadata": {},
171 |    "source": [
172 |     "We can use if statements to look for special conditions."
173 |    ]
174 |   },
175 |   {
176 |    "cell_type": "code",
177 |    "execution_count": 9,
178 |    "metadata": {
179 |     "collapsed": true
180 |    },
181 |    "outputs": [],
182 |    "source": [
183 |     "x = 10"
184 |    ]
185 |   },
186 |   {
187 |    "cell_type": "code",
188 |    "execution_count": 10,
189 |    "metadata": {
190 |     "collapsed": false
191 |    },
192 |    "outputs": [
193 |     {
194 |      "name": "stdout",
195 |      "output_type": "stream",
196 |      "text": [
197 |       "It looks like x is greater than 5\n"
198 |      ]
199 |     }
200 |    ],
201 |    "source": [
202 |     "if x > 5:\n",
203 |     "    print(\"It looks like x is greater than 5\")"
204 |    ]
205 |   },
206 |   {
207 |    "cell_type": "code",
208 |    "execution_count": 12,
209 |    "metadata": {
210 |     "collapsed": false
211 |    },
212 |    "outputs": [
213 |     {
214 |      "name": "stdout",
215 |      "output_type": "stream",
216 |      "text": [
217 |       "Hello\n"
218 |      ]
219 |     }
220 |    ],
221 |    "source": [
222 |     "if x > 5 and x < 20:\n",
223 |     "    print(\"Hello\")\n"
224 |    ]
225 |   },
226 |   {
227 |    "cell_type": "code",
228 |    "execution_count": 13,
229 |    "metadata": {
230 |     "collapsed": true
231 |    },
232 |    "outputs": [],
233 |    "source": [
234 |     "number_list = [1,2,3,4,5,6,7,8,9,10]"
235 |    ]
236 |   },
237 |   {
238 |    "cell_type": "code",
239 |    "execution_count": 14,
240 |    "metadata": {
241 |     "collapsed": false
242 |    },
243 |    "outputs": [
244 |     {
245 |      "name": "stdout",
246 |      "output_type": "stream",
247 |      "text": [
248 |       "1 is odd\n",
249 |       "2 is even\n",
250 |       "3 is odd\n",
251 |       "4 is even\n",
252 |       "5 is odd\n",
253 |       "6 is even\n",
254 |       "7 is the best number!\n",
255 |       "8 is even\n",
256 |       "9 is odd\n",
257 |       "10 is even\n"
258 |      ]
259 |     }
260 |    ],
261 |    "source": [
262 |     "for number in number_list:\n",
263 |     "    if number%2 == 0:\n",
264 |     "        print(str(number) + \" is even\")\n",
265 |     "    elif number == 7:\n",
266 |     "        print(str(number) + \" is the best number!\")\n",
267 |     "    else:\n",
268 |     "        print(str(number) + \" is odd\")"
269 |    ]
270 |   }
271 |  ],
272 |  "metadata": {
273 |   "kernelspec": {
274 |    "display_name": "Python 3",
275 |    "language": "python",
276 |    "name": "python3"
277 |   },
278 |   "language_info": {
279 |    "codemirror_mode": {
280 |     "name": "ipython",
281 |     "version": 3
282 |    },
283 |    "file_extension": ".py",
284 |    "mimetype": "text/x-python",
285 |    "name": "python",
286 |    "nbconvert_exporter": "python",
287 |    "pygments_lexer": "ipython3",
288 |    "version": "3.6.0"
289 |   }
290 |  },
291 |  "nbformat": 4,
292 |  "nbformat_minor": 2
293 | }
294 | 


--------------------------------------------------------------------------------
/1. Quick Python Refresher/4. Functions.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "Is some of our projects we will be using fuctions to repetedly call parts of our code.\n",
  8 |     "\n",
  9 |     "We can define a function by giving it a name, and telling the function any values we plan on passing along."
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "code",
 14 |    "execution_count": 1,
 15 |    "metadata": {
 16 |     "collapsed": true
 17 |    },
 18 |    "outputs": [],
 19 |    "source": [
 20 |     "def my_function(something):\n",
 21 |     "    return something"
 22 |    ]
 23 |   },
 24 |   {
 25 |    "cell_type": "code",
 26 |    "execution_count": 2,
 27 |    "metadata": {
 28 |     "collapsed": false
 29 |    },
 30 |    "outputs": [
 31 |     {
 32 |      "data": {
 33 |       "text/plain": [
 34 |        "'hello'"
 35 |       ]
 36 |      },
 37 |      "execution_count": 2,
 38 |      "metadata": {},
 39 |      "output_type": "execute_result"
 40 |     }
 41 |    ],
 42 |    "source": [
 43 |     "my_function(\"hello\")"
 44 |    ]
 45 |   },
 46 |   {
 47 |    "cell_type": "code",
 48 |    "execution_count": 3,
 49 |    "metadata": {
 50 |     "collapsed": false
 51 |    },
 52 |    "outputs": [
 53 |     {
 54 |      "data": {
 55 |       "text/plain": [
 56 |        "2"
 57 |       ]
 58 |      },
 59 |      "execution_count": 3,
 60 |      "metadata": {},
 61 |      "output_type": "execute_result"
 62 |     }
 63 |    ],
 64 |    "source": [
 65 |     "my_function(2)"
 66 |    ]
 67 |   },
 68 |   {
 69 |    "cell_type": "code",
 70 |    "execution_count": 4,
 71 |    "metadata": {
 72 |     "collapsed": false
 73 |    },
 74 |    "outputs": [],
 75 |    "source": [
 76 |     "def information(word):\n",
 77 |     "    return \"Word: \" + str(word) + \", Length: \" + str(len(word))"
 78 |    ]
 79 |   },
 80 |   {
 81 |    "cell_type": "code",
 82 |    "execution_count": 5,
 83 |    "metadata": {
 84 |     "collapsed": false
 85 |    },
 86 |    "outputs": [
 87 |     {
 88 |      "data": {
 89 |       "text/plain": [
 90 |        "'Word: hello, Length: 5'"
 91 |       ]
 92 |      },
 93 |      "execution_count": 5,
 94 |      "metadata": {},
 95 |      "output_type": "execute_result"
 96 |     }
 97 |    ],
 98 |    "source": [
 99 |     "information(\"hello\")"
100 |    ]
101 |   },
102 |   {
103 |    "cell_type": "code",
104 |    "execution_count": 6,
105 |    "metadata": {
106 |     "collapsed": false
107 |    },
108 |    "outputs": [
109 |     {
110 |      "data": {
111 |       "text/plain": [
112 |        "'Word: language, Length: 8'"
113 |       ]
114 |      },
115 |      "execution_count": 6,
116 |      "metadata": {},
117 |      "output_type": "execute_result"
118 |     }
119 |    ],
120 |    "source": [
121 |     "information(\"language\")"
122 |    ]
123 |   }
124 |  ],
125 |  "metadata": {
126 |   "kernelspec": {
127 |    "display_name": "Python 3",
128 |    "language": "python",
129 |    "name": "python3"
130 |   },
131 |   "language_info": {
132 |    "codemirror_mode": {
133 |     "name": "ipython",
134 |     "version": 3
135 |    },
136 |    "file_extension": ".py",
137 |    "mimetype": "text/x-python",
138 |    "name": "python",
139 |    "nbconvert_exporter": "python",
140 |    "pygments_lexer": "ipython3",
141 |    "version": "3.6.0"
142 |   }
143 |  },
144 |  "nbformat": 4,
145 |  "nbformat_minor": 2
146 | }
147 | 


--------------------------------------------------------------------------------
/2. NLTK and the Basics/1. Counting Text.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# NLTK and the Basics - Counting Text"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 1,
 13 |    "metadata": {
 14 |     "collapsed": true
 15 |    },
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "import nltk"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "markdown",
 23 |    "metadata": {},
 24 |    "source": [
 25 |     "NLTK comes with pre-packed text data. \n",
 26 |     "\n",
 27 |     "Project Gutenberg is a group that digitizes books and literature that are mostly in the pubic domain.  These works make great examples for practicing NLP.  If you interested in Project Gutenberg, I recommend checking out their site. http://www.gutenberg.org/wiki/Main_Page"
 28 |    ]
 29 |   },
 30 |   {
 31 |    "cell_type": "code",
 32 |    "execution_count": 2,
 33 |    "metadata": {
 34 |     "collapsed": false
 35 |    },
 36 |    "outputs": [
 37 |     {
 38 |      "data": {
 39 |       "text/plain": [
 40 |        "['austen-emma.txt',\n",
 41 |        " 'austen-persuasion.txt',\n",
 42 |        " 'austen-sense.txt',\n",
 43 |        " 'bible-kjv.txt',\n",
 44 |        " 'blake-poems.txt',\n",
 45 |        " 'bryant-stories.txt',\n",
 46 |        " 'burgess-busterbrown.txt',\n",
 47 |        " 'carroll-alice.txt',\n",
 48 |        " 'chesterton-ball.txt',\n",
 49 |        " 'chesterton-brown.txt',\n",
 50 |        " 'chesterton-thursday.txt',\n",
 51 |        " 'edgeworth-parents.txt',\n",
 52 |        " 'melville-moby_dick.txt',\n",
 53 |        " 'milton-paradise.txt',\n",
 54 |        " 'shakespeare-caesar.txt',\n",
 55 |        " 'shakespeare-hamlet.txt',\n",
 56 |        " 'shakespeare-macbeth.txt',\n",
 57 |        " 'whitman-leaves.txt']"
 58 |       ]
 59 |      },
 60 |      "execution_count": 2,
 61 |      "metadata": {},
 62 |      "output_type": "execute_result"
 63 |     }
 64 |    ],
 65 |    "source": [
 66 |     "nltk.corpus.gutenberg.fileids()"
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "markdown",
 71 |    "metadata": {},
 72 |    "source": [
 73 |     "You will notice that every file name has the letter \"u\" before it. The 'u' is part of the external representation of the file name, meaning it's a Unicode string as opposed to a byte string. It is not part of the string."
 74 |    ]
 75 |   },
 76 |   {
 77 |    "cell_type": "code",
 78 |    "execution_count": 3,
 79 |    "metadata": {
 80 |     "collapsed": false
 81 |    },
 82 |    "outputs": [],
 83 |    "source": [
 84 |     "md = nltk.corpus.gutenberg.words(\"melville-moby_dick.txt\")"
 85 |    ]
 86 |   },
 87 |   {
 88 |    "cell_type": "code",
 89 |    "execution_count": 4,
 90 |    "metadata": {
 91 |     "collapsed": false
 92 |    },
 93 |    "outputs": [
 94 |     {
 95 |      "data": {
 96 |       "text/plain": [
 97 |        "['[', 'Moby', 'Dick', 'by', 'Herman', 'Melville', '1851', ']']"
 98 |       ]
 99 |      },
100 |      "execution_count": 4,
101 |      "metadata": {},
102 |      "output_type": "execute_result"
103 |     }
104 |    ],
105 |    "source": [
106 |     "md[:8]"
107 |    ]
108 |   },
109 |   {
110 |    "cell_type": "markdown",
111 |    "metadata": {},
112 |    "source": [
113 |     "We can count how many times a word appears in the book."
114 |    ]
115 |   },
116 |   {
117 |    "cell_type": "code",
118 |    "execution_count": 5,
119 |    "metadata": {
120 |     "collapsed": false
121 |    },
122 |    "outputs": [
123 |     {
124 |      "data": {
125 |       "text/plain": [
126 |        "906"
127 |       ]
128 |      },
129 |      "execution_count": 5,
130 |      "metadata": {},
131 |      "output_type": "execute_result"
132 |     }
133 |    ],
134 |    "source": [
135 |     "md.count(\"whale\")"
136 |    ]
137 |   },
138 |   {
139 |    "cell_type": "code",
140 |    "execution_count": 6,
141 |    "metadata": {
142 |     "collapsed": false
143 |    },
144 |    "outputs": [
145 |     {
146 |      "data": {
147 |       "text/plain": [
148 |        "330"
149 |       ]
150 |      },
151 |      "execution_count": 6,
152 |      "metadata": {},
153 |      "output_type": "execute_result"
154 |     }
155 |    ],
156 |    "source": [
157 |     "md.count(\"boat\")"
158 |    ]
159 |   },
160 |   {
161 |    "cell_type": "code",
162 |    "execution_count": 7,
163 |    "metadata": {
164 |     "collapsed": false
165 |    },
166 |    "outputs": [
167 |     {
168 |      "data": {
169 |       "text/plain": [
170 |        "501"
171 |       ]
172 |      },
173 |      "execution_count": 7,
174 |      "metadata": {},
175 |      "output_type": "execute_result"
176 |     }
177 |    ],
178 |    "source": [
179 |     "md.count(\"Ahab\")"
180 |    ]
181 |   },
182 |   {
183 |    "cell_type": "code",
184 |    "execution_count": 8,
185 |    "metadata": {
186 |     "collapsed": false
187 |    },
188 |    "outputs": [
189 |     {
190 |      "data": {
191 |       "text/plain": [
192 |        "0"
193 |       ]
194 |      },
195 |      "execution_count": 8,
196 |      "metadata": {},
197 |      "output_type": "execute_result"
198 |     }
199 |    ],
200 |    "source": [
201 |     "md.count(\"laptop\")"
202 |    ]
203 |   },
204 |   {
205 |    "cell_type": "markdown",
206 |    "metadata": {},
207 |    "source": [
208 |     "We can get an idea of how long the book is by seeing how many items are in our list."
209 |    ]
210 |   },
211 |   {
212 |    "cell_type": "code",
213 |    "execution_count": 9,
214 |    "metadata": {
215 |     "collapsed": false
216 |    },
217 |    "outputs": [
218 |     {
219 |      "data": {
220 |       "text/plain": [
221 |        "260819"
222 |       ]
223 |      },
224 |      "execution_count": 9,
225 |      "metadata": {},
226 |      "output_type": "execute_result"
227 |     }
228 |    ],
229 |    "source": [
230 |     "len(md)"
231 |    ]
232 |   },
233 |   {
234 |    "cell_type": "markdown",
235 |    "metadata": {},
236 |    "source": [
237 |     "We can see how many unique words are used in the book."
238 |    ]
239 |   },
240 |   {
241 |    "cell_type": "code",
242 |    "execution_count": 10,
243 |    "metadata": {
244 |     "collapsed": true
245 |    },
246 |    "outputs": [],
247 |    "source": [
248 |     "md_set = set(md)"
249 |    ]
250 |   },
251 |   {
252 |    "cell_type": "code",
253 |    "execution_count": 11,
254 |    "metadata": {
255 |     "collapsed": false
256 |    },
257 |    "outputs": [
258 |     {
259 |      "data": {
260 |       "text/plain": [
261 |        "19317"
262 |       ]
263 |      },
264 |      "execution_count": 11,
265 |      "metadata": {},
266 |      "output_type": "execute_result"
267 |     }
268 |    ],
269 |    "source": [
270 |     "len(md_set)"
271 |    ]
272 |   },
273 |   {
274 |    "cell_type": "markdown",
275 |    "metadata": {},
276 |    "source": [
277 |     "We can calculate the average number of times any given word is used in the book."
278 |    ]
279 |   },
280 |   {
281 |    "cell_type": "code",
282 |    "execution_count": 12,
283 |    "metadata": {
284 |     "collapsed": true
285 |    },
286 |    "outputs": [],
287 |    "source": [
288 |     "from __future__ import division #we import this since we are using Python 2.7"
289 |    ]
290 |   },
291 |   {
292 |    "cell_type": "code",
293 |    "execution_count": 13,
294 |    "metadata": {
295 |     "collapsed": false
296 |    },
297 |    "outputs": [
298 |     {
299 |      "data": {
300 |       "text/plain": [
301 |        "13.502044830977896"
302 |       ]
303 |      },
304 |      "execution_count": 13,
305 |      "metadata": {},
306 |      "output_type": "execute_result"
307 |     }
308 |    ],
309 |    "source": [
310 |     "len(md)/len(md_set)"
311 |    ]
312 |   },
313 |   {
314 |    "cell_type": "markdown",
315 |    "metadata": {},
316 |    "source": [
317 |     "We can look at the book as a lists of sentences."
318 |    ]
319 |   },
320 |   {
321 |    "cell_type": "code",
322 |    "execution_count": 14,
323 |    "metadata": {
324 |     "collapsed": false
325 |    },
326 |    "outputs": [],
327 |    "source": [
328 |     "md_sents = nltk.corpus.gutenberg.sents(\"melville-moby_dick.txt\")"
329 |    ]
330 |   },
331 |   {
332 |    "cell_type": "markdown",
333 |    "metadata": {},
334 |    "source": [
335 |     "We can calculate the average number of words per sentence in the book."
336 |    ]
337 |   },
338 |   {
339 |    "cell_type": "code",
340 |    "execution_count": 15,
341 |    "metadata": {
342 |     "collapsed": false
343 |    },
344 |    "outputs": [
345 |     {
346 |      "data": {
347 |       "text/plain": [
348 |        "25.928919375683467"
349 |       ]
350 |      },
351 |      "execution_count": 15,
352 |      "metadata": {},
353 |      "output_type": "execute_result"
354 |     }
355 |    ],
356 |    "source": [
357 |     "len(md)/len(md_sents)"
358 |    ]
359 |   }
360 |  ],
361 |  "metadata": {
362 |   "kernelspec": {
363 |    "display_name": "Python 3",
364 |    "language": "python",
365 |    "name": "python3"
366 |   },
367 |   "language_info": {
368 |    "codemirror_mode": {
369 |     "name": "ipython",
370 |     "version": 3
371 |    },
372 |    "file_extension": ".py",
373 |    "mimetype": "text/x-python",
374 |    "name": "python",
375 |    "nbconvert_exporter": "python",
376 |    "pygments_lexer": "ipython3",
377 |    "version": "3.6.0"
378 |   }
379 |  },
380 |  "nbformat": 4,
381 |  "nbformat_minor": 2
382 | }
383 | 


--------------------------------------------------------------------------------
/2. NLTK and the Basics/4. Conditional Frequency Distribution.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# NLTK and the Basics - Conditional Frequency Distribution"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 1,
 13 |    "metadata": {
 14 |     "collapsed": true
 15 |    },
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "import nltk"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "markdown",
 23 |    "metadata": {},
 24 |    "source": [
 25 |     "A conditional frequency distribution counts multiple cases or conditions. "
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "code",
 30 |    "execution_count": 2,
 31 |    "metadata": {
 32 |     "collapsed": false
 33 |    },
 34 |    "outputs": [],
 35 |    "source": [
 36 |     "names = [(\"Group A\", \"Paul\"),(\"Group A\", \"Mike\"),(\"Group A\", \"Katy\"),(\"Group B\", \"Amy\"),(\"Group B\", \"Joe\"),(\"Group B\", \"Amy\")]"
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "code",
 41 |    "execution_count": 3,
 42 |    "metadata": {
 43 |     "collapsed": false
 44 |    },
 45 |    "outputs": [
 46 |     {
 47 |      "data": {
 48 |       "text/plain": [
 49 |        "[('Group A', 'Paul'),\n",
 50 |        " ('Group A', 'Mike'),\n",
 51 |        " ('Group A', 'Katy'),\n",
 52 |        " ('Group B', 'Amy'),\n",
 53 |        " ('Group B', 'Joe'),\n",
 54 |        " ('Group B', 'Amy')]"
 55 |       ]
 56 |      },
 57 |      "execution_count": 3,
 58 |      "metadata": {},
 59 |      "output_type": "execute_result"
 60 |     }
 61 |    ],
 62 |    "source": [
 63 |     "names"
 64 |    ]
 65 |   },
 66 |   {
 67 |    "cell_type": "markdown",
 68 |    "metadata": {},
 69 |    "source": [
 70 |     "Running a regular frequency distribution on the list..."
 71 |    ]
 72 |   },
 73 |   {
 74 |    "cell_type": "code",
 75 |    "execution_count": 4,
 76 |    "metadata": {},
 77 |    "outputs": [
 78 |     {
 79 |      "data": {
 80 |       "text/plain": [
 81 |        "FreqDist({('Group A', 'Katy'): 1,\n",
 82 |        "          ('Group A', 'Mike'): 1,\n",
 83 |        "          ('Group A', 'Paul'): 1,\n",
 84 |        "          ('Group B', 'Amy'): 2,\n",
 85 |        "          ('Group B', 'Joe'): 1})"
 86 |       ]
 87 |      },
 88 |      "execution_count": 4,
 89 |      "metadata": {},
 90 |      "output_type": "execute_result"
 91 |     }
 92 |    ],
 93 |    "source": [
 94 |     "nltk.FreqDist(names)"
 95 |    ]
 96 |   },
 97 |   {
 98 |    "cell_type": "markdown",
 99 |    "metadata": {},
100 |    "source": [
101 |     "Running a conditional frequency distribution on the list..."
102 |    ]
103 |   },
104 |   {
105 |    "cell_type": "code",
106 |    "execution_count": 5,
107 |    "metadata": {
108 |     "collapsed": false
109 |    },
110 |    "outputs": [
111 |     {
112 |      "data": {
113 |       "text/plain": [
114 |        "ConditionalFreqDist(nltk.probability.FreqDist,\n",
115 |        "                    {'Group A': FreqDist({'Katy': 1, 'Mike': 1, 'Paul': 1}),\n",
116 |        "                     'Group B': FreqDist({'Amy': 2, 'Joe': 1})})"
117 |       ]
118 |      },
119 |      "execution_count": 5,
120 |      "metadata": {},
121 |      "output_type": "execute_result"
122 |     }
123 |    ],
124 |    "source": [
125 |     "nltk.ConditionalFreqDist(names)"
126 |    ]
127 |   }
128 |  ],
129 |  "metadata": {
130 |   "kernelspec": {
131 |    "display_name": "Python 3",
132 |    "language": "python",
133 |    "name": "python3"
134 |   },
135 |   "language_info": {
136 |    "codemirror_mode": {
137 |     "name": "ipython",
138 |     "version": 3
139 |    },
140 |    "file_extension": ".py",
141 |    "mimetype": "text/x-python",
142 |    "name": "python",
143 |    "nbconvert_exporter": "python",
144 |    "pygments_lexer": "ipython3",
145 |    "version": "3.6.0"
146 |   }
147 |  },
148 |  "nbformat": 4,
149 |  "nbformat_minor": 2
150 | }
151 | 


--------------------------------------------------------------------------------
/2. NLTK and the Basics/5. Example - Informative Words.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# NLTK and the Basics - Example: Informative Words"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 1,
 13 |    "metadata": {
 14 |     "collapsed": true
 15 |    },
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "import nltk"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "code",
 23 |    "execution_count": 2,
 24 |    "metadata": {
 25 |     "collapsed": true
 26 |    },
 27 |    "outputs": [],
 28 |    "source": [
 29 |     "alice = nltk.corpus.gutenberg.words(\"carroll-alice.txt\")"
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "code",
 34 |    "execution_count": 3,
 35 |    "metadata": {
 36 |     "collapsed": true
 37 |    },
 38 |    "outputs": [],
 39 |    "source": [
 40 |     "alice_fd = nltk.FreqDist(alice)"
 41 |    ]
 42 |   },
 43 |   {
 44 |    "cell_type": "markdown",
 45 |    "metadata": {},
 46 |    "source": [
 47 |     "Find the 100 most common words."
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "code",
 52 |    "execution_count": 4,
 53 |    "metadata": {
 54 |     "collapsed": true
 55 |    },
 56 |    "outputs": [],
 57 |    "source": [
 58 |     "alice_fd_100 = alice_fd.most_common(100)"
 59 |    ]
 60 |   },
 61 |   {
 62 |    "cell_type": "code",
 63 |    "execution_count": 5,
 64 |    "metadata": {
 65 |     "collapsed": false
 66 |    },
 67 |    "outputs": [],
 68 |    "source": [
 69 |     "moby = nltk.corpus.gutenberg.words(\"melville-moby_dick.txt\")\n",
 70 |     "moby_fd = nltk.FreqDist(moby)\n",
 71 |     "moby_fd_100 = moby_fd.most_common(100)"
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "markdown",
 76 |    "metadata": {},
 77 |    "source": [
 78 |     "We no longer care exactly how many times each word was seen and can drop the count."
 79 |    ]
 80 |   },
 81 |   {
 82 |    "cell_type": "code",
 83 |    "execution_count": 6,
 84 |    "metadata": {
 85 |     "collapsed": true
 86 |    },
 87 |    "outputs": [],
 88 |    "source": [
 89 |     "alice_100  = [word[0] for word in alice_fd_100]\n",
 90 |     "moby_100 = [word[0] for word in moby_fd_100]"
 91 |    ]
 92 |   },
 93 |   {
 94 |    "cell_type": "markdown",
 95 |    "metadata": {},
 96 |    "source": [
 97 |     "Two sets can be subtracted from one another leaving us with the difference."
 98 |    ]
 99 |   },
100 |   {
101 |    "cell_type": "code",
102 |    "execution_count": 7,
103 |    "metadata": {
104 |     "collapsed": false
105 |    },
106 |    "outputs": [
107 |     {
108 |      "data": {
109 |       "text/plain": [
110 |        "['ll',\n",
111 |        " 'much',\n",
112 |        " 'can',\n",
113 |        " 'Alice',\n",
114 |        " 'herself',\n",
115 |        " 'thought',\n",
116 |        " ':',\n",
117 |        " 'Mock',\n",
118 |        " 'again',\n",
119 |        " 'she',\n",
120 |        " 'Queen',\n",
121 |        " 'King',\n",
122 |        " '*',\n",
123 |        " 'Turtle',\n",
124 |        " 'way',\n",
125 |        " 'could',\n",
126 |        " 'did',\n",
127 |        " 't',\n",
128 |        " 'm',\n",
129 |        " \",'\",\n",
130 |        " 'see',\n",
131 |        " \".'\",\n",
132 |        " 'know',\n",
133 |        " 'little',\n",
134 |        " \"!'\",\n",
135 |        " 'off',\n",
136 |        " 'began',\n",
137 |        " 'went',\n",
138 |        " 'say',\n",
139 |        " 'Hatter',\n",
140 |        " \"?'\",\n",
141 |        " 'quite',\n",
142 |        " 'your',\n",
143 |        " 'said',\n",
144 |        " 'Gryphon',\n",
145 |        " 'do']"
146 |       ]
147 |      },
148 |      "execution_count": 7,
149 |      "metadata": {},
150 |      "output_type": "execute_result"
151 |     }
152 |    ],
153 |    "source": [
154 |     "list(set(alice_100) - set(moby_100))"
155 |    ]
156 |   },
157 |   {
158 |    "cell_type": "code",
159 |    "execution_count": 8,
160 |    "metadata": {
161 |     "collapsed": false
162 |    },
163 |    "outputs": [
164 |     {
165 |      "data": {
166 |       "text/plain": [
167 |        "['will',\n",
168 |        " 'from',\n",
169 |        " 'man',\n",
170 |        " 'over',\n",
171 |        " 'some',\n",
172 |        " 'been',\n",
173 |        " 'head',\n",
174 |        " 'other',\n",
175 |        " 'only',\n",
176 |        " 'more',\n",
177 |        " '!\"',\n",
178 |        " 'who',\n",
179 |        " 'are',\n",
180 |        " 'him',\n",
181 |        " '\"',\n",
182 |        " 'we',\n",
183 |        " 'such',\n",
184 |        " '?',\n",
185 |        " 'these',\n",
186 |        " 'long',\n",
187 |        " 'ye',\n",
188 |        " 'ship',\n",
189 |        " 'boat',\n",
190 |        " 'sea',\n",
191 |        " 'though',\n",
192 |        " 'Ahab',\n",
193 |        " 'which',\n",
194 |        " 'their',\n",
195 |        " 'But',\n",
196 |        " '.\"',\n",
197 |        " 'now',\n",
198 |        " 'any',\n",
199 |        " 'old',\n",
200 |        " 'than',\n",
201 |        " 'whale',\n",
202 |        " 'upon']"
203 |       ]
204 |      },
205 |      "execution_count": 8,
206 |      "metadata": {},
207 |      "output_type": "execute_result"
208 |     }
209 |    ],
210 |    "source": [
211 |     "list(set(moby_100) - set(alice_100))"
212 |    ]
213 |   }
214 |  ],
215 |  "metadata": {
216 |   "kernelspec": {
217 |    "display_name": "Python 3",
218 |    "language": "python",
219 |    "name": "python3"
220 |   },
221 |   "language_info": {
222 |    "codemirror_mode": {
223 |     "name": "ipython",
224 |     "version": 3
225 |    },
226 |    "file_extension": ".py",
227 |    "mimetype": "text/x-python",
228 |    "name": "python",
229 |    "nbconvert_exporter": "python",
230 |    "pygments_lexer": "ipython3",
231 |    "version": "3.6.0"
232 |   }
233 |  },
234 |  "nbformat": 4,
235 |  "nbformat_minor": 2
236 | }
237 | 


--------------------------------------------------------------------------------
/2. NLTK and the Basics/6. Bigrams.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# NLTK and the Basics - Bigrams"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 1,
 13 |    "metadata": {
 14 |     "collapsed": true
 15 |    },
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "import nltk"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "markdown",
 23 |    "metadata": {},
 24 |    "source": [
 25 |     "Bigrams, sometimes called 2grams, or ngrams (when dealing with a different number), is a way of looking at word sequences."
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "code",
 30 |    "execution_count": 2,
 31 |    "metadata": {
 32 |     "collapsed": true
 33 |    },
 34 |    "outputs": [],
 35 |    "source": [
 36 |     "text = \"I think it might rain today.\""
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "code",
 41 |    "execution_count": 3,
 42 |    "metadata": {
 43 |     "collapsed": false
 44 |    },
 45 |    "outputs": [],
 46 |    "source": [
 47 |     "tokens = nltk.word_tokenize(text)"
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "code",
 52 |    "execution_count": 4,
 53 |    "metadata": {
 54 |     "collapsed": false
 55 |    },
 56 |    "outputs": [
 57 |     {
 58 |      "data": {
 59 |       "text/plain": [
 60 |        "['I', 'think', 'it', 'might', 'rain', 'today', '.']"
 61 |       ]
 62 |      },
 63 |      "execution_count": 4,
 64 |      "metadata": {},
 65 |      "output_type": "execute_result"
 66 |     }
 67 |    ],
 68 |    "source": [
 69 |     "tokens"
 70 |    ]
 71 |   },
 72 |   {
 73 |    "cell_type": "code",
 74 |    "execution_count": 5,
 75 |    "metadata": {
 76 |     "collapsed": false
 77 |    },
 78 |    "outputs": [],
 79 |    "source": [
 80 |     "bigrams =  nltk.bigrams(tokens)"
 81 |    ]
 82 |   },
 83 |   {
 84 |    "cell_type": "code",
 85 |    "execution_count": 7,
 86 |    "metadata": {
 87 |     "collapsed": false
 88 |    },
 89 |    "outputs": [
 90 |     {
 91 |      "name": "stdout",
 92 |      "output_type": "stream",
 93 |      "text": [
 94 |       "('I', 'think')\n",
 95 |       "('think', 'it')\n",
 96 |       "('it', 'might')\n",
 97 |       "('might', 'rain')\n",
 98 |       "('rain', 'today')\n",
 99 |       "('today', '.')\n"
100 |      ]
101 |     }
102 |    ],
103 |    "source": [
104 |     "for item in bigrams:\n",
105 |     "    print(item)"
106 |    ]
107 |   },
108 |   {
109 |    "cell_type": "code",
110 |    "execution_count": 8,
111 |    "metadata": {
112 |     "collapsed": true
113 |    },
114 |    "outputs": [],
115 |    "source": [
116 |     "trigrams = nltk.trigrams(tokens)"
117 |    ]
118 |   },
119 |   {
120 |    "cell_type": "code",
121 |    "execution_count": 9,
122 |    "metadata": {
123 |     "collapsed": false
124 |    },
125 |    "outputs": [
126 |     {
127 |      "name": "stdout",
128 |      "output_type": "stream",
129 |      "text": [
130 |       "('I', 'think', 'it')\n",
131 |       "('think', 'it', 'might')\n",
132 |       "('it', 'might', 'rain')\n",
133 |       "('might', 'rain', 'today')\n",
134 |       "('rain', 'today', '.')\n"
135 |      ]
136 |     }
137 |    ],
138 |    "source": [
139 |     "for item in trigrams:\n",
140 |     "    print(item)"
141 |    ]
142 |   },
143 |   {
144 |    "cell_type": "code",
145 |    "execution_count": 10,
146 |    "metadata": {
147 |     "collapsed": true
148 |    },
149 |    "outputs": [],
150 |    "source": [
151 |     "from nltk.util import ngrams"
152 |    ]
153 |   },
154 |   {
155 |    "cell_type": "code",
156 |    "execution_count": 11,
157 |    "metadata": {
158 |     "collapsed": false
159 |    },
160 |    "outputs": [],
161 |    "source": [
162 |     "text = \"If it is nice out, I will go to the beach.\""
163 |    ]
164 |   },
165 |   {
166 |    "cell_type": "code",
167 |    "execution_count": 12,
168 |    "metadata": {
169 |     "collapsed": true
170 |    },
171 |    "outputs": [],
172 |    "source": [
173 |     "tokens = nltk.word_tokenize(text)"
174 |    ]
175 |   },
176 |   {
177 |    "cell_type": "code",
178 |    "execution_count": 13,
179 |    "metadata": {
180 |     "collapsed": false
181 |    },
182 |    "outputs": [],
183 |    "source": [
184 |     "bigrams = ngrams(tokens,2)"
185 |    ]
186 |   },
187 |   {
188 |    "cell_type": "code",
189 |    "execution_count": 14,
190 |    "metadata": {
191 |     "collapsed": false
192 |    },
193 |    "outputs": [
194 |     {
195 |      "name": "stdout",
196 |      "output_type": "stream",
197 |      "text": [
198 |       "('If', 'it')\n",
199 |       "('it', 'is')\n",
200 |       "('is', 'nice')\n",
201 |       "('nice', 'out')\n",
202 |       "('out', ',')\n",
203 |       "(',', 'I')\n",
204 |       "('I', 'will')\n",
205 |       "('will', 'go')\n",
206 |       "('go', 'to')\n",
207 |       "('to', 'the')\n",
208 |       "('the', 'beach')\n",
209 |       "('beach', '.')\n"
210 |      ]
211 |     }
212 |    ],
213 |    "source": [
214 |     "for item in bigrams:\n",
215 |     "    print(item)"
216 |    ]
217 |   },
218 |   {
219 |    "cell_type": "code",
220 |    "execution_count": 15,
221 |    "metadata": {
222 |     "collapsed": false
223 |    },
224 |    "outputs": [],
225 |    "source": [
226 |     "fourgrams = ngrams(tokens,4)"
227 |    ]
228 |   },
229 |   {
230 |    "cell_type": "code",
231 |    "execution_count": 17,
232 |    "metadata": {
233 |     "collapsed": false
234 |    },
235 |    "outputs": [
236 |     {
237 |      "name": "stdout",
238 |      "output_type": "stream",
239 |      "text": [
240 |       "('If', 'it', 'is', 'nice')\n",
241 |       "('it', 'is', 'nice', 'out')\n",
242 |       "('is', 'nice', 'out', ',')\n",
243 |       "('nice', 'out', ',', 'I')\n",
244 |       "('out', ',', 'I', 'will')\n",
245 |       "(',', 'I', 'will', 'go')\n",
246 |       "('I', 'will', 'go', 'to')\n",
247 |       "('will', 'go', 'to', 'the')\n",
248 |       "('go', 'to', 'the', 'beach')\n",
249 |       "('to', 'the', 'beach', '.')\n"
250 |      ]
251 |     }
252 |    ],
253 |    "source": [
254 |     "for item in fourgrams:\n",
255 |     "    print(item)"
256 |    ]
257 |   },
258 |   {
259 |    "cell_type": "markdown",
260 |    "metadata": {},
261 |    "source": [
262 |     "We can build a function to find any ngram."
263 |    ]
264 |   },
265 |   {
266 |    "cell_type": "code",
267 |    "execution_count": 18,
268 |    "metadata": {
269 |     "collapsed": false
270 |    },
271 |    "outputs": [],
272 |    "source": [
273 |     "def n_grams(text,n):\n",
274 |     "    tokens = nltk.word_tokenize(text)\n",
275 |     "    grams = ngrams(tokens,n)\n",
276 |     "    return grams"
277 |    ]
278 |   },
279 |   {
280 |    "cell_type": "code",
281 |    "execution_count": 19,
282 |    "metadata": {
283 |     "collapsed": true
284 |    },
285 |    "outputs": [],
286 |    "source": [
287 |     "text = \"I think it might rain today, but if it is nice out, I will go to the beach.\""
288 |    ]
289 |   },
290 |   {
291 |    "cell_type": "code",
292 |    "execution_count": 20,
293 |    "metadata": {
294 |     "collapsed": false
295 |    },
296 |    "outputs": [],
297 |    "source": [
298 |     "grams = n_grams(text, 5)"
299 |    ]
300 |   },
301 |   {
302 |    "cell_type": "code",
303 |    "execution_count": 21,
304 |    "metadata": {
305 |     "collapsed": false
306 |    },
307 |    "outputs": [
308 |     {
309 |      "name": "stdout",
310 |      "output_type": "stream",
311 |      "text": [
312 |       "('I', 'think', 'it', 'might', 'rain')\n",
313 |       "('think', 'it', 'might', 'rain', 'today')\n",
314 |       "('it', 'might', 'rain', 'today', ',')\n",
315 |       "('might', 'rain', 'today', ',', 'but')\n",
316 |       "('rain', 'today', ',', 'but', 'if')\n",
317 |       "('today', ',', 'but', 'if', 'it')\n",
318 |       "(',', 'but', 'if', 'it', 'is')\n",
319 |       "('but', 'if', 'it', 'is', 'nice')\n",
320 |       "('if', 'it', 'is', 'nice', 'out')\n",
321 |       "('it', 'is', 'nice', 'out', ',')\n",
322 |       "('is', 'nice', 'out', ',', 'I')\n",
323 |       "('nice', 'out', ',', 'I', 'will')\n",
324 |       "('out', ',', 'I', 'will', 'go')\n",
325 |       "(',', 'I', 'will', 'go', 'to')\n",
326 |       "('I', 'will', 'go', 'to', 'the')\n",
327 |       "('will', 'go', 'to', 'the', 'beach')\n",
328 |       "('go', 'to', 'the', 'beach', '.')\n"
329 |      ]
330 |     }
331 |    ],
332 |    "source": [
333 |     "for item in grams:\n",
334 |     "    print(item)"
335 |    ]
336 |   }
337 |  ],
338 |  "metadata": {
339 |   "kernelspec": {
340 |    "display_name": "Python 3",
341 |    "language": "python",
342 |    "name": "python3"
343 |   },
344 |   "language_info": {
345 |    "codemirror_mode": {
346 |     "name": "ipython",
347 |     "version": 3
348 |    },
349 |    "file_extension": ".py",
350 |    "mimetype": "text/x-python",
351 |    "name": "python",
352 |    "nbconvert_exporter": "python",
353 |    "pygments_lexer": "ipython3",
354 |    "version": "3.6.0"
355 |   }
356 |  },
357 |  "nbformat": 4,
358 |  "nbformat_minor": 2
359 | }
360 | 


--------------------------------------------------------------------------------
/2. NLTK and the Basics/7. Regular Expressions.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# NLTK and the Basics - Regular Expressions"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 8,
 13 |    "metadata": {
 14 |     "collapsed": true
 15 |    },
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "import nltk\n",
 19 |     "import re"
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "code",
 24 |    "execution_count": 9,
 25 |    "metadata": {
 26 |     "collapsed": true
 27 |    },
 28 |    "outputs": [],
 29 |    "source": [
 30 |     "alice = nltk.corpus.gutenberg.words(\"carroll-alice.txt\")"
 31 |    ]
 32 |   },
 33 |   {
 34 |    "cell_type": "markdown",
 35 |    "metadata": {},
 36 |    "source": [
 37 |     "Finding every word that start with \"new\"."
 38 |    ]
 39 |   },
 40 |   {
 41 |    "cell_type": "code",
 42 |    "execution_count": 10,
 43 |    "metadata": {
 44 |     "collapsed": false
 45 |    },
 46 |    "outputs": [
 47 |     {
 48 |      "data": {
 49 |       "text/plain": [
 50 |        "{'new', 'newspapers'}"
 51 |       ]
 52 |      },
 53 |      "execution_count": 10,
 54 |      "metadata": {},
 55 |      "output_type": "execute_result"
 56 |     }
 57 |    ],
 58 |    "source": [
 59 |     "set([word for word in alice if re.search(\"^new\",word)])"
 60 |    ]
 61 |   },
 62 |   {
 63 |    "cell_type": "markdown",
 64 |    "metadata": {},
 65 |    "source": [
 66 |     "Finding every word that ends with \"ful\"."
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "code",
 71 |    "execution_count": 11,
 72 |    "metadata": {
 73 |     "collapsed": false
 74 |    },
 75 |    "outputs": [
 76 |     {
 77 |      "data": {
 78 |       "text/plain": [
 79 |        "{'Beautiful',\n",
 80 |        " 'barrowful',\n",
 81 |        " 'beautiful',\n",
 82 |        " 'delightful',\n",
 83 |        " 'doubtful',\n",
 84 |        " 'dreadful',\n",
 85 |        " 'graceful',\n",
 86 |        " 'hopeful',\n",
 87 |        " 'mournful',\n",
 88 |        " 'ootiful',\n",
 89 |        " 'respectful',\n",
 90 |        " 'sorrowful',\n",
 91 |        " 'truthful',\n",
 92 |        " 'useful',\n",
 93 |        " 'wonderful'}"
 94 |       ]
 95 |      },
 96 |      "execution_count": 11,
 97 |      "metadata": {},
 98 |      "output_type": "execute_result"
 99 |     }
100 |    ],
101 |    "source": [
102 |     "set([word for word in alice if re.search(\"ful$\",word)])"
103 |    ]
104 |   },
105 |   {
106 |    "cell_type": "markdown",
107 |    "metadata": {},
108 |    "source": [
109 |     "Finding words that are six characters long and have two n's in the middle."
110 |    ]
111 |   },
112 |   {
113 |    "cell_type": "code",
114 |    "execution_count": 12,
115 |    "metadata": {
116 |     "collapsed": false
117 |    },
118 |    "outputs": [
119 |     {
120 |      "data": {
121 |       "text/plain": [
122 |        "{'cannot', 'dinner', 'fanned', 'manner', 'tunnel'}"
123 |       ]
124 |      },
125 |      "execution_count": 12,
126 |      "metadata": {},
127 |      "output_type": "execute_result"
128 |     }
129 |    ],
130 |    "source": [
131 |     "set([word for word in alice if re.search(\"^..nn..$\",word)])"
132 |    ]
133 |   },
134 |   {
135 |    "cell_type": "markdown",
136 |    "metadata": {},
137 |    "source": [
138 |     "Finding words that start with \"c\", \"h\", and \"t\", and end in \"at\"."
139 |    ]
140 |   },
141 |   {
142 |    "cell_type": "code",
143 |    "execution_count": 13,
144 |    "metadata": {
145 |     "collapsed": false
146 |    },
147 |    "outputs": [
148 |     {
149 |      "data": {
150 |       "text/plain": [
151 |        "{'cat', 'hat', 'rat'}"
152 |       ]
153 |      },
154 |      "execution_count": 13,
155 |      "metadata": {},
156 |      "output_type": "execute_result"
157 |     }
158 |    ],
159 |    "source": [
160 |     "set([word for word in alice if re.search(\"^[chr]at$\",word)])"
161 |    ]
162 |   },
163 |   {
164 |    "cell_type": "markdown",
165 |    "metadata": {},
166 |    "source": [
167 |     "Finding words of any length that have two n's."
168 |    ]
169 |   },
170 |   {
171 |    "cell_type": "code",
172 |    "execution_count": 14,
173 |    "metadata": {
174 |     "collapsed": false
175 |    },
176 |    "outputs": [
177 |     {
178 |      "data": {
179 |       "text/plain": [
180 |        "{'Ann',\n",
181 |        " 'Dinn',\n",
182 |        " 'Pennyworth',\n",
183 |        " 'annoy',\n",
184 |        " 'annoyed',\n",
185 |        " 'beginning',\n",
186 |        " 'cannot',\n",
187 |        " 'cunning',\n",
188 |        " 'dinn',\n",
189 |        " 'dinner',\n",
190 |        " 'fanned',\n",
191 |        " 'fanning',\n",
192 |        " 'funny',\n",
193 |        " 'grinned',\n",
194 |        " 'grinning',\n",
195 |        " 'manner',\n",
196 |        " 'manners',\n",
197 |        " 'planning',\n",
198 |        " 'running',\n",
199 |        " 'tunnel'}"
200 |       ]
201 |      },
202 |      "execution_count": 14,
203 |      "metadata": {},
204 |      "output_type": "execute_result"
205 |     }
206 |    ],
207 |    "source": [
208 |     "set([word for word in alice if re.search(\"^.*nn.*$\",word)])"
209 |    ]
210 |   }
211 |  ],
212 |  "metadata": {
213 |   "kernelspec": {
214 |    "display_name": "Python 3",
215 |    "language": "python",
216 |    "name": "python3"
217 |   },
218 |   "language_info": {
219 |    "codemirror_mode": {
220 |     "name": "ipython",
221 |     "version": 3
222 |    },
223 |    "file_extension": ".py",
224 |    "mimetype": "text/x-python",
225 |    "name": "python",
226 |    "nbconvert_exporter": "python",
227 |    "pygments_lexer": "ipython3",
228 |    "version": "3.6.0"
229 |   }
230 |  },
231 |  "nbformat": 4,
232 |  "nbformat_minor": 2
233 | }
234 | 


--------------------------------------------------------------------------------
/3. Tokenization, Tagging, Chunking/1. Tokenization.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Tokenization, Tagging, Chunking - Tokenization"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 1,
 13 |    "metadata": {
 14 |     "collapsed": true
 15 |    },
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "import nltk"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "markdown",
 23 |    "metadata": {},
 24 |    "source": [
 25 |     "Tokenization is the process of breaking up a string into a list of words and punctuation."
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "code",
 30 |    "execution_count": 2,
 31 |    "metadata": {
 32 |     "collapsed": true
 33 |    },
 34 |    "outputs": [],
 35 |    "source": [
 36 |     "my_string = \"I am learning Natural Language Processing.\""
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "code",
 41 |    "execution_count": 3,
 42 |    "metadata": {
 43 |     "collapsed": false
 44 |    },
 45 |    "outputs": [],
 46 |    "source": [
 47 |     "tokens = nltk.word_tokenize(my_string)"
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "code",
 52 |    "execution_count": 4,
 53 |    "metadata": {
 54 |     "collapsed": false
 55 |    },
 56 |    "outputs": [
 57 |     {
 58 |      "data": {
 59 |       "text/plain": [
 60 |        "['I', 'am', 'learning', 'Natural', 'Language', 'Processing', '.']"
 61 |       ]
 62 |      },
 63 |      "execution_count": 4,
 64 |      "metadata": {},
 65 |      "output_type": "execute_result"
 66 |     }
 67 |    ],
 68 |    "source": [
 69 |     "tokens"
 70 |    ]
 71 |   },
 72 |   {
 73 |    "cell_type": "code",
 74 |    "execution_count": 5,
 75 |    "metadata": {
 76 |     "collapsed": false
 77 |    },
 78 |    "outputs": [
 79 |     {
 80 |      "data": {
 81 |       "text/plain": [
 82 |        "7"
 83 |       ]
 84 |      },
 85 |      "execution_count": 5,
 86 |      "metadata": {},
 87 |      "output_type": "execute_result"
 88 |     }
 89 |    ],
 90 |    "source": [
 91 |     "len(tokens)"
 92 |    ]
 93 |   },
 94 |   {
 95 |    "cell_type": "code",
 96 |    "execution_count": 6,
 97 |    "metadata": {
 98 |     "collapsed": true
 99 |    },
100 |    "outputs": [],
101 |    "source": [
102 |     "phrase = \"I am learning Natural Language Processing. I am learning how to tokenize!\""
103 |    ]
104 |   },
105 |   {
106 |    "cell_type": "code",
107 |    "execution_count": 7,
108 |    "metadata": {
109 |     "collapsed": false
110 |    },
111 |    "outputs": [],
112 |    "source": [
113 |     "tokens_sent = nltk.sent_tokenize(phrase)"
114 |    ]
115 |   },
116 |   {
117 |    "cell_type": "code",
118 |    "execution_count": 8,
119 |    "metadata": {
120 |     "collapsed": false
121 |    },
122 |    "outputs": [
123 |     {
124 |      "data": {
125 |       "text/plain": [
126 |        "['I am learning Natural Language Processing.',\n",
127 |        " 'I am learning how to tokenize!']"
128 |       ]
129 |      },
130 |      "execution_count": 8,
131 |      "metadata": {},
132 |      "output_type": "execute_result"
133 |     }
134 |    ],
135 |    "source": [
136 |     "tokens_sent"
137 |    ]
138 |   },
139 |   {
140 |    "cell_type": "code",
141 |    "execution_count": 9,
142 |    "metadata": {
143 |     "collapsed": false
144 |    },
145 |    "outputs": [
146 |     {
147 |      "data": {
148 |       "text/plain": [
149 |        "2"
150 |       ]
151 |      },
152 |      "execution_count": 9,
153 |      "metadata": {},
154 |      "output_type": "execute_result"
155 |     }
156 |    ],
157 |    "source": [
158 |     "len(tokens_sent)"
159 |    ]
160 |   },
161 |   {
162 |    "cell_type": "markdown",
163 |    "metadata": {},
164 |    "source": [
165 |     "We can tokenize our sentence tokens."
166 |    ]
167 |   },
168 |   {
169 |    "cell_type": "code",
170 |    "execution_count": 10,
171 |    "metadata": {
172 |     "collapsed": false
173 |    },
174 |    "outputs": [
175 |     {
176 |      "name": "stdout",
177 |      "output_type": "stream",
178 |      "text": [
179 |       "['I', 'am', 'learning', 'Natural', 'Language', 'Processing', '.']\n",
180 |       "['I', 'am', 'learning', 'how', 'to', 'tokenize', '!']\n"
181 |      ]
182 |     }
183 |    ],
184 |    "source": [
185 |     "for item in tokens_sent:\n",
186 |     "    print(nltk.word_tokenize(item))"
187 |    ]
188 |   }
189 |  ],
190 |  "metadata": {
191 |   "kernelspec": {
192 |    "display_name": "Python 3",
193 |    "language": "python",
194 |    "name": "python3"
195 |   },
196 |   "language_info": {
197 |    "codemirror_mode": {
198 |     "name": "ipython",
199 |     "version": 3
200 |    },
201 |    "file_extension": ".py",
202 |    "mimetype": "text/x-python",
203 |    "name": "python",
204 |    "nbconvert_exporter": "python",
205 |    "pygments_lexer": "ipython3",
206 |    "version": "3.6.0"
207 |   }
208 |  },
209 |  "nbformat": 4,
210 |  "nbformat_minor": 2
211 | }
212 | 


--------------------------------------------------------------------------------
/3. Tokenization, Tagging, Chunking/2. Normalization.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Tokenization, Tagging, Chunking - Normalization"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 1,
 13 |    "metadata": {
 14 |     "collapsed": true
 15 |    },
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "import nltk"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "markdown",
 23 |    "metadata": {},
 24 |    "source": [
 25 |     "Removing punctuation."
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "code",
 30 |    "execution_count": 2,
 31 |    "metadata": {
 32 |     "collapsed": false
 33 |    },
 34 |    "outputs": [],
 35 |    "source": [
 36 |     "md = nltk.corpus.gutenberg.words(\"melville-moby_dick.txt\")"
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "code",
 41 |    "execution_count": 3,
 42 |    "metadata": {
 43 |     "collapsed": false
 44 |    },
 45 |    "outputs": [
 46 |     {
 47 |      "data": {
 48 |       "text/plain": [
 49 |        "['[', 'Moby', 'Dick', 'by', 'Herman', 'Melville', '1851', ']']"
 50 |       ]
 51 |      },
 52 |      "execution_count": 3,
 53 |      "metadata": {},
 54 |      "output_type": "execute_result"
 55 |     }
 56 |    ],
 57 |    "source": [
 58 |     "md[:8]"
 59 |    ]
 60 |   },
 61 |   {
 62 |    "cell_type": "code",
 63 |    "execution_count": 4,
 64 |    "metadata": {
 65 |     "collapsed": true
 66 |    },
 67 |    "outputs": [],
 68 |    "source": [
 69 |     "md_8 = md[:8]"
 70 |    ]
 71 |   },
 72 |   {
 73 |    "cell_type": "code",
 74 |    "execution_count": 5,
 75 |    "metadata": {
 76 |     "collapsed": false
 77 |    },
 78 |    "outputs": [
 79 |     {
 80 |      "data": {
 81 |       "text/plain": [
 82 |        "['[', 'Moby', 'Dick', 'by', 'Herman', 'Melville', '1851', ']']"
 83 |       ]
 84 |      },
 85 |      "execution_count": 5,
 86 |      "metadata": {},
 87 |      "output_type": "execute_result"
 88 |     }
 89 |    ],
 90 |    "source": [
 91 |     "md_8"
 92 |    ]
 93 |   },
 94 |   {
 95 |    "cell_type": "code",
 96 |    "execution_count": 7,
 97 |    "metadata": {
 98 |     "collapsed": false
 99 |    },
100 |    "outputs": [
101 |     {
102 |      "name": "stdout",
103 |      "output_type": "stream",
104 |      "text": [
105 |       "Moby\n",
106 |       "Dick\n",
107 |       "by\n",
108 |       "Herman\n",
109 |       "Melville\n"
110 |      ]
111 |     }
112 |    ],
113 |    "source": [
114 |     "for word in md_8:\n",
115 |     "    if word.isalpha():\n",
116 |     "        print(word)"
117 |    ]
118 |   },
119 |   {
120 |    "cell_type": "markdown",
121 |    "metadata": {},
122 |    "source": [
123 |     "Making everything lower case."
124 |    ]
125 |   },
126 |   {
127 |    "cell_type": "code",
128 |    "execution_count": 8,
129 |    "metadata": {
130 |     "collapsed": false
131 |    },
132 |    "outputs": [
133 |     {
134 |      "name": "stdout",
135 |      "output_type": "stream",
136 |      "text": [
137 |       "[\n",
138 |       "moby\n",
139 |       "dick\n",
140 |       "by\n",
141 |       "herman\n",
142 |       "melville\n",
143 |       "1851\n",
144 |       "]\n"
145 |      ]
146 |     }
147 |    ],
148 |    "source": [
149 |     "for word in md_8:\n",
150 |     "    print(word.lower())"
151 |    ]
152 |   },
153 |   {
154 |    "cell_type": "code",
155 |    "execution_count": 9,
156 |    "metadata": {
157 |     "collapsed": false
158 |    },
159 |    "outputs": [],
160 |    "source": [
161 |     "norm = [word.lower() for word in md_8 if word.isalpha()]"
162 |    ]
163 |   },
164 |   {
165 |    "cell_type": "code",
166 |    "execution_count": 10,
167 |    "metadata": {
168 |     "collapsed": false
169 |    },
170 |    "outputs": [
171 |     {
172 |      "data": {
173 |       "text/plain": [
174 |        "['moby', 'dick', 'by', 'herman', 'melville']"
175 |       ]
176 |      },
177 |      "execution_count": 10,
178 |      "metadata": {},
179 |      "output_type": "execute_result"
180 |     }
181 |    ],
182 |    "source": [
183 |     "norm"
184 |    ]
185 |   },
186 |   {
187 |    "cell_type": "markdown",
188 |    "metadata": {},
189 |    "source": [
190 |     "## Stemmers"
191 |    ]
192 |   },
193 |   {
194 |    "cell_type": "markdown",
195 |    "metadata": {},
196 |    "source": [
197 |     "Stemmers help further normalize text when we run into words that might be plural,\n",
198 |     "for example. \n",
199 |     "\n",
200 |     "There are many different kinds of stemmers so you have to pick the one that works best for your use case.\n"
201 |    ]
202 |   },
203 |   {
204 |    "cell_type": "code",
205 |    "execution_count": 11,
206 |    "metadata": {
207 |     "collapsed": true
208 |    },
209 |    "outputs": [],
210 |    "source": [
211 |     "porter = nltk.PorterStemmer()"
212 |    ]
213 |   },
214 |   {
215 |    "cell_type": "code",
216 |    "execution_count": 12,
217 |    "metadata": {
218 |     "collapsed": true
219 |    },
220 |    "outputs": [],
221 |    "source": [
222 |     "my_list = [\"cat\",\"cats\",\"lie\",\"lying\",\"run\",\"running\",\"city\",\"cities\",\"month\",\"monthly\",\"woman\",\"women\"]"
223 |    ]
224 |   },
225 |   {
226 |    "cell_type": "code",
227 |    "execution_count": 14,
228 |    "metadata": {
229 |     "collapsed": false
230 |    },
231 |    "outputs": [
232 |     {
233 |      "name": "stdout",
234 |      "output_type": "stream",
235 |      "text": [
236 |       "cat\n",
237 |       "cat\n",
238 |       "lie\n",
239 |       "lie\n",
240 |       "run\n",
241 |       "run\n",
242 |       "citi\n",
243 |       "citi\n",
244 |       "month\n",
245 |       "monthli\n",
246 |       "woman\n",
247 |       "women\n"
248 |      ]
249 |     }
250 |    ],
251 |    "source": [
252 |     "for word in my_list:\n",
253 |     "    print(porter.stem(word))"
254 |    ]
255 |   },
256 |   {
257 |    "cell_type": "code",
258 |    "execution_count": 15,
259 |    "metadata": {
260 |     "collapsed": true
261 |    },
262 |    "outputs": [],
263 |    "source": [
264 |     "lancaster = nltk.LancasterStemmer()"
265 |    ]
266 |   },
267 |   {
268 |    "cell_type": "code",
269 |    "execution_count": 17,
270 |    "metadata": {
271 |     "collapsed": false
272 |    },
273 |    "outputs": [
274 |     {
275 |      "name": "stdout",
276 |      "output_type": "stream",
277 |      "text": [
278 |       "cat\n",
279 |       "cat\n",
280 |       "lie\n",
281 |       "lying\n",
282 |       "run\n",
283 |       "run\n",
284 |       "city\n",
285 |       "city\n",
286 |       "mon\n",
287 |       "month\n",
288 |       "wom\n",
289 |       "wom\n"
290 |      ]
291 |     }
292 |    ],
293 |    "source": [
294 |     "for word in my_list:\n",
295 |     "    print(lancaster.stem(word))"
296 |    ]
297 |   },
298 |   {
299 |    "cell_type": "markdown",
300 |    "metadata": {},
301 |    "source": [
302 |     "We can try to solve the normalization problem with Lemmatization."
303 |    ]
304 |   },
305 |   {
306 |    "cell_type": "code",
307 |    "execution_count": 18,
308 |    "metadata": {
309 |     "collapsed": true
310 |    },
311 |    "outputs": [],
312 |    "source": [
313 |     "wnlem = nltk.WordNetLemmatizer()"
314 |    ]
315 |   },
316 |   {
317 |    "cell_type": "code",
318 |    "execution_count": 20,
319 |    "metadata": {
320 |     "collapsed": false
321 |    },
322 |    "outputs": [
323 |     {
324 |      "name": "stdout",
325 |      "output_type": "stream",
326 |      "text": [
327 |       "cat\n",
328 |       "cat\n",
329 |       "lie\n",
330 |       "lying\n",
331 |       "run\n",
332 |       "running\n",
333 |       "city\n",
334 |       "city\n",
335 |       "month\n",
336 |       "monthly\n",
337 |       "woman\n",
338 |       "woman\n"
339 |      ]
340 |     }
341 |    ],
342 |    "source": [
343 |     "for word in my_list:\n",
344 |     "    print(wnlem.lemmatize(word))"
345 |    ]
346 |   }
347 |  ],
348 |  "metadata": {
349 |   "kernelspec": {
350 |    "display_name": "Python 3",
351 |    "language": "python",
352 |    "name": "python3"
353 |   },
354 |   "language_info": {
355 |    "codemirror_mode": {
356 |     "name": "ipython",
357 |     "version": 3
358 |    },
359 |    "file_extension": ".py",
360 |    "mimetype": "text/x-python",
361 |    "name": "python",
362 |    "nbconvert_exporter": "python",
363 |    "pygments_lexer": "ipython3",
364 |    "version": "3.6.0"
365 |   }
366 |  },
367 |  "nbformat": 4,
368 |  "nbformat_minor": 2
369 | }
370 | 


--------------------------------------------------------------------------------
/3. Tokenization, Tagging, Chunking/3. Part of Speech Tagging.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Tokenization, Tagging, Chunking - Part of Speech Tagging"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 17,
 13 |    "metadata": {
 14 |     "collapsed": true
 15 |    },
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "import nltk"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "markdown",
 23 |    "metadata": {},
 24 |    "source": [
 25 |     "A part of speech tagger will identify the part of speech for a sequence of words."
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "code",
 30 |    "execution_count": 18,
 31 |    "metadata": {
 32 |     "collapsed": true
 33 |    },
 34 |    "outputs": [],
 35 |    "source": [
 36 |     "text = \"I walked to the cafe to buy coffee before work.\""
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "code",
 41 |    "execution_count": 19,
 42 |    "metadata": {
 43 |     "collapsed": false
 44 |    },
 45 |    "outputs": [],
 46 |    "source": [
 47 |     "tokens = nltk.word_tokenize(text)"
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "code",
 52 |    "execution_count": 20,
 53 |    "metadata": {
 54 |     "collapsed": false
 55 |    },
 56 |    "outputs": [
 57 |     {
 58 |      "data": {
 59 |       "text/plain": [
 60 |        "[('I', 'PRP'),\n",
 61 |        " ('walked', 'VBD'),\n",
 62 |        " ('to', 'TO'),\n",
 63 |        " ('the', 'DT'),\n",
 64 |        " ('cafe', 'NN'),\n",
 65 |        " ('to', 'TO'),\n",
 66 |        " ('buy', 'VB'),\n",
 67 |        " ('coffee', 'NN'),\n",
 68 |        " ('before', 'IN'),\n",
 69 |        " ('work', 'NN'),\n",
 70 |        " ('.', '.')]"
 71 |       ]
 72 |      },
 73 |      "execution_count": 20,
 74 |      "metadata": {},
 75 |      "output_type": "execute_result"
 76 |     }
 77 |    ],
 78 |    "source": [
 79 |     "nltk.pos_tag(tokens)"
 80 |    ]
 81 |   },
 82 |   {
 83 |    "cell_type": "markdown",
 84 |    "metadata": {},
 85 |    "source": [
 86 |     "For an extensive list of part-of-speech tags visit:\n",
 87 |     "https://en.wikipedia.org/w/index.php?title=Brown_Corpus"
 88 |    ]
 89 |   },
 90 |   {
 91 |    "cell_type": "code",
 92 |    "execution_count": 21,
 93 |    "metadata": {
 94 |     "collapsed": false
 95 |    },
 96 |    "outputs": [
 97 |     {
 98 |      "data": {
 99 |       "text/plain": [
100 |        "[('I', 'PRP'), ('will', 'MD'), ('have', 'VB'), ('desert', 'NN'), ('.', '.')]"
101 |       ]
102 |      },
103 |      "execution_count": 21,
104 |      "metadata": {},
105 |      "output_type": "execute_result"
106 |     }
107 |    ],
108 |    "source": [
109 |     "nltk.pos_tag(nltk.word_tokenize(\"I will have desert.\"))"
110 |    ]
111 |   },
112 |   {
113 |    "cell_type": "code",
114 |    "execution_count": 22,
115 |    "metadata": {
116 |     "collapsed": false
117 |    },
118 |    "outputs": [
119 |     {
120 |      "data": {
121 |       "text/plain": [
122 |        "[('They', 'PRP'), ('will', 'MD'), ('desert', 'VB'), ('us', 'PRP'), ('.', '.')]"
123 |       ]
124 |      },
125 |      "execution_count": 22,
126 |      "metadata": {},
127 |      "output_type": "execute_result"
128 |     }
129 |    ],
130 |    "source": [
131 |     "nltk.pos_tag(nltk.word_tokenize(\"They will desert us.\"))"
132 |    ]
133 |   },
134 |   {
135 |    "cell_type": "markdown",
136 |    "metadata": {},
137 |    "source": [
138 |     "Create a list of all nouns."
139 |    ]
140 |   },
141 |   {
142 |    "cell_type": "code",
143 |    "execution_count": 23,
144 |    "metadata": {
145 |     "collapsed": true
146 |    },
147 |    "outputs": [],
148 |    "source": [
149 |     "md = nltk.corpus.gutenberg.words(\"melville-moby_dick.txt\")"
150 |    ]
151 |   },
152 |   {
153 |    "cell_type": "code",
154 |    "execution_count": 24,
155 |    "metadata": {
156 |     "collapsed": false
157 |    },
158 |    "outputs": [],
159 |    "source": [
160 |     "md_norm = [word.lower() for word in md if word.isalpha()]"
161 |    ]
162 |   },
163 |   {
164 |    "cell_type": "code",
165 |    "execution_count": 25,
166 |    "metadata": {
167 |     "collapsed": true
168 |    },
169 |    "outputs": [],
170 |    "source": [
171 |     "md_tags = nltk.pos_tag(md_norm,tagset=\"universal\")"
172 |    ]
173 |   },
174 |   {
175 |    "cell_type": "code",
176 |    "execution_count": 26,
177 |    "metadata": {
178 |     "collapsed": false
179 |    },
180 |    "outputs": [
181 |     {
182 |      "data": {
183 |       "text/plain": [
184 |        "[('moby', 'NOUN'),\n",
185 |        " ('dick', 'NOUN'),\n",
186 |        " ('by', 'ADP'),\n",
187 |        " ('herman', 'NOUN'),\n",
188 |        " ('melville', 'NOUN')]"
189 |       ]
190 |      },
191 |      "execution_count": 26,
192 |      "metadata": {},
193 |      "output_type": "execute_result"
194 |     }
195 |    ],
196 |    "source": [
197 |     "md_tags[:5]"
198 |    ]
199 |   },
200 |   {
201 |    "cell_type": "code",
202 |    "execution_count": 27,
203 |    "metadata": {
204 |     "collapsed": true
205 |    },
206 |    "outputs": [],
207 |    "source": [
208 |     "md_nouns = [word for word in md_tags if word[1] == \"NOUN\"]"
209 |    ]
210 |   },
211 |   {
212 |    "cell_type": "code",
213 |    "execution_count": 28,
214 |    "metadata": {
215 |     "collapsed": false
216 |    },
217 |    "outputs": [],
218 |    "source": [
219 |     "nouns_fd = nltk.FreqDist(md_nouns)"
220 |    ]
221 |   },
222 |   {
223 |    "cell_type": "code",
224 |    "execution_count": 29,
225 |    "metadata": {
226 |     "collapsed": false
227 |    },
228 |    "outputs": [
229 |     {
230 |      "data": {
231 |       "text/plain": [
232 |        "[(('i', 'NOUN'), 1182),\n",
233 |        " (('whale', 'NOUN'), 909),\n",
234 |        " (('s', 'NOUN'), 774),\n",
235 |        " (('man', 'NOUN'), 527),\n",
236 |        " (('ship', 'NOUN'), 498),\n",
237 |        " (('sea', 'NOUN'), 435),\n",
238 |        " (('head', 'NOUN'), 337),\n",
239 |        " (('time', 'NOUN'), 334),\n",
240 |        " (('boat', 'NOUN'), 332),\n",
241 |        " (('ahab', 'NOUN'), 278)]"
242 |       ]
243 |      },
244 |      "execution_count": 29,
245 |      "metadata": {},
246 |      "output_type": "execute_result"
247 |     }
248 |    ],
249 |    "source": [
250 |     "nouns_fd.most_common()[:10]  "
251 |    ]
252 |   }
253 |  ],
254 |  "metadata": {
255 |   "kernelspec": {
256 |    "display_name": "Python 3",
257 |    "language": "python",
258 |    "name": "python3"
259 |   },
260 |   "language_info": {
261 |    "codemirror_mode": {
262 |     "name": "ipython",
263 |     "version": 3
264 |    },
265 |    "file_extension": ".py",
266 |    "mimetype": "text/x-python",
267 |    "name": "python",
268 |    "nbconvert_exporter": "python",
269 |    "pygments_lexer": "ipython3",
270 |    "version": "3.6.0"
271 |   }
272 |  },
273 |  "nbformat": 4,
274 |  "nbformat_minor": 2
275 | }
276 | 


--------------------------------------------------------------------------------
/3. Tokenization, Tagging, Chunking/4. Example - Multiple Parts of Speech.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Tokenization, Tagging, Chunking - Example: Multiple Parts of Speech"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 1,
 13 |    "metadata": {
 14 |     "collapsed": true
 15 |    },
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "import nltk"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "markdown",
 23 |    "metadata": {},
 24 |    "source": [
 25 |     "Words can be tagged with a different part of speech based on usage."
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "code",
 30 |    "execution_count": 2,
 31 |    "metadata": {
 32 |     "collapsed": false
 33 |    },
 34 |    "outputs": [],
 35 |    "source": [
 36 |     "alice = nltk.corpus.gutenberg.words(\"carroll-alice.txt\")"
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "code",
 41 |    "execution_count": 3,
 42 |    "metadata": {
 43 |     "collapsed": false
 44 |    },
 45 |    "outputs": [],
 46 |    "source": [
 47 |     "alice_norm = [word.lower() for word in alice if word.isalpha()]"
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "code",
 52 |    "execution_count": 4,
 53 |    "metadata": {
 54 |     "collapsed": true
 55 |    },
 56 |    "outputs": [],
 57 |    "source": [
 58 |     "alice_tags = nltk.pos_tag(alice_norm,tagset=\"universal\")"
 59 |    ]
 60 |   },
 61 |   {
 62 |    "cell_type": "code",
 63 |    "execution_count": 5,
 64 |    "metadata": {
 65 |     "collapsed": true
 66 |    },
 67 |    "outputs": [],
 68 |    "source": [
 69 |     "alice_cfd = nltk.ConditionalFreqDist(alice_tags)"
 70 |    ]
 71 |   },
 72 |   {
 73 |    "cell_type": "code",
 74 |    "execution_count": 6,
 75 |    "metadata": {
 76 |     "collapsed": false
 77 |    },
 78 |    "outputs": [
 79 |     {
 80 |      "data": {
 81 |       "text/plain": [
 82 |        "FreqDist({'ADP': 31, 'ADV': 4, 'PRT': 5})"
 83 |       ]
 84 |      },
 85 |      "execution_count": 6,
 86 |      "metadata": {},
 87 |      "output_type": "execute_result"
 88 |     }
 89 |    ],
 90 |    "source": [
 91 |     "alice_cfd['over']"
 92 |    ]
 93 |   },
 94 |   {
 95 |    "cell_type": "code",
 96 |    "execution_count": 7,
 97 |    "metadata": {
 98 |     "collapsed": false
 99 |    },
100 |    "outputs": [
101 |     {
102 |      "data": {
103 |       "text/plain": [
104 |        "FreqDist({'NOUN': 1, 'VERB': 16})"
105 |       ]
106 |      },
107 |      "execution_count": 7,
108 |      "metadata": {},
109 |      "output_type": "execute_result"
110 |     }
111 |    ],
112 |    "source": [
113 |     "alice_cfd['spoke']"
114 |    ]
115 |   },
116 |   {
117 |    "cell_type": "code",
118 |    "execution_count": 8,
119 |    "metadata": {
120 |     "collapsed": false
121 |    },
122 |    "outputs": [
123 |     {
124 |      "data": {
125 |       "text/plain": [
126 |        "FreqDist({'ADP': 1, 'NOUN': 5, 'VERB': 3})"
127 |       ]
128 |      },
129 |      "execution_count": 8,
130 |      "metadata": {},
131 |      "output_type": "execute_result"
132 |     }
133 |    ],
134 |    "source": [
135 |     "alice_cfd['answer']"
136 |    ]
137 |   }
138 |  ],
139 |  "metadata": {
140 |   "kernelspec": {
141 |    "display_name": "Python 3",
142 |    "language": "python",
143 |    "name": "python3"
144 |   },
145 |   "language_info": {
146 |    "codemirror_mode": {
147 |     "name": "ipython",
148 |     "version": 3
149 |    },
150 |    "file_extension": ".py",
151 |    "mimetype": "text/x-python",
152 |    "name": "python",
153 |    "nbconvert_exporter": "python",
154 |    "pygments_lexer": "ipython3",
155 |    "version": "3.6.0"
156 |   }
157 |  },
158 |  "nbformat": 4,
159 |  "nbformat_minor": 2
160 | }
161 | 


--------------------------------------------------------------------------------
/3. Tokenization, Tagging, Chunking/5. Example - Choices.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Tokenization, Tagging, Chunking - Example: Choices"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 1,
 13 |    "metadata": {
 14 |     "collapsed": true
 15 |    },
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "import nltk"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "markdown",
 23 |    "metadata": {},
 24 |    "source": [
 25 |     "Finding all the cases in a given text where there was a choice between two options, \"NOUN 'or' NOUN\"."
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "code",
 30 |    "execution_count": 2,
 31 |    "metadata": {
 32 |     "collapsed": false
 33 |    },
 34 |    "outputs": [],
 35 |    "source": [
 36 |     "stories = nltk.corpus.gutenberg.words(\"bryant-stories.txt\")\n",
 37 |     "tags = nltk.pos_tag(stories, tagset=\"universal\")"
 38 |    ]
 39 |   },
 40 |   {
 41 |    "cell_type": "code",
 42 |    "execution_count": 3,
 43 |    "metadata": {
 44 |     "collapsed": false
 45 |    },
 46 |    "outputs": [
 47 |     {
 48 |      "data": {
 49 |       "text/plain": [
 50 |        "[('[', 'NOUN'),\n",
 51 |        " ('Stories', 'NOUN'),\n",
 52 |        " ('to', 'PRT'),\n",
 53 |        " ('Tell', 'VERB'),\n",
 54 |        " ('to', 'PRT'),\n",
 55 |        " ('Children', 'NOUN'),\n",
 56 |        " ('by', 'ADP'),\n",
 57 |        " ('Sara', 'NOUN'),\n",
 58 |        " ('Cone', 'NOUN'),\n",
 59 |        " ('Bryant', 'NOUN')]"
 60 |       ]
 61 |      },
 62 |      "execution_count": 3,
 63 |      "metadata": {},
 64 |      "output_type": "execute_result"
 65 |     }
 66 |    ],
 67 |    "source": [
 68 |     "tags[:10]"
 69 |    ]
 70 |   },
 71 |   {
 72 |    "cell_type": "code",
 73 |    "execution_count": 5,
 74 |    "metadata": {
 75 |     "collapsed": false
 76 |    },
 77 |    "outputs": [
 78 |     {
 79 |      "name": "stdout",
 80 |      "output_type": "stream",
 81 |      "text": [
 82 |       "ship or part\n",
 83 |       "food or water\n",
 84 |       "queens or princesses\n",
 85 |       "rank or wealth\n"
 86 |      ]
 87 |     }
 88 |    ],
 89 |    "source": [
 90 |     "for ((word1,tag1),(word2,tag2),(word3,tag3)) in nltk.trigrams(tags):\n",
 91 |     "    if tag1 ==  \"NOUN\" and word2 == \"or\" and tag3 == \"NOUN\":\n",
 92 |     "        print(word1 + \" \" + word2 + \" \" + word3)"
 93 |    ]
 94 |   }
 95 |  ],
 96 |  "metadata": {
 97 |   "kernelspec": {
 98 |    "display_name": "Python 3",
 99 |    "language": "python",
100 |    "name": "python3"
101 |   },
102 |   "language_info": {
103 |    "codemirror_mode": {
104 |     "name": "ipython",
105 |     "version": 3
106 |    },
107 |    "file_extension": ".py",
108 |    "mimetype": "text/x-python",
109 |    "name": "python",
110 |    "nbconvert_exporter": "python",
111 |    "pygments_lexer": "ipython3",
112 |    "version": "3.6.0"
113 |   }
114 |  },
115 |  "nbformat": 4,
116 |  "nbformat_minor": 2
117 | }
118 | 


--------------------------------------------------------------------------------
/3. Tokenization, Tagging, Chunking/6. Chunking.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Tokenization, Tagging, Chunking - Chunking"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "Through chunking, we can prevent two word entities from being split."
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "code",
 19 |    "execution_count": 1,
 20 |    "metadata": {
 21 |     "collapsed": true
 22 |    },
 23 |    "outputs": [],
 24 |    "source": [
 25 |     "sentence = \"I will go to the coffee shop in New York after I get off the jet plane.\""
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "code",
 30 |    "execution_count": 2,
 31 |    "metadata": {
 32 |     "collapsed": true
 33 |    },
 34 |    "outputs": [],
 35 |    "source": [
 36 |     "import nltk"
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "code",
 41 |    "execution_count": 3,
 42 |    "metadata": {
 43 |     "collapsed": true
 44 |    },
 45 |    "outputs": [],
 46 |    "source": [
 47 |     "sent_tag = nltk.pos_tag(nltk.word_tokenize(sentence))"
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "code",
 52 |    "execution_count": 4,
 53 |    "metadata": {
 54 |     "collapsed": false
 55 |    },
 56 |    "outputs": [
 57 |     {
 58 |      "data": {
 59 |       "text/plain": [
 60 |        "[('I', 'PRP'),\n",
 61 |        " ('will', 'MD'),\n",
 62 |        " ('go', 'VB'),\n",
 63 |        " ('to', 'TO'),\n",
 64 |        " ('the', 'DT'),\n",
 65 |        " ('coffee', 'NN'),\n",
 66 |        " ('shop', 'NN'),\n",
 67 |        " ('in', 'IN'),\n",
 68 |        " ('New', 'NNP'),\n",
 69 |        " ('York', 'NNP'),\n",
 70 |        " ('after', 'IN'),\n",
 71 |        " ('I', 'PRP'),\n",
 72 |        " ('get', 'VBP'),\n",
 73 |        " ('off', 'IN'),\n",
 74 |        " ('the', 'DT'),\n",
 75 |        " ('jet', 'NN'),\n",
 76 |        " ('plane', 'NN'),\n",
 77 |        " ('.', '.')]"
 78 |       ]
 79 |      },
 80 |      "execution_count": 4,
 81 |      "metadata": {},
 82 |      "output_type": "execute_result"
 83 |     }
 84 |    ],
 85 |    "source": [
 86 |     "sent_tag"
 87 |    ]
 88 |   },
 89 |   {
 90 |    "cell_type": "code",
 91 |    "execution_count": 5,
 92 |    "metadata": {
 93 |     "collapsed": true
 94 |    },
 95 |    "outputs": [],
 96 |    "source": [
 97 |     "sequence =  '''\n",
 98 |     "            CHUNK: {<NNP>+}\n",
 99 |     "                   {<NN>+}\n",
100 |     "            '''"
101 |    ]
102 |   },
103 |   {
104 |    "cell_type": "code",
105 |    "execution_count": 6,
106 |    "metadata": {
107 |     "collapsed": false
108 |    },
109 |    "outputs": [],
110 |    "source": [
111 |     "NPChunker = nltk.RegexpParser(sequence)"
112 |    ]
113 |   },
114 |   {
115 |    "cell_type": "code",
116 |    "execution_count": 7,
117 |    "metadata": {
118 |     "collapsed": false
119 |    },
120 |    "outputs": [],
121 |    "source": [
122 |     "result = NPChunker.parse(sent_tag)"
123 |    ]
124 |   },
125 |   {
126 |    "cell_type": "code",
127 |    "execution_count": 9,
128 |    "metadata": {
129 |     "collapsed": false
130 |    },
131 |    "outputs": [
132 |     {
133 |      "name": "stdout",
134 |      "output_type": "stream",
135 |      "text": [
136 |       "(S\n",
137 |       "  I/PRP\n",
138 |       "  will/MD\n",
139 |       "  go/VB\n",
140 |       "  to/TO\n",
141 |       "  the/DT\n",
142 |       "  (CHUNK coffee/NN shop/NN)\n",
143 |       "  in/IN\n",
144 |       "  (CHUNK New/NNP York/NNP)\n",
145 |       "  after/IN\n",
146 |       "  I/PRP\n",
147 |       "  get/VBP\n",
148 |       "  off/IN\n",
149 |       "  the/DT\n",
150 |       "  (CHUNK jet/NN plane/NN)\n",
151 |       "  ./.)\n"
152 |      ]
153 |     }
154 |    ],
155 |    "source": [
156 |     "print(result)"
157 |    ]
158 |   }
159 |  ],
160 |  "metadata": {
161 |   "kernelspec": {
162 |    "display_name": "Python 3",
163 |    "language": "python",
164 |    "name": "python3"
165 |   },
166 |   "language_info": {
167 |    "codemirror_mode": {
168 |     "name": "ipython",
169 |     "version": 3
170 |    },
171 |    "file_extension": ".py",
172 |    "mimetype": "text/x-python",
173 |    "name": "python",
174 |    "nbconvert_exporter": "python",
175 |    "pygments_lexer": "ipython3",
176 |    "version": "3.6.0"
177 |   }
178 |  },
179 |  "nbformat": 4,
180 |  "nbformat_minor": 2
181 | }
182 | 


--------------------------------------------------------------------------------
/3. Tokenization, Tagging, Chunking/7. Example - Named Entity Recognition.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Tokenization, Tagging, Chunking - Example: Named Entity Recognition"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 9,
 13 |    "metadata": {
 14 |     "collapsed": true
 15 |    },
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "import nltk"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "code",
 23 |    "execution_count": 12,
 24 |    "metadata": {},
 25 |    "outputs": [],
 26 |    "source": [
 27 |     "text = open(\"example.txt\").read()"
 28 |    ]
 29 |   },
 30 |   {
 31 |    "cell_type": "code",
 32 |    "execution_count": 13,
 33 |    "metadata": {
 34 |     "collapsed": false
 35 |    },
 36 |    "outputs": [
 37 |     {
 38 |      "data": {
 39 |       "text/plain": [
 40 |        "'World War II (WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945, though related conflicts began earlier. It involved the vast majority of the world\\'s nations—including all of the great powers—eventually forming two opposing military alliances: the Allies and the Axis. It was the most widespread war in history, and directly involved more than 100 million people from over 30 countries. In a state of \"total war\", the major participants threw their entire economic, industrial, and scientific capabilities behind the war effort, erasing the distinction between civilian and military resources. Marked by mass deaths of civilians, including the Holocaust (in which approximately 11 million people were killed) and the strategic bombing of industrial and population centres (in which approximately one million were killed, and which included the atomic bombings of Hiroshima and Nagasaki), it resulted in an estimated 50 million to 85 million fatalities. These made World War II the deadliest conflict in human history.\\n\\nThe Empire of Japan aimed to dominate Asia and the Pacific and was already at war with the Republic of China in 1937, but the world war is generally said to have begun on 1 September 1939 with the invasion of Poland by Germany and subsequent declarations of war on Germany by France and the United Kingdom. From late 1939 to early 1941, in a series of campaigns and treaties, Germany conquered or controlled much of continental Europe, and formed the Axis alliance with Italy and Japan. Based on the Molotov–Ribbentrop Pact of August 1939, Germany and the Soviet Union partitioned and annexed territories of their European neighbours, Poland, Finland, Romania and the Baltic states. For a year starting in late June 1940, the United Kingdom and the British Commonwealth were the only Allied forces continuing the fight against the European Axis powers, with campaigns in North Africa and the Horn of Africa, the aerial Battle of Britain and the Blitz bombing campaign, as well as the long-running Battle of the Atlantic. In June 1941, the European Axis powers launched an invasion of the Soviet Union, opening the largest land theatre of war in history, which trapped the major part of the Axis\\' military forces into a war of attrition. In December 1941, Japan attacked the United States and European territories in the Pacific Ocean, and quickly conquered much of the Western Pacific.\\n\\nThe Axis advance halted in 1942 when Japan lost the critical Battle of Midway, near Hawaii, and Germany was defeated in North Africa and then, decisively, at Stalingrad in the Soviet Union. In 1943, with a series of German defeats on the Eastern Front, the Allied invasion of Italy which brought about Italian surrender, and Allied victories in the Pacific, the Axis lost the initiative and undertook strategic retreat on all fronts. In 1944, the Western Allies invaded German-occupied France, while the Soviet Union regained all of its territorial losses and invaded Germany and its allies. During 1944 and 1945 the Japanese suffered major reverses in mainland Asia in South Central China and Burma, while the Allies crippled the Japanese Navy and captured key Western Pacific islands.\\n\\nThe war in Europe ended with an invasion of Germany by the Western Allies and the Soviet Union culminating in the capture of Berlin by Soviet and Polish troops and the subsequent German unconditional surrender on 8 May 1945. Following the Potsdam Declaration by the Allies on 26 July 1945 and the refusal of Japan to surrender under its terms, the United States dropped atomic bombs on the Japanese cities of Hiroshima and Nagasaki on 6 August and 9 August respectively. With an invasion of the Japanese archipelago imminent, the possibility of additional atomic bombings, and the Soviet Union\\'s declaration of war on Japan and invasion of Manchuria, Japan surrendered on 15 August 1945. Thus ended the war in Asia, cementing the total victory of the Allies.\\n\\nWorld War II altered the political alignment and social structure of the world. The United Nations (UN) was established to foster international co-operation and prevent future conflicts. The victorious great powers—the United States, the Soviet Union, China, the United Kingdom, and France—became the permanent members of the United Nations Security Council. The Soviet Union and the United States emerged as rival superpowers, setting the stage for the Cold War, which lasted for the next 46 years. Meanwhile, the influence of European great powers waned, while the decolonisation of Asia and Africa began. Most countries whose industries had been damaged moved towards economic recovery. Political integration, especially in Europe, emerged as an effort to end pre-war enmities and to create a common identity.'"
 41 |       ]
 42 |      },
 43 |      "execution_count": 13,
 44 |      "metadata": {},
 45 |      "output_type": "execute_result"
 46 |     }
 47 |    ],
 48 |    "source": [
 49 |     "text"
 50 |    ]
 51 |   },
 52 |   {
 53 |    "cell_type": "code",
 54 |    "execution_count": 14,
 55 |    "metadata": {
 56 |     "collapsed": true
 57 |    },
 58 |    "outputs": [],
 59 |    "source": [
 60 |     "text_tag = nltk.pos_tag(nltk.word_tokenize(text))"
 61 |    ]
 62 |   },
 63 |   {
 64 |    "cell_type": "code",
 65 |    "execution_count": 15,
 66 |    "metadata": {
 67 |     "collapsed": false
 68 |    },
 69 |    "outputs": [],
 70 |    "source": [
 71 |     "text_ch = nltk.ne_chunk(text_tag)"
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "code",
 76 |    "execution_count": 17,
 77 |    "metadata": {
 78 |     "collapsed": false
 79 |    },
 80 |    "outputs": [
 81 |     {
 82 |      "name": "stdout",
 83 |      "output_type": "stream",
 84 |      "text": [
 85 |       "ORGANIZATION WWII\n",
 86 |       "ORGANIZATION WW2\n",
 87 |       "ORGANIZATION Second\n",
 88 |       "ORGANIZATION Axis\n",
 89 |       "ORGANIZATION Hiroshima\n",
 90 |       "GPE Nagasaki\n",
 91 |       "ORGANIZATION Empire of Japan\n",
 92 |       "GPE Asia\n",
 93 |       "ORGANIZATION Pacific\n",
 94 |       "ORGANIZATION Republic\n",
 95 |       "GPE China\n",
 96 |       "GPE Poland\n",
 97 |       "GPE Germany\n",
 98 |       "GPE Germany\n",
 99 |       "GPE France\n",
100 |       "ORGANIZATION United Kingdom\n",
101 |       "GPE Germany\n",
102 |       "GPE Europe\n",
103 |       "ORGANIZATION Axis\n",
104 |       "GPE Italy\n",
105 |       "GPE Japan\n",
106 |       "GPE Germany\n",
107 |       "GPE Soviet Union\n",
108 |       "GPE European\n",
109 |       "GPE Poland\n",
110 |       "GPE Finland\n",
111 |       "GPE Romania\n",
112 |       "GPE Baltic\n",
113 |       "ORGANIZATION United Kingdom\n",
114 |       "GPE British\n",
115 |       "ORGANIZATION European Axis\n",
116 |       "GPE North Africa\n",
117 |       "ORGANIZATION Horn\n",
118 |       "GPE Africa\n",
119 |       "GPE Britain\n",
120 |       "GPE Blitz\n",
121 |       "ORGANIZATION Atlantic\n",
122 |       "ORGANIZATION European Axis\n",
123 |       "GPE Soviet Union\n",
124 |       "ORGANIZATION Axis\n",
125 |       "GPE Japan\n",
126 |       "GPE United States\n",
127 |       "GPE European\n",
128 |       "ORGANIZATION Pacific Ocean\n",
129 |       "LOCATION Western Pacific\n",
130 |       "ORGANIZATION Axis\n",
131 |       "PERSON Japan\n",
132 |       "GPE Midway\n",
133 |       "GPE Hawaii\n",
134 |       "GPE Germany\n",
135 |       "GPE North Africa\n",
136 |       "FACILITY Stalingrad\n",
137 |       "GPE Soviet Union\n",
138 |       "GPE German\n",
139 |       "LOCATION Eastern Front\n",
140 |       "GPE Italy\n",
141 |       "GPE Italian\n",
142 |       "GPE Allied\n",
143 |       "ORGANIZATION Pacific\n",
144 |       "ORGANIZATION Axis\n",
145 |       "LOCATION Western\n",
146 |       "GPE France\n",
147 |       "GPE Soviet Union\n",
148 |       "GPE Germany\n",
149 |       "GPE Japanese\n",
150 |       "GPE Asia\n",
151 |       "GPE South\n",
152 |       "GPE China\n",
153 |       "GPE Burma\n",
154 |       "GPE Japanese\n",
155 |       "ORGANIZATION Navy\n",
156 |       "LOCATION Western Pacific\n",
157 |       "GPE Europe\n",
158 |       "GPE Germany\n",
159 |       "LOCATION Western\n",
160 |       "GPE Soviet Union\n",
161 |       "GPE Berlin\n",
162 |       "GPE Soviet\n",
163 |       "GPE Polish\n",
164 |       "GPE German\n",
165 |       "ORGANIZATION Potsdam\n",
166 |       "GPE Japan\n",
167 |       "GPE United States\n",
168 |       "GPE Japanese\n",
169 |       "ORGANIZATION Hiroshima\n",
170 |       "PERSON Nagasaki\n",
171 |       "GPE Japanese\n",
172 |       "GPE Soviet Union\n",
173 |       "GPE Japan\n",
174 |       "GPE Manchuria\n",
175 |       "GPE Japan\n",
176 |       "GPE Asia\n",
177 |       "ORGANIZATION United Nations\n",
178 |       "GPE United States\n",
179 |       "GPE Soviet Union\n",
180 |       "GPE China\n",
181 |       "ORGANIZATION United Kingdom\n",
182 |       "ORGANIZATION United Nations\n",
183 |       "ORGANIZATION Security Council\n",
184 |       "GPE Soviet Union\n",
185 |       "GPE United States\n",
186 |       "GPE European\n",
187 |       "GPE Asia\n",
188 |       "PERSON Africa\n",
189 |       "GPE Europe\n"
190 |      ]
191 |     }
192 |    ],
193 |    "source": [
194 |     "for chunk in text_ch:\n",
195 |     "    if hasattr(chunk, 'label'):\n",
196 |     "        print(chunk.label(), ' '.join(c[0] for c in chunk.leaves()))"
197 |    ]
198 |   }
199 |  ],
200 |  "metadata": {
201 |   "kernelspec": {
202 |    "display_name": "Python 3",
203 |    "language": "python",
204 |    "name": "python3"
205 |   },
206 |   "language_info": {
207 |    "codemirror_mode": {
208 |     "name": "ipython",
209 |     "version": 3
210 |    },
211 |    "file_extension": ".py",
212 |    "mimetype": "text/x-python",
213 |    "name": "python",
214 |    "nbconvert_exporter": "python",
215 |    "pygments_lexer": "ipython3",
216 |    "version": "3.6.0"
217 |   }
218 |  },
219 |  "nbformat": 4,
220 |  "nbformat_minor": 2
221 | }
222 | 


--------------------------------------------------------------------------------
/3. Tokenization, Tagging, Chunking/example.txt:
--------------------------------------------------------------------------------
1 | World War II (WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945, though related conflicts began earlier. It involved the vast majority of the world's nations—including all of the great powers—eventually forming two opposing military alliances: the Allies and the Axis. It was the most widespread war in history, and directly involved more than 100 million people from over 30 countries. In a state of "total war", the major participants threw their entire economic, industrial, and scientific capabilities behind the war effort, erasing the distinction between civilian and military resources. Marked by mass deaths of civilians, including the Holocaust (in which approximately 11 million people were killed) and the strategic bombing of industrial and population centres (in which approximately one million were killed, and which included the atomic bombings of Hiroshima and Nagasaki), it resulted in an estimated 50 million to 85 million fatalities. These made World War II the deadliest conflict in human history.
2 | 
3 | The Empire of Japan aimed to dominate Asia and the Pacific and was already at war with the Republic of China in 1937, but the world war is generally said to have begun on 1 September 1939 with the invasion of Poland by Germany and subsequent declarations of war on Germany by France and the United Kingdom. From late 1939 to early 1941, in a series of campaigns and treaties, Germany conquered or controlled much of continental Europe, and formed the Axis alliance with Italy and Japan. Based on the Molotov–Ribbentrop Pact of August 1939, Germany and the Soviet Union partitioned and annexed territories of their European neighbours, Poland, Finland, Romania and the Baltic states. For a year starting in late June 1940, the United Kingdom and the British Commonwealth were the only Allied forces continuing the fight against the European Axis powers, with campaigns in North Africa and the Horn of Africa, the aerial Battle of Britain and the Blitz bombing campaign, as well as the long-running Battle of the Atlantic. In June 1941, the European Axis powers launched an invasion of the Soviet Union, opening the largest land theatre of war in history, which trapped the major part of the Axis' military forces into a war of attrition. In December 1941, Japan attacked the United States and European territories in the Pacific Ocean, and quickly conquered much of the Western Pacific.
4 | 
5 | The Axis advance halted in 1942 when Japan lost the critical Battle of Midway, near Hawaii, and Germany was defeated in North Africa and then, decisively, at Stalingrad in the Soviet Union. In 1943, with a series of German defeats on the Eastern Front, the Allied invasion of Italy which brought about Italian surrender, and Allied victories in the Pacific, the Axis lost the initiative and undertook strategic retreat on all fronts. In 1944, the Western Allies invaded German-occupied France, while the Soviet Union regained all of its territorial losses and invaded Germany and its allies. During 1944 and 1945 the Japanese suffered major reverses in mainland Asia in South Central China and Burma, while the Allies crippled the Japanese Navy and captured key Western Pacific islands.
6 | 
7 | The war in Europe ended with an invasion of Germany by the Western Allies and the Soviet Union culminating in the capture of Berlin by Soviet and Polish troops and the subsequent German unconditional surrender on 8 May 1945. Following the Potsdam Declaration by the Allies on 26 July 1945 and the refusal of Japan to surrender under its terms, the United States dropped atomic bombs on the Japanese cities of Hiroshima and Nagasaki on 6 August and 9 August respectively. With an invasion of the Japanese archipelago imminent, the possibility of additional atomic bombings, and the Soviet Union's declaration of war on Japan and invasion of Manchuria, Japan surrendered on 15 August 1945. Thus ended the war in Asia, cementing the total victory of the Allies.
8 | 
9 | World War II altered the political alignment and social structure of the world. The United Nations (UN) was established to foster international co-operation and prevent future conflicts. The victorious great powers—the United States, the Soviet Union, China, the United Kingdom, and France—became the permanent members of the United Nations Security Council. The Soviet Union and the United States emerged as rival superpowers, setting the stage for the Cold War, which lasted for the next 46 years. Meanwhile, the influence of European great powers waned, while the decolonisation of Asia and Africa began. Most countries whose industries had been damaged moved towards economic recovery. Political integration, especially in Europe, emerged as an effort to end pre-war enmities and to create a common identity.


--------------------------------------------------------------------------------
/4. Custom Sources/2. HTML.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Custom Sources - HTML"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 1,
 13 |    "metadata": {
 14 |     "collapsed": true
 15 |    },
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "import nltk\n",
 19 |     "import urllib.request"
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "markdown",
 24 |    "metadata": {},
 25 |    "source": [
 26 |     "Websites are written in HTML, so when you pull information directly from a site, you will get all the code back along with the text."
 27 |    ]
 28 |   },
 29 |   {
 30 |    "cell_type": "code",
 31 |    "execution_count": 2,
 32 |    "metadata": {
 33 |     "collapsed": true
 34 |    },
 35 |    "outputs": [],
 36 |    "source": [
 37 |     "url = \"https://en.wikipedia.org/wiki/Python_(programming_language)\""
 38 |    ]
 39 |   },
 40 |   {
 41 |    "cell_type": "code",
 42 |    "execution_count": 3,
 43 |    "metadata": {},
 44 |    "outputs": [],
 45 |    "source": [
 46 |     "response = urllib.request.urlopen(url)"
 47 |    ]
 48 |   },
 49 |   {
 50 |    "cell_type": "code",
 51 |    "execution_count": 4,
 52 |    "metadata": {},
 53 |    "outputs": [],
 54 |    "source": [
 55 |     "html = response.read()"
 56 |    ]
 57 |   },
 58 |   {
 59 |    "cell_type": "code",
 60 |    "execution_count": 5,
 61 |    "metadata": {
 62 |     "collapsed": false
 63 |    },
 64 |    "outputs": [
 65 |     {
 66 |      "data": {
 67 |       "text/plain": [
 68 |        "b'<!DOCTYPE html>\\n<html class=\"client-nojs\" lang=\"en\" dir=\"ltr\">\\n<head>\\n<meta charset=\"UTF-8\"/>\\n<title>Python (programming language) - Wikipedia</title>\\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\\\s)client-nojs(\\\\s|$)/, \"$1client-js$2\" );</script>\\n<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({\"wgCanonicalNamespace\":\"\",\"wgCanonicalSpecialPageName\":false,\"wgNamespaceNumber\":0,\"wgPageName\":\"Python_(programming_language)\",\"wgTitle\":\"Python (programming language)\",\"wgCurRevisionId\":864962678,\"wgRevisionId\":864962678,\"wgArticleId\":23862,\"wgIsArticle\":true,\"wgIsRedirect\":false,\"wgAction\":\"view\",\"wgUserName\":null,\"wgUserGroups\":[\"*\"],\"wgCategories\":[\"CS1 errors: external links\",\"Wikipedia semi-protected pages\",\"Use dmy dates from August 2015\",\"Wikipedia articles needing clarification from May 2018\",\"All articles with unsourced statements\",\"Articles with unsourced statements from May 2018\",\"Articles containing potentially dated'"
 69 |       ]
 70 |      },
 71 |      "execution_count": 5,
 72 |      "metadata": {},
 73 |      "output_type": "execute_result"
 74 |     }
 75 |    ],
 76 |    "source": [
 77 |     "html[:1000]"
 78 |    ]
 79 |   },
 80 |   {
 81 |    "cell_type": "markdown",
 82 |    "metadata": {},
 83 |    "source": [
 84 |     "We will use a Python library called BeautifulSoup to make working with the HTML easier."
 85 |    ]
 86 |   },
 87 |   {
 88 |    "cell_type": "code",
 89 |    "execution_count": 6,
 90 |    "metadata": {
 91 |     "collapsed": true
 92 |    },
 93 |    "outputs": [],
 94 |    "source": [
 95 |     "from bs4 import BeautifulSoup"
 96 |    ]
 97 |   },
 98 |   {
 99 |    "cell_type": "code",
100 |    "execution_count": 7,
101 |    "metadata": {},
102 |    "outputs": [],
103 |    "source": [
104 |     "soup = BeautifulSoup(html, \"html5lib\")"
105 |    ]
106 |   },
107 |   {
108 |    "cell_type": "markdown",
109 |    "metadata": {},
110 |    "source": [
111 |     "Wikipedia places most readable text within paragraph tags or \"p\" tags."
112 |    ]
113 |   },
114 |   {
115 |    "cell_type": "code",
116 |    "execution_count": 8,
117 |    "metadata": {},
118 |    "outputs": [],
119 |    "source": [
120 |     "web_paragraph = [p_tag.text for p_tag in soup.find_all('p')]"
121 |    ]
122 |   },
123 |   {
124 |    "cell_type": "code",
125 |    "execution_count": 9,
126 |    "metadata": {},
127 |    "outputs": [],
128 |    "source": [
129 |     "web_tokens = [nltk.word_tokenize(paragraph) for paragraph in web_paragraph]"
130 |    ]
131 |   },
132 |   {
133 |    "cell_type": "markdown",
134 |    "metadata": {},
135 |    "source": [
136 |     "Now that we have each paragraph tokenized, we can find the first one on the page."
137 |    ]
138 |   },
139 |   {
140 |    "cell_type": "code",
141 |    "execution_count": 10,
142 |    "metadata": {
143 |     "collapsed": false
144 |    },
145 |    "outputs": [
146 |     {
147 |      "name": "stdout",
148 |      "output_type": "stream",
149 |      "text": [
150 |       "['Python', 'is', 'an', 'interpreted', 'high-level', 'programming', 'language', 'for', 'general-purpose', 'programming', '.', 'Created', 'by', 'Guido', 'van', 'Rossum', 'and', 'first', 'released', 'in', '1991', ',', 'Python', 'has', 'a', 'design', 'philosophy', 'that', 'emphasizes', 'code', 'readability', ',', 'notably', 'using', 'significant', 'whitespace', '.', 'It', 'provides', 'constructs', 'that', 'enable', 'clear', 'programming', 'on', 'both', 'small', 'and', 'large', 'scales', '.', '[', '27', ']', 'In', 'July', '2018', ',', 'Van', 'Rossum', 'stepped', 'down', 'as', 'the', 'leader', 'in', 'the', 'language', 'community', 'after', '30', 'years', '.', '[', '28', ']', '[', '29', ']']\n"
151 |      ]
152 |     }
153 |    ],
154 |    "source": [
155 |     "print(web_tokens[2])"
156 |    ]
157 |   }
158 |  ],
159 |  "metadata": {
160 |   "kernelspec": {
161 |    "display_name": "Python 3",
162 |    "language": "python",
163 |    "name": "python3"
164 |   },
165 |   "language_info": {
166 |    "codemirror_mode": {
167 |     "name": "ipython",
168 |     "version": 3
169 |    },
170 |    "file_extension": ".py",
171 |    "mimetype": "text/x-python",
172 |    "name": "python",
173 |    "nbconvert_exporter": "python",
174 |    "pygments_lexer": "ipython3",
175 |    "version": "3.6.0"
176 |   }
177 |  },
178 |  "nbformat": 4,
179 |  "nbformat_minor": 2
180 | }
181 | 


--------------------------------------------------------------------------------
/4. Custom Sources/3. URL.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Custom Sources - URL"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 1,
 13 |    "metadata": {
 14 |     "collapsed": true
 15 |    },
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "import nltk\n",
 19 |     "import urllib.request"
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "markdown",
 24 |    "metadata": {},
 25 |    "source": [
 26 |     "We can bring in text directly from a url if the source itself is text."
 27 |    ]
 28 |   },
 29 |   {
 30 |    "cell_type": "code",
 31 |    "execution_count": 2,
 32 |    "metadata": {
 33 |     "collapsed": true
 34 |    },
 35 |    "outputs": [],
 36 |    "source": [
 37 |     "url = \"http://www.gutenberg.org/cache/epub/26275/pg26275.txt\""
 38 |    ]
 39 |   },
 40 |   {
 41 |    "cell_type": "code",
 42 |    "execution_count": 3,
 43 |    "metadata": {},
 44 |    "outputs": [],
 45 |    "source": [
 46 |     "response = urllib.request.urlopen(url)\n",
 47 |     "odyssey_str = response.read().decode(\"utf-8-sig\")"
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "code",
 52 |    "execution_count": 4,
 53 |    "metadata": {
 54 |     "collapsed": false
 55 |    },
 56 |    "outputs": [
 57 |     {
 58 |      "data": {
 59 |       "text/plain": [
 60 |        "str"
 61 |       ]
 62 |      },
 63 |      "execution_count": 4,
 64 |      "metadata": {},
 65 |      "output_type": "execute_result"
 66 |     }
 67 |    ],
 68 |    "source": [
 69 |     "type(odyssey_str)"
 70 |    ]
 71 |   },
 72 |   {
 73 |    "cell_type": "code",
 74 |    "execution_count": 5,
 75 |    "metadata": {
 76 |     "collapsed": false
 77 |    },
 78 |    "outputs": [
 79 |     {
 80 |      "data": {
 81 |       "text/plain": [
 82 |        "\"The Project Gutenberg EBook of Homer's Odyssey\""
 83 |       ]
 84 |      },
 85 |      "execution_count": 5,
 86 |      "metadata": {},
 87 |      "output_type": "execute_result"
 88 |     }
 89 |    ],
 90 |    "source": [
 91 |     "odyssey_str[0:46]"
 92 |    ]
 93 |   },
 94 |   {
 95 |    "cell_type": "code",
 96 |    "execution_count": 6,
 97 |    "metadata": {
 98 |     "collapsed": true
 99 |    },
100 |    "outputs": [],
101 |    "source": [
102 |     "odyssey_tokens = nltk.word_tokenize(odyssey_str)"
103 |    ]
104 |   },
105 |   {
106 |    "cell_type": "code",
107 |    "execution_count": 7,
108 |    "metadata": {
109 |     "collapsed": false
110 |    },
111 |    "outputs": [
112 |     {
113 |      "data": {
114 |       "text/plain": [
115 |        "145070"
116 |       ]
117 |      },
118 |      "execution_count": 7,
119 |      "metadata": {},
120 |      "output_type": "execute_result"
121 |     }
122 |    ],
123 |    "source": [
124 |     "len(odyssey_tokens)"
125 |    ]
126 |   },
127 |   {
128 |    "cell_type": "code",
129 |    "execution_count": 8,
130 |    "metadata": {
131 |     "collapsed": true
132 |    },
133 |    "outputs": [],
134 |    "source": [
135 |     "odyssey_text = nltk.Text(odyssey_tokens)"
136 |    ]
137 |   },
138 |   {
139 |    "cell_type": "code",
140 |    "execution_count": 9,
141 |    "metadata": {
142 |     "collapsed": false
143 |    },
144 |    "outputs": [
145 |     {
146 |      "name": "stdout",
147 |      "output_type": "stream",
148 |      "text": [
149 |       "['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Homer', \"'s\", 'Odyssey']\n"
150 |      ]
151 |     }
152 |    ],
153 |    "source": [
154 |     "print(odyssey_text[:8])"
155 |    ]
156 |   }
157 |  ],
158 |  "metadata": {
159 |   "kernelspec": {
160 |    "display_name": "Python 3",
161 |    "language": "python",
162 |    "name": "python3"
163 |   },
164 |   "language_info": {
165 |    "codemirror_mode": {
166 |     "name": "ipython",
167 |     "version": 3
168 |    },
169 |    "file_extension": ".py",
170 |    "mimetype": "text/x-python",
171 |    "name": "python",
172 |    "nbconvert_exporter": "python",
173 |    "pygments_lexer": "ipython3",
174 |    "version": "3.6.0"
175 |   }
176 |  },
177 |  "nbformat": 4,
178 |  "nbformat_minor": 2
179 | }
180 | 


--------------------------------------------------------------------------------
/4. Custom Sources/4. CSV (Spreadsheet).ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Custom Sources - CSV (Spreadsheet)"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 1,
 13 |    "metadata": {
 14 |     "collapsed": true
 15 |    },
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "import nltk\n",
 19 |     "import csv"
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "markdown",
 24 |    "metadata": {},
 25 |    "source": [
 26 |     "Importing text stored in cells on a spreadsheet or in a csv file (comma separated values)."
 27 |    ]
 28 |   },
 29 |   {
 30 |    "cell_type": "code",
 31 |    "execution_count": 3,
 32 |    "metadata": {
 33 |     "collapsed": false
 34 |    },
 35 |    "outputs": [],
 36 |    "source": [
 37 |     "comments = []\n",
 38 |     "with open(\"reviews.csv\", \"r\") as file:\n",
 39 |     "    reader = csv.reader(file)\n",
 40 |     "    for row in reader:\n",
 41 |     "        comments.append(row)"
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "code",
 46 |    "execution_count": 4,
 47 |    "metadata": {
 48 |     "collapsed": false
 49 |    },
 50 |    "outputs": [
 51 |     {
 52 |      "data": {
 53 |       "text/plain": [
 54 |        "[\"This camera is perfect for an enthusiastic amateur photographer. The pictures are razor-sharp, even in macro. It is small enough to fit easily in a coat pocket or purse. It is light enough to carry around all day without bother. Operating its many features is easy and often obvious - i'm no annie lebovitz, but i was able to figure out most of its abilities just messing around with it at a camera store. The manual does a fine job filling in any blanks that remain. The auto-focus performs well, but i love having the 12 optional scene modes - they are dummy-proof, and correspond to many situations in which i would actually seek to use the camera. Comes with a 16 mb compact flash and one rechargeable battery the charging unit, included, is fast and small. I bought a 256 mb cf and a second battery, so it's good to go on a long vacation. I enthusiastically recommend this camera.\"]"
 55 |       ]
 56 |      },
 57 |      "execution_count": 4,
 58 |      "metadata": {},
 59 |      "output_type": "execute_result"
 60 |     }
 61 |    ],
 62 |    "source": [
 63 |     "comments[0]"
 64 |    ]
 65 |   },
 66 |   {
 67 |    "cell_type": "markdown",
 68 |    "metadata": {},
 69 |    "source": [
 70 |     "This command will give us a list, where each element is another list of all the tokens for each comment."
 71 |    ]
 72 |   },
 73 |   {
 74 |    "cell_type": "code",
 75 |    "execution_count": 5,
 76 |    "metadata": {
 77 |     "collapsed": false
 78 |    },
 79 |    "outputs": [],
 80 |    "source": [
 81 |     "tokens = [nltk.word_tokenize(str(entry)) for entry in comments]"
 82 |    ]
 83 |   },
 84 |   {
 85 |    "cell_type": "code",
 86 |    "execution_count": 6,
 87 |    "metadata": {
 88 |     "collapsed": false
 89 |    },
 90 |    "outputs": [
 91 |     {
 92 |      "data": {
 93 |       "text/plain": [
 94 |        "['[',\n",
 95 |        " '``',\n",
 96 |        " 'This',\n",
 97 |        " 'camera',\n",
 98 |        " 'is',\n",
 99 |        " 'perfect',\n",
100 |        " 'for',\n",
101 |        " 'an',\n",
102 |        " 'enthusiastic',\n",
103 |        " 'amateur',\n",
104 |        " 'photographer',\n",
105 |        " '.',\n",
106 |        " 'The',\n",
107 |        " 'pictures',\n",
108 |        " 'are',\n",
109 |        " 'razor-sharp',\n",
110 |        " ',',\n",
111 |        " 'even',\n",
112 |        " 'in',\n",
113 |        " 'macro',\n",
114 |        " '.',\n",
115 |        " 'It',\n",
116 |        " 'is',\n",
117 |        " 'small',\n",
118 |        " 'enough',\n",
119 |        " 'to',\n",
120 |        " 'fit',\n",
121 |        " 'easily',\n",
122 |        " 'in',\n",
123 |        " 'a',\n",
124 |        " 'coat',\n",
125 |        " 'pocket',\n",
126 |        " 'or',\n",
127 |        " 'purse',\n",
128 |        " '.',\n",
129 |        " 'It',\n",
130 |        " 'is',\n",
131 |        " 'light',\n",
132 |        " 'enough',\n",
133 |        " 'to',\n",
134 |        " 'carry',\n",
135 |        " 'around',\n",
136 |        " 'all',\n",
137 |        " 'day',\n",
138 |        " 'without',\n",
139 |        " 'bother',\n",
140 |        " '.',\n",
141 |        " 'Operating',\n",
142 |        " 'its',\n",
143 |        " 'many',\n",
144 |        " 'features',\n",
145 |        " 'is',\n",
146 |        " 'easy',\n",
147 |        " 'and',\n",
148 |        " 'often',\n",
149 |        " 'obvious',\n",
150 |        " '-',\n",
151 |        " 'i',\n",
152 |        " \"'m\",\n",
153 |        " 'no',\n",
154 |        " 'annie',\n",
155 |        " 'lebovitz',\n",
156 |        " ',',\n",
157 |        " 'but',\n",
158 |        " 'i',\n",
159 |        " 'was',\n",
160 |        " 'able',\n",
161 |        " 'to',\n",
162 |        " 'figure',\n",
163 |        " 'out',\n",
164 |        " 'most',\n",
165 |        " 'of',\n",
166 |        " 'its',\n",
167 |        " 'abilities',\n",
168 |        " 'just',\n",
169 |        " 'messing',\n",
170 |        " 'around',\n",
171 |        " 'with',\n",
172 |        " 'it',\n",
173 |        " 'at',\n",
174 |        " 'a',\n",
175 |        " 'camera',\n",
176 |        " 'store',\n",
177 |        " '.',\n",
178 |        " 'The',\n",
179 |        " 'manual',\n",
180 |        " 'does',\n",
181 |        " 'a',\n",
182 |        " 'fine',\n",
183 |        " 'job',\n",
184 |        " 'filling',\n",
185 |        " 'in',\n",
186 |        " 'any',\n",
187 |        " 'blanks',\n",
188 |        " 'that',\n",
189 |        " 'remain',\n",
190 |        " '.',\n",
191 |        " 'The',\n",
192 |        " 'auto-focus',\n",
193 |        " 'performs',\n",
194 |        " 'well',\n",
195 |        " ',',\n",
196 |        " 'but',\n",
197 |        " 'i',\n",
198 |        " 'love',\n",
199 |        " 'having',\n",
200 |        " 'the',\n",
201 |        " '12',\n",
202 |        " 'optional',\n",
203 |        " 'scene',\n",
204 |        " 'modes',\n",
205 |        " '-',\n",
206 |        " 'they',\n",
207 |        " 'are',\n",
208 |        " 'dummy-proof',\n",
209 |        " ',',\n",
210 |        " 'and',\n",
211 |        " 'correspond',\n",
212 |        " 'to',\n",
213 |        " 'many',\n",
214 |        " 'situations',\n",
215 |        " 'in',\n",
216 |        " 'which',\n",
217 |        " 'i',\n",
218 |        " 'would',\n",
219 |        " 'actually',\n",
220 |        " 'seek',\n",
221 |        " 'to',\n",
222 |        " 'use',\n",
223 |        " 'the',\n",
224 |        " 'camera',\n",
225 |        " '.',\n",
226 |        " 'Comes',\n",
227 |        " 'with',\n",
228 |        " 'a',\n",
229 |        " '16',\n",
230 |        " 'mb',\n",
231 |        " 'compact',\n",
232 |        " 'flash',\n",
233 |        " 'and',\n",
234 |        " 'one',\n",
235 |        " 'rechargeable',\n",
236 |        " 'battery',\n",
237 |        " 'the',\n",
238 |        " 'charging',\n",
239 |        " 'unit',\n",
240 |        " ',',\n",
241 |        " 'included',\n",
242 |        " ',',\n",
243 |        " 'is',\n",
244 |        " 'fast',\n",
245 |        " 'and',\n",
246 |        " 'small',\n",
247 |        " '.',\n",
248 |        " 'I',\n",
249 |        " 'bought',\n",
250 |        " 'a',\n",
251 |        " '256',\n",
252 |        " 'mb',\n",
253 |        " 'cf',\n",
254 |        " 'and',\n",
255 |        " 'a',\n",
256 |        " 'second',\n",
257 |        " 'battery',\n",
258 |        " ',',\n",
259 |        " 'so',\n",
260 |        " 'it',\n",
261 |        " \"'s\",\n",
262 |        " 'good',\n",
263 |        " 'to',\n",
264 |        " 'go',\n",
265 |        " 'on',\n",
266 |        " 'a',\n",
267 |        " 'long',\n",
268 |        " 'vacation',\n",
269 |        " '.',\n",
270 |        " 'I',\n",
271 |        " 'enthusiastically',\n",
272 |        " 'recommend',\n",
273 |        " 'this',\n",
274 |        " 'camera',\n",
275 |        " '.',\n",
276 |        " \"''\",\n",
277 |        " ']']"
278 |       ]
279 |      },
280 |      "execution_count": 6,
281 |      "metadata": {},
282 |      "output_type": "execute_result"
283 |     }
284 |    ],
285 |    "source": [
286 |     "tokens[0]"
287 |    ]
288 |   }
289 |  ],
290 |  "metadata": {
291 |   "kernelspec": {
292 |    "display_name": "Python 3",
293 |    "language": "python",
294 |    "name": "python3"
295 |   },
296 |   "language_info": {
297 |    "codemirror_mode": {
298 |     "name": "ipython",
299 |     "version": 3
300 |    },
301 |    "file_extension": ".py",
302 |    "mimetype": "text/x-python",
303 |    "name": "python",
304 |    "nbconvert_exporter": "python",
305 |    "pygments_lexer": "ipython3",
306 |    "version": "3.6.0"
307 |   }
308 |  },
309 |  "nbformat": 4,
310 |  "nbformat_minor": 2
311 | }
312 | 


--------------------------------------------------------------------------------
/4. Custom Sources/5. Exporting.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Custom Sources - Exporting"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 1,
 13 |    "metadata": {
 14 |     "collapsed": true
 15 |    },
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "import nltk"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "code",
 23 |    "execution_count": 2,
 24 |    "metadata": {
 25 |     "collapsed": false
 26 |    },
 27 |    "outputs": [],
 28 |    "source": [
 29 |     "alice = nltk.corpus.gutenberg.words(\"carroll-alice.txt\")"
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "code",
 34 |    "execution_count": 3,
 35 |    "metadata": {
 36 |     "collapsed": true
 37 |    },
 38 |    "outputs": [],
 39 |    "source": [
 40 |     "alice = alice[:1000]"
 41 |    ]
 42 |   },
 43 |   {
 44 |    "cell_type": "markdown",
 45 |    "metadata": {},
 46 |    "source": [
 47 |     "We have to untokenize the text."
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "code",
 52 |    "execution_count": 4,
 53 |    "metadata": {
 54 |     "collapsed": false
 55 |    },
 56 |    "outputs": [],
 57 |    "source": [
 58 |     "alice_str = ' '.join(alice)"
 59 |    ]
 60 |   },
 61 |   {
 62 |    "cell_type": "code",
 63 |    "execution_count": 5,
 64 |    "metadata": {
 65 |     "collapsed": false
 66 |    },
 67 |    "outputs": [
 68 |     {
 69 |      "data": {
 70 |       "text/plain": [
 71 |        "\"[ Alice ' s Adventures in Wonderland by Lewis Carroll 1865 ] CHAPTER I . Down the Rabbit - Hole Alice was beginning to get very tired of sitting by her sister on the bank , and of having nothing to do : once or twice she had peeped into the book her sister was reading , but it had no pictures or conversations in it , ' and what is the use of a book ,' thought Alice ' without pictures or conversation ?' So she was considering in her own mind ( as well as she could , for the hot day made her feel very sleepy and stupid ), whether the pleasure of making a daisy - chain would be worth the trouble of getting up and picking the daisies , when suddenly a White Rabbit with pink eyes ran close by her . There was nothing so VERY remarkable in that ; nor did Alice think it so VERY much out of the way to hear the Rabbit say to itself , ' Oh dear ! Oh dear ! I shall be late !' ( when she thought it over afterwards , it occurred to her that she ought to have wondered at this , but at the time it all seemed quite natural ); but when the Rabbit actually TOOK A WATCH OUT OF ITS WAISTCOAT - POCKET , and looked at it , and then hurried on , Alice started to her feet , for it flashed across her mind that she had never before seen a rabbit with either a waistcoat - pocket , or a watch to take out of it , and burning with curiosity , she ran across the field after it , and fortunately was just in time to see it pop down a large rabbit - hole under the hedge . In another moment down went Alice after it , never once considering how in the world she was to get out again . The rabbit - hole went straight on like a tunnel for some way , and then dipped suddenly down , so suddenly that Alice had not a moment to think about stopping herself before she found herself falling down a very deep well . Either the well was very deep , or she fell very slowly , for she had plenty of time as she went down to look about her and to wonder what was going to happen next . First , she tried to look down and make out what she was coming to , but it was too dark to see anything ; then she looked at the sides of the well , and noticed that they were filled with cupboards and book - shelves ; here and there she saw maps and pictures hung upon pegs . She took down a jar from one of the shelves as she passed ; it was labelled ' ORANGE MARMALADE ', but to her great disappointment it was empty : she did not like to drop the jar for fear of killing somebody , so managed to put it into one of the cupboards as she fell past it . ' Well !' thought Alice to herself , ' after such a fall as this , I shall think nothing of tumbling down stairs ! How brave they ' ll all think me at home ! Why , I wouldn ' t say anything about it , even if I fell off the top of the house !' ( Which was very likely true .) Down , down , down . Would the fall NEVER come to an end ! ' I wonder how many miles I ' ve fallen by this time ?' she said aloud . ' I must be getting somewhere near the centre of the earth . Let me see : that would be four thousand miles down , I think --' ( for , you see , Alice had learnt several things of this sort in her lessons in the schoolroom , and though this was not a VERY good opportunity for showing off her knowledge , as there was no one to listen to her , still it was good practice to say it over ) '-- yes , that ' s about the right distance -- but then I wonder what Latitude or Longitude I ' ve got to ?' ( Alice had no idea what Latitude was , or Longitude either , but thought they were nice grand words to say .) Presently she began again . ' I wonder if I shall fall right THROUGH the earth ! How funny it ' ll seem to come out among the people that walk with their heads downward ! The Antipathies , I think --' ( she was rather glad there WAS no one listening , this time , as it didn ' t sound at all the right word ) '-- but I shall have to ask them what the name of the country is , you know . Please , Ma ' am , is this New Zealand or Australia ?' ( and she tried to curtsey as she spoke -- fancy CURTSEYING as you ' re falling through the air ! Do you think you could manage it ?) ' And what an ignorant little girl she ' ll think me for asking ! No , it ' ll never do to ask : perhaps I shall see it written up somewhere .' Down , down , down . There was nothing else to do , so Alice soon began talking again . ' Dinah ' ll miss me very much to - night , I should think !' ( Dinah was the cat .) ' I hope they ' ll remember her saucer of milk at tea - time . Dinah my dear ! I wish you were down here with me\""
 72 |       ]
 73 |      },
 74 |      "execution_count": 5,
 75 |      "metadata": {},
 76 |      "output_type": "execute_result"
 77 |     }
 78 |    ],
 79 |    "source": [
 80 |     "alice_str"
 81 |    ]
 82 |   },
 83 |   {
 84 |    "cell_type": "markdown",
 85 |    "metadata": {},
 86 |    "source": [
 87 |     "We create a new file."
 88 |    ]
 89 |   },
 90 |   {
 91 |    "cell_type": "code",
 92 |    "execution_count": 6,
 93 |    "metadata": {
 94 |     "collapsed": true
 95 |    },
 96 |    "outputs": [],
 97 |    "source": [
 98 |     "new_file = open('export.txt', 'w')"
 99 |    ]
100 |   },
101 |   {
102 |    "cell_type": "code",
103 |    "execution_count": 7,
104 |    "metadata": {
105 |     "collapsed": true
106 |    },
107 |    "outputs": [
108 |     {
109 |      "data": {
110 |       "text/plain": [
111 |        "4542"
112 |       ]
113 |      },
114 |      "execution_count": 7,
115 |      "metadata": {},
116 |      "output_type": "execute_result"
117 |     }
118 |    ],
119 |    "source": [
120 |     "new_file.write(alice_str)"
121 |    ]
122 |   },
123 |   {
124 |    "cell_type": "code",
125 |    "execution_count": 8,
126 |    "metadata": {
127 |     "collapsed": true
128 |    },
129 |    "outputs": [],
130 |    "source": [
131 |     "new_file.close()"
132 |    ]
133 |   }
134 |  ],
135 |  "metadata": {
136 |   "kernelspec": {
137 |    "display_name": "Python 3",
138 |    "language": "python",
139 |    "name": "python3"
140 |   },
141 |   "language_info": {
142 |    "codemirror_mode": {
143 |     "name": "ipython",
144 |     "version": 3
145 |    },
146 |    "file_extension": ".py",
147 |    "mimetype": "text/x-python",
148 |    "name": "python",
149 |    "nbconvert_exporter": "python",
150 |    "pygments_lexer": "ipython3",
151 |    "version": "3.6.0"
152 |   }
153 |  },
154 |  "nbformat": 4,
155 |  "nbformat_minor": 2
156 | }
157 | 


--------------------------------------------------------------------------------
/4. Custom Sources/6. NLTK Resources.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Custom Sources - NLTK Resources"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 1,
 13 |    "metadata": {
 14 |     "collapsed": true
 15 |    },
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "import nltk"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "markdown",
 23 |    "metadata": {},
 24 |    "source": [
 25 |     "## Wordlist"
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "code",
 30 |    "execution_count": 2,
 31 |    "metadata": {
 32 |     "collapsed": false
 33 |    },
 34 |    "outputs": [
 35 |     {
 36 |      "data": {
 37 |       "text/plain": [
 38 |        "['A',\n",
 39 |        " 'a',\n",
 40 |        " 'aa',\n",
 41 |        " 'aal',\n",
 42 |        " 'aalii',\n",
 43 |        " 'aam',\n",
 44 |        " 'Aani',\n",
 45 |        " 'aardvark',\n",
 46 |        " 'aardwolf',\n",
 47 |        " 'Aaron',\n",
 48 |        " 'Aaronic',\n",
 49 |        " 'Aaronical',\n",
 50 |        " 'Aaronite',\n",
 51 |        " 'Aaronitic',\n",
 52 |        " 'Aaru',\n",
 53 |        " 'Ab',\n",
 54 |        " 'aba',\n",
 55 |        " 'Ababdeh',\n",
 56 |        " 'Ababua',\n",
 57 |        " 'abac']"
 58 |       ]
 59 |      },
 60 |      "execution_count": 2,
 61 |      "metadata": {},
 62 |      "output_type": "execute_result"
 63 |     }
 64 |    ],
 65 |    "source": [
 66 |     "nltk.corpus.words.words()[:20]"
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "markdown",
 71 |    "metadata": {},
 72 |    "source": [
 73 |     "## Stopwords"
 74 |    ]
 75 |   },
 76 |   {
 77 |    "cell_type": "code",
 78 |    "execution_count": 3,
 79 |    "metadata": {
 80 |     "collapsed": false
 81 |    },
 82 |    "outputs": [
 83 |     {
 84 |      "data": {
 85 |       "text/plain": [
 86 |        "['i',\n",
 87 |        " 'me',\n",
 88 |        " 'my',\n",
 89 |        " 'myself',\n",
 90 |        " 'we',\n",
 91 |        " 'our',\n",
 92 |        " 'ours',\n",
 93 |        " 'ourselves',\n",
 94 |        " 'you',\n",
 95 |        " \"you're\",\n",
 96 |        " \"you've\",\n",
 97 |        " \"you'll\",\n",
 98 |        " \"you'd\",\n",
 99 |        " 'your',\n",
100 |        " 'yours',\n",
101 |        " 'yourself',\n",
102 |        " 'yourselves',\n",
103 |        " 'he',\n",
104 |        " 'him',\n",
105 |        " 'his',\n",
106 |        " 'himself',\n",
107 |        " 'she',\n",
108 |        " \"she's\",\n",
109 |        " 'her',\n",
110 |        " 'hers',\n",
111 |        " 'herself',\n",
112 |        " 'it',\n",
113 |        " \"it's\",\n",
114 |        " 'its',\n",
115 |        " 'itself',\n",
116 |        " 'they',\n",
117 |        " 'them',\n",
118 |        " 'their',\n",
119 |        " 'theirs',\n",
120 |        " 'themselves',\n",
121 |        " 'what',\n",
122 |        " 'which',\n",
123 |        " 'who',\n",
124 |        " 'whom',\n",
125 |        " 'this',\n",
126 |        " 'that',\n",
127 |        " \"that'll\",\n",
128 |        " 'these',\n",
129 |        " 'those',\n",
130 |        " 'am',\n",
131 |        " 'is',\n",
132 |        " 'are',\n",
133 |        " 'was',\n",
134 |        " 'were',\n",
135 |        " 'be',\n",
136 |        " 'been',\n",
137 |        " 'being',\n",
138 |        " 'have',\n",
139 |        " 'has',\n",
140 |        " 'had',\n",
141 |        " 'having',\n",
142 |        " 'do',\n",
143 |        " 'does',\n",
144 |        " 'did',\n",
145 |        " 'doing',\n",
146 |        " 'a',\n",
147 |        " 'an',\n",
148 |        " 'the',\n",
149 |        " 'and',\n",
150 |        " 'but',\n",
151 |        " 'if',\n",
152 |        " 'or',\n",
153 |        " 'because',\n",
154 |        " 'as',\n",
155 |        " 'until',\n",
156 |        " 'while',\n",
157 |        " 'of',\n",
158 |        " 'at',\n",
159 |        " 'by',\n",
160 |        " 'for',\n",
161 |        " 'with',\n",
162 |        " 'about',\n",
163 |        " 'against',\n",
164 |        " 'between',\n",
165 |        " 'into',\n",
166 |        " 'through',\n",
167 |        " 'during',\n",
168 |        " 'before',\n",
169 |        " 'after',\n",
170 |        " 'above',\n",
171 |        " 'below',\n",
172 |        " 'to',\n",
173 |        " 'from',\n",
174 |        " 'up',\n",
175 |        " 'down',\n",
176 |        " 'in',\n",
177 |        " 'out',\n",
178 |        " 'on',\n",
179 |        " 'off',\n",
180 |        " 'over',\n",
181 |        " 'under',\n",
182 |        " 'again',\n",
183 |        " 'further',\n",
184 |        " 'then',\n",
185 |        " 'once',\n",
186 |        " 'here',\n",
187 |        " 'there',\n",
188 |        " 'when',\n",
189 |        " 'where',\n",
190 |        " 'why',\n",
191 |        " 'how',\n",
192 |        " 'all',\n",
193 |        " 'any',\n",
194 |        " 'both',\n",
195 |        " 'each',\n",
196 |        " 'few',\n",
197 |        " 'more',\n",
198 |        " 'most',\n",
199 |        " 'other',\n",
200 |        " 'some',\n",
201 |        " 'such',\n",
202 |        " 'no',\n",
203 |        " 'nor',\n",
204 |        " 'not',\n",
205 |        " 'only',\n",
206 |        " 'own',\n",
207 |        " 'same',\n",
208 |        " 'so',\n",
209 |        " 'than',\n",
210 |        " 'too',\n",
211 |        " 'very',\n",
212 |        " 's',\n",
213 |        " 't',\n",
214 |        " 'can',\n",
215 |        " 'will',\n",
216 |        " 'just',\n",
217 |        " 'don',\n",
218 |        " \"don't\",\n",
219 |        " 'should',\n",
220 |        " \"should've\",\n",
221 |        " 'now',\n",
222 |        " 'd',\n",
223 |        " 'll',\n",
224 |        " 'm',\n",
225 |        " 'o',\n",
226 |        " 're',\n",
227 |        " 've',\n",
228 |        " 'y',\n",
229 |        " 'ain',\n",
230 |        " 'aren',\n",
231 |        " \"aren't\",\n",
232 |        " 'couldn',\n",
233 |        " \"couldn't\",\n",
234 |        " 'didn',\n",
235 |        " \"didn't\",\n",
236 |        " 'doesn',\n",
237 |        " \"doesn't\",\n",
238 |        " 'hadn',\n",
239 |        " \"hadn't\",\n",
240 |        " 'hasn',\n",
241 |        " \"hasn't\",\n",
242 |        " 'haven',\n",
243 |        " \"haven't\",\n",
244 |        " 'isn',\n",
245 |        " \"isn't\",\n",
246 |        " 'ma',\n",
247 |        " 'mightn',\n",
248 |        " \"mightn't\",\n",
249 |        " 'mustn',\n",
250 |        " \"mustn't\",\n",
251 |        " 'needn',\n",
252 |        " \"needn't\",\n",
253 |        " 'shan',\n",
254 |        " \"shan't\",\n",
255 |        " 'shouldn',\n",
256 |        " \"shouldn't\",\n",
257 |        " 'wasn',\n",
258 |        " \"wasn't\",\n",
259 |        " 'weren',\n",
260 |        " \"weren't\",\n",
261 |        " 'won',\n",
262 |        " \"won't\",\n",
263 |        " 'wouldn',\n",
264 |        " \"wouldn't\"]"
265 |       ]
266 |      },
267 |      "execution_count": 3,
268 |      "metadata": {},
269 |      "output_type": "execute_result"
270 |     }
271 |    ],
272 |    "source": [
273 |     "nltk.corpus.stopwords.words(\"english\")"
274 |    ]
275 |   },
276 |   {
277 |    "cell_type": "markdown",
278 |    "metadata": {},
279 |    "source": [
280 |     "All the resources provided by NLTK: http://www.nltk.org/nltk_data/"
281 |    ]
282 |   }
283 |  ],
284 |  "metadata": {
285 |   "kernelspec": {
286 |    "display_name": "Python 3",
287 |    "language": "python",
288 |    "name": "python3"
289 |   },
290 |   "language_info": {
291 |    "codemirror_mode": {
292 |     "name": "ipython",
293 |     "version": 3
294 |    },
295 |    "file_extension": ".py",
296 |    "mimetype": "text/x-python",
297 |    "name": "python",
298 |    "nbconvert_exporter": "python",
299 |    "pygments_lexer": "ipython3",
300 |    "version": "3.6.0"
301 |   }
302 |  },
303 |  "nbformat": 4,
304 |  "nbformat_minor": 2
305 | }
306 | 


--------------------------------------------------------------------------------
/4. Custom Sources/7. Example - Remove Stopwords.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Custom Sources - Example: Remove Stopwords"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 1,
 13 |    "metadata": {
 14 |     "collapsed": true
 15 |    },
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "import nltk"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "code",
 23 |    "execution_count": 2,
 24 |    "metadata": {
 25 |     "collapsed": false
 26 |    },
 27 |    "outputs": [],
 28 |    "source": [
 29 |     "alice = nltk.corpus.gutenberg.words(\"carroll-alice.txt\")"
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "code",
 34 |    "execution_count": 3,
 35 |    "metadata": {
 36 |     "collapsed": false
 37 |    },
 38 |    "outputs": [],
 39 |    "source": [
 40 |     "alice_fd = nltk.FreqDist(alice)"
 41 |    ]
 42 |   },
 43 |   {
 44 |    "cell_type": "code",
 45 |    "execution_count": 4,
 46 |    "metadata": {
 47 |     "collapsed": false
 48 |    },
 49 |    "outputs": [],
 50 |    "source": [
 51 |     "alice_100 = alice_fd.most_common(100)\n",
 52 |     "alice_common = [word[0] for word in alice_100]\n",
 53 |     "common = set(word.lower() for word in alice_common if word.isalpha())"
 54 |    ]
 55 |   },
 56 |   {
 57 |    "cell_type": "code",
 58 |    "execution_count": 5,
 59 |    "metadata": {
 60 |     "collapsed": false
 61 |    },
 62 |    "outputs": [
 63 |     {
 64 |      "data": {
 65 |       "text/plain": [
 66 |        "{'a',\n",
 67 |        " 'about',\n",
 68 |        " 'again',\n",
 69 |        " 'alice',\n",
 70 |        " 'all',\n",
 71 |        " 'an',\n",
 72 |        " 'and',\n",
 73 |        " 'as',\n",
 74 |        " 'at',\n",
 75 |        " 'be',\n",
 76 |        " 'began',\n",
 77 |        " 'but',\n",
 78 |        " 'by',\n",
 79 |        " 'can',\n",
 80 |        " 'could',\n",
 81 |        " 'did',\n",
 82 |        " 'do',\n",
 83 |        " 'down',\n",
 84 |        " 'for',\n",
 85 |        " 'gryphon',\n",
 86 |        " 'had',\n",
 87 |        " 'hatter',\n",
 88 |        " 'have',\n",
 89 |        " 'he',\n",
 90 |        " 'her',\n",
 91 |        " 'herself',\n",
 92 |        " 'his',\n",
 93 |        " 'i',\n",
 94 |        " 'if',\n",
 95 |        " 'in',\n",
 96 |        " 'into',\n",
 97 |        " 'is',\n",
 98 |        " 'it',\n",
 99 |        " 'its',\n",
100 |        " 'king',\n",
101 |        " 'know',\n",
102 |        " 'like',\n",
103 |        " 'little',\n",
104 |        " 'll',\n",
105 |        " 'm',\n",
106 |        " 'me',\n",
107 |        " 'mock',\n",
108 |        " 'much',\n",
109 |        " 'my',\n",
110 |        " 'no',\n",
111 |        " 'not',\n",
112 |        " 'of',\n",
113 |        " 'off',\n",
114 |        " 'on',\n",
115 |        " 'one',\n",
116 |        " 'or',\n",
117 |        " 'out',\n",
118 |        " 'queen',\n",
119 |        " 'quite',\n",
120 |        " 's',\n",
121 |        " 'said',\n",
122 |        " 'say',\n",
123 |        " 'see',\n",
124 |        " 'she',\n",
125 |        " 'so',\n",
126 |        " 't',\n",
127 |        " 'that',\n",
128 |        " 'the',\n",
129 |        " 'them',\n",
130 |        " 'then',\n",
131 |        " 'there',\n",
132 |        " 'they',\n",
133 |        " 'this',\n",
134 |        " 'thought',\n",
135 |        " 'time',\n",
136 |        " 'to',\n",
137 |        " 'turtle',\n",
138 |        " 'up',\n",
139 |        " 'very',\n",
140 |        " 'was',\n",
141 |        " 'way',\n",
142 |        " 'went',\n",
143 |        " 'were',\n",
144 |        " 'what',\n",
145 |        " 'when',\n",
146 |        " 'with',\n",
147 |        " 'would',\n",
148 |        " 'you',\n",
149 |        " 'your'}"
150 |       ]
151 |      },
152 |      "execution_count": 5,
153 |      "metadata": {},
154 |      "output_type": "execute_result"
155 |     }
156 |    ],
157 |    "source": [
158 |     "common"
159 |    ]
160 |   },
161 |   {
162 |    "cell_type": "code",
163 |    "execution_count": 6,
164 |    "metadata": {
165 |     "collapsed": false
166 |    },
167 |    "outputs": [],
168 |    "source": [
169 |     "descriptive = list(set(common) - set(nltk.corpus.stopwords.words(\"english\")))"
170 |    ]
171 |   },
172 |   {
173 |    "cell_type": "code",
174 |    "execution_count": 7,
175 |    "metadata": {
176 |     "collapsed": false
177 |    },
178 |    "outputs": [
179 |     {
180 |      "data": {
181 |       "text/plain": [
182 |        "['time',\n",
183 |        " 'gryphon',\n",
184 |        " 'began',\n",
185 |        " 'could',\n",
186 |        " 'like',\n",
187 |        " 'hatter',\n",
188 |        " 'one',\n",
189 |        " 'say',\n",
190 |        " 'king',\n",
191 |        " 'mock',\n",
192 |        " 'see',\n",
193 |        " 'thought',\n",
194 |        " 'turtle',\n",
195 |        " 'queen',\n",
196 |        " 'quite',\n",
197 |        " 'went',\n",
198 |        " 'alice',\n",
199 |        " 'would',\n",
200 |        " 'said',\n",
201 |        " 'little',\n",
202 |        " 'know',\n",
203 |        " 'way',\n",
204 |        " 'much']"
205 |       ]
206 |      },
207 |      "execution_count": 7,
208 |      "metadata": {},
209 |      "output_type": "execute_result"
210 |     }
211 |    ],
212 |    "source": [
213 |     "descriptive"
214 |    ]
215 |   }
216 |  ],
217 |  "metadata": {
218 |   "kernelspec": {
219 |    "display_name": "Python 3",
220 |    "language": "python",
221 |    "name": "python3"
222 |   },
223 |   "language_info": {
224 |    "codemirror_mode": {
225 |     "name": "ipython",
226 |     "version": 3
227 |    },
228 |    "file_extension": ".py",
229 |    "mimetype": "text/x-python",
230 |    "name": "python",
231 |    "nbconvert_exporter": "python",
232 |    "pygments_lexer": "ipython3",
233 |    "version": "3.6.0"
234 |   }
235 |  },
236 |  "nbformat": 4,
237 |  "nbformat_minor": 2
238 | }
239 | 


--------------------------------------------------------------------------------
/4. Custom Sources/dec_independence.txt:
--------------------------------------------------------------------------------
 1 | When, in the course of human events, it becomes necessary for one people to dissolve the political bonds which have connected them with another, and to assume among the powers of the earth, the separate and equal station to which the laws of nature and of nature's God entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the separation.
 2 | 
 3 | We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable rights, that among these are life, liberty and the pursuit of happiness. That to secure these rights, governments are instituted among men, deriving their just powers from the consent of the governed. That whenever any form of government becomes destructive to these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying its foundation on such principles and organizing its powers in such form, as to them shall seem most likely to effect their safety and happiness. Prudence, indeed, will dictate that governments long established should not be changed for light and transient causes; and accordingly all experience hath shown that mankind are more disposed to suffer, while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, pursuing invariably the same object evinces a design to reduce them under absolute despotism, it is their right, it is their duty, to throw off such government, and to provide new guards for their future security. --
 4 | 
 5 | Such has been the patient sufferance of these colonies; and such is now the necessity which constrains them to alter their former systems of government. The history of the present King of Great Britain is a history of repeated injuries and usurpations, all having in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world.
 6 | 
 7 | He has refused his assent to laws, the most wholesome and necessary for the public good.
 8 | 
 9 | He has forbidden his governors to pass laws of immediate and pressing importance, unless suspended in their operation till his assent should be obtained; and when so suspended, he has utterly neglected to attend to them.
10 | 
11 | He has refused to pass other laws for the accommodation of large districts of people, unless those people would relinquish the right of representation in the legislature, a right inestimable to them and formidable to tyrants only.
12 | 
13 | He has called together legislative bodies at places unusual, uncomfortable, and distant from the depository of their public records, for the sole purpose of fatiguing them into compliance with his measures.
14 | 
15 | He has dissolved representative houses repeatedly, for opposing with manly firmness his invasions on the rights of the people.
16 | 
17 | He has refused for a long time, after such dissolutions, to cause others to be elected; whereby the legislative powers, incapable of annihilation, have returned to the people at large for their exercise; the state remaining in the meantime exposed to all the dangers of invasion from without, and convulsions within.
18 | 
19 | He has endeavored to prevent the population of these states; for that purpose obstructing the laws for naturalization of foreigners; refusing to pass others to encourage their migration hither, and raising the conditions of new appropriations of lands.
20 | 
21 | He has obstructed the administration of justice, by refusing his assent to laws for establishing judiciary powers.
22 | 
23 | He has made judges dependent on his will alone, for the tenure of their offices, and the amount and payment of their salaries.
24 | 
25 | He has erected a multitude of new offices, and sent hither swarms of officers to harass our people, and eat out their substance.
26 | 
27 | He has kept among us, in times of peace, standing armies without the consent of our legislature.
28 | 
29 | He has affected to render the military independent of and superior to civil power.
30 | 
31 | He has combined with others to subject us to a jurisdiction foreign to our constitution, and unacknowledged by our laws; giving his assent to their acts of pretended legislation:
32 | 
33 | For quartering large bodies of armed troops among us:
34 | For protecting them, by mock trial, from punishment for any murders which they should commit on the inhabitants of these states:
35 | For cutting off our trade with all parts of the world:
36 | For imposing taxes on us without our consent:
37 | For depriving us in many cases, of the benefits of trial by jury:
38 | For transporting us beyond seas to be tried for pretended offenses:
39 | For abolishing the free system of English laws in a neighboring province, establishing therein an arbitrary government, and enlarging its boundaries so as to render it at once an example and fit instrument for introducing the same absolute rule in these colonies:
40 | For taking away our charters, abolishing our most valuable laws, and altering fundamentally the forms of our governments:
41 | For suspending our own legislatures, and declaring themselves invested with power to legislate for us in all cases whatsoever.
42 | He has abdicated government here, by declaring us out of his protection and waging war against us.
43 | 
44 | He has plundered our seas, ravaged our coasts, burned our towns, and destroyed the lives of our people.
45 | 
46 | He is at this time transporting large armies of foreign mercenaries to complete the works of death, desolation and tyranny, already begun with circumstances of cruelty and perfidy scarcely paralleled in the most barbarous ages, and totally unworthy the head of a civilized nation.
47 | 
48 | He has constrained our fellow citizens taken captive on the high seas to bear arms against their country, to become the executioners of their friends and brethren, or to fall themselves by their hands.
49 | 
50 | He has excited domestic insurrections amongst us, and has endeavored to bring on the inhabitants of our frontiers, the merciless Indian savages, whose known rule of warfare, is undistinguished destruction of all ages, sexes and conditions. 
51 | In Jefferson's draft there is a part on slavery here
52 | 
53 | In every stage of these oppressions we have petitioned for redress in the most humble terms: our repeated petitions have been answered only by repeated injury. A prince, whose character is thus marked by every act which may define a tyrant, is unfit to be the ruler of a free people.
54 | 
55 | Nor have we been wanting in attention to our British brethren. We have warned them from time to time of attempts by their legislature to extend an unwarrantable jurisdiction over us. We have reminded them of the circumstances of our emigration and settlement here. We have appealed to their native justice and magnanimity, and we have conjured them by the ties of our common kindred to disavow these usurpations, which, would inevitably interrupt our connections and correspondence. We must, therefore, acquiesce in the necessity, which denounces our separation, and hold them, as we hold the rest of mankind, enemies in war, in peace friends.
56 | 
57 | We, therefore, the representatives of the United States of America, in General Congress, assembled, appealing to the Supreme Judge of the world for the rectitude of our intentions, do, in the name, and by the authority of the good people of these colonies, solemnly publish and declare, that these united colonies are, and of right ought to be free and independent states; that they are absolved from all allegiance to the British Crown, and that all political connection between them and the state of Great Britain, is and ought to be totally dissolved; and that as free and independent states, they have full power to levy war, conclude peace, contract alliances, establish commerce, and to do all other acts and things which independent states may of right do. And for the support of this declaration, with a firm reliance on the protection of Divine Providence, we mutually pledge to each other our lives, our fortunes and our sacred honor.


--------------------------------------------------------------------------------
/4. Custom Sources/export.txt:
--------------------------------------------------------------------------------
1 | [ Alice ' s Adventures in Wonderland by Lewis Carroll 1865 ] CHAPTER I . Down the Rabbit - Hole Alice was beginning to get very tired of sitting by her sister on the bank , and of having nothing to do : once or twice she had peeped into the book her sister was reading , but it had no pictures or conversations in it , ' and what is the use of a book ,' thought Alice ' without pictures or conversation ?' So she was considering in her own mind ( as well as she could , for the hot day made her feel very sleepy and stupid ), whether the pleasure of making a daisy - chain would be worth the trouble of getting up and picking the daisies , when suddenly a White Rabbit with pink eyes ran close by her . There was nothing so VERY remarkable in that ; nor did Alice think it so VERY much out of the way to hear the Rabbit say to itself , ' Oh dear ! Oh dear ! I shall be late !' ( when she thought it over afterwards , it occurred to her that she ought to have wondered at this , but at the time it all seemed quite natural ); but when the Rabbit actually TOOK A WATCH OUT OF ITS WAISTCOAT - POCKET , and looked at it , and then hurried on , Alice started to her feet , for it flashed across her mind that she had never before seen a rabbit with either a waistcoat - pocket , or a watch to take out of it , and burning with curiosity , she ran across the field after it , and fortunately was just in time to see it pop down a large rabbit - hole under the hedge . In another moment down went Alice after it , never once considering how in the world she was to get out again . The rabbit - hole went straight on like a tunnel for some way , and then dipped suddenly down , so suddenly that Alice had not a moment to think about stopping herself before she found herself falling down a very deep well . Either the well was very deep , or she fell very slowly , for she had plenty of time as she went down to look about her and to wonder what was going to happen next . First , she tried to look down and make out what she was coming to , but it was too dark to see anything ; then she looked at the sides of the well , and noticed that they were filled with cupboards and book - shelves ; here and there she saw maps and pictures hung upon pegs . She took down a jar from one of the shelves as she passed ; it was labelled ' ORANGE MARMALADE ', but to her great disappointment it was empty : she did not like to drop the jar for fear of killing somebody , so managed to put it into one of the cupboards as she fell past it . ' Well !' thought Alice to herself , ' after such a fall as this , I shall think nothing of tumbling down stairs ! How brave they ' ll all think me at home ! Why , I wouldn ' t say anything about it , even if I fell off the top of the house !' ( Which was very likely true .) Down , down , down . Would the fall NEVER come to an end ! ' I wonder how many miles I ' ve fallen by this time ?' she said aloud . ' I must be getting somewhere near the centre of the earth . Let me see : that would be four thousand miles down , I think --' ( for , you see , Alice had learnt several things of this sort in her lessons in the schoolroom , and though this was not a VERY good opportunity for showing off her knowledge , as there was no one to listen to her , still it was good practice to say it over ) '-- yes , that ' s about the right distance -- but then I wonder what Latitude or Longitude I ' ve got to ?' ( Alice had no idea what Latitude was , or Longitude either , but thought they were nice grand words to say .) Presently she began again . ' I wonder if I shall fall right THROUGH the earth ! How funny it ' ll seem to come out among the people that walk with their heads downward ! The Antipathies , I think --' ( she was rather glad there WAS no one listening , this time , as it didn ' t sound at all the right word ) '-- but I shall have to ask them what the name of the country is , you know . Please , Ma ' am , is this New Zealand or Australia ?' ( and she tried to curtsey as she spoke -- fancy CURTSEYING as you ' re falling through the air ! Do you think you could manage it ?) ' And what an ignorant little girl she ' ll think me for asking ! No , it ' ll never do to ask : perhaps I shall see it written up somewhere .' Down , down , down . There was nothing else to do , so Alice soon began talking again . ' Dinah ' ll miss me very much to - night , I should think !' ( Dinah was the cat .) ' I hope they ' ll remember her saucer of milk at tea - time . Dinah my dear ! I wish you were down here with me


--------------------------------------------------------------------------------
/4. Custom Sources/reviews.csv:
--------------------------------------------------------------------------------
1 | "This camera is perfect for an enthusiastic amateur photographer. The pictures are razor-sharp, even in macro. It is small enough to fit easily in a coat pocket or purse. It is light enough to carry around all day without bother. Operating its many features is easy and often obvious - i'm no annie lebovitz, but i was able to figure out most of its abilities just messing around with it at a camera store. The manual does a fine job filling in any blanks that remain. The auto-focus performs well, but i love having the 12 optional scene modes - they are dummy-proof, and correspond to many situations in which i would actually seek to use the camera. Comes with a 16 mb compact flash and one rechargeable battery the charging unit, included, is fast and small. I bought a 256 mb cf and a second battery, so it's good to go on a long vacation. I enthusiastically recommend this camera."
2 | "I got my camera three days back, and although i had some experience with digital cameras prior to purchasing this one, i still rate myself as a beginner. I bought this camera because it fit my budget and the pre-production and production model reviews were positive. It's easy to use, and yet very feature rich. In the auto mode it functions basically as a point and click, the scene modes are very easy to use and produce good results. The manual mode is feature rich and i can't wait to get the hang of it. The macro mode is exceptional, the pictures are very clear and you can take the pictures with the lens unbelievably close the subject. The battery life is very good, i got about 90 minutes with the lcd turned on all the time, the first time around, and i have been using it with the lcd off every now and then, and have yet needed to recharge it. The camera comes with a lexar 16mb starter card, which stores about 10 images in fine mode at the highest resolution, i intend to buy a bigger card soon."
3 | "I love photography. I had an older camera that was simply a point and shoot camera. I needed something with more power, so i bought a nikon coolpix 4300. I fell in love with this camera, it combines ease of use, with an immense amount of options and power. You can use the scene modes, or fine tune the options, i. you can change the iso level, shutter speed, etc. This camera is ideal for people who want more power, but don’t want to spend 1000s dollars on a camera. "
4 | "I bought coolpix 4300 two months after i had bought canon powershot s400. It was not easy sharing one with my teen age kid. The two cameras are very similar in functionality and pricing. I've had no problem with canon whatsoever. With nikon, although picture qualities are as good as any other 4 mp cameras, i've had the following headaches. Pictures won't transfer to pc directly from the camera using the included transfer cable. I did everything i could, and it took many days of frustration before concluding that the only way to transfer to pc is with the card reader. The speed is noticeably slower than canon, especially so with flashes on. With low battery, it twice wiped out the entire pictures in the memory chip. I used lexar 256 mb and i still use it which means nothing is wrong with lexar. Be very careful when the battery is low and make sure to carry extra batteries. "
5 | "The other reviewers have clearly pointed all the good things about this camera, which i do agree. But there are certain issues might be they are to me here - all of them are minor; not major ones though. This camera keeps on autofocusing in auto mode with a buzzing sound which can't be stopped. Would be really good if they have given an option to stop this autofocusing. If you want to have the date; time on the image, its only through their software ""nikon view"" which reads the images date; time from the images meta-data. So if you use your card reader; copy images - you got to once again open them through their software to put the date; time. In that too, there is not a direct way to add date; time - you got to say' print images' to a different directory in which there is an option to specify the date; time. Even the slightest of the shakes totally distorts your image. Images taken indoor were not so clear. You got to have flash on to get it even though your room is well lit. Lens cap is a really annoying. The movie clips taken will always have some noise in it, you can't avoid that. But overall this is a good camera with a' really good' picture clarity; an exceptional close-up shooting capability. I would rate this is 4.5 stars picture quality; image size defined above are specific to nikon coolpix."
6 | "Within a year, there are problems with my menu dial knob. It became stuck which makes it almost impossible to switch between modes. I send my camera to nikon for servicing, took them a whole 6 weeks to diagnose the problem. Worse of all, they claim that it's some kind of internal damage and refuse to cover the cost via warranty! They wouldn't repair my camera unless if i pay $100 for parts? and labor! It is a good camera in terms of the function and quality, but take your chance with it because nikon absolutely sucks when it comes to customer service."
7 | "Got a ""system error"" problem 30 days after purchase. Made the camera totally inoperable. Also, the lens cap design is flawed. You have to manually Audio on video also lacking. Otherwise, it takes very good pictures; shutter delay is not so bad either. Still, had to send it back to nikon for repair."
8 | "I am an amateur photographer and here is a piece of advise to all the folks who are thinking about making a move the digital world. I feel, is the best camera out there for the features and price. I had initially thought of buying a 2 or 3 megapixel camera but these are good for 4x6"" or 5x7"" prints and i wanted some really great 8x10"" photos once in a while. I did not want a very small camera as it seems to get lost in my hands and i was not comfortable with that. I wanted a decent sized camera with a contour for my fingers to hold it steadily. I wanted a camera that had a lot of built-in settings for different types of surroundings while giving me an option to use my photography skills although, I am an amateur with an interest in photography by turning on the manual settings. I wanted a respected brand and had to stay within my budget because i had bought an expensive camcorder before but had not used it much. Depending on all the above requirements, I had narrowed down my search to nikon 4300 and canon powershot s400 models. Nikon got the final nod for its settings auto and manual along with movie modes, medium; compact size, price, brand name, good software that is included and previous reviews. I should say I have been very happy with my decision ever since. The pictures are absolutely amazing - the camera captures the minutest of details."
9 | 


--------------------------------------------------------------------------------
/5. Projects/3. Term Frequency, Inverse Document Frequency.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Projects - TF-IDF"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "## Imports"
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "code",
 19 |    "execution_count": 1,
 20 |    "metadata": {
 21 |     "collapsed": true
 22 |    },
 23 |    "outputs": [],
 24 |    "source": [
 25 |     "import nltk\n",
 26 |     "import math"
 27 |    ]
 28 |   },
 29 |   {
 30 |    "cell_type": "markdown",
 31 |    "metadata": {},
 32 |    "source": [
 33 |     "## Load Data"
 34 |    ]
 35 |   },
 36 |   {
 37 |    "cell_type": "code",
 38 |    "execution_count": 2,
 39 |    "metadata": {
 40 |     "collapsed": true
 41 |    },
 42 |    "outputs": [],
 43 |    "source": [
 44 |     "dataset = {\n",
 45 |     "    \"tfidf_1.txt\":open(\"tfidf_1.txt\").read(),\n",
 46 |     "    \"tfidf_2.txt\":open(\"tfidf_2.txt\").read(),\n",
 47 |     "    \"tfidf_3.txt\":open(\"tfidf_3.txt\").read(),\n",
 48 |     "    \"tfidf_4.txt\":open(\"tfidf_4.txt\").read(),\n",
 49 |     "    \"tfidf_5.txt\":open(\"tfidf_5.txt\").read(),\n",
 50 |     "    \"tfidf_6.txt\":open(\"tfidf_6.txt\").read(),\n",
 51 |     "    \"tfidf_7.txt\":open(\"tfidf_7.txt\").read(),\n",
 52 |     "    \"tfidf_8.txt\":open(\"tfidf_8.txt\").read(),\n",
 53 |     "    \"tfidf_9.txt\":open(\"tfidf_9.txt\").read(),\n",
 54 |     "    \"tfidf_10.txt\":open(\"tfidf_10.txt\").read()\n",
 55 |     "}"
 56 |    ]
 57 |   },
 58 |   {
 59 |    "cell_type": "markdown",
 60 |    "metadata": {},
 61 |    "source": [
 62 |     "## Define Functions"
 63 |    ]
 64 |   },
 65 |   {
 66 |    "cell_type": "code",
 67 |    "execution_count": 3,
 68 |    "metadata": {
 69 |     "collapsed": true
 70 |    },
 71 |    "outputs": [],
 72 |    "source": [
 73 |     "# Calculate term frequencies\n",
 74 |     "def tf(dataset, file_name):\n",
 75 |     "    text = dataset[file_name]\n",
 76 |     "    tokens = nltk.word_tokenize(text)\n",
 77 |     "    fd = nltk.FreqDist(tokens)\n",
 78 |     "    return fd\n",
 79 |     "\n",
 80 |     "# Calculate inverse document frequency\n",
 81 |     "def idf(dataset, term):\n",
 82 |     "    count = [term in dataset[file_name] for file_name in dataset]\n",
 83 |     "    inv_df = math.log(len(count)/sum(count))\n",
 84 |     "    return inv_df\n",
 85 |     "\n",
 86 |     "def tfidf(dataset, file_name, n):\n",
 87 |     "    term_scores = {}\n",
 88 |     "    file_fd = tf(dataset,file_name)\n",
 89 |     "    for term in file_fd:\n",
 90 |     "        if term.isalpha():\n",
 91 |     "            idf_val = idf(dataset,term)\n",
 92 |     "            tf_val = tf(dataset, file_name)[term]\n",
 93 |     "            tfidf = tf_val*idf_val\n",
 94 |     "            term_scores[term] = round(tfidf,2)\n",
 95 |     "    return sorted(term_scores.items(), key=lambda x:-x[1])[:n]"
 96 |    ]
 97 |   },
 98 |   {
 99 |    "cell_type": "markdown",
100 |    "metadata": {},
101 |    "source": [
102 |     "## Run Code"
103 |    ]
104 |   },
105 |   {
106 |    "cell_type": "code",
107 |    "execution_count": 4,
108 |    "metadata": {
109 |     "collapsed": false
110 |    },
111 |    "outputs": [
112 |     {
113 |      "data": {
114 |       "text/plain": [
115 |        "[('Soviet', 20.72),\n",
116 |        " ('Union', 18.42),\n",
117 |        " ('Axis', 16.12),\n",
118 |        " ('Japan', 11.27),\n",
119 |        " ('Germany', 11.27)]"
120 |       ]
121 |      },
122 |      "execution_count": 4,
123 |      "metadata": {},
124 |      "output_type": "execute_result"
125 |     }
126 |    ],
127 |    "source": [
128 |     "tfidf(dataset,\"tfidf_1.txt\",5)"
129 |    ]
130 |   },
131 |   {
132 |    "cell_type": "code",
133 |    "execution_count": 5,
134 |    "metadata": {
135 |     "collapsed": false,
136 |     "scrolled": true
137 |    },
138 |    "outputs": [
139 |     {
140 |      "name": "stdout",
141 |      "output_type": "stream",
142 |      "text": [
143 |       "tfidf_1.txt: \n",
144 |       " [('Soviet', 20.72), ('Union', 18.42), ('Axis', 16.12), ('Japan', 11.27), ('Germany', 11.27)] \n",
145 |       "\n",
146 |       "tfidf_2.txt: \n",
147 |       " [('Module', 16.12), ('Armstrong', 13.82), ('lunar', 13.82), ('Apollo', 11.51), ('Moon', 9.21)] \n",
148 |       "\n",
149 |       "tfidf_3.txt: \n",
150 |       " [('Napoleon', 32.19), ('French', 16.86), ('Coalition', 11.51), ('Prussia', 6.91), ('military', 6.02)] \n",
151 |       "\n",
152 |       "tfidf_4.txt: \n",
153 |       " [('Washington', 25.33), ('President', 6.44), ('Continental', 4.82), ('presided', 4.61), ('militia', 4.61)] \n",
154 |       "\n",
155 |       "tfidf_5.txt: \n",
156 |       " [('Newton', 23.03), ('scientists', 6.91), ('motion', 4.83), ('mathematician', 4.61), ('Principia', 4.61)] \n",
157 |       "\n",
158 |       "tfidf_6.txt: \n",
159 |       " [('Revolution', 21.67), ('French', 15.65), ('privileges', 6.91), ('central', 6.91), ('Napoleon', 6.44)] \n",
160 |       "\n",
161 |       "tfidf_7.txt: \n",
162 |       " [('Leonardo', 18.42), ('Vinci', 9.21), ('painting', 6.91), ('Piero', 4.61), ('architecture', 4.61)] \n",
163 |       "\n",
164 |       "tfidf_8.txt: \n",
165 |       " [('Titanic', 18.42), ('passengers', 11.51), ('maritime', 9.21), ('safety', 9.21), ('lifeboats', 9.21)] \n",
166 |       "\n",
167 |       "tfidf_9.txt: \n",
168 |       " [('Rockefeller', 23.03), ('business', 6.91), ('Standard', 6.91), ('Oil', 6.91), ('University', 4.83)] \n",
169 |       "\n",
170 |       "tfidf_10.txt: \n",
171 |       " [('Tesla', 13.82), ('electrical', 6.44), ('wireless', 6.44), ('mechanical', 4.61), ('alternating', 4.61)] \n",
172 |       "\n"
173 |      ]
174 |     }
175 |    ],
176 |    "source": [
177 |     "for file_name in dataset:\n",
178 |     "    print(\"{0}: \\n {1} \\n\".format(file_name, tfidf(dataset,file_name,5)))"
179 |    ]
180 |   },
181 |   {
182 |    "cell_type": "raw",
183 |    "metadata": {},
184 |    "source": [
185 |     "Suggested Next Steps:\n",
186 |     " - Remove stopwords\n",
187 |     " - Make sure all tokens are lowercase\n",
188 |     " - Reduce the number of times we call the \"tf\" function"
189 |    ]
190 |   }
191 |  ],
192 |  "metadata": {
193 |   "kernelspec": {
194 |    "display_name": "Python 3",
195 |    "language": "python",
196 |    "name": "python3"
197 |   },
198 |   "language_info": {
199 |    "codemirror_mode": {
200 |     "name": "ipython",
201 |     "version": 3
202 |    },
203 |    "file_extension": ".py",
204 |    "mimetype": "text/x-python",
205 |    "name": "python",
206 |    "nbconvert_exporter": "python",
207 |    "pygments_lexer": "ipython3",
208 |    "version": "3.6.0"
209 |   }
210 |  },
211 |  "nbformat": 4,
212 |  "nbformat_minor": 2
213 | }
214 | 


--------------------------------------------------------------------------------
/5. Projects/reviews.csv:
--------------------------------------------------------------------------------
1 | "This camera is perfect for an enthusiastic amateur photographer. The pictures are razor-sharp, even in macro. It is small enough to fit easily in a coat pocket or purse. It is light enough to carry around all day without bother. Operating its many features is easy and often obvious - i'm no annie lebovitz, but i was able to figure out most of its abilities just messing around with it at a camera store. The manual does a fine job filling in any blanks that remain. The auto-focus performs well, but i love having the 12 optional scene modes - they are dummy-proof, and correspond to many situations in which i would actually seek to use the camera. Comes with a 16 mb compact flash and one rechargeable battery the charging unit, included, is fast and small. I bought a 256 mb cf and a second battery, so it's good to go on a long vacation. I enthusiastically recommend this camera."
2 | "I got my camera three days back, and although i had some experience with digital cameras prior to purchasing this one, i still rate myself as a beginner. I bought this camera because it fit my budget and the pre-production and production model reviews were positive. It's easy to use, and yet very feature rich. In the auto mode it functions basically as a point and click, the scene modes are very easy to use and produce good results. The manual mode is feature rich and i can't wait to get the hang of it. The macro mode is exceptional, the pictures are very clear and you can take the pictures with the lens unbelievably close the subject. The battery life is very good, i got about 90 minutes with the lcd turned on all the time, the first time around, and i have been using it with the lcd off every now and then, and have yet needed to recharge it. The camera comes with a lexar 16mb starter card, which stores about 10 images in fine mode at the highest resolution, i intend to buy a bigger card soon."
3 | "I love photography. I had an older camera that was simply a point and shoot camera. I needed something with more power, so i bought a nikon coolpix 4300. I fell in love with this camera, it combines ease of use, with an immense amount of options and power. You can use the scene modes, or fine tune the options, i. you can change the iso level, shutter speed, etc. This camera is ideal for people who want more power, but don’t want to spend 1000s dollars on a camera. "
4 | "I bought coolpix 4300 two months after i had bought canon powershot s400. It was not easy sharing one with my teen age kid. The two cameras are very similar in functionality and pricing. I've had no problem with canon whatsoever. With nikon, although picture qualities are as good as any other 4 mp cameras, i've had the following headaches. Pictures won't transfer to pc directly from the camera using the included transfer cable. I did everything i could, and it took many days of frustration before concluding that the only way to transfer to pc is with the card reader. The speed is noticeably slower than canon, especially so with flashes on. With low battery, it twice wiped out the entire pictures in the memory chip. I used lexar 256 mb and i still use it which means nothing is wrong with lexar. Be very careful when the battery is low and make sure to carry extra batteries. "
5 | "The other reviewers have clearly pointed all the good things about this camera, which i do agree. But there are certain issues might be they are to me here - all of them are minor; not major ones though. This camera keeps on autofocusing in auto mode with a buzzing sound which can't be stopped. Would be really good if they have given an option to stop this autofocusing. If you want to have the date; time on the image, its only through their software ""nikon view"" which reads the images date; time from the images meta-data. So if you use your card reader; copy images - you got to once again open them through their software to put the date; time. In that too, there is not a direct way to add date; time - you got to say' print images' to a different directory in which there is an option to specify the date; time. Even the slightest of the shakes totally distorts your image. Images taken indoor were not so clear. You got to have flash on to get it even though your room is well lit. Lens cap is a really annoying. The movie clips taken will always have some noise in it, you can't avoid that. But overall this is a good camera with a' really good' picture clarity; an exceptional close-up shooting capability. I would rate this is 4.5 stars picture quality; image size defined above are specific to nikon coolpix."
6 | "Within a year, there are problems with my menu dial knob. It became stuck which makes it almost impossible to switch between modes. I send my camera to nikon for servicing, took them a whole 6 weeks to diagnose the problem. Worse of all, they claim that it's some kind of internal damage and refuse to cover the cost via warranty! They wouldn't repair my camera unless if i pay $100 for parts? and labor! It is a good camera in terms of the function and quality, but take your chance with it because nikon absolutely sucks when it comes to customer service."
7 | "Got a ""system error"" problem 30 days after purchase. Made the camera totally inoperable. Also, the lens cap design is flawed. You have to manually Audio on video also lacking. Otherwise, it takes very good pictures; shutter delay is not so bad either. Still, had to send it back to nikon for repair."
8 | "I am an amateur photographer and here is a piece of advise to all the folks who are thinking about making a move the digital world. I feel, is the best camera out there for the features and price. I had initially thought of buying a 2 or 3 megapixel camera but these are good for 4x6"" or 5x7"" prints and i wanted some really great 8x10"" photos once in a while. I did not want a very small camera as it seems to get lost in my hands and i was not comfortable with that. I wanted a decent sized camera with a contour for my fingers to hold it steadily. I wanted a camera that had a lot of built-in settings for different types of surroundings while giving me an option to use my photography skills although, I am an amateur with an interest in photography by turning on the manual settings. I wanted a respected brand and had to stay within my budget because i had bought an expensive camcorder before but had not used it much. Depending on all the above requirements, I had narrowed down my search to nikon 4300 and canon powershot s400 models. Nikon got the final nod for its settings auto and manual along with movie modes, medium; compact size, price, brand name, good software that is included and previous reviews. I should say I have been very happy with my decision ever since. The pictures are absolutely amazing - the camera captures the minutest of details."
9 | 


--------------------------------------------------------------------------------
/5. Projects/tfidf_1.txt:
--------------------------------------------------------------------------------
1 | World War II (WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945, though related conflicts began earlier. It involved the vast majority of the world's nations—including all of the great powers—eventually forming two opposing military alliances: the Allies and the Axis. It was the most widespread war in history, and directly involved more than 100 million people from over 30 countries. In a state of "total war", the major participants threw their entire economic, industrial, and scientific capabilities behind the war effort, erasing the distinction between civilian and military resources. Marked by mass deaths of civilians, including the Holocaust (in which approximately 11 million people were killed) and the strategic bombing of industrial and population centres (in which approximately one million were killed, and which included the atomic bombings of Hiroshima and Nagasaki), it resulted in an estimated 50 million to 85 million fatalities. These made World War II the deadliest conflict in human history.
2 | 
3 | The Empire of Japan aimed to dominate Asia and the Pacific and was already at war with the Republic of China in 1937, but the world war is generally said to have begun on 1 September 1939 with the invasion of Poland by Germany and subsequent declarations of war on Germany by France and the United Kingdom. From late 1939 to early 1941, in a series of campaigns and treaties, Germany conquered or controlled much of continental Europe, and formed the Axis alliance with Italy and Japan. Based on the Molotov–Ribbentrop Pact of August 1939, Germany and the Soviet Union partitioned and annexed territories of their European neighbours, Poland, Finland, Romania and the Baltic states. For a year starting in late June 1940, the United Kingdom and the British Commonwealth were the only Allied forces continuing the fight against the European Axis powers, with campaigns in North Africa and the Horn of Africa, the aerial Battle of Britain and the Blitz bombing campaign, as well as the long-running Battle of the Atlantic. In June 1941, the European Axis powers launched an invasion of the Soviet Union, opening the largest land theatre of war in history, which trapped the major part of the Axis' military forces into a war of attrition. In December 1941, Japan attacked the United States and European territories in the Pacific Ocean, and quickly conquered much of the Western Pacific.
4 | 
5 | The Axis advance halted in 1942 when Japan lost the critical Battle of Midway, near Hawaii, and Germany was defeated in North Africa and then, decisively, at Stalingrad in the Soviet Union. In 1943, with a series of German defeats on the Eastern Front, the Allied invasion of Italy which brought about Italian surrender, and Allied victories in the Pacific, the Axis lost the initiative and undertook strategic retreat on all fronts. In 1944, the Western Allies invaded German-occupied France, while the Soviet Union regained all of its territorial losses and invaded Germany and its allies. During 1944 and 1945 the Japanese suffered major reverses in mainland Asia in South Central China and Burma, while the Allies crippled the Japanese Navy and captured key Western Pacific islands.
6 | 
7 | The war in Europe ended with an invasion of Germany by the Western Allies and the Soviet Union culminating in the capture of Berlin by Soviet and Polish troops and the subsequent German unconditional surrender on 8 May 1945. Following the Potsdam Declaration by the Allies on 26 July 1945 and the refusal of Japan to surrender under its terms, the United States dropped atomic bombs on the Japanese cities of Hiroshima and Nagasaki on 6 August and 9 August respectively. With an invasion of the Japanese archipelago imminent, the possibility of additional atomic bombings, and the Soviet Union's declaration of war on Japan and invasion of Manchuria, Japan surrendered on 15 August 1945. Thus ended the war in Asia, cementing the total victory of the Allies.
8 | 
9 | World War II altered the political alignment and social structure of the world. The United Nations (UN) was established to foster international co-operation and prevent future conflicts. The victorious great powers—the United States, the Soviet Union, China, the United Kingdom, and France—became the permanent members of the United Nations Security Council. The Soviet Union and the United States emerged as rival superpowers, setting the stage for the Cold War, which lasted for the next 46 years. Meanwhile, the influence of European great powers waned, while the decolonisation of Asia and Africa began. Most countries whose industries had been damaged moved towards economic recovery. Political integration, especially in Europe, emerged as an effort to end pre-war enmities and to create a common identity.


--------------------------------------------------------------------------------
/5. Projects/tfidf_10.txt:
--------------------------------------------------------------------------------
1 | Nikola Tesla; 10 July 1856 – 7 January 1943) was a Serbian American inventor, electrical engineer, mechanical engineer, physicist, and futurist best known for his contributions to the design of the modern alternating current (AC) electricity supply system.
2 | 
3 | Tesla gained experience in telephony and electrical engineering before emigrating to the United States in 1884 to work for Thomas Edison in New York City. He soon struck out on his own with financial backers, setting up laboratories and companies to develop a range of electrical devices. His patented AC induction motor and transformer were licensed by George Westinghouse, who also hired Tesla for a short time as a consultant. His work in the formative years of electric power development was involved in a corporate alternating current/direct current "War of Currents" as well as various patent battles.
4 | 
5 | Tesla went on to pursue his ideas of wireless lighting and electricity distribution in his high-voltage, high-frequency power experiments in New York and Colorado Springs, and made early (1893) pronouncements on the possibility of wireless communication with his devices. He tried to put these ideas to practical use in his ill-fated attempt at intercontinental wireless transmission, which was his unfinished Wardenclyffe Tower project. In his lab he also conducted a range of experiments with mechanical oscillators/generators, electrical discharge tubes, and early X-ray imaging. He also built a wireless controlled boat, one of the first ever exhibited.
6 | 
7 | Tesla was renowned for his achievements and showmanship, eventually earning him a reputation in popular culture as an archetypal "mad scientist". His patents earned him a considerable amount of money, much of which was used to finance his own projects with varying degrees of success. He lived most of his life in a series of New York hotels, through his retirement. He died on 7 January 1943. His work fell into relative obscurity after his death, but in 1960 the General Conference on Weights and Measures named the SI unit of magnetic flux density the tesla in his honor. There has been a resurgence in interest in Tesla in popular culture since the 1990s.


--------------------------------------------------------------------------------
/5. Projects/tfidf_2.txt:
--------------------------------------------------------------------------------
1 | Apollo 11 was the spaceflight that landed the first humans on the Moon, Americans Neil Armstrong and Buzz Aldrin, on July 20, 1969, at 20:18 UTC. Armstrong became the first to step onto the lunar surface six hours later on July 21 at 02:56 UTC. Armstrong spent about two and a half hours outside the spacecraft, Aldrin slightly less, and together they collected 47.5 pounds (21.5 kg) of lunar material for return to Earth. The third member of the mission, Michael Collins, piloted the command spacecraft alone in lunar orbit until Armstrong and Aldrin returned to it just under a day later for the trip back to Earth.
2 | 
3 | Launched by a Saturn V rocket from Kennedy Space Center in Merritt Island, Florida, on July 16, Apollo 11 was the fifth manned mission of NASA's Apollo program. The Apollo spacecraft had three parts: a Command Module (CM) with a cabin for the three astronauts, and the only part that landed back on Earth; a Service Module (SM), which supported the Command Module with propulsion, electrical power, oxygen, and water; and a Lunar Module (LM) for landing on the Moon (which itself was composed of two parts). After being sent toward the Moon by the Saturn V's upper stage, the astronauts separated the spacecraft from it and traveled for three days until they entered into lunar orbit. Armstrong and Aldrin then moved into the Lunar Module and landed in the Sea of Tranquility. They stayed a total of about 21 1⁄2 hours on the lunar surface. After lifting off in the upper part of the Lunar Module and rejoining Collins in the Command Module, they returned to Earth and landed in the Pacific Ocean on July 24.
4 | 
5 | Broadcast on live TV to a world-wide audience, Armstrong stepped onto the lunar surface and described the event as "one small step for [a] man, one giant leap for mankind." Apollo 11 effectively ended the Space Race and fulfilled a national goal proposed in 1961 by the U.S. President John F. Kennedy in a speech before the U.S. Congress: "before this decade is out, of landing a man on the Moon and returning him safely to the Earth."


--------------------------------------------------------------------------------
/5. Projects/tfidf_3.txt:
--------------------------------------------------------------------------------
1 | Napoleon Bonaparte, born Napoleone di Buonaparte; 15 August 1769 – 5 May 1821) was a French military and political leader who rose to prominence during the French Revolution and led several successful campaigns during the Revolutionary Wars. As Napoleon I, he was Emperor of the French from 1804 until 1814, and again in 1815. Napoleon dominated European affairs for over a decade while leading France against a series of coalitions in the Napoleonic Wars. He won most of these wars and the vast majority of his battles, rapidly gaining control of continental Europe before his ultimate defeat in 1815. One of the greatest commanders in history, his campaigns are studied at military schools worldwide and he remains one of the most celebrated and controversial political figures in Western history. In civil affairs, Napoleon had a major long-term impact by bringing liberal reforms to the countries that he conquered, especially the Low Countries, Switzerland, Italy, and large parts of Germany. He implemented fundamental liberal policies in France and throughout Western Europe. His lasting legal achievement was the Napoleonic Code, which has been adopted in various forms by a quarter of the world's legal systems, from Japan to Quebec.
2 | 
3 | Napoleon was born in Corsica to a relatively modest family of noble Tuscan ancestry. Serving in the French army, Napoleon supported the Revolution from the outset in 1789 and tried to spread its ideals to Corsica, but was banished from the island in 1793. Two years later, he saved the French government from collapse by firing on the Parisian mobs with cannons. After the Directory rewarded Napoleon by giving him command of Army of Italy at age 26, he began his first military campaign against the Austrians and their Italian allies, scoring a series of decisive victories that made him famous all across Europe. He followed the defeat of the Allies in Europe by commanding a military expedition to Egypt in 1798, conquering the Ottoman province after defeating the Mamelukes and launching modern Egyptology through the discoveries made by his army.
4 | 
5 | After returning from Egypt, Napoleon engineered a coup in November 1799 and became First Consul of the Republic. With the Concordat of 1801, Napoleon restored the religious powers of the Catholic Church but kept the lands seized by the Revolution. The state nominated the bishops and controlled church finances. He extended his political control over France until the Senate declared him Emperor of the French in 1804, launching the French Empire. Intractable differences with the British meant that the French were facing a Third Coalition by 1805. Napoleon shattered this coalition with decisive victories in the Ulm Campaign and a historic triumph at the Battle of Austerlitz, which led to the elimination of the Holy Roman Empire. In October 1805, however, a Franco-Spanish fleet was destroyed at the Battle of Trafalgar, allowing Britain to impose a naval blockade of the French coasts. In retaliation, Napoleon established the Continental System in 1806 to cut off European trade with Britain. The Fourth Coalition took up arms against him the same year because Prussia became worried about growing French influence on the continent. After quickly knocking out Prussia at the battles of Jena and Auerstedt, Napoleon turned his attention towards the Russians and annihilated them in 1807 at Friedland, which forced the Russians to accept the Treaties of Tilsit, the high water mark of the French Empire.
6 | 
7 | Hoping to extend the Continental System, Napoleon invaded Iberia and declared his brother Joseph the King of Spain in 1808. The Spanish and the Portuguese revolted with British support. The Peninsular War, noted for its brutal guerrilla warfare, lasted six years and culminated in an Allied victory. Fighting also erupted in Central Europe, as the Austrians launched another attack against the French in 1809. Napoleon defeated them at the Battle of Wagram, dissolving the Fifth Coalition formed against France. By 1811, Napoleon ruled over 70 million people across an empire that had domination in Europe, which had not witnessed this level of political consolidation since the days of the Roman Empire. He maintained his strategic status through a series of alliances and family appointments. He created a new aristocracy in France while allowing the return of nobles who had been forced into exile by the Revolution.
8 | 
9 | Tensions over rising Polish nationalism and the economic effects of the Continental System led to renewed confrontation with Russia. To enforce his blockade, Napoleon launched an invasion of Russia in 1812 that ended in catastrophic failure for the French. In 1813, Prussia and Austria joined Russian forces in a Sixth Coalition against France. A chaotic military campaign eventually culminated in a large Allied army defeating Napoleon at the Battle of Leipzig in October. The next year, the Allies invaded France and captured Paris, forcing Napoleon to abdicate in April 1814. He was exiled to the island of Elba. The Bourbons were restored to power and the French lost most of the territories that they had conquered since the Revolution. However, Napoleon escaped from Elba in February 1815 and took control of the government once again. The Allies formed a Seventh Coalition, which ultimately defeated Napoleon at the Battle of Waterloo in June. He was then captured by the British and imprisoned on the remote island of Saint Helena. His death in 1821 at the age of 51 was received by shock and grief throughout Europe. In 1840, a million people witnessed his remains returning to Paris, where they still reside at Les Invalides.


--------------------------------------------------------------------------------
/5. Projects/tfidf_4.txt:
--------------------------------------------------------------------------------
 1 | George Washington (February 22, 1732 – December 14, 1799) was the first President of the United States (1789–97), the Commander-in-Chief of the Continental Army during the American Revolutionary War, and one of the Founding Fathers of the United States. He presided over the convention that drafted the current United States Constitution and during his lifetime was called the "father of his country".
 2 | 
 3 | Widely admired for his strong leadership qualities, Washington was unanimously elected President in the first two national elections. He oversaw the creation of a strong, well-financed national government that maintained neutrality in the French Revolutionary Wars, suppressed the Whiskey Rebellion, and won acceptance among Americans of all types. Washington's incumbency established many precedents, still in use today, such as the cabinet system, the inaugural address, and the title Mr. President. His retirement from office after two terms established a tradition that lasted until 1940, when Franklin Delano Roosevelt won an unprecedented third term.
 4 | 
 5 | Born into the provincial gentry of Colonial Virginia, his family were wealthy planters who owned tobacco plantations and slaves which he inherited; he owned hundreds of slaves throughout his lifetime, but his views on slavery evolved. In his youth he became a senior British officer in the colonial militia during the first stages of the French and Indian War. In 1775, the Second Continental Congress commissioned Washington as commander-in-chief of the Continental Army in the American Revolution. In that command, Washington forced the British out of Boston in 1776, but was defeated and nearly captured later that year when he lost New York City. After crossing the Delaware River in the middle of winter, he defeated the British in two battles, retook New Jersey and restored momentum to the Patriot cause. This is known as the Battle of Trenton.
 6 | 
 7 | His strategy enabled Continental forces to capture two major British armies at Saratoga in 1777 and Yorktown in 1781. Historians laud Washington for the selection and supervision of his generals, preservation and command of the army, coordination with the Congress, with state governors and their militia, and attention to supplies, logistics, and training. In battle, however, Washington was repeatedly outmaneuvered by British generals with larger armies. After victory had been finalized in 1783, Washington resigned as commander-in-chief rather than seize power, proving his opposition to dictatorship and his commitment to American republicanism.
 8 | 
 9 | Washington presided over the Constitutional Convention in 1787, which devised a new form of federal government for the United States. Following unanimous election as President in 1789, he worked to unify rival factions in the fledgling nation. He supported Alexander Hamilton's programs to satisfy all debts, federal and state, established a permanent seat of government, implemented an effective tax system, and created a national bank. In avoiding war with Great Britain, he guaranteed a decade of peace and profitable trade by securing the Jay Treaty in 1795, despite intense opposition from the Jeffersonians. Although he remained nonpartisan, never joining the Federalist Party, he largely supported their policies. Washington's Farewell Address was an influential primer on republican virtue, warning against partisanship, sectionalism, and involvement in foreign wars. He retired from the presidency in 1797, returning to his home and plantation at Mount Vernon.
10 | 
11 | While in power, his use of national authority pursued many ends, especially the preservation of liberty, reduction of regional tensions, and promotion of a spirit of American nationalism. Upon his death, Washington was eulogized as "first in war, first in peace, and first in the hearts of his countrymen" by Henry Lee. Revered in life and in death, scholarly and public polling consistently ranks him among the top three presidents in American history; he has been depicted and remembered in monuments, currency, and other dedications to the present day.


--------------------------------------------------------------------------------
/5. Projects/tfidf_5.txt:
--------------------------------------------------------------------------------
1 | Sir Isaac Newton PRS MP (25 December 1642 – 20 March 1726/7) was an English physicist and mathematician (described in his own day as a "natural philosopher") who is widely recognised as one of the most influential scientists of all time and as a key figure in the scientific revolution. His book Philosophiæ Naturalis Principia Mathematica ("Mathematical Principles of Natural Philosophy"), first published in 1687, laid the foundations for classical mechanics. Newton made seminal contributions to optics, and he shares credit with Gottfried Leibniz for the development of calculus.
2 | 
3 | Newton's Principia formulated the laws of motion and universal gravitation, which dominated scientists' view of the physical universe for the next three centuries. By deriving Kepler's laws of planetary motion from his mathematical description of gravity, and then using the same principles to account for the trajectories of comets, the tides, the precession of the equinoxes, and other phenomena, Newton removed the last doubts about the validity of the heliocentric model of the Solar System. This work also demonstrated that the motion of objects on Earth and of celestial bodies could be described by the same principles. His prediction that Earth should be shaped as an oblate spheroid was later vindicated by the measurements of Maupertuis, La Condamine, and others, which helped convince most Continental European scientists of the superiority of Newtonian mechanics over the earlier system of Descartes.
4 | 
5 | Newton built the first practical reflecting telescope and developed a theory of colour based on the observation that a prism decomposes white light into the many colours of the visible spectrum. He formulated an empirical law of cooling, studied the speed of sound, and introduced the notion of a Newtonian fluid. In addition to his work on calculus, as a mathematician Newton contributed to the study of power series, generalised the binomial theorem to non-integer exponents, developed a method for approximating the roots of a function, and classified most of the cubic plane curves.
6 | 
7 | Newton was a fellow of Trinity College and the second Lucasian Professor of Mathematics at the University of Cambridge. He was a devout but unorthodox Christian and, unusually for a member of the Cambridge faculty of the day, he refused to take holy orders in the Church of England, perhaps because he privately rejected the doctrine of the Trinity. Beyond his work on the mathematical sciences, Newton dedicated much of his time to the study of biblical chronology and alchemy, but most of his work in those areas remained unpublished until long after his death. In his later life, Newton became president of the Royal Society. Newton served the British government as Warden and Master of the Royal Mint.


--------------------------------------------------------------------------------
/5. Projects/tfidf_6.txt:
--------------------------------------------------------------------------------
 1 | The French Revolution was a period of far-reaching social and political upheaval in France that lasted from 1789 until 1799, and was partially carried forward by Napoleon during the later expansion of the French Empire. The Revolution overthrew the monarchy, established a republic, experienced violent periods of political turmoil, and finally culminated in a dictatorship by Napoleon that rapidly brought many of its principles to Western Europe and beyond. Inspired by liberal and radical ideas, the Revolution profoundly altered the course of modern history, triggering the global decline of absolute monarchies while replacing them with republics. Through the Revolutionary Wars, it unleashed a wave of global conflicts that extended from the Caribbean to the Middle East. Historians widely regard the Revolution as one of the most important events in human history.
 2 | 
 3 | The causes of the French Revolution are complex and are still debated among historians. Following the Seven Years' War and the American Revolutionary War, the French government was deeply in debt and attempted to restore its financial status through unpopular taxation schemes. Years of bad harvests leading up to the Revolution also inflamed popular resentment of the privileges enjoyed by the clergy and the aristocracy. Demands for change were formulated in terms of Enlightenment ideals and contributed to the convocation of the Estates-General in May 1789. The first year of the Revolution saw members of the Third Estate taking control, the assault on the Bastille in July, the passage of the Declaration of the Rights of Man and of the Citizen in August, and a women's march on Versailles that forced the royal court back to Paris in October. A central event of the first stage, in August 1789, was the abolition of feudalism and the old rules and privileges left over from the Ancien Regime. The next few years featured political struggles between various liberal assemblies and right-wing supporters of the monarchy intent on thwarting major reforms. The Republic was proclaimed in September 1792 after the French victory at Valmy. In a momentous event that led to international condemnation, Louis XVI was executed in January 1793.
 4 | 
 5 | External threats closely shaped the course of the Revolution. The Revolutionary Wars beginning in 1792 ultimately featured French victories that facilitated the conquest of the Italian Peninsula, the Low Countries and most territories west of the Rhine – achievements that had eluded previous French governments for centuries. Internally, popular agitation radicalised the Revolution significantly, culminating in the rise of Maximilien Robespierre and the Jacobins. The dictatorship imposed by the Committee of Public Safety during the Reign of Terror, from 1793 until 1794, established price controls on food and other items, abolished slavery in French colonies abroad, dechristianised society through the creation of a new calendar and the expulsion of religious figures, and secured the borders of the new republic from its enemies. Large numbers of civilians were executed by revolutionary tribunals during the Terror, with estimates ranging from 16,000 to 40,000. After the Thermidorian Reaction, an executive council known as the Directory assumed control of the French state in 1795. The rule of the Directory was characterised by suspended elections, debt repudiations, financial instability, persecutions against the Catholic clergy, and significant military conquests abroad. Dogged by charges of corruption, the Directory collapsed in a coup led by Napoleon Bonaparte in 1799. Napoleon, who became the hero of the Revolution through his popular military campaigns, went on to establish the Consulate and later the First Empire, setting the stage for a wider array of global conflicts in the Napoleonic Wars.
 6 | 
 7 | The modern era has unfolded in the shadow of the French Revolution. Almost all future revolutionary movements looked back to the Revolution as their predecessor. Its central phrases and cultural symbols, such as La Marseillaise and Liberte, egalite, fraternite, became the clarion call for other major upheavals in modern history, including the Russian Revolution over a century later. The values and institutions of the Revolution dominate French politics to this day. French historian François Aulard comments that:
 8 | 
 9 | the Revolution consisted in the suppression of what was called the feudal system, in the emancipation of the individual, in greater division of landed property, the abolition of the privileges of noble birth, the establishment of equality, the simplification of life.... The French Revolution differed from other revolutions in being not merely national, for it aimed at benefiting all humanity."
10 | Globally, the Revolution accelerated the rise of republics and democracies. It became the focal point for the development of all modern political ideologies, leading to the spread of liberalism, radicalism, nationalism, socialism, feminism, and secularism, among many others. The Revolution also witnessed the birth of total war by organizing the resources of France and the lives of its citizens towards the objective of military conquest. Some of its central documents, like the Declaration of the Rights of Man, expanded the arena of human rights to include women and slaves, leading to movements for abolitionism and universal suffrage in the next century.


--------------------------------------------------------------------------------
/5. Projects/tfidf_7.txt:
--------------------------------------------------------------------------------
 1 | Leonardo di ser Piero da Vinci, more commonly Leonardo da Vinci, 15 April 1452 – 2 May 1519) was an Italian polymath whose areas of interest included invention, painting, sculpting, architecture, science, music, mathematics, engineering, literature, anatomy, geology, astronomy, botany, writing, history, and cartography. He has been variously called the father of paleontology, ichnology, and architecture, and is widely considered one of the greatest painters of all time. Sometimes credited with the inventions of the parachute, helicopter and tank, his genius epitomized the Renaissance humanist ideal.
 2 | 
 3 | Many historians and scholars regard Leonardo as a great exemplar of the "Renaissance Man", an individual of "unquenchable curiosity" and "feverishly inventive imagination". According to art historian Helen Gardner, the scope and depth of his interests were without precedent in recorded history, "his mind and personality seem to us superhuman, while the man himself mysterious and remote". Marco Rosci, however, notes that while there is much speculation regarding his life and personality, his view of the world was logical rather than mysterious, and that the empirical methods he employed were unorthodox for his time.
 4 | 
 5 | Born out of wedlock to a notary, Piero da Vinci, and a peasant woman, Caterina, in Vinci in the region of Florence, Leonardo was educated in the studio of the renowned Florentine painter Andrea del Verrocchio. Much of his earlier working life was spent in the service of Ludovico il Moro in Milan. He later worked in Rome, Bologna and Venice, and he spent his last years in France at the home awarded to him by Francis I.
 6 | 
 7 | Leonardo was, and is, renowned primarily as a painter. Among his works, the Mona Lisa is the most famous and most parodied portrait and The Last Supper the most reproduced religious painting of all time, with their fame approached only by Michelangelo's The Creation of Adam. Leonardo's drawing of the Vitruvian Man is also regarded as a cultural icon, being reproduced on items as varied as the euro coin, textbooks, and T-shirts. Perhaps fifteen of his paintings have survived, the small number because of his constant, and frequently disastrous, experimentation with new techniques. Nevertheless, these few works, together with his notebooks, which contain drawings, scientific diagrams, and his thoughts on the nature of painting, compose a contribution to later generations of artists rivalled only by that of his contemporary, Michelangelo.
 8 | 
 9 | Leonardo is revered for his technological ingenuity. He conceptualised flying machines, a type of armoured fighting vehicle, concentrated solar power, an adding machine, and the double hull, also outlining a rudimentary theory of plate tectonics. Relatively few of his designs were constructed or were even feasible during his lifetime, but some of his smaller inventions, such as an automated bobbin winder and a machine for testing the tensile strength of wire, entered the world of manufacturing unheralded. He made substantial discoveries in anatomy, civil engineering, optics, and hydrodynamics, but he did not publish his findings and they had no direct influence on later science.
10 | 
11 | Today, Leonardo is widely recognized as one of the most diversely talented individuals ever to have lived.


--------------------------------------------------------------------------------
/5. Projects/tfidf_8.txt:
--------------------------------------------------------------------------------
1 | RMS Titanic was a British passenger liner that sank in the North Atlantic Ocean in the early morning of 15 April 1912 after colliding with an iceberg during her maiden voyage from Southampton, UK, to New York City, US. The sinking resulted in the loss of more than 1,500 passengers and crew, making it one of the deadliest commercial peacetime maritime disasters in modern history. The RMS Titanic, the largest ship afloat at the time it entered service, was the second of three Olympic class ocean liners operated by the White Star Line, and was built by the Harland and Wolff shipyard in Belfast with Thomas Andrews as her naval architect. Andrews was among those lost in the sinking. On her maiden voyage, she carried 2,224 passengers and crew.
2 | 
3 | Under the command of Edward Smith, the ship's passengers included some of the wealthiest people in the world, as well as hundreds of emigrants from Great Britain and Ireland, Scandinavia and elsewhere throughout Europe seeking a new life in North America. A high-power radiotelegraph transmitter was available for sending passenger "marconigrams" and for the ship's operational use. Although Titanic had advanced safety features such as watertight compartments and remotely activated watertight doors, there were not enough lifeboats to accommodate all of those aboard due to outdated maritime safety regulations. Titanic only carried enough lifeboats for 1,178 people—slightly more than half of the number on board, and one-third her total capacity.
4 | 
5 | After leaving Southampton on 10 April 1912, Titanic called at Cherbourg in France and Queenstown (now Cobh) in Ireland before heading west to New York. On 14 April 1912, four days into the crossing and about 375 miles (600 km) south of Newfoundland, she hit an iceberg at 11:40 p.m. ship's time. The collision caused the ship's hull plates to buckle inwards along her starboard side and opened five of her sixteen watertight compartments to the sea; the ship gradually filled with water. Meanwhile, passengers and some crew members were evacuated in lifeboats, many of which were launched only partly loaded. A disproportionate number of men were left aboard because of a "women and children first" protocol followed by some of the officers loading the lifeboats. By 2:20 a.m., she broke apart and foundered, with well over one thousand people still aboard. Just under two hours after Titanic foundered, the Cunard liner RMS Carpathia arrived on the scene of the sinking, where she brought aboard an estimated 705 survivors.
6 | 
7 | The disaster was greeted with worldwide shock and outrage at the huge loss of life and the regulatory and operational failures that had led to it. Public inquiries in Britain and the United States led to major improvements in maritime safety. One of their most important legacies was the establishment in 1914 of the International Convention for the Safety of Life at Sea (SOLAS), which still governs maritime safety today. Additionally, several new wireless regulations were passed around the world in an effort to learn from the many missteps in wireless communications—which could have saved many more passengers.
8 | 
9 | The wreck of Titanic remains on the seabed, split in two and gradually disintegrating at a depth of 12,415 feet (3,784 m). Since her discovery in 1985, thousands of artefacts have been recovered and put on display at museums around the world. Titanic has become one of the most famous ships in history, her memory kept alive by numerous books, folk songs, films, exhibits, and memorials.


--------------------------------------------------------------------------------
/5. Projects/tfidf_9.txt:
--------------------------------------------------------------------------------
1 | John Davison Rockefeller Sr. (July 8, 1839 – May 23, 1937) was an American business magnate and philanthropist. He was a co-founder of the Standard Oil Company, which dominated the oil industry and was the first great U.S. business trust. Rockefeller revolutionized the petroleum industry, and along with other key contemporary industrialists such as Andrew Carnegie, defined the structure of modern philanthropy. In 1870, he founded Standard Oil Company and actively ran it until he officially retired in 1897.
2 | 
3 | Rockefeller founded Standard Oil as an Ohio partnership with his brother William along with Henry Flagler, Jabez A. Bostwick, chemist Samuel Andrews, and a silent partner, Stephen V. Harkness. As kerosene and gasoline grew in importance, Rockefeller's wealth soared and he became the world's richest man and the first American worth more than a billion dollars, controlling 90% of all oil in the United States at his peak. Adjusting for inflation, his fortune upon his death in 1937 stood at $336 billion, accounting for more than 1.5% of the national economy, making him the richest person in US history.
4 | 
5 | Rockefeller spent the last 40 years of his life in retirement at his estate, Kykuit, in Westchester County, New York. His fortune was mainly used to create the modern systematic approach of targeted philanthropy. He was able to do this through the creation of foundations that had a major effect on medicine, education and scientific research. His foundations pioneered the development of medical research and were instrumental in the eradication of hookworm and yellow fever.
6 | 
7 | Rockefeller was also the founder of both the University of Chicago and Rockefeller University and funded the establishment of Central Philippine University in the Philippines. He was a devoted Northern Baptist and supported many church-based institutions. Rockefeller adhered to total abstinence from alcohol and tobacco throughout his life. He was a faithful congregant of the Erie Street Baptist Mission Church, where he taught Sunday school, and served as a trustee, clerk, and occasional janitor. Religion was a guiding force throughout his life, and Rockefeller believed it to be the source of his success. Rockefeller was also considered a supporter of capitalism based in a perspective of social darwinism, and is often quoted saying "The growth of a large business is merely a survival of the fittest."


--------------------------------------------------------------------------------
/5. Projects/words_negative.csv:
--------------------------------------------------------------------------------
  1 | abysmal
  2 | adverse
  3 | alarming
  4 | angry
  5 | annoy
  6 | annoying
  7 | anxious
  8 | apathy
  9 | appalling
 10 | atrocious
 11 | awful
 12 | bad
 13 | banal
 14 | barbed
 15 | belligerent
 16 | bemoan
 17 | beneath
 18 | boring
 19 | broken
 20 | callous
 21 | can't
 22 | clumsy
 23 | coarse
 24 | cold
 25 | cold-hearted
 26 | collapse
 27 | confused
 28 | contradictory
 29 | contrary
 30 | corrosive
 31 | corrupt
 32 | crazy
 33 | creepy
 34 | criminal
 35 | cruel
 36 | cry
 37 | cutting
 38 | dead
 39 | decaying
 40 | damage
 41 | damaging
 42 | dastardly
 43 | deplorable
 44 | depressed
 45 | deprived
 46 | deformed
 47 | deny
 48 | despicable
 49 | detrimental
 50 | dirty
 51 | disappoint
 52 | disappointed
 53 | disease
 54 | disgusting
 55 | disheveled
 56 | dishonest
 57 | dishonorable
 58 | dismal
 59 | distress
 60 | dont
 61 | don't
 62 | dreadful
 63 | dreary
 64 | enraged
 65 | eroding
 66 | evil
 67 | fail
 68 | faulty
 69 | fear
 70 | feeble
 71 | fight
 72 | filthy
 73 | foul
 74 | frighten
 75 | frightful
 76 | frustration
 77 | fuck
 78 | gawky
 79 | ghastly
 80 | grave
 81 | greed
 82 | grim
 83 | grimace
 84 | gross
 85 | grotesque
 86 | gruesome
 87 | guilty
 88 | haggard
 89 | hard
 90 | hard-hearted
 91 | harmful
 92 | hate
 93 | hideous
 94 | homely
 95 | horrendous
 96 | horrible
 97 | hostile
 98 | hurt
 99 | hurtful
100 | icky
101 | ignore
102 | ignorant
103 | ill
104 | immature
105 | imperfect
106 | impossible
107 | inane
108 | inelegant
109 | infernal
110 | injure
111 | injurious
112 | insane
113 | insidious
114 | insipid
115 | issue
116 | issues
117 | jealous
118 | junky
119 | lose
120 | lousy
121 | lumpy
122 | malicious
123 | mean
124 | menacing
125 | messy
126 | misshapen
127 | missing
128 | misunderstood
129 | moan
130 | moldy
131 | monstrous
132 | naive
133 | nasty
134 | naughty
135 | negate
136 | negative
137 | never
138 | no
139 | nobody
140 | nondescript
141 | nonsense
142 | not
143 | noxious
144 | objectionable
145 | odious
146 | offensive
147 | old
148 | oppressive
149 | pain
150 | perturb
151 | pessimistic
152 | petty
153 | plain
154 | poisonous
155 | poop
156 | poor
157 | prejudice
158 | problem
159 | questionable
160 | quirky
161 | quit
162 | reject
163 | renege
164 | repellant
165 | reptilian
166 | repulsive
167 | repugnant
168 | revenge
169 | revolting
170 | rocky
171 | rotten
172 | rude
173 | ruthless
174 | sad
175 | savage
176 | scare
177 | scary
178 | scream
179 | severe
180 | shit
181 | shoddy
182 | shocking
183 | sick
184 | sickening
185 | sinister
186 | slimy
187 | smelly
188 | sobbing
189 | sorry
190 | spiteful
191 | sticky
192 | stinky
193 | stormy
194 | stressful
195 | stuck
196 | stupid
197 | substandard
198 | suck
199 | sucks
200 | suspect
201 | suspicious
202 | tense
203 | terrible
204 | terrifying
205 | threatening
206 | ugly
207 | undermine
208 | unfair
209 | unfavorable
210 | unhappy
211 | unhealthy
212 | uninspired
213 | unjust
214 | unlucky
215 | unpleasant
216 | upset
217 | unsatisfactory
218 | unsightly
219 | untoward
220 | unwanted
221 | unwelcome
222 | unwholesome
223 | unwieldy
224 | unwise
225 | upset
226 | vice
227 | vicious
228 | vile
229 | villainous
230 | vindictive
231 | wary
232 | weary
233 | wicked
234 | woeful
235 | worse
236 | worst
237 | worthless
238 | wound
239 | yell
240 | yucky
241 | zero
242 | 


--------------------------------------------------------------------------------
/5. Projects/words_positive.csv:
--------------------------------------------------------------------------------
  1 | absolutely
  2 | adorable
  3 | accepted
  4 | acclaimed
  5 | accomplish
  6 | accomplishment
  7 | achievement
  8 | action
  9 | active
 10 | admire
 11 | adventure
 12 | affirmative
 13 | affluent
 14 | agree
 15 | agreeable
 16 | amazing
 17 | angelic
 18 | appealing
 19 | approve
 20 | aptitude
 21 | attractive
 22 | awesome
 23 | beaming
 24 | beautiful
 25 | believe
 26 | beneficial
 27 | bliss
 28 | bountiful
 29 | bounty
 30 | brave
 31 | bravo
 32 | brilliant
 33 | bubbly
 34 | calm
 35 | celebrated
 36 | certain
 37 | champ
 38 | champion
 39 | charming
 40 | cheery
 41 | choice
 42 | classic
 43 | classical
 44 | clean
 45 | commend
 46 | composed
 47 | congratulation
 48 | constant
 49 | cool
 50 | courageous
 51 | creative
 52 | cute
 53 | dazzling
 54 | delight
 55 | delightful
 56 | distinguished
 57 | divine
 58 | earnest
 59 | easy
 60 | ecstatic
 61 | effective
 62 | effervescent
 63 | efficient
 64 | effortless
 65 | electrifying
 66 | elegant
 67 | enchanting
 68 | encouraging
 69 | endorsed
 70 | energetic
 71 | energized
 72 | engaging
 73 | enthusiastic
 74 | essential
 75 | esteemed
 76 | ethical
 77 | excellent
 78 | exceptional
 79 | exciting
 80 | exquisite
 81 | fabulous
 82 | fair
 83 | familiar
 84 | famous
 85 | fantastic
 86 | favorable
 87 | fetching
 88 | fine
 89 | fitting
 90 | flourishing
 91 | fortunate
 92 | free
 93 | fresh
 94 | friendly
 95 | fun
 96 | funny
 97 | generous
 98 | genius
 99 | genuine
100 | giving
101 | glamorous
102 | glowing
103 | good
104 | gorgeous
105 | graceful
106 | great
107 | green
108 | grin
109 | growing
110 | handsome
111 | happy
112 | harmonious
113 | healing
114 | healthy
115 | hearty
116 | heavenly
117 | honest
118 | honorable
119 | honored
120 | hug
121 | idea
122 | ideal
123 | imaginative
124 | imagine
125 | impressive
126 | independent
127 | innovate
128 | innovative
129 | instant
130 | instantaneous
131 | instinctive
132 | intuitive
133 | intellectual
134 | intelligent
135 | inventive
136 | jovial
137 | joy
138 | jubilant
139 | keen
140 | kind
141 | knowing
142 | knowledgeable
143 | laugh
144 | legendary
145 | light
146 | like
147 | learned
148 | lively
149 | love
150 | lovely
151 | lucid
152 | lucky
153 | luminous
154 | marvelous
155 | masterful
156 | meaningful
157 | merit
158 | meritorious
159 | miraculous
160 | motivating
161 | moving
162 | natural
163 | nice
164 | novel
165 | now
166 | nurturing
167 | nutritious
168 | okay
169 | one
170 | one-hundred percent
171 | open
172 | optimistic
173 | paradise
174 | perfect
175 | phenomenal
176 | pleasurable
177 | plentiful
178 | pleasant
179 | poised
180 | polished
181 | popular
182 | positive
183 | powerful
184 | prepared
185 | pretty
186 | principled
187 | productive
188 | progress
189 | prominent
190 | protected
191 | proud
192 | quality
193 | quick
194 | quiet
195 | ready
196 | reassuring
197 | recommend
198 | refined
199 | refreshing
200 | rejoice
201 | reliable
202 | remarkable
203 | resounding
204 | respected
205 | restored
206 | reward
207 | rewarding
208 | right
209 | robust
210 | safe
211 | satisfactory
212 | secure
213 | seemly
214 | simple
215 | skilled
216 | skillful
217 | smile
218 | soulful
219 | sparkling
220 | special
221 | spirited
222 | spiritual
223 | stirring
224 | stupendous
225 | stunning
226 | success
227 | successful
228 | sunny
229 | super
230 | superb
231 | supporting
232 | surprising
233 | terrific
234 | thorough
235 | thrilling
236 | thriving
237 | tops
238 | tranquil
239 | transforming
240 | transformative
241 | trusting
242 | truthful
243 | unreal
244 | unwavering
245 | up
246 | upbeat
247 | upright
248 | upstanding
249 | valued
250 | very
251 | vibrant
252 | victorious
253 | victory
254 | vigorous
255 | virtuous
256 | vital
257 | vivacious
258 | wealthy
259 | welcome
260 | well
261 | whole
262 | wholesome
263 | willing
264 | wonderful
265 | wondrous
266 | worthy
267 | wow
268 | yes
269 | yummy
270 | zeal
271 | zealous
272 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Introduction-to-NLP
 2 | Lectures for Udemy - "Introduction to Natural Language Processing"
 3 | 
 4 | 
 5 | This is the repository for the Udemy course - "Introduction to Natural Language Processing".
 6 | You will find the accompanying Jupyter Notebooks for the course along with necessary data.
 7 | You can download the .ipynb files onto your own computer for convenience.
 8 | 
 9 | Enjoy the course and thank you for enrolling!
10 | 
11 | Note: These Notebooks are free and always will be
12 | 


--------------------------------------------------------------------------------