├── .gitignore ├── 1. Quick Python Refresher ├── 1. Lists.ipynb ├── 2. Dictionaries.ipynb ├── 3. Loops and Conditionals.ipynb └── 4. Functions.ipynb ├── 2. NLTK and the Basics ├── 1. Counting Text.ipynb ├── 2. Example - Words Per Sentence Trends.ipynb ├── 3. Frequency Distribution.ipynb ├── 4. Conditional Frequency Distribution.ipynb ├── 5. Example - Informative Words.ipynb ├── 6. Bigrams.ipynb └── 7. Regular Expressions.ipynb ├── 3. Tokenization, Tagging, Chunking ├── 1. Tokenization.ipynb ├── 2. Normalization.ipynb ├── 3. Part of Speech Tagging.ipynb ├── 4. Example - Multiple Parts of Speech.ipynb ├── 5. Example - Choices.ipynb ├── 6. Chunking.ipynb ├── 7. Example - Named Entity Recognition.ipynb └── example.txt ├── 4. Custom Sources ├── 1. Text File.ipynb ├── 2. HTML.ipynb ├── 3. URL.ipynb ├── 4. CSV (Spreadsheet).ipynb ├── 5. Exporting.ipynb ├── 6. NLTK Resources.ipynb ├── 7. Example - Remove Stopwords.ipynb ├── dec_independence.txt ├── export.txt └── reviews.csv ├── 5. Projects ├── 1. Sentiment Analysis.ipynb ├── 2. Gender Prediction.ipynb ├── 3. Term Frequency, Inverse Document Frequency.ipynb ├── reviews.csv ├── tfidf_1.txt ├── tfidf_10.txt ├── tfidf_2.txt ├── tfidf_3.txt ├── tfidf_4.txt ├── tfidf_5.txt ├── tfidf_6.txt ├── tfidf_7.txt ├── tfidf_8.txt ├── tfidf_9.txt ├── words_negative.csv └── words_positive.csv └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | .ipynb_checkpoints 2 | */.ipynb_checkpoints 3 | -------------------------------------------------------------------------------- /1. Quick Python Refresher/1. Lists.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Python Refresher - Lists" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "A Python list stores comma separated values. In our cases these values will be strings, and numbers.\n" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 1, 20 | "metadata": { 21 | "collapsed": true 22 | }, 23 | "outputs": [], 24 | "source": [ 25 | "mylist = [\"a\",\"b\",\"c\"]" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 2, 31 | "metadata": { 32 | "collapsed": false 33 | }, 34 | "outputs": [ 35 | { 36 | "data": { 37 | "text/plain": [ 38 | "['a', 'b', 'c']" 39 | ] 40 | }, 41 | "execution_count": 2, 42 | "metadata": {}, 43 | "output_type": "execute_result" 44 | } 45 | ], 46 | "source": [ 47 | "mylist" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 3, 53 | "metadata": { 54 | "collapsed": false 55 | }, 56 | "outputs": [ 57 | { 58 | "data": { 59 | "text/plain": [ 60 | "[1, 2, 3, 4, 5]" 61 | ] 62 | }, 63 | "execution_count": 3, 64 | "metadata": {}, 65 | "output_type": "execute_result" 66 | } 67 | ], 68 | "source": [ 69 | "mylist2 = [1,2,3,4,5]\n", 70 | "mylist2" 71 | ] 72 | }, 73 | { 74 | "cell_type": "markdown", 75 | "metadata": {}, 76 | "source": [ 77 | "Each item in the list has a position or index. By using a list index you can get back individual list item.\n", 78 | "\n", 79 | "Remember that in programming, counting starts at 0, so to get the first item, we would call index 0." 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 4, 85 | "metadata": { 86 | "collapsed": false 87 | }, 88 | "outputs": [ 89 | { 90 | "data": { 91 | "text/plain": [ 92 | "'a'" 93 | ] 94 | }, 95 | "execution_count": 4, 96 | "metadata": {}, 97 | "output_type": "execute_result" 98 | } 99 | ], 100 | "source": [ 101 | "mylist[0]" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": 5, 107 | "metadata": { 108 | "collapsed": false 109 | }, 110 | "outputs": [ 111 | { 112 | "data": { 113 | "text/plain": [ 114 | "1" 115 | ] 116 | }, 117 | "execution_count": 5, 118 | "metadata": {}, 119 | "output_type": "execute_result" 120 | } 121 | ], 122 | "source": [ 123 | "mylist2[0]" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": {}, 129 | "source": [ 130 | "We can also use a range of indexes to call back a range from out list." 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": 6, 136 | "metadata": { 137 | "collapsed": false 138 | }, 139 | "outputs": [ 140 | { 141 | "data": { 142 | "text/plain": [ 143 | "[1, 2]" 144 | ] 145 | }, 146 | "execution_count": 6, 147 | "metadata": {}, 148 | "output_type": "execute_result" 149 | } 150 | ], 151 | "source": [ 152 | "mylist2[0:2]" 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": {}, 158 | "source": [ 159 | "The first number tells us where to start while the second tells us where to end and is exclusive. If we don't enter the fist number, we will get back the first x items, where x is the second index number we provide." 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": 7, 165 | "metadata": { 166 | "collapsed": false 167 | }, 168 | "outputs": [ 169 | { 170 | "data": { 171 | "text/plain": [ 172 | "['a', 'b']" 173 | ] 174 | }, 175 | "execution_count": 7, 176 | "metadata": {}, 177 | "output_type": "execute_result" 178 | } 179 | ], 180 | "source": [ 181 | "mylist[:2]" 182 | ] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "execution_count": 8, 187 | "metadata": { 188 | "collapsed": false 189 | }, 190 | "outputs": [ 191 | { 192 | "data": { 193 | "text/plain": [ 194 | "[1, 2, 3]" 195 | ] 196 | }, 197 | "execution_count": 8, 198 | "metadata": {}, 199 | "output_type": "execute_result" 200 | } 201 | ], 202 | "source": [ 203 | "mylist2[:3]" 204 | ] 205 | }, 206 | { 207 | "cell_type": "markdown", 208 | "metadata": {}, 209 | "source": [ 210 | "We can also call the ends of a list by doing the following." 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": 9, 216 | "metadata": { 217 | "collapsed": false 218 | }, 219 | "outputs": [ 220 | { 221 | "data": { 222 | "text/plain": [ 223 | "['b', 'c']" 224 | ] 225 | }, 226 | "execution_count": 9, 227 | "metadata": {}, 228 | "output_type": "execute_result" 229 | } 230 | ], 231 | "source": [ 232 | "mylist[-2:]" 233 | ] 234 | } 235 | ], 236 | "metadata": { 237 | "kernelspec": { 238 | "display_name": "Python 2", 239 | "language": "python", 240 | "name": "python2" 241 | }, 242 | "language_info": { 243 | "codemirror_mode": { 244 | "name": "ipython", 245 | "version": 2 246 | }, 247 | "file_extension": ".py", 248 | "mimetype": "text/x-python", 249 | "name": "python", 250 | "nbconvert_exporter": "python", 251 | "pygments_lexer": "ipython2", 252 | "version": "2.7.10" 253 | } 254 | }, 255 | "nbformat": 4, 256 | "nbformat_minor": 0 257 | } 258 | -------------------------------------------------------------------------------- /1. Quick Python Refresher/2. Dictionaries.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Python Refresher - Dictionaries" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "A Python dictionary stores key-value pairs." 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 1, 20 | "metadata": { 21 | "collapsed": false 22 | }, 23 | "outputs": [], 24 | "source": [ 25 | "d = {\n", 26 | " 'Python': 'programming', \n", 27 | " 'English': \"natural\", \n", 28 | " 'French': 'natrual', \n", 29 | " 'Ruby' : 'programming', \n", 30 | " 'Javascript' : 'programming'\n", 31 | "}" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": 2, 37 | "metadata": { 38 | "collapsed": false 39 | }, 40 | "outputs": [ 41 | { 42 | "data": { 43 | "text/plain": [ 44 | "{'English': 'natural',\n", 45 | " 'French': 'natrual',\n", 46 | " 'Javascript': 'programming',\n", 47 | " 'Python': 'programming',\n", 48 | " 'Ruby': 'programming'}" 49 | ] 50 | }, 51 | "execution_count": 2, 52 | "metadata": {}, 53 | "output_type": "execute_result" 54 | } 55 | ], 56 | "source": [ 57 | "d" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": 3, 63 | "metadata": { 64 | "collapsed": false 65 | }, 66 | "outputs": [ 67 | { 68 | "data": { 69 | "text/plain": [ 70 | "dict" 71 | ] 72 | }, 73 | "execution_count": 3, 74 | "metadata": {}, 75 | "output_type": "execute_result" 76 | } 77 | ], 78 | "source": [ 79 | "type(d)" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 4, 85 | "metadata": { 86 | "collapsed": false 87 | }, 88 | "outputs": [ 89 | { 90 | "data": { 91 | "text/plain": [ 92 | "'programming'" 93 | ] 94 | }, 95 | "execution_count": 4, 96 | "metadata": {}, 97 | "output_type": "execute_result" 98 | } 99 | ], 100 | "source": [ 101 | "d[\"Python\"]" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": 5, 107 | "metadata": { 108 | "collapsed": false 109 | }, 110 | "outputs": [ 111 | { 112 | "data": { 113 | "text/plain": [ 114 | "'natural'" 115 | ] 116 | }, 117 | "execution_count": 5, 118 | "metadata": {}, 119 | "output_type": "execute_result" 120 | } 121 | ], 122 | "source": [ 123 | "d[\"English\"]" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": {}, 129 | "source": [ 130 | "We can add new entries to a dictionary." 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": 6, 136 | "metadata": { 137 | "collapsed": true 138 | }, 139 | "outputs": [], 140 | "source": [ 141 | "d['Scala'] = 'programming'" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": 7, 147 | "metadata": { 148 | "collapsed": false 149 | }, 150 | "outputs": [ 151 | { 152 | "data": { 153 | "text/plain": [ 154 | "{'English': 'natural',\n", 155 | " 'French': 'natrual',\n", 156 | " 'Javascript': 'programming',\n", 157 | " 'Python': 'programming',\n", 158 | " 'Ruby': 'programming',\n", 159 | " 'Scala': 'programming'}" 160 | ] 161 | }, 162 | "execution_count": 7, 163 | "metadata": {}, 164 | "output_type": "execute_result" 165 | } 166 | ], 167 | "source": [ 168 | "d" 169 | ] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "metadata": {}, 174 | "source": [ 175 | "Values can also be numbers." 176 | ] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "execution_count": 8, 181 | "metadata": { 182 | "collapsed": true 183 | }, 184 | "outputs": [], 185 | "source": [ 186 | "d[\"languages known\"] = 3" 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": 9, 192 | "metadata": { 193 | "collapsed": false 194 | }, 195 | "outputs": [ 196 | { 197 | "data": { 198 | "text/plain": [ 199 | "{'English': 'natural',\n", 200 | " 'French': 'natrual',\n", 201 | " 'Javascript': 'programming',\n", 202 | " 'Python': 'programming',\n", 203 | " 'Ruby': 'programming',\n", 204 | " 'Scala': 'programming',\n", 205 | " 'languages known': 3}" 206 | ] 207 | }, 208 | "execution_count": 9, 209 | "metadata": {}, 210 | "output_type": "execute_result" 211 | } 212 | ], 213 | "source": [ 214 | "d" 215 | ] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "execution_count": 10, 220 | "metadata": { 221 | "collapsed": false 222 | }, 223 | "outputs": [ 224 | { 225 | "data": { 226 | "text/plain": [ 227 | "dict_keys(['Python', 'English', 'French', 'Ruby', 'Javascript', 'Scala', 'languages known'])" 228 | ] 229 | }, 230 | "execution_count": 10, 231 | "metadata": {}, 232 | "output_type": "execute_result" 233 | } 234 | ], 235 | "source": [ 236 | "d.keys()" 237 | ] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "execution_count": 11, 242 | "metadata": { 243 | "collapsed": false 244 | }, 245 | "outputs": [ 246 | { 247 | "data": { 248 | "text/plain": [ 249 | "dict_values(['programming', 'natural', 'natrual', 'programming', 'programming', 'programming', 3])" 250 | ] 251 | }, 252 | "execution_count": 11, 253 | "metadata": {}, 254 | "output_type": "execute_result" 255 | } 256 | ], 257 | "source": [ 258 | "d.values()" 259 | ] 260 | }, 261 | { 262 | "cell_type": "code", 263 | "execution_count": 12, 264 | "metadata": { 265 | "collapsed": false 266 | }, 267 | "outputs": [ 268 | { 269 | "data": { 270 | "text/plain": [ 271 | "7" 272 | ] 273 | }, 274 | "execution_count": 12, 275 | "metadata": {}, 276 | "output_type": "execute_result" 277 | } 278 | ], 279 | "source": [ 280 | "len(d)" 281 | ] 282 | } 283 | ], 284 | "metadata": { 285 | "kernelspec": { 286 | "display_name": "Python 3", 287 | "language": "python", 288 | "name": "python3" 289 | }, 290 | "language_info": { 291 | "codemirror_mode": { 292 | "name": "ipython", 293 | "version": 3 294 | }, 295 | "file_extension": ".py", 296 | "mimetype": "text/x-python", 297 | "name": "python", 298 | "nbconvert_exporter": "python", 299 | "pygments_lexer": "ipython3", 300 | "version": "3.6.0" 301 | } 302 | }, 303 | "nbformat": 4, 304 | "nbformat_minor": 2 305 | } 306 | -------------------------------------------------------------------------------- /1. Quick Python Refresher/3. Loops and Conditionals.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Python Refresher - Loops and Conditionals" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 2, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "mylist = [\"Python\",\"Ruby\",\"Javascript\",\"Scala\"]" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 3, 24 | "metadata": { 25 | "collapsed": false 26 | }, 27 | "outputs": [ 28 | { 29 | "data": { 30 | "text/plain": [ 31 | "['Python', 'Ruby', 'Javascript', 'Scala']" 32 | ] 33 | }, 34 | "execution_count": 3, 35 | "metadata": {}, 36 | "output_type": "execute_result" 37 | } 38 | ], 39 | "source": [ 40 | "mylist" 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "With a for loop, we can iterate through this list." 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 4, 53 | "metadata": { 54 | "collapsed": false 55 | }, 56 | "outputs": [ 57 | { 58 | "name": "stdout", 59 | "output_type": "stream", 60 | "text": [ 61 | "Python\n", 62 | "Ruby\n", 63 | "Javascript\n", 64 | "Scala\n" 65 | ] 66 | } 67 | ], 68 | "source": [ 69 | "for item in mylist:\n", 70 | " print(item)" 71 | ] 72 | }, 73 | { 74 | "cell_type": "markdown", 75 | "metadata": {}, 76 | "source": [ 77 | "We can also write a for loop this way, know as list comprehension when used with a list..." 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": 5, 83 | "metadata": { 84 | "collapsed": false 85 | }, 86 | "outputs": [ 87 | { 88 | "data": { 89 | "text/plain": [ 90 | "['Python', 'Ruby', 'Javascript', 'Scala']" 91 | ] 92 | }, 93 | "execution_count": 5, 94 | "metadata": {}, 95 | "output_type": "execute_result" 96 | } 97 | ], 98 | "source": [ 99 | "[item for item in mylist]" 100 | ] 101 | }, 102 | { 103 | "cell_type": "markdown", 104 | "metadata": {}, 105 | "source": [ 106 | "We can use for loops to carry out actions." 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": 6, 112 | "metadata": { 113 | "collapsed": false 114 | }, 115 | "outputs": [ 116 | { 117 | "name": "stdout", 118 | "output_type": "stream", 119 | "text": [ 120 | "Python is a fun programming language.\n", 121 | "Ruby is a fun programming language.\n", 122 | "Javascript is a fun programming language.\n", 123 | "Scala is a fun programming language.\n" 124 | ] 125 | } 126 | ], 127 | "source": [ 128 | "for item in mylist:\n", 129 | " print(item + \" is a fun programming language.\")" 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": 7, 135 | "metadata": { 136 | "collapsed": false 137 | }, 138 | "outputs": [], 139 | "source": [ 140 | "newlist = [item + \" is a fun programming language.\" for item in mylist]" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": 8, 146 | "metadata": { 147 | "collapsed": false 148 | }, 149 | "outputs": [ 150 | { 151 | "data": { 152 | "text/plain": [ 153 | "['Python is a fun programming language.',\n", 154 | " 'Ruby is a fun programming language.',\n", 155 | " 'Javascript is a fun programming language.',\n", 156 | " 'Scala is a fun programming language.']" 157 | ] 158 | }, 159 | "execution_count": 8, 160 | "metadata": {}, 161 | "output_type": "execute_result" 162 | } 163 | ], 164 | "source": [ 165 | "newlist" 166 | ] 167 | }, 168 | { 169 | "cell_type": "markdown", 170 | "metadata": {}, 171 | "source": [ 172 | "We can use if statements to look for special conditions." 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": 9, 178 | "metadata": { 179 | "collapsed": true 180 | }, 181 | "outputs": [], 182 | "source": [ 183 | "x = 10" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": 10, 189 | "metadata": { 190 | "collapsed": false 191 | }, 192 | "outputs": [ 193 | { 194 | "name": "stdout", 195 | "output_type": "stream", 196 | "text": [ 197 | "It looks like x is greater than 5\n" 198 | ] 199 | } 200 | ], 201 | "source": [ 202 | "if x > 5:\n", 203 | " print(\"It looks like x is greater than 5\")" 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": 12, 209 | "metadata": { 210 | "collapsed": false 211 | }, 212 | "outputs": [ 213 | { 214 | "name": "stdout", 215 | "output_type": "stream", 216 | "text": [ 217 | "Hello\n" 218 | ] 219 | } 220 | ], 221 | "source": [ 222 | "if x > 5 and x < 20:\n", 223 | " print(\"Hello\")\n" 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": 13, 229 | "metadata": { 230 | "collapsed": true 231 | }, 232 | "outputs": [], 233 | "source": [ 234 | "number_list = [1,2,3,4,5,6,7,8,9,10]" 235 | ] 236 | }, 237 | { 238 | "cell_type": "code", 239 | "execution_count": 14, 240 | "metadata": { 241 | "collapsed": false 242 | }, 243 | "outputs": [ 244 | { 245 | "name": "stdout", 246 | "output_type": "stream", 247 | "text": [ 248 | "1 is odd\n", 249 | "2 is even\n", 250 | "3 is odd\n", 251 | "4 is even\n", 252 | "5 is odd\n", 253 | "6 is even\n", 254 | "7 is the best number!\n", 255 | "8 is even\n", 256 | "9 is odd\n", 257 | "10 is even\n" 258 | ] 259 | } 260 | ], 261 | "source": [ 262 | "for number in number_list:\n", 263 | " if number%2 == 0:\n", 264 | " print(str(number) + \" is even\")\n", 265 | " elif number == 7:\n", 266 | " print(str(number) + \" is the best number!\")\n", 267 | " else:\n", 268 | " print(str(number) + \" is odd\")" 269 | ] 270 | } 271 | ], 272 | "metadata": { 273 | "kernelspec": { 274 | "display_name": "Python 3", 275 | "language": "python", 276 | "name": "python3" 277 | }, 278 | "language_info": { 279 | "codemirror_mode": { 280 | "name": "ipython", 281 | "version": 3 282 | }, 283 | "file_extension": ".py", 284 | "mimetype": "text/x-python", 285 | "name": "python", 286 | "nbconvert_exporter": "python", 287 | "pygments_lexer": "ipython3", 288 | "version": "3.6.0" 289 | } 290 | }, 291 | "nbformat": 4, 292 | "nbformat_minor": 2 293 | } 294 | -------------------------------------------------------------------------------- /1. Quick Python Refresher/4. Functions.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "Is some of our projects we will be using fuctions to repetedly call parts of our code.\n", 8 | "\n", 9 | "We can define a function by giving it a name, and telling the function any values we plan on passing along." 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 1, 15 | "metadata": { 16 | "collapsed": true 17 | }, 18 | "outputs": [], 19 | "source": [ 20 | "def my_function(something):\n", 21 | " return something" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 2, 27 | "metadata": { 28 | "collapsed": false 29 | }, 30 | "outputs": [ 31 | { 32 | "data": { 33 | "text/plain": [ 34 | "'hello'" 35 | ] 36 | }, 37 | "execution_count": 2, 38 | "metadata": {}, 39 | "output_type": "execute_result" 40 | } 41 | ], 42 | "source": [ 43 | "my_function(\"hello\")" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": 3, 49 | "metadata": { 50 | "collapsed": false 51 | }, 52 | "outputs": [ 53 | { 54 | "data": { 55 | "text/plain": [ 56 | "2" 57 | ] 58 | }, 59 | "execution_count": 3, 60 | "metadata": {}, 61 | "output_type": "execute_result" 62 | } 63 | ], 64 | "source": [ 65 | "my_function(2)" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 4, 71 | "metadata": { 72 | "collapsed": false 73 | }, 74 | "outputs": [], 75 | "source": [ 76 | "def information(word):\n", 77 | " return \"Word: \" + str(word) + \", Length: \" + str(len(word))" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": 5, 83 | "metadata": { 84 | "collapsed": false 85 | }, 86 | "outputs": [ 87 | { 88 | "data": { 89 | "text/plain": [ 90 | "'Word: hello, Length: 5'" 91 | ] 92 | }, 93 | "execution_count": 5, 94 | "metadata": {}, 95 | "output_type": "execute_result" 96 | } 97 | ], 98 | "source": [ 99 | "information(\"hello\")" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": 6, 105 | "metadata": { 106 | "collapsed": false 107 | }, 108 | "outputs": [ 109 | { 110 | "data": { 111 | "text/plain": [ 112 | "'Word: language, Length: 8'" 113 | ] 114 | }, 115 | "execution_count": 6, 116 | "metadata": {}, 117 | "output_type": "execute_result" 118 | } 119 | ], 120 | "source": [ 121 | "information(\"language\")" 122 | ] 123 | } 124 | ], 125 | "metadata": { 126 | "kernelspec": { 127 | "display_name": "Python 3", 128 | "language": "python", 129 | "name": "python3" 130 | }, 131 | "language_info": { 132 | "codemirror_mode": { 133 | "name": "ipython", 134 | "version": 3 135 | }, 136 | "file_extension": ".py", 137 | "mimetype": "text/x-python", 138 | "name": "python", 139 | "nbconvert_exporter": "python", 140 | "pygments_lexer": "ipython3", 141 | "version": "3.6.0" 142 | } 143 | }, 144 | "nbformat": 4, 145 | "nbformat_minor": 2 146 | } 147 | -------------------------------------------------------------------------------- /2. NLTK and the Basics/1. Counting Text.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# NLTK and the Basics - Counting Text" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import nltk" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "NLTK comes with pre-packed text data. \n", 26 | "\n", 27 | "Project Gutenberg is a group that digitizes books and literature that are mostly in the pubic domain. These works make great examples for practicing NLP. If you interested in Project Gutenberg, I recommend checking out their site. http://www.gutenberg.org/wiki/Main_Page" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": 2, 33 | "metadata": { 34 | "collapsed": false 35 | }, 36 | "outputs": [ 37 | { 38 | "data": { 39 | "text/plain": [ 40 | "['austen-emma.txt',\n", 41 | " 'austen-persuasion.txt',\n", 42 | " 'austen-sense.txt',\n", 43 | " 'bible-kjv.txt',\n", 44 | " 'blake-poems.txt',\n", 45 | " 'bryant-stories.txt',\n", 46 | " 'burgess-busterbrown.txt',\n", 47 | " 'carroll-alice.txt',\n", 48 | " 'chesterton-ball.txt',\n", 49 | " 'chesterton-brown.txt',\n", 50 | " 'chesterton-thursday.txt',\n", 51 | " 'edgeworth-parents.txt',\n", 52 | " 'melville-moby_dick.txt',\n", 53 | " 'milton-paradise.txt',\n", 54 | " 'shakespeare-caesar.txt',\n", 55 | " 'shakespeare-hamlet.txt',\n", 56 | " 'shakespeare-macbeth.txt',\n", 57 | " 'whitman-leaves.txt']" 58 | ] 59 | }, 60 | "execution_count": 2, 61 | "metadata": {}, 62 | "output_type": "execute_result" 63 | } 64 | ], 65 | "source": [ 66 | "nltk.corpus.gutenberg.fileids()" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "You will notice that every file name has the letter \"u\" before it. The 'u' is part of the external representation of the file name, meaning it's a Unicode string as opposed to a byte string. It is not part of the string." 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 3, 79 | "metadata": { 80 | "collapsed": false 81 | }, 82 | "outputs": [], 83 | "source": [ 84 | "md = nltk.corpus.gutenberg.words(\"melville-moby_dick.txt\")" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": 4, 90 | "metadata": { 91 | "collapsed": false 92 | }, 93 | "outputs": [ 94 | { 95 | "data": { 96 | "text/plain": [ 97 | "['[', 'Moby', 'Dick', 'by', 'Herman', 'Melville', '1851', ']']" 98 | ] 99 | }, 100 | "execution_count": 4, 101 | "metadata": {}, 102 | "output_type": "execute_result" 103 | } 104 | ], 105 | "source": [ 106 | "md[:8]" 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "metadata": {}, 112 | "source": [ 113 | "We can count how many times a word appears in the book." 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": 5, 119 | "metadata": { 120 | "collapsed": false 121 | }, 122 | "outputs": [ 123 | { 124 | "data": { 125 | "text/plain": [ 126 | "906" 127 | ] 128 | }, 129 | "execution_count": 5, 130 | "metadata": {}, 131 | "output_type": "execute_result" 132 | } 133 | ], 134 | "source": [ 135 | "md.count(\"whale\")" 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": 6, 141 | "metadata": { 142 | "collapsed": false 143 | }, 144 | "outputs": [ 145 | { 146 | "data": { 147 | "text/plain": [ 148 | "330" 149 | ] 150 | }, 151 | "execution_count": 6, 152 | "metadata": {}, 153 | "output_type": "execute_result" 154 | } 155 | ], 156 | "source": [ 157 | "md.count(\"boat\")" 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": 7, 163 | "metadata": { 164 | "collapsed": false 165 | }, 166 | "outputs": [ 167 | { 168 | "data": { 169 | "text/plain": [ 170 | "501" 171 | ] 172 | }, 173 | "execution_count": 7, 174 | "metadata": {}, 175 | "output_type": "execute_result" 176 | } 177 | ], 178 | "source": [ 179 | "md.count(\"Ahab\")" 180 | ] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "execution_count": 8, 185 | "metadata": { 186 | "collapsed": false 187 | }, 188 | "outputs": [ 189 | { 190 | "data": { 191 | "text/plain": [ 192 | "0" 193 | ] 194 | }, 195 | "execution_count": 8, 196 | "metadata": {}, 197 | "output_type": "execute_result" 198 | } 199 | ], 200 | "source": [ 201 | "md.count(\"laptop\")" 202 | ] 203 | }, 204 | { 205 | "cell_type": "markdown", 206 | "metadata": {}, 207 | "source": [ 208 | "We can get an idea of how long the book is by seeing how many items are in our list." 209 | ] 210 | }, 211 | { 212 | "cell_type": "code", 213 | "execution_count": 9, 214 | "metadata": { 215 | "collapsed": false 216 | }, 217 | "outputs": [ 218 | { 219 | "data": { 220 | "text/plain": [ 221 | "260819" 222 | ] 223 | }, 224 | "execution_count": 9, 225 | "metadata": {}, 226 | "output_type": "execute_result" 227 | } 228 | ], 229 | "source": [ 230 | "len(md)" 231 | ] 232 | }, 233 | { 234 | "cell_type": "markdown", 235 | "metadata": {}, 236 | "source": [ 237 | "We can see how many unique words are used in the book." 238 | ] 239 | }, 240 | { 241 | "cell_type": "code", 242 | "execution_count": 10, 243 | "metadata": { 244 | "collapsed": true 245 | }, 246 | "outputs": [], 247 | "source": [ 248 | "md_set = set(md)" 249 | ] 250 | }, 251 | { 252 | "cell_type": "code", 253 | "execution_count": 11, 254 | "metadata": { 255 | "collapsed": false 256 | }, 257 | "outputs": [ 258 | { 259 | "data": { 260 | "text/plain": [ 261 | "19317" 262 | ] 263 | }, 264 | "execution_count": 11, 265 | "metadata": {}, 266 | "output_type": "execute_result" 267 | } 268 | ], 269 | "source": [ 270 | "len(md_set)" 271 | ] 272 | }, 273 | { 274 | "cell_type": "markdown", 275 | "metadata": {}, 276 | "source": [ 277 | "We can calculate the average number of times any given word is used in the book." 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": 12, 283 | "metadata": { 284 | "collapsed": true 285 | }, 286 | "outputs": [], 287 | "source": [ 288 | "from __future__ import division #we import this since we are using Python 2.7" 289 | ] 290 | }, 291 | { 292 | "cell_type": "code", 293 | "execution_count": 13, 294 | "metadata": { 295 | "collapsed": false 296 | }, 297 | "outputs": [ 298 | { 299 | "data": { 300 | "text/plain": [ 301 | "13.502044830977896" 302 | ] 303 | }, 304 | "execution_count": 13, 305 | "metadata": {}, 306 | "output_type": "execute_result" 307 | } 308 | ], 309 | "source": [ 310 | "len(md)/len(md_set)" 311 | ] 312 | }, 313 | { 314 | "cell_type": "markdown", 315 | "metadata": {}, 316 | "source": [ 317 | "We can look at the book as a lists of sentences." 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": 14, 323 | "metadata": { 324 | "collapsed": false 325 | }, 326 | "outputs": [], 327 | "source": [ 328 | "md_sents = nltk.corpus.gutenberg.sents(\"melville-moby_dick.txt\")" 329 | ] 330 | }, 331 | { 332 | "cell_type": "markdown", 333 | "metadata": {}, 334 | "source": [ 335 | "We can calculate the average number of words per sentence in the book." 336 | ] 337 | }, 338 | { 339 | "cell_type": "code", 340 | "execution_count": 15, 341 | "metadata": { 342 | "collapsed": false 343 | }, 344 | "outputs": [ 345 | { 346 | "data": { 347 | "text/plain": [ 348 | "25.928919375683467" 349 | ] 350 | }, 351 | "execution_count": 15, 352 | "metadata": {}, 353 | "output_type": "execute_result" 354 | } 355 | ], 356 | "source": [ 357 | "len(md)/len(md_sents)" 358 | ] 359 | } 360 | ], 361 | "metadata": { 362 | "kernelspec": { 363 | "display_name": "Python 3", 364 | "language": "python", 365 | "name": "python3" 366 | }, 367 | "language_info": { 368 | "codemirror_mode": { 369 | "name": "ipython", 370 | "version": 3 371 | }, 372 | "file_extension": ".py", 373 | "mimetype": "text/x-python", 374 | "name": "python", 375 | "nbconvert_exporter": "python", 376 | "pygments_lexer": "ipython3", 377 | "version": "3.6.0" 378 | } 379 | }, 380 | "nbformat": 4, 381 | "nbformat_minor": 2 382 | } 383 | -------------------------------------------------------------------------------- /2. NLTK and the Basics/4. Conditional Frequency Distribution.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# NLTK and the Basics - Conditional Frequency Distribution" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import nltk" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "A conditional frequency distribution counts multiple cases or conditions. " 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 2, 31 | "metadata": { 32 | "collapsed": false 33 | }, 34 | "outputs": [], 35 | "source": [ 36 | "names = [(\"Group A\", \"Paul\"),(\"Group A\", \"Mike\"),(\"Group A\", \"Katy\"),(\"Group B\", \"Amy\"),(\"Group B\", \"Joe\"),(\"Group B\", \"Amy\")]" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 3, 42 | "metadata": { 43 | "collapsed": false 44 | }, 45 | "outputs": [ 46 | { 47 | "data": { 48 | "text/plain": [ 49 | "[('Group A', 'Paul'),\n", 50 | " ('Group A', 'Mike'),\n", 51 | " ('Group A', 'Katy'),\n", 52 | " ('Group B', 'Amy'),\n", 53 | " ('Group B', 'Joe'),\n", 54 | " ('Group B', 'Amy')]" 55 | ] 56 | }, 57 | "execution_count": 3, 58 | "metadata": {}, 59 | "output_type": "execute_result" 60 | } 61 | ], 62 | "source": [ 63 | "names" 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "metadata": {}, 69 | "source": [ 70 | "Running a regular frequency distribution on the list..." 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": 4, 76 | "metadata": {}, 77 | "outputs": [ 78 | { 79 | "data": { 80 | "text/plain": [ 81 | "FreqDist({('Group A', 'Katy'): 1,\n", 82 | " ('Group A', 'Mike'): 1,\n", 83 | " ('Group A', 'Paul'): 1,\n", 84 | " ('Group B', 'Amy'): 2,\n", 85 | " ('Group B', 'Joe'): 1})" 86 | ] 87 | }, 88 | "execution_count": 4, 89 | "metadata": {}, 90 | "output_type": "execute_result" 91 | } 92 | ], 93 | "source": [ 94 | "nltk.FreqDist(names)" 95 | ] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "metadata": {}, 100 | "source": [ 101 | "Running a conditional frequency distribution on the list..." 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": 5, 107 | "metadata": { 108 | "collapsed": false 109 | }, 110 | "outputs": [ 111 | { 112 | "data": { 113 | "text/plain": [ 114 | "ConditionalFreqDist(nltk.probability.FreqDist,\n", 115 | " {'Group A': FreqDist({'Katy': 1, 'Mike': 1, 'Paul': 1}),\n", 116 | " 'Group B': FreqDist({'Amy': 2, 'Joe': 1})})" 117 | ] 118 | }, 119 | "execution_count": 5, 120 | "metadata": {}, 121 | "output_type": "execute_result" 122 | } 123 | ], 124 | "source": [ 125 | "nltk.ConditionalFreqDist(names)" 126 | ] 127 | } 128 | ], 129 | "metadata": { 130 | "kernelspec": { 131 | "display_name": "Python 3", 132 | "language": "python", 133 | "name": "python3" 134 | }, 135 | "language_info": { 136 | "codemirror_mode": { 137 | "name": "ipython", 138 | "version": 3 139 | }, 140 | "file_extension": ".py", 141 | "mimetype": "text/x-python", 142 | "name": "python", 143 | "nbconvert_exporter": "python", 144 | "pygments_lexer": "ipython3", 145 | "version": "3.6.0" 146 | } 147 | }, 148 | "nbformat": 4, 149 | "nbformat_minor": 2 150 | } 151 | -------------------------------------------------------------------------------- /2. NLTK and the Basics/5. Example - Informative Words.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# NLTK and the Basics - Example: Informative Words" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import nltk" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 2, 24 | "metadata": { 25 | "collapsed": true 26 | }, 27 | "outputs": [], 28 | "source": [ 29 | "alice = nltk.corpus.gutenberg.words(\"carroll-alice.txt\")" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": 3, 35 | "metadata": { 36 | "collapsed": true 37 | }, 38 | "outputs": [], 39 | "source": [ 40 | "alice_fd = nltk.FreqDist(alice)" 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "Find the 100 most common words." 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 4, 53 | "metadata": { 54 | "collapsed": true 55 | }, 56 | "outputs": [], 57 | "source": [ 58 | "alice_fd_100 = alice_fd.most_common(100)" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": 5, 64 | "metadata": { 65 | "collapsed": false 66 | }, 67 | "outputs": [], 68 | "source": [ 69 | "moby = nltk.corpus.gutenberg.words(\"melville-moby_dick.txt\")\n", 70 | "moby_fd = nltk.FreqDist(moby)\n", 71 | "moby_fd_100 = moby_fd.most_common(100)" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": {}, 77 | "source": [ 78 | "We no longer care exactly how many times each word was seen and can drop the count." 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": 6, 84 | "metadata": { 85 | "collapsed": true 86 | }, 87 | "outputs": [], 88 | "source": [ 89 | "alice_100 = [word[0] for word in alice_fd_100]\n", 90 | "moby_100 = [word[0] for word in moby_fd_100]" 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "metadata": {}, 96 | "source": [ 97 | "Two sets can be subtracted from one another leaving us with the difference." 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": 7, 103 | "metadata": { 104 | "collapsed": false 105 | }, 106 | "outputs": [ 107 | { 108 | "data": { 109 | "text/plain": [ 110 | "['ll',\n", 111 | " 'much',\n", 112 | " 'can',\n", 113 | " 'Alice',\n", 114 | " 'herself',\n", 115 | " 'thought',\n", 116 | " ':',\n", 117 | " 'Mock',\n", 118 | " 'again',\n", 119 | " 'she',\n", 120 | " 'Queen',\n", 121 | " 'King',\n", 122 | " '*',\n", 123 | " 'Turtle',\n", 124 | " 'way',\n", 125 | " 'could',\n", 126 | " 'did',\n", 127 | " 't',\n", 128 | " 'm',\n", 129 | " \",'\",\n", 130 | " 'see',\n", 131 | " \".'\",\n", 132 | " 'know',\n", 133 | " 'little',\n", 134 | " \"!'\",\n", 135 | " 'off',\n", 136 | " 'began',\n", 137 | " 'went',\n", 138 | " 'say',\n", 139 | " 'Hatter',\n", 140 | " \"?'\",\n", 141 | " 'quite',\n", 142 | " 'your',\n", 143 | " 'said',\n", 144 | " 'Gryphon',\n", 145 | " 'do']" 146 | ] 147 | }, 148 | "execution_count": 7, 149 | "metadata": {}, 150 | "output_type": "execute_result" 151 | } 152 | ], 153 | "source": [ 154 | "list(set(alice_100) - set(moby_100))" 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": 8, 160 | "metadata": { 161 | "collapsed": false 162 | }, 163 | "outputs": [ 164 | { 165 | "data": { 166 | "text/plain": [ 167 | "['will',\n", 168 | " 'from',\n", 169 | " 'man',\n", 170 | " 'over',\n", 171 | " 'some',\n", 172 | " 'been',\n", 173 | " 'head',\n", 174 | " 'other',\n", 175 | " 'only',\n", 176 | " 'more',\n", 177 | " '!\"',\n", 178 | " 'who',\n", 179 | " 'are',\n", 180 | " 'him',\n", 181 | " '\"',\n", 182 | " 'we',\n", 183 | " 'such',\n", 184 | " '?',\n", 185 | " 'these',\n", 186 | " 'long',\n", 187 | " 'ye',\n", 188 | " 'ship',\n", 189 | " 'boat',\n", 190 | " 'sea',\n", 191 | " 'though',\n", 192 | " 'Ahab',\n", 193 | " 'which',\n", 194 | " 'their',\n", 195 | " 'But',\n", 196 | " '.\"',\n", 197 | " 'now',\n", 198 | " 'any',\n", 199 | " 'old',\n", 200 | " 'than',\n", 201 | " 'whale',\n", 202 | " 'upon']" 203 | ] 204 | }, 205 | "execution_count": 8, 206 | "metadata": {}, 207 | "output_type": "execute_result" 208 | } 209 | ], 210 | "source": [ 211 | "list(set(moby_100) - set(alice_100))" 212 | ] 213 | } 214 | ], 215 | "metadata": { 216 | "kernelspec": { 217 | "display_name": "Python 3", 218 | "language": "python", 219 | "name": "python3" 220 | }, 221 | "language_info": { 222 | "codemirror_mode": { 223 | "name": "ipython", 224 | "version": 3 225 | }, 226 | "file_extension": ".py", 227 | "mimetype": "text/x-python", 228 | "name": "python", 229 | "nbconvert_exporter": "python", 230 | "pygments_lexer": "ipython3", 231 | "version": "3.6.0" 232 | } 233 | }, 234 | "nbformat": 4, 235 | "nbformat_minor": 2 236 | } 237 | -------------------------------------------------------------------------------- /2. NLTK and the Basics/6. Bigrams.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# NLTK and the Basics - Bigrams" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import nltk" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "Bigrams, sometimes called 2grams, or ngrams (when dealing with a different number), is a way of looking at word sequences." 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 2, 31 | "metadata": { 32 | "collapsed": true 33 | }, 34 | "outputs": [], 35 | "source": [ 36 | "text = \"I think it might rain today.\"" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 3, 42 | "metadata": { 43 | "collapsed": false 44 | }, 45 | "outputs": [], 46 | "source": [ 47 | "tokens = nltk.word_tokenize(text)" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 4, 53 | "metadata": { 54 | "collapsed": false 55 | }, 56 | "outputs": [ 57 | { 58 | "data": { 59 | "text/plain": [ 60 | "['I', 'think', 'it', 'might', 'rain', 'today', '.']" 61 | ] 62 | }, 63 | "execution_count": 4, 64 | "metadata": {}, 65 | "output_type": "execute_result" 66 | } 67 | ], 68 | "source": [ 69 | "tokens" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": 5, 75 | "metadata": { 76 | "collapsed": false 77 | }, 78 | "outputs": [], 79 | "source": [ 80 | "bigrams = nltk.bigrams(tokens)" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": 7, 86 | "metadata": { 87 | "collapsed": false 88 | }, 89 | "outputs": [ 90 | { 91 | "name": "stdout", 92 | "output_type": "stream", 93 | "text": [ 94 | "('I', 'think')\n", 95 | "('think', 'it')\n", 96 | "('it', 'might')\n", 97 | "('might', 'rain')\n", 98 | "('rain', 'today')\n", 99 | "('today', '.')\n" 100 | ] 101 | } 102 | ], 103 | "source": [ 104 | "for item in bigrams:\n", 105 | " print(item)" 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": 8, 111 | "metadata": { 112 | "collapsed": true 113 | }, 114 | "outputs": [], 115 | "source": [ 116 | "trigrams = nltk.trigrams(tokens)" 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": 9, 122 | "metadata": { 123 | "collapsed": false 124 | }, 125 | "outputs": [ 126 | { 127 | "name": "stdout", 128 | "output_type": "stream", 129 | "text": [ 130 | "('I', 'think', 'it')\n", 131 | "('think', 'it', 'might')\n", 132 | "('it', 'might', 'rain')\n", 133 | "('might', 'rain', 'today')\n", 134 | "('rain', 'today', '.')\n" 135 | ] 136 | } 137 | ], 138 | "source": [ 139 | "for item in trigrams:\n", 140 | " print(item)" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": 10, 146 | "metadata": { 147 | "collapsed": true 148 | }, 149 | "outputs": [], 150 | "source": [ 151 | "from nltk.util import ngrams" 152 | ] 153 | }, 154 | { 155 | "cell_type": "code", 156 | "execution_count": 11, 157 | "metadata": { 158 | "collapsed": false 159 | }, 160 | "outputs": [], 161 | "source": [ 162 | "text = \"If it is nice out, I will go to the beach.\"" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": 12, 168 | "metadata": { 169 | "collapsed": true 170 | }, 171 | "outputs": [], 172 | "source": [ 173 | "tokens = nltk.word_tokenize(text)" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": 13, 179 | "metadata": { 180 | "collapsed": false 181 | }, 182 | "outputs": [], 183 | "source": [ 184 | "bigrams = ngrams(tokens,2)" 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": 14, 190 | "metadata": { 191 | "collapsed": false 192 | }, 193 | "outputs": [ 194 | { 195 | "name": "stdout", 196 | "output_type": "stream", 197 | "text": [ 198 | "('If', 'it')\n", 199 | "('it', 'is')\n", 200 | "('is', 'nice')\n", 201 | "('nice', 'out')\n", 202 | "('out', ',')\n", 203 | "(',', 'I')\n", 204 | "('I', 'will')\n", 205 | "('will', 'go')\n", 206 | "('go', 'to')\n", 207 | "('to', 'the')\n", 208 | "('the', 'beach')\n", 209 | "('beach', '.')\n" 210 | ] 211 | } 212 | ], 213 | "source": [ 214 | "for item in bigrams:\n", 215 | " print(item)" 216 | ] 217 | }, 218 | { 219 | "cell_type": "code", 220 | "execution_count": 15, 221 | "metadata": { 222 | "collapsed": false 223 | }, 224 | "outputs": [], 225 | "source": [ 226 | "fourgrams = ngrams(tokens,4)" 227 | ] 228 | }, 229 | { 230 | "cell_type": "code", 231 | "execution_count": 17, 232 | "metadata": { 233 | "collapsed": false 234 | }, 235 | "outputs": [ 236 | { 237 | "name": "stdout", 238 | "output_type": "stream", 239 | "text": [ 240 | "('If', 'it', 'is', 'nice')\n", 241 | "('it', 'is', 'nice', 'out')\n", 242 | "('is', 'nice', 'out', ',')\n", 243 | "('nice', 'out', ',', 'I')\n", 244 | "('out', ',', 'I', 'will')\n", 245 | "(',', 'I', 'will', 'go')\n", 246 | "('I', 'will', 'go', 'to')\n", 247 | "('will', 'go', 'to', 'the')\n", 248 | "('go', 'to', 'the', 'beach')\n", 249 | "('to', 'the', 'beach', '.')\n" 250 | ] 251 | } 252 | ], 253 | "source": [ 254 | "for item in fourgrams:\n", 255 | " print(item)" 256 | ] 257 | }, 258 | { 259 | "cell_type": "markdown", 260 | "metadata": {}, 261 | "source": [ 262 | "We can build a function to find any ngram." 263 | ] 264 | }, 265 | { 266 | "cell_type": "code", 267 | "execution_count": 18, 268 | "metadata": { 269 | "collapsed": false 270 | }, 271 | "outputs": [], 272 | "source": [ 273 | "def n_grams(text,n):\n", 274 | " tokens = nltk.word_tokenize(text)\n", 275 | " grams = ngrams(tokens,n)\n", 276 | " return grams" 277 | ] 278 | }, 279 | { 280 | "cell_type": "code", 281 | "execution_count": 19, 282 | "metadata": { 283 | "collapsed": true 284 | }, 285 | "outputs": [], 286 | "source": [ 287 | "text = \"I think it might rain today, but if it is nice out, I will go to the beach.\"" 288 | ] 289 | }, 290 | { 291 | "cell_type": "code", 292 | "execution_count": 20, 293 | "metadata": { 294 | "collapsed": false 295 | }, 296 | "outputs": [], 297 | "source": [ 298 | "grams = n_grams(text, 5)" 299 | ] 300 | }, 301 | { 302 | "cell_type": "code", 303 | "execution_count": 21, 304 | "metadata": { 305 | "collapsed": false 306 | }, 307 | "outputs": [ 308 | { 309 | "name": "stdout", 310 | "output_type": "stream", 311 | "text": [ 312 | "('I', 'think', 'it', 'might', 'rain')\n", 313 | "('think', 'it', 'might', 'rain', 'today')\n", 314 | "('it', 'might', 'rain', 'today', ',')\n", 315 | "('might', 'rain', 'today', ',', 'but')\n", 316 | "('rain', 'today', ',', 'but', 'if')\n", 317 | "('today', ',', 'but', 'if', 'it')\n", 318 | "(',', 'but', 'if', 'it', 'is')\n", 319 | "('but', 'if', 'it', 'is', 'nice')\n", 320 | "('if', 'it', 'is', 'nice', 'out')\n", 321 | "('it', 'is', 'nice', 'out', ',')\n", 322 | "('is', 'nice', 'out', ',', 'I')\n", 323 | "('nice', 'out', ',', 'I', 'will')\n", 324 | "('out', ',', 'I', 'will', 'go')\n", 325 | "(',', 'I', 'will', 'go', 'to')\n", 326 | "('I', 'will', 'go', 'to', 'the')\n", 327 | "('will', 'go', 'to', 'the', 'beach')\n", 328 | "('go', 'to', 'the', 'beach', '.')\n" 329 | ] 330 | } 331 | ], 332 | "source": [ 333 | "for item in grams:\n", 334 | " print(item)" 335 | ] 336 | } 337 | ], 338 | "metadata": { 339 | "kernelspec": { 340 | "display_name": "Python 3", 341 | "language": "python", 342 | "name": "python3" 343 | }, 344 | "language_info": { 345 | "codemirror_mode": { 346 | "name": "ipython", 347 | "version": 3 348 | }, 349 | "file_extension": ".py", 350 | "mimetype": "text/x-python", 351 | "name": "python", 352 | "nbconvert_exporter": "python", 353 | "pygments_lexer": "ipython3", 354 | "version": "3.6.0" 355 | } 356 | }, 357 | "nbformat": 4, 358 | "nbformat_minor": 2 359 | } 360 | -------------------------------------------------------------------------------- /2. NLTK and the Basics/7. Regular Expressions.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# NLTK and the Basics - Regular Expressions" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 8, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import nltk\n", 19 | "import re" 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": 9, 25 | "metadata": { 26 | "collapsed": true 27 | }, 28 | "outputs": [], 29 | "source": [ 30 | "alice = nltk.corpus.gutenberg.words(\"carroll-alice.txt\")" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "Finding every word that start with \"new\"." 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 10, 43 | "metadata": { 44 | "collapsed": false 45 | }, 46 | "outputs": [ 47 | { 48 | "data": { 49 | "text/plain": [ 50 | "{'new', 'newspapers'}" 51 | ] 52 | }, 53 | "execution_count": 10, 54 | "metadata": {}, 55 | "output_type": "execute_result" 56 | } 57 | ], 58 | "source": [ 59 | "set([word for word in alice if re.search(\"^new\",word)])" 60 | ] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "metadata": {}, 65 | "source": [ 66 | "Finding every word that ends with \"ful\"." 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": 11, 72 | "metadata": { 73 | "collapsed": false 74 | }, 75 | "outputs": [ 76 | { 77 | "data": { 78 | "text/plain": [ 79 | "{'Beautiful',\n", 80 | " 'barrowful',\n", 81 | " 'beautiful',\n", 82 | " 'delightful',\n", 83 | " 'doubtful',\n", 84 | " 'dreadful',\n", 85 | " 'graceful',\n", 86 | " 'hopeful',\n", 87 | " 'mournful',\n", 88 | " 'ootiful',\n", 89 | " 'respectful',\n", 90 | " 'sorrowful',\n", 91 | " 'truthful',\n", 92 | " 'useful',\n", 93 | " 'wonderful'}" 94 | ] 95 | }, 96 | "execution_count": 11, 97 | "metadata": {}, 98 | "output_type": "execute_result" 99 | } 100 | ], 101 | "source": [ 102 | "set([word for word in alice if re.search(\"ful$\",word)])" 103 | ] 104 | }, 105 | { 106 | "cell_type": "markdown", 107 | "metadata": {}, 108 | "source": [ 109 | "Finding words that are six characters long and have two n's in the middle." 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": 12, 115 | "metadata": { 116 | "collapsed": false 117 | }, 118 | "outputs": [ 119 | { 120 | "data": { 121 | "text/plain": [ 122 | "{'cannot', 'dinner', 'fanned', 'manner', 'tunnel'}" 123 | ] 124 | }, 125 | "execution_count": 12, 126 | "metadata": {}, 127 | "output_type": "execute_result" 128 | } 129 | ], 130 | "source": [ 131 | "set([word for word in alice if re.search(\"^..nn..$\",word)])" 132 | ] 133 | }, 134 | { 135 | "cell_type": "markdown", 136 | "metadata": {}, 137 | "source": [ 138 | "Finding words that start with \"c\", \"h\", and \"t\", and end in \"at\"." 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": 13, 144 | "metadata": { 145 | "collapsed": false 146 | }, 147 | "outputs": [ 148 | { 149 | "data": { 150 | "text/plain": [ 151 | "{'cat', 'hat', 'rat'}" 152 | ] 153 | }, 154 | "execution_count": 13, 155 | "metadata": {}, 156 | "output_type": "execute_result" 157 | } 158 | ], 159 | "source": [ 160 | "set([word for word in alice if re.search(\"^[chr]at$\",word)])" 161 | ] 162 | }, 163 | { 164 | "cell_type": "markdown", 165 | "metadata": {}, 166 | "source": [ 167 | "Finding words of any length that have two n's." 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": 14, 173 | "metadata": { 174 | "collapsed": false 175 | }, 176 | "outputs": [ 177 | { 178 | "data": { 179 | "text/plain": [ 180 | "{'Ann',\n", 181 | " 'Dinn',\n", 182 | " 'Pennyworth',\n", 183 | " 'annoy',\n", 184 | " 'annoyed',\n", 185 | " 'beginning',\n", 186 | " 'cannot',\n", 187 | " 'cunning',\n", 188 | " 'dinn',\n", 189 | " 'dinner',\n", 190 | " 'fanned',\n", 191 | " 'fanning',\n", 192 | " 'funny',\n", 193 | " 'grinned',\n", 194 | " 'grinning',\n", 195 | " 'manner',\n", 196 | " 'manners',\n", 197 | " 'planning',\n", 198 | " 'running',\n", 199 | " 'tunnel'}" 200 | ] 201 | }, 202 | "execution_count": 14, 203 | "metadata": {}, 204 | "output_type": "execute_result" 205 | } 206 | ], 207 | "source": [ 208 | "set([word for word in alice if re.search(\"^.*nn.*$\",word)])" 209 | ] 210 | } 211 | ], 212 | "metadata": { 213 | "kernelspec": { 214 | "display_name": "Python 3", 215 | "language": "python", 216 | "name": "python3" 217 | }, 218 | "language_info": { 219 | "codemirror_mode": { 220 | "name": "ipython", 221 | "version": 3 222 | }, 223 | "file_extension": ".py", 224 | "mimetype": "text/x-python", 225 | "name": "python", 226 | "nbconvert_exporter": "python", 227 | "pygments_lexer": "ipython3", 228 | "version": "3.6.0" 229 | } 230 | }, 231 | "nbformat": 4, 232 | "nbformat_minor": 2 233 | } 234 | -------------------------------------------------------------------------------- /3. Tokenization, Tagging, Chunking/1. Tokenization.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Tokenization, Tagging, Chunking - Tokenization" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import nltk" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "Tokenization is the process of breaking up a string into a list of words and punctuation." 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 2, 31 | "metadata": { 32 | "collapsed": true 33 | }, 34 | "outputs": [], 35 | "source": [ 36 | "my_string = \"I am learning Natural Language Processing.\"" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 3, 42 | "metadata": { 43 | "collapsed": false 44 | }, 45 | "outputs": [], 46 | "source": [ 47 | "tokens = nltk.word_tokenize(my_string)" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 4, 53 | "metadata": { 54 | "collapsed": false 55 | }, 56 | "outputs": [ 57 | { 58 | "data": { 59 | "text/plain": [ 60 | "['I', 'am', 'learning', 'Natural', 'Language', 'Processing', '.']" 61 | ] 62 | }, 63 | "execution_count": 4, 64 | "metadata": {}, 65 | "output_type": "execute_result" 66 | } 67 | ], 68 | "source": [ 69 | "tokens" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": 5, 75 | "metadata": { 76 | "collapsed": false 77 | }, 78 | "outputs": [ 79 | { 80 | "data": { 81 | "text/plain": [ 82 | "7" 83 | ] 84 | }, 85 | "execution_count": 5, 86 | "metadata": {}, 87 | "output_type": "execute_result" 88 | } 89 | ], 90 | "source": [ 91 | "len(tokens)" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": 6, 97 | "metadata": { 98 | "collapsed": true 99 | }, 100 | "outputs": [], 101 | "source": [ 102 | "phrase = \"I am learning Natural Language Processing. I am learning how to tokenize!\"" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": 7, 108 | "metadata": { 109 | "collapsed": false 110 | }, 111 | "outputs": [], 112 | "source": [ 113 | "tokens_sent = nltk.sent_tokenize(phrase)" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": 8, 119 | "metadata": { 120 | "collapsed": false 121 | }, 122 | "outputs": [ 123 | { 124 | "data": { 125 | "text/plain": [ 126 | "['I am learning Natural Language Processing.',\n", 127 | " 'I am learning how to tokenize!']" 128 | ] 129 | }, 130 | "execution_count": 8, 131 | "metadata": {}, 132 | "output_type": "execute_result" 133 | } 134 | ], 135 | "source": [ 136 | "tokens_sent" 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": 9, 142 | "metadata": { 143 | "collapsed": false 144 | }, 145 | "outputs": [ 146 | { 147 | "data": { 148 | "text/plain": [ 149 | "2" 150 | ] 151 | }, 152 | "execution_count": 9, 153 | "metadata": {}, 154 | "output_type": "execute_result" 155 | } 156 | ], 157 | "source": [ 158 | "len(tokens_sent)" 159 | ] 160 | }, 161 | { 162 | "cell_type": "markdown", 163 | "metadata": {}, 164 | "source": [ 165 | "We can tokenize our sentence tokens." 166 | ] 167 | }, 168 | { 169 | "cell_type": "code", 170 | "execution_count": 10, 171 | "metadata": { 172 | "collapsed": false 173 | }, 174 | "outputs": [ 175 | { 176 | "name": "stdout", 177 | "output_type": "stream", 178 | "text": [ 179 | "['I', 'am', 'learning', 'Natural', 'Language', 'Processing', '.']\n", 180 | "['I', 'am', 'learning', 'how', 'to', 'tokenize', '!']\n" 181 | ] 182 | } 183 | ], 184 | "source": [ 185 | "for item in tokens_sent:\n", 186 | " print(nltk.word_tokenize(item))" 187 | ] 188 | } 189 | ], 190 | "metadata": { 191 | "kernelspec": { 192 | "display_name": "Python 3", 193 | "language": "python", 194 | "name": "python3" 195 | }, 196 | "language_info": { 197 | "codemirror_mode": { 198 | "name": "ipython", 199 | "version": 3 200 | }, 201 | "file_extension": ".py", 202 | "mimetype": "text/x-python", 203 | "name": "python", 204 | "nbconvert_exporter": "python", 205 | "pygments_lexer": "ipython3", 206 | "version": "3.6.0" 207 | } 208 | }, 209 | "nbformat": 4, 210 | "nbformat_minor": 2 211 | } 212 | -------------------------------------------------------------------------------- /3. Tokenization, Tagging, Chunking/2. Normalization.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Tokenization, Tagging, Chunking - Normalization" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import nltk" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "Removing punctuation." 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 2, 31 | "metadata": { 32 | "collapsed": false 33 | }, 34 | "outputs": [], 35 | "source": [ 36 | "md = nltk.corpus.gutenberg.words(\"melville-moby_dick.txt\")" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 3, 42 | "metadata": { 43 | "collapsed": false 44 | }, 45 | "outputs": [ 46 | { 47 | "data": { 48 | "text/plain": [ 49 | "['[', 'Moby', 'Dick', 'by', 'Herman', 'Melville', '1851', ']']" 50 | ] 51 | }, 52 | "execution_count": 3, 53 | "metadata": {}, 54 | "output_type": "execute_result" 55 | } 56 | ], 57 | "source": [ 58 | "md[:8]" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": 4, 64 | "metadata": { 65 | "collapsed": true 66 | }, 67 | "outputs": [], 68 | "source": [ 69 | "md_8 = md[:8]" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": 5, 75 | "metadata": { 76 | "collapsed": false 77 | }, 78 | "outputs": [ 79 | { 80 | "data": { 81 | "text/plain": [ 82 | "['[', 'Moby', 'Dick', 'by', 'Herman', 'Melville', '1851', ']']" 83 | ] 84 | }, 85 | "execution_count": 5, 86 | "metadata": {}, 87 | "output_type": "execute_result" 88 | } 89 | ], 90 | "source": [ 91 | "md_8" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": 7, 97 | "metadata": { 98 | "collapsed": false 99 | }, 100 | "outputs": [ 101 | { 102 | "name": "stdout", 103 | "output_type": "stream", 104 | "text": [ 105 | "Moby\n", 106 | "Dick\n", 107 | "by\n", 108 | "Herman\n", 109 | "Melville\n" 110 | ] 111 | } 112 | ], 113 | "source": [ 114 | "for word in md_8:\n", 115 | " if word.isalpha():\n", 116 | " print(word)" 117 | ] 118 | }, 119 | { 120 | "cell_type": "markdown", 121 | "metadata": {}, 122 | "source": [ 123 | "Making everything lower case." 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": 8, 129 | "metadata": { 130 | "collapsed": false 131 | }, 132 | "outputs": [ 133 | { 134 | "name": "stdout", 135 | "output_type": "stream", 136 | "text": [ 137 | "[\n", 138 | "moby\n", 139 | "dick\n", 140 | "by\n", 141 | "herman\n", 142 | "melville\n", 143 | "1851\n", 144 | "]\n" 145 | ] 146 | } 147 | ], 148 | "source": [ 149 | "for word in md_8:\n", 150 | " print(word.lower())" 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": 9, 156 | "metadata": { 157 | "collapsed": false 158 | }, 159 | "outputs": [], 160 | "source": [ 161 | "norm = [word.lower() for word in md_8 if word.isalpha()]" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": 10, 167 | "metadata": { 168 | "collapsed": false 169 | }, 170 | "outputs": [ 171 | { 172 | "data": { 173 | "text/plain": [ 174 | "['moby', 'dick', 'by', 'herman', 'melville']" 175 | ] 176 | }, 177 | "execution_count": 10, 178 | "metadata": {}, 179 | "output_type": "execute_result" 180 | } 181 | ], 182 | "source": [ 183 | "norm" 184 | ] 185 | }, 186 | { 187 | "cell_type": "markdown", 188 | "metadata": {}, 189 | "source": [ 190 | "## Stemmers" 191 | ] 192 | }, 193 | { 194 | "cell_type": "markdown", 195 | "metadata": {}, 196 | "source": [ 197 | "Stemmers help further normalize text when we run into words that might be plural,\n", 198 | "for example. \n", 199 | "\n", 200 | "There are many different kinds of stemmers so you have to pick the one that works best for your use case.\n" 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "execution_count": 11, 206 | "metadata": { 207 | "collapsed": true 208 | }, 209 | "outputs": [], 210 | "source": [ 211 | "porter = nltk.PorterStemmer()" 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": 12, 217 | "metadata": { 218 | "collapsed": true 219 | }, 220 | "outputs": [], 221 | "source": [ 222 | "my_list = [\"cat\",\"cats\",\"lie\",\"lying\",\"run\",\"running\",\"city\",\"cities\",\"month\",\"monthly\",\"woman\",\"women\"]" 223 | ] 224 | }, 225 | { 226 | "cell_type": "code", 227 | "execution_count": 14, 228 | "metadata": { 229 | "collapsed": false 230 | }, 231 | "outputs": [ 232 | { 233 | "name": "stdout", 234 | "output_type": "stream", 235 | "text": [ 236 | "cat\n", 237 | "cat\n", 238 | "lie\n", 239 | "lie\n", 240 | "run\n", 241 | "run\n", 242 | "citi\n", 243 | "citi\n", 244 | "month\n", 245 | "monthli\n", 246 | "woman\n", 247 | "women\n" 248 | ] 249 | } 250 | ], 251 | "source": [ 252 | "for word in my_list:\n", 253 | " print(porter.stem(word))" 254 | ] 255 | }, 256 | { 257 | "cell_type": "code", 258 | "execution_count": 15, 259 | "metadata": { 260 | "collapsed": true 261 | }, 262 | "outputs": [], 263 | "source": [ 264 | "lancaster = nltk.LancasterStemmer()" 265 | ] 266 | }, 267 | { 268 | "cell_type": "code", 269 | "execution_count": 17, 270 | "metadata": { 271 | "collapsed": false 272 | }, 273 | "outputs": [ 274 | { 275 | "name": "stdout", 276 | "output_type": "stream", 277 | "text": [ 278 | "cat\n", 279 | "cat\n", 280 | "lie\n", 281 | "lying\n", 282 | "run\n", 283 | "run\n", 284 | "city\n", 285 | "city\n", 286 | "mon\n", 287 | "month\n", 288 | "wom\n", 289 | "wom\n" 290 | ] 291 | } 292 | ], 293 | "source": [ 294 | "for word in my_list:\n", 295 | " print(lancaster.stem(word))" 296 | ] 297 | }, 298 | { 299 | "cell_type": "markdown", 300 | "metadata": {}, 301 | "source": [ 302 | "We can try to solve the normalization problem with Lemmatization." 303 | ] 304 | }, 305 | { 306 | "cell_type": "code", 307 | "execution_count": 18, 308 | "metadata": { 309 | "collapsed": true 310 | }, 311 | "outputs": [], 312 | "source": [ 313 | "wnlem = nltk.WordNetLemmatizer()" 314 | ] 315 | }, 316 | { 317 | "cell_type": "code", 318 | "execution_count": 20, 319 | "metadata": { 320 | "collapsed": false 321 | }, 322 | "outputs": [ 323 | { 324 | "name": "stdout", 325 | "output_type": "stream", 326 | "text": [ 327 | "cat\n", 328 | "cat\n", 329 | "lie\n", 330 | "lying\n", 331 | "run\n", 332 | "running\n", 333 | "city\n", 334 | "city\n", 335 | "month\n", 336 | "monthly\n", 337 | "woman\n", 338 | "woman\n" 339 | ] 340 | } 341 | ], 342 | "source": [ 343 | "for word in my_list:\n", 344 | " print(wnlem.lemmatize(word))" 345 | ] 346 | } 347 | ], 348 | "metadata": { 349 | "kernelspec": { 350 | "display_name": "Python 3", 351 | "language": "python", 352 | "name": "python3" 353 | }, 354 | "language_info": { 355 | "codemirror_mode": { 356 | "name": "ipython", 357 | "version": 3 358 | }, 359 | "file_extension": ".py", 360 | "mimetype": "text/x-python", 361 | "name": "python", 362 | "nbconvert_exporter": "python", 363 | "pygments_lexer": "ipython3", 364 | "version": "3.6.0" 365 | } 366 | }, 367 | "nbformat": 4, 368 | "nbformat_minor": 2 369 | } 370 | -------------------------------------------------------------------------------- /3. Tokenization, Tagging, Chunking/3. Part of Speech Tagging.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Tokenization, Tagging, Chunking - Part of Speech Tagging" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 17, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import nltk" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "A part of speech tagger will identify the part of speech for a sequence of words." 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 18, 31 | "metadata": { 32 | "collapsed": true 33 | }, 34 | "outputs": [], 35 | "source": [ 36 | "text = \"I walked to the cafe to buy coffee before work.\"" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 19, 42 | "metadata": { 43 | "collapsed": false 44 | }, 45 | "outputs": [], 46 | "source": [ 47 | "tokens = nltk.word_tokenize(text)" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 20, 53 | "metadata": { 54 | "collapsed": false 55 | }, 56 | "outputs": [ 57 | { 58 | "data": { 59 | "text/plain": [ 60 | "[('I', 'PRP'),\n", 61 | " ('walked', 'VBD'),\n", 62 | " ('to', 'TO'),\n", 63 | " ('the', 'DT'),\n", 64 | " ('cafe', 'NN'),\n", 65 | " ('to', 'TO'),\n", 66 | " ('buy', 'VB'),\n", 67 | " ('coffee', 'NN'),\n", 68 | " ('before', 'IN'),\n", 69 | " ('work', 'NN'),\n", 70 | " ('.', '.')]" 71 | ] 72 | }, 73 | "execution_count": 20, 74 | "metadata": {}, 75 | "output_type": "execute_result" 76 | } 77 | ], 78 | "source": [ 79 | "nltk.pos_tag(tokens)" 80 | ] 81 | }, 82 | { 83 | "cell_type": "markdown", 84 | "metadata": {}, 85 | "source": [ 86 | "For an extensive list of part-of-speech tags visit:\n", 87 | "https://en.wikipedia.org/w/index.php?title=Brown_Corpus" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": 21, 93 | "metadata": { 94 | "collapsed": false 95 | }, 96 | "outputs": [ 97 | { 98 | "data": { 99 | "text/plain": [ 100 | "[('I', 'PRP'), ('will', 'MD'), ('have', 'VB'), ('desert', 'NN'), ('.', '.')]" 101 | ] 102 | }, 103 | "execution_count": 21, 104 | "metadata": {}, 105 | "output_type": "execute_result" 106 | } 107 | ], 108 | "source": [ 109 | "nltk.pos_tag(nltk.word_tokenize(\"I will have desert.\"))" 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": 22, 115 | "metadata": { 116 | "collapsed": false 117 | }, 118 | "outputs": [ 119 | { 120 | "data": { 121 | "text/plain": [ 122 | "[('They', 'PRP'), ('will', 'MD'), ('desert', 'VB'), ('us', 'PRP'), ('.', '.')]" 123 | ] 124 | }, 125 | "execution_count": 22, 126 | "metadata": {}, 127 | "output_type": "execute_result" 128 | } 129 | ], 130 | "source": [ 131 | "nltk.pos_tag(nltk.word_tokenize(\"They will desert us.\"))" 132 | ] 133 | }, 134 | { 135 | "cell_type": "markdown", 136 | "metadata": {}, 137 | "source": [ 138 | "Create a list of all nouns." 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": 23, 144 | "metadata": { 145 | "collapsed": true 146 | }, 147 | "outputs": [], 148 | "source": [ 149 | "md = nltk.corpus.gutenberg.words(\"melville-moby_dick.txt\")" 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": 24, 155 | "metadata": { 156 | "collapsed": false 157 | }, 158 | "outputs": [], 159 | "source": [ 160 | "md_norm = [word.lower() for word in md if word.isalpha()]" 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": 25, 166 | "metadata": { 167 | "collapsed": true 168 | }, 169 | "outputs": [], 170 | "source": [ 171 | "md_tags = nltk.pos_tag(md_norm,tagset=\"universal\")" 172 | ] 173 | }, 174 | { 175 | "cell_type": "code", 176 | "execution_count": 26, 177 | "metadata": { 178 | "collapsed": false 179 | }, 180 | "outputs": [ 181 | { 182 | "data": { 183 | "text/plain": [ 184 | "[('moby', 'NOUN'),\n", 185 | " ('dick', 'NOUN'),\n", 186 | " ('by', 'ADP'),\n", 187 | " ('herman', 'NOUN'),\n", 188 | " ('melville', 'NOUN')]" 189 | ] 190 | }, 191 | "execution_count": 26, 192 | "metadata": {}, 193 | "output_type": "execute_result" 194 | } 195 | ], 196 | "source": [ 197 | "md_tags[:5]" 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": 27, 203 | "metadata": { 204 | "collapsed": true 205 | }, 206 | "outputs": [], 207 | "source": [ 208 | "md_nouns = [word for word in md_tags if word[1] == \"NOUN\"]" 209 | ] 210 | }, 211 | { 212 | "cell_type": "code", 213 | "execution_count": 28, 214 | "metadata": { 215 | "collapsed": false 216 | }, 217 | "outputs": [], 218 | "source": [ 219 | "nouns_fd = nltk.FreqDist(md_nouns)" 220 | ] 221 | }, 222 | { 223 | "cell_type": "code", 224 | "execution_count": 29, 225 | "metadata": { 226 | "collapsed": false 227 | }, 228 | "outputs": [ 229 | { 230 | "data": { 231 | "text/plain": [ 232 | "[(('i', 'NOUN'), 1182),\n", 233 | " (('whale', 'NOUN'), 909),\n", 234 | " (('s', 'NOUN'), 774),\n", 235 | " (('man', 'NOUN'), 527),\n", 236 | " (('ship', 'NOUN'), 498),\n", 237 | " (('sea', 'NOUN'), 435),\n", 238 | " (('head', 'NOUN'), 337),\n", 239 | " (('time', 'NOUN'), 334),\n", 240 | " (('boat', 'NOUN'), 332),\n", 241 | " (('ahab', 'NOUN'), 278)]" 242 | ] 243 | }, 244 | "execution_count": 29, 245 | "metadata": {}, 246 | "output_type": "execute_result" 247 | } 248 | ], 249 | "source": [ 250 | "nouns_fd.most_common()[:10] " 251 | ] 252 | } 253 | ], 254 | "metadata": { 255 | "kernelspec": { 256 | "display_name": "Python 3", 257 | "language": "python", 258 | "name": "python3" 259 | }, 260 | "language_info": { 261 | "codemirror_mode": { 262 | "name": "ipython", 263 | "version": 3 264 | }, 265 | "file_extension": ".py", 266 | "mimetype": "text/x-python", 267 | "name": "python", 268 | "nbconvert_exporter": "python", 269 | "pygments_lexer": "ipython3", 270 | "version": "3.6.0" 271 | } 272 | }, 273 | "nbformat": 4, 274 | "nbformat_minor": 2 275 | } 276 | -------------------------------------------------------------------------------- /3. Tokenization, Tagging, Chunking/4. Example - Multiple Parts of Speech.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Tokenization, Tagging, Chunking - Example: Multiple Parts of Speech" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import nltk" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "Words can be tagged with a different part of speech based on usage." 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 2, 31 | "metadata": { 32 | "collapsed": false 33 | }, 34 | "outputs": [], 35 | "source": [ 36 | "alice = nltk.corpus.gutenberg.words(\"carroll-alice.txt\")" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 3, 42 | "metadata": { 43 | "collapsed": false 44 | }, 45 | "outputs": [], 46 | "source": [ 47 | "alice_norm = [word.lower() for word in alice if word.isalpha()]" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 4, 53 | "metadata": { 54 | "collapsed": true 55 | }, 56 | "outputs": [], 57 | "source": [ 58 | "alice_tags = nltk.pos_tag(alice_norm,tagset=\"universal\")" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": 5, 64 | "metadata": { 65 | "collapsed": true 66 | }, 67 | "outputs": [], 68 | "source": [ 69 | "alice_cfd = nltk.ConditionalFreqDist(alice_tags)" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": 6, 75 | "metadata": { 76 | "collapsed": false 77 | }, 78 | "outputs": [ 79 | { 80 | "data": { 81 | "text/plain": [ 82 | "FreqDist({'ADP': 31, 'ADV': 4, 'PRT': 5})" 83 | ] 84 | }, 85 | "execution_count": 6, 86 | "metadata": {}, 87 | "output_type": "execute_result" 88 | } 89 | ], 90 | "source": [ 91 | "alice_cfd['over']" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": 7, 97 | "metadata": { 98 | "collapsed": false 99 | }, 100 | "outputs": [ 101 | { 102 | "data": { 103 | "text/plain": [ 104 | "FreqDist({'NOUN': 1, 'VERB': 16})" 105 | ] 106 | }, 107 | "execution_count": 7, 108 | "metadata": {}, 109 | "output_type": "execute_result" 110 | } 111 | ], 112 | "source": [ 113 | "alice_cfd['spoke']" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": 8, 119 | "metadata": { 120 | "collapsed": false 121 | }, 122 | "outputs": [ 123 | { 124 | "data": { 125 | "text/plain": [ 126 | "FreqDist({'ADP': 1, 'NOUN': 5, 'VERB': 3})" 127 | ] 128 | }, 129 | "execution_count": 8, 130 | "metadata": {}, 131 | "output_type": "execute_result" 132 | } 133 | ], 134 | "source": [ 135 | "alice_cfd['answer']" 136 | ] 137 | } 138 | ], 139 | "metadata": { 140 | "kernelspec": { 141 | "display_name": "Python 3", 142 | "language": "python", 143 | "name": "python3" 144 | }, 145 | "language_info": { 146 | "codemirror_mode": { 147 | "name": "ipython", 148 | "version": 3 149 | }, 150 | "file_extension": ".py", 151 | "mimetype": "text/x-python", 152 | "name": "python", 153 | "nbconvert_exporter": "python", 154 | "pygments_lexer": "ipython3", 155 | "version": "3.6.0" 156 | } 157 | }, 158 | "nbformat": 4, 159 | "nbformat_minor": 2 160 | } 161 | -------------------------------------------------------------------------------- /3. Tokenization, Tagging, Chunking/5. Example - Choices.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Tokenization, Tagging, Chunking - Example: Choices" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import nltk" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "Finding all the cases in a given text where there was a choice between two options, \"NOUN 'or' NOUN\"." 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 2, 31 | "metadata": { 32 | "collapsed": false 33 | }, 34 | "outputs": [], 35 | "source": [ 36 | "stories = nltk.corpus.gutenberg.words(\"bryant-stories.txt\")\n", 37 | "tags = nltk.pos_tag(stories, tagset=\"universal\")" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 3, 43 | "metadata": { 44 | "collapsed": false 45 | }, 46 | "outputs": [ 47 | { 48 | "data": { 49 | "text/plain": [ 50 | "[('[', 'NOUN'),\n", 51 | " ('Stories', 'NOUN'),\n", 52 | " ('to', 'PRT'),\n", 53 | " ('Tell', 'VERB'),\n", 54 | " ('to', 'PRT'),\n", 55 | " ('Children', 'NOUN'),\n", 56 | " ('by', 'ADP'),\n", 57 | " ('Sara', 'NOUN'),\n", 58 | " ('Cone', 'NOUN'),\n", 59 | " ('Bryant', 'NOUN')]" 60 | ] 61 | }, 62 | "execution_count": 3, 63 | "metadata": {}, 64 | "output_type": "execute_result" 65 | } 66 | ], 67 | "source": [ 68 | "tags[:10]" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": 5, 74 | "metadata": { 75 | "collapsed": false 76 | }, 77 | "outputs": [ 78 | { 79 | "name": "stdout", 80 | "output_type": "stream", 81 | "text": [ 82 | "ship or part\n", 83 | "food or water\n", 84 | "queens or princesses\n", 85 | "rank or wealth\n" 86 | ] 87 | } 88 | ], 89 | "source": [ 90 | "for ((word1,tag1),(word2,tag2),(word3,tag3)) in nltk.trigrams(tags):\n", 91 | " if tag1 == \"NOUN\" and word2 == \"or\" and tag3 == \"NOUN\":\n", 92 | " print(word1 + \" \" + word2 + \" \" + word3)" 93 | ] 94 | } 95 | ], 96 | "metadata": { 97 | "kernelspec": { 98 | "display_name": "Python 3", 99 | "language": "python", 100 | "name": "python3" 101 | }, 102 | "language_info": { 103 | "codemirror_mode": { 104 | "name": "ipython", 105 | "version": 3 106 | }, 107 | "file_extension": ".py", 108 | "mimetype": "text/x-python", 109 | "name": "python", 110 | "nbconvert_exporter": "python", 111 | "pygments_lexer": "ipython3", 112 | "version": "3.6.0" 113 | } 114 | }, 115 | "nbformat": 4, 116 | "nbformat_minor": 2 117 | } 118 | -------------------------------------------------------------------------------- /3. Tokenization, Tagging, Chunking/6. Chunking.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Tokenization, Tagging, Chunking - Chunking" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Through chunking, we can prevent two word entities from being split." 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 1, 20 | "metadata": { 21 | "collapsed": true 22 | }, 23 | "outputs": [], 24 | "source": [ 25 | "sentence = \"I will go to the coffee shop in New York after I get off the jet plane.\"" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 2, 31 | "metadata": { 32 | "collapsed": true 33 | }, 34 | "outputs": [], 35 | "source": [ 36 | "import nltk" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 3, 42 | "metadata": { 43 | "collapsed": true 44 | }, 45 | "outputs": [], 46 | "source": [ 47 | "sent_tag = nltk.pos_tag(nltk.word_tokenize(sentence))" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 4, 53 | "metadata": { 54 | "collapsed": false 55 | }, 56 | "outputs": [ 57 | { 58 | "data": { 59 | "text/plain": [ 60 | "[('I', 'PRP'),\n", 61 | " ('will', 'MD'),\n", 62 | " ('go', 'VB'),\n", 63 | " ('to', 'TO'),\n", 64 | " ('the', 'DT'),\n", 65 | " ('coffee', 'NN'),\n", 66 | " ('shop', 'NN'),\n", 67 | " ('in', 'IN'),\n", 68 | " ('New', 'NNP'),\n", 69 | " ('York', 'NNP'),\n", 70 | " ('after', 'IN'),\n", 71 | " ('I', 'PRP'),\n", 72 | " ('get', 'VBP'),\n", 73 | " ('off', 'IN'),\n", 74 | " ('the', 'DT'),\n", 75 | " ('jet', 'NN'),\n", 76 | " ('plane', 'NN'),\n", 77 | " ('.', '.')]" 78 | ] 79 | }, 80 | "execution_count": 4, 81 | "metadata": {}, 82 | "output_type": "execute_result" 83 | } 84 | ], 85 | "source": [ 86 | "sent_tag" 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "execution_count": 5, 92 | "metadata": { 93 | "collapsed": true 94 | }, 95 | "outputs": [], 96 | "source": [ 97 | "sequence = '''\n", 98 | " CHUNK: {+}\n", 99 | " {+}\n", 100 | " '''" 101 | ] 102 | }, 103 | { 104 | "cell_type": "code", 105 | "execution_count": 6, 106 | "metadata": { 107 | "collapsed": false 108 | }, 109 | "outputs": [], 110 | "source": [ 111 | "NPChunker = nltk.RegexpParser(sequence)" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": 7, 117 | "metadata": { 118 | "collapsed": false 119 | }, 120 | "outputs": [], 121 | "source": [ 122 | "result = NPChunker.parse(sent_tag)" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": 9, 128 | "metadata": { 129 | "collapsed": false 130 | }, 131 | "outputs": [ 132 | { 133 | "name": "stdout", 134 | "output_type": "stream", 135 | "text": [ 136 | "(S\n", 137 | " I/PRP\n", 138 | " will/MD\n", 139 | " go/VB\n", 140 | " to/TO\n", 141 | " the/DT\n", 142 | " (CHUNK coffee/NN shop/NN)\n", 143 | " in/IN\n", 144 | " (CHUNK New/NNP York/NNP)\n", 145 | " after/IN\n", 146 | " I/PRP\n", 147 | " get/VBP\n", 148 | " off/IN\n", 149 | " the/DT\n", 150 | " (CHUNK jet/NN plane/NN)\n", 151 | " ./.)\n" 152 | ] 153 | } 154 | ], 155 | "source": [ 156 | "print(result)" 157 | ] 158 | } 159 | ], 160 | "metadata": { 161 | "kernelspec": { 162 | "display_name": "Python 3", 163 | "language": "python", 164 | "name": "python3" 165 | }, 166 | "language_info": { 167 | "codemirror_mode": { 168 | "name": "ipython", 169 | "version": 3 170 | }, 171 | "file_extension": ".py", 172 | "mimetype": "text/x-python", 173 | "name": "python", 174 | "nbconvert_exporter": "python", 175 | "pygments_lexer": "ipython3", 176 | "version": "3.6.0" 177 | } 178 | }, 179 | "nbformat": 4, 180 | "nbformat_minor": 2 181 | } 182 | -------------------------------------------------------------------------------- /3. Tokenization, Tagging, Chunking/7. Example - Named Entity Recognition.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Tokenization, Tagging, Chunking - Example: Named Entity Recognition" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 9, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import nltk" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 12, 24 | "metadata": {}, 25 | "outputs": [], 26 | "source": [ 27 | "text = open(\"example.txt\").read()" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": 13, 33 | "metadata": { 34 | "collapsed": false 35 | }, 36 | "outputs": [ 37 | { 38 | "data": { 39 | "text/plain": [ 40 | "'World War II (WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945, though related conflicts began earlier. It involved the vast majority of the world\\'s nations—including all of the great powers—eventually forming two opposing military alliances: the Allies and the Axis. It was the most widespread war in history, and directly involved more than 100 million people from over 30 countries. In a state of \"total war\", the major participants threw their entire economic, industrial, and scientific capabilities behind the war effort, erasing the distinction between civilian and military resources. Marked by mass deaths of civilians, including the Holocaust (in which approximately 11 million people were killed) and the strategic bombing of industrial and population centres (in which approximately one million were killed, and which included the atomic bombings of Hiroshima and Nagasaki), it resulted in an estimated 50 million to 85 million fatalities. These made World War II the deadliest conflict in human history.\\n\\nThe Empire of Japan aimed to dominate Asia and the Pacific and was already at war with the Republic of China in 1937, but the world war is generally said to have begun on 1 September 1939 with the invasion of Poland by Germany and subsequent declarations of war on Germany by France and the United Kingdom. From late 1939 to early 1941, in a series of campaigns and treaties, Germany conquered or controlled much of continental Europe, and formed the Axis alliance with Italy and Japan. Based on the Molotov–Ribbentrop Pact of August 1939, Germany and the Soviet Union partitioned and annexed territories of their European neighbours, Poland, Finland, Romania and the Baltic states. For a year starting in late June 1940, the United Kingdom and the British Commonwealth were the only Allied forces continuing the fight against the European Axis powers, with campaigns in North Africa and the Horn of Africa, the aerial Battle of Britain and the Blitz bombing campaign, as well as the long-running Battle of the Atlantic. In June 1941, the European Axis powers launched an invasion of the Soviet Union, opening the largest land theatre of war in history, which trapped the major part of the Axis\\' military forces into a war of attrition. In December 1941, Japan attacked the United States and European territories in the Pacific Ocean, and quickly conquered much of the Western Pacific.\\n\\nThe Axis advance halted in 1942 when Japan lost the critical Battle of Midway, near Hawaii, and Germany was defeated in North Africa and then, decisively, at Stalingrad in the Soviet Union. In 1943, with a series of German defeats on the Eastern Front, the Allied invasion of Italy which brought about Italian surrender, and Allied victories in the Pacific, the Axis lost the initiative and undertook strategic retreat on all fronts. In 1944, the Western Allies invaded German-occupied France, while the Soviet Union regained all of its territorial losses and invaded Germany and its allies. During 1944 and 1945 the Japanese suffered major reverses in mainland Asia in South Central China and Burma, while the Allies crippled the Japanese Navy and captured key Western Pacific islands.\\n\\nThe war in Europe ended with an invasion of Germany by the Western Allies and the Soviet Union culminating in the capture of Berlin by Soviet and Polish troops and the subsequent German unconditional surrender on 8 May 1945. Following the Potsdam Declaration by the Allies on 26 July 1945 and the refusal of Japan to surrender under its terms, the United States dropped atomic bombs on the Japanese cities of Hiroshima and Nagasaki on 6 August and 9 August respectively. With an invasion of the Japanese archipelago imminent, the possibility of additional atomic bombings, and the Soviet Union\\'s declaration of war on Japan and invasion of Manchuria, Japan surrendered on 15 August 1945. Thus ended the war in Asia, cementing the total victory of the Allies.\\n\\nWorld War II altered the political alignment and social structure of the world. The United Nations (UN) was established to foster international co-operation and prevent future conflicts. The victorious great powers—the United States, the Soviet Union, China, the United Kingdom, and France—became the permanent members of the United Nations Security Council. The Soviet Union and the United States emerged as rival superpowers, setting the stage for the Cold War, which lasted for the next 46 years. Meanwhile, the influence of European great powers waned, while the decolonisation of Asia and Africa began. Most countries whose industries had been damaged moved towards economic recovery. Political integration, especially in Europe, emerged as an effort to end pre-war enmities and to create a common identity.'" 41 | ] 42 | }, 43 | "execution_count": 13, 44 | "metadata": {}, 45 | "output_type": "execute_result" 46 | } 47 | ], 48 | "source": [ 49 | "text" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": 14, 55 | "metadata": { 56 | "collapsed": true 57 | }, 58 | "outputs": [], 59 | "source": [ 60 | "text_tag = nltk.pos_tag(nltk.word_tokenize(text))" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": 15, 66 | "metadata": { 67 | "collapsed": false 68 | }, 69 | "outputs": [], 70 | "source": [ 71 | "text_ch = nltk.ne_chunk(text_tag)" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": 17, 77 | "metadata": { 78 | "collapsed": false 79 | }, 80 | "outputs": [ 81 | { 82 | "name": "stdout", 83 | "output_type": "stream", 84 | "text": [ 85 | "ORGANIZATION WWII\n", 86 | "ORGANIZATION WW2\n", 87 | "ORGANIZATION Second\n", 88 | "ORGANIZATION Axis\n", 89 | "ORGANIZATION Hiroshima\n", 90 | "GPE Nagasaki\n", 91 | "ORGANIZATION Empire of Japan\n", 92 | "GPE Asia\n", 93 | "ORGANIZATION Pacific\n", 94 | "ORGANIZATION Republic\n", 95 | "GPE China\n", 96 | "GPE Poland\n", 97 | "GPE Germany\n", 98 | "GPE Germany\n", 99 | "GPE France\n", 100 | "ORGANIZATION United Kingdom\n", 101 | "GPE Germany\n", 102 | "GPE Europe\n", 103 | "ORGANIZATION Axis\n", 104 | "GPE Italy\n", 105 | "GPE Japan\n", 106 | "GPE Germany\n", 107 | "GPE Soviet Union\n", 108 | "GPE European\n", 109 | "GPE Poland\n", 110 | "GPE Finland\n", 111 | "GPE Romania\n", 112 | "GPE Baltic\n", 113 | "ORGANIZATION United Kingdom\n", 114 | "GPE British\n", 115 | "ORGANIZATION European Axis\n", 116 | "GPE North Africa\n", 117 | "ORGANIZATION Horn\n", 118 | "GPE Africa\n", 119 | "GPE Britain\n", 120 | "GPE Blitz\n", 121 | "ORGANIZATION Atlantic\n", 122 | "ORGANIZATION European Axis\n", 123 | "GPE Soviet Union\n", 124 | "ORGANIZATION Axis\n", 125 | "GPE Japan\n", 126 | "GPE United States\n", 127 | "GPE European\n", 128 | "ORGANIZATION Pacific Ocean\n", 129 | "LOCATION Western Pacific\n", 130 | "ORGANIZATION Axis\n", 131 | "PERSON Japan\n", 132 | "GPE Midway\n", 133 | "GPE Hawaii\n", 134 | "GPE Germany\n", 135 | "GPE North Africa\n", 136 | "FACILITY Stalingrad\n", 137 | "GPE Soviet Union\n", 138 | "GPE German\n", 139 | "LOCATION Eastern Front\n", 140 | "GPE Italy\n", 141 | "GPE Italian\n", 142 | "GPE Allied\n", 143 | "ORGANIZATION Pacific\n", 144 | "ORGANIZATION Axis\n", 145 | "LOCATION Western\n", 146 | "GPE France\n", 147 | "GPE Soviet Union\n", 148 | "GPE Germany\n", 149 | "GPE Japanese\n", 150 | "GPE Asia\n", 151 | "GPE South\n", 152 | "GPE China\n", 153 | "GPE Burma\n", 154 | "GPE Japanese\n", 155 | "ORGANIZATION Navy\n", 156 | "LOCATION Western Pacific\n", 157 | "GPE Europe\n", 158 | "GPE Germany\n", 159 | "LOCATION Western\n", 160 | "GPE Soviet Union\n", 161 | "GPE Berlin\n", 162 | "GPE Soviet\n", 163 | "GPE Polish\n", 164 | "GPE German\n", 165 | "ORGANIZATION Potsdam\n", 166 | "GPE Japan\n", 167 | "GPE United States\n", 168 | "GPE Japanese\n", 169 | "ORGANIZATION Hiroshima\n", 170 | "PERSON Nagasaki\n", 171 | "GPE Japanese\n", 172 | "GPE Soviet Union\n", 173 | "GPE Japan\n", 174 | "GPE Manchuria\n", 175 | "GPE Japan\n", 176 | "GPE Asia\n", 177 | "ORGANIZATION United Nations\n", 178 | "GPE United States\n", 179 | "GPE Soviet Union\n", 180 | "GPE China\n", 181 | "ORGANIZATION United Kingdom\n", 182 | "ORGANIZATION United Nations\n", 183 | "ORGANIZATION Security Council\n", 184 | "GPE Soviet Union\n", 185 | "GPE United States\n", 186 | "GPE European\n", 187 | "GPE Asia\n", 188 | "PERSON Africa\n", 189 | "GPE Europe\n" 190 | ] 191 | } 192 | ], 193 | "source": [ 194 | "for chunk in text_ch:\n", 195 | " if hasattr(chunk, 'label'):\n", 196 | " print(chunk.label(), ' '.join(c[0] for c in chunk.leaves()))" 197 | ] 198 | } 199 | ], 200 | "metadata": { 201 | "kernelspec": { 202 | "display_name": "Python 3", 203 | "language": "python", 204 | "name": "python3" 205 | }, 206 | "language_info": { 207 | "codemirror_mode": { 208 | "name": "ipython", 209 | "version": 3 210 | }, 211 | "file_extension": ".py", 212 | "mimetype": "text/x-python", 213 | "name": "python", 214 | "nbconvert_exporter": "python", 215 | "pygments_lexer": "ipython3", 216 | "version": "3.6.0" 217 | } 218 | }, 219 | "nbformat": 4, 220 | "nbformat_minor": 2 221 | } 222 | -------------------------------------------------------------------------------- /3. Tokenization, Tagging, Chunking/example.txt: -------------------------------------------------------------------------------- 1 | World War II (WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945, though related conflicts began earlier. It involved the vast majority of the world's nations—including all of the great powers—eventually forming two opposing military alliances: the Allies and the Axis. It was the most widespread war in history, and directly involved more than 100 million people from over 30 countries. In a state of "total war", the major participants threw their entire economic, industrial, and scientific capabilities behind the war effort, erasing the distinction between civilian and military resources. Marked by mass deaths of civilians, including the Holocaust (in which approximately 11 million people were killed) and the strategic bombing of industrial and population centres (in which approximately one million were killed, and which included the atomic bombings of Hiroshima and Nagasaki), it resulted in an estimated 50 million to 85 million fatalities. These made World War II the deadliest conflict in human history. 2 | 3 | The Empire of Japan aimed to dominate Asia and the Pacific and was already at war with the Republic of China in 1937, but the world war is generally said to have begun on 1 September 1939 with the invasion of Poland by Germany and subsequent declarations of war on Germany by France and the United Kingdom. From late 1939 to early 1941, in a series of campaigns and treaties, Germany conquered or controlled much of continental Europe, and formed the Axis alliance with Italy and Japan. Based on the Molotov–Ribbentrop Pact of August 1939, Germany and the Soviet Union partitioned and annexed territories of their European neighbours, Poland, Finland, Romania and the Baltic states. For a year starting in late June 1940, the United Kingdom and the British Commonwealth were the only Allied forces continuing the fight against the European Axis powers, with campaigns in North Africa and the Horn of Africa, the aerial Battle of Britain and the Blitz bombing campaign, as well as the long-running Battle of the Atlantic. In June 1941, the European Axis powers launched an invasion of the Soviet Union, opening the largest land theatre of war in history, which trapped the major part of the Axis' military forces into a war of attrition. In December 1941, Japan attacked the United States and European territories in the Pacific Ocean, and quickly conquered much of the Western Pacific. 4 | 5 | The Axis advance halted in 1942 when Japan lost the critical Battle of Midway, near Hawaii, and Germany was defeated in North Africa and then, decisively, at Stalingrad in the Soviet Union. In 1943, with a series of German defeats on the Eastern Front, the Allied invasion of Italy which brought about Italian surrender, and Allied victories in the Pacific, the Axis lost the initiative and undertook strategic retreat on all fronts. In 1944, the Western Allies invaded German-occupied France, while the Soviet Union regained all of its territorial losses and invaded Germany and its allies. During 1944 and 1945 the Japanese suffered major reverses in mainland Asia in South Central China and Burma, while the Allies crippled the Japanese Navy and captured key Western Pacific islands. 6 | 7 | The war in Europe ended with an invasion of Germany by the Western Allies and the Soviet Union culminating in the capture of Berlin by Soviet and Polish troops and the subsequent German unconditional surrender on 8 May 1945. Following the Potsdam Declaration by the Allies on 26 July 1945 and the refusal of Japan to surrender under its terms, the United States dropped atomic bombs on the Japanese cities of Hiroshima and Nagasaki on 6 August and 9 August respectively. With an invasion of the Japanese archipelago imminent, the possibility of additional atomic bombings, and the Soviet Union's declaration of war on Japan and invasion of Manchuria, Japan surrendered on 15 August 1945. Thus ended the war in Asia, cementing the total victory of the Allies. 8 | 9 | World War II altered the political alignment and social structure of the world. The United Nations (UN) was established to foster international co-operation and prevent future conflicts. The victorious great powers—the United States, the Soviet Union, China, the United Kingdom, and France—became the permanent members of the United Nations Security Council. The Soviet Union and the United States emerged as rival superpowers, setting the stage for the Cold War, which lasted for the next 46 years. Meanwhile, the influence of European great powers waned, while the decolonisation of Asia and Africa began. Most countries whose industries had been damaged moved towards economic recovery. Political integration, especially in Europe, emerged as an effort to end pre-war enmities and to create a common identity. -------------------------------------------------------------------------------- /4. Custom Sources/2. HTML.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Custom Sources - HTML" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import nltk\n", 19 | "import urllib.request" 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": [ 26 | "Websites are written in HTML, so when you pull information directly from a site, you will get all the code back along with the text." 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": 2, 32 | "metadata": { 33 | "collapsed": true 34 | }, 35 | "outputs": [], 36 | "source": [ 37 | "url = \"https://en.wikipedia.org/wiki/Python_(programming_language)\"" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 3, 43 | "metadata": {}, 44 | "outputs": [], 45 | "source": [ 46 | "response = urllib.request.urlopen(url)" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": 4, 52 | "metadata": {}, 53 | "outputs": [], 54 | "source": [ 55 | "html = response.read()" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": 5, 61 | "metadata": { 62 | "collapsed": false 63 | }, 64 | "outputs": [ 65 | { 66 | "data": { 67 | "text/plain": [ 68 | "b'\\n\\n\\n\\nPython (programming language) - Wikipedia\\n\\n