├── 0-Hello-World.ipynb ├── 1-Data-Cleaning.ipynb ├── 2-Exploratory-Data-Analysis.ipynb ├── 3-Sentiment-Analysis.ipynb ├── 4-Topic-Modeling.ipynb ├── 5-Text-Generation.ipynb ├── README.md ├── additional_resources.md ├── pickle ├── corpus.pkl ├── cv.pkl ├── cv_stop.pkl ├── data_clean.pkl ├── dtm.pkl └── dtm_stop.pkl └── transcripts ├── ali.txt ├── anthony.txt ├── bill.txt ├── bo.txt ├── dave.txt ├── hasan.txt ├── jim.txt ├── joe.txt ├── john.txt ├── louis.txt ├── mike.txt └── ricky.txt /0-Hello-World.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### Hello World" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Place your cursor in the cell below, and type Shift-Enter. If you see the printed statement below the cell, you are good to go!" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": null, 20 | "metadata": {}, 21 | "outputs": [], 22 | "source": [ 23 | "print('hello world')" 24 | ] 25 | } 26 | ], 27 | "metadata": { 28 | "kernelspec": { 29 | "display_name": "Python 3", 30 | "language": "python", 31 | "name": "python3" 32 | }, 33 | "language_info": { 34 | "codemirror_mode": { 35 | "name": "ipython", 36 | "version": 3 37 | }, 38 | "file_extension": ".py", 39 | "mimetype": "text/x-python", 40 | "name": "python", 41 | "nbconvert_exporter": "python", 42 | "pygments_lexer": "ipython3", 43 | "version": "3.6.2" 44 | }, 45 | "toc": { 46 | "nav_menu": {}, 47 | "number_sections": true, 48 | "sideBar": true, 49 | "skip_h1_title": false, 50 | "toc_cell": false, 51 | "toc_position": {}, 52 | "toc_section_display": "block", 53 | "toc_window_display": false 54 | }, 55 | "varInspector": { 56 | "cols": { 57 | "lenName": 16, 58 | "lenType": 16, 59 | "lenVar": 40 60 | }, 61 | "kernels_config": { 62 | "python": { 63 | "delete_cmd_postfix": "", 64 | "delete_cmd_prefix": "del ", 65 | "library": "var_list.py", 66 | "varRefreshCmd": "print(var_dic_list())" 67 | }, 68 | "r": { 69 | "delete_cmd_postfix": ") ", 70 | "delete_cmd_prefix": "rm(", 71 | "library": "var_list.r", 72 | "varRefreshCmd": "cat(var_dic_list()) " 73 | } 74 | }, 75 | "types_to_exclude": [ 76 | "module", 77 | "function", 78 | "builtin_function_or_method", 79 | "instance", 80 | "_Feature" 81 | ], 82 | "window_display": false 83 | } 84 | }, 85 | "nbformat": 4, 86 | "nbformat_minor": 2 87 | } 88 | -------------------------------------------------------------------------------- /1-Data-Cleaning.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Data Cleaning" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Introduction" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "This notebook goes through a necessary step of any data science project - data cleaning. Data cleaning is a time consuming and unenjoyable task, yet it's a very important one. Keep in mind, \"garbage in, garbage out\". Feeding dirty data into a model will give us results that are meaningless.\n", 22 | "\n", 23 | "Specifically, we'll be walking through:\n", 24 | "\n", 25 | "1. **Getting the data - **in this case, we'll be scraping data from a website\n", 26 | "2. **Cleaning the data - **we will walk through popular text pre-processing techniques\n", 27 | "3. **Organizing the data - **we will organize the cleaned data into a way that is easy to input into other algorithms\n", 28 | "\n", 29 | "The output of this notebook will be clean, organized data in two standard text formats:\n", 30 | "\n", 31 | "1. **Corpus** - a collection of text\n", 32 | "2. **Document-Term Matrix** - word counts in matrix format" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "## Problem Statement" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "As a reminder, our goal is to look at transcripts of various comedians and note their similarities and differences. Specifically, I'd like to know if Ali Wong's comedy style is different than other comedians, since she's the comedian that got me interested in stand up comedy." 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "metadata": {}, 52 | "source": [ 53 | "## Getting The Data" 54 | ] 55 | }, 56 | { 57 | "cell_type": "markdown", 58 | "metadata": {}, 59 | "source": [ 60 | "Luckily, there are wonderful people online that keep track of stand up routine transcripts. [Scraps From The Loft](http://scrapsfromtheloft.com) makes them available for non-profit and educational purposes.\n", 61 | "\n", 62 | "To decide which comedians to look into, I went on IMDB and looked specifically at comedy specials that were released in the past 5 years. To narrow it down further, I looked only at those with greater than a 7.5/10 rating and more than 2000 votes. If a comedian had multiple specials that fit those requirements, I would pick the most highly rated one. I ended up with a dozen comedy specials." 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": null, 68 | "metadata": { 69 | "collapsed": true 70 | }, 71 | "outputs": [], 72 | "source": [ 73 | "# Web scraping, pickle imports\n", 74 | "import requests\n", 75 | "from bs4 import BeautifulSoup\n", 76 | "import pickle\n", 77 | "\n", 78 | "# Scrapes transcript data from scrapsfromtheloft.com\n", 79 | "def url_to_transcript(url):\n", 80 | " '''Returns transcript data specifically from scrapsfromtheloft.com.'''\n", 81 | " page = requests.get(url).text\n", 82 | " soup = BeautifulSoup(page, \"lxml\")\n", 83 | " text = [p.text for p in soup.find(class_=\"post-content\").find_all('p')]\n", 84 | " print(url)\n", 85 | " return text\n", 86 | "\n", 87 | "# URLs of transcripts in scope\n", 88 | "urls = ['http://scrapsfromtheloft.com/2017/05/06/louis-ck-oh-my-god-full-transcript/',\n", 89 | " 'http://scrapsfromtheloft.com/2017/04/11/dave-chappelle-age-spin-2017-full-transcript/',\n", 90 | " 'http://scrapsfromtheloft.com/2018/03/15/ricky-gervais-humanity-transcript/',\n", 91 | " 'http://scrapsfromtheloft.com/2017/08/07/bo-burnham-2013-full-transcript/',\n", 92 | " 'http://scrapsfromtheloft.com/2017/05/24/bill-burr-im-sorry-feel-way-2014-full-transcript/',\n", 93 | " 'http://scrapsfromtheloft.com/2017/04/21/jim-jefferies-bare-2014-full-transcript/',\n", 94 | " 'http://scrapsfromtheloft.com/2017/08/02/john-mulaney-comeback-kid-2015-full-transcript/',\n", 95 | " 'http://scrapsfromtheloft.com/2017/10/21/hasan-minhaj-homecoming-king-2017-full-transcript/',\n", 96 | " 'http://scrapsfromtheloft.com/2017/09/19/ali-wong-baby-cobra-2016-full-transcript/',\n", 97 | " 'http://scrapsfromtheloft.com/2017/08/03/anthony-jeselnik-thoughts-prayers-2015-full-transcript/',\n", 98 | " 'http://scrapsfromtheloft.com/2018/03/03/mike-birbiglia-my-girlfriends-boyfriend-2013-full-transcript/',\n", 99 | " 'http://scrapsfromtheloft.com/2017/08/19/joe-rogan-triggered-2016-full-transcript/']\n", 100 | "\n", 101 | "# Comedian names\n", 102 | "comedians = ['louis', 'dave', 'ricky', 'bo', 'bill', 'jim', 'john', 'hasan', 'ali', 'anthony', 'mike', 'joe']" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": null, 108 | "metadata": {}, 109 | "outputs": [], 110 | "source": [ 111 | "# # Actually request transcripts (takes a few minutes to run)\n", 112 | "# transcripts = [url_to_transcript(u) for u in urls]" 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "execution_count": null, 118 | "metadata": { 119 | "collapsed": true 120 | }, 121 | "outputs": [], 122 | "source": [ 123 | "# # Pickle files for later use\n", 124 | "\n", 125 | "# # Make a new directory to hold the text files\n", 126 | "# !mkdir transcripts\n", 127 | "\n", 128 | "# for i, c in enumerate(comedians):\n", 129 | "# with open(\"transcripts/\" + c + \".txt\", \"wb\") as file:\n", 130 | "# pickle.dump(transcripts[i], file)" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": null, 136 | "metadata": { 137 | "collapsed": true 138 | }, 139 | "outputs": [], 140 | "source": [ 141 | "# Load pickled files\n", 142 | "data = {}\n", 143 | "for i, c in enumerate(comedians):\n", 144 | " with open(\"transcripts/\" + c + \".txt\", \"rb\") as file:\n", 145 | " data[c] = pickle.load(file)" 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": null, 151 | "metadata": {}, 152 | "outputs": [], 153 | "source": [ 154 | "# Double check to make sure data has been loaded properly\n", 155 | "data.keys()" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": null, 161 | "metadata": {}, 162 | "outputs": [], 163 | "source": [ 164 | "# More checks\n", 165 | "data['louis'][:2]" 166 | ] 167 | }, 168 | { 169 | "cell_type": "markdown", 170 | "metadata": {}, 171 | "source": [ 172 | "## Cleaning The Data" 173 | ] 174 | }, 175 | { 176 | "cell_type": "markdown", 177 | "metadata": {}, 178 | "source": [ 179 | "When dealing with numerical data, data cleaning often involves removing null values and duplicate data, dealing with outliers, etc. With text data, there are some common data cleaning techniques, which are also known as text pre-processing techniques.\n", 180 | "\n", 181 | "With text data, this cleaning process can go on forever. There's always an exception to every cleaning step. So, we're going to follow the MVP (minimum viable product) approach - start simple and iterate. Here are a bunch of things you can do to clean your data. We're going to execute just the common cleaning steps here and the rest can be done at a later point to improve our results.\n", 182 | "\n", 183 | "**Common data cleaning steps on all text:**\n", 184 | "* Make text all lower case\n", 185 | "* Remove punctuation\n", 186 | "* Remove numerical values\n", 187 | "* Remove common non-sensical text (/n)\n", 188 | "* Tokenize text\n", 189 | "* Remove stop words\n", 190 | "\n", 191 | "**More data cleaning steps after tokenization:**\n", 192 | "* Stemming / lemmatization\n", 193 | "* Parts of speech tagging\n", 194 | "* Create bi-grams or tri-grams\n", 195 | "* Deal with typos\n", 196 | "* And more..." 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": null, 202 | "metadata": {}, 203 | "outputs": [], 204 | "source": [ 205 | "# Let's take a look at our data again\n", 206 | "next(iter(data.keys()))" 207 | ] 208 | }, 209 | { 210 | "cell_type": "code", 211 | "execution_count": null, 212 | "metadata": {}, 213 | "outputs": [], 214 | "source": [ 215 | "# Notice that our dictionary is currently in key: comedian, value: list of text format\n", 216 | "next(iter(data.values()))" 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": null, 222 | "metadata": { 223 | "collapsed": true 224 | }, 225 | "outputs": [], 226 | "source": [ 227 | "# We are going to change this to key: comedian, value: string format\n", 228 | "def combine_text(list_of_text):\n", 229 | " '''Takes a list of text and combines them into one large chunk of text.'''\n", 230 | " combined_text = ' '.join(list_of_text)\n", 231 | " return combined_text" 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": null, 237 | "metadata": { 238 | "collapsed": true 239 | }, 240 | "outputs": [], 241 | "source": [ 242 | "# Combine it!\n", 243 | "data_combined = {key: [combine_text(value)] for (key, value) in data.items()}" 244 | ] 245 | }, 246 | { 247 | "cell_type": "code", 248 | "execution_count": null, 249 | "metadata": {}, 250 | "outputs": [], 251 | "source": [ 252 | "# We can either keep it in dictionary format or put it into a pandas dataframe\n", 253 | "import pandas as pd\n", 254 | "pd.set_option('max_colwidth',150)\n", 255 | "\n", 256 | "data_df = pd.DataFrame.from_dict(data_combined).transpose()\n", 257 | "data_df.columns = ['transcript']\n", 258 | "data_df = data_df.sort_index()\n", 259 | "data_df" 260 | ] 261 | }, 262 | { 263 | "cell_type": "code", 264 | "execution_count": null, 265 | "metadata": {}, 266 | "outputs": [], 267 | "source": [ 268 | "# Let's take a look at the transcript for Ali Wong\n", 269 | "data_df.transcript.loc['ali']" 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": null, 275 | "metadata": { 276 | "collapsed": true 277 | }, 278 | "outputs": [], 279 | "source": [ 280 | "# Apply a first round of text cleaning techniques\n", 281 | "import re\n", 282 | "import string\n", 283 | "\n", 284 | "def clean_text_round1(text):\n", 285 | " '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''\n", 286 | " text = text.lower()\n", 287 | " text = re.sub('\\[.*?\\]', '', text)\n", 288 | " text = re.sub('[%s]' % re.escape(string.punctuation), '', text)\n", 289 | " text = re.sub('\\w*\\d\\w*', '', text)\n", 290 | " return text\n", 291 | "\n", 292 | "round1 = lambda x: clean_text_round1(x)" 293 | ] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "execution_count": null, 298 | "metadata": {}, 299 | "outputs": [], 300 | "source": [ 301 | "# Let's take a look at the updated text\n", 302 | "data_clean = pd.DataFrame(data_df.transcript.apply(round1))\n", 303 | "data_clean" 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": null, 309 | "metadata": { 310 | "collapsed": true 311 | }, 312 | "outputs": [], 313 | "source": [ 314 | "# Apply a second round of cleaning\n", 315 | "def clean_text_round2(text):\n", 316 | " '''Get rid of some additional punctuation and non-sensical text that was missed the first time around.'''\n", 317 | " text = re.sub('[‘’“”…]', '', text)\n", 318 | " text = re.sub('\\n', '', text)\n", 319 | " return text\n", 320 | "\n", 321 | "round2 = lambda x: clean_text_round2(x)" 322 | ] 323 | }, 324 | { 325 | "cell_type": "code", 326 | "execution_count": null, 327 | "metadata": {}, 328 | "outputs": [], 329 | "source": [ 330 | "# Let's take a look at the updated text\n", 331 | "data_clean = pd.DataFrame(data_clean.transcript.apply(round2))\n", 332 | "data_clean" 333 | ] 334 | }, 335 | { 336 | "cell_type": "markdown", 337 | "metadata": {}, 338 | "source": [ 339 | "**NOTE:** This data cleaning aka text pre-processing step could go on for a while, but we are going to stop for now. After going through some analysis techniques, if you see that the results don't make sense or could be improved, you can come back and make more edits such as:\n", 340 | "* Mark 'cheering' and 'cheer' as the same word (stemming / lemmatization)\n", 341 | "* Combine 'thank you' into one term (bi-grams)\n", 342 | "* And a lot more..." 343 | ] 344 | }, 345 | { 346 | "cell_type": "markdown", 347 | "metadata": {}, 348 | "source": [ 349 | "## Organizing The Data" 350 | ] 351 | }, 352 | { 353 | "cell_type": "markdown", 354 | "metadata": {}, 355 | "source": [ 356 | "I mentioned earlier that the output of this notebook will be clean, organized data in two standard text formats:\n", 357 | "1. **Corpus - **a collection of text\n", 358 | "2. **Document-Term Matrix - **word counts in matrix format" 359 | ] 360 | }, 361 | { 362 | "cell_type": "markdown", 363 | "metadata": {}, 364 | "source": [ 365 | "### Corpus" 366 | ] 367 | }, 368 | { 369 | "cell_type": "markdown", 370 | "metadata": {}, 371 | "source": [ 372 | "We already created a corpus in an earlier step. The definition of a corpus is a collection of texts, and they are all put together neatly in a pandas dataframe here." 373 | ] 374 | }, 375 | { 376 | "cell_type": "code", 377 | "execution_count": null, 378 | "metadata": {}, 379 | "outputs": [], 380 | "source": [ 381 | "# Let's take a look at our dataframe\n", 382 | "data_df" 383 | ] 384 | }, 385 | { 386 | "cell_type": "code", 387 | "execution_count": null, 388 | "metadata": {}, 389 | "outputs": [], 390 | "source": [ 391 | "# Let's add the comedians' full names as well\n", 392 | "full_names = ['Ali Wong', 'Anthony Jeselnik', 'Bill Burr', 'Bo Burnham', 'Dave Chappelle', 'Hasan Minhaj',\n", 393 | " 'Jim Jefferies', 'Joe Rogan', 'John Mulaney', 'Louis C.K.', 'Mike Birbiglia', 'Ricky Gervais']\n", 394 | "\n", 395 | "data_df['full_name'] = full_names\n", 396 | "data_df" 397 | ] 398 | }, 399 | { 400 | "cell_type": "code", 401 | "execution_count": null, 402 | "metadata": { 403 | "collapsed": true 404 | }, 405 | "outputs": [], 406 | "source": [ 407 | "# Let's pickle it for later use\n", 408 | "data_df.to_pickle(\"corpus.pkl\")" 409 | ] 410 | }, 411 | { 412 | "cell_type": "markdown", 413 | "metadata": {}, 414 | "source": [ 415 | "### Document-Term Matrix" 416 | ] 417 | }, 418 | { 419 | "cell_type": "markdown", 420 | "metadata": {}, 421 | "source": [ 422 | "For many of the techniques we'll be using in future notebooks, the text must be tokenized, meaning broken down into smaller pieces. The most common tokenization technique is to break down text into words. We can do this using scikit-learn's CountVectorizer, where every row will represent a different document and every column will represent a different word.\n", 423 | "\n", 424 | "In addition, with CountVectorizer, we can remove stop words. Stop words are common words that add no additional meaning to text such as 'a', 'the', etc." 425 | ] 426 | }, 427 | { 428 | "cell_type": "code", 429 | "execution_count": null, 430 | "metadata": {}, 431 | "outputs": [], 432 | "source": [ 433 | "# We are going to create a document-term matrix using CountVectorizer, and exclude common English stop words\n", 434 | "from sklearn.feature_extraction.text import CountVectorizer\n", 435 | "\n", 436 | "cv = CountVectorizer(stop_words='english')\n", 437 | "data_cv = cv.fit_transform(data_clean.transcript)\n", 438 | "data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())\n", 439 | "data_dtm.index = data_clean.index\n", 440 | "data_dtm" 441 | ] 442 | }, 443 | { 444 | "cell_type": "code", 445 | "execution_count": null, 446 | "metadata": { 447 | "collapsed": true 448 | }, 449 | "outputs": [], 450 | "source": [ 451 | "# Let's pickle it for later use\n", 452 | "data_dtm.to_pickle(\"dtm.pkl\")" 453 | ] 454 | }, 455 | { 456 | "cell_type": "code", 457 | "execution_count": null, 458 | "metadata": { 459 | "collapsed": true 460 | }, 461 | "outputs": [], 462 | "source": [ 463 | "# Let's also pickle the cleaned data (before we put it in document-term matrix format) and the CountVectorizer object\n", 464 | "data_clean.to_pickle('data_clean.pkl')\n", 465 | "pickle.dump(cv, open(\"cv.pkl\", \"wb\"))" 466 | ] 467 | }, 468 | { 469 | "cell_type": "markdown", 470 | "metadata": { 471 | "collapsed": true 472 | }, 473 | "source": [ 474 | "## Additional Exercises" 475 | ] 476 | }, 477 | { 478 | "cell_type": "markdown", 479 | "metadata": {}, 480 | "source": [ 481 | "1. Can you add an additional regular expression to the clean_text_round2 function to further clean the text?\n", 482 | "2. Play around with CountVectorizer's parameters. What is ngram_range? What is min_df and max_df?" 483 | ] 484 | }, 485 | { 486 | "cell_type": "code", 487 | "execution_count": null, 488 | "metadata": { 489 | "collapsed": true 490 | }, 491 | "outputs": [], 492 | "source": [] 493 | } 494 | ], 495 | "metadata": { 496 | "kernelspec": { 497 | "display_name": "Python 3", 498 | "language": "python", 499 | "name": "python3" 500 | }, 501 | "language_info": { 502 | "codemirror_mode": { 503 | "name": "ipython", 504 | "version": 3 505 | }, 506 | "file_extension": ".py", 507 | "mimetype": "text/x-python", 508 | "name": "python", 509 | "nbconvert_exporter": "python", 510 | "pygments_lexer": "ipython3", 511 | "version": "3.6.2" 512 | }, 513 | "toc": { 514 | "nav_menu": {}, 515 | "number_sections": true, 516 | "sideBar": true, 517 | "skip_h1_title": false, 518 | "toc_cell": false, 519 | "toc_position": {}, 520 | "toc_section_display": "block", 521 | "toc_window_display": false 522 | }, 523 | "varInspector": { 524 | "cols": { 525 | "lenName": 16, 526 | "lenType": 16, 527 | "lenVar": 40 528 | }, 529 | "kernels_config": { 530 | "python": { 531 | "delete_cmd_postfix": "", 532 | "delete_cmd_prefix": "del ", 533 | "library": "var_list.py", 534 | "varRefreshCmd": "print(var_dic_list())" 535 | }, 536 | "r": { 537 | "delete_cmd_postfix": ") ", 538 | "delete_cmd_prefix": "rm(", 539 | "library": "var_list.r", 540 | "varRefreshCmd": "cat(var_dic_list()) " 541 | } 542 | }, 543 | "types_to_exclude": [ 544 | "module", 545 | "function", 546 | "builtin_function_or_method", 547 | "instance", 548 | "_Feature" 549 | ], 550 | "window_display": false 551 | } 552 | }, 553 | "nbformat": 4, 554 | "nbformat_minor": 2 555 | } 556 | -------------------------------------------------------------------------------- /2-Exploratory-Data-Analysis.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "collapsed": true 7 | }, 8 | "source": [ 9 | "# Exploratory Data Analysis" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "## Introduction" 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "metadata": {}, 22 | "source": [ 23 | "After the data cleaning step where we put our data into a few standard formats, the next step is to take a look at the data and see if what we're looking at makes sense. Before applying any fancy algorithms, it's always important to explore the data first.\n", 24 | "\n", 25 | "When working with numerical data, some of the exploratory data analysis (EDA) techniques we can use include finding the average of the data set, the distribution of the data, the most common values, etc. The idea is the same when working with text data. We are going to find some more obvious patterns with EDA before identifying the hidden patterns with machines learning (ML) techniques. We are going to look at the following for each comedian:\n", 26 | "\n", 27 | "1. **Most common words** - find these and create word clouds\n", 28 | "2. **Size of vocabulary** - look number of unique words and also how quickly someone speaks\n", 29 | "3. **Amount of profanity** - most common terms" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "## Most Common Words" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "### Analysis" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": null, 49 | "metadata": {}, 50 | "outputs": [], 51 | "source": [ 52 | "# Read in the document-term matrix\n", 53 | "import pandas as pd\n", 54 | "\n", 55 | "data = pd.read_pickle('dtm.pkl')\n", 56 | "data = data.transpose()\n", 57 | "data.head()" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": null, 63 | "metadata": {}, 64 | "outputs": [], 65 | "source": [ 66 | "# Find the top 30 words said by each comedian\n", 67 | "top_dict = {}\n", 68 | "for c in data.columns:\n", 69 | " top = data[c].sort_values(ascending=False).head(30)\n", 70 | " top_dict[c]= list(zip(top.index, top.values))\n", 71 | "\n", 72 | "top_dict" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": null, 78 | "metadata": {}, 79 | "outputs": [], 80 | "source": [ 81 | "# Print the top 15 words said by each comedian\n", 82 | "for comedian, top_words in top_dict.items():\n", 83 | " print(comedian)\n", 84 | " print(', '.join([word for word, count in top_words[0:14]]))\n", 85 | " print('---')" 86 | ] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "metadata": {}, 91 | "source": [ 92 | "**NOTE:** At this point, we could go on and create word clouds. However, by looking at these top words, you can see that some of them have very little meaning and could be added to a stop words list, so let's do just that.\n", 93 | "\n" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": null, 99 | "metadata": {}, 100 | "outputs": [], 101 | "source": [ 102 | "# Look at the most common top words --> add them to the stop word list\n", 103 | "from collections import Counter\n", 104 | "\n", 105 | "# Let's first pull out the top 30 words for each comedian\n", 106 | "words = []\n", 107 | "for comedian in data.columns:\n", 108 | " top = [word for (word, count) in top_dict[comedian]]\n", 109 | " for t in top:\n", 110 | " words.append(t)\n", 111 | " \n", 112 | "words" 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "execution_count": null, 118 | "metadata": {}, 119 | "outputs": [], 120 | "source": [ 121 | "# Let's aggregate this list and identify the most common words along with how many routines they occur in\n", 122 | "Counter(words).most_common()" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": null, 128 | "metadata": {}, 129 | "outputs": [], 130 | "source": [ 131 | "# If more than half of the comedians have it as a top word, exclude it from the list\n", 132 | "add_stop_words = [word for word, count in Counter(words).most_common() if count > 6]\n", 133 | "add_stop_words" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": null, 139 | "metadata": { 140 | "collapsed": true 141 | }, 142 | "outputs": [], 143 | "source": [ 144 | "# Let's update our document-term matrix with the new list of stop words\n", 145 | "from sklearn.feature_extraction import text \n", 146 | "from sklearn.feature_extraction.text import CountVectorizer\n", 147 | "\n", 148 | "# Read in cleaned data\n", 149 | "data_clean = pd.read_pickle('data_clean.pkl')\n", 150 | "\n", 151 | "# Add new stop words\n", 152 | "stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)\n", 153 | "\n", 154 | "# Recreate document-term matrix\n", 155 | "cv = CountVectorizer(stop_words=stop_words)\n", 156 | "data_cv = cv.fit_transform(data_clean.transcript)\n", 157 | "data_stop = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())\n", 158 | "data_stop.index = data_clean.index\n", 159 | "\n", 160 | "# Pickle it for later use\n", 161 | "import pickle\n", 162 | "pickle.dump(cv, open(\"cv_stop.pkl\", \"wb\"))\n", 163 | "data_stop.to_pickle(\"dtm_stop.pkl\")" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": null, 169 | "metadata": { 170 | "collapsed": true 171 | }, 172 | "outputs": [], 173 | "source": [ 174 | "# Let's make some word clouds!\n", 175 | "# Terminal / Anaconda Prompt: conda install -c conda-forge wordcloud\n", 176 | "from wordcloud import WordCloud\n", 177 | "\n", 178 | "wc = WordCloud(stopwords=stop_words, background_color=\"white\", colormap=\"Dark2\",\n", 179 | " max_font_size=150, random_state=42)" 180 | ] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "execution_count": null, 185 | "metadata": { 186 | "collapsed": true 187 | }, 188 | "outputs": [], 189 | "source": [ 190 | "# Reset the output dimensions\n", 191 | "import matplotlib.pyplot as plt\n", 192 | "\n", 193 | "plt.rcParams['figure.figsize'] = [16, 6]\n", 194 | "\n", 195 | "full_names = ['Ali Wong', 'Anthony Jeselnik', 'Bill Burr', 'Bo Burnham', 'Dave Chappelle', 'Hasan Minhaj',\n", 196 | " 'Jim Jefferies', 'Joe Rogan', 'John Mulaney', 'Louis C.K.', 'Mike Birbiglia', 'Ricky Gervais']\n", 197 | "\n", 198 | "# Create subplots for each comedian\n", 199 | "for index, comedian in enumerate(data.columns):\n", 200 | " wc.generate(data_clean.transcript[comedian])\n", 201 | " \n", 202 | " plt.subplot(3, 4, index+1)\n", 203 | " plt.imshow(wc, interpolation=\"bilinear\")\n", 204 | " plt.axis(\"off\")\n", 205 | " plt.title(full_names[index])\n", 206 | " \n", 207 | "plt.show()" 208 | ] 209 | }, 210 | { 211 | "cell_type": "markdown", 212 | "metadata": {}, 213 | "source": [ 214 | "### Findings" 215 | ] 216 | }, 217 | { 218 | "cell_type": "markdown", 219 | "metadata": {}, 220 | "source": [ 221 | "* Ali Wong says the s-word a lot and talks about her husband. I guess that's funny to me.\n", 222 | "* A lot of people use the F-word. Let's dig into that later." 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "metadata": {}, 228 | "source": [ 229 | "## Number of Words" 230 | ] 231 | }, 232 | { 233 | "cell_type": "markdown", 234 | "metadata": {}, 235 | "source": [ 236 | "### Analysis" 237 | ] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "execution_count": null, 242 | "metadata": {}, 243 | "outputs": [], 244 | "source": [ 245 | "# Find the number of unique words that each comedian uses\n", 246 | "\n", 247 | "# Identify the non-zero items in the document-term matrix, meaning that the word occurs at least once\n", 248 | "unique_list = []\n", 249 | "for comedian in data.columns:\n", 250 | " uniques = data[comedian].nonzero()[0].size\n", 251 | " unique_list.append(uniques)\n", 252 | "\n", 253 | "# Create a new dataframe that contains this unique word count\n", 254 | "data_words = pd.DataFrame(list(zip(full_names, unique_list)), columns=['comedian', 'unique_words'])\n", 255 | "data_unique_sort = data_words.sort_values(by='unique_words')\n", 256 | "data_unique_sort" 257 | ] 258 | }, 259 | { 260 | "cell_type": "code", 261 | "execution_count": null, 262 | "metadata": {}, 263 | "outputs": [], 264 | "source": [ 265 | "# Calculate the words per minute of each comedian\n", 266 | "\n", 267 | "# Find the total number of words that a comedian uses\n", 268 | "total_list = []\n", 269 | "for comedian in data.columns:\n", 270 | " totals = sum(data[comedian])\n", 271 | " total_list.append(totals)\n", 272 | " \n", 273 | "# Comedy special run times from IMDB, in minutes\n", 274 | "run_times = [60, 59, 80, 60, 67, 73, 77, 63, 62, 58, 76, 79]\n", 275 | "\n", 276 | "# Let's add some columns to our dataframe\n", 277 | "data_words['total_words'] = total_list\n", 278 | "data_words['run_times'] = run_times\n", 279 | "data_words['words_per_minute'] = data_words['total_words'] / data_words['run_times']\n", 280 | "\n", 281 | "# Sort the dataframe by words per minute to see who talks the slowest and fastest\n", 282 | "data_wpm_sort = data_words.sort_values(by='words_per_minute')\n", 283 | "data_wpm_sort" 284 | ] 285 | }, 286 | { 287 | "cell_type": "code", 288 | "execution_count": null, 289 | "metadata": { 290 | "collapsed": true 291 | }, 292 | "outputs": [], 293 | "source": [ 294 | "# Let's plot our findings\n", 295 | "import numpy as np\n", 296 | "\n", 297 | "y_pos = np.arange(len(data_words))\n", 298 | "\n", 299 | "plt.subplot(1, 2, 1)\n", 300 | "plt.barh(y_pos, data_unique_sort.unique_words, align='center')\n", 301 | "plt.yticks(y_pos, data_unique_sort.comedian)\n", 302 | "plt.title('Number of Unique Words', fontsize=20)\n", 303 | "\n", 304 | "plt.subplot(1, 2, 2)\n", 305 | "plt.barh(y_pos, data_wpm_sort.words_per_minute, align='center')\n", 306 | "plt.yticks(y_pos, data_wpm_sort.comedian)\n", 307 | "plt.title('Number of Words Per Minute', fontsize=20)\n", 308 | "\n", 309 | "plt.tight_layout()\n", 310 | "plt.show()" 311 | ] 312 | }, 313 | { 314 | "cell_type": "markdown", 315 | "metadata": {}, 316 | "source": [ 317 | "### Findings" 318 | ] 319 | }, 320 | { 321 | "cell_type": "markdown", 322 | "metadata": {}, 323 | "source": [ 324 | "* **Vocabulary**\n", 325 | " * Ricky Gervais (British comedy) and Bill Burr (podcast host) use a lot of words in their comedy\n", 326 | " * Louis C.K. (self-depricating comedy) and Anthony Jeselnik (dark humor) have a smaller vocabulary\n", 327 | "\n", 328 | "\n", 329 | "* **Talking Speed**\n", 330 | " * Joe Rogan (blue comedy) and Bill Burr (podcast host) talk fast\n", 331 | " * Bo Burnham (musical comedy) and Anthony Jeselnik (dark humor) talk slow\n", 332 | " \n", 333 | "Ali Wong is somewhere in the middle in both cases. Nothing too interesting here." 334 | ] 335 | }, 336 | { 337 | "cell_type": "markdown", 338 | "metadata": {}, 339 | "source": [ 340 | "## Amount of Profanity" 341 | ] 342 | }, 343 | { 344 | "cell_type": "markdown", 345 | "metadata": {}, 346 | "source": [ 347 | "### Analysis" 348 | ] 349 | }, 350 | { 351 | "cell_type": "code", 352 | "execution_count": null, 353 | "metadata": {}, 354 | "outputs": [], 355 | "source": [ 356 | "# Earlier I said we'd revisit profanity. Let's take a look at the most common words again.\n", 357 | "Counter(words).most_common()" 358 | ] 359 | }, 360 | { 361 | "cell_type": "code", 362 | "execution_count": null, 363 | "metadata": {}, 364 | "outputs": [], 365 | "source": [ 366 | "# Let's isolate just these bad words\n", 367 | "data_bad_words = data.transpose()[['fucking', 'fuck', 'shit']]\n", 368 | "data_profanity = pd.concat([data_bad_words.fucking + data_bad_words.fuck, data_bad_words.shit], axis=1)\n", 369 | "data_profanity.columns = ['f_word', 's_word']\n", 370 | "data_profanity" 371 | ] 372 | }, 373 | { 374 | "cell_type": "code", 375 | "execution_count": null, 376 | "metadata": { 377 | "collapsed": true 378 | }, 379 | "outputs": [], 380 | "source": [ 381 | "# Let's create a scatter plot of our findings\n", 382 | "plt.rcParams['figure.figsize'] = [10, 8]\n", 383 | "\n", 384 | "for i, comedian in enumerate(data_profanity.index):\n", 385 | " x = data_profanity.f_word.loc[comedian]\n", 386 | " y = data_profanity.s_word.loc[comedian]\n", 387 | " plt.scatter(x, y, color='blue')\n", 388 | " plt.text(x+1.5, y+0.5, full_names[i], fontsize=10)\n", 389 | " plt.xlim(-5, 155) \n", 390 | " \n", 391 | "plt.title('Number of Bad Words Used in Routine', fontsize=20)\n", 392 | "plt.xlabel('Number of F Bombs', fontsize=15)\n", 393 | "plt.ylabel('Number of S Words', fontsize=15)\n", 394 | "\n", 395 | "plt.show()" 396 | ] 397 | }, 398 | { 399 | "cell_type": "markdown", 400 | "metadata": {}, 401 | "source": [ 402 | "### Findings" 403 | ] 404 | }, 405 | { 406 | "cell_type": "markdown", 407 | "metadata": {}, 408 | "source": [ 409 | "* **Averaging 2 F-Bombs Per Minute!** - I don't like too much swearing, especially the f-word, which is probably why I've never heard of Bill Bur, Joe Rogan and Jim Jefferies.\n", 410 | "* **Clean Humor** - It looks like profanity might be a good predictor of the type of comedy I like. Besides Ali Wong, my two other favorite comedians in this group are John Mulaney and Mike Birbiglia." 411 | ] 412 | }, 413 | { 414 | "cell_type": "markdown", 415 | "metadata": { 416 | "collapsed": true 417 | }, 418 | "source": [ 419 | "## Side Note" 420 | ] 421 | }, 422 | { 423 | "cell_type": "markdown", 424 | "metadata": {}, 425 | "source": [ 426 | "What was our goal for the EDA portion of our journey? **To be able to take an initial look at our data and see if the results of some basic analysis made sense.**\n", 427 | "\n", 428 | "My conclusion - yes, it does, for a first pass. There are definitely some things that could be better cleaned up, such as adding more stop words or including bi-grams. But we can save that for another day. The results, especially the profanity findings, are interesting and make general sense, so we're going to move on.\n", 429 | "\n", 430 | "As a reminder, the data science process is an interative one. It's better to see some non-perfect but acceptable results to help you quickly decide whether your project is a dud or not, instead of having analysis paralysis and never delivering anything.\n", 431 | "\n", 432 | "**Alice's data science (and life) motto: Let go of perfectionism!**" 433 | ] 434 | }, 435 | { 436 | "cell_type": "markdown", 437 | "metadata": { 438 | "collapsed": true 439 | }, 440 | "source": [ 441 | "## Additional Exercises" 442 | ] 443 | }, 444 | { 445 | "cell_type": "markdown", 446 | "metadata": {}, 447 | "source": [ 448 | "1. What other word counts do you think would be interesting to compare instead of the f-word and s-word? Create a scatter plot comparing them." 449 | ] 450 | }, 451 | { 452 | "cell_type": "code", 453 | "execution_count": null, 454 | "metadata": { 455 | "collapsed": true 456 | }, 457 | "outputs": [], 458 | "source": [] 459 | } 460 | ], 461 | "metadata": { 462 | "kernelspec": { 463 | "display_name": "Python 3", 464 | "language": "python", 465 | "name": "python3" 466 | }, 467 | "language_info": { 468 | "codemirror_mode": { 469 | "name": "ipython", 470 | "version": 3 471 | }, 472 | "file_extension": ".py", 473 | "mimetype": "text/x-python", 474 | "name": "python", 475 | "nbconvert_exporter": "python", 476 | "pygments_lexer": "ipython3", 477 | "version": "3.6.2" 478 | }, 479 | "toc": { 480 | "nav_menu": {}, 481 | "number_sections": true, 482 | "sideBar": true, 483 | "skip_h1_title": false, 484 | "toc_cell": false, 485 | "toc_position": {}, 486 | "toc_section_display": "block", 487 | "toc_window_display": false 488 | }, 489 | "varInspector": { 490 | "cols": { 491 | "lenName": 16, 492 | "lenType": 16, 493 | "lenVar": 40 494 | }, 495 | "kernels_config": { 496 | "python": { 497 | "delete_cmd_postfix": "", 498 | "delete_cmd_prefix": "del ", 499 | "library": "var_list.py", 500 | "varRefreshCmd": "print(var_dic_list())" 501 | }, 502 | "r": { 503 | "delete_cmd_postfix": ") ", 504 | "delete_cmd_prefix": "rm(", 505 | "library": "var_list.r", 506 | "varRefreshCmd": "cat(var_dic_list()) " 507 | } 508 | }, 509 | "types_to_exclude": [ 510 | "module", 511 | "function", 512 | "builtin_function_or_method", 513 | "instance", 514 | "_Feature" 515 | ], 516 | "window_display": false 517 | } 518 | }, 519 | "nbformat": 4, 520 | "nbformat_minor": 2 521 | } 522 | -------------------------------------------------------------------------------- /3-Sentiment-Analysis.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Sentiment Analysis" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Introduction" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "So far, all of the analysis we've done has been pretty generic - looking at counts, creating scatter plots, etc. These techniques could be applied to numeric data as well.\n", 22 | "\n", 23 | "When it comes to text data, there are a few popular techniques that we'll be going through in the next few notebooks, starting with sentiment analysis. A few key points to remember with sentiment analysis.\n", 24 | "\n", 25 | "1. **TextBlob Module:** Linguistic researchers have labeled the sentiment of words based on their domain expertise. Sentiment of words can vary based on where it is in a sentence. The TextBlob module allows us to take advantage of these labels.\n", 26 | "2. **Sentiment Labels:** Each word in a corpus is labeled in terms of polarity and subjectivity (there are more labels as well, but we're going to ignore them for now). A corpus' sentiment is the average of these.\n", 27 | " * **Polarity**: How positive or negative a word is. -1 is very negative. +1 is very positive.\n", 28 | " * **Subjectivity**: How subjective, or opinionated a word is. 0 is fact. +1 is very much an opinion.\n", 29 | "\n", 30 | "For more info on how TextBlob coded up its [sentiment function](https://planspace.org/20150607-textblob_sentiment/).\n", 31 | "\n", 32 | "Let's take a look at the sentiment of the various transcripts, both overall and throughout the comedy routine." 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "## Sentiment of Routine" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": null, 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [ 48 | "# We'll start by reading in the corpus, which preserves word order\n", 49 | "import pandas as pd\n", 50 | "\n", 51 | "data = pd.read_pickle('corpus.pkl')\n", 52 | "data" 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": null, 58 | "metadata": {}, 59 | "outputs": [], 60 | "source": [ 61 | "# Create quick lambda functions to find the polarity and subjectivity of each routine\n", 62 | "# Terminal / Anaconda Navigator: conda install -c conda-forge textblob\n", 63 | "from textblob import TextBlob\n", 64 | "\n", 65 | "pol = lambda x: TextBlob(x).sentiment.polarity\n", 66 | "sub = lambda x: TextBlob(x).sentiment.subjectivity\n", 67 | "\n", 68 | "data['polarity'] = data['transcript'].apply(pol)\n", 69 | "data['subjectivity'] = data['transcript'].apply(sub)\n", 70 | "data" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": null, 76 | "metadata": {}, 77 | "outputs": [], 78 | "source": [ 79 | "# Let's plot the results\n", 80 | "import matplotlib.pyplot as plt\n", 81 | "\n", 82 | "plt.rcParams['figure.figsize'] = [10, 8]\n", 83 | "\n", 84 | "for index, comedian in enumerate(data.index):\n", 85 | " x = data.polarity.loc[comedian]\n", 86 | " y = data.subjectivity.loc[comedian]\n", 87 | " plt.scatter(x, y, color='blue')\n", 88 | " plt.text(x+.001, y+.001, data['full_name'][index], fontsize=10)\n", 89 | " plt.xlim(-.01, .12) \n", 90 | " \n", 91 | "plt.title('Sentiment Analysis', fontsize=20)\n", 92 | "plt.xlabel('<-- Negative -------- Positive -->', fontsize=15)\n", 93 | "plt.ylabel('<-- Facts -------- Opinions -->', fontsize=15)\n", 94 | "\n", 95 | "plt.show()" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | "## Sentiment of Routine Over Time" 103 | ] 104 | }, 105 | { 106 | "cell_type": "markdown", 107 | "metadata": {}, 108 | "source": [ 109 | "Instead of looking at the overall sentiment, let's see if there's anything interesting about the sentiment over time throughout each routine." 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": null, 115 | "metadata": { 116 | "collapsed": true 117 | }, 118 | "outputs": [], 119 | "source": [ 120 | "# Split each routine into 10 parts\n", 121 | "import numpy as np\n", 122 | "import math\n", 123 | "\n", 124 | "def split_text(text, n=10):\n", 125 | " '''Takes in a string of text and splits into n equal parts, with a default of 10 equal parts.'''\n", 126 | "\n", 127 | " # Calculate length of text, the size of each chunk of text and the starting points of each chunk of text\n", 128 | " length = len(text)\n", 129 | " size = math.floor(length / n)\n", 130 | " start = np.arange(0, length, size)\n", 131 | " \n", 132 | " # Pull out equally sized pieces of text and put it into a list\n", 133 | " split_list = []\n", 134 | " for piece in range(n):\n", 135 | " split_list.append(text[start[piece]:start[piece]+size])\n", 136 | " return split_list" 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": null, 142 | "metadata": {}, 143 | "outputs": [], 144 | "source": [ 145 | "# Let's take a look at our data again\n", 146 | "data" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": null, 152 | "metadata": {}, 153 | "outputs": [], 154 | "source": [ 155 | "# Let's create a list to hold all of the pieces of text\n", 156 | "list_pieces = []\n", 157 | "for t in data.transcript:\n", 158 | " split = split_text(t)\n", 159 | " list_pieces.append(split)\n", 160 | " \n", 161 | "list_pieces" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": null, 167 | "metadata": {}, 168 | "outputs": [], 169 | "source": [ 170 | "# The list has 10 elements, one for each transcript\n", 171 | "len(list_pieces)" 172 | ] 173 | }, 174 | { 175 | "cell_type": "code", 176 | "execution_count": null, 177 | "metadata": {}, 178 | "outputs": [], 179 | "source": [ 180 | "# Each transcript has been split into 10 pieces of text\n", 181 | "len(list_pieces[0])" 182 | ] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "execution_count": null, 187 | "metadata": {}, 188 | "outputs": [], 189 | "source": [ 190 | "# Calculate the polarity for each piece of text\n", 191 | "\n", 192 | "polarity_transcript = []\n", 193 | "for lp in list_pieces:\n", 194 | " polarity_piece = []\n", 195 | " for p in lp:\n", 196 | " polarity_piece.append(TextBlob(p).sentiment.polarity)\n", 197 | " polarity_transcript.append(polarity_piece)\n", 198 | " \n", 199 | "polarity_transcript" 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": null, 205 | "metadata": { 206 | "collapsed": true 207 | }, 208 | "outputs": [], 209 | "source": [ 210 | "# Show the plot for one comedian\n", 211 | "plt.plot(polarity_transcript[0])\n", 212 | "plt.title(data['full_name'].index[0])\n", 213 | "plt.show()" 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": null, 219 | "metadata": {}, 220 | "outputs": [], 221 | "source": [ 222 | "# Show the plot for all comedians\n", 223 | "plt.rcParams['figure.figsize'] = [16, 12]\n", 224 | "\n", 225 | "for index, comedian in enumerate(data.index): \n", 226 | " plt.subplot(3, 4, index+1)\n", 227 | " plt.plot(polarity_transcript[index])\n", 228 | " plt.plot(np.arange(0,10), np.zeros(10))\n", 229 | " plt.title(data['full_name'][index])\n", 230 | " plt.ylim(ymin=-.2, ymax=.3)\n", 231 | " \n", 232 | "plt.show()" 233 | ] 234 | }, 235 | { 236 | "cell_type": "markdown", 237 | "metadata": { 238 | "collapsed": true 239 | }, 240 | "source": [ 241 | "Ali Wong stays generally positive throughout her routine. Similar comedians are Louis C.K. and Mike Birbiglia.\n", 242 | "\n", 243 | "On the other hand, you have some pretty different patterns here like Bo Burnham who gets happier as time passes and Dave Chappelle who has some pretty down moments in his routine." 244 | ] 245 | }, 246 | { 247 | "cell_type": "markdown", 248 | "metadata": { 249 | "collapsed": true 250 | }, 251 | "source": [ 252 | "## Additional Exercises" 253 | ] 254 | }, 255 | { 256 | "cell_type": "markdown", 257 | "metadata": {}, 258 | "source": [ 259 | "1. Modify the number of sections the comedy routine is split into and see how the charts over time change." 260 | ] 261 | }, 262 | { 263 | "cell_type": "code", 264 | "execution_count": null, 265 | "metadata": { 266 | "collapsed": true 267 | }, 268 | "outputs": [], 269 | "source": [] 270 | } 271 | ], 272 | "metadata": { 273 | "kernelspec": { 274 | "display_name": "Python 3", 275 | "language": "python", 276 | "name": "python3" 277 | }, 278 | "language_info": { 279 | "codemirror_mode": { 280 | "name": "ipython", 281 | "version": 3 282 | }, 283 | "file_extension": ".py", 284 | "mimetype": "text/x-python", 285 | "name": "python", 286 | "nbconvert_exporter": "python", 287 | "pygments_lexer": "ipython3", 288 | "version": "3.6.2" 289 | }, 290 | "toc": { 291 | "nav_menu": {}, 292 | "number_sections": true, 293 | "sideBar": true, 294 | "skip_h1_title": false, 295 | "toc_cell": false, 296 | "toc_position": {}, 297 | "toc_section_display": "block", 298 | "toc_window_display": false 299 | }, 300 | "varInspector": { 301 | "cols": { 302 | "lenName": 16, 303 | "lenType": 16, 304 | "lenVar": 40 305 | }, 306 | "kernels_config": { 307 | "python": { 308 | "delete_cmd_postfix": "", 309 | "delete_cmd_prefix": "del ", 310 | "library": "var_list.py", 311 | "varRefreshCmd": "print(var_dic_list())" 312 | }, 313 | "r": { 314 | "delete_cmd_postfix": ") ", 315 | "delete_cmd_prefix": "rm(", 316 | "library": "var_list.r", 317 | "varRefreshCmd": "cat(var_dic_list()) " 318 | } 319 | }, 320 | "types_to_exclude": [ 321 | "module", 322 | "function", 323 | "builtin_function_or_method", 324 | "instance", 325 | "_Feature" 326 | ], 327 | "window_display": false 328 | } 329 | }, 330 | "nbformat": 4, 331 | "nbformat_minor": 2 332 | } 333 | -------------------------------------------------------------------------------- /4-Topic-Modeling.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Topic Modeling" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Introduction" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "Another popular text analysis technique is called topic modeling. The ultimate goal of topic modeling is to find various topics that are present in your corpus. Each document in the corpus will be made up of at least one topic, if not multiple topics.\n", 22 | "\n", 23 | "In this notebook, we will be covering the steps on how to do **Latent Dirichlet Allocation (LDA)**, which is one of many topic modeling techniques. It was specifically designed for text data.\n", 24 | "\n", 25 | "To use a topic modeling technique, you need to provide (1) a document-term matrix and (2) the number of topics you would like the algorithm to pick up.\n", 26 | "\n", 27 | "Once the topic modeling technique is applied, your job as a human is to interpret the results and see if the mix of words in each topic make sense. If they don't make sense, you can try changing up the number of topics, the terms in the document-term matrix, model parameters, or even try a different model." 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": {}, 33 | "source": [ 34 | "## Topic Modeling - Attempt #1 (All Text)" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": null, 40 | "metadata": {}, 41 | "outputs": [], 42 | "source": [ 43 | "# Let's read in our document-term matrix\n", 44 | "import pandas as pd\n", 45 | "import pickle\n", 46 | "\n", 47 | "data = pd.read_pickle('dtm_stop.pkl')\n", 48 | "data" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": null, 54 | "metadata": { 55 | "collapsed": true 56 | }, 57 | "outputs": [], 58 | "source": [ 59 | "# Import the necessary modules for LDA with gensim\n", 60 | "# Terminal / Anaconda Navigator: conda install -c conda-forge gensim\n", 61 | "from gensim import matutils, models\n", 62 | "import scipy.sparse\n", 63 | "\n", 64 | "# import logging\n", 65 | "# logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": null, 71 | "metadata": {}, 72 | "outputs": [], 73 | "source": [ 74 | "# One of the required inputs is a term-document matrix\n", 75 | "tdm = data.transpose()\n", 76 | "tdm.head()" 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": null, 82 | "metadata": { 83 | "collapsed": true 84 | }, 85 | "outputs": [], 86 | "source": [ 87 | "# We're going to put the term-document matrix into a new gensim format, from df --> sparse matrix --> gensim corpus\n", 88 | "sparse_counts = scipy.sparse.csr_matrix(tdm)\n", 89 | "corpus = matutils.Sparse2Corpus(sparse_counts)" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": null, 95 | "metadata": { 96 | "collapsed": true 97 | }, 98 | "outputs": [], 99 | "source": [ 100 | "# Gensim also requires dictionary of the all terms and their respective location in the term-document matrix\n", 101 | "cv = pickle.load(open(\"cv_stop.pkl\", \"rb\"))\n", 102 | "id2word = dict((v, k) for k, v in cv.vocabulary_.items())" 103 | ] 104 | }, 105 | { 106 | "cell_type": "markdown", 107 | "metadata": {}, 108 | "source": [ 109 | "Now that we have the corpus (term-document matrix) and id2word (dictionary of location: term), we need to specify two other parameters - the number of topics and the number of passes. Let's start the number of topics at 2, see if the results make sense, and increase the number from there." 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": null, 115 | "metadata": {}, 116 | "outputs": [], 117 | "source": [ 118 | "# Now that we have the corpus (term-document matrix) and id2word (dictionary of location: term),\n", 119 | "# we need to specify two other parameters as well - the number of topics and the number of passes\n", 120 | "lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=2, passes=10)\n", 121 | "lda.print_topics()" 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": null, 127 | "metadata": {}, 128 | "outputs": [], 129 | "source": [ 130 | "# LDA for num_topics = 3\n", 131 | "lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, passes=10)\n", 132 | "lda.print_topics()" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": null, 138 | "metadata": {}, 139 | "outputs": [], 140 | "source": [ 141 | "# LDA for num_topics = 4\n", 142 | "lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=4, passes=10)\n", 143 | "lda.print_topics()" 144 | ] 145 | }, 146 | { 147 | "cell_type": "markdown", 148 | "metadata": {}, 149 | "source": [ 150 | "These topics aren't looking too great. We've tried modifying our parameters. Let's try modifying our terms list as well." 151 | ] 152 | }, 153 | { 154 | "cell_type": "markdown", 155 | "metadata": {}, 156 | "source": [ 157 | "## Topic Modeling - Attempt #2 (Nouns Only)" 158 | ] 159 | }, 160 | { 161 | "cell_type": "markdown", 162 | "metadata": {}, 163 | "source": [ 164 | "One popular trick is to look only at terms that are from one part of speech (only nouns, only adjectives, etc.). Check out the UPenn tag set: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html." 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": null, 170 | "metadata": { 171 | "collapsed": true 172 | }, 173 | "outputs": [], 174 | "source": [ 175 | "# Let's create a function to pull out nouns from a string of text\n", 176 | "from nltk import word_tokenize, pos_tag\n", 177 | "\n", 178 | "def nouns(text):\n", 179 | " '''Given a string of text, tokenize the text and pull out only the nouns.'''\n", 180 | " is_noun = lambda pos: pos[:2] == 'NN'\n", 181 | " tokenized = word_tokenize(text)\n", 182 | " all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(pos)] \n", 183 | " return ' '.join(all_nouns)" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": null, 189 | "metadata": {}, 190 | "outputs": [], 191 | "source": [ 192 | "# Read in the cleaned data, before the CountVectorizer step\n", 193 | "data_clean = pd.read_pickle('data_clean.pkl')\n", 194 | "data_clean" 195 | ] 196 | }, 197 | { 198 | "cell_type": "code", 199 | "execution_count": null, 200 | "metadata": {}, 201 | "outputs": [], 202 | "source": [ 203 | "# Apply the nouns function to the transcripts to filter only on nouns\n", 204 | "data_nouns = pd.DataFrame(data_clean.transcript.apply(nouns))\n", 205 | "data_nouns" 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "execution_count": null, 211 | "metadata": {}, 212 | "outputs": [], 213 | "source": [ 214 | "# Create a new document-term matrix using only nouns\n", 215 | "from sklearn.feature_extraction import text\n", 216 | "from sklearn.feature_extraction.text import CountVectorizer\n", 217 | "\n", 218 | "# Re-add the additional stop words since we are recreating the document-term matrix\n", 219 | "add_stop_words = ['like', 'im', 'know', 'just', 'dont', 'thats', 'right', 'people',\n", 220 | " 'youre', 'got', 'gonna', 'time', 'think', 'yeah', 'said']\n", 221 | "stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)\n", 222 | "\n", 223 | "# Recreate a document-term matrix with only nouns\n", 224 | "cvn = CountVectorizer(stop_words=stop_words)\n", 225 | "data_cvn = cvn.fit_transform(data_nouns.transcript)\n", 226 | "data_dtmn = pd.DataFrame(data_cvn.toarray(), columns=cvn.get_feature_names())\n", 227 | "data_dtmn.index = data_nouns.index\n", 228 | "data_dtmn" 229 | ] 230 | }, 231 | { 232 | "cell_type": "code", 233 | "execution_count": null, 234 | "metadata": { 235 | "collapsed": true 236 | }, 237 | "outputs": [], 238 | "source": [ 239 | "# Create the gensim corpus\n", 240 | "corpusn = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmn.transpose()))\n", 241 | "\n", 242 | "# Create the vocabulary dictionary\n", 243 | "id2wordn = dict((v, k) for k, v in cvn.vocabulary_.items())" 244 | ] 245 | }, 246 | { 247 | "cell_type": "code", 248 | "execution_count": null, 249 | "metadata": {}, 250 | "outputs": [], 251 | "source": [ 252 | "# Let's start with 2 topics\n", 253 | "ldan = models.LdaModel(corpus=corpusn, num_topics=2, id2word=id2wordn, passes=10)\n", 254 | "ldan.print_topics()" 255 | ] 256 | }, 257 | { 258 | "cell_type": "code", 259 | "execution_count": null, 260 | "metadata": {}, 261 | "outputs": [], 262 | "source": [ 263 | "# Let's try topics = 3\n", 264 | "ldan = models.LdaModel(corpus=corpusn, num_topics=3, id2word=id2wordn, passes=10)\n", 265 | "ldan.print_topics()" 266 | ] 267 | }, 268 | { 269 | "cell_type": "code", 270 | "execution_count": null, 271 | "metadata": {}, 272 | "outputs": [], 273 | "source": [ 274 | "# Let's try 4 topics\n", 275 | "ldan = models.LdaModel(corpus=corpusn, num_topics=4, id2word=id2wordn, passes=10)\n", 276 | "ldan.print_topics()" 277 | ] 278 | }, 279 | { 280 | "cell_type": "markdown", 281 | "metadata": {}, 282 | "source": [ 283 | "## Topic Modeling - Attempt #3 (Nouns and Adjectives)" 284 | ] 285 | }, 286 | { 287 | "cell_type": "code", 288 | "execution_count": null, 289 | "metadata": { 290 | "collapsed": true 291 | }, 292 | "outputs": [], 293 | "source": [ 294 | "# Let's create a function to pull out nouns from a string of text\n", 295 | "def nouns_adj(text):\n", 296 | " '''Given a string of text, tokenize the text and pull out only the nouns and adjectives.'''\n", 297 | " is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'\n", 298 | " tokenized = word_tokenize(text)\n", 299 | " nouns_adj = [word for (word, pos) in pos_tag(tokenized) if is_noun_adj(pos)] \n", 300 | " return ' '.join(nouns_adj)" 301 | ] 302 | }, 303 | { 304 | "cell_type": "code", 305 | "execution_count": null, 306 | "metadata": {}, 307 | "outputs": [], 308 | "source": [ 309 | "# Apply the nouns function to the transcripts to filter only on nouns\n", 310 | "data_nouns_adj = pd.DataFrame(data_clean.transcript.apply(nouns_adj))\n", 311 | "data_nouns_adj" 312 | ] 313 | }, 314 | { 315 | "cell_type": "code", 316 | "execution_count": null, 317 | "metadata": {}, 318 | "outputs": [], 319 | "source": [ 320 | "# Create a new document-term matrix using only nouns and adjectives, also remove common words with max_df\n", 321 | "cvna = CountVectorizer(stop_words=stop_words, max_df=.8)\n", 322 | "data_cvna = cvna.fit_transform(data_nouns_adj.transcript)\n", 323 | "data_dtmna = pd.DataFrame(data_cvna.toarray(), columns=cvna.get_feature_names())\n", 324 | "data_dtmna.index = data_nouns_adj.index\n", 325 | "data_dtmna" 326 | ] 327 | }, 328 | { 329 | "cell_type": "code", 330 | "execution_count": null, 331 | "metadata": { 332 | "collapsed": true 333 | }, 334 | "outputs": [], 335 | "source": [ 336 | "# Create the gensim corpus\n", 337 | "corpusna = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmna.transpose()))\n", 338 | "\n", 339 | "# Create the vocabulary dictionary\n", 340 | "id2wordna = dict((v, k) for k, v in cvna.vocabulary_.items())" 341 | ] 342 | }, 343 | { 344 | "cell_type": "code", 345 | "execution_count": null, 346 | "metadata": {}, 347 | "outputs": [], 348 | "source": [ 349 | "# Let's start with 2 topics\n", 350 | "ldana = models.LdaModel(corpus=corpusna, num_topics=2, id2word=id2wordna, passes=10)\n", 351 | "ldana.print_topics()" 352 | ] 353 | }, 354 | { 355 | "cell_type": "code", 356 | "execution_count": null, 357 | "metadata": {}, 358 | "outputs": [], 359 | "source": [ 360 | "# Let's try 3 topics\n", 361 | "ldana = models.LdaModel(corpus=corpusna, num_topics=3, id2word=id2wordna, passes=10)\n", 362 | "ldana.print_topics()" 363 | ] 364 | }, 365 | { 366 | "cell_type": "code", 367 | "execution_count": null, 368 | "metadata": {}, 369 | "outputs": [], 370 | "source": [ 371 | "# Let's try 4 topics\n", 372 | "ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=10)\n", 373 | "ldana.print_topics()" 374 | ] 375 | }, 376 | { 377 | "cell_type": "markdown", 378 | "metadata": {}, 379 | "source": [ 380 | "## Identify Topics in Each Document" 381 | ] 382 | }, 383 | { 384 | "cell_type": "markdown", 385 | "metadata": {}, 386 | "source": [ 387 | "Out of the 9 topic models we looked at, the nouns and adjectives, 4 topic one made the most sense. So let's pull that down here and run it through some more iterations to get more fine-tuned topics." 388 | ] 389 | }, 390 | { 391 | "cell_type": "code", 392 | "execution_count": 47, 393 | "metadata": {}, 394 | "outputs": [ 395 | { 396 | "data": { 397 | "text/plain": [ 398 | "[(0,\n", 399 | " '0.009*\"joke\" + 0.005*\"mom\" + 0.005*\"parents\" + 0.004*\"hasan\" + 0.004*\"jokes\" + 0.004*\"anthony\" + 0.003*\"nuts\" + 0.003*\"dead\" + 0.003*\"tit\" + 0.003*\"twitter\"'),\n", 400 | " (1,\n", 401 | " '0.005*\"mom\" + 0.005*\"jenny\" + 0.005*\"clinton\" + 0.004*\"friend\" + 0.004*\"parents\" + 0.003*\"husband\" + 0.003*\"cow\" + 0.003*\"ok\" + 0.003*\"wife\" + 0.003*\"john\"'),\n", 402 | " (2,\n", 403 | " '0.005*\"bo\" + 0.005*\"gun\" + 0.005*\"guns\" + 0.005*\"repeat\" + 0.004*\"um\" + 0.004*\"ass\" + 0.004*\"eye\" + 0.004*\"contact\" + 0.003*\"son\" + 0.003*\"class\"'),\n", 404 | " (3,\n", 405 | " '0.006*\"ahah\" + 0.004*\"nigga\" + 0.004*\"gay\" + 0.003*\"dick\" + 0.003*\"door\" + 0.003*\"young\" + 0.003*\"motherfucker\" + 0.003*\"stupid\" + 0.003*\"bitch\" + 0.003*\"mad\"')]" 406 | ] 407 | }, 408 | "execution_count": 47, 409 | "metadata": {}, 410 | "output_type": "execute_result" 411 | } 412 | ], 413 | "source": [ 414 | "# Our final LDA model (for now)\n", 415 | "ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=80)\n", 416 | "ldana.print_topics()" 417 | ] 418 | }, 419 | { 420 | "cell_type": "markdown", 421 | "metadata": {}, 422 | "source": [ 423 | "These four topics look pretty decent. Let's settle on these for now.\n", 424 | "* Topic 0: mom, parents\n", 425 | "* Topic 1: husband, wife\n", 426 | "* Topic 2: guns\n", 427 | "* Topic 3: profanity" 428 | ] 429 | }, 430 | { 431 | "cell_type": "code", 432 | "execution_count": 48, 433 | "metadata": {}, 434 | "outputs": [ 435 | { 436 | "data": { 437 | "text/plain": [ 438 | "[(1, 'ali'),\n", 439 | " (0, 'anthony'),\n", 440 | " (2, 'bill'),\n", 441 | " (2, 'bo'),\n", 442 | " (3, 'dave'),\n", 443 | " (0, 'hasan'),\n", 444 | " (2, 'jim'),\n", 445 | " (3, 'joe'),\n", 446 | " (1, 'john'),\n", 447 | " (0, 'louis'),\n", 448 | " (1, 'mike'),\n", 449 | " (0, 'ricky')]" 450 | ] 451 | }, 452 | "execution_count": 48, 453 | "metadata": {}, 454 | "output_type": "execute_result" 455 | } 456 | ], 457 | "source": [ 458 | "# Let's take a look at which topics each transcript contains\n", 459 | "corpus_transformed = ldana[corpusna]\n", 460 | "list(zip([a for [(a,b)] in corpus_transformed], data_dtmna.index))" 461 | ] 462 | }, 463 | { 464 | "cell_type": "markdown", 465 | "metadata": { 466 | "collapsed": true 467 | }, 468 | "source": [ 469 | "For a first pass of LDA, these kind of make sense to me, so we'll call it a day for now.\n", 470 | "* Topic 0: mom, parents [Anthony, Hasan, Louis, Ricky]\n", 471 | "* Topic 1: husband, wife [Ali, John, Mike]\n", 472 | "* Topic 2: guns [Bill, Bo, Jim]\n", 473 | "* Topic 3: profanity [Dave, Joe]" 474 | ] 475 | }, 476 | { 477 | "cell_type": "markdown", 478 | "metadata": { 479 | "collapsed": true 480 | }, 481 | "source": [ 482 | "## Additional Exercises" 483 | ] 484 | }, 485 | { 486 | "cell_type": "markdown", 487 | "metadata": {}, 488 | "source": [ 489 | "1. Try further modifying the parameters of the topic models above and see if you can get better topics.\n", 490 | "2. Create a new topic model that includes terms from a different [part of speech](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) and see if you can get better topics." 491 | ] 492 | }, 493 | { 494 | "cell_type": "code", 495 | "execution_count": null, 496 | "metadata": { 497 | "collapsed": true 498 | }, 499 | "outputs": [], 500 | "source": [] 501 | } 502 | ], 503 | "metadata": { 504 | "kernelspec": { 505 | "display_name": "Python 3", 506 | "language": "python", 507 | "name": "python3" 508 | }, 509 | "language_info": { 510 | "codemirror_mode": { 511 | "name": "ipython", 512 | "version": 3 513 | }, 514 | "file_extension": ".py", 515 | "mimetype": "text/x-python", 516 | "name": "python", 517 | "nbconvert_exporter": "python", 518 | "pygments_lexer": "ipython3", 519 | "version": "3.6.2" 520 | }, 521 | "toc": { 522 | "nav_menu": {}, 523 | "number_sections": true, 524 | "sideBar": true, 525 | "skip_h1_title": false, 526 | "toc_cell": false, 527 | "toc_position": {}, 528 | "toc_section_display": "block", 529 | "toc_window_display": false 530 | }, 531 | "varInspector": { 532 | "cols": { 533 | "lenName": 16, 534 | "lenType": 16, 535 | "lenVar": 40 536 | }, 537 | "kernels_config": { 538 | "python": { 539 | "delete_cmd_postfix": "", 540 | "delete_cmd_prefix": "del ", 541 | "library": "var_list.py", 542 | "varRefreshCmd": "print(var_dic_list())" 543 | }, 544 | "r": { 545 | "delete_cmd_postfix": ") ", 546 | "delete_cmd_prefix": "rm(", 547 | "library": "var_list.r", 548 | "varRefreshCmd": "cat(var_dic_list()) " 549 | } 550 | }, 551 | "types_to_exclude": [ 552 | "module", 553 | "function", 554 | "builtin_function_or_method", 555 | "instance", 556 | "_Feature" 557 | ], 558 | "window_display": false 559 | } 560 | }, 561 | "nbformat": 4, 562 | "nbformat_minor": 2 563 | } 564 | -------------------------------------------------------------------------------- /5-Text-Generation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Text Generation" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Introduction" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "Markov chains can be used for very basic text generation. Think about every word in a corpus as a state. We can make a simple assumption that the next word is only dependent on the previous word - which is the basic assumption of a Markov chain.\n", 22 | "\n", 23 | "Markov chains don't generate text as well as deep learning, but it's a good (and fun!) start." 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "## Select Text to Imitate" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "In this notebook, we're specifically going to generate text in the style of Ali Wong, so as a first step, let's extract the text from her comedy routine." 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": null, 43 | "metadata": {}, 44 | "outputs": [], 45 | "source": [ 46 | "# Read in the corpus, including punctuation!\n", 47 | "import pandas as pd\n", 48 | "\n", 49 | "data = pd.read_pickle('corpus.pkl')\n", 50 | "data" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "# Extract only Ali Wong's text\n", 60 | "ali_text = data.transcript.loc['ali']\n", 61 | "ali_text[:200]" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "## Build a Markov Chain Function" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": {}, 74 | "source": [ 75 | "We are going to build a simple Markov chain function that creates a dictionary:\n", 76 | "* The keys should be all of the words in the corpus\n", 77 | "* The values should be a list of the words that follow the keys" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": null, 83 | "metadata": { 84 | "collapsed": true 85 | }, 86 | "outputs": [], 87 | "source": [ 88 | "from collections import defaultdict\n", 89 | "\n", 90 | "def markov_chain(text):\n", 91 | " '''The input is a string of text and the output will be a dictionary with each word as\n", 92 | " a key and each value as the list of words that come after the key in the text.'''\n", 93 | " \n", 94 | " # Tokenize the text by word, though including punctuation\n", 95 | " words = text.split(' ')\n", 96 | " \n", 97 | " # Initialize a default dictionary to hold all of the words and next words\n", 98 | " m_dict = defaultdict(list)\n", 99 | " \n", 100 | " # Create a zipped list of all of the word pairs and put them in word: list of next words format\n", 101 | " for current_word, next_word in zip(words[0:-1], words[1:]):\n", 102 | " m_dict[current_word].append(next_word)\n", 103 | "\n", 104 | " # Convert the default dict back into a dictionary\n", 105 | " m_dict = dict(m_dict)\n", 106 | " return m_dict" 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": null, 112 | "metadata": {}, 113 | "outputs": [], 114 | "source": [ 115 | "# Create the dictionary for Ali's routine, take a look at it\n", 116 | "ali_dict = markov_chain(ali_text)\n", 117 | "ali_dict" 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": {}, 123 | "source": [ 124 | "## Create a Text Generator" 125 | ] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "metadata": {}, 130 | "source": [ 131 | "We're going to create a function that generates sentences. It will take two things as inputs:\n", 132 | "* The dictionary you just created\n", 133 | "* The number of words you want generated\n", 134 | "\n", 135 | "Here are some examples of generated sentences:\n", 136 | "\n", 137 | ">'Shape right turn– I also takes so that she’s got women all know that snail-trail.'\n", 138 | "\n", 139 | ">'Optimum level of early retirement, and be sure all the following Tuesday… because it’s too.'" 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": null, 145 | "metadata": { 146 | "collapsed": true 147 | }, 148 | "outputs": [], 149 | "source": [ 150 | "import random\n", 151 | "\n", 152 | "def generate_sentence(chain, count=15):\n", 153 | " '''Input a dictionary in the format of key = current word, value = list of next words\n", 154 | " along with the number of words you would like to see in your generated sentence.'''\n", 155 | "\n", 156 | " # Capitalize the first word\n", 157 | " word1 = random.choice(list(chain.keys()))\n", 158 | " sentence = word1.capitalize()\n", 159 | "\n", 160 | " # Generate the second word from the value list. Set the new word as the first word. Repeat.\n", 161 | " for i in range(count-1):\n", 162 | " word2 = random.choice(chain[word1])\n", 163 | " word1 = word2\n", 164 | " sentence += ' ' + word2\n", 165 | "\n", 166 | " # End it with a period\n", 167 | " sentence += '.'\n", 168 | " return(sentence)" 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": null, 174 | "metadata": {}, 175 | "outputs": [], 176 | "source": [ 177 | "generate_sentence(ali_dict)" 178 | ] 179 | }, 180 | { 181 | "cell_type": "markdown", 182 | "metadata": { 183 | "collapsed": true 184 | }, 185 | "source": [ 186 | "## Additional Exercises" 187 | ] 188 | }, 189 | { 190 | "cell_type": "markdown", 191 | "metadata": {}, 192 | "source": [ 193 | "1. Try making the generate_sentence function better. Maybe allow it to end with a random punctuation mark or end whenever it gets to a word that already ends with a punctuation mark." 194 | ] 195 | }, 196 | { 197 | "cell_type": "code", 198 | "execution_count": null, 199 | "metadata": { 200 | "collapsed": true 201 | }, 202 | "outputs": [], 203 | "source": [] 204 | } 205 | ], 206 | "metadata": { 207 | "kernelspec": { 208 | "display_name": "Python 3", 209 | "language": "python", 210 | "name": "python3" 211 | }, 212 | "language_info": { 213 | "codemirror_mode": { 214 | "name": "ipython", 215 | "version": 3 216 | }, 217 | "file_extension": ".py", 218 | "mimetype": "text/x-python", 219 | "name": "python", 220 | "nbconvert_exporter": "python", 221 | "pygments_lexer": "ipython3", 222 | "version": "3.6.2" 223 | }, 224 | "toc": { 225 | "nav_menu": {}, 226 | "number_sections": true, 227 | "sideBar": true, 228 | "skip_h1_title": false, 229 | "toc_cell": false, 230 | "toc_position": {}, 231 | "toc_section_display": "block", 232 | "toc_window_display": false 233 | }, 234 | "varInspector": { 235 | "cols": { 236 | "lenName": 16, 237 | "lenType": 16, 238 | "lenVar": 40 239 | }, 240 | "kernels_config": { 241 | "python": { 242 | "delete_cmd_postfix": "", 243 | "delete_cmd_prefix": "del ", 244 | "library": "var_list.py", 245 | "varRefreshCmd": "print(var_dic_list())" 246 | }, 247 | "r": { 248 | "delete_cmd_postfix": ") ", 249 | "delete_cmd_prefix": "rm(", 250 | "library": "var_list.r", 251 | "varRefreshCmd": "cat(var_dic_list()) " 252 | } 253 | }, 254 | "types_to_exclude": [ 255 | "module", 256 | "function", 257 | "builtin_function_or_method", 258 | "instance", 259 | "_Feature" 260 | ], 261 | "window_display": false 262 | } 263 | }, 264 | "nbformat": 4, 265 | "nbformat_minor": 2 266 | } 267 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Welcome to the Introduction to Text Analytics Training! 2 | 3 | We will be going through several Jupyter Notebooks during this time and use a number of data science libraries along the way. The easiest way to get started is to download Anaconda, which is free and open source. When you download this, it comes with the Jupyter Notebook IDE and many popular data science libraries, so you don’t have to install them one by one. 4 | 5 | Here are the steps you’ll need to take before the start of the tutorial: 6 | 7 | ### 1. Download Anaconda 8 | I highly recommend that you download [the Python 3.7 version](https://www.anaconda.com/download/). 9 | 10 | ### 2. Download the Jupyter Notebooks 11 | Clone or download this [Github repository](https://github.com/adashofdata/intro-to-text-analytics), so you have access to all the Jupyter Notebooks (.ipynb extension) in the tutorial. **Note the green button on the right side of the screen that says `Clone or download`.** If you know how to use Github, go ahead and clone the repo. If you don't know how to use Github, you can also just download the zip file and unzip it on your laptop. 12 | 13 | ### 3. Launch Anaconda and Open a Jupyter Notebook 14 | 15 | *Windows:* 16 | Open the Anaconda Navigator program. You should see the Jupyter Notebook logo. Below the logo, click Launch. A browser window should open up. In the browser window, navigate to the location of the saved Jupyter Notebook files and open 0-Hello-World.ipynb. Follow the instructions in the notebook. 17 | 18 | *Mac/Linux:* 19 | Open a terminal. Type ```jupyter notebook```. A browser should open up. In the browser window, navigate to the location of the saved Jupyter Notebook files and open 0-Hello-World.ipynb. Follow the instructions in the notebook. 20 | 21 | ### 4. Install a Few Additional Packages 22 | 23 | There are a few additional packages we'll be using during the tutorial that are not included when you download Anaconda - wordcloud, textblob and gensim. 24 | 25 | *Windows:* 26 | Open the Anaconda Prompt program. You should see a black window pop up. Type `conda install -c conda-forge wordcloud` to download wordcloud. You will be asked whether you want to proceed or not. Type `y` for yes. Once that is done, type `conda install -c conda-forge textblob` to download textblob and `y` to proceed, and type `conda install -c conda-forge gensim` to download gensim and `y` to proceed. 27 | 28 | *Mac/Linux:* 29 | Your terminal should already be open. Type command-t to open a new tab. Type `conda install -c conda-forge wordcloud` to download wordcloud. You will be asked whether you want to proceed or not. Type `y` for yes. Once that is done, type `conda install -c conda-forge textblob` to download textblob and `y` to proceed, and type `conda install -c conda-forge gensim` to download gensim and `y` to proceed. 30 | 31 | If you have any issues, please email me at adashofdata@gmail.com or come talk to me before the start of the tutorial on Thursday morning. 32 | -------------------------------------------------------------------------------- /additional_resources.md: -------------------------------------------------------------------------------- 1 | ### Additional Resources 2 | 3 | 4 | * **Jupyter Notebook shortcuts** 5 | * Run a cell of code: Shift+Enter 6 | * Insert a cell above: Esc-A 7 | * Insert a cell below: Esc-B 8 | * See documentation: Shift+Tab 9 | 10 | 11 | * **Web scraping** 12 | * Inspect Element in Chrome: Command+Shift+C 13 | 14 | 15 | * **Regular expressions** 16 | * Website to test regular expressions: https://regex101.com 17 | -------------------------------------------------------------------------------- /pickle/corpus.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adashofdata/intro-to-text-analytics/1e2571c2914dcb6390a6da5ba9c15f0abe2c46ee/pickle/corpus.pkl -------------------------------------------------------------------------------- /pickle/cv.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adashofdata/intro-to-text-analytics/1e2571c2914dcb6390a6da5ba9c15f0abe2c46ee/pickle/cv.pkl -------------------------------------------------------------------------------- /pickle/cv_stop.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adashofdata/intro-to-text-analytics/1e2571c2914dcb6390a6da5ba9c15f0abe2c46ee/pickle/cv_stop.pkl -------------------------------------------------------------------------------- /pickle/data_clean.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adashofdata/intro-to-text-analytics/1e2571c2914dcb6390a6da5ba9c15f0abe2c46ee/pickle/data_clean.pkl -------------------------------------------------------------------------------- /pickle/dtm.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adashofdata/intro-to-text-analytics/1e2571c2914dcb6390a6da5ba9c15f0abe2c46ee/pickle/dtm.pkl -------------------------------------------------------------------------------- /pickle/dtm_stop.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adashofdata/intro-to-text-analytics/1e2571c2914dcb6390a6da5ba9c15f0abe2c46ee/pickle/dtm_stop.pkl -------------------------------------------------------------------------------- /transcripts/ali.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adashofdata/intro-to-text-analytics/1e2571c2914dcb6390a6da5ba9c15f0abe2c46ee/transcripts/ali.txt -------------------------------------------------------------------------------- /transcripts/anthony.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adashofdata/intro-to-text-analytics/1e2571c2914dcb6390a6da5ba9c15f0abe2c46ee/transcripts/anthony.txt -------------------------------------------------------------------------------- /transcripts/bill.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adashofdata/intro-to-text-analytics/1e2571c2914dcb6390a6da5ba9c15f0abe2c46ee/transcripts/bill.txt -------------------------------------------------------------------------------- /transcripts/bo.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adashofdata/intro-to-text-analytics/1e2571c2914dcb6390a6da5ba9c15f0abe2c46ee/transcripts/bo.txt -------------------------------------------------------------------------------- /transcripts/dave.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adashofdata/intro-to-text-analytics/1e2571c2914dcb6390a6da5ba9c15f0abe2c46ee/transcripts/dave.txt -------------------------------------------------------------------------------- /transcripts/hasan.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adashofdata/intro-to-text-analytics/1e2571c2914dcb6390a6da5ba9c15f0abe2c46ee/transcripts/hasan.txt -------------------------------------------------------------------------------- /transcripts/jim.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adashofdata/intro-to-text-analytics/1e2571c2914dcb6390a6da5ba9c15f0abe2c46ee/transcripts/jim.txt -------------------------------------------------------------------------------- /transcripts/joe.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adashofdata/intro-to-text-analytics/1e2571c2914dcb6390a6da5ba9c15f0abe2c46ee/transcripts/joe.txt -------------------------------------------------------------------------------- /transcripts/john.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adashofdata/intro-to-text-analytics/1e2571c2914dcb6390a6da5ba9c15f0abe2c46ee/transcripts/john.txt -------------------------------------------------------------------------------- /transcripts/louis.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adashofdata/intro-to-text-analytics/1e2571c2914dcb6390a6da5ba9c15f0abe2c46ee/transcripts/louis.txt -------------------------------------------------------------------------------- /transcripts/mike.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adashofdata/intro-to-text-analytics/1e2571c2914dcb6390a6da5ba9c15f0abe2c46ee/transcripts/mike.txt -------------------------------------------------------------------------------- /transcripts/ricky.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adashofdata/intro-to-text-analytics/1e2571c2914dcb6390a6da5ba9c15f0abe2c46ee/transcripts/ricky.txt --------------------------------------------------------------------------------