├── README.md ├── NLP_C1_W1_lecture_nb_01.ipynb └── NLP_C1_W1_lecture_nb_03.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # Natural-Language-Processing-Specialization 2 | Offered by Deeplearning.ai via Coursera 3 | -------------------------------------------------------------------------------- /NLP_C1_W1_lecture_nb_01.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Preprocessing\n", 8 | "\n", 9 | "In this lab, we will be exploring how to preprocess tweets for sentiment analysis. We will provide a function for preprocessing tweets during this week's assignment, but it is still good to know what is going on under the hood. By the end of this lecture, you will see how to use the [NLTK](http://www.nltk.org) package to perform a preprocessing pipeline for Twitter datasets." 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "## Setup\n", 17 | "\n", 18 | "You will be doing sentiment analysis on tweets in the first two weeks of this course. To help with that, we will be using the [Natural Language Toolkit (NLTK)](http://www.nltk.org/howto/twitter.html) package, an open-source Python library for natural language processing. It has modules for collecting, handling, and processing Twitter data, and you will be acquainted with them as we move along the course.\n", 19 | "\n", 20 | "For this exercise, we will use a Twitter dataset that comes with NLTK. This dataset has been manually annotated and serves to establish baselines for models quickly. Let us import them now as well as a few other libraries we will be using." 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": 1, 26 | "metadata": {}, 27 | "outputs": [], 28 | "source": [ 29 | "import nltk # Python library for NLP\n", 30 | "from nltk.corpus import twitter_samples # sample Twitter dataset from NLTK\n", 31 | "import matplotlib.pyplot as plt # library for visualization\n", 32 | "import random # pseudo-random number generator" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "## About the Twitter dataset\n", 40 | "\n", 41 | "The sample dataset from NLTK is separated into positive and negative tweets. It contains 5000 positive tweets and 5000 negative tweets exactly. The exact match between these classes is not a coincidence. The intention is to have a balanced dataset. That does not reflect the real distributions of positive and negative classes in live Twitter streams. It is just because balanced datasets simplify the design of most computational methods that are required for sentiment analysis. However, it is better to be aware that this balance of classes is artificial. \n", 42 | "\n", 43 | "The dataset is already downloaded in the Coursera workspace. In a local computer however, you can download the data by doing:" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": null, 49 | "metadata": {}, 50 | "outputs": [], 51 | "source": [ 52 | "# downloads sample twitter dataset. uncomment the line below if running on a local machine.\n", 53 | "# nltk.download('twitter_samples')" 54 | ] 55 | }, 56 | { 57 | "cell_type": "markdown", 58 | "metadata": {}, 59 | "source": [ 60 | "We can load the text fields of the positive and negative tweets by using the module's `strings()` method like this:" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": 2, 66 | "metadata": {}, 67 | "outputs": [], 68 | "source": [ 69 | "# select the set of positive and negative tweets\n", 70 | "all_positive_tweets = twitter_samples.strings('positive_tweets.json')\n", 71 | "all_negative_tweets = twitter_samples.strings('negative_tweets.json')" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": {}, 77 | "source": [ 78 | "Next, we'll print a report with the number of positive and negative tweets. It is also essential to know the data structure of the datasets" 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": 5, 84 | "metadata": {}, 85 | "outputs": [ 86 | { 87 | "name": "stdout", 88 | "output_type": "stream", 89 | "text": [ 90 | "Number of positive tweets: 5000\n", 91 | "Number of negative tweets: 5000\n", 92 | "\n", 93 | "The type of all_positive_tweets is: \n", 94 | "The type of a tweet entry is: \n" 95 | ] 96 | } 97 | ], 98 | "source": [ 99 | "print('Number of positive tweets: ', len(all_positive_tweets))\n", 100 | "print('Number of negative tweets: ', len(all_negative_tweets))\n", 101 | "\n", 102 | "print('\\nThe type of all_positive_tweets is: ', type(all_positive_tweets))\n", 103 | "print('The type of a tweet entry is: ', type(all_negative_tweets[0]))" 104 | ] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "metadata": {}, 109 | "source": [ 110 | "We can see that the data is stored in a list and as you might expect, individual tweets are stored as strings.\n", 111 | "\n", 112 | "You can make a more visually appealing report by using Matplotlib's [pyplot](https://matplotlib.org/tutorials/introductory/pyplot.html) library. Let us see how to create a [pie chart](https://matplotlib.org/3.2.1/gallery/pie_and_polar_charts/pie_features.html#sphx-glr-gallery-pie-and-polar-charts-pie-features-py) to show the same information as above. This simple snippet will serve you in future visualizations of this kind of data." 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "execution_count": 9, 118 | "metadata": {}, 119 | "outputs": [ 120 | { 121 | "data": { 122 | "image/png": "\n", 123 | "text/plain": [ 124 | "
" 125 | ] 126 | }, 127 | "metadata": {}, 128 | "output_type": "display_data" 129 | } 130 | ], 131 | "source": [ 132 | "# Declare a figure with a custom size\n", 133 | "fig = plt.figure(figsize=(5, 5))\n", 134 | "\n", 135 | "# labels for the two classes\n", 136 | "labels = 'Positives', 'Negative'\n", 137 | "\n", 138 | "# Sizes for each slide\n", 139 | "sizes = [len(all_positive_tweets), len(all_negative_tweets)] \n", 140 | "\n", 141 | "# Declare pie chart, where the slices will be ordered and plotted counter-clockwise:\n", 142 | "plt.pie(sizes, labels=labels, autopct='%1.1f%%',\n", 143 | " shadow=True, startangle=90)\n", 144 | "\n", 145 | "# Equal aspect ratio ensures that pie is drawn as a circle.\n", 146 | "plt.axis('equal') \n", 147 | "\n", 148 | "# Display the chart\n", 149 | "plt.show();" 150 | ] 151 | }, 152 | { 153 | "cell_type": "markdown", 154 | "metadata": {}, 155 | "source": [ 156 | "## Looking at raw texts\n", 157 | "\n", 158 | "Before anything else, we can print a couple of tweets from the dataset to see how they look. Understanding the data is responsible for 80% of the success or failure in data science projects. We can use this time to observe aspects we'd like to consider when preprocessing our data.\n", 159 | "\n", 160 | "Below, you will print one random positive and one random negative tweet. We have added a color mark at the beginning of the string to further distinguish the two." 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": 10, 166 | "metadata": {}, 167 | "outputs": [ 168 | { 169 | "name": "stdout", 170 | "output_type": "stream", 171 | "text": [ 172 | "\u001b[92m2 plans for the day down the drain great :):\n", 173 | "\u001b[91mi am so :((((((\n" 174 | ] 175 | } 176 | ], 177 | "source": [ 178 | "# print positive in greeen\n", 179 | "print('\\033[92m' + all_positive_tweets[random.randint(0,5000)])\n", 180 | "\n", 181 | "# print negative in red\n", 182 | "print('\\033[91m' + all_negative_tweets[random.randint(0,5000)])" 183 | ] 184 | }, 185 | { 186 | "cell_type": "markdown", 187 | "metadata": {}, 188 | "source": [ 189 | "One observation you may have is the presence of [emoticons](https://en.wikipedia.org/wiki/Emoticon) and URLs in many of the tweets. This info will come in handy in the next steps." 190 | ] 191 | }, 192 | { 193 | "cell_type": "markdown", 194 | "metadata": {}, 195 | "source": [ 196 | "## Preprocess raw text for Sentiment analysis" 197 | ] 198 | }, 199 | { 200 | "cell_type": "markdown", 201 | "metadata": {}, 202 | "source": [ 203 | "Data preprocessing is one of the critical steps in any machine learning project. It includes cleaning and formatting the data before feeding into a machine learning algorithm. For NLP, the preprocessing steps are comprised of the following tasks:\n", 204 | "\n", 205 | "* Tokenizing the string\n", 206 | "* Lowercasing\n", 207 | "* Removing stop words and punctuation\n", 208 | "* Stemming\n", 209 | "\n", 210 | "The videos explained each of these steps and why they are important. Let's see how we can do these to a given tweet. We will choose just one and see how this is transformed by each preprocessing step." 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": 11, 216 | "metadata": {}, 217 | "outputs": [ 218 | { 219 | "name": "stdout", 220 | "output_type": "stream", 221 | "text": [ 222 | "My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i\n" 223 | ] 224 | } 225 | ], 226 | "source": [ 227 | "# Our selected sample. Complex enough to exemplify each step\n", 228 | "tweet = all_positive_tweets[2277]\n", 229 | "print(tweet)" 230 | ] 231 | }, 232 | { 233 | "cell_type": "markdown", 234 | "metadata": {}, 235 | "source": [ 236 | "Let's import a few more libraries for this purpose." 237 | ] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "execution_count": 12, 242 | "metadata": {}, 243 | "outputs": [ 244 | { 245 | "name": "stderr", 246 | "output_type": "stream", 247 | "text": [ 248 | "[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...\n", 249 | "[nltk_data] Unzipping corpora/stopwords.zip.\n" 250 | ] 251 | }, 252 | { 253 | "data": { 254 | "text/plain": [ 255 | "True" 256 | ] 257 | }, 258 | "execution_count": 12, 259 | "metadata": {}, 260 | "output_type": "execute_result" 261 | } 262 | ], 263 | "source": [ 264 | "# download the stopwords from NLTK\n", 265 | "nltk.download('stopwords')" 266 | ] 267 | }, 268 | { 269 | "cell_type": "code", 270 | "execution_count": 13, 271 | "metadata": {}, 272 | "outputs": [], 273 | "source": [ 274 | "import re # library for regular expression operations\n", 275 | "import string # for string operations\n", 276 | "\n", 277 | "from nltk.corpus import stopwords # module for stop words that come with NLTK\n", 278 | "from nltk.stem import PorterStemmer # module for stemming\n", 279 | "from nltk.tokenize import TweetTokenizer # module for tokenizing strings" 280 | ] 281 | }, 282 | { 283 | "cell_type": "markdown", 284 | "metadata": {}, 285 | "source": [ 286 | "### Remove hyperlinks, Twitter marks and styles\n", 287 | "\n", 288 | "Since we have a Twitter dataset, we'd like to remove some substrings commonly used on the platform like the hashtag, retweet marks, and hyperlinks. We'll use the [re](https://docs.python.org/3/library/re.html) library to perform regular expression operations on our tweet. We'll define our search pattern and use the `sub()` method to remove matches by substituting with an empty character (i.e. `''`)" 289 | ] 290 | }, 291 | { 292 | "cell_type": "code", 293 | "execution_count": 14, 294 | "metadata": {}, 295 | "outputs": [ 296 | { 297 | "name": "stdout", 298 | "output_type": "stream", 299 | "text": [ 300 | "\u001b[92mMy beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i\n", 301 | "\u001b[94m\n", 302 | "My beautiful sunflowers on a sunny Friday morning off :) sunflowers favourites happy Friday off… \n" 303 | ] 304 | } 305 | ], 306 | "source": [ 307 | "print('\\033[92m' + tweet)\n", 308 | "print('\\033[94m')\n", 309 | "\n", 310 | "# remove old style retweet text \"RT\"\n", 311 | "tweet2 = re.sub(r'^RT[\\s]+', '', tweet)\n", 312 | "\n", 313 | "# remove hyperlinks\n", 314 | "tweet2 = re.sub(r'https?:\\/\\/.*[\\r\\n]*', '', tweet2)\n", 315 | "\n", 316 | "# remove hashtags\n", 317 | "# only removing the hash # sign from the word\n", 318 | "tweet2 = re.sub(r'#', '', tweet2)\n", 319 | "\n", 320 | "print(tweet2)" 321 | ] 322 | }, 323 | { 324 | "cell_type": "markdown", 325 | "metadata": {}, 326 | "source": [ 327 | "### Tokenize the string\n", 328 | "\n", 329 | "To tokenize means to split the strings into individual words without blanks or tabs. In this same step, we will also convert each word in the string to lower case. The [tokenize](https://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.casual) module from NLTK allows us to do these easily:" 330 | ] 331 | }, 332 | { 333 | "cell_type": "code", 334 | "execution_count": 15, 335 | "metadata": {}, 336 | "outputs": [ 337 | { 338 | "name": "stdout", 339 | "output_type": "stream", 340 | "text": [ 341 | "\n", 342 | "\u001b[92mMy beautiful sunflowers on a sunny Friday morning off :) sunflowers favourites happy Friday off… \n", 343 | "\u001b[94m\n", 344 | "\n", 345 | "Tokenized string:\n", 346 | "['my', 'beautiful', 'sunflowers', 'on', 'a', 'sunny', 'friday', 'morning', 'off', ':)', 'sunflowers', 'favourites', 'happy', 'friday', 'off', '…']\n" 347 | ] 348 | } 349 | ], 350 | "source": [ 351 | "print()\n", 352 | "print('\\033[92m' + tweet2)\n", 353 | "print('\\033[94m')\n", 354 | "\n", 355 | "# instantiate tokenizer class\n", 356 | "tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,\n", 357 | " reduce_len=True)\n", 358 | "\n", 359 | "# tokenize tweets\n", 360 | "tweet_tokens = tokenizer.tokenize(tweet2)\n", 361 | "\n", 362 | "print()\n", 363 | "print('Tokenized string:')\n", 364 | "print(tweet_tokens)" 365 | ] 366 | }, 367 | { 368 | "cell_type": "markdown", 369 | "metadata": {}, 370 | "source": [ 371 | "### Remove stop words and punctuations\n", 372 | "\n", 373 | "The next step is to remove stop words and punctuation. Stop words are words that don't add significant meaning to the text. You'll see the list provided by NLTK when you run the cells below." 374 | ] 375 | }, 376 | { 377 | "cell_type": "code", 378 | "execution_count": 16, 379 | "metadata": {}, 380 | "outputs": [ 381 | { 382 | "name": "stdout", 383 | "output_type": "stream", 384 | "text": [ 385 | "Stop words\n", 386 | "\n", 387 | "['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', \"you're\", \"you've\", \"you'll\", \"you'd\", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', \"she's\", 'her', 'hers', 'herself', 'it', \"it's\", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', \"that'll\", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', \"don't\", 'should', \"should've\", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', \"aren't\", 'couldn', \"couldn't\", 'didn', \"didn't\", 'doesn', \"doesn't\", 'hadn', \"hadn't\", 'hasn', \"hasn't\", 'haven', \"haven't\", 'isn', \"isn't\", 'ma', 'mightn', \"mightn't\", 'mustn', \"mustn't\", 'needn', \"needn't\", 'shan', \"shan't\", 'shouldn', \"shouldn't\", 'wasn', \"wasn't\", 'weren', \"weren't\", 'won', \"won't\", 'wouldn', \"wouldn't\"]\n", 388 | "\n", 389 | "Punctuation\n", 390 | "\n", 391 | "!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~\n" 392 | ] 393 | } 394 | ], 395 | "source": [ 396 | "#Import the english stop words list from NLTK\n", 397 | "stopwords_english = stopwords.words('english') \n", 398 | "\n", 399 | "print('Stop words\\n')\n", 400 | "print(stopwords_english)\n", 401 | "\n", 402 | "print('\\nPunctuation\\n')\n", 403 | "print(string.punctuation)" 404 | ] 405 | }, 406 | { 407 | "cell_type": "markdown", 408 | "metadata": {}, 409 | "source": [ 410 | "We can see that the stop words list above contains some words that could be important in some contexts. \n", 411 | "These could be words like _i, not, between, because, won, against_. You might need to customize the stop words list for some applications. For our exercise, we will use the entire list.\n", 412 | "\n", 413 | "For the punctuation, we saw earlier that certain groupings like ':)' and '...' should be retained when dealing with tweets because they are used to express emotions. In other contexts, like medical analysis, these should also be removed.\n", 414 | "\n", 415 | "Time to clean up our tokenized tweet!" 416 | ] 417 | }, 418 | { 419 | "cell_type": "code", 420 | "execution_count": 17, 421 | "metadata": {}, 422 | "outputs": [ 423 | { 424 | "name": "stdout", 425 | "output_type": "stream", 426 | "text": [ 427 | "\n", 428 | "\u001b[92m\n", 429 | "['my', 'beautiful', 'sunflowers', 'on', 'a', 'sunny', 'friday', 'morning', 'off', ':)', 'sunflowers', 'favourites', 'happy', 'friday', 'off', '…']\n", 430 | "\u001b[94m\n", 431 | "removed stop words and punctuation:\n", 432 | "['beautiful', 'sunflowers', 'sunny', 'friday', 'morning', ':)', 'sunflowers', 'favourites', 'happy', 'friday', '…']\n" 433 | ] 434 | } 435 | ], 436 | "source": [ 437 | "print()\n", 438 | "print('\\033[92m')\n", 439 | "print(tweet_tokens)\n", 440 | "print('\\033[94m')\n", 441 | "\n", 442 | "tweets_clean = []\n", 443 | "\n", 444 | "for word in tweet_tokens: # Go through every word in your tokens list\n", 445 | " if (word not in stopwords_english and # remove stopwords\n", 446 | " word not in string.punctuation): # remove punctuation\n", 447 | " tweets_clean.append(word)\n", 448 | "\n", 449 | "print('removed stop words and punctuation:')\n", 450 | "print(tweets_clean)" 451 | ] 452 | }, 453 | { 454 | "cell_type": "markdown", 455 | "metadata": {}, 456 | "source": [ 457 | "Please note that the words **happy** and **sunny** in this list are correctly spelled. " 458 | ] 459 | }, 460 | { 461 | "cell_type": "markdown", 462 | "metadata": {}, 463 | "source": [ 464 | "### Stemming\n", 465 | "\n", 466 | "Stemming is the process of converting a word to its most general form, or stem. This helps in reducing the size of our vocabulary.\n", 467 | "\n", 468 | "Consider the words: \n", 469 | " * **learn**\n", 470 | " * **learn**ing\n", 471 | " * **learn**ed\n", 472 | " * **learn**t\n", 473 | " \n", 474 | "All these words are stemmed from its common root **learn**. However, in some cases, the stemming process produces words that are not correct spellings of the root word. For example, **happi** and **sunni**. That's because it chooses the most common stem for related words. For example, we can look at the set of words that comprises the different forms of happy:\n", 475 | "\n", 476 | " * **happ**y\n", 477 | " * **happi**ness\n", 478 | " * **happi**er\n", 479 | " \n", 480 | "We can see that the prefix **happi** is more commonly used. We cannot choose **happ** because it is the stem of unrelated words like **happen**.\n", 481 | " \n", 482 | "NLTK has different modules for stemming and we will be using the [PorterStemmer](https://www.nltk.org/api/nltk.stem.html#module-nltk.stem.porter) module which uses the [Porter Stemming Algorithm](https://tartarus.org/martin/PorterStemmer/). Let's see how we can use it in the cell below." 483 | ] 484 | }, 485 | { 486 | "cell_type": "code", 487 | "execution_count": 18, 488 | "metadata": {}, 489 | "outputs": [ 490 | { 491 | "name": "stdout", 492 | "output_type": "stream", 493 | "text": [ 494 | "\n", 495 | "\u001b[92m\n", 496 | "['beautiful', 'sunflowers', 'sunny', 'friday', 'morning', ':)', 'sunflowers', 'favourites', 'happy', 'friday', '…']\n", 497 | "\u001b[94m\n", 498 | "stemmed words:\n", 499 | "['beauti', 'sunflow', 'sunni', 'friday', 'morn', ':)', 'sunflow', 'favourit', 'happi', 'friday', '…']\n" 500 | ] 501 | } 502 | ], 503 | "source": [ 504 | "print()\n", 505 | "print('\\033[92m')\n", 506 | "print(tweets_clean)\n", 507 | "print('\\033[94m')\n", 508 | "\n", 509 | "# Instantiate stemming class\n", 510 | "stemmer = PorterStemmer() \n", 511 | "\n", 512 | "# Create an empty list to store the stems\n", 513 | "tweets_stem = [] \n", 514 | "\n", 515 | "for word in tweets_clean:\n", 516 | " stem_word = stemmer.stem(word) # stemming word\n", 517 | " tweets_stem.append(stem_word) # append to the list\n", 518 | "\n", 519 | "print('stemmed words:')\n", 520 | "print(tweets_stem)" 521 | ] 522 | }, 523 | { 524 | "cell_type": "markdown", 525 | "metadata": {}, 526 | "source": [ 527 | "That's it! Now we have a set of words we can feed into to the next stage of our machine learning project." 528 | ] 529 | }, 530 | { 531 | "cell_type": "markdown", 532 | "metadata": {}, 533 | "source": [ 534 | "## process_tweet()\n", 535 | "\n", 536 | "As shown above, preprocessing consists of multiple steps before you arrive at the final list of words. We will not ask you to replicate these however. In the week's assignment, you will use the function `process_tweet(tweet)` available in _utils.py_. We encourage you to open the file and you'll see that this function's implementation is very similar to the steps above.\n", 537 | "\n", 538 | "To obtain the same result as in the previous code cells, you will only need to call the function `process_tweet()`. Let's do that in the next cell." 539 | ] 540 | }, 541 | { 542 | "cell_type": "code", 543 | "execution_count": 19, 544 | "metadata": {}, 545 | "outputs": [ 546 | { 547 | "name": "stdout", 548 | "output_type": "stream", 549 | "text": [ 550 | "\n", 551 | "\u001b[92m\n", 552 | "My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i\n", 553 | "\u001b[94m\n", 554 | "preprocessed tweet:\n", 555 | "['beauti', 'sunflow', 'sunni', 'friday', 'morn', ':)', 'sunflow', 'favourit', 'happi', 'friday', '…']\n" 556 | ] 557 | } 558 | ], 559 | "source": [ 560 | "from utils import process_tweet # Import the process_tweet function\n", 561 | "\n", 562 | "# choose the same tweet\n", 563 | "tweet = all_positive_tweets[2277]\n", 564 | "\n", 565 | "print()\n", 566 | "print('\\033[92m')\n", 567 | "print(tweet)\n", 568 | "print('\\033[94m')\n", 569 | "\n", 570 | "# call the imported function\n", 571 | "tweets_stem = process_tweet(tweet); # Preprocess a given tweet\n", 572 | "\n", 573 | "print('preprocessed tweet:')\n", 574 | "print(tweets_stem) # Print the result" 575 | ] 576 | }, 577 | { 578 | "cell_type": "markdown", 579 | "metadata": {}, 580 | "source": [ 581 | "That's it for this lab! You now know what is going on when you call the preprocessing helper function in this week's assignment. Hopefully, this exercise has also given you some insights on how to tweak this for other types of text datasets." 582 | ] 583 | } 584 | ], 585 | "metadata": { 586 | "kernelspec": { 587 | "display_name": "Python 3", 588 | "language": "python", 589 | "name": "python3" 590 | }, 591 | "language_info": { 592 | "codemirror_mode": { 593 | "name": "ipython", 594 | "version": 3 595 | }, 596 | "file_extension": ".py", 597 | "mimetype": "text/x-python", 598 | "name": "python", 599 | "nbconvert_exporter": "python", 600 | "pygments_lexer": "ipython3", 601 | "version": "3.7.1" 602 | } 603 | }, 604 | "nbformat": 4, 605 | "nbformat_minor": 2 606 | } 607 | -------------------------------------------------------------------------------- /NLP_C1_W1_lecture_nb_03.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Visualizing tweets and the Logistic Regression model\n", 8 | "\n", 9 | "**Objectives:** Visualize and interpret the logistic regression model\n", 10 | "\n", 11 | "**Steps:**\n", 12 | "* Plot tweets in a scatter plot using their positive and negative sums.\n", 13 | "* Plot the output of the logistic regression model in the same plot as a solid line" 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": {}, 19 | "source": [ 20 | "## Import the required libraries\n", 21 | "\n", 22 | "We will be using [*NLTK*](http://www.nltk.org/howto/twitter.html), an opensource NLP library, for collecting, handling, and processing Twitter data. In this lab, we will use the example dataset that comes alongside with NLTK. This dataset has been manually annotated and serves to establish baselines for models quickly. \n", 23 | "\n", 24 | "So, to start, let's import the required libraries. " 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": 4, 30 | "metadata": {}, 31 | "outputs": [], 32 | "source": [ 33 | "import nltk # NLP toolbox\n", 34 | "from os import getcwd\n", 35 | "import pandas as pd # Library for Dataframes \n", 36 | "from nltk.corpus import twitter_samples \n", 37 | "import matplotlib.pyplot as plt # Library for visualization\n", 38 | "import numpy as np # Library for math functions\n", 39 | "\n", 40 | "from utils import process_tweet, build_freqs # Our functions for NLP" 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "## Load the NLTK sample dataset\n", 48 | "\n", 49 | "To complete this lab, you need the sample dataset of the previous lab. Here, we assume the files are already available, and we only need to load into Python lists." 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": 5, 55 | "metadata": {}, 56 | "outputs": [ 57 | { 58 | "name": "stdout", 59 | "output_type": "stream", 60 | "text": [ 61 | "Number of tweets: 8000\n" 62 | ] 63 | } 64 | ], 65 | "source": [ 66 | "# select the set of positive and negative tweets\n", 67 | "all_positive_tweets = twitter_samples.strings('positive_tweets.json')\n", 68 | "all_negative_tweets = twitter_samples.strings('negative_tweets.json')\n", 69 | "\n", 70 | "tweets = all_positive_tweets + all_negative_tweets ## Concatenate the lists. \n", 71 | "labels = np.append(np.ones((len(all_positive_tweets),1)), np.zeros((len(all_negative_tweets),1)), axis = 0)\n", 72 | "\n", 73 | "# split the data into two pieces, one for training and one for testing (validation set) \n", 74 | "train_pos = all_positive_tweets[:4000]\n", 75 | "train_neg = all_negative_tweets[:4000]\n", 76 | "\n", 77 | "train_x = train_pos + train_neg \n", 78 | "\n", 79 | "print(\"Number of tweets: \", len(train_x))" 80 | ] 81 | }, 82 | { 83 | "cell_type": "markdown", 84 | "metadata": {}, 85 | "source": [ 86 | "# Load the extracted features\n", 87 | "\n", 88 | "Part of this week's assignment is the creation of the numerical features needed for the Logistic regression model. In order not to interfere with it, we have previously calculated and stored these features in a CSV file for the entire training set.\n", 89 | "\n", 90 | "So, please load these features created for the tweets sample. " 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": 6, 96 | "metadata": {}, 97 | "outputs": [ 98 | { 99 | "data": { 100 | "text/html": [ 101 | "
\n", 102 | "\n", 115 | "\n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | "
biaspositivenegativesentiment
01.03020.061.01.0
11.03573.0444.01.0
21.03005.0115.01.0
31.02862.04.01.0
41.03119.0225.01.0
51.02955.0119.01.0
61.03934.0538.01.0
71.03162.0276.01.0
81.0628.0189.01.0
91.0264.0112.01.0
\n", 198 | "
" 199 | ], 200 | "text/plain": [ 201 | " bias positive negative sentiment\n", 202 | "0 1.0 3020.0 61.0 1.0\n", 203 | "1 1.0 3573.0 444.0 1.0\n", 204 | "2 1.0 3005.0 115.0 1.0\n", 205 | "3 1.0 2862.0 4.0 1.0\n", 206 | "4 1.0 3119.0 225.0 1.0\n", 207 | "5 1.0 2955.0 119.0 1.0\n", 208 | "6 1.0 3934.0 538.0 1.0\n", 209 | "7 1.0 3162.0 276.0 1.0\n", 210 | "8 1.0 628.0 189.0 1.0\n", 211 | "9 1.0 264.0 112.0 1.0" 212 | ] 213 | }, 214 | "execution_count": 6, 215 | "metadata": {}, 216 | "output_type": "execute_result" 217 | } 218 | ], 219 | "source": [ 220 | "data = pd.read_csv('logistic_features.csv'); # Load a 3 columns csv file using pandas function\n", 221 | "data.head(10) # Print the first three data entries" 222 | ] 223 | }, 224 | { 225 | "cell_type": "markdown", 226 | "metadata": {}, 227 | "source": [ 228 | "Now let us get rid of the data frame to keep only Numpy arrays." 229 | ] 230 | }, 231 | { 232 | "cell_type": "code", 233 | "execution_count": 7, 234 | "metadata": {}, 235 | "outputs": [ 236 | { 237 | "name": "stdout", 238 | "output_type": "stream", 239 | "text": [ 240 | "(8000, 3)\n", 241 | "[[1.000e+00 3.020e+03 6.100e+01]\n", 242 | " [1.000e+00 3.573e+03 4.440e+02]\n", 243 | " [1.000e+00 3.005e+03 1.150e+02]\n", 244 | " ...\n", 245 | " [1.000e+00 1.440e+02 7.830e+02]\n", 246 | " [1.000e+00 2.050e+02 3.890e+03]\n", 247 | " [1.000e+00 1.890e+02 3.974e+03]]\n" 248 | ] 249 | } 250 | ], 251 | "source": [ 252 | "# Each feature is labeled as bias, positive and negative\n", 253 | "X = data[['bias', 'positive', 'negative']].values # Get only the numerical values of the dataframe\n", 254 | "Y = data['sentiment'].values; # Put in Y the corresponding labels or sentiments\n", 255 | "\n", 256 | "print(X.shape) # Print the shape of the X part\n", 257 | "print(X) # Print some rows of X" 258 | ] 259 | }, 260 | { 261 | "cell_type": "markdown", 262 | "metadata": {}, 263 | "source": [ 264 | "## Load a pretrained Logistic Regression model\n", 265 | "\n", 266 | "In the same way, as part of this week's assignment, a Logistic regression model must be trained. The next cell contains the resulting model from such training. Notice that a list of 3 numeric values represents the whole model, that we have called _theta_ $\\theta$." 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "execution_count": 8, 272 | "metadata": {}, 273 | "outputs": [], 274 | "source": [ 275 | "theta = [7e-08, 0.0005239, -0.00055517]" 276 | ] 277 | }, 278 | { 279 | "cell_type": "markdown", 280 | "metadata": {}, 281 | "source": [ 282 | "## Plot the samples in a scatter plot\n", 283 | "\n", 284 | "The vector theta represents a plane that split our feature space into two parts. Samples located over that plane are considered positive, and samples located under that plane are considered negative. Remember that we have a 3D feature space, i.e., each tweet is represented as a vector comprised of three values: `[bias, positive_sum, negative_sum]`, always having `bias = 1`. \n", 285 | "\n", 286 | "If we ignore the bias term, we can plot each tweet in a cartesian plane, using `positive_sum` and `negative_sum`. In the cell below, we do precisely this. Additionally, we color each tweet, depending on its class. Positive tweets will be green and negative tweets will be red." 287 | ] 288 | }, 289 | { 290 | "cell_type": "code", 291 | "execution_count": 9, 292 | "metadata": {}, 293 | "outputs": [ 294 | { 295 | "data": { 296 | "image/png": "\n", 297 | "text/plain": [ 298 | "
" 299 | ] 300 | }, 301 | "metadata": { 302 | "needs_background": "light" 303 | }, 304 | "output_type": "display_data" 305 | } 306 | ], 307 | "source": [ 308 | "# Plot the samples using columns 1 and 2 of the matrix\n", 309 | "fig, ax = plt.subplots(figsize = (8, 8))\n", 310 | "\n", 311 | "colors = ['red', 'green']\n", 312 | "\n", 313 | "# Color based on the sentiment Y\n", 314 | "ax.scatter(X[:,1], X[:,2], c=[colors[int(k)] for k in Y], s = 0.1) # Plot a dot for each pair of words\n", 315 | "plt.xlabel(\"Positive\")\n", 316 | "plt.ylabel(\"Negative\");" 317 | ] 318 | }, 319 | { 320 | "cell_type": "markdown", 321 | "metadata": {}, 322 | "source": [ 323 | "From the plot, it is evident that the features that we have chosen to represent tweets as numerical vectors allow an almost perfect separation between positive and negative tweets. So you can expect a very high accuracy for this model! \n", 324 | "\n", 325 | "## Plot the model alongside the data\n", 326 | "\n", 327 | "We will draw a gray line to show the cutoff between the positive and negative regions. In other words, the gray line marks the line where $$ z = \\theta * x = 0.$$\n", 328 | "To draw this line, we have to solve the above equation in terms of one of the independent variables.\n", 329 | "\n", 330 | "$$ z = \\theta * x = 0$$\n", 331 | "$$ x = [1, pos, neg] $$\n", 332 | "$$ z(\\theta, x) = \\theta_0+ \\theta_1 * pos + \\theta_2 * neg = 0 $$\n", 333 | "$$ neg = (-\\theta_0 - \\theta_1 * pos) / \\theta_2 $$\n", 334 | "\n", 335 | "The red and green lines that point in the direction of the corresponding sentiment are calculated using a perpendicular line to the separation line calculated in the previous equations(neg function). It must point in the same direction as the derivative of the Logit function, but the magnitude may differ. It is only for a visual representation of the model. \n", 336 | "\n", 337 | "$$direction = pos * \\theta_2 / \\theta_1$$" 338 | ] 339 | }, 340 | { 341 | "cell_type": "code", 342 | "execution_count": 10, 343 | "metadata": {}, 344 | "outputs": [], 345 | "source": [ 346 | "# Equation for the separation plane\n", 347 | "# It give a value in the negative axe as a function of a positive value\n", 348 | "# f(pos, neg, W) = w0 + w1 * pos + w2 * neg = 0\n", 349 | "# s(pos, W) = (w0 - w1 * pos) / w2\n", 350 | "def neg(theta, pos):\n", 351 | " return (-theta[0] - pos * theta[1]) / theta[2]\n", 352 | "\n", 353 | "# Equation for the direction of the sentiments change\n", 354 | "# We don't care about the magnitude of the change. We are only interested \n", 355 | "# in the direction. So this direction is just a perpendicular function to the \n", 356 | "# separation plane\n", 357 | "# df(pos, W) = pos * w2 / w1\n", 358 | "def direction(theta, pos):\n", 359 | " return pos * theta[2] / theta[1]" 360 | ] 361 | }, 362 | { 363 | "cell_type": "markdown", 364 | "metadata": {}, 365 | "source": [ 366 | "The green line in the chart points in the direction where z > 0 and the red line points in the direction where z < 0. The direction of these lines are given by the weights $\\theta_1$ and $\\theta_2$" 367 | ] 368 | }, 369 | { 370 | "cell_type": "code", 371 | "execution_count": 11, 372 | "metadata": {}, 373 | "outputs": [ 374 | { 375 | "data": { 376 | "image/png": "\n", 377 | "text/plain": [ 378 | "
" 379 | ] 380 | }, 381 | "metadata": { 382 | "needs_background": "light" 383 | }, 384 | "output_type": "display_data" 385 | } 386 | ], 387 | "source": [ 388 | "# Plot the samples using columns 1 and 2 of the matrix\n", 389 | "fig, ax = plt.subplots(figsize = (8, 8))\n", 390 | "\n", 391 | "colors = ['red', 'green']\n", 392 | "\n", 393 | "# Color base on the sentiment Y\n", 394 | "ax.scatter(X[:,1], X[:,2], c=[colors[int(k)] for k in Y], s = 0.1) # Plot a dot for each pair of words\n", 395 | "plt.xlabel(\"Positive\")\n", 396 | "plt.ylabel(\"Negative\")\n", 397 | "\n", 398 | "# Now lets represent the logistic regression model in this chart. \n", 399 | "maxpos = np.max(X[:,1])\n", 400 | "\n", 401 | "offset = 5000 # The pos value for the direction vectors origin\n", 402 | "\n", 403 | "# Plot a gray line that divides the 2 areas.\n", 404 | "ax.plot([0, maxpos], [neg(theta, 0), neg(theta, maxpos)], color = 'gray') \n", 405 | "\n", 406 | "# Plot a green line pointing to the positive direction\n", 407 | "ax.arrow(offset, neg(theta, offset), offset, direction(theta, offset), head_width=500, head_length=500, fc='g', ec='g')\n", 408 | "# Plot a red line pointing to the negative direction\n", 409 | "ax.arrow(offset, neg(theta, offset), -offset, -direction(theta, offset), head_width=500, head_length=500, fc='r', ec='r')\n", 410 | "\n", 411 | "plt.show();" 412 | ] 413 | }, 414 | { 415 | "cell_type": "markdown", 416 | "metadata": {}, 417 | "source": [ 418 | "**Note that more critical than the Logistic regression itself, are the features extracted from tweets that allow getting the right results in this exercise.**\n", 419 | "\n", 420 | "That is all, folks. Hopefully, now you understand better what the Logistic regression model represents, and why it works that well for this specific problem. " 421 | ] 422 | } 423 | ], 424 | "metadata": { 425 | "kernelspec": { 426 | "display_name": "Python 3", 427 | "language": "python", 428 | "name": "python3" 429 | }, 430 | "language_info": { 431 | "codemirror_mode": { 432 | "name": "ipython", 433 | "version": 3 434 | }, 435 | "file_extension": ".py", 436 | "mimetype": "text/x-python", 437 | "name": "python", 438 | "nbconvert_exporter": "python", 439 | "pygments_lexer": "ipython3", 440 | "version": "3.7.1" 441 | } 442 | }, 443 | "nbformat": 4, 444 | "nbformat_minor": 4 445 | } 446 | --------------------------------------------------------------------------------