├── README.md ├── datasets ├── art_of_war.txt └── hun_eng_pairs │ ├── hun_eng_pairs_test.txt │ ├── hun_eng_pairs_train.txt │ └── hun_eng_pairs_val.txt ├── models ├── art_of_war_char_level_lm.zip ├── nmt_no_attention │ ├── hun_eng_s2s_nmt_no_attention_model.zip │ └── hun_eng_s2s_nmt_no_attention_tokenizers.zip └── nmt_with_attention │ └── attention_weights.zip └── notebooks ├── .ipynb_checkpoints └── nlpdemystified_topic_modelling_lda-checkpoint.ipynb ├── nlpdemystified_classification_naive_bayes.ipynb ├── nlpdemystified_neural_networks_foundations.ipynb ├── nlpdemystified_preprocessing.ipynb ├── nlpdemystified_recurrent_neural_networks.ipynb ├── nlpdemystified_seq2seq_and_attention.ipynb ├── nlpdemystified_topic_modelling_lda.ipynb ├── nlpdemystified_transformers_and_pretraining.ipynb ├── nlpdemystified_vectorization.ipynb └── nlpdemystified_word_vectors.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # Natural Language Processing Demystified 2 | 3 | NLP Demystified is a free, comprehensive course to turn you into an NLP expert. It covers everything from the very basics to the state-of-the-art. 4 | 5 | - 15 modules of theory and concepts, clearly explained. 6 | - 9 fully-documented notebooks with end-to-end examples of how to accomplish common NLP tasks. 7 | - No machine learning knowledge assumed. Just know Python and a bit of high school math. 8 | 9 | Visit [nlpdemystified.org](https://nlpdemystified.org) to start learning. 10 | 11 | # Content 12 | 13 | | | | | 14 | | ------------------------------------------------------------------ | ---------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | 15 | | 1. Introduction | [video](https://www.youtube.com/watch?v=diOXCK7I2wA) | No notebook for this module | 16 | | 2. Tokenization | [video](https://www.youtube.com/watch?v=LZFriJ85BfM) | [notebook](https://colab.research.google.com/github/futuremojo/nlp-demystified/blob/main/notebooks/nlpdemystified_preprocessing.ipynb) | 17 | | 3. Basic Preprocessing | [video](https://www.youtube.com/watch?v=I173TmCTxpk) | [notebook](https://colab.research.google.com/github/futuremojo/nlp-demystified/blob/main/notebooks/nlpdemystified_preprocessing.ipynb#scrollTo=uUsfYCpVT4nI) | 18 | | 4. Advanced Preprocessing | [video](https://www.youtube.com/watch?v=aeUE9AXO5Ss) | [notebook](https://colab.research.google.com/github/futuremojo/nlp-demystified/blob/main/notebooks/nlpdemystified_preprocessing.ipynb#scrollTo=o9HLYYUt1kOP) | 19 | | 5. Measuring Document Similarity With Basic Bag-of-Words | [video](https://www.youtube.com/watch?v=QbPDjzk2oCA) | [notebook](https://colab.research.google.com/github/futuremojo/nlp-demystified/blob/main/notebooks/nlpdemystified_vectorization.ipynb) | 20 | | 6. Simple Document Search With TF-IDF | [video](https://www.youtube.com/watch?v=fIYSi41f1yg) | [notebook](https://colab.research.google.com/github/futuremojo/nlp-demystified/blob/main/notebooks/nlpdemystified_vectorization.ipynb#scrollTo=CnC_i4oH2ARW) | 21 | | 7. Building Models: Finding Patterns for Fun and Profit | [video](https://www.youtube.com/watch?v=-2c7bMSEAl8) | No notebook for this module | 22 | | 8. Naive Bayes: Fast and Simple Text Classification | [video](https://www.youtube.com/watch?v=FrWvpzoQBPQ) | [notebook](https://colab.research.google.com/github/futuremojo/nlp-demystified/blob/main/notebooks/nlpdemystified_classification_naive_bayes.ipynb) | 23 | | 9. Topic Modelling: Automatically Discovering Topics in Documents | [video](https://www.youtube.com/watch?v=9mNV4AwA9QI) | [notebook](https://colab.research.google.com/github/futuremojo/nlp-demystified/blob/main/notebooks/nlpdemystified_topic_modelling_lda.ipynb) | 24 | | 10. Neural Networks I: Core Mechanisms and Coding One From Scratch | [video](https://www.youtube.com/watch?v=VS1mgwAS8EM) | [notebook](https://colab.research.google.com/github/futuremojo/nlp-demystified/blob/main/notebooks/nlpdemystified_neural_networks_foundations.ipynb) | 25 | | 11. Neural Networks II: Effective Training Techniques | [video](https://www.youtube.com/watch?v=Pytt93Q-b2I) | [notebook](https://colab.research.google.com/github/futuremojo/nlp-demystified/blob/main/notebooks/nlpdemystified_neural_networks_foundations.ipynb#scrollTo=08E-EoqxxnVn) | 26 | | 12. Word Vectors | [video](https://www.youtube.com/watch?v=IebL0RQF5lg) | [notebook](https://colab.research.google.com/github/futuremojo/nlp-demystified/blob/main/notebooks/nlpdemystified_word_vectors.ipynb) | 27 | | 13. Recurrent Neural Networks and Language Models | [video](https://www.youtube.com/watch?v=y0FqGWbfkQw) | [notebook](https://colab.research.google.com/github/futuremojo/nlp-demystified/blob/main/notebooks/nlpdemystified_recurrent_neural_networks.ipynb) | 28 | | 14. Sequence-to-Sequence and Attention | [video](https://www.youtube.com/watch?v=tvIzBouq6lk) | [notebook](https://colab.research.google.com/github/futuremojo/nlp-demystified/blob/main/notebooks/nlpdemystified_seq2seq_and_attention.ipynb) | 29 | | 15. Transformers From Scratch, Pre-Training, and Transfer Learning | [video](https://www.youtube.com/watch?v=acxqoltilME) | [notebook](https://colab.research.google.com/github/futuremojo/nlp-demystified/blob/main/notebooks/nlpdemystified_transformers_and_pretraining.ipynb) | 30 | | | | | 31 | -------------------------------------------------------------------------------- /models/art_of_war_char_level_lm.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/futuremojo/nlp-demystified/117e69232e91b1b72c064cf5b65533fb4dddda02/models/art_of_war_char_level_lm.zip -------------------------------------------------------------------------------- /models/nmt_no_attention/hun_eng_s2s_nmt_no_attention_model.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/futuremojo/nlp-demystified/117e69232e91b1b72c064cf5b65533fb4dddda02/models/nmt_no_attention/hun_eng_s2s_nmt_no_attention_model.zip -------------------------------------------------------------------------------- /models/nmt_no_attention/hun_eng_s2s_nmt_no_attention_tokenizers.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/futuremojo/nlp-demystified/117e69232e91b1b72c064cf5b65533fb4dddda02/models/nmt_no_attention/hun_eng_s2s_nmt_no_attention_tokenizers.zip -------------------------------------------------------------------------------- /models/nmt_with_attention/attention_weights.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/futuremojo/nlp-demystified/117e69232e91b1b72c064cf5b65533fb4dddda02/models/nmt_with_attention/attention_weights.zip -------------------------------------------------------------------------------- /notebooks/.ipynb_checkpoints/nlpdemystified_topic_modelling_lda-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "nlpdemystified-topic-modelling-lda.ipynb", 7 | "private_outputs": true, 8 | "provenance": [] 9 | }, 10 | "kernelspec": { 11 | "name": "python3", 12 | "display_name": "Python 3" 13 | }, 14 | "accelerator": "GPU" 15 | }, 16 | "cells": [ 17 | { 18 | "cell_type": "markdown", 19 | "metadata": { 20 | "id": "ITy3IHHU95uS" 21 | }, 22 | "source": [ 23 | "# Natural Language Processing Demystified | Topic Modelling With Latent Dirichlet Allocation\n", 24 | "https://nlpdemystified.org
\n", 25 | "https://github.com/futuremojo/nlp-demystified

\n", 26 | "Course module for this demo: https://www.nlpdemystified.org/course/topic-modelling" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": { 32 | "id": "aes1ZqWZTUa5" 33 | }, 34 | "source": [ 35 | "# spaCy upgrade and package installation." 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "metadata": { 41 | "id": "zSVwiu4YTVDa" 42 | }, 43 | "source": [ 44 | "At the time this notebook was created, spaCy had newer releases but Colab was still using version 2.x by default. So the first step is to upgrade spaCy.\n", 45 | "

\n", 46 | "**IMPORTANT**
\n", 47 | "If you're running this in the cloud rather than a local Jupyter server on your machine, then the notebook will *timeout* after a period of inactivity. If that happens and you don't reconnect in time, you will need to upgrade spaCy again and reinstall the requisite statistical package(s).\n", 48 | "

\n", 49 | "Refer to this link on how to run Colab notebooks locally on your machine to avoid this issue:
\n", 50 | "https://research.google.com/colaboratory/local-runtimes.html\n", 51 | "\n", 52 | "---\n", 53 | "> **In the course video, I ran this demo on a local Jupyter server to take advantage of multiprocessing capabilities. It's not necessary but I recommend it.**" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "metadata": { 59 | "id": "_VstAdWMUWvp" 60 | }, 61 | "source": [ 62 | "!pip install -U spacy==3.*\n", 63 | "!python -m spacy download en_core_web_sm\n", 64 | "!python -m spacy info" 65 | ], 66 | "execution_count": null, 67 | "outputs": [] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": { 72 | "id": "DKZgKn9TTc9Z" 73 | }, 74 | "source": [ 75 | "For topic modelling, we'll use **Gensim**, a popular topic modelling library originally authored by Radim Řehůřek. It has implementations for LDA and other models.
\n", 76 | "https://radimrehurek.com/gensim/index.html" 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "source": [ 82 | "# Upgrade gensim in case.\n", 83 | "!pip install --upgrade numpy\n", 84 | "!pip install -U gensim==4.*" 85 | ], 86 | "metadata": { 87 | "id": "gRg7SM8qEY7o" 88 | }, 89 | "execution_count": null, 90 | "outputs": [] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "source": [ 95 | "import matplotlib.pyplot as plt\n", 96 | "import pandas as pd\n", 97 | "import random\n", 98 | "import spacy\n", 99 | "\n", 100 | "from gensim import models, corpora\n", 101 | "from gensim import similarities\n", 102 | "from gensim.models.coherencemodel import CoherenceModel\n", 103 | "from wordcloud import WordCloud" 104 | ], 105 | "metadata": { 106 | "id": "YcyuLLRk9Epv" 107 | }, 108 | "execution_count": null, 109 | "outputs": [] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": { 114 | "id": "aUqudgVeCfbM" 115 | }, 116 | "source": [ 117 | "# First pass at building an LDA topic model for our corpus" 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": { 123 | "id": "mHBDR4ZqVvwY" 124 | }, 125 | "source": [ 126 | "We'll use a corpus of over 90,000 CNN news articles originally compiled for training question answering models. I lightly processed them to remove some metadata and put them on Google Drive.\n", 127 | "([original source](https://cs.nyu.edu/~kcho/DMQA/))\n", 128 | "

\n", 129 | "To retrieve the corpus from Google Drive, we'll use the **gdown** library which I've already installed:
\n", 130 | "https://github.com/wkentaro/gdown" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "source": [ 136 | "# Download the CNN corpus.\n", 137 | "!gdown 'https://drive.google.com/uc?id=122fC9XpNwFKx0ryRVKJz5MWUTzA3Vpsf'" 138 | ], 139 | "metadata": { 140 | "id": "kO0I2ThbauR3" 141 | }, 142 | "execution_count": null, 143 | "outputs": [] 144 | }, 145 | { 146 | "cell_type": "markdown", 147 | "source": [ 148 | "The corpus is one large text file with each article in the corpus separated by an *@delimiter* string. We'll split the articles and place them in a list." 149 | ], 150 | "metadata": { 151 | "id": "Gpu_Z5fdbYpU" 152 | } 153 | }, 154 | { 155 | "cell_type": "code", 156 | "source": [ 157 | "with open('cnn_articles.txt', 'r') as f:\n", 158 | " articles = f.read().split('@delimiter')" 159 | ], 160 | "metadata": { 161 | "id": "JxGeaaj4auNO" 162 | }, 163 | "execution_count": null, 164 | "outputs": [] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "source": [ 169 | "print(len(articles))\n", 170 | "print(articles[0])" 171 | ], 172 | "metadata": { 173 | "id": "9QNyQo5gauIs" 174 | }, 175 | "execution_count": null, 176 | "outputs": [] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "source": [ 181 | "For this demo, we'll use a subset of the articles to speed things up but feel free to change the dataset size." 182 | ], 183 | "metadata": { 184 | "id": "1lKZEP-J02TA" 185 | } 186 | }, 187 | { 188 | "cell_type": "code", 189 | "source": [ 190 | "DATASET_SIZE = 20000\n", 191 | "dataset = articles[:DATASET_SIZE]" 192 | ], 193 | "metadata": { 194 | "id": "YSfxX4tlbpa6" 195 | }, 196 | "execution_count": null, 197 | "outputs": [] 198 | }, 199 | { 200 | "cell_type": "markdown", 201 | "source": [ 202 | "Just like in the [Text Classification with Naive Bayes](https://github.com/futuremojo/nlp-demystified/blob/main/notebooks/nlpdemystified_classification_naive_bayes.ipynb) demo, we'll start off with a *blank* tokenizer with no further pipeline components to see if that's good enough.\n", 203 | "

\n", 204 | "We'll filter out punctuations, newlines, and any tokens containing non-alphabetic characters." 205 | ], 206 | "metadata": { 207 | "id": "qLkJz7BS6q-S" 208 | } 209 | }, 210 | { 211 | "cell_type": "code", 212 | "source": [ 213 | "nlp = spacy.blank('en')\n", 214 | "\n", 215 | "def basic_filter(tokenized_doc):\n", 216 | " return [t.text for t in tokenized_doc if\n", 217 | " not t.is_punct and \\\n", 218 | " not t.is_space and \\\n", 219 | " t.is_alpha]" 220 | ], 221 | "metadata": { 222 | "id": "g6XVBLIl0FkX" 223 | }, 224 | "execution_count": null, 225 | "outputs": [] 226 | }, 227 | { 228 | "cell_type": "markdown", 229 | "source": [ 230 | "In this demo, we'll leverage spaCy's **nlp.pipe** function which can process a corpus as a batch (or a series of batches) and use multiple processes. Here, we'll process our dataset as a batch across multiple processes, then run the tokenized **doc** objects through the *basic_filter* function. You can adjust **NUM_PROCESS** as you wish.

\n", 231 | "Take a look at these link for ways to further optimize spaCy's pipeline:
\n", 232 | "https://spacy.io/usage/processing-pipelines#processing
\n", 233 | "https://spacy.io/api/language#pipe

\n", 234 | "YouTube video from spaCy on using **nlp.pipe**: [Speed up spaCy pipelines via `nlp.pipe` - spaCy shorts](https://www.youtube.com/watch?v=OoZ-H_8vRnc)
\n", 235 | "Tuning **nlp.pipe**: https://stackoverflow.com/questions/65850018/processing-text-with-spacy-nlp-pipe" 236 | ], 237 | "metadata": { 238 | "id": "6siL9mNJxqix" 239 | } 240 | }, 241 | { 242 | "cell_type": "code", 243 | "source": [ 244 | "NUM_PROCESS = 4" 245 | ], 246 | "metadata": { 247 | "id": "L1SVzXUzxtBe" 248 | }, 249 | "execution_count": null, 250 | "outputs": [] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "source": [ 255 | "%%time\n", 256 | "tokenized_articles = list(map(basic_filter, nlp.pipe(dataset, n_process=NUM_PROCESS)))" 257 | ], 258 | "metadata": { 259 | "id": "nGYhfDXcz9_V" 260 | }, 261 | "execution_count": null, 262 | "outputs": [] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "metadata": { 267 | "id": "OYNK7Nd-cLsZ" 268 | }, 269 | "source": [ 270 | "print(tokenized_articles[0])" 271 | ], 272 | "execution_count": null, 273 | "outputs": [] 274 | }, 275 | { 276 | "cell_type": "markdown", 277 | "metadata": { 278 | "id": "DkopX2P4UqDK" 279 | }, 280 | "source": [ 281 | "To start off, we'll go with 20 topics. With most topic models including LDA, there isn't a clear recipe on how to pick the optimal number of topics. The nature and composition of the data (e.g. average length of each document) has a major impact on how many topics are *interpretable* by a human. Often, it's best to go with something reasonable to begin with and then try different topic numbers.

For this corpus, I'm going with 20 topics which is a small amount relative to the corpus size, but my reasoning is that since this is a general mainstream news corpus, the topics themselves are going to be fairly broad." 282 | ] 283 | }, 284 | { 285 | "cell_type": "code", 286 | "metadata": { 287 | "id": "o9RbTz3OXTuM" 288 | }, 289 | "source": [ 290 | "NUM_TOPICS = 20" 291 | ], 292 | "execution_count": null, 293 | "outputs": [] 294 | }, 295 | { 296 | "cell_type": "markdown", 297 | "metadata": { 298 | "id": "XgCbr9SJZxDQ" 299 | }, 300 | "source": [ 301 | "After tokenizing our text, the first step with Gensim is to construct a **Dictionary** mapping words to integer IDs.
\n", 302 | "https://radimrehurek.com/gensim/corpora/dictionary.html

\n", 303 | "This is similar to the *fit* step we took with scikit-learn's vectorizers." 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "metadata": { 309 | "id": "EP2db-H8cUwb" 310 | }, 311 | "source": [ 312 | "# Build a Dictionary of word<-->id mappings.\n", 313 | "%%time\n", 314 | "dictionary = corpora.Dictionary(tokenized_articles)\n", 315 | "\n", 316 | "sample_token = 'news'\n", 317 | "print(f'Id for \\'{sample_token}\\' token: {dictionary.token2id[sample_token]}')" 318 | ], 319 | "execution_count": null, 320 | "outputs": [] 321 | }, 322 | { 323 | "cell_type": "markdown", 324 | "metadata": { 325 | "id": "XyAHgUxEaXVf" 326 | }, 327 | "source": [ 328 | "The next step is to create a frequency bag-of-words from each article using the **dictionary**'s *doc2bow* method. This is similar to the *transform* step from scikit-learn's vectorizers.
\n", 329 | "https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.doc2bow" 330 | ] 331 | }, 332 | { 333 | "cell_type": "code", 334 | "metadata": { 335 | "id": "ZYpRy9W6cWAK" 336 | }, 337 | "source": [ 338 | "%%time\n", 339 | "corpus_bow = [dictionary.doc2bow(article) for article in tokenized_articles]" 340 | ], 341 | "execution_count": null, 342 | "outputs": [] 343 | }, 344 | { 345 | "cell_type": "markdown", 346 | "metadata": { 347 | "id": "KD9khr0RbBTq" 348 | }, 349 | "source": [ 350 | "Finally, we'll generate our base LDA model. Gensim's LDA model has a large number of optional parameters but for now, we'll keep it simple.
\n", 351 | "https://radimrehurek.com/gensim/models/ldamodel.html?highlight=lda#module-gensim.models.ldamodel" 352 | ] 353 | }, 354 | { 355 | "cell_type": "code", 356 | "metadata": { 357 | "id": "AP0MS3n7dxE_" 358 | }, 359 | "source": [ 360 | "%%time\n", 361 | "lda_model = models.LdaModel(corpus=corpus_bow, num_topics=NUM_TOPICS, id2word=dictionary, random_state=1)" 362 | ], 363 | "execution_count": null, 364 | "outputs": [] 365 | }, 366 | { 367 | "cell_type": "markdown", 368 | "metadata": { 369 | "id": "8ecFL_MSb9wp" 370 | }, 371 | "source": [ 372 | "Once our model is generated, we can view the topics inferred. By default, the model's *print_topics* method shows the top 20 topics and each topic's ten most significant words.
\n", 373 | "https://radimrehurek.com/gensim/models/ldamodel.html?highlight=lda#gensim.models.ldamodel.LdaModel.print_topics" 374 | ] 375 | }, 376 | { 377 | "cell_type": "code", 378 | "metadata": { 379 | "id": "lFTFPOb4eKUi" 380 | }, 381 | "source": [ 382 | "lda_model.print_topics()" 383 | ], 384 | "execution_count": null, 385 | "outputs": [] 386 | }, 387 | { 388 | "cell_type": "markdown", 389 | "metadata": { 390 | "id": "XYmfb5YGcSP8" 391 | }, 392 | "source": [ 393 | "The first pass is pretty awful. The topics are dominated by stop words such that they essentially look all the same. Let's see if we can do better." 394 | ] 395 | }, 396 | { 397 | "cell_type": "markdown", 398 | "metadata": { 399 | "id": "Kf0X-w47svTF" 400 | }, 401 | "source": [ 402 | "# Improving preprocessing for better results." 403 | ] 404 | }, 405 | { 406 | "cell_type": "markdown", 407 | "metadata": { 408 | "id": "AkI5wxWccz8U" 409 | }, 410 | "source": [ 411 | "For our next attempt, we'll\n", 412 | "- remove stop words using the default spaCy stopword list. Given this is a corpus of news articles, there may be other stop words to consider such as salutations (\"Mr\", \"Mrs\"), and words related to quotes and thoughts (\"say\", \"think\"). But for this, we'll stick to defaults unless we see reason to do otherwise.\n", 413 | "- consider only the words the spaCy tagger flags as *nouns, verbs,* and *adjectives*. Including words with only certain POS tags is a common approach to improving topic models.\n", 414 | "- take the lemma." 415 | ] 416 | }, 417 | { 418 | "cell_type": "code", 419 | "source": [ 420 | "nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])\n", 421 | "\n", 422 | "def improved_filter(tokenized_doc):\n", 423 | " return [t.lemma_ for t in tokenized_doc if\n", 424 | " t.is_alpha and \\\n", 425 | " not t.is_punct and \\\n", 426 | " not t.is_space and \\\n", 427 | " not t.is_stop and \\\n", 428 | " t.pos_ in ['NOUN', 'VERB', 'ADJ']]" 429 | ], 430 | "metadata": { 431 | "id": "i1emkEmz1pYd" 432 | }, 433 | "execution_count": null, 434 | "outputs": [] 435 | }, 436 | { 437 | "cell_type": "code", 438 | "metadata": { 439 | "id": "NLqKeoy9FQED" 440 | }, 441 | "source": [ 442 | "# We'll need to retokenize everything and rebuild the BOWs. Because we're now\n", 443 | "# using the POS tagger, this will take longer. The \"w_pos\" in the variable \n", 444 | "# names below just means \"with part-of-speech\".\n", 445 | "%%time\n", 446 | "tokenized_articles_w_pos = list(map(improved_filter, nlp.pipe(dataset, n_process=NUM_PROCESS)))\n", 447 | "dictionary_w_pos = corpora.Dictionary(tokenized_articles_w_pos)\n", 448 | "corpus_bow_w_pos = [dictionary_w_pos.doc2bow(article) for article in tokenized_articles_w_pos]" 449 | ], 450 | "execution_count": null, 451 | "outputs": [] 452 | }, 453 | { 454 | "cell_type": "code", 455 | "metadata": { 456 | "id": "5sNd_PZypu13" 457 | }, 458 | "source": [ 459 | "%%time\n", 460 | "lda_model = models.LdaModel(corpus=corpus_bow_w_pos, num_topics=NUM_TOPICS, id2word=dictionary_w_pos, random_state=1)" 461 | ], 462 | "execution_count": null, 463 | "outputs": [] 464 | }, 465 | { 466 | "cell_type": "code", 467 | "metadata": { 468 | "id": "aG5iFkrQqyx5" 469 | }, 470 | "source": [ 471 | "lda_model.print_topics()" 472 | ], 473 | "execution_count": null, 474 | "outputs": [] 475 | }, 476 | { 477 | "cell_type": "markdown", 478 | "metadata": { 479 | "id": "0ckwYRIqgOtB" 480 | }, 481 | "source": [ 482 | "This is better but there are still a few low-signal words dominating topics such as \"said\" lemmatized to \"say\" which makes sense for a news corpus. Perhaps trimming the vocabulary and tuning the model parameters themselves can lead to something more interpretable." 483 | ] 484 | }, 485 | { 486 | "cell_type": "markdown", 487 | "metadata": { 488 | "id": "w_8oBuWxvqdl" 489 | }, 490 | "source": [ 491 | "# Trimming low- and high-frequency words." 492 | ] 493 | }, 494 | { 495 | "cell_type": "markdown", 496 | "metadata": { 497 | "id": "KFDI1BSLgxJw" 498 | }, 499 | "source": [ 500 | "One thing we can try is filtering out rare and common tokens.\n", 501 | "https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.filter_extremes" 502 | ] 503 | }, 504 | { 505 | "cell_type": "code", 506 | "metadata": { 507 | "id": "4YQctCWVhnL6" 508 | }, 509 | "source": [ 510 | "# The size of the dictionary before filtering.\n", 511 | "len(dictionary_w_pos)" 512 | ], 513 | "execution_count": null, 514 | "outputs": [] 515 | }, 516 | { 517 | "cell_type": "markdown", 518 | "metadata": { 519 | "id": "k8tzEnZKyfeC" 520 | }, 521 | "source": [ 522 | "The filtering is a bit idiosyncratic. The lower bound is an *absolute* number, and the upper bound is a *percentage*. Here, we're saying filter out words which occur in fewer than N documents and more than M% of the documents." 523 | ] 524 | }, 525 | { 526 | "cell_type": "code", 527 | "metadata": { 528 | "id": "lyCG8tLIp2QC" 529 | }, 530 | "source": [ 531 | "dictionary_w_pos.filter_extremes(no_below=5, no_above=0.5)" 532 | ], 533 | "execution_count": null, 534 | "outputs": [] 535 | }, 536 | { 537 | "cell_type": "code", 538 | "metadata": { 539 | "id": "6AvypKffhpyR" 540 | }, 541 | "source": [ 542 | "# The size of the dictionary after filtering.\n", 543 | "len(dictionary_w_pos)" 544 | ], 545 | "execution_count": null, 546 | "outputs": [] 547 | }, 548 | { 549 | "cell_type": "code", 550 | "metadata": { 551 | "id": "uWomtBFzhuO5" 552 | }, 553 | "source": [ 554 | "# Rebuild bag of words.\n", 555 | "corpus_bow_w_pos_filtered = [dictionary_w_pos.doc2bow(article) for article in tokenized_articles_w_pos]" 556 | ], 557 | "execution_count": null, 558 | "outputs": [] 559 | }, 560 | { 561 | "cell_type": "markdown", 562 | "metadata": { 563 | "id": "hVtALC9yYB9Z" 564 | }, 565 | "source": [ 566 | "This time, we're passing additional arguments when building the model. *alpha* is the prior on the document-topic distribution, and *eta* is the prior on the topic-word distribution (this was *beta* in the slides), and *passes* is the number of complete passes through the corpus during training.
\n", 567 | "https://radimrehurek.com/gensim/models/ldamodel.html?highlight=lda#module-gensim.models.ldamodel" 568 | ] 569 | }, 570 | { 571 | "cell_type": "code", 572 | "metadata": { 573 | "id": "S8_-mIdSvqUc" 574 | }, 575 | "source": [ 576 | "%%time\n", 577 | "lda_model = models.ldamodel.LdaModel(corpus=corpus_bow_w_pos_filtered,\n", 578 | " id2word=dictionary_w_pos,\n", 579 | " num_topics=NUM_TOPICS,\n", 580 | " passes=10,\n", 581 | " alpha='auto',\n", 582 | " eta='auto',\n", 583 | " random_state=1)" 584 | ], 585 | "execution_count": null, 586 | "outputs": [] 587 | }, 588 | { 589 | "cell_type": "code", 590 | "metadata": { 591 | "id": "iR2xCvNZvqDn" 592 | }, 593 | "source": [ 594 | "lda_model.print_topics()" 595 | ], 596 | "execution_count": null, 597 | "outputs": [] 598 | }, 599 | { 600 | "cell_type": "markdown", 601 | "source": [ 602 | "With improved filtering and low- and high-frequency words trimmed, we can see the topic-word distributions containing certain themes such as crime, travel, entertainment, etc.

\n", 603 | "**NOTE:** Remember that the topic model doesn't label topics for us. It just converges on collections of terms that likely form topics." 604 | ], 605 | "metadata": { 606 | "id": "o3dM-87PxPSY" 607 | } 608 | }, 609 | { 610 | "cell_type": "markdown", 611 | "source": [ 612 | "We set the training algorithm to learn priors for *alpha* and *eta*." 613 | ], 614 | "metadata": { 615 | "id": "HVipraNhL2fX" 616 | } 617 | }, 618 | { 619 | "cell_type": "code", 620 | "source": [ 621 | "print(lda_model.alpha)\n", 622 | "print(lda_model.eta)" 623 | ], 624 | "metadata": { 625 | "id": "5aimFUJGw4gT" 626 | }, 627 | "execution_count": null, 628 | "outputs": [] 629 | }, 630 | { 631 | "cell_type": "markdown", 632 | "source": [ 633 | "The *alpha* and *eta* values the training algorithm arrived at are well below 1. This translates to most articles being dominated by one or just a few topics, and most topics being dominated by a handful of words." 634 | ], 635 | "metadata": { 636 | "id": "Aj86WOUlL0zj" 637 | } 638 | }, 639 | { 640 | "cell_type": "markdown", 641 | "metadata": { 642 | "id": "auRQbV8Ajaz8" 643 | }, 644 | "source": [ 645 | "We can look at the topic distribution comprising a given article using the model's *get_document_topics* method.
\n", 646 | "https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_document_topics" 647 | ] 648 | }, 649 | { 650 | "cell_type": "code", 651 | "source": [ 652 | "article_idx = 0\n", 653 | "print(dataset[article_idx][:300])" 654 | ], 655 | "metadata": { 656 | "id": "7naCCCX1Nb2Z" 657 | }, 658 | "execution_count": null, 659 | "outputs": [] 660 | }, 661 | { 662 | "cell_type": "code", 663 | "source": [ 664 | "# Return topic distribution for an article sorted by probability.\n", 665 | "topics = sorted(lda_model.get_document_topics(corpus_bow_w_pos_filtered[article_idx]), key=lambda tup: tup[1])[::-1]\n", 666 | "topics" 667 | ], 668 | "metadata": { 669 | "id": "DrGy3dO019LL" 670 | }, 671 | "execution_count": null, 672 | "outputs": [] 673 | }, 674 | { 675 | "cell_type": "markdown", 676 | "source": [ 677 | "We can get the top words (10 by default) representing a topic using the model's *show_topic* method.\n", 678 | "https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.show_topic" 679 | ], 680 | "metadata": { 681 | "id": "85ztp46j13OL" 682 | } 683 | }, 684 | { 685 | "cell_type": "code", 686 | "source": [ 687 | "# View the words of the top topic from the previous article.\n", 688 | "lda_model.show_topic(topics[0][0])" 689 | ], 690 | "metadata": { 691 | "id": "aoA0ATU016Tn" 692 | }, 693 | "execution_count": null, 694 | "outputs": [] 695 | }, 696 | { 697 | "cell_type": "code", 698 | "source": [ 699 | "# View the words of the second-most prevalent topic from the previous article.\n", 700 | "lda_model.show_topic(topics[1][0])" 701 | ], 702 | "metadata": { 703 | "id": "oKJ9pvL2HQ3q" 704 | }, 705 | "execution_count": null, 706 | "outputs": [] 707 | }, 708 | { 709 | "cell_type": "markdown", 710 | "source": [ 711 | "The function below takes a document index and returns a **DataFrame** containing:\n", 712 | "1. the topics comprising the document up to a minimum probability.\n", 713 | "2. the top words of each topic.\n", 714 | "
\n", 715 | "\n", 716 | "https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html" 717 | ], 718 | "metadata": { 719 | "id": "VbsiukJ414XD" 720 | } 721 | }, 722 | { 723 | "cell_type": "code", 724 | "source": [ 725 | "def get_top_topics(article_idx, min_topic_prob):\n", 726 | "\n", 727 | " # Sort from highest to lowest topic probability.\n", 728 | " topic_prob_pairs = sorted(lda_model.get_document_topics(corpus_bow_w_pos_filtered[article_idx],\n", 729 | " minimum_probability=min_topic_prob),\n", 730 | " key=lambda tup: tup[1])[::-1]\n", 731 | "\n", 732 | " word_prob_pairs = [lda_model.show_topic(pair[0]) for pair in topic_prob_pairs]\n", 733 | " topic_words = [[pair[0] for pair in collection] for collection in word_prob_pairs]\n", 734 | "\n", 735 | " data = {\n", 736 | " 'Major Topics': topic_prob_pairs,\n", 737 | " 'Topic Words': topic_words\n", 738 | " }\n", 739 | "\n", 740 | " return pd.DataFrame(data)\n" 741 | ], 742 | "metadata": { 743 | "id": "o8F3dsBk2Oh2" 744 | }, 745 | "execution_count": null, 746 | "outputs": [] 747 | }, 748 | { 749 | "cell_type": "code", 750 | "source": [ 751 | "pd.set_option('max_colwidth', 600)\n", 752 | "snippet_length = 300\n", 753 | "min_topic_prob = 0.25\n", 754 | "\n", 755 | "article_idx = 1\n", 756 | "print(dataset[article_idx][:snippet_length])\n", 757 | "get_top_topics(article_idx, min_topic_prob)" 758 | ], 759 | "metadata": { 760 | "id": "y7HwvNlH3KNL" 761 | }, 762 | "execution_count": null, 763 | "outputs": [] 764 | }, 765 | { 766 | "cell_type": "code", 767 | "source": [ 768 | "article_idx = 10\n", 769 | "print(dataset[article_idx][:snippet_length])\n", 770 | "get_top_topics(article_idx, min_topic_prob)" 771 | ], 772 | "metadata": { 773 | "id": "RgbK19OAYD6T" 774 | }, 775 | "execution_count": null, 776 | "outputs": [] 777 | }, 778 | { 779 | "cell_type": "code", 780 | "source": [ 781 | "article_idx = 100\n", 782 | "print(dataset[article_idx][:snippet_length])\n", 783 | "get_top_topics(article_idx, min_topic_prob)" 784 | ], 785 | "metadata": { 786 | "id": "ucpGCL0cYD2V" 787 | }, 788 | "execution_count": null, 789 | "outputs": [] 790 | }, 791 | { 792 | "cell_type": "code", 793 | "source": [ 794 | "article_idx = 1000\n", 795 | "print(dataset[article_idx][:snippet_length])\n", 796 | "get_top_topics(article_idx, min_topic_prob)" 797 | ], 798 | "metadata": { 799 | "id": "KzeM3QEbYDyi" 800 | }, 801 | "execution_count": null, 802 | "outputs": [] 803 | }, 804 | { 805 | "cell_type": "code", 806 | "source": [ 807 | "article_idx = 10000\n", 808 | "print(dataset[article_idx][:snippet_length])\n", 809 | "get_top_topics(article_idx, 0.25)" 810 | ], 811 | "metadata": { 812 | "id": "vT5gxoP9YDuv" 813 | }, 814 | "execution_count": null, 815 | "outputs": [] 816 | }, 817 | { 818 | "cell_type": "markdown", 819 | "metadata": { 820 | "id": "sCr_9vWPvuU9" 821 | }, 822 | "source": [ 823 | "The results of this model look the best so far and we can see a human-interpretable link between the distribution of topics in a document, the distribution of words in each topic, and the content of the document itself." 824 | ] 825 | }, 826 | { 827 | "cell_type": "markdown", 828 | "metadata": { 829 | "id": "xRCf02nVvpfW" 830 | }, 831 | "source": [ 832 | "# Evaluation and Visualization" 833 | ] 834 | }, 835 | { 836 | "cell_type": "markdown", 837 | "metadata": { 838 | "id": "_TXlK5gebUjB" 839 | }, 840 | "source": [ 841 | "## Measuring topic models with coherence.\n", 842 | "\n", 843 | "If a topic is a mixture of particular words, then one way to measure how semantically coherent a topic is to calculate co-occurrence among the words. That is, how often the top words in a topic co-occur together among the documents versus how often they occur independently.\n", 844 | "\n", 845 | "Gensim's **Coherence Model** offers coherence implemented as a pipeline:
\n", 846 | "https://radimrehurek.com/gensim/models/coherencemodel.html\n", 847 | "
\n", 848 | "
\n", 849 | "See this paper for a detailed description of the pipeline as well as different co-occurence measures proposed:
\n", 850 | "http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf\n", 851 | "
\n", 852 | "
\n", 853 | "Topic model evaluation is a difficult subject with no clear quantitative approach and is still debated. A higher (or lower score depending on the measure) doesn't necessarily translate to a higher *qualitative* model. That is, the score a human would give looking at the topic words and how interpretable they are.

\n", 854 | "It's possible to favour a poorer scoring model because it serves a particular purpose better. Perhaps it's better to score the effectiveness of topic models based on performance in downstream tasks? See these videos for the problems with quantitative topic model evaluation:
\n", 855 | "[Matti Lyra - Evaluating Topic Models](https://www.youtube.com/watch?v=UkmIljRIG_M)
\n", 856 | "[Is Topic Model Evaluation Broken? The Incoherence of Coherence](https://www.youtube.com/watch?v=4KO2TO_cm2I)" 857 | ] 858 | }, 859 | { 860 | "cell_type": "code", 861 | "metadata": { 862 | "id": "nHBp-ZazNZRJ" 863 | }, 864 | "source": [ 865 | "%%time\n", 866 | "coherence_model_lda = CoherenceModel(model=lda_model, texts=tokenized_articles_w_pos, dictionary=dictionary_w_pos, coherence='u_mass')\n", 867 | "coherence_lda = coherence_model_lda.get_coherence()\n", 868 | "print('\\nCoherence Score: ', coherence_lda)" 869 | ], 870 | "execution_count": null, 871 | "outputs": [] 872 | }, 873 | { 874 | "cell_type": "markdown", 875 | "metadata": { 876 | "id": "Kfunq1Su8d1r" 877 | }, 878 | "source": [ 879 | "## Human evaluation\n", 880 | "Because the quantitative metrics aren't entirely correlated with quality, human judgment still plays a large role in topic model evaluation.\n" 881 | ] 882 | }, 883 | { 884 | "cell_type": "markdown", 885 | "metadata": { 886 | "id": "crPK6zKfC1gS" 887 | }, 888 | "source": [ 889 | "We can get someone to look at the topic words to see how interpretable they are. " 890 | ] 891 | }, 892 | { 893 | "cell_type": "markdown", 894 | "source": [ 895 | "There are also subjective tests like **word intrusion** and **topic intrusion**.\n", 896 | "

\n", 897 | "**Word intrusion** is taking words which belong to a topic, injecting a word from another topic into the collection, and seeing whether a human can easily identify the intruder word. The more easily the intruder word is spotted, the more well-formed the topic. For example, which word doesn't belong in this topic?
\n", 898 | "*{apple, lemon, tomato, horse, grape}*" 899 | ], 900 | "metadata": { 901 | "id": "GRMHpNksr0bQ" 902 | } 903 | }, 904 | { 905 | "cell_type": "markdown", 906 | "metadata": { 907 | "id": "qYfEicOH8d1t" 908 | }, 909 | "source": [ 910 | "We can also visualize them with word clouds." 911 | ] 912 | }, 913 | { 914 | "cell_type": "code", 915 | "metadata": { 916 | "id": "4qY6uzIW8d1t" 917 | }, 918 | "source": [ 919 | "def render_word_cloud(model, rows, cols, max_words):\n", 920 | " word_cloud = WordCloud(background_color='white', max_words=max_words, prefer_horizontal=1.0)\n", 921 | " fig, axes = plt.subplots(rows, cols, figsize=(15,15))\n", 922 | "\n", 923 | " for i, ax in enumerate(axes.flatten()):\n", 924 | " fig.add_subplot(ax)\n", 925 | " topic_words = dict(model.show_topic(i))\n", 926 | " word_cloud.generate_from_frequencies(topic_words)\n", 927 | " plt.gca().imshow(word_cloud, interpolation='bilinear')\n", 928 | " plt.gca().set_title('Topic {id}'.format(id=i))\n", 929 | " plt.gca().axis('off')\n", 930 | "\n", 931 | " plt.axis('off')\n", 932 | " plt.show()" 933 | ], 934 | "execution_count": null, 935 | "outputs": [] 936 | }, 937 | { 938 | "cell_type": "code", 939 | "metadata": { 940 | "id": "F3e6HjGtzNnG" 941 | }, 942 | "source": [ 943 | "# Here we'll visualize the first nine topics.\n", 944 | "render_word_cloud(lda_model, 3, 3, 10)" 945 | ], 946 | "execution_count": null, 947 | "outputs": [] 948 | }, 949 | { 950 | "cell_type": "markdown", 951 | "metadata": { 952 | "id": "FpBxrPcGEOcN" 953 | }, 954 | "source": [ 955 | "# Finding similar documents." 956 | ] 957 | }, 958 | { 959 | "cell_type": "markdown", 960 | "metadata": { 961 | "id": "sJtKEzTE8TSE" 962 | }, 963 | "source": [ 964 | "Gensim has a **similarities** module which can build an index for a given set of documents. Here, we're using **MatrixSimilarity** which computes cosine similarity across a corpus and stores them in an index.
\n", 965 | "https://radimrehurek.com/gensim/similarities/docsim.html#gensim.similarities.docsim.MatrixSimilarity" 966 | ] 967 | }, 968 | { 969 | "cell_type": "code", 970 | "metadata": { 971 | "id": "vq9EYQJWkib2" 972 | }, 973 | "source": [ 974 | "lda_index = similarities.MatrixSimilarity(lda_model[corpus_bow_w_pos_filtered], num_features=len(dictionary_w_pos))" 975 | ], 976 | "execution_count": null, 977 | "outputs": [] 978 | }, 979 | { 980 | "cell_type": "markdown", 981 | "metadata": { 982 | "id": "bHorc8fN9VHu" 983 | }, 984 | "source": [ 985 | "Here's a utility function to help retrieve the *first_m_words* of the *top_n* most similar documents. If you're curious about the *\\_\\_getitem\\__* method on the LDA Model class, you can find the code here:
\n", 986 | "https://github.com/RaRe-Technologies/gensim/blob/master/gensim/models/ldamodel.py" 987 | ] 988 | }, 989 | { 990 | "cell_type": "code", 991 | "metadata": { 992 | "id": "x6hIGoVYF6Rb" 993 | }, 994 | "source": [ 995 | "def get_similar_articles(index, model, article_bow, top_n=5, first_m_words=300):\n", 996 | " # model[article_bow] retrieves the topic distribution for the BOW.\n", 997 | " # index[model[article_bow] compares the topic distribution for the BOW against the similarity index previously computed.\n", 998 | " similar_docs = index[model[article_bow]]\n", 999 | " top_n_docs = sorted(enumerate(similar_docs), key=lambda item: -item[1])[1:top_n+1]\n", 1000 | " \n", 1001 | " # Return a list of tuples with each tuple: (article id, similarity score, first_m_words of article)\n", 1002 | " return list(map(lambda entry: (entry[0], entry[1], articles[entry[0]][:first_m_words]), top_n_docs))" 1003 | ], 1004 | "execution_count": null, 1005 | "outputs": [] 1006 | }, 1007 | { 1008 | "cell_type": "code", 1009 | "metadata": { 1010 | "id": "c4GV6jxI-Q8i" 1011 | }, 1012 | "source": [ 1013 | "article_idx = 0\n", 1014 | "print(dataset[article_idx][:snippet_length], '\\n')\n", 1015 | "get_similar_articles(lda_index, lda_model, corpus_bow_w_pos_filtered[article_idx])" 1016 | ], 1017 | "execution_count": null, 1018 | "outputs": [] 1019 | }, 1020 | { 1021 | "cell_type": "code", 1022 | "source": [ 1023 | "article_idx = 10\n", 1024 | "print(dataset[article_idx][:snippet_length], '\\n')\n", 1025 | "get_similar_articles(lda_index, lda_model, corpus_bow_w_pos_filtered[article_idx])" 1026 | ], 1027 | "metadata": { 1028 | "id": "d6rlTxY5zlCe" 1029 | }, 1030 | "execution_count": null, 1031 | "outputs": [] 1032 | }, 1033 | { 1034 | "cell_type": "code", 1035 | "source": [ 1036 | "article_idx = 100\n", 1037 | "print(dataset[article_idx][:snippet_length], '\\n')\n", 1038 | "get_similar_articles(lda_index, lda_model, corpus_bow_w_pos_filtered[article_idx])" 1039 | ], 1040 | "metadata": { 1041 | "id": "JQyVGB1Kzk7Y" 1042 | }, 1043 | "execution_count": null, 1044 | "outputs": [] 1045 | }, 1046 | { 1047 | "cell_type": "markdown", 1048 | "metadata": { 1049 | "id": "-8eWhXxhBEl7" 1050 | }, 1051 | "source": [ 1052 | "We can also query for documents similar to new, unseen documents. Below are short, actual blurbs from 2021 involving stock options and crime. Keep in mind that if this were a really old news corpus, then excerpts about cryptocurrencies and social media probably won't lead to good matches. This is another aspect to keep in mind when thinking about your data and use cases." 1053 | ] 1054 | }, 1055 | { 1056 | "cell_type": "code", 1057 | "metadata": { 1058 | "id": "rs4DF3CqODIp" 1059 | }, 1060 | "source": [ 1061 | "test_article = \"Capricorn Business Acquisitions Inc. (TSXV: CAK.H) (the “Company“) is pleased to announce that its board has approved the issuance of 70,000 stock options (“Stock Options“) to directors on April 19, 2020.\"\n", 1062 | "\n", 1063 | "article_tokens = list(map(improved_filter, [nlp(test_article)]))[0]\n", 1064 | "article_bow = dictionary_w_pos.doc2bow(article_tokens)\n", 1065 | "get_similar_articles(lda_index, lda_model, article_bow)" 1066 | ], 1067 | "execution_count": null, 1068 | "outputs": [] 1069 | }, 1070 | { 1071 | "cell_type": "code", 1072 | "source": [ 1073 | "test_article = \"DEA agent sentenced to 12 years in prison for conspiring with Colombian drug cartel.\"\n", 1074 | "\n", 1075 | "article_tokens = list(map(improved_filter, [nlp(test_article)]))[0]\n", 1076 | "article_bow = dictionary_w_pos.doc2bow(article_tokens)\n", 1077 | "get_similar_articles(lda_index, lda_model, article_bow)" 1078 | ], 1079 | "metadata": { 1080 | "id": "NpejaKM51Sos" 1081 | }, 1082 | "execution_count": null, 1083 | "outputs": [] 1084 | }, 1085 | { 1086 | "cell_type": "markdown", 1087 | "metadata": { 1088 | "id": "NMh0xLVmuKwW" 1089 | }, 1090 | "source": [ 1091 | "# Closing Thoughts and things to explore.\n", 1092 | "- Gensim infers topic and word distributions through [Variational Bayes (VB)](https://en.wikipedia.org/wiki/Variational_Bayesian_methods), not Gibbs Sampling. From the topics I've seen, Gibbs Sampling tends to lead to more interpretable topics, but VB is faster and Gensim offers the additional benefits of streaming documents, online learning, and training across a cluster of machines.\n", 1093 | "- Another topic modelling library, [Mallet](http://mallet.cs.umass.edu/), infers through Gibbs Sampling but is Java-based. Unfortunately, Gensim 4.0+ no longer offers a wrapper around Mallet. But if you're comfortable with Java, it may be worth exploring.\n", 1094 | "- Scikit-learn offers an [LDA model](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html). Maybe as an exercise, try using that LDA model on the [20 Newsgroups](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html) dataset (or ideally, a dataset with longer documents).\n", 1095 | "- [pyLDAvis](https://github.com/bmabey/pyLDAvis) is another means of visualizing topic models. You can see it in action in this [notebook](https://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/pyLDAvis_overview.ipynb). See if you can get it working on your own topic model.\n", 1096 | "- LDA tends to work better on longer documents, and whether a topic model is \"good\" depends on your use case rather than strictly on a quantitative metric." 1097 | ] 1098 | } 1099 | ] 1100 | } 1101 | -------------------------------------------------------------------------------- /notebooks/nlpdemystified_classification_naive_bayes.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "private_outputs": true, 7 | "provenance": [], 8 | "include_colab_link": true 9 | }, 10 | "kernelspec": { 11 | "name": "python3", 12 | "display_name": "Python 3" 13 | } 14 | }, 15 | "cells": [ 16 | { 17 | "cell_type": "markdown", 18 | "metadata": { 19 | "id": "view-in-github", 20 | "colab_type": "text" 21 | }, 22 | "source": [ 23 | "\"Open" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": { 29 | "id": "egwknkyvG1-C" 30 | }, 31 | "source": [ 32 | "# Natural Language Processing Demystified | Classification with Naive Bayes\n", 33 | "https://nlpdemystified.org
\n", 34 | "https://github.com/futuremojo/nlp-demystified\n", 35 | "

\n", 36 | "Course module for this demo: https://www.nlpdemystified.org/course/naive-bayes" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": { 42 | "id": "sHp03BS8Hhmp" 43 | }, 44 | "source": [ 45 | "# spaCy upgrade and package installation." 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": { 51 | "id": "TATDBJisHA_w" 52 | }, 53 | "source": [ 54 | "At the time this notebook was created, spaCy had newer releases but Colab was still using version 2.x by default. So the first step is to upgrade spaCy and download a statistical model for English.\n", 55 | "

\n", 56 | "**IMPORTANT**
\n", 57 | "If you're running this in the cloud, then the notebook will *timeout* after a period of inactivity. If that happens and you don't reconnect in time, you will need to upgrade spaCy again and reinstall the requisite statistical package(s).\n", 58 | "

\n", 59 | "Refer to this link on how to run Colab notebooks locally on your machine to avoid this issue:
\n", 60 | "https://research.google.com/colaboratory/local-runtimes.html" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "metadata": { 66 | "id": "K61NFIfSHAn4" 67 | }, 68 | "source": [ 69 | "!pip install -U spacy==3.*\n", 70 | "!python -m spacy download en_core_web_sm\n", 71 | "!python -m spacy info" 72 | ], 73 | "execution_count": null, 74 | "outputs": [] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "source": [ 79 | "import matplotlib.pyplot as plt\n", 80 | "import numpy as np\n", 81 | "import pandas as pd\n", 82 | "import spacy\n", 83 | "\n", 84 | "from sklearn import metrics\n", 85 | "from sklearn import model_selection\n", 86 | "from sklearn.datasets import fetch_20newsgroups\n", 87 | "from sklearn.dummy import DummyClassifier\n", 88 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 89 | "from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay\n", 90 | "from sklearn.model_selection import train_test_split\n", 91 | "from sklearn.naive_bayes import MultinomialNB\n", 92 | "from sklearn.pipeline import Pipeline" 93 | ], 94 | "metadata": { 95 | "id": "9AQ6Nyad3kiK" 96 | }, 97 | "execution_count": null, 98 | "outputs": [] 99 | }, 100 | { 101 | "cell_type": "markdown", 102 | "metadata": { 103 | "id": "ozlSs1Tz5f7M" 104 | }, 105 | "source": [ 106 | "# First pass at building a Naive Bayes model.\n" 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "metadata": { 112 | "id": "auwysA2BBfuE" 113 | }, 114 | "source": [ 115 | "As with our TF-IDF demo, we'll use the **20 newsgroups** dataset, a labelled dataset of 18,000 newsgroup posts across 20 topics.
\n", 116 | "https://scikit-learn.org/stable/datasets/real_world.html#the-20-newsgroups-text-dataset\n", 117 | "\n", 118 | "This time around, rather than fetching the posts from only one topic, we'll fetch the entire collection." 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "metadata": { 124 | "id": "eCPNpfOIB8fT" 125 | }, 126 | "source": [ 127 | "# To build our model, we want the training subset only. The training\n", 128 | "# subset is what gets downloaded by default but we explicitly\n", 129 | "# pass the parameter here for clarity.\n", 130 | "training_corpus = fetch_20newsgroups(subset='train')" 131 | ], 132 | "execution_count": null, 133 | "outputs": [] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "metadata": { 138 | "id": "qjPpaJitJ0Zf" 139 | }, 140 | "source": [ 141 | "print('Training data size: {}'.format(len(training_corpus.data)))" 142 | ], 143 | "execution_count": null, 144 | "outputs": [] 145 | }, 146 | { 147 | "cell_type": "markdown", 148 | "metadata": { 149 | "id": "sLrleHWZ6b6n" 150 | }, 151 | "source": [ 152 | "The training data we downloaded not only includes the posts but also a label (\"target\") for each post representing its topic. The posts are an array of strings while the labels are a corresponding array of numeric labels." 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "metadata": { 158 | "id": "JTiTDsbiDyKW" 159 | }, 160 | "source": [ 161 | "# These are the possible topics a post can belong to.\n", 162 | "training_corpus.target_names" 163 | ], 164 | "execution_count": null, 165 | "outputs": [] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "metadata": { 170 | "id": "O2CVP3GME8Yo" 171 | }, 172 | "source": [ 173 | "# These are the labels/targets for each post.\n", 174 | "print(training_corpus.target)" 175 | ], 176 | "execution_count": null, 177 | "outputs": [] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "metadata": { 182 | "id": "gPuaD40XD3V-" 183 | }, 184 | "source": [ 185 | "# The first post along with its corresponding label.\n", 186 | "print(training_corpus.data[0])\n", 187 | "\n", 188 | "first_doc_label = training_corpus.target[0]\n", 189 | "print('Label for this post: {}'.format(first_doc_label))\n", 190 | "print('Corresponding topic: {}'.format(training_corpus.target_names[first_doc_label]))" 191 | ], 192 | "execution_count": null, 193 | "outputs": [] 194 | }, 195 | { 196 | "cell_type": "markdown", 197 | "metadata": { 198 | "id": "LCaA_rWSGny2" 199 | }, 200 | "source": [ 201 | "When starting off with a dataset, it's a good idea to check its distribution. In this case, we can see at a glance this dataset is relatively balanced." 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "metadata": { 207 | "id": "IRbTT7vtDbes" 208 | }, 209 | "source": [ 210 | "bins, counts = np.unique(training_corpus.target, return_counts=True)\n", 211 | "freq_series = pd.Series(counts/len(training_corpus.data))\n", 212 | "plt.figure(figsize=(12, 8))\n", 213 | "ax = freq_series.plot(kind='bar')\n", 214 | "ax.set_xticklabels(bins, rotation=0)\n", 215 | "plt.show()" 216 | ], 217 | "execution_count": null, 218 | "outputs": [] 219 | }, 220 | { 221 | "cell_type": "markdown", 222 | "metadata": { 223 | "id": "akAhiAsuH0Vm" 224 | }, 225 | "source": [ 226 | "Now that we have our training set, we can split it further into train and validation sets (remember the test set, in this case, is a separate download). Creating a validation set isn't always necessary. If you have a small training set like this one, you can use alternative techniques like cross-validation but we'll show a split here since we talked about it in the model building module. scikit-learn has a module to help us do this.
\n", 227 | "https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html" 228 | ] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "metadata": { 233 | "id": "u3rdIx__I7JX" 234 | }, 235 | "source": [ 236 | "# Shuffle, then split the data into train and validation sets. Set the random_state \n", 237 | "# to 1 for reproducibility.\n", 238 | "train_data, val_data, train_labels, val_labels = train_test_split(training_corpus.data, training_corpus.target, train_size=0.8, random_state=1) \n", 239 | "print('Training data size: {}'.format(len(train_data)))\n", 240 | "print('Validation data size: {}'.format(len(val_data)))" 241 | ], 242 | "execution_count": null, 243 | "outputs": [] 244 | }, 245 | { 246 | "cell_type": "markdown", 247 | "metadata": { 248 | "id": "Sg6rGId2K4eb" 249 | }, 250 | "source": [ 251 | "Now that we have our train-validation split, let's create our spaCy tokenizer. Up to this point, we've been using the **en_core_web_sm** model." 252 | ] 253 | }, 254 | { 255 | "cell_type": "code", 256 | "metadata": { 257 | "id": "P4pQw-2KK2MV" 258 | }, 259 | "source": [ 260 | "nlp = spacy.load('en_core_web_sm')" 261 | ], 262 | "execution_count": null, 263 | "outputs": [] 264 | }, 265 | { 266 | "cell_type": "markdown", 267 | "metadata": { 268 | "id": "ElofxHmW7fP_" 269 | }, 270 | "source": [ 271 | "By default, it comes up with a preprocessing pipeline with several components enabled. We can view these components through the *pipe_names* attribute." 272 | ] 273 | }, 274 | { 275 | "cell_type": "code", 276 | "metadata": { 277 | "id": "AYATX6AS4k-b" 278 | }, 279 | "source": [ 280 | "nlp.pipe_names" 281 | ], 282 | "execution_count": null, 283 | "outputs": [] 284 | }, 285 | { 286 | "cell_type": "markdown", 287 | "metadata": { 288 | "id": "k2P0S-Tm7toD" 289 | }, 290 | "source": [ 291 | "In the previous demos, we individually disabled any component we didn't need. For our first pass at building a Naive Bayes classifier, we'll try tokenizing alone. Nothing else. Since that's the case, it's easier to instantiate a blank pipeline.
\n", 292 | "https://spacy.io/api/top-level#spacy.blank" 293 | ] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "metadata": { 298 | "id": "AauBdVeA4znZ" 299 | }, 300 | "source": [ 301 | "nlp = spacy.blank('en')\n", 302 | "\n", 303 | "# There should be no pipeline components.\n", 304 | "nlp.pipe_names" 305 | ], 306 | "execution_count": null, 307 | "outputs": [] 308 | }, 309 | { 310 | "cell_type": "code", 311 | "metadata": { 312 | "id": "e1tLv7nF4Za2" 313 | }, 314 | "source": [ 315 | "# For this exercise, we'll remove punctuation and spaces (which\n", 316 | "# includes newlines), filter for tokens consisting of alphabetic\n", 317 | "# characters only, and return the token text.\n", 318 | "def spacy_tokenizer(doc):\n", 319 | " return [t.text for t in nlp(doc) if \\\n", 320 | " not t.is_punct and \\\n", 321 | " not t.is_space and \\\n", 322 | " t.is_alpha]" 323 | ], 324 | "execution_count": null, 325 | "outputs": [] 326 | }, 327 | { 328 | "cell_type": "markdown", 329 | "metadata": { 330 | "id": "ylefAguOKgVk" 331 | }, 332 | "source": [ 333 | "We'll vectorize using the **TfidfVectorizer**." 334 | ] 335 | }, 336 | { 337 | "cell_type": "code", 338 | "metadata": { 339 | "id": "XpX_QndoCbaz" 340 | }, 341 | "source": [ 342 | "%%time\n", 343 | "vectorizer = TfidfVectorizer(tokenizer=spacy_tokenizer)\n", 344 | "train_feature_vects = vectorizer.fit_transform(train_data)" 345 | ], 346 | "execution_count": null, 347 | "outputs": [] 348 | }, 349 | { 350 | "cell_type": "markdown", 351 | "metadata": { 352 | "id": "qWys2B9xOXiq" 353 | }, 354 | "source": [ 355 | "Scikit-learn includes a multinomial naive bayes classifier.
\n", 356 | "https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html" 357 | ] 358 | }, 359 | { 360 | "cell_type": "markdown", 361 | "metadata": { 362 | "id": "YTo6o5LyYNnq" 363 | }, 364 | "source": [ 365 | "Calling *fit* on the classifier and passing it the feature vectors and corresponding labels kicks off the training." 366 | ] 367 | }, 368 | { 369 | "cell_type": "code", 370 | "metadata": { 371 | "id": "S4RouNa1OT_U" 372 | }, 373 | "source": [ 374 | "# Instantiate a classifier with the default settings.\n", 375 | "nb_classifier = MultinomialNB()\n", 376 | "nb_classifier.fit(train_feature_vects, train_labels)\n", 377 | "nb_classifier.get_params()" 378 | ], 379 | "execution_count": null, 380 | "outputs": [] 381 | }, 382 | { 383 | "cell_type": "markdown", 384 | "metadata": { 385 | "id": "zUWJuhJlPiHS" 386 | }, 387 | "source": [ 388 | "Now that we know about the **F1 score** and have a multiclass problem, let's look at the F1 score on the training data. Since the dataset is balanced, accuracy could work here as well but we'll look at F1 since we introduced it. scikit-learn has a module called **metrics** we can leverage. It contains a variety of scoring utilities we can use.
\n", 389 | "https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
\n", 390 | "https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score
\n", 391 | "https://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-f-measure-metrics
" 392 | ] 393 | }, 394 | { 395 | "cell_type": "code", 396 | "metadata": { 397 | "id": "vJ3MctXZYB9B" 398 | }, 399 | "source": [ 400 | "# Get predictions on training set and calculate F1 score.\n", 401 | "# See documentation above for more details on what \"macro\" means.\n", 402 | "train_preds = nb_classifier.predict(train_feature_vects)\n", 403 | "print('F1 score on initial training set: {}'.format(metrics.f1_score(train_labels, train_preds, average='macro')))" 404 | ], 405 | "execution_count": null, 406 | "outputs": [] 407 | }, 408 | { 409 | "cell_type": "markdown", 410 | "metadata": { 411 | "id": "Tw76J_P0QJa5" 412 | }, 413 | "source": [ 414 | "So right off the bat, using simple preprocessing and vectorization, and the default settings on the Naive Bayes classifier, we get a model with a decent F1 score. This looks good, but there's a problem.

\n", 415 | "When we downloaded the training data, we also included headers and footers which contain metadata like *subject*, and *email*.

\n", 416 | "This can be a problem because these fields may be highly informative, causing the model to predict mostly based on the metadata rather than the post content. But if this metadata isn't available at prediction time in production, then our model is going to perform poorly.\n", 417 | "

\n", 418 | "So let's retrieve the training data again but without the headers, footers, and post quotes this time. Just raw post text. This makes the problem notably harder for reasons we'll see soon." 419 | ] 420 | }, 421 | { 422 | "cell_type": "code", 423 | "metadata": { 424 | "id": "67YsfFTtQcxF" 425 | }, 426 | "source": [ 427 | "# Remove headers, footers, and quotes from training set and resplit.\n", 428 | "filtered_training_corpus = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))\n", 429 | "train_data, val_data, train_labels, val_labels = train_test_split(filtered_training_corpus.data, filtered_training_corpus.target, train_size=0.8, random_state=1) " 430 | ], 431 | "execution_count": null, 432 | "outputs": [] 433 | }, 434 | { 435 | "cell_type": "code", 436 | "metadata": { 437 | "id": "0ihNfcfVQcmO" 438 | }, 439 | "source": [ 440 | "# This is what a data point looks like now. Just plain post text.\n", 441 | "train_data[0]" 442 | ], 443 | "execution_count": null, 444 | "outputs": [] 445 | }, 446 | { 447 | "cell_type": "code", 448 | "metadata": { 449 | "id": "pNF3e3jzQcZW" 450 | }, 451 | "source": [ 452 | "# Revectorize our text and retrain our model.\n", 453 | "%%time\n", 454 | "train_feature_vects = vectorizer.fit_transform(train_data)\n", 455 | "nb_classifier.fit(train_feature_vects, train_labels)" 456 | ], 457 | "execution_count": null, 458 | "outputs": [] 459 | }, 460 | { 461 | "cell_type": "code", 462 | "metadata": { 463 | "id": "z0CJ3wIuQcO3" 464 | }, 465 | "source": [ 466 | "# Recheck F1 score on training data.\n", 467 | "train_preds = nb_classifier.predict(train_feature_vects)\n", 468 | "print('F1 score on filtered training set: {}'.format(metrics.f1_score(train_labels, train_preds, average='macro'))) " 469 | ], 470 | "execution_count": null, 471 | "outputs": [] 472 | }, 473 | { 474 | "cell_type": "markdown", 475 | "metadata": { 476 | "id": "qEuysjV4QcGO" 477 | }, 478 | "source": [ 479 | "Now that we've removed metadata, our F1 score has dropped but still seems ok. The next step is to see how well the classifier performs on the validation set." 480 | ] 481 | }, 482 | { 483 | "cell_type": "code", 484 | "metadata": { 485 | "id": "BnVNZ2WaQb9O" 486 | }, 487 | "source": [ 488 | "# Vectorize the validation data.\n", 489 | "%%time\n", 490 | "val_feature_vects = vectorizer.transform(val_data)" 491 | ], 492 | "execution_count": null, 493 | "outputs": [] 494 | }, 495 | { 496 | "cell_type": "code", 497 | "metadata": { 498 | "id": "Ph8R2E6YUSfG" 499 | }, 500 | "source": [ 501 | "# Predict and evaluate.\n", 502 | "val_preds = nb_classifier.predict(val_feature_vects)\n", 503 | "print('F1 score on filtered validation set: {}'.format(metrics.f1_score(val_labels, val_preds, average='macro')))" 504 | ], 505 | "execution_count": null, 506 | "outputs": [] 507 | }, 508 | { 509 | "cell_type": "markdown", 510 | "metadata": { 511 | "id": "gBCFRGBxQbql" 512 | }, 513 | "source": [ 514 | "That's quite a drop in F1 score. Because there are 20 classes involved, let's plot a confusion matrix to see what's going on:
\n", 515 | "https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ConfusionMatrixDisplay.html
\n", 516 | "https://scikit-learn.org/stable/modules/model_evaluation.html#confusion-matrix" 517 | ] 518 | }, 519 | { 520 | "cell_type": "code", 521 | "metadata": { 522 | "id": "FiITJWX3W9cD" 523 | }, 524 | "source": [ 525 | "# Set the size of the plot.\n", 526 | "fig, ax = plt.subplots(figsize=(15, 15))\n", 527 | "\n", 528 | "# Create the confusion matrix. \n", 529 | "disp = ConfusionMatrixDisplay.from_estimator(nb_classifier, val_feature_vects, val_labels, normalize='true', display_labels=filtered_training_corpus.target_names, xticks_rotation='vertical', ax=ax)" 530 | ], 531 | "execution_count": null, 532 | "outputs": [] 533 | }, 534 | { 535 | "cell_type": "markdown", 536 | "metadata": { 537 | "id": "wnc6wj02ZqYz" 538 | }, 539 | "source": [ 540 | "Similar to what we saw in the slides, the y-axis represents the true labels and the x-axis represents the predictions. Each square's brightness represents the number of posts assigned to that class. What we ideally want is brightness along the diagonal (top-left to bottom-right) which represent correct predictions, and little to no brightness anywhere else.\n", 541 | "

\n", 542 | "Looking at the confusion matrix above, we can make a few observations:\n", 543 | "1. The more specific a topic is, the better the prediction result. Hockey and cryptography are good examples. This intuitively makes sense.\n", 544 | "2. Topics with a lot of word overlap tend to have higher errors. For example, the majority of atheism and religion.misc posts are classified under christianity. In general, the christianity column has a prevalence of brighter squares with misclassified posts from politics.misc, politics.mideast, etc.\n", 545 | "3. There's a smaller, secondary cluster of errors around the computer-related topics (e.g. posts in electronics being misclassified as hardware).\n", 546 | "

\n", 547 | "Seeing the results of this matrix, at least there are plausible explanations for the discrepancies." 548 | ] 549 | }, 550 | { 551 | "cell_type": "markdown", 552 | "metadata": { 553 | "id": "k8RoMv8yg9p5" 554 | }, 555 | "source": [ 556 | "Let's take a look at **precision** and **recall** for each label:
\n", 557 | "https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
\n", 558 | "https://scikit-learn.org/stable/modules/model_evaluation.html#classification-report" 559 | ] 560 | }, 561 | { 562 | "cell_type": "code", 563 | "metadata": { 564 | "id": "R0P_PA6ebjAU" 565 | }, 566 | "source": [ 567 | "print(metrics.classification_report(val_labels, val_preds, target_names=filtered_training_corpus.target_names))" 568 | ], 569 | "execution_count": null, 570 | "outputs": [] 571 | }, 572 | { 573 | "cell_type": "markdown", 574 | "metadata": { 575 | "id": "Ppg_HUVee8y7" 576 | }, 577 | "source": [ 578 | "A few observations:\n", 579 | "1. Atheism has a perfect precision score but terrible recall, signalling that the model was right when it classified something as under atheism, but missed the vast majority in the corpus. The model didn't classify anything under religion.misc.\n", 580 | "2. The more specific the topic, the better it tends to do." 581 | ] 582 | }, 583 | { 584 | "cell_type": "markdown", 585 | "metadata": { 586 | "id": "Tg_-c_mueeo-" 587 | }, 588 | "source": [ 589 | "# Improving the model\n", 590 | "\n" 591 | ] 592 | }, 593 | { 594 | "cell_type": "markdown", 595 | "metadata": { 596 | "id": "_sh4NkAciNdp" 597 | }, 598 | "source": [ 599 | "Let's try to do better. One thing that's likely an issue is the sheer number of features we have relative to how little data there is." 600 | ] 601 | }, 602 | { 603 | "cell_type": "code", 604 | "metadata": { 605 | "id": "OFjW_Ee5e433" 606 | }, 607 | "source": [ 608 | "print('Training data size: {}'.format(len(train_data)))\n", 609 | "print('Number of training features: {}'.format(len(train_feature_vects[0].toarray().flatten())))" 610 | ], 611 | "execution_count": null, 612 | "outputs": [] 613 | }, 614 | { 615 | "cell_type": "markdown", 616 | "metadata": { 617 | "id": "mB5wpYxzkdFx" 618 | }, 619 | "source": [ 620 | "So we can experiment with:\n", 621 | "1. Removing stop words because topic identification likely depends more on keywords rather than sequences in this case.\n", 622 | "3. Using the token lemma rather than the text.\n", 623 | "
\n" 624 | ] 625 | }, 626 | { 627 | "cell_type": "markdown", 628 | "metadata": { 629 | "id": "NZvOroxYAYtj" 630 | }, 631 | "source": [ 632 | "We can't get away with the blank pipeline since we need a bunch of components to generate the lemma. So we'll load the **en_core_web_sm** model and disable named-entity recognition and parsing in the tokenizer callback." 633 | ] 634 | }, 635 | { 636 | "cell_type": "code", 637 | "metadata": { 638 | "id": "kuqg4x5z5R6M" 639 | }, 640 | "source": [ 641 | "nlp = spacy.load('en_core_web_sm')" 642 | ], 643 | "execution_count": null, 644 | "outputs": [] 645 | }, 646 | { 647 | "cell_type": "code", 648 | "metadata": { 649 | "id": "x2I72F5peeFR" 650 | }, 651 | "source": [ 652 | "unwanted_pipes = ['ner', 'parser']\n", 653 | "\n", 654 | "# Further remove stop words and take the lemma instead of token text.\n", 655 | "def spacy_tokenizer(doc):\n", 656 | " with nlp.disable_pipes(*unwanted_pipes):\n", 657 | " return [t.lemma_ for t in nlp(doc) if \\\n", 658 | " not t.is_punct and \\\n", 659 | " not t.is_space and \\\n", 660 | " not t.is_stop and \\\n", 661 | " t.is_alpha]" 662 | ], 663 | "execution_count": null, 664 | "outputs": [] 665 | }, 666 | { 667 | "cell_type": "markdown", 668 | "metadata": { 669 | "id": "fFFLr6TRjIkS" 670 | }, 671 | "source": [ 672 | "We need to re-vectorize the training set with the new tokenizer. Because there are certain components enabled, this is going to take longer (a few mins). Take a look at these link for ways to further optimize spaCy's pipeline:
\n", 673 | "https://spacy.io/usage/processing-pipelines#processing
\n", 674 | "https://spacy.io/api/language#pipe

\n", 675 | "YouTube video from spaCy on using **nlp.pipe**: [Speed up spaCy pipelines via `nlp.pipe` - spaCy shorts](https://www.youtube.com/watch?v=OoZ-H_8vRnc)
\n", 676 | "Tuning **nlp.pipe**: https://stackoverflow.com/questions/65850018/processing-text-with-spacy-nlp-pipe
\n", 677 | "Passing a list of pre-processed tokens to TfidfVectorizer: https://www.davidsbatista.net/blog/2018/02/28/TfidfVectorizer/" 678 | ] 679 | }, 680 | { 681 | "cell_type": "code", 682 | "metadata": { 683 | "id": "GzRMdHOg-Z1c" 684 | }, 685 | "source": [ 686 | "%%time\n", 687 | "vectorizer = TfidfVectorizer(tokenizer=spacy_tokenizer)\n", 688 | "train_feature_vects = vectorizer.fit_transform(train_data)" 689 | ], 690 | "execution_count": null, 691 | "outputs": [] 692 | }, 693 | { 694 | "cell_type": "code", 695 | "metadata": { 696 | "id": "FBY6WanaBVw5" 697 | }, 698 | "source": [ 699 | "# Check the number of features now.\n", 700 | "print('Number of training features: {}'.format(len(train_feature_vects[0].toarray().flatten())))" 701 | ], 702 | "execution_count": null, 703 | "outputs": [] 704 | }, 705 | { 706 | "cell_type": "markdown", 707 | "metadata": { 708 | "id": "7x9EYRCRlIDW" 709 | }, 710 | "source": [ 711 | "A little better but still not great. Let's retrain our classifier and see what happens." 712 | ] 713 | }, 714 | { 715 | "cell_type": "code", 716 | "metadata": { 717 | "id": "ykdDS-De9-Cy" 718 | }, 719 | "source": [ 720 | "nb_classifier.fit(train_feature_vects, train_labels)\n", 721 | "train_preds = nb_classifier.predict(train_feature_vects)\n", 722 | "print('Training F1 score with fewer features: {}'.format(metrics.f1_score(train_labels, train_preds, average='macro')))" 723 | ], 724 | "execution_count": null, 725 | "outputs": [] 726 | }, 727 | { 728 | "cell_type": "markdown", 729 | "metadata": { 730 | "id": "VBLUzjod996N" 731 | }, 732 | "source": [ 733 | "Check classifier performance on validation set." 734 | ] 735 | }, 736 | { 737 | "cell_type": "code", 738 | "metadata": { 739 | "id": "W9EH7UyvliIJ" 740 | }, 741 | "source": [ 742 | "%%time\n", 743 | "val_feature_vects = vectorizer.transform(val_data)" 744 | ], 745 | "execution_count": null, 746 | "outputs": [] 747 | }, 748 | { 749 | "cell_type": "code", 750 | "metadata": { 751 | "id": "cHntKMpPliRZ" 752 | }, 753 | "source": [ 754 | "val_preds = nb_classifier.predict(val_feature_vects)\n", 755 | "print('Validation F1 score with fewer features: {}'.format(metrics.f1_score(val_labels, val_preds, average='macro')))" 756 | ], 757 | "execution_count": null, 758 | "outputs": [] 759 | }, 760 | { 761 | "cell_type": "markdown", 762 | "metadata": { 763 | "id": "BMOCM485mPky" 764 | }, 765 | "source": [ 766 | "We managed to squeeze out a few percentage points. Let's look at the confusion matrix and classification report." 767 | ] 768 | }, 769 | { 770 | "cell_type": "code", 771 | "metadata": { 772 | "id": "cH67cCdwliZH" 773 | }, 774 | "source": [ 775 | "fig, ax = plt.subplots(figsize=(15, 15))\n", 776 | "disp = ConfusionMatrixDisplay.from_estimator(nb_classifier, val_feature_vects, val_labels, normalize='true', display_labels=filtered_training_corpus.target_names, xticks_rotation='vertical', ax=ax)" 777 | ], 778 | "execution_count": null, 779 | "outputs": [] 780 | }, 781 | { 782 | "cell_type": "code", 783 | "metadata": { 784 | "id": "F7ifQJaSlije" 785 | }, 786 | "source": [ 787 | "print(metrics.classification_report(val_labels, val_preds, target_names=filtered_training_corpus.target_names))" 788 | ], 789 | "execution_count": null, 790 | "outputs": [] 791 | }, 792 | { 793 | "cell_type": "markdown", 794 | "metadata": { 795 | "id": "3Xlt7kUmmfTB" 796 | }, 797 | "source": [ 798 | "In the confusion matrix, the squares in the christian column have dimmed, signalling fewer classification errors. And although atheism now classifies better, that topic along with religion.misc remain big sources of overall errors.\n", 799 | "\n", 800 | "Let's assume for now that we can't get or generate more data." 801 | ] 802 | }, 803 | { 804 | "cell_type": "markdown", 805 | "metadata": { 806 | "id": "jWyhcbn9mfcL" 807 | }, 808 | "source": [ 809 | "Next, we can try tuning a hyperparameter on the classifier. For Naive Bayes, we'll adjust the the *alpha* smoothing factor we discussed in the slides. But rather than trying a bunch ourselves, we can use a combination of **Grid Search** and **Cross Validation**.\n", 810 | "- Grid search involves having the computer try a list of hyperparameter values for us, and returning the best performing value. The list of hyperparameter values to try is supplied by us. Grid search is a basic technique and there are a number of other techniques such as **random search** and **bayesian optimization**.\n", 811 | "- Cross validation is a way to evaluate machine learning models on limited datasets. It randomly splits the data into k-groups. One group is set aside as the holdout set while the classifier trains a model on the remaining groups. The resulting model is then used on the holdout group and the score recorded. This repeats itself until all groups have been used as a holdout set and an average score returned.\n", 812 | "\n", 813 | "Scikit-learn has modules to handle both for us:
\n", 814 | "https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
\n", 815 | "https://scikit-learn.org/stable/modules/grid_search.html#grid-search
\n", 816 | "https://scikit-learn.org/stable/modules/cross_validation.html
\n" 817 | ] 818 | }, 819 | { 820 | "cell_type": "code", 821 | "metadata": { 822 | "id": "UCSbxiQWpWNg" 823 | }, 824 | "source": [ 825 | "# The alpha values to try.\n", 826 | "params = {'alpha': [0.01, 0.1, 0.5, 1.0, 10.0,],}\n", 827 | "\n", 828 | "# Instantiate the search with the model we want to try and fit it on the training data.\n", 829 | "multinomial_nb_grid = model_selection.GridSearchCV(MultinomialNB(), param_grid=params, scoring='f1_macro', n_jobs=-1, cv=5, verbose=5)\n", 830 | "multinomial_nb_grid.fit(train_feature_vects, train_labels)" 831 | ], 832 | "execution_count": null, 833 | "outputs": [] 834 | }, 835 | { 836 | "cell_type": "markdown", 837 | "metadata": { 838 | "id": "j3NsSXcKpky2" 839 | }, 840 | "source": [ 841 | "The resulting **GridSearchCV** object has a number of attributes you can explore:
\n", 842 | "https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html\n", 843 | "

\n", 844 | "We're interested in the best performing parameter value(s)." 845 | ] 846 | }, 847 | { 848 | "cell_type": "code", 849 | "metadata": { 850 | "id": "vJwHWedlpWcM" 851 | }, 852 | "source": [ 853 | "print('Best parameter value(s): {}'.format(multinomial_nb_grid.best_params_))" 854 | ], 855 | "execution_count": null, 856 | "outputs": [] 857 | }, 858 | { 859 | "cell_type": "markdown", 860 | "metadata": { 861 | "id": "zUrc7dbzp27E" 862 | }, 863 | "source": [ 864 | "You can directly access the best estimator found by the search. Let's try using it on the validation set." 865 | ] 866 | }, 867 | { 868 | "cell_type": "code", 869 | "metadata": { 870 | "id": "TlF5Ji1vqDq0" 871 | }, 872 | "source": [ 873 | "best_nb_classifier = multinomial_nb_grid.best_estimator_\n", 874 | "val_preds = best_nb_classifier.predict(val_feature_vects)\n", 875 | "print('Validation F1 score with fewer features: {}'.format(metrics.f1_score(val_labels, val_preds, average='macro')))" 876 | ], 877 | "execution_count": null, 878 | "outputs": [] 879 | }, 880 | { 881 | "cell_type": "markdown", 882 | "metadata": { 883 | "id": "rzhjYdM-qYkk" 884 | }, 885 | "source": [ 886 | "So we got another decent jump after using the the optimal *alpha* value. Let's look at the confusion matrix (using the best estimator so far) and classification report again.\n" 887 | ] 888 | }, 889 | { 890 | "cell_type": "code", 891 | "metadata": { 892 | "id": "xxaQxpAdpWjc" 893 | }, 894 | "source": [ 895 | "fig, ax = plt.subplots(figsize=(15, 15))\n", 896 | "disp = ConfusionMatrixDisplay.from_estimator(best_nb_classifier, val_feature_vects, val_labels, normalize='true', display_labels=filtered_training_corpus.target_names, xticks_rotation='vertical', ax=ax)" 897 | ], 898 | "execution_count": null, 899 | "outputs": [] 900 | }, 901 | { 902 | "cell_type": "code", 903 | "metadata": { 904 | "id": "ndj-_mlKqnvd" 905 | }, 906 | "source": [ 907 | "print(metrics.classification_report(val_labels, val_preds, target_names=filtered_training_corpus.target_names))" 908 | ], 909 | "execution_count": null, 910 | "outputs": [] 911 | }, 912 | { 913 | "cell_type": "markdown", 914 | "metadata": { 915 | "id": "jsslL_bsqnnB" 916 | }, 917 | "source": [ 918 | "A few observations from this one:\n", 919 | "1. Atheism and religion.misc are doing much better though still a source of errors.\n", 920 | "2. The christian column has dimmed further in the other categories.\n", 921 | "\n", 922 | "Given the small data size and the soft borders around various topics, what we have now is probably good enough. A few further ideas to explore:\n", 923 | "1. Augment the training data with posts from similar subreddits.\n", 924 | "2. Incorporate n-grams.\n", 925 | "3. Remove the *misc* categories if your goal allows it.\n", 926 | "4. Merge a few categories with large overlap together if your goal allows it.\n", 927 | "5. Use the **CountVectorizer** instead of the **TfidfVectorizer**.\n", 928 | "6. Play around with adding more stop words after seeing which ones are the most prevalent.\n", 929 | "7. Play with the min_df, max_df, and max_features in the **TFidfVectorizer**.\n", 930 | "8. Use a dimensionality reduction technique like Singular Value Decomposition (SVD) or dense word vectors which we'll cover in Part II.\n", 931 | "9. Try other models: logistic regression, support vector machines, random forests, SGD classifier.\n", 932 | "\n", 933 | "My guess is that, aside from merging categories, it'll be hard to do much better than what we have given the nature of the data.\n", 934 | "\n", 935 | "https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html\n", 936 | "
\n", 937 | "https://scikit-learn.org/stable/modules/svm.html#svm-classification\n", 938 | "
\n", 939 | "https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html\n", 940 | "
\n", 941 | "https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html\n", 942 | "\n" 943 | ] 944 | }, 945 | { 946 | "cell_type": "markdown", 947 | "metadata": { 948 | "id": "nG4KYHvPuUwy" 949 | }, 950 | "source": [ 951 | "For idea **(6)**, we can use the function below to view the most commonly occurring words in each category." 952 | ] 953 | }, 954 | { 955 | "cell_type": "code", 956 | "metadata": { 957 | "id": "BO1vyE_Jqnd9" 958 | }, 959 | "source": [ 960 | "def show_top_words(classifier, vectorizer, categories, top_n):\n", 961 | " feature_names = np.asarray(vectorizer.get_feature_names_out())\n", 962 | " for i, category in enumerate(categories):\n", 963 | " prob_sorted = classifier.feature_log_prob_[i, :].argsort()[::-1]\n", 964 | " print(\"%s: %s\" % (category, \" \".join(feature_names[prob_sorted[:top_n]])))" 965 | ], 966 | "execution_count": null, 967 | "outputs": [] 968 | }, 969 | { 970 | "cell_type": "code", 971 | "metadata": { 972 | "id": "SzrTHDDktms8" 973 | }, 974 | "source": [ 975 | "show_top_words(best_nb_classifier, vectorizer, filtered_training_corpus.target_names, 10)" 976 | ], 977 | "execution_count": null, 978 | "outputs": [] 979 | }, 980 | { 981 | "cell_type": "markdown", 982 | "source": [ 983 | "As a sanity check, we can use scikit-learns **DummyClassifier** which can make predictions using strategies such as \"just guess the most frequently occurring class\" or \"make random guesses\".
\n", 984 | "https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html" 985 | ], 986 | "metadata": { 987 | "id": "LtexSjOp_4wj" 988 | } 989 | }, 990 | { 991 | "cell_type": "code", 992 | "metadata": { 993 | "id": "C1e28pieaexT" 994 | }, 995 | "source": [ 996 | "# Train a dummy classifier which just guesses the most frequent class.\n", 997 | "dummy_clf = DummyClassifier(strategy=\"most_frequent\")\n", 998 | "dummy_clf.fit(train_feature_vects, train_labels)\n", 999 | "dummy_clf.score(val_feature_vects, val_labels)" 1000 | ], 1001 | "execution_count": null, 1002 | "outputs": [] 1003 | }, 1004 | { 1005 | "cell_type": "code", 1006 | "metadata": { 1007 | "id": "g53QB8nGbeoB" 1008 | }, 1009 | "source": [ 1010 | "# Train a dummy classifier which just guesses a class randomly.\n", 1011 | "dummy_clf = DummyClassifier(strategy=\"uniform\")\n", 1012 | "dummy_clf.fit(train_feature_vects, train_labels)\n", 1013 | "dummy_clf.score(val_feature_vects, val_labels)" 1014 | ], 1015 | "execution_count": null, 1016 | "outputs": [] 1017 | }, 1018 | { 1019 | "cell_type": "markdown", 1020 | "metadata": { 1021 | "id": "RXN_3sltvNWV" 1022 | }, 1023 | "source": [ 1024 | "# Creating the final Naive Bayes classifier." 1025 | ] 1026 | }, 1027 | { 1028 | "cell_type": "markdown", 1029 | "metadata": { 1030 | "id": "TBtpCqa8t5GZ" 1031 | }, 1032 | "source": [ 1033 | "Let's train the classifier we'll use on the test set. We'll use the entire original training set (including validation data) and the ideal *alpha* param.\n", 1034 | "
\n", 1035 | "We'll also use scikit-learn's **Pipeline** to specify a series of transformation and training steps so we can vectorize and fit a model with one call. Creating a few of these pipelines can help speed up your development and stay organized:
\n", 1036 | "https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html\n" 1037 | ] 1038 | }, 1039 | { 1040 | "cell_type": "code", 1041 | "metadata": { 1042 | "id": "nrj1BURWw3AP" 1043 | }, 1044 | "source": [ 1045 | "text_classifier = Pipeline([\n", 1046 | " ('vectorizer', TfidfVectorizer(tokenizer=spacy_tokenizer)),\n", 1047 | " ('classifier', MultinomialNB(alpha=0.01))\n", 1048 | "])" 1049 | ], 1050 | "execution_count": null, 1051 | "outputs": [] 1052 | }, 1053 | { 1054 | "cell_type": "code", 1055 | "metadata": { 1056 | "id": "EeznMt1Fze3Z" 1057 | }, 1058 | "source": [ 1059 | "%%time\n", 1060 | "text_classifier.fit(filtered_training_corpus.data, filtered_training_corpus.target)" 1061 | ], 1062 | "execution_count": null, 1063 | "outputs": [] 1064 | }, 1065 | { 1066 | "cell_type": "markdown", 1067 | "metadata": { 1068 | "id": "hWvrtPo90u_B" 1069 | }, 1070 | "source": [ 1071 | "Download the 20 newsgroups *test* dataset." 1072 | ] 1073 | }, 1074 | { 1075 | "cell_type": "code", 1076 | "metadata": { 1077 | "id": "Gtr4vs4du5Hl" 1078 | }, 1079 | "source": [ 1080 | "filtered_test_corpus = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))" 1081 | ], 1082 | "execution_count": null, 1083 | "outputs": [] 1084 | }, 1085 | { 1086 | "cell_type": "markdown", 1087 | "metadata": { 1088 | "id": "BxWYFXwvNzPx" 1089 | }, 1090 | "source": [ 1091 | "We can now pass the raw test data directly to the classifier." 1092 | ] 1093 | }, 1094 | { 1095 | "cell_type": "code", 1096 | "source": [ 1097 | "%%time\n", 1098 | "test_preds = text_classifier.predict(filtered_test_corpus.data)" 1099 | ], 1100 | "metadata": { 1101 | "id": "3eDucMkzj5Lj" 1102 | }, 1103 | "execution_count": null, 1104 | "outputs": [] 1105 | }, 1106 | { 1107 | "cell_type": "code", 1108 | "source": [ 1109 | "%%time\n", 1110 | "fig, ax = plt.subplots(figsize=(15, 15))\n", 1111 | "ConfusionMatrixDisplay.from_predictions(filtered_test_corpus.target, test_preds, normalize='true', display_labels=filtered_test_corpus.target_names, xticks_rotation='vertical', ax=ax)\n", 1112 | "plt.show()" 1113 | ], 1114 | "metadata": { 1115 | "id": "k8ThbeHJj5IM" 1116 | }, 1117 | "execution_count": null, 1118 | "outputs": [] 1119 | }, 1120 | { 1121 | "cell_type": "markdown", 1122 | "metadata": { 1123 | "id": "YeKTyl-W07Wi" 1124 | }, 1125 | "source": [ 1126 | "Looking at the confusion matrix for test data classification, we see there are still a few brighter clusters around the soft politics/religion area as well as the finer-grained computer-related topics which drag the overall accuracy down. This is reflected in the classification report as well. Overall, the other topics look ok given the data we have.\n", 1127 | "

" 1128 | ] 1129 | }, 1130 | { 1131 | "cell_type": "code", 1132 | "source": [ 1133 | "print(metrics.classification_report(filtered_test_corpus.target, test_preds, target_names=filtered_test_corpus.target_names))" 1134 | ], 1135 | "metadata": { 1136 | "id": "gQ5m0fEkj5Ej" 1137 | }, 1138 | "execution_count": null, 1139 | "outputs": [] 1140 | }, 1141 | { 1142 | "cell_type": "markdown", 1143 | "source": [ 1144 | "We can now leverage our pipeline to classify new documents on the fly now.\n", 1145 | "

\n", 1146 | "The function below takes a classifier, a document to classify, and an optional set of labels. It returns a tuple of the most probable class and its probability. With this information, you can choose a probability threshold over which to accept a classification. If it falls below the threshold, perhaps you can classify it in some default bucket or pass it to a human, or to another classifier downstream. You could also require a minimum string length for classification along with other conditions.\n", 1147 | "

\n" 1148 | ], 1149 | "metadata": { 1150 | "id": "csncNP4BmUd_" 1151 | } 1152 | }, 1153 | { 1154 | "cell_type": "code", 1155 | "metadata": { 1156 | "id": "uJ2lYTdW3HXP" 1157 | }, 1158 | "source": [ 1159 | "def classify_text(clf, doc, labels=None):\n", 1160 | " probas = clf.predict_proba([doc]).flatten()\n", 1161 | " max_proba_idx = np.argmax(probas)\n", 1162 | " \n", 1163 | " if labels:\n", 1164 | " most_proba_class = labels[max_proba_idx]\n", 1165 | " else:\n", 1166 | " most_proba_class = max_proba_idx\n", 1167 | "\n", 1168 | " return (most_proba_class, probas[max_proba_idx])" 1169 | ], 1170 | "execution_count": null, 1171 | "outputs": [] 1172 | }, 1173 | { 1174 | "cell_type": "markdown", 1175 | "metadata": { 1176 | "id": "icAHgqa2PDIk" 1177 | }, 1178 | "source": [ 1179 | "The strings below were taken at random from subreddits that have corresponding topics (e.g. r/space, r/cars, etc)." 1180 | ] 1181 | }, 1182 | { 1183 | "cell_type": "code", 1184 | "metadata": { 1185 | "id": "H-YZ3euf4hg1" 1186 | }, 1187 | "source": [ 1188 | "# Post from r/medicine.\n", 1189 | "s = \"Hello everyone so am doing my thesis on Ischemic heart disease have been using online articles and textbooks mostly Harrisons internal med. could u recommended me some source specifically books where i can get more about in depth knowledge on IHD.\"\n", 1190 | "classify_text(text_classifier, s, filtered_test_corpus.target_names)" 1191 | ], 1192 | "execution_count": null, 1193 | "outputs": [] 1194 | }, 1195 | { 1196 | "cell_type": "code", 1197 | "metadata": { 1198 | "id": "GGZJYEz8Pb65" 1199 | }, 1200 | "source": [ 1201 | "# Post from r/space.\n", 1202 | "s = \"First evidence that water can be created on the lunar surface by Earth's magnetosphere. Particles from Earth can seed the moon with water, implying that other planets could also contribute water to their satellites.\"\n", 1203 | "classify_text(text_classifier, s, filtered_test_corpus.target_names)" 1204 | ], 1205 | "execution_count": null, 1206 | "outputs": [] 1207 | }, 1208 | { 1209 | "cell_type": "code", 1210 | "metadata": { 1211 | "id": "s3_ryq6VPdza" 1212 | }, 1213 | "source": [ 1214 | "# Post from r/cars.\n", 1215 | "s = \"New Toyota 86 Launch Reportedly Delayed to 2022, CEO Doesn't Want a Subaru Copy\"\n", 1216 | "classify_text(text_classifier, s, filtered_test_corpus.target_names)" 1217 | ], 1218 | "execution_count": null, 1219 | "outputs": [] 1220 | }, 1221 | { 1222 | "cell_type": "code", 1223 | "metadata": { 1224 | "id": "4mOI3OSN4k0n" 1225 | }, 1226 | "source": [ 1227 | "# Post from r/electronics.\n", 1228 | "s = \"My First Ever Homemade PCB. My SMD Soldering Skills Aren't Great, But I'm Quite Proud of it.\"\n", 1229 | "classify_text(text_classifier, s, filtered_test_corpus.target_names)" 1230 | ], 1231 | "execution_count": null, 1232 | "outputs": [] 1233 | }, 1234 | { 1235 | "cell_type": "markdown", 1236 | "metadata": { 1237 | "id": "012m-1j1Q_Mg" 1238 | }, 1239 | "source": [ 1240 | "These are a few made-up statements with low probability which could belong to anything. In these situations, they can be dealt with as special cases." 1241 | ] 1242 | }, 1243 | { 1244 | "cell_type": "code", 1245 | "metadata": { 1246 | "id": "5Ct2AiGH5RrJ" 1247 | }, 1248 | "source": [ 1249 | "s = \"I don't know if that's a good idea.\"\n", 1250 | "classify_text(text_classifier, s, filtered_test_corpus.target_names)" 1251 | ], 1252 | "execution_count": null, 1253 | "outputs": [] 1254 | }, 1255 | { 1256 | "cell_type": "code", 1257 | "metadata": { 1258 | "id": "JYppCUPARU-N" 1259 | }, 1260 | "source": [ 1261 | "s = \"Hold on for dear life.\"\n", 1262 | "classify_text(text_classifier, s, filtered_test_corpus.target_names)" 1263 | ], 1264 | "execution_count": null, 1265 | "outputs": [] 1266 | }, 1267 | { 1268 | "cell_type": "markdown", 1269 | "metadata": { 1270 | "id": "pzxyKMHzegBI" 1271 | }, 1272 | "source": [ 1273 | "**Note:**
\n", 1274 | "Keep in mind that Naive Bayes is good at returning the most probable class but is regarded as a poor estimator because of its naive assumption of independence (i.e. the actual probability values aren't very reliable)." 1275 | ] 1276 | } 1277 | ] 1278 | } 1279 | -------------------------------------------------------------------------------- /notebooks/nlpdemystified_preprocessing.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "nlpdemystified-preprocessing.ipynb", 7 | "private_outputs": true, 8 | "provenance": [], 9 | "collapsed_sections": [], 10 | "include_colab_link": true 11 | }, 12 | "kernelspec": { 13 | "name": "python3", 14 | "display_name": "Python 3" 15 | } 16 | }, 17 | "cells": [ 18 | { 19 | "cell_type": "markdown", 20 | "metadata": { 21 | "id": "view-in-github", 22 | "colab_type": "text" 23 | }, 24 | "source": [ 25 | "\"Open" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": { 31 | "id": "DM8kLxUEVc3Z" 32 | }, 33 | "source": [ 34 | "# Natural Language Processing Demystified | Preprocessing\n", 35 | "https://nlpdemystified.org
\n", 36 | "https://github.com/futuremojo/nlp-demystified" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": { 42 | "id": "btimL_w92Q3P" 43 | }, 44 | "source": [ 45 | "### spaCy upgrade and package installation." 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": { 51 | "id": "u7Ll-fUK2VZs" 52 | }, 53 | "source": [ 54 | "At the time this notebook was created, spaCy had newer releases but Colab was still using version 2.x by default. So the first step is to upgrade spaCy.\n", 55 | "

\n", 56 | "**IMPORTANT**
\n", 57 | "If you're running this in the cloud rather than using a local Jupyter server on your machine, then the notebook will **timeout** after a period of inactivity. If that happens and you don't reconnect in time, you will need to upgrade spaCy again and reinstall the requisite statistical packages.\n", 58 | "

\n", 59 | "Refer to this link on how to run Colab notebooks locally on your machine to avoid this issue:
\n", 60 | "https://research.google.com/colaboratory/local-runtimes.html" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "metadata": { 66 | "id": "cve1-G7j2VTN" 67 | }, 68 | "source": [ 69 | "!pip install -U spacy==3.* " 70 | ], 71 | "execution_count": null, 72 | "outputs": [] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "metadata": { 77 | "id": "z-FDdbc62VHd" 78 | }, 79 | "source": [ 80 | "!python -m spacy info" 81 | ], 82 | "execution_count": null, 83 | "outputs": [] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "metadata": { 88 | "id": "8vW9svTE289D" 89 | }, 90 | "source": [ 91 | " import spacy " 92 | ], 93 | "execution_count": null, 94 | "outputs": [] 95 | }, 96 | { 97 | "cell_type": "markdown", 98 | "metadata": { 99 | "id": "ZfJKSJEU2U_s" 100 | }, 101 | "source": [ 102 | "After importing spaCy, the next thing we need to do is load a suitable statistical model for our project. spaCy offers a variety of models for different languages. These models help with tokenization, part-of-speech tagging, named entity recognition, and more.\n", 103 | "\n", 104 | "Here, we're loading the **en_core_web_sm** model which is the smallest English model spaCy offers and is a good starting point for NLP tasks.
\n", 105 | "https://spacy.io/models/en#en_core_web_sm" 106 | ] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "metadata": { 111 | "id": "6v6TGQff2iu6" 112 | }, 113 | "source": [ 114 | "Since we upgraded spaCy, we'll need to download the statistical model as well." 115 | ] 116 | }, 117 | { 118 | "cell_type": "code", 119 | "metadata": { 120 | "id": "4uOyHDNb2i5d" 121 | }, 122 | "source": [ 123 | "!python -m spacy download en_core_web_sm" 124 | ], 125 | "execution_count": null, 126 | "outputs": [] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "metadata": { 131 | "id": "mWDrpxDk2_r2" 132 | }, 133 | "source": [ 134 | "nlp = spacy.load('en_core_web_sm')" 135 | ], 136 | "execution_count": null, 137 | "outputs": [] 138 | }, 139 | { 140 | "cell_type": "markdown", 141 | "metadata": { 142 | "id": "K7YCbWtG3LJO" 143 | }, 144 | "source": [ 145 | "**en_core_web_sm** is trained on OntoNotes 5 which is an annotated corpus comprising news, blogs, transcripts, etc. Put simply, this means a bunch of documents were labelled with information such as how each sentence should be parsed, whether a particular word is a noun or adjective or other part-of-speech, whether a word is a special entity like a person or a real-world organization, and other language-related labels. A statistical model was then generated from these labelled documents.
\n", 146 | "https://catalog.ldc.upenn.edu/LDC2013T19\n", 147 | "

\n", 148 | "You can learn more about the available spaCy models at these links:
\n", 149 | "https://spacy.io/models
\n", 150 | "https://spacy.io/usage/models" 151 | ] 152 | }, 153 | { 154 | "cell_type": "markdown", 155 | "metadata": { 156 | "id": "dvF_udvi3OTO" 157 | }, 158 | "source": [ 159 | "After loading the model, the _nlp_ variable now references a **Language** class instance which contains language-specific rules for various tasks (e.g. tokenization) and a processing pipeline.
\n", 160 | "https://spacy.io/api/language" 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "metadata": { 166 | "id": "DAYGtQpT3UNN" 167 | }, 168 | "source": [ 169 | "type(nlp) " 170 | ], 171 | "execution_count": null, 172 | "outputs": [] 173 | }, 174 | { 175 | "cell_type": "markdown", 176 | "metadata": { 177 | "id": "unmnGRu8D-wa" 178 | }, 179 | "source": [ 180 | "# Tokenization\n", 181 | "\n", 182 | "Course module for this demo:\n", 183 | "https://www.nlpdemystified.org/course/tokenization\n" 184 | ] 185 | }, 186 | { 187 | "cell_type": "markdown", 188 | "metadata": { 189 | "id": "mLUcGm3IbQki" 190 | }, 191 | "source": [ 192 | "### Tokenization with spaCy" 193 | ] 194 | }, 195 | { 196 | "cell_type": "markdown", 197 | "metadata": { 198 | "id": "13twUCp2i_p8" 199 | }, 200 | "source": [ 201 | "We pass whatever text we want to process to _nlp_, which returns a **Doc** container object containing the tokenized text and a number of annotations for each token. These annotations are discussed in follow-up videos. You can learn more about the **Doc** object here:
\n", 202 | "https://spacy.io/api/doc" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "metadata": { 208 | "id": "BIoEJZ-IkHQ4" 209 | }, 210 | "source": [ 211 | "# Sample sentence.\n", 212 | "s = \"He didn't want to pay $20 for this book.\"\n", 213 | "doc = nlp(s)" 214 | ], 215 | "execution_count": null, 216 | "outputs": [] 217 | }, 218 | { 219 | "cell_type": "markdown", 220 | "metadata": { 221 | "id": "MMWZK3ZSk9-f" 222 | }, 223 | "source": [ 224 | "We can iterate over this **Doc** object and view the tokens." 225 | ] 226 | }, 227 | { 228 | "cell_type": "code", 229 | "metadata": { 230 | "id": "8SzqhZuulAe1" 231 | }, 232 | "source": [ 233 | "print([t.text for t in doc])" 234 | ], 235 | "execution_count": null, 236 | "outputs": [] 237 | }, 238 | { 239 | "cell_type": "markdown", 240 | "metadata": { 241 | "id": "ai1obkB93GdD" 242 | }, 243 | "source": [ 244 | "Note how\n", 245 | "- \"didn't\" is separated into \"did\" and \"n't\".\n", 246 | "- the currency symbol and amount are separated.\n", 247 | "- the period at the end of the sentence is its own token." 248 | ] 249 | }, 250 | { 251 | "cell_type": "markdown", 252 | "metadata": { 253 | "id": "AWH49gIh3hqN" 254 | }, 255 | "source": [ 256 | "The **Doc** object can be indexed and sliced like a regular list. The **Doc** object contains **Token** and **Span** objects, which offer different views into the text." 257 | ] 258 | }, 259 | { 260 | "cell_type": "code", 261 | "metadata": { 262 | "id": "MwLrxRsE3oKI" 263 | }, 264 | "source": [ 265 | "# We can view an individual token by indexing into the Doc object.\n", 266 | "print(doc[0])" 267 | ], 268 | "execution_count": null, 269 | "outputs": [] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "metadata": { 274 | "id": "bGapNHYQFYVa" 275 | }, 276 | "source": [ 277 | "# A Doc object is a container of other objects, namely Token and Span objects.\n", 278 | "print(type(doc[0]))" 279 | ], 280 | "execution_count": null, 281 | "outputs": [] 282 | }, 283 | { 284 | "cell_type": "code", 285 | "metadata": { 286 | "id": "EtL2IgIAGOd9" 287 | }, 288 | "source": [ 289 | "# Slicing a Doc object returns a Span object.\n", 290 | "print(doc[0:3])\n", 291 | "print(type(doc[0:3]))" 292 | ], 293 | "execution_count": null, 294 | "outputs": [] 295 | }, 296 | { 297 | "cell_type": "code", 298 | "metadata": { 299 | "id": "xybH4jjYGo73" 300 | }, 301 | "source": [ 302 | "# Access a token's index in a sentence.\n", 303 | "print([(t.text, t.i) for t in doc])" 304 | ], 305 | "execution_count": null, 306 | "outputs": [] 307 | }, 308 | { 309 | "cell_type": "markdown", 310 | "metadata": { 311 | "id": "_TqE980F4Vrt" 312 | }, 313 | "source": [ 314 | "Spacy's tokenization is _non-destructive_, which means the original input can be reconstructed from the tokens." 315 | ] 316 | }, 317 | { 318 | "cell_type": "code", 319 | "metadata": { 320 | "id": "OjXb8mR_DK-1" 321 | }, 322 | "source": [ 323 | "# You can view the original input like so:\n", 324 | "print(doc.text)" 325 | ], 326 | "execution_count": null, 327 | "outputs": [] 328 | }, 329 | { 330 | "cell_type": "markdown", 331 | "metadata": { 332 | "id": "73vuSX7MDK79" 333 | }, 334 | "source": [ 335 | "You can learn more about the **Token** and **Span** objects here:
\n", 336 | "https://spacy.io/api/token
\n", 337 | "https://spacy.io/api/span\n" 338 | ] 339 | }, 340 | { 341 | "cell_type": "markdown", 342 | "metadata": { 343 | "id": "lume_1UP6ySQ" 344 | }, 345 | "source": [ 346 | "We can also tokenize multiple sentences and access each sentence individually using the **Doc** object's _sents_ property." 347 | ] 348 | }, 349 | { 350 | "cell_type": "code", 351 | "metadata": { 352 | "id": "mPZ86x0hDK4m" 353 | }, 354 | "source": [ 355 | "s = \"\"\"Either the well was very deep, or she fell very slowly, for she \n", 356 | "had plenty of time as she went down to look about her and to wonder what \n", 357 | "was going to happen next. First, she tried to look down and make out what \n", 358 | "she was coming to, but it was too dark to see anything; then she looked at \n", 359 | "the sides of the well, and noticed that they were filled with cupboards and \n", 360 | "book-shelves; here and there she saw maps and pictures hung upon pegs.\"\"\"\n", 361 | "\n", 362 | "doc = nlp(s)\n", 363 | "\n", 364 | "# Look at individual sentences (there should be two 'Span' objects).\n", 365 | "print([sent for sent in doc.sents])" 366 | ], 367 | "execution_count": null, 368 | "outputs": [] 369 | }, 370 | { 371 | "cell_type": "markdown", 372 | "metadata": { 373 | "id": "DvSfDUyK06Qg" 374 | }, 375 | "source": [ 376 | "### Tokenization Exercises" 377 | ] 378 | }, 379 | { 380 | "cell_type": "code", 381 | "metadata": { 382 | "id": "fyywcBrCHzSk" 383 | }, 384 | "source": [ 385 | "#\n", 386 | "# EXERCISE:\n", 387 | "# 1) Tokenize the following text\n", 388 | "# 2) Iterate through the tokens to check whether there's a currency symbol.\n", 389 | "# 3) If there is, and the currency label is followed by a number, print\n", 390 | "# both the symbol and the number.\n", 391 | "# \n", 392 | "# Look through https://spacy.io/api/token#attributes on how to check whether\n", 393 | "# a token is a currency symbol or a number.\n", 394 | "#\n", 395 | "# Expected output: \"$20\".\n", 396 | "s = \"He didn't want to pay $20 for this book.\"\n", 397 | "doc = nlp(s)" 398 | ], 399 | "execution_count": null, 400 | "outputs": [] 401 | }, 402 | { 403 | "cell_type": "code", 404 | "metadata": { 405 | "id": "skajI-OZDK0t" 406 | }, 407 | "source": [ 408 | "#\n", 409 | "# EXERCISE: Learn how the spaCy tokenizer works and how to customize it:\n", 410 | "# https://spacy.io/usage/linguistic-features#tokenization\n", 411 | "#" 412 | ], 413 | "execution_count": null, 414 | "outputs": [] 415 | }, 416 | { 417 | "cell_type": "code", 418 | "metadata": { 419 | "id": "Ikbnyb8rDKv9" 420 | }, 421 | "source": [ 422 | "#\n", 423 | "# EXERCISE: Read through spaCy-101 and if you're interested, check out their course\n", 424 | "# on spaCy itself (link on the page).\n", 425 | "# https://spacy.io/usage/spacy-101\n", 426 | "#" 427 | ], 428 | "execution_count": null, 429 | "outputs": [] 430 | }, 431 | { 432 | "cell_type": "code", 433 | "metadata": { 434 | "id": "MMArLP91DKUW" 435 | }, 436 | "source": [ 437 | "#\n", 438 | "# EXERCISE: Look up how to tokenize the sentence below using NLTK. The imports \n", 439 | "# are done for you. Does the NLTK tokenizer handle \"N.Y.C.\" correctly?\n", 440 | "#\n", 441 | "import nltk\n", 442 | "from nltk.tokenize import TreebankWordTokenizer\n", 443 | "s = \"Let's go to N.Y.C. for the weekend.\"" 444 | ], 445 | "execution_count": null, 446 | "outputs": [] 447 | }, 448 | { 449 | "cell_type": "markdown", 450 | "metadata": { 451 | "id": "EMbm9tTakDdy" 452 | }, 453 | "source": [ 454 | "**NOTE**: Different tokenizers will give subtly different results based on the rules they use. Experiment with different tokenizers and use the one best suited for your project." 455 | ] 456 | }, 457 | { 458 | "cell_type": "markdown", 459 | "metadata": { 460 | "id": "uUsfYCpVT4nI" 461 | }, 462 | "source": [ 463 | "# Basic Preprocessing\n", 464 | "## Case-Folding, Stop Word Removal, Stemming, and Lemmatization.\n", 465 | "\n", 466 | "Course module for this demo:\n", 467 | "https://www.nlpdemystified.org/course/basic-preprocessing" 468 | ] 469 | }, 470 | { 471 | "cell_type": "markdown", 472 | "source": [ 473 | "**NOTE: If the notebook timed out, you may need to re-upgrade spaCy and re-install the language model as follows:**\n" 474 | ], 475 | "metadata": { 476 | "id": "5gaj23tgd7Su" 477 | } 478 | }, 479 | { 480 | "cell_type": "code", 481 | "metadata": { 482 | "id": "mg6dga4JePf2" 483 | }, 484 | "source": [ 485 | "!pip install -U spacy==3.*\n", 486 | "!python -m spacy download en_core_web_sm\n", 487 | "!python -m spacy info" 488 | ], 489 | "execution_count": null, 490 | "outputs": [] 491 | }, 492 | { 493 | "cell_type": "markdown", 494 | "metadata": { 495 | "id": "LgDDrCeI8f-4" 496 | }, 497 | "source": [ 498 | "spaCy performs all these preprocessing steps (except stemming) behind the scenes for you. Inline with its non-destructive policy, the tokens aren't modified directly. Rather, each **Token** object has a number of attributes which can help you get views of your document with these pre-processing steps applied. The attributes a **Token** has can be found here:
\n", 499 | "https://spacy.io/api/token#attributes\n", 500 | "

\n", 501 | "More information about spaCy's processing pipeline:
\n", 502 | "https://spacy.io/usage/processing-pipelines" 503 | ] 504 | }, 505 | { 506 | "cell_type": "code", 507 | "metadata": { 508 | "id": "jDEMR6En1j3H" 509 | }, 510 | "source": [ 511 | "import spacy\n", 512 | "nlp = spacy.load('en_core_web_sm')\n", 513 | "s = \"He told Dr. Lovato that he was done with the tests and would post the results shortly.\"\n", 514 | "doc = nlp(s)" 515 | ], 516 | "execution_count": null, 517 | "outputs": [] 518 | }, 519 | { 520 | "cell_type": "markdown", 521 | "metadata": { 522 | "id": "xwA1ct0obYlR" 523 | }, 524 | "source": [ 525 | "### Case-Folding" 526 | ] 527 | }, 528 | { 529 | "cell_type": "markdown", 530 | "metadata": { 531 | "id": "biBPWrVd9BrK" 532 | }, 533 | "source": [ 534 | "View your document with case-folding using the *lower_* attribute." 535 | ] 536 | }, 537 | { 538 | "cell_type": "code", 539 | "metadata": { 540 | "id": "1nt4RpzdgQQL" 541 | }, 542 | "source": [ 543 | "print([t.lower_ for t in doc])" 544 | ], 545 | "execution_count": null, 546 | "outputs": [] 547 | }, 548 | { 549 | "cell_type": "markdown", 550 | "metadata": { 551 | "id": "HL46I4sH9OMq" 552 | }, 553 | "source": [ 554 | "You can also apply conditions when generating these views. For example, we can skip case-folding if a token is the start of a sentence." 555 | ] 556 | }, 557 | { 558 | "cell_type": "code", 559 | "metadata": { 560 | "id": "IO0PQ8IFhOlZ" 561 | }, 562 | "source": [ 563 | "print([t.lower_ if not t.is_sent_start else t for t in doc])" 564 | ], 565 | "execution_count": null, 566 | "outputs": [] 567 | }, 568 | { 569 | "cell_type": "markdown", 570 | "metadata": { 571 | "id": "G7pTz8XJbmaT" 572 | }, 573 | "source": [ 574 | "### Stop Word Removal" 575 | ] 576 | }, 577 | { 578 | "cell_type": "markdown", 579 | "metadata": { 580 | "id": "tZLqqmHa9cRx" 581 | }, 582 | "source": [ 583 | "spaCy comes with a default stop word list. To view your document with stop words removed, you can use the *is_stop* attribute." 584 | ] 585 | }, 586 | { 587 | "cell_type": "code", 588 | "metadata": { 589 | "id": "9kvXbuDEhOxu" 590 | }, 591 | "source": [ 592 | "# spaCy's default stop word list.\n", 593 | "print(nlp.Defaults.stop_words)\n", 594 | "print(len(nlp.Defaults.stop_words))" 595 | ], 596 | "execution_count": null, 597 | "outputs": [] 598 | }, 599 | { 600 | "cell_type": "code", 601 | "metadata": { 602 | "id": "oAS1xmgOhO5y" 603 | }, 604 | "source": [ 605 | "print([t for t in doc if not t.is_stop])" 606 | ], 607 | "execution_count": null, 608 | "outputs": [] 609 | }, 610 | { 611 | "cell_type": "markdown", 612 | "metadata": { 613 | "id": "UPd1aiLrbqcK" 614 | }, 615 | "source": [ 616 | "### Lemmatization" 617 | ] 618 | }, 619 | { 620 | "cell_type": "markdown", 621 | "metadata": { 622 | "id": "gKidP32Y_qcE" 623 | }, 624 | "source": [ 625 | "It's similar with lemmatization. You can view your document with lemmatization applied through the *lemma_* attribute." 626 | ] 627 | }, 628 | { 629 | "cell_type": "code", 630 | "metadata": { 631 | "id": "fhdRleESkzTu" 632 | }, 633 | "source": [ 634 | "[(t.text, t.lemma_) for t in doc]" 635 | ], 636 | "execution_count": null, 637 | "outputs": [] 638 | }, 639 | { 640 | "cell_type": "markdown", 641 | "metadata": { 642 | "id": "VuaQJPjEjADE" 643 | }, 644 | "source": [ 645 | "### Basic Preprocessing Exercises" 646 | ] 647 | }, 648 | { 649 | "cell_type": "markdown", 650 | "metadata": { 651 | "id": "-uNdkuJqCA2A" 652 | }, 653 | "source": [ 654 | "spaCy doesn't support stemming natively. But for completeness, we can stem using **NLTK**. Specifically, we can use the *Snowball stemmer* which is an improved version of the *Porter stemmer*." 655 | ] 656 | }, 657 | { 658 | "cell_type": "code", 659 | "metadata": { 660 | "id": "_HQzMurVB13l" 661 | }, 662 | "source": [ 663 | "#\n", 664 | "# EXERCISE: Find out how to intialize the SnowballStemmer, then tokenize\n", 665 | "# and stem the sentence below.\n", 666 | "#\n", 667 | "from nltk.stem.snowball import SnowballStemmer\n", 668 | "s = 'He told Dr. Lovato that he was done with the tests and would post the results shortly.'\n", 669 | "\n", 670 | "# Initialize the stemmer here.\n", 671 | "\n", 672 | "\n", 673 | "# Tokenize, stem, and print the tokens.\n" 674 | ], 675 | "execution_count": null, 676 | "outputs": [] 677 | }, 678 | { 679 | "cell_type": "code", 680 | "metadata": { 681 | "id": "wOXJI061npqN" 682 | }, 683 | "source": [ 684 | "#\n", 685 | "# EXERCISE: Find out how to add and remove your own stop words in spaCy. Add the \n", 686 | "# word 'told' as a stop word, test that it works, then remove it from \n", 687 | "# the stop word list.\n", 688 | "#" 689 | ], 690 | "execution_count": null, 691 | "outputs": [] 692 | }, 693 | { 694 | "cell_type": "code", 695 | "metadata": { 696 | "id": "RLcCYIy-lP1u" 697 | }, 698 | "source": [ 699 | "#\n", 700 | "# EXERCISE: Read up on how to add your own custom attributes to Token objects\n", 701 | "# and try adding one of your own.\n", 702 | "# https://spacy.io/usage/processing-pipelines#custom-components-attributes\n", 703 | "#" 704 | ], 705 | "execution_count": null, 706 | "outputs": [] 707 | }, 708 | { 709 | "cell_type": "markdown", 710 | "metadata": { 711 | "id": "o9HLYYUt1kOP" 712 | }, 713 | "source": [ 714 | "#Advanced Preprocessing\n", 715 | "\n", 716 | "## Part-of-Speech Tagging, Named Entity Recognition, and Parsing.\n", 717 | "\n", 718 | "Course module for this demo:\n", 719 | "https://www.nlpdemystified.org/course/advanced-preprocessing" 720 | ] 721 | }, 722 | { 723 | "cell_type": "markdown", 724 | "source": [ 725 | "**NOTE: If the notebook timed out, you may need to re-upgrade spaCy and re-install the language model as follows:**\n" 726 | ], 727 | "metadata": { 728 | "id": "DfBqaH9feymn" 729 | } 730 | }, 731 | { 732 | "cell_type": "code", 733 | "metadata": { 734 | "id": "9xOdySsre_sP" 735 | }, 736 | "source": [ 737 | "!pip install -U spacy==3.*\n", 738 | "!python -m spacy download en_core_web_sm\n", 739 | "!python -m spacy info" 740 | ], 741 | "execution_count": null, 742 | "outputs": [] 743 | }, 744 | { 745 | "cell_type": "markdown", 746 | "metadata": { 747 | "id": "Tr5SqjHwSWpI" 748 | }, 749 | "source": [ 750 | "spaCy performs Part-of-Speech (POS) tagging, Named Entity Recognition (NER), and parsing as part of its default pipeline in the *nlp* object." 751 | ] 752 | }, 753 | { 754 | "cell_type": "code", 755 | "metadata": { 756 | "id": "shgWRMCq1kmy" 757 | }, 758 | "source": [ 759 | "import spacy\n", 760 | "nlp = spacy.load('en_core_web_sm')\n", 761 | "s = \"John watched an old movie at the cinema.\"\n", 762 | "doc = nlp(s)" 763 | ], 764 | "execution_count": null, 765 | "outputs": [] 766 | }, 767 | { 768 | "cell_type": "markdown", 769 | "metadata": { 770 | "id": "lwMQgciGb3or" 771 | }, 772 | "source": [ 773 | "### Part-of-Speech Tagging" 774 | ] 775 | }, 776 | { 777 | "cell_type": "markdown", 778 | "metadata": { 779 | "id": "AA9LDzULTW1_" 780 | }, 781 | "source": [ 782 | "POS tags can be accessed through the *pos_* attribute" 783 | ] 784 | }, 785 | { 786 | "cell_type": "code", 787 | "metadata": { 788 | "id": "0-9YRcSZ1kqq" 789 | }, 790 | "source": [ 791 | "[(t.text, t.pos_) for t in doc]" 792 | ], 793 | "execution_count": null, 794 | "outputs": [] 795 | }, 796 | { 797 | "cell_type": "markdown", 798 | "source": [ 799 | "To get a description for a POS tag, we can use _spacy.explain_." 800 | ], 801 | "metadata": { 802 | "id": "6UZcgwejnYm8" 803 | } 804 | }, 805 | { 806 | "cell_type": "code", 807 | "source": [ 808 | "spacy.explain('PROPN')" 809 | ], 810 | "metadata": { 811 | "id": "D9SXNvnmnW5e" 812 | }, 813 | "execution_count": null, 814 | "outputs": [] 815 | }, 816 | { 817 | "cell_type": "markdown", 818 | "metadata": { 819 | "id": "5_WbFDZ-Tqu9" 820 | }, 821 | "source": [ 822 | "The POS tags above are called *course-grained* tags. You can also access *fine-grained* tags through the *tag_* attribute. Fine-grained tags provide more detailed information about a token such as its tense and, if a word is a pronoun, what specific type of pronoun it is." 823 | ] 824 | }, 825 | { 826 | "cell_type": "code", 827 | "metadata": { 828 | "id": "1Z5oDzNr1kt2" 829 | }, 830 | "source": [ 831 | "[(t.text, t.tag_) for t in doc]" 832 | ], 833 | "execution_count": null, 834 | "outputs": [] 835 | }, 836 | { 837 | "cell_type": "markdown", 838 | "metadata": { 839 | "id": "jPOaN9yOUN-I" 840 | }, 841 | "source": [ 842 | "So **NNP** refers specifically to a _singular pronoun_, and **VBD** is a verb in *past tense*." 843 | ] 844 | }, 845 | { 846 | "cell_type": "code", 847 | "metadata": { 848 | "id": "pnfLDxoG1kxf" 849 | }, 850 | "source": [ 851 | "print(spacy.explain('NNP'))\n", 852 | "print(spacy.explain('VBD'))" 853 | ], 854 | "execution_count": null, 855 | "outputs": [] 856 | }, 857 | { 858 | "cell_type": "markdown", 859 | "metadata": { 860 | "id": "jte6K6HJb750" 861 | }, 862 | "source": [ 863 | "### Named Entity Recognition" 864 | ] 865 | }, 866 | { 867 | "cell_type": "markdown", 868 | "metadata": { 869 | "id": "2J2BjPyqWFEf" 870 | }, 871 | "source": [ 872 | "There are multiple ways to access named entities. One way is through the *ent_type_* attribute.\n" 873 | ] 874 | }, 875 | { 876 | "cell_type": "code", 877 | "metadata": { 878 | "id": "dWjNrX6koNVj" 879 | }, 880 | "source": [ 881 | "s = \"Volkswagen is developing an electric sedan which could potentially come to America next fall.\"\n", 882 | "doc = nlp(s)\n", 883 | "\n", 884 | "[(t.text, t.ent_type_) for t in doc]" 885 | ], 886 | "execution_count": null, 887 | "outputs": [] 888 | }, 889 | { 890 | "cell_type": "markdown", 891 | "metadata": { 892 | "id": "OJL4wfS6Wp9N" 893 | }, 894 | "source": [ 895 | "You can view spaCy's named entities annotations here:
\n", 896 | "https://spacy.io/api/annotation#named-entities\n", 897 | "\n", 898 | "or use _spacy.explain_." 899 | ] 900 | }, 901 | { 902 | "cell_type": "code", 903 | "source": [ 904 | "spacy.explain('GPE')" 905 | ], 906 | "metadata": { 907 | "id": "iu4OiPwDo9So" 908 | }, 909 | "execution_count": null, 910 | "outputs": [] 911 | }, 912 | { 913 | "cell_type": "markdown", 914 | "source": [ 915 | "You can also check if a token is an entity before printing it by checking whether the _ent_type_ (note the lack of trailing underscore) attribute is non-zero." 916 | ], 917 | "metadata": { 918 | "id": "F7p8IcNGpBTP" 919 | } 920 | }, 921 | { 922 | "cell_type": "code", 923 | "metadata": { 924 | "id": "0aBng8zdvjly" 925 | }, 926 | "source": [ 927 | "print([(t.text, t.ent_type_) for t in doc if t.ent_type != 0])" 928 | ], 929 | "execution_count": null, 930 | "outputs": [] 931 | }, 932 | { 933 | "cell_type": "markdown", 934 | "metadata": { 935 | "id": "5lNS65_2XJIY" 936 | }, 937 | "source": [ 938 | "Another way is through the _ents_ property of the **Doc** object. Here, we iterate through _ents_ and print the entity itself and its label." 939 | ] 940 | }, 941 | { 942 | "cell_type": "code", 943 | "metadata": { 944 | "id": "kSCzxs02vjdL" 945 | }, 946 | "source": [ 947 | "print([(ent.text, ent.label_) for ent in doc.ents])" 948 | ], 949 | "execution_count": null, 950 | "outputs": [] 951 | }, 952 | { 953 | "cell_type": "markdown", 954 | "metadata": { 955 | "id": "dHfZmta8XX9Y" 956 | }, 957 | "source": [ 958 | "Note how \"next fall\" is outputted above as a single span when you use _ents_.\n", 959 | "

\n", 960 | "You can also access the positions of entities:" 961 | ] 962 | }, 963 | { 964 | "cell_type": "code", 965 | "metadata": { 966 | "id": "mSzwRD0MvjTN" 967 | }, 968 | "source": [ 969 | "print([(ent.text, ent.label_, ent.start_char, ent.end_char) for ent in doc.ents])" 970 | ], 971 | "execution_count": null, 972 | "outputs": [] 973 | }, 974 | { 975 | "cell_type": "markdown", 976 | "metadata": { 977 | "id": "nvvQ9_7FdEHT" 978 | }, 979 | "source": [ 980 | "spaCy is bundled with visualizers for both parsing and named entities.
\n", 981 | "https://spacy.io/usage/visualizers\n", 982 | "

\n", 983 | "Here, we visualize the entities in our sample sentence." 984 | ] 985 | }, 986 | { 987 | "cell_type": "code", 988 | "metadata": { 989 | "id": "87eLywmVZCdw" 990 | }, 991 | "source": [ 992 | "from spacy import displacy\n", 993 | "\n", 994 | "# We need to set the 'jupyter' variable to True in order to output\n", 995 | "# the visualization directly. Otherwise, you'll get raw HTML.\n", 996 | "displacy.render(doc, style='ent', jupyter=True)" 997 | ], 998 | "execution_count": null, 999 | "outputs": [] 1000 | }, 1001 | { 1002 | "cell_type": "markdown", 1003 | "source": [ 1004 | "For domain-specific corpora, an NER tagger may need to be further fine-tuned. Here, we may want _The Martian_ tagged as a \"FILM\" (assuming that's our goal)." 1005 | ], 1006 | "metadata": { 1007 | "id": "PkNmqelTwTLZ" 1008 | } 1009 | }, 1010 | { 1011 | "cell_type": "code", 1012 | "metadata": { 1013 | "id": "0bcIaah29MME" 1014 | }, 1015 | "source": [ 1016 | "s = \"Ridley Scott directed The Martian.\"\n", 1017 | "doc = nlp(s)\n", 1018 | "displacy.render(doc, style='ent', jupyter=True)" 1019 | ], 1020 | "execution_count": null, 1021 | "outputs": [] 1022 | }, 1023 | { 1024 | "cell_type": "markdown", 1025 | "metadata": { 1026 | "id": "noGuG3JvcEfs" 1027 | }, 1028 | "source": [ 1029 | "### Parsing" 1030 | ] 1031 | }, 1032 | { 1033 | "cell_type": "markdown", 1034 | "metadata": { 1035 | "id": "ppWrztdJeO3J" 1036 | }, 1037 | "source": [ 1038 | "Let's first visualize a parse to make it easier to follow." 1039 | ] 1040 | }, 1041 | { 1042 | "cell_type": "code", 1043 | "metadata": { 1044 | "id": "xrvfA1TEvjJT" 1045 | }, 1046 | "source": [ 1047 | "s = \"She enrolled in the course at the university.\"\n", 1048 | "doc = nlp(s)\n", 1049 | "\n", 1050 | "# Note the 'style' argument is assigned a 'dep' flag this time around.\n", 1051 | "displacy.render(doc, style='dep', jupyter=True)" 1052 | ], 1053 | "execution_count": null, 1054 | "outputs": [] 1055 | }, 1056 | { 1057 | "cell_type": "markdown", 1058 | "metadata": { 1059 | "id": "gRN7_SQ-fO5H" 1060 | }, 1061 | "source": [ 1062 | "The visualization above is for a dependency parse (spaCy doesn't come with a constituency parser). For each pair of depencencies, spaCy visualizes the child (pointed to), the head (pointed from), and their relationship (the label arc). You can view the dependency annotations here:
\n", 1063 | "https://spacy.io/api/annotation#dependency-parsing\n", 1064 | "\n", 1065 | "You can also use *spacy.explain* to get information on a particular annotation." 1066 | ] 1067 | }, 1068 | { 1069 | "cell_type": "code", 1070 | "metadata": { 1071 | "id": "wvz1bLTZfqmv" 1072 | }, 1073 | "source": [ 1074 | "spacy.explain('nsubj')" 1075 | ], 1076 | "execution_count": null, 1077 | "outputs": [] 1078 | }, 1079 | { 1080 | "cell_type": "markdown", 1081 | "metadata": { 1082 | "id": "dCvHyqHggIpd" 1083 | }, 1084 | "source": [ 1085 | "The dependency labels themselves can be accessed through the *dep_* attribute." 1086 | ] 1087 | }, 1088 | { 1089 | "cell_type": "code", 1090 | "metadata": { 1091 | "id": "iX_BgpMVoNaj" 1092 | }, 1093 | "source": [ 1094 | "[(t.text, t.dep_) for t in doc]" 1095 | ], 1096 | "execution_count": null, 1097 | "outputs": [] 1098 | }, 1099 | { 1100 | "cell_type": "markdown", 1101 | "metadata": { 1102 | "id": "tt7zLq0ugR7O" 1103 | }, 1104 | "source": [ 1105 | "Note how the word 'enrolled' is the _ROOT_.\n", 1106 | "

\n", 1107 | "But the labels above don't show how the words are related to each other (the arcs). To get a better idea, you can print the head of each dependency." 1108 | ] 1109 | }, 1110 | { 1111 | "cell_type": "code", 1112 | "metadata": { 1113 | "id": "X15EOIq0oNfF" 1114 | }, 1115 | "source": [ 1116 | "[(t.text, t.dep_, t.head.text) for t in doc]" 1117 | ], 1118 | "execution_count": null, 1119 | "outputs": [] 1120 | }, 1121 | { 1122 | "cell_type": "markdown", 1123 | "metadata": { 1124 | "id": "OFXPL37Rg2xm" 1125 | }, 1126 | "source": [ 1127 | "### Using spaCy's Matcher to find patterns\n", 1128 | "spaCy comes with a host of pattern-matching functionality. Beyond regex, spaCy can match on a variety of attributes such as POS tags, entity labels, lemmas, dependencies, entire phrases, and a lot more. You can learn more here:
\n", 1129 | "https://spacy.io/usage/rule-based-matching
\n", 1130 | "https://explosion.ai/demos/matcher\n", 1131 | "

\n", 1132 | "Here, we try to search for patterns that may be useful for a hospitality bot." 1133 | ] 1134 | }, 1135 | { 1136 | "cell_type": "code", 1137 | "metadata": { 1138 | "id": "6v4hVnYmJuaK" 1139 | }, 1140 | "source": [ 1141 | "# The general Matcher is one of multiple matcher objects\n", 1142 | "# included with spaCy.\n", 1143 | "from spacy.matcher import Matcher\n", 1144 | "\n", 1145 | "# We initialize the Matcher with the spaCy vocab object, which contains\n", 1146 | "# words along with their labels and entities.\n", 1147 | "matcher = Matcher(nlp.vocab)\n", 1148 | "\n", 1149 | "s = \"I want to book a hotel room.\"\n", 1150 | "doc = nlp(s)\n", 1151 | "\n", 1152 | "# Patterns are expressed as an ordered sequence. Here, we're looking\n", 1153 | "# to match occurrences starting with a 'book' string followed by\n", 1154 | "# a determiner (DET) POS tag, then a noun POS tag.\n", 1155 | "# The OP key marks the match as optional in some way.\n", 1156 | "\n", 1157 | "# Here, the DET POS (marked with '?') will match 0 or 1 times, and\n", 1158 | "# the NOUN POS (marked with '+') will match 1 or more times.\n", 1159 | "# See this link for more information:\n", 1160 | "# https://spacy.io/usage/rule-based-matching#quantifiers\n", 1161 | "pattern = [\n", 1162 | " {'TEXT': 'book'},\n", 1163 | " {'POS': 'DET', 'OP': '?'},\n", 1164 | " {'POS': 'NOUN', 'OP': '+'},\n", 1165 | "]\n", 1166 | "\n", 1167 | "# We give our pattern a label and pass it to the matcher.\n", 1168 | "matcher.add('USER_INTENT', [pattern])\n", 1169 | "\n", 1170 | "# Run the matcher over the doc.\n", 1171 | "matches = matcher(doc)\n", 1172 | "\n", 1173 | "# For each match, the matcher returns a tuple specifying a match id, start, \n", 1174 | "# and end of the match.\n", 1175 | "print(\"Matches:\", [doc[start:end].text for match_id, start, end in matches])" 1176 | ], 1177 | "execution_count": null, 1178 | "outputs": [] 1179 | }, 1180 | { 1181 | "cell_type": "markdown", 1182 | "metadata": { 1183 | "id": "dygcIKF9plib" 1184 | }, 1185 | "source": [ 1186 | "The code above demonstrates the Matcher but is brittle.\n", 1187 | "- What if \"book\" is capitalized?\n", 1188 | "- What if a user types \"reserve\" instead of \"book\"?\n", 1189 | "- How can we match on \"hotel room\" as a compound noun?\n", 1190 | "- What if a user types \"book a flight and hotel room\"?\n", 1191 | "\n", 1192 | "Can you think of how you would handle these cases?\n", 1193 | "

\n", 1194 | "We could come up more rules to match different patterns, or perhaps just search for keywords based on POS and entities (e.g. a country) and present the user with a bunch of possible intentions and let them choose one, or have a bunch of different interpretation functions submit answers and select the most likely one based on what was historically accepted most often. We can also ask clarifying questions to narrow things down.\n", 1195 | "

\n", 1196 | "For example, for the last sentence, you could have a function scan through the **Doc** object's *noun_chunks* (phrases that have a noun as their head) and isolate keywords there along with potential conjunctions (e.g. \"and\").
\n", 1197 | "https://spacy.io/usage/linguistic-features#noun-chunks\n" 1198 | ] 1199 | }, 1200 | { 1201 | "cell_type": "code", 1202 | "metadata": { 1203 | "id": "xctXGD5K5Gvr" 1204 | }, 1205 | "source": [ 1206 | "doc = nlp(\"I want to book a flight and hotel room in Berlin.\")\n", 1207 | "for noun_phrase in doc.noun_chunks:\n", 1208 | " print(\"phrase: {}, root head: {}\".format(noun_phrase, noun_phrase.root.head))" 1209 | ], 1210 | "execution_count": null, 1211 | "outputs": [] 1212 | }, 1213 | { 1214 | "cell_type": "markdown", 1215 | "metadata": { 1216 | "id": "zMsHWX-9EvXU" 1217 | }, 1218 | "source": [ 1219 | "Using pure rules is a good place to start or prototype (especially if the domain is narrow with a tight set of use cases) but as our requirements get more sophisticated, we'll need to blend in other approaches such as classical models or perhaps deep learning (at the very least, maybe tune existing neural networks). spaCy's models can be updated with more examples to fine-tune predictions.
\n", 1220 | "https://spacy.io/usage/training
\n", 1221 | "
\n", 1222 | "We'll keep learning more approaches as the course progresses." 1223 | ] 1224 | }, 1225 | { 1226 | "cell_type": "markdown", 1227 | "metadata": { 1228 | "id": "knyuUv9cqsoY" 1229 | }, 1230 | "source": [ 1231 | "### Talkin' like Yoda\n", 1232 | "Languages like English are built around the _subject-verb-object_ pattern. But if you're familiar with Yoda from Star Wars, he famously speaks in an _object-subject-verb pattern_. Using the information in a dependency parse, we can turn basic English sentences into Yoda-speak." 1233 | ] 1234 | }, 1235 | { 1236 | "cell_type": "code", 1237 | "metadata": { 1238 | "id": "L9AydbEIqsRQ" 1239 | }, 1240 | "source": [ 1241 | "def yodize(s: str):\n", 1242 | " doc = nlp(s)\n", 1243 | " for t in doc:\n", 1244 | " if t.dep_ == \"ROOT\":\n", 1245 | "\n", 1246 | " # Assuming our sentence is of the form subject-verb-object, we take \n", 1247 | " # everything after the root (likely verb) and put it in front, and \n", 1248 | " # likewise take everything before the root, and put it after.\n", 1249 | " seq = [doc[t.i + 1: -1].text, doc[0: t.i].text, t.text + '.']\n", 1250 | " seq[0] = seq[0].capitalize()\n", 1251 | " print(' '.join(seq))" 1252 | ], 1253 | "execution_count": null, 1254 | "outputs": [] 1255 | }, 1256 | { 1257 | "cell_type": "code", 1258 | "metadata": { 1259 | "id": "uIa8Cziwqqnf" 1260 | }, 1261 | "source": [ 1262 | "yodize(\"I will fly to Texas.\")" 1263 | ], 1264 | "execution_count": null, 1265 | "outputs": [] 1266 | }, 1267 | { 1268 | "cell_type": "markdown", 1269 | "metadata": { 1270 | "id": "ofEnieaJZ8eX" 1271 | }, 1272 | "source": [ 1273 | "This is ok for simple sentences but starts getting weird with longer, more convoluted sentences. What are some ways you would improve this?" 1274 | ] 1275 | }, 1276 | { 1277 | "cell_type": "markdown", 1278 | "metadata": { 1279 | "id": "V92TUxWioNtq" 1280 | }, 1281 | "source": [ 1282 | "### Advanced Preprocessing Exercises" 1283 | ] 1284 | }, 1285 | { 1286 | "cell_type": "code", 1287 | "metadata": { 1288 | "id": "6Ltil7XSyzMe" 1289 | }, 1290 | "source": [ 1291 | "#\n", 1292 | "# EXERCISE: Learn how to extend spaCy's NER models. Specifically, how to add new\n", 1293 | "# entity names and entity types. \n", 1294 | "#" 1295 | ], 1296 | "execution_count": null, 1297 | "outputs": [] 1298 | }, 1299 | { 1300 | "cell_type": "code", 1301 | "metadata": { 1302 | "id": "1P58pxYkoN0j" 1303 | }, 1304 | "source": [ 1305 | "#\n", 1306 | "# EXERCISE: using doc.ents, identify and print the dates in this sentence.\n", 1307 | "# Expected output: ['Feb 13th', 'Feb 24th']\n", 1308 | "#\n", 1309 | "s = \"We'll be in Osaka on Feb 13th and leave on Feb 24th.\"\n", 1310 | "doc = nlp(s)\n", 1311 | "\n" 1312 | ], 1313 | "execution_count": null, 1314 | "outputs": [] 1315 | }, 1316 | { 1317 | "cell_type": "code", 1318 | "metadata": { 1319 | "id": "OVFi0bxCoN4N" 1320 | }, 1321 | "source": [ 1322 | "#\n", 1323 | "# EXERCISE: Read about spaCy's PhraseMatcher\n", 1324 | "# https://spacy.io/usage/rule-based-matching#phrasematcher\n", 1325 | "#\n", 1326 | "# Using the PhraseMatcher, find the start and end index of all occurrences \n", 1327 | "# of 'Caesar Augustus' and 'Roman Empire' (case-insensitive).\n", 1328 | "#\n", 1329 | "# Expected output: [(0, 2), (15, 17)]\n", 1330 | "#\n", 1331 | "from spacy.matcher import PhraseMatcher\n", 1332 | "s = \"Caesar Augustus was the founder of the Roman Principate (the first phase of the Roman Empire).\"\n", 1333 | "doc = nlp(s)\n" 1334 | ], 1335 | "execution_count": null, 1336 | "outputs": [] 1337 | }, 1338 | { 1339 | "cell_type": "markdown", 1340 | "metadata": { 1341 | "id": "3nhN7p9G8taJ" 1342 | }, 1343 | "source": [ 1344 | "# Additional Reading and Resources" 1345 | ] 1346 | }, 1347 | { 1348 | "cell_type": "markdown", 1349 | "metadata": { 1350 | "id": "bM4G2KWa8wXO" 1351 | }, 1352 | "source": [ 1353 | "Read through this page to learn more about spaCy's language processing pipeline including what's going on under the hood, how to create custom components, disable certain components (e.g. NER) when they're unneeded, optimization tips, and best practices:
\n", 1354 | "https://spacy.io/usage/processing-pipelines\n", 1355 | "

\n", 1356 | "Take the free and succinct spaCy course (available in multiple languages):
\n", 1357 | "https://course.spacy.io/\n" 1358 | ] 1359 | } 1360 | ] 1361 | } 1362 | -------------------------------------------------------------------------------- /notebooks/nlpdemystified_topic_modelling_lda.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "id": "view-in-github", 7 | "colab_type": "text" 8 | }, 9 | "source": [ 10 | "\"Open" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": { 16 | "id": "ITy3IHHU95uS" 17 | }, 18 | "source": [ 19 | "# Natural Language Processing Demystified | Topic Modelling With Latent Dirichlet Allocation\n", 20 | "https://nlpdemystified.org
\n", 21 | "https://github.com/futuremojo/nlp-demystified

\n", 22 | "Course module for this demo: https://www.nlpdemystified.org/course/topic-modelling" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": { 28 | "id": "aes1ZqWZTUa5" 29 | }, 30 | "source": [ 31 | "# spaCy upgrade and package installation." 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": { 37 | "id": "zSVwiu4YTVDa" 38 | }, 39 | "source": [ 40 | "At the time this notebook was created, spaCy had newer releases but Colab was still using version 2.x by default. So the first step is to upgrade spaCy.\n", 41 | "

\n", 42 | "**IMPORTANT**
\n", 43 | "If you're running this in the cloud rather than a local Jupyter server on your machine, then the notebook will *timeout* after a period of inactivity. If that happens and you don't reconnect in time, you will need to upgrade spaCy again and reinstall the requisite statistical package(s).\n", 44 | "

\n", 45 | "Refer to this link on how to run Colab notebooks locally on your machine to avoid this issue:
\n", 46 | "https://research.google.com/colaboratory/local-runtimes.html\n", 47 | "\n", 48 | "---\n", 49 | "> **In the course video, I ran this demo on a local Jupyter server to take advantage of multiprocessing capabilities. It's not necessary but I recommend it.**" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": null, 55 | "metadata": { 56 | "id": "_VstAdWMUWvp" 57 | }, 58 | "outputs": [], 59 | "source": [ 60 | "!pip install -U spacy==3.*\n", 61 | "!python -m spacy download en_core_web_sm\n", 62 | "!python -m spacy info" 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "metadata": { 68 | "id": "DKZgKn9TTc9Z" 69 | }, 70 | "source": [ 71 | "For topic modelling, we'll use **Gensim**, a popular topic modelling library originally authored by Radim Řehůřek. It has implementations for LDA and other models.
\n", 72 | "https://radimrehurek.com/gensim/index.html" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": null, 78 | "metadata": { 79 | "id": "gRg7SM8qEY7o" 80 | }, 81 | "outputs": [], 82 | "source": [ 83 | "# Upgrade gensim in case.\n", 84 | "# !pip install --upgrade numpy\n", 85 | "!pip install -U gensim==4.*" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": null, 91 | "metadata": { 92 | "id": "YcyuLLRk9Epv" 93 | }, 94 | "outputs": [], 95 | "source": [ 96 | "import matplotlib.pyplot as plt\n", 97 | "import pandas as pd\n", 98 | "import random\n", 99 | "import spacy\n", 100 | "\n", 101 | "from gensim import models, corpora\n", 102 | "from gensim import similarities\n", 103 | "from gensim.models.coherencemodel import CoherenceModel\n", 104 | "from wordcloud import WordCloud" 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": { 110 | "id": "aUqudgVeCfbM" 111 | }, 112 | "source": [ 113 | "# First pass at building an LDA topic model for our corpus" 114 | ] 115 | }, 116 | { 117 | "cell_type": "markdown", 118 | "metadata": { 119 | "id": "mHBDR4ZqVvwY" 120 | }, 121 | "source": [ 122 | "We'll use a corpus of over 90,000 CNN news articles originally compiled for training question answering models. I lightly processed them to remove some metadata and put them on Google Drive.\n", 123 | "([original source](https://cs.nyu.edu/~kcho/DMQA/))\n", 124 | "

\n", 125 | "To retrieve the corpus from Google Drive, we'll use the **gdown** library which I've already installed:
\n", 126 | "https://github.com/wkentaro/gdown" 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": null, 132 | "metadata": { 133 | "id": "z2ozMbpWTzAc" 134 | }, 135 | "outputs": [], 136 | "source": [ 137 | "import locale\n", 138 | "def getpreferredencoding(do_setlocale = True):\n", 139 | " return \"UTF-8\"\n", 140 | " \n", 141 | "locale.getpreferredencoding = getpreferredencoding" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": null, 147 | "metadata": { 148 | "id": "Sfpdk5TATzAc" 149 | }, 150 | "outputs": [], 151 | "source": [ 152 | "!pip install --upgrade --no-cache-dir gdown" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": null, 158 | "metadata": { 159 | "id": "kO0I2ThbauR3" 160 | }, 161 | "outputs": [], 162 | "source": [ 163 | "# Download the CNN corpus.\n", 164 | "!gdown 'https://drive.google.com/uc?id=122fC9XpNwFKx0ryRVKJz5MWUTzA3Vpsf'" 165 | ] 166 | }, 167 | { 168 | "cell_type": "markdown", 169 | "metadata": { 170 | "id": "Gpu_Z5fdbYpU" 171 | }, 172 | "source": [ 173 | "The corpus is one large text file with each article in the corpus separated by an *@delimiter* string. We'll split the articles and place them in a list." 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": null, 179 | "metadata": { 180 | "id": "JxGeaaj4auNO" 181 | }, 182 | "outputs": [], 183 | "source": [ 184 | "with open('cnn_articles.txt', 'r', encoding='utf8') as f:\n", 185 | " articles = f.read().split('@delimiter')" 186 | ] 187 | }, 188 | { 189 | "cell_type": "code", 190 | "execution_count": null, 191 | "metadata": { 192 | "id": "9QNyQo5gauIs" 193 | }, 194 | "outputs": [], 195 | "source": [ 196 | "print(len(articles))\n", 197 | "print(articles[0])" 198 | ] 199 | }, 200 | { 201 | "cell_type": "markdown", 202 | "metadata": { 203 | "id": "1lKZEP-J02TA" 204 | }, 205 | "source": [ 206 | "For this demo, we'll use a subset of the articles to speed things up but feel free to change the dataset size." 207 | ] 208 | }, 209 | { 210 | "cell_type": "code", 211 | "execution_count": null, 212 | "metadata": { 213 | "id": "YSfxX4tlbpa6" 214 | }, 215 | "outputs": [], 216 | "source": [ 217 | "DATASET_SIZE = 20000\n", 218 | "dataset = articles[:DATASET_SIZE]" 219 | ] 220 | }, 221 | { 222 | "cell_type": "markdown", 223 | "metadata": { 224 | "id": "qLkJz7BS6q-S" 225 | }, 226 | "source": [ 227 | "Just like in the [Text Classification with Naive Bayes](https://github.com/futuremojo/nlp-demystified/blob/main/notebooks/nlpdemystified_classification_naive_bayes.ipynb) demo, we'll start off with a *blank* tokenizer with no further pipeline components to see if that's good enough.\n", 228 | "

\n", 229 | "We'll filter out punctuations, newlines, and any tokens containing non-alphabetic characters." 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "execution_count": null, 235 | "metadata": { 236 | "id": "g6XVBLIl0FkX" 237 | }, 238 | "outputs": [], 239 | "source": [ 240 | "nlp = spacy.blank('en')\n", 241 | "\n", 242 | "def basic_filter(tokenized_doc):\n", 243 | " return [t.text for t in tokenized_doc if\n", 244 | " not t.is_punct and \\\n", 245 | " not t.is_space and \\\n", 246 | " t.is_alpha]" 247 | ] 248 | }, 249 | { 250 | "cell_type": "markdown", 251 | "metadata": { 252 | "id": "6siL9mNJxqix" 253 | }, 254 | "source": [ 255 | "In this demo, we'll leverage spaCy's **nlp.pipe** function which can process a corpus as a batch (or a series of batches) and use multiple processes. Here, we'll process our dataset as a batch across multiple processes, then run the tokenized **doc** objects through the *basic_filter* function. You can adjust **NUM_PROCESS** as you wish.

\n", 256 | "Take a look at these link for ways to further optimize spaCy's pipeline:
\n", 257 | "https://spacy.io/usage/processing-pipelines#processing
\n", 258 | "https://spacy.io/api/language#pipe

\n", 259 | "YouTube video from spaCy on using **nlp.pipe**: [Speed up spaCy pipelines via `nlp.pipe` - spaCy shorts](https://www.youtube.com/watch?v=OoZ-H_8vRnc)
\n", 260 | "Tuning **nlp.pipe**: https://stackoverflow.com/questions/65850018/processing-text-with-spacy-nlp-pipe" 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "execution_count": null, 266 | "metadata": { 267 | "id": "L1SVzXUzxtBe" 268 | }, 269 | "outputs": [], 270 | "source": [ 271 | "NUM_PROCESS = 4" 272 | ] 273 | }, 274 | { 275 | "cell_type": "code", 276 | "execution_count": null, 277 | "metadata": { 278 | "id": "nGYhfDXcz9_V" 279 | }, 280 | "outputs": [], 281 | "source": [ 282 | "%%time\n", 283 | "tokenized_articles = list(map(basic_filter, nlp.pipe(dataset, n_process=NUM_PROCESS)))" 284 | ] 285 | }, 286 | { 287 | "cell_type": "code", 288 | "execution_count": null, 289 | "metadata": { 290 | "id": "OYNK7Nd-cLsZ" 291 | }, 292 | "outputs": [], 293 | "source": [ 294 | "print(tokenized_articles[0])" 295 | ] 296 | }, 297 | { 298 | "cell_type": "markdown", 299 | "metadata": { 300 | "id": "DkopX2P4UqDK" 301 | }, 302 | "source": [ 303 | "To start off, we'll go with 20 topics. With most topic models including LDA, there isn't a clear recipe on how to pick the optimal number of topics. The nature and composition of the data (e.g. average length of each document) has a major impact on how many topics are *interpretable* by a human. Often, it's best to go with something reasonable to begin with and then try different topic numbers.

For this corpus, I'm going with 20 topics which is a small amount relative to the corpus size, but my reasoning is that since this is a general mainstream news corpus, the topics themselves are going to be fairly broad." 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": null, 309 | "metadata": { 310 | "id": "o9RbTz3OXTuM" 311 | }, 312 | "outputs": [], 313 | "source": [ 314 | "NUM_TOPICS = 20" 315 | ] 316 | }, 317 | { 318 | "cell_type": "markdown", 319 | "metadata": { 320 | "id": "XgCbr9SJZxDQ" 321 | }, 322 | "source": [ 323 | "After tokenizing our text, the first step with Gensim is to construct a **Dictionary** mapping words to integer IDs.
\n", 324 | "https://radimrehurek.com/gensim/corpora/dictionary.html

\n", 325 | "This is similar to the *fit* step we took with scikit-learn's vectorizers." 326 | ] 327 | }, 328 | { 329 | "cell_type": "code", 330 | "execution_count": null, 331 | "metadata": { 332 | "id": "EP2db-H8cUwb" 333 | }, 334 | "outputs": [], 335 | "source": [ 336 | "# Build a Dictionary of word<-->id mappings.\n", 337 | "%%time\n", 338 | "dictionary = corpora.Dictionary(tokenized_articles)\n", 339 | "\n", 340 | "sample_token = 'news'\n", 341 | "print(f'Id for \\'{sample_token}\\' token: {dictionary.token2id[sample_token]}')" 342 | ] 343 | }, 344 | { 345 | "cell_type": "markdown", 346 | "metadata": { 347 | "id": "XyAHgUxEaXVf" 348 | }, 349 | "source": [ 350 | "The next step is to create a frequency bag-of-words from each article using the **dictionary**'s *doc2bow* method. This is similar to the *transform* step from scikit-learn's vectorizers.
\n", 351 | "https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.doc2bow" 352 | ] 353 | }, 354 | { 355 | "cell_type": "code", 356 | "execution_count": null, 357 | "metadata": { 358 | "id": "ZYpRy9W6cWAK" 359 | }, 360 | "outputs": [], 361 | "source": [ 362 | "%%time\n", 363 | "corpus_bow = [dictionary.doc2bow(article) for article in tokenized_articles]" 364 | ] 365 | }, 366 | { 367 | "cell_type": "markdown", 368 | "metadata": { 369 | "id": "KD9khr0RbBTq" 370 | }, 371 | "source": [ 372 | "Finally, we'll generate our base LDA model. Gensim's LDA model has a large number of optional parameters but for now, we'll keep it simple.
\n", 373 | "https://radimrehurek.com/gensim/models/ldamodel.html?highlight=lda#module-gensim.models.ldamodel" 374 | ] 375 | }, 376 | { 377 | "cell_type": "code", 378 | "execution_count": null, 379 | "metadata": { 380 | "id": "AP0MS3n7dxE_" 381 | }, 382 | "outputs": [], 383 | "source": [ 384 | "%%time\n", 385 | "lda_model = models.LdaModel(corpus=corpus_bow, num_topics=NUM_TOPICS, id2word=dictionary, random_state=1)" 386 | ] 387 | }, 388 | { 389 | "cell_type": "markdown", 390 | "metadata": { 391 | "id": "8ecFL_MSb9wp" 392 | }, 393 | "source": [ 394 | "Once our model is generated, we can view the topics inferred. By default, the model's *print_topics* method shows the top 20 topics and each topic's ten most significant words.
\n", 395 | "https://radimrehurek.com/gensim/models/ldamodel.html?highlight=lda#gensim.models.ldamodel.LdaModel.print_topics" 396 | ] 397 | }, 398 | { 399 | "cell_type": "code", 400 | "execution_count": null, 401 | "metadata": { 402 | "id": "lFTFPOb4eKUi" 403 | }, 404 | "outputs": [], 405 | "source": [ 406 | "lda_model.print_topics()" 407 | ] 408 | }, 409 | { 410 | "cell_type": "markdown", 411 | "metadata": { 412 | "id": "XYmfb5YGcSP8" 413 | }, 414 | "source": [ 415 | "The first pass is pretty awful. The topics are dominated by stop words such that they essentially look all the same. Let's see if we can do better." 416 | ] 417 | }, 418 | { 419 | "cell_type": "markdown", 420 | "metadata": { 421 | "id": "Kf0X-w47svTF" 422 | }, 423 | "source": [ 424 | "# Improving preprocessing for better results." 425 | ] 426 | }, 427 | { 428 | "cell_type": "markdown", 429 | "metadata": { 430 | "id": "AkI5wxWccz8U" 431 | }, 432 | "source": [ 433 | "For our next attempt, we'll\n", 434 | "- remove stop words using the default spaCy stopword list. Given this is a corpus of news articles, there may be other stop words to consider such as salutations (\"Mr\", \"Mrs\"), and words related to quotes and thoughts (\"say\", \"think\"). But for this, we'll stick to defaults unless we see reason to do otherwise.\n", 435 | "- consider only the words the spaCy tagger flags as *nouns, verbs,* and *adjectives*. Including words with only certain POS tags is a common approach to improving topic models.\n", 436 | "- take the lemma." 437 | ] 438 | }, 439 | { 440 | "cell_type": "code", 441 | "execution_count": null, 442 | "metadata": { 443 | "id": "i1emkEmz1pYd" 444 | }, 445 | "outputs": [], 446 | "source": [ 447 | "nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])\n", 448 | "\n", 449 | "def improved_filter(tokenized_doc):\n", 450 | " return [t.lemma_ for t in tokenized_doc if\n", 451 | " t.is_alpha and \\\n", 452 | " not t.is_punct and \\\n", 453 | " not t.is_space and \\\n", 454 | " not t.is_stop and \\\n", 455 | " t.pos_ in ['NOUN', 'VERB', 'ADJ']]" 456 | ] 457 | }, 458 | { 459 | "cell_type": "code", 460 | "execution_count": null, 461 | "metadata": { 462 | "id": "NLqKeoy9FQED" 463 | }, 464 | "outputs": [], 465 | "source": [ 466 | "# We'll need to retokenize everything and rebuild the BOWs. Because we're now\n", 467 | "# using the POS tagger, this will take longer. The \"w_pos\" in the variable \n", 468 | "# names below just means \"with part-of-speech\".\n", 469 | "%%time\n", 470 | "tokenized_articles_w_pos = list(map(improved_filter, nlp.pipe(dataset, n_process=NUM_PROCESS)))\n", 471 | "dictionary_w_pos = corpora.Dictionary(tokenized_articles_w_pos)\n", 472 | "corpus_bow_w_pos = [dictionary_w_pos.doc2bow(article) for article in tokenized_articles_w_pos]" 473 | ] 474 | }, 475 | { 476 | "cell_type": "code", 477 | "execution_count": null, 478 | "metadata": { 479 | "id": "5sNd_PZypu13" 480 | }, 481 | "outputs": [], 482 | "source": [ 483 | "%%time\n", 484 | "lda_model = models.LdaModel(corpus=corpus_bow_w_pos, num_topics=NUM_TOPICS, id2word=dictionary_w_pos, random_state=1)" 485 | ] 486 | }, 487 | { 488 | "cell_type": "code", 489 | "execution_count": null, 490 | "metadata": { 491 | "id": "aG5iFkrQqyx5" 492 | }, 493 | "outputs": [], 494 | "source": [ 495 | "lda_model.print_topics()" 496 | ] 497 | }, 498 | { 499 | "cell_type": "markdown", 500 | "metadata": { 501 | "id": "0ckwYRIqgOtB" 502 | }, 503 | "source": [ 504 | "This is better but there are still a few low-signal words dominating topics such as \"said\" lemmatized to \"say\" which makes sense for a news corpus. Perhaps trimming the vocabulary and tuning the model parameters themselves can lead to something more interpretable." 505 | ] 506 | }, 507 | { 508 | "cell_type": "markdown", 509 | "metadata": { 510 | "id": "w_8oBuWxvqdl" 511 | }, 512 | "source": [ 513 | "# Trimming low- and high-frequency words." 514 | ] 515 | }, 516 | { 517 | "cell_type": "markdown", 518 | "metadata": { 519 | "id": "KFDI1BSLgxJw" 520 | }, 521 | "source": [ 522 | "One thing we can try is filtering out rare and common tokens.\n", 523 | "https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.filter_extremes" 524 | ] 525 | }, 526 | { 527 | "cell_type": "code", 528 | "execution_count": null, 529 | "metadata": { 530 | "id": "4YQctCWVhnL6" 531 | }, 532 | "outputs": [], 533 | "source": [ 534 | "# The size of the dictionary before filtering.\n", 535 | "len(dictionary_w_pos)" 536 | ] 537 | }, 538 | { 539 | "cell_type": "markdown", 540 | "metadata": { 541 | "id": "k8tzEnZKyfeC" 542 | }, 543 | "source": [ 544 | "The filtering is a bit idiosyncratic. The lower bound is an *absolute* number, and the upper bound is a *percentage*. Here, we're saying filter out words which occur in fewer than N documents and more than M% of the documents." 545 | ] 546 | }, 547 | { 548 | "cell_type": "code", 549 | "execution_count": null, 550 | "metadata": { 551 | "id": "lyCG8tLIp2QC" 552 | }, 553 | "outputs": [], 554 | "source": [ 555 | "dictionary_w_pos.filter_extremes(no_below=5, no_above=0.5)" 556 | ] 557 | }, 558 | { 559 | "cell_type": "code", 560 | "execution_count": null, 561 | "metadata": { 562 | "id": "6AvypKffhpyR" 563 | }, 564 | "outputs": [], 565 | "source": [ 566 | "# The size of the dictionary after filtering.\n", 567 | "len(dictionary_w_pos)" 568 | ] 569 | }, 570 | { 571 | "cell_type": "code", 572 | "execution_count": null, 573 | "metadata": { 574 | "id": "uWomtBFzhuO5" 575 | }, 576 | "outputs": [], 577 | "source": [ 578 | "# Rebuild bag of words.\n", 579 | "corpus_bow_w_pos_filtered = [dictionary_w_pos.doc2bow(article) for article in tokenized_articles_w_pos]" 580 | ] 581 | }, 582 | { 583 | "cell_type": "markdown", 584 | "metadata": { 585 | "id": "hVtALC9yYB9Z" 586 | }, 587 | "source": [ 588 | "This time, we're passing additional arguments when building the model. *alpha* is the prior on the document-topic distribution, and *eta* is the prior on the topic-word distribution (this was *beta* in the slides), and *passes* is the number of complete passes through the corpus during training.
\n", 589 | "https://radimrehurek.com/gensim/models/ldamodel.html?highlight=lda#module-gensim.models.ldamodel" 590 | ] 591 | }, 592 | { 593 | "cell_type": "code", 594 | "execution_count": null, 595 | "metadata": { 596 | "id": "S8_-mIdSvqUc" 597 | }, 598 | "outputs": [], 599 | "source": [ 600 | "%%time\n", 601 | "lda_model = models.ldamodel.LdaModel(corpus=corpus_bow_w_pos_filtered,\n", 602 | " id2word=dictionary_w_pos,\n", 603 | " num_topics=NUM_TOPICS,\n", 604 | " passes=10,\n", 605 | " alpha='auto',\n", 606 | " eta='auto',\n", 607 | " random_state=1)" 608 | ] 609 | }, 610 | { 611 | "cell_type": "code", 612 | "execution_count": null, 613 | "metadata": { 614 | "id": "iR2xCvNZvqDn" 615 | }, 616 | "outputs": [], 617 | "source": [ 618 | "lda_model.print_topics()" 619 | ] 620 | }, 621 | { 622 | "cell_type": "markdown", 623 | "metadata": { 624 | "id": "o3dM-87PxPSY" 625 | }, 626 | "source": [ 627 | "With improved filtering and low- and high-frequency words trimmed, we can see the topic-word distributions containing certain themes such as crime, travel, entertainment, etc.

\n", 628 | "**NOTE:** Remember that the topic model doesn't label topics for us. It just converges on collections of terms that likely form topics." 629 | ] 630 | }, 631 | { 632 | "cell_type": "markdown", 633 | "metadata": { 634 | "id": "HVipraNhL2fX" 635 | }, 636 | "source": [ 637 | "We set the training algorithm to learn priors for *alpha* and *eta*." 638 | ] 639 | }, 640 | { 641 | "cell_type": "code", 642 | "execution_count": null, 643 | "metadata": { 644 | "id": "5aimFUJGw4gT" 645 | }, 646 | "outputs": [], 647 | "source": [ 648 | "print(lda_model.alpha)\n", 649 | "print(lda_model.eta)" 650 | ] 651 | }, 652 | { 653 | "cell_type": "markdown", 654 | "metadata": { 655 | "id": "Aj86WOUlL0zj" 656 | }, 657 | "source": [ 658 | "The *alpha* and *eta* values the training algorithm arrived at are well below 1. This translates to most articles being dominated by one or just a few topics, and most topics being dominated by a handful of words." 659 | ] 660 | }, 661 | { 662 | "cell_type": "markdown", 663 | "metadata": { 664 | "id": "auRQbV8Ajaz8" 665 | }, 666 | "source": [ 667 | "We can look at the topic distribution comprising a given article using the model's *get_document_topics* method.
\n", 668 | "https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_document_topics" 669 | ] 670 | }, 671 | { 672 | "cell_type": "code", 673 | "execution_count": null, 674 | "metadata": { 675 | "id": "7naCCCX1Nb2Z" 676 | }, 677 | "outputs": [], 678 | "source": [ 679 | "article_idx = 0\n", 680 | "print(dataset[article_idx][:300])" 681 | ] 682 | }, 683 | { 684 | "cell_type": "code", 685 | "execution_count": null, 686 | "metadata": { 687 | "id": "DrGy3dO019LL" 688 | }, 689 | "outputs": [], 690 | "source": [ 691 | "# Return topic distribution for an article sorted by probability.\n", 692 | "topics = sorted(lda_model.get_document_topics(corpus_bow_w_pos_filtered[article_idx]), key=lambda tup: tup[1])[::-1]\n", 693 | "topics" 694 | ] 695 | }, 696 | { 697 | "cell_type": "markdown", 698 | "metadata": { 699 | "id": "85ztp46j13OL" 700 | }, 701 | "source": [ 702 | "We can get the top words (10 by default) representing a topic using the model's *show_topic* method.\n", 703 | "https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.show_topic" 704 | ] 705 | }, 706 | { 707 | "cell_type": "code", 708 | "execution_count": null, 709 | "metadata": { 710 | "id": "aoA0ATU016Tn" 711 | }, 712 | "outputs": [], 713 | "source": [ 714 | "# View the words of the top topic from the previous article.\n", 715 | "lda_model.show_topic(topics[0][0])" 716 | ] 717 | }, 718 | { 719 | "cell_type": "code", 720 | "execution_count": null, 721 | "metadata": { 722 | "id": "oKJ9pvL2HQ3q" 723 | }, 724 | "outputs": [], 725 | "source": [ 726 | "# View the words of the second-most prevalent topic from the previous article.\n", 727 | "lda_model.show_topic(topics[1][0])" 728 | ] 729 | }, 730 | { 731 | "cell_type": "markdown", 732 | "metadata": { 733 | "id": "VbsiukJ414XD" 734 | }, 735 | "source": [ 736 | "The function below takes a document index and returns a **DataFrame** containing:\n", 737 | "1. the topics comprising the document up to a minimum probability.\n", 738 | "2. the top words of each topic.\n", 739 | "
\n", 740 | "\n", 741 | "https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html" 742 | ] 743 | }, 744 | { 745 | "cell_type": "code", 746 | "execution_count": null, 747 | "metadata": { 748 | "id": "o8F3dsBk2Oh2" 749 | }, 750 | "outputs": [], 751 | "source": [ 752 | "def get_top_topics(article_idx, min_topic_prob):\n", 753 | "\n", 754 | " # Sort from highest to lowest topic probability.\n", 755 | " topic_prob_pairs = sorted(lda_model.get_document_topics(corpus_bow_w_pos_filtered[article_idx],\n", 756 | " minimum_probability=min_topic_prob),\n", 757 | " key=lambda tup: tup[1])[::-1]\n", 758 | "\n", 759 | " word_prob_pairs = [lda_model.show_topic(pair[0]) for pair in topic_prob_pairs]\n", 760 | " topic_words = [[pair[0] for pair in collection] for collection in word_prob_pairs]\n", 761 | "\n", 762 | " data = {\n", 763 | " 'Major Topics': topic_prob_pairs,\n", 764 | " 'Topic Words': topic_words\n", 765 | " }\n", 766 | "\n", 767 | " return pd.DataFrame(data)\n" 768 | ] 769 | }, 770 | { 771 | "cell_type": "code", 772 | "execution_count": null, 773 | "metadata": { 774 | "id": "y7HwvNlH3KNL" 775 | }, 776 | "outputs": [], 777 | "source": [ 778 | "pd.set_option('max_colwidth', 600)\n", 779 | "snippet_length = 300\n", 780 | "min_topic_prob = 0.25\n", 781 | "\n", 782 | "article_idx = 1\n", 783 | "print(dataset[article_idx][:snippet_length])\n", 784 | "get_top_topics(article_idx, min_topic_prob)" 785 | ] 786 | }, 787 | { 788 | "cell_type": "code", 789 | "execution_count": null, 790 | "metadata": { 791 | "id": "RgbK19OAYD6T" 792 | }, 793 | "outputs": [], 794 | "source": [ 795 | "article_idx = 10\n", 796 | "print(dataset[article_idx][:snippet_length])\n", 797 | "get_top_topics(article_idx, min_topic_prob)" 798 | ] 799 | }, 800 | { 801 | "cell_type": "code", 802 | "execution_count": null, 803 | "metadata": { 804 | "id": "ucpGCL0cYD2V" 805 | }, 806 | "outputs": [], 807 | "source": [ 808 | "article_idx = 100\n", 809 | "print(dataset[article_idx][:snippet_length])\n", 810 | "get_top_topics(article_idx, min_topic_prob)" 811 | ] 812 | }, 813 | { 814 | "cell_type": "code", 815 | "execution_count": null, 816 | "metadata": { 817 | "id": "KzeM3QEbYDyi" 818 | }, 819 | "outputs": [], 820 | "source": [ 821 | "article_idx = 1000\n", 822 | "print(dataset[article_idx][:snippet_length])\n", 823 | "get_top_topics(article_idx, min_topic_prob)" 824 | ] 825 | }, 826 | { 827 | "cell_type": "code", 828 | "execution_count": null, 829 | "metadata": { 830 | "id": "vT5gxoP9YDuv" 831 | }, 832 | "outputs": [], 833 | "source": [ 834 | "article_idx = 10000\n", 835 | "print(dataset[article_idx][:snippet_length])\n", 836 | "get_top_topics(article_idx, 0.25)" 837 | ] 838 | }, 839 | { 840 | "cell_type": "markdown", 841 | "metadata": { 842 | "id": "sCr_9vWPvuU9" 843 | }, 844 | "source": [ 845 | "The results of this model look the best so far and we can see a human-interpretable link between the distribution of topics in a document, the distribution of words in each topic, and the content of the document itself." 846 | ] 847 | }, 848 | { 849 | "cell_type": "markdown", 850 | "metadata": { 851 | "id": "xRCf02nVvpfW" 852 | }, 853 | "source": [ 854 | "# Evaluation and Visualization" 855 | ] 856 | }, 857 | { 858 | "cell_type": "markdown", 859 | "metadata": { 860 | "id": "_TXlK5gebUjB" 861 | }, 862 | "source": [ 863 | "## Measuring topic models with coherence.\n", 864 | "\n", 865 | "If a topic is a mixture of particular words, then one way to measure how semantically coherent a topic is to calculate co-occurrence among the words. That is, how often the top words in a topic co-occur together among the documents versus how often they occur independently.\n", 866 | "\n", 867 | "Gensim's **Coherence Model** offers coherence implemented as a pipeline:
\n", 868 | "https://radimrehurek.com/gensim/models/coherencemodel.html\n", 869 | "
\n", 870 | "
\n", 871 | "See this paper for a detailed description of the pipeline as well as different co-occurence measures proposed:
\n", 872 | "http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf\n", 873 | "
\n", 874 | "
\n", 875 | "Topic model evaluation is a difficult subject with no clear quantitative approach and is still debated. A higher (or lower score depending on the measure) doesn't necessarily translate to a higher *qualitative* model. That is, the score a human would give looking at the topic words and how interpretable they are.

\n", 876 | "It's possible to favour a poorer scoring model because it serves a particular purpose better. Perhaps it's better to score the effectiveness of topic models based on performance in downstream tasks? See these videos for the problems with quantitative topic model evaluation:
\n", 877 | "[Matti Lyra - Evaluating Topic Models](https://www.youtube.com/watch?v=UkmIljRIG_M)
\n", 878 | "[Is Topic Model Evaluation Broken? The Incoherence of Coherence](https://www.youtube.com/watch?v=4KO2TO_cm2I)" 879 | ] 880 | }, 881 | { 882 | "cell_type": "code", 883 | "execution_count": null, 884 | "metadata": { 885 | "id": "nHBp-ZazNZRJ" 886 | }, 887 | "outputs": [], 888 | "source": [ 889 | "%%time\n", 890 | "coherence_model_lda = CoherenceModel(model=lda_model, texts=tokenized_articles_w_pos, dictionary=dictionary_w_pos, coherence='u_mass')\n", 891 | "coherence_lda = coherence_model_lda.get_coherence()\n", 892 | "print('\\nCoherence Score: ', coherence_lda)" 893 | ] 894 | }, 895 | { 896 | "cell_type": "markdown", 897 | "metadata": { 898 | "id": "Kfunq1Su8d1r" 899 | }, 900 | "source": [ 901 | "## Human evaluation\n", 902 | "Because the quantitative metrics aren't entirely correlated with quality, human judgment still plays a large role in topic model evaluation.\n" 903 | ] 904 | }, 905 | { 906 | "cell_type": "markdown", 907 | "metadata": { 908 | "id": "crPK6zKfC1gS" 909 | }, 910 | "source": [ 911 | "We can get someone to look at the topic words to see how interpretable they are. " 912 | ] 913 | }, 914 | { 915 | "cell_type": "markdown", 916 | "metadata": { 917 | "id": "GRMHpNksr0bQ" 918 | }, 919 | "source": [ 920 | "There are also subjective tests like **word intrusion** and **topic intrusion**.\n", 921 | "

\n", 922 | "**Word intrusion** is taking words which belong to a topic, injecting a word from another topic into the collection, and seeing whether a human can easily identify the intruder word. The more easily the intruder word is spotted, the more well-formed the topic. For example, which word doesn't belong in this topic?
\n", 923 | "*{apple, lemon, tomato, horse, grape}*" 924 | ] 925 | }, 926 | { 927 | "cell_type": "markdown", 928 | "metadata": { 929 | "id": "qYfEicOH8d1t" 930 | }, 931 | "source": [ 932 | "We can also visualize them with word clouds." 933 | ] 934 | }, 935 | { 936 | "cell_type": "code", 937 | "execution_count": null, 938 | "metadata": { 939 | "id": "4qY6uzIW8d1t" 940 | }, 941 | "outputs": [], 942 | "source": [ 943 | "def render_word_cloud(model, rows, cols, max_words):\n", 944 | " word_cloud = WordCloud(background_color='white', max_words=max_words, prefer_horizontal=1.0)\n", 945 | " fig, axes = plt.subplots(rows, cols, figsize=(15,15))\n", 946 | "\n", 947 | " for i, ax in enumerate(axes.flatten()):\n", 948 | " fig.add_subplot(ax)\n", 949 | " topic_words = dict(model.show_topic(i))\n", 950 | " word_cloud.generate_from_frequencies(topic_words)\n", 951 | " plt.gca().imshow(word_cloud, interpolation='bilinear')\n", 952 | " plt.gca().set_title('Topic {id}'.format(id=i))\n", 953 | " plt.gca().axis('off')\n", 954 | "\n", 955 | " plt.axis('off')\n", 956 | " plt.show()" 957 | ] 958 | }, 959 | { 960 | "cell_type": "code", 961 | "execution_count": null, 962 | "metadata": { 963 | "id": "F3e6HjGtzNnG" 964 | }, 965 | "outputs": [], 966 | "source": [ 967 | "# Here we'll visualize the first nine topics.\n", 968 | "render_word_cloud(lda_model, 3, 3, 10)" 969 | ] 970 | }, 971 | { 972 | "cell_type": "markdown", 973 | "metadata": { 974 | "id": "FpBxrPcGEOcN" 975 | }, 976 | "source": [ 977 | "# Finding similar documents." 978 | ] 979 | }, 980 | { 981 | "cell_type": "markdown", 982 | "metadata": { 983 | "id": "sJtKEzTE8TSE" 984 | }, 985 | "source": [ 986 | "Gensim has a **similarities** module which can build an index for a given set of documents. Here, we're using **MatrixSimilarity** which computes cosine similarity across a corpus and stores them in an index.
\n", 987 | "https://radimrehurek.com/gensim/similarities/docsim.html#gensim.similarities.docsim.MatrixSimilarity" 988 | ] 989 | }, 990 | { 991 | "cell_type": "code", 992 | "execution_count": null, 993 | "metadata": { 994 | "id": "vq9EYQJWkib2" 995 | }, 996 | "outputs": [], 997 | "source": [ 998 | "lda_index = similarities.MatrixSimilarity(lda_model[corpus_bow_w_pos_filtered], num_features=len(dictionary_w_pos))" 999 | ] 1000 | }, 1001 | { 1002 | "cell_type": "markdown", 1003 | "metadata": { 1004 | "id": "bHorc8fN9VHu" 1005 | }, 1006 | "source": [ 1007 | "Here's a utility function to help retrieve the *first_m_words* of the *top_n* most similar documents. If you're curious about the *\\_\\_getitem\\__* method on the LDA Model class, you can find the code here:
\n", 1008 | "https://github.com/RaRe-Technologies/gensim/blob/master/gensim/models/ldamodel.py" 1009 | ] 1010 | }, 1011 | { 1012 | "cell_type": "code", 1013 | "execution_count": null, 1014 | "metadata": { 1015 | "id": "x6hIGoVYF6Rb" 1016 | }, 1017 | "outputs": [], 1018 | "source": [ 1019 | "def get_similar_articles(index, model, article_bow, top_n=5, first_m_words=300):\n", 1020 | " # model[article_bow] retrieves the topic distribution for the BOW.\n", 1021 | " # index[model[article_bow] compares the topic distribution for the BOW against the similarity index previously computed.\n", 1022 | " similar_docs = index[model[article_bow]]\n", 1023 | " top_n_docs = sorted(enumerate(similar_docs), key=lambda item: -item[1])[1:top_n+1]\n", 1024 | " \n", 1025 | " # Return a list of tuples with each tuple: (article id, similarity score, first_m_words of article)\n", 1026 | " return list(map(lambda entry: (entry[0], entry[1], articles[entry[0]][:first_m_words]), top_n_docs))" 1027 | ] 1028 | }, 1029 | { 1030 | "cell_type": "code", 1031 | "execution_count": null, 1032 | "metadata": { 1033 | "id": "c4GV6jxI-Q8i" 1034 | }, 1035 | "outputs": [], 1036 | "source": [ 1037 | "article_idx = 0\n", 1038 | "print(dataset[article_idx][:snippet_length], '\\n')\n", 1039 | "get_similar_articles(lda_index, lda_model, corpus_bow_w_pos_filtered[article_idx])" 1040 | ] 1041 | }, 1042 | { 1043 | "cell_type": "code", 1044 | "execution_count": null, 1045 | "metadata": { 1046 | "id": "d6rlTxY5zlCe" 1047 | }, 1048 | "outputs": [], 1049 | "source": [ 1050 | "article_idx = 10\n", 1051 | "print(dataset[article_idx][:snippet_length], '\\n')\n", 1052 | "get_similar_articles(lda_index, lda_model, corpus_bow_w_pos_filtered[article_idx])" 1053 | ] 1054 | }, 1055 | { 1056 | "cell_type": "code", 1057 | "execution_count": null, 1058 | "metadata": { 1059 | "id": "JQyVGB1Kzk7Y" 1060 | }, 1061 | "outputs": [], 1062 | "source": [ 1063 | "article_idx = 100\n", 1064 | "print(dataset[article_idx][:snippet_length], '\\n')\n", 1065 | "get_similar_articles(lda_index, lda_model, corpus_bow_w_pos_filtered[article_idx])" 1066 | ] 1067 | }, 1068 | { 1069 | "cell_type": "markdown", 1070 | "metadata": { 1071 | "id": "-8eWhXxhBEl7" 1072 | }, 1073 | "source": [ 1074 | "We can also query for documents similar to new, unseen documents. Below are short, actual blurbs from 2021 involving stock options and crime. Keep in mind that if this were a really old news corpus, then excerpts about cryptocurrencies and social media probably won't lead to good matches. This is another aspect to keep in mind when thinking about your data and use cases." 1075 | ] 1076 | }, 1077 | { 1078 | "cell_type": "code", 1079 | "execution_count": null, 1080 | "metadata": { 1081 | "id": "rs4DF3CqODIp" 1082 | }, 1083 | "outputs": [], 1084 | "source": [ 1085 | "test_article = \"Capricorn Business Acquisitions Inc. (TSXV: CAK.H) (the “Company“) is pleased to announce that its board has approved the issuance of 70,000 stock options (“Stock Options“) to directors on April 19, 2020.\"\n", 1086 | "\n", 1087 | "article_tokens = list(map(improved_filter, [nlp(test_article)]))[0]\n", 1088 | "article_bow = dictionary_w_pos.doc2bow(article_tokens)\n", 1089 | "get_similar_articles(lda_index, lda_model, article_bow)" 1090 | ] 1091 | }, 1092 | { 1093 | "cell_type": "code", 1094 | "execution_count": null, 1095 | "metadata": { 1096 | "id": "NpejaKM51Sos" 1097 | }, 1098 | "outputs": [], 1099 | "source": [ 1100 | "test_article = \"DEA agent sentenced to 12 years in prison for conspiring with Colombian drug cartel.\"\n", 1101 | "\n", 1102 | "article_tokens = list(map(improved_filter, [nlp(test_article)]))[0]\n", 1103 | "article_bow = dictionary_w_pos.doc2bow(article_tokens)\n", 1104 | "get_similar_articles(lda_index, lda_model, article_bow)" 1105 | ] 1106 | }, 1107 | { 1108 | "cell_type": "markdown", 1109 | "metadata": { 1110 | "id": "NMh0xLVmuKwW" 1111 | }, 1112 | "source": [ 1113 | "# Closing Thoughts and things to explore.\n", 1114 | "- Gensim infers topic and word distributions through [Variational Bayes (VB)](https://en.wikipedia.org/wiki/Variational_Bayesian_methods), not Gibbs Sampling. From the topics I've seen, Gibbs Sampling tends to lead to more interpretable topics, but VB is faster and Gensim offers the additional benefits of streaming documents, online learning, and training across a cluster of machines.\n", 1115 | "- Another topic modelling library, [Mallet](http://mallet.cs.umass.edu/), infers through Gibbs Sampling but is Java-based. Unfortunately, Gensim 4.0+ no longer offers a wrapper around Mallet. But if you're comfortable with Java, it may be worth exploring.\n", 1116 | "- Scikit-learn offers an [LDA model](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html). Maybe as an exercise, try using that LDA model on the [20 Newsgroups](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html) dataset (or ideally, a dataset with longer documents).\n", 1117 | "- [pyLDAvis](https://github.com/bmabey/pyLDAvis) is another means of visualizing topic models. You can see it in action in this [notebook](https://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/pyLDAvis_overview.ipynb). See if you can get it working on your own topic model.\n", 1118 | "- LDA tends to work better on longer documents, and whether a topic model is \"good\" depends on your use case rather than strictly on a quantitative metric." 1119 | ] 1120 | } 1121 | ], 1122 | "metadata": { 1123 | "accelerator": "GPU", 1124 | "colab": { 1125 | "name": "nlpdemystified-topic-modelling-lda.ipynb", 1126 | "private_outputs": true, 1127 | "provenance": [], 1128 | "include_colab_link": true 1129 | }, 1130 | "kernelspec": { 1131 | "display_name": "Python 3 (ipykernel)", 1132 | "language": "python", 1133 | "name": "python3" 1134 | }, 1135 | "language_info": { 1136 | "codemirror_mode": { 1137 | "name": "ipython", 1138 | "version": 3 1139 | }, 1140 | "file_extension": ".py", 1141 | "mimetype": "text/x-python", 1142 | "name": "python", 1143 | "nbconvert_exporter": "python", 1144 | "pygments_lexer": "ipython3", 1145 | "version": "3.8.13" 1146 | } 1147 | }, 1148 | "nbformat": 4, 1149 | "nbformat_minor": 0 1150 | } 1151 | -------------------------------------------------------------------------------- /notebooks/nlpdemystified_vectorization.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "nlpdemystified-vectorization.ipynb", 7 | "private_outputs": true, 8 | "provenance": [], 9 | "include_colab_link": true 10 | }, 11 | "kernelspec": { 12 | "name": "python3", 13 | "display_name": "Python 3" 14 | } 15 | }, 16 | "cells": [ 17 | { 18 | "cell_type": "markdown", 19 | "metadata": { 20 | "id": "view-in-github", 21 | "colab_type": "text" 22 | }, 23 | "source": [ 24 | "\"Open" 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": { 30 | "id": "F50G99nH112P" 31 | }, 32 | "source": [ 33 | "# Natural Language Processing Demystified | Simple Vectorization\n", 34 | "https://nlpdemystified.org
\n", 35 | "https://github.com/futuremojo/nlp-demystified" 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "metadata": { 41 | "id": "t9x6fL6L3zsb" 42 | }, 43 | "source": [ 44 | "### spaCy upgrade and package installation." 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": { 50 | "id": "88uW0zDh4BkP" 51 | }, 52 | "source": [ 53 | "At the time this notebook was created, spaCy had newer releases but Colab was still using version 2.x by default. So the first step is to upgrade spaCy and download a statisical language model.\n", 54 | "

\n", 55 | "**IMPORTANT**
\n", 56 | "If you're running this in the cloud rather than using a local Jupyter server on your machine, then the notebook will **timeout** after a period of inactivity. If that happens and you don't reconnect in time, you will need to upgrade spaCy again and reinstall the requisite statistical packages.\n", 57 | "

\n", 58 | "Refer to this link on how to run Colab notebooks locally on your machine to avoid this issue:
\n", 59 | "https://research.google.com/colaboratory/local-runtimes.html" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "metadata": { 65 | "id": "THBGyQba4Bcm" 66 | }, 67 | "source": [ 68 | "!pip install -U spacy==3.*\n", 69 | "!python -m spacy download en_core_web_sm\n", 70 | "!python -m spacy info" 71 | ], 72 | "execution_count": null, 73 | "outputs": [] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "metadata": { 78 | "id": "t81VT9JboTzt" 79 | }, 80 | "source": [ 81 | "# Basic Bag-of-Words (BOW)\n", 82 | "\n", 83 | "Course module for this demo: https://www.nlpdemystified.org/course/basic-bag-of-words" 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "source": [ 89 | "import spacy\n", 90 | "\n", 91 | "from scipy import spatial\n", 92 | "from sklearn.feature_extraction.text import CountVectorizer\n", 93 | "from sklearn.metrics.pairwise import cosine_similarity" 94 | ], 95 | "metadata": { 96 | "id": "u_EAof8njfHz" 97 | }, 98 | "execution_count": null, 99 | "outputs": [] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "metadata": { 104 | "id": "b1IVdG29wyJ7" 105 | }, 106 | "source": [ 107 | "## Plain frequency BOW" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "metadata": { 113 | "id": "2fwfWQDVyJpY" 114 | }, 115 | "source": [ 116 | "# A corpus of sentences.\n", 117 | "corpus = [\n", 118 | " \"Red Bull drops hint on F1 engine.\",\n", 119 | " \"Honda exits F1, leaving F1 partner Red Bull.\",\n", 120 | " \"Hamilton eyes record eighth F1 title.\",\n", 121 | " \"Aston Martin announces sponsor.\"\n", 122 | "]" 123 | ], 124 | "execution_count": null, 125 | "outputs": [] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "metadata": { 130 | "id": "ILvS020Zzm6F" 131 | }, 132 | "source": [ 133 | "We want to build a basic bag-of-words (BOW) representation of our corpus. Based on what you now know from the lesson, you can probably do this from scratch using dictionaries and lists (and maybe that's a good exercise). Fortunately, there are robust libraries which make it easy.\n", 134 | "\n", 135 | "We can use the scikit-learn **CountVectorizer** which takes a collection of text documents and creates a matrix of token counts:
\n", 136 | "https://scikit-learn.org/stable/index.html
\n", 137 | "https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html\n", 138 | "\n", 139 | "\n" 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "metadata": { 145 | "id": "IRhJPxbHwuj_" 146 | }, 147 | "source": [ 148 | "vectorizer = CountVectorizer()" 149 | ], 150 | "execution_count": null, 151 | "outputs": [] 152 | }, 153 | { 154 | "cell_type": "markdown", 155 | "metadata": { 156 | "id": "iAphZMVPBX9P" 157 | }, 158 | "source": [ 159 | "The *fit_transform* method does two things:\n", 160 | "1. It learns a vocabulary dictionary from the corpus.\n", 161 | "2. It returns a matrix where each row represents a document and each column represents a token (i.e. term).
\n", 162 | "\n", 163 | "https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer.fit_transform\n" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "metadata": { 169 | "id": "-5wi4_C7BAWv" 170 | }, 171 | "source": [ 172 | "bow = vectorizer.fit_transform(corpus)" 173 | ], 174 | "execution_count": null, 175 | "outputs": [] 176 | }, 177 | { 178 | "cell_type": "markdown", 179 | "metadata": { 180 | "id": "z3Bp1XNcF1FQ" 181 | }, 182 | "source": [ 183 | "We can take a look at the features and vocabulary dictionary. Notice the **CountVectorizer** took care of tokenization for us. It also removed punctuation and lower-cased everything." 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "metadata": { 189 | "id": "fQbqvLgVF8B7" 190 | }, 191 | "source": [ 192 | "# View features (tokens).\n", 193 | "print(vectorizer.get_feature_names_out())\n", 194 | "\n", 195 | "# View vocabulary dictionary.\n", 196 | "vectorizer.vocabulary_" 197 | ], 198 | "execution_count": null, 199 | "outputs": [] 200 | }, 201 | { 202 | "cell_type": "markdown", 203 | "metadata": { 204 | "id": "7dmNUkZeExam" 205 | }, 206 | "source": [ 207 | "Specifically, the **CountVectorizer** generates a sparse matrix using an efficient, compressed representation. The sparse matrix object includes a number of useful methods:\n", 208 | "https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html" 209 | ] 210 | }, 211 | { 212 | "cell_type": "code", 213 | "metadata": { 214 | "id": "Lug2-xnAExsb" 215 | }, 216 | "source": [ 217 | "print(type(bow))" 218 | ], 219 | "execution_count": null, 220 | "outputs": [] 221 | }, 222 | { 223 | "cell_type": "markdown", 224 | "metadata": { 225 | "id": "3bywJ0XnGKPQ" 226 | }, 227 | "source": [ 228 | "If we look at the raw structure, we'll see tuples where the first element represents the document, and the second element represents a token ID. It's then followed by a count of that token. So in the second document (index 1), token 8 (\"f1\") occurs twice." 229 | ] 230 | }, 231 | { 232 | "cell_type": "code", 233 | "metadata": { 234 | "id": "At6Gt4bsEx2D" 235 | }, 236 | "source": [ 237 | "print(bow)" 238 | ], 239 | "execution_count": null, 240 | "outputs": [] 241 | }, 242 | { 243 | "cell_type": "markdown", 244 | "metadata": { 245 | "id": "mv1N1Io2EyAb" 246 | }, 247 | "source": [ 248 | "Before we explore further, we want to make a few modifications.\n", 249 | "1. What if we want to use another tokenizer like spaCy's?\n", 250 | "2. Instead of frequency, what if we want to have a binary BOW?\n" 251 | ] 252 | }, 253 | { 254 | "cell_type": "markdown", 255 | "metadata": { 256 | "id": "KRgIHkzUVJtk" 257 | }, 258 | "source": [ 259 | "## Binary BOW with custom tokenizer" 260 | ] 261 | }, 262 | { 263 | "cell_type": "markdown", 264 | "metadata": { 265 | "id": "tof1PBgqEy1D" 266 | }, 267 | "source": [ 268 | "**CountVectorizer** supports using a custom tokenizer. For every document, it will call your tokenizer and expect a list of tokens returned. We'll create a simple callback below which has spaCy tokenize and filter tokens, and then return them." 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "metadata": { 274 | "id": "AcCLawrWEzC7" 275 | }, 276 | "source": [ 277 | "# As usual, we start by importing spaCy and loading a statistical model.\n", 278 | "nlp = spacy.load('en_core_web_sm')\n", 279 | "\n", 280 | "# Create a tokenizer callback using spaCy under the hood. Here, we tokenize\n", 281 | "# the passed-in text and return the tokens, filtering out punctuation.\n", 282 | "def spacy_tokenizer(doc):\n", 283 | " return [t.text for t in nlp(doc) if not t.is_punct]\n" 284 | ], 285 | "execution_count": null, 286 | "outputs": [] 287 | }, 288 | { 289 | "cell_type": "markdown", 290 | "metadata": { 291 | "id": "drEe1Lv_OScv" 292 | }, 293 | "source": [ 294 | "This time, we instantiate **CountVectorizer** with our custom tokenizer (*spacy_tokenizer*), turn off case-folding, and also set the *binary* parameter to *True* so we simply get 1s and 0s marking token presence rather than token frequency." 295 | ] 296 | }, 297 | { 298 | "cell_type": "code", 299 | "metadata": { 300 | "id": "1YREyWzaA-rT" 301 | }, 302 | "source": [ 303 | "vectorizer = CountVectorizer(tokenizer=spacy_tokenizer, lowercase=False, binary=True)\n", 304 | "bow = vectorizer.fit_transform(corpus)" 305 | ], 306 | "execution_count": null, 307 | "outputs": [] 308 | }, 309 | { 310 | "cell_type": "markdown", 311 | "metadata": { 312 | "id": "5jDKQkZUOysa" 313 | }, 314 | "source": [ 315 | "Looking at the resulting feature names and vocabulary dictionary, we can see our *spacy_tokenizer* being used. If you're not convinced, you can remove the punctuation filtering in our tokenizer and rerun the code." 316 | ] 317 | }, 318 | { 319 | "cell_type": "code", 320 | "metadata": { 321 | "id": "4x6RBqTGq302" 322 | }, 323 | "source": [ 324 | "print(vectorizer.get_feature_names_out())\n", 325 | "vectorizer.vocabulary_" 326 | ], 327 | "execution_count": null, 328 | "outputs": [] 329 | }, 330 | { 331 | "cell_type": "markdown", 332 | "metadata": { 333 | "id": "hFpQbdA-R3FI" 334 | }, 335 | "source": [ 336 | "To get a dense array representation of our sparse matrix, use *toarray*.
\n", 337 | "https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.toarray.html#scipy.sparse.csr_matrix.toarray\n", 338 | "\n", 339 | "We can also index and slice into the sparse matrix." 340 | ] 341 | }, 342 | { 343 | "cell_type": "code", 344 | "metadata": { 345 | "id": "2yGr36aP9GCr" 346 | }, 347 | "source": [ 348 | "print('A dense representation like we saw in the slides.')\n", 349 | "print(bow.toarray())\n", 350 | "print()\n", 351 | "print('Indexing and slicing.')\n", 352 | "print(bow[0])\n", 353 | "print()\n", 354 | "print(bow[0:2])" 355 | ], 356 | "execution_count": null, 357 | "outputs": [] 358 | }, 359 | { 360 | "cell_type": "markdown", 361 | "metadata": { 362 | "id": "XF0NVhdEUR1r" 363 | }, 364 | "source": [ 365 | "## Cosine Similarity" 366 | ] 367 | }, 368 | { 369 | "cell_type": "markdown", 370 | "metadata": { 371 | "id": "leI1VuDVVP4W" 372 | }, 373 | "source": [ 374 | "Writing your own cosine similarity function is straight-forward using numpy (left as an exercise). There are multiple ways to calculate it using scipy.\n", 375 | "

\n", 376 | "One way is using the **spatial** package, which is a collection of spatial algorithms and data structures. It has a method to calculate cosine *distance*. To get the cosine *similarity*, we have to substract the distance from 1.
\n", 377 | "https://docs.scipy.org/doc/scipy/reference/spatial.html
\n", 378 | "https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html#scipy.spatial.distance.cosine" 379 | ] 380 | }, 381 | { 382 | "cell_type": "code", 383 | "metadata": { 384 | "id": "kOQQ50IgXQfH" 385 | }, 386 | "source": [ 387 | "# The cosine method expects array_like inputs, so we need to generate\n", 388 | "# arrays from our sparse matrix.\n", 389 | "doc1_vs_doc2 = 1 - spatial.distance.cosine(bow[0].toarray()[0], bow[1].toarray()[0])\n", 390 | "doc1_vs_doc3 = 1 - spatial.distance.cosine(bow[0].toarray()[0], bow[2].toarray()[0])\n", 391 | "doc1_vs_doc4 = 1 - spatial.distance.cosine(bow[0].toarray()[0], bow[3].toarray()[0])\n", 392 | "\n", 393 | "print(corpus)\n", 394 | "\n", 395 | "print(f\"Doc 1 vs Doc 2: {doc1_vs_doc2}\")\n", 396 | "print(f\"Doc 1 vs Doc 3: {doc1_vs_doc3}\")\n", 397 | "print(f\"Doc 1 vs Doc 4: {doc1_vs_doc4}\")" 398 | ], 399 | "execution_count": null, 400 | "outputs": [] 401 | }, 402 | { 403 | "cell_type": "markdown", 404 | "metadata": { 405 | "id": "6SRDwr2gYD04" 406 | }, 407 | "source": [ 408 | "Another approach is using scikit-learn's *cosine_similarity* which computes the metric between multiple vectors. Here, we pass it our BOW and get a matrix of cosine similarities between each document.
\n", 409 | "https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html" 410 | ] 411 | }, 412 | { 413 | "cell_type": "code", 414 | "metadata": { 415 | "id": "WwwP8-jtchSI" 416 | }, 417 | "source": [ 418 | "# cosine_similarity can take either array-likes or sparse matrices.\n", 419 | "print(cosine_similarity(bow))" 420 | ], 421 | "execution_count": null, 422 | "outputs": [] 423 | }, 424 | { 425 | "cell_type": "markdown", 426 | "metadata": { 427 | "id": "I96W6qDVdDnY" 428 | }, 429 | "source": [ 430 | "## N-grams" 431 | ] 432 | }, 433 | { 434 | "cell_type": "markdown", 435 | "metadata": { 436 | "id": "D3E_hN5Ddyae" 437 | }, 438 | "source": [ 439 | "**CountVectorizer** includes an *ngram_range* parameter to generate different n-grams. n_gram range is specified using a minimum and maximum range. By default, n_gram range is set to (1, 1) which generates unigrams. Setting it to (1, 2) generates both unigrams and bigrams." 440 | ] 441 | }, 442 | { 443 | "cell_type": "code", 444 | "metadata": { 445 | "id": "OZooyyRleHXe" 446 | }, 447 | "source": [ 448 | "vectorizer = CountVectorizer(tokenizer=spacy_tokenizer, lowercase=False, binary=True, ngram_range=(1,2))\n", 449 | "bigrams = vectorizer.fit_transform(corpus)\n", 450 | "print(vectorizer.get_feature_names_out())\n", 451 | "print('Number of features: {}'.format(len(vectorizer.get_feature_names_out())))\n", 452 | "print(vectorizer.vocabulary_)" 453 | ], 454 | "execution_count": null, 455 | "outputs": [] 456 | }, 457 | { 458 | "cell_type": "code", 459 | "metadata": { 460 | "id": "Hvtmi3negc0G" 461 | }, 462 | "source": [ 463 | "# Setting n_gram range to (2, 2) generates only bigrams.\n", 464 | "vectorizer = CountVectorizer(tokenizer=spacy_tokenizer, lowercase=False, binary=True, ngram_range=(2,2))\n", 465 | "bigrams = vectorizer.fit_transform(corpus)\n", 466 | "print(vectorizer.get_feature_names_out())\n", 467 | "print(vectorizer.vocabulary_)" 468 | ], 469 | "execution_count": null, 470 | "outputs": [] 471 | }, 472 | { 473 | "cell_type": "markdown", 474 | "metadata": { 475 | "id": "j7e40ZAKhQmm" 476 | }, 477 | "source": [ 478 | "## Basic Bag-of-Words Exercises" 479 | ] 480 | }, 481 | { 482 | "cell_type": "code", 483 | "metadata": { 484 | "id": "dbdMO0bZjROn" 485 | }, 486 | "source": [ 487 | "#\n", 488 | "# EXERCISE: Create a spacy_tokenizer callback which takes a string and returns\n", 489 | "# a list of tokens (each token's text) with punctuation filtered out.\n", 490 | "#\n", 491 | "corpus = [\n", 492 | " \"Students use their GPS-enabled cellphones to take birdview photographs of a land in order to find specific danger points such as rubbish heaps.\",\n", 493 | " \"Teenagers are enthusiastic about taking aerial photograph in order to study their neighbourhood.\",\n", 494 | " \"Aerial photography is a great way to identify terrestrial features that aren’t visible from the ground level, such as lake contours or river paths.\",\n", 495 | " \"During the early days of digital SLRs, Canon was pretty much the undisputed leader in CMOS image sensor technology.\",\n", 496 | " \"Syrian President Bashar al-Assad tells the US it will 'pay the price' if it strikes against Syria.\"\n", 497 | "]\n", 498 | "\n", 499 | "nlp = spacy.load('en_core_web_sm')\n", 500 | "\n", 501 | "def spacy_tokenizer(doc):\n", 502 | " pass\n" 503 | ], 504 | "execution_count": null, 505 | "outputs": [] 506 | }, 507 | { 508 | "cell_type": "code", 509 | "metadata": { 510 | "id": "UjBJUUpcBWp2" 511 | }, 512 | "source": [ 513 | "#\n", 514 | "# EXERCISE: Initialize a CountVectorizer object and set it to use\n", 515 | "# your spacy_tokenizer with lower-casing off and to create a binary BOW.\n", 516 | "#\n", 517 | "\n", 518 | "# Instantiate a CountVectorizer object called 'vectorizer'.\n", 519 | "\n", 520 | "\n", 521 | "# Create a binary BOW from the corpus using your CountVectorizer.\n", 522 | "\n" 523 | ], 524 | "execution_count": null, 525 | "outputs": [] 526 | }, 527 | { 528 | "cell_type": "code", 529 | "metadata": { 530 | "id": "os3tPj5nmRLw" 531 | }, 532 | "source": [ 533 | "#\n", 534 | "# The string below is a whole paragraph. We want to create another\n", 535 | "# binary BOW but using the vocabulary of our *current* CountVectorizer. This means\n", 536 | "# that words in this paragraph which AREN'T already in the vocabulary won't be\n", 537 | "# represented. This is to illustrate how BOW can't handle out-of-vocabulary words\n", 538 | "# unless you rebuild your whole vocabulary. Still, we'll see that if there's\n", 539 | "# enough overlapping vocabulary, some similarity can still be picked up.\n", 540 | "#\n", 541 | "# Note that we call 'transform' only instead of 'fit_transform' because the\n", 542 | "# fit step (i.e. vocabulary build) is already done and we don't want to re-fit here.\n", 543 | "#\n", 544 | "s = [\"Teenagers take aerial shots of their neighbourhood using digital cameras sitting in old bottles which are launched via kites - a common toy for children living in the favelas. They then use GPS-enabled smartphones to take pictures of specific danger points - such as rubbish heaps, which can become a breeding ground for mosquitoes carrying dengue fever.\"]\n", 545 | "new_bow = vectorizer.transform(s)\n", 546 | "\n", 547 | "#\n", 548 | "# EXERCISE: using the pairwise cosine_similarity method from sklearn,\n", 549 | "# calculate the similarities between each document from the corpus against\n", 550 | "# this new document (new_bow). HINT: You can pass two parameters to\n", 551 | "# cosine_similarity in this case. See the docs:\n", 552 | "# https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html#scipy.spatial.distance.cosine\n", 553 | "#\n", 554 | "# Which document is the most similar? Which is the least similar? Do the results make sense\n", 555 | "# based on what you see?\n", 556 | "#\n", 557 | "\n" 558 | ], 559 | "execution_count": null, 560 | "outputs": [] 561 | }, 562 | { 563 | "cell_type": "code", 564 | "metadata": { 565 | "id": "eXThYmDiwMmR" 566 | }, 567 | "source": [ 568 | "#\n", 569 | "# EXERCISE: Implement your own cosine similarity method using numpy.\n", 570 | "# It should take two numpy arrays and output the similarity metric.\n", 571 | "# HINTS:\n", 572 | "# https://numpy.org/doc/stable/reference/generated/numpy.dot.html\n", 573 | "# https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html\n", 574 | "#\n", 575 | "# Verify the similarity between the first document in the corpus and the\n", 576 | "# paragraph is the same as the one you got from using pairwise cosine_similarity.\n", 577 | "#\n", 578 | "import numpy as np\n", 579 | "def cos_sim(a, b):\n", 580 | " pass\n" 581 | ], 582 | "execution_count": null, 583 | "outputs": [] 584 | }, 585 | { 586 | "cell_type": "code", 587 | "metadata": { 588 | "id": "ghlqn6l-dal4" 589 | }, 590 | "source": [ 591 | "#\n", 592 | "# EXERCISE: In spacy_tokenizer, instead of returning the plain text,\n", 593 | "# return the lemma_ attribute instead. How do the cosine similarity\n", 594 | "# results differ? What if you filter out stop words as well?\n", 595 | "#" 596 | ], 597 | "execution_count": null, 598 | "outputs": [] 599 | }, 600 | { 601 | "cell_type": "markdown", 602 | "metadata": { 603 | "id": "CnC_i4oH2ARW" 604 | }, 605 | "source": [ 606 | "# TF-IDF\n", 607 | "\n", 608 | "Course module for this demo: https://www.nlpdemystified.org/course/tf-idf" 609 | ] 610 | }, 611 | { 612 | "cell_type": "markdown", 613 | "source": [ 614 | "**NOTE: If the notebook timed out, you may need to re-upgrade spaCy and re-install the language model as follows:**" 615 | ], 616 | "metadata": { 617 | "id": "Xb7W_O_FS3H6" 618 | } 619 | }, 620 | { 621 | "cell_type": "code", 622 | "source": [ 623 | "!pip install -U spacy==3.*\n", 624 | "!python -m spacy download en_core_web_sm\n", 625 | "!python -m spacy info" 626 | ], 627 | "metadata": { 628 | "id": "rRtp9F8KS5QE" 629 | }, 630 | "execution_count": null, 631 | "outputs": [] 632 | }, 633 | { 634 | "cell_type": "code", 635 | "source": [ 636 | "import spacy\n", 637 | "\n", 638 | "from sklearn.datasets import fetch_20newsgroups\n", 639 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 640 | "from sklearn.metrics.pairwise import cosine_similarity" 641 | ], 642 | "metadata": { 643 | "id": "CMwv39AfP7Ti" 644 | }, 645 | "execution_count": null, 646 | "outputs": [] 647 | }, 648 | { 649 | "cell_type": "markdown", 650 | "metadata": { 651 | "id": "QmcTBtSx-XqZ" 652 | }, 653 | "source": [ 654 | "## Fetching datasets" 655 | ] 656 | }, 657 | { 658 | "cell_type": "markdown", 659 | "metadata": { 660 | "id": "WYkq3i7_-qhQ" 661 | }, 662 | "source": [ 663 | "This time around, rather than using a short toy corpus, let's use a larger dataset. scikit-learn has a **datasets** module with utilties to load datasets of our own as well as fetch popular reference datasets online.
\n", 664 | "https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets\n", 665 | "

\n", 666 | "We'll use the **20 newsgroups** dataset, which is a collection of 18,000 newsgroup posts across 20 topics.
\n", 667 | "https://scikit-learn.org/stable/datasets/real_world.html#the-20-newsgroups-text-dataset\n", 668 | "

\n", 669 | "List of datasets available:
\n", 670 | "https://scikit-learn.org/stable/datasets.html#datasets" 671 | ] 672 | }, 673 | { 674 | "cell_type": "markdown", 675 | "metadata": { 676 | "id": "UYjxqxVBBINV" 677 | }, 678 | "source": [ 679 | "The **datasets** module includes fetchers for each dataset in scikit-learn. For our purposes, we'll fetch only the posts from the *sci.space* topic, and skip on headers, footers, and quoting of other posts.
\n", 680 | "https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html#sklearn.datasets.fetch_20newsgroups\n", 681 | "

\n", 682 | "By default, the fetcher retrieves the *training* subset of the data only. If you don't know what that means, it'll become clear later in the course when we discuss modelling. For now, it doesn't matter for our purposes." 683 | ] 684 | }, 685 | { 686 | "cell_type": "code", 687 | "metadata": { 688 | "id": "T9to6gQNCGiN" 689 | }, 690 | "source": [ 691 | "corpus = fetch_20newsgroups(categories=['sci.space'],\n", 692 | " remove=('headers', 'footers', 'quotes'))" 693 | ], 694 | "execution_count": null, 695 | "outputs": [] 696 | }, 697 | { 698 | "cell_type": "markdown", 699 | "metadata": { 700 | "id": "W989GHQxDvTW" 701 | }, 702 | "source": [ 703 | "We get back a **Bunch** container object containing the data as well as other information.
\n", 704 | "https://scikit-learn.org/stable/modules/generated/sklearn.utils.Bunch.html\n", 705 | "

\n", 706 | "The actual posts are accessed through the *data* attribute and is a list of strings, each one representing a post." 707 | ] 708 | }, 709 | { 710 | "cell_type": "code", 711 | "metadata": { 712 | "id": "POGdVmdIDuCK" 713 | }, 714 | "source": [ 715 | "print(type(corpus))" 716 | ], 717 | "execution_count": null, 718 | "outputs": [] 719 | }, 720 | { 721 | "cell_type": "code", 722 | "metadata": { 723 | "id": "q6AgmbL0ES9I" 724 | }, 725 | "source": [ 726 | "# Number of posts in our dataset.\n", 727 | "len(corpus.data)" 728 | ], 729 | "execution_count": null, 730 | "outputs": [] 731 | }, 732 | { 733 | "cell_type": "code", 734 | "metadata": { 735 | "id": "qAjM4uNDEXGf" 736 | }, 737 | "source": [ 738 | "# View first two posts.\n", 739 | "corpus.data[:2]" 740 | ], 741 | "execution_count": null, 742 | "outputs": [] 743 | }, 744 | { 745 | "cell_type": "markdown", 746 | "metadata": { 747 | "id": "FH99M6cxCpsz" 748 | }, 749 | "source": [ 750 | "## Creating TF-IDF features" 751 | ] 752 | }, 753 | { 754 | "cell_type": "code", 755 | "metadata": { 756 | "id": "vtnQX-wWDhGh" 757 | }, 758 | "source": [ 759 | "# Like before, if we want to use spaCy's tokenizer, we need\n", 760 | "# to create a callback. Remember to upgrade spaCy if you need\n", 761 | "# to (refer to beginnning of file for commentary and instructions).\n", 762 | "nlp = spacy.load('en_core_web_sm')\n", 763 | "\n", 764 | "# We don't need named-entity recognition nor dependency parsing for\n", 765 | "# this so these components are disabled. This will speed up the\n", 766 | "# pipeline. We do need part-of-speech tagging however.\n", 767 | "unwanted_pipes = [\"ner\", \"parser\"]\n", 768 | "\n", 769 | "# For this exercise, we'll remove punctuation and spaces (which\n", 770 | "# includes newlines), filter for tokens consisting of alphabetic\n", 771 | "# characters, and return the lemma (which require POS tagging).\n", 772 | "def spacy_tokenizer(doc):\n", 773 | " with nlp.disable_pipes(*unwanted_pipes):\n", 774 | " return [t.lemma_ for t in nlp(doc) if \\\n", 775 | " not t.is_punct and \\\n", 776 | " not t.is_space and \\\n", 777 | " t.is_alpha]" 778 | ], 779 | "execution_count": null, 780 | "outputs": [] 781 | }, 782 | { 783 | "cell_type": "markdown", 784 | "metadata": { 785 | "id": "il-0gY9LEiNv" 786 | }, 787 | "source": [ 788 | "Like the classes to create raw frequency and binary bag-of-words vectors, scikit-learn includes a similar class called **TfidfVectorizer** to create TF-IDF vectors from a corpus.
\n", 789 | "https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html\n", 790 | "

\n", 791 | "The usage pattern is similar in that we call *fit_transform* on the corpus which generates the vocabulary dictionary (fit step), and generates the TF-IDF vectors (transform step)." 792 | ] 793 | }, 794 | { 795 | "cell_type": "code", 796 | "metadata": { 797 | "id": "Shj6BS0BN6FU" 798 | }, 799 | "source": [ 800 | "%%time\n", 801 | "# Use the default settings of TfidfVectorizer.\n", 802 | "vectorizer = TfidfVectorizer(tokenizer=spacy_tokenizer)\n", 803 | "features = vectorizer.fit_transform(corpus.data)" 804 | ], 805 | "execution_count": null, 806 | "outputs": [] 807 | }, 808 | { 809 | "cell_type": "code", 810 | "metadata": { 811 | "id": "CZ9w4gh9sobB" 812 | }, 813 | "source": [ 814 | "# The number of unique tokens.\n", 815 | "print(len(vectorizer.get_feature_names_out()))" 816 | ], 817 | "execution_count": null, 818 | "outputs": [] 819 | }, 820 | { 821 | "cell_type": "code", 822 | "metadata": { 823 | "id": "6CxmKlPcNRLk" 824 | }, 825 | "source": [ 826 | "# The dimensions of our feature matrix. X rows (documents) by Y columns (tokens).\n", 827 | "print(features.shape)" 828 | ], 829 | "execution_count": null, 830 | "outputs": [] 831 | }, 832 | { 833 | "cell_type": "code", 834 | "metadata": { 835 | "id": "yJwnU8PZNdHU" 836 | }, 837 | "source": [ 838 | "# What the encoding of the first document looks like in sparse format.\n", 839 | "print(features[0])" 840 | ], 841 | "execution_count": null, 842 | "outputs": [] 843 | }, 844 | { 845 | "cell_type": "markdown", 846 | "metadata": { 847 | "id": "Vp7VTwYzONlt" 848 | }, 849 | "source": [ 850 | "As we mentioned in the slides, there are TF-IDF variations out there and scikit-learn, among other things, adds **smoothing** (adds a one to the numerator and denominator in the IDF component), and normalizes by default. These can be disabled if desired using the *smooth_idf* and *norm* parameters respectively. See here for more information:
\n", 851 | "https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html\n" 852 | ] 853 | }, 854 | { 855 | "cell_type": "markdown", 856 | "metadata": { 857 | "id": "ylKLM-IMOwbJ" 858 | }, 859 | "source": [ 860 | "## Querying the data" 861 | ] 862 | }, 863 | { 864 | "cell_type": "markdown", 865 | "metadata": { 866 | "id": "h8oTtCg0QB71" 867 | }, 868 | "source": [ 869 | "The similarity measuring techniques we learned previously can be used here in the same way. In effect, we can query our data using this sequence:\n", 870 | "1. *Transform* our query using the same vocabulary from our *fit* step on our corpus.\n", 871 | "2. Calculate the pairwise cosine similarities between each document in our corpus and our query.\n", 872 | "3. Sort them in descending order by score." 873 | ] 874 | }, 875 | { 876 | "cell_type": "code", 877 | "metadata": { 878 | "id": "qNjEUzqlP6Oy" 879 | }, 880 | "source": [ 881 | "# Transform the query into a TF-IDF vector.\n", 882 | "query = [\"lunar orbit\"]\n", 883 | "query_tfidf = vectorizer.transform(query)" 884 | ], 885 | "execution_count": null, 886 | "outputs": [] 887 | }, 888 | { 889 | "cell_type": "code", 890 | "metadata": { 891 | "id": "jEfdfkmpP8Tv" 892 | }, 893 | "source": [ 894 | "# Calculate the cosine similarities between the query and each document.\n", 895 | "# We're calling flatten() here becaue cosine_similarity returns a list\n", 896 | "# of lists and we just want a single list.\n", 897 | "cosine_similarities = cosine_similarity(features, query_tfidf).flatten()" 898 | ], 899 | "execution_count": null, 900 | "outputs": [] 901 | }, 902 | { 903 | "cell_type": "markdown", 904 | "metadata": { 905 | "id": "skuSFhLxXOMC" 906 | }, 907 | "source": [ 908 | "Now that we have our list of cosine similarities, we can use this utility function to return the indices of the top k documents with the highest cosine similarities." 909 | ] 910 | }, 911 | { 912 | "cell_type": "code", 913 | "metadata": { 914 | "id": "H0PvqRDpUSYO" 915 | }, 916 | "source": [ 917 | "import numpy as np\n", 918 | "\n", 919 | "# numpy's argsort() method returns a list of *indices* that\n", 920 | "# would sort an array:\n", 921 | "# https://numpy.org/doc/stable/reference/generated/numpy.argsort.html\n", 922 | "#\n", 923 | "# The sort is ascending, but we want the largest k cosine_similarites\n", 924 | "# at the bottom of the sort. So we negate k, and get the last k\n", 925 | "# entries of the indices list in reverse order. There are faster\n", 926 | "# ways to do this using things like argpartition but this is\n", 927 | "# more succinct.\n", 928 | "def top_k(arr, k):\n", 929 | " kth_largest = (k + 1) * -1\n", 930 | " return np.argsort(arr)[:kth_largest:-1]" 931 | ], 932 | "execution_count": null, 933 | "outputs": [] 934 | }, 935 | { 936 | "cell_type": "code", 937 | "metadata": { 938 | "id": "zFYpEldVUaAG" 939 | }, 940 | "source": [ 941 | "# So for our query above, these are the top five documents.\n", 942 | "top_related_indices = top_k(cosine_similarities, 5)\n", 943 | "print(top_related_indices)" 944 | ], 945 | "execution_count": null, 946 | "outputs": [] 947 | }, 948 | { 949 | "cell_type": "code", 950 | "metadata": { 951 | "id": "4e86P3bQR1ZS" 952 | }, 953 | "source": [ 954 | "# Let's take a look at their respective cosine similarities.\n", 955 | "print(cosine_similarities[top_related_indices])" 956 | ], 957 | "execution_count": null, 958 | "outputs": [] 959 | }, 960 | { 961 | "cell_type": "code", 962 | "metadata": { 963 | "id": "kzdyTptURiTQ" 964 | }, 965 | "source": [ 966 | "# Top match.\n", 967 | "print(corpus.data[top_related_indices[0]])" 968 | ], 969 | "execution_count": null, 970 | "outputs": [] 971 | }, 972 | { 973 | "cell_type": "code", 974 | "metadata": { 975 | "id": "zQwWXypfR8vh" 976 | }, 977 | "source": [ 978 | "# Second-best match.\n", 979 | "print(corpus.data[top_related_indices[1]])" 980 | ], 981 | "execution_count": null, 982 | "outputs": [] 983 | }, 984 | { 985 | "cell_type": "code", 986 | "metadata": { 987 | "id": "w-5aqUbGSM5J" 988 | }, 989 | "source": [ 990 | "# Try a different query\n", 991 | "query = [\"satellite\"]\n", 992 | "query_tfidf = vectorizer.transform(query)\n", 993 | "\n", 994 | "cosine_similarities = cosine_similarity(features, query_tfidf).flatten()\n", 995 | "top_related_indices = top_k(cosine_similarities, 5)\n", 996 | "\n", 997 | "print(top_related_indices)\n", 998 | "print(cosine_similarities[top_related_indices])" 999 | ], 1000 | "execution_count": null, 1001 | "outputs": [] 1002 | }, 1003 | { 1004 | "cell_type": "code", 1005 | "metadata": { 1006 | "id": "VHQtRQIcSbTj" 1007 | }, 1008 | "source": [ 1009 | "print(corpus.data[top_related_indices[0]])" 1010 | ], 1011 | "execution_count": null, 1012 | "outputs": [] 1013 | }, 1014 | { 1015 | "cell_type": "markdown", 1016 | "metadata": { 1017 | "id": "t4v5wQ4JaBIh" 1018 | }, 1019 | "source": [ 1020 | "So here we have the beginnings of a simple search engine but we're a far cry from competing with commercial off-the-shelf search engines, let alone Google.\n", 1021 | "
\n", 1022 | "- For each query, we're scanning through our entire corpus, but in practice, you'll want to create an **inverted index**. Search applications such as Elasticsearch do that under the hood.\n", 1023 | "- You'd also want to evaluate the efficacy of your search using metrics like **precision** and **recall**.\n", 1024 | "- Document ranking also tends to be more sophisticated, using different ranking functions like Okapi BM25. With major search engines, ranking also involves hundreds of variables such as what the user searched for previously, what do they tend to click on, where are they physically, and on and on. These variables are part of the \"secret sauce\" and are closely guarded by companies.\n", 1025 | "- Beyond word presence, intent and meaning are playing a larger role.\n", 1026 | "
\n", 1027 | "\n", 1028 | "Information Retrieval is a huge, rich topic and beyond search, it's also key in tasks such as question-answering." 1029 | ] 1030 | }, 1031 | { 1032 | "cell_type": "markdown", 1033 | "metadata": { 1034 | "id": "Ak3LXiETfGIY" 1035 | }, 1036 | "source": [ 1037 | "## TF-IDF Exercises" 1038 | ] 1039 | }, 1040 | { 1041 | "cell_type": "markdown", 1042 | "metadata": { 1043 | "id": "08nTQB7_fJU0" 1044 | }, 1045 | "source": [ 1046 | "**EXERCISE**
\n", 1047 | "Read up on these concepts we just mentioned if you're curious.
\n", 1048 | "\n", 1049 | "https://en.wikipedia.org/wiki/Inverted_index
\n", 1050 | "https://en.wikipedia.org/wiki/Precision_and_recall
\n", 1051 | "https://en.wikipedia.org/wiki/Okapi_BM25
" 1052 | ] 1053 | }, 1054 | { 1055 | "cell_type": "code", 1056 | "metadata": { 1057 | "id": "Iz2FCCq1fsjz" 1058 | }, 1059 | "source": [ 1060 | "#\n", 1061 | "# EXERCISE: fetch multiple topics from the 20 newsgroups\n", 1062 | "# dataset and query them using the approach we followed.\n", 1063 | "# A list of topics can be found here:\n", 1064 | "# https://scikit-learn.org/stable/datasets/real_world.html#the-20-newsgroups-text-dataset\n", 1065 | "#\n", 1066 | "# If you're feeling ambitious, incorporate n-grams or\n", 1067 | "# look at how you can measure precision and recall.\n", 1068 | "#" 1069 | ], 1070 | "execution_count": null, 1071 | "outputs": [] 1072 | } 1073 | ] 1074 | } --------------------------------------------------------------------------------