├── README.md └── topic-modeling-with-colab-gensim-mallet.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # Colab + Gensim + Mallet 2 | 3 | September 14, 2021 4 | Geoff Ford 5 | https://polsci.github.io/ 6 | https://github.com/polsci/ 7 | 8 | See also: [Binder + Gensim + Mallet](https://github.com/polsci/binder-gensim-mallet) 9 | 10 | ## Introduction 11 | 12 | This repository is designed for students in DIGI405 at the University of Canterbury to do topic modeling through their browser using [Google Colab](https://colab.research.google.com/). It is relevant for others who want to do topic modeling through a browser with their own corpus. 13 | 14 | Note: The notebook has been updated to enforce Gensim v3.8 (the last version to support running topic models via Mallet). 15 | 16 | ## A note to DIGI405 students 17 | 18 | Make sure you are saving your notebook regularly as Google Colab times out (pretty sure this is after 90 minutes - if you can find the official Google documentation to confirm this please let me know!). 19 | 20 | ### Steps for DIGI405: 21 | 22 | 1. Launch the notebook in Google Colab (see below) 23 | 2. Run the first cells to upgrade Gensim and install Java and Mallet. 24 | 3. Run the cell to upload and extract the corpus zip file. Warning: uploads are quite slow. 25 | 4. Use the notebook to create your topic model. 26 | 27 | ## A note to everyone 28 | 29 | Before running the notebook, please read the [Google Colab FAQ](https://research.google.com/colaboratory/faq.html). 30 | 31 | ## Launch the notebook in Google Colab 32 | 33 | Click here to run the notebook: 34 | [Launch on Google Colab](https://colab.research.google.com/github/polsci/colab-gensim-mallet/blob/master/topic-modeling-with-colab-gensim-mallet.ipynb) 35 | 36 | ## Not in DIGI405? 37 | 38 | If you are not from this course, you can of course upload your own corpus as a zip. Your corpus should consist of a single directory of txt files (one document per txt file). This isn't the fastest way to run topic models, but allows you to create a topic model through your browser without installing any software. 39 | 40 | ## A note about pyLDAvis 41 | 42 | The environment should support [pyLDAvis](https://github.com/bmabey/pyLDAvis), however this is not implemented in the sample notebook. Add a cell like this to install it: 43 | ``` 44 | !pip install pyLDAvis 45 | ``` 46 | Add a cell like this to run it (note: this is sloooowwww and not recommended!): 47 | ``` 48 | import pyLDAvis.gensim as gensimvis 49 | import pyLDAvis 50 | vis_data30 = gensimvis.prepare(gensimmodel30, doc_term_matrix, dictionary) 51 | pyLDAvis.display(vis_data30) 52 | ``` 53 | -------------------------------------------------------------------------------- /topic-modeling-with-colab-gensim-mallet.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Topic Modeling with Google Colab, Gensim and Mallet\n", 8 | "\n", 9 | "This notebook implements [Gensim](https://radimrehurek.com/gensim/) and [Mallet](http://mallet.cs.umass.edu/index.php) for topic modeling using the [Google Colab](https://colab.research.google.com/) platform. The README is available at the [Colab + Gensim + Mallet Github repository](https://github.com/polsci/colab-gensim-mallet)." 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "## Upgrade Gensim" 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": null, 22 | "metadata": {}, 23 | "outputs": [], 24 | "source": [ 25 | "!pip install --upgrade gensim==3.8" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "## Install Java" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": null, 38 | "metadata": {}, 39 | "outputs": [], 40 | "source": [ 41 | "import os #importing os to set environment variable\n", 42 | "def install_java():\n", 43 | " !apt-get install -y openjdk-8-jdk-headless -qq > /dev/null #install openjdk\n", 44 | " os.environ[\"JAVA_HOME\"] = \"/usr/lib/jvm/java-8-openjdk-amd64\" #set environment variable\n", 45 | " !java -version #check java version\n", 46 | "install_java()" 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "metadata": {}, 52 | "source": [ 53 | "## Install Mallet" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": null, 59 | "metadata": {}, 60 | "outputs": [], 61 | "source": [ 62 | "!wget http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip\n", 63 | "!unzip mallet-2.0.8.zip" 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "metadata": {}, 69 | "source": [ 70 | "## Upload and extract corpus\n", 71 | "\n", 72 | "Upload a zip file with your corpus. The zip file of the corpus should contain a single directory containing .txt files. It appears that you need to rerun the cell if you don't select the file within a set time. " 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": null, 78 | "metadata": {}, 79 | "outputs": [], 80 | "source": [ 81 | "import zipfile\n", 82 | "from google.colab import files\n", 83 | "\n", 84 | "uploaded = files.upload()" 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "metadata": {}, 90 | "source": [ 91 | "This assumes you have uploaded a file above! This will output a directory listing as well so you can see your uploaded file and the directory." 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": null, 97 | "metadata": {}, 98 | "outputs": [], 99 | "source": [ 100 | "path_to_zip_file = list(uploaded.keys())[0]\n", 101 | "\n", 102 | "print ('Extracting',path_to_zip_file)\n", 103 | "\n", 104 | "with zipfile.ZipFile(path_to_zip_file, 'r') as zip_ref:\n", 105 | " zip_ref.extractall('.')\n", 106 | "\n", 107 | "print()\n", 108 | "print('Here is a directory listing (you should see a directory with your corpus):')\n", 109 | "!ls -l" 110 | ] 111 | }, 112 | { 113 | "cell_type": "markdown", 114 | "metadata": {}, 115 | "source": [ 116 | "## Import required libraries for topic modeling" 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": null, 122 | "metadata": {}, 123 | "outputs": [], 124 | "source": [ 125 | "import gensim\n", 126 | "import gensim.corpora as corpora\n", 127 | "from gensim.utils import simple_preprocess\n", 128 | "from gensim.models.wrappers import LdaMallet\n", 129 | "from gensim.models.coherencemodel import CoherenceModel\n", 130 | "from gensim import similarities\n", 131 | "\n", 132 | "import os.path\n", 133 | "import re\n", 134 | "import glob\n", 135 | "\n", 136 | "import nltk\n", 137 | "nltk.download('stopwords')\n", 138 | "\n", 139 | "from nltk.tokenize import RegexpTokenizer\n", 140 | "from nltk.corpus import stopwords" 141 | ] 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "metadata": {}, 146 | "source": [ 147 | "## Set the path to the Mallet binary and set the path to the corpus" 148 | ] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "execution_count": null, 153 | "metadata": {}, 154 | "outputs": [], 155 | "source": [ 156 | "os.environ['MALLET_HOME'] = '/content/mallet-2.0.8'\n", 157 | "mallet_path = '/content/mallet-2.0.8/bin/mallet' # you should NOT need to change this \n", 158 | "corpus_path = 'transcripts' # you need to change this path to the directory containing your corpus of .txt files" 159 | ] 160 | }, 161 | { 162 | "cell_type": "markdown", 163 | "metadata": {}, 164 | "source": [ 165 | "## Functions to load and preprocess the corpus and create the document-term matrix\n", 166 | "\n", 167 | "The following cell contains functions to load a corpus from a directory of text files, preprocess the corpus and create the bag of words document-term matrix. " 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": null, 173 | "metadata": {}, 174 | "outputs": [], 175 | "source": [ 176 | "def load_data_from_dir(path):\n", 177 | " file_list = glob.glob(path + '/*.txt')\n", 178 | "\n", 179 | " # create document list:\n", 180 | " documents_list = []\n", 181 | " source_list = []\n", 182 | " for filename in file_list:\n", 183 | " with open(filename, 'r', encoding='utf8') as f:\n", 184 | " text = f.read()\n", 185 | " f.close()\n", 186 | " documents_list.append(text)\n", 187 | " source_list.append(os.path.basename(filename))\n", 188 | " print(\"Total Number of Documents:\",len(documents_list))\n", 189 | " return documents_list, source_list\n", 190 | "\n", 191 | "def preprocess_data(doc_set,extra_stopwords = {}):\n", 192 | " # adapted from https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python\n", 193 | " # replace all newlines or multiple sequences of spaces with a standard space\n", 194 | " doc_set = [re.sub('\\s+', ' ', doc) for doc in doc_set]\n", 195 | " # initialize regex tokenizer\n", 196 | " tokenizer = RegexpTokenizer(r'\\w+')\n", 197 | " # create English stop words list\n", 198 | " en_stop = set(stopwords.words('english'))\n", 199 | " # add any extra stopwords\n", 200 | " if (len(extra_stopwords) > 0):\n", 201 | " en_stop = en_stop.union(extra_stopwords)\n", 202 | " \n", 203 | " # list for tokenized documents in loop\n", 204 | " texts = []\n", 205 | " # loop through document list\n", 206 | " for i in doc_set:\n", 207 | " # clean and tokenize document string\n", 208 | " raw = i.lower()\n", 209 | " tokens = tokenizer.tokenize(raw)\n", 210 | " # remove stop words from tokens\n", 211 | " stopped_tokens = [i for i in tokens if not i in en_stop]\n", 212 | " # add tokens to list\n", 213 | " texts.append(stopped_tokens)\n", 214 | " return texts\n", 215 | "\n", 216 | "def prepare_corpus(doc_clean):\n", 217 | " # adapted from https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python\n", 218 | " # Creating the term dictionary of our courpus, where every unique term is assigned an index. dictionary = corpora.Dictionary(doc_clean)\n", 219 | " dictionary = corpora.Dictionary(doc_clean)\n", 220 | " \n", 221 | " dictionary.filter_extremes(no_below=5, no_above=0.5)\n", 222 | " # Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.\n", 223 | " doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]\n", 224 | " # generate LDA model\n", 225 | " return dictionary,doc_term_matrix" 226 | ] 227 | }, 228 | { 229 | "cell_type": "markdown", 230 | "metadata": {}, 231 | "source": [ 232 | "## Load and pre-process the corpus\n", 233 | "Load the corpus, preprocess with additional stop words and output dictionary and document-term matrix." 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": null, 239 | "metadata": {}, 240 | "outputs": [], 241 | "source": [ 242 | "# adjust the path below to wherever you have the transcripts2018 folder\n", 243 | "document_list, source_list = load_data_from_dir(corpus_path)\n", 244 | "\n", 245 | "# I've added extra stopwords here in addition to NLTK's stopword list - you could look at adding others.\n", 246 | "doc_clean = preprocess_data(document_list,{'laughter','applause'})\n", 247 | "dictionary, doc_term_matrix = prepare_corpus(doc_clean)" 248 | ] 249 | }, 250 | { 251 | "cell_type": "markdown", 252 | "metadata": {}, 253 | "source": [ 254 | "## LDA model with 30 topics\n", 255 | "The following cell sets the number of topics we are training the model for. " 256 | ] 257 | }, 258 | { 259 | "cell_type": "code", 260 | "execution_count": null, 261 | "metadata": {}, 262 | "outputs": [], 263 | "source": [ 264 | "number_of_topics=30 # adjust this to alter the number of topics\n", 265 | "words=20 #adjust this to alter the number of words output for the topic below" 266 | ] 267 | }, 268 | { 269 | "cell_type": "markdown", 270 | "metadata": {}, 271 | "source": [ 272 | "The following cell runs LDA using Mallet from Gensim using the number_of_topics specified above. This might take a few minutes! " 273 | ] 274 | }, 275 | { 276 | "cell_type": "code", 277 | "execution_count": null, 278 | "metadata": {}, 279 | "outputs": [], 280 | "source": [ 281 | "ldamallet30 = LdaMallet(mallet_path, corpus=doc_term_matrix, num_topics=number_of_topics, id2word=dictionary)" 282 | ] 283 | }, 284 | { 285 | "cell_type": "markdown", 286 | "metadata": {}, 287 | "source": [ 288 | "The following cell outputs the topics." 289 | ] 290 | }, 291 | { 292 | "cell_type": "code", 293 | "execution_count": null, 294 | "metadata": {}, 295 | "outputs": [], 296 | "source": [ 297 | "ldamallet30.show_topics(num_topics=number_of_topics,num_words=words)" 298 | ] 299 | }, 300 | { 301 | "cell_type": "markdown", 302 | "metadata": {}, 303 | "source": [ 304 | "## Convert to Gensim model format\n", 305 | "Convert the Mallet model to Gensim format." 306 | ] 307 | }, 308 | { 309 | "cell_type": "code", 310 | "execution_count": null, 311 | "metadata": {}, 312 | "outputs": [], 313 | "source": [ 314 | "gensimmodel30 = gensim.models.wrappers.ldamallet.malletmodel2ldamodel(ldamallet30)" 315 | ] 316 | }, 317 | { 318 | "cell_type": "markdown", 319 | "metadata": {}, 320 | "source": [ 321 | "## Get a coherence score" 322 | ] 323 | }, 324 | { 325 | "cell_type": "code", 326 | "execution_count": null, 327 | "metadata": {}, 328 | "outputs": [], 329 | "source": [ 330 | "coherencemodel = CoherenceModel(model=gensimmodel30, texts=doc_clean, dictionary=dictionary, coherence='c_v')\n", 331 | "print (coherencemodel.get_coherence())" 332 | ] 333 | }, 334 | { 335 | "cell_type": "markdown", 336 | "metadata": {}, 337 | "source": [ 338 | "## Get id for specific videos" 339 | ] 340 | }, 341 | { 342 | "cell_type": "code", 343 | "execution_count": null, 344 | "metadata": {}, 345 | "outputs": [], 346 | "source": [ 347 | "lookup_doc_id = source_list.index('2017-09-20-zeynep_tufekci_we_re_building_a_dystopia_just_to_make_people_click_on_ads.txt')\n", 348 | "print('Document ID from lookup:', lookup_doc_id)" 349 | ] 350 | }, 351 | { 352 | "cell_type": "markdown", 353 | "metadata": {}, 354 | "source": [ 355 | "## Preview a document\n", 356 | "\n", 357 | "Preview a document - you can change the doc_id to view another document." 358 | ] 359 | }, 360 | { 361 | "cell_type": "code", 362 | "execution_count": null, 363 | "metadata": {}, 364 | "outputs": [], 365 | "source": [ 366 | "doc_id = lookup_doc_id # index of document to explore - this can be an id number or set to lookup_doc_id\n", 367 | "print(re.sub('\\s+', ' ', document_list[doc_id])) " 368 | ] 369 | }, 370 | { 371 | "cell_type": "markdown", 372 | "metadata": {}, 373 | "source": [ 374 | "## Output the distribution of topics for the document\n", 375 | "\n", 376 | "The next cell outputs the distribution of topics on the document specified above." 377 | ] 378 | }, 379 | { 380 | "cell_type": "code", 381 | "execution_count": null, 382 | "metadata": {}, 383 | "outputs": [], 384 | "source": [ 385 | "document_topics = gensimmodel30.get_document_topics(doc_term_matrix[doc_id])\n", 386 | "document_topics = sorted(document_topics, key=lambda x: x[1], reverse=True) # sorts document topics\n", 387 | "\n", 388 | "for topic, prop in document_topics:\n", 389 | " topic_words = [word[0] for word in gensimmodel30.show_topic(topic, 10)]\n", 390 | " print (\"%.2f\" % prop, topic, topic_words)" 391 | ] 392 | }, 393 | { 394 | "cell_type": "markdown", 395 | "metadata": {}, 396 | "source": [ 397 | "## Find similar documents\n", 398 | "This will find the 5 most similar documents to the document specified above based on their topic distribution." 399 | ] 400 | }, 401 | { 402 | "cell_type": "code", 403 | "execution_count": null, 404 | "metadata": {}, 405 | "outputs": [], 406 | "source": [ 407 | "# gensimmodel30[doc_term_matrix] below represents the documents in the corpus in LDA vector space\n", 408 | "lda_index = similarities.MatrixSimilarity(gensimmodel30[doc_term_matrix])\n", 409 | "\n", 410 | "# query for our doc_id from above\n", 411 | "similarity_index = lda_index[gensimmodel30[doc_term_matrix[doc_id]]]\n", 412 | "\n", 413 | "# Sort the similarity index\n", 414 | "similarity_index = sorted(enumerate(similarity_index), key=lambda item: -item[1])\n", 415 | "\n", 416 | "for i in range(1,6): \n", 417 | " document_id, similarity_score = similarity_index[i]\n", 418 | "\n", 419 | " print('Document Index:',document_id)\n", 420 | " print('Document:', source_list[document_id])\n", 421 | " print('Similarity Score:',similarity_score)\n", 422 | " \n", 423 | " print(re.sub('\\s+', ' ', document_list[document_id][:500]), '...') # preview first 500 characters\n", 424 | "\n", 425 | " document_topics = gensimmodel30[doc_term_matrix[document_id]]\n", 426 | " document_topics = sorted(document_topics, key=lambda x: x[1], reverse=True)\n", 427 | " for topic, prop in document_topics:\n", 428 | " topic_words = [word[0] for word in gensimmodel30.show_topic(topic, 10)]\n", 429 | " print (\"%.2f\" % prop, topic, topic_words)\n", 430 | " \n", 431 | " print()" 432 | ] 433 | }, 434 | { 435 | "cell_type": "code", 436 | "execution_count": null, 437 | "metadata": {}, 438 | "outputs": [], 439 | "source": [] 440 | } 441 | ], 442 | "metadata": { 443 | "kernelspec": { 444 | "display_name": "Python 3", 445 | "language": "python", 446 | "name": "python3" 447 | }, 448 | "language_info": { 449 | "codemirror_mode": { 450 | "name": "ipython", 451 | "version": 3 452 | }, 453 | "file_extension": ".py", 454 | "mimetype": "text/x-python", 455 | "name": "python", 456 | "nbconvert_exporter": "python", 457 | "pygments_lexer": "ipython3", 458 | "version": "3.7.3" 459 | } 460 | }, 461 | "nbformat": 4, 462 | "nbformat_minor": 2 463 | } 464 | --------------------------------------------------------------------------------