├── README.md
└── topic-modeling-with-colab-gensim-mallet.ipynb


/README.md:
--------------------------------------------------------------------------------
 1 | # Colab + Gensim + Mallet
 2 | 
 3 | September 14, 2021  
 4 | Geoff Ford  
 5 | https://polsci.github.io/  
 6 | https://github.com/polsci/  
 7 | 
 8 | See also: [Binder + Gensim + Mallet](https://github.com/polsci/binder-gensim-mallet)
 9 | 
10 | ## Introduction
11 | 
12 | This repository is designed for students in DIGI405 at the University of Canterbury to do topic modeling through their browser using [Google Colab](https://colab.research.google.com/). It is relevant for others who want to do topic modeling through a browser with their own corpus.
13 | 
14 | Note: The notebook has been updated to enforce Gensim v3.8 (the last version to support running topic models via Mallet).
15 | 
16 | ## A note to DIGI405 students
17 | 
18 | Make sure you are saving your notebook regularly as Google Colab times out (pretty sure this is after 90 minutes - if you can find the official Google documentation to confirm this please let me know!).
19 | 
20 | ### Steps for DIGI405:
21 | 
22 | 1. Launch the notebook in Google Colab (see below)
23 | 2. Run the first cells to upgrade Gensim and install Java and Mallet.
24 | 3. Run the cell to upload and extract the corpus zip file. Warning: uploads are quite slow.  
25 | 4. Use the notebook to create your topic model.  
26 | 
27 | ## A note to everyone
28 | 
29 | Before running the notebook, please read the [Google Colab FAQ](https://research.google.com/colaboratory/faq.html).
30 | 
31 | ## Launch the notebook in Google Colab
32 | 
33 | Click here to run the notebook:  
34 | [Launch on Google Colab](https://colab.research.google.com/github/polsci/colab-gensim-mallet/blob/master/topic-modeling-with-colab-gensim-mallet.ipynb)
35 | 
36 | ## Not in DIGI405?
37 | 
38 | If you are not from this course, you can of course upload your own corpus as a zip. Your corpus should consist of a single directory of txt files (one document per txt file). This isn't the fastest way to run topic models, but allows you to create a topic model through your browser without installing any software.
39 | 
40 | ## A note about pyLDAvis
41 | 
42 | The environment should support [pyLDAvis](https://github.com/bmabey/pyLDAvis), however this is not implemented in the sample notebook. Add a cell like this to install it:  
43 | ```
44 | !pip install pyLDAvis
45 | ```
46 | Add a cell like this to run it (note: this is sloooowwww and not recommended!):  
47 | ```
48 | import pyLDAvis.gensim as gensimvis
49 | import pyLDAvis
50 | vis_data30 = gensimvis.prepare(gensimmodel30, doc_term_matrix, dictionary)
51 | pyLDAvis.display(vis_data30)
52 | ```
53 | 


--------------------------------------------------------------------------------
/topic-modeling-with-colab-gensim-mallet.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Topic Modeling with Google Colab, Gensim and Mallet\n",
  8 |     "\n",
  9 |     "This notebook implements [Gensim](https://radimrehurek.com/gensim/) and [Mallet](http://mallet.cs.umass.edu/index.php) for topic modeling using the [Google Colab](https://colab.research.google.com/) platform. The README is available at the [Colab + Gensim + Mallet Github repository](https://github.com/polsci/colab-gensim-mallet)."
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "markdown",
 14 |    "metadata": {},
 15 |    "source": [
 16 |     "## Upgrade Gensim"
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "code",
 21 |    "execution_count": null,
 22 |    "metadata": {},
 23 |    "outputs": [],
 24 |    "source": [
 25 |     "!pip install --upgrade gensim==3.8"
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "markdown",
 30 |    "metadata": {},
 31 |    "source": [
 32 |     "## Install Java"
 33 |    ]
 34 |   },
 35 |   {
 36 |    "cell_type": "code",
 37 |    "execution_count": null,
 38 |    "metadata": {},
 39 |    "outputs": [],
 40 |    "source": [
 41 |     "import os       #importing os to set environment variable\n",
 42 |     "def install_java():\n",
 43 |     "  !apt-get install -y openjdk-8-jdk-headless -qq > /dev/null      #install openjdk\n",
 44 |     "  os.environ[\"JAVA_HOME\"] = \"/usr/lib/jvm/java-8-openjdk-amd64\"     #set environment variable\n",
 45 |     "  !java -version       #check java version\n",
 46 |     "install_java()"
 47 |    ]
 48 |   },
 49 |   {
 50 |    "cell_type": "markdown",
 51 |    "metadata": {},
 52 |    "source": [
 53 |     "## Install Mallet"
 54 |    ]
 55 |   },
 56 |   {
 57 |    "cell_type": "code",
 58 |    "execution_count": null,
 59 |    "metadata": {},
 60 |    "outputs": [],
 61 |    "source": [
 62 |     "!wget http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip\n",
 63 |     "!unzip mallet-2.0.8.zip"
 64 |    ]
 65 |   },
 66 |   {
 67 |    "cell_type": "markdown",
 68 |    "metadata": {},
 69 |    "source": [
 70 |     "## Upload and extract corpus\n",
 71 |     "\n",
 72 |     "Upload a zip file with your corpus. The zip file of the corpus should contain a single directory containing .txt files. It appears that you need to rerun the cell if you don't select the file within a set time. "
 73 |    ]
 74 |   },
 75 |   {
 76 |    "cell_type": "code",
 77 |    "execution_count": null,
 78 |    "metadata": {},
 79 |    "outputs": [],
 80 |    "source": [
 81 |     "import zipfile\n",
 82 |     "from google.colab import files\n",
 83 |     "\n",
 84 |     "uploaded = files.upload()"
 85 |    ]
 86 |   },
 87 |   {
 88 |    "cell_type": "markdown",
 89 |    "metadata": {},
 90 |    "source": [
 91 |     "This assumes you have uploaded a file above! This will output a directory listing as well so you can see your uploaded file and the directory."
 92 |    ]
 93 |   },
 94 |   {
 95 |    "cell_type": "code",
 96 |    "execution_count": null,
 97 |    "metadata": {},
 98 |    "outputs": [],
 99 |    "source": [
100 |     "path_to_zip_file = list(uploaded.keys())[0]\n",
101 |     "\n",
102 |     "print ('Extracting',path_to_zip_file)\n",
103 |     "\n",
104 |     "with zipfile.ZipFile(path_to_zip_file, 'r') as zip_ref:\n",
105 |     "  zip_ref.extractall('.')\n",
106 |     "\n",
107 |     "print()\n",
108 |     "print('Here is a directory listing (you should see a directory with your corpus):')\n",
109 |     "!ls -l"
110 |    ]
111 |   },
112 |   {
113 |    "cell_type": "markdown",
114 |    "metadata": {},
115 |    "source": [
116 |     "## Import required libraries for topic modeling"
117 |    ]
118 |   },
119 |   {
120 |    "cell_type": "code",
121 |    "execution_count": null,
122 |    "metadata": {},
123 |    "outputs": [],
124 |    "source": [
125 |     "import gensim\n",
126 |     "import gensim.corpora as corpora\n",
127 |     "from gensim.utils import simple_preprocess\n",
128 |     "from gensim.models.wrappers import LdaMallet\n",
129 |     "from gensim.models.coherencemodel import CoherenceModel\n",
130 |     "from gensim import similarities\n",
131 |     "\n",
132 |     "import os.path\n",
133 |     "import re\n",
134 |     "import glob\n",
135 |     "\n",
136 |     "import nltk\n",
137 |     "nltk.download('stopwords')\n",
138 |     "\n",
139 |     "from nltk.tokenize import RegexpTokenizer\n",
140 |     "from nltk.corpus import stopwords"
141 |    ]
142 |   },
143 |   {
144 |    "cell_type": "markdown",
145 |    "metadata": {},
146 |    "source": [
147 |     "## Set the path to the Mallet binary and set the path to the corpus"
148 |    ]
149 |   },
150 |   {
151 |    "cell_type": "code",
152 |    "execution_count": null,
153 |    "metadata": {},
154 |    "outputs": [],
155 |    "source": [
156 |     "os.environ['MALLET_HOME'] = '/content/mallet-2.0.8'\n",
157 |     "mallet_path = '/content/mallet-2.0.8/bin/mallet' # you should NOT need to change this \n",
158 |     "corpus_path = 'transcripts' # you need to change this path to the directory containing your corpus of .txt files"
159 |    ]
160 |   },
161 |   {
162 |    "cell_type": "markdown",
163 |    "metadata": {},
164 |    "source": [
165 |     "## Functions to load and preprocess the corpus and create the document-term matrix\n",
166 |     "\n",
167 |     "The following cell contains functions to load a corpus from a directory of text files, preprocess the corpus and create the bag of words document-term matrix. "
168 |    ]
169 |   },
170 |   {
171 |    "cell_type": "code",
172 |    "execution_count": null,
173 |    "metadata": {},
174 |    "outputs": [],
175 |    "source": [
176 |     "def load_data_from_dir(path):\n",
177 |     "    file_list = glob.glob(path + '/*.txt')\n",
178 |     "\n",
179 |     "    # create document list:\n",
180 |     "    documents_list = []\n",
181 |     "    source_list = []\n",
182 |     "    for filename in file_list:\n",
183 |     "        with open(filename, 'r', encoding='utf8') as f:\n",
184 |     "            text = f.read()\n",
185 |     "            f.close()\n",
186 |     "            documents_list.append(text)\n",
187 |     "            source_list.append(os.path.basename(filename))\n",
188 |     "    print(\"Total Number of Documents:\",len(documents_list))\n",
189 |     "    return documents_list, source_list\n",
190 |     "\n",
191 |     "def preprocess_data(doc_set,extra_stopwords = {}):\n",
192 |     "    # adapted from https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python\n",
193 |     "    # replace all newlines or multiple sequences of spaces with a standard space\n",
194 |     "    doc_set = [re.sub('\\s+', ' ', doc) for doc in doc_set]\n",
195 |     "    # initialize regex tokenizer\n",
196 |     "    tokenizer = RegexpTokenizer(r'\\w+')\n",
197 |     "    # create English stop words list\n",
198 |     "    en_stop = set(stopwords.words('english'))\n",
199 |     "    # add any extra stopwords\n",
200 |     "    if (len(extra_stopwords) > 0):\n",
201 |     "        en_stop = en_stop.union(extra_stopwords)\n",
202 |     "    \n",
203 |     "    # list for tokenized documents in loop\n",
204 |     "    texts = []\n",
205 |     "    # loop through document list\n",
206 |     "    for i in doc_set:\n",
207 |     "        # clean and tokenize document string\n",
208 |     "        raw = i.lower()\n",
209 |     "        tokens = tokenizer.tokenize(raw)\n",
210 |     "        # remove stop words from tokens\n",
211 |     "        stopped_tokens = [i for i in tokens if not i in en_stop]\n",
212 |     "        # add tokens to list\n",
213 |     "        texts.append(stopped_tokens)\n",
214 |     "    return texts\n",
215 |     "\n",
216 |     "def prepare_corpus(doc_clean):\n",
217 |     "    # adapted from https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python\n",
218 |     "    # Creating the term dictionary of our courpus, where every unique term is assigned an index. dictionary = corpora.Dictionary(doc_clean)\n",
219 |     "    dictionary = corpora.Dictionary(doc_clean)\n",
220 |     "    \n",
221 |     "    dictionary.filter_extremes(no_below=5, no_above=0.5)\n",
222 |     "    # Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.\n",
223 |     "    doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]\n",
224 |     "    # generate LDA model\n",
225 |     "    return dictionary,doc_term_matrix"
226 |    ]
227 |   },
228 |   {
229 |    "cell_type": "markdown",
230 |    "metadata": {},
231 |    "source": [
232 |     "## Load and pre-process the corpus\n",
233 |     "Load the corpus, preprocess with additional stop words and output dictionary and document-term matrix."
234 |    ]
235 |   },
236 |   {
237 |    "cell_type": "code",
238 |    "execution_count": null,
239 |    "metadata": {},
240 |    "outputs": [],
241 |    "source": [
242 |     "# adjust the path below to wherever you have the transcripts2018 folder\n",
243 |     "document_list, source_list = load_data_from_dir(corpus_path)\n",
244 |     "\n",
245 |     "# I've added extra stopwords here in addition to NLTK's stopword list - you could look at adding others.\n",
246 |     "doc_clean = preprocess_data(document_list,{'laughter','applause'})\n",
247 |     "dictionary, doc_term_matrix = prepare_corpus(doc_clean)"
248 |    ]
249 |   },
250 |   {
251 |    "cell_type": "markdown",
252 |    "metadata": {},
253 |    "source": [
254 |     "## LDA model with 30 topics\n",
255 |     "The following cell sets the number of topics we are training the model for. "
256 |    ]
257 |   },
258 |   {
259 |    "cell_type": "code",
260 |    "execution_count": null,
261 |    "metadata": {},
262 |    "outputs": [],
263 |    "source": [
264 |     "number_of_topics=30 # adjust this to alter the number of topics\n",
265 |     "words=20 #adjust this to alter the number of words output for the topic below"
266 |    ]
267 |   },
268 |   {
269 |    "cell_type": "markdown",
270 |    "metadata": {},
271 |    "source": [
272 |     "The following cell runs LDA using Mallet from Gensim using the number_of_topics specified above. This might take a few minutes! "
273 |    ]
274 |   },
275 |   {
276 |    "cell_type": "code",
277 |    "execution_count": null,
278 |    "metadata": {},
279 |    "outputs": [],
280 |    "source": [
281 |     "ldamallet30 = LdaMallet(mallet_path, corpus=doc_term_matrix, num_topics=number_of_topics, id2word=dictionary)"
282 |    ]
283 |   },
284 |   {
285 |    "cell_type": "markdown",
286 |    "metadata": {},
287 |    "source": [
288 |     "The following cell outputs the topics."
289 |    ]
290 |   },
291 |   {
292 |    "cell_type": "code",
293 |    "execution_count": null,
294 |    "metadata": {},
295 |    "outputs": [],
296 |    "source": [
297 |     "ldamallet30.show_topics(num_topics=number_of_topics,num_words=words)"
298 |    ]
299 |   },
300 |   {
301 |    "cell_type": "markdown",
302 |    "metadata": {},
303 |    "source": [
304 |     "## Convert to Gensim model format\n",
305 |     "Convert the Mallet model to Gensim format."
306 |    ]
307 |   },
308 |   {
309 |    "cell_type": "code",
310 |    "execution_count": null,
311 |    "metadata": {},
312 |    "outputs": [],
313 |    "source": [
314 |     "gensimmodel30 = gensim.models.wrappers.ldamallet.malletmodel2ldamodel(ldamallet30)"
315 |    ]
316 |   },
317 |   {
318 |    "cell_type": "markdown",
319 |    "metadata": {},
320 |    "source": [
321 |     "## Get a coherence score"
322 |    ]
323 |   },
324 |   {
325 |    "cell_type": "code",
326 |    "execution_count": null,
327 |    "metadata": {},
328 |    "outputs": [],
329 |    "source": [
330 |     "coherencemodel = CoherenceModel(model=gensimmodel30, texts=doc_clean, dictionary=dictionary, coherence='c_v')\n",
331 |     "print (coherencemodel.get_coherence())"
332 |    ]
333 |   },
334 |   {
335 |    "cell_type": "markdown",
336 |    "metadata": {},
337 |    "source": [
338 |     "## Get id for specific videos"
339 |    ]
340 |   },
341 |   {
342 |    "cell_type": "code",
343 |    "execution_count": null,
344 |    "metadata": {},
345 |    "outputs": [],
346 |    "source": [
347 |     "lookup_doc_id = source_list.index('2017-09-20-zeynep_tufekci_we_re_building_a_dystopia_just_to_make_people_click_on_ads.txt')\n",
348 |     "print('Document ID from lookup:', lookup_doc_id)"
349 |    ]
350 |   },
351 |   {
352 |    "cell_type": "markdown",
353 |    "metadata": {},
354 |    "source": [
355 |     "## Preview a document\n",
356 |     "\n",
357 |     "Preview a document - you can change the doc_id to view another document."
358 |    ]
359 |   },
360 |   {
361 |    "cell_type": "code",
362 |    "execution_count": null,
363 |    "metadata": {},
364 |    "outputs": [],
365 |    "source": [
366 |     "doc_id = lookup_doc_id # index of document to explore - this can be an id number or set to lookup_doc_id\n",
367 |     "print(re.sub('\\s+', ' ', document_list[doc_id])) "
368 |    ]
369 |   },
370 |   {
371 |    "cell_type": "markdown",
372 |    "metadata": {},
373 |    "source": [
374 |     "## Output the distribution of topics for the document\n",
375 |     "\n",
376 |     "The next cell outputs the distribution of topics on the document specified above."
377 |    ]
378 |   },
379 |   {
380 |    "cell_type": "code",
381 |    "execution_count": null,
382 |    "metadata": {},
383 |    "outputs": [],
384 |    "source": [
385 |     "document_topics = gensimmodel30.get_document_topics(doc_term_matrix[doc_id])\n",
386 |     "document_topics = sorted(document_topics, key=lambda x: x[1], reverse=True) # sorts document topics\n",
387 |     "\n",
388 |     "for topic, prop in document_topics:\n",
389 |     "    topic_words = [word[0] for word in gensimmodel30.show_topic(topic, 10)]\n",
390 |     "    print (\"%.2f\" % prop, topic, topic_words)"
391 |    ]
392 |   },
393 |   {
394 |    "cell_type": "markdown",
395 |    "metadata": {},
396 |    "source": [
397 |     "## Find similar documents\n",
398 |     "This will find the 5 most similar documents to the document specified above based on their topic distribution."
399 |    ]
400 |   },
401 |   {
402 |    "cell_type": "code",
403 |    "execution_count": null,
404 |    "metadata": {},
405 |    "outputs": [],
406 |    "source": [
407 |     "# gensimmodel30[doc_term_matrix] below represents the documents in the corpus in LDA vector space\n",
408 |     "lda_index = similarities.MatrixSimilarity(gensimmodel30[doc_term_matrix])\n",
409 |     "\n",
410 |     "# query for our doc_id from above\n",
411 |     "similarity_index = lda_index[gensimmodel30[doc_term_matrix[doc_id]]]\n",
412 |     "\n",
413 |     "# Sort the similarity index\n",
414 |     "similarity_index = sorted(enumerate(similarity_index), key=lambda item: -item[1])\n",
415 |     "\n",
416 |     "for i in range(1,6): \n",
417 |     "    document_id, similarity_score = similarity_index[i]\n",
418 |     "\n",
419 |     "    print('Document Index:',document_id)\n",
420 |     "    print('Document:', source_list[document_id])\n",
421 |     "    print('Similarity Score:',similarity_score)\n",
422 |     "    \n",
423 |     "    print(re.sub('\\s+', ' ', document_list[document_id][:500]), '...') # preview first 500 characters\n",
424 |     "\n",
425 |     "    document_topics = gensimmodel30[doc_term_matrix[document_id]]\n",
426 |     "    document_topics = sorted(document_topics, key=lambda x: x[1], reverse=True)\n",
427 |     "    for topic, prop in document_topics:\n",
428 |     "        topic_words = [word[0] for word in gensimmodel30.show_topic(topic, 10)]\n",
429 |     "        print (\"%.2f\" % prop, topic, topic_words)\n",
430 |     "    \n",
431 |     "    print()"
432 |    ]
433 |   },
434 |   {
435 |    "cell_type": "code",
436 |    "execution_count": null,
437 |    "metadata": {},
438 |    "outputs": [],
439 |    "source": []
440 |   }
441 |  ],
442 |  "metadata": {
443 |   "kernelspec": {
444 |    "display_name": "Python 3",
445 |    "language": "python",
446 |    "name": "python3"
447 |   },
448 |   "language_info": {
449 |    "codemirror_mode": {
450 |     "name": "ipython",
451 |     "version": 3
452 |    },
453 |    "file_extension": ".py",
454 |    "mimetype": "text/x-python",
455 |    "name": "python",
456 |    "nbconvert_exporter": "python",
457 |    "pygments_lexer": "ipython3",
458 |    "version": "3.7.3"
459 |   }
460 |  },
461 |  "nbformat": 4,
462 |  "nbformat_minor": 2
463 | }
464 | 


--------------------------------------------------------------------------------