├── README.md ├── requirements.txt ├── speech_data_extend.txt ├── tutorial.py └── tutorial_notebook.ipynb /README.md: -------------------------------------------------------------------------------- 1 | ### The topicmodels package that formed the basis of this tutorial is now retired. Archives of the code are available at: 2 | ### https://www.dropbox.com/s/klp7u6clahh1ztx/topic-modelling-tools-master.zip?dl=0 (no GSL) 3 | ### https://www.dropbox.com/s/gxkqgjwc50qjtg5/topic-modelling-tools-with_gsl.zip?dl=0 (with GSL). 4 | ### An alternative illustration of LDA is available at 5 | ### https://github.com/sekhansen/text_algorithms_econ/blob/main/notebooks/3_LDA.ipynb. 6 | 7 | ## An Introduction to Topic Modelling via Gibbs sampling: Code and Tutorial 8 | 9 | by Stephen Hansen, stephen.hansen@economics.ox.ac.uk 10 | 11 | Associate Professor of Economics, University of Oxford 12 | 13 | *** 14 | 15 | Thanks to Eric Hardy at Columbia University for collating data on speeches. 16 | 17 | *** 18 | 19 | If you use this software in research or educational projects, please cite: 20 | 21 | Hansen, Stephen, Michael McMahon, and Andrea Prat (2018), “Transparency and Deliberation on the FOMC: A Computational Linguistics Approach,” *Quarterly Journal of Economics*. 22 | 23 |
24 | 25 | ### INTRODUCTION 26 | 27 | This project introduces Latent Dirichlet Allocation (LDA) to those who do not necessarily have a background in computer science or programming. There are many implementations of LDA available online in a variety of languages, many of which are more memory and/or computationally efficient than this one. What is much rarer than optimized code, however, is documentation and examples that allow complete novices to practice implementing topic models for themselves. The goal of this project is to provide these, thereby reducing the startup costs involved in using topic models. 28 | 29 | The contents of the tutorial folder are as follows: 30 | 31 | 1. speech\_data\_extend.txt: Data on State of the Union Addresses. 32 | 2. tutorial_notebook.ipynb: iPython notebook for the tutorial. 33 | 3. tutorial.py: Python code with the key commands for the tutorial. 34 | 35 | ### INSTALLING PYTHON 36 | 37 | The code relies on standard scientific libraries which are not part of a basic Python installation. For those who do not already have them installed, the recommended option is to download the Anaconda distribution of Python 2.7 from with the default settings. After installation, you should be able to launch iPython directly from the Anaconda folder on OS X and Windows. On Linux you can launch it by typing “ipython” from the command line. (iPython is an enhanced Python interpreter particularly useful for scientific computing.) 38 | 39 | If iPython does not launch, then it may be that your anti-virus software considers it a risk and blocks it. For example, this may happen in some versions of Kaspersky 6.0 which, on starting iPython, quarantines the python.exe file which renders other (previously working) Python operations inoperable. One option is to turn off the anti-virus software. Another is to prevent the specific “Application Activity Analyzer” which interprets the “Input/output redirection” of iPython notebook as a threat which leads it to quarantine the Python executable. 40 | 41 | For background on Python and iPython from an economics perspective, see . 42 | 43 | ### INSTALLING TOPICMODELS PACKAGE 44 | 45 | In addition to common scientific libraries, the tutorial also requires installation of the topicmodels package located at . If you already have Python and pip installed (for example by installing Anaconda per the instructions above), `pip install topic-modelling-tools` should work. 46 | 47 | The only other requirement is that a C++ compiler is needed to build the Cython code. For Mac OS X you can download Xcode, while for Windows you can download the Visual Studio C++ compiler. 48 | 49 | To improve performance, I have used GNU Scientific Library's random number generator instead of numpy's in a separate branch located at . To use this version instead of the baseline version, install topicmodels with `pip install topic-modelling-tools_fast`. Using this version requires GSL to be installed. See the README for the package for further information. 50 | 51 | ### FOLLOWING THE TUTORIAL 52 | 53 | The tutorial can either be followed using the plain tutorial.py script; by using ipython; or by using ipython with qtconsole for enhanced graphics. To initiate the latter, type “jupyter qtconsole” (or in older versions "ipython qtconsole") You should make sure that your current working directory is the tutorial folder. To check this, you can type “pwd” to see the working directory. If you need to change it, use the cd command. 54 | 55 | The easiest option is to copy and paste the commands from the notebook into ipython (the notebook can be viewed on and is also provided for convenience). 56 | 57 | ### PERFORMANCE 58 | 59 | While primarily written as an introduction, the code for the project should also be suitable for analysis on datasets with at least several million words, which includes many of interest to social scientists. For very large datasets, a more scalable solution is likely best (note that even when fully optimized, Gibbs sampling tends to be slow compared to other inference algorithms). 60 | 61 | In terms of memory, one should keep in mind that each sample has an associated document-topic and term-topic matrix stored in the background. For large datasets, this may become an issue when trying to store many samples concurrently. 62 | 63 | ### FEEDBACK 64 | 65 | Comments, bug reports, and ideas are more than welcome, particularly from those using topic modelling in economics and social science research. 66 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | topic-modelling-tools 2 | matplotlib >= 1.4.3 3 | numpy >= 1.13.3 4 | nltk >= 3.2.4 5 | pandas >= 0.20.3 6 | scipy >= 0.19.1 7 | Cython >= 0.20.1 8 | -------------------------------------------------------------------------------- /tutorial.py: -------------------------------------------------------------------------------- 1 | """ 2 | (c) 2015, Stephen Hansen, stephen.hansen@economics.ox.ac.uk 3 | 4 | Python script for tutorial illustrating collapsed Gibbs sampling 5 | for Latent Dirichlet Allocation. 6 | """ 7 | 8 | import pandas as pd 9 | import topicmodels 10 | 11 | ############### 12 | # select data on which to run topic model 13 | ############### 14 | 15 | data = pd.read_table("speech_data_extend.txt", encoding="utf-8") 16 | data = data[data.year >= 1947] 17 | 18 | ############### 19 | # clean documents 20 | ############### 21 | 22 | docsobj = topicmodels.RawDocs(data.speech, "long") 23 | docsobj.token_clean(1) 24 | docsobj.stopword_remove("tokens") 25 | docsobj.stem() 26 | docsobj.stopword_remove("stems") 27 | docsobj.term_rank("stems") 28 | docsobj.rank_remove("tfidf", "stems", docsobj.tfidf_ranking[5000][1]) 29 | 30 | all_stems = [s for d in docsobj.stems for s in d] 31 | print("number of unique stems = %d" % len(set(all_stems))) 32 | print("number of total stems = %d" % len(all_stems)) 33 | 34 | ############### 35 | # estimate topic model 36 | ############### 37 | 38 | ldaobj = topicmodels.LDA.LDAGibbs(docsobj.stems, 30) 39 | 40 | ldaobj.sample(0, 50, 10) 41 | ldaobj.sample(0, 50, 10) 42 | 43 | ldaobj.samples_keep(4) 44 | ldaobj.topic_content(20) 45 | 46 | dt = ldaobj.dt_avg() 47 | tt = ldaobj.tt_avg() 48 | ldaobj.dict_print() 49 | 50 | data = data.drop('speech', 1) 51 | for i in xrange(ldaobj.K): 52 | data['T' + str(i)] = dt[:, i] 53 | data.to_csv("final_output.csv", index=False) 54 | 55 | ############### 56 | # query aggregate documents 57 | ############### 58 | 59 | data['speech'] = [' '.join(s) for s in docsobj.stems] 60 | aggspeeches = data.groupby(['year', 'president'])['speech'].\ 61 | apply(lambda x: ' '.join(x)) 62 | aggdocs = topicmodels.RawDocs(aggspeeches) 63 | 64 | queryobj = topicmodels.LDA.QueryGibbs(aggdocs.tokens, ldaobj.token_key, 65 | ldaobj.tt) 66 | queryobj.query(10) 67 | queryobj.perplexity() 68 | queryobj.query(30) 69 | queryobj.perplexity() 70 | 71 | dt_query = queryobj.dt_avg() 72 | aggdata = pd.DataFrame(dt_query, index=aggspeeches.index, 73 | columns=['T' + str(i) for i in xrange(queryobj.K)]) 74 | aggdata.to_csv("final_output_agg.csv") 75 | 76 | ############### 77 | # top topics 78 | ############### 79 | 80 | 81 | def top_topics(x): 82 | top = x.values.argsort()[-5:][::-1] 83 | return(pd.Series(top, index=range(1, 6))) 84 | 85 | temp = aggdata.reset_index() 86 | ranking = temp.set_index('president') 87 | ranking = ranking - ranking.mean() 88 | ranking = ranking.groupby(level='president').mean() 89 | ranking = ranking.sort_values('year') 90 | ranking = ranking.drop('year', 1) 91 | ranking = ranking.apply(top_topics, axis=1) 92 | ranking.to_csv("president_top_topics.csv") 93 | -------------------------------------------------------------------------------- /tutorial_notebook.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Introduction" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "This notebook introduces how to use the topicmodels module for implementing Latent Dirichlet Allocation using the collapsed Gibbs sampling algorithm of Griffiths and Steyvers (2004). The module contains three classes: one for processing raw text, another for implementing LDA, and another for querying. This tutorial will go through the main features of each, for full details see the documented source code.\n", 15 | "\n", 16 | "To illustrate LDA, the tutorial uses text data from State of the Union Addresses at the paragraph level. These are available for download from http://www.presidency.ucsb.edu/sou.php. They are contained in the tab-separated text file speech_data_extend.txt distributed with this tutorial.\n", 17 | "\n", 18 | "To interact with this data, we begin by importing some libraries that are not strictly speaking necessary for using topicmodels." 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 1, 24 | "metadata": { 25 | "collapsed": false 26 | }, 27 | "outputs": [], 28 | "source": [ 29 | "%matplotlib inline\n", 30 | "import matplotlib\n", 31 | "import numpy as np\n", 32 | "import matplotlib.pyplot as plt\n", 33 | "\n", 34 | "import pandas as pd" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "To begin, we read in the data, specifying the encoding of the text data." 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": 2, 47 | "metadata": { 48 | "collapsed": false 49 | }, 50 | "outputs": [], 51 | "source": [ 52 | "data = pd.read_table(\"speech_data_extend.txt\",encoding=\"utf-8\")" 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "metadata": {}, 58 | "source": [ 59 | "The data object is called a pandas DataFrame, and is similar to a Data Frame in R. One can see the data has three fields." 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": 3, 65 | "metadata": { 66 | "collapsed": false 67 | }, 68 | "outputs": [ 69 | { 70 | "data": { 71 | "text/plain": [ 72 | "Index([u'president', u'speech', u'year'], dtype='object')" 73 | ] 74 | }, 75 | "execution_count": 3, 76 | "metadata": {}, 77 | "output_type": "execute_result" 78 | } 79 | ], 80 | "source": [ 81 | "data.columns" 82 | ] 83 | }, 84 | { 85 | "cell_type": "markdown", 86 | "metadata": {}, 87 | "source": [ 88 | "For the tutorial we focus on State of the Union addresses made since the television era, which began in 1947." 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": 4, 94 | "metadata": { 95 | "collapsed": false 96 | }, 97 | "outputs": [ 98 | { 99 | "data": { 100 | "text/plain": [ 101 | "9488" 102 | ] 103 | }, 104 | "execution_count": 4, 105 | "metadata": {}, 106 | "output_type": "execute_result" 107 | } 108 | ], 109 | "source": [ 110 | "data = data[data.year >= 1947]\n", 111 | "len(data) # The number of documents (paragraphs of State of the Union Addresses) in the dataset" 112 | ] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "metadata": {}, 117 | "source": [ 118 | "# Cleaning Raw Text Data" 119 | ] 120 | }, 121 | { 122 | "cell_type": "markdown", 123 | "metadata": {}, 124 | "source": [ 125 | "We now import topicmodels, the module used in most of the analysis." 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": 5, 131 | "metadata": { 132 | "collapsed": false 133 | }, 134 | "outputs": [], 135 | "source": [ 136 | "import topicmodels" 137 | ] 138 | }, 139 | { 140 | "cell_type": "markdown", 141 | "metadata": {}, 142 | "source": [ 143 | "Before implementing a topic model, it is important to pre-process the data. The first class in the topicmodels module is called RawDocs and facilitates this pre-processing. We are going to pass this class the text data contained in the DataFrame along with a list of stopwords specified by \"long\". Stopwords are common words in English that tend to appear in all text, and so are not helpful in describing content. There is no definitive list of stopwords, and another option we describe below is to let the data itself reveal which words are useful for discriminating among documents. The list of words comes from http://snowball.tartarus.org/algorithms/english/stop.txt, but one need not use all of them. In Hansen, McMahon, and Prat (2014), for example, we use just a subset of these, which you can use by specifying \"short\" instead of \"long\". (You can view the stopwords by typing docsobj.stopwords into the interpreter)." 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": 6, 149 | "metadata": { 150 | "collapsed": false 151 | }, 152 | "outputs": [], 153 | "source": [ 154 | "docsobj = topicmodels.RawDocs(data.speech, \"long\")" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "Rather than passing the text as a DataFrame column, one can also pass a text file, in which case each new line will be read as a separate document.\n", 162 | "\n", 163 | "docsobj is now an object with several attributes. The most important is its tokens attribute. This is the outcome of taking each raw document, converting each contraction into its underlying words (e.g. \"don't\" into \"do not\"), coverting it into lowercase, and breaking it into its underlying linguistic elements (words, numbers, punctuation, etc.). To illustrate, compare the fourth paragraph in the 1947 State of the Union Address as a raw document and after tokenization." 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": 7, 169 | "metadata": { 170 | "collapsed": false 171 | }, 172 | "outputs": [ 173 | { 174 | "name": "stdout", 175 | "output_type": "stream", 176 | "text": [ 177 | "I come also to welcome you as you take up your duties and to discuss with you the manner in which you and I should fulfill our obligations to the American people during the next 2 years. \n", 178 | "[u'i', u'come', u'also', u'to', u'welcome', u'you', u'as', u'you', u'take', u'up', u'your', u'duties', u'and', u'to', u'discuss', u'with', u'you', u'the', u'manner', u'in', u'which', u'you', u'and', u'i', u'should', u'fulfill', u'our', u'obligations', u'to', u'the', u'american', u'people', u'during', u'the', u'next', u'2', u'years', u'.']\n" 179 | ] 180 | } 181 | ], 182 | "source": [ 183 | "print(data.speech.values[3]) # fourth paragraph (note that Python uses 0-indexing)\n", 184 | "print(docsobj.tokens[3])" 185 | ] 186 | }, 187 | { 188 | "cell_type": "markdown", 189 | "metadata": {}, 190 | "source": [ 191 | "At this point, all tokens are in the dataset, which can be useful for some purposes. For example, one might want to count the number of question marks in each speech. For implementing LDA, though, one generally wishes to focus on words. docsobj has a method to clean tokens. This will remove all non-alphanumeric tokens, and, by default, all numeric tokens as well (to keep numeric tokens, pass False as the second argument in parentheses). The number passed as an argument removes all tokens whose length is less that - this can be useful if some symbols in the data like copyright signs trascribed as a single 'c' that the user would like to remove, in which case one would pass 1. In this case, we pass 1 for illustration." 192 | ] 193 | }, 194 | { 195 | "cell_type": "code", 196 | "execution_count": 8, 197 | "metadata": { 198 | "collapsed": false 199 | }, 200 | "outputs": [], 201 | "source": [ 202 | "docsobj.token_clean(1)" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": 9, 208 | "metadata": { 209 | "collapsed": false 210 | }, 211 | "outputs": [ 212 | { 213 | "name": "stdout", 214 | "output_type": "stream", 215 | "text": [ 216 | "[u'come', u'also', u'to', u'welcome', u'you', u'as', u'you', u'take', u'up', u'your', u'duties', u'and', u'to', u'discuss', u'with', u'you', u'the', u'manner', u'in', u'which', u'you', u'and', u'should', u'fulfill', u'our', u'obligations', u'to', u'the', u'american', u'people', u'during', u'the', u'next', u'years']\n" 217 | ] 218 | } 219 | ], 220 | "source": [ 221 | "print(docsobj.tokens[3])" 222 | ] 223 | }, 224 | { 225 | "cell_type": "markdown", 226 | "metadata": {}, 227 | "source": [ 228 | "Next, we remove the stopwords we passed when creating docsobj. Here we pass \"tokens\" as an argument to specify that the stopwords should be removed from docsobj.tokens. The other option would be to pass \"stems\", which we discuss below. Notice by how much we have reduced the size of the document by removing stopwords!" 229 | ] 230 | }, 231 | { 232 | "cell_type": "code", 233 | "execution_count": 10, 234 | "metadata": { 235 | "collapsed": false 236 | }, 237 | "outputs": [ 238 | { 239 | "name": "stdout", 240 | "output_type": "stream", 241 | "text": [ 242 | "[u'come', u'welcome', u'duties', u'discuss', u'manner', u'fulfill', u'obligations', u'american', u'people', u'next', u'years']\n" 243 | ] 244 | } 245 | ], 246 | "source": [ 247 | "docsobj.stopword_remove(\"tokens\")\n", 248 | "print(docsobj.tokens[3])" 249 | ] 250 | }, 251 | { 252 | "cell_type": "markdown", 253 | "metadata": {}, 254 | "source": [ 255 | "The next step is to attempt to group together words that are grammatically different but themeatically identical. For example, the document above has the token \"obligations\" but another may have \"obligation\" and yet another \"oblige.\" Ultimately these three words denote the same concept, and so we might want them to share the same symbol. One way to achieve this is through stemming, a process whereby words are transformed through a deterministic algorithm to a base form. One popular stemmer is the Porter stemmer, which docsobj applies (via its implementation in Python's Natural Language Toolkit). This creates a new stems attribute." 256 | ] 257 | }, 258 | { 259 | "cell_type": "code", 260 | "execution_count": 11, 261 | "metadata": { 262 | "collapsed": false 263 | }, 264 | "outputs": [ 265 | { 266 | "name": "stdout", 267 | "output_type": "stream", 268 | "text": [ 269 | "[u'come', u'welcom', u'duti', u'discuss', u'manner', u'fulfil', u'oblig', u'american', u'peopl', u'next', u'year']\n" 270 | ] 271 | } 272 | ], 273 | "source": [ 274 | "docsobj.stem()\n", 275 | "print(docsobj.stems[3])\n", 276 | "docsobj.stopword_remove(\"stems\")" 277 | ] 278 | }, 279 | { 280 | "cell_type": "markdown", 281 | "metadata": {}, 282 | "source": [ 283 | "Notice that the outcome of stemming need not be an English word. These stems are the data on which we will run the topic model below. We make an additional call to remove stopwords from stems, since the stemmed forms of tokens not in the stopword list may themselves be in the stopword list.\n", 284 | "\n", 285 | "The final step in pre-processing is to drop remaining words that are not useful for identifying content. We have already dropped standard stopwords, but there may also be data-dependent common words. For example, in data from Supreme Court proceedings, \"justice\" might be treated as a stopword. Also, words that appear just once or twice in the collection are not informative of content either. Ideally, one would like a measure of informativeness that both punishes common words in the data, and rare words. One such option is to give each stem a tf-idf (term frequency - inverse document frequency) score. This is standard in the language processing literature, so we omit details here." 286 | ] 287 | }, 288 | { 289 | "cell_type": "code", 290 | "execution_count": 12, 291 | "metadata": { 292 | "collapsed": false 293 | }, 294 | "outputs": [], 295 | "source": [ 296 | "docsobj.term_rank(\"stems\")" 297 | ] 298 | }, 299 | { 300 | "cell_type": "markdown", 301 | "metadata": {}, 302 | "source": [ 303 | "This call produces two .csv files in the working directory, df_ranking.csv and tfidf_ranking.csv. df_ranking.csv ranks each stem according to its document frequency, or the number of documents it appears in. tfidf_ranking.csv ranks each stem accroding to the tf-idf measure, according to which highly informative words are those that appear frequently in the entire dataset, but in relatively few documents. Stems with the highest scores include \"gun\", \"iraq\", and \"immigr\".\n", 304 | "\n", 305 | "At this stage, the user can decide how many stems to drop based on either the df or tf-idf scores. The first argument to rank_remove specifies the ranking method to use (\"df\" or \"tfidf\"), the second whether to drop from \"tokens\" or \"stems\" (since we formed the rankings based on stems above, we should specify stems), and finally the value to use for the cutoff for dropping stems. One might instead prefer to provide a number $n$ such that all stems with a tf-idf value less than or equal to the $n$th ranked stem are then dropped, which we illustrate below. One can determine the cutoff from exploring the csv files, but here we plot the ranking in Python, which indicates a reasonable cutoff is 5000. (When using df rather than tfidf, substitute docsobj.df_ranking)." 306 | ] 307 | }, 308 | { 309 | "cell_type": "code", 310 | "execution_count": 13, 311 | "metadata": { 312 | "collapsed": false 313 | }, 314 | "outputs": [ 315 | { 316 | "data": { 317 | "text/plain": [ 318 | "[]" 319 | ] 320 | }, 321 | "execution_count": 13, 322 | "metadata": {}, 323 | "output_type": "execute_result" 324 | }, 325 | { 326 | "data": { 327 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAEACAYAAAC9Gb03AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAFwFJREFUeJzt3XuUXWV9//H3N0QmCQkxYkhCwyU2BQUJ4S4EyqCA8HMV\nsD/rBUHQH1384a9Y7ZL7WgmKXVKXCLW1rUhpwPJb1ksJIMhFHAFZ3Cck5RJBSLg0GS7BcJVC8vz+\neHZgGCbMzJlzzj57z/u11lmz9845e39zZuYzz3n2s58dKSUkSfU1ruwCJEmtZdBLUs0Z9JJUcwa9\nJNWcQS9JNWfQS1LNDRn0EdEVEbdHRG9ELI+IhcX2aRFxXUSsiIhrI2Jq68uVJI1UDGccfURMSim9\nHBGbAb8BTgb+N/BsSunvIuJUYFpK6bTWlitJGqlhdd2klF4uFruA8UACjgIWF9sXA0c3vTpJ0qgN\nK+gjYlxE9AJrgOtTSncCM1JKfQAppTXA1q0rU5LUqOG26DeklHYHZgP7RMQu5Fb9W57W7OIkSaM3\nfiRPTik9HxE9wOFAX0TMSCn1RcRM4KnBXhMR/gGQpAaklKIZ+xnOqJv3bhxRExETgUOBB4ArgBOK\npx0PLNnUPlJKHfVYuHBh6TVUoaZOrcuarGks1NVMw2nRzwIWR8Q48h+GH6WUro6I24D/iIgvAKuA\nTza1MklSUwwZ9Cml5cAeg2xfCxzSiqIkSc0zJq+M7e7uLruEt+nEmqAz67Km4bGm4evUupplWBdM\njeoAEanVx5CkuokIUrtOxkqSqs2gl6SaM+glqeYMekmqOYNekmrOoJekmjPoJanmDHpJqjmDXpJq\nzqCXpJoz6CWp5gx6Sao5g16Sas6gl6Saa0vQ/8//tOMokqTBtCXoH3mkHUeRJA2mLUHf09OOo0iS\nBtOWoL/hhnYcRZI0mLbcSnDatMSqVTBlSksPJUm1UblbCe6/P1x4YTuOJEkaqC0t+qVLE4ccAo89\nBhMntvRwklQLlWvR77Yb7L03fO1r7TiaJKm/trToU0qsXg077QQ33QTz57f0kJJUeZVr0QPMmgXn\nnAPHHQevvNKuo0qS2taiB0gpB/2qVXnIZVdXSw8tSZVVyRY9QARccgnMng177QU//WkOf0lS67S1\nRb/R+vVwzTVw1ll5/a//Go45BjbfvKWlSFJlNLNFX0rQb5QSXHstfOtb8NBDcPTRcPDBcPjhDsOU\nNLbVJug3SgmWLcut/Ouvh97ePCTzwx+GP/9z2Hnn3O0jSWNF7YJ+oNWrc/BffTUsWQJr18KOO+bH\nzjvnMfm77gozZ8I4Z9SXVEO1D/qB1q7NXTu//S0sXw533gn33w/PPw8f/CDMm5c/Aey8M2y7LWy3\nnV0/kqptzAX9prz0Eixdmlv/vb3w8MPw+OOwZg3suSd86ENw0EGwYAFsuWVLSpCkljDoh/DCC3DL\nLXD77fDrX+dPAPPm5ZO9Rx+du4AkqZMZ9CP0hz/AddfBz38OV10FkyfDEUfklv4RR+R1SeokbQ36\niJgNXALMADYA308pfTciFgJ/CTxVPPWMlNIvBnl96UHf3/r1uZvn+uvzvDs335zD/i/+Ag44ALbZ\npuwKJan9QT8TmJlSWhoRk4G7gaOATwEvpJTOG+L1HRX0Az33HFx0EfzqVzn0d9kF9t0XDj00j+mf\nNKnsCiWNRaV23UTE5cB3gQOAF1NK3x7i+R0d9P2tXZtH8yxZkvv1770X9tsPjjwSPvpRmDOn7Aol\njRWlBX1E7AD0AB8E/gY4AVgH3AX8TUpp3SCvqUzQD9TXlydfu/LK/Hj/++Gzn4XPfCbPxilJrVJK\n0BfdNj3A11NKSyJiOvBMSilFxDnArJTS/xnkdWnhwoVvrHd3d9Pd3d2M2ttq/Xr45S9h8WL48Y/h\nc5/L/fqHHeZVu5JGr6enh56enjfWzz777PYGfUSMB64CrkkpXTDIv28PXJlSmjfIv1W2Rb8pjz8O\nP/whXHZZnojtkEPgm9808CU1T9tb9BFxCbn1/pV+22amlNYUy18G9k4pHTPIa2sX9ButXw833ghf\n/WoewnnWWXDssWVXJakO2j3qZgFwE7AcSMXjDOAYYD55yOVK4KSUUt8gr69t0G/0yiv5wqxPfSpf\nkXvMMXDiiWVXJanKvGCqQz31FPzmN/lk7T775JE6xx2X596RpJGo7B2m6m7rreHjH899+CedBCtX\n5snWNt4+UZLKYIu+xXp74fzz8y0UTz4Zpk3LI3be976yK5PUyWzRV8juu+chmcuXw/Tp8MgjMH8+\nXHppXpakVrNFX4Kf/Qy+8Y18Fe4ZZ8App0BXV9lVSeoknoytid7ePF/+QQfBn/xJHrWz775lVyWp\nE9h1UxO7757n0znsMJgwAfbfHy64AF5/vezKJNWJLfoO8vOfw1e+km+WPnNmnkN/woSyq5JUBrtu\nauzll+F3v4MTTshTJp9xRp5MTdLYYtdNjU2aBLvuCldckUP/Ax/IF1ydfHLZlUmqKlv0He7pp2H1\n6nzh1c0357tgSaq/ZrboxzdjJ2qd6dPz4/vfhwMPzHe+mjUrt/LPPhvG+ZlM0hBs0VfIE0/ALbfk\nmTLPPjvPi7/XXmVXJakVPBkrzjwTrrkGvvc9mDEjb3vve2HKlHLrktQcBr1ICb7wBbj7bnjhBdiw\nAbbYApYtg/F2yEmVZ9DrbVLKc+h8+9v5jleSqs2TsXqbCPjYx+D003MXTldXbtm/612w1VZ26Uhj\nmS36GnniiTfnwd+wAV57Ld/9atYsuOuusquTNBJ23WjY1q2Dd787B77TKUjV4ZWxGrapU/MUCvff\nX3YlkspiH/0YsNde+a5W06fn9Rkz4BOfgKOPdoSONBbYdTMGrF4NDz6Yl1OCX/4SfvIT+OIXnUNH\n6lT20WvUbrsNjj0WHn647EokDcag16i9/nqeKfO55/KFVpI6i0Gvpth339yVs/32ef0974F//uc8\nJl9SuQx6NcWTT8Ktt765/ld/BdtuCzvtBJdeauBLZTLo1RKPPgpr1sCCBdDX9+YoHUnt5xQIaok5\nc/Jj771hxQqDXqoLL5jS28ydm1v3kurBrhu9zSmnwOWX55O048blR0SeM+cHP7DvXmoH++jVUs8+\nC729eUTOhg35sX49/Nmf5Yut5s7NtzKU1DoGvUpx0kl5Fsx77oElS+DII8uuSKovg16l+ta38iRp\nF19cdiVSfTl7pUq13Xbw8stlVyFpuAx6jdjEiXl+e0nV4Dh6jdikSXDHHfDpTw/+7wcemGfGlNQZ\nhuyjj4jZwCXADGADcGFK6e8jYhrwI2B7YCXwyZTSukFebx99zbzyClx1VR6JM9Cjj+YpkO++u/11\nSXXS1pOxETETmJlSWhoRk4G7gaOAzwPPppT+LiJOBaallE4b5PUG/RiyahXstx/cfnu+SfnEiWVX\nJFVTW0/GppTWpJSWFssvAg8As8lhv7h42mLg6GYUpGqbOROmTYM99oA994Tf/77siiSN6GRsROwA\nzAduA2aklPog/zEAtm52caqeri647748/HLlynyBlaRyDTvoi26bnwBfKlr2A/tj7J/RG6ZPh09+\nEl54oexKJA1r1E1EjCeH/KUppSXF5r6ImJFS6iv68Z/a1OsXLVr0xnJ3dzfd3d0NF6zqmDzZoJeG\nq6enh56enpbse1hXxkbEJcAzKaWv9Nt2LrA2pXSuJ2M1mNNPhy23zF8ljUy7R90sAG4ClpO7ZxJw\nBnAH8B/AtsAq8vDKt516M+jHru98B846a+h70h5zDJx/fntqkqrCuW5UCRs2wDPPvPNzbr01/0H4\n9a/bU5NUFd5hSpUwbhxsPcRYrLlz4emn21OPNFY5141KNX26QS+1mkGvUm21FaxbB3/6p3DwwfCL\nX5RdkVQ/Br1KNX483HknnHMOTJ0KV19ddkVS/dhHr9Lttlv++uCDOfQlNZctenWMSZOc515qBYNe\nHWPiRO9cJbWCQa+OMWkSvPRS2VVI9WMfvTrG9OnQ2wvHHz/814wbB1//Osye3bq6pKoz6NUx5s/P\nUyG89trwX3PeebB8uUEvvRODXh1j/Pg8781ILFkCf/hDa+qR6sI+elXahAmO1JGGYtCr0iZMsEUv\nDcWgV6UZ9NLQDHpVml030tAMelXaxImweDGceCK8/nrZ1UidyRuPqNKeeAJuuSUH/apVeTZMqQ68\nw5Q0wKxZcPfdsM02ZVciNUczg96uG9WCJ2WlTTPoVQtdXfDqq2VXIXUmg161YIte2jSDXrVgi17a\nNOe6US1MmADPPpvvP9sMU6bkmTGlOjDoVQtz5458QrRNefVVOOUU+NrXmrM/qWwOr5QG+O53YcUK\n+Id/KLsSjWUOr5RaaMIE+/tVLwa9NIAjeFQ3Br00gCN4VDcGvTSALXrVjUEvDWCLXnVj0EsD2KJX\n3Rj00gBdXQa96sWglwZweKXqxqCXBrBFr7ox6KUBbNGrbgx6aYCuLnjmmTzXzeWXl12NNHpDBn1E\nXBQRfRGxrN+2hRHxRETcUzwOb22ZUvvMmAFnngmPPQbf+EbZ1UijN+SkZhFxAPAicElKaV6xbSHw\nQkrpvCEP4KRmqqilS+H44+Hee8uuRGNRWyc1SyndAjw3WB3NKEDqVJttBuvXl12FNHqj6aP/vxGx\nNCJ+EBFTm1aR1CHGj4fXXy+7Cmn0Gg367wHvSynNB9YAQ3bhSFVji1510dAdplJKT/dbvRC48p2e\nv2jRojeWu7u76e7ubuSwUlsZ9Gqnnp4eenp6WrLvYd1hKiJ2AK5MKe1arM9MKa0plr8M7J1SGvRG\nbp6MVVWtXAkHHQSrVpVdicaiZp6MHbJFHxGXAd3AVhHxGLAQODgi5gMbgJXASc0oRuoktuhVF0MG\n/SZa6he3oBapoxj0qouG+uilsWCzzfKom04bebPZZhAObtYIGPTSJmyxRZ7zZsKEsit504YN8Ld/\nC6edVnYlqhKDXtqEyZPh+efLruKtzj0X1q4tuwpVjZOaSRUyblxu1UsjYdBLFWLQqxEGvVQhBr0a\nYdBLFWLQqxEGvVQhBr0aYdBLFWLQqxEGvVQhEeDUURopg16qEFv0aoRBL1WIQa9GGPRShRj0aoRB\nL1WIQa9GGPRShRj0aoRBL1WIQa9GGPRShUQY9Bo5g16qkHHjHEevkTPopQqx60aNMOilCjHo1QiD\nXqoQg16NMOilCjHo1QiDXqoQg16NMOilCnF4pRph0EsV4vBKNcKglyrErhs1wqCXKsSgVyMMeqlC\nDHo1wqCXKsSgVyMMeqlCDHo1wqCXKsThlWqEQS9ViMMr1QiDXqoQu27UCINeqhCDXo0w6KUKMejV\nCINeqhCDXo0YMugj4qKI6IuIZf22TYuI6yJiRURcGxFTW1umJDDo1ZjhtOgvBj46YNtpwA0ppZ2A\nG4HTm12YpLdzeKUaMWTQp5RuAZ4bsPkoYHGxvBg4usl1SRqEwyvViEb76LdOKfUBpJTWAFs3ryRJ\nm2LXjRoxvkn7ecc2xqJFi95Y7u7upru7u0mHlcYWg76+enp66Onpacm+Iw3jc2BEbA9cmVKaV6w/\nAHSnlPoiYibwq5TSBzbx2jScY0ga2rJlcOyx+avqLSJIKUUz9jXcrpsoHhtdAZxQLB8PLGlGMZLe\nmS16NWI4wysvA24FdoyIxyLi88A3gUMjYgXwkWJdUosZ9GrEkH30KaVjNvFPhzS5FklDMOjVCK+M\nlSokwuGVGjmDXqoQW/RqRLOGV0pqg64uWLUK5s0ru5Kx5V/+Bfbbr+wqGjes4ZWjOoDDK6WmWrEC\nXn217CrGljlzYMqU9h6zmcMrDXpJ6kBljKOXJFWUQS9JNWfQS1LNGfSSVHMGvSTVnEEvSTVn0EtS\nzRn0klRzBr0k1ZxBL0k1Z9BLUs0Z9JJUcwa9JNWcQS9JNWfQS1LNGfSSVHMGvSTVnEEvSTVn0EtS\nzRn0klRzBr0k1ZxBL0k1Z9BLUs0Z9JJUcwa9JNWcQS9JNWfQS1LNGfSSVHMGvSTVnEEvSTU3fjQv\njoiVwDpgA/BaSmmfZhQlSWqe0bboNwDdKaXdqxTyPT09ZZfwNp1YE3RmXdY0PNY0fJ1aV7OMNuij\nCftou078pnZiTdCZdVnT8FjT8HVqXc0y2pBOwPURcWdE/GUzCpIkNdeo+uiBBSml1RExnRz4D6SU\nbmlGYZKk5oiUUnN2FLEQeCGldN6A7c05gCSNMSmlaMZ+Gm7RR8QkYFxK6cWI2AI4DDh74POaVagk\nqTGj6bqZAfxn0WIfD/x7Sum65pQlSWqWpnXdSJI6U8uGRkbE4RHxYET8NiJObdVx+h3voojoi4hl\n/bZNi4jrImJFRFwbEVP7/dvpEfFQRDwQEYf1275HRCwr6j5/FPXMjogbI+K+iFgeESeXXVOxr66I\nuD0ieou6FnZIXeMi4p6IuKIT6in2tzIi7i3eqzs6oa6ImBoRPy6OcV9E7Fvyz/mOxftzT/F1XUSc\n3AHv05cj4r+K/f17RGxedk3F/r5U/N61NxNSSk1/kP+APAxsD7wLWAq8vxXH6nfMA4D5wLJ+284F\nTimWTwW+WSzvDPSSu5x2KGrd+OnmdmDvYvlq4KMN1jMTmF8sTwZWAO8vs6Z+tU0qvm4G3AbsU3Zd\nwJeBHwJXlP2961fTI8C0AdvKfp/+Dfh8sTwemFp2Tf1qGwf8N7BtmTUB2xTfu82L9R8Bx5f9PgG7\nAMuALvLv3nXAH7ejrlF9Y9/hP/Qh4Jp+66cBp7biWAOOuz1vDfoHgRnF8kzgwcHqAa4B9i2ec3+/\n7Z8G/qlJtV0OHNJhNU0C7gL2LrMuYDZwPdDNm0Ff+vsEPApsNWBbme/TlsDvBtle+ntV7Ocw4Oay\nayIH/SpgGjkkr+iE3z3gE8CF/dbPAr4KPNDqulrVdfNHwOP91p8otrXb1imlPoCU0hpg62L7wPqe\nLLb9EbnWjZpSd0TsQP60cRv5G1pqTUU3SS+wBrg+pXRnyXV9h/wD3/+EUenvE2+9IPDEDqhrDvBM\nRFxcdJV8P/Lot054rwA+BVxWLJdWU0rpv4FvA48V+1+XUrqhzJoK/wUcWHTVTAL+F/nTT8vrqtz0\nBaPU9jPPETEZ+AnwpZTSi4PU0PaaUkobUkq7k1vS+0TELmXVFREfA/pSSkvJU2psShmjBhaklPYg\n/0J+MSIOHKSOdtY1HtgD+MeirpfIrb7Sf6Yi4l3AkcCPN1FD22qKiHcDR5E/4W8DbBERny2zJoCU\n0oPkbprryd0tvcD6wZ7a7GO3KuifBLbrtz672NZufRExAyAiZgJPFdufJP8l3WhjfZva3pCIGE8O\n+UtTSks6oab+UkrPAz3A4SXWtQA4MiIeAf4f8OGIuBRYU/b7lFJaXXx9mtz1tg/lfv+eAB5PKd1V\nrP+UHPyd8DN1BHB3SumZYr3Mmg4BHkkprU0prQf+E9i/5JoASCldnFLaK6XUDfyefO6u5XW1Kujv\nBOZGxPYRsTm5D+mKFh2rv+CtrcIrgBOK5eOBJf22f7o4Ez8HmAvcUXxsWhcR+0REAJ/r95pG/Cu5\nL+2CTqkpIt678ax+REwEDiX3EZZSV0rpjJTSdiml95F/Tm5MKR0HXFlGPRtFxKTi0xjx5gWByynx\n+1d8vH88InYsNn0EuK/Mmvr5DPkP9UZl1vQY8KGImFDs6yPA/SXXBEDk6WKIiO2Aj5O7ulpf12hP\nwLzDiYfDyX+tHgJOa9Vx+h3vMvIZ/1fJ3+jPk0/G3FDUcR3w7n7PP518FvsB4LB+2/ck/0I/BFww\ninoWkD+WLSV/RLuneE/eU1ZNxb52LWpZSh4BcGaxvdS6iv0dxJsnY8t+n+b0+94t3/gz3AF17UZu\nSC0FfkYedVN2TZOAp4Ep/baVXdPCYv/LgMXk0X+d8DN+E7mvvpc8xXtb3isvmJKkmhtrJ2Mlacwx\n6CWp5gx6Sao5g16Sas6gl6SaM+glqeYMekmqOYNekmru/wMtMk0vzCo2twAAAABJRU5ErkJggg==\n", 328 | "text/plain": [ 329 | "" 330 | ] 331 | }, 332 | "metadata": {}, 333 | "output_type": "display_data" 334 | } 335 | ], 336 | "source": [ 337 | "plt.plot([x[1] for x in docsobj.tfidf_ranking])" 338 | ] 339 | }, 340 | { 341 | "cell_type": "code", 342 | "execution_count": 14, 343 | "metadata": { 344 | "collapsed": false 345 | }, 346 | "outputs": [ 347 | { 348 | "name": "stdout", 349 | "output_type": "stream", 350 | "text": [ 351 | "number of unique stems = 4742\n", 352 | "number of total stems = 250000\n" 353 | ] 354 | } 355 | ], 356 | "source": [ 357 | "docsobj.rank_remove(\"tfidf\",\"stems\",docsobj.tfidf_ranking[5000][1])\n", 358 | "all_stems = [s for d in docsobj.stems for s in d]\n", 359 | "print(\"number of unique stems = %d\" % len(set(all_stems)))\n", 360 | "print(\"number of total stems = %d\" % len(all_stems))" 361 | ] 362 | }, 363 | { 364 | "cell_type": "markdown", 365 | "metadata": {}, 366 | "source": [ 367 | "After pre-processing, we have 4742 unique stems, and 250000 total stems. We now proceed to estimate a topic model on them." 368 | ] 369 | }, 370 | { 371 | "cell_type": "markdown", 372 | "metadata": {}, 373 | "source": [ 374 | "# Estimating a Topic Model" 375 | ] 376 | }, 377 | { 378 | "cell_type": "markdown", 379 | "metadata": {}, 380 | "source": [ 381 | "The first step in estimation is to initialize a model via topicmodels' LDA class. We will pass docsobj.stems as the set of documents, and we also need to decide on a number of topics. Here we choose 30 topics. " 382 | ] 383 | }, 384 | { 385 | "cell_type": "code", 386 | "execution_count": 15, 387 | "metadata": { 388 | "collapsed": false 389 | }, 390 | "outputs": [], 391 | "source": [ 392 | "ldaobj = topicmodels.LDA.LDAGibbs(docsobj.stems,30)" 393 | ] 394 | }, 395 | { 396 | "cell_type": "markdown", 397 | "metadata": {}, 398 | "source": [ 399 | "There are three main parameters in LDA, the number of topics, and the two hyperparameters of the Dirichlet priors. topicmodels.LDA follows the advice of Griffiths and Steyvers (2004) and sets the hyperparameter of the Dirichlet prior on topics to $200/V$, where $V$ is the number of unique vocabulary elements, and the hyperparameter of the Dirichlet prior on document-topic distributions to $50/K$, where $K$ is the number of topics." 400 | ] 401 | }, 402 | { 403 | "cell_type": "code", 404 | "execution_count": 16, 405 | "metadata": { 406 | "collapsed": false 407 | }, 408 | "outputs": [ 409 | { 410 | "name": "stdout", 411 | "output_type": "stream", 412 | "text": [ 413 | "30\n", 414 | "1.66666666667\n", 415 | "0.0421762969211\n" 416 | ] 417 | } 418 | ], 419 | "source": [ 420 | "print(ldaobj.K) # number of topic, user defined.\n", 421 | "print(ldaobj.alpha) # hyperparameter for document-topic distribution, automatically defined\n", 422 | "print(ldaobj.beta) # hyperparameter for topics, automatically defined" 423 | ] 424 | }, 425 | { 426 | "cell_type": "markdown", 427 | "metadata": {}, 428 | "source": [ 429 | "Should users wish to define their own priors, they can do so by calling ldaobj.set_prior(alpha,beta). \n", 430 | "\n", 431 | "Another quantity set automatically by topicmodels.LDA is a random allocation of stems to topic assignments. It is a 250000-dimensional vector of integers in $\\{0,\\ldots,29\\}$. Should the user wish to define another seed, call ldaobj.set_seed." 432 | ] 433 | }, 434 | { 435 | "cell_type": "code", 436 | "execution_count": 17, 437 | "metadata": { 438 | "collapsed": false 439 | }, 440 | "outputs": [ 441 | { 442 | "name": "stdout", 443 | "output_type": "stream", 444 | "text": [ 445 | "[22 8 17 7 17 8 4 27 12 15]\n", 446 | "(250000,)\n" 447 | ] 448 | } 449 | ], 450 | "source": [ 451 | "print(ldaobj.topic_seed[:10])\n", 452 | "print(ldaobj.topic_seed.shape)" 453 | ] 454 | }, 455 | { 456 | "cell_type": "markdown", 457 | "metadata": {}, 458 | "source": [ 459 | "Now that we have initialized our topic model, we are ready to sample. To sample, we pass three parameters. The first is the number of iterations we want the chain to burn in before beginning to sample. The second is a thinning interval, the number of iterations to let the chain run between samples. Allowing for a thinning interval reduces autocorrelation between samples. The third is the number of samples to take. So, for example, if the user passes (1000,50,20) the following will happen. First, the chain will run for 1,000 iterations. Then 20 samples will be taken corresponding to the $\\{1050,1100\\ldots,1950,2000\\}$ iterations for a total of 2000 iterations overall.\n", 460 | "\n", 461 | "In order not to waste time in the tutorial, we start with a relatively short chain with no burnin, a thinning interval of 50, and 10 samples, for a total of 500 iterations." 462 | ] 463 | }, 464 | { 465 | "cell_type": "code", 466 | "execution_count": 18, 467 | "metadata": { 468 | "collapsed": false 469 | }, 470 | "outputs": [ 471 | { 472 | "name": "stdout", 473 | "output_type": "stream", 474 | "text": [ 475 | "Iteration 10 of (collapsed) Gibbs sampling\n", 476 | "Iteration 20 of (collapsed) Gibbs sampling\n", 477 | "Iteration 30 of (collapsed) Gibbs sampling\n", 478 | "Iteration 40 of (collapsed) Gibbs sampling\n", 479 | "Iteration 50 of (collapsed) Gibbs sampling\n", 480 | "Iteration 60 of (collapsed) Gibbs sampling\n", 481 | "Iteration 70 of (collapsed) Gibbs sampling\n", 482 | "Iteration 80 of (collapsed) Gibbs sampling\n", 483 | "Iteration 90 of (collapsed) Gibbs sampling\n", 484 | "Iteration 100 of (collapsed) Gibbs sampling\n", 485 | "Iteration 110 of (collapsed) Gibbs sampling\n", 486 | "Iteration 120 of (collapsed) Gibbs sampling\n", 487 | "Iteration 130 of (collapsed) Gibbs sampling\n", 488 | "Iteration 140 of (collapsed) Gibbs sampling\n", 489 | "Iteration 150 of (collapsed) Gibbs sampling\n", 490 | "Iteration 160 of (collapsed) Gibbs sampling\n", 491 | "Iteration 170 of (collapsed) Gibbs sampling\n", 492 | "Iteration 180 of (collapsed) Gibbs sampling\n", 493 | "Iteration 190 of (collapsed) Gibbs sampling\n", 494 | "Iteration 200 of (collapsed) Gibbs sampling\n", 495 | "Iteration 210 of (collapsed) Gibbs sampling\n", 496 | "Iteration 220 of (collapsed) Gibbs sampling\n", 497 | "Iteration 230 of (collapsed) Gibbs sampling\n", 498 | "Iteration 240 of (collapsed) Gibbs sampling\n", 499 | "Iteration 250 of (collapsed) Gibbs sampling\n", 500 | "Iteration 260 of (collapsed) Gibbs sampling\n", 501 | "Iteration 270 of (collapsed) Gibbs sampling\n", 502 | "Iteration 280 of (collapsed) Gibbs sampling\n", 503 | "Iteration 290 of (collapsed) Gibbs sampling\n", 504 | "Iteration 300 of (collapsed) Gibbs sampling\n", 505 | "Iteration 310 of (collapsed) Gibbs sampling\n", 506 | "Iteration 320 of (collapsed) Gibbs sampling\n", 507 | "Iteration 330 of (collapsed) Gibbs sampling\n", 508 | "Iteration 340 of (collapsed) Gibbs sampling\n", 509 | "Iteration 350 of (collapsed) Gibbs sampling\n", 510 | "Iteration 360 of (collapsed) Gibbs sampling\n", 511 | "Iteration 370 of (collapsed) Gibbs sampling\n", 512 | "Iteration 380 of (collapsed) Gibbs sampling\n", 513 | "Iteration 390 of (collapsed) Gibbs sampling\n", 514 | "Iteration 400 of (collapsed) Gibbs sampling\n", 515 | "Iteration 410 of (collapsed) Gibbs sampling\n", 516 | "Iteration 420 of (collapsed) Gibbs sampling\n", 517 | "Iteration 430 of (collapsed) Gibbs sampling\n", 518 | "Iteration 440 of (collapsed) Gibbs sampling\n", 519 | "Iteration 450 of (collapsed) Gibbs sampling\n", 520 | "Iteration 460 of (collapsed) Gibbs sampling\n", 521 | "Iteration 470 of (collapsed) Gibbs sampling\n", 522 | "Iteration 480 of (collapsed) Gibbs sampling\n", 523 | "Iteration 490 of (collapsed) Gibbs sampling\n", 524 | "Iteration 500 of (collapsed) Gibbs sampling\n" 525 | ] 526 | } 527 | ], 528 | "source": [ 529 | "ldaobj.sample(0,50,10)" 530 | ] 531 | }, 532 | { 533 | "cell_type": "markdown", 534 | "metadata": {}, 535 | "source": [ 536 | "Because we allowed no burn in and started sampling straight away, one would imagine the initial draws were poor in terms of describing topics. A formalization of this idea is to compute the perplexity of each of the samples. Perplexity is a common goodness-of-fit meausure in natural language processing and information theory literature that describes how well a probability model explains data. Lower values indicate better goodness-of-fit. Calling ldaobj.perplexity() returns the perplexity of each sample." 537 | ] 538 | }, 539 | { 540 | "cell_type": "code", 541 | "execution_count": 19, 542 | "metadata": { 543 | "collapsed": false 544 | }, 545 | "outputs": [ 546 | { 547 | "data": { 548 | "text/plain": [ 549 | "array([910.00942281, 874.09497579, 862.04496391, 852.09503063,\n", 550 | " 847.52895528, 843.21069781, 842.30736504, 840.07617437,\n", 551 | " 840.15844958, 838.29737893])" 552 | ] 553 | }, 554 | "execution_count": 19, 555 | "metadata": {}, 556 | "output_type": "execute_result" 557 | } 558 | ], 559 | "source": [ 560 | "ldaobj.perplexity()" 561 | ] 562 | }, 563 | { 564 | "cell_type": "markdown", 565 | "metadata": {}, 566 | "source": [ 567 | "Just as we suspected, the first sample has a much higher perplexity than the last. Moreover, it might be that if we had kept sampling the chain, we could get even lower perplexity. Once we call ldaobj.sample the first time, all further calls extend the existing chain by default rather than start from scratch. So let's draw another ten samples." 568 | ] 569 | }, 570 | { 571 | "cell_type": "code", 572 | "execution_count": 20, 573 | "metadata": { 574 | "collapsed": false 575 | }, 576 | "outputs": [ 577 | { 578 | "name": "stdout", 579 | "output_type": "stream", 580 | "text": [ 581 | "Iteration 10 of (collapsed) Gibbs sampling\n", 582 | "Iteration 20 of (collapsed) Gibbs sampling\n", 583 | "Iteration 30 of (collapsed) Gibbs sampling\n", 584 | "Iteration 40 of (collapsed) Gibbs sampling\n", 585 | "Iteration 50 of (collapsed) Gibbs sampling\n", 586 | "Iteration 60 of (collapsed) Gibbs sampling\n", 587 | "Iteration 70 of (collapsed) Gibbs sampling\n", 588 | "Iteration 80 of (collapsed) Gibbs sampling\n", 589 | "Iteration 90 of (collapsed) Gibbs sampling\n", 590 | "Iteration 100 of (collapsed) Gibbs sampling\n", 591 | "Iteration 110 of (collapsed) Gibbs sampling\n", 592 | "Iteration 120 of (collapsed) Gibbs sampling\n", 593 | "Iteration 130 of (collapsed) Gibbs sampling\n", 594 | "Iteration 140 of (collapsed) Gibbs sampling\n", 595 | "Iteration 150 of (collapsed) Gibbs sampling\n", 596 | "Iteration 160 of (collapsed) Gibbs sampling\n", 597 | "Iteration 170 of (collapsed) Gibbs sampling\n", 598 | "Iteration 180 of (collapsed) Gibbs sampling\n", 599 | "Iteration 190 of (collapsed) Gibbs sampling\n", 600 | "Iteration 200 of (collapsed) Gibbs sampling\n", 601 | "Iteration 210 of (collapsed) Gibbs sampling\n", 602 | "Iteration 220 of (collapsed) Gibbs sampling\n", 603 | "Iteration 230 of (collapsed) Gibbs sampling\n", 604 | "Iteration 240 of (collapsed) Gibbs sampling\n", 605 | "Iteration 250 of (collapsed) Gibbs sampling\n", 606 | "Iteration 260 of (collapsed) Gibbs sampling\n", 607 | "Iteration 270 of (collapsed) Gibbs sampling\n", 608 | "Iteration 280 of (collapsed) Gibbs sampling\n", 609 | "Iteration 290 of (collapsed) Gibbs sampling\n", 610 | "Iteration 300 of (collapsed) Gibbs sampling\n", 611 | "Iteration 310 of (collapsed) Gibbs sampling\n", 612 | "Iteration 320 of (collapsed) Gibbs sampling\n", 613 | "Iteration 330 of (collapsed) Gibbs sampling\n", 614 | "Iteration 340 of (collapsed) Gibbs sampling\n", 615 | "Iteration 350 of (collapsed) Gibbs sampling\n", 616 | "Iteration 360 of (collapsed) Gibbs sampling\n", 617 | "Iteration 370 of (collapsed) Gibbs sampling\n", 618 | "Iteration 380 of (collapsed) Gibbs sampling\n", 619 | "Iteration 390 of (collapsed) Gibbs sampling\n", 620 | "Iteration 400 of (collapsed) Gibbs sampling\n", 621 | "Iteration 410 of (collapsed) Gibbs sampling\n", 622 | "Iteration 420 of (collapsed) Gibbs sampling\n", 623 | "Iteration 430 of (collapsed) Gibbs sampling\n", 624 | "Iteration 440 of (collapsed) Gibbs sampling\n", 625 | "Iteration 450 of (collapsed) Gibbs sampling\n", 626 | "Iteration 460 of (collapsed) Gibbs sampling\n", 627 | "Iteration 470 of (collapsed) Gibbs sampling\n", 628 | "Iteration 480 of (collapsed) Gibbs sampling\n", 629 | "Iteration 490 of (collapsed) Gibbs sampling\n", 630 | "Iteration 500 of (collapsed) Gibbs sampling\n" 631 | ] 632 | } 633 | ], 634 | "source": [ 635 | "ldaobj.sample(0,50,10)" 636 | ] 637 | }, 638 | { 639 | "cell_type": "code", 640 | "execution_count": 21, 641 | "metadata": { 642 | "collapsed": false 643 | }, 644 | "outputs": [ 645 | { 646 | "data": { 647 | "text/plain": [ 648 | "array([910.00942281, 874.09497579, 862.04496391, 852.09503063,\n", 649 | " 847.52895528, 843.21069781, 842.30736504, 840.07617437,\n", 650 | " 840.15844958, 838.29737893, 837.77127923, 838.24312563,\n", 651 | " 835.7131597 , 834.27288412, 835.88357615, 834.14504198,\n", 652 | " 834.5741682 , 834.92096362, 836.03689584, 834.48549032])" 653 | ] 654 | }, 655 | "execution_count": 21, 656 | "metadata": {}, 657 | "output_type": "execute_result" 658 | } 659 | ], 660 | "source": [ 661 | "ldaobj.perplexity()" 662 | ] 663 | }, 664 | { 665 | "cell_type": "markdown", 666 | "metadata": {}, 667 | "source": [ 668 | "Indeed, the topic model has improved, and may yet improve more with further sampling, but at this point we will stop to continue with the tutorial. (In research applications, one would normally apply some convergence criterion to determine the stopping point.) Ideally we'd like to throw away the initial samples and only keep the last ones. ldaobj.samples_keep(n) keeps the last n samples of the chain (users can also pass a list of numbers corresponding to the indices they'd like to keep - remember that Python uses 0-indexing). We will keep the last four samples." 669 | ] 670 | }, 671 | { 672 | "cell_type": "code", 673 | "execution_count": 22, 674 | "metadata": { 675 | "collapsed": false 676 | }, 677 | "outputs": [], 678 | "source": [ 679 | "ldaobj.samples_keep(4)" 680 | ] 681 | }, 682 | { 683 | "cell_type": "markdown", 684 | "metadata": {}, 685 | "source": [ 686 | "So far we have only been sampling topic assignments for all the words in the dataset. What we really care about are the topics and the distribution of topics in each document. ldaobj has been carrying these around for us while we have been sampling. ldaobj.tt are the esitmated topics, and ldaobj.dt are the estimated document-topic distributions." 687 | ] 688 | }, 689 | { 690 | "cell_type": "code", 691 | "execution_count": 23, 692 | "metadata": { 693 | "collapsed": false 694 | }, 695 | "outputs": [ 696 | { 697 | "name": "stdout", 698 | "output_type": "stream", 699 | "text": [ 700 | "(4742, 30, 4)\n", 701 | "(9488, 30, 4)\n" 702 | ] 703 | } 704 | ], 705 | "source": [ 706 | "print(ldaobj.tt.shape)\n", 707 | "print(ldaobj.dt.shape)" 708 | ] 709 | }, 710 | { 711 | "cell_type": "markdown", 712 | "metadata": {}, 713 | "source": [ 714 | "The estimated topics are represented by $4742 \\times 30$ matrices whose columns sum to one, one for each sample, while the estimated distributions of topics within each document are represented by $9488 \\times 30$ matrices whose rows sum to one. To get an idea of the topics that have been estimated, and whether they make sense, ldaobj.topic_content(n) produces topic_description.csv in the working directory. Its rows contain the first n stems in each topic ranked according their probability, using the final stored sample. It's a good idea to check the topics are \"reasonable\" before proceeding with any analysis." 715 | ] 716 | }, 717 | { 718 | "cell_type": "code", 719 | "execution_count": 24, 720 | "metadata": { 721 | "collapsed": false 722 | }, 723 | "outputs": [], 724 | "source": [ 725 | "ldaobj.topic_content(20)" 726 | ] 727 | }, 728 | { 729 | "cell_type": "markdown", 730 | "metadata": {}, 731 | "source": [ 732 | "Most economics researchers will probably be most interested initially in the distributions of topics within each document. To generate these, one should average the matrices in ldaobj.dt. Here we have only taken four samples for purposes of illustration, but in actual research one should ideally take as many as is computationally feasible. ldaobj has a convenience method for doing this average, which will both return it as well as, by default, write it to dt.csv in the working directory (to disable printing, pass False to the method)." 733 | ] 734 | }, 735 | { 736 | "cell_type": "code", 737 | "execution_count": 25, 738 | "metadata": { 739 | "collapsed": false 740 | }, 741 | "outputs": [], 742 | "source": [ 743 | "dt = ldaobj.dt_avg()" 744 | ] 745 | }, 746 | { 747 | "cell_type": "markdown", 748 | "metadata": {}, 749 | "source": [ 750 | "One might also be interested in the average topics themselves, in which case there is a similar convenience function available that writes tt.csv to the working directory by default. Each unique stem in the data is associated to a number corresponding to rows of the topic matrices. Therefore, in most cases one will probably want to print out this key too, available as ldaobj.dict_print." 751 | ] 752 | }, 753 | { 754 | "cell_type": "code", 755 | "execution_count": 26, 756 | "metadata": { 757 | "collapsed": false 758 | }, 759 | "outputs": [], 760 | "source": [ 761 | "tt = ldaobj.tt_avg()\n", 762 | "ldaobj.dict_print()" 763 | ] 764 | }, 765 | { 766 | "cell_type": "markdown", 767 | "metadata": {}, 768 | "source": [ 769 | "One might also want to replace the speech field in the original dataset with the estimated topics in order to have a ready-to-go dataset for regression or other econometric analysis. The following code builds this dataset, and also writes it to file." 770 | ] 771 | }, 772 | { 773 | "cell_type": "code", 774 | "execution_count": 27, 775 | "metadata": { 776 | "collapsed": false 777 | }, 778 | "outputs": [], 779 | "source": [ 780 | "data = data.drop('speech',1)\n", 781 | "for i in range(ldaobj.K): data['T' + str(i)] = dt[:,i]\n", 782 | "data.to_csv(\"final_output.csv\",index=False)" 783 | ] 784 | }, 785 | { 786 | "cell_type": "markdown", 787 | "metadata": {}, 788 | "source": [ 789 | "If one wishes to analyze some function of the estimated document-topic distributions, this function should be computed for each separate sample and then averaged. Since the relevant functions are context-specific, topicmodels.LDA does not provide them, but it can be easily extended to accomodate this." 790 | ] 791 | }, 792 | { 793 | "cell_type": "markdown", 794 | "metadata": {}, 795 | "source": [ 796 | "# Querying Using Estimated Topics" 797 | ] 798 | }, 799 | { 800 | "cell_type": "markdown", 801 | "metadata": {}, 802 | "source": [ 803 | "After estimating a topic model, one is often interested in estimating the distribution of topics for documents not included in estimation. In this case, one option is to $\\textit{query}$ those documents by holding fixed the topics estimated from LDA, and only estimating the distribution of topics for the out-of-sample documents. The topicmodels module also provides a class for such querying, which this section introduces.\n", 804 | "\n", 805 | "We will apply querying to the corpus of entire State of the Union Addresses since 1947 (recall that we estimated topics on the level of the paragraph within each speech). In terms of estimating topics, the paragraph level is preferable to the speech level since individual paragraphs are more likely to be based around a single theme. But, in terms of econometric work, the entire speech is a more natural unit of analysis. At the same time, there is no general way of \"adding up\" probability distribution at the paragraph level in order to arrive at a speech-level distribution. Hence the need for querying, which allows us to estimate the speech-level distributions. (Extra credit: after the tutorial, estimate LDA on the entire speech level, and judge for yourself how the topics compare to those estimated at the paragraph level).\n", 806 | "\n", 807 | "The Query class is initialized in much the same way as LDA, but takes two additional objects: a 3-D array of estimated topics (number of tokens in the estimated topics $\\times$ number of estimated topics $\\times$ number of samples from the estimation); and a dictionary that maps tokens into an index. We can just pass these directly from ldaobj, which contains data from the above estimated LDA model." 808 | ] 809 | }, 810 | { 811 | "cell_type": "code", 812 | "execution_count": 28, 813 | "metadata": { 814 | "collapsed": false 815 | }, 816 | "outputs": [], 817 | "source": [ 818 | "data['speech'] = [' '.join(s) for s in docsobj.stems] # replace the speech field in the original data with its cleaned version from docsobj\n", 819 | "aggspeeches = data.groupby(['year','president'])['speech'].apply(lambda x: ' '.join(x)) # aggregate up to the speech level\n", 820 | "aggdocs = topicmodels.RawDocs(aggspeeches) # create new RawDocs object that contains entire speech stems in aggdocs.tokens\n", 821 | "queryobj = topicmodels.LDA.QueryGibbs(aggdocs.tokens,ldaobj.token_key,ldaobj.tt) # initialize query object with ldaobj attributes" 822 | ] 823 | }, 824 | { 825 | "cell_type": "markdown", 826 | "metadata": {}, 827 | "source": [ 828 | "Before continuing, suppose that we instead wanted to query a document whose constitutent parts had not been included in estimation, for example a State of the Union Address from the 1930s. How to proceed? First, create a RawDocs object with the text to be queried (recall that RawDocs can take a basic text file, which each new line treated as a separate documents). Second, perform the same cleaning steps as were done for the documents that went into the estimated model. However, there is no need to do any stopword removal. When you initialize a Query object, tokens in the documents to be queried that are not present in the estimated model are automatically stripped out.\n", 829 | "\n", 830 | "Since we don't need to estimate topics when querying, we can use far fewer iterations. Let's start with 10." 831 | ] 832 | }, 833 | { 834 | "cell_type": "code", 835 | "execution_count": 29, 836 | "metadata": { 837 | "collapsed": false 838 | }, 839 | "outputs": [ 840 | { 841 | "name": "stdout", 842 | "output_type": "stream", 843 | "text": [ 844 | "Sample 0 queried\n", 845 | "Sample 1 queried\n", 846 | "Sample 2 queried\n", 847 | "Sample 3 queried\n" 848 | ] 849 | } 850 | ], 851 | "source": [ 852 | "queryobj.query(10) # query our four samples" 853 | ] 854 | }, 855 | { 856 | "cell_type": "markdown", 857 | "metadata": {}, 858 | "source": [ 859 | "To convince yourself that we don't need many iterations, let's look at the perplexity of the data at the entire speech level. Notice that it is much higher than the perplexity of the data at the paragraph level. This indicates that the topic model predicts content at the paragraph level much better." 860 | ] 861 | }, 862 | { 863 | "cell_type": "code", 864 | "execution_count": 30, 865 | "metadata": { 866 | "collapsed": false 867 | }, 868 | "outputs": [ 869 | { 870 | "data": { 871 | "text/plain": [ 872 | "array([1228.92265336, 1230.21991452, 1230.41268569, 1230.958315 ])" 873 | ] 874 | }, 875 | "execution_count": 30, 876 | "metadata": {}, 877 | "output_type": "execute_result" 878 | } 879 | ], 880 | "source": [ 881 | "queryobj.perplexity()" 882 | ] 883 | }, 884 | { 885 | "cell_type": "markdown", 886 | "metadata": {}, 887 | "source": [ 888 | "Now let's triple the number of iterations to 30 and again look at the perplexity. (Unlike LDA's sampling, each call to query starts sampling from scratch)." 889 | ] 890 | }, 891 | { 892 | "cell_type": "code", 893 | "execution_count": 31, 894 | "metadata": { 895 | "collapsed": false 896 | }, 897 | "outputs": [ 898 | { 899 | "name": "stdout", 900 | "output_type": "stream", 901 | "text": [ 902 | "Sample 0 queried\n", 903 | "Sample 1 queried\n", 904 | "Sample 2 queried\n", 905 | "Sample 3 queried\n" 906 | ] 907 | } 908 | ], 909 | "source": [ 910 | "queryobj.query(30) # query our four samples using more iterations" 911 | ] 912 | }, 913 | { 914 | "cell_type": "code", 915 | "execution_count": 32, 916 | "metadata": { 917 | "collapsed": false 918 | }, 919 | "outputs": [ 920 | { 921 | "data": { 922 | "text/plain": [ 923 | "array([1228.8451597 , 1230.29149861, 1230.47619068, 1231.00879337])" 924 | ] 925 | }, 926 | "execution_count": 32, 927 | "metadata": {}, 928 | "output_type": "execute_result" 929 | } 930 | ], 931 | "source": [ 932 | "queryobj.perplexity()" 933 | ] 934 | }, 935 | { 936 | "cell_type": "markdown", 937 | "metadata": {}, 938 | "source": [ 939 | "Note these values are nearly exactly the same as for the 10-iteration querying.\n", 940 | "\n", 941 | "Finally, we follow similar steps as for LDA to output the estimated distribution of topics for entire speeches." 942 | ] 943 | }, 944 | { 945 | "cell_type": "code", 946 | "execution_count": 33, 947 | "metadata": { 948 | "collapsed": false 949 | }, 950 | "outputs": [], 951 | "source": [ 952 | "dt_query = queryobj.dt_avg()\n", 953 | "aggdata = pd.DataFrame(dt_query,index=aggspeeches.index,columns=['T' + str(i) for i in range(queryobj.K)])\n", 954 | "aggdata.to_csv(\"final_output_agg.csv\")" 955 | ] 956 | }, 957 | { 958 | "cell_type": "markdown", 959 | "metadata": {}, 960 | "source": [ 961 | "# Assessing Output" 962 | ] 963 | }, 964 | { 965 | "cell_type": "markdown", 966 | "metadata": {}, 967 | "source": [ 968 | "At this point, you can use all of the csv files this tutorial has generated with your statistical software of choice (should this not be Python!) to analyze the topics. Before finishing, though, we can perform an initial test of whether our output makes sense intuitively. The following code determines each President's top topics, as measured in terms of deviations from the sample average." 969 | ] 970 | }, 971 | { 972 | "cell_type": "code", 973 | "execution_count": 34, 974 | "metadata": { 975 | "collapsed": false 976 | }, 977 | "outputs": [ 978 | { 979 | "name": "stderr", 980 | "output_type": "stream", 981 | "text": [ 982 | "/Users/stephenhansen/anaconda/lib/python2.7/site-packages/pandas/core/computation/check.py:17: UserWarning: The installed version of numexpr 2.4.4 is not supported in pandas and will be not be used\n", 983 | "The minimum supported version is 2.4.6\n", 984 | "\n", 985 | " ver=ver, min_ver=_MIN_NUMEXPR_VERSION), UserWarning)\n" 986 | ] 987 | } 988 | ], 989 | "source": [ 990 | "def top_topics(x):\n", 991 | "\ttop = x.values.argsort()[-5:][::-1]\n", 992 | "\treturn(pd.Series(top,index=range(1,6)))\n", 993 | "\n", 994 | "temp = aggdata.reset_index()\n", 995 | "ranking = temp.set_index('president')\n", 996 | "ranking = ranking - ranking.mean()\n", 997 | "ranking = ranking.groupby(level='president').mean()\n", 998 | "ranking = ranking.sort_values('year')\n", 999 | "ranking = ranking.drop('year',1)\n", 1000 | "ranking = ranking.apply(top_topics,axis=1)\n", 1001 | "ranking.to_csv(\"president_top_topics.csv\")" 1002 | ] 1003 | }, 1004 | { 1005 | "cell_type": "markdown", 1006 | "metadata": {}, 1007 | "source": [ 1008 | "For this particular topic model, for example, George W. Bush's top topic contains words relating to military force, and Obama's employment and economic activity. The topic model you estimate will of course vary, so I encourage you to open president_top_policy_topics.csv and topic_description.csv to have a look for yourself. Note too that some topics probably relate to policy, while some others relate to pure rhetoric. Depending on the nature of the analysis you want to do with the data, it may make sense to restrict attention to some subset of the estimated topics." 1009 | ] 1010 | }, 1011 | { 1012 | "cell_type": "markdown", 1013 | "metadata": {}, 1014 | "source": [ 1015 | "That's all for now, I hope you enjoyed the tutorial, and begin to use topic modelling in your own work!" 1016 | ] 1017 | } 1018 | ], 1019 | "metadata": { 1020 | "kernelspec": { 1021 | "display_name": "Python 2", 1022 | "language": "python", 1023 | "name": "python2" 1024 | }, 1025 | "language_info": { 1026 | "codemirror_mode": { 1027 | "name": "ipython", 1028 | "version": 2 1029 | }, 1030 | "file_extension": ".py", 1031 | "mimetype": "text/x-python", 1032 | "name": "python", 1033 | "nbconvert_exporter": "python", 1034 | "pygments_lexer": "ipython2", 1035 | "version": "2.7.11" 1036 | } 1037 | }, 1038 | "nbformat": 4, 1039 | "nbformat_minor": 0 1040 | } 1041 | --------------------------------------------------------------------------------