├── .gitignore ├── README.md ├── data ├── academic_journals_internet_research.stm30.RData ├── donors_choose_sample.csv ├── stm_donor.RData ├── stm_donor_content.RData └── stm_donor_int.RData ├── stm_ic2s2.Rproj ├── stm_tutorial_code.Rmd ├── stm_tutorial_code.html └── stm_tutorial_slides.pdf /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | .Ruserdata 5 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # stm_ic2s2 2 | This page accompanies the Tutorial “Structural topic models for enriching quantitative text analysis” at the International Conference on Computational Social Science (IC2S2) in Amsterdam, NL, on July 17th 2019, organized by Carsten Schwemmer and Cornelius Puschmann. 3 | 4 | 5 | For the tutorial, please make sure that you have access to a recent version of [R](https://cran.r-project.org/) and preferably [RStudio](https://www.rstudio.com/products/rstudio/download/) on your computer. To save time, you can also install the R packages we need beforehand: 6 | 7 | ``` 8 | install.packages(c('tidyverse', 'stm', 'stminsights', 'quanteda', 'rmarkdown'), dependencies = TRUE) 9 | ``` 10 | 11 | For more information on the stm package, head over to http://www.structuraltopicmodel.com/. 12 | 13 | -------------------------------------------------------------------------------- /data/academic_journals_internet_research.stm30.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cbpuschmann/stm_ic2s2/2d361b535a7b60fe3c8c22619fb283cbfc384400/data/academic_journals_internet_research.stm30.RData -------------------------------------------------------------------------------- /data/stm_donor.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cbpuschmann/stm_ic2s2/2d361b535a7b60fe3c8c22619fb283cbfc384400/data/stm_donor.RData -------------------------------------------------------------------------------- /data/stm_donor_content.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cbpuschmann/stm_ic2s2/2d361b535a7b60fe3c8c22619fb283cbfc384400/data/stm_donor_content.RData -------------------------------------------------------------------------------- /data/stm_donor_int.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cbpuschmann/stm_ic2s2/2d361b535a7b60fe3c8c22619fb283cbfc384400/data/stm_donor_int.RData -------------------------------------------------------------------------------- /stm_ic2s2.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: No 4 | SaveWorkspace: No 5 | AlwaysSaveHistory: No 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: Sweave 13 | LaTeX: pdfLaTeX 14 | -------------------------------------------------------------------------------- /stm_tutorial_code.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Structural topic models for enriching quantitative text analysis
material available at https://github.com/cbpuschmann/stm_ic2s2" 3 | author: "Cornelius Puschmann & Carsten Schwemmer" 4 | output: 5 | revealjs::revealjs_presentation: 6 | mathjax: null 7 | center: yes 8 | fig_height: 3.5 9 | fig_width: 7 10 | reveal_options: 11 | previewLinks: yes 12 | slideNumber: yes 13 | theme: default 14 | transition: fade 15 | html_document: 16 | highlight: tango 17 | code_folding: show 18 | date: "July 17, 2019" 19 | --- 20 | 21 | *A huge thanks to Brandon Stewart for maintaining the STM package and for providing some of the content which we use in this tutorial.* 22 | 23 | # Setup 24 | 25 | ## Prepare your R environment 26 | 27 | - Download and extract the tutorial content from our [github repository](https://github.com/cbpuschmann/stm_ic2s2) 28 | - Open the R project file `stm_ic2s2.Rproj` in RStudio 29 | 30 | ## How R Markdown files work 31 | 32 | - Code is placed inside code cells, documentation is placed outside of code cells. 33 | - Create new code chunks with CTRL/CMD + ALT + I 34 | - Use CTRL/CMD + SHIFT + ENTER to run entire code cell 35 | - Use CTRL/CMD + ENTER to run selected code 36 | 37 | ```{r} 38 | # this is a comment inside a code cell 39 | 2+3 40 | 5+2 41 | ``` 42 | 43 | ## Install packages 44 | 45 | - Please install the packages that we will need for the tutorial: 46 | 47 | ```{r, eval = FALSE} 48 | install.packages(c('tidyverse', 'stm', 'stminsights', 49 | 'quanteda', 'rmarkdown'), 50 | dependencies = TRUE) 51 | ``` 52 | 53 | ## Loading data and packages 54 | 55 | ```{r , message=FALSE, warning=FALSE, results = 'hide'} 56 | library(tidyverse) 57 | library(stm) 58 | library(stminsights) 59 | library(quanteda) 60 | library(lubridate) 61 | theme_set(theme_light()) 62 | df <- read_csv('data/donors_choose_sample.csv') 63 | ``` 64 | 65 | ## The data we use for the workshop 66 | 67 | - We will be using a sample from a [Kaggle](https://www.kaggle.com/c/donorschoose-application-screening) Data Science for Good challenge 68 | - [DonorsChoose.org](https://DonorsChoose.org) provided the data and hosts an online platform where teachers can post requests for resources and people can make donations to these projects 69 | - The goal of the original challenge was to match previous donors with campaigns that would most likely inspire additional donations 70 | - The dataset includes texts and context information which might help answer various questions in social science. A description of variables is available [here](https://www.kaggle.com/donorschoose/io/discussion/56030) 71 | 72 | ## What could we learn from this data? 73 | 74 | Examples of questions we might ask: 75 | 76 | - How has classroom technology use changed over time? How does it differ by geographic location and the age of students? 77 | - How do the requests of schools in urban areas compare to those in rural areas? 78 | - What predicts whether a project will be funded? 79 | - How do the predictors of funding vary by geographic location? Or by economic status of the students? 80 | - Do male and female teachers ask for different resources? Are there differences in the way that they ask for those resources? 81 | 82 | # Formal background of topic models 83 | 84 | ## Formal background of topic models 85 | 86 | See our slides 87 | 88 | # Preprocessing and feature selection 89 | 90 | ## Preprocessing and feature selection 91 | 92 | - Due to time constraints, we will not explain the following code for preprocesing and feature selection in detail. 93 | - You can explore it during the open coding session or after the tutorial. 94 | - Use `load("data/stm_donor.RData")` to load the readily processed R objects that we need for the tutorial. 95 | 96 | ## Inspecting the data structure 97 | 98 | ```{r} 99 | glimpse(df) 100 | ``` 101 | 102 | ## Text example for one donation request 103 | 104 | ```{r} 105 | cat(df$project_essay[1]) 106 | ``` 107 | 108 | ## Preparing texts 109 | 110 | We use a [regular expression](https://en.wikipedia.org/wiki/Regular_expression) to clean up the donation texts: 111 | 112 | ```{r} 113 | df$project_essay <- str_replace_all(df$project_essay, 114 | '', '\n\n') 115 | ``` 116 | 117 | As we will incorporate contextual information of documents in our STM models, we also need to preprocess other variables. 118 | 119 | ## Working with time stamps 120 | 121 | First, we convert the time strings to a date format and then create a numerical variable, where the earliest date corresponds to 0. We do so because estimating STM effects doesn't play nicely with `date` variables. 122 | 123 | ```{r} 124 | # CTRL/CMD + SHIFT + M for the pipe operator 125 | df$date <- ymd(df$project_posted_date) 126 | min_date <- min(df$date) %>% 127 | as.numeric() 128 | df$date_num <- as.numeric(df$date) - min_date 129 | date_table <- df %>% arrange(date_num) %>% select(date, date_num) 130 | head(date_table, 2) 131 | ``` 132 | 133 | 134 | ## Example for recoding variables 135 | 136 | We can generate a proxy for the gender of teachers by working with their name prefixes. 137 | 138 | ```{r} 139 | df %>% count(teacher_prefix) 140 | ``` 141 | 142 | ## Example for recoding variables 143 | 144 | ```{r} 145 | df <- df %>% mutate(gender = case_when( 146 | teacher_prefix %in% c('Mrs.', 'Ms.') ~ 'Female', 147 | teacher_prefix == 'Mr.' ~ 'Male', 148 | TRUE ~ 'Other/Non-binary')) # TRUE -> everything else 149 | df %>% count(gender) 150 | ``` 151 | 152 | ## Other interesting variables: metro type 153 | 154 | ```{r} 155 | df %>% count(school_metro_type) 156 | ``` 157 | 158 | ## Other interesting variables: resource type 159 | 160 | ```{r} 161 | df %>% count(project_resource_category, sort = TRUE) 162 | ``` 163 | 164 | ## Other interesting variables: children eligible for free lunch 165 | 166 | ```{r} 167 | df %>% ggplot(aes(x = school_percentage_free_lunch)) + 168 | geom_histogram(bins = 20) 169 | ``` 170 | 171 | ## Text analysis using quanteda 172 | 173 | ![](https://avatars2.githubusercontent.com/u/34347233?s=200&v=4){width=150px} 174 | 175 | - A variety of R packages supports quantitative text analyses. We will focus on [quanteda](https://quanteda.io/), which is created and maintained by social scientists behind the Quanteda Initiative 176 | - Besides offering a huge number of methods for preprocessing and analysis, it also includes a function to prepare our textual data for structural topic modeling 177 | 178 | ## Quanteda corpus object 179 | 180 | - Using `corpus()`, you can create a quanteda corpus from a character vector or a data frame, which automatically includes meta data as document variables 181 | 182 | 183 | ```{r} 184 | donor_corp <- corpus(df, text_field = 'project_essay', 185 | docid_field = 'project_id') 186 | docvars(donor_corp)$text <- df$project_essay # we need unprocessed texts later 187 | ndoc(donor_corp) # no. of documents 188 | ``` 189 | 190 | ## KWIC 191 | 192 | - Before tokenization, corpus objects be used to discover [keywords in context](https://en.wikipedia.org/wiki/Key_Word_in_Context) (KWIC): 193 | 194 | ```{r} 195 | kwic_donor <- kwic(donor_corp, pattern = c("ipad"), 196 | window = 5) # context window 197 | head(kwic_donor, 3) 198 | ``` 199 | 200 | 201 | ## Tokenization 202 | 203 | - Tokens can be created from a corpus or character vector. The documentation (`?tokens()`) illustrates several options, e.g. for the removal of punctuation 204 | 205 | ```{r} 206 | donor_tokens <- tokens(donor_corp) 207 | donor_tokens[[1]][1:20] 208 | ``` 209 | 210 | ## Basic form of tokens 211 | 212 | - After tokenization text, some terms with similar semantic meaning might be regarded as different features (e.g. `love`, `loving`) 213 | - One solution is the application of [stemming](https://en.wikipedia.org/wiki/Stemming), which tries to reduce words to their basic form: 214 | 215 | ```{r} 216 | words <- c("love", "loving", "lovingly", "loved", "lover", "lovely") 217 | char_wordstem(words, 'english') 218 | ``` 219 | 220 | ## To stem or not to stem? 221 | 222 | - In the context of topic modeling, a [recent study](https://www.transacl.org/ojs/index.php/tacl/article/view/868/196) suggests that stemmers produce no meaningful improvement (for the English language) 223 | - Ultimately, whether stemming generates useful features or not varies by use case 224 | - An alternative that we won't cover in this course is [lemmatization](https://en.wikipedia.org/wiki/Lemmatisation), available via packages likes [spacyr](https://cran.r-project.org/web/packages/spacyr/index.html) and [udpipe](https://cran.r-project.org/web/packages/udpipe/index.html) 225 | 226 | 227 | ## More preprocessing 228 | 229 | - Multiple preprocessing steps can be chained via the pipe operator, e.g normalizing to lowercase and removing common English stopwords: 230 | 231 | ```{r} 232 | donor_tokens <- donor_tokens %>% 233 | tokens_tolower() %>% 234 | tokens_remove(stopwords('english'), padding = TRUE) 235 | 236 | donor_tokens[[1]][1:10] 237 | ``` 238 | 239 | ## Detecting collocations 240 | 241 | - Collocations (phrases) are sequences of tokens which symbolize shared semantic meaning, e.g. `United States` 242 | - Quanteda can detect collocations with log-linear models. An important parameter is the minimum collocation frequency, which can be used to fine-tune results 243 | 244 | ```{r} 245 | colls <- textstat_collocations(donor_tokens, 246 | min_count = 200) # minimum frequency 247 | donor_tokens <- tokens_compound(donor_tokens, colls, join = FALSE) %>% 248 | tokens_remove('') # remove empty strings 249 | 250 | donor_tokens[[1]][1:5] 251 | ``` 252 | 253 | ## Document-Feature Matrix (DFM) 254 | 255 | - Most models for automated text analysis require matrices as input format 256 | - A common variant which directrly translates to the bag of words format is the [document term matrix](https://en.wikipedia.org/wiki/Document-term_matrix) (in quanteda: document-feature matrix): 257 | 258 | doc_id I like hate currywurst 259 | -------- --- ------ ------ ------------- 260 | 1 1 1 0 1 261 | 2 1 0 1 1 262 | 263 | ## Creating a Document-Feature Matrix (dfm) 264 | 265 | - Problem: textual data is highly dimensional -> dfms's potentially grow to millions of rows & columns -> matrices for large text corpora don't fit in memory 266 | - Features are not evenly distributed (see e.g. [Zipf's law](https://en.wikipedia.org/wiki/Zipf%27s_law)) and most of these cells contain zeroes 267 | - Solution: Sparse data format, which does not include zero counts. Quanteda natively implements DFM's as [sparse matrices](https://en.wikipedia.org/wiki/Sparse_matrix) 268 | 269 | ## DFM's in quanteda 270 | 271 | - Quanteda can create DFM's from character vectors, corpora and token objects 272 | - Preprocessing that does not need to account for word order can also be done during or after the creation of DFM's (see documentation for `tokens()`) 273 | 274 | ```{r} 275 | dfm_donor <- dfm(donor_tokens, remove_numbers = TRUE) 276 | dim(dfm_donor) 277 | ``` 278 | 279 | 280 | ## More preprocessing - feature trimming 281 | 282 | - As an alternative (or complement) to manually defining stopwords, terms occuring in a large proportion of documents can be removed automatically. Rationale: if almost every document includes a term, it is not a useful feature for categorization 283 | - Very rare terms are often removed, as they are also not very helpful for categorization and can lead to overfitting 284 | 285 | ```{r} 286 | dfm_donor <- dfm_donor %>% 287 | dfm_keep(min_nchar = 2) %>% # remove chars with only one character 288 | dfm_trim(min_docfreq = 0.002, max_docfreq = 0.50, #2% min, 50% max 289 | docfreq_type = 'prop') # proportions instead of counts 290 | dim(dfm_donor) 291 | ``` 292 | 293 | 294 | ## Prepare textual data for STM 295 | 296 | - You can provide input data for the stm package in several ways: 297 | 298 | - via STM's own functions for text pre-processing 299 | - via directly passing quanteda dfm's 300 | - using quanteda's `convert()` function to prepare dfm's (recommended option) 301 | 302 | ```{r} 303 | out <- convert(dfm_donor, to = 'stm') 304 | names(out) 305 | ``` 306 | 307 | # Introducing structural topic models and tuning parameters 308 | 309 | ## Introducing structural topic models and tuning parameters 310 | 311 | See our slides 312 | 313 | ## STM - model fitting 314 | 315 | - For our first model, we will choose 30 topics and include school metro type, teacher gender and a flexible [spline](https://en.wikipedia.org/wiki/Spline_(mathematics)) for date as prevalence covariates: 316 | 317 | ```{r, eval = FALSE} 318 | stm_30 <- stm(documents = out$documents, 319 | vocab = out$vocab, 320 | data = out$meta, 321 | K = 30, 322 | prevalence = ~ school_metro_type + gender + s(date_num), 323 | verbose = TRUE) # show progress 324 | 325 | stm_effects30 <- estimateEffect(1:30 ~ school_metro_type + 326 | gender + s(date_num), 327 | stmobj = stm_30, metadata = out$meta) 328 | ``` 329 | 330 | ## Saving and restoring models 331 | 332 | - Depending on the number of documents and the vocabulary size, fitting STM models can require a lot of memory and computation time 333 | - It can be useful to save model objects as R binaries and reload them as needed: 334 | 335 | ```{r, eval = FALSE } 336 | save(out, stm_30, stm_effects30, file = "data/stm_donor.RData") 337 | ``` 338 | 339 | ```{r} 340 | load("data/stm_donor.RData") # reload data 341 | ``` 342 | 343 | # Model validation and interactively exploring STM models 344 | 345 | ## Interpreting structural topic models - topic proportions 346 | 347 | - `plot.STM()` implements several options for model interpretation. 348 | - *summary* plots show proportions and the most likely terms for each topic: 349 | 350 | 351 | ## Model interpretation - topic proportions 352 | 353 | ```{r fig.height=5, fig.width=9} 354 | plot.STM(stm_30, type = 'summary', text.cex = 0.8) 355 | ``` 356 | 357 | 358 | ## Model interpretation - probability terms 359 | 360 | `label` plots show terms for each topic with (again) the most likely terms as a default: 361 | 362 | 363 | ## Model interpretation - probability terms 364 | 365 | ```{r fig.height=4, fig.width=7} 366 | plot.STM(stm_30, type = 'labels', n = 8, 367 | text.cex = 0.8, width = 100, topics = 1:5) 368 | ``` 369 | 370 | 371 | ## Model interpretation - frex terms 372 | 373 | One strength of STM is that it also offers other metrics for topic terms. `frex` terms are both frequent and exclusive to a topic. 374 | 375 | 376 | ## Model interpretation - frex terms 377 | 378 | ```{r fig.height=4, fig.width=7} 379 | plot.STM(stm_30, type = 'labels', n = 8, text.cex = 0.8, 380 | width = 100, topics = 1:5, labeltype = 'frex') 381 | ``` 382 | 383 | 384 | ## Model interpretation - don't rely on terms only 385 | 386 | - Assigning labels for topics only by looking at the most likely terms is generally not a good idea 387 | - Sometimes these terms contain domain-specific stop words. Sometimes they are hard to make sense of by themselves 388 | - Recommendation: 389 | 390 | - use probability (most likely) terms 391 | - use frex terms 392 | - **qualitatively examine representative documents** 393 | 394 | 395 | ## Model interpretation - representative documents 396 | 397 | - STM allows to find representative (unprocessed) documents for each topic with `findThoughts()`, which can then be plotted with `plotQuote()`: 398 | 399 | ```{r } 400 | thoughts <- findThoughts(stm_30, 401 | texts = out$meta$text, # unprocessed documents 402 | topics = 1:3, n = 2) # topics and number of documents 403 | ``` 404 | 405 | 406 | ## Model interpretation - representative documents 407 | 408 | ```{r fig.height=5, fig.width=9} 409 | plotQuote(thoughts$docs[[3]][1], # topic 3 410 | width = 80, text.cex = 0.75) 411 | ``` 412 | 413 | ## Model intepretation - perspective plot 414 | 415 | - It is also possible to visualize differences in word usage between two topics: 416 | 417 | ```{r fig.height=5, fig.width=8} 418 | plot.STM(stm_30, type = 'perspective', topics = c(2,3)) 419 | ``` 420 | 421 | ## Interactive model validation - stminsights 422 | 423 | - You can interactively validate and explore structural topic models using the R package *stminsights*. What you need: 424 | 425 | - one or several stm models and corresponding effect estimates 426 | - the `out` object used to fit the models which includes documents, vocabulary and meta-data 427 | - The example `stm_donor.RData` includes all required objects 428 | 429 | 430 | ```{r eval=FALSE, message=FALSE, warning=FALSE} 431 | run_stminsights() 432 | ``` 433 | 434 | # Interpreting and visualizing prevalence and content effects 435 | 436 | - You already estimated a model with prevalence effects. Now we'll see how to also estimate content effects and how to visualize prevalence and content effects 437 | 438 | - There are several options for interpreting and visualizing effects: 439 | 440 | - using functions of the STM package 441 | - using stminsights function `get_effects()` 442 | - usting stminsights interactive mode 443 | 444 | ## Prevalence effects (stm package) 445 | 446 | ## Options for visualizing prevalence effects 447 | 448 | - Prevalence covariates affect topic proportions 449 | - They can be visualized in three ways: 450 | 451 | - `pointestimate`: pointestimates for categorical variables 452 | - `difference`: differences between topic proportions for two categories of one variable 453 | - `continuous`: line plots for continuous variables 454 | - You can also visualize interaction effects if you integrated them in your STM model (see `?plot.estimateEffect()`) 455 | 456 | ## Prevalence effects - pointestimate 457 | 458 | ```{r} 459 | plot.estimateEffect(stm_effects30, topic = 3, 460 | covariate = 'school_metro_type', method = 'pointestimate') 461 | ``` 462 | 463 | 464 | ## Prevalence effects - difference 465 | 466 | ```{r} 467 | plot.estimateEffect(stm_effects30, covariate = "gender", 468 | topics = c(5:10), method = "difference", 469 | model = stm_30, # to show labels alongside 470 | cov.value1 = "Female", cov.value2 = "Male", 471 | xlab = "Male <---> Female", xlim = c(-0.08, 0.08), 472 | labeltype = "frex", n = 3, 473 | width = 100, verbose.labels = FALSE) 474 | ``` 475 | 476 | ## Prevalence effects - continuous 477 | 478 | ```{r} 479 | plot.estimateEffect(stm_effects30, covariate = "date_num", 480 | topics = c(9:10), method = "continuous") 481 | ``` 482 | 483 | ## Prevalence effects with stminsights 484 | 485 | - You can use `get_effects()` to store prevalence effects in a tidy data frame: 486 | 487 | ```{r} 488 | gender_effects <- get_effects(estimates = stm_effects30, 489 | variable = 'gender', 490 | type = 'pointestimate') 491 | 492 | date_effects <- get_effects(estimates = stm_effects30, 493 | variable = 'date_num', 494 | type = 'continuous') 495 | ``` 496 | 497 | - Afterwards, effects can for instance be visualized with `ggplot2` 498 | 499 | ## Prevalence effects with stminsights - categorical 500 | 501 | ```{r} 502 | gender_effects %>% filter(topic == 3) %>% 503 | ggplot(aes(x = value, y = proportion)) + geom_point() + 504 | geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.1) + 505 | coord_flip() + labs(x = 'Gender', y = 'Topic Proportion') 506 | ``` 507 | 508 | ## Prevalence effects with stminsights - continuous (date) 509 | 510 | - STM doesn't work well with visualzing continous date variables. 511 | - For visualization purposes, we can convert our numeric date identifier back to original form: 512 | 513 | ```{r message=FALSE, warning=FALSE} 514 | date_effects <- date_effects %>% 515 | mutate(date_num = round(value, 0)) %>% 516 | left_join(out$meta %>% select(date, date_num)) %>% distinct() 517 | ``` 518 | 519 | ## Prevalence effects with stminsights - continuous (date) 520 | 521 | ```{r fig.height=3, fig.width=7} 522 | date_effects %>% filter(topic %in% c(9,10)) %>% 523 | ggplot(aes(x = date, y = proportion, 524 | group = topic, color = topic, fill = topic)) + 525 | geom_line() + scale_x_date(date_break = '3 months', date_labels = "%b/%y") + 526 | geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2) + 527 | theme(axis.text.x = element_text(angle = 45, hjust = 1)) 528 | ``` 529 | 530 | ## STM content effects 531 | 532 | - Content effects allow covariates to affect word distributions **within a topic** (e.g. female teachers talk differently about sports in comparison to male teachers). Example model formula: ``content = ~ gender`` 533 | - This feature is powerful but comes with some disadvantages: 534 | 535 | - You can only use one discrete variable for content effects 536 | - Interpreting the model is more complicated (see `labelTopics()` and `sageLabels()`) 537 | - We will focus on visualizing content effects with `perspective` plots 538 | 539 | ## Fitting content models 540 | 541 | - Content effects can (but do not have to) be combined with prevalence effects. We fit a model with 20 topics and teacher gender as content covariate 542 | - Important note: this as a new model and can show different results, even if you compare it to a model with the same number of topics 543 | 544 | ```{r, eval = FALSE} 545 | stm_20_content <- stm(documents = out$documents, 546 | vocab = out$vocab, 547 | data = out$meta, 548 | K = 20, 549 | prevalence = ~ school_metro_type + gender + s(date_num), 550 | content = ~ gender, 551 | verbose = FALSE) # show progress 552 | stm_effects20 <- estimateEffect(1:20 ~ school_metro_type + 553 | gender + s(date_num), 554 | stmobj = stm_20_content, metadata = out$meta) 555 | save(stm_20_content, stm_effects20,file = "data/stm_donor_content.RData") 556 | ``` 557 | 558 | ## Load content model 559 | 560 | ```{r} 561 | load("data/stm_donor_content.RData") # reload data 562 | ``` 563 | 564 | 565 | ## Visalizing content effects 566 | 567 | ```{r fig.height=6, fig.width=8} 568 | plot.STM(stm_20_content, topics = c(2), type = 'perspectives', 569 | covarlevels = c('Female', 'Male')) 570 | ``` 571 | 572 | 573 | # Open coding session 574 | 575 | ## Open coding session - your turn 576 | 577 | - For the open coding session, you can choose to either play around with data and models from the tutorial, or try fitting stm models on your own data 578 | - We will be around to help you out and answer questions 579 | 580 | # Appendix - code to play around with 581 | 582 | ## Appendix - more about comparing models 583 | 584 | As for statistical diagnostics, the STM authors recommend to inspect semantic coherence and exclusivity (see [STM vignette](https://github.com/bstewart/stm/blob/master/inst/doc/stmVignette.pdf?raw=true)): 585 | 586 | - Semantic coherence is is maximized when the most probable words in a given topic frequently co-occur together 587 | - Exclusivity (FREX) is maximized when a topic includes many exclusive terms 588 | - Coherence and exclusivity cannot be compared for models with content effects 589 | 590 | ## Appendix - fitting another model for comparisons 591 | 592 | ```{r, eval = FALSE} 593 | stm_10 <- stm(documents = out$documents, 594 | vocab = out$vocab, 595 | data = out$meta, 596 | K = 10, 597 | prevalence = ~ school_metro_type * s(date_num), # interaction 598 | verbose = FALSE) # show progress 599 | 600 | stm_effects10 <- estimateEffect(1:10 ~ school_metro_type + 601 | gender * s(date_num), 602 | stmobj = stm_10, metadata = out$meta) 603 | 604 | save(stm_10, stm_effects10, file = "data/stm_donor_int.RData") 605 | ``` 606 | 607 | ```{r} 608 | load("data/stm_donor_int.RData") # reload data 609 | ``` 610 | 611 | ## Appendix - calculating diagnostics (stminsights) 612 | 613 | ```{r} 614 | diag <- get_diag(models = list( 615 | model10 = stm_10, model30 = stm_30), out) 616 | diag %>% 617 | ggplot(aes(x = coherence, y = exclusivity, color = statistic)) + 618 | geom_text(aes(label = name), nudge_x = 2) + geom_point() + 619 | labs(x = 'Semantic Coherence', y = 'Exclusivity') 620 | ``` 621 | 622 | ## Appendix - correlation networks (stminsights) 623 | 624 | ```{r, message=FALSE, warning=FALSE} 625 | library(ggraph) 626 | stm_corrs <- get_network(model = stm_30, 627 | method = 'simple', # correlation criterion, 628 | cutoff = 0.05, # minimum correlation 629 | labels = paste('T', 1:30), 630 | cutiso = FALSE) # isolated nodes 631 | ``` 632 | 633 | ## Appendix - correlation networks (stminsights) 634 | 635 | ```{r fig.height=5, fig.width=9, message=FALSE, warning=FALSE} 636 | ggraph(stm_corrs, layout = 'fr') + geom_edge_link( 637 | aes(edge_width = weight), label_colour = '#fc8d62', 638 | edge_colour = '#377eb8') + geom_node_point(size = 4, colour = 'black') + 639 | geom_node_label(aes(label = name, size = props), 640 | colour = 'black', repel = TRUE, alpha = 0.85) + 641 | scale_size(range = c(2, 10), labels = scales::percent) + 642 | labs(size = 'Topic Proportion', edge_width = 'Topic Correlation') + theme_graph() 643 | ``` 644 | 645 | 646 | 647 | -------------------------------------------------------------------------------- /stm_tutorial_slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cbpuschmann/stm_ic2s2/2d361b535a7b60fe3c8c22619fb283cbfc384400/stm_tutorial_slides.pdf --------------------------------------------------------------------------------