├── .gitignore
├── README.md
├── data
├── academic_journals_internet_research.stm30.RData
├── donors_choose_sample.csv
├── stm_donor.RData
├── stm_donor_content.RData
└── stm_donor_int.RData
├── stm_ic2s2.Rproj
├── stm_tutorial_code.Rmd
├── stm_tutorial_code.html
└── stm_tutorial_slides.pdf
/.gitignore:
--------------------------------------------------------------------------------
1 | .Rproj.user
2 | .Rhistory
3 | .RData
4 | .Ruserdata
5 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # stm_ic2s2
2 | This page accompanies the Tutorial “Structural topic models for enriching quantitative text analysis” at the International Conference on Computational Social Science (IC2S2) in Amsterdam, NL, on July 17th 2019, organized by Carsten Schwemmer and Cornelius Puschmann.
3 |
4 |
5 | For the tutorial, please make sure that you have access to a recent version of [R](https://cran.r-project.org/) and preferably [RStudio](https://www.rstudio.com/products/rstudio/download/) on your computer. To save time, you can also install the R packages we need beforehand:
6 |
7 | ```
8 | install.packages(c('tidyverse', 'stm', 'stminsights', 'quanteda', 'rmarkdown'), dependencies = TRUE)
9 | ```
10 |
11 | For more information on the stm package, head over to http://www.structuraltopicmodel.com/.
12 |
13 |
--------------------------------------------------------------------------------
/data/academic_journals_internet_research.stm30.RData:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cbpuschmann/stm_ic2s2/2d361b535a7b60fe3c8c22619fb283cbfc384400/data/academic_journals_internet_research.stm30.RData
--------------------------------------------------------------------------------
/data/stm_donor.RData:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cbpuschmann/stm_ic2s2/2d361b535a7b60fe3c8c22619fb283cbfc384400/data/stm_donor.RData
--------------------------------------------------------------------------------
/data/stm_donor_content.RData:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cbpuschmann/stm_ic2s2/2d361b535a7b60fe3c8c22619fb283cbfc384400/data/stm_donor_content.RData
--------------------------------------------------------------------------------
/data/stm_donor_int.RData:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cbpuschmann/stm_ic2s2/2d361b535a7b60fe3c8c22619fb283cbfc384400/data/stm_donor_int.RData
--------------------------------------------------------------------------------
/stm_ic2s2.Rproj:
--------------------------------------------------------------------------------
1 | Version: 1.0
2 |
3 | RestoreWorkspace: No
4 | SaveWorkspace: No
5 | AlwaysSaveHistory: No
6 |
7 | EnableCodeIndexing: Yes
8 | UseSpacesForTab: Yes
9 | NumSpacesForTab: 2
10 | Encoding: UTF-8
11 |
12 | RnwWeave: Sweave
13 | LaTeX: pdfLaTeX
14 |
--------------------------------------------------------------------------------
/stm_tutorial_code.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Structural topic models for enriching quantitative text analysis
material available at https://github.com/cbpuschmann/stm_ic2s2"
3 | author: "Cornelius Puschmann & Carsten Schwemmer"
4 | output:
5 | revealjs::revealjs_presentation:
6 | mathjax: null
7 | center: yes
8 | fig_height: 3.5
9 | fig_width: 7
10 | reveal_options:
11 | previewLinks: yes
12 | slideNumber: yes
13 | theme: default
14 | transition: fade
15 | html_document:
16 | highlight: tango
17 | code_folding: show
18 | date: "July 17, 2019"
19 | ---
20 |
21 | *A huge thanks to Brandon Stewart for maintaining the STM package and for providing some of the content which we use in this tutorial.*
22 |
23 | # Setup
24 |
25 | ## Prepare your R environment
26 |
27 | - Download and extract the tutorial content from our [github repository](https://github.com/cbpuschmann/stm_ic2s2)
28 | - Open the R project file `stm_ic2s2.Rproj` in RStudio
29 |
30 | ## How R Markdown files work
31 |
32 | - Code is placed inside code cells, documentation is placed outside of code cells.
33 | - Create new code chunks with CTRL/CMD + ALT + I
34 | - Use CTRL/CMD + SHIFT + ENTER to run entire code cell
35 | - Use CTRL/CMD + ENTER to run selected code
36 |
37 | ```{r}
38 | # this is a comment inside a code cell
39 | 2+3
40 | 5+2
41 | ```
42 |
43 | ## Install packages
44 |
45 | - Please install the packages that we will need for the tutorial:
46 |
47 | ```{r, eval = FALSE}
48 | install.packages(c('tidyverse', 'stm', 'stminsights',
49 | 'quanteda', 'rmarkdown'),
50 | dependencies = TRUE)
51 | ```
52 |
53 | ## Loading data and packages
54 |
55 | ```{r , message=FALSE, warning=FALSE, results = 'hide'}
56 | library(tidyverse)
57 | library(stm)
58 | library(stminsights)
59 | library(quanteda)
60 | library(lubridate)
61 | theme_set(theme_light())
62 | df <- read_csv('data/donors_choose_sample.csv')
63 | ```
64 |
65 | ## The data we use for the workshop
66 |
67 | - We will be using a sample from a [Kaggle](https://www.kaggle.com/c/donorschoose-application-screening) Data Science for Good challenge
68 | - [DonorsChoose.org](https://DonorsChoose.org) provided the data and hosts an online platform where teachers can post requests for resources and people can make donations to these projects
69 | - The goal of the original challenge was to match previous donors with campaigns that would most likely inspire additional donations
70 | - The dataset includes texts and context information which might help answer various questions in social science. A description of variables is available [here](https://www.kaggle.com/donorschoose/io/discussion/56030)
71 |
72 | ## What could we learn from this data?
73 |
74 | Examples of questions we might ask:
75 |
76 | - How has classroom technology use changed over time? How does it differ by geographic location and the age of students?
77 | - How do the requests of schools in urban areas compare to those in rural areas?
78 | - What predicts whether a project will be funded?
79 | - How do the predictors of funding vary by geographic location? Or by economic status of the students?
80 | - Do male and female teachers ask for different resources? Are there differences in the way that they ask for those resources?
81 |
82 | # Formal background of topic models
83 |
84 | ## Formal background of topic models
85 |
86 | See our slides
87 |
88 | # Preprocessing and feature selection
89 |
90 | ## Preprocessing and feature selection
91 |
92 | - Due to time constraints, we will not explain the following code for preprocesing and feature selection in detail.
93 | - You can explore it during the open coding session or after the tutorial.
94 | - Use `load("data/stm_donor.RData")` to load the readily processed R objects that we need for the tutorial.
95 |
96 | ## Inspecting the data structure
97 |
98 | ```{r}
99 | glimpse(df)
100 | ```
101 |
102 | ## Text example for one donation request
103 |
104 | ```{r}
105 | cat(df$project_essay[1])
106 | ```
107 |
108 | ## Preparing texts
109 |
110 | We use a [regular expression](https://en.wikipedia.org/wiki/Regular_expression) to clean up the donation texts:
111 |
112 | ```{r}
113 | df$project_essay <- str_replace_all(df$project_essay,
114 | '', '\n\n')
115 | ```
116 |
117 | As we will incorporate contextual information of documents in our STM models, we also need to preprocess other variables.
118 |
119 | ## Working with time stamps
120 |
121 | First, we convert the time strings to a date format and then create a numerical variable, where the earliest date corresponds to 0. We do so because estimating STM effects doesn't play nicely with `date` variables.
122 |
123 | ```{r}
124 | # CTRL/CMD + SHIFT + M for the pipe operator
125 | df$date <- ymd(df$project_posted_date)
126 | min_date <- min(df$date) %>%
127 | as.numeric()
128 | df$date_num <- as.numeric(df$date) - min_date
129 | date_table <- df %>% arrange(date_num) %>% select(date, date_num)
130 | head(date_table, 2)
131 | ```
132 |
133 |
134 | ## Example for recoding variables
135 |
136 | We can generate a proxy for the gender of teachers by working with their name prefixes.
137 |
138 | ```{r}
139 | df %>% count(teacher_prefix)
140 | ```
141 |
142 | ## Example for recoding variables
143 |
144 | ```{r}
145 | df <- df %>% mutate(gender = case_when(
146 | teacher_prefix %in% c('Mrs.', 'Ms.') ~ 'Female',
147 | teacher_prefix == 'Mr.' ~ 'Male',
148 | TRUE ~ 'Other/Non-binary')) # TRUE -> everything else
149 | df %>% count(gender)
150 | ```
151 |
152 | ## Other interesting variables: metro type
153 |
154 | ```{r}
155 | df %>% count(school_metro_type)
156 | ```
157 |
158 | ## Other interesting variables: resource type
159 |
160 | ```{r}
161 | df %>% count(project_resource_category, sort = TRUE)
162 | ```
163 |
164 | ## Other interesting variables: children eligible for free lunch
165 |
166 | ```{r}
167 | df %>% ggplot(aes(x = school_percentage_free_lunch)) +
168 | geom_histogram(bins = 20)
169 | ```
170 |
171 | ## Text analysis using quanteda
172 |
173 | {width=150px}
174 |
175 | - A variety of R packages supports quantitative text analyses. We will focus on [quanteda](https://quanteda.io/), which is created and maintained by social scientists behind the Quanteda Initiative
176 | - Besides offering a huge number of methods for preprocessing and analysis, it also includes a function to prepare our textual data for structural topic modeling
177 |
178 | ## Quanteda corpus object
179 |
180 | - Using `corpus()`, you can create a quanteda corpus from a character vector or a data frame, which automatically includes meta data as document variables
181 |
182 |
183 | ```{r}
184 | donor_corp <- corpus(df, text_field = 'project_essay',
185 | docid_field = 'project_id')
186 | docvars(donor_corp)$text <- df$project_essay # we need unprocessed texts later
187 | ndoc(donor_corp) # no. of documents
188 | ```
189 |
190 | ## KWIC
191 |
192 | - Before tokenization, corpus objects be used to discover [keywords in context](https://en.wikipedia.org/wiki/Key_Word_in_Context) (KWIC):
193 |
194 | ```{r}
195 | kwic_donor <- kwic(donor_corp, pattern = c("ipad"),
196 | window = 5) # context window
197 | head(kwic_donor, 3)
198 | ```
199 |
200 |
201 | ## Tokenization
202 |
203 | - Tokens can be created from a corpus or character vector. The documentation (`?tokens()`) illustrates several options, e.g. for the removal of punctuation
204 |
205 | ```{r}
206 | donor_tokens <- tokens(donor_corp)
207 | donor_tokens[[1]][1:20]
208 | ```
209 |
210 | ## Basic form of tokens
211 |
212 | - After tokenization text, some terms with similar semantic meaning might be regarded as different features (e.g. `love`, `loving`)
213 | - One solution is the application of [stemming](https://en.wikipedia.org/wiki/Stemming), which tries to reduce words to their basic form:
214 |
215 | ```{r}
216 | words <- c("love", "loving", "lovingly", "loved", "lover", "lovely")
217 | char_wordstem(words, 'english')
218 | ```
219 |
220 | ## To stem or not to stem?
221 |
222 | - In the context of topic modeling, a [recent study](https://www.transacl.org/ojs/index.php/tacl/article/view/868/196) suggests that stemmers produce no meaningful improvement (for the English language)
223 | - Ultimately, whether stemming generates useful features or not varies by use case
224 | - An alternative that we won't cover in this course is [lemmatization](https://en.wikipedia.org/wiki/Lemmatisation), available via packages likes [spacyr](https://cran.r-project.org/web/packages/spacyr/index.html) and [udpipe](https://cran.r-project.org/web/packages/udpipe/index.html)
225 |
226 |
227 | ## More preprocessing
228 |
229 | - Multiple preprocessing steps can be chained via the pipe operator, e.g normalizing to lowercase and removing common English stopwords:
230 |
231 | ```{r}
232 | donor_tokens <- donor_tokens %>%
233 | tokens_tolower() %>%
234 | tokens_remove(stopwords('english'), padding = TRUE)
235 |
236 | donor_tokens[[1]][1:10]
237 | ```
238 |
239 | ## Detecting collocations
240 |
241 | - Collocations (phrases) are sequences of tokens which symbolize shared semantic meaning, e.g. `United States`
242 | - Quanteda can detect collocations with log-linear models. An important parameter is the minimum collocation frequency, which can be used to fine-tune results
243 |
244 | ```{r}
245 | colls <- textstat_collocations(donor_tokens,
246 | min_count = 200) # minimum frequency
247 | donor_tokens <- tokens_compound(donor_tokens, colls, join = FALSE) %>%
248 | tokens_remove('') # remove empty strings
249 |
250 | donor_tokens[[1]][1:5]
251 | ```
252 |
253 | ## Document-Feature Matrix (DFM)
254 |
255 | - Most models for automated text analysis require matrices as input format
256 | - A common variant which directrly translates to the bag of words format is the [document term matrix](https://en.wikipedia.org/wiki/Document-term_matrix) (in quanteda: document-feature matrix):
257 |
258 | doc_id I like hate currywurst
259 | -------- --- ------ ------ -------------
260 | 1 1 1 0 1
261 | 2 1 0 1 1
262 |
263 | ## Creating a Document-Feature Matrix (dfm)
264 |
265 | - Problem: textual data is highly dimensional -> dfms's potentially grow to millions of rows & columns -> matrices for large text corpora don't fit in memory
266 | - Features are not evenly distributed (see e.g. [Zipf's law](https://en.wikipedia.org/wiki/Zipf%27s_law)) and most of these cells contain zeroes
267 | - Solution: Sparse data format, which does not include zero counts. Quanteda natively implements DFM's as [sparse matrices](https://en.wikipedia.org/wiki/Sparse_matrix)
268 |
269 | ## DFM's in quanteda
270 |
271 | - Quanteda can create DFM's from character vectors, corpora and token objects
272 | - Preprocessing that does not need to account for word order can also be done during or after the creation of DFM's (see documentation for `tokens()`)
273 |
274 | ```{r}
275 | dfm_donor <- dfm(donor_tokens, remove_numbers = TRUE)
276 | dim(dfm_donor)
277 | ```
278 |
279 |
280 | ## More preprocessing - feature trimming
281 |
282 | - As an alternative (or complement) to manually defining stopwords, terms occuring in a large proportion of documents can be removed automatically. Rationale: if almost every document includes a term, it is not a useful feature for categorization
283 | - Very rare terms are often removed, as they are also not very helpful for categorization and can lead to overfitting
284 |
285 | ```{r}
286 | dfm_donor <- dfm_donor %>%
287 | dfm_keep(min_nchar = 2) %>% # remove chars with only one character
288 | dfm_trim(min_docfreq = 0.002, max_docfreq = 0.50, #2% min, 50% max
289 | docfreq_type = 'prop') # proportions instead of counts
290 | dim(dfm_donor)
291 | ```
292 |
293 |
294 | ## Prepare textual data for STM
295 |
296 | - You can provide input data for the stm package in several ways:
297 |
298 | - via STM's own functions for text pre-processing
299 | - via directly passing quanteda dfm's
300 | - using quanteda's `convert()` function to prepare dfm's (recommended option)
301 |
302 | ```{r}
303 | out <- convert(dfm_donor, to = 'stm')
304 | names(out)
305 | ```
306 |
307 | # Introducing structural topic models and tuning parameters
308 |
309 | ## Introducing structural topic models and tuning parameters
310 |
311 | See our slides
312 |
313 | ## STM - model fitting
314 |
315 | - For our first model, we will choose 30 topics and include school metro type, teacher gender and a flexible [spline](https://en.wikipedia.org/wiki/Spline_(mathematics)) for date as prevalence covariates:
316 |
317 | ```{r, eval = FALSE}
318 | stm_30 <- stm(documents = out$documents,
319 | vocab = out$vocab,
320 | data = out$meta,
321 | K = 30,
322 | prevalence = ~ school_metro_type + gender + s(date_num),
323 | verbose = TRUE) # show progress
324 |
325 | stm_effects30 <- estimateEffect(1:30 ~ school_metro_type +
326 | gender + s(date_num),
327 | stmobj = stm_30, metadata = out$meta)
328 | ```
329 |
330 | ## Saving and restoring models
331 |
332 | - Depending on the number of documents and the vocabulary size, fitting STM models can require a lot of memory and computation time
333 | - It can be useful to save model objects as R binaries and reload them as needed:
334 |
335 | ```{r, eval = FALSE }
336 | save(out, stm_30, stm_effects30, file = "data/stm_donor.RData")
337 | ```
338 |
339 | ```{r}
340 | load("data/stm_donor.RData") # reload data
341 | ```
342 |
343 | # Model validation and interactively exploring STM models
344 |
345 | ## Interpreting structural topic models - topic proportions
346 |
347 | - `plot.STM()` implements several options for model interpretation.
348 | - *summary* plots show proportions and the most likely terms for each topic:
349 |
350 |
351 | ## Model interpretation - topic proportions
352 |
353 | ```{r fig.height=5, fig.width=9}
354 | plot.STM(stm_30, type = 'summary', text.cex = 0.8)
355 | ```
356 |
357 |
358 | ## Model interpretation - probability terms
359 |
360 | `label` plots show terms for each topic with (again) the most likely terms as a default:
361 |
362 |
363 | ## Model interpretation - probability terms
364 |
365 | ```{r fig.height=4, fig.width=7}
366 | plot.STM(stm_30, type = 'labels', n = 8,
367 | text.cex = 0.8, width = 100, topics = 1:5)
368 | ```
369 |
370 |
371 | ## Model interpretation - frex terms
372 |
373 | One strength of STM is that it also offers other metrics for topic terms. `frex` terms are both frequent and exclusive to a topic.
374 |
375 |
376 | ## Model interpretation - frex terms
377 |
378 | ```{r fig.height=4, fig.width=7}
379 | plot.STM(stm_30, type = 'labels', n = 8, text.cex = 0.8,
380 | width = 100, topics = 1:5, labeltype = 'frex')
381 | ```
382 |
383 |
384 | ## Model interpretation - don't rely on terms only
385 |
386 | - Assigning labels for topics only by looking at the most likely terms is generally not a good idea
387 | - Sometimes these terms contain domain-specific stop words. Sometimes they are hard to make sense of by themselves
388 | - Recommendation:
389 |
390 | - use probability (most likely) terms
391 | - use frex terms
392 | - **qualitatively examine representative documents**
393 |
394 |
395 | ## Model interpretation - representative documents
396 |
397 | - STM allows to find representative (unprocessed) documents for each topic with `findThoughts()`, which can then be plotted with `plotQuote()`:
398 |
399 | ```{r }
400 | thoughts <- findThoughts(stm_30,
401 | texts = out$meta$text, # unprocessed documents
402 | topics = 1:3, n = 2) # topics and number of documents
403 | ```
404 |
405 |
406 | ## Model interpretation - representative documents
407 |
408 | ```{r fig.height=5, fig.width=9}
409 | plotQuote(thoughts$docs[[3]][1], # topic 3
410 | width = 80, text.cex = 0.75)
411 | ```
412 |
413 | ## Model intepretation - perspective plot
414 |
415 | - It is also possible to visualize differences in word usage between two topics:
416 |
417 | ```{r fig.height=5, fig.width=8}
418 | plot.STM(stm_30, type = 'perspective', topics = c(2,3))
419 | ```
420 |
421 | ## Interactive model validation - stminsights
422 |
423 | - You can interactively validate and explore structural topic models using the R package *stminsights*. What you need:
424 |
425 | - one or several stm models and corresponding effect estimates
426 | - the `out` object used to fit the models which includes documents, vocabulary and meta-data
427 | - The example `stm_donor.RData` includes all required objects
428 |
429 |
430 | ```{r eval=FALSE, message=FALSE, warning=FALSE}
431 | run_stminsights()
432 | ```
433 |
434 | # Interpreting and visualizing prevalence and content effects
435 |
436 | - You already estimated a model with prevalence effects. Now we'll see how to also estimate content effects and how to visualize prevalence and content effects
437 |
438 | - There are several options for interpreting and visualizing effects:
439 |
440 | - using functions of the STM package
441 | - using stminsights function `get_effects()`
442 | - usting stminsights interactive mode
443 |
444 | ## Prevalence effects (stm package)
445 |
446 | ## Options for visualizing prevalence effects
447 |
448 | - Prevalence covariates affect topic proportions
449 | - They can be visualized in three ways:
450 |
451 | - `pointestimate`: pointestimates for categorical variables
452 | - `difference`: differences between topic proportions for two categories of one variable
453 | - `continuous`: line plots for continuous variables
454 | - You can also visualize interaction effects if you integrated them in your STM model (see `?plot.estimateEffect()`)
455 |
456 | ## Prevalence effects - pointestimate
457 |
458 | ```{r}
459 | plot.estimateEffect(stm_effects30, topic = 3,
460 | covariate = 'school_metro_type', method = 'pointestimate')
461 | ```
462 |
463 |
464 | ## Prevalence effects - difference
465 |
466 | ```{r}
467 | plot.estimateEffect(stm_effects30, covariate = "gender",
468 | topics = c(5:10), method = "difference",
469 | model = stm_30, # to show labels alongside
470 | cov.value1 = "Female", cov.value2 = "Male",
471 | xlab = "Male <---> Female", xlim = c(-0.08, 0.08),
472 | labeltype = "frex", n = 3,
473 | width = 100, verbose.labels = FALSE)
474 | ```
475 |
476 | ## Prevalence effects - continuous
477 |
478 | ```{r}
479 | plot.estimateEffect(stm_effects30, covariate = "date_num",
480 | topics = c(9:10), method = "continuous")
481 | ```
482 |
483 | ## Prevalence effects with stminsights
484 |
485 | - You can use `get_effects()` to store prevalence effects in a tidy data frame:
486 |
487 | ```{r}
488 | gender_effects <- get_effects(estimates = stm_effects30,
489 | variable = 'gender',
490 | type = 'pointestimate')
491 |
492 | date_effects <- get_effects(estimates = stm_effects30,
493 | variable = 'date_num',
494 | type = 'continuous')
495 | ```
496 |
497 | - Afterwards, effects can for instance be visualized with `ggplot2`
498 |
499 | ## Prevalence effects with stminsights - categorical
500 |
501 | ```{r}
502 | gender_effects %>% filter(topic == 3) %>%
503 | ggplot(aes(x = value, y = proportion)) + geom_point() +
504 | geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.1) +
505 | coord_flip() + labs(x = 'Gender', y = 'Topic Proportion')
506 | ```
507 |
508 | ## Prevalence effects with stminsights - continuous (date)
509 |
510 | - STM doesn't work well with visualzing continous date variables.
511 | - For visualization purposes, we can convert our numeric date identifier back to original form:
512 |
513 | ```{r message=FALSE, warning=FALSE}
514 | date_effects <- date_effects %>%
515 | mutate(date_num = round(value, 0)) %>%
516 | left_join(out$meta %>% select(date, date_num)) %>% distinct()
517 | ```
518 |
519 | ## Prevalence effects with stminsights - continuous (date)
520 |
521 | ```{r fig.height=3, fig.width=7}
522 | date_effects %>% filter(topic %in% c(9,10)) %>%
523 | ggplot(aes(x = date, y = proportion,
524 | group = topic, color = topic, fill = topic)) +
525 | geom_line() + scale_x_date(date_break = '3 months', date_labels = "%b/%y") +
526 | geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2) +
527 | theme(axis.text.x = element_text(angle = 45, hjust = 1))
528 | ```
529 |
530 | ## STM content effects
531 |
532 | - Content effects allow covariates to affect word distributions **within a topic** (e.g. female teachers talk differently about sports in comparison to male teachers). Example model formula: ``content = ~ gender``
533 | - This feature is powerful but comes with some disadvantages:
534 |
535 | - You can only use one discrete variable for content effects
536 | - Interpreting the model is more complicated (see `labelTopics()` and `sageLabels()`)
537 | - We will focus on visualizing content effects with `perspective` plots
538 |
539 | ## Fitting content models
540 |
541 | - Content effects can (but do not have to) be combined with prevalence effects. We fit a model with 20 topics and teacher gender as content covariate
542 | - Important note: this as a new model and can show different results, even if you compare it to a model with the same number of topics
543 |
544 | ```{r, eval = FALSE}
545 | stm_20_content <- stm(documents = out$documents,
546 | vocab = out$vocab,
547 | data = out$meta,
548 | K = 20,
549 | prevalence = ~ school_metro_type + gender + s(date_num),
550 | content = ~ gender,
551 | verbose = FALSE) # show progress
552 | stm_effects20 <- estimateEffect(1:20 ~ school_metro_type +
553 | gender + s(date_num),
554 | stmobj = stm_20_content, metadata = out$meta)
555 | save(stm_20_content, stm_effects20,file = "data/stm_donor_content.RData")
556 | ```
557 |
558 | ## Load content model
559 |
560 | ```{r}
561 | load("data/stm_donor_content.RData") # reload data
562 | ```
563 |
564 |
565 | ## Visalizing content effects
566 |
567 | ```{r fig.height=6, fig.width=8}
568 | plot.STM(stm_20_content, topics = c(2), type = 'perspectives',
569 | covarlevels = c('Female', 'Male'))
570 | ```
571 |
572 |
573 | # Open coding session
574 |
575 | ## Open coding session - your turn
576 |
577 | - For the open coding session, you can choose to either play around with data and models from the tutorial, or try fitting stm models on your own data
578 | - We will be around to help you out and answer questions
579 |
580 | # Appendix - code to play around with
581 |
582 | ## Appendix - more about comparing models
583 |
584 | As for statistical diagnostics, the STM authors recommend to inspect semantic coherence and exclusivity (see [STM vignette](https://github.com/bstewart/stm/blob/master/inst/doc/stmVignette.pdf?raw=true)):
585 |
586 | - Semantic coherence is is maximized when the most probable words in a given topic frequently co-occur together
587 | - Exclusivity (FREX) is maximized when a topic includes many exclusive terms
588 | - Coherence and exclusivity cannot be compared for models with content effects
589 |
590 | ## Appendix - fitting another model for comparisons
591 |
592 | ```{r, eval = FALSE}
593 | stm_10 <- stm(documents = out$documents,
594 | vocab = out$vocab,
595 | data = out$meta,
596 | K = 10,
597 | prevalence = ~ school_metro_type * s(date_num), # interaction
598 | verbose = FALSE) # show progress
599 |
600 | stm_effects10 <- estimateEffect(1:10 ~ school_metro_type +
601 | gender * s(date_num),
602 | stmobj = stm_10, metadata = out$meta)
603 |
604 | save(stm_10, stm_effects10, file = "data/stm_donor_int.RData")
605 | ```
606 |
607 | ```{r}
608 | load("data/stm_donor_int.RData") # reload data
609 | ```
610 |
611 | ## Appendix - calculating diagnostics (stminsights)
612 |
613 | ```{r}
614 | diag <- get_diag(models = list(
615 | model10 = stm_10, model30 = stm_30), out)
616 | diag %>%
617 | ggplot(aes(x = coherence, y = exclusivity, color = statistic)) +
618 | geom_text(aes(label = name), nudge_x = 2) + geom_point() +
619 | labs(x = 'Semantic Coherence', y = 'Exclusivity')
620 | ```
621 |
622 | ## Appendix - correlation networks (stminsights)
623 |
624 | ```{r, message=FALSE, warning=FALSE}
625 | library(ggraph)
626 | stm_corrs <- get_network(model = stm_30,
627 | method = 'simple', # correlation criterion,
628 | cutoff = 0.05, # minimum correlation
629 | labels = paste('T', 1:30),
630 | cutiso = FALSE) # isolated nodes
631 | ```
632 |
633 | ## Appendix - correlation networks (stminsights)
634 |
635 | ```{r fig.height=5, fig.width=9, message=FALSE, warning=FALSE}
636 | ggraph(stm_corrs, layout = 'fr') + geom_edge_link(
637 | aes(edge_width = weight), label_colour = '#fc8d62',
638 | edge_colour = '#377eb8') + geom_node_point(size = 4, colour = 'black') +
639 | geom_node_label(aes(label = name, size = props),
640 | colour = 'black', repel = TRUE, alpha = 0.85) +
641 | scale_size(range = c(2, 10), labels = scales::percent) +
642 | labs(size = 'Topic Proportion', edge_width = 'Topic Correlation') + theme_graph()
643 | ```
644 |
645 |
646 |
647 |
--------------------------------------------------------------------------------
/stm_tutorial_slides.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cbpuschmann/stm_ic2s2/2d361b535a7b60fe3c8c22619fb283cbfc384400/stm_tutorial_slides.pdf
--------------------------------------------------------------------------------