├── .gitignore
├── README.md
├── README.qmd
├── media
    ├── RStudio-Shortcut-1.png
    ├── RStudio-Shortcut-2.png
    ├── install-os.png
    ├── new_project.png
    ├── reticulate.jpg
    ├── spacy-install.png
    ├── wizard-2.png
    ├── wizard-3.png
    └── wizard.png
├── python-in-r.html
├── python-in-r.qmd
└── reticulate_workshop.Rproj


/.gitignore:
--------------------------------------------------------------------------------
 1 | .Rproj.user
 2 | .Rhistory
 3 | .RData
 4 | .Ruserdata
 5 | flyer
 6 | .Renviron
 7 | cache_dir
 8 | outputs
 9 | python-env
10 | python-in-r_cache
11 | python-in-r_files
12 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | README
 2 | ================
 3 | 
 4 | # WORKSHOPS FOR UKRAINE: Python for R users
 5 | 
 6 | Material for the workshop *Python for R users* on Thursday, February
 7 | 16th, 18:00 - 20:00 CET (Rome, Berlin, Paris timezone). More info on the
 8 | course and how to sign up/get the recording at
 9 | https://sites.google.com/view/dariia-mykhailyshyna/main/r-workshops-for-ukraine#h.36dsv5tl42am
10 | 
11 | # Download the project
12 | 
13 | In RStudio go to “Create a project” (top left corner with this symbol
14 | ![](media/new_project.png)). Then select “Version Control”:
15 | 
16 | ![](media/wizard.png)
17 | 
18 | In the next window, select “Git”:
19 | 
20 | ![](media/wizard-2.png)
21 | 
22 | Then copy the URL `https://github.com/JBGruber/python_for_r_users` into
23 | the URL field and select where to download the project to.
24 | 
25 | ![](media/wizard-3.png)
26 | 
27 | After clicking “Create Project”, a new session should open. Navigate to
28 | the file “python-in-r.qmd” and open it. That’s it!
29 | 
30 | # Install dependencies
31 | 
32 | The short code below will check the main python-in-r.qmd file for
33 | mentioned R packages and install the missing ones on your computer:
34 | 
35 | ``` r
36 | if (!requireNamespace("rlang", quietly = TRUE)) install.packages("rlang", dependencies = TRUE)
37 | if (!rlang::is_installed("quanteda.corpora")) remotes::install_github("quanteda/quanteda.corpora")
38 | rlang::check_installed("attachment")
39 | rlang::check_installed(attachment::att_from_rmds("python-in-r.qmd"))
40 | ```
41 | 
42 | If there is no output, you are good to go.
43 | 
44 | We will install the Python packages during the workshop, but if you want
45 | to get a head start, you can follow the “Getting started” section in the
46 | “python-in-r.qmd” file.
47 | 


--------------------------------------------------------------------------------
/README.qmd:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: "README"
 3 | format: gfm
 4 | ---
 5 | 
 6 | # WORKSHOPS FOR UKRAINE: Python for R users
 7 | 
 8 | Material for the workshop *Python for R users* on Thursday, February 16th, 18:00 - 20:00 CET (Rome, Berlin, Paris timezone).
 9 | More info on the course and how to sign up/get the recording at https://sites.google.com/view/dariia-mykhailyshyna/main/r-workshops-for-ukraine#h.36dsv5tl42am
10 | 
11 | # Download the project
12 | 
13 | In RStudio go to "Create a project" (top left corner with this symbol ![](media/new_project.png)).
14 | Then select "Version Control":
15 | 
16 | ![](media/wizard.png)
17 | 
18 | In the next window, select "Git":
19 | 
20 | ![](media/wizard-2.png)
21 | 
22 | Then copy the URL `https://github.com/JBGruber/python_for_r_users` into the URL field and select where to download the project to.
23 | 
24 | ![](media/wizard-3.png)
25 | 
26 | After clicking "Create Project", a new session should open.
27 | Navigate to the file "python-in-r.qmd" and open it.
28 | That's it!
29 | 
30 | # Install dependencies
31 | 
32 | The short code below will check the main python-in-r.qmd file for mentioned R packages and install the missing ones on your computer:
33 | 
34 | ```{r}
35 | if (!requireNamespace("rlang", quietly = TRUE)) install.packages("rlang", dependencies = TRUE)
36 | if (!rlang::is_installed("quanteda.corpora")) remotes::install_github("quanteda/quanteda.corpora")
37 | rlang::check_installed("attachment")
38 | rlang::check_installed(attachment::att_from_rmds("python-in-r.qmd"))
39 | ```
40 | 
41 | If there is no output, you are good to go.
42 | 
43 | We will install the Python packages during the workshop, but if you want to get a head start, you can follow the "Getting started" section in the "python-in-r.qmd" file. 
44 | 
45 | 


--------------------------------------------------------------------------------
/media/RStudio-Shortcut-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JBGruber/python_for_r_users/ef3702206290de8a78cd3e35821c4f113b22fff4/media/RStudio-Shortcut-1.png


--------------------------------------------------------------------------------
/media/RStudio-Shortcut-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JBGruber/python_for_r_users/ef3702206290de8a78cd3e35821c4f113b22fff4/media/RStudio-Shortcut-2.png


--------------------------------------------------------------------------------
/media/install-os.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JBGruber/python_for_r_users/ef3702206290de8a78cd3e35821c4f113b22fff4/media/install-os.png


--------------------------------------------------------------------------------
/media/new_project.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JBGruber/python_for_r_users/ef3702206290de8a78cd3e35821c4f113b22fff4/media/new_project.png


--------------------------------------------------------------------------------
/media/reticulate.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JBGruber/python_for_r_users/ef3702206290de8a78cd3e35821c4f113b22fff4/media/reticulate.jpg


--------------------------------------------------------------------------------
/media/spacy-install.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JBGruber/python_for_r_users/ef3702206290de8a78cd3e35821c4f113b22fff4/media/spacy-install.png


--------------------------------------------------------------------------------
/media/wizard-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JBGruber/python_for_r_users/ef3702206290de8a78cd3e35821c4f113b22fff4/media/wizard-2.png


--------------------------------------------------------------------------------
/media/wizard-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JBGruber/python_for_r_users/ef3702206290de8a78cd3e35821c4f113b22fff4/media/wizard-3.png


--------------------------------------------------------------------------------
/media/wizard.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JBGruber/python_for_r_users/ef3702206290de8a78cd3e35821c4f113b22fff4/media/wizard.png


--------------------------------------------------------------------------------
/python-in-r.qmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "UA Workshop: Python for R users"
  3 | author: "Johannes B. Gruber"
  4 | format: html
  5 | ---
  6 | 
  7 | # Outline {#outline}
  8 | 
  9 | 1. [Why combine Python with R?](#why-combine-python-with-r)
 10 | 2. [Getting started](#getting-started)
 11 | 3. [Workflow](#workflow)
 12 | 4. [Example 1: `spaCy`](#example-1-spacy)
 13 | 5. [Example 2: NMF Topic Models from `scikit-learn`](#example-2-nmf-topic-models-from-scikit-learn)
 14 | 6. [Example 3: `BERTopic`](#example-3-bertopic)
 15 | 7. [Example 4: Supervised Learning with RoBERTa](#example-4-supervised-learning-with-roberta)
 16 | 8. [Example 5: Zero-Shot Classification](#example-5-zero-shot-classification)
 17 | 
 18 | # Why combine Python with R? {#why-combine-python-with-r}
 19 | 
 20 | ![](media/reticulate.jpg){fig-align="center"}
 21 | 
 22 | Why not just switch to Python?
 23 | 
 24 | 1.  If you're here, you probably already know R so why re-learn things from scratch?
 25 | 2.  R is a programming language specifically for statistics with some great built-in functionality that you would miss in Python.
 26 | 3.  R has absolutely outstanding packages for data science with no drop-in replacement in Python (e.g., ggplot2, dplyr, tidytext).
 27 | 
 28 | Why not just stick with R then?
 29 | 
 30 | 1.  Newer models and methods in machine learning are often Python only (as advancements are made by big companies who rely on Python)
 31 | 2.  You might want to collaborate with someone who uses Python and need to run their code
 32 | 3.  Learning a new (programming) language is always good to extend your skills (also in your the language(s) you already know)
 33 | 
 34 | # Getting started {#getting-started}
 35 | 
 36 | We start by installing the necessary Python packages, for which you should use a virtual environment (so we set that one up first).
 37 | 
 38 | ## Create a Virtual Environment {#virtual-environment}
 39 | 
 40 | **Before** you load `reticulate` for the first time, we need to create a virtual environment. This is a folder in your project directory with a link to Python and you the packages you want to use in this project. Why?
 41 | 
 42 | -   Packages (or their dependencies) on the [Python Package Index](https://pypi.org/) can be incompatible with each other -- meaning you can break things by updating.
 43 | 
 44 | -   Your operating system might keep older versions of some packages around, which you means you could break your OS by and accidental update!
 45 | 
 46 | -   This also adds to projects being reproducible on other systems, as you keep track of the specific version of each package used in your project (you could do this in R with the `renv` package).
 47 | 
 48 | To grab the correct version of Python to link to in virtual environment:
 49 | 
 50 | ```{r}
 51 | if (R.Version()$os == "mingw32") {
 52 |   system("where python") # for Windows
 53 | } else {
 54 |   system("whereis python")
 55 | }
 56 | ```
 57 | 
 58 | I choose the main Python installation in "/usr/bin/python" and use it as the base for a virtual environment. If you don't have any Python version on your system, you can install one with `reticulate::install_miniconda()`.
 59 | 
 60 | ```{r}
 61 | # I build in this if condition to not accidentally overwrite the environment when rerunning the notebook
 62 | if (!reticulate::virtualenv_exists(envname = "./python-env/")) {
 63 |   reticulate::virtualenv_create("./python-env/", python = "C:/Users/johannes/AppData/Local/r-miniconda/python.exe")
 64 |   # for Windows the path is usually "C:/Users/{user}/AppData/Local/r-miniconda/python.exe"
 65 | }
 66 | reticulate::virtualenv_exists(envname = "./python-env/")
 67 | ```
 68 | 
 69 | `reticulate` is supposed to automatically pick this up when started, but to make sure, I set the environment variable `RETICULATE_PYTHON` to the binary of Python in the new environment:
 70 | 
 71 | ```{r}
 72 | if (R.Version()$os == "mingw32") {
 73 |   python_path <- file.path(getwd(), "python-env/Scripts/python.exe")
 74 | } else {
 75 |   python_path <- file.path(getwd(), "python-env/bin/python")
 76 | }
 77 | file.exists(python_path)
 78 | Sys.setenv(RETICULATE_PYTHON = python_path)
 79 | ```
 80 | 
 81 | Optional: make this persist restarts of RStudio by saving the environment variable into an `.Renviron` file (otherwise the `Sys.setenv()` line above needs to be in every script):
 82 | 
 83 | ```{r eval=FALSE}
 84 | # open the .Renviron file
 85 | usethis::edit_r_environ(scope = "project")
 86 | # or directly append it with the necessary line
 87 | readr::write_lines(
 88 |   x = paste0("RETICULATE_PYTHON=", python_path),
 89 |   file = ".Renviron",
 90 |   append = TRUE
 91 | )
 92 | ```
 93 | 
 94 | Now reticulate should now pick up the correct binary in the project folder:
 95 | 
 96 | ```{r}
 97 | library(reticulate)
 98 | py_config()
 99 | ```
100 | 
101 | ## Installing Packages {#packages}
102 | 
103 | `reticulate::py_install()` installs package similar to `install.packages()`. Let's install the packages we need:
104 | 
105 | ```{r}
106 | #| eval: false
107 | reticulate::py_install(c("spacy",
108 |                          "scikit-learn",
109 |                          "pandas",
110 |                          "bertopic", # this one requires some build tools not usually available on Windows, comment out to install the rest
111 |                          "sentence_transformers",
112 |                          "simpletransformers"))
113 | ```
114 | 
115 | But there are some caveats:
116 | 
117 | -   not all packages can be installed with the name you see in scripts (e.g.,to install the package, call "scikit-learn", to load it you need `sklearn`)
118 | -   you might need a specific version of a package to follow a specific tutorial
119 | -   there can be different flavours of the same package (e.g., `bertopic`, `bertopic[gensim]`, `bertopic[spacy]`)
120 | -   you will get a cryptic warning if you attempt to install base Python packages
121 | 
122 | ```{r}
123 | #| error: true
124 | reticulate::py_install("os")
125 | ```
126 | 
127 | General tip: see if the software distributor has instructions, like the excellent ones from [`spacy`](https://spacy.io/usage):
128 | 
129 | ![](media/spacy-install.png){fig-align="center"}
130 | 
131 | If you see the `$` in the beginning, these are command line/bash commands. Use the ```` ```{bash} ```` chunk option to run these commands and use the pip and python versions in your virtual environment (you could also [activate the environment](https://docs.python.org/3/tutorial/venv.html) instead).
132 | 
133 | ```{bash}
134 | #| eval: false
135 | ./python-env/bin/pip install -U pip setuptools wheel
136 | ./python-env/bin/pip install -U 'spacy'
137 | ./python-env/bin/python -m spacy download en_core_web_sm
138 | ./python-env/bin/python -m spacy download de_core_news_sm
139 | ```
140 | 
141 | On Windows, the binary files are in a different location:
142 | 
143 | ```{bash}
144 | #| eval: false
145 | ./python-env/Scripts/pip.exe install -U pip setuptools wheel
146 | ./python-env/Scripts/pip.exe install -U 'spacy'
147 | ./python-env/Scripts/python.exe -m spacy download en_core_web_sm
148 | ./python-env/Scripts/python.exe -m spacy download de_core_news_sm
149 | ```
150 | 
151 | # Workflow {#workflow}
152 | 
153 | In my opinion, a nice workflow is to use R and Python together in a Quarto Document. All you need to do to tell Quarto to run a Python, instead of an R chunk is to replace ```` ```{r} ```` with ```` ```{python} ````.
154 | 
155 | ```{r}
156 | text <- "Hello World! From R"
157 | print(text)
158 | ```
159 | 
160 | ```{python}
161 | text = "Hello World! From Python"
162 | print(text)
163 | ```
164 | 
165 | You can even set up a shortcut to make these chunks (I like `Ctrl+Alt+P`):
166 | 
167 | ![](media/RStudio-Shortcut-1.png){fig-align="center"}
168 | 
169 | ![](media/RStudio-Shortcut-2.png){fig-align="center"}
170 | 
171 | To get an interactive Python session in your Console, you can use `reticulate::repl_python()`.
172 | 
173 | As you've seen above, the code is pretty similar, with a few key differences:
174 | 
175 | -   `=` instead of `<-`
176 | -   code formatting is part of the syntax!
177 | -   base Python does not have `data.frame` class, instead you have dictionaries or the DataFrame from the Pandas package
178 | -   Python lists are the equivalent of R vectors
179 | -   the `*apply` family of functions and vectorised code does not exist as such -- everything is a for loop! <!-- - the equivalent of `$`, `%>% ` and `::` in R is `.` in Python (but not always) -->
180 | -   a lot of packages are writing object oriented instead of functional code
181 | -   many more!
182 | 
183 | ```{python}
184 | #| error: true
185 | my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
186 | my_list + 2 # does not work in Python
187 | for i in my_list:
188 |     print(i + 2)
189 | ```
190 | 
191 | ```{python}
192 | my_dict = {'name': ['John', 'Jane', 'Jim', 'Joan'],
193 |           'age': [32, 28, 40, 35],
194 |           'city': ['New York', 'London', 'Paris', 'Berlin']}
195 | my_dict
196 | ```
197 | 
198 | The truly magical thing about `reticulate` is how seamless it hands objects back and forth between Python and R:
199 | 
200 | ```{r}
201 | py$text
202 | py$my_list
203 | py$my_dict
204 | ```
205 | 
206 | ```{r}
207 | my_df <- data.frame(num = 1:10,
208 |                     let = LETTERS[1:10])
209 | my_list <- list(df = my_df, 11:20)
210 | ```
211 | 
212 | ```{python}
213 | r.text
214 | r.my_df
215 | r.my_list
216 | ```
217 | 
218 | What I think is especially cool is that this even works with functions:
219 | 
220 | ```{python}
221 | def hello(x=None):
222 |   """
223 |   :param x: name of the person to say hello to.
224 |   """
225 |   if not x:
226 |     print("Hello World!")
227 |   else:
228 |     print("Hello " + x + "!")
229 | ```
230 | 
231 | ```{r}
232 | py$hello()
233 | py$hello("Class")
234 | reticulate::py_help(py$hello)
235 | ```
236 | 
237 | # Example 1: `spaCy` {#example-1-spacy}
238 | 
239 | The `spacyr` package is a good example for an R wrapper for a popular Python package. So comparing the functionality is a good venture point to understand what is happening. We can replicate the [`spacyr` tutorial](https://spacyr.quanteda.io/articles/using_spacyr.html) directly with reticulate to get going.
240 | 
241 | ```{r}
242 | txt <- c(d1 = "spaCy is great at fast natural language processing.",
243 |          d2 = "Mr. Smith spent two years in North Carolina. One in New York.")
244 | doc_ids <- names(txt)
245 | ```
246 | 
247 | ```{python}
248 | import spacy
249 | nlp = spacy.load("en_core_web_sm")
250 | doc = nlp(r.txt[1])
251 | x = doc[1]
252 | for token in doc:
253 |   print(token.text, "|", token.lemma_, "|", token.pos_, "|", token.ent_type_)
254 | ```
255 | 
256 | ```{r}
257 | doc <- py$doc
258 | doc
259 | doc[1]
260 | doc[1]$pos_
261 | ```
262 | 
263 | ```{r}
264 | tibble::tibble(
265 |   token = sapply(seq_along(doc) - 1, function(i) doc[i]$text),
266 |   lemma = sapply(seq_along(doc) - 1, function(i) doc[i]$lemma_),
267 |   pos = sapply(seq_along(doc) - 1, function(i) doc[i]$pos_),
268 |   entity = sapply(seq_along(doc) - 1, function(i) doc[i]$ent_type_)
269 | )
270 | ```
271 | 
272 | Another awesome way to run the Python code from R is to define a Python function that returns R-compatible objects:
273 | 
274 | ```{python}
275 | def spacy_parse(doc_id, text):
276 |   doc = nlp(text)
277 |   toks = [] # make empty list to fill
278 |   for sent_id, sent in enumerate(doc.sents): # loop over sentences
279 |     for token in sent: # loop over tokens
280 |       toks.append({
281 |         "doc_id": doc_id,
282 |         'sentence_id': sent_id + 1, # python numbers start at 0, we want to start at 1
283 |         'token_id': token.i + 1,
284 |         'token': token.text,
285 |         'lemma': token.lemma_,
286 |         'pos': token.pos_,
287 |         'entity': token.ent_type_
288 |         })
289 |   return toks
290 | ```
291 | 
292 | Now we can call this function directly from R:
293 | 
294 | ```{r}
295 | py$spacy_parse(1, txt[2])[[1]]
296 | ```
297 | 
298 | Or even wrap it in an R function to make it run on an entire vector at once:
299 | 
300 | ```{r}
301 | #| message: false
302 | library(tidyverse)
303 | spacy_parse <- function(text, doc_id = names(text)) {
304 |   result_list <- map2(doc_id, text, function(x, y) py$spacy_parse(x, y))
305 |   map_df(unlist(result_list, recursive = FALSE), as_tibble)
306 | }
307 | spacy_parse(txt)
308 | ```
309 | 
310 | # Example 2: NMF Topic Models from `scikit-learn` {#example-2-nmf-topic-models-from-scikit-learn}
311 | 
312 | Inspired by [Text Mining with R](https://www.tidytextmining.com/topicmodeling.html)
313 | 
314 | ```{r}
315 | library(janeaustenr)
316 | books <- austen_books() %>%
317 |   mutate(paragraph = cumsum(text == "" & lag(text) != "")) %>%
318 |   group_by(paragraph) %>%
319 |   summarise(book = head(book, 1),
320 |             text = trimws(paste(text, collapse = " ")),
321 |             .groups = "drop")
322 | 
323 | glimpse(books)
324 | ```
325 | 
326 | ```{r}
327 | library(tidytext)
328 | austen_dfm <- books %>%
329 |   unnest_tokens(output = feature, input = text) %>%
330 |   filter(!feature %in% stop_words$word) %>% 
331 |   count(book, paragraph, feature) %>%
332 |   mutate(doc_id = paste0(book, "_", paragraph)) %>%
333 |   cast_dfm(document = doc_id, term = feature, value = n)
334 | ```
335 | 
336 | Instead of importing individual functions, you can also just grab an entire Python package and use it from R:
337 | 
338 | ```{r}
339 | sklearn <- import("sklearn")
340 | model <- sklearn$decomposition$NMF( # functions are often elements of objects in Python and can be called like this
341 |   n_components = 6L,  # number of topics
342 |   random_state  =  5L, # equivalent of seed for reproducibility
343 |   max_iter = 400L
344 | )$fit(austen_dfm) # here the $ essentially works like a pipe
345 | 
346 | beta <- model$components_
347 | colnames(beta) <- colnames(austen_dfm)
348 | rownames(beta) <- paste0("topic_", seq_len(nrow(beta)))
349 | glimpse(beta)
350 | 
351 | gamma <- model$transform(austen_dfm)
352 | colnames(gamma) <- paste0("topic_", seq_len(ncol(gamma)))
353 | rownames(gamma) <- paste0("text_", seq_len(nrow(gamma)))
354 | glimpse(gamma)
355 | ```
356 | 
357 | ```{r}
358 | beta %>%
359 |   as_tibble(rownames = "topic") %>%
360 |   pivot_longer(cols = -topic, names_to = "feature", values_to = "beta") %>%
361 |   mutate(topic = fct_inorder(topic)) %>%
362 |   group_by(topic) %>%
363 |   slice_max(beta, n = 10) %>%
364 |   arrange(topic, -beta) %>%
365 |   mutate(feature = reorder_within(feature, beta, topic)) %>%
366 |   ggplot(aes(x = beta, y = feature, fill = topic)) +
367 |   geom_col() +
368 |   facet_wrap(~topic, ncol = 2, scales = "free") +
369 |   theme_minimal() +
370 |   labs(x = NULL, y = NULL, title = "Top-features per topic") +
371 |   scale_y_reordered()
372 | ```
373 | 
374 | # Example 3: `BERTopic` {#example-3-bertopic}
375 | 
376 | I use the quanteda tutorial about [topicmodels](https://tutorials.quanteda.io/machine-learning/topicmodel/) to show an example workflow for `BERTopic`. 
377 | 
378 | ```{r}
379 | library(quanteda.corpora)
380 | corp_news <- download("data_corpus_guardian")[["documents"]]
381 | ```
382 | 
383 | ```{python}
384 | from bertopic import BERTopic
385 | from sentence_transformers import SentenceTransformer
386 | from umap import UMAP
387 | 
388 | # confusingly, this is the setup part
389 | topic_model = BERTopic(language="english",
390 |                        top_n_words=5,
391 |                        n_gram_range=(1, 2),
392 |                        nr_topics="auto", # change if you want a specific nr of topics
393 |                        calculate_probabilities=True,
394 |                        umap_model=UMAP(random_state=42)) # make reproducible
395 | 
396 | # and only here we actually run something
397 | topics, doc_topic = topic_model.fit_transform(r.corp_news.texts)
398 | ```
399 | 
400 | Unlike traditional topic models, BERTopic uses an algorithm that automatically determines a sensible number of topics and also automatically labels topics:
401 | 
402 | ```{r}
403 | topic_model <- py$topic_model
404 | topic_labels <- tibble(topic = as.integer(names(topic_model$topic_labels_)),
405 |                        label = unlist(topic_model$topic_labels_ )) %>%
406 |   mutate(label = fct_reorder(label, topic))
407 | topic_labels
408 | ```
409 | 
410 | Note that -1 describes a trash topic with words and documents that do not really belong anywhere. BERTopic also supplies the top words, i.e., the ones that most likely belong to each topic. In the code above I requested 5 words for each topic:
411 | 
412 | ```{r}
413 | top_words <- map_df(names(topic_model$topic_representations_), function(t) {
414 |   map_df(topic_model$topic_representations_[[t]], function(y)
415 |     tibble(feature = y[[1]], prob = y[[2]])) %>%
416 |     mutate(topic = as.integer(t), .before = 1L)
417 | })
418 | ```
419 | 
420 | We can plot them in the same way as above:
421 | 
422 | ```{r}
423 | top_words %>%
424 |   filter(topic %in% c(1, 7, 44, 53, 65, 66)) %>% # select a couple of topics
425 |   left_join(topic_labels, by = "topic") %>%
426 |   mutate(feature = reorder_within(feature, prob, topic)) %>%
427 |   ggplot(aes(x = prob, y = feature, fill = topic, label = label)) +
428 |   geom_col(show.legend = FALSE) +
429 |   facet_wrap(vars(label), ncol = 2, scales = "free_y") +
430 |   scale_y_reordered() +
431 |   labs(x = NULL, y = NULL)
432 | ```
433 | 
434 | We can use a nice little visualization built into BERTopic to show how topics are linked to one another:
435 | 
436 | ```{python}
437 | # map intertopic distance
438 | intertopic_distance = topic_model.visualize_topics(width=700, height=700)
439 | # save fig
440 | intertopic_distance.write_html("python-in-r_files/figure-html/bert_corp_news_intertopic.html")
441 | ```
442 | 
443 | ```{r}
444 | htmltools::includeHTML("python-in-r_files/figure-html/bert_corp_news_intertopic.html")
445 | ```
446 | 
447 | BERTopic also classifies documents into the topic categories (again not really how you should use LDA topicmodels). And provides a nice visualisation for trends over time. Unfortunately, the date format in R does not translate automagically to Python, hence we need to convert the dates to strings:
448 | 
449 | ```{r}
450 | corp_news_t <- corp_news %>%
451 |   mutate(date_chr = as.character(date))
452 | ```
453 | 
454 | ```{python}
455 | topics_over_time = topic_model.topics_over_time(docs=r.corp_news_t.texts,
456 |                                                 timestamps=r.corp_news_t.date_chr,
457 |                                                 global_tuning=True,
458 |                                                 evolution_tuning=True,
459 |                                                 nr_bins=20)
460 | #plot figure
461 | fig_overtime = topic_model.visualize_topics_over_time(topics_over_time,
462 |                                                       topics=[1, 7, 44, 53, 65, 66])
463 | #save figure
464 | fig_overtime.write_html("python-in-r_files/figure-html/fig_overtime.html")
465 | ```
466 | 
467 | ```{r}
468 | htmltools::includeHTML("python-in-r_files/figure-html/fig_overtime.html")
469 | ```
470 | 
471 | # Example 4: Supervised Learning with RoBERTa {#example-4-supervised-learning-with-roberta}
472 | 
473 | To demonstrate the workflow of supervised learning, I'm replicating the example from [the naive bayes quanteda tutorial](https://tutorials.quanteda.io/machine-learning/nb/).
474 | 
475 | ```{python}
476 | #| message: false
477 | #| warning: false
478 | #| output: false
479 | import pandas as pd
480 | import os
481 | import torch
482 | from simpletransformers.classification import ClassificationModel
483 | 
484 | # args copied from grafzahl
485 | model_args = {
486 |   "num_train_epochs": 1, # increase for multiple runs, which can yield better performance
487 |   "use_multiprocessing": False,
488 |   "use_multiprocessing_for_evaluation": False,
489 |   "overwrite_output_dir": True,
490 |   "reprocess_input_data":  True,
491 |   "overwrite_output_dir":  True,
492 |   "fp16":  True,
493 |   "save_steps":  -1,
494 |   "save_eval_checkpoints":  False,
495 |   "save_model_every_epoch":  False,
496 |   "silent":  True,
497 | }
498 | 
499 | os.environ["TOKENIZERS_PARALLELISM"] = "false"
500 | 
501 | roberta_model = ClassificationModel(model_type="roberta",
502 |                                     model_name="roberta-base",
503 |                                     # Use GPU if available
504 |                                     use_cuda=torch.cuda.is_available(),
505 |                                     args=model_args)
506 | ```
507 | 
508 | We construct a training and test set from the movie review corpus in R:
509 | 
510 | ```{r}
511 | corp_movies <- quanteda.textmodels::data_corpus_moviereviews %>%
512 |   tibble(quanteda::docvars(x = .), text = .)
513 | 
514 | corp_movies %>%
515 |   count(sentiment)
516 | 
517 | set.seed(1)
518 | corp_movies_train <- corp_movies %>%
519 |   slice_sample(prop = 0.9)
520 | 
521 | corp_movies_test <- corp_movies %>%
522 |   filter(!id2 %in% corp_movies_train$id2)
523 | ```
524 | 
525 | Now we can train the model on the coded training set and predict the classes for the test set (if you do not have a GPU, this will take a long time, so maybe do it after the course:
526 | 
527 | ```{python}
528 | #| output: false
529 | # process data to the form simpletransformers needs
530 | train_df = r.corp_movies_train
531 | train_df['labels'] = train_df['sentiment'].astype('category').cat.codes
532 | train_df = train_df[['text', 'labels']]
533 | 
534 | roberta_model.train_model(train_df)
535 | 
536 | # test data needs to be a list
537 | test_l = r.corp_movies_test["text"].tolist()
538 | predictions, raw_outputs = roberta_model.predict(test_l)
539 | ```
540 | 
541 | ```{r}
542 | results <- tibble(
543 |   truth = corp_movies_test$sentiment,
544 |   estimate = factor(c("neg", "pos"))[py$predictions + 1]
545 | )
546 | conf_mat <- yardstick::conf_mat(results, truth, estimate)
547 | summary(conf_mat)
548 | ```
549 | 
550 | # Example 5: Zero-Shot Classification {#example-5-zero-shot-classification}
551 | 
552 | Something I learned about recently are zero-shot classification models, which do not need to be trained on new categories, but can infer category-text relationships from the data they were trained with.
553 | You can get one such model from https://huggingface.co/MoritzLaurer/xlm-v-base-mnli-xnli.
554 | 
555 | ```{python}
556 | from transformers import pipeline
557 | classifier = pipeline("zero-shot-classification",
558 |                       model="MoritzLaurer/xlm-v-base-mnli-xnli")
559 | 
560 | sequence_to_classify = "Angela Merkel ist eine Politikerin in Deutschland und Vorsitzende der CDU"
561 | candidate_labels = ["politics", "economy", "entertainment", "environment"]
562 | output = classifier(sequence_to_classify, candidate_labels, multi_label=False)
563 | print(output)
564 | ```
565 | 
566 | ```{r}
567 | #| cache: true
568 | zero_shot_classification <- function(text, labels) {
569 |   res <- py$classifier(text, labels, multi_label=FALSE)
570 |   map_df(seq_along(res), function(i) {
571 |     as_tibble(res[[i]]) %>%
572 |       mutate(id = i)
573 |   }) %>%
574 |     group_by(id) %>%
575 |     slice_max(scores, n = 1)
576 | }
577 | 
578 | set.seed(3)
579 | test <- corp_movies_test %>%
580 |   sample_n(10)
581 | 
582 | pred <- zero_shot_classification(
583 |   as.character(test$text),
584 |   c("negative", "positive")
585 | )
586 | 
587 | results <- pred %>%
588 |   ungroup() %>%
589 |   mutate(estimate = factor(labels),
590 |          estimate = fct_recode(estimate,
591 |                                neg = "negative",
592 |                                pos = "positive")) %>%
593 |   mutate(truth = test$sentiment[1:10])
594 | 
595 | conf_mat <- yardstick::conf_mat(results, truth, estimate)
596 | summary(conf_mat)
597 | ```
598 | 
599 | # Further Learning
600 | 
601 | - [Computational Analysis of Communication](https://cssbook.net/): a free book on communication science with Python and/or R with side-by-side code examples in both languages
602 | - [Doing Computational Social Science with Python: An Introduction](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2737682): a free book on social science data wrangling and analyses in Python (you can skip chapters 1-4)
603 | - [A Whirlwind Tour of Python](https://jakevdp.github.io/WhirlwindTourOfPython/): a free book with 
604 | - (https://www.youtube.com/watch?v=YmcA4ODpiqA&t=3679s): 4.5h workshop introducing Python (from with some hints for R users sprinkled throughout the examples)
605 | - [ChatGPT](https://chat.openai.com/chat) is really good at translating/explaining Python code!
606 | 
607 | # wrap up {#wrap-up}
608 | 
609 | Some information about the session.
610 | 
611 | ```{r}
612 | Sys.time()
613 | sessionInfo()
614 | py_list_packages() %>% 
615 |   as_tibble() %>% 
616 |   select(-requirement) %>% 
617 |   print(n = Inf)
618 | ```
619 | 


--------------------------------------------------------------------------------
/reticulate_workshop.Rproj:
--------------------------------------------------------------------------------
 1 | Version: 1.0
 2 | 
 3 | RestoreWorkspace: Default
 4 | SaveWorkspace: Default
 5 | AlwaysSaveHistory: Default
 6 | 
 7 | EnableCodeIndexing: Yes
 8 | UseSpacesForTab: Yes
 9 | NumSpacesForTab: 2
10 | Encoding: UTF-8
11 | 
12 | RnwWeave: Sweave
13 | LaTeX: pdfLaTeX
14 | 
15 | AutoAppendNewline: Yes
16 | StripTrailingWhitespace: Yes
17 | 
18 | UseNativePipeOperator: No
19 | 


--------------------------------------------------------------------------------