├── .gitignore ├── README.md ├── README.qmd ├── media ├── RStudio-Shortcut-1.png ├── RStudio-Shortcut-2.png ├── install-os.png ├── new_project.png ├── reticulate.jpg ├── spacy-install.png ├── wizard-2.png ├── wizard-3.png └── wizard.png ├── python-in-r.html ├── python-in-r.qmd └── reticulate_workshop.Rproj /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | .Ruserdata 5 | flyer 6 | .Renviron 7 | cache_dir 8 | outputs 9 | python-env 10 | python-in-r_cache 11 | python-in-r_files 12 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | README 2 | ================ 3 | 4 | # WORKSHOPS FOR UKRAINE: Python for R users 5 | 6 | Material for the workshop *Python for R users* on Thursday, February 7 | 16th, 18:00 - 20:00 CET (Rome, Berlin, Paris timezone). More info on the 8 | course and how to sign up/get the recording at 9 | https://sites.google.com/view/dariia-mykhailyshyna/main/r-workshops-for-ukraine#h.36dsv5tl42am 10 | 11 | # Download the project 12 | 13 | In RStudio go to “Create a project” (top left corner with this symbol 14 | ![](media/new_project.png)). Then select “Version Control”: 15 | 16 | ![](media/wizard.png) 17 | 18 | In the next window, select “Git”: 19 | 20 | ![](media/wizard-2.png) 21 | 22 | Then copy the URL `https://github.com/JBGruber/python_for_r_users` into 23 | the URL field and select where to download the project to. 24 | 25 | ![](media/wizard-3.png) 26 | 27 | After clicking “Create Project”, a new session should open. Navigate to 28 | the file “python-in-r.qmd” and open it. That’s it! 29 | 30 | # Install dependencies 31 | 32 | The short code below will check the main python-in-r.qmd file for 33 | mentioned R packages and install the missing ones on your computer: 34 | 35 | ``` r 36 | if (!requireNamespace("rlang", quietly = TRUE)) install.packages("rlang", dependencies = TRUE) 37 | if (!rlang::is_installed("quanteda.corpora")) remotes::install_github("quanteda/quanteda.corpora") 38 | rlang::check_installed("attachment") 39 | rlang::check_installed(attachment::att_from_rmds("python-in-r.qmd")) 40 | ``` 41 | 42 | If there is no output, you are good to go. 43 | 44 | We will install the Python packages during the workshop, but if you want 45 | to get a head start, you can follow the “Getting started” section in the 46 | “python-in-r.qmd” file. 47 | -------------------------------------------------------------------------------- /README.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "README" 3 | format: gfm 4 | --- 5 | 6 | # WORKSHOPS FOR UKRAINE: Python for R users 7 | 8 | Material for the workshop *Python for R users* on Thursday, February 16th, 18:00 - 20:00 CET (Rome, Berlin, Paris timezone). 9 | More info on the course and how to sign up/get the recording at https://sites.google.com/view/dariia-mykhailyshyna/main/r-workshops-for-ukraine#h.36dsv5tl42am 10 | 11 | # Download the project 12 | 13 | In RStudio go to "Create a project" (top left corner with this symbol ![](media/new_project.png)). 14 | Then select "Version Control": 15 | 16 | ![](media/wizard.png) 17 | 18 | In the next window, select "Git": 19 | 20 | ![](media/wizard-2.png) 21 | 22 | Then copy the URL `https://github.com/JBGruber/python_for_r_users` into the URL field and select where to download the project to. 23 | 24 | ![](media/wizard-3.png) 25 | 26 | After clicking "Create Project", a new session should open. 27 | Navigate to the file "python-in-r.qmd" and open it. 28 | That's it! 29 | 30 | # Install dependencies 31 | 32 | The short code below will check the main python-in-r.qmd file for mentioned R packages and install the missing ones on your computer: 33 | 34 | ```{r} 35 | if (!requireNamespace("rlang", quietly = TRUE)) install.packages("rlang", dependencies = TRUE) 36 | if (!rlang::is_installed("quanteda.corpora")) remotes::install_github("quanteda/quanteda.corpora") 37 | rlang::check_installed("attachment") 38 | rlang::check_installed(attachment::att_from_rmds("python-in-r.qmd")) 39 | ``` 40 | 41 | If there is no output, you are good to go. 42 | 43 | We will install the Python packages during the workshop, but if you want to get a head start, you can follow the "Getting started" section in the "python-in-r.qmd" file. 44 | 45 | -------------------------------------------------------------------------------- /media/RStudio-Shortcut-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JBGruber/python_for_r_users/ef3702206290de8a78cd3e35821c4f113b22fff4/media/RStudio-Shortcut-1.png -------------------------------------------------------------------------------- /media/RStudio-Shortcut-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JBGruber/python_for_r_users/ef3702206290de8a78cd3e35821c4f113b22fff4/media/RStudio-Shortcut-2.png -------------------------------------------------------------------------------- /media/install-os.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JBGruber/python_for_r_users/ef3702206290de8a78cd3e35821c4f113b22fff4/media/install-os.png -------------------------------------------------------------------------------- /media/new_project.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JBGruber/python_for_r_users/ef3702206290de8a78cd3e35821c4f113b22fff4/media/new_project.png -------------------------------------------------------------------------------- /media/reticulate.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JBGruber/python_for_r_users/ef3702206290de8a78cd3e35821c4f113b22fff4/media/reticulate.jpg -------------------------------------------------------------------------------- /media/spacy-install.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JBGruber/python_for_r_users/ef3702206290de8a78cd3e35821c4f113b22fff4/media/spacy-install.png -------------------------------------------------------------------------------- /media/wizard-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JBGruber/python_for_r_users/ef3702206290de8a78cd3e35821c4f113b22fff4/media/wizard-2.png -------------------------------------------------------------------------------- /media/wizard-3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JBGruber/python_for_r_users/ef3702206290de8a78cd3e35821c4f113b22fff4/media/wizard-3.png -------------------------------------------------------------------------------- /media/wizard.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JBGruber/python_for_r_users/ef3702206290de8a78cd3e35821c4f113b22fff4/media/wizard.png -------------------------------------------------------------------------------- /python-in-r.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "UA Workshop: Python for R users" 3 | author: "Johannes B. Gruber" 4 | format: html 5 | --- 6 | 7 | # Outline {#outline} 8 | 9 | 1. [Why combine Python with R?](#why-combine-python-with-r) 10 | 2. [Getting started](#getting-started) 11 | 3. [Workflow](#workflow) 12 | 4. [Example 1: `spaCy`](#example-1-spacy) 13 | 5. [Example 2: NMF Topic Models from `scikit-learn`](#example-2-nmf-topic-models-from-scikit-learn) 14 | 6. [Example 3: `BERTopic`](#example-3-bertopic) 15 | 7. [Example 4: Supervised Learning with RoBERTa](#example-4-supervised-learning-with-roberta) 16 | 8. [Example 5: Zero-Shot Classification](#example-5-zero-shot-classification) 17 | 18 | # Why combine Python with R? {#why-combine-python-with-r} 19 | 20 | ![](media/reticulate.jpg){fig-align="center"} 21 | 22 | Why not just switch to Python? 23 | 24 | 1. If you're here, you probably already know R so why re-learn things from scratch? 25 | 2. R is a programming language specifically for statistics with some great built-in functionality that you would miss in Python. 26 | 3. R has absolutely outstanding packages for data science with no drop-in replacement in Python (e.g., ggplot2, dplyr, tidytext). 27 | 28 | Why not just stick with R then? 29 | 30 | 1. Newer models and methods in machine learning are often Python only (as advancements are made by big companies who rely on Python) 31 | 2. You might want to collaborate with someone who uses Python and need to run their code 32 | 3. Learning a new (programming) language is always good to extend your skills (also in your the language(s) you already know) 33 | 34 | # Getting started {#getting-started} 35 | 36 | We start by installing the necessary Python packages, for which you should use a virtual environment (so we set that one up first). 37 | 38 | ## Create a Virtual Environment {#virtual-environment} 39 | 40 | **Before** you load `reticulate` for the first time, we need to create a virtual environment. This is a folder in your project directory with a link to Python and you the packages you want to use in this project. Why? 41 | 42 | - Packages (or their dependencies) on the [Python Package Index](https://pypi.org/) can be incompatible with each other -- meaning you can break things by updating. 43 | 44 | - Your operating system might keep older versions of some packages around, which you means you could break your OS by and accidental update! 45 | 46 | - This also adds to projects being reproducible on other systems, as you keep track of the specific version of each package used in your project (you could do this in R with the `renv` package). 47 | 48 | To grab the correct version of Python to link to in virtual environment: 49 | 50 | ```{r} 51 | if (R.Version()$os == "mingw32") { 52 | system("where python") # for Windows 53 | } else { 54 | system("whereis python") 55 | } 56 | ``` 57 | 58 | I choose the main Python installation in "/usr/bin/python" and use it as the base for a virtual environment. If you don't have any Python version on your system, you can install one with `reticulate::install_miniconda()`. 59 | 60 | ```{r} 61 | # I build in this if condition to not accidentally overwrite the environment when rerunning the notebook 62 | if (!reticulate::virtualenv_exists(envname = "./python-env/")) { 63 | reticulate::virtualenv_create("./python-env/", python = "C:/Users/johannes/AppData/Local/r-miniconda/python.exe") 64 | # for Windows the path is usually "C:/Users/{user}/AppData/Local/r-miniconda/python.exe" 65 | } 66 | reticulate::virtualenv_exists(envname = "./python-env/") 67 | ``` 68 | 69 | `reticulate` is supposed to automatically pick this up when started, but to make sure, I set the environment variable `RETICULATE_PYTHON` to the binary of Python in the new environment: 70 | 71 | ```{r} 72 | if (R.Version()$os == "mingw32") { 73 | python_path <- file.path(getwd(), "python-env/Scripts/python.exe") 74 | } else { 75 | python_path <- file.path(getwd(), "python-env/bin/python") 76 | } 77 | file.exists(python_path) 78 | Sys.setenv(RETICULATE_PYTHON = python_path) 79 | ``` 80 | 81 | Optional: make this persist restarts of RStudio by saving the environment variable into an `.Renviron` file (otherwise the `Sys.setenv()` line above needs to be in every script): 82 | 83 | ```{r eval=FALSE} 84 | # open the .Renviron file 85 | usethis::edit_r_environ(scope = "project") 86 | # or directly append it with the necessary line 87 | readr::write_lines( 88 | x = paste0("RETICULATE_PYTHON=", python_path), 89 | file = ".Renviron", 90 | append = TRUE 91 | ) 92 | ``` 93 | 94 | Now reticulate should now pick up the correct binary in the project folder: 95 | 96 | ```{r} 97 | library(reticulate) 98 | py_config() 99 | ``` 100 | 101 | ## Installing Packages {#packages} 102 | 103 | `reticulate::py_install()` installs package similar to `install.packages()`. Let's install the packages we need: 104 | 105 | ```{r} 106 | #| eval: false 107 | reticulate::py_install(c("spacy", 108 | "scikit-learn", 109 | "pandas", 110 | "bertopic", # this one requires some build tools not usually available on Windows, comment out to install the rest 111 | "sentence_transformers", 112 | "simpletransformers")) 113 | ``` 114 | 115 | But there are some caveats: 116 | 117 | - not all packages can be installed with the name you see in scripts (e.g.,to install the package, call "scikit-learn", to load it you need `sklearn`) 118 | - you might need a specific version of a package to follow a specific tutorial 119 | - there can be different flavours of the same package (e.g., `bertopic`, `bertopic[gensim]`, `bertopic[spacy]`) 120 | - you will get a cryptic warning if you attempt to install base Python packages 121 | 122 | ```{r} 123 | #| error: true 124 | reticulate::py_install("os") 125 | ``` 126 | 127 | General tip: see if the software distributor has instructions, like the excellent ones from [`spacy`](https://spacy.io/usage): 128 | 129 | ![](media/spacy-install.png){fig-align="center"} 130 | 131 | If you see the `$` in the beginning, these are command line/bash commands. Use the ```` ```{bash} ```` chunk option to run these commands and use the pip and python versions in your virtual environment (you could also [activate the environment](https://docs.python.org/3/tutorial/venv.html) instead). 132 | 133 | ```{bash} 134 | #| eval: false 135 | ./python-env/bin/pip install -U pip setuptools wheel 136 | ./python-env/bin/pip install -U 'spacy' 137 | ./python-env/bin/python -m spacy download en_core_web_sm 138 | ./python-env/bin/python -m spacy download de_core_news_sm 139 | ``` 140 | 141 | On Windows, the binary files are in a different location: 142 | 143 | ```{bash} 144 | #| eval: false 145 | ./python-env/Scripts/pip.exe install -U pip setuptools wheel 146 | ./python-env/Scripts/pip.exe install -U 'spacy' 147 | ./python-env/Scripts/python.exe -m spacy download en_core_web_sm 148 | ./python-env/Scripts/python.exe -m spacy download de_core_news_sm 149 | ``` 150 | 151 | # Workflow {#workflow} 152 | 153 | In my opinion, a nice workflow is to use R and Python together in a Quarto Document. All you need to do to tell Quarto to run a Python, instead of an R chunk is to replace ```` ```{r} ```` with ```` ```{python} ````. 154 | 155 | ```{r} 156 | text <- "Hello World! From R" 157 | print(text) 158 | ``` 159 | 160 | ```{python} 161 | text = "Hello World! From Python" 162 | print(text) 163 | ``` 164 | 165 | You can even set up a shortcut to make these chunks (I like `Ctrl+Alt+P`): 166 | 167 | ![](media/RStudio-Shortcut-1.png){fig-align="center"} 168 | 169 | ![](media/RStudio-Shortcut-2.png){fig-align="center"} 170 | 171 | To get an interactive Python session in your Console, you can use `reticulate::repl_python()`. 172 | 173 | As you've seen above, the code is pretty similar, with a few key differences: 174 | 175 | - `=` instead of `<-` 176 | - code formatting is part of the syntax! 177 | - base Python does not have `data.frame` class, instead you have dictionaries or the DataFrame from the Pandas package 178 | - Python lists are the equivalent of R vectors 179 | - the `*apply` family of functions and vectorised code does not exist as such -- everything is a for loop! 180 | - a lot of packages are writing object oriented instead of functional code 181 | - many more! 182 | 183 | ```{python} 184 | #| error: true 185 | my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] 186 | my_list + 2 # does not work in Python 187 | for i in my_list: 188 | print(i + 2) 189 | ``` 190 | 191 | ```{python} 192 | my_dict = {'name': ['John', 'Jane', 'Jim', 'Joan'], 193 | 'age': [32, 28, 40, 35], 194 | 'city': ['New York', 'London', 'Paris', 'Berlin']} 195 | my_dict 196 | ``` 197 | 198 | The truly magical thing about `reticulate` is how seamless it hands objects back and forth between Python and R: 199 | 200 | ```{r} 201 | py$text 202 | py$my_list 203 | py$my_dict 204 | ``` 205 | 206 | ```{r} 207 | my_df <- data.frame(num = 1:10, 208 | let = LETTERS[1:10]) 209 | my_list <- list(df = my_df, 11:20) 210 | ``` 211 | 212 | ```{python} 213 | r.text 214 | r.my_df 215 | r.my_list 216 | ``` 217 | 218 | What I think is especially cool is that this even works with functions: 219 | 220 | ```{python} 221 | def hello(x=None): 222 | """ 223 | :param x: name of the person to say hello to. 224 | """ 225 | if not x: 226 | print("Hello World!") 227 | else: 228 | print("Hello " + x + "!") 229 | ``` 230 | 231 | ```{r} 232 | py$hello() 233 | py$hello("Class") 234 | reticulate::py_help(py$hello) 235 | ``` 236 | 237 | # Example 1: `spaCy` {#example-1-spacy} 238 | 239 | The `spacyr` package is a good example for an R wrapper for a popular Python package. So comparing the functionality is a good venture point to understand what is happening. We can replicate the [`spacyr` tutorial](https://spacyr.quanteda.io/articles/using_spacyr.html) directly with reticulate to get going. 240 | 241 | ```{r} 242 | txt <- c(d1 = "spaCy is great at fast natural language processing.", 243 | d2 = "Mr. Smith spent two years in North Carolina. One in New York.") 244 | doc_ids <- names(txt) 245 | ``` 246 | 247 | ```{python} 248 | import spacy 249 | nlp = spacy.load("en_core_web_sm") 250 | doc = nlp(r.txt[1]) 251 | x = doc[1] 252 | for token in doc: 253 | print(token.text, "|", token.lemma_, "|", token.pos_, "|", token.ent_type_) 254 | ``` 255 | 256 | ```{r} 257 | doc <- py$doc 258 | doc 259 | doc[1] 260 | doc[1]$pos_ 261 | ``` 262 | 263 | ```{r} 264 | tibble::tibble( 265 | token = sapply(seq_along(doc) - 1, function(i) doc[i]$text), 266 | lemma = sapply(seq_along(doc) - 1, function(i) doc[i]$lemma_), 267 | pos = sapply(seq_along(doc) - 1, function(i) doc[i]$pos_), 268 | entity = sapply(seq_along(doc) - 1, function(i) doc[i]$ent_type_) 269 | ) 270 | ``` 271 | 272 | Another awesome way to run the Python code from R is to define a Python function that returns R-compatible objects: 273 | 274 | ```{python} 275 | def spacy_parse(doc_id, text): 276 | doc = nlp(text) 277 | toks = [] # make empty list to fill 278 | for sent_id, sent in enumerate(doc.sents): # loop over sentences 279 | for token in sent: # loop over tokens 280 | toks.append({ 281 | "doc_id": doc_id, 282 | 'sentence_id': sent_id + 1, # python numbers start at 0, we want to start at 1 283 | 'token_id': token.i + 1, 284 | 'token': token.text, 285 | 'lemma': token.lemma_, 286 | 'pos': token.pos_, 287 | 'entity': token.ent_type_ 288 | }) 289 | return toks 290 | ``` 291 | 292 | Now we can call this function directly from R: 293 | 294 | ```{r} 295 | py$spacy_parse(1, txt[2])[[1]] 296 | ``` 297 | 298 | Or even wrap it in an R function to make it run on an entire vector at once: 299 | 300 | ```{r} 301 | #| message: false 302 | library(tidyverse) 303 | spacy_parse <- function(text, doc_id = names(text)) { 304 | result_list <- map2(doc_id, text, function(x, y) py$spacy_parse(x, y)) 305 | map_df(unlist(result_list, recursive = FALSE), as_tibble) 306 | } 307 | spacy_parse(txt) 308 | ``` 309 | 310 | # Example 2: NMF Topic Models from `scikit-learn` {#example-2-nmf-topic-models-from-scikit-learn} 311 | 312 | Inspired by [Text Mining with R](https://www.tidytextmining.com/topicmodeling.html) 313 | 314 | ```{r} 315 | library(janeaustenr) 316 | books <- austen_books() %>% 317 | mutate(paragraph = cumsum(text == "" & lag(text) != "")) %>% 318 | group_by(paragraph) %>% 319 | summarise(book = head(book, 1), 320 | text = trimws(paste(text, collapse = " ")), 321 | .groups = "drop") 322 | 323 | glimpse(books) 324 | ``` 325 | 326 | ```{r} 327 | library(tidytext) 328 | austen_dfm <- books %>% 329 | unnest_tokens(output = feature, input = text) %>% 330 | filter(!feature %in% stop_words$word) %>% 331 | count(book, paragraph, feature) %>% 332 | mutate(doc_id = paste0(book, "_", paragraph)) %>% 333 | cast_dfm(document = doc_id, term = feature, value = n) 334 | ``` 335 | 336 | Instead of importing individual functions, you can also just grab an entire Python package and use it from R: 337 | 338 | ```{r} 339 | sklearn <- import("sklearn") 340 | model <- sklearn$decomposition$NMF( # functions are often elements of objects in Python and can be called like this 341 | n_components = 6L, # number of topics 342 | random_state = 5L, # equivalent of seed for reproducibility 343 | max_iter = 400L 344 | )$fit(austen_dfm) # here the $ essentially works like a pipe 345 | 346 | beta <- model$components_ 347 | colnames(beta) <- colnames(austen_dfm) 348 | rownames(beta) <- paste0("topic_", seq_len(nrow(beta))) 349 | glimpse(beta) 350 | 351 | gamma <- model$transform(austen_dfm) 352 | colnames(gamma) <- paste0("topic_", seq_len(ncol(gamma))) 353 | rownames(gamma) <- paste0("text_", seq_len(nrow(gamma))) 354 | glimpse(gamma) 355 | ``` 356 | 357 | ```{r} 358 | beta %>% 359 | as_tibble(rownames = "topic") %>% 360 | pivot_longer(cols = -topic, names_to = "feature", values_to = "beta") %>% 361 | mutate(topic = fct_inorder(topic)) %>% 362 | group_by(topic) %>% 363 | slice_max(beta, n = 10) %>% 364 | arrange(topic, -beta) %>% 365 | mutate(feature = reorder_within(feature, beta, topic)) %>% 366 | ggplot(aes(x = beta, y = feature, fill = topic)) + 367 | geom_col() + 368 | facet_wrap(~topic, ncol = 2, scales = "free") + 369 | theme_minimal() + 370 | labs(x = NULL, y = NULL, title = "Top-features per topic") + 371 | scale_y_reordered() 372 | ``` 373 | 374 | # Example 3: `BERTopic` {#example-3-bertopic} 375 | 376 | I use the quanteda tutorial about [topicmodels](https://tutorials.quanteda.io/machine-learning/topicmodel/) to show an example workflow for `BERTopic`. 377 | 378 | ```{r} 379 | library(quanteda.corpora) 380 | corp_news <- download("data_corpus_guardian")[["documents"]] 381 | ``` 382 | 383 | ```{python} 384 | from bertopic import BERTopic 385 | from sentence_transformers import SentenceTransformer 386 | from umap import UMAP 387 | 388 | # confusingly, this is the setup part 389 | topic_model = BERTopic(language="english", 390 | top_n_words=5, 391 | n_gram_range=(1, 2), 392 | nr_topics="auto", # change if you want a specific nr of topics 393 | calculate_probabilities=True, 394 | umap_model=UMAP(random_state=42)) # make reproducible 395 | 396 | # and only here we actually run something 397 | topics, doc_topic = topic_model.fit_transform(r.corp_news.texts) 398 | ``` 399 | 400 | Unlike traditional topic models, BERTopic uses an algorithm that automatically determines a sensible number of topics and also automatically labels topics: 401 | 402 | ```{r} 403 | topic_model <- py$topic_model 404 | topic_labels <- tibble(topic = as.integer(names(topic_model$topic_labels_)), 405 | label = unlist(topic_model$topic_labels_ )) %>% 406 | mutate(label = fct_reorder(label, topic)) 407 | topic_labels 408 | ``` 409 | 410 | Note that -1 describes a trash topic with words and documents that do not really belong anywhere. BERTopic also supplies the top words, i.e., the ones that most likely belong to each topic. In the code above I requested 5 words for each topic: 411 | 412 | ```{r} 413 | top_words <- map_df(names(topic_model$topic_representations_), function(t) { 414 | map_df(topic_model$topic_representations_[[t]], function(y) 415 | tibble(feature = y[[1]], prob = y[[2]])) %>% 416 | mutate(topic = as.integer(t), .before = 1L) 417 | }) 418 | ``` 419 | 420 | We can plot them in the same way as above: 421 | 422 | ```{r} 423 | top_words %>% 424 | filter(topic %in% c(1, 7, 44, 53, 65, 66)) %>% # select a couple of topics 425 | left_join(topic_labels, by = "topic") %>% 426 | mutate(feature = reorder_within(feature, prob, topic)) %>% 427 | ggplot(aes(x = prob, y = feature, fill = topic, label = label)) + 428 | geom_col(show.legend = FALSE) + 429 | facet_wrap(vars(label), ncol = 2, scales = "free_y") + 430 | scale_y_reordered() + 431 | labs(x = NULL, y = NULL) 432 | ``` 433 | 434 | We can use a nice little visualization built into BERTopic to show how topics are linked to one another: 435 | 436 | ```{python} 437 | # map intertopic distance 438 | intertopic_distance = topic_model.visualize_topics(width=700, height=700) 439 | # save fig 440 | intertopic_distance.write_html("python-in-r_files/figure-html/bert_corp_news_intertopic.html") 441 | ``` 442 | 443 | ```{r} 444 | htmltools::includeHTML("python-in-r_files/figure-html/bert_corp_news_intertopic.html") 445 | ``` 446 | 447 | BERTopic also classifies documents into the topic categories (again not really how you should use LDA topicmodels). And provides a nice visualisation for trends over time. Unfortunately, the date format in R does not translate automagically to Python, hence we need to convert the dates to strings: 448 | 449 | ```{r} 450 | corp_news_t <- corp_news %>% 451 | mutate(date_chr = as.character(date)) 452 | ``` 453 | 454 | ```{python} 455 | topics_over_time = topic_model.topics_over_time(docs=r.corp_news_t.texts, 456 | timestamps=r.corp_news_t.date_chr, 457 | global_tuning=True, 458 | evolution_tuning=True, 459 | nr_bins=20) 460 | #plot figure 461 | fig_overtime = topic_model.visualize_topics_over_time(topics_over_time, 462 | topics=[1, 7, 44, 53, 65, 66]) 463 | #save figure 464 | fig_overtime.write_html("python-in-r_files/figure-html/fig_overtime.html") 465 | ``` 466 | 467 | ```{r} 468 | htmltools::includeHTML("python-in-r_files/figure-html/fig_overtime.html") 469 | ``` 470 | 471 | # Example 4: Supervised Learning with RoBERTa {#example-4-supervised-learning-with-roberta} 472 | 473 | To demonstrate the workflow of supervised learning, I'm replicating the example from [the naive bayes quanteda tutorial](https://tutorials.quanteda.io/machine-learning/nb/). 474 | 475 | ```{python} 476 | #| message: false 477 | #| warning: false 478 | #| output: false 479 | import pandas as pd 480 | import os 481 | import torch 482 | from simpletransformers.classification import ClassificationModel 483 | 484 | # args copied from grafzahl 485 | model_args = { 486 | "num_train_epochs": 1, # increase for multiple runs, which can yield better performance 487 | "use_multiprocessing": False, 488 | "use_multiprocessing_for_evaluation": False, 489 | "overwrite_output_dir": True, 490 | "reprocess_input_data": True, 491 | "overwrite_output_dir": True, 492 | "fp16": True, 493 | "save_steps": -1, 494 | "save_eval_checkpoints": False, 495 | "save_model_every_epoch": False, 496 | "silent": True, 497 | } 498 | 499 | os.environ["TOKENIZERS_PARALLELISM"] = "false" 500 | 501 | roberta_model = ClassificationModel(model_type="roberta", 502 | model_name="roberta-base", 503 | # Use GPU if available 504 | use_cuda=torch.cuda.is_available(), 505 | args=model_args) 506 | ``` 507 | 508 | We construct a training and test set from the movie review corpus in R: 509 | 510 | ```{r} 511 | corp_movies <- quanteda.textmodels::data_corpus_moviereviews %>% 512 | tibble(quanteda::docvars(x = .), text = .) 513 | 514 | corp_movies %>% 515 | count(sentiment) 516 | 517 | set.seed(1) 518 | corp_movies_train <- corp_movies %>% 519 | slice_sample(prop = 0.9) 520 | 521 | corp_movies_test <- corp_movies %>% 522 | filter(!id2 %in% corp_movies_train$id2) 523 | ``` 524 | 525 | Now we can train the model on the coded training set and predict the classes for the test set (if you do not have a GPU, this will take a long time, so maybe do it after the course: 526 | 527 | ```{python} 528 | #| output: false 529 | # process data to the form simpletransformers needs 530 | train_df = r.corp_movies_train 531 | train_df['labels'] = train_df['sentiment'].astype('category').cat.codes 532 | train_df = train_df[['text', 'labels']] 533 | 534 | roberta_model.train_model(train_df) 535 | 536 | # test data needs to be a list 537 | test_l = r.corp_movies_test["text"].tolist() 538 | predictions, raw_outputs = roberta_model.predict(test_l) 539 | ``` 540 | 541 | ```{r} 542 | results <- tibble( 543 | truth = corp_movies_test$sentiment, 544 | estimate = factor(c("neg", "pos"))[py$predictions + 1] 545 | ) 546 | conf_mat <- yardstick::conf_mat(results, truth, estimate) 547 | summary(conf_mat) 548 | ``` 549 | 550 | # Example 5: Zero-Shot Classification {#example-5-zero-shot-classification} 551 | 552 | Something I learned about recently are zero-shot classification models, which do not need to be trained on new categories, but can infer category-text relationships from the data they were trained with. 553 | You can get one such model from https://huggingface.co/MoritzLaurer/xlm-v-base-mnli-xnli. 554 | 555 | ```{python} 556 | from transformers import pipeline 557 | classifier = pipeline("zero-shot-classification", 558 | model="MoritzLaurer/xlm-v-base-mnli-xnli") 559 | 560 | sequence_to_classify = "Angela Merkel ist eine Politikerin in Deutschland und Vorsitzende der CDU" 561 | candidate_labels = ["politics", "economy", "entertainment", "environment"] 562 | output = classifier(sequence_to_classify, candidate_labels, multi_label=False) 563 | print(output) 564 | ``` 565 | 566 | ```{r} 567 | #| cache: true 568 | zero_shot_classification <- function(text, labels) { 569 | res <- py$classifier(text, labels, multi_label=FALSE) 570 | map_df(seq_along(res), function(i) { 571 | as_tibble(res[[i]]) %>% 572 | mutate(id = i) 573 | }) %>% 574 | group_by(id) %>% 575 | slice_max(scores, n = 1) 576 | } 577 | 578 | set.seed(3) 579 | test <- corp_movies_test %>% 580 | sample_n(10) 581 | 582 | pred <- zero_shot_classification( 583 | as.character(test$text), 584 | c("negative", "positive") 585 | ) 586 | 587 | results <- pred %>% 588 | ungroup() %>% 589 | mutate(estimate = factor(labels), 590 | estimate = fct_recode(estimate, 591 | neg = "negative", 592 | pos = "positive")) %>% 593 | mutate(truth = test$sentiment[1:10]) 594 | 595 | conf_mat <- yardstick::conf_mat(results, truth, estimate) 596 | summary(conf_mat) 597 | ``` 598 | 599 | # Further Learning 600 | 601 | - [Computational Analysis of Communication](https://cssbook.net/): a free book on communication science with Python and/or R with side-by-side code examples in both languages 602 | - [Doing Computational Social Science with Python: An Introduction](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2737682): a free book on social science data wrangling and analyses in Python (you can skip chapters 1-4) 603 | - [A Whirlwind Tour of Python](https://jakevdp.github.io/WhirlwindTourOfPython/): a free book with 604 | - (https://www.youtube.com/watch?v=YmcA4ODpiqA&t=3679s): 4.5h workshop introducing Python (from with some hints for R users sprinkled throughout the examples) 605 | - [ChatGPT](https://chat.openai.com/chat) is really good at translating/explaining Python code! 606 | 607 | # wrap up {#wrap-up} 608 | 609 | Some information about the session. 610 | 611 | ```{r} 612 | Sys.time() 613 | sessionInfo() 614 | py_list_packages() %>% 615 | as_tibble() %>% 616 | select(-requirement) %>% 617 | print(n = Inf) 618 | ``` 619 | -------------------------------------------------------------------------------- /reticulate_workshop.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: Default 4 | SaveWorkspace: Default 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: Sweave 13 | LaTeX: pdfLaTeX 14 | 15 | AutoAppendNewline: Yes 16 | StripTrailingWhitespace: Yes 17 | 18 | UseNativePipeOperator: No 19 | --------------------------------------------------------------------------------