├── presentation.pdf ├── figures ├── anti_join.png ├── full_join.png ├── left_join.png ├── semi_join.png ├── inner_join.png ├── right_join.png ├── geom_cheatsheet.png ├── pivot_longer_wider.png └── README.md ├── logo ├── SPAAM-Logo-Full-Colour_ShortName.png ├── WSS-SPAAM-summerschool_logo_name.png ├── SPAAM-Logo-Full-Colour_ShortName.svg └── WSS-SPAAM-summerschool_logo_name.svg ├── README.md ├── spaam_theme ├── beamerthemespaam.sty ├── beamerouterthemespaam.sty ├── beamerfontthemespaam.sty ├── beamercolorthemespaam.sty └── beamerinnerthemespaam.sty ├── spaam_r_tidyverse_intro_2h.Rproj ├── .gitignore ├── preamble.tex └── presentation.Rmd /presentation.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nevrome/SPAAM.R.tidyverse.intro.2h.2022/main/presentation.pdf -------------------------------------------------------------------------------- /figures/anti_join.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nevrome/SPAAM.R.tidyverse.intro.2h.2022/main/figures/anti_join.png -------------------------------------------------------------------------------- /figures/full_join.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nevrome/SPAAM.R.tidyverse.intro.2h.2022/main/figures/full_join.png -------------------------------------------------------------------------------- /figures/left_join.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nevrome/SPAAM.R.tidyverse.intro.2h.2022/main/figures/left_join.png -------------------------------------------------------------------------------- /figures/semi_join.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nevrome/SPAAM.R.tidyverse.intro.2h.2022/main/figures/semi_join.png -------------------------------------------------------------------------------- /figures/inner_join.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nevrome/SPAAM.R.tidyverse.intro.2h.2022/main/figures/inner_join.png -------------------------------------------------------------------------------- /figures/right_join.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nevrome/SPAAM.R.tidyverse.intro.2h.2022/main/figures/right_join.png -------------------------------------------------------------------------------- /figures/geom_cheatsheet.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nevrome/SPAAM.R.tidyverse.intro.2h.2022/main/figures/geom_cheatsheet.png -------------------------------------------------------------------------------- /figures/pivot_longer_wider.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nevrome/SPAAM.R.tidyverse.intro.2h.2022/main/figures/pivot_longer_wider.png -------------------------------------------------------------------------------- /logo/SPAAM-Logo-Full-Colour_ShortName.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nevrome/SPAAM.R.tidyverse.intro.2h.2022/main/logo/SPAAM-Logo-Full-Colour_ShortName.png -------------------------------------------------------------------------------- /logo/WSS-SPAAM-summerschool_logo_name.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nevrome/SPAAM.R.tidyverse.intro.2h.2022/main/logo/WSS-SPAAM-summerschool_logo_name.png -------------------------------------------------------------------------------- /figures/README.md: -------------------------------------------------------------------------------- 1 | The join schemata are heavily inspired by [RStudio's dplyr cheatsheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-transformation.pdf). -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | This repository stores a 2h intro course for data analysis with R and the tidyverse. 2 | 3 | `presentation.Rmd` is an RMarkdown file with the instructions and code. `presentation.pdf` is a rendered version. -------------------------------------------------------------------------------- /spaam_theme/beamerthemespaam.sty: -------------------------------------------------------------------------------- 1 | \mode 2 | 3 | % Requirement 4 | \RequirePackage{tikz} 5 | 6 | % Settings 7 | \useinnertheme{spaam} 8 | \useoutertheme{spaam} 9 | \usecolortheme{spaam} 10 | \usefonttheme{spaam} 11 | 12 | \mode 13 | 14 | -------------------------------------------------------------------------------- /spaam_r_tidyverse_intro_2h.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: Default 4 | SaveWorkspace: Default 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: Sweave 13 | LaTeX: pdfLaTeX 14 | -------------------------------------------------------------------------------- /spaam_theme/beamerouterthemespaam.sty: -------------------------------------------------------------------------------- 1 | \mode 2 | 3 | % Frame title 4 | \defbeamertemplate*{frametitle}{spaam}[1][]{ 5 | \vskip 0.2cm 6 | \begin{beamercolorbox}[wd = \paperwidth, sep = 0.5cm]{frametitle} 7 | \insertframetitle\par 8 | \end{beamercolorbox} 9 | \vskip -0.5cm 10 | } 11 | 12 | \mode 13 | 14 | -------------------------------------------------------------------------------- /spaam_theme/beamerfontthemespaam.sty: -------------------------------------------------------------------------------- 1 | \mode 2 | 3 | % Settings 4 | \setbeamerfont{title}{size = {\fontsize{18}{18}}} 5 | \setbeamerfont{sectiontitle}{size = {\fontsize{18}{18}}} 6 | \setbeamerfont{frametitle}{size = {\fontsize{14}{14}}} 7 | \setbeamerfont{pagenumber}{size = {\fontsize{10}{10}}} 8 | \setbeamerfont{headline}{size = {\fontsize{6}{6}},shape=\itshape} 9 | 10 | \mode 11 | 12 | 13 | -------------------------------------------------------------------------------- /spaam_theme/beamercolorthemespaam.sty: -------------------------------------------------------------------------------- 1 | \mode 2 | 3 | \definecolor{spaampurple}{RGB}{115,42,130} 4 | 5 | % Settings 6 | \setbeamercolor{title page header}{fg = black} 7 | \setbeamercolor{author}{fg = black} 8 | \setbeamercolor{sectiontitle}{fg = black} 9 | \setbeamercolor{frametitle}{fg = black} 10 | \setbeamercolor{pagenumber}{fg = gray} 11 | \setbeamercolor{item}{fg = spaampurple} 12 | \setbeamercolor{section in toc}{fg = black} 13 | \setbeamercolor{headline}{fg = gray} 14 | 15 | \mode 16 | 17 | 18 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # History files 2 | .Rhistory 3 | .Rapp.history 4 | 5 | # Session Data files 6 | .RData 7 | 8 | # User-specific files 9 | .Ruserdata 10 | 11 | # Example code in package build process 12 | *-Ex.R 13 | 14 | # Output files from R CMD build 15 | /*.tar.gz 16 | 17 | # Output files from R CMD check 18 | /*.Rcheck/ 19 | 20 | # RStudio files 21 | .Rproj.user/ 22 | 23 | # produced vignettes 24 | vignettes/*.html 25 | vignettes/*.pdf 26 | 27 | # OAuth2 token, see https://github.com/hadley/httr/releases/tag/v0.3 28 | .httr-oauth 29 | 30 | # knitr and R markdown default cache directories 31 | *_cache/ 32 | /cache/ 33 | 34 | # Temporary files created by R markdown 35 | *.utf8.md 36 | *.knit.md 37 | 38 | # R Environment Variables 39 | .Renviron 40 | 41 | # Rendered output 42 | *.pdf 43 | !presentation.pdf 44 | *.log 45 | presentation.tex 46 | 47 | -------------------------------------------------------------------------------- /preamble.tex: -------------------------------------------------------------------------------- 1 | \titlegraphic{\begin{picture}(0,0)\put(305,-90){\makebox(0,0)[rt]{\includegraphics[width=8cm]{figures/niels_bach_pyre.jpg}}}\put(200,-230){\tiny \textcopyright \ Niels Bach - http://nielsbach.blogspot.com}\end{picture}} 2 | 3 | \makeatletter 4 | \let\@@magyar@captionfix\relax 5 | \makeatother 6 | 7 | \makeatletter 8 | \def\beamer@calltheme#1#2#3{% 9 | \def\beamer@themelist{#2} 10 | \@for\beamer@themename:=\beamer@themelist\do 11 | {\usepackage[{#1}]{\beamer@themelocation/#3\beamer@themename}}} 12 | 13 | \def\usefolder#1{ 14 | \def\beamer@themelocation{#1} 15 | } 16 | \def\beamer@themelocation{} 17 | 18 | \usefolder{spaam_theme} 19 | \usetheme{spaam} 20 | 21 | \usepackage[british]{babel} 22 | \usepackage{subfig} 23 | \usepackage[justification=centering]{caption} 24 | \usepackage{textcomp} 25 | \usepackage{changepage} 26 | \usepackage[absolute,overlay]{textpos} 27 | %\usepackage[texcoord,grid,gridunit=pt,gridcolor=red!10,subgridcolor=green!10]{eso-pic} 28 | 29 | \definecolor{burialconstruction}{HTML}{00A074} 30 | \definecolor{burialtype}{HTML}{0072B1} 31 | 32 | \definecolor{niceorange}{HTML}{D65D00} 33 | \definecolor{nicegrey}{HTML}{636363} 34 | -------------------------------------------------------------------------------- /spaam_theme/beamerinnerthemespaam.sty: -------------------------------------------------------------------------------- 1 | \mode 2 | 3 | % General 4 | \setbeamertemplate{navigation symbols}{} 5 | \setbeamertemplate{blocks}[rounded][shadow = true] 6 | 7 | % Slide background 8 | \setbeamertemplate{background}{ 9 | \begin{tikzpicture} 10 | % Title page 11 | \useasboundingbox (0,0) rectangle(\the\paperwidth,\the\paperheight); 12 | \ifnum\thepage=1\relax 13 | \node[inner sep=0pt] (whitehead) at (4,7){ 14 | \includegraphics[height = .30\textheight]{logo/SPAAM-Logo-Full-Colour_ShortName.png} 15 | }; 16 | \fi 17 | % Non title page 18 | \ifnum\thepage>1\relax 19 | % nothing here 20 | \fi 21 | \node[inner sep=0pt] (whitehead) at (0.6,0.4){ 22 | \includegraphics[height = 0.06\textheight]{logo/WSS-SPAAM-summerschool_logo_name.png} 23 | }; 24 | \end{tikzpicture} 25 | } 26 | 27 | % Header 28 | \setbeamertemplate{headline}[text line]{ 29 | \ifnum\thepage>1\relax 30 | \parbox{\linewidth}{ 31 | \vspace*{0.3cm} 32 | \hspace*{\fill}\usebeamerfont{headline} 33 | SPAAM Summer School: A crash course on R for data analysis | 2022 | Clemens Schmid | CC BY 4.0 34 | \hspace*{\fill} 35 | } 36 | \fi 37 | } 38 | 39 | % Footer 40 | \setbeamertemplate{footline}[text line]{ 41 | \ifnum\thepage>1\relax 42 | \parbox{\linewidth}{\vspace*{-15pt}\hfill\usebeamercolor[fg]{pagenumber}\usebeamerfont{pagenumber}\insertpagenumber\hspace*{-20pt}} 43 | \fi 44 | } 45 | 46 | % Title page 47 | \defbeamertemplate*{title page}{spaam}[1][]{ 48 | \vfill 49 | \begin{columns} 50 | \column{1\linewidth} 51 | \vskip3cm 52 | \begin{beamercolorbox}[wd=16cm,center,sep=8pt]{title page header} 53 | \usebeamerfont{title}\inserttitle\par 54 | \end{beamercolorbox} 55 | \vskip0.3cm 56 | \begin{beamercolorbox}[wd=16cm,center]{author} 57 | \usebeamerfont{author}\insertauthor 58 | \end{beamercolorbox} 59 | \end{columns} 60 | % \vskip0.2cm% 61 | %\begin{beamercolorbox}[wd=12cm,center,#1]{date} 62 | % \usebeamerfont{author}\insertdate% 63 | %\end{beamercolorbox} 64 | \vfill 65 | } 66 | 67 | % Items 68 | \setbeamertemplate{items}[square] 69 | 70 | % TOC 71 | \setbeamertemplate{section in toc}{ 72 | \inserttocsectionnumber. \inserttocsection 73 | } 74 | 75 | % Inter section slides 76 | \AtBeginSection[]{ 77 | \begin{frame} 78 | \vfill 79 | \centering 80 | \begin{beamercolorbox}[sep=8pt,center,shadow=true,rounded=true]{sectiontitle} 81 | \usebeamerfont{sectiontitle}\secname\par 82 | \end{beamercolorbox} 83 | \vfill 84 | \end{frame} 85 | } 86 | 87 | \mode 88 | 89 | 90 | -------------------------------------------------------------------------------- /logo/SPAAM-Logo-Full-Colour_ShortName.svg: -------------------------------------------------------------------------------- 1 | 2 | 21 | 39 | 41 | 43 | 44 | SPAAM-Logo-Full-Colour 46 | 50 | 54 | 58 | 62 | 66 | 70 | 74 | 77 | 81 | 84 | 88 | 91 | 95 | 98 | 102 | 105 | 107 | 108 | 110 | SPAAM-Logo-Full-Colour 111 | 112 | 113 | 114 | 115 | -------------------------------------------------------------------------------- /presentation.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "A crash course on R for data analysis" 3 | author: "Clemens Schmid" 4 | date: "2019/06/05" 5 | fontsize: 9pt 6 | output: 7 | beamer_presentation: 8 | includes: 9 | in_header: preamble.tex 10 | keep_tex: true 11 | classoption: "aspectratio=169" 12 | fig_caption: yes 13 | editor_options: 14 | chunk_output_type: console 15 | --- 16 | 17 | ```{r, echo = FALSE} 18 | # https://stackoverflow.com/questions/25646333/code-chunk-font-size-in-rmarkdown-with-knitr-and-latex 19 | def.chunk.hook <- knitr::knit_hooks$get("chunk") 20 | knitr::knit_hooks$set(chunk = function(x, options) { 21 | x <- def.chunk.hook(x, options) 22 | ifelse(options$size != "normalsize", paste0("\\", options$size,"\n\n", x, "\n\n \\normalsize"), x) 23 | }) 24 | knitr::opts_chunk$set( 25 | echo = TRUE, eval = TRUE, cache = TRUE, results = FALSE, warning = FALSE, message = FALSE, 26 | fig.align = "center", out.width = "60%", fig.dim = c(7, 4) 27 | ) 28 | ``` 29 | 30 | ## Data recovery 31 | 32 | Run the following script to recover the relevant section of your data directory 33 | 34 | ``` 35 | curl -s https://share.eva.mpg.de/index.php/s/dQJe7TKB8iBG6Wc/download | bash 36 | ``` 37 | 38 | If this does not work, please download the content of the following Git repository: 39 | https://github.com/nevrome/spaam_r_tidyverse_intro_2h 40 | 41 | ## Getting started for this workshop 42 | 43 | - Activate the relevant conda environment (don't forget to deactivate it later!) 44 | 45 | ``` 46 | conda activate r-python 47 | ``` 48 | 49 | - Navigate to 50 | 51 | ``` 52 | /vol/volume/3b-1-introduction-to-r-and-the-tidyverse/spaam_r_tidyverse_intro_2h 53 | ``` 54 | 55 | - Pull the latest changes in this Git repository 56 | 57 | ``` 58 | git pull 59 | ``` 60 | 61 | - Open RStudio 62 | - Load the project with `File > Open Project...` 63 | - Open the file `presentation.Rmd` in RStudio 64 | 65 | # A crash course on R for data analysis 66 | 67 | ## TOC 68 | 69 | - The working environment 70 | - Loading data into tibbles 71 | - Plotting data in tibbles 72 | - Conditional queries on tibbles 73 | - Transforming and manipulating tibbles 74 | - Combining tibbles with join operations 75 | 76 | # The working environment 77 | 78 | ## R, RStudio and the tidyverse 79 | 80 | - R is a fully featured programming language, but it excels as an environment for (statistical) data analysis (https://www.r-project.org) 81 | 82 | - RStudio is an integrated development environment (IDE) for R (and other languages): (https://www.rstudio.com/products/rstudio) 83 | 84 | - The tidyverse is a collection of R packages with well-designed and consistent interfaces for the main steps of data analysis: loading, transforming and plotting data (https://www.tidyverse.org) 85 | - This introduction works with tidyverse ~v1.3.0 86 | - We will learn about `readr`, `tibble`, `ggplot2`, `dplyr`, `magrittr` and `tidyr` 87 | - `forcats` will be briefly mentioned 88 | - `purrr` and `stringr` are left out 89 | 90 | # Loading data into tibbles 91 | 92 | ## Reading data with readr 93 | 94 | - With R we usually operate on data in our computer's memory 95 | - The tidyverse provides the package `readr` to read data from text files into the memory 96 | - `readr` can read from our file system or the internet 97 | - It provides functions to read data in almost any (text) format: 98 | 99 | ```{r eval=FALSE} 100 | readr::read_csv() # .csv files 101 | readr::read_tsv() # .tsv files 102 | readr::read_delim() # tabular files with an arbitrary separator 103 | readr::read_fwf() # fixed width files 104 | readr::read_lines() # read linewise to parse yourself 105 | ``` 106 | 107 | - `readr` automatically detects column types -- but you can also define them manually 108 | 109 | ## How does the interface of `read_csv` work? 110 | 111 | - We can learn more about a function with `?`. To open a help file: `?readr::read_csv` 112 | - `readr::read_csv` has many options to specify how to read a text file 113 | 114 | ```{r eval=FALSE} 115 | read_csv( 116 | file, # The path to the file we want to read 117 | col_names = TRUE, # Are there column names? 118 | col_types = NULL, # Which types do the columns have? NULL -> auto 119 | locale = default_locale(), # How is information encoded in this file? 120 | na = c("", "NA"), # Which values mean "no data" 121 | trim_ws = TRUE, # Should superfluous white-spaces be removed? 122 | skip = 0, # Skip X lines at the beginning of the file 123 | n_max = Inf, # Only read X lines 124 | skip_empty_rows = TRUE, # Should empty lines be ignored? 125 | comment = "", # Should comment lines be ignored? 126 | name_repair = "unique", # How should "broken" column names be fixed 127 | ... 128 | ) 129 | ``` 130 | 131 | ## What does `readr` produce? The `tibble`! 132 | 133 | ```{r echo = FALSE} 134 | sample_table_path <- "/vol/volume/3b-1-introduction-to-r-and-the-tidyverse/ancientmetagenome-hostassociated_samples.tsv" 135 | sample_table_url <- "https://raw.githubusercontent.com/SPAAM-community/AncientMetagenomeDir/b187df6ebd23dfeb42935fd5020cb615ead3f164/ancientmetagenome-hostassociated/samples/ancientmetagenome-hostassociated_samples.tsv" 136 | ``` 137 | 138 | ```{r} 139 | samples <- readr::read_tsv(sample_table_url) 140 | ``` 141 | 142 | - The `tibble` is a "data frame", a tabular data structure with rows and columns 143 | - Unlike a simple array, each column can have another data type 144 | 145 | ```{r results='markup'} 146 | print(samples, n = 3) 147 | ``` 148 | 149 | ## How to look at a `tibble`? 150 | 151 | ```{r, eval=FALSE} 152 | samples # Typing the name of an object will print it to the console 153 | str(samples) # A structural overview of an object 154 | summary(samples) # A human-readable summary of an object 155 | View(samples) # RStudio's interactive data browser 156 | ``` 157 | 158 | - R provides a very flexible indexing operation for `data.frame`s and `tibble`s 159 | 160 | ```{r, eval=FALSE} 161 | samples[1,1] # Access the first row and column 162 | samples[1,] # Access the first row 163 | samples[,1] # Access the first column 164 | samples[c(1,2,3),c(2,3,4)] # Access values from rows and columns 165 | samples[,-c(1,2)] # Remove the first two columns 166 | samples[,c("site_name", "material")] # Columns can be selected by name 167 | ``` 168 | 169 | - `tibble`s are mutable data structures, so their content can be overwritten 170 | 171 | ```{r, eval=FALSE} 172 | samples[1,1] <- "Cheesecake2015" # replace the first value in the first column 173 | ``` 174 | 175 | # Plotting data in `tibble`s 176 | 177 | ## `ggplot2` and the "grammar of graphics" 178 | 179 | - `ggplot2` offers an unusual, but powerful and logical interface 180 | - The following example describes a stacked bar chart 181 | 182 | ```{r} 183 | library(ggplot2) # Loading a library to use its functions without :: 184 | ``` 185 | 186 | ```{r eval=FALSE} 187 | ggplot( # Every plot starts with a call to the ggplot() function 188 | data = samples # This function can also take the input tibble 189 | ) + # The plot consists of functions linked with + 190 | geom_bar( # "geoms" define the plot layers we want to draw 191 | mapping = aes( # The aes() function maps variables to visual properties 192 | x = publication_year, # publication_year -> x-axis 193 | fill = community_type # community_type -> fill color 194 | ) 195 | ) 196 | ``` 197 | 198 | - `geom_*`: data + geometry (bars) + statistical transformation (sum) 199 | 200 | ## `ggplot2` and the "grammar of graphics" 201 | 202 | - This is the plot described above: number of samples per community type through time 203 | 204 | ```{r} 205 | ggplot(samples) + 206 | geom_bar(aes(x = publication_year, fill = community_type)) 207 | ``` 208 | 209 | ## `ggplot2` features many geoms 210 | 211 | \centering 212 | ![](figures/geom_cheatsheet.png){width=55%} 213 | 214 | - RStudio shares helpful cheatsheets for the tidyverse and beyond: https://www.rstudio.com/resources/cheatsheets 215 | 216 | ## `scale`s control the behaviour of visual elements 217 | 218 | - Another plot: Boxplots of sample age through time 219 | 220 | ```{r} 221 | ggplot(samples) + 222 | geom_boxplot(aes(x = as.factor(publication_year), y = sample_age)) 223 | ``` 224 | 225 | - This is not well readable, because extreme outliers dictate the scale 226 | 227 | ## `scale`s control the behaviour of visual elements 228 | 229 | - We can change the **scale** of different visual elements - e.g. the y-axis 230 | 231 | ```{r} 232 | ggplot(samples) + 233 | geom_boxplot(aes(x = as.factor(publication_year), y = sample_age)) + 234 | scale_y_log10() 235 | ``` 236 | 237 | - The log-scale improves readability 238 | 239 | ## `scale`s control the behaviour of visual elements 240 | 241 | - (Fill) color is a visual element of the plot and its scaling can be adjusted 242 | 243 | ```{r} 244 | ggplot(samples) + 245 | geom_boxplot(aes(x = as.factor(publication_year), y = sample_age, 246 | fill = as.factor(publication_year))) + 247 | scale_y_log10() + scale_fill_viridis_d(option = "C") 248 | ``` 249 | 250 | ## Defining plot matrices via `facet`s 251 | 252 | - Splitting up the plot by categories into **facets** is another way to visualize more variables at once 253 | 254 | ```{r} 255 | ggplot(samples) + 256 | geom_count(aes(x = as.factor(publication_year), y = material)) + 257 | facet_wrap(~archive) 258 | ``` 259 | 260 | - Unfortunately the x-axis became unreadable 261 | 262 | ## Setting purely aesthetic settings with `theme` 263 | 264 | - Aesthetic changes like this can be applied as part of the `theme` 265 | 266 | ```{r} 267 | ggplot(samples) + 268 | geom_count(aes(x = as.factor(publication_year), y = material)) + 269 | facet_wrap(~archive) + 270 | theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) 271 | ``` 272 | 273 | ## Exercise 1 274 | 275 | 1. Look at the `mtcars` dataset and read up on the meaning of its variables 276 | 277 | ```{r} 278 | 279 | ``` 280 | 281 | 2. Visualize the relationship between *Gross horsepower* and *1/4 mile time* 282 | 283 | ```{r} 284 | 285 | ``` 286 | 287 | 3. Integrate the *Number of cylinders* into your plot 288 | 289 | ```{r} 290 | 291 | ``` 292 | 293 | 294 | 295 | 296 | 297 | 298 | 299 | 300 | 301 | 302 | 303 | 304 | 305 | 306 | 307 | ## Possible solutions 1 308 | 309 | 1. Look at the `mtcars` dataset and read up on the meaning of its variables 310 | 311 | ```{r, eval=FALSE} 312 | ?mtcars 313 | ``` 314 | 315 | 2. Visualize the relationship between *Gross horsepower* and *1/4 mile time* 316 | 317 | ```{r, eval=FALSE} 318 | ggplot(mtcars) + geom_point(aes(x = hp, y = qsec)) 319 | ``` 320 | 321 | 3. Integrate the *Number of cylinders* into your plot 322 | 323 | ```{r, eval=FALSE} 324 | ggplot(mtcars) + geom_point(aes(x = hp, y = qsec, color = as.factor(cyl))) 325 | ``` 326 | 327 | 328 | 329 | # Conditional queries on tibbles 330 | 331 | ## Selecting columns and filtering rows with `select` and `filter` 332 | 333 | - The `dplyr` package includes powerful functions to subset data in tibbles based on conditions 334 | - `dplyr::select` allows to select columns 335 | 336 | ```{r} 337 | dplyr::select(samples, project_name, sample_age) # reduce to two columns 338 | dplyr::select(samples, -project_name, -sample_age) # remove two columns 339 | ``` 340 | 341 | - `dplyr::filter` allows for conditional filtering of rows 342 | 343 | ```{r} 344 | dplyr::filter(samples, publication_year == 2014) # samples published in 2014 345 | dplyr::filter(samples, publication_year == 2014 | 346 | publication_year == 2018) # samples from 2015 OR 2018 347 | dplyr::filter(samples, publication_year %in% c(2014, 2018)) # match operator: %in% 348 | dplyr::filter(samples, sample_host == "Homo sapiens" & 349 | community_type == "oral") # oral samples from modern humans 350 | ``` 351 | 352 | ## Chaining functions together with the pipe `%>%` 353 | 354 | - The pipe `%>%` in the `magrittr` package is a clever infix operator to chain data and operations 355 | 356 | ```{r} 357 | library(magrittr) 358 | samples %>% dplyr::filter(publication_year == 2014) 359 | ``` 360 | 361 | - It forwards the LHS as the first argument of the function appearing on the RHS 362 | - That allows for sequences of functions ("tidyverse style") 363 | 364 | ```{r} 365 | samples %>% 366 | dplyr::select(sample_host, community_type) %>% 367 | dplyr::filter(sample_host == "Homo sapiens" & community_type == "oral") %>% 368 | nrow() # count the rows 369 | ``` 370 | 371 | - `magrittr` also offers some more operators, among which the extraction `%$%` is particularly useful 372 | 373 | ```{r} 374 | samples %>% 375 | dplyr::filter(material == "tooth") %$% 376 | sample_age %>% # extract the sample_age column as a vector 377 | max() # get the maximum of said vector 378 | ``` 379 | 380 | ## Summary statistics in `base` R 381 | 382 | - Summarising and counting data is indispensable and R offers all operations you would expect in its `base` package 383 | 384 | ```{r} 385 | nrow(samples) # number of rows in a tibble 386 | length(samples$site_name) # length/size of a vector 387 | unique(samples$material) # unique elements of a vector 388 | 389 | min(samples$sample_age) # minimum 390 | max(samples$sample_age) # maximum 391 | 392 | mean(samples$sample_age) # mean 393 | median(samples$sample_age) # median 394 | 395 | var(samples$sample_age) # variance 396 | sd(samples$sample_age) # standard deviation 397 | quantile(samples$sample_age, probs = 0.75) # sample quantiles for the given probs 398 | ``` 399 | 400 | - many of these functions can ignore missing values with an option `na.rm = TRUE` 401 | 402 | ## Group-wise summaries with `group_by` and `summarise` 403 | 404 | - These summary statistics are particular useful when applied to conditional subsets of a dataset 405 | - `dplyr` allows such summary operations with a combination of `group_by` and `summarise` 406 | 407 | ```{r} 408 | samples %>% 409 | dplyr::group_by(material) %>% # group the tibble by the material column 410 | dplyr::summarise( 411 | min_age = min(sample_age), # a new column: min age for each group 412 | median_age = median(sample_age), # a new column: median age for each group 413 | max_age = max(sample_age) # a new column: max age for each group 414 | ) 415 | ``` 416 | 417 | - grouping can be applied across multiple columns 418 | 419 | ```{r} 420 | samples %>% 421 | dplyr::group_by(material, sample_host) %>% # group by material and host 422 | dplyr::summarise( 423 | n = dplyr::n(), # a new column: number of samples for each group 424 | .groups = "drop" # drop the grouping after this summary operation 425 | ) 426 | ``` 427 | 428 | ## Sorting and slicing tibbles with `arrange` and `slice` 429 | 430 | - `dplyr` allows to `arrange` tibbles by one or multiple columns 431 | 432 | ```{r} 433 | samples %>% dplyr::arrange(publication_year) # sort by publication year 434 | samples %>% dplyr::arrange(publication_year, 435 | sample_age) # ... and sample age 436 | samples %>% dplyr::arrange(dplyr::desc(sample_age)) # sort descending on sample age 437 | ``` 438 | 439 | - Sorting also works within groups and can be paired with `slice` to extract extreme values per group 440 | 441 | ```{r} 442 | samples %>% 443 | dplyr::group_by(publication_year) %>% # group by publication year 444 | dplyr::arrange(dplyr::desc(sample_age)) %>% # sort by age within (!) groups 445 | dplyr::slice_head(n = 2) %>% # keep the first two samples per group 446 | dplyr::ungroup() # remove the still lingering grouping 447 | ``` 448 | 449 | - Slicing is also the relevant operation to take random samples from the observations in a tibble 450 | 451 | ```{r} 452 | samples %>% dplyr::slice_sample(n = 20) 453 | ``` 454 | 455 | ## Exercise 2 456 | 457 | 1. Determine the number of cars with four *forward gears* (`gear`) in the `mtcars` dataset 458 | 459 | ```{r} 460 | 461 | ``` 462 | 463 | 2. Determine the mean *1/4 mile time* (`qsec`) per *Number of cylinders* (`cyl`) group 464 | 465 | ```{r} 466 | 467 | ``` 468 | 469 | 3. Identify the least efficient cars for both *transmission types* (`am`) 470 | 471 | ```{r} 472 | 473 | ``` 474 | 475 | 476 | 477 | 478 | 479 | 480 | 481 | 482 | 483 | 484 | 485 | 486 | 487 | 488 | 489 | ## Possible solutions 2 490 | 491 | 1. Determine the number of cars with four *forward gears* (`gear`) in the `mtcars` dataset 492 | 493 | ```{r, eval=FALSE} 494 | mtcars %>% dplyr::filter(gear == 4) %>% nrow() 495 | ``` 496 | 497 | 2. Determine the mean *1/4 mile time* (`qsec`) per *Number of cylinders* (`cyl`) group 498 | 499 | ```{r, eval=FALSE} 500 | mtcars %>% dplyr::group_by(cyl) %>% dplyr::summarise(qsec_mean = mean(qsec)) 501 | ``` 502 | 503 | 3. Identify the least efficient cars for both *transmission types* (`am`) 504 | 505 | ```{r, eval=FALSE} 506 | #mtcars3 <- tibble::rownames_to_column(mtcars, var = "car") %>% tibble::as_tibble() 507 | mtcars %>% dplyr::group_by(am) %>% dplyr::arrange(mpg) %>% dplyr::slice_head() 508 | ``` 509 | 510 | 511 | 512 | # Transforming and manipulating tibbles 513 | 514 | ## Renaming and reordering columns and values with `rename`, `relocate` and `recode` 515 | 516 | - Columns in tibbles can be renamed with `dplyr::rename` and reordered with `dplyr::relocate` 517 | 518 | ```{r} 519 | samples %>% dplyr::rename(country = geo_loc_name) # rename a column 520 | samples %>% dplyr::relocate(site_name, .before = project_name) # reorder columns 521 | ``` 522 | 523 | - Values in columns can also be changed with `dplyr::recode` 524 | 525 | ```{r} 526 | samples$sample_host %>% dplyr::recode(`Homo sapiens` = "modern human") 527 | ``` 528 | 529 | - R supports explicitly ordinal data with `factor`s, which can be reordered as well 530 | - `factor`s can be handeld more easily with the `forcats` package 531 | 532 | ```{r, fig.show='hide'} 533 | ggplot(samples) + geom_bar(aes(x = community_type)) # bars are alphabetically ordered 534 | sa2 <- samples 535 | sa2$cto <- forcats::fct_reorder(sa2$community_type, sa2$community_type, length) 536 | # fct_reorder: reorder the input factor by a summary statistic on an other vector 537 | ggplot(sa2) + geom_bar(aes(x = community_type)) # bars are ordered by size 538 | ``` 539 | 540 | ## Adding columns to tibbles with `mutate` and `transmute` 541 | 542 | - A common application of data manipulation is adding derived columns. `dplyr` offers that with `mutate` 543 | 544 | ```{r} 545 | samples %>% 546 | dplyr::mutate( # add a column that 547 | archive_summary = paste0(archive, ": ", archive_accession) # combines two other 548 | ) %$% archive_summary # columns 549 | ``` 550 | 551 | - `dplyr::transmute` removes all columns but the newly created ones 552 | 553 | ```{r} 554 | samples %>% 555 | dplyr::transmute( 556 | sample_name = tolower(sample_name), # overwrite this columns 557 | publication_doi # select this column 558 | ) 559 | ``` 560 | 561 | - `tibble::add_column` behaves as `dplyr::mutate`, but gives more control over column position 562 | 563 | ```{r} 564 | samples %>% tibble::add_column(., id = 1:nrow(.), .before = "project_name") 565 | ``` 566 | 567 | ## Conditional operations with `ifelse` and `case_when` 568 | 569 | - `ifelse` allows to implement conditional `mutate` operations, that consider information from other columns, but that gets cumbersome easily 570 | 571 | ```{r} 572 | samples %>% dplyr::mutate(hemi = ifelse(latitude >= 0, "North", "South")) %$% hemi 573 | ``` 574 | 575 | ```{r} 576 | samples %>% dplyr::mutate( 577 | hemi = ifelse(is.na(latitude), "unknown", ifelse(latitude >= 0, "North", "South")) 578 | ) %$% hemi 579 | ``` 580 | 581 | - `dplyr::case_when` is a much more readable solution for this application 582 | 583 | ```{r} 584 | samples %>% dplyr::mutate( 585 | hemi = dplyr::case_when( 586 | latitude >= 0 ~ "North", 587 | latitude < 0 ~ "South", 588 | TRUE ~ "unknown" # TRUE catches all remaining cases 589 | ) 590 | ) %$% hemi 591 | ``` 592 | 593 | ## Long and wide data formats 594 | 595 | - For different applications or to simplify certain analysis or plotting operations data often has to be transformed from a **wide** to a **long** format or vice versa 596 | 597 | ![](figures/pivot_longer_wider.png){height=80px} 598 | 599 | - A table in **wide** format has N key columns and N value columns 600 | - A table in **long** format has N key columns, one descriptor column and one value column 601 | 602 | ## A wide dataset 603 | 604 | ```{r} 605 | carsales <- tibble::tribble( 606 | ~brand, ~`2014`, ~`2015`, ~`2016`, ~`2017`, 607 | "BMW", 20, 25, 30, 45, 608 | "VW", 67, 40, 120, 55 609 | ) 610 | ``` 611 | 612 | ```{r, results="markup", echo=FALSE} 613 | carsales 614 | ``` 615 | 616 | - Wide format becomes a problem, when the columns are semantically identical. This dataset is in wide format and we can not easily plot it 617 | - We generally prefer data in long format, although it is more verbose with more duplication. "Long" format data is more "tidy" 618 | 619 | ## Making a wide dataset long with `pivot_longer` 620 | 621 | ```{r} 622 | carsales_long <- carsales %>% tidyr::pivot_longer( 623 | cols = tidyselect::num_range("", range = 2014:2017), # set of columns to transform 624 | names_to = "year", # the name of the descriptor column we want 625 | names_transform = as.integer, # a transformation function to apply to the names 626 | values_to = "sales" # the name of the value column we want 627 | ) 628 | ``` 629 | 630 | ```{r, results="markup", echo=FALSE} 631 | carsales_long 632 | ``` 633 | 634 | ## Making a long dataset wide with `pivot_wider` 635 | 636 | ```{r} 637 | carsales_wide <- carsales_long %>% tidyr::pivot_wider( 638 | id_cols = "brand", # the set of id columns that should not be changed 639 | names_from = year, # the descriptor column with the names of the new columns 640 | values_from = sales # the value column from which the values should be extracted 641 | ) 642 | ``` 643 | 644 | ```{r, results="markup", echo=FALSE} 645 | carsales_wide 646 | ``` 647 | 648 | - Applications of wide datasets are adjacency matrices to represent graphs, covariance matrices or other pairwise statistics 649 | - When data gets big, then wide formats can be significantly more efficient (e.g. for spatial data) 650 | 651 | ## Exercise 3 652 | 653 | 1. Move the column `gear` to the first position of the mtcars dataset 654 | 655 | ```{r} 656 | 657 | ``` 658 | 659 | 2. Make a new dataset `mtcars2` with the column `mpg` and an additional column `am_v`, which encodes the *transmission type* (`am`) as either `"manual"` or `"automatic"` 660 | 661 | ```{r} 662 | 663 | ``` 664 | 665 | 3. Count the number of cars per *transmission type* (`am_v`) and *number of gears* (`gear`). Then transform the result to a wide format, with one column per *transmission type*. 666 | 667 | ```{r} 668 | 669 | ``` 670 | 671 | 672 | 673 | 674 | 675 | 676 | 677 | 678 | 679 | 680 | 681 | 682 | 683 | 684 | 685 | ## Possible solutions 3 686 | 687 | 1. Move the column `gear` to the first position of the mtcars dataset 688 | 689 | ```{r, eval=FALSE} 690 | mtcars %>% dplyr::relocate(gear, .before = mpg) 691 | ``` 692 | 693 | 2. Make a new dataset `mtcars2` with the column `gear` and an additional column `am_v`, which encodes the *transmission type* (`am`) as either `"manual"` or `"automatic"` 694 | 695 | ```{r, eval=FALSE} 696 | mtcars2 <- mtcars %>% dplyr::mutate( 697 | gear, am_v = dplyr::case_when(am == 0 ~ "automatic", am == 1 ~ "manual") 698 | ) 699 | ``` 700 | 701 | 3. Count the number of cars in `mtcars2` per *transmission type* (`am_v`) and *number of gears* (`gear`). Then transform the result to a wide format, with one column per *transmission type*. 702 | 703 | ```{r, eval=FALSE} 704 | mtcars2 %>% dplyr::group_by(am_v, gear) %>% dplyr::tally() %>% 705 | tidyr::pivot_wider(names_from = am_v, values_from = n) 706 | ``` 707 | 708 | 709 | 710 | # Combining tibbles with join operations 711 | 712 | ## Types of joins 713 | 714 | Joins combine two datasets x and y based on key columns 715 | 716 | - Mutating joins add columns from one dataset to the other 717 | - Left join: Take observations from x and add fitting information from y 718 | - Right join: Take observations from y and add fitting information from x 719 | - Inner join: Join the overlapping observations from x and y 720 | - Full join: Join all observations from x and y, even if information is missing 721 | - Filtering joins remove observations from x based on their presence in y 722 | - Semi join: Keep every observation in x that is in y 723 | - Anti join: Keep every observation in x that is not in y 724 | 725 | ## A second dataset 726 | 727 | ```{r echo=FALSE} 728 | library_table_path <- "/vol/volume/3b-1-introduction-to-r-and-the-tidyverse/ancientmetagenome-hostassociated_libraries.tsv" 729 | library_table_url <- "https://raw.githubusercontent.com/SPAAM-community/AncientMetagenomeDir/b187df6ebd23dfeb42935fd5020cb615ead3f164/ancientmetagenome-hostassociated/libraries/ancientmetagenome-hostassociated_libraries.tsv" 730 | ``` 731 | 732 | ```{r, results='markup'} 733 | libraries <- readr::read_tsv(library_table_url) 734 | print(libraries, n = 3) 735 | ``` 736 | 737 | ## Meaningful subsets 738 | 739 | ```{r echo=FALSE} 740 | samsub <- samples %>% dplyr::select(project_name, sample_name, sample_age) 741 | libsub <- libraries %>% dplyr::select(project_name, sample_name, library_name, read_count) 742 | ``` 743 | 744 | ```{r, results='markup'} 745 | print(samsub, n = 3) 746 | print(libsub, n = 3) 747 | ``` 748 | 749 | ## Left join 750 | 751 | Take observations from x and add fitting information from y 752 | 753 | ![](figures/left_join.png){height=70px} 754 | 755 | ```{r} 756 | left <- dplyr::left_join( 757 | x = samsub, # 1060 observations 758 | y = libsub, # 1657 observations 759 | by = c("project_name", "sample_name") # the key columns by which to join 760 | ) 761 | ``` 762 | 763 | ```{r, echo=FALSE, results='markup'} 764 | print(left, n = 1) 765 | ``` 766 | 767 | - Left joins are the most common join operation: Add information from another dataset 768 | 769 | ## Right join 770 | 771 | Take observations from y and add fitting information from x 772 | 773 | ![](figures/right_join.png){height=70px} 774 | 775 | ```{r} 776 | right <- dplyr::right_join( 777 | x = samsub, # 1060 observations 778 | y = libsub, # 1657 observations 779 | by = c("project_name", "sample_name") 780 | ) 781 | ``` 782 | 783 | ```{r, echo=FALSE, results='markup'} 784 | print(right, n = 1) 785 | ``` 786 | 787 | - Right joins are almost identical to left joins -- only x and y have reversed roles 788 | 789 | ## Inner join 790 | 791 | Join the overlapping observations from x and y 792 | 793 | ![](figures/inner_join.png){height=70px} 794 | 795 | ```{r} 796 | inner <- dplyr::inner_join( 797 | x = samsub, # 1060 observations 798 | y = libsub, # 1657 observations 799 | by = c("project_name", "sample_name") 800 | ) 801 | ``` 802 | 803 | ```{r, echo=FALSE, results='markup'} 804 | print(inner, n = 1) 805 | ``` 806 | 807 | - Inner joins are a fast and easy way to check, to which degree two dataset overlap 808 | 809 | ## Full join 810 | 811 | Join all observations from x and y, even if information is missing 812 | 813 | ![](figures/full_join.png){height=70px} 814 | 815 | ```{r} 816 | full <- dplyr::full_join( 817 | x = samsub, # 1060 observations 818 | y = libsub, # 1657 observations 819 | by = c("project_name", "sample_name") 820 | ) 821 | ``` 822 | 823 | ```{r, echo=FALSE, results='markup'} 824 | print(full, n = 1) 825 | ``` 826 | 827 | - Full joins allow to preserve every bit of information 828 | 829 | ## Semi join 830 | 831 | Keep every observation in x that is in y 832 | 833 | ![](figures/semi_join.png){height=70px} 834 | 835 | ```{r} 836 | semi <- dplyr::semi_join( 837 | x = samsub, # 1060 observations 838 | y = libsub, # 1657 observations 839 | by = c("project_name", "sample_name") 840 | ) 841 | ``` 842 | 843 | ```{r, echo=FALSE, results='markup'} 844 | print(semi, n = 1) 845 | ``` 846 | 847 | - Semi joins are underused operations to filter datasets 848 | 849 | ## Anti join 850 | 851 | Keep every observation in x that is not in y 852 | 853 | ![](figures/anti_join.png){height=70px} 854 | 855 | ```{r} 856 | anti <- dplyr::anti_join( 857 | x = samsub, # 1060 observations 858 | y = libsub, # 1657 observations 859 | by = c("project_name", "sample_name") 860 | ) 861 | ``` 862 | 863 | ```{r, echo=FALSE, results='markup'} 864 | print(anti, n = 1) 865 | ``` 866 | 867 | - Anti joins allow to quickly specify incomplete datasets and missing information 868 | 869 | ## Exercise 4 870 | 871 | Consider the following additional dataset: 872 | 873 | ```{r} 874 | gear_opinions <- tibble::tibble(gear = c(3, 5), opinion = c("boring", "wow")) 875 | ``` 876 | 877 | 1. Add my opinions about gears to the `mtcars` dataset 878 | 879 | ```{r} 880 | 881 | ``` 882 | 883 | 2. Remove all cars from the dataset for which I don't have an opinion 884 | 885 | ```{r} 886 | 887 | ``` 888 | 889 | 890 | 891 | 892 | 893 | 894 | 895 | 896 | 897 | 898 | 899 | 900 | 901 | 902 | 903 | ## Possible Solutions 4 904 | 905 | 1. Add my opinions about gears to the `mtcars` dataset 906 | 907 | ```{r, eval=FALSE} 908 | dplyr::left_join(mtcars, gear_opinions, by = "gear") 909 | ``` 910 | 911 | 2. Remove all cars from the dataset for which I don't have an opinion 912 | 913 | ```{r, eval=FALSE} 914 | dplyr::anti_join(mtcars, gear_opinions, by = "gear") 915 | ``` 916 | -------------------------------------------------------------------------------- /logo/WSS-SPAAM-summerschool_logo_name.svg: -------------------------------------------------------------------------------- 1 | 2 | 20 | 22 | 24 | 25 | 51 | 56 | 57 | 59 | 60 | 62 | image/svg+xml 63 | 65 | 66 | 67 | 68 | 73 | 76 | 80 | 84 | 87 | 91 | 94 | 98 | 101 | 105 | 108 | 109 | 114 | 119 | 120 | 125 | 129 | 133 | 137 | 141 | 145 | 149 | 153 | 157 | 161 | 165 | 169 | 173 | 177 | 181 | 185 | 189 | 193 | 197 | 201 | 205 | 209 | 213 | 217 | 221 | 225 | 229 | 233 | 237 | 241 | 245 | 249 | 253 | 257 | 258 | 262 | 266 | 270 | 274 | 278 | 282 | 286 | 290 | 294 | 298 | 302 | 306 | 310 | 314 | 318 | 322 | 326 | 330 | 331 | 332 | 333 | --------------------------------------------------------------------------------