├── figures ├── anti_join.png ├── full_join.png ├── left_join.png ├── semi_join.png ├── inner_join.png ├── right_join.png ├── geom_cheatsheet.png ├── pivot_longer_wider.png └── README.md ├── data ├── README.md ├── data_prep.R ├── penguin_bills_2009.csv └── penguins.csv ├── README.md ├── NTU.R.tidyverse.intro.2h.2022.Rproj ├── .gitignore └── presentation.Rmd /figures/anti_join.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nevrome/NTU.R.tidyverse.intro.2h.2022/master/figures/anti_join.png -------------------------------------------------------------------------------- /figures/full_join.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nevrome/NTU.R.tidyverse.intro.2h.2022/master/figures/full_join.png -------------------------------------------------------------------------------- /figures/left_join.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nevrome/NTU.R.tidyverse.intro.2h.2022/master/figures/left_join.png -------------------------------------------------------------------------------- /figures/semi_join.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nevrome/NTU.R.tidyverse.intro.2h.2022/master/figures/semi_join.png -------------------------------------------------------------------------------- /figures/inner_join.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nevrome/NTU.R.tidyverse.intro.2h.2022/master/figures/inner_join.png -------------------------------------------------------------------------------- /figures/right_join.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nevrome/NTU.R.tidyverse.intro.2h.2022/master/figures/right_join.png -------------------------------------------------------------------------------- /data/README.md: -------------------------------------------------------------------------------- 1 | This data was taken as a random subset of the palmerpenguins R package: https://allisonhorst.github.io/palmerpenguins -------------------------------------------------------------------------------- /figures/geom_cheatsheet.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nevrome/NTU.R.tidyverse.intro.2h.2022/master/figures/geom_cheatsheet.png -------------------------------------------------------------------------------- /figures/pivot_longer_wider.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nevrome/NTU.R.tidyverse.intro.2h.2022/master/figures/pivot_longer_wider.png -------------------------------------------------------------------------------- /figures/README.md: -------------------------------------------------------------------------------- 1 | The join schemata are heavily inspired by [RStudio's dplyr cheatsheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-transformation.pdf). -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | This repository stores a 2h intro course for data analysis with R and the tidyverse. `presentation.Rmd` is an RMarkdown file with both instructions and code. 2 | 3 | The course was forked and modified from this repository: https://github.com/nevrome/SPAAM.R.tidyverse.intro.2h.2022 4 | -------------------------------------------------------------------------------- /NTU.R.tidyverse.intro.2h.2022.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: Default 4 | SaveWorkspace: Default 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: Sweave 13 | LaTeX: pdfLaTeX 14 | -------------------------------------------------------------------------------- /data/data_prep.R: -------------------------------------------------------------------------------- 1 | library(magrittr) 2 | 3 | peng <- palmerpenguins::penguins 4 | 5 | peng_prepped <- peng %>% 6 | dplyr::filter(!dplyr::if_any(.fns = is.na)) %>% 7 | tibble::add_column(., id = 1:nrow(.), .before = "species") 8 | 9 | peng_prepped %>% 10 | dplyr::slice_sample(n = 300) %>% 11 | dplyr::arrange(id) %>% 12 | dplyr::select(-bill_length_mm, -bill_depth_mm) %>% 13 | readr::write_csv("data/penguins.csv") 14 | 15 | peng_prepped %>% 16 | dplyr::slice_sample(n = 300) %>% 17 | dplyr::arrange(id) %>% 18 | dplyr::select(id, bill_length_mm, bill_depth_mm) %>% 19 | readr::write_csv("data/penguin_bills_2009.csv") 20 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # History files 2 | .Rhistory 3 | .Rapp.history 4 | 5 | # Session Data files 6 | .RData 7 | 8 | # User-specific files 9 | .Ruserdata 10 | 11 | # Example code in package build process 12 | *-Ex.R 13 | 14 | # Output files from R CMD build 15 | /*.tar.gz 16 | 17 | # Output files from R CMD check 18 | /*.Rcheck/ 19 | 20 | # RStudio files 21 | .Rproj.user/ 22 | 23 | # produced vignettes 24 | vignettes/*.html 25 | vignettes/*.pdf 26 | 27 | # OAuth2 token, see https://github.com/hadley/httr/releases/tag/v0.3 28 | .httr-oauth 29 | 30 | # knitr and R markdown default cache directories 31 | *_cache/ 32 | /cache/ 33 | 34 | # Temporary files created by R markdown 35 | *.utf8.md 36 | *.knit.md 37 | 38 | # R Environment Variables 39 | .Renviron 40 | 41 | # Rendered output 42 | *.pdf 43 | *.log 44 | 45 | -------------------------------------------------------------------------------- /data/penguin_bills_2009.csv: -------------------------------------------------------------------------------- 1 | id,bill_length_mm,bill_depth_mm 2 | 1,39.1,18.7 3 | 2,39.5,17.4 4 | 3,40.3,18 5 | 4,36.7,19.3 6 | 5,39.3,20.6 7 | 6,38.9,17.8 8 | 7,39.2,19.6 9 | 8,41.1,17.6 10 | 9,38.6,21.2 11 | 10,34.6,21.1 12 | 11,36.6,17.8 13 | 12,38.7,19 14 | 13,42.5,20.7 15 | 14,34.4,18.4 16 | 15,46,21.5 17 | 16,37.8,18.3 18 | 18,35.9,19.2 19 | 19,38.2,18.1 20 | 20,38.8,17.2 21 | 21,35.3,18.9 22 | 22,40.6,18.6 23 | 23,40.5,17.9 24 | 24,37.9,18.6 25 | 25,40.5,18.9 26 | 26,39.5,16.7 27 | 27,37.2,18.1 28 | 28,39.5,17.8 29 | 29,40.9,18.9 30 | 30,36.4,17 31 | 31,39.2,21.1 32 | 32,38.8,20 33 | 33,42.2,18.5 34 | 34,37.6,19.3 35 | 35,39.8,19.1 36 | 36,36.5,18 37 | 38,36,18.5 38 | 39,44.1,19.7 39 | 40,37,16.9 40 | 41,39.6,18.8 41 | 42,41.1,19 42 | 44,42.3,21.2 43 | 45,39.6,17.7 44 | 46,40.1,18.9 45 | 47,35,17.9 46 | 48,42,19.5 47 | 49,34.5,18.1 48 | 50,41.4,18.6 49 | 51,39,17.5 50 | 52,40.6,18.8 51 | 54,37.6,19.1 52 | 56,41.3,21.1 53 | 57,37.6,17 54 | 58,41.1,18.2 55 | 59,36.4,17.1 56 | 60,41.6,18 57 | 61,35.5,16.2 58 | 62,41.1,19.1 59 | 64,41.8,19.4 60 | 65,33.5,19 61 | 66,39.7,18.4 62 | 67,39.6,17.2 63 | 68,45.8,18.9 64 | 69,35.5,17.5 65 | 71,40.9,16.8 66 | 72,37.2,19.4 67 | 73,36.2,16.1 68 | 75,34.6,17.2 69 | 76,42.9,17.6 70 | 77,36.7,18.8 71 | 78,35.1,19.4 72 | 79,37.3,17.8 73 | 80,41.3,20.3 74 | 81,36.3,19.5 75 | 82,36.9,18.6 76 | 83,38.3,19.2 77 | 84,38.9,18.8 78 | 85,35.7,18 79 | 86,41.1,18.1 80 | 87,34,17.1 81 | 88,39.6,18.1 82 | 89,36.2,17.3 83 | 90,40.8,18.9 84 | 91,38.1,18.6 85 | 92,40.3,18.5 86 | 93,33.1,16.1 87 | 94,43.2,18.5 88 | 96,41,20 89 | 97,37.7,16 90 | 99,37.9,18.6 91 | 100,39.7,18.9 92 | 101,38.6,17.2 93 | 102,38.2,20 94 | 103,38.1,17 95 | 104,43.2,19 96 | 106,45.6,20.3 97 | 107,39.7,17.7 98 | 108,42.2,19.5 99 | 109,39.6,20.7 100 | 110,42.7,18.3 101 | 111,38.6,17 102 | 112,37.3,20.5 103 | 114,41.1,18.6 104 | 115,36.2,17.2 105 | 116,37.7,19.8 106 | 117,40.2,17 107 | 118,41.4,18.5 108 | 119,35.2,15.9 109 | 120,40.6,19 110 | 121,38.8,17.6 111 | 122,41.5,18.3 112 | 123,39,17.1 113 | 124,44.1,18 114 | 125,38.5,17.9 115 | 126,43.1,19.2 116 | 127,36.8,18.5 117 | 128,37.5,18.5 118 | 129,38.1,17.6 119 | 130,41.1,17.5 120 | 132,40.2,20.1 121 | 133,37,16.5 122 | 134,39.7,17.9 123 | 135,40.2,17.1 124 | 136,40.6,17.2 125 | 137,32.1,15.5 126 | 138,40.7,17 127 | 139,37.3,16.8 128 | 140,39,18.7 129 | 141,39.2,18.6 130 | 142,36.6,18.4 131 | 143,36,17.8 132 | 144,37.8,18.1 133 | 145,36,17.1 134 | 146,41.5,18.5 135 | 148,50,16.3 136 | 149,48.7,14.1 137 | 150,50,15.2 138 | 151,47.6,14.5 139 | 152,46.5,13.5 140 | 153,45.4,14.6 141 | 154,46.7,15.3 142 | 155,43.3,13.4 143 | 156,46.8,15.4 144 | 157,40.9,13.7 145 | 158,49,16.1 146 | 159,45.5,13.7 147 | 161,45.8,14.6 148 | 162,49.3,15.7 149 | 163,42,13.5 150 | 164,49.2,15.2 151 | 165,46.2,14.5 152 | 166,48.7,15.1 153 | 167,50.2,14.3 154 | 168,45.1,14.5 155 | 169,46.5,14.5 156 | 170,46.3,15.8 157 | 171,42.9,13.1 158 | 172,46.1,15.1 159 | 173,47.8,15 160 | 174,48.2,14.3 161 | 175,50,15.3 162 | 176,47.3,15.3 163 | 177,42.8,14.2 164 | 178,45.1,14.5 165 | 180,49.1,14.8 166 | 181,48.4,16.3 167 | 182,42.6,13.7 168 | 183,44.4,17.3 169 | 184,44,13.6 170 | 185,48.7,15.7 171 | 186,42.7,13.7 172 | 187,49.6,16 173 | 188,45.3,13.7 174 | 189,49.6,15 175 | 190,50.5,15.9 176 | 191,43.6,13.9 177 | 192,45.5,13.9 178 | 193,50.5,15.9 179 | 195,45.2,15.8 180 | 196,46.6,14.2 181 | 198,45.1,14.4 182 | 200,46.5,14.4 183 | 201,45,15.4 184 | 202,43.8,13.9 185 | 203,45.5,15 186 | 204,43.2,14.5 187 | 206,45.3,13.8 188 | 207,46.2,14.9 189 | 209,54.3,15.7 190 | 210,45.8,14.2 191 | 211,49.8,16.8 192 | 212,49.5,16.2 193 | 213,43.5,14.2 194 | 214,50.7,15 195 | 215,47.7,15 196 | 216,46.4,15.6 197 | 217,48.2,15.6 198 | 218,46.5,14.8 199 | 219,46.4,15 200 | 220,48.6,16 201 | 221,47.5,14.2 202 | 222,51.1,16.3 203 | 223,45.2,13.8 204 | 224,45.2,16.4 205 | 225,49.1,14.5 206 | 226,52.5,15.6 207 | 227,47.4,14.6 208 | 229,44.9,13.8 209 | 230,50.8,17.3 210 | 231,43.4,14.4 211 | 232,51.3,14.2 212 | 233,47.5,14 213 | 234,52.1,17 214 | 235,47.5,15 215 | 236,52.2,17.1 216 | 237,45.5,14.5 217 | 238,49.5,16.1 218 | 239,44.5,14.7 219 | 241,49.4,15.8 220 | 242,46.9,14.6 221 | 244,51.1,16.5 222 | 245,48.5,15 223 | 246,55.9,17 224 | 247,47.2,15.5 225 | 248,49.1,15 226 | 249,46.8,16.1 227 | 250,41.7,14.7 228 | 252,43.3,14 229 | 253,48.1,15.1 230 | 255,49.8,15.9 231 | 256,43.5,15.2 232 | 258,46.2,14.1 233 | 259,55.1,16 234 | 260,48.8,16.2 235 | 261,47.2,13.7 236 | 262,46.8,14.3 237 | 263,50.4,15.7 238 | 264,45.2,14.8 239 | 265,49.9,16.1 240 | 266,46.5,17.9 241 | 267,50,19.5 242 | 269,45.4,18.7 243 | 270,52.7,19.8 244 | 271,45.2,17.8 245 | 272,46.1,18.2 246 | 273,51.3,18.2 247 | 274,46,18.9 248 | 275,51.3,19.9 249 | 277,51.7,20.3 250 | 278,47,17.3 251 | 279,52,18.1 252 | 281,50.5,19.6 253 | 282,50.3,20 254 | 283,58,17.8 255 | 284,46.4,18.6 256 | 285,49.2,18.2 257 | 286,42.4,17.3 258 | 287,48.5,17.5 259 | 288,43.2,16.6 260 | 289,50.6,19.4 261 | 290,46.7,17.9 262 | 291,52,19 263 | 292,50.5,18.4 264 | 293,49.5,19 265 | 294,46.4,17.8 266 | 295,52.8,20 267 | 296,40.9,16.6 268 | 297,54.2,20.8 269 | 298,42.5,16.7 270 | 299,51,18.8 271 | 300,49.7,18.6 272 | 301,47.5,16.8 273 | 302,47.6,18.3 274 | 303,52,20.7 275 | 304,46.9,16.6 276 | 306,49,19.5 277 | 308,50.9,19.1 278 | 309,45.5,17 279 | 310,50.9,17.9 280 | 311,50.8,18.5 281 | 312,50.1,17.9 282 | 313,49,19.6 283 | 314,51.5,18.7 284 | 315,49.8,17.3 285 | 316,48.1,16.4 286 | 317,51.4,19 287 | 319,50.7,19.7 288 | 320,42.5,17.3 289 | 321,52.2,18.8 290 | 322,45.2,16.6 291 | 323,49.3,19.9 292 | 324,50.2,18.8 293 | 325,45.6,19.4 294 | 326,51.9,19.5 295 | 327,46.8,16.5 296 | 328,45.7,17 297 | 329,55.8,19.8 298 | 330,43.5,18.1 299 | 331,49.6,18.2 300 | 332,50.8,19 301 | 333,50.2,18.7 302 | -------------------------------------------------------------------------------- /data/penguins.csv: -------------------------------------------------------------------------------- 1 | id,species,island,flipper_length_mm,body_mass_g,sex,year 2 | 1,Adelie,Torgersen,181,3750,male,2007 3 | 2,Adelie,Torgersen,186,3800,female,2007 4 | 4,Adelie,Torgersen,193,3450,female,2007 5 | 5,Adelie,Torgersen,190,3650,male,2007 6 | 6,Adelie,Torgersen,181,3625,female,2007 7 | 7,Adelie,Torgersen,195,4675,male,2007 8 | 8,Adelie,Torgersen,182,3200,female,2007 9 | 9,Adelie,Torgersen,191,3800,male,2007 10 | 10,Adelie,Torgersen,198,4400,male,2007 11 | 12,Adelie,Torgersen,195,3450,female,2007 12 | 13,Adelie,Torgersen,197,4500,male,2007 13 | 14,Adelie,Torgersen,184,3325,female,2007 14 | 15,Adelie,Torgersen,194,4200,male,2007 15 | 17,Adelie,Biscoe,180,3600,male,2007 16 | 18,Adelie,Biscoe,189,3800,female,2007 17 | 19,Adelie,Biscoe,185,3950,male,2007 18 | 20,Adelie,Biscoe,180,3800,male,2007 19 | 21,Adelie,Biscoe,187,3800,female,2007 20 | 22,Adelie,Biscoe,183,3550,male,2007 21 | 24,Adelie,Biscoe,172,3150,female,2007 22 | 25,Adelie,Biscoe,180,3950,male,2007 23 | 27,Adelie,Dream,178,3900,male,2007 24 | 28,Adelie,Dream,188,3300,female,2007 25 | 29,Adelie,Dream,184,3900,male,2007 26 | 30,Adelie,Dream,195,3325,female,2007 27 | 32,Adelie,Dream,190,3950,male,2007 28 | 33,Adelie,Dream,180,3550,female,2007 29 | 34,Adelie,Dream,181,3300,female,2007 30 | 35,Adelie,Dream,184,4650,male,2007 31 | 36,Adelie,Dream,182,3150,female,2007 32 | 38,Adelie,Dream,186,3100,female,2007 33 | 39,Adelie,Dream,196,4400,male,2007 34 | 40,Adelie,Dream,185,3000,female,2007 35 | 41,Adelie,Dream,190,4600,male,2007 36 | 42,Adelie,Dream,182,3425,male,2007 37 | 44,Adelie,Dream,191,4150,male,2007 38 | 45,Adelie,Biscoe,186,3500,female,2008 39 | 46,Adelie,Biscoe,188,4300,male,2008 40 | 48,Adelie,Biscoe,200,4050,male,2008 41 | 49,Adelie,Biscoe,187,2900,female,2008 42 | 50,Adelie,Biscoe,191,3700,male,2008 43 | 51,Adelie,Biscoe,186,3550,female,2008 44 | 52,Adelie,Biscoe,193,3800,male,2008 45 | 53,Adelie,Biscoe,181,2850,female,2008 46 | 54,Adelie,Biscoe,194,3750,male,2008 47 | 55,Adelie,Biscoe,185,3150,female,2008 48 | 56,Adelie,Biscoe,195,4400,male,2008 49 | 57,Adelie,Biscoe,185,3600,female,2008 50 | 58,Adelie,Biscoe,192,4050,male,2008 51 | 59,Adelie,Biscoe,184,2850,female,2008 52 | 60,Adelie,Biscoe,192,3950,male,2008 53 | 61,Adelie,Biscoe,195,3350,female,2008 54 | 62,Adelie,Biscoe,188,4100,male,2008 55 | 63,Adelie,Torgersen,190,3050,female,2008 56 | 64,Adelie,Torgersen,198,4450,male,2008 57 | 65,Adelie,Torgersen,190,3600,female,2008 58 | 66,Adelie,Torgersen,190,3900,male,2008 59 | 67,Adelie,Torgersen,196,3550,female,2008 60 | 68,Adelie,Torgersen,197,4150,male,2008 61 | 69,Adelie,Torgersen,190,3700,female,2008 62 | 70,Adelie,Torgersen,195,4250,male,2008 63 | 71,Adelie,Torgersen,191,3700,female,2008 64 | 72,Adelie,Torgersen,184,3900,male,2008 65 | 73,Adelie,Torgersen,187,3550,female,2008 66 | 74,Adelie,Torgersen,195,4000,male,2008 67 | 76,Adelie,Torgersen,196,4700,male,2008 68 | 77,Adelie,Torgersen,187,3800,female,2008 69 | 78,Adelie,Torgersen,193,4200,male,2008 70 | 79,Adelie,Dream,191,3350,female,2008 71 | 80,Adelie,Dream,194,3550,male,2008 72 | 81,Adelie,Dream,190,3800,male,2008 73 | 82,Adelie,Dream,189,3500,female,2008 74 | 83,Adelie,Dream,189,3950,male,2008 75 | 84,Adelie,Dream,190,3600,female,2008 76 | 85,Adelie,Dream,202,3550,female,2008 77 | 86,Adelie,Dream,205,4300,male,2008 78 | 87,Adelie,Dream,185,3400,female,2008 79 | 88,Adelie,Dream,186,4450,male,2008 80 | 89,Adelie,Dream,187,3300,female,2008 81 | 90,Adelie,Dream,208,4300,male,2008 82 | 91,Adelie,Dream,190,3700,female,2008 83 | 92,Adelie,Dream,196,4350,male,2008 84 | 94,Adelie,Dream,192,4100,male,2008 85 | 95,Adelie,Biscoe,192,3725,female,2009 86 | 97,Adelie,Biscoe,183,3075,female,2009 87 | 98,Adelie,Biscoe,190,4250,male,2009 88 | 99,Adelie,Biscoe,193,2925,female,2009 89 | 100,Adelie,Biscoe,184,3550,male,2009 90 | 101,Adelie,Biscoe,199,3750,female,2009 91 | 102,Adelie,Biscoe,190,3900,male,2009 92 | 103,Adelie,Biscoe,181,3175,female,2009 93 | 104,Adelie,Biscoe,197,4775,male,2009 94 | 105,Adelie,Biscoe,198,3825,female,2009 95 | 106,Adelie,Biscoe,191,4600,male,2009 96 | 107,Adelie,Biscoe,193,3200,female,2009 97 | 108,Adelie,Biscoe,197,4275,male,2009 98 | 109,Adelie,Biscoe,191,3900,female,2009 99 | 110,Adelie,Biscoe,196,4075,male,2009 100 | 111,Adelie,Torgersen,188,2900,female,2009 101 | 112,Adelie,Torgersen,199,3775,male,2009 102 | 113,Adelie,Torgersen,189,3350,female,2009 103 | 114,Adelie,Torgersen,189,3325,male,2009 104 | 116,Adelie,Torgersen,198,3500,male,2009 105 | 117,Adelie,Torgersen,176,3450,female,2009 106 | 118,Adelie,Torgersen,202,3875,male,2009 107 | 119,Adelie,Torgersen,186,3050,female,2009 108 | 120,Adelie,Torgersen,199,4000,male,2009 109 | 121,Adelie,Torgersen,191,3275,female,2009 110 | 122,Adelie,Torgersen,195,4300,male,2009 111 | 123,Adelie,Torgersen,191,3050,female,2009 112 | 124,Adelie,Torgersen,210,4000,male,2009 113 | 125,Adelie,Torgersen,190,3325,female,2009 114 | 126,Adelie,Torgersen,197,3500,male,2009 115 | 128,Adelie,Dream,199,4475,male,2009 116 | 129,Adelie,Dream,187,3425,female,2009 117 | 130,Adelie,Dream,190,3900,male,2009 118 | 131,Adelie,Dream,191,3175,female,2009 119 | 132,Adelie,Dream,200,3975,male,2009 120 | 133,Adelie,Dream,185,3400,female,2009 121 | 134,Adelie,Dream,193,4250,male,2009 122 | 135,Adelie,Dream,193,3400,female,2009 123 | 136,Adelie,Dream,187,3475,male,2009 124 | 137,Adelie,Dream,188,3050,female,2009 125 | 138,Adelie,Dream,190,3725,male,2009 126 | 139,Adelie,Dream,192,3000,female,2009 127 | 140,Adelie,Dream,185,3650,male,2009 128 | 141,Adelie,Dream,190,4250,male,2009 129 | 142,Adelie,Dream,184,3475,female,2009 130 | 143,Adelie,Dream,195,3450,female,2009 131 | 144,Adelie,Dream,193,3750,male,2009 132 | 145,Adelie,Dream,187,3700,female,2009 133 | 146,Adelie,Dream,201,4000,male,2009 134 | 147,Gentoo,Biscoe,211,4500,female,2007 135 | 148,Gentoo,Biscoe,230,5700,male,2007 136 | 149,Gentoo,Biscoe,210,4450,female,2007 137 | 150,Gentoo,Biscoe,218,5700,male,2007 138 | 151,Gentoo,Biscoe,215,5400,male,2007 139 | 152,Gentoo,Biscoe,210,4550,female,2007 140 | 153,Gentoo,Biscoe,211,4800,female,2007 141 | 154,Gentoo,Biscoe,219,5200,male,2007 142 | 155,Gentoo,Biscoe,209,4400,female,2007 143 | 156,Gentoo,Biscoe,215,5150,male,2007 144 | 157,Gentoo,Biscoe,214,4650,female,2007 145 | 158,Gentoo,Biscoe,216,5550,male,2007 146 | 159,Gentoo,Biscoe,214,4650,female,2007 147 | 160,Gentoo,Biscoe,213,5850,male,2007 148 | 161,Gentoo,Biscoe,210,4200,female,2007 149 | 162,Gentoo,Biscoe,217,5850,male,2007 150 | 163,Gentoo,Biscoe,210,4150,female,2007 151 | 164,Gentoo,Biscoe,221,6300,male,2007 152 | 165,Gentoo,Biscoe,209,4800,female,2007 153 | 166,Gentoo,Biscoe,222,5350,male,2007 154 | 167,Gentoo,Biscoe,218,5700,male,2007 155 | 168,Gentoo,Biscoe,215,5000,female,2007 156 | 169,Gentoo,Biscoe,213,4400,female,2007 157 | 170,Gentoo,Biscoe,215,5050,male,2007 158 | 171,Gentoo,Biscoe,215,5000,female,2007 159 | 172,Gentoo,Biscoe,215,5100,male,2007 160 | 175,Gentoo,Biscoe,220,5550,male,2007 161 | 177,Gentoo,Biscoe,209,4700,female,2007 162 | 178,Gentoo,Biscoe,207,5050,female,2007 163 | 179,Gentoo,Biscoe,230,6050,male,2007 164 | 180,Gentoo,Biscoe,220,5150,female,2008 165 | 181,Gentoo,Biscoe,220,5400,male,2008 166 | 182,Gentoo,Biscoe,213,4950,female,2008 167 | 183,Gentoo,Biscoe,219,5250,male,2008 168 | 184,Gentoo,Biscoe,208,4350,female,2008 169 | 185,Gentoo,Biscoe,208,5350,male,2008 170 | 186,Gentoo,Biscoe,208,3950,female,2008 171 | 187,Gentoo,Biscoe,225,5700,male,2008 172 | 188,Gentoo,Biscoe,210,4300,female,2008 173 | 189,Gentoo,Biscoe,216,4750,male,2008 174 | 190,Gentoo,Biscoe,222,5550,male,2008 175 | 191,Gentoo,Biscoe,217,4900,female,2008 176 | 192,Gentoo,Biscoe,210,4200,female,2008 177 | 193,Gentoo,Biscoe,225,5400,male,2008 178 | 194,Gentoo,Biscoe,213,5100,female,2008 179 | 195,Gentoo,Biscoe,215,5300,male,2008 180 | 196,Gentoo,Biscoe,210,4850,female,2008 181 | 197,Gentoo,Biscoe,220,5300,male,2008 182 | 198,Gentoo,Biscoe,210,4400,female,2008 183 | 199,Gentoo,Biscoe,225,5000,male,2008 184 | 200,Gentoo,Biscoe,217,4900,female,2008 185 | 201,Gentoo,Biscoe,220,5050,male,2008 186 | 202,Gentoo,Biscoe,208,4300,female,2008 187 | 203,Gentoo,Biscoe,220,5000,male,2008 188 | 205,Gentoo,Biscoe,224,5550,male,2008 189 | 206,Gentoo,Biscoe,208,4200,female,2008 190 | 207,Gentoo,Biscoe,221,5300,male,2008 191 | 208,Gentoo,Biscoe,214,4400,female,2008 192 | 209,Gentoo,Biscoe,231,5650,male,2008 193 | 210,Gentoo,Biscoe,219,4700,female,2008 194 | 211,Gentoo,Biscoe,230,5700,male,2008 195 | 213,Gentoo,Biscoe,220,4700,female,2008 196 | 214,Gentoo,Biscoe,223,5550,male,2008 197 | 215,Gentoo,Biscoe,216,4750,female,2008 198 | 216,Gentoo,Biscoe,221,5000,male,2008 199 | 217,Gentoo,Biscoe,221,5100,male,2008 200 | 218,Gentoo,Biscoe,217,5200,female,2008 201 | 219,Gentoo,Biscoe,216,4700,female,2008 202 | 220,Gentoo,Biscoe,230,5800,male,2008 203 | 221,Gentoo,Biscoe,209,4600,female,2008 204 | 222,Gentoo,Biscoe,220,6000,male,2008 205 | 223,Gentoo,Biscoe,215,4750,female,2008 206 | 224,Gentoo,Biscoe,223,5950,male,2008 207 | 225,Gentoo,Biscoe,212,4625,female,2009 208 | 226,Gentoo,Biscoe,221,5450,male,2009 209 | 227,Gentoo,Biscoe,212,4725,female,2009 210 | 230,Gentoo,Biscoe,228,5600,male,2009 211 | 231,Gentoo,Biscoe,218,4600,female,2009 212 | 232,Gentoo,Biscoe,218,5300,male,2009 213 | 233,Gentoo,Biscoe,212,4875,female,2009 214 | 234,Gentoo,Biscoe,230,5550,male,2009 215 | 235,Gentoo,Biscoe,218,4950,female,2009 216 | 236,Gentoo,Biscoe,228,5400,male,2009 217 | 237,Gentoo,Biscoe,212,4750,female,2009 218 | 238,Gentoo,Biscoe,224,5650,male,2009 219 | 239,Gentoo,Biscoe,214,4850,female,2009 220 | 240,Gentoo,Biscoe,226,5200,male,2009 221 | 241,Gentoo,Biscoe,216,4925,male,2009 222 | 242,Gentoo,Biscoe,222,4875,female,2009 223 | 243,Gentoo,Biscoe,203,4625,female,2009 224 | 245,Gentoo,Biscoe,219,4850,female,2009 225 | 246,Gentoo,Biscoe,228,5600,male,2009 226 | 247,Gentoo,Biscoe,215,4975,female,2009 227 | 248,Gentoo,Biscoe,228,5500,male,2009 228 | 249,Gentoo,Biscoe,215,5500,male,2009 229 | 250,Gentoo,Biscoe,210,4700,female,2009 230 | 251,Gentoo,Biscoe,219,5500,male,2009 231 | 252,Gentoo,Biscoe,208,4575,female,2009 232 | 253,Gentoo,Biscoe,209,5500,male,2009 233 | 254,Gentoo,Biscoe,216,5000,female,2009 234 | 255,Gentoo,Biscoe,229,5950,male,2009 235 | 256,Gentoo,Biscoe,213,4650,female,2009 236 | 257,Gentoo,Biscoe,230,5500,male,2009 237 | 258,Gentoo,Biscoe,217,4375,female,2009 238 | 259,Gentoo,Biscoe,230,5850,male,2009 239 | 260,Gentoo,Biscoe,222,6000,male,2009 240 | 261,Gentoo,Biscoe,214,4925,female,2009 241 | 262,Gentoo,Biscoe,215,4850,female,2009 242 | 263,Gentoo,Biscoe,222,5750,male,2009 243 | 264,Gentoo,Biscoe,212,5200,female,2009 244 | 265,Gentoo,Biscoe,213,5400,male,2009 245 | 266,Chinstrap,Dream,192,3500,female,2007 246 | 267,Chinstrap,Dream,196,3900,male,2007 247 | 268,Chinstrap,Dream,193,3650,male,2007 248 | 269,Chinstrap,Dream,188,3525,female,2007 249 | 270,Chinstrap,Dream,197,3725,male,2007 250 | 271,Chinstrap,Dream,198,3950,female,2007 251 | 273,Chinstrap,Dream,197,3750,male,2007 252 | 274,Chinstrap,Dream,195,4150,female,2007 253 | 275,Chinstrap,Dream,198,3700,male,2007 254 | 279,Chinstrap,Dream,201,4050,male,2007 255 | 280,Chinstrap,Dream,190,3575,female,2007 256 | 281,Chinstrap,Dream,201,4050,male,2007 257 | 282,Chinstrap,Dream,197,3300,male,2007 258 | 283,Chinstrap,Dream,181,3700,female,2007 259 | 285,Chinstrap,Dream,195,4400,male,2007 260 | 286,Chinstrap,Dream,181,3600,female,2007 261 | 287,Chinstrap,Dream,191,3400,male,2007 262 | 288,Chinstrap,Dream,187,2900,female,2007 263 | 289,Chinstrap,Dream,193,3800,male,2007 264 | 290,Chinstrap,Dream,195,3300,female,2007 265 | 291,Chinstrap,Dream,197,4150,male,2007 266 | 292,Chinstrap,Dream,200,3400,female,2008 267 | 293,Chinstrap,Dream,200,3800,male,2008 268 | 294,Chinstrap,Dream,191,3700,female,2008 269 | 295,Chinstrap,Dream,205,4550,male,2008 270 | 296,Chinstrap,Dream,187,3200,female,2008 271 | 298,Chinstrap,Dream,187,3350,female,2008 272 | 299,Chinstrap,Dream,203,4100,male,2008 273 | 300,Chinstrap,Dream,195,3600,male,2008 274 | 301,Chinstrap,Dream,199,3900,female,2008 275 | 302,Chinstrap,Dream,195,3850,female,2008 276 | 303,Chinstrap,Dream,210,4800,male,2008 277 | 304,Chinstrap,Dream,192,2700,female,2008 278 | 305,Chinstrap,Dream,205,4500,male,2008 279 | 306,Chinstrap,Dream,210,3950,male,2008 280 | 307,Chinstrap,Dream,187,3650,female,2008 281 | 308,Chinstrap,Dream,196,3550,male,2008 282 | 309,Chinstrap,Dream,196,3500,female,2008 283 | 310,Chinstrap,Dream,196,3675,female,2009 284 | 312,Chinstrap,Dream,190,3400,female,2009 285 | 313,Chinstrap,Dream,212,4300,male,2009 286 | 315,Chinstrap,Dream,198,3675,female,2009 287 | 317,Chinstrap,Dream,201,3950,male,2009 288 | 318,Chinstrap,Dream,193,3600,female,2009 289 | 320,Chinstrap,Dream,187,3350,female,2009 290 | 321,Chinstrap,Dream,197,3450,male,2009 291 | 322,Chinstrap,Dream,191,3250,female,2009 292 | 323,Chinstrap,Dream,203,4050,male,2009 293 | 324,Chinstrap,Dream,202,3800,male,2009 294 | 325,Chinstrap,Dream,194,3525,female,2009 295 | 326,Chinstrap,Dream,206,3950,male,2009 296 | 327,Chinstrap,Dream,189,3650,female,2009 297 | 328,Chinstrap,Dream,195,3650,female,2009 298 | 329,Chinstrap,Dream,207,4000,male,2009 299 | 330,Chinstrap,Dream,202,3400,female,2009 300 | 331,Chinstrap,Dream,193,3775,male,2009 301 | 333,Chinstrap,Dream,198,3775,female,2009 302 | -------------------------------------------------------------------------------- /presentation.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "A crash course on R for data analysis" 3 | author: "Clemens Schmid" 4 | date: "2022/11/18" 5 | --- 6 | 7 | ## Getting started 8 | 9 | - Open RStudio 10 | - Load the project with `File > Open Project...` 11 | - Open the file `presentation.Rmd` in RStudio 12 | 13 | # A crash course on R for data analysis 14 | 15 | ## TOC 16 | 17 | - The working environment 18 | - Loading data into tibbles 19 | - Plotting data in tibbles 20 | - Conditional queries on tibbles 21 | - Transforming and manipulating tibbles 22 | - Combining tibbles with join operations 23 | 24 | # The working environment 25 | 26 | ## R, RStudio and the tidyverse 27 | 28 | - R is a fully featured programming language, but it excels as an environment for (statistical) data analysis (https://www.r-project.org) 29 | 30 | - RStudio is an integrated development environment (IDE) for R (and other languages): (https://www.rstudio.com/products/rstudio) 31 | 32 | - The tidyverse is a collection of R packages with well-designed and consistent interfaces for the main steps of data analysis: loading, transforming and plotting data (https://www.tidyverse.org) 33 | - This introduction works with tidyverse ~v1.3.0 34 | - We will learn about `readr`, `tibble`, `ggplot2`, `dplyr`, `magrittr` and `tidyr` 35 | - `forcats` will be briefly mentioned 36 | - `purrr` and `stringr` are left out 37 | 38 | # Loading data into tibbles 39 | 40 | ## Reading data with readr 41 | 42 | - With R we usually operate on data in our computer's memory 43 | - The tidyverse provides the package `readr` to read data from text files into the memory 44 | - `readr` can read from our file system or the internet 45 | - It provides functions to read data in almost any (text) format: 46 | 47 | ```{r eval=FALSE} 48 | readr::read_csv() # .csv files -> see data/penguins.csv 49 | readr::read_tsv() # .tsv files 50 | readr::read_delim() # tabular files with an arbitrary separator 51 | readr::read_fwf() # fixed width files 52 | readr::read_lines() # read linewise to parse yourself 53 | ``` 54 | 55 | - `readr` automatically detects column types -- but you can also define them manually 56 | 57 | ## How does the interface of `read_csv` work? 58 | 59 | - We can learn more about a function with `?`. To open a help file: `?readr::read_csv` 60 | - `readr::read_csv` has many options to specify how to read a text file 61 | 62 | ```{r eval=FALSE} 63 | read_csv( 64 | file, # The path to the file we want to read 65 | col_names = TRUE, # Are there column names? 66 | col_types = NULL, # Which types do the columns have? NULL -> auto 67 | locale = default_locale(), # How is information encoded in this file? 68 | na = c("", "NA"), # Which values mean "no data" 69 | trim_ws = TRUE, # Should superfluous white-spaces be removed? 70 | skip = 0, # Skip X lines at the beginning of the file 71 | n_max = Inf, # Only read X lines 72 | skip_empty_rows = TRUE, # Should empty lines be ignored? 73 | comment = "", # Should comment lines be ignored? 74 | name_repair = "unique", # How should "broken" column names be fixed 75 | ... 76 | ) 77 | ``` 78 | 79 | ## What does `readr` produce? The `tibble`! 80 | 81 | ```{r} 82 | peng <- readr::read_csv("data/penguins.csv") 83 | ``` 84 | 85 | - The `tibble` is a "data frame", a tabular data structure with rows and columns 86 | - Unlike a simple array, each column can have another data type 87 | 88 | ## How to look at a `tibble`? 89 | 90 | ```{r, eval=FALSE} 91 | peng # Typing the name of an object will print it to the console 92 | str(peng) # A structural overview of an object 93 | summary(peng) # A human-readable summary of an object 94 | View(peng) # RStudio's interactive data browser 95 | ``` 96 | 97 | 98 | 99 | # Plotting data in `tibble`s 100 | 101 | ## `ggplot2` and the "grammar of graphics" 102 | 103 | - `ggplot2` offers an unusual, but powerful and logical interface 104 | - The following example describes a stacked bar chart 105 | 106 | ```{r} 107 | library(ggplot2) # Loading a library to use its functions without :: 108 | ``` 109 | 110 | ```{r eval=FALSE} 111 | ggplot( # Every plot starts with a call to the ggplot() function 112 | data = peng # This function can also take the input tibble 113 | ) + # The plot consists of functions linked with + 114 | geom_bar( # "geoms" define the plot layers we want to draw 115 | mapping = aes( # The aes() function maps variables to visual properties 116 | x = island, # publication_year -> x-axis 117 | fill = species # community_type -> fill color 118 | ) 119 | ) 120 | ``` 121 | 122 | - `geom_*`: data + geometry (bars) + statistical transformation (sum) 123 | - `ggplot2` features many geoms: A good overview is provided by [this cheatsheet](https://www.rstudio.com/resources/cheatsheets) 124 | 125 | ## `scale`s control the behaviour of visual elements 126 | 127 | - Another plot: Boxplots of penguin weight through time 128 | 129 | ```{r} 130 | ggplot(peng) + 131 | geom_boxplot(aes(x = species, y = body_mass_g)) 132 | ``` 133 | 134 | - Let's assume we had some extreme outliers in this dataset 135 | 136 | ```{r} 137 | peng_out <- peng 138 | peng_out$body_mass_g[sample(1:nrow(peng_out), 10)] <- 100000 + 100000 * runif(10) 139 | ggplot(peng_out) + 140 | geom_boxplot(aes(x = species, y = body_mass_g)) 141 | ``` 142 | 143 | - This is not well readable, because the extreme outliers dictate the scale 144 | - A 100kg penguin is a scary thought and we would probably remove these outliers, but let's assume they are valid observation we want to include in the plot 145 | 146 | - To mitigate this issue we can change the **scale** of different visual elements - e.g. the y-axis 147 | 148 | ```{r} 149 | ggplot(peng_out) + 150 | geom_boxplot(aes(x = species, y = body_mass_g)) + 151 | scale_y_log10() 152 | ``` 153 | 154 | - The log-scale improves readability 155 | 156 | ## Other `scale`s 157 | 158 | - (Fill) color is a visual element of the plot and its scaling can be adjusted 159 | 160 | ```{r} 161 | ggplot(peng) + 162 | geom_boxplot(aes(x = species, y = body_mass_g, fill = species)) + 163 | scale_y_log10() + 164 | scale_fill_viridis_d(option = "C") 165 | ``` 166 | 167 | ## Defining plot matrices via `facet`s 168 | 169 | - Splitting up the plot by categories into **facets** is another way to visualize more variables at once 170 | 171 | ```{r} 172 | ggplot(peng) + 173 | geom_boxplot(aes(x = species, y = body_mass_g)) + 174 | facet_wrap(~year) 175 | ``` 176 | 177 | - Unfortunately the x-axis became unreadable 178 | 179 | ## Setting purely aesthetic settings with `theme` 180 | 181 | - Aesthetic changes like this can be applied as part of the `theme` 182 | 183 | ```{r} 184 | ggplot(peng) + 185 | geom_boxplot(aes(x = species, y = body_mass_g)) + 186 | facet_wrap(~year) + 187 | theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) 188 | ``` 189 | 190 | ## Exercise 1 191 | 192 | 1. Look at the `mtcars` dataset and read up on the meaning of its variables with the help operator `?`. `mtcars` is part of R and can be accessed simply by typing `mtcars` in the console. 193 | 194 | 2. Visualize the relationship between *Gross horsepower* and *1/4 mile time* 195 | 196 | ```{r} 197 | 198 | ``` 199 | 200 | 3. Integrate the *Number of cylinders* into your plot 201 | 202 | ```{r} 203 | 204 | ``` 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | 218 | 219 | 220 | ## Possible solutions 1 221 | 222 | 1. Look at the `mtcars` dataset and read up on the meaning of its variables 223 | 224 | ```{r, eval=FALSE} 225 | ?mtcars 226 | ``` 227 | 228 | 2. Visualize the relationship between *Gross horsepower (hp)* and *1/4 mile time (qsec)* 229 | 230 | ```{r, eval=FALSE} 231 | ggplot(mtcars) + geom_point(aes(x = hp, y = qsec)) 232 | ``` 233 | 234 | 3. Integrate the *Number of cylinders* into your plot 235 | 236 | ```{r, eval=FALSE} 237 | ggplot(mtcars) + geom_point(aes(x = hp, y = qsec, color = as.factor(cyl))) 238 | ``` 239 | 240 | 241 | 242 | # Conditional queries on tibbles 243 | 244 | ## Selecting columns and filtering rows with `select` and `filter` 245 | 246 | - The `dplyr` package includes powerful functions to subset data in tibbles based on conditions 247 | - `dplyr::select` allows to select columns 248 | 249 | ```{r} 250 | dplyr::select(peng, id, flipper_length_mm) # reduce to two columns 251 | dplyr::select(peng, -island, -flipper_length_mm) # remove two columns 252 | ``` 253 | 254 | - `dplyr::filter` allows for conditional filtering of rows 255 | 256 | ```{r} 257 | dplyr::filter(peng, year == 2007) # penguins measured in 2007 258 | dplyr::filter(peng, year == 2007 | 259 | year == 2009) # penguins measured in 2007 OR 2009 260 | dplyr::filter(peng, year %in% c(2007, 2009)) # match operator: %in% 261 | dplyr::filter(peng, species == "Adelie" & 262 | body_mass_g >= 4000) # Adelie penguins heavier than 4kg 263 | ``` 264 | 265 | ## Chaining functions together with the pipe `%>%` 266 | 267 | - The pipe `%>%` in the `magrittr` package is a clever infix operator to chain data and operations 268 | 269 | ```{r} 270 | library(magrittr) 271 | peng %>% dplyr::filter(year == 2007) 272 | ``` 273 | 274 | - It forwards the LHS as the first argument of the function appearing on the RHS 275 | - That allows for sequences of functions ("tidyverse style") 276 | 277 | ```{r} 278 | peng %>% 279 | dplyr::select(id, species, body_mass_g) %>% 280 | dplyr::filter(species == "Adelie" & body_mass_g >= 4000) %>% 281 | nrow() # count the rows 282 | ``` 283 | 284 | - `magrittr` also offers some more operators, among which the extraction `%$%` is particularly useful 285 | 286 | ```{r} 287 | peng %>% 288 | dplyr::filter(island == "Biscoe") %$% 289 | species %>% # extract the species column as a vector 290 | unique() # get the unique elements of said vector 291 | ``` 292 | 293 | ## Summary statistics in `base` R 294 | 295 | - Summarising and counting data is indispensable and R offers all operations you would expect in its `base` package 296 | 297 | ```{r} 298 | chinstraps <- peng %>% dplyr::filter(species == "Chinstrap") 299 | 300 | nrow(chinstraps) # number of rows in a tibble 301 | length(chinstraps$species) # length/size of a vector 302 | unique(chinstraps$species) # unique elements of a vector 303 | 304 | min(chinstraps$body_mass_g) # minimum 305 | max(chinstraps$body_mass_g) # maximum 306 | 307 | mean(chinstraps$body_mass_g) # mean 308 | median(chinstraps$body_mass_g) # median 309 | 310 | var(chinstraps$body_mass_g) # variance 311 | sd(chinstraps$body_mass_g) # standard deviation 312 | quantile(chinstraps$body_mass_g, probs = 0.75) # weight quantiles for the given probs 313 | ``` 314 | 315 | - many of these functions can ignore missing values with an option `na.rm = TRUE` 316 | 317 | ## Group-wise summaries with `group_by` and `summarise` 318 | 319 | - These summary statistics are particular useful when applied to conditional subsets of a dataset 320 | - `dplyr` allows such summary operations with a combination of `group_by` and `summarise` 321 | 322 | ```{r} 323 | peng %>% 324 | dplyr::group_by(species) %>% # group the tibble by the material column 325 | dplyr::summarise( 326 | min_weight = min(body_mass_g), # a new column: min weight for each group 327 | median_weight = median(body_mass_g), # a new column: median weight for each group 328 | max_weight = max(body_mass_g) # a new column: max weight for each group 329 | ) 330 | ``` 331 | 332 | - grouping can be applied across multiple columns 333 | 334 | ```{r} 335 | peng %>% 336 | dplyr::group_by(species, year) %>% # group by species and year 337 | dplyr::summarise( 338 | n = dplyr::n(), # a new column: number of penguins for each group 339 | .groups = "drop" # drop the grouping after this summary operation 340 | ) 341 | ``` 342 | 343 | ## Sorting and slicing tibbles with `arrange` and `slice` 344 | 345 | - `dplyr` allows to `arrange` tibbles by one or multiple columns 346 | 347 | ```{r} 348 | peng %>% dplyr::arrange(sex) # sort by sex 349 | peng %>% dplyr::arrange(sex, body_mass_g) # ... and weight 350 | peng %>% dplyr::arrange(dplyr::desc(body_mass_g)) # sort descendingly on weight 351 | ``` 352 | 353 | - Sorting also works within groups and can be paired with `slice` to extract extreme values per group 354 | 355 | ```{r} 356 | peng %>% 357 | dplyr::group_by(species) %>% # group by species 358 | dplyr::arrange(dplyr::desc(body_mass_g)) %>% # sort by weight within (!) groups 359 | dplyr::slice_head(n = 3) %>% # keep the first three penguins per group 360 | dplyr::ungroup() # remove the still lingering grouping 361 | ``` 362 | 363 | - Slicing is also the relevant operation to take random samples from the observations in a tibble 364 | 365 | ```{r} 366 | peng %>% dplyr::slice_sample(n = 10) 367 | ``` 368 | 369 | 370 | 371 | ## Exercise 2 372 | 373 | 1. Determine the number of cars with four *forward gears* (`gear`) in the `mtcars` dataset 374 | 375 | ```{r} 376 | 377 | ``` 378 | 379 | 2. Determine the mean *1/4 mile time* (`qsec`) per *Number of cylinders* (`cyl`) group 380 | 381 | ```{r} 382 | 383 | ``` 384 | 385 | 3. Identify the least efficient cars for both *transmission types* (`am`) 386 | 387 | ```{r} 388 | 389 | ``` 390 | 391 | 392 | 393 | 394 | 395 | 396 | 397 | 398 | 399 | 400 | 401 | 402 | 403 | 404 | 405 | ## Possible solutions 2 406 | 407 | 1. Determine the number of cars with four *forward gears* (`gear`) in the `mtcars` dataset 408 | 409 | ```{r, eval=FALSE} 410 | mtcars %>% dplyr::filter(gear == 4) %>% nrow() 411 | ``` 412 | 413 | 2. Determine the mean *1/4 mile time* (`qsec`) per *Number of cylinders* (`cyl`) group 414 | 415 | ```{r, eval=FALSE} 416 | mtcars %>% dplyr::group_by(cyl) %>% dplyr::summarise(qsec_mean = mean(qsec)) 417 | ``` 418 | 419 | 3. Identify the least efficient cars for both *transmission types* (`am`) 420 | 421 | ```{r, eval=FALSE} 422 | #mtcars3 <- tibble::rownames_to_column(mtcars, var = "car") %>% tibble::as_tibble() 423 | mtcars %>% dplyr::group_by(am) %>% dplyr::arrange(mpg) %>% dplyr::slice_head() 424 | ``` 425 | 426 | 427 | 428 | # Transforming and manipulating tibbles 429 | 430 | ## Renaming and reordering columns and values with `rename`, `relocate` and `recode` 431 | 432 | - Columns in tibbles can be renamed with `dplyr::rename` and reordered with `dplyr::relocate` 433 | 434 | ```{r} 435 | peng %>% dplyr::rename(penguin_name = id) # rename a column 436 | peng %>% dplyr::relocate(year, .before = species) # reorder columns 437 | ``` 438 | 439 | - Values in columns can be changed with `dplyr::recode` 440 | 441 | ```{r} 442 | peng$island %>% dplyr::recode(Torgersen = "My favourite Island") 443 | ``` 444 | 445 | - R supports explicitly ordinal data with `factor`s, which can be reordered as well 446 | - `factor`s can be handled more easily with the `forcats` package 447 | 448 | ```{r, fig.show='hide'} 449 | ggplot(peng) + geom_bar(aes(x = species)) # bars are alphabetically ordered 450 | peng2 <- peng 451 | peng2$species_ordered <- forcats::fct_reorder(peng2$species, peng2$species, length) 452 | # fct_reorder: reorder the input factor by a summary statistic on an other vector 453 | ggplot(peng2) + geom_bar(aes(x = species_ordered)) # bars are ordered by size 454 | ``` 455 | 456 | ## Adding columns to tibbles with `mutate` and `transmute` 457 | 458 | - A common application of data manipulation is adding derived columns. `dplyr` offers that with `mutate` 459 | 460 | ```{r} 461 | peng %>% 462 | dplyr::mutate( # add a column that 463 | body_mass_kg = body_mass_g/1000 # manipulates an existing column 464 | ) 465 | ``` 466 | 467 | - `dplyr::transmute` removes all columns but the newly created ones 468 | 469 | ```{r} 470 | peng %>% 471 | dplyr::transmute( 472 | id = paste("Penguin Nr.", id), # overwrite this column 473 | flipper_length_mm # select this column 474 | ) 475 | ``` 476 | 477 | - `dplyr::mutate` can also be combined with `dplyr::group_by` (instead of `dplyr::summarise`) 478 | 479 | ```{r} 480 | peng %>% 481 | dplyr::group_by(species, sex, year) %>% 482 | dplyr::mutate( # mutate does not collapse rows, unlike summarise 483 | mean_weight = mean(body_mass_g, na.rm = T) 484 | ) %>% 485 | dplyr::arrange(body_mass_g) 486 | ``` 487 | 488 | - `tibble::add_column` behaves as `dplyr::mutate`, but gives more control over column position 489 | 490 | ```{r} 491 | peng %>% tibble::add_column( 492 | flipper_length_cm = .$flipper_length_mm/10, # not the . representing the LHS of the pipe 493 | .after = "flipper_length_mm" 494 | ) 495 | ``` 496 | 497 | ## Conditional operations with `ifelse` and `case_when` 498 | 499 | - `ifelse` allows to implement conditional `mutate` operations, that consider information from other columns, but that gets cumbersome easily 500 | 501 | ```{r} 502 | peng %>% dplyr::mutate( 503 | weight = ifelse( 504 | test = body_mass_g >= 4200, # is weight below or above mean weight? 505 | yes = "above mean", 506 | no = "below mean" 507 | ) 508 | ) 509 | ``` 510 | 511 | - `dplyr::case_when` is more readable and scales much better for this application 512 | 513 | ```{r} 514 | peng %>% dplyr::mutate( 515 | weight = dplyr::case_when( 516 | body_mass_g >= 4200 ~ "above mean", # the number of conditions is arbitrary 517 | body_mass_g < 4200 ~ "below mean", 518 | TRUE ~ "unknown" # TRUE catches all remaining cases 519 | ) 520 | ) 521 | ``` 522 | 523 | ## Long and wide data formats 524 | 525 | - For different applications or to simplify certain analysis or plotting operations data often has to be transformed from a **wide** to a **long** format or vice versa 526 | 527 | ![](figures/pivot_longer_wider.png){height=150px} 528 | 529 | - A table in **wide** format has N key columns and N value columns 530 | - A table in **long** format has N key columns, one descriptor column and one value column 531 | 532 | ## A wide dataset 533 | 534 | ```{r} 535 | carsales <- tibble::tribble( 536 | ~brand, ~`2014`, ~`2015`, ~`2016`, ~`2017`, 537 | "BMW", 20, 25, 30, 45, 538 | "VW", 67, 40, 120, 55 539 | ) 540 | ``` 541 | 542 | - Wide format becomes a problem, when the columns are semantically identical. This dataset is in wide format and we can not easily plot it 543 | - We generally prefer data in long format, although it is more verbose with more duplication. "Long" format data is more "tidy" 544 | 545 | ## Making a wide dataset long with `pivot_longer` 546 | 547 | ```{r} 548 | carsales_long <- carsales %>% tidyr::pivot_longer( 549 | cols = tidyselect::num_range("", range = 2014:2017), # set of columns to transform 550 | names_to = "year", # the name of the descriptor column we want 551 | names_transform = as.integer, # a transformation function to apply to the names 552 | values_to = "sales" # the name of the value column we want 553 | ) 554 | ``` 555 | 556 | ## Making a long dataset wide with `pivot_wider` 557 | 558 | ```{r} 559 | carsales_wide <- carsales_long %>% tidyr::pivot_wider( 560 | id_cols = "brand", # the set of id columns that should not be changed 561 | names_from = year, # the descriptor column with the names of the new columns 562 | values_from = sales # the value column from which the values should be extracted 563 | ) 564 | ``` 565 | 566 | - Applications of wide datasets are adjacency matrices to represent graphs, covariance matrices or other pairwise statistics 567 | - When data gets big, then wide formats can be significantly more efficient (e.g. for spatial data) 568 | 569 | 570 | 571 | ## Exercise 3 572 | 573 | 1. Move the column `gear` to the first position of the mtcars dataset 574 | 575 | ```{r} 576 | 577 | ``` 578 | 579 | 2. Make a new dataset `mtcars2` with the column `mpg` and an additional column `am_v`, which encodes the *transmission type* (`am`) as either `"manual"` or `"automatic"` 580 | 581 | ```{r} 582 | 583 | ``` 584 | 585 | 3. Count the number of cars per *transmission type* (`am_v`) and *number of gears* (`gear`). Then transform the result to a wide format, with one column per *transmission type*. 586 | 587 | ```{r} 588 | 589 | ``` 590 | 591 | 592 | 593 | 594 | 595 | 596 | 597 | 598 | 599 | 600 | 601 | 602 | 603 | 604 | 605 | ## Possible solutions 3 606 | 607 | 1. Move the column `gear` to the first position of the mtcars dataset 608 | 609 | ```{r, eval=FALSE} 610 | mtcars %>% dplyr::relocate(gear, .before = mpg) 611 | ``` 612 | 613 | 2. Make a new dataset `mtcars2` with the column `gear` and an additional column `am_v`, which encodes the *transmission type* (`am`) as either `"manual"` or `"automatic"` 614 | 615 | ```{r, eval=FALSE} 616 | mtcars2 <- mtcars %>% dplyr::mutate( 617 | gear, am_v = dplyr::case_when(am == 0 ~ "automatic", am == 1 ~ "manual") 618 | ) 619 | ``` 620 | 621 | 3. Count the number of cars in `mtcars2` per *transmission type* (`am_v`) and *number of gears* (`gear`). Then transform the result to a wide format, with one column per *transmission type*. 622 | 623 | ```{r, eval=FALSE} 624 | mtcars2 %>% dplyr::group_by(am_v, gear) %>% dplyr::tally() %>% 625 | tidyr::pivot_wider(names_from = am_v, values_from = n) 626 | ``` 627 | 628 | 629 | 630 | # Combining tibbles with join operations 631 | 632 | ## Types of joins 633 | 634 | Joins combine two datasets x and y based on key columns 635 | 636 | - Mutating joins add columns from one dataset to the other 637 | - Left join: Take observations from x and add fitting information from y 638 | - Right join: Take observations from y and add fitting information from x 639 | - Inner join: Join the overlapping observations from x and y 640 | - Full join: Join all observations from x and y, even if information is missing 641 | - Filtering joins remove observations from x based on their presence in y 642 | - Semi join: Keep every observation in x that is in y 643 | - Anti join: Keep every observation in x that is not in y 644 | 645 | ## A second dataset 646 | 647 | - Contains additional variables for a subset of penguins 648 | - Both datasets feature 300 penguins, but only with a partial overlap of individuals 649 | 650 | ```{r echo=FALSE} 651 | bills <- readr::read_csv("data/penguin_bills_2009.csv") 652 | print(bills, n = 5) 653 | ``` 654 | 655 | ## Left join 656 | 657 | Take observations from x and add fitting information from y 658 | 659 | ![](figures/left_join.png){height=150px} 660 | 661 | ```{r} 662 | dplyr::left_join( 663 | x = peng, # 300 observations 664 | y = bills, # 300 observations 665 | by = "id" # the key column by which to join 666 | ) 667 | ``` 668 | 669 | - Left joins are the most common join operation: Add information from y to the main dataset x 670 | 671 | ## Right join 672 | 673 | Take observations from y and add fitting information from x 674 | 675 | ![](figures/right_join.png){height=150px} 676 | 677 | ```{r} 678 | dplyr::right_join( 679 | x = peng, # 300 observations 680 | y = bills, # 300 observations 681 | by = "id" 682 | ) %>% dplyr::arrange(id) 683 | ``` 684 | 685 | - Right joins are almost identical to left joins -- only x and y have reversed roles 686 | 687 | ## Inner join 688 | 689 | Join the overlapping observations from x and y 690 | 691 | ![](figures/inner_join.png){height=150px} 692 | 693 | ```{r} 694 | dplyr::inner_join( 695 | x = peng, # 300 observations 696 | y = bills, # 300 observations 697 | by = "id" 698 | ) 699 | ``` 700 | 701 | - Inner joins are a fast and easy way to check, to which degree two dataset overlap 702 | 703 | ## Full join 704 | 705 | Join all observations from x and y, even if information is missing 706 | 707 | ![](figures/full_join.png){height=150px} 708 | 709 | ```{r} 710 | dplyr::full_join( 711 | x = peng, # 300 observations 712 | y = bills, # 300 observations 713 | by = "id" 714 | ) 715 | ``` 716 | 717 | - Full joins allow to preserve every bit of information 718 | 719 | ## Semi join 720 | 721 | Keep every observation in x that is in y 722 | 723 | ![](figures/semi_join.png){height=150px} 724 | 725 | ```{r} 726 | dplyr::semi_join( 727 | x = peng, # 300 observations 728 | y = bills, # 300 observations 729 | by = "id" 730 | ) 731 | ``` 732 | 733 | - Semi joins are underused operations to filter datasets 734 | 735 | ## Anti join 736 | 737 | Keep every observation in x that is not in y 738 | 739 | ![](figures/anti_join.png){height=150px} 740 | 741 | ```{r} 742 | dplyr::anti_join( 743 | x = peng, # 300 observations 744 | y = bills, # 300 observations 745 | by = "id" 746 | ) 747 | ``` 748 | 749 | - Anti joins allow to quickly specify incomplete datasets and missing information 750 | 751 | 752 | 753 | ## Exercise 4 754 | 755 | Consider the following additional dataset: 756 | 757 | ```{r} 758 | gear_opinions <- tibble::tibble(gear = c(3, 5), opinion = c("boring", "wow")) 759 | ``` 760 | 761 | 1. Add my opinions about gears to the `mtcars` dataset 762 | 763 | ```{r} 764 | 765 | ``` 766 | 767 | 2. Remove all cars from the dataset for which I don't have an opinion 768 | 769 | ```{r} 770 | 771 | ``` 772 | 773 | 774 | 775 | 776 | 777 | 778 | 779 | 780 | 781 | 782 | 783 | 784 | 785 | 786 | 787 | ## Possible Solutions 4 788 | 789 | 1. Add my opinions about gears to the `mtcars` dataset 790 | 791 | ```{r, eval=FALSE} 792 | dplyr::left_join(mtcars, gear_opinions, by = "gear") 793 | ``` 794 | 795 | 2. Remove all cars from the dataset for which I don't have an opinion 796 | 797 | ```{r, eval=FALSE} 798 | dplyr::anti_join(mtcars, gear_opinions, by = "gear") 799 | ``` 800 | --------------------------------------------------------------------------------