├── .gitignore ├── README.md ├── assets ├── mess_up_iris.R ├── project_files.png ├── project_screen1.png ├── project_screen2.png ├── project_screen3.png └── tidy_data.png ├── installing_software.md ├── iris.csv ├── loading_data.md ├── next_steps.md ├── plotting.Rmd ├── plotting.md ├── plotting_files └── figure-markdown_strict │ ├── unnamed-chunk-1-1.png │ └── unnamed-chunk-2-1.png ├── r_markdown.md ├── r_project.md ├── reproduciblecodeR.Rproj ├── summarising_data.Rmd ├── summarising_data.md ├── tidying_data.Rmd ├── tidying_data.md ├── tidying_output.Rmd └── tidying_output.md /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | .Ruserdata 5 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Writing reproducible code in R 2 | 3 | In this session, we are going to go through the basics of writing reproducible code in R, using an RMarkdown document. 4 | 5 | The material is self-paced and includes an example analysis. It is suggested that you work through the sections in order. 6 | 7 | * [Setting up an R project](./r_project.md) - Create a self contained project in RStudio 8 | * [Creating an RMarkdown notebook](./r_markdown.md) - Create a notebook to have all analyses and notes in one place 9 | * [Loading Data](./loading_data.md) - Getting your own data into R 10 | * [Tidying Data](./tidying_data.md) - A quick example of transforming a messy dataset into something workable 11 | * [Manipulating and summarising data](./summarising_data.md) - How to take our tidy data and create some useful summaries 12 | * [Tidying model output](./tidying_output.md) - An example of how to tidy the output from a basic statistical model 13 | * [Plotting](./plotting.md) - To finish up, how to plot a summary of a model using ggplot2 14 | * [Additional resources](./next_steps.md) - Steps to further learning 15 | 16 | My version of the R Notebook we have been working on can be found [here](https://github.com/laurajanegraham/reproducible_r). 17 | -------------------------------------------------------------------------------- /assets/mess_up_iris.R: -------------------------------------------------------------------------------- 1 | library(tidyr) 2 | library(dplyr) 3 | library(readr) 4 | 5 | iris$ID <- rep(paste0("sample", 1:50), 3) 6 | iris_mess <- gather(iris, measurement, value, -Species, -ID) %>% 7 | spread(Species, value) %>% 8 | unite(measurement, ID, measurement) 9 | 10 | write.csv(iris_mess, file="iris.csv", row.names = FALSE) 11 | 12 | 13 | -------------------------------------------------------------------------------- /assets/project_files.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BES2016Workshop/reproduciblecodeR/97002f73e7587450107b66216555a08b1e182d9f/assets/project_files.png -------------------------------------------------------------------------------- /assets/project_screen1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BES2016Workshop/reproduciblecodeR/97002f73e7587450107b66216555a08b1e182d9f/assets/project_screen1.png -------------------------------------------------------------------------------- /assets/project_screen2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BES2016Workshop/reproduciblecodeR/97002f73e7587450107b66216555a08b1e182d9f/assets/project_screen2.png -------------------------------------------------------------------------------- /assets/project_screen3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BES2016Workshop/reproduciblecodeR/97002f73e7587450107b66216555a08b1e182d9f/assets/project_screen3.png -------------------------------------------------------------------------------- /assets/tidy_data.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BES2016Workshop/reproduciblecodeR/97002f73e7587450107b66216555a08b1e182d9f/assets/tidy_data.png -------------------------------------------------------------------------------- /installing_software.md: -------------------------------------------------------------------------------- 1 | # Installing the software 2 | 3 | Many users of R use it from within another free piece of software called **RStudio.** 4 | RStudio is a powerful and productive user interface for R. It’s free and open source, and works great on Windows, Mac, and Linux. 5 | 6 | Rstudio's version control functionality is provided by yet another software called **git** 7 | 8 | Our first task, therefore, is to install R, RStudio and git. 9 | 10 | ### Install R 11 | 12 | Install R first. Downloads are available at https://cran.rstudio.com/ 13 | * Direct link for Windows https://cran.r-project.org/bin/windows/base/ 14 | * Direct link for MacOS X https://cran.r-project.org/bin/macosx/ 15 | * Direct link for Linux https://cran.r-project.org/bin/linux/ 16 | 17 | ### Install RStudio 18 | 19 | Next, install RStudio. **If you already have RStudio, make sure you have the latest version (1.0.44).** The R Notebook used in later lessons will not work in earlier versions. 20 | 21 | * Downloads are available at https://www.rstudio.com/products/rstudio/download/ 22 | 23 | ### Installing R packages 24 | 25 | We are going to be using a number of packages in the following example. To install these packages, run the following code in the R console. 26 | 27 | `install.packages(c("RCurl", "readr", "tidyr", "dplyr", "broom", "ggplot2", "cowplot"))` 28 | 29 | ### Install git 30 | 31 | Git is one of the most popular version control systems in the world. It is free and open source. 32 | 33 | * Windows & OS X: http://git-scm.com/downloads 34 | * Debian/Ubuntu: `sudo apt-get install git-core` 35 | * Fedora/RedHat: `sudo yum install git-core` 36 | 37 | To check that the installation worked, open a terminal or command prompt: 38 | 39 | **Windows** 40 | 41 | * Go to the Start menu 42 | * In the Search or Run line type **cmd** and press enter. 43 | 44 | **Mac** 45 | 46 | * Go to **Applications** -> **Utilities** -> **Terminal** 47 | 48 | Type `git version`. You should see a short message containing some version information. 49 | 50 | ### Configure git 51 | 52 | After installing git, you need to tell it who you are. Open a terminal window or command prompt (see above) and type the following: 53 | 54 | ``` 55 | git config --global user.email "you@youremail.com" 56 | git config --global user.name "Your Name" 57 | ``` 58 | 59 | On succesful completion, you should see no output from these commands. 60 | 61 | You can also configure git to use your preferred editor for commit messages, e.g. on a Mac: 62 | 63 | ``` 64 | git config --global core.editor nano 65 | ``` 66 | 67 | or on Windows: 68 | 69 | ``` 70 | git config --global core.editor notepad 71 | ``` 72 | 73 | It's a good idea to follow this step since the default editor selected by git is quite difficult to use! 74 | 75 | ### Sign up for an account on GitHub 76 | 77 | GitHub is a popular online hosting service for git repositories. It provides a useful interface for collaboration and code sharing. 78 | 79 | Create a free account on GitHub: 80 | 81 | [https://github.com/join](https://github.com/join) 82 | 83 | ***If you have an academic email account you should use it here.*** 84 | GitHub users can create an unlimited number of free, public repositories but only a limited number of private repositories. However, academic users can request access to an unlimited number of free private repositories. 85 | 86 | **Optional** 87 | 88 | If you are an academic user, sign up for free private repositories here: 89 | 90 | [https://education.github.com/discount_requests/new](https://education.github.com/discount_requests/new) 91 | 92 | ***This requires your account to be associated with an academic email address.*** 93 | 94 | It may take a while to receive the verification email for this step. Don't worry, we won't need this for the tutorial. -------------------------------------------------------------------------------- /iris.csv: -------------------------------------------------------------------------------- 1 | "measurement","setosa","versicolor","virginica" 2 | "sample1_Petal.Length",1.4,4.7,6 3 | "sample1_Petal.Width",0.2,1.4,2.5 4 | "sample1_Sepal.Length",5.1,7,6.3 5 | "sample1_Sepal.Width",3.5,3.2,3.3 6 | "sample10_Petal.Length",1.5,3.9,6.1 7 | "sample10_Petal.Width",0.1,1.4,2.5 8 | "sample10_Sepal.Length",4.9,5.2,7.2 9 | "sample10_Sepal.Width",3.1,2.7,3.6 10 | "sample11_Petal.Length",1.5,3.5,5.1 11 | "sample11_Petal.Width",0.2,1,2 12 | "sample11_Sepal.Length",5.4,5,6.5 13 | "sample11_Sepal.Width",3.7,2,3.2 14 | "sample12_Petal.Length",1.6,4.2,5.3 15 | "sample12_Petal.Width",0.2,1.5,1.9 16 | "sample12_Sepal.Length",4.8,5.9,6.4 17 | "sample12_Sepal.Width",3.4,3,2.7 18 | "sample13_Petal.Length",1.4,4,5.5 19 | "sample13_Petal.Width",0.1,1,2.1 20 | "sample13_Sepal.Length",4.8,6,6.8 21 | "sample13_Sepal.Width",3,2.2,3 22 | "sample14_Petal.Length",1.1,4.7,5 23 | "sample14_Petal.Width",0.1,1.4,2 24 | "sample14_Sepal.Length",4.3,6.1,5.7 25 | "sample14_Sepal.Width",3,2.9,2.5 26 | "sample15_Petal.Length",1.2,3.6,5.1 27 | "sample15_Petal.Width",0.2,1.3,2.4 28 | "sample15_Sepal.Length",5.8,5.6,5.8 29 | "sample15_Sepal.Width",4,2.9,2.8 30 | "sample16_Petal.Length",1.5,4.4,5.3 31 | "sample16_Petal.Width",0.4,1.4,2.3 32 | "sample16_Sepal.Length",5.7,6.7,6.4 33 | "sample16_Sepal.Width",4.4,3.1,3.2 34 | "sample17_Petal.Length",1.3,4.5,5.5 35 | "sample17_Petal.Width",0.4,1.5,1.8 36 | "sample17_Sepal.Length",5.4,5.6,6.5 37 | "sample17_Sepal.Width",3.9,3,3 38 | "sample18_Petal.Length",1.4,4.1,6.7 39 | "sample18_Petal.Width",0.3,1,2.2 40 | "sample18_Sepal.Length",5.1,5.8,7.7 41 | "sample18_Sepal.Width",3.5,2.7,3.8 42 | "sample19_Petal.Length",1.7,4.5,6.9 43 | "sample19_Petal.Width",0.3,1.5,2.3 44 | "sample19_Sepal.Length",5.7,6.2,7.7 45 | "sample19_Sepal.Width",3.8,2.2,2.6 46 | "sample2_Petal.Length",1.4,4.5,5.1 47 | "sample2_Petal.Width",0.2,1.5,1.9 48 | "sample2_Sepal.Length",4.9,6.4,5.8 49 | "sample2_Sepal.Width",3,3.2,2.7 50 | "sample20_Petal.Length",1.5,3.9,5 51 | "sample20_Petal.Width",0.3,1.1,1.5 52 | "sample20_Sepal.Length",5.1,5.6,6 53 | "sample20_Sepal.Width",3.8,2.5,2.2 54 | "sample21_Petal.Length",1.7,4.8,5.7 55 | "sample21_Petal.Width",0.2,1.8,2.3 56 | "sample21_Sepal.Length",5.4,5.9,6.9 57 | "sample21_Sepal.Width",3.4,3.2,3.2 58 | "sample22_Petal.Length",1.5,4,4.9 59 | "sample22_Petal.Width",0.4,1.3,2 60 | "sample22_Sepal.Length",5.1,6.1,5.6 61 | "sample22_Sepal.Width",3.7,2.8,2.8 62 | "sample23_Petal.Length",1,4.9,6.7 63 | "sample23_Petal.Width",0.2,1.5,2 64 | "sample23_Sepal.Length",4.6,6.3,7.7 65 | "sample23_Sepal.Width",3.6,2.5,2.8 66 | "sample24_Petal.Length",1.7,4.7,4.9 67 | "sample24_Petal.Width",0.5,1.2,1.8 68 | "sample24_Sepal.Length",5.1,6.1,6.3 69 | "sample24_Sepal.Width",3.3,2.8,2.7 70 | "sample25_Petal.Length",1.9,4.3,5.7 71 | "sample25_Petal.Width",0.2,1.3,2.1 72 | "sample25_Sepal.Length",4.8,6.4,6.7 73 | "sample25_Sepal.Width",3.4,2.9,3.3 74 | "sample26_Petal.Length",1.6,4.4,6 75 | "sample26_Petal.Width",0.2,1.4,1.8 76 | "sample26_Sepal.Length",5,6.6,7.2 77 | "sample26_Sepal.Width",3,3,3.2 78 | "sample27_Petal.Length",1.6,4.8,4.8 79 | "sample27_Petal.Width",0.4,1.4,1.8 80 | "sample27_Sepal.Length",5,6.8,6.2 81 | "sample27_Sepal.Width",3.4,2.8,2.8 82 | "sample28_Petal.Length",1.5,5,4.9 83 | "sample28_Petal.Width",0.2,1.7,1.8 84 | "sample28_Sepal.Length",5.2,6.7,6.1 85 | "sample28_Sepal.Width",3.5,3,3 86 | "sample29_Petal.Length",1.4,4.5,5.6 87 | "sample29_Petal.Width",0.2,1.5,2.1 88 | "sample29_Sepal.Length",5.2,6,6.4 89 | "sample29_Sepal.Width",3.4,2.9,2.8 90 | "sample3_Petal.Length",1.3,4.9,5.9 91 | "sample3_Petal.Width",0.2,1.5,2.1 92 | "sample3_Sepal.Length",4.7,6.9,7.1 93 | "sample3_Sepal.Width",3.2,3.1,3 94 | "sample30_Petal.Length",1.6,3.5,5.8 95 | "sample30_Petal.Width",0.2,1,1.6 96 | "sample30_Sepal.Length",4.7,5.7,7.2 97 | "sample30_Sepal.Width",3.2,2.6,3 98 | "sample31_Petal.Length",1.6,3.8,6.1 99 | "sample31_Petal.Width",0.2,1.1,1.9 100 | "sample31_Sepal.Length",4.8,5.5,7.4 101 | "sample31_Sepal.Width",3.1,2.4,2.8 102 | "sample32_Petal.Length",1.5,3.7,6.4 103 | "sample32_Petal.Width",0.4,1,2 104 | "sample32_Sepal.Length",5.4,5.5,7.9 105 | "sample32_Sepal.Width",3.4,2.4,3.8 106 | "sample33_Petal.Length",1.5,3.9,5.6 107 | "sample33_Petal.Width",0.1,1.2,2.2 108 | "sample33_Sepal.Length",5.2,5.8,6.4 109 | "sample33_Sepal.Width",4.1,2.7,2.8 110 | "sample34_Petal.Length",1.4,5.1,5.1 111 | "sample34_Petal.Width",0.2,1.6,1.5 112 | "sample34_Sepal.Length",5.5,6,6.3 113 | "sample34_Sepal.Width",4.2,2.7,2.8 114 | "sample35_Petal.Length",1.5,4.5,5.6 115 | "sample35_Petal.Width",0.2,1.5,1.4 116 | "sample35_Sepal.Length",4.9,5.4,6.1 117 | "sample35_Sepal.Width",3.1,3,2.6 118 | "sample36_Petal.Length",1.2,4.5,6.1 119 | "sample36_Petal.Width",0.2,1.6,2.3 120 | "sample36_Sepal.Length",5,6,7.7 121 | "sample36_Sepal.Width",3.2,3.4,3 122 | "sample37_Petal.Length",1.3,4.7,5.6 123 | "sample37_Petal.Width",0.2,1.5,2.4 124 | "sample37_Sepal.Length",5.5,6.7,6.3 125 | "sample37_Sepal.Width",3.5,3.1,3.4 126 | "sample38_Petal.Length",1.4,4.4,5.5 127 | "sample38_Petal.Width",0.1,1.3,1.8 128 | "sample38_Sepal.Length",4.9,6.3,6.4 129 | "sample38_Sepal.Width",3.6,2.3,3.1 130 | "sample39_Petal.Length",1.3,4.1,4.8 131 | "sample39_Petal.Width",0.2,1.3,1.8 132 | "sample39_Sepal.Length",4.4,5.6,6 133 | "sample39_Sepal.Width",3,3,3 134 | "sample4_Petal.Length",1.5,4,5.6 135 | "sample4_Petal.Width",0.2,1.3,1.8 136 | "sample4_Sepal.Length",4.6,5.5,6.3 137 | "sample4_Sepal.Width",3.1,2.3,2.9 138 | "sample40_Petal.Length",1.5,4,5.4 139 | "sample40_Petal.Width",0.2,1.3,2.1 140 | "sample40_Sepal.Length",5.1,5.5,6.9 141 | "sample40_Sepal.Width",3.4,2.5,3.1 142 | "sample41_Petal.Length",1.3,4.4,5.6 143 | "sample41_Petal.Width",0.3,1.2,2.4 144 | "sample41_Sepal.Length",5,5.5,6.7 145 | "sample41_Sepal.Width",3.5,2.6,3.1 146 | "sample42_Petal.Length",1.3,4.6,5.1 147 | "sample42_Petal.Width",0.3,1.4,2.3 148 | "sample42_Sepal.Length",4.5,6.1,6.9 149 | "sample42_Sepal.Width",2.3,3,3.1 150 | "sample43_Petal.Length",1.3,4,5.1 151 | "sample43_Petal.Width",0.2,1.2,1.9 152 | "sample43_Sepal.Length",4.4,5.8,5.8 153 | "sample43_Sepal.Width",3.2,2.6,2.7 154 | "sample44_Petal.Length",1.6,3.3,5.9 155 | "sample44_Petal.Width",0.6,1,2.3 156 | "sample44_Sepal.Length",5,5,6.8 157 | "sample44_Sepal.Width",3.5,2.3,3.2 158 | "sample45_Petal.Length",1.9,4.2,5.7 159 | "sample45_Petal.Width",0.4,1.3,2.5 160 | "sample45_Sepal.Length",5.1,5.6,6.7 161 | "sample45_Sepal.Width",3.8,2.7,3.3 162 | "sample46_Petal.Length",1.4,4.2,5.2 163 | "sample46_Petal.Width",0.3,1.2,2.3 164 | "sample46_Sepal.Length",4.8,5.7,6.7 165 | "sample46_Sepal.Width",3,3,3 166 | "sample47_Petal.Length",1.6,4.2,5 167 | "sample47_Petal.Width",0.2,1.3,1.9 168 | "sample47_Sepal.Length",5.1,5.7,6.3 169 | "sample47_Sepal.Width",3.8,2.9,2.5 170 | "sample48_Petal.Length",1.4,4.3,5.2 171 | "sample48_Petal.Width",0.2,1.3,2 172 | "sample48_Sepal.Length",4.6,6.2,6.5 173 | "sample48_Sepal.Width",3.2,2.9,3 174 | "sample49_Petal.Length",1.5,3,5.4 175 | "sample49_Petal.Width",0.2,1.1,2.3 176 | "sample49_Sepal.Length",5.3,5.1,6.2 177 | "sample49_Sepal.Width",3.7,2.5,3.4 178 | "sample5_Petal.Length",1.4,4.6,5.8 179 | "sample5_Petal.Width",0.2,1.5,2.2 180 | "sample5_Sepal.Length",5,6.5,6.5 181 | "sample5_Sepal.Width",3.6,2.8,3 182 | "sample50_Petal.Length",1.4,4.1,5.1 183 | "sample50_Petal.Width",0.2,1.3,1.8 184 | "sample50_Sepal.Length",5,5.7,5.9 185 | "sample50_Sepal.Width",3.3,2.8,3 186 | "sample6_Petal.Length",1.7,4.5,6.6 187 | "sample6_Petal.Width",0.4,1.3,2.1 188 | "sample6_Sepal.Length",5.4,5.7,7.6 189 | "sample6_Sepal.Width",3.9,2.8,3 190 | "sample7_Petal.Length",1.4,4.7,4.5 191 | "sample7_Petal.Width",0.3,1.6,1.7 192 | "sample7_Sepal.Length",4.6,6.3,4.9 193 | "sample7_Sepal.Width",3.4,3.3,2.5 194 | "sample8_Petal.Length",1.5,3.3,6.3 195 | "sample8_Petal.Width",0.2,1,1.8 196 | "sample8_Sepal.Length",5,4.9,7.3 197 | "sample8_Sepal.Width",3.4,2.4,2.9 198 | "sample9_Petal.Length",1.4,4.6,5.8 199 | "sample9_Petal.Width",0.2,1.3,1.8 200 | "sample9_Sepal.Length",4.4,6.6,6.7 201 | "sample9_Sepal.Width",2.9,2.9,2.5 202 | -------------------------------------------------------------------------------- /loading_data.md: -------------------------------------------------------------------------------- 1 | # Loading Data 2 | 3 | R packages exist to load in pretty much any form of data you can think of. Some key examples include: 4 | 5 | - [readr](https://cran.r-project.org/web/packages/readr/README.html) tends to work faster and have more functionality for flat files (.csv, .txt) than base R (useful for big files) 6 | - [readxl](https://blog.rstudio.org/2015/04/15/readxl-0-1-0/) for Excel spreadsheets 7 | - [RODBC](https://cran.r-project.org/web/packages/RODBC/RODBC.pdf) for many types of database including Access 8 | - [RPostgreSQL](https://www.r-bloggers.com/getting-started-with-postgresql-in-r/) for PostgreSQL databases 9 | - [googlesheets](https://cran.r-project.org/web/packages/googlesheets/googlesheets.pdf) to interface with Google sheets 10 | - [raster](https://cran.r-project.org/web/packages/raster/raster.pdf) and [rgdal](https://cran.r-project.org/web/packages/rgdal/rgdal.pdf) for spatial data 11 | - [RCurl](https://cran.r-project.org/web/packages/RCurl/RCurl.pdf) contains functions to fetch data from webpages (along with lots more functionality for interfacing with webpages) 12 | 13 | To load these packages into your R session use `library()` e.g. `library(RCurl)` 14 | 15 | > ### Challenge 16 | > 17 | > In a new code chunk in your R Notebook, download [iris.csv](https://raw.githubusercontent.com/BES2016Workshop/reproduciblecodeR/master/iris.csv) using `getURL()` from the RCurl package, read into R using `read.csv()` and assign to the object name `iris`. 18 | > 19 | > **HINT** Use `df <- read.csv(text = getURL("/url/of/file"))` to read straight into R from webpage (replace `/url/of/file` with location of file). 20 | 21 | **Next:** [Tidying Data](./tidying_data.md) -------------------------------------------------------------------------------- /next_steps.md: -------------------------------------------------------------------------------- 1 | # Additional resources 2 | 3 | I hope that you’ve found this guide to making your research more reproducible with R helpful. There are plenty of links throughout to learn more about the packages I've talked about. If you’d like to learn more about R or about reproducibility, I’d highly recommend the following resources: 4 | 5 | - [RStudio cheatsheets](https://www.rstudio.com/resources/cheatsheets/): includes ggplot2, RMarkdown, dplyr, tidyr and more 6 | 7 | - [Swirl](http://swirlstats.com/): tutorials for tidyr, dplyr and much more directly in the R console 8 | 9 | - [Python pandas comparison with R](http://pandas.pydata.org/pandas-docs/stable/comparison_with_r.html): Python can be quicker than R, and so is particularly useful when you have a large dataset. This website provides a more detailed look at the R language and its many third party libraries as they relate to the python pandas library 10 | 11 | - [Software Carpentry lessons](http://software-carpentry.org/lessons/): freely available lessons taught on the Software Carpentry courses. To host or run a workshop also see this site. 12 | 13 | - [Reproducible Research on Coursera](https://www.coursera.org/learn/reproducible-research): taught by Roger Peng, Jeff Leek and Brian Caffo at Johns Hopkins University -------------------------------------------------------------------------------- /plotting.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | output: md_document 3 | --- 4 | # Plotting 5 | 6 | Finally, we want to plot our data to summarise the model from the previous step. [ggplot2](https://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf) is designed to work with tidy data formats and is based on the idea of the [grammar of graphics](https://ramnathv.github.io/pycon2014-r/visualize/ggplot2.html). This concept makes building up graphs from very simple to complex quite straightforward by adding additional layers. However, ggplot2 does have some less than ideal formatting like a grey gridded background. The [cowplot]() package overrides some of these settings to make publication quality plots. Cowplot also has some nice functionality for arranging plots. The [R graphics cookbook](http://www.cookbook-r.com/Graphs/) provides some helpful tutorials for building up plots using ggplot2. 7 | 8 | The features that I find most useful in ggplot2 are: 9 | 10 | - Build up plots layer-by-layer 11 | - Can use `facet_wrap()` and `facet_grid()` to create separate plots by a factor in the dataframe 12 | 13 | Let's make a plot of the `mtcars` model from the previous step: 14 | 15 | ```{r} 16 | data(mtcars) 17 | library(ggplot2) 18 | library(cowplot) 19 | 20 | p <- ggplot(mtcars, aes(x = wt, y = mpg)) + 21 | geom_point() + 22 | geom_smooth(method = "lm") 23 | 24 | p 25 | ``` 26 | 27 | We can then use `facet_wrap()` to get a separate plot for each number of cylinders: 28 | 29 | ```{r} 30 | p <- p + facet_wrap(~cyl) 31 | 32 | p 33 | ``` 34 | 35 | The `aes()` part of the call to `ggplot()` allows us to set the aesthetics of the plot, for example the `colour`, based on variables in the dataframe. 36 | 37 | > ### Challenge 38 | > 39 | > In a new code chunk in your R Notebook, load ggplot2 using `library(ggplot2)` and make a plot of the linear model created in the previous step. Colour the points by species name. 40 | > 41 | > **HINT** Loading the cowplot package will change the look of the plots to be more suitable for publication. 42 | 43 | **Next:** [Additional resources](./next_steps.md) 44 | 45 | -------------------------------------------------------------------------------- /plotting.md: -------------------------------------------------------------------------------- 1 | Plotting 2 | ======== 3 | 4 | Finally, we want to plot our data to summarise the model from the 5 | previous step. 6 | [ggplot2](https://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf) 7 | is designed to work with tidy data formats and is based on the idea of 8 | the [grammar of 9 | graphics](https://ramnathv.github.io/pycon2014-r/visualize/ggplot2.html). 10 | This concept makes building up graphs from very simple to complex quite 11 | straightforward by adding additional layers. However, ggplot2 does have 12 | some less than ideal formatting like a grey gridded background. The 13 | [cowplot]() package overrides some of these settings to make publication 14 | quality plots. Cowplot also has some nice functionality for arranging 15 | plots. The [R graphics cookbook](http://www.cookbook-r.com/Graphs/) 16 | provides some helpful tutorials for building up plots using ggplot2. 17 | 18 | The features that I find most useful in ggplot2 are: 19 | 20 | - Build up plots layer-by-layer 21 | - Can use `facet_wrap()` and `facet_grid()` to create separate plots 22 | by a factor in the dataframe 23 | 24 | Let's make a plot of the `mtcars` model from the previous step: 25 | 26 | data(mtcars) 27 | library(ggplot2) 28 | 29 | ## Warning: package 'ggplot2' was built under R version 3.3.2 30 | 31 | library(cowplot) 32 | 33 | ## Warning: package 'cowplot' was built under R version 3.3.2 34 | 35 | ## 36 | ## Attaching package: 'cowplot' 37 | 38 | ## The following object is masked from 'package:ggplot2': 39 | ## 40 | ## ggsave 41 | 42 | p <- ggplot(mtcars, aes(x = wt, y = mpg)) + 43 | geom_point() + 44 | geom_smooth(method = "lm") 45 | 46 | p 47 | 48 | ![](plotting_files/figure-markdown_strict/unnamed-chunk-1-1.png) 49 | 50 | We can then use `facet_wrap()` to get a separate plot for each number of 51 | cylinders: 52 | 53 | p <- p + facet_wrap(~cyl) 54 | 55 | p 56 | 57 | ![](plotting_files/figure-markdown_strict/unnamed-chunk-2-1.png) 58 | 59 | The `aes()` part of the call to `ggplot()` allows us to set the 60 | aesthetics of the plot, for example the `colour`, based on variables in 61 | the dataframe. 62 | 63 | > ### Challenge 64 | > 65 | > In a new code chunk in your R Notebook, load ggplot2 using 66 | > `library(ggplot2)` and make a plot of the linear model created in the 67 | > previous step. Colour the points by species name. 68 | > 69 | > **HINT** Loading the cowplot package will change the look of the plots 70 | > to be more suitable for publication. 71 | 72 | **Next:** [Additional resources](./next_steps.md) 73 | -------------------------------------------------------------------------------- /plotting_files/figure-markdown_strict/unnamed-chunk-1-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BES2016Workshop/reproduciblecodeR/97002f73e7587450107b66216555a08b1e182d9f/plotting_files/figure-markdown_strict/unnamed-chunk-1-1.png -------------------------------------------------------------------------------- /plotting_files/figure-markdown_strict/unnamed-chunk-2-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BES2016Workshop/reproduciblecodeR/97002f73e7587450107b66216555a08b1e182d9f/plotting_files/figure-markdown_strict/unnamed-chunk-2-1.png -------------------------------------------------------------------------------- /r_markdown.md: -------------------------------------------------------------------------------- 1 | # Creating an RMarkdown notebook 2 | 3 | [R Notebooks](http://rmarkdown.rstudio.com/r_notebooks.html) are Markdown documents which allow users to execute chunks of R code independently and interactively while producing publication quality output. They are an example of [literate programming](https://en.wikipedia.org/wiki/Literate_programming). 4 | 5 | Create an R Notebook as follows: 6 | 7 | **File** -> **New File** -> **R Notebook** 8 | 9 | Edit the title of the notebook at the top of the document and try following some of the automatically generated instructions within the notebook. 10 | 11 | In the worked example that follows, we can enter R commands into code chunks and make notes using Markdown code in the main part of the document. 12 | 13 | Markdown is intended to be as easy-to-read and easy-to-write as possible: [handy guide to Markdown syntax](https://guides.github.com/pdfs/markdown-cheatsheet-online.pdf) 14 | 15 | **Next:** [Loading Data](./loading_data.md) -------------------------------------------------------------------------------- /r_project.md: -------------------------------------------------------------------------------- 1 | # Setting up an R project 2 | 3 | A project is a folder that contains everything concerning your analysis and may include code, data and documentation. It is a complete research object that can be used to describe and reproduce your research. 4 | 5 | Create a new project in RStudio as follows: 6 | 7 | **File** -> **New Project** -> **New Directory** 8 | 9 | ![](./assets/project_screen1.png) 10 | 11 | In the **Project Type** screen, click on **Empty Project**. 12 | 13 | ![](./assets/project_screen2.png) 14 | 15 | In the **Create New Project** screen, give your project a name, set the folder to an appropriate location by clicking browse, and ensure that **create a git repository** is checked. Click on **Create Project**. 16 | 17 | ![](./assets/project_screen3.png) 18 | 19 | RStudio will create a new folder containing an empty project and set R's working directory to within it. 20 | 21 | ![](./assets/project_files.png) 22 | 23 | Two files are created in the otherwise empty project:- 24 | 25 | * **.gitignore** - Specifies files that should be ignored by the version control ystem. 26 | * **reproducible_r.Rproj** - Configuration information for the RStudio project 27 | 28 | There is no need to worry about the contents of either of these for the purposes of this tutorial. Tamora will be covering how to use git for version control in one of the other breakout sessions. 29 | 30 | **Next:** [Creating an RMarkdown notebook](./r_markdown.md) -------------------------------------------------------------------------------- /reproduciblecodeR.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: Default 4 | SaveWorkspace: Default 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: Sweave 13 | LaTeX: pdfLaTeX 14 | -------------------------------------------------------------------------------- /summarising_data.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | output: md_document 3 | --- 4 | # Manipulating and summarising data 5 | 6 | Once we have tidy data, we need to be able to apply data transformation functions to subset, order, summarise and create new variables. tidyr has as it's compliment the [dplyr](https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html) package. 7 | 8 | dplyr includes the following data transformation functions: 9 | 10 | - `select()` subset the columns of the data by selecting variables 11 | - `filter()` subset the rows of the data by a condition 12 | - `group_by()` groups data by one or more variables 13 | - `summarise()` summarise data by functions of choice (e.g. `mean()`, `max()`, `sd()`) 14 | - `arrange()` order data by a variable 15 | - `join()` joining two dataframes 16 | - `mutate()` create new variables 17 | - `summarise_each()` and `mutate_each` allow for applying functions to one or more columns 18 | 19 | tidyr and dplyr also include the `%>%` pipe function. This takes the output of the previous command and 'pipes' it as the input into the next command. This is neater than using a nested approach to commands, and removes the need to create intermediate output files. 20 | 21 | A brief example using the built-in `mtcars` data: 22 | 23 | ```{r, echo=FALSE} 24 | library(dplyr) 25 | ``` 26 | 27 | ```{r} 28 | data(mtcars) 29 | mtcars_summary <- group_by(mtcars, cyl) %>% 30 | summarise(mean_mpg = mean(mpg), sd_mpg = sd(mpg)) 31 | mtcars_summary 32 | ``` 33 | 34 | Here we have the mean and standard deviation of MPG for each number of cylinders. 35 | 36 | > ### Challenge 37 | > 38 | > In a new code chunk in your R Notebook, load the dplyr package using `library(dplyr)` and calculate the mean and standard deviation for each of the measured variables, grouped by species. 39 | > 40 | > **HINT** Use `summarise_each` rather than multiple calls to `summarise()`. 41 | 42 | **Next:** [Tidying model output](./tidying_output.md) 43 | 44 | -------------------------------------------------------------------------------- /summarising_data.md: -------------------------------------------------------------------------------- 1 | Manipulating and summarising data 2 | ================================= 3 | 4 | Once we have tidy data, we need to be able to apply data transformation 5 | functions to subset, order, summarise and create new variables. tidyr 6 | has as it's compliment the 7 | [dplyr](https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html) 8 | package. 9 | 10 | dplyr includes the following data transformation functions: 11 | 12 | - `select()` subset the columns of the data by selecting variables 13 | - `filter()` subset the rows of the data by a condition 14 | - `group_by()` groups data by one or more variables 15 | - `summarise()` summarise data by functions of choice (e.g. `mean()`, 16 | `max()`, `sd()`) 17 | - `arrange()` order data by a variable 18 | - `join()` joining two dataframes 19 | - `mutate()` create new variables 20 | - `summarise_each()` and `mutate_each` allow for applying functions to 21 | one or more columns 22 | 23 | tidyr and dplyr also include the `%>%` pipe function. This takes the 24 | output of the previous command and 'pipes' it as the input into the next 25 | command. This is neater than using a nested approach to commands, and 26 | removes the need to create intermediate output files. 27 | 28 | A brief example using the built-in `mtcars` data: 29 | 30 | ## Warning: package 'dplyr' was built under R version 3.3.2 31 | 32 | ## 33 | ## Attaching package: 'dplyr' 34 | 35 | ## The following objects are masked from 'package:stats': 36 | ## 37 | ## filter, lag 38 | 39 | ## The following objects are masked from 'package:base': 40 | ## 41 | ## intersect, setdiff, setequal, union 42 | 43 | data(mtcars) 44 | mtcars_summary <- group_by(mtcars, cyl) %>% 45 | summarise(mean_mpg = mean(mpg), sd_mpg = sd(mpg)) 46 | mtcars_summary 47 | 48 | ## # A tibble: 3 × 3 49 | ## cyl mean_mpg sd_mpg 50 | ## 51 | ## 1 4 26.66364 4.509828 52 | ## 2 6 19.74286 1.453567 53 | ## 3 8 15.10000 2.560048 54 | 55 | Here we have the mean and standard deviation of MPG for each number of 56 | cylinders. 57 | 58 | > ### Challenge 59 | > 60 | > In a new code chunk in your R Notebook, load the dplyr package using 61 | > `library(dplyr)` and calculate the mean and standard deviation for 62 | > each of the measured variables, grouped by species. 63 | > 64 | > **HINT** Use `summarise_each` rather than multiple calls to 65 | > `summarise()`. 66 | 67 | **Next:** [Tidying model output](./tidying_output.md) 68 | -------------------------------------------------------------------------------- /tidying_data.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | output: md_document 3 | --- 4 | # Tidying Data 5 | 6 | [“Tidy datasets are all alike but every messy datset is messy in its own way” (Hadley Wickham, 2014)](http://vita.had.co.nz/papers/tidy-data.html) 7 | 8 | Key features of tidy data are: 9 | 10 | - Observations in rows 11 | - Variables in columns 12 | - Each type of observational unit is a table 13 | 14 | ![](./assets/tidy_data.png) 15 | 16 | Messy data can take many forms. For example: 17 | 18 | - Column headers are values, not variable names 19 | - Multiple variables stored in one column 20 | - Variables stored in both rows and columns 21 | - Multiple observational unit types in the same table 22 | - Single observational unit in multiple tables 23 | 24 | Let's explore the data we loaded in the last exercise: 25 | ```{r, echo=FALSE} 26 | iris <- read.csv("iris.csv") 27 | ``` 28 | 29 | ```{r} 30 | # get the structure of the dataframe 31 | str(iris) 32 | 33 | # head gives us the first 6 rows to explore 34 | head(iris) 35 | ``` 36 | 37 | Here we have three characteristics of messy data: 38 | 39 | - Species names (which are values) as column headers 40 | - Multiple variables stored in one column: sample number and measurement type as a compound variable 41 | - Variables (the measurement types) are stored in rows instead of columns 42 | 43 | The [tidyr](https://blog.rstudio.org/2014/07/22/introducing-tidyr/) package provides functions to fix many of the issues in messy datasets. 44 | 45 | - `gather()` takes multiple columns and gathers them into key-value pairs. We can use this to get the species names into rows. 46 | 47 | - `separate()` takes one column and separates into multiple columns. We can use this to split the sample number from the measurement type. 48 | 49 | - `spread()` takes two columns (a key-value pair) and spreads them into multiple columns. We can use this to get the measurement types to form columns. 50 | 51 | Two other useful packages for tidying data are [lubridate](https://cran.r-project.org/web/packages/lubridate/lubridate.pdf) for working with dates and [taxize](https://ropensci.org/tutorials/taxize_tutorial.html) for cleaning taxonomic information. 52 | 53 | > ### Challenge 54 | > 55 | > In a new code chunk in your R Notebook, load the tidyr package using `library(tidyr)` and use the suggested functions to get the data into tidy data format. 56 | > 57 | > **HINT** Use `?` to get help on how to use a function (e.g. `?separate`) 58 | 59 | **Next:** [Manipulating and summarising data](./summarising_data.md) -------------------------------------------------------------------------------- /tidying_data.md: -------------------------------------------------------------------------------- 1 | Tidying Data 2 | ============ 3 | 4 | [“Tidy datasets are all alike but every messy datset is messy in its own 5 | way” (Hadley Wickham, 6 | 2014)](http://vita.had.co.nz/papers/tidy-data.html) 7 | 8 | Key features of tidy data are: 9 | 10 | - Observations in rows 11 | - Variables in columns 12 | - Each type of observational unit is a table 13 | 14 | ![](./assets/tidy_data.png) 15 | 16 | Messy data can take many forms. For example: 17 | 18 | - Column headers are values, not variable names 19 | - Multiple variables stored in one column 20 | - Variables stored in both rows and columns 21 | - Multiple observational unit types in the same table 22 | - Single observational unit in multiple tables 23 | 24 | Let's explore the data we loaded in the last exercise: 25 | 26 | # get the structure of the dataframe 27 | str(iris) 28 | 29 | ## 'data.frame': 200 obs. of 4 variables: 30 | ## $ measurement: Factor w/ 200 levels "sample1_Petal.Length",..: 1 2 3 4 5 6 7 8 9 10 ... 31 | ## $ setosa : num 1.4 0.2 5.1 3.5 1.5 0.1 4.9 3.1 1.5 0.2 ... 32 | ## $ versicolor : num 4.7 1.4 7 3.2 3.9 1.4 5.2 2.7 3.5 1 ... 33 | ## $ virginica : num 6 2.5 6.3 3.3 6.1 2.5 7.2 3.6 5.1 2 ... 34 | 35 | # head gives us the first 6 rows to explore 36 | head(iris) 37 | 38 | ## measurement setosa versicolor virginica 39 | ## 1 sample1_Petal.Length 1.4 4.7 6.0 40 | ## 2 sample1_Petal.Width 0.2 1.4 2.5 41 | ## 3 sample1_Sepal.Length 5.1 7.0 6.3 42 | ## 4 sample1_Sepal.Width 3.5 3.2 3.3 43 | ## 5 sample10_Petal.Length 1.5 3.9 6.1 44 | ## 6 sample10_Petal.Width 0.1 1.4 2.5 45 | 46 | Here we have three characteristics of messy data: 47 | 48 | - Species names (which are values) as column headers 49 | - Multiple variables stored in one column: sample number and 50 | measurement type as a compound variable 51 | - Variables (the measurement types) are stored in rows instead of 52 | columns 53 | 54 | The [tidyr](https://blog.rstudio.org/2014/07/22/introducing-tidyr/) 55 | package provides functions to fix many of the issues in messy datasets. 56 | 57 | - `gather()` takes multiple columns and gathers them into 58 | key-value pairs. We can use this to get the species names into rows. 59 | 60 | - `separate()` takes one column and separates into multiple columns. 61 | We can use this to split the sample number from the 62 | measurement type. 63 | 64 | - `spread()` takes two columns (a key-value pair) and spreads them 65 | into multiple columns. We can use this to get the measurement types 66 | to form columns. 67 | 68 | Two other useful packages for tidying data are 69 | [lubridate](https://cran.r-project.org/web/packages/lubridate/lubridate.pdf) 70 | for working with dates and 71 | [taxize](https://ropensci.org/tutorials/taxize_tutorial.html) for 72 | cleaning taxonomic information. 73 | 74 | > ### Challenge 75 | > 76 | > In a new code chunk in your R Notebook, load the tidyr package using 77 | > `library(tidyr)` and use the suggested functions to get the data into 78 | > tidy data format. 79 | > 80 | > **HINT** Use `?` to get help on how to use a function (e.g. 81 | > `?separate`) 82 | 83 | **Next:** [Manipulating and summarising data](./summarising_data.md) 84 | -------------------------------------------------------------------------------- /tidying_output.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | output: md_document 3 | --- 4 | # Tidying model output 5 | 6 | In this section, we will run a basic linear model and programatically tidy the output from the model. 7 | 8 | Again using the `mtcars` data, see an example of the output from a linear model: 9 | 10 | ```{r} 11 | data(mtcars) 12 | lmfit <- lm(mpg ~ wt, data = mtcars) 13 | summary(lmfit) 14 | ``` 15 | 16 | While this summary is useful for assessing the output of a single model, it can become quite difficult once the number of models starts to increase. This is where the [broom](https://cran.r-project.org/web/packages/broom/vignettes/broom.html) package comes in handy. This package provides functions to convert model coefficient estimates, predicted values and residuals, and summary statistics to data frames. 17 | 18 | ```{r} 19 | library(broom) 20 | # we can view a table of the coefficient estimates and p values 21 | tidy(lmfit) 22 | 23 | # we can view a table of the fit statistics 24 | glance(lmfit) 25 | ``` 26 | 27 | The functions from the broom package work on most classes of model output. 28 | 29 | > ### Challenge 30 | > 31 | > In a new code chunk in your R Notebook load the broom package with `library(broom)` and using the `lm()` and `tidy()` functions, fit a linear model relating petal length to petal width and output the table of coefficients. 32 | 33 | **Next:** [Plotting](./plotting.md) -------------------------------------------------------------------------------- /tidying_output.md: -------------------------------------------------------------------------------- 1 | Tidying model output 2 | ==================== 3 | 4 | In this section, we will run a basic linear model and programatically 5 | tidy the output from the model. 6 | 7 | Again using the `mtcars` data, see an example of the output from a 8 | linear model: 9 | 10 | data(mtcars) 11 | lmfit <- lm(mpg ~ wt, data = mtcars) 12 | summary(lmfit) 13 | 14 | ## 15 | ## Call: 16 | ## lm(formula = mpg ~ wt, data = mtcars) 17 | ## 18 | ## Residuals: 19 | ## Min 1Q Median 3Q Max 20 | ## -4.5432 -2.3647 -0.1252 1.4096 6.8727 21 | ## 22 | ## Coefficients: 23 | ## Estimate Std. Error t value Pr(>|t|) 24 | ## (Intercept) 37.2851 1.8776 19.858 < 2e-16 *** 25 | ## wt -5.3445 0.5591 -9.559 1.29e-10 *** 26 | ## --- 27 | ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 28 | ## 29 | ## Residual standard error: 3.046 on 30 degrees of freedom 30 | ## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446 31 | ## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10 32 | 33 | While this summary is useful for assessing the output of a single model, 34 | it can become quite difficult once the number of models starts to 35 | increase. This is where the 36 | [broom](https://cran.r-project.org/web/packages/broom/vignettes/broom.html) 37 | package comes in handy. This package provides functions to convert model 38 | coefficient estimates, predicted values and residuals, and summary 39 | statistics to data frames. 40 | 41 | library(broom) 42 | # we can view a table of the coefficient estimates and p values 43 | tidy(lmfit) 44 | 45 | ## term estimate std.error statistic p.value 46 | ## 1 (Intercept) 37.285126 1.877627 19.857575 8.241799e-19 47 | ## 2 wt -5.344472 0.559101 -9.559044 1.293959e-10 48 | 49 | # we can view a table of the fit statistics 50 | glance(lmfit) 51 | 52 | ## r.squared adj.r.squared sigma statistic p.value df logLik 53 | ## 1 0.7528328 0.7445939 3.045882 91.37533 1.293959e-10 2 -80.01471 54 | ## AIC BIC deviance df.residual 55 | ## 1 166.0294 170.4266 278.3219 30 56 | 57 | The functions from the broom package work on most classes of model 58 | output. 59 | 60 | > ### Challenge 61 | > 62 | > In a new code chunk in your R Notebook load the broom package with 63 | > `library(broom)` and using the `lm()` and `tidy()` functions, fit a 64 | > linear model relating petal length to petal width and output the table 65 | > of coefficients. 66 | 67 | **Next:** [Plotting](./plotting.md) 68 | --------------------------------------------------------------------------------