124 | Suggested reading
125 |
126 | - *Programming Skills for Data Science: Start Writing Code to Wrangle, Analyze, and Visualize Data with R* by Michael Freeman and Joel Ross, Addison-Wesley, 2019. See book [webpage](https://www.pearson.com/us/higher-education/program/Freeman-Programming-Skills-for-Data-Science-Start-Writing-Code-to-Wrangle-Analyze-and-Visualize-Data-with-R/PGM2047488.html) and [repository](https://programming-for-data-science.github.io/).
127 | - *R for Data Science* by Garrett Grolemund and Hadley Wickham, O'Reilly Media, 2016. See [online book](https://r4ds.had.co.nz/).
128 | - *Discovering Statistics Using R* by Andy Field, Jeremy Miles and Zoë Field, SAGE Publications Ltd, 2012. See book [webpage](https://www.discoveringstatistics.com/books/discovering-statistics-using-r/).
129 | - *Machine Learning with R: Expert techniques for predictive modeling* by Brett Lantz, Packt Publishing, 2019. See book [webpage](https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781788295864).
130 |
131 | Further reading
132 |
133 | - *The Art of R Programming: A Tour of Statistical Software Design* by Norman Matloff, No Starch Press, 2011. See book [webpage](https://nostarch.com/artofr.htm)
134 | - *An Introduction to R for Spatial Analysis and Mapping* by Chris Brunsdon and Lex Comber, Sage, 2015. See book [webpage](https://uk.sagepub.com/en-gb/eur/an-introduction-to-r-for-spatial-analysis-and-mapping/book241031)
135 | - *Geocomputation with R* by Robin Lovelace, Jakub Nowosad, Jannes Muenchow, CRC Press, 2019. See [online book](https://bookdown.org/robinlovelace/geocompr/).
136 |
137 |
138 |
139 | ## R
140 |
141 | Created in 1992 by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand
142 |
143 | - Free, open-source implementation of *S*
144 | - statistical programming language
145 | - Bell Labs
146 |
147 |
148 |
149 | - Functional programming language
150 | - Supports (and commonly used as) procedural (i.e., imperative) programming
151 | - Object-oriented
152 | - Interpreted (not compiled)
153 |
154 |
155 |
156 | ## Interpreting values
157 |
158 | When values and operations are inputted in the *Console*, the interpreter returns the results of its interpretation of the expression
159 |
160 | ```{r, echo=TRUE}
161 | 2
162 | "String value"
163 | # comments are ignored
164 | ```
165 |
166 |
167 |
168 | ## Basic types
169 |
170 | R provides three core data types
171 |
172 | - numeric
173 | - both integer and real numbers
174 | - character
175 | - i.e., text, also called *strings*
176 | - logical
177 | - `TRUE` or `FALSE`
178 |
179 |
180 | ## Numeric operators
181 |
182 | R provides a series of basic numeric operators
183 |
184 |
223 |
224 |
225 | ```{r, echo=TRUE}
226 | 5 >= 2
227 | ```
228 |
229 |
230 |
231 | ## Summary
232 |
233 | An introduction to R
234 |
235 | - Basic types
236 | - Basic operators
237 |
238 | **Next**: Core concepts
239 |
240 | - Variables
241 | - Functions
242 | - Libraries
243 |
244 | ```{r cleanup, include=FALSE}
245 | rm(list = ls())
246 | ```
--------------------------------------------------------------------------------
/src/lectures/contents/102_L_CoreConcepts.Rmd:
--------------------------------------------------------------------------------
1 | ```{r setup, include=FALSE}
2 | knitr::opts_chunk$set(echo = FALSE)
3 | knitr::opts_knit$set(root.dir = Sys.getenv("GRANOLARR_HOME"))
4 | rm(list = ls())
5 | ```
6 |
7 |
8 |
9 | # Core concepts
10 |
11 |
12 |
13 | ## Recap
14 |
15 | **Prev**: An introduction to R
16 |
17 | - Basic types
18 | - Basic operators
19 |
20 | **Now**: Core concepts
21 |
22 | - Variables
23 | - Functions
24 | - Libraries
25 |
26 |
27 |
28 | ## Variables
29 |
30 | Variables **store data** and can be defined
31 |
32 | - using an *identifier* (e.g., `a_variable`)
33 | - on the left of an *assignment operator* `<-`
34 | - followed by the object to be linked to the identifier
35 | - such as a *value* (e.g., `1`)
36 |
37 | ```{r, echo=TRUE}
38 | a_variable <- 1
39 | ```
40 |
41 | The value of the variable can be invoked by simply specifying the **identifier**.
42 |
43 | ```{r, echo=TRUE}
44 | a_variable
45 | ```
46 |
47 |
48 |
49 | ## Algorithms and functions
50 |
51 | *An* **algorithm** *or effective procedure is a mechanical rule, or automatic method, or programme for performing some mathematical operation* (Cutland, 1980).
52 |
53 | A **program** is a specific set of instructions that implement an abstract algorithm.
54 |
55 | The definition of an algorithm (and thus a program) can consist of one or more **function**s
56 |
57 | - set of instructions that preform a task
58 | - possibly using an input, possibly returning an output value
59 |
60 | Programming languages usually provide pre-defined functions that implement common algorithms (e.g., to find the square root of a number or to calculate a linear regression)
61 |
62 |
63 |
64 | ## Functions
65 |
66 | Functions execute complex operations and can be invoked
67 |
68 | - specifying the *function name*
69 | - the *arguments* (input values) between simple brackets
70 | - each *argument* corresponds to a *parameter*
71 | - sometimes the *parameter* name must be specified
72 |
73 | ```{r, echo=TRUE}
74 | sqrt(2)
75 | round(1.414214, digits = 2)
76 | ```
77 |
78 |
79 |
80 | ## Functions and variables
81 |
82 | - functions can be used on the right side of `<-`
83 | - variables and functions can be used as *arguments*
84 |
85 | ```{r, echo=TRUE}
86 | sqrt_of_two <- sqrt(2)
87 | sqrt_of_two
88 | round(sqrt_of_two, digits = 2)
89 | round(sqrt(2), digits = 2)
90 | ```
91 |
92 |
93 |
94 | ## Naming
95 |
96 | When creating an identifier for a variable or function
97 |
98 | - R is a **case sensitive** language
99 | - UPPER and lower case are not the same
100 | - `a_variable` is different from `a_VARIABLE`
101 | - names can include
102 | - alphanumeric symbols
103 | - `.` and `_`
104 | - names must start with
105 | - a letter
106 |
107 |
108 |
109 | ## Libraries
110 |
111 | Once a number of related, reusable functions are created
112 |
113 | - they can be collected and stored in **libraries** (a.k.a. *packages*)
114 | - `install.packages` is a function that can be used to install libraries (i.e., downloads it on your computer)
115 | - `library` is a function that *loads* a library (i.e., makes it available to a script)
116 |
117 | Libraries can be of any size and complexity, e.g.:
118 |
119 | - `base`: base R functions, including the `sqrt` function above
120 | - `rgdal`: implementation of the [GDAL (Geospatial Data Abstraction Library)](https://gdal.org/) functionalities
121 |
122 |
123 | ## stringr
124 |
125 | R provides some basic functions to manipulate strings, but the `stringr` library provides a more consistent and well-defined set
126 |
127 | ```{r, echo=TRUE}
128 | library(stringr)
129 |
130 | str_length("Leicester")
131 | str_detect("Leicester", "e")
132 | str_replace_all("Leicester", "e", "x")
133 | ```
134 |
135 |
136 |
137 |
138 | ## Summary
139 |
140 | Core concepts
141 |
142 | - Variables
143 | - Functions
144 | - Libraries
145 |
146 | **Next**: Tidyverse
147 |
148 | - Tidyverse libraries
149 | - *pipe* operator
150 |
151 | ```{r cleanup, include=FALSE}
152 | rm(list = ls())
153 | ```
--------------------------------------------------------------------------------
/src/lectures/contents/103_L_Tidyverse.Rmd:
--------------------------------------------------------------------------------
1 | ```{r setup, include=FALSE}
2 | knitr::opts_chunk$set(echo = FALSE)
3 | knitr::opts_knit$set(root.dir = Sys.getenv("GRANOLARR_HOME"))
4 | rm(list = ls())
5 | ```
6 |
7 |
8 |
9 | # Tidyverse
10 |
11 |
12 |
13 | ## Recap
14 |
15 | **Prev**: Core concepts
16 |
17 | - Variables
18 | - Functions
19 | - Libraries
20 |
21 | **Now**: Tidyverse
22 |
23 | - Tidyverse libraries
24 | - *pipe* operator
25 |
26 |
27 | ## Tidyverse
28 |
29 | The [Tidyverse](https://www.tidyverse.org/) was introduced by statistician [Hadley Wickham](https://t.co/DWqWlxbOKK?amp=1), Chief Scientist at [RStudio](https://rstudio.com/) (worth following him on [twitter](https://twitter.com/hadleywickham)).
30 |
31 | *"The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures."* ([Tidyverse homepage](https://www.tidyverse.org/)).
32 |
33 | **Core libraries**
34 |
35 | :::::: {.cols data-latex=""}
36 |
37 | ::: {.col data-latex="{0.5\textwidth}" style="width: 50%"}
38 |
39 |
40 | - [`tibble`](https://tibble.tidyverse.org/)
41 | - [`tidyr`](https://tidyr.tidyverse.org/)
42 | - [`stringr`](https://stringr.tidyverse.org/)
43 | - [`dplyr`](https://dplyr.tidyverse.org/)
44 |
45 |
46 | :::
47 |
48 | ::: {.col data-latex="{0.5\textwidth}" style="width: 50%"}
49 |
50 | - [`readr`](https://readr.tidyverse.org/)
51 | - [`ggplot2`](https://ggplot2.tidyverse.org/)
52 | - [`purrr`](https://purrr.tidyverse.org/)
53 | - [`forcats`](https://forcats.tidyverse.org/)
54 |
55 | :::
56 | ::::::
57 |
58 | Also, imports [`magrittr`](https://magrittr.tidyverse.org/), which plays an important role.
59 |
60 | ## Tidyverse core libraries
61 |
62 | The meta-library [Tidyverse](https://www.tidyverse.org/) includes:
63 |
64 | - **[`tibble`](https://tibble.tidyverse.org/)** is a modern re-imagining of the data frame, keeping what time has proven to be effective, and throwing out what it has not. Tibbles are data.frames that are lazy and surly: they do less and complain more forcing you to confront problems earlier, typically leading to cleaner, more expressive code.
65 | - **[`tidyr`](https://tidyr.tidyverse.org/)** provides a set of functions that help you get to tidy data. Tidy data is data with a consistent form: in brief, every variable goes in a column, and every column is a variable.
66 | - **[`stringr`](https://stringr.tidyverse.org/)** provides a cohesive set of functions designed to make working with strings as easy as possible. It is built on top of stringi, which uses the ICU C library to provide fast, correct implementations of common string manipulations.
67 |
68 |
69 | ## Tidyverse core libraries
70 |
71 | The meta-library [Tidyverse](https://www.tidyverse.org/) includes:
72 |
73 | - **[`dplyr`](https://dplyr.tidyverse.org/)** provides a grammar of data manipulation, providing a consistent set of verbs that solve the most common data manipulation challenges.
74 | - **[`readr`](https://readr.tidyverse.org/)** provides a fast and friendly way to read rectangular data (like csv, tsv, and fwf). It is designed to flexibly parse many types of data found in the wild, while still cleanly failing when data unexpectedly changes.
75 | - **[`ggplot2`](https://ggplot2.tidyverse.org/)** is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
76 |
77 |
78 |
79 | ## Tidyverse core libraries
80 |
81 | The meta-library [Tidyverse](https://www.tidyverse.org/) contains the following libraries:
82 |
83 | - **[`purrr`](https://purrr.tidyverse.org/)** enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors. Once you master the basic concepts, purrr allows you to replace many for loops with code that is easier to write and more expressive.
84 | - **[`forcats`](https://forcats.tidyverse.org/)** provides a suite of useful tools that solve common problems with factors. R uses factors to handle categorical variables, variables that have a fixed and known set of possible values.
85 |
86 |
87 |
88 | ## The pipe operator
89 |
90 | The [Tidyverse](https://www.tidyverse.org/) (via [`magrittr`](https://magrittr.tidyverse.org/)) also provide a clean and effective way of combining multiple manipulation steps
91 |
92 | The pipe operator `%>%`
93 |
94 | - takes the result from one function
95 | - and passes it to the next function
96 | - as the **first argument**
97 | - that doesn't need to be included in the code anymore
98 |
99 |
100 |
101 | ## Pipe example
102 |
103 | {width=100%}
104 |
105 |
106 |
107 | ## Pipe example
108 |
109 | The two codes below are equivalent
110 |
111 | - the first simply invokes the functions
112 | - the second uses the pipe operator `%>%`
113 |
114 | ```{r, echo=TRUE}
115 | round(sqrt(2), digits = 2)
116 | ```
117 |
118 | ```{r, echo=TRUE, message=FALSE, warning=FALSE}
119 | library(tidyverse)
120 |
121 | sqrt(2) %>%
122 | round(digits = 2)
123 | ```
124 |
125 |
126 |
127 | ## Coding style
128 |
129 | A *coding style* is a way of writing the code, including
130 |
131 | - how variable and functions are named
132 | - lower case and `_`
133 | - how spaces are used in the code
134 | - which libraries are used
135 |
136 | ```{r, echo=TRUE, eval=FALSE}
137 | # Bad
138 | X<-round(sqrt(2),2)
139 |
140 | #Good
141 | sqrt_of_two <- sqrt(2) %>%
142 | round(digits = 2)
143 | ```
144 |
145 | Study the [Tidyverse Style Guid](http://style.tidyverse.org/) and use it consistently!
146 |
147 |
148 | ## Summary
149 |
150 | Tidyverse
151 |
152 | - Tidyverse libraries
153 | - *pipe* operator
154 | - Coding style
155 |
156 | **Next**: Practical session
157 |
158 | - The R programming language
159 | - Interpreting values
160 | - Variables
161 | - Basic types
162 | - Tidyverse
163 | - Coding style
164 |
165 | ```{r cleanup, include=FALSE}
166 | rm(list = ls())
167 | ```
--------------------------------------------------------------------------------
/src/lectures/contents/112_L_ControlStructures.Rmd:
--------------------------------------------------------------------------------
1 | ```{r setup, include=FALSE}
2 | knitr::opts_chunk$set(echo = FALSE)
3 | knitr::opts_knit$set(root.dir = Sys.getenv("GRANOLARR_HOME"))
4 | rm(list = ls())
5 | ```
6 |
7 |
8 |
9 | # Control structures
10 |
11 |
12 |
13 | ## Recap
14 |
15 | **Prev**: Data types
16 |
17 | - Vectors
18 | - Factors
19 | - Matrices and arrays
20 | - Lists
21 |
22 | **Now**: Control structures
23 |
24 | - Conditional statements
25 | - Loops
26 |
27 |
28 |
29 | ## If
30 |
31 | Format: **if** (*condition*) *statement*
32 |
33 | - *condition*: expression returning a logic value (`TRUE` or `FALSE`)
34 | - *statement*: any valid R statement
35 | - *statement* only executed if *condition* is `TRUE`
36 |
37 |
38 | ```{r, echo=TRUE}
39 | a_value <- -7
40 | if (a_value < 0) cat("Negative")
41 | a_value <- 8
42 | if (a_value < 0) cat("Negative")
43 | ```
44 |
45 |
46 | ## Else
47 | Format: **if** (*condition*) *statement1* **else** *statement2*
48 |
49 | - *condition*: expression returning a logic value (`TRUE` or `FALSE`)
50 | - *statement1* and *statement2*: any valid R statements
51 | - *statement1* executed if *condition* is `TRUE`
52 | - *statement2* executed if *condition* is `FALSE`
53 |
54 |
55 | ```{r, echo=TRUE}
56 | a_value <- -7
57 | if (a_value < 0) cat("Negative") else cat("Positive")
58 | a_value <- 8
59 | if (a_value < 0) cat("Negative") else cat("Positive")
60 | ```
61 |
62 |
70 |
71 |
72 | ## Code blocks
73 |
74 | **Code blocks** allow to encapsulate **several** statements in a single group
75 |
76 | - `{` and `}` contain code blocks
77 | - the statements are execute together
78 |
79 | ```{r, echo=TRUE}
80 | first_value <- 8
81 | second_value <- 5
82 | if (first_value > second_value) {
83 | cat("First is greater than second\n")
84 | difference <- first_value - second_value
85 | cat("Their difference is ", difference)
86 | }
87 | ```
88 |
89 |
90 |
91 | ## Loops
92 | Loops are a fundamental component of (procedural) programming.
93 |
94 |
95 | There are two main types of loops:
96 |
97 | - **conditional** loops are executed as long as a defined condition holds true
98 | - construct `while`
99 | - construct `repeat`
100 | - **deterministic** loops are executed a pre-determined number of times
101 | - construct `for`
102 |
103 |
104 | ## While
105 |
106 | The *while* construct can be defined using the `while` reserved word, followed by the conditional statement between simple brackets, and a code block. The instructions in the code block are re-executed as long as the result of the evaluation of the conditional statement is `TRUE`.
107 |
108 | ```{r, echo=TRUE}
109 | current_value <- 0
110 |
111 | while (current_value < 3) {
112 | cat("Current value is", current_value, "\n")
113 | current_value <- current_value + 1
114 | }
115 | ```
116 |
117 |
139 |
140 | ## For
141 |
142 | The *for* construct can be defined using the `for` reserved word, followed by the definition of an **iterator**. The iterator is a variable which is temporarily assigned with the current element of a vector, as the construct iterates through all elements of the vector. This definition is followed by a code block, whose instructions are re-executed once for each element of the vector.
143 |
144 | ```{r, echo=TRUE}
145 | cities <- c("Derby", "Leicester", "Lincoln", "Nottingham")
146 | for (city in cities) {
147 | cat("Do you live in", city, "?\n")
148 | }
149 | ```
150 |
151 |
152 | ## For
153 |
154 | It is common practice to create a vector of integers on the spot in order to execute a certain sequence of steps a pre-defined number of times.
155 |
156 | ```{r, echo=TRUE}
157 | for (i in 1:3) {
158 | cat("This is exectuion number", i, ":\n")
159 | cat(" See you later!\n")
160 | }
161 | ```
162 |
163 |
164 | ## Loops with conditional statements
165 |
166 | ```{r, echo=TRUE}
167 | 3:0
168 | #Example: countdown!
169 | for (i in 3:0) {
170 | if (i == 0) {
171 | cat("Go!\n")
172 | } else {
173 | cat(i, "\n")
174 | }
175 | }
176 | ```
177 |
178 |
179 |
180 | ## Summary
181 |
182 | Control structures
183 |
184 | - Conditional statements
185 | - Loops
186 |
187 | **Next**: Functions
188 |
189 | - Defining functions
190 | - Scope of a variable
191 |
192 | ```{r cleanup, include=FALSE}
193 | rm(list = ls())
194 | ```
--------------------------------------------------------------------------------
/src/lectures/contents/113_L_Functions.Rmd:
--------------------------------------------------------------------------------
1 | ```{r setup, include=FALSE}
2 | knitr::opts_chunk$set(echo = FALSE)
3 | knitr::opts_knit$set(root.dir = Sys.getenv("GRANOLARR_HOME"))
4 | rm(list = ls())
5 | ```
6 |
7 |
8 |
9 | # Functions
10 |
11 |
12 | ## Summary
13 |
14 | **Prev**:Control structures
15 |
16 | - Conditional statements
17 | - Loops
18 |
19 | **Now**: Functions
20 |
21 | - Defining functions
22 | - Scope of a variable
23 |
24 |
25 |
26 | ## Defining functions
27 |
28 | A function can be defined
29 |
30 | - using an **identifier** (e.g., `add_one`)
31 | - on the left of an **assignment operator** `<-`
32 | - followed by the corpus of the function
33 |
34 | ```{r, echo=TRUE}
35 | add_one <- function (input_value) {
36 | output_value <- input_value + 1
37 | output_value
38 | }
39 | ```
40 |
41 | ## Defining functions
42 |
43 | The corpus
44 |
45 | - starts with the reserved word `function`
46 | - followed by the **parameter(s)** (e.g., `input_value`) between simple brackets
47 | - and the instruction(s) to be executed in a code block
48 | - the value of the last statement is returned as output
49 |
50 | ```{r, echo=TRUE}
51 | add_one <- function (input_value) {
52 | output_value <- input_value + 1
53 | output_value
54 | }
55 | ```
56 |
57 |
58 | ## Defining functions
59 |
60 | After being defined
61 |
62 | - a function can be invoked by specifying
63 | - the **identifier**
64 | - the necessary **parameter(s)**
65 |
66 | ```{r, echo=TRUE}
67 | add_one(3)
68 | add_one(1024)
69 | ```
70 |
71 |
72 | ## More parameters
73 |
74 | - A function can be defined as having two or more **parameters**
75 | - by specifying more than one parameter name (separated by **commas**) in the function definition
76 | - A function always take as input as many values as the number of parameters specified in the definition
77 | - otherwise an error is generated
78 |
79 | ```{r, echo=TRUE}
80 | area_rectangle <- function (hight, width) {
81 | area <- hight * width
82 | area
83 | }
84 |
85 | area_rectangle(3, 2)
86 | ```
87 |
88 |
89 | ## Functions and control structures
90 |
91 | Functions can contain both loops and conditional statements
92 |
93 | ```{r, echo=TRUE}
94 | factorial <- function (input_value) {
95 | result <- 1
96 | for (i in 1:input_value) {
97 | cat("current:", result, " | i:", i, "\n")
98 | result <- result * i
99 | }
100 | result
101 | }
102 | factorial(3)
103 | ```
104 |
105 |
106 |
124 |
125 |
126 |
127 | ## Scope
128 |
129 | The **scope of a variable** is the part of code in which the variable is ``visible''
130 |
131 | In R, variables have a **hierarchical** scope:
132 |
133 | - a variable defined in a script can be used referred to from within a definition of a function in the same script
134 | - a variable defined within a definition of a function will **not** be referable from outside the definition
135 | - scope does **not** apply to `if` or loop constructs
136 |
137 |
138 | ## Example
139 |
140 | In the case below
141 |
142 | - `x_value` is **global** to the function `times_x`
143 | - `new_value` and `input_value` are **local** to the function `times_x`
144 | - referring to `new_value` or `input_value` from outside the definition of `times_x` would result in an error
145 |
146 | ```{r, echo=TRUE}
147 | x_value <- 10
148 | times_x <- function (input_value) {
149 | new_value <- input_value * x_value
150 | new_value
151 | }
152 | times_x(2)
153 | ```
154 |
157 |
158 |
159 |
205 |
206 |
207 | ## Summary
208 |
209 | Functions
210 |
211 | - Defining functions
212 | - Scope of a variable
213 |
214 | **Next**: Practical session
215 |
216 | - Conditional statements
217 | - Loops
218 | - While
219 | - For
220 | - Functions
221 | - Loading functions from scripts
222 | - Debugging
223 |
224 | ```{r cleanup, include=FALSE}
225 | rm(list = ls())
226 | ```
--------------------------------------------------------------------------------
/src/lectures/contents/201_L_DataFrames.Rmd:
--------------------------------------------------------------------------------
1 | ```{r setup, include=FALSE}
2 | knitr::opts_chunk$set(echo = FALSE)
3 | knitr::opts_knit$set(root.dir = Sys.getenv("GRANOLARR_HOME"))
4 | rm(list = ls())
5 | ```
6 |
7 |
8 |
9 | # Data Frames
10 |
11 |
12 |
13 | ## Recap
14 |
15 | **Prev**: R programming
16 |
17 | - 111 Lecture: Data types
18 | - 112 Lecture: Control structures
19 | - 113 Lecture: Functions
20 | - 114 Practical session
21 |
22 | **Now**: Data Frames
23 |
24 | - Data Frames
25 | - Tibbles
26 |
27 |
28 | ## Lists and named lists
29 |
30 | **List**
31 |
32 | - can contain elements of different types
33 | - whereas elements of vectors are all of the same type
34 | - in **named lists**, each element has a name
35 | - elements can be selected using the operator `$`
36 |
37 | ```{r, echo=TRUE}
38 | employee <- list(employee_name = "Stef", start_year = 2015)
39 | employee[[1]]
40 | employee$employee_name
41 | ```
42 |
43 |
44 |
45 | ## Data Frames
46 |
47 | A **data frame** is equivalent to a *named list* where all elements are *vectors of the same length*.
48 |
49 | ```{r, echo=TRUE}
50 | employees <- data.frame(
51 | EmployeeName = c("Maria", "Pete", "Sarah"),
52 | Age = c(47, 34, 32),
53 | Role = c("Professor", "Researcher", "Researcher"))
54 | employees
55 | ```
56 |
57 | Data frames are the most common way to represent tabular data in R. Matrices and lists can be converted to data frames.
58 |
59 |
62 |
63 | ## Selection
64 |
65 | Selection is similar to vectors and lists.
66 |
67 | ```{r, echo=TRUE}
68 | employees[1, 1] # value selection
69 | employees[1, ] # row selection
70 | employees[, 1] # column selection
71 | ```
72 |
73 |
74 | ## Selection
75 |
76 | Selection is similar to vectors and lists.
77 |
78 | ```{r, echo=TRUE}
79 | employees$EmployeeName # column selection, as for named lists
80 | employees$EmployeeName[1]
81 | ```
82 |
83 |
84 |
85 | ## Table manipulation
86 |
87 | - Values can be assigned to cells
88 | - using any selection method
89 | - and the assignment operator `<-`
90 | - New columns can be defined
91 | - assigning a vector to a new name
92 |
93 | ```{r, echo=TRUE}
94 | employees$Age[3] <- 33
95 | employees$Place <- c("Leicester", "Leicester","Leicester")
96 | employees
97 | ```
98 |
99 |
100 |
101 | ## Column processing
102 |
103 | Operations can be performed on columns as they where vectors
104 |
105 | ```{r, echo=TRUE}
106 | 10 - c(1, 2, 3)
107 | ```
108 |
109 | ```{r, echo=TRUE}
110 | # Use Sys.Date to retrieve the current year
111 | current_year <- as.integer(format(Sys.Date(), "%Y"))
112 |
113 | # Calculate employee year of birth
114 | employees$Year_of_birth <- current_year - employees$Age
115 | employees
116 | ```
117 |
118 |
119 |
120 | ## tibble
121 |
122 | A [tibble](https://tibble.tidyverse.org/) is a modern reimagining of the data.frame within `tidyverse`
123 |
124 | - they do less
125 | - don’t change column names or types
126 | - don’t do partial matching
127 | - complain more
128 | - e.g. when referring to a column that does not exist
129 |
130 | That forces you to confront problems earlier, typically leading to cleaner, more expressive code.
131 |
132 |
133 |
134 | ## Summary
135 |
136 | Data Frames
137 |
138 | - Data Frames
139 | - Tibbles
140 |
141 | **Next**: Data selection and filtering
142 |
143 | - dplyr
144 | - dplyr::select
145 | - dplyr::filter
146 |
147 | ```{r cleanup, include=FALSE}
148 | rm(list = ls())
149 | ```
--------------------------------------------------------------------------------
/src/lectures/contents/202_L_SelectionFiltering.Rmd:
--------------------------------------------------------------------------------
1 | ```{r setup, include=FALSE}
2 | knitr::opts_chunk$set(echo = FALSE)
3 | knitr::opts_knit$set(root.dir = Sys.getenv("GRANOLARR_HOME"))
4 | rm(list = ls())
5 | ```
6 |
7 |
8 |
9 | # Selection and filtering
10 |
11 |
12 |
13 | ## Recap
14 |
15 | **Prev**: Data Frames
16 |
17 | - Data Frames
18 | - Tibbles
19 |
20 | **Now**: Data selection and filtering
21 |
22 | - dplyr
23 | - dplyr::select
24 | - dplyr::filter
25 |
26 |
27 |
28 | ## dplyr
29 |
30 | The `dplyr` (pronounced *dee-ply-er*) library is part of `tidyverse` and it offers a grammar for data manipulation
31 |
32 | - `select`: select specific columns
33 | - `filter`: select specific rows
34 | - `arrange`: arrange rows in a particular order
35 | - `summarise`: calculate aggregated values (e.g., mean, max, etc)
36 | - `group_by`: group data based on common column values
37 | - `mutate`: add columns
38 | - `join`: merge tables (`tibbles` or `data.frames`)
39 |
40 | ```{r, echo=TRUE, message=FALSE, warning=FALSE}
41 | library(tidyverse)
42 | ```
43 |
44 |
45 |
46 | ## Example dataset
47 |
48 | The library `nycflights13` contains a dataset storing data about all the flights departed from New York City in 2013
49 |
54 | ```{r, echo=TRUE, eval=FALSE, message=FALSE, warning=FALSE}
55 | library(nycflights13)
56 |
57 | nycflights13::flights
58 | ```
59 | ```{r, echo=FALSE}
60 | library(nycflights13)
61 |
62 | nycflights13::flights %>%
63 | print(n= 3, width = 70)
64 | ```
65 |
66 |
67 |
68 | ## Selecting table columns
69 |
70 | Columns of **data frames** and **tibbles** can be selected
71 |
72 | - specifying the column index
73 |
74 | ```{r, echo=TRUE, eval=FALSE}
75 | nycflights13::flights[, c(13, 14)]
76 | ```
77 |
78 | - specifying the column name
79 |
80 | ```{r, echo=TRUE, eval=FALSE}
81 | nycflights13::flights[, c("origin", "dest")]
82 | ```
83 | ```{r, echo=FALSE}
84 | nycflights13::flights[, c("origin", "dest")] %>%
85 | print(n = 3)
86 | ```
87 |
88 |
89 |
90 | ## dplyr::select
91 |
92 | `select` can be used to specify which columns to retain
93 |
94 | ```{r, echo=TRUE, eval=FALSE}
95 | nycflights13::flights %>%
96 | dplyr::select(
97 | origin, dest, dep_delay, arr_delay, year:day
98 | )
99 | ```
100 | ```{r, echo=FALSE}
101 | nycflights13::flights %>%
102 | dplyr::select(
103 | origin, dest, dep_delay, arr_delay, year:day
104 | ) %>%
105 | print(n = 5)
106 | ```
107 |
108 |
109 | ## dplyr::select
110 |
111 | ... or whichones to drop, using - in front of the column name
112 |
113 | ```{r, echo=TRUE, eval=FALSE}
114 | nycflights13::flights %>%
115 | dplyr::select(origin, dest, dep_delay, arr_delay, year:day) %>%
116 | dplyr::select(-arr_delay)
117 | ```
118 | ```{r, echo=FALSE}
119 | nycflights13::flights %>%
120 | dplyr::select(
121 | origin, dest, dep_delay, arr_delay, year:day
122 | ) %>%
123 | dplyr::select(
124 | -arr_delay
125 | ) %>%
126 | print(n = 3)
127 | ```
128 |
129 |
130 | ## Logical filtering
131 |
132 | Conditional statements can be used to filter a vector
133 |
134 | - i.e. to retain only certain values
135 | - where the specified value is `TRUE`
136 |
137 | ```{r, echo=TRUE}
138 | a_numeric_vector <- -3:3
139 | a_numeric_vector
140 | a_numeric_vector[c(FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)]
141 | ```
142 |
143 |
144 |
145 | ## Conditional filtering
146 |
147 | As a conditional expression results in a logic vector...
148 |
149 | ```{r, echo=TRUE}
150 | a_numeric_vector > 0
151 | ```
152 |
153 |
154 | ... conditional expressions can be used for filtering
155 |
156 | ```{r, echo=TRUE}
157 | a_numeric_vector[a_numeric_vector > 0]
158 | ```
159 |
160 |
161 |
162 | ## Filtering data frames
163 |
164 | The same approach can be applied to **data frames** and **tibbles**
165 |
166 | ```{r, echo=TRUE, eval=FALSE}
167 | nycflights13::flights$month
168 | ```
169 | ```{r, echo=FALSE}
170 | capture.output(nycflights13::flights$month)[1] %>% str_trunc(52) %>% cat()
171 | ```
172 |
173 | ```{r, echo=TRUE, eval=FALSE}
174 | nycflights13::flights$month == 11
175 | ```
176 | ```{r, echo=FALSE}
177 | capture.output(nycflights13::flights$month == 11)[1] %>% str_trunc(52) %>% cat()
178 | ```
179 |
180 | ```{r, echo=TRUE, eval=FALSE}
181 | nycflights13::flights[nycflights13::flights$month == 11, ]
182 | ```
183 | ```{r, echo=FALSE}
184 | nycflights13::flights[nycflights13::flights$month == 11, ] %>%
185 | print(n = 1, width = 52)
186 | ```
187 |
188 |
189 |
190 | ## dplyr::filter
191 |
192 | ```{r, echo=TRUE, eval=FALSE}
193 | nycflights13::flights %>%
194 | # Flights in November
195 | dplyr::filter(month == 11)
196 | ```
197 | ```{r, echo=FALSE}
198 | nycflights13::flights %>%
199 | dplyr::filter(month == 11) %>%
200 | print(n = 3, width = 52)
201 | ```
202 |
203 |
204 |
205 |
206 | ## Select and filter
207 |
208 | ```{r, echo=TRUE, eval=FALSE}
209 | nycflights13::flights %>%
210 | # Select the columns you need
211 | dplyr::select(origin, dest, dep_delay, arr_delay, year:day) %>%
212 | # Drop arr_delay... because you don't need it after all
213 | dplyr::select(-arr_delay) %>%
214 | # Filter in only November flights
215 | dplyr::filter(month == 11)
216 | ```
217 | ```{r, echo=FALSE}
218 | nycflights13::flights %>%
219 | dplyr::select(origin, dest, dep_delay, arr_delay, year:day) %>%
220 | dplyr::select(-arr_delay) %>%
221 | dplyr::filter(month == 11) %>%
222 | print(n = 3, width = 52)
223 | ```
224 |
225 |
226 |
227 | ## Summary
228 |
229 | Data selection and filtering
230 |
231 | - dplyr
232 | - dplyr::select
233 | - dplyr::filter
234 |
235 | **Next**: Data manipulation
236 |
237 | - dplyr::arrange
238 | - dplyr::summarise
239 | - dplyr::group_by
240 | - dplyr::mutate
241 |
242 |
243 |
244 | ```{r cleanup, include=FALSE}
245 | rm(list = ls())
246 | ```
--------------------------------------------------------------------------------
/src/lectures/contents/203_L_DataManipulation.Rmd:
--------------------------------------------------------------------------------
1 | ```{r setup, include=FALSE}
2 | knitr::opts_chunk$set(echo = FALSE)
3 | knitr::opts_knit$set(root.dir = Sys.getenv("GRANOLARR_HOME"))
4 | rm(list = ls())
5 | ```
6 |
7 |
8 |
9 | # Data manipulation
10 |
11 |
12 |
13 | ## Recap
14 |
15 | **Prev**: Data selection and filtering
16 |
17 | - dplyr
18 | - dplyr::select
19 | - dplyr::filter
20 |
21 | **Now**: Data manipulation
22 |
23 | - dplyr::arrange
24 | - dplyr::summarise
25 | - dplyr::group_by
26 | - dplyr::mutate
27 |
28 |
29 | ## Example
30 |
31 | ```{r, echo=TRUE, eval=FALSE}
32 | library(tidyverse)
33 | library(nycflights13)
34 |
35 | nov_dep_delays <-
36 | nycflights13::flights %>%
37 | dplyr::select(origin, dest, dep_delay, year:day) %>%
38 | dplyr::filter(month == 11)
39 |
40 | nov_dep_delays
41 | ```
42 | ```{r, echo=FALSE, message=FALSE, warning=FALSE}
43 | library(tidyverse)
44 | library(nycflights13)
45 |
46 | nov_dep_delays <-
47 | nycflights13::flights %>%
48 | dplyr::select(origin, dest, dep_delay, year:day) %>%
49 | dplyr::filter(month == 11)
50 |
51 | nov_dep_delays %>% print(n = 3)
52 | ```
53 |
54 |
55 |
56 | ## dplyr::arrange
57 |
58 | Arranges rows in a particular order
59 |
60 | - descending orders specified by using `-` (minus symbol)
61 |
62 | ```{r, echo=TRUE, eval=FALSE}
63 | nov_dep_delays %>%
64 | dplyr::arrange(
65 | # Ascending destination name
66 | dest,
67 | # Descending delay
68 | -dep_delay
69 | )
70 | ```
71 | ```{r, echo=FALSE, message=FALSE, warning=FALSE}
72 | nov_dep_delays %>%
73 | dplyr::arrange(
74 | dest, # Ascending destination name
75 | -dep_delay # Descending delay
76 | ) %>%
77 | print(n = 2)
78 | ```
79 |
80 |
81 |
82 | ## dplyr::summarise
83 |
84 | Calculates aggregated values
85 |
86 | - e.g., using functions such as mean, max, etc.
87 |
88 | ```{r, echo=TRUE}
89 | nov_dep_delays %>%
90 | # Need to filter out rows where delay is NA
91 | dplyr::filter(!is.na(dep_delay)) %>%
92 | # Create two aggregated columns
93 | dplyr::summarise(
94 | avg_dep_delay = mean(dep_delay),
95 | tot_dep_delay = sum(dep_delay)
96 | )
97 | ```
98 |
99 |
100 |
101 | ## dplyr::group_by
102 |
103 | Groups rows based on common values for specified column(s)
104 |
105 | - combined with `summarise`, aggregated values per group
106 |
107 | ```{r, echo=TRUE, eval=FALSE}
108 | nov_dep_delays %>%
109 | # First group by same destination
110 | dplyr::group_by(dest) %>%
111 | # Then calculate aggregated value
112 | dplyr::filter(!is.na(dep_delay)) %>%
113 | dplyr::summarise(tot_dep_delay = sum(dep_delay))
114 | ```
115 | ```{r, echo=FALSE, message=FALSE, warning=FALSE}
116 | nov_dep_delays %>%
117 | # First group by same destination
118 | dplyr::group_by(dest) %>%
119 | # Need to filter out rows where delay is NA
120 | dplyr::filter(!is.na(dep_delay)) %>%
121 | # Then calculate aggregated value
122 | dplyr::summarise(
123 | tot_dep_delay = sum(dep_delay)
124 | ) %>%
125 | print(n = 2)
126 | ```
127 |
128 |
129 |
130 | ## dplyr::tally and dplyr::count
131 |
132 | - `dplyr::tally` short-hand for `summarise` with `n`
133 | - number of rows
134 | - `dplyr::count`short-hand for `group_by` and `tally`
135 | - number of rows per group
136 |
137 |
138 | ```{r, echo=TRUE, eval=FALSE}
139 | nov_dep_delays %>%
140 | # Count flights by same destination
141 | dplyr::count(dest)
142 | ```
143 | ```{r, echo=FALSE, message=FALSE, warning=FALSE}
144 | nov_dep_delays %>%
145 | # Count flights by same destination
146 | dplyr::count(dest) %>%
147 | print(n = 3)
148 | ```
149 |
150 |
151 | ## dplyr::mutate
152 |
153 | Calculate values for new columns based on current columns
154 |
155 | ```{r, echo=TRUE, eval=FALSE}
156 | nov_dep_delays %>%
157 | dplyr::mutate(
158 | # Combine origin and destination into one column
159 | orig_dest = str_c(origin, dest, sep = "->"),
160 | # Departure delay in days (rather than minutes)
161 | delay_days = ((dep_delay / 60) /24)
162 | )
163 | ```
164 | ```{r, echo=FALSE, message=FALSE, warning=FALSE}
165 | nov_dep_delays %>%
166 | dplyr::mutate(
167 | orig_dest = str_c(origin, dest, sep = "->"),
168 | delay_days = ((dep_delay / 60) /24)
169 | ) %>%
170 | print(n = 3)
171 | ```
172 |
173 |
174 | ## Full pipe example
175 |
176 | ```{r, echo=TRUE, eval=FALSE}
177 | nycflights13::flights %>%
178 | dplyr::select(
179 | origin, dest, dep_delay, arr_delay,
180 | year:day
181 | ) %>%
182 | dplyr::select(-arr_delay) %>%
183 | dplyr::filter(month == 11) %>%
184 | dplyr::filter(!is.na(dep_delay)) %>%
185 | dplyr::arrange(dest, -dep_delay) %>%
186 | dplyr::group_by(dest) %>%
187 | dplyr::summarise(
188 | tot_dep_delay = sum(dep_delay)
189 | ) %>%
190 | dplyr::mutate(
191 | tot_dep_delay_days = ((tot_dep_delay / 60) /24)
192 | )
193 | ```
194 |
195 |
196 |
197 | ## Full pipe example
198 |
199 | ```{r, echo=FALSE, message=FALSE, warning=FALSE}
200 | nycflights13::flights %>%
201 | dplyr::select(
202 | origin, dest, dep_delay, arr_delay,
203 | year:day
204 | ) %>%
205 | dplyr::select(-arr_delay) %>%
206 | dplyr::filter(month == 11) %>%
207 | dplyr::filter(!is.na(dep_delay)) %>%
208 | dplyr::arrange(dest, -dep_delay) %>%
209 | dplyr::group_by(dest) %>%
210 | dplyr::summarise(
211 | tot_dep_delay = sum(dep_delay)
212 | ) %>%
213 | dplyr::mutate(
214 | tot_dep_delay_days = ((tot_dep_delay / 60) /24)
215 | )
216 | ```
217 |
218 |
219 |
220 | ## Summary
221 |
222 | Data manipulation
223 |
224 | - dplyr::arrange
225 | - dplyr::summarise
226 | - dplyr::group_by
227 | - dplyr::mutate
228 |
229 | **Next**: Practical session
230 |
231 | - Creating R projects
232 | - Creating R scripts
233 | - Data wrangling script
234 |
235 | ```{r cleanup, include=FALSE}
236 | rm(list = ls())
237 | ```
--------------------------------------------------------------------------------
/src/lectures/contents/211_L_DataJoin.Rmd:
--------------------------------------------------------------------------------
1 | ```{r setup, include=FALSE}
2 | knitr::opts_chunk$set(echo = FALSE)
3 | knitr::opts_knit$set(root.dir = Sys.getenv("GRANOLARR_HOME"))
4 | rm(list = ls())
5 | ```
6 |
7 |
8 |
9 | # Join operations
10 |
11 |
12 |
13 | ## Recap
14 |
15 | **Prev**: Selection and manipulation
16 |
17 | - Data frames and tibbles
18 | - Data selection and filtering
19 | - Data manipulation
20 |
21 | **Now**: Join operations
22 |
23 | - Joining data
24 | - dplyr join functions
25 |
26 |
27 | ## Example
28 |
29 | ```{r, echo=TRUE}
30 | cities <- data.frame(
31 | city_name = c("Barcelona", "London", "Rome", "Los Angeles"),
32 | country_name = c("Spain", "UK", "Italy", "US"),
33 | city_pop_M = c(1.62, 8.98, 4.34, 3.99)
34 | )
35 |
36 | cities_area <-data.frame(
37 | city_name = c("Barcelona", "London", "Rome", "Munich"),
38 | city_area_km2 = c(101, 1572, 496, 310)
39 | )
40 | ```
41 |
42 | ## Example
43 |
44 | ```{r, echo=FALSE}
45 | library(knitr)
46 |
47 | knitr::kable(cities)
48 | ```
49 |
50 |
51 |
52 | ```{r, echo=FALSE}
53 | knitr::kable(cities_area)
54 | ```
55 |
56 |
57 |
58 |
59 |
60 | ## Joining data
61 |
62 | Tables can be joined (or 'merged')
63 |
64 | - information from two tables can be combined
65 | - specifying **column(s) from two tables with common values**
66 | - usually one with a unique identifier of an entity
67 | - rows having the same value are joined
68 | - depending on parameters
69 | - a row from one table can be merged with multiple rows from the other table
70 | - rows with no matching values in the other table can be retained
71 | - `merge` base function or join functions in `dplyr`
72 |
73 |
74 |
75 | ## Join types
76 |
77 |
78 | {width=75%}
79 |
80 |
81 |
82 |
83 |
84 | ## dplyr joins
85 |
86 | `dplyr` provides [a series of join verbs](https://dplyr.tidyverse.org/reference/join.html)
87 |
88 | - **Mutating joins**
89 | - `inner_join`: inner join
90 | - `left_join`: left join
91 | - `right_join`: right join
92 | - `full_join`: full join
93 | - **Nesting joins**
94 | - `nest_join`: all rows columns from left table, plus a column of tibbles with matching from right
95 | - **Filtering joins** (keep only columns from left)
96 | - `semi_join`: , rows from left where match with right
97 | - `anti_join`: rows from left where no match with right
98 |
99 |
100 | ```{r, echo=FALSE, message=FALSE, warning=FALSE}
101 | library(tidyverse)
102 | ```
103 |
104 |
105 | ## dplyr::full_join
106 |
107 | - `full_join` combines all the available data
108 |
109 | ```{r, echo=TRUE, eval=FALSE}
110 | dplyr::full_join(
111 | # first argument, left table
112 | # second argument, right table
113 | cities, cities_area,
114 | # specify which column to be matched
115 | by = c("city_name" = "city_name")
116 | )
117 | ```
118 | ```{r, echo=FALSE, message=FALSE, warning=FALSE}
119 | cities %>%
120 | dplyr::full_join(cities_area) %>%
121 | knitr::kable()
122 | ```
123 |
124 |
125 | ## Pipes and shorthands
126 |
127 | When using (all) join verbs in `dplyr`
128 |
129 | ```{r, echo=TRUE, eval=FALSE}
130 | # using pipe, left table is "coming down the pipe"
131 | cities %>%
132 | dplyr::full_join(cities_area, by = c("city_name" = "city_name"))
133 |
134 | # if no columns specified, columns with the same name are matched
135 | cities %>%
136 | dplyr::full_join(cities_area)
137 | ```
138 | ```{r, echo=FALSE, message=FALSE, warning=FALSE}
139 | cities %>%
140 | dplyr::full_join(cities_area) %>%
141 | knitr::kable()
142 | ```
143 |
144 |
145 |
146 | ## dplyr::left_join
147 |
148 | - keeps all the data from the **left** table
149 | - first argument or *"coming down the pipe"*
150 | - rows from the right table without a match are dropped
151 | - second argument (or first when using *pipes*)
152 |
153 | ```{r, echo=TRUE, eval=FALSE}
154 | cities %>%
155 | dplyr::left_join(cities_area)
156 | ```
157 | ```{r, echo=FALSE, message=FALSE, warning=FALSE}
158 | cities %>%
159 | dplyr::left_join(cities_area) %>%
160 | knitr::kable()
161 | ```
162 |
163 |
164 |
165 | ## dplyr::right_join
166 |
167 | - keeps all the data from the **right** table
168 | - second argument (or first when using *pipes*)
169 | - rows from the left table without a match are dropped
170 | - first argument or *"coming down the pipe"*
171 |
172 | ```{r, echo=TRUE, eval=FALSE}
173 | cities %>%
174 | dplyr::right_join(cities_area)
175 | ```
176 | ```{r, echo=FALSE, message=FALSE, warning=FALSE}
177 | cities %>%
178 | dplyr::right_join(cities_area) %>%
179 | knitr::kable()
180 | ```
181 |
182 |
183 |
184 |
185 | ## dplyr::inner_join
186 |
187 | - keeps only rows that have a match in **both** tables
188 | - rows without a match either way are dropped
189 |
190 | ```{r, echo=TRUE, eval=FALSE}
191 | cities %>%
192 | dplyr::inner_join(cities_area)
193 | ```
194 | ```{r, echo=FALSE, message=FALSE, warning=FALSE}
195 | cities %>%
196 | dplyr::inner_join(cities_area) %>%
197 | knitr::kable()
198 | ```
199 |
200 |
201 |
202 | ## dplyr::semi_join and anti_join
203 |
204 | ```{r, echo=TRUE, eval=FALSE}
205 | cities %>%
206 | dplyr::semi_join(cities_area)
207 | ```
208 | ```{r, echo=FALSE, message=FALSE, warning=FALSE}
209 | cities %>%
210 | dplyr::semi_join(cities_area) %>%
211 | knitr::kable()
212 | ```
213 |
214 |
215 |
216 | ```{r, echo=TRUE, eval=FALSE}
217 | cities %>%
218 | dplyr::anti_join(cities_area)
219 | ```
220 | ```{r, echo=FALSE, message=FALSE, warning=FALSE}
221 | cities %>%
222 | dplyr::anti_join(cities_area) %>%
223 | knitr::kable()
224 | ```
225 |
226 |
227 |
228 | ## Summary
229 |
230 | Join operations
231 |
232 | - Joining data
233 | - dplyr join functions
234 |
235 | **Next**: Tidy-up your data
236 |
237 | - Wide and long data
238 | - Re-shape data
239 | - Handle missing values
240 |
241 | ```{r cleanup, include=FALSE}
242 | rm(list = ls())
243 | ```
--------------------------------------------------------------------------------
/src/lectures/contents/221_L_Reproducibility.Rmd:
--------------------------------------------------------------------------------
1 | ```{r setup, include=FALSE}
2 | knitr::opts_chunk$set(echo = FALSE)
3 | knitr::opts_knit$set(root.dir = Sys.getenv("GRANOLARR_HOME"))
4 | rm(list = ls())
5 | ```
6 |
7 |
8 |
9 | # Reproducibility
10 |
11 |
12 |
13 | ## Recap
14 |
15 | **Prev**: Table operations
16 |
17 | - 211 Join operations
18 | - 212 Data pivot
19 | - 213 Read and write data
20 | - 214 Practical session
21 |
22 | **Now**: Reproduciblity
23 |
24 | - Reproduciblity and software engineering
25 | - Reproduciblity in GIScience
26 | - Guidelines
27 |
28 |
29 | ## Reproduciblity
30 |
31 | In quantitative research, an analysis or project are considered to be **reproducible** if:
32 |
33 | - *"the data and code used to make a finding are available and they are sufficient for an independent researcher to recreate the finding."* [Christopher Gandrud, *Reproducible Research with R and R Studio*](https://www.crcpress.com/Reproducible-Research-with-R-and-R-Studio/Gandrud/p/book/9781498715379)
34 |
35 | That is becoming more and more important in science:
36 |
37 | - as programming and scripting are becoming integral in most disciplines
38 | - as the amount of data increases
39 |
40 |
41 |
42 | ## Why?
43 |
44 | In **scientific research**:
45 |
46 | - verificability of claims through replication
47 | - incremental work, avoid duplication
48 |
49 | For your **working practice**:
50 |
51 | - better working practices
52 | - coding
53 | - project structure
54 | - versioning
55 | - better teamwork
56 | - higher impact (not just results, but code, data, etc.)
57 |
58 |
59 |
60 | ## Reproducibility and software engineering
61 |
62 | Core aspects of **software engineering** are:
63 |
64 | - project design
65 | - software **readibility**
66 | - testing
67 | - **versioning**
68 |
69 | As programming becomes integral to research, similar necessities arise among scientists and data analysts.
70 |
71 |
72 |
73 | ## Reproducibility and "big data"
74 |
75 | There has been a lot of discussions about **"big data"**...
76 |
77 | - volume, velocity, variety, ...
78 |
79 | Beyond the hype of the moment, as the **amount** and **complexity** of data increases
80 |
81 | - the time required to replicate an analysis using point-and-click software becomes unsustainable
82 | - room for error increases
83 |
84 | Workflow management software (e.g., ArcGIS ModelBuilder) is one answer, reproducible data analysis based on script languages like R is another.
85 |
86 |
87 |
88 | ## Reproducibility in GIScience
89 |
90 | [Singleton *et al.*](https://www.tandfonline.com/doi/abs/10.1080/13658816.2015.1137579) have discussed the issue of reproducibility in GIScience, identifying the following best practices:
91 |
92 | 1. Data should be accessible within the public domain and available to researchers.
93 | 2. Software used should have open code and be scrutable.
94 | 3. Workflows should be public and link data, software, methods of analysis and presentation with discursive narrative
95 | 4. The peer review process and academic publishing should require submission of a workflow model and ideally open archiving of those materials necessary for
96 | replication.
97 | 5. Where full reproducibility is not possible (commercial software or sensitive data) aim to adopt aspects attainable within circumstances
98 |
99 |
100 |
101 | ## Document everything
102 |
103 | In order to be reproducible, every step of your project should be documented in detail
104 |
105 | - data gathering
106 | - data analysis
107 | - results presentation
108 |
109 | Well documented R scripts are an excellent way to document your project.
110 |
111 |
112 |
113 | ## Document well
114 |
115 | Create code that can be **easily understood** by someone outside your project, including yourself in six-month time!
116 |
117 | - use a style guide (e.g. [tidyverse](http://style.tidyverse.org/)) consistently
118 | - also add a **comment** before any line that could be ambiguous or particularly difficult or important
119 | - add a **comment** before each code block, describing what the code does
120 | - add a **comment** at the beginning of a file, including
121 | - date
122 | - contributors
123 | - other files the current file depends on
124 | - materials, sources and other references
125 |
126 |
127 |
128 | ## Workflow
129 |
130 | Relationships between files in a project are not simple:
131 |
132 | - in which order are file executed?
133 | - when to copy files from one folder to another, and where?
134 |
135 | A common solution is using **make files**
136 |
137 | - commonly written in *bash* on Linux systems
138 | - they can be written in R, using commands like
139 | - *source* to execute R scripts
140 | - *system* to interact with the operative system
141 |
142 |
143 |
144 | ## granolarr Mark.R
145 |
146 | Section of the [*granolarr*](https://sdesabbata.github.io/granolarr/) project make file [Make.R](https://github.com/sdesabbata/granolarr/blob/master/Make.R) that generates the current slides for the lecture session 221
147 |
148 | ```{}
149 | cat("\n\n>>> Rendering 221_L_Reproducibility.Rmd <<<\n\n")
150 | rmarkdown::render(
151 | paste0(
152 | Sys.getenv("GRANOLARR_HOME"),
153 | "/src/lectures/221_L_Reproducibility.Rmd"
154 | ),
155 | quiet = TRUE,
156 | output_dir = paste0(
157 | Sys.getenv("GRANOLARR_HOME"),
158 | "/docs/lectures/html"
159 | )
160 | )
161 | ```
162 |
163 |
164 | ## Future-proof formats
165 |
166 | Complex formats (e.g., .docx, .xlsx, .shp, ArcGIS .mxd)
167 |
168 | - can become obsolete
169 | - are not always portable
170 | - usually require proprietary software
171 |
172 | Use the simplest format to **future-proof** your analysis. **Text files** are the most versatile
173 |
174 | - data: .txt, .csv, .tsv
175 | - analysis: R scrpts, python scripts
176 | - write-up: LaTeX, Markdown, HTML
177 |
178 |
179 |
180 | ## Store and share
181 |
182 | Reproducible data analysis is particularly important when working in teams, to share and communicate your work.
183 |
184 | - [Dropbox](https://www.dropbox.com)
185 | - good option to work in teams, initially free
186 | - no versioning, branches
187 | - [Git](https://git-scm.com)
188 | - free and opensource control system
189 | - great to work in teams and share your work publically
190 | - can be more difficult at first
191 | - [GitHub](https://github.com) public repositories are free, private ones are not
192 | - [GitLab](https://about.gitlab.com/) offers free private repositories
193 |
194 |
195 |
196 | ## Summary
197 |
198 | Reproduciblity
199 |
200 | - Reproduciblity and software engineering
201 | - Reproduciblity in GIScience
202 | - Guidelines
203 |
204 | **Next**: RMarkdown
205 |
206 | - Markdown
207 | - RMarkdown
208 |
209 | ```{r cleanup, include=FALSE}
210 | rm(list = ls())
211 | ```
--------------------------------------------------------------------------------
/src/lectures/contents/222_L_RMarkdown.Rmd:
--------------------------------------------------------------------------------
1 | ```{r setup, include=FALSE}
2 | knitr::opts_chunk$set(echo = FALSE)
3 | knitr::opts_knit$set(root.dir = Sys.getenv("GRANOLARR_HOME"))
4 | rm(list = ls())
5 | ```
6 |
7 |
8 |
9 | # RMarkdown
10 |
11 |
12 |
13 | ## Recap
14 |
15 | **Prev**: Reproduciblity
16 |
17 | - Reproduciblity and software engineering
18 | - Reproduciblity in GIScience
19 | - Guidelines
20 |
21 | **Now**: RMarkdown
22 |
23 | - Markdown
24 | - RMarkdown
25 |
26 |
27 |
28 | ## Markdown
29 |
30 | **Markdown** is a simple markup language
31 |
32 | - allows to mark-up plain text
33 | - to specify more complex features (such as *italics text*)
34 | - using a very simple [syntax](https://daringfireball.net/projects/markdown/syntax)
35 |
36 | Markdown can be used in conjunction with numerous tools
37 |
38 | - to produce HTML pages
39 | - or even more complex formats (such as PDF)
40 |
41 | These slides are written in Markdown
42 |
43 |
44 |
45 | ## Markdown example code
46 |
47 | ```
48 | ### This is a third level heading
49 |
50 | Text can be specified as *italic* or **bold**
51 |
52 | - and list can be created
53 | - very simply
54 |
55 | 1. also numbered lists
56 | 1. [add a link like this](http://le.ac.uk)
57 |
58 | |Tables |Can |Be |
59 | |-------|------------|---------|
60 | |a bit |complicated |at first |
61 | |but |it gets |easier |
62 | ```
63 |
64 |
65 |
66 | ## Markdown example output
67 |
68 | ### This is a third level heading
69 |
70 | Text can be specified as *italic* or **bold**
71 |
72 | - and list can be created
73 | - very simply
74 |
75 | 1. also numbered lists
76 | 1. [add a link like this](http://le.ac.uk)
77 |
78 | |Tables |Can |Be |
79 | |-------|------------|---------|
80 | |a bit |complicated |at first |
81 | |but |it gets |easier |
82 |
83 |
84 |
85 | ## RMarkdown
86 |
87 | The [rmarkdown](https://rmarkdown.rstudio.com/docs/) library and its [RStudio plug-in](https://rmarkdown.rstudio.com/)
88 |
89 | - provide functionalities to *compile* scripts containing
90 | - **Markdown** text
91 | - rendered to documents (e.g., *.pdf* and *.doc*)
92 | - chunks of **R** code (other supported, e.g., Python, SQL)
93 | - included in output document
94 | - interpreted
95 | - results included in output document
96 |
97 | ````
98 | `r ''````{r, echo=TRUE}
99 | # Example of R chunck
100 | sqrt(2)
101 | `r ''````
102 | ````
103 |
104 |
105 | ## RMarkdown example
106 |
107 | Content of an RMarkdown file: `First_example.Rmd`
108 |
109 | ````
110 | This is an **RMarkdown** document. The *code chunk* below:
111 |
112 | - loads the necessary libraries
113 | - loads the flights from New York City in 2013
114 | - presents a few columns from the first row
115 |
116 | `r ''````{r, echo=TRUE, message=FALSE, warning=FALSE}
117 | library(tidyverse)
118 | library(nycflights13)
119 |
120 | nycflights13::flights %>%
121 | dplyr::select(year:day, origin, dest, flight) %>%
122 | dplyr::slice_head(1) %>%
123 | knitr::kable()
124 | `r ''````
125 | ````
126 |
127 |
128 | ## RMarkdown example
129 |
130 | This is an **RMarkdown** document. The *code chunk* below:
131 |
132 | - loads the necessary libraries
133 | - loads the flights from New York City in 2013
134 | - presents a few columns from the first row
135 |
136 | ```{r, echo=TRUE, message=FALSE, warning=FALSE}
137 | library(tidyverse)
138 | library(nycflights13)
139 |
140 | nycflights13::flights %>%
141 | dplyr::select(year:day, origin, dest, flight) %>%
142 | dplyr::slice_head(1) %>%
143 | knitr::kable()
144 | ```
145 |
146 |
147 | ## The Definitive Guide
148 |
149 | :::::: {.cols data-latex=""}
150 | ::: {.col style="width: 60%;" data-latex="{0.5\textwidth}"}
151 |
152 | Markdown is a rather simple for a markup language, but still fairly complex, especially when used in combination with R.
153 |
154 | For an complete guide to RMarkdown, please see:
155 |
156 | [R Markdown: The Definitive Guide](https://bookdown.org/yihui/rmarkdown/)
157 |
158 | by
159 | Yihui Xie,
160 | J. J. Allaire,
161 | Garrett Grolemund.
162 |
163 | :::
164 | ::: {.col style="width: 50%; text-align: right;" data-latex="{0.5\textwidth}"}
165 |
166 | 
167 |
168 | :::
169 | ::::::
170 |
171 | ## Summary
172 |
173 | RMarkdown
174 |
175 | - Markdown
176 | - RMarkdown
177 |
178 | **Next**: Git and Docker
179 |
180 | - Git operations
181 | - Git and RStudio
182 | - Docker
183 |
184 | ```{r cleanup, include=FALSE}
185 | rm(list = ls())
186 | ```
--------------------------------------------------------------------------------
/src/lectures/contents/223_L_Git.Rmd:
--------------------------------------------------------------------------------
1 | ```{r setup, include=FALSE}
2 | knitr::opts_chunk$set(echo = FALSE)
3 | knitr::opts_knit$set(root.dir = Sys.getenv("GRANOLARR_HOME"))
4 | rm(list = ls())
5 | ```
6 |
7 |
8 |
9 | # Git
10 |
11 |
12 | ## Recap
13 |
14 | RMarkdown
15 |
16 | - Markdown
17 | - RMarkdown
18 |
19 | **Next**: Git and Docker
20 |
21 | - Git operations
22 | - Git and RStudio
23 | - Docker
24 |
25 |
26 |
27 | ## What's git?
28 |
29 | **Git** is a free and opensource version control system
30 |
31 | - commonly used through a server
32 | - where a master copy of a project is kept
33 | - can also be used locally
34 | - allows storing versions of a project
35 | - syncronisation
36 | - consistency
37 | - history
38 | - multiple branches
39 |
40 |
41 |
42 | ## How git works
43 |
44 | A series of snapshots
45 |
46 | - each commit is a snapshot of all files
47 | - if no change to a file, link to previous commit
48 | - all history stored locally
49 |
50 |
58 |
59 |
60 |
61 | ## Three stages
62 |
63 | When working with a git repository
64 |
65 | - first checkout the latest version
66 | - select the edits to stage
67 | - commit what has been staged in a permanent snapshot
68 |
69 |