├── .gitignore ├── CITATION.md ├── LICENSE.md ├── CONDUCT.md └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | *~ 2 | *.bak 3 | -------------------------------------------------------------------------------- /CITATION.md: -------------------------------------------------------------------------------- 1 | # Citation 2 | 3 | Please cite this work as: 4 | 5 | > Greg Wilson (ed.): "R for Data Science Instructors' Guide". , 2018. 6 | -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | # License 2 | 3 | *This is a human-readable summary of (and not a substitute for) the license. 4 | Please see for the full legal text.* 5 | 6 | This work is licensed under the Creative Commons Attribution 4.0 7 | International license (CC-BY-4.0). 8 | 9 | **You are free to:** 10 | 11 | - **Share**---copy and redistribute the material in any medium or 12 | format 13 | 14 | - **Remix**---remix, transform, and build upon the material for any 15 | purpose, even commercially. 16 | 17 | The licensor cannot revoke these freedoms as long as you follow the 18 | license terms. 19 | 20 | **Under the following terms:** 21 | 22 | - **Attribution**---You must give appropriate credit, provide a link 23 | to the license, and indicate if changes were made. You may do so in 24 | any reasonable manner, but not in any way that suggests the licensor 25 | endorses you or your use. 26 | 27 | - **No additional restrictions**---You may not apply legal terms or 28 | technological measures that legally restrict others from doing 29 | anything the license permits. 30 | 31 | **Notices:** 32 | 33 | You do not have to comply with the license for elements of the 34 | material in the public domain or where your use is permitted by an 35 | applicable exception or limitation. 36 | 37 | No warranties are given. The license may not give you all of the 38 | permissions necessary for your intended use. For example, other rights 39 | such as publicity, privacy, or moral rights may limit how you use the 40 | material. 41 | -------------------------------------------------------------------------------- /CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Code of Conduct 2 | 3 | In the interest of fostering an open and welcoming environment, we as 4 | contributors and maintainers pledge to making participation in our 5 | project and our community a harassment-free experience for everyone, 6 | regardless of age, body size, disability, ethnicity, gender identity 7 | and expression, level of experience, education, socioeconomic status, 8 | nationality, personal appearance, race, religion, or sexual identity 9 | and orientation. 10 | 11 | ## Our Standards 12 | 13 | Examples of behavior that contributes to creating a positive 14 | environment include: 15 | 16 | * using welcoming and inclusive language, 17 | * being respectful of differing viewpoints and experiences, 18 | * gracefully accepting constructive criticism, 19 | * focusing on what is best for the community, and 20 | * showing empathy towards other community members. 21 | 22 | Examples of unacceptable behavior by participants include: 23 | 24 | * the use of sexualized language or imagery and unwelcome sexual 25 | attention or advances, 26 | * trolling, insulting/derogatory comments, and personal or political 27 | attacks, 28 | * public or private harassment, 29 | * publishing others' private information, such as a physical or 30 | electronic address, without explicit permission, and 31 | * other conduct which could reasonably be considered inappropriate in 32 | a professional setting 33 | 34 | ## Our Responsibilities 35 | 36 | Project maintainers are responsible for clarifying the standards of 37 | acceptable behavior and are expected to take appropriate and fair 38 | corrective action in response to any instances of unacceptable 39 | behavior. 40 | 41 | Project maintainers have the right and responsibility to remove, edit, 42 | or reject comments, commits, code, wiki edits, issues, and other 43 | contributions that are not aligned to this Code of Conduct, or to ban 44 | temporarily or permanently any contributor for other behaviors that 45 | they deem inappropriate, threatening, offensive, or harmful. 46 | 47 | ## Scope 48 | 49 | This Code of Conduct applies both within project spaces and in public 50 | spaces when an individual is representing the project or its 51 | community. Examples of representing a project or community include 52 | using an official project e-mail address, posting via an official 53 | social media account, or acting as an appointed representative at an 54 | online or offline event. Representation of a project may be further 55 | defined and clarified by project maintainers. 56 | 57 | ## Enforcement 58 | 59 | Instances of abusive, harassing, or otherwise unacceptable behavior 60 | may be reported by [emailing the project team](mailto:gvwilson@third-bit.com). 61 | All complaints will be reviewed and investigated and will result in a 62 | response that is deemed necessary and appropriate to the 63 | circumstances. The project team is obligated to maintain 64 | confidentiality with regard to the reporter of an incident. Further 65 | details of specific enforcement policies may be posted separately. 66 | 67 | Project maintainers who do not follow or enforce the Code of Conduct 68 | in good faith may face temporary or permanent repercussions as 69 | determined by other members of the project's leadership. 70 | 71 | ## Attribution 72 | 73 | This Code of Conduct is adapted from the [Contributor 74 | Covenant](https://www.contributor-covenant.org) version 1.4. 75 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # *R for Data Science Instructor's Guide* 2 | 3 | **DRAFT:** notes for people teaching [R4DS](http://r4ds.had.co.nz/) 4 | with each chapter's learning objectives and key points. 5 | 6 | This work is licensed under a Creative Commons Attribution 4.0 International License. 7 | 8 | **References** 9 | 10 | Hadley Wickham and Garrett Grolemund: 11 | *R for Data Science: Import, Tidy, Transform, Visualize, and Model Data*. 12 | 1st ed., O'Reilly Media, 2017. 13 | 14 | ## Learner Personas 15 | 16 | **Nethira**, 27, is wrapping up a PhD in nursing and trying to decide 17 | whether to do a post-doc or take a data analyst position with an NGO 18 | in Tamil Nadu. She did two courses on statistics as an undergraduate, 19 | both using Stata, and picked up a bit of R from a labmate in grad 20 | school, but has never really come to grips with it as a tool. Nethira 21 | would like to improve her skills so that she can finish analyzing the 22 | data she collected for her thesis and get a couple of papers out, and 23 | to prepare herself for a possible change of career. These lessons 24 | will show her how to use the tidyverse in R to clean up, analyze, 25 | visualize, and model her data without working long nights or weekends. 26 | 27 | **Hannu**, 40, has worked as a traffic engineer for the Finnish 28 | Ministry of Transportation for the past 12 years, during which time he 29 | has become proficient with SQL and Python. As part of an open data 30 | initiative, his department has decided to build a traffic capacity 31 | dashboard using Shiny, and Hannu wants to learn the basics of modern R 32 | in a hurry so that he can join this project. These lessons will 33 | introduce him to the packages that make up the tidyverse, and prepare 34 | him for a deeper dive into more advanced R programming. 35 | 36 | Derived constraints: 37 | 38 | - Learners know what variables are, how to index a list, how loops and 39 | conditionals work, and grasp the basics of programming language syntax, such 40 | as how to write a string or a list (both). 41 | - Learners only have a shaky grasp of variable scope and the call stack, and 42 | will not understand closures or higher-order functions without detailed 43 | exposition (Nethira). 44 | - Learners know very basic statistics (mean, standard deviation, linear 45 | regression), but do not understand what a *p*-value is or why an observation 46 | can only be used once during hypothesis confirmation (Hannu). 47 | - Learners have 20-40 hours to work through this material. They may be able to 48 | ask more advanced friends or colleagues for help, but will primarily be 49 | learning on their own and by searching online (both). 50 | 51 | Note: definitions of terms are marked with `_single underscores_`, while other 52 | form of emphasis uses `*single asterisks*`. This makes it easy to extract 53 | definitions for glossary construction. 54 | 55 | ## 1. Introduction 56 | 57 | ### Objectives 58 | 59 | - Describe the steps in the basic data analysis cycle. 60 | - Explain the relative strengths and weaknesses of visualization and modeling. 61 | - Explain when techniques beyond those described in these lessons may be needed. 62 | - Explain the differences between hypothesis generation and hypothesis confirmation. 63 | - Describe and install the prerequisites for these lessons. 64 | - Explain where and how to get help. 65 | 66 | ### Key Points 67 | 68 | - The basic data analysis cycle is import, tidy, repeatedly transform, visualize, and model, and then communicate. 69 | - Visualizations provide novel insight, but don't scale well. 70 | - Models scale well, but cannot provide unexpected insight. 71 | - The techniques described in these lessons are good for tabular (rectangular) data up to about a gigabyte in size. 72 | - Hypothesis generation is the ad hoc process of exploring data to find possible hypotheses. 73 | An observation can be used many times during hypothesis generation. 74 | - Hypothesis confirmation is the rigorous application of mathematics to test falsifiable hypotheses. 75 | An observation can only be used once in hypothesis confirmation. 76 | - These lessons require R, RStudio, and a set of R packages called the tidyverse. 77 | - Help can be found by: 78 | - Typing `?name` at an interactive R prompt (where `name` identifies a package, function, or variable). 79 | - Copying and pasting the error message into a web search. 80 | - Searching on Stack Overflow. 81 | - When asking for help, create a reproducible example that: 82 | - Loads packages. 83 | - Includes a small amount of data. 84 | - Has short, readable code. 85 | 86 | ## 2. Introduction 87 | 88 | - See above. 89 | 90 | ## 3. Data Visualization 91 | 92 | ### Objectives 93 | 94 | - Explain what a data frame is and how to access the ones included with the tidyverse. 95 | - Explore the properties of a data frame. 96 | - Explain what geometries and mappings are. 97 | - Create a basic visualization of a single data data frame with a single geometry and a single mapping. 98 | - Create visualizations using the `x`, `y`, `color`, `size`, `alpha`, and `shape` properties. 99 | - Explain ways in which continuous and discrete variables should and shouldn't be visualized. 100 | - Explain what facets are and use them to display subsets of data in a single plot. 101 | - Create scatterplots and continuous line charts. 102 | - Create plots that represent data in two or more ways. 103 | - Describe three places in which the visual aspects of a plot can be specified. 104 | - Explain what a stat is in a plot, and how stats relate to geometries. 105 | - Create stacked and side-by-side bar charts. 106 | - Create a scatterplot to display data with many repeated values. 107 | - Flip the XY axes in a plot. 108 | - Use polar coordinates in a plot. 109 | - Describe the seven parameters to a full ggplot2 visualization. 110 | 111 | ### Key Points 112 | 113 | - A data frame is a rectangular set of observations (rows) of the same variables (columns). 114 | - Example data frames can be loaded using `library(tidyverse)` and then referred to by name (e.g., `mpg`) or by fully-qualified name (e.g., `ggplot2::mpg`). 115 | - To explore a data frame: 116 | - Type the name of the frame to see its shape, column titles, and first few rows. 117 | - Use `nrow(frame)` to get the number of rows. 118 | - Use `ncol(frame)` to get the number of columns. 119 | - Use `View(frame)` in RStudio to visualize the data frame as a table. 120 | - A geometry is an object that a plot uses to represent data, such as a scatterplot or a line. 121 | - A mapping describes how to connect features of a data frame to properties of a geometry, and is described by an aesthetic. 122 | - A very simple visualization has the form `ggplot(data = ) + (mapping = aes())` 123 | - Continuous variables should be visualized using smoothly-varying properties such as size and color. 124 | - Discrete variables should be visualized using properties such as shape and line type. 125 | - A facet is a subplot that displays a subset of the overall data. 126 | - Use `facet_wrap(, nrow=)` with a single discrete variable as a formula (such as `~COLUMN`) 127 | to display a single sub-plot for each value of the discrete variable. 128 | - Use `facet_grid()` with a formula of two discrete variables (such as `FIRST ~ SECOND`) 129 | to display a single sub-plot for each unique combination of the two variables. 130 | - Use `. ~ COLUMN` or `COLUMN ~ .` as a formula to display facets in rows or columns only. 131 | - Use `geom_point` to create a scatterplot and `geom_smooth` to create a line chart. 132 | - Add multiple geometries after the initial call to `ggplot` 133 | - The visual aspects of a plot can be specified as follows (each overrides the one(s) before): 134 | - Globally by specifying a value in the initial `ggplot` function. 135 | - For a particular geometry by specifying a value outside an aesthetic. 136 | - In a data-dependent way by setting a property of an aesthetic. 137 | - A stat performs a data transformation, such as counting the number of elements in a subset of the data. 138 | - Stats and geometries can often be used interchangeably since each stat has a default geometry and each geometry has a default stat. 139 | - Map one variable to `x` and another to `fill` in the aesthetic for `geom_bar` to create a stacked bar chart. 140 | - Set `position="dodge"` in `geom_bar` (outside the aesthetic) to create a side-by-side bar chart. 141 | - Use `position="jitter"` in `geom_point` (outside the aesthetic) to add randomization to a scatterplot to show data with duplication. 142 | - Add `coord_flip` to a visualization to flip the XY axes. 143 | - Add `coord_polar` to a visualization to use polar coordinates instead of Cartesian coordinates. 144 | - The seven parameters of a full ggplot2 visualization are: 145 | - data: the data frame to be plotted 146 | - geometry: how the data is to be displayed (e.g., scatterplot or line) 147 | - mapping: how the properties of the data map to the properties of the geometry (e.g., which columns map to X and Y coordinates) 148 | - stat: the transformation to apply to the data (e.g., count the number of observations) 149 | - position: how to adjust the positions of displayed elements (e.g., jittering points in a scatterplot) 150 | - coordinate function: whether to use Cartesian coordinates or polar coordinates 151 | - facet: how to subset the data to create multiple subplots 152 | 153 | ``` 154 | ggplot(data = ) + 155 | ( 156 | mapping = aes(), 157 | stat = , 158 | position = 159 | ) + 160 | + 161 | 162 | ``` 163 | 164 | ## 4. Workflow: Basics 165 | 166 | ### Objectives 167 | 168 | - Assign values to variables. 169 | - Call functions. 170 | - Write readable code. 171 | 172 | ### Key Points 173 | 174 | - Use `name <- value` to assign a value to a variable (do not use `=`). 175 | - Use `function(value1, value2)` to call a function. 176 | - Construct variable names out of words joined with underscores `like_this_example`. 177 | 178 | ## 5. Data Transformation 179 | 180 | ### Objectives 181 | 182 | - Describe the five basic data transformation operations in the tidyverse and explain their purpose. 183 | - Choose records by value using comparisons and logical operators. 184 | - Explain why filter conditions shouldn't use `==`, and correctly use `%in%` instead. 185 | - Explain the purpose of `NA`, how it affects arithmetic and logical operations, and how to test for it. 186 | - Explain how filtering treats `NA` and how to obtain different behavior. 187 | - Reorder records in ascending or descending order according to the values of one or more variables. 188 | - Select a subset of variables for all records by variable name. 189 | - Select and rename a subset of variables for all records by variable name. 190 | - Add new variables to a frame by deriving new values from existing ones. 191 | - Combine the values in a data frame to create one new value, or one new value per group. 192 | - Explain how summarization treats missing values (`NA`) and how to change this behavior. 193 | - Combine multiple transformations in order using pipe notation. 194 | - Explain why and how to include counts in summarization as a check on the validity of conclusions. 195 | - Name and describe half a dozen common summarization functions. 196 | - Explain the relationship between grouping and summarization. 197 | 198 | ### Key Points 199 | 200 | - The five basic data transformation operations in the tidyverse are: 201 | - `filter`: choose records by value(s). 202 | - `arrange`: reorder records. 203 | - `select`: choose variables by name. 204 | - `mutate`: derive new variables from existing ones. 205 | - `summarize`: combine many values to create a single new value. 206 | - `filter(frame, ...criteria...)` keeps records that pass all of the specified criteria 207 | - `name == value`: records must have the specified value for the named variable (note `==` rather than `=`). 208 | - `name > value`: the records' values must be greater than the given value (and similarly for `>=`, `!=`, `<`, and `<=`). 209 | - `min_rank(name)` to rank variables, giving the smallest values the smallest ranks. 210 | - Use `near(expression, value)` to compare floating-point numbers rather than `==`. 211 | - Use `&` (and) to require both conditions, `|` (or) to accept either condition, and `!` (not) to invert the sense of a condition. 212 | - Use `name %in% (value1, value2, ...)` to accept any of a fixed set of values for a variable. 213 | - `NA` (meaning "not available") represents unknown values. 214 | - Most operations involving `NA` produce `NA`, because there's no way to know the output without knowing the input. 215 | - Use `is.na(value)` to determine if a value is `NA`. 216 | - `filter` discards records with `FALSE` and `NA` results in tests. 217 | - Use `is.na(value)` to include `NA`'s explicitly. 218 | - Use `desc(name)` to order by descending value of the variable `name` instead of by ascending value. 219 | - Use `select(frame, name1, name2, ...)` to select only the named variables from the given frame. 220 | - Use `name1:name1` to select all variables from `name1` to `name2` (inclusive). 221 | - Use `-(name1:name2)` to unselect all variables from `name1` to `name2` (inclusive). 222 | - Use `rename(frame, new_name = old_name)` to select and rename variables from a frame. 223 | - Use `everything()` to select every variable that hasn't otherwise been selected. 224 | - Use `one_of(c("name1", "name2"))` to select all of the variables named in the given vector. 225 | - Use `mutate(frame, name1=expression1, name2=expression2, ...)` to add new variables to the end of the given frame. 226 | - Use `transmute(frame, name1=expression1, name2=expression2, ...)` create a new data frame whose values are derived from the values of an existing frame. 227 | - Use `group_by(frame, name1, name2, ...)` to group the values of `frame` according to distinct combinations of the values of the named variables. 228 | - Use `ungroup` to restore the original data. 229 | - Use `summarize(frame, name = function(...))` to aggregate the values in an entire data frame or by group within a data frame. 230 | - By default `summarize` produces `NA` as output when there are `NA`s in the input. 231 | - Use `na.rm = TRUE` to remove `NA`s from the data before summarization. 232 | - Use `frame %>% operation1(...) %>% operation2(...) %>% ...` to produce a new data frame by applying each operation to an existing one in order. 233 | - Use `n()` (for a simple count) or `sum(!is.na(name))` (to count the number of non-`NA`s) when summarizing values in order to see how many records contribute to an aggregated result. 234 | - Common summarization functions include: 235 | - `mean` and `median` 236 | - `sd` for standard deviation 237 | - `min`, `quantile`, and `max` for extrema and intermediate values 238 | - `first`, `nth`, and `last` for positional extrema and intermediate values 239 | - `n_distinct` for the number of distinct values 240 | - `count` to calculate counts or weighted sums 241 | - Each summarization peels off one layer of grouping. 242 | 243 | ## 6. Workflow: Scripts 244 | 245 | ### Objectives 246 | 247 | - Use the RStudio editor to write, save, and run R scripts. 248 | - Describe two things that should *not* be put in scripts. 249 | - Explain how to spot and fix syntax errors in the RStudio editor. 250 | 251 | ### Key Points 252 | 253 | - Use Cmd/Ctrl + Enter in the editor to run the current R expression in the console. 254 | - Use Cmd/Ctrl + Shift + S to run the complete script in the console. 255 | - Do not put `install.packages` or `setwd` in scripts, since they will affect other people's machines when run. 256 | - The RStudio editor uses white-on-red X's and red squiggly underlining to highlight syntax errors. 257 | 258 | ## 7. Exploratory Data Analysis 259 | 260 | ### Objectives 261 | 262 | - Describe the steps in exploratory data analysis (EDA). 263 | - Describe two types of questions that are useful to ask during EDA. 264 | - Correctly define *variable*, *value*, *observation*, *variation*, and *tidy data*. 265 | - Explain what a *categorical variable* is and how to best to store and visualize one. 266 | - Explain what a *continuous variable* is and how best to store and visualize one. 267 | - Explain why it is important to use a variety of bin widths when visualizing continuous variables as histograms. 268 | - List three questions whose answers will help you understand your data. 269 | - Describe and use a heuristic for identifying subgroups in data. 270 | - Explain how to handle outliers or unusual values in data. 271 | - Define *covariance* and describe how to visualize it for different combinations of two categorical and continuous variables. 272 | - Explain how to make code clearer to experienced readers by omitting information. 273 | 274 | ### Key Points 275 | 276 | - Exploratory data analysis consists of: 277 | - Generating questions about data. 278 | - Searching for answers by visualizing, transforming, and modeling data. 279 | - Using what is found to refine questions or generate new ones. 280 | - Two questions that are always useful to ask during EDA are: 281 | - What type of variation occurs within my variables? 282 | - What type of covariation occurs between my variables? 283 | - A _variable_ is something that can be measured. 284 | - A _value_ is the state of a variable when measured. 285 | - An _observation_ is a set of measurements made under similar conditions, and may contain several values (each associated with a different variable). 286 | - _Variation_ is the tendency of values to differ from measurement to measurement. 287 | - _Tidy data_ is a set of values, each of which is associated with exactly one variable and observation. 288 | Tidy data is usually displayed in tabular form: each observation (or record) is a row, while each variable is a column with a name and a type. 289 | - A _categorical variable_ is one that takes on only one of a small set of values. 290 | - Categorical variables are best represented using factors or character strings. 291 | - The distribution of a categorical variable is best visualized using a bar chart (created using `geom_bar`). 292 | - `dplyr::count(name)` counts the number of occurrences of each value of a categorical variable. 293 | - A _continuous variable_ is one that takes on any of an infinite set of ordered values. 294 | - Categorical variables are best represented using numbers or date-times. 295 | - The distribution of a categorical variable is best visualized using a histogram. 296 | - `dplyr::count(ggplot2::cut_width(name, width))` divides occurrences into bins and counts the number of occurrences in each bin. 297 | - Histograms with different bin widths can have very different visual appearances, so varying the bin width provides insight that no single bin width can. 298 | - Use `geom_histogram(mapping=..., binwidth=value)` to vary the width of histogram bins. 299 | - Or `geom_freqpoly` to display histograms using lines instead of bars. 300 | - Three questions to ask of any dataset are: 301 | - Which values are most common (and why)? 302 | - Which values are rare (and why)? 303 | - What patterns are present in the data? 304 | - How can you describe the pattern? 305 | - How strong is it? 306 | - Is it a coincidence? 307 | - Does the pattern change if you examine subgroups of the data? 308 | - Clusters of similar values suggest that data contains subgroups. To characterize these subgroups, ask: 309 | - How are observations in each cluster similar? 310 | - How do observations in different clusters differ? 311 | - What might explain the existence of these clusters? 312 | - How might the appearance of these clusters be misleading (e.g., an artifact of the visualization used)? 313 | - If outliers are present, repeat each analysis with and without them. 314 | - If there are only a few, and dropping them doesn't affect results, use `mutate` and `ifelse` to replace them with `NA`. 315 | - If there are many, or dropping them changes results, account for them in analysis and reporting. 316 | - _Covariation_ is the tendency for some variables to vary in related ways. 317 | - When visualizing the relationship between continuous and categorical variables: 318 | - Displaying raw counts can be misleading if the number of items in different categories varies widely. 319 | - Displaying _densities_ (i.e., counts standardized so that the area of each curve is the same) can be more informative. 320 | - Boxplots show less of the raw data, but are easier to interpret when there are many categories. 321 | - Reorder unordered categorical variables to make trends easier to see. 322 | - When visualizing the relationship between two categorical variables: 323 | - Display the counts for each pairing of values (e.g., using `geom_count` or `geom_tile`) 324 | - In general, put the categorical variable with the greater number of categories or the longer labels on the Y axis. 325 | - When visualizing the relationship between two continuous variables: 326 | - Use a scatterplot with jittering or transparency to handle datasets with up to hundreds of points. 327 | - Use `geom_bin2d` or `geom_hex` to bin values in two dimensions. 328 | - Bin one or both of the continuous variables so that visualizations for continuous variables can be used. 329 | - Use `cut_width` or `cut_number` to bin continuous values by value range or number of values respectively. 330 | - Omitting argument names for commonly-used functions makes code easier for experienced programmers to understand. 331 | - The first two arguments to `ggplot` are the dataset and the mapping. 332 | 333 | ## 8. Workflow: Projects 334 | 335 | ### Objectives 336 | 337 | - Explain why analysts should save scripts rather than environments. 338 | - Explain what a *working directory* is and how to find what yours is. 339 | - Explain why setting your working directory from within your script is a bad idea. 340 | - Explain the difference between an *absolute path* and a *relative path* and the meaning of the symbol `~` in a path. 341 | - Explain what an RStudio project is and how one is stored. 342 | 343 | ### Key Points 344 | 345 | - Analysts should save scripts rather than environments because it is much easier to reconstruct an environment from a script than to reconstruct a script from an environment. 346 | - The _working directory_ is the directory where R looks for and saves files by default, and is displayed by calling `getwd()`. 347 | - Setting the working directory from within a script with `setwd` makes reproducibility more difficult because that directory may not exist on some other (person's) machine. 348 | - An _absolute path_ specifies a single location starting from the top of the filesystem. 349 | - A _relative path_ specifies a location starting from the current directory, and may identify different locations depending on where it is used. 350 | - The symbol `~` refers to the user's home directory on macOS and Linux, and to the user's `Documents` directory on Windows. 351 | - An RStudio project is a directory that contains the scripts and other files involved in an analysis. 352 | - Each RStudio project contains a `.Rproj` file with information about the project. 353 | 354 | ## 10. Tibbles 355 | 356 | ### Objectives 357 | 358 | - Explain the relationship between a tibble and a `data.frame` and the main ways in which tibbles differ from `data.frame`s. 359 | - Create tibbles from `data.frame`s and from scratch. 360 | - Explain what a *non-syntactic name* is and how to create tibble columns with non-syntactic names. 361 | - FIXME: explain how to use `tribble` (which requires an understanding of `~`). 362 | - Display an arbitrary number of rows and columns of a tibble. 363 | - Subset tibbles using `[[...]]`. 364 | - Subset tibbles using `$`. 365 | - FIXME: explain use of `[...]` (single bracket). 366 | 367 | ### Key Points 368 | 369 | - A tibble is a `data.frame` whose behaviors have been modified to work better with the tidyverse. 370 | - Tibbles never change their inputs' types. 371 | - Tibbles never adjust the names of variables. 372 | - Tibbles evaluate their constructor arguments lazily and sequentially, so that later variables can use the values of earlier variables. 373 | - Tibbles do not create row names. 374 | - Tibbles only recycle inputs of length 1, because recycling longer inputs has been a frequent source of bugs. 375 | - Tibbles can be created from `data.frame`s using `as_tibble` or from scratch using `tibble`. 376 | - Use `is_tibble` to determine if something is a tibble or not. 377 | - Use `class` to determine the classes of something. 378 | - A _non-syntactic name_ is one which is not a valid R variable name. 379 | - To create a non-syntactic column name, enclose the name in back-quotes. 380 | - Use `print` with `n` to set the number of rows and `width` to set the number of character columns. 381 | - Use `name[["variable"]]` or `name$variable` to extract the column named `variable` from a tibble. 382 | - Use `name[[N]]` to extract column `N` (integer) from a tibble. 383 | 384 | ## 11. Data Import 385 | 386 | ### Objectives 387 | 388 | - Name six functions for reading tabular data and explain their use. 389 | - Read CSV data files with multiple header lines, comments, missing headers, and/or markers for missing data. 390 | - Explain how data reader functions determine whether they have extra and missing values, and how they handle them. 391 | - Name four functions used to parse individual values and explain their use. 392 | - Explain how to obtain a summary of parsing problems encountered by data reading functions. 393 | - Define *locale* and explain its purpose and use. 394 | - Define *encoding* and explain its purpose and use. 395 | - Explain how `readr` functions determine column types. 396 | - Set the data types of columns explicitly while reading data. 397 | - Explain how to write well-formatted tabular data. 398 | - Describe what information is lost when writing tibbles to delimited files and what formats can be used instead. 399 | 400 | ### Key Points 401 | 402 | - Use the following functions to read tabular data in common formats: 403 | - `read_csv`: comma-delimited files. 404 | - `read_csv2`: semicolon-delimited files. 405 | - `read_tsv`: tab-delimited files. 406 | - `read_delim`: files using an arbitrary delimiter. 407 | - `read_fwf`: files with fixed-width fields. 408 | - `read_table`: read common fixed-width tabular formats with whitespace separators. 409 | - Use `skip=n` to skip the first N lines of a file. 410 | - Use `comment="#"` (or something similar) to ignore lines starting with `#`. 411 | - Use `col_names=FALSE` to stop `read_csv` from interpreting the first row as column headers. 412 | - Use `col_names=c("first", "second", "third")` to specify column names by hand. 413 | - Use `na="."` (or something similar) to specify the value(s) used to mark missing data. 414 | - Data reader functions use the number of values in the first row to determine the number of columns. 415 | - Extra values in subsequent rows are omitted. 416 | - Missing values in subsequent rows are set to `NA`. 417 | - Use `parse_integer`, `parse_number`, `parse_logical`, and `parse_date` to parse strings containing integers, general numbers, Booleans, and dates. 418 | - Use `na="."` (or similar) to specify the value(s) that should be interpreted as missing data. 419 | - Use `problems(name)` to access the `problems` attribute of the output of data reading functions. 420 | - A _locale_ is a collection of linguistic and/or regional settings for information formats, such as Canadian English or Brazilian Portuguese. 421 | - Use `locale(...)` to specify such things as the separator character used in long numbers. 422 | - An _encoding_ is a specification of how characters are represented digitally, such as ASCII or UTF-8. 423 | - Specify `encoding="name"` when parsing data to interpret the character data correctly. 424 | - UTF-8 is now the most commonly-used character encoding scheme. 425 | - Data reading functions read the first 1000 rows of the dataset and use the heuristics embodied in `guess_parser` to guess the type of the column. 426 | - Use `col_types=cols(...)` to manually specify the types of the columns of a data file. 427 | - Use `name = col_double()` to set the column's name to `name` and the type to `double`. 428 | - Use `name = col_date()` to set the name to `name` and the type to `date`. 429 | - Use `write_csv` to write data in comma-separated format and `write_tsv` to write it in tab-separated format. 430 | - Use `write_excel_csv` to write CSV with extra information so that it can immediately be loaded by Microsoft Excel. 431 | - Use `na="marker"` to specify how `NA` should be shown in the output. 432 | - Delimited file formats only store column names, not column types, so the latter have to be re-guessed when the file is re-read. 433 | - (Old) Saving data in R's custom binary format RDS will save type information. 434 | - (New) Saving data in the cross-language Feather format will also save type information, and this data can be read in multiple languages. 435 | 436 | ## 12. Tidy Data 437 | 438 | ### Objectives 439 | 440 | - Describe the three rules tabular data must obey to be considered "tidy", and the advantages of storing data this way. 441 | - Explain what *gathering* data means and use gather operations to tidy datasets. 442 | - Explain what *spreading* data means and use spread operations to tidy datasets. 443 | - Explain what *separating* data means and use separate operations to tidy datasets. 444 | - Explain what *uniting* data means and use unite operations to tidy datasets. 445 | - Describe two ways in which values can be missing from a dataset. 446 | - Explain how to *complete* a dataset and use completion operations to tidy datasets. 447 | - Explain why it can be useful to carry values forward and use this to tidy datasets. 448 | 449 | ### Key Points 450 | 451 | - Tidy data obeys three rules: 452 | - Each variable has its own column. 453 | - Each observation has its own row. 454 | - Each value has its own cell. 455 | - Tidy data is easier to process because: 456 | - No subsidiary processing is required (e.g., to split names into personal and family names). 457 | - Each column can be processed independently (e.g., there's no need to choose the type of processing based on a "type" field in another column). 458 | - To _gather_ data means to take N columns whose names are actually values and transform them into 2 columns where the first column holds the former column names and the second holds the values. 459 | - Use `gather(name, name, ..., key="key_name", value="value_name")` to transform the named columns into two columns with names `key_name` and `value_name`. 460 | - To _spread_ data means to take two columns, the N values in the first of which identify the meanings of the values in the second, and create N+1 columns, one for each of the distinct values in the first column. 461 | - Use `spread(key=first, value=second) to spread the values in `second` according to the keys in `first`. 462 | - To _separate_ data means to split one column into multiple values. 463 | - Use `separate(name, into=c("first", "second", ...))` to separate the values in one column to create multiple new columns. 464 | - To _unite_ data means to combine the values of two or more columns into a single column. 465 | - Use `unite(new_name, first, second, ...)` to combine the named columns to create a column named `new_name`. 466 | - Values will be combined with `_` unless `sep="#"` (or similar) is used (with `sep=""` to unite without a separator). 467 | - Use `convert=TRUE` with these functions to (try to) convert data types. 468 | - Values can be explicitly missing (the presence of an absence) or their entries can be missing entirely (the absence of a presence). 469 | - To _complete_ a dataset means to fill in missing combinations of values. 470 | - Use `complete(first, second, ...)` to fill in missing combinations of the values from the named columns. 471 | - Missing values sometimes indicate that the most recent value should be carried forward. 472 | - Use `fill(first, second, ...)` to carry the most recent observation(s) forward in the named column(s). 473 | 474 | ## 13. Relational Data 475 | 476 | ### Objectives 477 | 478 | - Define *relational data* and explain what *keys* are and how they are used when processing it. 479 | - Explain the difference between a *primary key* and a *foreign key*, and explain how to determine whether a key is actually a primary key. 480 | - Explain what a *surrogate key* is and why surrogate keys are sometimes needed. 481 | - Explain how relations are represented in relational data and describe three types of relations. 482 | - Define a *mutating join* and use mutating joins to combine information from two tables. 483 | - Define four kinds of joins and use each to combine information from two tables. 484 | - Explain what joins do if some keys are duplicated, and when this might occur. 485 | - Describe and use some common criteria for joins. 486 | - Define a *filtering join*, describe two types of filtering joins, and use them to combine information from two tables. 487 | - Describe the difference between how mutating joins and filtering joins behave in the presence of duplicated keys. 488 | - Describe three steps for identifying keys in tables that can be used in joins. 489 | - Describe and use three set operations on records. 490 | 491 | ### Key Points 492 | 493 | - _Relational data_ is made up of sets of tables that are related in some way. 494 | - A _key_ is a variable or set of variables whose values uniquely identify observations in a table. 495 | - Keys are used to connect observations in one table to observations in another. 496 | - A _primary key_ uniquely identifies an observation in its own table. 497 | - Use `count(name)` and `filter(n > 1)` to identify multiple occurrences of what is supposed to be a primary key. 498 | - A _foreign key_ uniquely identifies an observation in some other table, and is used to connect information between those tables. 499 | - A _surrogate key_ is an arbitrary identifier associated with an observation (such as a row number) that has no real-world meaning. 500 | - Surrogate keys are sometimes added to data when the data itself has no valid primary keys. 501 | - Relations are represented by matching primary keys in one table to foreign keys in another. Relations can be: 502 | - _One-to-one_ (or 1-1), meaning there is exactly one matching value in each table. 503 | - _One-to-many_ (or 1-N), meaning that each value in one table may have any number of matching values in another. 504 | - _Many-to-many_ (or N-N), meaning that there may be many matching values in each table. 505 | - A _mutating join_ updates one table with corresponding information from another table. 506 | - An _inner join_ combines observations from two tables when their keys are equal, discarding any unmatched rows. 507 | - Use `inner_join(left, right, by="name")` to join tables `left` and `right` on equal values of the column `name`. 508 | - A _left outer join_ (or simply _left join_) combines observations when keys are equal, keeping rows from the left table even if there are no corresponding values from the right table. 509 | - Missing values from the right table are assigned `NA` in the result. 510 | - Use `left_join` with arguments as above. 511 | - A _right outer join_ (or simply _right join_) does the same, but keeps rows from the right table even when rows from the left are missing. 512 | - Use `right_join` with arguments as above. 513 | - A _full outer join_ (or simply _full join_) keeps all rows from both table, filling in for gaps in either. 514 | - Use `full_join` with arguments as above. 515 | - If a key is duplicated in one or both tables, a join will produce all combinations of records with that key. 516 | - This often arises when a key is a primary key in one table and a foreign key in another. 517 | - If keys are duplicated in both tables, it may be a sign that the data is corrupt or that the supposed key actually isn't one. 518 | - A _natural join_ combines tables using equal values for all columns with identical names. 519 | - Use `by=NULL` in a join function to force a natural join. 520 | - Use `by=c("name1", "name2", ...)` to join on equal values of named columns. 521 | - Use `by=c("a" = "b", "c" = "d", ...)` to join on columns with different names. 522 | - Use `suffix=("name", "name")` to override the default `.x`, `.y` suffixes used for name collisions. 523 | - A _filtering join_ is one that keeps (or discards) observations from one table based on whether they match (or do not match) observations in a second table. 524 | - Use `semi_join(left, right)` to keep rows in `left` that have matches in `right`. 525 | - Use `anti_join(left, right)` to keep rows in `left` that do *not* have matches in `right`. 526 | - Because they only keep or discard rows, filtering joins never create duplicate entries, while mutating joins can if keys are duplicated. 527 | - Three steps for identifying keys in tables that can be used in joins are: 528 | - Identify the variable or variables that form the primary key for each table based on an understanding of the data. 529 | - Check that each table's primary key has no missing values. 530 | - Check that possible foreign keys match primary keys in other tables (e.g., by using `anti_join` to look for missing matches). 531 | - Three set operations that work on entire records are: 532 | - `union(left, right)`: returns unique observations from either or both table. 533 | - `intersect(left, right)`: returns unique observations that are in both tables. 534 | - `setdiff(left, right)`: returns observations that are in one of the tables but not both. 535 | 536 | ## 14. Strings 537 | 538 | ### Objectives 539 | 540 | - Write character strings in R, including ones that contain special characters. 541 | - Write multiple strings to the terminal, respecting escaped characters. 542 | - Use functions from the `stringr` package to perform basic operations on strings. 543 | - Explain what a *regular expression* is and what kinds of patterns they can match. 544 | - Describe two functions that implement regular expressions and use them to match simple patterns against text. 545 | - Describe nine patterns provided by regular expressions. 546 | - Capture subsections of matched text in regular expressions and re-match captured text within a pattern. 547 | - Detect and extract matches between a pattern and the strings in a vector. 548 | - Replace substrings that match regular expressions. 549 | - Split strings based on regular expression matches. 550 | - Locate substrings that match regular expressions. 551 | - Control matching options in regular expressions. 552 | - Find objects in the global environment whose names match a regular expression. 553 | - Find files and directories whose names match a regular expression. 554 | 555 | ### Key Points 556 | 557 | - Character strings in R are enclosed in matching single or double quotes. 558 | - Use backslash to escape special characters such as `\"`, `\n`, and `\\`. 559 | - Use `writeLines` to display a string or a vector of strings with special characters interpreted. 560 | - Use `str_length` to get a string's length. 561 | - Use `str_c` to concatenate strings. 562 | - Use `str_sub` to extract or replace substrings. 563 | - Use `str_to_lower`, `str_to_upper`, and `str_to_title` to change the case of strings. 564 | - Use `str_sort` to sort a vector of strings and `str_order` to get the 565 | - Use `str_order` to get the ordered indices of the strings in a vector. 566 | - Use `str_pad` to pad a string to fit a specified width and `str_trim` to trim it to fit that width. 567 | - A _regular expression_ is a pattern that matches text. 568 | - Regular expressions are written as text using punctuation and other characters to express choice, repetition, and other operations. 569 | - Regular expressions can express patterns that have fixed nesting, but not patterns that have unlimited nesting (such as nested parenthesization). 570 | - Use `str_view(text, pattern)` to find the first match of `pattern` to `text` and `str_view_all` to view all matches. 571 | - Nine patterns used in regular expressions are: 572 | - `.` matches any single character. 573 | - `\` escapes the character that follows it. 574 | - `^` and `$` match the beginning and end of the string respectively (without consuming any characters). 575 | - Use `\d` to match digits and `\s` to match whitespace. 576 | - Use `[abc]` to match any single character in a set and `[^abc]` to match any character *not* in a set. 577 | - Use `left|right` to match either of two patterns. 578 | - Use `{M,N}` to repeat a pattern M to N times. 579 | - Use `?` to signal that a pattern is optional (i.e., repeated zero or one times), `*` to repeat a pattern zero or more times, and `+` to repeat a pattern at least once. 580 | - Use parentheses `(...)` for grouping, just as in mathematics. 581 | - Every set of parentheses in a regular expression creates a numbered _capture group_. 582 | - Use `\1`, `\2`, etc. to refer to capture groups within a pattern in order to match the same actual text two or more times. 583 | - Use `str_detect(strings, pattern)` to create a logical vector showing where a pattern does or doesn't match. 584 | - Use `str_subset(strings, pattern)` to select the subset of strings that match a pattern and `str_count` to count the number of matches. 585 | - Use `str_extract(strings, pattern)` to extract the first match for the pattern in each string. 586 | - Use `str_extract_all(strings, pattern)` to extract all matches for the pattern in each string. 587 | - Use `str_match(string, pattern)` to extract parenthesized sub-matches for a pattern. 588 | - Use `tidyr::extract` to extract parenthesized sub-matches from a tibble into new columns. 589 | - Use `str_replace` or `str_replace_all` to replace substrings that match regular expressions. 590 | - Use `str_split` to split a string based on regular expression matches. 591 | - Use `str_locate` and `str_locate_all` to find the starting and ending positions of substrings that match regular expressions. 592 | - Use `regex` explicitly to construct a regular expression and control options such as multi-line matches and embedded comments. 593 | - Use `apropos` to find objects in the global environment whose names match a regular expression. 594 | - Use `dir` to find objects in the filesystem whose names match a regular expression. 595 | 596 | ## 15. Factors 597 | 598 | ### Objectives 599 | 600 | - Define *factor* and explain the purpose of factors in R. 601 | - Create and (re-)order factors. 602 | - Determine the valid levels of a factor. 603 | - Rename the levels of a factor. 604 | 605 | ### Key Points 606 | 607 | - A _factor_ is a variable that can take on one of a fixed set of values. 608 | - Factors are ordered, but the order is not necessarily alphabetical. 609 | - Use `factor(values, levels)` to create a vector of factors by matching strings in `values` to level names in `levels`. 610 | - Values that don't match level names are converted to `NA`. 611 | - The idiom `factor(values, unique(values))` orders the factors according to their first appearance in the data. 612 | - Use `fct_reorder(factor, values)` to reorder a factor according to a set of numeric values. 613 | - Use `levels(factor)` to recover the valid levels of a factor. 614 | - Use `fct_relevel(factors, "levels")` to move the named levels to the front of the list of factors (e.g., for display purposes). 615 | - Use `fct_infreq` to reorder factors by frequency. 616 | - Use `fct_rev` to reverse the order of factors. 617 | - Use `fct_recode(factor, "new_name_1" = "old_name_1", "new_name_2" = "old_name_2", ...)` to rename some or all factors. 618 | - Assigning several old levels to a single new level combines entries. 619 | - Use `fct_collapse(factors, new_name = c("old_name_1", "old_name_2"), ...)` to collapse many levels at once. 620 | - Use `fct_lump(factor, n=N)` to combine the smallest factors, leaving `N` groups. 621 | 622 | ## 16. Dates and Times 623 | 624 | ### Objectives 625 | 626 | - Describe three types of data that refer to an instant in time. 627 | - Get the current date and time. 628 | - Describe and use three ways to create a date-time. 629 | - Convert dates to date-times and date-times to dates. 630 | - Describe and use eight accessor functions to extract components of dates and date-times. 631 | - Describe and use three functions for rounding dates. 632 | - Explain how to modify components of dates and date-times. 633 | - Explain an idiom for exploring patterns in the lower-order components of date-times. 634 | - Explain how the difference between two moments in time is represented in base R and when using `lubridate`. 635 | - Explain the difference between a difftime, a *period*, and an *interval*. 636 | - Determine your current timezone. 637 | 638 | ### Key Points 639 | 640 | - Instants in time are described by _date_, _time_, and _date-time_. 641 | - Use `today` to get the current date and `now` to get the current date-time. 642 | - A date-time can be created from a string, from individual date and time components, or from an existing date-time. 643 | - Use `lubridate` functions such as `ymd` or `dmy` to parse year-month-day dates. 644 | - Use functions such as `ymd_hms` to parse full date-times. 645 | - Supplying a timezone with `tz="XYZ"` forces the creation of a date-time instead of just a date. 646 | - Use `make_date` or `make_datetime` to construct a date or date-time from numeric components. 647 | - Use `as_datetime` to convert a date to a date-time and `as_date` to convert a date-time to a date. 648 | - Use the following accessor functions to extract components from dates and date-times: 649 | - `year` and `month` 650 | - `yday` (day of the year), `mday` (day of the month), and `wday` (day of the week) 651 | - `hour`, `minute`, and `second` 652 | - Use `floor_date`, `round_date`, and `ceiling_date` to round dates down, to nearest, or up to a specified unit. 653 | - Use an accessor function on the left side of assignment to modify a portion of a date or date-time in place. 654 | - E.g., use `year(x) <- 2018` to set the year of a date or date-time to 2018. 655 | - Use `update(existing, name=value, ...)` to create a new date-time with modified values. 656 | - Use `update` to set the higher-order components of date-times to a constant in order to explore the variation in the lower-order components. 657 | - A _difftime_ represents the absolute difference between two moments in time (in seconds). 658 | - Use `as.duration(difftime)` to convert to a `lubridate` `duration`, which always uses seconds to represent differences in times. 659 | - Use `dyears`, `dseconds`, etc. to construct differences explicitly. 660 | - A _period_ represents the difference between two times taking human factors into account (such as daylight savings time). 661 | - An _interval_ is a duration with a starting point, which makes it precise enough that its exact length can be determined. 662 | - Use `Sys.timezone()` to determine your current timezone. 663 | 664 | ## 18. Pipes 665 | 666 | ### Objectives 667 | 668 | - Describe the pros and cons of four ways to write successive operations on data. 669 | - Explain the use of `%T>%`, `%$%`, and `%<>%`. 670 | 671 | ### Key Points 672 | 673 | - Four ways to write successive operations on data are: 674 | - Save each intermediate step as a new object: a lot of typing with many opportunities for transposition mistakes. 675 | - Overwrite the original object many times: loss of originals makes debugging difficult, and repetition of a single makes reading difficult. 676 | - Compose functions: unnatural reading order and parameters widely separated from function names. 677 | - Use the pipe `%>%`: simple to read *if* the transformations are sequential and applied to a single main stream of data. 678 | - `%T>%` ("tee") returns its left side rather than its right. 679 | - `%$%` unpacks the variables in a `data.frame` (which is useful when calling functions in base R that don't rely on `data.frame`s). 680 | - `%<>%` assigns the result back to the starting variable. 681 | 682 | ## 19. Functions (and Control Flow) 683 | 684 | ### Objectives 685 | 686 | - Explain the benefits of creating functions. 687 | - Describe three steps in the creation of a function. 688 | - Define functions of zero or more arguments. 689 | - Describe three rules that function names should follow. 690 | - Describe the difference between data and details in function arguments. 691 | - Define *conditional statement* and write conditional statements with multiple branches and a default branch. 692 | - Explain what a *short-circuit operator* is and write conditions using these operators. 693 | - Define *precondition* and implement preconditions in functions. 694 | - Write functions that take (and pass on) a varying number of arguments. 695 | - Describe and use two ways to return values from functions. 696 | - Implement pipeable functions that perform transformations or have side effects. 697 | 698 | ### Key Point 699 | 700 | - Create functions to elminate duplicated code, make programs more readable, and simplify maintenance and evolution. 701 | - When creating a function, select a name, decide on its arguments, and write its body. 702 | - Function names should: 703 | 1. Prefer verbs (actions) to nouns (things). 704 | 2. Use full words and consistent typography. 705 | 3. Be consistent with other functions in the same package or application. 706 | - Arguments to functions are (broadly speaking) either: 707 | - Data to be operated on (come first). 708 | - Details controlling how the function operates (come last, and should have default values). 709 | - When overriding the value of a default, use the full name of the argument. 710 | - A _conditional statement_ may or may not execute code depending on whether a condition is true or false. 711 | - Each conditional statement must have one `if`, zero or more `else if`, and zero or one `else` in that order. 712 | - Each branch except the `else` must have a logical condition that determines whether it is selected. 713 | - Branch conditions are tested in order, and only the code associated with the first branch whose condition is true is executed. 714 | - If no condition is true, and an `else` is present, the code in the `else` branch is executed. 715 | - Conditions must be `TRUE` or `FALSE`, not vectors or `NA`s. 716 | - A _short-circuit operator_ stops evaluating terms as soon as it knows whether the overall value is `TRUE` or `FALSE`. 717 | - "and", written `&&`, stops as soon as a term is `FALSE`. 718 | - "or", written `||`, stops as soon as a term is `TRUE`. 719 | - Use the functions `any`, `all`, and `identical` to collapse vectors into single values for testing. 720 | - Always indent the bodies of conditionals and functions (preferably by two spaces) and obey style rules for placement of curly braces. 721 | - A _precondition_ is something that must be true of a function's inputs in order for the function to work correctly. 722 | - Use `if` and `stop` to check that inputs are sensible before processing it, and generate a meaningful error message when it's not. 723 | - Or use `stopifnot` to check that one or more conditions are true (without generating a custom error message). 724 | - Use `...` (three dots) as a placeholder for zero or more arguments, which can then be passed into other functions. 725 | - Use `list(...)` to convert the actual arguments to a list for processing. 726 | - A function in R returns either: 727 | - An explicit value when `return(value)` is called. 728 | - The value of the last expression evaluated if no explicit `return` was executed. 729 | - To make a function pipeable: 730 | - For a transformation, take the data to be transformed as the first argument and return a modified object. 731 | - For a side effect, perform the operation (e.g., save to a file) and use `invisible(value)` to return the value without printing it. 732 | 733 | ## 20. Vectors 734 | 735 | ### Objectives 736 | 737 | - Define *atomic vector* and *list*, explain the differences between them, and give examples of each. 738 | - Explain what `NULL` is used for and how it differs from `NA`. 739 | - Determine the type and length of an arbitrary value. 740 | - Describe the values that logical vectors can contain and how they are usually constructed. 741 | - Describe the values that integer and double vectors can contain and the special values that each type can contain. 742 | - Describe the values that character vectors can contain. 743 | - Explain the difference between *explicit coercion* and *implicit coercion* and use the former to convert values from one type to another. 744 | - Explain the rule used to determine the type of a vector explicitly constructed out of values of different types. 745 | - Define *recycling* and correctly identify and interpret uses of it. 746 | - Recycle values explicitly. 747 | - Give vector elements names and explain when and why this is useful. 748 | - Describe six ways to subset a vector. 749 | - Explain the difference between single `[...]` and double `[[...]]` 750 | - Create and inspect lists. 751 | - Subset lists. 752 | - Define *attribute* and *augmented vector*. 753 | - Describe three ways vector attributes are used in R. 754 | - Explain how factors are implemented using augmented vectors. 755 | - Explain how tibbles are implemented using augmented vectors. 756 | 757 | ### Key Points 758 | 759 | - An _atomic vector_ is a homogeneous structure that holds logical, integer, double, character, complex, or raw data. 760 | - A _list_ (sometimes called a _recursive vector_) is a vector that can hold heterogeneous data, including other vectors. 761 | - The special value `NULL` represents the absence of a vector, or a vector of length zero, while `NA` represents the absence of a value. 762 | - Use `typeof(thing)` to obtain the name of the type of `thing`. 763 | - Use `is_logical`, `is_integer`, and similarly-named functions to test the types of values. 764 | - Use `is_scalar_integer` and similarly-named functions to test the type of a value and whether it is scalar or vector. 765 | - Use `length(thing)` to obtain the (integer) length of the type of `thing`. 766 | - Logical vectors can contain `TRUE`, `FALSE`, and `NA`, and are often constructed using Boolean expressions such as comparisons. 767 | - Integer vectors contain integer values, which should be used for counting. 768 | - To force a value to be stored as an integer, write it without a decimal portion and put `L` after it (for "long"). 769 | - Integer vectors can contain the special value `NA`. 770 | - Use `is.na` to check for this special value. 771 | - Double vectors contain floating-point numbers, which should be used for measurement. 772 | - Double vectors can contain the special values `NA`, `NaN` (not a number), `Inf` (infinity), and `-Inf` (negative infinity). 773 | - Use `is.finite`, `is.infinite`, and `is.nan` to check for these special values. 774 | - Character vectors can contain character strings, each of which can be arbitrarily long. 775 | - _Explicit coercion_ is the use of a function to convert values from one type to another. 776 | - Use `as.logical`, `as.character`, `as.integer`, or `as.double` to create a new vector containing the converted values from an original. 777 | - _Implicit coercion_ occurs when a value or vector of one type is used where another type is expected. 778 | - The function `c(value1, value2, ...)` creates a vector whose type is the most complex of the types of the provided values. 779 | - In order of increasing complexity, types are logical - integer - double - character. 780 | - To _recycle_ values is to re-use those from the shorter vector involved in an operation to match the length of the longer vector. 781 | - A "scalar" in R is actually a vector of length 1, and most recycling involves replicating a scalar to have the same length as a vector. 782 | - Base R produces a warning if the length of the longer is not an integer multiple of the length of the shorter. 783 | - Tidyverse functions throw errors in this case to forestall unexpected results. 784 | - Use `rep(values, times)` to recycle (or repeat) values explicitly. 785 | - Some or all vector elements can be given names when the vector is constructed using `c(name1=value1, ...)`. 786 | - Use `purrr:set_names(vector, names)` to set the names of a vector's values after that fact. 787 | - A vector can be subsetted using `[...]` in four ways: 788 | - Subsetting with a vector of positive integers selects those elements in order (possibly with repeats). 789 | - Subsetting with a vector of negative integers selects all elements *except* those identified. 790 | - Subsetting with zero creates an empty vector. 791 | - Subsetting with a logical vector keeps values corresponding to `TRUE` elements of the logical vector. 792 | - Subsetting with a character vector keeps only those values with the given names (possibly with repeats). 793 | - Using an empty subscript `[]` returns the entire vector. 794 | - Create lists using `list(value1, value2, ...)` and inspect their structure with `str(name)`. 795 | - Subsetting a list with `[...]` always returns a list. 796 | - Subsetting a list with `[[...]]` returns a single component (i.e., has one less level of nesting than the original). 797 | - `list$name` does the same thing as `[[...]]` for a named element of a list. 798 | - An _augmented vector_ is one that has extra named _attributes_ attached to it. 799 | - Get the value of a vector attribute using `attr(vector, name)`. 800 | - Set the value of a vector attribute using `attr(vector, name) <- value`. 801 | - Use `attributes(name)` to display all of the attributes of a vector. 802 | - Vector attributes are used to: 803 | - Name the elements of a vector (`names`). 804 | - Store dimensions to make a vector behave like a matrix. 805 | - Store a class name to implement classes in the S3 object-oriented system. 806 | - A factor is an integer vector that has the class `factor` and a `levels` attribute with the factors' names. 807 | - A tibble is a list with three classes and `names` and `row.names` attributes. 808 | - All elements of a tibble must be vectors having identical lengths. 809 | 810 | ## 21. Iteration 811 | 812 | ### Objectives 813 | 814 | - Describe the parts of a simple `for` loop. 815 | - Create empty vectors of a given type and length. 816 | - Explain why it is safer to use `seq_along(x)` than `1:length(x)`. 817 | - Write loops that iterate over the columns of a tibble using either indices or names. 818 | - Describe and use an efficient way to write loops when the size of the eventual output cannot be known in advance. 819 | - Explain how to write a loop when the number of required iterations is not known in advance. 820 | - Describe what happens when looping over the names of a vector that has some unnamed elements. 821 | - Explain what *higher-order functions* are, explain why they're useful, and write higher-order functions. 822 | - Describe the `map` family of functions and their purpose, and rewrite simple `for` loops to use `map` functions. 823 | - Describe the purpose and use of the `safely`, `possibly`, and `quietly` function. 824 | - Describe and use `map2` and `pmap`. 825 | - Describe and use `walk`, `walk2`, and `pwalk`. 826 | - Define *predicate function* and describe higher-order functions that work with predicate functions. 827 | - Define *reduction* and use `reduce` to implement it. 828 | - Define *accumulation* and use `accumulate` to implement it. 829 | 830 | ### Key Points 831 | 832 | - A `for` loop usually has: 833 | - A variable whose value changes for each iteration of the loop. 834 | - A set of values being iterated over (such as the indices of a vector). 835 | - A body that is executed once for each iteration. 836 | - An output variable where results are stored (whose space is usually preallocated for efficiency). 837 | - Use `vector("type", length)` to generate a vector of the specified type and length (usually to be filled in later). 838 | - `1:length(x)` is non-empty when `x` is empty; `seq_along(x)` is empty when `x` is empty, and so is better to use in loop controls. 839 | - To loop over the columns a tibble: 840 | - Use `for (variable in seq_along(tibble))` to loop over the numeric indices of the columns. 841 | - Use `for (variable in names(tibble))` to loop over the names of the columns. 842 | - When the size of a loop's eventual output cannot be known in advance, use a list to collect partial results and then `unlist` or `purrr:flatten_dbl` to combine them into a vector. 843 | - If values are tables, collect them in a list and use `bind_rows` to combine them all after the loop. 844 | - If the number of required iterations is not known in advance, use a `while` loop instead of a `for` loop. 845 | - Make sure that the condition of the `while` loop can be changed by the loop body so that the loop does not run forever. 846 | - If none of the elements of a vector have names, `names(vector)` returns `NULL`, so a `for` loop doesn't execute any iterations. 847 | - If some of the elements of a vector have names and some don't, `names(vector)` returns empty strings for unnamed elements. 848 | - This means that a `for` loop will execute, but that attempts to access unnamed vector elements by name will fail. 849 | - A _higher-order function_ is one that takes other functions as arguments. 850 | - Higher-order functions allow programmers to write control flow once and re-use it with different operations. 851 | - `map(object, function)` applies `function` to each `object` and returns a list of results. 852 | - The specialized functions `map_lgl`, `map_int`, etc., operate on and return vectors of specific types (logical, integer, etc.) 853 | - These functions preserve names and pass extra arguments through to the function provided. 854 | - FIXME: go through "21.5.1 Shortcuts" after learning about formulas. 855 | - `safely(func)` creates a new function that never throws an error, but instead always returns a list of two values: 856 | - `result` is either the original result (if the original function ran without an error) or `NULL` (if there was an error). 857 | - `error` is either `NULL` (if the original function ran without an error) or the error object (if there was an error). 858 | - `map(data, safely(func))` will therefore return a list of pairs. 859 | - And `transpose(map(data, safely(func)))` will return a pair of lists. 860 | - `possibly(func)` creates a new function that returns a user-supplied default value instead of throwing an error. 861 | - `quietly(func)` works like `safely` but captures printed output, messages, and warnings. 862 | - `map2(vec1, vec2, function)` applies `function` to corresponding elements from `vec1` and `vec2`. 863 | - `pmap(list_of_lists, function)` applies `function` to the values in each of the sub-lists. 864 | - It is safest to give the sub-lists names that match the names of the function's parameters rather than relying on positional matching. 865 | - The `walk` family of functions execute functions without collecting and returning their results. 866 | - A _predicate function_ is one that returns a single logical value. 867 | - `keep` and `discard` keep elements of the input where a predicate function returns `TRUE` or `FALSE` respectively. 868 | - `some` and `every` determine whether a predicate us true for any or all elements of the input data. 869 | - `detect` returns the first element for which a predicate is true. 870 | - `detect_index` returns the index of the first element for which a predicate is true. 871 | - `head_while` and `tail_while` collect runs of values from the start or end of a structure for which a predicate is true. 872 | - _Reduction_ combines many values using a binary (two-argument) function to create a single resulting value. 873 | - Use `reduce(data, function)` to do this. 874 | - `reduce` throws an error of `data` is empty unless an initial value `init` is provided. 875 | - _Accumulation_ performs the same operation as reduction, but keeps the intermediate results (i.e., calculates a running sum). 876 | --------------------------------------------------------------------------------