├── .gitignore
├── CITATION.md
├── LICENSE.md
├── CONDUCT.md
└── README.md


/.gitignore:
--------------------------------------------------------------------------------
1 | *~
2 | *.bak
3 | 


--------------------------------------------------------------------------------
/CITATION.md:
--------------------------------------------------------------------------------
1 | # Citation
2 | 
3 | Please cite this work as:
4 | 
5 | > Greg Wilson (ed.): "R for Data Science Instructors' Guide".  <https://github.com/rstudio-education/r4ds-instructors>, 2018.
6 | 


--------------------------------------------------------------------------------
/LICENSE.md:
--------------------------------------------------------------------------------
 1 | # License
 2 | 
 3 | *This is a human-readable summary of (and not a substitute for) the license.
 4 | Please see <https://creativecommons.org/licenses/by/4.0/legalcode> for the full legal text.*
 5 | 
 6 | This work is licensed under the Creative Commons Attribution 4.0
 7 | International license (CC-BY-4.0).
 8 | 
 9 | **You are free to:**
10 | 
11 | - **Share**---copy and redistribute the material in any medium or
12 |   format
13 | 
14 | - **Remix**---remix, transform, and build upon the material for any
15 |   purpose, even commercially.
16 | 
17 | The licensor cannot revoke these freedoms as long as you follow the
18 | license terms.
19 | 
20 | **Under the following terms:**
21 | 
22 | - **Attribution**---You must give appropriate credit, provide a link
23 |   to the license, and indicate if changes were made. You may do so in
24 |   any reasonable manner, but not in any way that suggests the licensor
25 |   endorses you or your use.
26 | 
27 | - **No additional restrictions**---You may not apply legal terms or
28 |   technological measures that legally restrict others from doing
29 |   anything the license permits.
30 | 
31 | **Notices:**
32 | 
33 | You do not have to comply with the license for elements of the
34 | material in the public domain or where your use is permitted by an
35 | applicable exception or limitation.
36 | 
37 | No warranties are given. The license may not give you all of the
38 | permissions necessary for your intended use. For example, other rights
39 | such as publicity, privacy, or moral rights may limit how you use the
40 | material.
41 | 


--------------------------------------------------------------------------------
/CONDUCT.md:
--------------------------------------------------------------------------------
 1 | # Code of Conduct
 2 | 
 3 | In the interest of fostering an open and welcoming environment, we as
 4 | contributors and maintainers pledge to making participation in our
 5 | project and our community a harassment-free experience for everyone,
 6 | regardless of age, body size, disability, ethnicity, gender identity
 7 | and expression, level of experience, education, socioeconomic status,
 8 | nationality, personal appearance, race, religion, or sexual identity
 9 | and orientation.
10 | 
11 | ## Our Standards
12 | 
13 | Examples of behavior that contributes to creating a positive
14 | environment include:
15 | 
16 | * using welcoming and inclusive language,
17 | * being respectful of differing viewpoints and experiences,
18 | * gracefully accepting constructive criticism,
19 | * focusing on what is best for the community, and
20 | * showing empathy towards other community members.
21 | 
22 | Examples of unacceptable behavior by participants include:
23 | 
24 | * the use of sexualized language or imagery and unwelcome sexual
25 |   attention or advances,
26 | * trolling, insulting/derogatory comments, and personal or political
27 |   attacks,
28 | * public or private harassment,
29 | * publishing others' private information, such as a physical or
30 |   electronic address, without explicit permission, and
31 | * other conduct which could reasonably be considered inappropriate in
32 |   a professional setting
33 | 
34 | ## Our Responsibilities
35 | 
36 | Project maintainers are responsible for clarifying the standards of
37 | acceptable behavior and are expected to take appropriate and fair
38 | corrective action in response to any instances of unacceptable
39 | behavior.
40 | 
41 | Project maintainers have the right and responsibility to remove, edit,
42 | or reject comments, commits, code, wiki edits, issues, and other
43 | contributions that are not aligned to this Code of Conduct, or to ban
44 | temporarily or permanently any contributor for other behaviors that
45 | they deem inappropriate, threatening, offensive, or harmful.
46 | 
47 | ## Scope
48 | 
49 | This Code of Conduct applies both within project spaces and in public
50 | spaces when an individual is representing the project or its
51 | community. Examples of representing a project or community include
52 | using an official project e-mail address, posting via an official
53 | social media account, or acting as an appointed representative at an
54 | online or offline event. Representation of a project may be further
55 | defined and clarified by project maintainers.
56 | 
57 | ## Enforcement
58 | 
59 | Instances of abusive, harassing, or otherwise unacceptable behavior
60 | may be reported by [emailing the project team](mailto:gvwilson@third-bit.com).
61 | All complaints will be reviewed and investigated and will result in a
62 | response that is deemed necessary and appropriate to the
63 | circumstances. The project team is obligated to maintain
64 | confidentiality with regard to the reporter of an incident.  Further
65 | details of specific enforcement policies may be posted separately.
66 | 
67 | Project maintainers who do not follow or enforce the Code of Conduct
68 | in good faith may face temporary or permanent repercussions as
69 | determined by other members of the project's leadership.
70 | 
71 | ## Attribution
72 | 
73 | This Code of Conduct is adapted from the [Contributor
74 | Covenant](https://www.contributor-covenant.org) version 1.4.
75 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # *R for Data Science Instructor's Guide*
  2 | 
  3 | **DRAFT:** notes for people teaching [R4DS](http://r4ds.had.co.nz/)
  4 | with each chapter's learning objectives and key points.
  5 | 
  6 | This work is licensed under a Creative Commons Attribution 4.0 International License.
  7 | 
  8 | **References**
  9 | 
 10 | Hadley Wickham and Garrett Grolemund:
 11 | *R for Data Science: Import, Tidy, Transform, Visualize, and Model Data*.
 12 | 1st ed., O'Reilly Media, 2017.
 13 | 
 14 | ## Learner Personas
 15 | 
 16 | **Nethira**, 27, is wrapping up a PhD in nursing and trying to decide
 17 | whether to do a post-doc or take a data analyst position with an NGO
 18 | in Tamil Nadu.  She did two courses on statistics as an undergraduate,
 19 | both using Stata, and picked up a bit of R from a labmate in grad
 20 | school, but has never really come to grips with it as a tool.  Nethira
 21 | would like to improve her skills so that she can finish analyzing the
 22 | data she collected for her thesis and get a couple of papers out, and
 23 | to prepare herself for a possible change of career.  These lessons
 24 | will show her how to use the tidyverse in R to clean up, analyze,
 25 | visualize, and model her data without working long nights or weekends.
 26 | 
 27 | **Hannu**, 40, has worked as a traffic engineer for the Finnish
 28 | Ministry of Transportation for the past 12 years, during which time he
 29 | has become proficient with SQL and Python.  As part of an open data
 30 | initiative, his department has decided to build a traffic capacity
 31 | dashboard using Shiny, and Hannu wants to learn the basics of modern R
 32 | in a hurry so that he can join this project.  These lessons will
 33 | introduce him to the packages that make up the tidyverse, and prepare
 34 | him for a deeper dive into more advanced R programming.
 35 | 
 36 | Derived constraints:
 37 | 
 38 | - Learners know what variables are, how to index a list, how loops and
 39 |   conditionals work, and grasp the basics of programming language syntax, such
 40 |   as how to write a string or a list (both).
 41 | - Learners only have a shaky grasp of variable scope and the call stack, and
 42 |   will not understand closures or higher-order functions without detailed
 43 |   exposition (Nethira).
 44 | - Learners know very basic statistics (mean, standard deviation, linear
 45 |   regression), but do not understand what a *p*-value is or why an observation
 46 |   can only be used once during hypothesis confirmation (Hannu).
 47 | - Learners have 20-40 hours to work through this material. They may be able to
 48 |   ask more advanced friends or colleagues for help, but will primarily be
 49 |   learning on their own and by searching online (both).
 50 | 
 51 | Note: definitions of terms are marked with `_single underscores_`, while other
 52 | form of emphasis uses `*single asterisks*`.  This makes it easy to extract
 53 | definitions for glossary construction.
 54 | 
 55 | ## 1. Introduction
 56 | 
 57 | ### Objectives
 58 | 
 59 | - Describe the steps in the basic data analysis cycle.
 60 | - Explain the relative strengths and weaknesses of visualization and modeling.
 61 | - Explain when techniques beyond those described in these lessons may be needed.
 62 | - Explain the differences between hypothesis generation and hypothesis confirmation.
 63 | - Describe and install the prerequisites for these lessons.
 64 | - Explain where and how to get help.
 65 | 
 66 | ### Key Points
 67 | 
 68 | - The basic data analysis cycle is import, tidy, repeatedly transform, visualize, and model, and then communicate.
 69 | - Visualizations provide novel insight, but don't scale well.
 70 | - Models scale well, but cannot provide unexpected insight.
 71 | - The techniques described in these lessons are good for tabular (rectangular) data up to about a gigabyte in size.
 72 | - Hypothesis generation is the ad hoc process of exploring data to find possible hypotheses.
 73 |   An observation can be used many times during hypothesis generation.
 74 | - Hypothesis confirmation is the rigorous application of mathematics to test falsifiable hypotheses.
 75 |   An observation can only be used once in hypothesis confirmation.
 76 | - These lessons require R, RStudio, and a set of R packages called the tidyverse.
 77 | - Help can be found by:
 78 |   - Typing `?name` at an interactive R prompt (where `name` identifies a package, function, or variable).
 79 |   - Copying and pasting the error message into a web search.
 80 |   - Searching on Stack Overflow.
 81 | - When asking for help, create a reproducible example that:
 82 |   - Loads packages.
 83 |   - Includes a small amount of data.
 84 |   - Has short, readable code.
 85 | 
 86 | ## 2. Introduction
 87 | 
 88 | - See above.
 89 | 
 90 | ## 3. Data Visualization
 91 | 
 92 | ### Objectives
 93 | 
 94 | - Explain what a data frame is and how to access the ones included with the tidyverse.
 95 | - Explore the properties of a data frame.
 96 | - Explain what geometries and mappings are.
 97 | - Create a basic visualization of a single data data frame with a single geometry and a single mapping.
 98 | - Create visualizations using the `x`, `y`, `color`, `size`, `alpha`, and `shape` properties.
 99 | - Explain ways in which continuous and discrete variables should and shouldn't be visualized.
100 | - Explain what facets are and use them to display subsets of data in a single plot.
101 | - Create scatterplots and continuous line charts.
102 | - Create plots that represent data in two or more ways.
103 | - Describe three places in which the visual aspects of a plot can be specified.
104 | - Explain what a stat is in a plot, and how stats relate to geometries.
105 | - Create stacked and side-by-side bar charts.
106 | - Create a scatterplot to display data with many repeated values.
107 | - Flip the XY axes in a plot.
108 | - Use polar coordinates in a plot.
109 | - Describe the seven parameters to a full ggplot2 visualization.
110 | 
111 | ### Key Points
112 | 
113 | - A data frame is a rectangular set of observations (rows) of the same variables (columns).
114 | - Example data frames can be loaded using `library(tidyverse)` and then referred to by name (e.g., `mpg`) or by fully-qualified name (e.g., `ggplot2::mpg`).
115 | - To explore a data frame:
116 |   - Type the name of the frame to see its shape, column titles, and first few rows.
117 |   - Use `nrow(frame)` to get the number of rows.
118 |   - Use `ncol(frame)` to get the number of columns.
119 |   - Use `View(frame)` in RStudio to visualize the data frame as a table.
120 | - A geometry is an object that a plot uses to represent data, such as a scatterplot or a line.
121 | - A mapping describes how to connect features of a data frame to properties of a geometry, and is described by an aesthetic.
122 | - A very simple visualization has the form `ggplot(data = <DATA>) + <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))`
123 | - Continuous variables should be visualized using smoothly-varying properties such as size and color.
124 | - Discrete variables should be visualized using properties such as shape and line type.
125 | - A facet is a subplot that displays a subset of the overall data.
126 | - Use `facet_wrap(<FORMULA>, nrow=<NUM>)` with a single discrete variable as a formula (such as `~COLUMN`)
127 |   to display a single sub-plot for each value of the discrete variable.
128 | - Use `facet_grid(<FORMULA>)` with a formula of two discrete variables (such as `FIRST ~ SECOND`)
129 |   to display a single sub-plot for each unique combination of the two variables.
130 | - Use `. ~ COLUMN` or `COLUMN ~ .` as a formula to display facets in rows or columns only.
131 | - Use `geom_point` to create a scatterplot and `geom_smooth` to create a line chart.
132 | - Add multiple geometries after the initial call to `ggplot` 
133 | - The visual aspects of a plot can be specified as follows (each overrides the one(s) before):
134 |   - Globally by specifying a value in the initial `ggplot` function.
135 |   - For a particular geometry by specifying a value outside an aesthetic.
136 |   - In a data-dependent way by setting a property of an aesthetic.
137 | - A stat performs a data transformation, such as counting the number of elements in a subset of the data.
138 | - Stats and geometries can often be used interchangeably since each stat has a default geometry and each geometry has a default stat.
139 | - Map one variable to `x` and another to `fill` in the aesthetic for `geom_bar` to create a stacked bar chart.
140 | - Set `position="dodge"` in `geom_bar` (outside the aesthetic) to create a side-by-side bar chart.
141 | - Use `position="jitter"` in `geom_point` (outside the aesthetic) to add randomization to a scatterplot to show data with duplication.
142 | - Add `coord_flip` to a visualization to flip the XY axes.
143 | - Add `coord_polar` to a visualization to use polar coordinates instead of Cartesian coordinates.
144 | - The seven parameters of a full ggplot2 visualization are:
145 |   - data: the data frame to be plotted
146 |   - geometry: how the data is to be displayed (e.g., scatterplot or line)
147 |   - mapping: how the properties of the data map to the properties of the geometry (e.g., which columns map to X and Y coordinates)
148 |   - stat: the transformation to apply to the data (e.g., count the number of observations)
149 |   - position: how to adjust the positions of displayed elements (e.g., jittering points in a scatterplot)
150 |   - coordinate function: whether to use Cartesian coordinates or polar coordinates
151 |   - facet: how to subset the data to create multiple subplots
152 | 
153 | ```
154 | ggplot(data = <DATA>) + 
155 |   <GEOM_FUNCTION>(
156 |      mapping = aes(<MAPPINGS>),
157 |      stat = <STAT>, 
158 |      position = <POSITION>
159 |   ) +
160 |   <COORDINATE_FUNCTION> +
161 |   <FACET_FUNCTION>
162 | ```
163 | 
164 | ## 4. Workflow: Basics
165 | 
166 | ### Objectives
167 | 
168 | - Assign values to variables.
169 | - Call functions.
170 | - Write readable code.
171 | 
172 | ### Key Points
173 | 
174 | - Use `name <- value` to assign a value to a variable (do not use `=`).
175 | - Use `function(value1, value2)` to call a function.
176 | - Construct variable names out of words joined with underscores `like_this_example`.
177 | 
178 | ## 5. Data Transformation
179 | 
180 | ### Objectives
181 | 
182 | - Describe the five basic data transformation operations in the tidyverse and explain their purpose.
183 | - Choose records by value using comparisons and logical operators.
184 | - Explain why filter conditions shouldn't use `==`, and correctly use `%in%` instead.
185 | - Explain the purpose of `NA`, how it affects arithmetic and logical operations, and how to test for it.
186 | - Explain how filtering treats `NA` and how to obtain different behavior.
187 | - Reorder records in ascending or descending order according to the values of one or more variables.
188 | - Select a subset of variables for all records by variable name.
189 | - Select and rename a subset of variables for all records by variable name.
190 | - Add new variables to a frame by deriving new values from existing ones.
191 | - Combine the values in a data frame to create one new value, or one new value per group.
192 | - Explain how summarization treats missing values (`NA`) and how to change this behavior.
193 | - Combine multiple transformations in order using pipe notation.
194 | - Explain why and how to include counts in summarization as a check on the validity of conclusions.
195 | - Name and describe half a dozen common summarization functions.
196 | - Explain the relationship between grouping and summarization.
197 | 
198 | ### Key Points
199 | 
200 | - The five basic data transformation operations in the tidyverse are:
201 |   - `filter`: choose records by value(s).
202 |   - `arrange`: reorder records.
203 |   - `select`: choose variables by name.
204 |   - `mutate`: derive new variables from existing ones.
205 |   - `summarize`: combine many values to create a single new value.
206 | - `filter(frame, ...criteria...)` keeps records that pass all of the specified criteria
207 |   - `name == value`: records must have the specified value for the named variable (note `==` rather than `=`).
208 |   - `name > value`: the records' values must be greater than the given value (and similarly for `>=`, `!=`, `<`, and `<=`).
209 |   - `min_rank(name)` to rank variables, giving the smallest values the smallest ranks.
210 |   - Use `near(expression, value)` to compare floating-point numbers rather than `==`.
211 |   - Use `&` (and) to require both conditions, `|` (or) to accept either condition, and `!` (not) to invert the sense of a condition.
212 | - Use `name %in% (value1, value2, ...)` to accept any of a fixed set of values for a variable.
213 | - `NA` (meaning "not available") represents unknown values.
214 | - Most operations involving `NA` produce `NA`, because there's no way to know the output without knowing the input.
215 | - Use `is.na(value)` to determine if a value is `NA`.
216 | - `filter` discards records with `FALSE` and `NA` results in tests.
217 |   - Use `is.na(value)` to include `NA`'s explicitly.
218 | - Use `desc(name)` to order by descending value of the variable `name` instead of by ascending value.
219 | - Use `select(frame, name1, name2, ...)` to select only the named variables from the given frame.
220 | - Use `name1:name1` to select all variables from `name1` to `name2` (inclusive).
221 | - Use `-(name1:name2)` to unselect all variables from `name1` to `name2` (inclusive).
222 | - Use `rename(frame, new_name = old_name)` to select and rename variables from a frame.
223 | - Use `everything()` to select every variable that hasn't otherwise been selected.
224 | - Use `one_of(c("name1", "name2"))` to select all of the variables named in the given vector.
225 | - Use `mutate(frame, name1=expression1, name2=expression2, ...)` to add new variables to the end of the given frame.
226 | - Use `transmute(frame, name1=expression1, name2=expression2, ...)` create a new data frame whose values are derived from the values of an existing frame.
227 | - Use `group_by(frame, name1, name2, ...)` to group the values of `frame` according to distinct combinations of the values of the named variables.
228 |   - Use `ungroup` to restore the original data.
229 | - Use `summarize(frame, name = function(...))` to aggregate the values in an entire data frame or by group within a data frame.
230 | - By default `summarize` produces `NA` as output when there are `NA`s in the input.
231 | - Use `na.rm = TRUE` to remove `NA`s from the data before summarization.
232 | - Use `frame %>% operation1(...) %>% operation2(...) %>% ...` to produce a new data frame by applying each operation to an existing one in order.
233 | - Use `n()` (for a simple count) or `sum(!is.na(name))` (to count the number of non-`NA`s) when summarizing values in order to see how many records contribute to an aggregated result.
234 | - Common summarization functions include:
235 |   - `mean` and `median`
236 |   - `sd` for standard deviation
237 |   - `min`, `quantile`, and `max` for extrema and intermediate values
238 |   - `first`, `nth`, and `last` for positional extrema and intermediate values
239 |   - `n_distinct` for the number of distinct values
240 |   - `count` to calculate counts or weighted sums
241 | - Each summarization peels off one layer of grouping.
242 | 
243 | ## 6. Workflow: Scripts
244 | 
245 | ### Objectives
246 | 
247 | - Use the RStudio editor to write, save, and run R scripts.
248 | - Describe two things that should *not* be put in scripts.
249 | - Explain how to spot and fix syntax errors in the RStudio editor.
250 | 
251 | ### Key Points
252 | 
253 | - Use Cmd/Ctrl + Enter in the editor to run the current R expression in the console.
254 | - Use Cmd/Ctrl + Shift + S to run the complete script in the console.
255 | - Do not put `install.packages` or `setwd` in scripts, since they will affect other people's machines when run.
256 | - The RStudio editor uses white-on-red X's and red squiggly underlining to highlight syntax errors.
257 | 
258 | ## 7. Exploratory Data Analysis
259 | 
260 | ### Objectives
261 | 
262 | - Describe the steps in exploratory data analysis (EDA).
263 | - Describe two types of questions that are useful to ask during EDA.
264 | - Correctly define *variable*, *value*, *observation*, *variation*, and *tidy data*.
265 | - Explain what a *categorical variable* is and how to best to store and visualize one.
266 | - Explain what a *continuous variable* is and how best to store and visualize one.
267 | - Explain why it is important to use a variety of bin widths when visualizing continuous variables as histograms.
268 | - List three questions whose answers will help you understand your data.
269 | - Describe and use a heuristic for identifying subgroups in data.
270 | - Explain how to handle outliers or unusual values in data.
271 | - Define *covariance* and describe how to visualize it for different combinations of two categorical and continuous variables.
272 | - Explain how to make code clearer to experienced readers by omitting information.
273 | 
274 | ### Key Points
275 | 
276 | - Exploratory data analysis consists of:
277 |   - Generating questions about data.
278 |   - Searching for answers by visualizing, transforming, and modeling data.
279 |   - Using what is found to refine questions or generate new ones.
280 | - Two questions that are always useful to ask during EDA are:
281 |   - What type of variation occurs within my variables?
282 |   - What type of covariation occurs between my variables?
283 | - A _variable_ is something that can be measured.
284 | - A _value_ is the state of a variable when measured.
285 | - An _observation_ is a set of measurements made under similar conditions, and may contain several values (each associated with a different variable).
286 | - _Variation_ is the tendency of values to differ from measurement to measurement.
287 | - _Tidy data_ is a set of values, each of which is associated with exactly one variable and observation.
288 |   Tidy data is usually displayed in tabular form: each observation (or record) is a row, while each variable is a column with a name and a type.
289 | - A _categorical variable_ is one that takes on only one of a small set of values.
290 |   - Categorical variables are best represented using factors or character strings.
291 |   - The distribution of a categorical variable is best visualized using a bar chart (created using `geom_bar`).
292 |   - `dplyr::count(name)` counts the number of occurrences of each value of a categorical variable.
293 | - A _continuous variable_ is one that takes on any of an infinite set of ordered values.
294 |   - Categorical variables are best represented using numbers or date-times.
295 |   - The distribution of a categorical variable is best visualized using a histogram.
296 |   - `dplyr::count(ggplot2::cut_width(name, width))` divides occurrences into bins and counts the number of occurrences in each bin.
297 | - Histograms with different bin widths can have very different visual appearances, so varying the bin width provides insight that no single bin width can.
298 |   - Use `geom_histogram(mapping=..., binwidth=value)` to vary the width of histogram bins.
299 |   - Or `geom_freqpoly` to display histograms using lines instead of bars.
300 | - Three questions to ask of any dataset are:
301 |   - Which values are most common (and why)?
302 |   - Which values are rare (and why)?
303 |   - What patterns are present in the data?
304 |     - How can you describe the pattern?
305 |     - How strong is it?
306 |     - Is it a coincidence?
307 |     - Does the pattern change if you examine subgroups of the data?
308 | - Clusters of similar values suggest that data contains subgroups.  To characterize these subgroups, ask:
309 |   - How are observations in each cluster similar?
310 |   - How do observations in different clusters differ?
311 |   - What might explain the existence of these clusters?
312 |   - How might the appearance of these clusters be misleading (e.g., an artifact of the visualization used)?
313 | - If outliers are present, repeat each analysis with and without them.
314 |   - If there are only a few, and dropping them doesn't affect results, use `mutate` and `ifelse` to replace them with `NA`.
315 |   - If there are many, or dropping them changes results, account for them in analysis and reporting.
316 | - _Covariation_ is the tendency for some variables to vary in related ways.
317 | - When visualizing the relationship between continuous and categorical variables:
318 |   - Displaying raw counts can be misleading if the number of items in different categories varies widely.
319 |   - Displaying _densities_ (i.e., counts standardized so that the area of each curve is the same) can be more informative.
320 |   - Boxplots show less of the raw data, but are easier to interpret when there are many categories.
321 |   - Reorder unordered categorical variables to make trends easier to see.
322 | - When visualizing the relationship between two categorical variables:
323 |   - Display the counts for each pairing of values (e.g., using `geom_count` or `geom_tile`)
324 |   - In general, put the categorical variable with the greater number of categories or the longer labels on the Y axis.
325 | - When visualizing the relationship between two continuous variables:
326 |   - Use a scatterplot with jittering or transparency to handle datasets with up to hundreds of points.
327 |   - Use `geom_bin2d` or `geom_hex` to bin values in two dimensions.
328 |   - Bin one or both of the continuous variables so that visualizations for continuous variables can be used.
329 |   - Use `cut_width` or `cut_number` to bin continuous values by value range or number of values respectively.
330 | - Omitting argument names for commonly-used functions makes code easier for experienced programmers to understand.
331 |   - The first two arguments to `ggplot` are the dataset and the mapping.
332 | 
333 | ## 8. Workflow: Projects
334 | 
335 | ### Objectives
336 | 
337 | - Explain why analysts should save scripts rather than environments.
338 | - Explain what a *working directory* is and how to find what yours is.
339 | - Explain why setting your working directory from within your script is a bad idea.
340 | - Explain the difference between an *absolute path* and a *relative path* and the meaning of the symbol `~` in a path.
341 | - Explain what an RStudio project is and how one is stored.
342 | 
343 | ### Key Points
344 | 
345 | - Analysts should save scripts rather than environments because it is much easier to reconstruct an environment from a script than to reconstruct a script from an environment.
346 | - The _working directory_ is the directory where R looks for and saves files by default, and is displayed by calling `getwd()`.
347 | - Setting the working directory from within a script with `setwd` makes reproducibility more difficult because that directory may not exist on some other (person's) machine.
348 | - An _absolute path_ specifies a single location starting from the top of the filesystem.
349 | - A _relative path_ specifies a location starting from the current directory, and may identify different locations depending on where it is used.
350 | - The symbol `~` refers to the user's home directory on macOS and Linux, and to the user's `Documents` directory on Windows.
351 | - An RStudio project is a directory that contains the scripts and other files involved in an analysis.
352 | - Each RStudio project contains a `.Rproj` file with information about the project.
353 | 
354 | ## 10. Tibbles
355 | 
356 | ### Objectives
357 | 
358 | - Explain the relationship between a tibble and a `data.frame` and the main ways in which tibbles differ from `data.frame`s.
359 | - Create tibbles from `data.frame`s and from scratch.
360 | - Explain what a *non-syntactic name* is and how to create tibble columns with non-syntactic names.
361 | - FIXME: explain how to use `tribble` (which requires an understanding of `~`).
362 | - Display an arbitrary number of rows and columns of a tibble.
363 | - Subset tibbles using `[[...]]`.
364 | - Subset tibbles using `$`.
365 | - FIXME: explain use of `[...]` (single bracket).
366 | 
367 | ### Key Points
368 | 
369 | - A tibble is a `data.frame` whose behaviors have been modified to work better with the tidyverse.
370 |   - Tibbles never change their inputs' types.
371 |   - Tibbles never adjust the names of variables.
372 |   - Tibbles evaluate their constructor arguments lazily and sequentially, so that later variables can use the values of earlier variables.
373 |   - Tibbles do not create row names.
374 |   - Tibbles only recycle inputs of length 1, because recycling longer inputs has been a frequent source of bugs.
375 | - Tibbles can be created from `data.frame`s using `as_tibble` or from scratch using `tibble`.
376 |   - Use `is_tibble` to determine if something is a tibble or not.
377 |   - Use `class` to determine the classes of something.
378 | - A _non-syntactic name_ is one which is not a valid R variable name.
379 | - To create a non-syntactic column name, enclose the name in back-quotes.
380 | - Use `print` with `n` to set the number of rows and `width` to set the number of character columns.
381 | - Use `name[["variable"]]` or `name$variable` to extract the column named `variable` from a tibble.
382 | - Use `name[[N]]` to extract column `N` (integer) from a tibble.
383 | 
384 | ## 11. Data Import
385 | 
386 | ### Objectives
387 | 
388 | - Name six functions for reading tabular data and explain their use.
389 | - Read CSV data files with multiple header lines, comments, missing headers, and/or markers for missing data.
390 | - Explain how data reader functions determine whether they have extra and missing values, and how they handle them.
391 | - Name four functions used to parse individual values and explain their use.
392 | - Explain how to obtain a summary of parsing problems encountered by data reading functions.
393 | - Define *locale* and explain its purpose and use.
394 | - Define *encoding* and explain its purpose and use.
395 | - Explain how `readr` functions determine column types.
396 | - Set the data types of columns explicitly while reading data.
397 | - Explain how to write well-formatted tabular data.
398 | - Describe what information is lost when writing tibbles to delimited files and what formats can be used instead.
399 | 
400 | ### Key Points
401 | 
402 | - Use the following functions to read tabular data in common formats:
403 |   - `read_csv`: comma-delimited files.
404 |   - `read_csv2`: semicolon-delimited files.
405 |   - `read_tsv`: tab-delimited files.
406 |   - `read_delim`: files using an arbitrary delimiter.
407 |   - `read_fwf`: files with fixed-width fields.
408 |   - `read_table`: read common fixed-width tabular formats with whitespace separators.
409 | - Use `skip=n` to skip the first N lines of a file.
410 | - Use  `comment="#"` (or something similar) to ignore lines starting with `#`.
411 | - Use `col_names=FALSE` to stop `read_csv` from interpreting the first row as column headers.
412 | - Use `col_names=c("first", "second", "third")` to specify column names by hand.
413 | - Use `na="."` (or something similar) to specify the value(s) used to mark missing data.
414 | - Data reader functions use the number of values in the first row to determine the number of columns.
415 |   - Extra values in subsequent rows are omitted.
416 |   - Missing values in subsequent rows are set to `NA`.
417 | - Use `parse_integer`, `parse_number`, `parse_logical`, and `parse_date` to parse strings containing integers, general numbers, Booleans, and dates.
418 |   - Use `na="."` (or similar) to specify the value(s) that should be interpreted as missing data.
419 | - Use `problems(name)` to access the `problems` attribute of the output of data reading functions.
420 | - A _locale_ is a collection of linguistic and/or regional settings for information formats, such as Canadian English or Brazilian Portuguese.
421 |   - Use `locale(...)` to specify such things as the separator character used in long numbers.
422 | - An _encoding_ is a specification of how characters are represented digitally, such as ASCII or UTF-8.
423 |   - Specify `encoding="name"` when parsing data to interpret the character data correctly.
424 |   - UTF-8 is now the most commonly-used character encoding scheme.
425 | - Data reading functions read the first 1000 rows of the dataset and use the heuristics embodied in `guess_parser` to guess the type of the column.
426 | - Use `col_types=cols(...)` to manually specify the types of the columns of a data file.
427 |   - Use `name = col_double()` to set the column's name to `name` and the type to `double`.
428 |   - Use `name = col_date()` to set the name to `name` and the type to `date`.
429 | - Use `write_csv` to write data in comma-separated format and `write_tsv` to write it in tab-separated format.
430 |   - Use `write_excel_csv` to write CSV with extra information so that it can immediately be loaded by Microsoft Excel.
431 |   - Use `na="marker"` to specify how `NA` should be shown in the output.
432 | - Delimited file formats only store column names, not column types, so the latter have to be re-guessed when the file is re-read.
433 |   - (Old) Saving data in R's custom binary format RDS will save type information.
434 |   - (New) Saving data in the cross-language Feather format will also save type information, and this data can be read in multiple languages.
435 | 
436 | ## 12. Tidy Data
437 | 
438 | ### Objectives
439 | 
440 | - Describe the three rules tabular data must obey to be considered "tidy", and the advantages of storing data this way.
441 | - Explain what *gathering* data means and use gather operations to tidy datasets.
442 | - Explain what *spreading* data means and use spread operations to tidy datasets.
443 | - Explain what *separating* data means and use separate operations to tidy datasets.
444 | - Explain what *uniting* data means and use unite operations to tidy datasets.
445 | - Describe two ways in which values can be missing from a dataset.
446 | - Explain how to *complete* a dataset and use completion operations to tidy datasets.
447 | - Explain why it can be useful to carry values forward and use this to tidy datasets.
448 | 
449 | ### Key Points
450 | 
451 | - Tidy data obeys three rules:
452 |   - Each variable has its own column.
453 |   - Each observation has its own row.
454 |   - Each value has its own cell.
455 | - Tidy data is easier to process because:
456 |   - No subsidiary processing is required (e.g., to split names into personal and family names).
457 |   - Each column can be processed independently (e.g., there's no need to choose the type of processing based on a "type" field in another column).
458 | - To _gather_ data means to take N columns whose names are actually values and transform them into 2 columns where the first column holds the former column names and the second holds the values.
459 |   - Use `gather(name, name, ..., key="key_name", value="value_name")` to transform the named columns into two columns with names `key_name` and `value_name`.
460 | - To _spread_ data means to take two columns, the N values in the first of which identify the meanings of the values in the second, and create N+1 columns, one for each of the distinct values in the first column.
461 |   - Use `spread(key=first, value=second) to spread the values in `second` according to the keys in `first`.
462 | - To _separate_ data means to split one column into multiple values.
463 |   - Use `separate(name, into=c("first", "second", ...))` to separate the values in one column to create multiple new columns.
464 | - To _unite_ data means to combine the values of two or more columns into a single column.
465 |   - Use `unite(new_name, first, second, ...)` to combine the named columns to create a column named `new_name`.
466 |   - Values will be combined with `_` unless `sep="#"` (or similar) is used (with `sep=""` to unite without a separator).
467 | - Use `convert=TRUE` with these functions to (try to) convert data types.
468 | - Values can be explicitly missing (the presence of an absence) or their entries can be missing entirely (the absence of a presence).
469 | - To _complete_ a dataset means to fill in missing combinations of values.
470 |   - Use `complete(first, second, ...)` to fill in missing combinations of the values from the named columns.
471 | - Missing values sometimes indicate that the most recent value should be carried forward.
472 |   - Use `fill(first, second, ...)` to carry the most recent observation(s) forward in the named column(s).
473 | 
474 | ## 13. Relational Data
475 | 
476 | ### Objectives
477 | 
478 | - Define *relational data* and explain what *keys* are and how they are used when processing it.
479 | - Explain the difference between a *primary key* and a *foreign key*, and explain how to determine whether a key is actually a primary key.
480 | - Explain what a *surrogate key* is and why surrogate keys are sometimes needed.
481 | - Explain how relations are represented in relational data and describe three types of relations.
482 | - Define a *mutating join* and use mutating joins to combine information from two tables.
483 | - Define four kinds of joins and use each to combine information from two tables.
484 | - Explain what joins do if some keys are duplicated, and when this might occur.
485 | - Describe and use some common criteria for joins.
486 | - Define a *filtering join*, describe two types of filtering joins, and use them to combine information from two tables.
487 | - Describe the difference between how mutating joins and filtering joins behave in the presence of duplicated keys.
488 | - Describe three steps for identifying keys in tables that can be used in joins.
489 | - Describe and use three set operations on records.
490 | 
491 | ### Key Points
492 | 
493 | - _Relational data_ is made up of sets of tables that are related in some way.
494 | - A _key_ is a variable or set of variables whose values uniquely identify observations in a table.
495 |   - Keys are used to connect observations in one table to observations in another.
496 | - A _primary key_ uniquely identifies an observation in its own table.
497 |   - Use `count(name)` and `filter(n > 1)` to identify multiple occurrences of what is supposed to be a primary key.
498 | - A _foreign key_ uniquely identifies an observation in some other table, and is used to connect information between those tables.
499 | - A _surrogate key_ is an arbitrary identifier associated with an observation (such as a row number) that has no real-world meaning.
500 |   - Surrogate keys are sometimes added to data when the data itself has no valid primary keys.
501 | - Relations are represented by matching primary keys in one table to foreign keys in another.  Relations can be:
502 |   - _One-to-one_ (or 1-1), meaning there is exactly one matching value in each table.
503 |   - _One-to-many_ (or 1-N), meaning that each value in one table may have any number of matching values in another.
504 |   - _Many-to-many_ (or N-N), meaning that there may be many matching values in each table.
505 | - A _mutating join_ updates one table with corresponding information from another table.
506 | - An _inner join_ combines observations from two tables when their keys are equal, discarding any unmatched rows.
507 |   - Use `inner_join(left, right, by="name")` to join tables `left` and `right` on equal values of the column `name`.
508 | - A _left outer join_ (or simply _left join_) combines observations when keys are equal, keeping rows from the left table even if there are no corresponding values from the right table.
509 |   - Missing values from the right table are assigned `NA` in the result.
510 |   - Use `left_join` with arguments as above.
511 | - A _right outer join_ (or simply _right join_) does the same, but keeps rows from the right table even when rows from the left are missing.
512 |   - Use `right_join` with arguments as above.
513 | - A _full outer join_ (or simply _full join_) keeps all rows from both table, filling in for gaps in either.
514 |   - Use `full_join` with arguments as above.
515 | - If a key is duplicated in one or both tables, a join will produce all combinations of records with that key.
516 |   - This often arises when a key is a primary key in one table and a foreign key in another.
517 |   - If keys are duplicated in both tables, it may be a sign that the data is corrupt or that the supposed key actually isn't one.
518 | - A _natural join_ combines tables using equal values for all columns with identical names.
519 |   - Use `by=NULL` in a join function to force a natural join.
520 | - Use `by=c("name1", "name2", ...)` to join on equal values of named columns.
521 | - Use `by=c("a" = "b", "c" = "d", ...)` to join on columns with different names.
522 | - Use `suffix=("name", "name")` to override the default `.x`, `.y` suffixes used for name collisions.
523 | - A _filtering join_ is one that keeps (or discards) observations from one table based on whether they match (or do not match) observations in a second table.
524 |   - Use `semi_join(left, right)` to keep rows in `left` that have matches in `right`.
525 |   - Use `anti_join(left, right)` to keep rows in `left` that do *not* have matches in `right`.
526 | - Because they only keep or discard rows, filtering joins never create duplicate entries, while mutating joins can if keys are duplicated.
527 | - Three steps for identifying keys in tables that can be used in joins are:
528 |   - Identify the variable or variables that form the primary key for each table based on an understanding of the data.
529 |   - Check that each table's primary key has no missing values.
530 |   - Check that possible foreign keys match primary keys in other tables (e.g., by using `anti_join` to look for missing matches).
531 | - Three set operations that work on entire records are:
532 |   - `union(left, right)`: returns unique observations from either or both table.
533 |   - `intersect(left, right)`: returns unique observations that are in both tables.
534 |   - `setdiff(left, right)`: returns observations that are in one of the tables but not both.
535 | 
536 | ## 14. Strings
537 | 
538 | ### Objectives
539 | 
540 | - Write character strings in R, including ones that contain special characters.
541 | - Write multiple strings to the terminal, respecting escaped characters.
542 | - Use functions from the `stringr` package to perform basic operations on strings.
543 | - Explain what a *regular expression* is and what kinds of patterns they can match.
544 | - Describe two functions that implement regular expressions and use them to match simple patterns against text.
545 | - Describe nine patterns provided by regular expressions.
546 | - Capture subsections of matched text in regular expressions and re-match captured text within a pattern.
547 | - Detect and extract matches between a pattern and the strings in a vector.
548 | - Replace substrings that match regular expressions.
549 | - Split strings based on regular expression matches.
550 | - Locate substrings that match regular expressions.
551 | - Control matching options in regular expressions.
552 | - Find objects in the global environment whose names match a regular expression.
553 | - Find files and directories whose names match a regular expression.
554 | 
555 | ### Key Points
556 | 
557 | - Character strings in R are enclosed in matching single or double quotes.
558 | - Use backslash to escape special characters such as `\"`, `\n`, and `\\`.
559 | - Use `writeLines` to display a string or a vector of strings with special characters interpreted.
560 | - Use `str_length` to get a string's length.
561 | - Use `str_c` to concatenate strings.
562 | - Use `str_sub` to extract or replace substrings.
563 | - Use `str_to_lower`, `str_to_upper`, and `str_to_title` to change the case of strings.
564 | - Use `str_sort` to sort a vector of strings and `str_order` to get the 
565 | - Use `str_order` to get the ordered indices of the strings in a vector.
566 | - Use `str_pad` to pad a string to fit a specified width and `str_trim` to trim it to fit that width.
567 | - A _regular expression_ is a pattern that matches text.
568 |   - Regular expressions are written as text using punctuation and other characters to express choice, repetition, and other operations.
569 |   - Regular expressions can express patterns that have fixed nesting, but not patterns that have unlimited nesting (such as nested parenthesization).
570 | - Use `str_view(text, pattern)` to find the first match of `pattern` to `text` and `str_view_all` to view all matches.
571 | - Nine patterns used in regular expressions are:
572 |   - `.` matches any single character.
573 |   - `\` escapes the character that follows it.
574 |   - `^` and `$` match the beginning and end of the string respectively (without consuming any characters).
575 |   - Use `\d` to match digits and `\s` to match whitespace.
576 |   - Use `[abc]` to match any single character in a set and `[^abc]` to match any character *not* in a set.
577 |   - Use `left|right` to match either of two patterns.
578 |   - Use `{M,N}` to repeat a pattern M to N times.
579 |   - Use `?` to signal that a pattern is optional (i.e., repeated zero or one times), `*` to repeat a pattern zero or more times, and `+` to repeat a pattern at least once.
580 |   - Use parentheses `(...)` for grouping, just as in mathematics.
581 | - Every set of parentheses in a regular expression creates a numbered _capture group_.
582 |   - Use `\1`, `\2`, etc. to refer to capture groups within a pattern in order to match the same actual text two or more times.
583 | - Use `str_detect(strings, pattern)` to create a logical vector showing where a pattern does or doesn't match.
584 | - Use `str_subset(strings, pattern)` to select the subset of strings that match a pattern and `str_count` to count the number of matches.
585 | - Use `str_extract(strings, pattern)` to extract the first match for the pattern in each string.
586 | - Use `str_extract_all(strings, pattern)` to extract all matches for the pattern in each string.
587 | - Use `str_match(string, pattern)` to extract parenthesized sub-matches for a pattern.
588 | - Use `tidyr::extract` to extract parenthesized sub-matches from a tibble into new columns.
589 | - Use `str_replace` or `str_replace_all` to replace substrings that match regular expressions.
590 | - Use `str_split` to split a string based on regular expression matches.
591 | - Use `str_locate` and `str_locate_all` to find the starting and ending positions of substrings that match regular expressions.
592 | - Use `regex` explicitly to construct a regular expression and control options such as multi-line matches and embedded comments.
593 | - Use `apropos` to find objects in the global environment whose names match a regular expression.
594 | - Use `dir` to find objects in the filesystem whose names match a regular expression.
595 | 
596 | ## 15. Factors
597 | 
598 | ### Objectives
599 | 
600 | - Define *factor* and explain the purpose of factors in R.
601 | - Create and (re-)order factors.
602 | - Determine the valid levels of a factor.
603 | - Rename the levels of a factor.
604 | 
605 | ### Key Points
606 | 
607 | - A _factor_ is a variable that can take on one of a fixed set of values.
608 |   - Factors are ordered, but the order is not necessarily alphabetical.
609 | - Use `factor(values, levels)` to create a vector of factors by matching strings in `values` to level names in `levels`.
610 |   - Values that don't match level names are converted to `NA`.
611 | - The idiom `factor(values, unique(values))` orders the factors according to their first appearance in the data.
612 |   - Use `fct_reorder(factor, values)` to reorder a factor according to a set of numeric values.
613 | - Use `levels(factor)` to recover the valid levels of a factor.
614 | - Use `fct_relevel(factors, "levels")` to move the named levels to the front of the list of factors (e.g., for display purposes).
615 | - Use `fct_infreq` to reorder factors by frequency.
616 | - Use `fct_rev` to reverse the order of factors.
617 | - Use `fct_recode(factor, "new_name_1" = "old_name_1", "new_name_2" = "old_name_2", ...)` to rename some or all factors.
618 |   - Assigning several old levels to a single new level combines entries.
619 |   - Use `fct_collapse(factors, new_name = c("old_name_1", "old_name_2"), ...)` to collapse many levels at once.
620 | - Use `fct_lump(factor, n=N)` to combine the smallest factors, leaving `N` groups.
621 | 
622 | ## 16. Dates and Times
623 | 
624 | ### Objectives
625 | 
626 | - Describe three types of data that refer to an instant in time.
627 | - Get the current date and time.
628 | - Describe and use three ways to create a date-time.
629 | - Convert dates to date-times and date-times to dates.
630 | - Describe and use eight accessor functions to extract components of dates and date-times.
631 | - Describe and use three functions for rounding dates.
632 | - Explain how to modify components of dates and date-times.
633 | - Explain an idiom for exploring patterns in the lower-order components of date-times.
634 | - Explain how the difference between two moments in time is represented in base R and when using `lubridate`.
635 | - Explain the difference between a difftime, a *period*, and an *interval*.
636 | - Determine your current timezone.
637 | 
638 | ### Key Points
639 | 
640 | - Instants in time are described by _date_, _time_, and _date-time_.
641 | - Use `today` to get the current date and `now` to get the current date-time.
642 | - A date-time can be created from a string, from individual date and time components, or from an existing date-time.
643 |   - Use `lubridate` functions such as `ymd` or `dmy` to parse year-month-day dates.
644 |   - Use functions such as `ymd_hms` to parse full date-times.
645 |   - Supplying a timezone with `tz="XYZ"` forces the creation of a date-time instead of just a date.
646 |   - Use `make_date` or `make_datetime` to construct a date or date-time from numeric components.
647 | - Use `as_datetime` to convert a date to a date-time and `as_date` to convert a date-time to a date.
648 | - Use the following accessor functions to extract components from dates and date-times:
649 |   - `year` and `month`
650 |   - `yday` (day of the year), `mday` (day of the month), and `wday` (day of the week)
651 |   - `hour`, `minute`, and `second`
652 | - Use `floor_date`, `round_date`, and `ceiling_date` to round dates down, to nearest, or up to a specified unit.
653 | - Use an accessor function on the left side of assignment to modify a portion of a date or date-time in place.
654 |   - E.g., use `year(x) <- 2018` to set the year of a date or date-time to 2018.
655 | - Use `update(existing, name=value, ...)` to create a new date-time with modified values.
656 | - Use `update` to set the higher-order components of date-times to a constant in order to explore the variation in the lower-order components.
657 | - A _difftime_ represents the absolute difference between two moments in time (in seconds).
658 |   - Use `as.duration(difftime)` to convert to a `lubridate` `duration`, which always uses seconds to represent differences in times.
659 |   - Use `dyears`, `dseconds`, etc. to construct differences explicitly.
660 | - A _period_ represents the difference between two times taking human factors into account (such as daylight savings time).
661 | - An _interval_ is a duration with a starting point, which makes it precise enough that its exact length can be determined.
662 | - Use `Sys.timezone()` to determine your current timezone.
663 | 
664 | ## 18. Pipes
665 | 
666 | ### Objectives
667 | 
668 | - Describe the pros and cons of four ways to write successive operations on data.
669 | - Explain the use of `%T>%`, `%$%`, and `%<>%`.
670 | 
671 | ### Key Points
672 | 
673 | - Four ways to write successive operations on data are:
674 |   - Save each intermediate step as a new object: a lot of typing with many opportunities for transposition mistakes.
675 |   - Overwrite the original object many times: loss of originals makes debugging difficult, and repetition of a single makes reading difficult.
676 |   - Compose functions: unnatural reading order and parameters widely separated from function names.
677 |   - Use the pipe `%>%`: simple to read *if* the transformations are sequential and applied to a single main stream of data.
678 | - `%T>%` ("tee") returns its left side rather than its right.
679 | - `%$%` unpacks the variables in a `data.frame` (which is useful when calling functions in base R that don't rely on `data.frame`s).
680 | - `%<>%` assigns the result back to the starting variable.
681 | 
682 | ## 19. Functions (and Control Flow)
683 | 
684 | ### Objectives
685 | 
686 | - Explain the benefits of creating functions.
687 | - Describe three steps in the creation of a function.
688 | - Define functions of zero or more arguments.
689 | - Describe three rules that function names should follow.
690 | - Describe the difference between data and details in function arguments.
691 | - Define *conditional statement* and write conditional statements with multiple branches and a default branch.
692 | - Explain what a *short-circuit operator* is and write conditions using these operators.
693 | - Define *precondition* and implement preconditions in functions.
694 | - Write functions that take (and pass on) a varying number of arguments.
695 | - Describe and use two ways to return values from functions.
696 | - Implement pipeable functions that perform transformations or have side effects.
697 | 
698 | ### Key Point
699 | 
700 | - Create functions to elminate duplicated code, make programs more readable, and simplify maintenance and evolution.
701 | - When creating a function, select a name, decide on its arguments, and write its body.
702 | - Function names should:
703 |   1. Prefer verbs (actions) to nouns (things).
704 |   2. Use full words and consistent typography.
705 |   3. Be consistent with other functions in the same package or application.
706 | - Arguments to functions are (broadly speaking) either:
707 |   - Data to be operated on (come first).
708 |   - Details controlling how the function operates (come last, and should have default values).
709 |   - When overriding the value of a default, use the full name of the argument.
710 | - A _conditional statement_ may or may not execute code depending on whether a condition is true or false.
711 |   - Each conditional statement must have one `if`, zero or more `else if`, and zero or one `else` in that order.
712 |   - Each branch except the `else` must have a logical condition that determines whether it is selected.
713 |   - Branch conditions are tested in order, and only the code associated with the first branch whose condition is true is executed.
714 |   - If no condition is true, and an `else` is present, the code in the `else` branch is executed.
715 | - Conditions must be `TRUE` or `FALSE`, not vectors or `NA`s.
716 | - A _short-circuit operator_ stops evaluating terms as soon as it knows whether the overall value is `TRUE` or `FALSE`.
717 |   - "and", written `&&`, stops as soon as a term is `FALSE`.
718 |   - "or", written `||`, stops as soon as a term is `TRUE`.
719 |   - Use the functions `any`, `all`, and `identical` to collapse vectors into single values for testing.
720 | - Always indent the bodies of conditionals and functions (preferably by two spaces) and obey style rules for placement of curly braces.
721 | - A _precondition_ is something that must be true of a function's inputs in order for the function to work correctly.
722 |   - Use `if` and `stop` to check that inputs are sensible before processing it, and generate a meaningful error message when it's not.
723 |   - Or use `stopifnot` to check that one or more conditions are true (without generating a custom error message).
724 | - Use `...` (three dots) as a placeholder for zero or more arguments, which can then be passed into other functions.
725 |   - Use `list(...)` to convert the actual arguments to a list for processing.
726 | - A function in R returns either:
727 |   - An explicit value when `return(value)` is called.
728 |   - The value of the last expression evaluated if no explicit `return` was executed.
729 | - To make a function pipeable:
730 |   - For a transformation, take the data to be transformed as the first argument and return a modified object.
731 |   - For a side effect, perform the operation (e.g., save to a file) and use `invisible(value)` to return the value without printing it.
732 | 
733 | ## 20. Vectors
734 | 
735 | ### Objectives
736 | 
737 | - Define *atomic vector* and *list*, explain the differences between them, and give examples of each.
738 | - Explain what `NULL` is used for and how it differs from `NA`.
739 | - Determine the type and length of an arbitrary value.
740 | - Describe the values that logical vectors can contain and how they are usually constructed.
741 | - Describe the values that integer and double vectors can contain and the special values that each type can contain.
742 | - Describe the values that character vectors can contain.
743 | - Explain the difference between *explicit coercion* and *implicit coercion* and use the former to convert values from one type to another.
744 | - Explain the rule used to determine the type of a vector explicitly constructed out of values of different types.
745 | - Define *recycling* and correctly identify and interpret uses of it.
746 | - Recycle values explicitly.
747 | - Give vector elements names and explain when and why this is useful.
748 | - Describe six ways to subset a vector.
749 | - Explain the difference between single `[...]` and double `[[...]]`
750 | - Create and inspect lists.
751 | - Subset lists.
752 | - Define *attribute* and *augmented vector*.
753 | - Describe three ways vector attributes are used in R.
754 | - Explain how factors are implemented using augmented vectors.
755 | - Explain how tibbles are implemented using augmented vectors.
756 | 
757 | ### Key Points
758 | 
759 | - An _atomic vector_ is a homogeneous structure that holds logical, integer, double, character, complex, or raw data.
760 | - A _list_ (sometimes called a _recursive vector_) is a vector that can hold heterogeneous data, including other vectors.
761 | - The special value `NULL` represents the absence of a vector, or a vector of length zero, while `NA` represents the absence of a value.
762 | - Use `typeof(thing)` to obtain the name of the type of `thing`.
763 |   - Use `is_logical`, `is_integer`, and similarly-named functions to test the types of values.
764 |   - Use `is_scalar_integer` and similarly-named functions to test the type of a value and whether it is scalar or vector.
765 | - Use `length(thing)` to obtain the (integer) length of the type of `thing`.
766 | - Logical vectors can contain `TRUE`, `FALSE`, and `NA`, and are often constructed using Boolean expressions such as comparisons.
767 | - Integer vectors contain integer values, which should be used for counting.
768 |   - To force a value to be stored as an integer, write it without a decimal portion and put `L` after it (for "long").
769 |   - Integer vectors can contain the special value `NA`.
770 |   - Use `is.na` to check for this special value.
771 | - Double vectors contain floating-point numbers, which should be used for measurement.
772 |   - Double vectors can contain the special values `NA`, `NaN` (not a number), `Inf` (infinity), and `-Inf` (negative infinity).
773 |   - Use `is.finite`, `is.infinite`, and `is.nan` to check for these special values.
774 | - Character vectors can contain character strings, each of which can be arbitrarily long.
775 | - _Explicit coercion_ is the use of a function to convert values from one type to another.
776 |   - Use `as.logical`, `as.character`, `as.integer`, or `as.double` to create a new vector containing the converted values from an original.
777 | - _Implicit coercion_ occurs when a value or vector of one type is used where another type is expected.
778 | - The function `c(value1, value2, ...)` creates a vector whose type is the most complex of the types of the provided values.
779 |   - In order of increasing complexity, types are logical - integer - double - character.
780 | - To _recycle_ values is to re-use those from the shorter vector involved in an operation to match the length of the longer vector.
781 |   - A "scalar" in R is actually a vector of length 1, and most recycling involves replicating a scalar to have the same length as a vector.
782 |   - Base R produces a warning if the length of the longer is not an integer multiple of the length of the shorter.
783 |   - Tidyverse functions throw errors in this case to forestall unexpected results.
784 | - Use `rep(values, times)` to recycle (or repeat) values explicitly.
785 | - Some or all vector elements can be given names when the vector is constructed using `c(name1=value1, ...)`.
786 |   - Use `purrr:set_names(vector, names)` to set the names of a vector's values after that fact.
787 | - A vector can be subsetted using `[...]` in four ways:
788 |   - Subsetting with a vector of positive integers selects those elements in order (possibly with repeats).
789 |   - Subsetting with a vector of negative integers selects all elements *except* those identified.
790 |   - Subsetting with zero creates an empty vector.
791 |   - Subsetting with a logical vector keeps values corresponding to `TRUE` elements of the logical vector.
792 |   - Subsetting with a character vector keeps only those values with the given names (possibly with repeats).
793 |   - Using an empty subscript `[]` returns the entire vector.
794 | - Create lists using `list(value1, value2, ...)` and inspect their structure with `str(name)`.
795 | - Subsetting a list with `[...]` always returns a list.
796 | - Subsetting a list with `[[...]]` returns a single component (i.e., has one less level of nesting than the original).
797 | - `list$name` does the same thing as `[[...]]` for a named element of a list.
798 | - An _augmented vector_ is one that has extra named _attributes_ attached to it.
799 | - Get the value of a vector attribute using `attr(vector, name)`.
800 | - Set the value of a vector attribute using `attr(vector, name) <- value`.
801 | - Use `attributes(name)` to display all of the attributes of a vector.
802 | - Vector attributes are used to:
803 |   - Name the elements of a vector (`names`).
804 |   - Store dimensions to make a vector behave like a matrix.
805 |   - Store a class name to implement classes in the S3 object-oriented system.
806 | - A factor is an integer vector that has the class `factor` and a `levels` attribute with the factors' names.
807 | - A tibble is a list with three classes and `names` and `row.names` attributes.
808 |   - All elements of a tibble must be vectors having identical lengths.
809 | 
810 | ## 21. Iteration
811 | 
812 | ### Objectives
813 | 
814 | - Describe the parts of a simple `for` loop.
815 | - Create empty vectors of a given type and length.
816 | - Explain why it is safer to use `seq_along(x)` than `1:length(x)`.
817 | - Write loops that iterate over the columns of a tibble using either indices or names.
818 | - Describe and use an efficient way to write loops when the size of the eventual output cannot be known in advance.
819 | - Explain how to write a loop when the number of required iterations is not known in advance.
820 | - Describe what happens when looping over the names of a vector that has some unnamed elements.
821 | - Explain what *higher-order functions* are, explain why they're useful, and write higher-order functions.
822 | - Describe the `map` family of functions and their purpose, and rewrite simple `for` loops to use `map` functions.
823 | - Describe the purpose and use of the `safely`, `possibly`, and `quietly` function.
824 | - Describe and use `map2` and `pmap`.
825 | - Describe and use `walk`, `walk2`, and `pwalk`.
826 | - Define *predicate function* and describe higher-order functions that work with predicate functions.
827 | - Define *reduction* and use `reduce` to implement it.
828 | - Define *accumulation* and use `accumulate` to implement it.
829 | 
830 | ### Key Points
831 | 
832 | - A `for` loop usually has:
833 |   - A variable whose value changes for each iteration of the loop.
834 |   - A set of values being iterated over (such as the indices of a vector).
835 |   - A body that is executed once for each iteration.
836 |   - An output variable where results are stored (whose space is usually preallocated for efficiency).
837 | - Use `vector("type", length)` to generate a vector of the specified type and length (usually to be filled in later).
838 | - `1:length(x)` is non-empty when `x` is empty; `seq_along(x)` is empty when `x` is empty, and so is better to use in loop controls.
839 | - To loop over the columns a tibble:
840 |   - Use `for (variable in seq_along(tibble))` to loop over the numeric indices of the columns.
841 |   - Use `for (variable in names(tibble))` to loop over the names of the columns.
842 | - When the size of a loop's eventual output cannot be known in advance, use a list to collect partial results and then `unlist` or `purrr:flatten_dbl` to combine them into a vector.
843 |   - If values are tables, collect them in a list and use `bind_rows` to combine them all after the loop.
844 | - If the number of required iterations is not known in advance, use a `while` loop instead of a `for` loop.
845 |   - Make sure that the condition of the `while` loop can be changed by the loop body so that the loop does not run forever.
846 | - If none of the elements of a vector have names, `names(vector)` returns `NULL`, so a `for` loop doesn't execute any iterations.
847 | - If some of the elements of a vector have names and some don't, `names(vector)` returns empty strings for unnamed elements.
848 |   - This means that a `for` loop will execute, but that attempts to access unnamed vector elements by name will fail.
849 | - A _higher-order function_ is one that takes other functions as arguments.
850 |   - Higher-order functions allow programmers to write control flow once and re-use it with different operations.
851 | - `map(object, function)` applies `function` to each `object` and returns a list of results.
852 |   - The specialized functions `map_lgl`, `map_int`, etc., operate on and return vectors of specific types (logical, integer, etc.)
853 |   - These functions preserve names and pass extra arguments through to the function provided.
854 | - FIXME: go through "21.5.1 Shortcuts" after learning about formulas.
855 | - `safely(func)` creates a new function that never throws an error, but instead always returns a list of two values:
856 |   - `result` is either the original result (if the original function ran without an error) or `NULL` (if there was an error).
857 |   - `error` is either `NULL` (if the original function ran without an error) or the error object (if there was an error).
858 |   - `map(data, safely(func))` will therefore return a list of pairs.
859 |   - And `transpose(map(data, safely(func)))` will return a pair of lists.
860 | - `possibly(func)` creates a new function that returns a user-supplied default value instead of throwing an error.
861 | - `quietly(func)` works like `safely` but captures printed output, messages, and warnings.
862 | - `map2(vec1, vec2, function)` applies `function` to corresponding elements from `vec1` and `vec2`.
863 | - `pmap(list_of_lists, function)` applies `function` to the values in each of the sub-lists.
864 |   - It is safest to give the sub-lists names that match the names of the function's parameters rather than relying on positional matching.
865 | - The `walk` family of functions execute functions without collecting and returning their results.
866 | - A _predicate function_ is one that returns a single logical value.
867 | - `keep` and `discard` keep elements of the input where a predicate function returns `TRUE` or `FALSE` respectively.
868 | - `some` and `every` determine whether a predicate us true for any or all elements of the input data.
869 | - `detect` returns the first element for which a predicate is true.
870 | - `detect_index` returns the index of the first element for which a predicate is true.
871 | - `head_while` and `tail_while` collect runs of values from the start or end of a structure for which a predicate is true.
872 | - _Reduction_ combines many values using a binary (two-argument) function to create a single resulting value.
873 |   - Use `reduce(data, function)` to do this.
874 |   - `reduce` throws an error of `data` is empty unless an initial value `init` is provided.
875 | - _Accumulation_ performs the same operation as reduction, but keeps the intermediate results (i.e., calculates a running sum).
876 | 


--------------------------------------------------------------------------------