├── .Rbuildignore ├── .dockerignore ├── .github ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── ISSUE_TEMPLATE.md └── workflows │ └── bookdown.yaml ├── .gitignore ├── .htmlhintrc ├── .remarkrc ├── DESCRIPTION ├── Dockerfile ├── EDA.Rmd ├── LICENSE ├── NAMESPACE ├── README.md ├── WORDLIST ├── _bookdown.yml ├── _common.R ├── _config.yml ├── _output.yaml ├── appendixes.Rmd ├── bin ├── add-r4ds-links.R ├── build.sh ├── check-r4ds-sections.R ├── check-spelling.R ├── create-sitemap.R ├── create-toc.R ├── deploy.sh ├── hypothesis.r ├── is_primary_key.R ├── r4ds-toc.R ├── render.R ├── serve.R └── style.R ├── communicate.Rmd ├── contributions.Rmd ├── data ├── README.md ├── file1.csv ├── file2.csv └── file3.csv ├── datetimes.Rmd ├── diagrams ├── Lahman1.graffle ├── Lahman1.png ├── Lahman2.graffle ├── Lahman2.png ├── Lahman3.graffle ├── Lahman3.png ├── master-batting-salaries.graffle ├── master-batting-salaries.png ├── nested_set_1.dot ├── nested_set_2.dot ├── nycflights.graffle └── nycflights.png ├── docker-compose.yml ├── explore.Rmd ├── factors.Rmd ├── functions.Rmd ├── graphics-for-communication.Rmd ├── img ├── cover.png ├── r4ds-exercise-solutions-cover.key ├── r4ds-exercise-solutions-cover.png ├── rmarkdown-file.png ├── rmarkdown-knit-button.png ├── rmarkdown-notebook.png └── visualize │ ├── unnamed-chunk-29-1.png │ ├── unnamed-chunk-29-2.png │ ├── unnamed-chunk-29-3.png │ ├── unnamed-chunk-29-4.png │ ├── unnamed-chunk-29-5.png │ └── unnamed-chunk-29-6.png ├── import.Rmd ├── includes ├── hypothesis.html ├── ort.css ├── r4ds-solutions.css └── r4ds.css ├── index.Rmd ├── intro.Rmd ├── iteration.Rmd ├── many-models.Rmd ├── model-basics.Rmd ├── model-building.Rmd ├── model.Rmd ├── pipes.Rmd ├── program.Rmd ├── r4ds-exercise-solutions.Rproj ├── r4ds-toc.csv ├── r4ds.bib ├── relational-data.Rmd ├── rmarkdown-formats.Rmd ├── rmarkdown-workflow.Rmd ├── rmarkdown.Rmd ├── rmarkdown ├── caching.Rmd ├── cv.Rmd ├── diamond-sizes.Rmd └── example.Rmd ├── strings.Rmd ├── tibble.Rmd ├── tidy.Rmd ├── transform.Rmd ├── vectors.Rmd ├── visualize.Rmd ├── workflow-basics.Rmd ├── workflow-projects.Rmd ├── workflow-scripts.Rmd └── wrangle.Rmd /.Rbuildignore: -------------------------------------------------------------------------------- 1 | ^\.github$ 2 | ^.*\.Rproj$ 3 | ^\.Rproj\.user$ 4 | ^.*\.R?md$ 5 | ^.*\.R$ 6 | ^.*\.ya?ml$ 7 | ^.*\.json$ 8 | ^_bookdown_files$ 9 | ^WORDLIST$ 10 | ^.*\.html$ 11 | ^.*\.css$ 12 | ^bookdown[0-9a-f]+$ 13 | ^docs/$ 14 | ^node_modules$ 15 | ^bin/$ 16 | ^diagrams/$ 17 | ^.*\.rds$ 18 | ^r4ds\.(tex|bib)$ 19 | ^\.dockerignore$ 20 | ^\.remarkrc$ 21 | ^_build/$ 22 | ^.*\.utf8\.md$ 23 | ^_build$ 24 | ^img$ 25 | ^includes$ 26 | ^Dockerfile$ 27 | ^diagrams$ 28 | ^bin$ 29 | ^Makefile$ 30 | -------------------------------------------------------------------------------- /.dockerignore: -------------------------------------------------------------------------------- 1 | * 2 | !DESCRIPTION 3 | -------------------------------------------------------------------------------- /.github/CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Contributor Code of Conduct 2 | 3 | As contributors and maintainers of this project, we pledge to respect all people who 4 | contribute through reporting issues, posting feature requests, updating documentation, 5 | submitting pull requests or patches, and other activities. 6 | 7 | We are committed to making participation in this project a harassment-free experience for 8 | everyone, regardless of level of experience, gender, gender identity and expression, 9 | sexual orientation, disability, personal appearance, body size, race, ethnicity, age, or religion. 10 | 11 | Examples of unacceptable behavior by participants include the use of sexual language or 12 | imagery, derogatory comments or personal attacks, trolling, public or private harassment, 13 | insults, or other unprofessional conduct. 14 | 15 | Project maintainers have the right and responsibility to remove, edit, or reject comments, 16 | commits, code, wiki edits, issues, and other contributions that are not aligned to this 17 | Code of Conduct. Project maintainers who do not follow the Code of Conduct may be removed 18 | from the project team. 19 | 20 | Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by 21 | opening an issue or contacting one or more of the project maintainers. 22 | 23 | This Code of Conduct is adapted from the [Contributor Covenant](http://contributor-covenant.org), version 1.0.0, available at 24 | . 25 | -------------------------------------------------------------------------------- /.github/CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing to R4DS Exercise Solutions 2 | 3 | This outlines how to propose a change to *R for Data Science Exercise Solutions*. 4 | 5 | ## Fixing typos 6 | 7 | Small typos or grammatical errors in the book may be edited directly using 8 | the GitHub web interface, so long as the changes are made in the _source_ file. 9 | 10 | ## Prerequisites 11 | 12 | Before you make a substantial pull request, you should always file an issue and 13 | make sure someone from the team agrees that it’s a problem. If you’ve found a 14 | bug, create an associated issue and illustrate the bug with a minimal 15 | [reprex](https://www.tidyverse.org/help/#reprex). 16 | 17 | ## Pull request process 18 | 19 | - We recommend that you create a Git branch for each pull request (PR). 20 | 21 | - Look at the Travis build status before and after making changes. 22 | The `README` should contain badges for any continuous integration services used 23 | by the package. 24 | 25 | - New code should follow the tidyverse [style guide](http://style.tidyverse.org). 26 | You can use the [styler](https://CRAN.R-project.org/package=styler) package to 27 | apply these styles, but please don't restyle code that has nothing to do with 28 | your PR. 29 | 30 | ## Code of Conduct 31 | 32 | Please note that the R4DSSolutions project is released with a 33 | [Contributor Code of Conduct](CODE_OF_CONDUCT.md). By contributing to this 34 | project you agree to abide by its terms. 35 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE.md: -------------------------------------------------------------------------------- 1 | Please briefly describe your problem and what output you expect. 2 | 3 | - Include a link to page, and reference the exercise number (if applicable). 4 | 5 | - If it is a typo, include the quoted text where it occurs. 6 | 7 | - If code is producing an error, include both the code and the error message, as well as the 8 | output of `sessionInfo()`. 9 | -------------------------------------------------------------------------------- /.github/workflows/bookdown.yaml: -------------------------------------------------------------------------------- 1 | # Workflow derived from https://github.com/r-lib/actions/tree/v2/examples 2 | # Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help 3 | on: 4 | push: 5 | branches: [main, master] 6 | pull_request: 7 | branches: [main, master] 8 | workflow_dispatch: 9 | 10 | name: bookdown 11 | 12 | jobs: 13 | bookdown: 14 | runs-on: ubuntu-latest 15 | # Only restrict concurrency for non-PR jobs 16 | concurrency: 17 | group: pkgdown-${{ github.event_name != 'pull_request' || github.run_id }} 18 | env: 19 | GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} 20 | steps: 21 | - uses: actions/checkout@v4 22 | 23 | - uses: r-lib/actions/setup-pandoc@v2 24 | 25 | - uses: r-lib/actions/setup-r@v2 26 | with: 27 | use-public-rspm: true 28 | 29 | - uses: r-lib/actions/setup-renv@v2 30 | 31 | - name: Cache bookdown results 32 | uses: actions/cache@v4 33 | with: 34 | path: _bookdown_files 35 | key: bookdown-${{ hashFiles('**/*Rmd') }} 36 | restore-keys: bookdown- 37 | 38 | - name: Build site 39 | run: bookdown::render_book("index.Rmd", quiet = TRUE) 40 | shell: Rscript {0} 41 | 42 | - name: Deploy to GitHub pages 🚀 43 | if: github.event_name != 'pull_request' 44 | uses: JamesIves/github-pages-deploy-action@v4.5.0 45 | with: 46 | branch: gh-pages 47 | folder: _book 48 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .Rhistory 2 | .Rproj.user 3 | .RData 4 | _bookdown_files/* 5 | 6 | # Don't ignore all _cache, _files folders because they are needed for output in the docs folder 7 | /*_cache/ 8 | /*_files/ 9 | /rmarkdown/*_cache/ 10 | /rmarkdown/*_files/ 11 | /rmarkdown/*.html 12 | 13 | figures 14 | 15 | # Ignore intermediate 16 | *.knit.md 17 | *.utf8.md 18 | /*.md 19 | !README.md 20 | !NEWS.md 21 | /libs 22 | 23 | # ignore bookdown cache 24 | /bookdown[0-9a-f]* 25 | 26 | *.aux 27 | *.out 28 | *.fls 29 | *.toc 30 | *.log 31 | *.fdb_latexmk 32 | _bookdown_files 33 | .ipynb_checkpoints 34 | *.rds 35 | node_modules 36 | Rplots.pdf 37 | *.bib.sav 38 | _build 39 | *.bak 40 | .drake 41 | /cache 42 | /figure 43 | -------------------------------------------------------------------------------- /.htmlhintrc: -------------------------------------------------------------------------------- 1 | { 2 | "doctype-html5": false, 3 | "tag-pair": false 4 | } 5 | -------------------------------------------------------------------------------- /.remarkrc: -------------------------------------------------------------------------------- 1 | { 2 | "plugins": [ 3 | "remark-preset-lint-recommended", 4 | "remark-preset-lint-consistent", 5 | "remark-preset-lint-markdown-style-guide", 6 | "remark-frontmatter", 7 | "remark-math", 8 | ["remark-lint-file-extension", false], 9 | ["remark-lint-maximum-line-length", 500], 10 | ["remark-lint-no-shortcut-reference-link", false], 11 | ["remark-lint-list-item-indent", "tab-size"], 12 | ["remark-lint-no-undefined-references", false], 13 | ["remark-lint-emphasis-marker", false], 14 | ["remark-lint-fenced-code-flag", false], 15 | ["remark-lint-no-duplicate-headings", false], 16 | ["remark-lint-maximum-heading-length", false], 17 | ["remark-lint-no-multiple-toplevel-headings", false], 18 | ["remark-lint-no-file-name-irregular-characters", false] 19 | ] 20 | } 21 | -------------------------------------------------------------------------------- /DESCRIPTION: -------------------------------------------------------------------------------- 1 | Package: r4ds.exercise.solutions 2 | Title: Exercise Solutions to "R for Data Science" 3 | Version: 0.1 4 | Authors@R: c(person("Jeffrey", "Arnold", , "jeffrey.arnold@gmail.com", c("aut", "cre"))) 5 | Description: Solutions to Wickham and Grolemund, "R for Data Science". 6 | This package installs the packages necessary to build that. 7 | License: file LICENSE 8 | URL: https://github.com/jrnold/r4ds-exercise-solutions 9 | Depends: 10 | R (>= 3.1.0) 11 | Imports: 12 | babynames, 13 | datamodelr, 14 | DiagrammeR, 15 | fueleconomy, 16 | gapminder, 17 | ggbeeswarm, 18 | ggplot2 (>= 3.3.0), 19 | ggstance, 20 | hexbin, 21 | here, 22 | lvplot, 23 | Lahman, 24 | MASS, 25 | maps, 26 | microbenchmark, 27 | nasaweather, 28 | nycflights13, 29 | stringi, 30 | tidyr (>= 1.0.0), 31 | tidyverse, 32 | viridis 33 | Suggests: 34 | bookdown (>= 0.7.17), 35 | devtools, 36 | fs, 37 | gh, 38 | git2r, 39 | glue, 40 | hypothesisr, 41 | jsonlite, 42 | lintr, 43 | magrittr, 44 | optparse, 45 | rmarkdown (>= 1.10.11), 46 | rvest, 47 | spelling, 48 | styler, 49 | urltools, 50 | webshot, 51 | xml2, 52 | yaml 53 | Remotes: 54 | github::bergant/datamodelr, 55 | github::mdlincoln/hypothesisr, 56 | github::rstudio/bookdown 57 | RoxygenNote: 6.1.1 58 | -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM rocker/verse:latest 2 | 3 | ENV PROJ_DIR /home/rstudio/r4ds-exercise-solutions 4 | ENV PANDOC_VERSION 2.5 5 | ENV PANDOC_FILENAME pandoc-${PANDOC_VERSION}-1-amd64.deb 6 | ENV APT_KEY_DONT_WARN_ON_DANGEROUS_USAGE 1 7 | 8 | # Install pandoc and nodejs 9 | RUN apt-get update && apt-get install -y \ 10 | gnupg \ 11 | curl 12 | 13 | RUN curl -sL https://deb.nodesource.com/setup_10.x | sudo -E bash - && \ 14 | apt-get install -y nodejs 15 | 16 | RUN curl -sL -o ${PANDOC_FILENAME} https://github.com/jgm/pandoc/releases/download/${PANDOC_VERSION}/${PANDOC_FILENAME} && \ 17 | dpkg -i ${PANDOC_FILENAME} && \ 18 | rm ${PANDOC_FILENAME} && \ 19 | rm /usr/local/bin/pandoc /usr/local/bin/pandoc-citeproc 20 | 21 | # Install dependencies needed to run code and build package 22 | RUN mkdir install 23 | COPY DESCRIPTION install 24 | RUN Rscript -e "devtools::install('install', dependencies=TRUE)" 25 | RUN rm -rf install 26 | 27 | RUN mkdir ${PROJ_DIR} 28 | WORKDIR ${PROJ_DIR} 29 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Attribution 4.0 International 2 | 3 | ======================================================================= 4 | 5 | Creative Commons Corporation ("Creative Commons") is not a law firm and 6 | does not provide legal services or legal advice. Distribution of 7 | Creative Commons public licenses does not create a lawyer-client or 8 | other relationship. Creative Commons makes its licenses and related 9 | information available on an "as-is" basis. Creative Commons gives no 10 | warranties regarding its licenses, any material licensed under their 11 | terms and conditions, or any related information. Creative Commons 12 | disclaims all liability for damages resulting from their use to the 13 | fullest extent possible. 14 | 15 | Using Creative Commons Public Licenses 16 | 17 | Creative Commons public licenses provide a standard set of terms and 18 | conditions that creators and other rights holders may use to share 19 | original works of authorship and other material subject to copyright 20 | and certain other rights specified in the public license below. The 21 | following considerations are for informational purposes only, are not 22 | exhaustive, and do not form part of our licenses. 23 | 24 | Considerations for licensors: Our public licenses are 25 | intended for use by those authorized to give the public 26 | permission to use material in ways otherwise restricted by 27 | copyright and certain other rights. Our licenses are 28 | irrevocable. Licensors should read and understand the terms 29 | and conditions of the license they choose before applying it. 30 | Licensors should also secure all rights necessary before 31 | applying our licenses so that the public can reuse the 32 | material as expected. Licensors should clearly mark any 33 | material not subject to the license. This includes other CC- 34 | licensed material, or material used under an exception or 35 | limitation to copyright. More considerations for licensors: 36 | wiki.creativecommons.org/Considerations_for_licensors 37 | 38 | Considerations for the public: By using one of our public 39 | licenses, a licensor grants the public permission to use the 40 | licensed material under specified terms and conditions. If 41 | the licensor's permission is not necessary for any reason--for 42 | example, because of any applicable exception or limitation to 43 | copyright--then that use is not regulated by the license. Our 44 | licenses grant only permissions under copyright and certain 45 | other rights that a licensor has authority to grant. Use of 46 | the licensed material may still be restricted for other 47 | reasons, including because others have copyright or other 48 | rights in the material. A licensor may make special requests, 49 | such as asking that all changes be marked or described. 50 | Although not required by our licenses, you are encouraged to 51 | respect those requests where reasonable. More considerations 52 | for the public: 53 | wiki.creativecommons.org/Considerations_for_licensees 54 | 55 | ======================================================================= 56 | 57 | Creative Commons Attribution 4.0 International Public License 58 | 59 | By exercising the Licensed Rights (defined below), You accept and agree 60 | to be bound by the terms and conditions of this Creative Commons 61 | Attribution 4.0 International Public License ("Public License"). To the 62 | extent this Public License may be interpreted as a contract, You are 63 | granted the Licensed Rights in consideration of Your acceptance of 64 | these terms and conditions, and the Licensor grants You such rights in 65 | consideration of benefits the Licensor receives from making the 66 | Licensed Material available under these terms and conditions. 67 | 68 | 69 | Section 1 -- Definitions. 70 | 71 | a. Adapted Material means material subject to Copyright and Similar 72 | Rights that is derived from or based upon the Licensed Material 73 | and in which the Licensed Material is translated, altered, 74 | arranged, transformed, or otherwise modified in a manner requiring 75 | permission under the Copyright and Similar Rights held by the 76 | Licensor. For purposes of this Public License, where the Licensed 77 | Material is a musical work, performance, or sound recording, 78 | Adapted Material is always produced where the Licensed Material is 79 | synched in timed relation with a moving image. 80 | 81 | b. Adapter's License means the license You apply to Your Copyright 82 | and Similar Rights in Your contributions to Adapted Material in 83 | accordance with the terms and conditions of this Public License. 84 | 85 | c. Copyright and Similar Rights means copyright and/or similar rights 86 | closely related to copyright including, without limitation, 87 | performance, broadcast, sound recording, and Sui Generis Database 88 | Rights, without regard to how the rights are labeled or 89 | categorized. For purposes of this Public License, the rights 90 | specified in Section 2(b)(1)-(2) are not Copyright and Similar 91 | Rights. 92 | 93 | d. Effective Technological Measures means those measures that, in the 94 | absence of proper authority, may not be circumvented under laws 95 | fulfilling obligations under Article 11 of the WIPO Copyright 96 | Treaty adopted on December 20, 1996, and/or similar international 97 | agreements. 98 | 99 | e. Exceptions and Limitations means fair use, fair dealing, and/or 100 | any other exception or limitation to Copyright and Similar Rights 101 | that applies to Your use of the Licensed Material. 102 | 103 | f. Licensed Material means the artistic or literary work, database, 104 | or other material to which the Licensor applied this Public 105 | License. 106 | 107 | g. Licensed Rights means the rights granted to You subject to the 108 | terms and conditions of this Public License, which are limited to 109 | all Copyright and Similar Rights that apply to Your use of the 110 | Licensed Material and that the Licensor has authority to license. 111 | 112 | h. Licensor means the individual(s) or entity(ies) granting rights 113 | under this Public License. 114 | 115 | i. Share means to provide material to the public by any means or 116 | process that requires permission under the Licensed Rights, such 117 | as reproduction, public display, public performance, distribution, 118 | dissemination, communication, or importation, and to make material 119 | available to the public including in ways that members of the 120 | public may access the material from a place and at a time 121 | individually chosen by them. 122 | 123 | j. Sui Generis Database Rights means rights other than copyright 124 | resulting from Directive 96/9/EC of the European Parliament and of 125 | the Council of 11 March 1996 on the legal protection of databases, 126 | as amended and/or succeeded, as well as other essentially 127 | equivalent rights anywhere in the world. 128 | 129 | k. You means the individual or entity exercising the Licensed Rights 130 | under this Public License. Your has a corresponding meaning. 131 | 132 | 133 | Section 2 -- Scope. 134 | 135 | a. License grant. 136 | 137 | 1. Subject to the terms and conditions of this Public License, 138 | the Licensor hereby grants You a worldwide, royalty-free, 139 | non-sublicensable, non-exclusive, irrevocable license to 140 | exercise the Licensed Rights in the Licensed Material to: 141 | 142 | a. reproduce and Share the Licensed Material, in whole or 143 | in part; and 144 | 145 | b. produce, reproduce, and Share Adapted Material. 146 | 147 | 2. Exceptions and Limitations. For the avoidance of doubt, where 148 | Exceptions and Limitations apply to Your use, this Public 149 | License does not apply, and You do not need to comply with 150 | its terms and conditions. 151 | 152 | 3. Term. The term of this Public License is specified in Section 153 | 6(a). 154 | 155 | 4. Media and formats; technical modifications allowed. The 156 | Licensor authorizes You to exercise the Licensed Rights in 157 | all media and formats whether now known or hereafter created, 158 | and to make technical modifications necessary to do so. The 159 | Licensor waives and/or agrees not to assert any right or 160 | authority to forbid You from making technical modifications 161 | necessary to exercise the Licensed Rights, including 162 | technical modifications necessary to circumvent Effective 163 | Technological Measures. For purposes of this Public License, 164 | simply making modifications authorized by this Section 2(a) 165 | (4) never produces Adapted Material. 166 | 167 | 5. Downstream recipients. 168 | 169 | a. Offer from the Licensor -- Licensed Material. Every 170 | recipient of the Licensed Material automatically 171 | receives an offer from the Licensor to exercise the 172 | Licensed Rights under the terms and conditions of this 173 | Public License. 174 | 175 | b. No downstream restrictions. You may not offer or impose 176 | any additional or different terms or conditions on, or 177 | apply any Effective Technological Measures to, the 178 | Licensed Material if doing so restricts exercise of the 179 | Licensed Rights by any recipient of the Licensed 180 | Material. 181 | 182 | 6. No endorsement. Nothing in this Public License constitutes or 183 | may be construed as permission to assert or imply that You 184 | are, or that Your use of the Licensed Material is, connected 185 | with, or sponsored, endorsed, or granted official status by, 186 | the Licensor or others designated to receive attribution as 187 | provided in Section 3(a)(1)(A)(i). 188 | 189 | b. Other rights. 190 | 191 | 1. Moral rights, such as the right of integrity, are not 192 | licensed under this Public License, nor are publicity, 193 | privacy, and/or other similar personality rights; however, to 194 | the extent possible, the Licensor waives and/or agrees not to 195 | assert any such rights held by the Licensor to the limited 196 | extent necessary to allow You to exercise the Licensed 197 | Rights, but not otherwise. 198 | 199 | 2. Patent and trademark rights are not licensed under this 200 | Public License. 201 | 202 | 3. To the extent possible, the Licensor waives any right to 203 | collect royalties from You for the exercise of the Licensed 204 | Rights, whether directly or through a collecting society 205 | under any voluntary or waivable statutory or compulsory 206 | licensing scheme. In all other cases the Licensor expressly 207 | reserves any right to collect such royalties. 208 | 209 | 210 | Section 3 -- License Conditions. 211 | 212 | Your exercise of the Licensed Rights is expressly made subject to the 213 | following conditions. 214 | 215 | a. Attribution. 216 | 217 | 1. If You Share the Licensed Material (including in modified 218 | form), You must: 219 | 220 | a. retain the following if it is supplied by the Licensor 221 | with the Licensed Material: 222 | 223 | i. identification of the creator(s) of the Licensed 224 | Material and any others designated to receive 225 | attribution, in any reasonable manner requested by 226 | the Licensor (including by pseudonym if 227 | designated); 228 | 229 | ii. a copyright notice; 230 | 231 | iii. a notice that refers to this Public License; 232 | 233 | iv. a notice that refers to the disclaimer of 234 | warranties; 235 | 236 | v. a URI or hyperlink to the Licensed Material to the 237 | extent reasonably practicable; 238 | 239 | b. indicate if You modified the Licensed Material and 240 | retain an indication of any previous modifications; and 241 | 242 | c. indicate the Licensed Material is licensed under this 243 | Public License, and include the text of, or the URI or 244 | hyperlink to, this Public License. 245 | 246 | 2. You may satisfy the conditions in Section 3(a)(1) in any 247 | reasonable manner based on the medium, means, and context in 248 | which You Share the Licensed Material. For example, it may be 249 | reasonable to satisfy the conditions by providing a URI or 250 | hyperlink to a resource that includes the required 251 | information. 252 | 253 | 3. If requested by the Licensor, You must remove any of the 254 | information required by Section 3(a)(1)(A) to the extent 255 | reasonably practicable. 256 | 257 | 4. If You Share Adapted Material You produce, the Adapter's 258 | License You apply must not prevent recipients of the Adapted 259 | Material from complying with this Public License. 260 | 261 | 262 | Section 4 -- Sui Generis Database Rights. 263 | 264 | Where the Licensed Rights include Sui Generis Database Rights that 265 | apply to Your use of the Licensed Material: 266 | 267 | a. for the avoidance of doubt, Section 2(a)(1) grants You the right 268 | to extract, reuse, reproduce, and Share all or a substantial 269 | portion of the contents of the database; 270 | 271 | b. if You include all or a substantial portion of the database 272 | contents in a database in which You have Sui Generis Database 273 | Rights, then the database in which You have Sui Generis Database 274 | Rights (but not its individual contents) is Adapted Material; and 275 | 276 | c. You must comply with the conditions in Section 3(a) if You Share 277 | all or a substantial portion of the contents of the database. 278 | 279 | For the avoidance of doubt, this Section 4 supplements and does not 280 | replace Your obligations under this Public License where the Licensed 281 | Rights include other Copyright and Similar Rights. 282 | 283 | 284 | Section 5 -- Disclaimer of Warranties and Limitation of Liability. 285 | 286 | a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE 287 | EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS 288 | AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF 289 | ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS, 290 | IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION, 291 | WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR 292 | PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS, 293 | ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT 294 | KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT 295 | ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU. 296 | 297 | b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE 298 | TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION, 299 | NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT, 300 | INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES, 301 | COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR 302 | USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN 303 | ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR 304 | DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR 305 | IN PART, THIS LIMITATION MAY NOT APPLY TO YOU. 306 | 307 | c. The disclaimer of warranties and limitation of liability provided 308 | above shall be interpreted in a manner that, to the extent 309 | possible, most closely approximates an absolute disclaimer and 310 | waiver of all liability. 311 | 312 | 313 | Section 6 -- Term and Termination. 314 | 315 | a. This Public License applies for the term of the Copyright and 316 | Similar Rights licensed here. However, if You fail to comply with 317 | this Public License, then Your rights under this Public License 318 | terminate automatically. 319 | 320 | b. Where Your right to use the Licensed Material has terminated under 321 | Section 6(a), it reinstates: 322 | 323 | 1. automatically as of the date the violation is cured, provided 324 | it is cured within 30 days of Your discovery of the 325 | violation; or 326 | 327 | 2. upon express reinstatement by the Licensor. 328 | 329 | For the avoidance of doubt, this Section 6(b) does not affect any 330 | right the Licensor may have to seek remedies for Your violations 331 | of this Public License. 332 | 333 | c. For the avoidance of doubt, the Licensor may also offer the 334 | Licensed Material under separate terms or conditions or stop 335 | distributing the Licensed Material at any time; however, doing so 336 | will not terminate this Public License. 337 | 338 | d. Sections 1, 5, 6, 7, and 8 survive termination of this Public 339 | License. 340 | 341 | 342 | Section 7 -- Other Terms and Conditions. 343 | 344 | a. The Licensor shall not be bound by any additional or different 345 | terms or conditions communicated by You unless expressly agreed. 346 | 347 | b. Any arrangements, understandings, or agreements regarding the 348 | Licensed Material not stated herein are separate from and 349 | independent of the terms and conditions of this Public License. 350 | 351 | 352 | Section 8 -- Interpretation. 353 | 354 | a. For the avoidance of doubt, this Public License does not, and 355 | shall not be interpreted to, reduce, limit, restrict, or impose 356 | conditions on any use of the Licensed Material that could lawfully 357 | be made without permission under this Public License. 358 | 359 | b. To the extent possible, if any provision of this Public License is 360 | deemed unenforceable, it shall be automatically reformed to the 361 | minimum extent necessary to make it enforceable. If the provision 362 | cannot be reformed, it shall be severed from this Public License 363 | without affecting the enforceability of the remaining terms and 364 | conditions. 365 | 366 | c. No term or condition of this Public License will be waived and no 367 | failure to comply consented to unless expressly agreed to by the 368 | Licensor. 369 | 370 | d. Nothing in this Public License constitutes or may be interpreted 371 | as a limitation upon, or waiver of, any privileges and immunities 372 | that apply to the Licensor or You, including from the legal 373 | processes of any jurisdiction or authority. 374 | 375 | 376 | ======================================================================= 377 | 378 | Creative Commons is not a party to its public 379 | licenses. Notwithstanding, Creative Commons may elect to apply one of 380 | its public licenses to material it publishes and in those instances 381 | will be considered the “Licensor.” The text of the Creative Commons 382 | public licenses is dedicated to the public domain under the CC0 Public 383 | Domain Dedication. Except for the limited purpose of indicating that 384 | material is shared under a Creative Commons public license or as 385 | otherwise permitted by the Creative Commons policies published at 386 | creativecommons.org/policies, Creative Commons does not authorize the 387 | use of the trademark "Creative Commons" or any other trademark or logo 388 | of Creative Commons without its prior written consent including, 389 | without limitation, in connection with any unauthorized modifications 390 | to any of its public licenses or any other arrangements, 391 | understandings, or agreements concerning use of licensed material. For 392 | the avoidance of doubt, this paragraph does not form part of the 393 | public licenses. 394 | 395 | Creative Commons may be contacted at creativecommons.org. 396 | -------------------------------------------------------------------------------- /NAMESPACE: -------------------------------------------------------------------------------- 1 | # Generated by roxygen2: do not edit by hand 2 | 3 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | [![Lifecycle: superseded](https://img.shields.io/badge/lifecycle-superseded-blue.svg)](https://lifecycle.r-lib.org/articles/stages.html#superseded) 2 | 3 | # Exercise Solutions to R for Data Science 4 | 5 | These are solutions to the **1st edition** of R for Data Science. The solutions to the 2nd edition of [R for Data Science](https://r4ds.hadley.nz/) are available at [R for Data Science (2e) - Solutions to Exercises](https://mine-cetinkaya-rundel.github.io/r4ds-solutions/). 6 | 7 | This repository contains the code and text behind the [Solutions for R for Data Science](https://jrnold.github.io/r4ds-exercise-solutions/), which, as its name suggests, has solutions to the the exercises in [R for Data Science](https://r4ds.had.co.nz/) by Garrett Grolemund and Hadley Wickham. 8 | 9 | The R packages used in this book can be installed via 10 | ```r 11 | devtools::install_github("jrnold/r4ds-exercise-solutions") 12 | ``` 13 | 14 | ## Contributing 15 | 16 | Work on this repo has effectively stopped since the 2nd edition of R for Data Science has been published. Please direct your contributions to [R for Data Science (2e) - Solutions to Exercises](https://mine-cetinkaya-rundel.github.io/r4ds-solutions/). 17 | 18 | ## Build 19 | 20 | The site is built using the [bookdown](https://bookdown.org/yihui/bookdown/) package and pandoc. 21 | -------------------------------------------------------------------------------- /WORDLIST: -------------------------------------------------------------------------------- 1 | aaa 2 | abaca 3 | abba 4 | abc 5 | abcsgasgddsadgsdgcba 6 | abline 7 | accba 8 | ae 9 | Ames 10 | ATL 11 | BDL 12 | benchmarking 13 | Berkson's 14 | bimodal 15 | bimodality 16 | binwidth 17 | BNA 18 | bookdown 19 | BOS 20 | BQN 21 | bts 22 | burqa 23 | cancelled 24 | carajoos 25 | chardet 26 | Chua 27 | Chuan 28 | cinq 29 | CLDR 30 | ClevelandMcGillMcGill 31 | CN 32 | colonophon 33 | colour 34 | Computerphile 35 | Consolas 36 | coord 37 | counterintuitive 38 | covariation 39 | Covariation 40 | csv 41 | cyclicality 42 | datamodelr 43 | datasets 44 | datetimes 45 | De 46 | Deja 47 | denormal 48 | derecho 49 | derechos 50 | dest 51 | DiagrammeR 52 | dialr 53 | disaggregate 54 | DL 55 | DoaneSeward 56 | DOCX 57 | dongzhuoer 58 | DOTLESS 59 | dplyr 60 | ds 61 | DS 62 | DSSolutions 63 | duplicative 64 | Durations 65 | dx 66 | dy 67 | eed 68 | EPUB 69 | EUC 70 | EV 71 | EWR 72 | ExpressJet 73 | Fabien 74 | facetted 75 | facetting 76 | FiraCode 77 | FiveThirtyEight 78 | fizzbuzz 79 | FizzBuzz 80 | forcats 81 | frac 82 | gapminder 83 | Gapminder 84 | GBK 85 | GeeksforGeeks 86 | geoms 87 | ggbeeswarm 88 | ggplot 89 | ggrepel 90 | ggstance 91 | github 92 | Gliffy 93 | glyphs 94 | Graffle 95 | Graphviz 96 | Grolemund 97 | GSP 98 | gss 99 | gvwilson 100 | hadley 101 | Hadley 102 | hc 103 | Hecker 104 | Hecker 105 | HeerAgrawala 106 | Heike 107 | heterogeneous 108 | heteroskedasticity 109 | Hintze 110 | HintzeNelson 111 | HNL 112 | Hofman 113 | Hofmann 114 | HofmannWickhamKafadar 115 | honours 116 | HOU 117 | hpdiamonds 118 | htm 119 | http 120 | hypothes 121 | ı 122 | IAH 123 | iconv 124 | IEC 125 | infty 126 | Inkscape 127 | interpretable 128 | io 129 | ipsum 130 | ise 131 | ized 132 | JamesCuster 133 | JIS 134 | jitter 135 | JL 136 | jrnold 137 | Kafadar 138 | Karlton 139 | kididdles 140 | KleinGeard 141 | Krywinski 142 | Krywinsky 143 | kunststube 144 | Lahman 145 | lceil 146 | Leste 147 | LGA 148 | lifecycle 149 | lintr 150 | lm 151 | loess 152 | lorem 153 | lubridate 154 | lvplot 155 | MarkDown 156 | mathrm 157 | mathsf 158 | md 159 | Menlo 160 | merely 161 | microbenchmark 162 | modelr 163 | MQ 164 | mvhone 165 | nd 166 | nicercode 167 | normals 168 | NoteBook 169 | NSE 170 | nycflights 171 | nz 172 | nzxwang 173 | O'Reilly 174 | oe 175 | OmniGraffle 176 | openflights 177 | ou 178 | pandoc 179 | Ph 180 | PhD 181 | PHI 182 | PHL 183 | pointrange 184 | postfixed 185 | PowerPoint 186 | pre 187 | preallocate 188 | preprocess 189 | preprocessed 190 | programmatically 191 | PSE 192 | purrr 193 | purrr's 194 | quartile 195 | quartiles 196 | Quora 197 | radix 198 | rceil 199 | readr 200 | regexcrossword 201 | reimplementing 202 | representable 203 | rescale 204 | rmarkdown 205 | rmarkdown 206 | Rmd 207 | rstudio 208 | RStudio 209 | rstudiotips 210 | Sanglard 211 | sd 212 | SelectFields 213 | Shalloway 214 | SJU 215 | StackOverflow 216 | stringi 217 | stringr 218 | STT 219 | suboptimal 220 | summarise 221 | th 222 | tibble 223 | tibbles 224 | Tibbles 225 | tidyr 226 | tidyverse 227 | Timezones 228 | transtats 229 | un 230 | undercarat 231 | underpredicting 232 | unicode 233 | unintuitively 234 | unpivots 235 | vectorized 236 | vectorizes 237 | viridis 238 | visualisation 239 | visualising 240 | Vu 241 | VVS 242 | WD 243 | Wickham 244 | wikipedia 245 | WordNet 246 | Wut 247 | www 248 | yaml 249 | YAML 250 | yyyy 251 | YYYY 252 | -------------------------------------------------------------------------------- /_bookdown.yml: -------------------------------------------------------------------------------- 1 | output_dir: "_build" 2 | new_session: yes 3 | 4 | rmd_files: 5 | - "index.Rmd" 6 | - "intro.Rmd" 7 | 8 | - "explore.Rmd" 9 | - "visualize.Rmd" 10 | - "workflow-basics.Rmd" 11 | - "transform.Rmd" 12 | - "workflow-scripts.Rmd" 13 | - "EDA.Rmd" 14 | - "workflow-projects.Rmd" 15 | 16 | - "wrangle.Rmd" 17 | - "tibble.Rmd" 18 | - "import.Rmd" 19 | - "tidy.Rmd" 20 | - "relational-data.Rmd" 21 | - "strings.Rmd" 22 | - "factors.Rmd" 23 | - "datetimes.Rmd" 24 | 25 | - "program.Rmd" 26 | - "pipes.Rmd" 27 | - "functions.Rmd" 28 | - "vectors.Rmd" 29 | - "iteration.Rmd" 30 | 31 | - "model.Rmd" 32 | - "model-basics.Rmd" 33 | - "model-building.Rmd" 34 | - "many-models.Rmd" 35 | 36 | - "communicate.Rmd" 37 | - "rmarkdown.Rmd" 38 | - "graphics-for-communication.Rmd" 39 | - "rmarkdown-formats.Rmd" 40 | - "rmarkdown-workflow.Rmd" 41 | 42 | - "appendixes.Rmd" 43 | 44 | before_chapter_script: "_common.R" 45 | book_filename: "r4ds-solutions" 46 | delete_merged_file: true 47 | edit: "https://github.com/jrnold/r4ds-exercise-solutions/edit/master/%s" 48 | history: "https://github.com/jrnold/r4ds-exercise-solutions/commits/master/%s" 49 | -------------------------------------------------------------------------------- /_common.R: -------------------------------------------------------------------------------- 1 | set.seed(1014) 2 | options(digits = 3) 3 | 4 | knitr::opts_chunk$set( 5 | comment = "#>", 6 | collapse = TRUE, 7 | cache = TRUE, 8 | autodep = TRUE, 9 | # need to save cache 10 | cache.extra = knitr::rand_seed, 11 | out.width = "70%", 12 | fig.align = "center", 13 | fig.width = 6, 14 | fig.asp = 0.618, # 1 / phi 15 | fig.show = "hold", 16 | # styler 17 | tidy = 'styler' 18 | ) 19 | 20 | options(dplyr.print_min = 6, dplyr.print_max = 6) 21 | 22 | is_html <- knitr::opts_knit$get("rmarkdown.pandoc.to") == "html" 23 | 24 | # Info and useful links 25 | SOURCE_URL <- stringr::str_c("https:/", "github.com", "jrnold", 26 | "r4ds-exercise-solutions", 27 | sep = "/" 28 | ) 29 | PUB_URL <- stringr::str_c("https:/", "jrnold.github.io", 30 | "r4ds-exercise-solutions", 31 | sep = "/" 32 | ) 33 | 34 | R4DS_URL <- "https://r4ds.had.co.nz" 35 | 36 | r4ds_url <- function(...) { 37 | stringr::str_c(R4DS_URL, ..., sep = "/") 38 | } 39 | 40 | comma_int <- function(x) { 41 | prettyNum(x, big.interval = 3, big.mark = ",") 42 | } 43 | 44 | no_exercises <- function() { 45 | tags <- htmltools::tags 46 | tags$div( 47 | class = 'alert alert-warning hints-alert', 48 | tags$div( 49 | class = "hints-icon", 50 | tags$i( 51 | class = "fa fa-exclamation-circle" 52 | ) 53 | ), 54 | tags$div( 55 | class = "hints-container", 56 | "No exercises" 57 | ) 58 | ) 59 | } 60 | -------------------------------------------------------------------------------- /_config.yml: -------------------------------------------------------------------------------- 1 | github_repo: "jrnold/r4ds-exercise-solutions" 2 | deploy_url: "https://jrnold.github.io/r4ds-exercise-solutions/" 3 | r4ds: 4 | github_repo: "hadley/r4ds" 5 | url: "https://r4ds.had.co.nz" 6 | chapters: 7 | - {"rmd": "intro.Rmd", "html": "introduction.html"} 8 | - {"rmd": "explore-intro.Rmd", "html": "explore-intro.html"} 9 | - {"rmd": "visualize.Rmd", "html": "data-visualisation.html"} 10 | - {"rmd": "workflow-basics.Rmd", "html": "workflow-basics.html"} 11 | - {"rmd": "transform.Rmd", "html": "transform.html"} 12 | - {"rmd": "workflow-scripts.Rmd", "html": "workflow-scripts.html"} 13 | - {"rmd": "EDA.Rmd", "html": "exploratory-data-analysis.html"} 14 | - {"rmd": "workflow-projects.Rmd", "html": "workflow-projects.html"} 15 | - {"rmd": "wrangle-intro.Rmd", "html": "wrangle-intro.html"} 16 | - {"rmd": "tibble.Rmd", "html": "tibbles.html"} 17 | - {"rmd": "import.Rmd", "html": "data-import.html"} 18 | - {"rmd": "tidy.Rmd", "html": "tidy-data.html"} 19 | - {"rmd": "relational-data.Rmd", "html": "relational-data.html"} 20 | - {"rmd": "strings.Rmd", "html": "strings.html"} 21 | - {"rmd": "factors.Rmd", "html": "factors.html"} 22 | - {"rmd": "datetimes.Rmd", "html": "dates-and-times.html"} 23 | - {"rmd": "program-intro.Rmd", "html": "program-intro.html"} 24 | - {"rmd": "pipes.Rmd", "html": "pipes.html"} 25 | - {"rmd": "functions.Rmd", "html": "functions.html"} 26 | - {"rmd": "vectors.Rmd", "html": "vectors.html"} 27 | - {"rmd": "iteration.Rmd", "html": "iteration.html"} 28 | - {"rmd": "model-intro.Rmd", "html": "model-intro.html"} 29 | - {"rmd": "model-basics.Rmd", "html": "model-basics.html"} 30 | - {"rmd": "model-building.Rmd", "html": "model-building.html"} 31 | - {"rmd": "many-models.Rmd", "html": "many-models.html"} 32 | - {"rmd": "communicate-intro.Rmd", "html": "communicate-intro.html"} 33 | - {"rmd": "rmarkdown.Rmd", "html": "r-markdown.html"} 34 | - {"rmd": "graphics-for-communication.Rmd", "html": "graphics-for-communication.html"} 35 | - {"rmd": "rmarkdown-formats.Rmd", "html": "r-markdown-formats.html"} 36 | - {"rmd": "rmarkdown-workflow.Rmd", "html": "r-markdown-workflow.html"} 37 | exercise_sections: 38 | # Mapping from subsections to the Exercise subsubsection in R4DS 39 | 3.1: 1 40 | 3.2: 4 41 | 3.3: 1 42 | 3.5: 1 43 | 3.6: 1 44 | 3.8: 1 45 | 3.9: 1 46 | 5.2: 4 47 | 5.3: 1 48 | 5.4: 1 49 | 5.5: 2 50 | 5.6: 7 51 | 5.7: 1 52 | 7.3: 4 53 | 7.4: 1 54 | 7.5.1: 1 55 | 7.5.2: 1 56 | 7.5.3: 1 57 | 11.2: 2 58 | 11.3: 5 59 | 11.4: 1 60 | 12.2: 1 61 | 12.3: 3 62 | 12.4: 3 63 | 12.5: 1 64 | 12.6: 1 65 | 13.2: 1 66 | 13.3: 1 67 | 13.4: 6 68 | 13.5: 1 69 | 14.2: 5 70 | 14.3.1: 1 71 | 14.3.2: 1 72 | 14.3.3: 1 73 | 14.3.4: 1 74 | 14.3.5: 1 75 | 14.4.3: 1 76 | 14.4.4: 1 77 | 14.4.5: 1 78 | 14.5: 1 79 | 14.7: 1 80 | 15.3: 1 81 | 15.4: 1 82 | 15.5: 1 83 | 16.2: 4 84 | 16.3: 4 85 | 16.4: 5 86 | 19.2: 1 87 | 19.3: 1 88 | 19.4: 4 89 | 19.5: 5 90 | 20.3: 5 91 | 20.4: 6 92 | 20.5: 4 93 | 20.7: 4 94 | 21.2: 1 95 | 21.3: 5 96 | 21.4: 1 97 | 21.5: 3 98 | 21.9: 3 99 | 23.2: 1 100 | 23.3: 3 101 | 23.4: 5 102 | 24.2: 3 103 | 24.3: 5 104 | 25.2: 5 105 | 25.4: 5 106 | 25.5: 3 107 | 27.2: 1 108 | 27.3: 1 109 | 27.4: 7 110 | 28.2: 1 111 | 28.3: 1 112 | 28.4: 4 113 | -------------------------------------------------------------------------------- /_output.yaml: -------------------------------------------------------------------------------- 1 | bookdown::gitbook: 2 | config: 3 | toc: 4 | collapse: section 5 | before: | 6 |
  • R for Data Science:
    Exercise Solutions
  • 7 | 8 | after: | 9 |
  • Proudly published with bookdown
  • 10 | 11 | edit: 12 | link: https://github.com/jrnold/r4ds-exercise-solutions/edit/master/%s 13 | text: "Edit" 14 | sharing: 15 | facebook: no 16 | twitter: yes 17 | google: no 18 | linkedin: yes 19 | weibo: no 20 | instapaper: no 21 | vk: no 22 | css: 23 | - "includes/r4ds.css" 24 | - "includes/r4ds-solutions.css" 25 | - "includes/ort.css" 26 | md_extensions: "+native_divs+native_spans+escaped_line_breaks+smart" 27 | download: no 28 | includes: 29 | in_header: 30 | - "includes/hypothesis.html" 31 | -------------------------------------------------------------------------------- /appendixes.Rmd: -------------------------------------------------------------------------------- 1 | # (PART) Appendixes {-} 2 | 3 | # References {-} 4 | -------------------------------------------------------------------------------- /bin/add-r4ds-links.R: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env Rscript 2 | #' For all sections anchor links and links to the relevant R4DS section 3 | suppressPackageStartupMessages({ 4 | library("rvest") 5 | library("fs") 6 | library("rlang") 7 | library("jsonlite") 8 | library("purrr") 9 | library("glue") 10 | }) 11 | 12 | # Add link to r4ds after each section 13 | add_r4ds_link_section <- function(x, r4ds_url) { 14 | # first child should be the heading 15 | header <- html_node(x, "h1,h2,h3,h4") 16 | href <- httr::modify_url(r4ds_url, frag = xml_attr(x, "id")) 17 | a <- xml_add_child(header, "a", href = href, 18 | class = "r4ds-section-link section-link", 19 | `aria-hidden` = "true") 20 | icon <- xml_add_child(a, "i", 21 | class = "fa fa-external-link", 22 | `aria-hidden` = "true") 23 | } 24 | 25 | add_r4ds_links <- function(doc, path, r4ds_url) { 26 | # r4ds-section used to indicate R4DS 27 | r4ds_sections <- html_nodes(doc, ".r4ds-section") 28 | r4ds_path_url <- httr::modify_url(r4ds_url, 29 | path = as.character(path)) 30 | walk(r4ds_sections, add_r4ds_link_section, r4ds_url = r4ds_path_url) 31 | } 32 | 33 | # Add link to r4ds after each section 34 | add_anchor_section <- function(x) { 35 | # first child should be the heading 36 | header <- html_node(x, "h1,h2,h3,h4") 37 | href <- paste0("#", xml_attr(x, "id")) 38 | a <- xml_add_child(header, "a", href = href, class = "anchor section-link", 39 | `aria-hidden` = "true", .where = 0) 40 | icon <- xml_add_child(a, "i", 41 | class = "fa fa-link", 42 | `aria-hidden` = "true") 43 | } 44 | 45 | # Add link to r4ds after each section 46 | add_anchors <- function(doc) { 47 | walk(html_nodes(doc, "div.section"), add_anchor_section) 48 | } 49 | 50 | handle_page <- function(path, r4ds_url, output_dir) { 51 | # read HTML for a file 52 | filename <- fs::path(output_dir, path) 53 | doc <- read_html(filename) 54 | add_r4ds_links(doc, path = path, r4ds_url = r4ds_url) 55 | add_anchors(doc) 56 | cat(glue("Adding links to {filename}"), "\n\n") 57 | write_html(doc, filename, options = c("as_html", "format")) 58 | } 59 | 60 | main <- function() { 61 | # read table of contents to get list of HTML files 62 | toc_filename <- "toc.json" 63 | output_dir <- bookdown:::load_config()[["output_dir"]] 64 | toc <- read_json(fs::path(output_dir, toc_filename)) 65 | 66 | # Get URL of r4ds 67 | config <- yaml::read_yaml(fs::path("_config.yml")) 68 | r4ds_url <- config$r4ds$url 69 | walk(unlist(map(toc, "path")), handle_page, r4ds_url, output_dir) 70 | } 71 | 72 | main() 73 | -------------------------------------------------------------------------------- /bin/build.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | # build website 3 | # exit when any command fails 4 | set -e 5 | set +x 6 | 7 | Rscript bin/render.R --force --quiet 8 | Rscript bin/create-sitemap.R 9 | Rscript bin/create-toc.R 10 | Rscript bin/add-r4ds-links.R 11 | -------------------------------------------------------------------------------- /bin/check-r4ds-sections.R: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env Rscript 2 | #' Check that sections in Soluions match those in R4DS 3 | #' 4 | #' For all sections in the solutions, check that 5 | #' 6 | #' - It appears in R4DS 7 | #' - It has the same title as in R4DS 8 | #' - It has the same heading level as R4DS 9 | #' 10 | #' It does not check whether all sections in R4DS appear in the Solutions 11 | #' 12 | suppressPackageStartupMessages({ 13 | library("xml2") 14 | library("rvest") 15 | library("purrr") 16 | library("tibble") 17 | library("dplyr") 18 | library("stringr") 19 | library("glue") 20 | }) 21 | 22 | handle_section <- function(x) { 23 | header <- html_node(x, "h1,h2,h3,h4,h5") 24 | tibble(id = as.character(html_attr(x, "id")), 25 | title = str_replace(str_trim(html_text(header)), 26 | "^\\d+(\\.\\d+)*\\s+", ""), 27 | tag = xml_name(header)) 28 | } 29 | 30 | handle_html_file <- function(path, output_dir, r4ds_url) { 31 | doc <- read_html(fs::path(output_dir, path)) 32 | solution_sections <- map_dfr(html_nodes(doc, "div.section.r4ds-section"), 33 | handle_section) 34 | if (nrow(solution_sections)) { 35 | solution_sections$path <- path 36 | 37 | r4ds_url <- httr::modify_url(r4ds_url, path = path) 38 | r4ds_doc <- read_html(r4ds_url) 39 | r4ds_sections <- map_dfr(html_nodes(r4ds_doc, "div.section"), 40 | handle_section) 41 | 42 | left_join(solution_sections, r4ds_sections, 43 | by = "id", suffix = c("_solutions", "_r4ds")) 44 | } 45 | } 46 | 47 | main <- function() { 48 | # Find any sections in solutions not in R4DS 49 | output_dir <- bookdown:::load_config()$output_dir 50 | filenames <- map(jsonlite::read_json(fs::path(output_dir, "toc.json")), 51 | "path") %>% 52 | map_chr(1) %>% 53 | keep(~ !.x %in% c("references.html")) 54 | 55 | config <- yaml::read_yaml("_config.yml") 56 | 57 | sections <- map_dfr(filenames, handle_html_file, output_dir = output_dir, 58 | r4ds_url = config$r4ds$url) 59 | status <- 0 60 | 61 | missing_ids <- filter(sections, is.na(title_r4ds)) %>% 62 | select(path, id) 63 | if (nrow(missing_ids) > 0) { 64 | cat("There sections have IDs not in R4DS", file = stderr()) 65 | cat(glue_data(missing_ids, "{path}#{id}"), file = stderr()) 66 | status <- 1 67 | } 68 | 69 | different_titles <- filter(sections, title_solutions != title_r4ds) %>% 70 | select(path, id, title_solutions, title_r4ds) 71 | if (nrow(different_titles) > 0) { 72 | cat("These sections have different titles than R4DS:") 73 | cat(glue_data(different_titles, "{path}#{id}: '{title_solutions}' (solutions) '{title_r4ds}' (R4DS)"), 74 | file = stderr()) 75 | status <- 1 76 | } 77 | 78 | different_headings <- filter(sections, tag_solutions != tag_r4ds) %>% 79 | select(path, id, tag_solutions, tag_r4ds) 80 | if (nrow(different_headings) > 0) { 81 | cat("These sections have different heading levels than R4DS:") 82 | cat(glue_data(different_headings, "{path}#{id}: '{tag_solutions}' (solutions) '{tag_r4ds}' (R4DS)"), 83 | file = stderr()) 84 | status <- 1 85 | } 86 | 87 | quit(save = "no", status = status) 88 | } 89 | 90 | main() 91 | -------------------------------------------------------------------------------- /bin/check-spelling.R: -------------------------------------------------------------------------------- 1 | #!/bin/env Rscript 2 | # Spell check 3 | suppressPackageStartupMessages({ 4 | library("spelling") 5 | library("magrittr") 6 | }) 7 | wordlist_file <- "WORDLIST" 8 | 9 | wordlist <- stringr::str_trim(readLines(wordlist_file)) 10 | 11 | files <- c( 12 | list.files(here::here("."), pattern = "\\.(Rnw|Rmd)$", full.names = TRUE), 13 | list.files(here::here("rmarkdown"), 14 | pattern = "\\.(Rmd)$", full.names = TRUE 15 | ), 16 | here::here("README.md") 17 | ) %>% 18 | normalizePath() %>% 19 | unique() 20 | 21 | misspelled_words <- spell_check_files(sort(files), ignore = wordlist) 22 | any_mispelled <- as.logical(nrow(misspelled_words)) 23 | 24 | if (any_mispelled) { 25 | sink(file = stderr()) 26 | print(misspelled_words) 27 | sink() 28 | quit(save = "no", status = 1) 29 | } 30 | -------------------------------------------------------------------------------- /bin/create-sitemap.R: -------------------------------------------------------------------------------- 1 | suppressPackageStartupMessages({ 2 | library("xml2") 3 | library("optparse") 4 | library("glue") 5 | }) 6 | 7 | sitemap_url_info <- function(path, base_url, priority = 0.5, 8 | changefreq = "daily") { 9 | lastmod <- format(file.info(path)$mtime, format = "%Y-%m-%d", 10 | tz = "UTC") 11 | loc <- paste0(stringr::str_replace(base_url, "/$", ""), "/", basename(path)) 12 | list(lastmod = lastmod, loc = loc, priority = priority, 13 | changefreq = changefreq) 14 | } 15 | 16 | create_sitemap <- function(output_dir, base_url, 17 | pattern = "^.*\\.html$", 18 | excludes = character(), 19 | changfreq = "daily", priority = 0.5) { 20 | SITEMAP_XMLNS <- "http://www.sitemaps.org/schemas/sitemap/0.9" 21 | filenames <- dir(output_dir, pattern = pattern) 22 | filenames <- base::setdiff(filenames, excludes) 23 | filenames <- file.path(output_dir, filenames) 24 | sitemap <- xml_new_root("urlset", xmlns = SITEMAP_XMLNS) 25 | for (file in filenames) { 26 | info <- sitemap_url_info(file, base_url = base_url) 27 | url <- xml_add_child(sitemap, "url") 28 | xml_add_child(url, "loc", info$loc) 29 | xml_add_child(url, "lastmod", info$lastmod) 30 | xml_add_child(url, "priority", info$priority) 31 | xml_add_child(url, "changefreq", info$changefreq) 32 | } 33 | sitemap_loc <- file.path(output_dir, "sitemap.xml") 34 | cat(glue("Writing to {sitemap_loc}\n")) 35 | write_xml(sitemap, sitemap_loc) 36 | } 37 | 38 | main <- function() { 39 | output_dir <- bookdown:::load_config()$output_dir 40 | config <- yaml::read_yaml(here::here("_config.yml")) 41 | create_sitemap(output_dir, config$deploy_url) 42 | } 43 | 44 | main() 45 | -------------------------------------------------------------------------------- /bin/create-toc.R: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env Rscript 2 | #' Write a table of contents 3 | #' 4 | #' This writes out a table of contents for R4DS Solutions to a JSON file. 5 | #' This is useful for quickly navigating or looking up exercises. 6 | #' 7 | #' 8 | suppressPackageStartupMessages({ 9 | library("rvest") 10 | library("purrr") 11 | library("stringr") 12 | library("jsonlite") 13 | library("tibble") 14 | library("fs") 15 | library("dplyr") 16 | library("glue") 17 | library("optparse") 18 | }) 19 | 20 | get_section_title <- function(x) { 21 | is_section_number <- function(x) { 22 | (xml_type(x) == "element") && 23 | xml_name(x) == "span" && 24 | html_has_class(x, "header-section-number") 25 | } 26 | header <- html_children(x)[[1]] 27 | title <- discard(xml_contents(header), is_section_number) %>% 28 | map_chr(html_text) %>% 29 | str_c(collapse = "") %>% 30 | str_trim() 31 | number <- html_text(html_node(header, "span.header-section-number")) 32 | list(title = title, number = number) 33 | } 34 | 35 | #' Return the classes of a HTML element 36 | html_classes <- function(x) { 37 | klass <- html_attr(x, "class") 38 | if (!length(klass) || is.na(klass)) { 39 | character() 40 | } else { 41 | # Space separated - includes space, tab, and newlines 42 | # https://www.w3.org/TR/2011/WD-html5-20110525/elements.html#classes 43 | unique(str_split(str_trim(klass), "\\s")[[1]]) 44 | } 45 | } 46 | 47 | #' Check whether a HTML element has a class 48 | html_has_class <- function(x, kls) kls %in% html_classes(x) 49 | 50 | # Get section level number 51 | get_level <- function(x) { 52 | lvl <- str_extract(str_subset(html_classes(x), "^level\\d+$"), 53 | "\\d+") 54 | as.integer(lvl) 55 | } 56 | 57 | #' parse each section 58 | process_section <- function(x, path = "/") { 59 | lvl <- get_level(x) 60 | current <- rlang::list2(id = html_attr(x, "id"), 61 | !!!get_section_title(x), 62 | path = path) 63 | # find next level of nodes 64 | sections <- map(html_nodes(x, glue("div.section.level{lvl + 1}")), 65 | process_section, path = path) 66 | names(sections) <- map_chr(sections, "id") 67 | current[["sections"]] <- sections 68 | current 69 | } 70 | 71 | process_page <- function(path, output_dir) { 72 | doc <- read_html(fs::path(output_dir, path)) 73 | # find top level heading 74 | process_section(html_node(doc, "div.section.level1"), path = path) 75 | } 76 | 77 | process_chapter <- function(x, output_dir) { 78 | a <- html_node(x, "a") 79 | data_level <- html_attr(x, "data-level") 80 | out <- list(chapter = if_else(data_level == "", NA_integer_, 81 | as.integer(data_level)), 82 | path = html_attr(x, "data-path"), 83 | href = html_attr(a, "href"), 84 | name = str_replace(html_text(a), "^[\\d.]+ ", "")) 85 | out$sections <- process_page(out[["path"]], output_dir) 86 | out 87 | } 88 | 89 | #' create and write the table of contents 90 | write_toc <- function(output_dir, path) { 91 | index <- read_html(file.path(output_dir, "index.html")) 92 | book_summary <- html_nodes(index, "div.book-summary>nav>ul") 93 | chapters <- map(html_nodes(book_summary, xpath = "./li[@class='chapter']"), 94 | process_chapter, output_dir) 95 | outfile <- fs::path(output_dir, path) 96 | cat(glue("Writing to {outfile}\n")) 97 | write_json(chapters, outfile) 98 | } 99 | 100 | main <- function() { 101 | description <- str_c( 102 | "Create a JSON table of contents for R4DS containing the sections ", 103 | "for all HTML files" 104 | ) 105 | parser <- OptionParser(description = description) 106 | opts <- parse_args(parser, positional_arguments = TRUE) 107 | path <- if (length(opts$args) < 1) { 108 | "toc.json" 109 | } else { 110 | opts$args[[1]] 111 | } 112 | output_dir <- bookdown:::load_config()$output_dir 113 | write_toc(output_dir, path) 114 | } 115 | 116 | main() 117 | -------------------------------------------------------------------------------- /bin/deploy.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | # From https://github.com/X1011/git-directory-deploy 3 | # License: BSD 3-Clause License 4 | # Copyright: Daniel Smith 5 | set -o errexit #abort if any command fails 6 | me=$(basename "$0") 7 | 8 | help_message="\ 9 | Usage: $me [-c FILE] [] 10 | Deploy generated files to a git branch. 11 | 12 | Options: 13 | 14 | -h, --help Show this help information. 15 | -v, --verbose Increase verbosity. Useful for debugging. 16 | -e, --allow-empty Allow deployment of an empty directory. 17 | -m, --message MESSAGE Specify the message used when committing on the 18 | deploy branch. 19 | -n, --no-hash Don't append the source commit's hash to the deploy 20 | commit's message. 21 | -c, --config-file PATH Override default & environment variables' values 22 | with those in set in the file at 'PATH'. Must be the 23 | first option specified. 24 | 25 | Variables: 26 | 27 | GIT_DEPLOY_DIR Folder path containing the files to deploy. 28 | GIT_DEPLOY_BRANCH Commit deployable files to this branch. 29 | GIT_DEPLOY_REPO Push the deploy branch to this repository. 30 | 31 | These variables have default values defined in the script. The defaults can be 32 | overridden by environment variables. Any environment variables are overridden 33 | by values set in a '.env' file (if it exists), and in turn by those set in a 34 | file specified by the '--config-file' option." 35 | 36 | parse_args() { 37 | # Set args from a local environment file. 38 | if [ -e ".env" ]; then 39 | source .env 40 | fi 41 | 42 | # Set args from file specified on the command-line. 43 | if [[ $1 = "-c" || $1 = "--config-file" ]]; then 44 | source "$2" 45 | shift 2 46 | fi 47 | 48 | # Parse arg flags 49 | # If something is exposed as an environment variable, set/overwrite it 50 | # here. Otherwise, set/overwrite the internal variable instead. 51 | while : ; do 52 | if [[ $1 = "-h" || $1 = "--help" ]]; then 53 | echo "$help_message" 54 | return 0 55 | elif [[ $1 = "-v" || $1 = "--verbose" ]]; then 56 | verbose=true 57 | shift 58 | elif [[ $1 = "-e" || $1 = "--allow-empty" ]]; then 59 | allow_empty=true 60 | shift 61 | elif [[ ( $1 = "-m" || $1 = "--message" ) && -n $2 ]]; then 62 | commit_message=$2 63 | shift 2 64 | elif [[ $1 = "-n" || $1 = "--no-hash" ]]; then 65 | GIT_DEPLOY_APPEND_HASH=false 66 | shift 67 | else 68 | break 69 | fi 70 | done 71 | 72 | # Set internal option vars from the environment and arg flags. All internal 73 | # vars should be declared here, with sane defaults if applicable. 74 | 75 | # Source directory & target branch. 76 | deploy_directory=${GIT_DEPLOY_DIR:-dist} 77 | deploy_branch=${GIT_DEPLOY_BRANCH:-gh-pages} 78 | 79 | #if no user identity is already set in the current git environment, use this: 80 | default_username=${GIT_DEPLOY_USERNAME:-deploy.sh} 81 | default_email=${GIT_DEPLOY_EMAIL:-} 82 | 83 | #repository to deploy to. must be readable and writable. 84 | repo=${GIT_DEPLOY_REPO:-origin} 85 | 86 | #append commit hash to the end of message by default 87 | append_hash=${GIT_DEPLOY_APPEND_HASH:-true} 88 | } 89 | 90 | main() { 91 | parse_args "$@" 92 | 93 | enable_expanded_output 94 | 95 | if ! git diff --exit-code --quiet --cached; then 96 | echo Aborting due to uncommitted changes in the index >&2 97 | return 1 98 | fi 99 | 100 | commit_title=`git log -n 1 --format="%s" HEAD` 101 | commit_hash=` git log -n 1 --format="%H" HEAD` 102 | 103 | #default commit message uses last title if a custom one is not supplied 104 | if [[ -z $commit_message ]]; then 105 | commit_message="publish: $commit_title" 106 | fi 107 | 108 | #append hash to commit message unless no hash flag was found 109 | if [ $append_hash = true ]; then 110 | commit_message="$commit_message"$'\n\n'"generated from commit $commit_hash" 111 | fi 112 | 113 | previous_branch=`git rev-parse --abbrev-ref HEAD` 114 | 115 | if [ ! -d "$deploy_directory" ]; then 116 | echo "Deploy directory '$deploy_directory' does not exist. Aborting." >&2 117 | return 1 118 | fi 119 | 120 | # must use short form of flag in ls for compatibility with OS X and BSD 121 | if [[ -z `ls -A "$deploy_directory" 2> /dev/null` && -z $allow_empty ]]; then 122 | echo "Deploy directory '$deploy_directory' is empty. Aborting. If you're sure you want to deploy an empty tree, use the --allow-empty / -e flag." >&2 123 | return 1 124 | fi 125 | 126 | if git ls-remote --exit-code $repo "refs/heads/$deploy_branch" ; then 127 | # deploy_branch exists in $repo; make sure we have the latest version 128 | 129 | disable_expanded_output 130 | git fetch --force $repo $deploy_branch:$deploy_branch 131 | enable_expanded_output 132 | fi 133 | 134 | # check if deploy_branch exists locally 135 | if git show-ref --verify --quiet "refs/heads/$deploy_branch" 136 | then incremental_deploy 137 | else initial_deploy 138 | fi 139 | 140 | restore_head 141 | } 142 | 143 | initial_deploy() { 144 | git --work-tree "$deploy_directory" checkout --orphan $deploy_branch 145 | git --work-tree "$deploy_directory" add --all 146 | commit+push 147 | } 148 | 149 | incremental_deploy() { 150 | #make deploy_branch the current branch 151 | git symbolic-ref HEAD refs/heads/$deploy_branch 152 | #put the previously committed contents of deploy_branch into the index 153 | git --work-tree "$deploy_directory" reset --mixed --quiet 154 | git --work-tree "$deploy_directory" add --all 155 | 156 | set +o errexit 157 | diff=$(git --work-tree "$deploy_directory" diff --exit-code --quiet HEAD --)$? 158 | set -o errexit 159 | case $diff in 160 | 0) echo No changes to files in $deploy_directory. Skipping commit.;; 161 | 1) commit+push;; 162 | *) 163 | echo git diff exited with code $diff. Aborting. Staying on branch $deploy_branch so you can debug. To switch back to master, use: git symbolic-ref HEAD refs/heads/master && git reset --mixed >&2 164 | return $diff 165 | ;; 166 | esac 167 | } 168 | 169 | commit+push() { 170 | set_user_id 171 | git --work-tree "$deploy_directory" commit -m "$commit_message" 172 | 173 | disable_expanded_output 174 | #--quiet is important here to avoid outputting the repo URL, which may contain a secret token 175 | git push --quiet $repo $deploy_branch 176 | enable_expanded_output 177 | } 178 | 179 | #echo expanded commands as they are executed (for debugging) 180 | enable_expanded_output() { 181 | if [ $verbose ]; then 182 | set -o xtrace 183 | set +o verbose 184 | fi 185 | } 186 | 187 | #this is used to avoid outputting the repo URL, which may contain a secret token 188 | disable_expanded_output() { 189 | if [ $verbose ]; then 190 | set +o xtrace 191 | set -o verbose 192 | fi 193 | } 194 | 195 | set_user_id() { 196 | if [[ -z `git config user.name` ]]; then 197 | git config user.name "$default_username" 198 | fi 199 | if [[ -z `git config user.email` ]]; then 200 | git config user.email "$default_email" 201 | fi 202 | } 203 | 204 | restore_head() { 205 | if [[ $previous_branch = "HEAD" ]]; then 206 | #we weren't on any branch before, so just set HEAD back to the commit it was on 207 | git update-ref --no-deref HEAD $commit_hash $deploy_branch 208 | else 209 | git symbolic-ref HEAD refs/heads/$previous_branch 210 | fi 211 | 212 | git reset --mixed 213 | } 214 | 215 | filter() { 216 | sed -e "s|$repo|\$repo|g" 217 | } 218 | 219 | sanitize() { 220 | "$@" 2> >(filter 1>&2) | filter 221 | } 222 | 223 | [[ $1 = --source-only ]] || main "$@" 224 | -------------------------------------------------------------------------------- /bin/hypothesis.r: -------------------------------------------------------------------------------- 1 | # Create GitHub issuses from hypothes.is annotations 2 | suppressPackageStartupMessages({ 3 | library("rapiclient") 4 | library("httr") 5 | library("ghql") 6 | library('purrr') 7 | library("stringr") 8 | library("glue") 9 | library("ghql") 10 | }) 11 | 12 | 13 | 14 | HYPOTHESIS_USER <- "acct:jrnold@hypothes.is" 15 | GITHUB_LABEL_INFO <- list(id="MDU6TGFiZWwxMzE5MTEyNTM4", name="hypothes.is") 16 | GITHUB_REPO_ID <- "MDEwOlJlcG9zaXRvcnk3NjgzMTMxMQ==" 17 | 18 | 19 | get_annotations <- function(search_after, limit=200) { 20 | hypothesis_api <- suppressWarnings(get_api(url = "https://h.readthedocs.io/en/latest/api-reference/hypothesis-v1.yaml")) 21 | hypothesis_api$host <- "hypothes.is" 22 | hypothesis_api$basePath <- "/api" 23 | operations <- get_operations(hypothesis_api) 24 | res <- operations$Search_for_annotations(search_after=search_after, sort="created", order="asc", 25 | limit=limit, 26 | wildcard_uri = "https://jrnold.github.io/r4ds-exercise-solutions/*") 27 | content(res) %>% 28 | pluck("rows") %>% 29 | keep(function(x) { 30 | x$user != HYPOTHESIS_USER 31 | }) 32 | } 33 | 34 | annotation_to_issue <- function(annotation, con) { 35 | x <- annotation[c("user", "id", "created", "updated", "uri", "text")] 36 | x[["link_incontext"]] <- annotation$links$incontext 37 | x[["link_html"]] <- annotation$links$html 38 | x[["text"]] <- str_c("> ", str_split(x$text, "\n")[[1]], collapse = "\n") 39 | title <- glue("Respond to {link_html}", .envir = x) 40 | body <- glue("Respond to hypothes.is annotation [{id}]({link_html}) on {uri} (user: {user}, created: {created}, updated: {updated})\n", "{text}", 41 | "\n", "Link: {link_incontext}", 42 | .envir = x, .sep = "\n") 43 | 44 | qry <- Query$new() 45 | qry$query('createNewIssue', " 46 | mutation createNewIssue($repositoryId: String!, $title: String!, $body: String!) { 47 | createIssue(input: {repositoryId: $repositoryId, title: $title, body: $body}) { 48 | issue{ 49 | id 50 | title 51 | } 52 | } 53 | }") 54 | 55 | qry$query('addLabelsToIssue', " 56 | mutation addLabelsToIssue($issueId: String!, $labels: [String!]!) { 57 | addLabelsToLabelable(input: {labelableId: $issueId, labelIds: $labels}) { 58 | labelable { 59 | ... on Issue { 60 | id 61 | } 62 | } 63 | } 64 | }") 65 | 66 | res <- con$exec(qry$queries$createNewIssue, list(title = title, body = body, repositoryId = GITHUB_REPO_ID)) 67 | data <- jsonlite::parse_json(res)$data 68 | issue_id <- data$createIssue$issue$id 69 | cat("Created issue ", data$createIssue$issue$title, "\n") 70 | res2 <- con$exec(qry$queries$addLabelsToIssue, list(issueId = issue_id, labels = list(GITHUB_LABEL_INFO$id))) 71 | cat("Added label to issue ", data$createIssue$issue$title, "\n") 72 | issue_id 73 | } 74 | 75 | connect_github <- function() { 76 | token <- Sys.getenv("GITHUB_PAT") 77 | con <- GraphqlClient$new( 78 | url = "https://api.github.com/graphql", 79 | headers = list(Authorization = paste0("Bearer ", token)) 80 | ) 81 | con$load_schema() 82 | con 83 | } 84 | 85 | main <- function() { 86 | search_after <- "2020-06-25T00:00:00" 87 | annotations <- get_annotations(search_after = search_after) 88 | con <- connect_github() 89 | map(annotations, annotation_to_issue, con) 90 | } 91 | 92 | annotations <- main() 93 | 94 | # con$load_schema() 95 | # qry <- Query$new() 96 | # qry$query("getRepoId", "{ 97 | # repository(owner: \"jrnold\", name: \"r4ds-exercise-solutions\") { 98 | # id 99 | # name 100 | # labels { 101 | # edges { 102 | # node { 103 | # id 104 | # name 105 | # } 106 | # } 107 | # } 108 | # } 109 | # }") 110 | # (x <- con$exec(qry$queries$getRepoId)) 111 | 112 | #con <- connect_github() 113 | # #con$load_schema() 114 | # qry <- Query$new() 115 | # qry$query("getRepoIssues", " 116 | # query getRepoIssues($owner: String!, $name: String!, $label: String!) { 117 | # repository(name: $name, owner: $owner) { 118 | # issues (last:1, filterBy: {labels: [$label], createdBy: $owner}) { 119 | # pageInfo { 120 | # startCursor 121 | # endCursor 122 | # hasNextPage 123 | # } 124 | # nodes { 125 | # author { 126 | # login 127 | # } 128 | # title 129 | # createdAt 130 | # } 131 | # totalCount 132 | # } 133 | # } 134 | # }") 135 | # (x <- con$exec(qry$queries$getRepoIssues, list(owner="jrnold", name = "r4ds-exercise-solutions", label = GITHUB_LABEL_INFO$name))) 136 | -------------------------------------------------------------------------------- /bin/is_primary_key.R: -------------------------------------------------------------------------------- 1 | any_missing <- function(x) any(is.na(x)) 2 | 3 | #' Check whether variables are a primary key 4 | #' 5 | #' Check whether a set of variables is a primary key for a data frame. 6 | #' Unlike SQL databases, R data frames do not enforce a primary key 7 | #' constraint. This function checks whether a set of variables uniquely 8 | #' identify a row. 9 | #' 10 | #' @param tbl A tbl. 11 | #' @param ... One or more unquoted expressions separated by commas. 12 | #' You can treat variable names like they are positions. This uses 13 | #' The same semantics as [dplyr::select()]. 14 | #' @return A logical vector of length one that is `TRUE` if the 15 | #' variables are primary key, and `FALSE` otherwise. 16 | is_primary_key <- function(tbl, ...) { 17 | variables <- quos(...) 18 | # no elements can be missing 19 | has_nulls <- summarise_at(tbl, vars(UQS(variables)), any_missing) 20 | if (any(as.logical(has_nulls))) { 21 | return(FALSE) 22 | } 23 | nrow(distinct(tbl, !!!variables)) == nrow(tbl) 24 | } 25 | 26 | foo <- tribble( 27 | ~a, ~b, ~c, 28 | 1, NA, 1, 29 | 2, 2, 1, 30 | 3, 3, 3 31 | ) 32 | 33 | is_key(foo, a) 34 | is_key(foo, b) 35 | is_key(foo, c) 36 | 37 | is_key(foo, 1:3) 38 | 39 | # check that columns in y are are foreign key of x 40 | is_foreign_key <- function(x, y, by = NULL) { 41 | # check that by is a primary key for y 42 | if (!rlang::eval_tidy(quo(is_primary_key(y, !!!by)))) { 43 | return(FALSE) 44 | } 45 | # check that all x are found in y 46 | !nrow(anti_join(x, y, by = by)) 47 | } 48 | 49 | -------------------------------------------------------------------------------- /bin/r4ds-toc.R: -------------------------------------------------------------------------------- 1 | #' extract the table of contents from "R for Data Science" and save to a csv file 2 | library("rvest") 3 | library("purrr") 4 | library("stringr") 5 | library("jsonlite") 6 | library("tibble") 7 | library("dplyr") 8 | library("readr") 9 | 10 | R4DS_INDEX <- "https://r4ds.had.co.nz/" 11 | 12 | r4ds_chapters <- function() { 13 | index <- read_html(R4DS_INDEX) 14 | book_summary <- html_nodes(index, "div.book-summary") 15 | chapters <- html_nodes(book_summary, "li.chapter") %>% 16 | keep(~ str_detect(html_attr(.x, "data-level"), "^\\d+$")) %>% 17 | map_dfr(~tibble(path = html_attr(.x, "data-path"), 18 | chapter = html_attr(.x, "data-level"))) 19 | chapters 20 | } 21 | 22 | process_section <- function(section) { 23 | header <- html_children(section)[[1]] 24 | lvl <- str_split(html_attr(section, "class"), " ")[[1]] %>% 25 | str_subset("^level(\\d+)$") %>% 26 | str_extract("\\d+") 27 | tibble( 28 | section_level = lvl, 29 | section_id = html_attr(section, "id"), 30 | section_number = html_text(html_node(header, "span.header-section-number")), 31 | section_name = str_replace(html_text(header), "^[\\d.]+\\s+", "") 32 | ) 33 | } 34 | 35 | process_path <- function(path) { 36 | doc <- read_html(str_c(R4DS_INDEX, path)) 37 | map_dfr(html_nodes(doc, "div.section"), process_section) %>% 38 | mutate(path = path) 39 | } 40 | 41 | create_toc <- function() { 42 | chapters <- r4ds_chapters() 43 | map_dfr(chapters$path, process_path) %>% 44 | mutate(url = str_c(R4DS_INDEX, path, "#", section_id)) 45 | } 46 | 47 | main <- function() { 48 | toc <- create_toc() 49 | write_csv(toc, "r4ds-toc.csv") 50 | } 51 | 52 | main() 53 | -------------------------------------------------------------------------------- /bin/render.R: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env Rscript 2 | # Script to render book 3 | suppressPackageStartupMessages({ 4 | library("optparse") 5 | library("xml2") 6 | }) 7 | 8 | # From devtools:::git_uncommitted 9 | git_uncommitted <- function(path = ".") { 10 | r <- git2r::repository(path, discover = TRUE) 11 | st <- vapply(git2r::status(r), length, integer(1)) 12 | any(st != 0) 13 | } 14 | 15 | # Adapted from devtools:::git_uncommitted 16 | check_uncommitted <- function(path = ".") { 17 | if (git_uncommitted(path)) { 18 | stop(paste("Uncommitted files.", 19 | "All files should be committed before release.", 20 | "Please add and commit.", 21 | sep = " " 22 | )) 23 | } 24 | } 25 | 26 | create_outdir <- function(output_dir) { 27 | # create nojekyll if it doesn't exist 28 | dir.create(output_dir, recursive = TRUE, showWarnings = FALSE) 29 | nojekyll <- file.path(output_dir, ".nojekyll") 30 | if (!file.exists(nojekyll)) { 31 | cat("Creating ", nojekyll, "\n") 32 | con <- file(nojekyll, "w") 33 | close(con) 34 | } 35 | } 36 | 37 | render <- function(input, output_format = "all", force = FALSE, 38 | config = "_config.yml", ...) { 39 | if (rmarkdown::pandoc_version() < 2) { 40 | stop("This book requires pandoc > 2") 41 | } 42 | if (!force) { 43 | check_uncommitted(dirname(input[[1]])) 44 | } 45 | output_dir <- yaml::read_yaml("_bookdown.yml")$output_dir 46 | create_outdir(output_dir) 47 | bookdown::render_book(input = "index.Rmd", output_format = output_format, ..., 48 | envir = new.env(), clean_envir = FALSE) 49 | } 50 | 51 | main <- function(args = NULL) { 52 | option_list <- list( 53 | make_option(c("-f", "--force"), 54 | action = "store_true", default = FALSE, 55 | help = "Render even if there are uncomitted changes." 56 | ), 57 | make_option(c("-q", "--quiet"), 58 | action = "store_true", default = FALSE, 59 | help = "Do not use verbose output" 60 | ), 61 | make_option("--to", default = "all", 62 | help = "Bookdown output format to use"), 63 | make_option("--config", default = "_config.yml", 64 | help = "Path to project config file. Needed for output URL.") 65 | ) 66 | if (is.null(args)) { 67 | args <- commandArgs(TRUE) 68 | } 69 | opts <- parse_args(OptionParser( 70 | usage = "%prog [options] [output_format|all|html|pdf]", 71 | option_list = option_list 72 | ), 73 | args = args, 74 | positional_arguments = TRUE, 75 | convert_hyphens_to_underscores = TRUE 76 | ) 77 | output_format <- opts$options$to 78 | if (output_format == "html") { 79 | output_format <- "bookdown::gitbook" 80 | } else if (output_format == "pdf") { 81 | output_format <- "bookdown::pdf_book" 82 | } 83 | input <- if (!length(opts$args)) { 84 | "index.Rmd" 85 | } else { 86 | opts$args 87 | } 88 | render( 89 | input, 90 | output_format = output_format, 91 | force = opts$options$force, 92 | quiet = opts$options$quiet 93 | ) 94 | } 95 | 96 | main() 97 | -------------------------------------------------------------------------------- /bin/serve.R: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env Rscript 2 | bookdown::serve_book(".", preview = TRUE, daemon = TRUE, in_session = FALSE) 3 | -------------------------------------------------------------------------------- /bin/style.R: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env Rscript 2 | styler::style_dir(".", filetype = c("R", "Rmd")) 3 | -------------------------------------------------------------------------------- /communicate.Rmd: -------------------------------------------------------------------------------- 1 | # (PART) Communicate {-} 2 | 3 | # Introduction {#communicate-intro .r4ds-section} 4 | 5 | `r no_exercises()` 6 | -------------------------------------------------------------------------------- /contributions.Rmd: -------------------------------------------------------------------------------- 1 | # Contributing 2 | 3 | Like *R for Data Science*, the solutions to it have been developed in the open, and they can improve with your contributions. There are a number of ways you can help make the solutions even better: 4 | 5 | - If you don't understand something, please 6 | [let me know](mailto:jeffrey.arnold@gmail.com). Your feedback on what is confusing or hard to understand is valuable. 7 | 8 | - If you spot a typo, feel free to edit the underlying page and send a pull 9 | request. If you've never done this before, the process is very easy: 10 | 11 | - Click the edit this page on the sidebar. 12 | 13 | - Make the changes using GitHub's in-page editor and save. 14 | 15 | - Submit a pull request and include a brief description of your changes. 16 | "Fixing typos" is perfectly adequate. 17 | -------------------------------------------------------------------------------- /data/README.md: -------------------------------------------------------------------------------- 1 | This directory contains dummy CSV files so that Exercise 21.3.1 can run code. 2 | -------------------------------------------------------------------------------- /data/file1.csv: -------------------------------------------------------------------------------- 1 | X1,X2 2 | 1,"a" 3 | 2,"b" 4 | -------------------------------------------------------------------------------- /data/file2.csv: -------------------------------------------------------------------------------- 1 | X1,X2 2 | 3,"c" 3 | 4,"d" 4 | -------------------------------------------------------------------------------- /data/file3.csv: -------------------------------------------------------------------------------- 1 | X1,X2 2 | 5,"e" 3 | 6,"f" 4 | -------------------------------------------------------------------------------- /datetimes.Rmd: -------------------------------------------------------------------------------- 1 | # Dates and times {#dates-and-times .r4ds-section} 2 | 3 | ## Introduction {#introduction-10 .r4ds-section} 4 | 5 | ```{r message=FALSE,cache=FALSE} 6 | library("tidyverse") 7 | library("lubridate") 8 | library("nycflights13") 9 | ``` 10 | 11 | ## Creating date/times {#creating-datetimes .r4ds-section} 12 | 13 | This code is needed by exercises. 14 | ```{r} 15 | make_datetime_100 <- function(year, month, day, time) { 16 | make_datetime(year, month, day, time %/% 100, time %% 100) 17 | } 18 | 19 | flights_dt <- flights %>% 20 | filter(!is.na(dep_time), !is.na(arr_time)) %>% 21 | mutate( 22 | dep_time = make_datetime_100(year, month, day, dep_time), 23 | arr_time = make_datetime_100(year, month, day, arr_time), 24 | sched_dep_time = make_datetime_100(year, month, day, sched_dep_time), 25 | sched_arr_time = make_datetime_100(year, month, day, sched_arr_time) 26 | ) %>% 27 | select(origin, dest, ends_with("delay"), ends_with("time")) 28 | ``` 29 | 30 | ### Exercise 16.2.1 {.unnumbered .exercise data-number="16.2.1"} 31 | 32 |
    33 | 34 | What happens if you parse a string that 35 | contains invalid dates? 36 | 37 |
    38 | 39 |
    40 | 41 | ```{r} 42 | ret <- ymd(c("2010-10-10", "bananas")) 43 | print(class(ret)) 44 | ret 45 | ``` 46 | 47 | It produces an `NA` and a warning message. 48 | 49 |
    50 | 51 | ### Exercise 16.2.2 {.unnumbered .exercise data-number="16.2.2"} 52 | 53 |
    54 | 55 | What does the `tzone` argument to `today()` do? 56 | Why is it important? 57 | 58 |
    59 | 60 |
    61 | 62 | It determines the time-zone of the date. 63 | Since different time-zones can have different dates, the value of `today()` can vary depending on the time-zone specified. 64 | 65 |
    66 | 67 | ### Exercise 16.2.3 {.unnumbered .exercise data-number="16.2.3"} 68 | 69 |
    70 | 71 | Use the appropriate lubridate function to parse each of the following dates: 72 | 73 | ```{r} 74 | d1 <- "January 1, 2010" 75 | d2 <- "2015-Mar-07" 76 | d3 <- "06-Jun-2017" 77 | d4 <- c("August 19 (2015)", "July 1 (2015)") 78 | d5 <- "12/30/14" 79 | ``` 80 | 81 |
    82 | 83 |
    84 | 85 | ```{r} 86 | mdy(d1) 87 | ymd(d2) 88 | dmy(d3) 89 | mdy(d4) 90 | mdy(d5) 91 | ``` 92 | 93 |
    94 | 95 | ## Date-time components {#date-time-components .r4ds-section} 96 | 97 | The following code from the chapter is used 98 | 99 | ```{r} 100 | sched_dep <- flights_dt %>% 101 | mutate(minute = minute(sched_dep_time)) %>% 102 | group_by(minute) %>% 103 | summarise( 104 | avg_delay = mean(arr_delay, na.rm = TRUE), 105 | n = n() 106 | ) 107 | ``` 108 | In the previous code, the difference between rounded and un-rounded dates provides the within-period time. 109 | 110 | ### Exercise 16.3.1 {.unnumbered .exercise data-number="16.3.1"} 111 | 112 |
    113 | 114 | How does the distribution of flight times 115 | within a day change over the course of the year? 116 | 117 |
    118 | 119 |
    120 | 121 | Let's try plotting this by month: 122 | ```{r} 123 | flights_dt %>% 124 | filter(!is.na(dep_time)) %>% 125 | mutate(dep_hour = update(dep_time, yday = 1)) %>% 126 | mutate(month = factor(month(dep_time))) %>% 127 | ggplot(aes(dep_hour, color = month)) + 128 | geom_freqpoly(binwidth = 60 * 60) 129 | ``` 130 | 131 | This will look better if everything is normalized within groups. The reason 132 | that February is lower is that there are fewer days and thus fewer flights. 133 | ```{r} 134 | flights_dt %>% 135 | filter(!is.na(dep_time)) %>% 136 | mutate(dep_hour = update(dep_time, yday = 1)) %>% 137 | mutate(month = factor(month(dep_time))) %>% 138 | ggplot(aes(dep_hour, color = month)) + 139 | geom_freqpoly(aes(y = ..density..), binwidth = 60 * 60) 140 | ``` 141 | 142 | At least to me there doesn't appear to much difference in within-day distribution over the year, but I maybe thinking about it incorrectly. 143 | 144 |
    145 | 146 | ### Exercise 16.3.2 {.unnumbered .exercise data-number="16.3.2"} 147 | 148 |
    149 | 150 | Compare `dep_time`, `sched_dep_time` and `dep_delay`. Are they consistent? Explain your findings. 151 | 152 |
    153 | 154 |
    155 | 156 | If they are consistent, then `dep_time = sched_dep_time + dep_delay`. 157 | 158 | ```{r} 159 | flights_dt %>% 160 | mutate(dep_time_ = sched_dep_time + dep_delay * 60) %>% 161 | filter(dep_time_ != dep_time) %>% 162 | select(dep_time_, dep_time, sched_dep_time, dep_delay) 163 | ``` 164 | 165 | There exist discrepancies. It looks like there are mistakes in the dates. These 166 | are flights in which the actual departure time is on the *next* day relative to 167 | the scheduled departure time. We forgot to account for this when creating the 168 | date-times using `make_datetime_100()` function in [16.2.2 From individual components](https://r4ds.had.co.nz/dates-and-times.html#from-individual-components). The code would have had to check if the departure time is less than 169 | the scheduled departure time plus departure delay (in minutes). Alternatively, simply adding the departure delay to the scheduled departure time is a more robust way to construct the departure time because it will automatically account for crossing into the next day. 170 | 171 |
    172 | 173 | ### Exercise 16.3.3 {.unnumbered .exercise data-number="16.3.3"} 174 | 175 |
    176 | 177 | Compare `air_time` with the duration between the departure and arrival. 178 | Explain your findings. 179 | 180 |
    181 | 182 |
    183 | 184 | ```{r} 185 | flights_dt %>% 186 | mutate( 187 | flight_duration = as.numeric(arr_time - dep_time), 188 | air_time_mins = air_time, 189 | diff = flight_duration - air_time_mins 190 | ) %>% 191 | select(origin, dest, flight_duration, air_time_mins, diff) 192 | ``` 193 | 194 |
    195 | 196 | ### Exercise 16.3.4 {.unnumbered .exercise data-number="16.3.4"} 197 | 198 |
    199 | 200 | How does the average delay time change over the course of a day? Should you use `dep_time` or `sched_dep_time`? Why? 201 | 202 |
    203 | 204 |
    205 | 206 | Use `sched_dep_time` because that is the relevant metric for someone scheduling a flight. Also, using `dep_time` will always bias delays to later in the day since delays will push flights later. 207 | 208 | ```{r} 209 | flights_dt %>% 210 | mutate(sched_dep_hour = hour(sched_dep_time)) %>% 211 | group_by(sched_dep_hour) %>% 212 | summarise(dep_delay = mean(dep_delay)) %>% 213 | ggplot(aes(y = dep_delay, x = sched_dep_hour)) + 214 | geom_point() + 215 | geom_smooth() 216 | ``` 217 | 218 |
    219 | 220 | ### Exercise 16.3.5 {.unnumbered .exercise data-number="16.3.5"} 221 | 222 |
    223 | 224 | On what day of the week should you leave if you want to minimize the chance of a delay? 225 | 226 |
    227 | 228 |
    229 | 230 | Saturday has the lowest average departure delay time and the lowest average arrival delay time. 231 | 232 | ```{r 16.3.5.tbl} 233 | flights_dt %>% 234 | mutate(dow = wday(sched_dep_time)) %>% 235 | group_by(dow) %>% 236 | summarise( 237 | dep_delay = mean(dep_delay), 238 | arr_delay = mean(arr_delay, na.rm = TRUE) 239 | ) %>% 240 | print(n = Inf) 241 | ``` 242 | 243 | ```{r 16.3.5.fig1} 244 | flights_dt %>% 245 | mutate(wday = wday(dep_time, label = TRUE)) %>% 246 | group_by(wday) %>% 247 | summarize(ave_dep_delay = mean(dep_delay, na.rm = TRUE)) %>% 248 | ggplot(aes(x = wday, y = ave_dep_delay)) + 249 | geom_bar(stat = "identity") 250 | ``` 251 | 252 | ```{r 16.3.5.fig2} 253 | flights_dt %>% 254 | mutate(wday = wday(dep_time, label = TRUE)) %>% 255 | group_by(wday) %>% 256 | summarize(ave_arr_delay = mean(arr_delay, na.rm = TRUE)) %>% 257 | ggplot(aes(x = wday, y = ave_arr_delay)) + 258 | geom_bar(stat = "identity") 259 | ``` 260 | 261 |
    262 | 263 | ### Exercise 16.3.6 {.unnumbered .exercise data-number="16.3.6"} 264 | 265 |
    266 | 267 | What makes the distribution of `diamonds$carat` and `flights$sched_dep_time` similar? 268 | 269 |
    270 | 271 |
    272 | 273 | ```{r} 274 | ggplot(diamonds, aes(x = carat)) + 275 | geom_density() 276 | ``` 277 | 278 | In both `carat` and `sched_dep_time` there are abnormally large numbers of values are at nice "human" numbers. In `sched_dep_time` it is at 00 and 30 minutes. In carats, it is at 0, 1/3, 1/2, 2/3, 279 | 280 | ```{r} 281 | ggplot(diamonds, aes(x = carat %% 1 * 100)) + 282 | geom_histogram(binwidth = 1) 283 | ``` 284 | 285 | In scheduled departure times it is 00 and 30 minutes, and minutes 286 | ending in 0 and 5. 287 | 288 | ```{r} 289 | ggplot(flights_dt, aes(x = minute(sched_dep_time))) + 290 | geom_histogram(binwidth = 1) 291 | ``` 292 | 293 |
    294 | 295 | ### Exercise 16.3.7 {.unnumbered .exercise data-number="16.3.7"} 296 | 297 |
    298 | 299 | Confirm my hypothesis that the early departures of flights in minutes 20-30 and 50-60 are caused by scheduled flights that leave early. 300 | Hint: create a binary variable that tells you whether or not a flight was delayed. 301 | 302 |
    303 | 304 |
    305 | 306 | First, I create a binary variable `early` that is equal to 1 if a flight leaves early, and 0 if it does not. 307 | Then, I group flights by the minute of departure. 308 | This shows that the proportion of flights that are early departures is highest between minutes 20--30 and 50--60. 309 | ```{r} 310 | flights_dt %>% 311 | mutate(minute = minute(dep_time), 312 | early = dep_delay < 0) %>% 313 | group_by(minute) %>% 314 | summarise( 315 | early = mean(early, na.rm = TRUE), 316 | n = n()) %>% 317 | ggplot(aes(minute, early)) + 318 | geom_line() 319 | ``` 320 | 321 |
    322 | 323 | ## Time spans {#time-spans .r4ds-section} 324 | 325 | ### Exercise 16.4.1 {.unnumbered .exercise data-number="16.4.1"} 326 | 327 |
    328 | 329 | Why is there `months()` but no `dmonths()`? 330 | 331 |
    332 | 333 |
    334 | 335 | There is no unambiguous value of months in terms of seconds since months have differing numbers of days. 336 | 337 | - 31 days: January, March, May, July, August, October, December 338 | - 30 days: April, June, September, November 339 | - 28 or 29 days: February 340 | 341 | The month is not a duration of time defined independently of when it occurs, but a special interval between two dates. 342 | 343 |
    344 | 345 | ### Exercise 16.4.2 {.unnumbered .exercise data-number="16.4.2"} 346 | 347 |
    348 | 349 | Explain `days(overnight * 1)` to someone who has just started learning R. 350 | How does it work? 351 | 352 |
    353 | 354 |
    355 | 356 | The variable `overnight` is equal to `TRUE` or `FALSE`. 357 | If it is an overnight flight, this becomes 1 day, and if not, then overnight = 0, and no days are added to the date. 358 | 359 |
    360 | 361 | ### Exercise 16.4.3 {.unnumbered .exercise data-number="16.4.3"} 362 | 363 |
    364 | 365 | Create a vector of dates giving the first day of every month in 2015. 366 | Create a vector of dates giving the first day of every month in the current year. 367 | 368 |
    369 | 370 |
    371 | 372 | A vector of the first day of the month for every month in 2015: 373 | ```{r} 374 | ymd("2015-01-01") + months(0:11) 375 | ``` 376 | 377 | To get the vector of the first day of the month for *this* year, we first need to figure out what this year is, and get January 1st of it. 378 | I can do that by taking `today()` and truncating it to the year using `floor_date()`: 379 | ```{r} 380 | floor_date(today(), unit = "year") + months(0:11) 381 | ``` 382 | 383 |
    384 | 385 | ### Exercise 16.4.4 {.unnumbered .exercise data-number="16.4.4"} 386 | 387 |
    388 | 389 | Write a function that given your birthday (as a date), returns how old you are in years. 390 | 391 |
    392 | 393 |
    394 | 395 | ```{r} 396 | age <- function(bday) { 397 | (bday %--% today()) %/% years(1) 398 | } 399 | age(ymd("1990-10-12")) 400 | ``` 401 | 402 |
    403 | 404 | ### Exercise 16.4.5 {.unnumbered .exercise data-number="16.4.5"} 405 | 406 |
    407 | 408 | Why can’t `(today() %--% (today() + years(1)) / months(1)` work? 409 | 410 |
    411 | 412 |
    413 | 414 | The code in the question is missing a parentheses. 415 | So, I will assume that that the correct code is, 416 | ```{r} 417 | (today() %--% (today() + years(1))) / months(1) 418 | ``` 419 | 420 | While this code will not display a warning or message, it does not work exactly as 421 | expected. The problem is discussed in the [Intervals](https://r4ds.had.co.nz/dates-and-times.html#intervals) section. 422 | 423 | The numerator of the expression, `(today() %--% (today() + years(1))`, is an *interval*, which includes both a duration of time and a starting point. The interval has an exact number of seconds. 424 | The denominator of the expression, `months(1)`, is a period, which is meaningful to humans but not defined in terms of an exact number of seconds. 425 | Months can be 28, 29, 30, or 31 days, so it is not clear what `months(1)` divide by? 426 | The code does not produce a warning message, but it will not always produce the correct result. 427 | 428 | To find the number of months within an interval use `%/%` instead of `/`, 429 | ```{r} 430 | (today() %--% (today() + years(1))) %/% months(1) 431 | ``` 432 | 433 | Alternatively, we could define a "month" as 30 days, and run 434 | ```{r} 435 | (today() %--% (today() + years(1))) / days(30) 436 | ``` 437 | 438 | This approach will not work with `today() + years(1)`, which is not defined for February 29th on leap years: 439 | ```{r} 440 | as.Date("2016-02-29") + years(1) 441 | ``` 442 | 443 |
    444 | 445 | ## Time zones {#time-zones .r4ds-section} 446 | 447 | `r no_exercises()` 448 | -------------------------------------------------------------------------------- /diagrams/Lahman1.graffle: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jrnold/r4ds-exercise-solutions/e5605c5d032bcab5780766e23cd230118ffb44ba/diagrams/Lahman1.graffle -------------------------------------------------------------------------------- /diagrams/Lahman1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jrnold/r4ds-exercise-solutions/e5605c5d032bcab5780766e23cd230118ffb44ba/diagrams/Lahman1.png -------------------------------------------------------------------------------- /diagrams/Lahman2.graffle: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jrnold/r4ds-exercise-solutions/e5605c5d032bcab5780766e23cd230118ffb44ba/diagrams/Lahman2.graffle -------------------------------------------------------------------------------- /diagrams/Lahman2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jrnold/r4ds-exercise-solutions/e5605c5d032bcab5780766e23cd230118ffb44ba/diagrams/Lahman2.png -------------------------------------------------------------------------------- /diagrams/Lahman3.graffle: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jrnold/r4ds-exercise-solutions/e5605c5d032bcab5780766e23cd230118ffb44ba/diagrams/Lahman3.graffle -------------------------------------------------------------------------------- /diagrams/Lahman3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jrnold/r4ds-exercise-solutions/e5605c5d032bcab5780766e23cd230118ffb44ba/diagrams/Lahman3.png -------------------------------------------------------------------------------- /diagrams/master-batting-salaries.graffle: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jrnold/r4ds-exercise-solutions/e5605c5d032bcab5780766e23cd230118ffb44ba/diagrams/master-batting-salaries.graffle -------------------------------------------------------------------------------- /diagrams/master-batting-salaries.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jrnold/r4ds-exercise-solutions/e5605c5d032bcab5780766e23cd230118ffb44ba/diagrams/master-batting-salaries.png -------------------------------------------------------------------------------- /diagrams/nested_set_1.dot: -------------------------------------------------------------------------------- 1 | digraph nested_set_1 { 2 | node[shape=box] 3 | graph[style=rounded] 4 | # subgraph for R information 5 | subgraph cluster_abcdef { 6 | node[style=filled,fillcolor=gray90] 7 | "b" 8 | "a" 9 | subgraph cluster_ef { 10 | graph[fillcolor=gray90,style="rounded,filled"] 11 | node[fillcolor=gray80] 12 | "f" 13 | "e" 14 | } 15 | subgraph cluster_cd { 16 | graph[fillcolor=gray90,style="rounded,filled"] 17 | node[fillcolor=gray80] 18 | "d" 19 | "c" 20 | } 21 | } 22 | } 23 | -------------------------------------------------------------------------------- /diagrams/nested_set_2.dot: -------------------------------------------------------------------------------- 1 | digraph nested_set_2 { 2 | node[shape=box] 3 | graph[style=rounded] 4 | # subgraph for R information 5 | subgraph cluster_1 { 6 | subgraph cluster_2 { 7 | graph[fillcolor=gray90,style="rounded,filled"] 8 | subgraph cluster_3 { 9 | graph[fillcolor=gray80] 10 | subgraph cluster_4 { 11 | graph[fillcolor=gray70] 12 | subgraph cluster_5 { 13 | graph[fillcolor=gray60] 14 | subgraph cluster_6 { 15 | graph[fillcolor=gray50] 16 | node[style=filled,fillcolor=gray40] 17 | "a" 18 | } 19 | } 20 | } 21 | } 22 | } 23 | } 24 | } 25 | -------------------------------------------------------------------------------- /diagrams/nycflights.graffle: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jrnold/r4ds-exercise-solutions/e5605c5d032bcab5780766e23cd230118ffb44ba/diagrams/nycflights.graffle -------------------------------------------------------------------------------- /diagrams/nycflights.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jrnold/r4ds-exercise-solutions/e5605c5d032bcab5780766e23cd230118ffb44ba/diagrams/nycflights.png -------------------------------------------------------------------------------- /docker-compose.yml: -------------------------------------------------------------------------------- 1 | version: '3.2' 2 | services: 3 | nodejs: 4 | image: "node:latest" 5 | command: "npm test" 6 | volumes: 7 | - type: bind 8 | target: /app 9 | source: "." 10 | working_dir: /app 11 | 12 | r: 13 | build: "." 14 | volumes: 15 | - type: bind 16 | target: /home/rstudio/r4ds-exercise-solutions 17 | source: "." 18 | ports: 19 | - "8787:8787" 20 | working_dir: /home/rstudio/r4ds-exercise-solutions 21 | -------------------------------------------------------------------------------- /explore.Rmd: -------------------------------------------------------------------------------- 1 | # (PART) Explore {-} 2 | 3 | # Introduction {#explore-intro .r4ds-section} 4 | 5 | `r no_exercises()` 6 | -------------------------------------------------------------------------------- /factors.Rmd: -------------------------------------------------------------------------------- 1 | # Factors {#factors .r4ds-section} 2 | 3 | ## Introduction {#introduction-9 .r4ds-section} 4 | 5 | Functions and packages: 6 | 7 | ```{r setup,message = FALSE,cache=FALSE} 8 | library("tidyverse") 9 | ``` 10 | The forcats package does not need to be explicitly loaded, since the recent versions of the tidyverse package now attach it. 11 | 12 | ## Creating factors {#creating-factors .r4ds-section} 13 | 14 | `r no_exercises()` 15 | 16 | ## General Social Survey {#general-social-survey .r4ds-section} 17 | 18 | ### Exercise 15.3.1 {.unnumbered .exercise data-number="15.3.1"} 19 | 20 |
    21 | 22 | Explore the distribution of `rincome` (reported income). 23 | What makes the default bar chart hard to understand? 24 | How could you improve the plot? 25 | 26 |
    27 | 28 |
    29 | 30 | My first attempt is to use `geom_bar()` with the default settings. 31 | ```{r} 32 | rincome_plot <- 33 | gss_cat %>% 34 | ggplot(aes(x = rincome)) + 35 | geom_bar() 36 | rincome_plot 37 | ``` 38 | 39 | The problem with default bar chart settings, are that the labels overlapping and impossible to read. 40 | I'll try changing the angle of the x-axis labels to vertical so that they will not overlap. 41 | ```{r} 42 | rincome_plot + 43 | theme(axis.text.x = element_text(angle = 90, hjust = 1)) 44 | ``` 45 | 46 | This is better because the labels are not overlapping, but also difficult to read because the labels are vertical. 47 | I could try angling the labels so that they are easier to read, but not overlapping. 48 | ```{r} 49 | rincome_plot + 50 | theme(axis.text.x = element_text(angle = 45, hjust = 1)) 51 | ``` 52 | 53 | But the solution I prefer for bar charts with long labels is to flip the axes, so that the bars are horizontal. 54 | Then the category labels are also horizontal, and easy to read. 55 | ```{r} 56 | rincome_plot + 57 | coord_flip() 58 | ``` 59 | 60 | Though more than asked for in this question, I could further improve this plot by 61 | 62 | 1. removing the "Not applicable" responses, 63 | 1. renaming "Lt \$1000" to "Less than \$1000", 64 | 1. using color to distinguish non-response categories ("Refused", "Don't know", and "No answer") from income levels ("Lt $1000", ...), 65 | 1. adding meaningful y- and x-axis titles, and 66 | 1. formatting the counts axis labels to use commas. 67 | 68 | ```{r} 69 | gss_cat %>% 70 | filter(!rincome %in% c("Not applicable")) %>% 71 | mutate(rincome = fct_recode(rincome, 72 | "Less than $1000" = "Lt $1000" 73 | )) %>% 74 | mutate(rincome_na = rincome %in% c("Refused", "Don't know", "No answer")) %>% 75 | ggplot(aes(x = rincome, fill = rincome_na)) + 76 | geom_bar() + 77 | coord_flip() + 78 | scale_y_continuous("Number of Respondents", labels = scales::comma) + 79 | scale_x_discrete("Respondent's Income") + 80 | scale_fill_manual(values = c("FALSE" = "black", "TRUE" = "gray")) + 81 | theme(legend.position = "None") 82 | ``` 83 | 84 | If I were only interested in non-missing responses, then I could drop all respondents who answered "Not applicable", "Refused", "Don't know", or "No answer". 85 | ```{r} 86 | gss_cat %>% 87 | filter(!rincome %in% c("Not applicable", "Don't know", "No answer", "Refused")) %>% 88 | mutate(rincome = fct_recode(rincome, 89 | "Less than $1000" = "Lt $1000" 90 | )) %>% 91 | ggplot(aes(x = rincome)) + 92 | geom_bar() + 93 | coord_flip() + 94 | scale_y_continuous("Number of Respondents", labels = scales::comma) + 95 | scale_x_discrete("Respondent's Income") 96 | ``` 97 | 98 | A side-effect of `coord_flip()` is that the label ordering on the x-axis, from lowest (top) to highest (bottom) is counterintuitive. 99 | The next section introduces a function `fct_reorder()` which can help with this. 100 | 101 |
    102 | 103 | ### Exercise 15.3.2 {.unnumbered .exercise data-number="15.3.2"} 104 | 105 |
    106 | What is the most common `relig` in this survey? 107 | What’s the most common `partyid`? 108 |
    109 | 110 |
    111 | 112 | The most common `relig` is "Protestant" 113 | ```{r} 114 | gss_cat %>% 115 | count(relig) %>% 116 | arrange(desc(n)) %>% 117 | head(1) 118 | ``` 119 | 120 | The most common `partyid` is "Independent" 121 | ```{r} 122 | gss_cat %>% 123 | count(partyid) %>% 124 | arrange(desc(n)) %>% 125 | head(1) 126 | ``` 127 | 128 |
    129 | 130 | ### Exercise 15.3.3 {.unnumbered .exercise data-number="15.3.3"} 131 | 132 |
    133 | Which `relig` does `denom` (denomination) apply to? 134 | How can you find out with a table? 135 | How can you find out with a visualization? 136 |
    137 | 138 |
    139 | 140 | ```{r} 141 | levels(gss_cat$denom) 142 | ``` 143 | 144 | From the context it is clear that `denom` refers to "Protestant" (and unsurprising given that it is the largest category in `freq`). 145 | Let's filter out the non-responses, no answers, others, not-applicable, or 146 | no denomination, to leave only answers to denominations. 147 | After doing that, the only remaining responses are "Protestant". 148 | ```{r} 149 | gss_cat %>% 150 | filter(!denom %in% c( 151 | "No answer", "Other", "Don't know", "Not applicable", 152 | "No denomination" 153 | )) %>% 154 | count(relig) 155 | ``` 156 | 157 | This is also clear in a scatter plot of `relig` vs. `denom` where the points are 158 | proportional to the size of the number of answers (since otherwise there would be overplotting). 159 | ```{r} 160 | gss_cat %>% 161 | count(relig, denom) %>% 162 | ggplot(aes(x = relig, y = denom, size = n)) + 163 | geom_point() + 164 | theme(axis.text.x = element_text(angle = 90)) 165 | ``` 166 | 167 |
    168 | 169 | ## Modifying factor order {#modifying-factor-order .r4ds-section} 170 | 171 | ### Exercise 15.4.1 {.unnumbered .exercise data-number="15.4.1"} 172 | 173 |
    174 | There are some suspiciously high numbers in `tvhours`. 175 | Is the `mean` a good summary? 176 |
    177 | 178 |
    179 | 180 | ```{r} 181 | summary(gss_cat[["tvhours"]]) 182 | ``` 183 | 184 | ```{r} 185 | gss_cat %>% 186 | filter(!is.na(tvhours)) %>% 187 | ggplot(aes(x = tvhours)) + 188 | geom_histogram(binwidth = 1) 189 | ``` 190 | 191 | Whether the mean is the best summary depends on what you are using it for :-), i.e. your objective. 192 | But probably the median would be what most people prefer. 193 | And the hours of TV doesn't look that surprising to me. 194 | 195 |
    196 | 197 | ### Exercise 15.4.2 {.unnumbered .exercise data-number="15.4.2"} 198 | 199 |
    200 | For each factor in `gss_cat` identify whether the order of the levels is arbitrary or principled. 201 |
    202 | 203 |
    204 | 205 | The following piece of code uses functions introduced in Ch 21, to print out the names of only the factors. 206 | ```{r} 207 | keep(gss_cat, is.factor) %>% names() 208 | ``` 209 | 210 | There are six categorical variables: `marital`, `race`, `rincome`, `partyid`, `relig`, and `denom`. 211 | 212 | The ordering of marital is "somewhat principled". There is some sort of logic 213 | in that the levels are grouped "never married", married at some point 214 | (separated, divorced, widowed), and "married"; though it would seem that "Never 215 | Married", "Divorced", "Widowed", "Separated", "Married" might be more natural. 216 | I find that the question of ordering can be determined by the level of 217 | aggregation in a categorical variable, and there can be more "partially 218 | ordered" factors than one would expect. 219 | 220 | ```{r} 221 | levels(gss_cat[["marital"]]) 222 | ``` 223 | ```{r} 224 | gss_cat %>% 225 | ggplot(aes(x = marital)) + 226 | geom_bar() 227 | ``` 228 | 229 | The ordering of race is principled in that the categories are ordered by count of observations in the data. 230 | ```{r} 231 | levels(gss_cat$race) 232 | ``` 233 | ```{r} 234 | gss_cat %>% 235 | ggplot(aes(race)) + 236 | geom_bar() + 237 | scale_x_discrete(drop = FALSE) 238 | ``` 239 | 240 | The levels of `rincome` are ordered in decreasing order of the income; however 241 | the placement of "No answer", "Don't know", and "Refused" before, and "Not 242 | applicable" after the income levels is arbitrary. It would be better to place 243 | all the missing income level categories either before or after all the known 244 | values. 245 | 246 | ```{r} 247 | levels(gss_cat$rincome) 248 | ``` 249 | 250 | The levels of `relig` is arbitrary: there is no natural ordering, and they don't appear to be ordered by stats within the dataset. 251 | ```{r} 252 | levels(gss_cat$relig) 253 | ``` 254 | 255 | ```{r} 256 | gss_cat %>% 257 | ggplot(aes(relig)) + 258 | geom_bar() + 259 | coord_flip() 260 | ``` 261 | 262 | The same goes for `denom`. 263 | ```{r} 264 | levels(gss_cat$denom) 265 | ``` 266 | 267 | Ignoring "No answer", "Don't know", and "Other party", the levels of `partyid` are ordered from "Strong Republican"" to "Strong Democrat". 268 | ```{r} 269 | levels(gss_cat$partyid) 270 | ``` 271 | 272 |
    273 | 274 | ### Exercise 15.4.3 {.unnumbered .exercise data-number="15.4.3"} 275 | 276 |
    277 | Why did moving “Not applicable” to the front of the levels move it to the bottom of the plot? 278 |
    279 | 280 |
    281 | 282 | Because that gives the level "Not applicable" an integer value of 1. 283 | 284 |
    285 | 286 | ## Modifying factor levels {#modifying-factor-levels .r4ds-section} 287 | 288 | ### Exercise 15.5.1 {.unnumbered .exercise data-number="15.5.1"} 289 | 290 |
    291 | How have the proportions of people identifying as Democrat, Republican, and Independent changed over time? 292 |
    293 | 294 |
    295 | 296 | To answer that, we need to combine the multiple levels into Democrat, Republican, and Independent 297 | ```{r} 298 | levels(gss_cat$partyid) 299 | ``` 300 | 301 | ```{r} 302 | gss_cat %>% 303 | mutate( 304 | partyid = 305 | fct_collapse(partyid, 306 | other = c("No answer", "Don't know", "Other party"), 307 | rep = c("Strong republican", "Not str republican"), 308 | ind = c("Ind,near rep", "Independent", "Ind,near dem"), 309 | dem = c("Not str democrat", "Strong democrat") 310 | ) 311 | ) %>% 312 | count(year, partyid) %>% 313 | group_by(year) %>% 314 | mutate(p = n / sum(n)) %>% 315 | ggplot(aes( 316 | x = year, y = p, 317 | colour = fct_reorder2(partyid, year, p) 318 | )) + 319 | geom_point() + 320 | geom_line() + 321 | labs(colour = "Party ID.") 322 | ``` 323 | 324 |
    325 | 326 | ### Exercise 15.5.2 {.unnumbered .exercise data-number="15.5.2"} 327 | 328 |
    329 | How could you collapse `rincome` into a small set of categories? 330 |
    331 | 332 |
    333 | 334 | Group all the non-responses into one category, and then group other categories into a smaller number. Since there is a clear ordering, we would not use `fct_lump()`.` 335 | ```{r} 336 | levels(gss_cat$rincome) 337 | ``` 338 | 339 | ```{r} 340 | library("stringr") 341 | gss_cat %>% 342 | mutate( 343 | rincome = 344 | fct_collapse( 345 | rincome, 346 | `Unknown` = c("No answer", "Don't know", "Refused", "Not applicable"), 347 | `Lt $5000` = c("Lt $1000", str_c( 348 | "$", c("1000", "3000", "4000"), 349 | " to ", c("2999", "3999", "4999") 350 | )), 351 | `$5000 to 10000` = str_c( 352 | "$", c("5000", "6000", "7000", "8000"), 353 | " to ", c("5999", "6999", "7999", "9999") 354 | ) 355 | ) 356 | ) %>% 357 | ggplot(aes(x = rincome)) + 358 | geom_bar() + 359 | coord_flip() 360 | ``` 361 |
    362 | -------------------------------------------------------------------------------- /graphics-for-communication.Rmd: -------------------------------------------------------------------------------- 1 | # Graphics for communication {#graphics-for-communication .r4ds-section} 2 | 3 | ## Introduction {#introduction-19 .r4ds-section} 4 | 5 | ```{r setup,message=FALSE,cache=FALSE} 6 | library("tidyverse") 7 | library("modelr") 8 | library("lubridate") 9 | ``` 10 | 11 | ## Label {#label .r4ds-section} 12 | 13 | ### Exercise 28.2.1 {.unnumbered .exercise data-number="28.2.1"} 14 | 15 |
    16 | Create one plot on the fuel economy data with customized `title`, 17 | `subtitle`, `caption`, `x`, `y`, and `colour` labels. 18 |
    19 | 20 |
    21 | 22 | ```{r} 23 | ggplot( 24 | data = mpg, 25 | mapping = aes(x = fct_reorder(class, hwy), y = hwy) 26 | ) + 27 | geom_boxplot() + 28 | coord_flip() + 29 | labs( 30 | title = "Compact Cars have > 10 Hwy MPG than Pickup Trucks", 31 | subtitle = "Comparing the median highway mpg in each class", 32 | caption = "Data from fueleconomy.gov", 33 | x = "Car Class", 34 | y = "Highway Miles per Gallon" 35 | ) 36 | ``` 37 | 38 |
    39 | 40 | ### Exercise 28.2.2 {.unnumbered .exercise data-number="28.2.2"} 41 | 42 |
    43 | The `geom_smooth()` is somewhat misleading because the `hwy` for large engines is skewed upwards due to the inclusion of lightweight sports cars with big engines. 44 | Use your modeling tools to fit and display a better model. 45 |
    46 | 47 |
    48 | 49 | First, I'll plot the relationship between fuel efficiency and engine size (displacement) using all cars. 50 | The plot shows a strong negative relationship. 51 | ```{r} 52 | ggplot(mpg, aes(displ, hwy)) + 53 | geom_point() + 54 | geom_smooth(method = "lm", se = FALSE) + 55 | labs( 56 | title = "Fuel Efficiency Decreases with Engine Size", 57 | caption = "Data from fueleconomy.gov", 58 | y = "Highway Miles per Gallon", 59 | x = "Engine Displacement" 60 | ) 61 | ``` 62 | 63 | However, if I disaggregate by car class, and plot the relationship between 64 | fuel efficiency and engine displacement within each class, I see a different 65 | relationship. 66 | 67 | 1. For all car class except subcompact cars, there is no relationship or only 68 | a small negative relationship between fuel efficiency and engine size. 69 | 70 | 1. For subcompact cars, there is a strong negative relationship between fuel 71 | efficiency and engine size. As the question noted, this is because the 72 | subcompact car class includes both small cheap cars, and sports cars with 73 | large engines. 74 | 75 | ```{r} 76 | ggplot(mpg, aes(displ, hwy, colour = class)) + 77 | geom_point() + 78 | geom_smooth(method = "lm", se = FALSE) + 79 | labs( 80 | title = "Fuel Efficiency Mostly Varies by Car Class", 81 | subtitle = "Subcompact caries fuel efficiency varies by engine size", 82 | caption = "Data from fueleconomy.gov", 83 | y = "Highway Miles per Gallon", 84 | x = "Engine Displacement" 85 | ) 86 | ``` 87 | 88 | Another way to model and visualize the relationship between fuel efficiency 89 | and engine displacement after accounting for car class is to regress 90 | fuel efficiency on car class, and plot the residuals of that regression against 91 | engine displacement. 92 | The residuals of the first regression are the variation in fuel efficiency 93 | not explained by engine displacement. 94 | The relationship between fuel efficiency and engine displacement is attenuated 95 | after accounting for car class. 96 | 97 | ```{r} 98 | mod <- lm(hwy ~ class, data = mpg) 99 | mpg %>% 100 | add_residuals(mod) %>% 101 | ggplot(aes(x = displ, y = resid)) + 102 | geom_point() + 103 | geom_smooth(method = "lm", se = FALSE) + 104 | labs( 105 | title = "Engine size has little effect on fuel efficiency", 106 | subtitle = "After accounting for car class", 107 | caption = "Data from fueleconomy.gov", 108 | y = "Highway MPG Relative to Class Average", 109 | x = "Engine Displacement" 110 | ) 111 | ``` 112 | 113 |
    114 | 115 | ### Exercise 28.2.3 {.unnumbered .exercise data-number="28.2.3"} 116 | 117 |
    118 | Take an exploratory graphic that you've created in the last month, and add informative titles to make it easier for others to understand. 119 |
    120 | 121 |
    122 | 123 | By its very nature, this exercise is left to readers. 124 | 125 |
    126 | 127 | ## Annotations {#annotations .r4ds-section} 128 | 129 | ### Exercise 28.3.1 {.unnumbered .exercise data-number="28.3.1"} 130 | 131 |
    132 | Use `geom_text()` with infinite positions to place text at the four corners of the plot. 133 |
    134 | 135 |
    136 | 137 | I can use similar code as the example in the text. 138 | However, I need to use `vjust` and `hjust` in order for the text to appear in the plot, and these need to be different for each corner. 139 | But, `geom_text()` takes `hjust` and `vjust` as aesthetics, I can add them to the data and mappings, and use a single `geom_text()` call instead of four different `geom_text()` calls with four different data arguments, and four different values of `hjust` and `vjust` arguments. 140 | ```{r} 141 | label <- tribble( 142 | ~displ, ~hwy, ~label, ~vjust, ~hjust, 143 | Inf, Inf, "Top right", "top", "right", 144 | Inf, -Inf, "Bottom right", "bottom", "right", 145 | -Inf, Inf, "Top left", "top", "left", 146 | -Inf, -Inf, "Bottom left", "bottom", "left" 147 | ) 148 | 149 | ggplot(mpg, aes(displ, hwy)) + 150 | geom_point() + 151 | geom_text(aes(label = label, vjust = vjust, hjust = hjust), data = label) 152 | ``` 153 | 154 |
    155 | 156 | ### Exercise 28.3.2 {.unnumbered .exercise data-number="28.3.2"} 157 | 158 |
    159 | Read the documentation for `annotate()`. How can you use it to add a text label to a plot without having to create a tibble? 160 |
    161 | 162 |
    163 | 164 | With annotate you use what would be aesthetic mappings directly as arguments: 165 | ```{r} 166 | ggplot(mpg, aes(displ, hwy)) + 167 | geom_point() + 168 | annotate("text", 169 | x = Inf, y = Inf, 170 | label = "Increasing engine size is \nrelated to decreasing fuel economy.", vjust = "top", hjust = "right" 171 | ) 172 | ``` 173 | 174 |
    175 | 176 | ### Exercise 28.3.3 {.unnumbered .exercise data-number="28.3.3"} 177 | 178 |
    179 | How do labels with `geom_text()` interact with faceting? 180 | How can you add a label to a single facet? 181 | How can you put a different label in each facet? 182 | (Hint: think about the underlying data.) 183 |
    184 | 185 |
    186 | 187 | If the facet variable is not specified, the text is drawn in all facets. 188 | ```{r} 189 | label <- tibble( 190 | displ = Inf, 191 | hwy = Inf, 192 | label = "Increasing engine size is \nrelated to decreasing fuel economy." 193 | ) 194 | 195 | ggplot(mpg, aes(displ, hwy)) + 196 | geom_point() + 197 | geom_text(aes(label = label), 198 | data = label, vjust = "top", hjust = "right", 199 | size = 2 200 | ) + 201 | facet_wrap(~class) 202 | ``` 203 | 204 | To draw the label in only one facet, add a column to the label data frame with the value of the faceting variable(s) in which to draw it. 205 | ```{r} 206 | label <- tibble( 207 | displ = Inf, 208 | hwy = Inf, 209 | class = "2seater", 210 | label = "Increasing engine size is \nrelated to decreasing fuel economy." 211 | ) 212 | 213 | ggplot(mpg, aes(displ, hwy)) + 214 | geom_point() + 215 | geom_text(aes(label = label), 216 | data = label, vjust = "top", hjust = "right", 217 | size = 2 218 | ) + 219 | facet_wrap(~class) 220 | ``` 221 | 222 | To draw labels in different plots, simply have the facetting variable(s): 223 | ```{r} 224 | label <- tibble( 225 | displ = Inf, 226 | hwy = Inf, 227 | class = unique(mpg$class), 228 | label = str_c("Label for ", class) 229 | ) 230 | 231 | ggplot(mpg, aes(displ, hwy)) + 232 | geom_point() + 233 | geom_text(aes(label = label), 234 | data = label, vjust = "top", hjust = "right", 235 | size = 3 236 | ) + 237 | facet_wrap(~class) 238 | ``` 239 | 240 |
    241 | 242 | ### Exercise 28.3.4 {.unnumbered .exercise data-number="28.3.4"} 243 | 244 |
    245 | What arguments to `geom_label()` control the appearance of the background box? 246 |
    247 | 248 |
    249 | 250 | - `label.padding`: padding around label 251 | - `label.r`: amount of rounding in the corners 252 | - `label.size`: size of label border 253 | 254 |
    255 | 256 | ### Exercise 28.3.5 {.unnumbered .exercise data-number="28.3.5"} 257 | 258 |
    259 | What are the four arguments to `arrow()`? How do they work? 260 | Create a series of plots that demonstrate the most important options. 261 |
    262 | 263 |
    264 | 265 | The four arguments are (from the help for `arrow()`): 266 | 267 | - `angle` : angle of arrow head 268 | - `length` : length of the arrow head 269 | - `ends`: ends of the line to draw arrow head 270 | - `type`: `"open"` or `"close"`: whether the arrow head is a closed or open triangle 271 | 272 |
    273 | 274 | ## Scales {#scales .r4ds-section} 275 | 276 | ### Exercise 28.4.1 {.unnumbered .exercise data-number="28.4.1"} 277 | 278 |
    279 | Why doesn’t the following code override the default scale? 280 |
    281 | 282 |
    283 | 284 | ```{r} 285 | df <- tibble( 286 | x = rnorm(10000), 287 | y = rnorm(10000) 288 | ) 289 | ggplot(df, aes(x, y)) + 290 | geom_hex() + 291 | scale_colour_gradient(low = "white", high = "red") + 292 | coord_fixed() 293 | ``` 294 | 295 | It does not override the default scale because the colors in `geom_hex()` are set by the `fill` aesthetic, not the `color` aesthetic. 296 | 297 | ```{r} 298 | ggplot(df, aes(x, y)) + 299 | geom_hex() + 300 | scale_fill_gradient(low = "white", high = "red") + 301 | coord_fixed() 302 | ``` 303 | 304 |
    305 | 306 | ### Exercise 28.4.2 {.unnumbered .exercise data-number="28.4.2"} 307 | 308 |
    309 | The first argument to every scale is the label for the scale. 310 | It is equivalent to using the `labs` function. 311 |
    312 | 313 |
    314 | 315 | ```{r} 316 | ggplot(mpg, aes(displ, hwy)) + 317 | geom_point(aes(colour = class)) + 318 | geom_smooth(se = FALSE) + 319 | labs( 320 | x = "Engine displacement (L)", 321 | y = "Highway fuel economy (mpg)", 322 | colour = "Car type" 323 | ) 324 | ``` 325 | 326 | ```{r} 327 | ggplot(mpg, aes(displ, hwy)) + 328 | geom_point(aes(colour = class)) + 329 | geom_smooth(se = FALSE) + 330 | scale_x_continuous("Engine displacement (L)") + 331 | scale_y_continuous("Highway fuel economy (mpg)") + 332 | scale_colour_discrete("Car type") 333 | ``` 334 | 335 |
    336 | 337 | ### Exercise 28.4.3 {.unnumbered .exercise data-number="28.4.3"} 338 | 339 |
    340 | Change the display of the presidential terms by: 341 | 342 | 1. Combining the two variants shown above. 343 | 1. Improving the display of the y axis. 344 | 1. Labeling each term with the name of the president. 345 | 1. Adding informative plot labels. 346 | 1. Placing breaks every 4 years (this is trickier than it seems!). 347 | 348 |
    349 | 350 |
    351 | 352 | ```{r} 353 | fouryears <- lubridate::make_date(seq(year(min(presidential$start)), 354 | year(max(presidential$end)), 355 | by = 4 356 | ), 1, 1) 357 | 358 | presidential %>% 359 | mutate( 360 | id = 33 + row_number(), 361 | name_id = fct_inorder(str_c(name, " (", id, ")")) 362 | ) %>% 363 | ggplot(aes(start, name_id, colour = party)) + 364 | geom_point() + 365 | geom_segment(aes(xend = end, yend = name_id)) + 366 | scale_colour_manual("Party", values = c(Republican = "red", Democratic = "blue")) + 367 | scale_y_discrete(NULL) + 368 | scale_x_date(NULL, 369 | breaks = presidential$start, date_labels = "'%y", 370 | minor_breaks = fouryears 371 | ) + 372 | ggtitle("Terms of US Presdients", 373 | subtitle = "Roosevelth (34th) to Obama (44th)" 374 | ) + 375 | theme( 376 | panel.grid.minor = element_blank(), 377 | axis.ticks.y = element_blank() 378 | ) 379 | ``` 380 | 381 | To include both the start dates of presidential terms and every 382 | four years, I use different levels of emphasis. 383 | The presidential term start years are used as major breaks with thicker lines and x-axis labels. 384 | Lines for every four years is indicated with minor breaks that use thinner lines to distinguish them from presidential term start years and to avoid cluttering the plot. 385 | 386 |
    387 | 388 | ### Exercise 28.4.4 {.unnumbered .exercise data-number="28.4.4"} 389 | 390 |
    391 | Use `override.aes` to make the legend on the following plot easier to see. 392 |
    393 | 394 |
    395 | 396 | ```{r} 397 | ggplot(diamonds, aes(carat, price)) + 398 | geom_point(aes(colour = cut), alpha = 1 / 20) 399 | ``` 400 | 401 | The problem with the legend is that the `alpha` value make the colors hard to see. So I'll override the alpha value to make the points solid in the legend. 402 | ```{r} 403 | ggplot(diamonds, aes(carat, price)) + 404 | geom_point(aes(colour = cut), alpha = 1 / 20) + 405 | theme(legend.position = "bottom") + 406 | guides(colour = guide_legend(nrow = 1, override.aes = list(alpha = 1))) 407 | ``` 408 | 409 |
    410 | 411 | ## Zooming {#zooming .r4ds-section} 412 | 413 | `r no_exercises()` 414 | 415 | ## Themes {#themes .r4ds-section} 416 | 417 | `r no_exercises()` 418 | 419 | ## Saving your plots {#saving-your-plots .r4ds-section} 420 | 421 | `r no_exercises()` 422 | 423 | ## Learning more {#learning-more-4 .r4ds-section} 424 | 425 | `r no_exercises()` 426 | -------------------------------------------------------------------------------- /img/cover.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jrnold/r4ds-exercise-solutions/e5605c5d032bcab5780766e23cd230118ffb44ba/img/cover.png -------------------------------------------------------------------------------- /img/r4ds-exercise-solutions-cover.key: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jrnold/r4ds-exercise-solutions/e5605c5d032bcab5780766e23cd230118ffb44ba/img/r4ds-exercise-solutions-cover.key -------------------------------------------------------------------------------- /img/r4ds-exercise-solutions-cover.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jrnold/r4ds-exercise-solutions/e5605c5d032bcab5780766e23cd230118ffb44ba/img/r4ds-exercise-solutions-cover.png -------------------------------------------------------------------------------- /img/rmarkdown-file.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jrnold/r4ds-exercise-solutions/e5605c5d032bcab5780766e23cd230118ffb44ba/img/rmarkdown-file.png -------------------------------------------------------------------------------- /img/rmarkdown-knit-button.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jrnold/r4ds-exercise-solutions/e5605c5d032bcab5780766e23cd230118ffb44ba/img/rmarkdown-knit-button.png -------------------------------------------------------------------------------- /img/rmarkdown-notebook.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jrnold/r4ds-exercise-solutions/e5605c5d032bcab5780766e23cd230118ffb44ba/img/rmarkdown-notebook.png -------------------------------------------------------------------------------- /img/visualize/unnamed-chunk-29-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jrnold/r4ds-exercise-solutions/e5605c5d032bcab5780766e23cd230118ffb44ba/img/visualize/unnamed-chunk-29-1.png -------------------------------------------------------------------------------- /img/visualize/unnamed-chunk-29-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jrnold/r4ds-exercise-solutions/e5605c5d032bcab5780766e23cd230118ffb44ba/img/visualize/unnamed-chunk-29-2.png -------------------------------------------------------------------------------- /img/visualize/unnamed-chunk-29-3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jrnold/r4ds-exercise-solutions/e5605c5d032bcab5780766e23cd230118ffb44ba/img/visualize/unnamed-chunk-29-3.png -------------------------------------------------------------------------------- /img/visualize/unnamed-chunk-29-4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jrnold/r4ds-exercise-solutions/e5605c5d032bcab5780766e23cd230118ffb44ba/img/visualize/unnamed-chunk-29-4.png -------------------------------------------------------------------------------- /img/visualize/unnamed-chunk-29-5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jrnold/r4ds-exercise-solutions/e5605c5d032bcab5780766e23cd230118ffb44ba/img/visualize/unnamed-chunk-29-5.png -------------------------------------------------------------------------------- /img/visualize/unnamed-chunk-29-6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jrnold/r4ds-exercise-solutions/e5605c5d032bcab5780766e23cd230118ffb44ba/img/visualize/unnamed-chunk-29-6.png -------------------------------------------------------------------------------- /import.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | output: html_document 3 | editor_options: 4 | chunk_output_type: console 5 | --- 6 | # Data import {#data-import .r4ds-section} 7 | 8 | ## Introduction {#introduction-5 .r4ds-section} 9 | 10 | ```{r results='hide',message=FALSE,cache=FALSE} 11 | library("tidyverse") 12 | ``` 13 | 14 | ## Getting started {#getting-started .r4ds-section} 15 | 16 | ### Exercise 11.2.1 {.unnumbered .exercise data-number="11.2.1"} 17 | 18 |
    19 | What function would you use to read a file where fields were separated with “|”? 20 |
    21 | 22 |
    23 | 24 | Use the `read_delim()` function with the argument `delim="|"`. 25 | ```{r eval=FALSE} 26 | read_delim(file, delim = "|") 27 | ``` 28 | 29 |
    30 | 31 | ### Exercise 11.2.2 {.unnumbered .exercise data-number="11.2.2"} 32 | 33 |
    34 | Apart from `file`, `skip`, and `comment`, what other arguments do `read_csv()` and `read_tsv()` have in common? 35 |
    36 | 37 |
    38 | 39 | They have the following arguments in common: 40 | ```{r} 41 | intersect(names(formals(read_csv)), names(formals(read_tsv))) 42 | ``` 43 | 44 | - `col_names` and `col_types` are used to specify the column names and how to parse the columns 45 | - `locale` is important for determining things like the encoding and whether "." or "," is used as a decimal mark. 46 | - `na` and `quoted_na` control which strings are treated as missing values when parsing vectors 47 | - `trim_ws` trims whitespace before and after cells before parsing 48 | - `n_max` sets how many rows to read 49 | - `guess_max` sets how many rows to use when guessing the column type 50 | - `progress` determines whether a progress bar is shown. 51 | 52 | In fact, the two functions have the exact same arguments: 53 | ```{r} 54 | identical(names(formals(read_csv)), names(formals(read_tsv))) 55 | ``` 56 | 57 |
    58 | 59 | ### Exercise 11.2.3 {.unnumbered .exercise data-number="11.2.3"} 60 | 61 |
    62 | What are the most important arguments to `read_fwf()`? 63 |
    64 | 65 |
    66 | 67 | The most important argument to `read_fwf()` which reads "fixed-width formats", is `col_positions` which tells the function where data columns begin and end. 68 | 69 |
    70 | 71 | ### Exercise 11.2.4 {.unnumbered .exercise data-number="11.2.4"} 72 | 73 |
    74 | Sometimes strings in a CSV file contain commas. 75 | To prevent them from causing problems they need to be surrounded by a quoting character, like `"` or `'`. 76 | By convention, `read_csv()` assumes that the quoting character will be `"`, and if you want to change it you’ll need to use `read_delim()` instead. 77 | What arguments do you need to specify to read the following text into a data frame? 78 | 79 | ``` 80 | "x,y\n1,'a,b'" 81 | ``` 82 | 83 |
    84 | 85 |
    86 | 87 | For `read_delim()`, we will will need to specify a delimiter, in this case `","`, and a quote argument. 88 | ```{r} 89 | x <- "x,y\n1,'a,b'" 90 | read_delim(x, ",", quote = "'") 91 | ``` 92 | 93 | However, this question is out of date. `read_csv()` now supports a quote argument, so the following code works. 94 | ```{r} 95 | read_csv(x, quote = "'") 96 | ``` 97 | 98 |
    99 | 100 | ### Exercise 11.2.5 {.unnumbered .exercise data-number="11.2.5"} 101 | 102 |
    103 | Identify what is wrong with each of the following inline CSV files. 104 | What happens when you run the code? 105 |
    106 | 107 |
    108 | 109 | ```{r} 110 | read_csv("a,b\n1,2,3\n4,5,6") 111 | ``` 112 | 113 | Only two columns are specified in the header "a" and "b", but the rows have three columns, so the last column is dropped. 114 | 115 | ```{r} 116 | read_csv("a,b,c\n1,2\n1,2,3,4") 117 | ``` 118 | 119 | The numbers of columns in the data do not match the number of columns in the header (three). 120 | In row one, there are only two values, so column `c` is set to missing. 121 | In row two, there is an extra value, and that value is dropped. 122 | 123 | ```{r} 124 | read_csv("a,b\n\"1") 125 | ``` 126 | It's not clear what the intent was here. 127 | The opening quote `"1` is dropped because it is not closed, and `a` is treated as an integer. 128 | 129 | ```{r} 130 | read_csv("a,b\n1,2\na,b") 131 | ``` 132 | Both "a" and "b" are treated as character vectors since they contain non-numeric strings. 133 | This may have been intentional, or the author may have intended the values of the columns to be "1,2" and "a,b". 134 | 135 | ```{r} 136 | read_csv("a;b\n1;3") 137 | ``` 138 | 139 | The values are separated by ";" rather than ",". Use `read_csv2()` instead: 140 | ```{r} 141 | read_csv2("a;b\n1;3") 142 | ``` 143 | 144 |
    145 | 146 | ## Parsing a vector {#parsing-a-vector .r4ds-section} 147 | 148 | ### Exercise 11.3.1 {.unnumbered .exercise data-number="11.3.1"} 149 | 150 |
    151 | What are the most important arguments to `locale()`? 152 |
    153 | 154 |
    155 | 156 | The locale object has arguments to set the following: 157 | 158 | - date and time formats: `date_names`, `date_format`, and `time_format` 159 | - time zone: `tz` 160 | - numbers: `decimal_mark`, `grouping_mark` 161 | - encoding: `encoding` 162 | 163 |
    164 | 165 | ### Exercise 11.3.2 {.unnumbered .exercise data-number="11.3.2"} 166 | 167 |
    168 | What happens if you try and set `decimal_mark` and `grouping_mark` to the same character? 169 | What happens to the default value of `grouping_mark` when you set `decimal_mark` to `","`? 170 | What happens to the default value of `decimal_mark` when you set the `grouping_mark` to `"."`? 171 |
    172 | 173 |
    174 | 175 | If the decimal and grouping marks are set to the same character, `locale` throws an error: 176 | ```{r error=TRUE} 177 | locale(decimal_mark = ".", grouping_mark = ".") 178 | ``` 179 | 180 | If the `decimal_mark` is set to the comma "`,"`, then the grouping mark is set to the period `"."`: 181 | ```{r} 182 | locale(decimal_mark = ",") 183 | ``` 184 | 185 | If the grouping mark is set to a period, then the decimal mark is set to a comma 186 | ```{r} 187 | locale(grouping_mark = ".") 188 | ``` 189 | 190 |
    191 | 192 | ### Exercise 11.3.3 {.unnumbered .exercise data-number="11.3.3"} 193 | 194 |
    195 | I didn’t discuss the `date_format` and `time_format` options to `locale()`. 196 | What do they do? 197 | Construct an example that shows when they might be useful. 198 |
    199 | 200 |
    201 | 202 | They provide default date and time formats. 203 | The [readr vignette](https://cran.r-project.org/web/packages/readr/vignettes/locales.html) discusses using these to parse dates: since dates can include languages specific weekday and month names, and different conventions for specifying AM/PM 204 | ```{r} 205 | locale() 206 | ``` 207 | 208 | Examples from the readr vignette of parsing French dates 209 | ```{r} 210 | parse_date("1 janvier 2015", "%d %B %Y", locale = locale("fr")) 211 | parse_date("14 oct. 1979", "%d %b %Y", locale = locale("fr")) 212 | ``` 213 | 214 | Both the date format and time format are used for guessing column types. 215 | Thus if you were often parsing data that had non-standard formats for the date and time, you could specify custom values for `date_format` and `time_format`. 216 | ```{r} 217 | locale_custom <- locale(date_format = "Day %d Mon %M Year %y", 218 | time_format = "Sec %S Min %M Hour %H") 219 | date_custom <- c("Day 01 Mon 02 Year 03", "Day 03 Mon 01 Year 01") 220 | parse_date(date_custom) 221 | parse_date(date_custom, locale = locale_custom) 222 | time_custom <- c("Sec 01 Min 02 Hour 03", "Sec 03 Min 02 Hour 01") 223 | parse_time(time_custom) 224 | parse_time(time_custom, locale = locale_custom) 225 | ``` 226 | 227 |
    228 | 229 | ### Exercise 11.3.4 {.unnumbered .exercise data-number="11.3.4"} 230 | 231 |
    232 | If you live outside the US, create a new locale object that encapsulates the settings for the types of file you read most commonly. 233 |
    234 | 235 |
    236 | 237 | Read the help page for `locale()` using `?locale` to learn about the different variables that can be set. 238 | 239 | As an example, consider Australia. 240 | Most of the defaults values are valid, except that the date format is "(d)d/mm/yyyy", meaning that January 2, 2006 is written as `02/01/2006`. 241 | 242 | However, default locale will parse that date as February 1, 2006. 243 | 244 | ```{r} 245 | parse_date("02/01/2006") 246 | ``` 247 | 248 | To correctly parse Australian dates, define a new `locale` object. 249 | 250 | ```{r} 251 | au_locale <- locale(date_format = "%d/%m/%Y") 252 | ``` 253 | 254 | Using `parse_date()` with the `au_locale` as its locale will correctly parse our example date. 255 | 256 | ```{r} 257 | parse_date("02/01/2006", locale = au_locale) 258 | ``` 259 | 260 |
    261 | 262 | ### Exercise 11.3.5 {.unnumbered .exercise data-number="11.3.5"} 263 | 264 |
    265 | What’s the difference between `read_csv()` and `read_csv2()`? 266 |
    267 | 268 |
    269 | 270 | The delimiter. The function `read_csv()` uses a comma, while `read_csv2()` uses a semi-colon (`;`). Using a semi-colon is useful when commas are used as the decimal point (as in Europe). 271 | 272 |
    273 | 274 | ### Exercise 11.3.6 {.unnumbered .exercise data-number="11.3.6"} 275 | 276 |
    277 | What are the most common encodings used in Europe? 278 | What are the most common encodings used in Asia? 279 | Do some googling to find out. 280 |
    281 | 282 |
    283 | 284 | UTF-8 is standard now, and ASCII has been around forever. 285 | 286 | For the European languages, there are separate encodings for Romance languages and Eastern European languages using Latin script, Cyrillic, Greek, Hebrew, Turkish: usually with separate ISO and Windows encoding standards. 287 | There is also Mac OS Roman. 288 | 289 | For Asian languages Arabic and Vietnamese have ISO and Windows standards. The other major Asian scripts have their own: 290 | 291 | - Japanese: JIS X 0208, Shift JIS, ISO-2022-JP 292 | - Chinese: GB 2312, GBK, GB 18030 293 | - Korean: KS X 1001, EUC-KR, ISO-2022-KR 294 | 295 | The list in the documentation for `stringi::stri_enc_detect()` is a good list of encodings since it supports the most common encodings. 296 | 297 | - Western European Latin script languages: ISO-8859-1, Windows-1250 (also CP-1250 for code-point) 298 | - Eastern European Latin script languages: ISO-8859-2, Windows-1252 299 | - Greek: ISO-8859-7 300 | - Turkish: ISO-8859-9, Windows-1254 301 | - Hebrew: ISO-8859-8, IBM424, Windows 1255 302 | - Russian: Windows 1251 303 | - Japanese: Shift JIS, ISO-2022-JP, EUC-JP 304 | - Korean: ISO-2022-KR, EUC-KR 305 | - Chinese: GB18030, ISO-2022-CN (Simplified), Big5 (Traditional) 306 | - Arabic: ISO-8859-6, IBM420, Windows 1256 307 | 308 | For more information on character encodings see the following sources. 309 | 310 | - The Wikipedia page [Character encoding](https://en.wikipedia.org/wiki/Character_encoding), has a good list of encodings. 311 | - Unicode [CLDR](http://cldr.unicode.org/) project 312 | - [What is the most common encoding of each language](https://stackoverflow.com/questions/8509339/what-is-the-most-common-encoding-of-each-language) (Stack Overflow) 313 | - "What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text", . 314 | 315 | Programs that identify the encoding of text include: 316 | 317 | - `readr::guess_encoding()` 318 | - `stringi::str_enc_detect()` 319 | - [iconv](https://en.wikipedia.org/wiki/Iconv) 320 | - [chardet](https://github.com/chardet/chardet) (Python) 321 | 322 |
    323 | 324 | ### Exercise 11.3.7 {.unnumbered .exercise data-number="11.3.7"} 325 | 326 |
    327 | Generate the correct format string to parse each of the following dates and times: 328 |
    329 | 330 |
    331 | 332 | ```{r} 333 | d1 <- "January 1, 2010" 334 | d2 <- "2015-Mar-07" 335 | d3 <- "06-Jun-2017" 336 | d4 <- c("August 19 (2015)", "July 1 (2015)") 337 | d5 <- "12/30/14" # Dec 30, 2014 338 | t1 <- "1705" 339 | t2 <- "11:15:10.12 PM" 340 | ``` 341 | 342 | The correct formats are: 343 | ```{r} 344 | parse_date(d1, "%B %d, %Y") 345 | parse_date(d2, "%Y-%b-%d") 346 | parse_date(d3, "%d-%b-%Y") 347 | parse_date(d4, "%B %d (%Y)") 348 | parse_date(d5, "%m/%d/%y") 349 | parse_time(t1, "%H%M") 350 | ``` 351 | The time `t2` uses real seconds, 352 | ```{r} 353 | parse_time(t2, "%H:%M:%OS %p") 354 | ``` 355 | 356 |
    357 | 358 | ## Parsing a file {#parsing-a-file .r4ds-section} 359 | 360 | `r no_exercises()` 361 | 362 | ## Writing to a file {#writing-to-a-file .r4ds-section} 363 | 364 | `r no_exercises()` 365 | 366 | ## Other types of data {#other-types-of-data .r4ds-section} 367 | 368 | `r no_exercises()` 369 | -------------------------------------------------------------------------------- /includes/hypothesis.html: -------------------------------------------------------------------------------- 1 | 7 | 8 | 9 | 10 | -------------------------------------------------------------------------------- /includes/ort.css: -------------------------------------------------------------------------------- 1 | /* 2 | License: MIT 3 | Copyright (c) 2016 Open Review Toolkit & Ben Marwick 4 | 5 | Source: https://github.com/benmarwick/bookdown-ort/blob/a240344b05c75dc6f40012d5cc8bfab31e4dc7b3/style.css. With additional editing. 6 | */ 7 | 8 | .h-icon-chevron-left { 9 | background: white; 10 | padding: 3px; 11 | border: #eee 1px solid; 12 | color: #666; 13 | } 14 | 15 | .fa-rotate-315 { 16 | -webkit-transform: rotate(315deg); 17 | -moz-transform: rotate(315deg); 18 | -ms-transform: rotate(315deg); 19 | -o-transform: rotate(315deg); 20 | transform: rotate(315deg); 21 | } 22 | 23 | .rmdreview { 24 | padding: 1em 1em 1em 5em; 25 | margin-bottom: 0px; 26 | background: #f5f5f5 5px center/3em no-repeat; 27 | position:relative; 28 | } 29 | 30 | .rmdreview:before { 31 | content: "\f0e6"; 32 | font-family: FontAwesome; 33 | left:10px; 34 | position:absolute; 35 | top:10px; 36 | bottom: 0px; 37 | font-size: 60px; 38 | } 39 | -------------------------------------------------------------------------------- /includes/r4ds-solutions.css: -------------------------------------------------------------------------------- 1 | .hints-icon { 2 | display: table-cell; 3 | padding-right: 15px; 4 | padding-left: 5px; 5 | } 6 | 7 | .hints-container { 8 | display: table-cell; 9 | } 10 | 11 | .noexercises, .question { 12 | margin: 20px 0; 13 | padding: 15px 30px 15px 15px; 14 | } 15 | 16 | .question { 17 | border-left: 5px solid #eee; 18 | } 19 | 20 | .question blockquote:first-child { 21 | margin-top: 0 22 | } 23 | 24 | .question blockquote:last-child { 25 | margin-bottom: 0 26 | } 27 | 28 | /* Anchors and external links for sections */ 29 | /* Showing anchors for sections idea comes from the CSS GitHub uses to render READMEs 30 | See https://github.com/sindresorhus/github-markdown-css */ 31 | .section-link { 32 | font-size: smaller; 33 | vertical-align: middle; 34 | /* yeah, imporant is bad, but I can't figure out how to override this one otherwise */ 35 | color: #a9a9a9 !important; 36 | } 37 | 38 | .anchor { 39 | float: left; 40 | padding-right: 4px; 41 | margin-left: -20px; 42 | } 43 | 44 | .r4ds-section-link { 45 | padding-left: 4px; 46 | } 47 | 48 | h1 a.section-link, 49 | h2 a.section-link, 50 | h3 a.section-link, 51 | h4 a.section-link, 52 | h5 a.section-link, 53 | h6 a.section-link { 54 | visibility: hidden; 55 | } 56 | 57 | h1:hover a.section-link, 58 | h2:hover a.section-link, 59 | h3:hover a.section-link, 60 | h4:hover a.section-link, 61 | h5:hover a.section-link, 62 | h6:hover a.section-link { 63 | visibility: visible; 64 | } 65 | 66 | /* space for the anchor links icons on the left */ 67 | .book .book-body .page-wrapper .page-inner section { 68 | margin-left: 20px; 69 | } 70 | -------------------------------------------------------------------------------- /includes/r4ds.css: -------------------------------------------------------------------------------- 1 | .book .book-header { 2 | opacity: 1; 3 | text-align: left; 4 | } 5 | #header .title { 6 | margin-bottom: 0; 7 | } 8 | #header .author { 9 | margin: 0; 10 | color: #666; 11 | } 12 | #header .author em { 13 | font-style: normal; 14 | } 15 | -------------------------------------------------------------------------------- /index.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | knit: "bookdown::render_book" 3 | title: "R for Data Science: Exercise Solutions" 4 | date: >- 5 | `r format(Sys.Date(), "%B %d, %Y")` 6 | author: ["Jeffrey B. Arnold"] 7 | description: > 8 | Solutions to the exercises in 9 | "R for Data Science" by Garrett Grolemund and Hadley Wickham. 10 | site: bookdown::bookdown_site 11 | github-repo: "jrnold/r4ds-exercise-solutions" 12 | url: 'http\://jrnold.github.io/r4ds-exercise-solutions' 13 | twitter-handle: jrnld 14 | documentclass: book 15 | bibliography: ["r4ds.bib"] 16 | link-citations: true 17 | biblio-stye: apalike 18 | cover-image: /img/r4ds-exercise-solutions-cover.png 19 | --- 20 | 21 | ```{r include=FALSE,cache=FALSE,purl=FALSE} 22 | # don't cache anything on this page 23 | knitr::opts_chunk$set(cache = FALSE) 24 | ``` 25 | 26 | # Welcome {-} 27 | 28 | Cover image 29 | 30 | This book contains the **exercise solutions** for the book [*R for Data Science*](https://amzn.to/2aHLAQ1), by Hadley Wickham and Garret Grolemund [@WickhamGrolemund2017]. 31 | 32 | *R for Data Science* itself is available online at [r4ds.had.co.nz](https://r4ds.had.co.nz/), and physical copy is published by O'Reilly Media and available from [amazon](https://amzn.to/2aHLAQ1). 33 | 34 | ## Acknowledgments {-} 35 | 36 | ```{r include=FALSE,purl=FALSE,cache=FALSE} 37 | library("magrittr") 38 | # adapted from usethis:::github_repo_spec 39 | github_repo_spec <- function(path = here::here()) { 40 | stringr::str_c(gh::gh_tree_remote(path), collapse = "/") 41 | } 42 | 43 | # copied from usethis:::parse_repo_spec 44 | parse_repo_spec <- function(repo_spec) { 45 | repo_split <- stringr::str_split(repo_spec, "/")[[1]] 46 | if (length(repo_split) != 2) { 47 | stop("`repo_spec` must be of the form 'owner/repo'") 48 | } 49 | list(owner = repo_split[[1]], repo = repo_split[[2]]) 50 | } 51 | 52 | # copied from usethis:::spec_owner 53 | spec_owner <- function(repo_spec) { 54 | parse_repo_spec(repo_spec)$owner 55 | } 56 | 57 | # copied from usethis:::spec_repo 58 | spec_repo <- function(repo_spec) { 59 | parse_repo_spec(repo_spec)$repo 60 | } 61 | 62 | # Need to use the github API because this info isn't included in the 63 | # commits for GitHub pulls: Github 64 | 65 | # adapted from from usethis:::use_tidy_thanks 66 | github_contribs <- function(repo_spec = github_repo_spec(), 67 | excluded = NULL) { 68 | if (is.null(excluded)) { 69 | excluded <- spec_owner(repo_spec) 70 | } 71 | res <- gh::gh("/repos/:owner/:repo/issues", 72 | owner = spec_owner(repo_spec), 73 | repo = spec_repo(repo_spec), state = "all", 74 | filter = "all", .limit = Inf 75 | ) 76 | if (identical(res[[1]], "")) { 77 | message("No matching issues/PRs found.") 78 | return(invisible()) 79 | } 80 | contributors <- purrr:::map_chr(res, c("user", "login")) %>% 81 | purrr::discard(~.x %in% excluded) %>% 82 | unique() %>% 83 | sort() 84 | glue::glue("[\\@{contributors}](https://github.com/{contributors})") %>% 85 | glue::glue_collapse(sep = ", ", width = Inf, last = ", and") 86 | } 87 | 88 | hypothesis_contribs <- function() { 89 | hypothesis_user_url <- function(x) { 90 | username <- stringr::str_match(x, "acct:(.*)@")[1, 2] 91 | url <- stringr::str_c("https://hypothes.is/users/", username) 92 | stringr::str_c("[\\@", username, "](", url, ")") 93 | } 94 | 95 | hypothesis_url <- "https://hypothes.is/api/search" 96 | url_pattern <- "https://jrnold.github.io/r4ds-exercise-solutions/*" 97 | annotations <- httr::GET(hypothesis_url, 98 | query = list(wildcard_uri = url_pattern)) %>% 99 | httr::content() 100 | 101 | annotations %>% 102 | purrr::pluck("rows") %>% 103 | purrr::keep(~ !.x$flagged) %>% 104 | purrr::map_chr("user") %>% 105 | unique() %>% 106 | purrr::discard(~ .x == "acct:jrnold@hypothes.is") %>% 107 | purrr::map_chr(hypothesis_user_url) %>% 108 | sort() %>% 109 | glue::glue_collapse(sep = ", ", width = Inf, last = ", and ") 110 | } 111 | ``` 112 | 113 | These solutions have benefited from many contributors. 114 | A special thanks to: 115 | 116 | - Garrett Grolemund and Hadley Wickham for writing the truly fantastic *R for Data Science*, without whom these solutions would not exist---literally. 117 | - [\@dongzhuoer](https://github.com/dongzhuoer) and [\@cfgauss](https://hypothes.is/users/cfgauss) for careful readings of the book and noticing numerous issues and proposing fixes. 118 | 119 | Thank you to all of those who contributed issues or pull-requests on 120 | [GitHub](https://github.com/jrnold/r4ds-exercise-solutions/graphs/contributors) 121 | (in alphabetical order): `r github_contribs()` 122 | Thank you to all of you who contributed annotations on [hypothes.is](https://hypothes.is/search?q=url%3Ajrnold.github.io%2Fr4ds-exercise-solutions%2F*) (in alphabetical order): `r hypothesis_contribs()`. 123 | 124 | For another set of solutions for and notes on *R for Data Science* see [Yet Another 'R for Data Science' Study Guide](https://brshallo.github.io/r4ds_solutions/) by [Bryan Shalloway](https://github.com/brshallo). 125 | 126 | ## License {-} 127 | 128 | This work is licensed under a Creative Commons Attribution 4.0 International License. 129 | -------------------------------------------------------------------------------- /intro.Rmd: -------------------------------------------------------------------------------- 1 | # Introduction {#introduction .r4ds-section} 2 | 3 | ```{r include=FALSE,cache=FALSE,purl=FALSE} 4 | # don't cache anything on this page 5 | knitr::opts_chunk$set(cache = FALSE) 6 | ``` 7 | 8 | ## How this book is organized {-} 9 | 10 | The book is divided into sections in with the same numbers and titles as those in *R for Data Science*. 11 | Not all sections have exercises. 12 | Those sections without exercises have placeholder text indicating that there are no exercises. 13 | The text for each exercise is followed by the solution. 14 | 15 | Like *R for Data Science*, packages used in each chapter are loaded in a code chunk at the start of the chapter in a section titled "Prerequisites". 16 | If exercises depend on code in a section of *R for Data Science* it is either provided before the exercises or within the exercise solution. 17 | 18 | If a package is used infrequently in solutions it may not be loaded, and functions using it will be called using the package name followed by two colons, as in `dplyr::mutate()` (see the *R for Data Science* [Introduction](https://r4ds.had.co.nz/introduction.html#running-r-code)). 19 | The double colon may also be used to be explicit about the package from which a function comes. 20 | 21 | ## Prerequisites {-} 22 | 23 | This book is a complement to, not a substitute of, [R for Data Science](). 24 | It only provides the exercise solutions for it. 25 | See the [R for Data Science](https://r4ds.had.co.nz/introduction.html#prerequisites) prerequisites. 26 | 27 | Additional, the solutions use several packages that are not used in *R4DS*. 28 | You can install all packages required to run the code in this book with the following line of code. 29 | ```{r eval=FALSE} 30 | devtools::install_github("jrnold/r4ds-exercise-solutions") 31 | ``` 32 | 33 | ## Bugs/Contributing {-} 34 | 35 | If you find any typos, errors in the solutions, have an alternative solution, 36 | or think the solution could be improved, I would love your contributions. 37 | The best way to contribute is through GitHub. 38 | Please open an issue at or a pull request at 39 | . 40 | 41 | ## Colophon {-} 42 | 43 | ```{r include=FALSE, purl = FALSE} 44 | r_head <- git2r::repository_head() 45 | r_branch <- r_head$name 46 | r_sha <- git2r::sha(r_head) 47 | r_sha_short <- stringr::str_sub(r_sha, 1, 7) 48 | github_full_url <- stringr::str_c(SOURCE_URL, "tree", r_sha, sep = "/") 49 | ``` 50 | 51 | HTML and PDF versions of this book are available at <`r PUB_URL`>. 52 | The book is powered by [bookdown](https://bookdown.org/home) which makes it easy to turn R markdown files into HTML, PDF, and EPUB. 53 | 54 | The source of this book is available on GitHub at <`r SOURCE_URL`>. 55 | This book was built from commit [`r r_sha_short`](`r github_full_url`). 56 | 57 | This book was built with these R packages. 58 | ```{r colonophon} 59 | devtools::session_info("r4ds.exercise.solutions") 60 | ``` 61 | -------------------------------------------------------------------------------- /many-models.Rmd: -------------------------------------------------------------------------------- 1 | # Many models {#many-models .r4ds-section} 2 | 3 | ## Introduction {#introduction-17 .r4ds-section} 4 | 5 | ```{r setup,message=FALSE,cache=FALSE} 6 | library("modelr") 7 | library("tidyverse") 8 | library("gapminder") 9 | ``` 10 | 11 | ## gapminder {#gapminder .r4ds-section} 12 | 13 | ### Exercise 25.2.1 {.unnumbered .exercise data-number="25.2.1"} 14 | 15 |
    16 | 17 | A linear trend seems to be slightly too simple for the overall trend. 18 | Can you do better with a quadratic polynomial? 19 | How can you interpret the coefficients of the quadratic? 20 | Hint you might want to transform year so that it has mean zero.) 21 | 22 |
    23 | 24 |
    25 | 26 | The following code replicates the analysis in the chapter but replaces the function `country_model()` with a regression that includes the year squared. 27 | ```{r, eval = FALSE} 28 | lifeExp ~ poly(year, 2) 29 | ``` 30 | 31 | ```{r} 32 | country_model <- function(df) { 33 | lm(lifeExp ~ poly(year - median(year), 2), data = df) 34 | } 35 | 36 | by_country <- gapminder %>% 37 | group_by(country, continent) %>% 38 | nest() 39 | 40 | by_country <- by_country %>% 41 | mutate(model = map(data, country_model)) 42 | ``` 43 | 44 | ```{r} 45 | by_country <- by_country %>% 46 | mutate( 47 | resids = map2(data, model, add_residuals) 48 | ) 49 | by_country 50 | ``` 51 | 52 | ```{r} 53 | unnest(by_country, resids) %>% 54 | ggplot(aes(year, resid)) + 55 | geom_line(aes(group = country), alpha = 1 / 3) + 56 | geom_smooth(se = FALSE) 57 | ``` 58 | 59 | ```{r} 60 | by_country %>% 61 | mutate(glance = map(model, broom::glance)) %>% 62 | unnest(glance, .drop = TRUE) %>% 63 | ggplot(aes(continent, r.squared)) + 64 | geom_jitter(width = 0.5) 65 | ``` 66 | 67 |
    68 | 69 | ### Exercise 25.2.2 {.unnumbered .exercise data-number="25.2.2"} 70 | 71 |
    72 | Explore other methods for visualizing the distribution of $R^2$ per continent. 73 | You might want to try the ggbeeswarm package, which provides similar methods for avoiding overlaps as jitter, but uses deterministic methods. 74 | 75 |
    76 | 77 |
    78 | 79 | See exercise 7.5.1.1.6 for more on ggbeeswarm 80 | 81 | ```{r} 82 | library("ggbeeswarm") 83 | by_country %>% 84 | mutate(glance = map(model, broom::glance)) %>% 85 | unnest(glance, .drop = TRUE) %>% 86 | ggplot(aes(continent, r.squared)) + 87 | geom_beeswarm() 88 | ``` 89 | 90 |
    91 | 92 | ### Exercise 25.2.3 {.unnumbered .exercise data-number="25.2.3"} 93 | 94 |
    95 | 96 | To create the last plot (showing the data for the countries with the worst model fits), 97 | we needed two steps: 98 | we created a data frame with one row per country 99 | and then semi-joined it to the original dataset. 100 | It’s possible to avoid this join if we use `unnest()` instead of `unnest(.drop = TRUE)`. 101 | How? 102 | 103 |
    104 | 105 |
    106 | 107 | ```{r} 108 | gapminder %>% 109 | group_by(country, continent) %>% 110 | nest() %>% 111 | mutate(model = map(data, ~lm(lifeExp ~ year, .))) %>% 112 | mutate(glance = map(model, broom::glance)) %>% 113 | unnest(glance) %>% 114 | unnest(data) %>% 115 | filter(r.squared < 0.25) %>% 116 | ggplot(aes(year, lifeExp)) + 117 | geom_line(aes(color = country)) 118 | ``` 119 | 120 |
    121 | 122 | ## List-columns {#list-columns-1 .r4ds-section} 123 | 124 | `r no_exercises()` 125 | 126 | ## Creating list-columns {#creating-list-columns .r4ds-section} 127 | 128 | ### Exercise 25.4.1 {.unnumbered .exercise data-number="25.4.1"} 129 | 130 |
    131 | 132 | List all the functions that you can think of that take a atomic vector and return a list. 133 | 134 |
    135 | 136 |
    137 | 138 | Many functions in the stringr package take a character vector as input and return a list. 139 | ```{r} 140 | str_split(sentences[1:3], " ") 141 | str_match_all(c("abc", "aa", "aabaa", "abbbc"), "a+") 142 | ``` 143 | The `map()` function takes a vector and always returns a list. 144 | ```{r} 145 | map(1:3, runif) 146 | ``` 147 | 148 |
    149 | 150 | ### Exercise 25.4.2 {.unnumbered .exercise data-number="25.4.2"} 151 | 152 |
    153 | 154 | Brainstorm useful summary functions that, like `quantile()`, return multiple values. 155 | 156 |
    157 | 158 |
    159 | 160 | Some examples of summary functions that return multiple values are the following. 161 | ```{r} 162 | range(mtcars$mpg) 163 | fivenum(mtcars$mpg) 164 | boxplot.stats(mtcars$mpg) 165 | ``` 166 | 167 |
    168 | 169 | ### Exercise 25.4.3 {.unnumbered .exercise data-number="25.4.3"} 170 | 171 |
    172 | 173 | What’s missing in the following data frame? 174 | How does `quantile()` return that missing piece? 175 | Why isn’t that helpful here? 176 | 177 | ```{r} 178 | mtcars %>% 179 | group_by(cyl) %>% 180 | summarise(q = list(quantile(mpg))) %>% 181 | unnest() 182 | ``` 183 | 184 |
    185 | 186 |
    187 | 188 | The particular quantiles of the values are missing, e.g. `0%`, `25%`, `50%`, `75%`, `100%`. `quantile()` returns these in the names of the vector. 189 | ```{r} 190 | quantile(mtcars$mpg) 191 | ``` 192 | 193 | Since the `unnest` function drops the names of the vector, they aren't useful here. 194 | 195 |
    196 | 197 | ### Exercise 25.4.4 {.unnumbered .exercise data-number="25.4.4"} 198 | 199 |
    200 | 201 | What does this code do? 202 | Why might might it be useful? 203 | 204 | ```r 205 | mtcars %>% 206 | group_by(cyl) %>% 207 | summarise_each(funs(list)) 208 | ``` 209 | 210 |
    211 | 212 |
    213 | 214 | ```{r} 215 | mtcars %>% 216 | group_by(cyl) %>% 217 | summarise_each(funs(list)) 218 | ``` 219 | 220 | It creates a data frame in which each row corresponds to a value of `cyl`, 221 | and each observation for each column (other than `cyl`) is a vector of all the values of that column for that value of `cyl`. 222 | It seems like it should be useful to have all the observations of each variable for each group, but off the top of my head, I can't think of a specific use for this. 223 | But, it seems that it may do many things that `dplyr::do` does. 224 | 225 |
    226 | 227 | ## Simplifying list-columns {#simplifying-list-columns .r4ds-section} 228 | 229 | ### Exercise 25.5.1 {.unnumbered .exercise data-number="25.5.1"} 230 | 231 |
    232 | 233 | Why might the `lengths()` function be useful for creating atomic vector columns from list-columns? 234 | 235 |
    236 | 237 |
    238 | 239 | The `lengths()` function returns the lengths of each element in a list. 240 | It could be useful for testing whether all elements in a list-column are the same length. 241 | You could get the maximum length to determine how many atomic vector columns to create. 242 | It is also a replacement for something like `map_int(x, length)` or `sapply(x, length)`. 243 | 244 |
    245 | 246 | ### Exercise 25.5.2 {.unnumbered .exercise data-number="25.5.2"} 247 | 248 |
    249 | 250 | List the most common types of vector found in a data frame. 251 | What makes lists different? 252 | 253 |
    254 | 255 |
    256 | 257 | The common types of vectors in data frames are: 258 | 259 | - `logical` 260 | - `numeric` 261 | - `integer` 262 | - `character` 263 | - `factor` 264 | 265 | All of the common types of vectors in data frames are atomic. 266 | Lists are not atomic since they can contain other lists and other vectors. 267 | 268 |
    269 | 270 | ## Making tidy data with broom {#making-tidy-data-with-broom .r4ds-section} 271 | 272 | `r no_exercises()` 273 | -------------------------------------------------------------------------------- /model-basics.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | output: html_document 3 | editor_options: 4 | chunk_output_type: console 5 | --- 6 | 7 | # Model basics {#model-basics .r4ds-section} 8 | 9 | ## Introduction {#introduction-15 .r4ds-section} 10 | 11 | ```{r setup,message=FALSE,cache=FALSE} 12 | library("tidyverse") 13 | library("modelr") 14 | ``` 15 | 16 | The option `na.action` determines how missing values are handled. 17 | It is a function. 18 | `na.warn` sets it so that there is a warning if there are any missing values. 19 | If it is not set (the default), R will silently drop them. 20 | 21 | ```{r} 22 | options(na.action = na.warn) 23 | ``` 24 | 25 | ## A simple model {#a-simple-model .r4ds-section} 26 | 27 | ### Exercise 23.2.1 {.unnumbered .exercise data-number="23.2.1"} 28 | 29 |
    30 | One downside of the linear model is that it is sensitive to unusual values because the distance incorporates a squared term. Fit a linear model to the simulated data below, and visualize the results. Rerun a few times to generate different simulated datasets. What do you notice about the model? 31 |
    32 | 33 |
    34 | 35 | ```{r} 36 | sim1a <- tibble( 37 | x = rep(1:10, each = 3), 38 | y = x * 1.5 + 6 + rt(length(x), df = 2) 39 | ) 40 | ``` 41 | 42 | Let's run it once and plot the results: 43 | ```{r} 44 | ggplot(sim1a, aes(x = x, y = y)) + 45 | geom_point() + 46 | geom_smooth(method = "lm", se = FALSE) 47 | ``` 48 | We can also do this more systematically, by generating several simulations 49 | and plotting the line. 50 | 51 | ```{r} 52 | simt <- function(i) { 53 | tibble( 54 | x = rep(1:10, each = 3), 55 | y = x * 1.5 + 6 + rt(length(x), df = 2), 56 | .id = i 57 | ) 58 | } 59 | 60 | sims <- map_df(1:12, simt) 61 | 62 | ggplot(sims, aes(x = x, y = y)) + 63 | geom_point() + 64 | geom_smooth(method = "lm", colour = "red") + 65 | facet_wrap(~.id, ncol = 4) 66 | ``` 67 | 68 | What if we did the same things with normal distributions? 69 | ```{r} 70 | sim_norm <- function(i) { 71 | tibble( 72 | x = rep(1:10, each = 3), 73 | y = x * 1.5 + 6 + rnorm(length(x)), 74 | .id = i 75 | ) 76 | } 77 | 78 | simdf_norm <- map_df(1:12, sim_norm) 79 | 80 | ggplot(simdf_norm, aes(x = x, y = y)) + 81 | geom_point() + 82 | geom_smooth(method = "lm", colour = "red") + 83 | facet_wrap(~.id, ncol = 4) 84 | ``` 85 | There are not large outliers, and the slopes are more similar. 86 | 87 | The reason for this is that the Student's $t$-distribution, from which we sample with `rt()` has heavier tails than the normal distribution (`rnorm()`). This means that the Student's t-distribution 88 | assigns a larger probability to values further from the center of the distribution. 89 | ```{r} 90 | tibble( 91 | x = seq(-5, 5, length.out = 100), 92 | normal = dnorm(x), 93 | student_t = dt(x, df = 2) 94 | ) %>% 95 | pivot_longer(-x, names_to="distribution", values_to="density") %>% 96 | ggplot(aes(x = x, y = density, colour = distribution)) + 97 | geom_line() 98 | ``` 99 | 100 | For a normal distribution with mean zero and standard deviation one, the probability of being greater than 2 is, 101 | ```{r} 102 | pnorm(2, lower.tail = FALSE) 103 | ``` 104 | For a Student's $t$ distribution with degrees of freedom = 2, it is more than 3 times higher, 105 | ```{r} 106 | pt(2, df = 2, lower.tail = FALSE) 107 | ``` 108 | 109 |
    110 | 111 | ### Exercise 23.2.2 {.unnumbered .exercise data-number="23.2.2"} 112 | 113 |
    114 | One way to make linear models more robust is to use a different distance measure. For example, instead of root-mean-squared distance, you could use mean-absolute distance: 115 |
    116 | 117 |
    118 | 119 | ```{r} 120 | measure_distance <- function(mod, data) { 121 | diff <- data$y - make_prediction(mod, data) 122 | mean(abs(diff)) 123 | } 124 | ``` 125 | 126 | For the above function to work, we need to define a function, `make_prediction()`, that 127 | takes a numeric vector of length two (the intercept and slope) and returns the predictions, 128 | ```{r} 129 | make_prediction <- function(mod, data) { 130 | mod[1] + mod[2] * data$x 131 | } 132 | ``` 133 | 134 | Using the `sim1a` data, the best parameters of the least absolute deviation are: 135 | ```{r} 136 | best <- optim(c(0, 0), measure_distance, data = sim1a) 137 | best$par 138 | ``` 139 | Using the `sim1a` data, while the parameters the minimize the least squares objective function are: 140 | ```{r} 141 | measure_distance_ls <- function(mod, data) { 142 | diff <- data$y - (mod[1] + mod[2] * data$x) 143 | sqrt(mean(diff^2)) 144 | } 145 | 146 | best <- optim(c(0, 0), measure_distance_ls, data = sim1a) 147 | best$par 148 | ``` 149 | 150 | In practice, I suggest not using `optim()` to fit this model, and instead using an existing implementation. 151 | The `rlm()` and `lqs()` functions in the [MASS](https://CRAN.R-project.org/package=MASS) fit robust and resistant linear models. 152 | 153 |
    154 | 155 | ### Exercise 23.2.3 {.unnumbered .exercise data-number="23.2.3"} 156 | 157 |
    158 | One challenge with performing numerical optimization is that it’s only guaranteed to find a local optimum. What’s the problem with optimizing a three parameter model like this? 159 |
    160 | 161 |
    162 | 163 | ```{r} 164 | model3 <- function(a, data) { 165 | a[1] + data$x * a[2] + a[3] 166 | } 167 | ``` 168 | 169 | The problem is that you for any values `a[1] = a1` and `a[3] = a3`, any other values of `a[1]` and `a[3]` where `a[1] + a[3] == (a1 + a3)` will have the same fit. 170 | 171 | ```{r} 172 | measure_distance_3 <- function(a, data) { 173 | diff <- data$y - model3(a, data) 174 | sqrt(mean(diff^2)) 175 | } 176 | ``` 177 | Depending on our starting points, we can find different optimal values: 178 | ```{r} 179 | best3a <- optim(c(0, 0, 0), measure_distance_3, data = sim1) 180 | best3a$par 181 | ``` 182 | ```{r} 183 | best3b <- optim(c(0, 0, 1), measure_distance_3, data = sim1) 184 | best3b$par 185 | ``` 186 | ```{r} 187 | best3c <- optim(c(0, 0, 5), measure_distance_3, data = sim1) 188 | best3c$par 189 | ``` 190 | In fact there are an infinite number of optimal values for this model. 191 | 192 | 204 | 205 |
    206 | 207 | ## Visualising models {#visualising-models .r4ds-section} 208 | 209 | ### Exercise 23.3.1 {.unnumbered .exercise data-number="23.3.1"} 210 | 211 |
    212 | Instead of using `lm()` to fit a straight line, you can use `loess()` to fit a smooth curve. Repeat the process of model fitting, grid generation, predictions, and visualization on `sim1` using `loess()` instead of `lm()`. How does the result compare to `geom_smooth()`? 213 |
    214 | 215 |
    216 | 217 | I'll use `add_predictions()` and `add_residuals()` to add the predictions and residuals from a loess regression to the `sim1` data. 218 | 219 | ```{r} 220 | sim1_loess <- loess(y ~ x, data = sim1) 221 | sim1_lm <- lm(y ~ x, data = sim1) 222 | 223 | grid_loess <- sim1 %>% 224 | add_predictions(sim1_loess) 225 | 226 | sim1 <- sim1 %>% 227 | add_residuals(sim1_lm) %>% 228 | add_predictions(sim1_lm) %>% 229 | add_residuals(sim1_loess, var = "resid_loess") %>% 230 | add_predictions(sim1_loess, var = "pred_loess") 231 | ``` 232 | 233 | This plots the loess predictions. 234 | The loess produces a nonlinear, smooth line through the data. 235 | ```{r} 236 | plot_sim1_loess <- 237 | ggplot(sim1, aes(x = x, y = y)) + 238 | geom_point() + 239 | geom_line(aes(x = x, y = pred), data = grid_loess, colour = "red") 240 | plot_sim1_loess 241 | ``` 242 | 243 | The predictions of loess are the same as the default method for `geom_smooth()` because `geom_smooth()` uses `loess()` by default; the message even tells us that. 244 | ```{r message=TRUE} 245 | plot_sim1_loess + 246 | geom_smooth(method = "loess", colour = "blue", se = FALSE, alpha = 0.20) 247 | ``` 248 | 249 | We can plot the residuals (red), and compare them to the residuals from `lm()` (black). 250 | In general, the loess model has smaller residuals within the sample (out of sample is a different issue, and we haven't considered the uncertainty of these estimates). 251 | 252 | ```{r} 253 | ggplot(sim1, aes(x = x)) + 254 | geom_ref_line(h = 0) + 255 | geom_point(aes(y = resid)) + 256 | geom_point(aes(y = resid_loess), colour = "red") 257 | ``` 258 | 259 |
    260 | 261 | ### Exercise 23.3.2 {.unnumbered .exercise data-number="23.3.2"} 262 | 263 |
    264 | `add_predictions()` is paired with `gather_predictions()` and `spread_predictions()`. 265 | How do these three functions differ? 266 |
    267 | 268 |
    269 | 270 | The functions `gather_predictions()` and `spread_predictions()` allow for adding predictions from multiple models at once. 271 | 272 | Taking the `sim1_mod` example, 273 | ```{r} 274 | sim1_mod <- lm(y ~ x, data = sim1) 275 | grid <- sim1 %>% 276 | data_grid(x) 277 | ``` 278 | 279 | The function `add_predictions()` adds only a single model at a time. 280 | To add two models: 281 | ```{r} 282 | grid %>% 283 | add_predictions(sim1_mod, var = "pred_lm") %>% 284 | add_predictions(sim1_loess, var = "pred_loess") 285 | ``` 286 | The function `gather_predictions()` adds predictions from multiple models by 287 | stacking the results and adding a column with the model name, 288 | ```{r} 289 | grid %>% 290 | gather_predictions(sim1_mod, sim1_loess) 291 | ``` 292 | The function `spread_predictions()` adds predictions from multiple models by 293 | adding multiple columns (postfixed with the model name) with predictions from each model. 294 | ```{r} 295 | grid %>% 296 | spread_predictions(sim1_mod, sim1_loess) 297 | ``` 298 | The function `spread_predictions()` is similar to the example which runs `add_predictions()` for each model, and is equivalent to running `spread()` after 299 | running `gather_predictions()`: 300 | ```{r} 301 | grid %>% 302 | gather_predictions(sim1_mod, sim1_loess) %>% 303 | spread(model, pred) 304 | ``` 305 | 306 |
    307 | 308 | ### Exercise 23.3.3 {.unnumbered .exercise data-number="23.3.3"} 309 | 310 |
    311 | What does `geom_ref_line()` do? What package does it come from? 312 | Why is displaying a reference line in plots showing residuals useful and important? 313 |
    314 | 315 |
    316 | 317 | The geom `geom_ref_line()` adds as reference line to a plot. 318 | It is equivalent to running `geom_hline()` or `geom_vline()` with default settings that are useful for visualizing models. 319 | Putting a reference line at zero for residuals is important because good models (generally) should have residuals centered at zero, with approximately the same variance (or distribution) over the support of x, and no correlation. 320 | A zero reference line makes it easier to judge these characteristics visually. 321 | 322 |
    323 | 324 | ### Exercise 23.3.4 {.unnumbered .exercise data-number="23.3.4"} 325 | 326 |
    327 | Why might you want to look at a frequency polygon of absolute residuals? 328 | What are the pros and cons compared to looking at the raw residuals? 329 |
    330 | 331 |
    332 | 333 | Showing the absolute values of the residuals makes it easier to view the spread of the residuals. 334 | The model assumes that the residuals have mean zero, and using the absolute values of the residuals effectively doubles the number of residuals. 335 | ```{r} 336 | sim1_mod <- lm(y ~ x, data = sim1) 337 | 338 | sim1 <- sim1 %>% 339 | add_residuals(sim1_mod) 340 | 341 | ggplot(sim1, aes(x = abs(resid))) + 342 | geom_freqpoly(binwidth = 0.5) 343 | ``` 344 | 345 | However, using the absolute values of residuals throws away information about the sign, meaning that the 346 | frequency polygon cannot show whether the model systematically over- or under-estimates the residuals. 347 | 348 |
    349 | 350 | ## Formulas and model families {#formulas-and-model-families .r4ds-section} 351 | 352 | ### Exercise 23.4.1 {.unnumbered .exercise data-number="23.4.1"} 353 | 354 |
    355 | What happens if you repeat the analysis of `sim2` using a model without an intercept. What happens to the model equation? 356 | What happens to the predictions? 357 |
    358 | 359 |
    360 | 361 | To run a model without an intercept, add `- 1` or `+ 0` to the right-hand-side o f the formula: 362 | ```{r} 363 | mod2a <- lm(y ~ x - 1, data = sim2) 364 | ``` 365 | ```{r} 366 | mod2 <- lm(y ~ x, data = sim2) 367 | ``` 368 | 369 | The predictions are exactly the same in the models with and without an intercept: 370 | ```{r} 371 | grid <- sim2 %>% 372 | data_grid(x) %>% 373 | spread_predictions(mod2, mod2a) 374 | grid 375 | ``` 376 | 377 |
    378 | 379 | ### Exercise 23.4.2 {.unnumbered .exercise data-number="23.4.2"} 380 | 381 |
    382 | Use `model_matrix()` to explore the equations generated for the models I fit to `sim3` and `sim4`. 383 | Why is `*` a good shorthand for interaction? 384 |
    385 | 386 |
    387 | 388 | For `x1 * x2` when `x2` is a categorical variable produces indicator variables `x2b`, `x2c`, `x2d` and 389 | variables `x1:x2b`, `x1:x2c`, and `x1:x2d` which are the products of `x1` and `x2*` variables: 390 | ```{r} 391 | x3 <- model_matrix(y ~ x1 * x2, data = sim3) 392 | x3 393 | ``` 394 | We can confirm that the variables `x1:x2b` is the product of `x1` and `x2b`, 395 | ```{r} 396 | all(x3[["x1:x2b"]] == (x3[["x1"]] * x3[["x2b"]])) 397 | ``` 398 | and similarly for `x1:x2c` and `x2c`, and `x1:x2d` and `x2d`: 399 | ```{r} 400 | all(x3[["x1:x2c"]] == (x3[["x1"]] * x3[["x2c"]])) 401 | all(x3[["x1:x2d"]] == (x3[["x1"]] * x3[["x2d"]])) 402 | ``` 403 | 404 | For `x1 * x2` where both `x1` and `x2` are continuous variables, `model_matrix()` creates variables 405 | `x1`, `x2`, and `x1:x2`: 406 | ```{r} 407 | x4 <- model_matrix(y ~ x1 * x2, data = sim4) 408 | x4 409 | ``` 410 | Confirm that `x1:x2` is the product of the `x1` and `x2`, 411 | ```{r} 412 | all(x4[["x1"]] * x4[["x2"]] == x4[["x1:x2"]]) 413 | ``` 414 | 415 | The asterisk `*` is good shorthand for an interaction since an interaction between `x1` and `x2` includes 416 | terms for `x1`, `x2`, and the product of `x1` and `x2`. 417 | 418 |
    419 | 420 | ### Exercise 23.4.3 {.unnumbered .exercise data-number="23.4.3"} 421 | 422 |
    423 | Using the basic principles, convert the formulas in the following two models into functions. 424 | (Hint: start by converting the categorical variable into 0-1 variables.) 425 |
    426 | 427 | ```{r} 428 | mod1 <- lm(y ~ x1 + x2, data = sim3) 429 | mod2 <- lm(y ~ x1 * x2, data = sim3) 430 | ``` 431 | 432 |
    433 | 434 | The problem is to convert the formulas in the models into functions. 435 | I will assume that the function is only handling the conversion of the right hand side of the formula into a model matrix. 436 | The functions will take one argument, a data frame with `x1` and `x2` columns, 437 | and it will return a data frame. 438 | In other words, the functions will be special cases of the `model_matrix()` function. 439 | 440 | Consider the right hand side of the first formula, `~ x1 + x2`. 441 | In the `sim3` data frame, the column `x1` is an integer, and the variable `x2` is a factor with four levels. 442 | ```{r} 443 | levels(sim3$x2) 444 | ``` 445 | 446 | Since `x1` is numeric it is unchanged. 447 | Since `x2` is a factor it is replaced with columns of indicator variables for all but one of its levels. 448 | I will first consider the special case in which `x2` only takes the levels of `x2` in `sim3`. 449 | In this case, "a" is considered the reference level and omitted, and new columns are made for "b", "c", and "d". 450 | ```{r} 451 | model_matrix_mod1 <- function(.data) { 452 | mutate(.data, 453 | x2b = as.numeric(x2 == "b"), 454 | x2c = as.numeric(x2 == "c"), 455 | x2d = as.numeric(x2 == "d"), 456 | `(Intercept)` = 1 457 | ) %>% 458 | select(`(Intercept)`, x1, x2b, x2c, x2d) 459 | } 460 | ``` 461 | ```{r} 462 | model_matrix_mod1(sim3) 463 | ``` 464 | 465 | A more general function for `~ x1 + x2` would not hard-code the specific levels in `x2`. 466 | ```{r} 467 | model_matrix_mod1b <- function(.data) { 468 | # the levels of x2 469 | lvls <- levels(.data$x2) 470 | # drop the first level 471 | # this assumes that there are at least two levels 472 | lvls <- lvls[2:length(lvls)] 473 | # create an indicator variable for each level of x2 474 | for (lvl in lvls) { 475 | # new column name x2 + level name 476 | varname <- str_c("x2", lvl) 477 | # add indicator variable for lvl 478 | .data[[varname]] <- as.numeric(.data$x2 == lvl) 479 | } 480 | # generate the list of variables to keep 481 | x2_variables <- str_c("x2", lvls) 482 | # Add an intercept 483 | .data[["(Intercept)"]] <- 1 484 | # keep x1 and x2 indicator variables 485 | select(.data, `(Intercept)`, x1, all_of(x2_variables)) 486 | } 487 | ``` 488 | ```{r} 489 | model_matrix_mod1b(sim3) 490 | ``` 491 | 492 | Consider the right hand side of the first formula, `~ x1 * x2`. 493 | The output data frame will consist of `x1`, columns with indicator variables for each level (except the reference level) of `x2`, 494 | and columns with the `x2` indicator variables multiplied by `x1`. 495 | 496 | As with the previous formula, first I'll write a function that hard-codes the levels of `x2`. 497 | ```{r} 498 | model_matrix_mod2 <- function(.data) { 499 | mutate(.data, 500 | `(Intercept)` = 1, 501 | x2b = as.numeric(x2 == "b"), 502 | x2c = as.numeric(x2 == "c"), 503 | x2d = as.numeric(x2 == "d"), 504 | `x1:x2b` = x1 * x2b, 505 | `x1:x2c` = x1 * x2c, 506 | `x1:x2d` = x1 * x2d 507 | ) %>% 508 | select(`(Intercept)`, x1, x2b, x2c, x2d, `x1:x2b`, `x1:x2c`, `x1:x2d`) 509 | } 510 | ``` 511 | ```{r} 512 | model_matrix_mod2(sim3) 513 | ``` 514 | 515 | For a more general function which will handle arbitrary levels in `x2`, I will 516 | extend the `model_matrix_mod1b()` function that I wrote earlier. 517 | ```{r} 518 | model_matrix_mod2b <- function(.data) { 519 | # get dataset with x1 and x2 indicator variables 520 | out <- model_matrix_mod1b(.data) 521 | # get names of the x2 indicator columns 522 | x2cols <- str_subset(colnames(out), "^x2") 523 | # create interactions between x1 and the x2 indicator columns 524 | for (varname in x2cols) { 525 | # name of the interaction variable 526 | newvar <- str_c("x1:", varname) 527 | out[[newvar]] <- out$x1 * out[[varname]] 528 | } 529 | out 530 | } 531 | ``` 532 | ```{r} 533 | model_matrix_mod2b(sim3) 534 | ``` 535 | 536 | These functions could be further generalized to allow for `x1` and `x2` to 537 | be either numeric or factors. However, generalizing much more than that and 538 | we will soon start reimplementing all of the `matrix_model()` function. 539 | 540 |
    541 | 542 | ### Exercise 23.4.4 {.unnumbered .exercise data-number="23.4.4"} 543 | 544 |
    545 | For `sim4`, which of `mod1` and `mod2` is better? 546 | I think `mod2` does a slightly better job at removing patterns, but it’s pretty subtle. 547 | Can you come up with a plot to support my claim? 548 |
    549 | 550 |
    551 | 552 | Estimate models `mod1` and `mod2` on `sim4`, 553 | ```{r} 554 | mod1 <- lm(y ~ x1 + x2, data = sim4) 555 | mod2 <- lm(y ~ x1 * x2, data = sim4) 556 | ``` 557 | and add the residuals from these models to the `sim4` data, 558 | ```{r} 559 | sim4_mods <- gather_residuals(sim4, mod1, mod2) 560 | ``` 561 | 562 | Frequency plots of both the residuals, 563 | ```{r} 564 | 565 | ggplot(sim4_mods, aes(x = resid, colour = model)) + 566 | geom_freqpoly(binwidth = 0.5) + 567 | geom_rug() 568 | ``` 569 | and the absolute values of the residuals, 570 | ```{r} 571 | ggplot(sim4_mods, aes(x = abs(resid), colour = model)) + 572 | geom_freqpoly(binwidth = 0.5) + 573 | geom_rug() 574 | ``` 575 | does not show much difference in the residuals between the models. 576 | However, `mod2` appears to have fewer residuals in the tails of the distribution between 2.5 and 5 (although the most extreme residuals are from `mod2`. 577 | 578 | This is confirmed by checking the standard deviation of the residuals of these models, 579 | ```{r} 580 | sim4_mods %>% 581 | group_by(model) %>% 582 | summarise(resid = sd(resid)) 583 | ``` 584 | The standard deviation of the residuals of `mod2` is smaller than that of `mod1`. 585 | 586 |
    587 | 588 | ## Missing values {#missing-values-5 .r4ds-section} 589 | 590 | `r no_exercises()` 591 | 592 | ## Other model families {#other-model-families .r4ds-section} 593 | 594 | `r no_exercises()` 595 | -------------------------------------------------------------------------------- /model.Rmd: -------------------------------------------------------------------------------- 1 | # (PART) Model {-} 2 | 3 | # Introduction {#model-intro .r4ds-section} 4 | 5 | `r no_exercises()` 6 | -------------------------------------------------------------------------------- /pipes.Rmd: -------------------------------------------------------------------------------- 1 | # Pipes {#pipes .r4ds-section} 2 | 3 | `r no_exercises()` 4 | -------------------------------------------------------------------------------- /program.Rmd: -------------------------------------------------------------------------------- 1 | # (PART) Program {-} 2 | 3 | # Introduction {#program-intro .r4ds-section} 4 | 5 | `r no_exercises()` 6 | -------------------------------------------------------------------------------- /r4ds-exercise-solutions.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: No 4 | SaveWorkspace: No 5 | AlwaysSaveHistory: No 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: knitr 13 | LaTeX: XeLaTeX 14 | 15 | AutoAppendNewline: Yes 16 | StripTrailingWhitespace: Yes 17 | LineEndingConversion: Posix 18 | 19 | BuildType: Custom 20 | CustomScriptPath: bin/render.R 21 | -------------------------------------------------------------------------------- /r4ds.bib: -------------------------------------------------------------------------------- 1 | % Encoding: UTF-8 2 | 3 | @Article{ClevelandMcGillMcGill1988, 4 | author = {William S. Cleveland and Marylyn E. McGill and Robert McGill}, 5 | title = {The shape parameter of a two-variable graph}, 6 | journal = {Journal of the American Statistical Association}, 7 | year = {1988}, 8 | volume = {83}, 9 | number = {402}, 10 | pages = {289--300}, 11 | issn = {01621459}, 12 | url = {https://www.jstor.org/stable/2288843}, 13 | abstract = {The shape parameter of a two-variable graph is the ratio of the horizontal and vertical distances spanned by the data. For at least 70 years this parameter has received much attention in writings on data display, because it is a critical factor on two-variable graphs that show how one variable depends on the other. But despite the attention, there has been little systematic study. In this article the shape parameter and its effect on the visual decoding of slope information are studied through historical, empirical, theoretical, and experimental investigations. These investigations lead to a method for choosing the shape that maximizes the accuracy of slope judgments.}, 14 | publisher = {[American Statistical Association, Taylor \& Francis, Ltd.]}, 15 | timestamp = {2018-08-03}, 16 | } 17 | 18 | @Book{WickhamGrolemund2017, 19 | author = {Wickham, Hadley and Grolemund, Garrett}, 20 | title = {{R} for data science: import, tidy, transform, visualize, and model data}, 21 | date = {2017-01-05}, 22 | edition = {1}, 23 | publisher = {O'Reilly Media}, 24 | isbn = {978-1491910399}, 25 | timestamp = {2018-08-03}, 26 | } 27 | 28 | @Article{HeerAgrawala2006, 29 | author = {Heer, Jeffrey and Agrawala, Maneesh}, 30 | title = {Multi-scale banking to 45º}, 31 | journaltitle = {Ieee Transactions on Visualization and Computer Graphics}, 32 | year = {2006}, 33 | volume = {12}, 34 | number = {5}, 35 | issue = {September/October}, 36 | doi = {10.1109/TVCG.2006.163}, 37 | url = {https://dx.doi.org/10.1109/TVCG.2006.163}, 38 | timestamp = {2018-08-03}, 39 | } 40 | 41 | @Book{Cleveland1993, 42 | author = {Cleveland, William S.}, 43 | title = {Visualizing information}, 44 | year = {1993}, 45 | publisher = {Hobart Press}, 46 | timestamp = {2018-08-03}, 47 | } 48 | 49 | @Book{Cleveland1994, 50 | author = {Cleveland, William S.}, 51 | title = {The elements of graphing data}, 52 | year = {1994}, 53 | publisher = {Hobart Press}, 54 | timestamp = {2018-08-03}, 55 | } 56 | 57 | @Article{Cleveland1993a, 58 | author = {William S. Cleveland}, 59 | title = {A model for studying display methods of statistical graphics}, 60 | journal = {Journal of Computational and Graphical Statistics}, 61 | year = {1993}, 62 | volume = {2}, 63 | number = {4}, 64 | pages = {323-343}, 65 | doi = {10.1080/10618600.1993.10474616}, 66 | url = { 67 | https://dx.doi.org/10.1080/10618600.1993.10474616 68 | 69 | }, 70 | publisher = {Taylor \& Francis}, 71 | timestamp = {2018-08-03}, 72 | } 73 | 74 | @Article{DoaneSeward2011, 75 | author = {David P. Doane and Lori E. Seward}, 76 | title = {Measuring skewness: a forgotten statistic?}, 77 | journal = {Journal of Statistics Education}, 78 | year = {2011}, 79 | volume = {19}, 80 | number = {2}, 81 | pages = {null}, 82 | doi = {10.1080/10691898.2011.11889611}, 83 | eprint = {https://doi.org/10.1080/10691898.2011.11889611}, 84 | url = { 85 | https://doi.org/10.1080/10691898.2011.11889611 86 | 87 | }, 88 | publisher = {Taylor \& Francis}, 89 | timestamp = {2018-08-03}, 90 | } 91 | 92 | @Article{HintzeNelson1998, 93 | author = {Jerry L. Hintze and Ray D. Nelson}, 94 | title = {Violin Plots: A Box Plot-Density Trace Synergism}, 95 | journal = {The American Statistician}, 96 | year = {1998}, 97 | volume = {52}, 98 | number = {2}, 99 | pages = {181-184}, 100 | doi = {10.1080/00031305.1998.10480559}, 101 | eprint = {https://amstat.tandfonline.com/doi/pdf/10.1080/00031305.1998.10480559}, 102 | url = { 103 | https://amstat.tandfonline.com/doi/abs/10.1080/00031305.1998.10480559 104 | 105 | }, 106 | publisher = {Taylor \& Francis}, 107 | timestamp = {2018-08-10}, 108 | } 109 | 110 | @Article{HofmannWickhamKafadar2017, 111 | author = {Heike Hofmann and Hadley Wickham and Karen Kafadar}, 112 | title = {Letter-Value Plots: Boxplots for Large Data}, 113 | journal = {Journal of Computational and Graphical Statistics}, 114 | year = {2017}, 115 | volume = {26}, 116 | number = {3}, 117 | pages = {469-477}, 118 | doi = {10.1080/10618600.2017.1305277}, 119 | eprint = {https://doi.org/10.1080/10618600.2017.1305277}, 120 | url = { 121 | https://doi.org/10.1080/10618600.2017.1305277 122 | 123 | }, 124 | publisher = {Taylor \& Francis}, 125 | timestamp = {2018-08-10}, 126 | } 127 | 128 | @Comment{jabref-meta: databaseType:biblatex;} 129 | -------------------------------------------------------------------------------- /rmarkdown-formats.Rmd: -------------------------------------------------------------------------------- 1 | # R Markdown formats {#r-markdown-formats .r4ds-section} 2 | 3 | `r no_exercises()` 4 | -------------------------------------------------------------------------------- /rmarkdown-workflow.Rmd: -------------------------------------------------------------------------------- 1 | # R Markdown workflow {#r-markdown-workflow .r4ds-section} 2 | 3 | `r no_exercises()` 4 | -------------------------------------------------------------------------------- /rmarkdown.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | output: html_document 3 | editor_options: 4 | chunk_output_type: console 5 | --- 6 | 7 | # R Markdown {#r-markdown .r4ds-section} 8 | 9 | ## Introduction {#introduction-18 .r4ds-section} 10 | 11 | ## R Markdown basics {#r-markdown-basics .r4ds-section} 12 | 13 | ### Exercise 27.2.1 {.unnumbered .exercise data-number="27.2.1"} 14 | 15 |
    16 | 17 | Create a new notebook using *File > New File > R Notebook*. Read the instructions. Practice running the chunks. Verify that you can modify the code, re-run it, and see modified output. 18 | 19 |
    20 | 21 |
    22 | 23 | This exercise is left to the reader. 24 | 25 |
    26 | 27 | ### Exercise 27.2.2 {.unnumbered .exercise data-number="27.2.2"} 28 | 29 |
    30 | 31 | Create a new R Markdown document with *File > New File > R Markdown ...*. 32 | Knit it by clicking the appropriate button. 33 | Knit it by using the appropriate keyboard short cut. 34 | Verify that you can modify the input and see the output update. 35 | 36 |
    37 | 38 |
    39 | 40 | This exercise is mostly left to the reader. 41 | Recall that the keyboard shortcut to knit a file is `Cmd/Ctrl + Alt + K`. 42 | 43 |
    44 | 45 | ### Exercise 27.2.3 {.unnumbered .exercise data-number="27.2.3"} 46 | 47 |
    48 | Compare and contrast the R notebook and R markdown files you created above. 49 | How are the outputs similar? How are they different? 50 | How are the inputs similar? How are they different? 51 | What happens if you copy the YAML header from one to the other? 52 |
    53 | 54 |
    55 | 56 | R notebook files show the output of code chunks inside the editor, while hiding the console, when they are edited in RStudio. 57 | This contrasts with R markdown files, which show their output inside the console, and do not show output inside the editor. 58 | This makes R notebook documents appealing for interactive exploration. 59 | In this R markdown file, the plot is displayed in the "Plot" tab, while the output of `summary()` is displayed in the tab. 60 | ```{r echo=FALSE,purl=FALSE} 61 | knitr::include_graphics("img/rmarkdown-file.png") 62 | ``` 63 | However, when this same file is converted to a R notebook, the plot and `summary()` output are displayed in the "Editor" below the chunk of code which created them. 64 | ```{r echo=FALSE,purl=FALSE} 65 | knitr::include_graphics("img/rmarkdown-notebook.png") 66 | ``` 67 | 68 | Both R notebooks and R markdown files and can be knit to produce HTML output. 69 | R markdown files can be knit to a variety of formats including HTML, PDF, and DOCX. 70 | However, R notebooks can only be knit to HTML files, which are given the extension `.nb.html`. 71 | However, unlike R markdown files knit to HTML, the HTML output of an R notebook includes copy of the original `.Rmd` source. 72 | If a `.nb.html` file is opened in RStudio, the source of the `.Rmd` file can be extracted and edited. 73 | In contrast, there is no way to recover the original source of an R markdown file from its output, except through the parts that are displayed in the output itself. 74 | 75 | R markdown files and R notebooks differ in the value of `output` in their YAML headers. 76 | The YAML header for the R notebook will have the line, 77 | ``` 78 | --- 79 | ouptut: html_notebook 80 | --- 81 | ``` 82 | For example, this is a R notebook, 83 | ``` 84 | --- 85 | title: "Diamond sizes" 86 | date: 2016-08-25 87 | output: html_notebook 88 | --- 89 | 90 | Text of the document. 91 | ``` 92 | 93 | The YAML header for the R markdown file will have the line, 94 | ``` 95 | ouptut: html_document 96 | ``` 97 | For example, this is a R markdown file. 98 | ``` 99 | --- 100 | title: "Diamond sizes" 101 | date: 2016-08-25 102 | output: html_document 103 | --- 104 | 105 | Text of the document. 106 | ``` 107 | 108 | Copying the YAML header from an R notebook to a R markdown file changes it to an R notebook, and vice-versa. 109 | More specifically, an `.Rmd` file can be changed to R markdown file or R notebook by changing the value of the `output` key in the header. 110 | 111 | The RStudio IDE and the rmarkdown package both use the YAML header of an `.Rmd` file to determine the document-type of the file. 112 | 113 | For more information on R markdown notebooks see the following sources: 114 | 115 | - [R Markdown: The Definitive Guide](https://bookdown.org/yihui/rmarkdown/) section), Chapter [Notebook](https://bookdown.org/yihui/rmarkdown/notebook.html) 116 | - [Difference between R MarkDown and R NoteBook](https://stackoverflow.com/questions/43820483/difference-between-r-markdown-and-r-notebook/43898504#43898504) StackOverflow thread. 117 | 118 |
    119 | 120 | ### Exercise 27.2.4 {.unnumbered .exercise data-number="27.2.4"} 121 | 122 |
    123 | 124 | Create one new R Markdown document for each of the three built-in formats: 125 | HTML, PDF and Word. 126 | Knit each of the three documents. 127 | How does the output 128 | differ? How does the input differ? 129 | (You may need to install LaTeX in order to 130 | build the PDF output — RStudio will prompt you if this is necessary.) 131 | 132 |
    133 | 134 |
    135 | 136 | They produce different outputs, both in the final documents and intermediate 137 | files (notably the type of plots produced). The only difference in the inputs 138 | is the value of `output` in the YAML header. 139 | The following `.Rmd` would be knit to HTML. 140 | ``` 141 | --- 142 | title: "Diamond sizes" 143 | date: 2016-08-25 144 | output: html_document 145 | --- 146 | 147 | Text of the document. 148 | ``` 149 | If the value of the `output` key is changed to `word_document`, knitting the file will create a Word document (DOCX). 150 | ``` 151 | --- 152 | title: "Diamond sizes" 153 | date: 2016-08-25 154 | output: word_document 155 | --- 156 | 157 | Text of the document. 158 | ``` 159 | Similarly, if the value of the `output` key is changed to `pdf_document`, knitting the file will create a PDF. 160 | ``` 161 | --- 162 | title: "Diamond sizes" 163 | date: 2016-08-25 164 | output: pdf_document 165 | --- 166 | 167 | Text of the document. 168 | ``` 169 | 170 | If you click on the *Knit* menu button and then on one of *Knit to HTML*, *Knit to PDF*, or *Knit to Word*, 171 | you will see that the value of the `output` key will change to `html_document`, `pdf_document`, or `word_document`, respectively. 172 | 173 | ```{r echo=FALSE,purl=FALSE} 174 | knitr::include_graphics("img/rmarkdown-knit-button.png") 175 | ``` 176 | 177 | You will see that the value of `output` will look a little different than the previous examples. 178 | It will add a new line with a value like, `pdf_document: default`. 179 | 180 | ```yaml 181 | --- 182 | title: "Diamond sizes" 183 | date: 2016-08-25 184 | output: 185 | pdf_document: default 186 | --- 187 | 188 | Text of the document. 189 | ``` 190 | 191 | This format is more general, allows the document have multiple output formats as well as configuration settings that allow more fine-grained control over the look of the output format. 192 | The chapter [R Markdown Formats](https://r4ds.had.co.nz/r-markdown-formats.html) discusses output formats for R markdown files in more detail. 193 | 194 |
    195 | 196 | ## Text formatting with Markdown {#text-formatting-with-markdown .r4ds-section} 197 | 198 | ### Exercise 27.3.1 {.unnumbered .exercise data-number="27.3.1"} 199 | 200 |
    201 | Practice what you’ve learned by creating a brief CV. 202 | The title should be your name, and you should include headings for (at least) education or employment. 203 | Each of the sections should include a bulleted list of jobs/degrees. 204 | Highlight the year in bold. 205 |
    206 | 207 |
    208 | 209 | A minimal example is the following CV. 210 | ```{r cv,echo=FALSE,comment='',purl=FALSE} 211 | cat(readr::read_file(here::here("rmarkdown", "cv.Rmd"))) 212 | ``` 213 | 214 | Your own example could be much more detailed. 215 | 216 |
    217 | 218 | ### Exercise 27.3.2 {.unnumbered .exercise data-number="27.3.2"} 219 | 220 |
    221 | 222 | Using the R Markdown quick reference, figure out how to: 223 | 224 | 1. Add a footnote. 225 | 1. Add a horizontal rule. 226 | 1. Add a block quote. 227 | 228 |
    229 | 230 |
    231 | 232 | ```{r example,echo=FALSE,comment='',purl=FALSE} 233 | cat(readr::read_file(here::here("rmarkdown", "example.Rmd"))) 234 | ``` 235 | 236 |
    237 | 238 | ### Exercise 27.3.3 {.unnumbered .exercise data-number="27.3.3"} 239 | 240 |
    241 | 242 | Copy and paste the contents of `diamond-sizes.Rmd` from in to a local R markdown document. 243 | Check that you can run it, then add text after the frequency polygon that describes its most striking features. 244 |
    245 | 246 |
    247 | 248 | The following R markdown document answers this question as well as exercises [Exercise 27.4.1](#exercise-27.4.1), [Exercise 27.4.2](#exercise-27.4.2), and [Exercise 27.4.3](#exercise-27.4.3). 249 | 250 | ```{r diamond-sizes,echo=FALSE,comment='',purl=FALSE} 251 | cat(readr::read_file(here::here("rmarkdown", "diamond-sizes.Rmd"))) 252 | ``` 253 | 254 |
    255 | 256 | ## Code chunks {#code-chunks .r4ds-section} 257 | 258 | ### Exercise 27.4.1 {.unnumbered .exercise data-number="27.4.1"} 259 | 260 |
    261 | Add a section that explores how diamond sizes vary by cut, color, and clarity. 262 | Assume you’re writing a report for someone who doesn’t know R, and instead of setting `echo = FALSE` on each chunk, set a global option. 263 |
    264 | 265 |
    266 | 267 | See the answer to [Exercise 27.3.3](#exercise-27.3.3). 268 | 269 |
    270 | 271 | ### Exercise 27.4.2 {.unnumbered .exercise data-number="27.4.2"} 272 | 273 |
    274 | Download `diamond-sizes.Rmd` from . 275 | Add a section that describes the largest 20 diamonds, including a table that displays their most important attributes. 276 |
    277 | 278 |
    279 | 280 | See the answer to [Exercise 27.3.3](#exercise-27.3.3). 281 | I use `arrange()` and `slice()` to select the largest twenty diamonds, and 282 | `knitr::kable()` to produce a formatted table. 283 | 284 |
    285 | 286 | ### Exercise 27.4.3 {.unnumbered .exercise data-number="27.4.3"} 287 | 288 |
    289 | Modify `diamonds-sizes.Rmd` to use `comma()` to produce nicely formatted output. 290 | Also include the percentage of diamonds that are larger than 2.5 carats. 291 |
    292 | 293 |
    294 | 295 | See the answer to [Exercise 27.3.3](#exercise-27.3.3). 296 | 297 | I moved the computation of the number larger and percent of diamonds larger than 2.5 carats into a code chunk. 298 | I find that it is best to keep inline R expressions simple, usually consisting of an object and a formatting function. 299 | This makes it both easier to read and test the R code, while simultaneously making the prose easier to read. 300 | It helps the readability of the code and document to keep the computation of objects used in prose close to their use. 301 | Calculating those objects in a code chunk with the `include = FALSE` option (as is done in `diamonds-size.Rmd`) is useful in this regard. 302 | 303 |
    304 | 305 | ### Exercise 27.4.4 {.unnumbered .exercise data-number="27.4.4"} 306 | 307 |
    308 | 309 | Set up a network of chunks where `d` depends on `c` and `b`, and both `b` and `c` depend on `a`. Have each chunk print lubridate::now(), set cache = TRUE, then verify your understanding of caching. 310 | 311 |
    312 | 313 |
    314 | 315 | ```{r caching,echo=FALSE,comment='',purl=FALSE} 316 | cat(readr::read_file(here::here("rmarkdown", "caching.Rmd"))) 317 | ``` 318 | 319 |
    320 | 321 | ## Troubleshooting {#troubleshooting .r4ds-section} 322 | 323 | `r no_exercises()` 324 | 325 | ## YAML header {#yaml-header .r4ds-section} 326 | 327 | `r no_exercises()` 328 | 329 | ## Learning more {#learning-more-3 .r4ds-section} 330 | 331 | `r no_exercises()` 332 | -------------------------------------------------------------------------------- /rmarkdown/caching.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Exercise 24.4.7.4" 3 | author: "Jeffrey Arnold" 4 | date: "2/1/2018" 5 | output: html_document 6 | --- 7 | 8 | ```{r setup, include=FALSE} 9 | knitr::opts_chunk$set(echo = TRUE, cache = TRUE) 10 | ``` 11 | 12 | The chunk `a` has no dependencies. 13 | ```{r a} 14 | print(lubridate::now()) 15 | x <- 1 16 | ``` 17 | 18 | The chunk `b` depends on `a`. 19 | ```{r b, dependson = c("a")} 20 | print(lubridate::now()) 21 | y <- x + 1 22 | ``` 23 | 24 | The chunk `c` depends on `a`. 25 | ```{r c, dependson = c("a")} 26 | print(lubridate::now()) 27 | z <- x * 2 28 | ``` 29 | 30 | The chunk `d` depends on `c` and `b`: 31 | ```{r d, dependson = c("c", "b")} 32 | print(lubridate::now()) 33 | w <- y + z 34 | ``` 35 | 36 | If this document is knit repeatedly, the value printed by `lubridate::now()` 37 | will be the same for all chunks, and the same as the first time the document 38 | was run with caching. 39 | -------------------------------------------------------------------------------- /rmarkdown/cv.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Hadley Wickham" 3 | --- 4 | 5 | ## Employment 6 | 7 | - Chief Scientist, Rstudio, **2013--present**. 8 | - Adjust Professor, Rice University, Houston, TX, **2013--present**. 9 | - Assistant Professor, Rice University, Houston, TX, **2008--12**. 10 | 11 | ## Education 12 | 13 | - Ph.D. in Statistics, Iowa State University, Ames, IA, **2008** 14 | 15 | - M.Sc. in Statistics, University of Auckland, New Zealand, **2004** 16 | 17 | - B.Sc. in Statistics and Computer Science, First Class Honours, The 18 | University of Auckland, New Zealand, **2002**. 19 | 20 | - Bachelor of Human Biology, First Class Honours, The University of Auckland, 21 | Auckland, New Zealand, **1999**. 22 | -------------------------------------------------------------------------------- /rmarkdown/diamond-sizes.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Diamond sizes" 3 | output: html_document 4 | date: '2018-07-15' 5 | --- 6 | 7 | ```{r knitr_opts, include = FALSE} 8 | knitr::opts_chunk$set(echo = FALSE) 9 | ``` 10 | 11 | ```{r setup, message = FALSE, cache=FALSE} 12 | library("ggplot2") 13 | library("dplyr") 14 | ``` 15 | 16 | ```{r} 17 | smaller <- diamonds %>% 18 | filter(carat <= 2.5) 19 | ``` 20 | 21 | ```{r include = FALSE, purl = FALSE} 22 | # Hide objects and functions ONLY used inline 23 | n_larger <- nrow(diamonds) - nrow(smaller) 24 | pct_larger <- n_larger / nrow(diamonds) * 100 25 | 26 | comma <- function(x) { 27 | format(x, digits = 2, big.mark = ",") 28 | } 29 | ``` 30 | 31 | ## Size and Cut, Color, and Clarity 32 | 33 | Diamonds with lower quality cuts (cuts are ranked from "Ideal" to "Fair") tend 34 | to be be larger. 35 | ```{r} 36 | ggplot(diamonds, aes(y = carat, x = cut)) + 37 | geom_boxplot() 38 | ``` 39 | Likewise, diamonds with worse color (diamond colors are ranked from J (worst) 40 | to D (best)) tend to be larger: 41 | 42 | ```{r} 43 | ggplot(diamonds, aes(y = carat, x = color)) + 44 | geom_boxplot() 45 | ``` 46 | 47 | The pattern present in cut and color is also present in clarity. Diamonds with 48 | worse clarity (I1 (worst), SI1, SI2, VS1, VS2, VVS1, VVS2, IF (best)) tend to 49 | be larger: 50 | 51 | ```{r} 52 | ggplot(diamonds, aes(y = carat, x = clarity)) + 53 | geom_boxplot() 54 | ``` 55 | 56 | These patterns are consistent with there being a profitability threshold for 57 | retail diamonds that is a function of carat, clarity, color, cut and other 58 | characteristics. A diamond may be profitable to sell if a poor value of one 59 | feature, for example, poor clarity, color, or cut, is be offset by a good value 60 | of another feature, such as a large size. This can be considered an example 61 | of [Berkson's paradox](https://en.wikipedia.org/wiki/Berkson%27s_paradox). 62 | 63 | ## Largest Diamonds 64 | 65 | We have data about `r comma(nrow(diamonds))` diamonds. Only 66 | `r n_larger` (`r round(pct_larger, 1)`%) are larger 67 | than 2.5 carats. The distribution of the remainder is shown below: 68 | 69 | ```{r} 70 | smaller %>% 71 | ggplot(aes(carat)) + 72 | geom_freqpoly(binwidth = 0.01) 73 | ``` 74 | 75 | The frequency distribution of diamond sizes is marked by spikes at 76 | whole-number and half-carat values, as well as several other carat values 77 | corresponding to fractions. 78 | 79 | The largest twenty diamonds (by carat) in the datasets are, 80 | 81 | ```{r results = "asis"} 82 | diamonds %>% 83 | arrange(desc(carat)) %>% 84 | slice(1:20) %>% 85 | select(carat, cut, color, clarity) %>% 86 | knitr::kable( 87 | caption = "The largest 20 diamonds in the `diamonds` dataset." 88 | ) 89 | ``` 90 | 91 | Most of the twenty largest datasets are in the lowest clarity category ("I1"), 92 | with one being in the second best category ("VVS2") The top twenty diamonds 93 | have colors ranging from the worst, "J", to best, "D",categories, though most 94 | are in the lower categories "J" and "I". The top twenty diamonds are more evenly 95 | distributed among the cut categories, from "Fair" to "Ideal", although the worst 96 | category (Fair) is the most common. 97 | -------------------------------------------------------------------------------- /rmarkdown/example.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: Horizontal Rules, Block Quotes, and Footnotes 3 | --- 4 | 5 | The quick brown fox jumped over the lazy dog.[^quick-fox] 6 | 7 | Use three or more `-` for a horizontal rule. For example, 8 | 9 | --- 10 | 11 | The horizontal rule uses the same syntax as a YAML block? So how does R markdown 12 | distinguish between the two? Three dashes ("---") is only treated the start of 13 | a YAML block if it is at the start of the document. 14 | 15 | > This would be a block quote. Generally, block quotes are used to indicate 16 | > quotes longer than a three or four lines. 17 | 18 | [^quick-fox]: This is an example of a footnote. The sentence this is footnoting 19 | is often used for displaying fonts because it includes all 26 letters of the 20 | English alphabet. 21 | -------------------------------------------------------------------------------- /tibble.Rmd: -------------------------------------------------------------------------------- 1 | # Tibbles {#tibbles .r4ds-section} 2 | 3 | ```{r setup,message=FALSE,cache=FALSE} 4 | library("tidyverse") 5 | ``` 6 | 7 | ## Exercise 10.1 {.unnumbered .exercise data-number="10.1"} 8 | 9 |
    10 | How can you tell if an object is a tibble? (Hint: try printing `mtcars`, which is a regular data frame). 11 |
    12 | 13 |
    14 | 15 | When we print `mtcars`, it prints all the columns. 16 | ```{r} 17 | mtcars 18 | ``` 19 | 20 | But when we first convert `mtcars` to a tibble using `as_tibble()`, it prints only the first ten observations. 21 | There are also some other differences in formatting of the printed data frame. 22 | It prints the number of rows and columns and the date type of each column. 23 | ```{r} 24 | as_tibble(mtcars) 25 | ``` 26 | 27 | You can use the function `is_tibble()` to check whether a data frame is a tibble or not. 28 | The `mtcars` data frame is not a tibble. 29 | ```{r} 30 | is_tibble(mtcars) 31 | ``` 32 | But the `diamonds` and `flights` data are tibbles. 33 | ```{r} 34 | is_tibble(ggplot2::diamonds) 35 | is_tibble(nycflights13::flights) 36 | is_tibble(as_tibble(mtcars)) 37 | ``` 38 | 39 | More generally, you can use the `class()` function to find out the class of an 40 | object. Tibbles has the classes `c("tbl_df", "tbl", "data.frame")`, while old 41 | data frames will only have the class `"data.frame"`. 42 | ```{r} 43 | class(mtcars) 44 | class(ggplot2::diamonds) 45 | class(nycflights13::flights) 46 | ``` 47 | 48 | If you are interested in reading more on R's classes, read the chapters on 49 | object oriented programming in [Advanced R](http://adv-r.had.co.nz/S3.html). 50 | 51 |
    52 | 53 | ## Exercise 10.2 {.unnumbered .exercise data-number="10.2"} 54 | 55 |
    56 | Compare and contrast the following operations on a `data.frame` and equivalent tibble. What is different? Why might the default data frame behaviors cause you frustration? 57 |
    58 | 59 |
    60 | 61 | ```{r} 62 | df <- data.frame(abc = 1, xyz = "a") 63 | df$x 64 | df[, "xyz"] 65 | df[, c("abc", "xyz")] 66 | ``` 67 | 68 | ```{r} 69 | tbl <- as_tibble(df) 70 | tbl$x 71 | tbl[, "xyz"] 72 | tbl[, c("abc", "xyz")] 73 | ``` 74 | 75 | The `$` operator will match any column name that starts with the name following it. 76 | Since there is a column named `xyz`, the expression `df$x` will be expanded to `df$xyz`. 77 | This behavior of the `$` operator saves a few keystrokes, but it can result in accidentally using a different column than you thought you were using. 78 | 79 | With data.frames, with `[` the type of object that is returned differs on the 80 | number of columns. If it is one column, it won't return a data.frame, but 81 | instead will return a vector. With more than one column, then it will return a 82 | data.frame. This is fine if you know what you are passing in, but suppose you 83 | did `df[ , vars]` where `vars` was a variable. Then what that code does 84 | depends on `length(vars)` and you'd have to write code to account for those 85 | situations or risk bugs. 86 | 87 |
    88 | 89 | ## Exercise 10.3 {.unnumbered .exercise data-number="10.3"} 90 | 91 |
    92 | If you have the name of a variable stored in an object, e.g. `var <- "mpg"`, how can you extract the reference variable from a tibble? 93 |
    94 | 95 |
    96 | 97 | You can use the double bracket, like `df[[var]]`. You cannot use the dollar sign, because `df$var` would look for a column named `var`. 98 | 99 |
    100 | 101 | ## Exercise 10.4 {.unnumbered .exercise data-number="10.4"} 102 | 103 |
    104 | 105 | Practice referring to non-syntactic names in the following data frame by: 106 | 107 | 1. Extracting the variable called 1. 108 | 1. Plotting a scatterplot of 1 vs 2. 109 | 1. Creating a new column called 3 which is 2 divided by 1. 110 | 1. Renaming the columns to one, two and three. 111 | 112 |
    113 | 114 |
    115 | 116 | For this example, I'll create a dataset called annoying with 117 | columns named `1` and `2`. 118 | 119 | ```{r} 120 | annoying <- tibble( 121 | `1` = 1:10, 122 | `2` = `1` * 2 + rnorm(length(`1`)) 123 | ) 124 | ``` 125 | 126 | 1. To extract the variable named `1`: 127 | 128 | ```{r} 129 | annoying[["1"]] 130 | ``` 131 | 132 | or 133 | 134 | ```{r} 135 | annoying$`1` 136 | ``` 137 | 138 | 1. To create a scatter plot of `1` vs. `2`: 139 | 140 | ```{r} 141 | ggplot(annoying, aes(x = `1`, y = `2`)) + 142 | geom_point() 143 | ``` 144 | 145 | 1. To add a new column `3` which is `2` divided by `1`: 146 | 147 | ```{r} 148 | mutate(annoying, `3` = `2` / `1`) 149 | ``` 150 | 151 | or 152 | 153 | ```{r} 154 | annoying[["3"]] <- annoying$`2` / annoying$`1` 155 | ``` 156 | 157 | or 158 | 159 | ```{r} 160 | annoying[["3"]] <- annoying[["2"]] / annoying[["1"]] 161 | ``` 162 | 163 | 1. To rename the columns to `one`, `two`, and `three`, run: 164 | 165 | ```{r} 166 | annoying <- rename(annoying, one = `1`, two = `2`, three = `3`) 167 | glimpse(annoying) 168 | ``` 169 | 170 |
    171 | 172 | ## Exercise 10.5 {.unnumbered .exercise data-number="10.5"} 173 | 174 |
    175 | What does `tibble::enframe()` do? When might you use it? 176 |
    177 | 178 |
    179 | 180 | The function `tibble::enframe()` converts named vectors to a data frame with names and values 181 | 182 | ```{r} 183 | enframe(c(a = 1, b = 2, c = 3)) 184 | ``` 185 | 186 |
    187 | 188 | ## Exercise 10.6 {.unnumbered .exercise data-number="10.6"} 189 | 190 |
    191 | What option controls how many additional column names are printed at the footer of a tibble? 192 |
    193 | 194 |
    195 | 196 | The help page for the `print()` method of tibble objects is discussed in `?print.tbl`. 197 | The `n_extra` argument determines the number of extra columns to print information for. 198 | 199 |
    200 | -------------------------------------------------------------------------------- /workflow-basics.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | output: html_document 3 | editor_options: 4 | chunk_output_type: console 5 | --- 6 | # Workflow: basics {#workflow-basics .r4ds-section} 7 | 8 | ```{r message=FALSE,cache=FALSE} 9 | library("tidyverse") 10 | ``` 11 | 12 | ## Exercise 4.1 {.unnumbered .exercise data-number="4.1"} 13 | 14 |
    15 | Why does this code not work? 16 | ```{r error=TRUE} 17 | my_variable <- 10 18 | my_varıable 19 | ``` 20 |
    21 | 22 |
    23 | 24 | The variable being printed is `my_varıable`, not `my_variable`: 25 | the seventh character is "ı" ("[LATIN SMALL LETTER DOTLESS I](https://en.wikipedia.org/wiki/Dotted_and_dotless_I)"), not "i". 26 | 27 | While it wouldn't have helped much in this case, the importance of 28 | distinguishing characters in code is reasons why fonts which clearly 29 | distinguish similar characters are preferred in programming. 30 | It is especially important to distinguish between two sets of similar looking characters: 31 | 32 | - the numeral zero (0), the Latin small letter O (o), and the Latin capital letter O (O), 33 | - the numeral one (1), the Latin small letter I (i), the Latin capital letter I (I), and Latin small letter L (l). 34 | 35 | In these fonts, zero and the Latin letter O are often distinguished by using a glyph for zero that uses either a dot in the interior or a slash through it. 36 | Some examples of fonts with dotted or slashed zero glyphs are Consolas, Deja Vu Sans Mono, Monaco, Menlo, [Source Sans Pro](https://adobe-fonts.github.io/source-sans-pro/), and FiraCode. 37 | 38 | Error messages of the form `"object '...' not found"` mean exactly what they say. 39 | R cannot find an object with that name. 40 | Unfortunately, the error does not tell you why that object cannot be found, because R does not know the reason that the object does not exist. 41 | The most common scenarios in which I encounter this error message are 42 | 43 | 1. I forgot to create the object, or an error prevented the object from being created. 44 | 45 | 1. I made a typo in the object's name, either when using it or when I created it (as in the example above), or I forgot what I had originally named it. 46 | If you find yourself often writing the wrong name for an object, 47 | it is a good indication that the original name was not a good one. 48 | 49 | 1. I forgot to load the package that contains the object using `library()`. 50 | 51 |
    52 | 53 | ## Exercise 4.2 {.unnumbered .exercise data-number="4.2"} 54 | 55 |
    56 | 57 | Tweak each of the following R commands so that they run correctly: 58 | 59 | ```{r, eval = FALSE} 60 | ggplot(dota = mpg) + 61 | geom_point(mapping = aes(x = displ, y = hwy)) 62 | 63 | fliter(mpg, cyl = 8) 64 | filter(diamond, carat > 3) 65 | ``` 66 | 67 |
    68 | 69 |
    70 | 71 | ```{r error=TRUE} 72 | ggplot(dota = mpg) + 73 | geom_point(mapping = aes(x = displ, y = hwy)) 74 | ``` 75 | The error message is `argument "data" is missing, with no default`. 76 | This error is a result of a typo, `dota` instead of `data`. 77 | ```{r error=TRUE} 78 | ggplot(data = mpg) + 79 | geom_point(mapping = aes(x = displ, y = hwy)) 80 | ``` 81 | 82 | ```{r error=TRUE} 83 | fliter(mpg, cyl = 8) 84 | ``` 85 | 86 | R could not find the function `fliter()` because we made a typo: `fliter` instead of `filter`. 87 | 88 | ```{r error=TRUE} 89 | filter(mpg, cyl = 8) 90 | ``` 91 | 92 | We aren't done yet. But the error message gives a suggestion. Let's follow it. 93 | 94 | ```{r error=TRUE} 95 | filter(mpg, cyl == 8) 96 | ``` 97 | 98 | ```{r error=TRUE} 99 | filter(diamond, carat > 3) 100 | ``` 101 | 102 | R says it can't find the object `diamond`. 103 | This is a typo; the data frame is named `diamonds`. 104 | ```{r error=TRUE} 105 | filter(diamonds, carat > 3) 106 | ``` 107 | 108 | How did I know? I started typing in `diamond` and RStudio completed it to `diamonds`. 109 | Since `diamonds` includes the variable `carat` and the code works, that appears to have been the problem. 110 | 111 |
    112 | 113 | ## Exercise 4.3 {.unnumbered .exercise data-number="4.3"} 114 | 115 |
    116 | Press *Alt + Shift + K*. What happens? How can you get to the same place using the menus? 117 |
    118 | 119 |
    120 | 121 | This gives a menu with keyboard shortcuts. This can be found in the menu under `Tools -> Keyboard Shortcuts Help`. 122 | 123 |
    124 | -------------------------------------------------------------------------------- /workflow-projects.Rmd: -------------------------------------------------------------------------------- 1 | # Workflow: projects {#workflow-projects .r4ds-section} 2 | 3 | `r no_exercises()` 4 | -------------------------------------------------------------------------------- /workflow-scripts.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | output: html_document 3 | editor_options: 4 | chunk_output_type: console 5 | --- 6 | 7 | # Workflow: scripts {#workflow-scripts .r4ds-section} 8 | 9 | ## Exercise 6.1 {.unnumbered .exercise data-number="6.1"} 10 | 11 |
    12 | 13 | Go to the RStudio Tips twitter account, and find one tip that looks interesting. 14 | Practice using it! 15 | 16 |
    17 | 18 |
    19 | 20 | The current timeline of [\@rstudiotips](https://twitter.com/rstudiotips) is displayed below. 21 | 22 | 23 | 24 | 25 |
    26 | 27 | ## Exercise 6.2 {.unnumbered .exercise data-number="6.2"} 28 | 29 |
    30 | 31 | What other common mistakes will RStudio diagnostics report? 32 | Read to find out. 33 | 34 |
    35 | 36 |
    37 | 38 | You should read that page, but some other diagnostics for R code include the following. 39 | 40 | 1. Check for missing, unmatched, partially matched, and too many arguments to functions. 41 | 1. Warn if a variable is not defined. 42 | 1. Warn if a variable is defined but not used. 43 | 1. Check that the code style conforms to the [tidyverse style guide](https://style.tidyverse.org/). 44 | 45 |
    46 | -------------------------------------------------------------------------------- /wrangle.Rmd: -------------------------------------------------------------------------------- 1 | # (PART) Wrangle {-} 2 | 3 | # Introduction {#wrangle-intro .r4ds-section} 4 | 5 | `r no_exercises()` 6 | --------------------------------------------------------------------------------