├── README.md ├── binder-notes.md └── reproducible-data-analysis.pdf /README.md: -------------------------------------------------------------------------------- 1 | # How To Make Your Data Analysis Notebooks More Reproducible 2 | 3 | [![rstudio_talk_slides](https://i.imgur.com/fYGze6k.png)](http://inundata.org/talks/rstd19/#/) 4 | 5 | [Slide deck](http://inundata.org/talks/rstd19/#/) | Slide deck as [PDF](https://github.com/karthik/rstudio2019/blob/master/reproducible-data-analysis.pdf) 6 | 7 | [🎥 Video of talk at rstudio::conf(2019)](https://resources.rstudio.com/rstudio-conf-2019/a-guide-to-modern-reproducible-data-science-with-r) 8 | 9 | ## Resources 10 | I have included a handful of links to papers, software packages and tutorials/manuals about some tools I mention in my talk. Pull requests or issues on additional ones to include are welcome. 11 | 12 | ### Research Compendia 13 | 14 | - [Statistical Analysis and reproducible research ](https://biostats.bepress.com/bioconductor/paper2/) 15 | - [Packaging Data Analytical Work Reproducibly Using R (and Friends)](https://www.tandfonline.com/doi/abs/10.1080/00031305.2017.1375986) ([OA preprint](https://peerj.com/preprints/3192/)). A practical introduction to setting up a research compendium in R. 16 | - [The rOpenSci reproducibility guide](https://ropensci.github.io/reproducibility-guide/) *Slightly dated but still very useful* 17 | 18 | **Examples of Research Compendia on GitHub** 19 | Below are a few links to real world examples of research compendia in R. To have a minimal compendium, all you really need is a valid [`DESCRIPTION`](https://github.com/boettiger-lab/pomdp-intro/blob/master/DESCRIPTION) file containing a handful of fields such as type, name, version and dependencies. See Marwick et al 2017 for a detailed description of the different types of compendia. 20 | 21 | **Small** 22 | - [Code and data associated with Duffy, James, and Longworth Applied and Environmental Microbiology paper describing the ecology, virulence, and phylogeny of a brood parasite of Daphnia, Blastulidium paedophthorum;](https://github.com/duffymeg/BroodParasiteDescription) 23 | 24 | **Medium** 25 | - [Resolving the measurement uncertainty paradox in ecological management](https://github.com/boettiger-lab/pomdp-intro) 26 | 27 | **Large** 28 | 29 | - [Non-parametric Bayesian Inference for Conservation Decisions ](https://github.com/cboettig/nonparametric-bayes) 30 | 31 | - Find various other compendia on [Github](https://github.com/topics/research-compendium) and [Zenodo](https://zenodo.org/communities/research-compendium?page=1&size=20) using the `research-compendium` tag. 32 | 33 | **Software packages related to research compendia** 34 | 35 | - 📦 [`rrtools`](https://github.com/benmarwick/rrtools) by Ben Marwick (also the author of the packaging data analysis paper mentioned above) *extends functions in `devtools` and provides instructions, templates, and functions to make a basic compendium suitable for doing reproducible research with R.* 36 | - Also see 📦 [workflowr](https://jdblischak.github.io/workflowr/) by John Blischak and the [task view](https://github.com/jdblischak/ctv-project-workflows) on R-based data analysis projects maintained by John Blischak, Anna Krystalli, Ben Marwick, Daniel Nüst. 37 | - 📦 [`usethis`](https://github.com/r-lib/usethis) *Many of the major function in `rrtools` are imported from `usethis.` A savvy user can get by setting up and maintaining a compendium purely with `usethis` functions.* 38 | - 📦 [`goodpractice`](https://github.com/MangoTheCat/goodpractice) - Designed to help you build more robust packages, the package does a deep dive on your package contents and provide advice on syntax pitfalls to avoid, code formatting suggestions, and helps you improve overall package structure. 39 | - The 📦 [`rticles`](https://github.com/rstudio/rticles) package by JJ has numerous journal templates and together with Rstudio addins like word [`countaddin` ](https://github.com/benmarwick/wordcountaddin)and [`citr`](https://github.com/crsh/citr) + [`knitcitations`](https://github.com/cboettig/knitcitations). 40 | 41 | 42 | ### 📈 Data management 43 | 44 | - 📦 [`piggyback`](https://github.com/ropensci/piggyback), [[docs]](https://ropensci.github.io/piggyback/): This clever R package allows you to attach arbitrary data (or other) files (upto 2gb each) to a GitHub release. Given GitHub's fast [CDN](https://en.wikipedia.org/wiki/Content_delivery_network), this would be an easy way to quickly attach large files to a compendium and read them back in a local/collaborator/remote environment very easily. As always be sure to archive a long-term copy on [Zenodo](https://zenodo.org/). 45 | - 📦 [`arkdb`](https://github.com/ropensci/arkdb) [[docs]](https://ropensci.github.io/arkdb/): This package allows you to archive and unarchive databases as flat text files. 46 | - 🎥 For more on setting up data packages, see this [excellent talk by Noam Ross](https://www.youtube.com/watch?v=zsEsh5QpN0U) at New York R. 47 | 48 | ### Computational environments: Binder and friends 49 | 50 | - [My Binder](https://mybinder.org/) is a free binderhub deployment that turns any Git repo into a collection of interactive notebooks. Now with better R support! 51 | - For instructions on how to set this up for your R project, see [my notes here](https://github.com/karthik/rstudio2019/blob/master/binder-notes.md) 52 | - [Introducing Binder 2.0 — share your interactive research environment](https://elifesciences.org/labs/8653a61d/introducing-binder-2-0-share-your-interactive-research-environment) Paper describing the architecture of Binder in case you were interested in what was happening under the hood 53 | - 🎥 [A talk about Binder at Scipy 2018](https://www.youtube.com/watch?v=KcC0W5LP9GM). Also see [conference proceedings PDF](http://conference.scipy.org/proceedings/scipy2018/pdfs/project_jupyter.pdf). 54 | - [`repo2docker`](https://github.com/jupyter/repo2docker) A Python module that will turn any repo (or local folder) into a Docker Image. 55 | 56 | **Other hosted Binder hubs** 57 | 58 | - [Pangeo binder](https://binder.pangeo.io/) *Pangeo encourages everyone to use it.* 59 | - [gesis](https://notebooks.gesis.org/) 60 | - [Syzgy](http://syzygy.ca/) *Binder + JupyterHub for Compute Canada* 61 | 62 | **Setting up Binder for your analysis** 63 | 64 | I have captured all the various ways to set up mybinder with a R project in a [separate document](binder-notes.md). 65 | 66 | Are you interested in setting up or hosting a binderhub for the R community? Get in touch via the issues. 67 | 68 | 69 | **Also see** 70 | - [Whole Tale](https://wholetale.org/) 71 | - [Computing environments for reproducibility: Capturing the “Whole Tale”](https://www.sciencedirect.com/science/article/pii/S0167739X17310695) - OA paper describing the Whole Tale project. 72 | - [Code Ocean](https://codeocean.com/) - A commercial, blackbox, full-stack service that will accomplish something similar to the above two projects. Code Ocean links will likely start appearing in papers soon. 73 | 74 | 75 | **Software packages related to setting up computational environments** 76 | 77 | - 📦 [`Containerit`](https://github.com/o2r-project/containerit). [Detailed blog post](https://o2r.info/2017/05/30/containerit-package/) This sweet package will generate a Dockerfile for you by examining the code inside a folder or just from your session info. This is analogous to `repo2docker` but is very R centric 78 | - [`stevedore`](https://github.com/richfitz/stevedore) Although there are a few docker clients (docker, harbor), this is my recommendation for managing docker containers from inside R. 79 | 80 | 81 | ### 🔨 Workflows: drake and friends 82 | 83 | - 📦 [`drake`](https://github.com/ropensci/drake) - An R-focused pipeline toolkit for reproducibility and high-performance computing. Install the package from here or CRAN. 84 | - [The prequel to the drake R package](https://ropensci.org/blog/2018/02/06/drake/) *A blog post by the creator of drake describing his motivation for the package.* 85 | - [drake manual](https://ropenscilabs.github.io/drake-manual/) A detailed `bookdown` guide on how to setup and use drake for projects of varying levels of complexity. 86 | - [Presentation on drake](https://wlandau.github.io/drake-datafest-2019/#/) Slides from a talk by Will Landau (who is here at the conference so go pick his brain if you want to learn more!) 87 | 88 | **Real world drake examples** 89 | - [Pathogen modeling study](https://github.com/pat-s/pathogen-modeling) 90 | 91 | **Miscellaneous** 92 | - IKEA diagram inspired by [IDEA instructions](https://idea-instructions.com/) 93 | 94 | --- 95 | 96 | ### Acknowledgments 97 | 98 | Many thanks to [Chris Holdgraf](https://bids.berkeley.edu/people/chris-holdgraf), [Carl Boettiger](https://www.carlboettiger.info/), [Will Landau](https://wlandau.github.io/), and [Ben Marwick](http://faculty.washington.edu/bmarwick/) for various discussions on these topics. Also thanks to Ciera Martinez, Kara Woo, and Nick Tierney for comments on the presentation. 99 | 100 | -------------------------------------------------------------------------------- /binder-notes.md: -------------------------------------------------------------------------------- 1 | ![](https://mybinder.org/static/logo.svg?v=f9f0d927b67cc9dc99d788c822ca21c0) 2 | 3 | # How to set up a My Binder for your R project 4 | 5 | [My Binder](https://mybinder.org/) has had R support for about a year but it is getting better. When passed a Git repo (hosted on GitHub, Gitlab or from any arbitrary location), it will look for patterns (such as the presence of a DESCRIPTION file, `install.R` etc) and create a Dockerfile and start to build an instance from a pre-configured build pack (there is one for R projects and this is extensible to meet other needs). 6 | 7 | ## ⚙ Simplest setup 8 | 9 | For the simplest setup, you'll just need two files in your repo. 10 | - `runtime.txt` Here you can specify your R version by date (e.g. `r-2018-12-31`) and also versions of Python and Julia. In the case of R, this will also lock down CRAN packages to this R release (Rocker takes care of pulling packages from MRAN matching this date. You should specify the version based on a date: 11 | 12 | - Select a MRAN date corresponding to your desired R version or just choose latest. 13 | 14 | |R version | MRAN date | RStudio version | RStudio date | 15 | |----------|------------|-----------------|--------------| 16 | | 3.1.0 | 2014-09-17 | NA | NA | 17 | | 3.2.0 | 2015-06-18 | NA | NA | 18 | | 3.2.5 | 2016-05-03 | NA | NA | 19 | | 3.3.0 | 2016-06-21 | NA | NA | 20 | | 3.3.1 | 2016-10-31 | 1.0.44 | 2016-11-01 | 21 | | 3.3.2 | 2017-03-06 | 1.0.136 | 2016-12-21 | 22 | | 3.3.3 | 2017-04-21 | 1.0.143 | 2017-04-19 | 23 | | 3.4.0 | 2017-06-30 | 1.0.143 | 2017-04-19 | 24 | | 3.4.1 | 2017-09-28 | 1.0.153 | 2017-07-20 | 25 | | 3.4.2 | 2017-11-30 | 1.1.383 | 2017-10-09 | 26 | | 3.4.3 | 2017-03-15 | 1.1.442 | 2018-03-12 | 27 | | 3.4.4 | 2018-04-23 | 1.1.447 | 2018-04-18 | 28 | | 3.5.0 | 2018-07-02 | 1.1.447 | 2018-04-18 | 29 | | 3.5.1 | 2018-12-20 | 1.1.463 | 2018-10-29 | 30 | | 3.5.2 | latest | latest | latest | 31 | 32 | [more information here](https://github.com/rocker-org/rocker-versioned/blob/master/VERSIONS.md). 33 | - `install.R` A list of `install.packages('package_name')` commands, one per line. 34 | - If you have non-R binaries you'd like installed, just add them to a file called `apt.txt` and those will be installed before these steps. You just need to name one per line (e.g. `openssl`). 35 | - All of these files will need to be committed to the top level of your repo.s 36 | 37 | You can now head to `mybinder.org`, paste in your URL, specify a branch if it is something other than master, and then build. There is also a badge you can copy into your README (`usethis` also has a `use_binder_badge()` function which can do this for you (it needs some updating, more below). 38 | 39 | To launch a Rstudio server instead of a Jupyter notebook, you will need to append the URL with add `?urlpath=rstudio` after `/master/` 40 | 41 | 📉 **Downsides to this approach** 42 | 43 | This is by far the slowest way to set up a binder. Depending on the packages you depend on (e.g. tidyverse) it can take several hours (only for R packages) for the first image to be built and it may stall sometimes, requiring you to kick off a new build all over again. 44 | 45 | ## 🏇 A faster approach 46 | 47 | A much faster approach is to include a Dockerfile. If mybinder detects one, it will skip over all other configuration methods and proceed to build your container. If using a Rocker image, you can also install additional package dependencies with [`installGithub.r`](https://github.com/eddelbuettel/littler/blob/master/inst/examples/installGithub.r) (the most concise way to install CRAN and GitHub packages; Comes bundled with rocker images). 48 | 49 | You will need to gather dependencies in your scripts/notebooks. This can be done using helper functions (more notes on this here), or using packages like `Containerit` that will write a Dockerfile for you. 50 | 51 | I also prefer this approach because it provides an additional layer of usefulness for a user who may not want to use binder or Docker. For such users the `DESCRIPTION` file is a concise way to install the compendium locally. 52 | 53 | Example minimal Dockerfile (thanks Carl) to make this work 54 | 55 | ``` 56 | FROM rocker/binder:latest 57 | 58 | USER root 59 | COPY . ${HOME} 60 | RUN chown -R ${NB_USER} ${HOME} 61 | 62 | ## Become normal user again 63 | USER ${NB_USER} 64 | RUN wget https://github.com/karthik/binder-test-fastest/raw/master/DESCRIPTION && \ 65 | R -e "devtools::install_deps()" 66 | ``` 67 | 68 | ### ⚠ Caution 69 | 70 | *If your base Docker image does not include binder specific components, then you will not be able to launch Rstudio server (or even a Jupyter notebook). To prepare your Dockerfile for binder, read [this guide](https://mybinder.readthedocs.io/en/latest/tutorials/dockerfile.html#preparing-your-dockerfile). Binder will get through the setup but not launch an instance for you. My recommendation is to use the Rocker binder image as your base and then add other packages you need.* 71 | 72 | 73 | ## 🚀 Fastest and best recommendation for a compendium 74 | 75 | - Create a minimal `DESCRIPTION` file for your project (TODO: more on this soon) 76 | - Create a minimal Docker file that uses rocker/binder as base, copies files into your container and then uses `devtools`' `install_deps` to install everything from your binary. 77 | 78 | This will build the container on the first run (and cache after that assuming your Git repo does not accrue further commits) and launch quickly from then on. 79 | 80 | ### 💥 Holepunch ⚠️ 📦 81 | 82 | Also check out this new and experimental package called [`holepunch`](https://github.com/karthik/holepunch) to help you automate setup for a binder project. 83 | 84 | ### Limitations of mybinder 85 | - Each instance is limited to 2 gb of RAM and will get destroyed after 10 minutes of inactivity. More on [memory issues](https://mybinder.readthedocs.io/en/latest/faq.html#how-much-memory-am-i-given-when-using-binder). 86 | - Each instance can run for a maximum of 24 hours before it will get killed. 87 | - You can get around these limitations by hosting your own binder hub but this will require compute + devops resources from your side. Read more at the [binderhub deployment guide](https://binderhub.readthedocs.io/en/latest/). 88 | -------------------------------------------------------------------------------- /reproducible-data-analysis.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/karthik/rstudio2019/9a2265959e8fb747a103b627ce7bfe8a674cd6b7/reproducible-data-analysis.pdf --------------------------------------------------------------------------------