├── case-studies
    ├── .gitignore
    ├── dturek.pdf
    ├── jkitzes.pdf
    ├── dturek.md
    └── jkitzes.md
├── template.md
└── README.md


/case-studies/.gitignore:
--------------------------------------------------------------------------------
1 | mynotes.txt
2 | 
3 | 


--------------------------------------------------------------------------------
/case-studies/dturek.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/barbagroup/repro-case-studies/master/case-studies/dturek.pdf


--------------------------------------------------------------------------------
/case-studies/jkitzes.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/barbagroup/repro-case-studies/master/case-studies/jkitzes.pdf


--------------------------------------------------------------------------------
/template.md:
--------------------------------------------------------------------------------
 1 | ##### Introduction
 2 | *Please answer these introductory questions for your case study in a few sentences.*
 3 | 
 4 | 1) Who are you and what is your research field? Include your name, affiliation, discipline, and the background or context of your overall research that is necessary specifically to introduce your specific case study.
 5 | 
 6 | 2) Define what the term "reproducibility" means to you generally and/or in the particular context of your case study.
 7 | 
 8 | ##### Workflow diagram
 9 | 
10 | The core of your case study is a diagrammatic workflow sketch, which depicts your the entire pipeline of your workflow. Think of this like a circuit diagram: boxes represent steps, tools, or other disjoint components, and arrows show how outputs of particular steps feed as inputs to others. This diagram will be complemented by a textual narrative.
11 | 
12 | We recommend the site draw.io for creating this diagram, or you are also welcome to sketch this by hand. While creating your diagram, be sure to consider:
13 | 
14 | * specialized tools and where they enter your workflow
15 | * the "state" of the data at each stage
16 | * collaborators
17 | * version control repos
18 | * products produced at various stages (graphics, summaries, etc)
19 | * databases
20 | * whitepapers
21 | * customers
22 | 
23 | Each of the two example case studies include a workflow diagram you can also use for guidance.
24 | 
25 | Please save your diagram alongside this completed case study template.
26 | 
27 | ##### Workflow narrative
28 | 
29 | Referring to your diagram, describe your workflow for this specific project, from soup to nuts. Imagine walking a friend or a colleague through the basic steps, paying particular attention to links between steps. Don't forget to include "messy parts", loops, aborted efforts, and failures.
30 | 
31 | It may be helpful to consider the following questions, where interesting, applicable, and non-obvious from context. For each part of your workflow:
32 | 
33 | * **Frequency:** How often does the step happen and how long does it take?
34 | * **Who:** Which members of your team participate (or not)?
35 | * **Manual/Automated:** Is the step automated or does it involve human intervention (if so, is it recorded)?
36 | * **Tools:** Which software or online tools are used in the step? How are they used?
37 | 
38 | In addition to detailing the steps of the workflow, you may wish to consider the following questions about the workflow as a whole:
39 | 
40 | * **Data:** Is your raw data online?
41 |    * Is it citeable?
42 |    * Does the license allow external researchers to publish a replication/confirmation of your published work?
43 | * **Software:** Is the software online?
44 |    * Is there documentation?
45 |    * Are there tests?
46 |    * Are there example input files alongside the code?
47 | * **Processing:** Is your data processing workflow online?
48 |    * Are the scripts documented?
49 |    * Would an external researcher know what order to run them in?
50 |    * Would they know what parameters to use?
51 | 
52 | *(500-800 words)*
53 | 
54 | ##### Pain points
55 | *Describe in detail the steps of a reproducible workflow which you consider to be particularly painful. How do you handle these? How do you avoid them? (200-400 words)*
56 | 
57 | ##### Key benefits
58 | *Discuss one or several sections of your workflow that you feel makes your approach better than the "normal" non-reproducible workflow that others might use in your field. What does your workflow do better than the one used by your lesser-skilled colleagues and students, and why? What would you want them to learn from your example? (200-400 words)*
59 | 
60 | ##### Key tools
61 | *If applicable, provide a detailed description of a particular specialized tool that plays a key role in making your workflow reproducible, if you think that the tool might be of broader interest or relevance to a general audience. (200-400 words)*
62 | 
63 | ##### General questions about reproducibility
64 | 
65 | *Please provide short answers (a few sentences each) to these general questions about reproducibility and scientific research. Rough ideas are appropriate here, as these will not be published with the case study. Please feel free to answer all or only some of these questions.*
66 | 
67 | 1) Why do you think that reproducibility in your domain is important?
68 | 
69 | 2) How or where did you learn the reproducible practices described in your case study? Mentors, classes, workshops, etc.
70 | 
71 | 3) What do you see as the major pitfalls to doing reproducible research in your domain, and do you have any suggestions for working around these? Examples could include legal, logistical, human, or technical challenges.
72 | 
73 | 4) What do you view as the major incentives for doing reproducible research?
74 | 
75 | 5) Are there any broad reproducibility best practices that you'd recommend for researchers in your field?
76 | 
77 | 6) Would you recommend any specific websites, training courses, or books for learning more about reproducibility?
78 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Reproducibility Case Study
 2 | 
 3 | Thank you for volunteering to contribute to our reproducibility case study project. This document contains the following sections:
 4 | 
 5 | - [Project goal](#project-goal)
 6 | - [What is a case study?](#what-is-a-case-study)
 7 | - [Case study instructions](#case-study-instructions)
 8 | - [How to submit your case study](#submitting-your-case-study)
 9 | 
10 | ## Project goal
11 | 
12 | The overall goal of this project is to better understand and communicate existing practices surrounding reproducibility in academic research. Our strategy is to gather a collection of several dozen case studies, authored by individual researchers from across disciplines who volunteer to share their practices. Each case study will provide a single, concrete example of a workflow pipeline that embodies the principles of reproducible research. We do not intend to create another anecdotally-derived list of best practices or a discussion of the _why_ of reproducibility, but rather to gather and share examples of the _how_ of reproducibility practices in academia.
13 | 
14 | <!-- A case study takes the form of a ~1,500-2,000 word response to a series of short answer quetsions and a flowchart-style diagram, both described in detail below. The initial group of authors reported that a case study took approximately 2-4 hours to prepare. -->
15 | 
16 | As you prepare your case study, consider your audience to be advanced undergraduates, graduate students, and other disciplinary scientists who want to learn the "how" of reproducible research by example. We will work to publish these case studies as an edited volume, where you will be credited as a chapter author for your case study, and in a Github repo or other similar open online platform.
17 | 
18 | For more information about the project goals or framework, contact Justin Kitzes (jkitzes@berkeley.edu)
19 | 
20 | ## What is a case study?
21 | 
22 | A case study describes the computational workflow that you used to complete a single well-defined research output (e.g., a manuscript or equivalent). A case study specifically should not describe a lab, a person, a concept, or a general hardware/software stack. It should be an example of a concrete research activity, not a reflection. The goal is to capture your web of practices surrounding a specific research pipeline and its associated benefits and challenges.
23 | 
24 | Broadly, there are two types of topics that may be the subject of case studies. The first is a manuscript or equivalent "traditional" publication. The second is a piece of software that you have developed as part of your research. In both cases, the case study should describe the workflow used to create the final product. Examples of a [publication-based case study](case-studies/jkitzes.md) and a [software-based case study](case-studies/dturek.md) are included in this repository for your reference.
25 | 
26 | When choosing a subject for your case study, __think small and concrete__. For a sense of scale, a case study should cover research representing approximately 6 months to 1 year of work for a single researcher. If your projects tend to be large or complex, you may choose to focus on only a portion of a larger project for your case study. If you do not have a single project that you feel would be useful to share, you can describe an "idealized" workflow that combines features of several different real projects, although we are particularly interested in hearing about projects that were partially successful or that involved significant difficulties.
27 | 
28 | ## Case Study Instructions
29 | 
30 | The case studies should be written using markdown. To help you complete your case study, we have provided a template that will guide you through a series of standard short answer questions. To complete your case study, simply open [this template](https://raw.githubusercontent.com/BIDS/repro-case-studies/master/template.md), save a copy of the template with a file name consisting of your first initial and last name (e.g. ``alovelace.md``), and replace the sections labeled ``[Answer]`` with your responses. Length suggestions are provided for each question. You may also wish to refer to the two sample case studies ([publication-based](case-studies/jkitzes.md) and [software-based](case-studies/dturek.md)) for guidance.
31 | 
32 | In summary, the core of the case study is a visual diagram showing your workflow with an accompanying narrative description. You'll also be asked for additional detail on specific pain points (portions of the reproducible workflow that were failed, incomplete, or particularly challenging), key achievements, and key tools. Several general questions on your views of reproducibility will complete the case study.  
33 | 
34 | Each case study will be short, with a target of ~1500-2000 words (~6-9 pages) for the final version. We encourage you to write more than this if needed in your initial submission, recognizing that the editors will later help you to to reduce this total.  
35 | 
36 | ## Submitting your case study
37 | 
38 | If possible, please submit your case study as a pull request to this repository, as described in the instructions below. Alternatively, you may email your contribution, consisting of the completed template file and a workflow diagram, to Justin Kitzes (jkitzes@berkeley.edu). Please feel free to contact Justin with any questions on submission, the case study, or the larger project.
39 | 
40 | Here are the steps to submit a case study as a pull request:
41 | 
42 | - [Fork](https://help.github.com/articles/fork-a-repo/) the
43 |   [repro-case-studies](https://github.com/BIDS/repro-case-studies)
44 |   repository on GitHub.
45 | 
46 | - Clone that repository onto your local machine using git. 
47 | 
48 | - Two examples are provided in the ``case-studies`` directory. These are at ``case-studies/jkitzes.md`` and ``case-studies/dturek.md``. Associated workflow diagrams are in the corresponding pdf files.
49 | 
50 | - Copy the template.md file into the ``case-studies`` directory, and name this file with your first initial and last name (e.g. ``alovelace.md``) 
51 | 
52 | - Open the file and respond to each question. Experience has shown that it is easiest to answer the questions in the order presented.
53 | 
54 | - Add an image file (preferably PDF) containing your workflow diagram.
55 | 
56 | - Please do not modify any files other than your case study.
57 | 
58 | - Once you are ready to submit your case study, file a [pull request](https://help.github.com/articles/using-pull-requests/) on GitHub.
59 | 
60 | ## Previewing your Case Study
61 | 
62 | We can recommend three ways to view your case study document.
63 | 
64 | 1. [Commit and push your changes](http://readwrite.com/2013/10/02/github-for-beginners-part-2) to your GitHub fork. Then open a browser and navigate 
65 |   to ``https://github.com/YOURUSERNAME/repro-case-studies/case-studies/YOURFILENAME.md`` 
66 | 2. Open it on your computer with an application like [Atom](https://atom.io/)
67 | 3. Install Pandoc and, in your ``case-studies`` directory, run ``pandoc YOURFILENAME.md -o YOURFILENAME.pdf``
68 | 


--------------------------------------------------------------------------------
/case-studies/dturek.md:
--------------------------------------------------------------------------------
 1 | ##### Introduction
 2 | 
 3 | *Please answer these introductory questions for your case study in a few sentences.*
 4 | 
 5 | 1) Who are you and what is your research field? Include your name, affiliation, discipline, and the background or context of your overall research that is necessary specifically to introduce your specific case study.
 6 | 
 7 | My name is Daniel Turek, and my area of research is computational statistics and algorithms, generally with application to problems of ecological statistics.  The workflow I describe is the process, from idea inception to publication, of creating an automated procedure to improve the sampling efficiency of Markov chain Monte Carlo (MCMC) sampling.  MCMC is an accessible and commonly used statistical technique for performing inference on hierarchical model structures.
 8 | 
 9 | 2) Define what the term "reproducibility" means to you generally and/or in the particular context of your case study.
10 | 
11 | In the context of my case study, reproducibility means that users / reviewers can re-create the results (improvements in MCMC efficiency) presented in our manuscript.  However, the results will not match exactly due to small differences in algorithm runtime.
12 | 
13 | ##### Workflow diagram
14 | 
15 | [Diagram](dturek.pdf)
16 | 
17 | ##### Workflow narrative
18 | 
19 | The process begins with team brainstorming of how an automated procedure for improving MCMC efficiency could work.  This is arguably the most fun part of the entire process.  Anywhere from 2-4 people actually hitting the whiteboard to discuss ideas.  Each of several sessions lasts a few hours.  We review theory and literature between these sessions, too.  This initial exploration occurs over one or two weeks.
20 | 
21 | A plausible idea is hatched, and now must be prototyped to assess effectiveness.  The project lead implements the algorithm in R, since our engine for doing MCMC runs natively there.  This works well for our team, since everyone is comfortable in R, and code may be shared and reviewed easily.  We create a private Github repository where members of our team write/review/modify the algorithm.  This is a private repo amongst us, since it’s entirely experimental at this point, and not intended for the public.  There is little (or no) documentation at this point.
22 | 
23 | Multiple iterations are possible at this stage, whereby ideas are implemented undergo preliminary testing.  Depending on the results of each iteration, we go back to the drawing board several times, to figure out where the previous algorithm failed, or how it can be improved.  Once again, we implement a newer version, and test it using a small number of tests we’ve devised.  This part of the process is time consuming, and potentially frustrating, as many dead-ends are reached.  The path forward is not always clear.  This process of revising and testing our algorithm may take 3 to 6 months.
24 | 
25 | Eventually, this process converges to a functional algorithm.  All members of our team are satisfied with the results, and agree the algorithm is ready for a more professional implementation, and hopefully publication.
26 | 
27 | One or two team members (those closest with the MCMC engine) do a more formal implementation of our algorithm.  This implementation is added to an existing public Github repository, which contains the basis of the MCMC engine for public use.  This step should only take a few weeks, since the algorithm is well-defined and finalized.  Appropriate documentation is also written in the form of R help files, which are also added to the public repo.
28 | 
29 | The next goal is to produce a published research paper describing the algorithm and results.  Towards this end, we assemble a suite of reproducible examples.  These come from known, standard, existing models and published data sets, which are chosen as being either “common” or “difficult” applications of MCMC.  A new public Github repository is created, and these example models and data sets are added in the form of R data files.  Additionally, bash scripts for running our new algorithm on these examples are added, and also a helpful README file.  The sole purpose of this repo is to be referenced in our manuscript, as a place containing fully automated scripts for reproducing the results presented in the manuscript.
30 | 
31 | Our reproducible examples include fixing the random number generator seed in the executable scripts, thus we can guarantee the same sampling results for each MCMC run (otherwise, a stochastic algorithm).  However, the exact *timing* of each MCMC run will vary between runs and computing platforms, and hence the final measure of efficiency will vary, too.  Thus, the exact results are not perfectly reproducible, but vary approximately 5% between runs.
32 | 
33 | Team members jointly contribute to drafting a manuscript describing our new algorithm, which presents the results from the suite of example models.  This is jointly written by team members using LaTeX.  The manuscript specifically references the repository of reproducible examples, and also explains the caveat in exact reproduction of the results — namely, that they will vary slightly from those presented, and why.  The reviewers are nonetheless thrilled with the algorithm and reproducible nature of our research, and readily accept the manuscript for publication.
34 | 
35 | ##### Pain points
36 | 
37 | *Describe in detail the steps of a reproducible workflow which you consider to be particularly painful. How do you handle these? How do you avoid them?*
38 | 
39 | The iterative process of devising and testing our algorithm is not well-documented or particularly reproducible.  The only saving grace is that Github is used for versioning control, so in theory we could look backwards at previous work, if necessary.  But in practice, the commit messages are short and not very descriptive, since everything is experimental at this point.  No less, there’s basically no documentation accompanying our code.  It would be difficult to actually review previous versions of the algorithm or results, if it were necessary.
40 | 
41 | In addition, the fact that our set of “reproducible” examples are not perfectly reproducible is a small point of contention.  We are conflicted to call these examples reproducible, since the results presented in our manuscript cannot actually be recreated.  Team members agree that this appears to be unavoidable.  We explain this in the manuscript, and call our results “reproducible” nonetheless.
42 | 
43 | For preparation of our manuscript, numerical results are manually typed into a tex document.  Tools such as knitr and sweave exist for automating this process, which automatically incorporate numeric and graphical results directly from R into LaTeX.  We opted not to use these tools to automate the interaction between R and our manuscript, since not all team members are familiar or comfortable using these tools.  Preparation of the manuscript would have been simpler and less error-prone had we used these tools, which probably would have been a wise decision, but the learning curve deterred our team from doing so.
44 | 
45 | ##### Key benefits
46 | 
47 | *Discuss one or several sections of your workflow that you feel makes your approach better than the "normal" non-reproducible workflow that others might use in your field. What does your workflow do better than the one used by your lesser-skilled colleagues and students, and why? What would you want them to learn from your example?*
48 | 
49 | The most notably reproducible aspect of this project is the public repo containing input data and scripts for re-running all analyses appearing in our manuscript.  This includes individual bash scripts for running each particular analysis, as well as a single “master” script which re-runs all analyses.  A reviewer can easily reproduce (to within a small margin of error) all numerical results appearing in the manuscript, and researchers reading the ensuing publication have an easy path forward to using the algorithm themselves.
50 | 
51 | ##### General questions about reproducibility [Optional]
52 | 
53 | 1) Why do you think that reproducibility in your domain is important?
54 | 
55 | Reproducibility is important so that others may verify the results given in our publication.  This ensures that the results are genuine, and also gives a clear path forward for others to use our algorithm.
56 | 
57 | 2) How or where did you learn the reproducible practices described in your case study? Mentors, classes, workshops, etc.
58 | 
59 | Mostly through the use of Github, from colleagues at Berkeley, and through general programming experience.  No specific classes or workshops come to mind.
60 | 
61 | 3) What do you see as the major pitfalls to doing reproducible research in your domain, and do you have any suggestions for working around these? Examples could include legal, logistical, human, or technical challenges.
62 | 
63 | In the area of computational statistics, there are not many barriers to reproducible research aside from ignorance or technical inability.  However, this case study does highlight one genuine obstacle: that of performance differences between various machines and computing platforms, which will affect algorithm runtime, which factors into our measure of efficiency.
64 | 
65 | 4) What do you view as the major incentives for doing reproducible research?
66 | 
67 | Primarily so that others may actually (and easily) verify our results, if they so choose.
68 | 
69 | 5) Are there any broad reproducibility best practices that you'd recommend for researchers in your field?
70 | 
71 | 6) Would you recommend any specific websites, training courses, or books for learning more about reproducibility?
72 | 
73 | 
74 | 


--------------------------------------------------------------------------------
/case-studies/jkitzes.md:
--------------------------------------------------------------------------------
 1 | ##### Introduction
 2 | *Please answer these introductory questions for your case study in a few sentences.*
 3 | 
 4 | 1) Who are you and what is your research field? Include your name, affiliation, discipline, and the background or context of your overall research that is necessary specifically to introduce your specific case study.
 5 | 
 6 | My name is Justin Kitzes, and I am a quantitative ecologist who develops theory and models to predict the effects of land use and climate change on biodiversity. I am currently a postdoc in the Energy and Resources Group and a Data Science Fellow in the Institute for Data Science at the University of California, Berkeley.
 7 | 
 8 | 2) Define what the term "reproducibility" means to you generally and/or in the particular context of your case study.
 9 | 
10 | I consider a study to be (computationally) reproducible when I can send a colleague a zip file containing my raw data and code and he or she can push a single button to create all of the results, tables, and figures in my analysis. It can, of course, be quite challenging to achieve this goal with anything short of the simplest scientific workflows.
11 | 
12 | ##### Workflow diagram
13 | 
14 | [Diagram](jkitzes.pdf)
15 | 
16 | ##### Workflow narrative
17 | 
18 | This study investigated whether several common species of bats showed decreased activity near three large highways around San Francisco Bay. Activity in this study was defined as the number of ultrasonic foraging calls recorded by autonomous acoustic detectors. The core analysis tasks involved collecting raw call data using the detectors, extracting features of the recorded calls, classifying the calls to the species level, and performing statistical analysis on the resulting nightly call counts as a function of predictor variables, including distance from the road. The final analysis was published in PLoS ONE (doi:10.1371/journal.pone.0096341).
19 | 
20 | Two different types of detectors were used, one of which recorded data in zero-crossing format and the other in full spectrum format. The full spectrum data was converted to zero-crossing using a closed-source utility provided by the detector manufacturer, with conversion parameters selected to produce output similar to the recordings from the native zero-crossing detector. The free, closed-source software AnalookW, developed by an individual researcher who has been active for many years in bat call analysis, was used to filter out files containing only noise and, for the remaining calls, extract twelve features describing from each call. These features were saved in a csv table.
21 | 
22 | The calls were then classified by species, which required the construction of a custom random forest classifier. A reference library containing zero-crossing calls made by individual bats identified in hand was obtained from a personal contact, and the same twelve features were extracted for these calls using AnalookW. A random forest classifier was trained on this data using the Python package scikit-learn v0.12. Classifier accuracy was evaluated using cross validation and a confusion matrix.
23 | 
24 | The classifier was then applied to identify the recorded calls to the species level, creating a classified call table. This table was summarized into a nightly pass table, which aggregated the calls into passes consisting of multiple, closely spaced calls, and counted the number of passes of each species recorded in each night, at each distance from the road. Environmental and site variables were joined to this pass data to create the final table for statistical analysis.
25 | 
26 | As functions for fitting generalized linear mixed models (GLMMs) were not available in Python, statistical analysis was carried out in R. Exploratory analysis showed that a Poisson regression was not appropriate for the data, so a negative binomial GLMM was fit to the nightly counts of all species and separately for four common species. The model results were saved as a table that later appeared in the final manuscript. The model result table was then read by a Python script, which created and saved several figures, one of which appeared in the final manuscript. The final manuscript was written in LaTeX and submitted to journals in that format.
27 | 
28 | In addition to the manuscript, this analysis also led to the creation of the open source software BatID, which bundled the classifier object with a web-based GUI to enable non-programmers to classify California bat calls using a similar workflow to the one presented here.
29 | 
30 | ##### Pain points
31 | *Describe in detail the steps of a reproducible workflow which you consider to be particularly painful. How do you handle these? How do you avoid them?*
32 | 
33 | At the beginning of the pipeline, two graphical programs had to be used, one to convert a proprietary data format to the zero-crossing format and the second (AnalookW) to perform feature extraction on the zero-crossing calls. Both of these steps required parameters to be entered into these programs, which I was careful to document manually, as this information can otherwise easily be lost. AnalookW runs only on Windows, which required me (and any analysts wishing to use my later software BatID) to locate a Windows computer to complete this step. Although I code faster and more accurately in Python, I needed to switch to R for statistical analysis, as the necessary packages were not (and still are not) available for Python. A major headache at the manuscript stage arose because the R statistical functions reported output only as a non-machine-readable text file or as an object, which required me to create the final table, containing coefficients and standard errors for 14 variables across 5 models, by hand.
34 | 
35 | Once I created and released the BatID software, a problem immediately arose when scikit-learn updated to v0.13, which could not read the classifier object created during my analysis. Additionally, the original BatID package required a user to install a full scientific Python stack, a task that proved difficult for precisely the audience of non-programmers that I was hoping to reach. I eventually used pyinstaller to create a standalone binary executable for Windows, reasoning that users of the software needed a Windows computer anyway in order to run AnalookW as a prior step in the analysis.
36 | 
37 | ##### Key benefits
38 | *Discuss one or several sections of your workflow that you feel makes your approach better than the "normal" non-reproducible workflow that others might use in your field. What does your workflow do better than the one used by your lesser-skilled colleagues and students, and why? What would you want ahem to learn from your example?*
39 | 
40 | Of all aspects of the analysis, I am particularly happy about the effort that I put in to creating the BatID standalone classifier software. As the majority of my colleagues in ecology are non-programmers or novice programmers, I believe that these types of user-friendly tools are critical to advancing the state of science in my field, as well as the uptake of new methods by non-academic non-profit and agency scientists, and I hope that more of my computationally-oriented colleagues will engage in similar activities in the future.
41 | 
42 | Ironically, of course, in creating a tool for non-programmers, I also created another graphical program that cannot easily be scripted into a workflow such as the one. I attempted to ameliorate this concern in the most recent BatID version by requiring users create a text file containing all parameters, which is read by the program along with the data file, and having the program save all results in the same directory as the parameter file, along with a log file. This at least ensures that there is, by default, some record of the program version, time, and parameters used to process the raw data into classified results tables.
43 | 
44 | ##### Key tools [Optional]
45 | *If applicable, provide a detailed description of a particular specialized tool that plays a key role in making your workflow reproducible, if you think that the tool might be of broader interest or relevance to a general audience.*
46 | 
47 | [Answer, 200-400 words]
48 | 
49 | ##### General questions about reproducibility [Optional]
50 | 
51 | *Please provide short answers (a few sentences each) to these general questions about reproducibility and scientific research. Rough ideas are appropriate here, as these will not be published with the case study. Please feel free to answer all or only some of these questions.*
52 | 
53 | 1) Why do you think that reproducibility in your domain is important?
54 | 
55 | I think that reproducibility is particularly important in fields like ecology in which researchers are constantly striving to make increasingly detailed inference and predictions using relatively scarce data. Although I do not have evidence to this effect, it seems logical to me that in these "high leverage" types of analyses, small analytical decisions (how data are cleaned, the options passed to optimizers, etc.) could play a disproportionate role in influencing the eventual study outcomes, and thus need to be fully documented and shared. An easy way to guarantee that all of these decisions are recorded is to make one's entire analysis reproducible by others. More broadly, I feel strongly that reproducibility is a basic component of good science. Now that "doing science" requires communicating more detail than can be easily expressed in narrative form in a manuscript, releasing code and data seems completely necessary across all domains.
56 | 
57 | 2) How or where did you learn the reproducible practices described in your case study? Mentors, classes, workshops, etc.
58 | 
59 | I started learning these tools through workshops, in particular Josh Bloom's Python bootcamp and through teaching Software Carpentry bootcamps with other experienced instructors. I also learned a great deal by working closely with Chloe Lewis, a former student in my PhD lab who had previously worked at Microsoft, on the development of our software package _macroeco_, which I continue to maintain. I picked up more advanced techniques and ideas mostly through web searches while attempting to get unstuck from issues that arose in my own research and software development.
60 | 
61 | 3) What do you see as the major pitfalls to doing reproducible research in your domain, and do you have any suggestions for working around these? Examples could include legal, logistical, human, or technical challenges.
62 | 
63 | A major challenge in ecology continues to be data sharing and access. Many field ecologists, understandably, are reluctant to share their hard-won raw data with other researchers. I suspect that this caution arises both from a sense of professional necessity (i.e., I invested a ton of time collecting this data and I am going to be the one to publish all of the analyses using it) and from the feeling that the numbers alone cannot possibly capture all the subtle nuances they observed in the field that are important to truly understanding the data, its potential, and its limitations. In particular, information about sampling bias (as derived from choices of study site, sampling techniques, season, missing data, and many other factors) cannot easily be described in numeric form -- I also suspect that many field ecologists believe that it's not even in the manuscript, implying, very unfortunately for reproducibility, that the person who collected the data is the only one qualified to analyze it. What data is published and available tends to be relatively small and of heterogeneous format, and thus tends to be locked up in printed pdf tables and other non-machine-readable formats.
64 | 
65 | 4) What do you view as the major incentives for doing reproducible research?
66 | 
67 | I'm not sure that there are any major external incentives in my field -- certainly, in principle, releasing reproducible research could increase the number of researchers who end up citing your manuscript, but this seems somewhat indirect. Some journals, like PLoS, are now mandating that all novel computer code be uploaded as manuscript SI, but it seems that this requirement is not particularly thoroughly checked.
68 | 
69 | 5) Are there any broad reproducibility best practices that you'd recommend for researchers in your field?
70 | 
71 | [Answer]
72 | 
73 | 6) Would you recommend any specific websites, training courses, or books for learning more about reproducibility?
74 | 
75 | [Answer]
76 | 


--------------------------------------------------------------------------------