├── .gitignore ├── convert_html └── README ├── coverimageprint.png ├── full cover.png ├── isbn_barcode.pdf ├── manuscript ├── Book.txt ├── EDA.Rmd ├── EDA.md ├── EDAchecklist.Rmd ├── EDAchecklist.md ├── Interpreting_Your_Results.Rmd ├── Interpreting_Your_Results.md ├── Sample.txt ├── StatingandRefiningTheQuestion.Rmd ├── StatingandRefiningTheQuestion.md ├── Subset.txt ├── TODO ├── _book │ ├── EDA.md │ ├── Interpreting_Your_Results.md │ ├── StatingandRefiningTheQuestion.md │ ├── a-quick-example.html │ ├── abc-always-be-checking-your-ns.html │ ├── about-the-authors.html │ ├── about.md │ ├── air-pollution-and-mortality-in-new-york-city.html │ ├── applying-the-epicycle-of-analysis-process.html │ ├── applying-the-epicycle-to-stating-and-refining-your-question.html │ ├── associational-analyses.html │ ├── attitude.html │ ├── case-study-non-diet-soda-consumption-and-body-mass-index.html │ ├── case-study.html │ ├── chapter-communication.html │ ├── chapter-formal.html │ ├── chapter-inference.html │ ├── chapter-interpretation.html │ ├── chapter-question.html │ ├── characteristics-of-a-good-question.html │ ├── check-the-packaging.html │ ├── collecting-information.html │ ├── communication.md │ ├── comparing-expectations-to-data.html │ ├── comparing-model-expectations-to-reality.html │ ├── concluding-thoughts.html │ ├── content.html │ ├── css │ │ └── style.css │ ├── describe-a-model-for-the-population.html │ ├── describe-the-sampling-process.html │ ├── epicycle-of-analysis.html │ ├── epicycles-of-analysis.html │ ├── epicycles.md │ ├── examining-linear-relationships.html │ ├── example-apple-music-usage.html │ ├── exploratory-data-analysis-checklist-a-case-study.html │ ├── exploratory-data-analysis.html │ ├── factors-affecting-the-quality-of-inference.html │ ├── follow-up-questions.html │ ├── formalmodeling.md │ ├── formulate-your-question.html │ ├── general-framework.html │ ├── identify-the-population.html │ ├── images │ │ ├── EDA-unnamed-chunk-16-1.png │ │ ├── EDA-unnamed-chunk-17-1.png │ │ ├── EDA-unnamed-chunk-20-1.png │ │ ├── epicycle.png │ │ ├── formal-unnamed-chunk-10-1.png │ │ ├── formal-unnamed-chunk-12-1.png │ │ ├── formal-unnamed-chunk-2-1.png │ │ ├── formal-unnamed-chunk-5-1.png │ │ ├── inferencepred-unnamed-chunk-10-1.png │ │ ├── inferencepred-unnamed-chunk-11-1.png │ │ ├── inferencepred-unnamed-chunk-3-1.png │ │ ├── inferencepred-unnamed-chunk-4-1.png │ │ ├── inferencepred-unnamed-chunk-5-1.png │ │ ├── interpret-unnamed-chunk-1-1.png │ │ ├── model-unnamed-chunk-10-1.png │ │ ├── model-unnamed-chunk-11-1.png │ │ ├── model-unnamed-chunk-12-1.png │ │ ├── model-unnamed-chunk-3-1.png │ │ ├── model-unnamed-chunk-5-1.png │ │ ├── model-unnamed-chunk-6-1.png │ │ ├── model-unnamed-chunk-7-1.png │ │ ├── penguin1.png │ │ ├── penguin2.png │ │ ├── penguin3.png │ │ └── table-epicycles.png │ ├── index.html │ ├── index.md │ ├── inference-vs-prediction-implications-for-modeling-strategy.html │ ├── inference.md │ ├── inferencepred.md │ ├── inferring-an-association.html │ ├── libs │ │ ├── gitbook-2.6.7 │ │ │ ├── css │ │ │ │ ├── fontawesome │ │ │ │ │ └── fontawesome-webfont.ttf │ │ │ │ ├── plugin-bookdown.css │ │ │ │ ├── plugin-fontsettings.css │ │ │ │ ├── plugin-highlight.css │ │ │ │ ├── plugin-search.css │ │ │ │ └── style.css │ │ │ └── js │ │ │ │ ├── app.min.js │ │ │ │ ├── jquery.highlight.js │ │ │ │ ├── lunr.js │ │ │ │ ├── plugin-bookdown.js │ │ │ │ ├── plugin-fontsettings.js │ │ │ │ ├── plugin-search.js │ │ │ │ └── plugin-sharing.js │ │ └── jquery-2.2.3 │ │ │ └── jquery.min.js │ ├── look-at-the-top-and-the-bottom-of-your-data.html │ ├── make-a-plot.html │ ├── modelbuilding.md │ ├── models-as-expectations.html │ ├── populations-come-in-many-forms.html │ ├── predicting-the-outcome.html │ ├── prediction-analyses.html │ ├── principles-of-interpretation.html │ ├── reacting-to-data-refining-our-expectations.html │ ├── read-in-your-data.html │ ├── routine-communication.html │ ├── search_index.json │ ├── setting-expectations.html │ ├── setting-the-scene.html │ ├── style.html │ ├── summary-1.html │ ├── summary-2.html │ ├── summary.html │ ├── the-audience.html │ ├── translating-a-question-into-a-data-problem.html │ ├── try-the-easy-solution-first.html │ ├── types-of-questions.html │ ├── using-models-to-explore-your-data.html │ ├── validate-with-at-least-one-external-data-source.html │ ├── what-are-the-goals-of-formal-modeling.html │ └── when-do-we-stop.html ├── _bookdown_files │ └── EDA_cache │ │ └── html │ │ ├── __packages │ │ ├── unnamed-chunk-16_dce4065f9d0c9cd692c2eea208ef00f2.RData │ │ ├── unnamed-chunk-16_dce4065f9d0c9cd692c2eea208ef00f2.rdb │ │ └── unnamed-chunk-16_dce4065f9d0c9cd692c2eea208ef00f2.rdx ├── about.md ├── artofdatascience.Rproj ├── artofdatascience.rds ├── backmatter.txt ├── buildbook.R ├── causalinference.Rmd ├── causalinference.md ├── causalinference.qmd ├── collaboration.md ├── communication.Rmd ├── communication.md ├── conclusion.md ├── data │ ├── avgpm25.csv │ ├── bmi_pm25_no2_sim.csv │ ├── face.rda │ ├── hourly_ozone.rds │ ├── leanpub-rprogramming-interested-readers-20150430.csv │ └── maacs_sim.csv ├── epicycles.md ├── epicycles_table.Rmd ├── equation.pl ├── fixmath.R ├── formalmodeling.Rmd ├── formalmodeling.md ├── frontmatter.txt ├── images │ ├── 6x9 HQ cover.jpg │ ├── 6x9 cover 300dpi.jpg │ ├── EDA-unnamed-chunk-16-1.png │ ├── EDA-unnamed-chunk-17-1.png │ ├── EDA-unnamed-chunk-20-1.png │ ├── epicycle.jpg │ ├── epicycle.png │ ├── formal-unnamed-chunk-10-1.png │ ├── formal-unnamed-chunk-12-1.png │ ├── formal-unnamed-chunk-2-1.png │ ├── formal-unnamed-chunk-4-1.png │ ├── formal-unnamed-chunk-5-1.png │ ├── inferencepred-unnamed-chunk-10-1.png │ ├── inferencepred-unnamed-chunk-11-1.png │ ├── inferencepred-unnamed-chunk-3-1.png │ ├── inferencepred-unnamed-chunk-4-1.png │ ├── inferencepred-unnamed-chunk-5-1.png │ ├── interpret-unnamed-chunk-1-1.png │ ├── model-unnamed-chunk-10-1.png │ ├── model-unnamed-chunk-11-1.png │ ├── model-unnamed-chunk-12-1.png │ ├── model-unnamed-chunk-3-1.png │ ├── model-unnamed-chunk-4-1.png │ ├── model-unnamed-chunk-5-1.png │ ├── model-unnamed-chunk-6-1.png │ ├── model-unnamed-chunk-7-1.png │ ├── penguin1.png │ ├── penguin2.png │ ├── penguin3.png │ ├── table-epicycles.png │ ├── title_page.png │ └── title_page_orig.png ├── inference.Rmd ├── inference.md ├── inferencepred.Rmd ├── inferencepred.md ├── listPackages.R ├── mainmatter.txt ├── modelbuilding.Rmd ├── modelbuilding.md ├── ny.rds ├── o3_ny_1999.csv └── preface.md ├── preview └── artofdatascience-preview.pdf ├── problems ├── figures │ ├── unnamed-chunk-2-1.png │ ├── unnamed-chunk-3-1.png │ ├── unnamed-chunk-4-1.png │ └── unnamed-chunk-5-1.png ├── problems.Rmd ├── problems.html └── problems.md └── published └── artofdatascience.pdf /.gitignore: -------------------------------------------------------------------------------- 1 | _book 2 | _bookdown_files 3 | artofdatascience.rds 4 | rsconnect 5 | .Rproj.user 6 | .Rhistory 7 | .RData 8 | .Ruserdata 9 | -------------------------------------------------------------------------------- /convert_html/README: -------------------------------------------------------------------------------- 1 | This folder is used to convert HTML files to Markdown and place them in your book. 2 | 3 | To do this: 4 | 5 | 1) Place the HTML files in this folder 6 | 2) Go to the import page for your book (E.g. http://leanpub.com/yourbookname/import) 7 | 3) Click on the "Start Conversion" button. 8 | 9 | We will take all files that end in .html, convert them to markdown and place them in your manuscript folder. 10 | 11 | The HTML files will be placed in convert_done after they are converted. 12 | 13 | -------------------------------------------------------------------------------- /coverimageprint.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/coverimageprint.png -------------------------------------------------------------------------------- /full cover.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/full cover.png -------------------------------------------------------------------------------- /isbn_barcode.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/isbn_barcode.pdf -------------------------------------------------------------------------------- /manuscript/Book.txt: -------------------------------------------------------------------------------- 1 | preface.md 2 | epicycles.md 3 | StatingandRefiningTheQuestion.md 4 | EDA.md 5 | modelbuilding.md 6 | inference.md 7 | formalmodeling.md 8 | inferencepred.md 9 | causalinference.md 10 | Interpreting_Your_Results.md 11 | communication.md 12 | collaboration.md 13 | conclusion.md 14 | backmatter.txt 15 | about.md 16 | -------------------------------------------------------------------------------- /manuscript/Sample.txt: -------------------------------------------------------------------------------- 1 | preface.md 2 | epicycles.md 3 | -------------------------------------------------------------------------------- /manuscript/Subset.txt: -------------------------------------------------------------------------------- 1 | epicycles.md 2 | -------------------------------------------------------------------------------- /manuscript/TODO: -------------------------------------------------------------------------------- 1 | Chapter on "Evaluating Success of Data Science Project" 2 | 3 | Add something on model fit to Formal Modeling 4 | 5 | Short section/chapter on making data analysis presentations 6 | -------------------------------------------------------------------------------- /manuscript/_book/about.md: -------------------------------------------------------------------------------- 1 | # About the Authors 2 | 3 | **Roger D. Peng** is a Professor of Biostatistics at the 4 | Johns Hopkins Bloomberg School of Public Health. He is also a 5 | Co-Founder of the [Johns Hopkins Data Science 6 | Specialization](http://www.coursera.org/specialization/jhudatascience/1), 7 | which has enrolled over 1.5 million students, the [Johns Hopkins 8 | Executive Data Science 9 | Specialization](https://www.coursera.org/specializations/executive-data-science), 10 | the [Simply Statistics blog](http://simplystatistics.org/) where he 11 | writes about statistics and data science for the general public, and 12 | the [Not So Standard Deviations](https://soundcloud.com/nssd-podcast) 13 | podcast. Roger can be found on Twitter and GitHub under the user name 14 | [@rdpeng](https://twitter.com/rdpeng). 15 | 16 | **Elizabeth Matsui** is a Professor of Pediatrics, Epidemiology and 17 | Environmental Health Sciences at Johns Hopkins University and a 18 | practicing pediatric allergist/immunologist. She directs a data 19 | management and analysis center with Dr. Peng that supports 20 | epidemiologic studies and clinical trials and is co-founder of 21 | [Skybrude Consulting, LLC](http://skybrudeconsulting.com), a data 22 | science consulting firm. Elizabeth can be found on Twitter 23 | [@eliza68](https://twitter.com/eliza68). 24 | -------------------------------------------------------------------------------- /manuscript/_book/communication.md: -------------------------------------------------------------------------------- 1 | --- 2 | output: 3 | html_document: 4 | keep_md: yes 5 | --- 6 | # Communication {#chapter-communication} 7 | 8 | Communication is fundamental to good data analysis. What we aim to address in this chapter is the role of routine communication in the process of doing your data analysis and in disseminating your final results in a more formal setting, often to an external, larger audience. There are lots of good books that address the "how-to" of giving formal presentations, either in the form of a talk or a written piece, such as a white paper or scientific paper. In this chapter, though, we will focus on: 9 | 10 | 1. How to use routine communication as one of the tools needed to perform a good data analysis; and 11 | 12 | 2. How to convey the key points of your data analysis when communicating informally and formally. 13 | 14 | Communication is both one of the tools of data analysis, and also the final product of data analysis: there is no point in doing a data analysis if you're not going to communicate your process and results to an audience. A good data analyst communicates informally multiple times during the data analysis process and also gives careful thought to communicating the final results so that the analysis is as useful and informative as possible to the wider audience it was intended for. 15 | 16 | ## Routine communication 17 | 18 | The main purpose of routine communication is to gather data, which is part of the epicyclic process for each core activity. You gather data by communicating your results and the responses you receive from your audience should inform the next steps in your data analysis. The types of responses you receive include not only answers to specific questions, but also commentary and questions your audience has in response to your report (either written or oral). The form that your routine communication takes depends on what the goal of the communication is. If your goal, for example, is to get clarity on how a variable is coded because when you explore the dataset it appears to be an ordinal variable, but you had understood that it was a continuous variable, your communication is brief and to the point. 19 | 20 | If, on the other hand, some results from your exploratory data analysis are not what you expected, your communication may take the form of a small, informal meeting that includes displaying tables and/or figures pertinent to your issue. A third type of informal communication is one in which you may not have specific questions to ask of your audience, but instead are seeking feedback on the data analysis process and/or results to help you refine the process and/or to inform your next steps. 21 | 22 | In sum, there are three main types of informal communication and they are classified based on the objectives you have for the communication: (1) to answer a very focused question, which is often a technical question or a question aimed at gathering a fact, (2) to help you work through some results that are puzzling or not quite what you expected, and (3) to get general impressions and feedback as a means of identifying issues that had not occurred to you so that you can refine your data analysis. 23 | 24 | Focusing on a few core concepts will help you achieve your objectives when planning routine communication. These concepts are: 25 | 26 | 1. **Audience**: Know your audience and when you have control over who the audience is, select the right audience for the kind of feedback you are looking for. 27 | 28 | 2. **Content**: Be focused and concise, but provide sufficient information for the audience to understand the information you are presenting and question(s) you are asking. 29 | 30 | 3. **Style**: Avoid jargon. Unless you are communicating about a focused highly technical issue to a highly technical audience, it is best to use language and figures and tables that can be understood by a more general audience. 31 | 32 | 4. **Attitude**: Have an open, collaborative attitude so that you are ready to fully engage in a dialogue and so that your audience gets the message that your goal is not to "defend" your question or work, but rather to get their input so that you can do your best work. 33 | 34 | ## The Audience 35 | 36 | For many types of routine communication, you will have the ability to select your audience, but in some cases, such as when you are delivering an interim report to your boss or your team, the audience may be pre-determined. Your audience may be composed of other data analysts, the individual(s) who initiated the question, your boss and/or other managers or executive team members, non-data analysts who are content experts, and/or someone representing the general public. 37 | 38 | For the first type of routine communication, in which you are primarily seeking factual knowledge or clarification about the dataset or related information, selecting a person (or people) who have the factual knowledge to answer the question and are responsive to queries is most appropriate. For a question about how the data for a variable in the dataset were collected, you might approach a person who collected the data or a person who has worked with the dataset before or was responsible for compiling the data. If the question is about the command to use in a statistical programming language in order to run a certain type of statistical test, this information is often easily found by an internet search. But if this fails, querying a person who uses the particular programming language would be appropriate. 39 | 40 | For the second type of routine communication, in which you have some results and you are either unsure whether they are what you'd expect, or they are not what you expected, you'll likely be most helped if you engage more than one person and they represent a range of perspectives. The most productive and helpful meetings typically include people with data analysis and content area expertise. As a rule of thumb, the more types of stakeholders you communicate with while you are doing your data analysis project, the better your final product will be. For example, if you only communicate with other data analysts, you may overlook some important aspects of your data analysis that would have been discovered had you communicated with your boss, content experts, or other people. 41 | 42 | For the third type of routine communication, which typically occurs when you have come to a natural place for pausing your data analysis. Although when and where in your data analysis these pauses occur are dictated by the specific analysis you are doing, one very common place to pause and take stock is after completing at least some exploratory data analysis. It's important to pause and ask for feedback at this point as this exercise will often identify additional exploratory analyses that are important for informing next steps, such as model building, and therefore prevent you from sinking time and effort into pursuing models that are not relevant, not appropriate, or both. This sort of communication is most effective when it takes the form of a face-to-face meeting, but video conferencing and phone conversations can also be effective. When selecting your audience, think about who among the people available to you give the most helpful feedback and which perspectives will be important for informing the next steps of your analysis. At a minimum, you should have both data analysis and content expertise represented, but in this type of meeting it may also be helpful to hear from people who share, or at least understand, the perspective of the larger target audience for the formal communication of the results of your data analysis. 43 | 44 | 45 | ## Content 46 | 47 | The most important guiding principle is to tailor the information you deliver to the objective of the communication. For a targeted question aimed at getting clarification about the coding of a variable, the recipient of your communication does not need to know the overall objective of your analysis, what you have done up to this point, or see any figures or tables. A specific, pointed question along the lines of "I'm analyzing the crime dataset that you sent me last week and am looking at the variable "education" and see that it is coded 0, 1, and 2, but I don't see any labels for those codes. Do you know what these codes for the "education" variable stand for?" 48 | 49 | For the second type of communication, in which you are seeking feedback because of a puzzling or unexpected issue with your analysis, more background information will be needed, but complete background information for the overall project may not be. To illustrate this concept, let's assume that you have been examining the relationship between height and lung function and you construct a scatterplot, which suggests that the relationship is non-linear as there appears to be curvature to the relationship. Although you have some ideas about approaches for handling non-linear relationships, you appropriately seek input from others. After giving some thought to your objectives for the communication, you settle on two primary objectives: (1) To understand if there is a best approach for handling the non-linearity of the relationship, and if so, how to determine which approach is best, and (2) To understand more about the non-linear relationship you observe, including whether this is expected and/or known and whether the non-linearity is important to capture in your analyses. 50 | 51 | To achieve your objectives, you will need to provide your audience with some context and background, but providing a comprehensive background for the data analysis project and review of all of the steps you've taken so far is unnecessary and likely to absorb time and effort that would be better devoted to your specific objectives. In this example, appropriate context and background might include the following: (1) the overall objective of the data analysis, (2) how height and lung function fit into the overall objective of the data analysis, for example, height may be a potential confounder, or the major predictor of interest, and (3) what you have done so far with respect to height and lung function and what you've learned. This final step should include some visual display of data, such as the aforementioned scatterplot. The final content of your presentation, then, would include a statement of the objectives for the discussion, a brief overview of the data analysis project, how the specific issue you are facing fits into the overall data analysis project, and then finally, pertinent findings from your analysis related to height and lung function. 52 | 53 | If you were developing a slide presentation, fewer slides should be devoted to the background and context than the presentation of the data analysis findings for height and lung function. One slide should be sufficient for the data analysis overview, and 1-2 slides should be sufficient for explaining the context of the height-lung function issue within the larger data analysis project. The meat of the presentation shouldn't require more than 5-8 slides, so that the total presentation time should be no more than 10-15 minutes. Although slides are certainly not necessary, a visual tool for presenting this information is very helpful and should not imply that the presentation should be "formal." Instead, the idea is to provide the group sufficient information to generate discussion that is focused on your objectives, which is best achieved by an informal presentation. 54 | 55 | These same principles apply to the third type of communication, except that you may not have focused objectives and instead you may be seeking general feedback on your data analysis project from your audience. If this is the case, this more general objective should be stated and the remainder of the content should include a statement of the question you are seeking to answer with the analysis, the objective(s) of the data analysis, a summary of the characteristics of the data set (source of the data, number of observations, etc.), a summary of your exploratory analyses, a summary of your model building, your interpretation of your results, and conclusions. By providing key points from your entire data analysis, your audience will be able to provide feedback about the overall project as well as each of the steps of data analysis. A well-planned discussion yields helpful, thoughtful feedback and should be considered a success if you are left armed with additional refinements to make to your data analysis and thoughtful perspective about what should be included in the more formal presentation of your final results to an external audience. 56 | 57 | 58 | ##Style 59 | 60 | Although the style of communication increases in formality from the first to the third type of routine communication, all of these communications should largely be informal and, except for perhaps the focused communication about a small technical issue, jargon should be avoided. Because the primary purpose of routine communication is to get feedback, your communication style should encourage discussion. Some approaches to encourage discussion include stating up front that you would like the bulk of the meeting to include active discussion and that you welcome questions during your presentation rather than asking the audience to hold them until the end of your presentation. If an audience member provides commentary, asking what others in the audience think will also promote discussion. In essence, to get the best feedback you want to hear what your audience members are thinking, and this is most likely accomplished by setting an informal tone and actively encouraging discussion. 61 | 62 | 63 | ##Attitude 64 | 65 | A defensive or off-putting attitude can sabotage all the work you've put into carefully selecting the audience, thoughtfully identifying your objectives and preparing your content, and stating that you are seeking discussion. Your audience will be reluctant to offer constructive feedback if they sense that their feedback will not be well received and you will leave the meeting without achieving your objectives, and ill prepared to make any refinements or additions to your data analysis. And when it comes time to deliver a formal presentation to an external audience, you will not be well prepared and won't be able to present your best work. To avoid this pitfall, deliberately cultivate a receptive and positive attitude prior to communicating by putting your ego and insecurities aside. If you can do this successfully, it will serve you well. In fact, we both know people who have had highly successful careers based largely on their positive and welcoming attitude towards feedback, including constructive criticism. 66 | -------------------------------------------------------------------------------- /manuscript/_book/css/style.css: -------------------------------------------------------------------------------- 1 | .rmdcaution, .rmdimportant, .rmdnote, .rmdtip, .rmdwarning { 2 | padding: 1em 1em 1em 4em; 3 | margin-bottom: 10px; 4 | background: #f5f5f5 5px center/3em no-repeat; 5 | } 6 | .rmdcaution { 7 | background-image: url("../images/caution.png"); 8 | } 9 | .rmdimportant { 10 | background-image: url("../images/important.png"); 11 | } 12 | .rmdnote { 13 | background-image: url("../images/note.png"); 14 | } 15 | .rmdtip { 16 | background-image: url("../images/tip.png"); 17 | } 18 | .rmdwarning { 19 | background-image: url("../images/warning.png"); 20 | } 21 | .kable_wrapper { 22 | border-spacing: 20px 0; 23 | border-collapse: separate; 24 | border: none; 25 | margin: auto; 26 | } 27 | .kable_wrapper > tbody > tr > td { 28 | vertical-align: top; 29 | } 30 | p.caption { 31 | color: #777; 32 | margin-top: 10px; 33 | } 34 | p code { 35 | white-space: inherit; 36 | } 37 | pre { 38 | word-break: normal; 39 | word-wrap: normal; 40 | } 41 | pre code { 42 | white-space: inherit; 43 | } -------------------------------------------------------------------------------- /manuscript/_book/images/EDA-unnamed-chunk-16-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/_book/images/EDA-unnamed-chunk-16-1.png -------------------------------------------------------------------------------- /manuscript/_book/images/EDA-unnamed-chunk-17-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/_book/images/EDA-unnamed-chunk-17-1.png -------------------------------------------------------------------------------- /manuscript/_book/images/EDA-unnamed-chunk-20-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/_book/images/EDA-unnamed-chunk-20-1.png -------------------------------------------------------------------------------- /manuscript/_book/images/epicycle.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/_book/images/epicycle.png -------------------------------------------------------------------------------- /manuscript/_book/images/formal-unnamed-chunk-10-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/_book/images/formal-unnamed-chunk-10-1.png -------------------------------------------------------------------------------- /manuscript/_book/images/formal-unnamed-chunk-12-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/_book/images/formal-unnamed-chunk-12-1.png -------------------------------------------------------------------------------- /manuscript/_book/images/formal-unnamed-chunk-2-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/_book/images/formal-unnamed-chunk-2-1.png -------------------------------------------------------------------------------- /manuscript/_book/images/formal-unnamed-chunk-5-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/_book/images/formal-unnamed-chunk-5-1.png -------------------------------------------------------------------------------- /manuscript/_book/images/inferencepred-unnamed-chunk-10-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/_book/images/inferencepred-unnamed-chunk-10-1.png -------------------------------------------------------------------------------- /manuscript/_book/images/inferencepred-unnamed-chunk-11-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/_book/images/inferencepred-unnamed-chunk-11-1.png -------------------------------------------------------------------------------- /manuscript/_book/images/inferencepred-unnamed-chunk-3-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/_book/images/inferencepred-unnamed-chunk-3-1.png -------------------------------------------------------------------------------- /manuscript/_book/images/inferencepred-unnamed-chunk-4-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/_book/images/inferencepred-unnamed-chunk-4-1.png -------------------------------------------------------------------------------- /manuscript/_book/images/inferencepred-unnamed-chunk-5-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/_book/images/inferencepred-unnamed-chunk-5-1.png -------------------------------------------------------------------------------- /manuscript/_book/images/interpret-unnamed-chunk-1-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/_book/images/interpret-unnamed-chunk-1-1.png -------------------------------------------------------------------------------- /manuscript/_book/images/model-unnamed-chunk-10-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/_book/images/model-unnamed-chunk-10-1.png -------------------------------------------------------------------------------- /manuscript/_book/images/model-unnamed-chunk-11-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/_book/images/model-unnamed-chunk-11-1.png -------------------------------------------------------------------------------- /manuscript/_book/images/model-unnamed-chunk-12-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/_book/images/model-unnamed-chunk-12-1.png -------------------------------------------------------------------------------- /manuscript/_book/images/model-unnamed-chunk-3-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/_book/images/model-unnamed-chunk-3-1.png -------------------------------------------------------------------------------- /manuscript/_book/images/model-unnamed-chunk-5-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/_book/images/model-unnamed-chunk-5-1.png -------------------------------------------------------------------------------- /manuscript/_book/images/model-unnamed-chunk-6-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/_book/images/model-unnamed-chunk-6-1.png -------------------------------------------------------------------------------- /manuscript/_book/images/model-unnamed-chunk-7-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/_book/images/model-unnamed-chunk-7-1.png -------------------------------------------------------------------------------- /manuscript/_book/images/penguin1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/_book/images/penguin1.png -------------------------------------------------------------------------------- /manuscript/_book/images/penguin2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/_book/images/penguin2.png -------------------------------------------------------------------------------- /manuscript/_book/images/penguin3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/_book/images/penguin3.png -------------------------------------------------------------------------------- /manuscript/_book/images/table-epicycles.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/_book/images/table-epicycles.png -------------------------------------------------------------------------------- /manuscript/_book/index.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: "The Art of Data Science" 3 | author: "Roger D. Peng and Elizabeth Matsui" 4 | date: "2017-01-20" 5 | site: bookdown::bookdown_site 6 | output: 7 | bookdown::html_chapters: 8 | includes: 9 | in_header: style.css 10 | documentclass: book 11 | link-citations: yes 12 | github-repo: rdpeng/artofdatascience 13 | url: 'https\://github.com/rdpeng/artofdatascience' 14 | description: "The book covers R software development for building data science tools. As the field of data science evolves, it has become clear that software development skills are essential for producing useful data science results and products. You will obtain rigorous training in the R language, including the skills for handling complex data, building R packages and developing custom data visualizations. You will learn modern software development practices to build tools that are highly reusable, modular, and suitable for use in a team-based environment or a community of developers." 15 | --- 16 | 17 | # Data Analysis as Art 18 | 19 | Data analysis is hard, and part of the problem is that few people can explain how to do it. It's not that there aren't any people doing data analysis on a regular basis. It's that the people who are really good at it have yet to enlighten us about the thought process that goes on in their heads. 20 | 21 | Imagine you were to ask a songwriter how she writes her songs. There are many tools upon which she can draw. We have a general understanding of how a good song should be structured: how long it should be, how many verses, maybe there's a verse followed by a chorus, etc. In other words, there's an abstract framework for songs in general. Similarly, we have music theory that tells us that certain combinations of notes and chords work well together and other combinations don't sound good. As good as these tools might be, ultimately, knowledge of song structure and music theory alone doesn't make for a good song. Something else is needed. 22 | 23 | In Donald Knuth's legendary 1974 essay [*Computer Programming as an Art*](http://www.paulgraham.com/knuth.html), Knuth talks about the difference between art and science. In that essay, he was trying to get across the idea that although computer programming involved complex machines and very technical knowledge, the act of writing a computer program had an artistic component. In this essay, he says that 24 | 25 | > Science is knowledge which we understand so well that we can teach it to a computer. 26 | 27 | Everything else is art. 28 | 29 | At some point, the songwriter must inject a creative spark into the process to bring all the songwriting tools together to make something that people want to listen to. This is a key part of the *art* of songwriting. That creative spark is difficult to describe, much less write down, but it's clearly essential to writing good songs. If it weren't, then we'd have computer programs regularly writing hit songs. For better or for worse, that hasn't happened yet. 30 | 31 | Much like songwriting (and computer programming, for that matter), it's important to realize that *data analysis is an art*. It is not something yet that we can teach to a computer. Data analysts have many *tools* at their disposal, from linear regression to classification trees and even deep learning, and these tools have all been carefully taught to computers. But ultimately, a data analyst must find a way to assemble all of the tools and apply them to data to answer a relevant question---a question of interest to people. 32 | 33 | Unfortunately, the process of data analysis is not one that we have been able to write down effectively. It's true that there are many statistics textbooks out there, many lining our own shelves. But in our opinion, none of these really addresses the core problems involved in conducting real-world data analyses. In 1991, Daryl Pregibon, a prominent statistician previously of AT&T Research and now of Google, [said in reference to the process of data analysis](http://www.nap.edu/catalog/1910/the-future-of-statistical-software-proceedings-of-a-forum) that "statisticians have a process that they espouse but do not fully understand". 34 | 35 | Describing data analysis presents a difficult conundrum. On the one hand, developing a useful framework involves characterizing the elements of a data analysis using abstract language in order to find the commonalities across different kinds of analyses. Sometimes, this language is the language of mathematics. On the other hand, it is often the very details of an analysis that makes each one so difficult and yet interesting. How can one effectively generalize across many different data analyses, each of which has important unique aspects? 36 | 37 | What we have set out to do in this book is to write down the process of data analysis. What we describe is not a specific "formula" for data analysis---something like "apply this method and then run that test"--- but rather is a general process that can be applied in a variety of situations. Through our extensive experience both managing data analysts and conducting our own data analyses, we have carefully observed what produces coherent results and what fails to produce useful insights into data. Our goal is to write down what we have learned in the hopes that others may find it useful. 38 | -------------------------------------------------------------------------------- /manuscript/_book/inferencepred.md: -------------------------------------------------------------------------------- 1 | # Inference vs. Prediction: Implications for Modeling Strategy 2 | 3 | 4 | 5 | Understanding whether you're answering an inferential question versus a prediction question is an important concept because the type of question you're answering can greatly influence the modeling strategy you pursue. If you do not clearly understand which type of question you are asking, you may end up using the wrong type of modeling approach and ultimately make the wrong conclusions from your data. The purpose of this chapter is to show you what can happen when you confuse one question for another. 6 | 7 | The key things to remember are 8 | 9 | 1. For **inferential questions** the goal is typically to estimate an association between a predictor of interest and the outcome. There is usually only a handful of predictors of interest (or even just one), however there are typically many potential confounding variables to consider. They key goal of modeling is to estimate an association while making sure you appropriately adjust for any potential confounders. Often, sensitivity analyses are conducted to see if associations of interest are robust to different sets of confounders. 10 | 11 | 2. For **prediction questions** the goal is to identify a model that *best predicts* the outcome. Typically we do not place any *a priori* importance on the predictors, so long as they are good at predicting the outcome. There is no notion of "confounder" or "predictors of interest" because all predictors are potentially useful for predicting the outcome. Also, we often do not care about "how the model works" or telling a detailed story about the predictors. The key goal is to develop a model with good prediction skill and to estimate a reasonable error rate from the data. 12 | 13 | 14 | ## Air Pollution and Mortality in New York City 15 | 16 | The following example shows how different types of questions and corresponding modeling approaches can lead to different conclusions. The example uses air pollution and mortality data for New York City. The data were originally used as part of the [National Morbidity, Mortality, and Air Pollution Study](http://www.ihapss.jhsph.edu) (NMMAPS). 17 | 18 | 19 | 20 | 21 | Below is a plot of the daily mortality from all causes for the years 2001--2005. 22 | 23 |
24 | Daily Mortality in New York City, 2001--2005 25 |

(\#fig:unnamed-chunk-3)Daily Mortality in New York City, 2001--2005

26 |
27 | 28 | And here is a plot of 24-hour average levels of particulate matter with aerodynamic diameter less than or equal to 10 microns (PM10). 29 | 30 |
31 | Daily PM10 in New York City, 2001--2005 32 |

(\#fig:unnamed-chunk-4)Daily PM10 in New York City, 2001--2005

33 |
34 | 35 | Note that there are many fewer points on the plot above than there were on the plot of the mortality data. This is because PM10 is not measured everyday. Also note that there are negative values in the PM10 plot--this is because the PM10 data were mean-subtracted. In general, negative values of PM10 are not possible. 36 | 37 | ## Inferring an Association 38 | 39 | The first approach we will take will be to ask, "Is there an association between daily 24-hour average PM10 levels and daily mortality?" This is an inferential question and we are attempting to estimate an association. In addition, for this question, we know there are a number of potential confounders that we will have to deal with. 40 | 41 | Let's take a look at the bivariate association between PM10 and mortality. Here is a scatterplot of the two variables. 42 | 43 |
44 | PM10 and Mortality in New York City 45 |

(\#fig:unnamed-chunk-5)PM10 and Mortality in New York City

46 |
47 | 48 | There doesn't appear to be much going on there, and a simple linear regression model of the log of daily mortality and PM10 seems to confirm that. 49 | 50 | 51 | ``` 52 | Estimate Std. Error t value Pr(>|t|) 53 | (Intercept) 5.08884308354 0.0069353779 733.75138151 0.0000000 54 | pm10tmean 0.00004033446 0.0006913941 0.05833786 0.9535247 55 | ``` 56 | 57 | In the table of coefficients above, the coefficient for `pm10tmean` is quite small and its standard error is relatively large. Effectively, this estimate of the association is zero. 58 | 59 | However, we know quite a bit about both PM10 and daily mortality, and one thing we do know is that *season* plays a large role in both variables. In particular, we know that mortality tends to be higher in the winter and lower in the summer. PM10 tends to show the reverse pattern, being higher in the summer and lower in the winter. Because season is related to *both* PM10 and mortality, it is a good candidate for a confounder and it would make sense to adjust for it in the model. 60 | 61 | Here are the results for a second model, which includes both PM10 and season. Season is included as an indicator variable with 4 levels. 62 | 63 | 64 | 65 | ``` 66 | Estimate Std. Error t value Pr(>|t|) 67 | (Intercept) 5.166484285 0.0112629532 458.714886 0.000000e+00 68 | seasonQ2 -0.109271301 0.0166902948 -6.546996 3.209291e-10 69 | seasonQ3 -0.155503242 0.0169729148 -9.161847 1.736346e-17 70 | seasonQ4 -0.060317619 0.0167189714 -3.607735 3.716291e-04 71 | pm10tmean 0.001499111 0.0006156902 2.434847 1.558453e-02 72 | ``` 73 | 74 | Notice now that the `pm10tmean` coefficient is quite a bit larger than before and its `t value` is large, suggesting a strong association. How is this possible? 75 | 76 | It turns out that we have a classic example of [Simpson's Paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox) here. The overall relationship between P10 and mortality is null, but when we account for the seasonal variation in both mortality and PM10, the association is positive. The surprising result comes from the opposite ways in which season is related to mortality and PM10. 77 | 78 | So far we have accounted for season, but there are other potential confounders. In particular, weather variables, such as temperature and dew point temperature, are also both related to PM10 formation and mortality. 79 | 80 | In the following model we include temperature (`tmpd`) and dew point temperature (`dptp`). We also include the `date` variable in case there are any long-term trends that need to be accounted for. 81 | 82 | 83 | 84 | ``` 85 | Estimate Std. Error t value Pr(>|t|) 86 | (Intercept) 5.62066568788 0.16471183741 34.1242365 1.851690e-96 87 | date -0.00002984198 0.00001315212 -2.2689856 2.411521e-02 88 | seasonQ2 -0.05805970053 0.02299356287 -2.5250415 1.218288e-02 89 | seasonQ3 -0.07655519887 0.02904104658 -2.6361033 8.906912e-03 90 | seasonQ4 -0.03154694305 0.01832712585 -1.7213252 8.641910e-02 91 | tmpd -0.00295931276 0.00128835065 -2.2969777 2.244054e-02 92 | dptp 0.00068342228 0.00103489541 0.6603781 5.096144e-01 93 | pm10tmean 0.00237049992 0.00065856022 3.5995189 3.837886e-04 94 | ``` 95 | 96 | Notice that the `pm10tmean` coefficient is even bigger than it was in the previous model. There appears to still be an association between PM10 and mortality. The effect size is small, but we will discuss that later. 97 | 98 | Finally, another class of potential confounders includes other pollutants. Before we place blame on PM10 as a harmful pollutant, it's important that we examine whether there might be another pollutant that can explain what we're observing. NO2 is a good candidate because it shares some of the same sources as PM10 and is known to be related to mortality. Let's see what happens when we include that in the model. 99 | 100 | 101 | ``` 102 | Estimate Std. Error t value Pr(>|t|) 103 | (Intercept) 5.61378604085 0.16440280471 34.1465345 2.548704e-96 104 | date -0.00002973484 0.00001312231 -2.2659756 2.430503e-02 105 | seasonQ2 -0.05143935218 0.02338034983 -2.2001105 2.871069e-02 106 | seasonQ3 -0.06569205605 0.02990520457 -2.1966764 2.895825e-02 107 | seasonQ4 -0.02750381423 0.01849165119 -1.4873639 1.381739e-01 108 | tmpd -0.00296833498 0.00128542535 -2.3092239 2.174371e-02 109 | dptp 0.00070306996 0.00103262057 0.6808599 4.965877e-01 110 | no2tmean 0.00126556418 0.00086229169 1.4676753 1.434444e-01 111 | pm10tmean 0.00174189857 0.00078432327 2.2208937 2.725117e-02 112 | ``` 113 | 114 | Notice in the table of coefficients that the `no2tmean` coefficient is similar in magnitude to the `pm10tmean` coefficient, although its `t value` is not as large. The `pm10tmean` coefficient appears to be statistically significant, but it is somewhat smaller in magnitude now. 115 | 116 | Below is a plot of the PM10 coefficient from all four of the models that we tried. 117 | 118 |
119 | Association Between PM10 and Mortality Under Different Models 120 |

(\#fig:unnamed-chunk-10)Association Between PM10 and Mortality Under Different Models

121 |
122 | 123 | With the exception of Model 1, which did not account for any potential confounders, there appears to be a positive association between PM10 and mortality across Models 2--4. What this means and what we should do about it depends on what our ultimate goal is and we do not discuss that in detail here. It's notable that the effect size is generally small, especially compared to some of the other predictors in the model. However, it's also worth noting that presumably, everyone in New York City breathes, and so a small effect could have a large impact. 124 | 125 | 126 | ## Predicting the Outcome 127 | 128 | Another strategy we could have taken is to ask, "What best predicts mortality in New York City?" This is clearly a prediction question and we can use the data on hand to build a model. Here, we will use the [random forests](https://en.wikipedia.org/wiki/Random_forest) modeling strategy, which is a machine learning approach that performs well when there are a large number of predictors. One type of output we can obtain from the random forest procedure is a measure of *variable importance*. Roughly speaking, this measure indicates how important a given variable is to improving the prediction skill of the model. 129 | 130 | Below is a variable importance plot, which is obtained after fitting a random forest model. Larger values on the x-axis indicate greater importance. 131 | 132 |
133 | Random Forest Variable Importance Plot for Predicting Mortality 134 |

(\#fig:unnamed-chunk-11)Random Forest Variable Importance Plot for Predicting Mortality

135 |
136 | 137 | Notice that the variable `pm10tmean` comes near the bottom of the list in terms of importance. That is because it does not contribute much to predicting the outcome, mortality. Recall in the previous section that the effect size appeared to be small, meaning that it didn't really explain much variability in mortality. Predictors like temperature and dew point temperature are more useful as predictors of daily mortality. Even NO2 is a better predictor than PM10. 138 | 139 | However, just because PM10 is not a strong predictor of mortality doesn't mean that it does not have a relevant association with mortality. Given the tradeoffs that have to be made when developing a prediction model, PM10 is not high on the list of predictors that we would include--we simply cannot include every predictor. 140 | 141 | 142 | 143 | 144 | ## Summary 145 | 146 | In any data analysis, you want to ask yourself "Am I asking an inferential question or a prediction question?" This should be cleared up *before* any data are analyzed, as the answer to the question can guide the entire modeling strategy. In the example here, if we had decided on a prediction approach, we might have erroneously thought that PM10 was not relevant to mortality. However, the inferential approach suggested a statistically significant association with mortality. Framing the question right, and applying the appropriate modeling strategy, can play a large role in the kinds of conclusions you draw from the data. 147 | -------------------------------------------------------------------------------- /manuscript/_book/libs/gitbook-2.6.7/css/fontawesome/fontawesome-webfont.ttf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/_book/libs/gitbook-2.6.7/css/fontawesome/fontawesome-webfont.ttf -------------------------------------------------------------------------------- /manuscript/_book/libs/gitbook-2.6.7/css/plugin-bookdown.css: -------------------------------------------------------------------------------- 1 | .book .book-header h1 { 2 | padding-left: 20px; 3 | padding-right: 20px; 4 | } 5 | .book .book-header.fixed { 6 | position: fixed; 7 | right: 0; 8 | top: 0; 9 | left: 0; 10 | border-bottom: 1px solid rgba(0,0,0,.07); 11 | } 12 | span.search-highlight { 13 | background-color: #ffff88; 14 | } 15 | @media (min-width: 600px) { 16 | .book.with-summary .book-header.fixed { 17 | left: 300px; 18 | } 19 | } 20 | @media (max-width: 1240px) { 21 | .book .book-body.fixed { 22 | top: 50px; 23 | } 24 | .book .book-body.fixed .body-inner { 25 | top: auto; 26 | } 27 | } 28 | @media (max-width: 600px) { 29 | .book.with-summary .book-header.fixed { 30 | left: calc(100% - 60px); 31 | min-width: 300px; 32 | } 33 | .book.with-summary .book-body { 34 | transform: none; 35 | left: calc(100% - 60px); 36 | min-width: 300px; 37 | } 38 | .book .book-body.fixed { 39 | top: 0; 40 | } 41 | } 42 | 43 | .book .book-body.fixed .body-inner { 44 | top: 50px; 45 | } 46 | .book .book-body .page-wrapper .page-inner section.normal sub, .book .book-body .page-wrapper .page-inner section.normal sup { 47 | font-size: 85%; 48 | } 49 | 50 | @media print { 51 | .book .book-summary, .book .book-body .book-header, .fa { 52 | display: none !important; 53 | } 54 | .book .book-body.fixed { 55 | left: 0px; 56 | } 57 | .book .book-body,.book .book-body .body-inner, .book.with-summary { 58 | overflow: visible !important; 59 | } 60 | } 61 | .kable_wrapper { 62 | border-spacing: 20px 0; 63 | border-collapse: separate; 64 | border: none; 65 | margin: auto; 66 | } 67 | .kable_wrapper > tbody > tr > td { 68 | vertical-align: top; 69 | } 70 | .book .book-body .page-wrapper .page-inner section.normal table tr.header { 71 | border-top-width: 2px; 72 | } 73 | .book .book-body .page-wrapper .page-inner section.normal table tr:last-child td { 74 | border-bottom-width: 2px; 75 | } 76 | .book .book-body .page-wrapper .page-inner section.normal table td, .book .book-body .page-wrapper .page-inner section.normal table th { 77 | border-left: none; 78 | border-right: none; 79 | } 80 | .book .book-body .page-wrapper .page-inner section.normal table.kable_wrapper > tbody > tr, .book .book-body .page-wrapper .page-inner section.normal table.kable_wrapper > tbody > tr > td { 81 | border-top: none; 82 | } 83 | .book .book-body .page-wrapper .page-inner section.normal table.kable_wrapper > tbody > tr:last-child > td { 84 | border-bottom: none; 85 | } 86 | 87 | div.theorem, div.lemma, div.corollary, div.proposition { 88 | font-style: italic; 89 | } 90 | span.theorem, span.lemma, span.corollary, span.proposition { 91 | font-style: normal; 92 | } 93 | div.proof:after { 94 | content: "\25a2"; 95 | float: right; 96 | } 97 | -------------------------------------------------------------------------------- /manuscript/_book/libs/gitbook-2.6.7/css/plugin-fontsettings.css: -------------------------------------------------------------------------------- 1 | /* 2 | * Theme 1 3 | */ 4 | .color-theme-1 .dropdown-menu { 5 | background-color: #111111; 6 | border-color: #7e888b; 7 | } 8 | .color-theme-1 .dropdown-menu .dropdown-caret .caret-inner { 9 | border-bottom: 9px solid #111111; 10 | } 11 | .color-theme-1 .dropdown-menu .buttons { 12 | border-color: #7e888b; 13 | } 14 | .color-theme-1 .dropdown-menu .button { 15 | color: #afa790; 16 | } 17 | .color-theme-1 .dropdown-menu .button:hover { 18 | color: #73553c; 19 | } 20 | /* 21 | * Theme 2 22 | */ 23 | .color-theme-2 .dropdown-menu { 24 | background-color: #2d3143; 25 | border-color: #272a3a; 26 | } 27 | .color-theme-2 .dropdown-menu .dropdown-caret .caret-inner { 28 | border-bottom: 9px solid #2d3143; 29 | } 30 | .color-theme-2 .dropdown-menu .buttons { 31 | border-color: #272a3a; 32 | } 33 | .color-theme-2 .dropdown-menu .button { 34 | color: #62677f; 35 | } 36 | .color-theme-2 .dropdown-menu .button:hover { 37 | color: #f4f4f5; 38 | } 39 | .book .book-header .font-settings .font-enlarge { 40 | line-height: 30px; 41 | font-size: 1.4em; 42 | } 43 | .book .book-header .font-settings .font-reduce { 44 | line-height: 30px; 45 | font-size: 1em; 46 | } 47 | .book.color-theme-1 .book-body { 48 | color: #704214; 49 | background: #f3eacb; 50 | } 51 | .book.color-theme-1 .book-body .page-wrapper .page-inner section { 52 | background: #f3eacb; 53 | } 54 | .book.color-theme-2 .book-body { 55 | color: #bdcadb; 56 | background: #1c1f2b; 57 | } 58 | .book.color-theme-2 .book-body .page-wrapper .page-inner section { 59 | background: #1c1f2b; 60 | } 61 | .book.font-size-0 .book-body .page-inner section { 62 | font-size: 1.2rem; 63 | } 64 | .book.font-size-1 .book-body .page-inner section { 65 | font-size: 1.4rem; 66 | } 67 | .book.font-size-2 .book-body .page-inner section { 68 | font-size: 1.6rem; 69 | } 70 | .book.font-size-3 .book-body .page-inner section { 71 | font-size: 2.2rem; 72 | } 73 | .book.font-size-4 .book-body .page-inner section { 74 | font-size: 4rem; 75 | } 76 | .book.font-family-0 { 77 | font-family: Georgia, serif; 78 | } 79 | .book.font-family-1 { 80 | font-family: "Helvetica Neue", Helvetica, Arial, sans-serif; 81 | } 82 | .book.color-theme-1 .book-body .page-wrapper .page-inner section.normal { 83 | color: #704214; 84 | } 85 | .book.color-theme-1 .book-body .page-wrapper .page-inner section.normal a { 86 | color: inherit; 87 | } 88 | .book.color-theme-1 .book-body .page-wrapper .page-inner section.normal h1, 89 | .book.color-theme-1 .book-body .page-wrapper .page-inner section.normal h2, 90 | .book.color-theme-1 .book-body .page-wrapper .page-inner section.normal h3, 91 | .book.color-theme-1 .book-body .page-wrapper .page-inner section.normal h4, 92 | .book.color-theme-1 .book-body .page-wrapper .page-inner section.normal h5, 93 | .book.color-theme-1 .book-body .page-wrapper .page-inner section.normal h6 { 94 | color: inherit; 95 | } 96 | .book.color-theme-1 .book-body .page-wrapper .page-inner section.normal h1, 97 | .book.color-theme-1 .book-body .page-wrapper .page-inner section.normal h2 { 98 | border-color: inherit; 99 | } 100 | .book.color-theme-1 .book-body .page-wrapper .page-inner section.normal h6 { 101 | color: inherit; 102 | } 103 | .book.color-theme-1 .book-body .page-wrapper .page-inner section.normal hr { 104 | background-color: inherit; 105 | } 106 | .book.color-theme-1 .book-body .page-wrapper .page-inner section.normal blockquote { 107 | border-color: inherit; 108 | } 109 | .book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre, 110 | .book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code { 111 | background: #fdf6e3; 112 | color: #657b83; 113 | border-color: #f8df9c; 114 | } 115 | .book.color-theme-1 .book-body .page-wrapper .page-inner section.normal .highlight { 116 | background-color: inherit; 117 | } 118 | .book.color-theme-1 .book-body .page-wrapper .page-inner section.normal table th, 119 | .book.color-theme-1 .book-body .page-wrapper .page-inner section.normal table td { 120 | border-color: #f5d06c; 121 | } 122 | .book.color-theme-1 .book-body .page-wrapper .page-inner section.normal table tr { 123 | color: inherit; 124 | background-color: #fdf6e3; 125 | border-color: #444444; 126 | } 127 | .book.color-theme-1 .book-body .page-wrapper .page-inner section.normal table tr:nth-child(2n) { 128 | background-color: #fbeecb; 129 | } 130 | .book.color-theme-2 .book-body .page-wrapper .page-inner section.normal { 131 | color: #bdcadb; 132 | } 133 | .book.color-theme-2 .book-body .page-wrapper .page-inner section.normal a { 134 | color: #3eb1d0; 135 | } 136 | .book.color-theme-2 .book-body .page-wrapper .page-inner section.normal h1, 137 | .book.color-theme-2 .book-body .page-wrapper .page-inner section.normal h2, 138 | .book.color-theme-2 .book-body .page-wrapper .page-inner section.normal h3, 139 | .book.color-theme-2 .book-body .page-wrapper .page-inner section.normal h4, 140 | .book.color-theme-2 .book-body .page-wrapper .page-inner section.normal h5, 141 | .book.color-theme-2 .book-body .page-wrapper .page-inner section.normal h6 { 142 | color: #fffffa; 143 | } 144 | .book.color-theme-2 .book-body .page-wrapper .page-inner section.normal h1, 145 | .book.color-theme-2 .book-body .page-wrapper .page-inner section.normal h2 { 146 | border-color: #373b4e; 147 | } 148 | .book.color-theme-2 .book-body .page-wrapper .page-inner section.normal h6 { 149 | color: #373b4e; 150 | } 151 | .book.color-theme-2 .book-body .page-wrapper .page-inner section.normal hr { 152 | background-color: #373b4e; 153 | } 154 | .book.color-theme-2 .book-body .page-wrapper .page-inner section.normal blockquote { 155 | border-color: #373b4e; 156 | } 157 | .book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre, 158 | .book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code { 159 | color: #9dbed8; 160 | background: #2d3143; 161 | border-color: #2d3143; 162 | } 163 | .book.color-theme-2 .book-body .page-wrapper .page-inner section.normal .highlight { 164 | background-color: #282a39; 165 | } 166 | .book.color-theme-2 .book-body .page-wrapper .page-inner section.normal table th, 167 | .book.color-theme-2 .book-body .page-wrapper .page-inner section.normal table td { 168 | border-color: #3b3f54; 169 | } 170 | .book.color-theme-2 .book-body .page-wrapper .page-inner section.normal table tr { 171 | color: #b6c2d2; 172 | background-color: #2d3143; 173 | border-color: #3b3f54; 174 | } 175 | .book.color-theme-2 .book-body .page-wrapper .page-inner section.normal table tr:nth-child(2n) { 176 | background-color: #35394b; 177 | } 178 | .book.color-theme-1 .book-header { 179 | color: #afa790; 180 | background: transparent; 181 | } 182 | .book.color-theme-1 .book-header .btn { 183 | color: #afa790; 184 | } 185 | .book.color-theme-1 .book-header .btn:hover { 186 | color: #73553c; 187 | background: none; 188 | } 189 | .book.color-theme-1 .book-header h1 { 190 | color: #704214; 191 | } 192 | .book.color-theme-2 .book-header { 193 | color: #7e888b; 194 | background: transparent; 195 | } 196 | .book.color-theme-2 .book-header .btn { 197 | color: #3b3f54; 198 | } 199 | .book.color-theme-2 .book-header .btn:hover { 200 | color: #fffff5; 201 | background: none; 202 | } 203 | .book.color-theme-2 .book-header h1 { 204 | color: #bdcadb; 205 | } 206 | .book.color-theme-1 .book-body .navigation { 207 | color: #afa790; 208 | } 209 | .book.color-theme-1 .book-body .navigation:hover { 210 | color: #73553c; 211 | } 212 | .book.color-theme-2 .book-body .navigation { 213 | color: #383f52; 214 | } 215 | .book.color-theme-2 .book-body .navigation:hover { 216 | color: #fffff5; 217 | } 218 | /* 219 | * Theme 1 220 | */ 221 | .book.color-theme-1 .book-summary { 222 | color: #afa790; 223 | background: #111111; 224 | border-right: 1px solid rgba(0, 0, 0, 0.07); 225 | } 226 | .book.color-theme-1 .book-summary .book-search { 227 | background: transparent; 228 | } 229 | .book.color-theme-1 .book-summary .book-search input, 230 | .book.color-theme-1 .book-summary .book-search input:focus { 231 | border: 1px solid transparent; 232 | } 233 | .book.color-theme-1 .book-summary ul.summary li.divider { 234 | background: #7e888b; 235 | box-shadow: none; 236 | } 237 | .book.color-theme-1 .book-summary ul.summary li i.fa-check { 238 | color: #33cc33; 239 | } 240 | .book.color-theme-1 .book-summary ul.summary li.done > a { 241 | color: #877f6a; 242 | } 243 | .book.color-theme-1 .book-summary ul.summary li a, 244 | .book.color-theme-1 .book-summary ul.summary li span { 245 | color: #877f6a; 246 | background: transparent; 247 | font-weight: normal; 248 | } 249 | .book.color-theme-1 .book-summary ul.summary li.active > a, 250 | .book.color-theme-1 .book-summary ul.summary li a:hover { 251 | color: #704214; 252 | background: transparent; 253 | font-weight: normal; 254 | } 255 | /* 256 | * Theme 2 257 | */ 258 | .book.color-theme-2 .book-summary { 259 | color: #bcc1d2; 260 | background: #2d3143; 261 | border-right: none; 262 | } 263 | .book.color-theme-2 .book-summary .book-search { 264 | background: transparent; 265 | } 266 | .book.color-theme-2 .book-summary .book-search input, 267 | .book.color-theme-2 .book-summary .book-search input:focus { 268 | border: 1px solid transparent; 269 | } 270 | .book.color-theme-2 .book-summary ul.summary li.divider { 271 | background: #272a3a; 272 | box-shadow: none; 273 | } 274 | .book.color-theme-2 .book-summary ul.summary li i.fa-check { 275 | color: #33cc33; 276 | } 277 | .book.color-theme-2 .book-summary ul.summary li.done > a { 278 | color: #62687f; 279 | } 280 | .book.color-theme-2 .book-summary ul.summary li a, 281 | .book.color-theme-2 .book-summary ul.summary li span { 282 | color: #c1c6d7; 283 | background: transparent; 284 | font-weight: 600; 285 | } 286 | .book.color-theme-2 .book-summary ul.summary li.active > a, 287 | .book.color-theme-2 .book-summary ul.summary li a:hover { 288 | color: #f4f4f5; 289 | background: #252737; 290 | font-weight: 600; 291 | } 292 | -------------------------------------------------------------------------------- /manuscript/_book/libs/gitbook-2.6.7/css/plugin-search.css: -------------------------------------------------------------------------------- 1 | .book .book-summary .book-search { 2 | padding: 6px; 3 | background: transparent; 4 | position: absolute; 5 | top: -50px; 6 | left: 0px; 7 | right: 0px; 8 | transition: top 0.5s ease; 9 | } 10 | .book .book-summary .book-search input, 11 | .book .book-summary .book-search input:focus, 12 | .book .book-summary .book-search input:hover { 13 | width: 100%; 14 | background: transparent; 15 | border: 1px solid transparent; 16 | box-shadow: none; 17 | outline: none; 18 | line-height: 22px; 19 | padding: 7px 4px; 20 | color: inherit; 21 | } 22 | .book.with-search .book-summary .book-search { 23 | top: 0px; 24 | } 25 | .book.with-search .book-summary ul.summary { 26 | top: 50px; 27 | } 28 | -------------------------------------------------------------------------------- /manuscript/_book/libs/gitbook-2.6.7/js/jquery.highlight.js: -------------------------------------------------------------------------------- 1 | require(["jQuery"], function(jQuery) { 2 | 3 | /* 4 | * jQuery Highlight plugin 5 | * 6 | * Based on highlight v3 by Johann Burkard 7 | * http://johannburkard.de/blog/programming/javascript/highlight-javascript-text-higlighting-jquery-plugin.html 8 | * 9 | * Code a little bit refactored and cleaned (in my humble opinion). 10 | * Most important changes: 11 | * - has an option to highlight only entire words (wordsOnly - false by default), 12 | * - has an option to be case sensitive (caseSensitive - false by default) 13 | * - highlight element tag and class names can be specified in options 14 | * 15 | * Copyright (c) 2009 Bartek Szopka 16 | * 17 | * Licensed under MIT license. 18 | * 19 | */ 20 | 21 | jQuery.extend({ 22 | highlight: function (node, re, nodeName, className) { 23 | if (node.nodeType === 3) { 24 | var match = node.data.match(re); 25 | if (match) { 26 | var highlight = document.createElement(nodeName || 'span'); 27 | highlight.className = className || 'highlight'; 28 | var wordNode = node.splitText(match.index); 29 | wordNode.splitText(match[0].length); 30 | var wordClone = wordNode.cloneNode(true); 31 | highlight.appendChild(wordClone); 32 | wordNode.parentNode.replaceChild(highlight, wordNode); 33 | return 1; //skip added node in parent 34 | } 35 | } else if ((node.nodeType === 1 && node.childNodes) && // only element nodes that have children 36 | !/(script|style)/i.test(node.tagName) && // ignore script and style nodes 37 | !(node.tagName === nodeName.toUpperCase() && node.className === className)) { // skip if already highlighted 38 | for (var i = 0; i < node.childNodes.length; i++) { 39 | i += jQuery.highlight(node.childNodes[i], re, nodeName, className); 40 | } 41 | } 42 | return 0; 43 | } 44 | }); 45 | 46 | jQuery.fn.unhighlight = function (options) { 47 | var settings = { className: 'highlight', element: 'span' }; 48 | jQuery.extend(settings, options); 49 | 50 | return this.find(settings.element + "." + settings.className).each(function () { 51 | var parent = this.parentNode; 52 | parent.replaceChild(this.firstChild, this); 53 | parent.normalize(); 54 | }).end(); 55 | }; 56 | 57 | jQuery.fn.highlight = function (words, options) { 58 | var settings = { className: 'highlight', element: 'span', caseSensitive: false, wordsOnly: false }; 59 | jQuery.extend(settings, options); 60 | 61 | if (words.constructor === String) { 62 | words = [words]; 63 | } 64 | words = jQuery.grep(words, function(word, i){ 65 | return word !== ''; 66 | }); 67 | words = jQuery.map(words, function(word, i) { 68 | return word.replace(/[-[\]{}()*+?.,\\^$|#\s]/g, "\\$&"); 69 | }); 70 | if (words.length === 0) { return this; } 71 | 72 | var flag = settings.caseSensitive ? "" : "i"; 73 | var pattern = "(" + words.join("|") + ")"; 74 | if (settings.wordsOnly) { 75 | pattern = "\\b" + pattern + "\\b"; 76 | } 77 | var re = new RegExp(pattern, flag); 78 | 79 | return this.each(function () { 80 | jQuery.highlight(this, re, settings.element, settings.className); 81 | }); 82 | }; 83 | 84 | }); 85 | -------------------------------------------------------------------------------- /manuscript/_book/libs/gitbook-2.6.7/js/lunr.js: -------------------------------------------------------------------------------- 1 | /** 2 | * lunr - http://lunrjs.com - A bit like Solr, but much smaller and not as bright - 0.5.12 3 | * Copyright (C) 2015 Oliver Nightingale 4 | * MIT Licensed 5 | * @license 6 | */ 7 | !function(){var t=function(e){var n=new t.Index;return n.pipeline.add(t.trimmer,t.stopWordFilter,t.stemmer),e&&e.call(n,n),n};t.version="0.5.12",t.utils={},t.utils.warn=function(t){return function(e){t.console&&console.warn&&console.warn(e)}}(this),t.EventEmitter=function(){this.events={}},t.EventEmitter.prototype.addListener=function(){var t=Array.prototype.slice.call(arguments),e=t.pop(),n=t;if("function"!=typeof e)throw new TypeError("last argument must be a function");n.forEach(function(t){this.hasHandler(t)||(this.events[t]=[]),this.events[t].push(e)},this)},t.EventEmitter.prototype.removeListener=function(t,e){if(this.hasHandler(t)){var n=this.events[t].indexOf(e);this.events[t].splice(n,1),this.events[t].length||delete this.events[t]}},t.EventEmitter.prototype.emit=function(t){if(this.hasHandler(t)){var e=Array.prototype.slice.call(arguments,1);this.events[t].forEach(function(t){t.apply(void 0,e)})}},t.EventEmitter.prototype.hasHandler=function(t){return t in this.events},t.tokenizer=function(t){return arguments.length&&null!=t&&void 0!=t?Array.isArray(t)?t.map(function(t){return t.toLowerCase()}):t.toString().trim().toLowerCase().split(/[\s\-]+/):[]},t.Pipeline=function(){this._stack=[]},t.Pipeline.registeredFunctions={},t.Pipeline.registerFunction=function(e,n){n in this.registeredFunctions&&t.utils.warn("Overwriting existing registered function: "+n),e.label=n,t.Pipeline.registeredFunctions[e.label]=e},t.Pipeline.warnIfFunctionNotRegistered=function(e){var n=e.label&&e.label in this.registeredFunctions;n||t.utils.warn("Function is not registered with pipeline. This may cause problems when serialising the index.\n",e)},t.Pipeline.load=function(e){var n=new t.Pipeline;return e.forEach(function(e){var i=t.Pipeline.registeredFunctions[e];if(!i)throw new Error("Cannot load un-registered function: "+e);n.add(i)}),n},t.Pipeline.prototype.add=function(){var e=Array.prototype.slice.call(arguments);e.forEach(function(e){t.Pipeline.warnIfFunctionNotRegistered(e),this._stack.push(e)},this)},t.Pipeline.prototype.after=function(e,n){t.Pipeline.warnIfFunctionNotRegistered(n);var i=this._stack.indexOf(e);if(-1==i)throw new Error("Cannot find existingFn");i+=1,this._stack.splice(i,0,n)},t.Pipeline.prototype.before=function(e,n){t.Pipeline.warnIfFunctionNotRegistered(n);var i=this._stack.indexOf(e);if(-1==i)throw new Error("Cannot find existingFn");this._stack.splice(i,0,n)},t.Pipeline.prototype.remove=function(t){var e=this._stack.indexOf(t);-1!=e&&this._stack.splice(e,1)},t.Pipeline.prototype.run=function(t){for(var e=[],n=t.length,i=this._stack.length,o=0;n>o;o++){for(var r=t[o],s=0;i>s&&(r=this._stack[s](r,o,t),void 0!==r);s++);void 0!==r&&e.push(r)}return e},t.Pipeline.prototype.reset=function(){this._stack=[]},t.Pipeline.prototype.toJSON=function(){return this._stack.map(function(e){return t.Pipeline.warnIfFunctionNotRegistered(e),e.label})},t.Vector=function(){this._magnitude=null,this.list=void 0,this.length=0},t.Vector.Node=function(t,e,n){this.idx=t,this.val=e,this.next=n},t.Vector.prototype.insert=function(e,n){this._magnitude=void 0;var i=this.list;if(!i)return this.list=new t.Vector.Node(e,n,i),this.length++;if(en.idx?n=n.next:(i+=e.val*n.val,e=e.next,n=n.next);return i},t.Vector.prototype.similarity=function(t){return this.dot(t)/(this.magnitude()*t.magnitude())},t.SortedSet=function(){this.length=0,this.elements=[]},t.SortedSet.load=function(t){var e=new this;return e.elements=t,e.length=t.length,e},t.SortedSet.prototype.add=function(){var t,e;for(t=0;t1;){if(r===t)return o;t>r&&(e=o),r>t&&(n=o),i=n-e,o=e+Math.floor(i/2),r=this.elements[o]}return r===t?o:-1},t.SortedSet.prototype.locationFor=function(t){for(var e=0,n=this.elements.length,i=n-e,o=e+Math.floor(i/2),r=this.elements[o];i>1;)t>r&&(e=o),r>t&&(n=o),i=n-e,o=e+Math.floor(i/2),r=this.elements[o];return r>t?o:t>r?o+1:void 0},t.SortedSet.prototype.intersect=function(e){for(var n=new t.SortedSet,i=0,o=0,r=this.length,s=e.length,a=this.elements,h=e.elements;;){if(i>r-1||o>s-1)break;a[i]!==h[o]?a[i]h[o]&&o++:(n.add(a[i]),i++,o++)}return n},t.SortedSet.prototype.clone=function(){var e=new t.SortedSet;return e.elements=this.toArray(),e.length=e.elements.length,e},t.SortedSet.prototype.union=function(t){var e,n,i;return this.length>=t.length?(e=this,n=t):(e=t,n=this),i=e.clone(),i.add.apply(i,n.toArray()),i},t.SortedSet.prototype.toJSON=function(){return this.toArray()},t.Index=function(){this._fields=[],this._ref="id",this.pipeline=new t.Pipeline,this.documentStore=new t.Store,this.tokenStore=new t.TokenStore,this.corpusTokens=new t.SortedSet,this.eventEmitter=new t.EventEmitter,this._idfCache={},this.on("add","remove","update",function(){this._idfCache={}}.bind(this))},t.Index.prototype.on=function(){var t=Array.prototype.slice.call(arguments);return this.eventEmitter.addListener.apply(this.eventEmitter,t)},t.Index.prototype.off=function(t,e){return this.eventEmitter.removeListener(t,e)},t.Index.load=function(e){e.version!==t.version&&t.utils.warn("version mismatch: current "+t.version+" importing "+e.version);var n=new this;return n._fields=e.fields,n._ref=e.ref,n.documentStore=t.Store.load(e.documentStore),n.tokenStore=t.TokenStore.load(e.tokenStore),n.corpusTokens=t.SortedSet.load(e.corpusTokens),n.pipeline=t.Pipeline.load(e.pipeline),n},t.Index.prototype.field=function(t,e){var e=e||{},n={name:t,boost:e.boost||1};return this._fields.push(n),this},t.Index.prototype.ref=function(t){return this._ref=t,this},t.Index.prototype.add=function(e,n){var i={},o=new t.SortedSet,r=e[this._ref],n=void 0===n?!0:n;this._fields.forEach(function(n){var r=this.pipeline.run(t.tokenizer(e[n.name]));i[n.name]=r,t.SortedSet.prototype.add.apply(o,r)},this),this.documentStore.set(r,o),t.SortedSet.prototype.add.apply(this.corpusTokens,o.toArray());for(var s=0;s0&&(i=1+Math.log(this.documentStore.length/n)),this._idfCache[e]=i},t.Index.prototype.search=function(e){var n=this.pipeline.run(t.tokenizer(e)),i=new t.Vector,o=[],r=this._fields.reduce(function(t,e){return t+e.boost},0),s=n.some(function(t){return this.tokenStore.has(t)},this);if(!s)return[];n.forEach(function(e,n,s){var a=1/s.length*this._fields.length*r,h=this,l=this.tokenStore.expand(e).reduce(function(n,o){var r=h.corpusTokens.indexOf(o),s=h.idf(o),l=1,u=new t.SortedSet;if(o!==e){var c=Math.max(3,o.length-e.length);l=1/Math.log(c)}return r>-1&&i.insert(r,a*s*l),Object.keys(h.tokenStore.get(o)).forEach(function(t){u.add(t)}),n.union(u)},new t.SortedSet);o.push(l)},this);var a=o.reduce(function(t,e){return t.intersect(e)});return a.map(function(t){return{ref:t,score:i.similarity(this.documentVector(t))}},this).sort(function(t,e){return e.score-t.score})},t.Index.prototype.documentVector=function(e){for(var n=this.documentStore.get(e),i=n.length,o=new t.Vector,r=0;i>r;r++){var s=n.elements[r],a=this.tokenStore.get(s)[e].tf,h=this.idf(s);o.insert(this.corpusTokens.indexOf(s),a*h)}return o},t.Index.prototype.toJSON=function(){return{version:t.version,fields:this._fields,ref:this._ref,documentStore:this.documentStore.toJSON(),tokenStore:this.tokenStore.toJSON(),corpusTokens:this.corpusTokens.toJSON(),pipeline:this.pipeline.toJSON()}},t.Index.prototype.use=function(t){var e=Array.prototype.slice.call(arguments,1);e.unshift(this),t.apply(this,e)},t.Store=function(){this.store={},this.length=0},t.Store.load=function(e){var n=new this;return n.length=e.length,n.store=Object.keys(e.store).reduce(function(n,i){return n[i]=t.SortedSet.load(e.store[i]),n},{}),n},t.Store.prototype.set=function(t,e){this.has(t)||this.length++,this.store[t]=e},t.Store.prototype.get=function(t){return this.store[t]},t.Store.prototype.has=function(t){return t in this.store},t.Store.prototype.remove=function(t){this.has(t)&&(delete this.store[t],this.length--)},t.Store.prototype.toJSON=function(){return{store:this.store,length:this.length}},t.stemmer=function(){var t={ational:"ate",tional:"tion",enci:"ence",anci:"ance",izer:"ize",bli:"ble",alli:"al",entli:"ent",eli:"e",ousli:"ous",ization:"ize",ation:"ate",ator:"ate",alism:"al",iveness:"ive",fulness:"ful",ousness:"ous",aliti:"al",iviti:"ive",biliti:"ble",logi:"log"},e={icate:"ic",ative:"",alize:"al",iciti:"ic",ical:"ic",ful:"",ness:""},n="[^aeiou]",i="[aeiouy]",o=n+"[^aeiouy]*",r=i+"[aeiou]*",s="^("+o+")?"+r+o,a="^("+o+")?"+r+o+"("+r+")?$",h="^("+o+")?"+r+o+r+o,l="^("+o+")?"+i,u=new RegExp(s),c=new RegExp(h),f=new RegExp(a),d=new RegExp(l),p=/^(.+?)(ss|i)es$/,m=/^(.+?)([^s])s$/,v=/^(.+?)eed$/,y=/^(.+?)(ed|ing)$/,g=/.$/,S=/(at|bl|iz)$/,w=new RegExp("([^aeiouylsz])\\1$"),x=new RegExp("^"+o+i+"[^aeiouwxy]$"),k=/^(.+?[^aeiou])y$/,b=/^(.+?)(ational|tional|enci|anci|izer|bli|alli|entli|eli|ousli|ization|ation|ator|alism|iveness|fulness|ousness|aliti|iviti|biliti|logi)$/,E=/^(.+?)(icate|ative|alize|iciti|ical|ful|ness)$/,_=/^(.+?)(al|ance|ence|er|ic|able|ible|ant|ement|ment|ent|ou|ism|ate|iti|ous|ive|ize)$/,F=/^(.+?)(s|t)(ion)$/,O=/^(.+?)e$/,P=/ll$/,N=new RegExp("^"+o+i+"[^aeiouwxy]$"),T=function(n){var i,o,r,s,a,h,l;if(n.length<3)return n;if(r=n.substr(0,1),"y"==r&&(n=r.toUpperCase()+n.substr(1)),s=p,a=m,s.test(n)?n=n.replace(s,"$1$2"):a.test(n)&&(n=n.replace(a,"$1$2")),s=v,a=y,s.test(n)){var T=s.exec(n);s=u,s.test(T[1])&&(s=g,n=n.replace(s,""))}else if(a.test(n)){var T=a.exec(n);i=T[1],a=d,a.test(i)&&(n=i,a=S,h=w,l=x,a.test(n)?n+="e":h.test(n)?(s=g,n=n.replace(s,"")):l.test(n)&&(n+="e"))}if(s=k,s.test(n)){var T=s.exec(n);i=T[1],n=i+"i"}if(s=b,s.test(n)){var T=s.exec(n);i=T[1],o=T[2],s=u,s.test(i)&&(n=i+t[o])}if(s=E,s.test(n)){var T=s.exec(n);i=T[1],o=T[2],s=u,s.test(i)&&(n=i+e[o])}if(s=_,a=F,s.test(n)){var T=s.exec(n);i=T[1],s=c,s.test(i)&&(n=i)}else if(a.test(n)){var T=a.exec(n);i=T[1]+T[2],a=c,a.test(i)&&(n=i)}if(s=O,s.test(n)){var T=s.exec(n);i=T[1],s=c,a=f,h=N,(s.test(i)||a.test(i)&&!h.test(i))&&(n=i)}return s=P,a=c,s.test(n)&&a.test(n)&&(s=g,n=n.replace(s,"")),"y"==r&&(n=r.toLowerCase()+n.substr(1)),n};return T}(),t.Pipeline.registerFunction(t.stemmer,"stemmer"),t.stopWordFilter=function(e){return e&&t.stopWordFilter.stopWords[e]!==e?e:void 0},t.stopWordFilter.stopWords={a:"a",able:"able",about:"about",across:"across",after:"after",all:"all",almost:"almost",also:"also",am:"am",among:"among",an:"an",and:"and",any:"any",are:"are",as:"as",at:"at",be:"be",because:"because",been:"been",but:"but",by:"by",can:"can",cannot:"cannot",could:"could",dear:"dear",did:"did","do":"do",does:"does",either:"either","else":"else",ever:"ever",every:"every","for":"for",from:"from",get:"get",got:"got",had:"had",has:"has",have:"have",he:"he",her:"her",hers:"hers",him:"him",his:"his",how:"how",however:"however",i:"i","if":"if","in":"in",into:"into",is:"is",it:"it",its:"its",just:"just",least:"least",let:"let",like:"like",likely:"likely",may:"may",me:"me",might:"might",most:"most",must:"must",my:"my",neither:"neither",no:"no",nor:"nor",not:"not",of:"of",off:"off",often:"often",on:"on",only:"only",or:"or",other:"other",our:"our",own:"own",rather:"rather",said:"said",say:"say",says:"says",she:"she",should:"should",since:"since",so:"so",some:"some",than:"than",that:"that",the:"the",their:"their",them:"them",then:"then",there:"there",these:"these",they:"they","this":"this",tis:"tis",to:"to",too:"too",twas:"twas",us:"us",wants:"wants",was:"was",we:"we",were:"were",what:"what",when:"when",where:"where",which:"which","while":"while",who:"who",whom:"whom",why:"why",will:"will","with":"with",would:"would",yet:"yet",you:"you",your:"your"},t.Pipeline.registerFunction(t.stopWordFilter,"stopWordFilter"),t.trimmer=function(t){var e=t.replace(/^\W+/,"").replace(/\W+$/,"");return""===e?void 0:e},t.Pipeline.registerFunction(t.trimmer,"trimmer"),t.TokenStore=function(){this.root={docs:{}},this.length=0},t.TokenStore.load=function(t){var e=new this;return e.root=t.root,e.length=t.length,e},t.TokenStore.prototype.add=function(t,e,n){var n=n||this.root,i=t[0],o=t.slice(1);return i in n||(n[i]={docs:{}}),0===o.length?(n[i].docs[e.ref]=e,void(this.length+=1)):this.add(o,e,n[i])},t.TokenStore.prototype.has=function(t){if(!t)return!1;for(var e=this.root,n=0;n's that are truncated 95 | $('a').each(function(i, el) { 96 | if (el.offsetWidth >= el.scrollWidth) return; 97 | if (typeof el.title === 'undefined') return; 98 | el.title = el.text; 99 | }); 100 | 101 | // restore TOC scroll position 102 | var pos = gs.get('tocScrollTop'); 103 | if (typeof pos !== 'undefined') summary.scrollTop(pos); 104 | 105 | // highlight the TOC item that has same text as the heading in view as scrolling 106 | if (toc && toc.scroll_highlight !== false) (function() { 107 | // scroll the current TOC item into viewport 108 | var ht = $(window).height(), rect = li[0].getBoundingClientRect(); 109 | if (rect.top >= ht || rect.top <= 0 || rect.bottom <= 0) { 110 | summary.scrollTop(li[0].offsetTop); 111 | } 112 | // current chapter TOC items 113 | var items = $('a[href^="' + href + '"]').parent('li.chapter'), 114 | m = items.length; 115 | if (m === 0) { 116 | items = summary.find('li.chapter'); 117 | m = items.length; 118 | } 119 | if (m === 0) return; 120 | // all section titles on current page 121 | var hs = bookInner.find('.page-inner').find('h1,h2,h3'), n = hs.length, 122 | ts = hs.map(function(i, el) { return $(el).text(); }); 123 | if (n === 0) return; 124 | var scrollHandler = function(e) { 125 | var ht = $(window).height(); 126 | clearTimeout($.data(this, 'scrollTimer')); 127 | $.data(this, 'scrollTimer', setTimeout(function() { 128 | // find the first visible title in the viewport 129 | for (var i = 0; i < n; i++) { 130 | var rect = hs[i].getBoundingClientRect(); 131 | if (rect.top >= 0 && rect.bottom <= ht) break; 132 | } 133 | if (i === n) return; 134 | items.removeClass('active'); 135 | for (var j = 0; j < m; j++) { 136 | if (items.eq(j).children('a').first().text() === ts[i]) break; 137 | } 138 | if (j === m) j = 0; // highlight the chapter title 139 | // search bottom-up for a visible TOC item to highlight; if an item is 140 | // hidden, we check if its parent is visible, and so on 141 | while (j > 0 && items.eq(j).is(':hidden')) j--; 142 | items.eq(j).addClass('active'); 143 | }, 250)); 144 | }; 145 | bookInner.on('scroll.bookdown', scrollHandler); 146 | bookBody.on('scroll.bookdown', scrollHandler); 147 | })(); 148 | 149 | // do not refresh the page if the TOC item points to the current page 150 | $('a[href="' + href + '"]').parent('li.chapter').children('a') 151 | .on('click', function(e) { 152 | bookInner.scrollTop(0); 153 | bookBody.scrollTop(0); 154 | return false; 155 | }); 156 | 157 | var toolbar = config.toolbar; 158 | if (!toolbar || toolbar.position !== 'static') { 159 | var bookHeader = $('.book-header'); 160 | bookBody.addClass('fixed'); 161 | bookHeader.addClass('fixed') 162 | .css('background-color', bookBody.css('background-color')) 163 | .on('click.bookdown', function(e) { 164 | // the theme may have changed after user clicks the theme button 165 | bookHeader.css('background-color', bookBody.css('background-color')); 166 | }); 167 | } 168 | 169 | }); 170 | 171 | gitbook.events.bind("page.change", function(e) { 172 | // store TOC scroll position 173 | var summary = $('ul.summary'); 174 | gs.set('tocScrollTop', summary.scrollTop()); 175 | }); 176 | 177 | var bookBody = $('.book-body'), bookInner = bookBody.find('.body-inner'); 178 | var saveScrollPos = function(e) { 179 | // save scroll position before page is reloaded 180 | gs.set('bodyScrollTop', { 181 | body: bookBody.scrollTop(), 182 | inner: bookInner.scrollTop(), 183 | focused: document.hasFocus(), 184 | title: bookInner.find('.page-inner').find('h1,h2').first().text() 185 | }); 186 | }; 187 | $(document).on('servr:reload', saveScrollPos); 188 | 189 | // check if the page is loaded in an iframe (e.g. the RStudio preview window) 190 | var inIFrame = function() { 191 | var inIframe = true; 192 | try { inIframe = window.self !== window.top; } catch (e) {} 193 | return inIframe; 194 | }; 195 | if (inIFrame()) { 196 | $(window).on('blur unload', saveScrollPos); 197 | } 198 | 199 | $(function(e) { 200 | var pos = gs.get('bodyScrollTop'); 201 | if (pos) { 202 | if (pos.title === bookInner.find('.page-inner').find('h1,h2').first().text()) { 203 | if (pos.body !== 0) bookBody.scrollTop(pos.body); 204 | if (pos.inner !== 0) bookInner.scrollTop(pos.inner); 205 | } 206 | if (pos.focused) bookInner.find('.page-wrapper').focus(); 207 | } 208 | // clear book body scroll position 209 | gs.remove('bodyScrollTop'); 210 | }); 211 | 212 | }); 213 | -------------------------------------------------------------------------------- /manuscript/_book/libs/gitbook-2.6.7/js/plugin-fontsettings.js: -------------------------------------------------------------------------------- 1 | require(["gitbook", "lodash", "jQuery"], function(gitbook, _, $) { 2 | var fontState; 3 | 4 | var THEMES = { 5 | "white": 0, 6 | "sepia": 1, 7 | "night": 2 8 | }; 9 | 10 | var FAMILY = { 11 | "serif": 0, 12 | "sans": 1 13 | }; 14 | 15 | // Save current font settings 16 | function saveFontSettings() { 17 | gitbook.storage.set("fontState", fontState); 18 | update(); 19 | } 20 | 21 | // Increase font size 22 | function enlargeFontSize(e) { 23 | e.preventDefault(); 24 | if (fontState.size >= 4) return; 25 | 26 | fontState.size++; 27 | saveFontSettings(); 28 | }; 29 | 30 | // Decrease font size 31 | function reduceFontSize(e) { 32 | e.preventDefault(); 33 | if (fontState.size <= 0) return; 34 | 35 | fontState.size--; 36 | saveFontSettings(); 37 | }; 38 | 39 | // Change font family 40 | function changeFontFamily(index, e) { 41 | e.preventDefault(); 42 | 43 | fontState.family = index; 44 | saveFontSettings(); 45 | }; 46 | 47 | // Change type of color 48 | function changeColorTheme(index, e) { 49 | e.preventDefault(); 50 | 51 | var $book = $(".book"); 52 | 53 | if (fontState.theme !== 0) 54 | $book.removeClass("color-theme-"+fontState.theme); 55 | 56 | fontState.theme = index; 57 | if (fontState.theme !== 0) 58 | $book.addClass("color-theme-"+fontState.theme); 59 | 60 | saveFontSettings(); 61 | }; 62 | 63 | function update() { 64 | var $book = gitbook.state.$book; 65 | 66 | $(".font-settings .font-family-list li").removeClass("active"); 67 | $(".font-settings .font-family-list li:nth-child("+(fontState.family+1)+")").addClass("active"); 68 | 69 | $book[0].className = $book[0].className.replace(/\bfont-\S+/g, ''); 70 | $book.addClass("font-size-"+fontState.size); 71 | $book.addClass("font-family-"+fontState.family); 72 | 73 | if(fontState.theme !== 0) { 74 | $book[0].className = $book[0].className.replace(/\bcolor-theme-\S+/g, ''); 75 | $book.addClass("color-theme-"+fontState.theme); 76 | } 77 | }; 78 | 79 | function init(config) { 80 | var $bookBody, $book; 81 | 82 | //Find DOM elements. 83 | $book = gitbook.state.$book; 84 | $bookBody = $book.find(".book-body"); 85 | 86 | // Instantiate font state object 87 | fontState = gitbook.storage.get("fontState", { 88 | size: config.size || 2, 89 | family: FAMILY[config.family || "sans"], 90 | theme: THEMES[config.theme || "white"] 91 | }); 92 | 93 | update(); 94 | }; 95 | 96 | 97 | gitbook.events.bind("start", function(e, config) { 98 | var opts = config.fontsettings; 99 | 100 | // Create buttons in toolbar 101 | gitbook.toolbar.createButton({ 102 | icon: 'fa fa-font', 103 | label: 'Font Settings', 104 | className: 'font-settings', 105 | dropdown: [ 106 | [ 107 | { 108 | text: 'A', 109 | className: 'font-reduce', 110 | onClick: reduceFontSize 111 | }, 112 | { 113 | text: 'A', 114 | className: 'font-enlarge', 115 | onClick: enlargeFontSize 116 | } 117 | ], 118 | [ 119 | { 120 | text: 'Serif', 121 | onClick: _.partial(changeFontFamily, 0) 122 | }, 123 | { 124 | text: 'Sans', 125 | onClick: _.partial(changeFontFamily, 1) 126 | } 127 | ], 128 | [ 129 | { 130 | text: 'White', 131 | onClick: _.partial(changeColorTheme, 0) 132 | }, 133 | { 134 | text: 'Sepia', 135 | onClick: _.partial(changeColorTheme, 1) 136 | }, 137 | { 138 | text: 'Night', 139 | onClick: _.partial(changeColorTheme, 2) 140 | } 141 | ] 142 | ] 143 | }); 144 | 145 | 146 | // Init current settings 147 | init(opts); 148 | }); 149 | }); 150 | 151 | 152 | -------------------------------------------------------------------------------- /manuscript/_book/libs/gitbook-2.6.7/js/plugin-search.js: -------------------------------------------------------------------------------- 1 | require(["gitbook", "lodash", "jQuery"], function(gitbook, _, $) { 2 | var index = null; 3 | var $searchInput, $searchForm; 4 | var $highlighted, hi = 0, hiOpts = { className: 'search-highlight' }; 5 | var collapse = false; 6 | 7 | // Use a specific index 8 | function loadIndex(data) { 9 | // [Yihui] In bookdown, I use a character matrix to store the chapter 10 | // content, and the index is dynamically built on the client side. 11 | // Gitbook prebuilds the index data instead: https://github.com/GitbookIO/plugin-search 12 | // We can certainly do that via R packages V8 and jsonlite, but let's 13 | // see how slow it really is before improving it. On the other hand, 14 | // lunr cannot handle non-English text very well, e.g. the default 15 | // tokenizer cannot deal with Chinese text, so we may want to replace 16 | // lunr with a dumb simple text matching approach. 17 | index = lunr(function () { 18 | this.ref('url'); 19 | this.field('title', { boost: 10 }); 20 | this.field('body'); 21 | }); 22 | data.map(function(item) { 23 | index.add({ 24 | url: item[0], 25 | title: item[1], 26 | body: item[2] 27 | }); 28 | }); 29 | } 30 | 31 | // Fetch the search index 32 | function fetchIndex() { 33 | return $.getJSON(gitbook.state.basePath+"/search_index.json") 34 | .then(loadIndex); // [Yihui] we need to use this object later 35 | } 36 | 37 | // Search for a term and return results 38 | function search(q) { 39 | if (!index) return; 40 | 41 | var results = _.chain(index.search(q)) 42 | .map(function(result) { 43 | var parts = result.ref.split("#"); 44 | return { 45 | path: parts[0], 46 | hash: parts[1] 47 | }; 48 | }) 49 | .value(); 50 | 51 | // [Yihui] Highlight the search keyword on current page 52 | hi = 0; 53 | $highlighted = results.length === 0 ? undefined : $('.page-inner') 54 | .unhighlight(hiOpts).highlight(q, hiOpts).find('span.search-highlight'); 55 | scrollToHighlighted(); 56 | toggleTOC(results.length > 0); 57 | 58 | return results; 59 | } 60 | 61 | // [Yihui] Scroll the chapter body to the i-th highlighted string 62 | function scrollToHighlighted() { 63 | if (!$highlighted) return; 64 | var n = $highlighted.length; 65 | if (n === 0) return; 66 | var $p = $highlighted.eq(hi), p = $p[0], rect = p.getBoundingClientRect(); 67 | if (rect.top < 0 || rect.bottom > $(window).height()) { 68 | ($(window).width() >= 1240 ? $('.body-inner') : $('.book-body')) 69 | .scrollTop(p.offsetTop - 100); 70 | } 71 | $highlighted.css('background-color', ''); 72 | // an orange background color on the current item and removed later 73 | $p.css('background-color', 'orange'); 74 | setTimeout(function() { 75 | $p.css('background-color', ''); 76 | }, 2000); 77 | } 78 | 79 | // [Yihui] Expand/collapse TOC 80 | function toggleTOC(show) { 81 | if (!collapse) return; 82 | var toc_sub = $('ul.summary').children('li[data-level]').children('ul'); 83 | if (show) return toc_sub.show(); 84 | var href = window.location.pathname; 85 | href = href.substr(href.lastIndexOf('/') + 1); 86 | if (href === '') href = 'index.html'; 87 | var li = $('a[href^="' + href + location.hash + '"]').parent('li.chapter').first(); 88 | toc_sub.hide().parent().has(li).children('ul').show(); 89 | li.children('ul').show(); 90 | } 91 | 92 | // Create search form 93 | function createForm(value) { 94 | if ($searchForm) $searchForm.remove(); 95 | if ($searchInput) $searchInput.remove(); 96 | 97 | $searchForm = $('
', { 98 | 'class': 'book-search', 99 | 'role': 'search' 100 | }); 101 | 102 | $searchInput = $('', { 103 | 'type': 'search', 104 | 'class': 'form-control', 105 | 'val': value, 106 | 'placeholder': 'Type to search' 107 | }); 108 | 109 | $searchInput.appendTo($searchForm); 110 | $searchForm.prependTo(gitbook.state.$book.find('.book-summary')); 111 | } 112 | 113 | // Return true if search is open 114 | function isSearchOpen() { 115 | return gitbook.state.$book.hasClass("with-search"); 116 | } 117 | 118 | // Toggle the search 119 | function toggleSearch(_state) { 120 | if (isSearchOpen() === _state) return; 121 | if (!$searchInput) return; 122 | 123 | gitbook.state.$book.toggleClass("with-search", _state); 124 | 125 | // If search bar is open: focus input 126 | if (isSearchOpen()) { 127 | gitbook.sidebar.toggle(true); 128 | $searchInput.focus(); 129 | } else { 130 | $searchInput.blur(); 131 | $searchInput.val(""); 132 | gitbook.storage.remove("keyword"); 133 | gitbook.sidebar.filter(null); 134 | $('.page-inner').unhighlight(hiOpts); 135 | toggleTOC(false); 136 | } 137 | } 138 | 139 | // Recover current search when page changed 140 | function recoverSearch() { 141 | var keyword = gitbook.storage.get("keyword", ""); 142 | 143 | createForm(keyword); 144 | 145 | if (keyword.length > 0) { 146 | if(!isSearchOpen()) { 147 | toggleSearch(true); // [Yihui] open the search box 148 | } 149 | gitbook.sidebar.filter(_.pluck(search(keyword), "path")); 150 | } 151 | } 152 | 153 | 154 | gitbook.events.bind("start", function(e, config) { 155 | // [Yihui] disable search 156 | if (config.search === false) return; 157 | collapse = !config.toc || config.toc.collapse === 'section' || 158 | config.toc.collapse === 'subsection'; 159 | 160 | // Pre-fetch search index and create the form 161 | fetchIndex() 162 | // [Yihui] recover search after the page is loaded 163 | .then(recoverSearch); 164 | 165 | 166 | // Type in search bar 167 | $(document).on("keyup", ".book-search input", function(e) { 168 | var key = (e.keyCode ? e.keyCode : e.which); 169 | // [Yihui] Escape -> close search box; Up/Down: previous/next highlighted 170 | if (key == 27) { 171 | e.preventDefault(); 172 | toggleSearch(false); 173 | } else if (key == 38) { 174 | if (hi <= 0 && $highlighted) hi = $highlighted.length; 175 | hi--; 176 | scrollToHighlighted(); 177 | } else if (key == 40) { 178 | hi++; 179 | if ($highlighted && hi >= $highlighted.length) hi = 0; 180 | scrollToHighlighted(); 181 | } 182 | }).on("input", ".book-search input", function(e) { 183 | var q = $(this).val(); 184 | if (q.length === 0) { 185 | gitbook.sidebar.filter(null); 186 | gitbook.storage.remove("keyword"); 187 | $('.page-inner').unhighlight(hiOpts); 188 | toggleTOC(false); 189 | } else { 190 | var results = search(q); 191 | gitbook.sidebar.filter( 192 | _.pluck(results, "path") 193 | ); 194 | gitbook.storage.set("keyword", q); 195 | } 196 | }); 197 | 198 | // Create the toggle search button 199 | gitbook.toolbar.createButton({ 200 | icon: 'fa fa-search', 201 | label: 'Search', 202 | position: 'left', 203 | onClick: toggleSearch 204 | }); 205 | 206 | // Bind keyboard to toggle search 207 | gitbook.keyboard.bind(['f'], toggleSearch); 208 | }); 209 | 210 | // [Yihui] do not try to recover search; always start fresh 211 | // gitbook.events.bind("page.change", recoverSearch); 212 | }); 213 | -------------------------------------------------------------------------------- /manuscript/_book/libs/gitbook-2.6.7/js/plugin-sharing.js: -------------------------------------------------------------------------------- 1 | require(["gitbook", "lodash", "jQuery"], function(gitbook, _, $) { 2 | var SITES = { 3 | 'github': { 4 | 'label': 'Github', 5 | 'icon': 'fa fa-github', 6 | 'onClick': function(e) { 7 | e.preventDefault(); 8 | var repo = $('meta[name="github-repo"]').attr('content'); 9 | if (typeof repo === 'undefined') throw("Github repo not defined"); 10 | window.open("https://github.com/"+repo); 11 | } 12 | }, 13 | 'facebook': { 14 | 'label': 'Facebook', 15 | 'icon': 'fa fa-facebook', 16 | 'onClick': function(e) { 17 | e.preventDefault(); 18 | window.open("http://www.facebook.com/sharer/sharer.php?s=100&p[url]="+encodeURIComponent(location.href)); 19 | } 20 | }, 21 | 'twitter': { 22 | 'label': 'Twitter', 23 | 'icon': 'fa fa-twitter', 24 | 'onClick': function(e) { 25 | e.preventDefault(); 26 | window.open("http://twitter.com/home?status="+encodeURIComponent(document.title+" "+location.href)); 27 | } 28 | }, 29 | 'google': { 30 | 'label': 'Google+', 31 | 'icon': 'fa fa-google-plus', 32 | 'onClick': function(e) { 33 | e.preventDefault(); 34 | window.open("https://plus.google.com/share?url="+encodeURIComponent(location.href)); 35 | } 36 | }, 37 | 'weibo': { 38 | 'label': 'Weibo', 39 | 'icon': 'fa fa-weibo', 40 | 'onClick': function(e) { 41 | e.preventDefault(); 42 | window.open("http://service.weibo.com/share/share.php?content=utf-8&url="+encodeURIComponent(location.href)+"&title="+encodeURIComponent(document.title)); 43 | } 44 | }, 45 | 'instapaper': { 46 | 'label': 'Instapaper', 47 | 'icon': 'fa fa-instapaper', 48 | 'onClick': function(e) { 49 | e.preventDefault(); 50 | window.open("http://www.instapaper.com/text?u="+encodeURIComponent(location.href)); 51 | } 52 | }, 53 | 'vk': { 54 | 'label': 'VK', 55 | 'icon': 'fa fa-vk', 56 | 'onClick': function(e) { 57 | e.preventDefault(); 58 | window.open("http://vkontakte.ru/share.php?url="+encodeURIComponent(location.href)); 59 | } 60 | } 61 | }; 62 | 63 | 64 | 65 | gitbook.events.bind("start", function(e, config) { 66 | var opts = config.sharing; 67 | if (!opts) return; 68 | 69 | // Create dropdown menu 70 | var menu = _.chain(opts.all) 71 | .map(function(id) { 72 | var site = SITES[id]; 73 | 74 | return { 75 | text: site.label, 76 | onClick: site.onClick 77 | }; 78 | }) 79 | .compact() 80 | .value(); 81 | 82 | // Create main button with dropdown 83 | if (menu.length > 0) { 84 | gitbook.toolbar.createButton({ 85 | icon: 'fa fa-share-alt', 86 | label: 'Share', 87 | position: 'right', 88 | dropdown: [menu] 89 | }); 90 | } 91 | 92 | // Direct actions to share 93 | _.each(SITES, function(site, sideId) { 94 | if (!opts[sideId]) return; 95 | 96 | gitbook.toolbar.createButton({ 97 | icon: site.icon, 98 | label: site.text, 99 | position: 'right', 100 | onClick: site.onClick 101 | }); 102 | }); 103 | }); 104 | }); 105 | -------------------------------------------------------------------------------- /manuscript/_bookdown_files/EDA_cache/html/__packages: -------------------------------------------------------------------------------- 1 | base 2 | datasets 3 | utils 4 | grDevices 5 | graphics 6 | stats 7 | dplyr 8 | -------------------------------------------------------------------------------- /manuscript/_bookdown_files/EDA_cache/html/unnamed-chunk-16_dce4065f9d0c9cd692c2eea208ef00f2.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/_bookdown_files/EDA_cache/html/unnamed-chunk-16_dce4065f9d0c9cd692c2eea208ef00f2.RData -------------------------------------------------------------------------------- /manuscript/_bookdown_files/EDA_cache/html/unnamed-chunk-16_dce4065f9d0c9cd692c2eea208ef00f2.rdb: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/_bookdown_files/EDA_cache/html/unnamed-chunk-16_dce4065f9d0c9cd692c2eea208ef00f2.rdb -------------------------------------------------------------------------------- /manuscript/_bookdown_files/EDA_cache/html/unnamed-chunk-16_dce4065f9d0c9cd692c2eea208ef00f2.rdx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/_bookdown_files/EDA_cache/html/unnamed-chunk-16_dce4065f9d0c9cd692c2eea208ef00f2.rdx -------------------------------------------------------------------------------- /manuscript/about.md: -------------------------------------------------------------------------------- 1 | # About the Authors 2 | 3 | **Roger D. Peng** is a Professor of [Statistics and Data Sciences](https://stat.utexas.edu) at the University of Texas, Austin. Previously, he was [Professor of Biostatistics](https://publichealth.jhu.edu/departments/biostatistics) at the Johns Hopkins Bloomberg School of Public Health and the Co-Director of the [Johns Hopkins Data Science Lab](https://jhudatascience.org). His current research focuses on developing theory and methods for building successful data analyses and on the development of statistical methods for addressing environmental health problems. He is the author of the popular book [*R Programming for Data Science*](https://leanpub.com/rprogramming) and 10 other books on data science and statistics. He is also the co-creator of the [Simply Statistics blog](https://simplystatistics.org) where he writes about statistics for the public, the [Not So Standard Deviations podcast](https://nssdeviations.com) with Hilary Parker, and [The Effort Report podcast](https://effortreport.libsyn.com) with Elizabeth Matsui. Roger is a Fellow of the American Statistical Association and is the recipient of the Mortimer Spiegelman Award from the American Public Health Association, which honors a statistician who has made outstanding contributions to public health. Roger received a PhD in Statistics from the University of California, Los Angeles. 4 | 5 | **Elizabeth Matsui** is a Professor of Population Health and Pediatrics at Dell Medical School at the University of Texas, Austin and an Adjunct Professor of Pediatrics at Johns Hopkins University. She is also a practicing pediatric allergist/immunologist and epidemiologist and directs a research program focused on environmental exposures and lung health. Elizabeth can be found on Twitter at [@elizabethmatsui](https://twitter.com/elizabethmatsui). -------------------------------------------------------------------------------- /manuscript/artofdatascience.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: Default 4 | SaveWorkspace: Default 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 8 10 | Encoding: UTF-8 11 | 12 | RnwWeave: Sweave 13 | LaTeX: pdfLaTeX 14 | -------------------------------------------------------------------------------- /manuscript/artofdatascience.rds: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/artofdatascience.rds -------------------------------------------------------------------------------- /manuscript/backmatter.txt: -------------------------------------------------------------------------------- 1 | 2 | {backmatter} 3 | 4 | 5 | -------------------------------------------------------------------------------- /manuscript/buildbook.R: -------------------------------------------------------------------------------- 1 | ## Build Book 2 | 3 | chapters <- readLines("Book.txt") 4 | chap.rmd <- sub("\\.md$", ".Rmd", chapters) 5 | use <- file.exists(chap.rmd) 6 | chap.rmd <- chap.rmd[use] 7 | chapters <- chapters[use] 8 | 9 | library(knitr) 10 | source("fixmath.R") 11 | 12 | for(i in seq_along(chap.rmd)) { 13 | fi.rmd <- file.info(chap.rmd[i]) 14 | fi.md <- file.info(chapters[i]) 15 | if(fi.rmd$mtime > fi.md$mtime) { 16 | message("knitting ", chap.rmd[i], "...") 17 | knit(chap.rmd[i], quiet = TRUE) 18 | 19 | if(chap.rmd[i] %in% c("formalmodeling.Rmd", 20 | "causalinference.md")) 21 | fixmath(chapters[i]) 22 | if(chap.rmd[i] == "inference.Rmd") 23 | system(sprintf("./equation.pl %s", chapters[i])) 24 | } else 25 | message("skipping ", chap.rmd[i], "...") 26 | } 27 | -------------------------------------------------------------------------------- /manuscript/causalinference.qmd: -------------------------------------------------------------------------------- 1 | ../../elements_of_data_science/causalinference.qmd -------------------------------------------------------------------------------- /manuscript/collaboration.md: -------------------------------------------------------------------------------- 1 | # Collaboration 2 | 3 | We (Elizabeth and Roger) have been working together for a decade now and the story of how we started working together is perhaps a brief lesson on collaboration in and of itself. The short version is that Elizabeth emailed someone who didn’t have time to help her, so that person emailed someone else who didn’t have time, so that person emailed someone else who didn’t have time, so that person emailed Roger, who as a mere assistant professor had plenty of time! Some people are irked by this process because it feels like you’re someone’s fourth choice. But many good collaborations come about this way and it either works or it doesn’t work, regardless of where on the list you were when you were contacted. 4 | 5 | 6 | ## The Care and Feeding of Your Scientist Collaborator 7 | 8 | Once you’ve found yourself a collaborator, either in the scientific, business, or any other domain, what should you do with this person? Here is some advice that we have accumulated over the years. In this section we may refer to things in the scientific or academic context, because that is where we have the most experience, but we believe that the ideas are general and extend to other domains where collaboration is critical. 9 | 10 | 11 | ### A Scientist is Not a Fountain From Which “The Numbers” Flow 12 | 13 | Most statisticians and data scientists like to work with data, and some even need it to demonstrate the usefulness of their methods or theories. So there’s a temptation to go “find a scientist” to give you some data. This is starting off on the wrong foot. If you picture your collaborator as a person who hands over the data and then you never talk to that person again, then things will probably not end up so great. 14 | 15 | There are at least two ways in which the experience will be sub-optimal. First, your scientist collaborator may feel miffed that you basically went off and did your own thing, making them less inclined to work with you in the future. Second, the product you end up with (paper, software, algorithm, etc.) might not have the same impact on science as it would have had if you’d worked together more closely. Perhaps most importantly, “the data” have important context and come from somewhere and someone---data do not live in isolation. Your collaborator is in the best position to explain the details of that to you. 16 | 17 | 18 | ### Be Patient, Not Patronizing 19 | 20 | Statisticians are often annoyed that “So-and-so didn’t even know this” or are frustrated that “They tried to do this with a sample size of 3!” True, there are egregious cases of scientists with a lack of basic statistical knowledge, but *all good collaborations involve some teaching*. Otherwise, why would you collaborate with someone who knows exactly the same things that you know? 21 | 22 | Just like it’s important to take some time to learn the discipline that you are applying your statistical methods to, it’s important to take some time to describe to your collaborator how those statistical methods you’re using really work. Where does the information in the data come from? What aspects are important; what aspects are not important? What do parameter estimates mean in the context of this problem? If you find you can’t actually explain these concepts, or become very impatient when they don’t understand, that may be an indication that there’s a problem with the method itself that may need rethinking. Or maybe you just need a simpler method. 23 | 24 | 25 | ### Go To Where They Are 26 | 27 | This is a bit of advice that Roger got when he was just starting out at Johns Hopkins as a lowly assistant professor. The bottom line is that if you understand where the data come from (as in literally, the data come from this organ in this person’s body), then you might not be so flippant about asking for an extra 100 subjects to have a sufficient sample size. The context from which the data originate is always important to understand. 28 | 29 | In biomedical science, the data usually come from people. Real people. And the job of collecting that data, the scientist’s job, is usually not easy. So if you have a chance, go see how the data are collected and what needs to be done. Even just going to their office or lab for a meeting rather than having them come to you can be helpful for understanding the environment and context in which they work. 30 | 31 | It can feel nice (and convenient) to have everyone coming to you, but that’s just vanity. Take the time and go to where they are. 32 | 33 | 34 | ### Their Business Is Your Business, So Pitch In 35 | 36 | A lot of research (and actually most jobs) involves doing things that are not specifically relevant to your primary goal (such as a published paper in a good journal). But sometimes you do those things to achieve broader goals, like building better relationships and networks of contacts. This may involve, say, doing a sample size calculation once in a while for a new grant that’s going in. That may not be pertinent to your current project, but it’s not that hard to do, and it’ll help your collaborator a lot. 37 | 38 | You’re part of a team here, so everyone has to pitch in. In a restaurant kitchen, even the Chef works the line once in a while. Another way to think of this is as an investment. Particularly in the early stages there’s going to be a lot of ambiguity about what should be done and what is the best way to proceed. Sometimes the ideal solution won’t show itself until much later (the so-called “j-shaped curve” of investment). In the meantime, pitch in and keep things going. 39 | 40 | 41 | ### Your Job Is To Advance The Science 42 | 43 | In a good collaboration, everyone should be focused on the same goal. In my area, that goal is improving public health. If I have to prove a theorem or develop a new method to do that, then I will (or at least try). But if I’m collaborating with a biomedical scientist, there has to be an alignment of long-term goals. Otherwise, if the goals are scattered, the science tends to be scattered, and ultimately sub-optimal with respect to impact. If you think of your job in this way (to advance the science), then often you end up with better collaborations. Why? Because you start looking for people who are similarly advancing the science and having an impact, rather than looking for people who have “good data”, whatever that means, for applying your methods. 44 | 45 | In the end, I think statisticians and data scientists should focus on two things: Go out and find the best people to work with and then help them advance the science. 46 | 47 | 48 | 49 | ## The Care and Feeding of the Data Scientist 50 | 51 | Data scientists and statisticians aren’t the only ones out looking for collaborators. Collaboration is a two-way street and there are similarly scientists or business partners out there looking for help with their data. In the current environment, where large complex datasets and sophisticated quantitative and data visualization methods are increasingly common, collaboration with data scientists and statisticians is necessary to harness the full potential of your data and to have the greatest impact. In some cases, not engaging a data scientist collaborator may put you at risk of making statistical missteps that could result in erroneous results and poor decisions. 52 | 53 | In the previous section, we discussed how data scientists can build productive relationships with their scientist collaborators. Over time, we have also learned some key lessons regarding how scientists can best cultivate a productive relationship with a statistician or data scientist. The following points are made from the point of view of the scientist or business partner. 54 | 55 | ### Be Respectful of Time 56 | 57 | This tenet, of course, is applicable to all collaborations, but may be a more common stumbling block for scientists working with statisticians. Most power estimates and sample size calculations, for example, are more complex than appreciated by the scientist.  A discussion about the research question, primary outcome, etc. is required and some thought has to go into determining the most appropriate approach before your statistician collaborator has even laid hands on the keyboard and fired up R.  At a minimum, engage your statistician collaborator earlier than you might think necessary, and ideally, solicit their input during the planning stages.   58 | 59 | Engaging a statistician sooner rather than later not only fosters good will, but will also improve your science. A statistician’s time, like yours, is valuable, so respect their time by allocating an appropriate level of salary support on grants or budgets.  Most people appreciate that budgets are tight, so they understand that they may not get the level of salary support that they think is most appropriate.  However, “finding room” in the budget for 1% salary support for a statistician sends the message that the statistician is an afterthought, a necessity for a sample size calculation and a competitive grant application, but in the end, just a formality.    60 | 61 | Instead, dedicate sufficient salary support in your grant to support the level of statistical effort that will be needed.  This sends the message that you would like your statistician collaborator to be an integral part of the team and provides an opportunity for the kind of regular, ongoing interactions that are needed for productive collaborations. 62 | 63 | ### The Statistician is Not a Computational Tool 64 | 65 | Although sample size and power calculations are probably the most common service solicited from statisticians (especially in clinical research), and statisticians can be enormously helpful in this arena, they have the most impact when they are engaged in discussions about study designs and analytic approaches for a scientific question. Their quantitative approach to scientific problems provides a fresh perspective that can increase the scientific impact of your work.  My sense is that this is also much more interesting work for a statistician than sample size and power calculations, and engaging them in interesting work goes a long way towards cementing a mutually productive collaboration. 66 | 67 | ### Learn the Lingo of Statistics 68 | 69 | Technical jargon is a serious impediment to successful collaboration. This is true of all cross-discipline collaborations, but may be particularly true in collaborations with statisticians.  The field has a penchant for eponymous methods (Hosmer-Lemeshow, Wald, etc.) and terminology that is entertaining, but not intuitive (jackknife, bootstrapping, lasso).  While I am not suggesting that a scientist needs to enroll in statistics courses (why gain expertise in a field when your collaborator provides this expertise), I am advocating for educating yourself about the basic concepts and terminology of statistics.   70 | 71 | Know what is meant by: distribution of a variable, predictor variable, outcome variable, and variance, for example. There are some terrific “Statistics 101”-type lectures and course materials online that are excellent resources.  But also lean on your statistician collaborators by asking them to explain terminology and teach you these basics. Do not be afraid to ask questions. 72 | 73 | ### When All Else Fails (and Even When All Else Doesn’t Fail), Draw Pictures 74 | 75 | In truth, this is often the place where I start when I first engage a statistician. Showing your statistician collaborator what you expect your data to look like in a figure or conceptual diagram simplifies communication as it avoids use of jargon. Statisticians can readily grasp the key information they need from a figure or diagram to come up with a sample size estimate or analytic approach. 76 | 77 | If you are in a setting where you cannot draw things in person (e.g. working in two different institutions), try your best to draw things out. If you have a tablet-like device, there are numerous apps for drawing figures that can then be sent to collaborators. Video chat software sometimes has a mechanism for sharing a drawing in real time. We have even resorted to drawing things on a piece of paper, taking a picture of the paper, and then texting it to each other. It works! 78 | 79 | 80 | ### Teach Them Your Language 81 | 82 | Clinical medicine (the area in which Elizabeth works) is rife with jargon, and just as statistical jargon can make it difficult to communicate clearly with a statistician, so can clinical or domain-area jargon. Avoid technical jargon where possible, and define terminology where it is not possible to avoid it.  Educate your collaborator about the background, context and rationale for your scientific question and encourage questions from them. 83 | 84 | 85 | ### Generously Share Your Data and Ideas 86 | 87 | In many organizations, statisticians are very interested in developing new methods, applying more sophisticated methods to an “old” problem, and/or answering their own scientific questions. Do what you can to support these career interests, such as sharing your data and your ideas. Sharing data opens up avenues for increasing the impact of your work, as your statistician collaborator has opportunities to develop quantitative approaches to answering research questions related to your own interests.   88 | 89 | Sharing data alone is not sufficient, though. Discussions about what you see as the important, unanswered questions will help provide the necessary background and context for the statistician to make the most of the available data.  As highlighted in a recent [book](http://www.nytimes.com/2013/03/31/magazine/is-giving-the-secret-to-getting-ahead.html), giving may be an important and overlooked component of success, and I would argue, also a cornerstone of a successful collaboration. 90 | 91 | 92 | 93 | -------------------------------------------------------------------------------- /manuscript/communication.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | output: 3 | html_document: 4 | keep_md: yes 5 | --- 6 | # Communication {#chapter-communication} 7 | 8 | Communication is fundamental to good data analysis. What we aim to address in this chapter is the role of routine communication in the process of doing your data analysis and in disseminating your final results in a more formal setting, often to an external, larger audience. There are lots of good books that address the "how-to" of giving formal presentations, either in the form of a talk or a written piece, such as a white paper or scientific paper. In this chapter, though, we will focus on: 9 | 10 | 1. How to use routine communication as one of the tools needed to perform a good data analysis; and 11 | 12 | 2. How to convey the key points of your data analysis when communicating informally and formally. 13 | 14 | Communication is both one of the tools of data analysis, and also the final product of data analysis: there is no point in doing a data analysis if you're not going to communicate your process and results to an audience. A good data analyst communicates informally multiple times during the data analysis process and also gives careful thought to communicating the final results so that the analysis is as useful and informative as possible to the wider audience it was intended for. 15 | 16 | ## Routine communication 17 | 18 | The main purpose of routine communication is to gather data, which is part of the epicyclic process for each core activity. You gather data by communicating your results and the responses you receive from your audience should inform the next steps in your data analysis. The types of responses you receive include not only answers to specific questions, but also commentary and questions your audience has in response to your report (either written or oral). The form that your routine communication takes depends on what the goal of the communication is. If your goal, for example, is to get clarity on how a variable is coded because when you explore the dataset it appears to be an ordinal variable, but you had understood that it was a continuous variable, your communication is brief and to the point. 19 | 20 | If, on the other hand, some results from your exploratory data analysis are not what you expected, your communication may take the form of a small, informal meeting that includes displaying tables and/or figures pertinent to your issue. A third type of informal communication is one in which you may not have specific questions to ask of your audience, but instead are seeking feedback on the data analysis process and/or results to help you refine the process and/or to inform your next steps. 21 | 22 | In sum, there are three main types of informal communication and they are classified based on the objectives you have for the communication: (1) to answer a very focused question, which is often a technical question or a question aimed at gathering a fact, (2) to help you work through some results that are puzzling or not quite what you expected, and (3) to get general impressions and feedback as a means of identifying issues that had not occurred to you so that you can refine your data analysis. 23 | 24 | Focusing on a few core concepts will help you achieve your objectives when planning routine communication. These concepts are: 25 | 26 | 1. **Audience**: Know your audience and when you have control over who the audience is, select the right audience for the kind of feedback you are looking for. 27 | 28 | 2. **Content**: Be focused and concise, but provide sufficient information for the audience to understand the information you are presenting and question(s) you are asking. 29 | 30 | 3. **Style**: Avoid jargon. Unless you are communicating about a focused highly technical issue to a highly technical audience, it is best to use language and figures and tables that can be understood by a more general audience. 31 | 32 | 4. **Attitude**: Have an open, collaborative attitude so that you are ready to fully engage in a dialogue and so that your audience gets the message that your goal is not to "defend" your question or work, but rather to get their input so that you can do your best work. 33 | 34 | ## The Audience 35 | 36 | For many types of routine communication, you will have the ability to select your audience, but in some cases, such as when you are delivering an interim report to your boss or your team, the audience may be pre-determined. Your audience may be composed of other data analysts, the individual(s) who initiated the question, your boss and/or other managers or executive team members, non-data analysts who are content experts, and/or someone representing the general public. 37 | 38 | For the first type of routine communication, in which you are primarily seeking factual knowledge or clarification about the dataset or related information, selecting a person (or people) who have the factual knowledge to answer the question and are responsive to queries is most appropriate. For a question about how the data for a variable in the dataset were collected, you might approach a person who collected the data or a person who has worked with the dataset before or was responsible for compiling the data. If the question is about the command to use in a statistical programming language in order to run a certain type of statistical test, this information is often easily found by an internet search. But if this fails, querying a person who uses the particular programming language would be appropriate. 39 | 40 | For the second type of routine communication, in which you have some results and you are either unsure whether they are what you'd expect, or they are not what you expected, you'll likely be most helped if you engage more than one person and they represent a range of perspectives. The most productive and helpful meetings typically include people with data analysis and content area expertise. As a rule of thumb, the more types of stakeholders you communicate with while you are doing your data analysis project, the better your final product will be. For example, if you only communicate with other data analysts, you may overlook some important aspects of your data analysis that would have been discovered had you communicated with your boss, content experts, or other people. 41 | 42 | For the third type of routine communication, which typically occurs when you have come to a natural place for pausing your data analysis. Although when and where in your data analysis these pauses occur are dictated by the specific analysis you are doing, one very common place to pause and take stock is after completing at least some exploratory data analysis. It's important to pause and ask for feedback at this point as this exercise will often identify additional exploratory analyses that are important for informing next steps, such as model building, and therefore prevent you from sinking time and effort into pursuing models that are not relevant, not appropriate, or both. This sort of communication is most effective when it takes the form of a face-to-face meeting, but video conferencing and phone conversations can also be effective. When selecting your audience, think about who among the people available to you give the most helpful feedback and which perspectives will be important for informing the next steps of your analysis. At a minimum, you should have both data analysis and content expertise represented, but in this type of meeting it may also be helpful to hear from people who share, or at least understand, the perspective of the larger target audience for the formal communication of the results of your data analysis. 43 | 44 | 45 | ## Content 46 | 47 | The most important guiding principle is to tailor the information you deliver to the objective of the communication. For a targeted question aimed at getting clarification about the coding of a variable, the recipient of your communication does not need to know the overall objective of your analysis, what you have done up to this point, or see any figures or tables. A specific, pointed question along the lines of "I'm analyzing the crime dataset that you sent me last week and am looking at the variable "education" and see that it is coded 0, 1, and 2, but I don't see any labels for those codes. Do you know what these codes for the "education" variable stand for?" 48 | 49 | For the second type of communication, in which you are seeking feedback because of a puzzling or unexpected issue with your analysis, more background information will be needed, but complete background information for the overall project may not be. To illustrate this concept, let's assume that you have been examining the relationship between height and lung function and you construct a scatterplot, which suggests that the relationship is non-linear as there appears to be curvature to the relationship. Although you have some ideas about approaches for handling non-linear relationships, you appropriately seek input from others. After giving some thought to your objectives for the communication, you settle on two primary objectives: (1) To understand if there is a best approach for handling the non-linearity of the relationship, and if so, how to determine which approach is best, and (2) To understand more about the non-linear relationship you observe, including whether this is expected and/or known and whether the non-linearity is important to capture in your analyses. 50 | 51 | To achieve your objectives, you will need to provide your audience with some context and background, but providing a comprehensive background for the data analysis project and review of all of the steps you've taken so far is unnecessary and likely to absorb time and effort that would be better devoted to your specific objectives. In this example, appropriate context and background might include the following: (1) the overall objective of the data analysis, (2) how height and lung function fit into the overall objective of the data analysis, for example, height may be a potential confounder, or the major predictor of interest, and (3) what you have done so far with respect to height and lung function and what you've learned. This final step should include some visual display of data, such as the aforementioned scatterplot. The final content of your presentation, then, would include a statement of the objectives for the discussion, a brief overview of the data analysis project, how the specific issue you are facing fits into the overall data analysis project, and then finally, pertinent findings from your analysis related to height and lung function. 52 | 53 | If you were developing a slide presentation, fewer slides should be devoted to the background and context than the presentation of the data analysis findings for height and lung function. One slide should be sufficient for the data analysis overview, and 1-2 slides should be sufficient for explaining the context of the height-lung function issue within the larger data analysis project. The meat of the presentation shouldn't require more than 5-8 slides, so that the total presentation time should be no more than 10-15 minutes. Although slides are certainly not necessary, a visual tool for presenting this information is very helpful and should not imply that the presentation should be "formal." Instead, the idea is to provide the group sufficient information to generate discussion that is focused on your objectives, which is best achieved by an informal presentation. 54 | 55 | These same principles apply to the third type of communication, except that you may not have focused objectives and instead you may be seeking general feedback on your data analysis project from your audience. If this is the case, this more general objective should be stated and the remainder of the content should include a statement of the question you are seeking to answer with the analysis, the objective(s) of the data analysis, a summary of the characteristics of the data set (source of the data, number of observations, etc.), a summary of your exploratory analyses, a summary of your model building, your interpretation of your results, and conclusions. By providing key points from your entire data analysis, your audience will be able to provide feedback about the overall project as well as each of the steps of data analysis. A well-planned discussion yields helpful, thoughtful feedback and should be considered a success if you are left armed with additional refinements to make to your data analysis and thoughtful perspective about what should be included in the more formal presentation of your final results to an external audience. 56 | 57 | 58 | ##Style 59 | 60 | Although the style of communication increases in formality from the first to the third type of routine communication, all of these communications should largely be informal and, except for perhaps the focused communication about a small technical issue, jargon should be avoided. Because the primary purpose of routine communication is to get feedback, your communication style should encourage discussion. Some approaches to encourage discussion include stating up front that you would like the bulk of the meeting to include active discussion and that you welcome questions during your presentation rather than asking the audience to hold them until the end of your presentation. If an audience member provides commentary, asking what others in the audience think will also promote discussion. In essence, to get the best feedback you want to hear what your audience members are thinking, and this is most likely accomplished by setting an informal tone and actively encouraging discussion. 61 | 62 | 63 | ##Attitude 64 | 65 | A defensive or off-putting attitude can sabotage all the work you've put into carefully selecting the audience, thoughtfully identifying your objectives and preparing your content, and stating that you are seeking discussion. Your audience will be reluctant to offer constructive feedback if they sense that their feedback will not be well received and you will leave the meeting without achieving your objectives, and ill prepared to make any refinements or additions to your data analysis. And when it comes time to deliver a formal presentation to an external audience, you will not be well prepared and won't be able to present your best work. To avoid this pitfall, deliberately cultivate a receptive and positive attitude prior to communicating by putting your ego and insecurities aside. If you can do this successfully, it will serve you well. In fact, we both know people who have had highly successful careers based largely on their positive and welcoming attitude towards feedback, including constructive criticism. 66 | -------------------------------------------------------------------------------- /manuscript/communication.md: -------------------------------------------------------------------------------- 1 | --- 2 | output: 3 | html_document: 4 | keep_md: yes 5 | --- 6 | # Communication {#chapter-communication} 7 | 8 | Communication is fundamental to good data analysis. What we aim to address in this chapter is the role of routine communication in the process of doing your data analysis and in disseminating your final results in a more formal setting, often to an external, larger audience. There are lots of good books that address the "how-to" of giving formal presentations, either in the form of a talk or a written piece, such as a white paper or scientific paper. In this chapter, though, we will focus on: 9 | 10 | 1. How to use routine communication as one of the tools needed to perform a good data analysis; and 11 | 12 | 2. How to convey the key points of your data analysis when communicating informally and formally. 13 | 14 | Communication is both one of the tools of data analysis, and also the final product of data analysis: there is no point in doing a data analysis if you're not going to communicate your process and results to an audience. A good data analyst communicates informally multiple times during the data analysis process and also gives careful thought to communicating the final results so that the analysis is as useful and informative as possible to the wider audience it was intended for. 15 | 16 | ## Routine communication 17 | 18 | The main purpose of routine communication is to gather data, which is part of the epicyclic process for each core activity. You gather data by communicating your results and the responses you receive from your audience should inform the next steps in your data analysis. The types of responses you receive include not only answers to specific questions, but also commentary and questions your audience has in response to your report (either written or oral). The form that your routine communication takes depends on what the goal of the communication is. If your goal, for example, is to get clarity on how a variable is coded because when you explore the dataset it appears to be an ordinal variable, but you had understood that it was a continuous variable, your communication is brief and to the point. 19 | 20 | If, on the other hand, some results from your exploratory data analysis are not what you expected, your communication may take the form of a small, informal meeting that includes displaying tables and/or figures pertinent to your issue. A third type of informal communication is one in which you may not have specific questions to ask of your audience, but instead are seeking feedback on the data analysis process and/or results to help you refine the process and/or to inform your next steps. 21 | 22 | In sum, there are three main types of informal communication and they are classified based on the objectives you have for the communication: (1) to answer a very focused question, which is often a technical question or a question aimed at gathering a fact, (2) to help you work through some results that are puzzling or not quite what you expected, and (3) to get general impressions and feedback as a means of identifying issues that had not occurred to you so that you can refine your data analysis. 23 | 24 | Focusing on a few core concepts will help you achieve your objectives when planning routine communication. These concepts are: 25 | 26 | 1. **Audience**: Know your audience and when you have control over who the audience is, select the right audience for the kind of feedback you are looking for. 27 | 28 | 2. **Content**: Be focused and concise, but provide sufficient information for the audience to understand the information you are presenting and question(s) you are asking. 29 | 30 | 3. **Style**: Avoid jargon. Unless you are communicating about a focused highly technical issue to a highly technical audience, it is best to use language and figures and tables that can be understood by a more general audience. 31 | 32 | 4. **Attitude**: Have an open, collaborative attitude so that you are ready to fully engage in a dialogue and so that your audience gets the message that your goal is not to "defend" your question or work, but rather to get their input so that you can do your best work. 33 | 34 | ## The Audience 35 | 36 | For many types of routine communication, you will have the ability to select your audience, but in some cases, such as when you are delivering an interim report to your boss or your team, the audience may be pre-determined. Your audience may be composed of other data analysts, the individual(s) who initiated the question, your boss and/or other managers or executive team members, non-data analysts who are content experts, and/or someone representing the general public. 37 | 38 | For the first type of routine communication, in which you are primarily seeking factual knowledge or clarification about the dataset or related information, selecting a person (or people) who have the factual knowledge to answer the question and are responsive to queries is most appropriate. For a question about how the data for a variable in the dataset were collected, you might approach a person who collected the data or a person who has worked with the dataset before or was responsible for compiling the data. If the question is about the command to use in a statistical programming language in order to run a certain type of statistical test, this information is often easily found by an internet search. But if this fails, querying a person who uses the particular programming language would be appropriate. 39 | 40 | For the second type of routine communication, in which you have some results and you are either unsure whether they are what you'd expect, or they are not what you expected, you'll likely be most helped if you engage more than one person and they represent a range of perspectives. The most productive and helpful meetings typically include people with data analysis and content area expertise. As a rule of thumb, the more types of stakeholders you communicate with while you are doing your data analysis project, the better your final product will be. For example, if you only communicate with other data analysts, you may overlook some important aspects of your data analysis that would have been discovered had you communicated with your boss, content experts, or other people. 41 | 42 | For the third type of routine communication, which typically occurs when you have come to a natural place for pausing your data analysis. Although when and where in your data analysis these pauses occur are dictated by the specific analysis you are doing, one very common place to pause and take stock is after completing at least some exploratory data analysis. It's important to pause and ask for feedback at this point as this exercise will often identify additional exploratory analyses that are important for informing next steps, such as model building, and therefore prevent you from sinking time and effort into pursuing models that are not relevant, not appropriate, or both. This sort of communication is most effective when it takes the form of a face-to-face meeting, but video conferencing and phone conversations can also be effective. When selecting your audience, think about who among the people available to you give the most helpful feedback and which perspectives will be important for informing the next steps of your analysis. At a minimum, you should have both data analysis and content expertise represented, but in this type of meeting it may also be helpful to hear from people who share, or at least understand, the perspective of the larger target audience for the formal communication of the results of your data analysis. 43 | 44 | 45 | ## Content 46 | 47 | The most important guiding principle is to tailor the information you deliver to the objective of the communication. For a targeted question aimed at getting clarification about the coding of a variable, the recipient of your communication does not need to know the overall objective of your analysis, what you have done up to this point, or see any figures or tables. A specific, pointed question along the lines of "I'm analyzing the crime dataset that you sent me last week and am looking at the variable "education" and see that it is coded 0, 1, and 2, but I don't see any labels for those codes. Do you know what these codes for the "education" variable stand for?" 48 | 49 | For the second type of communication, in which you are seeking feedback because of a puzzling or unexpected issue with your analysis, more background information will be needed, but complete background information for the overall project may not be. To illustrate this concept, let's assume that you have been examining the relationship between height and lung function and you construct a scatterplot, which suggests that the relationship is non-linear as there appears to be curvature to the relationship. Although you have some ideas about approaches for handling non-linear relationships, you appropriately seek input from others. After giving some thought to your objectives for the communication, you settle on two primary objectives: (1) To understand if there is a best approach for handling the non-linearity of the relationship, and if so, how to determine which approach is best, and (2) To understand more about the non-linear relationship you observe, including whether this is expected and/or known and whether the non-linearity is important to capture in your analyses. 50 | 51 | To achieve your objectives, you will need to provide your audience with some context and background, but providing a comprehensive background for the data analysis project and review of all of the steps you've taken so far is unnecessary and likely to absorb time and effort that would be better devoted to your specific objectives. In this example, appropriate context and background might include the following: (1) the overall objective of the data analysis, (2) how height and lung function fit into the overall objective of the data analysis, for example, height may be a potential confounder, or the major predictor of interest, and (3) what you have done so far with respect to height and lung function and what you've learned. This final step should include some visual display of data, such as the aforementioned scatterplot. The final content of your presentation, then, would include a statement of the objectives for the discussion, a brief overview of the data analysis project, how the specific issue you are facing fits into the overall data analysis project, and then finally, pertinent findings from your analysis related to height and lung function. 52 | 53 | If you were developing a slide presentation, fewer slides should be devoted to the background and context than the presentation of the data analysis findings for height and lung function. One slide should be sufficient for the data analysis overview, and 1-2 slides should be sufficient for explaining the context of the height-lung function issue within the larger data analysis project. The meat of the presentation shouldn't require more than 5-8 slides, so that the total presentation time should be no more than 10-15 minutes. Although slides are certainly not necessary, a visual tool for presenting this information is very helpful and should not imply that the presentation should be "formal." Instead, the idea is to provide the group sufficient information to generate discussion that is focused on your objectives, which is best achieved by an informal presentation. 54 | 55 | These same principles apply to the third type of communication, except that you may not have focused objectives and instead you may be seeking general feedback on your data analysis project from your audience. If this is the case, this more general objective should be stated and the remainder of the content should include a statement of the question you are seeking to answer with the analysis, the objective(s) of the data analysis, a summary of the characteristics of the data set (source of the data, number of observations, etc.), a summary of your exploratory analyses, a summary of your model building, your interpretation of your results, and conclusions. By providing key points from your entire data analysis, your audience will be able to provide feedback about the overall project as well as each of the steps of data analysis. A well-planned discussion yields helpful, thoughtful feedback and should be considered a success if you are left armed with additional refinements to make to your data analysis and thoughtful perspective about what should be included in the more formal presentation of your final results to an external audience. 56 | 57 | 58 | ##Style 59 | 60 | Although the style of communication increases in formality from the first to the third type of routine communication, all of these communications should largely be informal and, except for perhaps the focused communication about a small technical issue, jargon should be avoided. Because the primary purpose of routine communication is to get feedback, your communication style should encourage discussion. Some approaches to encourage discussion include stating up front that you would like the bulk of the meeting to include active discussion and that you welcome questions during your presentation rather than asking the audience to hold them until the end of your presentation. If an audience member provides commentary, asking what others in the audience think will also promote discussion. In essence, to get the best feedback you want to hear what your audience members are thinking, and this is most likely accomplished by setting an informal tone and actively encouraging discussion. 61 | 62 | 63 | ##Attitude 64 | 65 | A defensive or off-putting attitude can sabotage all the work you've put into carefully selecting the audience, thoughtfully identifying your objectives and preparing your content, and stating that you are seeking discussion. Your audience will be reluctant to offer constructive feedback if they sense that their feedback will not be well received and you will leave the meeting without achieving your objectives, and ill prepared to make any refinements or additions to your data analysis. And when it comes time to deliver a formal presentation to an external audience, you will not be well prepared and won't be able to present your best work. To avoid this pitfall, deliberately cultivate a receptive and positive attitude prior to communicating by putting your ego and insecurities aside. If you can do this successfully, it will serve you well. In fact, we both know people who have had highly successful careers based largely on their positive and welcoming attitude towards feedback, including constructive criticism. 66 | -------------------------------------------------------------------------------- /manuscript/conclusion.md: -------------------------------------------------------------------------------- 1 | # Concluding Thoughts 2 | 3 | You should now be armed with an approach that you can apply to your data analyses. Although each data set is its own unique organism and each analysis has its own specific issues to contend with, tackling each step with the epicycle framework is useful for any analysis. As you work through developing your question, exploring your data, modeling your data, interpreting your results, and communicating your results, remember to always set expectations and then compare the result of your action to your expectations. If they don’t match, identify whether the problem is with the result of your action or your expectations and fix the problem so that they do match. If you can’t identify the problem, seek input from others, and then when you’ve fixed the problem, move on to the next action. This epicycle framework will help to keep you on a course that will end at a useful answer to your question. 4 | 5 | In addition to the epicycle framework, there are also activities of data analysis that we discussed throughout the book. Although all of the analysis activities are important, if we had to identify the ones that are most important for ensuring that your data analysis provides a valid, meaningful, and interpretable answer to your question, we would include the following: 6 | 7 | 1. Be thoughtful about developing your question and use the question to guide you throughout all of the analysis steps. 8 | 9 | 2. Follow the ABCs: 10 | 1. Always be checking 11 | 2. Always be challenging 12 | 3. Always be communicating 13 | 14 | The best way for the epicycle framework and these activities to become second nature is to do a lot of data analysis, so we encourage you to take advantage of the data analysis opportunities that come your way. Although with practice, many of these principles will become second nature to you, we have found that revisiting these principles has helped to resolve a range of issues we’ve faced in our own analyses. We hope, then, that the book continues to serve as a useful resource after you’re done reading it when you hit the stumbling blocks that occur in every analysis. 15 | -------------------------------------------------------------------------------- /manuscript/data/face.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/data/face.rda -------------------------------------------------------------------------------- /manuscript/data/hourly_ozone.rds: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/data/hourly_ozone.rds -------------------------------------------------------------------------------- /manuscript/data/leanpub-rprogramming-interested-readers-20150430.csv: -------------------------------------------------------------------------------- 1 | Name,Willing to pay,Email 2 | Mark Rodighiero,50.0,markatango@gmail.com 3 | Blake Decker,50.0,"" 4 | Jacqueline Smith,50.0,notjesmith@gmail.com 5 | Joyce Hsiao,50.0,joyce.hsiao1@gmail.com 6 | Amira Shelby,45.0,"" 7 | Drew Steen,40.0,"" 8 | Drew Steen,40.0,asteen1@utk.edu 9 | Saravanan Chinnachamy,40.0,Mitsaravanan@hotmail.com 10 | Luke,35.0,lromeo16@gmail.com 11 | Tamás Kenesei,30.0,takenesei@gmail.com 12 | Kostini,30.0,kokrits@gmail.com 13 | Laurence Liew,30.0,laurenceliew@postbox.sg 14 | Jeff Hedberg,30.0,jhedbergfd3s@gmail.com 15 | Azeem Ansar,30.0,azeem.ansar@gmail.com 16 | Max Hashim,30.0,maxhashim@gmail.com 17 | Rodrigo Ruben Santiago Nieves,30.0,nievesnievesnieves@gmail.com 18 | priyatham,30.0,priyatham.lallu@gmail.com 19 | Paul Kaefer,30.0,paul.kaefer@gmail.com 20 | Adrienn Szabo,30.0,"" 21 | guillaume D,30.0,"" 22 | Ted Scott,25.0,fishdaddyo@gmail.com 23 | Peter Hahn,25.0,hahn@hand.franken.de 24 | Stephanie Fulton,25.0,sgfulton@charter.net 25 | Mike reid,25.0,"" 26 | Dr. Connie Brett,25.0,"" 27 | Robert Boissy,25.0,rboissy@unmc.edu 28 | Lorena,25.0,lorena.zuniga@gmail.com 29 | Lorena,25.0,lorena.zuniga@gmail.com 30 | guillaume D,25.0,"" 31 | Will Blake,25.0,"" 32 | Carlos Santos,25.0,carlos.s.santos@gmail.com 33 | Ciera Martinez,20.0,ciera.martinez@gmail.com 34 | Aaron Kub,20.0,aaronkub@gmail.com 35 | Ilker Arslan,20.0,ilkarslan@gmail.com 36 | Ian McCarthy,20.0,isnmccarthyonline@gmail.com 37 | Sigve Nakken,20.0,"" 38 | V DeCristoforo,20.0,vdcnorth@gmail.com 39 | Mark Mounts,20.0,"" 40 | Johnny,20.0,"" 41 | Wagner Queiroz,20.0,wagnersq@gmail.com 42 | Michael Wimsatt,20.0,"" 43 | Walker Hamby,20.0,wker.hamby@gmail.com 44 | J Galbraith,20.0,"" 45 | dan omalley,20.0,"" 46 | Jan,15.0,Jan.multmeier@web.de 47 | Hiroaki Yutani,15.0,"" 48 | Dmitriy,15.0,kirillovde@gmail.com 49 | Simbhu,15.0,Simbhu@gmail.com 50 | Alex Spurrier,15.0,alex.spurrier@gmail.com 51 | Brent Brewington,15.0,brent.brewington@gmail.com 52 | Barochat,15.0,Barochat@gmail.com 53 | Gary,15.0,gary.netherton@live.com 54 | hj maas,12.0,"" 55 | Margarida Freire,10.0,freire.margarida@gmail.com 56 | Stephen Newhouse,10.0,stephen.j.newhouse@gmail.com 57 | Lincoln Mullen,10.0,lincoln@lincolnmullen.com 58 | Fabian garcia,10.0,"" 59 | Jp,10.0,"" 60 | Geoff Babb,10.0,geoffbabb@yahoo.com 61 | Sean De La Torre,10.0,"" 62 | Toni Borrallo,10.0,tomi.borrallo@gmail.com 63 | Lea Hildebrandt,10.0,"" 64 | Bruno Henrique,10.0,squall.bruno@gmail.com 65 | Kamal,10.0,kamal@hothi.nz 66 | Pat,10.0,"" 67 | Sean Leonard,10.0,sean.p.leonard@utexas.edu 68 | david barr,10.0,davidadambarr@hotmail.com.com 69 | George Luft,10.0,gluft3@gmail.com 70 | david barr,10.0,davidadambarr@hotmail.com.com 71 | Sébastien Portebois,10.0,sportebois@gmail.com 72 | Carlos Garcidueñas,10.0,carlos_garciduenas@hotmail.com 73 | Jonathan,10.0,"" 74 | Motasim,10.0,"" 75 | Raphael Schagerl,10.0,"" 76 | Andrew,10.0,"" 77 | kevin,9.99,kevinholden@gmail.com 78 | Eric Wallace,8.0,"" 79 | Jordi Torrubiano,7.0,"" 80 | Ashish Rane,5.0,ashish.d.rane@gmail.com 81 | blessy,5.0,"" 82 | Ben,5.0,"" 83 | Lei,5.0,"" 84 | Rail Suleymanov,5.0,rail.suleymanov@gmail.com 85 | Randy Blair,5.0,"" 86 | Mark Orndorff,5.0,morndorff@stat.fsu.edu 87 | Valentino Hudhra,3.0,v.hudhra@gmail.com 88 | Kavin,2.0,"" 89 | rain sun,0.99,jiali.vt@gmail.com 90 | Paulo Atreides,0.0,pauloatreides.filho@gmail.com 91 | Loai,0.0,"" 92 | Amine Mh,0.0,"" 93 | Keith Lane,,keith.computer@gmail.com 94 | ashfaq ali,,ashfaq.ali@med.lu.se 95 | "",,"" 96 | Arpit Arora,,aroraarpit.93@gmail.com 97 | Tom,,"" 98 | Matt Fabish,,"" 99 | gayathri,,gayathrisudershan@gmail.com 100 | molly,,"" 101 | Juan,,juancamilos@gmail.com 102 | Andy Rhinehart,,"" 103 | Anthony Pece,,anthony.pece@gmail.com 104 | Willber Nascimento,,willberufal@gmail.com 105 | Pete,,"" 106 | Matt,,"" 107 | Madan,,madan.tecno@gmail.com 108 | Cristian,,"" 109 | Ricardo,,rvidal@gmail.com 110 | Steve,,steven.l.senior@gmail.com 111 | Roger Fried,,rlewisf@yahoo.com 112 | Consuelo Arellano,,"" 113 | tj,,"" 114 | Marek,,"" 115 | Sam Williams,,samual.t.williams@gmail.com 116 | eli knaap,,eknaap@umd.edu 117 | kayode Ayankoya,,s212400096@nmmu.ac.za 118 | "",,"" 119 | Mark Ruddy,,"" 120 | Steeve Brechmann,,"" 121 | dwayne paschall,,dwaynepaschall@gmail.com 122 | Antonio Barros,,antonio.barros@outlook.com 123 | Ulises Gonzalez,,gonzalez.ulises@gmail.com 124 | Ross Gayler,,"" 125 | Roxana,,"" 126 | Matt Moramarco,,"" 127 | Mark Daniel Ward,,mdw@purdue.edu 128 | Hugo Estrada,,hugoestr@gmail.com 129 | itahisa marcelino,,itahisa@gmail.com 130 | Ivan Oscar Asensio,,ivan.o.asensio@us.hsbc.com 131 | Rehab Rahman,,"" 132 | Matías Ocampo,,matiocre@gmail.com 133 | -------------------------------------------------------------------------------- /manuscript/epicycles_table.Rmd: -------------------------------------------------------------------------------- 1 | ## Table for Epicycles of Analysis 2 | 3 | | | Set Expectations | Collect Information | Revise Expectations | 4 | |:---|:---------------- |:------------------- |:------------------- | 5 | | **Question** | Question is of interest to audience | Literature Search/Experts | Sharpen question | 6 | | **EDA** | Data are appropriate for question | Make exploratory plots of data | Refine question or collect more data | 7 | | **Formal Modeling** | Primary model answers question | Fit secondary models, sensitivity analysis | Revise formal model to include more predictors | 8 | | **Interpretation** | Interpretation of analyses provides a specific & meaningful answer to the question | Interpret totality of analyses with focus on effect sizes & uncertainty | Revise EDA and/or models to provide specific & interpretable answer | 9 | | **Communication** |Process & results of analysis are understood, complete & meaningful to audience |Seek feedback | Revise analyses or approach to presentation | 10 | -------------------------------------------------------------------------------- /manuscript/equation.pl: -------------------------------------------------------------------------------- 1 | #!/usr/bin/perl -i 2 | 3 | while(<>) { 4 | s/^\\\[$/{\$\$}/; 5 | s/^\\\]$/{\/\$\$}/; 6 | print; 7 | } 8 | -------------------------------------------------------------------------------- /manuscript/fixmath.R: -------------------------------------------------------------------------------- 1 | 2 | fixmath <- function(infile) { 3 | doc0 <- readLines(infile) 4 | 5 | doc <- sub("^\\\\\\[$", "{\\$\\$}", doc0, perl = TRUE) 6 | doc <- sub("^\\\\\\]$", "{\\/\\$\\$}", doc, perl = TRUE) 7 | doc <- gsub("\\$(\\S+)\\$", "{\\$\\$}\\1{\\/\\$\\$}", doc, perl = TRUE) 8 | 9 | writeLines(doc, infile) 10 | } -------------------------------------------------------------------------------- /manuscript/frontmatter.txt: -------------------------------------------------------------------------------- 1 | 2 | {frontmatter} 3 | 4 | 5 | -------------------------------------------------------------------------------- /manuscript/images/6x9 HQ cover.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/images/6x9 HQ cover.jpg -------------------------------------------------------------------------------- /manuscript/images/6x9 cover 300dpi.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/images/6x9 cover 300dpi.jpg -------------------------------------------------------------------------------- /manuscript/images/EDA-unnamed-chunk-16-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/images/EDA-unnamed-chunk-16-1.png -------------------------------------------------------------------------------- /manuscript/images/EDA-unnamed-chunk-17-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/images/EDA-unnamed-chunk-17-1.png -------------------------------------------------------------------------------- /manuscript/images/EDA-unnamed-chunk-20-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/images/EDA-unnamed-chunk-20-1.png -------------------------------------------------------------------------------- /manuscript/images/epicycle.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/images/epicycle.jpg -------------------------------------------------------------------------------- /manuscript/images/epicycle.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/images/epicycle.png -------------------------------------------------------------------------------- /manuscript/images/formal-unnamed-chunk-10-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/images/formal-unnamed-chunk-10-1.png -------------------------------------------------------------------------------- /manuscript/images/formal-unnamed-chunk-12-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/images/formal-unnamed-chunk-12-1.png -------------------------------------------------------------------------------- /manuscript/images/formal-unnamed-chunk-2-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/images/formal-unnamed-chunk-2-1.png -------------------------------------------------------------------------------- /manuscript/images/formal-unnamed-chunk-4-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/images/formal-unnamed-chunk-4-1.png -------------------------------------------------------------------------------- /manuscript/images/formal-unnamed-chunk-5-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/images/formal-unnamed-chunk-5-1.png -------------------------------------------------------------------------------- /manuscript/images/inferencepred-unnamed-chunk-10-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/images/inferencepred-unnamed-chunk-10-1.png -------------------------------------------------------------------------------- /manuscript/images/inferencepred-unnamed-chunk-11-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/images/inferencepred-unnamed-chunk-11-1.png -------------------------------------------------------------------------------- /manuscript/images/inferencepred-unnamed-chunk-3-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/images/inferencepred-unnamed-chunk-3-1.png -------------------------------------------------------------------------------- /manuscript/images/inferencepred-unnamed-chunk-4-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/images/inferencepred-unnamed-chunk-4-1.png -------------------------------------------------------------------------------- /manuscript/images/inferencepred-unnamed-chunk-5-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/images/inferencepred-unnamed-chunk-5-1.png -------------------------------------------------------------------------------- /manuscript/images/interpret-unnamed-chunk-1-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/images/interpret-unnamed-chunk-1-1.png -------------------------------------------------------------------------------- /manuscript/images/model-unnamed-chunk-10-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/images/model-unnamed-chunk-10-1.png -------------------------------------------------------------------------------- /manuscript/images/model-unnamed-chunk-11-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/images/model-unnamed-chunk-11-1.png -------------------------------------------------------------------------------- /manuscript/images/model-unnamed-chunk-12-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/images/model-unnamed-chunk-12-1.png -------------------------------------------------------------------------------- /manuscript/images/model-unnamed-chunk-3-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/images/model-unnamed-chunk-3-1.png -------------------------------------------------------------------------------- /manuscript/images/model-unnamed-chunk-4-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/images/model-unnamed-chunk-4-1.png -------------------------------------------------------------------------------- /manuscript/images/model-unnamed-chunk-5-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/images/model-unnamed-chunk-5-1.png -------------------------------------------------------------------------------- /manuscript/images/model-unnamed-chunk-6-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/images/model-unnamed-chunk-6-1.png -------------------------------------------------------------------------------- /manuscript/images/model-unnamed-chunk-7-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/images/model-unnamed-chunk-7-1.png -------------------------------------------------------------------------------- /manuscript/images/penguin1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/images/penguin1.png -------------------------------------------------------------------------------- /manuscript/images/penguin2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/images/penguin2.png -------------------------------------------------------------------------------- /manuscript/images/penguin3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/images/penguin3.png -------------------------------------------------------------------------------- /manuscript/images/table-epicycles.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/images/table-epicycles.png -------------------------------------------------------------------------------- /manuscript/images/title_page.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/images/title_page.png -------------------------------------------------------------------------------- /manuscript/images/title_page_orig.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/images/title_page_orig.png -------------------------------------------------------------------------------- /manuscript/inferencepred.Rmd: -------------------------------------------------------------------------------- 1 | # Inference vs. Prediction: Implications for Modeling Strategy 2 | 3 | ```{r,include=FALSE} 4 | knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE, 5 | comment = NA, fig.path = "images/inferencepred-") 6 | library(tidyverse) 7 | library(broom) 8 | library(knitr) 9 | ``` 10 | 11 | Understanding whether you're answering an inferential question versus a prediction question is an important concept because the type of question you're answering can greatly influence the modeling strategy you pursue. If you do not clearly understand which type of question you are asking, you may end up using the wrong type of modeling approach and ultimately make the wrong conclusions from your data. The purpose of this chapter is to show you what can happen when you confuse one question for another. 12 | 13 | The key things to remember are 14 | 15 | 1. For **inferential questions** the goal is typically to estimate an association between a predictor of interest and the outcome. There is usually only a handful of predictors of interest (or even just one), however there are typically many potential confounding variables to consider. They key goal of modeling is to estimate an association while making sure you appropriately adjust for any potential confounders. Often, sensitivity analyses are conducted to see if associations of interest are robust to different sets of confounders. 16 | 17 | 2. For **prediction questions** the goal is to identify a model that *best predicts* the outcome. Typically we do not place any *a priori* importance on the predictors, so long as they are good at predicting the outcome. There is no notion of "confounder" or "predictors of interest" because all predictors are potentially useful for predicting the outcome. Also, we often do not care about "how the model works" or telling a detailed story about the predictors. The key goal is to develop a model with good prediction skill and to estimate a reasonable error rate from the data. 18 | 19 | 20 | ## Air Pollution and Mortality in New York 21 | 22 | The following example shows how different types of questions and corresponding modeling approaches can lead to different conclusions. The example uses air pollution and mortality data for New York City. The data were originally used as part of the [National Morbidity, Mortality, and Air Pollution Study](http://www.ihapss.jhsph.edu) (NMMAPS). 23 | 24 | 25 | ```{r} 26 | ny <- readRDS("ny.rds") %>% 27 | rename(pm10 = pm10tmean, 28 | no2 = no2tmean) 29 | sub <- select(ny, c(date, death)) %>% 30 | filter(date >= "2001-01-01") %>% 31 | group_by(date) %>% summarize(death = sum(death)) 32 | env <- select(ny, -death) %>% 33 | filter(agecat == "under65" & date >= "2001-01-01") 34 | ny <- merge(sub, env, by = "date") 35 | ny <- mutate(ny, season = factor(quarters(ny$date)), dow = factor(weekdays(ny$date))) 36 | ``` 37 | 38 | Below is a plot of the daily mortality from all causes for the years 2001--2005. 39 | 40 | ```{r,fig.height=4,fig.cap="Daily Mortality in New York City, 2001--2005"} 41 | ny %>% 42 | ggplot(aes(date, death)) + 43 | geom_point() + 44 | ylab("Daily mortality") + 45 | xlab("Date") + 46 | theme_bw() 47 | ``` 48 | 49 | And here is a plot of 24-hour average levels of particulate matter with aerodynamic diameter less than or equal to 10 microns (PM10). 50 | 51 | ```{r,fig.height=4,fig.cap="Daily PM10 in New York City, 2001--2005"} 52 | ny %>% 53 | ggplot(aes(date, pm10)) + 54 | geom_point() + 55 | xlab("Date") + 56 | ylab(expression("24-hr average " * PM[10])) + 57 | theme_bw() 58 | ``` 59 | 60 | Note that there are many fewer points on the plot above than there were on the plot of the mortality data. This is because PM10 is not measured everyday. Also note that there are negative values in the PM10 plot--this is because the PM10 data were mean-subtracted. In general, negative values of PM10 are not possible. 61 | 62 | ## Inferring an Association 63 | 64 | The first approach we will take will be to ask, "Is there an association between daily 24-hour average PM10 levels and daily mortality?" This is an inferential question and we are attempting to estimate an association. In addition, for this question, we know there are a number of potential confounders that we will have to deal with. 65 | 66 | Let's take a look at the bivariate association between PM10 and mortality. Here is a scatterplot of the two variables. 67 | 68 | ```{r,fig.cap="PM10 and Mortality in New York City"} 69 | qplot(pm10, death, data = ny) + 70 | theme_bw() 71 | ``` 72 | 73 | There doesn't appear to be much going on there, and a simple linear regression model of the log of daily mortality and PM10 seems to confirm that. 74 | 75 | ```{r} 76 | fit0 <- lm(log(death) ~ pm10, data = ny) 77 | summary(fit0)$coefficients %>% 78 | kable(digits = 3) 79 | ``` 80 | 81 | In the table of coefficients above, the coefficient for `pm10` is quite small and its standard error is relatively large. Effectively, this estimate of the association is zero. 82 | 83 | However, we know quite a bit about both PM10 and daily mortality, and one thing we do know is that *season* plays a large role in both variables. In particular, we know that mortality tends to be higher in the winter and lower in the summer. PM10 tends to show the reverse pattern, being higher in the summer and lower in the winter. Because season is related to *both* PM10 and mortality, it is a good candidate for a confounder and it would make sense to adjust for it in the model. 84 | 85 | Here are the results for a second model, which includes both PM10 and season. Season is included as an indicator variable with 4 levels. 86 | 87 | 88 | ```{r} 89 | fit1 <- lm(log(death) ~ season + pm10, data = ny) 90 | summary(fit1)$coefficients %>% 91 | kable(digits = 4) 92 | ``` 93 | 94 | Notice now that the `pm10` coefficient is quite a bit larger than before and its `t value` is large, suggesting a strong association. How is this possible? 95 | 96 | It turns out that we have a classic example of [Simpson's Paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox) here. The overall relationship between P10 and mortality is null, but when we account for the seasonal variation in both mortality and PM10, the association is positive. The surprising result comes from the opposite ways in which season is related to mortality and PM10. 97 | 98 | So far we have accounted for season, but there are other potential confounders. In particular, weather variables, such as temperature and dew point temperature, are also both related to PM10 formation and mortality. 99 | 100 | In the following model we include temperature (`tmpd`) and dew point temperature (`dptp`). We also include the `date` variable in case there are any long-term trends that need to be accounted for. 101 | 102 | 103 | ```{r} 104 | fit2 <- lm(log(death) ~ date + season + tmpd + dptp + pm10, data = ny) 105 | summary(fit2)$coefficients %>% 106 | kable(digits = 4) 107 | ``` 108 | 109 | Notice that the `pm10` coefficient is even bigger than it was in the previous model. There appears to still be an association between PM10 and mortality. The effect size is small, but we will discuss that later. 110 | 111 | Finally, another class of potential confounders includes other pollutants. Before we place blame on PM10 as a harmful pollutant, it's important that we examine whether there might be another pollutant that can explain what we're observing. NO2 is a good candidate because it shares some of the same sources as PM10 and is known to be related to mortality. Let's see what happens when we include that in the model. 112 | 113 | ```{r} 114 | fit3 <- lm(log(death) ~ date + season + tmpd + dptp + no2 + pm10, data = ny) 115 | summary(fit3)$coefficients %>% 116 | kable(digits = 4) 117 | ``` 118 | 119 | Notice in the table of coefficients that the `no2` coefficient is similar in magnitude to the `pm10` coefficient, although its `t value` is not as large. The `pm10` coefficient appears to be statistically significant, but it is somewhat smaller in magnitude now. 120 | 121 | Below is a plot of the PM10 coefficient from all four of the models that we tried. 122 | 123 | ```{r,fig.cap="Association Between PM10 and Mortality Under Different Models"} 124 | pm10 = c(coef(fit0)["pm10"], 125 | coef(fit1)["pm10"], 126 | coef(fit2)["pm10"], 127 | coef(fit3)["pm10"]) 128 | ci <- rbind(confint(fit0)["pm10",], 129 | confint(fit1)["pm10",], 130 | confint(fit2)["pm10",], 131 | confint(fit3)["pm10",]) 132 | model <- 1:4 133 | plot(model, pm10, pch = 19, ylim = range(ci), xlab = "Model", ylab = expression(PM[10] * " coefficient"), xaxt = "n") 134 | axis(1, 1:4, as.character(1:4)) 135 | segments(model, ci[, 1], model, ci[, 2], col = "blue") 136 | abline(h = 0, lty = 2) 137 | ``` 138 | 139 | With the exception of Model 1, which did not account for any potential confounders, there appears to be a positive association between PM10 and mortality across Models 2--4. What this means and what we should do about it depends on what our ultimate goal is and we do not discuss that in detail here. It's notable that the effect size is generally small, especially compared to some of the other predictors in the model. However, it's also worth noting that presumably, everyone in New York City breathes, and so a small effect could have a large impact. 140 | 141 | 142 | ## Predicting the Outcome 143 | 144 | Another strategy we could have taken is to ask, "What best predicts mortality in New York City?" This is clearly a prediction question and we can use the data on hand to build a model. Here, we will use the [random forests](https://en.wikipedia.org/wiki/Random_forest) modeling strategy, which is a machine learning approach that performs well when there are a large number of predictors. One type of output we can obtain from the random forest procedure is a measure of *variable importance*. Roughly speaking, this measure indicates how important a given variable is to improving the prediction skill of the model. 145 | 146 | Below is a variable importance plot, which is obtained after fitting a random forest model. Larger values on the x-axis indicate greater importance. 147 | 148 | ```{r, fig.cap="Random Forest Variable Importance Plot for Predicting Mortality"} 149 | set.seed(1) 150 | library(randomForest) 151 | rf <- randomForest(death ~ tmpd + dptp + dow + date + season + pm10 + o3tmean + no2, data = ny, na.action = na.omit) 152 | varImpPlot(rf, main = "Variable Importance") 153 | ``` 154 | 155 | Notice that the variable `pm10` comes near the bottom of the list in terms of importance. That is because it does not contribute much to predicting the outcome, mortality. Recall in the previous section that the effect size appeared to be small, meaning that it didn't really explain much variability in mortality. Predictors like temperature and dew point temperature are more useful as predictors of daily mortality. Even NO2 is a better predictor than PM10. 156 | 157 | However, just because PM10 is not a strong predictor of mortality doesn't mean that it does not have a relevant association with mortality. Given the tradeoffs that have to be made when developing a prediction model, PM10 is not high on the list of predictors that we would include--we simply cannot include every predictor. 158 | 159 | ## Frequently Asked Questions 160 | 161 | 162 | 163 | ## Summary 164 | 165 | In any data analysis, you want to ask yourself "Am I asking an inferential question or a prediction question?" This should be cleared up *before* any data are analyzed, as the answer to the question can guide the entire modeling strategy. In the example here, if we had decided on a prediction approach, we might have erroneously thought that PM10 was not relevant to mortality. However, the inferential approach suggested a statistically significant association with mortality. Framing the question right, and applying the appropriate modeling strategy, can play a large role in the kinds of conclusions you draw from the data. 166 | -------------------------------------------------------------------------------- /manuscript/inferencepred.md: -------------------------------------------------------------------------------- 1 | # Inference vs. Prediction: Implications for Modeling Strategy 2 | 3 | 4 | 5 | Understanding whether you're answering an inferential question versus a prediction question is an important concept because the type of question you're answering can greatly influence the modeling strategy you pursue. If you do not clearly understand which type of question you are asking, you may end up using the wrong type of modeling approach and ultimately make the wrong conclusions from your data. The purpose of this chapter is to show you what can happen when you confuse one question for another. 6 | 7 | The key things to remember are 8 | 9 | 1. For **inferential questions** the goal is typically to estimate an association between a predictor of interest and the outcome. There is usually only a handful of predictors of interest (or even just one), however there are typically many potential confounding variables to consider. They key goal of modeling is to estimate an association while making sure you appropriately adjust for any potential confounders. Often, sensitivity analyses are conducted to see if associations of interest are robust to different sets of confounders. 10 | 11 | 2. For **prediction questions** the goal is to identify a model that *best predicts* the outcome. Typically we do not place any *a priori* importance on the predictors, so long as they are good at predicting the outcome. There is no notion of "confounder" or "predictors of interest" because all predictors are potentially useful for predicting the outcome. Also, we often do not care about "how the model works" or telling a detailed story about the predictors. The key goal is to develop a model with good prediction skill and to estimate a reasonable error rate from the data. 12 | 13 | 14 | ## Air Pollution and Mortality in New York 15 | 16 | The following example shows how different types of questions and corresponding modeling approaches can lead to different conclusions. The example uses air pollution and mortality data for New York City. The data were originally used as part of the [National Morbidity, Mortality, and Air Pollution Study](http://www.ihapss.jhsph.edu) (NMMAPS). 17 | 18 | 19 | 20 | 21 | Below is a plot of the daily mortality from all causes for the years 2001--2005. 22 | 23 | ![Daily Mortality in New York City, 2001--2005](images/inferencepred-unnamed-chunk-3-1.png) 24 | 25 | And here is a plot of 24-hour average levels of particulate matter with aerodynamic diameter less than or equal to 10 microns (PM10). 26 | 27 | ![Daily PM10 in New York City, 2001--2005](images/inferencepred-unnamed-chunk-4-1.png) 28 | 29 | Note that there are many fewer points on the plot above than there were on the plot of the mortality data. This is because PM10 is not measured everyday. Also note that there are negative values in the PM10 plot--this is because the PM10 data were mean-subtracted. In general, negative values of PM10 are not possible. 30 | 31 | ## Inferring an Association 32 | 33 | The first approach we will take will be to ask, "Is there an association between daily 24-hour average PM10 levels and daily mortality?" This is an inferential question and we are attempting to estimate an association. In addition, for this question, we know there are a number of potential confounders that we will have to deal with. 34 | 35 | Let's take a look at the bivariate association between PM10 and mortality. Here is a scatterplot of the two variables. 36 | 37 | ![PM10 and Mortality in New York City](images/inferencepred-unnamed-chunk-5-1.png) 38 | 39 | There doesn't appear to be much going on there, and a simple linear regression model of the log of daily mortality and PM10 seems to confirm that. 40 | 41 | 42 | | | Estimate| Std. Error| t value| Pr(>|t|)| 43 | |:-----------|--------:|----------:|-------:|------------------:| 44 | |(Intercept) | 5.089| 0.007| 733.751| 0.000| 45 | |pm10 | 0.000| 0.001| 0.058| 0.954| 46 | 47 | In the table of coefficients above, the coefficient for `pm10` is quite small and its standard error is relatively large. Effectively, this estimate of the association is zero. 48 | 49 | However, we know quite a bit about both PM10 and daily mortality, and one thing we do know is that *season* plays a large role in both variables. In particular, we know that mortality tends to be higher in the winter and lower in the summer. PM10 tends to show the reverse pattern, being higher in the summer and lower in the winter. Because season is related to *both* PM10 and mortality, it is a good candidate for a confounder and it would make sense to adjust for it in the model. 50 | 51 | Here are the results for a second model, which includes both PM10 and season. Season is included as an indicator variable with 4 levels. 52 | 53 | 54 | 55 | | | Estimate| Std. Error| t value| Pr(>|t|)| 56 | |:-----------|--------:|----------:|--------:|------------------:| 57 | |(Intercept) | 5.1665| 0.0113| 458.7149| 0.0000| 58 | |seasonQ2 | -0.1093| 0.0167| -6.5470| 0.0000| 59 | |seasonQ3 | -0.1555| 0.0170| -9.1618| 0.0000| 60 | |seasonQ4 | -0.0603| 0.0167| -3.6077| 0.0004| 61 | |pm10 | 0.0015| 0.0006| 2.4348| 0.0156| 62 | 63 | Notice now that the `pm10` coefficient is quite a bit larger than before and its `t value` is large, suggesting a strong association. How is this possible? 64 | 65 | It turns out that we have a classic example of [Simpson's Paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox) here. The overall relationship between P10 and mortality is null, but when we account for the seasonal variation in both mortality and PM10, the association is positive. The surprising result comes from the opposite ways in which season is related to mortality and PM10. 66 | 67 | So far we have accounted for season, but there are other potential confounders. In particular, weather variables, such as temperature and dew point temperature, are also both related to PM10 formation and mortality. 68 | 69 | In the following model we include temperature (`tmpd`) and dew point temperature (`dptp`). We also include the `date` variable in case there are any long-term trends that need to be accounted for. 70 | 71 | 72 | 73 | | | Estimate| Std. Error| t value| Pr(>|t|)| 74 | |:-----------|--------:|----------:|-------:|------------------:| 75 | |(Intercept) | 5.6207| 0.1647| 34.1242| 0.0000| 76 | |date | 0.0000| 0.0000| -2.2690| 0.0241| 77 | |seasonQ2 | -0.0581| 0.0230| -2.5250| 0.0122| 78 | |seasonQ3 | -0.0766| 0.0290| -2.6361| 0.0089| 79 | |seasonQ4 | -0.0315| 0.0183| -1.7213| 0.0864| 80 | |tmpd | -0.0030| 0.0013| -2.2970| 0.0224| 81 | |dptp | 0.0007| 0.0010| 0.6604| 0.5096| 82 | |pm10 | 0.0024| 0.0007| 3.5995| 0.0004| 83 | 84 | Notice that the `pm10` coefficient is even bigger than it was in the previous model. There appears to still be an association between PM10 and mortality. The effect size is small, but we will discuss that later. 85 | 86 | Finally, another class of potential confounders includes other pollutants. Before we place blame on PM10 as a harmful pollutant, it's important that we examine whether there might be another pollutant that can explain what we're observing. NO2 is a good candidate because it shares some of the same sources as PM10 and is known to be related to mortality. Let's see what happens when we include that in the model. 87 | 88 | 89 | | | Estimate| Std. Error| t value| Pr(>|t|)| 90 | |:-----------|--------:|----------:|-------:|------------------:| 91 | |(Intercept) | 5.6138| 0.1644| 34.1465| 0.0000| 92 | |date | 0.0000| 0.0000| -2.2660| 0.0243| 93 | |seasonQ2 | -0.0514| 0.0234| -2.2001| 0.0287| 94 | |seasonQ3 | -0.0657| 0.0299| -2.1967| 0.0290| 95 | |seasonQ4 | -0.0275| 0.0185| -1.4874| 0.1382| 96 | |tmpd | -0.0030| 0.0013| -2.3092| 0.0217| 97 | |dptp | 0.0007| 0.0010| 0.6809| 0.4966| 98 | |no2 | 0.0013| 0.0009| 1.4677| 0.1434| 99 | |pm10 | 0.0017| 0.0008| 2.2209| 0.0273| 100 | 101 | Notice in the table of coefficients that the `no2` coefficient is similar in magnitude to the `pm10` coefficient, although its `t value` is not as large. The `pm10` coefficient appears to be statistically significant, but it is somewhat smaller in magnitude now. 102 | 103 | Below is a plot of the PM10 coefficient from all four of the models that we tried. 104 | 105 | ![Association Between PM10 and Mortality Under Different Models](images/inferencepred-unnamed-chunk-10-1.png) 106 | 107 | With the exception of Model 1, which did not account for any potential confounders, there appears to be a positive association between PM10 and mortality across Models 2--4. What this means and what we should do about it depends on what our ultimate goal is and we do not discuss that in detail here. It's notable that the effect size is generally small, especially compared to some of the other predictors in the model. However, it's also worth noting that presumably, everyone in New York City breathes, and so a small effect could have a large impact. 108 | 109 | 110 | ## Predicting the Outcome 111 | 112 | Another strategy we could have taken is to ask, "What best predicts mortality in New York City?" This is clearly a prediction question and we can use the data on hand to build a model. Here, we will use the [random forests](https://en.wikipedia.org/wiki/Random_forest) modeling strategy, which is a machine learning approach that performs well when there are a large number of predictors. One type of output we can obtain from the random forest procedure is a measure of *variable importance*. Roughly speaking, this measure indicates how important a given variable is to improving the prediction skill of the model. 113 | 114 | Below is a variable importance plot, which is obtained after fitting a random forest model. Larger values on the x-axis indicate greater importance. 115 | 116 | ![Random Forest Variable Importance Plot for Predicting Mortality](images/inferencepred-unnamed-chunk-11-1.png) 117 | 118 | Notice that the variable `pm10` comes near the bottom of the list in terms of importance. That is because it does not contribute much to predicting the outcome, mortality. Recall in the previous section that the effect size appeared to be small, meaning that it didn't really explain much variability in mortality. Predictors like temperature and dew point temperature are more useful as predictors of daily mortality. Even NO2 is a better predictor than PM10. 119 | 120 | However, just because PM10 is not a strong predictor of mortality doesn't mean that it does not have a relevant association with mortality. Given the tradeoffs that have to be made when developing a prediction model, PM10 is not high on the list of predictors that we would include--we simply cannot include every predictor. 121 | 122 | ## Frequently Asked Questions 123 | 124 | 125 | 126 | ## Summary 127 | 128 | In any data analysis, you want to ask yourself "Am I asking an inferential question or a prediction question?" This should be cleared up *before* any data are analyzed, as the answer to the question can guide the entire modeling strategy. In the example here, if we had decided on a prediction approach, we might have erroneously thought that PM10 was not relevant to mortality. However, the inferential approach suggested a statistically significant association with mortality. Framing the question right, and applying the appropriate modeling strategy, can play a large role in the kinds of conclusions you draw from the data. 129 | -------------------------------------------------------------------------------- /manuscript/listPackages.R: -------------------------------------------------------------------------------- 1 | library(dplyr, warn.conflicts = FALSE) 2 | local({ 3 | files <- dir(pattern = glob2rx("*.Rmd")) 4 | code <- lapply(files, readLines) %>% unlist 5 | r <- regexec("^library\\((.*?)\\)", code, perl = TRUE) 6 | m <- regmatches(code, r) 7 | u <- sapply(m, length) > 0 8 | pkgs <- sapply(m[u], function(x) x[2]) %>% unique 9 | 10 | int <- installed.packages()[, 1] 11 | need <- setdiff(pkgs, int) 12 | if(length(need) > 0L) { 13 | install.packages(need, dependencies = TRUE) 14 | } 15 | }) 16 | 17 | -------------------------------------------------------------------------------- /manuscript/mainmatter.txt: -------------------------------------------------------------------------------- 1 | 2 | {mainmatter} 3 | 4 | 5 | -------------------------------------------------------------------------------- /manuscript/ny.rds: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/manuscript/ny.rds -------------------------------------------------------------------------------- /manuscript/o3_ny_1999.csv: -------------------------------------------------------------------------------- 1 | "ozone","temp","month" 2 | 25.372619048,55.333333333,5 3 | 32.833333333,57.666666667,5 4 | 28.886672851,56.666666667,5 5 | 12.068537415,56.666666667,5 6 | 11.219202899,63.666666667,5 7 | 13.191102757,60,5 8 | 8.1708074534,56.666666667,5 9 | 8.2391304348,58,5 10 | 18.136904762,64,5 11 | 23.769762846,64,5 12 | 20.920918367,59.333333333,5 13 | 18.736632342,62.666666667,5 14 | 29.316558442,59.333333333,5 15 | 20.001253133,56,5 16 | 17.196428571,59,5 17 | 22.410714286,58.333333333,5 18 | 22.04333751,61.333333333,5 19 | 13.679679448,60.333333333,5 20 | 12.417857143,62.666666667,5 21 | 26.114035088,65.333333333,5 22 | 24,65,5 23 | 29.226190476,67.666666667,5 24 | 34.946428571,61.333333333,5 25 | 11.762631416,62,5 26 | 21.538419913,62.666666667,5 27 | 24.452380952,66,5 28 | 20.022619048,64.333333333,5 29 | 25.660714286,69.666666667,5 30 | 34.970238095,74.666666667,5 31 | 35.724789916,75.666666667,5 32 | 41.840277778,75.333333333,5 33 | 40.212770563,75.666666667,6 34 | 29.338065661,78.666666667,6 35 | 35.386038961,75.666666667,6 36 | 28.394557823,70.333333333,6 37 | 26.229166667,68.333333333,6 38 | 25.194444444,68,6 39 | 49.992064612,82.333333333,6 40 | 45.337060714,85,6 41 | 35.970031056,75.333333333,6 42 | 24.278215832,66.333333333,6 43 | 23.703970719,64.666666667,6 44 | 18.726190476,67,6 45 | 19.25,73,6 46 | 26.126391466,72.666666667,6 47 | 32.656617865,74.333333333,6 48 | 26.359693878,66,6 49 | 15.294372294,63,6 50 | 21.034722222,67,6 51 | 29.131944444,68.666666667,6 52 | 33.041666667,67.333333333,6 53 | 14.768055556,63.666666667,6 54 | 27.296717172,70.666666667,6 55 | 42.381944444,78.333333333,6 56 | 26.034722222,77,6 57 | 23.783730159,73.333333333,6 58 | 43.583333333,82,6 59 | 24.854166667,81.333333333,6 60 | 22.416666667,81,6 61 | 24.596242707,81,6 62 | 24.690277778,74.666666667,6 63 | 13.698863636,73.333333333,7 64 | 29.674603175,79,7 65 | 43.011645963,80,7 66 | 36.891045549,88,7 67 | 37.19047619,92,7 68 | 49.558148088,92,7 69 | 38.09047619,84.333333333,7 70 | 27.976708075,80.333333333,7 71 | 32.326388889,77.666666667,7 72 | 41.715277778,77,7 73 | 25.881944444,72.333333333,7 74 | 18.929112554,71,7 75 | 25.392934447,70.333333333,7 76 | 14.625,70,7 77 | 38.138888889,75.666666667,7 78 | 58.951388889,83.333333333,7 79 | 56.972222222,85.666666667,7 80 | 59.256944444,87,7 81 | 37.991071429,85.666666667,7 82 | 28.148611111,78,7 83 | 19.201785714,76,7 84 | 18.6125,77,7 85 | 39.091666667,80.333333333,7 86 | 56.6,85.666666667,7 87 | 47.258333333,86.333333333,7 88 | 47.528333333,83.666666667,7 89 | 37.238839482,85,7 90 | 34.192550505,85.666666667,7 91 | 37.409722222,82,7 92 | 52.201388889,81.666666667,7 93 | 52.825483092,81,7 94 | 44.041666667,88.333333333,8 95 | 35.085912698,81.333333333,8 96 | 33.520833333,78,8 97 | 33.60736715,78.333333333,8 98 | 27.639880952,78.333333333,8 99 | 36.5,78.666666667,8 100 | 46.430555556,78,8 101 | 47.9375,78,8 102 | 25.966540404,74,8 103 | 21.493055556,70.666666667,8 104 | 34.479166667,77.333333333,8 105 | 41.847222222,80.333333333,8 106 | 23.538149351,80.666666667,8 107 | 42.680555556,78,8 108 | 19.680555556,75,8 109 | 22.78321256,75.666666667,8 110 | 30.319444444,78.666666667,8 111 | 33.35218254,81,8 112 | 25.052020202,75.333333333,8 113 | 19.134920635,68.333333333,8 114 | 16.243055556,62,8 115 | 14.583333333,66.666666667,8 116 | 20.942424242,74,8 117 | 26.379761905,77,8 118 | 30.110257074,74.666666667,8 119 | 19.020833333,74.666666667,8 120 | 28.152777778,76,8 121 | 34.222222222,80,8 122 | 35.4375,77.333333333,8 123 | 13.243686869,65.666666667,8 124 | 20.573412698,69.333333333,8 125 | 18.98989899,70.666666667,9 126 | 17.680555556,73.666666667,9 127 | 10.381313131,72.666666667,9 128 | 10.576388889,74.666666667,9 129 | 14.409722222,75.333333333,9 130 | 17.944444444,78.666666667,9 131 | 23.400252525,76.666666667,9 132 | 21.131944444,76.333333333,9 133 | 21.240530303,76.333333333,9 134 | 8.586489899,73,9 135 | 25.194444444,71.666666667,9 136 | 28.034722222,71.333333333,9 137 | 28.536976912,70.666666667,9 138 | 16.832976504,70.666666667,9 139 | 7.5277777778,69.666666667,9 140 | 18.360119048,67.666666667,9 141 | 19.878968254,66.333333333,9 142 | 15.611111111,66,9 143 | 26.868055556,66.666666667,9 144 | 25.992424242,68,9 145 | 10.599240402,61.333333333,9 146 | 11.363707729,55.333333333,9 147 | 14.961251167,62,9 148 | 22.258333333,67.333333333,9 149 | 25.433333333,69,9 150 | 14.775,62,9 151 | 14.041883117,66,9 152 | 26.199494949,71.333333333,9 153 | 29.169082126,71,9 154 | 20.188888889,65,9 155 | 25.372619048,55.333333333,5 156 | 32.833333333,57.666666667,5 157 | 28.886672851,56.666666667,5 158 | 12.068537415,56.666666667,5 159 | 11.219202899,63.666666667,5 160 | 13.191102757,60,5 161 | 8.1708074534,56.666666667,5 162 | 8.2391304348,58,5 163 | 18.136904762,64,5 164 | 23.769762846,64,5 165 | 20.920918367,59.333333333,5 166 | 18.736632342,62.666666667,5 167 | 29.316558442,59.333333333,5 168 | 20.001253133,56,5 169 | 17.196428571,59,5 170 | 22.410714286,58.333333333,5 171 | 22.04333751,61.333333333,5 172 | 13.679679448,60.333333333,5 173 | 12.417857143,62.666666667,5 174 | 26.114035088,65.333333333,5 175 | 24,65,5 176 | 29.226190476,67.666666667,5 177 | 34.946428571,61.333333333,5 178 | 11.762631416,62,5 179 | 21.538419913,62.666666667,5 180 | 24.452380952,66,5 181 | 20.022619048,64.333333333,5 182 | 25.660714286,69.666666667,5 183 | 34.970238095,74.666666667,5 184 | 35.724789916,75.666666667,5 185 | 41.840277778,75.333333333,5 186 | 40.212770563,75.666666667,6 187 | 29.338065661,78.666666667,6 188 | 35.386038961,75.666666667,6 189 | 28.394557823,70.333333333,6 190 | 26.229166667,68.333333333,6 191 | 25.194444444,68,6 192 | 49.992064612,82.333333333,6 193 | 45.337060714,85,6 194 | 35.970031056,75.333333333,6 195 | 24.278215832,66.333333333,6 196 | 23.703970719,64.666666667,6 197 | 18.726190476,67,6 198 | 19.25,73,6 199 | 26.126391466,72.666666667,6 200 | 32.656617865,74.333333333,6 201 | 26.359693878,66,6 202 | 15.294372294,63,6 203 | 21.034722222,67,6 204 | 29.131944444,68.666666667,6 205 | 33.041666667,67.333333333,6 206 | 14.768055556,63.666666667,6 207 | 27.296717172,70.666666667,6 208 | 42.381944444,78.333333333,6 209 | 26.034722222,77,6 210 | 23.783730159,73.333333333,6 211 | 43.583333333,82,6 212 | 24.854166667,81.333333333,6 213 | 22.416666667,81,6 214 | 24.596242707,81,6 215 | 24.690277778,74.666666667,6 216 | 13.698863636,73.333333333,7 217 | 29.674603175,79,7 218 | 43.011645963,80,7 219 | 36.891045549,88,7 220 | 37.19047619,92,7 221 | 49.558148088,92,7 222 | 38.09047619,84.333333333,7 223 | 27.976708075,80.333333333,7 224 | 32.326388889,77.666666667,7 225 | 41.715277778,77,7 226 | 25.881944444,72.333333333,7 227 | 18.929112554,71,7 228 | 25.392934447,70.333333333,7 229 | 14.625,70,7 230 | 38.138888889,75.666666667,7 231 | 58.951388889,83.333333333,7 232 | 56.972222222,85.666666667,7 233 | 59.256944444,87,7 234 | 37.991071429,85.666666667,7 235 | 28.148611111,78,7 236 | 19.201785714,76,7 237 | 18.6125,77,7 238 | 39.091666667,80.333333333,7 239 | 56.6,85.666666667,7 240 | 47.258333333,86.333333333,7 241 | 47.528333333,83.666666667,7 242 | 37.238839482,85,7 243 | 34.192550505,85.666666667,7 244 | 37.409722222,82,7 245 | 52.201388889,81.666666667,7 246 | 52.825483092,81,7 247 | 44.041666667,88.333333333,8 248 | 35.085912698,81.333333333,8 249 | 33.520833333,78,8 250 | 33.60736715,78.333333333,8 251 | 27.639880952,78.333333333,8 252 | 36.5,78.666666667,8 253 | 46.430555556,78,8 254 | 47.9375,78,8 255 | 25.966540404,74,8 256 | 21.493055556,70.666666667,8 257 | 34.479166667,77.333333333,8 258 | 41.847222222,80.333333333,8 259 | 23.538149351,80.666666667,8 260 | 42.680555556,78,8 261 | 19.680555556,75,8 262 | 22.78321256,75.666666667,8 263 | 30.319444444,78.666666667,8 264 | 33.35218254,81,8 265 | 25.052020202,75.333333333,8 266 | 19.134920635,68.333333333,8 267 | 16.243055556,62,8 268 | 14.583333333,66.666666667,8 269 | 20.942424242,74,8 270 | 26.379761905,77,8 271 | 30.110257074,74.666666667,8 272 | 19.020833333,74.666666667,8 273 | 28.152777778,76,8 274 | 34.222222222,80,8 275 | 35.4375,77.333333333,8 276 | 13.243686869,65.666666667,8 277 | 20.573412698,69.333333333,8 278 | 18.98989899,70.666666667,9 279 | 17.680555556,73.666666667,9 280 | 10.381313131,72.666666667,9 281 | 10.576388889,74.666666667,9 282 | 14.409722222,75.333333333,9 283 | 17.944444444,78.666666667,9 284 | 23.400252525,76.666666667,9 285 | 21.131944444,76.333333333,9 286 | 21.240530303,76.333333333,9 287 | 8.586489899,73,9 288 | 25.194444444,71.666666667,9 289 | 28.034722222,71.333333333,9 290 | 28.536976912,70.666666667,9 291 | 16.832976504,70.666666667,9 292 | 7.5277777778,69.666666667,9 293 | 18.360119048,67.666666667,9 294 | 19.878968254,66.333333333,9 295 | 15.611111111,66,9 296 | 26.868055556,66.666666667,9 297 | 25.992424242,68,9 298 | 10.599240402,61.333333333,9 299 | 11.363707729,55.333333333,9 300 | 14.961251167,62,9 301 | 22.258333333,67.333333333,9 302 | 25.433333333,69,9 303 | 14.775,62,9 304 | 14.041883117,66,9 305 | 26.199494949,71.333333333,9 306 | 29.169082126,71,9 307 | 20.188888889,65,9 308 | 25.372619048,55.333333333,5 309 | 32.833333333,57.666666667,5 310 | 28.886672851,56.666666667,5 311 | 12.068537415,56.666666667,5 312 | 11.219202899,63.666666667,5 313 | 13.191102757,60,5 314 | 8.1708074534,56.666666667,5 315 | 8.2391304348,58,5 316 | 18.136904762,64,5 317 | 23.769762846,64,5 318 | 20.920918367,59.333333333,5 319 | 18.736632342,62.666666667,5 320 | 29.316558442,59.333333333,5 321 | 20.001253133,56,5 322 | 17.196428571,59,5 323 | 22.410714286,58.333333333,5 324 | 22.04333751,61.333333333,5 325 | 13.679679448,60.333333333,5 326 | 12.417857143,62.666666667,5 327 | 26.114035088,65.333333333,5 328 | 24,65,5 329 | 29.226190476,67.666666667,5 330 | 34.946428571,61.333333333,5 331 | 11.762631416,62,5 332 | 21.538419913,62.666666667,5 333 | 24.452380952,66,5 334 | 20.022619048,64.333333333,5 335 | 25.660714286,69.666666667,5 336 | 34.970238095,74.666666667,5 337 | 35.724789916,75.666666667,5 338 | 41.840277778,75.333333333,5 339 | 40.212770563,75.666666667,6 340 | 29.338065661,78.666666667,6 341 | 35.386038961,75.666666667,6 342 | 28.394557823,70.333333333,6 343 | 26.229166667,68.333333333,6 344 | 25.194444444,68,6 345 | 49.992064612,82.333333333,6 346 | 45.337060714,85,6 347 | 35.970031056,75.333333333,6 348 | 24.278215832,66.333333333,6 349 | 23.703970719,64.666666667,6 350 | 18.726190476,67,6 351 | 19.25,73,6 352 | 26.126391466,72.666666667,6 353 | 32.656617865,74.333333333,6 354 | 26.359693878,66,6 355 | 15.294372294,63,6 356 | 21.034722222,67,6 357 | 29.131944444,68.666666667,6 358 | 33.041666667,67.333333333,6 359 | 14.768055556,63.666666667,6 360 | 27.296717172,70.666666667,6 361 | 42.381944444,78.333333333,6 362 | 26.034722222,77,6 363 | 23.783730159,73.333333333,6 364 | 43.583333333,82,6 365 | 24.854166667,81.333333333,6 366 | 22.416666667,81,6 367 | 24.596242707,81,6 368 | 24.690277778,74.666666667,6 369 | 13.698863636,73.333333333,7 370 | 29.674603175,79,7 371 | 43.011645963,80,7 372 | 36.891045549,88,7 373 | 37.19047619,92,7 374 | 49.558148088,92,7 375 | 38.09047619,84.333333333,7 376 | 27.976708075,80.333333333,7 377 | 32.326388889,77.666666667,7 378 | 41.715277778,77,7 379 | 25.881944444,72.333333333,7 380 | 18.929112554,71,7 381 | 25.392934447,70.333333333,7 382 | 14.625,70,7 383 | 38.138888889,75.666666667,7 384 | 58.951388889,83.333333333,7 385 | 56.972222222,85.666666667,7 386 | 59.256944444,87,7 387 | 37.991071429,85.666666667,7 388 | 28.148611111,78,7 389 | 19.201785714,76,7 390 | 18.6125,77,7 391 | 39.091666667,80.333333333,7 392 | 56.6,85.666666667,7 393 | 47.258333333,86.333333333,7 394 | 47.528333333,83.666666667,7 395 | 37.238839482,85,7 396 | 34.192550505,85.666666667,7 397 | 37.409722222,82,7 398 | 52.201388889,81.666666667,7 399 | 52.825483092,81,7 400 | 44.041666667,88.333333333,8 401 | 35.085912698,81.333333333,8 402 | 33.520833333,78,8 403 | 33.60736715,78.333333333,8 404 | 27.639880952,78.333333333,8 405 | 36.5,78.666666667,8 406 | 46.430555556,78,8 407 | 47.9375,78,8 408 | 25.966540404,74,8 409 | 21.493055556,70.666666667,8 410 | 34.479166667,77.333333333,8 411 | 41.847222222,80.333333333,8 412 | 23.538149351,80.666666667,8 413 | 42.680555556,78,8 414 | 19.680555556,75,8 415 | 22.78321256,75.666666667,8 416 | 30.319444444,78.666666667,8 417 | 33.35218254,81,8 418 | 25.052020202,75.333333333,8 419 | 19.134920635,68.333333333,8 420 | 16.243055556,62,8 421 | 14.583333333,66.666666667,8 422 | 20.942424242,74,8 423 | 26.379761905,77,8 424 | 30.110257074,74.666666667,8 425 | 19.020833333,74.666666667,8 426 | 28.152777778,76,8 427 | 34.222222222,80,8 428 | 35.4375,77.333333333,8 429 | 13.243686869,65.666666667,8 430 | 20.573412698,69.333333333,8 431 | 18.98989899,70.666666667,9 432 | 17.680555556,73.666666667,9 433 | 10.381313131,72.666666667,9 434 | 10.576388889,74.666666667,9 435 | 14.409722222,75.333333333,9 436 | 17.944444444,78.666666667,9 437 | 23.400252525,76.666666667,9 438 | 21.131944444,76.333333333,9 439 | 21.240530303,76.333333333,9 440 | 8.586489899,73,9 441 | 25.194444444,71.666666667,9 442 | 28.034722222,71.333333333,9 443 | 28.536976912,70.666666667,9 444 | 16.832976504,70.666666667,9 445 | 7.5277777778,69.666666667,9 446 | 18.360119048,67.666666667,9 447 | 19.878968254,66.333333333,9 448 | 15.611111111,66,9 449 | 26.868055556,66.666666667,9 450 | 25.992424242,68,9 451 | 10.599240402,61.333333333,9 452 | 11.363707729,55.333333333,9 453 | 14.961251167,62,9 454 | 22.258333333,67.333333333,9 455 | 25.433333333,69,9 456 | 14.775,62,9 457 | 14.041883117,66,9 458 | 26.199494949,71.333333333,9 459 | 29.169082126,71,9 460 | 20.188888889,65,9 461 | -------------------------------------------------------------------------------- /manuscript/preface.md: -------------------------------------------------------------------------------- 1 | --- 2 | output: pdf_document 3 | --- 4 | # Data Analysis as Art 5 | 6 | Data analysis is hard, and part of the problem is that few people can explain how to do it. It's not that there aren't any people doing data analysis on a regular basis. It's that the people who are really good at it have yet to enlighten us about the thought process that goes on in their heads. 7 | 8 | Imagine you were to ask a songwriter how she writes her songs. There are many tools upon which she can draw. We have a general understanding of how a good song should be structured: how long it should be, how many verses, maybe there's a verse followed by a chorus, etc. In other words, there's an abstract framework for songs in general. Similarly, we have music theory that tells us that certain combinations of notes and chords work well together and other combinations don't sound good. As good as these tools might be, ultimately, knowledge of song structure and music theory alone doesn't make for a good song. Something else is needed. 9 | 10 | In Donald Knuth's legendary 1974 essay [*Computer Programming as an Art*](http://www.paulgraham.com/knuth.html), Knuth talks about the difference between art and science. In that essay, he was trying to get across the idea that although computer programming involved complex machines and very technical knowledge, the act of writing a computer program had an artistic component. In this essay, he says that 11 | 12 | > Science is knowledge which we understand so well that we can teach it to a computer. 13 | 14 | Everything else is art. 15 | 16 | At some point, the songwriter must inject a creative spark into the process to bring all the songwriting tools together to make something that people want to listen to. This is a key part of the *art* of songwriting. That creative spark is difficult to describe, much less write down, but it's clearly essential to writing good songs. If it weren't, then we'd have computer programs regularly writing hit songs. For better or for worse, that hasn't happened yet. 17 | 18 | Much like songwriting (and computer programming, for that matter), it's important to realize that *data analysis is an art*. It is not something yet that we can teach to a computer. Data analysts have many *tools* at their disposal, from linear regression to classification trees and even deep learning, and these tools have all been carefully taught to computers. But ultimately, a data analyst must find a way to assemble all of the tools and apply them to data to answer a relevant question---a question of interest to people. 19 | 20 | Unfortunately, the process of data analysis is not one that we have been able to write down effectively. It's true that there are many statistics textbooks out there, many lining our own shelves. But in our opinion, none of these really addresses the core problems involved in conducting real-world data analyses. In 1991, Daryl Pregibon, a prominent statistician previously of AT&T Research and now of Google, [said in reference to the process of data analysis](http://www.nap.edu/catalog/1910/the-future-of-statistical-software-proceedings-of-a-forum) that "statisticians have a process that they espouse but do not fully understand". 21 | 22 | Describing data analysis presents a difficult conundrum. On the one hand, developing a useful framework involves characterizing the elements of a data analysis using abstract language in order to find the commonalities across different kinds of analyses. Sometimes, this language is the language of mathematics. On the other hand, it is often the very details of an analysis that makes each one so difficult and yet interesting. How can one effectively generalize across many different data analyses, each of which has important unique aspects? 23 | 24 | What we have set out to do in this book is to write down the process of data analysis. What we describe is not a specific "formula" for data analysis---something like "apply this method and then run that test"--- but rather is a general process that can be applied in a variety of situations. Through our extensive experience both managing data analysts and conducting our own data analyses, we have carefully observed what produces coherent results and what fails to produce useful insights into data. Our goal is to write down what we have learned in the hopes that others may find it useful. 25 | -------------------------------------------------------------------------------- /preview/artofdatascience-preview.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/preview/artofdatascience-preview.pdf -------------------------------------------------------------------------------- /problems/figures/unnamed-chunk-2-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/problems/figures/unnamed-chunk-2-1.png -------------------------------------------------------------------------------- /problems/figures/unnamed-chunk-3-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/problems/figures/unnamed-chunk-3-1.png -------------------------------------------------------------------------------- /problems/figures/unnamed-chunk-4-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/problems/figures/unnamed-chunk-4-1.png -------------------------------------------------------------------------------- /problems/figures/unnamed-chunk-5-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/problems/figures/unnamed-chunk-5-1.png -------------------------------------------------------------------------------- /problems/problems.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Problems" 3 | output: 4 | html_document: 5 | keep_md: yes 6 | --- 7 | 8 | ```{r,echo=FALSE} 9 | knitr::opts_chunk$set(echo = FALSE, fig.path = "figures/") 10 | ``` 11 | 12 | ## Question 13 | 14 | 15 | A data analyst who works for you is looking at data from 100 individuals on exposure to a potentially harmful substance and the levels of a blood biomarker measured on each individual. In this case higher levels of the biomarker indicate greater harm from the substance. The analyst shows you the following scatterplot of the individual's biomarker vs. the dosage of the substance experienced by the individual. 16 | 17 | ```{r} 18 | set.seed(2) 19 | x <- rnorm(100, 10) 20 | y <- x - (x - 10)^2 + rnorm(100) 21 | library(ggplot2) 22 | qplot(x, y, xlab = "Dosage", ylab = "Response") + geom_smooth(method = "lm") 23 | ``` 24 | 25 | 26 | ## Question 27 | 28 | You are conducting an associational analysis and attempting to estimate the association between soda consumption in the U.S. and body mass index (BMI). A scatterplot of the data from 100 individuals is shown below. 29 | 30 | What is the directionality of the association depicted in this plot? 31 | 32 | ```{r} 33 | set.seed(5) 34 | x <- rpois(100, 2) 35 | y <- 25 + 2 * x + rnorm(100, sd = 3) 36 | qplot(x, y, xlab = "# sodas per day", ylab = "BMI") + geom_smooth(method = "lm") 37 | ``` 38 | 39 | 40 | ## Question 41 | 42 | 43 | You are conducting an associational analysis and attempting to estimate the association between soda consumption in the U.S. and body mass index (BMI). As part of a sensitivity analysis you fit a series of 5 different models which of which produce a different estimate of the association, which is the change in BMI per 12 oz can of sode consumed daily. A plot of the estimates from the 5 models is shown below. 44 | 45 | From the plot, which of the following properties is 46 | 47 | ```{r} 48 | set.seed(3) 49 | x <- 1:5 50 | y <- rnorm(5, 2, 1) 51 | se <- 0.5 52 | lo <- y - 1.96 * se 53 | hi <- y + 1.96 * se 54 | par(mar = c(5,4, 2, 1)) 55 | plot(x, y, ylim = range(lo, hi, 0), pch = 19, xaxt = "n") 56 | axis(1, 1:5, paste("Model", 1:5)) 57 | abline(h = 0, lty = 2) 58 | segments(x, lo, x, hi) 59 | ``` 60 | 61 | ## Question 62 | 63 | You are building a web site and are A/B testing two different designs for the site. You want to know which designs lead to people taking a certain action on your web site (e.g. buying a product, clicking a button). The tests for the different designs are conducted separately over time because you do not want there to be two different versions of the web site available at any given time. Therefore, one issue may be the presence of time trends that can confound the relationship between a given design and the number of actions taken. The target of estimation is the average difference in actions taken on the web site per visit between design A and design B. 64 | 65 | In your formal modeling you fit a series of models to account for possible temporal trends in different ways. The results of your primary model and 4 secondary models is are shown below. 66 | 67 | What could you conclude from these results? 68 | 69 | ```{r} 70 | set.seed(1000) 71 | x <- 1:5 72 | y <- rnorm(5, 0, 5) 73 | se <- 0.5 * 5 74 | lo <- y - 1.96 * se 75 | hi <- y + 1.96 * se 76 | par(mar = c(5,4, 2, 1)) 77 | plot(x, y, ylim = range(lo, hi, 0), pch = 19, xaxt = "n", ylab = "Design A - Design B") 78 | axis(1, 1:5, c("Primary", paste("Secondary", 1:4))) 79 | abline(h = 0, lty = 2) 80 | segments(x, lo, x, hi) 81 | ``` 82 | 83 | 84 | 85 | 86 | -------------------------------------------------------------------------------- /problems/problems.md: -------------------------------------------------------------------------------- 1 | # Problems 2 | 3 | 4 | 5 | ## Question 6 | 7 | 8 | A data analyst who works for you is looking at data from 100 individuals on exposure to a potentially harmful substance and the levels of a blood biomarker measured on each individual. In this case higher levels of the biomarker indicate greater harm from the substance. The analyst shows you the following scatterplot of the individual's biomarker vs. the dosage of the substance experienced by the individual. 9 | 10 | ![](figures/unnamed-chunk-2-1.png) 11 | 12 | 13 | ## Question 14 | 15 | You are conducting an associational analysis and attempting to estimate the association between soda consumption in the U.S. and body mass index (BMI). A scatterplot of the data from 100 individuals is shown below. 16 | 17 | What is the directionality of the association depicted in this plot? 18 | 19 | ![](figures/unnamed-chunk-3-1.png) 20 | 21 | 22 | ## Question 23 | 24 | 25 | You are conducting an associational analysis and attempting to estimate the association between soda consumption in the U.S. and body mass index (BMI). As part of a sensitivity analysis you fit a series of 5 different models which of which produce a different estimate of the association, which is the change in BMI per 12 oz can of sode consumed daily. A plot of the estimates from the 5 models is shown below. 26 | 27 | From the plot, which of the following properties is 28 | 29 | ![](figures/unnamed-chunk-4-1.png) 30 | 31 | ## Question 32 | 33 | You are building a web site and are A/B testing two different designs for the site. You want to know which designs lead to people taking a certain action on your web site (e.g. buying a product, clicking a button). The tests for the different designs are conducted separately over time because you do not want there to be two different versions of the web site available at any given time. Therefore, one issue may be the presence of time trends that can confound the relationship between a given design and the number of actions taken. The target of estimation is the average difference in actions taken on the web site per visit between design A and design B. 34 | 35 | In your formal modeling you fit a series of models to account for possible temporal trends in different ways. The results of your primary model and 4 secondary models is are shown below. 36 | 37 | What could you conclude from these results? 38 | 39 | ![](figures/unnamed-chunk-5-1.png) 40 | 41 | 42 | 43 | 44 | -------------------------------------------------------------------------------- /published/artofdatascience.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdpeng/artofdatascience/095fcf05885181cbca6d245f463e9cf3dfe5543e/published/artofdatascience.pdf --------------------------------------------------------------------------------