├── .gitignore ├── LICENSE.md ├── README.md ├── annotated-bibs ├── blameless-postmortems.Rmd ├── blameless-postmortems.md ├── cogsci.pdf ├── communication.pdf ├── diversity.pdf ├── ds-academic-field.pdf ├── ethics.pdf ├── graphical-advice.pdf ├── modern-medicine.pdf ├── sharing-analyses.pdf ├── tailored-elearning.pdf └── teaching.pdf ├── applicants ├── accept-email.md ├── mail-merge.R └── pass-email.md ├── course-feedback.md ├── stats337.Rproj ├── week-01 ├── week-01-1.jpg ├── week-01-2.jpg ├── week-01-3.jpg ├── week-01-4.jpg ├── week-01.key └── week-01.pdf ├── week-02 ├── 01-most-important.jpg ├── 02-summary.jpg ├── 03-questions.jpg ├── 04-reading-process.jpg ├── README.Rmd └── README.md ├── week-03 └── google-doc.md ├── week-04 ├── google-doc.md └── quotes.pages └── week-05 ├── conversational-roles.pages └── quotes.pages /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | .httr-oauth 5 | permission-numbers.xlsx 6 | students 7 | -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | # Creative commons Attribution-ShareAlike 4.0 International 2 | 3 | Creative Commons Corporation (“Creative Commons”) is not a law firm and does not provide legal services or legal advice. Distribution of Creative Commons public licenses does not create a lawyer-client or other relationship. Creative Commons makes its licenses and related information available on an “as-is” basis. Creative Commons gives no warranties regarding its licenses, any material licensed under their terms and conditions, or any related information. Creative Commons disclaims all liability for damages resulting from their use to the fullest extent possible. 4 | 5 | ### Using Creative Commons Public Licenses 6 | 7 | Creative Commons public licenses provide a standard set of terms and conditions that creators and other rights holders may use to share original works of authorship and other material subject to copyright and certain other rights specified in the public license below. The following considerations are for informational purposes only, are not exhaustive, and do not form part of our licenses. 8 | 9 | * __Considerations for licensors:__ Our public licenses are intended for use by those authorized to give the public permission to use material in ways otherwise restricted by copyright and certain other rights. Our licenses are irrevocable. Licensors should read and understand the terms and conditions of the license they choose before applying it. Licensors should also secure all rights necessary before applying our licenses so that the public can reuse the material as expected. Licensors should clearly mark any material not subject to the license. This includes other CC-licensed material, or material used under an exception or limitation to copyright. [More considerations for licensors](http://wiki.creativecommons.org/Considerations_for_licensors_and_licensees#Considerations_for_licensors). 10 | 11 | * __Considerations for the public:__ By using one of our public licenses, a licensor grants the public permission to use the licensed material under specified terms and conditions. If the licensor’s permission is not necessary for any reason–for example, because of any applicable exception or limitation to copyright–then that use is not regulated by the license. Our licenses grant only permissions under copyright and certain other rights that a licensor has authority to grant. Use of the licensed material may still be restricted for other reasons, including because others have copyright or other rights in the material. A licensor may make special requests, such as asking that all changes be marked or described. Although not required by our licenses, you are encouraged to respect those requests where reasonable. [More considerations for the public](http://wiki.creativecommons.org/Considerations_for_licensors_and_licensees#Considerations_for_licensees). 12 | 13 | ## Creative Commons Attribution-ShareAlike 4.0 International Public License 14 | 15 | By exercising the Licensed Rights (defined below), You accept and agree to be bound by the terms and conditions of this Creative Commons Attribution-ShareAlike 4.0 International Public License ("Public License"). To the extent this Public License may be interpreted as a contract, You are granted the Licensed Rights in consideration of Your acceptance of these terms and conditions, and the Licensor grants You such rights in consideration of benefits the Licensor receives from making the Licensed Material available under these terms and conditions. 16 | 17 | ### Section 1 – Definitions. 18 | 19 | a. __Adapted Material__ means material subject to Copyright and Similar Rights that is derived from or based upon the Licensed Material and in which the Licensed Material is translated, altered, arranged, transformed, or otherwise modified in a manner requiring permission under the Copyright and Similar Rights held by the Licensor. For purposes of this Public License, where the Licensed Material is a musical work, performance, or sound recording, Adapted Material is always produced where the Licensed Material is synched in timed relation with a moving image. 20 | 21 | b. __Adapter's License__ means the license You apply to Your Copyright and Similar Rights in Your contributions to Adapted Material in accordance with the terms and conditions of this Public License. 22 | 23 | c. __BY-SA Compatible License__ means a license listed at [creativecommons.org/compatiblelicenses](http://creativecommons.org/compatiblelicenses), approved by Creative Commons as essentially the equivalent of this Public License. 24 | 25 | d. __Copyright and Similar Rights__ means copyright and/or similar rights closely related to copyright including, without limitation, performance, broadcast, sound recording, and Sui Generis Database Rights, without regard to how the rights are labeled or categorized. For purposes of this Public License, the rights specified in Section 2(b)(1)-(2) are not Copyright and Similar Rights. 26 | 27 | e. __Effective Technological Measures__ means those measures that, in the absence of proper authority, may not be circumvented under laws fulfilling obligations under Article 11 of the WIPO Copyright Treaty adopted on December 20, 1996, and/or similar international agreements. 28 | 29 | f. __Exceptions and Limitations__ means fair use, fair dealing, and/or any other exception or limitation to Copyright and Similar Rights that applies to Your use of the Licensed Material. 30 | 31 | g. __License Elements__ means the license attributes listed in the name of a Creative Commons Public License. The License Elements of this Public License are Attribution and ShareAlike. 32 | 33 | h. __Licensed Material__ means the artistic or literary work, database, or other material to which the Licensor applied this Public License. 34 | 35 | i. __Licensed Rights__ means the rights granted to You subject to the terms and conditions of this Public License, which are limited to all Copyright and Similar Rights that apply to Your use of the Licensed Material and that the Licensor has authority to license. 36 | 37 | j. __Licensor__ means the individual(s) or entity(ies) granting rights under this Public License. 38 | 39 | k. __Share__ means to provide material to the public by any means or process that requires permission under the Licensed Rights, such as reproduction, public display, public performance, distribution, dissemination, communication, or importation, and to make material available to the public including in ways that members of the public may access the material from a place and at a time individually chosen by them. 40 | 41 | l. __Sui Generis Database Rights__ means rights other than copyright resulting from Directive 96/9/EC of the European Parliament and of the Council of 11 March 1996 on the legal protection of databases, as amended and/or succeeded, as well as other essentially equivalent rights anywhere in the world. 42 | 43 | m. __You__ means the individual or entity exercising the Licensed Rights under this Public License. Your has a corresponding meaning. 44 | 45 | ### Section 2 – Scope. 46 | 47 | a. ___License grant.___ 48 | 49 | 1. Subject to the terms and conditions of this Public License, the Licensor hereby grants You a worldwide, royalty-free, non-sublicensable, non-exclusive, irrevocable license to exercise the Licensed Rights in the Licensed Material to: 50 | 51 | A. reproduce and Share the Licensed Material, in whole or in part; and 52 | 53 | B. produce, reproduce, and Share Adapted Material. 54 | 55 | 2. __Exceptions and Limitations.__ For the avoidance of doubt, where Exceptions and Limitations apply to Your use, this Public License does not apply, and You do not need to comply with its terms and conditions. 56 | 57 | 3. __Term.__ The term of this Public License is specified in Section 6(a). 58 | 59 | 4. __Media and formats; technical modifications allowed.__ The Licensor authorizes You to exercise the Licensed Rights in all media and formats whether now known or hereafter created, and to make technical modifications necessary to do so. The Licensor waives and/or agrees not to assert any right or authority to forbid You from making technical modifications necessary to exercise the Licensed Rights, including technical modifications necessary to circumvent Effective Technological Measures. For purposes of this Public License, simply making modifications authorized by this Section 2(a)(4) never produces Adapted Material. 60 | 61 | 5. __Downstream recipients.__ 62 | 63 | A. __Offer from the Licensor – Licensed Material.__ Every recipient of the Licensed Material automatically receives an offer from the Licensor to exercise the Licensed Rights under the terms and conditions of this Public License. 64 | 65 | B. __Additional offer from the Licensor – Adapted Material. Every recipient of Adapted Material from You automatically receives an offer from the Licensor to exercise the Licensed Rights in the Adapted Material under the conditions of the Adapter’s License You apply. 66 | 67 | C. __No downstream restrictions.__ You may not offer or impose any additional or different terms or conditions on, or apply any Effective Technological Measures to, the Licensed Material if doing so restricts exercise of the Licensed Rights by any recipient of the Licensed Material. 68 | 69 | 6. __No endorsement.__ Nothing in this Public License constitutes or may be construed as permission to assert or imply that You are, or that Your use of the Licensed Material is, connected with, or sponsored, endorsed, or granted official status by, the Licensor or others designated to receive attribution as provided in Section 3(a)(1)(A)(i). 70 | 71 | b. ___Other rights.___ 72 | 73 | 1. Moral rights, such as the right of integrity, are not licensed under this Public License, nor are publicity, privacy, and/or other similar personality rights; however, to the extent possible, the Licensor waives and/or agrees not to assert any such rights held by the Licensor to the limited extent necessary to allow You to exercise the Licensed Rights, but not otherwise. 74 | 75 | 2. Patent and trademark rights are not licensed under this Public License. 76 | 77 | 3. To the extent possible, the Licensor waives any right to collect royalties from You for the exercise of the Licensed Rights, whether directly or through a collecting society under any voluntary or waivable statutory or compulsory licensing scheme. In all other cases the Licensor expressly reserves any right to collect such royalties. 78 | 79 | ### Section 3 – License Conditions. 80 | 81 | Your exercise of the Licensed Rights is expressly made subject to the following conditions. 82 | 83 | a. ___Attribution.___ 84 | 85 | 1. If You Share the Licensed Material (including in modified form), You must: 86 | 87 | A. retain the following if it is supplied by the Licensor with the Licensed Material: 88 | 89 | i. identification of the creator(s) of the Licensed Material and any others designated to receive attribution, in any reasonable manner requested by the Licensor (including by pseudonym if designated); 90 | 91 | ii. a copyright notice; 92 | 93 | iii. a notice that refers to this Public License; 94 | 95 | iv. a notice that refers to the disclaimer of warranties; 96 | 97 | v. a URI or hyperlink to the Licensed Material to the extent reasonably practicable; 98 | 99 | B. indicate if You modified the Licensed Material and retain an indication of any previous modifications; and 100 | 101 | C. indicate the Licensed Material is licensed under this Public License, and include the text of, or the URI or hyperlink to, this Public License. 102 | 103 | 2. You may satisfy the conditions in Section 3(a)(1) in any reasonable manner based on the medium, means, and context in which You Share the Licensed Material. For example, it may be reasonable to satisfy the conditions by providing a URI or hyperlink to a resource that includes the required information. 104 | 105 | 3. If requested by the Licensor, You must remove any of the information required by Section 3(a)(1)(A) to the extent reasonably practicable. 106 | 107 | b. ___ShareAlike.___ 108 | 109 | In addition to the conditions in Section 3(a), if You Share Adapted Material You produce, the following conditions also apply. 110 | 111 | 1. The Adapter’s License You apply must be a Creative Commons license with the same License Elements, this version or later, or a BY-SA Compatible License. 112 | 113 | 2. You must include the text of, or the URI or hyperlink to, the Adapter's License You apply. You may satisfy this condition in any reasonable manner based on the medium, means, and context in which You Share Adapted Material. 114 | 115 | 3. You may not offer or impose any additional or different terms or conditions on, or apply any Effective Technological Measures to, Adapted Material that restrict exercise of the rights granted under the Adapter's License You apply. 116 | 117 | ### Section 4 – Sui Generis Database Rights. 118 | 119 | Where the Licensed Rights include Sui Generis Database Rights that apply to Your use of the Licensed Material: 120 | 121 | a. for the avoidance of doubt, Section 2(a)(1) grants You the right to extract, reuse, reproduce, and Share all or a substantial portion of the contents of the database; 122 | 123 | b. if You include all or a substantial portion of the database contents in a database in which You have Sui Generis Database Rights, then the database in which You have Sui Generis Database Rights (but not its individual contents) is Adapted Material, including for purposes of Section 3(b); and 124 | 125 | c. You must comply with the conditions in Section 3(a) if You Share all or a substantial portion of the contents of the database. 126 | 127 | For the avoidance of doubt, this Section 4 supplements and does not replace Your obligations under this Public License where the Licensed Rights include other Copyright and Similar Rights. 128 | 129 | ### Section 5 – Disclaimer of Warranties and Limitation of Liability. 130 | 131 | a. __Unless otherwise separately undertaken by the Licensor, to the extent possible, the Licensor offers the Licensed Material as-is and as-available, and makes no representations or warranties of any kind concerning the Licensed Material, whether express, implied, statutory, or other. This includes, without limitation, warranties of title, merchantability, fitness for a particular purpose, non-infringement, absence of latent or other defects, accuracy, or the presence or absence of errors, whether or not known or discoverable. Where disclaimers of warranties are not allowed in full or in part, this disclaimer may not apply to You.__ 132 | 133 | b. __To the extent possible, in no event will the Licensor be liable to You on any legal theory (including, without limitation, negligence) or otherwise for any direct, special, indirect, incidental, consequential, punitive, exemplary, or other losses, costs, expenses, or damages arising out of this Public License or use of the Licensed Material, even if the Licensor has been advised of the possibility of such losses, costs, expenses, or damages. Where a limitation of liability is not allowed in full or in part, this limitation may not apply to You.__ 134 | 135 | c. The disclaimer of warranties and limitation of liability provided above shall be interpreted in a manner that, to the extent possible, most closely approximates an absolute disclaimer and waiver of all liability. 136 | 137 | ### Section 6 – Term and Termination. 138 | 139 | a. This Public License applies for the term of the Copyright and Similar Rights licensed here. However, if You fail to comply with this Public License, then Your rights under this Public License terminate automatically. 140 | 141 | b. Where Your right to use the Licensed Material has terminated under Section 6(a), it reinstates: 142 | 143 | 1. automatically as of the date the violation is cured, provided it is cured within 30 days of Your discovery of the violation; or 144 | 145 | 2. upon express reinstatement by the Licensor. 146 | 147 | For the avoidance of doubt, this Section 6(b) does not affect any right the Licensor may have to seek remedies for Your violations of this Public License. 148 | 149 | c. For the avoidance of doubt, the Licensor may also offer the Licensed Material under separate terms or conditions or stop distributing the Licensed Material at any time; however, doing so will not terminate this Public License. 150 | 151 | d. Sections 1, 5, 6, 7, and 8 survive termination of this Public License. 152 | 153 | ### Section 7 – Other Terms and Conditions. 154 | 155 | a. The Licensor shall not be bound by any additional or different terms or conditions communicated by You unless expressly agreed. 156 | 157 | b. Any arrangements, understandings, or agreements regarding the Licensed Material not stated herein are separate from and independent of the terms and conditions of this Public License.t stated herein are separate from and independent of the terms and conditions of this Public License. 158 | 159 | ### Section 8 – Interpretation. 160 | 161 | a. For the avoidance of doubt, this Public License does not, and shall not be interpreted to, reduce, limit, restrict, or impose conditions on any use of the Licensed Material that could lawfully be made without permission under this Public License. 162 | 163 | b. To the extent possible, if any provision of this Public License is deemed unenforceable, it shall be automatically reformed to the minimum extent necessary to make it enforceable. If the provision cannot be reformed, it shall be severed from this Public License without affecting the enforceability of the remaining terms and conditions. 164 | 165 | c. No term or condition of this Public License will be waived and no failure to comply consented to unless expressly agreed to by the Licensor. 166 | 167 | d. Nothing in this Public License constitutes or may be interpreted as a limitation upon, or waiver of, any privileges and immunities that apply to the Licensor or You, including from the legal processes of any jurisdiction or authority. 168 | 169 | > Creative Commons is not a party to its public licenses. Notwithstanding, Creative Commons may elect to apply one of its public licenses to material it publishes and in those instances will be considered the “Licensor.” Except for the limited purpose of indicating that material is shared under a Creative Commons public license or as otherwise permitted by the Creative Commons policies published at [creativecommons.org/policies](http://creativecommons.org/policies), Creative Commons does not authorize the use of the trademark “Creative Commons” or any other trademark or logo of Creative Commons without its prior written consent including, without limitation, in connection with any unauthorized modifications to any of its public licenses or any other arrangements, understandings, or agreements concerning use of licensed material. For the avoidance of doubt, this paragraph does not form part of the public licenses. 170 | > 171 | > Creative Commons may be contacted at creativecommons.org 172 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Stats 337: Readings in Applied Data Science 2 | 3 | [Stats 337](https://explorecourses.stanford.edu/search?view=catalog&filter-coursestatus-Active=on&page=0&catalog=&academicYear=&q=stats+337&collapse=%2C2%2C) is a small discussion class available to Stanford students in Spring 2018. Student in this class will read 3-4 papers (or equivalent) per week, write a brief response, and then discuss the papers (and related ideas) in class. 4 | 5 | ## Readings 6 | 7 | These readings reflect my personal thoughts about applied data science, and are skewed towards topics that I think are important but are generally under appreciated. It is not a systematic attempt to survey the field. That said, if you think there's something major that I've missed, please feel free to submit [an issue](https://github.com/hadley/stats337/issues) (or [pull request](https://github.com/hadley/stats337/edit/master/README.md)!). These readings will evolve as the quarter goes by. 8 | 9 | Many of the readings come from [Practical Data Science for Stats](https://peerj.com/collections/50-practicaldatascistats/), a join PeerJ collection and special issue of the American Statistician. Jenny Bryan and I pulled this collection together in order to publish some of the important parts of data science that were previously unpublished. Other readings are blog posts because so much of applied data science is outside the comfort zone of traditional academic fields. 10 | 11 | The development of much of this course has been driven by conversations on twitter. A big thanks go to everyone who has helped me out! Key threads: [classroom discussion](https://twitter.com/hadleywickham/status/964650890593538048), [ethics](https://twitter.com/hadleywickham/status/978712074434957313), [google sheets](https://twitter.com/hadleywickham/status/978401746182549504), [citation management](https://twitter.com/hadleywickham/status/978752525493915648). 12 | 13 | ### What the *&!% is data science? (Apr 2) 14 | 15 | * [Data scientists mostly just do arithmetic and that’s a good thing](https://m.signalvnoise.com/data-scientists-mostly-just-do-arithmetic-and-that-s-a-good-thing-c6371885f7f6); 16 | Noah Lorang (2016). 17 | 18 | * Optional: [Enterprise Data Analysis and Visualization: An Interview Study](https://idl.cs.washington.edu/papers/enterprise-analysis-interviews); 19 | Sean Kandel, Andreas Paepcke, Joseph Hellerstein, Jeffrey Heer (2012). 20 | 21 | * Optional: [50 years of data science](https://www.tandfonline.com/doi/abs/10.1080/10618600.2017.1384734) 22 | ([OA preprint](https://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf)); 23 | David Donoho (2017). This is discussion paper and a number of notable 24 | statisticians have contributed commentary. Make sure to read some of these 25 | as well. 26 | 27 | [In-class resources](week-01/) 28 | 29 | ### Data collection and collaboration (Apr 9) 30 | 31 | * [Tidy data](https://www.jstatsoft.org/article/view/v059i10/); 32 | Hadley Wickham (2013). 33 | 34 | * [Data organization in spreadsheets](https://peerj.com/preprints/3183/); 35 | Karl W Broman, Kara Woo (2017). 36 | 37 | * [Best practices for using google sheets in your data project](https://matthewlincoln.net/2018/03/26/best-practices-for-using-google-sheets-in-your-data-project.html); 38 | Matthew Lincoln (2018). 39 | 40 | * Bonus: [Modeling as a core component of structuring data](https://iase-web.org/documents/SERJ/SERJ16(2)_Konold.pdf); 41 | Clifford Konold, William Finzer, Kozoom Kreetong (2017) 42 | 43 | [In-class photos](week-02/) 44 | 45 | 46 | Spend 3-5 minutes filling out [class feedback](https://goo.gl/forms/py92VLLqodxuBU8z1). 47 | 48 | ### Software engineering (Apr 16) 49 | 50 | * [Software development skills for data scientists](http://treycausey.com/software_dev_skills.html); 51 | Trey Causey (2015). 52 | 53 | * [Excuse me, do you have a moment to talk about version control?](https://peerj.com/preprints/3159/); 54 | Jennifer Bryan (2017). 55 | 56 | * [Good enough practices in scientific computing](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005510); 57 | Greg Wilson, Jennifer Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt, Tracy K. Teal (2017). 58 | 59 | [Collaborative google doc](week-03/google-doc.md) 60 | 61 | ### DevOps (Apr 23) 62 | 63 | * [Opinionated analysis development](https://peerj.com/preprints/3210/); 64 | Hillary Parker (2017) 65 | 66 | * [An introduction to Docker for reproducible research, with examples from the R environment](https://arxiv.org/abs/1410.0846); 67 | Carl Boettiger (2014). 68 | 69 | * [Hidden Technical Debt in Machine Learning Systems](https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf); 70 | D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, 71 | Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, Dan Dennison (2015). 72 | 73 | [Collaborative google doc](week-04/google-doc.md) 74 | 75 | ### Teaching (Apr 30) 76 | 77 | * [The Introductory Statistics Course: A Ptolemaic Curriculum?](https://escholarship.org/uc/item/6hb3k0nz). 78 | George W Cobb (2007). 79 | 80 | * [The democratization of data science education](https://peerj.com/preprints/3195/); 81 | Sean Kross, Roger D Peng, Brian S Caffo, Ira Gooding, Jeffrey T Leek (2017). 82 | 83 | * [Teaching stats for data science](https://peerj.com/preprints/3205/); 84 | Danny Kaplan (2017). 85 | 86 | * [Ten quick tips for teaching programming](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006023); Neil C. C. Brown, Greg Wilson (2018). 87 | 88 | ### Reproducibility (May 7) 89 | 90 | * [Best practices for computational science](https://openresearchsoftware.metajnl.com/articles/10.5334/jors.ay/); 91 | Victoria Stodden, Sheila Miguez (2014). 92 | 93 | * [How rOpenSci uses Code review to promote reproducible science](https://ropensci.org/blog/2017/09/01/nf-softwarereview/); 94 | Noam Ross, Scott Chamberlain, Karthik Ram, Maëlle Salmon (2017). 95 | 96 | * [A practical guide for transparency in psychological science](https://psyarxiv.com/rtygm/); 97 | Olivier Klein, Tom Hardwicke, Frederik Aust, Johannes Breuer, Henrik Danielsson, Alicia Hofelich Mohr, Hans IJzerman, Gustav Nilsonne, Wolf Vanpaemel, Michael Frank (2018). 98 | 99 | * [Lessons Learned Reproducing a Deep Reinforcement Learning Paper](http://amid.fish/reproducing-deep-rl); Matthew Rahtz (2018). 100 | 101 | * Bonus: [The Practice of Reproducible Research](https://www.practicereproducibleresearch.org); 102 | Justin Kitzes, Daniel Turek, Fatma Deniz (2018). 103 | 104 | ### Ethics (May 14) 105 | 106 | * [The Ethical Data Scientist](http://www.slate.com/articles/technology/future_tense/2016/02/how_to_bring_better_ethics_to_data_science.html); 107 | Cathy O'Neil (2016). 108 | 109 | * [Big data, machine learning, and the social sciences](https://medium.com/@hannawallach/big-data-machine-learning-and-the-social-sciences-927a8e20460d); 110 | Hannah Wallach (2014). 111 | 112 | * [A Code of Ethics for Data Science](https://medium.com/@dpatil/a-code-of-ethics-for-data-science-cda27d1fac1); 113 | DJ Patil (2018). 114 | 115 | * [An ethical code can’t be about ethics](https://towardsdatascience.com/an-ethical-code-cant-be-about-ethics-66acaea6f16f); 116 | Schaun Wheeler (2018). 117 | 118 | * [Ethical Guidelines for Statistical Practice](http://www.amstat.org/ASA/Your-Career/Ethical-Guidelines-for-Statistical-Practice.aspx); 119 | Committee on Professional Ethics of the American Statistical Association (2016). 120 | 121 | * [Journalism as a Professional Model for Data Science](https://www.brianckeegan.com/2016/02/journalism-as-a-professional-model-for-data-science/); 122 | Brian C. Keegan (2016) 123 | 124 | ### Career (May 21) 125 | 126 | * [What it's like to be on the data science job market](http://treycausey.com/data_science_interviews.html); 127 | Trey Causey (2015) 128 | 129 | * [Academic job search advice](http://matt.might.net/articles/advice-for-academic-job-hunt/); 130 | Matt Might (????). 131 | 132 | * [Importance of sponsorship](https://robinsones.github.io/The-Importance-of-Sponsorship/); 133 | Emily Robinson (2018). 134 | 135 | * [Imposter syndrome in data science](https://caitlinhudon.com/2018/01/19/imposter-syndrome-in-data-science/); 136 | Caitlin Hudon (2018). 137 | 138 | ### Industry 139 | 140 | * [Doing data science at twitter](https://medium.com/@rchang/my-two-year-journey-as-a-data-scientist-at-twitter-f0c13298aee6); 141 | Robert Chang (2015). 142 | 143 | * [Engineers shouldn’t write ETL: A guide to building a high functioning data science Department](https://multithreaded.stitchfix.com/blog/2016/03/16/engineers-shouldnt-write-etl/); 144 | Jeff Magnusson (2016). 145 | 146 | * [Using R packages and education to scale data science at Airbnb](https://medium.com/airbnb-engineering/using-r-packages-and-education-to-scale-data-science-at-airbnb-906faa58e12d); 147 | Ricardo Bion (2016). 148 | 149 | * [Data science at Instacart](https://tech.instacart.com/data-science-at-instacart-dabbd2d3f279); 150 | Jeremy Stanley (2017). 151 | 152 | * [.rprofile: Jenny Bryan](https://ropensci.org/blog/2017/12/08/rprofile-jenny-bryan/); 153 | Kelly O'Briant (2017) 154 | 155 | * [Marketing for data science](https://medium.com/indeed-data-science/marketing-for-data-science-a-7-step-go-to-market-plan-for-your-next-data-product-60c034c34d55). Erik Oberg (2018). 156 | 157 | ### Workflow 158 | 159 | * [The plain person's guide to plain text social science](http://plain-text.co); 160 | Kieran Healy (2016). 161 | 162 | * [Open notebook history](http://wcm1.web.rice.edu/open-notebook-history.html); 163 | Caleb McDaniel (2013). 164 | 165 | * Optional: [How to be a modern scientist](https://leanpub.com/modernscientist); 166 | Jeff Leek (2016). 167 | 168 | ## Annotated bibliographies 169 | 170 | Many students in the spring 2018 elected to share their final annotated bibliographies 171 | 172 | * [Blameless post-mortems](annotated-bibs/blameless-postmortems.md) by 173 | [Jennifer Wang](https://github.com/jw0506) 174 | 175 | * [Communication and visualization](annotated-bibs/communication.pdf) 176 | by Kenneth Tay 177 | 178 | * [Connections to cognitive science](annotated-bibs/cogsci.pdf) 179 | by Sara Altman. 180 | 181 | * [Data science as an academic field](annotated-bibs/ds-academic-field.pdf) (pdf) 182 | by [Stephen Bates](https://github.com/stephenbates19) 183 | 184 | * [Data science in modern medicine](annotated-bibs/modern-medicine.pdf) 185 | by Sean R. Zion. 186 | 187 | * [Diversity in hiring and employment](annotated-bibs/diversity.pdf) (pdf) 188 | 189 | * [Ethics in data science](annotated-bibs/ethics.pdf) (pdf) 190 | 191 | * [Graphical advice](annotated-bibs/graphical-advice.pdf) 192 | by Nick Hershey 193 | 194 | * [Sharing analyses across research groups](annotated-bibs/sharing-analyses.pdf) 195 | by Hershel Mehta. 196 | 197 | * [Tailoring learning experiences for adults through data analytics](annotated-bibs/tailored-elearning.pdf) (pdf) 198 | by Anna Khazenzon. 199 | 200 | * [Teaching data science](annotated-bibs/teaching.pdf) (pdf) 201 | by Ben Stenhaug. 202 | 203 | ## Grading 204 | 205 | This is a discussion based class so the majority of your final grade will come from your preparation for discussion (weekly 1-page responses, 30%), and your in-class participation (also 30%). This class is not meant to be self-contained, so the final component of your grade will be an annotated bibliography (40%) describing other papers that you read outside of this class. The goal of these assessments is to force you to do things that are in your own best interests, and to encourage you learn helpful workflows that will stand you in good stead outside of this class. 206 | 207 | I am not interested in policing excuses so no late responses will be accepted, and absences from class will count as a zero for participation. That said, I also don't want one bad week to affect your final grade, so your lowest two scores from each will be dropped. 208 | 209 | ### Responses 210 | 211 | Each week (after the first week), you need to turn in a 1-2 page written response to the papers that you read that week. The goal of response is to ensure that you've read the weekly readings, thought about them, and connected them to your existing knowledge, interests, and experience. In your response, you should briefly summarise the paper (1-2 sentences to jog your memory when you re-read your notes), and then focus on _your_ response to the paper: How did it make you feel? What questions were you left with? What do you think it got wrong? If you found one of the readings to be particularly thought provoking, feel free to devote your entire response to that paper. 212 | 213 | Each response will be graded on the check/plus/minus system. You will get a check if you briefly summarise the readings and add your own commentary. You will get a check-plus if you synthesize the readings, and combine them with outside knowledge/experience. You will get a check-minus if you only summarise the paper. (I will likely evolve these guidelines to be more concrete once I've read a few responses.) 214 | 215 | If you're not familiar with reading academic papers (or you want to polish your skills), you might want to read these guidelines from [Jeff Leek](https://github.com/jtleek/readingpapers). I'd also highly recommend that you learn and use a citation management system. Having a system for managing citations is crucial if you plan to write a thesis. If you don't have an existing system, start by reading the advice of [Caleb McDaniel](http://wcm1.web.rice.edu/plain-text-citations.html). 216 | 217 | ### Participation 218 | 219 | This is a discussion class so your classroom participation is essential. But don't worry if you're an introvert, shy, or English is your second language: there will be plenty of opportunities to participate that don't require verbal agility. In this class, I'll be drawing on the techniques described in [The Discussion Book](http://a.co/dar4MT1) by Stephen D. Brookfield and Stephen Preskill to make sure that everyone gets a chance to participate. I'll also collect regular feedback to make sure that everything is going well. 220 | 221 | ### Annotated bibliography. 222 | 223 | Your final project will be an annotated bibliography containing at least 20 papers or blog posts related to data science that we did not cover in this course. (See [citation tracing](http://www.raulpacheco.org/2018/02/forward-citation-tracing-and-backwards-citation-tracing-in-literature-reviews/)) 224 | 225 | Due June 6 (electronically) 226 | 227 | There are three components to the bibliography: 228 | 229 | * Executive summary (25%). Introduce the overall theme of your bibliography in 1-2 230 | paragraphs. Then use 1-2 pages to synthesise the most important or interesting 231 | from your annotated bibliography. 232 | 233 | * Top 3 (25%). List the three papers that you would most highly recommend and 234 | describe briefly why. 235 | 236 | * Bibliography (50%). List all the papers you have read with a proper reference 237 | and any notes you find helpful. 238 | 239 | Each component will be graded 1 (C), 2 (B), or 3 (A): 240 | 241 | * Executive summary: 242 | * 3: 243 | * 2: 244 | * 1: 245 | 246 | * Top 3: 247 | * 3: Your description of the top 3 papers makes me want to run out and 248 | read them immediately, and you make that easy with impeccable citations 249 | and links to pdfs 250 | 251 | * 2: 252 | 253 | * 1: You have listed 3 papers and briefly described why they are interesting. 254 | 255 | * Bibliography: 256 | * 1: 6-10 papers 257 | * 2: 11-16 papers 258 | * 3: >25 papers 259 | 260 | ## License 261 | 262 | Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. 263 | -------------------------------------------------------------------------------- /annotated-bibs/blameless-postmortems.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Stats 337, Applied Readings in Data Science (Spring 2018)" 3 | output: github_document 4 | --- 5 | 6 | ```{r setup, include=FALSE} 7 | knitr::opts_chunk$set(echo = TRUE) 8 | ``` 9 | 10 | # Annotated bibliography 11 | 12 | ****** 13 | 14 | ### Theme: Blameless postmortems for data science? 15 | 16 | ### Executive summary 17 | 18 | The idea of postmortems to evaluate failure events has long been considered an 19 | important practice for effective risk management. The idea of blameless 20 | postmortems goes further to emphasize the need to create and facilitate a 21 | postmortem process where participants are incentivized to provide detailed 22 | accounts and analyses of what happened without fear of punishment. 23 | 24 | My readings for this annotated bibliography were guided by a desire to learn 25 | more about the idea of blameless postmortems, with particular attention to how 26 | they might be implemented for data analysis errors and data science issues. I’ve 27 | outlined a number of potential reasons that blameless postmortems may have not 28 | yet been widely adopted within data science practices, as well as some potential 29 | suggestions for trying to encourage this practice. 30 | 31 | ### Barriers & Interventions to blameless postmortems for data science 32 | 33 | While the idea of blameless postmortems have been adopted by many software 34 | engineering and devop teams (with many referring to Etsy’s process as a model), 35 | it seems that blameless postmortems have not yet infiltrated standard data 36 | science practices. 37 | 38 | This may be because of a number of reasons, such as the following: 39 | 40 | * Data science is a relatively new field and there are not yet set “standard” 41 | data science practices 42 | * The definition of a data analysis success or failure may be more nebulous, 43 | such that the errors or failures due to data science may be less clearly 44 | identifiable - contrast this with some of the obvious software engineering 45 | failures (e.g. cloud service is disrupted). 46 | * There may be less incentives for small data science teams or data scientists 47 | spread across an organization to prioritize postmortem processes and learning 48 | over efficiency, and/or they may be in a weaker position to establish a 49 | culture of blameless postmortems. 50 | * There are a lack of examples or case studies about blameless postmortems as 51 | applied to data analysis errors – and likewise, there are a lack of templates 52 | for conducting these kinds of postmortems. 53 | * Data scientists may underestimate the degree to which decisions made as part 54 | of data science practices are subject to human bias and error. 55 | * Changing organizational culture is hard work! And managing blameless 56 | postmortem processes effectively may sometimes be delicate, and/or require 57 | specific training and practice. These skills likely lie outside the typical 58 | scope of what a data scientist thinks their job constitutes. 59 | 60 | From this list and from the readings, some thoughts about potential 61 | interventions to accelerate the adoption of blameless postmortems in data 62 | science are the following: 63 | 64 | * Have data scientists create a blameless post-mortem template for data 65 | science failures within their own organization. Doing so would likely catalyze 66 | thoughtful and explicit discussions on what data science success and failure 67 | looks like, as well as help establish group norms about what a blameless 68 | post-mortem process looks like before a crisis forces the issue. 69 | * If they already exist elsewhere in the organization, communicate and learn 70 | about the postmortem processes already in-place within software engineering 71 | groups, etc – and use these resources as potential templates for data science 72 | postmortems, or if appropriate, see if data science postmortems would belong 73 | within existing processes. 74 | * Consider when it would be appropriate and/or beneficial to the community to 75 | make a data science post-mortem public 76 | * Consider conducting systematic and internally reviewed premortems to 77 | identify potential risks and human biases before embarking on a data science 78 | project; revisit and iterate as necessary as the project unfolds 79 | 80 | Any feedback, thoughts, critiques, additions, welcome! 81 | 82 | 83 | ### Top 3 articles 84 | 85 | **1. [Blameless PostMortems and a Just Culture](https://codeascraft.com/2012/05/22/blameless-postmortems/). John** 86 | **Allspaw (from Etsy). *Code as Craft* (May 2012).** 87 | 88 | *Why you should read this*: If there was a canon of readings on blameless 89 | postmortem, this article would be on it. The article is relatively short, but 90 | lays out the philosophy behind blameless postmortems in a cogent and persuasive 91 | manner and at a digestible pace – it’s a great way to quickly get up to speed on 92 | the ideas as well as the actions that blameless postmortems involve. As in, John 93 | not only presents simple explanations of key principles from the literature on 94 | risk management and safety (e.g. from Sidney Dekker), but also lays out concrete 95 | steps that Etsy takes to implement these ideas. And it seems like everyone 96 | writing about blameless postmortems links to this article… so don’t be out of 97 | the loop! 98 | 99 | *Winning Quotations*: 100 | 101 | * “So technically, engineers are not at all “off the hook” with a blameless 102 | PostMortem process. They are very much on the hook for helping Etsy become safer 103 | and more resilient, in the end.” 104 | * “We enable and encourage people who do make mistakes to be the experts on 105 | educating the rest of the organization how not to make them in the future.” 106 | 107 | **2. [What is a Successful Data Analysis?](https://simplystatistics.org/2018/04/17/what-is-a-successful-data-analysis/)** **Roger Peng. *Simplystats* (Apr 2018).** 108 | 109 | *Why you should read this*: Maybe this article should come first, because 110 | fundamental to the question of postmortems for data science is the question: 111 | what does data analysis failure look like? What metrics do we use to identify it 112 | when we see it? 113 | 114 | This article is a great entry into these questions – you’ll inevitably push your 115 | thinking by observing your own reactions and thoughts in response to Roger’s 116 | proposed definition, which he suggests might be unsettling (or not!). 117 | 118 | In terms of content: Roger presents a framework with which to think about the 119 | question of success in data analysis, and contrasts his ideas about “acceptance” 120 | and “audience” to other notions such as that of using internal and external 121 | validity as a measure of successful data analysis. He also brings two critical 122 | yet underappreciated points into the conversation: 1) the importance of 123 | considering the context in which an analysis is performed when trying to 124 | evaluate what analysis is appropriate; and 2) that human nature plays a big role 125 | in defining the success of data analysis. 126 | 127 | *Winning quotations*: 128 | 129 | * “Success depends on human beings, unfortunately, and this is something 130 | analysts must be prepared to deal with.” 131 | * “When an audience is upset by a data analysis, and they are being honest, 132 | they are usually upset with the chosen narrative, not with the facts per se.” 133 | 134 | 135 | **3. [Fearless shared postmortems – CRE life lessons](https://cloudplatform.googleblog.com/2017/11/fearless-shared-postmortems-CRE-life-lessons.html).** 136 | **Adrian Hilton, Gwendolyn Stockman. *Google Cloud Platform Blog* (Nov 2017).** 137 | 138 | *Why you should read this*: This is a bit of an oddball reading suggestion (so 139 | maybe that’s reason enough!). While the motivation for why teams for Google’s 140 | Site Reliability Engineering are thinking about the mechanics of writing an 141 | external postmortem may be obvious, it is less obvious why data scientists may 142 | want to think about the value of external postmortems. So here are two reasons 143 | to read this article: 1) As the importance and role of data science grows, the 144 | likelihood that data science decisions and failures will affect customers more 145 | directly and obviously may also grow (e.g. think facebook experiments that the 146 | public has pushed back on) – and thus the value of external postmortems. And 2) 147 | this article has a nice section at the very bottom called “A side note on the 148 | role of luck”, which offers something both wise and unique to most descriptions 149 | of postmortem write-ups. 150 | 151 | *Winning quotations*: 152 | 153 | * “We have found that, with a combination of automation and practice, we can 154 | produce a shareable version of an internal postmortem with about 10% 155 | additional work, plus internal review.” 156 | * “An internal postmortem assumes the reader has basic knowledge of the 157 | technical and operational background; this is unlikely to be true for your 158 | customer. We try to write the least detailed explanation that still allows the 159 | reader to understand why the incident happened; too much detail here is more 160 | likely to be off-putting than helpful.” 161 | 162 | ## Bibliography 163 | *Note about citation formats*: 164 | 165 | * Most citations follow the convention used in the GitHub syllabus, 166 | reverting to a more traditional academic citation format for academic 167 | publications. 168 | * Readings are generally grouped by topic and listed in reverse chronological 169 | order, except for the priority readings which are placed first. 170 | 171 | **General** 172 | 173 | * [What is a Successful Data Analysis?](https://simplystatistics.org/2018/04/17/what-is-a-successful-data-analysis/) 174 | Roger Peng. *Simplystats* (Apr 2018). 175 | * Parker H. (2017) Opinionated analysis development. PeerJ Preprints 5:e3210v1 176 | https://doi.org/10.7287/peerj.preprints.3210v1 177 | * [Why ‘Blameless’ ‘Postmortems’ Can Feel Wrong](https://medium.com/@jpaulreed/why-blameless-postmortems-might-feel-wrong-cbeee00d51b2). 178 | Paul Reed. *Medium* (Aug 2016) 179 | * [It’s Not Your Fault Blameless 180 | post-mortems](https://www.slideshare.net/jhand2/its-not-your-fault-blameless-post-mortems). 181 | Jason Hand. (Jul 2014) 182 | * Dekker, S., Paul, C., & Hofmeyr, J-H. (2011) [The complexity of failure: Implications of complexity theory for safety 183 | investigations](https://www.sciencedirect.com/science/article/pii/S0925753511000105). 184 | Safety Science, 49: 939-945. 185 | * Dekker, S. (2002) [Reconstructing human contributions to accidents: the new view on error and performance](https://ac.els-cdn.com/S0022437502000324/1-s2.0-S0022437502000324-main.pdf?_tid=a5bf53cb-4092-4a be-bbfd-ec56bac96588&acdnat=1528328624_9f90fafde4c783a8175f94bce923aced). 186 | Journal of Safety Research, 33: 371-385. 187 | 188 | **Company case studies** 189 | 190 | * [Fearless shared postmortems – CRE life 191 | lessons](https://cloudplatform.googleblog.com/2017/11/fearless-shared-postmortems-CRE-life-lessons.html). 192 | Adrian Hilton, Gwendolyn Stockman. *Google Cloud Platform Blog* (Nov 2017). 193 | + This blog posts also points to other great examples of public postmortems by 194 | [Google Cloud 195 | Platform](https://status.cloud.google.com/incident/compute/16007), 196 | [Gitlab](https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/), 197 | [CloudFlare](https://blog.cloudflare.com/incident-report-on-memory-leak-caused-by-cloudflare-parser-bug/), 198 | and 199 | [Honeycomb.io](https://www.honeycomb.io/blog/2017/10/bitten-by-a-kafka-bug-postmortem/) 200 | * [5 Whys – how we conduct blameless post-mortems after something goes 201 | wrong](http://code.hootsuite.com/blameless-post-mortems/). Noel Pullen. 202 | *Hootsuite Development* (2017) 203 | * [Chapter 15: Postmortem Culture: Learning from 204 | Failure](https://landing.google.com/sre/book/chapters/postmortem-culture.html). 205 | John Lunney, Sue Lueder, edited by Gary O’Connor. (2017) 206 | + [Postmortem culture: how you can learn from failure](https://rework.withgoogle.com/blog/postmortem-culture-how-you-can-learn-from-failure/). 207 | John Lunney, Sue Lueder, Gary O’Connor. *Re: Work* (Apr 2018). 208 | + Provides an [example postmortem](https://landing.google.com/sre/book/chapters/postmortem.html) and 209 | this [postmortem exercise template](https://docs.google.com/document/d/1ob0dfG_gefr_gQ8kbKr0kS4XpaKbc0oVAk4Te9tbDqM/edit) 210 | + [Google: Engineering excellence requires a “blameless post-mortem culture” for fault fixing](https://www.v3.co.uk/v3-uk/news/3013962/google-engineering-excellence-requires-a-blameless-post-mortem-culture-fo r-fault-fixing). Stuart Sumner. V3 (July 2017). 211 | * [Postmortems at Airbnb](https://medium.com/airbnb-engineering/postmortems-at-airbnb-dde936fd7877). 212 | Ben Hughes. Medium (Oct 2013). 213 | * [Blameless PostMortems and a Just Culture](https://codeascraft.com/2012/05/22/blameless-postmortems/). John 214 | Allspaw (from Etsy). Code as Craft (May 2012). 215 | + Dekker, S. & Breakey, H. (2016) [‘Just culture:’ Improving safety by achieving substantive, procedural 216 | and restorative justice](https://www.sciencedirect.com/science/article/pii/S0925753516000321). 217 | Safety Science. 85: 187-193. 218 | + [What blameless really means](http://www.jessicaharllee.com/notes/what-blameless-really-means/). 219 | Jessica Harlee (Mar 2014) 220 | 221 | **How to run a postmortem debrief and other postmortem resources** 222 | 223 | * [A collection of postmortem templates](https://github.com/dastergon/postmortem-templates). 224 | Dastergon GitHub repo. 225 | * [What Etsy Does When Things Go Wrong: A 7-Step Guide](https://www.fastcodesign.com/3064726/what-etsy-does-when-things-go-wrong-a-7-step-guide). 226 | John Allspaw, Morgan Evans, Daniel Schauenberg. *Co.Design* (Nov 2016). 227 | + [Etsy github Morgue](https://github.com/etsy/morgue) 228 | + [Practical Postmortems at Etsy](https://www.infoq.com/articles/postmortems-etsy). 229 | Daniel Schauenberg. *InfoQ* (Aug 2015) 230 | * [A Project Postmortem Toolkit: Apps and Approaches that Help You Learn More 231 | from Retrospectives](https://zapier.com/blog/project-retrospective-postmortem/). 232 | Genevieve Conti. *Zapier* (Nov 2015). 233 | * [A Leader’s Guide to After-Action Reviews](http://www.au.af.mil/au/awc/awcgate/army/tc_25-20/tc25-20.pdf). 234 | Headquarters, Department of the Army (Sep 1993). 235 | 236 | **Other** 237 | 238 | * [After a Major Cyberattack, Does the Public Deserve an Explanation?](https://www.nextgov.com/cybersecurity/2018/06/after-major-cyber-attack-does-public-deserve-explanation/148692/) Mitch 239 | Herckis. Nextgov (Jun 4, 2018). 240 | + [The City of Atlanta should publish a blameless post-mortem of the 241 | ransomware attack](https://www.change.org/p/mayor-keisha-lance-bottoms-the-city-of-atlanta-should-publish-a-blameless-post-mortem-of-the-ransomware-attack?recruiter=867558611&utm_source=share_petition&utm_medium=twitter&utm_campaign=share_petition). *Change.org petition*. (Mar 2018) 242 | * [Tool: Foster psychological safety](https://rework.withgoogle.com/guides/understanding-team-effectiveness/steps/foster-psychological-safety/). *Re:Work* 243 | + [The Head of “X” Explains How To Make Audacity the Path of Least Resistance](https://www.wired.com/2016/04/the-head-of-x-explains-how-to-make-audacity-the-path-of-least-resistance/#.aeio3w645). Astro Teller. *Wired* (Apr 2016). 244 | * [Upserve – Software Engineer](https://jobs.lever.co/upserve/ad9b5e26-3118-430c-aeae-d5331c41a5d3) – mentioned in the job posting directly as part of what a day might look like 245 | 246 | 247 | ****** 248 | 249 | #### Articles about "a case for data literacy" 250 | I didn't go with this topic, but in case this is helpful to anyone...! 251 | 252 | **General:** 253 | 254 | * [Why companies must close the data literacy divide](https://www.forbes.com/sites/brentdykes/2017/03/09/why-companies-must-close-the-data-literacy-divide/#75639da369d9). Brent Dykes. *Forbes* (March 9, 2017). 255 | * [Beyond Data Literacy: Reinventing Community Engagement and Empowerment in the Age of Data](http://datapopalliance.org/wp-content/uploads/2015/11/Beyond-Data-Literacy-2015.pdf). *Data-Pop Alliance* (Oct 2015). 256 | * [Facebook Spawned a Data Crisis. Here’s What We Do Next](https://magenta.as/thank-you-facebook-now-suffer-the-consequences-beee86038439). Michael Horn. *Magenta* (Apr 4, 2018). 257 | * Matthews, P. (2016) [Data literacy conceptions, community capabilities](http://eprints.uwe.ac.uk/30506/1/ci-journal-datalit-matthews-preprintep16.pdf). The Journal of Community Informatics, 12 (3). ISSN 1712-4441 Available from: http://eprints.uwe.ac.uk/30506 258 | + Helpful framing: four varieties of data competencies including research (academic), classroom (secondary education), carpentry (practical training), and inclusion (community development). 259 | * [Becoming Data Literate in 3 Simple Steps](http://datajournalismhandbook.org/1.0/en/understanding_data_0.html). Nicolas Kayser-Bril. *Data Journalism Handbook 1.0 Beta* 260 | * [Data Literacy – Quantitative Research Part 2](https://uxknowledgebase.com/data-literacy-quantitative-research-part-2-de07607f1127). Krisztina Szerovay. *Medium* (May 16, 2018). 261 | * [Where if Your Organization on the Marketing Data Literacy Spectrum?](https://medium.com/aimarketingassociation/where-is-your-organization-on-the-marketing-data-literacy-spectrum-b0988740b9e2) Jim Sterne. *Medium* (Apr 9, 2018). 262 | * [Why Data Science and UX Research Teams are Better Together](https://www.mindtheproduct.com/2018/02/data-science-ux-research-teams-better-together/). Julie Stanescu. (Feb 7, 2018). 263 | * [Data literacy: Your data-driven advantage starts with your people](https://www.bdcnetwork.com/blog/data-literacy-your-data-driven-advantage-starts-your-people). Nathan Miller. *Building Design + Construction* (May 24, 2017). 264 | * [Why Data Literacy Matters](https://data36.com/why-data-literacy-matters/). Gabor Papp. *Data36* (Oct 17, 2016). 265 | * [Why We Should All Be Data Literate](http://alistapart.com/article/why-we-should-all-be-data-literate). Dan Turner. *A List Apart* (Sep 20, 2016). 266 | * [Is Design Metrically Opposed?](https://www.uie.com/jared-live/transcripts/Is_Design_Metrically_Opposed.html) Jared Spool. *Transcript of talk, UXIM Salt Lake City* (Apr 2015) 267 | * Martin, Elaine R. (2014). [What is Data Literacy?](https://escholarship.umassmed.edu/cgi/viewcontent.cgi?article=1069&context=jeslib) Journal of eScience Librarianship 3(1): e1069. http://dx.doi.org/10.7191/ jeslib.2014.1069 268 | * [Data Literacy: Definition, Importance and scope](http://epgp.inflibnet.ac.in/epgpdata/uploads/epgp_content/S000021LI/P001449/M021913/ET/1503055537ModuleID-MIL-10-etext-DataLiteracyDefinition,Importanceandscope.pdf). Anubhuti Yadav. 269 | 270 | **In higher ed:** 271 | 272 | * [Strategies and Best Practices for Data Literacy Education, Knowledge Synthesis Report](http://www.mikesmit.com/wp-content/papercite-data/pdf/data_literacy.pdf). Chantel Ridsdale, James Rothwell, Mike Smit, Hossam Ali-Hassan, Michael Bliemel, Dean Irvine, Daniel Kelley, Stan Matwin, and Brad Wuetherick. *Dalhousie University* (2015). * [The Data Literacy Disruption: It’s Time to Change the Education Mindset](https://medium.com/@sarah_32155/the-data-literacy-disruption-its-time-to-change-the-education-mindset-d0dfcaa0782c). Sarah Nell. *Medium* (April 30, 2018). 273 | * [The importance of data literacy in secondary education](https://edexec.co.uk/the-importance-of-data-literacy-in-secondary-education/). Executive Education. (Mar 14, 2018). 274 | * [The importance of data literacy in higher education](http://edquarter.com/Article/the-importance-of-data-literacy-in-higher-education). Charley Rogers. *edquarter* (Jan 5, 2018). 275 | 276 | **For educators:** 277 | 278 | * [Why data literacy training is important, and how I-TECH helps](https://edscoop.com/why-training-and-data-literacy-are-important-and-how-i-tech-helps). Paige Kowalski. *edscoop* (Oct 30, 2015). 279 | + [State Progress, Data Quality Campaign](https://dataqualitycampaign.org/why-education-data/state-progress). *Data Quality Campaign* (2015). 280 | * [Ethical and appropriate data use requires data literacy](https://datafordecisions.wested.org/wp-content/uploads/2015/03/Kappan-Ethical-and-Appropriate-Data-Use-Requires-Data-Literacy.pdf). Ellen Mandinach, Brennan Parton, Edith Gummer, Rachel Anderson. *Data for Decisions, WestEd* (March 2015) 281 | -------------------------------------------------------------------------------- /annotated-bibs/blameless-postmortems.md: -------------------------------------------------------------------------------- 1 | Stats 337, Applied Readings in Data Science (Spring 2018) 2 | ================ 3 | 4 | # Annotated bibliography 5 | 6 | ----- 7 | 8 | ### Theme: Blameless postmortems for data science? 9 | 10 | ### Executive summary 11 | 12 | The idea of postmortems to evaluate failure events has long been 13 | considered an important practice for effective risk management. The idea 14 | of blameless postmortems goes further to emphasize the need to create 15 | and facilitate a postmortem process where participants are incentivized 16 | to provide detailed accounts and analyses of what happened without fear 17 | of punishment. 18 | 19 | My readings for this annotated bibliography were guided by a desire to 20 | learn more about the idea of blameless postmortems, with particular 21 | attention to how they might be implemented for data analysis errors and 22 | data science issues. I’ve outlined a number of potential reasons that 23 | blameless postmortems may have not yet been widely adopted within data 24 | science practices, as well as some potential suggestions for trying to 25 | encourage this practice. 26 | 27 | ### Barriers & Interventions to blameless postmortems for data science 28 | 29 | While the idea of blameless postmortems have been adopted by many 30 | software engineering and devop teams (with many referring to Etsy’s 31 | process as a model), it seems that blameless postmortems have not yet 32 | infiltrated standard data science practices. 33 | 34 | This may be because of a number of reasons, such as the following: 35 | 36 | - Data science is a relatively new field and there are not yet set 37 | “standard” data science practices 38 | - The definition of a data analysis success or failure may be more 39 | nebulous, such that the errors or failures due to data science may 40 | be less clearly identifiable - contrast this with some of the 41 | obvious software engineering failures (e.g. cloud service is 42 | disrupted). 43 | - There may be less incentives for small data science teams or data 44 | scientists spread across an organization to prioritize postmortem 45 | processes and learning over efficiency, and/or they may be in a 46 | weaker position to establish a culture of blameless postmortems. 47 | - There are a lack of examples or case studies about blameless 48 | postmortems as applied to data analysis errors – and likewise, there 49 | are a lack of templates for conducting these kinds of postmortems. 50 | - Data scientists may underestimate the degree to which decisions made 51 | as part of data science practices are subject to human bias and 52 | error. 53 | - Changing organizational culture is hard work\! And managing 54 | blameless postmortem processes effectively may sometimes be 55 | delicate, and/or require specific training and practice. These 56 | skills likely lie outside the typical scope of what a data scientist 57 | thinks their job constitutes. 58 | 59 | From this list and from the readings, some thoughts about potential 60 | interventions to accelerate the adoption of blameless postmortems in 61 | data science are the following: 62 | 63 | - Have data scientists create a blameless post-mortem template for 64 | data science failures within their own organization. Doing so would 65 | likely catalyze thoughtful and explicit discussions on what data 66 | science success and failure looks like, as well as help establish 67 | group norms about what a blameless post-mortem process looks like 68 | before a crisis forces the issue. 69 | - If they already exist elsewhere in the organization, communicate and 70 | learn about the postmortem processes already in-place within 71 | software engineering groups, etc – and use these resources as 72 | potential templates for data science postmortems, or if appropriate, 73 | see if data science postmortems would belong within existing 74 | processes. 75 | - Consider when it would be appropriate and/or beneficial to the 76 | community to make a data science post-mortem public 77 | - Consider conducting systematic and internally reviewed premortems to 78 | identify potential risks and human biases before embarking on a data 79 | science project; revisit and iterate as necessary as the project 80 | unfolds 81 | 82 | Any feedback, thoughts, critiques, additions, welcome\! 83 | 84 | ### Top 3 articles 85 | 86 | **1. [Blameless PostMortems and a Just 87 | Culture](https://codeascraft.com/2012/05/22/blameless-postmortems/). 88 | John** **Allspaw (from Etsy). *Code as Craft* (May 2012).** 89 | 90 | *Why you should read this*: If there was a canon of readings on 91 | blameless postmortem, this article would be on it. The article is 92 | relatively short, but lays out the philosophy behind blameless 93 | postmortems in a cogent and persuasive manner and at a digestible pace – 94 | it’s a great way to quickly get up to speed on the ideas as well as the 95 | actions that blameless postmortems involve. As in, John not only 96 | presents simple explanations of key principles from the literature on 97 | risk management and safety (e.g. from Sidney Dekker), but also lays out 98 | concrete steps that Etsy takes to implement these ideas. And it seems 99 | like everyone writing about blameless postmortems links to this article… 100 | so don’t be out of the loop\! 101 | 102 | *Winning Quotations*: 103 | 104 | - “So technically, engineers are not at all “off the hook” with a 105 | blameless PostMortem process. They are very much on the hook for 106 | helping Etsy become safer and more resilient, in the end.” 107 | - “We enable and encourage people who do make mistakes to be the 108 | experts on educating the rest of the organization how not to make 109 | them in the future.” 110 | 111 | **2. [What is a Successful Data 112 | Analysis?](https://simplystatistics.org/2018/04/17/what-is-a-successful-data-analysis/)** 113 | **Roger Peng. *Simplystats* (Apr 2018).** 114 | 115 | *Why you should read this*: Maybe this article should come first, 116 | because fundamental to the question of postmortems for data science is 117 | the question: what does data analysis failure look like? What metrics do 118 | we use to identify it when we see it? 119 | 120 | This article is a great entry into these questions – you’ll inevitably 121 | push your thinking by observing your own reactions and thoughts in 122 | response to Roger’s proposed definition, which he suggests might be 123 | unsettling (or not\!). 124 | 125 | In terms of content: Roger presents a framework with which to think 126 | about the question of success in data analysis, and contrasts his ideas 127 | about “acceptance” and “audience” to other notions such as that of using 128 | internal and external validity as a measure of successful data analysis. 129 | He also brings two critical yet underappreciated points into the 130 | conversation: 1) the importance of considering the context in which an 131 | analysis is performed when trying to evaluate what analysis is 132 | appropriate; and 2) that human nature plays a big role in defining the 133 | success of data analysis. 134 | 135 | *Winning quotations*: 136 | 137 | - “Success depends on human beings, unfortunately, and this is 138 | something analysts must be prepared to deal with.” 139 | - “When an audience is upset by a data analysis, and they are being 140 | honest, they are usually upset with the chosen narrative, not with 141 | the facts per se.” 142 | 143 | **3. [Fearless shared postmortems – CRE life 144 | lessons](https://cloudplatform.googleblog.com/2017/11/fearless-shared-postmortems-CRE-life-lessons.html).** 145 | **Adrian Hilton, Gwendolyn Stockman. *Google Cloud Platform Blog* (Nov 146 | 2017).** 147 | 148 | *Why you should read this*: This is a bit of an oddball reading 149 | suggestion (so maybe that’s reason enough\!). While the motivation for 150 | why teams for Google’s Site Reliability Engineering are thinking about 151 | the mechanics of writing an external postmortem may be obvious, it is 152 | less obvious why data scientists may want to think about the value of 153 | external postmortems. So here are two reasons to read this article: 1) 154 | As the importance and role of data science grows, the likelihood that 155 | data science decisions and failures will affect customers more directly 156 | and obviously may also grow (e.g. think facebook experiments that the 157 | public has pushed back on) – and thus the value of external postmortems. 158 | And 2) this article has a nice section at the very bottom called “A side 159 | note on the role of luck”, which offers something both wise and unique 160 | to most descriptions of postmortem write-ups. 161 | 162 | *Winning quotations*: 163 | 164 | - “We have found that, with a combination of automation and practice, 165 | we can produce a shareable version of an internal postmortem with 166 | about 10% additional work, plus internal review.” 167 | - “An internal postmortem assumes the reader has basic knowledge of 168 | the technical and operational background; this is unlikely to be 169 | true for your customer. We try to write the least detailed 170 | explanation that still allows the reader to understand why the 171 | incident happened; too much detail here is more likely to be 172 | off-putting than helpful.” 173 | 174 | ## Bibliography 175 | 176 | *Note about citation formats*: 177 | 178 | - Most citations follow the convention used in the GitHub syllabus, 179 | reverting to a more traditional academic citation format for 180 | academic publications. 181 | - Readings are generally grouped by topic and listed in reverse 182 | chronological order, except for the priority readings which are 183 | placed first. 184 | 185 | **General** 186 | 187 | - [What is a Successful Data 188 | Analysis?](https://simplystatistics.org/2018/04/17/what-is-a-successful-data-analysis/) 189 | Roger Peng. *Simplystats* (Apr 2018). 190 | - Parker H. (2017) Opinionated analysis development. PeerJ Preprints 191 | 5:e3210v1 192 | - [Why ‘Blameless’ ‘Postmortems’ Can Feel 193 | Wrong](https://medium.com/@jpaulreed/why-blameless-postmortems-might-feel-wrong-cbeee00d51b2). 194 | Paul Reed. *Medium* (Aug 2016) 195 | - [It’s Not Your Fault Blameless 196 | post-mortems](https://www.slideshare.net/jhand2/its-not-your-fault-blameless-post-mortems). 197 | Jason Hand. (Jul 2014) 198 | - Dekker, S., Paul, C., & Hofmeyr, J-H. (2011) [The complexity of 199 | failure: Implications of complexity theory for safety 200 | investigations](https://www.sciencedirect.com/science/article/pii/S0925753511000105). 201 | Safety Science, 49: 939-945. 202 | - Dekker, S. (2002) [Reconstructing human contributions to accidents: 203 | the new view on error and 204 | performance](https://ac.els-cdn.com/S0022437502000324/1-s2.0-S0022437502000324-main.pdf?_tid=a5bf53cb-4092-4a%20be-bbfd-ec56bac96588&acdnat=1528328624_9f90fafde4c783a8175f94bce923aced). 205 | Journal of Safety Research, 33: 371-385. 206 | 207 | **Company case studies** 208 | 209 | - [Fearless shared postmortems – CRE life 210 | lessons](https://cloudplatform.googleblog.com/2017/11/fearless-shared-postmortems-CRE-life-lessons.html). 211 | Adrian Hilton, Gwendolyn Stockman. *Google Cloud Platform Blog* (Nov 212 | 2017). 213 | - This blog posts also points to other great examples of public 214 | postmortems by [Google Cloud 215 | Platform](https://status.cloud.google.com/incident/compute/16007), 216 | [Gitlab](https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/), 217 | [CloudFlare](https://blog.cloudflare.com/incident-report-on-memory-leak-caused-by-cloudflare-parser-bug/), 218 | and 219 | [Honeycomb.io](https://www.honeycomb.io/blog/2017/10/bitten-by-a-kafka-bug-postmortem/) 220 | - [5 Whys – how we conduct blameless post-mortems after something goes 221 | wrong](http://code.hootsuite.com/blameless-post-mortems/). Noel 222 | Pullen. *Hootsuite Development* (2017) 223 | - [Chapter 15: Postmortem Culture: Learning from 224 | Failure](https://landing.google.com/sre/book/chapters/postmortem-culture.html). 225 | John Lunney, Sue Lueder, edited by Gary O’Connor. (2017) 226 | - [Postmortem culture: how you can learn from 227 | failure](https://rework.withgoogle.com/blog/postmortem-culture-how-you-can-learn-from-failure/). 228 | John Lunney, Sue Lueder, Gary O’Connor. *Re: Work* (Apr 2018). 229 | - Provides an [example 230 | postmortem](https://landing.google.com/sre/book/chapters/postmortem.html) 231 | and this [postmortem exercise 232 | template](https://docs.google.com/document/d/1ob0dfG_gefr_gQ8kbKr0kS4XpaKbc0oVAk4Te9tbDqM/edit) 233 | - [Google: Engineering excellence requires a “blameless 234 | post-mortem culture” for fault 235 | fixing](https://www.v3.co.uk/v3-uk/news/3013962/google-engineering-excellence-requires-a-blameless-post-mortem-culture-fo%20r-fault-fixing). 236 | Stuart Sumner. V3 (July 2017). 237 | - [Postmortems at 238 | Airbnb](https://medium.com/airbnb-engineering/postmortems-at-airbnb-dde936fd7877). 239 | Ben Hughes. Medium (Oct 2013). 240 | - [Blameless PostMortems and a Just 241 | Culture](https://codeascraft.com/2012/05/22/blameless-postmortems/). 242 | John Allspaw (from Etsy). Code as Craft (May 2012). 243 | - Dekker, S. & Breakey, H. (2016) [‘Just culture:’ Improving 244 | safety by achieving substantive, procedural and restorative 245 | justice](https://www.sciencedirect.com/science/article/pii/S0925753516000321). 246 | Safety Science. 85: 187-193. 247 | - [What blameless really 248 | means](http://www.jessicaharllee.com/notes/what-blameless-really-means/). 249 | Jessica Harlee (Mar 2014) 250 | 251 | **How to run a postmortem debrief and other postmortem resources** 252 | 253 | - [A collection of postmortem 254 | templates](https://github.com/dastergon/postmortem-templates). 255 | Dastergon GitHub repo. 256 | - [What Etsy Does When Things Go Wrong: A 7-Step 257 | Guide](https://www.fastcodesign.com/3064726/what-etsy-does-when-things-go-wrong-a-7-step-guide). 258 | John Allspaw, Morgan Evans, Daniel Schauenberg. *Co.Design* (Nov 259 | 2016). 260 | - [Etsy github Morgue](https://github.com/etsy/morgue) 261 | - [Practical Postmortems at 262 | Etsy](https://www.infoq.com/articles/postmortems-etsy). Daniel 263 | Schauenberg. *InfoQ* (Aug 2015) 264 | - [A Project Postmortem Toolkit: Apps and Approaches that Help You 265 | Learn More from 266 | Retrospectives](https://zapier.com/blog/project-retrospective-postmortem/). 267 | Genevieve Conti. *Zapier* (Nov 2015). 268 | - [A Leader’s Guide to After-Action 269 | Reviews](http://www.au.af.mil/au/awc/awcgate/army/tc_25-20/tc25-20.pdf). 270 | Headquarters, Department of the Army (Sep 1993). 271 | 272 | **Other** 273 | 274 | - [After a Major Cyberattack, Does the Public Deserve an 275 | Explanation?](https://www.nextgov.com/cybersecurity/2018/06/after-major-cyber-attack-does-public-deserve-explanation/148692/) 276 | Mitch Herckis. Nextgov (Jun 4, 2018). 277 | - [The City of Atlanta should publish a blameless post-mortem of 278 | the ransomware 279 | attack](https://www.change.org/p/mayor-keisha-lance-bottoms-the-city-of-atlanta-should-publish-a-blameless-post-mortem-of-the-ransomware-attack?recruiter=867558611&utm_source=share_petition&utm_medium=twitter&utm_campaign=share_petition). 280 | *Change.org petition*. (Mar 2018) 281 | - [Tool: Foster psychological 282 | safety](https://rework.withgoogle.com/guides/understanding-team-effectiveness/steps/foster-psychological-safety/). 283 | *Re:Work* 284 | - [The Head of “X” Explains How To Make Audacity the Path of Least 285 | Resistance](https://www.wired.com/2016/04/the-head-of-x-explains-how-to-make-audacity-the-path-of-least-resistance/#.aeio3w645). 286 | Astro Teller. *Wired* (Apr 2016). 287 | - [Upserve – Software 288 | Engineer](https://jobs.lever.co/upserve/ad9b5e26-3118-430c-aeae-d5331c41a5d3) 289 | – mentioned in the job posting directly as part of what a day might 290 | look like 291 | 292 | ----- 293 | 294 | #### Articles about “a case for data literacy” 295 | 296 | I didn’t go with this topic, but in case this is helpful to anyone…\! 297 | 298 | **General:** 299 | 300 | - [Why companies must close the data literacy 301 | divide](https://www.forbes.com/sites/brentdykes/2017/03/09/why-companies-must-close-the-data-literacy-divide/#75639da369d9). 302 | Brent Dykes. *Forbes* (March 9, 2017). 303 | - [Beyond Data Literacy: Reinventing Community Engagement and 304 | Empowerment in the Age of 305 | Data](http://datapopalliance.org/wp-content/uploads/2015/11/Beyond-Data-Literacy-2015.pdf). 306 | *Data-Pop Alliance* (Oct 2015). 307 | - [Facebook Spawned a Data Crisis. Here’s What We Do 308 | Next](https://magenta.as/thank-you-facebook-now-suffer-the-consequences-beee86038439). 309 | Michael Horn. *Magenta* (Apr 4, 2018). 310 | - Matthews, P. (2016) [Data literacy conceptions, community 311 | capabilities](http://eprints.uwe.ac.uk/30506/1/ci-journal-datalit-matthews-preprintep16.pdf). 312 | The Journal of Community Informatics, 12 (3). ISSN 1712-4441 313 | Available from: 314 | - Helpful framing: four varieties of data competencies including 315 | research (academic), classroom (secondary education), carpentry 316 | (practical training), and inclusion (community development). 317 | - [Becoming Data Literate in 3 Simple 318 | Steps](http://datajournalismhandbook.org/1.0/en/understanding_data_0.html). 319 | Nicolas Kayser-Bril. *Data Journalism Handbook 1.0 Beta* 320 | - [Data Literacy – Quantitative Research 321 | Part 2](https://uxknowledgebase.com/data-literacy-quantitative-research-part-2-de07607f1127). 322 | Krisztina Szerovay. *Medium* (May 16, 2018). 323 | - [Where if Your Organization on the Marketing Data Literacy 324 | Spectrum?](https://medium.com/aimarketingassociation/where-is-your-organization-on-the-marketing-data-literacy-spectrum-b0988740b9e2) 325 | Jim Sterne. *Medium* (Apr 9, 2018). 326 | - [Why Data Science and UX Research Teams are Better 327 | Together](https://www.mindtheproduct.com/2018/02/data-science-ux-research-teams-better-together/). 328 | Julie Stanescu. (Feb 7, 2018). 329 | - [Data literacy: Your data-driven advantage starts with your 330 | people](https://www.bdcnetwork.com/blog/data-literacy-your-data-driven-advantage-starts-your-people). 331 | Nathan Miller. *Building Design + Construction* (May 24, 2017). 332 | - [Why Data Literacy 333 | Matters](https://data36.com/why-data-literacy-matters/). Gabor Papp. 334 | *Data36* (Oct 17, 2016). 335 | - [Why We Should All Be Data 336 | Literate](http://alistapart.com/article/why-we-should-all-be-data-literate). 337 | Dan Turner. *A List Apart* (Sep 20, 2016). 338 | - [Is Design Metrically 339 | Opposed?](https://www.uie.com/jared-live/transcripts/Is_Design_Metrically_Opposed.html) 340 | Jared Spool. *Transcript of talk, UXIM Salt Lake City* (Apr 2015) 341 | - Martin, Elaine R. (2014). [What is Data 342 | Literacy?](https://escholarship.umassmed.edu/cgi/viewcontent.cgi?article=1069&context=jeslib) 343 | Journal of eScience Librarianship 3(1): e1069. 344 | jeslib.2014.1069 345 | - [Data Literacy: Definition, Importance and 346 | scope](http://epgp.inflibnet.ac.in/epgpdata/uploads/epgp_content/S000021LI/P001449/M021913/ET/1503055537ModuleID-MIL-10-etext-DataLiteracyDefinition,Importanceandscope.pdf). 347 | Anubhuti Yadav. 348 | 349 | **In higher ed:** 350 | 351 | - [Strategies and Best Practices for Data Literacy Education, 352 | Knowledge Synthesis 353 | Report](http://www.mikesmit.com/wp-content/papercite-data/pdf/data_literacy.pdf). 354 | Chantel Ridsdale, James Rothwell, Mike Smit, Hossam Ali-Hassan, 355 | Michael Bliemel, Dean Irvine, Daniel Kelley, Stan Matwin, and Brad 356 | Wuetherick. *Dalhousie University* (2015). \* [The Data Literacy 357 | Disruption: It’s Time to Change the Education 358 | Mindset](https://medium.com/@sarah_32155/the-data-literacy-disruption-its-time-to-change-the-education-mindset-d0dfcaa0782c). 359 | Sarah Nell. *Medium* (April 30, 2018). 360 | - [The importance of data literacy in secondary 361 | education](https://edexec.co.uk/the-importance-of-data-literacy-in-secondary-education/). 362 | Executive Education. (Mar 14, 2018). 363 | - [The importance of data literacy in higher 364 | education](http://edquarter.com/Article/the-importance-of-data-literacy-in-higher-education). 365 | Charley Rogers. *edquarter* (Jan 5, 2018). 366 | 367 | **For educators:** 368 | 369 | - [Why data literacy training is important, and how I-TECH 370 | helps](https://edscoop.com/why-training-and-data-literacy-are-important-and-how-i-tech-helps). 371 | Paige Kowalski. *edscoop* (Oct 30, 2015). 372 | - [State Progress, Data Quality 373 | Campaign](https://dataqualitycampaign.org/why-education-data/state-progress). 374 | *Data Quality Campaign* (2015). 375 | - [Ethical and appropriate data use requires data 376 | literacy](https://datafordecisions.wested.org/wp-content/uploads/2015/03/Kappan-Ethical-and-Appropriate-Data-Use-Requires-Data-Literacy.pdf). 377 | Ellen Mandinach, Brennan Parton, Edith Gummer, Rachel Anderson. 378 | *Data for Decisions, WestEd* (March 2015) 379 | -------------------------------------------------------------------------------- /annotated-bibs/cogsci.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hadley/stats337/c285022b83e4fbc58e86e3eec2bb08c359475368/annotated-bibs/cogsci.pdf -------------------------------------------------------------------------------- /annotated-bibs/communication.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hadley/stats337/c285022b83e4fbc58e86e3eec2bb08c359475368/annotated-bibs/communication.pdf -------------------------------------------------------------------------------- /annotated-bibs/diversity.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hadley/stats337/c285022b83e4fbc58e86e3eec2bb08c359475368/annotated-bibs/diversity.pdf -------------------------------------------------------------------------------- /annotated-bibs/ds-academic-field.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hadley/stats337/c285022b83e4fbc58e86e3eec2bb08c359475368/annotated-bibs/ds-academic-field.pdf -------------------------------------------------------------------------------- /annotated-bibs/ethics.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hadley/stats337/c285022b83e4fbc58e86e3eec2bb08c359475368/annotated-bibs/ethics.pdf -------------------------------------------------------------------------------- /annotated-bibs/graphical-advice.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hadley/stats337/c285022b83e4fbc58e86e3eec2bb08c359475368/annotated-bibs/graphical-advice.pdf -------------------------------------------------------------------------------- /annotated-bibs/modern-medicine.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hadley/stats337/c285022b83e4fbc58e86e3eec2bb08c359475368/annotated-bibs/modern-medicine.pdf -------------------------------------------------------------------------------- /annotated-bibs/sharing-analyses.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hadley/stats337/c285022b83e4fbc58e86e3eec2bb08c359475368/annotated-bibs/sharing-analyses.pdf -------------------------------------------------------------------------------- /annotated-bibs/tailored-elearning.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hadley/stats337/c285022b83e4fbc58e86e3eec2bb08c359475368/annotated-bibs/tailored-elearning.pdf -------------------------------------------------------------------------------- /annotated-bibs/teaching.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hadley/stats337/c285022b83e4fbc58e86e3eec2bb08c359475368/annotated-bibs/teaching.pdf -------------------------------------------------------------------------------- /applicants/accept-email.md: -------------------------------------------------------------------------------- 1 | Hi! 2 | 3 | Thanks for applying to take Stats337: I'd love to have you in my class. I appreciate your interest knowing very little about the class, and I now have a syllabus with more information available: https://github.com/hadley/stats337#readings. 4 | 5 | Please read the syllabus through, and if you do still want to take the class: 6 | 7 | 1. Register using permission number: {num} 8 | 9 | 2. Fill out https://goo.gl/forms/Dsz286pUg3msNm2g2: I'll share this 10 | only within the class so we can learn each others' names. (Let me 11 | know if you don't want to share your information this way and we 12 | can make alternative arrangements) 13 | 14 | 3. Prepare for class by reading https://m.signalvnoise.com/c6371885f7f6: 15 | this should only take a couple of minutes. If you're bored this week 16 | feel free to read the optional readings for week 1. 17 | 18 | If you can no longer take the class, no hard feelings, just let me know 19 | as soon as possible so I can pass on your slot to someone else. 20 | 21 | Thanks, and looking forward to meeting you next week! 22 | 23 | Hadley 24 | -------------------------------------------------------------------------------- /applicants/mail-merge.R: -------------------------------------------------------------------------------- 1 | library(googlesheets) 2 | library(gmailr) 3 | library(glue) 4 | library(purrr) 5 | 6 | sheet <- gs_key("1sAUqRxH1n5im8YtJiPY6IBSHDtPhF8EL9LmeH9K6Y14", visibility = "private") 7 | applicants <- gs_read(sheet) 8 | 9 | email <- applicants$`Email Address` 10 | num <- applicants$`Permission number` 11 | decision <- applicants$Decision 12 | 13 | 14 | # Accept ------------------------------------------------------------------ 15 | 16 | accept_template <- paste(readLines("applicants/accept-email.md"), collapse = "\n") 17 | 18 | accept_send <- function(email, num) { 19 | body <- glue(accept_template, num = num) 20 | 21 | message <- mime() %>% 22 | from("h.wickham@gmail.com") %>% 23 | to(email) %>% 24 | subject("Stats337") %>% 25 | text_body(body) 26 | 27 | message %>% create_draft() 28 | 29 | email 30 | } 31 | 32 | sent <- character() 33 | to_send <- decision == "Accept" & !is.na(decision) & !(email %in% sent) 34 | sent <- map2_chr(email[to_send], num[to_send], possibly(accept_send, NA_character_)) 35 | 36 | 37 | # Fail to accept ---------------------------------------------------------- 38 | 39 | pass_template <- paste(readLines("applicants/pass-email.md"), collapse = "\n") 40 | 41 | pass_send <- function(email, num) { 42 | message <- mime() %>% 43 | from("h.wickham@gmail.com") %>% 44 | to(email) %>% 45 | subject("Stats337") %>% 46 | text_body(pass_template) 47 | 48 | message %>% create_draft() 49 | 50 | email 51 | } 52 | 53 | sent <- character() 54 | to_send <- decision == "Pass" & !is.na(decision) & !(email %in% sent) 55 | sent <- map(email[to_send], possibly(pass_send, NA_character_)) 56 | -------------------------------------------------------------------------------- /applicants/pass-email.md: -------------------------------------------------------------------------------- 1 | Hi! 2 | 3 | Thanks for applying to take Stats337. I appreciate your enthusiasm for the class, but unfortunately the class was oversubscribed and I could not accept everyone who applied. 4 | 5 | Hadley 6 | 7 | PS. The syllabus is available at https://github.com/hadley/stats337#readings if you want to do the readings in your own time 8 | -------------------------------------------------------------------------------- /course-feedback.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'What could I improve next time?' 3 | --- 4 | 5 | # Content 6 | 7 | Some way to summarize the "best practices" or top thoughts at the end of the each class was appreciated (we sometimes did this through the whiteboard summaries, etc). (+2) 8 | 9 | Sometimes the classes felt like they ended on the note of... well, things are complex!  Without necessarily a sense of how to move forward on it (if we were embedded within an organization, etc). (+2)   10 | 11 | I think that the range of topics covered provided a pretty good overview of data science, what it is, the roles of data scientists in various industries, typical work responsibilities and workflows, and what tools exist. (+1) 12 | 13 | It would be nice to get Hadley's thoughts on each of the sets of readings for maybe 10 - 15 minutes sometime in class. (+3) 14 | 15 | Definitely more experts. I wish there was more insight from those who understand and know the field well (Hadley, Eduardo, Etc) (+2) 16 | 17 | I really liked the presentation by Ed from Facebook (+9), and it would've been cool to have other presenters or maybe other speaking events outside of class if we want to spend more time actually discussing the readings in class. (or to be able to ask the paper / blog authors questions) 18 | 19 | - One thing that was great about Ed's presentation was that he gave a lot of concrete examples (e.g. Examples of specific questions he asked when interviewing at Facebook; Concrete example of what a career path at Facebook looks like) 20 | 21 | One possibility is to have students bring in a relevant article that they liked and share it with the class. 22 | 23 | Hearing more expert opinions would be great! Whether it's Hadley speaking on his thoughts, bringing in more industry speakers (the FB lecture was great!), or having Stanford faculty speak. (+1 having current faculty who are thinking about these topics would've been great to have in class to hear opinions from) 24 | 25 | Less google docs. 26 | 27 | The DevOps and software engineering sessions had a lot of overlapping content. 28 | 29 | - Also felt like content in many weeks was overlapping 30 | 31 | I would have liked a working definition for "data science" at the beginning of the quarter, so that even if we disagree on what it is, we can be referring to the same thing throughout the quarter. 32 | 33 | - Might have been interesting to get our naive definitions during class one; and compare that to a definition at the end of the class 34 | 35 | I would appreciate if we could spend more time on data structure, such as relational databases. (+1 seems data engineering practices are important to know) 36 | 37 | Case studies of really well-done and poorly done analysis might have been nice rather than high-level overviews of process suggestions. (+4) 38 | 39 | Did we discuss everything in the course overview? I feel like we didn't spend any time on visualization for explanation and exploration, which would have been great to cover with Hadley. 40 | 41 | # Organisation 42 | 43 | I think a group project would have been fun. 44 | 45 | I found the various activities to be very useful for idea generation, but they were less effective at distilling and refining ideas. It would be useful to have a few activities where the final product is some nice, compact artifact. (+4) 46 | 47 | - I actually wouldn't want this. I don't know if it's useful to create a compact artifact for an entire class of people, and it seems like it would take a lot of time to distill the thoughts of 20 different people into a single thing. Maybe instead of having us write the responses before each class, we could write them after, so that we could distill our own ideas after the discussion. 48 | 49 | I would have liked more concrete discussions. Perhaps the  in-class prompts could be made more specific. 50 | 51 | I liked some of the interactive class discussions, but would have liked more of an "intro" and "summary" from Hadley (+3). I still feel that I got the range of opinions and thoughts from people not at the center of some of the debates. In this context, I think that the learning curve plateaued for me and I got more out of the readings than the class itself. 52 | 53 | I found the response papers to be quite helpful as a part of the organization of the course, and was happy with them being quite nebulous, and not really that highly criticized in terms of assessment. They helped me synthesize my thoughts in ways that I would not have done if I were only reading the papers. (+4) 54 | 55 | I've saved or bookmarked some of the readings for future reference as I thought that some were very good at discussing and summarizing a particular aspect. 56 | 57 | # Assessment 58 | 59 | It might have been useful to turn the weekly responses in earlier, so you could tailor discussion based on students' interest. Maybe just have a discussion board (+1) on which we all respond? Then we may have avoided some of the less fruitful discussions. 60 | 61 | For the response papers, perhaps the default guideline should be to pick just one article and discuss it, rather than going through all the articles. 62 | 63 | More clear feedback on responses would have been helpful to know how to improve from a check to a check plus. But might suggest re-thinking the weekly response paper in favor of discussion board / question posting option. 64 | 65 | - I preferred the response paper to an online discussion board. (+2) 66 | 67 | It would be great if we could turn in the reflections online both for (a) greater ease of submission/assessment and (b) we could view each other's reviews. (+3) (+3 mainly printing was such a pain) 68 | 69 | Would much rather post questions / comments to a group discussion board or Canvas than write response papers. (+2) 70 | 71 | A concrete question prompt rather than an open-ended reflection would have been nice). (I actually liked the open-ended reflection. I may have enjoyed the responses less if I couldn't focus on a tangential topic of personal interest) 72 | 73 | Annotated bibliography seems rushed? Maybe if we did it in two stages (develop idea, preliminary reading) earlier in the quarter, then wrote up a couple pages on it? 74 | 75 | Drop the annotated bibliography or having a clear purpose for that. (+1 yes - I would have liked some feedback from other people about what kinds of questions would be of greater interest to not just me; want to make it useful for more than just a weird curiosity) 76 | 77 | An option would be to have an opinion of a data analysis made by us in the past. 78 | 79 | # Other 80 | 81 | Working with a few examples of real code would have been interesting.  Perhaps a peer code review or a presentation of some good code by Hadley. (+5) Code from Hadley! 82 | 83 | One of my favourite moments of the whole class was early on when we traded our own practices for workflow, reading, etc. (+1) More of that would be been great (i.e. process things that are super helpful but rarely come up in class or with other students) 84 | 85 | Perhaps some of the tools and workflow discussions could have been illustrated by talking through some real-life examples of projects that might come up for the people in the room. For example: "Imagine you're tasked with doing XYZ for a company that does ABC, how would you go about doing that?" (+1 a few more concrete discussions!) 86 | 87 | Any thoughts on maintaining some sort of post-class community? (e.g. e-mail list, etc)? 88 | 89 | It would have been cool to get ".rprofiles" of some of the data scientists Hadley knows, with more nuggets of personal experience mixed in. (+2) 90 | 91 | A clear goal/expected learning outcomes for the course from the start. 92 | 93 | # Hadley's takeaways 94 | 95 | Emphasise ambiguity and why I'm not talking 96 | 97 | More history/context? Data science is very new - conventions less established. 98 | 99 | Start with Tukey's data science + statistics. 100 | 101 | More concrete data science process --- case studies + examples. 102 | 103 | More sharing from discipline on experience. 104 | 105 | Annotated bibliography should be multiple assessment. 106 | 107 | Online discussion so time in class for other activities? 108 | -------------------------------------------------------------------------------- /stats337.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: No 4 | SaveWorkspace: No 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: knitr 13 | LaTeX: XeLaTeX 14 | 15 | AutoAppendNewline: Yes 16 | StripTrailingWhitespace: Yes 17 | 18 | BuildType: Package 19 | PackageUseDevtools: Yes 20 | PackageInstallArgs: --no-multiarch --with-keep.source 21 | PackageRoxygenize: rd,collate,namespace 22 | -------------------------------------------------------------------------------- /week-01/week-01-1.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hadley/stats337/c285022b83e4fbc58e86e3eec2bb08c359475368/week-01/week-01-1.jpg -------------------------------------------------------------------------------- /week-01/week-01-2.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hadley/stats337/c285022b83e4fbc58e86e3eec2bb08c359475368/week-01/week-01-2.jpg -------------------------------------------------------------------------------- /week-01/week-01-3.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hadley/stats337/c285022b83e4fbc58e86e3eec2bb08c359475368/week-01/week-01-3.jpg -------------------------------------------------------------------------------- /week-01/week-01-4.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hadley/stats337/c285022b83e4fbc58e86e3eec2bb08c359475368/week-01/week-01-4.jpg -------------------------------------------------------------------------------- /week-01/week-01.key: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hadley/stats337/c285022b83e4fbc58e86e3eec2bb08c359475368/week-01/week-01.key -------------------------------------------------------------------------------- /week-01/week-01.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hadley/stats337/c285022b83e4fbc58e86e3eec2bb08c359475368/week-01/week-01.pdf -------------------------------------------------------------------------------- /week-02/01-most-important.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hadley/stats337/c285022b83e4fbc58e86e3eec2bb08c359475368/week-02/01-most-important.jpg -------------------------------------------------------------------------------- /week-02/02-summary.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hadley/stats337/c285022b83e4fbc58e86e3eec2bb08c359475368/week-02/02-summary.jpg -------------------------------------------------------------------------------- /week-02/03-questions.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hadley/stats337/c285022b83e4fbc58e86e3eec2bb08c359475368/week-02/03-questions.jpg -------------------------------------------------------------------------------- /week-02/04-reading-process.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hadley/stats337/c285022b83e4fbc58e86e3eec2bb08c359475368/week-02/04-reading-process.jpg -------------------------------------------------------------------------------- /week-02/README.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | output: github_document 3 | --- 4 | 5 | ```{r, results = "asis", echo = FALSE} 6 | fig_files <- dir(pattern = "jpg$") 7 | cat(paste0("* ", fig_files, " ![](", fig_files, ")\n")) 8 | ``` 9 | -------------------------------------------------------------------------------- /week-02/README.md: -------------------------------------------------------------------------------- 1 | 2 | - 01-most-important.jpg ![](01-most-important.jpg) 3 | - 02-summary.jpg ![](02-summary.jpg) 4 | - 03-questions.jpg ![](03-questions.jpg) 5 | - 04-reading-process.jpg ![](04-reading-process.jpg) 6 | -------------------------------------------------------------------------------- /week-03/google-doc.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'Stats337: Software engineering' 3 | --- 4 | 5 | # Working well with others 6 | 7 | ## How do I get senior advisors on board with using these tools / best practices? (SZ) 8 | 9 | - Investigate existing practices and identify areas that can be benefited from automation 10 | - Then create some sort of quantitative metric for time and resources saved by adopting version control tools 11 | - - Maybe show them the results... (For example, showing them the output of profvis when you talk about how you went about optimizing your code.) 12 | - Show them the article? 13 | - Organize a meeting/skillshare to make team transition seamless(ish) 14 | - Seems like there is room for "transition" tools... i.e. which allow you to use these tools + in parallel generate more traditional formats to share (natural gas as a transition fuel out of fossil fuels) 15 | - There's a good paper by Adam Grant on how to get organizations to collaborate on quasi-experiments. Maybe this can be applied to version control too: 16 | 17 | 18 | 19 | -   20 | - (Page 674-678) 21 | 22 | ## How do you get collaborators on board with tools like Github? 23 | 24 | In cases where there are significant barriers to collaborator buy in (e.g., working with government, where individuals might not even be authorized to use platforms like Github), is it worth using these kinds of tools for collaboration? (Autumn) 25 | 26 | - Frame it as time saving / cost cutting / in terms that your superiors will understand. 27 | - Often not worth the effort (if it is just for document editing, much easier to just use emails -Aaron) 28 | - If I'm the project statistician, then I think it could be relatively straightforward, because my collaborators will care more about the output I'm producing than the tools and code I'm writing. So if they can see the output on GitHub, they'll probably be happy. 29 | 30 | # Teaching/learning 31 | 32 | ## What is the best way to learn best practices in a classroom setting? (Emily) 33 | 34 | - Team teaching / small groups work well. 35 | - Forcing yourself to use it on a project, perhaps one of smaller or medium stakes, has been the best way for me. 36 | - Seeing other student's work and asking those who did something you admire. 37 | - Beyond the basic, coaching small teams 38 | - Seeing other people\'s code makes people comfortable and opens up new approaches 39 | 40 | ## 41 | 42 | ## How to move an existing course to GitHub 43 | 44 | (I've got my R Markdown files, homework assignments in Latex, and some data sets in Excel, csv, and txt)? Where do I best start transitioning stuff? 45 | 46 | - Bill mentioned that Hadley might have some good resources on this 47 | 48 | ## 49 | 50 | ## What are people's best practices for how you stay on top of new developments? (can I second this?)  - specific examples / recs? 51 | 52 | - RStudio newsletter 53 | - Twitter 54 | 55 | 56 | 57 | - Any good accounts to follow? 58 | 59 | 60 | 61 | - \@rstudiotips 62 | - \@hadleywickham (+1) 63 | - \@treycausey 64 | - \@vsbuffalo 65 | - \@jakevdp 66 | - \@randal\_olson 67 | - \@jeremystan 68 | - \@luispedrocoelho 69 | - \@seantaylor 70 | - \@dataandme 71 | 72 | 73 | 74 | - How to prune? 75 | 76 | 77 | 78 | - code mentors (+4) 79 | 80 | 81 | 82 | - How active is this mentorship? 83 | 84 | 85 | 86 | - Stanford Stats 337 87 | - Reading the scientific computing literature/blogs/news websites 88 | -  --- particularly the weekly highlights 89 | - [Rbloggers](https://www.r-bloggers.com) 90 | 91 | ## How to learn new features of git incrementally? 92 | 93 | - https://www.atlassian.com/git/tutorials 94 | 95 | 96 | 97 | - Maybe as they come up? 98 | 99 | I've heard Make mentioned a few times - any resources on learning this? 100 | 101 | ## \^Relatedly how do you implement widespread education of best practices for data science across disciplines (not just establishing opportunities but convincing people to take advantage of them). (Autumn) 102 | 103 | - I think it always has to start with showing people how much value it can bring to them (cost, time savings, etc.). 104 | 105 | ## How and when did you all learn e.g. version control?  (community poll... might be interesting to see years too) 106 | 107 | - Learned subversion doing research in undergrad. Learned git working in industry. 108 | 109 | 110 | 111 | - +1 to undergrad research project 112 | - What is subversion? Hadley: it's one of the precursors to git 113 | 114 | 115 | 116 | - Learned version control (git) in undergraduate coursework 117 | - Got much much better at git from being the only person on a team who knew anything about git (turns out when you have to step up, you can learn a lot from the interwebs!) 118 | - Grad school (first quarter), in an intro grad programming course (x3) 119 | - Final projects in undergraduate class 120 | - On my own, probably through a MOOC 121 | - Internship 122 | - In conjunction with back-ups of my computer files (also learned subversion first, then Git) 123 | - DCL- 2017 (+1) 124 | - CS107 (+1) 125 | - Data Science course in undergrad, and then just reading blogs/tutorials online out of self interest 126 | - On my own in \~2015, got much better at it while working with a team in an internship in 2016 127 | 128 | # Investment 129 | 130 | ## 131 | 132 | ## How do you decide when it's worth it to use higher cost tools like writing tests or github etc? 133 | 134 | At a topline level - the benefits of implementing these tools should outweigh the costs. 135 | 136 | Here are some ways to think about answering this question: 137 | 138 | - Do you care about version control? If you don't care... probably not worth it. 139 | - How long are you spending on the project, vs. how long are you spending on overhead? 140 | - How long did it take to diagnose this problem? How likely is it to come up again? 141 | - Does future you want to collaborate with past you? 142 | 143 | 144 | 145 | - Will you be grateful later on that you documented process? 146 | - Will you need to leave and come back to it? 147 | - Are these valuable thinking processes or reasonings that aren't captured in the final output? 148 | 149 | 150 | 151 | - Are you working with other people on the project?  If you are... 152 | 153 | When shouldn't you take these up? Or why isn't my grocery list on github? 154 | 155 | - Fixed costs and maintenance costs. 156 | - How long will you be spending on the project? 157 | - Don't have any knowledge of how to use version control, or someone you want to work with doesn't have any knowledge of version control 158 | 159 | ## What are the benefits to implementing best practices? 160 | 161 | - General 162 | 163 | 164 | 165 | - If your project is becoming sufficiently complicated that you're having trouble staying organized 166 | - I think I might start using it more for projects/studies I aim to publish: just seems like it can save me time and headaches when the reviewer comments are coming in and I need to go back to the analyses and do more/change things 167 | - If I think that I'll have the opportunity (or will create the opportunities) to share the results with people who appreciate these practices - generating opportunities for recognition is helpful motivation 168 | - I think even for your "future self" GitHub seems like a worthwhile upfront investment in terms of saving time and reducing confusion. 169 | 170 | 171 | 172 | - GitHub 173 | 174 | 175 | 176 | - GitHub for a project seems almost always worthwhile (if format is compatible) 177 | 178 | 179 | 180 | - Doesn't have to be a collaborative project for it to be useful 181 | 182 | 183 | 184 | - Github seems worthwhile when people you may collaborate with have existing workflows for processing/wrangling data that you also want to leverage. This can help ensure data consistency across analyses where multiple people want to work with the same data. 185 | - BitBucket is another option, especially if you would rather use mercurial 186 | 187 | 188 | 189 | - Writing tests / documentation 190 | 191 | 192 | 193 | - Documenting: when the thinking isn't clear from the output 194 | - In terms of tests: if you think it's possible you could break your code when you're working on it, once something works and you spent more than a bit getting it to work (ex. hours vs. minutes) 195 | - Tests are useful to make your code "fail fast" in the sense that you can catch errors at their source, which makes debugging a lot easier. If you find yourself looking at your log and not knowing where down the pipeline errors occur, you might find testing useful 196 | - Assertions 197 | 198 | ## A lot of my work is (essentially) solo, or just collaboration with my advisor. Should I still use git? I don't personally find it that beneficial in non-collaborative projects. 199 | 200 | - \^I've had this same experience working in industry as the only data scientist on a team. There are sometimes unanticipated collaborative efforts that need my prior analyses for other teams, but this is hard to anticipate 201 | - Yes! (or some other VC system - no real benefits over subversion for solo work) It is hugely advantageous in ways you don't realize until you start using it. It gives you a lot more flexibility in working with your code, and making changes that could potentially break things without having to worry about getting the system back to a working state. 202 | - I have found it very helpful for small-scale projects for self-documentation, versioning/the ability to rollback changes, and the ability to test out new direction on branches without committing to that direction. 203 | - Can be more easily transitioned into a portfolio if you need to demonstrate your work or skills in/for future settings 204 | - I think it is extremely helpful solo! I use it solo most of the time for the version control aspect! 205 | 206 | ## How do you build your utility library? How do you distribute? (Aaron) 207 | 208 | - Docker!!! 209 | 210 | # Patterns, smells and refactorings 211 | 212 | Pattern = identify what is wrong (smell) and then fix it (with a refactoring). 213 | 214 | Having written a lot of code, how can I extract the common patterns? 215 | 216 | How to identify when it worthwhile to do the right thing? 217 | 218 | What are common problems? 219 | 220 | - systematically exploring hyperparameter tuning 221 | - managing 100s of files 222 | - copying and pasting code --- time to write a function? 223 | 224 | ## There are a set of standard design patterns for software engineering. Is there something analogous for data science? (Alex) +1 225 | 226 | - Exploration versus fixed question 227 | - Single data set versus multi data set 228 | 229 | ## What are data science "code smells"? What are common data science refactorings? (Hadley) 230 | 231 | - Smell: Repeated code (Aaron) 232 | - Refactor: vectorization of loops in R (Aaron) 233 | - Smell: Hard-coded elements within functions 234 | - Smell: Significant varied table shaping --- "letting the horse win" 235 | - Smell: Long (100+ line) functions 236 | - Smell: Places in the code with many nested indents and loops (\~4+ levels deep) refactor by taking some of these blocks into their own function. 237 | - Smell: Variables that do not have clear meaning from their name 238 | - Smell: long files  (or all code in one long file) 239 | 240 | # Other 241 | 242 | - 243 | 244 | 245 | 246 | - 247 | 248 | ## What is the best way to manage modeling workflows & versioning? 249 | 250 | This is different from just software versioning since you need to version input and output data, and may need to be running multiple models at the same time to do parameter sweeps or try out different options. I've seen systems built on Make, and built my own, but it seems like there should be more about best practices here. (Aaron) 251 | 252 | ## Out of all the best practices listed, are there criteria that we can use to assess which is best suited for particular situations? 253 | 254 | - Does it make your life or your collaborator's life easier? 255 | - If hard to assess the above, was it recommended by someone doing something similar to you? 256 | 257 | ## Related to the previous question, what tools exist to manage data and analyses that are proprietary (e.g., company-owned in a consulting setting) or otherwise subject to sharing and identifiability restrictions? 258 | 259 | - I think git can work on local servers as well? \<\--Yes (Yup! Have done this, you install git locally and don't link it to GitHub) 260 | 261 | ## How do you write word documents for projects with collaborators? 262 | 263 | - [Overleaf](https://www.overleaf.com/) (like Google Docs for Latex) (X3) 264 | 265 | 266 | 267 | - It has lightweight version control options too! 268 | - And preformatted templates for lots and lots of journals! 269 | - And rich-text/WYSIWYG editor for those unfamiliar with Latex 270 | 271 | 272 | 273 | - Sharelatex 274 | - Google Docs...? 275 | - Word + dropbox 276 | - TexMacs is a great WYSIWYG for solo latex work 277 | 278 | ## How to handle backwards compatibility with your personal utilities library? (Nick) 279 | 280 | - I think someone mentioned Make 281 | - I use something called Docker... 282 | - What is in your personal utilities library? I want to hear more about this! 283 | 284 | ## There are lots of mentions of Docker. It seemed like it was a bit much for most analysis workflows. Maybe I'm wrong? What are people's use cases for this? Are people using it in academics on a regular basis? (Aaron) 285 | 286 | - Looks like we actually have readings on Docker next week... \*angel emoji\* 287 | 288 | ## Who should be responsible for teaching data scientists how to use best practices? Schools? Industry? The individual? \[Eli\] 289 | 290 | - If there are data science courses at schools, they should at least try to teach best practices. 291 | - I think it should always be the individual.... Every organization is different\... 292 | 293 | # GLOSSARY 294 | 295 | ## Code Smell 296 | 297 | Any symptom in the source code of a program that possibly indicates a deeper problem 298 | 299 | Pattern 300 | 301 | ## 302 | 303 | ## Docker 304 | 305 | ![](https://lh6.googleusercontent.com/BjD4Plk7dE3WBne1xPzgCzdZuv0bodItOGTR0AvCa38EqxDxTmUlEPRy1uMG7MnCWrD3kybrm0DcrMa0pR6jGN06Gkzt1DUP8yPBs8ENmUVfSx5jvPfoqbIrolXGs07PehqUZt7T)\<- That is not helpful :( (+1 even though it's really cute!) 306 | 307 | ## 308 | 309 | ## Make 310 | 311 | A Unix utility to automate processes. You specify dependencies, and when something become out of date, a specified action is taken, such as compiling code or running a script. 312 | 313 | ## Unit test 314 | 315 | Contained code that is used to test whether your "production" code is working as intended. It is primarily used in the world of software engineering, where functions should have deterministic outputs based on the inputs, but is less well-defined in data science, where there may not be deterministic outputs, due to randomness in the algorithm, changing data, etc. 316 | 317 | ## Utility library 318 | 319 | A collection of functions that you find yourself using across many projects. 320 | 321 | ## \~\~ this is fun and oddly soothing \~\ 322 | 323 | ![](https://lh4.googleusercontent.com/cI4GH1qfUs9QZ8msgTSm8GXJpsgwRXW0RyC0JW7zwup0ixyXCcoelBwh-V8v0JYSlLJt-1pY_7qfcwQQ59M1szIiptd1-xWI7_e6DAITNCxOjbVWEeEejyXUDVUZyyVx3vWzhdfo) 324 | -------------------------------------------------------------------------------- /week-04/google-doc.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'Stats337: DevOps' 3 | --- 4 | 5 | # Questions 6 | 7 | How can we map the idea of 'technical debt' onto statistical analyses? (++++++) 8 | 9 | - Run time can be one way? 10 | - A related question could be: how do we think about quality control in the context of statistical analysis? 11 | - Can't it also be in terms of how much work you're going to have to put in later? 12 | - Possible definition: if the input data changes slightly, how long would it take to re-do the original analysis? (+) 13 | - HW: Another possible definition (more DevOps-y): if your computer was destroyed, how long would it take you to recreate the analysis? 14 | 15 | What is DevOps, and how is it different from software engineering? (++++++) 16 | 17 | - Basically same: Is there an agreed upon/decisive definition of DevOps philosophy? (++) 18 | - I would argue that DevOps is a subset of software engineering. Basically, DevOps engineers are folks who support the infrastructure and systems for the software developers to do their work. For instance, if you need to deploy a piece of software into Amazon Web Services, DevOps engineers are usually the ones who will set up the permissions, databases, etc. for the developers. 19 | 20 | 21 | 22 | - So DevOps sets up the architecture / structure within which (or on top of which?) software engineers build things? (the distinction is still not clear to me) 23 | 24 | 25 | 26 | - HW: I'd say the philosophy of DevOps is to automate everything. "If something is worth doing, it's worth automating" 27 | 28 | 29 | 30 | - I think there's also a piece of efficiency too 31 | 32 | 33 | 34 | - This series of posts cleared things up for me:   35 | 36 | Do you know of any examples (e.g., repos, live videos) of great data analyses? Preferably something where you could see a bit of someone's workflow (+++++) 37 | 38 | - Basically same Q: Is there something like [this](https://www.practicereproducibleresearch.org/) for data science generally? Case studies or process bios would be really helpful to make DevOps thinking more tangible. 39 | - Examples: 40 | 41 | 1. 42 | 2. 43 | 44 | - HW: I made this youtube video of me doing a basic data analysis:   45 | - HW: More historical, and more focussed on visualisation (rather than data science): . But neat to see state of the art in (e.g.) 1970s (+) 46 | - Data and replication code of a researcher here in stanford: 47 | 48 | How can one develop unit tests to catch statistical bugs rather than programming bugs? (+++++) 49 | 50 | - In the ML technical debt paper, the author recommends checking that the average of the predicted values matches the population average value. (This is somewhat specific to prediction settings.) 51 | - HW: I think there's no analogy to lab experiments where you might have positive and negative controls for every experiment. 52 | - You can generate synthetic data and do basic sanity checks. 53 | 54 | How is 'technical debt' measured in CS? Lines of code? (+++) 55 | 56 | - HW: I'm not aware of any precise measures; it's more of a metaphor than a specific analysis technique. 57 | - Code run-time is one way in which technical debt is measured in CS. If you install bunch of packages that are not useful / have codes that are not used, then you'd have higher run time than if your code were efficient. 58 | - Other general things to look for: 59 | 60 | 61 | 62 | - indentation counts (high levels of indentation might indicate overly nested loops which make the code very complicated) 63 | -  file churn (how often some files are changed)\--if some files are changed a lot by many people, that might signal that the files should be refactored into more modularized components with more specific functionality 64 | - function/method lengths/number of lines (degree of decomposition) 65 | 66 | For people who have used Docker \-- what were your original use cases specifically and did you ultimately find Docker to be the best solution for your problem? (trying to figure out if it's worth learning) (++) What created the impetus for you to learn Docker? (++) For people who have tried to use Docker, what did you decide to go with instead? (+) 67 | 68 | - HW: I used it to recreate a package build failure that I had on travis. The only way I got it working was carefully following the instructions in   69 | - I tried to use it on order to play around with a neuroimaging analysis pipeline w/o having to install all of the dependencies, but ran into some snag setting it up, and didn't find it worth the time to troubleshoot. 70 | 71 | DevOps thinking seems most helpful for large-scale collaborative analysis projects. Is it that necessary for much smaller-scale or short-term projects? (+) 72 | 73 | - I think automated testing and infrastructure as code are just as important for ensuring accuracy and reproducibility in small-scale projects. Continuous integration and deployment, performance monitoring, load testing... probably not. 74 | 75 | Do you generally read research articles like this or do something else (e.g., read the docs / tutorials) to learn new technologies like Docker? (+) 76 | 77 | - HW: papers are often good if you can find them because they'll survey a bunch of the field, and give cohesive recommendations; but very new technologies will not have papers yet 78 | 79 | 80 | 81 | - I search for tutorials 82 | - I generally search for tutorials or check out blogs that I like (I find tutorials more useful than academic papers to learn a new skill). 83 | - Generally prefer tutorials 84 | - I think a better way of thinking about Docker is that it's an implementation of what's called containers. You can just read up on containers instead. AWS has a good background on this stuff. 85 | 86 | What is a good source to explain modular code and how to use it in social sciences? (+) 87 | 88 | - 89 | 90 | 91 | 92 | - Not that. I mean, something like to use that in the workflow of your data analysis. 93 | 94 | Are there any best practices related to software engineering/DevOps that haven't been in the readings that anyone thinks are important to keep in mind? (++) 95 | 96 | - Comment generously to communicate what the code is doing 97 | - This is kind of besides the point but a big part of what DevOps folks do is to keep down the spend on servers, etc. so there's almost an implicit cost-saving aspect to DevOps work -\> perhaps this relates to the concept of debt (though it's literal in this sense). 98 | 99 | How do you implement code review in your current setting? Are there lightweight versions you could recommend? (++) 100 | 101 | - I use it in a teaching setting every day as people learn new packages / software in R. I've found it useful to have a set curriculum (e.g., practice problems from a book), and use code review to debug problems that everyone has worked on in a weekly meeting or something like that 102 | 103 | 104 | 105 | - Is this done in github? 106 | 107 | Have you tried reproducing someone else's analysis before? What were the obstacles you faced? (+) 108 | 109 | - Didn't have the pipelines they used to produce their dataset 110 | 111 | 112 | 113 | - Similar: didn't have script showing what manipulations they did to the data before running analysis and couldn't reproduce analysis even with analysis code (+) (same) 114 | 115 | 116 | 117 | - Didn't have their code (just had results) and wasn't sure exactly what analysis was run 118 | - Tried reproducing someone's analysis and found that their experimental display code wasn't close to functional, so lost faith in the published paper. Probably should have contacted the authors, but didn't care enough. Did model my analysis of the reproduction with their code though. Lots of hacky manual bits with no explanation. 119 | - No idea what the columns in the dataset were 120 | - Related to this question, has anyone been able to find a GitHub repo that has data + code for a project? Kind of creepy but might be an interesting idea to download everything and see if you get the same results 121 | 122 | POLL. Let's assume that researchers made their code and data available. 123 | 124 | What percent of studies do you think would contain errors in the code? 125 | 126 | - Probs high (+++) 127 | 128 | What percent of errors would change the analytical outcome (small or large)? 129 | 130 | - Probs somewhat less high, since manual testing and convergent results are common. But depends how upstream the error is. Simulating synthetic data FTW. 131 | 132 | What percent of errors would change the direction of the analytical outcome? \[do we think that people's willingness to support process changes in their field correlates to their predictions?\] 133 | 134 | -I think that if transparency and reproducibility were required, there probably would be far fewer errors in published studies. 😶 135 | 136 | # Glossary 137 | 138 | Virtual Machine: Software to run one operating system on top of another, such as running Windows on a Mac. Can be used to distribute configured software. 139 | 140 | eta(?)-Features from the Sculley piece 141 | 142 | Provisioning tools 143 | -------------------------------------------------------------------------------- /week-04/quotes.pages: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hadley/stats337/c285022b83e4fbc58e86e3eec2bb08c359475368/week-04/quotes.pages -------------------------------------------------------------------------------- /week-05/conversational-roles.pages: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hadley/stats337/c285022b83e4fbc58e86e3eec2bb08c359475368/week-05/conversational-roles.pages -------------------------------------------------------------------------------- /week-05/quotes.pages: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hadley/stats337/c285022b83e4fbc58e86e3eec2bb08c359475368/week-05/quotes.pages --------------------------------------------------------------------------------