├── All_documents.pdf ├── README.md ├── StrategicPlan_NIH_review30March2018.pdf ├── combined_response.md ├── combined_response.pdf ├── jasons_submission_comment.md ├── preamble.pdf └── prelim_drafts ├── goal_01_infrastructure_response.md ├── goal_02_dataecosystem_response.md ├── goal_03_datamanagement_response.md ├── goal_04_workforce_response.md ├── goal_05_datastewardship_response.md └── preamble.md /All_documents.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JasonJWilliamsNY/2018_nih_datascience_rfi/64dc52a8e2153ca55f1d6fb2ce766fb4dde954f9/All_documents.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 2018_nih_datascience_rfi 2 | This is a short response to the 2018 RFI on NIH Strategic Plan for Data Science 3 | 4 | # Purpose 5 | On March 5th, 2018 [this post](https://nexus.od.nih.gov/all/2018/03/05/requesting-your-input-on-the-draft-nih-strategic-plan-for-data-science/) announced a request for information (RFI) on the 6 | NIH [Strategic Plan for Data Science](https://grants.nih.gov/grants/rfi/NIH-Strategic-Plan-for-Data-Science.pdf). 7 | Responses are due April 2nd, 2018 on [this form](https://grants.nih.gov/grants/rfi/rfi.cfm?ID=73). 8 | 9 | **This repo contains language that you can use in your response to this RFI** 10 | 11 | ## How to use these comments 12 | 13 | **A [Similar RFI](https://nlmdirector.nlm.nih.gov/2018/03/20/next-generation-data-science-research-challenges/) only generated 53 responses. Your voice will be heard!** 14 | 15 | 1. If possible, read the entire [Strategic Plan](https://grants.nih.gov/grants/rfi/NIH-Strategic-Plan-for-Data-Science.pdf). 16 | If you don't have that much time, read one of the Goal sections most relevant to 17 | you (individual sections are as short as 10 minutes of reading). 18 | 19 | 2. We have created specific [responses](./combined_response.md) to each of the 20 | Plan's stated goals. You may copy and paste all/some/none of what you agree with 21 | using the online NIH form (step 3). 22 | **The NIH form has a 500 word limit so consider using one of the attachments** 23 | (You can write "see attached" in the comment box): 24 | 25 | - **[All_documents.pdf](./All_documents.pdf)** : PDF of the Preamble, sample NIH 26 | review form, and all collected comments 27 | - **[combined_response.pdf](combined_response.pdf)** : PDF of the Preamble and all 28 | collected comments 29 | - **[preamble.pdf](preamble.pdf)** : PDF of the Preamble alone 30 | - **[StrategicPlan_NIH_review30March2018.pdf](StrategicPlan_NIH_review30March2018.pdf)**: PDF of the sample NIH review 31 | 32 | You may modify anything you find here, or write your own response.(**Please consider submitting your own responses here for the community to use, or review even after the 33 | RFI due date**). 34 | 35 | 3. By **April 2, 2018 11:59:59 PM EDT** access the NIH [online form](https://grants.nih.gov/grants/rfi/rfi.cfm?ID=73), write in and/or 36 | copy-paste your responses. Please follow all of the form instructions/guidelines, 37 | Invalid responses (e.g. automated, or in some other way disqualified) will not 38 | be accepted. Only submit comments that represent your 39 | views; see [guide notice instructions](https://grants.nih.gov/grants/guide/notice-files/NOT-OD-18-134.html) 40 | 41 | 42 | 43 | **Thanks for your contribution** It is a responsibility to ensure NIH can act 44 | on the most accurate and best possible advice. 45 | 46 | 47 | 48 | ## How to contribute 49 | 50 | Please submit a pull request, or (and especially if you aren't familiar with 51 | Github) please click the issues tab to open an issue, or Tweet/email me (email = 52 | my last name @ cshl.edu). 53 | 54 | **Even after the due date**, please feel free to post here for others to see. 55 | 56 | ## Contributors 57 | 58 | **Disclaimer**: Contributions are only the personal opinions of the individuals 59 | that made them, and are not the opinions of other contributors, their employers, 60 | or anyone else. 61 | 62 | 63 | - Jason Williams - Cold Spring Harbor Laboratory, NY 64 | - **Bio sentence**: Diversity advocate, founder CSHL Biological Data Science Meeting, 65 | Software Carpentry instructor and former foundation Chair, External consultant to 66 | NIH Data Commons, #underrepresentedinSTEM 67 | 68 | - Rochelle E. Tractenberg - Georgetown University, Washington, DC. 69 | - **Bio Sentence**: psychometrician, biostatistician and research methodologist, specializing 70 | in developing and validating difficult-to-measure outcomes including those in biomedical research and in higher/graduate/post graduate education. Chair (2017-2019) Committee on Professional Ethics of the American Statistical Association. 71 | 72 | - Bastian Greshake Tzovaras - Lawrence Berkeley National Laboratory, CA 73 | - **Bio sentence**: Co-founder of openSNP, Director of Research at Open Humans and 74 | Visiting Scholar at the Berkeley Lab. 75 | 76 | - Alexander (Sasha) Wait Zaranek - Harvard University, MA 77 | - **Bio sentence**: Co-Founder, Harvard Personal Genome Project (PGP) Chief Scientist, Curoverse Research Founder https://arvados.org, Head of Quantified Biology, Veritas Genetics 78 | 79 | ## Other public responses 80 | 81 | These are other responses to the RFI, posted here to represent the diversity 82 | of opinions and in the hopes of continuing the conversation. They are not 83 | related to the contribution in this repo, and are only the opinions of their 84 | respective authors. 85 | 86 | - Hary Hochheiser et.al., University of Pittsburgh: [link to document](https://docs.google.com/document/d/1ibfFtWu75OQelcOUj-EXLCWc1qox0xugE_bqfdVF8Kc/edit) 87 | - Dimitri Yatsenko, Vathes LLC/Baylor College of Medicine: [link to document](https://github.com/dimitri-yatsenko/NIH-RFI-Strategic-Plan-for-Data-Science) 88 | - Michael Love, UNC Chapel Hill: [link to document](https://github.com/mikelove/2018_nih_datascience_rfi) 89 | - DJ Patel, Former U.S. Cheif Data Scientist: [link to document](https://www.linkedin.com/pulse/data-science-nih-healthcare-dj-patil/) 90 | - Rafael Irizarry, Harvard T.H. Chan School of Public Health: [link to document](https://simplystatistics.org/2018/04/02/input-on-the-draft-nih-strategic-plan-for-data-science/) 91 | -------------------------------------------------------------------------------- /StrategicPlan_NIH_review30March2018.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JasonJWilliamsNY/2018_nih_datascience_rfi/64dc52a8e2153ca55f1d6fb2ce766fb4dde954f9/StrategicPlan_NIH_review30March2018.pdf -------------------------------------------------------------------------------- /combined_response.md: -------------------------------------------------------------------------------- 1 | ## Preamble 2 | 3 | NIH’s mission is “to seek fundamental knowledge about the nature and behavior of 4 | living systems and the application of that knowledge to enhance health, lengthen 5 | life, and reduce illness and disability.” Data Science will play a prominent 6 | role in fulfilling that mission in the 21st Century. Unfortunately, and for 7 | several reasons, the current NIH Strategic Plan for Data Science (SPDS) will not 8 | further, and may even act against NIH’s mission. On the surface, the SPDS has 9 | identified important goals. However, careful review of the current Plan reveals 10 | the impossibility of translating anything articulated into impactful, 11 | actionable, and meaningfully measurable implementations. If this document came 12 | from any organization other than NIH it might be ignored. Instead, the awesome 13 | weight of NIH recommendations - which demand an equally awesome obligation to 14 | the community for productive criticism - invites the plausible scenario that 15 | many investigators (within and outside of NIH) will be guided by an incomplete, 16 | inaccurate, and dangerously misguided document. While the tone of this document 17 | is necessarily blunt (far more so than if feedback was addressed to an individual 18 | investigator), it is with the belief that NIH does not simply want feedback 19 | that says what it wants to hear - the NIH mission is too important for 20 | criticisms to go unheard. 21 | 22 | To conclude that the SPDS is fundamentally flawed, consider these 23 | representative metrics copied verbatim from the document: 24 | 25 | - Goal 1: Support a Highly Efficient and Effective Biomedical Research Data 26 | Infrastructure 27 | - Sample evaluation metric: “quantity of cloud storage and computing used by 28 | NIH and by NIH-funded researchers” 29 | 30 | - Goal 2: Promote Modernization of the Data-Resources Ecosystem 31 | - Sample evaluation metric: “quantity of databases and knowledgebases 32 | supported using resource-based funding mechanisms" 33 | 34 | - Goal 3: Support the Development and Dissemination of Advanced Data Management, 35 | Analytics, and Visualization Tools 36 | - Sample evaluation metric: “quantity of new software tools developed” 37 | 38 | - Goal 4: Enhance Workforce Development for Biomedical Data Science 39 | - Sample evaluation metric: “quantity of new data science-related training 40 | programs for NIH staff and participation in these programs” 41 | 42 | - Goal 5: Enact Appropriate Policies to Promote Stewardship and Sustainability 43 | - Sample evaluation metric: “establishment and use of open-data licenses” 44 | (a count metric) 45 | 46 | These metrics (which are wholly representative of evaluations proposed) would 47 | never be accepted by a scientific reviewer. If the SPDS were to be considered 48 | by one of the NIH’s scientific review panels, it would certainly be designated 49 | “Not Recommended for Further Consideration (NRFC)”. Simply put, there is no 50 | perceivable likelihood of this Strategic Plan to exert any positive influence 51 | on the fields of Data Science or data intensive biomedical research. The 52 | majority of metrics proposed are simply “counts” of the number of activities 53 | created or performed, without any meaningful and/or independent evaluation of 54 | their benefit. Most of these activities have to happen anyway -e.g., the 55 | “quantity of databases and knowledgebases supported” will increment every time 56 | the NIH gives a grant to someone proposing to collect data and create a database 57 | (following the NIH policy on data sharing, 58 | https://grants.nih.gov/policy/sharing.htm). There is no suggestion that the 59 | increments these metrics represent are meaningful - or that they do or might 60 | contribute to biomedical work to “enhance health, lengthen life, and reduce 61 | illness and disability.” These evaluation metrics are both tautological (i.e., 62 | if the NIH funds research that generates data, these metrics will increment and 63 | the Goal will appear to have been “achieved”) and vague - as long as at least 64 | one of each of these events occurs, the Goal could be argued to have been 65 | “achieved”). Thus, almost any proposal to collect a large quantity of data that 66 | is funded will be branded as “successful”. While these goals are important for 67 | the NIH to function as a data broker, they actually represent characteristics 68 | to strive for throughout NIH, rather than a Strategic Plan. This plan is 69 | surprisingly unreflective of prior NIH Data Science activities (NCBC, BD2K) or 70 | even praiseworthy and well-informed planning documents emerging in parallel 71 | (NLM Strategic Plan: https://www.nlm.nih.gov/pubs/plan/lrp17/NLM_StrategicReport2017_2027.pdf). 72 | 73 | Data Science is not the progressive extension of clinical research applications, 74 | and a truly impactful NIH strategy for Data Science must by definition come from 75 | experts in Data Science - preferably, those with expertise in both Data Science 76 | and biomedical research. As stated in SPDS Goal 4, NIH as an organization does 77 | not have the expertise in Data Science to plan its strategic direction in this 78 | area. We recommend that if NIH is committed to a transformative strategy for 79 | Data Science (and it should be), the current SPDS should be discarded. A 80 | Strategic Plan with meaningful goals that promote the thoughtful integration of 81 | Data Science with biomedical research representing measurable (not “countable”) 82 | impact, together with formal and independent evaluation, should be developed 83 | instead. 84 | 85 | Such a Plan would require a sustainable vision that accurately reflects the 86 | discipline of Data Science and its potential role in biomedical research. Such 87 | a vision must be community-driven, perhaps through a call for nominations that 88 | would assemble national and international community members with evidence of 89 | expertise in the needed areas. Given the weaknesses in the current SPDS, this 90 | RFI has every chance of resulting in an inward-looking, self-fulfilling prophecy 91 | (e.g., such that any grant that is made to any data-generating proposal results 92 | in “success” according to these self-defined metrics). If the Data Science and 93 | data-intensive biomedical research community comments are ignored, and priority 94 | is given to existing investigators and internal stakeholders, the NIH will 95 | promote an inward-looking and essentially irrelevant program for integrating 96 | Data Science into biomedical research. 97 | 98 | The Plan fails on multiple several technical merits, and the community members 99 | who have authored this document have assembled additional detailed remarks at 100 | this URL: 101 | 102 | https://github.com/JasonJWilliamsNY/2018_nih_datascience_rfi 103 | 104 | Additionally, as an exercise, the NIH review form has been filled out for this 105 | Plan to explore how a realistic NIH scientific reviewer (who did the exercise) 106 | might score such a proposal 107 | 108 | https://docs.google.com/document/d/1XxQLORoTm2lkucQz6k3QdgNOARDrvU3zx_svT0PX_c0/edit# 109 | 110 | These comments are provided to support a new effort at a realistic, plausible, 111 | and community-driven Strategic Plan for Data Intensive Biomedical Research. The 112 | community stands ready to assist NIH in this important work and we urge the 113 | organization to commit to making this a community effort. We thank NIH for the 114 | work done and going forward to develop the next stages of this plan. 115 | 116 | ## The appropriateness of the goals of the plan and of the strategies and implementation tactics proposed to achieve them 117 | 118 | Goal 1: NIH should be applauded for recognizing that historically, funding of 119 | data resources used funding approaches that were appropriate for research 120 | projects. We agree this must change. Also, we agree that tool development 121 | and data resources may often need distinct funding contexts/expectations. 122 | 123 | Goal 1: We agree with the stated need to support the hardening and optimization 124 | (user-friendliness, computational efficiency, etc.) of innovative tools and 125 | algorithms. 126 | 127 | Goal 1: The Data Commons has been driven from bottom-up development by 128 | the community. The Commons is in its early stages and should be allowed to 129 | function as is. The Commons is the appropriate testbed for many of the 130 | technological innovations and processes that NIH may ultimately which to explore 131 | at broader scales after sufficient development. 132 | 133 | Goal 1: The implementation tactics proposed seem almost randomly selected, 134 | making it nearly impractical to criticize them. For example, one of the three 135 | bullet points indicates that new technologies should be adapted (somewhat 136 | meaningless, but not disagreeable), but then goes on to mention that GPUs should 137 | be used. That is weirdly specific, and not necessarily wrong, but why mention at 138 | this level of detail without some specific vision for the real biomedical 139 | informatics problems that are relevant here? This is like saying “calculus” 140 | should be used. Maybe; but such an out of context statement is ultimately a 141 | collection of buzzwords. At best, this is an indication that greater expertise 142 | is needed to reformulate this document. 143 | 144 | Goal 1: Although this is a strategy document, it’s implausible to imagine the 145 | linking of NIH datasets as described in objective 1-2 can be elucidated in a 146 | single thin paragraph. We have no idea how the “Biomedical Data Translator” will 147 | work but can only imagine it will need to function at least as well as the 148 | “Universal Translator” of Star Trek. Either enough detail for the strategy needs 149 | to be proposed here for the document to serve its purpose, or the space is 150 | better spent articulating the hard problems that need fixing. 151 | 152 | Goal 2: The overall goal recognized (the need to avoid data siloing) is 153 | absolutely a correct (but difficult) target for NIH. It is impossible to believe 154 | that anything in the implementation or metrics will demonstrably achieve this. 155 | What appears to be one problem is likely several problems, each of which is 156 | worthy of study to generate actionable solutions. Usability is identified as 157 | target to optimize for and while this may be correct, no metrics proposed seem 158 | to measure this. While there may be real and important distinctions between 159 | databases and knowledgebases, we again suggest no convincing metrics have been 160 | proposed here. The statement that “Funding approaches used for databases and 161 | knowledgebases will be appropriate…” conveys no usable information. 162 | 163 | Goal 2: For some reason, objective 2-2 contrasts the “small-scale” datasets 164 | produced by individual laboratories with the “high-value” datasets generated 165 | by NIH-funded consortia. Besides pointing out the needless condescension here, 166 | this kind of contrast actually belies the problem in designing data systems in 167 | which there are various classes of “citizenship.” Treating the much larger 168 | quantity of data generated by the plurality of extramural investigators as 169 | somehow different my lead to policies which work for the NIH, but don’t work for 170 | the community, leading to irreconcilable standards. 171 | 172 | Goal 2: Implementation tactics proposed are bewildering. Under this 173 | subheading appears the bullet “Ensure privacy and security.” While it is 174 | gratifying that the word privacy finally appears 11 pages into this 26-page 175 | document, this is not in any way an implementation tactic. 176 | 177 | Goal 2: Evaluation metrics are horrific. Quantity of datasets and funded 178 | databases does nothing more than account for the fact that NIH spent money. 179 | 180 | Goal 3: The strategy to leverage and support existing tool-sharing systems to 181 | encourage "marketplaces" for tools developers, and to separate funding of tool 182 | development and data generation is very important, and we support this direction 183 | of the NIH. This is a proven strategy to elevate high quality tools, for 184 | example, in the world of high-throughput genomics, one can consider the NHGRI 185 | funded Bioconductor Project as a decade-long successful use case in providing a 186 | unified interface for more than 1,000 "high-quality, open-source data 187 | management, analytics, and visualization tools". 188 | 189 | Goal 3: In general, implementation tactics are plausible, but not evidence-based. 190 | Rather than propose that each step NIH takes to develop a tactic is supported by 191 | a body of research, the lowest-hanging fruit (and most productive solution) is 192 | to have the community develop an actual set of strategic targets, with clear 193 | metrics for evaluation. 194 | 195 | Goal 3: The SPDS plan underestimates the pervasiveness and persistence of 196 | bad/outdated software and methods (See: https://www.the-scientist.com/?articles.view/articleNo/51260/title/Scientists-Continue-to-Use-Outdated-Methods/). It is completely unclear how separating evaluation and 197 | funding for tool development and dissemination from support for databases and 198 | knowledgebases (this sentence from the SPDS is itself unclear) will address 199 | this problem. This may help, but is our knowledge an unvetted hypothesis. 200 | 201 | Goal 3: Although the SPDS does not make any strategy clear, the goal of 202 | supporting tools and workflows (objective 3-1) is a good one. We further agree 203 | that partnership is exactly the way that this needs to be pursued. 204 | 205 | Goal 3: The metrics proposed for this sophisticated set of objectives are 206 | catastrophic.There is no way that the objectives stated for Goal 3 can be 207 | effectively measured or set a useful standard for success. 208 | 209 | Goal 4: Feldon et.al (PNAS, 2017; https://doi.org/10.1073/pnas.1705783114) 210 | concludes that despite $28 million in investment by NSF and NIH in training 211 | (including workshops/boot camps relevant to biomedical data science), much of 212 | this training is “not associated with observable benefits related to skill 213 | development, scholarly productivity, or socialization into the academic 214 | community.” Clearly, if NIH intends to do better it needs to completely 215 | reconceive how it approaches training in data science. 216 | 217 | Goal 4: Several reasonable priorities are identified, but only 218 | demonstrably ineffective and/or inappropriate approaches and evaluation metrics 219 | are proposed to achieve/evaluate these goals. 220 | 221 | Goal 4: The inadequacy of the strategies suggested is epitomized by the proposed 222 | evaluation metric: “the quantity of new data science-related training programs 223 | for NIH staff and participation in these programs.” This is as unconvincing as 224 | suggesting a research proposal will be measured in the number of experiments 225 | performed. In fact, the only metric proposed for evaluation of training is the 226 | number of training opportunities created. Such an arbitrary and crude metric 227 | would get any research proposal returned from study section without discussion. 228 | Number of training opportunities cannot be a plausible metric. Despite their 229 | being no shortage of training opportunities from MOOCs to workshops, there is a 230 | persistent, apparent, and urgent training gap. This inadequate metric is a clear 231 | red flag that training guided by the proposed plan will accomplish very little. 232 | 233 | Goal 4: It is clear that training is under-prioritized by NIH. In the largest 234 | survey on unmet needs for life science investigators, NSF investigators report 235 | in Barone et.al (PLOS Comp. Bio, 2017; https://doi.org/10.1371/journal.pcbi.1005755) 236 | that their most unmet computational needs are not software, infrastructure or 237 | compute. Instead it is the need for training; specifically training in 238 | integration of multiple data types, data and metadata management, and scaling 239 | analyses to HPC and cloud. 240 | 241 | Goal 4: Strategy failed to properly understand the role of training in 242 | biomedical data science and the need to define and measure what constitutes 243 | effective training. There is no mention made of a serious commitment to 244 | evidence-based teaching that is needed to design effective short-format courses 245 | and workshops. While the educational pipeline from at least the undergraduate 246 | level and beyond needs serious improvement to address biomedical data science, 247 | short-format training and workshops will play an important role. These workshops 248 | must not be constructed ad hoc. Typically, training is delivered by 249 | staff/faculty with a high level of bioinformatics/data science domain expertise, 250 | but little to no guidance in andragogy, cognitive science, or evaluation. 251 | 252 | Goal 5: Here, and once in Goal 3 are the only mentions in the document of a 253 | community-driven activity (which actually needs to be brought to the entire 254 | SPDS). The FAIR Data Ecosystem is a laudable goal, but the idea that NIH should 255 | “Strive to ensure that all data in NIH-supported data resources are FAIR” is 256 | still a goal without a plan. More than technological advances or implementation, 257 | this is a training activity that requires community awareness, understanding, 258 | input, and buy-in on FAIR principles. The implementation tactics are plausible, 259 | but ultimately without appropriate evaluation are too vague to establish 260 | success. Establishing open-source licenses or even promoting their use won’t in 261 | itself FAIR. This section misses out on how the hard question of ELSI/privacy 262 | and biomedical data which NIH currently has not updated to accommodate the 263 | vision of what biomedical data science might achieve. 264 | 265 | Goal 5: Evaluation is inappropriate/unrevealing count metrics that will not 266 | indicate whether FAIR principles are realized or not. 267 | 268 | ## Opportunities for NIH to partner in achieving these goals 269 | 270 | Goal 1: NSF has been exploring centralized computing models through XSEDE, and 271 | open-science clouds (CyVerse Atmosphere, XSEDE-Jetstream) for many years. These 272 | groups would be natural partners in addition to commercial cloud providers. The 273 | NSF resources will not match the capacity of commercial cloud but have optimized 274 | for the science use-cases and user profiles relevant to biomedical research. 275 | 276 | Goal 2: ASAPbio, Force 11, Open Science Framework, Zonodo, FigShare, BiorXiv, 277 | and many other community-driven organization are exploring data lifecycle issues 278 | and metrics that are relevant to this discussion. The entire SPDS needs to be 279 | completely reconceived to include representation from individuals within these 280 | organizations who have scholarly reputations in data management and life science publication/communication. 281 | 282 | Goal 3: There are a variety of groups the NIH can partner with. The number of 283 | potential individual investigators is too numerous to list, but these 284 | individuals should be relatively easy to identify by means of their scholarly 285 | contributions (carefully avoiding journal publications as a primary metric). 286 | Reaching out and partnering with groups such as the Open Bioinformatics 287 | Foundation and societies like ISMB would be an ideal way for NIH to foster deep 288 | community involvement. 289 | 290 | Goal 4: We present comments here in the hopes that NIH will consider bold action 291 | in this area because the problem is solvable and the community of investigators 292 | with experience in training relevant to biomedical data science has been 293 | thinking deeply on the topic. The community is relatively small, well-connected, 294 | and should be extensively leveraged in developing robust scalable solutions. It 295 | should be easy to assemble the 10-20 most important practitioners and educators 296 | in biomedical data science, confident that they and their second order 297 | collaborators would constitute a reasonably sized working group that can bring 298 | in much needed solutions. Understandably, in fact by definition as an 299 | aspirational goal, NIH has identified the value in prioritizing data science 300 | as an area it needs to be at the forefront of but does not have the expertise 301 | to achieve alone. The community does. 302 | 303 | Goal 4: Right now, the single best target for collaboration is the Software and 304 | Data Carpentry community (Greg Wilson: "Software Carpentry: Lessons Learned". 305 | F1000Research,2016, 3:62 (doi: 10.12688/f1000research.3-62.v2). There are many 306 | reasons why collaboration here will be tremendously important for NIH to 307 | succeed. First, the Carpentry community itself represents a global federation 308 | of researchers in the space of computation, data science, bioinformatics, and 309 | related fields with a strong interest in education. In short – this is a 310 | self-selected community of hundreds to thousands of researchers; the available 311 | expertise in biomedical data science is well-covered in this community. 312 | Additionally, there simply is no other community that has built a sustainable 313 | and scalable approach to building educational content relevant to biomedical 314 | data science with strong grounding in assessment and pedagogy. It would be 315 | a tremendous squandering of resources to not build on this foundation. 316 | 317 | Goal 4: This is an area that especially calls for collaboration with the 318 | National Science Foundation. NSF already has several strong funded programs that 319 | are dedicated to understanding the problems of bioinformatics education and 320 | biomedical data science and developing solutions – the Network for Integrating 321 | Bioinformatics Education into Life Science 322 | (NIBLSE, https://qubeshub.org/groups/niblse) is just one of many. NIBLSE has 323 | for example, identified that the newest trained faculty with bioinformatics 324 | expertise are not bringing that training into the classroom, and the lack of 325 | training is the biggest barrier to bioinformatics education 326 | (https://www.biorxiv.org/content/early/2017/10/19/204420). The potential for 327 | synergy here is enormous for developing the k-16 pipeline. This alliance 328 | (especially leveraging NSF INCLUDES) could be a tremendous opportunity to do so 329 | in a way that enhances diversity. And while there are distinct career paths for 330 | biomedical vs. non-human life sciences, there is almost complete overlap in the 331 | study, preparation, and training for both student groups. 332 | 333 | ## Additional concepts that should be included in the plan 334 | 335 | Goal 1: Grafting “Data Science” onto NIH is essentially a massive retrofitting 336 | exercise. If we had to pick one area to think of, it is focusing on emerging 337 | techniques (Long-read sequencing, machine learning approaches, CIRSPR, etc.) 338 | and how NIH manages these data that could be a primary target for envisioning 339 | a data science-friendly ecosystem. The community of users is smaller, and 340 | fixing emerging challenges seems like a manageable focus for fomenting 341 | community consensus. 342 | 343 | Goal 2: Conceptually, this goal needs to clearly differentiate technological 344 | obstacles from process obstacles. In reality, many of the needed technologies 345 | are either in place or will be generated from sectors outside of biomedicine. 346 | A few, perhaps, will be unique to the NIH use cases. More effort needs to be 347 | put into understanding the workflows and processes that investigators in a 348 | variety of contexts use to produce and consume data. This is an unaddressed 349 | research question in this document. 350 | 351 | Goal 3: The proposal as currently mentioned does not mention (1) computational reproducibility, or (2) exploratory data analysis for data quality control. 352 | These two topics are critical for the high-level goal of "extracting 353 | understanding from large-scale or complex biomedical research data". 354 | 355 | Goal 3: Computational reproducibility can be defined as the ability to produce 356 | identical results from identical data input or "raw data", and relies on 357 | biomedical researchers keeping track of metadata regarding the versions of 358 | tools that were used, the way in which tools were run, and the provenance and 359 | version of publicly available annotation files if these were used. This is very 360 | important for data science: if two groups observe discrepancies between their 361 | results, they absolutely must be able to identify the source, whether it be 362 | methodological or due to different versions of software or annotation data. 363 | 364 | Goal 3: Exploratory data analysis (EDA) needs to be a key component of the data 365 | science plan, as this should be the first step of any data analysis involving 366 | complex biological data. EDA is often how a data scientist will identify data 367 | artifacts, technical biases, batch effects, outliers, unaccounted for or 368 | unexpected heterogeneity, need for data transformation, or other various data 369 | quality issues that will cause serious problems for downstream methods, whether 370 | they be statistical methods, machine learning, deep learning, artificial 371 | intelligence or otherwise. In particular, machine learning and statistical 372 | methods rely on the quality of the metadata and the ability to provide 373 | consistent terms and judgements on describing samples across all datasets from 374 | consortia and from individual labs. Downstream methods may either fail to detect 375 | the relevant signal (loosely categorized as "false negatives") or may produce 376 | many spurious results which are purely associations with technical aspects of 377 | the data ("false positives"). Furthermore, Basic EDA can uncover biological 378 | signal that may be missed, such as biologically relevant heterogeneity, e.g. 379 | subtypes of disease with signal present in molecular data. 380 | 381 | Goal 3: Computational reproducibilty and supporting EDA should be components of 382 | both NIH funded tool development, as well as the plan to "Enhance Workforce 383 | Development for Biomedical Data Science" (Goal 4). 384 | 385 | Goal 4: There are a few basic concepts that must be included; and these are 386 | potentially at the right level for a strategic vision document: 387 | 388 | - Training is the most unmet need of investigators. Investments that 389 | under-prioritize training will not realize the value of computational and data 390 | infrastructure developed. 391 | - Biomedical data science education must not be solely delivered by domain 392 | experts/investigators training according to what they think is best. Instead, 393 | curriculum must be developed using evidence-based pedagogical principles. 394 | - Collaboration is key. Training that is developed as the unitary creation of 395 | NIH will fail. Training must be developed by a community that can maintain and 396 | sustain learning content. 397 | - Assessment is an integral part of training and cannot be ad hoc, it must 398 | generate evidence that learning has occurred, and be developed in a framework 399 | community of practice. This is hard – it is easy to count the number of CPU 400 | cycles paid for on the cloud, or the size of a database. 401 | - Data science must not about science, and not just data. It is easy to 402 | accumulate datasets, but not easy to develop training that is measurably 403 | effective – however it is definitely possible. 404 | - Citizen Science is more than "individuals giving their brains for analyzing 405 | data using computer games". There are growing communities of patients and 406 | healthy individuals who are coming together to analyze biomedical data, either 407 | their own or using public data resources to perform science under their own 408 | lead (c.f. http://jme.bmj.com/content/early/2015/03/30/medethics-2015-102663 409 | on participant-lead research). These community efforts are growing 410 | substantially and are bound to become important stakeholders in performing 411 | additional biomedical research. The needs of these communities should thus be 412 | targeted to. 413 | - Diversity is a highly obtainable goal for biomedical data science education. 414 | 415 | ## Performance measures and milestones that could be used to gauge the success of elements of the plan and inform course corrections 416 | 417 | Goal 1: The proposed evaluation metrics are horrific. These metrics are more 418 | appropriate for cloud providers who capture the described metrics to develop 419 | their invoices. While NIH should look to drive down communal costs, Goal 1, 420 | like all of the goals in this document are – hard research problems – which 421 | require deep thought to understand what success is. 422 | 423 | Goal 2: Without specific research questions, or the identification of relevant 424 | research that can be directly applied to the use cases NIH wants to advance, 425 | there are no additional milestones except a clearer definition of the goal. 426 | 427 | Goal 4: This question is difficult to answer because there needs to be a defined 428 | and agreed-upon set of competencies for biomedical data science. From these will 429 | follow learning objectives and assessments for these objectives. At the next 430 | stage will be dissemination targets and measures of community use and buy in. 431 | A workshop could resolve and develop these over the course of a few months. 432 | 433 | ## Any other topic the respondent feels is relevant for NIH to consider in developing this strategic plan 434 | 435 | Goal 1: The SPDS NIH correctly identifies that “The generation of most 436 | biomedical data is highly distributed and is accomplished mainly by individual 437 | scientists or relatively small groups of researchers.” This should be followed 438 | by the conclusion that any top-down approach must be matched by a 439 | correspondingly large-scale bottom-up approach. Individual investigators need 440 | the training and support to generate data in a way that fulfills the promise of 441 | FAIR principles. The Strategic Plan quotes the famous (and regrettable) 442 | statistic that 80% of Data Science is cleaning data, yet nothing proposed in 443 | this document will solve this. While NIH is in a position to pioneer (and 444 | appropriately fund) hard infrastructure (computation/storage, etc.), the greater 445 | attention must be paid to funding soft-infrastructure – training, documentation, 446 | and support that bring investigators into a community of practice. Note that 447 | Barone et.al (https://doi.org/10.1371/journal.pcbi.1005755 ) replicates the 448 | earlier findings of EDUCAUSE (https://net.educause.edu/ir/library/pdf/ers0605/rs/ers0605w.pdf); organizations 449 | planning for cyberinfrastructure development tend to underestimate 450 | and underfund the training needed to use infrastructure. Any infrastructure 451 | development must be matched by clear, measurable learning outcomes to ensure 452 | that investigators can actually make intended use of the investments. 453 | 454 | Goal 3: There are so many potential emerging technologies and themes, 455 | understandably, this plan should not be a laundry list of things to try. This 456 | objective need to be reconceived to articulate how those solutions will be 457 | collected, pursued, and evaluated. No such vision is clearly present. 458 | 459 | Goal 4: The work being done by NSF as well as the Data Science recommendations 460 | developed by the National Academies are highly relevant and need to be better 461 | integrated. There is also a unique connection in Data Science to industry. 462 | Industry partners will continue to lead advances in Data Science relevant to 463 | biomedicine. 464 | -------------------------------------------------------------------------------- /combined_response.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JasonJWilliamsNY/2018_nih_datascience_rfi/64dc52a8e2153ca55f1d6fb2ce766fb4dde954f9/combined_response.pdf -------------------------------------------------------------------------------- /jasons_submission_comment.md: -------------------------------------------------------------------------------- 1 | Please see the attachment, containing a collaboratively developed response that 2 | I led. Community updates to that response will be maintained at this URL: https://github.com/JasonJWilliamsNY/2018_nih_datascience_rfi. These 3 | to-the-point comments are complied in realization that NIH has a major influence 4 | over the direction of biomedical data science, and with sincere trust that NIH 5 | is committed to highest standards of academic rigor, commitment to its own 6 | mission, and responsibility to the US Tax Payer. Although every effort has been 7 | made for this response to consist only of productive and objective criticisms, 8 | I will express a personal and subjective comment here that my current and past 9 | dealings with NIH personnel have demonstrated them to be passionately committed 10 | to the same high standards aforementioned. Every person and every organization 11 | that aspires to be more than it is, will reach a point where its own resources 12 | cannot take them in the direction they need to go. I hope that NIH will consider 13 | leveraging the entire, diverse, biomedical data science community to achieve its 14 | greatest success in leveraging data science for the betterment of human health; 15 | it *will* take all of us. 16 | -------------------------------------------------------------------------------- /preamble.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JasonJWilliamsNY/2018_nih_datascience_rfi/64dc52a8e2153ca55f1d6fb2ce766fb4dde954f9/preamble.pdf -------------------------------------------------------------------------------- /prelim_drafts/goal_01_infrastructure_response.md: -------------------------------------------------------------------------------- 1 | # Goal 1: "Support a Highly Efficient and Effective Biomedical Research Data Infrastructure" 2 | 3 | 4 | 5 | ## The appropriateness of the goals of the plan and of the strategies and implementation tactics proposed to achieve them 6 | 7 | Goal 1: NIH should be applauded for recognizing that historically, funding of 8 | data resources used funding approaches that were research projects. We agree 9 | this must change. Also, we agree that tool development and data resources may 10 | often need distinct funding contexts/expectations. 11 | 12 | Goal 1: We agree with the stated need to support the hardening and optimization 13 | (user-friendliness, computational efficiency, etc.) of innovative tools and 14 | algorithms. 15 | 16 | Goal 1: The Data Commons has generally been driven from bottom-up development by 17 | the community. The Commons is in its early stages and should be allowed to 18 | function as is. The Commons is the appropriate testbed for many of the 19 | technological innovations and processes that NIH may ultimately which to explore 20 | at broader scales after sufficient development. 21 | 22 | Goal 1: The implementation tactics proposed seem almost randomly selected, 23 | making it nearly impractical to criticize them. For example, one of the three 24 | bullet points indicates that new technologies should be adapted (somewhat 25 | meaningless, but not disagreeable), but then goes on to mention that GPUs should 26 | be used. That is weirdly specific, and not necessarily wrong, but why mention at 27 | this level of detail without some specific vision for the real biomedical 28 | informatics problems that are relevant here? This is like saying “calculus” 29 | should be used. Maybe; but such an out of context statement is ultimately a 30 | collection of buzzwords. At best, this is an indication that greater expertise 31 | is needed to reformulate this document. 32 | 33 | Goal 1: Although this is a strategy document, it’s bewildering to imagine the 34 | linking of NIH datasets as described in objective 1-2 can be elucidated in a 35 | single thin paragraph. I have no idea how the “Biomedical Data Translator” will 36 | work but can only imagine it will need to function at least as well as the 37 | “Universal Translator” of Star Trek. Either enough detail for the strategy needs 38 | to be proposed here for the document to serve its purpose, or the space is 39 | better spent articulating the hard problems that need fixing. 40 | 41 | ## Opportunities for NIH to partner in achieving these goals 42 | 43 | Goal 1: NSF has been exploring centralized computing models through XSEDE, and 44 | open-science clouds (CyVerse Atmosphere, XSEDE-Jetstream) for many years. These 45 | groups would be natural partners in addition to commercial cloud providers. The 46 | NSF resources will not match the capacity of commercial cloud but have optimized 47 | for the science use-cases and user profiles relevant to biomedical research. 48 | 49 | ## Additional concepts that should be included in the plan 50 | 51 | Goal 1: Grafting “Data Science” onto NIH is essentially a massive retrofitting 52 | exercise. If I had to pick one area to think of, it is how emerging techniques 53 | (Long-read sequencing, machine learning approaches, CIRSPR, etc.) and how we 54 | treat that related data should be a primary target before the essentially get 55 | too big to go back. The community of users is smaller, and fixing emerging 56 | challenges seems like a more manageable focus for fomenting community consensus. 57 | 58 | 59 | ## Performance measures and milestones that could be used to gauge the success of elements of the plan and inform course corrections 60 | 61 | Goal 1: The proposed evaluation metrics are horrific. These metrics are more 62 | appropriate for cloud providers who capture the described metrics to develop 63 | their invoices. While NIH should look to drive down communal costs, Goal 1, 64 | like all of the goals in this document are – hard research problems – which 65 | require deep thought to understand what success is. 66 | 67 | ## Any other topic the respondent feels is relevant for NIH to consider in developing this strategic plan 68 | 69 | Goal 1: The SPDS NIH correctly identifies that “The generation of most 70 | biomedical data is highly distributed and is accomplished mainly by individual 71 | scientists or relatively small groups of researchers.” This should be followed 72 | by the conclusion that any top-down approach must be matched by a 73 | correspondingly large-scale bottom-up approach. Individual investigators need 74 | the training and support to generate data in a way that fulfills the promise of 75 | FAIR principles. The Strategic Plan quotes the famous (and regrettable) 76 | statistic that 80% of Data Science is cleaning data, yet nothing proposed in 77 | this document will solve this. While NIH is in a position to pioneer (and 78 | appropriately fund) hard infrastructure (computation/storage, etc.), the greater 79 | attention must be paid to funding soft-infrastructure – training, documentation, 80 | and support that bring investigators into a community of practice. Note that 81 | Barone et.al (https://doi.org/10.1371/journal.pcbi.1005755 ) replicates the 82 | earlier findings of EDUCAUSE (https://net.educause.edu/ir/library/pdf/ers0605/rs/ers0605w.pdf); organizations planning for cyberinfrastructure development tend to underestimate 83 | and underfund the training needed to use infrastructure. Any infrastructure 84 | development must be matched by clear, measurable learning outcomes to ensure 85 | that investigators can actually make intended use of the investments. 86 | -------------------------------------------------------------------------------- /prelim_drafts/goal_02_dataecosystem_response.md: -------------------------------------------------------------------------------- 1 | # Goal 2: "Promote Modernization of the Data-Resources Ecosystem" 2 | 3 | 4 | 5 | ## The appropriateness of the goals of the plan and of the strategies and implementation tactics proposed to achieve them 6 | 7 | Goal 2: The overall goal recognized – the need to avoid data siloing - is 8 | absolutely a correct (but difficult) target for NIH. It is impossible to believe 9 | that anything in the implementation or metrics will demonstrably achieve this. 10 | What appears to be one problem is likely several problems, each of which is 11 | worthy of study to generate actionable solutions. Usability is identified as 12 | target to optimize for, and while this may be correct, no metrics proposed seem 13 | to measure this. While there may be real and important distinctions between 14 | databases and knowledgebases, we again suggest no convincing metrics have been 15 | proposed here. The statement that “Funding approaches used for databases and 16 | knowledgebases will be appropriate…” conveys no usable information. 17 | 18 | Goal 2: For some reason, objective 2-2 contrasts the “small-scale” datasets 19 | produced by individual laboratories with the “high-value” datasets generated 20 | by NIH-funded consortia. Besides pointing out the needless condescension here, 21 | this kind of contrast actually belies the problem in designing data systems in 22 | which there are various classes of “citizenship.” Treating the much larger 23 | quantity of data generated by the plurality of extramural investigators as 24 | somehow different my lead to policies which work for the NIH, but don’t work for 25 | the community, leading to irreconcilable standards. 26 | 27 | Goal 2: Implementation tactics proposed are bewilderingly lazy. Under this 28 | subheading appears the bullet “Ensure privacy and security.” While it is 29 | gratifying that the word privacy finally appears 11 pages into this 26-page 30 | document, but this is not a tactic and transgresses the bounds of sense. 31 | 32 | Goal 2: Evaluation metrics are horrific. Quantity of datasets and funded 33 | databases does nothing more than account for the fact that NIH spent money. 34 | 35 | ## Opportunities for NIH to partner in achieving these goals 36 | 37 | Goal 2: ASAPbio, Force 11, Open Science Framework, Zonodo, FigShare, BiorXiv, 38 | and many other community-driven organization are exploring data lifecycle issues 39 | and metrics that are relevant to this discussion. The entire SPDS needs to be 40 | completely reconceived to include representation from individuals within these 41 | organizations who have scholarly reputations in data management and life science publication/communication. 42 | 43 | ## Additional concepts that should be included in the plan 44 | 45 | Goal 2: Conceptually, this goal needs to clearly differentiate technological 46 | obstacles from process obstacles. In reality, many of the needed technologies 47 | are either in place or will be generated from sectors outside of biomedicine. 48 | A few, perhaps, will be unique to the NIH use cases. More effort needs to be 49 | put into understanding the workflows and processes that investigators in a 50 | variety of contexts use to produce and consume data. This is an unaddressed 51 | research question in this document. 52 | 53 | ## Performance measures and milestones that could be used to gauge the success of elements of the plan and inform course corrections 54 | 55 | Goal 2: Without specific research questions, or the identification of relevant 56 | research that can be directly applied to the use cases NIH wants to advance, 57 | there are no additional milestones except a clearer definition of the goal. 58 | 59 | ## Any other topic the respondent feels is relevant for NIH to consider in developing this strategic plan 60 | -------------------------------------------------------------------------------- /prelim_drafts/goal_03_datamanagement_response.md: -------------------------------------------------------------------------------- 1 | # Goal 3: "Support the Development and Dissemination of Advanced Data Management, Analytics, and Visualization Tools" 2 | 3 | 4 | 5 | ## The appropriateness of the goals of the plan and of the strategies and implementation tactics proposed to achieve them 6 | 7 | Goal 3: The strategy to leverage and support existing tool-sharing systems to 8 | encourage "marketplaces" for tools developers, and to separate funding of tool 9 | development and data generation is very important, and we support this direction 10 | of the NIH. This is a proven strategy to elevate high quality tools, for 11 | example, in the world of high-throughput genomics, one can consider the NHGRI 12 | funded Bioconductor Project as a decade-long successful use case in providing a 13 | unified interface for more than 1,000 "high-quality, open-source data 14 | management, analytics, and visualization tools". 15 | 16 | Goal 3: In general, implementation tactics are plausible, but not evidence-based. 17 | Rather than propose that each step NIH takes to develop a tactic is supported by 18 | a body of research, the lowest-hanging fruit (and most productive solution) is 19 | to have the community develop an actual set of strategic targets, with clear 20 | metrics for evaluation. 21 | 22 | Goal 3: The SPDS plan underestimates the pervasiveness and persistence of 23 | bad/outdated software and methods (See: https://www.the-scientist.com/?articles.view/articleNo/51260/title/Scientists-Continue-to-Use-Outdated-Methods/). It is completely unclear how separating evaluation and 24 | funding for tool development and dissemination from support for databases and knowledgebases (this sentence from the SPDS is itself unclear) will address 25 | this problem. This may help, but is our knowledge an unvetted hypothesis. 26 | 27 | Goal 3: Although the SPDS does not make any strategy clear, the goal of 28 | supporting tools and workflows (objective 3-1) is a good one. We further agree 29 | that partnership is exactly the way that this needs to be pursued. 30 | 31 | Goal 3: The metrics for this sophisticated set of objectives are catastrophic. 32 | There is no way that the objectives stated for Goal 3 can be effectively 33 | measured or succeed. 34 | 35 | ## Opportunities for NIH to partner in achieving these goals 36 | 37 | Goal 3: There are a variety of groups the NIH can partner with. The number of 38 | potential individual investigators is too numerous to list, but these 39 | individuals should be relatively easy to identify by means of their scholarly 40 | contributions (carefully avoiding journal publications as a primary metric). 41 | Reaching out and partnering with groups such as the Open Bioinformatics 42 | Foundation and societies like ISMB would be an ideal way for NIH to foster deep 43 | community involvement. 44 | 45 | ## Additional concepts that should be included in the plan 46 | 47 | Goal 3: The proposal as currently mentioned does not mention (1) computational reproducibility, or (2) exploratory data analysis for data quality control. 48 | These two topics are critical for the high-level goal of "extracting 49 | understanding from large-scale or complex biomedical research data". 50 | 51 | Goal 3: Computational reproducibility can be defined as the ability to produce 52 | identical results from identical data input or "raw data", and relies on 53 | biomedical researchers keeping track of metadata regarding the versions of 54 | tools that were used, the way in which tools were run, and the provenance and 55 | version of publicly available annotation files if these were used. This is very 56 | important for data science: if two groups observe discrepancies between their 57 | results, they absolutely must be able to identify the source, whether it be 58 | methodological or due to different versions of software or annotation data. 59 | 60 | Goal 3: Exploratory data analysis (EDA) needs to be a key component of the data 61 | science plan, as this should be the first step of any data analysis involving 62 | complex biological data. EDA is often how a data scientist will identify data 63 | artifacts, technical biases, batch effects, outliers, unaccounted for or 64 | unexpected heterogeneity, need for data transformation, or other various data 65 | quality issues that will cause serious problems for downstream methods, whether 66 | they be statistical methods, machine learning, deep learning, artificial 67 | intelligence or otherwise. Downstream methods may either fail to detect the 68 | relevant signal (loosely categorized as "false negatives") or may produce many 69 | spurious results which are purely associations with technical aspects of the 70 | data ("false positives"). Furthermore, Basic EDA can uncover biological signal 71 | that may be missed, such as biologically relevant heterogeneity, e.g. subtypes 72 | of disease with signal present in molecular data. 73 | 74 | Goal 3: Computational reproducibilty and supporting EDA should be components of 75 | both NIH funded tool development, as well as the plan to "Enhance Workforce 76 | Development for Biomedical Data Science" (Goal 4). 77 | 78 | ## Performance measures and milestones that could be used to gauge the success of elements of the plan and inform course corrections 79 | 80 | 81 | 82 | ## Any other topic the respondent feels is relevant for NIH to consider in developing this strategic plan 83 | 84 | Goal 3: There are so many potential emerging technologies and themes, 85 | understandably, this plan should not be a laundry list of things to try. This 86 | objective need to be reconceived to articulate how those solutions will be 87 | collected, pursued, and evaluated. No such vision is clearly present. 88 | -------------------------------------------------------------------------------- /prelim_drafts/goal_04_workforce_response.md: -------------------------------------------------------------------------------- 1 | # Goal 4: "Enhance Workforce Development for Biomedical Data Science" 2 | 3 | 4 | 5 | ## The appropriateness of the goals of the plan and of the strategies and implementation tactics proposed to achieve them 6 | 7 | Goal 4: Feldon et.al (PNAS, 2017; https://doi.org/10.1073/pnas.1705783114) 8 | concludes that despite $28 million in investment by NSF and NIH in training 9 | (including workshops/boot camps relevant to biomedical data science), much of 10 | this training is “not associated with observable benefits related to skill 11 | development, scholarly productivity, or socialization into the academic 12 | community.” Clearly, if NIH intends to do better it needs to completely 13 | reconceive how it approaches training in data science. 14 | 15 | Goal 4: Several reasonable priorities are identified, but only 16 | demonstrably ineffective and/or inappropriate approaches and evaluation metrics 17 | are proposed to achieve/evaluate these goals. 18 | 19 | Goal 4: The inadequacy of the strategies suggested is epitomized by the proposed 20 | evaluation metric: “the quantity of new data science-related training programs 21 | for NIH staff and participation in these programs.” This is as unconvincing as 22 | suggesting a research proposal will be measured in the number of experiments 23 | performed. In fact, the only metric proposed for evaluation of training is the 24 | number of training opportunities created. Such an arbitrary and crude metric 25 | would get any research proposal returned from study section without discussion. 26 | Number of training opportunities cannot be a plausible metric. Despite their 27 | being no shortage of training opportunities from MOOCs to workshops, there is a 28 | persistent, apparent, and urgent training gap. This inadequate metric is a clear 29 | red flag that training guided by the proposed plan will accomplish very little. 30 | 31 | Goal 4: It is clear that training is under-prioritized by NIH. In the largest 32 | survey on unmet needs for life science investigators, NSF investigators report 33 | in Barone et.al (PLOS Comp. Bio, 2017; https://doi.org/10.1371/journal.pcbi.1005755) 34 | that their most unmet computational needs are not software, infrastructure or 35 | compute. Instead it is the need for training; specifically training in 36 | integration of multiple data types, data and metadata management, and scaling 37 | analyses to HPC and cloud. 38 | 39 | Goal 4: Strategy failed to properly understand the role of training in 40 | biomedical data science and the need to define and measure what constitutes 41 | effective training.There is no mention made of a serious commitment to 42 | evidence-based teaching that is needed to design effective short-format courses 43 | and workshops. While the educational pipeline from at least the undergraduate 44 | level and beyond needs serious improvement to address biomedical data science, short-format training and workshops will play an important role. These workshops 45 | must not be constructed ad hoc. Typically, training is delivered by 46 | staff/faculty with a high level of bioinformatics/data science domain expertise, 47 | but little to no guidance in andragogy, cognitive science, or evaluation. 48 | 49 | ## Opportunities for NIH to partner in achieving these goals 50 | 51 | Goal 4: We present comments here in the hopes that NIH will consider bold action 52 | in this area because the problem is solvable and the community of investigators 53 | with experience in training relevant to biomedical data science has been 54 | thinking deeply on the topic. The community is relatively small, well connected, 55 | and should be extensively leveraged in developing robust scalable solutions. It 56 | should be easy to assemble the 10-20 most important practitioners and educators 57 | in biomedical data science, confident that they and their second order 58 | collaborators would constitute a reasonably sized working group that can bring 59 | in much needed solutions. Understandably, in fact by definition as an 60 | aspirational goal, NIH has identified the value in prioritizing data science 61 | as an area it needs to be at the forefront of but does not have the expertise 62 | to achieve alone. The community does. 63 | 64 | Goal 4: Right now, the single best target for collaboration is the Software and 65 | Data Carpentry community (Greg Wilson: "Software Carpentry: Lessons Learned". 66 | F1000Research,2016, 3:62 (doi: 10.12688/f1000research.3-62.v2). There are many 67 | reasons why collaboration here will be tremendously important for NIH to 68 | succeed. First, the Carpentry community itself represents a global federation 69 | of researchers in the space of computation, data science, bioinformatics, and 70 | related fields with a strong interest in education. In short – this is a 71 | self-selected community of hundreds to thousands of researchers; the available 72 | expertise in biomedical data science is well-covered in this community. 73 | Additionally, there simply is no other community that has built a sustainable 74 | and scalable approach to building educational content relevant to biomedical 75 | data science with strong grounding in assessment and pedagogy. It would be 76 | a tremendous squandering of resources to not build on this foundation. 77 | 78 | Goal 4: This is an area that especially calls for collaboration with the 79 | National Science Foundation. NSF already has several strong funded programs that 80 | are dedicated to understanding the problems of bioinformatics education and 81 | biomedical data science and developing solutions – the Network for Integrating 82 | Bioinformatics Education into Life Science 83 | (NIBLSE, https://qubeshub.org/groups/niblse) is just one of many. NIBLSE has 84 | for example, identified that the newest trained faculty with bioinformatics 85 | expertise are not bringing that training into the classroom, and the lack of 86 | training is the biggest barrier to bioinformatics education 87 | (https://www.biorxiv.org/content/early/2017/10/19/204420). The potential for 88 | synergy here is enormous for developing the k-16 pipeline. This alliance 89 | (especially leveraging NSF INCLUDES) could be a tremendous opportunity to do so 90 | in a way that enhances diversity. And while there are distinct career paths for 91 | biomedical vs. non-human life sciences, there is almost complete overlap in the 92 | study, preparation, and training for both student groups. 93 | 94 | ## Additional concepts that should be included in the plan 95 | 96 | Goal 4: There are a few basic concepts that must be included; and these are 97 | at the right level for a strategic vision document. 98 | 99 | - Training is the most unmet need of investigators. Investments that 100 | under-prioritize training will not realize the value of computational and data 101 | infrastructure developed. 102 | - Biomedical data science education must not be solely delivered by domain 103 | experts/investigators training according to what they think is best. 104 | Curriculum must be developed using evidence-based pedagogical principles. 105 | - Collaboration is key. Training that is developed as the unitary creation of 106 | NIH will fail. Training must be developed by a community that can maintain and 107 | sustain learning content. 108 | - Assessment is an integral part of training and cannot be ad hoc, it must 109 | generate evidence that learning has occurred, and be developed in a framework 110 | community of practice. This is hard – it is easy to count the number of CPU 111 | cycles paid for on the cloud, or the size of a database. 112 | - Data science must not about science, and not just data. It is easy to 113 | accumulate datasets, but not easy to develop training that is measurably 114 | effective – however it is definitely possible. 115 | - Citizen Science is more than "individuals giving their brains for analyzing 116 | data using computer games". There are growing communities of patients and 117 | healthy individuals who are coming together to analyze biomedical data, either 118 | their own or using public data resources to perform science under their own 119 | lead (c.f. http://jme.bmj.com/content/early/2015/03/30/medethics-2015-102663 120 | on participant-lead research). These community efforts are growing 121 | substantially and are bound to become important stakeholders in performing 122 | additional biomedical research. The needs of these communities should thus be 123 | targeted to. 124 | - Diversity is a highly obtainable goal for biomedical data science education. 125 | 126 | ## Performance measures and milestones that could be used to gauge the success of elements of the plan and inform course corrections 127 | 128 | Goal 4: This question is difficult to answer because there needs to be a defined 129 | and agreed-upon set of competencies for biomedical data science. From these will 130 | follow learning objectives and assessments for these objectives. At the next 131 | stage will be dissemination targets and measures of community use and buy in. 132 | A workshop could resolve and develop these over the course of a few months. 133 | 134 | ## Any other topic the respondent feels is relevant for NIH to consider in developing this strategic plan 135 | 136 | Goal 4: The work being done by NSF as well as the Data Science recommendations 137 | developed by the National Academies are highly relevant and need to be better 138 | integrated. There is also a unique connection in Data Science to industry. 139 | Industry partners will continue to lead advances in Data Science relevant to 140 | biomedicine. 141 | -------------------------------------------------------------------------------- /prelim_drafts/goal_05_datastewardship_response.md: -------------------------------------------------------------------------------- 1 | # Goal 5: "Enact Appropriate Policies to Promote Stewardship and Sustainability" 2 | 3 | 4 | 5 | ## The appropriateness of the goals of the plan and of the strategies and implementation tactics proposed to achieve them 6 | 7 | Goal 5: Here, and once in goal 3 are the only mentions in the document of a community-driven activity (which actually needs to be brought to the entire SPDS). The FAIR Data Ecosystem is a laudable goal, but the idea that NIH should “Strive to ensure that all data in NIH-supported data resources are FAIR” is still a goal without a plan. More than technological advances or implementation, this is a training activity that requires community awareness, understanding, input, and buy-in on FAIR principles. The implementation tactics are plausible, but ultimately without appropriate evaluation are too vague to establish success. Establishing open-source licenses or even promoting their use won’t in itself FAIR. This section misses out on how the hard question of ELSI/privacy and biomedical data which NIH currently has not updated to accommodate the vision of what biomedical data science might achieve. 8 | 9 | Goal 5: Evaluation is inappropriate/unrevealing count metrics that will not indicate whether or not FAIR principles are realized. 10 | 11 | 12 | ## Opportunities for NIH to partner in achieving these goals 13 | 14 | 15 | 16 | ## Additional concepts that should be included in the plan 17 | 18 | 19 | 20 | ## Performance measures and milestones that could be used to gauge the success of elements of the plan and inform course corrections 21 | 22 | 23 | 24 | ## Any other topic the respondent feels is relevant for NIH to consider in developing this strategic plan 25 | -------------------------------------------------------------------------------- /prelim_drafts/preamble.md: -------------------------------------------------------------------------------- 1 | Draft Preamble: 2 | 3 | NIH’s mission is “to seek fundamental knowledge about the nature and behavior of living systems and the application of that knowledge to enhance health, lengthen life, and reduce illness and disability.” Data Science will play a prominent role in fulfilling that mission in the 21st Century. Unfortunately, and for several reasons, the current NIH Strategic Plan for Data Science (SPDS) will not further, and may even act against NIH’s mission. On the surface, the SPDS has identified important goals. However, careful review of the current Plan reveals the impossibility of translating anything articulated into impactful, actionable, and meaningfully measurable implementations. If this document came from any organization other than NIH it might be ignored. Instead, the awesome weight of NIH recommendations sets up the plausible scenario that many investigators (within and outside of NIH) will be guided by an incomplete, inaccurate, and dangerously misguided document. 4 | 5 | To conclude that the SPDS is fundamentally flawed, consider these representative metrics copied verbatim from the document: 6 | 7 | Goal 1: Support a Highly Efficient and Effective Biomedical Research Data Infrastructure 8 | Sample evaluation metric: “quantity of cloud storage and computing used by NIH and by 9 | NIH-funded researchers” 10 | 11 | Goal 2: Promote Modernization of the Data-Resources Ecosystem 12 | Sample evaluation metric: “quantity of databases and knowledgebases supported using resource-based funding mechanisms" 13 | 14 | Goal 3: Support the Development and Dissemination of Advanced Data Management, Analytics, and 15 | Visualization Tools 16 | Sample evaluation metric: “quantity of new software tools developed” 17 | 18 | Goal 4: Enhance Workforce Development for Biomedical Data Science 19 | Sample evaluation metric: “quantity of new data science-related training programs for NIH staff and participation in these programs” 20 | 21 | Goal 5: Enact Appropriate Policies to Promote Stewardship and Sustainability 22 | Sample evaluation metric: “establishment and use of open-data licenses” (a count metric) 23 | 24 | These metrics (which are wholly representative of evaluations proposed) would never be accepted by a scientific reviewer. If the SPDS were to be considered by one of the NIH’s scientific review panels, it would certainly be designated “Not Recommended for Further Consideration (NRFC)”. Simply put, there is no perceivable likelihood of this Strategic Plan to exert any positive influence on the fields of Data Science or data intensive biomedical research. The majority of metrics proposed are simply “counts” of the number of activities created or performed, without any meaningful and/or independent evaluation of their benefit. Most of these activities have to happen anyway -e.g., the “quantity of databases and knowledgebases supported” will increment every time the NIH gives a grant to someone proposing to collect data and create a database (following the NIH policy on data sharing, https://grants.nih.gov/policy/sharing.htm). There is no suggestion that the increments these metrics represent are meaningful - or that they do or might contribute to biomedical work to “enhance health, lengthen life, and reduce illness and disability.” These evaluation metrics are both tautological (i.e., if the NIH funds research that generates data, these metrics will increment and the Goal will appear to have been “achieved”) and vague - as long as at least one of each of these events occurs, the Goal could be argued to have been “achieved”). Thus, almost any proposal to collect a large quantity of data that is funded will be branded as “successful”. While these goals are important for the NIH to function as a data broker, they actually represent characteristics to strive for throughout NIH, rather than a Strategic Plan. This plan is surprisingly unreflective of prior NIH Data Science activities (NCBC, BD2K) or even well-informed planning documents emerging in parallel (NLM Strategic Plan: https://www.nlm.nih.gov/pubs/plan/lrp17/NLM_StrategicReport2017_2027.pdf) 25 | 26 | Data Science is not the progressive extension of clinical research applications, and a truly impactful NIH strategy for Data Science must by definition come from experts in Data Science - preferably, those with expertise in both Data Science and biomedical research. As stated in SPDS Goal 4, NIH as an organization does not have the expertise in Data Science to plan its strategic direction in this area. . We recommend that if NIH is committed to a transformative strategy for Data Science (and it should be), then the current SPDS should be discarded. A Strategic Plan with meaningful goals that promote the thoughtful integration of Data Science with biomedical research representing measurable (not “countable”) impact, together with formal and independent evaluation, should be developed instead. 27 | 28 | Such a Plan would require a sustainable vision that accurately reflects the discipline of Data Science and its potential role in biomedical research. Such a vision must be community-driven, perhaps through a call for nominations that would assemble national and international community members with evidence of expertise in the needed areas. Given the weaknesses in the current SPDS, this RFI has every chance of resulting in an inward-looking, self-fulfilling prophecy (e.g., such that any grant that is made to any data-generating proposal results in “success” according to these self-defined metrics). If the Data Science and data-intensive biomedical research community comments are ignored, and priority is given to existing investigators and internal stakeholders, the NIH will promote an inward-looking and essentially irrelevant program for integrating Data Science into biomedical research. 29 | 30 | The Plan fails on multiple several technical merits, and the community members who have authored this document have assembled additional detailed remarks at this URL: 31 | 32 | https://github.com/JasonJWilliamsNY/2018_nih_datascience_rfi 33 | 34 | Additionally, as an exercise, the NIH review form has been filled out for this Plan to explore how a realistic NIH scientific reviewer (who did the exercise) might score such a proposal 35 | 36 | https://docs.google.com/document/d/1XxQLORoTm2lkucQz6k3QdgNOARDrvU3zx_svT0PX_c0/edit 37 | 38 | These comments are provided to support a new effort at a realistic, plausible, and community-driven Strategic Plan for Data Intensive Biomedical Research. The community stands ready to assist NIH in this important work and we urge the organization to commit to making this a community effort. 39 | --------------------------------------------------------------------------------