├── All_documents.pdf
├── README.md
├── StrategicPlan_NIH_review30March2018.pdf
├── combined_response.md
├── combined_response.pdf
├── jasons_submission_comment.md
├── preamble.pdf
└── prelim_drafts
    ├── goal_01_infrastructure_response.md
    ├── goal_02_dataecosystem_response.md
    ├── goal_03_datamanagement_response.md
    ├── goal_04_workforce_response.md
    ├── goal_05_datastewardship_response.md
    └── preamble.md


/All_documents.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JasonJWilliamsNY/2018_nih_datascience_rfi/64dc52a8e2153ca55f1d6fb2ce766fb4dde954f9/All_documents.pdf


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # 2018_nih_datascience_rfi
 2 | This is a short response to the 2018 RFI on NIH Strategic Plan for Data Science
 3 | 
 4 | # Purpose
 5 | On March 5th, 2018 [this post](https://nexus.od.nih.gov/all/2018/03/05/requesting-your-input-on-the-draft-nih-strategic-plan-for-data-science/) announced a request for information (RFI) on the
 6 | NIH [Strategic Plan for Data Science](https://grants.nih.gov/grants/rfi/NIH-Strategic-Plan-for-Data-Science.pdf).
 7 | Responses are due April 2nd, 2018 on [this form](https://grants.nih.gov/grants/rfi/rfi.cfm?ID=73).
 8 | 
 9 | **This repo contains language that you can use in your response to this RFI**
10 | 
11 | ## How to use these comments
12 | 
13 | **A [Similar RFI](https://nlmdirector.nlm.nih.gov/2018/03/20/next-generation-data-science-research-challenges/) only generated 53 responses. Your voice will be heard!**
14 | 
15 | 1. If possible, read the entire [Strategic Plan](https://grants.nih.gov/grants/rfi/NIH-Strategic-Plan-for-Data-Science.pdf).
16 | If you don't have that much time, read one of the Goal sections most relevant to
17 | you (individual sections are as short as 10 minutes of reading).
18 | 
19 | 2. We have created specific [responses](./combined_response.md) to each of the
20 | Plan's stated goals. You may copy and paste all/some/none of what you agree with
21 | using the online NIH form (step 3).
22 | **The NIH form has a 500 word limit so consider using one of the attachments**
23 | (You can write "see attached" in the comment box):
24 | 
25 | - **[All_documents.pdf](./All_documents.pdf)** : PDF of the Preamble, sample NIH
26 |   review form, and all collected comments
27 | - **[combined_response.pdf](combined_response.pdf)** : PDF of the Preamble and all
28 |   collected comments
29 | - **[preamble.pdf](preamble.pdf)** : PDF of the Preamble alone
30 | - **[StrategicPlan_NIH_review30March2018.pdf](StrategicPlan_NIH_review30March2018.pdf)**:  PDF of the sample NIH review
31 | 
32 | You may modify anything you find here, or write your own response.(**Please consider submitting your own responses here for the community to use, or review even after the
33 | RFI due date**).
34 | 
35 | 3. By **April 2, 2018 11:59:59 PM EDT** access the NIH [online form](https://grants.nih.gov/grants/rfi/rfi.cfm?ID=73), write in and/or
36 | copy-paste your responses. Please follow all of the form instructions/guidelines,
37 | Invalid responses (e.g. automated, or in some other way disqualified) will not
38 | be accepted. Only submit comments that represent your
39 | views; see [guide notice instructions](https://grants.nih.gov/grants/guide/notice-files/NOT-OD-18-134.html)
40 | 
41 | 
42 | 
43 | **Thanks for your contribution** It is a responsibility to ensure NIH can act
44 | on the most accurate and best possible advice.
45 | 
46 | 
47 | 
48 | ## How to contribute
49 | 
50 | Please submit a pull request, or (and especially if you aren't familiar with
51 | Github) please click the issues tab to open an issue, or Tweet/email me (email =
52 | my last name @ cshl.edu).
53 | 
54 | **Even after the due date**, please feel free to post here for others to see.
55 | 
56 | ## Contributors
57 | 
58 | **Disclaimer**: Contributions are only the personal opinions of the individuals
59 | that made them, and are not the opinions of other contributors, their employers,
60 | or anyone else.
61 | 
62 | 
63 | - Jason Williams - Cold Spring Harbor Laboratory, NY
64 |   - **Bio sentence**: Diversity advocate, founder CSHL Biological Data Science Meeting,
65 |     Software Carpentry instructor and former foundation Chair, External consultant to
66 |     NIH Data Commons, #underrepresentedinSTEM
67 | 
68 | - Rochelle E. Tractenberg - Georgetown University, Washington, DC.
69 |   - **Bio Sentence**: psychometrician, biostatistician and research methodologist, specializing
70 |     in developing and validating difficult-to-measure outcomes including those in biomedical research and in higher/graduate/post graduate education. Chair (2017-2019) Committee on Professional Ethics of the American Statistical Association.
71 | 
72 | - Bastian Greshake Tzovaras - Lawrence Berkeley National Laboratory, CA
73 |   - **Bio sentence**: Co-founder of openSNP, Director of Research at Open Humans and
74 |     Visiting Scholar at the Berkeley Lab.
75 | 
76 | - Alexander (Sasha) Wait Zaranek - Harvard University, MA
77 |   - **Bio sentence**: Co-Founder, Harvard Personal Genome Project (PGP) Chief Scientist, Curoverse Research Founder https://arvados.org, Head of Quantified Biology, Veritas Genetics
78 | 
79 | ## Other public responses
80 | 
81 | These are other responses to the RFI, posted here to represent the diversity
82 | of opinions and in the hopes of continuing the conversation. They are not
83 | related to the contribution in this repo, and are only the opinions of their
84 | respective authors.
85 | 
86 | - Hary Hochheiser et.al., University of Pittsburgh: [link to document](https://docs.google.com/document/d/1ibfFtWu75OQelcOUj-EXLCWc1qox0xugE_bqfdVF8Kc/edit)
87 | - Dimitri Yatsenko, Vathes LLC/Baylor College of Medicine: [link to document](https://github.com/dimitri-yatsenko/NIH-RFI-Strategic-Plan-for-Data-Science)
88 | - Michael Love, UNC Chapel Hill: [link to document](https://github.com/mikelove/2018_nih_datascience_rfi)
89 | - DJ Patel, Former U.S. Cheif Data Scientist: [link to document](https://www.linkedin.com/pulse/data-science-nih-healthcare-dj-patil/)
90 | - Rafael Irizarry, Harvard T.H. Chan School of Public Health: [link to document](https://simplystatistics.org/2018/04/02/input-on-the-draft-nih-strategic-plan-for-data-science/)
91 | 


--------------------------------------------------------------------------------
/StrategicPlan_NIH_review30March2018.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JasonJWilliamsNY/2018_nih_datascience_rfi/64dc52a8e2153ca55f1d6fb2ce766fb4dde954f9/StrategicPlan_NIH_review30March2018.pdf


--------------------------------------------------------------------------------
/combined_response.md:
--------------------------------------------------------------------------------
  1 | ## Preamble
  2 | 
  3 | NIH’s mission is “to seek fundamental knowledge about the nature and behavior of
  4 | living systems and the application of that knowledge to enhance health, lengthen
  5 | life, and reduce illness and disability.” Data Science will play a prominent
  6 | role in fulfilling that mission in the 21st Century. Unfortunately, and for
  7 | several reasons, the current NIH Strategic Plan for Data Science (SPDS) will not
  8 | further, and may even act against NIH’s mission. On the surface, the SPDS has
  9 | identified important goals. However, careful review of the current Plan reveals
 10 | the impossibility of translating anything articulated into impactful,
 11 | actionable, and meaningfully measurable implementations. If this document came
 12 | from any organization other than NIH it might be ignored. Instead, the awesome
 13 | weight of NIH recommendations - which demand an equally awesome obligation to
 14 | the community for productive criticism -  invites the plausible scenario that
 15 | many investigators (within and outside of NIH) will be guided by an incomplete,
 16 | inaccurate, and dangerously misguided document. While the tone of this document
 17 | is necessarily blunt (far more so than if feedback was addressed to an individual
 18 | investigator), it is with the belief that NIH does not simply want feedback
 19 | that says what it wants to hear - the NIH mission is too important for
 20 | criticisms to go unheard.
 21 | 
 22 | To conclude that the SPDS is fundamentally flawed, consider these
 23 | representative metrics copied verbatim from the document:
 24 | 
 25 | - Goal 1: Support a Highly Efficient and Effective Biomedical Research Data
 26 |   Infrastructure
 27 |   - Sample evaluation metric: “quantity of cloud storage and computing used by
 28 |   NIH and by NIH-funded researchers”
 29 | 
 30 | - Goal 2: Promote Modernization of the Data-Resources Ecosystem
 31 |   - Sample evaluation metric: “quantity of databases and knowledgebases
 32 |   supported using resource-based funding mechanisms"
 33 | 
 34 | - Goal 3: Support the Development and Dissemination of Advanced Data Management,
 35 |   Analytics, and Visualization Tools
 36 |   - Sample evaluation metric: “quantity of new software tools developed”
 37 | 
 38 | - Goal 4: Enhance Workforce Development for Biomedical Data Science
 39 |   - Sample evaluation metric: “quantity of new data science-related training
 40 |     programs for NIH staff and participation in these programs”
 41 | 
 42 | - Goal 5: Enact Appropriate Policies to Promote Stewardship and Sustainability
 43 |   - Sample evaluation metric: “establishment and use of open-data licenses”
 44 |   (a count metric)
 45 | 
 46 | These metrics (which are wholly representative of evaluations proposed) would
 47 | never be accepted by a scientific reviewer. If the SPDS were to be considered
 48 | by one of the NIH’s scientific review panels, it would certainly be designated
 49 | “Not Recommended for Further Consideration (NRFC)”. Simply put, there is no
 50 | perceivable likelihood of this Strategic Plan to exert any positive influence
 51 | on the fields of Data Science or data intensive biomedical research. The
 52 | majority of metrics proposed are simply  “counts” of the number of activities
 53 | created or performed, without any meaningful and/or independent evaluation of
 54 | their benefit. Most of these activities have to happen anyway -e.g., the
 55 | “quantity of databases and knowledgebases supported” will increment every time
 56 | the NIH gives a grant to someone proposing to collect data and create a database
 57 | (following the NIH policy on data sharing,
 58 | https://grants.nih.gov/policy/sharing.htm). There is no suggestion that the
 59 | increments these metrics represent are meaningful - or that they do or might
 60 | contribute to biomedical work to “enhance health, lengthen life, and reduce
 61 | illness and disability.” These evaluation metrics are both tautological (i.e.,
 62 | if the NIH funds research that generates data, these metrics will increment and
 63 | the Goal will appear to have been “achieved”) and vague - as long as at least
 64 | one of each of these events occurs, the Goal could be argued to have been
 65 | “achieved”). Thus, almost any proposal to collect a large quantity of data that
 66 | is funded will be branded as “successful”. While these goals are important for
 67 | the NIH to function as a data broker, they actually represent characteristics
 68 | to strive for throughout NIH, rather than a Strategic Plan. This plan is
 69 | surprisingly unreflective of prior NIH Data Science activities (NCBC, BD2K) or
 70 | even praiseworthy and well-informed planning documents emerging in parallel
 71 | (NLM Strategic Plan: https://www.nlm.nih.gov/pubs/plan/lrp17/NLM_StrategicReport2017_2027.pdf).
 72 | 
 73 | Data Science is not the progressive extension of clinical research applications,
 74 | and a truly impactful NIH strategy for Data Science must by definition come from
 75 | experts in Data Science - preferably, those with expertise in both Data Science
 76 | and biomedical research. As stated in SPDS Goal 4, NIH as an organization does
 77 | not have the expertise in Data Science to plan its strategic direction in this
 78 | area. We recommend that if NIH is committed to a transformative strategy for
 79 | Data Science (and it should be), the current SPDS should be discarded. A
 80 | Strategic Plan with meaningful goals that promote the thoughtful integration of
 81 | Data Science with biomedical research representing measurable (not “countable”)
 82 | impact, together with formal and independent evaluation, should be developed
 83 | instead.
 84 | 
 85 | Such a Plan would require a sustainable vision that accurately reflects the
 86 | discipline of Data Science and its potential role in biomedical research. Such
 87 | a vision must be community-driven, perhaps through a call for nominations that
 88 | would assemble national and international community members with evidence of
 89 | expertise in the needed areas. Given the weaknesses in the current SPDS, this
 90 | RFI has every chance of resulting in an inward-looking, self-fulfilling prophecy
 91 | (e.g., such that any grant that is made to any data-generating proposal results
 92 | in “success” according to these self-defined metrics). If the Data Science and
 93 | data-intensive biomedical research community comments are ignored, and priority
 94 | is given to existing investigators and internal stakeholders, the NIH will
 95 | promote an inward-looking and essentially irrelevant program for integrating
 96 | Data Science into biomedical research.
 97 | 
 98 | The Plan fails on multiple several technical merits, and the community members
 99 | who have authored this document have assembled additional detailed remarks at
100 | this URL:
101 | 
102 | https://github.com/JasonJWilliamsNY/2018_nih_datascience_rfi
103 | 
104 | Additionally, as an exercise, the NIH review form has been filled out for this
105 | Plan to explore how a realistic NIH scientific reviewer (who did the exercise)
106 | might score such a proposal
107 | 
108 | https://docs.google.com/document/d/1XxQLORoTm2lkucQz6k3QdgNOARDrvU3zx_svT0PX_c0/edit#
109 | 
110 | These comments are provided to support a new effort at a realistic, plausible,
111 | and community-driven Strategic Plan for Data Intensive Biomedical Research. The
112 | community stands ready to assist NIH in this important work and we urge the
113 | organization to commit to making this a community effort. We thank NIH for the
114 | work done and going forward to develop the next stages of this plan.
115 | 
116 | ##  The appropriateness of the goals of the plan and of the strategies and implementation tactics proposed to achieve them
117 | 
118 | Goal 1: NIH should be applauded for recognizing that historically, funding of
119 | data resources used funding approaches that were appropriate for research
120 | projects. We agree this must change. Also, we agree that tool development
121 | and data resources may often need distinct funding contexts/expectations.
122 | 
123 | Goal 1: We agree with the stated need to support the hardening and optimization
124 | (user-friendliness, computational efficiency, etc.) of innovative tools and
125 | algorithms.
126 | 
127 | Goal 1: The Data Commons has been driven from bottom-up development by
128 | the community. The Commons is in its early stages and should be allowed to
129 | function as is. The Commons is the appropriate testbed for many of the
130 | technological innovations and processes that NIH may ultimately which to explore
131 | at broader scales after sufficient development.
132 | 
133 | Goal 1: The implementation tactics proposed seem almost randomly selected,
134 | making it nearly impractical to criticize them. For example, one of the three
135 | bullet points indicates that new technologies should be adapted (somewhat
136 | meaningless, but not disagreeable), but then goes on to mention that GPUs should
137 | be used. That is weirdly specific, and not necessarily wrong, but why mention at
138 | this level of detail without some specific vision for the real biomedical
139 | informatics problems that are relevant here? This is like saying “calculus”
140 | should be used. Maybe; but such an out of context statement is ultimately a
141 | collection of buzzwords. At best, this is an indication that greater expertise
142 | is needed to reformulate this document.
143 | 
144 | Goal 1: Although this is a strategy document, it’s implausible to imagine the
145 | linking of NIH datasets as described in objective 1-2 can be elucidated in a
146 | single thin paragraph. We have no idea how the “Biomedical Data Translator” will
147 | work but can only imagine it will need to function at least as well as the
148 | “Universal Translator” of Star Trek. Either enough detail for the strategy needs
149 | to be proposed here for the document to serve its purpose, or the space is
150 | better spent articulating the hard problems that need fixing.
151 | 
152 | Goal 2: The overall goal recognized (the need to avoid data siloing) is
153 | absolutely a correct (but difficult) target for NIH. It is impossible to believe
154 | that anything in the implementation or metrics will demonstrably achieve this.
155 | What appears to be one problem is likely several problems, each of which is
156 | worthy of study to generate actionable solutions. Usability is identified as
157 | target to optimize for and while this may be correct, no metrics proposed seem
158 | to measure this. While there may be real and important distinctions between
159 | databases and knowledgebases, we again suggest no convincing metrics have been
160 | proposed here. The statement that “Funding approaches used for databases and
161 | knowledgebases will be appropriate…” conveys no usable information.
162 | 
163 | Goal 2: For some reason, objective 2-2 contrasts the “small-scale” datasets
164 | produced by individual laboratories with the “high-value” datasets generated
165 | by NIH-funded consortia. Besides pointing out the needless condescension here,
166 | this kind of contrast actually belies the problem in designing data systems in
167 | which there are various classes of “citizenship.” Treating the much larger
168 | quantity of data generated by the plurality of extramural investigators as
169 | somehow different my lead to policies which work for the NIH, but don’t work for
170 | the community, leading to irreconcilable standards.
171 | 
172 | Goal 2: Implementation tactics proposed are bewildering. Under this
173 | subheading appears the bullet “Ensure privacy and security.” While it is
174 | gratifying that the word privacy finally appears 11 pages into this 26-page
175 | document, this is not in any way an implementation tactic.
176 | 
177 | Goal 2: Evaluation metrics are horrific. Quantity of datasets and funded
178 | databases does nothing more than account for the fact that NIH spent money.
179 | 
180 | Goal 3: The strategy to leverage and support existing tool-sharing systems to
181 | encourage "marketplaces" for tools developers, and to separate funding of tool
182 | development and data generation is very important, and we support this direction
183 | of the NIH. This is a proven strategy to elevate high quality tools, for
184 | example, in the world of high-throughput genomics, one can consider the NHGRI
185 | funded Bioconductor Project as a decade-long successful use case in providing a
186 | unified interface for more than 1,000 "high-quality, open-source data
187 | management, analytics, and visualization tools".
188 | 
189 | Goal 3: In general, implementation tactics are plausible, but not evidence-based.
190 | Rather than propose that each step NIH takes to develop a tactic is supported by
191 | a body of research, the lowest-hanging fruit (and most productive solution) is
192 | to have the community develop an actual set of strategic targets, with clear
193 | metrics for evaluation.
194 | 
195 | Goal 3: The SPDS plan underestimates the pervasiveness and persistence of
196 | bad/outdated software and methods (See: https://www.the-scientist.com/?articles.view/articleNo/51260/title/Scientists-Continue-to-Use-Outdated-Methods/). It is completely unclear how separating evaluation and
197 | funding for tool development and dissemination from support for databases and
198 | knowledgebases (this sentence from the SPDS is itself unclear) will address
199 | this problem. This may help, but is our knowledge an unvetted hypothesis.
200 | 
201 | Goal 3: Although the SPDS does not make any strategy clear, the goal of
202 | supporting tools and workflows (objective 3-1) is a good one. We further agree
203 | that partnership is exactly the way that this needs to be pursued.
204 | 
205 | Goal 3: The metrics proposed for this sophisticated set of objectives are
206 | catastrophic.There is no way that the objectives stated for Goal 3 can be
207 | effectively measured or set a useful standard for success.
208 | 
209 | Goal 4: Feldon et.al (PNAS, 2017; https://doi.org/10.1073/pnas.1705783114)
210 | concludes that despite $28 million in investment by NSF and NIH in training
211 | (including workshops/boot camps relevant to biomedical data science), much of
212 | this training is “not associated with observable benefits related to skill
213 | development, scholarly productivity, or socialization into the academic
214 | community.” Clearly, if NIH intends to do better it needs to completely
215 | reconceive how it approaches training in data science.
216 | 
217 | Goal 4: Several reasonable priorities are identified, but only
218 | demonstrably ineffective and/or inappropriate approaches and evaluation metrics
219 | are proposed to achieve/evaluate these goals.
220 | 
221 | Goal 4: The inadequacy of the strategies suggested is epitomized by the proposed
222 | evaluation metric: “the quantity of new data science-related training programs
223 | for NIH staff and participation in these programs.” This is as unconvincing as
224 | suggesting a research proposal will be measured in the number of experiments
225 | performed. In fact, the only metric proposed for evaluation of training is the
226 | number of training opportunities created. Such an arbitrary and crude metric
227 | would get any research proposal returned from study section without discussion.
228 | Number of training opportunities cannot be a plausible metric. Despite their
229 | being no shortage of training opportunities from MOOCs to workshops, there is a
230 | persistent, apparent, and urgent training gap. This inadequate metric is a clear
231 | red flag that training guided by the proposed plan will accomplish very little.
232 | 
233 | Goal 4: It is clear that training is under-prioritized by NIH. In the largest
234 | survey on unmet needs for life science investigators, NSF investigators report
235 | in Barone et.al (PLOS Comp. Bio, 2017; https://doi.org/10.1371/journal.pcbi.1005755)
236 | that their most unmet computational needs are not software, infrastructure or
237 | compute. Instead it is the need for training; specifically training in
238 | integration of multiple data types, data and metadata management, and scaling
239 | analyses to HPC and cloud.
240 | 
241 | Goal 4: Strategy failed to properly understand the role of training in
242 | biomedical data science and the need to define and measure what constitutes
243 | effective training. There is no mention made of a serious commitment to
244 | evidence-based teaching that is needed to design effective short-format courses
245 | and workshops. While the educational pipeline from at least the undergraduate
246 | level and beyond needs serious improvement to address biomedical data science,
247 | short-format training and workshops will play an important role. These workshops
248 | must not be constructed ad hoc. Typically, training is delivered by
249 | staff/faculty with a high level of bioinformatics/data science domain expertise,
250 | but little to no guidance in andragogy, cognitive science, or evaluation.
251 | 
252 | Goal 5: Here, and once in Goal 3 are the only mentions in the document of a
253 | community-driven activity (which actually needs to be brought to the entire
254 | SPDS). The FAIR Data Ecosystem is a laudable goal, but the idea that NIH should
255 | “Strive to ensure that all data in NIH-supported data resources are FAIR” is
256 | still a goal without a plan. More than technological advances or implementation,
257 | this is a training activity that requires community awareness, understanding,
258 | input, and buy-in on FAIR principles. The implementation tactics are plausible,
259 | but ultimately without appropriate evaluation are too vague to establish
260 | success. Establishing open-source licenses or even promoting their use won’t in
261 | itself FAIR. This section misses out on how the hard question of ELSI/privacy
262 | and biomedical data which NIH currently has not updated to accommodate the
263 | vision of what biomedical data science might achieve.  
264 | 
265 | Goal 5: Evaluation is inappropriate/unrevealing count metrics that will not
266 | indicate whether FAIR principles are realized or not.
267 | 
268 | ## Opportunities for NIH to partner in achieving these goals
269 | 
270 | Goal 1: NSF has been exploring centralized computing models through XSEDE, and
271 | open-science clouds (CyVerse Atmosphere, XSEDE-Jetstream) for many years. These
272 | groups would be natural partners in addition to commercial cloud providers. The
273 | NSF resources will not match the capacity of commercial cloud but have optimized
274 | for the science use-cases and user profiles relevant to biomedical research.
275 | 
276 | Goal 2: ASAPbio, Force 11, Open Science Framework, Zonodo, FigShare, BiorXiv,
277 | and many other community-driven organization are exploring data lifecycle issues
278 | and metrics that are relevant to this discussion. The entire SPDS needs to be
279 | completely reconceived to include representation from individuals within these
280 | organizations who have scholarly reputations in data management and life science publication/communication.
281 | 
282 | Goal 3: There are a variety of groups the NIH can partner with. The number of
283 | potential individual investigators is too numerous to list, but these
284 | individuals should be relatively easy to identify by means of their scholarly
285 | contributions (carefully avoiding journal publications as a primary metric).
286 | Reaching out and partnering with groups such as the Open Bioinformatics
287 | Foundation and societies like ISMB would be an ideal way for NIH to foster deep
288 | community involvement.
289 | 
290 | Goal 4: We present comments here in the hopes that NIH will consider bold action
291 | in this area because the problem is solvable and the community of investigators
292 | with experience in training relevant to biomedical data science has been
293 | thinking deeply on the topic. The community is relatively small, well-connected,
294 | and should be extensively leveraged in developing robust scalable solutions. It
295 | should be easy to assemble the 10-20 most important practitioners and educators
296 | in biomedical data science, confident that they and their second order
297 | collaborators would constitute a reasonably sized working group that can bring
298 | in much needed solutions. Understandably, in fact by definition as an
299 | aspirational goal, NIH has identified the value in prioritizing data science
300 | as an area it needs to be at the forefront of but does not have the expertise
301 | to achieve alone. The community does.
302 | 
303 | Goal 4: Right now, the single best target for collaboration is the Software and
304 | Data Carpentry community (Greg Wilson: "Software Carpentry: Lessons Learned".
305 | F1000Research,2016, 3:62 (doi: 10.12688/f1000research.3-62.v2). There are many
306 | reasons why collaboration here will be tremendously important for NIH to
307 | succeed. First, the Carpentry community itself represents a global federation
308 | of researchers in the space of computation, data science, bioinformatics, and
309 | related fields with a strong interest in education. In short – this is a
310 | self-selected community of hundreds to thousands of researchers; the available
311 | expertise in biomedical data science is well-covered in this community.
312 | Additionally, there simply is no other community that has built a sustainable
313 | and scalable approach to building educational content relevant to biomedical
314 | data science with strong grounding in assessment and pedagogy. It would be
315 | a tremendous squandering of resources to not build on this foundation.
316 | 
317 | Goal 4: This is an area that especially calls for collaboration with the
318 | National Science Foundation. NSF already has several strong funded programs that
319 | are dedicated to understanding the problems of bioinformatics education and
320 | biomedical data science and developing solutions – the Network for Integrating
321 | Bioinformatics Education into Life Science
322 | (NIBLSE, https://qubeshub.org/groups/niblse) is just one of many. NIBLSE has
323 | for example, identified that the newest trained faculty with bioinformatics
324 | expertise are not bringing that training into the classroom, and the lack of
325 | training is the biggest barrier to bioinformatics education
326 | (https://www.biorxiv.org/content/early/2017/10/19/204420). The potential for
327 | synergy here is enormous for developing the k-16 pipeline. This alliance
328 | (especially leveraging NSF INCLUDES) could be a tremendous opportunity to do so
329 | in a way that enhances diversity. And while there are distinct career paths for
330 | biomedical vs. non-human life sciences, there is almost complete overlap in the
331 | study, preparation, and training for both student groups.
332 | 
333 | ##  Additional concepts that should be included in the plan
334 | 
335 | Goal 1: Grafting “Data Science” onto NIH is essentially a massive retrofitting
336 | exercise. If we had to pick one area to think of, it is focusing on emerging
337 | techniques (Long-read sequencing, machine learning approaches, CIRSPR, etc.)
338 | and how NIH manages these data that could be a primary target for envisioning
339 | a data science-friendly ecosystem. The community of users is smaller, and
340 | fixing emerging challenges seems like a manageable focus for fomenting
341 | community consensus.
342 | 
343 | Goal 2: Conceptually, this goal needs to clearly differentiate technological
344 | obstacles from process obstacles. In reality, many of the needed technologies
345 | are either in place or will be generated from sectors outside of biomedicine.
346 | A few, perhaps, will be unique to the NIH use cases. More effort needs to be
347 | put into understanding the workflows and processes that investigators in a
348 | variety of contexts use to produce and consume data. This is an unaddressed
349 | research question in this document.
350 | 
351 | Goal 3: The proposal as currently mentioned does not mention (1) computational reproducibility, or (2) exploratory data analysis for data quality control.
352 | These two topics are critical for the high-level goal of "extracting
353 | understanding from large-scale or complex biomedical research data".
354 | 
355 | Goal 3: Computational reproducibility can be defined as the ability to produce
356 | identical results from identical data input or "raw data", and relies on
357 | biomedical researchers keeping track of metadata regarding the versions of
358 | tools that were used, the way in which tools were run, and the provenance and
359 | version of publicly available annotation files if these were used. This is very
360 | important for data science: if two groups observe discrepancies between their
361 | results, they absolutely must be able to identify the source, whether it be
362 | methodological or due to different versions of software or annotation data.
363 | 
364 | Goal 3: Exploratory data analysis (EDA) needs to be a key component of the data
365 | science plan, as this should be the first step of any data analysis involving
366 | complex biological data. EDA is often how a data scientist will identify data
367 | artifacts, technical biases, batch effects, outliers, unaccounted for or
368 | unexpected heterogeneity, need for data transformation, or other various data
369 | quality issues that will cause serious problems for downstream methods, whether
370 | they be statistical methods, machine learning, deep learning, artificial
371 | intelligence or otherwise. In particular, machine learning and statistical
372 | methods rely on the quality of the metadata and the ability to provide
373 | consistent terms and judgements on describing samples across all datasets from
374 | consortia and from individual labs. Downstream methods may either fail to detect
375 | the relevant signal (loosely categorized as "false negatives") or may produce
376 | many spurious results which are purely associations with technical aspects of
377 | the data ("false positives"). Furthermore, Basic EDA can uncover biological
378 | signal that may be missed, such as biologically relevant heterogeneity, e.g.
379 | subtypes of disease with signal present in molecular data.
380 | 
381 | Goal 3: Computational reproducibilty and supporting EDA should be components of
382 | both NIH funded tool development, as well as the plan to "Enhance Workforce
383 | Development for Biomedical Data Science" (Goal 4).
384 | 
385 | Goal 4: There are a few basic concepts that must be included; and these are
386 | potentially at the right level for a strategic vision document:
387 | 
388 | - Training is the most unmet need of investigators. Investments that
389 |   under-prioritize training will not realize the value of computational and data
390 |   infrastructure developed.
391 | - Biomedical data science education must not be solely delivered by domain
392 |   experts/investigators training according to what they think is best. Instead,
393 |   curriculum must be developed using evidence-based pedagogical principles.
394 | - Collaboration is key. Training that is developed as the unitary creation of
395 |   NIH will fail. Training must be developed by a community that can maintain and
396 |   sustain learning content.
397 | - Assessment is an integral part of training and cannot be ad hoc, it must
398 |   generate evidence that learning has occurred, and be developed in a framework
399 |   community of practice. This is hard – it is easy to count the number of CPU
400 |   cycles paid for on the cloud, or the size of a database.
401 | - Data science must not about science, and not just data. It is easy to
402 |   accumulate datasets, but not easy to develop training that is measurably
403 |   effective – however it is definitely possible.
404 | - Citizen Science is more than "individuals giving their brains for analyzing
405 |   data using computer games". There are growing communities of patients and
406 |   healthy individuals who are coming together to analyze biomedical data, either
407 |   their own or using public data resources to perform science under their own
408 |   lead (c.f. http://jme.bmj.com/content/early/2015/03/30/medethics-2015-102663
409 |   on participant-lead research). These community efforts are growing
410 |   substantially and are bound to become important stakeholders in performing
411 |   additional biomedical research. The needs of these communities should thus be
412 |   targeted to.
413 | - Diversity is a highly obtainable goal for biomedical data science education.
414 | 
415 | ## Performance measures and milestones that could be used to gauge the success of elements of the plan and inform course corrections
416 | 
417 | Goal 1: The proposed evaluation metrics are horrific. These metrics are more
418 | appropriate for cloud providers who capture the described metrics to develop
419 | their invoices. While NIH should look to drive down communal costs, Goal 1,
420 | like all of the goals in this document are – hard research problems –  which
421 | require deep thought to understand what success is.
422 | 
423 | Goal 2: Without specific research questions, or the identification of relevant
424 | research that can be directly applied to the use cases NIH wants to advance,
425 | there are no additional milestones except a clearer definition of the goal.
426 | 
427 | Goal 4: This question is difficult to answer because there needs to be a defined
428 | and agreed-upon set of competencies for biomedical data science. From these will
429 | follow learning objectives and assessments for these objectives. At the next
430 | stage will be dissemination targets and measures of community use and buy in.
431 | A workshop could resolve and develop these over the course of a few months.
432 | 
433 | ## Any other topic the respondent feels is relevant for NIH to consider in developing this strategic plan
434 | 
435 | Goal 1: The SPDS NIH correctly identifies that “The generation of most
436 | biomedical data is highly distributed and is accomplished mainly by individual
437 | scientists or relatively small groups of researchers.” This should be followed
438 | by the conclusion that any top-down approach must be matched by a
439 | correspondingly large-scale bottom-up approach. Individual investigators need
440 | the training and support to generate data in a way that fulfills the promise of
441 | FAIR principles. The Strategic Plan quotes the famous (and regrettable)
442 | statistic that 80% of Data Science is cleaning data, yet nothing proposed in
443 | this document will solve this. While NIH is in a position to pioneer (and
444 | appropriately fund) hard infrastructure (computation/storage, etc.), the greater
445 | attention must be paid to funding soft-infrastructure – training, documentation,
446 | and support that bring investigators into a community of practice. Note that
447 | Barone et.al (https://doi.org/10.1371/journal.pcbi.1005755 ) replicates the
448 | earlier findings of EDUCAUSE (https://net.educause.edu/ir/library/pdf/ers0605/rs/ers0605w.pdf); organizations
449 | planning for cyberinfrastructure development tend to underestimate
450 | and underfund the training needed to use infrastructure. Any infrastructure
451 | development must be matched by clear, measurable learning outcomes to ensure
452 | that investigators can actually make intended use of the investments.
453 | 
454 | Goal 3: There are so many potential emerging technologies and themes,
455 | understandably, this plan should not be a laundry list of things to try. This
456 | objective need to be reconceived to articulate how those solutions will be
457 | collected, pursued, and evaluated. No such vision is clearly present.
458 | 
459 | Goal 4: The work being done by NSF as well as the Data Science recommendations
460 | developed by the National Academies are highly relevant and need to be better
461 | integrated. There is also a unique connection in Data Science to industry.
462 | Industry partners will continue to lead advances in Data Science relevant to
463 | biomedicine.
464 | 


--------------------------------------------------------------------------------
/combined_response.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JasonJWilliamsNY/2018_nih_datascience_rfi/64dc52a8e2153ca55f1d6fb2ce766fb4dde954f9/combined_response.pdf


--------------------------------------------------------------------------------
/jasons_submission_comment.md:
--------------------------------------------------------------------------------
 1 | Please see the attachment, containing a collaboratively developed response that
 2 | I led. Community updates to that response will be maintained at this URL: https://github.com/JasonJWilliamsNY/2018_nih_datascience_rfi. These
 3 | to-the-point comments are complied in realization that NIH has a major influence
 4 | over the direction of biomedical data science, and with sincere trust that NIH
 5 | is committed to highest standards of academic rigor, commitment to its own
 6 | mission, and responsibility to the US Tax Payer. Although every effort has been
 7 | made for this response to consist only of productive and objective criticisms,
 8 | I will express a personal and subjective comment here that my current and past
 9 | dealings with NIH personnel have demonstrated them to be passionately committed
10 | to the same high standards aforementioned. Every person and every organization
11 | that aspires to be more than it is, will reach a point where its own resources
12 | cannot take them in the direction they need to go. I hope that NIH will consider
13 | leveraging the entire, diverse, biomedical data science community to achieve its
14 | greatest success in leveraging data science for the betterment of human health;
15 | it *will* take all of us. 
16 | 


--------------------------------------------------------------------------------
/preamble.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JasonJWilliamsNY/2018_nih_datascience_rfi/64dc52a8e2153ca55f1d6fb2ce766fb4dde954f9/preamble.pdf


--------------------------------------------------------------------------------
/prelim_drafts/goal_01_infrastructure_response.md:
--------------------------------------------------------------------------------
 1 | # Goal 1: "Support a Highly Efficient and Effective Biomedical Research Data Infrastructure"
 2 | 
 3 | 
 4 | 
 5 | ##  The appropriateness of the goals of the plan and of the strategies and implementation tactics proposed to achieve them
 6 | 
 7 | Goal 1: NIH should be applauded for recognizing that historically, funding of
 8 | data resources used funding approaches that were research projects. We agree
 9 | this must change. Also, we agree that tool development and data resources may
10 | often need distinct funding contexts/expectations.
11 | 
12 | Goal 1: We agree with the stated need to support the hardening and optimization
13 | (user-friendliness, computational efficiency, etc.) of innovative tools and
14 | algorithms.
15 | 
16 | Goal 1: The Data Commons has generally been driven from bottom-up development by
17 | the community. The Commons is in its early stages and should be allowed to
18 | function as is. The Commons is the appropriate testbed for many of the
19 | technological innovations and processes that NIH may ultimately which to explore
20 | at broader scales after sufficient development.
21 | 
22 | Goal 1: The implementation tactics proposed seem almost randomly selected,
23 | making it nearly impractical to criticize them. For example, one of the three
24 | bullet points indicates that new technologies should be adapted (somewhat
25 | meaningless, but not disagreeable), but then goes on to mention that GPUs should
26 | be used. That is weirdly specific, and not necessarily wrong, but why mention at
27 | this level of detail without some specific vision for the real biomedical
28 | informatics problems that are relevant here? This is like saying “calculus”
29 | should be used. Maybe; but such an out of context statement is ultimately a
30 | collection of buzzwords. At best, this is an indication that greater expertise
31 | is needed to reformulate this document.
32 | 
33 | Goal 1: Although this is a strategy document, it’s bewildering to imagine the
34 | linking of NIH datasets as described in objective 1-2 can be elucidated in a
35 | single thin paragraph. I have no idea how the “Biomedical Data Translator” will
36 | work but can only imagine it will need to function at least as well as the
37 | “Universal Translator” of Star Trek. Either enough detail for the strategy needs
38 | to be proposed here for the document to serve its purpose, or the space is
39 | better spent articulating the hard problems that need fixing.
40 | 
41 | ## Opportunities for NIH to partner in achieving these goals
42 | 
43 | Goal 1: NSF has been exploring centralized computing models through XSEDE, and
44 | open-science clouds (CyVerse Atmosphere, XSEDE-Jetstream) for many years. These
45 | groups would be natural partners in addition to commercial cloud providers. The
46 | NSF resources will not match the capacity of commercial cloud but have optimized
47 | for the science use-cases and user profiles relevant to biomedical research.
48 | 
49 | ##  Additional concepts that should be included in the plan
50 | 
51 | Goal 1: Grafting “Data Science” onto NIH is essentially a massive retrofitting
52 | exercise. If I had to pick one area to think of, it is how emerging techniques
53 | (Long-read sequencing, machine learning approaches, CIRSPR, etc.) and how we
54 | treat that related data should be a primary target before the essentially get
55 | too big to go back. The community of users is smaller, and fixing emerging
56 | challenges seems like a more manageable focus for fomenting community consensus. 
57 | 
58 | 
59 | ## Performance measures and milestones that could be used to gauge the success of elements of the plan and inform course corrections
60 | 
61 | Goal 1: The proposed evaluation metrics are horrific. These metrics are more
62 | appropriate for cloud providers who capture the described metrics to develop
63 | their invoices. While NIH should look to drive down communal costs, Goal 1,
64 | like all of the goals in this document are – hard research problems –  which
65 | require deep thought to understand what success is.
66 | 
67 | ## Any other topic the respondent feels is relevant for NIH to consider in developing this strategic plan
68 | 
69 | Goal 1: The SPDS NIH correctly identifies that “The generation of most
70 | biomedical data is highly distributed and is accomplished mainly by individual
71 | scientists or relatively small groups of researchers.” This should be followed
72 | by the conclusion that any top-down approach must be matched by a
73 | correspondingly large-scale bottom-up approach. Individual investigators need
74 | the training and support to generate data in a way that fulfills the promise of
75 | FAIR principles. The Strategic Plan quotes the famous (and regrettable)
76 | statistic that 80% of Data Science is cleaning data, yet nothing proposed in
77 | this document will solve this. While NIH is in a position to pioneer (and
78 | appropriately fund) hard infrastructure (computation/storage, etc.), the greater
79 | attention must be paid to funding soft-infrastructure – training, documentation,
80 | and support that bring investigators into a community of practice. Note that
81 | Barone et.al (https://doi.org/10.1371/journal.pcbi.1005755 ) replicates the
82 | earlier findings of EDUCAUSE (https://net.educause.edu/ir/library/pdf/ers0605/rs/ers0605w.pdf); organizations planning for cyberinfrastructure development tend to underestimate
83 | and underfund the training needed to use infrastructure. Any infrastructure
84 | development must be matched by clear, measurable learning outcomes to ensure
85 | that investigators can actually make intended use of the investments.
86 | 


--------------------------------------------------------------------------------
/prelim_drafts/goal_02_dataecosystem_response.md:
--------------------------------------------------------------------------------
 1 | # Goal 2: "Promote Modernization of the Data-Resources Ecosystem"
 2 | 
 3 | 
 4 | 
 5 | ##  The appropriateness of the goals of the plan and of the strategies and implementation tactics proposed to achieve them
 6 | 
 7 | Goal 2: The overall goal recognized – the need to avoid data siloing - is
 8 | absolutely a correct (but difficult) target for NIH. It is impossible to believe
 9 | that anything in the implementation or metrics will demonstrably achieve this.
10 | What appears to be one problem is likely several problems, each of which is
11 | worthy of study to generate actionable solutions. Usability is identified as
12 | target to optimize for, and while this may be correct, no metrics proposed seem
13 | to measure this. While there may be real and important distinctions between
14 | databases and knowledgebases, we again suggest no convincing metrics have been
15 | proposed here. The statement that “Funding approaches used for databases and
16 | knowledgebases will be appropriate…” conveys no usable information.
17 | 
18 | Goal 2: For some reason, objective 2-2 contrasts the “small-scale” datasets
19 | produced by individual laboratories with the “high-value” datasets generated
20 | by NIH-funded consortia. Besides pointing out the needless condescension here,
21 | this kind of contrast actually belies the problem in designing data systems in
22 | which there are various classes of “citizenship.” Treating the much larger
23 | quantity of data generated by the plurality of extramural investigators as
24 | somehow different my lead to policies which work for the NIH, but don’t work for
25 | the community, leading to irreconcilable standards.
26 | 
27 | Goal 2: Implementation tactics proposed are bewilderingly lazy. Under this
28 | subheading appears the bullet “Ensure privacy and security.” While it is
29 | gratifying that the word privacy finally appears 11 pages into this 26-page
30 | document, but this is not a tactic and transgresses the bounds of sense.  
31 | 
32 | Goal 2: Evaluation metrics are horrific. Quantity of datasets and funded
33 | databases does nothing more than account for the fact that NIH spent money.
34 | 
35 | ## Opportunities for NIH to partner in achieving these goals
36 | 
37 | Goal 2: ASAPbio, Force 11, Open Science Framework, Zonodo, FigShare, BiorXiv,
38 | and many other community-driven organization are exploring data lifecycle issues
39 | and metrics that are relevant to this discussion. The entire SPDS needs to be
40 | completely reconceived to include representation from individuals within these
41 | organizations who have scholarly reputations in data management and life science publication/communication.
42 | 
43 | ##  Additional concepts that should be included in the plan
44 | 
45 | Goal 2: Conceptually, this goal needs to clearly differentiate technological
46 | obstacles from process obstacles. In reality, many of the needed technologies
47 | are either in place or will be generated from sectors outside of biomedicine.
48 | A few, perhaps, will be unique to the NIH use cases. More effort needs to be
49 | put into understanding the workflows and processes that investigators in a
50 | variety of contexts use to produce and consume data. This is an unaddressed
51 | research question in this document.
52 | 
53 | ## Performance measures and milestones that could be used to gauge the success of elements of the plan and inform course corrections
54 | 
55 | Goal 2: Without specific research questions, or the identification of relevant
56 | research that can be directly applied to the use cases NIH wants to advance,
57 | there are no additional milestones except a clearer definition of the goal.
58 | 
59 | ## Any other topic the respondent feels is relevant for NIH to consider in developing this strategic plan
60 | 


--------------------------------------------------------------------------------
/prelim_drafts/goal_03_datamanagement_response.md:
--------------------------------------------------------------------------------
 1 | # Goal 3: "Support the Development and Dissemination of Advanced Data Management, Analytics, and Visualization Tools"
 2 | 
 3 | 
 4 | 
 5 | ##  The appropriateness of the goals of the plan and of the strategies and implementation tactics proposed to achieve them
 6 | 
 7 | Goal 3: The strategy to leverage and support existing tool-sharing systems to
 8 | encourage "marketplaces" for tools developers, and to separate funding of tool
 9 | development and data generation is very important, and we support this direction
10 | of the NIH. This is a proven strategy to elevate high quality tools, for
11 | example, in the world of high-throughput genomics, one can consider the NHGRI
12 | funded Bioconductor Project as a decade-long successful use case in providing a
13 | unified interface for more than 1,000 "high-quality, open-source data
14 | management, analytics, and visualization tools".
15 | 
16 | Goal 3: In general, implementation tactics are plausible, but not evidence-based.
17 | Rather than propose that each step NIH takes to develop a tactic is supported by
18 | a body of research, the lowest-hanging fruit (and most productive solution) is
19 | to have the community develop an actual set of strategic targets, with clear
20 | metrics for evaluation.
21 | 
22 | Goal 3: The SPDS plan underestimates the pervasiveness and persistence of
23 | bad/outdated software and methods (See: https://www.the-scientist.com/?articles.view/articleNo/51260/title/Scientists-Continue-to-Use-Outdated-Methods/). It is completely unclear how separating evaluation and
24 | funding for tool development and dissemination from support for databases and knowledgebases (this sentence from the SPDS is itself unclear) will address
25 | this problem. This may help, but is our knowledge an unvetted hypothesis.
26 | 
27 | Goal 3: Although the SPDS does not make any strategy clear, the goal of
28 | supporting tools and workflows (objective 3-1) is a good one. We further agree
29 | that partnership is exactly the way that this needs to be pursued.
30 | 
31 | Goal 3: The metrics for this sophisticated set of objectives are catastrophic.
32 | There is no way that the objectives stated for Goal 3 can be effectively
33 | measured or succeed.
34 | 
35 | ## Opportunities for NIH to partner in achieving these goals
36 | 
37 | Goal 3: There are a variety of groups the NIH can partner with. The number of
38 | potential individual investigators is too numerous to list, but these
39 | individuals should be relatively easy to identify by means of their scholarly
40 | contributions (carefully avoiding journal publications as a primary metric).
41 | Reaching out and partnering with groups such as the Open Bioinformatics
42 | Foundation and societies like ISMB would be an ideal way for NIH to foster deep
43 | community involvement.
44 | 
45 | ##  Additional concepts that should be included in the plan
46 | 
47 | Goal 3: The proposal as currently mentioned does not mention (1) computational reproducibility, or (2) exploratory data analysis for data quality control.
48 | These two topics are critical for the high-level goal of "extracting
49 | understanding from large-scale or complex biomedical research data".
50 | 
51 | Goal 3: Computational reproducibility can be defined as the ability to produce
52 | identical results from identical data input or "raw data", and relies on
53 | biomedical researchers keeping track of metadata regarding the versions of
54 | tools that were used, the way in which tools were run, and the provenance and
55 | version of publicly available annotation files if these were used. This is very
56 | important for data science: if two groups observe discrepancies between their
57 | results, they absolutely must be able to identify the source, whether it be
58 | methodological or due to different versions of software or annotation data.
59 | 
60 | Goal 3: Exploratory data analysis (EDA) needs to be a key component of the data
61 | science plan, as this should be the first step of any data analysis involving
62 | complex biological data. EDA is often how a data scientist will identify data
63 | artifacts, technical biases, batch effects, outliers, unaccounted for or
64 | unexpected heterogeneity, need for data transformation, or other various data
65 | quality issues that will cause serious problems for downstream methods, whether
66 | they be statistical methods, machine learning, deep learning, artificial
67 | intelligence or otherwise. Downstream methods may either fail to detect the
68 | relevant signal (loosely categorized as "false negatives") or may produce many
69 | spurious results which are purely associations with technical aspects of the
70 | data ("false positives"). Furthermore, Basic EDA can uncover biological signal
71 | that may be missed, such as biologically relevant heterogeneity, e.g. subtypes
72 | of disease with signal present in molecular data.
73 | 
74 | Goal 3: Computational reproducibilty and supporting EDA should be components of
75 | both NIH funded tool development, as well as the plan to "Enhance Workforce
76 | Development for Biomedical Data Science" (Goal 4).
77 | 
78 | ## Performance measures and milestones that could be used to gauge the success of elements of the plan and inform course corrections
79 | 
80 | 
81 | 
82 | ## Any other topic the respondent feels is relevant for NIH to consider in developing this strategic plan
83 | 
84 | Goal 3: There are so many potential emerging technologies and themes,
85 | understandably, this plan should not be a laundry list of things to try. This
86 | objective need to be reconceived to articulate how those solutions will be
87 | collected, pursued, and evaluated. No such vision is clearly present.
88 | 


--------------------------------------------------------------------------------
/prelim_drafts/goal_04_workforce_response.md:
--------------------------------------------------------------------------------
  1 | # Goal 4: "Enhance Workforce Development for Biomedical Data Science"
  2 | 
  3 | 
  4 | 
  5 | ##  The appropriateness of the goals of the plan and of the strategies and implementation tactics proposed to achieve them
  6 | 
  7 | Goal 4: Feldon et.al (PNAS, 2017; https://doi.org/10.1073/pnas.1705783114)
  8 | concludes that despite $28 million in investment by NSF and NIH in training
  9 | (including workshops/boot camps relevant to biomedical data science), much of
 10 | this training is “not associated with observable benefits related to skill
 11 | development, scholarly productivity, or socialization into the academic
 12 | community.” Clearly, if NIH intends to do better it needs to completely
 13 | reconceive how it approaches training in data science.
 14 | 
 15 | Goal 4: Several reasonable priorities are identified, but only
 16 | demonstrably ineffective and/or inappropriate approaches and evaluation metrics
 17 | are proposed to achieve/evaluate these goals.
 18 | 
 19 | Goal 4: The inadequacy of the strategies suggested is epitomized by the proposed
 20 | evaluation metric: “the quantity of new data science-related training programs
 21 | for NIH staff and participation in these programs.” This is as unconvincing as
 22 | suggesting a research proposal will be measured in the number of experiments
 23 | performed. In fact, the only metric proposed for evaluation of training is the
 24 | number of training opportunities created. Such an arbitrary and crude metric
 25 | would get any research proposal returned from study section without discussion.
 26 | Number of training opportunities cannot be a plausible metric. Despite their
 27 | being no shortage of training opportunities from MOOCs to workshops, there is a
 28 | persistent, apparent, and urgent training gap. This inadequate metric is a clear
 29 | red flag that training guided by the proposed plan will accomplish very little.
 30 | 
 31 | Goal 4: It is clear that training is under-prioritized by NIH. In the largest
 32 | survey on unmet needs for life science investigators, NSF investigators report
 33 | in Barone et.al (PLOS Comp. Bio, 2017; https://doi.org/10.1371/journal.pcbi.1005755)
 34 | that their most unmet computational needs are not software, infrastructure or
 35 | compute. Instead it is the need for training; specifically training in
 36 | integration of multiple data types, data and metadata management, and scaling
 37 | analyses to HPC and cloud.
 38 | 
 39 | Goal 4: Strategy failed to properly understand  the role of training in
 40 | biomedical data science and the need to define and measure what constitutes
 41 | effective training.There is no mention made of a serious commitment to
 42 | evidence-based teaching that is needed to design effective short-format courses
 43 | and workshops. While the educational pipeline from at least the undergraduate
 44 | level and beyond needs serious improvement to address biomedical data science, short-format training and workshops will play an important role. These workshops
 45 | must not be constructed ad hoc. Typically, training is delivered by
 46 | staff/faculty with a high level of bioinformatics/data science domain expertise,
 47 | but little to no guidance in andragogy, cognitive science, or evaluation.
 48 | 
 49 | ## Opportunities for NIH to partner in achieving these goals
 50 | 
 51 | Goal 4: We present comments here in the hopes that NIH will consider bold action
 52 | in this area because the problem is solvable and the community of investigators
 53 | with experience in training relevant to biomedical data science has been
 54 | thinking deeply on the topic. The community is relatively small, well connected,
 55 | and should be extensively leveraged in developing robust scalable solutions. It
 56 | should be easy to assemble the 10-20 most important practitioners and educators
 57 | in biomedical data science, confident that they and their second order
 58 | collaborators would constitute a reasonably sized working group that can bring
 59 | in much needed solutions. Understandably, in fact by definition as an
 60 | aspirational goal, NIH has identified the value in prioritizing data science
 61 | as an area it needs to be at the forefront of but does not have the expertise
 62 | to achieve alone. The community does.
 63 | 
 64 | Goal 4: Right now, the single best target for collaboration is the Software and
 65 | Data Carpentry community (Greg Wilson: "Software Carpentry: Lessons Learned".
 66 | F1000Research,2016, 3:62 (doi: 10.12688/f1000research.3-62.v2). There are many
 67 | reasons why collaboration here will be tremendously important for NIH to
 68 | succeed. First, the Carpentry community itself represents a global federation
 69 | of researchers in the space of computation, data science, bioinformatics, and
 70 | related fields with a strong interest in education. In short – this is a
 71 | self-selected community of hundreds to thousands of researchers; the available
 72 | expertise in biomedical data science is well-covered in this community.
 73 | Additionally, there simply is no other community that has built a sustainable
 74 | and scalable approach to building educational content relevant to biomedical
 75 | data science with strong grounding in assessment and pedagogy. It would be
 76 | a tremendous squandering of resources to not build on this foundation.
 77 | 
 78 | Goal 4: This is an area that especially calls for collaboration with the
 79 | National Science Foundation. NSF already has several strong funded programs that
 80 | are dedicated to understanding the problems of bioinformatics education and
 81 | biomedical data science and developing solutions – the Network for Integrating
 82 | Bioinformatics Education into Life Science
 83 | (NIBLSE, https://qubeshub.org/groups/niblse) is just one of many. NIBLSE has
 84 | for example, identified that the newest trained faculty with bioinformatics
 85 | expertise are not bringing that training into the classroom, and the lack of
 86 | training is the biggest barrier to bioinformatics education
 87 | (https://www.biorxiv.org/content/early/2017/10/19/204420). The potential for
 88 | synergy here is enormous for developing the k-16 pipeline. This alliance
 89 | (especially leveraging NSF INCLUDES) could be a tremendous opportunity to do so
 90 | in a way that enhances diversity. And while there are distinct career paths for
 91 | biomedical vs. non-human life sciences, there is almost complete overlap in the
 92 | study, preparation, and training for both student groups.
 93 | 
 94 | ##  Additional concepts that should be included in the plan
 95 | 
 96 | Goal 4: There are a few basic concepts that must be included; and these are
 97 | at the right level for a strategic vision document.
 98 | 
 99 | - Training is the most unmet need of investigators. Investments that
100 |   under-prioritize training will not realize the value of computational and data
101 |   infrastructure developed.
102 | - Biomedical data science education must not be solely delivered by domain
103 |   experts/investigators training according to what they think is best.
104 |   Curriculum must be developed using evidence-based pedagogical principles.
105 | - Collaboration is key. Training that is developed as the unitary creation of
106 |   NIH will fail. Training must be developed by a community that can maintain and
107 |   sustain learning content.
108 | - Assessment is an integral part of training and cannot be ad hoc, it must
109 |   generate evidence that learning has occurred, and be developed in a framework
110 |   community of practice. This is hard – it is easy to count the number of CPU
111 |   cycles paid for on the cloud, or the size of a database.
112 | - Data science must not about science, and not just data. It is easy to
113 |   accumulate datasets, but not easy to develop training that is measurably
114 |   effective – however it is definitely possible.
115 | - Citizen Science is more than "individuals giving their brains for analyzing
116 |   data using computer games". There are growing communities of patients and
117 |   healthy individuals who are coming together to analyze biomedical data, either
118 |   their own or using public data resources to perform science under their own
119 |   lead (c.f. http://jme.bmj.com/content/early/2015/03/30/medethics-2015-102663
120 |   on participant-lead research). These community efforts are growing
121 |   substantially and are bound to become important stakeholders in performing
122 |   additional biomedical research. The needs of these communities should thus be
123 |   targeted to.
124 | - Diversity is a highly obtainable goal for biomedical data science education.
125 | 
126 | ## Performance measures and milestones that could be used to gauge the success of elements of the plan and inform course corrections
127 | 
128 | Goal 4: This question is difficult to answer because there needs to be a defined
129 | and agreed-upon set of competencies for biomedical data science. From these will
130 | follow learning objectives and assessments for these objectives. At the next
131 | stage will be dissemination targets and measures of community use and buy in.
132 | A workshop could resolve and develop these over the course of a few months.
133 | 
134 | ## Any other topic the respondent feels is relevant for NIH to consider in developing this strategic plan
135 | 
136 | Goal 4: The work being done by NSF as well as the Data Science recommendations
137 | developed by the National Academies are highly relevant and need to be better
138 | integrated. There is also a unique connection in Data Science to industry.
139 | Industry partners will continue to lead advances in Data Science relevant to
140 | biomedicine.
141 | 


--------------------------------------------------------------------------------
/prelim_drafts/goal_05_datastewardship_response.md:
--------------------------------------------------------------------------------
 1 | # Goal 5: "Enact Appropriate Policies to Promote Stewardship and Sustainability"
 2 | 
 3 | 
 4 | 
 5 | ##  The appropriateness of the goals of the plan and of the strategies and implementation tactics proposed to achieve them
 6 | 
 7 | Goal 5: Here, and once in goal 3 are the only mentions in the document of a community-driven activity (which actually needs to be brought to the entire SPDS). The FAIR Data Ecosystem is a laudable goal, but the idea that NIH should “Strive to ensure that all data in NIH-supported data resources are FAIR” is still a goal without a plan. More than technological advances or implementation, this is a training activity that requires community awareness, understanding, input, and buy-in on FAIR principles. The implementation tactics are plausible, but ultimately without appropriate evaluation are too vague to establish success. Establishing open-source licenses or even promoting their use won’t in itself FAIR. This section misses out on how the hard question of ELSI/privacy and biomedical data which NIH currently has not updated to accommodate the vision of what biomedical data science might achieve.  
 8 | 
 9 | Goal 5: Evaluation is inappropriate/unrevealing count metrics that will not indicate whether or not FAIR principles are realized. 
10 | 
11 | 
12 | ## Opportunities for NIH to partner in achieving these goals
13 | 
14 | 
15 | 
16 | ##  Additional concepts that should be included in the plan
17 | 
18 | 
19 | 
20 | ## Performance measures and milestones that could be used to gauge the success of elements of the plan and inform course corrections
21 | 
22 | 
23 | 
24 | ## Any other topic the respondent feels is relevant for NIH to consider in developing this strategic plan
25 | 


--------------------------------------------------------------------------------
/prelim_drafts/preamble.md:
--------------------------------------------------------------------------------
 1 | Draft Preamble:
 2 | 
 3 | NIH’s mission is “to seek fundamental knowledge about the nature and behavior of living systems and the application of that knowledge to enhance health, lengthen life, and reduce illness and disability.” Data Science will play a prominent role in fulfilling that mission in the 21st Century. Unfortunately, and for several reasons, the current NIH Strategic Plan for Data Science (SPDS) will not further, and may even act against NIH’s mission. On the surface, the SPDS has identified important goals. However,  careful review of the current Plan reveals the impossibility of translating anything articulated into impactful, actionable, and meaningfully measurable implementations. If this document came from any organization other than NIH it might be ignored. Instead, the awesome weight of NIH recommendations sets up the plausible scenario that many investigators (within and outside of NIH) will be guided by an incomplete, inaccurate, and dangerously misguided document.
 4 | 
 5 | To conclude that  the SPDS is fundamentally flawed, consider these representative metrics copied verbatim from the document:
 6 | 
 7 | Goal 1: Support a Highly Efficient and Effective Biomedical Research Data Infrastructure
 8 | Sample evaluation metric: “quantity of cloud storage and computing used by NIH and by
 9 | NIH-funded researchers”
10 | 
11 | Goal 2: Promote Modernization of the Data-Resources Ecosystem
12 | Sample evaluation metric: “quantity of databases and knowledgebases supported using resource-based funding mechanisms"
13 | 
14 | Goal 3: Support the Development and Dissemination of Advanced Data Management, Analytics, and
15 | Visualization Tools
16 | Sample evaluation metric: “quantity of new software tools developed”
17 | 
18 | Goal 4: Enhance Workforce Development for Biomedical Data Science
19 | Sample evaluation metric: “quantity of new data science-related training programs for NIH staff and participation in these programs”
20 | 
21 | Goal 5: Enact Appropriate Policies to Promote Stewardship and Sustainability
22 | Sample evaluation metric: “establishment and use of open-data licenses” (a count metric)
23 | 
24 | These metrics (which are wholly representative of evaluations proposed) would never be accepted by a scientific reviewer. If the SPDS were to be considered by one of the NIH’s scientific review panels, it would certainly be designated “Not Recommended for Further Consideration (NRFC)”. Simply put, there is no perceivable likelihood of this Strategic Plan to exert any positive influence on the fields of Data Science or data intensive biomedical research. The majority of metrics proposed are simply  “counts” of the number of activities created or performed, without any meaningful and/or independent evaluation of their benefit. Most of these activities have to happen anyway -e.g., the “quantity of databases and knowledgebases supported” will increment every time the NIH gives a grant to someone proposing to collect data and create a database (following the NIH policy on data sharing, https://grants.nih.gov/policy/sharing.htm). There is no suggestion that the increments these metrics represent are meaningful - or that they do or might contribute to biomedical work to “enhance health, lengthen life, and reduce illness and disability.” These evaluation metrics are both tautological (i.e., if the NIH funds research that generates data, these metrics will increment and the Goal will appear to have been “achieved”) and vague - as long as at least one of each of these events occurs, the Goal could be argued to have been “achieved”). Thus, almost any proposal to collect a large quantity of data that is funded will be branded as “successful”. While these goals are important for the NIH to function as a data broker, they actually represent characteristics to strive for throughout NIH, rather than a Strategic Plan. This plan is surprisingly unreflective of prior NIH Data Science activities (NCBC, BD2K) or even well-informed planning documents emerging in parallel (NLM Strategic Plan: https://www.nlm.nih.gov/pubs/plan/lrp17/NLM_StrategicReport2017_2027.pdf)
25 | 
26 | Data Science is not the progressive extension of clinical research applications, and a truly impactful NIH strategy for Data Science must by definition come from experts in Data Science - preferably, those with expertise in both Data Science and biomedical research. As stated in SPDS Goal 4, NIH as an organization does not have the expertise in Data Science to plan its strategic direction in this area. . We recommend that if NIH is committed to a transformative strategy for Data Science (and it should be), then the current SPDS should be discarded. A Strategic Plan with meaningful goals that promote the thoughtful integration of Data Science with biomedical research representing measurable (not “countable”) impact, together with formal and independent evaluation, should be developed instead.
27 | 
28 | Such a Plan would require a sustainable vision that accurately reflects the discipline of Data Science and its potential role in biomedical research. Such a vision must be community-driven, perhaps through a call for nominations that would assemble national and international community members with evidence of expertise in the needed areas. Given the weaknesses in the current SPDS, this RFI has every chance of resulting in an inward-looking, self-fulfilling prophecy (e.g., such that any grant that is made to any data-generating proposal results in “success” according to these self-defined metrics). If the Data Science and data-intensive biomedical research community comments are ignored, and priority is given to existing investigators and internal stakeholders, the NIH will promote an inward-looking and essentially irrelevant program for integrating Data Science into biomedical research.
29 | 
30 | The Plan fails on multiple several technical merits, and the community members who have authored this document have assembled additional detailed remarks at this URL:
31 | 
32 | https://github.com/JasonJWilliamsNY/2018_nih_datascience_rfi
33 | 
34 | Additionally, as an exercise, the NIH review form has been filled out for this Plan to explore how a realistic NIH scientific reviewer (who did the exercise) might score such a proposal
35 | 
36 |  https://docs.google.com/document/d/1XxQLORoTm2lkucQz6k3QdgNOARDrvU3zx_svT0PX_c0/edit
37 | 
38 | These comments are provided to support a new effort at a realistic, plausible, and community-driven Strategic Plan for Data Intensive Biomedical Research. The community stands ready to assist NIH in this important work and we urge the organization to commit to making this a community effort.
39 | 


--------------------------------------------------------------------------------