├── .github └── CODEOWNERS ├── meetings ├── 3-meeting-2023-03-21 │ ├── scythe-overview.pdf │ └── minutes.md ├── 4-meeting-2023-05-02 │ └── minutes.md ├── 2-meeting-2023-01-19 │ └── minutes.md ├── 5-meeting-2023-05-30 │ └── minutes.md └── 1-meeting-2022-11-30 │ └── minutes.md ├── .gitignore ├── LICENSE └── README.md /.github/CODEOWNERS: -------------------------------------------------------------------------------- 1 | * @ml-evs @PeterKraus @davidelbert 2 | -------------------------------------------------------------------------------- /meetings/3-meeting-2023-03-21/scythe-overview.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marda-alliance/metadata_extractors/HEAD/meetings/3-meeting-2023-03-21/scythe-overview.pdf -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Compiled Object files 2 | *.o 3 | *.obj 4 | 5 | # Compiled Dynamic libraries 6 | *.so 7 | *.dylib 8 | *.dll 9 | 10 | # Compiled Static libraries 11 | *.a 12 | *.lib 13 | 14 | # Executables 15 | *.exe 16 | 17 | # DUB 18 | .dub 19 | docs.json 20 | __dummy.html 21 | docs/ 22 | 23 | # Code coverage 24 | *.lst 25 | -------------------------------------------------------------------------------- /meetings/4-meeting-2023-05-02/minutes.md: -------------------------------------------------------------------------------- 1 | # Progress meeting 4 2 | 3 | 02/05/2023 4 | 5 | ## Agenda 6 | 7 | Intro from Peter 8 | ELN roundtable discussion of feature wishlist 9 | 10 | ## Minutes 11 | 12 | Peter Kraus: Introduced the current state of the registry, API prototype and schema. 13 | 14 | Suggestions from Kevin Jablonka: 15 | 16 | - Reusability and validation as main focus: is the parser tested, how is it 17 | tested and how reliable is it? 18 | - Able to use code without providing much boilerplate 19 | 20 | Peter: 21 | - possibility of MaRDA permanent infrastructure for running the registry 22 | - output schema is still a target but perhaps not in current WG 23 | 24 | Matthew: containerisation and construction of environments is important: how 25 | about things outside Python: JS/WASM/docker? 26 | 27 | Discussion about output formats: option to specify some useful common formats 28 | 29 | Markus: wishlist -- must be open in the output format 100% 30 | 31 | Nicolas: ELN developer's wishlist -- adding custom JSON to arbitrary files that 32 | are added to a sample/experiment. Want to deploy a container that can scrape 33 | arbitrary files. 34 | 35 | Steffen: also ELN developer, in agreement with Nicolas but needs metadata instead of data -- i.e., 36 | should be a case switch, wish ability to see thumbnail of the data. Shouldn't 37 | decide meaning of data, instead allow it to be mutable. Also something pip 38 | installable for programmatic use. 39 | 40 | Matthew: maybe having monolothic container baked with registry as a build engine 41 | 42 | Markus: 43 | - Three execution pathways proposed: "external" service, local docker, local python package. 44 | - Very tricky to provide all three reliably, especially due to dependencies. 45 | - For Nomad, important to ensure scalability to ~million files -> single docker instance might not work 46 | 47 | Steffen: how to deal with multiple parsers for a given file: some rating 48 | process. Also all the different APIs for parsers are an issue, can we capture 49 | that? 50 | 51 | Peter: API and schema repository have some progress towards this 52 | 53 | Nicolas: 54 | - common API is not feasible for all parsers, 55 | - ELN users should not be exposed to API related issues at all 56 | - keep it simple by choosing/bundling one extractor per filetype 57 | 58 | Markus: file type detection is crucial, NOMAD could contribute its own detection 59 | system. Needs to be a combination of file extension, regex etc. 60 | -------------------------------------------------------------------------------- /meetings/3-meeting-2023-03-21/minutes.md: -------------------------------------------------------------------------------- 1 | # Progress meeting 2 | 3 | 21/03/2023 4 | 5 | ## Agenda 6 | 7 | - Introductions (5 mins) 8 | - Case study: Scythe (Logan Ward) 9 | - Discussion 10 | - Summary of progress on `api` in MaRDA Extractors WG (Matthew Evans) 11 | - Open discussion 12 | 13 | ## Case study: Scythe: an extractor library you might like 14 | 15 | Presented by [Logan Ward](https://github.com/WardLT). Unfortunately we forgot to record the talk. See [Scythe GH repo](https://github.com/materials-data-facility/scythe) as well as the [slides](scythe-overview.pdf). 16 | 17 | - originally called MaterialsIO ➝ renamed to Scythe due to Google's material design SEO 18 | - designed to introduce some standardisation into internal metadata extraction at Argonne 19 | - group ➝ extract ➝ adapt pipeline 20 | - Group files that belong together, e.g. in/out files ➝ logic is extractor specific 21 | - Extract into a documentable format - currently JSON 22 | - Adapt for translating between known filetypes/formats 23 | - Python interface 24 | - Design principle: filesystem ➝ database should work in 1 line of code 25 | 26 | 1. Question (Matthew Evans): Self describing schema - how is this implemented? 27 | 28 | Answer: For now, in a JSONSchema as it is verifiable and human-readable. 29 | 30 | 2. Question (Matthew Evans): Handling groups of files - how does the user select the appropriate filetype? 31 | 32 | Answer: Better illustrated using an example: 33 | - 1) each file is treated separately 34 | - 2) each particular datatype has its own extractor 35 | - 3) supplemental files are pulled out by the extractor itself 36 | Filetype matching is done by prefixes or postfixes (e.g. for VASP). 37 | 38 | - Manifesto: summaries of data, not necessarily lossless 39 | - Well designed contributor guide 40 | - Key feature: autodiscovery of Scythe-compatible Extractors on the current host via Stevedore 41 | 42 | 3. Question (Peter Kraus): Is the "losslessness" of the data described somewhere? 43 | 44 | Answer: If it's not in the documentation, then not. 45 | 46 | 4. Quetion (Ken Kroenlein): Universal datastructures might not work well. How about data dictionaries and standards or similar technologies to describe the structure of the data? 47 | 48 | Answer: This is currently an Extractor-level decision, to drive adoption. 49 | 50 | ## Summary of progress on `api` in MaRDA Extractors WG 51 | 52 | Matthew gave a quick run-down of the current draft of `Extractor` execution api. Please see the PR [#5](https://github.com/marda-alliance/metadata_extractors_api/pull/5) for details, review and comments. 53 | 54 | -------------------------------------------------------------------------------- /meetings/2-meeting-2023-01-19/minutes.md: -------------------------------------------------------------------------------- 1 | # Progress meeting 2 | 3 | 19/01/2023 4 | 5 | ## Agenda 6 | 7 | - Introductions (5 mins) 8 | - Summary of main goals (Matthew Evans) 9 | - Progress in WP1, WP2, WP3 (Matthew Evans) 10 | - Case study: yadg (Peter Kraus) 11 | - Open discussion 12 | 13 | ## Minutes from Discussion 14 | 15 | 1. David Elbert: 16 | - Suggestion: Please register for the [annual MaRDA meeting](https://www.marda-alliance.org/blog-2/marda2023/). Poster sessions for ECRs are scheduled, with prizes! 17 | 18 | --- 19 | 20 | 2. Matthew Evans & Peter Kraus: 21 | - Presentation of current progress and case study on [yadg](https://github.com/dgbowl/yadg) ([slides](https://docs.google.com/presentation/d/1nQD7cwEG67W5MAVjXXb-6nUklDg1AcxYs8j7WRsJjIc/edit#slide=id.g1e03a09dbfb_0_16)) 22 | 23 | 2. a) Casper Andresen: 24 | - Question: Will the transformation step in our pipeline be based on semantics, or only a pure syntax translation? 25 | 26 | 2. b) Matthew Evans: 27 | - Answer: LinkML allows linking to other schemas, such as Dublin Core or Schema.org. Therefore, capturing semantics should be possible. Contributions from experts are welcome! 28 | 29 | --- 30 | 31 | 3. a) Logan Ward: 32 | - Question: Interesting choice of going with a "dataschema" for yadg. What exactly is it? What are the limitations? 33 | 34 | 3. b) Peter Kraus: 35 | - Answer: It defines the "folder structure" of files related to a single experiment, as opposed to the layout within the individual files. User adoption is quite hard, and very often you just want the raw data, without bundling it. 36 | 37 | 3. c) Logan Ward: 38 | - Suggestion: Efforts such as this one (metadata extractors) could lead to a nice recursion, where your extractor taps into other linked tools. 39 | 40 | --- 41 | 42 | 4. a) Steffen Brinckmann: 43 | - Question: Is passing parameters into extractors (e.g. information about background etc.) condsidered for the schemas? 44 | 45 | 4. b) Peter Kraus: 46 | - Answer: From my point of view, not at the extractor level, which should not be post-processing the data values. 47 | 48 | 4. c) Matthew Evans: 49 | - Answer: This is a very application-dependent question. Parameters are likely to strongly depend on the extractors themselves. If we support it, it must be optional. 50 | 51 | --- 52 | 53 | 5. a) Ken Kroenlein: 54 | - Suggestion: For extractor output, the data dictionaries WG of MaRDA can provide a controlled vocabulary. 55 | 56 | 5. b) Peter Kraus: 57 | - Answer: I will be in touch to figure out how I can implement this in my code! 58 | 59 | --- 60 | 61 | 6. Closing remarks: 62 | - MaRDA annual meeting on 25. - 27. Feb. 2023 63 | - have a first tagged schema by then (ME) 64 | - have an example implementation in yadg (PK) 65 | - Office hours: 66 | - fortnightly (approximately), informal 67 | - next one on 25. Jan 2023, 15:00 UTC 68 | - suggest a date/time if the dates planned don't suit! 69 | 70 | 71 | 72 | -------------------------------------------------------------------------------- /meetings/5-meeting-2023-05-30/minutes.md: -------------------------------------------------------------------------------- 1 | # Progress meeting 5 2 | 3 | 30/05/2023 4 | 5 | ## Agenda 6 | 7 | - Intro from Matthew Evans 8 | - Overview of NOMAD Parsers Lauri Himanen 9 | - Demo of API from Matthew Evans 10 | 11 | ## Minutes 12 | 13 | ### MaRDA WG status summary - Matthew Evans 14 | - Quick summary of MaRDA Extractors WG 15 | - Overview of the 3 repos, incl. [schema](https://github.com/marda-alliance/metadata_extractors_schema), [registry](https://github.com/marda-alliance/metadata_extractors_registry), and [api](https://github.com/marda-alliance/metadata_extractors_api) 16 | - Overview of the current `Filetypes` and `Extractors` in the registry 17 | 18 | **See also [Demo of API](#demo-of-api-videos) videos below!** 19 | 20 | ### NOMAD Parsers - Lauri Himanen 21 | 22 | 1. What is NOMAD: 23 | - RDM platform 24 | - covers simulations, experiments, workflows 25 | - funded by German NFDI 26 | - [nomad-lab.eu](nomad-lab.eu) is a freely available, open, central repository 27 | - NOMAD Oasis is a self-hosted instance 28 | 2. Getting started with NOMAD: 29 | - `upload` files 30 | - add / edit using ELN interface 31 | - publish to get a DOI 32 | 3. Parser infrastructure in NOMAD: 33 | - act on uploaded files and turn them into `entries` 34 | - `entries` can be searched, analysed, have a known structure -> implies a `schema` 35 | - each parser has to define its own `schema` 36 | - NOMAD has it's own Pydantic-like schema language called NOMAD metainfo 37 | - parsers are triggered on upload of a file, matching by using: 38 | - file extension 39 | - file mimetype 40 | - file contents (e.g. header) 41 | - one file is usually one `entry`, but sometimes one file is many `entries` 42 | - reading of auxilliary files in an upload is handled by the parser 43 | 4. Parser plugins 44 | - basic NOMAD has ~60 parsers pre-installed, mostly for electronic structure calculations 45 | - defining custom parsers is possible via a `plugin` mechanism 46 | - `plugins` may be integrated into the central service after a review 47 | - plugins have to have: 48 | - a schema definition in a specified location 49 | - the parsing code and file matching logic in a specified location 50 | - the general `schema` can be extended by the `parser` 51 | - the infrastructure is using a lot of regex to perform matching of quantities and filetypes 52 | - parsing is performed by passing the path of the file 53 | - `nomad.yaml`: a configuration file for the plugin 54 | 55 | ### Q/A: 56 | - Peter: About auxiliary files - must be uploaded together, or can NOMAD ask for them? 57 | - Lauri: They must be uploaded together. Their usage may be documented in the README. The parser can emit a debug/log message, but the user is responsible for uploading all files together. 58 | 59 | --- 60 | - Nicolas: Use of regex on plaintext files does not sound efficient. Is this really the best way? Wouldn't it be better to fix QM codes upstream? 61 | - Lauri: Some QM codes are moving away from text files, but progress is slow. There is also the important issue of legacy QM data, which has to be addressed somehow. 62 | 63 | --- 64 | - Peter: Are you aware of QCSchema? 65 | - Lauri: No, not yet. 66 | 67 | --- 68 | - Matthew: Overlap with MaRDA WG. How would we go about validating plugins? 69 | - Lauri: The plugin mechanism in NOMAD is very new. Registry and an authority marking/reviewing plugins would be useful. 70 | 71 | --- 72 | - Matthew: How about sandboxing plugin code? 73 | - Lauri: Sandboxing is tricky, as on an instance, it's a question for the Oasis (instance) admin 74 | 75 | 76 | ### Open Discussion: 77 | - Matthew: Suggestion to skip next months (July) meeting in favour of writing & working. 78 | - Peter: Agreed. 79 | 80 | --- 81 | - Steffen: Multiple people have different goals. Focusing on a single extractor for a single filetype is perhaps too ambitious. 82 | - Matthew: Yes, this is currently the goal of the WG. 83 | 84 | --- 85 | - Steffen: Focus should be on getting more examples. This requires making the parser submission process to be simple and comfortable enough even for lazy people... 86 | - Peter: A website frontend to avoid boilerplate is on the TODO list, however, manpower is a problem. 87 | 88 | --- 89 | - Steffen: Review process of extractors should be currently very streamlined, as long as things don't overwrite other people's work, we should allow things in. 90 | - Matthew: Sandboxing at some level might be necessary, as otherwise it's a big safety issue. 91 | 92 | ### Demo of API videos: 93 | - [API Intro and `biologic-mpr` example](https://drive.google.com/file/d/10v6uE6By0bm3Z2Sfno1ULM6I8itX75uF/view?usp=drive_web) 94 | - [Parsing `agilent-dx` using API](https://drive.google.com/file/d/1XeTYR14dJORUGIZnYzoc1FRM0eWUNSVu/view?usp=drive_web) -------------------------------------------------------------------------------- /meetings/1-meeting-2022-11-30/minutes.md: -------------------------------------------------------------------------------- 1 | # Kick-off meeting 2 | 3 | 30/11/2022 4 | 5 | ## Agenda 6 | 7 | - Introductions (5 mins) 8 | - Initial slides by Matthew Evans (15 mins) 9 | - Q&A 10 | - Open discussion 11 | 12 | ## Minutes from Q&A & Discussion 13 | 14 | 1. Matthew Evans: Introductory presentation. Recorded ([video](https://www.youtube.com/watch?v=6x5Ow-CLRWg), [slides](https://docs.google.com/presentation/d/1QTqDEO3H1s_wAtcE9DD6k1eCJ7GzvdV2spjHE94bEQw/edit?usp=sharing)) 15 | 16 | 2. a) Markus Scheidgen: 17 | - Question: Will slides be shared? 18 | - Suggestion: A goal can be to establish a service registry with a searchable `filetype -> tools available` mapping. 19 | - Comment: Skeptical about "chaining parsers" due to low chance of success. 20 | 21 | 2. b) Matthew Evans: 22 | - Comment: A way of coordination might be to have a discussion for each WP. 23 | - Answer: Slides will be published. 24 | 25 | 4. a) Steffen Brinckmann: 26 | - Question: Transformation WP concerns not only data, but also filetype? 27 | - Suggestion: CWL is used in chemistry field: 28 | - wrappable around Python 29 | - requires a server/service 30 | - Question: What happens when the scientist wants to "add" processing (area under curve) as metadata? 31 | - Question: Will it be possible to edit/customise extractors? 32 | 33 | 4. b) Matthew Evans: 34 | - Answer: Are filetypes included, workflow & interoperability is well covered by the WPs. 35 | - Answer: Optional features for extractors: 36 | - individual decision of parsers, implemented in parsers 37 | - good use-case for parser chaining (see 2a) 38 | - Comment: Note the discussion "Prior Art" on GitHub. 39 | 40 | 5. a) Ken Kroenlein: 41 | - Comment: Concerned that a parser that does `file -> JSON/XML` is already a full ETL pipeline. 42 | - Suggestion: Reworking of schema (WP1) to be a lot tighter & low level, as the `file -> bytes in memory` step is where the "magic" and pipelining might happen. This would also allow to piece parsers together. 43 | 44 | 5. b) David Elbert: 45 | - Answer: Important to check scope of WPs as we go. 46 | - Answer: Boundary of data vs metadata, or what is defined as a derived value, is currently defined by the filetypes or parsers. Stick to that definition for now. 47 | 48 | 6. a) Ken Kroenlein & Jim Warren: 49 | - Comment: Just achieving a schema (WP1) is a lot of work and would be a win. 50 | 51 | 6. b) Peter Kraus: 52 | - Comment: Registry (WP3) is a "low hanging fruit" that can be done with little work. 53 | 54 | 6. c) Steffen Brinckmann: 55 | - Question: Order of action: Should we start with schema (WP1) and then move to API (WP2), a "discussion based" approach, or should we start with WP2 and move to WP1, an "evolutionary approach"? 56 | 57 | 6. d) Markus Scheidgen: 58 | - Comment: Schema (WP1) is key to be able to find the right parser. 59 | - Comment: Registry (WP3) is not super valuable without WP1, as it's just a page with tools, it would need at least WP1, but ideally WP2 to be useful. 60 | - Comment: We shouldn't worry about chaining tools at this moment, avoids specifying parser output schema. 61 | 62 | 6. e) Jim Warren: 63 | - Comment: Figuring out schema (WP1) is key. 64 | - Examples are very important. 65 | - Need to identify what's broken in each parser. 66 | 67 | 7. a) Nicholas Carpi: 68 | - Suggestion: Vision, as an ELN developer, is to be able to: 69 | - parse an user-supplied file using an "external service" 70 | - added value for the user, e.g. visualisation, post-processing 71 | - Comment: Huge amount of prior art, but no common framework. 72 | - Question: What's the difference between the goals of this WG and Clowder framework? 73 | 74 | 7. b) David Elbert: 75 | - Answer: Unfortunate that Clowder folk are not present. Will try to get them involved. 76 | 77 | 7. c) Matthew Evans: 78 | - Answer: Point taken about existing projects & Clowder. Risk of creating yet another standard. 79 | - Question: What would an ELN developer like from this "external service"? Would we need to provide hardening? 80 | 81 | 7. d) Nicholas Carpi: 82 | - Answer: "External" as in not within ELN, but not off-site. Everything should be done locally, via e.g. a HTTPS REST service: `post file -> get data` 83 | - Answer: It would be great to have to use only one service, with reasonable and understandable errors. 84 | - Suggestion: Maintenance of a single repo including all parsers can be tricky, so a plugin architecture might be the way to go. 85 | 86 | 7. e) Peter Kraus: 87 | - Comment: There is a lot of prior art, some of it at various stages of bit rot. Added benefit of this WG would be testing of parsers via CI to detect broken/bitrotten parts of code. 88 | 89 | 7. f) Ken Kroenlein: 90 | - Concern: Think about long term home for CI/CD. Relying on external services & running out of budget is a good recipe to fail. 91 | 92 | 7. g) David Elbert: 93 | - Answer: While MaRDA itself cannot provide funding, we can try to arrange long-term funding as part of the work of this or future WG. 94 | 95 | 8. Closing remarks: 96 | - Next meeting: January 97 | - should have some work done by then 98 | - a prototype for our own parsers (PK, ME) 99 | - MaRDA annual meeting is also happening 100 | - Suggestion: Weekly open hour on Jitsi - will be communicated in due course. 101 | 102 | 103 | 104 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Creative Commons Legal Code 2 | 3 | CC0 1.0 Universal 4 | 5 | CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE 6 | LEGAL SERVICES. DISTRIBUTION OF THIS DOCUMENT DOES NOT CREATE AN 7 | ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS 8 | INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES 9 | REGARDING THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS 10 | PROVIDED HEREUNDER, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM 11 | THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED 12 | HEREUNDER. 13 | 14 | Statement of Purpose 15 | 16 | The laws of most jurisdictions throughout the world automatically confer 17 | exclusive Copyright and Related Rights (defined below) upon the creator 18 | and subsequent owner(s) (each and all, an "owner") of an original work of 19 | authorship and/or a database (each, a "Work"). 20 | 21 | Certain owners wish to permanently relinquish those rights to a Work for 22 | the purpose of contributing to a commons of creative, cultural and 23 | scientific works ("Commons") that the public can reliably and without fear 24 | of later claims of infringement build upon, modify, incorporate in other 25 | works, reuse and redistribute as freely as possible in any form whatsoever 26 | and for any purposes, including without limitation commercial purposes. 27 | These owners may contribute to the Commons to promote the ideal of a free 28 | culture and the further production of creative, cultural and scientific 29 | works, or to gain reputation or greater distribution for their Work in 30 | part through the use and efforts of others. 31 | 32 | For these and/or other purposes and motivations, and without any 33 | expectation of additional consideration or compensation, the person 34 | associating CC0 with a Work (the "Affirmer"), to the extent that he or she 35 | is an owner of Copyright and Related Rights in the Work, voluntarily 36 | elects to apply CC0 to the Work and publicly distribute the Work under its 37 | terms, with knowledge of his or her Copyright and Related Rights in the 38 | Work and the meaning and intended legal effect of CC0 on those rights. 39 | 40 | 1. Copyright and Related Rights. A Work made available under CC0 may be 41 | protected by copyright and related or neighboring rights ("Copyright and 42 | Related Rights"). Copyright and Related Rights include, but are not 43 | limited to, the following: 44 | 45 | i. the right to reproduce, adapt, distribute, perform, display, 46 | communicate, and translate a Work; 47 | ii. moral rights retained by the original author(s) and/or performer(s); 48 | iii. publicity and privacy rights pertaining to a person's image or 49 | likeness depicted in a Work; 50 | iv. rights protecting against unfair competition in regards to a Work, 51 | subject to the limitations in paragraph 4(a), below; 52 | v. rights protecting the extraction, dissemination, use and reuse of data 53 | in a Work; 54 | vi. database rights (such as those arising under Directive 96/9/EC of the 55 | European Parliament and of the Council of 11 March 1996 on the legal 56 | protection of databases, and under any national implementation 57 | thereof, including any amended or successor version of such 58 | directive); and 59 | vii. other similar, equivalent or corresponding rights throughout the 60 | world based on applicable law or treaty, and any national 61 | implementations thereof. 62 | 63 | 2. Waiver. To the greatest extent permitted by, but not in contravention 64 | of, applicable law, Affirmer hereby overtly, fully, permanently, 65 | irrevocably and unconditionally waives, abandons, and surrenders all of 66 | Affirmer's Copyright and Related Rights and associated claims and causes 67 | of action, whether now known or unknown (including existing as well as 68 | future claims and causes of action), in the Work (i) in all territories 69 | worldwide, (ii) for the maximum duration provided by applicable law or 70 | treaty (including future time extensions), (iii) in any current or future 71 | medium and for any number of copies, and (iv) for any purpose whatsoever, 72 | including without limitation commercial, advertising or promotional 73 | purposes (the "Waiver"). Affirmer makes the Waiver for the benefit of each 74 | member of the public at large and to the detriment of Affirmer's heirs and 75 | successors, fully intending that such Waiver shall not be subject to 76 | revocation, rescission, cancellation, termination, or any other legal or 77 | equitable action to disrupt the quiet enjoyment of the Work by the public 78 | as contemplated by Affirmer's express Statement of Purpose. 79 | 80 | 3. Public License Fallback. Should any part of the Waiver for any reason 81 | be judged legally invalid or ineffective under applicable law, then the 82 | Waiver shall be preserved to the maximum extent permitted taking into 83 | account Affirmer's express Statement of Purpose. In addition, to the 84 | extent the Waiver is so judged Affirmer hereby grants to each affected 85 | person a royalty-free, non transferable, non sublicensable, non exclusive, 86 | irrevocable and unconditional license to exercise Affirmer's Copyright and 87 | Related Rights in the Work (i) in all territories worldwide, (ii) for the 88 | maximum duration provided by applicable law or treaty (including future 89 | time extensions), (iii) in any current or future medium and for any number 90 | of copies, and (iv) for any purpose whatsoever, including without 91 | limitation commercial, advertising or promotional purposes (the 92 | "License"). The License shall be deemed effective as of the date CC0 was 93 | applied by Affirmer to the Work. Should any part of the License for any 94 | reason be judged legally invalid or ineffective under applicable law, such 95 | partial invalidity or ineffectiveness shall not invalidate the remainder 96 | of the License, and in such case Affirmer hereby affirms that he or she 97 | will not (i) exercise any of his or her remaining Copyright and Related 98 | Rights in the Work or (ii) assert any associated claims and causes of 99 | action with respect to the Work, in either case contrary to Affirmer's 100 | express Statement of Purpose. 101 | 102 | 4. Limitations and Disclaimers. 103 | 104 | a. No trademark or patent rights held by Affirmer are waived, abandoned, 105 | surrendered, licensed or otherwise affected by this document. 106 | b. Affirmer offers the Work as-is and makes no representations or 107 | warranties of any kind concerning the Work, express, implied, 108 | statutory or otherwise, including without limitation warranties of 109 | title, merchantability, fitness for a particular purpose, non 110 | infringement, or the absence of latent or other defects, accuracy, or 111 | the present or absence of errors, whether or not discoverable, all to 112 | the greatest extent permissible under applicable law. 113 | c. Affirmer disclaims responsibility for clearing rights of other persons 114 | that may apply to the Work or any use thereof, including without 115 | limitation any person's Copyright and Related Rights in the Work. 116 | Further, Affirmer disclaims responsibility for obtaining any necessary 117 | consents, permissions or other rights required for any use of the 118 | Work. 119 | d. Affirmer understands and acknowledges that Creative Commons is not a 120 | party to this document and has no duty or obligation with respect to 121 | this CC0 or use of the Work. 122 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |
2 | 3 |
4 | 5 | #
MaRDA Metadata Extractors: Discussions & Minutes
6 | 7 |
8 | 9 | [![Documentation](https://badgen.net/badge/docs/marda-alliance.github.io/blue?icon=firefox)](https://marda-alliance.github.io/metadata_extractors/) 10 | 11 |
12 | 13 | > [!IMPORTANT] 14 | > The MaRDA Metadata Extractors working group has now ended; development will continue under the [datatractor](https://github.com/datatractor/) organisation, but discussions can continue in this repository. 15 | 16 | This repository contains organizational info for a [MaRDA](https://www.marda-alliance.org/) working group (WG) focused on connecting and advancing interoperability of efforts on automated extraction of metadata from materials files. 17 | 18 | **Contacts**: 19 | 20 | - *[Matthew Evans](https://ml-evs.science), UCLouvain* 21 | (`matthew.evans[at]uclouvain.be`) 22 | - *Peter Kraus, TU Berlin* 23 | (`peter.kraus[at]ceramics.tu-berlin.de`) 24 | - *David Elbert, Johns Hopkins University* 25 | (`elbert[at]jhu.edu`) 26 | 27 | 28 | ## Contributing 29 | 30 | This working group is completely open. 31 | If you would like to be added ot the mailing list please reach out to us over email, or just turn up at a meeting! 32 | 33 | The GitHub discussions on this repo can be used for pretty much any related chat, and specific code suggestions/feedback can be made as pull requests to the GitHub repos for each subproject: 34 | 35 | - [marda-alliance/metadata_extractors_schema](https://github.com/marda-alliance/metadata_extractors_schema) 36 | - [marda-alliance/metadata_extractors_registry](https://github.com/marda-alliance/metadata_extractors_registry) 37 | - [marda-alliance/metadata_extractors_api](https://github.com/marda-alliance/metadata_extractors_api) 38 | 39 | 40 | ## Organization & logistics 41 | 42 | 43 | - All meetings will be open and will take place on [Jitsi](https://meet.jit.si/) 44 | with the room code `marda-extractors`. Meeting dates will be arranged in the 45 | previous meeting, and will be announced on the mailing list. 46 | - Meeting minutes and relevant links will be made available in this repository under [`./meetings`](https://github.com/marda-alliance/metadata_extractors/tree/main/meetings). 47 | - The [GitHub discussions on this 48 | repo](https://github.com/marda-alliance/metadata_extractors/discussions) will be used for asynchronous 49 | communications about the design, development and logistics of the overall 50 | working group - please introduce yourself in the [announcement 51 | thread](https://github.com/marda-alliance/metadata_extractors/discussions/1)! 52 | 53 | ## Proposal 54 | 55 | ![image](https://user-images.githubusercontent.com/7916000/205093968-632da485-b922-4244-8d8d-51ff85c92d35.png) 56 | 57 | 58 | The text below is taken from the initial working group proposal. 59 | 60 | ### Motivation 61 | 62 | Enabling interoperability between experimental apparatus, software, scientists and informaticians requires a lot of plumbing. 63 | This plumbing is often characterized as an **Extract-Transform-Load** (ETL) pipeline, whereby data is first extracted from some heterogeneous sources, transformed into a suitable format for the use case (querying, archival, further analysis) and then loaded onto a storage platform (a database, a filesystem, an archive repository, supplementary info for a publication). 64 | Such pipelines are often opaque, not reproducible and not sufficiently modular to be reusable; these features penalize groups with fewer resources to devote to data management, leading to much duplicated effort on reimplementing parsers or extractors for different file types and transformed data models. 65 | 66 | This working group aims to address these issues and promote FAIR practices by designing, creating and reusing software and infrastructure to streamline the process of describing ETL pipelines in such a way that they can be more broadly reused, and associated tooling for automated execution and discovery of said pipelines. 67 | Following the name of the WG, it will primarily target the **Extraction** step, i.e., the literal parsing of unstructured data files, and the description of the output Transformed data model and encouragement of FAIR representations (c.f. the upcoming recommendations from MaRDA WG5). 68 | This is a different approach from the typical target of creating a unifying output data schema; instead we will remain agnostic to the output format if it is sufficiently well-described in a machine-actionable manner. 69 | 70 | ### Work plan 71 | 72 | This working group aims to investigate, design, and implement or re-use a hierarchy of open software tooling and open infrastructure to achieve the goals outlined above. The initial 6 months of meetings will primarily involve scoping the overall project, and the group members, to ensure that the WG is representative of existing efforts, thus maximizing the chances of adoption. The three main technical thrusts are expected to be: 73 | 74 | 1. **A lightweight metadata schema for parsers** and associated tooling for software libraries to self-report: 75 | - The file formats they support (e.g., output files from a particular experimental apparatus, or log files from a computational chemistry code) 76 | - The shape and semantics of the data models produced, via existing formats for data descriptionand tooling, such as , for example self-contained formats (NeXus, CIF), schemas (JSONSchema, XSD) and, semantic data (RDF, JSON-LD & CSV-LD), HDF5 and STAR (and their respective domain-specific derivatives NeXus and CIF). Depending on the output format, such metadata can be provided in-band or out-of-band in a well-defined location (e.g., a separate file, a persistent URL), following the upcoming recommendations of the MarDA Data Dictionaries WG5. 77 | - Any additional metadata required for re-use, such as code versions and environments, source and code archive URLs, and bibliographic data following, for example, Dublin Core. 78 | 1. **A common API specification for executing parser code**, and associated tooling. 79 | - Parsers could be run natively (in the language they were written in) on local files via the creation of a language-specific harness to which the parsers contribute plugins. This harness will execute the parser in such a way to maintain the link with its reported schema, as above. 80 | - Parsers could be packaged into containers (Docker or otherwise) that provide reproducible environments for their execution. Another harness for executing the containers from various languages, or via an HTTP API, will be investigated. Such containers can then be deployed as part of larger infrastructure, or used in a serverless fashion. This approach is well-suited for asynchronous message queues and streaming of data. 81 | 1. **A searchable registry of parsers** 82 | - Any parser code that implements the above functionality can be added (automatically or otherwise) to a registry of parsers that can be filtered against the metadata schema from the I. This could allow for automated support of a new file type (i.e., filter for parsers that support such a file type, and that provide a container, automatically download and deploy the container and use the expected endpoints to parse the data file). 83 | - Such a registry would provide discoverability and automated validation of resources, accelerating the proliferation of shared or overlapping data models, schemas and semantics. 84 | The registry can then be used to chain parsers together as composable blocks, for example, a custom semantic layer could be added to a data model as a converter from a JSON file with well-defined schema, into a JSON-LD file. 85 | 86 | ### Goals & expected impact 87 | 88 | - Choose some “anointed” parsing libraries across underserved disciplines (primarily for experimental data), and use them as prototypical cases for the above tooling. There are already several options within the list of interested working group members that could be fruitful. The impact here should not only be a more reusable packaging of existing parsers, but also upstream improvements to the parsers themselves as schemas are formalized for the first time. 89 | - Once a simple registry has been set up, WG members will trial its use within their own existing infrastructure projects across materials science and chemical data management. If the trial is successful, some of the maintenance burden for these steps could be shared amongst the interested parties, such that future development benefits all. 90 | 91 | ### Deliverables 92 | 93 | Within 12 months of the commencement of the working group, we will deliver: 94 | 95 | - A Working Group Note, published on the MaRDA Alliance website 96 | - A proof-of-concept implementation of I, II and III: documented, open source and available under the MaRDA Alliance GitHub organization. 97 | --------------------------------------------------------------------------------