├── .github
    └── CODEOWNERS
├── meetings
    ├── 3-meeting-2023-03-21
    │   ├── scythe-overview.pdf
    │   └── minutes.md
    ├── 4-meeting-2023-05-02
    │   └── minutes.md
    ├── 2-meeting-2023-01-19
    │   └── minutes.md
    ├── 5-meeting-2023-05-30
    │   └── minutes.md
    └── 1-meeting-2022-11-30
    │   └── minutes.md
├── .gitignore
├── LICENSE
└── README.md


/.github/CODEOWNERS:
--------------------------------------------------------------------------------
1 | * @ml-evs @PeterKraus @davidelbert
2 | 


--------------------------------------------------------------------------------
/meetings/3-meeting-2023-03-21/scythe-overview.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marda-alliance/metadata_extractors/HEAD/meetings/3-meeting-2023-03-21/scythe-overview.pdf


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | # Compiled Object files
 2 | *.o
 3 | *.obj
 4 | 
 5 | # Compiled Dynamic libraries
 6 | *.so
 7 | *.dylib
 8 | *.dll
 9 | 
10 | # Compiled Static libraries
11 | *.a
12 | *.lib
13 | 
14 | # Executables
15 | *.exe
16 | 
17 | # DUB
18 | .dub
19 | docs.json
20 | __dummy.html
21 | docs/
22 | 
23 | # Code coverage
24 | *.lst
25 | 


--------------------------------------------------------------------------------
/meetings/4-meeting-2023-05-02/minutes.md:
--------------------------------------------------------------------------------
 1 | # Progress meeting 4
 2 | 
 3 | 02/05/2023
 4 | 
 5 | ## Agenda
 6 | 
 7 | Intro from Peter
 8 | ELN roundtable discussion of feature wishlist
 9 | 
10 | ## Minutes
11 | 
12 | Peter Kraus: Introduced the current state of the registry, API prototype and schema.
13 | 
14 | Suggestions from Kevin Jablonka:
15 | 
16 | - Reusability and validation as main focus: is the parser tested, how is it
17 |   tested and how reliable is it?
18 | - Able to use code without providing much boilerplate
19 | 
20 | Peter:
21 | - possibility of MaRDA permanent infrastructure for running the registry
22 | - output schema is still a target but perhaps not in current WG
23 | 
24 | Matthew: containerisation and construction of environments is important: how
25 | about things outside Python: JS/WASM/docker?
26 | 
27 | Discussion about output formats: option to specify some useful common formats
28 | 
29 | Markus: wishlist -- must be open in the output format 100%
30 | 
31 | Nicolas: ELN developer's wishlist -- adding custom JSON to arbitrary files that
32 | are added to a sample/experiment. Want to deploy a container that can scrape
33 | arbitrary files.
34 | 
35 | Steffen: also ELN developer, in agreement with Nicolas but needs metadata instead of data -- i.e.,
36 | should be a case switch, wish ability to see thumbnail of the data. Shouldn't
37 | decide meaning of data, instead allow it to be mutable. Also something pip
38 | installable for programmatic use.
39 | 
40 | Matthew: maybe having monolothic container baked with registry as a build engine
41 | 
42 | Markus: 
43 | - Three execution pathways proposed: "external" service, local docker, local python package. 
44 | - Very tricky to provide all three reliably, especially due to dependencies. 
45 | - For Nomad, important to ensure scalability to ~million files -> single docker instance might not work
46 | 
47 | Steffen: how to deal with multiple parsers for a given file: some rating
48 | process. Also all the different APIs for parsers are an issue, can we capture
49 | that?
50 | 
51 | Peter: API and schema repository have some progress towards this
52 | 
53 | Nicolas: 
54 | - common API is not feasible for all parsers, 
55 | - ELN users should not be exposed to API related issues at all
56 | - keep it simple by choosing/bundling one extractor per filetype
57 | 
58 | Markus: file type detection is crucial, NOMAD could contribute its own detection
59 | system. Needs to be a combination of file extension, regex etc.
60 | 


--------------------------------------------------------------------------------
/meetings/3-meeting-2023-03-21/minutes.md:
--------------------------------------------------------------------------------
 1 | # Progress meeting
 2 | 
 3 | 21/03/2023
 4 | 
 5 | ## Agenda
 6 | 
 7 | - Introductions (5 mins)
 8 | - Case study: Scythe (Logan Ward)
 9 | - Discussion
10 | - Summary of progress on `api` in MaRDA Extractors WG (Matthew Evans)
11 | - Open discussion
12 | 
13 | ## Case study: Scythe: an extractor library you might like
14 | 
15 | Presented by [Logan Ward](https://github.com/WardLT). Unfortunately we forgot to record the talk. See [Scythe GH repo](https://github.com/materials-data-facility/scythe) as well as the [slides](scythe-overview.pdf).
16 | 
17 | - originally called MaterialsIO ➝ renamed to Scythe due to Google's material design SEO
18 | - designed to introduce some standardisation into internal metadata extraction at Argonne
19 | - group ➝ extract ➝ adapt pipeline
20 |   - Group files that belong together, e.g. in/out files ➝ logic is extractor specific
21 |   - Extract into a documentable format - currently JSON
22 |   - Adapt for translating between known filetypes/formats
23 | - Python interface
24 | - Design principle: filesystem ➝ database should work in 1 line of code
25 | 
26 | 1. Question (Matthew Evans): Self describing schema - how is this implemented?
27 |    
28 |    Answer: For now, in a JSONSchema as it is verifiable and human-readable.
29 | 
30 | 2. Question (Matthew Evans): Handling groups of files - how does the user select the appropriate filetype?
31 |  
32 |    Answer: Better illustrated using an example:
33 |     - 1) each file is treated separately
34 |     - 2) each particular datatype has its own extractor
35 |     - 3) supplemental files are pulled out by the extractor itself
36 |    Filetype matching is done by prefixes or postfixes (e.g. for VASP).
37 | 
38 | - Manifesto: summaries of data, not necessarily lossless
39 | - Well designed contributor guide
40 | - Key feature: autodiscovery of Scythe-compatible Extractors on the current host via Stevedore
41 | 
42 | 3. Question (Peter Kraus): Is the "losslessness" of the data described somewhere?
43 | 
44 |    Answer: If it's not in the documentation, then not.
45 | 
46 | 4. Quetion (Ken Kroenlein): Universal datastructures might not work well. How about data dictionaries and standards or similar technologies to describe the structure of the data?
47 | 
48 |    Answer: This is currently an Extractor-level decision, to drive adoption.
49 | 
50 | ## Summary of progress on `api` in MaRDA Extractors WG
51 | 
52 | Matthew gave a quick run-down of the current draft of `Extractor` execution api. Please see the PR [#5](https://github.com/marda-alliance/metadata_extractors_api/pull/5) for details, review and comments.
53 | 
54 | 


--------------------------------------------------------------------------------
/meetings/2-meeting-2023-01-19/minutes.md:
--------------------------------------------------------------------------------
 1 | # Progress meeting
 2 | 
 3 | 19/01/2023
 4 | 
 5 | ## Agenda
 6 | 
 7 | - Introductions (5 mins)
 8 | - Summary of main goals (Matthew Evans)
 9 | - Progress in WP1, WP2, WP3 (Matthew Evans)
10 | - Case study: yadg (Peter Kraus)
11 | - Open discussion
12 | 
13 | ## Minutes from Discussion
14 | 
15 | 1. David Elbert:
16 |   - Suggestion: Please register for the [annual MaRDA meeting](https://www.marda-alliance.org/blog-2/marda2023/). Poster sessions for ECRs are scheduled, with prizes!
17 | 
18 | ---
19 | 
20 | 2. Matthew Evans & Peter Kraus: 
21 |   - Presentation of current progress and case study on [yadg](https://github.com/dgbowl/yadg) ([slides](https://docs.google.com/presentation/d/1nQD7cwEG67W5MAVjXXb-6nUklDg1AcxYs8j7WRsJjIc/edit#slide=id.g1e03a09dbfb_0_16))
22 | 
23 | 2. a) Casper Andresen:
24 |   - Question: Will the transformation step in our pipeline be based on semantics, or only a pure syntax translation?
25 | 
26 | 2. b) Matthew Evans:
27 |   - Answer: LinkML allows linking to other schemas, such as Dublin Core or Schema.org. Therefore, capturing semantics should be possible. Contributions from experts are welcome!
28 | 
29 | ---
30 | 
31 | 3. a) Logan Ward:
32 |   - Question: Interesting choice of going with a "dataschema" for yadg. What exactly is it? What are the limitations?
33 | 
34 | 3. b) Peter Kraus:
35 |   - Answer: It defines the "folder structure" of files related to a single experiment, as opposed to the layout within the individual files. User adoption is quite hard, and very often you just want the raw data, without bundling it.
36 |  
37 | 3. c) Logan Ward:
38 |   - Suggestion: Efforts such as this one (metadata extractors) could lead to a nice recursion, where your extractor taps into other linked tools.
39 |   
40 | ---
41 |  
42 | 4. a) Steffen Brinckmann:
43 |   - Question: Is passing parameters into extractors (e.g. information about background etc.) condsidered for the schemas?
44 | 
45 | 4. b) Peter Kraus:
46 |   - Answer: From my point of view, not at the extractor level, which should not be post-processing the data values.
47 | 
48 | 4. c) Matthew Evans:
49 |   - Answer: This is a very application-dependent question. Parameters are likely to strongly depend on the extractors themselves. If we support it, it must be optional.
50 |   
51 | ---
52 | 
53 | 5. a) Ken Kroenlein:
54 |   - Suggestion: For extractor output, the data dictionaries WG of MaRDA can provide a controlled vocabulary.
55 | 
56 | 5. b) Peter Kraus:
57 |   - Answer: I will be in touch to figure out how I can implement this in my code!
58 | 
59 | ---
60 | 
61 | 6. Closing remarks:
62 | - MaRDA annual meeting on 25. - 27. Feb. 2023
63 |     - have a first tagged schema by then (ME)
64 |     - have an example implementation in yadg (PK)
65 | - Office hours: 
66 |     - fortnightly (approximately), informal
67 |     - next one on 25. Jan 2023, 15:00 UTC
68 |     - suggest a date/time if the dates planned don't suit!
69 |     
70 | 
71 | 
72 | 


--------------------------------------------------------------------------------
/meetings/5-meeting-2023-05-30/minutes.md:
--------------------------------------------------------------------------------
 1 | # Progress meeting 5
 2 | 
 3 | 30/05/2023
 4 | 
 5 | ## Agenda
 6 | 
 7 | - Intro from Matthew Evans
 8 | - Overview of NOMAD Parsers Lauri Himanen
 9 | - Demo of API from Matthew Evans
10 | 
11 | ## Minutes
12 | 
13 | ### MaRDA WG status summary - Matthew Evans
14 | - Quick summary of MaRDA Extractors WG
15 | - Overview of the 3 repos, incl. [schema](https://github.com/marda-alliance/metadata_extractors_schema), [registry](https://github.com/marda-alliance/metadata_extractors_registry), and [api](https://github.com/marda-alliance/metadata_extractors_api)
16 | - Overview of the current `Filetypes` and `Extractors` in the registry
17 | 
18 | **See also [Demo of API](#demo-of-api-videos) videos below!**
19 | 
20 | ### NOMAD Parsers - Lauri Himanen
21 | 
22 | 1. What is NOMAD:
23 |     - RDM platform
24 |     - covers simulations, experiments, workflows
25 |     - funded by German NFDI
26 |     - [nomad-lab.eu](nomad-lab.eu) is a freely available, open, central repository
27 |     - NOMAD Oasis is a self-hosted instance
28 | 2. Getting started with NOMAD:
29 |     - `upload` files
30 |     - add / edit using ELN interface
31 |     - publish to get a DOI
32 | 3. Parser infrastructure in NOMAD:
33 |     - act on uploaded files and turn them into `entries`
34 |     - `entries` can be searched, analysed, have a known structure -> implies a `schema`
35 |     - each parser has to define its own `schema`
36 |     - NOMAD has it's own Pydantic-like schema language called NOMAD metainfo
37 |     - parsers are triggered on upload of a file, matching by using:
38 |       - file extension
39 |       - file mimetype
40 |       - file contents (e.g. header)
41 |     - one file is usually one `entry`, but sometimes one file is many `entries`
42 |     - reading of auxilliary files in an upload is handled by the parser
43 | 4. Parser plugins
44 |     - basic NOMAD has ~60 parsers pre-installed, mostly for electronic structure calculations
45 |     - defining custom parsers is possible via a `plugin` mechanism
46 |     - `plugins` may be integrated into the central service after a review
47 |     - plugins have to have:
48 |         - a schema definition in a specified location
49 |         - the parsing code and file matching logic in a specified location
50 |     - the general `schema` can be extended by the `parser`
51 |     - the infrastructure is using a lot of regex to perform matching of quantities and filetypes
52 |     - parsing is performed by passing the path of the file
53 |     - `nomad.yaml`: a configuration file for the plugin
54 | 
55 | ### Q/A:
56 | - Peter: About auxiliary files - must be uploaded together, or can NOMAD ask for them?
57 | - Lauri: They must be uploaded together. Their usage may be documented in the README. The parser can emit a debug/log message, but the user is responsible for uploading all files together.
58 | 
59 | ---
60 | - Nicolas: Use of regex on plaintext files does not sound efficient. Is this really the best way? Wouldn't it be better to fix QM codes upstream?
61 | - Lauri: Some QM codes are moving away from text files, but progress is slow. There is also the important issue of legacy QM data, which has to be addressed somehow.
62 | 
63 | ---
64 | - Peter: Are you aware of QCSchema?
65 | - Lauri: No, not yet.
66 | 
67 | ---
68 | - Matthew: Overlap with MaRDA WG. How would we go about validating plugins?
69 | - Lauri: The plugin mechanism in NOMAD is very new. Registry and an authority marking/reviewing plugins would be useful.
70 | 
71 | ---
72 | - Matthew: How about sandboxing plugin code?
73 | - Lauri: Sandboxing is tricky, as on an instance, it's a question for the Oasis (instance) admin
74 | 
75 | 
76 | ### Open Discussion:
77 | - Matthew: Suggestion to skip next months (July) meeting in favour of writing & working.
78 | - Peter: Agreed.
79 | 
80 | ---
81 | - Steffen: Multiple people have different goals. Focusing on a single extractor for a single filetype is perhaps too ambitious.
82 | - Matthew: Yes, this is currently the goal of the WG.
83 | 
84 | ---
85 | - Steffen: Focus should be on getting more examples. This requires making the parser submission process to be simple and comfortable enough even for lazy people...
86 | - Peter: A website frontend to avoid boilerplate is on the TODO list, however, manpower is a problem.
87 | 
88 | ---
89 | - Steffen: Review process of extractors should be currently very streamlined, as long as things don't overwrite other people's work, we should allow things in.
90 | - Matthew: Sandboxing at some level might be necessary, as otherwise it's a big safety issue.
91 | 
92 | ### Demo of API videos:
93 | - [API Intro and `biologic-mpr` example](https://drive.google.com/file/d/10v6uE6By0bm3Z2Sfno1ULM6I8itX75uF/view?usp=drive_web)
94 | - [Parsing `agilent-dx` using API](https://drive.google.com/file/d/1XeTYR14dJORUGIZnYzoc1FRM0eWUNSVu/view?usp=drive_web)


--------------------------------------------------------------------------------
/meetings/1-meeting-2022-11-30/minutes.md:
--------------------------------------------------------------------------------
  1 | # Kick-off meeting
  2 | 
  3 | 30/11/2022
  4 | 
  5 | ## Agenda
  6 | 
  7 | - Introductions (5 mins)
  8 | - Initial slides by Matthew Evans (15 mins)
  9 | - Q&A
 10 | - Open discussion
 11 | 
 12 | ## Minutes from Q&A & Discussion
 13 | 
 14 | 1. Matthew Evans: Introductory presentation. Recorded ([video](https://www.youtube.com/watch?v=6x5Ow-CLRWg), [slides](https://docs.google.com/presentation/d/1QTqDEO3H1s_wAtcE9DD6k1eCJ7GzvdV2spjHE94bEQw/edit?usp=sharing))
 15 | 
 16 | 2. a) Markus Scheidgen:
 17 | - Question: Will slides be shared?
 18 | - Suggestion: A goal can be to establish a service registry with a searchable `filetype -> tools available` mapping.
 19 | - Comment: Skeptical about "chaining parsers" due to low chance of success.
 20 | 
 21 | 2. b) Matthew Evans:
 22 | - Comment: A way of coordination might be to have a discussion for each WP.
 23 | - Answer: Slides will be published.
 24 | 
 25 | 4. a) Steffen Brinckmann:
 26 | - Question: Transformation WP concerns not only data, but also filetype?
 27 | - Suggestion: CWL is used in chemistry field:
 28 |     - wrappable around Python
 29 |     - requires a server/service
 30 | - Question: What happens when the scientist wants to "add" processing (area under curve) as metadata?
 31 | - Question: Will it be possible to edit/customise extractors?
 32 | 
 33 | 4. b) Matthew Evans:
 34 | - Answer: Are filetypes included, workflow & interoperability is well covered by the WPs.
 35 | - Answer: Optional features for extractors:
 36 |     - individual decision of parsers, implemented in parsers
 37 |     - good use-case for parser chaining (see 2a)
 38 | - Comment: Note the discussion "Prior Art" on GitHub.
 39 | 
 40 | 5. a) Ken Kroenlein:
 41 | - Comment: Concerned that a parser that does `file -> JSON/XML` is already a full ETL pipeline.
 42 | - Suggestion: Reworking of schema (WP1) to be a lot tighter & low level, as the `file -> bytes in memory` step is where the "magic" and pipelining might happen. This would also allow to piece parsers together.
 43 | 
 44 | 5. b) David Elbert:
 45 | - Answer: Important to check scope of WPs as we go.
 46 | - Answer: Boundary of data vs metadata, or what is defined as a derived value, is currently defined by the filetypes or parsers. Stick to that definition for now.
 47 | 
 48 | 6. a) Ken Kroenlein & Jim Warren:
 49 | - Comment: Just achieving a schema (WP1) is a lot of work and would be a win.
 50 | 
 51 | 6. b) Peter Kraus:
 52 | - Comment: Registry (WP3) is a "low hanging fruit" that can be done with little work.
 53 | 
 54 | 6. c) Steffen Brinckmann:
 55 | - Question: Order of action: Should we start with schema (WP1) and then move to API (WP2), a "discussion based" approach, or should we start with WP2 and move to WP1, an "evolutionary approach"?
 56 | 
 57 | 6. d) Markus Scheidgen:
 58 | - Comment: Schema (WP1) is key to be able to find the right parser.
 59 | - Comment: Registry (WP3) is not super valuable without WP1, as it's just a page with tools, it would need at least WP1, but ideally WP2 to be useful.
 60 | - Comment: We shouldn't worry about chaining tools at this moment, avoids specifying parser output schema.
 61 | 
 62 | 6. e) Jim Warren:
 63 | - Comment: Figuring out schema (WP1) is key.
 64 |     - Examples are very important.
 65 |     - Need to identify what's broken in each parser.
 66 | 
 67 | 7. a) Nicholas Carpi:
 68 | - Suggestion: Vision, as an ELN developer, is to be able to:
 69 |     - parse an user-supplied file using an "external service"
 70 |     - added value for the user, e.g. visualisation, post-processing
 71 | - Comment: Huge amount of prior art, but no common framework.
 72 | - Question: What's the difference between the goals of this WG and Clowder framework?
 73 | 
 74 | 7. b) David Elbert:
 75 | - Answer: Unfortunate that Clowder folk are not present. Will try to get them involved.
 76 |  
 77 | 7. c) Matthew Evans:
 78 | - Answer: Point taken about existing projects & Clowder. Risk of creating yet another standard.
 79 | - Question: What would an ELN developer like from this "external service"? Would we need to provide hardening?
 80 | 
 81 | 7. d) Nicholas Carpi:
 82 | - Answer: "External" as in not within ELN, but not off-site. Everything should be done locally, via e.g. a HTTPS REST service: `post file -> get data`
 83 | - Answer: It would be great to have to use only one service, with reasonable and understandable errors.
 84 | - Suggestion: Maintenance of a single repo including all parsers can be tricky, so a plugin architecture might be the way to go.
 85 | 
 86 | 7. e) Peter Kraus:
 87 | - Comment: There is a lot of prior art, some of it at various stages of bit rot. Added benefit of this WG would be testing of parsers via CI to detect broken/bitrotten parts of code.
 88 | 
 89 | 7. f) Ken Kroenlein:
 90 | - Concern: Think about long term home for CI/CD. Relying on external services & running out of budget is a good recipe to fail.
 91 | 
 92 | 7. g) David Elbert:
 93 | - Answer: While MaRDA itself cannot provide funding, we can try to arrange long-term funding as part of the work of this or future WG.
 94 | 
 95 | 8. Closing remarks:
 96 | - Next meeting: January
 97 |     - should have some work done by then
 98 |     - a prototype for our own parsers (PK, ME)
 99 | - MaRDA annual meeting is also happening
100 | - Suggestion: Weekly open hour on Jitsi - will be communicated in due course.
101 | 
102 | 
103 | 
104 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 | Creative Commons Legal Code
  2 | 
  3 | CC0 1.0 Universal
  4 | 
  5 |     CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE
  6 |     LEGAL SERVICES. DISTRIBUTION OF THIS DOCUMENT DOES NOT CREATE AN
  7 |     ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS
  8 |     INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES
  9 |     REGARDING THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS
 10 |     PROVIDED HEREUNDER, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM
 11 |     THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED
 12 |     HEREUNDER.
 13 | 
 14 | Statement of Purpose
 15 | 
 16 | The laws of most jurisdictions throughout the world automatically confer
 17 | exclusive Copyright and Related Rights (defined below) upon the creator
 18 | and subsequent owner(s) (each and all, an "owner") of an original work of
 19 | authorship and/or a database (each, a "Work").
 20 | 
 21 | Certain owners wish to permanently relinquish those rights to a Work for
 22 | the purpose of contributing to a commons of creative, cultural and
 23 | scientific works ("Commons") that the public can reliably and without fear
 24 | of later claims of infringement build upon, modify, incorporate in other
 25 | works, reuse and redistribute as freely as possible in any form whatsoever
 26 | and for any purposes, including without limitation commercial purposes.
 27 | These owners may contribute to the Commons to promote the ideal of a free
 28 | culture and the further production of creative, cultural and scientific
 29 | works, or to gain reputation or greater distribution for their Work in
 30 | part through the use and efforts of others.
 31 | 
 32 | For these and/or other purposes and motivations, and without any
 33 | expectation of additional consideration or compensation, the person
 34 | associating CC0 with a Work (the "Affirmer"), to the extent that he or she
 35 | is an owner of Copyright and Related Rights in the Work, voluntarily
 36 | elects to apply CC0 to the Work and publicly distribute the Work under its
 37 | terms, with knowledge of his or her Copyright and Related Rights in the
 38 | Work and the meaning and intended legal effect of CC0 on those rights.
 39 | 
 40 | 1. Copyright and Related Rights. A Work made available under CC0 may be
 41 | protected by copyright and related or neighboring rights ("Copyright and
 42 | Related Rights"). Copyright and Related Rights include, but are not
 43 | limited to, the following:
 44 | 
 45 |   i. the right to reproduce, adapt, distribute, perform, display,
 46 |      communicate, and translate a Work;
 47 |  ii. moral rights retained by the original author(s) and/or performer(s);
 48 | iii. publicity and privacy rights pertaining to a person's image or
 49 |      likeness depicted in a Work;
 50 |  iv. rights protecting against unfair competition in regards to a Work,
 51 |      subject to the limitations in paragraph 4(a), below;
 52 |   v. rights protecting the extraction, dissemination, use and reuse of data
 53 |      in a Work;
 54 |  vi. database rights (such as those arising under Directive 96/9/EC of the
 55 |      European Parliament and of the Council of 11 March 1996 on the legal
 56 |      protection of databases, and under any national implementation
 57 |      thereof, including any amended or successor version of such
 58 |      directive); and
 59 | vii. other similar, equivalent or corresponding rights throughout the
 60 |      world based on applicable law or treaty, and any national
 61 |      implementations thereof.
 62 | 
 63 | 2. Waiver. To the greatest extent permitted by, but not in contravention
 64 | of, applicable law, Affirmer hereby overtly, fully, permanently,
 65 | irrevocably and unconditionally waives, abandons, and surrenders all of
 66 | Affirmer's Copyright and Related Rights and associated claims and causes
 67 | of action, whether now known or unknown (including existing as well as
 68 | future claims and causes of action), in the Work (i) in all territories
 69 | worldwide, (ii) for the maximum duration provided by applicable law or
 70 | treaty (including future time extensions), (iii) in any current or future
 71 | medium and for any number of copies, and (iv) for any purpose whatsoever,
 72 | including without limitation commercial, advertising or promotional
 73 | purposes (the "Waiver"). Affirmer makes the Waiver for the benefit of each
 74 | member of the public at large and to the detriment of Affirmer's heirs and
 75 | successors, fully intending that such Waiver shall not be subject to
 76 | revocation, rescission, cancellation, termination, or any other legal or
 77 | equitable action to disrupt the quiet enjoyment of the Work by the public
 78 | as contemplated by Affirmer's express Statement of Purpose.
 79 | 
 80 | 3. Public License Fallback. Should any part of the Waiver for any reason
 81 | be judged legally invalid or ineffective under applicable law, then the
 82 | Waiver shall be preserved to the maximum extent permitted taking into
 83 | account Affirmer's express Statement of Purpose. In addition, to the
 84 | extent the Waiver is so judged Affirmer hereby grants to each affected
 85 | person a royalty-free, non transferable, non sublicensable, non exclusive,
 86 | irrevocable and unconditional license to exercise Affirmer's Copyright and
 87 | Related Rights in the Work (i) in all territories worldwide, (ii) for the
 88 | maximum duration provided by applicable law or treaty (including future
 89 | time extensions), (iii) in any current or future medium and for any number
 90 | of copies, and (iv) for any purpose whatsoever, including without
 91 | limitation commercial, advertising or promotional purposes (the
 92 | "License"). The License shall be deemed effective as of the date CC0 was
 93 | applied by Affirmer to the Work. Should any part of the License for any
 94 | reason be judged legally invalid or ineffective under applicable law, such
 95 | partial invalidity or ineffectiveness shall not invalidate the remainder
 96 | of the License, and in such case Affirmer hereby affirms that he or she
 97 | will not (i) exercise any of his or her remaining Copyright and Related
 98 | Rights in the Work or (ii) assert any associated claims and causes of
 99 | action with respect to the Work, in either case contrary to Affirmer's
100 | express Statement of Purpose.
101 | 
102 | 4. Limitations and Disclaimers.
103 | 
104 |  a. No trademark or patent rights held by Affirmer are waived, abandoned,
105 |     surrendered, licensed or otherwise affected by this document.
106 |  b. Affirmer offers the Work as-is and makes no representations or
107 |     warranties of any kind concerning the Work, express, implied,
108 |     statutory or otherwise, including without limitation warranties of
109 |     title, merchantability, fitness for a particular purpose, non
110 |     infringement, or the absence of latent or other defects, accuracy, or
111 |     the present or absence of errors, whether or not discoverable, all to
112 |     the greatest extent permissible under applicable law.
113 |  c. Affirmer disclaims responsibility for clearing rights of other persons
114 |     that may apply to the Work or any use thereof, including without
115 |     limitation any person's Copyright and Related Rights in the Work.
116 |     Further, Affirmer disclaims responsibility for obtaining any necessary
117 |     consents, permissions or other rights required for any use of the
118 |     Work.
119 |  d. Affirmer understands and acknowledges that Creative Commons is not a
120 |     party to this document and has no duty or obligation with respect to
121 |     this CC0 or use of the Work.
122 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | <div align="center" style="padding-bottom: 1em;">
 2 | <img width="100px" align="center" src="https://avatars.githubusercontent.com/u/74017645?s=200&v=4">
 3 | </div>
 4 | 
 5 | # <div align="center">MaRDA Metadata Extractors: Discussions & Minutes</div>
 6 | 
 7 | <div align="center">
 8 | 
 9 | [![Documentation](https://badgen.net/badge/docs/marda-alliance.github.io/blue?icon=firefox)](https://marda-alliance.github.io/metadata_extractors/)
10 | 
11 | </div>
12 | 
13 | > [!IMPORTANT]  
14 | > The MaRDA Metadata Extractors working group has now ended; development will continue under the [datatractor](https://github.com/datatractor/) organisation, but discussions can continue in this repository.
15 | 
16 | This repository contains organizational info for a [MaRDA](https://www.marda-alliance.org/) working group (WG) focused on connecting and advancing interoperability of efforts on automated extraction of metadata from materials files.
17 | 
18 | **Contacts**:
19 | 
20 | - *[Matthew Evans](https://ml-evs.science), UCLouvain*
21 |   (`matthew.evans[at]uclouvain.be`)
22 | - *Peter Kraus, TU Berlin*
23 |   (`peter.kraus[at]ceramics.tu-berlin.de`)
24 | - *David Elbert, Johns Hopkins University* 
25 |   (`elbert[at]jhu.edu`)
26 |   
27 | 
28 | ## Contributing
29 | 
30 | This working group is completely open.
31 | If you would like to be added ot the mailing list please reach out to us over email, or just turn up at a meeting!
32 | 
33 | The GitHub discussions on this repo can be used for pretty much any related chat, and specific code suggestions/feedback can be made as pull requests to the GitHub repos for each subproject:
34 | 
35 | - [marda-alliance/metadata_extractors_schema](https://github.com/marda-alliance/metadata_extractors_schema)
36 | - [marda-alliance/metadata_extractors_registry](https://github.com/marda-alliance/metadata_extractors_registry)
37 | - [marda-alliance/metadata_extractors_api](https://github.com/marda-alliance/metadata_extractors_api)
38 | 
39 | 
40 | ## Organization & logistics
41 | 
42 | 
43 | - All meetings will be open and will take place on [Jitsi](https://meet.jit.si/)
44 |   with the room code `marda-extractors`. Meeting dates will be arranged in the
45 |   previous meeting, and will be announced on the mailing list.
46 | - Meeting minutes and relevant links will be made available in this repository under [`./meetings`](https://github.com/marda-alliance/metadata_extractors/tree/main/meetings).
47 | - The [GitHub discussions on this
48 |   repo](https://github.com/marda-alliance/metadata_extractors/discussions) will be used for asynchronous
49 |   communications about the design, development and logistics of the overall
50 |   working group - please introduce yourself in the [announcement
51 |   thread](https://github.com/marda-alliance/metadata_extractors/discussions/1)!
52 | 
53 | ## Proposal
54 | 
55 | ![image](https://user-images.githubusercontent.com/7916000/205093968-632da485-b922-4244-8d8d-51ff85c92d35.png)
56 | 
57 | 
58 | The text below is taken from the initial working group proposal.
59 | 
60 | ### Motivation
61 | 
62 | Enabling interoperability between experimental apparatus, software, scientists and informaticians requires a lot of plumbing.
63 | This plumbing is often characterized as an **Extract-Transform-Load** (ETL) pipeline, whereby data is first extracted from some heterogeneous sources, transformed into a suitable format for the use case (querying, archival, further analysis) and then loaded onto a storage platform (a database, a filesystem, an archive repository, supplementary info for a publication).
64 | Such pipelines are often opaque, not reproducible and not sufficiently modular to be reusable; these features penalize groups with fewer resources to devote to data management, leading to much duplicated effort on reimplementing parsers or extractors for different file types and transformed data models.
65 | 
66 | This working group aims to address these issues and promote FAIR practices by designing, creating and reusing software and infrastructure to streamline the process of describing ETL pipelines in such a way that they can be more broadly reused, and associated tooling for automated execution and discovery of said pipelines.
67 | Following the name of the WG, it will primarily target the **Extraction** step, i.e., the literal parsing of unstructured data files, and the description of the output Transformed data model and encouragement of FAIR representations (c.f. the upcoming recommendations from MaRDA WG5).
68 | This is a different approach from the typical target of creating a unifying output data schema; instead we will remain agnostic to the output format if it is sufficiently well-described in a machine-actionable manner.
69 | 
70 | ### Work plan
71 | 
72 | This working group aims to investigate, design, and implement or re-use a hierarchy of open software tooling and open infrastructure to achieve the goals outlined above. The initial 6 months of meetings will primarily involve scoping the overall project, and the group members, to ensure that the WG is representative of existing efforts, thus maximizing the chances of adoption. The three main technical thrusts are expected to be:
73 | 
74 | 1. **A lightweight metadata schema for parsers** and associated tooling for software libraries to self-report:
75 |     - The file formats they support (e.g., output files from a particular experimental apparatus, or log files from a computational chemistry code)
76 |     - The shape and semantics of the data models produced, via existing formats for data descriptionand tooling,  such as , for example self-contained formats (NeXus, CIF), schemas (JSONSchema, XSD) and, semantic data (RDF, JSON-LD & CSV-LD),  HDF5 and STAR (and their respective domain-specific derivatives NeXus and CIF). Depending on the output format, such metadata can be provided in-band or out-of-band in a well-defined location (e.g., a separate file, a persistent URL), following the upcoming recommendations of the MarDA Data Dictionaries WG5.
77 |     - Any additional metadata required for re-use, such as code versions and environments, source and code archive URLs, and bibliographic data following, for example, Dublin Core.
78 | 1. **A common API specification for executing parser code**, and associated tooling.
79 |     - Parsers could be run natively (in the language they were written in) on local files via the creation of a language-specific harness to which the parsers contribute plugins. This harness will execute the parser in such a way to maintain the link with its reported schema, as above.
80 |     - Parsers could be packaged into containers (Docker or otherwise) that provide reproducible environments for their execution. Another harness for executing the containers from various languages, or via an HTTP API, will be investigated. Such containers can then be deployed as part of larger infrastructure, or used in a serverless fashion. This approach is well-suited for asynchronous message queues and streaming of data.
81 | 1. **A searchable registry of parsers**
82 |     - Any parser code that implements the above functionality can be added (automatically or  otherwise) to a registry of parsers that can be filtered against the metadata schema from the I. This could allow for automated support of a new file type (i.e., filter for parsers that support such a file type, and that provide a container, automatically download and deploy the container and use the expected endpoints to parse the data file).
83 |     - Such a registry would provide discoverability and automated validation of resources, accelerating the proliferation of shared or overlapping data models, schemas and semantics.
84 | The registry can then be used to chain parsers together as composable blocks, for example, a custom semantic layer could be added to a data model as a converter from a JSON file with well-defined schema, into a JSON-LD file.
85 | 
86 | ### Goals & expected impact
87 | 
88 | - Choose some “anointed” parsing libraries across underserved disciplines (primarily for experimental data), and use them as prototypical cases for the above tooling. There are already several options within the list of interested working group members that could be fruitful. The impact here should not only be a more reusable packaging of existing parsers, but also upstream improvements to the parsers themselves as schemas are formalized for the first time.
89 | - Once a simple registry has been set up, WG members will trial its use within their own existing infrastructure projects across materials science and chemical data management. If the trial is successful, some of the maintenance burden for these steps could be shared amongst the interested parties, such that future development benefits all.
90 | 
91 | ### Deliverables
92 | 
93 | Within 12 months of the commencement of the working group, we will deliver:
94 | 
95 | - A Working Group Note, published on the MaRDA Alliance website
96 | - A proof-of-concept implementation of I, II and III: documented, open source and available under the MaRDA Alliance GitHub organization.
97 | 


--------------------------------------------------------------------------------