├── 7.md ├── 16.md ├── 17.md ├── 9.md ├── 20.md ├── 5.md ├── 32.md ├── 3.md ├── 13.md ├── 14.md ├── 21.md ├── 26.md ├── 24.md ├── 2.md ├── 25.md ├── 8.md ├── 19.md ├── 6.md ├── 23.md ├── 29.md ├── 12.md ├── 4.md ├── 22.md ├── 10.md ├── README.md ├── 11.md ├── 30.md ├── 15.md ├── 18.md ├── 27.md └── 31.md /7.md: -------------------------------------------------------------------------------- 1 | # Project 7: Enhancing the Training metrics database (TMD) for improved reporting 2 | 3 | ## Abstract 4 | 5 | This project aims to extend the functionality of the current Training Metrics Database, to allow for uploading of node specific questions, and establishing a seamless connection to TeSS. 6 | 7 | The ELIXIR Training Metrics Database serves as a valuable resource in the ELIXIR Training ecosystem. It collects and aggregates all the training metrics from the different nodes, and is an invaluable resource for generating statistics and reports. Extending the functionality to allow for node specific questions will further increase the usability for the nodes, catering to their unique reporting needs. Moreover, integrating it with TeSS would facilitate automatic data exchange between the two systems, further connecting the ELIXIR ecosystem of resources. 8 | 9 | We will achieve this by assembling a team of web developers and training coordinators, creating interactions and collaborations with the TeSS development team. 10 | 11 | ## Lead(s) 12 | 13 | Nina Norgren, Eleni Adamidi 14 | 15 | -------------------------------------------------------------------------------- /16.md: -------------------------------------------------------------------------------- 1 | # Project 16: Enhancing bio.tools by Semantic Literature Mining 2 | 3 | ## Abstract 4 | 5 | This project aims to improve and extend bio.tools metadata through fine-tuned named-entity recognition (NER) from Europe PMC and other established literature mining software. This will help researchers find uses of particular software and measure the impact of research software beyond paper citations, thus providing a better indicator of their impact. Text mining mentions of software is a non-trivial problem, as the software often is homonymous with other entities, such as chemicals , genes or organisms. However, NER of software is facilitated by frequent context words such as “version”, “software” or “program”. 6 | 7 | This will be further exploited by integration of the often very detailed bio.tools annotations to enhance software recognition. We expect to identify ensembles of publications for thousands of software tools annotated in bio.tools, adding valuable information about tool usage to Europe PMC and providing relevant background data for more accurate and deeper tool categorization and annotations, as well as improved benchmarking of the tools themselves. 8 | 9 | ## Lead(s) 10 | 11 | Veit Schwämmle, Magnus Palmblad 12 | 13 | -------------------------------------------------------------------------------- /17.md: -------------------------------------------------------------------------------- 1 | # Project 17: Development of FAIR image analysis workflows and training in Galaxy 2 | 3 | ## Abstract 4 | 5 | Image analysis tools within Galaxy are currently available but remain underutilised. Last year, our participation in the BioHackathon aimed to enhance the image analysis community in Galaxy. Our focus was on analysing the landscape of tools, gathering and annotating them. It involved community discussions to establish naming conventions, fostering greater standardization of these tools (see outcomes at https://github.com/beatrizserrano/bh2023-preprint/blob/main/BH2023_preprint_project16.pdf). Since then, the integration of tools into Galaxy has continued to expand. This year, our efforts will focus on the exploitation of such tools and showcasing Galaxy’s capabilities in meeting the needs of the imaging community. 6 | 7 | In this project, we will develop FAIR image analysis workflows and create comprehensive tutorials within the Galaxy Training Network. We will use sample datasets from public repositories to illustrate diverse image analysis tasks and build the corresponding Galaxy workflows. The resulting workflows will be made available on the Workflowhub.eu. Tutorials will serve as documentation to facilitate the utilisation of these workflows. 8 | 9 | ## Lead(s) 10 | 11 | Beatriz Serrano-Solano, Anne Fouilloux 12 | 13 | -------------------------------------------------------------------------------- /9.md: -------------------------------------------------------------------------------- 1 | # Project 9: BioHackrXiv: improving biohackathon publications 2 | 3 | ## Abstract 4 | 5 | BioHackrXiv.org is the markdown based pre-publishing biohackathon platform for project reporting that is used by the Elixir and Japanese biohackathons and other venues. 6 | The goal of the biohackrxiv is not only to expose work executed during and after biohackathons, but also to increase the scientific profile of participants by giving citeable publications. 7 | During the Biohackathon we want to improve user experience of submitting publication to biohackrxiv.org. Starting at a Japanese biohackathon this work has been mapped out in previous Elixir biohackathons and it is great we can get together an focus our attention on this project. 8 | As a new idea we would like to allow authors to add metadata on their information, websites, source code, data, blogs, videos etc. as a FAIR resource. 9 | 10 | Our work has resulted in 73 publications so far, with 200+ authors. The number of submissions is increasing every year. The quality of the publications is very high, in our experience. 11 | 12 | A full list of supported hackathons can be found [here](http://preview.biohackrxiv.org/) and a list of publications [here](https://biohackrxiv.org/discover). 13 | Previous publications are [here](https://biohackrxiv.org/discover?q=biohackrxiv). 14 | 15 | ## Lead(s) 16 | 17 | Pjotr Prins, Arun Isaac 18 | 19 | -------------------------------------------------------------------------------- /20.md: -------------------------------------------------------------------------------- 1 | # Project 20: Structuring Clinical Reports into OMOP Common Data Model (CDM) 2 | 3 | ## Abstract 4 | 5 | A clinical case report is a detailed report of the diagnosis, treatment, signs, symptoms and follow-up of a single patient. Case report forms (CRFs) are used to standardize the collection of these patient data in clinical research studies and trials. CRFs provide a semi-structured approach for collecting data where a combination of structured categories of patient data along with free-text content is defined. Since such CRFs are predominantly attached to publications as supplementary files, it comes with various data formats (e.g., PDF, XLS, CSV, GIF, etc.). 6 | 7 | With the increase of Open Access publications, the number of supplementary data files keeps on growing where the ability of researchers to find and reuse this information is severely limited. Beyond keyword queries of PMC/MEDLINE article indexes, researchers are unable to find CRFs using standard clinical terms to search for common clinical concepts. 8 | 9 | The inability to adequately search supplementary files - where CRFs can accompany a clinical research trial or study publication - requires researchers to manually locate and evaluate the contents of individual CRFs. This project aims at enhancing the FAIR-ness (Findable, Accessible, Interoperable, Reusable) of CRFs by transforming them into a structured Common Data Model (CDM) like OMOP (Observational Medical Outcomes Partnership). 10 | 11 | ## Lead(s) 12 | 13 | Venkata Satagopam, Tim Beck 14 | 15 | -------------------------------------------------------------------------------- /5.md: -------------------------------------------------------------------------------- 1 | # Project 5: Mapping of research software quality indicators across the ELIXIR Research Software Ecosystem 2 | 3 | ## Abstract 4 | 5 | The primary goal of this project is to perform a cross-walk of indicators around research software quality, creating a comprehensive catalogue that can be used in the context of the ELIXIR Research Software Ecosystem. A common understanding of quality indicators is a well-understood and acknowledged challenge in research software, with different levels of maturity across domains. However, this project will primarily focus on the particular aspects of the ELIXIR community, however aiming for an outcome applicable to the wider Life Science community. 6 | 7 | This catalogue will be extremely useful in raising awareness of the existing services, knowing their requirements and expectations, and identifying the optimal service for the particular case. Moreover, it will allow us to identify what could be potential gaps in a particular community, as well as indicators that could be adopted across communities. 8 | 9 | The project directly ties into various activities and efforts, both within ELIXIR (Tools Platform Software Best Practices, OpenEBench, Software Observatory, SMPs, STEERS WP2, etc) as well as beyond (EOSC, EVERSE, NFDI4DataScience, etc). 10 | 11 | We plan to engage participants in basically all activities. Newcomers can share their experience with research software development, software management, and research software quality. 12 | 13 | ## Lead(s) 14 | 15 | Fotis Psomopoulos, Eva Martin del Pico 16 | 17 | -------------------------------------------------------------------------------- /32.md: -------------------------------------------------------------------------------- 1 | # Project 32: VCF Explorer: Empowering Genomic Data Interaction 2 | 3 | ## Abstract 4 | 5 | Genomic data plays a pivotal role in understanding genetic variations, disease associations, and personalised medicine. However, managing and querying Variant Call Format (VCF) files efficiently remains a challenge due to their large size and complex structure. In this project, we propose the development of a VCF file explorer—a web-based application that facilitates seamless interaction with VCF files. Our approach leverages the array based TileDB VCF data model, to create a scalable and efficient database for storing VCF files, overcoming the limitations of traditional data storage systems like relational databases. 6 | 7 | The tool will allow users to search for specific variants based on custom filters and perform aggregate analyses while providing interactive visualisations. The VCF Explorer will empower researchers, clinicians, and bioinformaticians to efficiently explore and analyse genomic variants. By combining the robustness of TileDB with a user-friendly web interface, we aim to accelerate genomics research, variant interpretation, and clinical decision-making. A base proof of concept of the tool has been developed for Lineberger Comprehensive Center Bioinformatics core by the authors. 8 | 9 | Reference: 10 | 11 | [https://docs.tiledb.com/main/integrations-and-extensions/genomics/population-genomics](https://docs.tiledb.com/main/integrations-and-extensions/genomics/population-genomics) 12 | 13 | ## Lead(s) 14 | 15 | Sarang Bhutada, Vibhor Gupta 16 | 17 | -------------------------------------------------------------------------------- /3.md: -------------------------------------------------------------------------------- 1 | # Project 3: Reusable RDM Planning Environments for Trainings and Workshops 2 | 3 | ## Abstract 4 | 5 | Training sessions on Research Data Management (RDM) and data management planning are gaining traction within ELIXIR and beyond. One of the primary platforms for Data Management Plans (DMPs), known as Data Stewardship Wizard (DSW), plays a crucial role in educating researchers on DMP practices. Often utilized across various ELIXIR Nodes and Communities, DSW not only facilitates DMP creation but also serves as an educational tool. 6 | 7 | Organizations conducting such training sessions typically need to establish a separate DSW instance dedicated to training purposes. However, the repetitive nature of setting up and cleaning these instances, along with preparing content and user accounts, can be cumbersome. While some organizations have developed custom scripts leveraging the REST API to streamline these tasks, they require ongoing maintenance to align with DSW's monthly release cycle. 8 | 9 | In this project, our aim is to develop a service that provides pre-configured sets of content for bootstrapping, cleaning, and verifying DSW instances for training and workshops. These sets will be packaged in a shareable and reusable format, following FAIR principles, allowing organizations to manage their own sets while easily sharing or customizing existing ones. By integrating this service into the open-source codebase, we ensure compatibility and technical readiness aligned with the DSW platform itself. As a result, setting up a testing environment will take mere minutes, if not less. 10 | 11 | ## Lead(s) 12 | 13 | Kryštof Komanec, Jana Martínková 14 | 15 | -------------------------------------------------------------------------------- /13.md: -------------------------------------------------------------------------------- 1 | # Project 13: Interconnecting identifiers.org into a broader metadata connectivity 2 | 3 | ## Abstract 4 | 5 | Identifiers.org, an Elixir Recommended Interoperability Service, is a meta-resolver based on a registry which acts as a source of truth, providing a resolution service for compact identifiers, as well as a harmonisation service, based on records which store final resolving locations associated with assigned prefixes. The registry contains metadata on its namespace and resource entries which include valuable information on the data collections, such as ID regex patterns, online resources where identified data objects can be resolved, and associated institutions for these resources. 6 | 7 | We propose exposing this information in RDF format, greatly expanding the interoperability of this service, allowing direct consumption in a variety of ways, for example into Knowledge Graphs. The resulting RDF could be supplemented with several schemas such as DCAT, VoID, and schemas available in BioSchemas. Supplementing the same dataset using different schemas would negate the often-needed practice of mapping, which can be time-consuming. Writing Rest APIs and enabling a SPARQL endpoint for this information would be a more technical challenge. 8 | 9 | Furthermore, expanding the resolver to find related metadata for compact identifiers will enable support for additional use cases, matching the EOSC PID Meta Resolver. For this, we intend to connect our metadata resolver with the BridgeDB ELIXIR resource and the TogoID system from DBCLS. Through additions, identifiers.org will become more useful for its users as an interoperability service that is easily consumable. 10 | 11 | ## Lead(s) 12 | 13 | Renato Juaçaba Neto, Nick Juty 14 | 15 | -------------------------------------------------------------------------------- /14.md: -------------------------------------------------------------------------------- 1 | # Project 14: FAIRly easy APIs for research data in (Bio)Schema.org and RDF 2 | 3 | ## Abstract 4 | 5 | One third of the Elixir Core Data Resources (CDRs) provide their data as RDF, coupled with a SPARQL endpoint to query this data. While SPARQL is a powerful query language, only a minority of all data scientists and bioinformaticians are familiar with it. Therefore, to enable a wider reuse of RDF data, complementary data access interfaces are highly required. 6 | 7 | Python and R are two of the most popular programming languages among data scientists and bioinformaticians respectively. Therefore, to enable this target audience easy access to public RDF data, we have been experimenting with generating R and Python APIs for each of these datasets in a fully automatic manner. We do so by leveraging automatically generated descriptions of each dataset, i.e. information regarding the available classes and properties, as well as their cardinalities. 8 | 9 | This auto generation is important because: 10 | 11 | * 1) it significantly speeds up the API creation: a dataset maintainer will only need to verify the auto-generated code, without the need to actually write it by hand; 12 | 13 | * 2) it significantly enhances dataset Findability and Reuse - even Elixir CDRs have uneven representation of API packages across different programming languages, making their reusability depend on a technical hurdle - how familiar a given user is with the range of available programming languages. 14 | 15 | An under-resourced or less well-known dataset will likely have at most one API package. Being able to quickly generate complete APIs given only an RDF file or SPARQL endpoint will help better connect data providers with their users. 16 | 17 | ## Lead(s) 18 | 19 | Ana Claudia Sima, Jerven Bolleman 20 | 21 | -------------------------------------------------------------------------------- /21.md: -------------------------------------------------------------------------------- 1 | # Project 21: Enhancing multi-omic analyses through a federated microbiome analysis service 2 | 3 | ## Abstract 4 | 5 | Multi-omics datasets are an increasingly prevalent and necessary resource for achieving scientific advances in microbial ecosystem research. However, they present twin challenges to research infrastructures: firstly the utility of multi-omics datasets relies entirely on interoperability of omics layers, i.e. on formalised data linking. Secondly, microbiome derived data typically lead to computationally expensive analyses, i.e. on the availability of powerful compute infrastructures. 6 | 7 | Historically, these challenges have been met within the context of individual database resources or projects. These confines limit the FAIRness of datasets (since they typically aren’t interlinked, directly comparable, or collectively indexed), and mean the scope to analyse such datasets is governed by the available resources of the given project or service. Removing these confines, by establishing a model for the federated analysis of microbiome derived data, will allow these challenges to be met by the community as a whole. More compute can be brought to bear by combining EOSC and ELIXIR infrastructures, Galaxy instances, and existing resources like EMBL-EBI’s MGnify, but this requires adopting a common schema for sharing analysed datasets, including their provenance. 8 | 9 | Such a schema can also directly contribute to the interlinking of omics layers, using research objects to connect linked open datasets. We aim to design and implement a schema for this purpose, and use it to allow the generation of comparable analyses on heterogeneous compute infrastructures. By doing so, it will streamline the deposition of accessioned analysis products into public databases. 10 | 11 | ## Lead(s) 12 | 13 | Alexander Rogers, Alexander Sczyrba 14 | 15 | -------------------------------------------------------------------------------- /26.md: -------------------------------------------------------------------------------- 1 | # Project 26: Reducing the environmental impact of Galaxy 2 | 3 | ## Abstract 4 | 5 | Workflow management systems (WMSs) such as Galaxy are uniquely positioned to enable researchers to perform more environmentally-sustainable computational data analysis as they have full control of the resources used for a given workflow. 6 | In this project we want to reduce Galaxy resource usage by focusing on: 1) job caching to enable the reuse of tool outputs, and 2) environmentally-friendly job scheduling. 7 | 8 | Job caching uses the provenance information stored in Galaxy’s database for each tool execution to avoid unnecessary recalculations when the relevant parameters match. An initial implementation is already available and we will work on making the job cache apply in more scenarios. In particular, we will infer how strict dataset metadata needs to match for a job to be considered identical. We will also enable sharing the job cache for users that have opted-in to this feature, making it possible to run large-scale analyses during training sessions without consuming an unnecessary amount of computing resources. 9 | 10 | Advanced job scheduling is made possible in Galaxy through the Total Perspective Vortex (TPV) plugin. TPV can route entities (tools, users) to selected destinations with appropriate resource allocations (cores, GPUs, memory). It additionally allows arbitrary Python-based rules for e.g. custom ranking functions for choosing between destinations. We will specifically rank destinations in an order that promotes sustainability. Expanding on our initial implementation, we will collect (job-related) statistics and information from the Galaxy database and (Pulsar) compute destinations in a central location and add additional algorithms for the ranking based on these statistics. 11 | 12 | ## Lead(s) 13 | 14 | Nicola Soranzo, Paul De Geest 15 | 16 | -------------------------------------------------------------------------------- /24.md: -------------------------------------------------------------------------------- 1 | # Project 24: Increasing FAIRness of digital agrosystem resources by extending Bioschemas 2 | 3 | ## Abstract 4 | 5 | Research Data Infrastructures (RDIs) provide crucial publication services for researchers in the agrosystem domain. Due to their heterogeneous user communities and requirements, metadata standardization approaches to increase the FAIRness of resources can be a catalyst in simplifying data reusability and enabling cross-domain research. 6 | 7 | One way for RDIs to increase the Findability of their resources is to provide metadata markup via Schema.org, a vocabulary consumed by well-known search engines. Bioschemas, an extension of schema.org focussed on the life sciences, is an open community effort, aiming at increasing the adoption of key metadata properties in a domain-targeted manner, through the creation of needed domain-associated types, agreed properties and usable metadata profiles for describing those life science resources. The project will work on developing new resources (types, properties, profiles) for Bioschemas, which will help to describe agrosystem datasets in a FAIR manner. 8 | 9 | Participants will work on different topics, ranging from evaluating the current state of developed types and properties relevant to agrosystem resources to drafting new ones following use-case requirements and using example datasets. For increasing the metadata quality and supporting RDIs and their users in adopting the extension, participants will link properties to domain ontologies and facilitate mappings to other metadata standards, bolstering interoperability while following mapping frameworks like FAIR-IMPACT’s approach. To further ease the adoption for RDIs, participants will work on creating guidance and best practice documents on how to implement the extension into existing metadata description processes. 10 | 11 | ## Lead(s) 12 | 13 | Gabriel Schneider, Marco Brandizi 14 | 15 | -------------------------------------------------------------------------------- /2.md: -------------------------------------------------------------------------------- 1 | # Project 2: A curated assessment of metadata descriptors of AI-ready datasets 2 | 3 | ## Abstract 4 | 5 | To advance the use of Machine Learning for the understanding of diseases and conservation of biodiversity is important to promote FAIR AI-ready datasets since data scientists and bioinformaticians spend 80% of their time in data finding and preparation. Metadata descriptors for datasets are pivotal for the creation of Machine Learning Models as they facilitate the definition of strategies for data discovery, feature selection, data cleaning and data pre-processing. 6 | 7 | Once a dataset is AI-ready, such metadata descriptors change wrt the initial version of the raw data. What can we learn from the metadata of raw vs AI-ready datasets? What transformations from raw to AI-ready could be (semi)automated based on metadata descriptors? In this project, we will manually analyze and curate metadata descriptors before and after AI-readiness. Based on our analysis, we will identify dataset transformations that could be (semi)automated by software pipelines with the aim of alleviating the effort and time invested in data pre-processing for Machine Learning. 8 | 9 | The results will be later integrated into a metadata-based reproducibility assessment cycle, part of the NFDI4DataScience project in Germany. To facilitate the work during the BioHackathon, we will focus on datasets from the DOME registry as this would indicate already some level of availability for the metadata (even if hidden in a scholarly article).The AI-ready metadata descriptors will use the Croissant schema proposed by the ML Commons. This project will also take into account previous work done at the BioHackathon 2022 on metadata for synthetic data. 10 | 11 | ## Lead(s) 12 | 13 | Leyla Jael Castro, Nuria Queralt Rosinach 14 | 15 | ## Project repository 16 | 17 | https://github.com/zbmed-semtec/bheu24-cm4mlds (with updated information and developments) 18 | 19 | -------------------------------------------------------------------------------- /25.md: -------------------------------------------------------------------------------- 1 | # Project 25: Recognising research software contributions leveraging the ELIXIR infrastructure 2 | 3 | ## Abstract 4 | 5 | The proposed project aims to enhance the capabilities of the ELIXIR infrastructure to track, credit, and recognise software contributions made by research software engineers. The project seeks to foster collaboration and engagement within the ELIXIR Communities and platforms by working with different stakeholders, including individual contributors to research software, to indicators that help measure the value of these contributions. 6 | 7 | The primary objective of this project is to promote a strong sense of community by recognising individual software contributions. We plan to connect APICURON and GitHub to track open-source research software contributions and reward each contributor for their efforts. We will also link OpenEBench evaluation data to this process by crosslinking GitHub repositories available in the Software Observatory section with APICURON. This will involve retrieving activity data from GitHub and OpenEBench, processing it, and integrating it into the APICURON platform. Recognition items will be pushed to ORCID from APICURON and made available for third-party services. 8 | 9 | We will leverage the involvement of APICURON and OpenEBench in the Data and Tools platforms and in the ELIXIR STEERS and EVERSE projects. These platforms and projects provide a network of stakeholders and guidance for implementing a fair recognition infrastructure. 10 | 11 | The project's usefulness lies in its ability to address a significant gap in the recognition and valuation of research software contributions. By providing a framework for recognition, we can incentivise and motivate developers to contribute to open-source software projects, leading to improved software quality and reproducibility and positively impacting the environment by reducing the carbon footprint. 12 | 13 | ## Lead(s) 14 | 15 | Adel Bouhraoua, Gavin Farell, José Mª Fernández 16 | 17 | -------------------------------------------------------------------------------- /8.md: -------------------------------------------------------------------------------- 1 | # Project 8: Data Model Converter: Bridging Cohort Information Across Biomedical Data Models 2 | 3 | ## Abstract 4 | 5 | In the era of precision medicine, interoperability of biomedical data is crucial for facilitating collaborative research and the concept of minimum data set (MDS) has arised as a collection of data elements using a standard approach to allow clinical data sharing and its use for research purposes. Health data are typically voluminous, complex, and sometimes too ambiguous to generate indicators that can provide knowledge and information on health. Our project aims to address this challenge by developing a versatile web-based tool, called ""DataModel Converter"" which enables conversion from cohort information to different biomedical minimum data sets offering an intuitive interface for users to select individuals from a cohort from a clinical database (eg. OMOP CDM, OpenEHR, etc) and effortlessly transform its structured data into various other standard formats, including B1MG Minimal Dataset for Cancer, BBMRI cohort definitions, OMOP cohorts, Phenopackets, beacon v2, etc. 6 | 7 | The key objectives of our project include: 8 | 1. Designing an interactive and user-friendly web application for cohort selection and data conversion. 9 | 10 | 2. Implementing backend functionalities to retrieve and manipulate data from clinical databases. 11 | 12 | 3. Developing semantic mappings between different data models while preserving data integrity and semantics (OMOP CDM, OpenEHR, Phenopackets, B1MG, BBMRI, etc). 13 | 14 | 4. Ensuring scalability, performance of the DataModel Converter platform to handle large-scale datasets. 15 | 16 | 17 | By providing researchers and healthcare professionals with a flexible and efficient means to harmonize data across disparate data models, our project aims to accelerate biomedical research, enhance collaboration, and ultimately contribute to advancements in personalized medicine and patient care." 18 | 19 | ## Lead(s) 20 | 21 | Sergi Aguiló-Castillo, Alberto Labarga 22 | 23 | -------------------------------------------------------------------------------- /19.md: -------------------------------------------------------------------------------- 1 | # Project 19: Creating user benefit from ARC-ISA RO-Crate machine-actionability 2 | 3 | ## Abstract 4 | 5 | The development of FAIR Digital Objects (FDOs) holds immense promise for advancing scientific research, yet one critical challenge persists: Despite efforts to create FDOs, achieving true machine-actionability remains elusive. 6 | 7 | We will address this pressing issue by focusing on the integration of Annotated Research Contexts (ARCs) within the scientific community. Recognizing the substantial efforts in annotating research and packaging it as RO-Crate FDOs, it is imperative to incentivize and leverage these endeavors to yield benefits transcending mere data management. ARCs as FDOs excel in meticulous record-keeping, rendering them indispensable in the realm of research data management. 8 | 9 | However, dissemination and practical actionability of ARCs across diverse services, tools and repositories is pivotal in engendering user benefits. These platforms require the capacity to comprehend and interpret RO-Crates, enabling seamless interaction with FDOs. Drawing from ARC FDO consumption, search, and indexing platforms must provide users with comprehensive search results, while the service infrastructure can offer customised services tailored to the data described in the FDO. 10 | 11 | Therefore, we will build a robust content-based recommendation framework. This approach promises to furnish users with enriched representations of ARC RO-Crate content, facilitating content-based filtering tailored to individual user needs. 12 | 13 | To substantiate the efficacy of this framework, Galaxy will serve as the representative workflow engine in a proof-of-concept endeavor aimed at suggesting workflows based on data annotated and encapsulated within ARC RO-Crates. Leveraging collaborative efforts uniting domain experts, developers, and stakeholders across diverse backgrounds, our objective is to engineer practical solutions that render ARC-ISA RO-Crates actionable across pivotal platforms. 14 | 15 | ## Lead(s) 16 | 17 | Angela Kranz, Eli Chadwick 18 | 19 | -------------------------------------------------------------------------------- /6.md: -------------------------------------------------------------------------------- 1 | # Project 6: Gender representation in ELIXIR-supported publications: a visibility analysis across academic search engines 2 | 3 | ## Abstract 4 | 5 | An equitable gender representation in ELIXIR-supported publications is crucial for fostering diversity and inclusivity within the ELIXIR community. Recognizing the potential for gender bias in popular bibliographic information retrieval systems, such as Google Scholar, the Bioinfo4Women initiative at the Life Sciences Department of the Barcelona Supercomputing Center (BSC) has developed a system to automatically retrieve comprehensive bibliographic data from Google Scholar queries, equipped with the capability to infer the gender of publication authors. Given Google Scholar's widespread use and its opaque ""relevance"" ranking criteria, our tool presents a significant opportunity to scrutinize and understand potential gender and visibility biases in ELIXIR-supported publications, particularly focusing on the underrepresentation of women leading authors in specific domains. 6 | 7 | The project aims to rigorously test and utilize the capabilities of our system to specifically explore ELIXIR-supported publications, drawing on the existing compilation created by the ELIXIR Impact Group [https://elixir-europe.org/about-us/impact/publications](https://elixir-europe.org/about-us/impact/publications). The challenge's objective is twofold: to assess the impact of Google Scholar's algorithm on the visibility of ELIXIR publications authored by women and to benchmark these findings against more transparent and FAIR-aligned bibliographic engines, such as the BIP! Finder developed by ELIXIR Greece. 8 | 9 | This endeavor will not only highlight discrepancies in gender representation but also foster the development of more equitable information retrieval practices. By leveraging our system's unique functionalities, participants will contribute to a more inclusive understanding of scholarly impact, paving the way for interventions to mitigate bias in academic literature production and discovery. 10 | 11 | ## Lead(s) 12 | 13 | Davide Cirillo, María Morales Martínez 14 | 15 | -------------------------------------------------------------------------------- /23.md: -------------------------------------------------------------------------------- 1 | # Project 23: MARS: Multi-omics Adapter for Repository Submissions, preparing for launch 2 | 3 | ## Abstract 4 | 5 | Multimodality studies are a reality, with scientists commonly using several different data acquisition techniques to characterise biological systems under various experimental conditions. Yet, the deposition of such studies to public repositories remains a challenge for scientists who need familiarity with individual repositories to achieve these data publication requirements. Started during [theBiohackathon 2023](https://github.com/elixir-europe/biohackathon-projects-2023/tree/main/27), [theMARS project](https://github.com/elixir-europe/MARS) (Multi-omics Adapter for Repository Submissions) made great strides in producing a proof of concept for dispatching metadata to BioSamples, ENA and MetaboLights using the ISA-JSON format. ISA-JSON, designed for multi-omics studies, has clear specifications and is used as output format by ISA tools, DataPLANT's ARC and FAIRDOM-SEEK software. 6 | 7 | Following this success, there is now interest in extending the service to support functional genomics data, hosted by ArrayExpress or BioStudies, EVA, EGA and e!DAL-PGP. Therefore, the objectives of this project are the following: 8 | 9 | 1. Consolidate the current proof of concept and to bring it closer to a functional prototype, by further testing it and streamline it 10 | 11 | 2. Extend functionalities to submit actual data files along with the metadata 12 | 13 | 3. Collaborate further with repositories to support ISA-JSON format for programmatic submission via their API endpoint. 14 | 15 | 4. Extend the MARS component and CLI to include additional data types and repositories, such as transcriptomics and possibly proteomics as a stretch goal. 16 | 17 | 5. Develop domain specific minimal annotation profiles, building on the experience gained with Metabolights for MS and NMR based assay definitions. 18 | 19 | To this end, we have assembled a team of subject matter experts to deliver on the task. 20 | 21 | 22 | ## Lead(s) 23 | 24 | Bert Droesbeke, Philippe Rocca-Serra 25 | 26 | -------------------------------------------------------------------------------- /29.md: -------------------------------------------------------------------------------- 1 | # Project 29: ELIXIR FAIR Lesson Plan Handbook: advancing researchers’ & data stewards’ FAIR skills 2 | 3 | ## Abstract 4 | 5 | Community activities (e.g., FAIRsFAIR, CONVERGE, ELIXIR-NL’s FAIR Data Day) have signalled the need for a framework onhow to teach FAIR skills to researchers and data stewards. Via hackathons (e.g., CONVERGE, BioHackathon 2023) with over 50 participants in total, a minimal viable product (MVP) [ELIXIR FAIR Lesson Plan Handbook](https://elixir-europe-training.github.io/ELIXIR-TrP-FAIR-Converge/) was created. Although lesson plans aren’t ready-to-go courses yet, they offer the basic framework for FAIR training. What do researchers and data stewards have to be skilled in to apply FAIR to datasets? 6 | 7 | We propose a BioHackathon 2024 project for two reasons: 8 | 9 | * To improve the user-friendliness. At BioHackathon 2023, a new format was applied to some lesson plans. This was received well, and we will apply it to other lesson plans as well. 10 | 11 | * Align with developments in ELIXIR training: 12 | 13 | * [Learning paths](https://elixir-europe.org/focus-groups/learning-paths) for data stewards and researchers, as part of the new Learning Path FG. Providing users with a “pathway” to connect lesson plans so that it caters their needs, makes teaching FAIR more trainee-oriented. Many organisations struggle with training on how to go/do FAIR. 14 | 15 | * [TheFAIR Metroline](https://zenodo.org/records/10850958)(in development), an ELIXIR-NL initiative for a unified FAIRification workflow, based on a comparison of FAIR models. The FAIR Metroline smartly combines practical FAIR steps with training needs/gaining competences in organisations. 16 | 17 | BioHackathon 2023 resulted in the creation of the GitHub repository, with corresponding website, for theELIXIR FAIR Lesson Plans. It will be BioHackathon 2024 that enables trainers to start using it, as we will work hard to restructure the content and engage with the community. 18 | 19 | ## Lead(s) 20 | 21 | Mijke Jetten, Martijn Kersloot 22 | 23 | ## Relevant links 24 | [Agenda document](https://docs.google.com/document/d/1N0qC44g9ijd1kCeLgKNZ7sHrlPXLycspQTSXnEqgwc4/edit) with links 25 | -------------------------------------------------------------------------------- /12.md: -------------------------------------------------------------------------------- 1 | # Project 12: Perturb -Bench: large-scale benchmarking of perturbational modelling tools in complex single-cell data 2 | 3 | ## Abstract 4 | 5 | Single-cell perturbation modelling delineates how perturbations affect cellular and molecular physiology, such as transcription factors, kinases, and signalling pathways. Perturbation modelling aims to understand the molecular impacts of pharmaceutical compounds or cellular stimulants, dissect disease pathobiology, and facilitate drug repurposing. 6 | 7 | Our BioHackathon project aims to address the current lack of independent benchmarking and best practices for perturbation modelling tools, which hinders their broader adoption by the single-cell community. We will conduct an extensive benchmarking study for various perturbation modelling tools, including variational autoencoders, graph-based models for gene-regulatory networks, Optimal Transport tools deciphering cell states, and foundational models. 8 | 9 | The benchmarking study will focus on out-of-distribution predictions for unseen events, drug synergy scores, and distilling perturbation effects from confounding sources of variation. We will adopt workflow management systems compatible with community-driven benchmarking frameworks, such as OpenEBench. 10 | 11 | We will utilise harmonised single-cell datasets from scPerturb (containing control/disease samples and CRISPR/compound treatments, e.g., sci-Plex, Perturb-seq). The project will standardise emerging metrics (e.g. gene expression correlation, distribution distances, clustering separation) concerning datasets and perturbational tasks and assemble a multidisciplinary group of participants to address biological and computational-mathematical challenges. 12 | 13 | Another goal will be the creation of a continuous repository to further develop benchmarking efforts beyond the BioHackathon’s duration. The project's feasibility is supported by the expertise of the leads, who are members of the ELIXIR Single Cell Omics Community/Machine-Learning Focus group, and their ongoing research initiatives, e.g. Mongoose ELIXIR Staff Exchange Project (GR-DE-NL nodes, Feb-Jul 2024). 14 | 15 | ## Lead(s) 16 | 17 | Georgios Gavriilidis, Marina Esteban-Medina 18 | 19 | -------------------------------------------------------------------------------- /4.md: -------------------------------------------------------------------------------- 1 | # Project 4: SPARQL Query Generation for Efficient Scientific Data Access of ELIXIR resources 2 | 3 | ## Abstract 4 | 5 | The Swiss Institute of Bioinformatics (SIB/ELIXIR-CH), Database Center for Life Science (DBCLS-Japan) and RIKEN-Japan join efforts to develop an open-source artificial intelligence (AI)-driven system for intuitive querying of scientific datasets to accelerate scientific innovation. We call for contributions in these efforts that align with the BioHackathon's goal of fostering an open-source infrastructure for data integration and addresses the urgent need for effective data retrieval methods. 6 | 7 | Our goal is to make it easier for life scientists to use databases by converting their questions into SPARQL queries using large language models (LLMs). We understand the difficulties researchers face with SPARQL's complexity and knowledge base schemas, so we suggest a user interface that combines LLMs and knowledge bases. This will allow for direct data interaction in natural language, simplifying the research process. Our approach will facilitate data discovery and retrieval with the necessary accuracy for scientific research, as it leverages LLMs to generate SPARQL queries grounded in validated scientific data. 8 | 9 | Despite LLMs’ abilities in areas like code generation, they often struggle with the semantic accuracy of SPARQL queries. Our project is focused on addressing these limitations, ensuring that conversational AI can accurately interpret and translate research inquiries into precise queries. It aligns with the objectives of the ELIXIR 2024-26 Programme and lays the groundwork for future research collaborations, offering a practical solution for data-driven discovery in the life sciences. 10 | 11 | ## Project GitHub Repository 12 | If you want to contribute or are just curious about our work, see: 13 | 14 | 👩‍💻 Project code: [https://github.com/jcrangel/SPARQL4ELIXIR 15 | ](https://github.com/jcrangel/SPARQL4ELIXIR.git) 16 | 17 | 📝 Project backlog: https://github.com/users/jcrangel/projects/9 18 | 19 | ## Lead(s) 20 | 21 | - Tarcisio Mendes de Farias, SIB Swiss Institute of Bioinformatics (ELIXIR-CH) 22 | - Julio Rangel, RIKEN - JAPAN 23 | 24 | ## Team members 25 | - Vincent Emonet, SIB Swiss Institute of Bioinformatics (ELIXIR-CH) 26 | - TBD 27 | 28 | -------------------------------------------------------------------------------- /22.md: -------------------------------------------------------------------------------- 1 | # Project 22: Enabling Secure Data Access from Galaxy to (F)EGA 2 | 3 | ## Abstract 4 | 5 | In an era marked by the continuous growth of precision medicine and the emergence of regulations such as the GDPR and EHDS, the implementation of secure repositories to enable data sharing has become essential. These protocols play a crucial role in preserving the confidentiality of sensitive information and effectively mitigating risks associated with unauthorised access and data breaches. 6 | 7 | Galaxy is one of the most popular analysis platforms, especially among non-bioinformatics specialists. Thus, to increase Galaxy's integration within environments that require stringent data security measures, this proposal devises a comprehensive strategy that would facilitate the secure and scalable access and processing of sensitive datasets (and their derivative sensitive results) within Galaxy. 8 | 9 | Integral to this endeavour is the European Genome-Phenome Archive (EGA), recognised as the predominant repository within Europe for the secure storage phenoclinical and genomics data, thus underscoring its significance in biomedical data security considerations. As data housed within EGA federated repositories is encrypted in accordance with the GA4GH Crypt4GH standard, the proposed strategy is the development of a protocol tailored to enable the secure access, transfer, and processing of encrypted datasets, thereby leveraging the capabilities of a multi-user public Galaxy platform. 10 | 11 | ## Project Objectives 12 | 13 | The objective of the project can be divided into two milestones: 14 | 1. Development of a workflow that connects EGA, either central or any federated node, and Galaxy through Crypt4GH protocols. 15 | 2. Galaxy's secure processing protocol: Sensitive datasets are kept encrypted throughout, with sensitive derivative results labelled as sensitive. 16 | 17 | ## Resources 18 | 19 | 1. [About Galaxy](https://docs.galaxyproject.org/en/master/). 20 | 2. [About EGA](https://localega.readthedocs.io/en/latest/). 21 | 3. [About GA4GH Crypt4GH](https://crypt4gh.readthedocs.io/en/latest/). 22 | 4. [Slack Channel](https://biohackeu.slack.com/archives/C07NBGJKE0Z) - Use this as the main source of communication between in-person and virtual participants during the hackathon. 23 | 24 | ## Leads 25 | 26 | María Chavero-Díez (ELIXIR-ES) Sveinung Gundersen, Pável Vázquez Faci (ELIXIR-NO) 27 | -------------------------------------------------------------------------------- /10.md: -------------------------------------------------------------------------------- 1 | # Project 10: BioSchemas for mortals 2 | 3 | ## Abstract 4 | 5 | Bioschemas is a community effort to improve the FAIRness of web-based resources.Established by ELIXIR over 7 years ago, it has good adoption by the technical communities in workflows, software and tools but less adoption than there should. Particularly in less technical communities such as training or even data services. 6 | 7 | The Bioschemas website hosts tooling, training and guidance materials. Practical 'how to use Bioschemas' help and examples has been neglected. Guidance is technical - written by techies for techies - inappropriate or inaccessible for a large cohort of potential Bioschemas users. Examples focus on simple use cases and not real set-ups that users actually encounter in their work. This makes access to directly usable markup impossible, leaving users confused about where to go next. This lack of helpful support poses an unacceptably high technical barrier for the broader user community, and means we are not fully exploiting Bioschemas. Common complaints frequently cite a technical ‘barrier’, the lack of ‘lightweight’ guidance, the need to ‘demystify’ and lack of assistance to users who have the desire to implement bioschemas, but not the ‘how’. 8 | 9 | The goal of this project is to reimagine, reframe and supplement the existing Bioschemas guidance available. Working with non-technical users from the data and training platforms,Patterns of use, tasks(different CMS; properties) andUser personas . This will be used toprovide users withspecific code examples that can be copy/pasted, documented examples for different web setups, customised guidance for different personas and be validated by non-technical users in the data and training platforms. 10 | 11 | ## Lead(s) 12 | 13 | Nick Juty, Helena Schnitzer 14 | 15 | ### Key participant 16 | 17 | Phil Reed 18 | 19 | ## Complete our survey 20 | 21 | Have you used Bioschemas to markup your data, tools, training materials or other content? 22 | Was it cumbersome, tedious or a joyous experience, or did you fail to get started? 23 | Please share your experiences and help us to run this project by [completing our survey](https://bit.ly/bh2410s). 24 | Every response really helps our analysis and takes less than 10 minutes. 25 | Your contribution will directly feed into our work as we reimagine, reframe and supplement the existing Bioschemas guidance available. 26 | 27 | ## Document links 28 | 29 | - [Activities for Mortals (including sign-up sheeet)](https://docs.google.com/document/d/15inqwNojNYkcookFkrngezsAhdze2mZTcNpY6r46blY/edit?tab=t.0) 30 | 31 | We will conduct most of our work in the Google Docs link above. 32 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | # BioHackathon Europe projects 2024 4 | This repository is intended for the BioHackathon Europe participants to share ideas. The event will take place in Campus Belloch, 4-8 November 2024. For more information, please see the [BioHackathon Europe website](https://biohackathon-europe.org/index.html). 5 | 6 | ## Accepted projects 7 | 8 | 40 | -------------------------------------------------------------------------------- /11.md: -------------------------------------------------------------------------------- 1 | # Project 11: Galaxy CoDex - Ensuring Galaxy community sustainability through resource aggregation and annotation 2 | 3 | ## Abstract 4 | 5 | Galaxy hosts a vast array of tools, tutorials, and workflows, with the exact number of workflows remaining uncertain. To address the challenge of enhancing tool visibility within this expansive ecosystem, a pipeline called the Galaxy Tool Metadata Extractor was created during the BioHackathon Europe 2023. This pipeline aggregates Galaxy tool suites from various sources, automatically extracts metadata such as bio.tools identifiers and EDAM ontology, and presents the information in an interactive table. Users can filter this table to find tools relevant to their research community. Throughout development, it was noted that many tools lack EDAM annotations. Efforts by the microbial community during both BioHackathon 2023 and a subsequent community-hosted online hackathon in 2024 have improved EDAM annotations for over 200 tools. 6 | 7 | However, Galaxy communities also offer training materials and workflows, which, like software, may be scattered across different platforms and lack EDAM annotations. 8 | 9 | Building upon the achievements of BioHackathon Europe 2023, this new initiative seeks to expand the capabilities of the existing Galaxy tool list table by introducing the Galaxy Communities Dock (Galaxy Codex). Galaxy Codex will involve enhancing and implementing webpage templates and files that enable domain communities to efficiently gather, organize, integrate, and deploy pertinent tools, workflows, and training materials across various Galaxy servers. Concurrently, best practices for resource annotation will be developed and integrated into different levels of the Galaxy ecosystem. 10 | 11 | In essence, the growth of Galaxy Communities necessitates the adoption of sustainable practices to ensure their continued advancement." 12 | 13 | ## Scope/Tasks 14 | 15 | 1. Establishing the **infrastructure for Galaxy CoDex** to enhance the discoverability of tools, workflows, and training materials within the Galaxy ecosystem 16 | 2. Ensuring the sustainability of Galaxy CoDex by implementing comprehensive **resource annotations** for communities (microGalaxy, single-cells and yours?) 17 | 3. Establishing ongoing resource **annotation best practices** within the Galaxy ecosystem 18 | 19 | ## Useful skills 20 | 21 | GitHub, Markdown 22 | 23 | 1. Improve the infrastructure of the Galaxy Codex: **Python**, **Web development**, Jekyll, Galaxy tools, Galaxy workflows, Databases 24 | 2. Annotation of community resources: **Interest** in microbiology, single-cells, or proteomics 25 | 3. Annotation best practices: **EDAM**, Galaxy tools, workflows 26 | 27 | ## Coordination 28 | 29 | We use an [organization document](https://bit.ly/gxy-codex-bh-2024) to share all useful links (especially to coordination spreadsheets), share detailed tasks, keep notes, and coordinate. 30 | 31 | ## Schedule 32 | The schedule is quite flexible. There is no requirement to join the whole week. 1 daily stand-up (will be run to coordinate with online participants and the Australian outpost 33 | 34 | ### Monday, November 4th 35 | 36 | Time (CET) | Topic 37 | --- | --- 38 | 15:30-16:00 | Presentation of the process to link Galaxy tools to bio.tools and improve bio.tools annotations 39 | 16:00-18:00 | Hacking 40 | 41 | ### Tuesday, November 5th - Thursday, November 7th 42 | 43 | Time (CET) | Topic 44 | --- | --- 45 | 8:30-10:00 | Stand-up & New participant onboarding 46 | 10:00-12:30 | Hacking with open Zoom 47 | 13:45-15:30 | Hacking 48 | 49 | ### Friday, November 8th 50 | 51 | Time (CET) | Topic 52 | --- | --- 53 | 8:30-10:00 | Stand-up & Outcome collection 54 | 55 | ## Leads 56 | 57 | Bérénice Batut, Wendi Bacon 58 | 59 | -------------------------------------------------------------------------------- /30.md: -------------------------------------------------------------------------------- 1 | # Project 30: The BioHackathon Cloud 2 | 3 | ## Abstract 4 | 5 | We propose the further consolidation of [the BioHackCloud](https://biohack.cloud/) (BHC) - a cloud-based infrastructure for the federated analysis of biological/biomedical data based on Global Alliance for Genomics and Health (GA4GH) standards and other relevant open community standards. The [ELIXIR Cloud](https://elixir-cloud.dcc.sib.swiss/) infrastructure serves as the initial BHC, with mid- to long-term plans of integrating with other cloud infrastructures like [Sapporo](https://github.com/sapporo-wes/sapporo), [Galaxy](https://galaxyproject.org/)/[Pulsar](https://pulsar.readthedocs.io/en/latest/), [Microsoft Azure](https://github.com/microsoft/ga4gh-tes). 6 | 7 | The BHC will be offered to interested BioHackathon participants, with BHC project participants providing training and support for the realization of individual use cases (e.g., construction of API calls, making workflows cloud-ready). Feasibility will be evaluated on a case-by-case basis, but where use cases cannot be realized with the current infrastructure, their support will be considered for the BHC roadmap or, where possible, will be implemented on site. 8 | 9 | Next to supporting other BioHackathon projects and integrating with other cloud infrastructures, another main goal for the BHC project is the implementation of additional features. In this hackathon, a focus will be placed on enhancing data privacy and security features of the BHC by providing Confidential Computing support as developed [byGENXT](https://www.genxt.network/). However, integration with other community standards (e.g., [RO-Crate](https://www.researchobject.org/ro-crate/) and implementations (e.g., [WorkflowHub](https://workflowhub.eu/), [BioContainers](https://biocontainers.pro/) is also possible, depending on participants’ interests. 10 | 11 | ## Lead(s) 12 | 13 | Alexander Kanitz, Pavel Nikonorov 14 | 15 | ## Event logistics 16 | 17 | The project will feature 3 main **topics** (more details will follow): 18 | - Work around the **[GA4GH-SDK](https://github.com/elixir-cloud-aai/ga4gh-sdk) and [confidential computing](https://en.wikipedia.org/wiki/Confidential_computing)** (led by Pavel Nikonorov) 19 | - Work around the **[WfExS workflow execution service](https://github.com/inab/WfExS-backend)** (led by Paula Iborra) 20 | - Work around **[JupyterHub](https://jupyter.org/hub)** (led by Viktória Spišaková) 21 | 22 | Moreover, there will be a half-day **workshop**: 23 | - **[ELIXIR On Cloud](https://elixir-cloud.dcc.sib.swiss/) onboarding** for compute and data centers (led by @Alex Kanitz) 24 | 25 | Once the leads provide more detailed descriptions on each of the topics, **we will create a poll on our Slack channel where participants can select their favorite topics**. Note that this is just to give leads an idea of how many issues to prepare - it is **not binding**. 26 | 27 | We will have a **[centralized project board](https://github.com/orgs/elixir-cloud-aai/projects/23/views/1)** that will feature detailed, byte-sized issues for the various topics, including labels for orientation ("good first issue"). The project board is currently still empty but will be filled up successively until the start of the event (a dedicated repo for creating meta issues or issues for existing code projects not hosted under the ELIXIR Cloud & AAI organization has been created [here]([url](https://github.com/elixir-cloud-aai/biohackeu24-issues))). 28 | 29 | **If you are interested in joining the project, please join the [dedicated Slack channel](https://biohackeu.slack.com/archives/C03HQPMEN81)!** Say hi, be a fly on the wall or suggest additional topics, workshops, issues, connections/integrations with other BioHackathon projects etc. Everyone is welcome :) 30 | -------------------------------------------------------------------------------- /15.md: -------------------------------------------------------------------------------- 1 | # Project 15: Enhancing interoperability of biomedical resources using ontologies. 2 | 3 | ## Abstract 4 | 5 | In Japan, there are two medical expense subsidy systems, “specific chronic pediatric disease system” and “designated intractable disease system”, for some rare diseases called “Nanbyo”. With these medical subsidies, many medical examinations and researches have been conducted on patients with Nanbyo diseases, and various databases have been constructed to summarize those results. However, there has been no systematic investigation of the correspondence between Nanbyo diseases and diseases in international rare genetic disease databases so far. 6 | 7 | This makes it difficult to share biomedical data on Nanbyo diseases with the rest of the world. To address this problem, we have developed a Nanbyo Disease Ontology (NANDO) with over 2700 entities which represent individual Nanbyo diseases in the two medical expense subsidy systems [https://rdfportal.org/dataset/nando](https://rdfportal.org/dataset/nando). Furthermore, using this first ontology of Nanbyo diseases in Japan, we constructed NanbyoData [http://nanbyodata.jp](http://nanbyodata.jp), which integrates biomedical data such as variants, genetic tests, bio-resources, and patient numbers on Nanbyo diseases in Japan. 8 | 9 | In this hackathon, we will try to map NANDO entities to Orphanet Rare Disease Ontology (ORDO) entities and Monarch Disease Ontology entities by using a semi-automatic approach. We believe that this will enhance interoperability of biomedical resources related to rare diseases between Europe, US, and Japan. 10 | 11 | ## Lead(s) 12 | * Toyofumi Fujiwara 13 | * David Lagorce 14 | 15 | ## Members 16 | * Terue Takatsuki 17 | 18 | ## Achievements/Outcomes 19 | By integrating the rare disease ontologies provided by the following organizations into a cross-referenced database, we aim to enhance the interoperability of the biomedical resources maintained by each organization. 20 | 21 | - **ORDO** - Orphanet Rare Disease ontology 22 | Provided by: [Orphanet](https://www.orpha.net/) 23 | - Total Number of Entities: 9,622 - Version 4.5 24 | 25 | - **Mondo** - Mondo Disease Ontology 26 | Provided by: [Monarch Initiative](https://monarchinitiative.org/) 27 | - Total Number of Entities: 29,392 - releases/2024-09-03 28 | 29 | - **NANDO** - Nanbyo Disease Ontology 30 | Provided by: [DBCLS](https://dbcls.rois.ac.jp/) 31 | - Total Number of Entities: 2,784 - releases/2023-11-27 32 | 33 | The mapping results for each ontology have been created in [SSSOM](https://mapping-commons.github.io/sssom/) format: 34 | 35 | - **Mapping between ORDO and Mondo**: [SSSOM_ORDO_MONDO_20241107](https://drive.google.com/file/d/1hxPk0xoHqw1Ti7kstuYfSy_tSV9Y6FkM/view?usp=drive_link) 36 | - Machine-generated mappings were manually curated. 37 | - Number of Mappings: 8,053 38 | - Number of Mapped ORDO Entities: 8,053 39 | - Number of Mapped Mondo Entities: 8,039 40 | 41 | - **Mapping between NANDO and Mondo**: [SSSOM_NANDO_MONDO_20241107](https://drive.google.com/file/d/1M52MUa-YSabjFBZQRTnkgwNvF_9uAIO3/view?usp=drive_link) 42 | - Machine-generated mappings were manually curated. 43 | - Number of Mappings: 2,187 44 | - Number of Mapped NANDO Entities: 2,036 45 | - Number of Mapped Mondo Entities: 1,549 46 | 47 | - **Mapping between ORDO and NANDO**: [SSSOM_ORDO_NANDO_20241107](https://drive.google.com/file/d/1hban8Q6fp9d2hWEghzlZz5EFqEOYVqSs/view?usp=drive_link) 48 | - Mapped through Mondo by leveraging the mapping results of ORDO to Mondo and NANDO to Mondo. 49 | - Number of Mappings: 1,658 50 | - Number of Mapped ORDO Entities: 1,130 51 | - Number of Mapped NANDO Entities: 1,616 52 | 53 | ## Future plans 54 | In this BioHackathon, we mapped entities with similar notations and manually curated them. Going forward, we will attempt to map entities that were not covered in this BioHackathon. 55 | - Number of NANDO entities not mapped to ORDO in this BioHakcathon: 1,168 56 | - Number of NANDO entities not mapped to Mondo in this BioHackathon: 748 57 | 58 | ## Acknowledgement 59 | - Orphanet 60 | - Monarch Initiative 61 | - DBCLS 62 | 63 | -------------------------------------------------------------------------------- /18.md: -------------------------------------------------------------------------------- 1 | # Project 18: Expanding FAIR database integration through elucidation and transformation of underlying graph schemas. 2 | 3 | ## Table of Contents 4 | 5 | * [Abstract](#abstract) 6 | * [Project overview](#project-flash-presentation) 7 | * [Communication and resources](#resources) 8 | * [Working Ethics](#working-ethics) 9 | * [Project Leads](#leads) 10 | * [Team Member](#members) 11 | 12 | ## Abstract 13 | 14 | The integration of life science data from different biomedical resources has been a major challenge attributed to fragmented data sources, the use of multiple data formats, and the existence of multiple ontologies for a single context among others. To address this problem, we launched the BioDataFuse (BDF) project, which employs a modular framework for integrating data from different sources into context-specific knowledge graphs. Through this project, we have currently been able to integrate and harmonise data from ten databases. However, the integration of such resources requires a detailed understanding of underlying graph schemas. 15 | 16 | In this biohackathon, we would like to streamline the data integration process such that any FAIR-compliant biological database can be easily converted to a graph. This robust process would involve two steps: first, understanding of the underlying graph schemas of data resources using the RDF-config (https://github.com/dbcls/rdf-config/) and VoID generator (https://github.com/JervenBolleman/void-generator) and second, the conversion of graph data into multiple compatible formats for improving accessibility and usability using G2G Mapper (https://g2gml.readthedocs.io/), LinkML (https://linkml.io/) and BDF (https://github.com/BioDataFuse/pyBiodatafuse). Moreover, we would test the resilience of the process by demonstrating the ease-of-integration of multiple data sources within the RDF Portal (https://rdfportal.org) and beyond. Through this test, we would essentially attract database owners to include additional biomedical data sources in BDF, thus expanding the applicability of their resource beyond the “yet-another-resource” paradigm. 17 | 18 | ## Project flash presentation 19 | 20 | ![overview slide](https://github.com/user-attachments/assets/5ae90c38-effb-4927-b0fc-c619a32d185e) 21 | 22 | ## Resources 23 | 24 | 1. [Project GitHub Repo](https://github.com/BioDataFuse/elixir_biohackathon_2024). 25 | 2. [BioDataFuse Web Interface](https://biodatafuse.org/). 26 | 3. [BioDataFuse Python package](https://github.com/BioDataFuse/pyBiodatafuse). 27 | 4. [BioDataFuse Web Interface codes](https://github.com/BioDataFuse/biodatafuseUI). 28 | 5. [Biohackarvix](https://github.com/BioDataFuse/biohackarvix-2024). 29 | 6. [Slack](https://biohackeu.slack.com/archives/C07MYNT0CHH) - This will be the main source of communication between in-person and virtual participants throughout the hackathon. 30 | 31 | ## Working ethics 32 | 33 | * :balance_scale: The use of GitHub issues and pull requests will be done to ensure the efficient working of multiple people on the GitHub repository. 34 | * :no_entry_sign: No commits to be made directly to the `main` branch of the GitHub repository. 35 | * :gear: Adding new Python functions should inherently involve writing subsequent unit test functions and documentation for the same. 36 | * :handshake: The main aim of the hackathon is collaboration, so please feel free to ask questions or provide feedback whenever in doubt. `We believe that there are no dumb questions that exist.` 37 | * :calendar: To ensure good communication among the team members, we would have two daily stand-ups (pre and post-hacking) allowing all participants to provide a less than 1-minute update on work done and work in the pipeline. 38 | 39 | ## Leads 40 | 41 | | Name | Affiliation | GitHub | LinkedIn | 42 | | --- | --- | --- | --- | 43 | | [**Tooba Abbassi-Daloii**](https://orcid.org/0000-0002-4904-3269) | Maastricht University, NL | [@tabbassidaloii](https://github.com/tabbassidaloii) | [Link](https://www.linkedin.com/in/tooba-abbassi-daloii/) | 44 | | [**Yojana Gadiya**](https://orcid.org/0000-0002-7683-0452) | Fraunhofer ITMP ScreeningPort, DE | [@YojanaGadiya](https://github.com/YojanaGadiya) | [Link](https://www.linkedin.com/in/yojana-gadiya-477739113/) 45 | 46 | ## Members 47 | * Toshiaki Katayama 48 | * Javier Millan Acosta 49 | * [Egon Willighagen](https://orcid.org/0000-0001-7542-0286), [@egonw](https://github.com/egonw), [LinkedIn](https://www.linkedin.com/in/egon-willighagen/) 50 | * Dominik Martinat, [@dominikmartinat](https://github.com/dominikmartinat) 51 | * Shuichi Kawashima 52 | 53 | -------------------------------------------------------------------------------- /27.md: -------------------------------------------------------------------------------- 1 | # Project 27: Integrating Bioconductor packages with the ELIXIR Research Software Ecosystem using EDAM 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
Bioconductor logoELIXIR biotools logoEDAM logo
10 | 11 | ## Main goals 12 | 13 | [Bioconductor](https://zenodo.org/records/8400205) is a global open-source software project that provides tools for the analysis and comprehension of high-throughput genomic data within the R statistical programming environment. In this project, we aim to enhance the [ELIXIR Research Software Ecosystem](https://f1000research.com/posters/12-1026) (RSEc) by increasing the findability, accessibility, interoperability, and reusability of over 2,000 [Bioconductor](https://bioconductor.org/) packages. Aligning them with the FAIR principles, as well as improving their description in the RSEc, particularly in the [bio.tools registry](https://bio.tools/), are key objectives. 14 | 15 | Additionally, this project aims to advance [EDAM](https://edamontology.org/page) as a standard, by applying it to a large genomic data science software/data ecosystem. By extending the EDAM ontology and the processes available through the RSEc to cover these Bioconductor packages, and designing automated mechanisms for synchronising descriptions between Bioconductor and the ELIXIR RSEc, we will significantly improve the search and discovery process for users, and strengthen the bioinformatics research infrastructure. 16 | 17 | This project will kick-start a long-term mutually beneficial collaboration between the ELIXIR Tools Platform and the Bioconductor community. 18 | 19 | :dart: **Short-term BioHackathon goals:** 20 | 21 | * Mapping of EDAM and biocViews terms 22 | * “Gold standard" manual annotation of a subset of Bioconductor packages in bio.tools 23 | * Assessing development or adaptation of a tool for automated EDAM suggestions from biocViews or package content 24 | 25 | :dart: **Long-term goals:** 26 | 27 | * Extend EDAM to all Bioconductor software packages, and also the thousands of Bioconductor annotation and experiment resources 28 | * Phase-out biocViews for systematic EDAM annotation 29 | * Synchronise Bioconductor packages with bio.tools (via automated integration with ELIXIR RSEc) 30 | 31 | ## Interested in contributing? 32 | 33 | :loudspeaker: Reach out! 34 | 35 | * [Bioconductor slack community](https://community-bioc.slack.com/): **#edam-collaboration** 36 | * [ELIXIR Europe slack community](https://elixir-europe.slack.com/): **#edam_ontology** 37 | 38 | Our project is committed to inclusivity, guided by the [Bioconductor Code of Conduct](https://www.bioconductor.org/about/code-of-conduct), as well as the [ELIXIR code of conduct for events](https://elixir-europe.org/events/code-of-conduct) and the [ELIXIR RSEc code of conduct](https://github.com/research-software-ecosystem/content/blob/master/CODE_OF_CONDUCT.md). We value inputs from different perspectives - from ontology experts to developers to end user experience - across a diversity of professional, personal, cultural, or linguistic backgrounds. 39 | 40 | Remote participation is welcome, and would ideally be planned with the interested parties ahead of the event, to ensure we can get the necessary setup ready to work during the event and have everybody able to fully enjoy and take advantage of the event. 41 | 42 | ## Abstract 43 | 44 | This project aims to enhance the ELIXIR Research Software Ecosystem (RSEc) by improving the accessibility, interoperability, and reusability of over 2,000 Bioconductor packages. This involves aligning their description with FAIR principles and setting up their synchronisation with the bio.tools registry. Additionally, this project aims to enhance EDAM's utility by applying the EDAM standard to the large Bioconductor ecosystem. The project utilises structured integration processes and community-centric development to achieve these goals. 45 | 46 | It aligns with the ELIXIR 2024-26 program objectives through the standardisation of Bioconductor software metadata, their inclusion in the RSEc infrastructure, and the community-based improvement of EDAM (Tools platform WP2 and WP3). Feasibility is ensured through planned deliverables, including mapping EDAM and biocViews, manual annotation, and exploring automated mapping tools. Long-term plans involve systematic annotation and synchronisation of Bioconductor packages with bio.tools. 47 | 48 | ## Lead(s) 49 | 50 | Claire Rioualen 🇫🇷, Maria Doyle 🇮🇪 51 | 52 | -------------------------------------------------------------------------------- /31.md: -------------------------------------------------------------------------------- 1 | # Project 31: Executable metadata mappings to FAIRify Biodiversity Genome Annotations 2 | 3 | :loudspeaker: [Project Repository](https://github.com/fairtracks/biohackathon-2024-project-31) 4 | 5 | ## Main goals 6 | 7 | The [FAIRification of Genomic Annotations Working Group (FGA-WG)](https://www.rd-alliance.org/groups/fairification-genomic-annotations-wg/) in the Research Data Alliance will focus on the challenges of harmonising metadata and software solutions to improve the discovery and reuse of publicly available genomic annotation data. 8 | 9 | Our Biohackathon project aims to: 10 | 11 | - Define minimal metadata to support genome annotations as FAIR objects, and 12 | - Develop interoperable executable mappings from bioinformatics case-studies to the [FAIRtracks model](https://github.com/fairtracks/fairtracks_standard#overview-of-structure-of-the-fairtracks-standard). 13 | 14 | Our PLAN during the biohackathon is to assess and implement the following: 15 | 16 | - What research data / metadata do we have that we can use as a case study? 17 | - What do we want in terms of interoperability, and will the Fairtracks schema provide sufficient coverage for the source metadata in our case-study? 18 | - What definitions are missing, or what level of lossiness is "acceptable"? How do we document this loss? 19 | - What tools and processes are needed to algorithmically produce a transformation? 20 | - Review [Omnipy](https://omnipy.readthedocs.io/) / [Whyqd (/wɪkɪd/)](https://whyqd.readthedocs.io/) (two Python-based libraries for data wrangling) to algorithmically produce a transformation. 21 | 22 | ## Interested in contributing? 23 | 24 | We have a diverse group of people participating, both on-site and remotely - including collaborators calling in from Australia - and we would appreciate people with any of the following skills or resources to contribute: 25 | 26 | - Schema.org / bioschemas familiarity (or metadata for research annotations) 27 | - Metadata modelling for interoperability 28 | - Bioinformatics research data / metadata to contribute as case-studies for transformation 29 | - Python & JSON / JSON-LD 30 | 31 | :loudspeaker: Get hold of us: 32 | 33 | - [Biohackathon slack community](https://biohackeu.slack.com/archives/C07MS890N6S): **Sveinung Gundersen** 34 | - Co-lead emails: 35 | - [**Gavin Chait**](mailto:gchait@whythawk.com) 36 | - [**Sveinung Gundersen**](mailto:sveinugu@uio.no) 37 | 38 | Our project is committed to inclusivity, guided by the [ELIXIR code of conduct for events](https://elixir-europe.org/events/code-of-conduct) and the [ELIXIR RSEc code of conduct](https://github.com/research-software-ecosystem/content/blob/master/CODE_OF_CONDUCT.md). We value inputs from a multitude of perspectives, levels of experience and skill, and across a diversity of professional, personal, cultural, or linguistic backgrounds. 39 | 40 | Remote participation is welcome, and we are supporting cross-over time-zones for our Australian contributors. Expect us online from about 7am CET from Tuesday, 6 November. 41 | 42 | ## Resources 43 | 44 | - [Current list of remote and on-site contributors](https://docs.google.com/spreadsheets/d/10wO-5kNdaTUpsZ3C0z5bsaYnf5EbDAWPw86wTmeHPkI/edit?gid=946925182#gid=946925182) 45 | - [Resource reading list](https://docs.google.com/spreadsheets/d/10wO-5kNdaTUpsZ3C0z5bsaYnf5EbDAWPw86wTmeHPkI/edit?gid=750772179#gid=750772179) 46 | - [Potential genome annotation metadata case-studies](https://docs.google.com/spreadsheets/d/10wO-5kNdaTUpsZ3C0z5bsaYnf5EbDAWPw86wTmeHPkI/edit?gid=0#gid=0) 47 | - [Rolling collaboration notes](https://docs.google.com/document/d/1xT-45UgIp-ujkudaN589RgJdvib3vVj_N4L9SQAgltw/edit?usp=sharing) 48 | 49 | ## Abstract 50 | 51 | Advances in sequencing technologies and assembly algorithms have enabled an explosion in diverse reference genomes across the tree of life, together with a need to annotate functional and structural features. There is no current set of minimal metadata to support genome annotations as FAIR objects, limiting their reproducibility and reliability. 52 | 53 | The FAIRification of Genomic Annotations Working Group (FGA-WG) in the Research Data Alliance (RDA) will develop a harmonised metadata model and recommended infrastructure to improve discovery and reuse of publicly available genomic annotations/tracks, supporting harmonised metadata for GFF3 files. Such metadata exists in e.g. project-specific databases or spreadsheets, workflow systems, repositories, exchange formats, and linked data. 54 | 55 | Harmonising metadata according to a unified data model requires the extraction, transformation and integration of data sourced in different research contexts, including "messy" data, using schema mappings or "crosswalks". These operations are time-consuming and may introduce opaque errors. FAIR principles emphasise reproducibility and trust in data analyses with persisted and shared accessible, auditable and executable data transformation and validation methods. 56 | 57 | [Omnipy](https://omnipy.readthedocs.io/) and [Whyqd (/wɪkɪd/)](https://whyqd.readthedocs.io/) are independently-developed Python libraries offering general functionality for auditable and executable metadata mappings. Each is pragmatically designed to ensure transformations are executable on real-world data, with validation and feedback. They differ in scope and users, and provide complementary functionalities. 58 | 59 | In this project, we will integrate Omnipy and Whyqd to develop executable mappings that transform existing metadata from biodiversity projects, such as ERGA, to conform to the FGA-WG metadata model, kickstarting the process of FAIRifying genome annotation GFF3 files. 60 | 61 | ## Lead(s) 62 | 63 | Sveinung Gundersen ɴᴏ, Gavin Chait ᴢᴀ 64 | 65 | --------------------------------------------------------------------------------