├── 7.md ├── 16.md ├── 17.md ├── 9.md ├── 20.md ├── 5.md ├── 32.md ├── 3.md ├── 13.md ├── 14.md ├── 21.md ├── 26.md ├── 24.md ├── 2.md ├── 25.md ├── 8.md ├── 19.md ├── 6.md ├── 23.md ├── 29.md ├── 12.md ├── 4.md ├── 22.md ├── 10.md ├── README.md ├── 11.md ├── 30.md ├── 15.md ├── 18.md ├── 27.md └── 31.md /7.md: -------------------------------------------------------------------------------- 1 | # Project 7: Enhancing the Training metrics database (TMD) for improved reporting 2 | 3 | ## Abstract 4 | 5 | This project aims to extend the functionality of the current Training Metrics Database, to allow for uploading of node specific questions, and establishing a seamless connection to TeSS. 6 | 7 | The ELIXIR Training Metrics Database serves as a valuable resource in the ELIXIR Training ecosystem. It collects and aggregates all the training metrics from the different nodes, and is an invaluable resource for generating statistics and reports. Extending the functionality to allow for node specific questions will further increase the usability for the nodes, catering to their unique reporting needs. Moreover, integrating it with TeSS would facilitate automatic data exchange between the two systems, further connecting the ELIXIR ecosystem of resources. 8 | 9 | We will achieve this by assembling a team of web developers and training coordinators, creating interactions and collaborations with the TeSS development team. 10 | 11 | ## Lead(s) 12 | 13 | Nina Norgren, Eleni Adamidi 14 | 15 | -------------------------------------------------------------------------------- /16.md: -------------------------------------------------------------------------------- 1 | # Project 16: Enhancing bio.tools by Semantic Literature Mining 2 | 3 | ## Abstract 4 | 5 | This project aims to improve and extend bio.tools metadata through fine-tuned named-entity recognition (NER) from Europe PMC and other established literature mining software. This will help researchers find uses of particular software and measure the impact of research software beyond paper citations, thus providing a better indicator of their impact. Text mining mentions of software is a non-trivial problem, as the software often is homonymous with other entities, such as chemicals , genes or organisms. However, NER of software is facilitated by frequent context words such as “version”, “software” or “program”. 6 | 7 | This will be further exploited by integration of the often very detailed bio.tools annotations to enhance software recognition. We expect to identify ensembles of publications for thousands of software tools annotated in bio.tools, adding valuable information about tool usage to Europe PMC and providing relevant background data for more accurate and deeper tool categorization and annotations, as well as improved benchmarking of the tools themselves. 8 | 9 | ## Lead(s) 10 | 11 | Veit Schwämmle, Magnus Palmblad 12 | 13 | -------------------------------------------------------------------------------- /17.md: -------------------------------------------------------------------------------- 1 | # Project 17: Development of FAIR image analysis workflows and training in Galaxy 2 | 3 | ## Abstract 4 | 5 | Image analysis tools within Galaxy are currently available but remain underutilised. Last year, our participation in the BioHackathon aimed to enhance the image analysis community in Galaxy. Our focus was on analysing the landscape of tools, gathering and annotating them. It involved community discussions to establish naming conventions, fostering greater standardization of these tools (see outcomes at https://github.com/beatrizserrano/bh2023-preprint/blob/main/BH2023_preprint_project16.pdf). Since then, the integration of tools into Galaxy has continued to expand. This year, our efforts will focus on the exploitation of such tools and showcasing Galaxy’s capabilities in meeting the needs of the imaging community. 6 | 7 | In this project, we will develop FAIR image analysis workflows and create comprehensive tutorials within the Galaxy Training Network. We will use sample datasets from public repositories to illustrate diverse image analysis tasks and build the corresponding Galaxy workflows. The resulting workflows will be made available on the Workflowhub.eu. Tutorials will serve as documentation to facilitate the utilisation of these workflows. 8 | 9 | ## Lead(s) 10 | 11 | Beatriz Serrano-Solano, Anne Fouilloux 12 | 13 | -------------------------------------------------------------------------------- /9.md: -------------------------------------------------------------------------------- 1 | # Project 9: BioHackrXiv: improving biohackathon publications 2 | 3 | ## Abstract 4 | 5 | BioHackrXiv.org is the markdown based pre-publishing biohackathon platform for project reporting that is used by the Elixir and Japanese biohackathons and other venues. 6 | The goal of the biohackrxiv is not only to expose work executed during and after biohackathons, but also to increase the scientific profile of participants by giving citeable publications. 7 | During the Biohackathon we want to improve user experience of submitting publication to biohackrxiv.org. Starting at a Japanese biohackathon this work has been mapped out in previous Elixir biohackathons and it is great we can get together an focus our attention on this project. 8 | As a new idea we would like to allow authors to add metadata on their information, websites, source code, data, blogs, videos etc. as a FAIR resource. 9 | 10 | Our work has resulted in 73 publications so far, with 200+ authors. The number of submissions is increasing every year. The quality of the publications is very high, in our experience. 11 | 12 | A full list of supported hackathons can be found [here](http://preview.biohackrxiv.org/) and a list of publications [here](https://biohackrxiv.org/discover). 13 | Previous publications are [here](https://biohackrxiv.org/discover?q=biohackrxiv). 14 | 15 | ## Lead(s) 16 | 17 | Pjotr Prins, Arun Isaac 18 | 19 | -------------------------------------------------------------------------------- /20.md: -------------------------------------------------------------------------------- 1 | # Project 20: Structuring Clinical Reports into OMOP Common Data Model (CDM) 2 | 3 | ## Abstract 4 | 5 | A clinical case report is a detailed report of the diagnosis, treatment, signs, symptoms and follow-up of a single patient. Case report forms (CRFs) are used to standardize the collection of these patient data in clinical research studies and trials. CRFs provide a semi-structured approach for collecting data where a combination of structured categories of patient data along with free-text content is defined. Since such CRFs are predominantly attached to publications as supplementary files, it comes with various data formats (e.g., PDF, XLS, CSV, GIF, etc.). 6 | 7 | With the increase of Open Access publications, the number of supplementary data files keeps on growing where the ability of researchers to find and reuse this information is severely limited. Beyond keyword queries of PMC/MEDLINE article indexes, researchers are unable to find CRFs using standard clinical terms to search for common clinical concepts. 8 | 9 | The inability to adequately search supplementary files - where CRFs can accompany a clinical research trial or study publication - requires researchers to manually locate and evaluate the contents of individual CRFs. This project aims at enhancing the FAIR-ness (Findable, Accessible, Interoperable, Reusable) of CRFs by transforming them into a structured Common Data Model (CDM) like OMOP (Observational Medical Outcomes Partnership). 10 | 11 | ## Lead(s) 12 | 13 | Venkata Satagopam, Tim Beck 14 | 15 | -------------------------------------------------------------------------------- /5.md: -------------------------------------------------------------------------------- 1 | # Project 5: Mapping of research software quality indicators across the ELIXIR Research Software Ecosystem 2 | 3 | ## Abstract 4 | 5 | The primary goal of this project is to perform a cross-walk of indicators around research software quality, creating a comprehensive catalogue that can be used in the context of the ELIXIR Research Software Ecosystem. A common understanding of quality indicators is a well-understood and acknowledged challenge in research software, with different levels of maturity across domains. However, this project will primarily focus on the particular aspects of the ELIXIR community, however aiming for an outcome applicable to the wider Life Science community. 6 | 7 | This catalogue will be extremely useful in raising awareness of the existing services, knowing their requirements and expectations, and identifying the optimal service for the particular case. Moreover, it will allow us to identify what could be potential gaps in a particular community, as well as indicators that could be adopted across communities. 8 | 9 | The project directly ties into various activities and efforts, both within ELIXIR (Tools Platform Software Best Practices, OpenEBench, Software Observatory, SMPs, STEERS WP2, etc) as well as beyond (EOSC, EVERSE, NFDI4DataScience, etc). 10 | 11 | We plan to engage participants in basically all activities. Newcomers can share their experience with research software development, software management, and research software quality. 12 | 13 | ## Lead(s) 14 | 15 | Fotis Psomopoulos, Eva Martin del Pico 16 | 17 | -------------------------------------------------------------------------------- /32.md: -------------------------------------------------------------------------------- 1 | # Project 32: VCF Explorer: Empowering Genomic Data Interaction 2 | 3 | ## Abstract 4 | 5 | Genomic data plays a pivotal role in understanding genetic variations, disease associations, and personalised medicine. However, managing and querying Variant Call Format (VCF) files efficiently remains a challenge due to their large size and complex structure. In this project, we propose the development of a VCF file explorer—a web-based application that facilitates seamless interaction with VCF files. Our approach leverages the array based TileDB VCF data model, to create a scalable and efficient database for storing VCF files, overcoming the limitations of traditional data storage systems like relational databases. 6 | 7 | The tool will allow users to search for specific variants based on custom filters and perform aggregate analyses while providing interactive visualisations. The VCF Explorer will empower researchers, clinicians, and bioinformaticians to efficiently explore and analyse genomic variants. By combining the robustness of TileDB with a user-friendly web interface, we aim to accelerate genomics research, variant interpretation, and clinical decision-making. A base proof of concept of the tool has been developed for Lineberger Comprehensive Center Bioinformatics core by the authors. 8 | 9 | Reference: 10 | 11 | [https://docs.tiledb.com/main/integrations-and-extensions/genomics/population-genomics](https://docs.tiledb.com/main/integrations-and-extensions/genomics/population-genomics) 12 | 13 | ## Lead(s) 14 | 15 | Sarang Bhutada, Vibhor Gupta 16 | 17 | -------------------------------------------------------------------------------- /3.md: -------------------------------------------------------------------------------- 1 | # Project 3: Reusable RDM Planning Environments for Trainings and Workshops 2 | 3 | ## Abstract 4 | 5 | Training sessions on Research Data Management (RDM) and data management planning are gaining traction within ELIXIR and beyond. One of the primary platforms for Data Management Plans (DMPs), known as Data Stewardship Wizard (DSW), plays a crucial role in educating researchers on DMP practices. Often utilized across various ELIXIR Nodes and Communities, DSW not only facilitates DMP creation but also serves as an educational tool. 6 | 7 | Organizations conducting such training sessions typically need to establish a separate DSW instance dedicated to training purposes. However, the repetitive nature of setting up and cleaning these instances, along with preparing content and user accounts, can be cumbersome. While some organizations have developed custom scripts leveraging the REST API to streamline these tasks, they require ongoing maintenance to align with DSW's monthly release cycle. 8 | 9 | In this project, our aim is to develop a service that provides pre-configured sets of content for bootstrapping, cleaning, and verifying DSW instances for training and workshops. These sets will be packaged in a shareable and reusable format, following FAIR principles, allowing organizations to manage their own sets while easily sharing or customizing existing ones. By integrating this service into the open-source codebase, we ensure compatibility and technical readiness aligned with the DSW platform itself. As a result, setting up a testing environment will take mere minutes, if not less. 10 | 11 | ## Lead(s) 12 | 13 | Kryštof Komanec, Jana Martínková 14 | 15 | -------------------------------------------------------------------------------- /13.md: -------------------------------------------------------------------------------- 1 | # Project 13: Interconnecting identifiers.org into a broader metadata connectivity 2 | 3 | ## Abstract 4 | 5 | Identifiers.org, an Elixir Recommended Interoperability Service, is a meta-resolver based on a registry which acts as a source of truth, providing a resolution service for compact identifiers, as well as a harmonisation service, based on records which store final resolving locations associated with assigned prefixes. The registry contains metadata on its namespace and resource entries which include valuable information on the data collections, such as ID regex patterns, online resources where identified data objects can be resolved, and associated institutions for these resources. 6 | 7 | We propose exposing this information in RDF format, greatly expanding the interoperability of this service, allowing direct consumption in a variety of ways, for example into Knowledge Graphs. The resulting RDF could be supplemented with several schemas such as DCAT, VoID, and schemas available in BioSchemas. Supplementing the same dataset using different schemas would negate the often-needed practice of mapping, which can be time-consuming. Writing Rest APIs and enabling a SPARQL endpoint for this information would be a more technical challenge. 8 | 9 | Furthermore, expanding the resolver to find related metadata for compact identifiers will enable support for additional use cases, matching the EOSC PID Meta Resolver. For this, we intend to connect our metadata resolver with the BridgeDB ELIXIR resource and the TogoID system from DBCLS. Through additions, identifiers.org will become more useful for its users as an interoperability service that is easily consumable. 10 | 11 | ## Lead(s) 12 | 13 | Renato Juaçaba Neto, Nick Juty 14 | 15 | -------------------------------------------------------------------------------- /14.md: -------------------------------------------------------------------------------- 1 | # Project 14: FAIRly easy APIs for research data in (Bio)Schema.org and RDF 2 | 3 | ## Abstract 4 | 5 | One third of the Elixir Core Data Resources (CDRs) provide their data as RDF, coupled with a SPARQL endpoint to query this data. While SPARQL is a powerful query language, only a minority of all data scientists and bioinformaticians are familiar with it. Therefore, to enable a wider reuse of RDF data, complementary data access interfaces are highly required. 6 | 7 | Python and R are two of the most popular programming languages among data scientists and bioinformaticians respectively. Therefore, to enable this target audience easy access to public RDF data, we have been experimenting with generating R and Python APIs for each of these datasets in a fully automatic manner. We do so by leveraging automatically generated descriptions of each dataset, i.e. information regarding the available classes and properties, as well as their cardinalities. 8 | 9 | This auto generation is important because: 10 | 11 | * 1) it significantly speeds up the API creation: a dataset maintainer will only need to verify the auto-generated code, without the need to actually write it by hand; 12 | 13 | * 2) it significantly enhances dataset Findability and Reuse - even Elixir CDRs have uneven representation of API packages across different programming languages, making their reusability depend on a technical hurdle - how familiar a given user is with the range of available programming languages. 14 | 15 | An under-resourced or less well-known dataset will likely have at most one API package. Being able to quickly generate complete APIs given only an RDF file or SPARQL endpoint will help better connect data providers with their users. 16 | 17 | ## Lead(s) 18 | 19 | Ana Claudia Sima, Jerven Bolleman 20 | 21 | -------------------------------------------------------------------------------- /21.md: -------------------------------------------------------------------------------- 1 | # Project 21: Enhancing multi-omic analyses through a federated microbiome analysis service 2 | 3 | ## Abstract 4 | 5 | Multi-omics datasets are an increasingly prevalent and necessary resource for achieving scientific advances in microbial ecosystem research. However, they present twin challenges to research infrastructures: firstly the utility of multi-omics datasets relies entirely on interoperability of omics layers, i.e. on formalised data linking. Secondly, microbiome derived data typically lead to computationally expensive analyses, i.e. on the availability of powerful compute infrastructures. 6 | 7 | Historically, these challenges have been met within the context of individual database resources or projects. These confines limit the FAIRness of datasets (since they typically aren’t interlinked, directly comparable, or collectively indexed), and mean the scope to analyse such datasets is governed by the available resources of the given project or service. Removing these confines, by establishing a model for the federated analysis of microbiome derived data, will allow these challenges to be met by the community as a whole. More compute can be brought to bear by combining EOSC and ELIXIR infrastructures, Galaxy instances, and existing resources like EMBL-EBI’s MGnify, but this requires adopting a common schema for sharing analysed datasets, including their provenance. 8 | 9 | Such a schema can also directly contribute to the interlinking of omics layers, using research objects to connect linked open datasets. We aim to design and implement a schema for this purpose, and use it to allow the generation of comparable analyses on heterogeneous compute infrastructures. By doing so, it will streamline the deposition of accessioned analysis products into public databases. 10 | 11 | ## Lead(s) 12 | 13 | Alexander Rogers, Alexander Sczyrba 14 | 15 | -------------------------------------------------------------------------------- /26.md: -------------------------------------------------------------------------------- 1 | # Project 26: Reducing the environmental impact of Galaxy 2 | 3 | ## Abstract 4 | 5 | Workflow management systems (WMSs) such as Galaxy are uniquely positioned to enable researchers to perform more environmentally-sustainable computational data analysis as they have full control of the resources used for a given workflow. 6 | In this project we want to reduce Galaxy resource usage by focusing on: 1) job caching to enable the reuse of tool outputs, and 2) environmentally-friendly job scheduling. 7 | 8 | Job caching uses the provenance information stored in Galaxy’s database for each tool execution to avoid unnecessary recalculations when the relevant parameters match. An initial implementation is already available and we will work on making the job cache apply in more scenarios. In particular, we will infer how strict dataset metadata needs to match for a job to be considered identical. We will also enable sharing the job cache for users that have opted-in to this feature, making it possible to run large-scale analyses during training sessions without consuming an unnecessary amount of computing resources. 9 | 10 | Advanced job scheduling is made possible in Galaxy through the Total Perspective Vortex (TPV) plugin. TPV can route entities (tools, users) to selected destinations with appropriate resource allocations (cores, GPUs, memory). It additionally allows arbitrary Python-based rules for e.g. custom ranking functions for choosing between destinations. We will specifically rank destinations in an order that promotes sustainability. Expanding on our initial implementation, we will collect (job-related) statistics and information from the Galaxy database and (Pulsar) compute destinations in a central location and add additional algorithms for the ranking based on these statistics. 11 | 12 | ## Lead(s) 13 | 14 | Nicola Soranzo, Paul De Geest 15 | 16 | -------------------------------------------------------------------------------- /24.md: -------------------------------------------------------------------------------- 1 | # Project 24: Increasing FAIRness of digital agrosystem resources by extending Bioschemas 2 | 3 | ## Abstract 4 | 5 | Research Data Infrastructures (RDIs) provide crucial publication services for researchers in the agrosystem domain. Due to their heterogeneous user communities and requirements, metadata standardization approaches to increase the FAIRness of resources can be a catalyst in simplifying data reusability and enabling cross-domain research. 6 | 7 | One way for RDIs to increase the Findability of their resources is to provide metadata markup via Schema.org, a vocabulary consumed by well-known search engines. Bioschemas, an extension of schema.org focussed on the life sciences, is an open community effort, aiming at increasing the adoption of key metadata properties in a domain-targeted manner, through the creation of needed domain-associated types, agreed properties and usable metadata profiles for describing those life science resources. The project will work on developing new resources (types, properties, profiles) for Bioschemas, which will help to describe agrosystem datasets in a FAIR manner. 8 | 9 | Participants will work on different topics, ranging from evaluating the current state of developed types and properties relevant to agrosystem resources to drafting new ones following use-case requirements and using example datasets. For increasing the metadata quality and supporting RDIs and their users in adopting the extension, participants will link properties to domain ontologies and facilitate mappings to other metadata standards, bolstering interoperability while following mapping frameworks like FAIR-IMPACT’s approach. To further ease the adoption for RDIs, participants will work on creating guidance and best practice documents on how to implement the extension into existing metadata description processes. 10 | 11 | ## Lead(s) 12 | 13 | Gabriel Schneider, Marco Brandizi 14 | 15 | -------------------------------------------------------------------------------- /2.md: -------------------------------------------------------------------------------- 1 | # Project 2: A curated assessment of metadata descriptors of AI-ready datasets 2 | 3 | ## Abstract 4 | 5 | To advance the use of Machine Learning for the understanding of diseases and conservation of biodiversity is important to promote FAIR AI-ready datasets since data scientists and bioinformaticians spend 80% of their time in data finding and preparation. Metadata descriptors for datasets are pivotal for the creation of Machine Learning Models as they facilitate the definition of strategies for data discovery, feature selection, data cleaning and data pre-processing. 6 | 7 | Once a dataset is AI-ready, such metadata descriptors change wrt the initial version of the raw data. What can we learn from the metadata of raw vs AI-ready datasets? What transformations from raw to AI-ready could be (semi)automated based on metadata descriptors? In this project, we will manually analyze and curate metadata descriptors before and after AI-readiness. Based on our analysis, we will identify dataset transformations that could be (semi)automated by software pipelines with the aim of alleviating the effort and time invested in data pre-processing for Machine Learning. 8 | 9 | The results will be later integrated into a metadata-based reproducibility assessment cycle, part of the NFDI4DataScience project in Germany. To facilitate the work during the BioHackathon, we will focus on datasets from the DOME registry as this would indicate already some level of availability for the metadata (even if hidden in a scholarly article).The AI-ready metadata descriptors will use the Croissant schema proposed by the ML Commons. This project will also take into account previous work done at the BioHackathon 2022 on metadata for synthetic data. 10 | 11 | ## Lead(s) 12 | 13 | Leyla Jael Castro, Nuria Queralt Rosinach 14 | 15 | ## Project repository 16 | 17 | https://github.com/zbmed-semtec/bheu24-cm4mlds (with updated information and developments) 18 | 19 | -------------------------------------------------------------------------------- /25.md: -------------------------------------------------------------------------------- 1 | # Project 25: Recognising research software contributions leveraging the ELIXIR infrastructure 2 | 3 | ## Abstract 4 | 5 | The proposed project aims to enhance the capabilities of the ELIXIR infrastructure to track, credit, and recognise software contributions made by research software engineers. The project seeks to foster collaboration and engagement within the ELIXIR Communities and platforms by working with different stakeholders, including individual contributors to research software, to indicators that help measure the value of these contributions. 6 | 7 | The primary objective of this project is to promote a strong sense of community by recognising individual software contributions. We plan to connect APICURON and GitHub to track open-source research software contributions and reward each contributor for their efforts. We will also link OpenEBench evaluation data to this process by crosslinking GitHub repositories available in the Software Observatory section with APICURON. This will involve retrieving activity data from GitHub and OpenEBench, processing it, and integrating it into the APICURON platform. Recognition items will be pushed to ORCID from APICURON and made available for third-party services. 8 | 9 | We will leverage the involvement of APICURON and OpenEBench in the Data and Tools platforms and in the ELIXIR STEERS and EVERSE projects. These platforms and projects provide a network of stakeholders and guidance for implementing a fair recognition infrastructure. 10 | 11 | The project's usefulness lies in its ability to address a significant gap in the recognition and valuation of research software contributions. By providing a framework for recognition, we can incentivise and motivate developers to contribute to open-source software projects, leading to improved software quality and reproducibility and positively impacting the environment by reducing the carbon footprint. 12 | 13 | ## Lead(s) 14 | 15 | Adel Bouhraoua, Gavin Farell, José Mª Fernández 16 | 17 | -------------------------------------------------------------------------------- /8.md: -------------------------------------------------------------------------------- 1 | # Project 8: Data Model Converter: Bridging Cohort Information Across Biomedical Data Models 2 | 3 | ## Abstract 4 | 5 | In the era of precision medicine, interoperability of biomedical data is crucial for facilitating collaborative research and the concept of minimum data set (MDS) has arised as a collection of data elements using a standard approach to allow clinical data sharing and its use for research purposes. Health data are typically voluminous, complex, and sometimes too ambiguous to generate indicators that can provide knowledge and information on health. Our project aims to address this challenge by developing a versatile web-based tool, called ""DataModel Converter"" which enables conversion from cohort information to different biomedical minimum data sets offering an intuitive interface for users to select individuals from a cohort from a clinical database (eg. OMOP CDM, OpenEHR, etc) and effortlessly transform its structured data into various other standard formats, including B1MG Minimal Dataset for Cancer, BBMRI cohort definitions, OMOP cohorts, Phenopackets, beacon v2, etc. 6 | 7 | The key objectives of our project include: 8 | 1. Designing an interactive and user-friendly web application for cohort selection and data conversion. 9 | 10 | 2. Implementing backend functionalities to retrieve and manipulate data from clinical databases. 11 | 12 | 3. Developing semantic mappings between different data models while preserving data integrity and semantics (OMOP CDM, OpenEHR, Phenopackets, B1MG, BBMRI, etc). 13 | 14 | 4. Ensuring scalability, performance of the DataModel Converter platform to handle large-scale datasets. 15 | 16 | 17 | By providing researchers and healthcare professionals with a flexible and efficient means to harmonize data across disparate data models, our project aims to accelerate biomedical research, enhance collaboration, and ultimately contribute to advancements in personalized medicine and patient care." 18 | 19 | ## Lead(s) 20 | 21 | Sergi Aguiló-Castillo, Alberto Labarga 22 | 23 | -------------------------------------------------------------------------------- /19.md: -------------------------------------------------------------------------------- 1 | # Project 19: Creating user benefit from ARC-ISA RO-Crate machine-actionability 2 | 3 | ## Abstract 4 | 5 | The development of FAIR Digital Objects (FDOs) holds immense promise for advancing scientific research, yet one critical challenge persists: Despite efforts to create FDOs, achieving true machine-actionability remains elusive. 6 | 7 | We will address this pressing issue by focusing on the integration of Annotated Research Contexts (ARCs) within the scientific community. Recognizing the substantial efforts in annotating research and packaging it as RO-Crate FDOs, it is imperative to incentivize and leverage these endeavors to yield benefits transcending mere data management. ARCs as FDOs excel in meticulous record-keeping, rendering them indispensable in the realm of research data management. 8 | 9 | However, dissemination and practical actionability of ARCs across diverse services, tools and repositories is pivotal in engendering user benefits. These platforms require the capacity to comprehend and interpret RO-Crates, enabling seamless interaction with FDOs. Drawing from ARC FDO consumption, search, and indexing platforms must provide users with comprehensive search results, while the service infrastructure can offer customised services tailored to the data described in the FDO. 10 | 11 | Therefore, we will build a robust content-based recommendation framework. This approach promises to furnish users with enriched representations of ARC RO-Crate content, facilitating content-based filtering tailored to individual user needs. 12 | 13 | To substantiate the efficacy of this framework, Galaxy will serve as the representative workflow engine in a proof-of-concept endeavor aimed at suggesting workflows based on data annotated and encapsulated within ARC RO-Crates. Leveraging collaborative efforts uniting domain experts, developers, and stakeholders across diverse backgrounds, our objective is to engineer practical solutions that render ARC-ISA RO-Crates actionable across pivotal platforms. 14 | 15 | ## Lead(s) 16 | 17 | Angela Kranz, Eli Chadwick 18 | 19 | -------------------------------------------------------------------------------- /6.md: -------------------------------------------------------------------------------- 1 | # Project 6: Gender representation in ELIXIR-supported publications: a visibility analysis across academic search engines 2 | 3 | ## Abstract 4 | 5 | An equitable gender representation in ELIXIR-supported publications is crucial for fostering diversity and inclusivity within the ELIXIR community. Recognizing the potential for gender bias in popular bibliographic information retrieval systems, such as Google Scholar, the Bioinfo4Women initiative at the Life Sciences Department of the Barcelona Supercomputing Center (BSC) has developed a system to automatically retrieve comprehensive bibliographic data from Google Scholar queries, equipped with the capability to infer the gender of publication authors. Given Google Scholar's widespread use and its opaque ""relevance"" ranking criteria, our tool presents a significant opportunity to scrutinize and understand potential gender and visibility biases in ELIXIR-supported publications, particularly focusing on the underrepresentation of women leading authors in specific domains. 6 | 7 | The project aims to rigorously test and utilize the capabilities of our system to specifically explore ELIXIR-supported publications, drawing on the existing compilation created by the ELIXIR Impact Group [https://elixir-europe.org/about-us/impact/publications](https://elixir-europe.org/about-us/impact/publications). The challenge's objective is twofold: to assess the impact of Google Scholar's algorithm on the visibility of ELIXIR publications authored by women and to benchmark these findings against more transparent and FAIR-aligned bibliographic engines, such as the BIP! Finder developed by ELIXIR Greece. 8 | 9 | This endeavor will not only highlight discrepancies in gender representation but also foster the development of more equitable information retrieval practices. By leveraging our system's unique functionalities, participants will contribute to a more inclusive understanding of scholarly impact, paving the way for interventions to mitigate bias in academic literature production and discovery. 10 | 11 | ## Lead(s) 12 | 13 | Davide Cirillo, María Morales Martínez 14 | 15 | -------------------------------------------------------------------------------- /23.md: -------------------------------------------------------------------------------- 1 | # Project 23: MARS: Multi-omics Adapter for Repository Submissions, preparing for launch 2 | 3 | ## Abstract 4 | 5 | Multimodality studies are a reality, with scientists commonly using several different data acquisition techniques to characterise biological systems under various experimental conditions. Yet, the deposition of such studies to public repositories remains a challenge for scientists who need familiarity with individual repositories to achieve these data publication requirements. Started during [theBiohackathon 2023](https://github.com/elixir-europe/biohackathon-projects-2023/tree/main/27), [theMARS project](https://github.com/elixir-europe/MARS) (Multi-omics Adapter for Repository Submissions) made great strides in producing a proof of concept for dispatching metadata to BioSamples, ENA and MetaboLights using the ISA-JSON format. ISA-JSON, designed for multi-omics studies, has clear specifications and is used as output format by ISA tools, DataPLANT's ARC and FAIRDOM-SEEK software. 6 | 7 | Following this success, there is now interest in extending the service to support functional genomics data, hosted by ArrayExpress or BioStudies, EVA, EGA and e!DAL-PGP. Therefore, the objectives of this project are the following: 8 | 9 | 1. Consolidate the current proof of concept and to bring it closer to a functional prototype, by further testing it and streamline it 10 | 11 | 2. Extend functionalities to submit actual data files along with the metadata 12 | 13 | 3. Collaborate further with repositories to support ISA-JSON format for programmatic submission via their API endpoint. 14 | 15 | 4. Extend the MARS component and CLI to include additional data types and repositories, such as transcriptomics and possibly proteomics as a stretch goal. 16 | 17 | 5. Develop domain specific minimal annotation profiles, building on the experience gained with Metabolights for MS and NMR based assay definitions. 18 | 19 | To this end, we have assembled a team of subject matter experts to deliver on the task. 20 | 21 | 22 | ## Lead(s) 23 | 24 | Bert Droesbeke, Philippe Rocca-Serra 25 | 26 | -------------------------------------------------------------------------------- /29.md: -------------------------------------------------------------------------------- 1 | # Project 29: ELIXIR FAIR Lesson Plan Handbook: advancing researchers’ & data stewards’ FAIR skills 2 | 3 | ## Abstract 4 | 5 | Community activities (e.g., FAIRsFAIR, CONVERGE, ELIXIR-NL’s FAIR Data Day) have signalled the need for a framework onhow to teach FAIR skills to researchers and data stewards. Via hackathons (e.g., CONVERGE, BioHackathon 2023) with over 50 participants in total, a minimal viable product (MVP) [ELIXIR FAIR Lesson Plan Handbook](https://elixir-europe-training.github.io/ELIXIR-TrP-FAIR-Converge/) was created. Although lesson plans aren’t ready-to-go courses yet, they offer the basic framework for FAIR training. What do researchers and data stewards have to be skilled in to apply FAIR to datasets? 6 | 7 | We propose a BioHackathon 2024 project for two reasons: 8 | 9 | * To improve the user-friendliness. At BioHackathon 2023, a new format was applied to some lesson plans. This was received well, and we will apply it to other lesson plans as well. 10 | 11 | * Align with developments in ELIXIR training: 12 | 13 | * [Learning paths](https://elixir-europe.org/focus-groups/learning-paths) for data stewards and researchers, as part of the new Learning Path FG. Providing users with a “pathway” to connect lesson plans so that it caters their needs, makes teaching FAIR more trainee-oriented. Many organisations struggle with training on how to go/do FAIR. 14 | 15 | * [TheFAIR Metroline](https://zenodo.org/records/10850958)(in development), an ELIXIR-NL initiative for a unified FAIRification workflow, based on a comparison of FAIR models. The FAIR Metroline smartly combines practical FAIR steps with training needs/gaining competences in organisations. 16 | 17 | BioHackathon 2023 resulted in the creation of the GitHub repository, with corresponding website, for theELIXIR FAIR Lesson Plans. It will be BioHackathon 2024 that enables trainers to start using it, as we will work hard to restructure the content and engage with the community. 18 | 19 | ## Lead(s) 20 | 21 | Mijke Jetten, Martijn Kersloot 22 | 23 | ## Relevant links 24 | [Agenda document](https://docs.google.com/document/d/1N0qC44g9ijd1kCeLgKNZ7sHrlPXLycspQTSXnEqgwc4/edit) with links 25 | -------------------------------------------------------------------------------- /12.md: -------------------------------------------------------------------------------- 1 | # Project 12: Perturb -Bench: large-scale benchmarking of perturbational modelling tools in complex single-cell data 2 | 3 | ## Abstract 4 | 5 | Single-cell perturbation modelling delineates how perturbations affect cellular and molecular physiology, such as transcription factors, kinases, and signalling pathways. Perturbation modelling aims to understand the molecular impacts of pharmaceutical compounds or cellular stimulants, dissect disease pathobiology, and facilitate drug repurposing. 6 | 7 | Our BioHackathon project aims to address the current lack of independent benchmarking and best practices for perturbation modelling tools, which hinders their broader adoption by the single-cell community. We will conduct an extensive benchmarking study for various perturbation modelling tools, including variational autoencoders, graph-based models for gene-regulatory networks, Optimal Transport tools deciphering cell states, and foundational models. 8 | 9 | The benchmarking study will focus on out-of-distribution predictions for unseen events, drug synergy scores, and distilling perturbation effects from confounding sources of variation. We will adopt workflow management systems compatible with community-driven benchmarking frameworks, such as OpenEBench. 10 | 11 | We will utilise harmonised single-cell datasets from scPerturb (containing control/disease samples and CRISPR/compound treatments, e.g., sci-Plex, Perturb-seq). The project will standardise emerging metrics (e.g. gene expression correlation, distribution distances, clustering separation) concerning datasets and perturbational tasks and assemble a multidisciplinary group of participants to address biological and computational-mathematical challenges. 12 | 13 | Another goal will be the creation of a continuous repository to further develop benchmarking efforts beyond the BioHackathon’s duration. The project's feasibility is supported by the expertise of the leads, who are members of the ELIXIR Single Cell Omics Community/Machine-Learning Focus group, and their ongoing research initiatives, e.g. Mongoose ELIXIR Staff Exchange Project (GR-DE-NL nodes, Feb-Jul 2024). 14 | 15 | ## Lead(s) 16 | 17 | Georgios Gavriilidis, Marina Esteban-Medina 18 | 19 | -------------------------------------------------------------------------------- /4.md: -------------------------------------------------------------------------------- 1 | # Project 4: SPARQL Query Generation for Efficient Scientific Data Access of ELIXIR resources 2 | 3 | ## Abstract 4 | 5 | The Swiss Institute of Bioinformatics (SIB/ELIXIR-CH), Database Center for Life Science (DBCLS-Japan) and RIKEN-Japan join efforts to develop an open-source artificial intelligence (AI)-driven system for intuitive querying of scientific datasets to accelerate scientific innovation. We call for contributions in these efforts that align with the BioHackathon's goal of fostering an open-source infrastructure for data integration and addresses the urgent need for effective data retrieval methods. 6 | 7 | Our goal is to make it easier for life scientists to use databases by converting their questions into SPARQL queries using large language models (LLMs). We understand the difficulties researchers face with SPARQL's complexity and knowledge base schemas, so we suggest a user interface that combines LLMs and knowledge bases. This will allow for direct data interaction in natural language, simplifying the research process. Our approach will facilitate data discovery and retrieval with the necessary accuracy for scientific research, as it leverages LLMs to generate SPARQL queries grounded in validated scientific data. 8 | 9 | Despite LLMs’ abilities in areas like code generation, they often struggle with the semantic accuracy of SPARQL queries. Our project is focused on addressing these limitations, ensuring that conversational AI can accurately interpret and translate research inquiries into precise queries. It aligns with the objectives of the ELIXIR 2024-26 Programme and lays the groundwork for future research collaborations, offering a practical solution for data-driven discovery in the life sciences. 10 | 11 | ## Project GitHub Repository 12 | If you want to contribute or are just curious about our work, see: 13 | 14 | 👩💻 Project code: [https://github.com/jcrangel/SPARQL4ELIXIR 15 | ](https://github.com/jcrangel/SPARQL4ELIXIR.git) 16 | 17 | 📝 Project backlog: https://github.com/users/jcrangel/projects/9 18 | 19 | ## Lead(s) 20 | 21 | - Tarcisio Mendes de Farias, SIB Swiss Institute of Bioinformatics (ELIXIR-CH) 22 | - Julio Rangel, RIKEN - JAPAN 23 | 24 | ## Team members 25 | - Vincent Emonet, SIB Swiss Institute of Bioinformatics (ELIXIR-CH) 26 | - TBD 27 | 28 | -------------------------------------------------------------------------------- /22.md: -------------------------------------------------------------------------------- 1 | # Project 22: Enabling Secure Data Access from Galaxy to (F)EGA 2 | 3 | ## Abstract 4 | 5 | In an era marked by the continuous growth of precision medicine and the emergence of regulations such as the GDPR and EHDS, the implementation of secure repositories to enable data sharing has become essential. These protocols play a crucial role in preserving the confidentiality of sensitive information and effectively mitigating risks associated with unauthorised access and data breaches. 6 | 7 | Galaxy is one of the most popular analysis platforms, especially among non-bioinformatics specialists. Thus, to increase Galaxy's integration within environments that require stringent data security measures, this proposal devises a comprehensive strategy that would facilitate the secure and scalable access and processing of sensitive datasets (and their derivative sensitive results) within Galaxy. 8 | 9 | Integral to this endeavour is the European Genome-Phenome Archive (EGA), recognised as the predominant repository within Europe for the secure storage phenoclinical and genomics data, thus underscoring its significance in biomedical data security considerations. As data housed within EGA federated repositories is encrypted in accordance with the GA4GH Crypt4GH standard, the proposed strategy is the development of a protocol tailored to enable the secure access, transfer, and processing of encrypted datasets, thereby leveraging the capabilities of a multi-user public Galaxy platform. 10 | 11 | ## Project Objectives 12 | 13 | The objective of the project can be divided into two milestones: 14 | 1. Development of a workflow that connects EGA, either central or any federated node, and Galaxy through Crypt4GH protocols. 15 | 2. Galaxy's secure processing protocol: Sensitive datasets are kept encrypted throughout, with sensitive derivative results labelled as sensitive. 16 | 17 | ## Resources 18 | 19 | 1. [About Galaxy](https://docs.galaxyproject.org/en/master/). 20 | 2. [About EGA](https://localega.readthedocs.io/en/latest/). 21 | 3. [About GA4GH Crypt4GH](https://crypt4gh.readthedocs.io/en/latest/). 22 | 4. [Slack Channel](https://biohackeu.slack.com/archives/C07NBGJKE0Z) - Use this as the main source of communication between in-person and virtual participants during the hackathon. 23 | 24 | ## Leads 25 | 26 | María Chavero-Díez (ELIXIR-ES) Sveinung Gundersen, Pável Vázquez Faci (ELIXIR-NO) 27 | -------------------------------------------------------------------------------- /10.md: -------------------------------------------------------------------------------- 1 | # Project 10: BioSchemas for mortals 2 | 3 | ## Abstract 4 | 5 | Bioschemas is a community effort to improve the FAIRness of web-based resources.Established by ELIXIR over 7 years ago, it has good adoption by the technical communities in workflows, software and tools but less adoption than there should. Particularly in less technical communities such as training or even data services. 6 | 7 | The Bioschemas website hosts tooling, training and guidance materials. Practical 'how to use Bioschemas' help and examples has been neglected. Guidance is technical - written by techies for techies - inappropriate or inaccessible for a large cohort of potential Bioschemas users. Examples focus on simple use cases and not real set-ups that users actually encounter in their work. This makes access to directly usable markup impossible, leaving users confused about where to go next. This lack of helpful support poses an unacceptably high technical barrier for the broader user community, and means we are not fully exploiting Bioschemas. Common complaints frequently cite a technical ‘barrier’, the lack of ‘lightweight’ guidance, the need to ‘demystify’ and lack of assistance to users who have the desire to implement bioschemas, but not the ‘how’. 8 | 9 | The goal of this project is to reimagine, reframe and supplement the existing Bioschemas guidance available. Working with non-technical users from the data and training platforms,Patterns of use, tasks(different CMS; properties) andUser personas . This will be used toprovide users withspecific code examples that can be copy/pasted, documented examples for different web setups, customised guidance for different personas and be validated by non-technical users in the data and training platforms. 10 | 11 | ## Lead(s) 12 | 13 | Nick Juty, Helena Schnitzer 14 | 15 | ### Key participant 16 | 17 | Phil Reed 18 | 19 | ## Complete our survey 20 | 21 | Have you used Bioschemas to markup your data, tools, training materials or other content? 22 | Was it cumbersome, tedious or a joyous experience, or did you fail to get started? 23 | Please share your experiences and help us to run this project by [completing our survey](https://bit.ly/bh2410s). 24 | Every response really helps our analysis and takes less than 10 minutes. 25 | Your contribution will directly feed into our work as we reimagine, reframe and supplement the existing Bioschemas guidance available. 26 | 27 | ## Document links 28 | 29 | - [Activities for Mortals (including sign-up sheeet)](https://docs.google.com/document/d/15inqwNojNYkcookFkrngezsAhdze2mZTcNpY6r46blY/edit?tab=t.0) 30 | 31 | We will conduct most of our work in the Google Docs link above. 32 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | # BioHackathon Europe projects 2024 4 | This repository is intended for the BioHackathon Europe participants to share ideas. The event will take place in Campus Belloch, 4-8 November 2024. For more information, please see the [BioHackathon Europe website](https://biohackathon-europe.org/index.html). 5 | 6 | ## Accepted projects 7 | 8 |
![]() |
6 | ![]() |
7 |