├── .gitignore ├── docs ├── test-model.md ├── deploy-model.md ├── monitor-model.md ├── model-ops-lifecycle.md ├── add-data-pipeline-ingestion-patterns-batch.md ├── add-data-pipeline-ingestion-patterns-event.md ├── add-data-pipeline-ingestion-patterns-federation.md ├── component-architecture-data-science.md ├── explore-notebooks-and-manage-dependencies.md ├── add-data-infra-as-a-platform.md ├── profiles.yml ├── push-changes.md ├── credentials.env ├── metadata-management.md ├── setup-gitops-pipeline.md ├── building-images-for-openshift.md ├── data-ingestion-pipeline.md ├── data-loading.md ├── component-architecture-data-platform.md ├── data-transformation.md ├── setup-initial-environment.md ├── pre-requisite.md ├── add-federated-governance.md ├── data-extraction.md └── add-data-pipeline-management.md ├── .DS_Store ├── images ├── .DS_Store ├── architecture │ ├── COP26-Overview.png │ ├── COP26-Ingestion-Flow.png │ ├── COP26-Overview-Business.png │ ├── Data-Commons-Pipeline.png │ ├── COP26-Overview-Technical.png │ ├── COP26-Federated-Governance.png │ ├── Data-Commons-Platform-Overview.png │ ├── Data-Commons-Platform-Ingestion.png │ ├── Data-Ingestion-Federation-pattern.png │ ├── Data-Ingestion-Batch-ingestion-pattern.png │ ├── Data-Commons-Overall-Component-Architecture.png │ ├── Data-Commons-Platform-Federated-Governance.png │ ├── Data-Ingestion-Event-driven-ingestion-pattern.png │ ├── Data-Commons-Data-Platform-Component-Architecture.png │ └── Data-Commons-Data-Science-Platform-Component-Architecture.png ├── developer_guide │ ├── jupyterhub-login.png │ ├── jupyterhub-gitclone.png │ ├── jupyterhub-launcher.png │ ├── repo-choose-license.png │ ├── Data-Ingestion-Process.png │ ├── aicoe-project-template.png │ └── jupyterhub-startserver.png ├── OS-C Data Commons Data Platform Layered Architecture.png └── OS-Climate Data Ops Process Flow & Security Management.png ├── os-climate-diagrams.pdf ├── platform-resourcing.md ├── README.md ├── os-c-data-commons-developer-guide.md ├── os-c-data-commons-architecture-blueprint.md └── LICENSE /.gitignore: -------------------------------------------------------------------------------- 1 | .env* 2 | -------------------------------------------------------------------------------- /docs/test-model.md: -------------------------------------------------------------------------------- 1 | # OS-Climate Data Commons Developer Guide: Test model -------------------------------------------------------------------------------- /docs/deploy-model.md: -------------------------------------------------------------------------------- 1 | # OS-Climate Data Commons Developer Guide: Deploy model -------------------------------------------------------------------------------- /docs/monitor-model.md: -------------------------------------------------------------------------------- 1 | # OS-Climate Data Commons Developer Guide: Monitor model -------------------------------------------------------------------------------- /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/os-climate/os_c_data_commons/HEAD/.DS_Store -------------------------------------------------------------------------------- /images/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/os-climate/os_c_data_commons/HEAD/images/.DS_Store -------------------------------------------------------------------------------- /os-climate-diagrams.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/os-climate/os_c_data_commons/HEAD/os-climate-diagrams.pdf -------------------------------------------------------------------------------- /images/architecture/COP26-Overview.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/os-climate/os_c_data_commons/HEAD/images/architecture/COP26-Overview.png -------------------------------------------------------------------------------- /images/architecture/COP26-Ingestion-Flow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/os-climate/os_c_data_commons/HEAD/images/architecture/COP26-Ingestion-Flow.png -------------------------------------------------------------------------------- /images/developer_guide/jupyterhub-login.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/os-climate/os_c_data_commons/HEAD/images/developer_guide/jupyterhub-login.png -------------------------------------------------------------------------------- /images/architecture/COP26-Overview-Business.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/os-climate/os_c_data_commons/HEAD/images/architecture/COP26-Overview-Business.png -------------------------------------------------------------------------------- /images/architecture/Data-Commons-Pipeline.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/os-climate/os_c_data_commons/HEAD/images/architecture/Data-Commons-Pipeline.png -------------------------------------------------------------------------------- /images/developer_guide/jupyterhub-gitclone.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/os-climate/os_c_data_commons/HEAD/images/developer_guide/jupyterhub-gitclone.png -------------------------------------------------------------------------------- /images/developer_guide/jupyterhub-launcher.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/os-climate/os_c_data_commons/HEAD/images/developer_guide/jupyterhub-launcher.png -------------------------------------------------------------------------------- /images/developer_guide/repo-choose-license.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/os-climate/os_c_data_commons/HEAD/images/developer_guide/repo-choose-license.png -------------------------------------------------------------------------------- /images/architecture/COP26-Overview-Technical.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/os-climate/os_c_data_commons/HEAD/images/architecture/COP26-Overview-Technical.png -------------------------------------------------------------------------------- /images/developer_guide/Data-Ingestion-Process.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/os-climate/os_c_data_commons/HEAD/images/developer_guide/Data-Ingestion-Process.png -------------------------------------------------------------------------------- /images/developer_guide/aicoe-project-template.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/os-climate/os_c_data_commons/HEAD/images/developer_guide/aicoe-project-template.png -------------------------------------------------------------------------------- /images/developer_guide/jupyterhub-startserver.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/os-climate/os_c_data_commons/HEAD/images/developer_guide/jupyterhub-startserver.png -------------------------------------------------------------------------------- /images/architecture/COP26-Federated-Governance.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/os-climate/os_c_data_commons/HEAD/images/architecture/COP26-Federated-Governance.png -------------------------------------------------------------------------------- /images/architecture/Data-Commons-Platform-Overview.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/os-climate/os_c_data_commons/HEAD/images/architecture/Data-Commons-Platform-Overview.png -------------------------------------------------------------------------------- /images/architecture/Data-Commons-Platform-Ingestion.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/os-climate/os_c_data_commons/HEAD/images/architecture/Data-Commons-Platform-Ingestion.png -------------------------------------------------------------------------------- /images/architecture/Data-Ingestion-Federation-pattern.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/os-climate/os_c_data_commons/HEAD/images/architecture/Data-Ingestion-Federation-pattern.png -------------------------------------------------------------------------------- /docs/model-ops-lifecycle.md: -------------------------------------------------------------------------------- 1 | # OS-Climate Data Commons Developer Guide: ModelOps Lifecycle Overview 2 | 3 | # Next Step 4 | 5 | [Setup and Deploy Inference Application](./deploy-model.md) -------------------------------------------------------------------------------- /images/architecture/Data-Ingestion-Batch-ingestion-pattern.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/os-climate/os_c_data_commons/HEAD/images/architecture/Data-Ingestion-Batch-ingestion-pattern.png -------------------------------------------------------------------------------- /images/OS-C Data Commons Data Platform Layered Architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/os-climate/os_c_data_commons/HEAD/images/OS-C Data Commons Data Platform Layered Architecture.png -------------------------------------------------------------------------------- /images/OS-Climate Data Ops Process Flow & Security Management.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/os-climate/os_c_data_commons/HEAD/images/OS-Climate Data Ops Process Flow & Security Management.png -------------------------------------------------------------------------------- /images/architecture/Data-Commons-Overall-Component-Architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/os-climate/os_c_data_commons/HEAD/images/architecture/Data-Commons-Overall-Component-Architecture.png -------------------------------------------------------------------------------- /images/architecture/Data-Commons-Platform-Federated-Governance.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/os-climate/os_c_data_commons/HEAD/images/architecture/Data-Commons-Platform-Federated-Governance.png -------------------------------------------------------------------------------- /images/architecture/Data-Ingestion-Event-driven-ingestion-pattern.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/os-climate/os_c_data_commons/HEAD/images/architecture/Data-Ingestion-Event-driven-ingestion-pattern.png -------------------------------------------------------------------------------- /images/architecture/Data-Commons-Data-Platform-Component-Architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/os-climate/os_c_data_commons/HEAD/images/architecture/Data-Commons-Data-Platform-Component-Architecture.png -------------------------------------------------------------------------------- /images/architecture/Data-Commons-Data-Science-Platform-Component-Architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/os-climate/os_c_data_commons/HEAD/images/architecture/Data-Commons-Data-Science-Platform-Component-Architecture.png -------------------------------------------------------------------------------- /docs/add-data-pipeline-ingestion-patterns-batch.md: -------------------------------------------------------------------------------- 1 | # Data Ingestion Patterns: Batch 2 | 3 | ![Data pipeline ingestion pattern: Batch](https://github.com/os-climate/os_c_data_commons/blob/main/images/architecture/Data-Ingestion-Batch-ingestion-pattern.png) 4 | -------------------------------------------------------------------------------- /docs/add-data-pipeline-ingestion-patterns-event.md: -------------------------------------------------------------------------------- 1 | # Data Ingestion Patterns: Event-Driven 2 | 3 | ![Data pipeline ingestion pattern: Batch](https://github.com/os-climate/os_c_data_commons/blob/main/images/architecture/Data-Ingestion-Event-driven-ingestion-pattern.png) -------------------------------------------------------------------------------- /docs/add-data-pipeline-ingestion-patterns-federation.md: -------------------------------------------------------------------------------- 1 | # Data Ingestion Patterns: Data Federation 2 | 3 | ![Data pipeline ingestion pattern: Data Federation](https://github.com/os-climate/os_c_data_commons/blob/main/images/architecture/Data-Ingestion-Federation-pattern.png) -------------------------------------------------------------------------------- /docs/component-architecture-data-science.md: -------------------------------------------------------------------------------- 1 | # Component Architecture: Data Science Platform 2 | 3 | ## Overview 4 | 5 | ![Data Science Plaform Component Architecture](https://github.com/os-climate/os_c_data_commons/blob/main/images/architecture/Data-Commons-Data-Science-Platform-Component-Architecture.png) 6 | -------------------------------------------------------------------------------- /docs/explore-notebooks-and-manage-dependencies.md: -------------------------------------------------------------------------------- 1 | # OS-Climate Data Commons Developer Guide: Explore notebooks and manage dependencies 2 | 3 | 1. Add dependencies to Pipfile in the [packages] section. 4 | 2. Create custom images (is this the same as creating virtual envs?). 5 | 6 | ## Next Step 7 | 8 | [Push changes to GitHub](./push-changes.md) 9 | -------------------------------------------------------------------------------- /docs/add-data-infra-as-a-platform.md: -------------------------------------------------------------------------------- 1 | # Architectural Domain Driver: Data Infrastructure-as-a-Platform 2 | 3 | Data infrastructure as a platform provides common tools and capabilities for data storage and management on a self-service basis in order to speed implementation and remove from data producers and owners the burden of building their own data-asset platform. 4 | -------------------------------------------------------------------------------- /docs/profiles.yml: -------------------------------------------------------------------------------- 1 | dbt_project_name: 2 | target: dev 3 | outputs: 4 | dev: 5 | type: trino 6 | method: jwt 7 | user: user_name 8 | jwt_token: jwt_token 9 | database: osc_datacommons_dev 10 | host: trino-secure-odh-trino.apps.odh-cl2.apps.os-climate.org 11 | port: 443 12 | schema: dbt_project_name 13 | threads: 1 -------------------------------------------------------------------------------- /platform-resourcing.md: -------------------------------------------------------------------------------- 1 | This file will contain list of tools, operators, and other components, for tracking resources. 2 | Something like a google sheet doc might be easier for totaling columns. If one is created we can link to it from here. 3 | 4 | - jupyter / jupyter-hub / elyra / manifests for common libraries 5 | - object storage / trino / iceberg / tileDB 6 | - metadata / apache ranger / linked-in data hub 7 | - dask 8 | - postgresql or other SQL 9 | - sqlite / datasette / sqlalchemy 10 | - superset / d3 observable / plotly 11 | - spark 12 | - seldon 13 | - add more here... 14 | -------------------------------------------------------------------------------- /docs/push-changes.md: -------------------------------------------------------------------------------- 1 | # OS-Climate Data Commons Developer Guide: Push changes 2 | 3 | Note to say a few words about pull-requests, commits, pushes, etc. and how to play nicely with CD/CI environment. 4 | 5 | It is a Red Hat best practice for individual contributors to fork, rather than clone, their repositories: https://redhat-cop.github.io/contrib/local-setup.html. TL;DR: if you are maintining the repo for the project, it's OK to clone. But in all other cases, it's better to fork and submit pull requests that convey the changes in your forked repository to the branch against which you wish to merge. 6 | -------------------------------------------------------------------------------- /docs/credentials.env: -------------------------------------------------------------------------------- 1 | # user trino credentials 2 | # Get your JWT token from: 3 | # new: https://das-odh-trino.apps.odh-cl2.apps.os-climate.org 4 | TRINO_HOST=trino-secure-odh-trino.apps.odh-cl2.apps.os-climate.org 5 | TRINO_PORT=443 6 | TRINO_USER=your_github_username 7 | TRINO_PASSWD=your_passwd_or_jwt_token 8 | 9 | # S3 Credentials for 'osc physical landing' bucket 10 | # for OSC landing bucket credentials, file an issue at: 11 | # https://github.com/os-climate/OS-Climate-Community-Hub 12 | S3_LANDING_ENDPOINT=https://s3.us-east-1.amazonaws.com 13 | S3_LANDING_BUCKET=redhat-osc-physical-landing-647521352890 14 | S3_LANDING_ACCESS_KEY=on_request 15 | S3_LANDING_SECRET_KEY=on_request 16 | 17 | # S3 Credentials for Data Loading bucket on Hive / Iceberg 18 | # for Data Loading bucket credentials, file an issue at: 19 | # https://github.com/os-climate/OS-Climate-Community-Hub 20 | S3_HIVE_ENDPOINT=https://s3.us-east-1.amazonaws.com 21 | S3_HIVE_BUCKET=osc-datacommons-s3-bucket-dev02 22 | S3_HIVE_ACCESS_KEY=on_request 23 | S3_HIVE_SECRET_KEY=on_request 24 | 25 | # Pachyderm Credentials 26 | PACH_ENDPOINT="pachd.pachyderm.svc.cluster.local" 27 | PACH_PORT="30650" 28 | -------------------------------------------------------------------------------- /docs/metadata-management.md: -------------------------------------------------------------------------------- 1 | # OS-Climate Data Commons Developer Guide: Data Quality Validation and Metadata Management 2 | 3 | ## 1. Data Quality Validation 4 | 5 | From the previous page you have already run `dbt test --profiles-dir /opt/app-root/src/` which provides a preliminary data quality check. 6 | 7 | What more needs to be said? 8 | 9 | ## 2. Metadata Management 10 | 11 | ## 3. Data Profiling and Sample Data 12 | 13 | ## 4. Sources and Exposures 14 | 15 | [Sources and Exposures](https://timeflow.academy/dbt/labs/sources-exposures) put the DBT pipeline into a larger context. Source tables and views can be referenced in our DBT pipelines, but they are never created or materialsied by DBT. Exposures are tables containing the data we actually want our user community to use and which meet our desired standards for accuracy and completeness. We can add metadata to Exposures to expose not only what the data is, but also how it is used or expected to be used (a dashboard, report or notebook), and an email address for the owner of the downstream consumer, facilitating communication and collaboration as new data arrives and as schemas evolve. 16 | 17 | # Next Step 18 | 19 | [Data Quality Validation and Metadata Management](./metadata-management.md) 20 | -------------------------------------------------------------------------------- /docs/setup-gitops-pipeline.md: -------------------------------------------------------------------------------- 1 | # OS-Climate Data Commons Developer Guide: Setup GitOps pipeline 2 | 3 | This section needs to define how developers can name and use pipelines for 4 | * Development on their own branch (or the branch that results from accepting a pull request) 5 | * CD/CI testing (we don't want automated testing to break every branch that every developer is working on) 6 | * General release to the Data Commons community (it's fine to have a canonical name mean "latest", but releases also need to be versioned) 7 | 8 | Presently developers are using canonical release names in their notesbooks (such as `osc_datacommons_dev.gleif`), which is all well and good for a single developer working by themselves. But really there needs to be some way for developers to tag their intended canonical name with some additional information so that the overall system can do the right thing when multiple developers and/or release engineers are all working together in a variety of environments that are all connected. Now that we have several ingestion streams with multiple developers working (or wanting to work on them), this needs to be defined and implemented as a library that all developers of ingestion pipelines (as well as all users of ingestion pipelines) can use. 9 | 10 | The currently-defined `osc_datacommons_dev`, `osc_datacommons_test`, and whatever we call the OS-C production environment does not provide enough granularity, and expecting developers to manually tweak strings in notebooks is asking for trouble. 11 | -------------------------------------------------------------------------------- /docs/building-images-for-openshift.md: -------------------------------------------------------------------------------- 1 | # Building Images For OpenShift 2 | 3 | This file has some tips and tricks for understanding what OpenShift expects when building and running container images, 4 | and how to write your `Dockerfile` and organize your directory layout to make your builds and deployments run 5 | smoothly with OpenShift. 6 | 7 | ### Do not assume root user 8 | 9 | OpenShift runs all its images with an anonymous and random UID, and group id 0: 10 | ```sh 11 | $ id 12 | uid=1000640000(1000640000) gid=0(root) groups=0(root),1000640000 13 | ``` 14 | 15 | This is the single most common source of problems when new OpenShift users try to run images they were running using `docker` or `kubernetes`. 16 | Typically these images will fail in OpenShift because they were build to assume they would run as `root:0`. 17 | 18 | Here are some tips for writing your `Dockerfile` to work on OpenShift: 19 | 20 | #### Ensure group 0 can access your deps 21 | Install your image dependencies into non-root directories, and ensure that they are universally accessible by `gid 0`. 22 | One example of a common install pattern is the following: 23 | 24 | ``` 25 | RUN \ 26 | mkdir -p /opt/app \ 27 | && cd /opt/app \ 28 | && install-my-stuff \ 29 | && chgrp -R 0 /opt/app \ 30 | && chmod g+rwX /opt/app 31 | ``` 32 | 33 | #### Locally test with a different uid 34 | 35 | If you wish to test your images locally using `docker` or `podman`, 36 | it is best to add a final line to your `Dockerfile`, like this: 37 | 38 | ``` 39 | # Emulate an anonymous uid 40 | USER 9999:0 41 | ``` 42 | 43 | This will cause `docker` or `podman` (or `kubernetes`) to run your container as `uid 9999` and `gid 0`. 44 | If your image runs this way locally, it is likely to run when OpenShift assigns it a different random uid. 45 | 46 | 47 | 48 | ### Directory layouts 49 | 50 | Most OpenShift build-related commands (e.g. `oc new-build` or `oc new-app`) provide a way to specify the context directory to use from a repository. 51 | The command argument is typically `--context-dir=your/context/path`, and is a relative path (not beginning with `/`) from the root directory of your repository. 52 | 53 | These command line utilities assume that your dockerfile is named `Dockerfile`, and is in your context directory: `your/context/path/Dockerfile` 54 | OpenShift's build systems further assume that *all* of your build dependencies are underneath `your/context/path/...`. 55 | 56 | Deviating from this directory layout is possible, but it generally requires editing your `BuildConfig` as a YAML file and setting the `dockerfilePath` field. 57 | See: https://docs.openshift.com/container-platform/3.4/dev_guide/builds/build_strategies.html#dockerfile-path 58 | 59 | ### Credential management 60 | 61 | Say some smart things about this topic when we have smart things to say. 62 | -------------------------------------------------------------------------------- /docs/data-ingestion-pipeline.md: -------------------------------------------------------------------------------- 1 | # OS-Climate Data Commons Developer Guide: Data Ingestion Pipeline Overview 2 | 3 | OS-Climate Data Commons provides a backbone for building, maintaining and operating data pipelines at scale. Our data ingestion pipelines are structured following an Extraction Loading Transformation (ELT) patterns with key steps represented in the diagram below. 4 | 5 | ![Data Ingestion Pipeline Process Overview](../images/developer_guide/Data-Ingestion-Process.png) 6 | 7 | The key steps are as follows: 8 | 9 | - A data ingestion trigger is used to create an instance of Directed Acyclic Graph (DAG) in Airflow. A DAG is a collection of tasks organized with dependencies and relationships to say how the ingestion should run. Note that as per our principle of Data as Code, the schema of the DAG as well as the code blueprint of all tasks are maintained in a source code repository. 10 | - The ingestion flow always starts with the validation of source data which is a check for minimum data requirements as well as ingestion format. Once the source data is validated, data is either federated (if accessible directly to our data federation layer) or extracted and put under version control. 11 | - The next step is loading of the data into cloud storage for further processing, leveraging Trino as an ingestion engine. This is required again only in the case where data needs to be extracted from source. 12 | - From the versioned data source, the transformation pipeline is then triggered. At this stage all transformations are done as repetable and versioned SQL code leveraging DBT. 13 | - Finally, the generation of data set metadata as well as execution of data validation checks are done are performed, and the resulting metadata and quality control results loaded into our metadata catalogue together with the data lineage of the pipeline. 14 | 15 | In the above process, all data, metadata, and data quality checks are automatically documented and made available in the Data Commons layer as a new version-controlled data set. Any exception in the data ingestion pipeline or, if the ingestion is successful, a report on the data ingestion and processing including results of quality tests are generated and sent to the data owner. This approach allows to easily and transparently correct erroneous data inputs and trigger re-processing as required, since the resulting data load is provided and made accessible to the data owner even in case the data quality level is not good enough for distribution. 16 | 17 | Next, we will go through the various steps of a data pipeline with code examples. For this, we will use an ingestion example from our [WRI Global Power Plant Database ingestion pipeline repository](https://github.com/os-climate/wri-gppd-ingestion-pipeline), which show the ingestion process from file-based source data into a dedicated Trino schema as well as all the steps required for data quality and metadata management. 18 | 19 | ## Next Step 20 | 21 | [Data Extraction](./data-extraction.md) 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | > [!IMPORTANT] 4 | > On June 26 2024, Linux Foundation announced the merger of its financial services umbrella, the Fintech Open Source Foundation ([FINOS](https://finos.org)), with OS-Climate, an open source community dedicated to building data technologies, modeling, and analytic tools that will drive global capital flows into climate change mitigation and resilience; OS-Climate projects are in the process of transitioning to the [FINOS governance framework](https://community.finos.org/docs/governance); read more on [finos.org/press/finos-join-forces-os-open-source-climate-sustainability-esg](https://finos.org/press/finos-join-forces-os-open-source-climate-sustainability-esg) 5 | 6 | 7 | 8 | # OS-Climate Data Commons 9 | 10 | OS-Climate Data Commons is a unified, open Multimodal Data Processing platform used by OS-Climate members to collect, normalize and integrate climate and ESG data from public and private sources, in support of: 11 | 12 | - Corporates in efficiently disclosing and managing their own climate and ESG data, including correcting, reporting and confirming the information in an auditable and secure manner. 13 | - Data scientists in collaboratively solving data collection, cleaning and normalization issues, based on shared modeling standards, tooling and commnunity development following a data pipeline as code approach. 14 | - Decision makers such as investors, financial institutions, regulators in integrating new or existing scenario-based predictive analytics with an open repository of trustworthy climate data. 15 | 16 | ## Overview 17 | 18 | ![OS-C Data Commons Platform Overview](https://github.com/os-climate/os_c_data_commons/blob/main/images/architecture/COP26-Overview-Business.png) 19 | 20 | The Data Commons platform aims at bridging climate-related data gaps across 3 dimensions: 21 | 22 | 1. Data Availability: The platform supports data availability through data democratization via self-service data infrastructure as a platform. A self-service platform is fundamental to a successful data mesh architectural approach where existing data sources are federated and can be made discoverable and shareable easily across an organization and ecosystem through open tools and readily available infrastructure supporting data creation, storage, transformation and distribution. 23 | 24 | 2. Data Comparability: The platforms supports data comparability through domain-oriented decentralized data ownership and architecture i.e. data is treated like a product. The goal is to stop proliferation of data puddles to “connect” the data with proper referential and relevant industry identifiers in order to have collections of data aligned with business goals. 25 | 26 | 3. Data Reliability: The platform supports data reliability through a federated data access, data lifecycle management, security and compliance. This supports a data as code approach where the data pipeline code, the data itself and data schema are versioned so as to have transparency and reproducibility (time machine), while enforcing authentication and authorization required for data access management with consistent policies across the platform and throughout the data lineage. 27 | 28 | For more information on this and how Data Commons fits into the picture, good introduction links include the official [Data Commons page](https://os-climate.org/data-commons/) on OS-Climate website, as well as the video recording of the [Data Commons Platform Overview](https://vimeo.com/645282758) at the COP26 in Glasgow. Detailed platform documentation maintained by our community is available in this repository and accessible through the links below. 29 | 30 | ## Architecture 31 | 32 | [Data Commons Architecture Blueprint](https://github.com/os-climate/os_c_data_commons/blob/main/os-c-data-commons-architecture-blueprint.md) 33 | 34 | ## Developer Resources 35 | 36 | [Data Commons Developer Guide](https://github.com/os-climate/os_c_data_commons/blob/main/os-c-data-commons-developer-guide.md) 37 | -------------------------------------------------------------------------------- /docs/data-loading.md: -------------------------------------------------------------------------------- 1 | # OS-Climate Data Commons Developer Guide: Data Loading 2 | 3 | In the Data Commons architecture, the Distributed SQL Query engine provided by Trino is leverages both as a centralized layer for federated queries but also as a data loading engine for the data processed in our data pipelines. In the latter case, it fulfills the role of a data processing engine and abstracts the complexity of writing and versioning data over the underlying cloud object storage and Apache Iceberg, used to handle the partitioning and versioning of the data schema and the data itself. Trino and Iceberg basically bring the simplicity of SQL tables to the work we do with big data, and provide the reliability and scale we need for high volume data processing. 4 | 5 | The steps required for data loading are demonstrated in the [data loading notebook for WRI Global Power Plant Database][1]. 6 | 7 | ## 1. Use osc-ingest-tools to connect into the Hive ingestion bucket 8 | 9 | For efficient data loading at scale it is advised to use our osc-ingest-tools python library and connect into the Hive ingestion bucket, as shown in the sample above. This requires proper setup of credentials.env as explained in our [environment setup guide](./setup-initial-environment.md) 10 | 11 | ``` 12 | import osc_ingest_trino as osc 13 | hive_bucket = osc.attach_s3_bucket('S3_HIVE') 14 | ``` 15 | 16 | ## 2. Open a Trino connection 17 | 18 | In this case we use Trino as an ingestion engine, so we need to open a connection: 19 | 20 | ``` 21 | ingest_catalog = 'catalog_name' 22 | engine = osc.attach_trino_engine(verbose=True, catalog=ingest_catalog) 23 | ``` 24 | 25 | ## 3. Ingest the data via Trino 26 | 27 | In this case we use Trino as an ingestion engine, which allows creating a new table schema with a SQL statement and directly loading the data from a partitioned set of parquet files writtten in our Hive storage bucket. We may first need to create a new schema if it does not exist. Note that we use `osc._do_sql` instead of calling `engine.execute` directly, because since Trino version 395 and the trino python client version 0.317.0, some calls are complete without returning rows, and others won't complete unless rows are fetched via fetchall. The `osc._do_sql` command abstracts that away. 28 | 29 | ``` 30 | ingest_schema = 'schema_name' 31 | ingest_table = 'table_name' 32 | ingest_bucket = 'osc-datacommons-s3-bucket-dev02' 33 | schema_create_sql = f""" 34 | create schema if not exists {ingest_catalog}.{ingest_schema} with ( 35 | location ='s3a://{ingest_bucket}/data/{ingest_schema}.db/' 36 | ) 37 | """ 38 | osc._do_sql(schema_create_sql, engine, verbose=True) 39 | ``` 40 | 41 | Once done, osc-ingest-tools provide a fast loading method via Hive: 42 | 43 | ``` 44 | osc.fast_pandas_ingest_via_hive( 45 | df_gppd, 46 | engine, 47 | ingest_catalog, ingest_schema, ingest_table, 48 | hive_bucket, hive_catalog, hive_schema, 49 | partition_columns = ['field_name'], 50 | overwrite = True, 51 | verbose = True 52 | ) 53 | ``` 54 | From there the data should be available in Trino and can further data processing can be done from there, in particular data transformation if required. Note that the data transformation setup need to be executed regardless in order to have the data set metadata ingested, as we leverage the data transformation part of the pipeline with dbt to push the data set metadata into our metadata catalogue based on OpenMetadata. 55 | 56 | ## 4. Using Seed Data 57 | 58 | [As pointed out in previously mentioned dbt training modules](https://timeflow.academy/dbt/labs/dbt-seed-data-lab), having access to some small data tables (such as country codes) can greatly simplify the ingestion of principal data tables. The `dbt seed` command allows the loading of small CSV files that can be stored in the dbt project repository. Of course it is also possible to reference, load, and use such data within the ingention notebooks themselves, but if you think it makes more sense to use the data only within the context of dbt, the `dbt seed` command makes that possible. 59 | 60 | 61 | ## Next Step 62 | 63 | [Data Transformation](./data-transformation.md) 64 | 65 | [1]: https://github.com/os-climate/wri-gppd-ingestion-pipeline/blob/master/notebooks/wri-gppd-02-loading.ipynb 66 | -------------------------------------------------------------------------------- /docs/component-architecture-data-platform.md: -------------------------------------------------------------------------------- 1 | # Component Architecture: Data Management Platform 2 | 3 | ## Overview 4 | 5 | ![Data Management Plaform Component Architecture](https://github.com/os-climate/os_c_data_commons/blob/main/images/architecture/Data-Commons-Data-Platform-Component-Architecture.png) 6 | 7 | The above view is a component architecture diagram of the Data Commons platform focusing on data management capabilities. Our approach is to closely integrate technology components across 3 layers: 8 | 9 | - Physical layer: responsible for storage and accessibility of raw data, both structured and unstructured. 10 | 11 | - Virtual layer: responsible for data virtualization, federation and access to data pipelines and other internal technology services. 12 | 13 | - Access layer: responsible for data distribution including access control and compliance management for external data consumers (including external applications integrating with the Data Commons) 14 | 15 | In addition, a single layer of Identify & Access Management driving data access controls, security and compliance is implemented across the physical, virtual and access layers to ensure consistency across, and all data transactions are audited in real-time through a single consolidated real-time monitoring pipeline. 16 | 17 | ## Data Access Management Components 18 | 19 | The key components of our data platform architecture are as follows: 20 | 21 | 1. Object storage: Responsible for secure storage of raw data and object-level data access for data ingestion / loading. This is a system layer where data transactions are system-based, considered to be privileged activity and typically handled via automated processes. The data security at this layer is to be governed through secure code management validating the logic of the intended data transaction and secrets management to authenticate all access requests for privileged credentials and then enforce data security policies. 22 | 23 | 2. Data serving: Open standard, high-performance table format responsible for managing the availability, integrity and consistency of data transactions and data schema for huge analytic datasets. This includes the management of schema evolution (supporting add, drop, update, rename), transparent management of hidden partitioning / partitioning layout for data objects, support of ACID data transactions (particularly multiple concurrent writer processes) through eventually-consistent cloud-object stores, and time travel / version rollback for data. This layer is critical to our data-as-code approach as it enables reproducible queries based on any past table snapshot as well as examination of historical data changes. 24 | 25 | 3. Distributed SQL Query Engine: Centralized data access and analytics layer with query federation capability. This component ensures that authentication and data access query management is standardized and centralized for all data clients consuming data, by providing management capabilities for role-based access control at the data element level in Data Commons. This being a federation layer it can support query across multiple external distributed data sources for cross-querying requirements. 26 | 27 | 4. Metadata management: Component responsible for data pipeline and metadata management. By tying together data pipeline code and execution, it provides file-based automated data and metadata versioning across all stages of our data pipelines (including intermediate results), including immutable data lineage tracking for every version of code, models and data to ensure full reproducibility of data and code for transparency and compliance purpose. 28 | 29 | 5. Data security management: Framework responsible for the monitoring and management of comprehensive data security across the entire Data Commons, as a multi-tenant environment. It provides centralized security administration for all security related tasks in a management UI, fine-grained authorization for specific action / operation on data at the data set / data element level including metadata-based authorization and controls. It also centralizes the auditing of user / process-level access for all data transactions. Ultimately, we want this layer to provide data access management and visibility directly to the data / data product owners as a self-service capability. 30 | 31 | 6. API Gateway: API management tool that aggregates the access to various external data APIs used in distributing data to external clients. It is responsible for managing external authentication, authorization to external services integrating with Data Commons, in particular external partner application and tools. 32 | -------------------------------------------------------------------------------- /docs/data-transformation.md: -------------------------------------------------------------------------------- 1 | # OS-Climate Data Commons Developer Guide: Data Transformation 2 | 3 | We leverage DBT to build and execite advanced data transformation pipelines with clean and maintainable models, and version control all DBT artifacts in the pipeline repository, following our data-as-code principle. Before going through these instructions, we recommend going through a DBT overview and training and a good link for this is [DBT For Data Engineers provided by timeflow academy][1]. This is a free training and cover the DBT capabilities we need for building our pipelines. 4 | 5 | The following steps required for data transformation would typically be implemented in an Airflow DAG for a production run, but we cover here the details of each step from the point of view of a manual run in a development environment. 6 | 7 | ## 1. Setup your development environment and repository 8 | 9 | Considering the required structure for DBT projects to run, and the typical development process which may requires re-run of DBT pipelines specific to data schemas, we recommend to set a specific DBT project for each specific data schema that the data pipeline is working on (as opposed to centralizing all DBT pipelines into a dedicated repository). This means we advise to: 10 | 11 | - Have a `profiles.yml` file to manage all DBT connections, broken down by DBT project / schema. Here is a link to a sample of base [profiles.yml](profiles.yml) file 12 | - In your data pipeline repository, create one subdirectory with the name of the DBT project / schema. Another subdirectory that will be automatically created here is the `logs` directory. 13 | - In the project subdirectory, a lot of logs data is typically generated as part of dbt command run. We typically want to exclude these from git versioning, having a `.gitignore` defined as per below (this has been done by default for you in our data pipeline project template): 14 | 15 | ``` 16 | target/ 17 | !target 18 | target/* 19 | !target/manifest.json 20 | !target/catalog.json 21 | dbt_packages/ 22 | logs/ 23 | ``` 24 | 25 | ## 2. Setup your DBT pipeline 26 | 27 | We leverage baseline DBT capabilities here with a few key aspects to note: 28 | 29 | - In `dbt_project.yml`, we prioritize outputs leveraging materialized views rather than new tables when possible if transformations are simple. If transformation involve combining data from multiple schemas and / or complex transformation rules, it can be acceptable to use new physical tables as an output but bear in mind this means having duplicated data stored. 30 | - In the `models` subdirectory, we leverage the YAML file for schema definition to manage the data set metadata definition, and maintain / version it as code in our repository. This means if you means to access metadata from an external source, we recommend having one step in the data pipeline to extract metadata and automatically produce the YAML file. This allows scalable maintenance of the metadata model by adding steps in the pipeline to augmnent the metadata from source while keeping every change versioned if needed. 31 | - Some simple data quality tests can be run in the DBT pipeline itself, if defined together with the schema. We recommend leveraging this for simple data checks performed for intermediate transformation step. The testing done to assess the quality of the final data set, in particular business rules, should be done with great_expectations (as shown in next step). 32 | - The dbt-utils package provides a richer set of functions for simple data quality checks, for example performant functionalty for testing uniqueness of compound (multi-column) primary keys. To install dbt-utils it is necessary to create a `packages.yml` file in the same directory as the `dbt_project.yml` file. See instructions [here](https://hub.getdbt.com/dbt-labs/dbt_utils/latest/). 33 | - The various steps in the pipeline and SQL scripts will be automatically acquired by OpenMetadata in the next step to build a lineage visible from our catalogue in OpenMetadata itself. So we recommend more simple steps in sequence and simple SQL statements rather than a single, convoluted SQL statement for the pipeline as it increases visibility for external data users. 34 | 35 | ## 3. Execute DBT pipeline 36 | 37 | Finally the execution of the DBT pipeline can be run from the command line in 4 steps: 38 | 39 | ``` 40 | dbt debug --profiles-dir /opt/app-root/src/ 41 | dbt run --profiles-dir /opt/app-root/src/ 42 | dbt test --profiles-dir /opt/app-root/src/ 43 | dbt docs generate --profiles-dir /opt/app-root/src/ 44 | ``` 45 | 46 | ## Next Step 47 | 48 | [Data Quality Validation and Metadata Management](./metadata-management.md) 49 | 50 | [1]: https://timeflow.academy/categories/dbt 51 | -------------------------------------------------------------------------------- /docs/setup-initial-environment.md: -------------------------------------------------------------------------------- 1 | # OS-Climate Data Commons Developer Guide: Setup Initial Environment 2 | 3 | This section covers how to setup your development environment as a contributor to OS-Climate Data Commons developing a data ingestion or processing pipeline. The setup is based around use of OS-Climate GitHub, and a Jupyter Hub / Elyra service provided as a management development platform built to support the needs of our contributors. 4 | 5 | ## 1. Create new GitHub repository for pipeline development 6 | 7 | In order to have a standardized structure that can be easily understood by data scientists, devops engineers and developers, repositories should be created by using one of the project templates: 8 | 9 | - [Template for Data Science projects][1] 10 | - [Template for Data Pipeline projects][4] 11 | 12 | You can click the `Use the template` button provided in the repository and create the structure for your repo this way. *Take care to select **OS-Climate** as the owner; the default is to create under your own GitHub ID, which may not be your intention if you are contributing to OS-Climate.* 13 | 14 | ![Repository Template](../images/developer_guide/aicoe-project-template.png) 15 | 16 | Having a defined structure in a project ensures all the pieces required for the ML and DevOps lifecycles are present and easily discoverable and allows managing library dependencies, notebooks, test data, documentation, etc. For more information on this topic, we recommend reading and understanding the [Cookiecutter Data Science][2] documentation, on which our standard repository template is inspired. 17 | 18 | ## 2. Make your repository open-source friendly 19 | 20 | 1. Add the APL 2.0 Open Source license to your repository. This is done by going into your repository, creating a new file called "LICENSE", clicking the button `Choose a license template` on the right, selecting the Apache License 2.0 template and then committing the proposed change. 21 | 22 | ![Open Source license](../images/developer_guide/repo-choose-license.png) 23 | 24 | 2. Explain what the repository is about in the "README.md" file created in the repository root. 25 | 26 | 3. Document the contribution structure of your repository for you and your trusted collaborators in the "OWNERS" file found in the repository root. 27 | 28 | 4. Confirm whether DCO / CLA covers this repository. 29 | 30 | ## 3. Access the development environment with Elyra images on JupyterHub 31 | 32 | With your GitHub credentials and once you are part of the team odh-env-users, you will be able to access the development environment. 33 | 34 | 1. Click this [link][2] to access and select githubidp for authentication. 35 | 36 | ![Jupyter Hub Login](../images/developer_guide/jupyterhub-login.png) 37 | 38 | 2. Select the image called `Elyra Notebook Image` and `Large` for container size. 39 | 40 | ![Jupyter Hub Server Start](../images/developer_guide/jupyterhub-startserver.png) 41 | 42 | 3. Your server should start automatically after a couple of minutes and the Jupyter launcher appear. 43 | 44 | ![Jupyter Hub Launcher](../images/developer_guide/jupyterhub-launcher.png) 45 | 46 | ## 4. Set your credentials environment variables 47 | 48 | From the File menu, create a new text file called `credentials.env`. You can copy this file from [this link](https://github.com/os-climate/os_c_data_commons/blob/main/docs/credentials.env). This example file includes a link to the JWT token retrieval client for Trino access. 49 | 50 | To secure and restrict the access to data based on user profiles, we have defined role-based access controls to specific schemas in Trino based on your team assignments. Therefore, authentication with the Trino service has been federated with GitHub SSO and on a weekly basis you will need to retrieve a JWT token from this [Token Retrieval Client][3]. Get the token and cut / paste the token string as your TRINO_PASSWD in the credentials file. 51 | 52 | Care should be taken to never commit the credentials.env file to your repository. Our template repos have this file listed in the .gitignore, so that it cannot be committed. If any changes are made to the repository structure or .gitignore file, it is important to make sure that this exclusion is still in place. If credentials are ever exposed publicly, please contact security@lists.os-climate.org immediately. 53 | 54 | ## 5. Access your repo using Jupyterlab Git Extension 55 | 56 | Once you are in the Jupyterlab UI, you can use the Git extension provided to clone this repo. 57 | 58 | 1. Click the Git extension button from Jupyterlab UI and select `Clone a repository`: 59 | 60 | ![Cloning a Git repository](../images/developer_guide/jupyterhub-gitclone.png) 61 | 62 | 2. Enter the HTTPS address of the repository you want to clone. If it is private and you have access, enter your credentials when requested. 63 | 64 | 3. You are ready to go! 65 | 66 | ## Next Step 67 | 68 | [Explore notebooks and manage dependencies](./explore-notebooks-and-manage-dependencies.md) 69 | 70 | [1]: https://github.com/os-climate/data-science-template 71 | [2]: https://jupyterhub-odh-jupyterhub.apps.odh-cl2.apps.os-climate.org/ 72 | [3]: https://das-odh-trino.apps.odh-cl2.apps.os-climate.org/ 73 | [4]: https://github.com/os-climate/data-pipeline-template 74 | -------------------------------------------------------------------------------- /docs/pre-requisite.md: -------------------------------------------------------------------------------- 1 | # OS-Climate Data Commons Developer Guide: Pre-Requisite 2 | 3 | This list provides the requirements for newly onboarded contributors to the OS-Climate project who need to leverage the Data Commons platform for their data management requirements. 4 | 5 | ## GitHub Setup 6 | 7 | 1. A GitHub account with your name and contributing organization updated in your profile. This is the profile that will be invited to OS-Climate GitHub organization and will provide with immediate access (Read-only mode) to all repositories. 8 | 9 | 2. The access to every repository in the organization is managed through specific team structures. A project team is created for every delivery stream (naming convention: os-climate-stream-name-project-team) which provide the ability to raise and contribute issues in the applicable repository or set of repositories, sub-teams providing privileged administrative access to a given repository (naming convention: repository-name-admin) and sub-teams providing development access to a given repository (naming convention: repository-name-developers). Check that your team assignment is correct and if not, please raise an issue against [OS-Climate Data Commons Repository][1] with the subject as "Access Request for OS-C GitHub Organization repositories". 10 | 11 | 3. For data pipeline developers, access to the Data Commons platform built on Open Data Hub is configured to work with GitHub SSO and requires the user to be part of the team named odh-env-users. If you require access to the platform, please raise an issue against [OS-Climate Data Commons Repository][1] with the subject as "Access Request for Data Commons platform". The access provided will be based on developer teams you are assigned to at creation time (see item 2 above). In the same process, access to the relevant S3 buckets (for source data) and Trino catalogs (for data ingestion and processing) would be created for your user. Any additional access request or request for new S3 buckets and Trino catalogs to be created would have to be raised separately. 12 | 13 | 4. For data pipeline developers who require an SFTP access to a secure S3 bucket where source data is to be uploaded, you will also need to create specific SSH keys in your GitHub account for SFTP access to the bucket. This can be done following the documented process: [Adding a new SSH key to your GitHub account](https://docs.github.com/en/github/authenticating-to-github/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account). Name the new key OS-Climate SFTP Key and raise an issue against [OS-Climate Data Commons Repository][1] with the subject as "Access request for Source Data SFTP", indicating the repository for the data pipeline and source of the data. 14 | 15 | ## Access to S3 Buckets 16 | 17 | 1. When providing development access to Data Commons platform (item 2 as part of the GitHub Setup above) and a Read access to relevant source data, and R/W access to the relevant S3 buckets for data ingestions & processing, may be required. In order to get the required S3 credentials including access and secret keys, you should raise an issue against [OS-Climate Data Commons Repository][1] with the subject as "Access request for S3 Buckets" with the information about the access required including data pipeline repositories you need to contribute to and relevant data sets that need to be accessed. 18 | 19 | 2. There are two main buckets potentially required for data pipelines developers: 20 | - The S3 bucket which is our Source Data Landing Zone 21 | - The S3 Iceberg Development Data bucket which is used to load data into Trino 22 | 23 | ## Access to Trino 24 | 25 | 1. Access to relevant Trino catalogs and schemas is done via pull requests on relevant Operate First overlay files. A good example of such pull request can be found at [this link][2] 26 | 27 | 2. In order to perform the change, the following steps are required: 28 | - If required (i.e. a new pipeline and catalogs / schemas / tables have to be created), create a new group under group-mapping.properties with the users who need access to the schema. A specific group needs to be created for each type of access (e.g. if you need to segregate people who have Read access vs Read / Write access to a schema) 29 | - Add your GitHub User Name to required groups in group-mapping.properties 30 | - If required, use the file trino-acl-dsl.yaml to define the access rights for the new or existing groups into the new catalogs / schema / tables 31 | - Then on the command line use the trino-dsl-to-rules python module to produce and format the rules.json configuration file used by Trino: 32 | 33 | ```text 34 | trino-dsl-to-rules kfdefs/overlays/osc/osc-cl2/trino/configs/trino-acl-dsl.yaml > kfdefs/overlays/osc/osc-cl2/trino/configs/rules.json 35 | ``` 36 | 37 | When the above is done, commit the pull request as per the example shared above. Do note there are commit hooks in place to ensure consistency between the DSL and rules.json to ensure that access controls follow the appropriate structure and syntax. 38 | 39 | ## Next Step 40 | 41 | [Setup your initial environment](./setup-initial-environment.md) 42 | 43 | [1]: https://github.com/os-climate/os_c_data_commons 44 | [2]: https://github.com/operate-first/apps/commit/cc3d4c456d5d0c40882971f700a89ad16b0bc111 45 | -------------------------------------------------------------------------------- /docs/add-federated-governance.md: -------------------------------------------------------------------------------- 1 | # Architectural Domain Driver: Federated Data Management and Governance 2 | 3 | ## Overview 4 | 5 | As the data mesh architecture approach we use to build the Data Commons enables decentralisation and organization of data along domain-driven lines, we also need to define common data governance guidelines, standards and controls that local domain team of contributors can follow as part of their implementation. This covers both the data management processes and practices (data schema management, metadata management, data lineage management), mainly governed through shared documentation and code reviews by the community, but also the shared data infrastructure and services layer that various domains can leverage to build their own pipelines from pre-approved templates and guidelines that ensure security and compliance. In this section, we focus on architecture domain drivers related to the federated governance process as well as the design of the shared platform supporting it. 6 | 7 | - **FG-001 - Shared model for federated governance:** We aim to support a high level of independence and accountability by the domain product owners from production to consumption in terms of how to manage their data and how they can best scale. Rather than enfocring a command-and-control centralized governance function without consideration for the nuances of each domain, the data mesh paradigm requires a federated governance model where responsability for governance is shared. In practice, this means: 8 | 9 | 1. We centralize governance in terms of managing global risk and compliance policies and standards of the common technology platform including data security, owning inter-domain data and data pipeline management standards, data lineage reporting and auditing. Also in terms of security implementation, identity management (authentication) is managed centrally and access management is provided and delegated to the respective domains. 10 | 11 | 2. The domains therefore directly own data provenance, data quality (both definition and measurement), data classification (in the form of a data dictionary and data set metadata communicated to a cross-domain data catalog), authorization entitlements (via role-based access control management), adherence to compliance and terminology standards, and definition of inter-domain data entity standardization. 12 | 13 | This means that some governance aspects would be set at the data mesh level whereas others are managed at the discretion of the domain, within common standards and best practices. We elaborate further on this for each specific governance aspect in the following list of drivers. 14 | 15 | - **FG-002 - Shared built-in delegated authorization system for unified security management:** Data platforms and the need for ELT pipelines increases by definition data sprawl across the organization due to the need to access and replicate the data and / or generate derived data across pipelines. We use a shared delegated authorization system built into the single data access layer (at the level of the Distributed SQL Query Engine) to minimize the need for data duplication, therefore reducing data sprawl, and also providing the ability to ensure that we have consistency in authorization and entitlements across the organization. This model of federated access is representated in the diagram below. 16 | 17 | ![Data Commons Platform Federated Governance](https://github.com/os-climate/os_c_data_commons/blob/main/images/architecture/Data-Commons-Platform-Federated-Governance.png) 18 | 19 | - **FG-003 - Data compliance through automated and centralized data lineage management:** Data governance management is the ability to assess and monitor whether the data follows any and all required policies e.g. GDPR, SOX, etc... A critical dependency to this is the requirement for the ability to capture and store data origin, what happens to it in data pipelines over time (and why) and trace it all the way to its distribution, namely the ability to produce and maintain a data lineage. In the context of our data-as-code approach, we therefore ensure compliance through an automated and centralized data lineage management capability, [closely integrated with the authorization system](https://github.com/os-climate/os_c_data_commons/blob/main/component-architecture-data-platform.md), which provides an immutable record for all data transactions / activities through the following capabilities: 20 | 21 | 1. Every version of data pipeline code, models (if relevant) and data is captured and tracked, using automated data versioning which provides a complete audit trail for all data and artifacts across pipeline stages, including intermediate results 22 | 23 | 2. The platform maintains historical reproducibility of data and code within compliance requirements (in particular time period) 24 | 25 | 3. The platform manages relationships between all historical data states (data lineage). This includes capturing and storing key metadata attributes of the pipeline execution such as source data systems involved, rules or models used to process the data, time stamps for each state when data is created, added, processed, deleted and finally organizational information such as domain owner, data format and documentation, and retention policies. 26 | 27 | - **FG-004 - Data quality is quantified and communicated to data users:** The domain owner is responsible for establishing a data quality assessment framework that is implemented into the pipelines, which requires the ability to measure, report on quality and also manage distribution based on clearly established quality gates that are implemented as part of data testing. 28 | -------------------------------------------------------------------------------- /docs/data-extraction.md: -------------------------------------------------------------------------------- 1 | # OS-Climate Data Commons Developer Guide: Data Extraction 2 | 3 | Note that for the examples in this documentation, we are primarily using a data pipeline example from our [WRI Global Power Plant Database ingestion pipeline repository](https://github.com/os-climate/wri-gppd-ingestion-pipeline). Other repositories and code bases will be referenced specifically for specific handling not used in the context of the WRI Global Power Plant Database ingestion pipeline. 4 | 5 | We cover below the most common use case of data extraction from a managed file source. For federated data, specific connection to the data source would have to be handled based on the specific client used to connect (preferably a python library). Ultimately, for all types of source, we want to be able to extract and version the source data, and be able to load the data into a Pandas DataFrame for the next step which is the data loading into Trino, before transformation can happen. 6 | 7 | ## 1. Connect to and retrieve source data from a S3 source data bucket 8 | 9 | This is the most common case of source data extraction for large, batch-based extractions which need to be triggered on a regular but not continuous basis. S3 source data bucket are read-only buckets used purely for source data retrieval. This can be done from: 10 | 11 | 1. A private S3 bucket for an OS-Climate Source Data Landing Zone. We have a S3 bucket (redhat-osc-physical-landing-647521352890) for public and private sources that need to be downloaded physically for ingestion, where we restrict the ability to read / write data to authorized processes. Access to the S3 bucket is managed via dedicated secrets embedding the required credentials - if you need access to specific source data please check the section on [OS-Climate Data Commons Developer Guide: Pre-Requisite](./pre-requisite.md). For retrieval of data files we advise using the boto3 resource API which provides efficient handling of bucket and object-level data. The following code shows how to access a given bucket: 12 | 13 | ``` 14 | s3_resource = boto3.resource( 15 | service_name="s3", 16 | endpoint_url=os.environ['S3_LANDING_ENDPOINT'], 17 | aws_access_key_id=os.environ['S3_LANDING_ACCESS_KEY'], 18 | aws_secret_access_key=os.environ['S3_LANDING_SECRET_KEY'], 19 | ) 20 | bucket = s3_resource.Bucket(os.environ['S3_LANDING_BUCKET']) 21 | ``` 22 | 23 | From there, a specific file object such as a csv file can be retrieved with the following code: 24 | 25 | ``` 26 | csv_object = s3_resource.Object(os.environ['S3_LANDING_BUCKET'], 'filepath') 27 | csv_file = BytesIO(csv_object.get()['Body'].read()) 28 | ``` 29 | 30 | 2. A public S3 bucket for any data source that can be directly "federated" in the ingestion process from the source data provider. In this case, you can typically access the bucket through unsigned access as shown ou our [GLEIF Data Ingestion Sample](https://github.com/os-climate/data-platform-demo/blob/master/notebooks/gleif_ingestion_sample.ipynb). 31 | 32 | 3. As a side note, we also include in the [Template for Data Pipeline projects][1] a simple data directory structure for simple file-based upload from the GitHub repository of the pipeline itself. This structure can be used as an alternative for simple pipelines where: 33 | - The data is public and can be made available on our public repositories. 34 | - The data is not large (<10000 records) 35 | - The data is rarely updated 36 | - There is value in having the whole pipeline inclusive of the data available for external parties to replicate the example, for example in the case of pipelines used as demonstration for data engineering training 37 | 38 | In this structure, we have 4 folders: 39 | - external: used for data from third party sources which is needed in the pipeline (e.g. lookup) but is not the main processing target 40 | - raw: the original data to be processed (should be immutable) 41 | - interim: intermediate data that has been transformed but not to be used for analysis 42 | - processed: the final, canonical data set for analysis 43 | 44 | ## 2. Version the source data on Pachyderm 45 | 46 | All source data triggering a data pipeline should then be versioned in a dedicated data versioning repository on Pachyderm service. Pachyderm provides version control and lineage management for source data and a good example of this can be found in [data extraction notebook for WRI Global Power Plant Database][2]. The code leverages the python_pachyderm library and versioning a source file requires only 3 steps: 47 | 48 | 1. Creating a Pachyderm client based on connection details set in your credentials.env 49 | 50 | ``` 51 | client = python_pachyderm.Client(os.environ['PACH_ENDPOINT'], os.environ['PACH_PORT']) 52 | ``` 53 | 54 | 2. Creating a new repository (if required) - following the naming of your pipeline repository is recommended here 55 | 56 | ``` 57 | client.create_repo("wri-gppd") 58 | ```` 59 | 60 | 3. A simple commit of the file can then be done 61 | 62 | ``` 63 | with client.commit("wri-gppd", "master") as commit: 64 | # Add all the files, recursively inserting from the directory 65 | # Alternatively, we could use `client.put_file_url` or 66 | # `client_put_file_bytes`. 67 | python_pachyderm.put_files(client, path, commit, "/global_power_plant_database_v_1_3/") 68 | ``` 69 | Note that we do not condition the type of file format to be used for versioning purpose. A recommendation is to use column-oriented data file format (such as Parquet, ORC) while bearing in mind that the data loading step will require reading the data into a Pandas dataframe for subsequent loading into Trino. 70 | 71 | It should also be noted that for data that should carry units (i.e., metric tons of CO2e, petajoules, Euros, etc.), Pint-Pandas works best with column data all of the same type. Thus a timeseries of companies in rows and years in columns is an anti-pattern (because different companies have units particular to their sectors) while companies in columns and years in rows typically leads to homogeneous (and thus PintArray-friendly) data columns. 72 | 73 | ## Next Step 74 | 75 | [Data Loading](./data-loading.md) 76 | 77 | [1]: https://github.com/os-climate/data-pipeline-template 78 | [2]: https://github.com/os-climate/wri-gppd-ingestion-pipeline/blob/master/notebooks/wri-gppd-01-extraction.ipynb 79 | -------------------------------------------------------------------------------- /docs/add-data-pipeline-management.md: -------------------------------------------------------------------------------- 1 | # Architectural Domain Driver: Data Pipeline Management 2 | 3 | - **DPM-001 - Distributed data mesh architecture:** Pipeline management in OS-C Data Commons platform is based on a distributed data mesh architectural paradigm, in order to support rapid onboarding of an ever-growing number of distributed domain data sets, and the expected proliferation of consumption scenarios such as reporting, analytical tools and machine learning across the growing OS-C community. This means having small units of data pipelines that are highly re-usable across multiple development streams with well-defined and documented integration / consumption models, all leveraging a shared infrastructure that takes care of scalability, governance and security. 4 | 5 | ![OS-C Data Commons Data Pipeline Architecture](https://github.com/os-climate/os_c_data_commons/blob/main/images/architecture/Data-Commons-Pipeline.png) 6 | 7 | - **DPM-002 - Query-driven data mesh implementation strategy:** The primary way data products should be made accessible to the organization is query-driven. In Data Commons query-driven data mesh, interested consumers are expected to query the data through a data federation layer using SQL, including queries that can federate multiple sources of data including both external federated data sources as well as loaded / transformed data made stored in the Data Commons data vault. This has several advantages including: 8 | 9 | Improving data quality: By keeping the data close to data product owner and limiting data copy across multiple repositories, we can reduce data errors as well as the cost associated with data movement. 10 | 11 | Improving availability / usability: As most data sciences users are already familiar with SQL, they can use it to get insights by combiining a wide range of heterogeneous data sources. 12 | 13 | Increasing agility: Having the ability to create adhoc queries gives users the freedom to easily integrate and leverage data sources in other domains. 14 | 15 | - **DPM-003 - Date pipelines based on a FLT / ELT approach:** For data pipelines the use of an FLT (Federate / Load / Transform) or ELT (Extract / Load / Transform) approach is preferred over a traditional ETL approach. This means that: 16 | 17 | We prioritize loading data from a federated source as much as possible in order to promote use of the data maintained by the domain owner, and limit data duplication. 18 | 19 | We perform transformation only in consumer data pipelines that require the transformed data for their own use (e.g. feeding a model, or data distribution). In this case data is loaded only once by the consumer data pipeline and all transformation logic is managed according to our data pipeline management principles, in particular data-as-code for transparency reproducibility. In addition, the consumer pipeline typically is developed from the output-bacward and only loading relevant data from source. 20 | 21 | - **DPM-004 - Use a multi-layered approach for data processing / engineering:** Data Engineering Pipelines which focus on handling ELT from multimodal external data sources and data normalisation, shall have a dedicated pipeline by business domain layer, managed in a dedicated code repository. The data processing should follow a multi-layered approach for decoupling and easy maintenance of the processing logic: 22 | 23 | Ingestion: This layer handles the collection of raw data in a wide variety (structured, semistructured, and unstructured) from various sources (such as external file-based systems, external APIs, IoT devices) at any speed (batch or stream) and scale. 24 | 25 | Storage: This layer takes care of storing the data in secure, flexible, and efficient storage for internal processing. This can typically be a relational database, a graph database, a NoSQL database, an in-memory data cache or even an event streaming pipeline. This should include managing relevant business and technical metadata that allows us to understand the data’s origin, format, lineage, and how it is organised, classified and connected. 26 | 27 | Processing: This layer turns the raw data into consumable, by sorting, filtering, splitting, enriching, aggregating, joining, and applying any required business logic to produce new meaningful data sets. At this layer, we harmonise and simplify the disparate data sources by combining various data sets and build a unified business domain layer, which can be reused for various analytics and reporting use cases. 28 | 29 | Distribution: This layer provides the consumer of the data the ability to use the post-processed data, by performing ad-hoc queries, producing views which are organised into reports and dashboards or upstream it for ML use in other pipelines. This layer is designed for reusability, discoverability, and backfilling. 30 | 31 | - **DPM-005 - Manage data pipelines and training data sets as code for reproducibility:** Data pipelines design should ensure both audit-ability and reproducibility, which is the ability to re-process the same source data with the same workflow / model version to reach the same conclusion as a previous work state. This means in particular for pipelines leveraging machine learning, the data pipeline implementation should support a snapshot of the raw, curated and model input data to be saved / versioned / metadata-tagged every time a model is trained and associated with a specific version of pipeline source code (maintained in the Github repository). Training data should also be made available for external consumption by other work streams requiring similar model training. 32 | 33 | - **DPM-006 - Recommended patterns for ingestion:** Based on the data source, we recommend 3 main types of ingestion patterns driving the underlying design of the data pipelines, namely: 34 | 35 | 1. [Batch ingestion pattern](add-data-pipeline-ingestion-patterns-batch.md): Used when entire, consistent data sets need to be loaded into Data Commons consistently with the source. 36 | 2. [Event-driven ingestion pattern](add-data-pipeline-ingestion-patterns-event.md): Used when there is a need to synchronize the data to an external source that goes through regular data changes, be it as append-only changes that need to be captured as a persistent and unchangeable log, or data unit level transactions that would impact existing data. 37 | 3. [Data federation pattern](add-data-pipeline-ingestion-patterns-federation.md): Used when the external data set is vast, available through direct connectivity, and the data required in the pipeline is typically a subset (data elements / records) of the entire data set. 38 | -------------------------------------------------------------------------------- /os-c-data-commons-developer-guide.md: -------------------------------------------------------------------------------- 1 | # OS-Climate Data Commons Developer Guide 2 | 3 | This developer guide is for data engineers, data scientists and developers of the OS-Climate community who are looking at leveraging the OS-Climate Data Commons to build data ingestion and processing pipelines, as well as AI / ML pipelines. It shows step-by-step how to configure your development environment, structure projects, and manage data and code in a way that complies with our Architecture Blueprint. 4 | 5 | **Need Help?** 6 | 7 | - Outage / System failure: File an Linux Foundation (LF) [outage ticket](https://jira.linuxfoundation.org/plugins/servlet/desk/portal/2/create/30) (note: select OS-Climate from project list) 8 | - New infrastructure request (e.g. software upgrade): File an LF [ticket](https://jira.linuxfoundation.org/plugins/servlet/desk/portal/2) (note: select OS-Climate from project list) 9 | - General infrastructure support: Get help on OS-Climate Slack [Data Commons channel](https://os-climate.slack.com/archives/C034SCF92BU) 10 | - Data Commons developer support: Get help on OS-Climate Slack [Developers channel](https://os-climate.slack.com/archives/C034SCQU919) 11 | 12 | **OS-Climate's Cluster Information** 13 | - Cluster 1 [CL1](https://console-openshift-console.apps.odh-cl1.apps.os-climate.org/): used for development and initial upgrades of applications 14 | - Cluster 2 [CL2](https://console-openshift-console.apps.odh-cl2.apps.os-climate.org/): stable cluster, sandbox UI and released versions of tools are available from cluster 2 15 | - Cluster 3 [CL3](https://console-openshift-console.apps.odh-cl3.apps.os-climate.org/): administrative cluster, managed by Red Hat and Linux Foundation IT org 16 | - Cluster 4 [CL4](https://console-openshift-console.apps.odh-cl4.apps.os-climate.org/): latest implementation of Red Hat's Data Mesh pattern - under construction. Follows Open Data Hub[Data Mesh Pattern](https://github.com/opendatahub-io-contrib/data-mesh-pattern). 17 | 18 | ## Tools 19 | 20 | Pipeline development leverages a number of tools provided by Data Commons. The list below provides an overview of key technologies involved as well as links to development instances: 21 | 22 | | Technology | Description | Link | 23 | | ---------- | ----------- | ---- | 24 | | [GitHub][2] | Version control tool used to maintain the pipelines as code | [OS-Climate GitHub](https://github.com/os-climate) | 25 | | [GitHub Projects][3] | Project tracking tool that integrates issues and pull requests | [Data Commons Project Board](https://github.com/orgs/os-climate/projects/7) | 26 | | [JupyterHub][4] | Self-service environment for Jupyter notebooks used to develop pipelines | [JupyterHub Development Instance](https://jupyterhub-odh-jupyterhub.apps.odh-cl2.apps.os-climate.org/) | 27 | | [Kubeflow Pipelines][5] | MLOps tool to support model development, training, serving and automated machine learning | | 28 | | [Trino][7] | Distributed SQL Query Engine for big data, used for data ingestion and distributed queries | [Trino Console](https://trino-secure-odh-trino.apps.odh-cl2.apps.os-climate.org/) | 29 | | [CloudBeaver][8] | Web-based database GUI tool which provides rich web interface to Trino | [CloudBeaver Development Instance](https://cloudbeaver-odh-trino.apps.odh-cl2.apps.os-climate.org/) | 30 | | [Pachyderm][9] | Data-driven pipeline management tool for machine learning, providing version control for data | | 31 | | [dbt][10] | SQL-based data transformation tool providing git-enabled version control of data transformation pipelines | | 32 | | [Great Expectations][11] | Data quality tool providing git-enabled data quality pipelines management | | 33 | | [OpenMetadata][12] | Centralized metadata store providing data discovery, data collaboration, metadata versioning and data lineage | [OpenMetadata Development Instance](https://openmetadata-openmetadata.apps.odh-cl2.apps.os-climate.org) | 34 | | [Airflow][13] | Workflow management platform for data engineering pipelines | [Airflow Development Instance](https://airflow-openmetadata.apps.odh-cl2.apps.os-climate.org/home) | 35 | | [Apache Superset][6] | Data exploration and visualization platform | [Superset Development Instance](https://superset-secure-odh-superset.apps.odh-cl2.apps.os-climate.org/) | 36 | | [Grafana][14] | Analytics and interactive visualization platform | [Grafana Development Instance](https://grafana-opf-monitoring.apps.odh-cl2.apps.os-climate.org/login) 37 | | [INCEpTION][15] | Text-annotation environment primarily used by OS-C for machine learning-based data extraction | [INCEpTION Development Instance](https://inception-inception.apps.odh-cl2.apps.os-climate.org/) | 38 | 39 | ## GitOps for reproducibility, portability, traceability with AI support 40 | 41 | Nowadays, developers (including data scientists) use Git and GitOps practices to store and share code on development platforms such as GitHub. GitOps best practices allow for reproducibility and traceability in projects. For this reason, we have decided to adopt a GitOps approach toward managing the platform, data pipeline code as well as data and related artifacts. 42 | 43 | One of the most important requirements to ensure data quality through reproducibility is dependency management. Having dependencies clearly managed in audited configuration artifacts allows portability of notebooks, so they can be shared safely with others and reused in other projects. 44 | 45 | ## Project templates 46 | 47 | We use two project templates as starting point for new repositories: 48 | 49 | - A project template for data pipelines, specific to OS-Climate Data Commons, can be found here: [Data Pipelines Template][16] 50 | - A project tempalte specifically for AI/ML pipelines can be found here: [Data Science Template][1]. 51 | 52 | Together the use of these templates ties data scientist needs (e.g. notebooks, models) and data engineers needs (e.g. data and metadata pipelines). Having structure in a project ensures all the pieces required for the Data and MLOps lifecycles are present and easily discoverable. 53 | 54 | ## Tutorial Steps 55 | 56 | 0. [Pre-requisites](./docs/pre-requisite.md) 57 | 58 | ### ML Lifecycle / Source Lifecycle 59 | 60 | 1. [Setup your initial environment](./docs/setup-initial-environment.md) 61 | 62 | 2. [Explore notebooks and manage dependencies](./docs/explore-notebooks-and-manage-dependencies.md) 63 | 64 | 3. [Push changes to GitHub](./docs/push-changes.md) 65 | 66 | 4. [Setup pipelines to create releases, build images and enable dependency management](./docs/setup-gitops-pipeline.md) 67 | 68 | ### DataOps Lifecycle 69 | 70 | 5. [Data Ingestion Pipeline Overview](./docs/data-ingestion-pipeline.md) 71 | 72 | 6. [Data Extraction](./docs/data-extraction.md) 73 | 74 | 7. [Data Loading](./docs/data-loading.md) 75 | 76 | 8. [Data Transformation](./docs/data-transformation.md) 77 | 78 | 9. [Metadata Management](./docs/metadata-management.md) 79 | 80 | ### ModelOps Lifecycle 81 | 82 | 10. [ModelOps Lifecycle Overview](./docs/model-ops-lifecycle.md) 83 | 84 | 11. [Setup and Deploy Inference Application](./docs/deploy-model.md) 85 | 86 | 12. [Test Deployed inference application](./docs/test-model.md) 87 | 88 | 13. [Monitor your inference application](./docs/monitor-model.md) 89 | 90 | [1]: https://github.com/aicoe-aiops/project-template 91 | [2]: https://github.com/ 92 | [3]: https://docs.github.com/en/issues/trying-out-the-new-projects-experience/about-projects 93 | [4]: https://jupyter.org/hub 94 | [5]: https://www.kubeflow.org/docs/pipelines/overview/pipelines-overview/ 95 | [6]: https://superset.apache.org/ 96 | [7]: https://trino.io/ 97 | [8]: https://dbeaver.com/ 98 | [9]: https://www.pachyderm.com/ 99 | [10]: https://www.getdbt.com/ 100 | [11]: https://greatexpectations.io/ 101 | [12]: https://open-metadata.org/ 102 | [13]: https://airflow.apache.org/ 103 | [14]: https://grafana.com/ 104 | [15]: https://inception-project.github.io/ 105 | [16]: https://github.com/os-climate/data-pipeline-template 106 | -------------------------------------------------------------------------------- /os-c-data-commons-architecture-blueprint.md: -------------------------------------------------------------------------------- 1 | # OS-C Data Commons Architecture Blueprint 2 | 3 | ## Overview 4 | 5 | ![OS-C Data Commons Platform Overview](https://github.com/os-climate/os_c_data_commons/blob/main/images/architecture/COP26-Overview-Technical.png) 6 | 7 | The overall architecture approach is based on a distributed data mesh architecture, with a focus on addressing the key problems we face in the climate data domain with regards to the proliferation of data sources and consumers, the diversity and complexity of data transformation requirements, and the speed of both scaling and change required. It is recommended to read the following foundational articles to understand the concept and approach towards building a data mesh: 8 | 9 | > [How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh][1] 10 | > -- Zhamak Dehghani, Thoughtworks 11 | 12 | > [Data Mesh Principles and Logical Architecture][2] 13 | > -- Zhamak Dehghani, Thoughtworks 14 | 15 | The Data Commons implementation is based on 3 foundational principles: 16 | 17 | 1. **Self-service data infrastructure as a platform:** The platform provides standardized self-service infrastructure and tooling for creating, maintaining and managing data products, for communities of data scientists and developers who may not have the technology capability or time to handle infrastructure provisioning, data storage and security, data pipeline orchestration and management or product integration. 18 | 19 | 2. **Domain-oriented decentralized data product ownership:** Data ownership is decentralized and domain data product owners are responsible for all capabilities within a given domain, including discoverability, understandability, quality and security of the data. In effect this is achieved by having agile, autonomous, distributed teams building data pipelines as code and data product interfaces on standard tooling and shared infrastructure services. These teams own the code for data pipelines loading, transforming and serving the data as well as the related metadata, and drive the development cycle for their own data domains. This is reflected and built into our DataOps / DevOps organization and processes, with contributor team structure, access to the platform and data management capabilities supporting autonomous collaborative development for the various data streams within OS-Climate. 20 | 21 | 3. **Federated governance:** The platform requires a layer providing a federated view of the data domains while being able to support the establishment of common operating standards around data / metadata / data lineage management, quality assurance, security and compliance policies, and by extension any cross-cutting supervision concern across all data products. 22 | 23 | ## Component Architecture 24 | 25 | This section covers the key open source technology components on which the Data Commons platform is built. To do this, we segregate two views: 26 | 27 | 1. [Data Management Platform](./docs/component-architecture-data-platform.md) 28 | 2. [Data Science Platform](./docs/component-architecture-data-science.md) 29 | 30 | It is important to understand that all technology components are not only evaluated and selected based on features and design requirements, but also considering open source projects that: 31 | 32 | - Have an "upstream first" approach. For more information on open source upstream and why upstream first is an important approach, this [blog post](https://www.redhat.com/en/blog/what-open-source-upstream) does a good job at explaining the various concepts. 33 | - Have an active, diverse community contributing to the development and use of the technology, including some enterprise usage at scale. 34 | - Ideally, for critical online components of the system, the possibility to get enterprise versions and support, considering a medium to long term requirement could be for some OS-Climate members to run the platform within their own regulated environment. 35 | 36 | ## Guiding Principles 37 | 38 | These guiding principles are used for a consistent approach in linking together not just the platform / tools components but also the people and process aspects of data as managed through OS-Climate Data Commons: 39 | 40 | 1. **Data as code:** All data and data handling logic should be treated as code. Creation, deprecation, and critical changes to data artifacts should go through a design review process with appropriate written documents where the community of contributors as well as data consumers’ views are taken into account. Changes in data sources, schema, pipeline logic have mandatory reviewers who sign off before changes are landed. Data artifacts have tests associated with them and are continuously tested. The same practices we apply to software code versioning and management are employed. 41 | 42 | 2. **Data is owned:** Data and data pipelines are code and all code must be owned. Source data loading and normalisation pipelines are managed in dedicated code repositories with a clear owner, purpose and list of contributors. Likewise, data analytics pipelines providing inputs to scenario analysis and alignment tooling are managed in dedicated code repositories with a clear owner, purpose, list of contributors and defined data management lifecycle including data security and documented process for publishing, sharing and using the data. 43 | 44 | 3. **Data quality is measured, reviewed and managed:** Data artifacts must have SLAs for data quality, SLAs for data issues remediation, and incident reporting and management just like for any technology service. This means that for machine learning models, it is required to implement business driven data validation rules on the independent and target variables as part of a continuous integration and deployment pipeline, by monitoring the statistical properties of variables over the period of time and by continuously re-fitting the models to handle potential drift. The owner is responsible for quality and upholding those SLAs to the rest of the OS-C Data Commons user community. 45 | 46 | 4. **Accelerate data productivity:** The Data Commons platform and data tooling provided must be designed to optimise collaboration between data producers and consumers. Data tooling includes ETL tools, data integration, data management including data storage, data lineage and data access management, data pipeline development tools supporting the writing and execution of tests, automation of the data pipeline by following MLOps methodology, and integration with the existing monitoring / auditing capabilities of the platform. All the tooling used should be open source so as to not restrict usage for any contributing organization, and delivered following the Operate First principle (more at [https://www.operate-first.cloud/](https://www.operate-first.cloud/)) which means the platform development and productisation process must include building the required operational knowledge for deploying and managing the platform, and encapsulating it in the software itself. 47 | 48 | 5. **Organise for data:** Teams contributing to data pipeline development on OS-C Data Commons platform should be self-sufficient in managing the whole lifecycle of development which includes onboarding contributors, provisioning required infrastructure for development, testing and data processing, managing functional dependencies such as dependencies on other data pipeline streams, and managing their code repositories. In order to support this, the OS-C Data Commons platform follows a model of self-service built around the OS-C Data Commons platform GitHub repository where contributors and all code (platform, data pipelines) is managed and working instances of the Data Commons platform running on self-service infrastructure and tightly integrated with the code base. 49 | 50 | ## Architectural Domain Design Drivers 51 | 52 | This section covers key design domains in the platform, and for each of these domains the key architectural design drivers behind the target architecture blueprint. 53 | 54 | 1. [Data Infrastructure-as-a-Platform](./docs/add-data-infra-as-a-platform.md) 55 | 2. [Data Pipeline Management](./docs/add-data-pipeline-management.md) 56 | 3. [Federated Data Management and Governance](./docs/add-federated-governance.md) 57 | 58 | [1]: https://martinfowler.com/articles/data-monolith-to-mesh.html 59 | [2]: https://martinfowler.com/articles/data-mesh-principles.html -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | --------------------------------------------------------------------------------