├── .github ├── validate-examples └── workflows │ └── ci.yaml ├── .gitignore ├── CHANGELOG.md ├── CNAME ├── LICENSE ├── README.md ├── _config.yml ├── _layouts └── default.html ├── datacontract.init.yaml ├── datacontract.schema.json ├── definition.schema.json ├── diagrams ├── automation.drawio ├── datacontract.drawio └── favicon.drawio ├── examples ├── covid-cases │ ├── datacontract.html │ └── datacontract.yaml ├── datacontract.html ├── generate-catalog ├── index.html ├── muellimperium │ ├── data.csv │ ├── datacontract.html │ └── datacontract.yaml ├── orders-latest-nested │ ├── datacontract.html │ └── datacontract.yaml └── orders-latest │ ├── datacontract.html │ └── datacontract.yaml ├── gen-openapi-yaml ├── images ├── categories.png ├── datacontract-logo.png ├── datacontract-preview.png ├── datacontract.png ├── favicon.png └── supported-by-innoq--petrol-apricot.svg ├── versions ├── 0.9.0 │ ├── README.md │ ├── datacontract.init.yaml │ └── datacontract.schema.json ├── 0.9.1 │ ├── README.md │ ├── datacontract.init.yaml │ └── datacontract.schema.json ├── 0.9.2 │ ├── README.md │ ├── datacontract.init.yaml │ └── datacontract.schema.json └── 0.9.3 │ ├── README.md │ ├── datacontract.init.yaml │ ├── datacontract.schema.json │ └── definition.schema.json └── workshop.md /.github/validate-examples: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | set -ex 4 | 5 | #function datacontract() { 6 | # docker run --rm -v "${PWD}:/home/datacontract" --platform linux/amd64 datacontract/cli:latest "$@" 7 | #} 8 | 9 | datacontract --version 10 | 11 | SCHEMA=datacontract.schema.json 12 | 13 | awk '/^```yaml$/{flag=1; next} /^```$/{print ""; flag=0; exit} flag' README.md > datacontract-from-readme.yaml 14 | datacontract lint datacontract-from-readme.yaml --schema $SCHEMA 15 | datacontract test --examples datacontract-from-readme.yaml --schema $SCHEMA 16 | # Compare with example? 17 | 18 | datacontract lint examples/orders-latest/datacontract.yaml --schema $SCHEMA 19 | datacontract test --examples examples/orders-latest/datacontract.yaml --schema $SCHEMA 20 | 21 | datacontract lint examples/orders-latest-nested/datacontract.yaml --schema $SCHEMA 22 | datacontract test --examples examples/orders-latest-nested/datacontract.yaml --schema $SCHEMA || true # examples are not nested 23 | 24 | datacontract lint examples/covid-cases/datacontract.yaml --schema $SCHEMA 25 | datacontract test --examples examples/covid-cases/datacontract.yaml --schema $SCHEMA || true 26 | 27 | -------------------------------------------------------------------------------- /.github/workflows/ci.yaml: -------------------------------------------------------------------------------- 1 | on: 2 | push: 3 | pull_request: 4 | workflow_call: 5 | 6 | name: CI 7 | jobs: 8 | test: 9 | if: false # skip as the example structure has changed with v1.1.0 10 | runs-on: ubuntu-latest 11 | steps: 12 | - uses: actions/checkout@v4 13 | - name: Set up Python 14 | uses: actions/setup-python@v5 15 | with: 16 | python-version: 3.11 17 | - name: Install dependencies 18 | run: | 19 | python -m pip install --upgrade pip 20 | pip install datacontract-cli[all] 21 | datacontract --version 22 | - name: Validate examples 23 | run: .github/validate-examples 24 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .idea/ 2 | *.bkp 3 | datacontract.schema.openapi-format.* 4 | .soda/ 5 | datacontract-from-readme.yaml 6 | .duckdb/ 7 | -------------------------------------------------------------------------------- /CHANGELOG.md: -------------------------------------------------------------------------------- 1 | # Changelog 2 | 3 | All notable changes to this project will be documented in this file. 4 | 5 | The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), 6 | and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). 7 | 8 | ## [Unreleased] 9 | 10 | ## [1.1.0] - 2024-10-30 11 | 12 | ### Added 13 | - Data quality on model and field level ([#55](https://github.com/datacontract/datacontract-specification/issues/55)) 14 | - Lineage support ([#90](https://github.com/datacontract/datacontract-specification/issues/90)) 15 | - Field and definition `examples` as array of any type, instead of `example` as a single value ([#29](https://github.com/datacontract/datacontract-specification/issues/29) 16 | - Support for server-specific data types as config map ([#63](https://github.com/datacontract/datacontract-specification/issues/63)) 17 | - AWS Glue Catalog server support 18 | - sftp server support 19 | - info.status field 20 | - oracle server support 21 | - field.title attribute 22 | - model.title attribute 23 | - AWS Kinesis Data Streams server support 24 | - field.links attribute 25 | - Trino support 26 | - Field `type: map` support with properties `keys` and `values` 27 | - Definitions: `fields`, for type `object`, `record`, and `struct` 28 | - Field `field.primaryKey` (Replaces `field.primary`) 29 | - Field `model.primaryKey` to describe a composite primary key 30 | - Add Redshift server properties `clusterIdentifier`, `endpoint`, `host` and `port`. 31 | 32 | ### Removed 33 | 34 | - `definitions.domain` removed (use a hierarchical structure instead) 35 | - `definitions.name` removed (use a hierarchical structure instead) 36 | - `quality` on top-level removed 37 | - `examples` on top-level removed 38 | - `schema` removed in favor of encoding any physical schema configuration in the `model` using the `config` map at the field level and supporting import/export ([#21](https://github.com/datacontract/datacontract-specification/issues/21)). 39 | 40 | ### Deprecated 41 | 42 | - `field.primary` (use `field.primaryKey` instead) 43 | 44 | 45 | ## [0.9.3] - 2024-03-06 46 | 47 | ### Added 48 | 49 | - Service levels as a top level `servicelevels` element 50 | - pubsub server support 51 | - primary key and relationship support via `field.primary` and `field.references` attributes 52 | - databricks server support improved 53 | 54 | ## [0.9.2] - 2024-01-04 55 | 56 | ### Added 57 | 58 | - Format and validation attributes to fields in models and definitions 59 | - Postgres support 60 | - Databricks support 61 | 62 | ## [0.9.1] - 2023-11-19 63 | 64 | ### Added 65 | 66 | - A logical data model (#13), mainly to simplify editor support with a defined schema, easier to detect breaking changes, and better Databricks support. 67 | - Definitions (#14) for reusable semantic definitions within one data contract or across data contracts. 68 | 69 | ### Removed 70 | 71 | - Property `info.dataProduct` as data products should define which data contracts they implement. 72 | - Property `info.outputPort` as data products should define which data contracts they implement. 73 | 74 | Those removals are not considered as breaking changes, as these attributes are now treated as specification extensions. 75 | 76 | ## [0.9.0] - 2023-09-12 77 | 78 | First public release. 79 | -------------------------------------------------------------------------------- /CNAME: -------------------------------------------------------------------------------- 1 | datacontract.com -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 Data Mesh Architecture 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /_config.yml: -------------------------------------------------------------------------------- 1 | plugins: 2 | - jekyll-sitemap 3 | name: Data Contract Specification 4 | title: null 5 | description: Data contracts bring data providers and data consumers together. 6 | -------------------------------------------------------------------------------- /_layouts/default.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | {% seo %} 15 | 16 | 26 | 27 | 28 |
29 | {% if site.title and site.title != page.title %} 30 |

{{ site.title }}

31 | {% endif %} 32 | 33 | {{ content }} 34 | 35 | {% if site.github.private != true and site.github.license %} 36 | 39 | {% endif %} 40 |
41 | 53 | 54 | 55 | {% if site.google_analytics %} 56 | 64 | {% endif %} 65 | 66 | 67 | 68 | 69 | 70 | -------------------------------------------------------------------------------- /datacontract.init.yaml: -------------------------------------------------------------------------------- 1 | dataContractSpecification: 1.1.0 2 | id: my-data-contract-id 3 | info: 4 | title: My Data Contract 5 | version: 0.0.1 6 | # description: 7 | # owner: 8 | # contact: 9 | # name: 10 | # url: 11 | # email: 12 | 13 | 14 | ### servers 15 | 16 | #servers: 17 | # production: 18 | # type: s3 19 | # location: s3:// 20 | # format: parquet 21 | # delimiter: new_line 22 | 23 | ### terms 24 | 25 | #terms: 26 | # usage: 27 | # limitations: 28 | # billing: 29 | # noticePeriod: 30 | 31 | 32 | ### models 33 | 34 | # models: 35 | # my_model: 36 | # description: 37 | # type: 38 | # fields: 39 | # my_field: 40 | # type: 41 | # description: 42 | 43 | 44 | ### definitions 45 | 46 | # definitions: 47 | # my_field: 48 | # domain: 49 | # name: 50 | # title: 51 | # type: 52 | # description: 53 | # example: 54 | # pii: 55 | # classification: 56 | 57 | 58 | ### servicelevels 59 | 60 | #servicelevels: 61 | # availability: 62 | # description: The server is available during support hours 63 | # percentage: 99.9% 64 | # retention: 65 | # description: Data is retained for one year because! 66 | # period: P1Y 67 | # unlimited: false 68 | # latency: 69 | # description: Data is available within 25 hours after the order was placed 70 | # threshold: 25h 71 | # sourceTimestampField: orders.order_timestamp 72 | # processedTimestampField: orders.processed_timestamp 73 | # freshness: 74 | # description: The age of the youngest row in a table. 75 | # threshold: 25h 76 | # timestampField: orders.order_timestamp 77 | # frequency: 78 | # description: Data is delivered once a day 79 | # type: batch # or streaming 80 | # interval: daily # for batch, either or cron 81 | # cron: 0 0 * * * # for batch, either or interval 82 | # support: 83 | # description: The data is available during typical business hours at headquarters 84 | # time: 9am to 5pm in EST on business days 85 | # responseTime: 1h 86 | # backup: 87 | # description: Data is backed up once a week, every Sunday at 0:00 UTC. 88 | # interval: weekly 89 | # cron: 0 0 * * 0 90 | # recoveryTime: 24 hours 91 | # recoveryPoint: 1 week 92 | -------------------------------------------------------------------------------- /definition.schema.json: -------------------------------------------------------------------------------- 1 | { 2 | "$schema": "http://json-schema.org/draft-07/schema#", 3 | "type": "object", 4 | "description": "Clear and concise explanations of syntax, semantic, and classification of business objects in a given domain.", 5 | "properties": { 6 | "id": { 7 | "type": "string", 8 | "description": "A unique identifier for this definition. Encode the domain into the ID, separated by slashes.", 9 | "examples": [ 10 | "checkout/order_id" 11 | ] 12 | }, 13 | "title": { 14 | "type": "string", 15 | "description": "The business name of this definition." 16 | }, 17 | "description": { 18 | "type": "string", 19 | "description": "Clear and concise explanations related to the domain." 20 | }, 21 | "type": { 22 | "type": "string", 23 | "description": "The logical data type." 24 | }, 25 | "minLength": { 26 | "type": "integer", 27 | "description": "A value must be greater than or equal to this value. Applies only to string types." 28 | }, 29 | "maxLength": { 30 | "type": "integer", 31 | "description": "A value must be less than or equal to this value. Applies only to string types." 32 | }, 33 | "format": { 34 | "type": "string", 35 | "description": "Specific format requirements for the value (e.g., 'email', 'uri', 'uuid')." 36 | }, 37 | "precision": { 38 | "type": "integer", 39 | "examples": [ 40 | 38 41 | ], 42 | "description": "The maximum number of digits in a number. Only applies to numeric values. Defaults to 38." 43 | }, 44 | "scale": { 45 | "type": "integer", 46 | "examples": [ 47 | 0 48 | ], 49 | "description": "The maximum number of decimal places in a number. Only applies to numeric values. Defaults to 0." 50 | }, 51 | "pattern": { 52 | "type": "string", 53 | "description": "A regular expression pattern the value must match. Applies only to string types." 54 | }, 55 | "example": { 56 | "type": "string", 57 | "description": "An example value for this field.", 58 | "deprecationMessage": "Use the examples field instead." 59 | }, 60 | "examples": { 61 | "type": "array", 62 | "description": "A examples value for this field." 63 | }, 64 | "pii": { 65 | "type": "boolean", 66 | "description": "Indicates if the field contains Personal Identifiable Information (PII)." 67 | }, 68 | "classification": { 69 | "type": "string", 70 | "description": "The data class defining the sensitivity level for this field." 71 | }, 72 | "tags": { 73 | "type": "array", 74 | "items": { 75 | "type": "string" 76 | }, 77 | "description": "Custom metadata to provide additional context." 78 | }, 79 | "links": { 80 | "type": "object", 81 | "description": "Links to external resources.", 82 | "minProperties": 1, 83 | "propertyNames": { 84 | "pattern": "^[a-zA-Z0-9_-]+$" 85 | }, 86 | "additionalProperties": { 87 | "type": "string", 88 | "title": "Link", 89 | "description": "A URL to an external resource.", 90 | "format": "uri", 91 | "examples": [ 92 | "https://example.com" 93 | ] 94 | } 95 | } 96 | }, 97 | "required": [ 98 | "type" 99 | ] 100 | } 101 | -------------------------------------------------------------------------------- /diagrams/favicon.drawio: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /examples/covid-cases/datacontract.yaml: -------------------------------------------------------------------------------- 1 | dataContractSpecification: 0.9.3 2 | id: covid_cases 3 | info: 4 | title: COVID-19 cases 5 | description: Johns Hopkins University Consolidated data on COVID-19 cases, sourced from Enigma 6 | version: "0.0.1" 7 | links: 8 | blog: https://aws.amazon.com/blogs/big-data/a-public-data-lake-for-analysis-of-covid-19-data/ 9 | data-explorer: https://dj2taa9i652rf.cloudfront.net/ 10 | data: https://covid19-lake.s3.us-east-2.amazonaws.com/enigma-jhu/json/part-00000-adec1cd2-96df-4c6b-a5f2-780f092951ba-c000.json 11 | servers: 12 | s3-json: 13 | type: s3 14 | location: s3://covid19-lake/enigma-jhu/json/*.json 15 | format: json 16 | delimiter: new_line 17 | models: 18 | covid_cases: 19 | description: the number of confirmed covid cases reported for a specified region, with location and county/province/country information. 20 | fields: 21 | fips: 22 | type: string 23 | description: state and county two digits code 24 | admin2: 25 | type: string 26 | description: county name 27 | province_state: 28 | type: string 29 | description: province name or state name 30 | country_region: 31 | type: string 32 | description: country name or region name 33 | last_update: 34 | type: timestamp_ntz 35 | description: last update timestamp 36 | latitude: 37 | type: double 38 | description: location (latitude) 39 | longitude: 40 | type: double 41 | description: location (longitude) 42 | confirmed: 43 | type: int 44 | description: number of confirmed cases 45 | combined_key: 46 | type: string 47 | description: county name+state name+country name 48 | quality: 49 | type: SodaCL 50 | specification: 51 | checks for covid_cases: 52 | - freshness(last_update::datetime) < 5000d # dataset is not updated anymore 53 | - row_count > 1000 54 | -------------------------------------------------------------------------------- /examples/generate-catalog: -------------------------------------------------------------------------------- 1 | datacontract catalog --files "**/*.yaml" --output "." 2 | -------------------------------------------------------------------------------- /examples/muellimperium/data.csv: -------------------------------------------------------------------------------- 1 | Pluto,residual_waste,2021-01-09 2 | Pluto,bio_waste,2021-01-02 3 | Pluto,paper,2021-01-11 4 | Pluto,plastic,2021-01-12 5 | Pluto,bulky_waste,2021-02-04 6 | Earth,residual_waste,2021-01-14 7 | Earth,bio_waste,2021-01-08 8 | Earth,paper,2021-01-12 9 | Earth,plastic,2021-01-27 10 | Earth,bulky_waste,2021-02-03 11 | -------------------------------------------------------------------------------- /examples/muellimperium/datacontract.yaml: -------------------------------------------------------------------------------- 1 | dataContractSpecification: 0.9.3 2 | id: muellimperium-exchange-format 3 | info: 4 | title: Muellimperium Exchange Format 5 | version: 0.0.1 6 | description: | 7 | The Muellimperium Exchange Format is a data contract for exchanging data between the Muellimperium and its partners. 8 | owner: Emperor of the Muellimperium 9 | contract: 10 | name: The Emperor 11 | email: the-emperor@muellimperium.com 12 | servers: 13 | exchange: 14 | type: local 15 | path: data.csv 16 | format: csv 17 | models: 18 | garbage_collection: 19 | type: table 20 | fields: 21 | location: 22 | type: text 23 | required: true 24 | description: The location where the garbage is collected. 25 | garbage_type: 26 | type: text 27 | required: true 28 | description: The type of garbage that is collected. 29 | enum: 30 | - paper 31 | - plastic 32 | - residual_waste 33 | - bio_waste 34 | - bulky_waste 35 | - hazardous_waste 36 | collection_date: 37 | type: date 38 | required: true 39 | description: The date when the garbage is collected. 40 | examples: 41 | - model: garbage_collection 42 | type: json 43 | data: 44 | - location: "Musterstadt" 45 | garbage_type: "paper" 46 | collection_date: "2022-01-01" 47 | - location: "Musterstadt" 48 | garbage_type: "plastic" 49 | collection_date: "2022-01-02" 50 | - location: "Musterstadt" 51 | garbage_type: "residual_waste" 52 | collection_date: "2022-01-03" 53 | -------------------------------------------------------------------------------- /examples/orders-latest-nested/datacontract.yaml: -------------------------------------------------------------------------------- 1 | dataContractSpecification: 0.9.3 2 | id: urn:datacontract:checkout:orders-latest-nested 3 | info: 4 | title: Orders Latest (Nested) 5 | version: 1.0.0 6 | description: | 7 | Successful customer orders in the webshop. 8 | All orders since 2020-01-01. 9 | Orders with their line items are in their current state (no history included). 10 | owner: Checkout Team 11 | contact: 12 | name: John Doe (Data Product Owner) 13 | url: https://teams.microsoft.com/l/channel/example/checkout 14 | terms: 15 | usage: | 16 | Data can be used for reports, analytics and machine learning use cases. 17 | Order may be linked and joined by other tables 18 | limitations: | 19 | Not suitable for real-time use cases. 20 | Data may not be used to identify individual customers. 21 | Max data processing per day: 10 TiB 22 | billing: 5000 USD per month 23 | noticePeriod: P3M 24 | models: 25 | orders: 26 | description: One record per order. Includes cancelled and deleted orders. 27 | type: table 28 | fields: 29 | order_id: 30 | $ref: '#/definitions/order_id' 31 | required: true 32 | unique: true 33 | primary: true 34 | order_timestamp: 35 | description: The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful. 36 | type: timestamp 37 | required: true 38 | order_total: 39 | description: Total amount the smallest monetary unit (e.g., cents). 40 | type: long 41 | required: true 42 | customer_id: 43 | description: Unique identifier for the customer. 44 | type: text 45 | minLength: 10 46 | maxLength: 20 47 | customer_email_address: 48 | description: The email address, as entered by the customer. The email address was not verified. 49 | type: text 50 | format: email 51 | required: true 52 | address: 53 | type: object 54 | description: The delivery address of the customer. 55 | fields: 56 | street: 57 | description: The street name and house number. 58 | type: text 59 | city: 60 | description: The city name. 61 | type: text 62 | additional_lines: 63 | description: Additional address lines, such as floor, apartment, or company name. 64 | type: array 65 | items: 66 | type: text 67 | description: Additional line 68 | processed_timestamp: 69 | description: The timestamp when the record was processed by the data platform. 70 | type: timestamp 71 | required: true 72 | line_items: 73 | description: A single article that is part of an order. 74 | type: table 75 | fields: 76 | lines_item_id: 77 | type: text 78 | description: Primary key of the lines_item_id table 79 | required: true 80 | unique: true 81 | primary: true 82 | order_id: 83 | $ref: '#/definitions/order_id' 84 | references: orders.order_id 85 | sku: 86 | description: The purchased article number 87 | $ref: '#/definitions/sku' 88 | definitions: 89 | order_id: 90 | domain: checkout 91 | name: order_id 92 | title: Order ID 93 | type: text 94 | format: uuid 95 | description: An internal ID that identifies an order in the online shop. 96 | example: 243c25e5-a081-43a9-aeab-6d5d5b6cb5e2 97 | pii: true 98 | classification: restricted 99 | sku: 100 | domain: inventory 101 | name: sku 102 | title: Stock Keeping Unit 103 | type: text 104 | pattern: ^[A-Za-z0-9]{8,14}$ 105 | example: "96385074" 106 | description: | 107 | A Stock Keeping Unit (SKU) is an internal unique identifier for an article. 108 | It is typically associated with an article's barcode, such as the EAN/GTIN. 109 | -------------------------------------------------------------------------------- /examples/orders-latest/datacontract.yaml: -------------------------------------------------------------------------------- 1 | dataContractSpecification: 1.1.0 2 | id: urn:datacontract:checkout:orders-latest 3 | info: 4 | title: Orders Latest 5 | version: 2.0.0 6 | description: | 7 | Successful customer orders in the webshop. 8 | All orders since 2020-01-01. 9 | Orders with their line items are in their current state (no history included). 10 | owner: Checkout Team 11 | contact: 12 | name: John Doe (Data Product Owner) 13 | url: https://teams.microsoft.com/l/channel/example/checkout 14 | servers: 15 | production: 16 | type: s3 17 | environment: prod 18 | location: s3://datacontract-example-orders-latest/v2/{model}/*.json 19 | format: json 20 | delimiter: new_line 21 | description: "One folder per model. One file per day." 22 | roles: 23 | - name: analyst_us 24 | description: Access to the data for US region 25 | - name: analyst_cn 26 | description: Access to the data for China region 27 | terms: 28 | usage: | 29 | Data can be used for reports, analytics and machine learning use cases. 30 | Order may be linked and joined by other tables 31 | limitations: | 32 | Not suitable for real-time use cases. 33 | Data may not be used to identify individual customers. 34 | Max data processing per day: 10 TiB 35 | policies: 36 | - name: privacy-policy 37 | url: https://example.com/privacy-policy 38 | - name: license 39 | description: External data is licensed under agreement 1234. 40 | url: https://example.com/license/1234 41 | billing: 5000 USD per month 42 | noticePeriod: P3M 43 | models: 44 | orders: 45 | description: One record per order. Includes cancelled and deleted orders. 46 | type: table 47 | fields: 48 | order_id: 49 | $ref: '#/definitions/order_id' 50 | required: true 51 | unique: true 52 | primaryKey: true 53 | order_timestamp: 54 | description: The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful. 55 | type: timestamp 56 | required: true 57 | examples: 58 | - "2024-09-09T08:30:00Z" 59 | tags: ["business-timestamp"] 60 | order_total: 61 | description: Total amount the smallest monetary unit (e.g., cents). 62 | type: long 63 | required: true 64 | examples: 65 | - 9999 66 | quality: 67 | - type: sql 68 | description: 95% of all order total values are expected to be between 10 and 499 EUR. 69 | query: | 70 | SELECT quantile_cont(order_total, 0.95) AS percentile_95 71 | FROM orders 72 | mustBeBetween: [1000, 49900] 73 | customer_id: 74 | description: Unique identifier for the customer. 75 | type: text 76 | minLength: 10 77 | maxLength: 20 78 | customer_email_address: 79 | description: The email address, as entered by the customer. 80 | type: text 81 | format: email 82 | required: true 83 | pii: true 84 | classification: sensitive 85 | quality: 86 | - type: text 87 | description: The email address is not verified and may be invalid. 88 | lineage: 89 | inputFields: 90 | - namespace: com.example.service.checkout 91 | name: checkout_db.orders 92 | field: email_address 93 | processed_timestamp: 94 | description: The timestamp when the record was processed by the data platform. 95 | type: timestamp 96 | required: true 97 | config: 98 | jsonType: string 99 | jsonFormat: date-time 100 | quality: 101 | - type: sql 102 | description: The maximum duration between two orders should be less that 3600 seconds 103 | query: | 104 | SELECT MAX(duration) AS max_duration FROM (SELECT EXTRACT(EPOCH FROM (order_timestamp - LAG(order_timestamp) 105 | OVER (ORDER BY order_timestamp))) AS duration FROM orders) 106 | mustBeLessThan: 3600 107 | - type: sql 108 | description: Row Count 109 | query: | 110 | SELECT count(*) as row_count 111 | FROM orders 112 | mustBeGreaterThan: 5 113 | examples: 114 | - | 115 | order_id,order_timestamp,order_total,customer_id,customer_email_address,processed_timestamp 116 | "1001","2030-09-09T08:30:00Z",2500,"1000000001","mary.taylor82@example.com","2030-09-09T08:31:00Z" 117 | "1002","2030-09-08T15:45:00Z",1800,"1000000002","michael.miller83@example.com","2030-09-09T08:31:00Z" 118 | "1003","2030-09-07T12:15:00Z",3200,"1000000003","michael.smith5@example.com","2030-09-09T08:31:00Z" 119 | "1004","2030-09-06T19:20:00Z",1500,"1000000004","elizabeth.moore80@example.com","2030-09-09T08:31:00Z" 120 | "1005","2030-09-05T10:10:00Z",4200,"1000000004","elizabeth.moore80@example.com","2030-09-09T08:31:00Z" 121 | "1006","2030-09-04T14:55:00Z",2800,"1000000005","john.davis28@example.com","2030-09-09T08:31:00Z" 122 | "1007","2030-09-03T21:05:00Z",1900,"1000000006","linda.brown67@example.com","2030-09-09T08:31:00Z" 123 | "1008","2030-09-02T17:40:00Z",3600,"1000000007","patricia.smith40@example.com","2030-09-09T08:31:00Z" 124 | "1009","2030-09-01T09:25:00Z",3100,"1000000008","linda.wilson43@example.com","2030-09-09T08:31:00Z" 125 | "1010","2030-08-31T22:50:00Z",2700,"1000000009","mary.smith98@example.com","2030-09-09T08:31:00Z" 126 | line_items: 127 | description: A single article that is part of an order. 128 | type: table 129 | fields: 130 | line_item_id: 131 | type: text 132 | description: Primary key of the lines_item_id table 133 | required: true 134 | order_id: 135 | $ref: '#/definitions/order_id' 136 | references: orders.order_id 137 | sku: 138 | description: The purchased article number 139 | $ref: '#/definitions/sku' 140 | primaryKey: ["order_id", "line_item_id"] 141 | examples: 142 | - | 143 | line_item_id,order_id,sku 144 | "LI-1","1001","5901234123457" 145 | "LI-2","1001","4001234567890" 146 | "LI-3","1002","5901234123457" 147 | "LI-4","1002","2001234567893" 148 | "LI-5","1003","4001234567890" 149 | "LI-6","1003","5001234567892" 150 | "LI-7","1004","5901234123457" 151 | "LI-8","1005","2001234567893" 152 | "LI-9","1005","5001234567892" 153 | "LI-10","1005","6001234567891" 154 | definitions: 155 | order_id: 156 | title: Order ID 157 | type: text 158 | format: uuid 159 | description: An internal ID that identifies an order in the online shop. 160 | examples: 161 | - 243c25e5-a081-43a9-aeab-6d5d5b6cb5e2 162 | pii: true 163 | classification: restricted 164 | tags: 165 | - orders 166 | sku: 167 | title: Stock Keeping Unit 168 | type: text 169 | pattern: ^[A-Za-z0-9]{8,14}$ 170 | examples: 171 | - "96385074" 172 | description: | 173 | A Stock Keeping Unit (SKU) is an internal unique identifier for an article. 174 | It is typically associated with an article's barcode, such as the EAN/GTIN. 175 | links: 176 | wikipedia: https://en.wikipedia.org/wiki/Stock_keeping_unit 177 | tags: 178 | - inventory 179 | servicelevels: 180 | availability: 181 | description: The server is available during support hours 182 | percentage: 99.9% 183 | retention: 184 | description: Data is retained for one year 185 | period: P1Y 186 | unlimited: false 187 | latency: 188 | description: Data is available within 25 hours after the order was placed 189 | threshold: 25h 190 | sourceTimestampField: orders.order_timestamp 191 | processedTimestampField: orders.processed_timestamp 192 | freshness: 193 | description: The age of the youngest row in a table. 194 | threshold: 25h 195 | timestampField: orders.order_timestamp 196 | frequency: 197 | description: Data is delivered once a day 198 | type: batch # or streaming 199 | interval: daily # for batch, either or cron 200 | cron: 0 0 * * * # for batch, either or interval 201 | support: 202 | description: The data is available during typical business hours at headquarters 203 | time: 9am to 5pm in EST on business days 204 | responseTime: 1h 205 | backup: 206 | description: Data is backed up once a week, every Sunday at 0:00 UTC. 207 | interval: weekly 208 | cron: 0 0 * * 0 209 | recoveryTime: 24 hours 210 | recoveryPoint: 1 week 211 | tags: 212 | - checkout 213 | - orders 214 | - s3 215 | links: 216 | datacontractCli: https://cli.datacontract.com -------------------------------------------------------------------------------- /gen-openapi-yaml: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # INSTALL BEFORE 4 | # npm install -g @openapi-contrib/json-schema-to-openapi-schema 5 | # brew install yq 6 | 7 | json-schema-to-openapi-schema convert datacontract.schema.json > datacontract.schema.openapi-format.json 8 | yq --input-format=json --output-format=yaml --prettyPrint datacontract.schema.openapi-format.json > datacontract.schema.openapi-format.yaml 9 | echo "Compare 'datacontract.schema.openapi-format.yaml' with openapi.yaml of the Data Mesh Manager" 10 | echo "Prepend 'DataContract:\\n' and match the indendation correctly. Then, compare in IntelliJ" 11 | -------------------------------------------------------------------------------- /images/categories.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacontract/datacontract-specification/e7a0259a002b5b82ba3bb8a02323cc47a20d374c/images/categories.png -------------------------------------------------------------------------------- /images/datacontract-logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacontract/datacontract-specification/e7a0259a002b5b82ba3bb8a02323cc47a20d374c/images/datacontract-logo.png -------------------------------------------------------------------------------- /images/datacontract-preview.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacontract/datacontract-specification/e7a0259a002b5b82ba3bb8a02323cc47a20d374c/images/datacontract-preview.png -------------------------------------------------------------------------------- /images/datacontract.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacontract/datacontract-specification/e7a0259a002b5b82ba3bb8a02323cc47a20d374c/images/datacontract.png -------------------------------------------------------------------------------- /images/favicon.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacontract/datacontract-specification/e7a0259a002b5b82ba3bb8a02323cc47a20d374c/images/favicon.png -------------------------------------------------------------------------------- /images/supported-by-innoq--petrol-apricot.svg: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /versions/0.9.0/README.md: -------------------------------------------------------------------------------- 1 | # Data Contract Specification 2 | 3 | ![datacontract.png](images/datacontract.png) 4 | 5 | Data contracts bring data providers and data consumers together. 6 | 7 | A _data contract_ is a document that defines the structure, format, semantics, quality, and terms of use for exchanging data between a data provider and their consumers. A data contract is implemented by a data product's output port or other data technologies. Data contracts can also be used for the input port to specify the expectations of data dependencies and verify given guarantees. 8 | 9 | The _data contract specification_ defines a YAML format to describe attributes of provided data sets. It is data platform neutral, yet supports well-known formats to express schemas (e.g., dbt models, JSON Schema, Protobuf, SQL DDL) and quality tests (e.g., SodaCL, SQL queries) to avoid unnecessary abstractions. The data contract specification is an open initiative to define a common data contract format. Think of an [OpenAPI specification](https://www.openapis.org/), but for data sets. 10 | 11 | Data contracts come into play when data is exchanged between different teams or organizational units, such as in a [data mesh architecture](https://www.datamesh-architecture.com/). 12 | First, and foremost, data contracts are a communication tool to express a common understanding of how data should be structured and interpreted. They make semantic and quality expectations explicit. They are often created in [workshops](/workshop). Later in development and production, they also serve as the basis for code generation, testing, schema validations, quality checks, monitoring, access control, and computational governance policies. 13 | 14 | _Note: The term "data contract" refers to a specification that is usually owned by the data provider and thus does not align with a "contract" in a legal sense as a mutual agreement between two parties. The term "contract" may be somewhat misleading, but it is how it is used in practice. The mutual agreement between one data provider and one data consumer is the "data usage agreement" that refers to a data contract. Data usage agreements have a defined lifecycle, start/end date, and help the data provider to track who accesses their data and for which purposes._ 15 | 16 | The specification is inspired by [AIDA User Group's Open Data Contract Standard](https://github.com/AIDAUserGroup/open-data-contract-standard), (formerly [PayPal's Data Contract Template](https://github.com/paypal/data-contract-template/blob/main/docs/README.md)) and Data Mesh Manager's [Data Contract API](https://www.datamesh-manager.com). 17 | It follows [OpenAPI](https://www.openapis.org/) and [AsyncAPI](https://www.asyncapi.com/) conventions. 18 | 19 | Version 20 | --- 21 | 22 | 0.9.0 23 | 24 | Example 25 | --- 26 | 27 | [![Open in Data Contract Studio](https://img.shields.io/badge/open%20in-Data%20Contract%20Studio-blue)](https://studio.datacontract.com/) 28 | 29 | ```yaml 30 | dataContractSpecification: 0.9.0 31 | id: urn:datacontract:checkout:orders-latest-npii 32 | info: 33 | title: Orders Latest NPII 34 | version: 1.0.0 35 | description: Successful customer orders in the webshop. All orders since 2020-01-01. Orders with their line items are in their current state (no history included). PII data is removed. 36 | owner: Checkout Team 37 | contact: 38 | name: John Doe (Data Product Owner) 39 | email: john.doe@example.com 40 | servers: 41 | production: 42 | type: BigQuery 43 | project: acme_orders_prod 44 | dataset: bigquery_orders_latest_npii_v1 45 | terms: 46 | usage: > 47 | Data can be used for reports, analytics and machine learning use cases. 48 | Order may be linked and joined by other tables 49 | limitations: > 50 | Not suitable for real-time use cases. 51 | Data may not be used to identify individual customers. 52 | Max data processing per day: 10 TiB 53 | billing: 5000 USD per month 54 | noticePeriod: P3M 55 | schema: 56 | type: dbt # the specification format: dbt, bigquery, avro, protobuf, sql, json-schema, custom 57 | specification: # expressed as string or inline yaml or via "$ref: model.yaml" 58 | version: 2 59 | description: The subset of the output port's data model that we agree to use 60 | models: 61 | - name: orders 62 | description: > 63 | One record per order. Includes cancelled and deleted orders. 64 | columns: 65 | - name: order_id 66 | data_type: string 67 | description: Primary key of the orders table 68 | - name: order_timestamp 69 | data_type: timestamptz 70 | description: The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful. 71 | - name: order_total 72 | data_type: integer 73 | description: "Total amount of the order in the smallest monetary unit (e.g., cents)." 74 | - name: line_items 75 | description: > 76 | The items that are part of an order 77 | columns: 78 | - name: lines_item_id 79 | data_type: string 80 | description: Primary key of the lines_item_id table 81 | - name: order_id 82 | data_type: string 83 | description: Foreign key to the orders table 84 | - name: sku 85 | data_type: string 86 | description: The purchased article number 87 | examples: 88 | - type: csv # csv, json, yaml, custom 89 | model: orders 90 | data: |- # expressed as string or inline yaml or via "$ref: data.csv" 91 | order_id,order_timestamp,order_total 92 | "1001","2023-09-09T08:30:00Z",2500 93 | "1002","2023-09-08T15:45:00Z",1800 94 | "1003","2023-09-07T12:15:00Z",3200 95 | "1004","2023-09-06T19:20:00Z",1500 96 | "1005","2023-09-05T10:10:00Z",4200 97 | "1006","2023-09-04T14:55:00Z",2800 98 | "1007","2023-09-03T21:05:00Z",1900 99 | "1008","2023-09-02T17:40:00Z",3600 100 | "1009","2023-09-01T09:25:00Z",3100 101 | "1010","2023-08-31T22:50:00Z",2700 102 | - type: csv 103 | model: line_items 104 | data: |- 105 | lines_item_id,order_id,sku 106 | "1","1001","5901234123457" 107 | "2","1001","4001234567890" 108 | "3","1002","5901234123457" 109 | "4","1002","2001234567893" 110 | "5","1003","4001234567890" 111 | "6","1003","5001234567892" 112 | "7","1004","5901234123457" 113 | "8","1005","2001234567893" 114 | "9","1005","5001234567892" 115 | "10","1005","6001234567891" 116 | quality: 117 | type: SodaCL # data quality check format: SodaCL, montecarlo, custom 118 | specification: # expressed as string or inline yaml or via "$ref: checks.yaml" 119 | checks for orders: 120 | - freshness(order_timestamp) < 24h 121 | - row_count > 500000 122 | - duplicate_count(order_id) = 0 123 | checks for line_items: 124 | - row_count > 500000 125 | ``` 126 | 127 | Schema 128 | --- 129 | 130 | [JSON Schema](https://github.com/datacontract/datacontract-specification/blob/main/datacontract.schema.json) of the Data Contract Specification. 131 | 132 | ### Data Contract Object 133 | 134 | This is the root document. 135 | 136 | It is _RECOMMENDED_ that the root document be named: `datacontract.yaml`. 137 | 138 | | Field | Type | Description | 139 | |---------------------------|------------------------------------|-------------------------------------------------------------------------------------------------------| 140 | | dataContractSpecification | `string` | REQUIRED. Specifies the Data Contract Specification being used. | 141 | | id | `string` | REQUIRED. An organization-wide unique technical identifier, such as a UUID, URN, slug, string, or number | 142 | | info | [Info Object](#info-object) | REQUIRED. Specifies the metadata of the data contract. May be used by tooling. | 143 | | servers | [Servers Object](#servers-object) | Specifies the servers of the data contract. | 144 | | terms | [Terms Object](#terms-object) | Specifies the terms and conditions of the data contract. | 145 | | schema | [Schema Object](#schema-object) | Specifies the data contract schema. The specification supports different schemas. | 146 | | examples | [Examples Object](#examples-object) | Specifies example data sets for the schema. The specification supports different example types. | 147 | | quality | [Quality Object](#quality-object) | Specifies the quality attributes and checks. The specification supports different quality check DSLs. | 148 | 149 | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). 150 | 151 | 152 | 153 | 154 | ### Info Object 155 | 156 | Metadata and life cycle information about the data contract. 157 | 158 | 159 | | Field | Type | Description | 160 | |---------|--------|------------------------------------------------------------------------------------------------------------------------------------------------------------------| 161 | | title | `string` | REQUIRED. The title of the data contract. | 162 | | version | `string` | REQUIRED. The version of the data contract document (which is distinct from the Data Contract Specification version or the Data Product implementation version). | 163 | | description | `string` | A description of the data contract. | 164 | | owner | `string` | The owner or team responsible for managing the data contract and providing the data. | 165 | | dataProduct | `string` | The identifier of the data product that contains the output port providing the data. | 166 | | outputPort | `string` | DEPRECATED. The identifier of the output port that implements the data contract. | 167 | | contact | [Contact Object](#contact-object) | Contact information for the data contract. | 168 | 169 | 170 | 171 | 172 | ### Contact Object 173 | 174 | Contact information for the data contract. 175 | 176 | | Field | Type | Description | 177 | |-------|----------|-------------------------------------------------------------------------------------------------------| 178 | | name | `string` | The identifying name of the contact person/organization. | 179 | | url | `string` | The URL pointing to the contact information. This _MUST_ be in the form of a URL. | 180 | | email | `string` | The email address of the contact person/organization. This _MUST_ be in the form of an email address. | 181 | 182 | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). 183 | 184 | ### Servers Object 185 | 186 | Information about the servers. 187 | 188 | The Servers Object is a map of [Server Objects](#server-object). 189 | 190 | ### Server Object 191 | 192 | The fields are dependent on the defined type. 193 | 194 | | Field | Type | Description | 195 | |-------------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 196 | | type | `string` | The type of the data product technology that implements the data contract. Well-known server types are: `bigquery`, `s3`, `redshift`, `snowflake`, `databricks`, `kafka` | 197 | | description | `string` | An optional string describing the server. | 198 | 199 | 200 | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). 201 | 202 | #### BigQuery Server Object 203 | 204 | | Field | Type | Description | 205 | |---------|----------|-------------| 206 | | type | `string` | `bigquery` | 207 | | project | `string` | | 208 | | dataset | `string` | | 209 | 210 | #### S3 Server Object 211 | 212 | | Field | Type | Description | 213 | |----------|----------|--------------------------------| 214 | | type | `string` | `s3` | 215 | | location | `string` | S3 URL, starting with `s3://` | 216 | 217 | Example: 218 | 219 | ```yaml 220 | servers: 221 | production: 222 | type: s3 223 | location: s3://acme-orders-prod/orders/ 224 | ``` 225 | 226 | 227 | #### Redshift Server Object 228 | 229 | | Field | Type | Description | 230 | |----------|----------|-------------| 231 | | type | `string` | `redshift` | 232 | | account | `string` | | 233 | | database | `string` | | 234 | | schema | `string` | | 235 | 236 | #### Snowflake Server Object 237 | 238 | | Field | Type | Description | 239 | |----------|----------|-------------| 240 | | type | `string` | `snowflake` | 241 | | account | `string` | | 242 | | database | `string` | | 243 | | schema | `string` | | 244 | 245 | #### Databricks Server Object 246 | 247 | | Field | Type | Description | 248 | |----------|----------|--------------| 249 | | type | `string` | `databricks` | 250 | | share | `string` | | 251 | 252 | #### Kafka Server Object 253 | 254 | | Field | Type | Description | 255 | |-------|----------|-------------| 256 | | type | `string` | `kafka` | 257 | | host | `string` | | 258 | | topic | `string` | | 259 | 260 | 261 | 262 | ### Terms Object 263 | 264 | The terms and conditions of the data contract. 265 | 266 | | Field | Type | Description | 267 | |----------------------|--------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 268 | | usage | `string` | The usage describes the way the data is expected to be used. Can contain business and technical information. | 269 | | limitations | `string` | The limitations describe the restrictions on how the data can be used, can be technical or restrictions on what the data may not be used for. | 270 | | billing | `string` | The billing describes the pricing model for using the data, such as whether it's free, having a monthly fee, or metered pay-per-use. | 271 | | noticePeriod | `string` | The period of time that must be given by either party to terminate or modify a data usage agreement. Uses ISO-8601 period format, e.g., `P3M` for a period of three months. | 272 | 273 | 274 | ### Schema Object 275 | 276 | The schema of the data contract describes the syntax and semantics of provided data sets. 277 | As the type of the output port depends on the data platform, multiple schema specifications are supported. 278 | 279 | A schema may define a single table, a collection of tables as a dataset, a file structure, or any arbitrary structure. 280 | 281 | To avoid unnecessary abstractions, the data contract specification supports existing well-known formats. Some schema types, such as `dbt`, also support defining tests and additional metadata. 282 | 283 | 284 | | Field | Type | Description | 285 | | ----- |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------| 286 | | type | `string` | REQUIRED. The type of the schema.
Typical values are: `dbt`, `bigquery`, `json-schema`, `sql-ddl`, `avro`, `protobuf`, `custom` | 287 | | specification | [dbt Schema Object](#dbt-schema-object) \|
[BigQuery Schema Object](#bigquery-schema-object) \|
[JSON Schema Schema Object](#bigquery-schema-object) \|
[SQL DDL Schema Object](#sql-ddl-schema-object) \|
`string` | REQUIRED. The specification of the schema. The schema specification can be encoded as a string or as inline YAML. | 288 | 289 | 290 | #### dbt Schema Object 291 | 292 | https://docs.getdbt.com/reference/model-properties 293 | 294 | Example (inline YAML): 295 | 296 | ```yaml 297 | schema: 298 | type: dbt 299 | specification: 300 | version: 2 301 | models: 302 | - name: "My Table" 303 | description: "My description" 304 | columns: 305 | - name: "My column" 306 | data_type: text 307 | description: "My description" 308 | ``` 309 | 310 | Example (string): 311 | 312 | ```yaml 313 | schema: 314 | type: dbt 315 | specification: |- 316 | version: 2 317 | models: 318 | - name: "My Table" 319 | description: "My description" 320 | columns: 321 | - name: "My column" 322 | data_type: text 323 | description: "My description" 324 | ``` 325 | 326 | #### BigQuery Schema Object 327 | 328 | The schema structure is defined by the [Google BigQuery Table](https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#resource:-table) object. You can extract such a Table object via the [tables.get](https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/get) endpoint. 329 | 330 | Instead of providing a single Table object, you can also provide an array of such objects. Be aware that [tables.list](https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/list) only returns a subset of the full Table object. You need to call every Table object via [tables.get](https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/get) to get the full Table object, including the actual schema. 331 | 332 | Learn more: [Google BigQuery REST Reference v2](https://cloud.google.com/bigquery/docs/reference/rest) 333 | 334 | 335 | 336 | Example: 337 | 338 | ```yaml 339 | schema: 340 | type: bigquery 341 | specification: |- 342 | { 343 | "tableReference": { 344 | "projectId": "my-project", 345 | "datasetId": "my_dataset", 346 | "tableId": "my_table" 347 | }, 348 | "description": "This is a description", 349 | "type": "TABLE", 350 | "schema": { 351 | "fields": [ 352 | { 353 | "name": "name", 354 | "type": "STRING", 355 | "mode": "NULLABLE", 356 | "description": "This is a description" 357 | } 358 | ] 359 | } 360 | } 361 | ``` 362 | 363 | #### JSON Schema Schema Object 364 | 365 | JSON Schema can be defined as JSON or rendered as YAML, following the [OpenAPI Schema Object dialect](https://spec.openapis.org/oas/v3.1.0#properties) 366 | 367 | Example (inline YAML): 368 | 369 | ```yaml 370 | schema: 371 | type: json-schema 372 | specification: 373 | orders: 374 | description: One record per order. Includes cancelled and deleted orders. 375 | type: object 376 | properties: 377 | order_id: 378 | type: string 379 | description: Primary key of the orders table 380 | order_timestamp: 381 | type: string 382 | format: date-time 383 | description: The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful. 384 | order_total: 385 | type: integer 386 | description: Total amount of the order in the smallest monetary unit (e.g., cents). 387 | line_items: 388 | type: object 389 | properties: 390 | lines_item_id: 391 | type: string 392 | description: Primary key of the lines_item_id table 393 | order_id: 394 | type: string 395 | description: Foreign key to the orders table 396 | sku: 397 | type: string 398 | description: The purchased article number 399 | ``` 400 | 401 | Example (string): 402 | 403 | ```yaml 404 | schema: 405 | type: json-schema 406 | specification: |- 407 | { 408 | "$schema": "http://json-schema.org/draft-07/schema#", 409 | "type": "object", 410 | "properties": { 411 | "orders": { 412 | "type": "object", 413 | "description": "One record per order. Includes cancelled and deleted orders.", 414 | "properties": { 415 | "order_id": { 416 | "type": "string", 417 | "description": "Primary key of the orders table" 418 | }, 419 | "order_timestamp": { 420 | "type": "string", 421 | "format": "date-time", 422 | "description": "The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful." 423 | }, 424 | "order_total": { 425 | "type": "integer", 426 | "description": "Total amount of the order in the smallest monetary unit (e.g., cents)." 427 | } 428 | }, 429 | "required": ["order_id", "order_timestamp", "order_total"] 430 | }, 431 | "line_items": { 432 | "type": "object", 433 | "properties": { 434 | "lines_item_id": { 435 | "type": "string", 436 | "description": "Primary key of the lines_item_id table" 437 | }, 438 | "order_id": { 439 | "type": "string", 440 | "description": "Foreign key to the orders table" 441 | }, 442 | "sku": { 443 | "type": "string", 444 | "description": "The purchased article number" 445 | } 446 | }, 447 | "required": ["lines_item_id", "order_id", "sku"] 448 | } 449 | }, 450 | "required": ["orders", "line_items"] 451 | } 452 | ``` 453 | 454 | #### SQL DDL Schema Object 455 | 456 | Classical SQL DDLs can be used to describe the structure. 457 | 458 | 459 | Example (string): 460 | 461 | ```yaml 462 | schema: 463 | type: sql-ddl 464 | specification: |- 465 | -- One record per order. Includes cancelled and deleted orders. 466 | CREATE TABLE orders ( 467 | order_id TEXT PRIMARY KEY, -- Primary key of the orders table 468 | order_timestamp TIMESTAMPTZ NOT NULL, -- The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful. 469 | order_total INTEGER NOT NULL -- Total amount of the order in the smallest monetary unit (e.g., cents) 470 | ); 471 | 472 | -- The items that are part of an order 473 | CREATE TABLE line_items ( 474 | lines_item_id TEXT PRIMARY KEY, -- Primary key of the lines_item_id table 475 | order_id TEXT REFERENCES orders(order_id), -- Foreign key to the orders table 476 | sku TEXT NOT NULL -- The purchased article number 477 | ); 478 | 479 | ``` 480 | 481 | ### Examples Object 482 | 483 | The Examples Object is an array of [Example Objects](#examples-object). 484 | 485 | ### Example Object 486 | 487 | | Field | Type | Description | 488 | |-------------|----------|-----------------------------------------------------------------------------------------------------------------------------------------| 489 | | type | `string` | The type of the data product technology that implements the data contract. Well-known server types are: `csv`, `json`, `yaml`, `custom` | 490 | | description | `string` | An optional string describing the example. | 491 | | model | `string` | The reference to the model in the schema, e.g. a table name. | 492 | | data | `string` | Example data for this model. | 493 | 494 | Example: 495 | 496 | ```yaml 497 | examples: 498 | - type: csv 499 | model: orders 500 | data: |- 501 | order_id,order_timestamp,order_total 502 | "1001","2023-09-09T08:30:00Z",2500 503 | "1002","2023-09-08T15:45:00Z",1800 504 | "1003","2023-09-07T12:15:00Z",3200 505 | "1004","2023-09-06T19:20:00Z",1500 506 | "1005","2023-09-05T10:10:00Z",4200 507 | "1006","2023-09-04T14:55:00Z",2800 508 | "1007","2023-09-03T21:05:00Z",1900 509 | "1008","2023-09-02T17:40:00Z",3600 510 | "1009","2023-09-01T09:25:00Z",3100 511 | "1010","2023-08-31T22:50:00Z",2700 512 | ``` 513 | 514 | ### Quality Object 515 | 516 | The quality object contains quality attributes and checks. 517 | 518 | | Field | Type | Description | 519 | | ----- |-------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------| 520 | | type | `string` | REQUIRED. The type of the schema.
Typical values are: `SodaCL`, `montecarlo`, `custom` | 521 | | specification | [SodaCL Quality Object](#sodacl-quality-object) \|
[Monte Carlo Schema Object](#monte-carlo-quality-object) \|
`string` | REQUIRED. The specification of the quality attributes. The quality specification can be encoded as a string or as inline YAML. | 522 | 523 | 524 | #### SodaCL Quality Object 525 | 526 | Quality attributes in [Soda Checks Language](https://docs.soda.io/soda-cl/soda-cl-overview.html). 527 | 528 | The `specification` represents the content of a `checks.yml` file. 529 | 530 | Example (inline): 531 | 532 | ```yaml 533 | quality: 534 | type: SodaCL # data quality check format: SodaCL, montecarlo, dbt-tests, custom 535 | specification: # expressed as string or inline yaml or via "$ref: checks.yaml" 536 | checks for orders: 537 | - row_count > 0 538 | - duplicate_count(order_id) = 0 539 | checks for line_items: 540 | - row_count > 0 541 | ``` 542 | 543 | Example (string): 544 | 545 | ```yaml 546 | quality: 547 | type: SodaCL 548 | specification: |- 549 | checks for search_queries: 550 | - freshness(search_timestamp) < 1d 551 | - row_count > 100000 552 | - missing_count(search_query) = 0 553 | ``` 554 | 555 | #### Monte Carlo Quality Object 556 | 557 | Quality attributes defined as Monte Carlos [Monitors as Code](https://docs.getmontecarlo.com/docs/monitors-as-code). 558 | 559 | The `specification` represents the content of a `montecarlo.yml` file. 560 | 561 | Example (string): 562 | 563 | ```yaml 564 | quality: 565 | type: montecarlo 566 | specification: |- 567 | montecarlo: 568 | field_health: 569 | - table: project:dataset.table_name 570 | timestamp_field: created 571 | dimension_tracking: 572 | - table: project:dataset.table_name 573 | timestamp_field: created 574 | field: order_status 575 | ``` 576 | 577 | ### Specification Extensions 578 | 579 | While the Data Contract Specification tries to accommodate most use cases, additional data can be added to extend the specification at certain points. 580 | 581 | A custom fields can be added with any name. The value can be null, a primitive, an array or an object. 582 | 583 | ### Design Principles 584 | 585 | The Data Contract Specification follows these design principles: 586 | 587 | - Is an open standard and its serialization can be versioned in git 588 | - Follows OpenAPI and AsyncAPI conventions so that it feels immediately familiar 589 | - Supports tooling by being machine-readable 590 | - Supports existing well-known formats to avoid unnecessary abstractions 591 | - Supports contract-first approaches 592 | - Supports code-first approaches 593 | 594 | Tooling 595 | --- 596 | - [Data Contract Studio](https://studio.datacontract.com/) is a free web tool to develop and share data contracts. 597 | - [Data Contract CLI](https://github.com/datacontract/cli) is a free CLI tool to help you create, develop, and maintain your data contracts. 598 | - [Data Mesh Manager](https://www.datamesh-manager.com/) is a commercial tool to manage data products and data contracts. It supports the data contract specification and allows the user to import or export data contracts using this specification. 599 | 600 | 601 | Other Data Contract Specifications 602 | --- 603 | - [AIDA User Group's Open Data Contract Standard](https://github.com/AIDAUserGroup/open-data-contract-standard) 604 | - [PayPal's Data Contract Template](https://github.com/paypal/data-contract-template/blob/main/docs/README.md) 605 | 606 | Literature 607 | --- 608 | - [Driving Data Quality with Data Contracts](https://www.amazon.com/dp/B0C37FPH3D) by Andrew Jones 609 | 610 | Authors 611 | --- 612 | The Data Contract Specification was originally created by [Jochen Christ](https://www.linkedin.com/in/jochenchrist/) and [Dr. Simon Harrer](https://www.linkedin.com/in/simonharrer/), and is currently maintained by them. 613 | 614 | 615 | Contributing 616 | --- 617 | Contributions are welcome! Please open an issue or a pull request. 618 | 619 | License 620 | --- 621 | [MIT License](LICENSE) 622 | 623 | 624 | 625 | -------------------------------------------------------------------------------- /versions/0.9.0/datacontract.init.yaml: -------------------------------------------------------------------------------- 1 | dataContractSpecification: 0.9.0 2 | id: my-data-contract-id 3 | info: 4 | title: My Data Contract 5 | version: 0.0.1 6 | # description: 7 | # owner: 8 | # contact: 9 | # name: 10 | # url: 11 | # email: 12 | 13 | 14 | ### servers 15 | 16 | #servers: 17 | # my-stage: 18 | # type: bigquery 19 | # project: 20 | # dataset: 21 | 22 | #servers: 23 | # my-stage: 24 | # type: s3 25 | # location: s3:// 26 | 27 | #servers: 28 | # my-stage: 29 | # type: redshift 30 | # account: 31 | # database: 32 | # schema: 33 | 34 | #servers: 35 | # my-stage: 36 | # type: snowflake 37 | # account: 38 | # database: 39 | # schema: 40 | 41 | #servers: 42 | # my-stage: 43 | # type: databricks 44 | # share: 45 | 46 | #servers: 47 | # my-stage: 48 | # type: kafka 49 | # host: 50 | # topic: 51 | 52 | 53 | ### terms 54 | 55 | #terms: 56 | # usage: 57 | # limitations: 58 | # billing: 59 | # noticePeriod: 60 | 61 | 62 | ### schema 63 | 64 | #schema: 65 | # type: dbt 66 | # specification: 67 | # version: 68 | # models: 69 | # - name: 70 | # description: 71 | # columns: 72 | # - name: 73 | # type: 74 | # description: 75 | # tests: 76 | 77 | #schema: 78 | # type: dbt 79 | # specification: |- 80 | # version: 81 | # models: 82 | # - name: 83 | # description: 84 | # columns: 85 | # - name: 86 | # type: 87 | # description: 88 | # tests: 89 | 90 | #schema: 91 | # type: dbt 92 | # specification: "$ref: model.yaml" 93 | 94 | #schema: 95 | # type: bigquery 96 | # specification: |- 97 | # { 98 | # "tableReference": { 99 | # "projectId": "my-project", 100 | # "datasetId": "my_dataset", 101 | # "tableId": "my_table" 102 | # }, 103 | # "description": "This is a description", 104 | # "type": "TABLE", 105 | # "schema": { 106 | # "fields": [ 107 | # { 108 | # "name": "name", 109 | # "type": "STRING", 110 | # "mode": "NULLABLE", 111 | # "description": "This is a description" 112 | # } 113 | # ] 114 | # } 115 | # } 116 | 117 | #schema: 118 | # type: json-schema 119 | # specification: 120 | # my-table: 121 | # description: 122 | # type: object 123 | # properties: 124 | # id: 125 | # type: string 126 | # description: 127 | 128 | #schema: 129 | # type: json-schema 130 | # specification: |- 131 | # { 132 | # "$schema": "http://json-schema.org/draft-07/schema#", 133 | # "type": "object", 134 | # "properties": { 135 | # "my_table": { 136 | # "type": "object", 137 | # "description": "", 138 | # "properties": { 139 | # "id": { 140 | # "type": "string", 141 | # "description": "" 142 | # }, 143 | # "required": ["id"] 144 | # } 145 | # }, 146 | # "required": ["my-table"] 147 | # } 148 | 149 | #schema: 150 | # type: sql-ddl 151 | # specification: |- 152 | # CREATE TABLE my_table ( 153 | # id TEXT PRIMARY KEY 154 | # ); 155 | 156 | #schema: 157 | # type: avro 158 | # specification: 159 | # User: 160 | # type: record 161 | # name: MyTable 162 | # fields: 163 | # - name: id 164 | # type: string 165 | 166 | #schema: 167 | # type: avro 168 | # specification: |- 169 | # { 170 | # "type": "record", 171 | # "name": "MyTable", 172 | # "fields": [ 173 | # { 174 | # "name": "name", 175 | # "type": "string" 176 | # } 177 | # ] 178 | # } 179 | 180 | #schema: 181 | # type: protobuf 182 | # specification: |- 183 | # message MyTable { 184 | # string id = 1; 185 | # } 186 | 187 | #schema: 188 | # type: custom 189 | # specification: 190 | 191 | 192 | ### examples 193 | 194 | #examples: 195 | # - type: csv 196 | # model: my_table 197 | # data: |- 198 | # id,timestamp,amount 199 | # "1001","2023-09-09T08:30:00Z",2500 200 | # "1002","2023-09-08T15:45:00Z",1800 201 | # 202 | #examples: 203 | # - type: csv 204 | # model: my_table 205 | # data: "$ref: data.csv" 206 | 207 | #examples: 208 | # - type: json 209 | # model: my_table 210 | # data: |- 211 | # [ 212 | # { 213 | # "id": "1001", 214 | # "timestamp": "2023-09-09T08:30:00Z", 215 | # "amount": 2500 216 | # }, 217 | # { 218 | # "id": "1002", 219 | # "timestamp": "2023-09-08T15:45:00Z", 220 | # "amount": 1800 221 | # } 222 | # ] 223 | 224 | #examples: 225 | # - type: yaml 226 | # model: my_table 227 | # data: 228 | # - id: 1001 229 | # timestamp: 2023-09-09T08:30:00Z 230 | # amount: 2500 231 | # - id: 1002 232 | # timestamp: 2023-09-08T15:45:00Z 233 | # amount: 1800 234 | 235 | #examples: 236 | # - type: custom 237 | # model: my_table 238 | # data: |- 239 | 240 | 241 | ### quality 242 | 243 | #quality: 244 | # type: SodaCL 245 | # specification: 246 | # checks for my_table: 247 | # - duplicate_count(order_id) = 0 248 | 249 | #quality: 250 | # type: SodaCL 251 | # specification: 252 | # checks for my_table: |- 253 | # - duplicate_count(id) = 0 254 | 255 | #quality: 256 | # type: SodaCL 257 | # specification: 258 | # checks for my_table: "$ref: checks.yaml" 259 | 260 | #quality: 261 | # type: montecarlo 262 | # specification: |- 263 | # montecarlo: 264 | # field_health: 265 | # - table: my_project:my_dataset.my_table 266 | # fields: 267 | # - id 268 | # - timestamp 269 | # - amount 270 | # timestamp_field: timestamp 271 | 272 | #quality: 273 | # type: custom 274 | # specification: |- 275 | -------------------------------------------------------------------------------- /versions/0.9.0/datacontract.schema.json: -------------------------------------------------------------------------------- 1 | { 2 | "$schema": "http://json-schema.org/draft-07/schema#", 3 | "type": "object", 4 | "properties": { 5 | "dataContractSpecification": { 6 | "type": "string", 7 | "enum": [ 8 | "0.9.0" 9 | ], 10 | "description": "Specifies the Data Contract Specification being used." 11 | }, 12 | "id": { 13 | "type": "string", 14 | "description": "Specifies the identifier of the data contract." 15 | }, 16 | "info": { 17 | "type": "object", 18 | "properties": { 19 | "title": { 20 | "type": "string", 21 | "description": "The title of the data contract." 22 | }, 23 | "version": { 24 | "type": "string", 25 | "description": "The version of the data contract document (which is distinct from the Data Contract Specification version or the Data Product implementation version)." 26 | }, 27 | "description": { 28 | "type": "string", 29 | "description": "A description of the data contract." 30 | }, 31 | "owner": { 32 | "type": "string", 33 | "description": "The owner or team responsible for managing the data contract and providing the data." 34 | }, 35 | "dataProduct": { 36 | "type": "string", 37 | "description": "The data product that contains the output port providing the data." 38 | }, 39 | "outputPort": { 40 | "type": "string", 41 | "description": "The output port that implements the data contract." 42 | }, 43 | "contact": { 44 | "type": "object", 45 | "properties": { 46 | "name": { 47 | "type": "string", 48 | "description": "The identifying name of the contact person/organization." 49 | }, 50 | "url": { 51 | "type": "string", 52 | "format": "uri", 53 | "description": "The URL pointing to the contact information. This MUST be in the form of a URL." 54 | }, 55 | "email": { 56 | "type": "string", 57 | "format": "email", 58 | "description": "The email address of the contact person/organization. This MUST be in the form of an email address." 59 | } 60 | }, 61 | "description": "Contact information for the data contract." 62 | } 63 | }, 64 | "required": [ 65 | "title", 66 | "version" 67 | ], 68 | "description": "Metadata and life cycle information about the data contract." 69 | }, 70 | "servers": { 71 | "type": "object", 72 | "additionalProperties": { 73 | "anyOf": [ 74 | { 75 | "type": "object", 76 | "properties": { 77 | "type": { 78 | "type": "string", 79 | "enum": [ 80 | "bigquery", 81 | "BigQuery" 82 | ], 83 | "description": "The type of the data product technology that implements the data contract." 84 | }, 85 | "project": { 86 | "type": "string", 87 | "description": "An optional string describing the server." 88 | }, 89 | "dataset": { 90 | "type": "string", 91 | "description": "An optional string describing the server." 92 | } 93 | }, 94 | "required": [ 95 | "type", 96 | "project", 97 | "dataset" 98 | ] 99 | }, 100 | { 101 | "type": "object", 102 | "properties": { 103 | "type": { 104 | "type": "string", 105 | "enum": [ 106 | "s3" 107 | ], 108 | "description": "The type of the data product technology that implements the data contract." 109 | }, 110 | "location": { 111 | "type": "string", 112 | "format": "uri", 113 | "description": "An optional string describing the server. Must be in the form of a URL." 114 | } 115 | }, 116 | "required": [ 117 | "type", 118 | "location" 119 | ] 120 | }, 121 | { 122 | "type": "object", 123 | "properties": { 124 | "type": { 125 | "type": "string", 126 | "enum": [ 127 | "redshift" 128 | ], 129 | "description": "The type of the data product technology that implements the data contract." 130 | }, 131 | "account": { 132 | "type": "string", 133 | "description": "An optional string describing the server." 134 | }, 135 | "database": { 136 | "type": "string", 137 | "description": "An optional string describing the server." 138 | }, 139 | "schema": { 140 | "type": "string", 141 | "description": "An optional string describing the server." 142 | } 143 | }, 144 | "required": [ 145 | "type", 146 | "account", 147 | "database", 148 | "schema" 149 | ] 150 | }, 151 | { 152 | "type": "object", 153 | "properties": { 154 | "type": { 155 | "type": "string", 156 | "enum": [ 157 | "snowflake" 158 | ], 159 | "description": "The type of the data product technology that implements the data contract." 160 | }, 161 | "account": { 162 | "type": "string", 163 | "description": "An optional string describing the server." 164 | }, 165 | "database": { 166 | "type": "string", 167 | "description": "An optional string describing the server." 168 | }, 169 | "schema": { 170 | "type": "string", 171 | "description": "An optional string describing the server." 172 | } 173 | }, 174 | "required": [ 175 | "type", 176 | "account", 177 | "database", 178 | "schema" 179 | ] 180 | }, 181 | { 182 | "type": "object", 183 | "properties": { 184 | "type": { 185 | "type": "string", 186 | "enum": [ 187 | "databricks" 188 | ], 189 | "description": "The type of the data product technology that implements the data contract." 190 | }, 191 | "share": { 192 | "type": "string", 193 | "description": "An optional string describing the server." 194 | } 195 | }, 196 | "required": [ 197 | "type", 198 | "share" 199 | ] 200 | }, 201 | { 202 | "type": "object", 203 | "properties": { 204 | "type": { 205 | "type": "string", 206 | "enum": [ 207 | "kafka" 208 | ], 209 | "description": "The type of the data product technology that implements the data contract." 210 | }, 211 | "host": { 212 | "type": "string", 213 | "description": "An optional string describing the server." 214 | }, 215 | "topic": { 216 | "type": "string", 217 | "description": "An optional string describing the server." 218 | } 219 | }, 220 | "required": [ 221 | "type", 222 | "host", 223 | "topic" 224 | ] 225 | } 226 | ] 227 | }, 228 | "description": "Information about the servers." 229 | }, 230 | "terms": { 231 | "type": "object", 232 | "description": "The terms and conditions of the data contract.", 233 | "properties": { 234 | "usage": { 235 | "type": "string", 236 | "description": "The usage describes the way the data is expected to be used. Can contain business and technical information." 237 | }, 238 | "limitations": { 239 | "type": "string", 240 | "description": "The limitations describe the restrictions on how the data can be used, can be technical or restrictions on what the data may not be used for." 241 | }, 242 | "billing": { 243 | "type": "string", 244 | "description": "The billing describes the pricing model for using the data, such as whether it's free, having a monthly fee, or metered pay-per-use." 245 | }, 246 | "noticePeriod": { 247 | "type": "string", 248 | "description": "The period of time that must be given by either party to terminate or modify a data usage agreement. Uses ISO-8601 period format, e.g., 'P3M' for a period of three months." 249 | } 250 | } 251 | }, 252 | "schema": { 253 | "type": "object", 254 | "properties": { 255 | "type": { 256 | "type": "string", 257 | "enum": [ 258 | "dbt", 259 | "bigquery", 260 | "json-schema", 261 | "sql-ddl", 262 | "avro", 263 | "protobuf", 264 | "custom" 265 | ], 266 | "description": "The type of the schema. Typical values are: dbt, bigquery, json-schema, sql-ddl, avro, protobuf, custom." 267 | }, 268 | "specification": { 269 | "anyOf": [ 270 | { 271 | "type": "string", 272 | "description": "The specification of the schema as a string." 273 | }, 274 | { 275 | "type": "object", 276 | "description": "The specification of the schema as an object." 277 | } 278 | ] 279 | } 280 | }, 281 | "required": [ 282 | "type", 283 | "specification" 284 | ], 285 | "description": "The schema of the data contract describes the syntax and semantics of provided data sets. It supports different schema types." 286 | }, 287 | "examples": { 288 | "type": "array", 289 | "items": { 290 | "type": "object", 291 | "properties": { 292 | "type": { 293 | "type": "string", 294 | "enum": [ 295 | "csv", 296 | "json", 297 | "yaml", 298 | "custom" 299 | ], 300 | "description": "The type of the example data. Well-known types are: csv, json, yaml, custom." 301 | }, 302 | "description": { 303 | "type": "string", 304 | "description": "An optional string describing the example." 305 | }, 306 | "model": { 307 | "type": "string", 308 | "description": "The reference to the model in the schema, e.g., a table name." 309 | }, 310 | "data": { 311 | "type": "string", 312 | "description": "Example data for this model." 313 | } 314 | }, 315 | "required": [ 316 | "type", 317 | "model", 318 | "data" 319 | ] 320 | }, 321 | "description": "The Examples Object is an array of Example Objects." 322 | }, 323 | "quality": { 324 | "type": "object", 325 | "properties": { 326 | "type": { 327 | "type": "string", 328 | "enum": [ 329 | "SodaCL", 330 | "montecarlo", 331 | "custom" 332 | ], 333 | "description": "The type of the quality check. Typical values are: SodaCL, montecarlo, custom." 334 | }, 335 | "specification": { 336 | "anyOf": [ 337 | { 338 | "type": "string", 339 | "description": "The specification of the quality attributes as a string." 340 | }, 341 | { 342 | "type": "object", 343 | "description": "The specification of the quality attributes as an object." 344 | } 345 | ] 346 | } 347 | }, 348 | "required": [ 349 | "type", 350 | "specification" 351 | ], 352 | "description": "The quality object contains quality attributes and checks." 353 | } 354 | }, 355 | "required": [ 356 | "dataContractSpecification", 357 | "id", 358 | "info" 359 | ] 360 | } 361 | -------------------------------------------------------------------------------- /versions/0.9.1/README.md: -------------------------------------------------------------------------------- 1 | # Data Contract Specification 2 | 3 | ![datacontract.png](images/datacontract.png) 4 | 5 | Data contracts bring data providers and data consumers together. 6 | 7 | A _data contract_ is a document that defines the structure, format, semantics, quality, and terms of use for exchanging data between a data provider and their consumers. 8 | A data contract is implemented by a data product's output port or other data technologies. 9 | Data contracts can also be used for the input port to specify the expectations of data dependencies and verify given guarantees. 10 | 11 | The _data contract specification_ defines a YAML format to describe attributes of provided data sets. 12 | It is data platform neutral and can be used with any data platform, such as AWS S3, Google BigQuery, Microsoft Fabric, Databricks, and Snowflake. 13 | The data contract specification is an open initiative to define a common data contract format. 14 | It follows [OpenAPI](https://www.openapis.org/) and [AsyncAPI](https://www.asyncapi.com/) conventions. 15 | 16 | Data contracts come into play when data is exchanged between different teams or organizational units, such as in a [data mesh architecture](https://www.datamesh-architecture.com/). 17 | First, and foremost, data contracts are a communication tool to express a common understanding of how data should be structured and interpreted. 18 | They make semantic and quality expectations explicit. 19 | They are often created collaboratively in [workshops](/workshop) together with data providers and data consumers. 20 | Later in development and production, they also serve as the basis for code generation, testing, schema validations, quality checks, monitoring, access control, and computational governance policies. 21 | 22 | The specification comes along with the [Data Contract CLI](https://github.com/datacontract/cli), an open-source tool to develop, validate, and enforce data contracts. 23 | 24 | _Note: The term "data contract" refers to a specification that is usually owned by the data provider and thus does not align with a "contract" in a legal sense as a mutual agreement between two parties. 25 | The term "contract" may be somewhat misleading, but it is how it is used in practice. 26 | The mutual agreement between one data provider and one data consumer is the "data usage agreement" that refers to a data contract. 27 | Data usage agreements have a defined lifecycle, start/end date, and help the data provider to track who accesses their data and for which purposes._ 28 | 29 | Version 30 | --- 31 | 32 | 0.9.1 ([Changelog](CHANGELOG.md)) 33 | 34 | Example 35 | --- 36 | 37 | [![Open in Data Contract Studio](https://img.shields.io/badge/open%20in-Data%20Contract%20Studio-blue)](https://studio.datacontract.com/) 38 | 39 | ```yaml 40 | dataContractSpecification: 0.9.1 41 | id: urn:datacontract:checkout:orders-latest-npii 42 | info: 43 | title: Orders Latest NPII 44 | version: 1.0.0 45 | description: Successful customer orders in the webshop. All orders since 2020-01-01. Orders with their line items are in their current state (no history included). PII data is removed. 46 | owner: Checkout Team 47 | contact: 48 | name: John Doe (Data Product Owner) 49 | email: john.doe@example.com 50 | servers: 51 | production: 52 | type: BigQuery 53 | project: acme_orders_prod 54 | dataset: bigquery_orders_latest_npii_v1 55 | terms: 56 | usage: > 57 | Data can be used for reports, analytics and machine learning use cases. 58 | Order may be linked and joined by other tables 59 | limitations: > 60 | Not suitable for real-time use cases. 61 | Data may not be used to identify individual customers. 62 | Max data processing per day: 10 TiB 63 | billing: 5000 USD per month 64 | noticePeriod: P3M 65 | models: 66 | orders: 67 | description: One record per order. Includes cancelled and deleted orders. 68 | type: table 69 | fields: 70 | order_id: 71 | $ref: '#/definitions/order_id' 72 | order_timestamp: 73 | type: timestamp 74 | description: The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful. 75 | order_total: 76 | type: long 77 | description: Total amount the smallest monetary unit (e.g., cents). 78 | line_items: 79 | description: A single article that is part of an order. 80 | type: table 81 | fields: 82 | lines_item_id: 83 | type: string 84 | description: Primary key of the lines_item_id table 85 | order_id: 86 | $ref: '#/definitions/order_id' 87 | sku: 88 | description: The purchased article number 89 | $ref: '#/definitions/sku' 90 | definitions: 91 | order_id: 92 | domain: checkout 93 | name: order_id 94 | title: Order ID 95 | type: string 96 | description: An internal ID that identifies an order in the online shop. 97 | example: 243c25e5-a081-43a9-aeab-6d5d5b6cb5e2 98 | pii: true 99 | classification: restricted 100 | sku: 101 | domain: inventory 102 | name: sku 103 | title: Stock Keeping Unit 104 | type: string 105 | example: AC1212ME1 106 | description: | 107 | A Stock Keeping Unit (SKU) is an internal unique identifier for an article. 108 | It is typically associated with an article's barcode, such as the EAN/GTIN. 109 | examples: 110 | - type: csv # csv, json, yaml, custom 111 | model: orders 112 | data: |- # expressed as string or inline yaml or via "$ref: data.csv" 113 | order_id,order_timestamp,order_total 114 | "1001","2023-09-09T08:30:00Z",2500 115 | "1002","2023-09-08T15:45:00Z",1800 116 | "1003","2023-09-07T12:15:00Z",3200 117 | "1004","2023-09-06T19:20:00Z",1500 118 | "1005","2023-09-05T10:10:00Z",4200 119 | "1006","2023-09-04T14:55:00Z",2800 120 | "1007","2023-09-03T21:05:00Z",1900 121 | "1008","2023-09-02T17:40:00Z",3600 122 | "1009","2023-09-01T09:25:00Z",3100 123 | "1010","2023-08-31T22:50:00Z",2700 124 | - type: csv 125 | model: line_items 126 | data: |- 127 | lines_item_id,order_id,sku 128 | "1","1001","5901234123457" 129 | "2","1001","4001234567890" 130 | "3","1002","5901234123457" 131 | "4","1002","2001234567893" 132 | "5","1003","4001234567890" 133 | "6","1003","5001234567892" 134 | "7","1004","5901234123457" 135 | "8","1005","2001234567893" 136 | "9","1005","5001234567892" 137 | "10","1005","6001234567891" 138 | quality: 139 | type: SodaCL # data quality check format: SodaCL, montecarlo, custom 140 | specification: # expressed as string or inline yaml or via "$ref: checks.yaml" 141 | checks for orders: 142 | - freshness(order_timestamp) < 24h 143 | - row_count > 500000 144 | - duplicate_count(order_id) = 0 145 | checks for line_items: 146 | - row_count > 500000 147 | ``` 148 | 149 | Schema 150 | --- 151 | 152 | - [Data Contract Object](#data-contract-object) 153 | - [Info Object](#info-object) 154 | - [Contact Object](#contact-object) 155 | - [Server Object](#server-object) 156 | - [Terms Object](#terms-object) 157 | - [Model Object](#model-object) 158 | - [Field Object](#field-object) 159 | - [Definition Object](#definition-object) 160 | - [Schema Object](#schema-object) 161 | - [Example Object](#example-object) 162 | - [Quality Object](#quality-object) 163 | - [Data Types](#data-types) 164 | - [Specification Extensions](#specification-extensions) 165 | 166 | 167 | [JSON Schema](https://github.com/datacontract/datacontract-specification/blob/main/datacontract.schema.json) of the Data Contract Specification. 168 | 169 | ### Data Contract Object 170 | 171 | This is the root document. 172 | 173 | It is _RECOMMENDED_ that the root document be named: `datacontract.yaml`. 174 | 175 | | Field | Type | Description | 176 | |---------------------------|------------------------------------------------------|----------------------------------------------------------------------------------------------------------| 177 | | dataContractSpecification | `string` | REQUIRED. Specifies the Data Contract Specification being used. | 178 | | id | `string` | REQUIRED. An organization-wide unique technical identifier, such as a UUID, URN, slug, string, or number | 179 | | info | [Info Object](#info-object) | REQUIRED. Specifies the metadata of the data contract. May be used by tooling. | 180 | | servers | Map[string, [Server Object](#server-object)] | Specifies the servers of the data contract. | 181 | | terms | [Terms Object](#terms-object) | Specifies the terms and conditions of the data contract. | 182 | | models | Map[string, [Model Object](#model-object)] | Specifies the logical data model. | 183 | | definitions | Map[string, [Definition Object](#definition-object)] | Specifies definitions. | 184 | | schema | [Schema Object](#schema-object) | Specifies the physical schema. The specification supports different schema format. | 185 | | examples | Array of [Example Objects](#example-object) | Specifies example data sets for the data model. The specification supports different example types. | 186 | | quality | [Quality Object](#quality-object) | Specifies the quality attributes and checks. The specification supports different quality check DSLs. | 187 | 188 | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). 189 | 190 | 191 | 192 | 193 | ### Info Object 194 | 195 | Metadata and life cycle information about the data contract. 196 | 197 | 198 | | Field | Type | Description | 199 | |---------|--------|------------------------------------------------------------------------------------------------------------------------------------------------------------------| 200 | | title | `string` | REQUIRED. The title of the data contract. | 201 | | version | `string` | REQUIRED. The version of the data contract document (which is distinct from the Data Contract Specification version or the Data Product implementation version). | 202 | | description | `string` | A description of the data contract. | 203 | | owner | `string` | The owner or team responsible for managing the data contract and providing the data. | 204 | | contact | [Contact Object](#contact-object) | Contact information for the data contract. | 205 | 206 | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). 207 | 208 | 209 | ### Contact Object 210 | 211 | Contact information for the data contract. 212 | 213 | | Field | Type | Description | 214 | |-------|----------|-------------------------------------------------------------------------------------------------------| 215 | | name | `string` | The identifying name of the contact person/organization. | 216 | | url | `string` | The URL pointing to the contact information. This _MUST_ be in the form of a URL. | 217 | | email | `string` | The email address of the contact person/organization. This _MUST_ be in the form of an email address. | 218 | 219 | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). 220 | 221 | ### Server Object 222 | 223 | The fields are dependent on the defined type. 224 | 225 | | Field | Type | Description | 226 | |-------------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 227 | | type | `string` | The type of the data product technology that implements the data contract. Well-known server types are: `bigquery`, `s3`, `redshift`, `snowflake`, `databricks`, `kafka` | 228 | | description | `string` | An optional string describing the server. | 229 | 230 | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). 231 | 232 | #### BigQuery Server Object 233 | 234 | | Field | Type | Description | 235 | |---------|----------|-------------| 236 | | type | `string` | `bigquery` | 237 | | project | `string` | | 238 | | dataset | `string` | | 239 | 240 | #### S3 Server Object 241 | 242 | | Field | Type | Description | 243 | |----------|----------|--------------------------------| 244 | | type | `string` | `s3` | 245 | | location | `string` | S3 URL, starting with `s3://` | 246 | 247 | Example: 248 | 249 | ```yaml 250 | servers: 251 | production: 252 | type: s3 253 | location: s3://acme-orders-prod/orders/ 254 | ``` 255 | 256 | 257 | #### Redshift Server Object 258 | 259 | | Field | Type | Description | 260 | |----------|----------|-------------| 261 | | type | `string` | `redshift` | 262 | | account | `string` | | 263 | | database | `string` | | 264 | | schema | `string` | | 265 | 266 | #### Snowflake Server Object 267 | 268 | | Field | Type | Description | 269 | |----------|----------|-------------| 270 | | type | `string` | `snowflake` | 271 | | account | `string` | | 272 | | database | `string` | | 273 | | schema | `string` | | 274 | 275 | #### Databricks Server Object 276 | 277 | | Field | Type | Description | 278 | |----------|----------|--------------| 279 | | type | `string` | `databricks` | 280 | | share | `string` | | 281 | 282 | #### Kafka Server Object 283 | 284 | | Field | Type | Description | 285 | |-------|----------|-------------| 286 | | type | `string` | `kafka` | 287 | | host | `string` | | 288 | | topic | `string` | | 289 | 290 | 291 | 292 | ### Terms Object 293 | 294 | The terms and conditions of the data contract. 295 | 296 | | Field | Type | Description | 297 | |----------------------|--------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 298 | | usage | `string` | The usage describes the way the data is expected to be used. Can contain business and technical information. | 299 | | limitations | `string` | The limitations describe the restrictions on how the data can be used, can be technical or restrictions on what the data may not be used for. | 300 | | billing | `string` | The billing describes the pricing model for using the data, such as whether it's free, having a monthly fee, or metered pay-per-use. | 301 | | noticePeriod | `string` | The period of time that must be given by either party to terminate or modify a data usage agreement. Uses ISO-8601 period format, e.g., `P3M` for a period of three months. | 302 | 303 | 304 | ### Model Object 305 | 306 | The Model Object describes the structure and semantics of a data model, such as tables, views, or structured files. 307 | 308 | The name of the data model (table name) is defined by the key that refers to this Model Object. 309 | 310 | | Field | Type | Description | 311 | |-------------|----------------------------------------------|-----------------------------------------------------------------------| 312 | | type | `string` | The type of the model. Examples: `table`, `object`. Default: `table`. | 313 | | description | `string` | An optional string describing the data model. | 314 | | fields | Map[`string`, [Field Object](#field-object)] | The fields (e.g. columns) of the data model. | 315 | 316 | 317 | 318 | ### Field Object 319 | 320 | The Field Objects describes one field (column, property, nested field) of a data model. 321 | 322 | | Field | Type | Description | 323 | |----------------|--------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 324 | | type | [Data Type](#data-types) | The logical data type of the field. | 325 | | description | `string` | An optional string describing the semantic of the data in this field. | 326 | | pii | `boolean` | An indication, if this field contains Personal Identifiable Information (PII). | 327 | | classification | `string` | The data class defining the sensitivity level for this field, according to the organization's classification scheme. Examples may be: `sensitive`, `restricted`, `internal`, `public`. | 328 | | tags | Array of `string` | Custom metadata to provide additional context. | 329 | | $ref | `string` | A reference URI to a definition in the specification, internally or externally. Properties will be inherited from the definition. | 330 | 331 | 332 | ### Definition Object 333 | 334 | The Definition Object includes a clear and concise explanations of syntax, semantic, and classification of a business object in a given domain. 335 | It serves as a reference for a common understanding of terminology, ensure consistent usage and to identify join-able fields. 336 | Models fields can refer to definitions using the `$ref` field to link to existing definitions and avoid duplicate documentations. 337 | 338 | | Field | Type | Description | 339 | |----------------|--------------------------|----------------------------------------------------------------------------------------------------------------------| 340 | | domain | `string` | The domain in which this definition is valid. Default: `global`. | 341 | | name | `string` | The technical name of this definition. | 342 | | title | `string` | The business name of this definition. | 343 | | type | [Data Type](#data-types) | The logical data type | 344 | | description | `string` | Clear and concise explanations related to the domain | 345 | | example | `string` | An example value. | 346 | | pii | `boolean` | An indication, if this field contains Personal Identifiable Information (PII). | 347 | | classification | `string` | The data class defining the sensitivity level for this field, according to the organization's classification scheme. | 348 | | tags | Array of `string` | Custom metadata to provide additional context. | 349 | 350 | 351 | ### Schema Object 352 | 353 | The schema of the data contract describes the physical schema. 354 | The type of the schema depends on the data platform. 355 | 356 | | Field | Type | Description | 357 | | ----- |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------| 358 | | type | `string` | REQUIRED. The type of the schema.
Typical values are: `dbt`, `bigquery`, `json-schema`, `sql-ddl`, `avro`, `protobuf`, `custom` | 359 | | specification | [dbt Schema Object](#dbt-schema-object) \|
[BigQuery Schema Object](#bigquery-schema-object) \|
[JSON Schema Schema Object](#bigquery-schema-object) \|
[SQL DDL Schema Object](#sql-ddl-schema-object) \|
`string` | REQUIRED. The specification of the schema. The schema specification can be encoded as a string or as inline YAML. | 360 | 361 | 362 | #### dbt Schema Object 363 | 364 | https://docs.getdbt.com/reference/model-properties 365 | 366 | Example (inline YAML): 367 | 368 | ```yaml 369 | schema: 370 | type: dbt 371 | specification: 372 | version: 2 373 | models: 374 | - name: "My Table" 375 | description: "My description" 376 | columns: 377 | - name: "My column" 378 | data_type: text 379 | description: "My description" 380 | ``` 381 | 382 | Example (string): 383 | 384 | ```yaml 385 | schema: 386 | type: dbt 387 | specification: |- 388 | version: 2 389 | models: 390 | - name: "My Table" 391 | description: "My description" 392 | columns: 393 | - name: "My column" 394 | data_type: text 395 | description: "My description" 396 | ``` 397 | 398 | #### BigQuery Schema Object 399 | 400 | The schema structure is defined by the [Google BigQuery Table](https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#resource:-table) object. You can extract such a Table object via the [tables.get](https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/get) endpoint. 401 | 402 | Instead of providing a single Table object, you can also provide an array of such objects. Be aware that [tables.list](https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/list) only returns a subset of the full Table object. You need to call every Table object via [tables.get](https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/get) to get the full Table object, including the actual schema. 403 | 404 | Learn more: [Google BigQuery REST Reference v2](https://cloud.google.com/bigquery/docs/reference/rest) 405 | 406 | 407 | 408 | Example: 409 | 410 | ```yaml 411 | schema: 412 | type: bigquery 413 | specification: |- 414 | { 415 | "tableReference": { 416 | "projectId": "my-project", 417 | "datasetId": "my_dataset", 418 | "tableId": "my_table" 419 | }, 420 | "description": "This is a description", 421 | "type": "TABLE", 422 | "schema": { 423 | "fields": [ 424 | { 425 | "name": "name", 426 | "type": "STRING", 427 | "mode": "NULLABLE", 428 | "description": "This is a description" 429 | } 430 | ] 431 | } 432 | } 433 | ``` 434 | 435 | #### JSON Schema Schema Object 436 | 437 | JSON Schema can be defined as JSON or rendered as YAML, following the [OpenAPI Schema Object dialect](https://spec.openapis.org/oas/v3.1.0#properties) 438 | 439 | Example (inline YAML): 440 | 441 | ```yaml 442 | schema: 443 | type: json-schema 444 | specification: 445 | orders: 446 | description: One record per order. Includes cancelled and deleted orders. 447 | type: object 448 | properties: 449 | order_id: 450 | type: string 451 | description: Primary key of the orders table 452 | order_timestamp: 453 | type: string 454 | format: date-time 455 | description: The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful. 456 | order_total: 457 | type: integer 458 | description: Total amount of the order in the smallest monetary unit (e.g., cents). 459 | line_items: 460 | type: object 461 | properties: 462 | lines_item_id: 463 | type: string 464 | description: Primary key of the lines_item_id table 465 | order_id: 466 | type: string 467 | description: Foreign key to the orders table 468 | sku: 469 | type: string 470 | description: The purchased article number 471 | ``` 472 | 473 | Example (string): 474 | 475 | ```yaml 476 | schema: 477 | type: json-schema 478 | specification: |- 479 | { 480 | "$schema": "http://json-schema.org/draft-07/schema#", 481 | "type": "object", 482 | "properties": { 483 | "orders": { 484 | "type": "object", 485 | "description": "One record per order. Includes cancelled and deleted orders.", 486 | "properties": { 487 | "order_id": { 488 | "type": "string", 489 | "description": "Primary key of the orders table" 490 | }, 491 | "order_timestamp": { 492 | "type": "string", 493 | "format": "date-time", 494 | "description": "The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful." 495 | }, 496 | "order_total": { 497 | "type": "integer", 498 | "description": "Total amount of the order in the smallest monetary unit (e.g., cents)." 499 | } 500 | }, 501 | "required": ["order_id", "order_timestamp", "order_total"] 502 | }, 503 | "line_items": { 504 | "type": "object", 505 | "properties": { 506 | "lines_item_id": { 507 | "type": "string", 508 | "description": "Primary key of the lines_item_id table" 509 | }, 510 | "order_id": { 511 | "type": "string", 512 | "description": "Foreign key to the orders table" 513 | }, 514 | "sku": { 515 | "type": "string", 516 | "description": "The purchased article number" 517 | } 518 | }, 519 | "required": ["lines_item_id", "order_id", "sku"] 520 | } 521 | }, 522 | "required": ["orders", "line_items"] 523 | } 524 | ``` 525 | 526 | #### SQL DDL Schema Object 527 | 528 | Classical SQL DDLs can be used to describe the structure. 529 | 530 | 531 | Example (string): 532 | 533 | ```yaml 534 | schema: 535 | type: sql-ddl 536 | specification: |- 537 | -- One record per order. Includes cancelled and deleted orders. 538 | CREATE TABLE orders ( 539 | order_id TEXT PRIMARY KEY, -- Primary key of the orders table 540 | order_timestamp TIMESTAMPTZ NOT NULL, -- The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful. 541 | order_total INTEGER NOT NULL -- Total amount of the order in the smallest monetary unit (e.g., cents) 542 | ); 543 | 544 | -- The items that are part of an order 545 | CREATE TABLE line_items ( 546 | lines_item_id TEXT PRIMARY KEY, -- Primary key of the lines_item_id table 547 | order_id TEXT REFERENCES orders(order_id), -- Foreign key to the orders table 548 | sku TEXT NOT NULL -- The purchased article number 549 | ); 550 | 551 | ``` 552 | 553 | ### Example Object 554 | 555 | | Field | Type | Description | 556 | |-------------|----------|-----------------------------------------------------------------------------------------------------------------------------------------| 557 | | type | `string` | The type of the data product technology that implements the data contract. Well-known server types are: `csv`, `json`, `yaml`, `custom` | 558 | | description | `string` | An optional string describing the example. | 559 | | model | `string` | The reference to the model in the schema, e.g. a table name. | 560 | | data | `string` | Example data for this model. | 561 | 562 | Example: 563 | 564 | ```yaml 565 | examples: 566 | - type: csv 567 | model: orders 568 | data: |- 569 | order_id,order_timestamp,order_total 570 | "1001","2023-09-09T08:30:00Z",2500 571 | "1002","2023-09-08T15:45:00Z",1800 572 | "1003","2023-09-07T12:15:00Z",3200 573 | "1004","2023-09-06T19:20:00Z",1500 574 | "1005","2023-09-05T10:10:00Z",4200 575 | "1006","2023-09-04T14:55:00Z",2800 576 | "1007","2023-09-03T21:05:00Z",1900 577 | "1008","2023-09-02T17:40:00Z",3600 578 | "1009","2023-09-01T09:25:00Z",3100 579 | "1010","2023-08-31T22:50:00Z",2700 580 | ``` 581 | 582 | ### Quality Object 583 | 584 | The quality object contains quality attributes and checks. 585 | 586 | | Field | Type | Description | 587 | | ----- |-------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------| 588 | | type | `string` | REQUIRED. The type of the schema.
Typical values are: `SodaCL`, `montecarlo`, `custom` | 589 | | specification | [SodaCL Quality Object](#sodacl-quality-object) \|
[Monte Carlo Schema Object](#monte-carlo-quality-object) \|
`string` | REQUIRED. The specification of the quality attributes. The quality specification can be encoded as a string or as inline YAML. | 590 | 591 | 592 | #### SodaCL Quality Object 593 | 594 | Quality attributes in [Soda Checks Language](https://docs.soda.io/soda-cl/soda-cl-overview.html). 595 | 596 | The `specification` represents the content of a `checks.yml` file. 597 | 598 | Example (inline): 599 | 600 | ```yaml 601 | quality: 602 | type: SodaCL # data quality check format: SodaCL, montecarlo, dbt-tests, custom 603 | specification: # expressed as string or inline yaml or via "$ref: checks.yaml" 604 | checks for orders: 605 | - row_count > 0 606 | - duplicate_count(order_id) = 0 607 | checks for line_items: 608 | - row_count > 0 609 | ``` 610 | 611 | Example (string): 612 | 613 | ```yaml 614 | quality: 615 | type: SodaCL 616 | specification: |- 617 | checks for search_queries: 618 | - freshness(search_timestamp) < 1d 619 | - row_count > 100000 620 | - missing_count(search_query) = 0 621 | ``` 622 | 623 | #### Monte Carlo Quality Object 624 | 625 | Quality attributes defined as Monte Carlos [Monitors as Code](https://docs.getmontecarlo.com/docs/monitors-as-code). 626 | 627 | The `specification` represents the content of a `montecarlo.yml` file. 628 | 629 | Example (string): 630 | 631 | ```yaml 632 | quality: 633 | type: montecarlo 634 | specification: |- 635 | montecarlo: 636 | field_health: 637 | - table: project:dataset.table_name 638 | timestamp_field: created 639 | dimension_tracking: 640 | - table: project:dataset.table_name 641 | timestamp_field: created 642 | field: order_status 643 | ``` 644 | 645 | ### Data Types 646 | 647 | The following data types are supported for model fields and definitions: 648 | 649 | - Unicode character sequence: `string`, `text`, `varchar` 650 | - Any numeric type, either integers or floating point numbers: `number`, `decimal`, `numeric` 651 | - 32-bit signed integer: `int`, `integer` 652 | - 64-bit signed integer: `long`, `bigint` 653 | - Single precision (32-bit) IEEE 754 floating-point number: `float` 654 | - Double precision (64-bit) IEEE 754 floating-point number: `double` 655 | - Binary value: `boolean` 656 | - Timestamp with timezone: `timestamp`, `timestamp_tz` 657 | - Timestamp with no timezone: `timestamp_ntz` 658 | - Date with no time information: `date` 659 | - Array: `array` 660 | - Sequence of 8-bit unsigned bytes: `bytes` 661 | - Complex type: `object`, `record`, `struct` 662 | - No value: `null` 663 | 664 | ### Specification Extensions 665 | 666 | While the Data Contract Specification tries to accommodate most use cases, additional data can be added to extend the specification at certain points. 667 | 668 | A custom fields can be added with any name. The value can be null, a primitive, an array or an object. 669 | 670 | ### Design Principles 671 | 672 | The Data Contract Specification follows these design principles: 673 | 674 | - A free, open, and open-sourced standard 675 | - Follow OpenAPI and AsyncAPI conventions so that it feels immediately familiar 676 | - Support contract-first approaches 677 | - Support code-first approaches 678 | - Support tooling by being machine-readable 679 | 680 | Tooling 681 | --- 682 | - [Data Contract CLI](https://github.com/datacontract/cli) is a free CLI tool to help you create, develop, and maintain your data contracts. 683 | - [Data Contract Studio](https://studio.datacontract.com/) is a free web tool to develop and share data contracts. 684 | - [Data Mesh Manager](https://www.datamesh-manager.com/) is a commercial tool to manage data products and data contracts. It supports the data contract specification and allows the user to import or export data contracts using this specification. 685 | 686 | 687 | Other Data Contract Specifications 688 | --- 689 | - [AIDA User Group's Open Data Contract Standard](https://github.com/AIDAUserGroup/open-data-contract-standard) 690 | - [PayPal's Data Contract Template](https://github.com/paypal/data-contract-template/blob/main/docs/README.md) 691 | 692 | Literature 693 | --- 694 | - [Driving Data Quality with Data Contracts](https://www.amazon.com/dp/B0C37FPH3D) by Andrew Jones 695 | 696 | Authors 697 | --- 698 | The Data Contract Specification was originally created by [Jochen Christ](https://www.linkedin.com/in/jochenchrist/) and [Dr. Simon Harrer](https://www.linkedin.com/in/simonharrer/), and is currently maintained by them. 699 | 700 | 701 | Contributing 702 | --- 703 | Contributions are welcome! Please open an issue or a pull request. 704 | 705 | License 706 | --- 707 | [MIT License](LICENSE) 708 | 709 | 710 | 711 | -------------------------------------------------------------------------------- /versions/0.9.1/datacontract.init.yaml: -------------------------------------------------------------------------------- 1 | dataContractSpecification: 0.9.1 2 | id: my-data-contract-id 3 | info: 4 | title: My Data Contract 5 | version: 0.0.1 6 | # description: 7 | # owner: 8 | # contact: 9 | # name: 10 | # url: 11 | # email: 12 | 13 | 14 | ### servers 15 | 16 | #servers: 17 | # my-stage: 18 | # type: bigquery 19 | # project: 20 | # dataset: 21 | 22 | #servers: 23 | # my-stage: 24 | # type: s3 25 | # location: s3:// 26 | 27 | #servers: 28 | # my-stage: 29 | # type: redshift 30 | # account: 31 | # database: 32 | # schema: 33 | 34 | #servers: 35 | # my-stage: 36 | # type: snowflake 37 | # account: 38 | # database: 39 | # schema: 40 | 41 | #servers: 42 | # my-stage: 43 | # type: databricks 44 | # share: 45 | 46 | #servers: 47 | # my-stage: 48 | # type: kafka 49 | # host: 50 | # topic: 51 | 52 | 53 | ### terms 54 | 55 | #terms: 56 | # usage: 57 | # limitations: 58 | # billing: 59 | # noticePeriod: 60 | 61 | ### models 62 | # models: 63 | # my_model: 64 | # description: 65 | # type: 66 | # fields: 67 | # my_field: 68 | # type: 69 | # description: 70 | 71 | ### definitions 72 | # definitions: 73 | # my_field: 74 | # domain: 75 | # name: 76 | # title: 77 | # type: 78 | # description: 79 | # example: 80 | # pii: 81 | # classification: 82 | 83 | ### schema 84 | 85 | #schema: 86 | # type: dbt 87 | # specification: 88 | # version: 89 | # models: 90 | # - name: 91 | # description: 92 | # columns: 93 | # - name: 94 | # type: 95 | # description: 96 | # tests: 97 | 98 | #schema: 99 | # type: dbt 100 | # specification: |- 101 | # version: 102 | # models: 103 | # - name: 104 | # description: 105 | # columns: 106 | # - name: 107 | # type: 108 | # description: 109 | # tests: 110 | 111 | #schema: 112 | # type: dbt 113 | # specification: "$ref: model.yaml" 114 | 115 | #schema: 116 | # type: bigquery 117 | # specification: |- 118 | # { 119 | # "tableReference": { 120 | # "projectId": "my-project", 121 | # "datasetId": "my_dataset", 122 | # "tableId": "my_table" 123 | # }, 124 | # "description": "This is a description", 125 | # "type": "TABLE", 126 | # "schema": { 127 | # "fields": [ 128 | # { 129 | # "name": "name", 130 | # "type": "STRING", 131 | # "mode": "NULLABLE", 132 | # "description": "This is a description" 133 | # } 134 | # ] 135 | # } 136 | # } 137 | 138 | #schema: 139 | # type: json-schema 140 | # specification: 141 | # my-table: 142 | # description: 143 | # type: object 144 | # properties: 145 | # id: 146 | # type: string 147 | # description: 148 | 149 | #schema: 150 | # type: json-schema 151 | # specification: |- 152 | # { 153 | # "$schema": "http://json-schema.org/draft-07/schema#", 154 | # "type": "object", 155 | # "properties": { 156 | # "my_table": { 157 | # "type": "object", 158 | # "description": "", 159 | # "properties": { 160 | # "id": { 161 | # "type": "string", 162 | # "description": "" 163 | # }, 164 | # "required": ["id"] 165 | # } 166 | # }, 167 | # "required": ["my-table"] 168 | # } 169 | 170 | #schema: 171 | # type: sql-ddl 172 | # specification: |- 173 | # CREATE TABLE my_table ( 174 | # id TEXT PRIMARY KEY 175 | # ); 176 | 177 | #schema: 178 | # type: avro 179 | # specification: 180 | # User: 181 | # type: record 182 | # name: MyTable 183 | # fields: 184 | # - name: id 185 | # type: string 186 | 187 | #schema: 188 | # type: avro 189 | # specification: |- 190 | # { 191 | # "type": "record", 192 | # "name": "MyTable", 193 | # "fields": [ 194 | # { 195 | # "name": "name", 196 | # "type": "string" 197 | # } 198 | # ] 199 | # } 200 | 201 | #schema: 202 | # type: protobuf 203 | # specification: |- 204 | # message MyTable { 205 | # string id = 1; 206 | # } 207 | 208 | #schema: 209 | # type: custom 210 | # specification: 211 | 212 | 213 | ### examples 214 | 215 | #examples: 216 | # - type: csv 217 | # model: my_table 218 | # data: |- 219 | # id,timestamp,amount 220 | # "1001","2023-09-09T08:30:00Z",2500 221 | # "1002","2023-09-08T15:45:00Z",1800 222 | # 223 | #examples: 224 | # - type: csv 225 | # model: my_table 226 | # data: "$ref: data.csv" 227 | 228 | #examples: 229 | # - type: json 230 | # model: my_table 231 | # data: |- 232 | # [ 233 | # { 234 | # "id": "1001", 235 | # "timestamp": "2023-09-09T08:30:00Z", 236 | # "amount": 2500 237 | # }, 238 | # { 239 | # "id": "1002", 240 | # "timestamp": "2023-09-08T15:45:00Z", 241 | # "amount": 1800 242 | # } 243 | # ] 244 | 245 | #examples: 246 | # - type: yaml 247 | # model: my_table 248 | # data: 249 | # - id: 1001 250 | # timestamp: 2023-09-09T08:30:00Z 251 | # amount: 2500 252 | # - id: 1002 253 | # timestamp: 2023-09-08T15:45:00Z 254 | # amount: 1800 255 | 256 | #examples: 257 | # - type: custom 258 | # model: my_table 259 | # data: |- 260 | 261 | 262 | ### quality 263 | 264 | #quality: 265 | # type: SodaCL 266 | # specification: 267 | # checks for my_table: 268 | # - duplicate_count(order_id) = 0 269 | 270 | #quality: 271 | # type: SodaCL 272 | # specification: 273 | # checks for my_table: |- 274 | # - duplicate_count(id) = 0 275 | 276 | #quality: 277 | # type: SodaCL 278 | # specification: 279 | # checks for my_table: "$ref: checks.yaml" 280 | 281 | #quality: 282 | # type: montecarlo 283 | # specification: |- 284 | # montecarlo: 285 | # field_health: 286 | # - table: my_project:my_dataset.my_table 287 | # fields: 288 | # - id 289 | # - timestamp 290 | # - amount 291 | # timestamp_field: timestamp 292 | 293 | #quality: 294 | # type: custom 295 | # specification: |- 296 | -------------------------------------------------------------------------------- /versions/0.9.1/datacontract.schema.json: -------------------------------------------------------------------------------- 1 | { 2 | "$schema": "http://json-schema.org/draft-07/schema#", 3 | "type": "object", 4 | "properties": { 5 | "dataContractSpecification": { 6 | "type": "string", 7 | "enum": [ 8 | "0.9.1", 9 | "0.9.0" 10 | ], 11 | "description": "Specifies the Data Contract Specification being used." 12 | }, 13 | "id": { 14 | "type": "string", 15 | "description": "Specifies the identifier of the data contract." 16 | }, 17 | "info": { 18 | "type": "object", 19 | "properties": { 20 | "title": { 21 | "type": "string", 22 | "description": "The title of the data contract." 23 | }, 24 | "version": { 25 | "type": "string", 26 | "description": "The version of the data contract document (which is distinct from the Data Contract Specification version or the Data Product implementation version)." 27 | }, 28 | "description": { 29 | "type": "string", 30 | "description": "A description of the data contract." 31 | }, 32 | "owner": { 33 | "type": "string", 34 | "description": "The owner or team responsible for managing the data contract and providing the data." 35 | }, 36 | "contact": { 37 | "type": "object", 38 | "properties": { 39 | "name": { 40 | "type": "string", 41 | "description": "The identifying name of the contact person/organization." 42 | }, 43 | "url": { 44 | "type": "string", 45 | "format": "uri", 46 | "description": "The URL pointing to the contact information. This MUST be in the form of a URL." 47 | }, 48 | "email": { 49 | "type": "string", 50 | "format": "email", 51 | "description": "The email address of the contact person/organization. This MUST be in the form of an email address." 52 | } 53 | }, 54 | "description": "Contact information for the data contract." 55 | } 56 | }, 57 | "required": [ 58 | "title", 59 | "version" 60 | ], 61 | "description": "Metadata and life cycle information about the data contract." 62 | }, 63 | "servers": { 64 | "type": "object", 65 | "additionalProperties": { 66 | "anyOf": [ 67 | { 68 | "type": "object", 69 | "properties": { 70 | "type": { 71 | "type": "string", 72 | "enum": [ 73 | "bigquery", 74 | "BigQuery" 75 | ], 76 | "description": "The type of the data product technology that implements the data contract." 77 | }, 78 | "project": { 79 | "type": "string", 80 | "description": "An optional string describing the server." 81 | }, 82 | "dataset": { 83 | "type": "string", 84 | "description": "An optional string describing the server." 85 | } 86 | }, 87 | "required": [ 88 | "type", 89 | "project", 90 | "dataset" 91 | ] 92 | }, 93 | { 94 | "type": "object", 95 | "properties": { 96 | "type": { 97 | "type": "string", 98 | "enum": [ 99 | "s3" 100 | ], 101 | "description": "The type of the data product technology that implements the data contract." 102 | }, 103 | "location": { 104 | "type": "string", 105 | "format": "uri", 106 | "description": "An optional string describing the server. Must be in the form of a URL." 107 | } 108 | }, 109 | "required": [ 110 | "type", 111 | "location" 112 | ] 113 | }, 114 | { 115 | "type": "object", 116 | "properties": { 117 | "type": { 118 | "type": "string", 119 | "enum": [ 120 | "redshift" 121 | ], 122 | "description": "The type of the data product technology that implements the data contract." 123 | }, 124 | "account": { 125 | "type": "string", 126 | "description": "An optional string describing the server." 127 | }, 128 | "database": { 129 | "type": "string", 130 | "description": "An optional string describing the server." 131 | }, 132 | "schema": { 133 | "type": "string", 134 | "description": "An optional string describing the server." 135 | } 136 | }, 137 | "required": [ 138 | "type", 139 | "account", 140 | "database", 141 | "schema" 142 | ] 143 | }, 144 | { 145 | "type": "object", 146 | "properties": { 147 | "type": { 148 | "type": "string", 149 | "enum": [ 150 | "snowflake" 151 | ], 152 | "description": "The type of the data product technology that implements the data contract." 153 | }, 154 | "account": { 155 | "type": "string", 156 | "description": "An optional string describing the server." 157 | }, 158 | "database": { 159 | "type": "string", 160 | "description": "An optional string describing the server." 161 | }, 162 | "schema": { 163 | "type": "string", 164 | "description": "An optional string describing the server." 165 | } 166 | }, 167 | "required": [ 168 | "type", 169 | "account", 170 | "database", 171 | "schema" 172 | ] 173 | }, 174 | { 175 | "type": "object", 176 | "properties": { 177 | "type": { 178 | "type": "string", 179 | "enum": [ 180 | "databricks" 181 | ], 182 | "description": "The type of the data product technology that implements the data contract." 183 | }, 184 | "share": { 185 | "type": "string", 186 | "description": "An optional string describing the server." 187 | } 188 | }, 189 | "required": [ 190 | "type", 191 | "share" 192 | ] 193 | }, 194 | { 195 | "type": "object", 196 | "properties": { 197 | "type": { 198 | "type": "string", 199 | "enum": [ 200 | "kafka" 201 | ], 202 | "description": "The type of the data product technology that implements the data contract." 203 | }, 204 | "host": { 205 | "type": "string", 206 | "description": "An optional string describing the server." 207 | }, 208 | "topic": { 209 | "type": "string", 210 | "description": "An optional string describing the server." 211 | } 212 | }, 213 | "required": [ 214 | "type", 215 | "host", 216 | "topic" 217 | ] 218 | } 219 | ] 220 | }, 221 | "description": "Information about the servers." 222 | }, 223 | "terms": { 224 | "type": "object", 225 | "description": "The terms and conditions of the data contract.", 226 | "properties": { 227 | "usage": { 228 | "type": "string", 229 | "description": "The usage describes the way the data is expected to be used. Can contain business and technical information." 230 | }, 231 | "limitations": { 232 | "type": "string", 233 | "description": "The limitations describe the restrictions on how the data can be used, can be technical or restrictions on what the data may not be used for." 234 | }, 235 | "billing": { 236 | "type": "string", 237 | "description": "The billing describes the pricing model for using the data, such as whether it's free, having a monthly fee, or metered pay-per-use." 238 | }, 239 | "noticePeriod": { 240 | "type": "string", 241 | "description": "The period of time that must be given by either party to terminate or modify a data usage agreement. Uses ISO-8601 period format, e.g., 'P3M' for a period of three months." 242 | } 243 | } 244 | }, 245 | "models": { 246 | "description": "Specifies the logical data model. Use the models name (e.g., the table name) as the key.", 247 | "type": "object", 248 | "minProperties": 1, 249 | "propertyNames": { 250 | "pattern": "^[a-zA-Z0-9_-]+$" 251 | }, 252 | "additionalProperties": { 253 | "type": "object", 254 | "properties": { 255 | "description": { 256 | "type": "string" 257 | }, 258 | "type": { 259 | "description": "The type of the model. Examples: table, object. Default: table.", 260 | "type": "string", 261 | "default": "table" 262 | }, 263 | "fields": { 264 | "description": "Specifies a field in the data model. Use the field name (e.g., the column name) as the key.", 265 | "type": "object", 266 | "additionalProperties": { 267 | "type": "object", 268 | "properties": { 269 | "$ref": { 270 | "type": "string", 271 | "description": "A reference URI to a definition in the specification, internally or externally. Properties will be inherited from the definition." 272 | }, 273 | "type": { 274 | "type": "string", 275 | "description": "The logical data type of the field.", 276 | "enum": [ 277 | "number", "decimal", "numeric", 278 | "int", "integer", 279 | "long", "bigint", 280 | "float", 281 | "double", 282 | "string", "text", "varchar", 283 | "boolean", 284 | "timestamp", "timestamp_tz", 285 | "timestamp_ntz", 286 | "date", 287 | "array", 288 | "object", "record", "struct", 289 | "bytes", 290 | "null" 291 | ] 292 | }, 293 | "description": { 294 | "type": "string", 295 | "description": "An optional string describing the semantic of the data in this field." 296 | }, 297 | "pii": { 298 | "type": "boolean", 299 | "description": "An indication, if this field contains Personal Identifiable Information (PII)." 300 | }, 301 | "classification": { 302 | "type": "string", 303 | "description": "The data class defining the sensitivity level for this field, according to the organization's classification scheme.", 304 | "examples": ["sensitive", "restricted", "internal", "public"] 305 | }, 306 | "tags": { 307 | "type": "array", 308 | "items": { 309 | "type": "string" 310 | }, 311 | "description": "Custom metadata to provide additional context." 312 | } 313 | } 314 | } 315 | } 316 | } 317 | } 318 | }, 319 | "schema": { 320 | "type": "object", 321 | "properties": { 322 | "type": { 323 | "type": "string", 324 | "enum": [ 325 | "dbt", 326 | "bigquery", 327 | "json-schema", 328 | "sql-ddl", 329 | "avro", 330 | "protobuf", 331 | "custom" 332 | ], 333 | "description": "The type of the schema. Typical values are: dbt, bigquery, json-schema, sql-ddl, avro, protobuf, custom." 334 | }, 335 | "specification": { 336 | "anyOf": [ 337 | { 338 | "type": "string", 339 | "description": "The specification of the schema as a string." 340 | }, 341 | { 342 | "type": "object", 343 | "description": "The specification of the schema as an object." 344 | } 345 | ] 346 | } 347 | }, 348 | "required": [ 349 | "type", 350 | "specification" 351 | ], 352 | "description": "The schema of the data contract describes the syntax and semantics of provided data sets. It supports different schema types." 353 | }, 354 | "examples": { 355 | "type": "array", 356 | "items": { 357 | "type": "object", 358 | "properties": { 359 | "type": { 360 | "type": "string", 361 | "enum": [ 362 | "csv", 363 | "json", 364 | "yaml", 365 | "custom" 366 | ], 367 | "description": "The type of the example data. Well-known types are: csv, json, yaml, custom." 368 | }, 369 | "description": { 370 | "type": "string", 371 | "description": "An optional string describing the example." 372 | }, 373 | "model": { 374 | "type": "string", 375 | "description": "The reference to the model in the schema, e.g., a table name." 376 | }, 377 | "data": { 378 | "type": "string", 379 | "description": "Example data for this model." 380 | } 381 | }, 382 | "required": [ 383 | "type", 384 | "data" 385 | ] 386 | }, 387 | "description": "The Examples Object is an array of Example Objects." 388 | }, 389 | "quality": { 390 | "type": "object", 391 | "properties": { 392 | "type": { 393 | "type": "string", 394 | "enum": [ 395 | "SodaCL", 396 | "montecarlo", 397 | "custom" 398 | ], 399 | "description": "The type of the quality check. Typical values are: SodaCL, montecarlo, custom." 400 | }, 401 | "specification": { 402 | "anyOf": [ 403 | { 404 | "type": "string", 405 | "description": "The specification of the quality attributes as a string." 406 | }, 407 | { 408 | "type": "object", 409 | "description": "The specification of the quality attributes as an object." 410 | } 411 | ] 412 | } 413 | }, 414 | "required": [ 415 | "type", 416 | "specification" 417 | ], 418 | "description": "The quality object contains quality attributes and checks." 419 | } 420 | }, 421 | "required": [ 422 | "dataContractSpecification", 423 | "id", 424 | "info" 425 | ] 426 | } 427 | -------------------------------------------------------------------------------- /versions/0.9.2/datacontract.init.yaml: -------------------------------------------------------------------------------- 1 | dataContractSpecification: 0.9.2 2 | id: my-data-contract-id 3 | info: 4 | title: My Data Contract 5 | version: 0.0.1 6 | # description: 7 | # owner: 8 | # contact: 9 | # name: 10 | # url: 11 | # email: 12 | 13 | 14 | ### servers 15 | 16 | #servers: 17 | # production: 18 | # type: s3 19 | # location: s3:// 20 | # format: parquet 21 | # delimiter: new_line 22 | 23 | ### terms 24 | 25 | #terms: 26 | # usage: 27 | # limitations: 28 | # billing: 29 | # noticePeriod: 30 | 31 | 32 | ### models 33 | 34 | # models: 35 | # my_model: 36 | # description: 37 | # type: 38 | # fields: 39 | # my_field: 40 | # type: 41 | # description: 42 | 43 | 44 | ### definitions 45 | 46 | # definitions: 47 | # my_field: 48 | # domain: 49 | # name: 50 | # title: 51 | # type: 52 | # description: 53 | # example: 54 | # pii: 55 | # classification: 56 | 57 | 58 | ### examples 59 | 60 | #examples: 61 | # - type: csv 62 | # model: my_model 63 | # data: |- 64 | # id,timestamp,amount 65 | # "1001","2023-09-09T08:30:00Z",2500 66 | # "1002","2023-09-08T15:45:00Z",1800 67 | 68 | 69 | ### quality 70 | 71 | #quality: 72 | # type: SodaCL 73 | # specification: 74 | # checks for my_model: |- 75 | # - duplicate_count(id) = 0 76 | -------------------------------------------------------------------------------- /versions/0.9.2/datacontract.schema.json: -------------------------------------------------------------------------------- 1 | { 2 | "$schema": "http://json-schema.org/draft-07/schema#", 3 | "type": "object", 4 | "title": "DataContractSpecification", 5 | "properties": { 6 | "dataContractSpecification": { 7 | "type": "string", 8 | "title": "DataContractSpecificationVersion", 9 | "enum": [ 10 | "0.9.2", 11 | "0.9.1", 12 | "0.9.0" 13 | ], 14 | "description": "Specifies the Data Contract Specification being used." 15 | }, 16 | "id": { 17 | "type": "string", 18 | "description": "Specifies the identifier of the data contract." 19 | }, 20 | "info": { 21 | "type": "object", 22 | "properties": { 23 | "title": { 24 | "type": "string", 25 | "description": "The title of the data contract." 26 | }, 27 | "version": { 28 | "type": "string", 29 | "description": "The version of the data contract document (which is distinct from the Data Contract Specification version or the Data Product implementation version)." 30 | }, 31 | "description": { 32 | "type": "string", 33 | "description": "A description of the data contract." 34 | }, 35 | "owner": { 36 | "type": "string", 37 | "description": "The owner or team responsible for managing the data contract and providing the data." 38 | }, 39 | "contact": { 40 | "type": "object", 41 | "properties": { 42 | "name": { 43 | "type": "string", 44 | "description": "The identifying name of the contact person/organization." 45 | }, 46 | "url": { 47 | "type": "string", 48 | "format": "uri", 49 | "description": "The URL pointing to the contact information. This MUST be in the form of a URL." 50 | }, 51 | "email": { 52 | "type": "string", 53 | "format": "email", 54 | "description": "The email address of the contact person/organization. This MUST be in the form of an email address." 55 | } 56 | }, 57 | "description": "Contact information for the data contract." 58 | } 59 | }, 60 | "required": [ 61 | "title", 62 | "version" 63 | ], 64 | "description": "Metadata and life cycle information about the data contract." 65 | }, 66 | "servers": { 67 | "type": "object", 68 | "additionalProperties": { 69 | "oneOf": [ 70 | { 71 | "type": "object", 72 | "title": "BigQueryServer", 73 | "properties": { 74 | "type": { 75 | "type": "string", 76 | "enum": [ 77 | "bigquery", 78 | "BigQuery" 79 | ], 80 | "description": "The type of the data product technology that implements the data contract." 81 | }, 82 | "project": { 83 | "type": "string", 84 | "description": "An optional string describing the server." 85 | }, 86 | "dataset": { 87 | "type": "string", 88 | "description": "An optional string describing the server." 89 | } 90 | }, 91 | "additionalProperties": true, 92 | "required": [ 93 | "type", 94 | "project", 95 | "dataset" 96 | ] 97 | }, 98 | { 99 | "type": "object", 100 | "title": "S3Server", 101 | "properties": { 102 | "type": { 103 | "type": "string", 104 | "enum": [ 105 | "s3" 106 | ], 107 | "description": "The type of the data product technology that implements the data contract." 108 | }, 109 | "location": { 110 | "type": "string", 111 | "format": "uri", 112 | "description": "An optional string describing the server. Must be in the form of a URL." 113 | } 114 | }, 115 | "additionalProperties": true, 116 | "required": [ 117 | "type", 118 | "location" 119 | ] 120 | }, 121 | { 122 | "type": "object", 123 | "title": "RedshiftServer", 124 | "properties": { 125 | "type": { 126 | "type": "string", 127 | "enum": [ 128 | "redshift" 129 | ], 130 | "description": "The type of the data product technology that implements the data contract." 131 | }, 132 | "account": { 133 | "type": "string", 134 | "description": "An optional string describing the server." 135 | }, 136 | "database": { 137 | "type": "string", 138 | "description": "An optional string describing the server." 139 | }, 140 | "schema": { 141 | "type": "string", 142 | "description": "An optional string describing the server." 143 | } 144 | }, 145 | "additionalProperties": true, 146 | "required": [ 147 | "type", 148 | "account", 149 | "database", 150 | "schema" 151 | ] 152 | }, 153 | { 154 | "type": "object", 155 | "title": "SnowflakeServer", 156 | "properties": { 157 | "type": { 158 | "type": "string", 159 | "enum": [ 160 | "snowflake" 161 | ], 162 | "description": "The type of the data product technology that implements the data contract." 163 | }, 164 | "account": { 165 | "type": "string", 166 | "description": "An optional string describing the server." 167 | }, 168 | "database": { 169 | "type": "string", 170 | "description": "An optional string describing the server." 171 | }, 172 | "schema": { 173 | "type": "string", 174 | "description": "An optional string describing the server." 175 | } 176 | }, 177 | "additionalProperties": true, 178 | "required": [ 179 | "type", 180 | "account", 181 | "database", 182 | "schema" 183 | ] 184 | }, 185 | { 186 | "type": "object", 187 | "title": "DatabricksServer", 188 | "properties": { 189 | "type": { 190 | "type": "string", 191 | "const": "databricks", 192 | "description": "The type of the data product technology that implements the data contract." 193 | }, 194 | "host": { 195 | "type": "string", 196 | "description": "The Databricks host", 197 | "examples": ["dbc-abcdefgh-1234.cloud.databricks.com"] 198 | }, 199 | "catalog": { 200 | "type": "string", 201 | "description": "The name of the Hive or Unity catalog" 202 | }, 203 | "schema": { 204 | "type": "string", 205 | "description": "The schema name in the catalog" 206 | } 207 | }, 208 | "additionalProperties": true, 209 | "required": [ 210 | "type", 211 | "host", 212 | "catalog", 213 | "schema" 214 | ] 215 | }, 216 | { 217 | "type": "object", 218 | "title": "PostgresServer", 219 | "properties": { 220 | "type": { 221 | "type": "string", 222 | "const": "postgres", 223 | "description": "The type of the data product technology that implements the data contract." 224 | }, 225 | "host": { 226 | "type": "string", 227 | "description": "The host to the database server", 228 | "examples": ["localhost"] 229 | }, 230 | "port": { 231 | "type": "integer", 232 | "description": "The port to the database server." 233 | }, 234 | "database": { 235 | "type": "string", 236 | "description": "The name of the database.", 237 | "examples": ["postgres"] 238 | }, 239 | "schema": { 240 | "type": "string", 241 | "description": "The name of the schema in the database.", 242 | "examples": ["public"] 243 | } 244 | }, 245 | "additionalProperties": true, 246 | "required": [ 247 | "type", 248 | "host", 249 | "port", 250 | "database", 251 | "schema" 252 | ] 253 | }, 254 | { 255 | "type": "object", 256 | "title": "KafkaServer", 257 | "description": "Kafka Server", 258 | "properties": { 259 | "type": { 260 | "type": "string", 261 | "enum": [ 262 | "kafka" 263 | ], 264 | "description": "The type of the data product technology that implements the data contract." 265 | }, 266 | "host": { 267 | "type": "string", 268 | "description": "The bootstrap server of the kafka cluster." 269 | }, 270 | "topic": { 271 | "type": "string", 272 | "description": "The topic name." 273 | }, 274 | "format": { 275 | "type": "string", 276 | "description": "The format of the message. Examples: json, avro, protobuf. Default: json.", 277 | "default": "json" 278 | } 279 | }, 280 | "additionalProperties": true, 281 | "required": [ 282 | "type", 283 | "host", 284 | "topic" 285 | ] 286 | }, 287 | { 288 | "type": "object", 289 | "title": "PubSubServer", 290 | "properties": { 291 | "type": { 292 | "type": "string", 293 | "enum": [ 294 | "pubsub" 295 | ], 296 | "description": "The type of the data product technology that implements the data contract." 297 | }, 298 | "project": { 299 | "type": "string", 300 | "description": "The GCP project name." 301 | }, 302 | "topic": { 303 | "type": "string", 304 | "description": "The topic name." 305 | } 306 | }, 307 | "additionalProperties": true, 308 | "required": [ 309 | "type", 310 | "project", 311 | "topic" 312 | ] 313 | }, 314 | { 315 | "type": "object", 316 | "title": "LocalServer", 317 | "properties": { 318 | "type": { 319 | "type": "string", 320 | "enum": [ 321 | "local" 322 | ], 323 | "description": "The type of the data product technology that implements the data contract." 324 | }, 325 | "path": { 326 | "type": "string", 327 | "description": "The relative or absolute path to the data file(s).", 328 | "examples": [ 329 | "./folder/data.parquet", 330 | "./folder/*.parquet" 331 | ] 332 | }, 333 | "format": { 334 | "type": "string", 335 | "description": "The format of the file(s)", 336 | "examples": ["json", "parquet", "csv"] 337 | } 338 | }, 339 | "additionalProperties": true, 340 | "required": [ 341 | "type", 342 | "path", 343 | "format" 344 | ] 345 | } 346 | 347 | ] 348 | }, 349 | "description": "Information about the servers." 350 | }, 351 | "terms": { 352 | "type": "object", 353 | "description": "The terms and conditions of the data contract.", 354 | "properties": { 355 | "usage": { 356 | "type": "string", 357 | "description": "The usage describes the way the data is expected to be used. Can contain business and technical information." 358 | }, 359 | "limitations": { 360 | "type": "string", 361 | "description": "The limitations describe the restrictions on how the data can be used, can be technical or restrictions on what the data may not be used for." 362 | }, 363 | "billing": { 364 | "type": "string", 365 | "description": "The billing describes the pricing model for using the data, such as whether it's free, having a monthly fee, or metered pay-per-use." 366 | }, 367 | "noticePeriod": { 368 | "type": "string", 369 | "description": "The period of time that must be given by either party to terminate or modify a data usage agreement. Uses ISO-8601 period format, e.g., 'P3M' for a period of three months." 370 | } 371 | } 372 | }, 373 | "models": { 374 | "description": "Specifies the logical data model. Use the models name (e.g., the table name) as the key.", 375 | "type": "object", 376 | "minProperties": 1, 377 | "propertyNames": { 378 | "pattern": "^[a-zA-Z0-9_-]+$" 379 | }, 380 | "additionalProperties": { 381 | "type": "object", 382 | "title": "Model", 383 | "properties": { 384 | "description": { 385 | "type": "string" 386 | }, 387 | "type": { 388 | "description": "The type of the model. Examples: table, view, object. Default: table.", 389 | "type": "string", 390 | "title": "ModelType", 391 | "default": "table", 392 | "enum": [ 393 | "table", 394 | "view", 395 | "object" 396 | ] 397 | }, 398 | "fields": { 399 | "description": "Specifies a field in the data model. Use the field name (e.g., the column name) as the key.", 400 | "type": "object", 401 | "additionalProperties": { 402 | "type": "object", 403 | "title": "Field", 404 | "properties": { 405 | "description": { 406 | "type": "string", 407 | "description": "An optional string describing the semantic of the data in this field." 408 | }, 409 | "type": { 410 | "type": "string", 411 | "title": "FieldType", 412 | "description": "The logical data type of the field.", 413 | "enum": [ 414 | "number", 415 | "decimal", 416 | "numeric", 417 | "int", 418 | "integer", 419 | "long", 420 | "bigint", 421 | "float", 422 | "double", 423 | "string", 424 | "text", 425 | "varchar", 426 | "boolean", 427 | "timestamp", 428 | "timestamp_tz", 429 | "timestamp_ntz", 430 | "date", 431 | "array", 432 | "object", 433 | "record", 434 | "struct", 435 | "bytes", 436 | "null" 437 | ] 438 | }, 439 | "required": { 440 | "type": "boolean", 441 | "default": false, 442 | "description": "An indication, if this field must contain a value and may not be null." 443 | }, 444 | "primary": { 445 | "type": "boolean", 446 | "default": false, 447 | "description": "If this field is a primary key." 448 | }, 449 | "unique": { 450 | "type": "boolean", 451 | "default": false, 452 | "description": "An indication, if the value must be unique within the model." 453 | }, 454 | "enum": { 455 | "type": "array", 456 | "items": { 457 | "type": "string" 458 | }, 459 | "uniqueItems": true, 460 | "description": "A value must be equal to one of the elements in this array value. Only evaluated if the value is not null." 461 | }, 462 | "minLength": { 463 | "type": "number", 464 | "description": "A value must greater than, or equal to, the value of this. Only applies to string types." 465 | }, 466 | "maxLength": { 467 | "type": "number", 468 | "description": "A value must less than, or equal to, the value of this. Only applies to string types." 469 | }, 470 | "format": { 471 | "type": "string", 472 | "description": "A specific format the value must comply with (e.g., 'email', 'uri', 'uuid')." 473 | }, 474 | "pattern": { 475 | "type": "string", 476 | "description": "A regular expression the value must match. Only applies to string types." 477 | }, 478 | "minimum": { 479 | "type": "number", 480 | "description": "A value of a number must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values." 481 | }, 482 | "exclusiveMinimum": { 483 | "type": "number", 484 | "description": "A value of a number must greater than the value of this. Only evaluated if the value is not null. Only applies to numeric values." 485 | }, 486 | "maximum": { 487 | "type": "number", 488 | "description": "A value of a number must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values." 489 | }, 490 | "exclusiveMaximum": { 491 | "type": "number", 492 | "description": "A value of a number must less than the value of this. Only evaluated if the value is not null. Only applies to numeric values." 493 | }, 494 | "example": { 495 | "type": "string", 496 | "description": "An example value for this field." 497 | }, 498 | "pii": { 499 | "type": "boolean", 500 | "description": "An indication, if this field contains Personal Identifiable Information (PII)." 501 | }, 502 | "classification": { 503 | "type": "string", 504 | "description": "The data class defining the sensitivity level for this field, according to the organization's classification scheme.", 505 | "examples": [ 506 | "sensitive", 507 | "restricted", 508 | "internal", 509 | "public" 510 | ] 511 | }, 512 | "tags": { 513 | "type": "array", 514 | "items": { 515 | "type": "string" 516 | }, 517 | "description": "Custom metadata to provide additional context." 518 | }, 519 | "$ref": { 520 | "type": "string", 521 | "description": "A reference URI to a definition in the specification, internally or externally. Properties will be inherited from the definition." 522 | } 523 | } 524 | } 525 | } 526 | } 527 | } 528 | }, 529 | "definitions": { 530 | "description": "Clear and concise explanations of syntax, semantic, and classification of business objects in a given domain.", 531 | "type": "object", 532 | "propertyNames": { 533 | "pattern": "^[a-zA-Z0-9_-]+$" 534 | }, 535 | "additionalProperties": { 536 | "type": "object", 537 | "title": "Definition", 538 | "properties": { 539 | "domain": { 540 | "type": "string", 541 | "description": "The domain in which this definition is valid.", 542 | "default": "global" 543 | }, 544 | "name": { 545 | "type": "string", 546 | "description": "The technical name of this definition." 547 | }, 548 | "title": { 549 | "type": "string", 550 | "description": "The business name of this definition." 551 | }, 552 | "description": { 553 | "type": "string", 554 | "description": "Clear and concise explanations related to the domain." 555 | }, 556 | "type": { 557 | "type": "string", 558 | "description": "The logical data type." 559 | }, 560 | "minLength": { 561 | "type": "number", 562 | "description": "A value must be greater than or equal to this value. Applies only to string types." 563 | }, 564 | "maxLength": { 565 | "type": "number", 566 | "description": "A value must be less than or equal to this value. Applies only to string types." 567 | }, 568 | "format": { 569 | "type": "string", 570 | "description": "Specific format requirements for the value (e.g., 'email', 'uri', 'uuid')." 571 | }, 572 | "pattern": { 573 | "type": "string", 574 | "description": "A regular expression pattern the value must match. Applies only to string types." 575 | }, 576 | "example": { 577 | "type": "string", 578 | "description": "An example value." 579 | }, 580 | "pii": { 581 | "type": "boolean", 582 | "description": "Indicates if the field contains Personal Identifiable Information (PII)." 583 | }, 584 | "classification": { 585 | "type": "string", 586 | "description": "The data class defining the sensitivity level for this field." 587 | }, 588 | "tags": { 589 | "type": "array", 590 | "items": { 591 | "type": "string" 592 | }, 593 | "description": "Custom metadata to provide additional context." 594 | } 595 | }, 596 | "required": [ 597 | "name", 598 | "type" 599 | ] 600 | } 601 | }, 602 | "schema": { 603 | "type": "object", 604 | "properties": { 605 | "type": { 606 | "type": "string", 607 | "title": "SchemaType", 608 | "enum": [ 609 | "dbt", 610 | "bigquery", 611 | "json-schema", 612 | "sql-ddl", 613 | "avro", 614 | "protobuf", 615 | "custom" 616 | ], 617 | "description": "The type of the schema. Typical values are dbt, bigquery, json-schema, sql-ddl, avro, protobuf, custom." 618 | }, 619 | "specification": { 620 | "oneOf": [ 621 | { 622 | "type": "string", 623 | "description": "The specification of the schema as a string." 624 | }, 625 | { 626 | "type": "object", 627 | "description": "The specification of the schema as an object." 628 | } 629 | ] 630 | } 631 | }, 632 | "required": [ 633 | "type", 634 | "specification" 635 | ], 636 | "description": "The schema of the data contract describes the syntax and semantics of provided data sets. It supports different schema types." 637 | }, 638 | "examples": { 639 | "type": "array", 640 | "items": { 641 | "type": "object", 642 | "properties": { 643 | "type": { 644 | "type": "string", 645 | "title": "ExampleType", 646 | "enum": [ 647 | "csv", 648 | "json", 649 | "yaml", 650 | "custom" 651 | ], 652 | "description": "The type of the example data. Well-known types are csv, json, yaml, custom." 653 | }, 654 | "description": { 655 | "type": "string", 656 | "description": "An optional string describing the example." 657 | }, 658 | "model": { 659 | "type": "string", 660 | "description": "The reference to the model in the schema, e.g., a table name." 661 | }, 662 | "data": { 663 | "oneOf": [{ 664 | "type": "string", 665 | "description": "Example data for this model." 666 | },{ 667 | "type": "array", 668 | "description": "Example data for this model in a structured format. Use this for type json or yaml." 669 | }] 670 | } 671 | }, 672 | "required": [ 673 | "type", 674 | "data" 675 | ] 676 | }, 677 | "description": "The Examples Object is an array of Example Objects." 678 | }, 679 | "quality": { 680 | "type": "object", 681 | "properties": { 682 | "type": { 683 | "type": "string", 684 | "title": "QualityType", 685 | "enum": [ 686 | "SodaCL", 687 | "montecarlo", 688 | "custom" 689 | ], 690 | "description": "The type of the quality check. Typical values are SodaCL, montecarlo, custom." 691 | }, 692 | "specification": { 693 | "oneOf": [ 694 | { 695 | "type": "string", 696 | "description": "The specification of the quality attributes as a string." 697 | }, 698 | { 699 | "type": "object", 700 | "description": "The specification of the quality attributes as an object." 701 | } 702 | ] 703 | } 704 | }, 705 | "required": [ 706 | "type", 707 | "specification" 708 | ], 709 | "description": "The quality object contains quality attributes and checks." 710 | } 711 | }, 712 | "required": [ 713 | "dataContractSpecification", 714 | "id", 715 | "info" 716 | ] 717 | } 718 | -------------------------------------------------------------------------------- /versions/0.9.3/datacontract.init.yaml: -------------------------------------------------------------------------------- 1 | dataContractSpecification: 0.9.3 2 | id: my-data-contract-id 3 | info: 4 | title: My Data Contract 5 | version: 0.0.1 6 | # description: 7 | # owner: 8 | # contact: 9 | # name: 10 | # url: 11 | # email: 12 | 13 | 14 | ### servers 15 | 16 | #servers: 17 | # production: 18 | # type: s3 19 | # location: s3:// 20 | # format: parquet 21 | # delimiter: new_line 22 | 23 | ### terms 24 | 25 | #terms: 26 | # usage: 27 | # limitations: 28 | # billing: 29 | # noticePeriod: 30 | 31 | 32 | ### models 33 | 34 | # models: 35 | # my_model: 36 | # description: 37 | # type: 38 | # fields: 39 | # my_field: 40 | # type: 41 | # description: 42 | 43 | 44 | ### definitions 45 | 46 | # definitions: 47 | # my_field: 48 | # domain: 49 | # name: 50 | # title: 51 | # type: 52 | # description: 53 | # example: 54 | # pii: 55 | # classification: 56 | 57 | 58 | ### examples 59 | 60 | #examples: 61 | # - type: csv 62 | # model: my_model 63 | # data: |- 64 | # id,timestamp,amount 65 | # "1001","2023-09-09T08:30:00Z",2500 66 | # "1002","2023-09-08T15:45:00Z",1800 67 | 68 | ### servicelevels 69 | 70 | #servicelevels: 71 | # availability: 72 | # description: The server is available during support hours 73 | # percentage: 99.9% 74 | # retention: 75 | # description: Data is retained for one year because! 76 | # period: P1Y 77 | # unlimited: false 78 | # latency: 79 | # description: Data is available within 25 hours after the order was placed 80 | # threshold: 25h 81 | # sourceTimestampField: orders.order_timestamp 82 | # processedTimestampField: orders.processed_timestamp 83 | # freshness: 84 | # description: The age of the youngest row in a table. 85 | # threshold: 25h 86 | # timestampField: orders.order_timestamp 87 | # frequency: 88 | # description: Data is delivered once a day 89 | # type: batch # or streaming 90 | # interval: daily # for batch, either or cron 91 | # cron: 0 0 * * * # for batch, either or interval 92 | # support: 93 | # description: The data is available during typical business hours at headquarters 94 | # time: 9am to 5pm in EST on business days 95 | # responseTime: 1h 96 | # backup: 97 | # description: Data is backed up once a week, every Sunday at 0:00 UTC. 98 | # interval: weekly 99 | # cron: 0 0 * * 0 100 | # recoveryTime: 24 hours 101 | # recoveryPoint: 1 week 102 | 103 | ### quality 104 | 105 | #quality: 106 | # type: SodaCL 107 | # specification: 108 | # checks for my_model: |- 109 | # - duplicate_count(id) = 0 -------------------------------------------------------------------------------- /versions/0.9.3/definition.schema.json: -------------------------------------------------------------------------------- 1 | { 2 | "$schema": "http://json-schema.org/draft-07/schema#", 3 | "type": "object", 4 | "description": "Clear and concise explanations of syntax, semantic, and classification of business objects in a given domain.", 5 | "properties": { 6 | "domain": { 7 | "type": "string", 8 | "description": "The domain in which this definition is valid.", 9 | "default": "global" 10 | }, 11 | "name": { 12 | "type": "string", 13 | "description": "The technical name of this definition." 14 | }, 15 | "title": { 16 | "type": "string", 17 | "description": "The business name of this definition." 18 | }, 19 | "description": { 20 | "type": "string", 21 | "description": "Clear and concise explanations related to the domain." 22 | }, 23 | "type": { 24 | "type": "string", 25 | "description": "The logical data type." 26 | }, 27 | "minLength": { 28 | "type": "integer", 29 | "description": "A value must be greater than or equal to this value. Applies only to string types." 30 | }, 31 | "maxLength": { 32 | "type": "integer", 33 | "description": "A value must be less than or equal to this value. Applies only to string types." 34 | }, 35 | "format": { 36 | "type": "string", 37 | "description": "Specific format requirements for the value (e.g., 'email', 'uri', 'uuid')." 38 | }, 39 | "precision": { 40 | "type": "integer", 41 | "examples": [ 42 | 38 43 | ], 44 | "description": "The maximum number of digits in a number. Only applies to numeric values. Defaults to 38." 45 | }, 46 | "scale": { 47 | "type": "integer", 48 | "examples": [ 49 | 0 50 | ], 51 | "description": "The maximum number of decimal places in a number. Only applies to numeric values. Defaults to 0." 52 | }, 53 | "pattern": { 54 | "type": "string", 55 | "description": "A regular expression pattern the value must match. Applies only to string types." 56 | }, 57 | "example": { 58 | "type": "string", 59 | "description": "An example value." 60 | }, 61 | "pii": { 62 | "type": "boolean", 63 | "description": "Indicates if the field contains Personal Identifiable Information (PII)." 64 | }, 65 | "classification": { 66 | "type": "string", 67 | "description": "The data class defining the sensitivity level for this field." 68 | }, 69 | "tags": { 70 | "type": "array", 71 | "items": { 72 | "type": "string" 73 | }, 74 | "description": "Custom metadata to provide additional context." 75 | }, 76 | "links": { 77 | "type": "object", 78 | "description": "Links to external resources.", 79 | "minProperties": 1, 80 | "propertyNames": { 81 | "pattern": "^[a-zA-Z0-9_-]+$" 82 | }, 83 | "additionalProperties": { 84 | "type": "string", 85 | "title": "Link", 86 | "description": "A URL to an external resource.", 87 | "format": "uri", 88 | "examples": [ 89 | "https://example.com" 90 | ] 91 | } 92 | } 93 | }, 94 | "required": [ 95 | "name", 96 | "type" 97 | ] 98 | } -------------------------------------------------------------------------------- /workshop.md: -------------------------------------------------------------------------------- 1 | # Data Contract Workshop 2 | 3 | Bring data producers and consumers together to define data contracts in a facilitated workshop. 4 | 5 | ## Goal 6 | 7 | A defined and agreed upon data contract between data producers and consumers. 8 | 9 | ## Participants 10 | 11 | - Facilitator 12 | - Neutral moderator and typist 13 | - Should know the used data contract formal ([Data Contract Specification](https://datacontract.com) or [ODCS](https://bitol-io.github.io/open-data-contract-standard/latest/)) and its tools well 14 | - Get the [authors of the Data Contract Specification](https://datacontract.com/#authors) as facilitators for your workshop. 15 | - Data producer 16 | - Product Owner 17 | - Software Engineers 18 | - Data consumers 19 | - Product Owner 20 | - Data Engineers / Scientist / Analyst 21 | 22 | Recommendation: keep the group small (not more than 5 people) 23 | 24 | ## Settings 25 | 26 | - Show data contract the whole workshop on the screen (projector, screenshare, ...) 27 | - Facilitator is the typist 28 | - Facilitator is moderator 29 | - Data Producer and Data Consumers discuss and give commands to the facilitator 30 | 31 | ## Guidelines for the Data Contract Specification 32 | 33 | ### Recommended Order of Completion (Data Contract Specification) 34 | 35 | 1. Info (get the context) 36 | 2. Examples (example-driven facilitation) 37 | 3. Model (you will spend most of your time here) 38 | - Use the [Data Contract CLI](https://cli.datacontract.com) to test the model against the previously created examples:\\ 39 | `datacontract test --examples datacontract.yaml` 40 | 4. Quality 41 | 5. Terms 42 | 6. Servers (if already applicable) 43 | - Start with a "local" server with actual, real data you downloaded 44 | - Use the [Data Contract CLI](https://cli.datacontract.com) to test the model against the actual data on a specific server:\\ 45 | `datacontract test datacontract.yaml` 46 | - Switch to the actual remote server, if applicable 47 | 48 | ### Tooling (Data Contract Specification) 49 | 50 | - Open the [starter template](https://datacontract.com/datacontract.init.yaml) in the [Data Contract Editor](https://editor.datacontract.com) and get going. If you lack an experienced facilitator, ignore any validation errors and warnings within the editor. 51 | - Use the [Data Contract Editor](https://editor.datacontract.com) to share the results of the workshop afterward with the participants and other stakeholders. 52 | - Use the [Data Contract CLI](https://cli.datacontract.com) to validate the data contract after the workshop. 53 | - Use the [Data Mesh Manager](https://www.datamesh-manager.com) to publish the data contract and have it in a central place 54 | 55 | ## Guidelines for ODCS 56 | 57 | We recommend to use the [Excel template](https://github.com/datacontract/open-data-contract-standard-excel-template) for workshops as it is easier to work with in such a setting as it comes with a nice visualization. 58 | 59 | ### Recommended Order of Completion (ODCS) 60 | 61 | 1. Fundamentals (get the context) 62 | - **[Fill in the fundamentals](https://bitol-io.github.io/open-data-contract-standard/latest/#fundamentals)** consisting of id, name, version, status, and description. 63 | 2. Schema (you will spend most of your time here) 64 | - **[Fill in the schemas](https://bitol-io.github.io/open-data-contract-standard/latest/#schema)** (tables) and their properties (columns) along with their name and logicalType as a start in the schema part. 65 | - After that, add information like `description`, `classification`, ... 66 | - Use tags or customProperties add additional metadata where there is no direct support by ODCS 67 | 3. Quality 68 | - **[Add quality checks](https://bitol-io.github.io/open-data-contract-standard/latest/#data-quality)** at the schema or the property level. Start with quality checks of type text first to capture the requirements. 69 | - OPTIONAL Conver the text-based requirements into automated sql-based quality checks 70 | 4. SLAs 71 | - **[Add SLAs](https://bitol-io.github.io/open-data-contract-standard/latest/#service-level-agreement-sla)** that the data provider guarantees towards all data consumers. 72 | 5. Team & Support 73 | - **[Add the team members](https://bitol-io.github.io/open-data-contract-standard/latest/#team)** so that the data consumer knows who is part of the team that owns the data protected by the data contracts. 74 | - **[Add a support channel](https://bitol-io.github.io/open-data-contract-standard/latest/#support-and-communication-channels)** so (potential) data consumers know how to get support and reach the data owners. 75 | 6. Servers (if already applicable) 76 | - **[Add the server information](https://bitol-io.github.io/open-data-contract-standard/latest/#infrastructure-and-servers)** on where the data is available 77 | - Use the [Data Contract CLI](https://cli.datacontract.com) to test the schema against the actual data on a specific server:\\ 78 | `datacontract test datacontract.yaml` 79 | 80 | ### Tooling (ODCS) 81 | 82 | - Use the [Excel template](https://github.com/datacontract/open-data-contract-standard-excel-template) for the workshop 83 | - Use the [Data Contract CLI](https://cli.datacontract.com) to validate the data contract after the workshop. 84 | - Use the [Data Mesh Manager](https://www.datamesh-manager.com) to publish the data contract and have it in a central place 85 | 86 | ## Related 87 | 88 | - This data contract workshop could be a followup to a data product design workshop using the [Data Product Canvas](https://www.datamesh-architecture.com/data-product-canvas), making the offered contract at the output port of the designed data product more concrete. 89 | --------------------------------------------------------------------------------