├── sources ├── INSTRUCTIONS.md ├── datacontract.com │ ├── CHANGELOG.md │ ├── workshop.md │ ├── datacontract.init.yaml │ ├── definition.schema.json │ ├── datacontract.yaml │ └── datacontract.schema.json └── cli.datacontract.com │ ├── CHANGELOG.md │ └── README.md ├── CNAME ├── images ├── favicon.png ├── datacontract-gpt-browser.png ├── datacontract-gpt-social-media.png └── supported-by-innoq--petrol-apricot.svg ├── _config.yml ├── example_shipment.yaml ├── _layouts └── default.html ├── example_datacontract.yaml └── README.md /sources/INSTRUCTIONS.md: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /CNAME: -------------------------------------------------------------------------------- 1 | gpt.datacontract.com -------------------------------------------------------------------------------- /images/favicon.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacontract/datacontract-gpt/main/images/favicon.png -------------------------------------------------------------------------------- /images/datacontract-gpt-browser.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacontract/datacontract-gpt/main/images/datacontract-gpt-browser.png -------------------------------------------------------------------------------- /images/datacontract-gpt-social-media.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacontract/datacontract-gpt/main/images/datacontract-gpt-social-media.png -------------------------------------------------------------------------------- /_config.yml: -------------------------------------------------------------------------------- 1 | plugins: 2 | - jekyll-sitemap 3 | name: Data Contract GPT 4 | title: null 5 | description: Data Contract GPT for the Data Contract Specification to interactively create and modify data contracts. 6 | -------------------------------------------------------------------------------- /example_shipment.yaml: -------------------------------------------------------------------------------- 1 | shipment_id: "123e4567-e89b-12d3-a456-426614174000" 2 | origin: "New York, NY" 3 | destination: "Los Angeles, CA" 4 | shipment_date: "2024-01-01T10:00:00Z" 5 | delivery_date: "2024-01-05T15:00:00Z" 6 | status: "delivered" -------------------------------------------------------------------------------- /sources/datacontract.com/CHANGELOG.md: -------------------------------------------------------------------------------- 1 | # Changelog 2 | 3 | All notable changes to this project will be documented in this file. 4 | 5 | The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), 6 | and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). 7 | 8 | ## [Unreleased] 9 | 10 | Please note, while the major version is zero (0.y.z), Anything MAY change at any time. 11 | The public API SHOULD NOT be considered stable. 12 | 13 | ### Added 14 | - Support for server-specific data types as config map ([#63](https://github.com/datacontract/datacontract-specification/issues/63)) 15 | - AWS Glue Catalog server support 16 | - sftp server support 17 | - info.status field 18 | - oracle server support 19 | - field.title attribute 20 | - model.title attribute 21 | - AWS Kinesis Data Streams server support 22 | 23 | ## [0.9.3] - 2024-03-06 24 | 25 | ### Added 26 | 27 | - Service levels as a top level `servicelevels` element 28 | - pubsub server support 29 | - primary key and relationship support via `field.primary` and `field.references` attributes 30 | - databricks server support improved 31 | 32 | ## [0.9.2] - 2024-01-04 33 | 34 | ### Added 35 | 36 | - Format and validation attributes to fields in models and definitions 37 | - Postgres support 38 | - Databricks support 39 | 40 | ## [0.9.1] - 2023-11-19 41 | 42 | ### Added 43 | 44 | - A logical data model (#13), mainly to simplify editor support with a defined schema, easier to detect breaking changes, and better Databricks support. 45 | - Definitions (#14) for reusable semantic definitions within one data contract or across data contracts. 46 | 47 | ### Removed 48 | 49 | - Property `info.dataProduct` as data products should define which data contracts they implement. 50 | - Property `info.outputPort` as data products should define which data contracts they implement. 51 | 52 | Those removals are not considered as breaking changes, as these attributes are now treated as specification extensions. 53 | 54 | ## [0.9.0] - 2023-09-12 55 | 56 | First public release. 57 | -------------------------------------------------------------------------------- /sources/datacontract.com/workshop.md: -------------------------------------------------------------------------------- 1 | # Data Contract Workshop 2 | 3 | Bring data producers and consumers together to define data contracts in a facilitated workshop. 4 | 5 | ## Goal 6 | 7 | A defined and agreed upon data contract between data producers and consumers. 8 | 9 | ## Participants 10 | 11 | - Facilitator 12 | - Neutral moderator and typist 13 | - Should know the [Data Contract Specification](https://datacontract.com) and its tools well 14 | - Get the [authors of the Data Contract Specification](https://datacontract.com/#authors) as facilitators for your workshop. 15 | - Data producer 16 | - Product Owner 17 | - Software Engineers 18 | - Data consumers 19 | - Product Owner 20 | - Data Engineers / Scientist / Analyst 21 | 22 | Recommendation: keep the group small (not more than 5 people) 23 | 24 | ## Settings 25 | 26 | - Show data contract the whole workshop on the screen (projector, screenshare, ...) 27 | - Facilitator is the typist 28 | - Facilitator is moderator 29 | - Data Producer and Data Consumers discuss and give commands to the facilitator 30 | 31 | ## Recommended Order of Completion 32 | 33 | 1. Info (get the context) 34 | 2. Examples (example-driven facilitation) 35 | 3. Model (you will spend most of your time here) 36 | - Use the [Data Contract CLI](https://cli.datacontract.com) to test the model against the previously created examples:\\ 37 | `datacontract test --examples datacontract.yaml` 38 | 4. Quality 39 | 5. Terms 40 | 6. Servers (if already applicable) 41 | - Start with a "local" server with actual, real data you downloaded 42 | - Use the [Data Contract CLI](https://cli.datacontract.com) to test the model against the actual data on a specific server:\\ 43 | `datacontract test datacontract.yaml` 44 | - Switch to the actual remote server, if applicable 45 | 46 | ## Tooling 47 | 48 | - Open the [starter template](https://datacontract.com/datacontract.init.yaml) in the [Data Contract Studio](https://studio.datacontract.com) and get going. If you lack an experienced facilitator, ignore any validation errors and warnings within the studio. 49 | - Use the [Data Contract Studio](https://studio.datacontract.com) to share the results of the workshop afterward with the participants and other stakeholders. 50 | - Use the [Data Contract CLI](https://cli.datacontract.com) to validate the data contract after the workshop. 51 | 52 | ## Related 53 | 54 | - This data contract workshop could be a followup to a data product design workshop using the [Data Product Canvas](https://www.datamesh-architecture.com/data-product-canvas), making the offered contract at the output port of the designed data product more concrete. 55 | -------------------------------------------------------------------------------- /sources/datacontract.com/datacontract.init.yaml: -------------------------------------------------------------------------------- 1 | dataContractSpecification: 0.9.3 2 | id: my-data-contract-id 3 | info: 4 | title: My Data Contract 5 | version: 0.0.1 6 | # description: 7 | # owner: 8 | # contact: 9 | # name: 10 | # url: 11 | # email: 12 | 13 | 14 | ### servers 15 | 16 | #servers: 17 | # production: 18 | # type: s3 19 | # location: s3:// 20 | # format: parquet 21 | # delimiter: new_line 22 | 23 | ### terms 24 | 25 | #terms: 26 | # usage: 27 | # limitations: 28 | # billing: 29 | # noticePeriod: 30 | 31 | 32 | ### models 33 | 34 | # models: 35 | # my_model: 36 | # description: 37 | # type: 38 | # fields: 39 | # my_field: 40 | # type: 41 | # description: 42 | 43 | 44 | ### definitions 45 | 46 | # definitions: 47 | # my_field: 48 | # domain: 49 | # name: 50 | # title: 51 | # type: 52 | # description: 53 | # example: 54 | # pii: 55 | # classification: 56 | 57 | 58 | ### examples 59 | 60 | #examples: 61 | # - type: csv 62 | # model: my_model 63 | # data: |- 64 | # id,timestamp,amount 65 | # "1001","2023-09-09T08:30:00Z",2500 66 | # "1002","2023-09-08T15:45:00Z",1800 67 | 68 | ### servicelevels 69 | 70 | #servicelevels: 71 | # availability: 72 | # description: The server is available during support hours 73 | # percentage: 99.9% 74 | # retention: 75 | # description: Data is retained for one year because! 76 | # period: P1Y 77 | # unlimited: false 78 | # latency: 79 | # description: Data is available within 25 hours after the order was placed 80 | # threshold: 25h 81 | # sourceTimestampField: orders.order_timestamp 82 | # processedTimestampField: orders.processed_timestamp 83 | # freshness: 84 | # description: The age of the youngest row in a table. 85 | # threshold: 25h 86 | # timestampField: orders.order_timestamp 87 | # frequency: 88 | # description: Data is delivered once a day 89 | # type: batch # or streaming 90 | # interval: daily # for batch, either or cron 91 | # cron: 0 0 * * * # for batch, either or interval 92 | # support: 93 | # description: The data is available during typical business hours at headquarters 94 | # time: 9am to 5pm in EST on business days 95 | # responseTime: 1h 96 | # backup: 97 | # description: Data is backed up once a week, every Sunday at 0:00 UTC. 98 | # interval: weekly 99 | # cron: 0 0 * * 0 100 | # recoveryTime: 24 hours 101 | # recoveryPoint: 1 week 102 | 103 | ### quality 104 | 105 | #quality: 106 | # type: SodaCL 107 | # specification: 108 | # checks for my_model: |- 109 | # - duplicate_count(id) = 0 110 | -------------------------------------------------------------------------------- /sources/datacontract.com/definition.schema.json: -------------------------------------------------------------------------------- 1 | { 2 | "$schema": "http://json-schema.org/draft-07/schema#", 3 | "type": "object", 4 | "description": "Clear and concise explanations of syntax, semantic, and classification of business objects in a given domain.", 5 | "properties": { 6 | "domain": { 7 | "type": "string", 8 | "description": "The domain in which this definition is valid.", 9 | "default": "global" 10 | }, 11 | "name": { 12 | "type": "string", 13 | "description": "The technical name of this definition." 14 | }, 15 | "title": { 16 | "type": "string", 17 | "description": "The business name of this definition." 18 | }, 19 | "description": { 20 | "type": "string", 21 | "description": "Clear and concise explanations related to the domain." 22 | }, 23 | "type": { 24 | "type": "string", 25 | "description": "The logical data type." 26 | }, 27 | "minLength": { 28 | "type": "integer", 29 | "description": "A value must be greater than or equal to this value. Applies only to string types." 30 | }, 31 | "maxLength": { 32 | "type": "integer", 33 | "description": "A value must be less than or equal to this value. Applies only to string types." 34 | }, 35 | "format": { 36 | "type": "string", 37 | "description": "Specific format requirements for the value (e.g., 'email', 'uri', 'uuid')." 38 | }, 39 | "precision": { 40 | "type": "integer", 41 | "examples": [ 42 | 38 43 | ], 44 | "description": "The maximum number of digits in a number. Only applies to numeric values. Defaults to 38." 45 | }, 46 | "scale": { 47 | "type": "integer", 48 | "examples": [ 49 | 0 50 | ], 51 | "description": "The maximum number of decimal places in a number. Only applies to numeric values. Defaults to 0." 52 | }, 53 | "pattern": { 54 | "type": "string", 55 | "description": "A regular expression pattern the value must match. Applies only to string types." 56 | }, 57 | "example": { 58 | "type": "string", 59 | "description": "An example value." 60 | }, 61 | "pii": { 62 | "type": "boolean", 63 | "description": "Indicates if the field contains Personal Identifiable Information (PII)." 64 | }, 65 | "classification": { 66 | "type": "string", 67 | "description": "The data class defining the sensitivity level for this field." 68 | }, 69 | "tags": { 70 | "type": "array", 71 | "items": { 72 | "type": "string" 73 | }, 74 | "description": "Custom metadata to provide additional context." 75 | } 76 | }, 77 | "required": [ 78 | "name", 79 | "type" 80 | ] 81 | } 82 | -------------------------------------------------------------------------------- /_layouts/default.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | {% seo %} 15 | 16 | 26 | 27 | 28 |
29 | {% if site.title and site.title != page.title %} 30 |

{{ site.title }}

31 | {% endif %} 32 | 33 | {{ content }} 34 | 35 | {% if site.github.private != true and site.github.license %} 36 | 39 | {% endif %} 40 |
41 | 53 | 54 | 55 | {% if site.google_analytics %} 56 | 64 | {% endif %} 65 | 66 | 67 | 68 | 69 | 70 | -------------------------------------------------------------------------------- /images/supported-by-innoq--petrol-apricot.svg: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /example_datacontract.yaml: -------------------------------------------------------------------------------- 1 | dataContractSpecification: "0.9.3" 2 | id: "shipment-data-contract" 3 | info: 4 | title: "Shipment Data Contract" 5 | version: "1.0.0" 6 | status: "active" 7 | description: "Data contract for shipment information." 8 | owner: "Logistics Team" 9 | contact: 10 | name: "Logistics Support" 11 | email: "logistics-support@example.com" 12 | servers: 13 | S3: 14 | type: "s3" 15 | location: "s3://logistics-data/shipments" 16 | format: "json" 17 | models: 18 | shipments: 19 | description: "Details of each shipment." 20 | type: "table" 21 | fields: 22 | shipment_id: 23 | type: "string" 24 | description: "Unique identifier for the shipment." 25 | primary: true 26 | required: true 27 | sender: 28 | type: "string" 29 | description: "Name of the sender." 30 | required: true 31 | recipient: 32 | type: "string" 33 | description: "Name of the recipient." 34 | required: true 35 | origin: 36 | type: "string" 37 | description: "Origin location of the shipment." 38 | required: true 39 | destination: 40 | type: "string" 41 | description: "Destination location of the shipment." 42 | required: true 43 | weight: 44 | type: "decimal" 45 | description: "Weight of the shipment in kilograms." 46 | precision: 10 47 | scale: 2 48 | required: true 49 | shipped_date: 50 | type: "timestamp" 51 | description: "Date and time when the shipment was sent." 52 | required: true 53 | delivery_date: 54 | type: "timestamp" 55 | description: "Date and time when the shipment was delivered." 56 | status: 57 | type: "string" 58 | description: "Current status of the shipment." 59 | enum: ["pending", "shipped", "in_transit", "delivered", "canceled"] 60 | required: true 61 | examples: 62 | - type: "json" 63 | model: "shipments" 64 | data: 65 | - shipment_id: "SHIP12345" 66 | sender: "Company A" 67 | recipient: "Customer B" 68 | origin: "New York, NY" 69 | destination: "San Francisco, CA" 70 | weight: 15.75 71 | shipped_date: "2023-06-01T10:00:00Z" 72 | delivery_date: "2023-06-03T15:30:00Z" 73 | status: "delivered" 74 | - type: "json" 75 | model: "shipments" 76 | data: 77 | - shipment_id: "SHIP12346" 78 | sender: "Company C" 79 | recipient: "Customer D" 80 | origin: "Los Angeles, CA" 81 | destination: "Seattle, WA" 82 | weight: 5.20 83 | shipped_date: "2023-06-01T11:00:00Z" 84 | status: "in_transit" 85 | quality: 86 | type: "SodaCL" 87 | specification: | 88 | checks for shipments: 89 | - schema: 90 | fail: 91 | when required column missing: 92 | - shipment_id 93 | - sender 94 | - recipient 95 | - origin 96 | - destination 97 | - weight 98 | - shipped_date 99 | - status 100 | - values in shipment_id: 101 | must not be null: 102 | fail: 103 | count: 1 104 | - values in weight: 105 | must not be null: 106 | fail: 107 | count: 1 108 | must be between 0 and 1000: 109 | fail: 110 | count: 1 111 | - values in status: 112 | must be in: 113 | - pending 114 | - shipped 115 | - in_transit 116 | - delivered 117 | - canceled 118 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Data Contract GPT 2 | 3 |

4 | Slack Status 5 |

6 | 7 | [Data Contract GPT](https://chatgpt.com/g/g-QGMQrqm3p-data-contract-gpt) for the [Data Contract Specification](https://datacontract.com) to create data contracts from your example data, chat with your data contracts, or export in other formats. And yes, we know that GPTs can be wrong. That's why we also offer a [deterministic open source tool around data contracts written in Python](https://cli.datacontract.com). 8 | 9 | [![Data Contract GPT Browser Window](/images/datacontract-gpt-browser.png){: width="100%"}](https://chatgpt.com/g/g-QGMQrqm3p-data-contract-gpt) 10 | 11 | ## Create an initial data contract for your domain 12 | 13 | [Open Example in Data Contract GPT (Requires free ChatGPT account)](https://chatgpt.com/share/9695cd45-0352-4d9a-bae2-6ee1b236a1a0) 14 | 15 | > **User** 16 | > 17 | > Create a data contract for data in the logistics domain for my table shipments in Snowflake. 18 | 19 | > **Data Contract GPT** 20 | > 21 | > ```yaml 22 | > dataContractSpecification: "0.9.3" 23 | > id: "logistics_shipments_001" 24 | > info: 25 | > title: "Logistics Shipments" 26 | > version: "1.0.0" 27 | > status: "active" 28 | > description: "Data contract for the shipments table in the logistics domain." 29 | > owner: "Logistics Team" 30 | > ... 31 | > ``` 32 | 33 | ## Create a data contract from example data 34 | 35 | [Open Example in Data Contract GPT (Requires free ChatGPT account)](https://chatgpt.com/share/82614125-2152-4603-b52d-03619e59699e) 36 | 37 | > **User** 38 | > 39 | > *Upload [example data](example_shipment.yaml) to ChatGPT* 40 | > 41 | > Create a data contract for the uploaded example data. 42 | 43 | > **Data Contract GPT** 44 | > 45 | > ```yaml 46 | > dataContractSpecification: "0.9.3" 47 | > id: "shipment-data-contract" 48 | > info: 49 | > title: "Shipment Data Contract" 50 | > version: "1.0.0" 51 | > status: "active" 52 | > description: | 53 | > This data contract defines the structure and requirements for shipment data, including details 54 | > about the shipment ID, order ID, shipping address, shipment date, and delivery date. 55 | > owner: "Logistics Team" 56 | > ... 57 | > ``` 58 | 59 | ## Extend an existing data contract interactively 60 | 61 | [Open Example in Data Contract GPT (Requires free ChatGPT account)](https://chatgpt.com/share/fc3486e4-c289-4314-8e29-e1ea26b744dd) 62 | 63 | > **User** 64 | > 65 | > *Paste [example contract](example_datacontract.yaml) into prompt* 66 | > 67 | > Add new field location to this data contract. 68 | 69 | > **Data Contract GPT** 70 | > 71 | > ```yaml 72 | > ... 73 | > models: 74 | > shipments: 75 | > fields: 76 | > location: 77 | > type: "string" 78 | > description: "Current location of the shipment." 79 | > ... 80 | > examples: 81 | > - type: "json" 82 | > model: "shipments" 83 | > data: 84 | > - shipment_id: "SHIP12345" 85 | > ... 86 | > location: "San Francisco, CA" 87 | > ... 88 | > ``` 89 | 90 | > **User** 91 | > 92 | > Can you give recommendations for the service levels? 93 | 94 | > **Data Contract GPT** 95 | > 96 | > ```yaml 97 | > servicelevels: 98 | > availability: 99 | > description: "The server is available 24/7 with a guaranteed uptime." 100 | > percentage: "99.9%" 101 | > retention: 102 | > description: "Data is retained for one year from the shipment date." 103 | > period: "P1Y" 104 | > latency: 105 | > description: "Data is available within 24 hours after the shipment is processed." 106 | > threshold: "PT24H" 107 | > sourceTimestampField: "shipped_date" 108 | > processedTimestampField: "delivery_date" 109 | > ... 110 | > ``` 111 | 112 | ## You have even better prompt ideas? 113 | 114 | Help others and contribute your great prompt ideas to [datacontract-gpt repository](https://github.com/datacontract/datacontract-gpt). 115 | 116 | ## Credits 117 | 118 | Created by [Simon Harrer](https://www.linkedin.com/in/simonharrer/) and [Jochen Christ](https://www.linkedin.com/in/jochenchrist/), with the help of many folks from INNOQ. 119 | 120 | 121 | 122 | -------------------------------------------------------------------------------- /sources/datacontract.com/datacontract.yaml: -------------------------------------------------------------------------------- 1 | dataContractSpecification: 0.9.3 2 | id: urn:datacontract:checkout:orders-latest 3 | info: 4 | title: Orders Latest 5 | version: 1.0.0 6 | description: | 7 | Successful customer orders in the webshop. 8 | All orders since 2020-01-01. 9 | Orders with their line items are in their current state (no history included). 10 | owner: Checkout Team 11 | contact: 12 | name: John Doe (Data Product Owner) 13 | url: https://teams.microsoft.com/l/channel/example/checkout 14 | servers: 15 | production: 16 | type: s3 17 | location: s3://datacontract-example-orders-latest/data/{model}/*.json 18 | format: json 19 | delimiter: new_line 20 | terms: 21 | usage: | 22 | Data can be used for reports, analytics and machine learning use cases. 23 | Order may be linked and joined by other tables 24 | limitations: | 25 | Not suitable for real-time use cases. 26 | Data may not be used to identify individual customers. 27 | Max data processing per day: 10 TiB 28 | billing: 5000 USD per month 29 | noticePeriod: P3M 30 | models: 31 | orders: 32 | description: One record per order. Includes cancelled and deleted orders. 33 | type: table 34 | fields: 35 | order_id: 36 | $ref: '#/definitions/order_id' 37 | required: true 38 | unique: true 39 | primary: true 40 | order_timestamp: 41 | description: The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful. 42 | type: timestamp 43 | required: true 44 | example: "2024-09-09T08:30:00Z" 45 | order_total: 46 | description: Total amount the smallest monetary unit (e.g., cents). 47 | type: long 48 | required: true 49 | example: "9999" 50 | customer_id: 51 | description: Unique identifier for the customer. 52 | type: text 53 | minLength: 10 54 | maxLength: 20 55 | customer_email_address: 56 | description: The email address, as entered by the customer. The email address was not verified. 57 | type: text 58 | format: email 59 | required: true 60 | pii: true 61 | classification: sensitive 62 | processed_timestamp: 63 | description: The timestamp when the record was processed by the data platform. 64 | type: timestamp 65 | required: true 66 | config: 67 | jsonType: string 68 | jsonFormat: date-time 69 | line_items: 70 | description: A single article that is part of an order. 71 | type: table 72 | fields: 73 | lines_item_id: 74 | type: text 75 | description: Primary key of the lines_item_id table 76 | required: true 77 | unique: true 78 | primary: true 79 | order_id: 80 | $ref: '#/definitions/order_id' 81 | references: orders.order_id 82 | sku: 83 | description: The purchased article number 84 | $ref: '#/definitions/sku' 85 | definitions: 86 | order_id: 87 | domain: checkout 88 | name: order_id 89 | title: Order ID 90 | type: text 91 | format: uuid 92 | description: An internal ID that identifies an order in the online shop. 93 | example: 243c25e5-a081-43a9-aeab-6d5d5b6cb5e2 94 | pii: true 95 | classification: restricted 96 | sku: 97 | domain: inventory 98 | name: sku 99 | title: Stock Keeping Unit 100 | type: text 101 | pattern: ^[A-Za-z0-9]{8,14}$ 102 | example: "96385074" 103 | description: | 104 | A Stock Keeping Unit (SKU) is an internal unique identifier for an article. 105 | It is typically associated with an article's barcode, such as the EAN/GTIN. 106 | examples: 107 | - type: csv # csv, json, yaml, custom 108 | model: orders 109 | description: An example list of order records. 110 | data: | # expressed as string or inline yaml or via "$ref: data.csv" 111 | order_id,order_timestamp,order_total,customer_id,customer_email_address,processed_timestamp 112 | "1001","2030-09-09T08:30:00Z",2500,"1000000001","mary.taylor82@example.com","2030-09-09T08:31:00Z" 113 | "1002","2030-09-08T15:45:00Z",1800,"1000000002","michael.miller83@example.com","2030-09-09T08:31:00Z" 114 | "1003","2030-09-07T12:15:00Z",3200,"1000000003","michael.smith5@example.com","2030-09-09T08:31:00Z" 115 | "1004","2030-09-06T19:20:00Z",1500,"1000000004","elizabeth.moore80@example.com","2030-09-09T08:31:00Z" 116 | "1005","2030-09-05T10:10:00Z",4200,"1000000004","elizabeth.moore80@example.com","2030-09-09T08:31:00Z" 117 | "1006","2030-09-04T14:55:00Z",2800,"1000000005","john.davis28@example.com","2030-09-09T08:31:00Z" 118 | "1007","2030-09-03T21:05:00Z",1900,"1000000006","linda.brown67@example.com","2030-09-09T08:31:00Z" 119 | "1008","2030-09-02T17:40:00Z",3600,"1000000007","patricia.smith40@example.com","2030-09-09T08:31:00Z" 120 | "1009","2030-09-01T09:25:00Z",3100,"1000000008","linda.wilson43@example.com","2030-09-09T08:31:00Z" 121 | "1010","2030-08-31T22:50:00Z",2700,"1000000009","mary.smith98@example.com","2030-09-09T08:31:00Z" 122 | - type: csv 123 | model: line_items 124 | description: An example list of line items. 125 | data: | 126 | lines_item_id,order_id,sku 127 | "LI-1","1001","5901234123457" 128 | "LI-2","1001","4001234567890" 129 | "LI-3","1002","5901234123457" 130 | "LI-4","1002","2001234567893" 131 | "LI-5","1003","4001234567890" 132 | "LI-6","1003","5001234567892" 133 | "LI-7","1004","5901234123457" 134 | "LI-8","1005","2001234567893" 135 | "LI-9","1005","5001234567892" 136 | "LI-10","1005","6001234567891" 137 | servicelevels: 138 | availability: 139 | description: The server is available during support hours 140 | percentage: 99.9% 141 | retention: 142 | description: Data is retained for one year 143 | period: P1Y 144 | unlimited: false 145 | latency: 146 | description: Data is available within 25 hours after the order was placed 147 | threshold: 25h 148 | sourceTimestampField: orders.order_timestamp 149 | processedTimestampField: orders.processed_timestamp 150 | freshness: 151 | description: The age of the youngest row in a table. 152 | threshold: 25h 153 | timestampField: orders.order_timestamp 154 | frequency: 155 | description: Data is delivered once a day 156 | type: batch # or streaming 157 | interval: daily # for batch, either or cron 158 | cron: 0 0 * * * # for batch, either or interval 159 | support: 160 | description: The data is available during typical business hours at headquarters 161 | time: 9am to 5pm in EST on business days 162 | responseTime: 1h 163 | backup: 164 | description: Data is backed up once a week, every Sunday at 0:00 UTC. 165 | interval: weekly 166 | cron: 0 0 * * 0 167 | recoveryTime: 24 hours 168 | recoveryPoint: 1 week 169 | quality: 170 | type: SodaCL # data quality check format: SodaCL, montecarlo, custom 171 | specification: # expressed as string or inline yaml or via "$ref: checks.yaml" 172 | checks for orders: 173 | - row_count >= 5 174 | - duplicate_count(order_id) = 0 175 | checks for line_items: 176 | - values in (order_id) must exist in orders (order_id) 177 | - row_count >= 5 178 | -------------------------------------------------------------------------------- /sources/cli.datacontract.com/CHANGELOG.md: -------------------------------------------------------------------------------- 1 | # Changelog 2 | 3 | All notable changes to this project will be documented in this file. 4 | 5 | The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), 6 | and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). 7 | 8 | ## [Unreleased] 9 | 10 | ## [0.10.7] - 2024-05-31 11 | 12 | ### Added 13 | - Test data contract against dataframes / temporary views (#175) 14 | 15 | ### Fixed 16 | - AVRO export: Logical Types should be nested (#233) 17 | 18 | ## [0.10.6] - 2024-05-29 19 | 20 | ### Fixed 21 | 22 | - Fixed Docker build by removing msodbcsql18 dependency (temporary workaround) 23 | 24 | ## [0.10.5] - 2024-05-29 25 | 26 | ### Added 27 | - Added support for `sqlserver` (#196) 28 | - `datacontract export --format dbml`: Export to [Database Markup Language (DBML)](https://dbml.dbdiagram.io/home/) (#135) 29 | - `datacontract export --format avro`: Now supports config map on field level for logicalTypes and default values [Custom Avro Properties](./README.md#custom-avro-properties) 30 | - `datacontract import --format avro`: Now supports importing logicalType and default definition on avro files [Custom Avro Properties](./README.md#custom-avro-properties) 31 | - Support `config.bigqueryType` for testing BigQuery types 32 | - Added support for selecting specific tables in an AWS Glue `import` through the `glue-table` parameter (#122) 33 | 34 | ### Fixed 35 | 36 | - Fixed jsonschema export for models with empty object-typed fields (#218) 37 | - Fixed testing BigQuery tables with BOOL fields 38 | - `datacontract catalog` Show search bar also on mobile 39 | 40 | ## [0.10.4] - 2024-05-17 41 | 42 | ### Added 43 | 44 | - `datacontract catalog` Search 45 | - `datacontract publish`: Publish the data contract to the Data Mesh Manager 46 | - `datacontract import --format bigquery`: Import from BigQuery format (#110) 47 | - `datacontract export --format bigquery`: Export to BigQuery format (#111) 48 | - `datacontract export --format avro`: Now supports [Avro logical types](https://avro.apache.org/docs/1.11.1/specification/#logical-types) to better model date types. `date`, `timestamp`/`timestamp-tz` and `timestamp-ntz` are now mapped to the appropriate logical types. (#141) 49 | - `datacontract import --format jsonschema`: Import from JSON schema (#91) 50 | - `datacontract export --format jsonschema`: Improved export by exporting more additional information 51 | - `datacontract export --format html`: Added support for Service Levels, Definitions, Examples and nested Fields 52 | - `datacontract export --format go`: Export to go types format 53 | 54 | ## [0.10.3] - 2024-05-05 55 | 56 | ### Fixed 57 | - datacontract catalog: Add index.html to manifest 58 | 59 | ## [0.10.2] - 2024-05-05 60 | 61 | ### Added 62 | 63 | - Added import glue (#166) 64 | - Added test support for `azure` (#146) 65 | - Added support for `delta` tables on S3 (#24) 66 | - Added new command `datacontract catalog` that generates a data contract catalog with an `index.html` file. 67 | - Added field format information to HTML export 68 | 69 | ### Fixed 70 | - RDF Export: Fix error if owner is not a URI/URN 71 | 72 | 73 | ## [0.10.1] - 2024-04-19 74 | 75 | ### Fixed 76 | 77 | - Fixed docker columns 78 | 79 | ## [0.10.0] - 2024-04-19 80 | 81 | ### Added 82 | 83 | - Added timestamp when ah HTML export was created 84 | 85 | ### Fixed 86 | 87 | - Fixed export format **html** 88 | 89 | ## [0.9.9] - 2024-04-18 90 | 91 | ### Added 92 | 93 | - Added export format **html** (#15) 94 | - Added descriptions as comments to `datacontract export --format sql` for Databricks dialects 95 | - Added import of arrays in Avro import 96 | 97 | ## [0.9.8] - 2024-04-01 98 | 99 | ### Added 100 | 101 | - Added export format **great-expectations**: `datacontract export --format great-expectations` 102 | - Added gRPC support to OpenTelemetry integration for publishing test results 103 | - Added AVRO import support for namespace (#121) 104 | - Added handling for optional fields in avro import (#112) 105 | - Added Databricks SQL dialect for `datacontract export --format sql` 106 | 107 | ### Fixed 108 | 109 | - Use `sql_type_converter` to build checks. 110 | - Fixed AVRO import when doc is missing (#121) 111 | 112 | ## [0.9.7] - 2024-03-15 113 | 114 | ### Added 115 | 116 | - Added option publish test results to **OpenTelemetry**: `datacontract test --publish-to-opentelemetry` 117 | - Added export format **protobuf**: `datacontract export --format protobuf` 118 | - Added export format **terraform**: `datacontract export --format terraform` (limitation: only works for AWS S3 right now) 119 | - Added export format **sql**: `datacontract export --format sql` 120 | - Added export format **sql-query**: `datacontract export --format sql-query` 121 | - Added export format **avro-idl**: `datacontract export --format avro-idl`: Generates an Avro IDL file containing records for each model. 122 | - Added new command **changelog**: `datacontract changelog datacontract1.yaml datacontract2.yaml` will now generate a changelog based on the changes in the data contract. This will be useful for keeping track of changes in the data contract over time. 123 | - Added extensive linting on data contracts. `datacontract lint` will now check for a variety of possible errors in the data contract, such as missing descriptions, incorrect references to models or fields, nonsensical constraints, and more. 124 | - Added importer for avro schemas. `datacontract import --format avro` will now import avro schemas into a data contract. 125 | 126 | ### Fixed 127 | 128 | - Fixed a bug where the export to YAML always escaped the unicode characters. 129 | 130 | 131 | ## [0.9.6-2] - 2024-03-04 132 | 133 | ### Added 134 | 135 | - test kafka for avro messages 136 | - added export format **avro**: `datacontract export --format avro` 137 | 138 | ## [0.9.6] - 2024-03-04 139 | 140 | This is a huge step forward, we now support testing Kafka messages. 141 | We start with JSON messages and avro, and Protobuf will follow. 142 | 143 | ### Added 144 | - test kafka for JSON messages 145 | - added import format **sql**: `datacontract import --format sql` (#51) 146 | - added export format **dbt-sources**: `datacontract export --format dbt-sources` 147 | - added export format **dbt-staging-sql**: `datacontract export --format dbt-staging-sql` 148 | - added export format **rdf**: `datacontract export --format rdf` (#52) 149 | - added command `datacontract breaking` to detect breaking changes in between two data contracts. 150 | 151 | ## [0.9.5] - 2024-02-22 152 | 153 | ### Added 154 | - export to dbt models (#37). 155 | - export to ODCS (#49). 156 | - test - show a test summary table. 157 | - lint - Support local schema (#46). 158 | 159 | ## [0.9.4] - 2024-02-18 160 | 161 | ### Added 162 | - Support for Postgres 163 | - Support for Databricks 164 | 165 | ## [0.9.3] - 2024-02-10 166 | 167 | ### Added 168 | - Support for BigQuery data connection 169 | - Support for multiple models with S3 170 | 171 | ### Fixed 172 | 173 | - Fix Docker images. Disable builds for linux/amd64. 174 | 175 | ## [0.9.2] - 2024-01-31 176 | 177 | ### Added 178 | - Publish to Docker Hub 179 | 180 | ## [0.9.0] - 2024-01-26 - BREAKING 181 | 182 | This is a breaking change (we are still on a 0.x.x version). 183 | The project migrated from Golang to Python. 184 | The Golang version can be found at [cli-go](https://github.com/datacontract/cli-go) 185 | 186 | ### Added 187 | - `test` Support to directly run tests and connect to data sources defined in servers section. 188 | - `test` generated schema tests from the model definition. 189 | - `test --publish URL` Publish test results to a server URL. 190 | - `export` now exports the data contract so format jsonschema and sodacl. 191 | 192 | ### Changed 193 | - The `--file` option removed in favor of a direct argument.: Use `datacontract test datacontract.yaml` instead of `datacontract test --file datacontract.yaml`. 194 | 195 | ### Removed 196 | - `model` is now part of `export` 197 | - `quality` is now part of `export` 198 | - Temporary Removed: `diff` needs to be migrated to Python. 199 | - Temporary Removed: `breaking` needs to be migrated to Python. 200 | - Temporary Removed: `inline` needs to be migrated to Python. 201 | 202 | ## [0.6.0] 203 | ### Added 204 | - Support local json schema in lint command. 205 | - Update to specification 0.9.2. 206 | 207 | ## [0.5.3] 208 | ### Fixed 209 | - Fix format flag bug in model (print) command. 210 | 211 | ## [0.5.2] 212 | ### Changed 213 | - Log to STDOUT. 214 | - Rename `model` command parameter, `type` -> `format`. 215 | 216 | ## [0.5.1] 217 | ### Removed 218 | - Remove `schema` command. 219 | 220 | ### Fixed 221 | - Fix documentation. 222 | - Security update of x/sys. 223 | 224 | ## [0.5.0] 225 | ### Added 226 | - Adapt Data Contract Specification in version 0.9.2. 227 | - Use `models` section for `diff`/`breaking`. 228 | - Add `model` command. 229 | - Let `inline` print to STDOUT instead of overwriting datacontract file. 230 | - Let `quality` write input from STDIN if present. 231 | 232 | ## [0.4.0] 233 | ### Added 234 | - Basic implementation of `test` command for Soda Core. 235 | 236 | ### Changed 237 | - Change package structure to allow usage as library. 238 | 239 | ## [0.3.2] 240 | ### Fixed 241 | - Fix field parsing for dbt models, affects stability of `diff`/`breaking`. 242 | 243 | ## [0.3.1] 244 | ### Fixed 245 | - Fix comparing order of contracts in `diff`/`breaking`. 246 | 247 | ## [0.3.0] 248 | ### Added 249 | - Handle non-existent schema specification when using `diff`/`breaking`. 250 | - Resolve local and remote resources such as schema specifications when using "$ref: ..." notation. 251 | - Implement `schema` command: prints your schema. 252 | - Implement `quality` command: prints your quality definitions. 253 | - Implement the `inline` command: resolves all references using the "$ref: ..." notation and writes them to your data contract. 254 | 255 | ### Changed 256 | - Allow remote and local location for all data contract inputs (`--file`, `--with`). 257 | 258 | ## [0.2.0] 259 | ### Added 260 | - Add `diff` command for dbt schema specification. 261 | - Add `breaking` command for dbt schema specification. 262 | 263 | ### Changed 264 | - Suggest a fix during `init` when the file already exists. 265 | - Rename `validate` command to `lint`. 266 | 267 | ### Removed 268 | - Remove `check-compatibility` command. 269 | 270 | ### Fixed 271 | - Improve usage documentation. 272 | 273 | ## [0.1.1] 274 | ### Added 275 | - Initial release. 276 | -------------------------------------------------------------------------------- /sources/datacontract.com/datacontract.schema.json: -------------------------------------------------------------------------------- 1 | { 2 | "$schema": "http://json-schema.org/draft-07/schema#", 3 | "type": "object", 4 | "title": "DataContractSpecification", 5 | "properties": { 6 | "dataContractSpecification": { 7 | "type": "string", 8 | "title": "DataContractSpecificationVersion", 9 | "enum": [ 10 | "0.9.3", 11 | "0.9.2", 12 | "0.9.1", 13 | "0.9.0" 14 | ], 15 | "description": "Specifies the Data Contract Specification being used." 16 | }, 17 | "id": { 18 | "type": "string", 19 | "description": "Specifies the identifier of the data contract." 20 | }, 21 | "info": { 22 | "type": "object", 23 | "properties": { 24 | "title": { 25 | "type": "string", 26 | "description": "The title of the data contract." 27 | }, 28 | "version": { 29 | "type": "string", 30 | "description": "The version of the data contract document (which is distinct from the Data Contract Specification version or the Data Product implementation version)." 31 | }, 32 | "status": { 33 | "type": "string", 34 | "description": "The status of the data contract. Can be proposed, in development, active, retired.", 35 | "x-extensible-enum": [ 36 | "proposed", 37 | "in development", 38 | "active", 39 | "retired" 40 | ] 41 | }, 42 | "description": { 43 | "type": "string", 44 | "description": "A description of the data contract." 45 | }, 46 | "owner": { 47 | "type": "string", 48 | "description": "The owner or team responsible for managing the data contract and providing the data." 49 | }, 50 | "contact": { 51 | "type": "object", 52 | "properties": { 53 | "name": { 54 | "type": "string", 55 | "description": "The identifying name of the contact person/organization." 56 | }, 57 | "url": { 58 | "type": "string", 59 | "format": "uri", 60 | "description": "The URL pointing to the contact information. This MUST be in the form of a URL." 61 | }, 62 | "email": { 63 | "type": "string", 64 | "format": "email", 65 | "description": "The email address of the contact person/organization. This MUST be in the form of an email address." 66 | } 67 | }, 68 | "description": "Contact information for the data contract.", 69 | "additionalProperties": true 70 | } 71 | }, 72 | "additionalProperties": true, 73 | "required": [ 74 | "title", 75 | "version" 76 | ], 77 | "description": "Metadata and life cycle information about the data contract." 78 | }, 79 | "servers": { 80 | "type": "object", 81 | "additionalProperties": { 82 | "oneOf": [ 83 | { 84 | "type": "object", 85 | "title": "BigQueryServer", 86 | "properties": { 87 | "type": { 88 | "type": "string", 89 | "enum": [ 90 | "bigquery", 91 | "BigQuery" 92 | ], 93 | "description": "The type of the data product technology that implements the data contract." 94 | }, 95 | "project": { 96 | "type": "string", 97 | "description": "An optional string describing the server." 98 | }, 99 | "dataset": { 100 | "type": "string", 101 | "description": "An optional string describing the server." 102 | } 103 | }, 104 | "additionalProperties": true, 105 | "required": [ 106 | "type", 107 | "project", 108 | "dataset" 109 | ] 110 | }, 111 | { 112 | "type": "object", 113 | "title": "S3Server", 114 | "properties": { 115 | "type": { 116 | "type": "string", 117 | "enum": [ 118 | "s3" 119 | ], 120 | "description": "The type of the data product technology that implements the data contract." 121 | }, 122 | "location": { 123 | "type": "string", 124 | "format": "uri", 125 | "description": "An optional string describing the server. Must be in the form of a URL.", 126 | "examples": [ 127 | "s3://datacontract-example-orders-latest/data/{model}/*.json" 128 | ] 129 | }, 130 | "endpointUrl": { 131 | "type": "string", 132 | "format": "uri", 133 | "description": "The server endpoint for S3-compatible servers.", 134 | "examples": ["https://minio.example.com"] 135 | }, 136 | "format": { 137 | "type": "string", 138 | "enum": [ 139 | "parquet", 140 | "delta", 141 | "json", 142 | "csv" 143 | ], 144 | "description": "File format." 145 | }, 146 | "delimiter": { 147 | "type": "string", 148 | "enum": [ 149 | "new_line", 150 | "array" 151 | ], 152 | "description": "Only for format = json. How multiple json documents are delimited within one file" 153 | } 154 | }, 155 | "additionalProperties": true, 156 | "required": [ 157 | "type", 158 | "location" 159 | ] 160 | }, 161 | { 162 | "type": "object", 163 | "title": "SftpServer", 164 | "properties": { 165 | "type": { 166 | "type": "string", 167 | "enum": [ 168 | "sftp" 169 | ], 170 | "description": "The type of the data product technology that implements the data contract." 171 | }, 172 | "location": { 173 | "type": "string", 174 | "format": "uri", 175 | "description": "An optional string describing the server. Must be in the form of a sftp URL.", 176 | "examples": [ 177 | "sftp://123.123.12.123/{model}/*.json" 178 | ] 179 | }, 180 | "format": { 181 | "type": "string", 182 | "enum": [ 183 | "parquet", 184 | "delta", 185 | "json", 186 | "csv" 187 | ], 188 | "description": "File format." 189 | }, 190 | "delimiter": { 191 | "type": "string", 192 | "enum": [ 193 | "new_line", 194 | "array" 195 | ], 196 | "description": "Only for format = json. How multiple json documents are delimited within one file" 197 | } 198 | }, 199 | "additionalProperties": true, 200 | "required": [ 201 | "type", 202 | "location" 203 | ] 204 | }, 205 | { 206 | "type": "object", 207 | "title": "RedshiftServer", 208 | "properties": { 209 | "type": { 210 | "type": "string", 211 | "enum": [ 212 | "redshift" 213 | ], 214 | "description": "The type of the data product technology that implements the data contract." 215 | }, 216 | "account": { 217 | "type": "string", 218 | "description": "An optional string describing the server." 219 | }, 220 | "database": { 221 | "type": "string", 222 | "description": "An optional string describing the server." 223 | }, 224 | "schema": { 225 | "type": "string", 226 | "description": "An optional string describing the server." 227 | } 228 | }, 229 | "additionalProperties": true, 230 | "required": [ 231 | "type", 232 | "account", 233 | "database", 234 | "schema" 235 | ] 236 | }, 237 | { 238 | "type": "object", 239 | "title": "AzureServer", 240 | "properties": { 241 | "type": { 242 | "type": "string", 243 | "enum": [ 244 | "azure" 245 | ], 246 | "description": "The type of the data product technology that implements the data contract." 247 | }, 248 | "location": { 249 | "type": "string", 250 | "format": "uri", 251 | "description": "Fully qualified path to Azure Blob Storage or Azure Data Lake Storage (ADLS), supports globs.", 252 | "examples": [ 253 | "az://my_storage_account_name.blob.core.windows.net/my_container/path/*.parquet", 254 | "abfss://my_storage_account_name.dfs.core.windows.net/my_container_name/path/*.parquet" 255 | ] 256 | }, 257 | "format": { 258 | "type": "string", 259 | "enum": [ 260 | "parquet", 261 | "delta", 262 | "json", 263 | "csv" 264 | ], 265 | "description": "File format." 266 | }, 267 | "delimiter": { 268 | "type": "string", 269 | "enum": [ 270 | "new_line", 271 | "array" 272 | ], 273 | "description": "Only for format = json. How multiple json documents are delimited within one file" 274 | } 275 | }, 276 | "additionalProperties": true, 277 | "required": [ 278 | "type", 279 | "location", 280 | "format" 281 | ] 282 | }, 283 | { 284 | "type": "object", 285 | "title": "SqlserverServer", 286 | "properties": { 287 | "type": { 288 | "type": "string", 289 | "enum": [ 290 | "sqlserver" 291 | ], 292 | "description": "The type of the data product technology that implements the data contract." 293 | }, 294 | "host": { 295 | "type": "string", 296 | "description": "The host to the database server", 297 | "examples": [ 298 | "localhost" 299 | ] 300 | }, 301 | "port": { 302 | "type": "integer", 303 | "description": "The port to the database server.", 304 | "default": 1433, 305 | "examples": [ 306 | 1433 307 | ] 308 | }, 309 | "database": { 310 | "type": "string", 311 | "description": "The name of the database.", 312 | "examples": [ 313 | "database" 314 | ] 315 | }, 316 | "schema": { 317 | "type": "string", 318 | "description": "The name of the schema in the database.", 319 | "examples": [ 320 | "dbo" 321 | ] 322 | } 323 | }, 324 | "additionalProperties": true, 325 | "required": [ 326 | "type", 327 | "host", 328 | "database", 329 | "schema" 330 | ] 331 | }, 332 | { 333 | "type": "object", 334 | "title": "SnowflakeServer", 335 | "properties": { 336 | "type": { 337 | "type": "string", 338 | "enum": [ 339 | "snowflake" 340 | ], 341 | "description": "The type of the data product technology that implements the data contract." 342 | }, 343 | "account": { 344 | "type": "string", 345 | "description": "An optional string describing the server." 346 | }, 347 | "database": { 348 | "type": "string", 349 | "description": "An optional string describing the server." 350 | }, 351 | "schema": { 352 | "type": "string", 353 | "description": "An optional string describing the server." 354 | } 355 | }, 356 | "additionalProperties": true, 357 | "required": [ 358 | "type", 359 | "account", 360 | "database", 361 | "schema" 362 | ] 363 | }, 364 | { 365 | "type": "object", 366 | "title": "DatabricksServer", 367 | "properties": { 368 | "type": { 369 | "type": "string", 370 | "const": "databricks", 371 | "description": "The type of the data product technology that implements the data contract." 372 | }, 373 | "host": { 374 | "type": "string", 375 | "description": "The Databricks host", 376 | "examples": [ 377 | "dbc-abcdefgh-1234.cloud.databricks.com" 378 | ] 379 | }, 380 | "catalog": { 381 | "type": "string", 382 | "description": "The name of the Hive or Unity catalog" 383 | }, 384 | "schema": { 385 | "type": "string", 386 | "description": "The schema name in the catalog" 387 | } 388 | }, 389 | "additionalProperties": true, 390 | "required": [ 391 | "type", 392 | "host", 393 | "catalog", 394 | "schema" 395 | ] 396 | }, 397 | { 398 | "type": "object", 399 | "title": "DataframeServer", 400 | "properties": { 401 | "type": { 402 | "type": "string", 403 | "const": "dataframe", 404 | "description": "The type of the data product technology that implements the data contract." 405 | } 406 | }, 407 | "additionalProperties": true, 408 | "required": [ 409 | "type" 410 | ] 411 | }, 412 | { 413 | "type": "object", 414 | "title": "GlueServer", 415 | "properties": { 416 | "type": { 417 | "type": "string", 418 | "const": "glue", 419 | "description": "The type of the data product technology that implements the data contract." 420 | }, 421 | "account": { 422 | "type": "string", 423 | "description": "The AWS Glue account", 424 | "examples": [ 425 | "1234-5678-9012" 426 | ] 427 | }, 428 | "database": { 429 | "type": "string", 430 | "description": "The AWS Glue database name", 431 | "examples": [ 432 | "my_database" 433 | ] 434 | }, 435 | "location": { 436 | "type": "string", 437 | "format": "uri", 438 | "description": "The AWS S3 path. Must be in the form of a URL.", 439 | "examples": [ 440 | "s3://datacontract-example-orders-latest/data/{model}" 441 | ] 442 | }, 443 | "format": { 444 | "type": "string", 445 | "description": "The format of the files", 446 | "examples": [ 447 | "parquet", 448 | "csv", 449 | "json", 450 | "delta" 451 | ] 452 | } 453 | }, 454 | "additionalProperties": true, 455 | "required": [ 456 | "type", 457 | "account", 458 | "database" 459 | ] 460 | }, 461 | { 462 | "type": "object", 463 | "title": "PostgresServer", 464 | "properties": { 465 | "type": { 466 | "type": "string", 467 | "const": "postgres", 468 | "description": "The type of the data product technology that implements the data contract." 469 | }, 470 | "host": { 471 | "type": "string", 472 | "description": "The host to the database server", 473 | "examples": [ 474 | "localhost" 475 | ] 476 | }, 477 | "port": { 478 | "type": "integer", 479 | "description": "The port to the database server." 480 | }, 481 | "database": { 482 | "type": "string", 483 | "description": "The name of the database.", 484 | "examples": [ 485 | "postgres" 486 | ] 487 | }, 488 | "schema": { 489 | "type": "string", 490 | "description": "The name of the schema in the database.", 491 | "examples": [ 492 | "public" 493 | ] 494 | } 495 | }, 496 | "additionalProperties": true, 497 | "required": [ 498 | "type", 499 | "host", 500 | "port", 501 | "database", 502 | "schema" 503 | ] 504 | }, 505 | { 506 | "type": "object", 507 | "title": "OracleServer", 508 | "properties": { 509 | "type": { 510 | "type": "string", 511 | "const": "oracle", 512 | "description": "The type of the data product technology that implements the data contract." 513 | }, 514 | "host": { 515 | "type": "string", 516 | "description": "The host to the oracle server", 517 | "examples": [ 518 | "localhost" 519 | ] 520 | }, 521 | "port": { 522 | "type": "integer", 523 | "description": "The port to the oracle server.", 524 | "examples": [ 525 | 1523 526 | ] 527 | }, 528 | "serviceName": { 529 | "type": "string", 530 | "description": "The name of the service.", 531 | "examples": [ 532 | "service" 533 | ] 534 | } 535 | }, 536 | "additionalProperties": true, 537 | "required": [ 538 | "type", 539 | "host", 540 | "port", 541 | "serviceName" 542 | ] 543 | }, 544 | { 545 | "type": "object", 546 | "title": "KafkaServer", 547 | "description": "Kafka Server", 548 | "properties": { 549 | "type": { 550 | "type": "string", 551 | "enum": [ 552 | "kafka" 553 | ], 554 | "description": "The type of the data product technology that implements the data contract." 555 | }, 556 | "host": { 557 | "type": "string", 558 | "description": "The bootstrap server of the kafka cluster." 559 | }, 560 | "topic": { 561 | "type": "string", 562 | "description": "The topic name." 563 | }, 564 | "format": { 565 | "type": "string", 566 | "description": "The format of the message. Examples: json, avro, protobuf. Default: json.", 567 | "default": "json" 568 | } 569 | }, 570 | "additionalProperties": true, 571 | "required": [ 572 | "type", 573 | "host", 574 | "topic" 575 | ] 576 | }, 577 | { 578 | "type": "object", 579 | "title": "PubSubServer", 580 | "properties": { 581 | "type": { 582 | "type": "string", 583 | "enum": [ 584 | "pubsub" 585 | ], 586 | "description": "The type of the data product technology that implements the data contract." 587 | }, 588 | "project": { 589 | "type": "string", 590 | "description": "The GCP project name." 591 | }, 592 | "topic": { 593 | "type": "string", 594 | "description": "The topic name." 595 | } 596 | }, 597 | "additionalProperties": true, 598 | "required": [ 599 | "type", 600 | "project", 601 | "topic" 602 | ] 603 | }, 604 | { 605 | "type": "object", 606 | "title": "KinesisDataStreamsServer", 607 | "description": "Kinesis Data Streams Server", 608 | "properties": { 609 | "type": { 610 | "type": "string", 611 | "enum": [ 612 | "kinesis" 613 | ], 614 | "description": "The type of the data product technology that implements the data contract." 615 | }, 616 | "stream": { 617 | "type": "string", 618 | "description": "The name of the Kinesis data stream." 619 | }, 620 | "region": { 621 | "type": "string", 622 | "description": "AWS region.", 623 | "examples": [ 624 | "eu-west-1" 625 | ] 626 | }, 627 | "format": { 628 | "type": "string", 629 | "description": "The format of the record", 630 | "examples": [ 631 | "json", 632 | "avro", 633 | "protobuf" 634 | ] 635 | } 636 | }, 637 | "additionalProperties": true, 638 | "required": [ 639 | "type", 640 | "stream" 641 | ] 642 | }, 643 | { 644 | "type": "object", 645 | "title": "LocalServer", 646 | "properties": { 647 | "type": { 648 | "type": "string", 649 | "enum": [ 650 | "local" 651 | ], 652 | "description": "The type of the data product technology that implements the data contract." 653 | }, 654 | "path": { 655 | "type": "string", 656 | "description": "The relative or absolute path to the data file(s).", 657 | "examples": [ 658 | "./folder/data.parquet", 659 | "./folder/*.parquet" 660 | ] 661 | }, 662 | "format": { 663 | "type": "string", 664 | "description": "The format of the file(s)", 665 | "examples": [ 666 | "json", 667 | "parquet", 668 | "delta", 669 | "csv" 670 | ] 671 | } 672 | }, 673 | "additionalProperties": true, 674 | "required": [ 675 | "type", 676 | "path", 677 | "format" 678 | ] 679 | } 680 | ] 681 | }, 682 | "description": "Information about the servers." 683 | }, 684 | "terms": { 685 | "type": "object", 686 | "description": "The terms and conditions of the data contract.", 687 | "properties": { 688 | "usage": { 689 | "type": "string", 690 | "description": "The usage describes the way the data is expected to be used. Can contain business and technical information." 691 | }, 692 | "limitations": { 693 | "type": "string", 694 | "description": "The limitations describe the restrictions on how the data can be used, can be technical or restrictions on what the data may not be used for." 695 | }, 696 | "billing": { 697 | "type": "string", 698 | "description": "The billing describes the pricing model for using the data, such as whether it's free, having a monthly fee, or metered pay-per-use." 699 | }, 700 | "noticePeriod": { 701 | "type": "string", 702 | "description": "The period of time that must be given by either party to terminate or modify a data usage agreement. Uses ISO-8601 period format, e.g., 'P3M' for a period of three months." 703 | } 704 | }, 705 | "additionalProperties": true 706 | }, 707 | "models": { 708 | "description": "Specifies the logical data model. Use the models name (e.g., the table name) as the key.", 709 | "type": "object", 710 | "minProperties": 1, 711 | "propertyNames": { 712 | "pattern": "^[a-zA-Z0-9_-]+$" 713 | }, 714 | "additionalProperties": { 715 | "type": "object", 716 | "title": "Model", 717 | "properties": { 718 | "description": { 719 | "type": "string" 720 | }, 721 | "type": { 722 | "description": "The type of the model. Examples: table, view, object. Default: table.", 723 | "type": "string", 724 | "title": "ModelType", 725 | "default": "table", 726 | "enum": [ 727 | "table", 728 | "view", 729 | "object" 730 | ] 731 | }, 732 | "title": { 733 | "type": "string", 734 | "description": "An optional string providing a human readable name for the model. Especially useful if the model name is cryptic or contains abbreviations.", 735 | "examples": ["Purchase Orders", "Air Shipments"] 736 | }, 737 | "fields": { 738 | "description": "Specifies a field in the data model. Use the field name (e.g., the column name) as the key.", 739 | "type": "object", 740 | "additionalProperties": { 741 | "type": "object", 742 | "title": "Field", 743 | "properties": { 744 | "description": { 745 | "type": "string", 746 | "description": "An optional string describing the semantic of the data in this field." 747 | }, 748 | "title": { 749 | "type": "string", 750 | "description": "An optional string providing a human readable name for the field. Especially useful if the field name is cryptic or contains abbreviations." 751 | }, 752 | "type": { 753 | "$ref": "#/$defs/FieldType" 754 | }, 755 | "required": { 756 | "type": "boolean", 757 | "default": false, 758 | "description": "An indication, if this field must contain a value and may not be null." 759 | }, 760 | "fields": { 761 | "description": "The nested fields (e.g. columns) of the object, record, or struct.", 762 | "type": "object", 763 | "additionalProperties": { 764 | "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" 765 | } 766 | }, 767 | "items": { 768 | "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" 769 | }, 770 | "primary": { 771 | "type": "boolean", 772 | "default": false, 773 | "description": "If this field is a primary key." 774 | }, 775 | "references": { 776 | "type": "string", 777 | "description": "The reference to a field in another model. E.g. use 'orders.order_id' to reference the order_id field of the model orders. Think of defining a foreign key relationship.", 778 | "examples": [ 779 | "orders.order_id", 780 | "model.nested_field.field" 781 | ] 782 | }, 783 | "unique": { 784 | "type": "boolean", 785 | "default": false, 786 | "description": "An indication, if the value must be unique within the model." 787 | }, 788 | "enum": { 789 | "type": "array", 790 | "items": { 791 | "type": "string" 792 | }, 793 | "uniqueItems": true, 794 | "description": "A value must be equal to one of the elements in this array value. Only evaluated if the value is not null." 795 | }, 796 | "minLength": { 797 | "type": "integer", 798 | "description": "A value must greater than, or equal to, the value of this. Only applies to string types." 799 | }, 800 | "maxLength": { 801 | "type": "integer", 802 | "description": "A value must less than, or equal to, the value of this. Only applies to string types." 803 | }, 804 | "format": { 805 | "type": "string", 806 | "description": "A specific format the value must comply with (e.g., 'email', 'uri', 'uuid').", 807 | "examples": [ 808 | "email", 809 | "uri", 810 | "uuid" 811 | ] 812 | }, 813 | "precision": { 814 | "type": "number", 815 | "examples": [ 816 | 38 817 | ], 818 | "description": "The maximum number of digits in a number. Only applies to numeric values. Defaults to 38." 819 | }, 820 | "scale": { 821 | "type": "number", 822 | "examples": [ 823 | 0 824 | ], 825 | "description": "The maximum number of decimal places in a number. Only applies to numeric values. Defaults to 0." 826 | }, 827 | "pattern": { 828 | "type": "string", 829 | "description": "A regular expression the value must match. Only applies to string types.", 830 | "examples": [ 831 | "^[a-zA-Z0-9_-]+$" 832 | ] 833 | }, 834 | "minimum": { 835 | "type": "number", 836 | "description": "A value of a number must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values." 837 | }, 838 | "exclusiveMinimum": { 839 | "type": "number", 840 | "description": "A value of a number must greater than the value of this. Only evaluated if the value is not null. Only applies to numeric values." 841 | }, 842 | "maximum": { 843 | "type": "number", 844 | "description": "A value of a number must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values." 845 | }, 846 | "exclusiveMaximum": { 847 | "type": "number", 848 | "description": "A value of a number must less than the value of this. Only evaluated if the value is not null. Only applies to numeric values." 849 | }, 850 | "example": { 851 | "type": "string", 852 | "description": "An example value for this field." 853 | }, 854 | "pii": { 855 | "type": "boolean", 856 | "description": "An indication, if this field contains Personal Identifiable Information (PII)." 857 | }, 858 | "classification": { 859 | "type": "string", 860 | "description": "The data class defining the sensitivity level for this field, according to the organization's classification scheme.", 861 | "examples": [ 862 | "sensitive", 863 | "restricted", 864 | "internal", 865 | "public" 866 | ] 867 | }, 868 | "tags": { 869 | "type": "array", 870 | "items": { 871 | "type": "string" 872 | }, 873 | "description": "Custom metadata to provide additional context." 874 | }, 875 | "$ref": { 876 | "type": "string", 877 | "description": "A reference URI to a definition in the specification, internally or externally. Properties will be inherited from the definition." 878 | }, 879 | "config": { 880 | "type": "object", 881 | "description": "Additional metadata for field configuration.", 882 | "additionalProperties": { 883 | "type": [ 884 | "string", 885 | "number", 886 | "boolean", 887 | "object", 888 | "array", 889 | "null" 890 | ] 891 | }, 892 | "properties": { 893 | "avroType": { 894 | "type": "string", 895 | "description": "Specify the field type to use when exporting the data model to Apache Avro." 896 | }, 897 | "avroLogicalType": { 898 | "type": "string", 899 | "description": "Specify the logical field type to use when exporting the data model to Apache Avro." 900 | }, 901 | "bigqueryType": { 902 | "type": "string", 903 | "description": "Specify the physical column type that is used in a BigQuery table, e.g., `NUMERIC(5, 2)`." 904 | }, 905 | "snowflakeType": { 906 | "type": "string", 907 | "description": "Specify the physical column type that is used in a Snowflake table, e.g., `TIMESTAMP_LTZ`." 908 | }, 909 | "redshiftType": { 910 | "type": "string", 911 | "description": "Specify the physical column type that is used in a Redshift table, e.g., `SMALLINT`." 912 | }, 913 | "sqlserverType": { 914 | "type": "string", 915 | "description": "Specify the physical column type that is used in a SQL Server table, e.g., `DATETIME2`." 916 | }, 917 | "unityType": { 918 | "type": "string", 919 | "description": "Specify the physical column type that is used in a Databricks Unity Catalog table." 920 | }, 921 | "glueType": { 922 | "type": "string", 923 | "description": "Specify the physical column type that is used in an AWS Glue Data Catalog table." 924 | } 925 | } 926 | } 927 | } 928 | } 929 | }, 930 | "config": { 931 | "type": "object", 932 | "description": "Additional metadata for model configuration.", 933 | "additionalProperties": { 934 | "type": [ 935 | "string", 936 | "number", 937 | "boolean", 938 | "object", 939 | "array", 940 | "null" 941 | 942 | 943 | ] 944 | }, 945 | "properties": { 946 | "avroNamespace": { 947 | "type": "string", 948 | "description": "The namespace to use when importing and exporting the data model from / to Apache Avro." 949 | } 950 | } 951 | } 952 | } 953 | } 954 | }, 955 | "definitions": { 956 | "description": "Clear and concise explanations of syntax, semantic, and classification of business objects in a given domain.", 957 | "type": "object", 958 | "propertyNames": { 959 | "pattern": "^[a-zA-Z0-9_-]+$" 960 | }, 961 | "additionalProperties": { 962 | "type": "object", 963 | "title": "Definition", 964 | "properties": { 965 | "domain": { 966 | "type": "string", 967 | "description": "The domain in which this definition is valid.", 968 | "default": "global" 969 | }, 970 | "name": { 971 | "type": "string", 972 | "description": "The technical name of this definition." 973 | }, 974 | "title": { 975 | "type": "string", 976 | "description": "The business name of this definition." 977 | }, 978 | "description": { 979 | "type": "string", 980 | "description": "Clear and concise explanations related to the domain." 981 | }, 982 | "type": { 983 | "$ref": "#/$defs/FieldType" 984 | }, 985 | "minLength": { 986 | "type": "integer", 987 | "description": "A value must be greater than or equal to this value. Applies only to string types." 988 | }, 989 | "maxLength": { 990 | "type": "integer", 991 | "description": "A value must be less than or equal to this value. Applies only to string types." 992 | }, 993 | "format": { 994 | "type": "string", 995 | "description": "Specific format requirements for the value (e.g., 'email', 'uri', 'uuid')." 996 | }, 997 | "precision": { 998 | "type": "integer", 999 | "examples": [ 1000 | 38 1001 | ], 1002 | "description": "The maximum number of digits in a number. Only applies to numeric values. Defaults to 38." 1003 | }, 1004 | "scale": { 1005 | "type": "integer", 1006 | "examples": [ 1007 | 0 1008 | ], 1009 | "description": "The maximum number of decimal places in a number. Only applies to numeric values. Defaults to 0." 1010 | }, 1011 | "pattern": { 1012 | "type": "string", 1013 | "description": "A regular expression pattern the value must match. Applies only to string types." 1014 | }, 1015 | "minimum": { 1016 | "type": "number", 1017 | "description": "A value of a number must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values." 1018 | }, 1019 | "exclusiveMinimum": { 1020 | "type": "number", 1021 | "description": "A value of a number must greater than the value of this. Only evaluated if the value is not null. Only applies to numeric values." 1022 | }, 1023 | "maximum": { 1024 | "type": "number", 1025 | "description": "A value of a number must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values." 1026 | }, 1027 | "exclusiveMaximum": { 1028 | "type": "number", 1029 | "description": "A value of a number must less than the value of this. Only evaluated if the value is not null. Only applies to numeric values." 1030 | }, 1031 | "example": { 1032 | "type": "string", 1033 | "description": "An example value." 1034 | }, 1035 | "pii": { 1036 | "type": "boolean", 1037 | "description": "Indicates if the field contains Personal Identifiable Information (PII)." 1038 | }, 1039 | "classification": { 1040 | "type": "string", 1041 | "description": "The data class defining the sensitivity level for this field." 1042 | }, 1043 | "tags": { 1044 | "type": "array", 1045 | "items": { 1046 | "type": "string" 1047 | }, 1048 | "description": "Custom metadata to provide additional context." 1049 | } 1050 | }, 1051 | "required": [ 1052 | "name", 1053 | "type" 1054 | ] 1055 | } 1056 | }, 1057 | "schema": { 1058 | "type": "object", 1059 | "properties": { 1060 | "type": { 1061 | "type": "string", 1062 | "title": "SchemaType", 1063 | "enum": [ 1064 | "dbt", 1065 | "bigquery", 1066 | "json-schema", 1067 | "sql-ddl", 1068 | "avro", 1069 | "protobuf", 1070 | "custom" 1071 | ], 1072 | "description": "The type of the schema. Typical values are dbt, bigquery, json-schema, sql-ddl, avro, protobuf, custom." 1073 | }, 1074 | "specification": { 1075 | "oneOf": [ 1076 | { 1077 | "type": "string", 1078 | "description": "The specification of the schema as a string." 1079 | }, 1080 | { 1081 | "type": "object", 1082 | "description": "The specification of the schema as an object." 1083 | } 1084 | ] 1085 | } 1086 | }, 1087 | "required": [ 1088 | "type", 1089 | "specification" 1090 | ], 1091 | "description": "The schema of the data contract describes the syntax and semantics of provided data sets. It supports different schema types." 1092 | }, 1093 | "examples": { 1094 | "type": "array", 1095 | "items": { 1096 | "type": "object", 1097 | "properties": { 1098 | "type": { 1099 | "type": "string", 1100 | "title": "ExampleType", 1101 | "enum": [ 1102 | "csv", 1103 | "json", 1104 | "yaml", 1105 | "custom" 1106 | ], 1107 | "description": "The type of the example data. Well-known types are csv, json, yaml, custom." 1108 | }, 1109 | "description": { 1110 | "type": "string", 1111 | "description": "An optional string describing the example." 1112 | }, 1113 | "model": { 1114 | "type": "string", 1115 | "description": "The reference to the model in the schema, e.g., a table name." 1116 | }, 1117 | "data": { 1118 | "oneOf": [ 1119 | { 1120 | "type": "string", 1121 | "description": "Example data for this model." 1122 | }, 1123 | { 1124 | "type": "array", 1125 | "description": "Example data for this model in a structured format. Use this for type json or yaml." 1126 | } 1127 | ] 1128 | } 1129 | }, 1130 | "required": [ 1131 | "type", 1132 | "data" 1133 | ] 1134 | }, 1135 | "description": "The Examples Object is an array of Example Objects." 1136 | }, 1137 | "servicelevels": { 1138 | "type": "object", 1139 | "description": "Specifies the service level agreements for the provided data, including availability, data retention policies, latency requirements, data freshness, update frequency, support availability, and backup policies.", 1140 | "properties": { 1141 | "availability": { 1142 | "type": "object", 1143 | "description": "Availability refers to the promise or guarantee by the service provider about the uptime of the system that provides the data.", 1144 | "properties": { 1145 | "description": { 1146 | "type": "string", 1147 | "description": "An optional string describing the availability service level.", 1148 | "example": "The server is available during support hours" 1149 | }, 1150 | "percentage": { 1151 | "type": "string", 1152 | "description": "An optional string describing the guaranteed uptime in percent (e.g., `99.9%`)", 1153 | "pattern": "^\\d+(\\.\\d+)?%$", 1154 | "example": "99.9%" 1155 | } 1156 | } 1157 | }, 1158 | "retention": { 1159 | "type": "object", 1160 | "description": "Retention covers the period how long data will be available.", 1161 | "properties": { 1162 | "description": { 1163 | "type": "string", 1164 | "description": "An optional string describing the retention service level.", 1165 | "example": "Data is retained for one year." 1166 | }, 1167 | "period": { 1168 | "type": "string", 1169 | "description": "An optional period of time, how long data is available. Supported formats: Simple duration (e.g., `1 year`, `30d`) and ISO 8601 duration (e.g, `P1Y`).", 1170 | "example": "P1Y" 1171 | }, 1172 | "unlimited": { 1173 | "type": "boolean", 1174 | "description": "An optional indicator that data is kept forever.", 1175 | "example": false 1176 | }, 1177 | "timestampField": { 1178 | "type": "string", 1179 | "description": "An optional reference to the field that contains the timestamp that the period refers to.", 1180 | "example": "orders.order_timestamp" 1181 | } 1182 | } 1183 | }, 1184 | "latency": { 1185 | "type": "object", 1186 | "description": "Latency refers to the maximum amount of time from the source to its destination.", 1187 | "properties": { 1188 | "description": { 1189 | "type": "string", 1190 | "description": "An optional string describing the latency service level.", 1191 | "example": "Data is available within 25 hours after the order was placed." 1192 | }, 1193 | "threshold": { 1194 | "type": "string", 1195 | "description": "An optional maximum duration between the source timestamp and the processed timestamp. Supported formats: Simple duration (e.g., `24 hours`, `5s`) and ISO 8601 duration (e.g, `PT24H`).", 1196 | "example": "25h" 1197 | }, 1198 | "sourceTimestampField": { 1199 | "type": "string", 1200 | "description": "An optional reference to the field that contains the timestamp when the data was provided at the source.", 1201 | "example": "orders.order_timestamp" 1202 | }, 1203 | "processedTimestampField": { 1204 | "type": "string", 1205 | "description": "An optional reference to the field that contains the processing timestamp, which denotes when the data is made available to consumers of this data contract.", 1206 | "example": "orders.processed_timestamp" 1207 | } 1208 | } 1209 | }, 1210 | "freshness": { 1211 | "type": "object", 1212 | "description": "The maximum age of the youngest row in a table.", 1213 | "properties": { 1214 | "description": { 1215 | "type": "string", 1216 | "description": "An optional string describing the freshness service level.", 1217 | "example": "The age of the youngest row in a table is within 25 hours." 1218 | }, 1219 | "threshold": { 1220 | "type": "string", 1221 | "description": "An optional maximum age of the youngest entry. Supported formats: Simple duration (e.g., `24 hours`, `5s`) and ISO 8601 duration (e.g., `PT24H`).", 1222 | "example": "25h" 1223 | }, 1224 | "timestampField": { 1225 | "type": "string", 1226 | "description": "An optional reference to the field that contains the timestamp that the threshold refers to.", 1227 | "example": "orders.order_timestamp" 1228 | } 1229 | } 1230 | }, 1231 | "frequency": { 1232 | "type": "object", 1233 | "description": "Frequency describes how often data is updated.", 1234 | "properties": { 1235 | "description": { 1236 | "type": "string", 1237 | "description": "An optional string describing the frequency service level.", 1238 | "example": "Data is delivered once a day." 1239 | }, 1240 | "type": { 1241 | "type": "string", 1242 | "enum": [ 1243 | "batch", 1244 | "micro-batching", 1245 | "streaming", 1246 | "manual" 1247 | ], 1248 | "description": "The method of data processing.", 1249 | "example": "batch" 1250 | }, 1251 | "interval": { 1252 | "type": "string", 1253 | "description": "Optional. Only for batch: How often the pipeline is triggered, e.g., `daily`.", 1254 | "example": "daily" 1255 | }, 1256 | "cron": { 1257 | "type": "string", 1258 | "description": "Optional. Only for batch: A cron expression when the pipelines is triggered. E.g., `0 0 * * *`.", 1259 | "example": "0 0 * * *" 1260 | } 1261 | } 1262 | }, 1263 | "support": { 1264 | "type": "object", 1265 | "description": "Support describes the times when support will be available for contact.", 1266 | "properties": { 1267 | "description": { 1268 | "type": "string", 1269 | "description": "An optional string describing the support service level.", 1270 | "example": "The data is available during typical business hours at headquarters." 1271 | }, 1272 | "time": { 1273 | "type": "string", 1274 | "description": "An optional string describing the times when support will be available for contact such as `24/7` or `business hours only`.", 1275 | "example": "9am to 5pm in EST on business days" 1276 | }, 1277 | "responseTime": { 1278 | "type": "string", 1279 | "description": "An optional string describing the time it takes for the support team to acknowledge a request. This does not mean the issue will be resolved immediately, but it assures users that their request has been received and will be dealt with.", 1280 | "example": "24 hours" 1281 | } 1282 | } 1283 | }, 1284 | "backup": { 1285 | "type": "object", 1286 | "description": "Backup specifies details about data backup procedures.", 1287 | "properties": { 1288 | "description": { 1289 | "type": "string", 1290 | "description": "An optional string describing the backup service level.", 1291 | "example": "Data is backed up once a week, every Sunday at 0:00 UTC." 1292 | }, 1293 | "interval": { 1294 | "type": "string", 1295 | "description": "An optional interval that defines how often data will be backed up, e.g., `daily`.", 1296 | "example": "weekly" 1297 | }, 1298 | "cron": { 1299 | "type": "string", 1300 | "description": "An optional cron expression when data will be backed up, e.g., `0 0 * * *`.", 1301 | "example": "0 0 * * 0" 1302 | }, 1303 | "recoveryTime": { 1304 | "type": "string", 1305 | "description": "An optional Recovery Time Objective (RTO) specifies the maximum amount of time allowed to restore data from a backup after a failure or loss event (e.g., 4 hours, 24 hours).", 1306 | "example": "24 hours" 1307 | }, 1308 | "recoveryPoint": { 1309 | "type": "string", 1310 | "description": "An optional Recovery Point Objective (RPO) defines the maximum acceptable age of files that must be recovered from backup storage for normal operations to resume after a disaster or data loss event. This essentially measures how much data you can afford to lose, measured in time (e.g., 4 hours, 24 hours).", 1311 | "example": "1 week" 1312 | } 1313 | } 1314 | } 1315 | } 1316 | }, 1317 | "quality": { 1318 | "type": "object", 1319 | "properties": { 1320 | "type": { 1321 | "type": "string", 1322 | "title": "QualityType", 1323 | "enum": [ 1324 | "SodaCL", 1325 | "montecarlo", 1326 | "great-expectations", 1327 | "custom" 1328 | ], 1329 | "description": "The type of the quality check. Typical values are SodaCL, montecarlo, great-expectations, custom." 1330 | }, 1331 | "specification": { 1332 | "oneOf": [ 1333 | { 1334 | "type": "string", 1335 | "description": "The specification of the quality attributes as a string." 1336 | }, 1337 | { 1338 | "type": "object", 1339 | "description": "The specification of the quality attributes as an object." 1340 | } 1341 | ] 1342 | } 1343 | }, 1344 | "required": [ 1345 | "type", 1346 | "specification" 1347 | ], 1348 | "description": "The quality object contains quality attributes and checks." 1349 | } 1350 | }, 1351 | "required": [ 1352 | "dataContractSpecification", 1353 | "id", 1354 | "info" 1355 | ], 1356 | "$defs": { 1357 | "FieldType": { 1358 | "type": "string", 1359 | "title": "FieldType", 1360 | "description": "The logical data type of the field.", 1361 | "enum": [ 1362 | "number", 1363 | "decimal", 1364 | "numeric", 1365 | "int", 1366 | "integer", 1367 | "long", 1368 | "bigint", 1369 | "float", 1370 | "double", 1371 | "string", 1372 | "text", 1373 | "varchar", 1374 | "boolean", 1375 | "timestamp", 1376 | "timestamp_tz", 1377 | "timestamp_ntz", 1378 | "date", 1379 | "array", 1380 | "object", 1381 | "record", 1382 | "struct", 1383 | "bytes", 1384 | "null" 1385 | ] 1386 | } 1387 | } 1388 | } 1389 | -------------------------------------------------------------------------------- /sources/cli.datacontract.com/README.md: -------------------------------------------------------------------------------- 1 | # Data Contract CLI 2 | 3 |

4 | 5 | Test Workflow 6 | 7 | Stars 8 | Slack Status 9 |

10 | 11 | The `datacontract` CLI is an open source command-line tool for working with [Data Contracts](https://datacontract.com/). 12 | It uses data contract YAML files to lint the data contract, connect to data sources and execute schema and quality tests, detect breaking changes, and export to different formats. The tool is written in Python. It can be used as a standalone CLI tool, in a CI/CD pipeline, or directly as a Python library. 13 | 14 | ![Main features of the Data Contract CLI](datacontractcli.png) 15 | 16 | 17 | ## Getting started 18 | 19 | Let's look at this data contract: 20 | [https://datacontract.com/examples/orders-latest/datacontract.yaml](https://datacontract.com/examples/orders-latest/datacontract.yaml) 21 | 22 | We have a _servers_ section with endpoint details to the S3 bucket, _models_ for the structure of the data, _servicelevels_ and _quality_ attributes that describe the expected freshness and number of rows. 23 | 24 | This data contract contains all information to connect to S3 and check that the actual data meets the defined schema and quality requirements. We can use this information to test if the actual data set in S3 is compliant to the data contract. 25 | 26 | Let's use [pip](https://pip.pypa.io/en/stable/getting-started/) to install the CLI (or use the [Docker image](#docker), if you prefer). 27 | ```bash 28 | $ python3 -m pip install datacontract-cli 29 | ``` 30 | 31 | We run the tests: 32 | 33 | ```bash 34 | $ datacontract test https://datacontract.com/examples/orders-latest/datacontract.yaml 35 | 36 | # returns: 37 | Testing https://datacontract.com/examples/orders-latest/datacontract.yaml 38 | ╭────────┬─────────────────────────────────────────────────────────────────────┬───────────────────────────────┬─────────╮ 39 | │ Result │ Check │ Field │ Details │ 40 | ├────────┼─────────────────────────────────────────────────────────────────────┼───────────────────────────────┼─────────┤ 41 | │ passed │ Check that JSON has valid schema │ orders │ │ 42 | │ passed │ Check that JSON has valid schema │ line_items │ │ 43 | │ passed │ Check that field order_id is present │ orders │ │ 44 | │ passed │ Check that field order_timestamp is present │ orders │ │ 45 | │ passed │ Check that field order_total is present │ orders │ │ 46 | │ passed │ Check that field customer_id is present │ orders │ │ 47 | │ passed │ Check that field customer_email_address is present │ orders │ │ 48 | │ passed │ row_count >= 5000 │ orders │ │ 49 | │ passed │ Check that required field order_id has no null values │ orders.order_id │ │ 50 | │ passed │ Check that unique field order_id has no duplicate values │ orders.order_id │ │ 51 | │ passed │ duplicate_count(order_id) = 0 │ orders.order_id │ │ 52 | │ passed │ Check that required field order_timestamp has no null values │ orders.order_timestamp │ │ 53 | │ passed │ freshness(order_timestamp) < 24h │ orders.order_timestamp │ │ 54 | │ passed │ Check that required field order_total has no null values │ orders.order_total │ │ 55 | │ passed │ Check that required field customer_email_address has no null values │ orders.customer_email_address │ │ 56 | │ passed │ Check that field lines_item_id is present │ line_items │ │ 57 | │ passed │ Check that field order_id is present │ line_items │ │ 58 | │ passed │ Check that field sku is present │ line_items │ │ 59 | │ passed │ values in (order_id) must exist in orders (order_id) │ line_items.order_id │ │ 60 | │ passed │ row_count >= 5000 │ line_items │ │ 61 | │ passed │ Check that required field lines_item_id has no null values │ line_items.lines_item_id │ │ 62 | │ passed │ Check that unique field lines_item_id has no duplicate values │ line_items.lines_item_id │ │ 63 | ╰────────┴─────────────────────────────────────────────────────────────────────┴───────────────────────────────┴─────────╯ 64 | 🟢 data contract is valid. Run 22 checks. Took 6.739514 seconds. 65 | ``` 66 | 67 | Voilà, the CLI tested that the _datacontract.yaml_ itself is valid, all records comply with the schema, and all quality attributes are met. 68 | 69 | We can also use the datacontract.yaml to export in many [formats](#format), e.g., to SQL: 70 | 71 | ```bash 72 | $ datacontract export --format sql https://datacontract.com/examples/orders-latest/datacontract.yaml 73 | 74 | # returns: 75 | -- Data Contract: urn:datacontract:checkout:orders-latest 76 | -- SQL Dialect: snowflake 77 | CREATE TABLE orders ( 78 | order_id TEXT not null primary key, 79 | order_timestamp TIMESTAMP_TZ not null, 80 | order_total NUMBER not null, 81 | customer_id TEXT, 82 | customer_email_address TEXT not null, 83 | processed_timestamp TIMESTAMP_TZ not null 84 | ); 85 | CREATE TABLE line_items ( 86 | lines_item_id TEXT not null primary key, 87 | order_id TEXT, 88 | sku TEXT 89 | ); 90 | ``` 91 | 92 | Or generate an HTML export: 93 | 94 | ```bash 95 | $ datacontract export --format html https://datacontract.com/examples/orders-latest/datacontract.yaml > datacontract.html 96 | ``` 97 | 98 | which will create this [HTML export](https://datacontract.com/examples/orders-latest/datacontract.html). 99 | 100 | ## Usage 101 | 102 | ```bash 103 | # create a new data contract from example and write it to datacontract.yaml 104 | $ datacontract init datacontract.yaml 105 | 106 | # lint the datacontract.yaml 107 | $ datacontract lint datacontract.yaml 108 | 109 | # execute schema and quality checks 110 | $ datacontract test datacontract.yaml 111 | 112 | # execute schema and quality checks on the examples within the contract 113 | $ datacontract test --examples datacontract.yaml 114 | 115 | # export data contract as html (other formats: avro, dbt, dbt-sources, dbt-staging-sql, jsonschema, odcs, rdf, sql, sodacl, terraform, ...) 116 | $ datacontract export --format html datacontract.yaml > datacontract.html 117 | 118 | # import avro (other formats: sql, glue, bigquery...) 119 | $ datacontract import --format avro --source avro_schema.avsc 120 | 121 | # find differences between to data contracts 122 | $ datacontract diff datacontract-v1.yaml datacontract-v2.yaml 123 | 124 | # find differences between to data contracts categorized into error, warning, and info. 125 | $ datacontract changelog datacontract-v1.yaml datacontract-v2.yaml 126 | 127 | # fail pipeline on breaking changes. Uses changelog internally and showing only error and warning. 128 | $ datacontract breaking datacontract-v1.yaml datacontract-v2.yaml 129 | ``` 130 | 131 | ## Programmatic (Python) 132 | ```python 133 | from datacontract.data_contract import DataContract 134 | 135 | data_contract = DataContract(data_contract_file="datacontract.yaml") 136 | run = data_contract.test() 137 | if not run.has_passed(): 138 | print("Data quality validation failed.") 139 | # Abort pipeline, alert, or take corrective actions... 140 | ``` 141 | 142 | 143 | ## Installation 144 | 145 | Choose the most appropriate installation method for your needs: 146 | 147 | ### pip 148 | Python 3.11 recommended. 149 | Python 3.12 available as pre-release release candidate for 0.9.3 150 | 151 | ```bash 152 | python3 -m pip install datacontract-cli 153 | ``` 154 | 155 | ### pipx 156 | pipx installs into an isolated environment. 157 | ```bash 158 | pipx install datacontract-cli 159 | ``` 160 | 161 | ### Docker 162 | 163 | ```bash 164 | docker pull datacontract/cli 165 | docker run --rm -v ${PWD}:/home/datacontract datacontract/cli 166 | ``` 167 | 168 | Or via an alias that automatically uses the latest version: 169 | 170 | ```bash 171 | alias datacontract='docker run --rm -v "${PWD}:/home/datacontract" datacontract/cli:latest' 172 | ``` 173 | 174 | ## Documentation 175 | 176 | Commands 177 | 178 | - [init](#init) 179 | - [lint](#lint) 180 | - [test](#test) 181 | - [export](#export) 182 | - [import](#import) 183 | - [breaking](#breaking) 184 | - [changelog](#changelog) 185 | - [diff](#diff) 186 | - [catalog](#catalog) 187 | - [publish](#publish) 188 | 189 | ### init 190 | 191 | ``` 192 | Usage: datacontract init [OPTIONS] [LOCATION] 193 | 194 | Download a datacontract.yaml template and write it to file. 195 | 196 | ╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────╮ 197 | │ location [LOCATION] The location (url or path) of the data contract yaml to create. │ 198 | │ [default: datacontract.yaml] │ 199 | ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ 200 | ╭─ Options ────────────────────────────────────────────────────────────────────────────────────╮ 201 | │ --template TEXT URL of a template or data contract │ 202 | │ [default: │ 203 | │ https://datacontract.com/datacontract.init.yaml] │ 204 | │ --overwrite --no-overwrite Replace the existing datacontract.yaml │ 205 | │ [default: no-overwrite] │ 206 | │ --help Show this message and exit. │ 207 | ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ 208 | ``` 209 | 210 | ### lint 211 | 212 | ``` 213 | Usage: datacontract lint [OPTIONS] [LOCATION] 214 | 215 | Validate that the datacontract.yaml is correctly formatted. 216 | 217 | ╭─ Arguments ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ 218 | │ location [LOCATION] The location (url or path) of the data contract yaml. [default: datacontract.yaml] │ 219 | ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ 220 | ╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ 221 | │ --schema TEXT The location (url or path) of the Data Contract Specification JSON Schema │ 222 | │ [default: https://datacontract.com/datacontract.schema.json] │ 223 | │ --help Show this message and exit. │ 224 | ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ 225 | ``` 226 | 227 | ### test 228 | 229 | ``` 230 | Usage: datacontract test [OPTIONS] [LOCATION] 231 | 232 | Run schema and quality tests on configured servers. 233 | 234 | ╭─ Arguments ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ 235 | │ location [LOCATION] The location (url or path) of the data contract yaml. [default: datacontract.yaml] │ 236 | ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ 237 | ╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ 238 | │ --schema TEXT The location (url or path) of the Data Contract │ 239 | │ Specification JSON Schema │ 240 | │ [default: │ 241 | │ https://datacontract.com/datacontract.schema.json] │ 242 | │ --server TEXT The server configuration to run the schema and quality │ 243 | │ tests. Use the key of the server object in the data │ 244 | │ contract yaml file to refer to a server, e.g., │ 245 | │ `production`, or `all` for all servers (default). │ 246 | │ [default: all] │ 247 | │ --examples --no-examples Run the schema and quality tests on the example data │ 248 | │ within the data contract. │ 249 | │ [default: no-examples] │ 250 | │ --publish TEXT The url to publish the results after the test │ 251 | │ [default: None] │ 252 | │ --publish-to-opentelemetry --no-publish-to-opentelemetry Publish the results to opentelemetry. Use environment │ 253 | │ variables to configure the OTLP endpoint, headers, etc. │ 254 | │ [default: no-publish-to-opentelemetry] │ 255 | │ --logs --no-logs Print logs [default: no-logs] │ 256 | │ --help Show this message and exit. │ 257 | ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ 258 | ``` 259 | 260 | Data Contract CLI connects to a data source and runs schema and quality tests to verify that the data contract is valid. 261 | 262 | ```bash 263 | $ datacontract test --server production datacontract.yaml 264 | ``` 265 | 266 | To connect to the databases the `server` block in the datacontract.yaml is used to set up the connection. 267 | In addition, credentials, such as username and passwords, may be defined with environment variables. 268 | 269 | The application uses different engines, based on the server `type`. 270 | Internally, it connects with DuckDB, Spark, or a native connection and executes the most tests with _soda-core_ and _fastjsonschema_. 271 | 272 | Credentials are provided with environment variables. 273 | 274 | Supported server types: 275 | 276 | - [s3](#S3) 277 | - [bigquery](#bigquery) 278 | - [azure](#azure) 279 | - [sqlserver](#sqlserver) 280 | - [databricks](#databricks) 281 | - [databricks (programmatic)](#databricks-programmatic) 282 | - [dataframr (programmatic)](#dataframe-programmatic) 283 | - [snowflake](#snowflake) 284 | - [kafka](#kafka) 285 | - [postgres](#postgres) 286 | - [local](#local) 287 | 288 | Supported formats: 289 | 290 | - parquet 291 | - json 292 | - csv 293 | - delta 294 | - iceberg (coming soon) 295 | 296 | Feel free to create an [issue](https://github.com/datacontract/datacontract-cli/issues), if you need support for an additional type and formats. 297 | 298 | ### S3 299 | 300 | Data Contract CLI can test data that is stored in S3 buckets or any S3-compliant endpoints in various formats. 301 | 302 | #### Examples 303 | 304 | ##### JSON 305 | 306 | datacontract.yaml 307 | ```yaml 308 | servers: 309 | production: 310 | type: s3 311 | endpointUrl: https://minio.example.com # not needed with AWS S3 312 | location: s3://bucket-name/path/*/*.json 313 | format: json 314 | delimiter: new_line # new_line, array, or none 315 | ``` 316 | 317 | ##### Delta Tables 318 | 319 | datacontract.yaml 320 | ```yaml 321 | servers: 322 | production: 323 | type: s3 324 | endpointUrl: https://minio.example.com # not needed with AWS S3 325 | location: s3://bucket-name/path/table.delta # path to the Delta table folder containing parquet data files and the _delta_log 326 | format: delta 327 | ``` 328 | 329 | #### Environment Variables 330 | 331 | | Environment Variable | Example | Description | 332 | |-----------------------------------|-------------------------------|-----------------------| 333 | | `DATACONTRACT_S3_REGION` | `eu-central-1` | Region of S3 bucket | 334 | | `DATACONTRACT_S3_ACCESS_KEY_ID` | `AKIAXV5Q5QABCDEFGH` | AWS Access Key ID | 335 | | `DATACONTRACT_S3_SECRET_ACCESS_KEY` | `93S7LRrJcqLaaaa/XXXXXXXXXXXXX` | AWS Secret Access Key | 336 | 337 | 338 | 339 | ### BigQuery 340 | 341 | We support authentication to BigQuery using Service Account Key. The used Service Account should include the roles: 342 | * BigQuery Job User 343 | * BigQuery Data Viewer 344 | 345 | 346 | #### Example 347 | 348 | datacontract.yaml 349 | ```yaml 350 | servers: 351 | production: 352 | type: bigquery 353 | project: datameshexample-product 354 | dataset: datacontract_cli_test_dataset 355 | models: 356 | datacontract_cli_test_table: # corresponds to a BigQuery table 357 | type: table 358 | fields: ... 359 | ``` 360 | 361 | #### Environment Variables 362 | 363 | | Environment Variable | Example | Description | 364 | |----------------------------------------------|---------------------------|---------------------------------------------------------| 365 | | `DATACONTRACT_BIGQUERY_ACCOUNT_INFO_JSON_PATH` | `~/service-access-key.json` | Service Access key as saved on key creation by BigQuery | 366 | 367 | 368 | 369 | ### Azure 370 | 371 | Data Contract CLI can test data that is stored in Azure Blob storage or Azure Data Lake Storage (Gen2) (ADLS) in various formats. 372 | 373 | #### Example 374 | 375 | datacontract.yaml 376 | ```yaml 377 | servers: 378 | production: 379 | type: azure 380 | location: abfss://datameshdatabricksdemo.dfs.core.windows.net/dataproducts/inventory_events/*.parquet 381 | format: parquet 382 | ``` 383 | 384 | #### Environment Variables 385 | 386 | Authentication works with an Azure Service Principal (SPN) aka App Registration with a secret. 387 | 388 | | Environment Variable | Example | Description | 389 | |-----------------------------------|-------------------------------|------------------------------------------------------| 390 | | `DATACONTRACT_AZURE_TENANT_ID` | `79f5b80f-10ff-40b9-9d1f-774b42d605fc` | The Azure Tenant ID | 391 | | `DATACONTRACT_AZURE_CLIENT_ID` | `3cf7ce49-e2e9-4cbc-a922-4328d4a58622` | The ApplicationID / ClientID of the app registration | 392 | | `DATACONTRACT_AZURE_CLIENT_SECRET` | `yZK8Q~GWO1MMXXXXXXXXXXXXX` | The Client Secret value | 393 | 394 | 395 | 396 | ### Sqlserver 397 | 398 | Data Contract CLI can test data in MS SQL Server (including Azure SQL, Synapse Analytics SQL Pool). 399 | 400 | #### Example 401 | 402 | datacontract.yaml 403 | ```yaml 404 | servers: 405 | production: 406 | type: sqlserver 407 | host: localhost 408 | port: 5432 409 | database: tempdb 410 | schema: dbo 411 | driver: ODBC Driver 18 for SQL Server 412 | models: 413 | my_table_1: # corresponds to a table 414 | type: table 415 | fields: 416 | my_column_1: # corresponds to a column 417 | type: varchar 418 | ``` 419 | 420 | #### Environment Variables 421 | 422 | | Environment Variable | Example | Description | 423 | |----------------------------------|--------------------|-------------| 424 | | `DATACONTRACT_SQLSERVER_USERNAME` | `root` | Username | 425 | | `DATACONTRACT_SQLSERVER_PASSWORD` | `toor` | Password | 426 | | `DATACONTRACT_SQLSERVER_TRUSTED_CONNECTION` | `True` | Use windows authentication, instead of login | 427 | | `DATACONTRACT_SQLSERVER_TRUST_SERVER_CERTIFICATE` | `True` | Trust self-signed certificate | 428 | | `DATACONTRACT_SQLSERVER_ENCRYPTED_CONNECTION` | `True` | Use SSL | 429 | 430 | 431 | 432 | 433 | ### Databricks 434 | 435 | Works with Unity Catalog and Hive metastore. 436 | 437 | Needs a running SQL warehouse or compute cluster. 438 | 439 | #### Example 440 | 441 | datacontract.yaml 442 | ```yaml 443 | servers: 444 | production: 445 | type: databricks 446 | host: dbc-abcdefgh-1234.cloud.databricks.com 447 | catalog: acme_catalog_prod 448 | schema: orders_latest 449 | models: 450 | orders: # corresponds to a table 451 | type: table 452 | fields: ... 453 | ``` 454 | 455 | #### Environment Variables 456 | 457 | | Environment Variable | Example | Description | 458 | |----------------------------------------------|--------------------------------------|-------------------------------------------------------| 459 | | `DATACONTRACT_DATABRICKS_TOKEN` | `dapia00000000000000000000000000000` | The personal access token to authenticate | 460 | | `DATACONTRACT_DATABRICKS_HTTP_PATH` | `/sql/1.0/warehouses/b053a3ffffffff` | The HTTP path to the SQL warehouse or compute cluster | 461 | 462 | 463 | ### Databricks (programmatic) 464 | 465 | Works with Unity Catalog and Hive metastore. 466 | When running in a notebook or pipeline, the provided `spark` session can be used. 467 | An additional authentication is not required. 468 | 469 | Requires a Databricks Runtime with Python >= 3.10. 470 | 471 | #### Example 472 | 473 | datacontract.yaml 474 | ```yaml 475 | servers: 476 | production: 477 | type: databricks 478 | host: dbc-abcdefgh-1234.cloud.databricks.com # ignored, always use current host 479 | catalog: acme_catalog_prod 480 | schema: orders_latest 481 | models: 482 | orders: # corresponds to a table 483 | type: table 484 | fields: ... 485 | ``` 486 | 487 | Notebook 488 | ```python 489 | %pip install datacontract-cli 490 | dbutils.library.restartPython() 491 | 492 | from datacontract.data_contract import DataContract 493 | 494 | data_contract = DataContract( 495 | data_contract_file="/Volumes/acme_catalog_prod/orders_latest/datacontract/datacontract.yaml", 496 | spark=spark) 497 | run = data_contract.test() 498 | run.result 499 | ``` 500 | 501 | ### Dataframe (programmatic) 502 | 503 | Works with Spark DataFrames. 504 | DataFrames need to be created as named temporary views. 505 | Multiple temporary views are suppored if your data contract contains multiple models. 506 | 507 | Testing DataFrames is useful to test your datasets in a pipeline before writing them to a data source. 508 | 509 | #### Example 510 | 511 | datacontract.yaml 512 | ```yaml 513 | servers: 514 | production: 515 | type: dataframe 516 | models: 517 | my_table: # corresponds to a temporary view 518 | type: table 519 | fields: ... 520 | ``` 521 | 522 | Example code 523 | ```python 524 | from datacontract.data_contract import DataContract 525 | 526 | df.createOrReplaceTempView("my_table") 527 | 528 | data_contract = DataContract( 529 | data_contract_file="datacontract.yaml", 530 | spark=spark, 531 | ) 532 | run = data_contract.test() 533 | assert run.result == "passed" 534 | ``` 535 | 536 | 537 | ### Snowflake 538 | 539 | Data Contract CLI can test data in Snowflake. 540 | 541 | #### Example 542 | 543 | datacontract.yaml 544 | ```yaml 545 | 546 | servers: 547 | snowflake: 548 | type: snowflake 549 | account: abcdefg-xn12345 550 | database: ORDER_DB 551 | schema: ORDERS_PII_V2 552 | models: 553 | my_table_1: # corresponds to a table 554 | type: table 555 | fields: 556 | my_column_1: # corresponds to a column 557 | type: varchar 558 | ``` 559 | 560 | #### Environment Variables 561 | 562 | | Environment Variable | Example | Description | 563 | |------------------------------------|--------------------|-----------------------------------------------------| 564 | | `DATACONTRACT_SNOWFLAKE_USERNAME` | `datacontract` | Username | 565 | | `DATACONTRACT_SNOWFLAKE_PASSWORD` | `mysecretpassword` | Password | 566 | | `DATACONTRACT_SNOWFLAKE_ROLE` | `DATAVALIDATION` | The snowflake role to use. | 567 | | `DATACONTRACT_SNOWFLAKE_WAREHOUSE` | `COMPUTE_WH` | The Snowflake Warehouse to use executing the tests. | 568 | 569 | 570 | 571 | ### Kafka 572 | 573 | Kafka support is currently considered experimental. 574 | 575 | #### Example 576 | 577 | datacontract.yaml 578 | ```yaml 579 | servers: 580 | production: 581 | type: kafka 582 | host: abc-12345.eu-central-1.aws.confluent.cloud:9092 583 | topic: my-topic-name 584 | format: json 585 | ``` 586 | 587 | #### Environment Variables 588 | 589 | | Environment Variable | Example | Description | 590 | |------------------------------------|---------|-----------------------------| 591 | | `DATACONTRACT_KAFKA_SASL_USERNAME` | `xxx` | The SASL username (key). | 592 | | `DATACONTRACT_KAFKA_SASL_PASSWORD` | `xxx` | The SASL password (secret). | 593 | 594 | 595 | ### Postgres 596 | 597 | Data Contract CLI can test data in Postgres or Postgres-compliant databases (e.g., RisingWave). 598 | 599 | #### Example 600 | 601 | datacontract.yaml 602 | ```yaml 603 | servers: 604 | postgres: 605 | type: postgres 606 | host: localhost 607 | port: 5432 608 | database: postgres 609 | schema: public 610 | models: 611 | my_table_1: # corresponds to a table 612 | type: table 613 | fields: 614 | my_column_1: # corresponds to a column 615 | type: varchar 616 | ``` 617 | 618 | #### Environment Variables 619 | 620 | | Environment Variable | Example | Description | 621 | |----------------------------------|--------------------|-------------| 622 | | `DATACONTRACT_POSTGRES_USERNAME` | `postgres` | Username | 623 | | `DATACONTRACT_POSTGRES_PASSWORD` | `mysecretpassword` | Password | 624 | 625 | 626 | 627 | 628 | 629 | ### export 630 | 631 | ``` 632 | 633 | Usage: datacontract export [OPTIONS] [LOCATION] 634 | 635 | Convert data contract to a specific format. Prints to stdout or to the specified output file. 636 | 637 | ╭─ Arguments ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ 638 | │ location [LOCATION] The location (url or path) of the data contract yaml. [default: datacontract.yaml] │ 639 | ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ 640 | ╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ 641 | │ * --format [jsonschema|pydantic-model|sodacl|dbt|dbt-sources|db The export format. [default: None] [required] │ 642 | │ t-staging-sql|odcs|rdf|avro|protobuf|great-expectati │ 643 | │ ons|terraform|avro-idl|sql|sql-query|html|go|bigquer │ 644 | │ y|dbml] │ 645 | │ --output PATH Specify the file path where the exported data will be │ 646 | │ saved. If no path is provided, the output will be │ 647 | │ printed to stdout. │ 648 | │ [default: None] │ 649 | │ --server TEXT The server name to export. [default: None] │ 650 | │ --model TEXT Use the key of the model in the data contract yaml │ 651 | │ file to refer to a model, e.g., `orders`, or `all` │ 652 | │ for all models (default). │ 653 | │ [default: all] │ 654 | │ --help Show this message and exit. │ 655 | ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ 656 | ╭─ RDF Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ 657 | │ --rdf-base TEXT [rdf] The base URI used to generate the RDF graph. [default: None] │ 658 | ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ 659 | ╭─ SQL Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ 660 | │ --sql-server-type TEXT [sql] The server type to determine the sql dialect. By default, it uses 'auto' to automatically │ 661 | │ detect the sql dialect via the specified servers in the data contract. │ 662 | │ [default: auto] │ 663 | ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ 664 | 665 | ``` 666 | 667 | ```bash 668 | # Example export data contract as HTML 669 | datacontract export --format html > datacontract.html 670 | ``` 671 | 672 | Available export options: 673 | 674 | | Type | Description | Status | 675 | |----------------------|---------------------------------------------------------|--------| 676 | | `html` | Export to HTML | ✅ | 677 | | `jsonschema` | Export to JSON Schema | ✅ | 678 | | `odcs` | Export to Open Data Contract Standard (ODCS) | ✅ | 679 | | `sodacl` | Export to SodaCL quality checks in YAML format | ✅ | 680 | | `dbt` | Export to dbt models in YAML format | ✅ | 681 | | `dbt-sources` | Export to dbt sources in YAML format | ✅ | 682 | | `dbt-staging-sql` | Export to dbt staging SQL models | ✅ | 683 | | `rdf` | Export data contract to RDF representation in N3 format | ✅ | 684 | | `avro` | Export to AVRO models | ✅ | 685 | | `protobuf` | Export to Protobuf | ✅ | 686 | | `terraform` | Export to terraform resources | ✅ | 687 | | `sql` | Export to SQL DDL | ✅ | 688 | | `sql-query` | Export to SQL Query | ✅ | 689 | | `great-expectations` | Export to Great Expectations Suites in JSON Format | ✅ | 690 | | `bigquery` | Export to BigQuery Schemas | ✅ | 691 | | `go` | Export to Go types | ✅ | 692 | | `pydantic-model` | Export to pydantic models | ✅ | 693 | | `DBML` | Export to a DBML Diagram description | ✅ | 694 | | Missing something? | Please create an issue on GitHub | TBD | 695 | 696 | #### Great Expectations 697 | 698 | The export function transforms a specified data contract into a comprehensive Great Expectations JSON suite. 699 | If the contract includes multiple models, you need to specify the names of the model you wish to export. 700 | 701 | ```shell 702 | datacontract export datacontract.yaml --format great-expectations --model orders 703 | ``` 704 | 705 | The export creates a list of expectations by utilizing: 706 | 707 | - The data from the Model definition with a fixed mapping 708 | - The expectations provided in the quality field for each model (find here the expectations gallery https://greatexpectations.io/expectations/) 709 | 710 | #### RDF 711 | 712 | The export function converts a given data contract into a RDF representation. You have the option to 713 | add a base_url which will be used as the default prefix to resolve relative IRIs inside the document. 714 | 715 | ```shell 716 | datacontract export --format rdf --rdf-base https://www.example.com/ datacontract.yaml 717 | ``` 718 | 719 | The data contract is mapped onto the following concepts of a yet to be defined Data Contract 720 | Ontology named https://datacontract.com/DataContractSpecification/ : 721 | - DataContract 722 | - Server 723 | - Model 724 | 725 | Having the data contract inside an RDF Graph gives us access the following use cases: 726 | - Interoperability with other data contract specification formats 727 | - Store data contracts inside a knowledge graph 728 | - Enhance a semantic search to find and retrieve data contracts 729 | - Linking model elements to already established ontologies and knowledge 730 | - Using full power of OWL to reason about the graph structure of data contracts 731 | - Apply graph algorithms on multiple data contracts (Find similar data contracts, find "gatekeeper" 732 | data products, find the true domain owner of a field attribute) 733 | 734 | #### DBML 735 | 736 | The export function converts the logical data types of the datacontract into the specific ones of a concrete Database 737 | if a server is selected via the `--server` option (based on the `type` of that server). If no server is selected, the 738 | logical data types are exported. 739 | 740 | 741 | #### Avro 742 | 743 | The export function converts the data contract specification into an avro schema. It supports specifying custom avro properties for logicalTypes and default values. 744 | 745 | ##### Custom Avro Properties 746 | 747 | We support a **config map on field level**. A config map may include any additional key-value pairs and support multiple server type bindings. 748 | 749 | To specify custom Avro properties in your data contract, you can define them within the `config` section of your field definition. Below is an example of how to structure your YAML configuration to include custom Avro properties, such as `avroLogicalType` and `avroDefault`. 750 | 751 | >NOTE: At this moment, we just support [logicalType](https://avro.apache.org/docs/1.11.0/spec.html#Logical+Types) and [default](https://avro.apache.org/docs/1.11.0/spec.htm) 752 | 753 | #### Example Configuration 754 | 755 | ```yaml 756 | models: 757 | orders: 758 | fields: 759 | my_field_1: 760 | description: Example for AVRO with Timestamp (microsecond precision) https://avro.apache.org/docs/current/spec.html#Local+timestamp+%28microsecond+precision%29 761 | type: long 762 | example: 1672534861000000 # Equivalent to 2023-01-01 01:01:01 in microseconds 763 | config: 764 | avroLogicalType: local-timestamp-micros 765 | avroDefault: 1672534861000000 766 | ``` 767 | 768 | #### Explanation 769 | 770 | - **models**: The top-level key that contains different models (tables or objects) in your data contract. 771 | - **orders**: A specific model name. Replace this with the name of your model. 772 | - **fields**: The fields within the model. Each field can have various properties defined. 773 | - **my_field_1**: The name of a specific field. Replace this with your field name. 774 | - **description**: A textual description of the field. 775 | - **type**: The data type of the field. In this example, it is `long`. 776 | - **example**: An example value for the field. 777 | - **config**: Section to specify custom Avro properties. 778 | - **avroLogicalType**: Specifies the logical type of the field in Avro. In this example, it is `local-timestamp-micros`. 779 | - **avroDefault**: Specifies the default value for the field in Avro. In this example, it is 1672534861000000 which corresponds to ` 2023-01-01 01:01:01 UTC`. 780 | 781 | 782 | ### import 783 | 784 | ``` 785 | Usage: datacontract import [OPTIONS] 786 | 787 | Create a data contract from the given source location. Prints to stdout. 788 | 789 | ╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ 790 | │ * --format [sql|avro|glue|bigquery|jsonschema] The format of the source file. [default: None] [required] │ 791 | │ --source TEXT The path to the file or Glue Database that should be imported. │ 792 | │ [default: None] │ 793 | │ --glue-table TEXT List of table ids to import from the Glue Database (repeat for │ 794 | │ multiple table ids, leave empty for all tables in the dataset). │ 795 | │ [default: None] │ 796 | │ --bigquery-project TEXT The bigquery project id. [default: None] │ 797 | │ --bigquery-dataset TEXT The bigquery dataset id. [default: None] │ 798 | │ --bigquery-table TEXT List of table ids to import from the bigquery API (repeat for │ 799 | │ multiple table ids, leave empty for all tables in the dataset). │ 800 | │ [default: None] │ 801 | │ --help Show this message and exit. │ 802 | ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ 803 | ``` 804 | 805 | Example: 806 | ```bash 807 | # Example import from SQL DDL 808 | datacontract import --format sql --source my_ddl.sql 809 | ``` 810 | 811 | Available import options: 812 | 813 | | Type | Description | Status | 814 | |--------------------|------------------------------------------------|---------| 815 | | `sql` | Import from SQL DDL | ✅ | 816 | | `avro` | Import from AVRO schemas | ✅ | 817 | | `glue` | Import from AWS Glue DataCatalog | ✅ | 818 | | `protobuf` | Import from Protobuf schemas | TBD | 819 | | `jsonschema` | Import from JSON Schemas | ✅ | 820 | | `bigquery` | Import from BigQuery Schemas | ✅ | 821 | | `dbt` | Import from dbt models | TBD | 822 | | `odcs` | Import from Open Data Contract Standard (ODCS) | TBD | 823 | | Missing something? | Please create an issue on GitHub | TBD | 824 | 825 | 826 | #### BigQuery 827 | 828 | Bigquery data can either be imported off of JSON Files generated from the table descriptions or directly from the Bigquery API. In case you want to use JSON Files, specify the `source` parameter with a path to the JSON File. 829 | 830 | To import from the Bigquery API, you have to _omit_ `source` and instead need to provide `bigquery-project` and `bigquery-dataset`. Additionally you may specify `bigquery-table` to enumerate the tables that should be imported. If no tables are given, _all_ available tables of the dataset will be imported. 831 | 832 | For providing authentication to the Client, please see [the google documentation](https://cloud.google.com/docs/authentication/provide-credentials-adc#how-to) or the one [about authorizing client libraries](https://cloud.google.com/bigquery/docs/authentication#client-libs). 833 | 834 | Examples: 835 | 836 | ```bash 837 | # Example import from Bigquery JSON 838 | datacontract import --format bigquery --source my_bigquery_table.json 839 | ``` 840 | 841 | ```bash 842 | # Example import from Bigquery API with specifying the tables to import 843 | datacontract import --format bigquery --bigquery-project --bigquery-dataset --bigquery-table --bigquery-table --bigquery-table 844 | ``` 845 | 846 | ```bash 847 | # Example import from Bigquery API importing all tables in the dataset 848 | datacontract import --format bigquery --bigquery-project --bigquery-dataset 849 | ``` 850 | 851 | ### Glue 852 | 853 | Importing from Glue reads the necessary Data directly off of the AWS API. 854 | You may give the `glue-table` parameter to enumerate the tables that should be imported. If no tables are given, _all_ available tables of the database will be imported. 855 | 856 | Examples: 857 | 858 | ```bash 859 | # Example import from AWS Glue with specifying the tables to import 860 | datacontract import --format glue --source --glue-table --glue-table --glue-table 861 | ``` 862 | 863 | ```bash 864 | # Example import from AWS Glue importing all tables in the database 865 | datacontract import --format glue --source 866 | ``` 867 | 868 | 869 | ### breaking 870 | 871 | ``` 872 | Usage: datacontract breaking [OPTIONS] LOCATION_OLD LOCATION_NEW 873 | 874 | Identifies breaking changes between data contracts. Prints to stdout. 875 | 876 | ╭─ Arguments ───────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ 877 | │ * location_old TEXT The location (url or path) of the old data contract yaml. [default: None] [required] │ 878 | │ * location_new TEXT The location (url or path) of the new data contract yaml. [default: None] [required] │ 879 | ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ 880 | ╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ 881 | │ --help Show this message and exit. │ 882 | ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ 883 | ``` 884 | 885 | ### changelog 886 | 887 | ``` 888 | Usage: datacontract changelog [OPTIONS] LOCATION_OLD LOCATION_NEW 889 | 890 | Generate a changelog between data contracts. Prints to stdout. 891 | 892 | ╭─ Arguments ───────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ 893 | │ * location_old TEXT The location (url or path) of the old data contract yaml. [default: None] [required] │ 894 | │ * location_new TEXT The location (url or path) of the new data contract yaml. [default: None] [required] │ 895 | ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ 896 | ╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ 897 | │ --help Show this message and exit. │ 898 | ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ 899 | ``` 900 | 901 | ### diff 902 | 903 | ``` 904 | Usage: datacontract diff [OPTIONS] LOCATION_OLD LOCATION_NEW 905 | 906 | PLACEHOLDER. Currently works as 'changelog' does. 907 | 908 | ╭─ Arguments ───────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ 909 | │ * location_old TEXT The location (url or path) of the old data contract yaml. [default: None] [required] │ 910 | │ * location_new TEXT The location (url or path) of the new data contract yaml. [default: None] [required] │ 911 | ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ 912 | ╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ 913 | │ --help Show this message and exit. │ 914 | ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ 915 | ``` 916 | 917 | ### catalog 918 | 919 | ``` 920 | 921 | Usage: datacontract catalog [OPTIONS] 922 | 923 | Create an html catalog of data contracts. 924 | 925 | ╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ 926 | │ --files TEXT Glob pattern for the data contract files to include in the catalog. [default: *.yaml] │ 927 | │ --output TEXT Output directory for the catalog html files. [default: catalog/] │ 928 | │ --help Show this message and exit. │ 929 | ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ 930 | ``` 931 | 932 | ### Publish 933 | 934 | ``` 935 | 936 | Usage: datacontract publish [OPTIONS] [LOCATION] 937 | 938 | Publish the data contract to the Data Mesh Manager. 939 | 940 | ╭─ Arguments ────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ 941 | │ location [LOCATION] The location (url or path) of the data contract yaml. [default: datacontract.yaml] │ 942 | ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ 943 | ╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ 944 | │ --help Show this message and exit. │ 945 | ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ 946 | ``` 947 | 948 | ## Integrations 949 | 950 | | Integration | Option | Description | 951 | |-------------------|------------------------------|-------------------------------------------------------------------------------------------------------| 952 | | Data Mesh Manager | `--publish` | Push full results to the [Data Mesh Manager API](https://api.datamesh-manager.com/swagger/index.html) | 953 | | OpenTelemetry | `--publish-to-opentelemetry` | Push result as gauge metrics | 954 | 955 | ### Integration with Data Mesh Manager 956 | 957 | If you use [Data Mesh Manager](https://datamesh-manager.com/), you can use the data contract URL and append the `--publish` option to send and display the test results. Set an environment variable for your API key. 958 | 959 | ```bash 960 | # Fetch current data contract, execute tests on production, and publish result to data mesh manager 961 | $ EXPORT DATAMESH_MANAGER_API_KEY=xxx 962 | $ datacontract test https://demo.datamesh-manager.com/demo279750347121/datacontracts/4df9d6ee-e55d-4088-9598-b635b2fdcbbc/datacontract.yaml --server production --publish 963 | ``` 964 | 965 | ### Integration with OpenTelemetry 966 | 967 | If you use OpenTelemetry, you can use the data contract URL and append the `--publish-to-opentelemetry` option to send the test results to your OLTP-compatible instance, e.g., Prometheus. 968 | 969 | The metric name is "datacontract.cli.test.result" and it uses the following encoding for the result: 970 | 971 | | datacontract.cli.test.result | Description | 972 | |-------|---------------------------------------| 973 | | 0 | test run passed, no warnings | 974 | | 1 | test run has warnings | 975 | | 2 | test run failed | 976 | | 3 | test run not possible due to an error | 977 | | 4 | test status unknown | 978 | 979 | 980 | ```bash 981 | # Fetch current data contract, execute tests on production, and publish result to open telemetry 982 | $ EXPORT OTEL_SERVICE_NAME=datacontract-cli 983 | $ EXPORT OTEL_EXPORTER_OTLP_ENDPOINT=https://YOUR_ID.apm.westeurope.azure.elastic-cloud.com:443 984 | $ EXPORT OTEL_EXPORTER_OTLP_HEADERS=Authorization=Bearer%20secret # Optional, when using SaaS Products 985 | $ EXPORT OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf # Optional, default is http/protobuf - use value grpc to use the gRPC protocol instead 986 | # Send to OpenTelemetry 987 | $ datacontract test https://demo.datamesh-manager.com/demo279750347121/datacontracts/4df9d6ee-e55d-4088-9598-b635b2fdcbbc/datacontract.yaml --server production --publish-to-opentelemetry 988 | ``` 989 | 990 | Current limitations: 991 | - currently, only ConsoleExporter and OTLP Exporter 992 | - Metrics only, no logs yet (but loosely planned) 993 | 994 | 995 | ## Best Practices 996 | 997 | We share best practices in using the Data Contract CLI. 998 | 999 | ### Data-first Approach 1000 | 1001 | Create a data contract based on the actual data. This is the fastest way to get started and to get feedback from the data consumers. 1002 | 1003 | 1. Use an existing physical schema (e.g., SQL DDL) as a starting point to define your logical data model in the contract. Double check right after the import whether the actual data meets the imported logical data model. Just to be sure. 1004 | ```bash 1005 | $ datacontract import --format sql ddl.sql 1006 | $ datacontract test 1007 | ``` 1008 | 1009 | 2. Add examples to the `datacontract.yaml`. If you can, use actual data and anonymize. Make sure that the examples match the imported logical data model. 1010 | ```bash 1011 | $ datacontract test --examples 1012 | ``` 1013 | 1014 | 1015 | 3. Add quality checks and additional type constraints one by one to the contract and make sure the examples and the actual data still adheres to the contract. Check against examples for a very fast feedback loop. 1016 | ```bash 1017 | $ datacontract test --examples 1018 | $ datacontract test 1019 | ``` 1020 | 1021 | 4. Make sure that all the best practices for a `datacontract.yaml` are met using the linter. You probably forgot to document some fields and add the terms and conditions. 1022 | ```bash 1023 | $ datacontract lint 1024 | ``` 1025 | 1026 | 5. Set up a CI pipeline that executes daily and reports the results to the [Data Mesh Manager](https://datamesh-manager.com). Or to some place else. You can even publish to any opentelemetry compatible system. 1027 | ```bash 1028 | $ datacontract test --publish https://api.datamesh-manager.com/api/runs 1029 | ``` 1030 | 1031 | ### Contract-First 1032 | 1033 | Create a data contract based on the requirements from use cases. 1034 | 1035 | 1. Start with a `datacontract.yaml` template. 1036 | ```bash 1037 | $ datacontract init 1038 | ``` 1039 | 1040 | 2. Add examples to the `datacontract.yaml`. Do not start with the data model, although you are probably tempted to do that. Examples are the fastest way to get feedback from everybody and not loose someone in the discussion. 1041 | 1042 | 3. Create the model based on the examples. Test the model against the examples to double-check whether the model matches the examples. 1043 | ```bash 1044 | $ datacontract test --examples 1045 | ``` 1046 | 1047 | 4. Add quality checks and additional type constraints one by one to the contract and make sure the examples and the actual data still adheres to the contract. Check against examples for a very fast feedback loop. 1048 | ```bash 1049 | $ datacontract test --examples 1050 | ``` 1051 | 1052 | 5. Fill in the terms, descriptions, etc. Make sure you follow all best practices for a `datacontract.yaml` using the linter. 1053 | ```bash 1054 | $ datacontract lint 1055 | ``` 1056 | 1057 | 6. Set up a CI pipeline that lints and tests the examples so you make sure that any changes later do not decrease the quality of the contract. 1058 | ```bash 1059 | $ datacontract lint 1060 | $ datacontract test --examples 1061 | ``` 1062 | 1063 | 7. Use the export function to start building the providing data product as well as the integration into the consuming data products. 1064 | ```bash 1065 | # data provider 1066 | $ datacontract export --format dbt 1067 | # data consumer 1068 | $ datacontract export --format dbt-sources 1069 | $ datacontract export --format dbt-staging-sql 1070 | ``` 1071 | 1072 | ### Schema Evolution 1073 | 1074 | #### Non-breaking Changes 1075 | Examples: adding models or fields 1076 | 1077 | - Add the models or fields in the datacontract.yaml 1078 | - Increment the minor version of the datacontract.yaml on any change. Simply edit the datacontract.yaml for this. 1079 | - You need a policy that these changes are non-breaking. That means that one cannot use the star expression in SQL to query a table under contract. Make the consequences known. 1080 | - Fail the build in the Pull Request if a datacontract.yaml accidentially adds a breaking change even despite only a minor version change 1081 | ```bash 1082 | $ datacontract breaking datacontract-from-pr.yaml datacontract-from-main.yaml 1083 | ``` 1084 | - Create a changelog of this minor change. 1085 | ```bash 1086 | $ datacontract changelog datacontract-from-pr.yaml datacontract-from-main.yaml 1087 | ``` 1088 | #### Breaking Changes 1089 | Examples: Removing or renaming models and fields. 1090 | 1091 | - Remove or rename models and fields in the datacontract.yaml, and any other change that might be part of this new major version of this data contract. 1092 | - Increment the major version of the datacontract.yaml for this and create a new file for the major version. The reason being, that one needs to offer an upgrade path for the data consumers from the old to the new major version. 1093 | - As data consumers need to migrate, try to reduce the frequency of major versions by making multiple breaking changes together if possible. 1094 | - Be aware of the notice period in the data contract as this is the minimum amount of time you have to offer both the old and the new version for a migration path. 1095 | - Do not fear making breaking changes with data contracts. It's okay to do them in this controlled way. Really! 1096 | - Create a changelog of this major change. 1097 | ```bash 1098 | $ datacontract changelog datacontract-from-pr.yaml datacontract-from-main.yaml 1099 | ``` 1100 | 1101 | ## Development Setup 1102 | 1103 | Python base interpreter should be 3.11.x (unless working on 3.12 release candidate). 1104 | 1105 | ```bash 1106 | # create venv 1107 | python3 -m venv venv 1108 | source venv/bin/activate 1109 | 1110 | # Install Requirements 1111 | pip install --upgrade pip setuptools wheel 1112 | pip install -e '.[dev]' 1113 | ruff check --fix 1114 | ruff format 1115 | pytest 1116 | ``` 1117 | 1118 | 1119 | ### Docker Build 1120 | 1121 | ```bash 1122 | docker build -t datacontract/cli . 1123 | docker run --rm -v ${PWD}:/home/datacontract datacontract/cli 1124 | ``` 1125 | 1126 | #### Docker compose integration 1127 | 1128 | We've included a [docker-compose.yml](./docker-compose.yml) configuration to simplify the build, test, and deployment of the image. 1129 | 1130 | ##### Building the Image with Docker Compose 1131 | 1132 | To build the Docker image using Docker Compose, run the following command: 1133 | 1134 | ```bash 1135 | docker compose build 1136 | ``` 1137 | 1138 | This command utilizes the `docker-compose.yml` to build the image, leveraging predefined settings such as the build context and Dockerfile location. This approach streamlines the image creation process, avoiding the need for manual build specifications each time. 1139 | 1140 | #### Testing the Image 1141 | 1142 | After building the image, you can test it directly with Docker Compose: 1143 | 1144 | ```bash 1145 | docker compose run --rm datacontract --version 1146 | ``` 1147 | 1148 | This command runs the container momentarily to check the version of the `datacontract` CLI. The `--rm` flag ensures that the container is automatically removed after the command executes, keeping your environment clean. 1149 | 1150 | 1151 | 1152 | ## Release Steps 1153 | 1154 | 1. Update the version in `pyproject.toml` 1155 | 2. Have a look at the `CHANGELOG.md` 1156 | 3. Create release commit manually 1157 | 4. Execute `./release` 1158 | 5. Wait until GitHub Release is created 1159 | 6. Add the release notes to the GitHub Release 1160 | 1161 | ## Contribution 1162 | 1163 | We are happy to receive your contributions. Propose your change in an issue or directly create a pull request with your improvements. 1164 | 1165 | ## Companies using this tool 1166 | 1167 | - [INNOQ](https://innoq.com) 1168 | - And many more. To add your company, please create a pull request. 1169 | 1170 | ## License 1171 | 1172 | [MIT License](LICENSE) 1173 | 1174 | ## Credits 1175 | 1176 | Created by [Stefan Negele](https://www.linkedin.com/in/stefan-negele-573153112/) and [Jochen Christ](https://www.linkedin.com/in/jochenchrist/). 1177 | 1178 | 1179 | 1180 | 1181 | --------------------------------------------------------------------------------