├── sources
├── INSTRUCTIONS.md
├── datacontract.com
│ ├── CHANGELOG.md
│ ├── workshop.md
│ ├── datacontract.init.yaml
│ ├── definition.schema.json
│ ├── datacontract.yaml
│ └── datacontract.schema.json
└── cli.datacontract.com
│ ├── CHANGELOG.md
│ └── README.md
├── CNAME
├── images
├── favicon.png
├── datacontract-gpt-browser.png
├── datacontract-gpt-social-media.png
└── supported-by-innoq--petrol-apricot.svg
├── _config.yml
├── example_shipment.yaml
├── _layouts
└── default.html
├── example_datacontract.yaml
└── README.md
/sources/INSTRUCTIONS.md:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/CNAME:
--------------------------------------------------------------------------------
1 | gpt.datacontract.com
--------------------------------------------------------------------------------
/images/favicon.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacontract/datacontract-gpt/main/images/favicon.png
--------------------------------------------------------------------------------
/images/datacontract-gpt-browser.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacontract/datacontract-gpt/main/images/datacontract-gpt-browser.png
--------------------------------------------------------------------------------
/images/datacontract-gpt-social-media.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacontract/datacontract-gpt/main/images/datacontract-gpt-social-media.png
--------------------------------------------------------------------------------
/_config.yml:
--------------------------------------------------------------------------------
1 | plugins:
2 | - jekyll-sitemap
3 | name: Data Contract GPT
4 | title: null
5 | description: Data Contract GPT for the Data Contract Specification to interactively create and modify data contracts.
6 |
--------------------------------------------------------------------------------
/example_shipment.yaml:
--------------------------------------------------------------------------------
1 | shipment_id: "123e4567-e89b-12d3-a456-426614174000"
2 | origin: "New York, NY"
3 | destination: "Los Angeles, CA"
4 | shipment_date: "2024-01-01T10:00:00Z"
5 | delivery_date: "2024-01-05T15:00:00Z"
6 | status: "delivered"
--------------------------------------------------------------------------------
/sources/datacontract.com/CHANGELOG.md:
--------------------------------------------------------------------------------
1 | # Changelog
2 |
3 | All notable changes to this project will be documented in this file.
4 |
5 | The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
6 | and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7 |
8 | ## [Unreleased]
9 |
10 | Please note, while the major version is zero (0.y.z), Anything MAY change at any time.
11 | The public API SHOULD NOT be considered stable.
12 |
13 | ### Added
14 | - Support for server-specific data types as config map ([#63](https://github.com/datacontract/datacontract-specification/issues/63))
15 | - AWS Glue Catalog server support
16 | - sftp server support
17 | - info.status field
18 | - oracle server support
19 | - field.title attribute
20 | - model.title attribute
21 | - AWS Kinesis Data Streams server support
22 |
23 | ## [0.9.3] - 2024-03-06
24 |
25 | ### Added
26 |
27 | - Service levels as a top level `servicelevels` element
28 | - pubsub server support
29 | - primary key and relationship support via `field.primary` and `field.references` attributes
30 | - databricks server support improved
31 |
32 | ## [0.9.2] - 2024-01-04
33 |
34 | ### Added
35 |
36 | - Format and validation attributes to fields in models and definitions
37 | - Postgres support
38 | - Databricks support
39 |
40 | ## [0.9.1] - 2023-11-19
41 |
42 | ### Added
43 |
44 | - A logical data model (#13), mainly to simplify editor support with a defined schema, easier to detect breaking changes, and better Databricks support.
45 | - Definitions (#14) for reusable semantic definitions within one data contract or across data contracts.
46 |
47 | ### Removed
48 |
49 | - Property `info.dataProduct` as data products should define which data contracts they implement.
50 | - Property `info.outputPort` as data products should define which data contracts they implement.
51 |
52 | Those removals are not considered as breaking changes, as these attributes are now treated as specification extensions.
53 |
54 | ## [0.9.0] - 2023-09-12
55 |
56 | First public release.
57 |
--------------------------------------------------------------------------------
/sources/datacontract.com/workshop.md:
--------------------------------------------------------------------------------
1 | # Data Contract Workshop
2 |
3 | Bring data producers and consumers together to define data contracts in a facilitated workshop.
4 |
5 | ## Goal
6 |
7 | A defined and agreed upon data contract between data producers and consumers.
8 |
9 | ## Participants
10 |
11 | - Facilitator
12 | - Neutral moderator and typist
13 | - Should know the [Data Contract Specification](https://datacontract.com) and its tools well
14 | - Get the [authors of the Data Contract Specification](https://datacontract.com/#authors) as facilitators for your workshop.
15 | - Data producer
16 | - Product Owner
17 | - Software Engineers
18 | - Data consumers
19 | - Product Owner
20 | - Data Engineers / Scientist / Analyst
21 |
22 | Recommendation: keep the group small (not more than 5 people)
23 |
24 | ## Settings
25 |
26 | - Show data contract the whole workshop on the screen (projector, screenshare, ...)
27 | - Facilitator is the typist
28 | - Facilitator is moderator
29 | - Data Producer and Data Consumers discuss and give commands to the facilitator
30 |
31 | ## Recommended Order of Completion
32 |
33 | 1. Info (get the context)
34 | 2. Examples (example-driven facilitation)
35 | 3. Model (you will spend most of your time here)
36 | - Use the [Data Contract CLI](https://cli.datacontract.com) to test the model against the previously created examples:\\
37 | `datacontract test --examples datacontract.yaml`
38 | 4. Quality
39 | 5. Terms
40 | 6. Servers (if already applicable)
41 | - Start with a "local" server with actual, real data you downloaded
42 | - Use the [Data Contract CLI](https://cli.datacontract.com) to test the model against the actual data on a specific server:\\
43 | `datacontract test datacontract.yaml`
44 | - Switch to the actual remote server, if applicable
45 |
46 | ## Tooling
47 |
48 | - Open the [starter template](https://datacontract.com/datacontract.init.yaml) in the [Data Contract Studio](https://studio.datacontract.com) and get going. If you lack an experienced facilitator, ignore any validation errors and warnings within the studio.
49 | - Use the [Data Contract Studio](https://studio.datacontract.com) to share the results of the workshop afterward with the participants and other stakeholders.
50 | - Use the [Data Contract CLI](https://cli.datacontract.com) to validate the data contract after the workshop.
51 |
52 | ## Related
53 |
54 | - This data contract workshop could be a followup to a data product design workshop using the [Data Product Canvas](https://www.datamesh-architecture.com/data-product-canvas), making the offered contract at the output port of the designed data product more concrete.
55 |
--------------------------------------------------------------------------------
/sources/datacontract.com/datacontract.init.yaml:
--------------------------------------------------------------------------------
1 | dataContractSpecification: 0.9.3
2 | id: my-data-contract-id
3 | info:
4 | title: My Data Contract
5 | version: 0.0.1
6 | # description:
7 | # owner:
8 | # contact:
9 | # name:
10 | # url:
11 | # email:
12 |
13 |
14 | ### servers
15 |
16 | #servers:
17 | # production:
18 | # type: s3
19 | # location: s3://
20 | # format: parquet
21 | # delimiter: new_line
22 |
23 | ### terms
24 |
25 | #terms:
26 | # usage:
27 | # limitations:
28 | # billing:
29 | # noticePeriod:
30 |
31 |
32 | ### models
33 |
34 | # models:
35 | # my_model:
36 | # description:
37 | # type:
38 | # fields:
39 | # my_field:
40 | # type:
41 | # description:
42 |
43 |
44 | ### definitions
45 |
46 | # definitions:
47 | # my_field:
48 | # domain:
49 | # name:
50 | # title:
51 | # type:
52 | # description:
53 | # example:
54 | # pii:
55 | # classification:
56 |
57 |
58 | ### examples
59 |
60 | #examples:
61 | # - type: csv
62 | # model: my_model
63 | # data: |-
64 | # id,timestamp,amount
65 | # "1001","2023-09-09T08:30:00Z",2500
66 | # "1002","2023-09-08T15:45:00Z",1800
67 |
68 | ### servicelevels
69 |
70 | #servicelevels:
71 | # availability:
72 | # description: The server is available during support hours
73 | # percentage: 99.9%
74 | # retention:
75 | # description: Data is retained for one year because!
76 | # period: P1Y
77 | # unlimited: false
78 | # latency:
79 | # description: Data is available within 25 hours after the order was placed
80 | # threshold: 25h
81 | # sourceTimestampField: orders.order_timestamp
82 | # processedTimestampField: orders.processed_timestamp
83 | # freshness:
84 | # description: The age of the youngest row in a table.
85 | # threshold: 25h
86 | # timestampField: orders.order_timestamp
87 | # frequency:
88 | # description: Data is delivered once a day
89 | # type: batch # or streaming
90 | # interval: daily # for batch, either or cron
91 | # cron: 0 0 * * * # for batch, either or interval
92 | # support:
93 | # description: The data is available during typical business hours at headquarters
94 | # time: 9am to 5pm in EST on business days
95 | # responseTime: 1h
96 | # backup:
97 | # description: Data is backed up once a week, every Sunday at 0:00 UTC.
98 | # interval: weekly
99 | # cron: 0 0 * * 0
100 | # recoveryTime: 24 hours
101 | # recoveryPoint: 1 week
102 |
103 | ### quality
104 |
105 | #quality:
106 | # type: SodaCL
107 | # specification:
108 | # checks for my_model: |-
109 | # - duplicate_count(id) = 0
110 |
--------------------------------------------------------------------------------
/sources/datacontract.com/definition.schema.json:
--------------------------------------------------------------------------------
1 | {
2 | "$schema": "http://json-schema.org/draft-07/schema#",
3 | "type": "object",
4 | "description": "Clear and concise explanations of syntax, semantic, and classification of business objects in a given domain.",
5 | "properties": {
6 | "domain": {
7 | "type": "string",
8 | "description": "The domain in which this definition is valid.",
9 | "default": "global"
10 | },
11 | "name": {
12 | "type": "string",
13 | "description": "The technical name of this definition."
14 | },
15 | "title": {
16 | "type": "string",
17 | "description": "The business name of this definition."
18 | },
19 | "description": {
20 | "type": "string",
21 | "description": "Clear and concise explanations related to the domain."
22 | },
23 | "type": {
24 | "type": "string",
25 | "description": "The logical data type."
26 | },
27 | "minLength": {
28 | "type": "integer",
29 | "description": "A value must be greater than or equal to this value. Applies only to string types."
30 | },
31 | "maxLength": {
32 | "type": "integer",
33 | "description": "A value must be less than or equal to this value. Applies only to string types."
34 | },
35 | "format": {
36 | "type": "string",
37 | "description": "Specific format requirements for the value (e.g., 'email', 'uri', 'uuid')."
38 | },
39 | "precision": {
40 | "type": "integer",
41 | "examples": [
42 | 38
43 | ],
44 | "description": "The maximum number of digits in a number. Only applies to numeric values. Defaults to 38."
45 | },
46 | "scale": {
47 | "type": "integer",
48 | "examples": [
49 | 0
50 | ],
51 | "description": "The maximum number of decimal places in a number. Only applies to numeric values. Defaults to 0."
52 | },
53 | "pattern": {
54 | "type": "string",
55 | "description": "A regular expression pattern the value must match. Applies only to string types."
56 | },
57 | "example": {
58 | "type": "string",
59 | "description": "An example value."
60 | },
61 | "pii": {
62 | "type": "boolean",
63 | "description": "Indicates if the field contains Personal Identifiable Information (PII)."
64 | },
65 | "classification": {
66 | "type": "string",
67 | "description": "The data class defining the sensitivity level for this field."
68 | },
69 | "tags": {
70 | "type": "array",
71 | "items": {
72 | "type": "string"
73 | },
74 | "description": "Custom metadata to provide additional context."
75 | }
76 | },
77 | "required": [
78 | "name",
79 | "type"
80 | ]
81 | }
82 |
--------------------------------------------------------------------------------
/_layouts/default.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
10 |
11 | The `datacontract` CLI is an open source command-line tool for working with [Data Contracts](https://datacontract.com/).
12 | It uses data contract YAML files to lint the data contract, connect to data sources and execute schema and quality tests, detect breaking changes, and export to different formats. The tool is written in Python. It can be used as a standalone CLI tool, in a CI/CD pipeline, or directly as a Python library.
13 |
14 | 
15 |
16 |
17 | ## Getting started
18 |
19 | Let's look at this data contract:
20 | [https://datacontract.com/examples/orders-latest/datacontract.yaml](https://datacontract.com/examples/orders-latest/datacontract.yaml)
21 |
22 | We have a _servers_ section with endpoint details to the S3 bucket, _models_ for the structure of the data, _servicelevels_ and _quality_ attributes that describe the expected freshness and number of rows.
23 |
24 | This data contract contains all information to connect to S3 and check that the actual data meets the defined schema and quality requirements. We can use this information to test if the actual data set in S3 is compliant to the data contract.
25 |
26 | Let's use [pip](https://pip.pypa.io/en/stable/getting-started/) to install the CLI (or use the [Docker image](#docker), if you prefer).
27 | ```bash
28 | $ python3 -m pip install datacontract-cli
29 | ```
30 |
31 | We run the tests:
32 |
33 | ```bash
34 | $ datacontract test https://datacontract.com/examples/orders-latest/datacontract.yaml
35 |
36 | # returns:
37 | Testing https://datacontract.com/examples/orders-latest/datacontract.yaml
38 | ╭────────┬─────────────────────────────────────────────────────────────────────┬───────────────────────────────┬─────────╮
39 | │ Result │ Check │ Field │ Details │
40 | ├────────┼─────────────────────────────────────────────────────────────────────┼───────────────────────────────┼─────────┤
41 | │ passed │ Check that JSON has valid schema │ orders │ │
42 | │ passed │ Check that JSON has valid schema │ line_items │ │
43 | │ passed │ Check that field order_id is present │ orders │ │
44 | │ passed │ Check that field order_timestamp is present │ orders │ │
45 | │ passed │ Check that field order_total is present │ orders │ │
46 | │ passed │ Check that field customer_id is present │ orders │ │
47 | │ passed │ Check that field customer_email_address is present │ orders │ │
48 | │ passed │ row_count >= 5000 │ orders │ │
49 | │ passed │ Check that required field order_id has no null values │ orders.order_id │ │
50 | │ passed │ Check that unique field order_id has no duplicate values │ orders.order_id │ │
51 | │ passed │ duplicate_count(order_id) = 0 │ orders.order_id │ │
52 | │ passed │ Check that required field order_timestamp has no null values │ orders.order_timestamp │ │
53 | │ passed │ freshness(order_timestamp) < 24h │ orders.order_timestamp │ │
54 | │ passed │ Check that required field order_total has no null values │ orders.order_total │ │
55 | │ passed │ Check that required field customer_email_address has no null values │ orders.customer_email_address │ │
56 | │ passed │ Check that field lines_item_id is present │ line_items │ │
57 | │ passed │ Check that field order_id is present │ line_items │ │
58 | │ passed │ Check that field sku is present │ line_items │ │
59 | │ passed │ values in (order_id) must exist in orders (order_id) │ line_items.order_id │ │
60 | │ passed │ row_count >= 5000 │ line_items │ │
61 | │ passed │ Check that required field lines_item_id has no null values │ line_items.lines_item_id │ │
62 | │ passed │ Check that unique field lines_item_id has no duplicate values │ line_items.lines_item_id │ │
63 | ╰────────┴─────────────────────────────────────────────────────────────────────┴───────────────────────────────┴─────────╯
64 | 🟢 data contract is valid. Run 22 checks. Took 6.739514 seconds.
65 | ```
66 |
67 | Voilà, the CLI tested that the _datacontract.yaml_ itself is valid, all records comply with the schema, and all quality attributes are met.
68 |
69 | We can also use the datacontract.yaml to export in many [formats](#format), e.g., to SQL:
70 |
71 | ```bash
72 | $ datacontract export --format sql https://datacontract.com/examples/orders-latest/datacontract.yaml
73 |
74 | # returns:
75 | -- Data Contract: urn:datacontract:checkout:orders-latest
76 | -- SQL Dialect: snowflake
77 | CREATE TABLE orders (
78 | order_id TEXT not null primary key,
79 | order_timestamp TIMESTAMP_TZ not null,
80 | order_total NUMBER not null,
81 | customer_id TEXT,
82 | customer_email_address TEXT not null,
83 | processed_timestamp TIMESTAMP_TZ not null
84 | );
85 | CREATE TABLE line_items (
86 | lines_item_id TEXT not null primary key,
87 | order_id TEXT,
88 | sku TEXT
89 | );
90 | ```
91 |
92 | Or generate an HTML export:
93 |
94 | ```bash
95 | $ datacontract export --format html https://datacontract.com/examples/orders-latest/datacontract.yaml > datacontract.html
96 | ```
97 |
98 | which will create this [HTML export](https://datacontract.com/examples/orders-latest/datacontract.html).
99 |
100 | ## Usage
101 |
102 | ```bash
103 | # create a new data contract from example and write it to datacontract.yaml
104 | $ datacontract init datacontract.yaml
105 |
106 | # lint the datacontract.yaml
107 | $ datacontract lint datacontract.yaml
108 |
109 | # execute schema and quality checks
110 | $ datacontract test datacontract.yaml
111 |
112 | # execute schema and quality checks on the examples within the contract
113 | $ datacontract test --examples datacontract.yaml
114 |
115 | # export data contract as html (other formats: avro, dbt, dbt-sources, dbt-staging-sql, jsonschema, odcs, rdf, sql, sodacl, terraform, ...)
116 | $ datacontract export --format html datacontract.yaml > datacontract.html
117 |
118 | # import avro (other formats: sql, glue, bigquery...)
119 | $ datacontract import --format avro --source avro_schema.avsc
120 |
121 | # find differences between to data contracts
122 | $ datacontract diff datacontract-v1.yaml datacontract-v2.yaml
123 |
124 | # find differences between to data contracts categorized into error, warning, and info.
125 | $ datacontract changelog datacontract-v1.yaml datacontract-v2.yaml
126 |
127 | # fail pipeline on breaking changes. Uses changelog internally and showing only error and warning.
128 | $ datacontract breaking datacontract-v1.yaml datacontract-v2.yaml
129 | ```
130 |
131 | ## Programmatic (Python)
132 | ```python
133 | from datacontract.data_contract import DataContract
134 |
135 | data_contract = DataContract(data_contract_file="datacontract.yaml")
136 | run = data_contract.test()
137 | if not run.has_passed():
138 | print("Data quality validation failed.")
139 | # Abort pipeline, alert, or take corrective actions...
140 | ```
141 |
142 |
143 | ## Installation
144 |
145 | Choose the most appropriate installation method for your needs:
146 |
147 | ### pip
148 | Python 3.11 recommended.
149 | Python 3.12 available as pre-release release candidate for 0.9.3
150 |
151 | ```bash
152 | python3 -m pip install datacontract-cli
153 | ```
154 |
155 | ### pipx
156 | pipx installs into an isolated environment.
157 | ```bash
158 | pipx install datacontract-cli
159 | ```
160 |
161 | ### Docker
162 |
163 | ```bash
164 | docker pull datacontract/cli
165 | docker run --rm -v ${PWD}:/home/datacontract datacontract/cli
166 | ```
167 |
168 | Or via an alias that automatically uses the latest version:
169 |
170 | ```bash
171 | alias datacontract='docker run --rm -v "${PWD}:/home/datacontract" datacontract/cli:latest'
172 | ```
173 |
174 | ## Documentation
175 |
176 | Commands
177 |
178 | - [init](#init)
179 | - [lint](#lint)
180 | - [test](#test)
181 | - [export](#export)
182 | - [import](#import)
183 | - [breaking](#breaking)
184 | - [changelog](#changelog)
185 | - [diff](#diff)
186 | - [catalog](#catalog)
187 | - [publish](#publish)
188 |
189 | ### init
190 |
191 | ```
192 | Usage: datacontract init [OPTIONS] [LOCATION]
193 |
194 | Download a datacontract.yaml template and write it to file.
195 |
196 | ╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────╮
197 | │ location [LOCATION] The location (url or path) of the data contract yaml to create. │
198 | │ [default: datacontract.yaml] │
199 | ╰──────────────────────────────────────────────────────────────────────────────────────────────╯
200 | ╭─ Options ────────────────────────────────────────────────────────────────────────────────────╮
201 | │ --template TEXT URL of a template or data contract │
202 | │ [default: │
203 | │ https://datacontract.com/datacontract.init.yaml] │
204 | │ --overwrite --no-overwrite Replace the existing datacontract.yaml │
205 | │ [default: no-overwrite] │
206 | │ --help Show this message and exit. │
207 | ╰──────────────────────────────────────────────────────────────────────────────────────────────╯
208 | ```
209 |
210 | ### lint
211 |
212 | ```
213 | Usage: datacontract lint [OPTIONS] [LOCATION]
214 |
215 | Validate that the datacontract.yaml is correctly formatted.
216 |
217 | ╭─ Arguments ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
218 | │ location [LOCATION] The location (url or path) of the data contract yaml. [default: datacontract.yaml] │
219 | ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
220 | ╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
221 | │ --schema TEXT The location (url or path) of the Data Contract Specification JSON Schema │
222 | │ [default: https://datacontract.com/datacontract.schema.json] │
223 | │ --help Show this message and exit. │
224 | ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
225 | ```
226 |
227 | ### test
228 |
229 | ```
230 | Usage: datacontract test [OPTIONS] [LOCATION]
231 |
232 | Run schema and quality tests on configured servers.
233 |
234 | ╭─ Arguments ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
235 | │ location [LOCATION] The location (url or path) of the data contract yaml. [default: datacontract.yaml] │
236 | ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
237 | ╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
238 | │ --schema TEXT The location (url or path) of the Data Contract │
239 | │ Specification JSON Schema │
240 | │ [default: │
241 | │ https://datacontract.com/datacontract.schema.json] │
242 | │ --server TEXT The server configuration to run the schema and quality │
243 | │ tests. Use the key of the server object in the data │
244 | │ contract yaml file to refer to a server, e.g., │
245 | │ `production`, or `all` for all servers (default). │
246 | │ [default: all] │
247 | │ --examples --no-examples Run the schema and quality tests on the example data │
248 | │ within the data contract. │
249 | │ [default: no-examples] │
250 | │ --publish TEXT The url to publish the results after the test │
251 | │ [default: None] │
252 | │ --publish-to-opentelemetry --no-publish-to-opentelemetry Publish the results to opentelemetry. Use environment │
253 | │ variables to configure the OTLP endpoint, headers, etc. │
254 | │ [default: no-publish-to-opentelemetry] │
255 | │ --logs --no-logs Print logs [default: no-logs] │
256 | │ --help Show this message and exit. │
257 | ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
258 | ```
259 |
260 | Data Contract CLI connects to a data source and runs schema and quality tests to verify that the data contract is valid.
261 |
262 | ```bash
263 | $ datacontract test --server production datacontract.yaml
264 | ```
265 |
266 | To connect to the databases the `server` block in the datacontract.yaml is used to set up the connection.
267 | In addition, credentials, such as username and passwords, may be defined with environment variables.
268 |
269 | The application uses different engines, based on the server `type`.
270 | Internally, it connects with DuckDB, Spark, or a native connection and executes the most tests with _soda-core_ and _fastjsonschema_.
271 |
272 | Credentials are provided with environment variables.
273 |
274 | Supported server types:
275 |
276 | - [s3](#S3)
277 | - [bigquery](#bigquery)
278 | - [azure](#azure)
279 | - [sqlserver](#sqlserver)
280 | - [databricks](#databricks)
281 | - [databricks (programmatic)](#databricks-programmatic)
282 | - [dataframr (programmatic)](#dataframe-programmatic)
283 | - [snowflake](#snowflake)
284 | - [kafka](#kafka)
285 | - [postgres](#postgres)
286 | - [local](#local)
287 |
288 | Supported formats:
289 |
290 | - parquet
291 | - json
292 | - csv
293 | - delta
294 | - iceberg (coming soon)
295 |
296 | Feel free to create an [issue](https://github.com/datacontract/datacontract-cli/issues), if you need support for an additional type and formats.
297 |
298 | ### S3
299 |
300 | Data Contract CLI can test data that is stored in S3 buckets or any S3-compliant endpoints in various formats.
301 |
302 | #### Examples
303 |
304 | ##### JSON
305 |
306 | datacontract.yaml
307 | ```yaml
308 | servers:
309 | production:
310 | type: s3
311 | endpointUrl: https://minio.example.com # not needed with AWS S3
312 | location: s3://bucket-name/path/*/*.json
313 | format: json
314 | delimiter: new_line # new_line, array, or none
315 | ```
316 |
317 | ##### Delta Tables
318 |
319 | datacontract.yaml
320 | ```yaml
321 | servers:
322 | production:
323 | type: s3
324 | endpointUrl: https://minio.example.com # not needed with AWS S3
325 | location: s3://bucket-name/path/table.delta # path to the Delta table folder containing parquet data files and the _delta_log
326 | format: delta
327 | ```
328 |
329 | #### Environment Variables
330 |
331 | | Environment Variable | Example | Description |
332 | |-----------------------------------|-------------------------------|-----------------------|
333 | | `DATACONTRACT_S3_REGION` | `eu-central-1` | Region of S3 bucket |
334 | | `DATACONTRACT_S3_ACCESS_KEY_ID` | `AKIAXV5Q5QABCDEFGH` | AWS Access Key ID |
335 | | `DATACONTRACT_S3_SECRET_ACCESS_KEY` | `93S7LRrJcqLaaaa/XXXXXXXXXXXXX` | AWS Secret Access Key |
336 |
337 |
338 |
339 | ### BigQuery
340 |
341 | We support authentication to BigQuery using Service Account Key. The used Service Account should include the roles:
342 | * BigQuery Job User
343 | * BigQuery Data Viewer
344 |
345 |
346 | #### Example
347 |
348 | datacontract.yaml
349 | ```yaml
350 | servers:
351 | production:
352 | type: bigquery
353 | project: datameshexample-product
354 | dataset: datacontract_cli_test_dataset
355 | models:
356 | datacontract_cli_test_table: # corresponds to a BigQuery table
357 | type: table
358 | fields: ...
359 | ```
360 |
361 | #### Environment Variables
362 |
363 | | Environment Variable | Example | Description |
364 | |----------------------------------------------|---------------------------|---------------------------------------------------------|
365 | | `DATACONTRACT_BIGQUERY_ACCOUNT_INFO_JSON_PATH` | `~/service-access-key.json` | Service Access key as saved on key creation by BigQuery |
366 |
367 |
368 |
369 | ### Azure
370 |
371 | Data Contract CLI can test data that is stored in Azure Blob storage or Azure Data Lake Storage (Gen2) (ADLS) in various formats.
372 |
373 | #### Example
374 |
375 | datacontract.yaml
376 | ```yaml
377 | servers:
378 | production:
379 | type: azure
380 | location: abfss://datameshdatabricksdemo.dfs.core.windows.net/dataproducts/inventory_events/*.parquet
381 | format: parquet
382 | ```
383 |
384 | #### Environment Variables
385 |
386 | Authentication works with an Azure Service Principal (SPN) aka App Registration with a secret.
387 |
388 | | Environment Variable | Example | Description |
389 | |-----------------------------------|-------------------------------|------------------------------------------------------|
390 | | `DATACONTRACT_AZURE_TENANT_ID` | `79f5b80f-10ff-40b9-9d1f-774b42d605fc` | The Azure Tenant ID |
391 | | `DATACONTRACT_AZURE_CLIENT_ID` | `3cf7ce49-e2e9-4cbc-a922-4328d4a58622` | The ApplicationID / ClientID of the app registration |
392 | | `DATACONTRACT_AZURE_CLIENT_SECRET` | `yZK8Q~GWO1MMXXXXXXXXXXXXX` | The Client Secret value |
393 |
394 |
395 |
396 | ### Sqlserver
397 |
398 | Data Contract CLI can test data in MS SQL Server (including Azure SQL, Synapse Analytics SQL Pool).
399 |
400 | #### Example
401 |
402 | datacontract.yaml
403 | ```yaml
404 | servers:
405 | production:
406 | type: sqlserver
407 | host: localhost
408 | port: 5432
409 | database: tempdb
410 | schema: dbo
411 | driver: ODBC Driver 18 for SQL Server
412 | models:
413 | my_table_1: # corresponds to a table
414 | type: table
415 | fields:
416 | my_column_1: # corresponds to a column
417 | type: varchar
418 | ```
419 |
420 | #### Environment Variables
421 |
422 | | Environment Variable | Example | Description |
423 | |----------------------------------|--------------------|-------------|
424 | | `DATACONTRACT_SQLSERVER_USERNAME` | `root` | Username |
425 | | `DATACONTRACT_SQLSERVER_PASSWORD` | `toor` | Password |
426 | | `DATACONTRACT_SQLSERVER_TRUSTED_CONNECTION` | `True` | Use windows authentication, instead of login |
427 | | `DATACONTRACT_SQLSERVER_TRUST_SERVER_CERTIFICATE` | `True` | Trust self-signed certificate |
428 | | `DATACONTRACT_SQLSERVER_ENCRYPTED_CONNECTION` | `True` | Use SSL |
429 |
430 |
431 |
432 |
433 | ### Databricks
434 |
435 | Works with Unity Catalog and Hive metastore.
436 |
437 | Needs a running SQL warehouse or compute cluster.
438 |
439 | #### Example
440 |
441 | datacontract.yaml
442 | ```yaml
443 | servers:
444 | production:
445 | type: databricks
446 | host: dbc-abcdefgh-1234.cloud.databricks.com
447 | catalog: acme_catalog_prod
448 | schema: orders_latest
449 | models:
450 | orders: # corresponds to a table
451 | type: table
452 | fields: ...
453 | ```
454 |
455 | #### Environment Variables
456 |
457 | | Environment Variable | Example | Description |
458 | |----------------------------------------------|--------------------------------------|-------------------------------------------------------|
459 | | `DATACONTRACT_DATABRICKS_TOKEN` | `dapia00000000000000000000000000000` | The personal access token to authenticate |
460 | | `DATACONTRACT_DATABRICKS_HTTP_PATH` | `/sql/1.0/warehouses/b053a3ffffffff` | The HTTP path to the SQL warehouse or compute cluster |
461 |
462 |
463 | ### Databricks (programmatic)
464 |
465 | Works with Unity Catalog and Hive metastore.
466 | When running in a notebook or pipeline, the provided `spark` session can be used.
467 | An additional authentication is not required.
468 |
469 | Requires a Databricks Runtime with Python >= 3.10.
470 |
471 | #### Example
472 |
473 | datacontract.yaml
474 | ```yaml
475 | servers:
476 | production:
477 | type: databricks
478 | host: dbc-abcdefgh-1234.cloud.databricks.com # ignored, always use current host
479 | catalog: acme_catalog_prod
480 | schema: orders_latest
481 | models:
482 | orders: # corresponds to a table
483 | type: table
484 | fields: ...
485 | ```
486 |
487 | Notebook
488 | ```python
489 | %pip install datacontract-cli
490 | dbutils.library.restartPython()
491 |
492 | from datacontract.data_contract import DataContract
493 |
494 | data_contract = DataContract(
495 | data_contract_file="/Volumes/acme_catalog_prod/orders_latest/datacontract/datacontract.yaml",
496 | spark=spark)
497 | run = data_contract.test()
498 | run.result
499 | ```
500 |
501 | ### Dataframe (programmatic)
502 |
503 | Works with Spark DataFrames.
504 | DataFrames need to be created as named temporary views.
505 | Multiple temporary views are suppored if your data contract contains multiple models.
506 |
507 | Testing DataFrames is useful to test your datasets in a pipeline before writing them to a data source.
508 |
509 | #### Example
510 |
511 | datacontract.yaml
512 | ```yaml
513 | servers:
514 | production:
515 | type: dataframe
516 | models:
517 | my_table: # corresponds to a temporary view
518 | type: table
519 | fields: ...
520 | ```
521 |
522 | Example code
523 | ```python
524 | from datacontract.data_contract import DataContract
525 |
526 | df.createOrReplaceTempView("my_table")
527 |
528 | data_contract = DataContract(
529 | data_contract_file="datacontract.yaml",
530 | spark=spark,
531 | )
532 | run = data_contract.test()
533 | assert run.result == "passed"
534 | ```
535 |
536 |
537 | ### Snowflake
538 |
539 | Data Contract CLI can test data in Snowflake.
540 |
541 | #### Example
542 |
543 | datacontract.yaml
544 | ```yaml
545 |
546 | servers:
547 | snowflake:
548 | type: snowflake
549 | account: abcdefg-xn12345
550 | database: ORDER_DB
551 | schema: ORDERS_PII_V2
552 | models:
553 | my_table_1: # corresponds to a table
554 | type: table
555 | fields:
556 | my_column_1: # corresponds to a column
557 | type: varchar
558 | ```
559 |
560 | #### Environment Variables
561 |
562 | | Environment Variable | Example | Description |
563 | |------------------------------------|--------------------|-----------------------------------------------------|
564 | | `DATACONTRACT_SNOWFLAKE_USERNAME` | `datacontract` | Username |
565 | | `DATACONTRACT_SNOWFLAKE_PASSWORD` | `mysecretpassword` | Password |
566 | | `DATACONTRACT_SNOWFLAKE_ROLE` | `DATAVALIDATION` | The snowflake role to use. |
567 | | `DATACONTRACT_SNOWFLAKE_WAREHOUSE` | `COMPUTE_WH` | The Snowflake Warehouse to use executing the tests. |
568 |
569 |
570 |
571 | ### Kafka
572 |
573 | Kafka support is currently considered experimental.
574 |
575 | #### Example
576 |
577 | datacontract.yaml
578 | ```yaml
579 | servers:
580 | production:
581 | type: kafka
582 | host: abc-12345.eu-central-1.aws.confluent.cloud:9092
583 | topic: my-topic-name
584 | format: json
585 | ```
586 |
587 | #### Environment Variables
588 |
589 | | Environment Variable | Example | Description |
590 | |------------------------------------|---------|-----------------------------|
591 | | `DATACONTRACT_KAFKA_SASL_USERNAME` | `xxx` | The SASL username (key). |
592 | | `DATACONTRACT_KAFKA_SASL_PASSWORD` | `xxx` | The SASL password (secret). |
593 |
594 |
595 | ### Postgres
596 |
597 | Data Contract CLI can test data in Postgres or Postgres-compliant databases (e.g., RisingWave).
598 |
599 | #### Example
600 |
601 | datacontract.yaml
602 | ```yaml
603 | servers:
604 | postgres:
605 | type: postgres
606 | host: localhost
607 | port: 5432
608 | database: postgres
609 | schema: public
610 | models:
611 | my_table_1: # corresponds to a table
612 | type: table
613 | fields:
614 | my_column_1: # corresponds to a column
615 | type: varchar
616 | ```
617 |
618 | #### Environment Variables
619 |
620 | | Environment Variable | Example | Description |
621 | |----------------------------------|--------------------|-------------|
622 | | `DATACONTRACT_POSTGRES_USERNAME` | `postgres` | Username |
623 | | `DATACONTRACT_POSTGRES_PASSWORD` | `mysecretpassword` | Password |
624 |
625 |
626 |
627 |
628 |
629 | ### export
630 |
631 | ```
632 |
633 | Usage: datacontract export [OPTIONS] [LOCATION]
634 |
635 | Convert data contract to a specific format. Prints to stdout or to the specified output file.
636 |
637 | ╭─ Arguments ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
638 | │ location [LOCATION] The location (url or path) of the data contract yaml. [default: datacontract.yaml] │
639 | ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
640 | ╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
641 | │ * --format [jsonschema|pydantic-model|sodacl|dbt|dbt-sources|db The export format. [default: None] [required] │
642 | │ t-staging-sql|odcs|rdf|avro|protobuf|great-expectati │
643 | │ ons|terraform|avro-idl|sql|sql-query|html|go|bigquer │
644 | │ y|dbml] │
645 | │ --output PATH Specify the file path where the exported data will be │
646 | │ saved. If no path is provided, the output will be │
647 | │ printed to stdout. │
648 | │ [default: None] │
649 | │ --server TEXT The server name to export. [default: None] │
650 | │ --model TEXT Use the key of the model in the data contract yaml │
651 | │ file to refer to a model, e.g., `orders`, or `all` │
652 | │ for all models (default). │
653 | │ [default: all] │
654 | │ --help Show this message and exit. │
655 | ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
656 | ╭─ RDF Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
657 | │ --rdf-base TEXT [rdf] The base URI used to generate the RDF graph. [default: None] │
658 | ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
659 | ╭─ SQL Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
660 | │ --sql-server-type TEXT [sql] The server type to determine the sql dialect. By default, it uses 'auto' to automatically │
661 | │ detect the sql dialect via the specified servers in the data contract. │
662 | │ [default: auto] │
663 | ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
664 |
665 | ```
666 |
667 | ```bash
668 | # Example export data contract as HTML
669 | datacontract export --format html > datacontract.html
670 | ```
671 |
672 | Available export options:
673 |
674 | | Type | Description | Status |
675 | |----------------------|---------------------------------------------------------|--------|
676 | | `html` | Export to HTML | ✅ |
677 | | `jsonschema` | Export to JSON Schema | ✅ |
678 | | `odcs` | Export to Open Data Contract Standard (ODCS) | ✅ |
679 | | `sodacl` | Export to SodaCL quality checks in YAML format | ✅ |
680 | | `dbt` | Export to dbt models in YAML format | ✅ |
681 | | `dbt-sources` | Export to dbt sources in YAML format | ✅ |
682 | | `dbt-staging-sql` | Export to dbt staging SQL models | ✅ |
683 | | `rdf` | Export data contract to RDF representation in N3 format | ✅ |
684 | | `avro` | Export to AVRO models | ✅ |
685 | | `protobuf` | Export to Protobuf | ✅ |
686 | | `terraform` | Export to terraform resources | ✅ |
687 | | `sql` | Export to SQL DDL | ✅ |
688 | | `sql-query` | Export to SQL Query | ✅ |
689 | | `great-expectations` | Export to Great Expectations Suites in JSON Format | ✅ |
690 | | `bigquery` | Export to BigQuery Schemas | ✅ |
691 | | `go` | Export to Go types | ✅ |
692 | | `pydantic-model` | Export to pydantic models | ✅ |
693 | | `DBML` | Export to a DBML Diagram description | ✅ |
694 | | Missing something? | Please create an issue on GitHub | TBD |
695 |
696 | #### Great Expectations
697 |
698 | The export function transforms a specified data contract into a comprehensive Great Expectations JSON suite.
699 | If the contract includes multiple models, you need to specify the names of the model you wish to export.
700 |
701 | ```shell
702 | datacontract export datacontract.yaml --format great-expectations --model orders
703 | ```
704 |
705 | The export creates a list of expectations by utilizing:
706 |
707 | - The data from the Model definition with a fixed mapping
708 | - The expectations provided in the quality field for each model (find here the expectations gallery https://greatexpectations.io/expectations/)
709 |
710 | #### RDF
711 |
712 | The export function converts a given data contract into a RDF representation. You have the option to
713 | add a base_url which will be used as the default prefix to resolve relative IRIs inside the document.
714 |
715 | ```shell
716 | datacontract export --format rdf --rdf-base https://www.example.com/ datacontract.yaml
717 | ```
718 |
719 | The data contract is mapped onto the following concepts of a yet to be defined Data Contract
720 | Ontology named https://datacontract.com/DataContractSpecification/ :
721 | - DataContract
722 | - Server
723 | - Model
724 |
725 | Having the data contract inside an RDF Graph gives us access the following use cases:
726 | - Interoperability with other data contract specification formats
727 | - Store data contracts inside a knowledge graph
728 | - Enhance a semantic search to find and retrieve data contracts
729 | - Linking model elements to already established ontologies and knowledge
730 | - Using full power of OWL to reason about the graph structure of data contracts
731 | - Apply graph algorithms on multiple data contracts (Find similar data contracts, find "gatekeeper"
732 | data products, find the true domain owner of a field attribute)
733 |
734 | #### DBML
735 |
736 | The export function converts the logical data types of the datacontract into the specific ones of a concrete Database
737 | if a server is selected via the `--server` option (based on the `type` of that server). If no server is selected, the
738 | logical data types are exported.
739 |
740 |
741 | #### Avro
742 |
743 | The export function converts the data contract specification into an avro schema. It supports specifying custom avro properties for logicalTypes and default values.
744 |
745 | ##### Custom Avro Properties
746 |
747 | We support a **config map on field level**. A config map may include any additional key-value pairs and support multiple server type bindings.
748 |
749 | To specify custom Avro properties in your data contract, you can define them within the `config` section of your field definition. Below is an example of how to structure your YAML configuration to include custom Avro properties, such as `avroLogicalType` and `avroDefault`.
750 |
751 | >NOTE: At this moment, we just support [logicalType](https://avro.apache.org/docs/1.11.0/spec.html#Logical+Types) and [default](https://avro.apache.org/docs/1.11.0/spec.htm)
752 |
753 | #### Example Configuration
754 |
755 | ```yaml
756 | models:
757 | orders:
758 | fields:
759 | my_field_1:
760 | description: Example for AVRO with Timestamp (microsecond precision) https://avro.apache.org/docs/current/spec.html#Local+timestamp+%28microsecond+precision%29
761 | type: long
762 | example: 1672534861000000 # Equivalent to 2023-01-01 01:01:01 in microseconds
763 | config:
764 | avroLogicalType: local-timestamp-micros
765 | avroDefault: 1672534861000000
766 | ```
767 |
768 | #### Explanation
769 |
770 | - **models**: The top-level key that contains different models (tables or objects) in your data contract.
771 | - **orders**: A specific model name. Replace this with the name of your model.
772 | - **fields**: The fields within the model. Each field can have various properties defined.
773 | - **my_field_1**: The name of a specific field. Replace this with your field name.
774 | - **description**: A textual description of the field.
775 | - **type**: The data type of the field. In this example, it is `long`.
776 | - **example**: An example value for the field.
777 | - **config**: Section to specify custom Avro properties.
778 | - **avroLogicalType**: Specifies the logical type of the field in Avro. In this example, it is `local-timestamp-micros`.
779 | - **avroDefault**: Specifies the default value for the field in Avro. In this example, it is 1672534861000000 which corresponds to ` 2023-01-01 01:01:01 UTC`.
780 |
781 |
782 | ### import
783 |
784 | ```
785 | Usage: datacontract import [OPTIONS]
786 |
787 | Create a data contract from the given source location. Prints to stdout.
788 |
789 | ╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
790 | │ * --format [sql|avro|glue|bigquery|jsonschema] The format of the source file. [default: None] [required] │
791 | │ --source TEXT The path to the file or Glue Database that should be imported. │
792 | │ [default: None] │
793 | │ --glue-table TEXT List of table ids to import from the Glue Database (repeat for │
794 | │ multiple table ids, leave empty for all tables in the dataset). │
795 | │ [default: None] │
796 | │ --bigquery-project TEXT The bigquery project id. [default: None] │
797 | │ --bigquery-dataset TEXT The bigquery dataset id. [default: None] │
798 | │ --bigquery-table TEXT List of table ids to import from the bigquery API (repeat for │
799 | │ multiple table ids, leave empty for all tables in the dataset). │
800 | │ [default: None] │
801 | │ --help Show this message and exit. │
802 | ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
803 | ```
804 |
805 | Example:
806 | ```bash
807 | # Example import from SQL DDL
808 | datacontract import --format sql --source my_ddl.sql
809 | ```
810 |
811 | Available import options:
812 |
813 | | Type | Description | Status |
814 | |--------------------|------------------------------------------------|---------|
815 | | `sql` | Import from SQL DDL | ✅ |
816 | | `avro` | Import from AVRO schemas | ✅ |
817 | | `glue` | Import from AWS Glue DataCatalog | ✅ |
818 | | `protobuf` | Import from Protobuf schemas | TBD |
819 | | `jsonschema` | Import from JSON Schemas | ✅ |
820 | | `bigquery` | Import from BigQuery Schemas | ✅ |
821 | | `dbt` | Import from dbt models | TBD |
822 | | `odcs` | Import from Open Data Contract Standard (ODCS) | TBD |
823 | | Missing something? | Please create an issue on GitHub | TBD |
824 |
825 |
826 | #### BigQuery
827 |
828 | Bigquery data can either be imported off of JSON Files generated from the table descriptions or directly from the Bigquery API. In case you want to use JSON Files, specify the `source` parameter with a path to the JSON File.
829 |
830 | To import from the Bigquery API, you have to _omit_ `source` and instead need to provide `bigquery-project` and `bigquery-dataset`. Additionally you may specify `bigquery-table` to enumerate the tables that should be imported. If no tables are given, _all_ available tables of the dataset will be imported.
831 |
832 | For providing authentication to the Client, please see [the google documentation](https://cloud.google.com/docs/authentication/provide-credentials-adc#how-to) or the one [about authorizing client libraries](https://cloud.google.com/bigquery/docs/authentication#client-libs).
833 |
834 | Examples:
835 |
836 | ```bash
837 | # Example import from Bigquery JSON
838 | datacontract import --format bigquery --source my_bigquery_table.json
839 | ```
840 |
841 | ```bash
842 | # Example import from Bigquery API with specifying the tables to import
843 | datacontract import --format bigquery --bigquery-project --bigquery-dataset --bigquery-table --bigquery-table --bigquery-table
844 | ```
845 |
846 | ```bash
847 | # Example import from Bigquery API importing all tables in the dataset
848 | datacontract import --format bigquery --bigquery-project --bigquery-dataset
849 | ```
850 |
851 | ### Glue
852 |
853 | Importing from Glue reads the necessary Data directly off of the AWS API.
854 | You may give the `glue-table` parameter to enumerate the tables that should be imported. If no tables are given, _all_ available tables of the database will be imported.
855 |
856 | Examples:
857 |
858 | ```bash
859 | # Example import from AWS Glue with specifying the tables to import
860 | datacontract import --format glue --source --glue-table --glue-table --glue-table
861 | ```
862 |
863 | ```bash
864 | # Example import from AWS Glue importing all tables in the database
865 | datacontract import --format glue --source
866 | ```
867 |
868 |
869 | ### breaking
870 |
871 | ```
872 | Usage: datacontract breaking [OPTIONS] LOCATION_OLD LOCATION_NEW
873 |
874 | Identifies breaking changes between data contracts. Prints to stdout.
875 |
876 | ╭─ Arguments ───────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
877 | │ * location_old TEXT The location (url or path) of the old data contract yaml. [default: None] [required] │
878 | │ * location_new TEXT The location (url or path) of the new data contract yaml. [default: None] [required] │
879 | ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
880 | ╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
881 | │ --help Show this message and exit. │
882 | ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
883 | ```
884 |
885 | ### changelog
886 |
887 | ```
888 | Usage: datacontract changelog [OPTIONS] LOCATION_OLD LOCATION_NEW
889 |
890 | Generate a changelog between data contracts. Prints to stdout.
891 |
892 | ╭─ Arguments ───────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
893 | │ * location_old TEXT The location (url or path) of the old data contract yaml. [default: None] [required] │
894 | │ * location_new TEXT The location (url or path) of the new data contract yaml. [default: None] [required] │
895 | ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
896 | ╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
897 | │ --help Show this message and exit. │
898 | ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
899 | ```
900 |
901 | ### diff
902 |
903 | ```
904 | Usage: datacontract diff [OPTIONS] LOCATION_OLD LOCATION_NEW
905 |
906 | PLACEHOLDER. Currently works as 'changelog' does.
907 |
908 | ╭─ Arguments ───────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
909 | │ * location_old TEXT The location (url or path) of the old data contract yaml. [default: None] [required] │
910 | │ * location_new TEXT The location (url or path) of the new data contract yaml. [default: None] [required] │
911 | ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
912 | ╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
913 | │ --help Show this message and exit. │
914 | ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
915 | ```
916 |
917 | ### catalog
918 |
919 | ```
920 |
921 | Usage: datacontract catalog [OPTIONS]
922 |
923 | Create an html catalog of data contracts.
924 |
925 | ╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
926 | │ --files TEXT Glob pattern for the data contract files to include in the catalog. [default: *.yaml] │
927 | │ --output TEXT Output directory for the catalog html files. [default: catalog/] │
928 | │ --help Show this message and exit. │
929 | ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
930 | ```
931 |
932 | ### Publish
933 |
934 | ```
935 |
936 | Usage: datacontract publish [OPTIONS] [LOCATION]
937 |
938 | Publish the data contract to the Data Mesh Manager.
939 |
940 | ╭─ Arguments ────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
941 | │ location [LOCATION] The location (url or path) of the data contract yaml. [default: datacontract.yaml] │
942 | ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
943 | ╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
944 | │ --help Show this message and exit. │
945 | ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
946 | ```
947 |
948 | ## Integrations
949 |
950 | | Integration | Option | Description |
951 | |-------------------|------------------------------|-------------------------------------------------------------------------------------------------------|
952 | | Data Mesh Manager | `--publish` | Push full results to the [Data Mesh Manager API](https://api.datamesh-manager.com/swagger/index.html) |
953 | | OpenTelemetry | `--publish-to-opentelemetry` | Push result as gauge metrics |
954 |
955 | ### Integration with Data Mesh Manager
956 |
957 | If you use [Data Mesh Manager](https://datamesh-manager.com/), you can use the data contract URL and append the `--publish` option to send and display the test results. Set an environment variable for your API key.
958 |
959 | ```bash
960 | # Fetch current data contract, execute tests on production, and publish result to data mesh manager
961 | $ EXPORT DATAMESH_MANAGER_API_KEY=xxx
962 | $ datacontract test https://demo.datamesh-manager.com/demo279750347121/datacontracts/4df9d6ee-e55d-4088-9598-b635b2fdcbbc/datacontract.yaml --server production --publish
963 | ```
964 |
965 | ### Integration with OpenTelemetry
966 |
967 | If you use OpenTelemetry, you can use the data contract URL and append the `--publish-to-opentelemetry` option to send the test results to your OLTP-compatible instance, e.g., Prometheus.
968 |
969 | The metric name is "datacontract.cli.test.result" and it uses the following encoding for the result:
970 |
971 | | datacontract.cli.test.result | Description |
972 | |-------|---------------------------------------|
973 | | 0 | test run passed, no warnings |
974 | | 1 | test run has warnings |
975 | | 2 | test run failed |
976 | | 3 | test run not possible due to an error |
977 | | 4 | test status unknown |
978 |
979 |
980 | ```bash
981 | # Fetch current data contract, execute tests on production, and publish result to open telemetry
982 | $ EXPORT OTEL_SERVICE_NAME=datacontract-cli
983 | $ EXPORT OTEL_EXPORTER_OTLP_ENDPOINT=https://YOUR_ID.apm.westeurope.azure.elastic-cloud.com:443
984 | $ EXPORT OTEL_EXPORTER_OTLP_HEADERS=Authorization=Bearer%20secret # Optional, when using SaaS Products
985 | $ EXPORT OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf # Optional, default is http/protobuf - use value grpc to use the gRPC protocol instead
986 | # Send to OpenTelemetry
987 | $ datacontract test https://demo.datamesh-manager.com/demo279750347121/datacontracts/4df9d6ee-e55d-4088-9598-b635b2fdcbbc/datacontract.yaml --server production --publish-to-opentelemetry
988 | ```
989 |
990 | Current limitations:
991 | - currently, only ConsoleExporter and OTLP Exporter
992 | - Metrics only, no logs yet (but loosely planned)
993 |
994 |
995 | ## Best Practices
996 |
997 | We share best practices in using the Data Contract CLI.
998 |
999 | ### Data-first Approach
1000 |
1001 | Create a data contract based on the actual data. This is the fastest way to get started and to get feedback from the data consumers.
1002 |
1003 | 1. Use an existing physical schema (e.g., SQL DDL) as a starting point to define your logical data model in the contract. Double check right after the import whether the actual data meets the imported logical data model. Just to be sure.
1004 | ```bash
1005 | $ datacontract import --format sql ddl.sql
1006 | $ datacontract test
1007 | ```
1008 |
1009 | 2. Add examples to the `datacontract.yaml`. If you can, use actual data and anonymize. Make sure that the examples match the imported logical data model.
1010 | ```bash
1011 | $ datacontract test --examples
1012 | ```
1013 |
1014 |
1015 | 3. Add quality checks and additional type constraints one by one to the contract and make sure the examples and the actual data still adheres to the contract. Check against examples for a very fast feedback loop.
1016 | ```bash
1017 | $ datacontract test --examples
1018 | $ datacontract test
1019 | ```
1020 |
1021 | 4. Make sure that all the best practices for a `datacontract.yaml` are met using the linter. You probably forgot to document some fields and add the terms and conditions.
1022 | ```bash
1023 | $ datacontract lint
1024 | ```
1025 |
1026 | 5. Set up a CI pipeline that executes daily and reports the results to the [Data Mesh Manager](https://datamesh-manager.com). Or to some place else. You can even publish to any opentelemetry compatible system.
1027 | ```bash
1028 | $ datacontract test --publish https://api.datamesh-manager.com/api/runs
1029 | ```
1030 |
1031 | ### Contract-First
1032 |
1033 | Create a data contract based on the requirements from use cases.
1034 |
1035 | 1. Start with a `datacontract.yaml` template.
1036 | ```bash
1037 | $ datacontract init
1038 | ```
1039 |
1040 | 2. Add examples to the `datacontract.yaml`. Do not start with the data model, although you are probably tempted to do that. Examples are the fastest way to get feedback from everybody and not loose someone in the discussion.
1041 |
1042 | 3. Create the model based on the examples. Test the model against the examples to double-check whether the model matches the examples.
1043 | ```bash
1044 | $ datacontract test --examples
1045 | ```
1046 |
1047 | 4. Add quality checks and additional type constraints one by one to the contract and make sure the examples and the actual data still adheres to the contract. Check against examples for a very fast feedback loop.
1048 | ```bash
1049 | $ datacontract test --examples
1050 | ```
1051 |
1052 | 5. Fill in the terms, descriptions, etc. Make sure you follow all best practices for a `datacontract.yaml` using the linter.
1053 | ```bash
1054 | $ datacontract lint
1055 | ```
1056 |
1057 | 6. Set up a CI pipeline that lints and tests the examples so you make sure that any changes later do not decrease the quality of the contract.
1058 | ```bash
1059 | $ datacontract lint
1060 | $ datacontract test --examples
1061 | ```
1062 |
1063 | 7. Use the export function to start building the providing data product as well as the integration into the consuming data products.
1064 | ```bash
1065 | # data provider
1066 | $ datacontract export --format dbt
1067 | # data consumer
1068 | $ datacontract export --format dbt-sources
1069 | $ datacontract export --format dbt-staging-sql
1070 | ```
1071 |
1072 | ### Schema Evolution
1073 |
1074 | #### Non-breaking Changes
1075 | Examples: adding models or fields
1076 |
1077 | - Add the models or fields in the datacontract.yaml
1078 | - Increment the minor version of the datacontract.yaml on any change. Simply edit the datacontract.yaml for this.
1079 | - You need a policy that these changes are non-breaking. That means that one cannot use the star expression in SQL to query a table under contract. Make the consequences known.
1080 | - Fail the build in the Pull Request if a datacontract.yaml accidentially adds a breaking change even despite only a minor version change
1081 | ```bash
1082 | $ datacontract breaking datacontract-from-pr.yaml datacontract-from-main.yaml
1083 | ```
1084 | - Create a changelog of this minor change.
1085 | ```bash
1086 | $ datacontract changelog datacontract-from-pr.yaml datacontract-from-main.yaml
1087 | ```
1088 | #### Breaking Changes
1089 | Examples: Removing or renaming models and fields.
1090 |
1091 | - Remove or rename models and fields in the datacontract.yaml, and any other change that might be part of this new major version of this data contract.
1092 | - Increment the major version of the datacontract.yaml for this and create a new file for the major version. The reason being, that one needs to offer an upgrade path for the data consumers from the old to the new major version.
1093 | - As data consumers need to migrate, try to reduce the frequency of major versions by making multiple breaking changes together if possible.
1094 | - Be aware of the notice period in the data contract as this is the minimum amount of time you have to offer both the old and the new version for a migration path.
1095 | - Do not fear making breaking changes with data contracts. It's okay to do them in this controlled way. Really!
1096 | - Create a changelog of this major change.
1097 | ```bash
1098 | $ datacontract changelog datacontract-from-pr.yaml datacontract-from-main.yaml
1099 | ```
1100 |
1101 | ## Development Setup
1102 |
1103 | Python base interpreter should be 3.11.x (unless working on 3.12 release candidate).
1104 |
1105 | ```bash
1106 | # create venv
1107 | python3 -m venv venv
1108 | source venv/bin/activate
1109 |
1110 | # Install Requirements
1111 | pip install --upgrade pip setuptools wheel
1112 | pip install -e '.[dev]'
1113 | ruff check --fix
1114 | ruff format
1115 | pytest
1116 | ```
1117 |
1118 |
1119 | ### Docker Build
1120 |
1121 | ```bash
1122 | docker build -t datacontract/cli .
1123 | docker run --rm -v ${PWD}:/home/datacontract datacontract/cli
1124 | ```
1125 |
1126 | #### Docker compose integration
1127 |
1128 | We've included a [docker-compose.yml](./docker-compose.yml) configuration to simplify the build, test, and deployment of the image.
1129 |
1130 | ##### Building the Image with Docker Compose
1131 |
1132 | To build the Docker image using Docker Compose, run the following command:
1133 |
1134 | ```bash
1135 | docker compose build
1136 | ```
1137 |
1138 | This command utilizes the `docker-compose.yml` to build the image, leveraging predefined settings such as the build context and Dockerfile location. This approach streamlines the image creation process, avoiding the need for manual build specifications each time.
1139 |
1140 | #### Testing the Image
1141 |
1142 | After building the image, you can test it directly with Docker Compose:
1143 |
1144 | ```bash
1145 | docker compose run --rm datacontract --version
1146 | ```
1147 |
1148 | This command runs the container momentarily to check the version of the `datacontract` CLI. The `--rm` flag ensures that the container is automatically removed after the command executes, keeping your environment clean.
1149 |
1150 |
1151 |
1152 | ## Release Steps
1153 |
1154 | 1. Update the version in `pyproject.toml`
1155 | 2. Have a look at the `CHANGELOG.md`
1156 | 3. Create release commit manually
1157 | 4. Execute `./release`
1158 | 5. Wait until GitHub Release is created
1159 | 6. Add the release notes to the GitHub Release
1160 |
1161 | ## Contribution
1162 |
1163 | We are happy to receive your contributions. Propose your change in an issue or directly create a pull request with your improvements.
1164 |
1165 | ## Companies using this tool
1166 |
1167 | - [INNOQ](https://innoq.com)
1168 | - And many more. To add your company, please create a pull request.
1169 |
1170 | ## License
1171 |
1172 | [MIT License](LICENSE)
1173 |
1174 | ## Credits
1175 |
1176 | Created by [Stefan Negele](https://www.linkedin.com/in/stefan-negele-573153112/) and [Jochen Christ](https://www.linkedin.com/in/jochenchrist/).
1177 |
1178 |
1179 |
1180 |
1181 |
--------------------------------------------------------------------------------