├── .gitignore
├── CONTRIBUTING.md
├── README.md
├── versions
├── v0.2.0
│ ├── CHANGELOG.md
│ └── transformation-specification.md
├── v0.2.1
│ └── CHANGELOG.md
└── v0.1.0
│ └── transformation-specification.md
├── CODE_OF_CONDUCT.md
└── LICENSE
/.gitignore:
--------------------------------------------------------------------------------
1 | # Draft content - not included in the specification
2 | draft/
3 |
4 | # Common files
5 | *.pyc
6 | *.pyo
7 | *.pyd
8 | __pycache__/
9 | .pytest_cache/
10 | *.egg-info/
11 | dist/
12 | build/
13 |
14 | # IDE
15 | .vscode/
16 | .idea/
17 | *.swp
18 | *.swo
19 | *~
20 |
21 | # OS files
22 | .DS_Store
23 | Thumbs.db
24 |
25 |
--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
1 | # Contributing to the Open Transformation Specification
2 |
3 | We welcome contributions to the Open Transformation Specification! This document provides guidelines for contributing to the project.
4 |
5 | ## How to Contribute
6 |
7 | We welcome new contributors to the project whether you have changes to suggest, problems to report, or some feedback for us. Please jump to the most relevant section from the list below:
8 |
9 | - Ask a question or offer feedback: use a discussion
10 | - Suggest a change or report a problem: open an issue
11 | - Contribute a change to the repository: open a pull request
12 | - Or just get in touch
13 |
14 | ## Development Process
15 |
16 | The Open Transformation Specification is currently in early development (v0.1.0). The development process is evolving as we establish the foundation for the specification.
17 |
18 | ### Current Focus Areas
19 |
20 | - **Core Specification**: Defining the fundamental structure and concepts
21 | - **Examples**: Creating clear examples of transformation specifications
22 | - **Documentation**: Building comprehensive documentation
23 | - **Community**: Establishing governance and contribution processes
24 |
25 | ## Specification Guidelines
26 |
27 | When contributing to the specification:
28 |
29 | - Use clear, unambiguous language
30 | - Provide examples for complex concepts
31 | - Consider backward compatibility
32 | - Follow the established structure in `versions/v0.1.0/`
33 |
34 | ## Code Standards
35 |
36 | - Use clear, descriptive commit messages
37 | - Follow existing formatting conventions
38 | - Include documentation for new features
39 | - Test any code changes thoroughly
40 |
41 | ## Getting Help
42 |
43 | - Check existing [issues](https://github.com/francescomucio/open-transformation-specification/issues)
44 | - Start a [discussion](https://github.com/francescomucio/open-transformation-specification/discussions)
45 | - Review the [Code of Conduct](CODE_OF_CONDUCT.md)
46 |
47 | ## License
48 |
49 | By contributing to the Open Transformation Specification, you agree that your contributions will be licensed under the Apache 2.0 License.
50 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # The Open Transformation Specification
2 |
3 | [](https://opensource.org/licenses/Apache-2.0)
4 |
5 | The Open Transformation Specification (OTS) is a community-driven open specification that defines a standard, programming language-agnostic interface description for data transformation pipelines and workflows, including transformations, data quality tests, and user-defined functions (UDFs).
6 |
7 | This specification enables **interoperability** between tools and platforms, allowing data transformations to be shared, understood, and executed across different systems without vendor lock-in. By providing a common standard, OTS shifts the data transformation ecosystem from isolated, proprietary tools to an **open core** where tools can seamlessly work together around a shared specification.
8 |
9 | This specification allows both humans and computers to discover and understand the capabilities of data transformations without requiring access to source code, additional documentation, or inspection of execution logs. When properly defined via OTS, a consumer can understand and interact with data transformation pipelines with a minimal amount of implementation logic.
10 |
11 | ## Versions
12 |
13 | This repository contains the Markdown sources for all published Open Transformation Specification versions. For release notes and release candidate versions, refer to the [releases page](https://github.com/francescomucio/open-transformation-specification/releases).
14 |
15 | - **Current Version**: [v0.2.1](versions/v0.2.1/transformation-specification.md)
16 | - **Previous Versions**:
17 | - [v0.2.0](versions/v0.2.0/transformation-specification.md)
18 | - [v0.1.0](versions/v0.1.0/transformation-specification.md)
19 |
20 | ## Interoperability and Ecosystem
21 |
22 | The Open Transformation Specification enables true **interoperability** in the data transformation space. By defining a common standard, OTS allows:
23 |
24 | - **Cross-tool compatibility**: Transformations defined in one tool can be consumed and executed by any OTS-compliant tool
25 | - **Vendor independence**: Avoid lock-in to proprietary formats and tools
26 | - **Ecosystem growth**: An open core standard enables a thriving ecosystem of compatible tools, libraries, and services
27 | - **Seamless integration**: Tools can work together, sharing transformations, tests, and functions without conversion or manual intervention
28 |
29 | ## Tools and Libraries
30 |
31 | The OTS ecosystem is growing with tools and libraries that implement the specification. These tools demonstrate the **interoperability** benefits of the standard:
32 |
33 | - **[Tee for Transform](https://github.com/francescomucio/tee-for-transform)** - A Python framework for managing SQL data transformations with support for multiple database backends. Fully OTS-compliant with import/export capabilities.
34 |
35 | *More tools and libraries will be documented as they become available. If you're building an OTS-compliant tool, we'd love to feature it here!*
36 |
37 | ## Participation
38 |
39 | The current process for developing the Open Transformation Specification is described in the [Contributing Guidelines](CONTRIBUTING.md).
40 |
41 | ## Community
42 |
43 | Join our community on Discord to discuss the specification, ask questions, share ideas, and connect with other contributors: [Discord Community](https://discord.gg/gKm6KgACY7)
44 |
45 | ## Licensing
46 |
47 | See: [License](LICENSE) (Apache-2.0)
48 |
--------------------------------------------------------------------------------
/versions/v0.2.0/CHANGELOG.md:
--------------------------------------------------------------------------------
1 | # OTS v0.2.0 Changelog
2 |
3 | ## New Features
4 |
5 | ### User-Defined Functions (UDFs) Support
6 |
7 | **Added `source_functions` field to SQL transformation code structure**
8 |
9 | - **Field**: `source_functions` (optional array of strings)
10 | - **Purpose**: Track user-defined functions (UDFs) called in SQL transformations
11 | - **Location**: `code.sql.source_functions`
12 | - **Format**: Array of function names (preferably fully qualified: `schema.function_name`)
13 |
14 | ### Changes to SQL Transformation Structure
15 |
16 | The `code.sql` object now includes:
17 |
18 | ```yaml
19 | code:
20 | sql:
21 | original_sql: string
22 | resolved_sql: string
23 | source_tables: [string] # Existing field
24 | source_functions: [string] # NEW in v0.2.0
25 | ```
26 |
27 | ### Benefits
28 |
29 | 1. **Dependency Analysis**: Enables accurate dependency graph building including function dependencies
30 | 2. **Execution Order**: Ensures functions are created before transformations that use them
31 | 3. **Validation**: Allows tools to verify all required functions exist before execution
32 | 4. **Function Chains**: Supports function-to-function dependencies
33 |
34 | ### Backward Compatibility
35 |
36 | - `source_functions` is **optional** - existing v0.1.0 modules remain valid
37 | - If omitted, tools should assume an empty array `[]`
38 | - All examples in v0.2.0 include `source_functions: []` for transformations without function dependencies
39 |
40 | ### Functions Array in OTS Modules
41 |
42 | **Added `functions` array to OTS Module structure**
43 |
44 | - **Field**: `functions` (optional array of function definitions)
45 | - **Purpose**: Define user-defined functions (UDFs) that can be used in transformations
46 | - **Location**: Top-level in OTS Module (same level as `transformations`)
47 | - **Structure**: Array of function definitions with `function_id`, `function_type`, `language`, `code`, `parameters`, `return_type`, `deterministic`, `dependencies`, and `metadata`
48 |
49 | **Function Execution Order:**
50 | - Functions are created before transformations in dependency order
51 | - Function-to-function dependencies are resolved automatically
52 | - Execution order: Seeds → Functions → Transformations
53 |
54 | **Function Overloading:**
55 | - OTS 0.2.0 supports function overloading (multiple functions with same name, different signatures)
56 | - Functions are identified by fully qualified name and parameter signature
57 | - Each overloaded function is tracked separately in the `functions` array
58 |
59 | ### Test Library Structure Updates
60 |
61 | **Added `ots_version` field to Test Library structure**
62 |
63 | - **Field**: `ots_version` (required string)
64 | - **Purpose**: Indicates which version of the OTS standard the test library follows
65 | - **Location**: Top-level in Test Library file
66 | - **Format**: OTS version string (e.g., "0.2.0")
67 |
68 | ### Documentation Updates
69 |
70 | - Added new section: "User-Defined Functions (UDFs)" with examples
71 | - Updated all code examples to include `source_functions` field
72 | - Added example transformation using UDFs
73 | - Added `functions` array definition to OTS Module structure
74 | - Updated test library examples to include `ots_version` field
75 |
76 | ## Migration from v0.1.0
77 |
78 | To migrate an OTS Module from v0.1.0 to v0.2.0:
79 |
80 | 1. Update `ots_version` from `"0.1.0"` to `"0.2.0"`
81 | 2. Add `source_functions: []` to all `code.sql` objects (if no functions are used)
82 | 3. If transformations use UDFs, populate `source_functions` with function names
83 |
84 | Example migration:
85 |
86 | ```yaml
87 | # v0.1.0
88 | code:
89 | sql:
90 | original_sql: "..."
91 | resolved_sql: "..."
92 | source_tables: ["table1"]
93 |
94 | # v0.2.0
95 | code:
96 | sql:
97 | original_sql: "..."
98 | resolved_sql: "..."
99 | source_tables: ["table1"]
100 | source_functions: [] # Add this line
101 | ```
102 |
103 |
--------------------------------------------------------------------------------
/versions/v0.2.1/CHANGELOG.md:
--------------------------------------------------------------------------------
1 | # OTS v0.2.1 Changelog
2 |
3 | ## New Features
4 |
5 | ### Schema Change Handling for Incremental Materialization
6 |
7 | **Added `on_schema_change` field to incremental materialization configuration**
8 |
9 | - **Field**: `on_schema_change` (optional string)
10 | - **Purpose**: Control how schema differences between transformation output and existing target table are handled
11 | - **Location**: `materialization.incremental_details.on_schema_change`
12 | - **Default**: `"fail"` (fail transformation if schema changes detected)
13 | - **Options**:
14 | - `"fail"`: Fail the transformation if any schema differences are detected (default)
15 | - `"ignore"`: Ignore schema differences and proceed (may cause errors if columns don't match)
16 | - `"append_new_columns"`: Automatically add new columns, keep existing columns
17 | - `"sync_all_columns"`: Add new columns and remove missing columns (may cause data loss)
18 | - `"full_refresh"`: Drop and recreate table with full transformation output
19 | - `"full_incremental_refresh"`: Drop, recreate, then run incremental strategy in chunks
20 | - `"recreate_empty"`: Drop and recreate as empty table (for external backfilling)
21 |
22 | ### Full Incremental Refresh Configuration
23 |
24 | **Added `full_incremental_refresh` field for chunked incremental execution**
25 |
26 | - **Field**: `full_incremental_refresh` (optional object)
27 | - **Purpose**: Configure parameter-based incremental chunking after table recreation
28 | - **Location**: Top-level in transformation definition (same level as `materialization`)
29 | - **Required when**: `on_schema_change` is set to `"full_incremental_refresh"`
30 |
31 | **Structure:**
32 | ```yaml
33 | full_incremental_refresh:
34 | parameters:
35 | - name: string # Parameter name (matches placeholder, e.g., "@start_date")
36 | start_value: string # Initial value for the parameter
37 | end_value: string # End condition: hardcoded value or expression evaluated against source table (e.g., "max(event_date)" from source.events)
38 | step: string # Increment step (SQL interval or numeric value)
39 | ```
40 |
41 | **Use Cases:**
42 | - Single parameter: One parameter in the array (e.g., `@start_date`)
43 | - Multiple parameters: Multiple parameters for boundary-based queries (e.g., `@start_date` and `@end_date`)
44 | - Both parameters use the same `step` value (ideally), but can differ if needed
45 | - Parameter names must match placeholders in the query (e.g., `"@start_date"` matches `'@start_date'` in SQL)
46 |
47 | ### Schema Change Behavior Details
48 |
49 | **Type Mismatches:**
50 | - Columns with same name but different data type are treated as schema changes
51 | - With `on_schema_change="fail"`, type mismatches cause immediate failure
52 | - Different data type = different column (fail immediately)
53 |
54 | **Column Order:**
55 | - Column order differences are detected and logged as warnings
56 | - Tools should rely on explicit column lists in INSERT/MERGE statements
57 | - No automatic reordering is performed
58 |
59 | **Schema Comparison:**
60 | - Schema comparison happens after time-based filtering but before execution
61 | - Uses database-specific `DESCRIBE` or equivalent methods for precision
62 | - If table doesn't exist, schema comparison is skipped and table is created normally
63 |
64 | ## Changes to Incremental Materialization Structure
65 |
66 | The `incremental_details` object now includes:
67 |
68 | ```yaml
69 | incremental_details:
70 | strategy: string
71 | delete_condition: string
72 | filter_condition: string
73 | merge_key: [string]
74 | update_columns: [string]
75 | on_schema_change: string # NEW in v0.2.1
76 | ```
77 |
78 | The transformation definition now supports:
79 |
80 | ```yaml
81 | materialization:
82 | type: "incremental"
83 | incremental_details: {...}
84 |
85 | full_incremental_refresh: # NEW in v0.2.1 (optional)
86 | parameters: [...]
87 | ```
88 |
89 | ## Backward Compatibility
90 |
91 | - `on_schema_change` is **optional** - existing v0.2.0 modules remain valid
92 | - If omitted, default behavior is `"fail"` (matches previous implicit behavior)
93 | - `full_incremental_refresh` is **optional** - only required when `on_schema_change="full_incremental_refresh"`
94 | - All existing v0.2.0 examples remain valid in v0.2.1
95 |
96 | ## Migration from v0.2.0
97 |
98 | To migrate an OTS Module from v0.2.0 to v0.2.1:
99 |
100 | 1. Update `ots_version` from `"0.2.0"` to `"0.2.1"`
101 | 2. Optionally add `on_schema_change` to `incremental_details` if you want explicit schema change handling
102 | 3. If using `full_incremental_refresh`, add the `full_incremental_refresh` configuration
103 |
104 | Example migration:
105 |
106 | ```yaml
107 | # v0.2.0
108 | materialization:
109 | type: "incremental"
110 | incremental_details:
111 | strategy: "append"
112 | filter_condition: "created_at >= '@start_date'"
113 |
114 | # v0.2.1 (explicit default)
115 | materialization:
116 | type: "incremental"
117 | incremental_details:
118 | strategy: "append"
119 | filter_condition: "created_at >= '@start_date'"
120 | on_schema_change: "fail" # Optional: explicit default
121 |
122 | # v0.2.1 (with schema change handling)
123 | materialization:
124 | type: "incremental"
125 | incremental_details:
126 | strategy: "append"
127 | filter_condition: "created_at >= '@start_date'"
128 | on_schema_change: "append_new_columns" # Auto-add new columns
129 |
130 | # v0.2.1 (with full incremental refresh)
131 | materialization:
132 | type: "incremental"
133 | incremental_details:
134 | strategy: "append"
135 | filter_condition: "event_date >= '@start_date'"
136 | on_schema_change: "full_incremental_refresh"
137 |
138 | full_incremental_refresh:
139 | parameters:
140 | - name: "@start_date"
141 | start_value: "2024-01-01"
142 | end_value: "max(event_date)" # Evaluated against source table
143 | step: "INTERVAL 1 DAY"
144 | ```
145 |
146 | ## Documentation Updates
147 |
148 | - Added "Schema Change Handling" section to Incremental Materialization documentation
149 | - Added examples for all `on_schema_change` options
150 | - Added `full_incremental_refresh` configuration examples
151 | - Updated append strategy example to show `on_schema_change` usage
152 | - Added documentation for type mismatch and column order handling
153 |
154 |
155 |
--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | # Contributor Covenant 3.0 Code of Conduct
2 |
3 | ## Our Pledge
4 |
5 | We pledge to make our community welcoming, safe, and equitable for all.
6 |
7 | We are committed to fostering an environment that respects and promotes the dignity, rights, and contributions of all individuals, regardless of characteristics including race, ethnicity, caste, color, age, physical characteristics, neurodiversity, disability, sex or gender, gender identity or expression, sexual orientation, language, philosophy or religion, national or social origin, socio-economic position, level of education, or other status. The same privileges of participation are extended to everyone who participates in good faith and in accordance with this Covenant.
8 |
9 | ## Encouraged Behaviors
10 |
11 | While acknowledging differences in social norms, we all strive to meet our community's expectations for positive behavior. We also understand that our words and actions may be interpreted differently than we intend based on culture, background, or native language.
12 |
13 | With these considerations in mind, we agree to behave mindfully toward each other and act in ways that center our shared values, including:
14 |
15 | 1. Respecting the **purpose of our community**, our activities, and our ways of gathering.
16 | 2. Engaging **kindly and honestly** with others.
17 | 3. Respecting **different viewpoints** and experiences.
18 | 4. **Taking responsibility** for our actions and contributions.
19 | 5. Gracefully giving and accepting **constructive feedback**.
20 | 6. Committing to **repairing harm** when it occurs.
21 | 7. Behaving in other ways that promote and sustain the **well-being of our community**.
22 |
23 | ## Restricted Behaviors
24 |
25 | We agree to restrict the following behaviors in our community. Instances, threats, and promotion of these behaviors are violations of this Code of Conduct.
26 |
27 | 1. **Harassment.** Violating explicitly expressed boundaries or engaging in unnecessary personal attention after any clear request to stop.
28 | 2. **Character attacks.** Making insulting, demeaning, or pejorative comments directed at a community member or group of people.
29 | 3. **Stereotyping or discrimination.** Characterizing anyone's personality or behavior on the basis of immutable identities or traits.
30 | 4. **Sexualization.** Behaving in a way that would generally be considered inappropriately intimate in the context or purpose of the community.
31 | 5. **Violating confidentiality**. Sharing or acting on someone's personal or private information without their permission.
32 | 6. **Endangerment.** Causing, encouraging, or threatening violence or other harm toward any person or group.
33 | 7. Behaving in other ways that **threaten the well-being** of our community.
34 |
35 | ### Other Restrictions
36 |
37 | 1. **Misleading identity.** Impersonating someone else for any reason, or pretending to be someone else to evade enforcement actions.
38 | 2. **Failing to credit sources.** Not properly crediting the sources of content you contribute.
39 | 3. **Promotional materials**. Sharing marketing or other commercial content in a way that is outside the norms of the community.
40 | 4. **Irresponsible communication.** Failing to responsibly present content which includes, links or describes any other restricted behaviors.
41 |
42 | ## Reporting an Issue
43 |
44 | Tensions can occur between community members even when they are trying their best to collaborate. Not every conflict represents a code of conduct violation, and this Code of Conduct reinforces encouraged behaviors and norms that can help avoid conflicts and minimize harm.
45 |
46 | When an incident does occur, it is important to report it promptly. To report a possible violation, please contact the project maintainers at [contact email].
47 |
48 | Community Moderators take reports of violations seriously and will make every effort to respond in a timely manner. They will investigate all reports of code of conduct violations, reviewing messages, logs, and recordings, or interviewing witnesses and other participants. Community Moderators will keep investigation and enforcement actions as transparent as possible while prioritizing safety and confidentiality. In order to honor these values, enforcement actions are carried out in private with the involved parties, but communicating to the whole community may be part of a mutually agreed upon resolution.
49 |
50 | ## Addressing and Repairing Harm
51 |
52 | If an investigation by the Community Moderators finds that this Code of Conduct has been violated, the following enforcement ladder may be used to determine how best to repair harm, based on the incident's impact on the individuals involved and the community as a whole. Depending on the severity of a violation, lower rungs on the ladder may be skipped.
53 |
54 | 1. **Warning**
55 | 1. Event: A violation involving a single incident or series of incidents.
56 | 2. Consequence: A private, written warning from the Community Moderators.
57 | 3. Repair: Examples of repair include a private written apology, acknowledgement of responsibility, and seeking clarification on expectations.
58 |
59 | 2. **Temporarily Limited Activities**
60 | 1. Event: A repeated incidence of a violation that previously resulted in a warning, or the first incidence of a more serious violation.
61 | 2. Consequence: A private, written warning with a time-limited cooldown period designed to underscore the seriousness of the situation and give the community members involved time to process the incident. The cooldown period may be limited to particular communication channels or interactions with particular community members.
62 | 3. Repair: Examples of repair may include making an apology, using the cooldown period to reflect on actions and impact, and being thoughtful about re-entering community spaces after the period is over.
63 |
64 | 3. **Temporary Suspension**
65 | 1. Event: A pattern of repeated violation which the Community Moderators have tried to address with warnings, or a single serious violation.
66 | 2. Consequence: A private written warning with conditions for return from suspension. In general, temporary suspensions give the person being suspended time to reflect upon their behavior and possible corrective actions.
67 | 3. Repair: Examples of repair include respecting the spirit of the suspension, meeting the specified conditions for return, and being thoughtful about how to reintegrate with the community when the suspension is lifted.
68 |
69 | 4. **Permanent Ban**
70 | 1. Event: A pattern of repeated code of conduct violations that other steps on the ladder have failed to resolve, or a violation so serious that the Community Moderators determine there is no way to keep the community safe with this person as a member.
71 | 2. Consequence: Access to all community spaces, tools, and communication channels is removed. In general, permanent bans should be rarely used, should have strong reasoning behind them, and should only be resorted to if working through other remedies has failed to change the behavior.
72 | 3. Repair: There is no possible repair in cases of this severity.
73 |
74 | This enforcement ladder is intended as a guideline. It does not limit the ability of Community Managers to use their discretion and judgment, in keeping with the best interests of our community.
75 |
76 | ## Scope
77 |
78 | This Code of Conduct applies within all community spaces, and also applies when an individual is officially representing the community in public or other spaces. Examples of representing our community include using an official email address, posting via an official social media account, or acting as an appointed representative at an online or offline event.
79 |
80 | ## Attribution
81 |
82 | This Code of Conduct is adapted from the [Contributor Covenant, version 3.0](https://www.contributor-covenant.org/version/3/0/code_of_conduct/), permanently available at https://www.contributor-covenant.org/version/3/0/.
83 |
84 | Contributor Covenant is stewarded by the Organization for Ethical Source and licensed under CC BY-SA 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by-sa/4.0/
85 |
86 | For answers to common questions about Contributor Covenant, see the FAQ at https://www.contributor-covenant.org/faq. Translations are provided at https://www.contributor-covenant.org/translations. Additional enforcement and community guideline resources can be found at https://www.contributor-covenant.org/resources. The enforcement ladder was inspired by the work of Mozilla's code of conduct team.
87 |
88 |
89 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Apache License
2 | Version 2.0, January 2004
3 | http://www.apache.org/licenses/
4 |
5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6 |
7 | 1. Definitions.
8 |
9 | "License" shall mean the terms and conditions for use, reproduction,
10 | and distribution as defined by Sections 1 through 9 of this document.
11 |
12 | "Licensor" shall mean the copyright owner or entity granting the License.
13 |
14 | "Legal Entity" shall mean the union of the acting entity and all
15 | other entities that control, are controlled by, or are under common
16 | control with that entity. For the purposes of this definition,
17 | "control" means (i) the power, direct or indirect, to cause the
18 | direction or management of such entity, whether by contract or
19 | otherwise, or (ii) ownership of fifty percent (50%) or more of the
20 | outstanding shares, or (iii) beneficial ownership of such entity.
21 |
22 | "You" (or "Your") shall mean an individual or Legal Entity
23 | exercising permissions granted by this License.
24 |
25 | "Source" shall mean the preferred form for making modifications,
26 | including but not limited to software source code, documentation
27 | source, and configuration files.
28 |
29 | "Object" shall mean any form resulting from mechanical
30 | transformation or translation of a Source form, including but
31 | not limited to compiled object code, generated documentation,
32 | and conversions to other media types.
33 |
34 | "Work" shall mean the work of authorship, whether in Source or
35 | Object form, made available under the License, as indicated by a
36 | copyright notice that is included in or attached to the work
37 | (which shall not include communications that are clearly marked or
38 | otherwise designated in writing by the copyright owner as "Not a Work").
39 |
40 | "Derivative Works" shall mean any work, whether in Source or Object
41 | form, that is based upon (or derived from) the Work and for which the
42 | editorial revisions, annotations, elaborations, or other modifications
43 | represent, as a whole, an original work of authorship. For the purposes
44 | of this License, Derivative Works shall not include works that remain
45 | separable from, or merely link (or bind by name) to the interfaces of,
46 | the Work and derivative works thereof.
47 |
48 | "Contribution" shall mean any work of authorship, including
49 | the original version of the Work and any modifications or additions
50 | to that Work or Derivative Works thereof, that is intentionally
51 | submitted to Licensor for inclusion in the Work by the copyright owner
52 | or by an individual or Legal Entity authorized to submit on behalf of
53 | the copyright owner. For the purposes of this definition, "submitted"
54 | means any form of electronic, verbal, or written communication sent
55 | to the Licensor or its representatives, including but not limited to
56 | communication on electronic mailing lists, source code control
57 | systems, and issue tracking systems that are managed by, or on behalf
58 | of, the Licensor for the purpose of discussing and improving the Work,
59 | but excluding communication that is conspicuously marked or otherwise
60 | designated in writing by the copyright owner as "Not a Work".
61 |
62 | 2. Grant of Copyright License. Subject to the terms and conditions of
63 | this License, each Contributor hereby grants to You a perpetual,
64 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
65 | copyright license to use, reproduce, modify, distribute, and prepare
66 | Derivative Works of, and to display, perform, and distribute the
67 | Work and such Derivative Works in Source or Object form.
68 |
69 | 3. Grant of Patent License. Subject to the terms and conditions of
70 | this License, each Contributor hereby grants to You a perpetual,
71 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
72 | (except as stated in this section) patent license to make, have made,
73 | use, offer to sell, sell, import, and otherwise transfer the Work,
74 | where such license applies only to those patent claims licensable
75 | by such Contributor that are necessarily infringed by their
76 | Contribution(s) alone or by combination of their Contribution(s)
77 | with the Work to which such Contribution(s) was submitted. If You
78 | institute patent litigation against any entity (including a
79 | cross-claim or counterclaim in a lawsuit) alleging that the Work
80 | or a Contribution incorporated within the Work constitutes direct
81 | or contributory patent infringement, then any patent licenses
82 | granted to You under this License for that Work shall terminate
83 | as of the date such litigation is filed.
84 |
85 | 4. Redistribution. You may reproduce and distribute copies of the
86 | Work or Derivative Works thereof in any medium, with or without
87 | modifications, and in Source or Object form, provided that You
88 | meet the following conditions:
89 |
90 | (a) You must give any other recipients of the Work or
91 | Derivative Works a copy of this License; and
92 |
93 | (b) You must cause any modified files to carry prominent notices
94 | stating that You changed the files; and
95 |
96 | (c) You must retain, in the Source form of any Derivative Works
97 | that You distribute, all copyright, patent, trademark, and
98 | attribution notices from the Source form of the Work,
99 | excluding those notices that do not pertain to any part of
100 | the Derivative Works; and
101 |
102 | (d) If the Work includes a "NOTICE" text file as part of its
103 | distribution, then any Derivative Works that You distribute must
104 | include a readable copy of the attribution notices contained
105 | within such NOTICE file, excluding those notices that do not
106 | pertain to any part of the Derivative Works, in at least one
107 | of the following places: within a NOTICE text file distributed
108 | as part of the Derivative Works; within the Source form or
109 | documentation, if provided along with the Derivative Works; or,
110 | within a display generated by the Derivative Works, if and
111 | wherever such third-party notices normally appear. The contents
112 | of the NOTICE file are for informational purposes only and
113 | do not modify the License. You may add Your own attribution
114 | notices within Derivative Works that You distribute, alongside
115 | or as an addendum to the NOTICE text from the Work, provided
116 | that such additional attribution notices cannot be construed
117 | as modifying the License.
118 |
119 | You may add Your own copyright notice to Your modifications and
120 | may provide additional or different license terms and conditions
121 | for use, reproduction, or distribution of Your modifications, or
122 | for any such Derivative Works as a whole, provided Your use,
123 | reproduction, and distribution of the Work otherwise complies with
124 | the conditions stated in this License.
125 |
126 | 5. Submission of Contributions. Unless You explicitly state otherwise,
127 | any Contribution intentionally submitted for inclusion in the Work
128 | by You to the Licensor shall be under the terms and conditions of
129 | this License, without any additional terms or conditions.
130 | Notwithstanding the above, nothing herein shall supersede or modify
131 | the terms of any separate license agreement you may have executed
132 | with Licensor regarding such Contributions.
133 |
134 | 6. Trademarks. This License does not grant permission to use the trade
135 | names, trademarks, service marks, or product names of the Licensor,
136 | except as required for reasonable and customary use in describing the
137 | origin of the Work and reproducing the content of the NOTICE file.
138 |
139 | 7. Disclaimer of Warranty. Unless required by applicable law or
140 | agreed to in writing, Licensor provides the Work (and each
141 | Contributor provides its Contributions) on an "AS IS" BASIS,
142 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
143 | implied, including, without limitation, any warranties or conditions
144 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
145 | PARTICULAR PURPOSE. You are solely responsible for determining the
146 | appropriateness of using or redistributing the Work and assume any
147 | risks associated with Your exercise of permissions under this License.
148 |
149 | 8. Limitation of Liability. In no event and under no legal theory,
150 | whether in tort (including negligence), contract, or otherwise,
151 | unless required by applicable law (such as deliberate and grossly
152 | negligent acts) or agreed to in writing, shall any Contributor be
153 | liable to You for damages, including any direct, indirect, special,
154 | incidental, or consequential damages of any character arising as a
155 | result of this License or out of the use or inability to use the
156 | Work (including but not limited to damages for loss of goodwill,
157 | work stoppage, computer failure or malfunction, or any and all
158 | other commercial damages or losses), even if such Contributor
159 | has been advised of the possibility of such damages.
160 |
161 | 9. Accepting Warranty or Support. You may choose to offer, and to
162 | charge a fee for, warranty, support, indemnity or other liability
163 | obligations and/or rights consistent with this License. However, in
164 | accepting such obligations, You may act only on Your own behalf and
165 | on Your sole responsibility, not on behalf of any other Contributor,
166 | and only if You agree to indemnify, defend, and hold each Contributor
167 | harmless for any liability incurred by, or claims asserted against,
168 | such Contributor by reason of your accepting any such warranty or support.
169 |
170 | END OF TERMS AND CONDITIONS
171 |
172 | APPENDIX: How to apply the Apache License to your work.
173 |
174 | To apply the Apache License to your work, attach the following
175 | boilerplate notice, with the fields enclosed by brackets "[]"
176 | replaced with your own identifying information. (Don't include
177 | the brackets!) The text should be enclosed in the appropriate
178 | comment syntax for the file format. We also recommend that a
179 | file or class name and description of purpose be included on the
180 | same "printed page" as the copyright notice for easier
181 | identification within third-party archives.
182 |
183 | Copyright [yyyy] [name of copyright owner]
184 |
185 | Licensed under the Apache License, Version 2.0 (the "License");
186 | you may not use this file except in compliance with the License.
187 | You may obtain a copy of the License at
188 |
189 | http://www.apache.org/licenses/LICENSE-2.0
190 |
191 | Unless required by applicable law or agreed to in writing, software
192 | distributed under the License is distributed on an "AS IS" BASIS,
193 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
194 | See the License for the specific language governing permissions and
195 | limitations under the License.
196 |
197 |
198 |
--------------------------------------------------------------------------------
/versions/v0.1.0/transformation-specification.md:
--------------------------------------------------------------------------------
1 | # Open Transformation Specification v0.1.0
2 |
3 | ## Table of Contents
4 |
5 | 1. [Introduction](#introduction)
6 | 2. [Core Concepts](#core-concepts)
7 | 3. [Open Transformation Definition (OTD) Structure](#open-transformation-definition-otd-structure)
8 | 4. [Materialization Types](#materialization-types)
9 | 5. [Data Quality Tests](#data-quality-tests)
10 | 6. [Examples](#examples)
11 |
12 | ## Introduction
13 |
14 | The Open Transformation Specification (OTS) defines a standard, programming language-agnostic interface description for data transformations. This specification allows both humans and computers to discover and understand how transformations behave, what outputs they produce, and how those outputs are materialized (as tables, views, incremental updates, SCD2, etc.) without requiring additional documentation or configuration.
15 |
16 | An OTS-based transformation must include both the code that transforms the data and metadata about the transformation. A tool implementing OTS should be able to execute an OTS transformation with no additional code or information beyond what's specified in the OTS document.
17 |
18 | ## Core Concepts
19 |
20 | ### Open Transformation Definition (OTD)
21 |
22 | An **Open Transformation Definition (OTD)** is a concrete instance of the Open Transformation Specification - a single file or document that describes a specific data transformation using the OTS format.
23 |
24 | A transformation is a unit of data processing that takes one or more data sources as input and produces one data output. Right now, transformations are SQL queries, but we plan to add support for other programming languages in the future.
25 |
26 | ### Open Transformation Specification Module
27 |
28 | An **Open Transformation Specification Module (OTS Module)** is a collection of related transformations that target the same database and schema. An OTS Module can contain one or more transformations, much like how an OpenAPI specification can contain multiple endpoints.
29 |
30 | Key characteristics of an OTS Module:
31 | - **Single target**: All transformations in a module target the same database and schema
32 | - **Logical grouping**: Related transformations are organized together
33 | - **Deployment unit**: The entire module can be deployed as a single unit
34 |
35 | #### OTS vs OTD vs OTS Module
36 |
37 | - **Open Transformation Specification (OTS)**: The standard that defines the structure and rules
38 | - **Open Transformation Definition (OTD)**: A specific transformation within a module
39 | - **Open Transformation Specification Module (OTS Module)**: A collection of related transformations targeting the same database and schema
40 |
41 | ### Test Library
42 |
43 | A **Test Library** is a project-level collection of reusable test definitions (generic and singular SQL tests) that can be shared across multiple OTS modules. Test libraries are defined separately from transformation modules and are referenced by modules that need to use them.
44 |
45 | Key characteristics of a Test Library:
46 | - **Project-level scope**: Test libraries are defined at the project/workspace level, separate from OTS modules
47 | - **Reusability**: Tests defined in a library can be referenced by any OTS module in the project
48 | - **Test types**: Contains both generic SQL tests (with placeholders) and singular SQL tests (table-specific)
49 | - **Optional**: Modules can define tests inline or reference a test library, or both
50 |
51 | #### OTS vs OTD vs OTS Module vs Test Library
52 |
53 | - **Open Transformation Specification (OTS)**: The standard that defines the structure and rules
54 | - **Open Transformation Definition (OTD)**: A specific transformation within a module
55 | - **Open Transformation Specification Module (OTS Module)**: A collection of related transformations targeting the same database and schema
56 | - **Test Library**: A project-level collection of reusable test definitions that can be shared across modules
57 |
58 | Think of it this way: OTS is like the blueprint, an OTS Module is the house (a complete set of transformations), each OTD is a room within that house (an individual transformation), and a Test Library is like a shared toolbox of quality checks that can be used across multiple houses.
59 |
60 | ## Components of an OTD
61 |
62 | An Open Transformation Definition consists of several key components that work together to define an executable transformation:
63 |
64 | 1. **Transformation Code**: The transformation logic (SQL, Python, PySpark, etc.) stored in a type-based structure
65 | 2. **Schema Definition**: The structure of the output data including column definitions, types, and validation rules
66 | 3. **Materialization Strategy**: How the output is stored and updated (table, view, incremental, SCD2)
67 | 4. **Tests**: Validation rules that ensure data quality at table level
68 | 5. **Metadata**: Additional information about the transformation (owner, tags, creation date, etc.)
69 |
70 | ### Transformation Code
71 |
72 | Transformations can be written in different languages (SQL, Python, PySpark, etc.). The transformation code is stored in a type-based structure that supports multiple transformation types while maintaining a consistent interface.
73 |
74 | #### SQL Transformations
75 |
76 | For SQL transformations, the code is stored with the following structure:
77 | - `original_sql`: The original SQL query as written (typically a SELECT statement). This preserves the original transformation code as authored.
78 | - `resolved_sql`: SQL with fully qualified table names (schema.table format). This is the preferred version for execution as it eliminates ambiguity in table references. Tools should use `resolved_sql` when executing transformations.
79 | - `source_tables`: List of input tables referenced in the query (required for dependency analysis)
80 |
81 | **When to use each:**
82 | - Use `original_sql` for: displaying the original code to users, version control, understanding the transformation logic
83 | - Use `resolved_sql` for: actual execution, dependency resolution, cross-database compatibility
84 |
85 | #### Non-SQL Transformations
86 |
87 | Support for non-SQL transformation types (Python, PySpark, R, etc.) is planned for future versions of the specification. The current v0.1.0 specification focuses on SQL transformations.
88 |
89 | ### Schema Definition
90 |
91 | Schema defines the structure of the output data, including column names, data types, descriptions, partitioning, indexes, and other properties of the physical table. The schema is essential for understanding what the transformation produces without executing it. For example, it enables generating DDL statements for creating the output table.
92 |
93 | ### Materialization Strategy
94 | Materialization defines how the transformation output is stored and updated. Common types include:
95 | - **table**: Full table replacement on each run
96 | - **view**: Virtual table that queries underlying data
97 | - **incremental**: Partial updates using strategies like delete+insert or merge
98 | - **scd2**: Slowly Changing Dimension type 2 for tracking historical changes
99 |
100 | ### Tests
101 | Tests are validation rules that ensure data quality. They can be defined at two levels:
102 | - **Column-level tests**: Applied to individual columns (e.g., `not_null`, `unique`)
103 | - **Table-level tests**: Applied to the entire output (e.g., `row_count_gt_0`, `unique`)
104 |
105 | Tests enable automated data quality validation without manual inspection. OTS supports three types of tests:
106 | 1. **Standard Tests**: Built-in tests defined in the OTS specification (e.g., `not_null`, `unique`, `row_count_gt_0`)
107 | 2. **Generic SQL Tests**: Reusable SQL tests with placeholders that can be applied to multiple transformations
108 | 3. **Singular SQL Tests**: Table-specific SQL tests with hardcoded table references
109 |
110 | Generic SQL Tests and Singular SQL Tests can be defined in a project Test Library (see [Test Library](#test-library)) or inline within the current OTS Module.
111 |
112 | For detailed information about test types, definitions, and usage, see the [Data Quality Tests](#data-quality-tests) section.
113 |
114 | ### Metadata
115 | Metadata provides additional information about the transformation including:
116 | - **file_path**: Location of the source transformation file
117 | - **owner**: Person or team responsible for the transformation
118 | - **tags**: List of string tags for categorization and discovery (e.g., ["analytics", "fct", "production"])
119 | - **object_tags**: Dictionary of key-value pairs for database object tagging (e.g., {"sensitivity_tag": "pii", "classification": "public"})
120 |
121 | **Tag Types:**
122 | - **tags** (dbt-style): Simple string tags used for filtering, categorization, and discovery. These are typically used for model selection and organization.
123 | - **Module-level tags**: Tags defined at the module level apply to all transformations in the module. They can be inherited by transformations or merged with transformation-specific tags.
124 | - **Transformation-level tags**: Tags defined at the transformation level are specific to that transformation. They can be merged with module-level tags.
125 | - **object_tags** (database-style): Key-value pairs that are attached directly to database objects (tables, views) in databases that support object tagging (e.g., Snowflake). These are used for data governance, compliance, and metadata management. Unlike `tags`, `object_tags` are always transformation-specific and are not inherited from module level.
126 |
127 | ## OTS Module Structure
128 |
129 | An OTS Module is a YAML or JSON document that can contain one or more transformations. Below is the complete structure:
130 |
131 | ### Complete OTS Module Structure
132 |
133 | ```yaml
134 | # OTS version
135 | ots_version: string # OTS specification version (e.g., "0.1.0") - indicates which version of the OTS standard this module follows
136 |
137 | # Module metadata
138 | module_name: string # Module name (e.g., "ecommerce_analytics")
139 | module_description: string # Description of the module (optional)
140 | version: string # Optional: Module/package version (e.g., "1.0.0") - version of this specific module, independent of OTS version
141 | tags: [string] # Optional: Module-level tags for categorization (e.g., ["analytics", "fct"]). These can be inherited or merged with transformation-level tags.
142 | test_library_path: string # Optional: Path to test library file (relative to module file or absolute path)
143 |
144 | # Optional: Inline test definitions (same structure as test library)
145 | generic_tests: # Optional: Module-specific generic SQL tests
146 | test_name:
147 | type: "sql"
148 | level: "table" | "column"
149 | description: string
150 | sql: string
151 | parameters: {}
152 | singular_tests: # Optional: Module-specific singular SQL tests
153 | test_name:
154 | type: "sql"
155 | level: "table" | "column"
156 | description: string
157 | sql: string
158 | target_transformation: string
159 |
160 | target:
161 | database: string # Target database name
162 | schema: string # Target schema name
163 | sql_dialect: string # Optional: SQL dialect (e.g., "postgres", "bigquery", "snowflake", "spark", etc.)
164 | connection_profile: string # Optional: connection profile reference
165 |
166 | # Transformations
167 | transformations: # Array of transformation definitions
168 | - transformation_id: string # Fully qualified identifier (e.g., "analytics.my_first_table")
169 | description: string # Optional: Description of what the transformation does (optional)
170 | transformation_type: string # Type of transformation: "sql" (default: "sql"). Non-SQL types (python, pyspark, r) are planned for future versions.
171 | sql_dialect: string # Optional: SQL dialect of the transformation code (for translation to target dialect)
172 |
173 | # Transformation code (type-based structure)
174 | code:
175 | # For SQL transformations (transformation_type: "sql")
176 | sql:
177 | original_sql: string # The original SQL query as written (typically a SELECT statement)
178 | resolved_sql: string # SQL with fully qualified table names (schema.table) - preferred for execution
179 | source_tables: [string] # List of input tables referenced (required for dependency analysis)
180 |
181 | # Note: Non-SQL transformation types (python, pyspark, r) are planned for future versions
182 |
183 | # Schema definition
184 | schema:
185 | columns: # Array of column definitions
186 | - name: string # Column name
187 | datatype: string # Data type ("number", "string", "date", etc.)
188 | description: string # Column description
189 | partitioning: [string] # Optional: Partition keys
190 | indexes: # Optional: Array of index definitions
191 | - name: string # Index name (optional, auto-generated if not provided)
192 | columns: [string] # Columns to index
193 |
194 | # Materialization strategy
195 | materialization:
196 | type: string # "table", "view", "incremental", "scd2"
197 | incremental_details: # Required if type is "incremental"
198 | strategy: string # "delete_insert", "append", "merge"
199 | delete_condition: string # SQL condition for delete (delete_insert only)
200 | filter_condition: string # SQL condition for filtering data
201 | merge_key: [string] # Primary key columns for matching records (merge only)
202 | update_columns: [string] # (Optional) List of columns to be updated in merge strategy
203 | scd2_details: # Optional if type is "scd2"
204 | start_column: string # Name of the start column (default: "valid_from")
205 | end_column: string # Name of the end column (default: "valid_to")
206 | unique_key: [string] # Array of columns that uniquely identify a record in SCD2 modeling (optional)
207 |
208 | # Tests: both column-level and table-level
209 | tests:
210 | columns: # Optional: Column-level tests
211 | column_name: # Tests for specific columns
212 | - string # Simple test name (e.g., "not_null", "unique")
213 | - object # Test with parameters: {name: string, params?: object, severity?: "error"|"warning"}
214 | table: # Optional: Table-level tests
215 | - string # Simple test name (e.g., "row_count_gt_0")
216 | - object # Test with parameters: {name: string, params?: object, severity?: "error"|"warning"}
217 |
218 | # Metadata
219 | metadata:
220 | file_path: string # Path to the source transformation file
221 | owner: string # Optional: Person or team responsible (optional)
222 | tags: [string] # Optional: List of string tags for categorization and discovery (e.g., ["analytics", "fct"])
223 | object_tags: dict # Optional: Dictionary of key-value pairs for database object tagging (e.g., {"sensitivity_tag": "pii", "classification": "public"})
224 | ```
225 |
226 | ## Simple Table Transformation
227 |
228 |
229 | JSON Format
230 |
231 | ```json
232 | {
233 | "ots_version": "0.1.0",
234 | "module_name": "analytics_customers",
235 | "module_description": "Customer analytics transformations",
236 | "tags": ["analytics", "production"],
237 | "test_library_path": "../tests/test_library.yaml",
238 | "target": {
239 | "database": "warehouse",
240 | "schema": "analytics",
241 | "sql_dialect": "postgres"
242 | },
243 | "transformations": [
244 | {
245 | "transformation_id": "analytics.customers",
246 | "description": "Customer data table",
247 | "transformation_type": "sql",
248 | "code": {
249 | "sql": {
250 | "original_sql": "SELECT id, name, email, created_at FROM source.customers WHERE active = true",
251 | "resolved_sql": "SELECT id, name, email, created_at FROM warehouse.source.customers WHERE active = true",
252 | "source_tables": ["source.customers"]
253 | }
254 | },
255 | "schema": {
256 | "columns": [
257 | {
258 | "name": "id",
259 | "datatype": "number",
260 | "description": "Unique customer identifier"
261 | },
262 | {
263 | "name": "name",
264 | "datatype": "string",
265 | "description": "Customer name"
266 | },
267 | {
268 | "name": "email",
269 | "datatype": "string",
270 | "description": "Customer email address"
271 | },
272 | {
273 | "name": "created_at",
274 | "datatype": "date",
275 | "description": "Customer creation date"
276 | }
277 | ],
278 | "partitioning": [],
279 | "indexes": [
280 | {
281 | "name": "idx_customers_id",
282 | "columns": ["id"]
283 | },
284 | {
285 | "name": "idx_customers_email",
286 | "columns": ["email"]
287 | }
288 | ]
289 | },
290 | "materialization": {
291 | "type": "table"
292 | },
293 | "tests": {
294 | "columns": {
295 | "id": ["not_null", "unique"],
296 | "email": ["not_null", "unique"],
297 | "created_at": ["not_null"]
298 | },
299 | "table": ["row_count_gt_0"]
300 | },
301 | "metadata": {
302 | "file_path": "/models/analytics/customers.sql",
303 | "owner": "data-team",
304 | "tags": ["customer", "core"],
305 | "object_tags": {
306 | "sensitivity_tag": "pii",
307 | "classification": "internal"
308 | }
309 | }
310 | }
311 | ]
312 | }
313 | ```
314 |
315 |
316 |
317 |
318 | YAML Format
319 |
320 | ```yaml
321 | ots_version: "0.1.0"
322 | module_name: "analytics_customers"
323 | module_description: "Customer analytics transformations"
324 | tags: ["analytics", "production"]
325 |
326 | target:
327 | database: "warehouse"
328 | schema: "analytics"
329 | sql_dialect: "postgres"
330 |
331 | transformations:
332 | - transformation_id: "analytics.customers"
333 | description: "Customer data table"
334 | transformation_type: "sql"
335 |
336 | code:
337 | sql:
338 | original_sql: "SELECT id, name, email, created_at FROM source.customers WHERE active = true"
339 | resolved_sql: "SELECT id, name, email, created_at FROM warehouse.source.customers WHERE active = true"
340 | source_tables: ["source.customers"]
341 |
342 | schema:
343 | columns:
344 | - name: "id"
345 | datatype: "number"
346 | description: "Unique customer identifier"
347 | - name: "name"
348 | datatype: "string"
349 | description: "Customer name"
350 | - name: "email"
351 | datatype: "string"
352 | description: "Customer email address"
353 | - name: "created_at"
354 | datatype: "date"
355 | description: "Customer creation date"
356 | partitioning: []
357 | indexes:
358 | - name: "idx_customers_id"
359 | columns: ["id"]
360 | - name: "idx_customers_email"
361 | columns: ["email"]
362 |
363 | materialization:
364 | type: "table"
365 |
366 | tests:
367 | columns:
368 | id: ["not_null", "unique"]
369 | email: ["not_null", "unique"]
370 | created_at: ["not_null"]
371 | table: ["row_count_gt_0"]
372 |
373 | metadata:
374 | file_path: "/models/analytics/customers.sql"
375 | owner: "data-team"
376 | tags: ["customer", "core"]
377 | object_tags:
378 | sensitivity_tag: "pii"
379 | classification: "internal"
380 | ```
381 |
382 |
383 |
384 | ## Materialization Types
385 |
386 | ### Incremental Materialization
387 |
388 | Incremental materialization updates only changed data using one of three strategies:
389 | - **delete_insert**: Deletes rows matching a condition and inserts new data
390 | - **append**: Simply appends new data without removing existing rows
391 | - **merge**: Performs an upsert operation using a merge statement
392 |
393 | #### Delete-Insert Strategy
394 |
395 | ```yaml
396 | materialization:
397 | type: "incremental"
398 | incremental_details:
399 | strategy: "delete_insert"
400 | delete_condition: "to_date(updated_ts) = '@start_date'"
401 | filter_condition: "to_date(updated_ts) = '@start_date'"
402 | ```
403 |
404 | #### Append Strategy
405 |
406 | ```yaml
407 | materialization:
408 | type: "incremental"
409 | incremental_details:
410 | strategy: "append"
411 | filter_condition: "created_at >= '@start_date'"
412 | ```
413 |
414 | #### Merge Strategy
415 |
416 | ```yaml
417 | materialization:
418 | type: "incremental"
419 | incremental_details:
420 | strategy: "merge"
421 | merge_key: "customer_id" # Primary key or unique identifier for matching
422 | filter_condition: "updated_at >= '@start_date'"
423 | update_columns: ["name", "email"] # Optional: specific columns to update
424 | ```
425 |
426 | ### SCD2 Materialization
427 |
428 | SCD2 (Slowly Changing Dimension Type 2) materialization tracks historical changes with valid date ranges. Requires a unique key to identify records.
429 |
430 | ```yaml
431 | materialization:
432 | type: "scd2"
433 | scd2_details:
434 | unique_key: ["product_id"] # Primary key or unique identifier
435 | start_column: "valid_from" # Optional, defaults to "valid_from"
436 | end_column: "valid_to" # Optional, defaults to "valid_to"
437 | ```
438 |
439 | ### Schema Column Definition
440 |
441 | A schema column in an OTD defines the structure and properties of a single column in the output table:
442 |
443 | ```yaml
444 | columns:
445 | - name: string # Column name
446 | datatype: string # Data type ("number", "string", "date", etc.)
447 | description: string # Column description
448 | ```
449 |
450 | **Common Data Types:**
451 | - `number`: Numeric values
452 | - `string`: Text values
453 | - `date`: Date and timestamp values
454 | - `boolean`: True/false values
455 | - `array`: Array of values
456 | - `object`: Complex nested objects
457 |
458 | ## Data Quality Tests
459 |
460 | Data quality tests are validation rules that ensure the correctness and quality of transformation outputs. Tests can be defined at two levels:
461 | - **Column-level tests**: Applied to individual columns (e.g., `not_null`, `unique`)
462 | - **Table-level tests**: Applied to the entire output (e.g., `row_count_gt_0`, `unique`)
463 |
464 | Tests enable automated data quality validation without manual inspection. OTS supports three types of tests:
465 |
466 | 1. **Standard Tests**: Built-in tests defined in the OTS specification (e.g., `not_null`, `unique`, `row_count_gt_0`)
467 | 2. **Generic SQL Tests**: Reusable SQL tests with placeholders that can be applied to multiple transformations
468 | 3. **Singular SQL Tests**: Table-specific SQL tests with hardcoded table references
469 |
470 | ### Standard Tests
471 |
472 | Standard tests are built into the OTS specification and must be implemented by all OTS-compliant tools. These tests provide common data quality checks that are widely applicable across different transformations.
473 |
474 | #### Column-Level Standard Tests
475 |
476 | **`not_null`**
477 | - **Description**: Ensures a column contains no NULL values
478 | - **Level**: Column
479 | - **Parameters**: None
480 | - **Implementation**: Returns rows where the column is NULL (test fails if any rows returned)
481 | - **Example**:
482 | ```yaml
483 | tests:
484 | columns:
485 | id: ["not_null"]
486 | ```
487 |
488 | **`unique`**
489 | - **Description**: Ensures column values are unique across all rows
490 | - **Level**: Column or Table
491 | - **Parameters**:
492 | - `columns` (array, optional): For table-level tests, specifies which columns to check for uniqueness. If omitted at table level, checks all columns (entire row uniqueness)
493 | - **Implementation**: Returns duplicate values (test fails if any duplicates found)
494 | - **Examples**:
495 | ```yaml
496 | tests:
497 | columns:
498 | # Column-level: single column uniqueness
499 | id: ["not_null", "unique"]
500 |
501 | table:
502 | # Table-level: composite uniqueness on specific columns
503 | - name: "unique"
504 | params:
505 | columns: ["customer_id", "order_date"]
506 |
507 | # Table-level: entire row uniqueness (all columns)
508 | - "unique"
509 | ```
510 |
511 | **`accepted_values`**
512 | - **Description**: Ensures column values are within a specified list of acceptable values
513 | - **Level**: Column
514 | - **Parameters**:
515 | - `values` (array, required): List of acceptable values
516 | - **Implementation**: Returns rows where column value is not in the accepted list
517 | - **Example**:
518 | ```yaml
519 | tests:
520 | columns:
521 | status:
522 | - name: "accepted_values"
523 | params:
524 | values: ["active", "inactive", "pending"]
525 | ```
526 |
527 | **`relationships`**
528 | - **Description**: Ensures referential integrity between tables (foreign key validation)
529 | - **Level**: Column
530 | - **Parameters**:
531 | - `to` (string, required): Target transformation ID (e.g., "analytics.customers")
532 | - `field` (string, required): Column name in the target transformation
533 | - **Implementation**: Returns rows where the column value doesn't exist in the target table's specified field
534 | - **Example**:
535 | ```yaml
536 | tests:
537 | columns:
538 | customer_id:
539 | - name: "relationships"
540 | params:
541 | to: "analytics.customers"
542 | field: "id"
543 | ```
544 |
545 | #### Table-Level Standard Tests
546 |
547 | **`row_count_gt_0`**
548 | - **Description**: Ensures the table has at least one row
549 | - **Level**: Table
550 | - **Parameters**: None
551 | - **Implementation**: Returns a count result (test fails if count = 0)
552 | - **Example**:
553 | ```yaml
554 | tests:
555 | table:
556 | - "row_count_gt_0"
557 | ```
558 |
559 | ### Test Libraries
560 |
561 | Test libraries are project-level collections of custom test definitions (generic and singular SQL tests) that can be shared across multiple OTS modules. For a detailed introduction to Test Libraries, see the [Test Library](#test-library) section in Core Concepts.
562 |
563 | #### Test Library Structure
564 |
565 | A test library is a YAML or JSON file that defines reusable test definitions. The file can be named anything (e.g., `test_library.yaml`, `tests.yaml`, `data_quality_tests.json`), but must follow the structure below.
566 |
567 | **Test Library File Structure:**
568 | ```yaml
569 | # test_library.yaml
570 | test_library_version: string # Optional: Version identifier for the test library (e.g., "1.0", "2.1")
571 | description: string # Optional: Human-readable description of the test library
572 |
573 | generic_tests:
574 | check_minimum_rows:
575 | type: "sql"
576 | level: "table"
577 | description: "Ensures table has minimum number of rows"
578 | sql: |
579 | SELECT 1 as violation
580 | FROM @table_name
581 | GROUP BY 1
582 | HAVING COUNT(*) < @min_rows:10
583 | parameters:
584 | min_rows:
585 | type: "number"
586 | default: 10
587 | description: "Minimum number of rows required"
588 |
589 | column_not_negative:
590 | type: "sql"
591 | level: "column"
592 | description: "Ensures numeric column has no negative values"
593 | sql: |
594 | SELECT @column_name
595 | FROM @table_name
596 | WHERE @column_name < 0
597 | parameters: []
598 |
599 | singular_tests:
600 | test_customers_email_format:
601 | type: "sql"
602 | level: "table"
603 | description: "Validates email format for customers table"
604 | sql: |
605 | SELECT id, email
606 | FROM analytics.customers
607 | WHERE email NOT LIKE '%@%.%'
608 | target_transformation: "analytics.customers"
609 | ```
610 |
611 | #### Generic SQL Tests
612 |
613 | Generic SQL tests are reusable tests that use placeholders (variables) to make them applicable to multiple transformations. They follow the dbt pattern where:
614 | - The query returns rows when the test fails
615 | - 0 rows returned = test passes
616 | - 1+ rows returned = test fails
617 |
618 | **Placeholders:**
619 | - `@table_name` or `{{ table_name }}`: Replaced with the fully qualified transformation ID. The `@` syntax is recommended for cleaner SQL.
620 | - `@column_name` or `{{ column_name }}`: Replaced with the column name (for column-level tests). The `@` syntax is recommended.
621 | - Custom parameters: Available as `@parameter_name` or `{{ parameter_name }}` with optional defaults using `@param:default` syntax (e.g., `@min_rows:10`)
622 |
623 | **Structure:**
624 | ```yaml
625 | generic_tests:
626 | test_name: # Required: Unique test name (used for referencing)
627 | type: "sql" # Required: Always "sql" for SQL tests
628 | level: "table" | "column" # Required: Test level
629 | description: string # Optional: Human-readable description
630 | sql: string # Required: SQL query (returns rows on failure)
631 | parameters: # Optional: Parameter definitions
632 | param_name:
633 | type: "number" | "string" | "boolean" | "array" # Required: Parameter type
634 | default: value # Optional: Default value
635 | description: string # Optional: Parameter description
636 | ```
637 |
638 | **Example Generic Test:**
639 | ```yaml
640 | check_minimum_rows:
641 | type: "sql"
642 | level: "table"
643 | description: "Ensures table has minimum number of rows"
644 | sql: |
645 | SELECT 1 as violation
646 | FROM @table_name
647 | GROUP BY 1
648 | HAVING COUNT(*) < @min_rows:10
649 | parameters:
650 | min_rows:
651 | type: "number"
652 | default: 10
653 | description: "Minimum number of rows required"
654 | ```
655 |
656 | #### Singular SQL Tests
657 |
658 | Singular SQL tests are table-specific tests with hardcoded table references. They are useful for:
659 | - Complex business logic specific to one transformation
660 | - Tests that reference multiple tables
661 | - Table-specific validation rules
662 |
663 | **Structure:**
664 | ```yaml
665 | singular_tests:
666 | test_name: # Required: Unique test name (used for referencing)
667 | type: "sql" # Required: Always "sql" for SQL tests
668 | level: "table" | "column" # Required: Test level
669 | description: string # Optional: Human-readable description
670 | sql: string # Required: SQL query with hardcoded table names
671 | target_transformation: string # Required: Transformation ID this test applies to (used for validation and discovery)
672 | ```
673 |
674 | **Example Singular Test:**
675 | ```yaml
676 | test_customers_email_format:
677 | type: "sql"
678 | level: "table"
679 | description: "Validates email format for customers table"
680 | sql: |
681 | SELECT id, email
682 | FROM analytics.customers
683 | WHERE email NOT LIKE '%@%.%'
684 | target_transformation: "analytics.customers"
685 | ```
686 |
687 | ### Referencing Tests in Transformations
688 |
689 | Transformations reference tests from:
690 | 1. **Standard tests**: Referenced by name (e.g., `"not_null"`, `"unique"`)
691 | 2. **Test library tests**: Referenced by name from the test library (e.g., `"check_minimum_rows"`)
692 |
693 | **Module Structure with Test Library Reference:**
694 |
695 | ```yaml
696 | ots_version: "0.1.0"
697 | module_name: "analytics_customers"
698 | test_library_path: "../tests/test_library.yaml" # Optional: Path to test library
699 |
700 | target:
701 | database: "warehouse"
702 | schema: "analytics"
703 |
704 | transformations:
705 | - transformation_id: "analytics.customers"
706 | tests:
707 | columns:
708 | id:
709 | - "not_null" # Standard test
710 | - "unique" # Standard test (column-level)
711 | email:
712 | - "not_null"
713 | - name: "accepted_values" # Standard test with params
714 | params:
715 | values: ["gmail.com", "yahoo.com"]
716 | amount:
717 | - name: "column_not_negative" # Generic test from library
718 | table:
719 | - "row_count_gt_0" # Standard test
720 | - "unique" # Standard test (table-level, checks all columns)
721 | - name: "unique" # Standard test (table-level, composite on specific columns)
722 | params:
723 | columns: ["customer_id", "order_date"]
724 | - name: "check_minimum_rows" # Generic test with params
725 | params:
726 | min_rows: 100
727 | - "test_customers_email_format" # Singular test from library
728 | ```
729 |
730 | **Test Reference Formats:**
731 |
732 | 1. **Simple string** (standard test, no parameters):
733 | ```yaml
734 | tests:
735 | columns:
736 | id: ["not_null", "unique"]
737 | table:
738 | - "row_count_gt_0"
739 | ```
740 |
741 | 2. **Object with name** (standard test with parameters):
742 | ```yaml
743 | tests:
744 | columns:
745 | status:
746 | - name: "accepted_values"
747 | params:
748 | values: ["active", "inactive"]
749 | ```
750 |
751 | 3. **Object with name** (generic/singular test from library):
752 | ```yaml
753 | tests:
754 | table:
755 | - name: "check_minimum_rows"
756 | params:
757 | min_rows: 100
758 | ```
759 |
760 | ### Test Execution Model
761 |
762 | Tests follow the dbt execution model:
763 | - **0 rows returned** = test passes
764 | - **1+ rows returned** = test fails
765 |
766 | For standard tests, tools generate SQL queries that return violating rows. For SQL tests (generic and singular), the SQL query itself returns rows when violations are found.
767 |
768 | **Test Severity:**
769 | - Tests can have a `severity` level: `"error"` (default) or `"warning"`
770 | - `error`: Test failure stops execution and fails the build
771 | - `warning`: Test failure is logged but doesn't stop execution
772 |
773 | **Severity in Test References:**
774 | ```yaml
775 | tests:
776 | columns:
777 | id:
778 | - name: "not_null"
779 | severity: "error" # Default, can be omitted
780 | - name: "unique"
781 | severity: "warning" # Non-blocking
782 | table:
783 | - name: "row_count_gt_0"
784 | severity: "error" # Default, can be omitted
785 | ```
786 |
787 | ### Inline Test Definitions in OTS Modules
788 |
789 | Generic and singular SQL tests can also be defined directly within an OTS Module, using the same structure as test libraries. This is useful for module-specific tests that don't need to be shared across modules.
790 |
791 | **Module Structure with Inline Tests:**
792 | ```yaml
793 | ots_version: "0.1.0"
794 | module_name: "analytics_customers"
795 |
796 | # Optional: Inline test definitions (same structure as test library)
797 | generic_tests:
798 | check_recent_data:
799 | type: "sql"
800 | level: "table"
801 | description: "Ensures table has recent data"
802 | sql: |
803 | SELECT 1 as violation
804 | FROM @table_name
805 | WHERE updated_at < CURRENT_DATE - INTERVAL '@days:7' DAY
806 | parameters:
807 | days:
808 | type: "number"
809 | default: 7
810 |
811 | singular_tests:
812 | test_customers_specific:
813 | type: "sql"
814 | level: "table"
815 | description: "Module-specific test"
816 | sql: |
817 | SELECT id FROM analytics.customers WHERE status = 'invalid'
818 | target_transformation: "analytics.customers"
819 |
820 | target:
821 | database: "warehouse"
822 | schema: "analytics"
823 |
824 | transformations:
825 | - transformation_id: "analytics.customers"
826 | tests:
827 | table:
828 | - name: "check_recent_data" # References inline generic test
829 | params:
830 | days: 3
831 | - "test_customers_specific" # References inline singular test
832 | ```
833 |
834 | **Test Resolution Priority:**
835 | When resolving test names, tools should check in the following order:
836 | 1. **Standard tests** (built into OTS specification)
837 | 2. **Inline tests** (defined in the current OTS Module)
838 | 3. **Test library tests** (from referenced test library)
839 |
840 | If a test name exists in multiple locations, the first match takes precedence. This allows modules to override test library tests with module-specific implementations.
841 |
842 | ### Test Library Resolution
843 |
844 | When a transformation module references a test library:
845 | 1. The tool resolves the `test_library_path` (relative to the module file or absolute path)
846 | 2. Loads the test library file (YAML or JSON format)
847 | 3. Validates test definitions
848 | 4. Makes tests available for reference in transformations (after inline tests)
849 |
850 | **Test Discovery:**
851 | - **Standard tests**: Always available, no discovery needed
852 | - **Generic tests**: Discovered from test library or inline module definitions
853 | - **Singular tests**: Discovered from test library or inline module definitions. The `target_transformation` field helps tools validate that the test is applied to the correct transformation.
854 |
855 | If a test is referenced but not found among the Standard tests, inline tests, or Test library, it must result in an error.
856 |
857 | ## Complete Examples: Incremental Strategies
858 |
859 | ### Delete-Insert Example
860 |
861 |
862 | YAML Format
863 |
864 | ```yaml
865 | ots_version: "0.1.0"
866 | transformation_id: "analytics.recent_orders"
867 | description: "Orders updated in the last 7 days"
868 |
869 | transformation_type: "sql"
870 | code:
871 | sql:
872 | original_sql: "SELECT order_id, customer_id, order_date, amount, status FROM source.orders WHERE updated_at >= '@start_date'"
873 | resolved_sql: "SELECT order_id, customer_id, order_date, amount, status FROM warehouse.source.orders WHERE updated_at >= '@start_date'"
874 | source_tables: ["source.orders"]
875 |
876 | schema:
877 | columns:
878 | - name: "order_id"
879 | datatype: "number"
880 | description: "Unique order identifier"
881 | - name: "customer_id"
882 | datatype: "number"
883 | description: "Customer ID"
884 | - name: "order_date"
885 | datatype: "date"
886 | description: "Order date"
887 | - name: "amount"
888 | datatype: "number"
889 | description: "Order amount"
890 | - name: "status"
891 | datatype: "string"
892 | description: "Order status"
893 | partitioning: ["order_date"]
894 | indexes:
895 | - name: "idx_order_id"
896 | columns: ["order_id"]
897 |
898 | materialization:
899 | type: "incremental"
900 | incremental_details:
901 | strategy: "delete_insert"
902 | delete_condition: "to_date(updated_at) = '@start_date'"
903 | filter_condition: "to_date(updated_at) = '@start_date'"
904 |
905 | tests:
906 | columns:
907 | order_id: ["not_null", "unique"]
908 | order_date: ["not_null"]
909 | table: ["row_count_gt_0"]
910 |
911 | metadata:
912 | file_path: "/models/analytics/recent_orders.sql"
913 | owner: "analytics-team"
914 | tags: ["orders", "incremental"]
915 | ```
916 |
917 |
918 |
919 |
920 | JSON Format
921 |
922 | ```json
923 | {
924 | "ots_version": "0.1.0",
925 | "transformation_id": "analytics.recent_orders",
926 | "description": "Orders updated in the last 7 days",
927 |
928 | "transformation_type": "sql",
929 | "code": {
930 | "sql": {
931 | "original_sql": "SELECT order_id, customer_id, order_date, amount, status FROM source.orders WHERE updated_at >= '@start_date'",
932 | "resolved_sql": "SELECT order_id, customer_id, order_date, amount, status FROM warehouse.source.orders WHERE updated_at >= '@start_date'",
933 | "source_tables": ["source.orders"]
934 | }
935 | },
936 |
937 | "schema": {
938 | "columns": [
939 | {
940 | "name": "order_id",
941 | "datatype": "number",
942 | "description": "Unique order identifier"
943 | },
944 | {
945 | "name": "customer_id",
946 | "datatype": "number",
947 | "description": "Customer ID"
948 | },
949 | {
950 | "name": "order_date",
951 | "datatype": "date",
952 | "description": "Order date"
953 | },
954 | {
955 | "name": "amount",
956 | "datatype": "number",
957 | "description": "Order amount"
958 | },
959 | {
960 | "name": "status",
961 | "datatype": "string",
962 | "description": "Order status"
963 | }
964 | ],
965 | "partitioning": ["order_date"],
966 | "indexes": [
967 | {
968 | "name": "idx_order_id",
969 | "columns": ["order_id"]
970 | }
971 | ]
972 | },
973 |
974 | "materialization": {
975 | "type": "incremental",
976 | "incremental_details": {
977 | "strategy": "delete_insert",
978 | "delete_condition": "to_date(updated_at) = '@start_date'",
979 | "filter_condition": "to_date(updated_at) = '@start_date'"
980 | }
981 | },
982 |
983 | "tests": {
984 | "columns": {
985 | "order_id": ["not_null", "unique"],
986 | "order_date": ["not_null"]
987 | },
988 | "table": ["row_count_gt_0"]
989 | },
990 |
991 | "metadata": {
992 | "file_path": "/models/analytics/recent_orders.sql",
993 | "owner": "analytics-team",
994 | "tags": ["orders", "incremental"]
995 | }
996 | }
997 | ```
998 |
999 |
1000 |
1001 | ### Append Example
1002 |
1003 |
1004 | YAML Format
1005 |
1006 | ```yaml
1007 | ots_version: "0.1.0"
1008 | transformation_id: "logs.event_stream"
1009 | description: "Append-only event log"
1010 |
1011 | transformation_type: "sql"
1012 | code:
1013 | sql:
1014 | original_sql: "SELECT event_id, timestamp, user_id, event_type, payload FROM source.events WHERE timestamp >= '@start_date'"
1015 | resolved_sql: "SELECT event_id, timestamp, user_id, event_type, payload FROM warehouse.source.events WHERE timestamp >= '@start_date'"
1016 | source_tables: ["source.events"]
1017 |
1018 | schema:
1019 | columns:
1020 | - name: "event_id"
1021 | datatype: "string"
1022 | description: "Unique event identifier"
1023 | - name: "timestamp"
1024 | datatype: "date"
1025 | description: "Event timestamp"
1026 | - name: "user_id"
1027 | datatype: "string"
1028 | description: "User who triggered the event"
1029 | - name: "event_type"
1030 | datatype: "string"
1031 | description: "Type of event"
1032 | - name: "payload"
1033 | datatype: "object"
1034 | description: "Event payload data"
1035 | partitioning: ["timestamp"]
1036 | indexes:
1037 | - name: "idx_timestamp"
1038 | columns: ["timestamp"]
1039 | - name: "idx_user_id"
1040 | columns: ["user_id"]
1041 |
1042 | materialization:
1043 | type: "incremental"
1044 | incremental_details:
1045 | strategy: "append"
1046 | filter_condition: "timestamp >= '@start_date'"
1047 |
1048 | tests:
1049 | columns:
1050 | event_id: ["not_null", "unique"]
1051 | timestamp: ["not_null"]
1052 | table: ["row_count_gt_0"]
1053 |
1054 | metadata:
1055 | file_path: "/models/logs/event_stream.sql"
1056 | owner: "data-engineering"
1057 | tags: ["events", "append-only"]
1058 | ```
1059 |
1060 |
1061 |
1062 |
1063 | JSON Format
1064 |
1065 | ```json
1066 | {
1067 | "ots_version": "0.1.0",
1068 | "transformation_id": "logs.event_stream",
1069 | "description": "Append-only event log",
1070 |
1071 | "transformation_type": "sql",
1072 | "code": {
1073 | "sql": {
1074 | "original_sql": "SELECT event_id, timestamp, user_id, event_type, payload FROM source.events WHERE timestamp >= '@start_date'",
1075 | "resolved_sql": "SELECT event_id, timestamp, user_id, event_type, payload FROM warehouse.source.events WHERE timestamp >= '@start_date'",
1076 | "source_tables": ["source.events"]
1077 | }
1078 | },
1079 |
1080 | "schema": {
1081 | "columns": [
1082 | {
1083 | "name": "event_id",
1084 | "datatype": "string",
1085 | "description": "Unique event identifier"
1086 | },
1087 | {
1088 | "name": "timestamp",
1089 | "datatype": "date",
1090 | "description": "Event timestamp"
1091 | },
1092 | {
1093 | "name": "user_id",
1094 | "datatype": "string",
1095 | "description": "User who triggered the event"
1096 | },
1097 | {
1098 | "name": "event_type",
1099 | "datatype": "string",
1100 | "description": "Type of event"
1101 | },
1102 | {
1103 | "name": "payload",
1104 | "datatype": "object",
1105 | "description": "Event payload data"
1106 | }
1107 | ],
1108 | "partitioning": ["timestamp"],
1109 | "indexes": [
1110 | {
1111 | "name": "idx_timestamp",
1112 | "columns": ["timestamp"]
1113 | },
1114 | {
1115 | "name": "idx_user_id",
1116 | "columns": ["user_id"]
1117 | }
1118 | ]
1119 | },
1120 |
1121 | "materialization": {
1122 | "type": "incremental",
1123 | "incremental_details": {
1124 | "strategy": "append",
1125 | "filter_condition": "timestamp >= '@start_date'"
1126 | }
1127 | },
1128 |
1129 | "tests": {
1130 | "columns": {
1131 | "event_id": ["not_null", "unique"],
1132 | "timestamp": ["not_null"]
1133 | },
1134 | "table": ["row_count_gt_0"]
1135 | },
1136 |
1137 | "metadata": {
1138 | "file_path": "/models/logs/event_stream.sql",
1139 | "owner": "data-engineering",
1140 | "tags": ["events", "append-only"]
1141 | }
1142 | }
1143 | ```
1144 |
1145 |
1146 |
1147 | ### Merge Example
1148 |
1149 |
1150 | YAML Format
1151 |
1152 | ```yaml
1153 | ots_version: "0.1.0"
1154 | transformation_id: "product.master_data"
1155 | description: "Customer master data with upsert logic"
1156 |
1157 | transformation_type: "sql"
1158 | code:
1159 | sql:
1160 | original_sql: "SELECT customer_id, name, email, phone, updated_at FROM source.customers WHERE updated_at >= '@start_date'"
1161 | resolved_sql: "SELECT customer_id, name, email, phone, updated_at FROM warehouse.source.customers WHERE updated_at >= '@start_date'"
1162 | source_tables: ["source.customers"]
1163 |
1164 | schema:
1165 | columns:
1166 | - name: "customer_id"
1167 | datatype: "number"
1168 | description: "Unique customer identifier"
1169 | - name: "name"
1170 | datatype: "string"
1171 | description: "Customer name"
1172 | - name: "email"
1173 | datatype: "string"
1174 | description: "Customer email"
1175 | - name: "phone"
1176 | datatype: "string"
1177 | description: "Customer phone number"
1178 | - name: "updated_at"
1179 | datatype: "date"
1180 | description: "Last update timestamp"
1181 | partitioning: []
1182 | indexes:
1183 | - name: "idx_customer_id"
1184 | columns: ["customer_id"]
1185 | - name: "idx_email"
1186 | columns: ["email"]
1187 |
1188 | materialization:
1189 | type: "incremental"
1190 | incremental_details:
1191 | strategy: "merge"
1192 | filter_condition: "updated_at >= '@start_date'"
1193 | merge_key: ["customer_id"]
1194 | update_columns: ["name", "email", "phone", "updated_at"]
1195 |
1196 | tests:
1197 | columns:
1198 | customer_id: ["not_null", "unique"]
1199 | email: ["not_null"]
1200 | table: ["row_count_gt_0", "unique"] # unique at table level checks all columns for row uniqueness
1201 |
1202 | metadata:
1203 | file_path: "/models/product/master_data.sql"
1204 | owner: "product-team"
1205 | tags: ["customers", "master-data"]
1206 | ```
1207 |
1208 |
1209 |
1210 |
1211 | JSON Format
1212 |
1213 | ```json
1214 | {
1215 | "ots_version": "0.1.0",
1216 | "transformation_id": "product.master_data",
1217 | "description": "Customer master data with upsert logic",
1218 |
1219 | "transformation_type": "sql",
1220 | "code": {
1221 | "sql": {
1222 | "original_sql": "SELECT customer_id, name, email, phone, updated_at FROM source.customers WHERE updated_at >= '@start_date'",
1223 | "resolved_sql": "SELECT customer_id, name, email, phone, updated_at FROM warehouse.source.customers WHERE updated_at >= '@start_date'",
1224 | "source_tables": ["source.customers"]
1225 | }
1226 | },
1227 |
1228 | "schema": {
1229 | "columns": [
1230 | {
1231 | "name": "customer_id",
1232 | "datatype": "number",
1233 | "description": "Unique customer identifier"
1234 | },
1235 | {
1236 | "name": "name",
1237 | "datatype": "string",
1238 | "description": "Customer name"
1239 | },
1240 | {
1241 | "name": "email",
1242 | "datatype": "string",
1243 | "description": "Customer email"
1244 | },
1245 | {
1246 | "name": "phone",
1247 | "datatype": "string",
1248 | "description": "Customer phone number"
1249 | },
1250 | {
1251 | "name": "updated_at",
1252 | "datatype": "date",
1253 | "description": "Last update timestamp"
1254 | }
1255 | ],
1256 | "partitioning": [],
1257 | "indexes": [
1258 | {
1259 | "name": "idx_customer_id",
1260 | "columns": ["customer_id"]
1261 | },
1262 | {
1263 | "name": "idx_email",
1264 | "columns": ["email"]
1265 | }
1266 | ]
1267 | },
1268 |
1269 | "materialization": {
1270 | "type": "incremental",
1271 | "incremental_details": {
1272 | "strategy": "merge",
1273 | "filter_condition": "updated_at >= '@start_date'",
1274 | "merge_key": ["customer_id"],
1275 | "update_columns": ["name", "email", "phone", "updated_at"]
1276 | }
1277 | },
1278 |
1279 | "tests": {
1280 | "columns": {
1281 | "customer_id": ["not_null", "unique"],
1282 | "email": ["not_null"]
1283 | },
1284 | "table": ["row_count_gt_0", "unique"]
1285 | },
1286 |
1287 | "metadata": {
1288 | "file_path": "/models/product/master_data.sql",
1289 | "owner": "product-team",
1290 | "tags": ["customers", "master-data"]
1291 | }
1292 | }
1293 | ```
1294 |
1295 |
1296 |
1297 | ### SCD2 Example
1298 |
1299 |
1300 | YAML Format
1301 |
1302 | ```yaml
1303 | ots_version: "0.1.0"
1304 | transformation_id: "dim.products_scd2"
1305 | description: "Product dimension with full history tracking"
1306 |
1307 | transformation_type: "sql"
1308 | code:
1309 | sql:
1310 | original_sql: "SELECT product_id, product_name, price, category, updated_at FROM source.products WHERE updated_at >= '@start_date'"
1311 | resolved_sql: "SELECT product_id, product_name, price, category, updated_at FROM warehouse.source.products WHERE updated_at >= '@start_date'"
1312 | source_tables: ["source.products"]
1313 |
1314 | schema:
1315 | columns:
1316 | - name: "product_id"
1317 | datatype: "number"
1318 | description: "Unique product identifier"
1319 | - name: "product_name"
1320 | datatype: "string"
1321 | description: "Product name"
1322 | - name: "price"
1323 | datatype: "number"
1324 | description: "Product price"
1325 | - name: "category"
1326 | datatype: "string"
1327 | description: "Product category"
1328 | - name: "updated_at"
1329 | datatype: "date"
1330 | description: "Last update timestamp"
1331 | - name: "valid_from"
1332 | datatype: "date"
1333 | description: "Record validity start date"
1334 | - name: "valid_to"
1335 | datatype: "date"
1336 | description: "Record validity end date"
1337 | partitioning: []
1338 | indexes:
1339 | - name: "idx_product_id"
1340 | columns: ["product_id"]
1341 | - name: "idx_valid_from"
1342 | columns: ["valid_from"]
1343 |
1344 | materialization:
1345 | type: "scd2"
1346 | scd2_details:
1347 | unique_key: ["product_id"]
1348 | start_column: "valid_from"
1349 | end_column: "valid_to"
1350 |
1351 | tests:
1352 | columns:
1353 | product_id: ["not_null", "unique"]
1354 | valid_from: ["not_null"]
1355 | table: ["row_count_gt_0"]
1356 |
1357 | metadata:
1358 | file_path: "/models/dim/products_scd2.sql"
1359 | owner: "data-engineering"
1360 | tags: ["products", "scd2", "dimension"]
1361 | ```
1362 |
1363 |
1364 |
1365 |
1366 | JSON Format
1367 |
1368 | ```json
1369 | {
1370 | "ots_version": "0.1.0",
1371 | "transformation_id": "dim.products_scd2",
1372 | "description": "Product dimension with full history tracking",
1373 |
1374 | "transformation_type": "sql",
1375 | "code": {
1376 | "sql": {
1377 | "original_sql": "SELECT product_id, product_name, price, category, updated_at FROM source.products WHERE updated_at >= '@start_date'",
1378 | "resolved_sql": "SELECT product_id, product_name, price, category, updated_at FROM warehouse.source.products WHERE updated_at >= '@start_date'",
1379 | "source_tables": ["source.products"]
1380 | }
1381 | },
1382 |
1383 | "schema": {
1384 | "columns": [
1385 | {
1386 | "name": "product_id",
1387 | "datatype": "number",
1388 | "description": "Unique product identifier"
1389 | },
1390 | {
1391 | "name": "product_name",
1392 | "datatype": "string",
1393 | "description": "Product name"
1394 | },
1395 | {
1396 | "name": "price",
1397 | "datatype": "number",
1398 | "description": "Product price"
1399 | },
1400 | {
1401 | "name": "category",
1402 | "datatype": "string",
1403 | "description": "Product category"
1404 | },
1405 | {
1406 | "name": "updated_at",
1407 | "datatype": "date",
1408 | "description": "Last update timestamp"
1409 | },
1410 | {
1411 | "name": "valid_from",
1412 | "datatype": "date",
1413 | "description": "Record validity start date"
1414 | },
1415 | {
1416 | "name": "valid_to",
1417 | "datatype": "date",
1418 | "description": "Record validity end date"
1419 | }
1420 | ],
1421 | "partitioning": [],
1422 | "indexes": [
1423 | {
1424 | "name": "idx_product_id",
1425 | "columns": ["product_id"]
1426 | },
1427 | {
1428 | "name": "idx_valid_from",
1429 | "columns": ["valid_from"]
1430 | }
1431 | ]
1432 | },
1433 |
1434 | "materialization": {
1435 | "type": "scd2",
1436 | "scd2_details": {
1437 | "unique_key": ["product_id"],
1438 | "start_column": "valid_from",
1439 | "end_column": "valid_to"
1440 | }
1441 | },
1442 |
1443 | "tests": {
1444 | "columns": {
1445 | "product_id": ["not_null", "unique"],
1446 | "valid_from": ["not_null"]
1447 | },
1448 | "table": ["row_count_gt_0"]
1449 | },
1450 |
1451 | "metadata": {
1452 | "file_path": "/models/dim/products_scd2.sql",
1453 | "owner": "data-engineering",
1454 | "tags": ["products", "scd2", "dimension"]
1455 | }
1456 | }
1457 | ```
1458 |
1459 |
1460 |
1461 | ## Complete Example: Test Library and Module
1462 |
1463 | This example demonstrates a complete setup with a test library and a transformation module that uses both standard and custom tests.
1464 |
1465 | ### Test Library Example
1466 |
1467 |
1468 | YAML Format
1469 |
1470 | ```yaml
1471 | # tests/test_library.yaml
1472 | test_library_version: "1.0"
1473 | description: "Shared data quality tests for analytics project"
1474 |
1475 | generic_tests:
1476 | check_minimum_rows:
1477 | type: "sql"
1478 | level: "table"
1479 | description: "Ensures table has minimum number of rows"
1480 | sql: |
1481 | SELECT 1 as violation
1482 | FROM @table_name
1483 | GROUP BY 1
1484 | HAVING COUNT(*) < @min_rows:10
1485 | parameters:
1486 | min_rows:
1487 | type: "number"
1488 | default: 10
1489 | description: "Minimum number of rows required"
1490 |
1491 | column_not_negative:
1492 | type: "sql"
1493 | level: "column"
1494 | description: "Ensures numeric column has no negative values"
1495 | sql: |
1496 | SELECT @column_name
1497 | FROM @table_name
1498 | WHERE @column_name < 0
1499 | parameters: []
1500 |
1501 | singular_tests:
1502 | test_customers_email_format:
1503 | type: "sql"
1504 | level: "table"
1505 | description: "Validates email format for customers table"
1506 | sql: |
1507 | SELECT id, email
1508 | FROM analytics.customers
1509 | WHERE email NOT LIKE '%@%.%'
1510 | target_transformation: "analytics.customers"
1511 | ```
1512 |
1513 |
1514 |
1515 |
1516 | JSON Format
1517 |
1518 | ```json
1519 | {
1520 | "test_library_version": "1.0",
1521 | "description": "Shared data quality tests for analytics project",
1522 | "generic_tests": {
1523 | "check_minimum_rows": {
1524 | "type": "sql",
1525 | "level": "table",
1526 | "description": "Ensures table has minimum number of rows",
1527 | "sql": "SELECT 1 as violation\nFROM @table_name\nGROUP BY 1\nHAVING COUNT(*) < @min_rows:10",
1528 | "parameters": {
1529 | "min_rows": {
1530 | "type": "number",
1531 | "default": 10,
1532 | "description": "Minimum number of rows required"
1533 | }
1534 | }
1535 | },
1536 | "column_not_negative": {
1537 | "type": "sql",
1538 | "level": "column",
1539 | "description": "Ensures numeric column has no negative values",
1540 | "sql": "SELECT @column_name\nFROM @table_name\nWHERE @column_name < 0",
1541 | "parameters": []
1542 | }
1543 | },
1544 | "singular_tests": {
1545 | "test_customers_email_format": {
1546 | "type": "sql",
1547 | "level": "table",
1548 | "description": "Validates email format for customers table",
1549 | "sql": "SELECT id, email\nFROM analytics.customers\nWHERE email NOT LIKE '%@%.%'",
1550 | "target_transformation": "analytics.customers"
1551 | }
1552 | }
1553 | }
1554 | ```
1555 |
1556 |
1557 |
1558 | ### Module Using Test Library
1559 |
1560 |
1561 | YAML Format
1562 |
1563 | ```yaml
1564 | ots_version: "0.1.0"
1565 | module_name: "analytics_customers"
1566 | module_description: "Customer analytics transformations"
1567 | test_library_path: "../tests/test_library.yaml"
1568 | tags: ["analytics", "production"]
1569 |
1570 | target:
1571 | database: "warehouse"
1572 | schema: "analytics"
1573 | sql_dialect: "postgres"
1574 |
1575 | transformations:
1576 | - transformation_id: "analytics.customers"
1577 | description: "Customer data table"
1578 | transformation_type: "sql"
1579 |
1580 | code:
1581 | sql:
1582 | original_sql: "SELECT id, name, email, created_at, amount FROM source.customers WHERE active = true"
1583 | resolved_sql: "SELECT id, name, email, created_at, amount FROM warehouse.source.customers WHERE active = true"
1584 | source_tables: ["source.customers"]
1585 |
1586 | schema:
1587 | columns:
1588 | - name: "id"
1589 | datatype: "number"
1590 | description: "Unique customer identifier"
1591 | - name: "name"
1592 | datatype: "string"
1593 | description: "Customer name"
1594 | - name: "email"
1595 | datatype: "string"
1596 | description: "Customer email address"
1597 | - name: "created_at"
1598 | datatype: "date"
1599 | description: "Customer creation date"
1600 | - name: "amount"
1601 | datatype: "number"
1602 | description: "Customer account balance"
1603 | partitioning: []
1604 | indexes:
1605 | - name: "idx_customers_id"
1606 | columns: ["id"]
1607 | - name: "idx_customers_email"
1608 | columns: ["email"]
1609 |
1610 | materialization:
1611 | type: "table"
1612 |
1613 | tests:
1614 | columns:
1615 | id:
1616 | - "not_null" # Standard test
1617 | - "unique" # Standard test (column-level)
1618 | email:
1619 | - "not_null"
1620 | - name: "accepted_values" # Standard test with params
1621 | params:
1622 | values: ["gmail.com", "yahoo.com", "company.com"]
1623 | amount:
1624 | - name: "column_not_negative" # Generic test from library
1625 | table:
1626 | - "row_count_gt_0" # Standard test
1627 | - "unique" # Standard test (table-level, checks all columns for row uniqueness)
1628 | - name: "check_minimum_rows" # Generic test with params
1629 | params:
1630 | min_rows: 100
1631 | - "test_customers_email_format" # Singular test from library
1632 |
1633 | metadata:
1634 | file_path: "/models/analytics/customers.sql"
1635 | owner: "data-team"
1636 | tags: ["customer", "core"]
1637 | object_tags:
1638 | sensitivity_tag: "pii"
1639 | classification: "internal"
1640 | ```
1641 |
1642 |
1643 |
1644 |
1645 | JSON Format
1646 |
1647 | ```json
1648 | {
1649 | "ots_version": "0.1.0",
1650 | "module_name": "analytics_customers",
1651 | "module_description": "Customer analytics transformations",
1652 | "test_library_path": "../tests/test_library.yaml",
1653 | "tags": ["analytics", "production"],
1654 | "target": {
1655 | "database": "warehouse",
1656 | "schema": "analytics",
1657 | "sql_dialect": "postgres"
1658 | },
1659 | "transformations": [
1660 | {
1661 | "transformation_id": "analytics.customers",
1662 | "description": "Customer data table",
1663 | "transformation_type": "sql",
1664 | "code": {
1665 | "sql": {
1666 | "original_sql": "SELECT id, name, email, created_at, amount FROM source.customers WHERE active = true",
1667 | "resolved_sql": "SELECT id, name, email, created_at, amount FROM warehouse.source.customers WHERE active = true",
1668 | "source_tables": ["source.customers"]
1669 | }
1670 | },
1671 | "schema": {
1672 | "columns": [
1673 | {
1674 | "name": "id",
1675 | "datatype": "number",
1676 | "description": "Unique customer identifier"
1677 | },
1678 | {
1679 | "name": "name",
1680 | "datatype": "string",
1681 | "description": "Customer name"
1682 | },
1683 | {
1684 | "name": "email",
1685 | "datatype": "string",
1686 | "description": "Customer email address"
1687 | },
1688 | {
1689 | "name": "created_at",
1690 | "datatype": "date",
1691 | "description": "Customer creation date"
1692 | },
1693 | {
1694 | "name": "amount",
1695 | "datatype": "number",
1696 | "description": "Customer account balance"
1697 | }
1698 | ],
1699 | "partitioning": [],
1700 | "indexes": [
1701 | {
1702 | "name": "idx_customers_id",
1703 | "columns": ["id"]
1704 | },
1705 | {
1706 | "name": "idx_customers_email",
1707 | "columns": ["email"]
1708 | }
1709 | ]
1710 | },
1711 | "materialization": {
1712 | "type": "table"
1713 | },
1714 | "tests": {
1715 | "columns": {
1716 | "id": ["not_null", "unique"],
1717 | "email": [
1718 | "not_null",
1719 | {
1720 | "name": "accepted_values",
1721 | "params": {
1722 | "values": ["gmail.com", "yahoo.com", "company.com"]
1723 | }
1724 | }
1725 | ],
1726 | "amount": [
1727 | {
1728 | "name": "column_not_negative"
1729 | }
1730 | ]
1731 | },
1732 | "table": [
1733 | "row_count_gt_0",
1734 | "unique",
1735 | {
1736 | "name": "check_minimum_rows",
1737 | "params": {
1738 | "min_rows": 100
1739 | }
1740 | },
1741 | "test_customers_email_format"
1742 | ]
1743 | },
1744 | "metadata": {
1745 | "file_path": "/models/analytics/customers.sql",
1746 | "owner": "data-team",
1747 | "tags": ["customer", "core"],
1748 | "object_tags": {
1749 | "sensitivity_tag": "pii",
1750 | "classification": "internal"
1751 | }
1752 | }
1753 | }
1754 | ]
1755 | }
1756 | ```
1757 |
1758 |
1759 |
--------------------------------------------------------------------------------
/versions/v0.2.0/transformation-specification.md:
--------------------------------------------------------------------------------
1 | # Open Transformation Specification v0.2.0
2 |
3 | ## Table of Contents
4 |
5 | 1. [Introduction](#introduction)
6 | 2. [Core Concepts](#core-concepts)
7 | 3. [Open Transformation Definition (OTD) Structure](#open-transformation-definition-otd-structure)
8 | 4. [Materialization Types](#materialization-types)
9 | 5. [Data Quality Tests](#data-quality-tests)
10 | 6. [User-Defined Functions (UDFs)](#user-defined-functions-udfs)
11 | 7. [Examples](#examples)
12 |
13 | ## Introduction
14 |
15 | The Open Transformation Specification (OTS) defines a standard, programming language-agnostic interface description for data transformations, data quality tests, and user-defined functions (UDFs). This specification enables **interoperability** between tools and platforms, shifting the data transformation ecosystem from isolated, proprietary tools to an **open core** where tools can seamlessly work together around a shared specification.
16 |
17 | This specification allows both humans and computers to discover and understand how transformations behave, what outputs they produce, and how those outputs are materialized (as tables, views, incremental updates, SCD2, etc.) without requiring additional documentation or configuration. By providing a common standard, OTS ensures that transformations defined in one tool can be consumed, understood, and executed by any OTS-compliant tool.
18 |
19 | The OTS standard encompasses three types of artifacts: **Open Transformation Definitions (OTDs)** for transformations, **UDF Definitions** for user-defined functions, and **Test Definitions** for data quality tests. Together, these form the complete set of **OTS Artifacts** that can be defined and managed within an OTS Module.
20 |
21 | An OTS-based transformation must include both the code that transforms the data and metadata about the transformation. A tool implementing OTS should be able to execute an OTS transformation with no additional code or information beyond what's specified in the OTS document. This **interoperability** ensures that the transformation ecosystem can grow organically, with tools building on each other's capabilities rather than creating isolated silos.
22 |
23 | ## Core Concepts
24 |
25 | ### OTS Artifacts
26 |
27 | **OTS Artifacts** is the umbrella term for all concrete instances of the Open Transformation Specification. The OTS standard defines three types of artifacts:
28 |
29 | 1. **Open Transformation Definition (OTD)**: A structured definition that describes a specific data transformation
30 | 2. **UDF Definition**: A structured definition that describes a user-defined function
31 | 3. **Test Definition**: A structured definition that describes a data quality test
32 |
33 | All OTS Artifacts follow the OTS format and can be defined within an OTS Module. Together, they form a complete data transformation pipeline with reusable functions and quality validation.
34 |
35 | ### Open Transformation Definition (OTD)
36 |
37 | An **Open Transformation Definition (OTD)** is a concrete instance of the Open Transformation Specification that describes a specific data transformation using the OTS format. An OTD exists as a structured definition within an OTS Module, which is the file or document that contains one or more transformation definitions.
38 |
39 | A transformation is a unit of data processing that takes one or more data sources as input and produces one data output. Right now, transformations are SQL queries, but we plan to add support for other programming languages in the future.
40 |
41 | ### Open Transformation Specification Module
42 |
43 | An **Open Transformation Specification Module (OTS Module)** is a collection of related OTS Artifacts (transformations, UDFs, and tests) that target the same database and schema. An OTS Module can contain one or more transformations, UDF definitions, and test definitions, much like how an OpenAPI specification can contain multiple endpoints.
44 |
45 | Key characteristics of an OTS Module:
46 | - **Single target**: All transformations in a module target the same database and schema
47 | - **Logical grouping**: Related transformations are organized together
48 | - **Deployment unit**: The entire module can be deployed as a single unit
49 |
50 | ### Test Library
51 |
52 | A **Test Library** is a project-level collection of reusable Test Definitions (generic and singular SQL tests) that can be shared across multiple OTS modules. Test libraries are defined separately from transformation modules and are referenced by modules that need to use them.
53 |
54 | Key characteristics of a Test Library:
55 | - **Project-level scope**: Test libraries are defined at the project/workspace level, separate from OTS modules
56 | - **Reusability**: Test Definitions in a library can be referenced by any OTS module in the project
57 | - **Test types**: Contains both generic SQL tests (with placeholders) and singular SQL tests (table-specific)
58 | - **Optional**: Modules can define tests inline or reference a test library, or both
59 |
60 | ### UDF Definition
61 |
62 | A **UDF Definition** is a concrete instance of the Open Transformation Specification that describes a user-defined function using the OTS format. A UDF Definition exists as a structured definition within an OTS Module, defining a custom function that can be called within SQL transformations.
63 |
64 | UDF Definitions include the function's signature (parameters and return type), implementation code, dependencies, and metadata. They enable reusable business logic and calculations that can be shared across multiple transformations.
65 |
66 | ### Test Definition
67 |
68 | A **Test Definition** is a concrete instance of the Open Transformation Specification that describes a data quality test using the OTS format. Test Definitions can exist in two contexts:
69 |
70 | 1. **Within a Test Library**: Reusable test definitions (generic and singular SQL tests) that can be shared across multiple OTS modules
71 | 2. **Inline within an OTS Module**: Module-specific test definitions that are defined directly in the transformation module
72 |
73 | Test Definitions include the test logic (SQL queries for generic/singular tests, or test type for standard tests), parameters, target scope (table or column level), and metadata. They enable automated data quality validation without manual inspection.
74 |
75 | #### OTS vs OTD vs OTS Module vs Test Library vs OTS Artifacts
76 |
77 | - **Open Transformation Specification (OTS)**: The standard that defines the structure and rules
78 | - **OTS Artifacts**: The umbrella term for all concrete instances of OTS (OTDs, UDF Definitions, and Test Definitions)
79 | - **Open Transformation Definition (OTD)**: A specific transformation within a module
80 | - **UDF Definition**: A specific user-defined function within a module
81 | - **Test Definition**: A specific data quality test (in a Test Library or inline in a module)
82 | - **Open Transformation Specification Module (OTS Module)**: A collection of related OTS Artifacts targeting the same database and schema
83 | - **Test Library**: A project-level collection of reusable Test Definitions that can be shared across modules
84 |
85 | ## Components of an OTD
86 |
87 | An Open Transformation Definition consists of several key components that work together to define an executable transformation:
88 |
89 | 1. **Transformation Code**: The transformation logic (SQL, Python, PySpark, etc.) stored in a type-based structure
90 | 2. **Schema Definition**: The structure of the output data including column definitions, types, and validation rules
91 | 3. **Materialization Strategy**: How the output is stored and updated (table, view, incremental, SCD2)
92 | 4. **Tests**: Validation rules that ensure data quality at table level
93 | 5. **Metadata**: Additional information about the transformation (owner, tags, creation date, etc.)
94 |
95 | ### Transformation Code
96 |
97 | Transformations can be written in different languages (SQL, Python, PySpark, etc.). The transformation code is stored in a type-based structure that supports multiple transformation types while maintaining a consistent interface.
98 |
99 | #### SQL Transformations
100 |
101 | For SQL transformations, the code is stored with the following structure:
102 | - `original_sql`: The original SQL query as written (typically a SELECT statement). This preserves the original transformation code as authored.
103 | - `resolved_sql`: SQL with fully qualified table names (schema.table format). This is the preferred version for execution as it eliminates ambiguity in table references. Tools should use `resolved_sql` when executing transformations.
104 | - `source_tables`: List of input tables referenced in the query (required for dependency analysis)
105 | - `source_functions`: List of user-defined functions (UDFs) called in the query (required for dependency analysis). This field is optional and may be empty if no user-defined functions are used. Function names should be fully qualified (schema.function_name) when possible, or unqualified if the function is resolved by the database.
106 |
107 | **When to use each:**
108 | - Use `original_sql` for: displaying the original code to users, version control, understanding the transformation logic
109 | - Use `resolved_sql` for: actual execution, dependency resolution, cross-database compatibility
110 | - Use `source_tables` and `source_functions` for: dependency graph building, execution order determination, and understanding transformation dependencies
111 |
112 | #### Non-SQL Transformations
113 |
114 | Support for non-SQL transformation types (Python, PySpark, R, etc.) is planned for future versions of the specification. The current v0.2.0 specification focuses on SQL transformations and adds support for user-defined functions (UDFs).
115 |
116 | ### Schema Definition
117 |
118 | Schema defines the structure of the output data, including column names, data types, descriptions, partitioning, indexes, and other properties of the physical table. The schema is essential for understanding what the transformation produces without executing it. For example, it enables generating DDL statements for creating the output table.
119 |
120 | ### Materialization Strategy
121 | Materialization defines how the transformation output is stored and updated. Common types include:
122 | - **table**: Full table replacement on each run
123 | - **view**: Virtual table that queries underlying data
124 | - **incremental**: Partial updates using strategies like delete+insert or merge
125 | - **scd2**: Slowly Changing Dimension type 2 for tracking historical changes
126 |
127 | ### Tests
128 | Tests are validation rules that ensure data quality. They can be defined at two levels:
129 | - **Column-level tests**: Applied to individual columns (e.g., `not_null`, `unique`)
130 | - **Table-level tests**: Applied to the entire output (e.g., `row_count_gt_0`, `unique`)
131 |
132 | Tests enable automated data quality validation without manual inspection. OTS supports three types of tests:
133 | 1. **Standard Tests**: Built-in tests defined in the OTS specification (e.g., `not_null`, `unique`, `row_count_gt_0`)
134 | 2. **Generic SQL Tests**: Reusable SQL tests with placeholders that can be applied to multiple transformations
135 | 3. **Singular SQL Tests**: Table-specific SQL tests with hardcoded table references
136 |
137 | Generic SQL Tests and Singular SQL Tests can be defined in a project Test Library (see [Test Library](#test-library)) or inline within the current OTS Module.
138 |
139 | For detailed information about test types, definitions, and usage, see the [Data Quality Tests](#data-quality-tests) section.
140 |
141 | ### Metadata
142 | Metadata provides additional information about the transformation including:
143 | - **file_path**: Location of the source transformation file
144 | - **owner**: Person or team responsible for the transformation
145 | - **tags**: List of string tags for categorization and discovery (e.g., ["analytics", "fct", "production"])
146 | - **object_tags**: Dictionary of key-value pairs for database object tagging (e.g., {"sensitivity_tag": "pii", "classification": "public"})
147 |
148 | **Tag Types:**
149 | - **tags** (dbt-style): Simple string tags used for filtering, categorization, and discovery. These are typically used for model selection and organization.
150 | - **Module-level tags**: Tags defined at the module level apply to all transformations in the module. They can be inherited by transformations or merged with transformation-specific tags.
151 | - **Transformation-level tags**: Tags defined at the transformation level are specific to that transformation. They can be merged with module-level tags.
152 | - **object_tags** (database-style): Key-value pairs that are attached directly to database objects (tables, views) in databases that support object tagging (e.g., Snowflake). These are used for data governance, compliance, and metadata management. Unlike `tags`, `object_tags` are always transformation-specific and are not inherited from module level.
153 |
154 | ## OTS Module Structure
155 |
156 | An OTS Module is a YAML or JSON document that can contain one or more OTS Artifacts (transformations, UDF Definitions, and Test Definitions). Below is the complete structure:
157 |
158 | ### Complete OTS Module Structure
159 |
160 | ```yaml
161 | # OTS version
162 | ots_version: string # OTS specification version (e.g., "0.1.0") - indicates which version of the OTS standard this module follows
163 |
164 | # Module metadata
165 | module_name: string # Module name (e.g., "ecommerce_analytics")
166 | module_description: string # Description of the module (optional)
167 | version: string # Optional: Module/package version (e.g., "1.0.0") - version of this specific module, independent of OTS version
168 | tags: [string] # Optional: Module-level tags for categorization (e.g., ["analytics", "fct"]). These can be inherited or merged with transformation-level tags.
169 | test_library_path: string # Optional: Path to test library file (relative to module file or absolute path)
170 |
171 | # Optional: Inline test definitions (same structure as test library)
172 | generic_tests: # Optional: Module-specific generic SQL tests
173 | test_name:
174 | type: "sql"
175 | level: "table" | "column"
176 | description: string
177 | sql: string
178 | parameters: {}
179 | singular_tests: # Optional: Module-specific singular SQL tests
180 | test_name:
181 | type: "sql"
182 | level: "table" | "column"
183 | description: string
184 | sql: string
185 | target_transformation: string
186 |
187 | target:
188 | database: string # Target database name
189 | schema: string # Target schema name
190 | sql_dialect: string # Optional: SQL dialect (e.g., "postgres", "bigquery", "snowflake", "spark", etc.)
191 | connection_profile: string # Optional: connection profile reference
192 |
193 | # Transformations
194 | transformations: # Array of transformation definitions
195 | - transformation_id: string # Fully qualified identifier (e.g., "analytics.my_first_table")
196 | description: string # Optional: Description of what the transformation does (optional)
197 | transformation_type: string # Type of transformation: "sql" (default: "sql"). Non-SQL types (python, pyspark, r) are planned for future versions.
198 | sql_dialect: string # Optional: SQL dialect of the transformation code (for translation to target dialect)
199 |
200 | # Transformation code (type-based structure)
201 | code:
202 | # For SQL transformations (transformation_type: "sql")
203 | sql:
204 | original_sql: string # The original SQL query as written (typically a SELECT statement)
205 | resolved_sql: string # SQL with fully qualified table names (schema.table) - preferred for execution
206 | source_tables: [string] # List of input tables referenced (required for dependency analysis)
207 | source_functions: [string] # Optional: List of user-defined functions (UDFs) called in the query (required for dependency analysis)
208 |
209 | # Note: Non-SQL transformation types (python, pyspark, r) are planned for future versions
210 |
211 | # Schema definition
212 | schema:
213 | columns: # Array of column definitions
214 | - name: string # Column name
215 | datatype: string # Data type ("number", "string", "date", etc.)
216 | description: string # Column description
217 | partitioning: [string] # Optional: Partition keys
218 | indexes: # Optional: Array of index definitions
219 | - name: string # Index name (optional, auto-generated if not provided)
220 | columns: [string] # Columns to index
221 |
222 | # Materialization strategy
223 | materialization:
224 | type: string # "table", "view", "incremental", "scd2"
225 | incremental_details: # Required if type is "incremental"
226 | strategy: string # "delete_insert", "append", "merge"
227 | delete_condition: string # SQL condition for delete (delete_insert only)
228 | filter_condition: string # SQL condition for filtering data
229 | merge_key: [string] # Primary key columns for matching records (merge only)
230 | update_columns: [string] # (Optional) List of columns to be updated in merge strategy
231 | scd2_details: # Optional if type is "scd2"
232 | start_column: string # Name of the start column (default: "valid_from")
233 | end_column: string # Name of the end column (default: "valid_to")
234 | unique_key: [string] # Array of columns that uniquely identify a record in SCD2 modeling (optional)
235 |
236 | # Tests: both column-level and table-level
237 | tests:
238 | columns: # Optional: Column-level tests
239 | column_name: # Tests for specific columns
240 | - string # Simple test name (e.g., "not_null", "unique")
241 | - object # Test with parameters: {name: string, params?: object, severity?: "error"|"warning"}
242 | table: # Optional: Table-level tests
243 | - string # Simple test name (e.g., "row_count_gt_0")
244 | - object # Test with parameters: {name: string, params?: object, severity?: "error"|"warning"}
245 |
246 | # Metadata
247 | metadata:
248 | file_path: string # Path to the source transformation file
249 | owner: string # Optional: Person or team responsible (optional)
250 | tags: [string] # Optional: List of string tags for categorization and discovery (e.g., ["analytics", "fct"])
251 | object_tags: dict # Optional: Dictionary of key-value pairs for database object tagging (e.g., {"sensitivity_tag": "pii", "classification": "public"})
252 |
253 | # Functions (OTS 0.2.0+)
254 | functions: # Optional: Array of user-defined function definitions (OTS 0.2.0+)
255 | - function_id: string # Fully qualified function name (e.g., "schema.function_name")
256 | description: string # Optional: Description of what the function does
257 | function_type: string # Function type: "scalar", "aggregate", or "table"
258 | language: string # Programming language: "sql", "python", "javascript", etc.
259 | parameters: # Optional: Array of function parameters
260 | - name: string # Parameter name
261 | type: string # Parameter data type (e.g., "DOUBLE", "VARCHAR")
262 | description: string # Optional: Parameter description
263 | return_type: string # Optional: Return type for scalar/aggregate functions (e.g., "DOUBLE", "VARCHAR")
264 | return_table_schema: # Optional: Schema definition for table functions (same structure as transformation schema)
265 | columns: [ColumnDefinition]
266 | deterministic: bool # Optional: Whether the function is deterministic (same inputs always produce same outputs)
267 | code: # Function code (type-based structure)
268 | generic_sql: string # Generic SQL code that works across databases (for SQL functions)
269 | database_specific: dict # Database-specific implementations (keyed by database name)
270 | dependencies: # Optional: Function dependencies
271 | tables: [string] # List of tables the function depends on
272 | functions: [string] # List of other functions this function depends on
273 | metadata: # Function metadata
274 | file_path: string # Path to the source function file
275 | tags: [string] # Optional: List of string tags for categorization
276 | object_tags: dict # Optional: Dictionary of key-value pairs for database object tagging
277 | ```
278 |
279 | ## Simple Table Transformation
280 |
281 |
282 | JSON Format
283 |
284 | ```json
285 | {
286 | "ots_version": "0.2.0",
287 | "module_name": "analytics_customers",
288 | "module_description": "Customer analytics transformations",
289 | "tags": ["analytics", "production"],
290 | "test_library_path": "../tests/test_library.yaml",
291 | "target": {
292 | "database": "warehouse",
293 | "schema": "analytics",
294 | "sql_dialect": "postgres"
295 | },
296 | "transformations": [
297 | {
298 | "transformation_id": "analytics.customers",
299 | "description": "Customer data table",
300 | "transformation_type": "sql",
301 | "code": {
302 | "sql": {
303 | "original_sql": "SELECT id, name, email, created_at FROM source.customers WHERE active = true",
304 | "resolved_sql": "SELECT id, name, email, created_at FROM warehouse.source.customers WHERE active = true",
305 | "source_tables": ["source.customers"],
306 | "source_functions": []
307 | }
308 | },
309 | "schema": {
310 | "columns": [
311 | {
312 | "name": "id",
313 | "datatype": "number",
314 | "description": "Unique customer identifier"
315 | },
316 | {
317 | "name": "name",
318 | "datatype": "string",
319 | "description": "Customer name"
320 | },
321 | {
322 | "name": "email",
323 | "datatype": "string",
324 | "description": "Customer email address"
325 | },
326 | {
327 | "name": "created_at",
328 | "datatype": "date",
329 | "description": "Customer creation date"
330 | }
331 | ],
332 | "partitioning": [],
333 | "indexes": [
334 | {
335 | "name": "idx_customers_id",
336 | "columns": ["id"]
337 | },
338 | {
339 | "name": "idx_customers_email",
340 | "columns": ["email"]
341 | }
342 | ]
343 | },
344 | "materialization": {
345 | "type": "table"
346 | },
347 | "tests": {
348 | "columns": {
349 | "id": ["not_null", "unique"],
350 | "email": ["not_null", "unique"],
351 | "created_at": ["not_null"]
352 | },
353 | "table": ["row_count_gt_0"]
354 | },
355 | "metadata": {
356 | "file_path": "/models/analytics/customers.sql",
357 | "owner": "data-team",
358 | "tags": ["customer", "core"],
359 | "object_tags": {
360 | "sensitivity_tag": "pii",
361 | "classification": "internal"
362 | }
363 | }
364 | }
365 | ]
366 | }
367 | ```
368 |
369 |
370 |
371 |
372 | YAML Format
373 |
374 | ```yaml
375 | ots_version: "0.2.0"
376 | module_name: "analytics_customers"
377 | module_description: "Customer analytics transformations"
378 | tags: ["analytics", "production"]
379 |
380 | target:
381 | database: "warehouse"
382 | schema: "analytics"
383 | sql_dialect: "postgres"
384 |
385 | transformations:
386 | - transformation_id: "analytics.customers"
387 | description: "Customer data table"
388 | transformation_type: "sql"
389 |
390 | code:
391 | sql:
392 | original_sql: "SELECT id, name, email, created_at FROM source.customers WHERE active = true"
393 | resolved_sql: "SELECT id, name, email, created_at FROM warehouse.source.customers WHERE active = true"
394 | source_tables: ["source.customers"]
395 | source_functions: []
396 |
397 | schema:
398 | columns:
399 | - name: "id"
400 | datatype: "number"
401 | description: "Unique customer identifier"
402 | - name: "name"
403 | datatype: "string"
404 | description: "Customer name"
405 | - name: "email"
406 | datatype: "string"
407 | description: "Customer email address"
408 | - name: "created_at"
409 | datatype: "date"
410 | description: "Customer creation date"
411 | partitioning: []
412 | indexes:
413 | - name: "idx_customers_id"
414 | columns: ["id"]
415 | - name: "idx_customers_email"
416 | columns: ["email"]
417 |
418 | materialization:
419 | type: "table"
420 |
421 | tests:
422 | columns:
423 | id: ["not_null", "unique"]
424 | email: ["not_null", "unique"]
425 | created_at: ["not_null"]
426 | table: ["row_count_gt_0"]
427 |
428 | metadata:
429 | file_path: "/models/analytics/customers.sql"
430 | owner: "data-team"
431 | tags: ["customer", "core"]
432 | object_tags:
433 | sensitivity_tag: "pii"
434 | classification: "internal"
435 | ```
436 |
437 |
438 |
439 | ## Materialization Types
440 |
441 | ### Incremental Materialization
442 |
443 | Incremental materialization updates only changed data using one of three strategies:
444 | - **delete_insert**: Deletes rows matching a condition and inserts new data
445 | - **append**: Simply appends new data without removing existing rows
446 | - **merge**: Performs an upsert operation using a merge statement
447 |
448 | #### Delete-Insert Strategy
449 |
450 | ```yaml
451 | materialization:
452 | type: "incremental"
453 | incremental_details:
454 | strategy: "delete_insert"
455 | delete_condition: "to_date(updated_ts) = '@start_date'"
456 | filter_condition: "to_date(updated_ts) = '@start_date'"
457 | ```
458 |
459 | #### Append Strategy
460 |
461 | ```yaml
462 | materialization:
463 | type: "incremental"
464 | incremental_details:
465 | strategy: "append"
466 | filter_condition: "created_at >= '@start_date'"
467 | ```
468 |
469 | #### Merge Strategy
470 |
471 | ```yaml
472 | materialization:
473 | type: "incremental"
474 | incremental_details:
475 | strategy: "merge"
476 | merge_key: "customer_id" # Primary key or unique identifier for matching
477 | filter_condition: "updated_at >= '@start_date'"
478 | update_columns: ["name", "email"] # Optional: specific columns to update
479 | ```
480 |
481 | ### SCD2 Materialization
482 |
483 | SCD2 (Slowly Changing Dimension Type 2) materialization tracks historical changes with valid date ranges. Requires a unique key to identify records.
484 |
485 | ```yaml
486 | materialization:
487 | type: "scd2"
488 | scd2_details:
489 | unique_key: ["product_id"] # Primary key or unique identifier
490 | start_column: "valid_from" # Optional, defaults to "valid_from"
491 | end_column: "valid_to" # Optional, defaults to "valid_to"
492 | ```
493 |
494 | ### Schema Column Definition
495 |
496 | A schema column in an OTD defines the structure and properties of a single column in the output table:
497 |
498 | ```yaml
499 | columns:
500 | - name: string # Column name
501 | datatype: string # Data type ("number", "string", "date", etc.)
502 | description: string # Column description
503 | ```
504 |
505 | **Common Data Types:**
506 | - `number`: Numeric values
507 | - `string`: Text values
508 | - `date`: Date and timestamp values
509 | - `boolean`: True/false values
510 | - `array`: Array of values
511 | - `object`: Complex nested objects
512 |
513 | ## Data Quality Tests
514 |
515 | Data quality tests are validation rules that ensure the correctness and quality of transformation outputs. Tests can be defined at two levels:
516 | - **Column-level tests**: Applied to individual columns (e.g., `not_null`, `unique`)
517 | - **Table-level tests**: Applied to the entire output (e.g., `row_count_gt_0`, `unique`)
518 |
519 | Tests enable automated data quality validation without manual inspection. OTS supports three types of tests:
520 |
521 | 1. **Standard Tests**: Built-in tests defined in the OTS specification (e.g., `not_null`, `unique`, `row_count_gt_0`)
522 | 2. **Generic SQL Tests**: Reusable SQL tests with placeholders that can be applied to multiple transformations
523 | 3. **Singular SQL Tests**: Table-specific SQL tests with hardcoded table references
524 |
525 | ### Standard Tests
526 |
527 | Standard tests are built into the OTS specification and must be implemented by all OTS-compliant tools. These tests provide common data quality checks that are widely applicable across different transformations.
528 |
529 | #### Column-Level Standard Tests
530 |
531 | **`not_null`**
532 | - **Description**: Ensures a column contains no NULL values
533 | - **Level**: Column
534 | - **Parameters**: None
535 | - **Implementation**: Returns rows where the column is NULL (test fails if any rows returned)
536 | - **Example**:
537 | ```yaml
538 | tests:
539 | columns:
540 | id: ["not_null"]
541 | ```
542 |
543 | **`unique`**
544 | - **Description**: Ensures column values are unique across all rows
545 | - **Level**: Column or Table
546 | - **Parameters**:
547 | - `columns` (array, optional): For table-level tests, specifies which columns to check for uniqueness. If omitted at table level, checks all columns (entire row uniqueness)
548 | - **Implementation**: Returns duplicate values (test fails if any duplicates found)
549 | - **Examples**:
550 | ```yaml
551 | tests:
552 | columns:
553 | # Column-level: single column uniqueness
554 | id: ["not_null", "unique"]
555 |
556 | table:
557 | # Table-level: composite uniqueness on specific columns
558 | - name: "unique"
559 | params:
560 | columns: ["customer_id", "order_date"]
561 |
562 | # Table-level: entire row uniqueness (all columns)
563 | - "unique"
564 | ```
565 |
566 | **`accepted_values`**
567 | - **Description**: Ensures column values are within a specified list of acceptable values
568 | - **Level**: Column
569 | - **Parameters**:
570 | - `values` (array, required): List of acceptable values
571 | - **Implementation**: Returns rows where column value is not in the accepted list
572 | - **Example**:
573 | ```yaml
574 | tests:
575 | columns:
576 | status:
577 | - name: "accepted_values"
578 | params:
579 | values: ["active", "inactive", "pending"]
580 | ```
581 |
582 | **`relationships`**
583 | - **Description**: Ensures referential integrity between tables (foreign key validation)
584 | - **Level**: Column
585 | - **Parameters**:
586 | - `to` (string, required): Target transformation ID (e.g., "analytics.customers")
587 | - `field` (string, required): Column name in the target transformation
588 | - **Implementation**: Returns rows where the column value doesn't exist in the target table's specified field
589 | - **Example**:
590 | ```yaml
591 | tests:
592 | columns:
593 | customer_id:
594 | - name: "relationships"
595 | params:
596 | to: "analytics.customers"
597 | field: "id"
598 | ```
599 |
600 | #### Table-Level Standard Tests
601 |
602 | **`row_count_gt_0`**
603 | - **Description**: Ensures the table has at least one row
604 | - **Level**: Table
605 | - **Parameters**: None
606 | - **Implementation**: Returns a count result (test fails if count = 0)
607 | - **Example**:
608 | ```yaml
609 | tests:
610 | table:
611 | - "row_count_gt_0"
612 | ```
613 |
614 | ### Test Libraries
615 |
616 | Test libraries are project-level collections of reusable Test Definitions (generic and singular SQL tests) that can be shared across multiple OTS modules. For a detailed introduction to Test Libraries, see the [Test Library](#test-library) section in Core Concepts.
617 |
618 | #### Test Library Structure
619 |
620 | A test library is a YAML or JSON file that contains reusable Test Definitions. The file can be named anything (e.g., `test_library.yaml`, `tests.yaml`, `data_quality_tests.json`), but must follow the structure below.
621 |
622 | **Test Library File Structure:**
623 | ```yaml
624 | # test_library.yaml
625 | ots_version: string # OTS specification version (e.g., "0.2.0") - indicates which version of the OTS standard this test library follows
626 | test_library_version: string # Optional: Version identifier for the test library (e.g., "1.0", "2.1")
627 | description: string # Optional: Human-readable description of the test library
628 |
629 | generic_tests:
630 | check_minimum_rows:
631 | type: "sql"
632 | level: "table"
633 | description: "Ensures table has minimum number of rows"
634 | sql: |
635 | SELECT 1 as violation
636 | FROM @table_name
637 | GROUP BY 1
638 | HAVING COUNT(*) < @min_rows:10
639 | parameters:
640 | min_rows:
641 | type: "number"
642 | default: 10
643 | description: "Minimum number of rows required"
644 |
645 | column_not_negative:
646 | type: "sql"
647 | level: "column"
648 | description: "Ensures numeric column has no negative values"
649 | sql: |
650 | SELECT @column_name
651 | FROM @table_name
652 | WHERE @column_name < 0
653 | parameters: []
654 |
655 | singular_tests:
656 | test_customers_email_format:
657 | type: "sql"
658 | level: "table"
659 | description: "Validates email format for customers table"
660 | sql: |
661 | SELECT id, email
662 | FROM analytics.customers
663 | WHERE email NOT LIKE '%@%.%'
664 | target_transformation: "analytics.customers"
665 | ```
666 |
667 | #### Generic SQL Tests
668 |
669 | Generic SQL tests are reusable tests that use placeholders (variables) to make them applicable to multiple transformations. They follow the dbt pattern where:
670 | - The query returns rows when the test fails
671 | - 0 rows returned = test passes
672 | - 1+ rows returned = test fails
673 |
674 | **Placeholders:**
675 | - `@table_name` or `{{ table_name }}`: Replaced with the fully qualified transformation ID. The `@` syntax is recommended for cleaner SQL.
676 | - `@column_name` or `{{ column_name }}`: Replaced with the column name (for column-level tests). The `@` syntax is recommended.
677 | - Custom parameters: Available as `@parameter_name` or `{{ parameter_name }}` with optional defaults using `@param:default` syntax (e.g., `@min_rows:10`)
678 |
679 | **Structure:**
680 | ```yaml
681 | generic_tests:
682 | test_name: # Required: Unique test name (used for referencing)
683 | type: "sql" # Required: Always "sql" for SQL tests
684 | level: "table" | "column" # Required: Test level
685 | description: string # Optional: Human-readable description
686 | sql: string # Required: SQL query (returns rows on failure)
687 | parameters: # Optional: Parameter definitions
688 | param_name:
689 | type: "number" | "string" | "boolean" | "array" # Required: Parameter type
690 | default: value # Optional: Default value
691 | description: string # Optional: Parameter description
692 | ```
693 |
694 | **Example Generic Test:**
695 | ```yaml
696 | check_minimum_rows:
697 | type: "sql"
698 | level: "table"
699 | description: "Ensures table has minimum number of rows"
700 | sql: |
701 | SELECT 1 as violation
702 | FROM @table_name
703 | GROUP BY 1
704 | HAVING COUNT(*) < @min_rows:10
705 | parameters:
706 | min_rows:
707 | type: "number"
708 | default: 10
709 | description: "Minimum number of rows required"
710 | ```
711 |
712 | #### Singular SQL Tests
713 |
714 | Singular SQL tests are table-specific tests with hardcoded table references. They are useful for:
715 | - Complex business logic specific to one transformation
716 | - Tests that reference multiple tables
717 | - Table-specific validation rules
718 |
719 | **Structure:**
720 | ```yaml
721 | singular_tests:
722 | test_name: # Required: Unique test name (used for referencing)
723 | type: "sql" # Required: Always "sql" for SQL tests
724 | level: "table" | "column" # Required: Test level
725 | description: string # Optional: Human-readable description
726 | sql: string # Required: SQL query with hardcoded table names
727 | target_transformation: string # Required: Transformation ID this test applies to (used for validation and discovery)
728 | ```
729 |
730 | **Example Singular Test:**
731 | ```yaml
732 | test_customers_email_format:
733 | type: "sql"
734 | level: "table"
735 | description: "Validates email format for customers table"
736 | sql: |
737 | SELECT id, email
738 | FROM analytics.customers
739 | WHERE email NOT LIKE '%@%.%'
740 | target_transformation: "analytics.customers"
741 | ```
742 |
743 | ### Referencing Tests in Transformations
744 |
745 | Transformations reference tests from:
746 | 1. **Standard tests**: Referenced by name (e.g., `"not_null"`, `"unique"`)
747 | 2. **Test library tests**: Referenced by name from the test library (e.g., `"check_minimum_rows"`)
748 |
749 | **Module Structure with Test Library Reference:**
750 |
751 | ```yaml
752 | ots_version: "0.2.0"
753 | module_name: "analytics_customers"
754 | test_library_path: "../tests/test_library.yaml" # Optional: Path to test library
755 |
756 | target:
757 | database: "warehouse"
758 | schema: "analytics"
759 |
760 | transformations:
761 | - transformation_id: "analytics.customers"
762 | tests:
763 | columns:
764 | id:
765 | - "not_null" # Standard test
766 | - "unique" # Standard test (column-level)
767 | email:
768 | - "not_null"
769 | - name: "accepted_values" # Standard test with params
770 | params:
771 | values: ["gmail.com", "yahoo.com"]
772 | amount:
773 | - name: "column_not_negative" # Generic test from library
774 | table:
775 | - "row_count_gt_0" # Standard test
776 | - "unique" # Standard test (table-level, checks all columns)
777 | - name: "unique" # Standard test (table-level, composite on specific columns)
778 | params:
779 | columns: ["customer_id", "order_date"]
780 | - name: "check_minimum_rows" # Generic test with params
781 | params:
782 | min_rows: 100
783 | - "test_customers_email_format" # Singular test from library
784 | ```
785 |
786 | **Test Reference Formats:**
787 |
788 | 1. **Simple string** (standard test, no parameters):
789 | ```yaml
790 | tests:
791 | columns:
792 | id: ["not_null", "unique"]
793 | table:
794 | - "row_count_gt_0"
795 | ```
796 |
797 | 2. **Object with name** (standard test with parameters):
798 | ```yaml
799 | tests:
800 | columns:
801 | status:
802 | - name: "accepted_values"
803 | params:
804 | values: ["active", "inactive"]
805 | ```
806 |
807 | 3. **Object with name** (generic/singular test from library):
808 | ```yaml
809 | tests:
810 | table:
811 | - name: "check_minimum_rows"
812 | params:
813 | min_rows: 100
814 | ```
815 |
816 | ### Test Execution Model
817 |
818 | Tests follow the dbt execution model:
819 | - **0 rows returned** = test passes
820 | - **1+ rows returned** = test fails
821 |
822 | For standard tests, tools generate SQL queries that return violating rows. For SQL tests (generic and singular), the SQL query itself returns rows when violations are found.
823 |
824 | **Test Severity:**
825 | - Tests can have a `severity` level: `"error"` (default) or `"warning"`
826 | - `error`: Test failure stops execution and fails the build
827 | - `warning`: Test failure is logged but doesn't stop execution
828 |
829 | **Severity in Test References:**
830 | ```yaml
831 | tests:
832 | columns:
833 | id:
834 | - name: "not_null"
835 | severity: "error" # Default, can be omitted
836 | - name: "unique"
837 | severity: "warning" # Non-blocking
838 | table:
839 | - name: "row_count_gt_0"
840 | severity: "error" # Default, can be omitted
841 | ```
842 |
843 | ### Inline Test Definitions in OTS Modules
844 |
845 | Test Definitions (generic and singular SQL tests) can also be defined directly within an OTS Module, using the same structure as test libraries. This is useful for module-specific Test Definitions that don't need to be shared across modules.
846 |
847 | **Module Structure with Inline Tests:**
848 | ```yaml
849 | ots_version: "0.2.0"
850 | module_name: "analytics_customers"
851 |
852 | # Optional: Inline test definitions (same structure as test library)
853 | generic_tests:
854 | check_recent_data:
855 | type: "sql"
856 | level: "table"
857 | description: "Ensures table has recent data"
858 | sql: |
859 | SELECT 1 as violation
860 | FROM @table_name
861 | WHERE updated_at < CURRENT_DATE - INTERVAL '@days:7' DAY
862 | parameters:
863 | days:
864 | type: "number"
865 | default: 7
866 |
867 | singular_tests:
868 | test_customers_specific:
869 | type: "sql"
870 | level: "table"
871 | description: "Module-specific test"
872 | sql: |
873 | SELECT id FROM analytics.customers WHERE status = 'invalid'
874 | target_transformation: "analytics.customers"
875 |
876 | target:
877 | database: "warehouse"
878 | schema: "analytics"
879 |
880 | transformations:
881 | - transformation_id: "analytics.customers"
882 | tests:
883 | table:
884 | - name: "check_recent_data" # References inline generic test
885 | params:
886 | days: 3
887 | - "test_customers_specific" # References inline singular test
888 | ```
889 |
890 | **Test Resolution Priority:**
891 | When resolving test names, tools should check in the following order:
892 | 1. **Standard tests** (built into OTS specification)
893 | 2. **Inline tests** (defined in the current OTS Module)
894 | 3. **Test library tests** (from referenced test library)
895 |
896 | If a test name exists in multiple locations, the first match takes precedence. This allows modules to override test library tests with module-specific implementations.
897 |
898 | ### Test Library Resolution
899 |
900 | When a transformation module references a test library:
901 | 1. The tool resolves the `test_library_path` (relative to the module file or absolute path)
902 | 2. Loads the test library file (YAML or JSON format)
903 | 3. Validates test definitions
904 | 4. Makes tests available for reference in transformations (after inline tests)
905 |
906 | **Test Discovery:**
907 | - **Standard tests**: Always available, no discovery needed
908 | - **Generic tests**: Discovered from test library or inline module definitions
909 | - **Singular tests**: Discovered from test library or inline module definitions. The `target_transformation` field helps tools validate that the test is applied to the correct transformation.
910 |
911 | If a test is referenced but not found among the Standard tests, inline tests, or Test library, it must result in an error.
912 |
913 | ## User-Defined Functions (UDFs)
914 |
915 | ### Overview
916 |
917 | User-Defined Functions (UDFs) are custom functions that can be called within SQL transformations. OTS v0.2.0 adds support for defining UDFs as **UDF Definitions** within OTS Modules and tracking UDF dependencies in transformations, enabling proper dependency graph building and execution order determination.
918 |
919 | ### Function Dependencies
920 |
921 | When a transformation calls a user-defined function, the function name should be listed in the `source_functions` array. This allows tools to:
922 |
923 | - Build accurate dependency graphs that include function dependencies
924 | - Determine correct execution order (functions must be created before transformations that use them)
925 | - Validate that all required functions exist before executing transformations
926 | - Support function-to-function dependencies (functions calling other functions)
927 |
928 | ### Function Naming
929 |
930 | Function names in `source_functions` should follow these conventions:
931 | - **Fully qualified names** (preferred): `schema.function_name` (e.g., `analytics.calculate_percentage`)
932 | - **Unqualified names**: `function_name` (when the function is resolved by the database's search path)
933 |
934 | Tools should resolve unqualified function names to fully qualified names when building dependency graphs.
935 |
936 | ### Example: Transformation Using a Function
937 |
938 | ```yaml
939 | ots_version: "0.2.0"
940 | transformation_id: "analytics.order_summary"
941 | description: "Order summary with calculated metrics"
942 |
943 | transformation_type: "sql"
944 | code:
945 | sql:
946 | original_sql: |
947 | SELECT
948 | order_id,
949 | customer_id,
950 | analytics.calculate_percentage(discount_amount, total_amount) as discount_pct,
951 | analytics.format_currency(total_amount) as formatted_total
952 | FROM source.orders
953 | resolved_sql: |
954 | SELECT
955 | order_id,
956 | customer_id,
957 | analytics.calculate_percentage(discount_amount, total_amount) as discount_pct,
958 | analytics.format_currency(total_amount) as formatted_total
959 | FROM warehouse.source.orders
960 | source_tables: ["source.orders"]
961 | source_functions: ["analytics.calculate_percentage", "analytics.format_currency"]
962 |
963 | schema:
964 | columns:
965 | - name: "order_id"
966 | datatype: "number"
967 | - name: "customer_id"
968 | datatype: "number"
969 | - name: "discount_pct"
970 | datatype: "number"
971 | description: "Discount percentage calculated using UDF"
972 | - name: "formatted_total"
973 | datatype: "string"
974 | description: "Formatted currency using UDF"
975 |
976 | materialization:
977 | type: "table"
978 | ```
979 |
980 | In this example, the transformation depends on two user-defined functions:
981 | - `analytics.calculate_percentage`: Calculates percentage values
982 | - `analytics.format_currency`: Formats numeric values as currency strings
983 |
984 | These dependencies are tracked in `source_functions`, allowing the dependency graph to ensure these functions are created before the transformation executes.
985 |
986 | ### Function Execution Order
987 |
988 | Functions are executed in dependency order, before transformations:
989 | 1. **Seeds** are loaded first (if any)
990 | 2. **Functions** are created in dependency order (functions that depend on other functions are created after their dependencies)
991 | 3. **Transformations** are executed (can use functions created in step 2)
992 |
993 | Function-to-function dependencies are resolved automatically based on the `dependencies.functions` array in each function definition.
994 |
995 | ### Function Overloading
996 |
997 | Some databases (e.g., Snowflake, DuckDB, PostgreSQL) support function overloading - multiple functions with the same name but different parameter signatures. OTS 0.2.0 supports this by:
998 |
999 | - **Function identification**: Functions are identified by their fully qualified name (`schema.function_name`) and parameter signature
1000 | - **Signature matching**: When a function is called, the database matches the call to the appropriate overload based on parameter types
1001 | - **Dependency tracking**: Each overloaded function is tracked separately in the `functions` array with its unique signature
1002 |
1003 | Tools implementing OTS should handle function overloading according to the target database's capabilities and requirements.
1004 |
1005 | ### Example: OTS Module with Functions
1006 |
1007 | The following example shows a complete OTS Module that includes both transformations and UDF Definitions:
1008 |
1009 | ```yaml
1010 | ots_version: "0.2.0"
1011 | module_name: "analytics_calculations"
1012 | module_description: "Analytics module with custom calculation functions"
1013 |
1014 | target:
1015 | database: "warehouse"
1016 | schema: "analytics"
1017 | sql_dialect: "postgres"
1018 |
1019 | transformations:
1020 | - transformation_id: "analytics.order_summary"
1021 | description: "Order summary with calculated metrics"
1022 | transformation_type: "sql"
1023 | code:
1024 | sql:
1025 | original_sql: |
1026 | SELECT
1027 | order_id,
1028 | customer_id,
1029 | analytics.calculate_percentage(discount_amount, total_amount) as discount_pct,
1030 | analytics.format_currency(total_amount) as formatted_total
1031 | FROM source.orders
1032 | resolved_sql: |
1033 | SELECT
1034 | order_id,
1035 | customer_id,
1036 | analytics.calculate_percentage(discount_amount, total_amount) as discount_pct,
1037 | analytics.format_currency(total_amount) as formatted_total
1038 | FROM warehouse.source.orders
1039 | source_tables: ["source.orders"]
1040 | source_functions: ["analytics.calculate_percentage", "analytics.format_currency"]
1041 | materialization:
1042 | type: "table"
1043 |
1044 | functions:
1045 | - function_id: "analytics.calculate_percentage"
1046 | description: "Calculates the percentage of a numerator over a denominator"
1047 | function_type: "scalar"
1048 | language: "sql"
1049 | parameters:
1050 | - name: "numerator"
1051 | type: "DOUBLE"
1052 | description: "The numerator value"
1053 | - name: "denominator"
1054 | type: "DOUBLE"
1055 | description: "The denominator value"
1056 | return_type: "DOUBLE"
1057 | deterministic: true
1058 | code:
1059 | generic_sql: |
1060 | CREATE OR REPLACE FUNCTION calculate_percentage(
1061 | numerator DOUBLE,
1062 | denominator DOUBLE
1063 | ) RETURNS DOUBLE AS $$
1064 | SELECT
1065 | CASE
1066 | WHEN denominator = 0 OR denominator IS NULL THEN NULL
1067 | ELSE (numerator / denominator) * 100.0
1068 | END
1069 | $$;
1070 | database_specific: {}
1071 | dependencies:
1072 | tables: []
1073 | functions: []
1074 | metadata:
1075 | file_path: "/functions/analytics/calculate_percentage.sql"
1076 | tags: ["math", "utility"]
1077 | object_tags:
1078 | category: "calculation"
1079 | complexity: "simple"
1080 |
1081 | - function_id: "analytics.format_currency"
1082 | description: "Formats a numeric value as currency string"
1083 | function_type: "scalar"
1084 | language: "sql"
1085 | parameters:
1086 | - name: "amount"
1087 | type: "DOUBLE"
1088 | description: "The amount to format"
1089 | return_type: "VARCHAR"
1090 | code:
1091 | generic_sql: |
1092 | CREATE OR REPLACE FUNCTION format_currency(amount DOUBLE)
1093 | RETURNS VARCHAR AS $$
1094 | SELECT '$' || TO_CHAR(amount, 'FM999,999,999.00')
1095 | $$;
1096 | database_specific: {}
1097 | dependencies:
1098 | tables: []
1099 | functions: []
1100 | metadata:
1101 | file_path: "/functions/analytics/format_currency.sql"
1102 | tags: ["formatting", "utility"]
1103 | ```
1104 |
1105 | In this example:
1106 | - The module defines two functions: `analytics.calculate_percentage` and `analytics.format_currency`
1107 | - The transformation `analytics.order_summary` uses both functions and lists them in `source_functions`
1108 | - Functions are defined with their complete structure including parameters, return types, code, and metadata
1109 | - The `functions` array is at the same level as `transformations`, maintaining consistency in the module structure
1110 |
1111 | ## Complete Examples: Incremental Strategies
1112 |
1113 | ### Delete-Insert Example
1114 |
1115 |
1116 | YAML Format
1117 |
1118 | ```yaml
1119 | ots_version: "0.2.0"
1120 | transformation_id: "analytics.recent_orders"
1121 | description: "Orders updated in the last 7 days"
1122 |
1123 | transformation_type: "sql"
1124 | code:
1125 | sql:
1126 | original_sql: "SELECT order_id, customer_id, order_date, amount, status FROM source.orders WHERE updated_at >= '@start_date'"
1127 | resolved_sql: "SELECT order_id, customer_id, order_date, amount, status FROM warehouse.source.orders WHERE updated_at >= '@start_date'"
1128 | source_tables: ["source.orders"]
1129 | source_functions: []
1130 |
1131 | schema:
1132 | columns:
1133 | - name: "order_id"
1134 | datatype: "number"
1135 | description: "Unique order identifier"
1136 | - name: "customer_id"
1137 | datatype: "number"
1138 | description: "Customer ID"
1139 | - name: "order_date"
1140 | datatype: "date"
1141 | description: "Order date"
1142 | - name: "amount"
1143 | datatype: "number"
1144 | description: "Order amount"
1145 | - name: "status"
1146 | datatype: "string"
1147 | description: "Order status"
1148 | partitioning: ["order_date"]
1149 | indexes:
1150 | - name: "idx_order_id"
1151 | columns: ["order_id"]
1152 |
1153 | materialization:
1154 | type: "incremental"
1155 | incremental_details:
1156 | strategy: "delete_insert"
1157 | delete_condition: "to_date(updated_at) = '@start_date'"
1158 | filter_condition: "to_date(updated_at) = '@start_date'"
1159 |
1160 | tests:
1161 | columns:
1162 | order_id: ["not_null", "unique"]
1163 | order_date: ["not_null"]
1164 | table: ["row_count_gt_0"]
1165 |
1166 | metadata:
1167 | file_path: "/models/analytics/recent_orders.sql"
1168 | owner: "analytics-team"
1169 | tags: ["orders", "incremental"]
1170 | ```
1171 |
1172 |
1173 |
1174 |
1175 | JSON Format
1176 |
1177 | ```json
1178 | {
1179 | "ots_version": "0.2.0",
1180 | "transformation_id": "analytics.recent_orders",
1181 | "description": "Orders updated in the last 7 days",
1182 |
1183 | "transformation_type": "sql",
1184 | "code": {
1185 | "sql": {
1186 | "original_sql": "SELECT order_id, customer_id, order_date, amount, status FROM source.orders WHERE updated_at >= '@start_date'",
1187 | "resolved_sql": "SELECT order_id, customer_id, order_date, amount, status FROM warehouse.source.orders WHERE updated_at >= '@start_date'",
1188 | "source_tables": ["source.orders"],
1189 | "source_functions": []
1190 | }
1191 | },
1192 |
1193 | "schema": {
1194 | "columns": [
1195 | {
1196 | "name": "order_id",
1197 | "datatype": "number",
1198 | "description": "Unique order identifier"
1199 | },
1200 | {
1201 | "name": "customer_id",
1202 | "datatype": "number",
1203 | "description": "Customer ID"
1204 | },
1205 | {
1206 | "name": "order_date",
1207 | "datatype": "date",
1208 | "description": "Order date"
1209 | },
1210 | {
1211 | "name": "amount",
1212 | "datatype": "number",
1213 | "description": "Order amount"
1214 | },
1215 | {
1216 | "name": "status",
1217 | "datatype": "string",
1218 | "description": "Order status"
1219 | }
1220 | ],
1221 | "partitioning": ["order_date"],
1222 | "indexes": [
1223 | {
1224 | "name": "idx_order_id",
1225 | "columns": ["order_id"]
1226 | }
1227 | ]
1228 | },
1229 |
1230 | "materialization": {
1231 | "type": "incremental",
1232 | "incremental_details": {
1233 | "strategy": "delete_insert",
1234 | "delete_condition": "to_date(updated_at) = '@start_date'",
1235 | "filter_condition": "to_date(updated_at) = '@start_date'"
1236 | }
1237 | },
1238 |
1239 | "tests": {
1240 | "columns": {
1241 | "order_id": ["not_null", "unique"],
1242 | "order_date": ["not_null"]
1243 | },
1244 | "table": ["row_count_gt_0"]
1245 | },
1246 |
1247 | "metadata": {
1248 | "file_path": "/models/analytics/recent_orders.sql",
1249 | "owner": "analytics-team",
1250 | "tags": ["orders", "incremental"]
1251 | }
1252 | }
1253 | ```
1254 |
1255 |
1256 |
1257 | ### Append Example
1258 |
1259 |
1260 | YAML Format
1261 |
1262 | ```yaml
1263 | ots_version: "0.2.0"
1264 | transformation_id: "logs.event_stream"
1265 | description: "Append-only event log"
1266 |
1267 | transformation_type: "sql"
1268 | code:
1269 | sql:
1270 | original_sql: "SELECT event_id, timestamp, user_id, event_type, payload FROM source.events WHERE timestamp >= '@start_date'"
1271 | resolved_sql: "SELECT event_id, timestamp, user_id, event_type, payload FROM warehouse.source.events WHERE timestamp >= '@start_date'"
1272 | source_tables: ["source.events"]
1273 | source_functions: []
1274 |
1275 | schema:
1276 | columns:
1277 | - name: "event_id"
1278 | datatype: "string"
1279 | description: "Unique event identifier"
1280 | - name: "timestamp"
1281 | datatype: "date"
1282 | description: "Event timestamp"
1283 | - name: "user_id"
1284 | datatype: "string"
1285 | description: "User who triggered the event"
1286 | - name: "event_type"
1287 | datatype: "string"
1288 | description: "Type of event"
1289 | - name: "payload"
1290 | datatype: "object"
1291 | description: "Event payload data"
1292 | partitioning: ["timestamp"]
1293 | indexes:
1294 | - name: "idx_timestamp"
1295 | columns: ["timestamp"]
1296 | - name: "idx_user_id"
1297 | columns: ["user_id"]
1298 |
1299 | materialization:
1300 | type: "incremental"
1301 | incremental_details:
1302 | strategy: "append"
1303 | filter_condition: "timestamp >= '@start_date'"
1304 |
1305 | tests:
1306 | columns:
1307 | event_id: ["not_null", "unique"]
1308 | timestamp: ["not_null"]
1309 | table: ["row_count_gt_0"]
1310 |
1311 | metadata:
1312 | file_path: "/models/logs/event_stream.sql"
1313 | owner: "data-engineering"
1314 | tags: ["events", "append-only"]
1315 | ```
1316 |
1317 |
1318 |
1319 |
1320 | JSON Format
1321 |
1322 | ```json
1323 | {
1324 | "ots_version": "0.2.0",
1325 | "transformation_id": "logs.event_stream",
1326 | "description": "Append-only event log",
1327 |
1328 | "transformation_type": "sql",
1329 | "code": {
1330 | "sql": {
1331 | "original_sql": "SELECT event_id, timestamp, user_id, event_type, payload FROM source.events WHERE timestamp >= '@start_date'",
1332 | "resolved_sql": "SELECT event_id, timestamp, user_id, event_type, payload FROM warehouse.source.events WHERE timestamp >= '@start_date'",
1333 | "source_tables": ["source.events"],
1334 | "source_functions": []
1335 | }
1336 | },
1337 |
1338 | "schema": {
1339 | "columns": [
1340 | {
1341 | "name": "event_id",
1342 | "datatype": "string",
1343 | "description": "Unique event identifier"
1344 | },
1345 | {
1346 | "name": "timestamp",
1347 | "datatype": "date",
1348 | "description": "Event timestamp"
1349 | },
1350 | {
1351 | "name": "user_id",
1352 | "datatype": "string",
1353 | "description": "User who triggered the event"
1354 | },
1355 | {
1356 | "name": "event_type",
1357 | "datatype": "string",
1358 | "description": "Type of event"
1359 | },
1360 | {
1361 | "name": "payload",
1362 | "datatype": "object",
1363 | "description": "Event payload data"
1364 | }
1365 | ],
1366 | "partitioning": ["timestamp"],
1367 | "indexes": [
1368 | {
1369 | "name": "idx_timestamp",
1370 | "columns": ["timestamp"]
1371 | },
1372 | {
1373 | "name": "idx_user_id",
1374 | "columns": ["user_id"]
1375 | }
1376 | ]
1377 | },
1378 |
1379 | "materialization": {
1380 | "type": "incremental",
1381 | "incremental_details": {
1382 | "strategy": "append",
1383 | "filter_condition": "timestamp >= '@start_date'"
1384 | }
1385 | },
1386 |
1387 | "tests": {
1388 | "columns": {
1389 | "event_id": ["not_null", "unique"],
1390 | "timestamp": ["not_null"]
1391 | },
1392 | "table": ["row_count_gt_0"]
1393 | },
1394 |
1395 | "metadata": {
1396 | "file_path": "/models/logs/event_stream.sql",
1397 | "owner": "data-engineering",
1398 | "tags": ["events", "append-only"]
1399 | }
1400 | }
1401 | ```
1402 |
1403 |
1404 |
1405 | ### Merge Example
1406 |
1407 |
1408 | YAML Format
1409 |
1410 | ```yaml
1411 | ots_version: "0.2.0"
1412 | transformation_id: "product.master_data"
1413 | description: "Customer master data with upsert logic"
1414 |
1415 | transformation_type: "sql"
1416 | code:
1417 | sql:
1418 | original_sql: "SELECT customer_id, name, email, phone, updated_at FROM source.customers WHERE updated_at >= '@start_date'"
1419 | resolved_sql: "SELECT customer_id, name, email, phone, updated_at FROM warehouse.source.customers WHERE updated_at >= '@start_date'"
1420 | source_tables: ["source.customers"]
1421 | source_functions: []
1422 |
1423 | schema:
1424 | columns:
1425 | - name: "customer_id"
1426 | datatype: "number"
1427 | description: "Unique customer identifier"
1428 | - name: "name"
1429 | datatype: "string"
1430 | description: "Customer name"
1431 | - name: "email"
1432 | datatype: "string"
1433 | description: "Customer email"
1434 | - name: "phone"
1435 | datatype: "string"
1436 | description: "Customer phone number"
1437 | - name: "updated_at"
1438 | datatype: "date"
1439 | description: "Last update timestamp"
1440 | partitioning: []
1441 | indexes:
1442 | - name: "idx_customer_id"
1443 | columns: ["customer_id"]
1444 | - name: "idx_email"
1445 | columns: ["email"]
1446 |
1447 | materialization:
1448 | type: "incremental"
1449 | incremental_details:
1450 | strategy: "merge"
1451 | filter_condition: "updated_at >= '@start_date'"
1452 | merge_key: ["customer_id"]
1453 | update_columns: ["name", "email", "phone", "updated_at"]
1454 |
1455 | tests:
1456 | columns:
1457 | customer_id: ["not_null", "unique"]
1458 | email: ["not_null"]
1459 | table: ["row_count_gt_0", "unique"] # unique at table level checks all columns for row uniqueness
1460 |
1461 | metadata:
1462 | file_path: "/models/product/master_data.sql"
1463 | owner: "product-team"
1464 | tags: ["customers", "master-data"]
1465 | ```
1466 |
1467 |
1468 |
1469 |
1470 | JSON Format
1471 |
1472 | ```json
1473 | {
1474 | "ots_version": "0.2.0",
1475 | "transformation_id": "product.master_data",
1476 | "description": "Customer master data with upsert logic",
1477 |
1478 | "transformation_type": "sql",
1479 | "code": {
1480 | "sql": {
1481 | "original_sql": "SELECT customer_id, name, email, phone, updated_at FROM source.customers WHERE updated_at >= '@start_date'",
1482 | "resolved_sql": "SELECT customer_id, name, email, phone, updated_at FROM warehouse.source.customers WHERE updated_at >= '@start_date'",
1483 | "source_tables": ["source.customers"],
1484 | "source_functions": []
1485 | }
1486 | },
1487 |
1488 | "schema": {
1489 | "columns": [
1490 | {
1491 | "name": "customer_id",
1492 | "datatype": "number",
1493 | "description": "Unique customer identifier"
1494 | },
1495 | {
1496 | "name": "name",
1497 | "datatype": "string",
1498 | "description": "Customer name"
1499 | },
1500 | {
1501 | "name": "email",
1502 | "datatype": "string",
1503 | "description": "Customer email"
1504 | },
1505 | {
1506 | "name": "phone",
1507 | "datatype": "string",
1508 | "description": "Customer phone number"
1509 | },
1510 | {
1511 | "name": "updated_at",
1512 | "datatype": "date",
1513 | "description": "Last update timestamp"
1514 | }
1515 | ],
1516 | "partitioning": [],
1517 | "indexes": [
1518 | {
1519 | "name": "idx_customer_id",
1520 | "columns": ["customer_id"]
1521 | },
1522 | {
1523 | "name": "idx_email",
1524 | "columns": ["email"]
1525 | }
1526 | ]
1527 | },
1528 |
1529 | "materialization": {
1530 | "type": "incremental",
1531 | "incremental_details": {
1532 | "strategy": "merge",
1533 | "filter_condition": "updated_at >= '@start_date'",
1534 | "merge_key": ["customer_id"],
1535 | "update_columns": ["name", "email", "phone", "updated_at"]
1536 | }
1537 | },
1538 |
1539 | "tests": {
1540 | "columns": {
1541 | "customer_id": ["not_null", "unique"],
1542 | "email": ["not_null"]
1543 | },
1544 | "table": ["row_count_gt_0", "unique"]
1545 | },
1546 |
1547 | "metadata": {
1548 | "file_path": "/models/product/master_data.sql",
1549 | "owner": "product-team",
1550 | "tags": ["customers", "master-data"]
1551 | }
1552 | }
1553 | ```
1554 |
1555 |
1556 |
1557 | ### SCD2 Example
1558 |
1559 |
1560 | YAML Format
1561 |
1562 | ```yaml
1563 | ots_version: "0.2.0"
1564 | transformation_id: "dim.products_scd2"
1565 | description: "Product dimension with full history tracking"
1566 |
1567 | transformation_type: "sql"
1568 | code:
1569 | sql:
1570 | original_sql: "SELECT product_id, product_name, price, category, updated_at FROM source.products WHERE updated_at >= '@start_date'"
1571 | resolved_sql: "SELECT product_id, product_name, price, category, updated_at FROM warehouse.source.products WHERE updated_at >= '@start_date'"
1572 | source_tables: ["source.products"]
1573 | source_functions: []
1574 |
1575 | schema:
1576 | columns:
1577 | - name: "product_id"
1578 | datatype: "number"
1579 | description: "Unique product identifier"
1580 | - name: "product_name"
1581 | datatype: "string"
1582 | description: "Product name"
1583 | - name: "price"
1584 | datatype: "number"
1585 | description: "Product price"
1586 | - name: "category"
1587 | datatype: "string"
1588 | description: "Product category"
1589 | - name: "updated_at"
1590 | datatype: "date"
1591 | description: "Last update timestamp"
1592 | - name: "valid_from"
1593 | datatype: "date"
1594 | description: "Record validity start date"
1595 | - name: "valid_to"
1596 | datatype: "date"
1597 | description: "Record validity end date"
1598 | partitioning: []
1599 | indexes:
1600 | - name: "idx_product_id"
1601 | columns: ["product_id"]
1602 | - name: "idx_valid_from"
1603 | columns: ["valid_from"]
1604 |
1605 | materialization:
1606 | type: "scd2"
1607 | scd2_details:
1608 | unique_key: ["product_id"]
1609 | start_column: "valid_from"
1610 | end_column: "valid_to"
1611 |
1612 | tests:
1613 | columns:
1614 | product_id: ["not_null", "unique"]
1615 | valid_from: ["not_null"]
1616 | table: ["row_count_gt_0"]
1617 |
1618 | metadata:
1619 | file_path: "/models/dim/products_scd2.sql"
1620 | owner: "data-engineering"
1621 | tags: ["products", "scd2", "dimension"]
1622 | ```
1623 |
1624 |
1625 |
1626 |
1627 | JSON Format
1628 |
1629 | ```json
1630 | {
1631 | "ots_version": "0.2.0",
1632 | "transformation_id": "dim.products_scd2",
1633 | "description": "Product dimension with full history tracking",
1634 |
1635 | "transformation_type": "sql",
1636 | "code": {
1637 | "sql": {
1638 | "original_sql": "SELECT product_id, product_name, price, category, updated_at FROM source.products WHERE updated_at >= '@start_date'",
1639 | "resolved_sql": "SELECT product_id, product_name, price, category, updated_at FROM warehouse.source.products WHERE updated_at >= '@start_date'",
1640 | "source_tables": ["source.products"],
1641 | "source_functions": []
1642 | }
1643 | },
1644 |
1645 | "schema": {
1646 | "columns": [
1647 | {
1648 | "name": "product_id",
1649 | "datatype": "number",
1650 | "description": "Unique product identifier"
1651 | },
1652 | {
1653 | "name": "product_name",
1654 | "datatype": "string",
1655 | "description": "Product name"
1656 | },
1657 | {
1658 | "name": "price",
1659 | "datatype": "number",
1660 | "description": "Product price"
1661 | },
1662 | {
1663 | "name": "category",
1664 | "datatype": "string",
1665 | "description": "Product category"
1666 | },
1667 | {
1668 | "name": "updated_at",
1669 | "datatype": "date",
1670 | "description": "Last update timestamp"
1671 | },
1672 | {
1673 | "name": "valid_from",
1674 | "datatype": "date",
1675 | "description": "Record validity start date"
1676 | },
1677 | {
1678 | "name": "valid_to",
1679 | "datatype": "date",
1680 | "description": "Record validity end date"
1681 | }
1682 | ],
1683 | "partitioning": [],
1684 | "indexes": [
1685 | {
1686 | "name": "idx_product_id",
1687 | "columns": ["product_id"]
1688 | },
1689 | {
1690 | "name": "idx_valid_from",
1691 | "columns": ["valid_from"]
1692 | }
1693 | ]
1694 | },
1695 |
1696 | "materialization": {
1697 | "type": "scd2",
1698 | "scd2_details": {
1699 | "unique_key": ["product_id"],
1700 | "start_column": "valid_from",
1701 | "end_column": "valid_to"
1702 | }
1703 | },
1704 |
1705 | "tests": {
1706 | "columns": {
1707 | "product_id": ["not_null", "unique"],
1708 | "valid_from": ["not_null"]
1709 | },
1710 | "table": ["row_count_gt_0"]
1711 | },
1712 |
1713 | "metadata": {
1714 | "file_path": "/models/dim/products_scd2.sql",
1715 | "owner": "data-engineering",
1716 | "tags": ["products", "scd2", "dimension"]
1717 | }
1718 | }
1719 | ```
1720 |
1721 |
1722 |
1723 | ## Complete Example: Test Library and Module
1724 |
1725 | This example demonstrates a complete setup with a test library and a transformation module that uses both standard and custom tests.
1726 |
1727 | ### Test Library Example
1728 |
1729 |
1730 | YAML Format
1731 |
1732 | ```yaml
1733 | # tests/test_library.yaml
1734 | ots_version: "0.2.0"
1735 | test_library_version: "1.0"
1736 | description: "Shared data quality tests for analytics project"
1737 |
1738 | generic_tests:
1739 | check_minimum_rows:
1740 | type: "sql"
1741 | level: "table"
1742 | description: "Ensures table has minimum number of rows"
1743 | sql: |
1744 | SELECT 1 as violation
1745 | FROM @table_name
1746 | GROUP BY 1
1747 | HAVING COUNT(*) < @min_rows:10
1748 | parameters:
1749 | min_rows:
1750 | type: "number"
1751 | default: 10
1752 | description: "Minimum number of rows required"
1753 |
1754 | column_not_negative:
1755 | type: "sql"
1756 | level: "column"
1757 | description: "Ensures numeric column has no negative values"
1758 | sql: |
1759 | SELECT @column_name
1760 | FROM @table_name
1761 | WHERE @column_name < 0
1762 | parameters: []
1763 |
1764 | singular_tests:
1765 | test_customers_email_format:
1766 | type: "sql"
1767 | level: "table"
1768 | description: "Validates email format for customers table"
1769 | sql: |
1770 | SELECT id, email
1771 | FROM analytics.customers
1772 | WHERE email NOT LIKE '%@%.%'
1773 | target_transformation: "analytics.customers"
1774 | ```
1775 |
1776 |
1777 |
1778 |
1779 | JSON Format
1780 |
1781 | ```json
1782 | {
1783 | "ots_version": "0.2.0",
1784 | "test_library_version": "1.0",
1785 | "description": "Shared data quality tests for analytics project",
1786 | "generic_tests": {
1787 | "check_minimum_rows": {
1788 | "type": "sql",
1789 | "level": "table",
1790 | "description": "Ensures table has minimum number of rows",
1791 | "sql": "SELECT 1 as violation\nFROM @table_name\nGROUP BY 1\nHAVING COUNT(*) < @min_rows:10",
1792 | "parameters": {
1793 | "min_rows": {
1794 | "type": "number",
1795 | "default": 10,
1796 | "description": "Minimum number of rows required"
1797 | }
1798 | }
1799 | },
1800 | "column_not_negative": {
1801 | "type": "sql",
1802 | "level": "column",
1803 | "description": "Ensures numeric column has no negative values",
1804 | "sql": "SELECT @column_name\nFROM @table_name\nWHERE @column_name < 0",
1805 | "parameters": []
1806 | }
1807 | },
1808 | "singular_tests": {
1809 | "test_customers_email_format": {
1810 | "type": "sql",
1811 | "level": "table",
1812 | "description": "Validates email format for customers table",
1813 | "sql": "SELECT id, email\nFROM analytics.customers\nWHERE email NOT LIKE '%@%.%'",
1814 | "target_transformation": "analytics.customers"
1815 | }
1816 | }
1817 | }
1818 | ```
1819 |
1820 |
1821 |
1822 | ### Module Using Test Library
1823 |
1824 |
1825 | YAML Format
1826 |
1827 | ```yaml
1828 | ots_version: "0.2.0"
1829 | module_name: "analytics_customers"
1830 | module_description: "Customer analytics transformations"
1831 | test_library_path: "../tests/test_library.yaml"
1832 | tags: ["analytics", "production"]
1833 |
1834 | target:
1835 | database: "warehouse"
1836 | schema: "analytics"
1837 | sql_dialect: "postgres"
1838 |
1839 | transformations:
1840 | - transformation_id: "analytics.customers"
1841 | description: "Customer data table"
1842 | transformation_type: "sql"
1843 |
1844 | code:
1845 | sql:
1846 | original_sql: "SELECT id, name, email, created_at, amount FROM source.customers WHERE active = true"
1847 | resolved_sql: "SELECT id, name, email, created_at, amount FROM warehouse.source.customers WHERE active = true"
1848 | source_tables: ["source.customers"]
1849 | source_functions: []
1850 |
1851 | schema:
1852 | columns:
1853 | - name: "id"
1854 | datatype: "number"
1855 | description: "Unique customer identifier"
1856 | - name: "name"
1857 | datatype: "string"
1858 | description: "Customer name"
1859 | - name: "email"
1860 | datatype: "string"
1861 | description: "Customer email address"
1862 | - name: "created_at"
1863 | datatype: "date"
1864 | description: "Customer creation date"
1865 | - name: "amount"
1866 | datatype: "number"
1867 | description: "Customer account balance"
1868 | partitioning: []
1869 | indexes:
1870 | - name: "idx_customers_id"
1871 | columns: ["id"]
1872 | - name: "idx_customers_email"
1873 | columns: ["email"]
1874 |
1875 | materialization:
1876 | type: "table"
1877 |
1878 | tests:
1879 | columns:
1880 | id:
1881 | - "not_null" # Standard test
1882 | - "unique" # Standard test (column-level)
1883 | email:
1884 | - "not_null"
1885 | - name: "accepted_values" # Standard test with params
1886 | params:
1887 | values: ["gmail.com", "yahoo.com", "company.com"]
1888 | amount:
1889 | - name: "column_not_negative" # Generic test from library
1890 | table:
1891 | - "row_count_gt_0" # Standard test
1892 | - "unique" # Standard test (table-level, checks all columns for row uniqueness)
1893 | - name: "check_minimum_rows" # Generic test with params
1894 | params:
1895 | min_rows: 100
1896 | - "test_customers_email_format" # Singular test from library
1897 |
1898 | metadata:
1899 | file_path: "/models/analytics/customers.sql"
1900 | owner: "data-team"
1901 | tags: ["customer", "core"]
1902 | object_tags:
1903 | sensitivity_tag: "pii"
1904 | classification: "internal"
1905 | ```
1906 |
1907 |
1908 |
1909 |
1910 | JSON Format
1911 |
1912 | ```json
1913 | {
1914 | "ots_version": "0.2.0",
1915 | "module_name": "analytics_customers",
1916 | "module_description": "Customer analytics transformations",
1917 | "test_library_path": "../tests/test_library.yaml",
1918 | "tags": ["analytics", "production"],
1919 | "target": {
1920 | "database": "warehouse",
1921 | "schema": "analytics",
1922 | "sql_dialect": "postgres"
1923 | },
1924 | "transformations": [
1925 | {
1926 | "transformation_id": "analytics.customers",
1927 | "description": "Customer data table",
1928 | "transformation_type": "sql",
1929 | "code": {
1930 | "sql": {
1931 | "original_sql": "SELECT id, name, email, created_at, amount FROM source.customers WHERE active = true",
1932 | "resolved_sql": "SELECT id, name, email, created_at, amount FROM warehouse.source.customers WHERE active = true",
1933 | "source_tables": ["source.customers"],
1934 | "source_functions": []
1935 | }
1936 | },
1937 | "schema": {
1938 | "columns": [
1939 | {
1940 | "name": "id",
1941 | "datatype": "number",
1942 | "description": "Unique customer identifier"
1943 | },
1944 | {
1945 | "name": "name",
1946 | "datatype": "string",
1947 | "description": "Customer name"
1948 | },
1949 | {
1950 | "name": "email",
1951 | "datatype": "string",
1952 | "description": "Customer email address"
1953 | },
1954 | {
1955 | "name": "created_at",
1956 | "datatype": "date",
1957 | "description": "Customer creation date"
1958 | },
1959 | {
1960 | "name": "amount",
1961 | "datatype": "number",
1962 | "description": "Customer account balance"
1963 | }
1964 | ],
1965 | "partitioning": [],
1966 | "indexes": [
1967 | {
1968 | "name": "idx_customers_id",
1969 | "columns": ["id"]
1970 | },
1971 | {
1972 | "name": "idx_customers_email",
1973 | "columns": ["email"]
1974 | }
1975 | ]
1976 | },
1977 | "materialization": {
1978 | "type": "table"
1979 | },
1980 | "tests": {
1981 | "columns": {
1982 | "id": ["not_null", "unique"],
1983 | "email": [
1984 | "not_null",
1985 | {
1986 | "name": "accepted_values",
1987 | "params": {
1988 | "values": ["gmail.com", "yahoo.com", "company.com"]
1989 | }
1990 | }
1991 | ],
1992 | "amount": [
1993 | {
1994 | "name": "column_not_negative"
1995 | }
1996 | ]
1997 | },
1998 | "table": [
1999 | "row_count_gt_0",
2000 | "unique",
2001 | {
2002 | "name": "check_minimum_rows",
2003 | "params": {
2004 | "min_rows": 100
2005 | }
2006 | },
2007 | "test_customers_email_format"
2008 | ]
2009 | },
2010 | "metadata": {
2011 | "file_path": "/models/analytics/customers.sql",
2012 | "owner": "data-team",
2013 | "tags": ["customer", "core"],
2014 | "object_tags": {
2015 | "sensitivity_tag": "pii",
2016 | "classification": "internal"
2017 | }
2018 | }
2019 | }
2020 | ]
2021 | }
2022 | ```
2023 |
2024 |
2025 |
--------------------------------------------------------------------------------