├── .gitignore ├── CONTRIBUTING.md ├── README.md ├── versions ├── v0.2.0 │ ├── CHANGELOG.md │ └── transformation-specification.md ├── v0.2.1 │ └── CHANGELOG.md └── v0.1.0 │ └── transformation-specification.md ├── CODE_OF_CONDUCT.md └── LICENSE /.gitignore: -------------------------------------------------------------------------------- 1 | # Draft content - not included in the specification 2 | draft/ 3 | 4 | # Common files 5 | *.pyc 6 | *.pyo 7 | *.pyd 8 | __pycache__/ 9 | .pytest_cache/ 10 | *.egg-info/ 11 | dist/ 12 | build/ 13 | 14 | # IDE 15 | .vscode/ 16 | .idea/ 17 | *.swp 18 | *.swo 19 | *~ 20 | 21 | # OS files 22 | .DS_Store 23 | Thumbs.db 24 | 25 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing to the Open Transformation Specification 2 | 3 | We welcome contributions to the Open Transformation Specification! This document provides guidelines for contributing to the project. 4 | 5 | ## How to Contribute 6 | 7 | We welcome new contributors to the project whether you have changes to suggest, problems to report, or some feedback for us. Please jump to the most relevant section from the list below: 8 | 9 | - Ask a question or offer feedback: use a discussion 10 | - Suggest a change or report a problem: open an issue 11 | - Contribute a change to the repository: open a pull request 12 | - Or just get in touch 13 | 14 | ## Development Process 15 | 16 | The Open Transformation Specification is currently in early development (v0.1.0). The development process is evolving as we establish the foundation for the specification. 17 | 18 | ### Current Focus Areas 19 | 20 | - **Core Specification**: Defining the fundamental structure and concepts 21 | - **Examples**: Creating clear examples of transformation specifications 22 | - **Documentation**: Building comprehensive documentation 23 | - **Community**: Establishing governance and contribution processes 24 | 25 | ## Specification Guidelines 26 | 27 | When contributing to the specification: 28 | 29 | - Use clear, unambiguous language 30 | - Provide examples for complex concepts 31 | - Consider backward compatibility 32 | - Follow the established structure in `versions/v0.1.0/` 33 | 34 | ## Code Standards 35 | 36 | - Use clear, descriptive commit messages 37 | - Follow existing formatting conventions 38 | - Include documentation for new features 39 | - Test any code changes thoroughly 40 | 41 | ## Getting Help 42 | 43 | - Check existing [issues](https://github.com/francescomucio/open-transformation-specification/issues) 44 | - Start a [discussion](https://github.com/francescomucio/open-transformation-specification/discussions) 45 | - Review the [Code of Conduct](CODE_OF_CONDUCT.md) 46 | 47 | ## License 48 | 49 | By contributing to the Open Transformation Specification, you agree that your contributions will be licensed under the Apache 2.0 License. 50 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # The Open Transformation Specification 2 | 3 | [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) 4 | 5 | The Open Transformation Specification (OTS) is a community-driven open specification that defines a standard, programming language-agnostic interface description for data transformation pipelines and workflows, including transformations, data quality tests, and user-defined functions (UDFs). 6 | 7 | This specification enables **interoperability** between tools and platforms, allowing data transformations to be shared, understood, and executed across different systems without vendor lock-in. By providing a common standard, OTS shifts the data transformation ecosystem from isolated, proprietary tools to an **open core** where tools can seamlessly work together around a shared specification. 8 | 9 | This specification allows both humans and computers to discover and understand the capabilities of data transformations without requiring access to source code, additional documentation, or inspection of execution logs. When properly defined via OTS, a consumer can understand and interact with data transformation pipelines with a minimal amount of implementation logic. 10 | 11 | ## Versions 12 | 13 | This repository contains the Markdown sources for all published Open Transformation Specification versions. For release notes and release candidate versions, refer to the [releases page](https://github.com/francescomucio/open-transformation-specification/releases). 14 | 15 | - **Current Version**: [v0.2.1](versions/v0.2.1/transformation-specification.md) 16 | - **Previous Versions**: 17 | - [v0.2.0](versions/v0.2.0/transformation-specification.md) 18 | - [v0.1.0](versions/v0.1.0/transformation-specification.md) 19 | 20 | ## Interoperability and Ecosystem 21 | 22 | The Open Transformation Specification enables true **interoperability** in the data transformation space. By defining a common standard, OTS allows: 23 | 24 | - **Cross-tool compatibility**: Transformations defined in one tool can be consumed and executed by any OTS-compliant tool 25 | - **Vendor independence**: Avoid lock-in to proprietary formats and tools 26 | - **Ecosystem growth**: An open core standard enables a thriving ecosystem of compatible tools, libraries, and services 27 | - **Seamless integration**: Tools can work together, sharing transformations, tests, and functions without conversion or manual intervention 28 | 29 | ## Tools and Libraries 30 | 31 | The OTS ecosystem is growing with tools and libraries that implement the specification. These tools demonstrate the **interoperability** benefits of the standard: 32 | 33 | - **[Tee for Transform](https://github.com/francescomucio/tee-for-transform)** - A Python framework for managing SQL data transformations with support for multiple database backends. Fully OTS-compliant with import/export capabilities. 34 | 35 | *More tools and libraries will be documented as they become available. If you're building an OTS-compliant tool, we'd love to feature it here!* 36 | 37 | ## Participation 38 | 39 | The current process for developing the Open Transformation Specification is described in the [Contributing Guidelines](CONTRIBUTING.md). 40 | 41 | ## Community 42 | 43 | Join our community on Discord to discuss the specification, ask questions, share ideas, and connect with other contributors: [Discord Community](https://discord.gg/gKm6KgACY7) 44 | 45 | ## Licensing 46 | 47 | See: [License](LICENSE) (Apache-2.0) 48 | -------------------------------------------------------------------------------- /versions/v0.2.0/CHANGELOG.md: -------------------------------------------------------------------------------- 1 | # OTS v0.2.0 Changelog 2 | 3 | ## New Features 4 | 5 | ### User-Defined Functions (UDFs) Support 6 | 7 | **Added `source_functions` field to SQL transformation code structure** 8 | 9 | - **Field**: `source_functions` (optional array of strings) 10 | - **Purpose**: Track user-defined functions (UDFs) called in SQL transformations 11 | - **Location**: `code.sql.source_functions` 12 | - **Format**: Array of function names (preferably fully qualified: `schema.function_name`) 13 | 14 | ### Changes to SQL Transformation Structure 15 | 16 | The `code.sql` object now includes: 17 | 18 | ```yaml 19 | code: 20 | sql: 21 | original_sql: string 22 | resolved_sql: string 23 | source_tables: [string] # Existing field 24 | source_functions: [string] # NEW in v0.2.0 25 | ``` 26 | 27 | ### Benefits 28 | 29 | 1. **Dependency Analysis**: Enables accurate dependency graph building including function dependencies 30 | 2. **Execution Order**: Ensures functions are created before transformations that use them 31 | 3. **Validation**: Allows tools to verify all required functions exist before execution 32 | 4. **Function Chains**: Supports function-to-function dependencies 33 | 34 | ### Backward Compatibility 35 | 36 | - `source_functions` is **optional** - existing v0.1.0 modules remain valid 37 | - If omitted, tools should assume an empty array `[]` 38 | - All examples in v0.2.0 include `source_functions: []` for transformations without function dependencies 39 | 40 | ### Functions Array in OTS Modules 41 | 42 | **Added `functions` array to OTS Module structure** 43 | 44 | - **Field**: `functions` (optional array of function definitions) 45 | - **Purpose**: Define user-defined functions (UDFs) that can be used in transformations 46 | - **Location**: Top-level in OTS Module (same level as `transformations`) 47 | - **Structure**: Array of function definitions with `function_id`, `function_type`, `language`, `code`, `parameters`, `return_type`, `deterministic`, `dependencies`, and `metadata` 48 | 49 | **Function Execution Order:** 50 | - Functions are created before transformations in dependency order 51 | - Function-to-function dependencies are resolved automatically 52 | - Execution order: Seeds → Functions → Transformations 53 | 54 | **Function Overloading:** 55 | - OTS 0.2.0 supports function overloading (multiple functions with same name, different signatures) 56 | - Functions are identified by fully qualified name and parameter signature 57 | - Each overloaded function is tracked separately in the `functions` array 58 | 59 | ### Test Library Structure Updates 60 | 61 | **Added `ots_version` field to Test Library structure** 62 | 63 | - **Field**: `ots_version` (required string) 64 | - **Purpose**: Indicates which version of the OTS standard the test library follows 65 | - **Location**: Top-level in Test Library file 66 | - **Format**: OTS version string (e.g., "0.2.0") 67 | 68 | ### Documentation Updates 69 | 70 | - Added new section: "User-Defined Functions (UDFs)" with examples 71 | - Updated all code examples to include `source_functions` field 72 | - Added example transformation using UDFs 73 | - Added `functions` array definition to OTS Module structure 74 | - Updated test library examples to include `ots_version` field 75 | 76 | ## Migration from v0.1.0 77 | 78 | To migrate an OTS Module from v0.1.0 to v0.2.0: 79 | 80 | 1. Update `ots_version` from `"0.1.0"` to `"0.2.0"` 81 | 2. Add `source_functions: []` to all `code.sql` objects (if no functions are used) 82 | 3. If transformations use UDFs, populate `source_functions` with function names 83 | 84 | Example migration: 85 | 86 | ```yaml 87 | # v0.1.0 88 | code: 89 | sql: 90 | original_sql: "..." 91 | resolved_sql: "..." 92 | source_tables: ["table1"] 93 | 94 | # v0.2.0 95 | code: 96 | sql: 97 | original_sql: "..." 98 | resolved_sql: "..." 99 | source_tables: ["table1"] 100 | source_functions: [] # Add this line 101 | ``` 102 | 103 | -------------------------------------------------------------------------------- /versions/v0.2.1/CHANGELOG.md: -------------------------------------------------------------------------------- 1 | # OTS v0.2.1 Changelog 2 | 3 | ## New Features 4 | 5 | ### Schema Change Handling for Incremental Materialization 6 | 7 | **Added `on_schema_change` field to incremental materialization configuration** 8 | 9 | - **Field**: `on_schema_change` (optional string) 10 | - **Purpose**: Control how schema differences between transformation output and existing target table are handled 11 | - **Location**: `materialization.incremental_details.on_schema_change` 12 | - **Default**: `"fail"` (fail transformation if schema changes detected) 13 | - **Options**: 14 | - `"fail"`: Fail the transformation if any schema differences are detected (default) 15 | - `"ignore"`: Ignore schema differences and proceed (may cause errors if columns don't match) 16 | - `"append_new_columns"`: Automatically add new columns, keep existing columns 17 | - `"sync_all_columns"`: Add new columns and remove missing columns (may cause data loss) 18 | - `"full_refresh"`: Drop and recreate table with full transformation output 19 | - `"full_incremental_refresh"`: Drop, recreate, then run incremental strategy in chunks 20 | - `"recreate_empty"`: Drop and recreate as empty table (for external backfilling) 21 | 22 | ### Full Incremental Refresh Configuration 23 | 24 | **Added `full_incremental_refresh` field for chunked incremental execution** 25 | 26 | - **Field**: `full_incremental_refresh` (optional object) 27 | - **Purpose**: Configure parameter-based incremental chunking after table recreation 28 | - **Location**: Top-level in transformation definition (same level as `materialization`) 29 | - **Required when**: `on_schema_change` is set to `"full_incremental_refresh"` 30 | 31 | **Structure:** 32 | ```yaml 33 | full_incremental_refresh: 34 | parameters: 35 | - name: string # Parameter name (matches placeholder, e.g., "@start_date") 36 | start_value: string # Initial value for the parameter 37 | end_value: string # End condition: hardcoded value or expression evaluated against source table (e.g., "max(event_date)" from source.events) 38 | step: string # Increment step (SQL interval or numeric value) 39 | ``` 40 | 41 | **Use Cases:** 42 | - Single parameter: One parameter in the array (e.g., `@start_date`) 43 | - Multiple parameters: Multiple parameters for boundary-based queries (e.g., `@start_date` and `@end_date`) 44 | - Both parameters use the same `step` value (ideally), but can differ if needed 45 | - Parameter names must match placeholders in the query (e.g., `"@start_date"` matches `'@start_date'` in SQL) 46 | 47 | ### Schema Change Behavior Details 48 | 49 | **Type Mismatches:** 50 | - Columns with same name but different data type are treated as schema changes 51 | - With `on_schema_change="fail"`, type mismatches cause immediate failure 52 | - Different data type = different column (fail immediately) 53 | 54 | **Column Order:** 55 | - Column order differences are detected and logged as warnings 56 | - Tools should rely on explicit column lists in INSERT/MERGE statements 57 | - No automatic reordering is performed 58 | 59 | **Schema Comparison:** 60 | - Schema comparison happens after time-based filtering but before execution 61 | - Uses database-specific `DESCRIBE` or equivalent methods for precision 62 | - If table doesn't exist, schema comparison is skipped and table is created normally 63 | 64 | ## Changes to Incremental Materialization Structure 65 | 66 | The `incremental_details` object now includes: 67 | 68 | ```yaml 69 | incremental_details: 70 | strategy: string 71 | delete_condition: string 72 | filter_condition: string 73 | merge_key: [string] 74 | update_columns: [string] 75 | on_schema_change: string # NEW in v0.2.1 76 | ``` 77 | 78 | The transformation definition now supports: 79 | 80 | ```yaml 81 | materialization: 82 | type: "incremental" 83 | incremental_details: {...} 84 | 85 | full_incremental_refresh: # NEW in v0.2.1 (optional) 86 | parameters: [...] 87 | ``` 88 | 89 | ## Backward Compatibility 90 | 91 | - `on_schema_change` is **optional** - existing v0.2.0 modules remain valid 92 | - If omitted, default behavior is `"fail"` (matches previous implicit behavior) 93 | - `full_incremental_refresh` is **optional** - only required when `on_schema_change="full_incremental_refresh"` 94 | - All existing v0.2.0 examples remain valid in v0.2.1 95 | 96 | ## Migration from v0.2.0 97 | 98 | To migrate an OTS Module from v0.2.0 to v0.2.1: 99 | 100 | 1. Update `ots_version` from `"0.2.0"` to `"0.2.1"` 101 | 2. Optionally add `on_schema_change` to `incremental_details` if you want explicit schema change handling 102 | 3. If using `full_incremental_refresh`, add the `full_incremental_refresh` configuration 103 | 104 | Example migration: 105 | 106 | ```yaml 107 | # v0.2.0 108 | materialization: 109 | type: "incremental" 110 | incremental_details: 111 | strategy: "append" 112 | filter_condition: "created_at >= '@start_date'" 113 | 114 | # v0.2.1 (explicit default) 115 | materialization: 116 | type: "incremental" 117 | incremental_details: 118 | strategy: "append" 119 | filter_condition: "created_at >= '@start_date'" 120 | on_schema_change: "fail" # Optional: explicit default 121 | 122 | # v0.2.1 (with schema change handling) 123 | materialization: 124 | type: "incremental" 125 | incremental_details: 126 | strategy: "append" 127 | filter_condition: "created_at >= '@start_date'" 128 | on_schema_change: "append_new_columns" # Auto-add new columns 129 | 130 | # v0.2.1 (with full incremental refresh) 131 | materialization: 132 | type: "incremental" 133 | incremental_details: 134 | strategy: "append" 135 | filter_condition: "event_date >= '@start_date'" 136 | on_schema_change: "full_incremental_refresh" 137 | 138 | full_incremental_refresh: 139 | parameters: 140 | - name: "@start_date" 141 | start_value: "2024-01-01" 142 | end_value: "max(event_date)" # Evaluated against source table 143 | step: "INTERVAL 1 DAY" 144 | ``` 145 | 146 | ## Documentation Updates 147 | 148 | - Added "Schema Change Handling" section to Incremental Materialization documentation 149 | - Added examples for all `on_schema_change` options 150 | - Added `full_incremental_refresh` configuration examples 151 | - Updated append strategy example to show `on_schema_change` usage 152 | - Added documentation for type mismatch and column order handling 153 | 154 | 155 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Contributor Covenant 3.0 Code of Conduct 2 | 3 | ## Our Pledge 4 | 5 | We pledge to make our community welcoming, safe, and equitable for all. 6 | 7 | We are committed to fostering an environment that respects and promotes the dignity, rights, and contributions of all individuals, regardless of characteristics including race, ethnicity, caste, color, age, physical characteristics, neurodiversity, disability, sex or gender, gender identity or expression, sexual orientation, language, philosophy or religion, national or social origin, socio-economic position, level of education, or other status. The same privileges of participation are extended to everyone who participates in good faith and in accordance with this Covenant. 8 | 9 | ## Encouraged Behaviors 10 | 11 | While acknowledging differences in social norms, we all strive to meet our community's expectations for positive behavior. We also understand that our words and actions may be interpreted differently than we intend based on culture, background, or native language. 12 | 13 | With these considerations in mind, we agree to behave mindfully toward each other and act in ways that center our shared values, including: 14 | 15 | 1. Respecting the **purpose of our community**, our activities, and our ways of gathering. 16 | 2. Engaging **kindly and honestly** with others. 17 | 3. Respecting **different viewpoints** and experiences. 18 | 4. **Taking responsibility** for our actions and contributions. 19 | 5. Gracefully giving and accepting **constructive feedback**. 20 | 6. Committing to **repairing harm** when it occurs. 21 | 7. Behaving in other ways that promote and sustain the **well-being of our community**. 22 | 23 | ## Restricted Behaviors 24 | 25 | We agree to restrict the following behaviors in our community. Instances, threats, and promotion of these behaviors are violations of this Code of Conduct. 26 | 27 | 1. **Harassment.** Violating explicitly expressed boundaries or engaging in unnecessary personal attention after any clear request to stop. 28 | 2. **Character attacks.** Making insulting, demeaning, or pejorative comments directed at a community member or group of people. 29 | 3. **Stereotyping or discrimination.** Characterizing anyone's personality or behavior on the basis of immutable identities or traits. 30 | 4. **Sexualization.** Behaving in a way that would generally be considered inappropriately intimate in the context or purpose of the community. 31 | 5. **Violating confidentiality**. Sharing or acting on someone's personal or private information without their permission. 32 | 6. **Endangerment.** Causing, encouraging, or threatening violence or other harm toward any person or group. 33 | 7. Behaving in other ways that **threaten the well-being** of our community. 34 | 35 | ### Other Restrictions 36 | 37 | 1. **Misleading identity.** Impersonating someone else for any reason, or pretending to be someone else to evade enforcement actions. 38 | 2. **Failing to credit sources.** Not properly crediting the sources of content you contribute. 39 | 3. **Promotional materials**. Sharing marketing or other commercial content in a way that is outside the norms of the community. 40 | 4. **Irresponsible communication.** Failing to responsibly present content which includes, links or describes any other restricted behaviors. 41 | 42 | ## Reporting an Issue 43 | 44 | Tensions can occur between community members even when they are trying their best to collaborate. Not every conflict represents a code of conduct violation, and this Code of Conduct reinforces encouraged behaviors and norms that can help avoid conflicts and minimize harm. 45 | 46 | When an incident does occur, it is important to report it promptly. To report a possible violation, please contact the project maintainers at [contact email]. 47 | 48 | Community Moderators take reports of violations seriously and will make every effort to respond in a timely manner. They will investigate all reports of code of conduct violations, reviewing messages, logs, and recordings, or interviewing witnesses and other participants. Community Moderators will keep investigation and enforcement actions as transparent as possible while prioritizing safety and confidentiality. In order to honor these values, enforcement actions are carried out in private with the involved parties, but communicating to the whole community may be part of a mutually agreed upon resolution. 49 | 50 | ## Addressing and Repairing Harm 51 | 52 | If an investigation by the Community Moderators finds that this Code of Conduct has been violated, the following enforcement ladder may be used to determine how best to repair harm, based on the incident's impact on the individuals involved and the community as a whole. Depending on the severity of a violation, lower rungs on the ladder may be skipped. 53 | 54 | 1. **Warning** 55 | 1. Event: A violation involving a single incident or series of incidents. 56 | 2. Consequence: A private, written warning from the Community Moderators. 57 | 3. Repair: Examples of repair include a private written apology, acknowledgement of responsibility, and seeking clarification on expectations. 58 | 59 | 2. **Temporarily Limited Activities** 60 | 1. Event: A repeated incidence of a violation that previously resulted in a warning, or the first incidence of a more serious violation. 61 | 2. Consequence: A private, written warning with a time-limited cooldown period designed to underscore the seriousness of the situation and give the community members involved time to process the incident. The cooldown period may be limited to particular communication channels or interactions with particular community members. 62 | 3. Repair: Examples of repair may include making an apology, using the cooldown period to reflect on actions and impact, and being thoughtful about re-entering community spaces after the period is over. 63 | 64 | 3. **Temporary Suspension** 65 | 1. Event: A pattern of repeated violation which the Community Moderators have tried to address with warnings, or a single serious violation. 66 | 2. Consequence: A private written warning with conditions for return from suspension. In general, temporary suspensions give the person being suspended time to reflect upon their behavior and possible corrective actions. 67 | 3. Repair: Examples of repair include respecting the spirit of the suspension, meeting the specified conditions for return, and being thoughtful about how to reintegrate with the community when the suspension is lifted. 68 | 69 | 4. **Permanent Ban** 70 | 1. Event: A pattern of repeated code of conduct violations that other steps on the ladder have failed to resolve, or a violation so serious that the Community Moderators determine there is no way to keep the community safe with this person as a member. 71 | 2. Consequence: Access to all community spaces, tools, and communication channels is removed. In general, permanent bans should be rarely used, should have strong reasoning behind them, and should only be resorted to if working through other remedies has failed to change the behavior. 72 | 3. Repair: There is no possible repair in cases of this severity. 73 | 74 | This enforcement ladder is intended as a guideline. It does not limit the ability of Community Managers to use their discretion and judgment, in keeping with the best interests of our community. 75 | 76 | ## Scope 77 | 78 | This Code of Conduct applies within all community spaces, and also applies when an individual is officially representing the community in public or other spaces. Examples of representing our community include using an official email address, posting via an official social media account, or acting as an appointed representative at an online or offline event. 79 | 80 | ## Attribution 81 | 82 | This Code of Conduct is adapted from the [Contributor Covenant, version 3.0](https://www.contributor-covenant.org/version/3/0/code_of_conduct/), permanently available at https://www.contributor-covenant.org/version/3/0/. 83 | 84 | Contributor Covenant is stewarded by the Organization for Ethical Source and licensed under CC BY-SA 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by-sa/4.0/ 85 | 86 | For answers to common questions about Contributor Covenant, see the FAQ at https://www.contributor-covenant.org/faq. Translations are provided at https://www.contributor-covenant.org/translations. Additional enforcement and community guideline resources can be found at https://www.contributor-covenant.org/resources. The enforcement ladder was inspired by the work of Mozilla's code of conduct team. 87 | 88 | 89 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity granting the License. 13 | 14 | "Legal Entity" shall mean the union of the acting entity and all 15 | other entities that control, are controlled by, or are under common 16 | control with that entity. For the purposes of this definition, 17 | "control" means (i) the power, direct or indirect, to cause the 18 | direction or management of such entity, whether by contract or 19 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 20 | outstanding shares, or (iii) beneficial ownership of such entity. 21 | 22 | "You" (or "Your") shall mean an individual or Legal Entity 23 | exercising permissions granted by this License. 24 | 25 | "Source" shall mean the preferred form for making modifications, 26 | including but not limited to software source code, documentation 27 | source, and configuration files. 28 | 29 | "Object" shall mean any form resulting from mechanical 30 | transformation or translation of a Source form, including but 31 | not limited to compiled object code, generated documentation, 32 | and conversions to other media types. 33 | 34 | "Work" shall mean the work of authorship, whether in Source or 35 | Object form, made available under the License, as indicated by a 36 | copyright notice that is included in or attached to the work 37 | (which shall not include communications that are clearly marked or 38 | otherwise designated in writing by the copyright owner as "Not a Work"). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based upon (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and derivative works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control 57 | systems, and issue tracking systems that are managed by, or on behalf 58 | of, the Licensor for the purpose of discussing and improving the Work, 59 | but excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Work". 61 | 62 | 2. Grant of Copyright License. Subject to the terms and conditions of 63 | this License, each Contributor hereby grants to You a perpetual, 64 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 65 | copyright license to use, reproduce, modify, distribute, and prepare 66 | Derivative Works of, and to display, perform, and distribute the 67 | Work and such Derivative Works in Source or Object form. 68 | 69 | 3. Grant of Patent License. Subject to the terms and conditions of 70 | this License, each Contributor hereby grants to You a perpetual, 71 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 72 | (except as stated in this section) patent license to make, have made, 73 | use, offer to sell, sell, import, and otherwise transfer the Work, 74 | where such license applies only to those patent claims licensable 75 | by such Contributor that are necessarily infringed by their 76 | Contribution(s) alone or by combination of their Contribution(s) 77 | with the Work to which such Contribution(s) was submitted. If You 78 | institute patent litigation against any entity (including a 79 | cross-claim or counterclaim in a lawsuit) alleging that the Work 80 | or a Contribution incorporated within the Work constitutes direct 81 | or contributory patent infringement, then any patent licenses 82 | granted to You under this License for that Work shall terminate 83 | as of the date such litigation is filed. 84 | 85 | 4. Redistribution. You may reproduce and distribute copies of the 86 | Work or Derivative Works thereof in any medium, with or without 87 | modifications, and in Source or Object form, provided that You 88 | meet the following conditions: 89 | 90 | (a) You must give any other recipients of the Work or 91 | Derivative Works a copy of this License; and 92 | 93 | (b) You must cause any modified files to carry prominent notices 94 | stating that You changed the files; and 95 | 96 | (c) You must retain, in the Source form of any Derivative Works 97 | that You distribute, all copyright, patent, trademark, and 98 | attribution notices from the Source form of the Work, 99 | excluding those notices that do not pertain to any part of 100 | the Derivative Works; and 101 | 102 | (d) If the Work includes a "NOTICE" text file as part of its 103 | distribution, then any Derivative Works that You distribute must 104 | include a readable copy of the attribution notices contained 105 | within such NOTICE file, excluding those notices that do not 106 | pertain to any part of the Derivative Works, in at least one 107 | of the following places: within a NOTICE text file distributed 108 | as part of the Derivative Works; within the Source form or 109 | documentation, if provided along with the Derivative Works; or, 110 | within a display generated by the Derivative Works, if and 111 | wherever such third-party notices normally appear. The contents 112 | of the NOTICE file are for informational purposes only and 113 | do not modify the License. You may add Your own attribution 114 | notices within Derivative Works that You distribute, alongside 115 | or as an addendum to the NOTICE text from the Work, provided 116 | that such additional attribution notices cannot be construed 117 | as modifying the License. 118 | 119 | You may add Your own copyright notice to Your modifications and 120 | may provide additional or different license terms and conditions 121 | for use, reproduction, or distribution of Your modifications, or 122 | for any such Derivative Works as a whole, provided Your use, 123 | reproduction, and distribution of the Work otherwise complies with 124 | the conditions stated in this License. 125 | 126 | 5. Submission of Contributions. Unless You explicitly state otherwise, 127 | any Contribution intentionally submitted for inclusion in the Work 128 | by You to the Licensor shall be under the terms and conditions of 129 | this License, without any additional terms or conditions. 130 | Notwithstanding the above, nothing herein shall supersede or modify 131 | the terms of any separate license agreement you may have executed 132 | with Licensor regarding such Contributions. 133 | 134 | 6. Trademarks. This License does not grant permission to use the trade 135 | names, trademarks, service marks, or product names of the Licensor, 136 | except as required for reasonable and customary use in describing the 137 | origin of the Work and reproducing the content of the NOTICE file. 138 | 139 | 7. Disclaimer of Warranty. Unless required by applicable law or 140 | agreed to in writing, Licensor provides the Work (and each 141 | Contributor provides its Contributions) on an "AS IS" BASIS, 142 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 143 | implied, including, without limitation, any warranties or conditions 144 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 145 | PARTICULAR PURPOSE. You are solely responsible for determining the 146 | appropriateness of using or redistributing the Work and assume any 147 | risks associated with Your exercise of permissions under this License. 148 | 149 | 8. Limitation of Liability. In no event and under no legal theory, 150 | whether in tort (including negligence), contract, or otherwise, 151 | unless required by applicable law (such as deliberate and grossly 152 | negligent acts) or agreed to in writing, shall any Contributor be 153 | liable to You for damages, including any direct, indirect, special, 154 | incidental, or consequential damages of any character arising as a 155 | result of this License or out of the use or inability to use the 156 | Work (including but not limited to damages for loss of goodwill, 157 | work stoppage, computer failure or malfunction, or any and all 158 | other commercial damages or losses), even if such Contributor 159 | has been advised of the possibility of such damages. 160 | 161 | 9. Accepting Warranty or Support. You may choose to offer, and to 162 | charge a fee for, warranty, support, indemnity or other liability 163 | obligations and/or rights consistent with this License. However, in 164 | accepting such obligations, You may act only on Your own behalf and 165 | on Your sole responsibility, not on behalf of any other Contributor, 166 | and only if You agree to indemnify, defend, and hold each Contributor 167 | harmless for any liability incurred by, or claims asserted against, 168 | such Contributor by reason of your accepting any such warranty or support. 169 | 170 | END OF TERMS AND CONDITIONS 171 | 172 | APPENDIX: How to apply the Apache License to your work. 173 | 174 | To apply the Apache License to your work, attach the following 175 | boilerplate notice, with the fields enclosed by brackets "[]" 176 | replaced with your own identifying information. (Don't include 177 | the brackets!) The text should be enclosed in the appropriate 178 | comment syntax for the file format. We also recommend that a 179 | file or class name and description of purpose be included on the 180 | same "printed page" as the copyright notice for easier 181 | identification within third-party archives. 182 | 183 | Copyright [yyyy] [name of copyright owner] 184 | 185 | Licensed under the Apache License, Version 2.0 (the "License"); 186 | you may not use this file except in compliance with the License. 187 | You may obtain a copy of the License at 188 | 189 | http://www.apache.org/licenses/LICENSE-2.0 190 | 191 | Unless required by applicable law or agreed to in writing, software 192 | distributed under the License is distributed on an "AS IS" BASIS, 193 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 194 | See the License for the specific language governing permissions and 195 | limitations under the License. 196 | 197 | 198 | -------------------------------------------------------------------------------- /versions/v0.1.0/transformation-specification.md: -------------------------------------------------------------------------------- 1 | # Open Transformation Specification v0.1.0 2 | 3 | ## Table of Contents 4 | 5 | 1. [Introduction](#introduction) 6 | 2. [Core Concepts](#core-concepts) 7 | 3. [Open Transformation Definition (OTD) Structure](#open-transformation-definition-otd-structure) 8 | 4. [Materialization Types](#materialization-types) 9 | 5. [Data Quality Tests](#data-quality-tests) 10 | 6. [Examples](#examples) 11 | 12 | ## Introduction 13 | 14 | The Open Transformation Specification (OTS) defines a standard, programming language-agnostic interface description for data transformations. This specification allows both humans and computers to discover and understand how transformations behave, what outputs they produce, and how those outputs are materialized (as tables, views, incremental updates, SCD2, etc.) without requiring additional documentation or configuration. 15 | 16 | An OTS-based transformation must include both the code that transforms the data and metadata about the transformation. A tool implementing OTS should be able to execute an OTS transformation with no additional code or information beyond what's specified in the OTS document. 17 | 18 | ## Core Concepts 19 | 20 | ### Open Transformation Definition (OTD) 21 | 22 | An **Open Transformation Definition (OTD)** is a concrete instance of the Open Transformation Specification - a single file or document that describes a specific data transformation using the OTS format. 23 | 24 | A transformation is a unit of data processing that takes one or more data sources as input and produces one data output. Right now, transformations are SQL queries, but we plan to add support for other programming languages in the future. 25 | 26 | ### Open Transformation Specification Module 27 | 28 | An **Open Transformation Specification Module (OTS Module)** is a collection of related transformations that target the same database and schema. An OTS Module can contain one or more transformations, much like how an OpenAPI specification can contain multiple endpoints. 29 | 30 | Key characteristics of an OTS Module: 31 | - **Single target**: All transformations in a module target the same database and schema 32 | - **Logical grouping**: Related transformations are organized together 33 | - **Deployment unit**: The entire module can be deployed as a single unit 34 | 35 | #### OTS vs OTD vs OTS Module 36 | 37 | - **Open Transformation Specification (OTS)**: The standard that defines the structure and rules 38 | - **Open Transformation Definition (OTD)**: A specific transformation within a module 39 | - **Open Transformation Specification Module (OTS Module)**: A collection of related transformations targeting the same database and schema 40 | 41 | ### Test Library 42 | 43 | A **Test Library** is a project-level collection of reusable test definitions (generic and singular SQL tests) that can be shared across multiple OTS modules. Test libraries are defined separately from transformation modules and are referenced by modules that need to use them. 44 | 45 | Key characteristics of a Test Library: 46 | - **Project-level scope**: Test libraries are defined at the project/workspace level, separate from OTS modules 47 | - **Reusability**: Tests defined in a library can be referenced by any OTS module in the project 48 | - **Test types**: Contains both generic SQL tests (with placeholders) and singular SQL tests (table-specific) 49 | - **Optional**: Modules can define tests inline or reference a test library, or both 50 | 51 | #### OTS vs OTD vs OTS Module vs Test Library 52 | 53 | - **Open Transformation Specification (OTS)**: The standard that defines the structure and rules 54 | - **Open Transformation Definition (OTD)**: A specific transformation within a module 55 | - **Open Transformation Specification Module (OTS Module)**: A collection of related transformations targeting the same database and schema 56 | - **Test Library**: A project-level collection of reusable test definitions that can be shared across modules 57 | 58 | Think of it this way: OTS is like the blueprint, an OTS Module is the house (a complete set of transformations), each OTD is a room within that house (an individual transformation), and a Test Library is like a shared toolbox of quality checks that can be used across multiple houses. 59 | 60 | ## Components of an OTD 61 | 62 | An Open Transformation Definition consists of several key components that work together to define an executable transformation: 63 | 64 | 1. **Transformation Code**: The transformation logic (SQL, Python, PySpark, etc.) stored in a type-based structure 65 | 2. **Schema Definition**: The structure of the output data including column definitions, types, and validation rules 66 | 3. **Materialization Strategy**: How the output is stored and updated (table, view, incremental, SCD2) 67 | 4. **Tests**: Validation rules that ensure data quality at table level 68 | 5. **Metadata**: Additional information about the transformation (owner, tags, creation date, etc.) 69 | 70 | ### Transformation Code 71 | 72 | Transformations can be written in different languages (SQL, Python, PySpark, etc.). The transformation code is stored in a type-based structure that supports multiple transformation types while maintaining a consistent interface. 73 | 74 | #### SQL Transformations 75 | 76 | For SQL transformations, the code is stored with the following structure: 77 | - `original_sql`: The original SQL query as written (typically a SELECT statement). This preserves the original transformation code as authored. 78 | - `resolved_sql`: SQL with fully qualified table names (schema.table format). This is the preferred version for execution as it eliminates ambiguity in table references. Tools should use `resolved_sql` when executing transformations. 79 | - `source_tables`: List of input tables referenced in the query (required for dependency analysis) 80 | 81 | **When to use each:** 82 | - Use `original_sql` for: displaying the original code to users, version control, understanding the transformation logic 83 | - Use `resolved_sql` for: actual execution, dependency resolution, cross-database compatibility 84 | 85 | #### Non-SQL Transformations 86 | 87 | Support for non-SQL transformation types (Python, PySpark, R, etc.) is planned for future versions of the specification. The current v0.1.0 specification focuses on SQL transformations. 88 | 89 | ### Schema Definition 90 | 91 | Schema defines the structure of the output data, including column names, data types, descriptions, partitioning, indexes, and other properties of the physical table. The schema is essential for understanding what the transformation produces without executing it. For example, it enables generating DDL statements for creating the output table. 92 | 93 | ### Materialization Strategy 94 | Materialization defines how the transformation output is stored and updated. Common types include: 95 | - **table**: Full table replacement on each run 96 | - **view**: Virtual table that queries underlying data 97 | - **incremental**: Partial updates using strategies like delete+insert or merge 98 | - **scd2**: Slowly Changing Dimension type 2 for tracking historical changes 99 | 100 | ### Tests 101 | Tests are validation rules that ensure data quality. They can be defined at two levels: 102 | - **Column-level tests**: Applied to individual columns (e.g., `not_null`, `unique`) 103 | - **Table-level tests**: Applied to the entire output (e.g., `row_count_gt_0`, `unique`) 104 | 105 | Tests enable automated data quality validation without manual inspection. OTS supports three types of tests: 106 | 1. **Standard Tests**: Built-in tests defined in the OTS specification (e.g., `not_null`, `unique`, `row_count_gt_0`) 107 | 2. **Generic SQL Tests**: Reusable SQL tests with placeholders that can be applied to multiple transformations 108 | 3. **Singular SQL Tests**: Table-specific SQL tests with hardcoded table references 109 | 110 | Generic SQL Tests and Singular SQL Tests can be defined in a project Test Library (see [Test Library](#test-library)) or inline within the current OTS Module. 111 | 112 | For detailed information about test types, definitions, and usage, see the [Data Quality Tests](#data-quality-tests) section. 113 | 114 | ### Metadata 115 | Metadata provides additional information about the transformation including: 116 | - **file_path**: Location of the source transformation file 117 | - **owner**: Person or team responsible for the transformation 118 | - **tags**: List of string tags for categorization and discovery (e.g., ["analytics", "fct", "production"]) 119 | - **object_tags**: Dictionary of key-value pairs for database object tagging (e.g., {"sensitivity_tag": "pii", "classification": "public"}) 120 | 121 | **Tag Types:** 122 | - **tags** (dbt-style): Simple string tags used for filtering, categorization, and discovery. These are typically used for model selection and organization. 123 | - **Module-level tags**: Tags defined at the module level apply to all transformations in the module. They can be inherited by transformations or merged with transformation-specific tags. 124 | - **Transformation-level tags**: Tags defined at the transformation level are specific to that transformation. They can be merged with module-level tags. 125 | - **object_tags** (database-style): Key-value pairs that are attached directly to database objects (tables, views) in databases that support object tagging (e.g., Snowflake). These are used for data governance, compliance, and metadata management. Unlike `tags`, `object_tags` are always transformation-specific and are not inherited from module level. 126 | 127 | ## OTS Module Structure 128 | 129 | An OTS Module is a YAML or JSON document that can contain one or more transformations. Below is the complete structure: 130 | 131 | ### Complete OTS Module Structure 132 | 133 | ```yaml 134 | # OTS version 135 | ots_version: string # OTS specification version (e.g., "0.1.0") - indicates which version of the OTS standard this module follows 136 | 137 | # Module metadata 138 | module_name: string # Module name (e.g., "ecommerce_analytics") 139 | module_description: string # Description of the module (optional) 140 | version: string # Optional: Module/package version (e.g., "1.0.0") - version of this specific module, independent of OTS version 141 | tags: [string] # Optional: Module-level tags for categorization (e.g., ["analytics", "fct"]). These can be inherited or merged with transformation-level tags. 142 | test_library_path: string # Optional: Path to test library file (relative to module file or absolute path) 143 | 144 | # Optional: Inline test definitions (same structure as test library) 145 | generic_tests: # Optional: Module-specific generic SQL tests 146 | test_name: 147 | type: "sql" 148 | level: "table" | "column" 149 | description: string 150 | sql: string 151 | parameters: {} 152 | singular_tests: # Optional: Module-specific singular SQL tests 153 | test_name: 154 | type: "sql" 155 | level: "table" | "column" 156 | description: string 157 | sql: string 158 | target_transformation: string 159 | 160 | target: 161 | database: string # Target database name 162 | schema: string # Target schema name 163 | sql_dialect: string # Optional: SQL dialect (e.g., "postgres", "bigquery", "snowflake", "spark", etc.) 164 | connection_profile: string # Optional: connection profile reference 165 | 166 | # Transformations 167 | transformations: # Array of transformation definitions 168 | - transformation_id: string # Fully qualified identifier (e.g., "analytics.my_first_table") 169 | description: string # Optional: Description of what the transformation does (optional) 170 | transformation_type: string # Type of transformation: "sql" (default: "sql"). Non-SQL types (python, pyspark, r) are planned for future versions. 171 | sql_dialect: string # Optional: SQL dialect of the transformation code (for translation to target dialect) 172 | 173 | # Transformation code (type-based structure) 174 | code: 175 | # For SQL transformations (transformation_type: "sql") 176 | sql: 177 | original_sql: string # The original SQL query as written (typically a SELECT statement) 178 | resolved_sql: string # SQL with fully qualified table names (schema.table) - preferred for execution 179 | source_tables: [string] # List of input tables referenced (required for dependency analysis) 180 | 181 | # Note: Non-SQL transformation types (python, pyspark, r) are planned for future versions 182 | 183 | # Schema definition 184 | schema: 185 | columns: # Array of column definitions 186 | - name: string # Column name 187 | datatype: string # Data type ("number", "string", "date", etc.) 188 | description: string # Column description 189 | partitioning: [string] # Optional: Partition keys 190 | indexes: # Optional: Array of index definitions 191 | - name: string # Index name (optional, auto-generated if not provided) 192 | columns: [string] # Columns to index 193 | 194 | # Materialization strategy 195 | materialization: 196 | type: string # "table", "view", "incremental", "scd2" 197 | incremental_details: # Required if type is "incremental" 198 | strategy: string # "delete_insert", "append", "merge" 199 | delete_condition: string # SQL condition for delete (delete_insert only) 200 | filter_condition: string # SQL condition for filtering data 201 | merge_key: [string] # Primary key columns for matching records (merge only) 202 | update_columns: [string] # (Optional) List of columns to be updated in merge strategy 203 | scd2_details: # Optional if type is "scd2" 204 | start_column: string # Name of the start column (default: "valid_from") 205 | end_column: string # Name of the end column (default: "valid_to") 206 | unique_key: [string] # Array of columns that uniquely identify a record in SCD2 modeling (optional) 207 | 208 | # Tests: both column-level and table-level 209 | tests: 210 | columns: # Optional: Column-level tests 211 | column_name: # Tests for specific columns 212 | - string # Simple test name (e.g., "not_null", "unique") 213 | - object # Test with parameters: {name: string, params?: object, severity?: "error"|"warning"} 214 | table: # Optional: Table-level tests 215 | - string # Simple test name (e.g., "row_count_gt_0") 216 | - object # Test with parameters: {name: string, params?: object, severity?: "error"|"warning"} 217 | 218 | # Metadata 219 | metadata: 220 | file_path: string # Path to the source transformation file 221 | owner: string # Optional: Person or team responsible (optional) 222 | tags: [string] # Optional: List of string tags for categorization and discovery (e.g., ["analytics", "fct"]) 223 | object_tags: dict # Optional: Dictionary of key-value pairs for database object tagging (e.g., {"sensitivity_tag": "pii", "classification": "public"}) 224 | ``` 225 | 226 | ## Simple Table Transformation 227 | 228 |
229 | JSON Format 230 | 231 | ```json 232 | { 233 | "ots_version": "0.1.0", 234 | "module_name": "analytics_customers", 235 | "module_description": "Customer analytics transformations", 236 | "tags": ["analytics", "production"], 237 | "test_library_path": "../tests/test_library.yaml", 238 | "target": { 239 | "database": "warehouse", 240 | "schema": "analytics", 241 | "sql_dialect": "postgres" 242 | }, 243 | "transformations": [ 244 | { 245 | "transformation_id": "analytics.customers", 246 | "description": "Customer data table", 247 | "transformation_type": "sql", 248 | "code": { 249 | "sql": { 250 | "original_sql": "SELECT id, name, email, created_at FROM source.customers WHERE active = true", 251 | "resolved_sql": "SELECT id, name, email, created_at FROM warehouse.source.customers WHERE active = true", 252 | "source_tables": ["source.customers"] 253 | } 254 | }, 255 | "schema": { 256 | "columns": [ 257 | { 258 | "name": "id", 259 | "datatype": "number", 260 | "description": "Unique customer identifier" 261 | }, 262 | { 263 | "name": "name", 264 | "datatype": "string", 265 | "description": "Customer name" 266 | }, 267 | { 268 | "name": "email", 269 | "datatype": "string", 270 | "description": "Customer email address" 271 | }, 272 | { 273 | "name": "created_at", 274 | "datatype": "date", 275 | "description": "Customer creation date" 276 | } 277 | ], 278 | "partitioning": [], 279 | "indexes": [ 280 | { 281 | "name": "idx_customers_id", 282 | "columns": ["id"] 283 | }, 284 | { 285 | "name": "idx_customers_email", 286 | "columns": ["email"] 287 | } 288 | ] 289 | }, 290 | "materialization": { 291 | "type": "table" 292 | }, 293 | "tests": { 294 | "columns": { 295 | "id": ["not_null", "unique"], 296 | "email": ["not_null", "unique"], 297 | "created_at": ["not_null"] 298 | }, 299 | "table": ["row_count_gt_0"] 300 | }, 301 | "metadata": { 302 | "file_path": "/models/analytics/customers.sql", 303 | "owner": "data-team", 304 | "tags": ["customer", "core"], 305 | "object_tags": { 306 | "sensitivity_tag": "pii", 307 | "classification": "internal" 308 | } 309 | } 310 | } 311 | ] 312 | } 313 | ``` 314 | 315 |
316 | 317 |
318 | YAML Format 319 | 320 | ```yaml 321 | ots_version: "0.1.0" 322 | module_name: "analytics_customers" 323 | module_description: "Customer analytics transformations" 324 | tags: ["analytics", "production"] 325 | 326 | target: 327 | database: "warehouse" 328 | schema: "analytics" 329 | sql_dialect: "postgres" 330 | 331 | transformations: 332 | - transformation_id: "analytics.customers" 333 | description: "Customer data table" 334 | transformation_type: "sql" 335 | 336 | code: 337 | sql: 338 | original_sql: "SELECT id, name, email, created_at FROM source.customers WHERE active = true" 339 | resolved_sql: "SELECT id, name, email, created_at FROM warehouse.source.customers WHERE active = true" 340 | source_tables: ["source.customers"] 341 | 342 | schema: 343 | columns: 344 | - name: "id" 345 | datatype: "number" 346 | description: "Unique customer identifier" 347 | - name: "name" 348 | datatype: "string" 349 | description: "Customer name" 350 | - name: "email" 351 | datatype: "string" 352 | description: "Customer email address" 353 | - name: "created_at" 354 | datatype: "date" 355 | description: "Customer creation date" 356 | partitioning: [] 357 | indexes: 358 | - name: "idx_customers_id" 359 | columns: ["id"] 360 | - name: "idx_customers_email" 361 | columns: ["email"] 362 | 363 | materialization: 364 | type: "table" 365 | 366 | tests: 367 | columns: 368 | id: ["not_null", "unique"] 369 | email: ["not_null", "unique"] 370 | created_at: ["not_null"] 371 | table: ["row_count_gt_0"] 372 | 373 | metadata: 374 | file_path: "/models/analytics/customers.sql" 375 | owner: "data-team" 376 | tags: ["customer", "core"] 377 | object_tags: 378 | sensitivity_tag: "pii" 379 | classification: "internal" 380 | ``` 381 | 382 |
383 | 384 | ## Materialization Types 385 | 386 | ### Incremental Materialization 387 | 388 | Incremental materialization updates only changed data using one of three strategies: 389 | - **delete_insert**: Deletes rows matching a condition and inserts new data 390 | - **append**: Simply appends new data without removing existing rows 391 | - **merge**: Performs an upsert operation using a merge statement 392 | 393 | #### Delete-Insert Strategy 394 | 395 | ```yaml 396 | materialization: 397 | type: "incremental" 398 | incremental_details: 399 | strategy: "delete_insert" 400 | delete_condition: "to_date(updated_ts) = '@start_date'" 401 | filter_condition: "to_date(updated_ts) = '@start_date'" 402 | ``` 403 | 404 | #### Append Strategy 405 | 406 | ```yaml 407 | materialization: 408 | type: "incremental" 409 | incremental_details: 410 | strategy: "append" 411 | filter_condition: "created_at >= '@start_date'" 412 | ``` 413 | 414 | #### Merge Strategy 415 | 416 | ```yaml 417 | materialization: 418 | type: "incremental" 419 | incremental_details: 420 | strategy: "merge" 421 | merge_key: "customer_id" # Primary key or unique identifier for matching 422 | filter_condition: "updated_at >= '@start_date'" 423 | update_columns: ["name", "email"] # Optional: specific columns to update 424 | ``` 425 | 426 | ### SCD2 Materialization 427 | 428 | SCD2 (Slowly Changing Dimension Type 2) materialization tracks historical changes with valid date ranges. Requires a unique key to identify records. 429 | 430 | ```yaml 431 | materialization: 432 | type: "scd2" 433 | scd2_details: 434 | unique_key: ["product_id"] # Primary key or unique identifier 435 | start_column: "valid_from" # Optional, defaults to "valid_from" 436 | end_column: "valid_to" # Optional, defaults to "valid_to" 437 | ``` 438 | 439 | ### Schema Column Definition 440 | 441 | A schema column in an OTD defines the structure and properties of a single column in the output table: 442 | 443 | ```yaml 444 | columns: 445 | - name: string # Column name 446 | datatype: string # Data type ("number", "string", "date", etc.) 447 | description: string # Column description 448 | ``` 449 | 450 | **Common Data Types:** 451 | - `number`: Numeric values 452 | - `string`: Text values 453 | - `date`: Date and timestamp values 454 | - `boolean`: True/false values 455 | - `array`: Array of values 456 | - `object`: Complex nested objects 457 | 458 | ## Data Quality Tests 459 | 460 | Data quality tests are validation rules that ensure the correctness and quality of transformation outputs. Tests can be defined at two levels: 461 | - **Column-level tests**: Applied to individual columns (e.g., `not_null`, `unique`) 462 | - **Table-level tests**: Applied to the entire output (e.g., `row_count_gt_0`, `unique`) 463 | 464 | Tests enable automated data quality validation without manual inspection. OTS supports three types of tests: 465 | 466 | 1. **Standard Tests**: Built-in tests defined in the OTS specification (e.g., `not_null`, `unique`, `row_count_gt_0`) 467 | 2. **Generic SQL Tests**: Reusable SQL tests with placeholders that can be applied to multiple transformations 468 | 3. **Singular SQL Tests**: Table-specific SQL tests with hardcoded table references 469 | 470 | ### Standard Tests 471 | 472 | Standard tests are built into the OTS specification and must be implemented by all OTS-compliant tools. These tests provide common data quality checks that are widely applicable across different transformations. 473 | 474 | #### Column-Level Standard Tests 475 | 476 | **`not_null`** 477 | - **Description**: Ensures a column contains no NULL values 478 | - **Level**: Column 479 | - **Parameters**: None 480 | - **Implementation**: Returns rows where the column is NULL (test fails if any rows returned) 481 | - **Example**: 482 | ```yaml 483 | tests: 484 | columns: 485 | id: ["not_null"] 486 | ``` 487 | 488 | **`unique`** 489 | - **Description**: Ensures column values are unique across all rows 490 | - **Level**: Column or Table 491 | - **Parameters**: 492 | - `columns` (array, optional): For table-level tests, specifies which columns to check for uniqueness. If omitted at table level, checks all columns (entire row uniqueness) 493 | - **Implementation**: Returns duplicate values (test fails if any duplicates found) 494 | - **Examples**: 495 | ```yaml 496 | tests: 497 | columns: 498 | # Column-level: single column uniqueness 499 | id: ["not_null", "unique"] 500 | 501 | table: 502 | # Table-level: composite uniqueness on specific columns 503 | - name: "unique" 504 | params: 505 | columns: ["customer_id", "order_date"] 506 | 507 | # Table-level: entire row uniqueness (all columns) 508 | - "unique" 509 | ``` 510 | 511 | **`accepted_values`** 512 | - **Description**: Ensures column values are within a specified list of acceptable values 513 | - **Level**: Column 514 | - **Parameters**: 515 | - `values` (array, required): List of acceptable values 516 | - **Implementation**: Returns rows where column value is not in the accepted list 517 | - **Example**: 518 | ```yaml 519 | tests: 520 | columns: 521 | status: 522 | - name: "accepted_values" 523 | params: 524 | values: ["active", "inactive", "pending"] 525 | ``` 526 | 527 | **`relationships`** 528 | - **Description**: Ensures referential integrity between tables (foreign key validation) 529 | - **Level**: Column 530 | - **Parameters**: 531 | - `to` (string, required): Target transformation ID (e.g., "analytics.customers") 532 | - `field` (string, required): Column name in the target transformation 533 | - **Implementation**: Returns rows where the column value doesn't exist in the target table's specified field 534 | - **Example**: 535 | ```yaml 536 | tests: 537 | columns: 538 | customer_id: 539 | - name: "relationships" 540 | params: 541 | to: "analytics.customers" 542 | field: "id" 543 | ``` 544 | 545 | #### Table-Level Standard Tests 546 | 547 | **`row_count_gt_0`** 548 | - **Description**: Ensures the table has at least one row 549 | - **Level**: Table 550 | - **Parameters**: None 551 | - **Implementation**: Returns a count result (test fails if count = 0) 552 | - **Example**: 553 | ```yaml 554 | tests: 555 | table: 556 | - "row_count_gt_0" 557 | ``` 558 | 559 | ### Test Libraries 560 | 561 | Test libraries are project-level collections of custom test definitions (generic and singular SQL tests) that can be shared across multiple OTS modules. For a detailed introduction to Test Libraries, see the [Test Library](#test-library) section in Core Concepts. 562 | 563 | #### Test Library Structure 564 | 565 | A test library is a YAML or JSON file that defines reusable test definitions. The file can be named anything (e.g., `test_library.yaml`, `tests.yaml`, `data_quality_tests.json`), but must follow the structure below. 566 | 567 | **Test Library File Structure:** 568 | ```yaml 569 | # test_library.yaml 570 | test_library_version: string # Optional: Version identifier for the test library (e.g., "1.0", "2.1") 571 | description: string # Optional: Human-readable description of the test library 572 | 573 | generic_tests: 574 | check_minimum_rows: 575 | type: "sql" 576 | level: "table" 577 | description: "Ensures table has minimum number of rows" 578 | sql: | 579 | SELECT 1 as violation 580 | FROM @table_name 581 | GROUP BY 1 582 | HAVING COUNT(*) < @min_rows:10 583 | parameters: 584 | min_rows: 585 | type: "number" 586 | default: 10 587 | description: "Minimum number of rows required" 588 | 589 | column_not_negative: 590 | type: "sql" 591 | level: "column" 592 | description: "Ensures numeric column has no negative values" 593 | sql: | 594 | SELECT @column_name 595 | FROM @table_name 596 | WHERE @column_name < 0 597 | parameters: [] 598 | 599 | singular_tests: 600 | test_customers_email_format: 601 | type: "sql" 602 | level: "table" 603 | description: "Validates email format for customers table" 604 | sql: | 605 | SELECT id, email 606 | FROM analytics.customers 607 | WHERE email NOT LIKE '%@%.%' 608 | target_transformation: "analytics.customers" 609 | ``` 610 | 611 | #### Generic SQL Tests 612 | 613 | Generic SQL tests are reusable tests that use placeholders (variables) to make them applicable to multiple transformations. They follow the dbt pattern where: 614 | - The query returns rows when the test fails 615 | - 0 rows returned = test passes 616 | - 1+ rows returned = test fails 617 | 618 | **Placeholders:** 619 | - `@table_name` or `{{ table_name }}`: Replaced with the fully qualified transformation ID. The `@` syntax is recommended for cleaner SQL. 620 | - `@column_name` or `{{ column_name }}`: Replaced with the column name (for column-level tests). The `@` syntax is recommended. 621 | - Custom parameters: Available as `@parameter_name` or `{{ parameter_name }}` with optional defaults using `@param:default` syntax (e.g., `@min_rows:10`) 622 | 623 | **Structure:** 624 | ```yaml 625 | generic_tests: 626 | test_name: # Required: Unique test name (used for referencing) 627 | type: "sql" # Required: Always "sql" for SQL tests 628 | level: "table" | "column" # Required: Test level 629 | description: string # Optional: Human-readable description 630 | sql: string # Required: SQL query (returns rows on failure) 631 | parameters: # Optional: Parameter definitions 632 | param_name: 633 | type: "number" | "string" | "boolean" | "array" # Required: Parameter type 634 | default: value # Optional: Default value 635 | description: string # Optional: Parameter description 636 | ``` 637 | 638 | **Example Generic Test:** 639 | ```yaml 640 | check_minimum_rows: 641 | type: "sql" 642 | level: "table" 643 | description: "Ensures table has minimum number of rows" 644 | sql: | 645 | SELECT 1 as violation 646 | FROM @table_name 647 | GROUP BY 1 648 | HAVING COUNT(*) < @min_rows:10 649 | parameters: 650 | min_rows: 651 | type: "number" 652 | default: 10 653 | description: "Minimum number of rows required" 654 | ``` 655 | 656 | #### Singular SQL Tests 657 | 658 | Singular SQL tests are table-specific tests with hardcoded table references. They are useful for: 659 | - Complex business logic specific to one transformation 660 | - Tests that reference multiple tables 661 | - Table-specific validation rules 662 | 663 | **Structure:** 664 | ```yaml 665 | singular_tests: 666 | test_name: # Required: Unique test name (used for referencing) 667 | type: "sql" # Required: Always "sql" for SQL tests 668 | level: "table" | "column" # Required: Test level 669 | description: string # Optional: Human-readable description 670 | sql: string # Required: SQL query with hardcoded table names 671 | target_transformation: string # Required: Transformation ID this test applies to (used for validation and discovery) 672 | ``` 673 | 674 | **Example Singular Test:** 675 | ```yaml 676 | test_customers_email_format: 677 | type: "sql" 678 | level: "table" 679 | description: "Validates email format for customers table" 680 | sql: | 681 | SELECT id, email 682 | FROM analytics.customers 683 | WHERE email NOT LIKE '%@%.%' 684 | target_transformation: "analytics.customers" 685 | ``` 686 | 687 | ### Referencing Tests in Transformations 688 | 689 | Transformations reference tests from: 690 | 1. **Standard tests**: Referenced by name (e.g., `"not_null"`, `"unique"`) 691 | 2. **Test library tests**: Referenced by name from the test library (e.g., `"check_minimum_rows"`) 692 | 693 | **Module Structure with Test Library Reference:** 694 | 695 | ```yaml 696 | ots_version: "0.1.0" 697 | module_name: "analytics_customers" 698 | test_library_path: "../tests/test_library.yaml" # Optional: Path to test library 699 | 700 | target: 701 | database: "warehouse" 702 | schema: "analytics" 703 | 704 | transformations: 705 | - transformation_id: "analytics.customers" 706 | tests: 707 | columns: 708 | id: 709 | - "not_null" # Standard test 710 | - "unique" # Standard test (column-level) 711 | email: 712 | - "not_null" 713 | - name: "accepted_values" # Standard test with params 714 | params: 715 | values: ["gmail.com", "yahoo.com"] 716 | amount: 717 | - name: "column_not_negative" # Generic test from library 718 | table: 719 | - "row_count_gt_0" # Standard test 720 | - "unique" # Standard test (table-level, checks all columns) 721 | - name: "unique" # Standard test (table-level, composite on specific columns) 722 | params: 723 | columns: ["customer_id", "order_date"] 724 | - name: "check_minimum_rows" # Generic test with params 725 | params: 726 | min_rows: 100 727 | - "test_customers_email_format" # Singular test from library 728 | ``` 729 | 730 | **Test Reference Formats:** 731 | 732 | 1. **Simple string** (standard test, no parameters): 733 | ```yaml 734 | tests: 735 | columns: 736 | id: ["not_null", "unique"] 737 | table: 738 | - "row_count_gt_0" 739 | ``` 740 | 741 | 2. **Object with name** (standard test with parameters): 742 | ```yaml 743 | tests: 744 | columns: 745 | status: 746 | - name: "accepted_values" 747 | params: 748 | values: ["active", "inactive"] 749 | ``` 750 | 751 | 3. **Object with name** (generic/singular test from library): 752 | ```yaml 753 | tests: 754 | table: 755 | - name: "check_minimum_rows" 756 | params: 757 | min_rows: 100 758 | ``` 759 | 760 | ### Test Execution Model 761 | 762 | Tests follow the dbt execution model: 763 | - **0 rows returned** = test passes 764 | - **1+ rows returned** = test fails 765 | 766 | For standard tests, tools generate SQL queries that return violating rows. For SQL tests (generic and singular), the SQL query itself returns rows when violations are found. 767 | 768 | **Test Severity:** 769 | - Tests can have a `severity` level: `"error"` (default) or `"warning"` 770 | - `error`: Test failure stops execution and fails the build 771 | - `warning`: Test failure is logged but doesn't stop execution 772 | 773 | **Severity in Test References:** 774 | ```yaml 775 | tests: 776 | columns: 777 | id: 778 | - name: "not_null" 779 | severity: "error" # Default, can be omitted 780 | - name: "unique" 781 | severity: "warning" # Non-blocking 782 | table: 783 | - name: "row_count_gt_0" 784 | severity: "error" # Default, can be omitted 785 | ``` 786 | 787 | ### Inline Test Definitions in OTS Modules 788 | 789 | Generic and singular SQL tests can also be defined directly within an OTS Module, using the same structure as test libraries. This is useful for module-specific tests that don't need to be shared across modules. 790 | 791 | **Module Structure with Inline Tests:** 792 | ```yaml 793 | ots_version: "0.1.0" 794 | module_name: "analytics_customers" 795 | 796 | # Optional: Inline test definitions (same structure as test library) 797 | generic_tests: 798 | check_recent_data: 799 | type: "sql" 800 | level: "table" 801 | description: "Ensures table has recent data" 802 | sql: | 803 | SELECT 1 as violation 804 | FROM @table_name 805 | WHERE updated_at < CURRENT_DATE - INTERVAL '@days:7' DAY 806 | parameters: 807 | days: 808 | type: "number" 809 | default: 7 810 | 811 | singular_tests: 812 | test_customers_specific: 813 | type: "sql" 814 | level: "table" 815 | description: "Module-specific test" 816 | sql: | 817 | SELECT id FROM analytics.customers WHERE status = 'invalid' 818 | target_transformation: "analytics.customers" 819 | 820 | target: 821 | database: "warehouse" 822 | schema: "analytics" 823 | 824 | transformations: 825 | - transformation_id: "analytics.customers" 826 | tests: 827 | table: 828 | - name: "check_recent_data" # References inline generic test 829 | params: 830 | days: 3 831 | - "test_customers_specific" # References inline singular test 832 | ``` 833 | 834 | **Test Resolution Priority:** 835 | When resolving test names, tools should check in the following order: 836 | 1. **Standard tests** (built into OTS specification) 837 | 2. **Inline tests** (defined in the current OTS Module) 838 | 3. **Test library tests** (from referenced test library) 839 | 840 | If a test name exists in multiple locations, the first match takes precedence. This allows modules to override test library tests with module-specific implementations. 841 | 842 | ### Test Library Resolution 843 | 844 | When a transformation module references a test library: 845 | 1. The tool resolves the `test_library_path` (relative to the module file or absolute path) 846 | 2. Loads the test library file (YAML or JSON format) 847 | 3. Validates test definitions 848 | 4. Makes tests available for reference in transformations (after inline tests) 849 | 850 | **Test Discovery:** 851 | - **Standard tests**: Always available, no discovery needed 852 | - **Generic tests**: Discovered from test library or inline module definitions 853 | - **Singular tests**: Discovered from test library or inline module definitions. The `target_transformation` field helps tools validate that the test is applied to the correct transformation. 854 | 855 | If a test is referenced but not found among the Standard tests, inline tests, or Test library, it must result in an error. 856 | 857 | ## Complete Examples: Incremental Strategies 858 | 859 | ### Delete-Insert Example 860 | 861 |
862 | YAML Format 863 | 864 | ```yaml 865 | ots_version: "0.1.0" 866 | transformation_id: "analytics.recent_orders" 867 | description: "Orders updated in the last 7 days" 868 | 869 | transformation_type: "sql" 870 | code: 871 | sql: 872 | original_sql: "SELECT order_id, customer_id, order_date, amount, status FROM source.orders WHERE updated_at >= '@start_date'" 873 | resolved_sql: "SELECT order_id, customer_id, order_date, amount, status FROM warehouse.source.orders WHERE updated_at >= '@start_date'" 874 | source_tables: ["source.orders"] 875 | 876 | schema: 877 | columns: 878 | - name: "order_id" 879 | datatype: "number" 880 | description: "Unique order identifier" 881 | - name: "customer_id" 882 | datatype: "number" 883 | description: "Customer ID" 884 | - name: "order_date" 885 | datatype: "date" 886 | description: "Order date" 887 | - name: "amount" 888 | datatype: "number" 889 | description: "Order amount" 890 | - name: "status" 891 | datatype: "string" 892 | description: "Order status" 893 | partitioning: ["order_date"] 894 | indexes: 895 | - name: "idx_order_id" 896 | columns: ["order_id"] 897 | 898 | materialization: 899 | type: "incremental" 900 | incremental_details: 901 | strategy: "delete_insert" 902 | delete_condition: "to_date(updated_at) = '@start_date'" 903 | filter_condition: "to_date(updated_at) = '@start_date'" 904 | 905 | tests: 906 | columns: 907 | order_id: ["not_null", "unique"] 908 | order_date: ["not_null"] 909 | table: ["row_count_gt_0"] 910 | 911 | metadata: 912 | file_path: "/models/analytics/recent_orders.sql" 913 | owner: "analytics-team" 914 | tags: ["orders", "incremental"] 915 | ``` 916 | 917 |
918 | 919 |
920 | JSON Format 921 | 922 | ```json 923 | { 924 | "ots_version": "0.1.0", 925 | "transformation_id": "analytics.recent_orders", 926 | "description": "Orders updated in the last 7 days", 927 | 928 | "transformation_type": "sql", 929 | "code": { 930 | "sql": { 931 | "original_sql": "SELECT order_id, customer_id, order_date, amount, status FROM source.orders WHERE updated_at >= '@start_date'", 932 | "resolved_sql": "SELECT order_id, customer_id, order_date, amount, status FROM warehouse.source.orders WHERE updated_at >= '@start_date'", 933 | "source_tables": ["source.orders"] 934 | } 935 | }, 936 | 937 | "schema": { 938 | "columns": [ 939 | { 940 | "name": "order_id", 941 | "datatype": "number", 942 | "description": "Unique order identifier" 943 | }, 944 | { 945 | "name": "customer_id", 946 | "datatype": "number", 947 | "description": "Customer ID" 948 | }, 949 | { 950 | "name": "order_date", 951 | "datatype": "date", 952 | "description": "Order date" 953 | }, 954 | { 955 | "name": "amount", 956 | "datatype": "number", 957 | "description": "Order amount" 958 | }, 959 | { 960 | "name": "status", 961 | "datatype": "string", 962 | "description": "Order status" 963 | } 964 | ], 965 | "partitioning": ["order_date"], 966 | "indexes": [ 967 | { 968 | "name": "idx_order_id", 969 | "columns": ["order_id"] 970 | } 971 | ] 972 | }, 973 | 974 | "materialization": { 975 | "type": "incremental", 976 | "incremental_details": { 977 | "strategy": "delete_insert", 978 | "delete_condition": "to_date(updated_at) = '@start_date'", 979 | "filter_condition": "to_date(updated_at) = '@start_date'" 980 | } 981 | }, 982 | 983 | "tests": { 984 | "columns": { 985 | "order_id": ["not_null", "unique"], 986 | "order_date": ["not_null"] 987 | }, 988 | "table": ["row_count_gt_0"] 989 | }, 990 | 991 | "metadata": { 992 | "file_path": "/models/analytics/recent_orders.sql", 993 | "owner": "analytics-team", 994 | "tags": ["orders", "incremental"] 995 | } 996 | } 997 | ``` 998 | 999 |
1000 | 1001 | ### Append Example 1002 | 1003 |
1004 | YAML Format 1005 | 1006 | ```yaml 1007 | ots_version: "0.1.0" 1008 | transformation_id: "logs.event_stream" 1009 | description: "Append-only event log" 1010 | 1011 | transformation_type: "sql" 1012 | code: 1013 | sql: 1014 | original_sql: "SELECT event_id, timestamp, user_id, event_type, payload FROM source.events WHERE timestamp >= '@start_date'" 1015 | resolved_sql: "SELECT event_id, timestamp, user_id, event_type, payload FROM warehouse.source.events WHERE timestamp >= '@start_date'" 1016 | source_tables: ["source.events"] 1017 | 1018 | schema: 1019 | columns: 1020 | - name: "event_id" 1021 | datatype: "string" 1022 | description: "Unique event identifier" 1023 | - name: "timestamp" 1024 | datatype: "date" 1025 | description: "Event timestamp" 1026 | - name: "user_id" 1027 | datatype: "string" 1028 | description: "User who triggered the event" 1029 | - name: "event_type" 1030 | datatype: "string" 1031 | description: "Type of event" 1032 | - name: "payload" 1033 | datatype: "object" 1034 | description: "Event payload data" 1035 | partitioning: ["timestamp"] 1036 | indexes: 1037 | - name: "idx_timestamp" 1038 | columns: ["timestamp"] 1039 | - name: "idx_user_id" 1040 | columns: ["user_id"] 1041 | 1042 | materialization: 1043 | type: "incremental" 1044 | incremental_details: 1045 | strategy: "append" 1046 | filter_condition: "timestamp >= '@start_date'" 1047 | 1048 | tests: 1049 | columns: 1050 | event_id: ["not_null", "unique"] 1051 | timestamp: ["not_null"] 1052 | table: ["row_count_gt_0"] 1053 | 1054 | metadata: 1055 | file_path: "/models/logs/event_stream.sql" 1056 | owner: "data-engineering" 1057 | tags: ["events", "append-only"] 1058 | ``` 1059 | 1060 |
1061 | 1062 |
1063 | JSON Format 1064 | 1065 | ```json 1066 | { 1067 | "ots_version": "0.1.0", 1068 | "transformation_id": "logs.event_stream", 1069 | "description": "Append-only event log", 1070 | 1071 | "transformation_type": "sql", 1072 | "code": { 1073 | "sql": { 1074 | "original_sql": "SELECT event_id, timestamp, user_id, event_type, payload FROM source.events WHERE timestamp >= '@start_date'", 1075 | "resolved_sql": "SELECT event_id, timestamp, user_id, event_type, payload FROM warehouse.source.events WHERE timestamp >= '@start_date'", 1076 | "source_tables": ["source.events"] 1077 | } 1078 | }, 1079 | 1080 | "schema": { 1081 | "columns": [ 1082 | { 1083 | "name": "event_id", 1084 | "datatype": "string", 1085 | "description": "Unique event identifier" 1086 | }, 1087 | { 1088 | "name": "timestamp", 1089 | "datatype": "date", 1090 | "description": "Event timestamp" 1091 | }, 1092 | { 1093 | "name": "user_id", 1094 | "datatype": "string", 1095 | "description": "User who triggered the event" 1096 | }, 1097 | { 1098 | "name": "event_type", 1099 | "datatype": "string", 1100 | "description": "Type of event" 1101 | }, 1102 | { 1103 | "name": "payload", 1104 | "datatype": "object", 1105 | "description": "Event payload data" 1106 | } 1107 | ], 1108 | "partitioning": ["timestamp"], 1109 | "indexes": [ 1110 | { 1111 | "name": "idx_timestamp", 1112 | "columns": ["timestamp"] 1113 | }, 1114 | { 1115 | "name": "idx_user_id", 1116 | "columns": ["user_id"] 1117 | } 1118 | ] 1119 | }, 1120 | 1121 | "materialization": { 1122 | "type": "incremental", 1123 | "incremental_details": { 1124 | "strategy": "append", 1125 | "filter_condition": "timestamp >= '@start_date'" 1126 | } 1127 | }, 1128 | 1129 | "tests": { 1130 | "columns": { 1131 | "event_id": ["not_null", "unique"], 1132 | "timestamp": ["not_null"] 1133 | }, 1134 | "table": ["row_count_gt_0"] 1135 | }, 1136 | 1137 | "metadata": { 1138 | "file_path": "/models/logs/event_stream.sql", 1139 | "owner": "data-engineering", 1140 | "tags": ["events", "append-only"] 1141 | } 1142 | } 1143 | ``` 1144 | 1145 |
1146 | 1147 | ### Merge Example 1148 | 1149 |
1150 | YAML Format 1151 | 1152 | ```yaml 1153 | ots_version: "0.1.0" 1154 | transformation_id: "product.master_data" 1155 | description: "Customer master data with upsert logic" 1156 | 1157 | transformation_type: "sql" 1158 | code: 1159 | sql: 1160 | original_sql: "SELECT customer_id, name, email, phone, updated_at FROM source.customers WHERE updated_at >= '@start_date'" 1161 | resolved_sql: "SELECT customer_id, name, email, phone, updated_at FROM warehouse.source.customers WHERE updated_at >= '@start_date'" 1162 | source_tables: ["source.customers"] 1163 | 1164 | schema: 1165 | columns: 1166 | - name: "customer_id" 1167 | datatype: "number" 1168 | description: "Unique customer identifier" 1169 | - name: "name" 1170 | datatype: "string" 1171 | description: "Customer name" 1172 | - name: "email" 1173 | datatype: "string" 1174 | description: "Customer email" 1175 | - name: "phone" 1176 | datatype: "string" 1177 | description: "Customer phone number" 1178 | - name: "updated_at" 1179 | datatype: "date" 1180 | description: "Last update timestamp" 1181 | partitioning: [] 1182 | indexes: 1183 | - name: "idx_customer_id" 1184 | columns: ["customer_id"] 1185 | - name: "idx_email" 1186 | columns: ["email"] 1187 | 1188 | materialization: 1189 | type: "incremental" 1190 | incremental_details: 1191 | strategy: "merge" 1192 | filter_condition: "updated_at >= '@start_date'" 1193 | merge_key: ["customer_id"] 1194 | update_columns: ["name", "email", "phone", "updated_at"] 1195 | 1196 | tests: 1197 | columns: 1198 | customer_id: ["not_null", "unique"] 1199 | email: ["not_null"] 1200 | table: ["row_count_gt_0", "unique"] # unique at table level checks all columns for row uniqueness 1201 | 1202 | metadata: 1203 | file_path: "/models/product/master_data.sql" 1204 | owner: "product-team" 1205 | tags: ["customers", "master-data"] 1206 | ``` 1207 | 1208 |
1209 | 1210 |
1211 | JSON Format 1212 | 1213 | ```json 1214 | { 1215 | "ots_version": "0.1.0", 1216 | "transformation_id": "product.master_data", 1217 | "description": "Customer master data with upsert logic", 1218 | 1219 | "transformation_type": "sql", 1220 | "code": { 1221 | "sql": { 1222 | "original_sql": "SELECT customer_id, name, email, phone, updated_at FROM source.customers WHERE updated_at >= '@start_date'", 1223 | "resolved_sql": "SELECT customer_id, name, email, phone, updated_at FROM warehouse.source.customers WHERE updated_at >= '@start_date'", 1224 | "source_tables": ["source.customers"] 1225 | } 1226 | }, 1227 | 1228 | "schema": { 1229 | "columns": [ 1230 | { 1231 | "name": "customer_id", 1232 | "datatype": "number", 1233 | "description": "Unique customer identifier" 1234 | }, 1235 | { 1236 | "name": "name", 1237 | "datatype": "string", 1238 | "description": "Customer name" 1239 | }, 1240 | { 1241 | "name": "email", 1242 | "datatype": "string", 1243 | "description": "Customer email" 1244 | }, 1245 | { 1246 | "name": "phone", 1247 | "datatype": "string", 1248 | "description": "Customer phone number" 1249 | }, 1250 | { 1251 | "name": "updated_at", 1252 | "datatype": "date", 1253 | "description": "Last update timestamp" 1254 | } 1255 | ], 1256 | "partitioning": [], 1257 | "indexes": [ 1258 | { 1259 | "name": "idx_customer_id", 1260 | "columns": ["customer_id"] 1261 | }, 1262 | { 1263 | "name": "idx_email", 1264 | "columns": ["email"] 1265 | } 1266 | ] 1267 | }, 1268 | 1269 | "materialization": { 1270 | "type": "incremental", 1271 | "incremental_details": { 1272 | "strategy": "merge", 1273 | "filter_condition": "updated_at >= '@start_date'", 1274 | "merge_key": ["customer_id"], 1275 | "update_columns": ["name", "email", "phone", "updated_at"] 1276 | } 1277 | }, 1278 | 1279 | "tests": { 1280 | "columns": { 1281 | "customer_id": ["not_null", "unique"], 1282 | "email": ["not_null"] 1283 | }, 1284 | "table": ["row_count_gt_0", "unique"] 1285 | }, 1286 | 1287 | "metadata": { 1288 | "file_path": "/models/product/master_data.sql", 1289 | "owner": "product-team", 1290 | "tags": ["customers", "master-data"] 1291 | } 1292 | } 1293 | ``` 1294 | 1295 |
1296 | 1297 | ### SCD2 Example 1298 | 1299 |
1300 | YAML Format 1301 | 1302 | ```yaml 1303 | ots_version: "0.1.0" 1304 | transformation_id: "dim.products_scd2" 1305 | description: "Product dimension with full history tracking" 1306 | 1307 | transformation_type: "sql" 1308 | code: 1309 | sql: 1310 | original_sql: "SELECT product_id, product_name, price, category, updated_at FROM source.products WHERE updated_at >= '@start_date'" 1311 | resolved_sql: "SELECT product_id, product_name, price, category, updated_at FROM warehouse.source.products WHERE updated_at >= '@start_date'" 1312 | source_tables: ["source.products"] 1313 | 1314 | schema: 1315 | columns: 1316 | - name: "product_id" 1317 | datatype: "number" 1318 | description: "Unique product identifier" 1319 | - name: "product_name" 1320 | datatype: "string" 1321 | description: "Product name" 1322 | - name: "price" 1323 | datatype: "number" 1324 | description: "Product price" 1325 | - name: "category" 1326 | datatype: "string" 1327 | description: "Product category" 1328 | - name: "updated_at" 1329 | datatype: "date" 1330 | description: "Last update timestamp" 1331 | - name: "valid_from" 1332 | datatype: "date" 1333 | description: "Record validity start date" 1334 | - name: "valid_to" 1335 | datatype: "date" 1336 | description: "Record validity end date" 1337 | partitioning: [] 1338 | indexes: 1339 | - name: "idx_product_id" 1340 | columns: ["product_id"] 1341 | - name: "idx_valid_from" 1342 | columns: ["valid_from"] 1343 | 1344 | materialization: 1345 | type: "scd2" 1346 | scd2_details: 1347 | unique_key: ["product_id"] 1348 | start_column: "valid_from" 1349 | end_column: "valid_to" 1350 | 1351 | tests: 1352 | columns: 1353 | product_id: ["not_null", "unique"] 1354 | valid_from: ["not_null"] 1355 | table: ["row_count_gt_0"] 1356 | 1357 | metadata: 1358 | file_path: "/models/dim/products_scd2.sql" 1359 | owner: "data-engineering" 1360 | tags: ["products", "scd2", "dimension"] 1361 | ``` 1362 | 1363 |
1364 | 1365 |
1366 | JSON Format 1367 | 1368 | ```json 1369 | { 1370 | "ots_version": "0.1.0", 1371 | "transformation_id": "dim.products_scd2", 1372 | "description": "Product dimension with full history tracking", 1373 | 1374 | "transformation_type": "sql", 1375 | "code": { 1376 | "sql": { 1377 | "original_sql": "SELECT product_id, product_name, price, category, updated_at FROM source.products WHERE updated_at >= '@start_date'", 1378 | "resolved_sql": "SELECT product_id, product_name, price, category, updated_at FROM warehouse.source.products WHERE updated_at >= '@start_date'", 1379 | "source_tables": ["source.products"] 1380 | } 1381 | }, 1382 | 1383 | "schema": { 1384 | "columns": [ 1385 | { 1386 | "name": "product_id", 1387 | "datatype": "number", 1388 | "description": "Unique product identifier" 1389 | }, 1390 | { 1391 | "name": "product_name", 1392 | "datatype": "string", 1393 | "description": "Product name" 1394 | }, 1395 | { 1396 | "name": "price", 1397 | "datatype": "number", 1398 | "description": "Product price" 1399 | }, 1400 | { 1401 | "name": "category", 1402 | "datatype": "string", 1403 | "description": "Product category" 1404 | }, 1405 | { 1406 | "name": "updated_at", 1407 | "datatype": "date", 1408 | "description": "Last update timestamp" 1409 | }, 1410 | { 1411 | "name": "valid_from", 1412 | "datatype": "date", 1413 | "description": "Record validity start date" 1414 | }, 1415 | { 1416 | "name": "valid_to", 1417 | "datatype": "date", 1418 | "description": "Record validity end date" 1419 | } 1420 | ], 1421 | "partitioning": [], 1422 | "indexes": [ 1423 | { 1424 | "name": "idx_product_id", 1425 | "columns": ["product_id"] 1426 | }, 1427 | { 1428 | "name": "idx_valid_from", 1429 | "columns": ["valid_from"] 1430 | } 1431 | ] 1432 | }, 1433 | 1434 | "materialization": { 1435 | "type": "scd2", 1436 | "scd2_details": { 1437 | "unique_key": ["product_id"], 1438 | "start_column": "valid_from", 1439 | "end_column": "valid_to" 1440 | } 1441 | }, 1442 | 1443 | "tests": { 1444 | "columns": { 1445 | "product_id": ["not_null", "unique"], 1446 | "valid_from": ["not_null"] 1447 | }, 1448 | "table": ["row_count_gt_0"] 1449 | }, 1450 | 1451 | "metadata": { 1452 | "file_path": "/models/dim/products_scd2.sql", 1453 | "owner": "data-engineering", 1454 | "tags": ["products", "scd2", "dimension"] 1455 | } 1456 | } 1457 | ``` 1458 | 1459 |
1460 | 1461 | ## Complete Example: Test Library and Module 1462 | 1463 | This example demonstrates a complete setup with a test library and a transformation module that uses both standard and custom tests. 1464 | 1465 | ### Test Library Example 1466 | 1467 |
1468 | YAML Format 1469 | 1470 | ```yaml 1471 | # tests/test_library.yaml 1472 | test_library_version: "1.0" 1473 | description: "Shared data quality tests for analytics project" 1474 | 1475 | generic_tests: 1476 | check_minimum_rows: 1477 | type: "sql" 1478 | level: "table" 1479 | description: "Ensures table has minimum number of rows" 1480 | sql: | 1481 | SELECT 1 as violation 1482 | FROM @table_name 1483 | GROUP BY 1 1484 | HAVING COUNT(*) < @min_rows:10 1485 | parameters: 1486 | min_rows: 1487 | type: "number" 1488 | default: 10 1489 | description: "Minimum number of rows required" 1490 | 1491 | column_not_negative: 1492 | type: "sql" 1493 | level: "column" 1494 | description: "Ensures numeric column has no negative values" 1495 | sql: | 1496 | SELECT @column_name 1497 | FROM @table_name 1498 | WHERE @column_name < 0 1499 | parameters: [] 1500 | 1501 | singular_tests: 1502 | test_customers_email_format: 1503 | type: "sql" 1504 | level: "table" 1505 | description: "Validates email format for customers table" 1506 | sql: | 1507 | SELECT id, email 1508 | FROM analytics.customers 1509 | WHERE email NOT LIKE '%@%.%' 1510 | target_transformation: "analytics.customers" 1511 | ``` 1512 | 1513 |
1514 | 1515 |
1516 | JSON Format 1517 | 1518 | ```json 1519 | { 1520 | "test_library_version": "1.0", 1521 | "description": "Shared data quality tests for analytics project", 1522 | "generic_tests": { 1523 | "check_minimum_rows": { 1524 | "type": "sql", 1525 | "level": "table", 1526 | "description": "Ensures table has minimum number of rows", 1527 | "sql": "SELECT 1 as violation\nFROM @table_name\nGROUP BY 1\nHAVING COUNT(*) < @min_rows:10", 1528 | "parameters": { 1529 | "min_rows": { 1530 | "type": "number", 1531 | "default": 10, 1532 | "description": "Minimum number of rows required" 1533 | } 1534 | } 1535 | }, 1536 | "column_not_negative": { 1537 | "type": "sql", 1538 | "level": "column", 1539 | "description": "Ensures numeric column has no negative values", 1540 | "sql": "SELECT @column_name\nFROM @table_name\nWHERE @column_name < 0", 1541 | "parameters": [] 1542 | } 1543 | }, 1544 | "singular_tests": { 1545 | "test_customers_email_format": { 1546 | "type": "sql", 1547 | "level": "table", 1548 | "description": "Validates email format for customers table", 1549 | "sql": "SELECT id, email\nFROM analytics.customers\nWHERE email NOT LIKE '%@%.%'", 1550 | "target_transformation": "analytics.customers" 1551 | } 1552 | } 1553 | } 1554 | ``` 1555 | 1556 |
1557 | 1558 | ### Module Using Test Library 1559 | 1560 |
1561 | YAML Format 1562 | 1563 | ```yaml 1564 | ots_version: "0.1.0" 1565 | module_name: "analytics_customers" 1566 | module_description: "Customer analytics transformations" 1567 | test_library_path: "../tests/test_library.yaml" 1568 | tags: ["analytics", "production"] 1569 | 1570 | target: 1571 | database: "warehouse" 1572 | schema: "analytics" 1573 | sql_dialect: "postgres" 1574 | 1575 | transformations: 1576 | - transformation_id: "analytics.customers" 1577 | description: "Customer data table" 1578 | transformation_type: "sql" 1579 | 1580 | code: 1581 | sql: 1582 | original_sql: "SELECT id, name, email, created_at, amount FROM source.customers WHERE active = true" 1583 | resolved_sql: "SELECT id, name, email, created_at, amount FROM warehouse.source.customers WHERE active = true" 1584 | source_tables: ["source.customers"] 1585 | 1586 | schema: 1587 | columns: 1588 | - name: "id" 1589 | datatype: "number" 1590 | description: "Unique customer identifier" 1591 | - name: "name" 1592 | datatype: "string" 1593 | description: "Customer name" 1594 | - name: "email" 1595 | datatype: "string" 1596 | description: "Customer email address" 1597 | - name: "created_at" 1598 | datatype: "date" 1599 | description: "Customer creation date" 1600 | - name: "amount" 1601 | datatype: "number" 1602 | description: "Customer account balance" 1603 | partitioning: [] 1604 | indexes: 1605 | - name: "idx_customers_id" 1606 | columns: ["id"] 1607 | - name: "idx_customers_email" 1608 | columns: ["email"] 1609 | 1610 | materialization: 1611 | type: "table" 1612 | 1613 | tests: 1614 | columns: 1615 | id: 1616 | - "not_null" # Standard test 1617 | - "unique" # Standard test (column-level) 1618 | email: 1619 | - "not_null" 1620 | - name: "accepted_values" # Standard test with params 1621 | params: 1622 | values: ["gmail.com", "yahoo.com", "company.com"] 1623 | amount: 1624 | - name: "column_not_negative" # Generic test from library 1625 | table: 1626 | - "row_count_gt_0" # Standard test 1627 | - "unique" # Standard test (table-level, checks all columns for row uniqueness) 1628 | - name: "check_minimum_rows" # Generic test with params 1629 | params: 1630 | min_rows: 100 1631 | - "test_customers_email_format" # Singular test from library 1632 | 1633 | metadata: 1634 | file_path: "/models/analytics/customers.sql" 1635 | owner: "data-team" 1636 | tags: ["customer", "core"] 1637 | object_tags: 1638 | sensitivity_tag: "pii" 1639 | classification: "internal" 1640 | ``` 1641 | 1642 |
1643 | 1644 |
1645 | JSON Format 1646 | 1647 | ```json 1648 | { 1649 | "ots_version": "0.1.0", 1650 | "module_name": "analytics_customers", 1651 | "module_description": "Customer analytics transformations", 1652 | "test_library_path": "../tests/test_library.yaml", 1653 | "tags": ["analytics", "production"], 1654 | "target": { 1655 | "database": "warehouse", 1656 | "schema": "analytics", 1657 | "sql_dialect": "postgres" 1658 | }, 1659 | "transformations": [ 1660 | { 1661 | "transformation_id": "analytics.customers", 1662 | "description": "Customer data table", 1663 | "transformation_type": "sql", 1664 | "code": { 1665 | "sql": { 1666 | "original_sql": "SELECT id, name, email, created_at, amount FROM source.customers WHERE active = true", 1667 | "resolved_sql": "SELECT id, name, email, created_at, amount FROM warehouse.source.customers WHERE active = true", 1668 | "source_tables": ["source.customers"] 1669 | } 1670 | }, 1671 | "schema": { 1672 | "columns": [ 1673 | { 1674 | "name": "id", 1675 | "datatype": "number", 1676 | "description": "Unique customer identifier" 1677 | }, 1678 | { 1679 | "name": "name", 1680 | "datatype": "string", 1681 | "description": "Customer name" 1682 | }, 1683 | { 1684 | "name": "email", 1685 | "datatype": "string", 1686 | "description": "Customer email address" 1687 | }, 1688 | { 1689 | "name": "created_at", 1690 | "datatype": "date", 1691 | "description": "Customer creation date" 1692 | }, 1693 | { 1694 | "name": "amount", 1695 | "datatype": "number", 1696 | "description": "Customer account balance" 1697 | } 1698 | ], 1699 | "partitioning": [], 1700 | "indexes": [ 1701 | { 1702 | "name": "idx_customers_id", 1703 | "columns": ["id"] 1704 | }, 1705 | { 1706 | "name": "idx_customers_email", 1707 | "columns": ["email"] 1708 | } 1709 | ] 1710 | }, 1711 | "materialization": { 1712 | "type": "table" 1713 | }, 1714 | "tests": { 1715 | "columns": { 1716 | "id": ["not_null", "unique"], 1717 | "email": [ 1718 | "not_null", 1719 | { 1720 | "name": "accepted_values", 1721 | "params": { 1722 | "values": ["gmail.com", "yahoo.com", "company.com"] 1723 | } 1724 | } 1725 | ], 1726 | "amount": [ 1727 | { 1728 | "name": "column_not_negative" 1729 | } 1730 | ] 1731 | }, 1732 | "table": [ 1733 | "row_count_gt_0", 1734 | "unique", 1735 | { 1736 | "name": "check_minimum_rows", 1737 | "params": { 1738 | "min_rows": 100 1739 | } 1740 | }, 1741 | "test_customers_email_format" 1742 | ] 1743 | }, 1744 | "metadata": { 1745 | "file_path": "/models/analytics/customers.sql", 1746 | "owner": "data-team", 1747 | "tags": ["customer", "core"], 1748 | "object_tags": { 1749 | "sensitivity_tag": "pii", 1750 | "classification": "internal" 1751 | } 1752 | } 1753 | } 1754 | ] 1755 | } 1756 | ``` 1757 | 1758 |
1759 | -------------------------------------------------------------------------------- /versions/v0.2.0/transformation-specification.md: -------------------------------------------------------------------------------- 1 | # Open Transformation Specification v0.2.0 2 | 3 | ## Table of Contents 4 | 5 | 1. [Introduction](#introduction) 6 | 2. [Core Concepts](#core-concepts) 7 | 3. [Open Transformation Definition (OTD) Structure](#open-transformation-definition-otd-structure) 8 | 4. [Materialization Types](#materialization-types) 9 | 5. [Data Quality Tests](#data-quality-tests) 10 | 6. [User-Defined Functions (UDFs)](#user-defined-functions-udfs) 11 | 7. [Examples](#examples) 12 | 13 | ## Introduction 14 | 15 | The Open Transformation Specification (OTS) defines a standard, programming language-agnostic interface description for data transformations, data quality tests, and user-defined functions (UDFs). This specification enables **interoperability** between tools and platforms, shifting the data transformation ecosystem from isolated, proprietary tools to an **open core** where tools can seamlessly work together around a shared specification. 16 | 17 | This specification allows both humans and computers to discover and understand how transformations behave, what outputs they produce, and how those outputs are materialized (as tables, views, incremental updates, SCD2, etc.) without requiring additional documentation or configuration. By providing a common standard, OTS ensures that transformations defined in one tool can be consumed, understood, and executed by any OTS-compliant tool. 18 | 19 | The OTS standard encompasses three types of artifacts: **Open Transformation Definitions (OTDs)** for transformations, **UDF Definitions** for user-defined functions, and **Test Definitions** for data quality tests. Together, these form the complete set of **OTS Artifacts** that can be defined and managed within an OTS Module. 20 | 21 | An OTS-based transformation must include both the code that transforms the data and metadata about the transformation. A tool implementing OTS should be able to execute an OTS transformation with no additional code or information beyond what's specified in the OTS document. This **interoperability** ensures that the transformation ecosystem can grow organically, with tools building on each other's capabilities rather than creating isolated silos. 22 | 23 | ## Core Concepts 24 | 25 | ### OTS Artifacts 26 | 27 | **OTS Artifacts** is the umbrella term for all concrete instances of the Open Transformation Specification. The OTS standard defines three types of artifacts: 28 | 29 | 1. **Open Transformation Definition (OTD)**: A structured definition that describes a specific data transformation 30 | 2. **UDF Definition**: A structured definition that describes a user-defined function 31 | 3. **Test Definition**: A structured definition that describes a data quality test 32 | 33 | All OTS Artifacts follow the OTS format and can be defined within an OTS Module. Together, they form a complete data transformation pipeline with reusable functions and quality validation. 34 | 35 | ### Open Transformation Definition (OTD) 36 | 37 | An **Open Transformation Definition (OTD)** is a concrete instance of the Open Transformation Specification that describes a specific data transformation using the OTS format. An OTD exists as a structured definition within an OTS Module, which is the file or document that contains one or more transformation definitions. 38 | 39 | A transformation is a unit of data processing that takes one or more data sources as input and produces one data output. Right now, transformations are SQL queries, but we plan to add support for other programming languages in the future. 40 | 41 | ### Open Transformation Specification Module 42 | 43 | An **Open Transformation Specification Module (OTS Module)** is a collection of related OTS Artifacts (transformations, UDFs, and tests) that target the same database and schema. An OTS Module can contain one or more transformations, UDF definitions, and test definitions, much like how an OpenAPI specification can contain multiple endpoints. 44 | 45 | Key characteristics of an OTS Module: 46 | - **Single target**: All transformations in a module target the same database and schema 47 | - **Logical grouping**: Related transformations are organized together 48 | - **Deployment unit**: The entire module can be deployed as a single unit 49 | 50 | ### Test Library 51 | 52 | A **Test Library** is a project-level collection of reusable Test Definitions (generic and singular SQL tests) that can be shared across multiple OTS modules. Test libraries are defined separately from transformation modules and are referenced by modules that need to use them. 53 | 54 | Key characteristics of a Test Library: 55 | - **Project-level scope**: Test libraries are defined at the project/workspace level, separate from OTS modules 56 | - **Reusability**: Test Definitions in a library can be referenced by any OTS module in the project 57 | - **Test types**: Contains both generic SQL tests (with placeholders) and singular SQL tests (table-specific) 58 | - **Optional**: Modules can define tests inline or reference a test library, or both 59 | 60 | ### UDF Definition 61 | 62 | A **UDF Definition** is a concrete instance of the Open Transformation Specification that describes a user-defined function using the OTS format. A UDF Definition exists as a structured definition within an OTS Module, defining a custom function that can be called within SQL transformations. 63 | 64 | UDF Definitions include the function's signature (parameters and return type), implementation code, dependencies, and metadata. They enable reusable business logic and calculations that can be shared across multiple transformations. 65 | 66 | ### Test Definition 67 | 68 | A **Test Definition** is a concrete instance of the Open Transformation Specification that describes a data quality test using the OTS format. Test Definitions can exist in two contexts: 69 | 70 | 1. **Within a Test Library**: Reusable test definitions (generic and singular SQL tests) that can be shared across multiple OTS modules 71 | 2. **Inline within an OTS Module**: Module-specific test definitions that are defined directly in the transformation module 72 | 73 | Test Definitions include the test logic (SQL queries for generic/singular tests, or test type for standard tests), parameters, target scope (table or column level), and metadata. They enable automated data quality validation without manual inspection. 74 | 75 | #### OTS vs OTD vs OTS Module vs Test Library vs OTS Artifacts 76 | 77 | - **Open Transformation Specification (OTS)**: The standard that defines the structure and rules 78 | - **OTS Artifacts**: The umbrella term for all concrete instances of OTS (OTDs, UDF Definitions, and Test Definitions) 79 | - **Open Transformation Definition (OTD)**: A specific transformation within a module 80 | - **UDF Definition**: A specific user-defined function within a module 81 | - **Test Definition**: A specific data quality test (in a Test Library or inline in a module) 82 | - **Open Transformation Specification Module (OTS Module)**: A collection of related OTS Artifacts targeting the same database and schema 83 | - **Test Library**: A project-level collection of reusable Test Definitions that can be shared across modules 84 | 85 | ## Components of an OTD 86 | 87 | An Open Transformation Definition consists of several key components that work together to define an executable transformation: 88 | 89 | 1. **Transformation Code**: The transformation logic (SQL, Python, PySpark, etc.) stored in a type-based structure 90 | 2. **Schema Definition**: The structure of the output data including column definitions, types, and validation rules 91 | 3. **Materialization Strategy**: How the output is stored and updated (table, view, incremental, SCD2) 92 | 4. **Tests**: Validation rules that ensure data quality at table level 93 | 5. **Metadata**: Additional information about the transformation (owner, tags, creation date, etc.) 94 | 95 | ### Transformation Code 96 | 97 | Transformations can be written in different languages (SQL, Python, PySpark, etc.). The transformation code is stored in a type-based structure that supports multiple transformation types while maintaining a consistent interface. 98 | 99 | #### SQL Transformations 100 | 101 | For SQL transformations, the code is stored with the following structure: 102 | - `original_sql`: The original SQL query as written (typically a SELECT statement). This preserves the original transformation code as authored. 103 | - `resolved_sql`: SQL with fully qualified table names (schema.table format). This is the preferred version for execution as it eliminates ambiguity in table references. Tools should use `resolved_sql` when executing transformations. 104 | - `source_tables`: List of input tables referenced in the query (required for dependency analysis) 105 | - `source_functions`: List of user-defined functions (UDFs) called in the query (required for dependency analysis). This field is optional and may be empty if no user-defined functions are used. Function names should be fully qualified (schema.function_name) when possible, or unqualified if the function is resolved by the database. 106 | 107 | **When to use each:** 108 | - Use `original_sql` for: displaying the original code to users, version control, understanding the transformation logic 109 | - Use `resolved_sql` for: actual execution, dependency resolution, cross-database compatibility 110 | - Use `source_tables` and `source_functions` for: dependency graph building, execution order determination, and understanding transformation dependencies 111 | 112 | #### Non-SQL Transformations 113 | 114 | Support for non-SQL transformation types (Python, PySpark, R, etc.) is planned for future versions of the specification. The current v0.2.0 specification focuses on SQL transformations and adds support for user-defined functions (UDFs). 115 | 116 | ### Schema Definition 117 | 118 | Schema defines the structure of the output data, including column names, data types, descriptions, partitioning, indexes, and other properties of the physical table. The schema is essential for understanding what the transformation produces without executing it. For example, it enables generating DDL statements for creating the output table. 119 | 120 | ### Materialization Strategy 121 | Materialization defines how the transformation output is stored and updated. Common types include: 122 | - **table**: Full table replacement on each run 123 | - **view**: Virtual table that queries underlying data 124 | - **incremental**: Partial updates using strategies like delete+insert or merge 125 | - **scd2**: Slowly Changing Dimension type 2 for tracking historical changes 126 | 127 | ### Tests 128 | Tests are validation rules that ensure data quality. They can be defined at two levels: 129 | - **Column-level tests**: Applied to individual columns (e.g., `not_null`, `unique`) 130 | - **Table-level tests**: Applied to the entire output (e.g., `row_count_gt_0`, `unique`) 131 | 132 | Tests enable automated data quality validation without manual inspection. OTS supports three types of tests: 133 | 1. **Standard Tests**: Built-in tests defined in the OTS specification (e.g., `not_null`, `unique`, `row_count_gt_0`) 134 | 2. **Generic SQL Tests**: Reusable SQL tests with placeholders that can be applied to multiple transformations 135 | 3. **Singular SQL Tests**: Table-specific SQL tests with hardcoded table references 136 | 137 | Generic SQL Tests and Singular SQL Tests can be defined in a project Test Library (see [Test Library](#test-library)) or inline within the current OTS Module. 138 | 139 | For detailed information about test types, definitions, and usage, see the [Data Quality Tests](#data-quality-tests) section. 140 | 141 | ### Metadata 142 | Metadata provides additional information about the transformation including: 143 | - **file_path**: Location of the source transformation file 144 | - **owner**: Person or team responsible for the transformation 145 | - **tags**: List of string tags for categorization and discovery (e.g., ["analytics", "fct", "production"]) 146 | - **object_tags**: Dictionary of key-value pairs for database object tagging (e.g., {"sensitivity_tag": "pii", "classification": "public"}) 147 | 148 | **Tag Types:** 149 | - **tags** (dbt-style): Simple string tags used for filtering, categorization, and discovery. These are typically used for model selection and organization. 150 | - **Module-level tags**: Tags defined at the module level apply to all transformations in the module. They can be inherited by transformations or merged with transformation-specific tags. 151 | - **Transformation-level tags**: Tags defined at the transformation level are specific to that transformation. They can be merged with module-level tags. 152 | - **object_tags** (database-style): Key-value pairs that are attached directly to database objects (tables, views) in databases that support object tagging (e.g., Snowflake). These are used for data governance, compliance, and metadata management. Unlike `tags`, `object_tags` are always transformation-specific and are not inherited from module level. 153 | 154 | ## OTS Module Structure 155 | 156 | An OTS Module is a YAML or JSON document that can contain one or more OTS Artifacts (transformations, UDF Definitions, and Test Definitions). Below is the complete structure: 157 | 158 | ### Complete OTS Module Structure 159 | 160 | ```yaml 161 | # OTS version 162 | ots_version: string # OTS specification version (e.g., "0.1.0") - indicates which version of the OTS standard this module follows 163 | 164 | # Module metadata 165 | module_name: string # Module name (e.g., "ecommerce_analytics") 166 | module_description: string # Description of the module (optional) 167 | version: string # Optional: Module/package version (e.g., "1.0.0") - version of this specific module, independent of OTS version 168 | tags: [string] # Optional: Module-level tags for categorization (e.g., ["analytics", "fct"]). These can be inherited or merged with transformation-level tags. 169 | test_library_path: string # Optional: Path to test library file (relative to module file or absolute path) 170 | 171 | # Optional: Inline test definitions (same structure as test library) 172 | generic_tests: # Optional: Module-specific generic SQL tests 173 | test_name: 174 | type: "sql" 175 | level: "table" | "column" 176 | description: string 177 | sql: string 178 | parameters: {} 179 | singular_tests: # Optional: Module-specific singular SQL tests 180 | test_name: 181 | type: "sql" 182 | level: "table" | "column" 183 | description: string 184 | sql: string 185 | target_transformation: string 186 | 187 | target: 188 | database: string # Target database name 189 | schema: string # Target schema name 190 | sql_dialect: string # Optional: SQL dialect (e.g., "postgres", "bigquery", "snowflake", "spark", etc.) 191 | connection_profile: string # Optional: connection profile reference 192 | 193 | # Transformations 194 | transformations: # Array of transformation definitions 195 | - transformation_id: string # Fully qualified identifier (e.g., "analytics.my_first_table") 196 | description: string # Optional: Description of what the transformation does (optional) 197 | transformation_type: string # Type of transformation: "sql" (default: "sql"). Non-SQL types (python, pyspark, r) are planned for future versions. 198 | sql_dialect: string # Optional: SQL dialect of the transformation code (for translation to target dialect) 199 | 200 | # Transformation code (type-based structure) 201 | code: 202 | # For SQL transformations (transformation_type: "sql") 203 | sql: 204 | original_sql: string # The original SQL query as written (typically a SELECT statement) 205 | resolved_sql: string # SQL with fully qualified table names (schema.table) - preferred for execution 206 | source_tables: [string] # List of input tables referenced (required for dependency analysis) 207 | source_functions: [string] # Optional: List of user-defined functions (UDFs) called in the query (required for dependency analysis) 208 | 209 | # Note: Non-SQL transformation types (python, pyspark, r) are planned for future versions 210 | 211 | # Schema definition 212 | schema: 213 | columns: # Array of column definitions 214 | - name: string # Column name 215 | datatype: string # Data type ("number", "string", "date", etc.) 216 | description: string # Column description 217 | partitioning: [string] # Optional: Partition keys 218 | indexes: # Optional: Array of index definitions 219 | - name: string # Index name (optional, auto-generated if not provided) 220 | columns: [string] # Columns to index 221 | 222 | # Materialization strategy 223 | materialization: 224 | type: string # "table", "view", "incremental", "scd2" 225 | incremental_details: # Required if type is "incremental" 226 | strategy: string # "delete_insert", "append", "merge" 227 | delete_condition: string # SQL condition for delete (delete_insert only) 228 | filter_condition: string # SQL condition for filtering data 229 | merge_key: [string] # Primary key columns for matching records (merge only) 230 | update_columns: [string] # (Optional) List of columns to be updated in merge strategy 231 | scd2_details: # Optional if type is "scd2" 232 | start_column: string # Name of the start column (default: "valid_from") 233 | end_column: string # Name of the end column (default: "valid_to") 234 | unique_key: [string] # Array of columns that uniquely identify a record in SCD2 modeling (optional) 235 | 236 | # Tests: both column-level and table-level 237 | tests: 238 | columns: # Optional: Column-level tests 239 | column_name: # Tests for specific columns 240 | - string # Simple test name (e.g., "not_null", "unique") 241 | - object # Test with parameters: {name: string, params?: object, severity?: "error"|"warning"} 242 | table: # Optional: Table-level tests 243 | - string # Simple test name (e.g., "row_count_gt_0") 244 | - object # Test with parameters: {name: string, params?: object, severity?: "error"|"warning"} 245 | 246 | # Metadata 247 | metadata: 248 | file_path: string # Path to the source transformation file 249 | owner: string # Optional: Person or team responsible (optional) 250 | tags: [string] # Optional: List of string tags for categorization and discovery (e.g., ["analytics", "fct"]) 251 | object_tags: dict # Optional: Dictionary of key-value pairs for database object tagging (e.g., {"sensitivity_tag": "pii", "classification": "public"}) 252 | 253 | # Functions (OTS 0.2.0+) 254 | functions: # Optional: Array of user-defined function definitions (OTS 0.2.0+) 255 | - function_id: string # Fully qualified function name (e.g., "schema.function_name") 256 | description: string # Optional: Description of what the function does 257 | function_type: string # Function type: "scalar", "aggregate", or "table" 258 | language: string # Programming language: "sql", "python", "javascript", etc. 259 | parameters: # Optional: Array of function parameters 260 | - name: string # Parameter name 261 | type: string # Parameter data type (e.g., "DOUBLE", "VARCHAR") 262 | description: string # Optional: Parameter description 263 | return_type: string # Optional: Return type for scalar/aggregate functions (e.g., "DOUBLE", "VARCHAR") 264 | return_table_schema: # Optional: Schema definition for table functions (same structure as transformation schema) 265 | columns: [ColumnDefinition] 266 | deterministic: bool # Optional: Whether the function is deterministic (same inputs always produce same outputs) 267 | code: # Function code (type-based structure) 268 | generic_sql: string # Generic SQL code that works across databases (for SQL functions) 269 | database_specific: dict # Database-specific implementations (keyed by database name) 270 | dependencies: # Optional: Function dependencies 271 | tables: [string] # List of tables the function depends on 272 | functions: [string] # List of other functions this function depends on 273 | metadata: # Function metadata 274 | file_path: string # Path to the source function file 275 | tags: [string] # Optional: List of string tags for categorization 276 | object_tags: dict # Optional: Dictionary of key-value pairs for database object tagging 277 | ``` 278 | 279 | ## Simple Table Transformation 280 | 281 |
282 | JSON Format 283 | 284 | ```json 285 | { 286 | "ots_version": "0.2.0", 287 | "module_name": "analytics_customers", 288 | "module_description": "Customer analytics transformations", 289 | "tags": ["analytics", "production"], 290 | "test_library_path": "../tests/test_library.yaml", 291 | "target": { 292 | "database": "warehouse", 293 | "schema": "analytics", 294 | "sql_dialect": "postgres" 295 | }, 296 | "transformations": [ 297 | { 298 | "transformation_id": "analytics.customers", 299 | "description": "Customer data table", 300 | "transformation_type": "sql", 301 | "code": { 302 | "sql": { 303 | "original_sql": "SELECT id, name, email, created_at FROM source.customers WHERE active = true", 304 | "resolved_sql": "SELECT id, name, email, created_at FROM warehouse.source.customers WHERE active = true", 305 | "source_tables": ["source.customers"], 306 | "source_functions": [] 307 | } 308 | }, 309 | "schema": { 310 | "columns": [ 311 | { 312 | "name": "id", 313 | "datatype": "number", 314 | "description": "Unique customer identifier" 315 | }, 316 | { 317 | "name": "name", 318 | "datatype": "string", 319 | "description": "Customer name" 320 | }, 321 | { 322 | "name": "email", 323 | "datatype": "string", 324 | "description": "Customer email address" 325 | }, 326 | { 327 | "name": "created_at", 328 | "datatype": "date", 329 | "description": "Customer creation date" 330 | } 331 | ], 332 | "partitioning": [], 333 | "indexes": [ 334 | { 335 | "name": "idx_customers_id", 336 | "columns": ["id"] 337 | }, 338 | { 339 | "name": "idx_customers_email", 340 | "columns": ["email"] 341 | } 342 | ] 343 | }, 344 | "materialization": { 345 | "type": "table" 346 | }, 347 | "tests": { 348 | "columns": { 349 | "id": ["not_null", "unique"], 350 | "email": ["not_null", "unique"], 351 | "created_at": ["not_null"] 352 | }, 353 | "table": ["row_count_gt_0"] 354 | }, 355 | "metadata": { 356 | "file_path": "/models/analytics/customers.sql", 357 | "owner": "data-team", 358 | "tags": ["customer", "core"], 359 | "object_tags": { 360 | "sensitivity_tag": "pii", 361 | "classification": "internal" 362 | } 363 | } 364 | } 365 | ] 366 | } 367 | ``` 368 | 369 |
370 | 371 |
372 | YAML Format 373 | 374 | ```yaml 375 | ots_version: "0.2.0" 376 | module_name: "analytics_customers" 377 | module_description: "Customer analytics transformations" 378 | tags: ["analytics", "production"] 379 | 380 | target: 381 | database: "warehouse" 382 | schema: "analytics" 383 | sql_dialect: "postgres" 384 | 385 | transformations: 386 | - transformation_id: "analytics.customers" 387 | description: "Customer data table" 388 | transformation_type: "sql" 389 | 390 | code: 391 | sql: 392 | original_sql: "SELECT id, name, email, created_at FROM source.customers WHERE active = true" 393 | resolved_sql: "SELECT id, name, email, created_at FROM warehouse.source.customers WHERE active = true" 394 | source_tables: ["source.customers"] 395 | source_functions: [] 396 | 397 | schema: 398 | columns: 399 | - name: "id" 400 | datatype: "number" 401 | description: "Unique customer identifier" 402 | - name: "name" 403 | datatype: "string" 404 | description: "Customer name" 405 | - name: "email" 406 | datatype: "string" 407 | description: "Customer email address" 408 | - name: "created_at" 409 | datatype: "date" 410 | description: "Customer creation date" 411 | partitioning: [] 412 | indexes: 413 | - name: "idx_customers_id" 414 | columns: ["id"] 415 | - name: "idx_customers_email" 416 | columns: ["email"] 417 | 418 | materialization: 419 | type: "table" 420 | 421 | tests: 422 | columns: 423 | id: ["not_null", "unique"] 424 | email: ["not_null", "unique"] 425 | created_at: ["not_null"] 426 | table: ["row_count_gt_0"] 427 | 428 | metadata: 429 | file_path: "/models/analytics/customers.sql" 430 | owner: "data-team" 431 | tags: ["customer", "core"] 432 | object_tags: 433 | sensitivity_tag: "pii" 434 | classification: "internal" 435 | ``` 436 | 437 |
438 | 439 | ## Materialization Types 440 | 441 | ### Incremental Materialization 442 | 443 | Incremental materialization updates only changed data using one of three strategies: 444 | - **delete_insert**: Deletes rows matching a condition and inserts new data 445 | - **append**: Simply appends new data without removing existing rows 446 | - **merge**: Performs an upsert operation using a merge statement 447 | 448 | #### Delete-Insert Strategy 449 | 450 | ```yaml 451 | materialization: 452 | type: "incremental" 453 | incremental_details: 454 | strategy: "delete_insert" 455 | delete_condition: "to_date(updated_ts) = '@start_date'" 456 | filter_condition: "to_date(updated_ts) = '@start_date'" 457 | ``` 458 | 459 | #### Append Strategy 460 | 461 | ```yaml 462 | materialization: 463 | type: "incremental" 464 | incremental_details: 465 | strategy: "append" 466 | filter_condition: "created_at >= '@start_date'" 467 | ``` 468 | 469 | #### Merge Strategy 470 | 471 | ```yaml 472 | materialization: 473 | type: "incremental" 474 | incremental_details: 475 | strategy: "merge" 476 | merge_key: "customer_id" # Primary key or unique identifier for matching 477 | filter_condition: "updated_at >= '@start_date'" 478 | update_columns: ["name", "email"] # Optional: specific columns to update 479 | ``` 480 | 481 | ### SCD2 Materialization 482 | 483 | SCD2 (Slowly Changing Dimension Type 2) materialization tracks historical changes with valid date ranges. Requires a unique key to identify records. 484 | 485 | ```yaml 486 | materialization: 487 | type: "scd2" 488 | scd2_details: 489 | unique_key: ["product_id"] # Primary key or unique identifier 490 | start_column: "valid_from" # Optional, defaults to "valid_from" 491 | end_column: "valid_to" # Optional, defaults to "valid_to" 492 | ``` 493 | 494 | ### Schema Column Definition 495 | 496 | A schema column in an OTD defines the structure and properties of a single column in the output table: 497 | 498 | ```yaml 499 | columns: 500 | - name: string # Column name 501 | datatype: string # Data type ("number", "string", "date", etc.) 502 | description: string # Column description 503 | ``` 504 | 505 | **Common Data Types:** 506 | - `number`: Numeric values 507 | - `string`: Text values 508 | - `date`: Date and timestamp values 509 | - `boolean`: True/false values 510 | - `array`: Array of values 511 | - `object`: Complex nested objects 512 | 513 | ## Data Quality Tests 514 | 515 | Data quality tests are validation rules that ensure the correctness and quality of transformation outputs. Tests can be defined at two levels: 516 | - **Column-level tests**: Applied to individual columns (e.g., `not_null`, `unique`) 517 | - **Table-level tests**: Applied to the entire output (e.g., `row_count_gt_0`, `unique`) 518 | 519 | Tests enable automated data quality validation without manual inspection. OTS supports three types of tests: 520 | 521 | 1. **Standard Tests**: Built-in tests defined in the OTS specification (e.g., `not_null`, `unique`, `row_count_gt_0`) 522 | 2. **Generic SQL Tests**: Reusable SQL tests with placeholders that can be applied to multiple transformations 523 | 3. **Singular SQL Tests**: Table-specific SQL tests with hardcoded table references 524 | 525 | ### Standard Tests 526 | 527 | Standard tests are built into the OTS specification and must be implemented by all OTS-compliant tools. These tests provide common data quality checks that are widely applicable across different transformations. 528 | 529 | #### Column-Level Standard Tests 530 | 531 | **`not_null`** 532 | - **Description**: Ensures a column contains no NULL values 533 | - **Level**: Column 534 | - **Parameters**: None 535 | - **Implementation**: Returns rows where the column is NULL (test fails if any rows returned) 536 | - **Example**: 537 | ```yaml 538 | tests: 539 | columns: 540 | id: ["not_null"] 541 | ``` 542 | 543 | **`unique`** 544 | - **Description**: Ensures column values are unique across all rows 545 | - **Level**: Column or Table 546 | - **Parameters**: 547 | - `columns` (array, optional): For table-level tests, specifies which columns to check for uniqueness. If omitted at table level, checks all columns (entire row uniqueness) 548 | - **Implementation**: Returns duplicate values (test fails if any duplicates found) 549 | - **Examples**: 550 | ```yaml 551 | tests: 552 | columns: 553 | # Column-level: single column uniqueness 554 | id: ["not_null", "unique"] 555 | 556 | table: 557 | # Table-level: composite uniqueness on specific columns 558 | - name: "unique" 559 | params: 560 | columns: ["customer_id", "order_date"] 561 | 562 | # Table-level: entire row uniqueness (all columns) 563 | - "unique" 564 | ``` 565 | 566 | **`accepted_values`** 567 | - **Description**: Ensures column values are within a specified list of acceptable values 568 | - **Level**: Column 569 | - **Parameters**: 570 | - `values` (array, required): List of acceptable values 571 | - **Implementation**: Returns rows where column value is not in the accepted list 572 | - **Example**: 573 | ```yaml 574 | tests: 575 | columns: 576 | status: 577 | - name: "accepted_values" 578 | params: 579 | values: ["active", "inactive", "pending"] 580 | ``` 581 | 582 | **`relationships`** 583 | - **Description**: Ensures referential integrity between tables (foreign key validation) 584 | - **Level**: Column 585 | - **Parameters**: 586 | - `to` (string, required): Target transformation ID (e.g., "analytics.customers") 587 | - `field` (string, required): Column name in the target transformation 588 | - **Implementation**: Returns rows where the column value doesn't exist in the target table's specified field 589 | - **Example**: 590 | ```yaml 591 | tests: 592 | columns: 593 | customer_id: 594 | - name: "relationships" 595 | params: 596 | to: "analytics.customers" 597 | field: "id" 598 | ``` 599 | 600 | #### Table-Level Standard Tests 601 | 602 | **`row_count_gt_0`** 603 | - **Description**: Ensures the table has at least one row 604 | - **Level**: Table 605 | - **Parameters**: None 606 | - **Implementation**: Returns a count result (test fails if count = 0) 607 | - **Example**: 608 | ```yaml 609 | tests: 610 | table: 611 | - "row_count_gt_0" 612 | ``` 613 | 614 | ### Test Libraries 615 | 616 | Test libraries are project-level collections of reusable Test Definitions (generic and singular SQL tests) that can be shared across multiple OTS modules. For a detailed introduction to Test Libraries, see the [Test Library](#test-library) section in Core Concepts. 617 | 618 | #### Test Library Structure 619 | 620 | A test library is a YAML or JSON file that contains reusable Test Definitions. The file can be named anything (e.g., `test_library.yaml`, `tests.yaml`, `data_quality_tests.json`), but must follow the structure below. 621 | 622 | **Test Library File Structure:** 623 | ```yaml 624 | # test_library.yaml 625 | ots_version: string # OTS specification version (e.g., "0.2.0") - indicates which version of the OTS standard this test library follows 626 | test_library_version: string # Optional: Version identifier for the test library (e.g., "1.0", "2.1") 627 | description: string # Optional: Human-readable description of the test library 628 | 629 | generic_tests: 630 | check_minimum_rows: 631 | type: "sql" 632 | level: "table" 633 | description: "Ensures table has minimum number of rows" 634 | sql: | 635 | SELECT 1 as violation 636 | FROM @table_name 637 | GROUP BY 1 638 | HAVING COUNT(*) < @min_rows:10 639 | parameters: 640 | min_rows: 641 | type: "number" 642 | default: 10 643 | description: "Minimum number of rows required" 644 | 645 | column_not_negative: 646 | type: "sql" 647 | level: "column" 648 | description: "Ensures numeric column has no negative values" 649 | sql: | 650 | SELECT @column_name 651 | FROM @table_name 652 | WHERE @column_name < 0 653 | parameters: [] 654 | 655 | singular_tests: 656 | test_customers_email_format: 657 | type: "sql" 658 | level: "table" 659 | description: "Validates email format for customers table" 660 | sql: | 661 | SELECT id, email 662 | FROM analytics.customers 663 | WHERE email NOT LIKE '%@%.%' 664 | target_transformation: "analytics.customers" 665 | ``` 666 | 667 | #### Generic SQL Tests 668 | 669 | Generic SQL tests are reusable tests that use placeholders (variables) to make them applicable to multiple transformations. They follow the dbt pattern where: 670 | - The query returns rows when the test fails 671 | - 0 rows returned = test passes 672 | - 1+ rows returned = test fails 673 | 674 | **Placeholders:** 675 | - `@table_name` or `{{ table_name }}`: Replaced with the fully qualified transformation ID. The `@` syntax is recommended for cleaner SQL. 676 | - `@column_name` or `{{ column_name }}`: Replaced with the column name (for column-level tests). The `@` syntax is recommended. 677 | - Custom parameters: Available as `@parameter_name` or `{{ parameter_name }}` with optional defaults using `@param:default` syntax (e.g., `@min_rows:10`) 678 | 679 | **Structure:** 680 | ```yaml 681 | generic_tests: 682 | test_name: # Required: Unique test name (used for referencing) 683 | type: "sql" # Required: Always "sql" for SQL tests 684 | level: "table" | "column" # Required: Test level 685 | description: string # Optional: Human-readable description 686 | sql: string # Required: SQL query (returns rows on failure) 687 | parameters: # Optional: Parameter definitions 688 | param_name: 689 | type: "number" | "string" | "boolean" | "array" # Required: Parameter type 690 | default: value # Optional: Default value 691 | description: string # Optional: Parameter description 692 | ``` 693 | 694 | **Example Generic Test:** 695 | ```yaml 696 | check_minimum_rows: 697 | type: "sql" 698 | level: "table" 699 | description: "Ensures table has minimum number of rows" 700 | sql: | 701 | SELECT 1 as violation 702 | FROM @table_name 703 | GROUP BY 1 704 | HAVING COUNT(*) < @min_rows:10 705 | parameters: 706 | min_rows: 707 | type: "number" 708 | default: 10 709 | description: "Minimum number of rows required" 710 | ``` 711 | 712 | #### Singular SQL Tests 713 | 714 | Singular SQL tests are table-specific tests with hardcoded table references. They are useful for: 715 | - Complex business logic specific to one transformation 716 | - Tests that reference multiple tables 717 | - Table-specific validation rules 718 | 719 | **Structure:** 720 | ```yaml 721 | singular_tests: 722 | test_name: # Required: Unique test name (used for referencing) 723 | type: "sql" # Required: Always "sql" for SQL tests 724 | level: "table" | "column" # Required: Test level 725 | description: string # Optional: Human-readable description 726 | sql: string # Required: SQL query with hardcoded table names 727 | target_transformation: string # Required: Transformation ID this test applies to (used for validation and discovery) 728 | ``` 729 | 730 | **Example Singular Test:** 731 | ```yaml 732 | test_customers_email_format: 733 | type: "sql" 734 | level: "table" 735 | description: "Validates email format for customers table" 736 | sql: | 737 | SELECT id, email 738 | FROM analytics.customers 739 | WHERE email NOT LIKE '%@%.%' 740 | target_transformation: "analytics.customers" 741 | ``` 742 | 743 | ### Referencing Tests in Transformations 744 | 745 | Transformations reference tests from: 746 | 1. **Standard tests**: Referenced by name (e.g., `"not_null"`, `"unique"`) 747 | 2. **Test library tests**: Referenced by name from the test library (e.g., `"check_minimum_rows"`) 748 | 749 | **Module Structure with Test Library Reference:** 750 | 751 | ```yaml 752 | ots_version: "0.2.0" 753 | module_name: "analytics_customers" 754 | test_library_path: "../tests/test_library.yaml" # Optional: Path to test library 755 | 756 | target: 757 | database: "warehouse" 758 | schema: "analytics" 759 | 760 | transformations: 761 | - transformation_id: "analytics.customers" 762 | tests: 763 | columns: 764 | id: 765 | - "not_null" # Standard test 766 | - "unique" # Standard test (column-level) 767 | email: 768 | - "not_null" 769 | - name: "accepted_values" # Standard test with params 770 | params: 771 | values: ["gmail.com", "yahoo.com"] 772 | amount: 773 | - name: "column_not_negative" # Generic test from library 774 | table: 775 | - "row_count_gt_0" # Standard test 776 | - "unique" # Standard test (table-level, checks all columns) 777 | - name: "unique" # Standard test (table-level, composite on specific columns) 778 | params: 779 | columns: ["customer_id", "order_date"] 780 | - name: "check_minimum_rows" # Generic test with params 781 | params: 782 | min_rows: 100 783 | - "test_customers_email_format" # Singular test from library 784 | ``` 785 | 786 | **Test Reference Formats:** 787 | 788 | 1. **Simple string** (standard test, no parameters): 789 | ```yaml 790 | tests: 791 | columns: 792 | id: ["not_null", "unique"] 793 | table: 794 | - "row_count_gt_0" 795 | ``` 796 | 797 | 2. **Object with name** (standard test with parameters): 798 | ```yaml 799 | tests: 800 | columns: 801 | status: 802 | - name: "accepted_values" 803 | params: 804 | values: ["active", "inactive"] 805 | ``` 806 | 807 | 3. **Object with name** (generic/singular test from library): 808 | ```yaml 809 | tests: 810 | table: 811 | - name: "check_minimum_rows" 812 | params: 813 | min_rows: 100 814 | ``` 815 | 816 | ### Test Execution Model 817 | 818 | Tests follow the dbt execution model: 819 | - **0 rows returned** = test passes 820 | - **1+ rows returned** = test fails 821 | 822 | For standard tests, tools generate SQL queries that return violating rows. For SQL tests (generic and singular), the SQL query itself returns rows when violations are found. 823 | 824 | **Test Severity:** 825 | - Tests can have a `severity` level: `"error"` (default) or `"warning"` 826 | - `error`: Test failure stops execution and fails the build 827 | - `warning`: Test failure is logged but doesn't stop execution 828 | 829 | **Severity in Test References:** 830 | ```yaml 831 | tests: 832 | columns: 833 | id: 834 | - name: "not_null" 835 | severity: "error" # Default, can be omitted 836 | - name: "unique" 837 | severity: "warning" # Non-blocking 838 | table: 839 | - name: "row_count_gt_0" 840 | severity: "error" # Default, can be omitted 841 | ``` 842 | 843 | ### Inline Test Definitions in OTS Modules 844 | 845 | Test Definitions (generic and singular SQL tests) can also be defined directly within an OTS Module, using the same structure as test libraries. This is useful for module-specific Test Definitions that don't need to be shared across modules. 846 | 847 | **Module Structure with Inline Tests:** 848 | ```yaml 849 | ots_version: "0.2.0" 850 | module_name: "analytics_customers" 851 | 852 | # Optional: Inline test definitions (same structure as test library) 853 | generic_tests: 854 | check_recent_data: 855 | type: "sql" 856 | level: "table" 857 | description: "Ensures table has recent data" 858 | sql: | 859 | SELECT 1 as violation 860 | FROM @table_name 861 | WHERE updated_at < CURRENT_DATE - INTERVAL '@days:7' DAY 862 | parameters: 863 | days: 864 | type: "number" 865 | default: 7 866 | 867 | singular_tests: 868 | test_customers_specific: 869 | type: "sql" 870 | level: "table" 871 | description: "Module-specific test" 872 | sql: | 873 | SELECT id FROM analytics.customers WHERE status = 'invalid' 874 | target_transformation: "analytics.customers" 875 | 876 | target: 877 | database: "warehouse" 878 | schema: "analytics" 879 | 880 | transformations: 881 | - transformation_id: "analytics.customers" 882 | tests: 883 | table: 884 | - name: "check_recent_data" # References inline generic test 885 | params: 886 | days: 3 887 | - "test_customers_specific" # References inline singular test 888 | ``` 889 | 890 | **Test Resolution Priority:** 891 | When resolving test names, tools should check in the following order: 892 | 1. **Standard tests** (built into OTS specification) 893 | 2. **Inline tests** (defined in the current OTS Module) 894 | 3. **Test library tests** (from referenced test library) 895 | 896 | If a test name exists in multiple locations, the first match takes precedence. This allows modules to override test library tests with module-specific implementations. 897 | 898 | ### Test Library Resolution 899 | 900 | When a transformation module references a test library: 901 | 1. The tool resolves the `test_library_path` (relative to the module file or absolute path) 902 | 2. Loads the test library file (YAML or JSON format) 903 | 3. Validates test definitions 904 | 4. Makes tests available for reference in transformations (after inline tests) 905 | 906 | **Test Discovery:** 907 | - **Standard tests**: Always available, no discovery needed 908 | - **Generic tests**: Discovered from test library or inline module definitions 909 | - **Singular tests**: Discovered from test library or inline module definitions. The `target_transformation` field helps tools validate that the test is applied to the correct transformation. 910 | 911 | If a test is referenced but not found among the Standard tests, inline tests, or Test library, it must result in an error. 912 | 913 | ## User-Defined Functions (UDFs) 914 | 915 | ### Overview 916 | 917 | User-Defined Functions (UDFs) are custom functions that can be called within SQL transformations. OTS v0.2.0 adds support for defining UDFs as **UDF Definitions** within OTS Modules and tracking UDF dependencies in transformations, enabling proper dependency graph building and execution order determination. 918 | 919 | ### Function Dependencies 920 | 921 | When a transformation calls a user-defined function, the function name should be listed in the `source_functions` array. This allows tools to: 922 | 923 | - Build accurate dependency graphs that include function dependencies 924 | - Determine correct execution order (functions must be created before transformations that use them) 925 | - Validate that all required functions exist before executing transformations 926 | - Support function-to-function dependencies (functions calling other functions) 927 | 928 | ### Function Naming 929 | 930 | Function names in `source_functions` should follow these conventions: 931 | - **Fully qualified names** (preferred): `schema.function_name` (e.g., `analytics.calculate_percentage`) 932 | - **Unqualified names**: `function_name` (when the function is resolved by the database's search path) 933 | 934 | Tools should resolve unqualified function names to fully qualified names when building dependency graphs. 935 | 936 | ### Example: Transformation Using a Function 937 | 938 | ```yaml 939 | ots_version: "0.2.0" 940 | transformation_id: "analytics.order_summary" 941 | description: "Order summary with calculated metrics" 942 | 943 | transformation_type: "sql" 944 | code: 945 | sql: 946 | original_sql: | 947 | SELECT 948 | order_id, 949 | customer_id, 950 | analytics.calculate_percentage(discount_amount, total_amount) as discount_pct, 951 | analytics.format_currency(total_amount) as formatted_total 952 | FROM source.orders 953 | resolved_sql: | 954 | SELECT 955 | order_id, 956 | customer_id, 957 | analytics.calculate_percentage(discount_amount, total_amount) as discount_pct, 958 | analytics.format_currency(total_amount) as formatted_total 959 | FROM warehouse.source.orders 960 | source_tables: ["source.orders"] 961 | source_functions: ["analytics.calculate_percentage", "analytics.format_currency"] 962 | 963 | schema: 964 | columns: 965 | - name: "order_id" 966 | datatype: "number" 967 | - name: "customer_id" 968 | datatype: "number" 969 | - name: "discount_pct" 970 | datatype: "number" 971 | description: "Discount percentage calculated using UDF" 972 | - name: "formatted_total" 973 | datatype: "string" 974 | description: "Formatted currency using UDF" 975 | 976 | materialization: 977 | type: "table" 978 | ``` 979 | 980 | In this example, the transformation depends on two user-defined functions: 981 | - `analytics.calculate_percentage`: Calculates percentage values 982 | - `analytics.format_currency`: Formats numeric values as currency strings 983 | 984 | These dependencies are tracked in `source_functions`, allowing the dependency graph to ensure these functions are created before the transformation executes. 985 | 986 | ### Function Execution Order 987 | 988 | Functions are executed in dependency order, before transformations: 989 | 1. **Seeds** are loaded first (if any) 990 | 2. **Functions** are created in dependency order (functions that depend on other functions are created after their dependencies) 991 | 3. **Transformations** are executed (can use functions created in step 2) 992 | 993 | Function-to-function dependencies are resolved automatically based on the `dependencies.functions` array in each function definition. 994 | 995 | ### Function Overloading 996 | 997 | Some databases (e.g., Snowflake, DuckDB, PostgreSQL) support function overloading - multiple functions with the same name but different parameter signatures. OTS 0.2.0 supports this by: 998 | 999 | - **Function identification**: Functions are identified by their fully qualified name (`schema.function_name`) and parameter signature 1000 | - **Signature matching**: When a function is called, the database matches the call to the appropriate overload based on parameter types 1001 | - **Dependency tracking**: Each overloaded function is tracked separately in the `functions` array with its unique signature 1002 | 1003 | Tools implementing OTS should handle function overloading according to the target database's capabilities and requirements. 1004 | 1005 | ### Example: OTS Module with Functions 1006 | 1007 | The following example shows a complete OTS Module that includes both transformations and UDF Definitions: 1008 | 1009 | ```yaml 1010 | ots_version: "0.2.0" 1011 | module_name: "analytics_calculations" 1012 | module_description: "Analytics module with custom calculation functions" 1013 | 1014 | target: 1015 | database: "warehouse" 1016 | schema: "analytics" 1017 | sql_dialect: "postgres" 1018 | 1019 | transformations: 1020 | - transformation_id: "analytics.order_summary" 1021 | description: "Order summary with calculated metrics" 1022 | transformation_type: "sql" 1023 | code: 1024 | sql: 1025 | original_sql: | 1026 | SELECT 1027 | order_id, 1028 | customer_id, 1029 | analytics.calculate_percentage(discount_amount, total_amount) as discount_pct, 1030 | analytics.format_currency(total_amount) as formatted_total 1031 | FROM source.orders 1032 | resolved_sql: | 1033 | SELECT 1034 | order_id, 1035 | customer_id, 1036 | analytics.calculate_percentage(discount_amount, total_amount) as discount_pct, 1037 | analytics.format_currency(total_amount) as formatted_total 1038 | FROM warehouse.source.orders 1039 | source_tables: ["source.orders"] 1040 | source_functions: ["analytics.calculate_percentage", "analytics.format_currency"] 1041 | materialization: 1042 | type: "table" 1043 | 1044 | functions: 1045 | - function_id: "analytics.calculate_percentage" 1046 | description: "Calculates the percentage of a numerator over a denominator" 1047 | function_type: "scalar" 1048 | language: "sql" 1049 | parameters: 1050 | - name: "numerator" 1051 | type: "DOUBLE" 1052 | description: "The numerator value" 1053 | - name: "denominator" 1054 | type: "DOUBLE" 1055 | description: "The denominator value" 1056 | return_type: "DOUBLE" 1057 | deterministic: true 1058 | code: 1059 | generic_sql: | 1060 | CREATE OR REPLACE FUNCTION calculate_percentage( 1061 | numerator DOUBLE, 1062 | denominator DOUBLE 1063 | ) RETURNS DOUBLE AS $$ 1064 | SELECT 1065 | CASE 1066 | WHEN denominator = 0 OR denominator IS NULL THEN NULL 1067 | ELSE (numerator / denominator) * 100.0 1068 | END 1069 | $$; 1070 | database_specific: {} 1071 | dependencies: 1072 | tables: [] 1073 | functions: [] 1074 | metadata: 1075 | file_path: "/functions/analytics/calculate_percentage.sql" 1076 | tags: ["math", "utility"] 1077 | object_tags: 1078 | category: "calculation" 1079 | complexity: "simple" 1080 | 1081 | - function_id: "analytics.format_currency" 1082 | description: "Formats a numeric value as currency string" 1083 | function_type: "scalar" 1084 | language: "sql" 1085 | parameters: 1086 | - name: "amount" 1087 | type: "DOUBLE" 1088 | description: "The amount to format" 1089 | return_type: "VARCHAR" 1090 | code: 1091 | generic_sql: | 1092 | CREATE OR REPLACE FUNCTION format_currency(amount DOUBLE) 1093 | RETURNS VARCHAR AS $$ 1094 | SELECT '$' || TO_CHAR(amount, 'FM999,999,999.00') 1095 | $$; 1096 | database_specific: {} 1097 | dependencies: 1098 | tables: [] 1099 | functions: [] 1100 | metadata: 1101 | file_path: "/functions/analytics/format_currency.sql" 1102 | tags: ["formatting", "utility"] 1103 | ``` 1104 | 1105 | In this example: 1106 | - The module defines two functions: `analytics.calculate_percentage` and `analytics.format_currency` 1107 | - The transformation `analytics.order_summary` uses both functions and lists them in `source_functions` 1108 | - Functions are defined with their complete structure including parameters, return types, code, and metadata 1109 | - The `functions` array is at the same level as `transformations`, maintaining consistency in the module structure 1110 | 1111 | ## Complete Examples: Incremental Strategies 1112 | 1113 | ### Delete-Insert Example 1114 | 1115 |
1116 | YAML Format 1117 | 1118 | ```yaml 1119 | ots_version: "0.2.0" 1120 | transformation_id: "analytics.recent_orders" 1121 | description: "Orders updated in the last 7 days" 1122 | 1123 | transformation_type: "sql" 1124 | code: 1125 | sql: 1126 | original_sql: "SELECT order_id, customer_id, order_date, amount, status FROM source.orders WHERE updated_at >= '@start_date'" 1127 | resolved_sql: "SELECT order_id, customer_id, order_date, amount, status FROM warehouse.source.orders WHERE updated_at >= '@start_date'" 1128 | source_tables: ["source.orders"] 1129 | source_functions: [] 1130 | 1131 | schema: 1132 | columns: 1133 | - name: "order_id" 1134 | datatype: "number" 1135 | description: "Unique order identifier" 1136 | - name: "customer_id" 1137 | datatype: "number" 1138 | description: "Customer ID" 1139 | - name: "order_date" 1140 | datatype: "date" 1141 | description: "Order date" 1142 | - name: "amount" 1143 | datatype: "number" 1144 | description: "Order amount" 1145 | - name: "status" 1146 | datatype: "string" 1147 | description: "Order status" 1148 | partitioning: ["order_date"] 1149 | indexes: 1150 | - name: "idx_order_id" 1151 | columns: ["order_id"] 1152 | 1153 | materialization: 1154 | type: "incremental" 1155 | incremental_details: 1156 | strategy: "delete_insert" 1157 | delete_condition: "to_date(updated_at) = '@start_date'" 1158 | filter_condition: "to_date(updated_at) = '@start_date'" 1159 | 1160 | tests: 1161 | columns: 1162 | order_id: ["not_null", "unique"] 1163 | order_date: ["not_null"] 1164 | table: ["row_count_gt_0"] 1165 | 1166 | metadata: 1167 | file_path: "/models/analytics/recent_orders.sql" 1168 | owner: "analytics-team" 1169 | tags: ["orders", "incremental"] 1170 | ``` 1171 | 1172 |
1173 | 1174 |
1175 | JSON Format 1176 | 1177 | ```json 1178 | { 1179 | "ots_version": "0.2.0", 1180 | "transformation_id": "analytics.recent_orders", 1181 | "description": "Orders updated in the last 7 days", 1182 | 1183 | "transformation_type": "sql", 1184 | "code": { 1185 | "sql": { 1186 | "original_sql": "SELECT order_id, customer_id, order_date, amount, status FROM source.orders WHERE updated_at >= '@start_date'", 1187 | "resolved_sql": "SELECT order_id, customer_id, order_date, amount, status FROM warehouse.source.orders WHERE updated_at >= '@start_date'", 1188 | "source_tables": ["source.orders"], 1189 | "source_functions": [] 1190 | } 1191 | }, 1192 | 1193 | "schema": { 1194 | "columns": [ 1195 | { 1196 | "name": "order_id", 1197 | "datatype": "number", 1198 | "description": "Unique order identifier" 1199 | }, 1200 | { 1201 | "name": "customer_id", 1202 | "datatype": "number", 1203 | "description": "Customer ID" 1204 | }, 1205 | { 1206 | "name": "order_date", 1207 | "datatype": "date", 1208 | "description": "Order date" 1209 | }, 1210 | { 1211 | "name": "amount", 1212 | "datatype": "number", 1213 | "description": "Order amount" 1214 | }, 1215 | { 1216 | "name": "status", 1217 | "datatype": "string", 1218 | "description": "Order status" 1219 | } 1220 | ], 1221 | "partitioning": ["order_date"], 1222 | "indexes": [ 1223 | { 1224 | "name": "idx_order_id", 1225 | "columns": ["order_id"] 1226 | } 1227 | ] 1228 | }, 1229 | 1230 | "materialization": { 1231 | "type": "incremental", 1232 | "incremental_details": { 1233 | "strategy": "delete_insert", 1234 | "delete_condition": "to_date(updated_at) = '@start_date'", 1235 | "filter_condition": "to_date(updated_at) = '@start_date'" 1236 | } 1237 | }, 1238 | 1239 | "tests": { 1240 | "columns": { 1241 | "order_id": ["not_null", "unique"], 1242 | "order_date": ["not_null"] 1243 | }, 1244 | "table": ["row_count_gt_0"] 1245 | }, 1246 | 1247 | "metadata": { 1248 | "file_path": "/models/analytics/recent_orders.sql", 1249 | "owner": "analytics-team", 1250 | "tags": ["orders", "incremental"] 1251 | } 1252 | } 1253 | ``` 1254 | 1255 |
1256 | 1257 | ### Append Example 1258 | 1259 |
1260 | YAML Format 1261 | 1262 | ```yaml 1263 | ots_version: "0.2.0" 1264 | transformation_id: "logs.event_stream" 1265 | description: "Append-only event log" 1266 | 1267 | transformation_type: "sql" 1268 | code: 1269 | sql: 1270 | original_sql: "SELECT event_id, timestamp, user_id, event_type, payload FROM source.events WHERE timestamp >= '@start_date'" 1271 | resolved_sql: "SELECT event_id, timestamp, user_id, event_type, payload FROM warehouse.source.events WHERE timestamp >= '@start_date'" 1272 | source_tables: ["source.events"] 1273 | source_functions: [] 1274 | 1275 | schema: 1276 | columns: 1277 | - name: "event_id" 1278 | datatype: "string" 1279 | description: "Unique event identifier" 1280 | - name: "timestamp" 1281 | datatype: "date" 1282 | description: "Event timestamp" 1283 | - name: "user_id" 1284 | datatype: "string" 1285 | description: "User who triggered the event" 1286 | - name: "event_type" 1287 | datatype: "string" 1288 | description: "Type of event" 1289 | - name: "payload" 1290 | datatype: "object" 1291 | description: "Event payload data" 1292 | partitioning: ["timestamp"] 1293 | indexes: 1294 | - name: "idx_timestamp" 1295 | columns: ["timestamp"] 1296 | - name: "idx_user_id" 1297 | columns: ["user_id"] 1298 | 1299 | materialization: 1300 | type: "incremental" 1301 | incremental_details: 1302 | strategy: "append" 1303 | filter_condition: "timestamp >= '@start_date'" 1304 | 1305 | tests: 1306 | columns: 1307 | event_id: ["not_null", "unique"] 1308 | timestamp: ["not_null"] 1309 | table: ["row_count_gt_0"] 1310 | 1311 | metadata: 1312 | file_path: "/models/logs/event_stream.sql" 1313 | owner: "data-engineering" 1314 | tags: ["events", "append-only"] 1315 | ``` 1316 | 1317 |
1318 | 1319 |
1320 | JSON Format 1321 | 1322 | ```json 1323 | { 1324 | "ots_version": "0.2.0", 1325 | "transformation_id": "logs.event_stream", 1326 | "description": "Append-only event log", 1327 | 1328 | "transformation_type": "sql", 1329 | "code": { 1330 | "sql": { 1331 | "original_sql": "SELECT event_id, timestamp, user_id, event_type, payload FROM source.events WHERE timestamp >= '@start_date'", 1332 | "resolved_sql": "SELECT event_id, timestamp, user_id, event_type, payload FROM warehouse.source.events WHERE timestamp >= '@start_date'", 1333 | "source_tables": ["source.events"], 1334 | "source_functions": [] 1335 | } 1336 | }, 1337 | 1338 | "schema": { 1339 | "columns": [ 1340 | { 1341 | "name": "event_id", 1342 | "datatype": "string", 1343 | "description": "Unique event identifier" 1344 | }, 1345 | { 1346 | "name": "timestamp", 1347 | "datatype": "date", 1348 | "description": "Event timestamp" 1349 | }, 1350 | { 1351 | "name": "user_id", 1352 | "datatype": "string", 1353 | "description": "User who triggered the event" 1354 | }, 1355 | { 1356 | "name": "event_type", 1357 | "datatype": "string", 1358 | "description": "Type of event" 1359 | }, 1360 | { 1361 | "name": "payload", 1362 | "datatype": "object", 1363 | "description": "Event payload data" 1364 | } 1365 | ], 1366 | "partitioning": ["timestamp"], 1367 | "indexes": [ 1368 | { 1369 | "name": "idx_timestamp", 1370 | "columns": ["timestamp"] 1371 | }, 1372 | { 1373 | "name": "idx_user_id", 1374 | "columns": ["user_id"] 1375 | } 1376 | ] 1377 | }, 1378 | 1379 | "materialization": { 1380 | "type": "incremental", 1381 | "incremental_details": { 1382 | "strategy": "append", 1383 | "filter_condition": "timestamp >= '@start_date'" 1384 | } 1385 | }, 1386 | 1387 | "tests": { 1388 | "columns": { 1389 | "event_id": ["not_null", "unique"], 1390 | "timestamp": ["not_null"] 1391 | }, 1392 | "table": ["row_count_gt_0"] 1393 | }, 1394 | 1395 | "metadata": { 1396 | "file_path": "/models/logs/event_stream.sql", 1397 | "owner": "data-engineering", 1398 | "tags": ["events", "append-only"] 1399 | } 1400 | } 1401 | ``` 1402 | 1403 |
1404 | 1405 | ### Merge Example 1406 | 1407 |
1408 | YAML Format 1409 | 1410 | ```yaml 1411 | ots_version: "0.2.0" 1412 | transformation_id: "product.master_data" 1413 | description: "Customer master data with upsert logic" 1414 | 1415 | transformation_type: "sql" 1416 | code: 1417 | sql: 1418 | original_sql: "SELECT customer_id, name, email, phone, updated_at FROM source.customers WHERE updated_at >= '@start_date'" 1419 | resolved_sql: "SELECT customer_id, name, email, phone, updated_at FROM warehouse.source.customers WHERE updated_at >= '@start_date'" 1420 | source_tables: ["source.customers"] 1421 | source_functions: [] 1422 | 1423 | schema: 1424 | columns: 1425 | - name: "customer_id" 1426 | datatype: "number" 1427 | description: "Unique customer identifier" 1428 | - name: "name" 1429 | datatype: "string" 1430 | description: "Customer name" 1431 | - name: "email" 1432 | datatype: "string" 1433 | description: "Customer email" 1434 | - name: "phone" 1435 | datatype: "string" 1436 | description: "Customer phone number" 1437 | - name: "updated_at" 1438 | datatype: "date" 1439 | description: "Last update timestamp" 1440 | partitioning: [] 1441 | indexes: 1442 | - name: "idx_customer_id" 1443 | columns: ["customer_id"] 1444 | - name: "idx_email" 1445 | columns: ["email"] 1446 | 1447 | materialization: 1448 | type: "incremental" 1449 | incremental_details: 1450 | strategy: "merge" 1451 | filter_condition: "updated_at >= '@start_date'" 1452 | merge_key: ["customer_id"] 1453 | update_columns: ["name", "email", "phone", "updated_at"] 1454 | 1455 | tests: 1456 | columns: 1457 | customer_id: ["not_null", "unique"] 1458 | email: ["not_null"] 1459 | table: ["row_count_gt_0", "unique"] # unique at table level checks all columns for row uniqueness 1460 | 1461 | metadata: 1462 | file_path: "/models/product/master_data.sql" 1463 | owner: "product-team" 1464 | tags: ["customers", "master-data"] 1465 | ``` 1466 | 1467 |
1468 | 1469 |
1470 | JSON Format 1471 | 1472 | ```json 1473 | { 1474 | "ots_version": "0.2.0", 1475 | "transformation_id": "product.master_data", 1476 | "description": "Customer master data with upsert logic", 1477 | 1478 | "transformation_type": "sql", 1479 | "code": { 1480 | "sql": { 1481 | "original_sql": "SELECT customer_id, name, email, phone, updated_at FROM source.customers WHERE updated_at >= '@start_date'", 1482 | "resolved_sql": "SELECT customer_id, name, email, phone, updated_at FROM warehouse.source.customers WHERE updated_at >= '@start_date'", 1483 | "source_tables": ["source.customers"], 1484 | "source_functions": [] 1485 | } 1486 | }, 1487 | 1488 | "schema": { 1489 | "columns": [ 1490 | { 1491 | "name": "customer_id", 1492 | "datatype": "number", 1493 | "description": "Unique customer identifier" 1494 | }, 1495 | { 1496 | "name": "name", 1497 | "datatype": "string", 1498 | "description": "Customer name" 1499 | }, 1500 | { 1501 | "name": "email", 1502 | "datatype": "string", 1503 | "description": "Customer email" 1504 | }, 1505 | { 1506 | "name": "phone", 1507 | "datatype": "string", 1508 | "description": "Customer phone number" 1509 | }, 1510 | { 1511 | "name": "updated_at", 1512 | "datatype": "date", 1513 | "description": "Last update timestamp" 1514 | } 1515 | ], 1516 | "partitioning": [], 1517 | "indexes": [ 1518 | { 1519 | "name": "idx_customer_id", 1520 | "columns": ["customer_id"] 1521 | }, 1522 | { 1523 | "name": "idx_email", 1524 | "columns": ["email"] 1525 | } 1526 | ] 1527 | }, 1528 | 1529 | "materialization": { 1530 | "type": "incremental", 1531 | "incremental_details": { 1532 | "strategy": "merge", 1533 | "filter_condition": "updated_at >= '@start_date'", 1534 | "merge_key": ["customer_id"], 1535 | "update_columns": ["name", "email", "phone", "updated_at"] 1536 | } 1537 | }, 1538 | 1539 | "tests": { 1540 | "columns": { 1541 | "customer_id": ["not_null", "unique"], 1542 | "email": ["not_null"] 1543 | }, 1544 | "table": ["row_count_gt_0", "unique"] 1545 | }, 1546 | 1547 | "metadata": { 1548 | "file_path": "/models/product/master_data.sql", 1549 | "owner": "product-team", 1550 | "tags": ["customers", "master-data"] 1551 | } 1552 | } 1553 | ``` 1554 | 1555 |
1556 | 1557 | ### SCD2 Example 1558 | 1559 |
1560 | YAML Format 1561 | 1562 | ```yaml 1563 | ots_version: "0.2.0" 1564 | transformation_id: "dim.products_scd2" 1565 | description: "Product dimension with full history tracking" 1566 | 1567 | transformation_type: "sql" 1568 | code: 1569 | sql: 1570 | original_sql: "SELECT product_id, product_name, price, category, updated_at FROM source.products WHERE updated_at >= '@start_date'" 1571 | resolved_sql: "SELECT product_id, product_name, price, category, updated_at FROM warehouse.source.products WHERE updated_at >= '@start_date'" 1572 | source_tables: ["source.products"] 1573 | source_functions: [] 1574 | 1575 | schema: 1576 | columns: 1577 | - name: "product_id" 1578 | datatype: "number" 1579 | description: "Unique product identifier" 1580 | - name: "product_name" 1581 | datatype: "string" 1582 | description: "Product name" 1583 | - name: "price" 1584 | datatype: "number" 1585 | description: "Product price" 1586 | - name: "category" 1587 | datatype: "string" 1588 | description: "Product category" 1589 | - name: "updated_at" 1590 | datatype: "date" 1591 | description: "Last update timestamp" 1592 | - name: "valid_from" 1593 | datatype: "date" 1594 | description: "Record validity start date" 1595 | - name: "valid_to" 1596 | datatype: "date" 1597 | description: "Record validity end date" 1598 | partitioning: [] 1599 | indexes: 1600 | - name: "idx_product_id" 1601 | columns: ["product_id"] 1602 | - name: "idx_valid_from" 1603 | columns: ["valid_from"] 1604 | 1605 | materialization: 1606 | type: "scd2" 1607 | scd2_details: 1608 | unique_key: ["product_id"] 1609 | start_column: "valid_from" 1610 | end_column: "valid_to" 1611 | 1612 | tests: 1613 | columns: 1614 | product_id: ["not_null", "unique"] 1615 | valid_from: ["not_null"] 1616 | table: ["row_count_gt_0"] 1617 | 1618 | metadata: 1619 | file_path: "/models/dim/products_scd2.sql" 1620 | owner: "data-engineering" 1621 | tags: ["products", "scd2", "dimension"] 1622 | ``` 1623 | 1624 |
1625 | 1626 |
1627 | JSON Format 1628 | 1629 | ```json 1630 | { 1631 | "ots_version": "0.2.0", 1632 | "transformation_id": "dim.products_scd2", 1633 | "description": "Product dimension with full history tracking", 1634 | 1635 | "transformation_type": "sql", 1636 | "code": { 1637 | "sql": { 1638 | "original_sql": "SELECT product_id, product_name, price, category, updated_at FROM source.products WHERE updated_at >= '@start_date'", 1639 | "resolved_sql": "SELECT product_id, product_name, price, category, updated_at FROM warehouse.source.products WHERE updated_at >= '@start_date'", 1640 | "source_tables": ["source.products"], 1641 | "source_functions": [] 1642 | } 1643 | }, 1644 | 1645 | "schema": { 1646 | "columns": [ 1647 | { 1648 | "name": "product_id", 1649 | "datatype": "number", 1650 | "description": "Unique product identifier" 1651 | }, 1652 | { 1653 | "name": "product_name", 1654 | "datatype": "string", 1655 | "description": "Product name" 1656 | }, 1657 | { 1658 | "name": "price", 1659 | "datatype": "number", 1660 | "description": "Product price" 1661 | }, 1662 | { 1663 | "name": "category", 1664 | "datatype": "string", 1665 | "description": "Product category" 1666 | }, 1667 | { 1668 | "name": "updated_at", 1669 | "datatype": "date", 1670 | "description": "Last update timestamp" 1671 | }, 1672 | { 1673 | "name": "valid_from", 1674 | "datatype": "date", 1675 | "description": "Record validity start date" 1676 | }, 1677 | { 1678 | "name": "valid_to", 1679 | "datatype": "date", 1680 | "description": "Record validity end date" 1681 | } 1682 | ], 1683 | "partitioning": [], 1684 | "indexes": [ 1685 | { 1686 | "name": "idx_product_id", 1687 | "columns": ["product_id"] 1688 | }, 1689 | { 1690 | "name": "idx_valid_from", 1691 | "columns": ["valid_from"] 1692 | } 1693 | ] 1694 | }, 1695 | 1696 | "materialization": { 1697 | "type": "scd2", 1698 | "scd2_details": { 1699 | "unique_key": ["product_id"], 1700 | "start_column": "valid_from", 1701 | "end_column": "valid_to" 1702 | } 1703 | }, 1704 | 1705 | "tests": { 1706 | "columns": { 1707 | "product_id": ["not_null", "unique"], 1708 | "valid_from": ["not_null"] 1709 | }, 1710 | "table": ["row_count_gt_0"] 1711 | }, 1712 | 1713 | "metadata": { 1714 | "file_path": "/models/dim/products_scd2.sql", 1715 | "owner": "data-engineering", 1716 | "tags": ["products", "scd2", "dimension"] 1717 | } 1718 | } 1719 | ``` 1720 | 1721 |
1722 | 1723 | ## Complete Example: Test Library and Module 1724 | 1725 | This example demonstrates a complete setup with a test library and a transformation module that uses both standard and custom tests. 1726 | 1727 | ### Test Library Example 1728 | 1729 |
1730 | YAML Format 1731 | 1732 | ```yaml 1733 | # tests/test_library.yaml 1734 | ots_version: "0.2.0" 1735 | test_library_version: "1.0" 1736 | description: "Shared data quality tests for analytics project" 1737 | 1738 | generic_tests: 1739 | check_minimum_rows: 1740 | type: "sql" 1741 | level: "table" 1742 | description: "Ensures table has minimum number of rows" 1743 | sql: | 1744 | SELECT 1 as violation 1745 | FROM @table_name 1746 | GROUP BY 1 1747 | HAVING COUNT(*) < @min_rows:10 1748 | parameters: 1749 | min_rows: 1750 | type: "number" 1751 | default: 10 1752 | description: "Minimum number of rows required" 1753 | 1754 | column_not_negative: 1755 | type: "sql" 1756 | level: "column" 1757 | description: "Ensures numeric column has no negative values" 1758 | sql: | 1759 | SELECT @column_name 1760 | FROM @table_name 1761 | WHERE @column_name < 0 1762 | parameters: [] 1763 | 1764 | singular_tests: 1765 | test_customers_email_format: 1766 | type: "sql" 1767 | level: "table" 1768 | description: "Validates email format for customers table" 1769 | sql: | 1770 | SELECT id, email 1771 | FROM analytics.customers 1772 | WHERE email NOT LIKE '%@%.%' 1773 | target_transformation: "analytics.customers" 1774 | ``` 1775 | 1776 |
1777 | 1778 |
1779 | JSON Format 1780 | 1781 | ```json 1782 | { 1783 | "ots_version": "0.2.0", 1784 | "test_library_version": "1.0", 1785 | "description": "Shared data quality tests for analytics project", 1786 | "generic_tests": { 1787 | "check_minimum_rows": { 1788 | "type": "sql", 1789 | "level": "table", 1790 | "description": "Ensures table has minimum number of rows", 1791 | "sql": "SELECT 1 as violation\nFROM @table_name\nGROUP BY 1\nHAVING COUNT(*) < @min_rows:10", 1792 | "parameters": { 1793 | "min_rows": { 1794 | "type": "number", 1795 | "default": 10, 1796 | "description": "Minimum number of rows required" 1797 | } 1798 | } 1799 | }, 1800 | "column_not_negative": { 1801 | "type": "sql", 1802 | "level": "column", 1803 | "description": "Ensures numeric column has no negative values", 1804 | "sql": "SELECT @column_name\nFROM @table_name\nWHERE @column_name < 0", 1805 | "parameters": [] 1806 | } 1807 | }, 1808 | "singular_tests": { 1809 | "test_customers_email_format": { 1810 | "type": "sql", 1811 | "level": "table", 1812 | "description": "Validates email format for customers table", 1813 | "sql": "SELECT id, email\nFROM analytics.customers\nWHERE email NOT LIKE '%@%.%'", 1814 | "target_transformation": "analytics.customers" 1815 | } 1816 | } 1817 | } 1818 | ``` 1819 | 1820 |
1821 | 1822 | ### Module Using Test Library 1823 | 1824 |
1825 | YAML Format 1826 | 1827 | ```yaml 1828 | ots_version: "0.2.0" 1829 | module_name: "analytics_customers" 1830 | module_description: "Customer analytics transformations" 1831 | test_library_path: "../tests/test_library.yaml" 1832 | tags: ["analytics", "production"] 1833 | 1834 | target: 1835 | database: "warehouse" 1836 | schema: "analytics" 1837 | sql_dialect: "postgres" 1838 | 1839 | transformations: 1840 | - transformation_id: "analytics.customers" 1841 | description: "Customer data table" 1842 | transformation_type: "sql" 1843 | 1844 | code: 1845 | sql: 1846 | original_sql: "SELECT id, name, email, created_at, amount FROM source.customers WHERE active = true" 1847 | resolved_sql: "SELECT id, name, email, created_at, amount FROM warehouse.source.customers WHERE active = true" 1848 | source_tables: ["source.customers"] 1849 | source_functions: [] 1850 | 1851 | schema: 1852 | columns: 1853 | - name: "id" 1854 | datatype: "number" 1855 | description: "Unique customer identifier" 1856 | - name: "name" 1857 | datatype: "string" 1858 | description: "Customer name" 1859 | - name: "email" 1860 | datatype: "string" 1861 | description: "Customer email address" 1862 | - name: "created_at" 1863 | datatype: "date" 1864 | description: "Customer creation date" 1865 | - name: "amount" 1866 | datatype: "number" 1867 | description: "Customer account balance" 1868 | partitioning: [] 1869 | indexes: 1870 | - name: "idx_customers_id" 1871 | columns: ["id"] 1872 | - name: "idx_customers_email" 1873 | columns: ["email"] 1874 | 1875 | materialization: 1876 | type: "table" 1877 | 1878 | tests: 1879 | columns: 1880 | id: 1881 | - "not_null" # Standard test 1882 | - "unique" # Standard test (column-level) 1883 | email: 1884 | - "not_null" 1885 | - name: "accepted_values" # Standard test with params 1886 | params: 1887 | values: ["gmail.com", "yahoo.com", "company.com"] 1888 | amount: 1889 | - name: "column_not_negative" # Generic test from library 1890 | table: 1891 | - "row_count_gt_0" # Standard test 1892 | - "unique" # Standard test (table-level, checks all columns for row uniqueness) 1893 | - name: "check_minimum_rows" # Generic test with params 1894 | params: 1895 | min_rows: 100 1896 | - "test_customers_email_format" # Singular test from library 1897 | 1898 | metadata: 1899 | file_path: "/models/analytics/customers.sql" 1900 | owner: "data-team" 1901 | tags: ["customer", "core"] 1902 | object_tags: 1903 | sensitivity_tag: "pii" 1904 | classification: "internal" 1905 | ``` 1906 | 1907 |
1908 | 1909 |
1910 | JSON Format 1911 | 1912 | ```json 1913 | { 1914 | "ots_version": "0.2.0", 1915 | "module_name": "analytics_customers", 1916 | "module_description": "Customer analytics transformations", 1917 | "test_library_path": "../tests/test_library.yaml", 1918 | "tags": ["analytics", "production"], 1919 | "target": { 1920 | "database": "warehouse", 1921 | "schema": "analytics", 1922 | "sql_dialect": "postgres" 1923 | }, 1924 | "transformations": [ 1925 | { 1926 | "transformation_id": "analytics.customers", 1927 | "description": "Customer data table", 1928 | "transformation_type": "sql", 1929 | "code": { 1930 | "sql": { 1931 | "original_sql": "SELECT id, name, email, created_at, amount FROM source.customers WHERE active = true", 1932 | "resolved_sql": "SELECT id, name, email, created_at, amount FROM warehouse.source.customers WHERE active = true", 1933 | "source_tables": ["source.customers"], 1934 | "source_functions": [] 1935 | } 1936 | }, 1937 | "schema": { 1938 | "columns": [ 1939 | { 1940 | "name": "id", 1941 | "datatype": "number", 1942 | "description": "Unique customer identifier" 1943 | }, 1944 | { 1945 | "name": "name", 1946 | "datatype": "string", 1947 | "description": "Customer name" 1948 | }, 1949 | { 1950 | "name": "email", 1951 | "datatype": "string", 1952 | "description": "Customer email address" 1953 | }, 1954 | { 1955 | "name": "created_at", 1956 | "datatype": "date", 1957 | "description": "Customer creation date" 1958 | }, 1959 | { 1960 | "name": "amount", 1961 | "datatype": "number", 1962 | "description": "Customer account balance" 1963 | } 1964 | ], 1965 | "partitioning": [], 1966 | "indexes": [ 1967 | { 1968 | "name": "idx_customers_id", 1969 | "columns": ["id"] 1970 | }, 1971 | { 1972 | "name": "idx_customers_email", 1973 | "columns": ["email"] 1974 | } 1975 | ] 1976 | }, 1977 | "materialization": { 1978 | "type": "table" 1979 | }, 1980 | "tests": { 1981 | "columns": { 1982 | "id": ["not_null", "unique"], 1983 | "email": [ 1984 | "not_null", 1985 | { 1986 | "name": "accepted_values", 1987 | "params": { 1988 | "values": ["gmail.com", "yahoo.com", "company.com"] 1989 | } 1990 | } 1991 | ], 1992 | "amount": [ 1993 | { 1994 | "name": "column_not_negative" 1995 | } 1996 | ] 1997 | }, 1998 | "table": [ 1999 | "row_count_gt_0", 2000 | "unique", 2001 | { 2002 | "name": "check_minimum_rows", 2003 | "params": { 2004 | "min_rows": 100 2005 | } 2006 | }, 2007 | "test_customers_email_format" 2008 | ] 2009 | }, 2010 | "metadata": { 2011 | "file_path": "/models/analytics/customers.sql", 2012 | "owner": "data-team", 2013 | "tags": ["customer", "core"], 2014 | "object_tags": { 2015 | "sensitivity_tag": "pii", 2016 | "classification": "internal" 2017 | } 2018 | } 2019 | } 2020 | ] 2021 | } 2022 | ``` 2023 | 2024 |
2025 | --------------------------------------------------------------------------------