├── LICENSE ├── README.md ├── data_dictionary_scripts.md ├── data_validation_scripts.md ├── greenplum ├── data_dictionary │ └── data_dict_dump.sql └── schemadiff │ └── simple_schema_dump.sql ├── img ├── 01_schemadiff_side_by_side.png ├── 02_data_dictionary_in_xl.png ├── 03_data_dictionary_edit_and_run_script.png ├── 04_data_validation_scripts.png └── db_icons │ ├── gp_icon.png │ ├── mssql_icon.png │ ├── mysql_icon.png │ ├── oracle_icon.png │ └── pgsql_icon1.png ├── mssql ├── data_dictionary │ └── data_dict_dump.sql ├── data_validation │ ├── best_practices.md │ ├── demo_data │ │ ├── demo_hr_01_create_database.sql │ │ ├── demo_hr_02_create_tables.sql │ │ └── demo_hr_03_populate_tables.sql │ ├── how_to_use_advanced_script.md │ ├── how_to_use_basic_script.md │ ├── img │ │ ├── 01_data_val_mssql_example_test_case_sql_code.png │ │ ├── 02_data_val_mssql_run_results1.png │ │ ├── 03_data_val_mssql_adv_test_case_ex.png │ │ └── 04_data_val_mssql_run_results_adv.png │ ├── rule_set_01__row_counts.md │ ├── rule_set_02__keys.md │ ├── rule_set_03__heuristic_thresholds.md │ ├── rule_set_04__numeric_values.md │ ├── rule_set_05__date_values.md │ ├── rule_set_06__text_values.md │ ├── rule_set_07__regular_expressions.md │ ├── rule_set_08__diff_checks.md │ ├── rule_set_09__defect_regression.md │ └── sql_scripts │ │ ├── dv_advanced_test_cases.sql │ │ ├── dv_basic_test_cases.sql │ │ └── dv_test_case_snippets.sql └── schemadiff │ └── simple_schema_dump.sql ├── mysql ├── data_dictionary │ └── data_dict_dump.sql ├── data_validation │ ├── best_practices.md │ ├── demo_data │ │ ├── demo_hr_01_create_database.sql │ │ ├── demo_hr_02_create_tables.sql │ │ └── demo_hr_03_populate_tables.sql │ ├── how_to_use_advanced_script.md │ ├── how_to_use_basic_script.md │ ├── img │ │ ├── 01_data_val_mysql_example_test_case_sql_code.png │ │ ├── 02_data_val_mysql_run_results.png │ │ ├── 03_data_val_mysql_adv_test_case.png │ │ └── 04_data_val_mysql_run_results_adv.png │ ├── rule_set_01__row_counts.md │ ├── rule_set_02__keys.md │ ├── rule_set_03__heuristic_thresholds.md │ ├── rule_set_04__numeric_values.md │ ├── rule_set_05__date_values.md │ ├── rule_set_06__text_values.md │ ├── rule_set_07__regular_expressions.md │ ├── rule_set_08__diff_checks.md │ ├── rule_set_09__defect_regression.md │ └── sql_scripts │ │ ├── dv_advanced_test_cases.sql │ │ ├── dv_basic_test_cases.sql │ │ └── dv_test_case_snippets.sql └── schemadiff │ └── simple_schema_dump.sql ├── oracle ├── data_dictionary │ └── data_dict_dump.sql ├── data_validation │ ├── best_practices.md │ ├── demo_data │ │ ├── demo_hr_01_create_tables.sql │ │ └── demo_hr_02_populate_tables.sql │ ├── how_to_use_advanced_script.md │ ├── how_to_use_basic_script.md │ ├── img │ │ ├── 01_data_val_oracle_example_test_case_sql_code.png │ │ ├── 02_data_val_oracle_run_results1.png │ │ ├── 03_data_val_oracle_adv_test_case_ex.png │ │ └── 04_data_val_oracle_run_results_adv.png │ ├── rule_set_01__row_counts.md │ ├── rule_set_02__keys.md │ ├── rule_set_03__heuristic_thresholds.md │ ├── rule_set_04__numeric_values.md │ ├── rule_set_05__date_values.md │ ├── rule_set_06__text_values.md │ ├── rule_set_07__regular_expressions.md │ ├── rule_set_08__diff_checks.md │ ├── rule_set_09__defect_regression.md │ └── sql_scripts │ │ ├── dv_advanced_01_setup.sql │ │ ├── dv_advanced_02_test_cases.sql │ │ ├── dv_basic_test_cases.sql │ │ └── dv_test_case_snippets.sql └── schemadiff │ └── simple_schema_dump.sql ├── postgresql ├── data_dictionary │ └── data_dict_dump.sql ├── data_validation │ ├── best_practices.md │ ├── demo_data │ │ ├── demo_hr_01_create_database.sql │ │ ├── demo_hr_02_create_tables.sql │ │ └── demo_hr_03_populate_tables.sql │ ├── how_to_use_advanced_script.md │ ├── how_to_use_basic_script.md │ ├── img │ │ ├── 01_data_val_postgresql_example_test_case_sql_code.png │ │ ├── 03_data_val_postgresql_adv_test_case_ex2.png │ │ └── 04_data_val_postgresql_run_results_adv.png │ ├── rule_set_01__row_counts.md │ ├── rule_set_02__keys.md │ ├── rule_set_03__heuristic_thresholds.md │ ├── rule_set_04__numeric_values.md │ ├── rule_set_05__date_values.md │ ├── rule_set_06__text_values.md │ ├── rule_set_07__regular_expressions.md │ ├── rule_set_08__diff_checks.md │ ├── rule_set_09__defect_regression.md │ └── sql_scripts │ │ ├── dv_advanced_test_cases.sql │ │ ├── dv_basic_test_cases.sql │ │ └── dv_test_case_snippets.sql └── schemadiff │ └── simple_schema_dump.sql └── schemadiff_scripts.md /LICENSE: -------------------------------------------------------------------------------- 1 | Creative Commons Legal Code 2 | 3 | CC0 1.0 Universal 4 | 5 | CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE 6 | LEGAL SERVICES. DISTRIBUTION OF THIS DOCUMENT DOES NOT CREATE AN 7 | ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS 8 | INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES 9 | REGARDING THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS 10 | PROVIDED HEREUNDER, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM 11 | THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED 12 | HEREUNDER. 13 | 14 | Statement of Purpose 15 | 16 | The laws of most jurisdictions throughout the world automatically confer 17 | exclusive Copyright and Related Rights (defined below) upon the creator 18 | and subsequent owner(s) (each and all, an "owner") of an original work of 19 | authorship and/or a database (each, a "Work"). 20 | 21 | Certain owners wish to permanently relinquish those rights to a Work for 22 | the purpose of contributing to a commons of creative, cultural and 23 | scientific works ("Commons") that the public can reliably and without fear 24 | of later claims of infringement build upon, modify, incorporate in other 25 | works, reuse and redistribute as freely as possible in any form whatsoever 26 | and for any purposes, including without limitation commercial purposes. 27 | These owners may contribute to the Commons to promote the ideal of a free 28 | culture and the further production of creative, cultural and scientific 29 | works, or to gain reputation or greater distribution for their Work in 30 | part through the use and efforts of others. 31 | 32 | For these and/or other purposes and motivations, and without any 33 | expectation of additional consideration or compensation, the person 34 | associating CC0 with a Work (the "Affirmer"), to the extent that he or she 35 | is an owner of Copyright and Related Rights in the Work, voluntarily 36 | elects to apply CC0 to the Work and publicly distribute the Work under its 37 | terms, with knowledge of his or her Copyright and Related Rights in the 38 | Work and the meaning and intended legal effect of CC0 on those rights. 39 | 40 | 1. Copyright and Related Rights. A Work made available under CC0 may be 41 | protected by copyright and related or neighboring rights ("Copyright and 42 | Related Rights"). Copyright and Related Rights include, but are not 43 | limited to, the following: 44 | 45 | i. the right to reproduce, adapt, distribute, perform, display, 46 | communicate, and translate a Work; 47 | ii. moral rights retained by the original author(s) and/or performer(s); 48 | iii. publicity and privacy rights pertaining to a person's image or 49 | likeness depicted in a Work; 50 | iv. rights protecting against unfair competition in regards to a Work, 51 | subject to the limitations in paragraph 4(a), below; 52 | v. rights protecting the extraction, dissemination, use and reuse of data 53 | in a Work; 54 | vi. database rights (such as those arising under Directive 96/9/EC of the 55 | European Parliament and of the Council of 11 March 1996 on the legal 56 | protection of databases, and under any national implementation 57 | thereof, including any amended or successor version of such 58 | directive); and 59 | vii. other similar, equivalent or corresponding rights throughout the 60 | world based on applicable law or treaty, and any national 61 | implementations thereof. 62 | 63 | 2. Waiver. To the greatest extent permitted by, but not in contravention 64 | of, applicable law, Affirmer hereby overtly, fully, permanently, 65 | irrevocably and unconditionally waives, abandons, and surrenders all of 66 | Affirmer's Copyright and Related Rights and associated claims and causes 67 | of action, whether now known or unknown (including existing as well as 68 | future claims and causes of action), in the Work (i) in all territories 69 | worldwide, (ii) for the maximum duration provided by applicable law or 70 | treaty (including future time extensions), (iii) in any current or future 71 | medium and for any number of copies, and (iv) for any purpose whatsoever, 72 | including without limitation commercial, advertising or promotional 73 | purposes (the "Waiver"). Affirmer makes the Waiver for the benefit of each 74 | member of the public at large and to the detriment of Affirmer's heirs and 75 | successors, fully intending that such Waiver shall not be subject to 76 | revocation, rescission, cancellation, termination, or any other legal or 77 | equitable action to disrupt the quiet enjoyment of the Work by the public 78 | as contemplated by Affirmer's express Statement of Purpose. 79 | 80 | 3. Public License Fallback. Should any part of the Waiver for any reason 81 | be judged legally invalid or ineffective under applicable law, then the 82 | Waiver shall be preserved to the maximum extent permitted taking into 83 | account Affirmer's express Statement of Purpose. In addition, to the 84 | extent the Waiver is so judged Affirmer hereby grants to each affected 85 | person a royalty-free, non transferable, non sublicensable, non exclusive, 86 | irrevocable and unconditional license to exercise Affirmer's Copyright and 87 | Related Rights in the Work (i) in all territories worldwide, (ii) for the 88 | maximum duration provided by applicable law or treaty (including future 89 | time extensions), (iii) in any current or future medium and for any number 90 | of copies, and (iv) for any purpose whatsoever, including without 91 | limitation commercial, advertising or promotional purposes (the 92 | "License"). The License shall be deemed effective as of the date CC0 was 93 | applied by Affirmer to the Work. Should any part of the License for any 94 | reason be judged legally invalid or ineffective under applicable law, such 95 | partial invalidity or ineffectiveness shall not invalidate the remainder 96 | of the License, and in such case Affirmer hereby affirms that he or she 97 | will not (i) exercise any of his or her remaining Copyright and Related 98 | Rights in the Work or (ii) assert any associated claims and causes of 99 | action with respect to the Work, in either case contrary to Affirmer's 100 | express Statement of Purpose. 101 | 102 | 4. Limitations and Disclaimers. 103 | 104 | a. No trademark or patent rights held by Affirmer are waived, abandoned, 105 | surrendered, licensed or otherwise affected by this document. 106 | b. Affirmer offers the Work as-is and makes no representations or 107 | warranties of any kind concerning the Work, express, implied, 108 | statutory or otherwise, including without limitation warranties of 109 | title, merchantability, fitness for a particular purpose, non 110 | infringement, or the absence of latent or other defects, accuracy, or 111 | the present or absence of errors, whether or not discoverable, all to 112 | the greatest extent permissible under applicable law. 113 | c. Affirmer disclaims responsibility for clearing rights of other persons 114 | that may apply to the Work or any use thereof, including without 115 | limitation any person's Copyright and Related Rights in the Work. 116 | Further, Affirmer disclaims responsibility for obtaining any necessary 117 | consents, permissions or other rights required for any use of the 118 | Work. 119 | d. Affirmer understands and acknowledges that Creative Commons is not a 120 | party to this document and has no duty or obligation with respect to 121 | this CC0 or use of the Work. 122 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # SQL Scripts 2 | Useful sql scripts for MSSQL, MySQL, Oracle, PostgreSQL, and Greenplum. These are based on years of usage and refinement. These are common scripts used by data analysts, software testers, and other database professionals.
3 |
4 | 5 | 6 | 7 | ## [Data Dictionary](https://github.com/DataResearchLabs/sql_scripts/blob/main/data_dictionary_scripts.md) 8 | 9 | 10 | 11 | The Data Dictionary script and tutorials enable you to easily document an existing database schema. You can dump the tables, views, column descriptions, data types/lengths/sizes/precision, key constraints, and other information. Export to Excel for pretty output and simple filtering, searching, and sharing.
12 |
13 |
14 |
15 |
16 | 17 | 18 | ## [Data Validation Scripts](https://github.com/DataResearchLabs/sql_scripts/blob/main/data_validation_scripts.md) 19 | 20 | 21 | 22 | The Data Validation Framework scripts and tutorials enable you to easily standup a battery of automated data validation tests in your environment. Use the DVF for prod checkouts, or test and stage regression tests, or dev unit tests, or automated data validation after each data load. There are nine rule sets depicting 66 sample test cases to demonstrate how to test for row counts, keys, heuristic thresholds, numeric/date/text values, regular expressions, and data or schema diffs. The basic data validation script demonstrations executes all 66 sample validation tests putting out one line of text with the test id, status, and test description -- a nice simple way to organize your tests. The advanced data validation scripts execute the same 66 sample validation tests, but pushes output to a table and adds in the execution time in seconds, as well supporting detail rows on fails with the rejection code + reason, the expected value, the actual value, and the SQL to lookup the rejected row so you can copy-paste-execute-and-troubleshoot.
23 |
24 | 25 | 26 | ## [SchemaDiff](https://github.com/DataResearchLabs/sql_scripts/blob/main/schemadiff_scripts.md) 27 | 28 | 29 | 30 | The SchemaDiff script and tutorials enable you to track changes to your schema over time or between environments. You'll know exactly what changed last night with a deployment vs. the prior night's stable baseline. You'll know when folks are changing your development or test environment every morning rather than 15 days later, avoiding all the troubleshooting red herrings.
31 |
32 |
33 |
34 |
35 | 36 | 37 | ***If you like these scripts, please be sure to click the "Star" button above in GitHub.***
38 |
39 | ***Also, be sure to visit or subscribe to our YouTube channel*** www.DataResearchLabs.com!
40 |
41 |
42 | -------------------------------------------------------------------------------- /greenplum/data_dictionary/data_dict_dump.sql: -------------------------------------------------------------------------------- 1 | ------------------------------------------------------------------------------------ 2 | -- Data Dictionary Dump: 3 | -- This SQL script will dump table, column, key, and description design related 4 | -- metadata so that you can copy-paste or export to Excel as a Data Dictionary. 5 | ------------------------------------------------------------------------------------ 6 | -- Platform: Greenplum 7 | -- Author: DataResearchLabs 8 | -- GitHub: https://github.com/DataResearchLabs/sql_scripts 9 | -- YouTube Tutorials: https://www.youtube.com/channel/UCQciXv3xaBykeUFc04GxSXA 10 | ---------------------------------------------------------------------------------- 11 | WITH vars 12 | AS ( 13 | SELECT 14 | 'acq_cms' :: CHARACTER VARYING(50) AS v_SchemaName -- <<<<<<<<<<<<<<<< Set to the schema you wish to document 15 | , 'NO' :: CHARACTER VARYING(10) AS v_TablesOnly -- YES=Limit To Tables only; NO=Include views too 16 | ) 17 | 18 | , baseTbl 19 | AS ( 20 | SELECT table_schema AS SchemaName 21 | , table_catalog 22 | , table_type, table_name, table_schema 23 | FROM INFORMATION_SCHEMA.TABLES 24 | WHERE TABLE_SCHEMA = (SELECT v_SchemaName FROM vars) 25 | AND ( (TABLE_TYPE = 'BASE TABLE') 26 | OR ((SELECT v_TablesOnly FROM vars) = 'NO') 27 | ) 28 | ) 29 | 30 | , metadata 31 | AS ( 32 | SELECT 33 | bt.SchemaName AS schema_nm 34 | , bt.table_name AS table_nm 35 | , CASE WHEN bt.TABLE_TYPE = 'BASE TABLE' THEN 'TBL' 36 | WHEN bt.TABLE_TYPE = 'VIEW' THEN 'VW' 37 | ELSE 'UK' 38 | END AS obj_typ 39 | , tut.ordinal_position AS ord_pos 40 | , tut.column_name AS column_nm 41 | , tut.data_type || 42 | CASE WHEN tut.data_type IN('varchar','char') THEN '(' || tut.CHARACTER_MAXIMUM_LENGTH || ')' 43 | WHEN tut.data_type IN('date','time') THEN '(3)' 44 | WHEN tut.data_type = 'datetime' THEN '(8)' 45 | WHEN tut.data_type = 'timestamp' THEN '(4)' 46 | WHEN tut.data_type in('bigint','int8','integer','int','int4','smallint','int2') THEN '(' || tut.NUMERIC_PRECISION || ')' 47 | WHEN tut.data_type = 'decimal' THEN '(' || tut.NUMERIC_PRECISION || ',' || tut.NUMERIC_SCALE || ')' 48 | WHEN tut.CHARACTER_MAXIMUM_LENGTH IS NOT NULL THEN '(' || tut.CHARACTER_MAXIMUM_LENGTH || ')' 49 | WHEN tut.DATETIME_PRECISION IS NOT NULL THEN '(' || tut.DATETIME_PRECISION || ')' 50 | WHEN tut.NUMERIC_PRECISION IS NOT NULL 51 | AND tut.NUMERIC_SCALE IS NULL THEN '(' || tut.NUMERIC_PRECISION || ')' 52 | WHEN tut.NUMERIC_PRECISION IS NOT NULL 53 | AND tut.NUMERIC_SCALE IS NOT NULL THEN '(' || tut.NUMERIC_PRECISION || ',' || tut.NUMERIC_SCALE || ')' 54 | ELSE '' 55 | END AS data_typ 56 | , CASE WHEN tut.IS_NULLABLE = 'YES' THEN 'NULL' ELSE 'NOT NULL' END AS nullable 57 | FROM INFORMATION_SCHEMA.COLUMNS tut 58 | INNER JOIN baseTbl bt ON bt.table_catalog = tut.TABLE_CATALOG AND bt.table_name = tut.table_name 59 | ) 60 | 61 | 62 | , meta_for_keys 63 | AS ( 64 | SELECT schema_nm, table_nm 65 | , COALESCE(distribution_keys,'DISTRIBUTED RANDOMLY') :: CHARACTER VARYING(255) AS column_nm 66 | , 'DK' :: CHARACTER VARYING(10) AS is_key 67 | FROM 68 | ( SELECT pgn.nspname AS schema_nm, 69 | pgc.relname AS table_nm, 70 | pga.attname AS distribution_keys 71 | FROM 72 | ( SELECT gdp.localoid 73 | , CASE WHEN ( Array_upper(gdp.attrnums, 1) > 0 ) THEN Unnest(gdp.attrnums) 74 | ELSE NULL 75 | END As attnum 76 | FROM gp_distribution_policy gdp 77 | ORDER BY gdp.localoid 78 | ) AS distrokey 79 | INNER JOIN pg_class pgc ON distrokey.localoid = pgc.oid 80 | INNER JOIN pg_namespace pgn ON pgc.relnamespace = pgn.oid 81 | LEFT OUTER JOIN pg_attribute pga ON distrokey.attnum = pga.attnum AND distrokey.localoid = pga.attrelid 82 | WHERE pgn.nspname = (SELECT v_SchemaName FROM vars) 83 | AND pgc.relname IN(SELECT DISTINCT table_name FROM baseTbl) 84 | ) AS a 85 | ) 86 | 87 | , col_descr 88 | AS ( 89 | SELECT schema_nm 90 | , REPLACE(table_nm, schema_nm || '.', '') AS table_nm 91 | , column_nm, column_descr 92 | FROM ( 93 | SELECT 94 | ns.nspname AS schema_nm 95 | , (a.attrelid :: regclass) :: VARCHAR(150) AS table_nm 96 | , a.attname AS column_nm 97 | , d.description AS column_descr 98 | FROM pg_catalog.pg_attribute a 99 | LEFT JOIN pg_catalog.pg_class c ON a.attrelid = c.oid 100 | LEFT JOIN pg_catalog.pg_namespace ns ON c.relnamespace = ns.oid 101 | LEFT JOIN pg_catalog.pg_description d ON d.objoid = a.attrelid AND d.objsubid = a.attnum 102 | WHERE a.attnum > 0 103 | AND NOT a.attisdropped 104 | ) t 105 | WHERE schema_nm = (SELECT v_SchemaName FROM vars) 106 | AND REPLACE(table_nm, schema_nm || '.', '') IN(SELECT DISTINCT table_name FROM baseTbl) 107 | ) 108 | 109 | , combined_data 110 | AS ( 111 | SELECT md.schema_nm 112 | , md.table_nm 113 | , md.obj_typ 114 | , md.ord_pos AS ord 115 | , COALESCE(dk.is_key, ' ') AS is_key 116 | , md.column_nm 117 | , md.data_typ 118 | , md.nullable 119 | , cd.column_descr 120 | FROM metadata md 121 | LEFT JOIN col_descr cd ON cd.schema_nm = md.schema_nm AND cd.table_nm = md.table_nm AND cd.column_nm = md.column_nm 122 | LEFT JOIN meta_for_keys dk ON dk.schema_nm = md.schema_nm AND dk.table_nm = md.table_nm AND dk.column_nm = md.column_nm 123 | ) 124 | 125 | SELECT * 126 | FROM combined_data 127 | ORDER BY schema_nm, table_nm, ord -------------------------------------------------------------------------------- /greenplum/schemadiff/simple_schema_dump.sql: -------------------------------------------------------------------------------- 1 | ------------------------------------------------------------------------------------ 2 | -- Simple Schema Dump: 3 | -- This SQL script will dump table, column, key, and index design related metadata 4 | -- so that you can copy-paste or export to a text file. 5 | -- Even better, you can make other snapshots over time (same database schema earlier 6 | -- points in time), OR in different environments (DEV, PPMO, STAGE, PROD). Then, 7 | -- using your favorite free Text File Diff Tool (DiffMerge, ExamDiff, etc.) you 8 | -- can compare snapshots to quick isolate and identify what changed over time 9 | -- or is different between environments. 10 | ------------------------------------------------------------------------------------ 11 | -- Platform: Greenplum Server 12 | -- Author: DataResearchLabs 13 | -- GitHub: https://github.com/DataResearchLabs/sql_scripts 14 | -- YouTube Tutorials: https://www.youtube.com/playlist?list=PLVHoUDdbskUSWcPnDztPhXOnQT1R-rg0t 15 | ---------------------------------------------------------------------------------- 16 | WITH vars 17 | AS ( 18 | SELECT 'acq_cms' :: CHARACTER VARYING(50) AS v_SchemaName 19 | ) 20 | 21 | 22 | , baseTbl 23 | AS ( 24 | SELECT table_schema AS SchemaName, table_type, table_name 25 | FROM INFORMATION_SCHEMA.TABLES 26 | WHERE table_schema = (SELECT v_SchemaName FROM vars) 27 | ) 28 | 29 | 30 | , metaForTbl 31 | AS ( 32 | SELECT SchemaName 33 | , table_name AS TableName 34 | , '(' || CASE WHEN table_type = 'BASE TABLE' THEN 'Table' 35 | WHEN table_type = 'VIEW' THEN 'View' 36 | ELSE 'UK' 37 | END 38 | || ')' :: CHARACTER VARYING(20) AS ObjectType 39 | , table_name AS ObjectName 40 | , '(Exists)' :: CHARACTER VARYING(20) AS PropertyName 41 | , '' :: CHARACTER VARYING(50) AS PropertyValue 42 | FROM baseTbl 43 | ) 44 | 45 | , metaForCol_dataType 46 | AS ( 47 | SELECT ft.SchemaName, ft.table_name AS TableName 48 | , 'Column' :: CHARACTER VARYING(15) AS ObjectType 49 | , tut.column_name AS ObjectName 50 | , '2' :: CHARACTER VARYING(10) AS PropertyName 51 | , COALESCE(tut.data_type, 'unknown') 52 | || CASE WHEN tut.CHARACTER_MAXIMUM_LENGTH IS NOT NULL 53 | OR tut.NUMERIC_PRECISION IS NOT NULL 54 | OR tut.NUMERIC_SCALE IS NOT NULL THEN '(' 55 | ELSE '' 56 | END 57 | || CASE WHEN tut.CHARACTER_MAXIMUM_LENGTH IS NOT NULL THEN CAST(tut.CHARACTER_MAXIMUM_LENGTH AS VARCHAR(10)) 58 | ELSE '' 59 | END 60 | || CASE WHEN tut.DATA_TYPE IN('date','datetime','timestamp') THEN '' 61 | WHEN tut.NUMERIC_PRECISION IS NULL THEN '' 62 | ELSE CAST(tut.NUMERIC_PRECISION AS VARCHAR(10)) 63 | END 64 | || CASE WHEN tut.NUMERIC_SCALE IS NULL THEN '' 65 | WHEN tut.NUMERIC_SCALE >0 THEN ',' || CAST(tut.NUMERIC_SCALE AS VARCHAR(10)) 66 | ELSE '' 67 | END 68 | || CASE WHEN tut.CHARACTER_MAXIMUM_LENGTH IS NOT NULL 69 | OR tut.NUMERIC_PRECISION IS NOT NULL 70 | OR tut.NUMERIC_SCALE IS NOT NULL THEN ')' 71 | ELSE '' 72 | END 73 | :: CHARACTER VARYING(255) AS PropertyValue 74 | FROM INFORMATION_SCHEMA.COLUMNS tut 75 | INNER JOIN baseTbl ft ON ft.SchemaName = tut.TABLE_SCHEMA AND ft.TABLE_NAME = tut.TABLE_NAME 76 | ) 77 | 78 | 79 | , metaForCol_nullable 80 | AS ( 81 | SELECT ft.SchemaName, ft.table_name AS TableName 82 | , 'Column' :: CHARACTER VARYING(20) AS ObjectType 83 | , tut.column_name AS ObjectName 84 | , '3' :: CHARACTER VARYING(20) AS PropertyName 85 | , CASE WHEN tut.IS_NULLABLE = 'YES' THEN 'NULL' ELSE 'NOT NULL' END :: CHARACTER VARYING(20) AS PropertyValue 86 | FROM INFORMATION_SCHEMA.COLUMNS tut 87 | INNER JOIN baseTbl ft ON ft.SchemaName = tut.TABLE_SCHEMA AND ft.table_name = tut.table_name 88 | ) 89 | 90 | 91 | , metaForCol_ordpos 92 | AS ( 93 | SELECT ft.SchemaName, ft.table_name AS TableName 94 | , 'Column' :: CHARACTER VARYING(20) AS ObjectType 95 | , tut.column_name AS ObjectName 96 | , '1' :: CHARACTER VARYING(20) AS PropertyName 97 | , CASE WHEN tut.ORDINAL_POSITION IS NULL THEN '' 98 | ELSE LPAD( CAST(tut.ORDINAL_POSITION AS VARCHAR(3)), 3, '0') 99 | END AS PropertyValue 100 | FROM INFORMATION_SCHEMA.COLUMNS tut 101 | INNER JOIN baseTbl ft ON ft.SchemaName = tut.TABLE_SCHEMA AND ft.table_name = tut.table_name 102 | ) 103 | 104 | 105 | , metaAllCols 106 | AS ( 107 | SELECT schemaname, tablename, objecttype, objectname 108 | , 'Properties' :: CHARACTER VARYING(20) AS propertyname 109 | , STRING_AGG(propertyvalue, ' | ' ORDER BY propertyname, propertyvalue) :: CHARACTER VARYING(255) AS propertyvalue 110 | FROM ( 111 | SELECT * FROM metaForCol_dataType 112 | UNION SELECT * FROM metaForCol_nullable 113 | UNION SELECT * FROM metaForCol_ordpos 114 | ) t 115 | GROUP BY schemaname, tablename, objecttype, objectname 116 | ) 117 | 118 | 119 | , metaForKeys 120 | AS ( 121 | SELECT SchemaName, TableName 122 | , 'DistribKey' :: CHARACTER VARYING(20) AS ObjectType 123 | , 'n/a' :: CHARACTER VARYING(10) AS ObjectName 124 | , 'FieldList' :: CHARACTER VARYING(10) AS PropertyName 125 | , COALESCE(distribution_keys,'DISTRIBUTED RANDOMLY') :: CHARACTER VARYING(255) AS PropertyValue 126 | FROM 127 | ( SELECT pgn.nspname AS SchemaName, 128 | pgc.relname AS TableName, 129 | pga.attname AS distribution_keys 130 | FROM 131 | ( SELECT gdp.localoid 132 | , CASE WHEN ( Array_upper(gdp.attrnums, 1) > 0 ) THEN Unnest(gdp.attrnums) 133 | ELSE NULL 134 | END As attnum 135 | FROM gp_distribution_policy gdp 136 | ORDER BY gdp.localoid 137 | ) AS distrokey 138 | INNER JOIN pg_class pgc ON distrokey.localoid = pgc.oid 139 | INNER JOIN pg_namespace pgn ON pgc.relnamespace = pgn.oid 140 | LEFT OUTER JOIN pg_attribute pga ON distrokey.attnum = pga.attnum AND distrokey.localoid = pga.attrelid 141 | WHERE pgn.nspname = (SELECT v_SchemaName FROM vars) 142 | AND pgc.relname IN(SELECT DISTINCT table_name FROM baseTbl) 143 | ) AS a 144 | ) 145 | 146 | 147 | , metaForIdxs 148 | AS ( 149 | SELECT (SELECT v_SchemaName FROM vars) AS SchemaName 150 | , tablename AS TableName 151 | , 'Index' :: CHARACTER VARYING(20) AS ObjectType 152 | , indexname AS ObjectName 153 | , 'FieldList' :: CHARACTER VARYING(20) AS PropertyName 154 | , REPLACE(SUBSTR(indexdef, POSITION('(' IN indexdef)+ 1), ')', '') :: CHARACTER VARYING(255) AS PropertyValue 155 | FROM pg_catalog.pg_indexes 156 | WHERE schemaname = (SELECT v_SchemaName FROM vars) 157 | AND tablename IN(SELECT DISTINCT table_name FROM baseTbl) 158 | ) 159 | 160 | 161 | , allMetadata 162 | AS ( 163 | SELECT * FROM metaAllCols 164 | UNION SELECT * FROM metaForTbl /* not first, b/c propertyvalue column all nulls = no data type default = error */ 165 | UNION SELECT * FROM metaForKeys 166 | UNION SELECT * FROM metaForIdxs 167 | ) 168 | 169 | 170 | SELECT CASE WHEN objecttype IN('(Table)','(View)') THEN schemaname ELSE '' END AS schema_nm 171 | , CASE WHEN objecttype IN('(Table)','(View)') THEN tablename ELSE '' END AS tbl_nm 172 | , objecttype AS obj_typ, objectname AS obj_nm, /*propertyname,*/ propertyvalue AS properties 173 | FROM allMetadata 174 | ORDER BY schemaname, tablename, objecttype 175 | , CASE WHEN objecttype='Column' THEN propertyvalue ELSE '' END 176 | , objectname, propertyname 177 | -------------------------------------------------------------------------------- /img/01_schemadiff_side_by_side.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DataResearchLabs/sql_scripts/3eaa17caf248d3b58763534d62ef4d9ac5729a46/img/01_schemadiff_side_by_side.png -------------------------------------------------------------------------------- /img/02_data_dictionary_in_xl.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DataResearchLabs/sql_scripts/3eaa17caf248d3b58763534d62ef4d9ac5729a46/img/02_data_dictionary_in_xl.png -------------------------------------------------------------------------------- /img/03_data_dictionary_edit_and_run_script.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DataResearchLabs/sql_scripts/3eaa17caf248d3b58763534d62ef4d9ac5729a46/img/03_data_dictionary_edit_and_run_script.png -------------------------------------------------------------------------------- /img/04_data_validation_scripts.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DataResearchLabs/sql_scripts/3eaa17caf248d3b58763534d62ef4d9ac5729a46/img/04_data_validation_scripts.png -------------------------------------------------------------------------------- /img/db_icons/gp_icon.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DataResearchLabs/sql_scripts/3eaa17caf248d3b58763534d62ef4d9ac5729a46/img/db_icons/gp_icon.png -------------------------------------------------------------------------------- /img/db_icons/mssql_icon.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DataResearchLabs/sql_scripts/3eaa17caf248d3b58763534d62ef4d9ac5729a46/img/db_icons/mssql_icon.png -------------------------------------------------------------------------------- /img/db_icons/mysql_icon.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DataResearchLabs/sql_scripts/3eaa17caf248d3b58763534d62ef4d9ac5729a46/img/db_icons/mysql_icon.png -------------------------------------------------------------------------------- /img/db_icons/oracle_icon.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DataResearchLabs/sql_scripts/3eaa17caf248d3b58763534d62ef4d9ac5729a46/img/db_icons/oracle_icon.png -------------------------------------------------------------------------------- /img/db_icons/pgsql_icon1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DataResearchLabs/sql_scripts/3eaa17caf248d3b58763534d62ef4d9ac5729a46/img/db_icons/pgsql_icon1.png -------------------------------------------------------------------------------- /mssql/data_dictionary/data_dict_dump.sql: -------------------------------------------------------------------------------- 1 | 2 | ------------------------------------------------------------------------------------ 3 | -- Data Dictionary Dump: 4 | -- This SQL script will dump table, column, key, and description design related 5 | -- metadata so that you can copy-paste or export to Excel as a Data Dictionary. 6 | ------------------------------------------------------------------------------------ 7 | -- Platform: SQL Server 8 | -- Author: DataResearchLabs 9 | -- GitHub: https://github.com/DataResearchLabs/sql_scripts 10 | -- GitHub Tool: https://github.com/DataResearchLabs/data_analysts_toolbox/blob/main/data_dictionary_generator/readme.md 11 | -- YouTube Tutorials: https://www.youtube.com/channel/UCQciXv3xaBykeUFc04GxSXA 12 | ---------------------------------------------------------------------------------- 13 | 14 | -- IMPORTANT 15 | USE AdventureWorksLT2019; -- <<<<<<<<<<<< CHANGE THIS VALUE to Schema you want to dump 16 | ; 17 | 18 | -- All variables are consolidated here in the first CTE (Common Table Expression) 19 | -- Each given row is a variable, with the value you change preceding the "AS" command 20 | WITH vars 21 | AS ( 22 | SELECT 23 | DB_NAME() AS v_SchemaName -- (Do not change this value, it is picked up from changes above) 24 | , 'NO' AS v_TablesOnly -- Change this setting: YES=Limit To Tables only; NO=Include views too 25 | ) 26 | 27 | 28 | , baseTbl 29 | AS ( 30 | SELECT TABLE_CATALOG AS SchemaName, table_type, table_name, table_schema 31 | FROM INFORMATION_SCHEMA.TABLES 32 | WHERE TABLE_CATALOG = (SELECT v_SchemaName FROM vars) 33 | AND ( (TABLE_TYPE = 'BASE TABLE') 34 | OR ((SELECT v_TablesOnly FROM vars) = 'NO') 35 | ) 36 | ) 37 | 38 | 39 | , metadata 40 | AS ( 41 | SELECT 42 | bt.SchemaName AS schema_nm 43 | , bt.table_name AS table_nm 44 | , CASE WHEN bt.table_type = 'BASE TABLE' THEN 'TBL' 45 | WHEN bt.table_type = 'VIEW' THEN 'VW' 46 | ELSE 'UK' 47 | END AS obj_typ 48 | , RIGHT('000' + CAST(tut.ORDINAL_POSITION AS VARCHAR(3)), 3) AS ord_pos 49 | , tut.column_name AS column_nm 50 | , COALESCE(tut.data_type, 'unknown') + 51 | CASE WHEN tut.data_type IN('varchar','nvarchar') THEN '(' + CAST(tut.CHARACTER_MAXIMUM_LENGTH AS VARCHAR(10)) + ')' 52 | WHEN tut.data_type IN('char','nchar') THEN '(' + CAST(tut.CHARACTER_MAXIMUM_LENGTH AS VARCHAR(10)) + ')' 53 | WHEN tut.data_type ='date' THEN '(' + CAST(tut.DATETIME_PRECISION AS VARCHAR(10)) + ')' 54 | WHEN tut.data_type ='datetime' THEN '(' + CAST(tut.DATETIME_PRECISION AS VARCHAR(10)) + ')' 55 | WHEN tut.data_type in('bigint','int','smallint', 'tinyint') THEN '(' + CAST(tut.NUMERIC_PRECISION AS VARCHAR(10)) + ')' 56 | WHEN tut.data_type = 'uniqueidentifier' THEN '(16)' 57 | WHEN tut.data_type = 'money' THEN '(' + CAST(tut.NUMERIC_PRECISION AS VARCHAR(10)) + ')' 58 | WHEN tut.data_type = 'decimal' THEN '(' + CAST(tut.NUMERIC_PRECISION AS VARCHAR(10)) + ',' + CAST(tut.NUMERIC_SCALE AS VARCHAR(10)) + ')' 59 | WHEN tut.data_type = 'numeric' THEN '(' + CAST(tut.NUMERIC_PRECISION AS VARCHAR(10)) + ',' + CAST(tut.NUMERIC_SCALE AS VARCHAR(10)) + ')' 60 | WHEN tut.data_type = 'varbinary' THEN '(' + CAST(tut.CHARACTER_MAXIMUM_LENGTH AS VARCHAR(10)) + ')' 61 | WHEN tut.data_type = 'xml' THEN '(' + CAST(tut.CHARACTER_MAXIMUM_LENGTH AS VARCHAR(10)) + ')' 62 | WHEN tut.data_type IN('char','nchar') THEN '(' + CAST(tut.CHARACTER_MAXIMUM_LENGTH AS VARCHAR(10)) + ')' 63 | WHEN tut.CHARACTER_MAXIMUM_LENGTH IS NOT NULL THEN '(' + CAST(tut.CHARACTER_MAXIMUM_LENGTH AS VARCHAR(10)) + ')' 64 | WHEN tut.DATETIME_PRECISION IS NOT NULL THEN '(' + CAST(tut.DATETIME_PRECISION AS VARCHAR(10)) + ')' 65 | WHEN tut.NUMERIC_PRECISION IS NOT NULL 66 | AND tut.NUMERIC_SCALE IS NULL THEN '(' + CAST(tut.NUMERIC_PRECISION AS VARCHAR(10)) + ')' 67 | WHEN tut.NUMERIC_PRECISION IS NOT NULL 68 | AND tut.NUMERIC_SCALE IS NOT NULL THEN '(' + CAST(tut.NUMERIC_PRECISION AS VARCHAR(10)) + ',' + CAST(tut.NUMERIC_SCALE AS VARCHAR(10)) + ')' 69 | ELSE '' 70 | END AS data_typ 71 | , CASE WHEN tut.IS_NULLABLE = 'YES' THEN 'NULL' ELSE 'NOT NULL' END AS nullable 72 | FROM INFORMATION_SCHEMA.COLUMNS tut 73 | INNER JOIN baseTbl bt ON bt.SchemaName = tut.TABLE_CATALOG AND bt.table_name = tut.table_name 74 | ) 75 | 76 | , descr 77 | AS ( 78 | SELECT 79 | bt.SchemaName AS schema_nm 80 | , bt.table_name AS table_nm 81 | , tut.column_name AS column_nm 82 | , STRING_AGG(CAST(de.value AS VARCHAR(1024)), '. ') WITHIN GROUP (ORDER BY de.value) AS description 83 | FROM INFORMATION_SCHEMA.COLUMNS tut 84 | INNER JOIN baseTbl bt ON bt.SchemaName = tut.TABLE_CATALOG AND bt.table_name = tut.table_name 85 | LEFT JOIN sys.extended_properties de ON de.major_id = OBJECT_ID(bt.table_schema + '.' + bt.table_name) 86 | AND de.minor_id = tut.ORDINAL_POSITION 87 | AND de.name = 'MS_Description' 88 | GROUP BY bt.SchemaName, bt.table_name, tut.column_name 89 | ) 90 | 91 | 92 | , metadata_keys 93 | AS ( 94 | SELECT schema_nm, table_nm, column_nm 95 | , STRING_AGG(key_typ, ',') WITHIN GROUP (ORDER BY key_typ) AS is_key 96 | FROM ( 97 | SELECT 98 | cons.TABLE_CATALOG AS schema_nm 99 | , cons.TABLE_NAME AS table_nm 100 | , kcu.COLUMN_NAME AS column_nm 101 | , CASE WHEN cons.CONSTRAINT_TYPE = 'PRIMARY KEY' THEN 'PK' 102 | WHEN cons.CONSTRAINT_TYPE = 'UNIQUE' THEN 'UK' 103 | WHEN cons.CONSTRAINT_TYPE = 'FOREIGN KEY' THEN 'FK' 104 | ELSE 'X' 105 | END AS key_typ 106 | FROM INFORMATION_SCHEMA.TABLE_CONSTRAINTS cons 107 | INNER JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE kcu 108 | ON cons.TABLE_CATALOG = kcu.TABLE_CATALOG 109 | AND cons.TABLE_NAME = kcu.TABLE_NAME 110 | AND cons.CONSTRAINT_NAME = kcu.CONSTRAINT_NAME 111 | WHERE cons.TABLE_CATALOG = (SELECT v_SchemaName FROM vars) 112 | AND cons.table_name IN(SELECT DISTINCT table_name FROM baseTbl) 113 | AND cons.constraint_type IN('PRIMARY KEY','FOREIGN KEY','UNIQUE') 114 | GROUP BY cons.TABLE_CATALOG, cons.TABLE_NAME, kcu.COLUMN_NAME, cons.CONSTRAINT_TYPE 115 | ) AS t 116 | GROUP BY schema_nm, table_nm, column_nm 117 | ) 118 | 119 | 120 | SELECT md.schema_nm, md.table_nm, md.obj_typ, md.ord_pos 121 | , COALESCE(mk.is_key, ' ') AS keys 122 | , md.column_nm, md.data_typ, md.nullable 123 | , de.[description] 124 | FROM metadata md 125 | LEFT JOIN descr de ON de.schema_nm = md.schema_nm AND de.table_nm = md.table_nm AND de.column_nm = md.column_nm 126 | LEFT JOIN metadata_keys mk ON mk.schema_nm = md.schema_nm AND mk.table_nm = md.table_nm AND mk.column_nm = md.column_nm 127 | ORDER BY schema_nm, table_nm, ord_pos 128 | 129 | -------------------------------------------------------------------------------- /mssql/data_validation/demo_data/demo_hr_01_create_database.sql: -------------------------------------------------------------------------------- 1 | USE [master] 2 | GO 3 | 4 | CREATE DATABASE [Demo_HR] 5 | CONTAINMENT = NONE 6 | ON PRIMARY 7 | ( NAME = N'Demo_HR', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL15.SQLEXPRESS\MSSQL\DATA\Demo_HR.mdf' , SIZE = 8192KB , MAXSIZE = UNLIMITED, FILEGROWTH = 65536KB ) 8 | LOG ON 9 | ( NAME = N'Demo_HR_log', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL15.SQLEXPRESS\MSSQL\DATA\Demo_HR_log.ldf' , SIZE = 8192KB , MAXSIZE = 2048GB , FILEGROWTH = 65536KB ) 10 | WITH CATALOG_COLLATION = DATABASE_DEFAULT 11 | GO 12 | 13 | /* 14 | ALTER DATABASE [Demo_HR] SET ANSI_NULL_DEFAULT OFF 15 | GO 16 | ALTER DATABASE [Demo_HR] SET ANSI_NULLS OFF 17 | GO 18 | ALTER DATABASE [Demo_HR] SET ANSI_PADDING OFF 19 | GO 20 | ALTER DATABASE [Demo_HR] SET ANSI_WARNINGS OFF 21 | GO 22 | ALTER DATABASE [Demo_HR] SET ARITHABORT OFF 23 | GO 24 | ALTER DATABASE [Demo_HR] SET AUTO_CLOSE OFF 25 | GO 26 | ALTER DATABASE [Demo_HR] SET AUTO_SHRINK OFF 27 | GO 28 | ALTER DATABASE [Demo_HR] SET AUTO_UPDATE_STATISTICS ON 29 | GO 30 | ALTER DATABASE [Demo_HR] SET CURSOR_CLOSE_ON_COMMIT OFF 31 | GO 32 | ALTER DATABASE [Demo_HR] SET CURSOR_DEFAULT GLOBAL 33 | GO 34 | ALTER DATABASE [Demo_HR] SET CONCAT_NULL_YIELDS_NULL OFF 35 | GO 36 | ALTER DATABASE [Demo_HR] SET NUMERIC_ROUNDABORT OFF 37 | GO 38 | ALTER DATABASE [Demo_HR] SET QUOTED_IDENTIFIER OFF 39 | GO 40 | ALTER DATABASE [Demo_HR] SET RECURSIVE_TRIGGERS OFF 41 | GO 42 | ALTER DATABASE [Demo_HR] SET DISABLE_BROKER 43 | GO 44 | ALTER DATABASE [Demo_HR] SET AUTO_UPDATE_STATISTICS_ASYNC OFF 45 | GO 46 | ALTER DATABASE [Demo_HR] SET DATE_CORRELATION_OPTIMIZATION OFF 47 | GO 48 | ALTER DATABASE [Demo_HR] SET TRUSTWORTHY OFF 49 | GO 50 | ALTER DATABASE [Demo_HR] SET ALLOW_SNAPSHOT_ISOLATION OFF 51 | GO 52 | ALTER DATABASE [Demo_HR] SET PARAMETERIZATION SIMPLE 53 | GO 54 | ALTER DATABASE [Demo_HR] SET READ_COMMITTED_SNAPSHOT OFF 55 | GO 56 | ALTER DATABASE [Demo_HR] SET HONOR_BROKER_PRIORITY OFF 57 | GO 58 | ALTER DATABASE [Demo_HR] SET RECOVERY SIMPLE 59 | GO 60 | ALTER DATABASE [Demo_HR] SET MULTI_USER 61 | GO 62 | ALTER DATABASE [Demo_HR] SET PAGE_VERIFY CHECKSUM 63 | GO 64 | ALTER DATABASE [Demo_HR] SET DB_CHAINING OFF 65 | GO 66 | ALTER DATABASE [Demo_HR] SET FILESTREAM( NON_TRANSACTED_ACCESS = OFF ) 67 | GO 68 | ALTER DATABASE [Demo_HR] SET TARGET_RECOVERY_TIME = 60 SECONDS 69 | GO 70 | ALTER DATABASE [Demo_HR] SET DELAYED_DURABILITY = DISABLED 71 | GO 72 | ALTER DATABASE [Demo_HR] SET ACCELERATED_DATABASE_RECOVERY = OFF 73 | GO 74 | ALTER DATABASE [Demo_HR] SET QUERY_STORE = OFF 75 | GO 76 | ALTER DATABASE [Demo_HR] SET READ_WRITE 77 | GO 78 | */ 79 | -------------------------------------------------------------------------------- /mssql/data_validation/demo_data/demo_hr_02_create_tables.sql: -------------------------------------------------------------------------------- 1 | 2 | -- Demo_HR..JOBS definition 3 | CREATE TABLE Demo_HR..JOBS 4 | (JOB_ID VARCHAR(10), 5 | JOB_TITLE VARCHAR(35) NOT NULL, 6 | MIN_SALARY NUMERIC(6,0), 7 | MAX_SALARY NUMERIC(6,0), 8 | CONSTRAINT JOB_ID_PK PRIMARY KEY (JOB_ID) 9 | ); 10 | 11 | 12 | 13 | -- Demo_HR..JOBS_SNAPSHOT definition 14 | CREATE TABLE Demo_HR..JOBS_SNAPSHOT 15 | (JOB_ID VARCHAR(10), 16 | JOB_TITLE VARCHAR(35) NOT NULL, 17 | MIN_SALARY NUMERIC(6,0), 18 | MAX_SALARY NUMERIC(6,0) 19 | ); 20 | 21 | 22 | 23 | -- Demo_HR..REGIONS definition 24 | CREATE TABLE Demo_HR..REGIONS 25 | (REGION_ID NUMERIC CONSTRAINT REGION_ID_NN NOT NULL, 26 | REGION_NAME VARCHAR(25), 27 | CONSTRAINT REG_ID_PK PRIMARY KEY (REGION_ID) 28 | ); 29 | 30 | 31 | 32 | -- Demo_HR..COUNTRIES definition 33 | CREATE TABLE Demo_HR..COUNTRIES 34 | (COUNTRY_ID CHAR(2) CONSTRAINT COUNTRY_ID_NN NOT NULL, 35 | COUNTRY_NAME VARCHAR(40), 36 | REGION_ID NUMERIC, 37 | DATE_LAST_UPDATED DATETIME DEFAULT GETDATE(), 38 | CONSTRAINT COUNTRY_C_ID_PK PRIMARY KEY (COUNTRY_ID), 39 | CONSTRAINT COUNTR_REG_FK FOREIGN KEY (REGION_ID) REFERENCES Demo_HR..REGIONS (REGION_ID) 40 | ); 41 | 42 | 43 | 44 | 45 | -- Demo_HR..LOCATIONS definition 46 | CREATE TABLE Demo_HR..LOCATIONS 47 | (LOCATION_ID NUMERIC(4,0), 48 | STREET_ADDRESS VARCHAR(40), 49 | POSTAL_CODE VARCHAR(12), 50 | CITY VARCHAR(30) CONSTRAINT LOC_CITY_NN NOT NULL, 51 | STATE_PROVINCE VARCHAR(25), 52 | COUNTRY_ID CHAR(2), 53 | CONSTRAINT LOC_ID_PK PRIMARY KEY (LOCATION_ID), 54 | CONSTRAINT LOC_C_ID_FK FOREIGN KEY (COUNTRY_ID) REFERENCES Demo_HR..COUNTRIES (COUNTRY_ID) 55 | ); 56 | 57 | CREATE INDEX LOC_CITY_IX ON Demo_HR..LOCATIONS (CITY); 58 | CREATE INDEX LOC_STATE_PROVINCE_IX ON Demo_HR..LOCATIONS (STATE_PROVINCE); 59 | CREATE INDEX LOC_COUNTRY_IX ON Demo_HR..LOCATIONS (COUNTRY_ID); 60 | 61 | 62 | 63 | -- Demo_HR..DEPARTMENTS definition 64 | CREATE TABLE Demo_HR..DEPARTMENTS 65 | (DEPARTMENT_ID NUMERIC(4,0), 66 | DEPARTMENT_NAME VARCHAR(30) CONSTRAINT DEPT_NAME_NN NOT NULL, 67 | MANAGER_ID NUMERIC(6,0), 68 | LOCATION_ID NUMERIC(4,0), 69 | URL VARCHAR(255), 70 | CONSTRAINT DEPT_ID_PK PRIMARY KEY (DEPARTMENT_ID), 71 | CONSTRAINT DEPT_LOC_FK FOREIGN KEY (LOCATION_ID) REFERENCES Demo_HR..LOCATIONS (LOCATION_ID) 72 | ) ; 73 | 74 | CREATE INDEX DEPT_LOCATION_IX ON Demo_HR..DEPARTMENTS (LOCATION_ID); 75 | 76 | 77 | 78 | -- Demo_HR..JOB_HISTORY definition 79 | CREATE TABLE Demo_HR..JOB_HISTORY 80 | (EMPLOYEE_ID NUMERIC(6,0) CONSTRAINT JHIST_EMPLOYEE_NN NOT NULL, 81 | START_DATE DATETIME CONSTRAINT JHIST_START_DATE_NN NOT NULL, 82 | END_DATE DATETIME CONSTRAINT JHIST_END_DATE_NN NOT NULL, 83 | JOB_ID VARCHAR(10) CONSTRAINT JHIST_JOB_NN NOT NULL, 84 | DEPARTMENT_ID NUMERIC(4,0), 85 | CONSTRAINT JHIST_DATE_INTERVAL CHECK (end_date > start_date), 86 | CONSTRAINT JHIST_EMP_ID_ST_DATE_PK PRIMARY KEY (EMPLOYEE_ID, START_DATE), 87 | CONSTRAINT JHIST_JOB_FK FOREIGN KEY (JOB_ID) REFERENCES Demo_HR..JOBS (JOB_ID), 88 | CONSTRAINT JHIST_DEPT_FK FOREIGN KEY (DEPARTMENT_ID) REFERENCES Demo_HR..DEPARTMENTS (DEPARTMENT_ID) 89 | ) ; 90 | 91 | CREATE INDEX JHIST_JOB_IX ON Demo_HR..JOB_HISTORY (JOB_ID); 92 | CREATE INDEX JHIST_EMPLOYEE_IX ON Demo_HR..JOB_HISTORY (EMPLOYEE_ID); 93 | CREATE INDEX JHIST_DEPARTMENT_IX ON Demo_HR..JOB_HISTORY (DEPARTMENT_ID); 94 | 95 | 96 | 97 | 98 | -- Demo_HR..EMPLOYEES definition 99 | CREATE TABLE Demo_HR..EMPLOYEES 100 | (EMPLOYEE_ID NUMERIC(6,0), 101 | FIRST_NAME VARCHAR(20), 102 | LAST_NAME VARCHAR(25) CONSTRAINT EMP_LAST_NAME_NN NOT NULL, 103 | EMAIL VARCHAR(25) CONSTRAINT EMP_EMAIL_NN NOT NULL, 104 | PHONE_NUMBER VARCHAR(20), 105 | HIRE_DATE DATETIME CONSTRAINT EMP_HIRE_DATE_NN NOT NULL, 106 | JOB_ID VARCHAR(10) CONSTRAINT EMP_JOB_NN NOT NULL, 107 | SALARY NUMERIC(8,2), 108 | COMMISSION_PCT NUMERIC(2,2), 109 | MANAGER_ID NUMERIC(6,0), 110 | DEPARTMENT_ID NUMERIC(4,0), 111 | SOME_DATE_FMT1 VARCHAR(50), 112 | SOME_DATE_FMT2 VARCHAR(50), 113 | SOME_DATE_FMT3 VARCHAR(50), 114 | SOME_DATE_FMT4 VARCHAR(50), 115 | FAKE_SSN VARCHAR(11), 116 | ZIP5 VARCHAR(5), 117 | ZIP5OR9 VARCHAR(10), 118 | ZIP9 VARCHAR(10), 119 | EMAIL_ADDRESS VARCHAR(100), 120 | CONSTRAINT EMP_SALARY_MIN CHECK (salary > 0), 121 | CONSTRAINT EMP_EMAIL_UK UNIQUE (EMAIL), 122 | CONSTRAINT EMP_EMP_ID_PK PRIMARY KEY (EMPLOYEE_ID), 123 | CONSTRAINT EMP_DEPT_FK FOREIGN KEY (DEPARTMENT_ID) REFERENCES Demo_HR..DEPARTMENTS (DEPARTMENT_ID), 124 | CONSTRAINT EMP_JOB_FK FOREIGN KEY (JOB_ID) REFERENCES Demo_HR..JOBS (JOB_ID), 125 | CONSTRAINT EMP_MANAGER_FK FOREIGN KEY (MANAGER_ID) REFERENCES Demo_HR..EMPLOYEES (EMPLOYEE_ID) 126 | ) ; 127 | 128 | CREATE INDEX EMP_DEPARTMENT_IX ON Demo_HR..EMPLOYEES (DEPARTMENT_ID); 129 | CREATE INDEX EMP_JOB_IX ON Demo_HR..EMPLOYEES (JOB_ID); 130 | CREATE INDEX EMP_MANAGER_IX ON Demo_HR..EMPLOYEES (MANAGER_ID); 131 | CREATE INDEX EMP_NAME_IX ON Demo_HR..EMPLOYEES (LAST_NAME, FIRST_NAME); 132 | 133 | 134 | 135 | 136 | 137 | 138 | -------------------------------------------------------------------------------- /mssql/data_validation/img/01_data_val_mssql_example_test_case_sql_code.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DataResearchLabs/sql_scripts/3eaa17caf248d3b58763534d62ef4d9ac5729a46/mssql/data_validation/img/01_data_val_mssql_example_test_case_sql_code.png -------------------------------------------------------------------------------- /mssql/data_validation/img/02_data_val_mssql_run_results1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DataResearchLabs/sql_scripts/3eaa17caf248d3b58763534d62ef4d9ac5729a46/mssql/data_validation/img/02_data_val_mssql_run_results1.png -------------------------------------------------------------------------------- /mssql/data_validation/img/03_data_val_mssql_adv_test_case_ex.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DataResearchLabs/sql_scripts/3eaa17caf248d3b58763534d62ef4d9ac5729a46/mssql/data_validation/img/03_data_val_mssql_adv_test_case_ex.png -------------------------------------------------------------------------------- /mssql/data_validation/img/04_data_val_mssql_run_results_adv.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DataResearchLabs/sql_scripts/3eaa17caf248d3b58763534d62ef4d9ac5729a46/mssql/data_validation/img/04_data_val_mssql_run_results_adv.png -------------------------------------------------------------------------------- /mssql/data_validation/rule_set_01__row_counts.md: -------------------------------------------------------------------------------- 1 | ### Data Validation Examples - MS SQL Server 2 | [![License: CC0](https://img.shields.io/badge/License-CC0-red)](LICENSE "Creative Commons Zero License by DataResearchLabs (effectively = Public Domain") 3 | [![YouTube](https://img.shields.io/badge/YouTube-DataResearchLabs-brightgreen)](http://www.DataResearchLabs.com) 4 | #### [Return to Data Validation Home Page](https://github.com/DataResearchLabs/sql_scripts/blob/main/data_validation_scripts.md) 5 | 6 | 7 | # Rule Set #1 - Row Counts 8 | 9 | ## Table of Contents 10 | - T001 - "Full" Row Count 11 | - T002 - "Partial" Row Count 12 | - T003 - "Relative" Row Count 13 | - T004 - "Recent" Row Count 14 |
15 | 16 | 17 | 18 | ### T001 - "Full" Row Count 19 | Verify full row count for a table or view. For example, to verify that table countries has exactly 25 rows: 20 | ```sql 21 | SELECT CASE WHEN COUNT(*) <> 25 THEN 'FAIL' ELSE 'P' END AS status 22 | FROM demo_hr..countries; 23 | ``` 24 |
25 | 26 | 27 | 28 | ### T002 - "Partial" Row Count 29 | Verify partial row count for a subset of a table or view. For example, to verify that table countries has exactly 8 rows where region_id = 1 (Europe): 30 | ```sql 31 | SELECT CASE WHEN COUNT(*) <> 8 THEN 'FAIL' ELSE 'P' END AS status 32 | FROM demo_hr..countries 33 | WHERE region_id = 1; 34 | ``` 35 |
36 | 37 | 38 | 39 | ### T003 - "Relative" Row Count 40 | Verify relative row counts between tables or views. The example below verifies that table countries has at least 5 times the number of rows as table regions. The inner query returns one row with two columns: (a) count of contries, and (b) count of regions. The outer query can then apply the business logic of comparing the counts and returning a Pass or Fail. 41 | ```sql 42 | SELECT CASE WHEN countries_count < 5 * regions_count THEN 'FAIL' ELSE 'P' END AS status 43 | FROM ( 44 | SELECT (SELECT COUNT(*) AS row_count FROM demo_hr..countries) AS countries_count 45 | , (SELECT COUNT(*) AS row_count FROM demo_hr..regions) AS regions_count 46 | ) t; 47 | ``` 48 |
49 | 50 | 51 | 52 | ### T004 - "Recent" Row Count 53 | Verify recent row counts. The example below verifies that table countries has had at least 5 rows updated in the past 10 days based on the date stamp in field date_last_updated. The inner query uses GETDATE() (SQL Server current date time stamp) and subtracts 10 days. The inner query counts how many rows in table countries have a date_last_updated with the past 10 days. The outer query can then confirm that at least 5 rows are recent for a Pass...if not then it returns a Fail. 54 | ```sql 55 | SELECT CASE WHEN row_count < 5 THEN 'FAIL' ELSE 'P' END AS status 56 | FROM ( 57 | SELECT COUNT(*) AS row_count 58 | FROM demo_hr..countries 59 | WHERE date_last_updated >= GETDATE() - 10 60 | ); 61 | ``` 62 | 63 | 64 | 65 | -------------------------------------------------------------------------------- /mssql/data_validation/rule_set_02__keys.md: -------------------------------------------------------------------------------- 1 | ### Data Validation Examples - MS SQL Server 2 | [![License: CC0](https://img.shields.io/badge/License-CC0-red)](LICENSE "Creative Commons Zero License by DataResearchLabs (effectively = Public Domain") 3 | [![YouTube](https://img.shields.io/badge/YouTube-DataResearchLabs-brightgreen)](http://www.DataResearchLabs.com) 4 | #### [Return to Data Validation Home Page](https://github.com/DataResearchLabs/sql_scripts/blob/main/data_validation_scripts.md) 5 | 6 | 7 | # Rule Set #2 - Keys (Foreign & Unique) 8 | 9 | ## Table of Contents 10 | - T005 - Unique (Native) Key Has No Duplicates 11 | - T006 - Foreign Key Childern Have Orphans 12 | - T007 - Foreign Key Parent is Childless 13 |
14 | 15 | 16 | 17 | ### T005 - Unique Key Has No Duplicates 18 | Sure, good database design implies that unique keys be enforced by a constraint so that you do not need to test for it. However, there are times where a decision is made to **not** add a constraint to enforce the unique key (e.g.: table is replicated from a source having the constraint so skipped for performance). This does happen! At work just last week, two of my unique key regression tests started failing -- without these checks in place as a wide net, the downstream defects would been around a lot longer. 19 | In the example below, the inner query does a group by on the unique key fields, then using a HAVING clause filters down to those key-values with a count of more than 1 -- the dups. The outer query returns a fail if any rows come back with dups (match_count >= 2), or a pass if no dups found. 20 | ```sql 21 | SELECT CASE WHEN COUNT(*) > 0 THEN 'FAIL' ELSE 'P' END AS status 22 | FROM ( 23 | SELECT country_name -- UKey fields separated by comma 24 | , COUNT(*) AS match_count 25 | FROM demo_hr..countries -- UKey fields separated by comma 26 | GROUP BY country_name 27 | HAVING COUNT(*) > 1 28 | ) t; 29 | ``` 30 |
31 | 32 | 33 | 34 | ### T006 - Foreign Key Children Have Orphans 35 | Sure, as with T005 UKeys above, good database design implies that foreign keys be enforced by a constraint so that you do not need to test for it. However, there are times where for whatever reason the constraints do not exist. In those instances, you will want to periodically run a data validation test to ensure that this core assumption is not being violated (of course adding a foreign key constraint would be best, but if that is not an option then periodically check). 36 | In the example below, the inner query pulls from the child table countries as the anchor, then left joins out to the parent table regions on the key field region_id. If region_id does not exist in the parent table (p.region_id IS NULL), then the child region_id is an orphan. The outer query checks the count() of orphaned child rows: if it is >= 1 then the test fails, but if the count() = 0 then it passes. 37 | ```sql 38 | SELECT CASE WHEN COUNT(*) > 0 THEN 'FAIL' ELSE 'P' END AS status 39 | FROM ( 40 | SELECT DISTINCT c.region_id AS child_id, p.region_id AS parent_id 41 | FROM demo_hr..countries c 42 | LEFT JOIN demo_hr..regions p ON p.region_id = c.region_id 43 | WHERE p.region_id IS NULL 44 | ) t; 45 | ``` 46 |
47 | 48 | 49 | 50 | ### T007 - Foreign Key Parent Is Childless 51 | I'm not sure this particular test is all that useful because often it is okay for the parent-side of a foreign key relationship to not have children. But, if for some reason you need to be sure there is data present on both sides (parent **and** child), then this test is for you. You will notice in the example below that the query is very similar to T006 above, but the parent and child tables have switched positions in the FROM and LEFT JOIN lines. This is because we want to first pull all parent rows, then left join to find missing (FKey field IS NULL) child rows. 52 | ```sql 53 | SELECT CASE WHEN COUNT(*) > 0 THEN 'FAIL' ELSE 'P' END AS status 54 | FROM ( 55 | SELECT DISTINCT c.country_id AS child_id, p.country_id AS parent_id 56 | FROM demo_hr..countries p 57 | LEFT JOIN demo_hr..locations c ON p.country_id = c.country_id 58 | WHERE c.country_id IS NULL 59 | ) t; 60 | ``` 61 | -------------------------------------------------------------------------------- /mssql/data_validation/rule_set_03__heuristic_thresholds.md: -------------------------------------------------------------------------------- 1 | ### Data Validation Examples - MS SQL Server 2 | [![License: CC0](https://img.shields.io/badge/License-CC0-red)](LICENSE "Creative Commons Zero License by DataResearchLabs (effectively = Public Domain") 3 | [![YouTube](https://img.shields.io/badge/YouTube-DataResearchLabs-brightgreen)](http://www.DataResearchLabs.com) 4 | #### [Return to Data Validation Home Page](https://github.com/DataResearchLabs/sql_scripts/blob/main/data_validation_scripts.md) 5 | 6 | # Rule Set #3 - Heuristic Thresholds (Column Null Rates & Value Rates) 7 | 8 | 9 | ## Table of Contents 10 | - T008 - Null Rate Thresholds 11 | - T009 - Value Frequency Thresholds 12 | 13 | Note: Although the terms "rate" and "frequency" are mostly synonomous, nulls are usually referred to as "rates" and values as "frequencies". Don't know why, but following the convention I've heard over and over. 14 |
15 | 16 | 17 | 18 | ### T008 - Null Rate Thresholds 19 | There is a lot going on in this "Null Rate Threshold" data validation query below. The basic goal is to validate that a given table's columns do not have too many, or too few NULLs. 20 |
21 | In the example below...
22 | ...we are checking columns department_name, manager_id, and url in the table departments. We want to ensure that column department_name has no NULLs, and that column manager_id is NULL less than 65% of the time and column url is NULL less than 80% of the time. 23 | 24 | This check is like the proverbial miner's canary in that it is a trip wire triggered when something goes awry in a data feed. I've used this test scenario to great effect when coupled with a create-date or last-updated-date to monitor the past week's data loads for any unexpected upticks in null rates. 25 | 26 | There is a downside to this test scenario too however; and that is when it fires false alarms and you find yourself tinkering with the thresholds values (0.0000, 0.65000, and 0.80000 cutoffs below), raising and lowering them over and over. If this happens, chances are test fails are not actionable nor important and you should not waste your time applying this test scenario to that given table and field. Be careful to only pick fields that truly matter. 27 | 28 | Below, there is an upper CTE (common table expression) named "dtls" at the WITH clause, and a lower wrapper that applies the business logic (if any null rate rejections were found, fail the case). Inside the dtls CTE, there is an inner query at the bottom (at the FROM clause) doing a single table scan to calculate a null rate per column by counting nulls in each column and dividing by the total table row count. The SELECT CASE logic at the top applies the business logic; comparing the actual calcuated null rates (nr_dept_nm, nr_mgr_id, and nr_url) against the expected threshold rates (hard-coded as 0.0000, 0.6500, and 0.8000). The returned value is a rejection code (REJ-01, REJ-02, etc.) clearly indicating which field failed the null rate check, what the actual null rate was, and what the expected null rate threshold to exceed was. If no rejections are triggered, then status returns a "P" for pass. 29 |
30 | 31 | ```sql 32 | WITH dtls AS ( 33 | SELECT CASE WHEN nr_dept_nm > 0.0000 THEN 'REJ-01: Null rate too high at department_name. Exp=0.0000 / Act=' + CAST(nr_dept_nm AS VARCHAR(8)) 34 | WHEN nr_mgr_id > 0.6500 THEN 'REJ-02: Null rate too high at manager_id. Exp<=0.6500 / Act=' + CAST(nr_mgr_id AS VARCHAR(8)) 35 | WHEN nr_url > 0.8000 THEN 'REJ-03: Null rate too high at url. Exp<=0.8000 / Act=' + CAST(nr_url AS VARCHAR(8)) 36 | ELSE 'P' 37 | END AS status 38 | FROM ( 39 | SELECT CAST(SUM(CASE WHEN department_name IS NULL THEN 1 ELSE 0 END) AS FLOAT) / CAST(COUNT(*) AS FLOAT) AS nr_dept_nm 40 | , CAST(SUM(CASE WHEN manager_id IS NULL THEN 1 ELSE 0 END) AS FLOAT) / CAST(COUNT(*) AS FLOAT) AS nr_mgr_id 41 | , CAST(SUM(CASE WHEN url IS NULL THEN 1 ELSE 0 END) AS FLOAT) / CAST(COUNT(*) AS FLOAT) AS nr_url 42 | FROM demo_hr..departments 43 | ) t 44 | ) 45 | 46 | SELECT CASE WHEN COUNT(*) > 0 THEN 'FAIL' ELSE 'P' END AS status 47 | FROM dtls 48 | WHERE status <> 'P'; 49 | ``` 50 |
51 | 52 | 53 | 54 | ### T009 - Value Frequency Thresholds 55 | "Value Frequency Threshold" tests are fairly similar to null rates above (T008). The difference is that we are checking the frequency (or rate) at which a column's values occur. 56 |
57 | In the example below...
58 | ...we are checking the frequencies with which the values 1, 2, 3, and 4 occur in field region_id of table countries. There is an upper CTE (common table expression) named "dtls" at the WITH clause, and a lower wrapper that applies the business logic (if any value frequency rejections were found, fail the case). Inside the dtls CTE, there is an inner query at the bottom (at the FROM clause) doing a single table scan to calculate a frequencies for each value in the GROUP BY for the column. It the GROUP BY value count (field "freq") is divided by the total table row count (field "den") to calculate field "freq_rt". The SELECT CASE logic at the top applies the business logic; comparing the actual value frequencies (freq_rt when region_id = 1, or =2, etc.) against the expected threshold frequencies (hard-coded as 0.28 to 0.36, 016 to 0.24 and so on). The returned value is a rejection code (REJ-01, REJ-02, etc.) clearly indicating which field failed the value ferquency check, what the actual value frequency was, and what the expected value frequency threshold ranges were. If no rejections are triggered, then status returns a "P" for pass. 59 |
60 | 61 | ```sql 62 | WITH dtls AS ( 63 | SELECT region_id, freq_rt 64 | , CASE WHEN region_id = 1 AND freq_rt NOT BETWEEN 0.28 AND 0.36 THEN 'REJ-01: Frequency occurrence of region_id=1 is outside threshold|exp=0.28 thru 0.36|act=' + CAST(freq_rt AS VARCHAR(8)) 65 | WHEN region_id = 2 AND freq_rt NOT BETWEEN 0.16 AND 0.24 THEN 'REJ-02: Frequency occurrence of region_id=2 is outside threshold|exp=0.16 thru 0.24|act=' + CAST(freq_rt AS VARCHAR(8)) 66 | WHEN region_id = 3 AND freq_rt NOT BETWEEN 0.20 AND 0.28 THEN 'REJ-03: Frequency occurrence of region_id=3 is outside threshold|exp=0.20 thru 0.28|act=' + CAST(freq_rt AS VARCHAR(8)) 67 | WHEN region_id = 4 AND freq_rt NOT BETWEEN 0.20 AND 0.28 THEN 'REJ-04: Frequency occurrence of region_id=4 is outside threshold|exp=0.20 thru 0.28|act=' + CAST(freq_rt AS VARCHAR(8)) 68 | ELSE 'P' 69 | END AS status 70 | FROM ( 71 | SELECT region_id, CAST(freq AS FLOAT) / CAST(den AS FLOAT) AS freq_rt 72 | FROM ( 73 | SELECT region_id, COUNT(*) AS freq 74 | , (SELECT COUNT(*) FROM demo_hr..countries) AS den 75 | FROM demo_hr..countries 76 | GROUP BY region_id 77 | ) t 78 | ) t2 79 | ) 80 | 81 | SELECT CASE WHEN COUNT(*) > 0 THEN 'FAIL' ELSE 'P' END AS status 82 | FROM dtls 83 | WHERE status <> 'P'; 84 | ``` 85 | -------------------------------------------------------------------------------- /mssql/data_validation/rule_set_04__numeric_values.md: -------------------------------------------------------------------------------- 1 | ### Data Validation Examples - MS SQL Server 2 | [![License: CC0](https://img.shields.io/badge/License-CC0-red)](LICENSE "Creative Commons Zero License by DataResearchLabs (effectively = Public Domain") 3 | [![YouTube](https://img.shields.io/badge/YouTube-DataResearchLabs-brightgreen)](http://www.DataResearchLabs.com) 4 | #### [Return to Data Validation Home Page](https://github.com/DataResearchLabs/sql_scripts/blob/main/data_validation_scripts.md) 5 | 6 | # Rule Set #4 - Numeric Values 7 | 8 | ## Table of Contents 9 | - T010 - Not Null 10 | - T011 - Not Negative 11 | - T012 - Numeric Range 12 | - T013 - In Value List 13 | - T014 - Not In Value List 14 | - T015 - Multi Field Compare 15 |
16 | 17 | 18 | 19 | ### T010 - Not Null 20 | Verify numeric field is not null. For example, to verify that table countries has no NULLs in field region_id: 21 | ```sql 22 | SELECT CASE WHEN COUNT(*) > 0 THEN 'FAIL' ELSE 'P' END AS status 23 | FROM demo_hr..countries 24 | WHERE region_id IS NULL; 25 | ``` 26 |
27 | 28 | 29 | 30 | ### T011 - Not Negative 31 | Verify numeric field is not negative. For example, to verify that table countries has no field region_id negative values: 32 | ```sql 33 | SELECT CASE WHEN COUNT(*) > 0 THEN 'FAIL' ELSE 'P' END AS status 34 | FROM demo_hr..countries 35 | WHERE region_id < 0; 36 | ``` 37 |
38 | 39 | 40 | 41 | ### T012 - Numeric Range 42 | Verify numeric field value is within a range. In the example below, we verify that field employee_id is between 100 and 999 in table employees. Note that you can run the inner query yourself to return the actual rejection code (is too low or too high) along with the actual value and the expected value...all nicely packaged for easy troubleshooting. 43 | ```sql 44 | SELECT CASE WHEN COUNT(*) > 0 THEN 'FAIL' ELSE 'P' END AS status 45 | FROM ( 46 | SELECT employee_id 47 | , CASE WHEN employee_id < 100 THEN 'REJ-01: Verify employee_id > 99|exp>99|act=' + CAST(employee_id AS VARCHAR(10)) 48 | WHEN employee_id > 999 THEN 'REJ-02: Verify employee_id < 1000|exp<1000|act=' + CAST(employee_id AS VARCHAR(10)) 49 | ELSE 'P' 50 | END AS status 51 | FROM demo_hr..employees 52 | ) t 53 | WHERE status <> 'P'; 54 | ``` 55 |
56 | 57 | 58 | 59 | ### T013 - In Value List 60 | Verify numeric field is **in** the list of values. For example, to verify that table countries field region_id is always values 1, 2, 3, or 4 we use the IN() clause as shown below: 61 | ```sql 62 | SELECT CASE WHEN COUNT(*) > 0 THEN 'FAIL' ELSE 'P' END AS status 63 | FROM ( 64 | SELECT region_id 65 | , CASE WHEN region_id NOT IN(1,2,3,4) THEN 'FAIL' ELSE 'P' END AS status 66 | FROM demo_hr..countries 67 | ) t 68 | WHERE status <> 'P'; 69 | ``` 70 |
71 | 72 | 73 | 74 | ### T014 - Not In Value List 75 | Verify numeric field is **not** in the list of values. For example, to verify that table countries field region_id is never in values 97, 98, or 99 we use the NOT IN() clauses as shown below: 76 | ```sql 77 | SELECT CASE WHEN COUNT(*) > 0 THEN 'FAIL' ELSE 'P' END AS status 78 | FROM ( 79 | SELECT region_id 80 | , CASE WHEN region_id IN(97,98,99) THEN 'FAIL' ELSE 'P' END AS status 81 | FROM demo_hr..countries 82 | ) t 83 | WHERE status <> 'P'; 84 | ``` 85 |
86 | 87 | 88 | 89 | ### T015 - Multi Field Compare 90 | Verify numeric field values in relation to one another. For example, to verify that salary times commission_pct is always less than $10,000 in table employees: 91 | ```sql 92 | SELECT CASE WHEN COUNT(*) > 0 THEN 'FAIL' ELSE 'P' END AS status 93 | FROM ( 94 | SELECT salary, commission_pct 95 | , CASE WHEN salary * commission_pct > 10000 THEN 'FAIL' ELSE 'P' END AS status 96 | FROM demo_hr..employees 97 | ) t 98 | WHERE status <> 'P'; 99 | ``` 100 |
101 | -------------------------------------------------------------------------------- /mssql/data_validation/rule_set_05__date_values.md: -------------------------------------------------------------------------------- 1 | ### Data Validation Examples - MS SQL Server 2 | [![License: CC0](https://img.shields.io/badge/License-CC0-red)](LICENSE "Creative Commons Zero License by DataResearchLabs (effectively = Public Domain") 3 | [![YouTube](https://img.shields.io/badge/YouTube-DataResearchLabs-brightgreen)](http://www.DataResearchLabs.com) 4 | #### [Return to Data Validation Home Page](https://github.com/DataResearchLabs/sql_scripts/blob/main/data_validation_scripts.md) 5 | 6 | 7 | # Rule Set #5 - Date Values 8 | 9 | ## Table of Contents 10 | - T016 - Not Null 11 | - T017 - Date Range 12 | - T018 - No Time Part 13 | - T019 - Has Time Part 14 | - T020 - Multi Field Compare 15 | - Bonus Tip - Joining Tables with 2 Pairs of Start-End Date Overlaps 16 |
17 | 18 | 19 | 20 | ### T016 - Not Null 21 | Verify date field is not null. For example, to verify that table countries has no NULLs in field date_last_updated: 22 | ```sql 23 | SELECT date_last_updated 24 | , CASE WHEN date_last_updated IS NULL THEN 'FAIL' ELSE 'P' END AS status 25 | FROM demo_hr..countries; 26 | ``` 27 |
28 | 29 | 30 | 31 | ### T017 - Date Range 32 | Verify date field is within specified range. For example, you can run the sql below to verify that table countries field date_last_updated is between 1/1/2021 and today. Note the use of SYSDATE to represent today's date dynamically in Oracle. Notice the inner query uses a CASE...WHEN...ELSE structure to identify two rejections codes: (1) date is too high, and (2) date is too low. Expected and actual values are displayed in the output if you run the inner query only. The outer query is a wrapper to determine whether the test passed or failed. 33 | ```sql 34 | SELECT date_last_updated 35 | , CASE WHEN date_last_updated > GETDATE() THEN 'REJ-01: Field date_last_updated cannot be in the future|exp<=' + CAST(GETDATE() AS VARCHAR(20)) + '|act=' + CAST(date_last_updated AS VARCHAR(20)) 36 | WHEN date_last_updated < '01/01/2021' THEN 'REJ-02: Field date_last_updated cannot be too old|exp>=1/1/2021|act=' + CAST(date_last_updated AS VARCHAR(20)) 37 | ELSE 'P' 38 | END AS status 39 | FROM demo_hr..countries; 40 | ``` 41 |
42 | 43 | 44 | 45 | ### T018 - No Time Part 46 | Verify date field is a date only, no time part present. For example, to verify that table employees has no time part in field hire_date (time part must be "12:00:00"): 47 | ```sql 48 | SELECT hire_date 49 | , CASE WHEN CONVERT(VARCHAR(8), hire_date, 108) <> '00:00:00' THEN 'FAIL' ELSE 'P' END AS status 50 | FROM demo_hr..employees; 51 | ``` 52 |
53 | 54 | 55 | 56 | ### T019 - Has Time Part 57 | Verify date field is a date **and** time. For example, to verify that table employees has a time part in field hire_date (time part cannot be "12:00:00"): 58 | ```sql 59 | SELECT date_last_updated 60 | , CASE WHEN CONVERT(VARCHAR(8), date_last_updated, 108) = '00:00:00' THEN 'FAIL' ELSE 'P' END AS status 61 | FROM demo_hr..countries; 62 | ``` 63 |
64 | 65 | 66 | 67 | ### T020 - Multi Field Compare 68 | Verify multiple date fields relative to each other. For example, to verify that field start_date must be < field end_date in table job_history (thus if start_date is >= end_date the test case fails): 69 | ```sql 70 | SELECT start_date, end_date 71 | , CASE WHEN start_date >= end_date THEN 'FAIL' ELSE 'P' END AS status 72 | FROM demo_hr..job_history; 73 | ``` 74 |
75 | 76 | 77 | 78 | ### Bonus Tip - Joining Tables with 2 Pairs of Start-End Date Overlaps 79 | So this is a fantastic tip I learned from a co-worker in healthcare back in 2011 (was it Jennifer C.? or Matt G.? or Jonathon P.? I can't remember). 80 | 81 | 82 | The **problem** is that you are trying to join two tables with logic where the table1.start_dt/end_dt's overlap with the table2.start_dt/end_dt. 83 | 84 | 85 | The **solution** is to **join on `table1.start_dt <= table2.end_dt AND table1.end_dt >= table2.start_dt`**. 86 | 87 | Here is why: 88 | 89 | ``` 90 | Scenario #1 = "Discard - No Overlap" Table #1 date range is completely before Table #2's (Dt#1.End is NOT > Dt#2.Start) 91 | Table#1: |Start ----------- End| 92 | Table#2: |Start ----------- End| 93 | 94 | 95 | Scenario #2 = "Include - Table #1 End = Table#2 Start" Table #1 date range ends exactly where Table #2's starts (Tbl#1.End = Tbl#2.Start AND Tbl#1.Start < Tbl#2.End) 96 | Table#1: |Start ----------- End| 97 | Table#2: |Start ----------- End| 98 | 99 | 100 | Scenario #3 = "Include - Table #1 Ends Midway Table #2 Span" Table #1 date range nicely overlaps Table #2's (Tbl#1.End > Tbl#2.Start AND Tbl#1.Start < Tbl#2.End) 101 | Table#1: |Start ----------- End| 102 | Table#2: |Start ----------- End| 103 | 104 | 105 | Scenario #4 = "Include - Table #1 Starts Midway Table #2 Span" Table #1 date range nicely overlaps Table #2's (Tbl#1.End > Tbl#2.Start AND Tbl#1.Start < Tbl#2.End) 106 | Table#1: |Start ----------- End| 107 | Table#2: |Start ----------- End| 108 | 109 | 110 | Scenario #5 = "Include - Table #1 Starts at Table #2 End" Table #1 date range start exactly at Table #2's End (Tbl#1.End > Tbl#2.St AND Tbl#1.St = Tbl#2.End) 111 | Table#1: |Start ----------- End| 112 | Table#2: |Start ----------- End| 113 | 114 | 115 | Scenario #6 = "Discard - No Overlap" Date #1 is entirely after Date #2 (Tbl#1.Start is > Tbl#2.End) 116 | Table#1: |Start ----------- End| 117 | Table#2: |Start ----------- End| 118 | 119 | ``` 120 | 121 | 122 |
123 | -------------------------------------------------------------------------------- /mssql/data_validation/rule_set_09__defect_regression.md: -------------------------------------------------------------------------------- 1 | ### Data Validation Examples - MS SQL Server 2 | [![License: CC0](https://img.shields.io/badge/License-CC0-red)](LICENSE "Creative Commons Zero License by DataResearchLabs (effectively = Public Domain") 3 | [![YouTube](https://img.shields.io/badge/YouTube-DataResearchLabs-brightgreen)](http://www.DataResearchLabs.com) 4 | #### [Return to Data Validation Home Page](https://github.com/DataResearchLabs/sql_scripts/blob/main/data_validation_scripts.md) 5 |
6 | 7 | # Rule Set #9 - Defect Regression 8 | 9 | ## Summary 10 | Where possible, it is a good idea to cover known defects with a test case. This enables automated regression testing for free. Yuo can simply run this script to quickly re-check whether a defect has been re-introduced. 11 | 12 | There are a lot of caveats that will block you from covering every defect: 13 | * If anything but the simplest of setup and tear down data is required 14 | * If processes or jobs or applications or ETL must be triggered 15 | * If there are interactions with other databases or files not on the current server 16 | 17 | That said, you can still frequently setup static trip wires to monitor the data for recurrence of old bugs. 18 | 19 | Some real-world examples from my past include: 20 | * A defect where email addresses stored in the database had TAB characters. I setup a simple CHAR(9) check that ran daily and cross referenced the defect number in the SQL return results. 21 | * A defect where customer specific HTML snippets for email reports had bad tags manually entered by company analysts working with the client. Obviously fixing the app to avoid those data entries would be best, but that was a different department that was going to get to it later. So, I setup a daily alert to check for those HTML tags and some other similar ones that might be accidentally introduced. 22 | * A defect where internal data entry folks on rare occasions had to key in customer names. They were supposed to be all upper case for other reasons in the system, but the analysts sometimes entered lower case. Developers were not going to get around to fixing it for a couple of months (other priorities, this was a very low frequency issue). So I setup a daily check to trip an alarm and notify via email the appropriate person to correct their data. 23 | * Basically, you can take any of the test case examples in Rule Sets #1 thu #8 and apply them as regression tests for defects when appropriate. 24 | -------------------------------------------------------------------------------- /mssql/schemadiff/simple_schema_dump.sql: -------------------------------------------------------------------------------- 1 | ------------------------------------------------------------------------------------ 2 | -- Simple Schema Dump: 3 | -- This SQL script will dump table, column, key, and index design related metadata 4 | -- so that you can copy-paste or export to a text file. 5 | -- Even better, you can make other snapshots over time (same database schema earlier 6 | -- points in time), OR in different environments (DEV, PPMO, STAGE, PROD). Then, 7 | -- using your favorite free Text File Diff Tool (DiffMerge, ExamDiff, etc.) you 8 | -- can compare snapshots to quick isolate and identify what changed over time 9 | -- or is different between environments. 10 | ------------------------------------------------------------------------------------ 11 | -- Platform: Microsoft SQL Server 12 | -- Author: DataResearchLabs 13 | -- GitHub: https://github.com/DataResearchLabs/sql_scripts 14 | -- YouTube Tutorials: https://www.youtube.com/playlist?list=PLVHoUDdbskUSlZTVZmllGTdSuvyI4LBiE 15 | ---------------------------------------------------------------------------------- 16 | 17 | -- IMPORTANT 18 | USE AdventureWorksLT2019; -- <<<<<<<<<<<< CHANGE THIS VALUE to Schema you want to dump 19 | ; 20 | 21 | 22 | WITH vars 23 | AS ( 24 | SELECT DB_NAME() AS v_SchemaName 25 | ) 26 | 27 | , baseTbl 28 | AS ( 29 | SELECT TABLE_CATALOG AS SchemaName, table_type, table_name 30 | FROM INFORMATION_SCHEMA.TABLES 31 | WHERE TABLE_CATALOG = (SELECT v_SchemaName FROM vars) 32 | ) 33 | 34 | , metaForTbl 35 | AS ( 36 | SELECT t.SchemaName 37 | , t.table_name AS TableName 38 | , '(' + CASE WHEN t.table_type = 'BASE TABLE' THEN 'Table' WHEN t.table_type = 'VIEW' THEN 'View' ELSE 'UK' END + ')' AS ObjectType 39 | , t.table_name AS ObjectName 40 | , '(Exists)' AS PropertyName 41 | , ' ' AS PropertyValue 42 | FROM baseTbl t 43 | ) 44 | 45 | , metaForCol_dataType 46 | AS ( 47 | SELECT ft.SchemaName, ft.table_name AS TableName, 'Column' AS ObjectType, tut.column_name AS ObjectName 48 | , '2' AS PropertyName 49 | , COALESCE(tut.data_type, 'unknown') + '(' 50 | + CASE WHEN tut.CHARACTER_MAXIMUM_LENGTH IS NOT NULL THEN CAST(tut.CHARACTER_MAXIMUM_LENGTH AS VARCHAR(10)) ELSE '' END 51 | + CASE WHEN tut.DATA_TYPE IN('date','datetime') THEN ',' + CAST(tut.DATETIME_PRECISION AS VARCHAR(10)) 52 | WHEN tut.NUMERIC_PRECISION IS NULL THEN '' 53 | ELSE ',' + CAST(tut.NUMERIC_PRECISION AS VARCHAR(10)) 54 | END 55 | + CASE WHEN tut.NUMERIC_SCALE IS NOT NULL THEN ',' + CAST(tut.NUMERIC_SCALE AS VARCHAR(10)) ELSE '' END 56 | + ')' AS PropertyValue 57 | FROM INFORMATION_SCHEMA.COLUMNS tut 58 | INNER JOIN baseTbl ft ON ft.SchemaName = tut.TABLE_CATALOG AND ft.table_name = tut.table_name 59 | ) 60 | 61 | , metaForCol_nullable 62 | AS ( 63 | SELECT ft.SchemaName, ft.table_name AS TableName, 'Column' AS ObjectType, tut.column_name AS ObjectName 64 | , '3' AS PropertyName, CASE WHEN tut.IS_NULLABLE = 'YES' THEN 'NULL' ELSE 'NOT NULL' END AS PropertyValue 65 | FROM INFORMATION_SCHEMA.COLUMNS tut 66 | INNER JOIN baseTbl ft ON ft.SchemaName = tut.TABLE_CATALOG AND ft.table_name = tut.table_name 67 | ) 68 | 69 | , metaForCol_ordpos 70 | AS ( 71 | SELECT ft.SchemaName, ft.table_name AS TableName, 'Column' AS ObjectType, tut.column_name AS ObjectName 72 | , '1' AS PropertyName, RIGHT('000' + CAST(tut.ORDINAL_POSITION AS VARCHAR(3)), 3) AS PropertyValue 73 | FROM INFORMATION_SCHEMA.COLUMNS tut 74 | INNER JOIN baseTbl ft ON ft.SchemaName = tut.TABLE_CATALOG AND ft.table_name = tut.table_name 75 | ) 76 | 77 | , metaAllCols 78 | AS ( 79 | SELECT schemaname, tablename, objecttype, objectname, 'Properties' AS propertyname 80 | /* NOTE: STRING_AGG was added in SQL Server 2017 and forward. 81 | If you have and older version, use FOR XML PATH approach here: https://stackoverflow.com/questions/15477743/listagg-in-sqlserver 82 | */ 83 | , STRING_AGG(propertyvalue, ' | ') 84 | WITHIN GROUP (ORDER BY propertyname, propertyvalue) AS propertyvalue 85 | FROM ( 86 | SELECT * FROM metaForCol_dataType 87 | UNION SELECT * FROM metaForCol_nullable 88 | UNION SELECT * FROM metaForCol_ordpos 89 | ) t 90 | GROUP BY schemaname, tablename, objecttype, objectname 91 | ) 92 | 93 | , metaForKeys 94 | AS ( 95 | SELECT cons.TABLE_CATALOG AS SchemaName, cons.TABLE_NAME AS TableName 96 | , CASE WHEN cons.constraint_type = 'PRIMARY KEY' THEN 'PKey' 97 | WHEN cons.constraint_type = 'UNIQUE' THEN 'UKey' 98 | WHEN cons.constraint_type = 'FOREIGN KEY' THEN 'FKey' 99 | ELSE 'X' 100 | END AS ObjectType 101 | , cons.constraint_name AS ObjectName 102 | , 'FieldList' AS PropertyName 103 | /* NOTE: STRING_AGG was added in SQL Server 2017 and forward. 104 | If you have and older version, use FOR XML PATH approach here: https://stackoverflow.com/questions/15477743/listagg-in-sqlserver 105 | */ 106 | , STRING_AGG(kcu.COLUMN_NAME, ',') WITHIN GROUP (ORDER BY kcu.ORDINAL_POSITION) AS PropertyValue 107 | FROM INFORMATION_SCHEMA.TABLE_CONSTRAINTS cons 108 | INNER JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE kcu 109 | ON cons.TABLE_CATALOG = kcu.TABLE_CATALOG 110 | AND cons.TABLE_NAME = kcu.TABLE_NAME 111 | AND cons.CONSTRAINT_NAME = kcu.CONSTRAINT_NAME 112 | WHERE cons.TABLE_CATALOG = (SELECT v_SchemaName FROM vars) 113 | AND cons.table_name IN(SELECT DISTINCT table_name FROM baseTbl) 114 | AND cons.constraint_type IN('PRIMARY KEY','FOREIGN KEY','UNIQUE') 115 | GROUP BY cons.TABLE_CATALOG, cons.TABLE_NAME, cons.CONSTRAINT_TYPE, cons.CONSTRAINT_NAME 116 | ) 117 | 118 | , metaForIdxs 119 | AS ( 120 | SELECT (SELECT v_SchemaName FROM vars) AS SchemaName, o.name AS TableName 121 | , 'Index' AS ObjectType, i.name AS ObjectName 122 | , 'FieldList' AS PropertyName 123 | /* NOTE: STRING_AGG was added in SQL Server 2017 and forward. 124 | If you have and older version, use FOR XML PATH approach here: https://stackoverflow.com/questions/15477743/listagg-in-sqlserver 125 | */ 126 | , STRING_AGG(c.name, ',') WITHIN GROUP (ORDER BY ic.key_ordinal /*column_store_order_ordinal*/) AS PropertyValue 127 | FROM sys.indexes i 128 | INNER JOIN sys.objects o ON i.object_id = o.object_id 129 | INNER JOIN sys.index_columns ic ON ic.object_id = i.object_id AND ic.index_id = i.index_id 130 | INNER JOIN sys.columns c ON c.object_id = i.object_id AND c.column_id = ic.column_id 131 | WHERE i.[Type] = 2 132 | AND i.is_unique = 0 133 | AND i.is_primary_key = 0 134 | AND o.[type] = 'U' 135 | GROUP BY o.name, i.name 136 | ) 137 | 138 | , allMetadata 139 | AS ( 140 | SELECT * FROM metaForTbl 141 | UNION SELECT * FROM metaAllCols 142 | UNION SELECT * FROM metaForKeys 143 | UNION SELECT * FROM metaForIdxs 144 | ) 145 | 146 | SELECT CASE WHEN objecttype IN('(Table)','(View)') THEN schemaname ELSE ' ' END AS schema_nm 147 | , CASE WHEN objecttype IN('(Table)','(View)') THEN tablename ELSE ' ' END AS tbl_nm 148 | , objecttype AS obj_typ, objectname AS obj_nm, /*propertyname,*/ propertyvalue AS properties 149 | FROM allMetadata 150 | ORDER BY schemaname, tablename, objecttype 151 | , CASE WHEN objecttype='Column' THEN propertyvalue ELSE ' ' END 152 | , objectname, propertyname 153 | 154 | 155 | -------------------------------------------------------------------------------- /mysql/data_dictionary/data_dict_dump.sql: -------------------------------------------------------------------------------- 1 | ------------------------------------------------------------------------------------ 2 | -- Data Dictionary Dump: 3 | -- This SQL script will dump table, column, key, and description design related 4 | -- metadata so that you can copy-paste or export to Excel as a Data Dictionary. 5 | ------------------------------------------------------------------------------------ 6 | -- Platform: MySQL 7 | -- Author: DataResearchLabs 8 | -- GitHub: https://github.com/DataResearchLabs/sql_scripts 9 | -- YouTube Tutorials: https://www.youtube.com/channel/UCQciXv3xaBykeUFc04GxSXA 10 | ---------------------------------------------------------------------------------- 11 | USE sakila -- <<<<<<<<<<<<< Change schema here 12 | ; 13 | 14 | WITH vars 15 | AS ( 16 | SELECT 17 | DATABASE() AS v_SchemaName -- Do not change this value...it is changed up above on line 11 and passed thru to here 18 | , 'NO' AS v_TablesOnly -- YES=Limit To Tables only; NO=Include views too 19 | ) 20 | 21 | , baseTbl 22 | AS ( 23 | SELECT table_schema AS SchemaName 24 | , table_catalog 25 | , table_type, table_name, table_schema 26 | FROM INFORMATION_SCHEMA.TABLES 27 | WHERE TABLE_SCHEMA = (SELECT v_SchemaName FROM vars) 28 | AND ( (TABLE_TYPE = 'BASE TABLE') 29 | OR ((SELECT v_TablesOnly FROM vars) = 'NO') 30 | ) 31 | ) 32 | 33 | , metadata 34 | AS ( 35 | SELECT 36 | bt.SchemaName AS schema_nm 37 | , bt.table_name AS table_nm 38 | , CASE WHEN bt.TABLE_TYPE = 'BASE TABLE' THEN 'TBL' 39 | WHEN bt.TABLE_TYPE = 'VIEW' THEN 'VW' 40 | ELSE 'UK' 41 | END AS obj_typ 42 | , tut.ordinal_position AS ord_pos 43 | , tut.column_name AS column_nm 44 | , CONCAT(COALESCE(tut.data_type, 'unknown'), 45 | CASE WHEN tut.data_type IN('varchar','char') THEN CONCAT('(', tut.CHARACTER_MAXIMUM_LENGTH, ')') 46 | WHEN tut.data_type IN('date','time') THEN CONCAT('(3)') 47 | WHEN tut.data_type = 'datetime' THEN CONCAT('(8)') 48 | WHEN tut.data_type = 'timestamp' THEN CONCAT('(4)') 49 | WHEN tut.data_type in('tinyint','smallint','mediumint','int','bigint') THEN CONCAT('(', tut.NUMERIC_PRECISION, ')') 50 | WHEN tut.data_type = 'decimal' THEN CONCAT('(', tut.NUMERIC_PRECISION, ',', tut.NUMERIC_SCALE, ')') 51 | WHEN tut.CHARACTER_MAXIMUM_LENGTH IS NOT NULL THEN CONCAT('(', tut.CHARACTER_MAXIMUM_LENGTH, ')') 52 | WHEN tut.DATETIME_PRECISION IS NOT NULL THEN CONCAT('(', tut.DATETIME_PRECISION, ')') 53 | WHEN tut.NUMERIC_PRECISION IS NOT NULL 54 | AND tut.NUMERIC_SCALE IS NULL THEN CONCAT('(', tut.NUMERIC_PRECISION, ')') 55 | WHEN tut.NUMERIC_PRECISION IS NOT NULL 56 | AND tut.NUMERIC_SCALE IS NOT NULL THEN CONCAT('(', tut.NUMERIC_PRECISION, ',', tut.NUMERIC_SCALE, ')') 57 | ELSE '' 58 | END ) AS data_typ 59 | , CASE WHEN tut.IS_NULLABLE = 'YES' THEN 'NULL' ELSE 'NOT NULL' END AS nullable 60 | FROM INFORMATION_SCHEMA.COLUMNS tut 61 | INNER JOIN baseTbl bt ON bt.table_catalog = tut.TABLE_CATALOG AND bt.table_name = tut.table_name 62 | ) 63 | 64 | , meta_for_keys 65 | AS ( 66 | SELECT schema_nm, table_nm, column_nm 67 | , GROUP_CONCAT(is_key ORDER BY is_key SEPARATOR ',') AS is_key 68 | FROM ( 69 | SELECT cons.TABLE_SCHEMA AS schema_nm 70 | , cons.TABLE_NAME AS table_nm 71 | , kcu.COLUMN_NAME AS column_nm 72 | , CASE WHEN cons.constraint_type = 'PRIMARY KEY' THEN 'PK' 73 | WHEN cons.constraint_type = 'UNIQUE' THEN 'UK' 74 | WHEN cons.constraint_type = 'FOREIGN KEY' THEN 'FK' 75 | ELSE 'X' 76 | END AS is_key 77 | FROM INFORMATION_SCHEMA.TABLE_CONSTRAINTS cons 78 | INNER JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE kcu 79 | ON cons.TABLE_SCHEMA = kcu.TABLE_SCHEMA 80 | AND cons.TABLE_NAME = kcu.TABLE_NAME 81 | AND cons.CONSTRAINT_NAME = kcu.CONSTRAINT_NAME 82 | WHERE cons.table_schema = (SELECT v_SchemaName FROM vars) 83 | AND cons.table_name IN(SELECT DISTINCT table_name FROM baseTbl) 84 | AND cons.constraint_type IN('PRIMARY KEY','FOREIGN KEY','UNIQUE') 85 | GROUP BY cons.TABLE_SCHEMA, cons.TABLE_NAME, kcu.COLUMN_NAME, cons.constraint_type 86 | ) t 87 | GROUP BY schema_nm, table_nm, column_nm 88 | ) 89 | 90 | , col_comm 91 | AS ( 92 | SELECT TABLE_SCHEMA AS SCHEMA_NM 93 | , TABLE_NAME AS TABLE_NM 94 | , COLUMN_NAME AS COLUMN_NM 95 | , COLUMN_COMMENT AS column_descr 96 | FROM INFORMATION_SCHEMA.COLUMNS 97 | WHERE table_schema = (SELECT v_SchemaName FROM vars) 98 | AND table_name IN(SELECT DISTINCT table_name FROM baseTbl) 99 | ) 100 | 101 | SELECT md.SCHEMA_NM, md.TABLE_NM, md.OBJ_TYP 102 | , md.ORD_POS AS ord 103 | , COALESCE(pk.is_key, ' ') AS is_key 104 | , md.COLUMN_NM, md.DATA_TYP, md.NULLABLE, c.column_descr 105 | FROM metadata md 106 | LEFT JOIN meta_for_keys pk ON pk.SCHEMA_NM = md.SCHEMA_NM AND pk.TABLE_NM = md.TABLE_NM AND pk.COLUMN_NM = md.COLUMN_NM 107 | LEFT JOIN col_comm c ON c.SCHEMA_NM = md.SCHEMA_NM AND c.TABLE_NM = md.TABLE_NM AND c.COLUMN_NM = md.COLUMN_NM 108 | ORDER BY md.SCHEMA_NM, md.TABLE_NM, md.ORD_POS 109 | -------------------------------------------------------------------------------- /mysql/data_validation/demo_data/demo_hr_01_create_database.sql: -------------------------------------------------------------------------------- 1 | CREATE DATABASE Demo_HR; 2 | -------------------------------------------------------------------------------- /mysql/data_validation/demo_data/demo_hr_02_create_tables.sql: -------------------------------------------------------------------------------- 1 | USE Demo_HR; 2 | 3 | CREATE TABLE Demo_HR.JOBS 4 | (JOB_ID VARCHAR(10), 5 | JOB_TITLE VARCHAR(35) NOT NULL, 6 | MIN_SALARY NUMERIC(6,0), 7 | MAX_SALARY NUMERIC(6,0), 8 | CONSTRAINT JOB_ID_PK PRIMARY KEY (JOB_ID) 9 | ); 10 | 11 | 12 | 13 | -- Demo_HR..JOBS_SNAPSHOT definition 14 | CREATE TABLE Demo_HR.JOBS_SNAPSHOT 15 | (JOB_ID VARCHAR(10), 16 | JOB_TITLE VARCHAR(35) NOT NULL, 17 | MIN_SALARY NUMERIC(6,0), 18 | MAX_SALARY NUMERIC(6,0) 19 | ); 20 | 21 | 22 | 23 | -- Demo_HR..REGIONS definition 24 | CREATE TABLE Demo_HR.REGIONS 25 | (REGION_ID INTEGER NOT NULL PRIMARY KEY, 26 | REGION_NAME VARCHAR(25) 27 | ); 28 | 29 | 30 | 31 | -- Demo_HR..COUNTRIES definition 32 | CREATE TABLE Demo_HR.COUNTRIES 33 | (COUNTRY_ID CHAR(2) NOT NULL PRIMARY KEY, 34 | COUNTRY_NAME VARCHAR(40), 35 | REGION_ID NUMERIC REFERENCES Demo_HR.REGIONS (REGION_ID), 36 | DATE_LAST_UPDATED DATETIME DEFAULT CURRENT_TIMESTAMP 37 | ); 38 | 39 | 40 | 41 | 42 | -- Demo_HR..LOCATIONS definition 43 | CREATE TABLE Demo_HR.LOCATIONS 44 | (LOCATION_ID NUMERIC(4,0) PRIMARY KEY, 45 | STREET_ADDRESS VARCHAR(40), 46 | POSTAL_CODE VARCHAR(12), 47 | CITY VARCHAR(30) NOT NULL, 48 | STATE_PROVINCE VARCHAR(25), 49 | COUNTRY_ID CHAR(2) REFERENCES Demo_HR.COUNTRIES (COUNTRY_ID) 50 | ); 51 | 52 | CREATE INDEX LOC_CITY_IX ON Demo_HR.LOCATIONS (CITY); 53 | CREATE INDEX LOC_STATE_PROVINCE_IX ON Demo_HR.LOCATIONS (STATE_PROVINCE); 54 | CREATE INDEX LOC_COUNTRY_IX ON Demo_HR.LOCATIONS (COUNTRY_ID); 55 | 56 | 57 | 58 | -- Demo_HR..DEPARTMENTS definition 59 | CREATE TABLE Demo_HR.DEPARTMENTS 60 | (DEPARTMENT_ID NUMERIC(4,0) PRIMARY KEY, 61 | DEPARTMENT_NAME VARCHAR(30) NOT NULL, 62 | MANAGER_ID NUMERIC(6,0), 63 | LOCATION_ID NUMERIC(4,0) REFERENCES Demo_HR.LOCATIONS (LOCATION_ID) , 64 | URL VARCHAR(255) 65 | ) ; 66 | 67 | CREATE INDEX DEPT_LOCATION_IX ON Demo_HR.DEPARTMENTS (LOCATION_ID); 68 | 69 | 70 | 71 | -- Demo_HR..JOB_HISTORY definition 72 | CREATE TABLE Demo_HR.JOB_HISTORY 73 | (EMPLOYEE_ID NUMERIC(6,0) NOT NULL, 74 | START_DATE DATETIME NOT NULL, 75 | END_DATE DATETIME NOT NULL, 76 | JOB_ID VARCHAR(10) NOT NULL, 77 | DEPARTMENT_ID NUMERIC(4,0), 78 | CONSTRAINT JHIST_DATE_INTERVAL CHECK (end_date > start_date), 79 | CONSTRAINT JHIST_EMP_ID_ST_DATE_PK PRIMARY KEY (EMPLOYEE_ID, START_DATE), 80 | CONSTRAINT JHIST_JOB_FK FOREIGN KEY (JOB_ID) REFERENCES Demo_HR.JOBS (JOB_ID), 81 | CONSTRAINT JHIST_DEPT_FK FOREIGN KEY (DEPARTMENT_ID) REFERENCES Demo_HR.DEPARTMENTS (DEPARTMENT_ID) 82 | ) ; 83 | 84 | CREATE INDEX JHIST_JOB_IX ON Demo_HR.JOB_HISTORY (JOB_ID); 85 | CREATE INDEX JHIST_EMPLOYEE_IX ON Demo_HR.JOB_HISTORY (EMPLOYEE_ID); 86 | CREATE INDEX JHIST_DEPARTMENT_IX ON Demo_HR.JOB_HISTORY (DEPARTMENT_ID); 87 | 88 | 89 | 90 | 91 | -- Demo_HR..EMPLOYEES definition 92 | CREATE TABLE Demo_HR.EMPLOYEES 93 | (EMPLOYEE_ID NUMERIC(6,0), 94 | FIRST_NAME VARCHAR(20), 95 | LAST_NAME VARCHAR(25) NOT NULL, 96 | EMAIL VARCHAR(25) NOT NULL, 97 | PHONE_NUMBER VARCHAR(20), 98 | HIRE_DATE DATETIME NOT NULL, 99 | JOB_ID VARCHAR(10) NOT NULL, 100 | SALARY NUMERIC(8,2), 101 | COMMISSION_PCT NUMERIC(2,2), 102 | MANAGER_ID NUMERIC(6,0), 103 | DEPARTMENT_ID NUMERIC(4,0), 104 | SOME_DATE_FMT1 VARCHAR(50), 105 | SOME_DATE_FMT2 VARCHAR(50), 106 | SOME_DATE_FMT3 VARCHAR(50), 107 | SOME_DATE_FMT4 VARCHAR(50), 108 | FAKE_SSN VARCHAR(11), 109 | ZIP5 VARCHAR(5), 110 | ZIP5OR9 VARCHAR(10), 111 | ZIP9 VARCHAR(10), 112 | EMAIL_ADDRESS VARCHAR(100), 113 | CONSTRAINT EMP_SALARY_MIN CHECK (salary > 0), 114 | CONSTRAINT EMP_EMAIL_UK UNIQUE (EMAIL), 115 | CONSTRAINT EMP_EMP_ID_PK PRIMARY KEY (EMPLOYEE_ID), 116 | CONSTRAINT EMP_DEPT_FK FOREIGN KEY (DEPARTMENT_ID) REFERENCES Demo_HR.DEPARTMENTS (DEPARTMENT_ID), 117 | CONSTRAINT EMP_JOB_FK FOREIGN KEY (JOB_ID) REFERENCES Demo_HR.JOBS (JOB_ID), 118 | CONSTRAINT EMP_MANAGER_FK FOREIGN KEY (MANAGER_ID) REFERENCES Demo_HR.EMPLOYEES (EMPLOYEE_ID) 119 | ) ; 120 | 121 | CREATE INDEX EMP_DEPARTMENT_IX ON Demo_HR.EMPLOYEES (DEPARTMENT_ID); 122 | CREATE INDEX EMP_JOB_IX ON Demo_HR.EMPLOYEES (JOB_ID); 123 | CREATE INDEX EMP_MANAGER_IX ON Demo_HR.EMPLOYEES (MANAGER_ID); 124 | CREATE INDEX EMP_NAME_IX ON Demo_HR.EMPLOYEES (LAST_NAME, FIRST_NAME); 125 | 126 | 127 | 128 | 129 | 130 | 131 | -------------------------------------------------------------------------------- /mysql/data_validation/img/01_data_val_mysql_example_test_case_sql_code.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DataResearchLabs/sql_scripts/3eaa17caf248d3b58763534d62ef4d9ac5729a46/mysql/data_validation/img/01_data_val_mysql_example_test_case_sql_code.png -------------------------------------------------------------------------------- /mysql/data_validation/img/02_data_val_mysql_run_results.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DataResearchLabs/sql_scripts/3eaa17caf248d3b58763534d62ef4d9ac5729a46/mysql/data_validation/img/02_data_val_mysql_run_results.png -------------------------------------------------------------------------------- /mysql/data_validation/img/03_data_val_mysql_adv_test_case.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DataResearchLabs/sql_scripts/3eaa17caf248d3b58763534d62ef4d9ac5729a46/mysql/data_validation/img/03_data_val_mysql_adv_test_case.png -------------------------------------------------------------------------------- /mysql/data_validation/img/04_data_val_mysql_run_results_adv.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DataResearchLabs/sql_scripts/3eaa17caf248d3b58763534d62ef4d9ac5729a46/mysql/data_validation/img/04_data_val_mysql_run_results_adv.png -------------------------------------------------------------------------------- /mysql/data_validation/rule_set_01__row_counts.md: -------------------------------------------------------------------------------- 1 | ### Data Validation Examples - MySQL 2 | [![License: CC0](https://img.shields.io/badge/License-CC0-red)](LICENSE "Creative Commons Zero License by DataResearchLabs (effectively = Public Domain") 3 | [![YouTube](https://img.shields.io/badge/YouTube-DataResearchLabs-brightgreen)](http://www.DataResearchLabs.com) 4 | #### [Return to Data Validation Home Page](https://github.com/DataResearchLabs/sql_scripts/blob/main/data_validation_scripts.md) 5 | 6 | 7 | # Rule Set #1 - Row Counts 8 | 9 | ## Table of Contents 10 | - T001 - "Full" Row Count 11 | - T002 - "Partial" Row Count 12 | - T003 - "Relative" Row Count 13 | - T004 - "Recent" Row Count 14 |
15 | 16 | 17 | 18 | ### T001 - "Full" Row Count 19 | Verify full row count for a table or view. For example, to verify that table countries has exactly 25 rows: 20 | ```sql 21 | SELECT CASE WHEN COUNT(*) <> 25 THEN 'FAIL' ELSE 'P' END AS status 22 | FROM demo_hr.countries; 23 | ``` 24 |
25 | 26 | 27 | 28 | ### T002 - "Partial" Row Count 29 | Verify partial row count for a subset of a table or view. For example, to verify that table countries has exactly 8 rows where region_id = 1 (Europe): 30 | ```sql 31 | SELECT CASE WHEN COUNT(*) <> 8 THEN 'FAIL' ELSE 'P' END AS status 32 | FROM demo_hr.countries 33 | WHERE region_id = 1; 34 | ``` 35 |
36 | 37 | 38 | 39 | ### T003 - "Relative" Row Count 40 | Verify relative row counts between tables or views. The example below verifies that table countries has at least 5 times the number of rows as table regions. The inner query returns one row with two columns: (a) count of contries, and (b) count of regions. The outer query can then apply the business logic of comparing the counts and returning a Pass or Fail. 41 | ```sql 42 | SELECT CASE WHEN countries_count < 5 * regions_count THEN 'FAIL' ELSE 'P' END AS status 43 | FROM ( 44 | SELECT (SELECT COUNT(*) AS row_count FROM demo_hr.countries) AS countries_count 45 | , (SELECT COUNT(*) AS row_count FROM demo_hr.regions) AS regions_count 46 | ) t; 47 | ``` 48 |
49 | 50 | 51 | 52 | ### T004 - "Recent" Row Count 53 | Verify recent row counts. The example below verifies that table countries has had at least 5 rows updated in the past 150 days based on the date stamp in field date_last_updated. The inner query uses DATE_SUB() (MySQL current date time stamp) and subtracts 150 days. The inner query counts how many rows in table countries have a date_last_updated with the past 150 days. The outer query can then confirm that at least 5 rows are recent for a Pass...if not then it returns a Fail. 54 | ```sql 55 | SELECT CASE WHEN row_count < 5 THEN 'FAIL' ELSE 'P' END AS status 56 | FROM ( 57 | SELECT COUNT(*) AS row_count 58 | FROM demo_hr.countries 59 | WHERE date_last_updated >= DATE_SUB(NOW(), INTERVAL 150 DAY) 60 | ) t; 61 | ``` 62 | 63 | 64 | -------------------------------------------------------------------------------- /mysql/data_validation/rule_set_02__keys.md: -------------------------------------------------------------------------------- 1 | ### Data Validation Examples - MySQL 2 | [![License: CC0](https://img.shields.io/badge/License-CC0-red)](LICENSE "Creative Commons Zero License by DataResearchLabs (effectively = Public Domain") 3 | [![YouTube](https://img.shields.io/badge/YouTube-DataResearchLabs-brightgreen)](http://www.DataResearchLabs.com) 4 | #### [Return to Data Validation Home Page](https://github.com/DataResearchLabs/sql_scripts/blob/main/data_validation_scripts.md) 5 | 6 | 7 | # Rule Set #2 - Keys (Foreign & Unique) 8 | 9 | ## Table of Contents 10 | - T005 - Unique (Native) Key Has No Duplicates 11 | - T006 - Foreign Key Childern Have Orphans 12 | - T007 - Foreign Key Parent is Childless 13 |
14 | 15 | 16 | 17 | ### T005 - Unique Key Has No Duplicates 18 | Sure, good database design implies that unique keys be enforced by a constraint so that you do not need to test for it. However, there are times where a decision is made to **not** add a constraint to enforce the unique key (e.g.: table is replicated from a source having the constraint so skipped for performance). This does happen! At work just last week, two of my unique key regression tests started failing -- without these checks in place as a wide net, the downstream defects would been around a lot longer. 19 | In the example below, the inner query does a group by on the unique key fields, then using a HAVING clause filters down to those key-values with a count of more than 1 -- the dups. The outer query returns a fail if any rows come back with dups (match_count >= 2), or a pass if no dups found. 20 | ```sql 21 | SELECT country_name -- UKey fields 22 | , COUNT(*) AS match_count 23 | FROM demo_hr.countries 24 | GROUP BY country_name -- UKey fields 25 | HAVING COUNT(*) > 1; 26 | ``` 27 |
28 | 29 | 30 | 31 | ### T006 - Foreign Key Children Have Orphans 32 | Sure, as with T005 UKeys above, good database design implies that foreign keys be enforced by a constraint so that you do not need to test for it. However, there are times where for whatever reason the constraints do not exist. In those instances, you will want to periodically run a data validation test to ensure that this core assumption is not being violated (of course adding a foreign key constraint would be best, but if that is not an option then periodically check). 33 | In the example below, the inner query pulls from the child table countries as the anchor, then left joins out to the parent table regions on the key field region_id. If region_id does not exist in the parent table (p.region_id IS NULL), then the child region_id is an orphan. The outer query checks the count() of orphaned child rows: if it is >= 1 then the test fails, but if the count() = 0 then it passes. 34 | ```sql 35 | SELECT DISTINCT c.region_id AS child_id, p.region_id AS parent_id 36 | FROM demo_hr.countries c 37 | LEFT JOIN demo_hr.regions p ON p.region_id = c.region_id 38 | WHERE p.region_id IS NULL; 39 | ``` 40 |
41 | 42 | 43 | 44 | ### T007 - Foreign Key Parent Is Childless 45 | Im not sure this particular test is all the useful because often it is okay for the parent-side of a foreign key relationship to not have children. But, if for some reason you need to be sure there is data present on both sides (parent **and** child), then this test is for you. You will notice in the example below that the query is very similar to T006 above, but the parent and child tables have switched positions in the FROM and LEFT JOIN lines. This is because we want to first pull all parent rows, then left join to find missing (FKey field IS NULL) child rows. 46 | ```sql 47 | SELECT DISTINCT c.country_id AS child_id, p.country_id AS parent_id 48 | FROM demo_hr.countries p 49 | LEFT JOIN demo_hr.locations c ON p.country_id = c.country_id 50 | WHERE c.country_id IS NULL 51 | AND p.country_id IN('IT','JP','US','CA','CN','IN','AU','SG','UK','DE','CH','NL','MX'); 52 | ``` 53 | -------------------------------------------------------------------------------- /mysql/data_validation/rule_set_03__heuristic_thresholds.md: -------------------------------------------------------------------------------- 1 | ### Data Validation Examples - MySQL 2 | [![License: CC0](https://img.shields.io/badge/License-CC0-red)](LICENSE "Creative Commons Zero License by DataResearchLabs (effectively = Public Domain") 3 | [![YouTube](https://img.shields.io/badge/YouTube-DataResearchLabs-brightgreen)](http://www.DataResearchLabs.com) 4 | #### [Return to Data Validation Home Page](https://github.com/DataResearchLabs/sql_scripts/blob/main/data_validation_scripts.md) 5 | 6 | # Rule Set #3 - Heuristic Thresholds (Column Null Rates & Value Rates) 7 | 8 | 9 | ## Table of Contents 10 | - T008 - Null Rate Thresholds 11 | - T009 - Value Frequency Thresholds 12 | 13 | Note: Although the terms "rate" and "frequency" are mostly synonomous, nulls are usually referred to as "rates" and values as "frequencies". Don't know why, but following the convention I've heard over and over. 14 |
15 | 16 | 17 | 18 | ### T008 - Null Rate Thresholds 19 | There is a lot going on in this "Null Rate Threshold" data validation query below. The basic goal is to validate that a given table's columns do not have too many, or too few NULLs. 20 |
21 | In the example below...
22 | ...we are checking columns department_name, manager_id, and url in the table departments. We want to ensure that column department_name has no NULLs, and that column manager_id is NULL less than 65% of the time and column url is NULL less than 80% of the time. 23 | 24 | This check is like the proverbial miner's canary in that it is a trip wire triggered when something goes awry in a data feed. I've used this test scenario to great effect when coupled with a create-date or last-updated-date to monitor the past week's data loads for any unexpected upticks in null rates. 25 | 26 | There is a downside to this test scenario too however; and that is when it fires false alarms and you find yourself tinkering with the thresholds values (0.0000, 0.65000, and 0.80000 cutoffs below), raising and lowering them over and over. If this happens, chances are test fails are not actionable nor important and you should not waste your time applying this test scenario to that given table and field. Be careful to only pick fields that truly matter. 27 | 28 | Below, there is an upper CTE (common table expression) named "dtls" at the WITH clause, and a lower wrapper that applies the business logic (if any null rate rejections were found, fail the case). Inside the dtls CTE, there is an inner query at the bottom (at the FROM clause) doing a single table scan to calculate a null rate per column by counting nulls in each column and dividing by the total table row count. The SELECT CASE logic at the top applies the business logic; comparing the actual calcuated null rates (nr_dept_nm, nr_mgr_id, and nr_url) against the expected threshold rates (hard-coded as 0.0000, 0.6500, and 0.8000). The returned value is a rejection code (REJ-01, REJ-02, etc.) clearly indicating which field failed the null rate check, what the actual null rate was, and what the expected null rate threshold to exceed was. If no rejections are triggered, then status returns a "P" for pass. 29 |
30 | 31 | ```sql 32 | WITH dtls AS ( 33 | SELECT CASE WHEN nr_dept_nm > 0.0000 THEN CONCAT('REJ-01: Null rate too high at department_name|exp=0.0000|act=', CAST(nr_dept_nm AS CHAR(8)) ) 34 | WHEN nr_mgr_id > 0.6500 THEN CONCAT('REJ-02: Null rate too high at manager_id|exp<=0.6500|act=', CAST(nr_mgr_id AS CHAR(8)) ) 35 | WHEN nr_url > 0.8000 THEN CONCAT('REJ-03: Null rate too high at url|exp<=0.8000|act=', CAST(nr_url AS CHAR(8)) ) 36 | ELSE 'P' 37 | END AS status 38 | FROM ( 39 | SELECT CAST(SUM(CASE WHEN department_name IS NULL THEN 1 ELSE 0 END) AS DECIMAL(10, 5)) / CAST(COUNT(*) AS DECIMAL(10, 5)) AS nr_dept_nm 40 | , CAST(SUM(CASE WHEN manager_id IS NULL THEN 1 ELSE 0 END) AS DECIMAL(10, 5)) / CAST(COUNT(*) AS DECIMAL(10, 5)) AS nr_mgr_id 41 | , CAST(SUM(CASE WHEN url IS NULL THEN 1 ELSE 0 END) AS DECIMAL(10, 5)) / CAST(COUNT(*) AS DECIMAL(10, 5)) AS nr_url 42 | FROM demo_hr.departments 43 | ) t 44 | ) 45 | 46 | SELECT CASE WHEN COUNT(*) > 0 THEN 'FAIL' ELSE 'P' END AS status 47 | FROM dtls 48 | WHERE status <> 'P'; 49 | ``` 50 |
51 | 52 | 53 | 54 | ### T009 - Value Frequency Thresholds 55 | "Value Frequency Threshold" tests are fairly similar to null rates above (T008). The difference is that we are checking the frequency (or rate) at which a column's values occur. 56 |
57 | In the example below...
58 | ...we are checking the frequencies with which the values 1, 2, 3, and 4 occur in field region_id of table countries. There is an upper CTE (common table expression) named "dtls" at the WITH clause, and a lower wrapper that applies the business logic (if any value frequency rejections were found, fail the case). Inside the dtls CTE, there is an inner query at the bottom (at the FROM clause) doing a single table scan to calculate a frequencies for each value in the GROUP BY for the column. It the GROUP BY value count (field "freq") is divided by the total table row count (field "den") to calculate field "freq_rt". The SELECT CASE logic at the top applies the business logic; comparing the actual value frequencies (freq_rt when region_id = 1, or =2, etc.) against the expected threshold frequencies (hard-coded as 0.28 to 0.36, 016 to 0.24 and so on). The returned value is a rejection code (REJ-01, REJ-02, etc.) clearly indicating which field failed the value ferquency check, what the actual value frequency was, and what the expected value frequency threshold ranges were. If no rejections are triggered, then status returns a "P" for pass. 59 |
60 | 61 | ```sql 62 | WITH dtls AS ( 63 | SELECT region_id, freq_rt 64 | , CASE WHEN region_id = 1 AND freq_rt NOT BETWEEN 0.28 AND 0.36 THEN CONCAT('REJ-01: Frequency occurrence of region_id=1 is outside threshold|exp=0.28 thru 0.36|act=' , CAST(freq_rt AS CHAR(8))) 65 | WHEN region_id = 2 AND freq_rt NOT BETWEEN 0.16 AND 0.24 THEN CONCAT('REJ-02: Frequency occurrence of region_id=2 is outside threshold|exp=0.16 thru 0.24|act=' , CAST(freq_rt AS CHAR(8))) 66 | WHEN region_id = 3 AND freq_rt NOT BETWEEN 0.20 AND 0.28 THEN CONCAT('REJ-03: Frequency occurrence of region_id=3 is outside threshold|exp=0.20 thru 0.28|act=' , CAST(freq_rt AS CHAR(8))) 67 | WHEN region_id = 4 AND freq_rt NOT BETWEEN 0.20 AND 0.28 THEN CONCAT('REJ-04: Frequency occurrence of region_id=4 is outside threshold|exp=0.20 thru 0.28|act=' , CAST(freq_rt AS CHAR(8))) 68 | ELSE 'P' 69 | END AS status 70 | FROM ( 71 | SELECT region_id, CAST(freq AS FLOAT) / CAST(den AS FLOAT) AS freq_rt 72 | FROM ( 73 | SELECT region_id, COUNT(*) AS freq 74 | , (SELECT COUNT(*) FROM demo_hr.countries) AS den 75 | FROM demo_hr.countries 76 | GROUP BY region_id 77 | ) t 78 | ) t2 79 | ) 80 | 81 | SELECT CASE WHEN COUNT(*) > 0 THEN 'FAIL' ELSE 'P' END AS status 82 | FROM dtls 83 | WHERE status <> 'P'; 84 | ``` 85 | -------------------------------------------------------------------------------- /mysql/data_validation/rule_set_04__numeric_values.md: -------------------------------------------------------------------------------- 1 | ### Data Validation Examples - MySQL 2 | [![License: CC0](https://img.shields.io/badge/License-CC0-red)](LICENSE "Creative Commons Zero License by DataResearchLabs (effectively = Public Domain") 3 | [![YouTube](https://img.shields.io/badge/YouTube-DataResearchLabs-brightgreen)](http://www.DataResearchLabs.com) 4 | #### [Return to Data Validation Home Page](https://github.com/DataResearchLabs/sql_scripts/blob/main/data_validation_scripts.md) 5 | 6 | # Rule Set #4 - Numeric Values 7 | 8 | ## Table of Contents 9 | - T010 - Not Null 10 | - T011 - Not Negative 11 | - T012 - Numeric Range 12 | - T013 - In Value List 13 | - T014 - Not In Value List 14 | - T015 - Multi Field Compare 15 |
16 | 17 | 18 | 19 | ### T010 - Not Null 20 | Verify numeric field is not null. For example, to verify that table countries has no NULLs in field region_id: 21 | ```sql 22 | SELECT CASE WHEN COUNT(*) > 0 THEN 'FAIL' ELSE 'P' END AS status 23 | FROM demo_hr.countries 24 | WHERE region_id IS NULL; 25 | ``` 26 |
27 | 28 | 29 | 30 | ### T011 - Not Negative 31 | Verify numeric field is not negative. For example, to verify that table countries has no field region_id negative values: 32 | ```sql 33 | SELECT CASE WHEN COUNT(*) > 0 THEN 'FAIL' ELSE 'P' END AS status 34 | FROM demo_hr.countries 35 | WHERE region_id < 0; 36 | ``` 37 |
38 | 39 | 40 | 41 | ### T012 - Numeric Range 42 | Verify numeric field value is within a range. In the example below, we verify that field employee_id is between 100 and 999 in table employees. Note that you can run the inner query yourself to return the actual rejection code (is too low or too high) along with the actual value and the expected value...all nicely packaged for easy troubleshooting. 43 | ```sql 44 | SELECT employee_id 45 | , CASE WHEN employee_id < 100 THEN CONCAT('REJ-01: Verify employee_id > 99|exp>99|act=', CAST(employee_id AS CHAR(10)) ) 46 | WHEN employee_id > 999 THEN CONCAT('REJ-02: Verify employee_id < 1000|exp<1000|act=', CAST(employee_id AS CHAR(10)) ) 47 | ELSE 'P' 48 | END AS status 49 | FROM demo_hr.employees; 50 | ``` 51 |
52 | 53 | 54 | 55 | ### T013 - In Value List 56 | Verify numeric field is **in** the list of values. For example, to verify that table countries field region_id is always values 1, 2, 3, or 4 we use the IN() clause as shown below: 57 | ```sql 58 | SELECT region_id 59 | , CASE WHEN region_id NOT IN(1,2,3,4) THEN 'FAIL' ELSE 'P' END AS status 60 | FROM demo_hr.countries; 61 | ``` 62 |
63 | 64 | 65 | 66 | ### T014 - Not In Value List 67 | Verify numeric field is **not** in the list of values. For example, to verify that table countries field region_id is never in values 97, 98, or 99 we use the NOT IN() clauses as shown below: 68 | ```sql 69 | SELECT region_id 70 | , CASE WHEN region_id IN(97,98,99) THEN 'FAIL' ELSE 'P' END AS status 71 | FROM demo_hr.countries; 72 | ``` 73 |
74 | 75 | 76 | 77 | ### T015 - Multi Field Compare 78 | Verify numeric field values in relation to one another. For example, to verify that salary times commission_pct is always less than $10,000 in table employees: 79 | ```sql 80 | SELECT salary, commission_pct 81 | , CASE WHEN salary * commission_pct > 10000 THEN 'FAIL' ELSE 'P' END AS status 82 | FROM demo_hr.employees; 83 | ``` 84 |
85 | -------------------------------------------------------------------------------- /mysql/data_validation/rule_set_05__date_values.md: -------------------------------------------------------------------------------- 1 | ### Data Validation Examples - MySQL 2 | [![License: CC0](https://img.shields.io/badge/License-CC0-red)](LICENSE "Creative Commons Zero License by DataResearchLabs (effectively = Public Domain") 3 | [![YouTube](https://img.shields.io/badge/YouTube-DataResearchLabs-brightgreen)](http://www.DataResearchLabs.com) 4 | #### [Return to Data Validation Home Page](https://github.com/DataResearchLabs/sql_scripts/blob/main/data_validation_scripts.md) 5 | 6 | 7 | # Rule Set #5 - Date Values 8 | 9 | ## Table of Contents 10 | - T016 - Not Null 11 | - T017 - Date Range 12 | - T018 - No Time Part 13 | - T019 - Has Time Part 14 | - T020 - Multi Field Compare 15 | - Bonus Tip - Joining Tables with 2 Pairs of Start-End Date Overlaps 16 |
17 | 18 | 19 | 20 | ### T016 - Not Null 21 | Verify date field is not null. For example, to verify that table countries has no NULLs in field date_last_updated: 22 | ```sql 23 | SELECT date_last_updated 24 | , CASE WHEN date_last_updated IS NULL THEN 'FAIL' ELSE 'P' END AS status 25 | FROM demo_hr.countries; 26 | ``` 27 |
28 | 29 | 30 | 31 | ### T017 - Date Range 32 | Verify date field is within specified range. For example, you can run the sql below to verify that table countries field date_last_updated is between 1/1/2021 and today. Note the use of SYSDATE to represent today's date dynamically in Oracle. Notice the inner query uses a CASE...WHEN...ELSE structure to identify two rejections codes: (1) date is too high, and (2) date is too low. Expected and actual values are displayed in the output if you run the inner query only. The outer query is a wrapper to determine whether the test passed or failed. 33 | ```sql 34 | SELECT date_last_updated 35 | , CASE WHEN date_last_updated > NOW() THEN CONCAT('REJ-01: Field date_last_updated cannot be in the future|exp<=', CAST(NOW() AS CHAR(20)), '|act=', CAST(date_last_updated AS CHAR(20)) ) 36 | WHEN date_last_updated < '2021-01-01' THEN CONCAT('REJ-02: Field date_last_updated cannot be too old|exp>=1/1/2021|act=', CAST(date_last_updated AS CHAR(20)) ) 37 | ELSE 'P' 38 | END AS status 39 | FROM demo_hr.countries; 40 | ``` 41 |
42 | 43 | 44 | 45 | ### T018 - No Time Part 46 | Verify date field is a date only, no time part present. For example, to verify that table employees has no time part in field hire_date (time part must be "12:00:00"): 47 | ```sql 48 | SELECT hire_date 49 | , CASE WHEN DATE_FORMAT(hire_date, '%H:%i:%s') <> '00:00:00'THEN 'FAIL' ELSE 'P' END AS status 50 | FROM demo_hr.employees; 51 | ``` 52 |
53 | 54 | 55 | 56 | ### T019 - Has Time Part 57 | Verify date field is a date **and** time. For example, to verify that table employees has a time part in field hire_date (time part cannot be "12:00:00"): 58 | ```sql 59 | SELECT date_last_updated 60 | , CASE WHEN DATE_FORMAT(date_last_updated, '%H:%i:%s') = '00:00:00' THEN 'FAIL' ELSE 'P' END AS status 61 | FROM demo_hr.countries; 62 | ``` 63 |
64 | 65 | 66 | 67 | ### T020 - Multi Field Compare 68 | Verify multiple date fields relative to each other. For example, to verify that field start_date must be < field end_date in table job_history (thus if start_date is >= end_date the test case fails): 69 | ```sql 70 | SELECT start_date, end_date 71 | , CASE WHEN start_date >= end_date THEN 'FAIL' ELSE 'P' END AS status 72 | FROM demo_hr.job_history; 73 | ``` 74 |
75 | 76 | 77 | 78 | ### Bonus Tip - Joining Tables with 2 Pairs of Start-End Date Overlaps 79 | So this is a fantastic tip I learned from a co-worker in healthcare back in 2011 (was it Jennifer C.? or Matt G.? or Jonathon P.? I can't remember). 80 | 81 | 82 | The **problem** is that you are trying to join two tables with logic where the table1.start_dt/end_dt's overlap with the table2.start_dt/end_dt. 83 | 84 | 85 | The **solution** is to **join on `table1.start_dt <= table2.end_dt AND table1.end_dt >= table2.start_dt`**. 86 | 87 | Here is why: 88 | 89 | ``` 90 | Scenario #1 = "Discard - No Overlap" Table #1 date range is completely before Table #2's (Dt#1.End is NOT > Dt#2.Start) 91 | Table#1: |Start ----------- End| 92 | Table#2: |Start ----------- End| 93 | 94 | 95 | Scenario #2 = "Include - Table #1 End = Table#2 Start" Table #1 date range ends exactly where Table #2's starts (Tbl#1.End = Tbl#2.Start AND Tbl#1.Start < Tbl#2.End) 96 | Table#1: |Start ----------- End| 97 | Table#2: |Start ----------- End| 98 | 99 | 100 | Scenario #3 = "Include - Table #1 Ends Midway Table #2 Span" Table #1 date range nicely overlaps Table #2's (Tbl#1.End > Tbl#2.Start AND Tbl#1.Start < Tbl#2.End) 101 | Table#1: |Start ----------- End| 102 | Table#2: |Start ----------- End| 103 | 104 | 105 | Scenario #4 = "Include - Table #1 Starts Midway Table #2 Span" Table #1 date range nicely overlaps Table #2's (Tbl#1.End > Tbl#2.Start AND Tbl#1.Start < Tbl#2.End) 106 | Table#1: |Start ----------- End| 107 | Table#2: |Start ----------- End| 108 | 109 | 110 | Scenario #5 = "Include - Table #1 Starts at Table #2 End" Table #1 date range start exactly at Table #2's End (Tbl#1.End > Tbl#2.St AND Tbl#1.St = Tbl#2.End) 111 | Table#1: |Start ----------- End| 112 | Table#2: |Start ----------- End| 113 | 114 | 115 | Scenario #6 = "Discard - No Overlap" Date #1 is entirely after Date #2 (Tbl#1.Start is > Tbl#2.End) 116 | Table#1: |Start ----------- End| 117 | Table#2: |Start ----------- End| 118 | 119 | ``` 120 | 121 | 122 |
123 | -------------------------------------------------------------------------------- /mysql/data_validation/rule_set_09__defect_regression.md: -------------------------------------------------------------------------------- 1 | ### Data Validation Examples - MySQL 2 | [![License: CC0](https://img.shields.io/badge/License-CC0-red)](LICENSE "Creative Commons Zero License by DataResearchLabs (effectively = Public Domain") 3 | [![YouTube](https://img.shields.io/badge/YouTube-DataResearchLabs-brightgreen)](http://www.DataResearchLabs.com) 4 | #### [Return to Data Validation Home Page](https://github.com/DataResearchLabs/sql_scripts/blob/main/data_validation_scripts.md) 5 |
6 | 7 | # Rule Set #9 - Defect Regression 8 | 9 | ## Summary 10 | Where possible, it is a good idea to cover known defects with a test case. This enables automated regression testing for free. Yuo can simply run this script to quickly re-check whether a defect has been re-introduced. 11 | 12 | There are a lot of caveats that will block you from covering every defect: 13 | * If anything but the simplest of setup and tear down data is required 14 | * If processes or jobs or applications or ETL must be triggered 15 | * If there are interactions with other databases or files not on the current server 16 | 17 | That said, you can still frequently setup static trip wires to monitor the data for recurrence of old bugs. 18 | 19 | Some real-world examples from my past include: 20 | * A defect where email addresses stored in the database had TAB characters. I setup a simple CHAR(9) check that ran daily and cross referenced the defect number in the SQL return results. 21 | * A defect where customer specific HTML snippets for email reports had bad tags manually entered by company analysts working with the client. Obviously fixing the app to avoid those data entries would be best, but that was a different department that was going to get to it later. So, I setup a daily alert to check for those HTML tags and some other similar ones that might be accidentally introduced. 22 | * A defect where internal data entry folks on rare occasions had to key in customer names. They were supposed to be all upper case for other reasons in the system, but the analysts sometimes entered lower case. Developers were not going to get around to fixing it for a couple of months (other priorities, this was a very low frequency issue). So I setup a daily check to trip an alarm and notify via email the appropriate person to correct their data. 23 | * Basically, you can take any of the test case examples in Rule Sets #1 thu #8 and apply them as regression tests for defects when appropriate. 24 | -------------------------------------------------------------------------------- /mysql/schemadiff/simple_schema_dump.sql: -------------------------------------------------------------------------------- 1 | ------------------------------------------------------------------------------------ 2 | -- Simple Schema Dump: 3 | -- This SQL script will dump table, column, key, and index design related metadata 4 | -- so that you can copy-paste or export to a text file. 5 | -- Even better, you can make other snapshots over time (same database schema earlier 6 | -- points in time), OR in different environments (DEV, PPMO, STAGE, PROD). Then, 7 | -- using your favorite free Text File Diff Tool (DiffMerge, ExamDiff, etc.) you 8 | -- can compare snapshots to quick isolate and identify what changed over time 9 | -- or is different between environments. 10 | ------------------------------------------------------------------------------------ 11 | -- Platform: MySQL Server 12 | -- Author: DataResearchLabs 13 | -- GitHub: https://github.com/DataResearchLabs/sql_scripts 14 | -- YouTube Tutorials: https://www.youtube.com/playlist?list=PLVHoUDdbskUT0kz-4aB68EsollbmelcjT 15 | ---------------------------------------------------------------------------------- 16 | USE sakila -- <<<<<<<<<<<<< Change schema here 17 | ; 18 | 19 | WITH vars 20 | AS ( 21 | SELECT DATABASE() AS v_SchemaName 22 | ) 23 | 24 | , baseTbl 25 | AS ( 26 | SELECT table_schema AS SchemaName, table_type, table_name 27 | FROM INFORMATION_SCHEMA.TABLES 28 | WHERE table_schema = (SELECT v_SchemaName FROM vars) 29 | ) 30 | 31 | , metaForTbl 32 | AS ( 33 | SELECT t.SchemaName 34 | , t.table_name AS TableName 35 | , CONCAT( '(' 36 | , CASE WHEN t.table_type <=> 'BASE TABLE' THEN 'Table' 37 | WHEN t.table_type <=> 'VIEW' THEN 'View' 38 | ELSE 'UK' 39 | END 40 | , ')' 41 | ) AS ObjectType 42 | , t.table_name AS ObjectName 43 | , '(Exists)' AS PropertyName 44 | , ' ' AS PropertyValue 45 | FROM baseTbl t 46 | ) 47 | 48 | , metaForCol_dataType 49 | AS ( 50 | SELECT ft.SchemaName, ft.table_name AS TableName, 'Column' AS ObjectType, tut.column_name AS ObjectName 51 | , '2' AS PropertyName 52 | , CONCAT(COALESCE(tut.data_type, 'unknown') 53 | , '(' 54 | , CASE WHEN tut.CHARACTER_MAXIMUM_LENGTH IS NOT NULL THEN CAST(tut.CHARACTER_MAXIMUM_LENGTH AS CHAR(10)) ELSE '' END 55 | , CASE WHEN tut.DATA_TYPE IN('date','datetime','timestamp') THEN CAST(tut.DATETIME_PRECISION AS CHAR(10)) 56 | WHEN tut.NUMERIC_PRECISION IS NULL THEN '' 57 | ELSE CONCAT(CAST(tut.NUMERIC_PRECISION AS CHAR(10))) 58 | END 59 | , CASE WHEN tut.NUMERIC_SCALE IS NOT NULL THEN CONCAT(',', CAST(tut.NUMERIC_SCALE AS CHAR(10))) ELSE '' END 60 | , ')' 61 | ) AS PropertyValue 62 | FROM INFORMATION_SCHEMA.COLUMNS tut 63 | INNER JOIN baseTbl ft ON ft.SchemaName = tut.TABLE_SCHEMA AND ft.TABLE_NAME = tut.TABLE_NAME 64 | ) 65 | 66 | , metaForCol_nullable 67 | AS ( 68 | SELECT ft.SchemaName, ft.table_name AS TableName, 'Column' AS ObjectType, tut.column_name AS ObjectName 69 | , '3' AS PropertyName, CASE WHEN tut.IS_NULLABLE = 'YES' THEN 'NULL' ELSE 'NOT NULL' END AS PropertyValue 70 | FROM INFORMATION_SCHEMA.COLUMNS tut 71 | INNER JOIN baseTbl ft ON ft.SchemaName = tut.TABLE_SCHEMA AND ft.table_name = tut.table_name 72 | ) 73 | 74 | , metaForCol_ordpos 75 | AS ( 76 | SELECT ft.SchemaName, ft.table_name AS TableName, 'Column' AS ObjectType, tut.column_name AS ObjectName 77 | , '1' AS PropertyName 78 | , LPAD(CAST(tut.ORDINAL_POSITION AS CHAR(3)), 3, 0) AS PropertyValue 79 | FROM INFORMATION_SCHEMA.COLUMNS tut 80 | INNER JOIN baseTbl ft ON ft.SchemaName = tut.TABLE_SCHEMA AND ft.table_name = tut.table_name 81 | ) 82 | 83 | , metaAllCols 84 | AS ( 85 | SELECT schemaname, tablename, objecttype, objectname, 'Properties' AS propertyname 86 | , GROUP_CONCAT(propertyvalue ORDER BY propertyname, propertyvalue SEPARATOR ' | ') AS propertyvalue 87 | FROM ( 88 | SELECT * FROM metaForCol_dataType 89 | UNION SELECT * FROM metaForCol_nullable 90 | UNION SELECT * FROM metaForCol_ordpos 91 | ) t 92 | GROUP BY schemaname, tablename, objecttype, objectname 93 | ) 94 | 95 | , metaForKeys 96 | AS ( 97 | SELECT cons.TABLE_SCHEMA AS SchemaName, cons.TABLE_NAME AS TableName 98 | , CASE WHEN cons.constraint_type = 'PRIMARY KEY' THEN 'PKey' 99 | WHEN cons.constraint_type = 'UNIQUE' THEN 'UKey' 100 | WHEN cons.constraint_type = 'FOREIGN KEY' THEN 'FKey' 101 | ELSE 'X' 102 | END AS ObjectType 103 | , cons.constraint_name AS ObjectName 104 | , 'FieldList' AS PropertyName 105 | , GROUP_CONCAT(kcu.COLUMN_NAME ORDER BY kcu.ORDINAL_POSITION SEPARATOR ',') AS PropertyValue 106 | FROM INFORMATION_SCHEMA.TABLE_CONSTRAINTS cons 107 | INNER JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE kcu 108 | ON cons.TABLE_SCHEMA = kcu.TABLE_SCHEMA 109 | AND cons.TABLE_NAME = kcu.TABLE_NAME 110 | AND cons.CONSTRAINT_NAME = kcu.CONSTRAINT_NAME 111 | WHERE cons.table_schema = (SELECT v_SchemaName FROM vars) 112 | AND cons.table_name IN(SELECT DISTINCT table_name FROM baseTbl) 113 | AND cons.constraint_type IN('PRIMARY KEY','FOREIGN KEY','UNIQUE') 114 | GROUP BY cons.TABLE_SCHEMA, cons.TABLE_NAME, cons.CONSTRAINT_TYPE, cons.CONSTRAINT_NAME 115 | ) 116 | 117 | , metaForIdxs 118 | AS ( 119 | SELECT (SELECT v_SchemaName FROM vars) AS SchemaName 120 | , table_name AS TableName 121 | , 'Index' AS ObjectType 122 | , index_name AS ObjectName 123 | , 'FieldList' AS PropertyName 124 | , GROUP_CONCAT(column_name ORDER BY seq_in_index SEPARATOR ',') AS PropertyValue 125 | FROM information_schema.statistics 126 | WHERE table_schema = (SELECT v_SchemaName FROM vars) 127 | GROUP BY table_name, index_name 128 | ) 129 | 130 | , allMetadata 131 | AS ( 132 | SELECT * FROM metaForTbl 133 | UNION SELECT * FROM metaAllCols 134 | UNION SELECT * FROM metaForKeys 135 | UNION SELECT * FROM metaForIdxs 136 | ) 137 | 138 | SELECT CASE WHEN objecttype IN('(Table)','(View)') THEN schemaname ELSE ' ' END AS schema_nm 139 | , CASE WHEN objecttype IN('(Table)','(View)') THEN tablename ELSE ' ' END AS tbl_nm 140 | , objecttype AS obj_typ, objectname AS obj_nm, /*propertyname,*/ propertyvalue AS properties 141 | FROM allMetadata 142 | ORDER BY schemaname, tablename, objecttype 143 | , CASE WHEN objecttype='Column' THEN propertyvalue ELSE ' ' END 144 | , objectname, propertyname 145 | 146 | 147 | -------------------------------------------------------------------------------- /oracle/data_dictionary/data_dict_dump.sql: -------------------------------------------------------------------------------- 1 | 2 | ------------------------------------------------------------------------------------ 3 | -- Data Dictionary Dump: 4 | -- This SQL script will dump table, column, key, and description design related 5 | -- metadata so that you can copy-paste or export to Excel as a Data Dictionary. 6 | ------------------------------------------------------------------------------------ 7 | -- Platform: Oracle 8 | -- Author: DataResearchLabs 9 | -- GitHub: https://github.com/DataResearchLabs/sql_scripts 10 | -- YouTube Tutorials: https://www.youtube.com/channel/UCQciXv3xaBykeUFc04GxSXA 11 | ---------------------------------------------------------------------------------- 12 | 13 | -- All variables are consolidated here in the first CTE (Common Table Expression) 14 | -- Each given row is a variable, with the value you change preceding the "AS" command 15 | WITH vars 16 | AS ( 17 | SELECT 18 | 'HR' AS v_SchemaName -- <<<<<<<<<<<< CHANGE THIS VALUE to Schema you want to dump 19 | , 'NO' AS v_TablesOnly -- YES=Limit To Tables only; NO=Include views too 20 | FROM dual 21 | ) 22 | 23 | , meta_data 24 | AS ( 25 | SELECT 26 | atc.owner AS SCHEMA_NM 27 | , atc.table_name AS TABLE_NM 28 | , CASE WHEN t.TABLE_TYPE = 'TABLE' THEN 'TBL' 29 | WHEN t.TABLE_TYPE = 'VIEW' THEN 'VW' 30 | ELSE 'UK' 31 | END AS OBJ_TYP 32 | , atc.column_id AS ORD_POS 33 | , atc.column_name AS COLUMN_NM 34 | , (COALESCE(atc.data_type, 'UNKNOWN') || 35 | decode(atc.data_type, 36 | 'NUMBER', 37 | decode(atc.data_precision, null, '', 38 | '(' || to_char(atc.data_precision) || decode(atc.data_scale,null,'',0,'',',' || to_char(atc.data_scale) ) 39 | || ')' ), 40 | 'FLOAT', '(' || to_char(atc.data_precision) || ')', 41 | 'VARCHAR2', '(' || to_char(atc.data_length) || ')', 42 | 'NVARCHAR2', '(' || to_char(atc.data_length) || ')', 43 | 'VARCHAR', '(' || to_char(atc.data_length) || ')', 44 | 'CHAR', '(' || to_char(atc.data_length) || ')', 45 | 'RAW', '(' || to_char(atc.data_length) || ')', 46 | 'MLSLABEL',decode(atc.data_length,null,'',0,'','(' || to_char(atc.data_length) || ')'), 47 | '') 48 | ) AS DATA_TYP 49 | , CASE WHEN atc.nullable = 'Y' THEN 'NULL' ELSE 'NOT NULL' END AS NULLABLE 50 | , dcc.comments AS DESCRIPTION 51 | FROM all_tab_columns atc 52 | INNER JOIN all_col_comments dcc ON atc.owner = dcc.owner AND atc.table_name = dcc.table_name AND atc.column_name = dcc.column_name 53 | INNER JOIN all_tab_comments t ON t.OWNER = atc.owner AND t.TABLE_NAME = atc.table_name 54 | WHERE atc.owner = (SELECT vars.v_SchemaName FROM vars) 55 | AND ( (t.TABLE_TYPE = 'TABLE') 56 | OR ((SELECT v_TablesOnly FROM vars) = 'NO') 57 | ) 58 | ) 59 | 60 | 61 | , meta_for_keys 62 | AS ( 63 | SELECT SCHEMA_NM, TABLE_NM, COLUMN_NM 64 | , LISTAGG(IS_KEY, ', ') 65 | WITHIN GROUP(ORDER BY IS_KEY DESC) AS IS_KEY 66 | FROM ( 67 | SELECT cons.owner AS SCHEMA_NM 68 | , cols.table_name AS TABLE_NM 69 | , cols.column_name AS COLUMN_NM 70 | , CASE WHEN cons.constraint_type = 'P' THEN 'PK' 71 | WHEN cons.constraint_type = 'U' THEN 'UK' 72 | WHEN cons.constraint_type = 'R' THEN 'FK' 73 | ELSE 'X' 74 | END AS IS_KEY 75 | FROM all_constraints cons 76 | INNER JOIN all_cons_columns cols ON cons.constraint_name = cols.constraint_name AND cons.owner = cols.owner 77 | WHERE cons.owner = (SELECT vars.v_SchemaName FROM vars) 78 | AND cons.table_name IN(SELECT DISTINCT TABLE_NM FROM meta_data) 79 | AND cons.constraint_type IN('P','R','U') 80 | GROUP BY cons.owner, cols.table_name, cols.column_name, cons.constraint_type 81 | ) t 82 | GROUP BY SCHEMA_NM, TABLE_NM, COLUMN_NM 83 | ) 84 | 85 | 86 | SELECT md.SCHEMA_NM, md.TABLE_NM, md.OBJ_TYP 87 | , md.ORD_POS AS ORD 88 | , COALESCE(pk.IS_KEY, ' ') AS KEYS 89 | , md.COLUMN_NM, md.DATA_TYP, md.NULLABLE, md.DESCRIPTION 90 | FROM meta_data md 91 | LEFT JOIN meta_for_keys pk ON pk.SCHEMA_NM = md.SCHEMA_NM AND pk.TABLE_NM = md.TABLE_NM AND pk.COLUMN_NM = md.COLUMN_NM 92 | ORDER BY md.SCHEMA_NM, md.TABLE_NM, md.ORD_POS 93 | 94 | -------------------------------------------------------------------------------- /oracle/data_validation/demo_data/demo_hr_01_create_tables.sql: -------------------------------------------------------------------------------- 1 | CREATE USER DEMO_HR IDENTIFIED BY DEMO_HR; 2 | 3 | GRANT UNLIMITED TABLESPACE TO DEMO_HR; 4 | 5 | 6 | 7 | -- DEMO_HR.JOBS definition 8 | CREATE TABLE "DEMO_HR"."JOBS" 9 | ("JOB_ID" VARCHAR2(10), 10 | "JOB_TITLE" VARCHAR2(35) CONSTRAINT "JOB_TITLE_NN" NOT NULL ENABLE, 11 | "MIN_SALARY" NUMBER(6,0), 12 | "MAX_SALARY" NUMBER(6,0), 13 | CONSTRAINT "JOB_ID_PK" PRIMARY KEY ("JOB_ID") ENABLE 14 | ); 15 | 16 | 17 | 18 | -- DEMO_HR.JOBS_SNAPSHOT definition 19 | CREATE TABLE "DEMO_HR"."JOBS_SNAPSHOT" 20 | ("JOB_ID" VARCHAR2(10), 21 | "JOB_TITLE" VARCHAR2(35) NOT NULL ENABLE, 22 | "MIN_SALARY" NUMBER(6,0), 23 | "MAX_SALARY" NUMBER(6,0) 24 | ); 25 | 26 | 27 | 28 | -- DEMO_HR.REGIONS definition 29 | CREATE TABLE "DEMO_HR"."REGIONS" 30 | ("REGION_ID" NUMBER CONSTRAINT "REGION_ID_NN" NOT NULL ENABLE, 31 | "REGION_NAME" VARCHAR2(25), 32 | CONSTRAINT "REG_ID_PK" PRIMARY KEY ("REGION_ID") ENABLE 33 | ); 34 | 35 | 36 | 37 | -- DEMO_HR.COUNTRIES definition 38 | CREATE TABLE "DEMO_HR"."COUNTRIES" 39 | ("COUNTRY_ID" CHAR(2) CONSTRAINT "COUNTRY_ID_NN" NOT NULL ENABLE, 40 | "COUNTRY_NAME" VARCHAR2(40), 41 | "REGION_ID" NUMBER, 42 | "DATE_LAST_UPDATED" DATE DEFAULT SYSDATE, 43 | CONSTRAINT "COUNTRY_C_ID_PK" PRIMARY KEY ("COUNTRY_ID") ENABLE, 44 | CONSTRAINT "COUNTR_REG_FK" FOREIGN KEY ("REGION_ID") REFERENCES "DEMO_HR"."REGIONS" ("REGION_ID") ENABLE 45 | ); 46 | 47 | 48 | 49 | 50 | -- DEMO_HR.LOCATIONS definition 51 | CREATE TABLE "DEMO_HR"."LOCATIONS" 52 | ("LOCATION_ID" NUMBER(4,0), 53 | "STREET_ADDRESS" VARCHAR2(40), 54 | "POSTAL_CODE" VARCHAR2(12), 55 | "CITY" VARCHAR2(30) CONSTRAINT "LOC_CITY_NN" NOT NULL ENABLE, 56 | "STATE_PROVINCE" VARCHAR2(25), 57 | "COUNTRY_ID" CHAR(2), 58 | CONSTRAINT "LOC_ID_PK" PRIMARY KEY ("LOCATION_ID") ENABLE, 59 | CONSTRAINT "LOC_C_ID_FK" FOREIGN KEY ("COUNTRY_ID") REFERENCES "DEMO_HR"."COUNTRIES" ("COUNTRY_ID") ENABLE 60 | ); 61 | 62 | CREATE INDEX "DEMO_HR"."LOC_CITY_IX" ON "DEMO_HR"."LOCATIONS" ("CITY"); 63 | CREATE INDEX "DEMO_HR"."LOC_STATE_PROVINCE_IX" ON "DEMO_HR"."LOCATIONS" ("STATE_PROVINCE"); 64 | CREATE INDEX "DEMO_HR"."LOC_COUNTRY_IX" ON "DEMO_HR"."LOCATIONS" ("COUNTRY_ID"); 65 | 66 | 67 | 68 | -- DEMO_HR.DEPARTMENTS definition 69 | CREATE TABLE "DEMO_HR"."DEPARTMENTS" 70 | ("DEPARTMENT_ID" NUMBER(4,0), 71 | "DEPARTMENT_NAME" VARCHAR2(30) CONSTRAINT "DEPT_NAME_NN" NOT NULL ENABLE, 72 | "MANAGER_ID" NUMBER(6,0), 73 | "LOCATION_ID" NUMBER(4,0), 74 | "URL" VARCHAR2(255), 75 | CONSTRAINT "DEPT_ID_PK" PRIMARY KEY ("DEPARTMENT_ID") ENABLE, 76 | CONSTRAINT "DEPT_LOC_FK" FOREIGN KEY ("LOCATION_ID") REFERENCES "DEMO_HR"."LOCATIONS" ("LOCATION_ID") ENABLE 77 | ) ; 78 | 79 | CREATE INDEX "DEMO_HR"."DEPT_LOCATION_IX" ON "DEMO_HR"."DEPARTMENTS" ("LOCATION_ID"); 80 | 81 | 82 | 83 | -- DEMO_HR.JOB_HISTORY definition 84 | CREATE TABLE "DEMO_HR"."JOB_HISTORY" 85 | ("EMPLOYEE_ID" NUMBER(6,0) CONSTRAINT "JHIST_EMPLOYEE_NN" NOT NULL ENABLE, 86 | "START_DATE" DATE CONSTRAINT "JHIST_START_DATE_NN" NOT NULL ENABLE, 87 | "END_DATE" DATE CONSTRAINT "JHIST_END_DATE_NN" NOT NULL ENABLE, 88 | "JOB_ID" VARCHAR2(10) CONSTRAINT "JHIST_JOB_NN" NOT NULL ENABLE, 89 | "DEPARTMENT_ID" NUMBER(4,0), 90 | CONSTRAINT "JHIST_DATE_INTERVAL" CHECK (end_date > start_date) ENABLE, 91 | CONSTRAINT "JHIST_EMP_ID_ST_DATE_PK" PRIMARY KEY ("EMPLOYEE_ID", "START_DATE") ENABLE, 92 | CONSTRAINT "JHIST_JOB_FK" FOREIGN KEY ("JOB_ID") REFERENCES "DEMO_HR"."JOBS" ("JOB_ID") ENABLE, 93 | CONSTRAINT "JHIST_DEPT_FK" FOREIGN KEY ("DEPARTMENT_ID") REFERENCES "DEMO_HR"."DEPARTMENTS" ("DEPARTMENT_ID") ENABLE 94 | ) ; 95 | 96 | CREATE INDEX "DEMO_HR"."JHIST_JOB_IX" ON "DEMO_HR"."JOB_HISTORY" ("JOB_ID"); 97 | CREATE INDEX "DEMO_HR"."JHIST_EMPLOYEE_IX" ON "DEMO_HR"."JOB_HISTORY" ("EMPLOYEE_ID"); 98 | CREATE INDEX "DEMO_HR"."JHIST_DEPARTMENT_IX" ON "DEMO_HR"."JOB_HISTORY" ("DEPARTMENT_ID"); 99 | 100 | 101 | 102 | 103 | -- DEMO_HR.EMPLOYEES definition 104 | CREATE TABLE "DEMO_HR"."EMPLOYEES" 105 | ("EMPLOYEE_ID" NUMBER(6,0), 106 | "FIRST_NAME" VARCHAR2(20), 107 | "LAST_NAME" VARCHAR2(25) CONSTRAINT "EMP_LAST_NAME_NN" NOT NULL ENABLE, 108 | "EMAIL" VARCHAR2(25) CONSTRAINT "EMP_EMAIL_NN" NOT NULL ENABLE, 109 | "PHONE_NUMBER" VARCHAR2(20), 110 | "HIRE_DATE" DATE CONSTRAINT "EMP_HIRE_DATE_NN" NOT NULL ENABLE, 111 | "JOB_ID" VARCHAR2(10) CONSTRAINT "EMP_JOB_NN" NOT NULL ENABLE, 112 | "SALARY" NUMBER(8,2), 113 | "COMMISSION_PCT" NUMBER(2,2), 114 | "MANAGER_ID" NUMBER(6,0), 115 | "DEPARTMENT_ID" NUMBER(4,0), 116 | "SOME_DATE_FMT1" VARCHAR2(50), 117 | "SOME_DATE_FMT2" VARCHAR2(50), 118 | "SOME_DATE_FMT3" VARCHAR2(50), 119 | "SOME_DATE_FMT4" VARCHAR2(50), 120 | "FAKE_SSN" VARCHAR2(11), 121 | "ZIP5" VARCHAR2(5), 122 | "ZIP5OR9" VARCHAR2(10), 123 | "ZIP9" VARCHAR2(10), 124 | "EMAIL_ADDRESS" VARCHAR2(100), 125 | CONSTRAINT "EMP_SALARY_MIN" CHECK (salary > 0) ENABLE, 126 | CONSTRAINT "EMP_EMAIL_UK" UNIQUE ("EMAIL") ENABLE, 127 | CONSTRAINT "EMP_EMP_ID_PK" PRIMARY KEY ("EMPLOYEE_ID") ENABLE, 128 | CONSTRAINT "EMP_DEPT_FK" FOREIGN KEY ("DEPARTMENT_ID") REFERENCES "DEMO_HR"."DEPARTMENTS" ("DEPARTMENT_ID") ENABLE, 129 | CONSTRAINT "EMP_JOB_FK" FOREIGN KEY ("JOB_ID") REFERENCES "DEMO_HR"."JOBS" ("JOB_ID") ENABLE, 130 | CONSTRAINT "EMP_MANAGER_FK" FOREIGN KEY ("MANAGER_ID") REFERENCES "DEMO_HR"."EMPLOYEES" ("EMPLOYEE_ID") ENABLE 131 | ) ; 132 | 133 | CREATE INDEX "DEMO_HR"."EMP_DEPARTMENT_IX" ON "DEMO_HR"."EMPLOYEES" ("DEPARTMENT_ID"); 134 | CREATE INDEX "DEMO_HR"."EMP_JOB_IX" ON "DEMO_HR"."EMPLOYEES" ("JOB_ID"); 135 | CREATE INDEX "DEMO_HR"."EMP_MANAGER_IX" ON "DEMO_HR"."EMPLOYEES" ("MANAGER_ID"); 136 | CREATE INDEX "DEMO_HR"."EMP_NAME_IX" ON "DEMO_HR"."EMPLOYEES" ("LAST_NAME", "FIRST_NAME"); 137 | 138 | 139 | 140 | 141 | 142 | 143 | -------------------------------------------------------------------------------- /oracle/data_validation/img/01_data_val_oracle_example_test_case_sql_code.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DataResearchLabs/sql_scripts/3eaa17caf248d3b58763534d62ef4d9ac5729a46/oracle/data_validation/img/01_data_val_oracle_example_test_case_sql_code.png -------------------------------------------------------------------------------- /oracle/data_validation/img/02_data_val_oracle_run_results1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DataResearchLabs/sql_scripts/3eaa17caf248d3b58763534d62ef4d9ac5729a46/oracle/data_validation/img/02_data_val_oracle_run_results1.png -------------------------------------------------------------------------------- /oracle/data_validation/img/03_data_val_oracle_adv_test_case_ex.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DataResearchLabs/sql_scripts/3eaa17caf248d3b58763534d62ef4d9ac5729a46/oracle/data_validation/img/03_data_val_oracle_adv_test_case_ex.png -------------------------------------------------------------------------------- /oracle/data_validation/img/04_data_val_oracle_run_results_adv.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DataResearchLabs/sql_scripts/3eaa17caf248d3b58763534d62ef4d9ac5729a46/oracle/data_validation/img/04_data_val_oracle_run_results_adv.png -------------------------------------------------------------------------------- /oracle/data_validation/rule_set_01__row_counts.md: -------------------------------------------------------------------------------- 1 | ### Data Validation Examples - Oracle 2 | [![License: CC0](https://img.shields.io/badge/License-CC0-red)](LICENSE "Creative Commons Zero License by DataResearchLabs (effectively = Public Domain") 3 | [![YouTube](https://img.shields.io/badge/YouTube-DataResearchLabs-brightgreen)](http://www.DataResearchLabs.com) 4 | #### [Return to Data Validation Home Page](https://github.com/DataResearchLabs/sql_scripts/blob/main/data_validation_scripts.md) 5 |
6 | 7 | # Rule Set #1 - Row Counts 8 | 9 | ## Table of Contents 10 | - T001 - "Full" Row Count 11 | - T002 - "Partial" Row Count 12 | - T003 - "Relative" Row Count 13 | - T004 - "Recent" Row Count 14 |
15 | 16 | 17 | 18 | ### T001 - "Full" Row Count 19 | Verify full row count for a table or view. For example, to verify that table countries has exactly 25 rows: 20 | ```sql 21 | SELECT CASE WHEN COUNT(*) <> 25 THEN 'FAIL' ELSE 'P' END AS status 22 | FROM demo_hr.countries; 23 | ``` 24 |
25 | 26 | 27 | 28 | ### T002 - "Partial" Row Count 29 | Verify partial row count for a subset of a table or view. For example, to verify that table countries has exactly 8 rows where region_id = 1 (Europe): 30 | ```sql 31 | SELECT CASE WHEN COUNT(*) <> 8 THEN 'FAIL' ELSE 'P' END AS status 32 | FROM demo_hr.countries 33 | WHERE region_id = 1; 34 | ``` 35 |
36 | 37 | 38 | 39 | ### T003 - "Relative" Row Count 40 | Verify relative row counts between tables or views. The example below verifies that table countries has at least 5 times the number of rows as table regions. The inner query returns one row with two columns: (a) count of contries, and (b) count of regions. The outer query can then apply the business logic of comparing the counts and returning a Pass or Fail. 41 | ```sql 42 | SELECT CASE WHEN countries_count < 5 * regions_count THEN 'FAIL' ELSE 'P' END AS status 43 | FROM ( 44 | SELECT (SELECT COUNT(*) AS row_count FROM demo_hr.countries) AS countries_count 45 | , (SELECT COUNT(*) AS row_count FROM demo_hr.regions) AS regions_count 46 | FROM dual 47 | ); 48 | ``` 49 |
50 | 51 | 52 | 53 | ### T004 - "Recent" Row Count 54 | Verify recent row counts. The example below verifies that table countries has had at least 5 rows updated in the past 10 days based on the date stamp in field date_last_updated. The inner query uses SYSDATE (oracle current date time stamp) and subtracts 10 days. The inner query counts how many rows in table countries have a date_last_updated with the past 10 days. The outer query can then confirm that at least 5 rows are recent for a Pass...if not then it returns a Fail. 55 | ```sql 56 | SELECT CASE WHEN row_count < 5 THEN 'FAIL' ELSE 'P' END AS status 57 | FROM ( 58 | SELECT COUNT(*) AS row_count 59 | FROM demo_hr.countries 60 | WHERE date_last_updated >= SYSDATE - 10 61 | ); 62 | ``` 63 | 64 | 65 | 66 | -------------------------------------------------------------------------------- /oracle/data_validation/rule_set_02__keys.md: -------------------------------------------------------------------------------- 1 | ### Data Validation Examples - Oracle 2 | [![License: CC0](https://img.shields.io/badge/License-CC0-red)](LICENSE "Creative Commons Zero License by DataResearchLabs (effectively = Public Domain") 3 | [![YouTube](https://img.shields.io/badge/YouTube-DataResearchLabs-brightgreen)](http://www.DataResearchLabs.com) 4 | #### [Return to Data Validation Home Page](https://github.com/DataResearchLabs/sql_scripts/blob/main/data_validation_scripts.md) 5 |
6 | 7 | # Rule Set #2 - Keys (Foreign & Unique) 8 | 9 | ## Table of Contents 10 | - T005 - Unique (Native) Key Has No Duplicates 11 | - T006 - Foreign Key Childern Have Orphans 12 | - T007 - Foreign Key Parent is Childless 13 |
14 | 15 | 16 | 17 | ### T005 - Unique Key Has No Duplicates 18 | Sure, good database design implies that unique keys be enforced by a constraint so that you do not need to test for it. However, there are times where a decision is made to **not** add a constraint to enforce the unique key (e.g.: table is replicated from a source having the constraint so skipped for performance). This does happen! At work just last week, two of my unique key regression tests started failing -- without these checks in place as a wide net, the downstream defects would been around a lot longer. 19 | In the example below, the inner query does a group by on the unique key fields, then using a HAVING clause filters down to those key-values with a count of more than 1 -- the dups. The outer query returns a fail if any rows come back with dups (match_count >= 2), or a pass if no dups found. 20 | ```sql 21 | SELECT CASE WHEN COUNT(*) > 0 THEN 'FAIL' ELSE 'P' END AS status 22 | FROM ( 23 | SELECT country_name -- UKey fields separated by comma 24 | , COUNT(*) AS match_count 25 | FROM demo_hr.countries -- UKey fields separated by comma 26 | GROUP BY country_name 27 | HAVING COUNT(*) > 1 28 | ); 29 | ``` 30 |
31 | 32 | 33 | 34 | ### T006 - Foreign Key Children Have Orphans 35 | Sure, as with T005 UKeys above, good database design implies that foreign keys be enforced by a constraint so that you do not need to test for it. However, there are times where for whatever reason the constraints do not exist. In those instances, you will want to periodically run a data validation test to ensure that this core assumption is not being violated (of course adding a foreign key constraint would be best, but if that is not an option then periodically check). 36 | In the example below, the inner query pulls from the child table countries as the anchor, then left joins out to the parent table regions on the key field region_id. If region_id does not exist in the parent table (p.region_id IS NULL), then the child region_id is an orphan. The outer query checks the count() of orphaned child rows: if it is >= 1 then the test fails, but if the count() = 0 then it passes. 37 | ```sql 38 | SELECT CASE WHEN COUNT(*) > 0 THEN 'FAIL' ELSE 'P' END AS status 39 | FROM ( 40 | SELECT DISTINCT c.region_id AS child_id, p.region_id AS parent_id 41 | FROM demo_hr.countries c 42 | LEFT JOIN demo_hr.regions p ON p.region_id = c.region_id 43 | WHERE p.region_id IS NULL 44 | ); 45 | ``` 46 |
47 | 48 | 49 | 50 | ### T007 - Foreign Key Parent Is Childless 51 | Im not sure this particular test is all the useful because often it is okay for the parent-side of a foreign key relationship to not have children. But, if for some reason you need to be sure there is data present on both sides (parent **and** child), then this test is for you. You will notice in the example below that the query is very similar to T006 above, but the parent and child tables have switched positions in the FROM and LEFT JOIN lines. This is because we want to first pull all parent rows, then left join to find missing (FKey field IS NULL) child rows. 52 | ```sql 53 | SELECT CASE WHEN COUNT(*) > 0 THEN 'FAIL' ELSE 'P' END AS status 54 | FROM ( 55 | SELECT DISTINCT c.country_id AS child_id, p.country_id AS parent_id 56 | FROM demo_hr.countries p 57 | LEFT JOIN demo_hr.locations c ON p.country_id = c.country_id 58 | WHERE c.country_id IS NULL 59 | ); 60 | ``` 61 | 62 | -------------------------------------------------------------------------------- /oracle/data_validation/rule_set_03__heuristic_thresholds.md: -------------------------------------------------------------------------------- 1 | ### Data Validation Examples - Oracle 2 | [![License: CC0](https://img.shields.io/badge/License-CC0-red)](LICENSE "Creative Commons Zero License by DataResearchLabs (effectively = Public Domain") 3 | [![YouTube](https://img.shields.io/badge/YouTube-DataResearchLabs-brightgreen)](http://www.DataResearchLabs.com) 4 | #### [Return to Data Validation Home Page](https://github.com/DataResearchLabs/sql_scripts/blob/main/data_validation_scripts.md) 5 |
6 | 7 | # Rule Set #3 - Heuristic Thresholds (Column Null Rates & Value Rates) 8 | 9 | ## Table of Contents 10 | - T008 - Null Rate Thresholds 11 | - T009 - Value Frequency Thresholds 12 |
13 | 14 | 15 | Note: Although the terms "rate" and "frequency" are mostly synonomous, nulls are usually referred to as "rates" and values as "frequencies". Don't know why, but following the convention I've heard over and over. 16 |
17 | 18 | 19 | 20 | ### T008 - Null Rate Thresholds 21 | There is a lot going on in this "Null Rate Threshold" data validation query below. The basic goal is to validate that a given table's columns do not have too many, or too few NULLs. 22 |
23 | In the example below...
24 | ...we are checking columns department_name, manager_id, and url in the table departments. We want to ensure that column department_name has no NULLs, and that column manager_id is NULL less than 65% of the time and column url is NULL less than 80% of the time. 25 | 26 | This check is like the proverbial miner's canary in that it is a trip wire triggered when something goes awry in a data feed. I've used this test scenario to great effect when coupled with a create-date or last-updated-date to monitor the past week's data loads for any unexpected upticks in null rates. 27 | 28 | There is a downside to this test scenario too however; and that is when it fires false alarms and you find yourself tinkering with the thresholds values (0.0000, 0.65000, and 0.80000 cutoffs below), raising and lowering them over and over. If this happens, chances are test fails are not actionable nor important and you should not waste your time applying this test scenario to that given table and field. Be careful to only pick fields that truly matter. 29 | 30 | Below, there is an upper CTE (common table expression) named "dtls" at the WITH clause, and a lower wrapper that applies the business logic (if any null rate rejections were found, fail the case). Inside the dtls CTE, there is an inner query at the bottom (at the FROM clause) doing a single table scan to calculate a null rate per column by counting nulls in each column and dividing by the total table row count. The SELECT CASE logic at the top applies the business logic; comparing the actual calcuated null rates (nr_dept_nm, nr_mgr_id, and nr_url) against the expected threshold rates (hard-coded as 0.0000, 0.6500, and 0.8000). The returned value is a rejection code (REJ-01, REJ-02, etc.) clearly indicating which field failed the null rate check, what the actual null rate was, and what the expected null rate threshold to exceed was. If no rejections are triggered, then status returns a "P" for pass. 31 |
32 | 33 | ```sql 34 | WITH dtls AS ( 35 | SELECT CASE WHEN nr_dept_nm > 0.0000 THEN 'REJ-01: Null rate too high at department_name. Exp=0.0000 / Act=' || CAST(nr_dept_nm AS VARCHAR2(8)) 36 | WHEN nr_mgr_id > 0.6500 THEN 'REJ-02: Null rate too high at manager_id. Exp<=0.6500 / Act=' || CAST(nr_mgr_id AS VARCHAR2(8)) 37 | WHEN nr_url > 0.8000 THEN 'REJ-03: Null rate too high at url. Exp<=0.8000 / Act=' || CAST(nr_url AS VARCHAR2(8)) 38 | ELSE 'P' 39 | END AS status 40 | FROM ( 41 | SELECT CAST(SUM(CASE WHEN department_name IS NULL THEN 1 ELSE 0 END) AS FLOAT(126)) / CAST(COUNT(*) AS FLOAT(126)) AS nr_dept_nm 42 | , CAST(SUM(CASE WHEN manager_id IS NULL THEN 1 ELSE 0 END) AS FLOAT(126)) / CAST(COUNT(*) AS FLOAT(126)) AS nr_mgr_id 43 | , CAST(SUM(CASE WHEN url IS NULL THEN 1 ELSE 0 END) AS FLOAT(126)) / CAST(COUNT(*) AS FLOAT(126)) AS nr_url 44 | FROM demo_hr.departments 45 | ) 46 | ) 47 | 48 | SELECT CASE WHEN COUNT(*) > 0 THEN 'FAIL' ELSE 'P' END AS status 49 | FROM dtls 50 | WHERE status <> 'P'; 51 | ``` 52 |
53 | 54 | 55 | 56 | ### T009 - Value Frequency Thresholds 57 | "Value Frequency Threshold" tests are fairly similar to null rates above (T008). The difference is that we are checking the frequency (or rate) at which a column's values occur. 58 |
59 | In the example below...
60 | ...we are checking the frequencies with which the values 1, 2, 3, and 4 occur in field region_id of table countries. There is an upper CTE (common table expression) named "dtls" at the WITH clause, and a lower wrapper that applies the business logic (if any value frequency rejections were found, fail the case). Inside the dtls CTE, there is an inner query at the bottom (at the FROM clause) doing a single table scan to calculate a frequencies for each value in the GROUP BY for the column. It the GROUP BY value count (field "freq") is divided by the total table row count (field "den") to calculate field "freq_rt". The SELECT CASE logic at the top applies the business logic; comparing the actual value frequencies (freq_rt when region_id = 1, or =2, etc.) against the expected threshold frequencies (hard-coded as 0.28 to 0.36, 016 to 0.24 and so on). The returned value is a rejection code (REJ-01, REJ-02, etc.) clearly indicating which field failed the value ferquency check, what the actual value frequency was, and what the expected value frequency threshold ranges were. If no rejections are triggered, then status returns a "P" for pass. 61 |
62 | 63 | ```sql 64 | WITH dtls AS ( 65 | SELECT CASE WHEN region_id = 1 AND freq_rt NOT BETWEEN 0.28 AND 0.36 THEN 'REJ-01: Frequency occurrence of region_id=1 is outside threshold|exp=0.28 thru 0.36|act=' || CAST(freq_rt AS VARCHAR2(8)) 66 | WHEN region_id = 2 AND freq_rt NOT BETWEEN 0.16 AND 0.24 THEN 'REJ-02: Frequency occurrence of region_id=2 is outside threshold|exp=0.16 thru 0.24|act=' || CAST(freq_rt AS VARCHAR2(8)) 67 | WHEN region_id = 3 AND freq_rt NOT BETWEEN 0.20 AND 0.28 THEN 'REJ-03: Frequency occurrence of region_id=3 is outside threshold|exp=0.20 thru 0.28|act=' || CAST(freq_rt AS VARCHAR2(8)) 68 | WHEN region_id = 4 AND freq_rt NOT BETWEEN 0.20 AND 0.28 THEN 'REJ-04: Frequency occurrence of region_id=4 is outside threshold|exp=0.20 thru 0.28|act=' || CAST(freq_rt AS VARCHAR2(8)) 69 | ELSE 'P' 70 | END AS status 71 | FROM ( 72 | SELECT region_id, CAST(freq AS FLOAT(126)) / CAST(den AS FLOAT(126)) AS freq_rt 73 | FROM ( 74 | SELECT region_id, COUNT(*) AS freq 75 | , (SELECT COUNT(*) FROM demo_hr.countries) AS den 76 | FROM demo_hr.countries 77 | GROUP BY region_id 78 | ) 79 | ) 80 | ) 81 | 82 | SELECT CASE WHEN COUNT(*) > 0 THEN 'FAIL' ELSE 'P' END AS status 83 | FROM dtls 84 | WHERE status <> 'P'; 85 | ``` 86 | -------------------------------------------------------------------------------- /oracle/data_validation/rule_set_04__numeric_values.md: -------------------------------------------------------------------------------- 1 | ### Data Validation Examples - Oracle 2 | [![License: CC0](https://img.shields.io/badge/License-CC0-red)](LICENSE "Creative Commons Zero License by DataResearchLabs (effectively = Public Domain") 3 | [![YouTube](https://img.shields.io/badge/YouTube-DataResearchLabs-brightgreen)](http://www.DataResearchLabs.com) 4 | #### [Return to Data Validation Home Page](https://github.com/DataResearchLabs/sql_scripts/blob/main/data_validation_scripts.md) 5 |
6 | 7 | # Rule Set #4 - Numeric Values 8 | 9 | ## Table of Contents 10 | - T010 - Not Null 11 | - T011 - Not Negative 12 | - T012 - Numeric Range 13 | - T013 - In Value List 14 | - T014 - Not In Value List 15 | - T015 - Multi Field Compare 16 |
17 | 18 | 19 | 20 | ### T010 - Not Null 21 | Verify numeric field is not null. For example, to verify that table countries has no NULLs in field region_id: 22 | ```sql 23 | SELECT CASE WHEN COUNT(*) > 0 THEN 'FAIL' ELSE 'P' END AS status 24 | FROM demo_hr.countries 25 | WHERE region_id IS NULL; 26 | ``` 27 |
28 | 29 | 30 | 31 | ### T011 - Not Negative 32 | Verify numeric field is not negative. For example, to verify that table countries has no field region_id negative values: 33 | ```sql 34 | SELECT CASE WHEN COUNT(*) > 0 THEN 'FAIL' ELSE 'P' END AS status 35 | FROM demo_hr.countries 36 | WHERE region_id < 0; 37 | ``` 38 |
39 | 40 | 41 | 42 | ### T012 - Numeric Range 43 | Verify numeric field value is within a range. In the example below, we verify that field employee_id is between 100 and 999 in table employees. Note that you can run the inner query yourself to return the actual rejection code (is too low or too high) along with the actual value and the expected value...all nicely packaged for easy troubleshooting. 44 | ```sql 45 | SELECT employee_id 46 | , CASE WHEN employee_id < 100 THEN 'REJ-01: Verify employee_id > 99|exp>99|act=' || CAST(employee_id AS VARCHAR2(10)) 47 | WHEN employee_id > 999 THEN 'REJ-02: Verify employee_id < 1000|exp<1000|act=' || CAST(employee_id AS VARCHAR2(10)) 48 | ELSE 'P' 49 | END AS status 50 | FROM demo_hr.employees; 51 | ``` 52 |
53 | 54 | 55 | 56 | ### T013 - In Value List 57 | Verify numeric field is **in** the list of values. For example, to verify that table countries field region_id is always values 1, 2, 3, or 4 we use the IN() clause as shown below: 58 | ```sql 59 | SELECT region_id 60 | , CASE WHEN region_id NOT IN(1,2,3,4) THEN 'FAIL' ELSE 'P' END AS status 61 | FROM demo_hr.countries; 62 | ``` 63 |
64 | 65 | 66 | 67 | ### T014 - Not In Value List 68 | Verify numeric field is **not** in the list of values. For example, to verify that table countries field region_id is never in values 97, 98, or 99 we use the NOT IN() clauses as shown below: 69 | ```sql 70 | SELECT region_id 71 | , CASE WHEN region_id IN(97,98,99) THEN 'FAIL' ELSE 'P' END AS status 72 | FROM demo_hr.countries; 73 | ``` 74 |
75 | 76 | 77 | 78 | ### T015 - Multi Field Compare 79 | Verify numeric field values in relation to one another. For example, to verify that salary times commission_pct is always less than $10,000 in table employees: 80 | ```sql 81 | SELECT salary, commission_pct 82 | , CASE WHEN salary * commission_pct > 10000 THEN 'FAIL' ELSE 'P' END AS status 83 | FROM demo_hr.employees; 84 | ``` 85 |
86 | -------------------------------------------------------------------------------- /oracle/data_validation/rule_set_05__date_values.md: -------------------------------------------------------------------------------- 1 | ### Data Validation Examples - Oracle 2 | [![License: CC0](https://img.shields.io/badge/License-CC0-red)](LICENSE "Creative Commons Zero License by DataResearchLabs (effectively = Public Domain") 3 | [![YouTube](https://img.shields.io/badge/YouTube-DataResearchLabs-brightgreen)](http://www.DataResearchLabs.com) 4 | #### [Return to Data Validation Home Page](https://github.com/DataResearchLabs/sql_scripts/blob/main/data_validation_scripts.md) 5 |
6 | 7 | # Rule Set #5 - Date Values 8 | 9 | ## Table of Contents 10 | - T016 - Not Null 11 | - T017 - Date Range 12 | - T018 - No Time Part 13 | - T019 - Has Time Part 14 | - T020 - Multi Field Compare 15 | - Bonus Tip - Joining Tables with 2 Pairs of Start-End Date Overlaps 16 |
17 | 18 | 19 | 20 | ### T016 - Not Null 21 | Verify date field is not null. For example, to verify that table countries has no NULLs in field date_last_updated: 22 | ```sql 23 | SELECT date_last_updated 24 | , CASE WHEN date_last_updated IS NULL THEN 'FAIL' ELSE 'P' END AS status 25 | FROM demo_hr.countries; 26 | ``` 27 |
28 | 29 | 30 | 31 | ### T017 - Date Range 32 | Verify date field is within specified range. For example, you can run the sql below to verify that table countries field date_last_updated is between 1/1/2021 and today. Note the use of SYSDATE to represent today's date dynamically in Oracle. Notice the inner query uses a CASE...WHEN...ELSE structure to identify two rejections codes: (1) date is too high, and (2) date is too low. Expected and actual values are displayed in the output if you run the inner query only. The outer query is a wrapper to determine whether the test passed or failed. 33 | ```sql 34 | SELECT date_last_updated 35 | , CASE WHEN date_last_updated > SYSDATE THEN 'REJ-01: Field date_last_updated cannot be in the future|exp<=' || CAST(SYSDATE AS VARCHAR2(20)) || '|act=' || CAST(date_last_updated AS VARCHAR2(20)) 36 | WHEN date_last_updated < TO_DATE('01/01/2021', 'mm/dd/yyyy') THEN 'REJ-02: Field date_last_updated cannot be too old|exp>=1/1/2021|act=' || CAST(date_last_updated AS VARCHAR2(20)) 37 | ELSE 'P' 38 | END AS status 39 | FROM demo_hr.countries; 40 | ``` 41 |
42 | 43 | 44 | 45 | ### T018 - No Time Part 46 | Verify date field is a date only, no time part present. For example, to verify that table employees has no time part in field hire_date (time part must be "12:00:00"): 47 | ```sql 48 | SELECT hire_date 49 | , CASE WHEN TO_CHAR(hire_date, 'hh:mi:ss') <> '12:00:00' THEN 'FAIL' ELSE 'P' END AS status 50 | FROM demo_hr.employees; 51 | ``` 52 |
53 | 54 | 55 | 56 | ### T019 - Has Time Part 57 | Verify date field is a date **and** time. For example, to verify that table employees has a time part in field hire_date (time part cannot be "12:00:00"): 58 | ```sql 59 | SELECT start_tm 60 | , CASE WHEN TO_CHAR(start_tm, 'hh:mi:ss') = '12:00:00' THEN 'FAIL' ELSE 'P' END AS status 61 | FROM demo_hr.test_case_results; 62 | ``` 63 |
64 | 65 | 66 | 67 | ### T020 - Multi Field Compare 68 | Verify multiple date fields relative to each other. For example, to verify that field start_date must be < field end_date in table job_history (thus if start_date is >= end_date the test case fails): 69 | ```sql 70 | SELECT start_date, end_date 71 | , CASE WHEN start_date >= end_date THEN 'FAIL' ELSE 'P' END AS status 72 | FROM demo_hr.job_history; 73 | ``` 74 |
75 | 76 | 77 | 78 | ### Bonus Tip - Joining Tables with 2 Pairs of Start-End Date Overlaps 79 | So this is a fantastic tip I learned from a co-worker in healthcare back in 2011 (was it Jennifer C.? or Matt G.? or Jonathon P.? I can't remember). 80 | 81 | 82 | The **problem** is that you are trying to join two tables with logic where the table1.start_dt/end_dt's overlap with the table2.start_dt/end_dt. 83 | 84 | 85 | The **solution** is to **join on `table1.start_dt <= table2.end_dt AND table1.end_dt >= table2.start_dt`**. 86 | 87 | Here is why: 88 | 89 | ``` 90 | Scenario #1 = "Discard - No Overlap" Table #1 date range is completely before Table #2's (Dt#1.End is NOT > Dt#2.Start) 91 | Table#1: |Start ----------- End| 92 | Table#2: |Start ----------- End| 93 | 94 | 95 | Scenario #2 = "Include - Table #1 End = Table#2 Start" Table #1 date range ends exactly where Table #2's starts (Tbl#1.End = Tbl#2.Start AND Tbl#1.Start < Tbl#2.End) 96 | Table#1: |Start ----------- End| 97 | Table#2: |Start ----------- End| 98 | 99 | 100 | Scenario #3 = "Include - Table #1 Ends Midway Table #2 Span" Table #1 date range nicely overlaps Table #2's (Tbl#1.End > Tbl#2.Start AND Tbl#1.Start < Tbl#2.End) 101 | Table#1: |Start ----------- End| 102 | Table#2: |Start ----------- End| 103 | 104 | 105 | Scenario #4 = "Include - Table #1 Starts Midway Table #2 Span" Table #1 date range nicely overlaps Table #2's (Tbl#1.End > Tbl#2.Start AND Tbl#1.Start < Tbl#2.End) 106 | Table#1: |Start ----------- End| 107 | Table#2: |Start ----------- End| 108 | 109 | 110 | Scenario #5 = "Include - Table #1 Starts at Table #2 End" Table #1 date range start exactly at Table #2's End (Tbl#1.End > Tbl#2.St AND Tbl#1.St = Tbl#2.End) 111 | Table#1: |Start ----------- End| 112 | Table#2: |Start ----------- End| 113 | 114 | 115 | Scenario #6 = "Discard - No Overlap" Date #1 is entirely after Date #2 (Tbl#1.Start is > Tbl#2.End) 116 | Table#1: |Start ----------- End| 117 | Table#2: |Start ----------- End| 118 | 119 | ``` 120 | 121 | 122 |
123 | -------------------------------------------------------------------------------- /oracle/data_validation/rule_set_09__defect_regression.md: -------------------------------------------------------------------------------- 1 | ### Data Validation Examples - Oracle 2 | [![License: CC0](https://img.shields.io/badge/License-CC0-red)](LICENSE "Creative Commons Zero License by DataResearchLabs (effectively = Public Domain") 3 | [![YouTube](https://img.shields.io/badge/YouTube-DataResearchLabs-brightgreen)](http://www.DataResearchLabs.com) 4 | #### [Return to Data Validation Home Page](https://github.com/DataResearchLabs/sql_scripts/blob/main/data_validation_scripts.md) 5 |
6 | 7 | # Rule Set #9 - Defect Regression 8 | 9 | ## Summary 10 | Where possible, it is a good idea to cover known defects with a test case. This enables automated regression testing for free. Yuo can simply run this script to quickly re-check whether a defect has been re-introduced. 11 | 12 | There are a lot of caveats that will block you from covering every defect: 13 | * If anything but the simplest of setup and tear down data is required 14 | * If processes or jobs or applications or ETL must be triggered 15 | * If there are interactions with other databases or files not on the current server 16 | 17 | That said, you can still frequently setup static trip wires to monitor the data for recurrence of old bugs. 18 | 19 | Some real-world examples from my past include: 20 | * A defect where email addresses stored in the database had TAB characters. I setup a simple CHR(9) check that ran daily and cross referenced the defect number in the SQL return results. 21 | * A defect where customer specific HTML snippets for email reports had bad tags manually entered by company analysts working with the client. Obviously fixing the app to avoid those data entries would be best, but that was a different department that was going to get to it later. So, I setup a daily alert to check for those HTML tags and some other similar ones that might be accidentally introduced. 22 | * A defect where internal data entry folks on rare occasions had to key in customer names. They were supposed to be all upper case for other reasons in the system, but the analysts sometimes entered lower case. Developers were not going to get around to fixing it for a couple of months (other priorities, this was a very low frequency issue). So I setup a daily check to trip an alarm and notify via email the appropriate person to correct their data. 23 | * Basically, you can take any of the test case examples in Rule Sets #1 thu #8 and apply them as regression tests for defects when appropriate. 24 | -------------------------------------------------------------------------------- /oracle/data_validation/sql_scripts/dv_advanced_01_setup.sql: -------------------------------------------------------------------------------- 1 | 2 | -- =============================================================================================== 3 | -- Filename: dv_advanced_01_setup.sql 4 | -- Description: Data Validation Scripts - Setup Two "Temp" Tables 5 | -- Platform: Oracle 6 | -- Author: DataResearchLabs 7 | -- GitHub: https://github.com/DataResearchLabs/sql_scripts 8 | -- YouTube Tutorials: https://www.youtube.com/channel/UCQciXv3xaBykeUFc04GxSXA 9 | -------------------------------------------------------------------------------------------------- 10 | -- This SQL script is run one time prior to the ongoing regular running of the 11 | -- "dv_advanced_02_script.sql". 12 | -- Run this SQL *one time* to add the "TEST_CASE_RESULTS" and "TEST_CASE_CONFIG" tables to the 13 | -- database in which you are running tests. These tables contain the test results and any 14 | -- configuration values your test cases might need. 15 | -- Unfortunately Oracle differs from all the other platforms in that prior to version 19, there 16 | -- were no private/local/on-the-fly temporary tables that can just be easily spun up within a 17 | -- script and then discarded. Thus, we will just add two permanent tables one time and grant 18 | -- the user running the script read and write permissions to that table. 19 | -- =============================================================================================== 20 | 21 | 22 | 23 | -- 1: Drop Tables if Exist 24 | -------------------------- 25 | /* -- Highlight and manually run the line below **if** it already exists. 26 | DROP TABLE demo_hr.test_case_results; 27 | DROP TABLE demo_hr.test_case_config; 28 | */ 29 | 30 | 31 | -- 2: Add Table Test_Case_Results 32 | --------------------------------- 33 | CREATE TABLE demo_hr.test_case_results ( /* <<<<<<<<<<<<<<<<< Change Schema and Table name here */ 34 | tst_id VARCHAR2(5) 35 | , tst_descr VARCHAR2(255) 36 | , START_TM TIMESTAMP DEFAULT SYSTIMESTAMP 37 | , exec_tm VARCHAR2(15) 38 | , status VARCHAR2(5) 39 | , rej_dtls VARCHAR2(1024) 40 | , lookup_sql VARCHAR(1024) 41 | ); 42 | COMMIT; 43 | 44 | 45 | 46 | -- 3. Add Table Test_Case_Config 47 | CREATE TABLE demo_hr.test_case_config ( /* <<<<<<<<<<<<<<<<< change schema and table name here */ 48 | prop_nm VARCHAR2(99) 49 | , prop_val VARCHAR2(255) 50 | ); 51 | COMMIT; 52 | 53 | 54 | 55 | -------------------------------------------------------------------------------- /oracle/schemadiff/simple_schema_dump.sql: -------------------------------------------------------------------------------- 1 | ------------------------------------------------------------------------------------ 2 | -- Simple Schema Dump: 3 | -- This SQL script will dump table, column, key, and index design related metadata 4 | -- so that you can copy-paste or export to a text file. 5 | -- Even better, you can make other snapshots over time (same database schema earlier 6 | -- points in time), OR in different environments (DEV, PPMO, STAGE, PROD). Then, 7 | -- using your favorite free Text File Diff Tool (DiffMerge, ExamDiff, etc.) you 8 | -- can compare snapshots to quick isolate and identify what changed over time 9 | -- or is different between environments. 10 | ------------------------------------------------------------------------------------ 11 | -- Platform: Oracle 11g or later 12 | -- Author: DataResearchLabs 13 | -- GitHub: https://github.com/DataResearchLabs/sql_scripts 14 | -- YouTube Tutorials: https://www.youtube.com/channel/UCQciXv3xaBykeUFc04GxSXA 15 | ---------------------------------------------------------------------------------- 16 | 17 | -- All variables are consolidated here in the first CTE (Common Table Expression) 18 | -- Each given row is a variable, with the value you change preceding the "AS" command 19 | WITH vars 20 | AS ( 21 | SELECT 'HR' AS v_SchemaName -- <<<<<<<<<<<< CHANGE THIS VALUE to Schema you want to dump 22 | FROM dual 23 | ) 24 | 25 | 26 | , baseTbl 27 | AS ( 28 | SELECT owner, table_type, table_name /*, comments */ 29 | FROM SYS.ALL_TAB_COMMENTS 30 | WHERE table_name NOT LIKE 'BIN%' -- Leave this as is to ignore the Oracle10g and forard Recycle Bin tables 31 | AND owner = (SELECT v_SchemaName FROM vars) 32 | ) 33 | 34 | , metaForTbl 35 | AS ( 36 | SELECT t.owner AS SchemaName 37 | , t.table_name AS TableName 38 | , '(' || CASE WHEN t.table_type = 'TABLE' THEN 'Table' 39 | WHEN t.table_type = 'VIEW' THEN 'View' 40 | ELSE 'UK' 41 | END || ')' AS ObjectType 42 | , t.table_name AS ObjectName 43 | , '(Exists)' AS PropertyName 44 | , ' ' AS PropertyValue 45 | FROM baseTbl t 46 | ) 47 | 48 | , metaForCol_dataType 49 | AS ( 50 | SELECT tut.owner AS SchemaName, tut.table_name AS TableName, 'Column' AS ObjectType, tut.column_name AS ObjectName 51 | , '2' AS PropertyName 52 | , COALESCE(tut.data_type, 'UNKNOWN') || '(' 53 | || CASE WHEN tut.data_length IS NOT NULL THEN CAST(tut.data_length AS VARCHAR2(10)) ELSE '' END 54 | || CASE WHEN tut.data_precision IS NOT NULL THEN ',' || CAST(tut.data_precision AS VARCHAR2(10)) ELSE '' END 55 | || CASE WHEN tut.data_scale IS NOT NULL THEN ',' || CAST(tut.data_scale AS VARCHAR2(10)) ELSE '' END 56 | || ')' AS PropertyValue 57 | FROM SYS.ALL_TAB_COLUMNS tut 58 | INNER JOIN baseTbl ft ON ft.owner = tut.owner AND ft.table_name = tut.table_name 59 | ) 60 | 61 | , metaForCol_nullable 62 | AS ( 63 | SELECT tut.owner AS SchemaName, tut.table_name AS TableName, 'Column' AS ObjectType, tut.column_name AS ObjectName 64 | , '3' AS PropertyName, CASE WHEN tut.nullable = 'Y' THEN 'NULL' ELSE 'NOT NULL' END AS PropertyValue 65 | FROM SYS.ALL_TAB_COLUMNS tut 66 | INNER JOIN baseTbl ft ON ft.owner = tut.owner AND ft.table_name = tut.table_name 67 | ) 68 | 69 | , metaForCol_ordpos 70 | AS ( 71 | SELECT tut.owner AS SchemaName, tut.table_name AS TableName, 'Column' AS ObjectType, tut.column_name AS ObjectName 72 | , '1' AS PropertyName, LPAD(CAST(tut.column_id AS VARCHAR2(3)), 3, '0') AS PropertyValue 73 | FROM SYS.ALL_TAB_COLUMNS tut 74 | INNER JOIN baseTbl ft ON ft.owner = tut.owner AND ft.table_name = tut.table_name 75 | ) 76 | 77 | , metaAllCols 78 | AS ( 79 | SELECT schemaname, tablename, objecttype, objectname, 'Properties' AS propertyname 80 | , LISTAGG(propertyvalue, ' | ') 81 | WITHIN GROUP (ORDER BY propertyname, propertyvalue) AS propertyvalue 82 | FROM ( 83 | SELECT * FROM metaForCol_dataType 84 | UNION SELECT * FROM metaForCol_nullable 85 | UNION SELECT * FROM metaForCol_ordpos 86 | ) t 87 | GROUP BY schemaname, tablename, objecttype, objectname 88 | ) 89 | 90 | , metaForKeys 91 | AS ( 92 | SELECT cons.owner AS SchemaName, cols.table_name AS TableName 93 | , CASE WHEN cons.constraint_type = 'P' THEN 'PKey' 94 | WHEN cons.constraint_type = 'U' THEN 'UKey' 95 | WHEN cons.constraint_type = 'R' THEN 'FKey' 96 | END AS ObjectType 97 | , cons.constraint_name AS ObjectName 98 | , 'FieldList' AS PropertyName 99 | , LISTAGG(cols.column_name, ',') WITHIN GROUP (ORDER BY cols.position) AS PropertyValue 100 | FROM all_constraints cons 101 | INNER JOIN all_cons_columns cols ON cons.constraint_name = cols.constraint_name AND cons.owner = cols.owner 102 | WHERE cons.owner = (SELECT v_SchemaName FROM vars) 103 | AND cons.table_name IN(SELECT DISTINCT table_name FROM baseTbl) 104 | AND cons.constraint_type IN('P','R','U') 105 | GROUP BY cons.owner, cols.table_name, cons.constraint_type, cons.constraint_name 106 | ) 107 | 108 | , metaForIdxs 109 | AS ( 110 | SELECT tut.table_owner AS SchemaName, tut.table_name AS TableName 111 | , 'Index' AS ObjectType, tut.index_name AS ObjectName 112 | , 'FieldList' AS PropertyName 113 | , LISTAGG(tut.column_name, ',') WITHIN GROUP (ORDER BY tut.column_position) AS PropertyValue 114 | FROM ALL_IND_COLUMNS tut 115 | INNER JOIN baseTbl ft ON ft.owner = tut.index_owner AND ft.table_name = tut.table_name 116 | GROUP BY tut.table_owner, tut.table_name, tut.index_name 117 | ) 118 | 119 | , allMetadata 120 | AS ( 121 | SELECT * FROM metaForTbl 122 | UNION SELECT * FROM metaAllCols 123 | UNION SELECT * FROM metaForKeys 124 | UNION SELECT * FROM metaForIdxs 125 | ) 126 | 127 | SELECT CASE WHEN objecttype IN('(Table)','(View)') THEN schemaname ELSE ' ' END AS schema_nm 128 | , CASE WHEN objecttype IN('(Table)','(View)') THEN tablename ELSE ' ' END AS tbl_nm 129 | , objecttype AS obj_typ, objectname AS obj_nm, /*propertyname,*/ propertyvalue AS properties 130 | FROM allMetadata 131 | ORDER BY schemaname, tablename, objecttype 132 | , CASE WHEN objecttype='Column' THEN propertyvalue ELSE ' ' END 133 | , objectname, propertyname 134 | 135 | 136 | -------------------------------------------------------------------------------- /postgresql/data_dictionary/data_dict_dump.sql: -------------------------------------------------------------------------------- 1 | ------------------------------------------------------------------------------------ 2 | -- Data Dictionary Dump: 3 | -- This SQL script will dump table, column, key, and description design related 4 | -- metadata so that you can copy-paste or export to Excel as a Data Dictionary. 5 | ------------------------------------------------------------------------------------ 6 | -- Platform: PostgreSQL 7 | -- Author: DataResearchLabs 8 | -- GitHub: https://github.com/DataResearchLabs/sql_scripts 9 | -- YouTube Tutorials: https://www.youtube.com/channel/UCQciXv3xaBykeUFc04GxSXA 10 | ---------------------------------------------------------------------------------- 11 | WITH vars 12 | AS ( 13 | SELECT 14 | 'public' AS v_SchemaName -- Set to the schema whose tables you want in the Data Dictionary 15 | , 'NO' AS v_TablesOnly -- YES=Limit To Tables only; NO=Include views too 16 | ) 17 | 18 | , baseTbl 19 | AS ( 20 | SELECT table_schema AS SchemaName 21 | , table_catalog 22 | , table_type, table_name, table_schema 23 | FROM INFORMATION_SCHEMA.TABLES 24 | WHERE TABLE_SCHEMA = (SELECT v_SchemaName FROM vars) 25 | AND ( (TABLE_TYPE = 'BASE TABLE') 26 | OR ((SELECT v_TablesOnly FROM vars) = 'NO') 27 | ) 28 | ) 29 | 30 | , metadata 31 | AS ( 32 | SELECT 33 | bt.SchemaName AS schema_nm 34 | , bt.table_name AS table_nm 35 | , CASE WHEN bt.TABLE_TYPE = 'BASE TABLE' THEN 'TBL' 36 | WHEN bt.TABLE_TYPE = 'VIEW' THEN 'VW' 37 | ELSE 'UK' 38 | END AS obj_typ 39 | , tut.ordinal_position AS ord_pos 40 | , tut.column_name AS column_nm 41 | , CONCAT(COALESCE(tut.data_type, 'unknown'), 42 | CASE WHEN tut.data_type IN('varchar','char') THEN CONCAT('(', CAST(tut.CHARACTER_MAXIMUM_LENGTH AS varchar(10)), ')') 43 | WHEN tut.data_type IN('date','time') THEN CONCAT('(3)') 44 | WHEN tut.data_type = 'datetime' THEN CONCAT('(8)') 45 | WHEN tut.data_type = 'timestamp' THEN CONCAT('(4)') 46 | WHEN tut.data_type in('bigint','integer','smallint') THEN CONCAT('(', CAST(tut.NUMERIC_PRECISION AS varchar(10)), ')') 47 | WHEN tut.data_type = 'decimal' THEN CONCAT('(', CAST(tut.NUMERIC_PRECISION AS varchar(10)), ',', CAST(tut.NUMERIC_SCALE AS varchar(10)), ')') 48 | WHEN tut.CHARACTER_MAXIMUM_LENGTH IS NOT NULL THEN CONCAT('(', CAST(tut.CHARACTER_MAXIMUM_LENGTH AS varchar(10)), ')') 49 | WHEN tut.DATETIME_PRECISION IS NOT NULL THEN CONCAT('(', CAST(tut.DATETIME_PRECISION AS varchar(10)), ')') 50 | WHEN tut.NUMERIC_PRECISION IS NOT NULL 51 | AND tut.NUMERIC_SCALE IS NULL THEN CONCAT('(', CAST(tut.NUMERIC_PRECISION AS varchar(10)), ')') 52 | WHEN tut.NUMERIC_PRECISION IS NOT NULL 53 | AND tut.NUMERIC_SCALE IS NOT NULL THEN CONCAT('(', CAST(tut.NUMERIC_PRECISION AS varchar(10)), ',', CAST(tut.NUMERIC_SCALE AS varchar(10)), ')') 54 | ELSE '' 55 | END ) AS data_typ 56 | , CASE WHEN tut.IS_NULLABLE = 'YES' THEN 'NULL' ELSE 'NOT NULL' END AS nullable 57 | FROM INFORMATION_SCHEMA.COLUMNS tut 58 | INNER JOIN baseTbl bt ON bt.table_catalog = tut.TABLE_CATALOG AND bt.table_name = tut.table_name 59 | ) 60 | 61 | , meta_for_keys 62 | AS ( 63 | SELECT schema_nm, table_nm, column_nm 64 | , STRING_AGG(is_key, ',' ORDER BY is_key) AS is_key 65 | FROM ( 66 | SELECT cons.TABLE_SCHEMA AS schema_nm 67 | , cons.TABLE_NAME AS table_nm 68 | , kcu.COLUMN_NAME AS column_nm 69 | , CASE WHEN cons.constraint_type = 'PRIMARY KEY' THEN 'PK' 70 | WHEN cons.constraint_type = 'UNIQUE' THEN 'UK' 71 | WHEN cons.constraint_type = 'FOREIGN KEY' THEN 'FK' 72 | ELSE 'X' 73 | END AS is_key 74 | FROM INFORMATION_SCHEMA.TABLE_CONSTRAINTS cons 75 | INNER JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE kcu 76 | ON cons.TABLE_SCHEMA = kcu.TABLE_SCHEMA 77 | AND cons.TABLE_NAME = kcu.TABLE_NAME 78 | AND cons.CONSTRAINT_NAME = kcu.CONSTRAINT_NAME 79 | WHERE cons.table_schema = (SELECT v_SchemaName FROM vars) 80 | AND cons.table_name IN(SELECT DISTINCT table_name FROM baseTbl) 81 | AND cons.constraint_type IN('PRIMARY KEY','FOREIGN KEY','UNIQUE') 82 | GROUP BY cons.TABLE_SCHEMA, cons.TABLE_NAME, kcu.COLUMN_NAME, cons.constraint_type 83 | ) t 84 | GROUP BY schema_nm, table_nm, column_nm 85 | ) 86 | 87 | , col_comm 88 | AS ( 89 | SELECT c.TABLE_SCHEMA AS schema_nm 90 | , c.TABLE_NAME AS table_nm 91 | , c.COLUMN_NAME AS column_nm 92 | , pgd.DESCRIPTION AS column_descr 93 | FROM pg_catalog.pg_statio_all_tables AS st 94 | INNER JOIN pg_catalog.PG_DESCRIPTION AS pgd ON pgd.objoid = st.relid 95 | INNER JOIN INFORMATION_SCHEMA.COLUMNS AS c ON pgd.objsubid = c.ordinal_position 96 | AND c.table_schema = st.schemaname 97 | AND c.table_name = st.relname 98 | WHERE c.table_schema = (SELECT v_SchemaName FROM vars) 99 | AND c.table_name IN(SELECT DISTINCT table_name FROM baseTbl) 100 | ) 101 | 102 | SELECT md.SCHEMA_NM, md.TABLE_NM, md.OBJ_TYP 103 | , md.ORD_POS AS ord 104 | , COALESCE(pk.is_key, ' ') AS is_key 105 | , md.COLUMN_NM, md.DATA_TYP, md.NULLABLE, c.column_descr 106 | FROM metadata md 107 | LEFT JOIN meta_for_keys pk ON pk.SCHEMA_NM = md.SCHEMA_NM AND pk.TABLE_NM = md.TABLE_NM AND pk.COLUMN_NM = md.COLUMN_NM 108 | LEFT JOIN col_comm c ON c.SCHEMA_NM = md.SCHEMA_NM AND c.TABLE_NM = md.TABLE_NM AND c.COLUMN_NM = md.COLUMN_NM 109 | ORDER BY md.SCHEMA_NM, md.TABLE_NM, md.ORD_POS 110 | -------------------------------------------------------------------------------- /postgresql/data_validation/demo_data/demo_hr_01_create_database.sql: -------------------------------------------------------------------------------- 1 | -- SCHEMA: demo_hr 2 | 3 | -- DROP SCHEMA demo_hr; 4 | 5 | CREATE SCHEMA demo_hr 6 | AUTHORIZATION postgres; 7 | 8 | COMMENT ON SCHEMA public 9 | IS 'demo_hr schema for testing out data validation scripts'; 10 | 11 | GRANT ALL ON SCHEMA demo_hr TO PUBLIC; 12 | 13 | GRANT ALL ON SCHEMA demo_hr TO postgres; 14 | -------------------------------------------------------------------------------- /postgresql/data_validation/demo_data/demo_hr_02_create_tables.sql: -------------------------------------------------------------------------------- 1 | 2 | -- demo_hr.jobs definition 3 | CREATE TABLE demo_hr.jobs 4 | (job_id VARCHAR(10), 5 | job_title VARCHAR(35) CONSTRAINT job_title_nn NOT NULL, 6 | min_salary INTEGER, 7 | max_salary INTEGER, 8 | CONSTRAINT job_id_pk PRIMARY KEY (job_id) 9 | ); 10 | 11 | 12 | 13 | -- demo_hr.jobs_snapshot definition 14 | CREATE TABLE demo_hr.jobs_snapshot 15 | (job_id VARCHAR(10), 16 | job_title VARCHAR(35) NOT NULL, 17 | min_salary INTEGER, 18 | max_salary INTEGER 19 | ); 20 | 21 | 22 | 23 | -- demo_hr.regions definition 24 | CREATE TABLE demo_hr.regions 25 | (region_id INTEGER CONSTRAINT region_id_nn NOT NULL, 26 | region_name VARCHAR(25), 27 | CONSTRAINT reg_id_pk PRIMARY KEY (region_id) 28 | ); 29 | 30 | 31 | 32 | -- demo_hr.countries definition 33 | CREATE TABLE demo_hr.countries 34 | (country_id CHAR(2) CONSTRAINT country_id_nn NOT NULL, 35 | country_name VARCHAR(40), 36 | region_id INTEGER, 37 | date_last_updated DATE DEFAULT CURRENT_DATE, 38 | CONSTRAINT country_c_id_pk PRIMARY KEY (country_id), 39 | CONSTRAINT countr_reg_fk FOREIGN KEY (region_id) REFERENCES demo_hr.regions (region_id) 40 | ); 41 | 42 | 43 | 44 | 45 | -- demo_hr.locations definition 46 | CREATE TABLE demo_hr.locations 47 | (location_id INTEGER, 48 | street_address VARCHAR(40), 49 | postal_code VARCHAR(12), 50 | city VARCHAR(30) CONSTRAINT loc_city_nn NOT NULL, 51 | state_province VARCHAR(25), 52 | country_id CHAR(2), 53 | CONSTRAINT loc_id_pk PRIMARY KEY (location_id), 54 | CONSTRAINT loc_c_id_fk FOREIGN KEY (country_id) REFERENCES demo_hr.countries (country_id) 55 | ); 56 | 57 | CREATE INDEX loc_city_ix ON demo_hr.locations (city); 58 | CREATE INDEX loc_state_province_ix ON demo_hr.locations (state_province); 59 | CREATE INDEX loc_country_ix ON demo_hr.locations (country_id); 60 | 61 | 62 | 63 | -- demo_hr.departments dEFinition 64 | CREATE TABLE demo_hr.departments 65 | (department_id INTEGER, 66 | department_name VARCHAR(30) CONSTRAINT dept_name_nn NOT NULL, 67 | manager_id INTEGER, 68 | location_id INTEGER, 69 | url VARCHAR(255), 70 | CONSTRAINT dept_id_pk PRIMARY KEY (department_id), 71 | CONSTRAINT dept_loc_fk FOREIGN KEY (location_id) REFERENCES demo_hr.locations (location_id) 72 | ) ; 73 | 74 | CREATE INDEX dept_location_ix ON demo_hr.departments (location_id); 75 | 76 | 77 | 78 | -- demo_hr.job_history definition 79 | CREATE TABLE demo_hr.job_history 80 | (employee_id INTEGER CONSTRAINT jhist_employee_nn NOT NULL, 81 | start_date DATE CONSTRAINT jhist_start_date_nn NOT NULL, 82 | end_date DATE CONSTRAINT jhist_end_date_nn NOT NULL, 83 | job_id VARCHAR(10) CONSTRAINT jhist_job_nn NOT NULL, 84 | department_id INTEGER, 85 | CONSTRAINT jhist_date_interval CHECK (end_date > start_date), 86 | CONSTRAINT jhist_emp_id_st_date_pk PRIMARY KEY (employee_id, start_date), 87 | CONSTRAINT jhist_job_fk FOREIGN KEY (job_id) REFERENCES demo_hr.jobs (job_id), 88 | CONSTRAINT jhist_dept_fk FOREIGN KEY (department_id) REFERENCES demo_hr.departments (department_id) 89 | ) ; 90 | 91 | CREATE INDEX jhist_job_ix ON demo_hr.job_history (job_id); 92 | CREATE INDEX jhist_employee_ix ON demo_hr.job_history (employee_id); 93 | CREATE INDEX jhist_department_ix ON demo_hr.job_history (department_id); 94 | 95 | 96 | 97 | 98 | -- demo_hr.employees definition 99 | CREATE TABLE demo_hr.employees 100 | (employee_id INTEGER, 101 | first_name VARCHAR(20), 102 | last_name VARCHAR(25) CONSTRAINT emp_last_name_nn NOT NULL, 103 | email VARCHAR(25) CONSTRAINT emp_email_nn NOT NULL, 104 | phone_number VARCHAR(20), 105 | hire_date DATE CONSTRAINT emp_hire_date_nn NOT NULL, 106 | job_id VARCHAR(10) CONSTRAINT emp_job_nn NOT NULL, 107 | salary INTEGER, 108 | commission_pct NUMERIC(2,2), 109 | manager_id INTEGER, 110 | department_id INTEGER, 111 | some_date_fmt1 VARCHAR(50), 112 | some_date_fmt2 VARCHAR(50), 113 | some_date_fmt3 VARCHAR(50), 114 | some_date_fmt4 VARCHAR(50), 115 | fake_ssn VARCHAR(11), 116 | zip5 VARCHAR(5), 117 | zip5or9 VARCHAR(10), 118 | zip9 VARCHAR(10), 119 | email_address VARCHAR(100), 120 | CONSTRAINT emp_salary_min CHECK (salary > 0), 121 | CONSTRAINT emp_email_uk UNIQUE (email), 122 | CONSTRAINT emp_emp_id_pk PRIMARY KEY (employee_id), 123 | CONSTRAINT emp_dept_fk FOREIGN KEY (department_id) REFERENCES demo_hr.departments (department_id), 124 | CONSTRAINT emp_job_fk FOREIGN KEY (job_id) REFERENCES demo_hr.jobs (job_id), 125 | CONSTRAINT emp_manager_fk FOREIGN KEY (manager_id) REFERENCES demo_hr.employees (employee_id) 126 | ) ; 127 | 128 | CREATE INDEX emp_department_ix ON demo_hr.employees (department_id); 129 | CREATE INDEX emp_job_ix ON demo_hr.employees (job_id); 130 | CREATE INDEX emp_manager_ix ON demo_hr.employees (manager_id); 131 | CREATE INDEX emp_name_ix ON demo_hr.employees (last_name, first_name); 132 | 133 | 134 | 135 | 136 | 137 | 138 | -------------------------------------------------------------------------------- /postgresql/data_validation/img/01_data_val_postgresql_example_test_case_sql_code.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DataResearchLabs/sql_scripts/3eaa17caf248d3b58763534d62ef4d9ac5729a46/postgresql/data_validation/img/01_data_val_postgresql_example_test_case_sql_code.png -------------------------------------------------------------------------------- /postgresql/data_validation/img/03_data_val_postgresql_adv_test_case_ex2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DataResearchLabs/sql_scripts/3eaa17caf248d3b58763534d62ef4d9ac5729a46/postgresql/data_validation/img/03_data_val_postgresql_adv_test_case_ex2.png -------------------------------------------------------------------------------- /postgresql/data_validation/img/04_data_val_postgresql_run_results_adv.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DataResearchLabs/sql_scripts/3eaa17caf248d3b58763534d62ef4d9ac5729a46/postgresql/data_validation/img/04_data_val_postgresql_run_results_adv.png -------------------------------------------------------------------------------- /postgresql/data_validation/rule_set_01__row_counts.md: -------------------------------------------------------------------------------- 1 | ### Data Validation Examples - Postgresql 2 | [![License: CC0](https://img.shields.io/badge/License-CC0-red)](LICENSE "Creative Commons Zero License by DataResearchLabs (effectively = Public Domain") 3 | [![YouTube](https://img.shields.io/badge/YouTube-DataResearchLabs-brightgreen)](http://www.DataResearchLabs.com) 4 | #### [Return to Data Validation Home Page](https://github.com/DataResearchLabs/sql_scripts/blob/main/data_validation_scripts.md) 5 | 6 | 7 | # Rule Set #1 - Row Counts 8 | 9 | ## Table of Contents 10 | - T001 - "Full" Row Count 11 | - T002 - "Partial" Row Count 12 | - T003 - "Relative" Row Count 13 | - T004 - "Recent" Row Count 14 |
15 | 16 | 17 | 18 | ### T001 - "Full" Row Count 19 | Verify full row count for a table or view. For example, to verify that table countries has exactly 25 rows: 20 | ```sql 21 | SELECT CASE WHEN COUNT(*) <> 25 THEN 'FAIL' ELSE 'P' END AS status 22 | FROM demo_hr.countries; 23 | ``` 24 |
25 | 26 | 27 | 28 | ### T002 - "Partial" Row Count 29 | Verify partial row count for a subset of a table or view. For example, to verify that table countries has exactly 8 rows where region_id = 1 (Europe): 30 | ```sql 31 | SELECT CASE WHEN COUNT(*) <> 8 THEN 'FAIL' ELSE 'P' END AS status 32 | FROM demo_hr.countries 33 | WHERE region_id = 1; 34 | ``` 35 |
36 | 37 | 38 | 39 | ### T003 - "Relative" Row Count 40 | Verify relative row counts between tables or views. The example below verifies that table countries has at least 5 times the number of rows as table regions. The inner query returns one row with two columns: (a) count of contries, and (b) count of regions. The outer query can then apply the business logic of comparing the counts and returning a Pass or Fail. 41 | ```sql 42 | SELECT CASE WHEN countries_count < 5 * regions_count THEN 'FAIL' ELSE 'P' END AS status 43 | FROM ( 44 | SELECT (SELECT COUNT(*) AS row_count FROM demo_hr.countries) AS countries_count 45 | , (SELECT COUNT(*) AS row_count FROM demo_hr.regions) AS regions_count 46 | ) t; 47 | ``` 48 |
49 | 50 | 51 | 52 | ### T004 - "Recent" Row Count 53 | Verify recent row counts. The example below verifies that table countries has had at least 5 rows updated in the past 10 days based on the date stamp in field date_last_updated. The inner query uses GETDATE() (SQL Server current date time stamp) and subtracts 10 days. The inner query counts how many rows in table countries have a date_last_updated with the past 10 days. The outer query can then confirm that at least 5 rows are recent for a Pass...if not then it returns a Fail. 54 | ```sql 55 | SELECT CASE WHEN row_count < 5 THEN 'FAIL' ELSE 'P' END AS status 56 | FROM ( 57 | SELECT COUNT(*) AS row_count 58 | FROM demo_hr..countries 59 | WHERE date_last_updated >= NOW()::DATE - 150 60 | ) t; 61 | ``` 62 | 63 | 64 | -------------------------------------------------------------------------------- /postgresql/data_validation/rule_set_02__keys.md: -------------------------------------------------------------------------------- 1 | ### Data Validation Examples - Postgresql 2 | [![License: CC0](https://img.shields.io/badge/License-CC0-red)](LICENSE "Creative Commons Zero License by DataResearchLabs (effectively = Public Domain") 3 | [![YouTube](https://img.shields.io/badge/YouTube-DataResearchLabs-brightgreen)](http://www.DataResearchLabs.com) 4 | #### [Return to Data Validation Home Page](https://github.com/DataResearchLabs/sql_scripts/blob/main/data_validation_scripts.md) 5 | 6 | 7 | # Rule Set #2 - Keys (Foreign & Unique) 8 | 9 | ## Table of Contents 10 | - T005 - Unique (Native) Key Has No Duplicates 11 | - T006 - Foreign Key Childern Have Orphans 12 | - T007 - Foreign Key Parent is Childless 13 |
14 | 15 | 16 | 17 | ### T005 - Unique Key Has No Duplicates 18 | Sure, good database design implies that unique keys be enforced by a constraint so that you do not need to test for it. However, there are times where a decision is made to **not** add a constraint to enforce the unique key (e.g.: table is replicated from a source having the constraint so skipped for performance). This does happen! At work just last week, two of my unique key regression tests started failing -- without these checks in place as a wide net, the downstream defects would been around a lot longer. 19 | In the example below, the inner query does a group by on the unique key fields, then using a HAVING clause filters down to those key-values with a count of more than 1 -- the dups. The outer query returns a fail if any rows come back with dups (match_count >= 2), or a pass if no dups found. 20 | ```sql 21 | SELECT CASE WHEN COUNT(*) > 0 THEN 'FAIL' ELSE 'P' END AS status 22 | FROM ( 23 | SELECT country_name -- UKey fields separated by comma 24 | , COUNT(*) AS match_count 25 | FROM demo_hr.countries -- UKey fields separated by comma 26 | GROUP BY country_name 27 | HAVING COUNT(*) > 1 28 | ) t; 29 | ``` 30 |
31 | 32 | 33 | 34 | ### T006 - Foreign Key Children Have Orphans 35 | Sure, as with T005 UKeys above, good database design implies that foreign keys be enforced by a constraint so that you do not need to test for it. However, there are times where for whatever reason the constraints do not exist. In those instances, you will want to periodically run a data validation test to ensure that this core assumption is not being violated (of course adding a foreign key constraint would be best, but if that is not an option then periodically check). 36 | In the example below, the inner query pulls from the child table countries as the anchor, then left joins out to the parent table regions on the key field region_id. If region_id does not exist in the parent table (p.region_id IS NULL), then the child region_id is an orphan. The outer query checks the count() of orphaned child rows: if it is >= 1 then the test fails, but if the count() = 0 then it passes. 37 | ```sql 38 | SELECT CASE WHEN COUNT(*) > 0 THEN 'FAIL' ELSE 'P' END AS status 39 | FROM ( 40 | SELECT DISTINCT c.region_id AS child_id, p.region_id AS parent_id 41 | FROM demo_hr.countries c 42 | LEFT JOIN demo_hr.regions p ON p.region_id = c.region_id 43 | WHERE p.region_id IS NULL 44 | ) t; 45 | ``` 46 |
47 | 48 | 49 | 50 | ### T007 - Foreign Key Parent Is Childless 51 | I'm not sure this particular test is all that useful because often it is okay for the parent-side of a foreign key relationship to not have children. But, if for some reason you need to be sure there is data present on both sides (parent **and** child), then this test is for you. You will notice in the example below that the query is very similar to T006 above, but the parent and child tables have switched positions in the FROM and LEFT JOIN lines. This is because we want to first pull all parent rows, then left join to find missing (FKey field IS NULL) child rows. 52 | ```sql 53 | SELECT CASE WHEN COUNT(*) > 0 THEN 'FAIL' ELSE 'P' END AS status 54 | FROM ( 55 | SELECT DISTINCT c.country_id AS child_id, p.country_id AS parent_id 56 | FROM demo_hr.countries p 57 | LEFT JOIN demo_hr.locations c ON p.country_id = c.country_id 58 | WHERE c.country_id IS NULL 59 | AND p.country_id IN('IT','JP','US','CA','CN','IN','AU','SG','UK','DE','CH','NL','MX') 60 | ) t; 61 | ``` 62 | -------------------------------------------------------------------------------- /postgresql/data_validation/rule_set_03__heuristic_thresholds.md: -------------------------------------------------------------------------------- 1 | ### Data Validation Examples - Postgresql 2 | [![License: CC0](https://img.shields.io/badge/License-CC0-red)](LICENSE "Creative Commons Zero License by DataResearchLabs (effectively = Public Domain") 3 | [![YouTube](https://img.shields.io/badge/YouTube-DataResearchLabs-brightgreen)](http://www.DataResearchLabs.com) 4 | #### [Return to Data Validation Home Page](https://github.com/DataResearchLabs/sql_scripts/blob/main/data_validation_scripts.md) 5 | 6 | # Rule Set #3 - Heuristic Thresholds (Column Null Rates & Value Rates) 7 | 8 | 9 | ## Table of Contents 10 | - T008 - Null Rate Thresholds 11 | - T009 - Value Frequency Thresholds 12 | 13 | Note: Although the terms "rate" and "frequency" are mostly synonomous, nulls are usually referred to as "rates" and values as "frequencies". Don't know why, but following the convention I've heard over and over. 14 |
15 | 16 | 17 | 18 | ### T008 - Null Rate Thresholds 19 | There is a lot going on in this "Null Rate Threshold" data validation query below. The basic goal is to validate that a given table's columns do not have too many, or too few NULLs. 20 |
21 | In the example below...
22 | ...we are checking columns department_name, manager_id, and url in the table departments. We want to ensure that column department_name has no NULLs, and that column manager_id is NULL less than 65% of the time and column url is NULL less than 80% of the time. 23 | 24 | This check is like the proverbial miner's canary in that it is a trip wire triggered when something goes awry in a data feed. I've used this test scenario to great effect when coupled with a create-date or last-updated-date to monitor the past week's data loads for any unexpected upticks in null rates. 25 | 26 | There is a downside to this test scenario too however; and that is when it fires false alarms and you find yourself tinkering with the thresholds values (0.0000, 0.65000, and 0.80000 cutoffs below), raising and lowering them over and over. If this happens, chances are test fails are not actionable nor important and you should not waste your time applying this test scenario to that given table and field. Be careful to only pick fields that truly matter. 27 | 28 | Below, there is an upper CTE (common table expression) named "dtls" at the WITH clause, and a lower wrapper that applies the business logic (if any null rate rejections were found, fail the case). Inside the dtls CTE, there is an inner query at the bottom (at the FROM clause) doing a single table scan to calculate a null rate per column by counting nulls in each column and dividing by the total table row count. The SELECT CASE logic at the top applies the business logic; comparing the actual calcuated null rates (nr_dept_nm, nr_mgr_id, and nr_url) against the expected threshold rates (hard-coded as 0.0000, 0.6500, and 0.8000). The returned value is a rejection code (REJ-01, REJ-02, etc.) clearly indicating which field failed the null rate check, what the actual null rate was, and what the expected null rate threshold to exceed was. If no rejections are triggered, then status returns a "P" for pass. 29 |
30 | 31 | ```sql 32 | WITH dtls AS ( 33 | SELECT CASE WHEN nr_dept_nm > 0.0000 THEN 'REJ-01: Null rate too high at department_name. Exp=0.0000 / Act=' || CAST(nr_dept_nm AS VARCHAR(8)) 34 | WHEN nr_mgr_id > 0.6500 THEN 'REJ-02: Null rate too high at manager_id. Exp<=0.6500 / Act=' || CAST(nr_mgr_id AS VARCHAR(8)) 35 | WHEN nr_url > 0.8000 THEN 'REJ-03: Null rate too high at url. Exp<=0.8000 / Act=' || CAST(nr_url AS VARCHAR(8)) 36 | ELSE 'P' 37 | END AS status 38 | FROM ( 39 | SELECT CAST(SUM(CASE WHEN department_name IS NULL THEN 1 ELSE 0 END) AS DOUBLE PRECISION) / CAST(COUNT(*) AS DOUBLE PRECISION) AS nr_dept_nm 40 | , CAST(SUM(CASE WHEN manager_id IS NULL THEN 1 ELSE 0 END) AS DOUBLE PRECISION) / CAST(COUNT(*) AS DOUBLE PRECISION) AS nr_mgr_id 41 | , CAST(SUM(CASE WHEN url IS NULL THEN 1 ELSE 0 END) AS DOUBLE PRECISION) / CAST(COUNT(*) AS DOUBLE PRECISION) AS nr_url 42 | FROM demo_hr.departments 43 | ) t 44 | ) 45 | 46 | SELECT CASE WHEN COUNT(*) > 0 THEN 'FAIL' ELSE 'P' END AS status 47 | FROM dtls 48 | WHERE status <> 'P'; 49 | ``` 50 |
51 | 52 | 53 | 54 | ### T009 - Value Frequency Thresholds 55 | "Value Frequency Threshold" tests are fairly similar to null rates above (T008). The difference is that we are checking the frequency (or rate) at which a column's values occur. 56 |
57 | In the example below...
58 | ...we are checking the frequencies with which the values 1, 2, 3, and 4 occur in field region_id of table countries. There is an upper CTE (common table expression) named "dtls" at the WITH clause, and a lower wrapper that applies the business logic (if any value frequency rejections were found, fail the case). Inside the dtls CTE, there is an inner query at the bottom (at the FROM clause) doing a single table scan to calculate a frequencies for each value in the GROUP BY for the column. It the GROUP BY value count (field "freq") is divided by the total table row count (field "den") to calculate field "freq_rt". The SELECT CASE logic at the top applies the business logic; comparing the actual value frequencies (freq_rt when region_id = 1, or =2, etc.) against the expected threshold frequencies (hard-coded as 0.28 to 0.36, 016 to 0.24 and so on). The returned value is a rejection code (REJ-01, REJ-02, etc.) clearly indicating which field failed the value ferquency check, what the actual value frequency was, and what the expected value frequency threshold ranges were. If no rejections are triggered, then status returns a "P" for pass. 59 |
60 | 61 | ```sql 62 | WITH dtls AS ( 63 | SELECT region_id, freq_rt 64 | , CASE WHEN region_id = 1 AND freq_rt NOT BETWEEN 0.28 AND 0.36 THEN 'REJ-01: Frequency occurrence of region_id=1 is outside threshold|exp=0.28 thru 0.36|act=' || CAST(freq_rt AS VARCHAR(8)) 65 | WHEN region_id = 2 AND freq_rt NOT BETWEEN 0.16 AND 0.24 THEN 'REJ-02: Frequency occurrence of region_id=2 is outside threshold|exp=0.16 thru 0.24|act=' || CAST(freq_rt AS VARCHAR(8)) 66 | WHEN region_id = 3 AND freq_rt NOT BETWEEN 0.20 AND 0.28 THEN 'REJ-03: Frequency occurrence of region_id=3 is outside threshold|exp=0.20 thru 0.28|act=' || CAST(freq_rt AS VARCHAR(8)) 67 | WHEN region_id = 4 AND freq_rt NOT BETWEEN 0.20 AND 0.28 THEN 'REJ-04: Frequency occurrence of region_id=4 is outside threshold|exp=0.20 thru 0.28|act=' || CAST(freq_rt AS VARCHAR(8)) 68 | ELSE 'P' 69 | END AS status 70 | FROM ( 71 | SELECT region_id, CAST(freq AS DOUBLE PRECISION) / CAST(den AS DOUBLE PRECISION) AS freq_rt 72 | FROM ( 73 | SELECT region_id, COUNT(*) AS freq 74 | , (SELECT COUNT(*) FROM demo_hr.countries) AS den 75 | FROM demo_hr.countries 76 | GROUP BY region_id 77 | ) t 78 | ) t2 79 | ) 80 | 81 | SELECT CASE WHEN COUNT(*) > 0 THEN 'FAIL' ELSE 'P' END AS status 82 | FROM dtls 83 | WHERE status <> 'P'; 84 | ``` 85 | -------------------------------------------------------------------------------- /postgresql/data_validation/rule_set_04__numeric_values.md: -------------------------------------------------------------------------------- 1 | ### Data Validation Examples - Postgresql 2 | [![License: CC0](https://img.shields.io/badge/License-CC0-red)](LICENSE "Creative Commons Zero License by DataResearchLabs (effectively = Public Domain") 3 | [![YouTube](https://img.shields.io/badge/YouTube-DataResearchLabs-brightgreen)](http://www.DataResearchLabs.com) 4 | #### [Return to Data Validation Home Page](https://github.com/DataResearchLabs/sql_scripts/blob/main/data_validation_scripts.md) 5 | 6 | # Rule Set #4 - Numeric Values 7 | 8 | ## Table of Contents 9 | - T010 - Not Null 10 | - T011 - Not Negative 11 | - T012 - Numeric Range 12 | - T013 - In Value List 13 | - T014 - Not In Value List 14 | - T015 - Multi Field Compare 15 |
16 | 17 | 18 | 19 | ### T010 - Not Null 20 | Verify numeric field is not null. For example, to verify that table countries has no NULLs in field region_id: 21 | ```sql 22 | SELECT CASE WHEN COUNT(*) > 0 THEN 'FAIL' ELSE 'P' END AS status 23 | FROM demo_hr.countries 24 | WHERE region_id IS NULL; 25 | ``` 26 |
27 | 28 | 29 | 30 | ### T011 - Not Negative 31 | Verify numeric field is not negative. For example, to verify that table countries has no field region_id negative values: 32 | ```sql 33 | SELECT CASE WHEN COUNT(*) > 0 THEN 'FAIL' ELSE 'P' END AS status 34 | FROM demo_hr.countries 35 | WHERE region_id < 0; 36 | ``` 37 |
38 | 39 | 40 | 41 | ### T012 - Numeric Range 42 | Verify numeric field value is within a range. In the example below, we verify that field employee_id is between 100 and 999 in table employees. Note that you can run the inner query yourself to return the actual rejection code (is too low or too high) along with the actual value and the expected value...all nicely packaged for easy troubleshooting. 43 | ```sql 44 | SELECT CASE WHEN COUNT(*) > 0 THEN 'FAIL' ELSE 'P' END AS status 45 | FROM ( 46 | SELECT employee_id 47 | , CASE WHEN employee_id < 100 THEN 'REJ-01: Verify employee_id > 99|exp>99|act=' + CAST(employee_id AS VARCHAR(10)) 48 | WHEN employee_id > 999 THEN 'REJ-02: Verify employee_id < 1000|exp<1000|act=' + CAST(employee_id AS VARCHAR(10)) 49 | ELSE 'P' 50 | END AS status 51 | FROM demo_hr.employees 52 | ) t 53 | WHERE status <> 'P'; 54 | ``` 55 |
56 | 57 | 58 | 59 | ### T013 - In Value List 60 | Verify numeric field is **in** the list of values. For example, to verify that table countries field region_id is always values 1, 2, 3, or 4 we use the IN() clause as shown below: 61 | ```sql 62 | SELECT CASE WHEN COUNT(*) > 0 THEN 'FAIL' ELSE 'P' END AS status 63 | FROM ( 64 | SELECT region_id 65 | , CASE WHEN region_id NOT IN(1,2,3,4) THEN 'FAIL' ELSE 'P' END AS status 66 | FROM demo_hr.countries 67 | ) t 68 | WHERE status <> 'P'; 69 | ``` 70 |
71 | 72 | 73 | 74 | ### T014 - Not In Value List 75 | Verify numeric field is **not** in the list of values. For example, to verify that table countries field region_id is never in values 97, 98, or 99 we use the NOT IN() clauses as shown below: 76 | ```sql 77 | SELECT CASE WHEN COUNT(*) > 0 THEN 'FAIL' ELSE 'P' END AS status 78 | FROM ( 79 | SELECT region_id 80 | , CASE WHEN region_id IN(97,98,99) THEN 'FAIL' ELSE 'P' END AS status 81 | FROM demo_hr.countries 82 | ) t 83 | WHERE status <> 'P'; 84 | ``` 85 |
86 | 87 | 88 | 89 | ### T015 - Multi Field Compare 90 | Verify numeric field values in relation to one another. For example, to verify that salary times commission_pct is always less than $10,000 in table employees: 91 | ```sql 92 | SELECT CASE WHEN COUNT(*) > 0 THEN 'FAIL' ELSE 'P' END AS status 93 | FROM ( 94 | SELECT salary, commission_pct 95 | , CASE WHEN salary * commission_pct > 10000 THEN 'FAIL' ELSE 'P' END AS status 96 | FROM demo_hr.employees 97 | ) t 98 | WHERE status <> 'P'; 99 | ``` 100 |
101 | -------------------------------------------------------------------------------- /postgresql/data_validation/rule_set_05__date_values.md: -------------------------------------------------------------------------------- 1 | ### Data Validation Examples - Postgresql 2 | [![License: CC0](https://img.shields.io/badge/License-CC0-red)](LICENSE "Creative Commons Zero License by DataResearchLabs (effectively = Public Domain") 3 | [![YouTube](https://img.shields.io/badge/YouTube-DataResearchLabs-brightgreen)](http://www.DataResearchLabs.com) 4 | #### [Return to Data Validation Home Page](https://github.com/DataResearchLabs/sql_scripts/blob/main/data_validation_scripts.md) 5 | 6 | 7 | # Rule Set #5 - Date Values 8 | 9 | ## Table of Contents 10 | - T016 - Not Null 11 | - T017 - Date Range 12 | - T018 - No Time Part 13 | - T019 - Has Time Part 14 | - T020 - Multi Field Compare 15 | - Bonus Tip - Joining Tables with 2 Pairs of Start-End Date Overlaps 16 |
17 | 18 | 19 | 20 | ### T016 - Not Null 21 | Verify date field is not null. For example, to verify that table countries has no NULLs in field date_last_updated: 22 | ```sql 23 | SELECT date_last_updated 24 | , CASE WHEN date_last_updated IS NULL THEN 'FAIL' ELSE 'P' END AS status 25 | FROM demo_hr.countries; 26 | ``` 27 |
28 | 29 | 30 | 31 | ### T017 - Date Range 32 | Verify date field is within specified range. For example, you can run the sql below to verify that table countries field date_last_updated is between 1/1/2021 and today. Note the use of SYSDATE to represent today's date dynamically in Oracle. Notice the inner query uses a CASE...WHEN...ELSE structure to identify two rejections codes: (1) date is too high, and (2) date is too low. Expected and actual values are displayed in the output if you run the inner query only. The outer query is a wrapper to determine whether the test passed or failed. 33 | ```sql 34 | SELECT date_last_updated 35 | , CASE WHEN date_last_updated > CLOCK_TIMESTAMP() THEN 'REJ-01: Field date_last_updated cannot be in the future|exp<=' || CAST(CLOCK_TIMESTAMP() AS VARCHAR(20)) || '|act=' || CAST(date_last_updated AS VARCHAR(20)) 36 | WHEN date_last_updated < TO_DATE('01/01/2021', 'mm/dd/yyyy') THEN 'REJ-02: Field date_last_updated cannot be too old|exp>=1/1/2021|act=' || CAST(date_last_updated AS VARCHAR(20)) 37 | ELSE 'P' 38 | END AS status 39 | FROM demo_hr.countries; 40 | ``` 41 |
42 | 43 | 44 | 45 | ### T018 - No Time Part 46 | Verify date field is a date only, no time part present. For example, to verify that table employees has no time part in field hire_date (time part must be "12:00:00"): 47 | ```sql 48 | SELECT hire_date 49 | , CASE WHEN TO_CHAR(hire_date, 'hh:mi:ss') <> '12:00:00' THEN 'FAIL' ELSE 'P' END AS status 50 | FROM demo_hr.employees; 51 | ``` 52 |
53 | 54 | 55 | 56 | ### T019 - Has Time Part 57 | Verify date field is a date **and** time. For example, to verify that table employees has a time part in field hire_date (time part cannot be "12:00:00"): 58 | ```sql 59 | SELECT date_last_updated 60 | , CASE WHEN TO_CHAR(date_last_updated, 'hh:mi:ss') = '12:00:00' THEN 'FAIL' ELSE 'P' END AS status 61 | FROM demo_hr.countries; 62 | ``` 63 |
64 | 65 | 66 | 67 | ### T020 - Multi Field Compare 68 | Verify multiple date fields relative to each other. For example, to verify that field start_date must be < field end_date in table job_history (thus if start_date is >= end_date the test case fails): 69 | ```sql 70 | SELECT start_date, end_date 71 | , CASE WHEN start_date >= end_date THEN 'FAIL' ELSE 'P' END AS status 72 | FROM demo_hr.job_history; 73 | ``` 74 |
75 | 76 | 77 | 78 | ### Bonus Tip - Joining Tables with 2 Pairs of Start-End Date Overlaps 79 | So this is a fantastic tip I learned from a co-worker in healthcare back in 2011 (was it Jennifer C.? or Matt G.? or Jonathon P.? I can't remember). 80 | 81 | 82 | The **problem** is that you are trying to join two tables with logic where the table1.start_dt/end_dt's overlap with the table2.start_dt/end_dt. 83 | 84 | 85 | The **solution** is to **join on `table1.start_dt <= table2.end_dt AND table1.end_dt >= table2.start_dt`**. 86 | 87 | Here is why: 88 | 89 | ``` 90 | Scenario #1 = "Discard - No Overlap" Table #1 date range is completely before Table #2's (Dt#1.End is NOT > Dt#2.Start) 91 | Table#1: |Start ----------- End| 92 | Table#2: |Start ----------- End| 93 | 94 | 95 | Scenario #2 = "Include - Table #1 End = Table#2 Start" Table #1 date range ends exactly where Table #2's starts (Tbl#1.End = Tbl#2.Start AND Tbl#1.Start < Tbl#2.End) 96 | Table#1: |Start ----------- End| 97 | Table#2: |Start ----------- End| 98 | 99 | 100 | Scenario #3 = "Include - Table #1 Ends Midway Table #2 Span" Table #1 date range nicely overlaps Table #2's (Tbl#1.End > Tbl#2.Start AND Tbl#1.Start < Tbl#2.End) 101 | Table#1: |Start ----------- End| 102 | Table#2: |Start ----------- End| 103 | 104 | 105 | Scenario #4 = "Include - Table #1 Starts Midway Table #2 Span" Table #1 date range nicely overlaps Table #2's (Tbl#1.End > Tbl#2.Start AND Tbl#1.Start < Tbl#2.End) 106 | Table#1: |Start ----------- End| 107 | Table#2: |Start ----------- End| 108 | 109 | 110 | Scenario #5 = "Include - Table #1 Starts at Table #2 End" Table #1 date range start exactly at Table #2's End (Tbl#1.End > Tbl#2.St AND Tbl#1.St = Tbl#2.End) 111 | Table#1: |Start ----------- End| 112 | Table#2: |Start ----------- End| 113 | 114 | 115 | Scenario #6 = "Discard - No Overlap" Date #1 is entirely after Date #2 (Tbl#1.Start is > Tbl#2.End) 116 | Table#1: |Start ----------- End| 117 | Table#2: |Start ----------- End| 118 | 119 | ``` 120 | 121 | 122 |
123 | -------------------------------------------------------------------------------- /postgresql/data_validation/rule_set_07__regular_expressions.md: -------------------------------------------------------------------------------- 1 | ### Data Validation Examples - Postgresql 2 | [![License: CC0](https://img.shields.io/badge/License-CC0-red)](LICENSE "Creative Commons Zero License by DataResearchLabs (effectively = Public Domain") 3 | [![YouTube](https://img.shields.io/badge/YouTube-DataResearchLabs-brightgreen)](http://www.DataResearchLabs.com) 4 | #### [Return to Data Validation Home Page](https://github.com/DataResearchLabs/sql_scripts/blob/main/data_validation_scripts.md) 5 | 6 | # Rule Set #7 - Regular Expressions 7 | 8 | ## Table of Contents 9 | - T045 - RegExp("IsPhoneNumber") 10 | - T046 - RegExp("IsSSN") 11 | - T047 - RegExp("IsZip5") 12 | - T048 - RegExp("IsZip5or9") 13 | - T049 - RegExp("IsZip9") 14 | - T050 - RegExp("OnlyText") 15 | - T051 - RegExp("OnlyNumeric") 16 | - T052 - RegExp("NoLeadTrailSpaces") 17 | - T053 - RegExp("NoWhitespaces") 18 | - T054 - RegExp("OnlyLowerCase") 19 | - T055 - RegExp("OnlyUpperCase") 20 | - T056 - RegExp("TitleCase") 21 | - T057 - RegExp("EmailAddress") 22 | - T058 - RegExp("IsUrl") 23 |
24 | 25 | 26 | 27 | ### T045 - RegExp("IsPhoneNumber") 28 | Verify text field is a phone number format. For example, to verify that field phone_number of table employees is either US or international format: 29 | ```sql 30 | SELECT phone_number 31 | , CASE WHEN NOT phone_number ~ '[0-9]{3}[-. ][0-9]{3}[-. ][0-9]{4}' THEN 'FAIL' ELSE 'P' END AS status 32 | FROM demo_hr.employees; 33 | ``` 34 |
35 | 36 | 37 | 38 | ### T046 - RegExp("IsSSN") 39 | Verify text field is a valid social security number (SSN) format. For example, to verify that field fake_ssn of table employees is a valid SSN format: 40 | ```sql 41 | SELECT fake_ssn 42 | , CASE WHEN NOT fake_ssn ~ '^[0-9]{3}-[0-9]{2}-[0-9]{4}$' THEN 'FAIL' ELSE 'P' END AS status 43 | FROM demo_hr.employees; 44 | ``` 45 |
46 | 47 | 48 | 49 | ### T047 - RegExp("IsZip5") 50 | Verify text field is a valid zipcode 5-digit format. For example, to verify that field zip5 of table employees is a valid format: 51 | ```sql 52 | SELECT zip5 53 | , CASE WHEN NOT zip5 ~ '^[0-9]{5}$' THEN 'FAIL' ELSE 'P' END AS status 54 | FROM demo_hr.employees; 55 | ``` 56 |
57 | 58 | 59 | 60 | ### T048 - RegExp("IsZip5or9") 61 | Verify text field is a valid zipcode 5- or 9-digit format. For example, to verify that field zip5or9 of table employees is a valid format: 62 | ```sql 63 | SELECT zip5or9 64 | , CASE WHEN NOT zip5or9 ~ '^[[:digit:]]{5}(-[[:digit:]]{4})?$' THEN 'FAIL' ELSE 'P' END AS status 65 | FROM demo_hr.employees; 66 | ``` 67 |
68 | 69 | 70 | 71 | ### T049 - RegExp("IsZip9") 72 | Verify text field is a valid zipcode 9-digit format. For example, to verify that field zip9 of table employees is a valid format: 73 | ```sql 74 | SELECT zip9 75 | , CASE WHEN NOT zip9 ~ '^[[:digit:]]{5}[-/.][[:digit:]]{4}$' THEN 'FAIL' ELSE 'P' END AS status 76 | FROM demo_hr.employees; 77 | ``` 78 |
79 | 80 | 81 | 82 | ### T050 - RegExp("OnlyText") 83 | Verify text field is text / only contains alpha characters. For example, to verify that field last_name of table employees is text only: 84 | ```sql 85 | SELECT last_name 86 | , CASE WHEN NOT last_name ~ '^[a-zA-Z ]+$' THEN 'FAIL' ELSE 'P' END AS status 87 | FROM demo_hr.employees; 88 | ``` 89 |
90 | 91 | 92 | 93 | ### T051 - RegExp("OnlyNumeric") 94 | Verify text field numeric characters only. For example, to verify that field zip5 of table employees is numeric digits only: 95 | ```sql 96 | SELECT zip5 97 | , CASE WHEN NOT zip5 ~ '^[0-9]+$' THEN 'FAIL' ELSE 'P' END AS status 98 | FROM demo_hr.employees; 99 | ``` 100 |
101 | 102 | 103 | 104 | ### T052 - RegExp("NoLeadTrailSpaces") 105 | Verify text field has no leading or trailing spaces. For example, to verify that field last_name of table employees is fully trimmed: 106 | ```sql 107 | SELECT last_name 108 | , CASE WHEN last_name ~ '(^\s)|(\s$)' THEN 'FAIL' ELSE 'P' END AS status 109 | FROM demo_hr.employees; 110 | ``` 111 |
112 | 113 | 114 | 115 | ### T053 - RegExp("NoWhitespaces") 116 | Verify text field has no whitespace (spaces, non breaking spaces, carriage return, line feed, etc.). For example, to verify that field job_id of table employees has no whitespace: 117 | ```sql 118 | SELECT job_id 119 | , CASE WHEN job_id ~ '(\s)+' THEN 'FAIL' ELSE 'P' END AS status 120 | FROM demo_hr.employees; 121 | ``` 122 |
123 | 124 | 125 | 126 | ### T054 - RegExp("OnlyLowerCase") 127 | Verify text field has only lower case characters. For example, (not really practical, but as a demo) to verify that the 3rd and 4th characters in the field first_name of table employees are lower case: 128 | ```sql 129 | SELECT first_name 130 | , CASE WHEN NOT SUBSTRING(first_name,3,2) ~ '^[a-z]+$' THEN 'FAIL' ELSE 'P' END AS status 131 | FROM demo_hr.employees; 132 | ``` 133 |
134 | 135 | 136 | 137 | ### T055 - RegExp("OnlyUpperCase") 138 | Verify text field has only upper case characters. For example, to verify that all characters are uppercase in the field first_name of table employees: 139 | ```sql 140 | SELECT email 141 | , CASE WHEN NOT SUBSTRING(email,3,2) ~ '^[A-Z]+$' THEN 'FAIL' ELSE 'P' END AS status 142 | FROM demo_hr.employees; 143 | ``` 144 |
145 | 146 | 147 | 148 | ### T056 - RegExp("TitleCase") 149 | Verify text field is title case format (where the first letter of every word is upper case, and the rest are lower case). For example, to verify that the field first_name of table employees has proper casing: 150 | ```sql 151 | SELECT first_name, SUBSTRING(first_name,1,1) AS first_letter 152 | , CASE WHEN NOT SUBSTRING(first_name,1,1) ~ '([A-Z])' THEN 'REJ-01: Field first_name first character not upper case|exp=Like"[A-Z]"|act=' || first_name 153 | WHEN first_name NOT LIKE '% %' THEN 'P' -- Only one word, so no space + first character to check for uppercase 154 | WHEN NOT first_name ~ '(\s[A-Z]){1}' THEN 'REJ-02: Field first_name failed RegExpression check|exp=Like"(\s[A-Z]){1}"|act=' || first_name 155 | ELSE 'P' 156 | END AS status 157 | FROM demo_hr.employees; 158 | ``` 159 |
160 | 161 | 162 | 163 | ### T057 - RegExp("EmailAddress") 164 | Verify text field is a properly formatted email address. For example, to verify that the field email_address of table employees is properly formatted: 165 | ```sql 166 | SELECT email_address 167 | , CASE WHEN NOT email_address ~ '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}$' THEN 'FAIL' ELSE 'P' END AS status 168 | FROM demo_hr.employees; 169 | ``` 170 |
171 | 172 | 173 | 174 | ### T058 - RegExp("IsUrl") 175 | Verify text field is a properly formatted URL. For example, to verify that the field url of table departments is properly formatted with "http://" or "https://": 176 | ```sql 177 | SELECT url 178 | , CASE WHEN NOT url ~ '(http)(s)?(:\/\/)' THEN 'FAIL' ELSE 'P' END AS status 179 | FROM demo_hr.departments; 180 | ``` 181 |
182 | -------------------------------------------------------------------------------- /postgresql/data_validation/rule_set_09__defect_regression.md: -------------------------------------------------------------------------------- 1 | ### Data Validation Examples - Postgresql 2 | [![License: CC0](https://img.shields.io/badge/License-CC0-red)](LICENSE "Creative Commons Zero License by DataResearchLabs (effectively = Public Domain") 3 | [![YouTube](https://img.shields.io/badge/YouTube-DataResearchLabs-brightgreen)](http://www.DataResearchLabs.com) 4 | #### [Return to Data Validation Home Page](https://github.com/DataResearchLabs/sql_scripts/blob/main/data_validation_scripts.md) 5 |
6 | 7 | # Rule Set #9 - Defect Regression 8 | 9 | ## Summary 10 | Where possible, it is a good idea to cover known defects with a test case. This enables automated regression testing for free. Yuo can simply run this script to quickly re-check whether a defect has been re-introduced. 11 | 12 | There are a lot of caveats that will block you from covering every defect: 13 | * If anything but the simplest of setup and tear down data is required 14 | * If processes or jobs or applications or ETL must be triggered 15 | * If there are interactions with other databases or files not on the current server 16 | 17 | That said, you can still frequently setup static trip wires to monitor the data for recurrence of old bugs. 18 | 19 | Some real-world examples from my past include: 20 | * A defect where email addresses stored in the database had TAB characters. I setup a simple CHAR(9) check that ran daily and cross referenced the defect number in the SQL return results. 21 | * A defect where customer specific HTML snippets for email reports had bad tags manually entered by company analysts working with the client. Obviously fixing the app to avoid those data entries would be best, but that was a different department that was going to get to it later. So, I setup a daily alert to check for those HTML tags and some other similar ones that might be accidentally introduced. 22 | * A defect where internal data entry folks on rare occasions had to key in customer names. They were supposed to be all upper case for other reasons in the system, but the analysts sometimes entered lower case. Developers were not going to get around to fixing it for a couple of months (other priorities, this was a very low frequency issue). So I setup a daily check to trip an alarm and notify via email the appropriate person to correct their data. 23 | * Basically, you can take any of the test case examples in Rule Sets #1 thu #8 and apply them as regression tests for defects when appropriate. 24 | -------------------------------------------------------------------------------- /postgresql/schemadiff/simple_schema_dump.sql: -------------------------------------------------------------------------------- 1 | ------------------------------------------------------------------------------------ 2 | -- Simple Schema Dump: 3 | -- This SQL script will dump table, column, key, and index design related metadata 4 | -- so that you can copy-paste or export to a text file. 5 | -- Even better, you can make other snapshots over time (same database schema earlier 6 | -- points in time), OR in different environments (DEV, PPMO, STAGE, PROD). Then, 7 | -- using your favorite free Text File Diff Tool (DiffMerge, ExamDiff, etc.) you 8 | -- can compare snapshots to quick isolate and identify what changed over time 9 | -- or is different between environments. 10 | ------------------------------------------------------------------------------------ 11 | -- Platform: PostgreSQL Server 12 | -- Author: DataResearchLabs 13 | -- GitHub: https://github.com/DataResearchLabs/sql_scripts 14 | -- YouTube Tutorials: https://www.youtube.com/playlist?list=PLVHoUDdbskURPrZpH0Zkzfa1OQjNfj2Gg 15 | ---------------------------------------------------------------------------------- 16 | WITH vars 17 | AS ( 18 | SELECT 'pg_catalog' AS v_SchemaName 19 | ) 20 | 21 | , baseTbl 22 | AS ( 23 | SELECT table_schema AS SchemaName, table_type, table_name 24 | FROM INFORMATION_SCHEMA.TABLES 25 | WHERE table_schema = (SELECT v_SchemaName FROM vars) 26 | ) 27 | 28 | , metaForTbl 29 | AS ( 30 | SELECT t.SchemaName 31 | , t.table_name AS TableName 32 | , CONCAT( '(' 33 | , CASE WHEN t.table_type = 'BASE TABLE' THEN 'Table' 34 | WHEN t.table_type = 'VIEW' THEN 'View' 35 | ELSE 'UK' 36 | END 37 | , ')' 38 | ) AS ObjectType 39 | , t.table_name AS ObjectName 40 | , '(Exists)' AS PropertyName 41 | , ' ' AS PropertyValue 42 | FROM baseTbl t 43 | ) 44 | 45 | , metaForCol_dataType 46 | AS ( 47 | SELECT ft.SchemaName, ft.table_name AS TableName, 'Column' AS ObjectType, tut.column_name AS ObjectName 48 | , '2' AS PropertyName 49 | , CONCAT(COALESCE(tut.data_type, 'unknown') 50 | , CASE WHEN tut.CHARACTER_MAXIMUM_LENGTH IS NOT NULL 51 | OR tut.NUMERIC_PRECISION IS NOT NULL 52 | OR tut.NUMERIC_SCALE IS NOT NULL THEN '(' 53 | ELSE '' 54 | END 55 | , CASE WHEN tut.CHARACTER_MAXIMUM_LENGTH IS NOT NULL THEN CAST(tut.CHARACTER_MAXIMUM_LENGTH AS VARCHAR(10)) ELSE '' END 56 | , CASE WHEN tut.DATA_TYPE IN('date','datetime','timestamp') THEN CAST(tut.DATETIME_PRECISION AS VARCHAR(10)) 57 | WHEN tut.NUMERIC_PRECISION IS NULL THEN '' 58 | ELSE CONCAT(CAST(tut.NUMERIC_PRECISION AS VARCHAR(10))) 59 | END 60 | , CASE WHEN tut.NUMERIC_SCALE IS NOT NULL 61 | AND tut.NUMERIC_SCALE >0 THEN CONCAT(',', CAST(tut.NUMERIC_SCALE AS VARCHAR(10))) ELSE '' 62 | END 63 | , CASE WHEN tut.CHARACTER_MAXIMUM_LENGTH IS NOT NULL 64 | OR tut.NUMERIC_PRECISION IS NOT NULL 65 | OR tut.NUMERIC_SCALE IS NOT NULL THEN ')' 66 | END 67 | ) AS PropertyValue 68 | FROM INFORMATION_SCHEMA.COLUMNS tut 69 | INNER JOIN baseTbl ft ON ft.SchemaName = tut.TABLE_SCHEMA AND ft.TABLE_NAME = tut.TABLE_NAME 70 | ) 71 | 72 | , metaForCol_nullable 73 | AS ( 74 | SELECT ft.SchemaName, ft.table_name AS TableName, 'Column' AS ObjectType, tut.column_name AS ObjectName 75 | , '3' AS PropertyName, CASE WHEN tut.IS_NULLABLE = 'YES' THEN 'NULL' ELSE 'NOT NULL' END AS PropertyValue 76 | FROM INFORMATION_SCHEMA.COLUMNS tut 77 | INNER JOIN baseTbl ft ON ft.SchemaName = tut.TABLE_SCHEMA AND ft.table_name = tut.table_name 78 | ) 79 | 80 | , metaForCol_ordpos 81 | AS ( 82 | SELECT ft.SchemaName, ft.table_name AS TableName, 'Column' AS ObjectType, tut.column_name AS ObjectName 83 | , '1' AS PropertyName 84 | , CASE WHEN tut.ORDINAL_POSITION IS NULL THEN '' 85 | ELSE LPAD( CAST(tut.ORDINAL_POSITION AS VARCHAR(3)), 3, '0') 86 | END AS PropertyValue 87 | FROM INFORMATION_SCHEMA.COLUMNS tut 88 | INNER JOIN baseTbl ft ON ft.SchemaName = tut.TABLE_SCHEMA AND ft.table_name = tut.table_name 89 | ) 90 | 91 | , metaAllCols 92 | AS ( 93 | SELECT schemaname, tablename, objecttype, objectname, 'Properties' AS propertyname 94 | , STRING_AGG(propertyvalue, ' | ' ORDER BY propertyname, propertyvalue) AS propertyvalue 95 | FROM ( 96 | SELECT * FROM metaForCol_dataType 97 | UNION SELECT * FROM metaForCol_nullable 98 | UNION SELECT * FROM metaForCol_ordpos 99 | ) t 100 | GROUP BY schemaname, tablename, objecttype, objectname 101 | ) 102 | 103 | , metaForKeys 104 | AS ( 105 | SELECT cons.TABLE_SCHEMA AS SchemaName, cons.TABLE_NAME AS TableName 106 | , CASE WHEN cons.constraint_type = 'PRIMARY KEY' THEN 'PKey' 107 | WHEN cons.constraint_type = 'UNIQUE' THEN 'UKey' 108 | WHEN cons.constraint_type = 'FOREIGN KEY' THEN 'FKey' 109 | ELSE 'X' 110 | END AS ObjectType 111 | , cons.constraint_name AS ObjectName 112 | , 'FieldList' AS PropertyName 113 | , STRING_AGG(kcu.COLUMN_NAME, ',' ORDER BY kcu.ORDINAL_POSITION) AS PropertyValue 114 | FROM INFORMATION_SCHEMA.TABLE_CONSTRAINTS cons 115 | INNER JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE kcu 116 | ON cons.TABLE_SCHEMA = kcu.TABLE_SCHEMA 117 | AND cons.TABLE_NAME = kcu.TABLE_NAME 118 | WHERE cons.TABLE_NAME IN(SELECT DISTINCT table_name FROM baseTbl) 119 | AND cons.TABLE_SCHEMA = (SELECT v_SchemaName FROM vars) 120 | AND cons.constraint_type IN('PRIMARY KEY','FOREIGN KEY','UNIQUE') 121 | GROUP BY cons.TABLE_SCHEMA, cons.TABLE_NAME, cons.CONSTRAINT_TYPE, cons.CONSTRAINT_NAME 122 | ) 123 | 124 | , metaForIdxs 125 | AS ( 126 | SELECT (SELECT v_SchemaName FROM vars) AS SchemaName 127 | , tablename AS TableName 128 | , 'Index' AS ObjectType 129 | , indexname AS ObjectName 130 | , 'FieldList' AS PropertyName 131 | , REPLACE(SUBSTR(indexdef, POSITION('(' IN indexdef)+ 1), ')', '') AS PropertyValue 132 | FROM pg_catalog.pg_indexes 133 | WHERE schemaname = (SELECT v_SchemaName FROM vars) 134 | ) 135 | 136 | , allMetadata 137 | AS ( 138 | SELECT * FROM metaForTbl 139 | UNION SELECT * FROM metaAllCols 140 | UNION SELECT * FROM metaForKeys 141 | UNION SELECT * FROM metaForIdxs 142 | ) 143 | 144 | SELECT CASE WHEN objecttype IN('(Table)','(View)') THEN schemaname ELSE ' ' END AS schema_nm 145 | , CASE WHEN objecttype IN('(Table)','(View)') THEN tablename ELSE ' ' END AS tbl_nm 146 | , objecttype AS obj_typ, objectname AS obj_nm, /*propertyname,*/ propertyvalue AS properties 147 | FROM allMetadata 148 | ORDER BY schemaname, tablename, objecttype 149 | , CASE WHEN objecttype='Column' THEN propertyvalue ELSE ' ' END 150 | , objectname, propertyname 151 | 152 | 153 | -------------------------------------------------------------------------------- /schemadiff_scripts.md: -------------------------------------------------------------------------------- 1 | # Multiple Platform SchemaDiff Scripts 2 | [![License: CC0](https://img.shields.io/badge/License-CC0-red)](LICENSE "Creative Commons Zero License by DataResearchLabs (effectively = Public Domain") 3 | [![YouTube](https://img.shields.io/badge/YouTube-DataResearchLabs-brightgreen)](http://www.DataResearchLabs.com) 4 | 5 | ## Overview 6 | There are many reasons you may need a SQL script that captures table, view, column, and key properties. 7 | It all boils down to comparing different states to quickly identify what changed, or what is different. 8 | 9 | 10 | * Maybe you need to quickly isolate what got missed in the migrations from DEV to TEST environments. 11 | * Maybe you need to quickly identify what changed between this release (AFTER) and the prior release (BEFORE). 12 | * Maybe you want to run the script daily and output state snapshots to text files so that in the event of an emergency you can quickly identify what changed between given dates. 13 | 14 | Using your favorite text diff-ing tool, here is what two sample schemas (PPMO-OldSchema vs. DEV-NewSchema) might look like side-by-side (red markup lines added): 15 | 16 | 17 | 18 | ***If you like these scripts, be sure to click the "Star" button above in GitHub.***
19 |
20 | ***Also, be sure to visit or subscribe to our YouTube channel*** www.DataResearchLabs.com!
21 |
22 |
23 | Links to the script source code as well as video tutorials are listed below, by platform (MSSQL, Oracle, MySQL, etc.): 24 |
25 |
26 | 27 | 28 | 29 | 30 | 34 | 41 | 48 | 49 | 50 | 51 | 52 | 53 | 57 | 65 | 72 | 73 | 74 | 75 | 76 | 77 | 81 | 88 | 95 | 96 | 97 | 98 | 99 | 100 | 104 | 111 | 118 | 119 | 120 | 121 | 122 | 126 | 133 | 140 | 141 | 142 | 143 |
31 |
32 | 33 |
35 | 36 | 37 | ## Greenplum 38 | * [Greenplum "Schema Dump" Script](https://raw.githubusercontent.com/DataResearchLabs/sql_scripts/main/greenplum/schemadiff/simple_schema_dump.sql)
39 | * [Greenplum "Schema Dump" Tutorial](http://www.youtube.com/watch?feature=player_embedded&v=U8p6YJZtPM8)
40 |
42 | 43 | 44 | Overview Video 45 | 46 | 47 |
54 |
55 | 56 |
58 | 59 | 60 | ## MS SQL Server 61 | * [MSSQL "Schema Dump" Script](https://raw.githubusercontent.com/DataResearchLabs/sql_scripts/main/mssql/schemadiff/simple_schema_dump.sql)
62 | * [MSSQL "Schema Dump" Tutorial](http://www.youtube.com/watch?feature=player_embedded&v=Vxf3SM38Ar0)
63 | * *Note: MSSQL 2017 forward only (STRING_AGG fx)* 64 |
66 | 67 | 68 | Overview Video 69 | 70 | 71 |
78 |
79 | 80 |
82 | 83 | 84 | ## MySQL 85 | * [MySQL "Schema Dump" Script](https://raw.githubusercontent.com/DataResearchLabs/sql_scripts/main/mysql/schemadiff/simple_schema_dump.sql)
86 | * [MySQL "Schema Dump" Tutorial](http://www.youtube.com/watch?feature=player_embedded&v=LwfagEfficU)
87 |
89 | 90 | 91 | Overview Video 92 | 93 | 94 |
101 |
102 | 103 |
105 | 106 | 107 | ## Oracle 108 | * [Oracle "Schema Dump" Script](https://raw.githubusercontent.com/DataResearchLabs/sql_scripts/main/oracle/schemadiff/simple_schema_dump.sql)
109 | * [Oracle "Schema Dump" Tutorial](http://www.youtube.com/watch?feature=player_embedded&v=kzemPW1156s)
110 |
112 | 113 | 114 | Overview Video 115 | 116 | 117 |
123 |
124 | 125 |
127 | 128 | 129 | ## PostgreSQL 130 | * [PostgreSQL "Schema Dump" Script](https://raw.githubusercontent.com/DataResearchLabs/sql_scripts/main/postgresql/schemadiff/simple_schema_dump.sql)
131 | * [PostgreSQL "Schema Dump" Tutorial](http://www.youtube.com/watch?feature=player_embedded&v=I3Gr2d7tob0)
132 |
134 | 135 | 136 | Overview Video 137 | 138 | 139 |
144 | 145 | 146 | --------------------------------------------------------------------------------