├── .DS_Store ├── 00-quickstarts ├── .DS_Store ├── README.md ├── databricks-concurrency │ ├── 01-concurrency-testing-notebook.py │ └── concurrency_framework_1.0.py ├── design-patterns │ ├── .DS_Store │ ├── Advanced Notebooks │ │ ├── .DS_Store │ │ ├── DBT Incremental Model Example │ │ │ ├── .DS_Store │ │ │ └── optimized_dbt │ │ │ │ ├── .DS_Store │ │ │ │ ├── .gitignore │ │ │ │ ├── README.md │ │ │ │ ├── analyses │ │ │ │ └── .gitkeep │ │ │ │ ├── dbt_project.yml │ │ │ │ ├── macros │ │ │ │ ├── .gitkeep │ │ │ │ ├── create_bronze_sensors_identity_table.sql │ │ │ │ └── create_bronze_users_identity_table.sql │ │ │ │ ├── models │ │ │ │ ├── example │ │ │ │ │ ├── gold_hourly_summary_stats_7_day_rolling.sql │ │ │ │ │ ├── gold_smoothed_sensors_3_day_rolling.sql │ │ │ │ │ ├── silver_sensors_scd_1.sql │ │ │ │ │ └── silver_users_scd_1.sql │ │ │ │ └── sources.yml │ │ │ │ ├── seeds │ │ │ │ └── .gitkeep │ │ │ │ ├── snapshots │ │ │ │ ├── .gitkeep │ │ │ │ ├── silver_sensors_scd_2.sql │ │ │ │ └── silver_users_scd_2.sql │ │ │ │ └── tests │ │ │ │ └── .gitkeep │ │ ├── End to End Procedural Migration Pattern │ │ │ └── Procedural Migration Pattern with SCD2 Example.py │ │ ├── Multi-plexing with Autoloader │ │ │ └── Option 1: Actually Multi-plexing tables on write │ │ │ │ ├── Child Job Template.py │ │ │ │ └── Controller Job.py │ │ ├── Parallel Custom Named File Exports │ │ │ ├── Parallel File Exports - Python Version.py │ │ │ └── Parallel File Exports.py │ │ ├── SCD Design Patterns │ │ │ └── Advanced CDC With SCD in Databricks.py │ │ └── airflow_sql_files │ │ │ ├── 0_ddls.sql │ │ │ ├── 1_sensors_table_copy_into.sql │ │ │ ├── 2_sensors_table_merge.sql │ │ │ ├── 3_sensors_table_optimize.sql │ │ │ ├── 4_sensors_table_gold_aggregate.sql │ │ │ └── 5_clean_up_batch.sql │ ├── Step 1 - SQL EDW Pipeline.sql │ ├── Step 10 - Lakehouse Federation.py │ ├── Step 11 - SQL Orchestration in Production.py │ ├── Step 12 - SCD2 - SQL EDW Pipeline.sql │ ├── Step 13 - Migrating Identity Columns.sql │ ├── Step 14 - Using the Query Profile.sql │ ├── Step 15 - Dynamic & Parameterized SQL with Variables.py │ ├── Step 16 - Using System Tables.py │ ├── Step 2 - Optimize your Delta Tables.py │ ├── Step 3 - DLT Version Simple SQL EDW Pipeline.sql │ ├── Step 4 - Create Gold Layer Analytics Tables.sql │ ├── Step 5 - Unified Batch and Streaming.py │ ├── Step 6 - Streaming Table Design Patterns.sql │ ├── Step 7 - COPY INTO Loading Patterns.py │ ├── Step 8 - Liquid Clustering Delta Tables.py │ └── Step 9 - Using SQL Functions.py ├── dlt-cdc │ ├── 01-Retail_DLT_CDC_SQL.sql │ ├── 02-Retail_DLT_CDC_Python.py │ ├── 03-Retail_DLT_CDC_Monitoring.py │ ├── 04-Retail_DLT_CDC_Full.py │ └── _resources │ │ ├── 00-Data_CDC_Generator.py │ │ ├── 01-load-data-quality-dashboard.py │ │ ├── LICENSE.py │ │ ├── NOTICE.py │ │ └── README.py ├── dlt-loans │ ├── 01-DLT-Loan-pipeline-SQL.sql │ ├── 02-DLT-Loan-pipeline-PYTHON.py │ ├── 03-Log-Analysis.sql │ └── _resources │ │ ├── 00-Loan-Data-Generator.py │ │ ├── 01-load-data-quality-dashboard.py │ │ ├── LICENSE.py │ │ ├── NOTICE.py │ │ └── README.py ├── lakehouse-retail-c360 │ ├── 00-churn-introduction-lakehouse.sql │ ├── 01-Data-ingestion │ │ ├── 01.1-DLT-churn-SQL.sql │ │ ├── 01.2-DLT-churn-Python-UDF.py │ │ ├── 01.3-DLT-churn-python.py │ │ └── plain-spark-delta-pipeline │ │ │ └── 01.5-Delta-pipeline-spark-churn.py │ ├── 02-Data-governance │ │ └── 02-UC-data-governance-security-churn.sql │ ├── 03-BI-data-warehousing │ │ └── 03-BI-Datawarehousing.sql │ ├── 04-Data-Science-ML │ │ ├── 04.1-automl-churn-prediction.py │ │ ├── 04.2-automl-generated-notebook.py │ │ └── 04.3-running-inference.py │ ├── 05-Workflow-orchestration │ │ └── 05-Workflow-orchestration-churn.py │ └── _resources │ │ ├── 00-global-setup.py │ │ ├── 00-prep-data-db-sql.py │ │ ├── 00-setup-uc.py │ │ ├── 00-setup.py │ │ ├── 01-load-data.py │ │ ├── 02-create-churn-tables.py │ │ ├── LICENSE.py │ │ ├── NOTICE.py │ │ └── README.py └── llm-dolly-chatbot │ ├── 01-Dolly-Introduction.py │ ├── 02-Data-preparation.py │ ├── 03-Q&A-prompt-engineering-for-dolly.py │ ├── 04-chat-bot-prompt-engineering-dolly.py │ └── _resources │ ├── 00-global-setup.py │ ├── 00-init.py │ ├── LICENSE.py │ ├── NOTICE.py │ └── README.py ├── 10-migrations ├── .DS_Store ├── 05-uc-upgrade │ ├── 00-Upgrade-database-to-UC.sql │ └── _resources │ │ ├── 00-setup.py │ │ ├── LICENSE.py │ │ ├── NOTICE.py │ │ └── README.py ├── 10-hms-uc-migration.py ├── README.md ├── Using DBSQL Serverless Client Example.py ├── Using DBSQL Serverless Transaction Manager Example.py ├── Using Delta Helpers Notebook Example.py ├── Using Delta Logger Example.py ├── Using Delta Logger.py ├── Using Delta Merge Helpers Example.py ├── Using Streaming Tables and MV Orchestrator.py ├── Using Transaction Manager Example.py └── helperfunctions │ ├── .DS_Store │ ├── __init__.py │ ├── build │ └── lib │ │ ├── datavalidator.py │ │ ├── dbsqlclient.py │ │ ├── dbsqltransactions.py │ │ ├── deltahelpers.py │ │ ├── deltalogger.py │ │ ├── redshiftchecker.py │ │ ├── stmvorchestrator.py │ │ └── transactions.py │ ├── datavalidator.py │ ├── dbsqlclient.py │ ├── dbsqltransactions.py │ ├── deltahelpers.py │ ├── deltalogger.py │ ├── dist │ └── helperfunctions-1.0.0-py3-none-any.whl │ ├── helperfunctions.egg-info │ ├── PKG-INFO │ ├── SOURCES.txt │ ├── dependency_links.txt │ ├── requires.txt │ └── top_level.txt │ ├── redshiftchecker.py │ ├── requirements.txt │ ├── setup.py │ ├── stmvorchestrator.py │ └── transactions.py ├── 20-operational-excellence └── README.md ├── 30-performance ├── README.md ├── TPC-DS Runner │ ├── CONTRIBUTING.md │ ├── README.md │ ├── assets │ │ └── images │ │ │ ├── cluster.png │ │ │ ├── filters.png │ │ │ ├── main_notebook.png │ │ │ ├── run_all.png │ │ │ └── workflow.png │ ├── constants.py │ ├── main.py │ ├── notebooks │ │ ├── create_data_and_queries.scala │ │ └── run_tpcds_benchmarking.py │ └── utils │ │ ├── databricks_client.py │ │ ├── general.py │ │ └── run.py ├── dbsql-query-replay-tool │ ├── 00-Functions.py │ ├── 01-Query_Replay_Tool.py │ └── README.md └── delta-optimizer │ ├── __init__.py │ ├── customer-facing-delta-optimizer │ ├── Query Profile Builder Only.py │ ├── Step 1_ Optimization Strategy Builder.py │ ├── Step 2_ Strategy Runner.py │ ├── Step 3_ Query History and Profile Analyzer.py │ └── deltaoptimizer-1.5.5-py3-none-any.whl │ └── deltaoptimizer │ ├── .DS_Store │ ├── .gitignore │ ├── .vscode │ └── settings.json │ ├── __init__.py │ ├── build │ └── lib │ │ └── deltaoptimizer.py │ ├── deltaoptimizer.egg-info │ ├── PKG-INFO │ ├── SOURCES.txt │ ├── dependency_links.txt │ ├── requires.txt │ └── top_level.txt │ ├── deltaoptimizer.py │ ├── dist │ └── deltaoptimizer-1.5.5-py3-none-any.whl │ └── setup.py ├── 40-observability ├── README.md ├── data-profiling │ ├── 01-create-data-profile.py │ ├── 02-create-data-profile-multi-schema.py │ └── 03-dbfs-profiler.py ├── dbsql-logging │ ├── 00-Config.py │ ├── 01-Functions.py │ ├── 02-Initialization.py │ ├── 03-APIs_to_Delta.py │ ├── 04-Metrics.sql │ ├── 05-Alert_Syntax.sql │ ├── 99-Maintenance.py │ └── README.md ├── dbsql-query-history-sync │ ├── README.md │ ├── __init__.py │ ├── dist │ │ ├── dbsql_query_history_sync-0.0.1-py3-none-any.whl │ │ └── dbsql_query_history_sync-0.0.1.tar.gz │ ├── examples │ │ ├── dbsql_query_sync_example.py │ │ └── standalone_dbsql_get_query_history_example.py │ ├── pyproject.toml │ └── src │ │ ├── __init__.py │ │ └── dbsql_query_history_sync │ │ ├── __init__.py │ │ ├── delta_sync.py │ │ └── queries_api.py └── stream-monitoring │ └── 01-stream-monitoring.py ├── CONTRIBUTING.md ├── LICENSE ├── README.md └── concurrency_framework_1.0.py /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/.DS_Store -------------------------------------------------------------------------------- /00-quickstarts/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/00-quickstarts/.DS_Store -------------------------------------------------------------------------------- /00-quickstarts/README.md: -------------------------------------------------------------------------------- 1 | #### Quickstarts 2 | 3 | This section consists of tools that will help new Customers quickly setup a Lakehouse and get up and running with Databricks. This is not production grade code. This is purely for evaluation and trying out a Lakehouse quickly 4 | 5 | # 6 | 1. [dbdemos.ai](https://www.dbdemos.ai/). 7 | 2. [DBSQL Concurrency Test](https://github.com/AbePabbathi/lakehouse-tacklebox/tree/master/00-quickstarts/databricks-concurrency) 8 | 3. [EDW ETL Demo](https://github.com/databricks/edw-etl-demo) 9 | 4. [TPC-DI ETL Demo](https://github.com/shannon-barrow/databricks-tpc-di). Please read the [blog post](https://www.databricks.com/blog/2023/04/14/how-we-performed-etl-one-billion-records-under-1-delta-live-tables.html) for more info -------------------------------------------------------------------------------- /00-quickstarts/design-patterns/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/00-quickstarts/design-patterns/.DS_Store -------------------------------------------------------------------------------- /00-quickstarts/design-patterns/Advanced Notebooks/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/00-quickstarts/design-patterns/Advanced Notebooks/.DS_Store -------------------------------------------------------------------------------- /00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/.DS_Store -------------------------------------------------------------------------------- /00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/.DS_Store -------------------------------------------------------------------------------- /00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/.gitignore: -------------------------------------------------------------------------------- 1 | 2 | target/ 3 | dbt_packages/ 4 | logs/ 5 | -------------------------------------------------------------------------------- /00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/README.md: -------------------------------------------------------------------------------- 1 | Welcome to your new dbt project! 2 | 3 | ### Using the starter project 4 | 5 | Try running the following commands: 6 | - dbt run 7 | - dbt test 8 | 9 | 10 | ### Resources: 11 | - Learn more about dbt [in the docs](https://docs.getdbt.com/docs/introduction) 12 | - Check out [Discourse](https://discourse.getdbt.com/) for commonly asked questions and answers 13 | - Join the [chat](https://community.getdbt.com/) on Slack for live discussions and support 14 | - Find [dbt events](https://events.getdbt.com) near you 15 | - Check out [the blog](https://blog.getdbt.com/) for the latest news on dbt's development and best practices 16 | -------------------------------------------------------------------------------- /00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/analyses/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/analyses/.gitkeep -------------------------------------------------------------------------------- /00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/dbt_project.yml: -------------------------------------------------------------------------------- 1 | # Name your project! Project names should contain only lowercase characters 2 | # and underscores. A good package name should reflect your organization's 3 | # name or the intended use of these models 4 | name: 'optimized_dbt' 5 | version: '1.0.0' 6 | config-version: 2 7 | 8 | # This setting configures which "profile" dbt uses for this project. 9 | profile: 'optimized_dbt' 10 | 11 | model-paths: ["models"] 12 | analysis-paths: ["analyses"] 13 | test-paths: ["tests"] 14 | seed-paths: ["seeds"] 15 | macro-paths: ["macros"] 16 | snapshot-paths: ["snapshots"] 17 | 18 | clean-targets: # directories to be removed by `dbt clean` 19 | - "target" 20 | - "dbt_packages" 21 | 22 | models: 23 | optimized_dbt: 24 | +materialized: table 25 | +tblproperties: {'delta.feature.allowColumnDefaults': 'supported', 'delta.columnMapping.mode' : 'name', 'delta.enableDeletionVectors': 'true'} 26 | 27 | # Optional for logging dbt run info to Delta tables 28 | # on-run-end: "{{ dbt_artifacts.upload_results(results) }}" 29 | 30 | 31 | -------------------------------------------------------------------------------- /00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/macros/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/macros/.gitkeep -------------------------------------------------------------------------------- /00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/macros/create_bronze_sensors_identity_table.sql: -------------------------------------------------------------------------------- 1 | {% macro create_bronze_sensors_identity_table() %} 2 | -- Seprately DDL creation for doing things like custom / rigid schema DDL or Identity columns 3 | -- Use Sparingly, as DBT approaches DDL as being done inconjunction with the data 4 | 5 | CREATE TABLE IF NOT EXISTS {{target.catalog}}.{{target.schema}}.bronze_sensors 6 | ( 7 | Id BIGINT GENERATED BY DEFAULT AS IDENTITY, 8 | device_id INT, 9 | user_id INT, 10 | calories_burnt DECIMAL(10,2), 11 | miles_walked DECIMAL(10,2), 12 | num_steps DECIMAL(10,2), 13 | timestamp TIMESTAMP, 14 | value STRING, 15 | ingest_timestamp TIMESTAMP 16 | ) 17 | CLUSTER BY (ingest_timestamp) 18 | 19 | {% endmacro %} -------------------------------------------------------------------------------- /00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/macros/create_bronze_users_identity_table.sql: -------------------------------------------------------------------------------- 1 | {% macro create_bronze_users_identity_table() %} 2 | -- Seprately DDL creation for doing things like custom / rigid schema DDL or Identity columns 3 | -- Use Sparingly, as DBT approaches DDL as being done inconjunction with the data 4 | 5 | CREATE TABLE IF NOT EXISTS {{target.catalog}}.{{target.schema}}.bronze_users 6 | ( 7 | userid BIGINT GENERATED BY DEFAULT AS IDENTITY (START WITH 1 INCREMENT BY 1), 8 | gender STRING, 9 | age INT, 10 | height DECIMAL(10,2), 11 | weight DECIMAL(10,2), 12 | smoker STRING, 13 | familyhistory STRING, 14 | cholestlevs STRING, 15 | bp STRING, 16 | risk DECIMAL(10,2), 17 | ingest_timestamp TIMESTAMP 18 | ) 19 | CLUSTER BY (ingest_timestamp) 20 | 21 | {% endmacro %} -------------------------------------------------------------------------------- /00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/models/example/gold_hourly_summary_stats_7_day_rolling.sql: -------------------------------------------------------------------------------- 1 | {{ 2 | config( 3 | materialized='table', 4 | liquid_clustered_by='device_id, HourBucket' 5 | ) 6 | }} 7 | 8 | -- Get hourly aggregates for last 7 days 9 | SELECT device_id, 10 | date_trunc('hour', timestamp) AS HourBucket, 11 | AVG(num_steps)::float AS AvgNumStepsAcrossDevices, 12 | AVG(calories_burnt)::float AS AvgCaloriesBurnedAcrossDevices, 13 | AVG(miles_walked)::float AS AvgMilesWalkedAcrossDevices 14 | FROM {{ ref('silver_sensors_scd_1') }} 15 | WHERE timestamp >= ((SELECT MAX(timestamp) FROM {{ ref('silver_sensors_scd_1') }}) - INTERVAL '7 DAYS') 16 | GROUP BY device_id, date_trunc('hour', timestamp) 17 | ORDER BY HourBucket 18 | -------------------------------------------------------------------------------- /00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/models/example/gold_smoothed_sensors_3_day_rolling.sql: -------------------------------------------------------------------------------- 1 | {{ 2 | config( 3 | materialized='table', 4 | liquid_clustered_by='device_id, HourBucket' 5 | ) 6 | }} 7 | 8 | SELECT 9 | device_id, HourBucket, 10 | -- Number of Steps 11 | (avg(`AvgNumStepsAcrossDevices`) OVER ( 12 | ORDER BY `HourBucket` 13 | ROWS BETWEEN 14 | 4 PRECEDING AND 15 | CURRENT ROW 16 | )) ::float AS SmoothedNumSteps4HourMA, -- 4 hour moving average 17 | 18 | (avg(`AvgNumStepsAcrossDevices`) OVER ( 19 | ORDER BY `HourBucket` 20 | ROWS BETWEEN 21 | 24 PRECEDING AND 22 | CURRENT ROW 23 | ))::float AS SmoothedNumSteps24HourMA --24 hour moving average 24 | , 25 | -- Calories Burned 26 | (avg(`AvgCaloriesBurnedAcrossDevices`) OVER ( 27 | ORDER BY `HourBucket` 28 | ROWS BETWEEN 29 | 4 PRECEDING AND 30 | CURRENT ROW 31 | ))::float AS SmoothedCalsBurned4HourMA, -- 4 hour moving average 32 | 33 | (avg(`AvgCaloriesBurnedAcrossDevices`) OVER ( 34 | ORDER BY `HourBucket` 35 | ROWS BETWEEN 36 | 24 PRECEDING AND 37 | CURRENT ROW 38 | ))::float AS SmoothedCalsBurned24HourMA --24 hour moving average, 39 | , 40 | -- Miles Walked 41 | (avg(`AvgMilesWalkedAcrossDevices`) OVER ( 42 | ORDER BY `HourBucket` 43 | ROWS BETWEEN 44 | 4 PRECEDING AND 45 | CURRENT ROW 46 | ))::float AS SmoothedMilesWalked4HourMA, -- 4 hour moving average 47 | 48 | (avg(`AvgMilesWalkedAcrossDevices`) OVER ( 49 | ORDER BY `HourBucket` 50 | ROWS BETWEEN 51 | 24 PRECEDING AND 52 | CURRENT ROW 53 | ))::float AS SmoothedMilesWalked24HourMA --24 hour moving average 54 | FROM {{ ref('gold_hourly_summary_stats_7_day_rolling') }} 55 | WHERE HourBucket >= ((SELECT MAX(HourBucket) FROM {{ ref('gold_hourly_summary_stats_7_day_rolling') }}) - INTERVAL '3 DAYS') -------------------------------------------------------------------------------- /00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/models/example/silver_sensors_scd_1.sql: -------------------------------------------------------------------------------- 1 | {{ 2 | config( 3 | materialized='incremental', 4 | unique_key='Id', 5 | incremental_strategy='merge', 6 | tblproperties={'delta.tuneFileSizesForRewrites': 'true', 'delta.feature.allowColumnDefaults': 'supported', 'delta.columnMapping.mode' : 'name'}, 7 | liquid_clustered_by = 'timestamp, id, device_id', 8 | incremental_predicates= ["DBT_INTERNAL_DEST.timestamp > dateadd(day, -7, now())"], 9 | pre_hook=["{{ create_bronze_sensors_identity_table() }}", 10 | 11 | "{{ databricks_copy_into(target_table='bronze_sensors', 12 | source='/databricks-datasets/iot-stream/data-device/', 13 | file_format='json', 14 | expression_list = 'id::bigint AS Id, device_id::integer AS device_id, user_id::integer AS user_id, calories_burnt::decimal(10,2) AS calories_burnt, miles_walked::decimal(10,2) AS miles_walked, num_steps::decimal(10,2) AS num_steps, timestamp::timestamp AS timestamp, value AS value, now() AS ingest_timestamp', 15 | copy_options={'force': 'true'} 16 | ) }}", 17 | 18 | "OPTIMIZE {{target.catalog}}.{{target.schema}}.bronze_sensors", 19 | 20 | "ANALYZE TABLE {{target.catalog}}.{{target.schema}}.bronze_sensors COMPUTE STATISTICS FOR ALL COLUMNS" 21 | ], 22 | post_hook=[ 23 | "OPTIMIZE {{ this }}", 24 | "ANALYZE TABLE {{ this }} COMPUTE STATISTICS FOR ALL COLUMNS;" 25 | ] 26 | ) 27 | }} 28 | 29 | 30 | WITH de_dup ( 31 | SELECT Id::integer, 32 | device_id::integer, 33 | user_id::integer, 34 | calories_burnt::decimal, 35 | miles_walked::decimal, 36 | num_steps::decimal, 37 | timestamp::timestamp, 38 | value::string, 39 | ingest_timestamp, 40 | ROW_NUMBER() OVER(PARTITION BY device_id, user_id, timestamp ORDER BY ingest_timestamp DESC, timestamp DESC) AS DupRank 41 | FROM {{target.catalog}}.{{target.schema}}.bronze_sensors 42 | -- Add Incremental Processing Macro here 43 | {% if is_incremental() %} 44 | 45 | WHERE ingest_timestamp > (SELECT MAX(ingest_timestamp) FROM {{ this }}) 46 | 47 | {% endif %} 48 | ) 49 | 50 | SELECT Id, device_id, user_id, calories_burnt, miles_walked, num_steps, timestamp, value, ingest_timestamp 51 | -- optional 52 | /* 53 | sha2(CONCAT(COALESCE(Id, ''), COALESCE(device_id, ''))) AS composite_key -- use this as the key if you have composite key 54 | */ 55 | FROM de_dup 56 | WHERE DupRank = 1 -------------------------------------------------------------------------------- /00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/models/example/silver_users_scd_1.sql: -------------------------------------------------------------------------------- 1 | {{ 2 | config( 3 | materialized='incremental', 4 | unique_key='userid', 5 | incremental_strategy='merge', 6 | liquid_clustered_by = 'userid', 7 | pre_hook=["{{ create_bronze_users_identity_table() }}", 8 | 9 | "{{ databricks_copy_into(target_table='bronze_users', 10 | source='/databricks-datasets/iot-stream/data-user/', 11 | file_format='csv', 12 | expression_list = 'userid::bigint AS userid, gender AS gender, age::integer AS age, height::decimal(10,2) AS height, weight::decimal(10,2) AS weight, smoker AS smoker, familyhistory AS familyhistory, cholestlevs AS cholestlevs, bp AS bp, risk::decimal(10,2) AS risk, now() AS ingest_timestamp', 13 | copy_options={'force': 'true'}, 14 | format_options={'header': 'true'} 15 | ) }}", 16 | 17 | "OPTIMIZE {{target.catalog}}.{{target.schema}}.bronze_users" 18 | ], 19 | post_hook=[ 20 | "OPTIMIZE {{ this }}", 21 | "ANALYZE TABLE {{ this }} COMPUTE STATISTICS FOR ALL COLUMNS;" 22 | ] 23 | ) 24 | }} 25 | 26 | 27 | WITH de_dup ( 28 | SELECT 29 | userid::bigint, 30 | gender::string, 31 | age::int, 32 | height::decimal, 33 | weight::decimal, 34 | smoker, 35 | familyhistory, 36 | cholestlevs, 37 | bp, 38 | risk, 39 | ingest_timestamp, 40 | ROW_NUMBER() OVER(PARTITION BY userid ORDER BY ingest_timestamp DESC) AS DupRank 41 | FROM {{target.catalog}}.{{target.schema}}.bronze_users 42 | -- Add Incremental Processing Macro here 43 | {% if is_incremental() %} 44 | 45 | WHERE ingest_timestamp > (SELECT MAX(ingest_timestamp) FROM {{ this }}) 46 | 47 | {% endif %} 48 | ) 49 | SELECT * 50 | FROM de_dup 51 | WHERE DupRank = 1 -------------------------------------------------------------------------------- /00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/models/sources.yml: -------------------------------------------------------------------------------- 1 | version: 1 2 | 3 | sources: 4 | - name: dbt_optimized 5 | catalog: main 6 | schema: dbt_optimized 7 | tables: 8 | - name: silver_sensors_scd_1 9 | - name: silver_users_scd_1 -------------------------------------------------------------------------------- /00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/seeds/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/seeds/.gitkeep -------------------------------------------------------------------------------- /00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/snapshots/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/snapshots/.gitkeep -------------------------------------------------------------------------------- /00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/snapshots/silver_sensors_scd_2.sql: -------------------------------------------------------------------------------- 1 | {% snapshot sensors_snapshot %} 2 | 3 | {{ 4 | config( 5 | target_schema= target.schema + '_snapshots', 6 | unique_key='Id', 7 | 8 | strategy='timestamp', 9 | updated_at='ingest_timestamp', 10 | ) 11 | }} 12 | 13 | select * from {{ source('dbt_optimized', 'silver_sensors_scd_1') }} 14 | 15 | {% endsnapshot %} -------------------------------------------------------------------------------- /00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/snapshots/silver_users_scd_2.sql: -------------------------------------------------------------------------------- 1 | {% snapshot users_snapshot %} 2 | 3 | {{ 4 | config( 5 | target_schema= target.schema + '_snapshots', 6 | unique_key='userid', 7 | 8 | strategy='check', 9 | check_cols=['age', 'height', 'weight', 'smoker', 'familyhistory', 'cholestlevs', 'bp', 'risk'], 10 | updated_at='ingest_timestamp', 11 | ) 12 | }} 13 | 14 | select * from {{ source('dbt_optimized', 'silver_users_scd_1') }} 15 | 16 | {% endsnapshot %} -------------------------------------------------------------------------------- /00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/tests/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/tests/.gitkeep -------------------------------------------------------------------------------- /00-quickstarts/design-patterns/Advanced Notebooks/Multi-plexing with Autoloader/Option 1: Actually Multi-plexing tables on write/Child Job Template.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC 4 | # MAGIC ## Controller notebook 5 | # MAGIC 6 | # MAGIC Identifies and Orcestrates the sub jobs 7 | 8 | # COMMAND ---------- 9 | 10 | from pyspark.sql.functions import * 11 | from pyspark.sql.types import * 12 | 13 | # COMMAND ---------- 14 | 15 | # DBTITLE 1,Step 1: Logic to get unique list of events/sub directories that separate the different streams 16 | # Design considerations 17 | # Ideally the writer of the raw data will separate out event types by folder so you can use globPathFilters to create separate streams 18 | # If ALL events are in one data source, all streams will stream from 1 table and then will be filtered for that event in the stream. To avoid many file listings of the same file, enable useNotifications = true in autoloader 19 | 20 | # COMMAND ---------- 21 | 22 | # DBTITLE 1,Define Params 23 | dbutils.widgets.text("Input Root Path", "") 24 | dbutils.widgets.text("Parent Job Name", "") 25 | dbutils.widgets.text("Child Task Name", "") 26 | 27 | # COMMAND ---------- 28 | 29 | # DBTITLE 1,Get Params 30 | root_input_path = dbutils.widgets.get("Input Root Path") 31 | parent_job_name = dbutils.widgets.get("Parent Job Name") 32 | child_task_name = dbutils.widgets.get("Child Task Name") 33 | 34 | print(f"Root input path: {root_input_path}") 35 | print(f"Parent Job Name: {parent_job_name}") 36 | print(f"Event Task Name: {child_task_name}") 37 | 38 | # COMMAND ---------- 39 | 40 | # DBTITLE 1,Define Dynamic Checkpoint Path 41 | ## Eeach stream needs its own checkpoint, we can dynamically define that for each event/table we want to create / teast out 42 | 43 | checkpoint_path = f"dbfs:/checkpoints//{parent_job_name}/{child_task_name}/" 44 | 45 | # COMMAND ---------- 46 | 47 | # DBTITLE 1,Target Location Definitions 48 | spark.sql("""CREATE DATABASE IF NOT EXISTS iot_multiplexing_demo""") 49 | 50 | # COMMAND ---------- 51 | 52 | # DBTITLE 1,Use Whatever custom event filtering logic is needed 53 | filter_regex_string = "part-" + child_task_name + "*.json*" 54 | 55 | print(filter_regex_string) 56 | 57 | # COMMAND ---------- 58 | 59 | # DBTITLE 1,Read Stream 60 | input_df = (spark 61 | .readStream 62 | .format("text") 63 | .option("multiLine", "true") 64 | .option("pathGlobFilter", filter_regex_string) 65 | .load(root_input_path) 66 | .withColumn("inputFileName", input_file_name()) ## you can filter using .option("globPathFilter") as well here 67 | ) 68 | 69 | # COMMAND ---------- 70 | 71 | # DBTITLE 1,Transformation Logic on any events (can be conditional on event) 72 | transformed_df = (input_df 73 | .withColumn("EventName", lit(child_task_name)) 74 | .selectExpr("value:id::integer AS Id", 75 | "EventName", 76 | "value:user_id::integer AS UserId", 77 | "value:device_id::integer AS DeviceId", 78 | "value:num_steps::decimal AS NumberOfSteps", 79 | "value:miles_walked::decimal AS MilesWalked", 80 | "value:calories_burnt::decimal AS Calories", 81 | "value:timestamp::timestamp AS EventTimestamp", 82 | "current_timestamp() AS IngestionTimestamp", 83 | "inputFileName") 84 | 85 | ) 86 | 87 | # COMMAND ---------- 88 | 89 | # DBTITLE 1,Truncate this child stream and reload from all data 90 | 91 | dbutils.fs.rm(checkpoint_path, recurse=True) 92 | 93 | # COMMAND ---------- 94 | 95 | # DBTITLE 1,Dynamic Write Stream 96 | (transformed_df 97 | .writeStream 98 | .trigger(once=True) 99 | .option("checkpointLocation", checkpoint_path) 100 | .toTable(f"iot_multiplexing_demo.iot_stream_event_{child_task_name}") 101 | ) 102 | -------------------------------------------------------------------------------- /00-quickstarts/design-patterns/Advanced Notebooks/Parallel Custom Named File Exports/Parallel File Exports.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # DBTITLE 1,helper function to dynamically build target path for each file 3 | # MAGIC %scala 4 | # MAGIC 5 | # MAGIC 6 | # MAGIC def getNewFilePath(sourcePath: String): String = { 7 | # MAGIC val source_path = sourcePath; 8 | # MAGIC 9 | # MAGIC val slice_len = source_path.split("/").length - 1; 10 | # MAGIC val source_path_root = source_path.split("/").slice(0, slice_len); 11 | # MAGIC val source_path_file_name = source_path.split("/").last; 12 | # MAGIC 13 | # MAGIC // any arbitrary file rename logic 14 | # MAGIC val new_path_file_name = "renamed/"+source_path_file_name; 15 | # MAGIC val new_path = source_path_root.mkString("/") + "/" + new_path_file_name; 16 | # MAGIC 17 | # MAGIC return new_path 18 | # MAGIC } 19 | 20 | # COMMAND ---------- 21 | 22 | # DBTITLE 1,Test New Function to dynamically build target path for each row (file) 23 | # MAGIC %scala 24 | # MAGIC 25 | # MAGIC val test_new_path = getNewFilePath("dbfs:/databricks-datasets/iot-stream/data-device/part-00003.json.gz") 26 | # MAGIC 27 | # MAGIC println(test_new_path) 28 | 29 | # COMMAND ---------- 30 | 31 | # MAGIC %scala 32 | # MAGIC import org.apache.hadoop.fs 33 | # MAGIC 34 | # MAGIC // maybe we need to register access keys here? not sure yet. Still dealing with Auth issues 35 | # MAGIC val conf = new org.apache.spark.util.SerializableConfiguration(sc.hadoopConfiguration) 36 | # MAGIC 37 | # MAGIC val broadcastConf = sc.broadcast(conf) 38 | # MAGIC 39 | # MAGIC print(conf.value) 40 | 41 | # COMMAND ---------- 42 | 43 | # MAGIC %scala 44 | # MAGIC 45 | # MAGIC import org.apache.hadoop.fs._ 46 | # MAGIC 47 | # MAGIC // root bucket of where original files were dropped 48 | # MAGIC val filesToCopy = dbutils.fs.ls("dbfs:/databricks-datasets/iot-stream/data-device/").map(_.path) 49 | # MAGIC 50 | # MAGIC spark.sparkContext.parallelize(filesToCopy).foreachPartition(rows => rows.foreach { 51 | # MAGIC 52 | # MAGIC file => 53 | # MAGIC 54 | # MAGIC println(file) 55 | # MAGIC val fromPath = new Path(file) 56 | # MAGIC 57 | # MAGIC val tempNewPath = getNewFilePath(file) 58 | # MAGIC 59 | # MAGIC val toPath = new Path(tempNewPath) 60 | # MAGIC 61 | # MAGIC val fromFs = toPath.getFileSystem(conf.value) 62 | # MAGIC 63 | # MAGIC val toFs = toPath.getFileSystem(conf.value) 64 | # MAGIC 65 | # MAGIC FileUtil.copy(fromFs, fromPath, toFs, toPath, false, conf.value) 66 | # MAGIC 67 | # MAGIC }) 68 | 69 | # COMMAND ---------- 70 | 71 | # MAGIC %scala 72 | # MAGIC 73 | # MAGIC val filesToCopy = dbutils.fs.ls("dbfs:/databricks-datasets/iot-stream/data-device/").map(_.path) 74 | # MAGIC 75 | # MAGIC 76 | # MAGIC val filesDf = spark.sparkContext.parallelize(filesToCopy).toDF() 77 | # MAGIC 78 | # MAGIC display(filesDf) 79 | -------------------------------------------------------------------------------- /00-quickstarts/design-patterns/Advanced Notebooks/airflow_sql_files/0_ddls.sql: -------------------------------------------------------------------------------- 1 | CREATE DATABASE IF NOT EXISTS main.iot_dashboard_airflow; 2 | 3 | CREATE TABLE IF NOT EXISTS main.iot_dashboard_airflow.bronze_sensors 4 | ( 5 | Id BIGINT GENERATED BY DEFAULT AS IDENTITY, 6 | device_id INT, 7 | user_id INT, 8 | calories_burnt DECIMAL(10,2), 9 | miles_walked DECIMAL(10,2), 10 | num_steps DECIMAL(10,2), 11 | timestamp TIMESTAMP, 12 | value STRING 13 | ) 14 | USING DELTA 15 | TBLPROPERTIES("delta.targetFileSize"="128mb") 16 | ; 17 | 18 | CREATE TABLE IF NOT EXISTS main.iot_dashboard_airflow.bronze_users 19 | ( 20 | userid BIGINT GENERATED BY DEFAULT AS IDENTITY (START WITH 1 INCREMENT BY 1), 21 | gender STRING, 22 | age INT, 23 | height DECIMAL(10,2), 24 | weight DECIMAL(10,2), 25 | smoker STRING, 26 | familyhistory STRING, 27 | cholestlevs STRING, 28 | bp STRING, 29 | risk DECIMAL(10,2), 30 | update_timestamp TIMESTAMP 31 | ) 32 | USING DELTA 33 | TBLPROPERTIES("delta.targetFileSize"="128mb") 34 | ; 35 | 36 | CREATE TABLE IF NOT EXISTS main.iot_dashboard_airflow.silver_sensors 37 | ( 38 | Id BIGINT GENERATED BY DEFAULT AS IDENTITY, 39 | device_id INT, 40 | user_id INT, 41 | calories_burnt DECIMAL(10,2), 42 | miles_walked DECIMAL(10,2), 43 | num_steps DECIMAL(10,2), 44 | timestamp TIMESTAMP, 45 | value STRING 46 | ) 47 | USING DELTA 48 | PARTITIONED BY (user_id) 49 | TBLPROPERTIES("delta.targetFileSize"="128mb") 50 | ; 51 | 52 | CREATE TABLE IF NOT EXISTS main.iot_dashboard_airflow.silver_users 53 | ( 54 | userid BIGINT GENERATED BY DEFAULT AS IDENTITY, 55 | gender STRING, 56 | age INT, 57 | height DECIMAL(10,2), 58 | weight DECIMAL(10,2), 59 | smoker STRING, 60 | familyhistory STRING, 61 | cholestlevs STRING, 62 | bp STRING, 63 | risk DECIMAL(10,2), 64 | update_timestamp TIMESTAMP 65 | ) 66 | USING DELTA 67 | TBLPROPERTIES("delta.targetFileSize"="128mb") 68 | ; -------------------------------------------------------------------------------- /00-quickstarts/design-patterns/Advanced Notebooks/airflow_sql_files/1_sensors_table_copy_into.sql: -------------------------------------------------------------------------------- 1 | -- This is ONLY the SELECT expression in COPY INTO without the word "SELECT" 2 | id::bigint AS Id, 3 | device_id::integer AS device_id, 4 | user_id::integer AS user_id, 5 | calories_burnt::decimal(10,2) AS calories_burnt, 6 | miles_walked::decimal(10,2) AS miles_walked, 7 | num_steps::decimal(10,2) AS num_steps, 8 | timestamp::timestamp AS timestamp, 9 | value AS value -- This is a JSON object -------------------------------------------------------------------------------- /00-quickstarts/design-patterns/Advanced Notebooks/airflow_sql_files/2_sensors_table_merge.sql: -------------------------------------------------------------------------------- 1 | MERGE INTO main.iot_dashboard_airflow.silver_sensors AS target 2 | USING ( 3 | WITH de_dup ( 4 | SELECT Id::integer, 5 | device_id::integer, 6 | user_id::integer, 7 | calories_burnt::decimal, 8 | miles_walked::decimal, 9 | num_steps::decimal, 10 | timestamp::timestamp, 11 | value::string, 12 | ROW_NUMBER() OVER(PARTITION BY device_id, user_id, timestamp ORDER BY timestamp DESC) AS DupRank 13 | FROM main.iot_dashboard_airflow.bronze_sensors 14 | ) 15 | 16 | SELECT Id, device_id, user_id, calories_burnt, miles_walked, num_steps, timestamp, value 17 | FROM de_dup 18 | WHERE DupRank = 1 19 | ) AS source 20 | ON source.Id = target.Id 21 | AND source.user_id = target.user_id 22 | AND source.device_id = target.device_id 23 | WHEN MATCHED THEN UPDATE SET 24 | target.calories_burnt = source.calories_burnt, 25 | target.miles_walked = source.miles_walked, 26 | target.num_steps = source.num_steps, 27 | target.timestamp = source.timestamp 28 | WHEN NOT MATCHED THEN INSERT *; -------------------------------------------------------------------------------- /00-quickstarts/design-patterns/Advanced Notebooks/airflow_sql_files/3_sensors_table_optimize.sql: -------------------------------------------------------------------------------- 1 | OPTIMIZE main.iot_dashboard_airflow.silver_sensors ZORDER BY (timestamp); 2 | 3 | ANALYZE TABLE main.iot_dashboard_airflow.silver_sensors COMPUTE STATISTICS FOR ALL COLUMNS; 4 | -------------------------------------------------------------------------------- /00-quickstarts/design-patterns/Advanced Notebooks/airflow_sql_files/4_sensors_table_gold_aggregate.sql: -------------------------------------------------------------------------------- 1 | CREATE OR REPLACE TABLE main.iot_dashboard_airflow.hourly_summary_statistics 2 | AS 3 | SELECT user_id, 4 | date_trunc('hour', timestamp) AS HourBucket, 5 | AVG(num_steps)::float AS AvgNumStepsAcrossDevices, 6 | AVG(calories_burnt)::float AS AvgCaloriesBurnedAcrossDevices, 7 | AVG(miles_walked)::float AS AvgMilesWalkedAcrossDevices 8 | FROM iot_dashboard.silver_sensors 9 | GROUP BY user_id,date_trunc('hour', timestamp) 10 | ORDER BY HourBucket; 11 | 12 | CREATE OR REPLACE TABLE main.iot_dashboard_airflow.smoothed_hourly_statistics 13 | AS 14 | SELECT *, 15 | -- Number of Steps 16 | (avg(`AvgNumStepsAcrossDevices`) OVER ( 17 | ORDER BY `HourBucket` 18 | ROWS BETWEEN 19 | 4 PRECEDING AND 20 | CURRENT ROW 21 | )) ::float AS SmoothedNumSteps4HourMA, -- 4 hour moving average 22 | 23 | (avg(`AvgNumStepsAcrossDevices`) OVER ( 24 | ORDER BY `HourBucket` 25 | ROWS BETWEEN 26 | 24 PRECEDING AND 27 | CURRENT ROW 28 | ))::float AS SmoothedNumSteps12HourMA --24 hour moving average 29 | , 30 | -- Calories Burned 31 | (avg(`AvgCaloriesBurnedAcrossDevices`) OVER ( 32 | ORDER BY `HourBucket` 33 | ROWS BETWEEN 34 | 4 PRECEDING AND 35 | CURRENT ROW 36 | ))::float AS SmoothedCalsBurned4HourMA, -- 4 hour moving average 37 | 38 | (avg(`AvgCaloriesBurnedAcrossDevices`) OVER ( 39 | ORDER BY `HourBucket` 40 | ROWS BETWEEN 41 | 24 PRECEDING AND 42 | CURRENT ROW 43 | ))::float AS SmoothedCalsBurned12HourMA --24 hour moving average, 44 | , 45 | -- Miles Walked 46 | (avg(`AvgMilesWalkedAcrossDevices`) OVER ( 47 | ORDER BY `HourBucket` 48 | ROWS BETWEEN 49 | 4 PRECEDING AND 50 | CURRENT ROW 51 | ))::float AS SmoothedMilesWalked4HourMA, -- 4 hour moving average 52 | 53 | (avg(`AvgMilesWalkedAcrossDevices`) OVER ( 54 | ORDER BY `HourBucket` 55 | ROWS BETWEEN 56 | 24 PRECEDING AND 57 | CURRENT ROW 58 | ))::float AS SmoothedMilesWalked12HourMA --24 hour moving average 59 | FROM main.iot_dashboard_airflow.hourly_summary_statistics; -------------------------------------------------------------------------------- /00-quickstarts/design-patterns/Advanced Notebooks/airflow_sql_files/5_clean_up_batch.sql: -------------------------------------------------------------------------------- 1 | TRUNCATE TABLE main.iot_dashboard_airflow.bronze_sensors; -------------------------------------------------------------------------------- /00-quickstarts/design-patterns/Step 10 - Lakehouse Federation.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # Using Lakehouse Federation for a single Pane of Glass 3 | 4 | ## Topics 5 | 6 | 1. How to use Lakehouse Federation 7 | 2. Setting up new database 8 | 3. Performance management / considerations 9 | 4. Limitations 10 | -------------------------------------------------------------------------------- /00-quickstarts/design-patterns/Step 11 - SQL Orchestration in Production.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | ## Orchestrating SQL Pipelines in Production 3 | 4 | 1. SQL Tasks Types 5 | 2. Airflow Operator 6 | 3. DBSQL REST API / Pushdown Client 7 | 8 | # COMMAND ---------- 9 | 10 | # DBTITLE 1,Airflow 11 | ## See the Advanced Notebooks section to find the collection to SQL files for the Airflow Demo. 12 | ## Then navigate to https://medium.com/dbsql-sme-engineering and find the Airflow Blog for the deep dive 13 | -------------------------------------------------------------------------------- /00-quickstarts/design-patterns/Step 13 - Migrating Identity Columns.sql: -------------------------------------------------------------------------------- 1 | -- Databricks notebook source 2 | -- MAGIC %md 3 | -- MAGIC 4 | -- MAGIC ## How to migrate IDENTITY columns from a Data Warehouse to DBSQL / Delta Lakehouse 5 | -- MAGIC 6 | -- MAGIC ## Summary 7 | -- MAGIC Quick notebook showing how to properly migrate tables from a data warehouse to a Delta table where you want to retain the values of existing IDENTITY key values and ensure that the IDENTITY generation picks up from the most recent IDENTITY column value 8 | 9 | -- COMMAND ---------- 10 | 11 | -- MAGIC %md 12 | -- MAGIC 13 | -- MAGIC 14 | -- MAGIC ### Steps to migrate key properly 15 | -- MAGIC 16 | -- MAGIC 1. Create a table with id columns such as: GENERATED BY DEFAULT AS IDENTITY (START WITH 1 INCREMENT BY 1) 17 | -- MAGIC 2. Backfill existing data warehouse tables with an INSERT INTO / MERGE from a snapshot of the datawarehouse table 18 | -- MAGIC 3. Run command: ALTER TABLE main.default.identity_test ALTER COLUMN id SYNC IDENTITY; to ensure that the newly inserted values pick up where the data warehouse left off on key generation 19 | -- MAGIC 4. Insert new identity values with new pipelines (or leave out column and let it auto-generate) 20 | 21 | -- COMMAND ---------- 22 | 23 | -- DBTITLE 1,Simple End to End Example 24 | 25 | CREATE OR REPLACE TABLE main.default.identity_test ( 26 | id BIGINT GENERATED BY DEFAULT AS IDENTITY (START WITH 1 INCREMENT BY 1), 27 | name STRING DEFAULT 'cody' 28 | ) 29 | TBLPROPERTIES('delta.feature.allowColumnDefaults' = 'supported', 'delta.columnMapping.mode' = 'name') 30 | ; 31 | 32 | -- Simulate EDW migration load with existing keys 33 | INSERT INTO main.default.identity_test (id,name) 34 | VALUES (5, 'cody'), (6, 'davis'); 35 | 36 | 37 | SELECT * FROM main.default.identity_test; 38 | 39 | 40 | -- Simulate new load incrmentally 41 | 42 | INSERT INTO main.default.identity_test (name) 43 | VALUES ('cody_new'), ('davis_new'); 44 | 45 | -- BAD! ID keys get messed up 46 | SELECT * FROM main.default.identity_test; 47 | 48 | -- FIX 49 | ALTER TABLE main.default.identity_test ALTER COLUMN id SYNC IDENTITY; 50 | 51 | -- try again 52 | INSERT INTO main.default.identity_test (name) 53 | VALUES ('cody_fix'), ('davis_fix'); 54 | 55 | SELECT * FROM main.default.identity_test; 56 | 57 | 58 | -------------------------------------------------------------------------------- /00-quickstarts/design-patterns/Step 3 - DLT Version Simple SQL EDW Pipeline.sql: -------------------------------------------------------------------------------- 1 | -- Databricks notebook source 2 | -- MAGIC %md 3 | -- MAGIC 4 | -- MAGIC # This notebook generates a full data pipeline from databricks dataset - iot-stream 5 | -- MAGIC 6 | -- MAGIC #### Define the SQL - Add as a library to a DLT pipeline, and run the pipeline! 7 | -- MAGIC 8 | -- MAGIC ## This creates 2 tables: 9 | -- MAGIC 10 | -- MAGIC Database: iot_dashboard 11 | -- MAGIC 12 | -- MAGIC Tables: silver_sensors, silver_users 13 | -- MAGIC 14 | -- MAGIC Params: StartOver (Yes/No) - allows user to truncate and reload pipeline 15 | 16 | -- COMMAND ---------- 17 | 18 | -- MAGIC %md 19 | -- MAGIC 20 | -- MAGIC ## This is built as a library for a Delta Live Tables pipeline 21 | 22 | -- COMMAND ---------- 23 | 24 | -- MAGIC %md 25 | -- MAGIC ## Exhaustive list of all cloud_files STREAMING LIVE TABLE options 26 | -- MAGIC https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-incremental-data.html#language-sql 27 | 28 | -- COMMAND ---------- 29 | 30 | -- DBTITLE 1,Incrementally Ingest Source Data from Raw Files 31 | --No longer need a separate copy into statement, you can use the Databricks Autoloader directly in SQL by using the cloud_files function 32 | -- OPTIONALLY defined DDL in the table definition 33 | CREATE OR REFRESH STREAMING LIVE TABLE bronze_sensors 34 | ( 35 | Id BIGINT GENERATED BY DEFAULT AS IDENTITY, 36 | device_id INT, 37 | user_id INT, 38 | calories_burnt DECIMAL(10,2), 39 | miles_walked DECIMAL(10,2), 40 | num_steps DECIMAL(10,2), 41 | timestamp TIMESTAMP, 42 | value STRING, 43 | CONSTRAINT has_device EXPECT (device_id IS NOT NULL) ON VIOLATION DROP ROW , 44 | CONSTRAINT has_user EXPECT(user_id IS NOT NULL) ON VIOLATION DROP ROW, 45 | CONSTRAINT has_data EXPECT(num_steps IS NOT NULL) -- with no violation rule, nothing happens, we just track quality in DLT 46 | ) 47 | TBLPROPERTIES("delta.targetFileSize"="128mb", 48 | "pipelines.autoOptimize.managed"="true", 49 | "pipelines.autoOptimize.zOrderCols"="create_timestamp,device_id,user_id", 50 | "pipelines.trigger.interval"="1 hour") 51 | AS 52 | SELECT 53 | id::bigint AS Id, 54 | device_id::integer AS device_id, 55 | user_id::integer AS user_id, 56 | calories_burnt::decimal(10,2) AS calories_burnt, 57 | miles_walked::decimal(10,2) AS miles_walked, 58 | num_steps::decimal(10,2) AS num_steps, 59 | timestamp::timestamp AS timestamp, 60 | value AS value 61 | FROM cloud_files("/databricks-datasets/iot-stream/data-device/", "json") 62 | -- First 2 params of cloud_files are always input file path and format, then rest are map object of optional params 63 | -- To make incremental - Add STREAMING keyword before LIVE TABLE 64 | ; 65 | 66 | 67 | 68 | -- COMMAND ---------- 69 | 70 | -- MAGIC %md 71 | -- MAGIC 72 | -- MAGIC ## Process Change data with updates or deletes 73 | -- MAGIC API Docs: https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-cdc.html 74 | -- MAGIC 75 | -- MAGIC 76 | -- MAGIC ### Automatically store change as SCD 1 or SCD 2 Type changes 77 | -- MAGIC 78 | -- MAGIC SCD 1/2 Docs: https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-cdc.html#language-sql 79 | 80 | -- COMMAND ---------- 81 | 82 | -- DBTITLE 1,Incremental upsert data into target silver layer 83 | -- Create and populate the target table. 84 | CREATE OR REFRESH STREAMING LIVE TABLE silver_sensors 85 | ( 86 | Id BIGINT GENERATED BY DEFAULT AS IDENTITY, 87 | device_id INT, 88 | user_id INT, 89 | calories_burnt DECIMAL(10,2), 90 | miles_walked DECIMAL(10,2), 91 | num_steps DECIMAL(10,2), 92 | timestamp TIMESTAMP, 93 | value STRING) 94 | TBLPROPERTIES("delta.targetFileSize"="128mb", 95 | "quality"="silver", 96 | "pipelines.autoOptimize.managed"="true", 97 | "pipelines.autoOptimize.zOrderCols"="create_timestamp,device_id,user_id", 98 | "pipelines.trigger.interval"="1 hour" 99 | ); 100 | 101 | -- COMMAND ---------- 102 | 103 | -- DBTITLE 1,Actually run CDC Transformation Operation 104 | APPLY CHANGES INTO 105 | LIVE.silver_sensors 106 | FROM 107 | STREAM(LIVE.bronze_sensors) -- use STREAM to get change feed, use LIVE to get DAG source table 108 | KEYS 109 | (user_id, device_id) -- Identical to the ON statement in MERGE, can be 1 of many keys 110 | --APPLY AS DELETE WHEN 111 | -- operation = "DELETE" --Need if you have a operation columnd that specifies "APPEND"/"UPDATE"/"DELETE" like true CDC data 112 | SEQUENCE BY 113 | timestamp 114 | COLUMNS * EXCEPT 115 | (Id) --For auto increment keys, exclude the updates cause you dont want to replace Ids of auto_id columns 116 | -- Optionally exclude columns like metadata or operation types, by default, UPDATE * is the operation 117 | STORED AS 118 | SCD TYPE 1 -- [SCD TYPE 2] will expire updated originals 119 | 120 | -- COMMAND ---------- 121 | 122 | -- MAGIC %md 123 | -- MAGIC 124 | -- MAGIC ## FULL REFRESH EXAMPLE - Ingest Full User Data Set Each Load 125 | 126 | -- COMMAND ---------- 127 | 128 | -- DBTITLE 1,FulltIngest Raw User Data 129 | CREATE OR REPLACE STREAMING LIVE TABLE silver_users 130 | ( -- REPLACE truncates the checkpoint each time and loads from scratch every time 131 | userid BIGINT GENERATED BY DEFAULT AS IDENTITY, 132 | gender STRING, 133 | age INT, 134 | height DECIMAL(10,2), 135 | weight DECIMAL(10,2), 136 | smoker STRING, 137 | familyhistory STRING, 138 | cholestlevs STRING, 139 | bp STRING, 140 | risk DECIMAL(10,2), 141 | update_timestamp TIMESTAMP, 142 | CONSTRAINT has_user EXPECT (userid IS NOT NULL) ON VIOLATION DROP ROW 143 | ) 144 | TBLPROPERTIES("delta.targetFileSize"="128mb", 145 | "quality"="silver", 146 | "pipelines.autoOptimize.managed"="true", 147 | "pipelines.autoOptimize.zOrderCols"="userid", 148 | "pipelines.trigger.interval"="1 day" 149 | ) 150 | AS (SELECT 151 | userid::bigint AS userid, 152 | gender AS gender, 153 | age::integer AS age, 154 | height::decimal(10,2) AS height, 155 | weight::decimal(10,2) AS weight, 156 | smoker AS smoker, 157 | familyhistory AS familyhistory, 158 | cholestlevs AS cholestlevs, 159 | bp AS bp, 160 | risk::decimal(10,2) AS risk, 161 | current_timestamp() AS update_timestamp 162 | FROM cloud_files("/databricks-datasets/iot-stream/data-user/","csv", map( 'header', 'true')) 163 | ) 164 | ; 165 | -------------------------------------------------------------------------------- /00-quickstarts/design-patterns/Step 4 - Create Gold Layer Analytics Tables.sql: -------------------------------------------------------------------------------- 1 | -- Databricks notebook source 2 | -- MAGIC %md 3 | -- MAGIC 4 | -- MAGIC ## Create Gold Layer Tables that aggregate and clean up the data for BI / ML 5 | 6 | -- COMMAND ---------- 7 | 8 | CREATE OR REPLACE TABLE iot_dashboard.hourly_summary_statistics 9 | AS 10 | SELECT user_id, 11 | date_trunc('hour', timestamp) AS HourBucket, 12 | AVG(num_steps)::float AS AvgNumStepsAcrossDevices, 13 | AVG(calories_burnt)::float AS AvgCaloriesBurnedAcrossDevices, 14 | AVG(miles_walked)::float AS AvgMilesWalkedAcrossDevices 15 | FROM iot_dashboard.silver_sensors 16 | GROUP BY user_id,date_trunc('hour', timestamp) 17 | ORDER BY HourBucket; 18 | 19 | 20 | CREATE OR REPLACE TABLE iot_dashboard.smoothed_hourly_statistics 21 | AS 22 | SELECT *, 23 | -- Number of Steps 24 | (avg(`AvgNumStepsAcrossDevices`) OVER ( 25 | ORDER BY `HourBucket` 26 | ROWS BETWEEN 27 | 4 PRECEDING AND 28 | CURRENT ROW 29 | )) ::float AS SmoothedNumSteps4HourMA, -- 4 hour moving average 30 | 31 | (avg(`AvgNumStepsAcrossDevices`) OVER ( 32 | ORDER BY `HourBucket` 33 | ROWS BETWEEN 34 | 24 PRECEDING AND 35 | CURRENT ROW 36 | ))::float AS SmoothedNumSteps12HourMA --24 hour moving average 37 | , 38 | -- Calories Burned 39 | (avg(`AvgCaloriesBurnedAcrossDevices`) OVER ( 40 | ORDER BY `HourBucket` 41 | ROWS BETWEEN 42 | 4 PRECEDING AND 43 | CURRENT ROW 44 | ))::float AS SmoothedCalsBurned4HourMA, -- 4 hour moving average 45 | 46 | (avg(`AvgCaloriesBurnedAcrossDevices`) OVER ( 47 | ORDER BY `HourBucket` 48 | ROWS BETWEEN 49 | 24 PRECEDING AND 50 | CURRENT ROW 51 | ))::float AS SmoothedCalsBurned12HourMA --24 hour moving average, 52 | , 53 | -- Miles Walked 54 | (avg(`AvgMilesWalkedAcrossDevices`) OVER ( 55 | ORDER BY `HourBucket` 56 | ROWS BETWEEN 57 | 4 PRECEDING AND 58 | CURRENT ROW 59 | ))::float AS SmoothedMilesWalked4HourMA, -- 4 hour moving average 60 | 61 | (avg(`AvgMilesWalkedAcrossDevices`) OVER ( 62 | ORDER BY `HourBucket` 63 | ROWS BETWEEN 64 | 24 PRECEDING AND 65 | CURRENT ROW 66 | ))::float AS SmoothedMilesWalked12HourMA --24 hour moving average 67 | FROM iot_dashboard.hourly_summary_statistics 68 | 69 | -- COMMAND ---------- 70 | 71 | -- DBTITLE 1,Build Visuals in DBSQL, Directly in Notebook, or in any BI tool! 72 | SELECT * FROM iot_dashboard.smoothed_hourly_statistics WHERE user_id = 1 73 | -------------------------------------------------------------------------------- /00-quickstarts/design-patterns/Step 7 - COPY INTO Loading Patterns.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC 4 | # MAGIC ## Materlized Views 5 | # MAGIC 6 | # MAGIC Patterns and Best Practices 7 | # MAGIC 8 | # MAGIC 9 | # MAGIC 1. Create Materialized View 10 | # MAGIC 2. Optimize Materialized View 11 | # MAGIC 3. Check / Monitor Performance of MV 12 | # MAGIC 4. When to NOT use MVs 13 | -------------------------------------------------------------------------------- /00-quickstarts/design-patterns/Step 8 - Liquid Clustering Delta Tables.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC 4 | # MAGIC ## Deep Dive on Liquid Clustering Delta Tables 5 | # MAGIC 6 | # MAGIC ### Topics 7 | # MAGIC 8 | # MAGIC 1. How to create and optimize liquid tables 9 | # MAGIC 2. How to merge/update/delete data from liquid tables 10 | # MAGIC 3. VACUUM/PURGE/REORG on Liqiud tables 11 | # MAGIC 4. Performance Measurement 12 | # MAGIC 5. When to use ZORDER/Partitions vs Liquid 13 | # MAGIC 6. Liquid Limitations 14 | -------------------------------------------------------------------------------- /00-quickstarts/design-patterns/Step 9 - Using SQL Functions.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # SQL Functions Topic Deep Dive 3 | 4 | ## Topics 5 | 6 | 1. How to use SQL functions 7 | 2. Different languages - Python/SQL 8 | 3. Variables, etc. 9 | 4. Using Models in SQL functions 10 | 5. AI Functions 11 | -------------------------------------------------------------------------------- /00-quickstarts/dlt-cdc/04-Retail_DLT_CDC_Full.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC 4 | # MAGIC # Implementing a CDC pipeline using DLT for N tables 5 | # MAGIC 6 | # MAGIC We saw previously how to setup a CDC pipeline for a single table. However, real-life database typically involve multiple tables, with 1 CDC folder per table. 7 | # MAGIC 8 | # MAGIC Operating and ingesting all these tables at scale is quite challenging. You need to start multiple table ingestion at the same time, working with threads, handling errors, restart where you stopped, deal with merge manually. 9 | # MAGIC 10 | # MAGIC Thankfully, DLT takes care of that for you. We can leverage python loops to naturally iterate over the folders (see the [documentation](https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-cookbook.html#programmatically-manage-and-create-multiple-live-tables) for more details) 11 | # MAGIC 12 | # MAGIC DLT engine will handle the parallelization whenever possible, and autoscale based on your data volume. 13 | # MAGIC 14 | # MAGIC 15 | # MAGIC 16 | # MAGIC 17 | # MAGIC 18 | # MAGIC 23 | 24 | # COMMAND ---------- 25 | 26 | # DBTITLE 1,2 tables in our cdc_raw: customers and transactions 27 | # MAGIC %fs ls /tmp/demo/cdc_raw 28 | 29 | # COMMAND ---------- 30 | 31 | #Let's loop over all the folders and dynamically generate our DLT pipeline. 32 | import dlt 33 | from pyspark.sql.functions import * 34 | 35 | 36 | def create_pipeline(table_name): 37 | print(f"Building DLT CDC pipeline for {table_name}") 38 | 39 | ##Raw CDC Table 40 | # .option("cloudFiles.maxFilesPerTrigger", "1") 41 | @dlt.create_table(name=table_name+"_cdc", 42 | comment = "New "+table_name+" data incrementally ingested from cloud object storage landing zone") 43 | def raw_cdc(): 44 | return ( 45 | spark.readStream.format("cloudFiles") 46 | .option("cloudFiles.format", "json") 47 | .option("cloudFiles.inferColumnTypes", "true") 48 | .load("/demos/dlt/cdc_raw/"+table_name)) 49 | 50 | ##Clean CDC input and track quality with expectations 51 | @dlt.create_view(name=table_name+"_cdc_clean", 52 | comment="Cleansed cdc data, tracking data quality with a view. We ensude valid JSON, id and operation type") 53 | @dlt.expect_or_drop("no_rescued_data", "_rescued_data IS NULL") 54 | @dlt.expect_or_drop("valid_id", "id IS NOT NULL") 55 | @dlt.expect_or_drop("valid_operation", "operation IN ('APPEND', 'DELETE', 'UPDATE')") 56 | def raw_cdc_clean(): 57 | return dlt.read_stream(table_name+"_cdc") 58 | 59 | 60 | ##Materialize the final table 61 | dlt.create_target_table(name=table_name, comment="Clean, materialized "+table_name) 62 | dlt.apply_changes(target = table_name, #The customer table being materilized 63 | source = table_name+"_cdc_clean", #the incoming CDC 64 | keys = ["id"], #what we'll be using to match the rows to upsert 65 | sequence_by = col("operation_date"), #we deduplicate by operation date getting the most recent value 66 | ignore_null_updates = False, 67 | apply_as_deletes = expr("operation = 'DELETE'"), #DELETE condition 68 | except_column_list = ["operation", "operation_date", "_rescued_data"]) #in addition we drop metadata columns 69 | 70 | 71 | for folder in dbutils.fs.ls("/demos/dlt/cdc_raw"): 72 | table_name = folder.name[:-1] 73 | create_pipeline(table_name) 74 | 75 | # COMMAND ---------- 76 | 77 | # DBTITLE 1,Add final layer joining 2 tables 78 | @dlt.create_table(name="transactions_per_customers", 79 | comment = "table join between users and transactions for further analysis") 80 | def raw_cdc(): 81 | return dlt.read("transactions").join(dlt.read("customers"), ["id"], "left") 82 | 83 | # COMMAND ---------- 84 | 85 | # MAGIC %md 86 | # MAGIC ### Conclusion 87 | # MAGIC We can now scale our CDC pipeline to N tables using python factorization. This gives us infinite possibilities and abstraction level in our DLT pipelines. 88 | # MAGIC 89 | # MAGIC DLT handles all the hard work for us so that we can focus on business transformation and drastically accelerate DE team: 90 | # MAGIC - simplify file ingestion with the autoloader 91 | # MAGIC - track data quality using exception 92 | # MAGIC - simplify all operations including upsert with APPLY CHANGES 93 | # MAGIC - process all our tables in parallel 94 | # MAGIC - autoscale based on the amount of data 95 | # MAGIC 96 | # MAGIC DLT gives more power to SQL-only users, letting them build advanced data pipeline without requiering strong Data Engineers skills. 97 | -------------------------------------------------------------------------------- /00-quickstarts/dlt-cdc/_resources/00-Data_CDC_Generator.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %pip install Faker 3 | 4 | # COMMAND ---------- 5 | 6 | # MAGIC %md 7 | # MAGIC 8 | # MAGIC ### Retail CDC Data Generator 9 | # MAGIC 10 | # MAGIC Run this notebook to create new data. It's added in the pipeline to make sure data exists when we run it. 11 | # MAGIC 12 | # MAGIC You can also run it in the background to periodically add data. 13 | # MAGIC 14 | # MAGIC 15 | # MAGIC 16 | 17 | # COMMAND ---------- 18 | 19 | folder = "/demos/dlt/cdc_raw" 20 | #dbutils.fs.rm(folder, True) 21 | try: 22 | dbutils.fs.ls(folder) 23 | dbutils.fs.ls(folder+"/transactions") 24 | dbutils.fs.ls(folder+"/customers") 25 | except: 26 | print("folder doesn't exists, generating the data...") 27 | from pyspark.sql import functions as F 28 | from faker import Faker 29 | from collections import OrderedDict 30 | import uuid 31 | fake = Faker() 32 | import random 33 | 34 | 35 | fake_firstname = F.udf(fake.first_name) 36 | fake_lastname = F.udf(fake.last_name) 37 | fake_email = F.udf(fake.ascii_company_email) 38 | fake_date = F.udf(lambda:fake.date_time_this_month().strftime("%m-%d-%Y %H:%M:%S")) 39 | fake_address = F.udf(fake.address) 40 | operations = OrderedDict([("APPEND", 0.5),("DELETE", 0.1),("UPDATE", 0.3),(None, 0.01)]) 41 | fake_operation = F.udf(lambda:fake.random_elements(elements=operations, length=1)[0]) 42 | fake_id = F.udf(lambda: str(uuid.uuid4()) if random.uniform(0, 1) < 0.98 else None) 43 | 44 | df = spark.range(0, 100000).repartition(100) 45 | df = df.withColumn("id", fake_id()) 46 | df = df.withColumn("firstname", fake_firstname()) 47 | df = df.withColumn("lastname", fake_lastname()) 48 | df = df.withColumn("email", fake_email()) 49 | df = df.withColumn("address", fake_address()) 50 | df = df.withColumn("operation", fake_operation()) 51 | df_customers = df.withColumn("operation_date", fake_date()) 52 | df_customers.repartition(100).write.format("json").mode("overwrite").save(folder+"/customers") 53 | 54 | df = spark.range(0, 10000).repartition(20) 55 | df = df.withColumn("id", fake_id()) 56 | df = df.withColumn("transaction_date", fake_date()) 57 | df = df.withColumn("amount", F.round(F.rand()*1000)) 58 | df = df.withColumn("item_count", F.round(F.rand()*10)) 59 | df = df.withColumn("operation", fake_operation()) 60 | df = df.withColumn("operation_date", fake_date()) 61 | #Join with the customer to get the same IDs generated. 62 | df = df.withColumn("t_id", F.monotonically_increasing_id()).join(spark.read.json(folder+"/customers").select("id").withColumnRenamed("id", "customer_id").withColumn("t_id", F.monotonically_increasing_id()), "t_id").drop("t_id") 63 | df.repartition(10).write.format("json").mode("overwrite").save(folder+"/transactions") 64 | -------------------------------------------------------------------------------- /00-quickstarts/dlt-cdc/_resources/LICENSE.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ## Licence 4 | 5 | # COMMAND ---------- 6 | 7 | # MAGIC %md 8 | # MAGIC 9 | # MAGIC Copyright (2022) Databricks, Inc. 10 | # MAGIC 11 | # MAGIC This library (the "Software") may not be used except in connection with the Licensee's use of the Databricks Platform Services pursuant 12 | # MAGIC to an Agreement (defined below) between Licensee (defined below) and Databricks, Inc. ("Databricks"). The Object Code version of the 13 | # MAGIC Software shall be deemed part of the Downloadable Services under the Agreement, or if the Agreement does not define Downloadable Services, 14 | # MAGIC Subscription Services, or if neither are defined then the term in such Agreement that refers to the applicable Databricks Platform 15 | # MAGIC Services (as defined below) shall be substituted herein for “Downloadable Services.” Licensee's use of the Software must comply at 16 | # MAGIC all times with any restrictions applicable to the Downlodable Services and Subscription Services, generally, and must be used in 17 | # MAGIC accordance with any applicable documentation. For the avoidance of doubt, the Software constitutes Databricks Confidential Information 18 | # MAGIC under the Agreement. 19 | # MAGIC 20 | # MAGIC Additionally, and notwithstanding anything in the Agreement to the contrary: 21 | # MAGIC * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES 22 | # MAGIC OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE 23 | # MAGIC LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR 24 | # MAGIC IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 25 | # MAGIC * you may view, make limited copies of, and may compile the Source Code version of the Software into an Object Code version of the 26 | # MAGIC Software. For the avoidance of doubt, you may not make derivative works of Software (or make any any changes to the Source Code 27 | # MAGIC version of the unless you have agreed to separate terms with Databricks permitting such modifications (e.g., a contribution license 28 | # MAGIC agreement)). 29 | # MAGIC 30 | # MAGIC If you have not agreed to an Agreement or otherwise do not agree to these terms, you may not use the Software or view, copy or compile 31 | # MAGIC the Source Code of the Software. 32 | # MAGIC 33 | # MAGIC This license terminates automatically upon the termination of the Agreement or Licensee's breach of these terms. Additionally, 34 | # MAGIC Databricks may terminate this license at any time on notice. Upon termination, you must permanently delete the Software and all 35 | # MAGIC copies thereof (including the Source Code). 36 | # MAGIC 37 | # MAGIC Agreement: the agreement between Databricks and Licensee governing the use of the Databricks Platform Services, which shall be, with 38 | # MAGIC respect to Databricks, the Databricks Terms of Service located at www.databricks.com/termsofservice, and with respect to Databricks 39 | # MAGIC Community Edition, the Community Edition Terms of Service located at www.databricks.com/ce-termsofuse, in each case unless Licensee 40 | # MAGIC has entered into a separate written agreement with Databricks governing the use of the applicable Databricks Platform Services. 41 | # MAGIC 42 | # MAGIC Databricks Platform Services: the Databricks services or the Databricks Community Edition services, according to where the Software is used. 43 | # MAGIC 44 | # MAGIC Licensee: the user of the Software, or, if the Software is being used on behalf of a company, the company. 45 | # MAGIC 46 | # MAGIC Object Code: is version of the Software produced when an interpreter or a compiler translates the Source Code into recognizable and 47 | # MAGIC executable machine code. 48 | # MAGIC 49 | # MAGIC Source Code: the human readable portion of the Software. 50 | -------------------------------------------------------------------------------- /00-quickstarts/dlt-cdc/_resources/NOTICE.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ## Licence 4 | # MAGIC See LICENSE file. 5 | # MAGIC 6 | # MAGIC ## Data collection 7 | # MAGIC To improve users experience and dbdemos asset quality, dbdemos sends report usage and capture views in the installed notebook (usually in the first cell) and other assets like dashboards. This information is captured for product improvement only and not for marketing purpose, and doesn't contain PII information. By using `dbdemos` and the assets it provides, you consent to this data collection. If you wish to disable it, you can set `Tracker.enable_tracker` to False in the `tracker.py` file. 8 | # MAGIC 9 | # MAGIC ## Resource creation 10 | # MAGIC To simplify your experience, `dbdemos` will create and start for you resources. As example, a demo could start (not exhaustive): 11 | # MAGIC - A cluster to run your demo 12 | # MAGIC - A Delta Live Table Pipeline to ingest data 13 | # MAGIC - A DBSQL endpoint to run DBSQL dashboard 14 | # MAGIC - An ML model 15 | # MAGIC 16 | # MAGIC While `dbdemos` does its best to limit the consumption and enforce resource auto-termination, you remain responsible for the resources created and the potential consumption associated. 17 | # MAGIC 18 | # MAGIC ## Support 19 | # MAGIC Databricks does not offer official support for `dbdemos` and the associated assets. 20 | # MAGIC For any issue with `dbdemos` or the demos installed, please open an issue and the demo team will have a look on a best effort basis. 21 | # MAGIC 22 | # MAGIC 23 | -------------------------------------------------------------------------------- /00-quickstarts/dlt-cdc/_resources/README.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ## DBDemos asset 4 | # MAGIC 5 | # MAGIC The notebooks available under `_/resources` are technical resources. 6 | # MAGIC 7 | # MAGIC Do not edit these notebooks or try to run them directly. These notebooks will load data / run some setup. They are indirectly called from the main notebook (`%run ./_resources/.....`) 8 | -------------------------------------------------------------------------------- /00-quickstarts/dlt-loans/03-Log-Analysis.sql: -------------------------------------------------------------------------------- 1 | -- Databricks notebook source 2 | -- MAGIC %md 3 | -- MAGIC ### A cluster has been created for this demo 4 | -- MAGIC To run this demo, just select the cluster `dbdemos-dlt-loans-abraham_pabbathi` from the dropdown menu ([open cluster configuration](https://e2-demo-field-eng.cloud.databricks.com/#setting/clusters/0728-224958-5yoad5lg/configuration)).
5 | -- MAGIC *Note: If the cluster was deleted after 30 days, you can re-create it with `dbdemos.create_cluster('dlt-loans')` or re-install the demo: `dbdemos.install('dlt-loans')`* 6 | 7 | -- COMMAND ---------- 8 | 9 | -- MAGIC %md-sandbox 10 | -- MAGIC 11 | -- MAGIC # DLT pipeline log analysis 12 | -- MAGIC 13 | -- MAGIC 14 | -- MAGIC 15 | -- MAGIC Each DLT Pipeline saves events and expectations metrics in the Storage Location defined on the pipeline. From this table we can see what is happening and the quality of the data passing through it. 16 | -- MAGIC 17 | -- MAGIC You can leverage the expecations directly as a SQL table with Databricks SQL to track your expectation metrics and send alerts as required. 18 | -- MAGIC 19 | -- MAGIC This notebook extracts and analyses expectation metrics to build such KPIS. 20 | -- MAGIC 21 | -- MAGIC You can find your metrics opening the Settings of your DLT pipeline, under `storage` : 22 | -- MAGIC 23 | -- MAGIC ``` 24 | -- MAGIC { 25 | -- MAGIC ... 26 | -- MAGIC "name": "test_dlt_cdc", 27 | -- MAGIC "storage": "/demos/dlt/loans", 28 | -- MAGIC "target": "quentin_dlt_cdc" 29 | -- MAGIC } 30 | -- MAGIC ``` 31 | -- MAGIC 32 | -- MAGIC 33 | -- MAGIC 34 | 35 | -- COMMAND ---------- 36 | 37 | -- DBTITLE 1,Load DLT system table 38 | -- MAGIC %python 39 | -- MAGIC import re 40 | -- MAGIC current_user = dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags().apply('user') 41 | -- MAGIC storage_path = '/demos/dlt/loans/'+re.sub("[^A-Za-z0-9]", '_', current_user[:current_user.rfind('@')]) 42 | -- MAGIC dbutils.widgets.text('storage_path', storage_path) 43 | -- MAGIC print(f"using storage path: {storage_path}") 44 | 45 | -- COMMAND ---------- 46 | 47 | -- MAGIC %python display(dbutils.fs.ls(dbutils.widgets.get('storage_path'))) 48 | 49 | -- COMMAND ---------- 50 | 51 | -- MAGIC %sql 52 | -- MAGIC CREATE OR REPLACE TEMPORARY VIEW demo_dlt_loans_system_event_log_raw 53 | -- MAGIC as SELECT * FROM delta.`$storage_path/system/events`; 54 | -- MAGIC SELECT * FROM demo_dlt_loans_system_event_log_raw order by timestamp desc; 55 | 56 | -- COMMAND ---------- 57 | 58 | -- MAGIC %md 59 | -- MAGIC The `details` column contains metadata about each Event sent to the Event Log. There are different fields depending on what type of Event it is. Some examples include: 60 | -- MAGIC * `user_action` Events occur when taking actions like creating the pipeline 61 | -- MAGIC * `flow_definition` Events occur when a pipeline is deployed or updated and have lineage, schema, and execution plan information 62 | -- MAGIC * `output_dataset` and `input_datasets` - output table/view and its upstream table(s)/view(s) 63 | -- MAGIC * `flow_type` - whether this is a complete or append flow 64 | -- MAGIC * `explain_text` - the Spark explain plan 65 | -- MAGIC * `flow_progress` Events occur when a data flow starts running or finishes processing a batch of data 66 | -- MAGIC * `metrics` - currently contains `num_output_rows` 67 | -- MAGIC * `data_quality` - contains an array of the results of the data quality rules for this particular dataset 68 | -- MAGIC * `dropped_records` 69 | -- MAGIC * `expectations` 70 | -- MAGIC * `name`, `dataset`, `passed_records`, `failed_records` 71 | -- MAGIC 72 | 73 | -- COMMAND ---------- 74 | 75 | -- DBTITLE 1,Lineage Information 76 | SELECT 77 | details:flow_definition.output_dataset, 78 | details:flow_definition.input_datasets, 79 | details:flow_definition.flow_type, 80 | details:flow_definition.schema, 81 | details:flow_definition 82 | FROM demo_dlt_loans_system_event_log_raw 83 | WHERE details:flow_definition IS NOT NULL 84 | ORDER BY timestamp 85 | 86 | -- COMMAND ---------- 87 | 88 | -- DBTITLE 1,Data Quality Results 89 | SELECT 90 | id, 91 | expectations.dataset, 92 | expectations.name, 93 | expectations.failed_records, 94 | expectations.passed_records 95 | FROM( 96 | SELECT 97 | id, 98 | timestamp, 99 | details:flow_progress.metrics, 100 | details:flow_progress.data_quality.dropped_records, 101 | explode(from_json(details:flow_progress:data_quality:expectations 102 | ,schema_of_json("[{'name':'str', 'dataset':'str', 'passed_records':42, 'failed_records':42}]"))) expectations 103 | FROM demo_dlt_loans_system_event_log_raw 104 | WHERE details:flow_progress.metrics IS NOT NULL) data_quality 105 | 106 | -- COMMAND ---------- 107 | 108 | -- MAGIC %md 109 | -- MAGIC Your expectations are ready to be queried in SQL! Open the data Quality Dashboard example for more details. 110 | -------------------------------------------------------------------------------- /00-quickstarts/dlt-loans/_resources/LICENSE.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ## Licence 4 | 5 | # COMMAND ---------- 6 | 7 | # MAGIC %md 8 | # MAGIC 9 | # MAGIC Copyright (2022) Databricks, Inc. 10 | # MAGIC 11 | # MAGIC This library (the "Software") may not be used except in connection with the Licensee's use of the Databricks Platform Services pursuant 12 | # MAGIC to an Agreement (defined below) between Licensee (defined below) and Databricks, Inc. ("Databricks"). The Object Code version of the 13 | # MAGIC Software shall be deemed part of the Downloadable Services under the Agreement, or if the Agreement does not define Downloadable Services, 14 | # MAGIC Subscription Services, or if neither are defined then the term in such Agreement that refers to the applicable Databricks Platform 15 | # MAGIC Services (as defined below) shall be substituted herein for “Downloadable Services.” Licensee's use of the Software must comply at 16 | # MAGIC all times with any restrictions applicable to the Downlodable Services and Subscription Services, generally, and must be used in 17 | # MAGIC accordance with any applicable documentation. For the avoidance of doubt, the Software constitutes Databricks Confidential Information 18 | # MAGIC under the Agreement. 19 | # MAGIC 20 | # MAGIC Additionally, and notwithstanding anything in the Agreement to the contrary: 21 | # MAGIC * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES 22 | # MAGIC OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE 23 | # MAGIC LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR 24 | # MAGIC IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 25 | # MAGIC * you may view, make limited copies of, and may compile the Source Code version of the Software into an Object Code version of the 26 | # MAGIC Software. For the avoidance of doubt, you may not make derivative works of Software (or make any any changes to the Source Code 27 | # MAGIC version of the unless you have agreed to separate terms with Databricks permitting such modifications (e.g., a contribution license 28 | # MAGIC agreement)). 29 | # MAGIC 30 | # MAGIC If you have not agreed to an Agreement or otherwise do not agree to these terms, you may not use the Software or view, copy or compile 31 | # MAGIC the Source Code of the Software. 32 | # MAGIC 33 | # MAGIC This license terminates automatically upon the termination of the Agreement or Licensee's breach of these terms. Additionally, 34 | # MAGIC Databricks may terminate this license at any time on notice. Upon termination, you must permanently delete the Software and all 35 | # MAGIC copies thereof (including the Source Code). 36 | # MAGIC 37 | # MAGIC Agreement: the agreement between Databricks and Licensee governing the use of the Databricks Platform Services, which shall be, with 38 | # MAGIC respect to Databricks, the Databricks Terms of Service located at www.databricks.com/termsofservice, and with respect to Databricks 39 | # MAGIC Community Edition, the Community Edition Terms of Service located at www.databricks.com/ce-termsofuse, in each case unless Licensee 40 | # MAGIC has entered into a separate written agreement with Databricks governing the use of the applicable Databricks Platform Services. 41 | # MAGIC 42 | # MAGIC Databricks Platform Services: the Databricks services or the Databricks Community Edition services, according to where the Software is used. 43 | # MAGIC 44 | # MAGIC Licensee: the user of the Software, or, if the Software is being used on behalf of a company, the company. 45 | # MAGIC 46 | # MAGIC Object Code: is version of the Software produced when an interpreter or a compiler translates the Source Code into recognizable and 47 | # MAGIC executable machine code. 48 | # MAGIC 49 | # MAGIC Source Code: the human readable portion of the Software. 50 | -------------------------------------------------------------------------------- /00-quickstarts/dlt-loans/_resources/NOTICE.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ## Licence 4 | # MAGIC See LICENSE file. 5 | # MAGIC 6 | # MAGIC ## Data collection 7 | # MAGIC To improve users experience and dbdemos asset quality, dbdemos sends report usage and capture views in the installed notebook (usually in the first cell) and other assets like dashboards. This information is captured for product improvement only and not for marketing purpose, and doesn't contain PII information. By using `dbdemos` and the assets it provides, you consent to this data collection. If you wish to disable it, you can set `Tracker.enable_tracker` to False in the `tracker.py` file. 8 | # MAGIC 9 | # MAGIC ## Resource creation 10 | # MAGIC To simplify your experience, `dbdemos` will create and start for you resources. As example, a demo could start (not exhaustive): 11 | # MAGIC - A cluster to run your demo 12 | # MAGIC - A Delta Live Table Pipeline to ingest data 13 | # MAGIC - A DBSQL endpoint to run DBSQL dashboard 14 | # MAGIC - An ML model 15 | # MAGIC 16 | # MAGIC While `dbdemos` does its best to limit the consumption and enforce resource auto-termination, you remain responsible for the resources created and the potential consumption associated. 17 | # MAGIC 18 | # MAGIC ## Support 19 | # MAGIC Databricks does not offer official support for `dbdemos` and the associated assets. 20 | # MAGIC For any issue with `dbdemos` or the demos installed, please open an issue and the demo team will have a look on a best effort basis. 21 | # MAGIC 22 | # MAGIC 23 | -------------------------------------------------------------------------------- /00-quickstarts/dlt-loans/_resources/README.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ## DBDemos asset 4 | # MAGIC 5 | # MAGIC The notebooks available under `_/resources` are technical resources. 6 | # MAGIC 7 | # MAGIC Do not edit these notebooks or try to run them directly. These notebooks will load data / run some setup. They are indirectly called from the main notebook (`%run ./_resources/.....`) 8 | -------------------------------------------------------------------------------- /00-quickstarts/lakehouse-retail-c360/01-Data-ingestion/01.2-DLT-churn-Python-UDF.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # DBTITLE 1,Let's install mlflow & the ML libs to be able to load our model (from requirement.txt file): 3 | # MAGIC %pip install mlflow>=2.1 category-encoders==2.5.1.post0 cffi==1.15.0 cloudpickle==2.0.0 databricks-automl-runtime==0.2.15 defusedxml==0.7.1 holidays==0.18 lightgbm==3.3.4 matplotlib==3.5.1 psutil==5.8.0 scikit-learn==1.0.2 typing-extensions==4.1.1 4 | 5 | # COMMAND ---------- 6 | 7 | # MAGIC %md #Registering python UDF to a SQL function 8 | # MAGIC This is a companion notebook to load the `predict_churn` model as a spark udf and save it as a SQL function. While this code was present in the SQL notebook, it won't be run by the DLT engine (since the notebook is SQL we only read sql cess) 9 | # MAGIC 10 | # MAGIC For the UDF to be available, you must this notebook in your DLT. (Currently mixing python in a SQL DLT notebook won't run the python) 11 | # MAGIC 12 | # MAGIC 13 | # MAGIC 14 | 15 | # COMMAND ---------- 16 | 17 | # MAGIC %python 18 | # MAGIC import mlflow 19 | # MAGIC # Stage/version 20 | # MAGIC # Model name | 21 | # MAGIC # | | 22 | # MAGIC predict_churn_udf = mlflow.pyfunc.spark_udf(spark, "models:/dbdemos_customer_churn/Production") 23 | # MAGIC spark.udf.register("predict_churn", predict_churn_udf) 24 | 25 | # COMMAND ---------- 26 | 27 | # MAGIC %md ### Setting up the DLT 28 | # MAGIC 29 | # MAGIC This notebook must be included in your DLT "libraries" parameter: 30 | # MAGIC 31 | # MAGIC ``` 32 | # MAGIC { 33 | # MAGIC "id": "95f28631-1884-425e-af69-05c3f397dd90", 34 | # MAGIC "name": "xxxx", 35 | # MAGIC "storage": "/demos/dlt/lakehouse_churn/xxxxx", 36 | # MAGIC "configuration": { 37 | # MAGIC "pipelines.useV2DetailsPage": "true" 38 | # MAGIC }, 39 | # MAGIC "clusters": [ 40 | # MAGIC { 41 | # MAGIC "label": "default", 42 | # MAGIC "autoscale": { 43 | # MAGIC "min_workers": 1, 44 | # MAGIC "max_workers": 5 45 | # MAGIC } 46 | # MAGIC } 47 | # MAGIC ], 48 | # MAGIC "libraries": [ 49 | # MAGIC { 50 | # MAGIC "notebook": { 51 | # MAGIC "path": "/Repos/xxxx/01.2-DLT-churn-Python-UDF" 52 | # MAGIC } 53 | # MAGIC }, 54 | # MAGIC { 55 | # MAGIC "notebook": { 56 | # MAGIC "path": "/Repos/xxxx/01.1-DLT-churn-SQL" 57 | # MAGIC } 58 | # MAGIC } 59 | # MAGIC ], 60 | # MAGIC "target": "retail_lakehouse_churn_xxxx", 61 | # MAGIC "continuous": false, 62 | # MAGIC "development": false 63 | # MAGIC } 64 | # MAGIC ``` 65 | -------------------------------------------------------------------------------- /00-quickstarts/lakehouse-retail-c360/05-Workflow-orchestration/05-Workflow-orchestration-churn.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md-sandbox 3 | # MAGIC # Deploying and orchestrating the full workflow 4 | # MAGIC 5 | # MAGIC 6 | # MAGIC 7 | # MAGIC All our assets are ready. We now need to define when we want our DLT pipeline to kick in and refresh the tables. 8 | # MAGIC 9 | # MAGIC One option is to switch DLT pipeline in continuous mode to have a streaming pipeline, providing near-realtime insight. 10 | # MAGIC 11 | # MAGIC An alternative is to wakeup the DLT pipeline every X hours, ingest the new data (incremental) and shut down all your compute. 12 | # MAGIC 13 | # MAGIC This is a simple configuration offering a tradeoff between uptime and ingestion latencies. 14 | # MAGIC 15 | # MAGIC In our case, we decided that the best tradoff is to ingest new data every hours: 16 | # MAGIC 17 | # MAGIC - Start the DLT pipeline to ingest new data and refresh our tables 18 | # MAGIC - Refresh the DBSQL dashboard (and potentially notify downstream applications) 19 | # MAGIC - Retrain our model to include the lastest date and capture potential behavior change 20 | # MAGIC 21 | # MAGIC 22 | # MAGIC 23 | 24 | # COMMAND ---------- 25 | 26 | # MAGIC %md-sandbox 27 | # MAGIC ## Orchestrating our Churn pipeline with Databricks Workflows 28 | # MAGIC 29 | # MAGIC 30 | # MAGIC 31 | # MAGIC With Databricks Lakehouse, no need for external orchestrator. We can use [Workflows](/#job/list) (available on the left menu) to orchestrate our Churn pipeline within a few click. 32 | # MAGIC 33 | # MAGIC 34 | # MAGIC 35 | # MAGIC ### Orchestrate anything anywhere 36 | # MAGIC With workflow, you can run diverse workloads for the full data and AI lifecycle on any cloud. Orchestrate Delta Live Tables and Jobs for SQL, Spark, notebooks, dbt, ML models and more. 37 | # MAGIC 38 | # MAGIC ### Simple - Fully managed 39 | # MAGIC Remove operational overhead with a fully managed orchestration service, so you can focus on your workflows not on managing your infrastructure. 40 | # MAGIC 41 | # MAGIC ### Proven reliability 42 | # MAGIC Have full confidence in your workflows leveraging our proven experience running tens of millions of production workloads daily across AWS, Azure and GCP. 43 | 44 | # COMMAND ---------- 45 | 46 | # MAGIC %md-sandbox 47 | # MAGIC 48 | # MAGIC ## Creating your workflow 49 | # MAGIC 50 | # MAGIC 51 | # MAGIC 52 | # MAGIC A Databricks Workflow is composed of Tasks. 53 | # MAGIC 54 | # MAGIC Each task can trigger a specific job: 55 | # MAGIC 56 | # MAGIC * Delta Live Tables 57 | # MAGIC * SQL query / dashboard 58 | # MAGIC * Model retraining / inference 59 | # MAGIC * Notebooks 60 | # MAGIC * dbt 61 | # MAGIC * ... 62 | # MAGIC 63 | # MAGIC In this example, can see our 3 tasks: 64 | # MAGIC 65 | # MAGIC * Start the DLT pipeline to ingest new data and refresh our tables 66 | # MAGIC * Refresh the DBSQL dashboard (and potentially notify downstream applications) 67 | # MAGIC * Retrain our Churn model 68 | 69 | # COMMAND ---------- 70 | 71 | # MAGIC %md-sandbox 72 | # MAGIC 73 | # MAGIC ## Monitoring your runs 74 | # MAGIC 75 | # MAGIC 76 | # MAGIC 77 | # MAGIC Once your workflow is created, we can access historical runs and receive alerts if something goes wrong! 78 | # MAGIC 79 | # MAGIC In the screenshot we can see that our workflow had multiple errors, with different runtime, and ultimately got fixed. 80 | # MAGIC 81 | # MAGIC Workflow monitoring includes errors, abnormal job duration and more advanced control! 82 | 83 | # COMMAND ---------- 84 | 85 | # MAGIC %md 86 | # MAGIC ## Conclusion 87 | # MAGIC 88 | # MAGIC Not only Datatabricks Lakehouse let you ingest, analyze and infer churn, it also provides a best-in-class orchestrator to offer your business fresh insight making sure everything works as expected! 89 | # MAGIC 90 | # MAGIC [Go back to introduction]($../00-churn-introduction-lakehouse) 91 | -------------------------------------------------------------------------------- /00-quickstarts/lakehouse-retail-c360/_resources/00-setup-uc.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | dbutils.widgets.dropdown("reset_all_data", "false", ["true", "false"], "Reset all data") 3 | reset_all_data = dbutils.widgets.get("reset_all_data") == "true" 4 | 5 | # COMMAND ---------- 6 | 7 | # MAGIC %run ./00-global-setup $reset_all_data=$reset_all_data $db_prefix=retail $catalog=dbdemos $db=lakehouse_c360 8 | 9 | # COMMAND ---------- 10 | 11 | catalog = "dbdemos" 12 | database = 'lakehouse_c360' 13 | 14 | # COMMAND ---------- 15 | 16 | import json 17 | import time 18 | from pyspark.sql.window import Window 19 | from pyspark.sql.functions import row_number, sha1, col, initcap, to_timestamp 20 | 21 | folder = "/demos/retail/churn" 22 | 23 | if reset_all_data or is_folder_empty(folder+"/orders") or is_folder_empty(folder+"/users") or is_folder_empty(folder+"/events"): 24 | #data generation on another notebook to avoid installing libraries (takes a few seconds to setup pip env) 25 | print(f"Generating data under {folder} , please wait a few sec...") 26 | path = dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get() 27 | parent_count = path[path.rfind("lakehouse-retail-c360"):].count('/') - 1 28 | prefix = "./" if parent_count == 0 else parent_count*"../" 29 | prefix = f'{prefix}_resources/' 30 | dbutils.notebook.run(prefix+"02-create-churn-tables", 600, {"catalog": catalog, "cloud_storage_path": "/demos/", "reset_all_data": reset_all_data, "db": database}) 31 | else: 32 | print("data already existing. Run with reset_all_data=true to force a data cleanup for your local demo.") 33 | 34 | # COMMAND ---------- 35 | 36 | for table in spark.sql("SHOW TABLES").collect(): 37 | try: 38 | spark.sql(f"alter table {table['tableName']} owner to `account users`") 39 | except Exception as e: 40 | print(f"couldn't set table {table} ownership to account users") 41 | -------------------------------------------------------------------------------- /00-quickstarts/lakehouse-retail-c360/_resources/00-setup.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | dbutils.widgets.dropdown("reset_all_data", "false", ["true", "false"], "Reset all data") 3 | 4 | # COMMAND ---------- 5 | 6 | # MAGIC %run ./00-global-setup $reset_all_data=$reset_all_data $db_prefix=retail 7 | 8 | # COMMAND ---------- 9 | 10 | import mlflow 11 | if "evaluate" not in dir(mlflow): 12 | raise Exception("ERROR - YOU NEED MLFLOW 2.0 for this demo. Select DBRML 12+") 13 | 14 | import json 15 | import time 16 | from pyspark.sql.window import Window 17 | from pyspark.sql.functions import row_number 18 | 19 | reset_all_data = dbutils.widgets.get("reset_all_data") == "true" 20 | raw_data_location = cloud_storage_path+"/retail/churn" 21 | 22 | import json 23 | import time 24 | from pyspark.sql.window import Window 25 | from pyspark.sql.functions import row_number, sha1, col, initcap, to_timestamp 26 | 27 | folder = "/demos/retail/churn" 28 | 29 | if reset_all_data or is_folder_empty(folder+"/orders") or is_folder_empty(folder+"/users") or is_folder_empty(folder+"/events"): 30 | #data generation on another notebook to avoid installing libraries (takes a few seconds to setup pip env) 31 | print(f"Generating data under {folder} , please wait a few sec...") 32 | path = dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get() 33 | parent_count = path[path.rfind("lakehouse-retail-c360"):].count('/') - 1 34 | prefix = "./" if parent_count == 0 else parent_count*"../" 35 | prefix = f'{prefix}_resources/' 36 | dbutils.notebook.run(prefix+"01-load-data", 600, {"reset_all_data": dbutils.widgets.get("reset_all_data")}) 37 | else: 38 | print("data already existing. Run with reset_all_data=true to force a data cleanup for your local demo.") 39 | -------------------------------------------------------------------------------- /00-quickstarts/lakehouse-retail-c360/_resources/LICENSE.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ## Licence 4 | 5 | # COMMAND ---------- 6 | 7 | # MAGIC %md 8 | # MAGIC 9 | # MAGIC Copyright (2022) Databricks, Inc. 10 | # MAGIC 11 | # MAGIC This library (the "Software") may not be used except in connection with the Licensee's use of the Databricks Platform Services pursuant 12 | # MAGIC to an Agreement (defined below) between Licensee (defined below) and Databricks, Inc. ("Databricks"). The Object Code version of the 13 | # MAGIC Software shall be deemed part of the Downloadable Services under the Agreement, or if the Agreement does not define Downloadable Services, 14 | # MAGIC Subscription Services, or if neither are defined then the term in such Agreement that refers to the applicable Databricks Platform 15 | # MAGIC Services (as defined below) shall be substituted herein for “Downloadable Services.” Licensee's use of the Software must comply at 16 | # MAGIC all times with any restrictions applicable to the Downlodable Services and Subscription Services, generally, and must be used in 17 | # MAGIC accordance with any applicable documentation. For the avoidance of doubt, the Software constitutes Databricks Confidential Information 18 | # MAGIC under the Agreement. 19 | # MAGIC 20 | # MAGIC Additionally, and notwithstanding anything in the Agreement to the contrary: 21 | # MAGIC * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES 22 | # MAGIC OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE 23 | # MAGIC LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR 24 | # MAGIC IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 25 | # MAGIC * you may view, make limited copies of, and may compile the Source Code version of the Software into an Object Code version of the 26 | # MAGIC Software. For the avoidance of doubt, you may not make derivative works of Software (or make any any changes to the Source Code 27 | # MAGIC version of the unless you have agreed to separate terms with Databricks permitting such modifications (e.g., a contribution license 28 | # MAGIC agreement)). 29 | # MAGIC 30 | # MAGIC If you have not agreed to an Agreement or otherwise do not agree to these terms, you may not use the Software or view, copy or compile 31 | # MAGIC the Source Code of the Software. 32 | # MAGIC 33 | # MAGIC This license terminates automatically upon the termination of the Agreement or Licensee's breach of these terms. Additionally, 34 | # MAGIC Databricks may terminate this license at any time on notice. Upon termination, you must permanently delete the Software and all 35 | # MAGIC copies thereof (including the Source Code). 36 | # MAGIC 37 | # MAGIC Agreement: the agreement between Databricks and Licensee governing the use of the Databricks Platform Services, which shall be, with 38 | # MAGIC respect to Databricks, the Databricks Terms of Service located at www.databricks.com/termsofservice, and with respect to Databricks 39 | # MAGIC Community Edition, the Community Edition Terms of Service located at www.databricks.com/ce-termsofuse, in each case unless Licensee 40 | # MAGIC has entered into a separate written agreement with Databricks governing the use of the applicable Databricks Platform Services. 41 | # MAGIC 42 | # MAGIC Databricks Platform Services: the Databricks services or the Databricks Community Edition services, according to where the Software is used. 43 | # MAGIC 44 | # MAGIC Licensee: the user of the Software, or, if the Software is being used on behalf of a company, the company. 45 | # MAGIC 46 | # MAGIC Object Code: is version of the Software produced when an interpreter or a compiler translates the Source Code into recognizable and 47 | # MAGIC executable machine code. 48 | # MAGIC 49 | # MAGIC Source Code: the human readable portion of the Software. 50 | -------------------------------------------------------------------------------- /00-quickstarts/lakehouse-retail-c360/_resources/NOTICE.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ## Licence 4 | # MAGIC See LICENSE file. 5 | # MAGIC 6 | # MAGIC ## Data collection 7 | # MAGIC To improve users experience and dbdemos asset quality, dbdemos sends report usage and capture views in the installed notebook (usually in the first cell) and other assets like dashboards. This information is captured for product improvement only and not for marketing purpose, and doesn't contain PII information. By using `dbdemos` and the assets it provides, you consent to this data collection. If you wish to disable it, you can set `Tracker.enable_tracker` to False in the `tracker.py` file. 8 | # MAGIC 9 | # MAGIC ## Resource creation 10 | # MAGIC To simplify your experience, `dbdemos` will create and start for you resources. As example, a demo could start (not exhaustive): 11 | # MAGIC - A cluster to run your demo 12 | # MAGIC - A Delta Live Table Pipeline to ingest data 13 | # MAGIC - A DBSQL endpoint to run DBSQL dashboard 14 | # MAGIC - An ML model 15 | # MAGIC 16 | # MAGIC While `dbdemos` does its best to limit the consumption and enforce resource auto-termination, you remain responsible for the resources created and the potential consumption associated. 17 | # MAGIC 18 | # MAGIC ## Support 19 | # MAGIC Databricks does not offer official support for `dbdemos` and the associated assets. 20 | # MAGIC For any issue with `dbdemos` or the demos installed, please open an issue and the demo team will have a look on a best effort basis. 21 | # MAGIC 22 | # MAGIC 23 | -------------------------------------------------------------------------------- /00-quickstarts/lakehouse-retail-c360/_resources/README.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ## DBDemos asset 4 | # MAGIC 5 | # MAGIC The notebooks available under `_/resources` are technical resources. 6 | # MAGIC 7 | # MAGIC Do not edit these notebooks or try to run them directly. These notebooks will load data / run some setup. They are indirectly called from the main notebook (`%run ./_resources/.....`) 8 | -------------------------------------------------------------------------------- /00-quickstarts/llm-dolly-chatbot/_resources/LICENSE.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ## Licence 4 | 5 | # COMMAND ---------- 6 | 7 | # MAGIC %md 8 | # MAGIC 9 | # MAGIC Copyright (2022) Databricks, Inc. 10 | # MAGIC 11 | # MAGIC This library (the "Software") may not be used except in connection with the Licensee's use of the Databricks Platform Services pursuant 12 | # MAGIC to an Agreement (defined below) between Licensee (defined below) and Databricks, Inc. ("Databricks"). The Object Code version of the 13 | # MAGIC Software shall be deemed part of the Downloadable Services under the Agreement, or if the Agreement does not define Downloadable Services, 14 | # MAGIC Subscription Services, or if neither are defined then the term in such Agreement that refers to the applicable Databricks Platform 15 | # MAGIC Services (as defined below) shall be substituted herein for “Downloadable Services.” Licensee's use of the Software must comply at 16 | # MAGIC all times with any restrictions applicable to the Downlodable Services and Subscription Services, generally, and must be used in 17 | # MAGIC accordance with any applicable documentation. For the avoidance of doubt, the Software constitutes Databricks Confidential Information 18 | # MAGIC under the Agreement. 19 | # MAGIC 20 | # MAGIC Additionally, and notwithstanding anything in the Agreement to the contrary: 21 | # MAGIC * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES 22 | # MAGIC OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE 23 | # MAGIC LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR 24 | # MAGIC IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 25 | # MAGIC * you may view, make limited copies of, and may compile the Source Code version of the Software into an Object Code version of the 26 | # MAGIC Software. For the avoidance of doubt, you may not make derivative works of Software (or make any any changes to the Source Code 27 | # MAGIC version of the unless you have agreed to separate terms with Databricks permitting such modifications (e.g., a contribution license 28 | # MAGIC agreement)). 29 | # MAGIC 30 | # MAGIC If you have not agreed to an Agreement or otherwise do not agree to these terms, you may not use the Software or view, copy or compile 31 | # MAGIC the Source Code of the Software. 32 | # MAGIC 33 | # MAGIC This license terminates automatically upon the termination of the Agreement or Licensee's breach of these terms. Additionally, 34 | # MAGIC Databricks may terminate this license at any time on notice. Upon termination, you must permanently delete the Software and all 35 | # MAGIC copies thereof (including the Source Code). 36 | # MAGIC 37 | # MAGIC Agreement: the agreement between Databricks and Licensee governing the use of the Databricks Platform Services, which shall be, with 38 | # MAGIC respect to Databricks, the Databricks Terms of Service located at www.databricks.com/termsofservice, and with respect to Databricks 39 | # MAGIC Community Edition, the Community Edition Terms of Service located at www.databricks.com/ce-termsofuse, in each case unless Licensee 40 | # MAGIC has entered into a separate written agreement with Databricks governing the use of the applicable Databricks Platform Services. 41 | # MAGIC 42 | # MAGIC Databricks Platform Services: the Databricks services or the Databricks Community Edition services, according to where the Software is used. 43 | # MAGIC 44 | # MAGIC Licensee: the user of the Software, or, if the Software is being used on behalf of a company, the company. 45 | # MAGIC 46 | # MAGIC Object Code: is version of the Software produced when an interpreter or a compiler translates the Source Code into recognizable and 47 | # MAGIC executable machine code. 48 | # MAGIC 49 | # MAGIC Source Code: the human readable portion of the Software. 50 | -------------------------------------------------------------------------------- /00-quickstarts/llm-dolly-chatbot/_resources/NOTICE.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ## Licence 4 | # MAGIC See LICENSE file. 5 | # MAGIC 6 | # MAGIC ## Data collection 7 | # MAGIC To improve users experience and dbdemos asset quality, dbdemos sends report usage and capture views in the installed notebook (usually in the first cell) and other assets like dashboards. This information is captured for product improvement only and not for marketing purpose, and doesn't contain PII information. By using `dbdemos` and the assets it provides, you consent to this data collection. If you wish to disable it, you can set `Tracker.enable_tracker` to False in the `tracker.py` file. 8 | # MAGIC 9 | # MAGIC ## Resource creation 10 | # MAGIC To simplify your experience, `dbdemos` will create and start for you resources. As example, a demo could start (not exhaustive): 11 | # MAGIC - A cluster to run your demo 12 | # MAGIC - A Delta Live Table Pipeline to ingest data 13 | # MAGIC - A DBSQL endpoint to run DBSQL dashboard 14 | # MAGIC - An ML model 15 | # MAGIC 16 | # MAGIC While `dbdemos` does its best to limit the consumption and enforce resource auto-termination, you remain responsible for the resources created and the potential consumption associated. 17 | # MAGIC 18 | # MAGIC ## Support 19 | # MAGIC Databricks does not offer official support for `dbdemos` and the associated assets. 20 | # MAGIC For any issue with `dbdemos` or the demos installed, please open an issue and the demo team will have a look on a best effort basis. 21 | # MAGIC 22 | # MAGIC 23 | -------------------------------------------------------------------------------- /00-quickstarts/llm-dolly-chatbot/_resources/README.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ## DBDemos asset 4 | # MAGIC 5 | # MAGIC The notebooks available under `_/resources` are technical resources. 6 | # MAGIC 7 | # MAGIC Do not edit these notebooks or try to run them directly. These notebooks will load data / run some setup. They are indirectly called from the main notebook (`%run ./_resources/.....`) 8 | -------------------------------------------------------------------------------- /10-migrations/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/10-migrations/.DS_Store -------------------------------------------------------------------------------- /10-migrations/05-uc-upgrade/_resources/00-setup.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %pip install faker 3 | 4 | # COMMAND ---------- 5 | 6 | dbutils.widgets.text("catalog", "dbdemos", "UC Catalog") 7 | dbutils.widgets.text("external_location_path", "s3a://databricks-e2demofieldengwest/external_location_uc_upgrade", "External location path") 8 | external_location_path = dbutils.widgets.get("external_location_path") 9 | 10 | # COMMAND ---------- 11 | 12 | import pyspark.sql.functions as F 13 | import re 14 | catalog = dbutils.widgets.get("catalog") 15 | 16 | catalog_exists = False 17 | for r in spark.sql("SHOW CATALOGS").collect(): 18 | if r['catalog'] == catalog: 19 | catalog_exists = True 20 | 21 | #As non-admin users don't have permission by default, let's do that only if the catalog doesn't exist (an admin need to run it first) 22 | if not catalog_exists: 23 | spark.sql(f"CREATE CATALOG IF NOT EXISTS {catalog}") 24 | spark.sql(f"ALTER CATALOG {catalog} OWNER TO `account users`") 25 | spark.sql(f"GRANT CREATE, USAGE on CATALOG {catalog} TO `account users`") 26 | spark.sql(f"USE CATALOG {catalog}") 27 | 28 | database = 'database_upgraded_on_uc' 29 | print(f"creating {database} database") 30 | spark.sql(f"DROP DATABASE IF EXISTS {catalog}.{database} CASCADE") 31 | spark.sql(f"CREATE DATABASE IF NOT EXISTS {catalog}.{database}") 32 | spark.sql(f"GRANT CREATE, USAGE on DATABASE {catalog}.{database} TO `account users`") 33 | spark.sql(f"ALTER SCHEMA {catalog}.{database} OWNER TO `account users`") 34 | 35 | # COMMAND ---------- 36 | 37 | folder = "/dbdemos/uc/delta_dataset" 38 | spark.sql('drop database if exists hive_metastore.uc_database_to_upgrade cascade') 39 | #fix a bug from legacy version 40 | spark.sql(f'drop database if exists {catalog}.uc_database_to_upgrade cascade') 41 | dbutils.fs.rm("/transactions", True) 42 | 43 | print("generating the data...") 44 | from pyspark.sql import functions as F 45 | from faker import Faker 46 | from collections import OrderedDict 47 | import uuid 48 | import random 49 | fake = Faker() 50 | 51 | fake_firstname = F.udf(fake.first_name) 52 | fake_lastname = F.udf(fake.last_name) 53 | fake_email = F.udf(fake.ascii_company_email) 54 | fake_date = F.udf(lambda:fake.date_time_this_month().strftime("%m-%d-%Y %H:%M:%S")) 55 | fake_address = F.udf(fake.address) 56 | fake_credit_card_expire = F.udf(fake.credit_card_expire) 57 | 58 | fake_id = F.udf(lambda: str(uuid.uuid4())) 59 | countries = ['FR', 'USA', 'SPAIN'] 60 | fake_country = F.udf(lambda: countries[random.randint(0,2)]) 61 | 62 | df = spark.range(0, 10000) 63 | df = df.withColumn("id", F.monotonically_increasing_id()) 64 | df = df.withColumn("creation_date", fake_date()) 65 | df = df.withColumn("customer_firstname", fake_firstname()) 66 | df = df.withColumn("customer_lastname", fake_lastname()) 67 | df = df.withColumn("country", fake_country()) 68 | df = df.withColumn("customer_email", fake_email()) 69 | df = df.withColumn("address", fake_address()) 70 | df = df.withColumn("gender", F.round(F.rand()+0.2)) 71 | df = df.withColumn("age_group", F.round(F.rand()*10)) 72 | df.repartition(3).write.mode('overwrite').format("delta").save(folder+"/users") 73 | 74 | 75 | df = spark.range(0, 10000) 76 | df = df.withColumn("id", F.monotonically_increasing_id()) 77 | df = df.withColumn("customer_id", F.monotonically_increasing_id()) 78 | df = df.withColumn("transaction_date", fake_date()) 79 | df = df.withColumn("credit_card_expire", fake_credit_card_expire()) 80 | df = df.withColumn("amount", F.round(F.rand()*1000+200)) 81 | 82 | df = df.cache() 83 | spark.sql('create database if not exists hive_metastore.uc_database_to_upgrade') 84 | df.repartition(3).write.mode('overwrite').format("delta").saveAsTable("hive_metastore.uc_database_to_upgrade.users") 85 | 86 | #Note: this requires hard-coded external location. 87 | df.repartition(3).write.mode('overwrite').format("delta").save(external_location_path+"/transactions") 88 | 89 | # COMMAND ---------- 90 | 91 | df.repartition(3).write.mode('overwrite').format("delta").save(external_location_path+"/transactions") 92 | 93 | # COMMAND ---------- 94 | 95 | #Need to switch to hive metastore to avoid having a : org.apache.spark.SparkException: Your query is attempting to access overlapping paths through multiple authorization mechanisms, which is not currently supported. 96 | spark.sql("USE CATALOG hive_metastore") 97 | spark.sql(f"create table if not exists hive_metastore.uc_database_to_upgrade.transactions location '{external_location_path}/transactions'") 98 | spark.sql(f"create or replace view `hive_metastore`.`uc_database_to_upgrade`.users_view_to_upgrade as select * from hive_metastore.uc_database_to_upgrade.users where id is not null") 99 | 100 | spark.sql(f"USE CATALOG {catalog}") 101 | -------------------------------------------------------------------------------- /10-migrations/05-uc-upgrade/_resources/LICENSE.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ## Licence 4 | 5 | # COMMAND ---------- 6 | 7 | # MAGIC %md 8 | # MAGIC 9 | # MAGIC Copyright (2022) Databricks, Inc. 10 | # MAGIC 11 | # MAGIC This library (the "Software") may not be used except in connection with the Licensee's use of the Databricks Platform Services pursuant 12 | # MAGIC to an Agreement (defined below) between Licensee (defined below) and Databricks, Inc. ("Databricks"). The Object Code version of the 13 | # MAGIC Software shall be deemed part of the Downloadable Services under the Agreement, or if the Agreement does not define Downloadable Services, 14 | # MAGIC Subscription Services, or if neither are defined then the term in such Agreement that refers to the applicable Databricks Platform 15 | # MAGIC Services (as defined below) shall be substituted herein for “Downloadable Services.” Licensee's use of the Software must comply at 16 | # MAGIC all times with any restrictions applicable to the Downlodable Services and Subscription Services, generally, and must be used in 17 | # MAGIC accordance with any applicable documentation. For the avoidance of doubt, the Software constitutes Databricks Confidential Information 18 | # MAGIC under the Agreement. 19 | # MAGIC 20 | # MAGIC Additionally, and notwithstanding anything in the Agreement to the contrary: 21 | # MAGIC * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES 22 | # MAGIC OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE 23 | # MAGIC LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR 24 | # MAGIC IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 25 | # MAGIC * you may view, make limited copies of, and may compile the Source Code version of the Software into an Object Code version of the 26 | # MAGIC Software. For the avoidance of doubt, you may not make derivative works of Software (or make any any changes to the Source Code 27 | # MAGIC version of the unless you have agreed to separate terms with Databricks permitting such modifications (e.g., a contribution license 28 | # MAGIC agreement)). 29 | # MAGIC 30 | # MAGIC If you have not agreed to an Agreement or otherwise do not agree to these terms, you may not use the Software or view, copy or compile 31 | # MAGIC the Source Code of the Software. 32 | # MAGIC 33 | # MAGIC This license terminates automatically upon the termination of the Agreement or Licensee's breach of these terms. Additionally, 34 | # MAGIC Databricks may terminate this license at any time on notice. Upon termination, you must permanently delete the Software and all 35 | # MAGIC copies thereof (including the Source Code). 36 | # MAGIC 37 | # MAGIC Agreement: the agreement between Databricks and Licensee governing the use of the Databricks Platform Services, which shall be, with 38 | # MAGIC respect to Databricks, the Databricks Terms of Service located at www.databricks.com/termsofservice, and with respect to Databricks 39 | # MAGIC Community Edition, the Community Edition Terms of Service located at www.databricks.com/ce-termsofuse, in each case unless Licensee 40 | # MAGIC has entered into a separate written agreement with Databricks governing the use of the applicable Databricks Platform Services. 41 | # MAGIC 42 | # MAGIC Databricks Platform Services: the Databricks services or the Databricks Community Edition services, according to where the Software is used. 43 | # MAGIC 44 | # MAGIC Licensee: the user of the Software, or, if the Software is being used on behalf of a company, the company. 45 | # MAGIC 46 | # MAGIC Object Code: is version of the Software produced when an interpreter or a compiler translates the Source Code into recognizable and 47 | # MAGIC executable machine code. 48 | # MAGIC 49 | # MAGIC Source Code: the human readable portion of the Software. 50 | -------------------------------------------------------------------------------- /10-migrations/05-uc-upgrade/_resources/NOTICE.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ## Licence 4 | # MAGIC See LICENSE file. 5 | # MAGIC 6 | # MAGIC ## Data collection 7 | # MAGIC To improve users experience and dbdemos asset quality, dbdemos sends report usage and capture views in the installed notebook (usually in the first cell) and other assets like dashboards. This information is captured for product improvement only and not for marketing purpose, and doesn't contain PII information. By using `dbdemos` and the assets it provides, you consent to this data collection. If you wish to disable it, you can set `Tracker.enable_tracker` to False in the `tracker.py` file. 8 | # MAGIC 9 | # MAGIC ## Resource creation 10 | # MAGIC To simplify your experience, `dbdemos` will create and start for you resources. As example, a demo could start (not exhaustive): 11 | # MAGIC - A cluster to run your demo 12 | # MAGIC - A Delta Live Table Pipeline to ingest data 13 | # MAGIC - A DBSQL endpoint to run DBSQL dashboard 14 | # MAGIC - An ML model 15 | # MAGIC 16 | # MAGIC While `dbdemos` does its best to limit the consumption and enforce resource auto-termination, you remain responsible for the resources created and the potential consumption associated. 17 | # MAGIC 18 | # MAGIC ## Support 19 | # MAGIC Databricks does not offer official support for `dbdemos` and the associated assets. 20 | # MAGIC For any issue with `dbdemos` or the demos installed, please open an issue and the demo team will have a look on a best effort basis. 21 | # MAGIC 22 | # MAGIC 23 | -------------------------------------------------------------------------------- /10-migrations/05-uc-upgrade/_resources/README.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ## DBDemos asset 4 | # MAGIC 5 | # MAGIC The notebooks available under `_/resources` are technical resources. 6 | # MAGIC 7 | # MAGIC Do not edit these notebooks or try to run them directly. These notebooks will load data / run some setup. They are indirectly called from the main notebook (`%run ./_resources/.....`) 8 | -------------------------------------------------------------------------------- /10-migrations/README.md: -------------------------------------------------------------------------------- 1 | #### Migrations 2 | 3 | This section consists of tools that will help new Customers migrate their existing workloads to Lakehouse 4 | -------------------------------------------------------------------------------- /10-migrations/Using DBSQL Serverless Client Example.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %pip install -r helperfunctions/requirements.txt 3 | 4 | # COMMAND ---------- 5 | 6 | from helperfunctions.dbsqlclient import ServerlessClient 7 | 8 | # COMMAND ---------- 9 | 10 | # DBTITLE 1,Example Inputs For Client 11 | 12 | 13 | token = None ## optional 14 | host_name = None ## optional 15 | warehouse_id = "" 16 | 17 | ## Single Query Example 18 | sql_statement = "SELECT concat_ws('-', M.id, N.id, random()) as ID FROM range(1000) AS M, range(1000) AS N LIMIT 10000000" 19 | 20 | ## Multi Query Example 21 | multi_statement = "SELECT 1; SELECT 2; SELECT concat_ws('-', M.id, N.id, random()) as ID FROM range(1000) AS M, range(1000) AS N LIMIT 10000000" 22 | 23 | # COMMAND ---------- 24 | 25 | serverless_client = ServerlessClient(warehouse_id = warehouse_id, token=token, host_name=host_name) ## token=, host_name=verbose=True for print statements and other debugging messages 26 | 27 | # COMMAND ---------- 28 | 29 | # DBTITLE 1,Basic sql drop-in command 30 | """ 31 | Optional Params: 32 | 1. full_results 33 | 2. use_catalog = - this is a command specific USE CATALOG statement for the single SQL command 34 | 3. use_schema = - this is a command specific USE SCHEMA 35 | 36 | """ 37 | 38 | result_df = serverless_client.sql(sql_statement = sql_statement) ## OPTIONAL: use_catalog="hive_metastore", use_schema="default" 39 | 40 | # COMMAND ---------- 41 | 42 | # DBTITLE 1,Multi Statement Command - No Results just Status - Recommended for production 43 | """ 44 | Optional Params: 45 | 1. full_results 46 | 2. use_catalog = - this is a command specific USE CATALOG statement for the single SQL command 47 | 3. use_schema = - this is a command specific USE SCHEMA 48 | 49 | """ 50 | 51 | result = serverless_client.submit_multiple_sql_commands(sql_statements = multi_statement, full_results=False) #session_catalog, session_schema are also optional parameters that will simulate a USE statement. True full_results just returns the whole API response for each query 52 | 53 | # COMMAND ---------- 54 | 55 | # DBTITLE 1,Multi Statement Command Returning Results of Last Command - Best for simple processes 56 | result_multi_df = serverless_client.submit_multiple_sql_commands_last_results(sql_statements = multi_statement) 57 | 58 | # COMMAND ---------- 59 | 60 | display(result_multi_df) 61 | 62 | # COMMAND ---------- 63 | 64 | # DBTITLE 1,If Multi Statement Fails, this is how to access the result chain 65 | ## The function save the state of each command in the chain, even if it fails to return results for troubleshooting 66 | 67 | last_saved_multi_statement_state = serverless_client.multi_statement_result_state 68 | print(last_saved_multi_statement_state) 69 | -------------------------------------------------------------------------------- /10-migrations/Using DBSQL Serverless Transaction Manager Example.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %pip install -r helperfunctions/requirements.txt 3 | 4 | # COMMAND ---------- 5 | 6 | from helperfunctions.dbsqltransactions import DBSQLTransactionManager 7 | 8 | # COMMAND ---------- 9 | 10 | # DBTITLE 1,Example Inputs For Client 11 | token = None ## optional 12 | host_name = None ## optional 13 | warehouse_id = "" 14 | 15 | # COMMAND ---------- 16 | 17 | # DBTITLE 1,Example Multi Statement Transaction 18 | sqlString = """ 19 | USE CATALOG hive_metastore; 20 | 21 | CREATE SCHEMA IF NOT EXISTS iot_dashboard; 22 | 23 | USE SCHEMA iot_dashboard; 24 | 25 | -- Create Tables 26 | CREATE OR REPLACE TABLE iot_dashboard.bronze_sensors 27 | ( 28 | Id BIGINT GENERATED BY DEFAULT AS IDENTITY, 29 | device_id INT, 30 | user_id INT, 31 | calories_burnt DECIMAL(10,2), 32 | miles_walked DECIMAL(10,2), 33 | num_steps DECIMAL(10,2), 34 | timestamp TIMESTAMP, 35 | value STRING 36 | ) 37 | USING DELTA 38 | TBLPROPERTIES("delta.targetFileSize"="128mb"); 39 | 40 | CREATE OR REPLACE TABLE iot_dashboard.silver_sensors 41 | ( 42 | Id BIGINT GENERATED BY DEFAULT AS IDENTITY, 43 | device_id INT, 44 | user_id INT, 45 | calories_burnt DECIMAL(10,2), 46 | miles_walked DECIMAL(10,2), 47 | num_steps DECIMAL(10,2), 48 | timestamp TIMESTAMP, 49 | value STRING 50 | ) 51 | USING DELTA 52 | PARTITIONED BY (user_id) 53 | TBLPROPERTIES("delta.targetFileSize"="128mb"); 54 | 55 | -- Statement 1 -- the load 56 | COPY INTO iot_dashboard.bronze_sensors 57 | FROM (SELECT 58 | id::bigint AS Id, 59 | device_id::integer AS device_id, 60 | user_id::integer AS user_id, 61 | calories_burnt::decimal(10,2) AS calories_burnt, 62 | miles_walked::decimal(10,2) AS miles_walked, 63 | num_steps::decimal(10,2) AS num_steps, 64 | timestamp::timestamp AS timestamp, 65 | value AS value -- This is a JSON object 66 | FROM "/databricks-datasets/iot-stream/data-device/") 67 | FILEFORMAT = json 68 | COPY_OPTIONS('force'='true') -- 'false' -- process incrementally 69 | --option to be incremental or always load all files 70 | ; 71 | 72 | -- Statement 2 73 | MERGE INTO iot_dashboard.silver_sensors AS target 74 | USING (SELECT Id::integer, 75 | device_id::integer, 76 | user_id::integer, 77 | calories_burnt::decimal, 78 | miles_walked::decimal, 79 | num_steps::decimal, 80 | timestamp::timestamp, 81 | value::string 82 | FROM iot_dashboard.bronze_sensors) AS source 83 | ON source.Id = target.Id 84 | AND source.user_id = target.user_id 85 | AND source.device_id = target.device_id 86 | WHEN MATCHED THEN UPDATE SET 87 | target.calories_burnt = source.calories_burnt, 88 | target.miles_walked = source.miles_walked, 89 | target.num_steps = source.num_steps, 90 | target.timestamp = source.timestamp 91 | WHEN NOT MATCHED THEN INSERT *; 92 | 93 | OPTIMIZE iot_dashboard.silver_sensors ZORDER BY (timestamp); 94 | 95 | -- This calculate table stats for all columns to ensure the optimizer can build the best plan 96 | -- Statement 3 97 | 98 | ANALYZE TABLE iot_dashboard.silver_sensors COMPUTE STATISTICS FOR ALL COLUMNS; 99 | 100 | CREATE OR REPLACE TABLE hourly_summary_statistics 101 | AS 102 | SELECT user_id, 103 | date_trunc('hour', timestamp) AS HourBucket, 104 | AVG(num_steps)::float AS AvgNumStepsAcrossDevices, 105 | AVG(calories_burnt)::float AS AvgCaloriesBurnedAcrossDevices, 106 | AVG(miles_walked)::float AS AvgMilesWalkedAcrossDevices 107 | FROM silver_sensors 108 | GROUP BY user_id,date_trunc('hour', timestamp) 109 | ORDER BY HourBucket; 110 | 111 | -- Statement 4 112 | -- Truncate bronze batch once successfully loaded 113 | TRUNCATE TABLE bronze_sensors; 114 | """ 115 | 116 | # COMMAND ---------- 117 | 118 | serverless_client_t = DBSQLTransactionManager(warehouse_id = warehouse_id, mode="inferred_altered_tables") ## token=, host_name=verbose=True for print statements and other debugging messages 119 | 120 | # COMMAND ---------- 121 | 122 | # DBTITLE 1,Submitting the Multi Statement Transaction to Serverless SQL Warehouse 123 | """ 124 | PARAMS: 125 | warehouse_id --> Required, the SQL warehouse to submit statements 126 | mode -> selected_tables, inferred_altered_tables 127 | token --> optional, will try to get one for the user 128 | host_name --> optional, will try to infer same workspace url 129 | 130 | 131 | execute_sql_transaction params: 132 | return_type --> "message", "last_results". "message" will return status of query chain. "last_result" will run all statements and return the last results of the final query in the chain 133 | 134 | """ 135 | 136 | result_df = serverless_client_t.execute_dbsql_transaction(sql_string = sqlString) 137 | -------------------------------------------------------------------------------- /10-migrations/Using Delta Helpers Notebook Example.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC 4 | # MAGIC ## Using Delta Helpers Materialization Class. 5 | # MAGIC 6 | # MAGIC This class is for the purpose of materializing tables with delta onto cloud storage. This is often helpful for debugging and for simplifying longer, more complex query pipelines that would otherwise require highly nested CTE statements. Often times, the plan is simplified and performane is improved by removing the lazy evaluation and creating "checkpoint" steps with a materialized temp_db. Currently spark temp tables are NOT materialized, and thus not evaluated until called which is identical to a subquery. 7 | # MAGIC 8 | # MAGIC #### Initialization 9 | # MAGIC 10 | # MAGIC
  • deltaHelpers = DeltaHelpers(temp_root_path= "dbfs:/delta_temp_db", db_name="delta_temp") - The parameters are defaults and can be changed to a customer db name or s3 path 11 | # MAGIC 12 | # MAGIC #### There are 4 methods: 13 | # MAGIC 14 | # MAGIC
  • createOrReplaceTempDeltaTable(df: DataFrame, table_name: String) - This creates or replaces materialized delta table in the default location in dbfs or in your provided s3 path 15 | # MAGIC
  • appendToTempDeltaTable(df: DataFrame, table_name: String) - This appends to an existing delta table or creates a new one if not exists in dbfs or your provided s3 path 16 | # MAGIC
  • removeTempDeltaTable(table_name) - This removes the delta table from your delta_temp database session 17 | # MAGIC
  • removeAllTempTablesForSession() - This truncates the initialized temp_db session. It does NOT run a DROP DATABASE command because the database can be global. It only removes the session path it creates. 18 | 19 | # COMMAND ---------- 20 | 21 | # MAGIC %pip install -r helperfunctions/requirements.txt 22 | 23 | # COMMAND ---------- 24 | 25 | # DBTITLE 1,Import 26 | from helperfunctions.deltahelpers import DeltaHelpers 27 | 28 | # COMMAND ---------- 29 | 30 | # DBTITLE 1,Initialize 31 | ## 2 Params [Optional - db_name, temp_root_path] 32 | deltaHelpers = DeltaHelpers() 33 | 34 | # COMMAND ---------- 35 | 36 | # DBTITLE 1,Create or Replace Temp Delta Table 37 | df = spark.read.format("json").load("/databricks-datasets/iot-stream/data-device/") 38 | 39 | ## Methods return the cached dataframe so you can continue on as needed without reloading source each time AND you can reference in SQL (better for foreachBatch) 40 | ## No longer lazy -- this calls an action 41 | df = deltaHelpers.createOrReplaceTempDeltaTable(df, "iot_data") 42 | 43 | ## Build ML Models 44 | 45 | display(df) 46 | 47 | # COMMAND ---------- 48 | 49 | # DBTITLE 1,Read cached table quickly in python or SQL 50 | # MAGIC %sql 51 | # MAGIC -- Read cahced table quickly in python or SQL 52 | # MAGIC SELECT * FROM delta_temp.iot_data 53 | 54 | # COMMAND ---------- 55 | 56 | df.count() 57 | 58 | # COMMAND ---------- 59 | 60 | # DBTITLE 1,Append to Temp Delta Table 61 | ## Data is 1,000,000 rows 62 | df_doubled = deltaHelpers.appendToTempDeltaTable(df, "iot_data") 63 | 64 | ## Be CAREFUL HERE! Since the function calls an action, it is NOT lazily evaluated. So running it multiple times can append the same data 65 | df_doubled.count() 66 | 67 | # COMMAND ---------- 68 | 69 | # MAGIC %sql 70 | # MAGIC 71 | # MAGIC DESCRIBE HISTORY delta_temp.iot_data 72 | 73 | # COMMAND ---------- 74 | 75 | # DBTITLE 1,Remove Temp Delta Table 76 | deltaHelpers.removeTempDeltaTable("iot_data") 77 | 78 | # COMMAND ---------- 79 | 80 | # MAGIC %sql 81 | # MAGIC 82 | # MAGIC SELECT * FROM delta_temp.iot_data 83 | 84 | # COMMAND ---------- 85 | 86 | # DBTITLE 1,Truncate Session 87 | ## Deletes all tables in session path but does not drop that delta_temp database 88 | deltaHelpers.removeAllTempTablesForSession() 89 | -------------------------------------------------------------------------------- /10-migrations/Using Delta Logger.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC 4 | # MAGIC ## Delta Logger - How to use 5 | # MAGIC 6 | # MAGIC Purpose: This notebook utilizes the delta logger library to automatically and easiy log general pipeline information all in one place for any data pipeline. 7 | # MAGIC 8 | # MAGIC All logger tables have a standard default schema DDL: 9 | # MAGIC 10 | # MAGIC CREATE TABLE IF NOT EXISTS delta_logger ( 11 | # MAGIC run_id BIGINT GENERATED BY DEFAULT AS IDENTITY, 12 | # MAGIC process_name STRING NOT NULL, 13 | # MAGIC status STRING NOT NULL, -- RUNNING, FAIL, SUCCESS, STALE 14 | # MAGIC start_timestamp TIMESTAMP NOT NULL, 15 | # MAGIC end_timestamp TIMESTAMP, 16 | # MAGIC run_metadata STRING 17 | # MAGIC ) 18 | # MAGIC USING DELTA 19 | # MAGIC PARTITIONED BY (process_name); 20 | # MAGIC 21 | # MAGIC ## Initialize 22 | # MAGIC delta_logger = DeltaLogger(logger_table="main.iot_dashboard.pipeline_logs", 23 | # MAGIC process_name="iot_pipeline", 24 | # MAGIC logger_location=None) 25 | # MAGIC 26 | # MAGIC - logger_table is the logging table you want to store and reference. You can create and manage as many logger tables as you would like. If you initilize a DeltaLogger and that table does not exist, it will create it for you. 27 | # MAGIC - process_name OPTIONAL - Users can log events/runs and pass the process_name into each event, or they can simply define it at the session level this way. This will default to using the process_name passed in here for the whole session. It can be overridden anytime. 28 | # MAGIC - logger_location OPTIONAL - default = None. This is an override for specifying a specific object storage location for where the user wants the table to live. If not provided, it will be a managed table by default (recommended). 29 | # MAGIC 30 | # MAGIC ## Methods: 31 | # MAGIC 32 | # MAGIC For most methods: -- if process_name not provided, will use session. If cannot find process_name, will error. 33 | # MAGIC 34 | # MAGIC - create_logger() -- creates a logger table if not exists. This also optimizes the table since it is used in initlialization. 35 | # MAGIC - drop_logger() -- drops the logger table attached to the session 36 | # MAGIC - truncate_logger() -- clears an existing logger table 37 | # MAGIC - start_run(process_name: Optional, msg: Optional) 38 | # MAGIC - fail_run(process_name: Optional, msg: Optional) 39 | # MAGIC - complete_run(process_name: Optional, msg: Optional) 40 | # MAGIC - get_last_successful_run_id(proces_name: Optional) -- If no previous successful run, return -1 41 | # MAGIC - get_last_successful_run_timestamp(process_name: Optional) -- If no previous successful run for the process, defaults to "1900-01-01 00:00:00" 42 | # MAGIC - get_last_run_id(process_name: Optional) -- Get last run id regardless of status, if none return -1 43 | # MAGIC - get_last_run_timestamp(process_name: Optional) -- Get last run timestamp , If no previous run for the process, defaults to "1900-01-01 00:00:00" 44 | # MAGIC - get_last_failed_run_id(process_name: Optional) 45 | # MAGIC - get_last_failed_run_timestamp(prcoess_name: Optional) 46 | # MAGIC - clean_zombie_runs(process_name: Optional) -- Will mark any runs without and end timestamp in the running state to "STALE" and give them an end timestamp. This ONLY happens when a new run is created and the runs are < the max existing RUNNING run id 47 | # MAGIC - optimize_log(process_name:Optional, zorderCols=["end_timestamp", "start_timestamp", "run_id"]) -- Optimizes the underlying log table for a particular process name a ZORDERs by input col list 48 | # MAGIC - INTERNAL: _update_run_id(run_id, process_name:Optional, start_time=None, end_time=None, status=None, run_metadata=None) 49 | # MAGIC 50 | # MAGIC ### Limitations / Considerations 51 | # MAGIC 1. Currently supports 1 concurrent run per process_name for a given delta table. If you want to run concurrent pipelines, you need to create separate process names for them. This is meant to be a simple run and logging tracking solution for EDW pipelines. 52 | # MAGIC 53 | # MAGIC 2. User can pass in the fully qualified table name, use the spark session defaults, or pass in catalog and database overrides to the parameters. Pick one. 54 | # MAGIC 55 | 56 | # COMMAND ---------- 57 | 58 | # MAGIC %md 59 | # MAGIC 60 | # MAGIC ## Design Patterns 61 | # MAGIC 62 | # MAGIC 1. Use for Basic error handling, tracking of runs of various processes 63 | # MAGIC 2. Use for watermarking loading patterns. i.e. Creating a new run automatically pulls the most recent previous successful run and provide a "watermark" variable you can utilize for incremental loading. Use delta_logger.get_last_succes 64 | 65 | # COMMAND ---------- 66 | 67 | from helperfunctions.deltalogger import DeltaLogger 68 | 69 | # COMMAND ---------- 70 | 71 | # MAGIC %sql 72 | # MAGIC 73 | # MAGIC CREATE DATABASE IF NOT EXISTS main.iot_dashboard_logger; 74 | # MAGIC USE CATALOG main; 75 | # MAGIC USE DATABASE iot_dashboard_logger; 76 | 77 | # COMMAND ---------- 78 | 79 | delta_logger = DeltaLogger(logger_table_name="main.iot_dashboard_logger.delta_logger", process_name='iot_dashboard_pipeline') 80 | 81 | # COMMAND ---------- 82 | 83 | delta_logger.get_most_recent_success_run_start_time() 84 | 85 | # COMMAND ---------- 86 | 87 | delta_logger.create_run(metadata={"data_quality_stuff": "oh dear"}) 88 | 89 | # COMMAND ---------- 90 | 91 | print(delta_logger.active_run_id) 92 | print(delta_logger.active_run_end_ts) 93 | print(delta_logger.active_run_start_ts) 94 | print(delta_logger.active_run_status) 95 | print(delta_logger.active_run_metadata) 96 | 97 | # COMMAND ---------- 98 | 99 | # DBTITLE 1,Complete and Fail Active Runs 100 | delta_logger.complete_run() 101 | #delta_logger.fail_run() 102 | 103 | # COMMAND ---------- 104 | 105 | # MAGIC %sql 106 | # MAGIC 107 | # MAGIC SELECT * FROM main.iot_dashboard_logger.delta_logger 108 | -------------------------------------------------------------------------------- /10-migrations/Using Delta Merge Helpers Example.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC 4 | # MAGIC ## Delta Merge Helpers: 5 | # MAGIC 6 | # MAGIC This is class with a set of static methods that help the user easily perform retry statements on operataions that may be cause a lot of conflicting transactions (usually in MERGE / UPDATE statements). 7 | # MAGIC 8 | # MAGIC
  • 1 Method: retrySqlStatement(spark: SparkSession, operation_name: String, sqlStatement: String) - the spark param is your existing Spark session, the operation name is simply an operation to identify your transaction, the sqlStatement parameter is the SQL statement you want to retry. 9 | 10 | # COMMAND ---------- 11 | 12 | # MAGIC %pip install -r helperfunctions/requirements.txt 13 | 14 | # COMMAND ---------- 15 | 16 | from helperfunctions.deltahelpers import DeltaMergeHelpers 17 | 18 | # COMMAND ---------- 19 | 20 | 21 | sql_statement = """ 22 | MERGE INTO iot_dashboard.silver_sensors AS target 23 | USING (SELECT Id::integer, 24 | device_id::integer, 25 | user_id::integer, 26 | calories_burnt::decimal, 27 | miles_walked::decimal, 28 | num_steps::decimal, 29 | timestamp::timestamp, 30 | value::string 31 | FROM iot_dashboard.bronze_sensors) AS source 32 | ON source.Id = target.Id 33 | AND source.user_id = target.user_id 34 | AND source.device_id = target.device_id 35 | WHEN MATCHED THEN UPDATE SET 36 | target.calories_burnt = source.calories_burnt, 37 | target.miles_walked = source.miles_walked, 38 | target.num_steps = source.num_steps, 39 | target.timestamp = source.timestamp 40 | WHEN NOT MATCHED THEN INSERT *; 41 | """ 42 | 43 | DeltaMergeHelpers.retrySqlStatement(spark, "merge_sensors", sqlStatement=sql_statement) 44 | -------------------------------------------------------------------------------- /10-migrations/Using Streaming Tables and MV Orchestrator.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC 4 | # MAGIC ## This library helps orchestrate Streaming tables in conjunction with other tables that may depend on synchronous updated from the streaming table for classical EDW loading patterns 5 | # MAGIC 6 | # MAGIC ## Assumptions / Best Practices 7 | # MAGIC 8 | # MAGIC 1. Assumes ST is NOT SCHEDULED in the CREATE STATEMENT (externally orchestrated) (that is a different loading pattern that is not as common in classical EDW) 9 | # MAGIC 10 | # MAGIC 2. Assumes that one or many pipelines are dependent upon the successful CREATe OR REFRESH of the streaming table, so this library will simply block the tasks from moving the job onto the rest of the DAG to ensure the downstream tasks actually read from the table when it finishes updated 11 | # MAGIC 12 | # MAGIC 3. This works best with a single node "Driver" notebook loading sql files from Git similar to how airflow would orchestrate locally. The single job node would then call spark.sql() to run the CREATE OR REFRESH and then you arent needing a warehouse and a DLT pipeline in the job for streaming refreshes. 13 | 14 | # COMMAND ---------- 15 | 16 | # MAGIC %md 17 | # MAGIC 18 | # MAGIC ## Library Steps 19 | # MAGIC 20 | # MAGIC ### This library only takes in 1 sql statement at a time, this is because if there are multiple and only some pass and others fail, then it would not be correct failing or passing the whole statement. Each ST/MV must be done separately. This can be done by simply calling the static methods multiple times. 21 | # MAGIC 22 | # MAGIC 1. Parse Streaming Table / MV Create / Refresh commmand 23 | # MAGIC 2. Identify ST / MV table(s) for that command 24 | # MAGIC 3. Run SQL command - CREATE / REFRESH ST/MV 25 | # MAGIC 4. DESCRIBE DETAIL to get pipelines.pipelineId metadata 26 | # MAGIC 5. Perform REST API Call to check for in-progress Refreshes 27 | # MAGIC 6. Poll and block statement chain from "finishing" until all pipelines identified are in either "PASS/FAIL" 28 | # MAGIC 7. If statement PASSES - then complete and return 29 | # MAGIC 8. If statement FAILS - then throw REFRESH FAIL exception 30 | 31 | # COMMAND ---------- 32 | 33 | from helperfunctions.stmvorchestrator import orchestrate_stmv_statement 34 | 35 | # COMMAND ---------- 36 | 37 | sql_statement = """ 38 | CREATE OR REFRESH STREAMING TABLE main.iot_dashboard.streaming_tables_raw_data 39 | AS SELECT 40 | id::bigint AS Id, 41 | device_id::integer AS device_id, 42 | user_id::integer AS user_id, 43 | calories_burnt::decimal(10,2) AS calories_burnt, 44 | miles_walked::decimal(10,2) AS miles_walked, 45 | num_steps::decimal(10,2) AS num_steps, 46 | timestamp::timestamp AS timestamp, 47 | value AS value -- This is a JSON object 48 | FROM STREAM read_files('dbfs:/databricks-datasets/iot-stream/data-device/*.json*', 49 | format => 'json', 50 | maxFilesPerTrigger => 12 -- what does this do when you 51 | ) 52 | """ 53 | 54 | # COMMAND ---------- 55 | 56 | orchestrate_stmv_statement(spark, dbutils, sql_statement=sql_statement) 57 | -------------------------------------------------------------------------------- /10-migrations/helperfunctions/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/10-migrations/helperfunctions/.DS_Store -------------------------------------------------------------------------------- /10-migrations/helperfunctions/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/10-migrations/helperfunctions/__init__.py -------------------------------------------------------------------------------- /10-migrations/helperfunctions/build/lib/dbsqltransactions.py: -------------------------------------------------------------------------------- 1 | from helperfunctions.dbsqlclient import ServerlessClient 2 | from helperfunctions.transactions import Transaction, TransactionException, AlteredTableParser 3 | 4 | 5 | class DBSQLTransactionManager(Transaction): 6 | 7 | def __init__(self, warehouse_id, mode="selected_tables", uc_default=False, host_name=None, token=None): 8 | 9 | super().__init__(mode=mode, uc_default=uc_default) 10 | self.host_name = host_name 11 | self.token = token 12 | self.warehouse_id = warehouse_id 13 | 14 | return 15 | 16 | 17 | ### Execute multi statment SQL, now we can implement this easier for Serverless or not Serverless 18 | def execute_dbsql_transaction(self, sql_string, tables_to_manage=[], force=False, return_type="message"): 19 | 20 | ## return_type = message (returns status messages), last_result (returns the result of the last command in the sql chain) 21 | ## If force= True, then if transaction manager fails to find tables, then it runs the SQL anyways 22 | ## You do not NEED to run SQL this way to rollback a transaction, 23 | ## but it automatically breaks up multiple statements in one SQL file into a series of spark.sql() commands 24 | 25 | serverless_client = ServerlessClient(warehouse_id = self.warehouse_id, token=self.token, host_name=self.host_name) ## token=, host_name=verbose=True for print statements and other debugging messages 26 | 27 | result_df = None 28 | stmts = [i for i in sql_string.split(";") if len(i) >0] 29 | 30 | ## Save to class state 31 | self.raw_sql_statement = sql_string 32 | self.sql_statement_list = stmts 33 | 34 | success_tables = False 35 | 36 | try: 37 | self.begin_dynamic_transaction(tables_to_manage=tables_to_manage) 38 | 39 | success_tables = True 40 | 41 | except Exception as e: 42 | print(f"FAILED: failed to acquire tables with errors: {str(e)}") 43 | 44 | ## If succeeded or force = True, then run the SQL 45 | if success_tables or force: 46 | if success_tables == False and force == True: 47 | warnings.warn("WARNING: Failed to acquire tables but force flag = True, so SQL statement will run anyways") 48 | 49 | ## Run the Transaction Logic with Serverless Client 50 | try: 51 | print(f"TRANSACTION IN PROGRESS ...Running multi statement SQL transaction now\n") 52 | 53 | ###!! Since the DBSQL execution API does not understand multiple statements, we need to submit the USE commands in the correct order manually. This is done with the AlteredTableParser() 54 | 55 | ### Get the USE session tree and submit SQL statements according to that tree 56 | parser = AlteredTableParser() 57 | parser.parse_sql_chain_for_altered_tables(self.sql_statement_list) 58 | use_sessions = parser.get_use_session_tree() 59 | 60 | for i in use_sessions: 61 | 62 | session_catalog = i.get("session_cat") 63 | session_db = i.get("session_db") 64 | use_session_statemnts = i.get("sql_statements") 65 | 66 | for s in use_session_statemnts: 67 | single_st = s.get("statement") 68 | 69 | if single_st is not None: 70 | 71 | ## Submit the single command with the session USE scoped commands from the Parser Tree 72 | ## OPTION 1: return status message 73 | if return_type == "message": 74 | 75 | result_df = serverless_client.submit_multiple_sql_commands(sql_statements=single_st, use_catalog=session_catalog, use_schema=session_db) 76 | 77 | elif return_type == "last_result": 78 | 79 | result_df = serverless_client.submit_multiple_sql_commands_last_results(sql_statements=single_st, use_catalog=session_catalog, use_schema=session_db) 80 | 81 | else: 82 | result_df = None 83 | print("No run mode selected, select 'message' or 'last_results'") 84 | 85 | 86 | print(f"\n TRANSACTION SUCCEEDED: Multi Statement SQL Transaction Successfull! Updating Snapshot\n ") 87 | self.commit_transaction() 88 | 89 | 90 | ## Return results after committing sucesss outside of the for loop 91 | return result_df 92 | 93 | 94 | except Exception as e: 95 | print(f"\n TRANSACTION FAILED to run all statements... ROLLING BACK \n") 96 | self.rollback_transaction() 97 | print(f"Rollback successful!") 98 | 99 | raise(e) 100 | 101 | else: 102 | 103 | raise(TransactionException(message="Failed to acquire tables and force=False, not running process.", errors="Failed to acquire tables and force=False, not running process.")) 104 | -------------------------------------------------------------------------------- /10-migrations/helperfunctions/build/lib/stmvorchestrator.py: -------------------------------------------------------------------------------- 1 | import re 2 | import requests 3 | import time 4 | 5 | 6 | ## Function to block Create or REFRESH of ST or MV statements to wait until it is finishing before moving to next task 7 | 8 | ## Similar to the awaitTermination() method in a streaming pipeline 9 | 10 | ## Only supports 1 sql statement at a time on purpose 11 | 12 | def orchestrate_stmv_statement(spark, dbutils, sql_statement, host_name=None, token=None): 13 | 14 | host_name = None 15 | token = None 16 | 17 | ## Infer hostname from same workspace 18 | if host_name is not None: 19 | host_name = host_name 20 | 21 | else: 22 | host_name = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().getOrElse(None).replace("https://", "") 23 | 24 | ## Automatically get user token if none provided 25 | if token is not None: 26 | token = token 27 | else: 28 | token = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().getOrElse(None) 29 | 30 | 31 | ## Get current catalogs/schemas from outside USE commands 32 | current_schema = spark.sql("SELECT current_schema()").collect()[0][0] 33 | current_catalog = spark.sql("SELECT current_catalog()").collect()[0][0] 34 | 35 | if current_catalog == 'spark_catalog': 36 | current_catalog = 'hive_metastore' 37 | 38 | 39 | ## Check for multiple statements, if more than 1, than raise too many statement exception 40 | all_statements = re.split(";", sql_statement) 41 | 42 | if (len(all_statements) > 1): 43 | print("WARNING: There are more than one statements in this sql command, this function will just pick and try to run the first statement and ignore the rest.") 44 | 45 | 46 | sql_statement = all_statements[0] 47 | 48 | 49 | try: 50 | 51 | ## Get table/mv that is being refreshed 52 | table_match = re.split("CREATE OR REFRESH STREAMING TABLE\s|REFRESH STREAMING TABLE\s|CREATE OR REFRESH MATERIALIZED VIEW\s|REFRESH MATERIALIZED VIEW\s", sql_statement.upper())[1].split(" ")[0] 53 | 54 | except Exception as e: 55 | 56 | ## If it was not able to find a REFRESH statement, ignore and unblock the operation and move on (i.e. if its not an ST/MV or if its just a CREATE) 57 | 58 | print("WARNING: No ST / MV Refresh statements found. Moving on.") 59 | return 60 | 61 | ## If ST/MV refresh was found 62 | 63 | if (len(table_match.split(".")) == 3): 64 | ## fully qualified, dont change it 65 | pass 66 | elif (len(table_match.split(".")) == 2): 67 | table_match = current_catalog + "." + table_match 68 | 69 | elif(len(table_match.split(".")) == 1): 70 | table_match = current_catalog + "." + current_schema + "." + table_match 71 | 72 | 73 | ## Step 2 - Execute SQL Statement 74 | spark.sql(sql_statement) 75 | 76 | 77 | ## Step 3 - Get pipeline Id for table 78 | active_pipeline_id = (spark.sql(f"DESCRIBE DETAIL {table_match}") 79 | .selectExpr("properties").take(1)[0][0] 80 | .get("pipelines.pipelineId") 81 | ) 82 | 83 | ## Poll for pipeline status 84 | 85 | 86 | current_state = "UNKNOWN" 87 | 88 | ## Pipeline is active 89 | while current_state not in ("FAILED", "IDLE"): 90 | 91 | url = "https://" + host_name + "/api/2.0/pipelines/" 92 | headers_auth = {"Authorization":f"Bearer {token}"} 93 | 94 | check_status_resp = requests.get(url + active_pipeline_id , headers=headers_auth).json() 95 | 96 | current_state = check_status_resp.get("state") 97 | 98 | if current_state == "IDLE": 99 | print(f"STMV Pipeline {active_pipeline_id} completed! \n Moving on") 100 | return 101 | 102 | elif current_state == "FAILED": 103 | raise(BaseException(f"PIPELINE {active_pipeline_id} FAILED!")) 104 | 105 | 106 | else: 107 | ## Wait before polling again 108 | ## TODO: Do exponential backoff 109 | time.sleep(5) 110 | 111 | -------------------------------------------------------------------------------- /10-migrations/helperfunctions/dbsqltransactions.py: -------------------------------------------------------------------------------- 1 | from helperfunctions.dbsqlclient import ServerlessClient 2 | from helperfunctions.transactions import Transaction, TransactionException, AlteredTableParser 3 | 4 | 5 | class DBSQLTransactionManager(Transaction): 6 | 7 | def __init__(self, warehouse_id, mode="selected_tables", uc_default=False, host_name=None, token=None): 8 | 9 | super().__init__(mode=mode, uc_default=uc_default) 10 | self.host_name = host_name 11 | self.token = token 12 | self.warehouse_id = warehouse_id 13 | 14 | return 15 | 16 | 17 | ### Execute multi statment SQL, now we can implement this easier for Serverless or not Serverless 18 | def execute_dbsql_transaction(self, sql_string, tables_to_manage=[], force=False, return_type="message"): 19 | 20 | ## return_type = message (returns status messages), last_result (returns the result of the last command in the sql chain) 21 | ## If force= True, then if transaction manager fails to find tables, then it runs the SQL anyways 22 | ## You do not NEED to run SQL this way to rollback a transaction, 23 | ## but it automatically breaks up multiple statements in one SQL file into a series of spark.sql() commands 24 | 25 | serverless_client = ServerlessClient(warehouse_id = self.warehouse_id, token=self.token, host_name=self.host_name) ## token=, host_name=verbose=True for print statements and other debugging messages 26 | 27 | result_df = None 28 | stmts = [i for i in sql_string.split(";") if len(i) >0] 29 | 30 | ## Save to class state 31 | self.raw_sql_statement = sql_string 32 | self.sql_statement_list = stmts 33 | 34 | success_tables = False 35 | 36 | try: 37 | self.begin_dynamic_transaction(tables_to_manage=tables_to_manage) 38 | 39 | success_tables = True 40 | 41 | except Exception as e: 42 | print(f"FAILED: failed to acquire tables with errors: {str(e)}") 43 | 44 | ## If succeeded or force = True, then run the SQL 45 | if success_tables or force: 46 | if success_tables == False and force == True: 47 | warnings.warn("WARNING: Failed to acquire tables but force flag = True, so SQL statement will run anyways") 48 | 49 | ## Run the Transaction Logic with Serverless Client 50 | try: 51 | print(f"TRANSACTION IN PROGRESS ...Running multi statement SQL transaction now\n") 52 | 53 | ###!! Since the DBSQL execution API does not understand multiple statements, we need to submit the USE commands in the correct order manually. This is done with the AlteredTableParser() 54 | 55 | ### Get the USE session tree and submit SQL statements according to that tree 56 | parser = AlteredTableParser() 57 | parser.parse_sql_chain_for_altered_tables(self.sql_statement_list) 58 | use_sessions = parser.get_use_session_tree() 59 | 60 | for i in use_sessions: 61 | 62 | session_catalog = i.get("session_cat") 63 | session_db = i.get("session_db") 64 | use_session_statemnts = i.get("sql_statements") 65 | 66 | for s in use_session_statemnts: 67 | single_st = s.get("statement") 68 | 69 | if single_st is not None: 70 | 71 | ## Submit the single command with the session USE scoped commands from the Parser Tree 72 | ## OPTION 1: return status message 73 | if return_type == "message": 74 | 75 | result_df = serverless_client.submit_multiple_sql_commands(sql_statements=single_st, use_catalog=session_catalog, use_schema=session_db) 76 | 77 | elif return_type == "last_result": 78 | 79 | result_df = serverless_client.submit_multiple_sql_commands_last_results(sql_statements=single_st, use_catalog=session_catalog, use_schema=session_db) 80 | 81 | else: 82 | result_df = None 83 | print("No run mode selected, select 'message' or 'last_results'") 84 | 85 | 86 | print(f"\n TRANSACTION SUCCEEDED: Multi Statement SQL Transaction Successfull! Updating Snapshot\n ") 87 | self.commit_transaction() 88 | 89 | 90 | ## Return results after committing sucesss outside of the for loop 91 | return result_df 92 | 93 | 94 | except Exception as e: 95 | print(f"\n TRANSACTION FAILED to run all statements... ROLLING BACK \n") 96 | self.rollback_transaction() 97 | print(f"Rollback successful!") 98 | 99 | raise(e) 100 | 101 | else: 102 | 103 | raise(TransactionException(message="Failed to acquire tables and force=False, not running process.", errors="Failed to acquire tables and force=False, not running process.")) 104 | -------------------------------------------------------------------------------- /10-migrations/helperfunctions/deltahelpers.py: -------------------------------------------------------------------------------- 1 | import json 2 | import requests 3 | import re 4 | import os 5 | from datetime import datetime, timedelta 6 | import uuid 7 | from pyspark.sql import SparkSession 8 | from pyspark.sql.functions import col, count, lit, max 9 | from pyspark.sql.types import * 10 | 11 | 12 | ### Helps Materialize temp tables during ETL pipelines 13 | class DeltaHelpers(): 14 | 15 | 16 | def __init__(self, db_name="delta_temp", temp_root_path="dbfs:/delta_temp_db"): 17 | 18 | self.spark = SparkSession.getActiveSession() 19 | self.db_name = db_name 20 | self.temp_root_path = temp_root_path 21 | 22 | self.dbutils = None 23 | 24 | #if self.spark.conf.get("spark.databricks.service.client.enabled") == "true": 25 | try: 26 | from pyspark.dbutils import DBUtils 27 | self.dbutils = DBUtils(self.spark) 28 | 29 | except: 30 | 31 | import IPython 32 | self.dbutils = IPython.get_ipython().user_ns["dbutils"] 33 | 34 | self.session_id =self.dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get() 35 | self.temp_env = self.temp_root_path + self.session_id 36 | self.spark.sql(f"""DROP DATABASE IF EXISTS {self.db_name} CASCADE;""") 37 | self.spark.sql(f"""CREATE DATABASE IF NOT EXISTS {self.db_name} LOCATION '{self.temp_env}'; """) 38 | print(f"Initializing Root Temp Environment: {self.db_name} at {self.temp_env}") 39 | 40 | return 41 | 42 | 43 | def createOrReplaceTempDeltaTable(self, df, table_name): 44 | 45 | tblObj = {} 46 | new_table_id = table_name 47 | write_path = self.temp_env + new_table_id 48 | 49 | self.spark.sql(f"DROP TABLE IF EXISTS {self.db_name}.{new_table_id}") 50 | self.dbutils.fs.rm(write_path, recurse=True) 51 | 52 | df.write.format("delta").mode("overwrite").option("path", write_path).saveAsTable(f"{self.db_name}.{new_table_id}") 53 | 54 | persisted_df = self.spark.read.format("delta").load(write_path) 55 | return persisted_df 56 | 57 | def appendToTempDeltaTable(self, df, table_name): 58 | 59 | tblObj = {} 60 | new_table_id = table_name 61 | write_path = self.temp_env + new_table_id 62 | 63 | df.write.format("delta").mode("append").option("path", write_path).saveAsTable(f"{self.db_name}.{new_table_id}") 64 | 65 | persisted_df = self.spark.read.format("delta").load(write_path) 66 | return persisted_df 67 | 68 | def removeTempDeltaTable(self, table_name): 69 | 70 | table_path = self.temp_env + table_name 71 | self.dbutils.fs.rm(table_path, recurse=True) 72 | self.spark.sql(f"""DROP TABLE IF EXISTS {self.db_name}.{table_name}""") 73 | 74 | print(f"Temp Table: {table_name} has been deleted.") 75 | return 76 | 77 | def removeAllTempTablesForSession(self): 78 | 79 | self.dbutils.fs.rm(self.temp_env, recurse=True) 80 | ##spark.sql(f"""DROP DATABASE IF EXISTS {self.db_name} CASCADE""") This temp db name COULD be global, never delete without separate method 81 | print(f"All temp tables in the session have been removed: {self.temp_env}") 82 | return 83 | 84 | 85 | 86 | class SchemaHelpers(): 87 | 88 | def __init__(): 89 | import json 90 | return 91 | 92 | @staticmethod 93 | def getDDLString(structObj): 94 | import json 95 | ddl = [] 96 | for c in json.loads(structObj.json()).get("fields"): 97 | 98 | name = c.get("name") 99 | dType = c.get("type") 100 | ddl.append(f"{name}::{dType} AS {name}") 101 | 102 | final_ddl = ", ".join(ddl) 103 | return final_ddl 104 | 105 | @staticmethod 106 | def getDDLList(structObj): 107 | import json 108 | ddl = [] 109 | for c in json.loads(structObj.json()).get("fields"): 110 | 111 | name = c.get("name") 112 | dType = c.get("type") 113 | ddl.append(f"{name}::{dType} AS {name}") 114 | 115 | return ddl 116 | 117 | @staticmethod 118 | def getFlattenedSqlExprFromValueColumn(structObj): 119 | import json 120 | ddl = [] 121 | for c in json.loads(structObj.json()).get("fields"): 122 | 123 | name = c.get("name") 124 | dType = c.get("type") 125 | ddl.append(f"value:{name}::{dType} AS {name}") 126 | 127 | return ddl 128 | 129 | 130 | 131 | 132 | class DeltaMergeHelpers(): 133 | 134 | def __init__(self): 135 | return 136 | 137 | @staticmethod 138 | def retrySqlStatement(spark, operationName, sqlStatement, maxRetries = 10, maxSecondsBetweenAttempts=60): 139 | 140 | import time 141 | maxRetries = maxRetries 142 | numRetries = 0 143 | maxWaitTime = maxSecondsBetweenAttempts 144 | ### Does not check for existence, ensure that happens before merge 145 | 146 | while numRetries <= maxRetries: 147 | 148 | try: 149 | 150 | print(f"SQL Statement Attempt for {operationName} #{numRetries + 1}...") 151 | 152 | spark.sql(sqlStatement) 153 | 154 | print(f"SQL Statement Attempt for {operationName} #{numRetries + 1} Successful!") 155 | break 156 | 157 | except Exception as e: 158 | error_msg = str(e) 159 | 160 | print(f"Failed SQL Statment Attmpet for {operationName} #{numRetries} with error: {error_msg}") 161 | 162 | numRetries += 1 163 | if numRetries > maxRetries: 164 | break 165 | 166 | waitTime = waitTime = 2**(numRetries-1) ## Wait longer up to max wait time for failed operations 167 | 168 | if waitTime > maxWaitTime: 169 | waitTime = maxWaitTime 170 | 171 | print(f"Waiting {waitTime} seconds before next attempt on {operationName}...") 172 | time.sleep(waitTime) -------------------------------------------------------------------------------- /10-migrations/helperfunctions/dist/helperfunctions-1.0.0-py3-none-any.whl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/10-migrations/helperfunctions/dist/helperfunctions-1.0.0-py3-none-any.whl -------------------------------------------------------------------------------- /10-migrations/helperfunctions/helperfunctions.egg-info/PKG-INFO: -------------------------------------------------------------------------------- 1 | Metadata-Version: 2.1 2 | Name: helperfunctions 3 | Version: 1.0.0 4 | Summary: Lakehouse Warehousing and Delta Helper Functions 5 | Author: Cody Austin Davis @Databricks, Inc. 6 | Author-email: cody.davis@databricks.com 7 | Requires-Dist: sqlparse 8 | Requires-Dist: sql_metadata 9 | Requires-Dist: sqlglot 10 | Requires-Dist: pyarrow 11 | -------------------------------------------------------------------------------- /10-migrations/helperfunctions/helperfunctions.egg-info/SOURCES.txt: -------------------------------------------------------------------------------- 1 | datavalidator.py 2 | dbsqlclient.py 3 | dbsqltransactions.py 4 | deltahelpers.py 5 | deltalogger.py 6 | redshiftchecker.py 7 | setup.py 8 | stmvorchestrator.py 9 | transactions.py 10 | helperfunctions.egg-info/PKG-INFO 11 | helperfunctions.egg-info/SOURCES.txt 12 | helperfunctions.egg-info/dependency_links.txt 13 | helperfunctions.egg-info/requires.txt 14 | helperfunctions.egg-info/top_level.txt -------------------------------------------------------------------------------- /10-migrations/helperfunctions/helperfunctions.egg-info/dependency_links.txt: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /10-migrations/helperfunctions/helperfunctions.egg-info/requires.txt: -------------------------------------------------------------------------------- 1 | sqlparse 2 | sql_metadata 3 | sqlglot 4 | pyarrow 5 | -------------------------------------------------------------------------------- /10-migrations/helperfunctions/helperfunctions.egg-info/top_level.txt: -------------------------------------------------------------------------------- 1 | datavalidator 2 | dbsqlclient 3 | dbsqltransactions 4 | deltahelpers 5 | deltalogger 6 | redshiftchecker 7 | stmvorchestrator 8 | transactions 9 | -------------------------------------------------------------------------------- /10-migrations/helperfunctions/requirements.txt: -------------------------------------------------------------------------------- 1 | sqlglot 2 | pyarrow -------------------------------------------------------------------------------- /10-migrations/helperfunctions/setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup 2 | 3 | setup( 4 | name='helperfunctions', 5 | version='1.0.0', 6 | description='Lakehouse Warehousing and Delta Helper Functions', 7 | author='Cody Austin Davis @Databricks, Inc.', 8 | author_email='cody.davis@databricks.com', 9 | py_modules=['datavalidator', 10 | 'dbsqltransactions', 11 | 'stmvorchestrator', 12 | 'redshiftchecker', 13 | 'dbsqlclient', 14 | 'transactions', 15 | 'deltalogger', 16 | 'deltahelpers'], 17 | install_requires=[ 18 | 'sqlparse', 19 | 'sql_metadata', 20 | 'sqlglot', 21 | 'pyarrow' 22 | ] 23 | ) -------------------------------------------------------------------------------- /10-migrations/helperfunctions/stmvorchestrator.py: -------------------------------------------------------------------------------- 1 | import re 2 | import requests 3 | import time 4 | 5 | 6 | ## Function to block Create or REFRESH of ST or MV statements to wait until it is finishing before moving to next task 7 | 8 | ## Similar to the awaitTermination() method in a streaming pipeline 9 | 10 | ## Only supports 1 sql statement at a time on purpose 11 | 12 | def orchestrate_stmv_statement(spark, dbutils, sql_statement, host_name=None, token=None): 13 | 14 | host_name = None 15 | token = None 16 | 17 | ## Infer hostname from same workspace 18 | if host_name is not None: 19 | host_name = host_name 20 | 21 | else: 22 | host_name = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().getOrElse(None).replace("https://", "") 23 | 24 | ## Automatically get user token if none provided 25 | if token is not None: 26 | token = token 27 | else: 28 | token = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().getOrElse(None) 29 | 30 | 31 | ## Get current catalogs/schemas from outside USE commands 32 | current_schema = spark.sql("SELECT current_schema()").collect()[0][0] 33 | current_catalog = spark.sql("SELECT current_catalog()").collect()[0][0] 34 | 35 | if current_catalog == 'spark_catalog': 36 | current_catalog = 'hive_metastore' 37 | 38 | 39 | ## Check for multiple statements, if more than 1, than raise too many statement exception 40 | all_statements = re.split(";", sql_statement) 41 | 42 | if (len(all_statements) > 1): 43 | print("WARNING: There are more than one statements in this sql command, this function will just pick and try to run the first statement and ignore the rest.") 44 | 45 | 46 | sql_statement = all_statements[0] 47 | 48 | 49 | try: 50 | 51 | ## Get table/mv that is being refreshed 52 | table_match = re.split("CREATE OR REFRESH STREAMING TABLE\s|REFRESH STREAMING TABLE\s|CREATE OR REFRESH MATERIALIZED VIEW\s|REFRESH MATERIALIZED VIEW\s", sql_statement.upper())[1].split(" ")[0] 53 | 54 | except Exception as e: 55 | 56 | ## If it was not able to find a REFRESH statement, ignore and unblock the operation and move on (i.e. if its not an ST/MV or if its just a CREATE) 57 | 58 | print("WARNING: No ST / MV Refresh statements found. Moving on.") 59 | return 60 | 61 | ## If ST/MV refresh was found 62 | 63 | if (len(table_match.split(".")) == 3): 64 | ## fully qualified, dont change it 65 | pass 66 | elif (len(table_match.split(".")) == 2): 67 | table_match = current_catalog + "." + table_match 68 | 69 | elif(len(table_match.split(".")) == 1): 70 | table_match = current_catalog + "." + current_schema + "." + table_match 71 | 72 | 73 | ## Step 2 - Execute SQL Statement 74 | spark.sql(sql_statement) 75 | 76 | 77 | ## Step 3 - Get pipeline Id for table 78 | active_pipeline_id = (spark.sql(f"DESCRIBE DETAIL {table_match}") 79 | .selectExpr("properties").take(1)[0][0] 80 | .get("pipelines.pipelineId") 81 | ) 82 | 83 | ## Poll for pipeline status 84 | 85 | 86 | current_state = "UNKNOWN" 87 | 88 | ## Pipeline is active 89 | while current_state not in ("FAILED", "IDLE"): 90 | 91 | url = "https://" + host_name + "/api/2.0/pipelines/" 92 | headers_auth = {"Authorization":f"Bearer {token}"} 93 | 94 | check_status_resp = requests.get(url + active_pipeline_id , headers=headers_auth).json() 95 | 96 | current_state = check_status_resp.get("state") 97 | 98 | if current_state == "IDLE": 99 | print(f"STMV Pipeline {active_pipeline_id} completed! \n Moving on") 100 | return 101 | 102 | elif current_state == "FAILED": 103 | raise(BaseException(f"PIPELINE {active_pipeline_id} FAILED!")) 104 | 105 | 106 | else: 107 | ## Wait before polling again 108 | ## TODO: Do exponential backoff 109 | time.sleep(5) 110 | 111 | -------------------------------------------------------------------------------- /20-operational-excellence/README.md: -------------------------------------------------------------------------------- 1 | #### Operational Excellence 2 | 3 | This section consists of tools that will help Infrastructure Administrators automate Lakehouse management and operations, eg. data pipelines, workflows, CI/CD processes, IaaS 4 | 5 | # 6 | 1. [Terraform Examples](https://github.com/databricks/terraform-databricks-examples) -------------------------------------------------------------------------------- /30-performance/README.md: -------------------------------------------------------------------------------- 1 | #### Performance Optimizations 2 | 3 | This section consists of tools that will help Developers and Administrators optimize the performance of Lakehouse processes. 4 | 5 | 6 | 1. [Delta Optimizer](https://github.com/AbePabbathi/lakehouse-tacklebox/tree/main/30-performance/delta-optimizer) 7 | 2. [TPC-DS Runner](https://github.com/AbePabbathi/lakehouse-tacklebox/tree/main/30-performance/TPC-DS%20Runner) 8 | 3. [Query Replay Tool](https://github.com/AbePabbathi/lakehouse-tacklebox/tree/main/30-performance/dbsql-query-replay-tool) -------------------------------------------------------------------------------- /30-performance/TPC-DS Runner/CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing 2 | Contributions are welcome! Feel free to file an issue/PR or reach out to michael.berk@databricks.com. 3 | 4 | ### Potential Contributions (in order of importance) 5 | * Support UC 6 | * Support writing raw queries to UC volumes instead of DBFS 7 | * Modify existing data write to handle conversion to UC-managed tables 8 | * Scope warehouse concurrency limitations 9 | * Improve beaker performance calculations - [issue](https://github.com/goodwillpunning/beaker/issues/24) 10 | * Look to improve data write performance. Some options include: 11 | * Improve threading of writes. 12 | * Document baseline TPC-DS benchmarking runtimes - [template](https://github.com/databricks/spark-sql-perf/blob/master/src/main/notebooks/tpcds_datagen.scala) 13 | * Allow for running create data OR run benchmarking. Currently these methods are coupled and it prevent's rerunning different warehouse benchmarks against the same data source 14 | * Add dashboarding and further analysis using Nishant's tool(s) 15 | * Determine if spark-sql-perf supports latest LTS DBR version or if we need to hardcode 12.2 16 | * Make Beaker pip-installable within a Databricks notebook then remove the hard-coded .whl url - [issue](https://github.com/goodwillpunning/beaker/issues/19) 17 | -------------------------------------------------------------------------------- /30-performance/TPC-DS Runner/README.md: -------------------------------------------------------------------------------- 1 | # Databricks TPC-DS Benchmarking Tool 2 | 3 | This tool runs the TPC-DS benchmark on a Databricks SQL warehouse. The TPC-DS benchmark is a standardised method of evaluating the performance of decision support solutions, such as databases, data warehouses, and big data systems. 4 | 5 | **Disclaimer: this tool is simple. It will not duplicate warehouse peak performance out-of-the-box. Instead, it's meant to be a transparent and representative baseline.** 6 | 7 | ## Quick Start 8 | #### 0 - Clone this repo via [Databricks Repos](https://docs.databricks.com/en/repos/index.html) 9 | 10 | #### 1 - Open the main notebook 11 | 12 | 13 | #### 2 - Create or Attach to Cluster 14 | 15 | 16 | Note that if you're using a unity catalog (UC) table, UC must be enabled on this cluster. 17 | Note that we don't support serverless clusters at this time. 18 | 19 | #### 3 - Run your parameters 20 | * Note that you may have to run the first cell in the notebook to see the widgets. 21 | 22 | 23 | ## Parameters 24 | Data 25 | * **Catalog Name**: the name of the catalog to write to for non-UC configurations 26 | * **Schema Prefix**: a string that will be prepended to the dynamically-generated schema name 27 | * **Number of GB of Data**: the number of gigabytes of TPC-DS data to be written. `1` indicates that the sum of all table sizes will be ~1GB. 28 | 29 | Warehouse 30 | * **Maximum Number of Clusters**: the maximum number of workers to which a SQL warehouse can scale 31 | * **Warehouse Size**: T-shirt size of the SQL warehouse workers 32 | * **Channel**: the warehouse channel, which correspond to the underlying DBR version 33 | 34 | Load Testing 35 | * **Concurrency**: the simulated number of users executing the TPC-DS queries. On the backend, this corresponds to the number of Python threads. 36 | * **Query Repeatition Count**: the number of times the TPC-DS queries will be repeatedly run. `2` indicates that each TPC-DS query will be run twice. Note that caching is disabled, so repeated queries will not hit cache. 37 | 38 | #### 4 - Click "Run All" 39 | 40 | 41 | #### What will happen? 42 | After clicking run all, a Databricks workflow with two tasks will be created. The first task is responsible for writing TPC-DS data and the associated queries into Delta tables. The second task will execute a TPC-DS benchmark leveraging the tables and queries created in the prior task. The results of the bechmarking will be printed out in the job notebook for viewing, but also will be written to a delta table; the location of the delta table will be printed in the job notebook. 43 | 44 | 45 | 46 | ## Core Concepts 47 | - **Concurrency**: The simulated number of users executing concurrent queries. It provides an insight into how well the system can handle multiple users executing queries at the same time. 48 | - **Throughput**: The number of queries that the system can handle per unit of time. It is usually measured in queries per minute (QPM) and provides insignt into the speed and efficiency of the system. 49 | 50 | # Product Details 51 | ## Relevant Features 52 | * The tool is cloud agnostic. 53 | * Authentication is automatically handled by the python SDK. 54 | * Benchmarking will be performed on the latest LTS DBR version. 55 | * Result cache is hard-coded to false, which means that all queries will not hit a warehouse's cache. 56 | * Each benchmark run will trigger a warehouse "warming," which is just a `SELECT *` on all TPC-DS tables. 57 | * Table format is hard-coded to delta. Data writes are currently hard-coded to DBR 12.2, so if there are updates in Delta with newer DBR versions, they will not be included. This decision was made because spark-sql-perf did not run on > 12.2 DBR as of 2023-08-10. 58 | * A new warehouse will be created based on user parameters. If a warehouse with the same name exists, the benchmarking tool will use that existing warehouse. 59 | * Given Python's Global Processing Lock (GIL), increasing the number of cores will have diminshing returns. To hide complexity from the user while also bounding cost, the concurrency parameter will scale cluster count linearly up to 100 cores, then stop. Concurrency > 100 however is still supported via multithreading - it will just run on a maximum of 100 cores. Based on our default node type, this will be 25 workers. 60 | * We are using [Databricks python-sql-connector](https://docs.databricks.com/en/dev-tools/python-sql-connector.html) to execute queries, but we are not fetching the results. The python-sql-connector has a built-in feature that retries with backoff when rate limit errors occur. Due to this retry mechanism, the actual performance of the system may be slightly faster than what the benchmarking results indicate. 61 | * If the data (with a given set of configs) already exists, it will not be overwritten. The matching logic simply uses the name of the schema, so if you change the `schema_prefix` (and that resulting schema is not found), new data will be written. 62 | 63 | ### Limitations 64 | * You must run this tool from a single-user cluster to allow default SDK authentication. 65 | * We currently don't support UC. That will be the next step for this tool. 66 | * We currently only support DBSQL serverless warehouses for simplicity. If there is desire to test non-serverless warehouses, please let us know. 67 | 68 | ### Data Generation Runtimes 69 | Both the data generation and benchmarking workflow tasks will increase in runtime as the data size increases. Here are some examples, however your benchmarking runtimes may differ signifigantly depending on your configurations. 70 | | Number of GB Written | create_data_and_queries Runtime | TPCDS_benchmarking Runtime | 71 | |---------|---------|---------| 72 | | 1 GB | 17 mins | 7 mins | 73 | | 100 GB | 70 mins | 24 mins | 74 | | 1 TB | 305 mins | 54 mins | 75 | -------------------------------------------------------------------------------- /30-performance/TPC-DS Runner/assets/images/cluster.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/30-performance/TPC-DS Runner/assets/images/cluster.png -------------------------------------------------------------------------------- /30-performance/TPC-DS Runner/assets/images/filters.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/30-performance/TPC-DS Runner/assets/images/filters.png -------------------------------------------------------------------------------- /30-performance/TPC-DS Runner/assets/images/main_notebook.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/30-performance/TPC-DS Runner/assets/images/main_notebook.png -------------------------------------------------------------------------------- /30-performance/TPC-DS Runner/assets/images/run_all.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/30-performance/TPC-DS Runner/assets/images/run_all.png -------------------------------------------------------------------------------- /30-performance/TPC-DS Runner/assets/images/workflow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/30-performance/TPC-DS Runner/assets/images/workflow.png -------------------------------------------------------------------------------- /30-performance/TPC-DS Runner/constants.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | pip install --upgrade databricks-sdk -q 3 | 4 | # COMMAND ---------- 5 | 6 | dbutils.library.restartPython() 7 | 8 | # COMMAND ---------- 9 | 10 | import os 11 | import math 12 | from dataclasses import dataclass 13 | from utils.general import tables_already_exist, get_widget_values, create_widgets 14 | 15 | @dataclass 16 | class Constants: 17 | ############### Variables dependant upon user parameters ############## 18 | # Number of GBs of TPCDS data to write 19 | number_of_gb_of_data: int 20 | 21 | # Name of the catalog to write TPCDS data to 22 | catalog_name: str 23 | 24 | # Prefex of the schma to write TPCDS data to 25 | schema_prefix: str 26 | 27 | # Size of the warehouse cluster 28 | warehouse_size: str 29 | 30 | # Maximum number of clusters to scale to in the warehouse 31 | maximum_number_of_clusters: int 32 | 33 | # Warehouse channel name 34 | channel: str 35 | 36 | # Number of concurrent threads 37 | concurrency: int 38 | 39 | # Number of times to duplicate the benchmarking run 40 | query_repetition_count: int 41 | 42 | ############### Variables indepenet of user parameters ############# 43 | # Name of the job 44 | job_name = f"[AUTOMATED] Create and run TPC-DS" 45 | 46 | # Dynamic variables that are used to create downstream variables 47 | _current_user_email = ( 48 | dbutils.notebook.entry_point.getDbutils() 49 | .notebook() 50 | .getContext() 51 | .userName() 52 | .get() 53 | ) 54 | _cwd = os.getcwd().replace("/Workspace", "") 55 | 56 | # User-specific parameters, which are used to create directories and cluster single-access-mode 57 | current_user_email = _current_user_email 58 | current_user_name = ( 59 | _current_user_email.replace(".", "_").replace("-", "_").split("@")[0] 60 | ) 61 | 62 | # Base directory where all data and queries will be written 63 | root_directory = f"dbfs:/Benchmarking/TPCDS/{current_user_name}" 64 | 65 | # Additional subdirectories within the above root_directory 66 | script_path = os.path.join(root_directory, "scripts") 67 | data_path = os.path.join(root_directory, "data") 68 | query_path = os.path.join(root_directory, "queries") 69 | 70 | # Location of the spark-sql-perf jar, which is used to create TPC-DS data and queries 71 | jar_path = os.path.join(script_path, "jars/spark-sql-perf_2.12-0.5.1-SNAPSHOT.jar") 72 | 73 | # Location of the init script, which is responsible for installing the above jar and other prerequisites 74 | init_script_path = os.path.join(script_path, "tpcds-install.sh") 75 | 76 | # Location of the dist whl for beaker 77 | beaker_whl_path = os.path.join(script_path, "beaker-0.0.1-py3-none-any.whl") 78 | 79 | # Location of the notebook that creates data and queries 80 | create_data_and_queries_notebook_path = os.path.join( 81 | _cwd, "notebooks/create_data_and_queries" 82 | ) 83 | 84 | # Location of the notebook that runs TPC-DS queries against written data using the beaker library 85 | run_tpcds_benchmarking_notebook_path = os.path.join( 86 | _cwd, "notebooks/run_tpcds_benchmarking" 87 | ) 88 | 89 | # Name of the current databricks host 90 | host = f"https://{spark.conf.get('spark.databricks.workspaceUrl')}/" 91 | 92 | def _validate_concurrency_will_utilize_cluster(self): 93 | required_number_of_clusters = math.ceil(self.concurrency / 10) 94 | if self.maximum_number_of_clusters > required_number_of_clusters: 95 | 96 | print( 97 | "Warning:\n" 98 | "\tFor optimal performance, we recommend using 1 cluster per 10 levels of concurrency. Your currrent\n" 99 | "\tconfiguration will underutilize the warehouse and a cheaper configuration shuold exhibit the same performance.\n" 100 | f"\tPlease try using {required_number_of_clusters} clusters instead." 101 | ) 102 | 103 | def __post_init__(self): 104 | # Create a schema prefix if '' to ensure unrelated schemas are not deleted 105 | if self.schema_prefix == "": 106 | self.schema_prefix = "tpcds_benchmark" 107 | 108 | # Name of the schema that tpcds data and benchmarking metrics will be written to 109 | self.schema_name: str = ( 110 | f"{self.schema_prefix.rstrip('_')}_{self.number_of_gb_of_data}_gb" 111 | ) 112 | 113 | # Add schema to data path 114 | self.data_path = os.path.join(self.data_path, self.schema_name) 115 | 116 | # Determine if TPC-DS tables already exist 117 | self.tables_already_exist = tables_already_exist(spark, self.catalog_name, self.schema_name) 118 | 119 | # Param validations/warnings 120 | self._validate_concurrency_will_utilize_cluster() 121 | 122 | -------------------------------------------------------------------------------- /30-performance/TPC-DS Runner/main.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # DBTITLE 1,Import Constants 3 | # MAGIC %run ./constants 4 | 5 | # COMMAND ---------- 6 | 7 | # DBTITLE 1,Add Widgets to Notebook 8 | create_widgets(dbutils) 9 | 10 | # COMMAND ---------- 11 | 12 | # DBTITLE 1,Pull Variables from Notebook Widgets 13 | constants = Constants( 14 | **get_widget_values(dbutils) 15 | ) 16 | 17 | # COMMAND ---------- 18 | 19 | # DBTITLE 1,Create and Run TPC-DS Benchmark 20 | from utils.run import run 21 | 22 | run(spark, dbutils, constants) 23 | -------------------------------------------------------------------------------- /30-performance/TPC-DS Runner/notebooks/create_data_and_queries.scala: -------------------------------------------------------------------------------- 1 | // Databricks notebook source 2 | // DBTITLE 1,Get parameters from job 3 | // name of the user, formatted to be passable as a schema 4 | val userName = dbutils.widgets.get("current_user_name") 5 | 6 | // The scaleFactor defines the size of the dataset to generate (in GB) 7 | val scaleFactor = dbutils.widgets.get("scale_factor") 8 | 9 | // Location to store the queries 10 | val queryDir = dbutils.widgets.get("query_directory") 11 | 12 | // Location to store the data 13 | val dataDir = dbutils.widgets.get("data_directory") 14 | 15 | // Name of the database to write the tpcds data 16 | val catalogName = dbutils.widgets.get("catalog_name") 17 | 18 | // Name of the database to write the tpcds data 19 | val schemaName = dbutils.widgets.get("schema_name") 20 | 21 | // Determine if tables with the same parameters have already been written 22 | val tablesAlreadyExist = dbutils.widgets.get("tables_already_exist") 23 | 24 | // COMMAND ---------- 25 | 26 | // DBTITLE 1,Write Data 27 | if (tablesAlreadyExist == "false") { 28 | // source: https://github.com/deepaksekaranz/TPCDSDataGen/tree/master/TPCDS-Kit 29 | import com.databricks.spark.sql.perf.tpcds.TPCDSTables 30 | 31 | // The scaleFactor defines the size of the dataset to generate (in GB) 32 | val scaleFactorInt = scaleFactor.toInt 33 | 34 | // Set the file type 35 | val fileFormat = "delta" 36 | 37 | // Initialize TPCDS tables with given parameters 38 | val tables = new TPCDSTables( 39 | sqlContext = sqlContext, 40 | dsdgenDir = "/usr/local/bin/tpcds-kit/tools", 41 | scaleFactor = scaleFactor, 42 | useDoubleForDecimal = false, // If true, replaces DecimalType with DoubleType 43 | useStringForDate = false // If true, replaces DateType with StringType 44 | ) 45 | 46 | // Generate TPC-DS data 47 | tables.genData( 48 | location = dataDir, 49 | format = "delta", 50 | overwrite = true, // overwrite the data that is already there 51 | partitionTables = false, // create the partitioned fact tables 52 | clusterByPartitionColumns = false, // shuffle to get partitions coalesced into single files. 53 | filterOutNullPartitionValues = false, // true to filter out the partition with NULL key value 54 | tableFilter = "", // "" means generate all tables 55 | numPartitions = 20 // how many dsdgen partitions to run - number of input tasks. 56 | ) 57 | 58 | // Create the specified database if it doesn't exist 59 | sql(s"create schema if not exists $schemaName") 60 | 61 | // Create metastore tables in a specified database for your data. The current database will be switched to the specified database. 62 | // Once tables are created, the current database will be switched to the specified database. 63 | tables.createExternalTables(dataDir, fileFormat, schemaName, overwrite = true, discoverPartitions = false) 64 | 65 | // Convert the tables to managed 66 | val tableInfo = dbutils.fs.ls(dataDir).map(x => (x.name.stripSuffix("/"), x.path)) 67 | 68 | for ((tableName, tablePath) <- tableInfo) { 69 | spark.sql(s"DROP TABLE IF EXISTS ${catalogName}.${schemaName}.${tableName}") 70 | spark.sql(s""" 71 | CREATE TABLE ${catalogName}.${schemaName}.${tableName} 72 | LOCATION '$tablePath' 73 | """) 74 | } 75 | } 76 | 77 | // COMMAND ---------- 78 | 79 | // DBTITLE 1,Write Queries 80 | import scala.util.Try 81 | import com.databricks.spark.sql.perf.tpcds.TPCDS 82 | import com.databricks.spark.sql.perf.Query 83 | 84 | def writeQueriesToDBFS(dbfsPath: String, queries: Map[String, Query]): Unit = { 85 | queries.foreach { case (fileName, query) => 86 | val dbfsFilePath = s"$dbfsPath/$fileName.sql" 87 | val putResult = Try(dbutils.fs.put(dbfsFilePath, query.sqlText.getOrElse(""), overwrite = true)) 88 | 89 | putResult match { 90 | case scala.util.Success(_) => println(s"Successfully written to $dbfsFilePath") 91 | case scala.util.Failure(exception) => println(s"Failed to write to $dbfsFilePath: ${exception.getMessage}") 92 | } 93 | } 94 | } 95 | 96 | val tpcds = new TPCDS (sqlContext = sqlContext) 97 | val sqlQueries = tpcds.tpcds2_4QueriesMap 98 | 99 | writeQueriesToDBFS(queryDir, sqlQueries) 100 | -------------------------------------------------------------------------------- /30-performance/TPC-DS Runner/notebooks/run_tpcds_benchmarking.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md ### Run TPC-DS Benchmarks 3 | 4 | # COMMAND ---------- 5 | 6 | pip install --upgrade databricks-sdk -q 7 | 8 | # COMMAND ---------- 9 | 10 | dbutils.library.restartPython() 11 | 12 | # COMMAND ---------- 13 | 14 | # DBTITLE 1,Configuration Variables 15 | import time 16 | from databricks.sdk import WorkspaceClient 17 | 18 | # Host and PAT for beaker authentication 19 | HOST = spark.conf.get('spark.databricks.workspaceUrl') 20 | PAT = WorkspaceClient().tokens.create(comment='temp use', lifetime_seconds=60*60*12).token_value 21 | 22 | # ID of the warehouse to run benchmarks with 23 | WAREHOUSE_ID = dbutils.widgets.get("warehouse_id") 24 | WAREHOUSE_HTTP_PATH = f"/sql/1.0/warehouses/{WAREHOUSE_ID}" 25 | 26 | # Name of the catalog to read/write to 27 | CATALOG_NAME = dbutils.widgets.get("catalog_name") 28 | 29 | # Name of the schema to read/write to 30 | SCHEMA_NAME = dbutils.widgets.get("schema_name") 31 | 32 | # Location of query files 33 | QUERY_PATH = dbutils.widgets.get("query_path").lstrip('/').replace('dbfs:','/dbfs') 34 | 35 | # Number of procs in beaker 36 | CONCURRENCY = int(dbutils.widgets.get("concurrency")) 37 | 38 | # Number of procs in beaker 39 | QUERY_REPETITION_COUNT = int(dbutils.widgets.get("query_repetition_count")) 40 | 41 | # Id of the job, which is used to create the schema 42 | try: 43 | job_id = dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags().get("jobId").get() 44 | METRICS_TABLE_NAME = f"benchmark_metrics_for_job_{job_id}" 45 | except AttributeError as e: 46 | print("This notebook must be run within a Databricks workflow.") 47 | raise e 48 | 49 | # TPC-DS Tables to Warm 50 | TPCDS_TABLE_NAMES = { 51 | "call_center", 52 | "catalog_page", 53 | "catalog_returns", 54 | "catalog_sales", 55 | "customer", 56 | "customer_address", 57 | "customer_demographics", 58 | "date_dim", 59 | "household_demographics", 60 | "income_band", 61 | "inventory", 62 | "item", 63 | "promotion", 64 | "reason", 65 | "ship_mode", 66 | "store", 67 | "store_returns", 68 | "store_sales", 69 | "time_dim", 70 | "warehouse", 71 | "web_page", 72 | "web_returns", 73 | "web_sales", 74 | "web_site", 75 | } 76 | 77 | # COMMAND ---------- 78 | 79 | # DBTITLE 1,Start Warehouse 80 | warehouse_start_time = time.time() 81 | WorkspaceClient().warehouses.start_and_wait(WAREHOUSE_ID) 82 | print(f"{int(time.time() - warehouse_start_time)}s Warehouse Startup Time") 83 | 84 | # COMMAND ---------- 85 | 86 | # DBTITLE 1,Run Benchmark 87 | from beaker import benchmark 88 | from functools import reduce 89 | from pyspark.sql import DataFrame 90 | import pyspark.sql.functions as F 91 | 92 | # Create beaker benchmark object 93 | bm = benchmark.Benchmark(results_cache_enabled=False) 94 | 95 | # # Set benchmarking parameters 96 | bm.setName(name=f"TPC-DS Benchmark {SCHEMA_NAME}") 97 | bm.setHostname(hostname=HOST) 98 | bm.setWarehouse(http_path=WAREHOUSE_HTTP_PATH) 99 | bm.setConcurrency(concurrency=CONCURRENCY) 100 | bm.setWarehouseToken(token=PAT) 101 | bm.setCatalog(catalog=CATALOG_NAME) 102 | bm.setSchema(schema=SCHEMA_NAME) 103 | bm.setQueryFileDir(QUERY_PATH) 104 | bm.setQueryRepeatCount(QUERY_REPETITION_COUNT) 105 | 106 | # Warm the warehouse. This won't be perfect, but it's the best we can do with current serverless queueing 107 | tables_with_schema = [f"{SCHEMA_NAME}.{t}" for t in TPCDS_TABLE_NAMES] 108 | for _ in range(int(min(CONCURRENCY, 50))): 109 | bm.preWarmTables(tables_with_schema) 110 | 111 | # Execute run 112 | start_time = time.time() 113 | result = bm.execute() 114 | duration = time.time() - start_time 115 | 116 | # Store run metrics 117 | metrics_df = spark.createDataFrame(result) 118 | 119 | # COMMAND ---------- 120 | 121 | # DBTITLE 1,Write Metrics to a Delta Table 122 | # write output dataframe to delta for analysis/consumption 123 | metrics_full_path = f"{CATALOG_NAME}.{SCHEMA_NAME}.{METRICS_TABLE_NAME}" 124 | print(f"Writing to delta table: {metrics_full_path}") 125 | metrics_df.write.mode('overwrite').saveAsTable(metrics_full_path) 126 | 127 | # Display the table for reference 128 | metrics_df.display() 129 | 130 | # COMMAND ---------- 131 | 132 | # DBTITLE 1,Throughput 133 | sql_files = [1 for x in dbutils.fs.ls(QUERY_PATH.replace('/dbfs','dbfs:')) if x.name.endswith('.sql')] 134 | n_sql_queries = len(sql_files) * QUERY_REPETITION_COUNT 135 | print(f"TPC-DS queries per minute: {n_sql_queries / (duration / 60)}") 136 | 137 | # COMMAND ---------- 138 | 139 | 140 | -------------------------------------------------------------------------------- /30-performance/TPC-DS Runner/utils/run.py: -------------------------------------------------------------------------------- 1 | from utils.general import setup_files 2 | from utils.databricks_client import DatabricksClient 3 | 4 | 5 | def run(spark, dbutils, constants): 6 | # Step 0: drop write schema if exists 7 | spark.sql( 8 | f"create schema if not exists {constants.catalog_name}.{constants.schema_name}" 9 | ) 10 | 11 | # Step 1: write init script, jar, and beaker whl to DBFS 12 | setup_files( 13 | dbutils, 14 | constants.jar_path, 15 | constants.init_script_path, 16 | constants.beaker_whl_path, 17 | ) 18 | 19 | # Step 2: create the client 20 | client = DatabricksClient(constants) 21 | 22 | # Step 3: create a warehouse to benchmark against 23 | warehouse_id = client.create_warehouse().id 24 | constants.warehouse_id = warehouse_id 25 | 26 | # Step 4: create a job to create TPCDS data and queries at a given location and runs benchmarks 27 | job_id = client.create_job().job_id 28 | run_id = client.run_job(job_id).run_id 29 | 30 | # Step 5: monitor the job run until completion 31 | url = f"{constants.host.replace('www.','')}#job/{job_id}/run/{run_id}" 32 | print(f"\nA TPC-DS benchmarking job was created at the following url:\n\t{url}\n") 33 | print(f"It will write TPC-DS data to {constants.data_path}.") 34 | print( 35 | "The job may take several hours depending upon data size, so please check back when it's complete.\n" 36 | ) -------------------------------------------------------------------------------- /30-performance/dbsql-query-replay-tool/01-Query_Replay_Tool.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %run ./00-Functions 3 | 4 | # COMMAND ---------- 5 | 6 | # SETUP 7 | test_name = "" 8 | result_catalog = "" 9 | result_schema = "" 10 | token = "" 11 | 12 | # SETUP SOURCE WAREHOUSE ID AND START AND END TIME 13 | source_warehouse_id = "" 14 | source_start_time = "2023-12-01 00:00:00" 15 | source_end_time = "2023-12-01 00:05:00" 16 | 17 | replay_test = QueryReplayTest( 18 | test_name=test_name, 19 | result_catalog=result_catalog, 20 | result_schema=result_schema, 21 | token=token, 22 | source_warehouse_id=source_warehouse_id, 23 | source_start_time=source_start_time, 24 | source_end_time=source_end_time, 25 | ) 26 | 27 | test_id = replay_test.run() 28 | 29 | # COMMAND ---------- 30 | 31 | # MAGIC %md 32 | # MAGIC ### `query_df` is the list of source queries we are going to use 33 | 34 | # COMMAND ---------- 35 | 36 | replay_test.query_df.orderBy('start_time').display() 37 | 38 | # COMMAND ---------- 39 | 40 | # MAGIC %md 41 | # MAGIC ### `show_run` gives the test details 42 | 43 | # COMMAND ---------- 44 | 45 | replay_test.show_run.display() 46 | 47 | # COMMAND ---------- 48 | 49 | # MAGIC %md 50 | # MAGIC ### `show_run_details` gives the corresponding statement_id's for all the queries we ran 51 | 52 | # COMMAND ---------- 53 | 54 | replay_test.show_run_details.display() 55 | 56 | # COMMAND ---------- 57 | 58 | # MAGIC %md 59 | # MAGIC ### `query_results` gives the result comparing source details against the test output 60 | 61 | # COMMAND ---------- 62 | 63 | # Recreate test object by providing test_id will allow to us to retrieve the query results later (since system tables might not have all the results immediately) 64 | 65 | replay_test = QueryReplayTest( 66 | test_name=test_name, 67 | result_catalog=result_catalog, 68 | result_schema=result_schema, 69 | token=token, 70 | source_warehouse_id=source_warehouse_id, 71 | source_start_time=source_start_time, 72 | source_end_time=source_end_time, 73 | test_id=test_id 74 | ) 75 | 76 | replay_test.query_results.display() 77 | -------------------------------------------------------------------------------- /30-performance/dbsql-query-replay-tool/README.md: -------------------------------------------------------------------------------- 1 | # Databricks SQL Query Replay Tool 2 | 3 | This tool is aimed to help users evaluate performance of different warehouses by replaying a set of query history from one warehouse to another. 4 | 5 | ## Notebooks 6 | 7 | * `00-Functions` is the notebook containing the python class 8 | * `01-Query_Replay_Tool` is the notebook that is used to execute the test 9 | 10 | ## Requirements 11 | 12 | User need access to query history system tables `system.query.history` in order to extract the queries and start time for the test. 13 | 14 | ## Usage 15 | 16 | Users need to set the following parameters 17 | 18 | * `test_name`: Test Identifier 19 | * `result_catalog` and `result_schema`: The schema where the test results will be written to 20 | * `token`: A Databricks PAT that will be used to launched those queries 21 | * `source_warehouse_id`: The warehouse ID where the origiated queries were submitted 22 | * `source_start_time`: The start time to filter for queries 23 | * `source_end_time`: The end time to filter for queries 24 | 25 | And here are a number of optional configuration for the target warehouse where the queries will be replayed to (see [Create Warehouse API doc](https://docs.databricks.com/api/workspace/warehouses/create) for more details). 26 | 27 | * `target_warehouse_size` 28 | * `target_warehouse_max_num_clusters` 29 | * `target_warehouse_type` 30 | * `target_warehouse_serverless` 31 | * `target_warehouse_custom_tags` 32 | * `target_warehouse_channel` 33 | 34 | The replay can be executed as follows. 35 | 36 | ```python 37 | replay_test = QueryReplayTest( 38 | test_name=test_name, 39 | result_catalog=result_catalog, 40 | result_schema=result_schema, 41 | token=token, 42 | source_warehouse_id=source_warehouse_id, 43 | source_start_time=source_start_time, 44 | source_end_time=source_end_time, 45 | ) 46 | 47 | test_id = replay_test.run() 48 | ``` 49 | 50 | Once the test is completed, it will return the `test_id` which can be used to retrieve the result. 51 | 52 | Here are other functionality within the `QueryReplayTest` 53 | 54 | * `replay_test.query_df` returns the queries that were used for the test 55 | * `replay_test.show_run` returns the metadata of the test 56 | * `show_run_details` returns the corresponding statement_id's for all the queries we ran 57 | * `query_results` returnrs the result comparing source details against the test output 58 | 59 | All the output data are written to the nominated schema in tables `query_replay_test_run` and `query_replay_test_run_details` if you want to query them directly as well. 60 | -------------------------------------------------------------------------------- /30-performance/delta-optimizer/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/30-performance/delta-optimizer/__init__.py -------------------------------------------------------------------------------- /30-performance/delta-optimizer/customer-facing-delta-optimizer/deltaoptimizer-1.5.5-py3-none-any.whl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/30-performance/delta-optimizer/customer-facing-delta-optimizer/deltaoptimizer-1.5.5-py3-none-any.whl -------------------------------------------------------------------------------- /30-performance/delta-optimizer/deltaoptimizer/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/30-performance/delta-optimizer/deltaoptimizer/.DS_Store -------------------------------------------------------------------------------- /30-performance/delta-optimizer/deltaoptimizer/.gitignore: -------------------------------------------------------------------------------- 1 | 2 | .databricks 3 | -------------------------------------------------------------------------------- /30-performance/delta-optimizer/deltaoptimizer/.vscode/settings.json: -------------------------------------------------------------------------------- 1 | { 2 | "python.envFile": "${workspaceFolder}/.databricks/.databricks.env", 3 | "databricks.python.envFile": "${workspaceFolder}/.env", 4 | "jupyter.interactiveWindow.cellMarker.codeRegex": "^# COMMAND ----------|^# Databricks notebook source|^(#\\s*%%|#\\s*\\|#\\s*In\\[\\d*?\\]|#\\s*In\\[ \\])", 5 | "jupyter.interactiveWindow.cellMarker.default": "# COMMAND ----------" 6 | } -------------------------------------------------------------------------------- /30-performance/delta-optimizer/deltaoptimizer/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/30-performance/delta-optimizer/deltaoptimizer/__init__.py -------------------------------------------------------------------------------- /30-performance/delta-optimizer/deltaoptimizer/deltaoptimizer.egg-info/PKG-INFO: -------------------------------------------------------------------------------- 1 | Metadata-Version: 2.1 2 | Name: deltaoptimizer 3 | Version: 1.5.5 4 | Summary: Delta Optimizer Beta - UC Enabled 5 | Author: Cody Austin Davis @Databricks, Inc. 6 | Author-email: cody.davis@databricks.com 7 | Requires-Dist: sqlparse 8 | Requires-Dist: sql_metadata 9 | -------------------------------------------------------------------------------- /30-performance/delta-optimizer/deltaoptimizer/deltaoptimizer.egg-info/SOURCES.txt: -------------------------------------------------------------------------------- 1 | deltaoptimizer.py 2 | setup.py 3 | deltaoptimizer.egg-info/PKG-INFO 4 | deltaoptimizer.egg-info/SOURCES.txt 5 | deltaoptimizer.egg-info/dependency_links.txt 6 | deltaoptimizer.egg-info/requires.txt 7 | deltaoptimizer.egg-info/top_level.txt -------------------------------------------------------------------------------- /30-performance/delta-optimizer/deltaoptimizer/deltaoptimizer.egg-info/dependency_links.txt: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /30-performance/delta-optimizer/deltaoptimizer/deltaoptimizer.egg-info/requires.txt: -------------------------------------------------------------------------------- 1 | sqlparse 2 | sql_metadata 3 | -------------------------------------------------------------------------------- /30-performance/delta-optimizer/deltaoptimizer/deltaoptimizer.egg-info/top_level.txt: -------------------------------------------------------------------------------- 1 | deltaoptimizer 2 | -------------------------------------------------------------------------------- /30-performance/delta-optimizer/deltaoptimizer/dist/deltaoptimizer-1.5.5-py3-none-any.whl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/30-performance/delta-optimizer/deltaoptimizer/dist/deltaoptimizer-1.5.5-py3-none-any.whl -------------------------------------------------------------------------------- /30-performance/delta-optimizer/deltaoptimizer/setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup 2 | 3 | setup( 4 | name='deltaoptimizer', 5 | version='1.5.5', 6 | description='Delta Optimizer Beta - UC Enabled', 7 | author='Cody Austin Davis @Databricks, Inc.', 8 | author_email='cody.davis@databricks.com', 9 | install_requires=[ 10 | 'sqlparse', 11 | 'sql_metadata' 12 | ] 13 | ) -------------------------------------------------------------------------------- /40-observability/README.md: -------------------------------------------------------------------------------- 1 | #### Governance/Observability 2 | 3 | This section consists of tools that will help CDOs, Billing Administrators and Infrastructure Administrators to get a better understanding of the usage and cost drivers of the Lakehouse 4 | 5 | # 6 | 1. [Data Profiling](https://github.com/AbePabbathi/lakehouse-tacklebox/tree/master/40-observability/data-profiling) 7 | 2. [DBSQL Monitoring](https://github.com/AbePabbathi/lakehouse-tacklebox/tree/master/40-observability/dbsql-logging) 8 | 3. [Stream Monitoring](https://github.com/AbePabbathi/lakehouse-tacklebox/tree/main/40-observability/stream-monitoring) 9 | 4. PII Detector (Coming soon) -------------------------------------------------------------------------------- /40-observability/dbsql-logging/00-Config.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # DBTITLE 1,API Config 3 | # Please ensure the url starts with https and DOES NOT have a slash at the end 4 | WORKSPACE_HOST = 'https://adb-2541733722036151.11.azuredatabricks.net' 5 | WAREHOUSE_URL = "{0}/api/2.0/sql/warehouses".format(WORKSPACE_HOST) ## SQL Warehouses APIs 2.0 6 | QUERIES_URL = "{0}/api/2.0/sql/history/queries".format(WORKSPACE_HOST) ## Query History API 2.0 7 | WORKFLOWS_URL = "{0}/api/2.1/jobs/list".format(WORKSPACE_HOST) ## Jobs & Workflows History API 2.1 8 | DASHBOARDS_URL = "{0}/api/2.0/preview/sql/queries".format(WORKSPACE_HOST,250) ## Queries and Dashboards API - ❗️in preview, deprecated soon❗️ 9 | 10 | MAX_RESULTS_PER_PAGE = 1000 11 | MAX_PAGES_PER_RUN = 500 12 | PAGE_SIZE = 250 # 250 is the max 13 | 14 | # We will fetch all queries that were started between this number of hours ago, and now() 15 | # Queries that are running for longer than this will not be updated. 16 | # Can be set to a much higher number when backfilling data, for example when this Job didn't run for a while. 17 | NUM_HOURS_TO_UPDATE = 168 18 | 19 | # COMMAND ---------- 20 | 21 | # DBTITLE 1,API Authentication 22 | # If you want to run this notebook yourself, you need to create a Databricks personal access token, 23 | # store it using our secrets API, and pass it in through the Spark config, such as this: 24 | # spark.pat_token {{secrets/query_history_etl/user}}, or Azure Keyvault. 25 | 26 | #Databricks secrets API 27 | #AUTH_HEADER = {"Authorization" : "Bearer " + spark.conf.get("spark.pat_token")} 28 | #Azure KeyVault 29 | #AUTH_HEADER = {"Authorization" : "Bearer " + dbutils.secrets.get(scope = "", key = "")} 30 | #Naughty way 31 | AUTH_HEADER = {"Authorization" : "Bearer " + "dapixxxxxxxxxxxxxxxxxxxxxxxxx"} 32 | 33 | # COMMAND ---------- 34 | 35 | # DBTITLE 1,Database and Table Config 36 | DATABASE_NAME = "dbsql_logging" 37 | # DATABASE_LOCATION = "/s3-location/" 38 | QUERIES_TABLE_NAME = "queries" 39 | WAREHOUSES_TABLE_NAME = "warehouses" 40 | WORKFLOWS_TABLE_NAME = "workflows" 41 | DASHBOARDS_TABLE_NAME = "dashboards_preview" 42 | 43 | # COMMAND ---------- 44 | 45 | # DBTITLE 1,Delta Table Maintenance 46 | QUERIES_ZORDER = "endpoint_id" 47 | WAREHOUSES_ZORDER = "id" 48 | WORKFLOWS_ZORDER = "job_id" 49 | DASHBOARDS_ZORDER = "id" 50 | 51 | VACUUM_RETENTION = 168 52 | -------------------------------------------------------------------------------- /40-observability/dbsql-logging/01-Functions.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # DBTITLE 1,Check if spark can read the table 3 | def check_table_exist(db_tbl_name): 4 | table_exist = False 5 | try: 6 | spark.read.table(db_tbl_name) # Check if spark can read the table 7 | table_exist = True 8 | except: 9 | pass 10 | return table_exist 11 | 12 | # COMMAND ---------- 13 | 14 | # DBTITLE 1,Current time in milliseconds 15 | def current_time_in_millis(): 16 | return round(time.time() * 1000) 17 | 18 | # COMMAND ---------- 19 | 20 | # DBTITLE 1,True False fix 21 | def get_boolean_keys(arrays): 22 | # A quirk in Python's and Spark's handling of JSON booleans requires us to converting True and False to true and false 23 | boolean_keys_to_convert = [] 24 | for array in arrays: 25 | for key in array.keys(): 26 | if type(array[key]) is bool: 27 | boolean_keys_to_convert.append(key) 28 | #print(boolean_keys_to_convert) 29 | return boolean_keys_to_convert 30 | 31 | # COMMAND ---------- 32 | 33 | # DBTITLE 1,Turn API results into json 34 | def result_to_json(result): 35 | return json.dumps(result.json()) 36 | 37 | # COMMAND ---------- 38 | 39 | # DBTITLE 1,Get specific page results (Dashboards API only) 40 | def get_page_result(base_url, page, auth): 41 | return requests.get(f'{base_url}&page={page}&order=executed_at', headers=auth) 42 | 43 | # COMMAND ---------- 44 | 45 | # DBTITLE 1,Get specific offset results (Workflows API only) 46 | def get_offset_result(base_url, offest, auth): 47 | return requests.get(f'{base_url}&offset={offest}', headers=auth) 48 | -------------------------------------------------------------------------------- /40-observability/dbsql-logging/02-Initialization.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %run ./00-Config 3 | 4 | # COMMAND ---------- 5 | 6 | spark.sql(f'CREATE DATABASE IF NOT EXISTS {DATABASE_NAME}') 7 | # optional: add location 8 | # spark.sql(f'CREATE DATABASE IF NOT EXISTS {DATABASE_NAME} LOCATION {DATABASE_LOCATION}') 9 | -------------------------------------------------------------------------------- /40-observability/dbsql-logging/05-Alert_Syntax.sql: -------------------------------------------------------------------------------- 1 | -- Databricks notebook source 2 | -- MAGIC %md 3 | -- MAGIC ## Syntax for Alerts in DBSQL 4 | -- MAGIC 5 | -- MAGIC This notebook contains snippets that may be useful to be alerted for on DBSQL usage. 6 | -- MAGIC 7 | -- MAGIC Schedule this using workflows, or use the schedule dropdown on the top right of the notebook. The job should take <20 mins to run and consume ~1 DBU, there's very little data being processes here, even on busy workspaces. 8 | -- MAGIC 9 | -- MAGIC Alerts should be actionable. "Nice to know" information just acts as noise. Examples of actional alerts may be: 10 | -- MAGIC * Terminating long running warehouses 11 | -- MAGIC * Investigating long running queries 12 | -- MAGIC * Sizing up warehouses that have specific query failures 13 | -- MAGIC 14 | -- MAGIC **Remember**, if you want to be notified of a query taking 2 hours to run, this job must be scheduled at least every two hours 15 | -- MAGIC 16 | -- MAGIC ### How to set up alerts with DBSQL 17 | -- MAGIC 1. DBSQL > SQL Editor > create a query in DBSQL by coping the below or creating your own, name it, and save it 18 | -- MAGIC 2. DBSQL > Alerts > Create Alert > select your query you have just saved, set the threshold for values to be alerted for, save, then change the destination if needed 19 | -- MAGIC 20 | -- MAGIC [Official docs](https://docs.databricks.com/sql/user/alerts/index.html) 21 | 22 | -- COMMAND ---------- 23 | 24 | -- MAGIC %run ./00-Config 25 | 26 | -- COMMAND ---------- 27 | 28 | -- %run ./03-APIs_to_Delta 29 | -- Uncomment this if you would like to run as part of a job 30 | -- Remove these comments too! 31 | 32 | -- COMMAND ---------- 33 | 34 | -- MAGIC %python 35 | -- MAGIC spark.sql(f' USE {DATABASE_NAME}') 36 | 37 | -- COMMAND ---------- 38 | 39 | -- DBTITLE 1,Queries currently running that are over the 95th percentile 40 | SELECT round(duration/1000/60/60,2) as duration_h, * 41 | FROM queries 42 | WHERE duration > (SELECT percentile(duration, 0.95) AS duration_95 43 | FROM queries WHERE status = "FINISHED" 44 | AND statement_type IN ("SELECT", "MERGE")) 45 | AND status = "RUNNING" 46 | ORDER BY duration DESC 47 | 48 | -- COMMAND ---------- 49 | 50 | -- DBTITLE 1,Queries taking over 6 hours to run 51 | SELECT round(duration/1000/60/60,2) as duration_h, * 52 | FROM queries 53 | WHERE duration > 21600000 --6 hours in miliseconds 54 | AND status = "RUNNING" 55 | ORDER BY duration DESC 56 | -------------------------------------------------------------------------------- /40-observability/dbsql-logging/99-Maintenance.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %run ./00-Config 3 | 4 | # COMMAND ---------- 5 | 6 | # DBTITLE 1,Optimize & zOrder 7 | spark.sql(f'OPTIMIZE {DATABASE_NAME}.{QUERIES_TABLE_NAME} ZORDER BY {QUERIES_ZORDER}') 8 | spark.sql(f'OPTIMIZE {DATABASE_NAME}.{WAREHOUSES_TABLE_NAME} ZORDER BY {WAREHOUSES_ZORDER}') 9 | spark.sql(f'OPTIMIZE {DATABASE_NAME}.{DASHBOARDS_TABLE_NAME} ZORDER BY {DASHBOARDS_ZORDER}') 10 | spark.sql(f'OPTIMIZE {DATABASE_NAME}.{WORKFLOWS_TABLE_NAME} ZORDER BY {WORKFLOWS_ZORDER}') 11 | 12 | # COMMAND ---------- 13 | 14 | # DBTITLE 1,Allow for parallel deletes in Vacuum 15 | spark.conf.set("spark.databricks.delta.vacuum.parallelDelete.enabled", True) 16 | 17 | # COMMAND ---------- 18 | 19 | # DBTITLE 1,Delete small files no longer in use 20 | spark.sql(f'VACUUM {DATABASE_NAME}.{QUERIES_TABLE_NAME} RETAIN {VACUUM_RETENTION} HOURS') 21 | spark.sql(f'VACUUM {DATABASE_NAME}.{WAREHOUSES_TABLE_NAME} RETAIN {VACUUM_RETENTION} HOURS') 22 | spark.sql(f'VACUUM {DATABASE_NAME}.{DASHBOARDS_TABLE_NAME} RETAIN {VACUUM_RETENTION} HOURS') 23 | spark.sql(f'VACUUM {DATABASE_NAME}.{WORKFLOWS_TABLE_NAME} RETAIN {VACUUM_RETENTION} HOURS') 24 | -------------------------------------------------------------------------------- /40-observability/dbsql-logging/README.md: -------------------------------------------------------------------------------- 1 | ### dbsql-logging 2 | This tool is a collection notebooks that pulls together data from 4 different APIs to produce useful metrics to monitor DBSQL usage: 3 | * [SQL Warehouses APIs 2.0](https://docs.databricks.com/sql/api/sql-endpoints.html), referred to as the Warehouse API 4 | * [Query History API 2.0](https://docs.databricks.com/sql/api/query-history.html), referred to as the Queries API 5 | * [Jobs API 2.1](https://docs.databricks.com/dev-tools/api/latest/jobs.html), referred to as the Workflows API 6 | * [Queries and Dashboards API](https://docs.databricks.com/sql/api/queries-dashboards.html), referred to as the Dashboards API - ❗️in preview, known issues, deprecated soon❗️ 7 | 8 | Creator: holly.smith@databricks.com 9 | 10 | #### Setup 11 | This tool has been tested with the following 12 | Cluster config: 13 | * 11.3 LTS 14 | * Driver: i3.xlarge 15 | * Workers: 2 x i3.xlarge - the data here is fairly small 16 | 17 | Profile: 18 | Must be an **admin** in your workspace for the Dashboards API 19 | 20 | #### Notebooks 21 | 22 | ##### 00-Config 23 | This is the configuration of: 24 | * Workspace URL 25 | * Authetication options 26 | * Database and Table storage 27 | * `OPTIMIZE`, `ZORDER` and `VACUUM` settings 28 | 29 | ##### 01-Functions 30 | * Reuable functions created, all pulled out for code readability 31 | 32 | ##### 02-Initialisation 33 | * Creates the database if it doesn't exist 34 | * Optional: specify a location for the Database 35 | Dependent on: `00-Config` 36 | 37 | ##### 03-APIs_to_Delta 38 | 39 | **Warehouses API:** Appends each call of the api and uses a snapshot time to identify 40 | 41 | **Query History API:** Upserts / merges new queries to the original table 42 | 43 | **Workflows API:** Upserts / merges new workflows to the original table 44 | 45 | **Dashboards API:** . I have tried my best to refer to it as a preview in every step of the code to reflect how this is a preview 46 | 47 | Dependent on: `00-Config`, `01-Functions`, `02-Initialisation` 48 | 49 | ##### 04-Metrics 50 | * Dashboards & Queries with owner, useful for finding orphaned records 51 | * Queries to Optimise 52 | * Warehouse Metrics 53 | * Per User Metrics 54 | 55 | 56 | Dependent on: `00-Config` 57 | 58 | 59 | ##### 99-Maintenance 60 | Runs `OPTIMIZE`, `ZORDER` and `VACUUM` against tables 61 | 62 | Dependent on: `00-Config` 63 | 64 | 65 | ##### Troubleshooting 66 | 67 | ###### Cluster OOM 68 | The data used here was very small, even for a Databricks demo workspace with thousands of users. Parts of 03-APIs_to_Delta involves pulling a json to the driver, in the highly unlikely event of the driver OOM you have two choices: 69 | 1. The quick option: select a larger driver 70 | 2. The robust option: loop through reading in only one page at a time and write to a spark dataframe at a time 71 | 72 | ###### Dashboards API not sorting in new queries 73 | There are known issues with the API. Where possible, try to use the 74 | 75 | ###### Dashboards API has stopped working 76 | This API will go through stages of deprecation, unfortunately with no hard timelines as of yet. Here is the rough process: 77 | 1. When DBSQL Pro comes out, the API will be officially deprecated 78 | 2. It should (*should*) be removed from the documentation at that point 79 | 3. Later it will become totally unavailable 80 | 81 | The Query History API captures data shown below 82 | 83 | ![Query History](https://i.imgur.com/fZaQYzT.png) 84 | -------------------------------------------------------------------------------- /40-observability/dbsql-query-history-sync/README.md: -------------------------------------------------------------------------------- 1 | Sync a Delta table with the query history from a dbsql warehouse. 2 | 3 | As easy as: 4 | 5 | pip install from dist/dbsql_query_history_sync-0.0.1-py3-none-any.whl or dist/dbsql_query_history_sync-0.0.1.tar.gz 6 | 7 | To download the query history without Databricks environment or pyspark (need to change the dbsql host, warehouse_ids and access token): 8 | ``` 9 | > cd examples 10 | > ./standalone_dbsql_get_query_history_example.py 11 | ``` 12 | 13 | To create a Delta table and continuously sync queries from the dbsql warehouses to it: 14 | 15 | ``` 16 | import dbsql_query_history_sync.delta_sync as delta_sync 17 | 18 | # create the object 19 | udbq = delta_sync.UpdateDBQueries(dbx_token=DBX_TOKEN, 20 | warehouse_ids=warehouse_ids_list, earliest_query_ts_ms=dt_ts, table_name=sync_table) 21 | udbq.update_db_repeat(interval_secs=10) 22 | ``` 23 | 24 | See examples/dbsql_query_sync_example.py. 25 | 26 | For questions contact nishant.deshpande@databricks.com. 27 | 28 | -------------------------------------------------------------------------------- /40-observability/dbsql-query-history-sync/__init__.py: -------------------------------------------------------------------------------- 1 | # module 2 | -------------------------------------------------------------------------------- /40-observability/dbsql-query-history-sync/dist/dbsql_query_history_sync-0.0.1-py3-none-any.whl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/40-observability/dbsql-query-history-sync/dist/dbsql_query_history_sync-0.0.1-py3-none-any.whl -------------------------------------------------------------------------------- /40-observability/dbsql-query-history-sync/dist/dbsql_query_history_sync-0.0.1.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/40-observability/dbsql-query-history-sync/dist/dbsql_query_history_sync-0.0.1.tar.gz -------------------------------------------------------------------------------- /40-observability/dbsql-query-history-sync/examples/dbsql_query_sync_example.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | import datetime, dateutil 3 | import sys, os 4 | import json 5 | #import dbutils 6 | import time 7 | 8 | 9 | # COMMAND ---------- 10 | 11 | sys.path.append(f"{os.getcwd()}/../src") 12 | #sys.path 13 | 14 | # COMMAND ---------- 15 | 16 | import dbsql_query_history_sync.queries_api as queries_api 17 | import dbsql_query_history_sync.delta_sync as delta_sync 18 | 19 | # COMMAND ---------- 20 | 21 | import importlib 22 | 23 | # COMMAND ---------- 24 | 25 | importlib.reload(queries_api) 26 | importlib.reload(delta_sync) 27 | 28 | # COMMAND ---------- 29 | 30 | # replace as required 31 | workspace_url = 'e2-demo-field-eng.cloud.databricks.com' 32 | warehouse_ids_list = ['475b94ddc7cd5211',] 33 | 34 | # COMMAND ---------- 35 | 36 | # Replace as required 37 | DBX_TOKEN = dbutils.secrets.get(scope='nishant-deshpande', key='dbsql-api-key') # subst your scope + key to query the API 38 | 39 | 40 | # COMMAND ---------- 41 | 42 | # Adjust the history period as required. 43 | dt = datetime.datetime.now() - datetime.timedelta(minutes=5) 44 | #dt = datetime.datetime.now() - datetime.timedelta(hours=1) 45 | print(dt) 46 | dt_ts = int(dt.timestamp() * 1000) 47 | print(dt_ts) 48 | 49 | # COMMAND ---------- 50 | 51 | # get queries as a list 52 | x = queries_api.get_query_history( 53 | dbx_token=DBX_TOKEN, 54 | workspace_url=workspace_url, warehouse_ids=warehouse_ids_list, start_ts_ms=dt_ts, end_ts_ms=None, user_ids=None, statuses=None, stop_fetch_limit=1000) 55 | 56 | # COMMAND ---------- 57 | 58 | t_ts = int(datetime.datetime.now().timestamp()) 59 | sync_table = f'default.query_history_test_{t_ts}' # change to your preferred table name. 60 | print(sync_table) 61 | 62 | # COMMAND ---------- 63 | 64 | # create the object 65 | udbq = delta_sync.UpdateDBQueries(spark_session=spark, dbx_token=DBX_TOKEN, workspace_url=workspace_url, 66 | warehouse_ids=warehouse_ids_list, earliest_query_ts_ms=dt_ts, table_name=sync_table) 67 | 68 | # COMMAND ---------- 69 | 70 | # This updates the table with the query history one time 71 | udbq.update_db() 72 | 73 | # COMMAND ---------- 74 | 75 | # Check the table 76 | display(spark.sql(f""" 77 | select count(1), timestamp(min(query_start_time_ms)/1000), timestamp(max(query_start_time_ms)/1000) 78 | from {sync_table} 79 | """)) 80 | 81 | # COMMAND ---------- 82 | 83 | 84 | 85 | # COMMAND ---------- 86 | 87 | # This will update the underlying table incrementally every 10 seconds. 88 | udbq.update_db_repeat(interval_secs=10) 89 | 90 | # COMMAND ---------- 91 | 92 | # MAGIC %sql 93 | # MAGIC select count(1), timestamp(min(query_start_time_ms)/1000), timestamp(max(query_start_time_ms)/1000) 94 | # MAGIC from default.query_history_test_1695014139 -- update the table name to new table created above 95 | 96 | # COMMAND ---------- 97 | 98 | 99 | -------------------------------------------------------------------------------- /40-observability/dbsql-query-history-sync/examples/standalone_dbsql_get_query_history_example.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | import sys, os 4 | import datetime, dateutil.parser 5 | import pickle 6 | 7 | sys.path.append('../src') 8 | 9 | import dbsql_query_history_sync.queries_api as queries_api 10 | 11 | def ts(): 12 | return datetime.datetime.now().strftime("%Y%m%d%H%M%S") 13 | 14 | def main(): 15 | # change as required. 16 | workspace_url = os.getenv("DATABRICKS_HOST", "e2-demo-field-eng.cloud.databricks.com") 17 | warehouse_ids = ["771c2bee30209f22"] 18 | start_ts_ms = dateutil.parser.parse('2023-01-01').timestamp() * 1000 19 | dbx_token = os.getenv('DATABRICKS_ACCESS_TOKEN') 20 | 21 | qh = queries_api.get_query_history( 22 | dbx_token=dbx_token, 23 | workspace_url=workspace_url, 24 | warehouse_ids=warehouse_ids, 25 | start_ts_ms=start_ts_ms) 26 | print(len(qh)) 27 | fname = f'/tmp/queries_{ts()}.pkl' 28 | with open(fname, 'wb') as f: 29 | pickle.dump(qh, f) 30 | print(f"created pkl file {fname}") 31 | 32 | if __name__ == "__main__": 33 | main() 34 | -------------------------------------------------------------------------------- /40-observability/dbsql-query-history-sync/pyproject.toml: -------------------------------------------------------------------------------- 1 | [build-system] 2 | requires = ["hatchling"] 3 | build-backend = "hatchling.build" 4 | 5 | [project] 6 | name = "dbsql-query-history-sync" 7 | dynamic = ["version"] 8 | description = "1> Get dbsql query history. 2> Sync to Delta table." 9 | readme = "README.md" 10 | license = "MIT" 11 | authors = [ 12 | { name = "Nishant Deshpande", email = "nishant.deshpande@databricks.com" }, 13 | ] 14 | classifiers = [ 15 | "Programming Language :: Python :: 3", 16 | "License :: Other/Proprietary License", 17 | "Operating System :: OS Independent", 18 | ] 19 | requires-python = ">=3.7" 20 | dependencies = [ 21 | "requests", 22 | "dateutils" 23 | ] 24 | 25 | [project.urls] 26 | Homepage = "https://github.com/databricks/lakehouse-tacklebox/tree/master/40-observability/dbsql-query-history-sync" 27 | 28 | [tool.hatch.version] 29 | path = "src/__init__.py" 30 | 31 | [tool.hatch.build.targets.sdist] 32 | include = [ 33 | "/src", 34 | ] 35 | -------------------------------------------------------------------------------- /40-observability/dbsql-query-history-sync/src/__init__.py: -------------------------------------------------------------------------------- 1 | VERSION = '0.0.1' 2 | 3 | -------------------------------------------------------------------------------- /40-observability/dbsql-query-history-sync/src/dbsql_query_history_sync/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/40-observability/dbsql-query-history-sync/src/dbsql_query_history_sync/__init__.py -------------------------------------------------------------------------------- /40-observability/dbsql-query-history-sync/src/dbsql_query_history_sync/delta_sync.py: -------------------------------------------------------------------------------- 1 | import datetime, dateutil 2 | import sys, os 3 | import json 4 | import time 5 | 6 | from delta.tables import DeltaTable 7 | 8 | import pyspark.sql.types as T 9 | import pyspark.sql.functions as F 10 | 11 | from . import queries_api 12 | 13 | class UpdateDBQueries: 14 | def __init__(self, spark_session, dbx_token, workspace_url, 15 | warehouse_ids, earliest_query_ts_ms, table_name, 16 | user_ids=None, max_queries_batch=1000): 17 | self.spark = spark_session 18 | self.dbx_token = dbx_token 19 | self.workspace_url = workspace_url 20 | self.warehouse_ids = warehouse_ids 21 | self.earliest_query_ts_ms = earliest_query_ts_ms 22 | self.table_name = table_name 23 | self.user_ids = user_ids 24 | self.max_queries_batch = max_queries_batch 25 | 26 | # hidden state for optimization of update_db_repeat 27 | self._next_update_ts_ms_d = {} 28 | 29 | self._do_init() 30 | 31 | def _do_init(self): 32 | '''Check if table_name exists, and if it does not, create it. 33 | ''' 34 | # We will use schema evolution when we need to add data. 35 | # That way we don't have to assume the returned results keep the same schema. 36 | # Add the minimum columns required for things to work. 37 | self.spark.sql(f""" 38 | create table if not exists {self.table_name} 39 | (query_id string, status string, query_start_time_ms bigint)""") 40 | # This is somewhat 'invasive' but this is not a open source api used by the unsuspecting masses 41 | # so I think this is ok. 42 | self.spark.sql(f"alter table {self.table_name} SET TBLPROPERTIES ('delta.enableDeletionVectors' = true)") 43 | 44 | def _merge_db_queries(self, from_ts_ms): 45 | print(f'_merge_db_queries(from_ts_ms={from_ts_ms})') 46 | start_ts_ms = from_ts_ms if from_ts_ms else self.earliest_query_ts_ms 47 | c = queries_api.sync_query_history( 48 | dbx_token=self.dbx_token, workspace_url=self.workspace_url, 49 | warehouse_ids=self.warehouse_ids, start_ts_ms=start_ts_ms, 50 | query_sink_fn=self._merge_results, sink_batch_size=self.max_queries_batch, 51 | user_ids=self.user_ids) 52 | return c 53 | 54 | def _merge_results(self, query_history): 55 | qh_df = self.spark.createDataFrame(query_history) 56 | ame = self.spark.conf.get('spark.databricks.delta.schema.autoMerge.enabled') 57 | if ame != 'true': 58 | self.spark.conf.set('spark.databricks.delta.schema.autoMerge.enabled', True) 59 | _table = DeltaTable.forName(self.spark, self.table_name) 60 | (_table.alias('t1').merge(qh_df.alias('n1'), 't1.query_id = n1.query_id') 61 | .whenMatchedUpdateAll() 62 | .whenNotMatchedInsertAll() 63 | .execute()) 64 | if ame != 'true': 65 | self.spark.conf.set('spark.databricks.delta.schema.autoMerge.enabled', ame) 66 | qh_df.createOrReplaceTempView('qh_df') 67 | # Optimization. 68 | d = self._get_existing_ts(table_name='qh_df') 69 | if not self._next_update_ts_ms_d.get('pending'): 70 | print(f'_merge_results: updating _next_update_ts_ms_d with {d}') 71 | self._next_update_ts_ms_d.update(d) 72 | else: 73 | print(f'already have a pending {self._next_update_ts_ms_d}') 74 | 75 | def update_db(self): 76 | '''Update the db with queries. Check the table for existing data and query newer queries accordingly. 77 | ''' 78 | d = self._get_existing_ts() 79 | existing_ts_ms = d['pending'] if d['pending'] else d['all'] 80 | print(f'existing_ts_ms: {existing_ts_ms}') 81 | c = self._merge_db_queries(existing_ts_ms) 82 | print(f"got {c} queries") 83 | 84 | def _get_existing_ts(self, table_name=None): 85 | ''' 86 | Get the timestamp that should be used to get the next queries. 87 | ''' 88 | if not table_name: 89 | table_name = self.table_name 90 | r = self.spark.sql( 91 | f""" 92 | select * from ( 93 | select 'pending' as status, min(query_start_time_ms) as ts_ms 94 | from {table_name} 95 | where lower(status) in ('queued', 'running')) 96 | union all 97 | (select 'all' as status, max(query_start_time_ms) as ts_ms 98 | from {table_name}) 99 | """).collect() 100 | assert(r[0][0] == 'pending' and r[1][0] == 'all') 101 | #ts_ms = r[0][1] if r[0][1] else r[1][1] 102 | #return ts_ms 103 | return dict(r) 104 | 105 | def update_db_repeat(self, interval_secs): 106 | '''Update the db with new queries every interval_secs. 107 | ''' 108 | d = self._get_existing_ts() 109 | print(f'got initial state: {d}') 110 | ts_ms = d['pending'] if d['pending'] else d['all'] 111 | while True: 112 | c = self._merge_db_queries(ts_ms) 113 | # self._next_update_ts_ms is kept updated inside self._merge_db_queries as an optimization 114 | print(f'self._next_update_ts_ms_d: {self._next_update_ts_ms_d}') 115 | ts_ms = self._next_update_ts_ms_d.get('pending') if self._next_update_ts_ms_d.get('pending') else self._next_update_ts_ms_d.get('all') 116 | self._next_update_ts_ms_d = {} 117 | print(f'merged {c} queries, updated ts_ms to {ts_ms}') 118 | print(f'sleeping {interval_secs}...') 119 | time.sleep(interval_secs) 120 | -------------------------------------------------------------------------------- /40-observability/dbsql-query-history-sync/src/dbsql_query_history_sync/queries_api.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import datetime, dateutil 3 | import sys, os 4 | import json 5 | import time 6 | 7 | def sync_query_history(dbx_token, workspace_url, warehouse_ids, start_ts_ms, 8 | query_sink_fn, sink_batch_size, 9 | end_ts_ms=None, user_ids=None, statuses=None, 10 | stop_fetch_limit=2147483647): 11 | '''Pull the query history from the API and call query_sink_fn. 12 | query_sink_fn: for every query_batch_size queries, push them into this sink and 13 | assume that the sink does the right thing with them. 14 | Note that the batch size will not be exactly respected. I.e. as soon as the accumulated 15 | queries go over the query_batch_size, query_sink_fn will be called. 16 | ''' 17 | print(f'sync_query_history({locals()})') 18 | #workspace_url = "e2-demo-field-eng.cloud.databricks.com" 19 | uri = f"https://{workspace_url}/api/2.0/sql/history/queries" 20 | print(uri) 21 | headers_auth = {"Authorization":f"Bearer {dbx_token}"} 22 | request_dict = {} 23 | request_dict.update({"filter_by":{"warehouse_ids": warehouse_ids}}) 24 | time_filter = {"start_time_ms": start_ts_ms} 25 | if end_ts_ms: 26 | time_filter.update({'end_time_ms': end_ts_ms}) 27 | request_dict['filter_by'].update({"query_start_time_range": time_filter}) 28 | if statuses: 29 | request_dict['filter_by'].update({"statuses": statuses}) 30 | if user_ids: 31 | request_dict['filter_by'].update({"user_ids": user_ids}) 32 | max_single_call_results = min(sink_batch_size, 1000, stop_fetch_limit) 33 | request_dict.update({'include_metrics': "true", "max_results": f"{max_single_call_results}"}) 34 | 35 | ## Convert dict to json 36 | print(f'REQUEST: {request_dict}') 37 | v = json.dumps(request_dict) 38 | 39 | uri = f"https://{workspace_url}/api/2.0/sql/history/queries" 40 | headers_auth = {"Authorization":f"Bearer {dbx_token}"} 41 | 42 | #### Get Query History Results from API 43 | endp_resp = requests.get(uri, data=v, headers=headers_auth).json() 44 | #print(endp_resp) 45 | resp = endp_resp.get("res") 46 | 47 | if resp is None: 48 | print('no results!') 49 | return [] 50 | 51 | next_page = endp_resp.get("next_page_token") 52 | has_next_page = endp_resp.get("has_next_page") 53 | 54 | total_fetch_count = len(resp) 55 | 56 | while has_next_page: 57 | #len_resp = len(resp) 58 | if len(resp) >= sink_batch_size: #or len(resp) + total_count >= stop_fetch_limit: 59 | query_sink_fn(resp) 60 | resp = [] 61 | 62 | if total_fetch_count >= stop_fetch_limit: 63 | break 64 | 65 | print(f"Getting results for next page... {next_page}") 66 | 67 | raw_page_request = { 68 | "include_metrics": "true", 69 | "max_results": max_single_call_results, 70 | "page_token": next_page 71 | } 72 | 73 | json_page_request = json.dumps(raw_page_request) 74 | 75 | current_page_resp = requests.get(uri,data=json_page_request, headers=headers_auth).json() 76 | current_page_queries = current_page_resp.get("res") 77 | 78 | resp.extend(current_page_queries) 79 | total_fetch_count += len(current_page_queries) 80 | 81 | ## Get next page 82 | next_page = current_page_resp.get("next_page_token") 83 | has_next_page = current_page_resp.get("has_next_page") 84 | 85 | if resp: 86 | query_sink_fn(resp) 87 | 88 | return total_fetch_count 89 | 90 | 91 | 92 | def get_query_history(dbx_token, workspace_url, warehouse_ids, start_ts_ms, 93 | end_ts_ms=None, user_ids=None, statuses=None, 94 | stop_fetch_limit=10000): 95 | query_sink = [] 96 | def _fn(qh): 97 | print(f"got {len(qh)} queries") 98 | query_sink.extend(qh) 99 | 100 | total_fetch_count = sync_query_history( 101 | dbx_token, workspace_url, warehouse_ids, start_ts_ms, 102 | _fn, 100, 103 | end_ts_ms=end_ts_ms, user_ids=user_ids, statuses=statuses, 104 | stop_fetch_limit=stop_fetch_limit) 105 | 106 | print(f"total_fetch_count: {total_fetch_count}") 107 | return query_sink 108 | 109 | 110 | 111 | 112 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | ## Welcome to lakehouse-tacklebox contributing guide 2 | 3 | Thank you for your interest in contributing to the tacklebox! 4 | 5 | In this guide you will get an overview of the contribution workflow from creating a PR, reviewing, and merging the PR. 6 | 7 | #### New contributor guide 8 | 9 | To get an overview of the project, read the [README](README.md). 10 | 11 | Here are the steps to follow to contribute: 12 | - [Fill out this request form](https://forms.gle/qsCTdtBLKj9KuyvY8) This will help the admins to know under which sub-folder your tool belongs. 13 | - Create a branch and add your code under the appropriate sub folder. 14 | - Create a PR. 15 | - Admins will review your code and provide any feedback for changes 16 | - Admin approves and merges the changes -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 Abe 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## LAKEHOUSE TACKLEBOX 2 | Don't go fishing in the lakehouse without the lakehouse-tacklebox 3 | 4 | 5 | ### Project Description 6 | This repo is a collection of tools which Databricks users can use to deploy,manage and operate a Databricks based Lakehouse. 7 | 8 | 9 | ### Using this Project 10 | The tools are organized into different sections based on [well-architected-framework](https://docs.databricks.com/lakehouse-architecture/index.html) pillars 11 | * [Quickstarts/Evaluation Tools](https://github.com/AbePabbathi/lakehouse-tacklebox/tree/master/00-quickstarts) 12 | * [Migrations](https://github.com/AbePabbathi/lakehouse-tacklebox/tree/master/10-migrations) 13 | * [Operational Excellence](https://github.com/AbePabbathi/lakehouse-tacklebox/tree/master/20-operational-excellence) 14 | * [Performance](https://github.com/AbePabbathi/lakehouse-tacklebox/tree/master/30-performance) 15 | * [Governance/Observability](https://github.com/AbePabbathi/lakehouse-tacklebox/tree/master/40-observability) 16 | * Reliability 17 | * Security 18 | 19 | Each tool has it's own README.md file with instructions on how to run the code. 20 | 21 | A new customer will generally start with the tools available in Quickstarts/Evaluation Tools section and move down the chain to more advanced tools to help them implement a robust data platform based on the Lakehouse 22 | 23 | 24 | 25 | ### Project Support 26 | Please note that all projects in this repo are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements (SLAs). They are provided AS-IS and we do not make any guarantees of any kind. Please do not submit a support ticket relating to any issues arising from the use of these projects. 27 | Any issues discovered through the use of this project should be filed as GitHub Issues on the Repo. They will be reviewed as time permits, but there are no formal SLAs for support. 28 | --------------------------------------------------------------------------------