├── .DS_Store
├── 00-quickstarts
    ├── .DS_Store
    ├── README.md
    ├── databricks-concurrency
    │   ├── 01-concurrency-testing-notebook.py
    │   └── concurrency_framework_1.0.py
    ├── design-patterns
    │   ├── .DS_Store
    │   ├── Advanced Notebooks
    │   │   ├── .DS_Store
    │   │   ├── DBT Incremental Model Example
    │   │   │   ├── .DS_Store
    │   │   │   └── optimized_dbt
    │   │   │   │   ├── .DS_Store
    │   │   │   │   ├── .gitignore
    │   │   │   │   ├── README.md
    │   │   │   │   ├── analyses
    │   │   │   │       └── .gitkeep
    │   │   │   │   ├── dbt_project.yml
    │   │   │   │   ├── macros
    │   │   │   │       ├── .gitkeep
    │   │   │   │       ├── create_bronze_sensors_identity_table.sql
    │   │   │   │       └── create_bronze_users_identity_table.sql
    │   │   │   │   ├── models
    │   │   │   │       ├── example
    │   │   │   │       │   ├── gold_hourly_summary_stats_7_day_rolling.sql
    │   │   │   │       │   ├── gold_smoothed_sensors_3_day_rolling.sql
    │   │   │   │       │   ├── silver_sensors_scd_1.sql
    │   │   │   │       │   └── silver_users_scd_1.sql
    │   │   │   │       └── sources.yml
    │   │   │   │   ├── seeds
    │   │   │   │       └── .gitkeep
    │   │   │   │   ├── snapshots
    │   │   │   │       ├── .gitkeep
    │   │   │   │       ├── silver_sensors_scd_2.sql
    │   │   │   │       └── silver_users_scd_2.sql
    │   │   │   │   └── tests
    │   │   │   │       └── .gitkeep
    │   │   ├── End to End Procedural Migration Pattern
    │   │   │   └── Procedural Migration Pattern with SCD2 Example.py
    │   │   ├── Multi-plexing with Autoloader
    │   │   │   └── Option 1: Actually Multi-plexing tables on write
    │   │   │   │   ├── Child Job Template.py
    │   │   │   │   └── Controller Job.py
    │   │   ├── Parallel Custom Named File Exports
    │   │   │   ├── Parallel File Exports - Python Version.py
    │   │   │   └── Parallel File Exports.py
    │   │   ├── SCD Design Patterns
    │   │   │   └── Advanced CDC With SCD in Databricks.py
    │   │   └── airflow_sql_files
    │   │   │   ├── 0_ddls.sql
    │   │   │   ├── 1_sensors_table_copy_into.sql
    │   │   │   ├── 2_sensors_table_merge.sql
    │   │   │   ├── 3_sensors_table_optimize.sql
    │   │   │   ├── 4_sensors_table_gold_aggregate.sql
    │   │   │   └── 5_clean_up_batch.sql
    │   ├── Step 1 - SQL EDW Pipeline.sql
    │   ├── Step 10 - Lakehouse Federation.py
    │   ├── Step 11 - SQL Orchestration in Production.py
    │   ├── Step 12 - SCD2 - SQL EDW Pipeline.sql
    │   ├── Step 13 - Migrating Identity Columns.sql
    │   ├── Step 14 - Using the Query Profile.sql
    │   ├── Step 15 - Dynamic & Parameterized SQL with Variables.py
    │   ├── Step 16 - Using System Tables.py
    │   ├── Step 2 - Optimize your Delta Tables.py
    │   ├── Step 3 - DLT Version Simple SQL EDW Pipeline.sql
    │   ├── Step 4 - Create Gold Layer Analytics Tables.sql
    │   ├── Step 5 - Unified Batch and Streaming.py
    │   ├── Step 6 - Streaming Table Design Patterns.sql
    │   ├── Step 7 - COPY INTO Loading Patterns.py
    │   ├── Step 8 - Liquid Clustering Delta Tables.py
    │   └── Step 9 - Using SQL Functions.py
    ├── dlt-cdc
    │   ├── 01-Retail_DLT_CDC_SQL.sql
    │   ├── 02-Retail_DLT_CDC_Python.py
    │   ├── 03-Retail_DLT_CDC_Monitoring.py
    │   ├── 04-Retail_DLT_CDC_Full.py
    │   └── _resources
    │   │   ├── 00-Data_CDC_Generator.py
    │   │   ├── 01-load-data-quality-dashboard.py
    │   │   ├── LICENSE.py
    │   │   ├── NOTICE.py
    │   │   └── README.py
    ├── dlt-loans
    │   ├── 01-DLT-Loan-pipeline-SQL.sql
    │   ├── 02-DLT-Loan-pipeline-PYTHON.py
    │   ├── 03-Log-Analysis.sql
    │   └── _resources
    │   │   ├── 00-Loan-Data-Generator.py
    │   │   ├── 01-load-data-quality-dashboard.py
    │   │   ├── LICENSE.py
    │   │   ├── NOTICE.py
    │   │   └── README.py
    ├── lakehouse-retail-c360
    │   ├── 00-churn-introduction-lakehouse.sql
    │   ├── 01-Data-ingestion
    │   │   ├── 01.1-DLT-churn-SQL.sql
    │   │   ├── 01.2-DLT-churn-Python-UDF.py
    │   │   ├── 01.3-DLT-churn-python.py
    │   │   └── plain-spark-delta-pipeline
    │   │   │   └── 01.5-Delta-pipeline-spark-churn.py
    │   ├── 02-Data-governance
    │   │   └── 02-UC-data-governance-security-churn.sql
    │   ├── 03-BI-data-warehousing
    │   │   └── 03-BI-Datawarehousing.sql
    │   ├── 04-Data-Science-ML
    │   │   ├── 04.1-automl-churn-prediction.py
    │   │   ├── 04.2-automl-generated-notebook.py
    │   │   └── 04.3-running-inference.py
    │   ├── 05-Workflow-orchestration
    │   │   └── 05-Workflow-orchestration-churn.py
    │   └── _resources
    │   │   ├── 00-global-setup.py
    │   │   ├── 00-prep-data-db-sql.py
    │   │   ├── 00-setup-uc.py
    │   │   ├── 00-setup.py
    │   │   ├── 01-load-data.py
    │   │   ├── 02-create-churn-tables.py
    │   │   ├── LICENSE.py
    │   │   ├── NOTICE.py
    │   │   └── README.py
    └── llm-dolly-chatbot
    │   ├── 01-Dolly-Introduction.py
    │   ├── 02-Data-preparation.py
    │   ├── 03-Q&A-prompt-engineering-for-dolly.py
    │   ├── 04-chat-bot-prompt-engineering-dolly.py
    │   └── _resources
    │       ├── 00-global-setup.py
    │       ├── 00-init.py
    │       ├── LICENSE.py
    │       ├── NOTICE.py
    │       └── README.py
├── 10-migrations
    ├── .DS_Store
    ├── 05-uc-upgrade
    │   ├── 00-Upgrade-database-to-UC.sql
    │   └── _resources
    │   │   ├── 00-setup.py
    │   │   ├── LICENSE.py
    │   │   ├── NOTICE.py
    │   │   └── README.py
    ├── 10-hms-uc-migration.py
    ├── README.md
    ├── Using DBSQL Serverless Client Example.py
    ├── Using DBSQL Serverless Transaction Manager Example.py
    ├── Using Delta Helpers Notebook Example.py
    ├── Using Delta Logger Example.py
    ├── Using Delta Logger.py
    ├── Using Delta Merge Helpers Example.py
    ├── Using Streaming Tables and MV Orchestrator.py
    ├── Using Transaction Manager Example.py
    └── helperfunctions
    │   ├── .DS_Store
    │   ├── __init__.py
    │   ├── build
    │       └── lib
    │       │   ├── datavalidator.py
    │       │   ├── dbsqlclient.py
    │       │   ├── dbsqltransactions.py
    │       │   ├── deltahelpers.py
    │       │   ├── deltalogger.py
    │       │   ├── redshiftchecker.py
    │       │   ├── stmvorchestrator.py
    │       │   └── transactions.py
    │   ├── datavalidator.py
    │   ├── dbsqlclient.py
    │   ├── dbsqltransactions.py
    │   ├── deltahelpers.py
    │   ├── deltalogger.py
    │   ├── dist
    │       └── helperfunctions-1.0.0-py3-none-any.whl
    │   ├── helperfunctions.egg-info
    │       ├── PKG-INFO
    │       ├── SOURCES.txt
    │       ├── dependency_links.txt
    │       ├── requires.txt
    │       └── top_level.txt
    │   ├── redshiftchecker.py
    │   ├── requirements.txt
    │   ├── setup.py
    │   ├── stmvorchestrator.py
    │   └── transactions.py
├── 20-operational-excellence
    └── README.md
├── 30-performance
    ├── README.md
    ├── TPC-DS Runner
    │   ├── CONTRIBUTING.md
    │   ├── README.md
    │   ├── assets
    │   │   └── images
    │   │   │   ├── cluster.png
    │   │   │   ├── filters.png
    │   │   │   ├── main_notebook.png
    │   │   │   ├── run_all.png
    │   │   │   └── workflow.png
    │   ├── constants.py
    │   ├── main.py
    │   ├── notebooks
    │   │   ├── create_data_and_queries.scala
    │   │   └── run_tpcds_benchmarking.py
    │   └── utils
    │   │   ├── databricks_client.py
    │   │   ├── general.py
    │   │   └── run.py
    ├── dbsql-query-replay-tool
    │   ├── 00-Functions.py
    │   ├── 01-Query_Replay_Tool.py
    │   └── README.md
    └── delta-optimizer
    │   ├── __init__.py
    │   ├── customer-facing-delta-optimizer
    │       ├── Query Profile Builder Only.py
    │       ├── Step 1_ Optimization Strategy Builder.py
    │       ├── Step 2_ Strategy Runner.py
    │       ├── Step 3_ Query History and Profile Analyzer.py
    │       └── deltaoptimizer-1.5.5-py3-none-any.whl
    │   └── deltaoptimizer
    │       ├── .DS_Store
    │       ├── .gitignore
    │       ├── .vscode
    │           └── settings.json
    │       ├── __init__.py
    │       ├── build
    │           └── lib
    │           │   └── deltaoptimizer.py
    │       ├── deltaoptimizer.egg-info
    │           ├── PKG-INFO
    │           ├── SOURCES.txt
    │           ├── dependency_links.txt
    │           ├── requires.txt
    │           └── top_level.txt
    │       ├── deltaoptimizer.py
    │       ├── dist
    │           └── deltaoptimizer-1.5.5-py3-none-any.whl
    │       └── setup.py
├── 40-observability
    ├── README.md
    ├── data-profiling
    │   ├── 01-create-data-profile.py
    │   ├── 02-create-data-profile-multi-schema.py
    │   └── 03-dbfs-profiler.py
    ├── dbsql-logging
    │   ├── 00-Config.py
    │   ├── 01-Functions.py
    │   ├── 02-Initialization.py
    │   ├── 03-APIs_to_Delta.py
    │   ├── 04-Metrics.sql
    │   ├── 05-Alert_Syntax.sql
    │   ├── 99-Maintenance.py
    │   └── README.md
    ├── dbsql-query-history-sync
    │   ├── README.md
    │   ├── __init__.py
    │   ├── dist
    │   │   ├── dbsql_query_history_sync-0.0.1-py3-none-any.whl
    │   │   └── dbsql_query_history_sync-0.0.1.tar.gz
    │   ├── examples
    │   │   ├── dbsql_query_sync_example.py
    │   │   └── standalone_dbsql_get_query_history_example.py
    │   ├── pyproject.toml
    │   └── src
    │   │   ├── __init__.py
    │   │   └── dbsql_query_history_sync
    │   │       ├── __init__.py
    │   │       ├── delta_sync.py
    │   │       └── queries_api.py
    └── stream-monitoring
    │   └── 01-stream-monitoring.py
├── CONTRIBUTING.md
├── LICENSE
├── README.md
└── concurrency_framework_1.0.py


/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/.DS_Store


--------------------------------------------------------------------------------
/00-quickstarts/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/00-quickstarts/.DS_Store


--------------------------------------------------------------------------------
/00-quickstarts/README.md:
--------------------------------------------------------------------------------
1 | #### Quickstarts
2 | 
3 | This section consists of tools that will help new Customers quickly setup a Lakehouse and get up and running with Databricks. This is not production grade code. This is purely for evaluation and trying out a Lakehouse quickly
4 | 
5 | #
6 | 1. [dbdemos.ai](https://www.dbdemos.ai/). 
7 | 2. [DBSQL Concurrency Test](https://github.com/AbePabbathi/lakehouse-tacklebox/tree/master/00-quickstarts/databricks-concurrency)
8 | 3. [EDW ETL Demo](https://github.com/databricks/edw-etl-demo)
9 | 4. [TPC-DI ETL Demo](https://github.com/shannon-barrow/databricks-tpc-di). Please read the [blog post](https://www.databricks.com/blog/2023/04/14/how-we-performed-etl-one-billion-records-under-1-delta-live-tables.html) for more info 


--------------------------------------------------------------------------------
/00-quickstarts/design-patterns/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/00-quickstarts/design-patterns/.DS_Store


--------------------------------------------------------------------------------
/00-quickstarts/design-patterns/Advanced Notebooks/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/00-quickstarts/design-patterns/Advanced Notebooks/.DS_Store


--------------------------------------------------------------------------------
/00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/.DS_Store


--------------------------------------------------------------------------------
/00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/.DS_Store


--------------------------------------------------------------------------------
/00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/.gitignore:
--------------------------------------------------------------------------------
1 | 
2 | target/
3 | dbt_packages/
4 | logs/
5 | 


--------------------------------------------------------------------------------
/00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/README.md:
--------------------------------------------------------------------------------
 1 | Welcome to your new dbt project!
 2 | 
 3 | ### Using the starter project
 4 | 
 5 | Try running the following commands:
 6 | - dbt run
 7 | - dbt test
 8 | 
 9 | 
10 | ### Resources:
11 | - Learn more about dbt [in the docs](https://docs.getdbt.com/docs/introduction)
12 | - Check out [Discourse](https://discourse.getdbt.com/) for commonly asked questions and answers
13 | - Join the [chat](https://community.getdbt.com/) on Slack for live discussions and support
14 | - Find [dbt events](https://events.getdbt.com) near you
15 | - Check out [the blog](https://blog.getdbt.com/) for the latest news on dbt's development and best practices
16 | 


--------------------------------------------------------------------------------
/00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/analyses/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/analyses/.gitkeep


--------------------------------------------------------------------------------
/00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/dbt_project.yml:
--------------------------------------------------------------------------------
 1 | # Name your project! Project names should contain only lowercase characters
 2 | # and underscores. A good package name should reflect your organization's
 3 | # name or the intended use of these models
 4 | name: 'optimized_dbt'
 5 | version: '1.0.0'
 6 | config-version: 2
 7 | 
 8 | # This setting configures which "profile" dbt uses for this project.
 9 | profile: 'optimized_dbt'
10 | 
11 | model-paths: ["models"]
12 | analysis-paths: ["analyses"]
13 | test-paths: ["tests"]
14 | seed-paths: ["seeds"]
15 | macro-paths: ["macros"]
16 | snapshot-paths: ["snapshots"]
17 | 
18 | clean-targets:         # directories to be removed by `dbt clean`
19 |   - "target"
20 |   - "dbt_packages"
21 | 
22 | models:
23 |   optimized_dbt:
24 |     +materialized: table
25 |     +tblproperties: {'delta.feature.allowColumnDefaults': 'supported', 'delta.columnMapping.mode' : 'name', 'delta.enableDeletionVectors': 'true'}
26 | 
27 | # Optional for logging dbt run info to Delta tables
28 | # on-run-end: "{{ dbt_artifacts.upload_results(results) }}"
29 |       
30 |     
31 | 


--------------------------------------------------------------------------------
/00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/macros/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/macros/.gitkeep


--------------------------------------------------------------------------------
/00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/macros/create_bronze_sensors_identity_table.sql:
--------------------------------------------------------------------------------
 1 | {% macro create_bronze_sensors_identity_table() %}
 2 | -- Seprately DDL creation for doing things like custom / rigid schema DDL or Identity columns
 3 | -- Use Sparingly, as DBT approaches DDL as being done inconjunction with the data
 4 | 
 5 | CREATE TABLE IF NOT EXISTS {{target.catalog}}.{{target.schema}}.bronze_sensors
 6 | (
 7 | Id BIGINT GENERATED BY DEFAULT AS IDENTITY,
 8 | device_id INT,
 9 | user_id INT,
10 | calories_burnt DECIMAL(10,2), 
11 | miles_walked DECIMAL(10,2), 
12 | num_steps DECIMAL(10,2), 
13 | timestamp TIMESTAMP,
14 | value STRING,
15 | ingest_timestamp TIMESTAMP
16 | )
17 | CLUSTER BY (ingest_timestamp)
18 | 
19 | {% endmacro %}


--------------------------------------------------------------------------------
/00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/macros/create_bronze_users_identity_table.sql:
--------------------------------------------------------------------------------
 1 | {% macro create_bronze_users_identity_table() %}
 2 | -- Seprately DDL creation for doing things like custom / rigid schema DDL or Identity columns
 3 | -- Use Sparingly, as DBT approaches DDL as being done inconjunction with the data
 4 | 
 5 | CREATE TABLE IF NOT EXISTS {{target.catalog}}.{{target.schema}}.bronze_users
 6 | (
 7 | userid BIGINT GENERATED BY DEFAULT AS IDENTITY (START WITH 1 INCREMENT BY 1),
 8 | gender STRING,
 9 | age INT,
10 | height DECIMAL(10,2), 
11 | weight DECIMAL(10,2),
12 | smoker STRING,
13 | familyhistory STRING,
14 | cholestlevs STRING,
15 | bp STRING,
16 | risk DECIMAL(10,2),
17 | ingest_timestamp TIMESTAMP
18 | )
19 | CLUSTER BY (ingest_timestamp)
20 | 
21 | {% endmacro %}


--------------------------------------------------------------------------------
/00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/models/example/gold_hourly_summary_stats_7_day_rolling.sql:
--------------------------------------------------------------------------------
 1 | {{ 
 2 |   config(
 3 |     materialized='table',
 4 |     liquid_clustered_by='device_id, HourBucket'
 5 |   )
 6 | }}
 7 | 
 8 | -- Get hourly aggregates for last 7 days
 9 | SELECT device_id,
10 | date_trunc('hour', timestamp) AS HourBucket,
11 | AVG(num_steps)::float AS AvgNumStepsAcrossDevices,
12 | AVG(calories_burnt)::float AS AvgCaloriesBurnedAcrossDevices,
13 | AVG(miles_walked)::float AS AvgMilesWalkedAcrossDevices
14 | FROM {{ ref('silver_sensors_scd_1') }}
15 | WHERE timestamp >= ((SELECT MAX(timestamp) FROM {{ ref('silver_sensors_scd_1') }}) - INTERVAL '7 DAYS')
16 | GROUP BY device_id, date_trunc('hour', timestamp)
17 | ORDER BY HourBucket
18 | 


--------------------------------------------------------------------------------
/00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/models/example/gold_smoothed_sensors_3_day_rolling.sql:
--------------------------------------------------------------------------------
 1 | {{ 
 2 |   config(
 3 |     materialized='table',
 4 |     liquid_clustered_by='device_id, HourBucket'
 5 |   )
 6 | }}
 7 | 
 8 | SELECT 
 9 | device_id, HourBucket,
10 | -- Number of Steps
11 | (avg(`AvgNumStepsAcrossDevices`) OVER (
12 |         ORDER BY `HourBucket`
13 |         ROWS BETWEEN
14 |           4 PRECEDING AND
15 |           CURRENT ROW
16 |       )) ::float AS SmoothedNumSteps4HourMA, -- 4 hour moving average
17 |       
18 | (avg(`AvgNumStepsAcrossDevices`) OVER (
19 |         ORDER BY `HourBucket`
20 |         ROWS BETWEEN
21 |           24 PRECEDING AND
22 |           CURRENT ROW
23 |       ))::float AS SmoothedNumSteps24HourMA --24 hour moving average
24 | ,
25 | -- Calories Burned
26 | (avg(`AvgCaloriesBurnedAcrossDevices`) OVER (
27 |         ORDER BY `HourBucket`
28 |         ROWS BETWEEN
29 |           4 PRECEDING AND
30 |           CURRENT ROW
31 |       ))::float AS SmoothedCalsBurned4HourMA, -- 4 hour moving average
32 |       
33 | (avg(`AvgCaloriesBurnedAcrossDevices`) OVER (
34 |         ORDER BY `HourBucket`
35 |         ROWS BETWEEN
36 |           24 PRECEDING AND
37 |           CURRENT ROW
38 |       ))::float AS SmoothedCalsBurned24HourMA --24 hour moving average,
39 | ,
40 | -- Miles Walked
41 | (avg(`AvgMilesWalkedAcrossDevices`) OVER (
42 |         ORDER BY `HourBucket`
43 |         ROWS BETWEEN
44 |           4 PRECEDING AND
45 |           CURRENT ROW
46 |       ))::float AS SmoothedMilesWalked4HourMA, -- 4 hour moving average
47 |       
48 | (avg(`AvgMilesWalkedAcrossDevices`) OVER (
49 |         ORDER BY `HourBucket`
50 |         ROWS BETWEEN
51 |           24 PRECEDING AND
52 |           CURRENT ROW
53 |       ))::float AS SmoothedMilesWalked24HourMA --24 hour moving average
54 | FROM {{ ref('gold_hourly_summary_stats_7_day_rolling') }}
55 | WHERE HourBucket >= ((SELECT MAX(HourBucket) FROM {{ ref('gold_hourly_summary_stats_7_day_rolling') }}) - INTERVAL '3 DAYS')


--------------------------------------------------------------------------------
/00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/models/example/silver_sensors_scd_1.sql:
--------------------------------------------------------------------------------
 1 | {{ 
 2 |   config(
 3 |     materialized='incremental',
 4 |     unique_key='Id',
 5 |     incremental_strategy='merge',
 6 |     tblproperties={'delta.tuneFileSizesForRewrites': 'true', 'delta.feature.allowColumnDefaults': 'supported', 'delta.columnMapping.mode' : 'name'},
 7 |     liquid_clustered_by = 'timestamp, id, device_id',
 8 |     incremental_predicates= ["DBT_INTERNAL_DEST.timestamp > dateadd(day, -7, now())"],
 9 |     pre_hook=["{{ create_bronze_sensors_identity_table() }}",
10 | 
11 |               "{{ databricks_copy_into(target_table='bronze_sensors',
12 |                     source='/databricks-datasets/iot-stream/data-device/',
13 |                     file_format='json',
14 |                     expression_list = 'id::bigint AS Id, device_id::integer AS device_id, user_id::integer AS user_id, calories_burnt::decimal(10,2) AS calories_burnt, miles_walked::decimal(10,2) AS miles_walked, num_steps::decimal(10,2) AS num_steps, timestamp::timestamp AS timestamp, value  AS value, now() AS ingest_timestamp',
15 |                     copy_options={'force': 'true'}
16 |                     ) }}",
17 | 
18 |               "OPTIMIZE {{target.catalog}}.{{target.schema}}.bronze_sensors",
19 | 
20 |               "ANALYZE TABLE {{target.catalog}}.{{target.schema}}.bronze_sensors COMPUTE STATISTICS FOR ALL COLUMNS"
21 |     ],
22 |     post_hook=[
23 |         "OPTIMIZE {{ this }}",
24 |         "ANALYZE TABLE {{ this }} COMPUTE STATISTICS FOR ALL COLUMNS;"
25 |         ]
26 |   ) 
27 | }}
28 | 
29 | 
30 | WITH de_dup (
31 | SELECT Id::integer,
32 |               device_id::integer,
33 |               user_id::integer,
34 |               calories_burnt::decimal,
35 |               miles_walked::decimal,
36 |               num_steps::decimal,
37 |               timestamp::timestamp,
38 |               value::string,
39 |               ingest_timestamp,
40 |               ROW_NUMBER() OVER(PARTITION BY device_id, user_id, timestamp ORDER BY ingest_timestamp DESC, timestamp DESC) AS DupRank
41 |               FROM {{target.catalog}}.{{target.schema}}.bronze_sensors
42 |               -- Add Incremental Processing Macro here
43 |               {% if is_incremental() %}
44 | 
45 |                 WHERE ingest_timestamp > (SELECT MAX(ingest_timestamp) FROM {{ this }})
46 | 
47 |               {% endif %}
48 |               )
49 |               
50 | SELECT Id, device_id, user_id, calories_burnt, miles_walked, num_steps, timestamp, value, ingest_timestamp
51 | -- optional
52 | /*
53 | sha2(CONCAT(COALESCE(Id, ''), COALESCE(device_id, ''))) AS composite_key -- use this as the key if you have composite key
54 | */
55 | FROM de_dup
56 | WHERE DupRank = 1


--------------------------------------------------------------------------------
/00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/models/example/silver_users_scd_1.sql:
--------------------------------------------------------------------------------
 1 | {{ 
 2 |   config(
 3 |     materialized='incremental',
 4 |     unique_key='userid',
 5 |     incremental_strategy='merge',
 6 |     liquid_clustered_by = 'userid',
 7 |     pre_hook=["{{ create_bronze_users_identity_table() }}",
 8 | 
 9 |               "{{ databricks_copy_into(target_table='bronze_users',
10 |                     source='/databricks-datasets/iot-stream/data-user/',
11 |                     file_format='csv',
12 |                     expression_list = 'userid::bigint AS userid, gender AS gender, age::integer AS age, height::decimal(10,2) AS height, weight::decimal(10,2) AS weight, smoker AS smoker, familyhistory AS familyhistory, cholestlevs AS cholestlevs, bp AS bp, risk::decimal(10,2) AS risk, now() AS ingest_timestamp',
13 |                     copy_options={'force': 'true'},
14 |                     format_options={'header': 'true'}
15 |                     ) }}",
16 | 
17 |               "OPTIMIZE {{target.catalog}}.{{target.schema}}.bronze_users"
18 |     ],
19 |     post_hook=[
20 |         "OPTIMIZE {{ this }}",
21 |         "ANALYZE TABLE {{ this }} COMPUTE STATISTICS FOR ALL COLUMNS;"
22 |         ]
23 |   ) 
24 | }}
25 | 
26 | 
27 | WITH de_dup (
28 | SELECT 
29 |       userid::bigint,
30 |       gender::string,
31 |       age::int,
32 |       height::decimal, 
33 |       weight::decimal,
34 |       smoker,
35 |       familyhistory,
36 |       cholestlevs,
37 |       bp,
38 |       risk,
39 |       ingest_timestamp,
40 |        ROW_NUMBER() OVER(PARTITION BY userid ORDER BY ingest_timestamp DESC) AS DupRank
41 |       FROM {{target.catalog}}.{{target.schema}}.bronze_users
42 |       -- Add Incremental Processing Macro here
43 |       {% if is_incremental() %}
44 | 
45 |         WHERE ingest_timestamp > (SELECT MAX(ingest_timestamp) FROM {{ this }})
46 | 
47 |       {% endif %}
48 | )              
49 | SELECT *
50 | FROM de_dup
51 | WHERE DupRank = 1


--------------------------------------------------------------------------------
/00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/models/sources.yml:
--------------------------------------------------------------------------------
1 | version: 1
2 | 
3 | sources:
4 |   - name: dbt_optimized
5 |     catalog: main  
6 |     schema: dbt_optimized  
7 |     tables:
8 |       - name: silver_sensors_scd_1
9 |       - name: silver_users_scd_1


--------------------------------------------------------------------------------
/00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/seeds/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/seeds/.gitkeep


--------------------------------------------------------------------------------
/00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/snapshots/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/snapshots/.gitkeep


--------------------------------------------------------------------------------
/00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/snapshots/silver_sensors_scd_2.sql:
--------------------------------------------------------------------------------
 1 | {% snapshot sensors_snapshot %}
 2 | 
 3 | {{
 4 |     config(
 5 |       target_schema= target.schema + '_snapshots',
 6 |       unique_key='Id',
 7 | 
 8 |       strategy='timestamp',
 9 |       updated_at='ingest_timestamp',
10 |     )
11 | }}
12 | 
13 | select * from {{ source('dbt_optimized', 'silver_sensors_scd_1') }}
14 | 
15 | {% endsnapshot %}


--------------------------------------------------------------------------------
/00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/snapshots/silver_users_scd_2.sql:
--------------------------------------------------------------------------------
 1 | {% snapshot users_snapshot %}
 2 | 
 3 | {{
 4 |     config(
 5 |       target_schema= target.schema + '_snapshots',
 6 |       unique_key='userid',
 7 | 
 8 |       strategy='check',
 9 |       check_cols=['age', 'height', 'weight', 'smoker', 'familyhistory', 'cholestlevs', 'bp', 'risk'],
10 |       updated_at='ingest_timestamp',
11 |     )
12 | }}
13 | 
14 | select * from {{ source('dbt_optimized', 'silver_users_scd_1') }}
15 | 
16 | {% endsnapshot %}


--------------------------------------------------------------------------------
/00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/tests/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/00-quickstarts/design-patterns/Advanced Notebooks/DBT Incremental Model Example/optimized_dbt/tests/.gitkeep


--------------------------------------------------------------------------------
/00-quickstarts/design-patterns/Advanced Notebooks/Multi-plexing with Autoloader/Option 1: Actually Multi-plexing tables on write/Child Job Template.py:
--------------------------------------------------------------------------------
  1 | # Databricks notebook source
  2 | # MAGIC %md
  3 | # MAGIC
  4 | # MAGIC ## Controller notebook
  5 | # MAGIC
  6 | # MAGIC Identifies and Orcestrates the sub jobs
  7 | 
  8 | # COMMAND ----------
  9 | 
 10 | from pyspark.sql.functions import *
 11 | from pyspark.sql.types import *
 12 | 
 13 | # COMMAND ----------
 14 | 
 15 | # DBTITLE 1,Step 1: Logic to get unique list of events/sub directories that separate the different streams
 16 | # Design considerations
 17 | # Ideally the writer of the raw data will separate out event types by folder so you can use globPathFilters to create separate streams
 18 | # If ALL events are in one data source, all streams will stream from 1 table and then will be filtered for that event in the stream. To avoid many file listings of the same file, enable useNotifications = true in autoloader
 19 | 
 20 | # COMMAND ----------
 21 | 
 22 | # DBTITLE 1,Define Params
 23 | dbutils.widgets.text("Input Root Path", "")
 24 | dbutils.widgets.text("Parent Job Name", "")
 25 | dbutils.widgets.text("Child Task Name", "")
 26 | 
 27 | # COMMAND ----------
 28 | 
 29 | # DBTITLE 1,Get Params
 30 | root_input_path = dbutils.widgets.get("Input Root Path")
 31 | parent_job_name = dbutils.widgets.get("Parent Job Name")
 32 | child_task_name = dbutils.widgets.get("Child Task Name")
 33 | 
 34 | print(f"Root input path: {root_input_path}")
 35 | print(f"Parent Job Name: {parent_job_name}")
 36 | print(f"Event Task Name: {child_task_name}")
 37 | 
 38 | # COMMAND ----------
 39 | 
 40 | # DBTITLE 1,Define Dynamic Checkpoint Path
 41 | ## Eeach stream needs its own checkpoint, we can dynamically define that for each event/table we want to create / teast out
 42 | 
 43 | checkpoint_path = f"dbfs:/checkpoints/<your_user_id_here>/{parent_job_name}/{child_task_name}/"
 44 | 
 45 | # COMMAND ----------
 46 | 
 47 | # DBTITLE 1,Target Location Definitions
 48 | spark.sql("""CREATE DATABASE IF NOT EXISTS iot_multiplexing_demo""")
 49 | 
 50 | # COMMAND ----------
 51 | 
 52 | # DBTITLE 1,Use Whatever custom event filtering logic is needed
 53 | filter_regex_string = "part-" + child_task_name + "*.json*"
 54 | 
 55 | print(filter_regex_string)
 56 | 
 57 | # COMMAND ----------
 58 | 
 59 | # DBTITLE 1,Read Stream
 60 | input_df = (spark
 61 |   .readStream
 62 |   .format("text")
 63 |   .option("multiLine", "true")
 64 |   .option("pathGlobFilter", filter_regex_string)
 65 |   .load(root_input_path)
 66 |   .withColumn("inputFileName", input_file_name()) ## you can filter using .option("globPathFilter") as well here
 67 | )
 68 | 
 69 | # COMMAND ----------
 70 | 
 71 | # DBTITLE 1,Transformation Logic on any events (can be conditional on event)
 72 | transformed_df = (input_df
 73 |   .withColumn("EventName", lit(child_task_name))
 74 |   .selectExpr("value:id::integer AS Id", 
 75 |               "EventName",
 76 |               "value:user_id::integer AS UserId",
 77 |               "value:device_id::integer AS DeviceId",
 78 |               "value:num_steps::decimal AS NumberOfSteps",
 79 |               "value:miles_walked::decimal AS MilesWalked",
 80 |               "value:calories_burnt::decimal AS Calories",
 81 |               "value:timestamp::timestamp AS EventTimestamp",
 82 |               "current_timestamp() AS IngestionTimestamp",
 83 |               "inputFileName")
 84 | 
 85 | )
 86 | 
 87 | # COMMAND ----------
 88 | 
 89 | # DBTITLE 1,Truncate this child stream and reload from all data
 90 | 
 91 | dbutils.fs.rm(checkpoint_path, recurse=True)
 92 | 
 93 | # COMMAND ----------
 94 | 
 95 | # DBTITLE 1,Dynamic Write Stream
 96 | (transformed_df
 97 |   .writeStream
 98 |   .trigger(once=True)
 99 |   .option("checkpointLocation", checkpoint_path)
100 |   .toTable(f"iot_multiplexing_demo.iot_stream_event_{child_task_name}")
101 | )
102 | 


--------------------------------------------------------------------------------
/00-quickstarts/design-patterns/Advanced Notebooks/Parallel Custom Named File Exports/Parallel File Exports.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # DBTITLE 1,helper function to dynamically build target path for each file
 3 | # MAGIC %scala 
 4 | # MAGIC
 5 | # MAGIC
 6 | # MAGIC def getNewFilePath(sourcePath: String): String = {
 7 | # MAGIC   val source_path = sourcePath;
 8 | # MAGIC
 9 | # MAGIC   val slice_len = source_path.split("/").length - 1;
10 | # MAGIC   val source_path_root = source_path.split("/").slice(0, slice_len);
11 | # MAGIC   val source_path_file_name = source_path.split("/").last;
12 | # MAGIC
13 | # MAGIC   // any arbitrary file rename logic
14 | # MAGIC   val new_path_file_name = "renamed/"+source_path_file_name;
15 | # MAGIC   val new_path = source_path_root.mkString("/") + "/" + new_path_file_name;
16 | # MAGIC
17 | # MAGIC   return new_path
18 | # MAGIC }
19 | 
20 | # COMMAND ----------
21 | 
22 | # DBTITLE 1,Test New Function to dynamically build target path for each row (file)
23 | # MAGIC %scala 
24 | # MAGIC
25 | # MAGIC val test_new_path = getNewFilePath("dbfs:/databricks-datasets/iot-stream/data-device/part-00003.json.gz")
26 | # MAGIC
27 | # MAGIC println(test_new_path)
28 | 
29 | # COMMAND ----------
30 | 
31 | # MAGIC %scala 
32 | # MAGIC import org.apache.hadoop.fs
33 | # MAGIC
34 | # MAGIC // maybe we need to register access keys here? not sure yet. Still dealing with Auth issues
35 | # MAGIC val conf = new org.apache.spark.util.SerializableConfiguration(sc.hadoopConfiguration)
36 | # MAGIC
37 | # MAGIC val broadcastConf = sc.broadcast(conf)
38 | # MAGIC
39 | # MAGIC print(conf.value)
40 | 
41 | # COMMAND ----------
42 | 
43 | # MAGIC %scala 
44 | # MAGIC
45 | # MAGIC import org.apache.hadoop.fs._
46 | # MAGIC
47 | # MAGIC // root bucket of where original files were dropped
48 | # MAGIC val filesToCopy = dbutils.fs.ls("dbfs:/databricks-datasets/iot-stream/data-device/").map(_.path)
49 | # MAGIC
50 | # MAGIC spark.sparkContext.parallelize(filesToCopy).foreachPartition(rows => rows.foreach {
51 | # MAGIC   
52 | # MAGIC   file => 
53 | # MAGIC   
54 | # MAGIC   println(file)
55 | # MAGIC   val fromPath = new Path(file)
56 | # MAGIC   
57 | # MAGIC   val tempNewPath = getNewFilePath(file)
58 | # MAGIC   
59 | # MAGIC   val toPath = new Path(tempNewPath)
60 | # MAGIC   
61 | # MAGIC   val fromFs = toPath.getFileSystem(conf.value)
62 | # MAGIC   
63 | # MAGIC   val toFs = toPath.getFileSystem(conf.value)
64 | # MAGIC   
65 | # MAGIC   FileUtil.copy(fromFs, fromPath, toFs, toPath, false, conf.value)
66 | # MAGIC   
67 | # MAGIC })
68 | 
69 | # COMMAND ----------
70 | 
71 | # MAGIC %scala
72 | # MAGIC
73 | # MAGIC val filesToCopy = dbutils.fs.ls("dbfs:/databricks-datasets/iot-stream/data-device/").map(_.path)
74 | # MAGIC
75 | # MAGIC
76 | # MAGIC val filesDf = spark.sparkContext.parallelize(filesToCopy).toDF()
77 | # MAGIC
78 | # MAGIC display(filesDf)
79 | 


--------------------------------------------------------------------------------
/00-quickstarts/design-patterns/Advanced Notebooks/airflow_sql_files/0_ddls.sql:
--------------------------------------------------------------------------------
 1 | CREATE DATABASE IF NOT EXISTS main.iot_dashboard_airflow;
 2 | 
 3 | CREATE TABLE IF NOT EXISTS main.iot_dashboard_airflow.bronze_sensors
 4 | (
 5 | Id BIGINT GENERATED BY DEFAULT AS IDENTITY,
 6 | device_id INT,
 7 | user_id INT,
 8 | calories_burnt DECIMAL(10,2), 
 9 | miles_walked DECIMAL(10,2), 
10 | num_steps DECIMAL(10,2), 
11 | timestamp TIMESTAMP,
12 | value STRING
13 | )
14 | USING DELTA
15 | TBLPROPERTIES("delta.targetFileSize"="128mb")
16 | ;
17 | 
18 | CREATE TABLE IF NOT EXISTS main.iot_dashboard_airflow.bronze_users
19 | (
20 | userid BIGINT GENERATED BY DEFAULT AS IDENTITY (START WITH 1 INCREMENT BY 1),
21 | gender STRING,
22 | age INT,
23 | height DECIMAL(10,2), 
24 | weight DECIMAL(10,2),
25 | smoker STRING,
26 | familyhistory STRING,
27 | cholestlevs STRING,
28 | bp STRING,
29 | risk DECIMAL(10,2),
30 | update_timestamp TIMESTAMP
31 | )
32 | USING DELTA 
33 | TBLPROPERTIES("delta.targetFileSize"="128mb")
34 | ;
35 | 
36 | CREATE TABLE IF NOT EXISTS main.iot_dashboard_airflow.silver_sensors
37 | (
38 | Id BIGINT GENERATED BY DEFAULT AS IDENTITY,
39 | device_id INT,
40 | user_id INT,
41 | calories_burnt DECIMAL(10,2), 
42 | miles_walked DECIMAL(10,2), 
43 | num_steps DECIMAL(10,2), 
44 | timestamp TIMESTAMP,
45 | value STRING
46 | )
47 | USING DELTA 
48 | PARTITIONED BY (user_id)
49 | TBLPROPERTIES("delta.targetFileSize"="128mb")
50 | ;
51 | 
52 | CREATE TABLE IF NOT EXISTS main.iot_dashboard_airflow.silver_users
53 | (
54 | userid BIGINT GENERATED BY DEFAULT AS IDENTITY,
55 | gender STRING,
56 | age INT,
57 | height DECIMAL(10,2), 
58 | weight DECIMAL(10,2),
59 | smoker STRING,
60 | familyhistory STRING,
61 | cholestlevs STRING,
62 | bp STRING,
63 | risk DECIMAL(10,2),
64 | update_timestamp TIMESTAMP
65 | )
66 | USING DELTA 
67 | TBLPROPERTIES("delta.targetFileSize"="128mb")
68 | ;


--------------------------------------------------------------------------------
/00-quickstarts/design-patterns/Advanced Notebooks/airflow_sql_files/1_sensors_table_copy_into.sql:
--------------------------------------------------------------------------------
1 | -- This is ONLY the SELECT expression in COPY INTO without the word "SELECT"
2 | id::bigint AS Id,
3 | device_id::integer AS device_id,
4 | user_id::integer AS user_id,
5 | calories_burnt::decimal(10,2) AS calories_burnt, 
6 | miles_walked::decimal(10,2) AS miles_walked, 
7 | num_steps::decimal(10,2) AS num_steps, 
8 | timestamp::timestamp AS timestamp,
9 | value  AS value -- This is a JSON object


--------------------------------------------------------------------------------
/00-quickstarts/design-patterns/Advanced Notebooks/airflow_sql_files/2_sensors_table_merge.sql:
--------------------------------------------------------------------------------
 1 | MERGE INTO main.iot_dashboard_airflow.silver_sensors AS target
 2 | USING (
 3 | WITH de_dup (
 4 | SELECT Id::integer,
 5 |               device_id::integer,
 6 |               user_id::integer,
 7 |               calories_burnt::decimal,
 8 |               miles_walked::decimal,
 9 |               num_steps::decimal,
10 |               timestamp::timestamp,
11 |               value::string,
12 |               ROW_NUMBER() OVER(PARTITION BY device_id, user_id, timestamp ORDER BY timestamp DESC) AS DupRank
13 |               FROM main.iot_dashboard_airflow.bronze_sensors
14 |               )
15 |               
16 | SELECT Id, device_id, user_id, calories_burnt, miles_walked, num_steps, timestamp, value
17 | FROM de_dup
18 | WHERE DupRank = 1
19 | ) AS source
20 | ON source.Id = target.Id
21 | AND source.user_id = target.user_id
22 | AND source.device_id = target.device_id
23 | WHEN MATCHED THEN UPDATE SET 
24 |   target.calories_burnt = source.calories_burnt,
25 |   target.miles_walked = source.miles_walked,
26 |   target.num_steps = source.num_steps,
27 |   target.timestamp = source.timestamp
28 | WHEN NOT MATCHED THEN INSERT *;


--------------------------------------------------------------------------------
/00-quickstarts/design-patterns/Advanced Notebooks/airflow_sql_files/3_sensors_table_optimize.sql:
--------------------------------------------------------------------------------
1 | OPTIMIZE main.iot_dashboard_airflow.silver_sensors ZORDER BY (timestamp);
2 | 
3 | ANALYZE TABLE main.iot_dashboard_airflow.silver_sensors COMPUTE STATISTICS FOR ALL COLUMNS;
4 | 


--------------------------------------------------------------------------------
/00-quickstarts/design-patterns/Advanced Notebooks/airflow_sql_files/4_sensors_table_gold_aggregate.sql:
--------------------------------------------------------------------------------
 1 | CREATE OR REPLACE TABLE main.iot_dashboard_airflow.hourly_summary_statistics
 2 | AS
 3 | SELECT user_id,
 4 | date_trunc('hour', timestamp) AS HourBucket,
 5 | AVG(num_steps)::float AS AvgNumStepsAcrossDevices,
 6 | AVG(calories_burnt)::float AS AvgCaloriesBurnedAcrossDevices,
 7 | AVG(miles_walked)::float AS AvgMilesWalkedAcrossDevices
 8 | FROM iot_dashboard.silver_sensors
 9 | GROUP BY user_id,date_trunc('hour', timestamp)
10 | ORDER BY HourBucket;
11 | 
12 | CREATE OR REPLACE TABLE main.iot_dashboard_airflow.smoothed_hourly_statistics
13 | AS 
14 | SELECT *,
15 | -- Number of Steps
16 | (avg(`AvgNumStepsAcrossDevices`) OVER (
17 |         ORDER BY `HourBucket`
18 |         ROWS BETWEEN
19 |           4 PRECEDING AND
20 |           CURRENT ROW
21 |       )) ::float AS SmoothedNumSteps4HourMA, -- 4 hour moving average
22 |       
23 | (avg(`AvgNumStepsAcrossDevices`) OVER (
24 |         ORDER BY `HourBucket`
25 |         ROWS BETWEEN
26 |           24 PRECEDING AND
27 |           CURRENT ROW
28 |       ))::float AS SmoothedNumSteps12HourMA --24 hour moving average
29 | ,
30 | -- Calories Burned
31 | (avg(`AvgCaloriesBurnedAcrossDevices`) OVER (
32 |         ORDER BY `HourBucket`
33 |         ROWS BETWEEN
34 |           4 PRECEDING AND
35 |           CURRENT ROW
36 |       ))::float AS SmoothedCalsBurned4HourMA, -- 4 hour moving average
37 |       
38 | (avg(`AvgCaloriesBurnedAcrossDevices`) OVER (
39 |         ORDER BY `HourBucket`
40 |         ROWS BETWEEN
41 |           24 PRECEDING AND
42 |           CURRENT ROW
43 |       ))::float AS SmoothedCalsBurned12HourMA --24 hour moving average,
44 | ,
45 | -- Miles Walked
46 | (avg(`AvgMilesWalkedAcrossDevices`) OVER (
47 |         ORDER BY `HourBucket`
48 |         ROWS BETWEEN
49 |           4 PRECEDING AND
50 |           CURRENT ROW
51 |       ))::float AS SmoothedMilesWalked4HourMA, -- 4 hour moving average
52 |       
53 | (avg(`AvgMilesWalkedAcrossDevices`) OVER (
54 |         ORDER BY `HourBucket`
55 |         ROWS BETWEEN
56 |           24 PRECEDING AND
57 |           CURRENT ROW
58 |       ))::float AS SmoothedMilesWalked12HourMA --24 hour moving average
59 | FROM main.iot_dashboard_airflow.hourly_summary_statistics;


--------------------------------------------------------------------------------
/00-quickstarts/design-patterns/Advanced Notebooks/airflow_sql_files/5_clean_up_batch.sql:
--------------------------------------------------------------------------------
1 | TRUNCATE TABLE main.iot_dashboard_airflow.bronze_sensors;


--------------------------------------------------------------------------------
/00-quickstarts/design-patterns/Step 10 - Lakehouse Federation.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # Using Lakehouse Federation for a single Pane of Glass
 3 | 
 4 | ## Topics
 5 | 
 6 | 1. How to use Lakehouse Federation
 7 | 2. Setting up new database
 8 | 3. Performance management / considerations
 9 | 4. Limitations
10 | 


--------------------------------------------------------------------------------
/00-quickstarts/design-patterns/Step 11 - SQL Orchestration in Production.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | ## Orchestrating SQL Pipelines in Production
 3 | 
 4 | 1. SQL Tasks Types
 5 | 2. Airflow Operator
 6 | 3. DBSQL REST API / Pushdown Client
 7 | 
 8 | # COMMAND ----------
 9 | 
10 | # DBTITLE 1,Airflow
11 | ## See the Advanced Notebooks section to find the collection to SQL files for the Airflow Demo. 
12 | ## Then navigate to https://medium.com/dbsql-sme-engineering and find the Airflow Blog for the deep dive
13 | 


--------------------------------------------------------------------------------
/00-quickstarts/design-patterns/Step 13 - Migrating Identity Columns.sql:
--------------------------------------------------------------------------------
 1 | -- Databricks notebook source
 2 | -- MAGIC %md
 3 | -- MAGIC
 4 | -- MAGIC ## How to migrate IDENTITY columns from a Data Warehouse to DBSQL / Delta Lakehouse
 5 | -- MAGIC
 6 | -- MAGIC ## Summary
 7 | -- MAGIC Quick notebook showing how to properly migrate tables from a data warehouse to a Delta table where you want to retain the values of existing IDENTITY key values and ensure that the IDENTITY generation picks up from the most recent IDENTITY column value
 8 | 
 9 | -- COMMAND ----------
10 | 
11 | -- MAGIC %md
12 | -- MAGIC
13 | -- MAGIC
14 | -- MAGIC ### Steps to migrate key properly
15 | -- MAGIC
16 | -- MAGIC 1. Create a table with id columns such as: GENERATED BY DEFAULT AS IDENTITY (START WITH 1 INCREMENT BY 1)
17 | -- MAGIC 2. Backfill existing data warehouse tables with an INSERT INTO / MERGE from a snapshot of the datawarehouse table
18 | -- MAGIC 3. Run command: ALTER TABLE main.default.identity_test ALTER COLUMN id SYNC IDENTITY; to ensure that the newly inserted values pick up where the data warehouse left off on key generation
19 | -- MAGIC 4. Insert new identity values with new pipelines (or leave out column and let it auto-generate)
20 | 
21 | -- COMMAND ----------
22 | 
23 | -- DBTITLE 1,Simple End to End Example
24 | 
25 | CREATE OR REPLACE TABLE main.default.identity_test (
26 |   id BIGINT GENERATED BY DEFAULT AS IDENTITY (START WITH 1 INCREMENT BY 1),
27 |   name STRING DEFAULT 'cody'
28 |   )
29 |   TBLPROPERTIES('delta.feature.allowColumnDefaults' = 'supported', 'delta.columnMapping.mode' = 'name')
30 | ;
31 | 
32 | -- Simulate EDW migration load with existing keys
33 | INSERT INTO main.default.identity_test (id,name) 
34 | VALUES (5, 'cody'), (6, 'davis');
35 | 
36 | 
37 | SELECT * FROM main.default.identity_test;
38 | 
39 | 
40 | -- Simulate new load incrmentally
41 | 
42 | INSERT INTO main.default.identity_test (name) 
43 | VALUES ('cody_new'), ('davis_new');
44 | 
45 | -- BAD! ID keys get messed up
46 | SELECT * FROM main.default.identity_test;
47 | 
48 | -- FIX
49 | ALTER TABLE main.default.identity_test ALTER COLUMN id SYNC IDENTITY;
50 | 
51 | -- try again
52 | INSERT INTO main.default.identity_test (name) 
53 | VALUES ('cody_fix'), ('davis_fix');
54 | 
55 | SELECT * FROM main.default.identity_test;
56 | 
57 | 
58 | 


--------------------------------------------------------------------------------
/00-quickstarts/design-patterns/Step 3 - DLT Version Simple SQL EDW Pipeline.sql:
--------------------------------------------------------------------------------
  1 | -- Databricks notebook source
  2 | -- MAGIC %md
  3 | -- MAGIC
  4 | -- MAGIC # This notebook generates a full data pipeline from databricks dataset - iot-stream
  5 | -- MAGIC
  6 | -- MAGIC #### Define the SQL - Add as a library to a DLT pipeline, and run the pipeline!
  7 | -- MAGIC
  8 | -- MAGIC ## This creates 2 tables: 
  9 | -- MAGIC
 10 | -- MAGIC <b> Database: </b> iot_dashboard
 11 | -- MAGIC
 12 | -- MAGIC <b> Tables: </b> silver_sensors, silver_users 
 13 | -- MAGIC
 14 | -- MAGIC <b> Params: </b> StartOver (Yes/No) - allows user to truncate and reload pipeline
 15 | 
 16 | -- COMMAND ----------
 17 | 
 18 | -- MAGIC %md 
 19 | -- MAGIC
 20 | -- MAGIC ## This is built as a library for a Delta Live Tables pipeline
 21 | 
 22 | -- COMMAND ----------
 23 | 
 24 | -- MAGIC %md
 25 | -- MAGIC ## Exhaustive list of all cloud_files STREAMING LIVE TABLE options
 26 | -- MAGIC https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-incremental-data.html#language-sql
 27 | 
 28 | -- COMMAND ----------
 29 | 
 30 | -- DBTITLE 1,Incrementally Ingest Source Data from Raw Files
 31 | --No longer need a separate copy into statement, you can use the Databricks Autoloader directly in SQL by using the cloud_files function
 32 | -- OPTIONALLY defined DDL in the table definition
 33 | CREATE OR REFRESH STREAMING LIVE TABLE bronze_sensors
 34 | (
 35 | Id BIGINT GENERATED BY DEFAULT AS IDENTITY,
 36 | device_id INT,
 37 | user_id INT,
 38 | calories_burnt DECIMAL(10,2), 
 39 | miles_walked DECIMAL(10,2), 
 40 | num_steps DECIMAL(10,2), 
 41 | timestamp TIMESTAMP,
 42 | value STRING,
 43 | CONSTRAINT has_device EXPECT (device_id IS NOT NULL) ON VIOLATION DROP ROW  ,
 44 | CONSTRAINT has_user EXPECT(user_id IS NOT NULL) ON VIOLATION DROP ROW,
 45 | CONSTRAINT has_data EXPECT(num_steps IS NOT NULL) -- with no violation rule, nothing happens, we just track quality in DLT
 46 | )
 47 | TBLPROPERTIES("delta.targetFileSize"="128mb",
 48 | "pipelines.autoOptimize.managed"="true",
 49 | "pipelines.autoOptimize.zOrderCols"="create_timestamp,device_id,user_id",
 50 | "pipelines.trigger.interval"="1 hour")
 51 | AS 
 52 | SELECT 
 53 | id::bigint AS Id,
 54 | device_id::integer AS device_id,
 55 | user_id::integer AS user_id,
 56 | calories_burnt::decimal(10,2) AS calories_burnt, 
 57 | miles_walked::decimal(10,2) AS miles_walked, 
 58 | num_steps::decimal(10,2) AS num_steps,
 59 | timestamp::timestamp AS timestamp,
 60 | value AS value
 61 | FROM cloud_files("/databricks-datasets/iot-stream/data-device/", "json")
 62 |  -- First 2 params of cloud_files are always input file path and format, then rest are map object of optional params
 63 | -- To make incremental - Add STREAMING keyword before LIVE TABLE
 64 | ;
 65 | 
 66 | 
 67 | 
 68 | -- COMMAND ----------
 69 | 
 70 | -- MAGIC %md
 71 | -- MAGIC
 72 | -- MAGIC ## Process Change data with updates or deletes 
 73 | -- MAGIC API Docs: https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-cdc.html
 74 | -- MAGIC
 75 | -- MAGIC
 76 | -- MAGIC ### Automatically store change as SCD 1 or SCD 2 Type changes
 77 | -- MAGIC
 78 | -- MAGIC SCD 1/2 Docs: https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-cdc.html#language-sql
 79 | 
 80 | -- COMMAND ----------
 81 | 
 82 | -- DBTITLE 1,Incremental upsert data into target silver layer
 83 | -- Create and populate the target table.
 84 | CREATE OR REFRESH STREAMING LIVE TABLE silver_sensors
 85 | (
 86 | Id BIGINT GENERATED BY DEFAULT AS IDENTITY,
 87 | device_id INT,
 88 | user_id INT,
 89 | calories_burnt DECIMAL(10,2), 
 90 | miles_walked DECIMAL(10,2), 
 91 | num_steps DECIMAL(10,2), 
 92 | timestamp TIMESTAMP,
 93 | value STRING)
 94 | TBLPROPERTIES("delta.targetFileSize"="128mb",
 95 | "quality"="silver",
 96 | "pipelines.autoOptimize.managed"="true",
 97 | "pipelines.autoOptimize.zOrderCols"="create_timestamp,device_id,user_id",
 98 | "pipelines.trigger.interval"="1 hour"
 99 | );
100 | 
101 | -- COMMAND ----------
102 | 
103 | -- DBTITLE 1,Actually run CDC Transformation Operation
104 | APPLY CHANGES INTO
105 |   LIVE.silver_sensors
106 | FROM
107 |   STREAM(LIVE.bronze_sensors) -- use STREAM to get change feed, use LIVE to get DAG source table
108 | KEYS
109 |   (user_id, device_id) -- Identical to the ON statement in MERGE, can be 1 of many keys
110 | --APPLY AS DELETE WHEN
111 | --  operation = "DELETE" --Need if you have a operation columnd that specifies "APPEND"/"UPDATE"/"DELETE" like true CDC data
112 | SEQUENCE BY
113 |   timestamp
114 | COLUMNS * EXCEPT
115 |    (Id) --For auto increment keys, exclude the updates cause you dont want to replace Ids of auto_id columns
116 | --    Optionally exclude columns like metadata or operation types, by default, UPDATE * is the operation
117 | STORED AS
118 |   SCD TYPE 1 -- [SCD TYPE 2] will expire updated originals
119 | 
120 | -- COMMAND ----------
121 | 
122 | -- MAGIC %md 
123 | -- MAGIC
124 | -- MAGIC ## FULL REFRESH EXAMPLE - Ingest Full User Data Set Each Load
125 | 
126 | -- COMMAND ----------
127 | 
128 | -- DBTITLE 1,FulltIngest Raw User Data
129 | CREATE OR REPLACE STREAMING LIVE TABLE silver_users
130 | ( -- REPLACE truncates the checkpoint each time and loads from scratch every time
131 | userid BIGINT GENERATED BY DEFAULT AS IDENTITY,
132 | gender STRING,
133 | age INT,
134 | height DECIMAL(10,2), 
135 | weight DECIMAL(10,2),
136 | smoker STRING,
137 | familyhistory STRING,
138 | cholestlevs STRING,
139 | bp STRING,
140 | risk DECIMAL(10,2),
141 | update_timestamp TIMESTAMP,
142 | CONSTRAINT has_user EXPECT (userid IS NOT NULL) ON VIOLATION DROP ROW
143 | )
144 | TBLPROPERTIES("delta.targetFileSize"="128mb",
145 | "quality"="silver",
146 | "pipelines.autoOptimize.managed"="true",
147 | "pipelines.autoOptimize.zOrderCols"="userid",
148 | "pipelines.trigger.interval"="1 day"
149 | )
150 | AS (SELECT 
151 | userid::bigint AS userid,
152 | gender AS gender,
153 | age::integer AS age,
154 | height::decimal(10,2) AS height, 
155 | weight::decimal(10,2) AS weight,
156 | smoker AS smoker,
157 | familyhistory AS familyhistory,
158 | cholestlevs AS cholestlevs,
159 | bp AS bp,
160 | risk::decimal(10,2) AS risk,
161 | current_timestamp() AS update_timestamp
162 | FROM cloud_files("/databricks-datasets/iot-stream/data-user/","csv", map( 'header', 'true'))
163 | )
164 | ;
165 | 


--------------------------------------------------------------------------------
/00-quickstarts/design-patterns/Step 4 - Create Gold Layer Analytics Tables.sql:
--------------------------------------------------------------------------------
 1 | -- Databricks notebook source
 2 | -- MAGIC %md
 3 | -- MAGIC
 4 | -- MAGIC ## Create Gold Layer Tables that aggregate and clean up the data for BI / ML
 5 | 
 6 | -- COMMAND ----------
 7 | 
 8 | CREATE OR REPLACE TABLE iot_dashboard.hourly_summary_statistics
 9 | AS
10 | SELECT user_id,
11 | date_trunc('hour', timestamp) AS HourBucket,
12 | AVG(num_steps)::float AS AvgNumStepsAcrossDevices,
13 | AVG(calories_burnt)::float AS AvgCaloriesBurnedAcrossDevices,
14 | AVG(miles_walked)::float AS AvgMilesWalkedAcrossDevices
15 | FROM iot_dashboard.silver_sensors
16 | GROUP BY user_id,date_trunc('hour', timestamp)
17 | ORDER BY HourBucket;
18 | 
19 | 
20 | CREATE OR REPLACE TABLE iot_dashboard.smoothed_hourly_statistics
21 | AS 
22 | SELECT *,
23 | -- Number of Steps
24 | (avg(`AvgNumStepsAcrossDevices`) OVER (
25 |         ORDER BY `HourBucket`
26 |         ROWS BETWEEN
27 |           4 PRECEDING AND
28 |           CURRENT ROW
29 |       )) ::float AS SmoothedNumSteps4HourMA, -- 4 hour moving average
30 |       
31 | (avg(`AvgNumStepsAcrossDevices`) OVER (
32 |         ORDER BY `HourBucket`
33 |         ROWS BETWEEN
34 |           24 PRECEDING AND
35 |           CURRENT ROW
36 |       ))::float AS SmoothedNumSteps12HourMA --24 hour moving average
37 | ,
38 | -- Calories Burned
39 | (avg(`AvgCaloriesBurnedAcrossDevices`) OVER (
40 |         ORDER BY `HourBucket`
41 |         ROWS BETWEEN
42 |           4 PRECEDING AND
43 |           CURRENT ROW
44 |       ))::float AS SmoothedCalsBurned4HourMA, -- 4 hour moving average
45 |       
46 | (avg(`AvgCaloriesBurnedAcrossDevices`) OVER (
47 |         ORDER BY `HourBucket`
48 |         ROWS BETWEEN
49 |           24 PRECEDING AND
50 |           CURRENT ROW
51 |       ))::float AS SmoothedCalsBurned12HourMA --24 hour moving average,
52 | ,
53 | -- Miles Walked
54 | (avg(`AvgMilesWalkedAcrossDevices`) OVER (
55 |         ORDER BY `HourBucket`
56 |         ROWS BETWEEN
57 |           4 PRECEDING AND
58 |           CURRENT ROW
59 |       ))::float AS SmoothedMilesWalked4HourMA, -- 4 hour moving average
60 |       
61 | (avg(`AvgMilesWalkedAcrossDevices`) OVER (
62 |         ORDER BY `HourBucket`
63 |         ROWS BETWEEN
64 |           24 PRECEDING AND
65 |           CURRENT ROW
66 |       ))::float AS SmoothedMilesWalked12HourMA --24 hour moving average
67 | FROM iot_dashboard.hourly_summary_statistics
68 | 
69 | -- COMMAND ----------
70 | 
71 | -- DBTITLE 1,Build Visuals in DBSQL, Directly in Notebook, or in any BI tool!
72 | SELECT * FROM iot_dashboard.smoothed_hourly_statistics WHERE user_id = 1
73 | 


--------------------------------------------------------------------------------
/00-quickstarts/design-patterns/Step 7 - COPY INTO Loading Patterns.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # MAGIC %md
 3 | # MAGIC
 4 | # MAGIC ## Materlized Views
 5 | # MAGIC
 6 | # MAGIC Patterns and Best Practices
 7 | # MAGIC
 8 | # MAGIC
 9 | # MAGIC 1. Create Materialized View
10 | # MAGIC 2. Optimize Materialized View
11 | # MAGIC 3. Check / Monitor Performance of MV
12 | # MAGIC 4. When to NOT use MVs
13 | 


--------------------------------------------------------------------------------
/00-quickstarts/design-patterns/Step 8 - Liquid Clustering Delta Tables.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # MAGIC %md 
 3 | # MAGIC
 4 | # MAGIC ## Deep Dive on Liquid Clustering  Delta Tables
 5 | # MAGIC
 6 | # MAGIC ### Topics
 7 | # MAGIC
 8 | # MAGIC 1. How to create and optimize liquid tables
 9 | # MAGIC 2. How to merge/update/delete data from liquid tables
10 | # MAGIC 3. VACUUM/PURGE/REORG on Liqiud tables
11 | # MAGIC 4. Performance Measurement
12 | # MAGIC 5. When to use ZORDER/Partitions vs Liquid
13 | # MAGIC 6. Liquid Limitations
14 | 


--------------------------------------------------------------------------------
/00-quickstarts/design-patterns/Step 9 - Using SQL Functions.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # SQL Functions Topic Deep Dive
 3 | 
 4 | ## Topics
 5 | 
 6 | 1. How to use SQL functions
 7 | 2. Different languages - Python/SQL
 8 | 3. Variables, etc. 
 9 | 4. Using Models in SQL functions
10 | 5. AI Functions
11 | 


--------------------------------------------------------------------------------
/00-quickstarts/dlt-cdc/04-Retail_DLT_CDC_Full.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # MAGIC %md
 3 | # MAGIC
 4 | # MAGIC # Implementing a CDC pipeline using DLT for N tables
 5 | # MAGIC
 6 | # MAGIC We saw previously how to setup a CDC pipeline for a single table. However, real-life database typically involve multiple tables, with 1 CDC folder per table.
 7 | # MAGIC
 8 | # MAGIC Operating and ingesting all these tables at scale is quite challenging. You need to start multiple table ingestion at the same time, working with threads, handling errors, restart where you stopped, deal with merge manually.
 9 | # MAGIC
10 | # MAGIC Thankfully, DLT takes care of that for you. We can leverage python loops to naturally iterate over the folders (see the [documentation](https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-cookbook.html#programmatically-manage-and-create-multiple-live-tables) for more details)
11 | # MAGIC
12 | # MAGIC DLT engine will handle the parallelization whenever possible, and autoscale based on your data volume.
13 | # MAGIC
14 | # MAGIC <img src="https://github.com/QuentinAmbard/databricks-demo/raw/main/product_demos/cdc_dlt_pipeline_full.png" width="1000"/>
15 | # MAGIC
16 | # MAGIC <!-- Collect usage data (view). Remove it to disable collection. View README for more details.  -->
17 | # MAGIC <img width="1px" src="https://www.google-analytics.com/collect?v=1&gtm=GTM-NKQ8TT7&tid=UA-163989034-1&aip=1&t=event&ec=dbdemos&ea=VIEW&dp=%2F_dbdemos%2Fdata-engineering%2Fdlt-cdc%2F04-Retail_DLT_CDC_Full&cid=1444828305810485&uid=553895811432007">
18 | # MAGIC <!-- [metadata={"description":"Process CDC from external system and save them as a Delta Table. BRONZE/SILVER.<br/><i>Usage: demo CDC flow.</i>",
19 | # MAGIC  "authors":["mojgan.mazouchi@databricks.com"],
20 | # MAGIC  "db_resources":{},
21 | # MAGIC   "search_tags":{"vertical": "retail", "step": "Data Engineering", "components": ["autoloader", "copy into", "cdc", "cdf"]},
22 | # MAGIC                  "canonicalUrl": {"AWS": "", "Azure": "", "GCP": ""}}] -->
23 | 
24 | # COMMAND ----------
25 | 
26 | # DBTITLE 1,2 tables in our cdc_raw: customers and transactions
27 | # MAGIC %fs ls /tmp/demo/cdc_raw
28 | 
29 | # COMMAND ----------
30 | 
31 | #Let's loop over all the folders and dynamically generate our DLT pipeline. 
32 | import dlt
33 | from pyspark.sql.functions import *
34 |   
35 |   
36 | def create_pipeline(table_name):
37 |   print(f"Building DLT CDC pipeline for {table_name}")
38 |   
39 |   ##Raw CDC Table
40 |   #        .option("cloudFiles.maxFilesPerTrigger", "1")
41 |   @dlt.create_table(name=table_name+"_cdc",
42 |                     comment = "New "+table_name+" data incrementally ingested from cloud object storage landing zone")
43 |   def raw_cdc():
44 |     return (
45 |       spark.readStream.format("cloudFiles")
46 |         .option("cloudFiles.format", "json")
47 |         .option("cloudFiles.inferColumnTypes", "true")
48 |         .load("/demos/dlt/cdc_raw/"+table_name))
49 |   
50 |   ##Clean CDC input and track quality with expectations
51 |   @dlt.create_view(name=table_name+"_cdc_clean",
52 |                   comment="Cleansed cdc data, tracking data quality with a view. We ensude valid JSON, id and operation type")
53 |   @dlt.expect_or_drop("no_rescued_data", "_rescued_data IS NULL")
54 |   @dlt.expect_or_drop("valid_id", "id IS NOT NULL")
55 |   @dlt.expect_or_drop("valid_operation", "operation IN ('APPEND', 'DELETE', 'UPDATE')")
56 |   def raw_cdc_clean():
57 |     return dlt.read_stream(table_name+"_cdc")
58 |   
59 |   
60 |   ##Materialize the final table
61 |   dlt.create_target_table(name=table_name, comment="Clean, materialized "+table_name)
62 |   dlt.apply_changes(target = table_name, #The customer table being materilized
63 |                     source = table_name+"_cdc_clean", #the incoming CDC
64 |                     keys = ["id"], #what we'll be using to match the rows to upsert
65 |                     sequence_by = col("operation_date"), #we deduplicate by operation date getting the most recent value
66 |                     ignore_null_updates = False,
67 |                     apply_as_deletes = expr("operation = 'DELETE'"), #DELETE condition
68 |                     except_column_list = ["operation", "operation_date", "_rescued_data"]) #in addition we drop metadata columns
69 |   
70 |   
71 | for folder in dbutils.fs.ls("/demos/dlt/cdc_raw"):
72 |   table_name = folder.name[:-1]
73 |   create_pipeline(table_name)
74 | 
75 | # COMMAND ----------
76 | 
77 | # DBTITLE 1,Add final layer joining 2 tables
78 | @dlt.create_table(name="transactions_per_customers",
79 |                   comment = "table join between users and transactions for further analysis")
80 | def raw_cdc():
81 |   return dlt.read("transactions").join(dlt.read("customers"), ["id"], "left")
82 | 
83 | # COMMAND ----------
84 | 
85 | # MAGIC %md
86 | # MAGIC ### Conclusion 
87 | # MAGIC We can now scale our CDC pipeline to N tables using python factorization. This gives us infinite possibilities and abstraction level in our DLT pipelines.
88 | # MAGIC
89 | # MAGIC DLT handles all the hard work for us so that we can focus on business transformation and drastically accelerate DE team:
90 | # MAGIC - simplify file ingestion with the autoloader
91 | # MAGIC - track data quality using exception
92 | # MAGIC - simplify all operations including upsert with APPLY CHANGES
93 | # MAGIC - process all our tables in parallel
94 | # MAGIC - autoscale based on the amount of data
95 | # MAGIC
96 | # MAGIC DLT gives more power to SQL-only users, letting them build advanced data pipeline without requiering strong Data Engineers skills.
97 | 


--------------------------------------------------------------------------------
/00-quickstarts/dlt-cdc/_resources/00-Data_CDC_Generator.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # MAGIC %pip install Faker
 3 | 
 4 | # COMMAND ----------
 5 | 
 6 | # MAGIC %md
 7 | # MAGIC
 8 | # MAGIC ### Retail CDC Data Generator
 9 | # MAGIC
10 | # MAGIC Run this notebook to create new data. It's added in the pipeline to make sure data exists when we run it.
11 | # MAGIC
12 | # MAGIC You can also run it in the background to periodically add data.
13 | # MAGIC
14 | # MAGIC <!-- Collect usage data (view). Remove it to disable collection. View README for more details.  -->
15 | # MAGIC <img width="1px" src="https://www.google-analytics.com/collect?v=1&gtm=GTM-NKQ8TT7&tid=UA-163989034-1&aip=1&t=event&ec=dbdemos&ea=VIEW&dp=%2F_dbdemos%2Fdata-engineering%2Fdlt-cdc%2F_resources%2F00-Data_CDC_Generator&cid=1444828305810485&uid=553895811432007">
16 | 
17 | # COMMAND ----------
18 | 
19 | folder = "/demos/dlt/cdc_raw"
20 | #dbutils.fs.rm(folder, True)
21 | try:
22 |   dbutils.fs.ls(folder)
23 |   dbutils.fs.ls(folder+"/transactions")
24 |   dbutils.fs.ls(folder+"/customers")
25 | except:
26 |   print("folder doesn't exists, generating the data...")
27 |   from pyspark.sql import functions as F
28 |   from faker import Faker
29 |   from collections import OrderedDict 
30 |   import uuid
31 |   fake = Faker()
32 |   import random
33 | 
34 | 
35 |   fake_firstname = F.udf(fake.first_name)
36 |   fake_lastname = F.udf(fake.last_name)
37 |   fake_email = F.udf(fake.ascii_company_email)
38 |   fake_date = F.udf(lambda:fake.date_time_this_month().strftime("%m-%d-%Y %H:%M:%S"))
39 |   fake_address = F.udf(fake.address)
40 |   operations = OrderedDict([("APPEND", 0.5),("DELETE", 0.1),("UPDATE", 0.3),(None, 0.01)])
41 |   fake_operation = F.udf(lambda:fake.random_elements(elements=operations, length=1)[0])
42 |   fake_id = F.udf(lambda: str(uuid.uuid4()) if random.uniform(0, 1) < 0.98 else None)
43 | 
44 |   df = spark.range(0, 100000).repartition(100)
45 |   df = df.withColumn("id", fake_id())
46 |   df = df.withColumn("firstname", fake_firstname())
47 |   df = df.withColumn("lastname", fake_lastname())
48 |   df = df.withColumn("email", fake_email())
49 |   df = df.withColumn("address", fake_address())
50 |   df = df.withColumn("operation", fake_operation())
51 |   df_customers = df.withColumn("operation_date", fake_date())
52 |   df_customers.repartition(100).write.format("json").mode("overwrite").save(folder+"/customers")
53 | 
54 |   df = spark.range(0, 10000).repartition(20)
55 |   df = df.withColumn("id", fake_id())
56 |   df = df.withColumn("transaction_date", fake_date())
57 |   df = df.withColumn("amount", F.round(F.rand()*1000))
58 |   df = df.withColumn("item_count", F.round(F.rand()*10))
59 |   df = df.withColumn("operation", fake_operation())
60 |   df = df.withColumn("operation_date", fake_date())
61 |   #Join with the customer to get the same IDs generated.
62 |   df = df.withColumn("t_id", F.monotonically_increasing_id()).join(spark.read.json(folder+"/customers").select("id").withColumnRenamed("id", "customer_id").withColumn("t_id", F.monotonically_increasing_id()), "t_id").drop("t_id")
63 |   df.repartition(10).write.format("json").mode("overwrite").save(folder+"/transactions")
64 | 


--------------------------------------------------------------------------------
/00-quickstarts/dlt-cdc/_resources/LICENSE.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # MAGIC %md
 3 | # MAGIC ## Licence
 4 | 
 5 | # COMMAND ----------
 6 | 
 7 | # MAGIC %md
 8 | # MAGIC
 9 | # MAGIC Copyright (2022) Databricks, Inc.
10 | # MAGIC
11 | # MAGIC This library (the "Software") may not be used except in connection with the Licensee's use of the Databricks Platform Services pursuant 
12 | # MAGIC to an Agreement (defined below) between Licensee (defined below) and Databricks, Inc. ("Databricks"). The Object Code version of the 
13 | # MAGIC Software shall be deemed part of the Downloadable Services under the Agreement, or if the Agreement does not define Downloadable Services, 
14 | # MAGIC Subscription Services, or if neither are defined then the term in such Agreement that refers to the applicable Databricks Platform 
15 | # MAGIC Services (as defined below) shall be substituted herein for “Downloadable Services.”  Licensee's use of the Software must comply at 
16 | # MAGIC all times with any restrictions applicable to the Downlodable Services and Subscription Services, generally, and must be used in 
17 | # MAGIC accordance with any applicable documentation. For the avoidance of doubt, the Software constitutes Databricks Confidential Information
18 | # MAGIC under the Agreement.
19 | # MAGIC
20 | # MAGIC Additionally, and notwithstanding anything in the Agreement to the contrary: 
21 | # MAGIC * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES 
22 | # MAGIC   OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE 
23 | # MAGIC   LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR 
24 | # MAGIC   IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
25 | # MAGIC * you may view, make limited copies of, and may compile the Source Code version of the Software into an Object Code version of the
26 | # MAGIC   Software.  For the avoidance of doubt, you may not make derivative works of Software (or make any any changes to the Source Code 
27 | # MAGIC   version of the unless you have agreed to separate terms with Databricks permitting such modifications (e.g., a contribution license
28 | # MAGIC   agreement)).
29 | # MAGIC
30 | # MAGIC If you have not agreed to an Agreement or otherwise do not agree to these terms, you may not use the Software or view, copy or compile
31 | # MAGIC the Source Code of the Software.
32 | # MAGIC   
33 | # MAGIC This license terminates automatically upon the termination of the Agreement or Licensee's breach of these terms.  Additionally, 
34 | # MAGIC Databricks may terminate this license at any time on notice.  Upon termination, you must permanently delete the Software and all
35 | # MAGIC copies thereof (including the Source Code).
36 | # MAGIC
37 | # MAGIC Agreement: the agreement between Databricks and Licensee governing the use of the Databricks Platform Services, which shall be, with
38 | # MAGIC respect to Databricks, the Databricks Terms of Service located at www.databricks.com/termsofservice, and with respect to Databricks
39 | # MAGIC Community Edition, the Community Edition Terms of Service located at www.databricks.com/ce-termsofuse, in each case unless Licensee 
40 | # MAGIC has entered into a separate written agreement with Databricks governing the use of the applicable Databricks Platform Services.
41 | # MAGIC  
42 | # MAGIC Databricks Platform Services: the Databricks services or the Databricks Community Edition services, according to where the Software is used.
43 | # MAGIC
44 | # MAGIC Licensee: the user of the Software, or, if the Software is being used on behalf of a company, the company.
45 | # MAGIC
46 | # MAGIC Object Code: is version of the Software produced when an interpreter or a compiler translates the Source Code into recognizable and 
47 | # MAGIC executable machine code.
48 | # MAGIC
49 | # MAGIC Source Code: the human readable portion of the Software.
50 | 


--------------------------------------------------------------------------------
/00-quickstarts/dlt-cdc/_resources/NOTICE.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # MAGIC %md
 3 | # MAGIC ## Licence
 4 | # MAGIC See LICENSE file.
 5 | # MAGIC
 6 | # MAGIC ## Data collection
 7 | # MAGIC To improve users experience and dbdemos asset quality, dbdemos sends report usage and capture views in the installed notebook (usually in the first cell) and other assets like dashboards. This information is captured for product improvement only and not for marketing purpose, and doesn't contain PII information. By using `dbdemos` and the assets it provides, you consent to this data collection. If you wish to disable it, you can set `Tracker.enable_tracker` to False in the `tracker.py` file.
 8 | # MAGIC
 9 | # MAGIC ## Resource creation
10 | # MAGIC To simplify your experience, `dbdemos` will create and start for you resources. As example, a demo could start (not exhaustive):
11 | # MAGIC - A cluster to run your demo
12 | # MAGIC - A Delta Live Table Pipeline to ingest data
13 | # MAGIC - A DBSQL endpoint to run DBSQL dashboard
14 | # MAGIC - An ML model
15 | # MAGIC
16 | # MAGIC While `dbdemos` does its best to limit the consumption and enforce resource auto-termination, you remain responsible for the resources created and the potential consumption associated.
17 | # MAGIC
18 | # MAGIC ## Support
19 | # MAGIC Databricks does not offer official support for `dbdemos` and the associated assets.
20 | # MAGIC For any issue with `dbdemos` or the demos installed, please open an issue and the demo team will have a look on a best effort basis.
21 | # MAGIC
22 | # MAGIC
23 | 


--------------------------------------------------------------------------------
/00-quickstarts/dlt-cdc/_resources/README.py:
--------------------------------------------------------------------------------
1 | # Databricks notebook source
2 | # MAGIC %md
3 | # MAGIC ## DBDemos asset
4 | # MAGIC
5 | # MAGIC The notebooks available under `_/resources` are technical resources.
6 | # MAGIC
7 | # MAGIC Do not edit these notebooks or try to run them directly. These notebooks will load data / run some setup. They are indirectly called from the main notebook (`%run ./_resources/.....`)
8 | 


--------------------------------------------------------------------------------
/00-quickstarts/dlt-loans/03-Log-Analysis.sql:
--------------------------------------------------------------------------------
  1 | -- Databricks notebook source
  2 | -- MAGIC %md 
  3 | -- MAGIC ### A cluster has been created for this demo
  4 | -- MAGIC To run this demo, just select the cluster `dbdemos-dlt-loans-abraham_pabbathi` from the dropdown menu ([open cluster configuration](https://e2-demo-field-eng.cloud.databricks.com/#setting/clusters/0728-224958-5yoad5lg/configuration)). <br />
  5 | -- MAGIC *Note: If the cluster was deleted after 30 days, you can re-create it with `dbdemos.create_cluster('dlt-loans')` or re-install the demo: `dbdemos.install('dlt-loans')`*
  6 | 
  7 | -- COMMAND ----------
  8 | 
  9 | -- MAGIC %md-sandbox
 10 | -- MAGIC
 11 | -- MAGIC # DLT pipeline log analysis
 12 | -- MAGIC
 13 | -- MAGIC <img style="float:right" width="500" src="https://github.com/QuentinAmbard/databricks-demo/raw/main/retail/resources/images/retail-dlt-data-quality-dashboard.png">
 14 | -- MAGIC
 15 | -- MAGIC Each DLT Pipeline saves events and expectations metrics in the Storage Location defined on the pipeline. From this table we can see what is happening and the quality of the data passing through it.
 16 | -- MAGIC
 17 | -- MAGIC You can leverage the expecations directly as a SQL table with Databricks SQL to track your expectation metrics and send alerts as required. 
 18 | -- MAGIC
 19 | -- MAGIC This notebook extracts and analyses expectation metrics to build such KPIS.
 20 | -- MAGIC
 21 | -- MAGIC You can find your metrics opening the Settings of your DLT pipeline, under `storage` :
 22 | -- MAGIC
 23 | -- MAGIC ```
 24 | -- MAGIC {
 25 | -- MAGIC     ...
 26 | -- MAGIC     "name": "test_dlt_cdc",
 27 | -- MAGIC     "storage": "/demos/dlt/loans",
 28 | -- MAGIC     "target": "quentin_dlt_cdc"
 29 | -- MAGIC }
 30 | -- MAGIC ```
 31 | -- MAGIC
 32 | -- MAGIC <!-- Collect usage data (view). Remove it to disable collection. View README for more details.  -->
 33 | -- MAGIC <img width="1px" src="https://www.google-analytics.com/collect?v=1&gtm=GTM-NKQ8TT7&tid=UA-163989034-1&aip=1&t=event&ec=dbdemos&ea=VIEW&dp=%2F_dbdemos%2Fdata-engineering%2Fdlt-loans%2F03-Log-Analysis&cid=1444828305810485&uid=553895811432007">
 34 | 
 35 | -- COMMAND ----------
 36 | 
 37 | -- DBTITLE 1,Load DLT system table 
 38 | -- MAGIC %python
 39 | -- MAGIC import re
 40 | -- MAGIC current_user = dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags().apply('user')
 41 | -- MAGIC storage_path = '/demos/dlt/loans/'+re.sub("[^A-Za-z0-9]", '_', current_user[:current_user.rfind('@')])
 42 | -- MAGIC dbutils.widgets.text('storage_path', storage_path)
 43 | -- MAGIC print(f"using storage path: {storage_path}")
 44 | 
 45 | -- COMMAND ----------
 46 | 
 47 | -- MAGIC %python display(dbutils.fs.ls(dbutils.widgets.get('storage_path')))
 48 | 
 49 | -- COMMAND ----------
 50 | 
 51 | -- MAGIC %sql 
 52 | -- MAGIC CREATE OR REPLACE TEMPORARY VIEW demo_dlt_loans_system_event_log_raw 
 53 | -- MAGIC   as SELECT * FROM delta.`$storage_path/system/events`;
 54 | -- MAGIC SELECT * FROM demo_dlt_loans_system_event_log_raw order by timestamp desc;
 55 | 
 56 | -- COMMAND ----------
 57 | 
 58 | -- MAGIC %md
 59 | -- MAGIC The `details` column contains metadata about each Event sent to the Event Log. There are different fields depending on what type of Event it is. Some examples include:
 60 | -- MAGIC * `user_action` Events occur when taking actions like creating the pipeline
 61 | -- MAGIC * `flow_definition` Events occur when a pipeline is deployed or updated and have lineage, schema, and execution plan information
 62 | -- MAGIC   * `output_dataset` and `input_datasets` - output table/view and its upstream table(s)/view(s)
 63 | -- MAGIC   * `flow_type` - whether this is a complete or append flow
 64 | -- MAGIC   * `explain_text` - the Spark explain plan
 65 | -- MAGIC * `flow_progress` Events occur when a data flow starts running or finishes processing a batch of data
 66 | -- MAGIC   * `metrics` - currently contains `num_output_rows`
 67 | -- MAGIC   * `data_quality` - contains an array of the results of the data quality rules for this particular dataset
 68 | -- MAGIC     * `dropped_records`
 69 | -- MAGIC     * `expectations`
 70 | -- MAGIC       * `name`, `dataset`, `passed_records`, `failed_records`
 71 | -- MAGIC   
 72 | 
 73 | -- COMMAND ----------
 74 | 
 75 | -- DBTITLE 1,Lineage Information
 76 | SELECT
 77 |   details:flow_definition.output_dataset,
 78 |   details:flow_definition.input_datasets,
 79 |   details:flow_definition.flow_type,
 80 |   details:flow_definition.schema,
 81 |   details:flow_definition
 82 | FROM demo_dlt_loans_system_event_log_raw
 83 | WHERE details:flow_definition IS NOT NULL
 84 | ORDER BY timestamp
 85 | 
 86 | -- COMMAND ----------
 87 | 
 88 | -- DBTITLE 1,Data Quality Results
 89 | SELECT
 90 |   id,
 91 |   expectations.dataset,
 92 |   expectations.name,
 93 |   expectations.failed_records,
 94 |   expectations.passed_records
 95 | FROM(
 96 |   SELECT 
 97 |     id,
 98 |     timestamp,
 99 |     details:flow_progress.metrics,
100 |     details:flow_progress.data_quality.dropped_records,
101 |     explode(from_json(details:flow_progress:data_quality:expectations
102 |              ,schema_of_json("[{'name':'str', 'dataset':'str', 'passed_records':42, 'failed_records':42}]"))) expectations
103 |   FROM demo_dlt_loans_system_event_log_raw
104 |   WHERE details:flow_progress.metrics IS NOT NULL) data_quality
105 | 
106 | -- COMMAND ----------
107 | 
108 | -- MAGIC %md
109 | -- MAGIC Your expectations are ready to be queried in SQL! Open the <a href="/sql/dashboards/976586f6-8e3e-4bf6-a826-30ddd88760bc" target="_blank">data Quality Dashboard example</a> for more details.
110 | 


--------------------------------------------------------------------------------
/00-quickstarts/dlt-loans/_resources/LICENSE.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # MAGIC %md
 3 | # MAGIC ## Licence
 4 | 
 5 | # COMMAND ----------
 6 | 
 7 | # MAGIC %md
 8 | # MAGIC
 9 | # MAGIC Copyright (2022) Databricks, Inc.
10 | # MAGIC
11 | # MAGIC This library (the "Software") may not be used except in connection with the Licensee's use of the Databricks Platform Services pursuant 
12 | # MAGIC to an Agreement (defined below) between Licensee (defined below) and Databricks, Inc. ("Databricks"). The Object Code version of the 
13 | # MAGIC Software shall be deemed part of the Downloadable Services under the Agreement, or if the Agreement does not define Downloadable Services, 
14 | # MAGIC Subscription Services, or if neither are defined then the term in such Agreement that refers to the applicable Databricks Platform 
15 | # MAGIC Services (as defined below) shall be substituted herein for “Downloadable Services.”  Licensee's use of the Software must comply at 
16 | # MAGIC all times with any restrictions applicable to the Downlodable Services and Subscription Services, generally, and must be used in 
17 | # MAGIC accordance with any applicable documentation. For the avoidance of doubt, the Software constitutes Databricks Confidential Information
18 | # MAGIC under the Agreement.
19 | # MAGIC
20 | # MAGIC Additionally, and notwithstanding anything in the Agreement to the contrary: 
21 | # MAGIC * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES 
22 | # MAGIC   OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE 
23 | # MAGIC   LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR 
24 | # MAGIC   IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
25 | # MAGIC * you may view, make limited copies of, and may compile the Source Code version of the Software into an Object Code version of the
26 | # MAGIC   Software.  For the avoidance of doubt, you may not make derivative works of Software (or make any any changes to the Source Code 
27 | # MAGIC   version of the unless you have agreed to separate terms with Databricks permitting such modifications (e.g., a contribution license
28 | # MAGIC   agreement)).
29 | # MAGIC
30 | # MAGIC If you have not agreed to an Agreement or otherwise do not agree to these terms, you may not use the Software or view, copy or compile
31 | # MAGIC the Source Code of the Software.
32 | # MAGIC   
33 | # MAGIC This license terminates automatically upon the termination of the Agreement or Licensee's breach of these terms.  Additionally, 
34 | # MAGIC Databricks may terminate this license at any time on notice.  Upon termination, you must permanently delete the Software and all
35 | # MAGIC copies thereof (including the Source Code).
36 | # MAGIC
37 | # MAGIC Agreement: the agreement between Databricks and Licensee governing the use of the Databricks Platform Services, which shall be, with
38 | # MAGIC respect to Databricks, the Databricks Terms of Service located at www.databricks.com/termsofservice, and with respect to Databricks
39 | # MAGIC Community Edition, the Community Edition Terms of Service located at www.databricks.com/ce-termsofuse, in each case unless Licensee 
40 | # MAGIC has entered into a separate written agreement with Databricks governing the use of the applicable Databricks Platform Services.
41 | # MAGIC  
42 | # MAGIC Databricks Platform Services: the Databricks services or the Databricks Community Edition services, according to where the Software is used.
43 | # MAGIC
44 | # MAGIC Licensee: the user of the Software, or, if the Software is being used on behalf of a company, the company.
45 | # MAGIC
46 | # MAGIC Object Code: is version of the Software produced when an interpreter or a compiler translates the Source Code into recognizable and 
47 | # MAGIC executable machine code.
48 | # MAGIC
49 | # MAGIC Source Code: the human readable portion of the Software.
50 | 


--------------------------------------------------------------------------------
/00-quickstarts/dlt-loans/_resources/NOTICE.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # MAGIC %md
 3 | # MAGIC ## Licence
 4 | # MAGIC See LICENSE file.
 5 | # MAGIC
 6 | # MAGIC ## Data collection
 7 | # MAGIC To improve users experience and dbdemos asset quality, dbdemos sends report usage and capture views in the installed notebook (usually in the first cell) and other assets like dashboards. This information is captured for product improvement only and not for marketing purpose, and doesn't contain PII information. By using `dbdemos` and the assets it provides, you consent to this data collection. If you wish to disable it, you can set `Tracker.enable_tracker` to False in the `tracker.py` file.
 8 | # MAGIC
 9 | # MAGIC ## Resource creation
10 | # MAGIC To simplify your experience, `dbdemos` will create and start for you resources. As example, a demo could start (not exhaustive):
11 | # MAGIC - A cluster to run your demo
12 | # MAGIC - A Delta Live Table Pipeline to ingest data
13 | # MAGIC - A DBSQL endpoint to run DBSQL dashboard
14 | # MAGIC - An ML model
15 | # MAGIC
16 | # MAGIC While `dbdemos` does its best to limit the consumption and enforce resource auto-termination, you remain responsible for the resources created and the potential consumption associated.
17 | # MAGIC
18 | # MAGIC ## Support
19 | # MAGIC Databricks does not offer official support for `dbdemos` and the associated assets.
20 | # MAGIC For any issue with `dbdemos` or the demos installed, please open an issue and the demo team will have a look on a best effort basis.
21 | # MAGIC
22 | # MAGIC
23 | 


--------------------------------------------------------------------------------
/00-quickstarts/dlt-loans/_resources/README.py:
--------------------------------------------------------------------------------
1 | # Databricks notebook source
2 | # MAGIC %md
3 | # MAGIC ## DBDemos asset
4 | # MAGIC
5 | # MAGIC The notebooks available under `_/resources` are technical resources.
6 | # MAGIC
7 | # MAGIC Do not edit these notebooks or try to run them directly. These notebooks will load data / run some setup. They are indirectly called from the main notebook (`%run ./_resources/.....`)
8 | 


--------------------------------------------------------------------------------
/00-quickstarts/lakehouse-retail-c360/01-Data-ingestion/01.2-DLT-churn-Python-UDF.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # DBTITLE 1,Let's install mlflow & the ML libs to be able to load our model (from requirement.txt file):
 3 | # MAGIC %pip install mlflow>=2.1 category-encoders==2.5.1.post0 cffi==1.15.0 cloudpickle==2.0.0 databricks-automl-runtime==0.2.15 defusedxml==0.7.1 holidays==0.18 lightgbm==3.3.4 matplotlib==3.5.1 psutil==5.8.0 scikit-learn==1.0.2 typing-extensions==4.1.1
 4 | 
 5 | # COMMAND ----------
 6 | 
 7 | # MAGIC %md #Registering python UDF to a SQL function
 8 | # MAGIC This is a companion notebook to load the `predict_churn` model as a spark udf and save it as a SQL function. While this code was present in the SQL notebook, it won't be run by the DLT engine (since the notebook is SQL we only read sql cess)
 9 | # MAGIC  
10 | # MAGIC For the UDF to be available, you must this notebook in your DLT. (Currently mixing python in a SQL DLT notebook won't run the python)
11 | # MAGIC
12 | # MAGIC <!-- Collect usage data (view). Remove it to disable collection. View README for more details.  -->
13 | # MAGIC <img width="1px" src="https://www.google-analytics.com/collect?v=1&gtm=GTM-NKQ8TT7&tid=UA-163989034-1&aip=1&t=event&ec=dbdemos&ea=VIEW&dp=%2F_dbdemos%2Flakehouse%2Flakehouse-retail-c360%2F01-Data-ingestion%2F01.2-DLT-churn-Python-UDF&cid=1444828305810485&uid=553895811432007">
14 | 
15 | # COMMAND ----------
16 | 
17 | # MAGIC %python
18 | # MAGIC import mlflow
19 | # MAGIC #                                                                              Stage/version  
20 | # MAGIC #                                                                 Model name         |        
21 | # MAGIC #                                                                     |              |        
22 | # MAGIC predict_churn_udf = mlflow.pyfunc.spark_udf(spark, "models:/dbdemos_customer_churn/Production")
23 | # MAGIC spark.udf.register("predict_churn", predict_churn_udf)
24 | 
25 | # COMMAND ----------
26 | 
27 | # MAGIC %md ### Setting up the DLT 
28 | # MAGIC
29 | # MAGIC This notebook must be included in your DLT "libraries" parameter:
30 | # MAGIC
31 | # MAGIC ```
32 | # MAGIC {
33 | # MAGIC     "id": "95f28631-1884-425e-af69-05c3f397dd90",
34 | # MAGIC     "name": "xxxx",
35 | # MAGIC     "storage": "/demos/dlt/lakehouse_churn/xxxxx",
36 | # MAGIC     "configuration": {
37 | # MAGIC         "pipelines.useV2DetailsPage": "true"
38 | # MAGIC     },
39 | # MAGIC     "clusters": [
40 | # MAGIC         {
41 | # MAGIC             "label": "default",
42 | # MAGIC             "autoscale": {
43 | # MAGIC                 "min_workers": 1,
44 | # MAGIC                 "max_workers": 5
45 | # MAGIC             }
46 | # MAGIC         }
47 | # MAGIC     ],
48 | # MAGIC     "libraries": [
49 | # MAGIC         {
50 | # MAGIC             "notebook": {
51 | # MAGIC                 "path": "/Repos/xxxx/01.2-DLT-churn-Python-UDF"
52 | # MAGIC             }
53 | # MAGIC         },
54 | # MAGIC         {
55 | # MAGIC             "notebook": {
56 | # MAGIC                 "path": "/Repos/xxxx/01.1-DLT-churn-SQL"
57 | # MAGIC             }
58 | # MAGIC         }
59 | # MAGIC     ],
60 | # MAGIC     "target": "retail_lakehouse_churn_xxxx",
61 | # MAGIC     "continuous": false,
62 | # MAGIC     "development": false
63 | # MAGIC }
64 | # MAGIC ```
65 | 


--------------------------------------------------------------------------------
/00-quickstarts/lakehouse-retail-c360/05-Workflow-orchestration/05-Workflow-orchestration-churn.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # MAGIC %md-sandbox
 3 | # MAGIC # Deploying and orchestrating the full workflow
 4 | # MAGIC
 5 | # MAGIC <img style="float: right; margin-left: 10px" width="300px" src="https://raw.githubusercontent.com/QuentinAmbard/databricks-demo/main/retail/resources/images/lakehouse-retail/lakehouse-retail-churn-5.png" />
 6 | # MAGIC
 7 | # MAGIC All our assets are ready. We now need to define when we want our DLT pipeline to kick in and refresh the tables.
 8 | # MAGIC
 9 | # MAGIC One option is to switch DLT pipeline in continuous mode to have a streaming pipeline, providing near-realtime insight.
10 | # MAGIC
11 | # MAGIC An alternative is to wakeup the DLT pipeline every X hours, ingest the new data (incremental) and shut down all your compute. 
12 | # MAGIC
13 | # MAGIC This is a simple configuration offering a tradeoff between uptime and ingestion latencies.
14 | # MAGIC
15 | # MAGIC In our case, we decided that the best tradoff is to ingest new data every hours:
16 | # MAGIC
17 | # MAGIC - Start the DLT pipeline to ingest new data and refresh our tables
18 | # MAGIC - Refresh the DBSQL dashboard (and potentially notify downstream applications)
19 | # MAGIC - Retrain our model to include the lastest date and capture potential behavior change
20 | # MAGIC
21 | # MAGIC <!-- Collect usage data (view). Remove it to disable collection. View README for more details.  -->
22 | # MAGIC <img width="1px" src="https://www.google-analytics.com/collect?v=1&gtm=GTM-NKQ8TT7&tid=UA-163989034-1&aip=1&t=event&ec=dbdemos&ea=VIEW&dp=%2F_dbdemos%2Flakehouse%2Flakehouse-retail-c360%2F05-Workflow-orchestration%2F05-Workflow-orchestration-churn&cid=1444828305810485&uid=553895811432007">
23 | 
24 | # COMMAND ----------
25 | 
26 | # MAGIC %md-sandbox
27 | # MAGIC ## Orchestrating our Churn pipeline with Databricks Workflows
28 | # MAGIC
29 | # MAGIC <img style="float: right; margin-left: 10px" width="600px" src="https://www.databricks.com/wp-content/uploads/2022/05/workflows-orchestrate-img.png" />
30 | # MAGIC
31 | # MAGIC With Databricks Lakehouse, no need for external orchestrator. We can use [Workflows](/#job/list) (available on the left menu) to orchestrate our Churn pipeline within a few click.
32 | # MAGIC
33 | # MAGIC
34 | # MAGIC
35 | # MAGIC ###  Orchestrate anything anywhere
36 | # MAGIC With workflow, you can run diverse workloads for the full data and AI lifecycle on any cloud. Orchestrate Delta Live Tables and Jobs for SQL, Spark, notebooks, dbt, ML models and more.
37 | # MAGIC
38 | # MAGIC ### Simple - Fully managed
39 | # MAGIC Remove operational overhead with a fully managed orchestration service, so you can focus on your workflows not on managing your infrastructure.
40 | # MAGIC
41 | # MAGIC ### Proven reliability
42 | # MAGIC Have full confidence in your workflows leveraging our proven experience running tens of millions of production workloads daily across AWS, Azure and GCP.
43 | 
44 | # COMMAND ----------
45 | 
46 | # MAGIC %md-sandbox
47 | # MAGIC
48 | # MAGIC ## Creating your workflow
49 | # MAGIC
50 | # MAGIC <img style="float: right; margin-left: 10px" width="600px" src="https://raw.githubusercontent.com/QuentinAmbard/databricks-demo/main/retail/resources/images/lakehouse-retail/lakehouse-retail-churn-workflow.png" />
51 | # MAGIC
52 | # MAGIC A Databricks Workflow is composed of Tasks.
53 | # MAGIC
54 | # MAGIC Each task can trigger a specific job:
55 | # MAGIC
56 | # MAGIC * Delta Live Tables
57 | # MAGIC * SQL query / dashboard
58 | # MAGIC * Model retraining / inference
59 | # MAGIC * Notebooks
60 | # MAGIC * dbt
61 | # MAGIC * ...
62 | # MAGIC
63 | # MAGIC In this example, can see our 3 tasks:
64 | # MAGIC
65 | # MAGIC * Start the DLT pipeline to ingest new data and refresh our tables
66 | # MAGIC * Refresh the DBSQL dashboard (and potentially notify downstream applications)
67 | # MAGIC * Retrain our Churn model
68 | 
69 | # COMMAND ----------
70 | 
71 | # MAGIC %md-sandbox
72 | # MAGIC
73 | # MAGIC ## Monitoring your runs
74 | # MAGIC
75 | # MAGIC <img style="float: right; margin-left: 10px" width="600px" src="https://raw.githubusercontent.com/QuentinAmbard/databricks-demo/main/retail/resources/images/lakehouse-retail/lakehouse-retail-churn-workflow-monitoring.png" />
76 | # MAGIC
77 | # MAGIC Once your workflow is created, we can access historical runs and receive alerts if something goes wrong!
78 | # MAGIC
79 | # MAGIC In the screenshot we can see that our workflow had multiple errors, with different runtime, and ultimately got fixed.
80 | # MAGIC
81 | # MAGIC Workflow monitoring includes errors, abnormal job duration and more advanced control!
82 | 
83 | # COMMAND ----------
84 | 
85 | # MAGIC %md
86 | # MAGIC ## Conclusion
87 | # MAGIC
88 | # MAGIC Not only Datatabricks Lakehouse let you ingest, analyze and infer churn, it also provides a best-in-class orchestrator to offer your business fresh insight making sure everything works as expected!
89 | # MAGIC
90 | # MAGIC [Go back to introduction]($../00-churn-introduction-lakehouse)
91 | 


--------------------------------------------------------------------------------
/00-quickstarts/lakehouse-retail-c360/_resources/00-setup-uc.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | dbutils.widgets.dropdown("reset_all_data", "false", ["true", "false"], "Reset all data")
 3 | reset_all_data = dbutils.widgets.get("reset_all_data") == "true"
 4 | 
 5 | # COMMAND ----------
 6 | 
 7 | # MAGIC %run ./00-global-setup $reset_all_data=$reset_all_data $db_prefix=retail $catalog=dbdemos $db=lakehouse_c360
 8 | 
 9 | # COMMAND ----------
10 | 
11 | catalog = "dbdemos"
12 | database = 'lakehouse_c360'
13 | 
14 | # COMMAND ----------
15 | 
16 | import json
17 | import time
18 | from pyspark.sql.window import Window
19 | from pyspark.sql.functions import row_number, sha1, col, initcap, to_timestamp
20 | 
21 | folder = "/demos/retail/churn"
22 | 
23 | if reset_all_data or is_folder_empty(folder+"/orders") or is_folder_empty(folder+"/users") or is_folder_empty(folder+"/events"):
24 |   #data generation on another notebook to avoid installing libraries (takes a few seconds to setup pip env)
25 |   print(f"Generating data under {folder} , please wait a few sec...")
26 |   path = dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()
27 |   parent_count = path[path.rfind("lakehouse-retail-c360"):].count('/') - 1
28 |   prefix = "./" if parent_count == 0 else parent_count*"../"
29 |   prefix = f'{prefix}_resources/'
30 |   dbutils.notebook.run(prefix+"02-create-churn-tables", 600, {"catalog": catalog, "cloud_storage_path": "/demos/", "reset_all_data": reset_all_data, "db": database})
31 | else:
32 |   print("data already existing. Run with reset_all_data=true to force a data cleanup for your local demo.")
33 | 
34 | # COMMAND ----------
35 | 
36 | for table in spark.sql("SHOW TABLES").collect():
37 |     try:
38 |         spark.sql(f"alter table {table['tableName']} owner to `account users`")
39 |     except Exception as e:
40 |         print(f"couldn't set table {table} ownership to account users")
41 | 


--------------------------------------------------------------------------------
/00-quickstarts/lakehouse-retail-c360/_resources/00-setup.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | dbutils.widgets.dropdown("reset_all_data", "false", ["true", "false"], "Reset all data")
 3 | 
 4 | # COMMAND ----------
 5 | 
 6 | # MAGIC %run ./00-global-setup $reset_all_data=$reset_all_data $db_prefix=retail
 7 | 
 8 | # COMMAND ----------
 9 | 
10 | import mlflow
11 | if "evaluate" not in dir(mlflow):
12 |     raise Exception("ERROR - YOU NEED MLFLOW 2.0 for this demo. Select DBRML 12+")
13 |     
14 | import json
15 | import time
16 | from pyspark.sql.window import Window
17 | from pyspark.sql.functions import row_number
18 | 
19 | reset_all_data = dbutils.widgets.get("reset_all_data") == "true"
20 | raw_data_location = cloud_storage_path+"/retail/churn"
21 | 
22 | import json
23 | import time
24 | from pyspark.sql.window import Window
25 | from pyspark.sql.functions import row_number, sha1, col, initcap, to_timestamp
26 | 
27 | folder = "/demos/retail/churn"
28 | 
29 | if reset_all_data or is_folder_empty(folder+"/orders") or is_folder_empty(folder+"/users") or is_folder_empty(folder+"/events"):
30 |   #data generation on another notebook to avoid installing libraries (takes a few seconds to setup pip env)
31 |   print(f"Generating data under {folder} , please wait a few sec...")
32 |   path = dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()
33 |   parent_count = path[path.rfind("lakehouse-retail-c360"):].count('/') - 1
34 |   prefix = "./" if parent_count == 0 else parent_count*"../"
35 |   prefix = f'{prefix}_resources/'
36 |   dbutils.notebook.run(prefix+"01-load-data", 600, {"reset_all_data": dbutils.widgets.get("reset_all_data")})
37 | else:
38 |   print("data already existing. Run with reset_all_data=true to force a data cleanup for your local demo.")
39 | 


--------------------------------------------------------------------------------
/00-quickstarts/lakehouse-retail-c360/_resources/LICENSE.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # MAGIC %md
 3 | # MAGIC ## Licence
 4 | 
 5 | # COMMAND ----------
 6 | 
 7 | # MAGIC %md
 8 | # MAGIC
 9 | # MAGIC Copyright (2022) Databricks, Inc.
10 | # MAGIC
11 | # MAGIC This library (the "Software") may not be used except in connection with the Licensee's use of the Databricks Platform Services pursuant 
12 | # MAGIC to an Agreement (defined below) between Licensee (defined below) and Databricks, Inc. ("Databricks"). The Object Code version of the 
13 | # MAGIC Software shall be deemed part of the Downloadable Services under the Agreement, or if the Agreement does not define Downloadable Services, 
14 | # MAGIC Subscription Services, or if neither are defined then the term in such Agreement that refers to the applicable Databricks Platform 
15 | # MAGIC Services (as defined below) shall be substituted herein for “Downloadable Services.”  Licensee's use of the Software must comply at 
16 | # MAGIC all times with any restrictions applicable to the Downlodable Services and Subscription Services, generally, and must be used in 
17 | # MAGIC accordance with any applicable documentation. For the avoidance of doubt, the Software constitutes Databricks Confidential Information
18 | # MAGIC under the Agreement.
19 | # MAGIC
20 | # MAGIC Additionally, and notwithstanding anything in the Agreement to the contrary: 
21 | # MAGIC * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES 
22 | # MAGIC   OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE 
23 | # MAGIC   LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR 
24 | # MAGIC   IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
25 | # MAGIC * you may view, make limited copies of, and may compile the Source Code version of the Software into an Object Code version of the
26 | # MAGIC   Software.  For the avoidance of doubt, you may not make derivative works of Software (or make any any changes to the Source Code 
27 | # MAGIC   version of the unless you have agreed to separate terms with Databricks permitting such modifications (e.g., a contribution license
28 | # MAGIC   agreement)).
29 | # MAGIC
30 | # MAGIC If you have not agreed to an Agreement or otherwise do not agree to these terms, you may not use the Software or view, copy or compile
31 | # MAGIC the Source Code of the Software.
32 | # MAGIC   
33 | # MAGIC This license terminates automatically upon the termination of the Agreement or Licensee's breach of these terms.  Additionally, 
34 | # MAGIC Databricks may terminate this license at any time on notice.  Upon termination, you must permanently delete the Software and all
35 | # MAGIC copies thereof (including the Source Code).
36 | # MAGIC
37 | # MAGIC Agreement: the agreement between Databricks and Licensee governing the use of the Databricks Platform Services, which shall be, with
38 | # MAGIC respect to Databricks, the Databricks Terms of Service located at www.databricks.com/termsofservice, and with respect to Databricks
39 | # MAGIC Community Edition, the Community Edition Terms of Service located at www.databricks.com/ce-termsofuse, in each case unless Licensee 
40 | # MAGIC has entered into a separate written agreement with Databricks governing the use of the applicable Databricks Platform Services.
41 | # MAGIC  
42 | # MAGIC Databricks Platform Services: the Databricks services or the Databricks Community Edition services, according to where the Software is used.
43 | # MAGIC
44 | # MAGIC Licensee: the user of the Software, or, if the Software is being used on behalf of a company, the company.
45 | # MAGIC
46 | # MAGIC Object Code: is version of the Software produced when an interpreter or a compiler translates the Source Code into recognizable and 
47 | # MAGIC executable machine code.
48 | # MAGIC
49 | # MAGIC Source Code: the human readable portion of the Software.
50 | 


--------------------------------------------------------------------------------
/00-quickstarts/lakehouse-retail-c360/_resources/NOTICE.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # MAGIC %md
 3 | # MAGIC ## Licence
 4 | # MAGIC See LICENSE file.
 5 | # MAGIC
 6 | # MAGIC ## Data collection
 7 | # MAGIC To improve users experience and dbdemos asset quality, dbdemos sends report usage and capture views in the installed notebook (usually in the first cell) and other assets like dashboards. This information is captured for product improvement only and not for marketing purpose, and doesn't contain PII information. By using `dbdemos` and the assets it provides, you consent to this data collection. If you wish to disable it, you can set `Tracker.enable_tracker` to False in the `tracker.py` file.
 8 | # MAGIC
 9 | # MAGIC ## Resource creation
10 | # MAGIC To simplify your experience, `dbdemos` will create and start for you resources. As example, a demo could start (not exhaustive):
11 | # MAGIC - A cluster to run your demo
12 | # MAGIC - A Delta Live Table Pipeline to ingest data
13 | # MAGIC - A DBSQL endpoint to run DBSQL dashboard
14 | # MAGIC - An ML model
15 | # MAGIC
16 | # MAGIC While `dbdemos` does its best to limit the consumption and enforce resource auto-termination, you remain responsible for the resources created and the potential consumption associated.
17 | # MAGIC
18 | # MAGIC ## Support
19 | # MAGIC Databricks does not offer official support for `dbdemos` and the associated assets.
20 | # MAGIC For any issue with `dbdemos` or the demos installed, please open an issue and the demo team will have a look on a best effort basis.
21 | # MAGIC
22 | # MAGIC
23 | 


--------------------------------------------------------------------------------
/00-quickstarts/lakehouse-retail-c360/_resources/README.py:
--------------------------------------------------------------------------------
1 | # Databricks notebook source
2 | # MAGIC %md
3 | # MAGIC ## DBDemos asset
4 | # MAGIC
5 | # MAGIC The notebooks available under `_/resources` are technical resources.
6 | # MAGIC
7 | # MAGIC Do not edit these notebooks or try to run them directly. These notebooks will load data / run some setup. They are indirectly called from the main notebook (`%run ./_resources/.....`)
8 | 


--------------------------------------------------------------------------------
/00-quickstarts/llm-dolly-chatbot/_resources/LICENSE.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # MAGIC %md
 3 | # MAGIC ## Licence
 4 | 
 5 | # COMMAND ----------
 6 | 
 7 | # MAGIC %md
 8 | # MAGIC
 9 | # MAGIC Copyright (2022) Databricks, Inc.
10 | # MAGIC
11 | # MAGIC This library (the "Software") may not be used except in connection with the Licensee's use of the Databricks Platform Services pursuant 
12 | # MAGIC to an Agreement (defined below) between Licensee (defined below) and Databricks, Inc. ("Databricks"). The Object Code version of the 
13 | # MAGIC Software shall be deemed part of the Downloadable Services under the Agreement, or if the Agreement does not define Downloadable Services, 
14 | # MAGIC Subscription Services, or if neither are defined then the term in such Agreement that refers to the applicable Databricks Platform 
15 | # MAGIC Services (as defined below) shall be substituted herein for “Downloadable Services.”  Licensee's use of the Software must comply at 
16 | # MAGIC all times with any restrictions applicable to the Downlodable Services and Subscription Services, generally, and must be used in 
17 | # MAGIC accordance with any applicable documentation. For the avoidance of doubt, the Software constitutes Databricks Confidential Information
18 | # MAGIC under the Agreement.
19 | # MAGIC
20 | # MAGIC Additionally, and notwithstanding anything in the Agreement to the contrary: 
21 | # MAGIC * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES 
22 | # MAGIC   OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE 
23 | # MAGIC   LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR 
24 | # MAGIC   IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
25 | # MAGIC * you may view, make limited copies of, and may compile the Source Code version of the Software into an Object Code version of the
26 | # MAGIC   Software.  For the avoidance of doubt, you may not make derivative works of Software (or make any any changes to the Source Code 
27 | # MAGIC   version of the unless you have agreed to separate terms with Databricks permitting such modifications (e.g., a contribution license
28 | # MAGIC   agreement)).
29 | # MAGIC
30 | # MAGIC If you have not agreed to an Agreement or otherwise do not agree to these terms, you may not use the Software or view, copy or compile
31 | # MAGIC the Source Code of the Software.
32 | # MAGIC   
33 | # MAGIC This license terminates automatically upon the termination of the Agreement or Licensee's breach of these terms.  Additionally, 
34 | # MAGIC Databricks may terminate this license at any time on notice.  Upon termination, you must permanently delete the Software and all
35 | # MAGIC copies thereof (including the Source Code).
36 | # MAGIC
37 | # MAGIC Agreement: the agreement between Databricks and Licensee governing the use of the Databricks Platform Services, which shall be, with
38 | # MAGIC respect to Databricks, the Databricks Terms of Service located at www.databricks.com/termsofservice, and with respect to Databricks
39 | # MAGIC Community Edition, the Community Edition Terms of Service located at www.databricks.com/ce-termsofuse, in each case unless Licensee 
40 | # MAGIC has entered into a separate written agreement with Databricks governing the use of the applicable Databricks Platform Services.
41 | # MAGIC  
42 | # MAGIC Databricks Platform Services: the Databricks services or the Databricks Community Edition services, according to where the Software is used.
43 | # MAGIC
44 | # MAGIC Licensee: the user of the Software, or, if the Software is being used on behalf of a company, the company.
45 | # MAGIC
46 | # MAGIC Object Code: is version of the Software produced when an interpreter or a compiler translates the Source Code into recognizable and 
47 | # MAGIC executable machine code.
48 | # MAGIC
49 | # MAGIC Source Code: the human readable portion of the Software.
50 | 


--------------------------------------------------------------------------------
/00-quickstarts/llm-dolly-chatbot/_resources/NOTICE.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # MAGIC %md
 3 | # MAGIC ## Licence
 4 | # MAGIC See LICENSE file.
 5 | # MAGIC
 6 | # MAGIC ## Data collection
 7 | # MAGIC To improve users experience and dbdemos asset quality, dbdemos sends report usage and capture views in the installed notebook (usually in the first cell) and other assets like dashboards. This information is captured for product improvement only and not for marketing purpose, and doesn't contain PII information. By using `dbdemos` and the assets it provides, you consent to this data collection. If you wish to disable it, you can set `Tracker.enable_tracker` to False in the `tracker.py` file.
 8 | # MAGIC
 9 | # MAGIC ## Resource creation
10 | # MAGIC To simplify your experience, `dbdemos` will create and start for you resources. As example, a demo could start (not exhaustive):
11 | # MAGIC - A cluster to run your demo
12 | # MAGIC - A Delta Live Table Pipeline to ingest data
13 | # MAGIC - A DBSQL endpoint to run DBSQL dashboard
14 | # MAGIC - An ML model
15 | # MAGIC
16 | # MAGIC While `dbdemos` does its best to limit the consumption and enforce resource auto-termination, you remain responsible for the resources created and the potential consumption associated.
17 | # MAGIC
18 | # MAGIC ## Support
19 | # MAGIC Databricks does not offer official support for `dbdemos` and the associated assets.
20 | # MAGIC For any issue with `dbdemos` or the demos installed, please open an issue and the demo team will have a look on a best effort basis.
21 | # MAGIC
22 | # MAGIC
23 | 


--------------------------------------------------------------------------------
/00-quickstarts/llm-dolly-chatbot/_resources/README.py:
--------------------------------------------------------------------------------
1 | # Databricks notebook source
2 | # MAGIC %md
3 | # MAGIC ## DBDemos asset
4 | # MAGIC
5 | # MAGIC The notebooks available under `_/resources` are technical resources.
6 | # MAGIC
7 | # MAGIC Do not edit these notebooks or try to run them directly. These notebooks will load data / run some setup. They are indirectly called from the main notebook (`%run ./_resources/.....`)
8 | 


--------------------------------------------------------------------------------
/10-migrations/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/10-migrations/.DS_Store


--------------------------------------------------------------------------------
/10-migrations/05-uc-upgrade/_resources/00-setup.py:
--------------------------------------------------------------------------------
  1 | # Databricks notebook source
  2 | # MAGIC %pip install faker
  3 | 
  4 | # COMMAND ----------
  5 | 
  6 | dbutils.widgets.text("catalog", "dbdemos", "UC Catalog")
  7 | dbutils.widgets.text("external_location_path", "s3a://databricks-e2demofieldengwest/external_location_uc_upgrade", "External location path")
  8 | external_location_path = dbutils.widgets.get("external_location_path")
  9 | 
 10 | # COMMAND ----------
 11 | 
 12 | import pyspark.sql.functions as F
 13 | import re
 14 | catalog = dbutils.widgets.get("catalog")
 15 | 
 16 | catalog_exists = False
 17 | for r in spark.sql("SHOW CATALOGS").collect():
 18 |     if r['catalog'] == catalog:
 19 |         catalog_exists = True
 20 | 
 21 | #As non-admin users don't have permission by default, let's do that only if the catalog doesn't exist (an admin need to run it first)     
 22 | if not catalog_exists:
 23 |     spark.sql(f"CREATE CATALOG IF NOT EXISTS {catalog}")
 24 |     spark.sql(f"ALTER CATALOG {catalog} OWNER TO `account users`")
 25 |     spark.sql(f"GRANT CREATE, USAGE on CATALOG {catalog} TO `account users`")
 26 | spark.sql(f"USE CATALOG {catalog}")
 27 | 
 28 | database = 'database_upgraded_on_uc'
 29 | print(f"creating {database} database")
 30 | spark.sql(f"DROP DATABASE IF EXISTS {catalog}.{database} CASCADE")
 31 | spark.sql(f"CREATE DATABASE IF NOT EXISTS {catalog}.{database}")
 32 | spark.sql(f"GRANT CREATE, USAGE on DATABASE {catalog}.{database} TO `account users`")
 33 | spark.sql(f"ALTER SCHEMA {catalog}.{database} OWNER TO `account users`")
 34 | 
 35 | # COMMAND ----------
 36 | 
 37 | folder = "/dbdemos/uc/delta_dataset"
 38 | spark.sql('drop database if exists hive_metastore.uc_database_to_upgrade cascade')
 39 | #fix a bug from legacy version
 40 | spark.sql(f'drop database if exists {catalog}.uc_database_to_upgrade cascade')
 41 | dbutils.fs.rm("/transactions", True)
 42 | 
 43 | print("generating the data...")
 44 | from pyspark.sql import functions as F
 45 | from faker import Faker
 46 | from collections import OrderedDict 
 47 | import uuid
 48 | import random
 49 | fake = Faker()
 50 | 
 51 | fake_firstname = F.udf(fake.first_name)
 52 | fake_lastname = F.udf(fake.last_name)
 53 | fake_email = F.udf(fake.ascii_company_email)
 54 | fake_date = F.udf(lambda:fake.date_time_this_month().strftime("%m-%d-%Y %H:%M:%S"))
 55 | fake_address = F.udf(fake.address)
 56 | fake_credit_card_expire = F.udf(fake.credit_card_expire)
 57 | 
 58 | fake_id = F.udf(lambda: str(uuid.uuid4()))
 59 | countries = ['FR', 'USA', 'SPAIN']
 60 | fake_country = F.udf(lambda: countries[random.randint(0,2)])
 61 | 
 62 | df = spark.range(0, 10000)
 63 | df = df.withColumn("id", F.monotonically_increasing_id())
 64 | df = df.withColumn("creation_date", fake_date())
 65 | df = df.withColumn("customer_firstname", fake_firstname())
 66 | df = df.withColumn("customer_lastname", fake_lastname())
 67 | df = df.withColumn("country", fake_country())
 68 | df = df.withColumn("customer_email", fake_email())
 69 | df = df.withColumn("address", fake_address())
 70 | df = df.withColumn("gender", F.round(F.rand()+0.2))
 71 | df = df.withColumn("age_group", F.round(F.rand()*10))
 72 | df.repartition(3).write.mode('overwrite').format("delta").save(folder+"/users")
 73 | 
 74 | 
 75 | df = spark.range(0, 10000)
 76 | df = df.withColumn("id", F.monotonically_increasing_id())
 77 | df = df.withColumn("customer_id",  F.monotonically_increasing_id())
 78 | df = df.withColumn("transaction_date", fake_date())
 79 | df = df.withColumn("credit_card_expire", fake_credit_card_expire())
 80 | df = df.withColumn("amount", F.round(F.rand()*1000+200))
 81 | 
 82 | df = df.cache()
 83 | spark.sql('create database if not exists hive_metastore.uc_database_to_upgrade')
 84 | df.repartition(3).write.mode('overwrite').format("delta").saveAsTable("hive_metastore.uc_database_to_upgrade.users")
 85 | 
 86 | #Note: this requires hard-coded external location.
 87 | df.repartition(3).write.mode('overwrite').format("delta").save(external_location_path+"/transactions")
 88 | 
 89 | # COMMAND ----------
 90 | 
 91 | df.repartition(3).write.mode('overwrite').format("delta").save(external_location_path+"/transactions")
 92 | 
 93 | # COMMAND ----------
 94 | 
 95 | #Need to switch to hive metastore to avoid having a : org.apache.spark.SparkException: Your query is attempting to access overlapping paths through multiple authorization mechanisms, which is not currently supported.
 96 | spark.sql("USE CATALOG hive_metastore")
 97 | spark.sql(f"create table if not exists hive_metastore.uc_database_to_upgrade.transactions location '{external_location_path}/transactions'")
 98 | spark.sql(f"create or replace view `hive_metastore`.`uc_database_to_upgrade`.users_view_to_upgrade as select * from hive_metastore.uc_database_to_upgrade.users where id is not null")
 99 | 
100 | spark.sql(f"USE CATALOG {catalog}")
101 | 


--------------------------------------------------------------------------------
/10-migrations/05-uc-upgrade/_resources/LICENSE.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # MAGIC %md
 3 | # MAGIC ## Licence
 4 | 
 5 | # COMMAND ----------
 6 | 
 7 | # MAGIC %md
 8 | # MAGIC
 9 | # MAGIC Copyright (2022) Databricks, Inc.
10 | # MAGIC
11 | # MAGIC This library (the "Software") may not be used except in connection with the Licensee's use of the Databricks Platform Services pursuant 
12 | # MAGIC to an Agreement (defined below) between Licensee (defined below) and Databricks, Inc. ("Databricks"). The Object Code version of the 
13 | # MAGIC Software shall be deemed part of the Downloadable Services under the Agreement, or if the Agreement does not define Downloadable Services, 
14 | # MAGIC Subscription Services, or if neither are defined then the term in such Agreement that refers to the applicable Databricks Platform 
15 | # MAGIC Services (as defined below) shall be substituted herein for “Downloadable Services.”  Licensee's use of the Software must comply at 
16 | # MAGIC all times with any restrictions applicable to the Downlodable Services and Subscription Services, generally, and must be used in 
17 | # MAGIC accordance with any applicable documentation. For the avoidance of doubt, the Software constitutes Databricks Confidential Information
18 | # MAGIC under the Agreement.
19 | # MAGIC
20 | # MAGIC Additionally, and notwithstanding anything in the Agreement to the contrary: 
21 | # MAGIC * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES 
22 | # MAGIC   OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE 
23 | # MAGIC   LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR 
24 | # MAGIC   IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
25 | # MAGIC * you may view, make limited copies of, and may compile the Source Code version of the Software into an Object Code version of the
26 | # MAGIC   Software.  For the avoidance of doubt, you may not make derivative works of Software (or make any any changes to the Source Code 
27 | # MAGIC   version of the unless you have agreed to separate terms with Databricks permitting such modifications (e.g., a contribution license
28 | # MAGIC   agreement)).
29 | # MAGIC
30 | # MAGIC If you have not agreed to an Agreement or otherwise do not agree to these terms, you may not use the Software or view, copy or compile
31 | # MAGIC the Source Code of the Software.
32 | # MAGIC   
33 | # MAGIC This license terminates automatically upon the termination of the Agreement or Licensee's breach of these terms.  Additionally, 
34 | # MAGIC Databricks may terminate this license at any time on notice.  Upon termination, you must permanently delete the Software and all
35 | # MAGIC copies thereof (including the Source Code).
36 | # MAGIC
37 | # MAGIC Agreement: the agreement between Databricks and Licensee governing the use of the Databricks Platform Services, which shall be, with
38 | # MAGIC respect to Databricks, the Databricks Terms of Service located at www.databricks.com/termsofservice, and with respect to Databricks
39 | # MAGIC Community Edition, the Community Edition Terms of Service located at www.databricks.com/ce-termsofuse, in each case unless Licensee 
40 | # MAGIC has entered into a separate written agreement with Databricks governing the use of the applicable Databricks Platform Services.
41 | # MAGIC  
42 | # MAGIC Databricks Platform Services: the Databricks services or the Databricks Community Edition services, according to where the Software is used.
43 | # MAGIC
44 | # MAGIC Licensee: the user of the Software, or, if the Software is being used on behalf of a company, the company.
45 | # MAGIC
46 | # MAGIC Object Code: is version of the Software produced when an interpreter or a compiler translates the Source Code into recognizable and 
47 | # MAGIC executable machine code.
48 | # MAGIC
49 | # MAGIC Source Code: the human readable portion of the Software.
50 | 


--------------------------------------------------------------------------------
/10-migrations/05-uc-upgrade/_resources/NOTICE.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # MAGIC %md
 3 | # MAGIC ## Licence
 4 | # MAGIC See LICENSE file.
 5 | # MAGIC
 6 | # MAGIC ## Data collection
 7 | # MAGIC To improve users experience and dbdemos asset quality, dbdemos sends report usage and capture views in the installed notebook (usually in the first cell) and other assets like dashboards. This information is captured for product improvement only and not for marketing purpose, and doesn't contain PII information. By using `dbdemos` and the assets it provides, you consent to this data collection. If you wish to disable it, you can set `Tracker.enable_tracker` to False in the `tracker.py` file.
 8 | # MAGIC
 9 | # MAGIC ## Resource creation
10 | # MAGIC To simplify your experience, `dbdemos` will create and start for you resources. As example, a demo could start (not exhaustive):
11 | # MAGIC - A cluster to run your demo
12 | # MAGIC - A Delta Live Table Pipeline to ingest data
13 | # MAGIC - A DBSQL endpoint to run DBSQL dashboard
14 | # MAGIC - An ML model
15 | # MAGIC
16 | # MAGIC While `dbdemos` does its best to limit the consumption and enforce resource auto-termination, you remain responsible for the resources created and the potential consumption associated.
17 | # MAGIC
18 | # MAGIC ## Support
19 | # MAGIC Databricks does not offer official support for `dbdemos` and the associated assets.
20 | # MAGIC For any issue with `dbdemos` or the demos installed, please open an issue and the demo team will have a look on a best effort basis.
21 | # MAGIC
22 | # MAGIC
23 | 


--------------------------------------------------------------------------------
/10-migrations/05-uc-upgrade/_resources/README.py:
--------------------------------------------------------------------------------
1 | # Databricks notebook source
2 | # MAGIC %md
3 | # MAGIC ## DBDemos asset
4 | # MAGIC
5 | # MAGIC The notebooks available under `_/resources` are technical resources.
6 | # MAGIC
7 | # MAGIC Do not edit these notebooks or try to run them directly. These notebooks will load data / run some setup. They are indirectly called from the main notebook (`%run ./_resources/.....`)
8 | 


--------------------------------------------------------------------------------
/10-migrations/README.md:
--------------------------------------------------------------------------------
1 | #### Migrations
2 | 
3 | This section consists of tools that will help new Customers migrate their existing workloads to Lakehouse
4 | 


--------------------------------------------------------------------------------
/10-migrations/Using DBSQL Serverless Client Example.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # MAGIC %pip install -r helperfunctions/requirements.txt
 3 | 
 4 | # COMMAND ----------
 5 | 
 6 | from helperfunctions.dbsqlclient import ServerlessClient
 7 | 
 8 | # COMMAND ----------
 9 | 
10 | # DBTITLE 1,Example Inputs For Client
11 | 
12 | 
13 | token = None ## optional
14 | host_name = None ## optional
15 | warehouse_id = "<warehouse_id>"
16 | 
17 | ## Single Query Example
18 | sql_statement = "SELECT concat_ws('-', M.id, N.id, random()) as ID FROM range(1000) AS M, range(1000) AS N LIMIT 10000000"
19 | 
20 | ## Multi Query Example
21 | multi_statement = "SELECT 1; SELECT 2; SELECT concat_ws('-', M.id, N.id, random()) as ID FROM range(1000) AS M, range(1000) AS N LIMIT 10000000"
22 | 
23 | # COMMAND ----------
24 | 
25 | serverless_client = ServerlessClient(warehouse_id = warehouse_id, token=token, host_name=host_name) ## token=<optional>, host_name=<optional>verbose=True for print statements and other debugging messages
26 | 
27 | # COMMAND ----------
28 | 
29 | # DBTITLE 1,Basic sql drop-in command
30 | """
31 | Optional Params:
32 | 1. full_results
33 | 2. use_catalog = <catalog> - this is a command specific USE CATALOG statement for the single SQL command
34 | 3. use_schema = <schema> - this is a command specific USE SCHEMA 
35 | 
36 | """
37 | 
38 | result_df = serverless_client.sql(sql_statement = sql_statement) ## OPTIONAL: use_catalog="hive_metastore", use_schema="default"
39 | 
40 | # COMMAND ----------
41 | 
42 | # DBTITLE 1,Multi Statement Command - No Results just Status - Recommended for production
43 | """
44 | Optional Params:
45 | 1. full_results
46 | 2. use_catalog = <catalog> - this is a command specific USE CATALOG statement for the single SQL command
47 | 3. use_schema = <schema> - this is a command specific USE SCHEMA 
48 | 
49 | """
50 | 
51 | result = serverless_client.submit_multiple_sql_commands(sql_statements = multi_statement, full_results=False) #session_catalog, session_schema are also optional parameters that will simulate a USE statement. True full_results just returns the whole API response for each query
52 | 
53 | # COMMAND ----------
54 | 
55 | # DBTITLE 1,Multi Statement Command Returning Results of Last Command - Best for simple processes
56 | result_multi_df = serverless_client.submit_multiple_sql_commands_last_results(sql_statements = multi_statement)
57 | 
58 | # COMMAND ----------
59 | 
60 | display(result_multi_df)
61 | 
62 | # COMMAND ----------
63 | 
64 | # DBTITLE 1,If Multi Statement Fails, this is how to access the result chain
65 | ## The function save the state of each command in the chain, even if it fails to return results for troubleshooting
66 | 
67 | last_saved_multi_statement_state = serverless_client.multi_statement_result_state
68 | print(last_saved_multi_statement_state)
69 | 


--------------------------------------------------------------------------------
/10-migrations/Using DBSQL Serverless Transaction Manager Example.py:
--------------------------------------------------------------------------------
  1 | # Databricks notebook source
  2 | # MAGIC %pip install -r helperfunctions/requirements.txt
  3 | 
  4 | # COMMAND ----------
  5 | 
  6 | from helperfunctions.dbsqltransactions import DBSQLTransactionManager
  7 | 
  8 | # COMMAND ----------
  9 | 
 10 | # DBTITLE 1,Example Inputs For Client
 11 | token = None ## optional
 12 | host_name = None ## optional
 13 | warehouse_id = "<warehouse_id>"
 14 | 
 15 | # COMMAND ----------
 16 | 
 17 | # DBTITLE 1,Example Multi Statement Transaction
 18 | sqlString = """
 19 | USE CATALOG hive_metastore;
 20 | 
 21 | CREATE SCHEMA IF NOT EXISTS iot_dashboard;
 22 | 
 23 | USE SCHEMA iot_dashboard;
 24 | 
 25 | -- Create Tables
 26 | CREATE OR REPLACE TABLE iot_dashboard.bronze_sensors
 27 | (
 28 | Id BIGINT GENERATED BY DEFAULT AS IDENTITY,
 29 | device_id INT,
 30 | user_id INT,
 31 | calories_burnt DECIMAL(10,2), 
 32 | miles_walked DECIMAL(10,2), 
 33 | num_steps DECIMAL(10,2), 
 34 | timestamp TIMESTAMP,
 35 | value STRING
 36 | )
 37 | USING DELTA
 38 | TBLPROPERTIES("delta.targetFileSize"="128mb");
 39 | 
 40 | CREATE OR REPLACE TABLE iot_dashboard.silver_sensors
 41 | (
 42 | Id BIGINT GENERATED BY DEFAULT AS IDENTITY,
 43 | device_id INT,
 44 | user_id INT,
 45 | calories_burnt DECIMAL(10,2), 
 46 | miles_walked DECIMAL(10,2), 
 47 | num_steps DECIMAL(10,2), 
 48 | timestamp TIMESTAMP,
 49 | value STRING
 50 | )
 51 | USING DELTA 
 52 | PARTITIONED BY (user_id)
 53 | TBLPROPERTIES("delta.targetFileSize"="128mb");
 54 | 
 55 | -- Statement 1 -- the load
 56 | COPY INTO iot_dashboard.bronze_sensors
 57 | FROM (SELECT 
 58 |       id::bigint AS Id,
 59 |       device_id::integer AS device_id,
 60 |       user_id::integer AS user_id,
 61 |       calories_burnt::decimal(10,2) AS calories_burnt, 
 62 |       miles_walked::decimal(10,2) AS miles_walked, 
 63 |       num_steps::decimal(10,2) AS num_steps, 
 64 |       timestamp::timestamp AS timestamp,
 65 |       value AS value -- This is a JSON object
 66 | FROM "/databricks-datasets/iot-stream/data-device/")
 67 | FILEFORMAT = json
 68 | COPY_OPTIONS('force'='true') -- 'false' -- process incrementally
 69 | --option to be incremental or always load all files
 70 | ; 
 71 | 
 72 | -- Statement 2
 73 | MERGE INTO iot_dashboard.silver_sensors AS target
 74 | USING (SELECT Id::integer,
 75 |               device_id::integer,
 76 |               user_id::integer,
 77 |               calories_burnt::decimal,
 78 |               miles_walked::decimal,
 79 |               num_steps::decimal,
 80 |               timestamp::timestamp,
 81 |               value::string
 82 |               FROM iot_dashboard.bronze_sensors) AS source
 83 | ON source.Id = target.Id
 84 | AND source.user_id = target.user_id
 85 | AND source.device_id = target.device_id
 86 | WHEN MATCHED THEN UPDATE SET 
 87 |   target.calories_burnt = source.calories_burnt,
 88 |   target.miles_walked = source.miles_walked,
 89 |   target.num_steps = source.num_steps,
 90 |   target.timestamp = source.timestamp
 91 | WHEN NOT MATCHED THEN INSERT *;
 92 | 
 93 | OPTIMIZE iot_dashboard.silver_sensors ZORDER BY (timestamp);
 94 | 
 95 | -- This calculate table stats for all columns to ensure the optimizer can build the best plan
 96 | -- Statement 3
 97 | 
 98 | ANALYZE TABLE iot_dashboard.silver_sensors COMPUTE STATISTICS FOR ALL COLUMNS;
 99 | 
100 | CREATE OR REPLACE TABLE hourly_summary_statistics
101 | AS
102 | SELECT user_id,
103 | date_trunc('hour', timestamp) AS HourBucket,
104 | AVG(num_steps)::float AS AvgNumStepsAcrossDevices,
105 | AVG(calories_burnt)::float AS AvgCaloriesBurnedAcrossDevices,
106 | AVG(miles_walked)::float AS AvgMilesWalkedAcrossDevices
107 | FROM silver_sensors
108 | GROUP BY user_id,date_trunc('hour', timestamp)
109 | ORDER BY HourBucket;
110 | 
111 | -- Statement 4
112 | -- Truncate bronze batch once successfully loaded
113 | TRUNCATE TABLE bronze_sensors;
114 | """
115 | 
116 | # COMMAND ----------
117 | 
118 | serverless_client_t = DBSQLTransactionManager(warehouse_id = warehouse_id, mode="inferred_altered_tables") ## token=<optional>, host_name=<optional>verbose=True for print statements and other debugging messages
119 | 
120 | # COMMAND ----------
121 | 
122 | # DBTITLE 1,Submitting the Multi Statement Transaction to Serverless SQL Warehouse
123 | """
124 | PARAMS: 
125 | warehouse_id --> Required, the SQL warehouse to submit statements
126 | mode -> selected_tables, inferred_altered_tables
127 | token --> optional, will try to get one for the user
128 | host_name --> optional, will try to infer same workspace url
129 | 
130 | 
131 | execute_sql_transaction params: 
132 | return_type --> "message", "last_results". "message" will return status of query chain. "last_result" will run all statements and return the last results of the final query in the chain
133 | 
134 | """
135 | 
136 | result_df = serverless_client_t.execute_dbsql_transaction(sql_string = sqlString)
137 | 


--------------------------------------------------------------------------------
/10-migrations/Using Delta Helpers Notebook Example.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # MAGIC %md
 3 | # MAGIC 
 4 | # MAGIC ## Using Delta Helpers Materialization Class. 
 5 | # MAGIC 
 6 | # MAGIC <p3> This class is for the purpose of materializing tables with delta onto cloud storage. This is often helpful for debugging and for simplifying longer, more complex query pipelines that would otherwise require highly nested CTE statements. Often times, the plan is simplified and performane is improved by removing the lazy evaluation and creating "checkpoint" steps with a materialized temp_db. Currently spark temp tables are NOT materialized, and thus not evaluated until called which is identical to a subquery. 
 7 | # MAGIC   
 8 | # MAGIC #### Initialization
 9 | # MAGIC   
10 | # MAGIC   <li> <b> deltaHelpers = DeltaHelpers(temp_root_path= "dbfs:/delta_temp_db", db_name="delta_temp") </b> - The parameters are defaults and can be changed to a customer db name or s3 path
11 | # MAGIC     
12 | # MAGIC #### There are 4 methods: 
13 | # MAGIC   
14 | # MAGIC   <li> <b> createOrReplaceTempDeltaTable(df: DataFrame, table_name: String) </b> - This creates or replaces materialized delta table in the default location in dbfs or in your provided s3 path
15 | # MAGIC   <li> <b> appendToTempDeltaTable(df: DataFrame, table_name: String) </b> - This appends to an existing delta table or creates a new one if not exists in dbfs or your provided s3 path
16 | # MAGIC   <li> <b> removeTempDeltaTable(table_name) </b> - This removes the delta table from your delta_temp database session
17 | # MAGIC   <li> <b> removeAllTempTablesForSession() </b> - This truncates the initialized temp_db session. It does NOT run a DROP DATABASE command because the database can be global. It only removes the session path it creates. 
18 | 
19 | # COMMAND ----------
20 | 
21 | # MAGIC %pip install -r helperfunctions/requirements.txt
22 | 
23 | # COMMAND ----------
24 | 
25 | # DBTITLE 1,Import
26 | from helperfunctions.deltahelpers import DeltaHelpers
27 | 
28 | # COMMAND ----------
29 | 
30 | # DBTITLE 1,Initialize
31 | ## 2 Params [Optional - db_name, temp_root_path]
32 | deltaHelpers = DeltaHelpers()
33 | 
34 | # COMMAND ----------
35 | 
36 | # DBTITLE 1,Create or Replace Temp Delta Table
37 | df = spark.read.format("json").load("/databricks-datasets/iot-stream/data-device/")
38 | 
39 | ## Methods return the cached dataframe so you can continue on as needed without reloading source each time AND you can reference in SQL (better for foreachBatch)
40 | ## No longer lazy -- this calls an action
41 | df = deltaHelpers.createOrReplaceTempDeltaTable(df, "iot_data")
42 | 
43 | ## Build ML Models
44 | 
45 | display(df)
46 | 
47 | # COMMAND ----------
48 | 
49 | # DBTITLE 1,Read cached table quickly in python or SQL
50 | # MAGIC %sql
51 | # MAGIC -- Read cahced table quickly in python or SQL
52 | # MAGIC SELECT * FROM delta_temp.iot_data
53 | 
54 | # COMMAND ----------
55 | 
56 | df.count()
57 | 
58 | # COMMAND ----------
59 | 
60 | # DBTITLE 1,Append to Temp Delta Table
61 | ## Data is 1,000,000 rows
62 | df_doubled = deltaHelpers.appendToTempDeltaTable(df, "iot_data")
63 | 
64 | ## Be CAREFUL HERE! Since the function calls an action, it is NOT lazily evaluated. So running it multiple times can append the same data
65 | df_doubled.count()
66 | 
67 | # COMMAND ----------
68 | 
69 | # MAGIC %sql
70 | # MAGIC 
71 | # MAGIC DESCRIBE HISTORY delta_temp.iot_data
72 | 
73 | # COMMAND ----------
74 | 
75 | # DBTITLE 1,Remove Temp Delta Table
76 | deltaHelpers.removeTempDeltaTable("iot_data")
77 | 
78 | # COMMAND ----------
79 | 
80 | # MAGIC %sql
81 | # MAGIC 
82 | # MAGIC SELECT * FROM delta_temp.iot_data
83 | 
84 | # COMMAND ----------
85 | 
86 | # DBTITLE 1,Truncate Session
87 | ## Deletes all tables in session path but does not drop that delta_temp database
88 | deltaHelpers.removeAllTempTablesForSession()
89 | 


--------------------------------------------------------------------------------
/10-migrations/Using Delta Logger.py:
--------------------------------------------------------------------------------
  1 | # Databricks notebook source
  2 | # MAGIC %md
  3 | # MAGIC
  4 | # MAGIC ## Delta Logger - How to use
  5 | # MAGIC
  6 | # MAGIC Purpose: This notebook utilizes the delta logger library to automatically and easiy log general pipeline information all in one place for any data pipeline. 
  7 | # MAGIC
  8 | # MAGIC All logger tables have a standard default schema DDL: 
  9 | # MAGIC
 10 | # MAGIC CREATE TABLE IF NOT EXISTS delta_logger (
 11 | # MAGIC   run_id BIGINT GENERATED BY DEFAULT AS IDENTITY,
 12 | # MAGIC   process_name STRING NOT NULL,
 13 | # MAGIC   status STRING NOT NULL, -- RUNNING, FAIL, SUCCESS, STALE
 14 | # MAGIC   start_timestamp TIMESTAMP NOT NULL,
 15 | # MAGIC   end_timestamp TIMESTAMP,
 16 | # MAGIC   run_metadata STRING
 17 | # MAGIC )
 18 | # MAGIC USING DELTA 
 19 | # MAGIC PARTITIONED BY (process_name);
 20 | # MAGIC
 21 | # MAGIC ## Initialize 
 22 | # MAGIC delta_logger = DeltaLogger(logger_table="main.iot_dashboard.pipeline_logs", 
 23 | # MAGIC                             process_name="iot_pipeline", 
 24 | # MAGIC                             logger_location=None)
 25 | # MAGIC
 26 | # MAGIC  - <b>logger_table</b> is the logging table you want to store and reference. You can create and manage as many logger tables as you would like. If you initilize a DeltaLogger and that table does not exist, it will create it for you. 
 27 | # MAGIC - <b> process_name</b> OPTIONAL - Users can log events/runs and pass the process_name into each event, or they can simply define it at the session level this way. This will default to using the process_name passed in here for the whole session. It can be overridden anytime. 
 28 | # MAGIC - <b> logger_location </b> OPTIONAL - default = None. This is an override for specifying a specific object storage location for where the user wants the table to live. If not provided, it will be a managed table by default (recommended).
 29 | # MAGIC
 30 | # MAGIC ## Methods: 
 31 | # MAGIC
 32 | # MAGIC For most methods: --  if process_name not provided, will use session. If cannot find process_name, will error. 
 33 | # MAGIC
 34 | # MAGIC - <b> create_logger() </b> -- creates a logger table if not exists. This also optimizes the table since it is used in initlialization. 
 35 | # MAGIC - <b> drop_logger() </b> -- drops the logger table attached to the session
 36 | # MAGIC - <b> truncate_logger() </b> -- clears an existing logger table
 37 | # MAGIC - <b> start_run(process_name: Optional, msg: Optional) </b>
 38 | # MAGIC - <b> fail_run(process_name: Optional, msg: Optional) </b>
 39 | # MAGIC - <b> complete_run(process_name: Optional, msg: Optional) </b>
 40 | # MAGIC - <b> get_last_successful_run_id(proces_name: Optional)</b> -- If no previous successful run, return -1
 41 | # MAGIC - <b> get_last_successful_run_timestamp(process_name: Optional)</b> -- If no previous successful run for the process, defaults to "1900-01-01 00:00:00"
 42 | # MAGIC - <b> get_last_run_id(process_name: Optional)</b> -- Get last run id regardless of status, if none return -1
 43 | # MAGIC - <b> get_last_run_timestamp(process_name: Optional)</b> -- Get last run timestamp , If no previous run for the process, defaults to "1900-01-01 00:00:00"
 44 | # MAGIC - <b> get_last_failed_run_id(process_name: Optional) </b>
 45 | # MAGIC - <b> get_last_failed_run_timestamp(prcoess_name: Optional) </b>
 46 | # MAGIC - <b> clean_zombie_runs(process_name: Optional) </b> -- Will mark any runs without and end timestamp in the running state to "STALE" and give them an end timestamp. This ONLY happens when a new run is created and the runs are < the max existing RUNNING run id
 47 | # MAGIC - <b> optimize_log(process_name:Optional, zorderCols=["end_timestamp", "start_timestamp", "run_id"]) </b> -- Optimizes the underlying log table for a particular process name a ZORDERs by input col list
 48 | # MAGIC - <b> INTERNAL: _update_run_id(run_id, process_name:Optional, start_time=None, end_time=None, status=None, run_metadata=None)
 49 | # MAGIC
 50 | # MAGIC ### Limitations / Considerations
 51 | # MAGIC 1. Currently supports 1 concurrent run per process_name for a given delta table. If you want to run concurrent pipelines, you need to create separate process names for them. This is meant to be a simple run and logging tracking solution for EDW pipelines. 
 52 | # MAGIC
 53 | # MAGIC 2. User can pass in the fully qualified table name, use the spark session defaults, or pass in catalog and database overrides to the parameters. Pick one. 
 54 | # MAGIC
 55 | 
 56 | # COMMAND ----------
 57 | 
 58 | # MAGIC %md
 59 | # MAGIC
 60 | # MAGIC ## Design Patterns
 61 | # MAGIC
 62 | # MAGIC 1. Use for Basic error handling, tracking of runs of various processes
 63 | # MAGIC 2. Use for watermarking loading patterns. i.e. Creating a new run automatically pulls the most recent previous successful run and provide a "watermark" variable you can utilize for incremental loading. Use delta_logger.get_last_succes
 64 | 
 65 | # COMMAND ----------
 66 | 
 67 | from helperfunctions.deltalogger import DeltaLogger
 68 | 
 69 | # COMMAND ----------
 70 | 
 71 | # MAGIC %sql
 72 | # MAGIC
 73 | # MAGIC CREATE DATABASE IF NOT EXISTS main.iot_dashboard_logger;
 74 | # MAGIC USE CATALOG main;
 75 | # MAGIC USE DATABASE iot_dashboard_logger;
 76 | 
 77 | # COMMAND ----------
 78 | 
 79 | delta_logger = DeltaLogger(logger_table_name="main.iot_dashboard_logger.delta_logger", process_name='iot_dashboard_pipeline')
 80 | 
 81 | # COMMAND ----------
 82 | 
 83 | delta_logger.get_most_recent_success_run_start_time()
 84 | 
 85 | # COMMAND ----------
 86 | 
 87 | delta_logger.create_run(metadata={"data_quality_stuff": "oh dear"})
 88 | 
 89 | # COMMAND ----------
 90 | 
 91 | print(delta_logger.active_run_id)
 92 | print(delta_logger.active_run_end_ts)
 93 | print(delta_logger.active_run_start_ts)
 94 | print(delta_logger.active_run_status)
 95 | print(delta_logger.active_run_metadata)
 96 | 
 97 | # COMMAND ----------
 98 | 
 99 | # DBTITLE 1,Complete and Fail Active Runs
100 | delta_logger.complete_run()
101 | #delta_logger.fail_run()
102 | 
103 | # COMMAND ----------
104 | 
105 | # MAGIC %sql
106 | # MAGIC
107 | # MAGIC SELECT * FROM main.iot_dashboard_logger.delta_logger
108 | 


--------------------------------------------------------------------------------
/10-migrations/Using Delta Merge Helpers Example.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # MAGIC %md
 3 | # MAGIC
 4 | # MAGIC ## Delta Merge Helpers:
 5 | # MAGIC
 6 | # MAGIC <p4> This is class with a set of static methods that help the user easily perform retry statements on operataions that may be cause a lot of conflicting transactions (usually in MERGE / UPDATE statements). 
 7 | # MAGIC   
 8 | # MAGIC <li> <b> 1 Method: retrySqlStatement(spark: SparkSession, operation_name: String, sqlStatement: String) </b> - the spark param is your existing Spark session, the operation name is simply an operation to identify your transaction, the sqlStatement parameter is the SQL statement you want to retry. 
 9 | 
10 | # COMMAND ----------
11 | 
12 | # MAGIC %pip install -r helperfunctions/requirements.txt
13 | 
14 | # COMMAND ----------
15 | 
16 | from helperfunctions.deltahelpers import DeltaMergeHelpers
17 | 
18 | # COMMAND ----------
19 | 
20 | 
21 | sql_statement = """
22 | MERGE INTO iot_dashboard.silver_sensors AS target
23 | USING (SELECT Id::integer,
24 |               device_id::integer,
25 |               user_id::integer,
26 |               calories_burnt::decimal,
27 |               miles_walked::decimal,
28 |               num_steps::decimal,
29 |               timestamp::timestamp,
30 |               value::string
31 |               FROM iot_dashboard.bronze_sensors) AS source
32 | ON source.Id = target.Id
33 | AND source.user_id = target.user_id
34 | AND source.device_id = target.device_id
35 | WHEN MATCHED THEN UPDATE SET 
36 |   target.calories_burnt = source.calories_burnt,
37 |   target.miles_walked = source.miles_walked,
38 |   target.num_steps = source.num_steps,
39 |   target.timestamp = source.timestamp
40 | WHEN NOT MATCHED THEN INSERT *;
41 | """
42 | 
43 | DeltaMergeHelpers.retrySqlStatement(spark, "merge_sensors", sqlStatement=sql_statement)
44 | 


--------------------------------------------------------------------------------
/10-migrations/Using Streaming Tables and MV Orchestrator.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # MAGIC %md
 3 | # MAGIC
 4 | # MAGIC ## This library helps orchestrate Streaming tables in conjunction with other tables that may depend on synchronous updated from the streaming table for classical EDW loading patterns
 5 | # MAGIC  
 6 | # MAGIC ## Assumptions / Best Practices
 7 | # MAGIC
 8 | # MAGIC 1. Assumes ST is NOT SCHEDULED in the CREATE STATEMENT (externally orchestrated) (that is a different loading pattern that is not as common in classical EDW)
 9 | # MAGIC
10 | # MAGIC 2. Assumes that one or many pipelines are dependent upon the successful CREATe OR REFRESH of the streaming table, so this library will simply block the tasks from moving the job onto the rest of the DAG to ensure the downstream tasks actually read from the table when it finishes updated
11 | # MAGIC
12 | # MAGIC 3. This works best with a single node "Driver" notebook loading sql files from Git similar to how airflow would orchestrate locally. The single job node would then call spark.sql() to run the CREATE OR REFRESH and then you arent needing a warehouse and a DLT pipeline in the job for streaming refreshes. 
13 | 
14 | # COMMAND ----------
15 | 
16 | # MAGIC %md
17 | # MAGIC
18 | # MAGIC ## Library Steps 
19 | # MAGIC
20 | # MAGIC ### This library only takes in 1 sql statement at a time, this is because if there are multiple and only some pass and others fail, then it would not be correct failing or passing the whole statement. Each ST/MV must be done separately. This can be done by simply calling the static methods multiple times.
21 | # MAGIC
22 | # MAGIC 1. Parse Streaming Table / MV Create / Refresh commmand
23 | # MAGIC 2. Identify ST / MV table(s) for that command
24 | # MAGIC 3. Run SQL command - CREATE / REFRESH ST/MV
25 | # MAGIC 4. DESCRIBE DETAIL to get pipelines.pipelineId metadata
26 | # MAGIC 5. Perform REST API Call to check for in-progress Refreshes
27 | # MAGIC 6. Poll and block statement chain from "finishing" until all pipelines identified are in either "PASS/FAIL"
28 | # MAGIC 7. If statement PASSES - then complete and return
29 | # MAGIC 8. If statement FAILS - then throw REFRESH FAIL exception
30 | 
31 | # COMMAND ----------
32 | 
33 | from helperfunctions.stmvorchestrator import orchestrate_stmv_statement
34 | 
35 | # COMMAND ----------
36 | 
37 | sql_statement = """
38 | CREATE OR REFRESH STREAMING TABLE main.iot_dashboard.streaming_tables_raw_data
39 |   AS SELECT 
40 |       id::bigint AS Id,
41 |       device_id::integer AS device_id,
42 |       user_id::integer AS user_id,
43 |       calories_burnt::decimal(10,2) AS calories_burnt, 
44 |       miles_walked::decimal(10,2) AS miles_walked, 
45 |       num_steps::decimal(10,2) AS num_steps, 
46 |       timestamp::timestamp AS timestamp,
47 |       value  AS value -- This is a JSON object
48 |   FROM STREAM read_files('dbfs:/databricks-datasets/iot-stream/data-device/*.json*', 
49 |   format => 'json',
50 |   maxFilesPerTrigger => 12 -- what does this do when you
51 |   )
52 | """
53 | 
54 | # COMMAND ----------
55 | 
56 | orchestrate_stmv_statement(spark, dbutils, sql_statement=sql_statement)
57 | 


--------------------------------------------------------------------------------
/10-migrations/helperfunctions/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/10-migrations/helperfunctions/.DS_Store


--------------------------------------------------------------------------------
/10-migrations/helperfunctions/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/10-migrations/helperfunctions/__init__.py


--------------------------------------------------------------------------------
/10-migrations/helperfunctions/build/lib/dbsqltransactions.py:
--------------------------------------------------------------------------------
  1 | from helperfunctions.dbsqlclient import ServerlessClient
  2 | from helperfunctions.transactions import Transaction, TransactionException, AlteredTableParser
  3 | 
  4 | 
  5 | class DBSQLTransactionManager(Transaction):
  6 | 
  7 |   def __init__(self, warehouse_id, mode="selected_tables", uc_default=False, host_name=None, token=None):
  8 | 
  9 |     super().__init__(mode=mode, uc_default=uc_default)
 10 |     self.host_name = host_name
 11 |     self.token = token
 12 |     self.warehouse_id = warehouse_id
 13 | 
 14 |     return 
 15 |   
 16 | 
 17 |   ### Execute multi statment SQL, now we can implement this easier for Serverless or not Serverless
 18 |   def execute_dbsql_transaction(self, sql_string, tables_to_manage=[], force=False, return_type="message"):
 19 | 
 20 |     ## return_type = message (returns status messages), last_result (returns the result of the last command in the sql chain)
 21 |     ## If force= True, then if transaction manager fails to find tables, then it runs the SQL anyways
 22 |     ## You do not NEED to run SQL this way to rollback a transaction,
 23 |     ## but it automatically breaks up multiple statements in one SQL file into a series of spark.sql() commands
 24 | 
 25 |     serverless_client = ServerlessClient(warehouse_id = self.warehouse_id, token=self.token, host_name=self.host_name) ## token=<optional>, host_name=<optional>verbose=True for print statements and other debugging messages
 26 |     
 27 |     result_df = None
 28 |     stmts = [i for i in sql_string.split(";") if len(i) >0]
 29 | 
 30 |     ## Save to class state
 31 |     self.raw_sql_statement = sql_string
 32 |     self.sql_statement_list = stmts
 33 | 
 34 |     success_tables = False
 35 | 
 36 |     try:
 37 |       self.begin_dynamic_transaction(tables_to_manage=tables_to_manage)
 38 | 
 39 |       success_tables = True
 40 | 
 41 |     except Exception as e:
 42 |       print(f"FAILED: failed to acquire tables with errors: {str(e)}")
 43 |     
 44 |     ## If succeeded or force = True, then run the SQL
 45 |     if success_tables or force:
 46 |       if success_tables == False and force == True:
 47 |         warnings.warn("WARNING: Failed to acquire tables but force flag = True, so SQL statement will run anyways")
 48 | 
 49 |       ## Run the Transaction Logic with Serverless Client
 50 |       try:
 51 |         print(f"TRANSACTION IN PROGRESS ...Running multi statement SQL transaction now\n")
 52 | 
 53 |         ###!! Since the DBSQL execution API does not understand multiple statements, we need to submit the USE commands in the correct order manually. This is done with the AlteredTableParser()
 54 | 
 55 |         ### Get the USE session tree and submit SQL statements according to that tree
 56 |         parser = AlteredTableParser()
 57 |         parser.parse_sql_chain_for_altered_tables(self.sql_statement_list)
 58 |         use_sessions = parser.get_use_session_tree()
 59 | 
 60 |         for i in use_sessions:
 61 | 
 62 |           session_catalog = i.get("session_cat")
 63 |           session_db = i.get("session_db")
 64 |           use_session_statemnts = i.get("sql_statements")
 65 | 
 66 |           for s in use_session_statemnts:
 67 |             single_st = s.get("statement")
 68 | 
 69 |             if single_st is not None:
 70 | 
 71 |               ## Submit the single command with the session USE scoped commands from the Parser Tree
 72 |               ## OPTION 1: return status message
 73 |               if return_type == "message":
 74 | 
 75 |                 result_df = serverless_client.submit_multiple_sql_commands(sql_statements=single_st, use_catalog=session_catalog, use_schema=session_db)
 76 | 
 77 |               elif return_type == "last_result":
 78 |                 
 79 |                 result_df = serverless_client.submit_multiple_sql_commands_last_results(sql_statements=single_st, use_catalog=session_catalog, use_schema=session_db)
 80 | 
 81 |               else:
 82 |                 result_df = None
 83 |                 print("No run mode selected, select 'message' or 'last_results'")
 84 | 
 85 | 
 86 |         print(f"\n TRANSACTION SUCCEEDED: Multi Statement SQL Transaction Successfull! Updating Snapshot\n ")
 87 |         self.commit_transaction()
 88 | 
 89 | 
 90 |         ## Return results after committing sucesss outside of the for loop
 91 |         return result_df
 92 | 
 93 |           
 94 |       except Exception as e:
 95 |         print(f"\n TRANSACTION FAILED to run all statements... ROLLING BACK \n")
 96 |         self.rollback_transaction()
 97 |         print(f"Rollback successful!")
 98 |         
 99 |         raise(e)
100 | 
101 |     else:
102 | 
103 |       raise(TransactionException(message="Failed to acquire tables and force=False, not running process.", errors="Failed to acquire tables and force=False, not running process."))
104 |       


--------------------------------------------------------------------------------
/10-migrations/helperfunctions/build/lib/stmvorchestrator.py:
--------------------------------------------------------------------------------
  1 | import re
  2 | import requests
  3 | import time
  4 | 
  5 | 
  6 | ## Function to block Create or REFRESH of ST or MV statements to wait until it is finishing before moving to next task
  7 | 
  8 | ## Similar to the awaitTermination() method in a streaming pipeline
  9 | 
 10 | ## Only supports 1 sql statement at a time on purpose
 11 | 
 12 | def orchestrate_stmv_statement(spark, dbutils, sql_statement, host_name=None, token=None):
 13 | 
 14 |   host_name = None
 15 |   token = None
 16 | 
 17 |   ## Infer hostname from same workspace
 18 |   if host_name is not None:
 19 |     host_name = host_name
 20 | 
 21 |   else:
 22 |     host_name = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().getOrElse(None).replace("https://", "")
 23 | 
 24 |   ## Automatically get user token if none provided
 25 |   if token is not None:
 26 |     token = token
 27 |   else: 
 28 |     token = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().getOrElse(None)
 29 | 
 30 | 
 31 |   ## Get current catalogs/schemas from outside USE commands
 32 |   current_schema = spark.sql("SELECT current_schema()").collect()[0][0]
 33 |   current_catalog = spark.sql("SELECT current_catalog()").collect()[0][0]
 34 | 
 35 |   if current_catalog == 'spark_catalog':
 36 |     current_catalog = 'hive_metastore'
 37 | 
 38 | 
 39 |   ## Check for multiple statements, if more than 1, than raise too many statement exception
 40 |   all_statements = re.split(";", sql_statement)
 41 | 
 42 |   if (len(all_statements) > 1):
 43 |     print("WARNING: There are more than one statements in this sql command, this function will just pick and try to run the first statement and ignore the rest.")
 44 | 
 45 | 
 46 |   sql_statement = all_statements[0]
 47 | 
 48 | 
 49 |   try:
 50 | 
 51 |     ## Get table/mv that is being refreshed
 52 |     table_match = re.split("CREATE OR REFRESH STREAMING TABLE\s|REFRESH STREAMING TABLE\s|CREATE OR REFRESH MATERIALIZED VIEW\s|REFRESH MATERIALIZED VIEW\s", sql_statement.upper())[1].split(" ")[0]
 53 | 
 54 |   except Exception as e:
 55 | 
 56 |     ## If it was not able to find a REFRESH statement, ignore and unblock the operation and move on (i.e. if its not an ST/MV or if its just a CREATE)
 57 | 
 58 |     print("WARNING: No ST / MV Refresh statements found. Moving on.")
 59 |     return
 60 |   
 61 |   ## If ST/MV refresh was found
 62 |   
 63 |   if (len(table_match.split(".")) == 3):
 64 |     ## fully qualified, dont change it
 65 |     pass
 66 |   elif (len(table_match.split(".")) == 2):
 67 |     table_match = current_catalog + "." + table_match
 68 | 
 69 |   elif(len(table_match.split(".")) == 1):
 70 |     table_match = current_catalog + "." + current_schema + "." + table_match
 71 | 
 72 | 
 73 |   ## Step 2 - Execute SQL Statement
 74 |   spark.sql(sql_statement)
 75 | 
 76 | 
 77 |   ## Step 3 - Get pipeline Id for table 
 78 |   active_pipeline_id = (spark.sql(f"DESCRIBE DETAIL {table_match}")
 79 |     .selectExpr("properties").take(1)[0][0]
 80 |     .get("pipelines.pipelineId")
 81 |   )
 82 | 
 83 |   ## Poll for pipeline status
 84 |   
 85 | 
 86 |   current_state = "UNKNOWN"
 87 | 
 88 |   ## Pipeline is active 
 89 |   while current_state not in ("FAILED", "IDLE"):
 90 | 
 91 |     url = "https://" + host_name + "/api/2.0/pipelines/"
 92 |     headers_auth = {"Authorization":f"Bearer {token}"}
 93 | 
 94 |     check_status_resp = requests.get(url + active_pipeline_id , headers=headers_auth).json()
 95 | 
 96 |     current_state = check_status_resp.get("state")
 97 | 
 98 |     if current_state == "IDLE":
 99 |       print(f"STMV Pipeline {active_pipeline_id} completed! \n Moving on")
100 |       return
101 |     
102 |     elif current_state == "FAILED":
103 |       raise(BaseException(f"PIPELINE {active_pipeline_id} FAILED!"))
104 |     
105 | 
106 |     else:
107 |       ## Wait before polling again
108 |       ## TODO: Do exponential backoff
109 |       time.sleep(5)
110 | 
111 |     


--------------------------------------------------------------------------------
/10-migrations/helperfunctions/dbsqltransactions.py:
--------------------------------------------------------------------------------
  1 | from helperfunctions.dbsqlclient import ServerlessClient
  2 | from helperfunctions.transactions import Transaction, TransactionException, AlteredTableParser
  3 | 
  4 | 
  5 | class DBSQLTransactionManager(Transaction):
  6 | 
  7 |   def __init__(self, warehouse_id, mode="selected_tables", uc_default=False, host_name=None, token=None):
  8 | 
  9 |     super().__init__(mode=mode, uc_default=uc_default)
 10 |     self.host_name = host_name
 11 |     self.token = token
 12 |     self.warehouse_id = warehouse_id
 13 | 
 14 |     return 
 15 |   
 16 | 
 17 |   ### Execute multi statment SQL, now we can implement this easier for Serverless or not Serverless
 18 |   def execute_dbsql_transaction(self, sql_string, tables_to_manage=[], force=False, return_type="message"):
 19 | 
 20 |     ## return_type = message (returns status messages), last_result (returns the result of the last command in the sql chain)
 21 |     ## If force= True, then if transaction manager fails to find tables, then it runs the SQL anyways
 22 |     ## You do not NEED to run SQL this way to rollback a transaction,
 23 |     ## but it automatically breaks up multiple statements in one SQL file into a series of spark.sql() commands
 24 | 
 25 |     serverless_client = ServerlessClient(warehouse_id = self.warehouse_id, token=self.token, host_name=self.host_name) ## token=<optional>, host_name=<optional>verbose=True for print statements and other debugging messages
 26 |     
 27 |     result_df = None
 28 |     stmts = [i for i in sql_string.split(";") if len(i) >0]
 29 | 
 30 |     ## Save to class state
 31 |     self.raw_sql_statement = sql_string
 32 |     self.sql_statement_list = stmts
 33 | 
 34 |     success_tables = False
 35 | 
 36 |     try:
 37 |       self.begin_dynamic_transaction(tables_to_manage=tables_to_manage)
 38 | 
 39 |       success_tables = True
 40 | 
 41 |     except Exception as e:
 42 |       print(f"FAILED: failed to acquire tables with errors: {str(e)}")
 43 |     
 44 |     ## If succeeded or force = True, then run the SQL
 45 |     if success_tables or force:
 46 |       if success_tables == False and force == True:
 47 |         warnings.warn("WARNING: Failed to acquire tables but force flag = True, so SQL statement will run anyways")
 48 | 
 49 |       ## Run the Transaction Logic with Serverless Client
 50 |       try:
 51 |         print(f"TRANSACTION IN PROGRESS ...Running multi statement SQL transaction now\n")
 52 | 
 53 |         ###!! Since the DBSQL execution API does not understand multiple statements, we need to submit the USE commands in the correct order manually. This is done with the AlteredTableParser()
 54 | 
 55 |         ### Get the USE session tree and submit SQL statements according to that tree
 56 |         parser = AlteredTableParser()
 57 |         parser.parse_sql_chain_for_altered_tables(self.sql_statement_list)
 58 |         use_sessions = parser.get_use_session_tree()
 59 | 
 60 |         for i in use_sessions:
 61 | 
 62 |           session_catalog = i.get("session_cat")
 63 |           session_db = i.get("session_db")
 64 |           use_session_statemnts = i.get("sql_statements")
 65 | 
 66 |           for s in use_session_statemnts:
 67 |             single_st = s.get("statement")
 68 | 
 69 |             if single_st is not None:
 70 | 
 71 |               ## Submit the single command with the session USE scoped commands from the Parser Tree
 72 |               ## OPTION 1: return status message
 73 |               if return_type == "message":
 74 | 
 75 |                 result_df = serverless_client.submit_multiple_sql_commands(sql_statements=single_st, use_catalog=session_catalog, use_schema=session_db)
 76 | 
 77 |               elif return_type == "last_result":
 78 |                 
 79 |                 result_df = serverless_client.submit_multiple_sql_commands_last_results(sql_statements=single_st, use_catalog=session_catalog, use_schema=session_db)
 80 | 
 81 |               else:
 82 |                 result_df = None
 83 |                 print("No run mode selected, select 'message' or 'last_results'")
 84 | 
 85 | 
 86 |         print(f"\n TRANSACTION SUCCEEDED: Multi Statement SQL Transaction Successfull! Updating Snapshot\n ")
 87 |         self.commit_transaction()
 88 | 
 89 | 
 90 |         ## Return results after committing sucesss outside of the for loop
 91 |         return result_df
 92 | 
 93 |           
 94 |       except Exception as e:
 95 |         print(f"\n TRANSACTION FAILED to run all statements... ROLLING BACK \n")
 96 |         self.rollback_transaction()
 97 |         print(f"Rollback successful!")
 98 |         
 99 |         raise(e)
100 | 
101 |     else:
102 | 
103 |       raise(TransactionException(message="Failed to acquire tables and force=False, not running process.", errors="Failed to acquire tables and force=False, not running process."))
104 |       


--------------------------------------------------------------------------------
/10-migrations/helperfunctions/deltahelpers.py:
--------------------------------------------------------------------------------
  1 | import json
  2 | import requests
  3 | import re
  4 | import os
  5 | from datetime import datetime, timedelta
  6 | import uuid
  7 | from pyspark.sql import SparkSession
  8 | from pyspark.sql.functions import col, count, lit, max
  9 | from pyspark.sql.types import *
 10 | 
 11 | 
 12 | ### Helps Materialize temp tables during ETL pipelines
 13 | class DeltaHelpers():
 14 | 
 15 |     
 16 |     def __init__(self, db_name="delta_temp", temp_root_path="dbfs:/delta_temp_db"):
 17 |         
 18 |         self.spark = SparkSession.getActiveSession()
 19 |         self.db_name = db_name
 20 |         self.temp_root_path = temp_root_path
 21 | 
 22 |         self.dbutils = None
 23 |       
 24 |         #if self.spark.conf.get("spark.databricks.service.client.enabled") == "true":
 25 |         try:     
 26 |             from pyspark.dbutils import DBUtils
 27 |             self.dbutils = DBUtils(self.spark)
 28 |         
 29 |         except:
 30 |             
 31 |             import IPython
 32 |             self.dbutils = IPython.get_ipython().user_ns["dbutils"]
 33 | 
 34 |         self.session_id =self.dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()
 35 |         self.temp_env = self.temp_root_path + self.session_id
 36 |         self.spark.sql(f"""DROP DATABASE IF EXISTS {self.db_name} CASCADE;""")
 37 |         self.spark.sql(f"""CREATE DATABASE IF NOT EXISTS {self.db_name} LOCATION '{self.temp_env}'; """)
 38 |         print(f"Initializing Root Temp Environment: {self.db_name} at {self.temp_env}")
 39 |         
 40 |         return
 41 |     
 42 | 
 43 |     def createOrReplaceTempDeltaTable(self, df, table_name):
 44 |         
 45 |         tblObj = {}
 46 |         new_table_id = table_name
 47 |         write_path = self.temp_env + new_table_id
 48 |         
 49 |         self.spark.sql(f"DROP TABLE IF EXISTS {self.db_name}.{new_table_id}")
 50 |         self.dbutils.fs.rm(write_path, recurse=True)
 51 |         
 52 |         df.write.format("delta").mode("overwrite").option("path", write_path).saveAsTable(f"{self.db_name}.{new_table_id}")
 53 |         
 54 |         persisted_df = self.spark.read.format("delta").load(write_path)
 55 |         return persisted_df
 56 |  
 57 |     def appendToTempDeltaTable(self, df, table_name):
 58 |         
 59 |         tblObj = {}
 60 |         new_table_id = table_name
 61 |         write_path = self.temp_env + new_table_id
 62 |         
 63 |         df.write.format("delta").mode("append").option("path", write_path).saveAsTable(f"{self.db_name}.{new_table_id}")
 64 |         
 65 |         persisted_df = self.spark.read.format("delta").load(write_path)
 66 |         return persisted_df
 67 |       
 68 |     def removeTempDeltaTable(self, table_name):
 69 |         
 70 |         table_path = self.temp_env + table_name
 71 |         self.dbutils.fs.rm(table_path, recurse=True)
 72 |         self.spark.sql(f"""DROP TABLE IF EXISTS {self.db_name}.{table_name}""")
 73 |         
 74 |         print(f"Temp Table: {table_name} has been deleted.")
 75 |         return
 76 |     
 77 |     def removeAllTempTablesForSession(self):
 78 |         
 79 |         self.dbutils.fs.rm(self.temp_env, recurse=True)
 80 |         ##spark.sql(f"""DROP DATABASE IF EXISTS {self.db_name} CASCADE""") This temp db name COULD be global, never delete without separate method
 81 |         print(f"All temp tables in the session have been removed: {self.temp_env}")
 82 |         return
 83 |         
 84 | 
 85 | 
 86 | class SchemaHelpers():
 87 |     
 88 |     def __init__():
 89 |         import json
 90 |         return
 91 |     
 92 |     @staticmethod
 93 |     def getDDLString(structObj):
 94 |         import json
 95 |         ddl = []
 96 |         for c in json.loads(structObj.json()).get("fields"):
 97 | 
 98 |             name = c.get("name")
 99 |             dType = c.get("type")
100 |             ddl.append(f"{name}::{dType} AS {name}")
101 | 
102 |         final_ddl = ", ".join(ddl)
103 |         return final_ddl
104 |     
105 |     @staticmethod
106 |     def getDDLList(structObj):
107 |         import json
108 |         ddl = []
109 |         for c in json.loads(structObj.json()).get("fields"):
110 | 
111 |             name = c.get("name")
112 |             dType = c.get("type")
113 |             ddl.append(f"{name}::{dType} AS {name}")
114 | 
115 |         return ddl
116 |     
117 |     @staticmethod
118 |     def getFlattenedSqlExprFromValueColumn(structObj):
119 |         import json
120 |         ddl = []
121 |         for c in json.loads(structObj.json()).get("fields"):
122 | 
123 |             name = c.get("name")
124 |             dType = c.get("type")
125 |             ddl.append(f"value:{name}::{dType} AS {name}")
126 | 
127 |         return ddl
128 |       
129 |       
130 |       
131 |       
132 | class DeltaMergeHelpers():
133 |  
134 |     def __init__(self):
135 |         return
136 |  
137 |     @staticmethod
138 |     def retrySqlStatement(spark, operationName, sqlStatement, maxRetries = 10, maxSecondsBetweenAttempts=60):
139 |  
140 |         import time
141 |         maxRetries = maxRetries
142 |         numRetries = 0
143 |         maxWaitTime = maxSecondsBetweenAttempts
144 |         ### Does not check for existence, ensure that happens before merge
145 |  
146 |         while numRetries <= maxRetries:
147 |  
148 |             try: 
149 |  
150 |                 print(f"SQL Statement Attempt for {operationName} #{numRetries + 1}...")
151 |  
152 |                 spark.sql(sqlStatement)
153 |  
154 |                 print(f"SQL Statement Attempt for {operationName} #{numRetries + 1} Successful!")
155 |                 break
156 |  
157 |             except Exception as e:
158 |                 error_msg = str(e)
159 |  
160 |                 print(f"Failed SQL Statment Attmpet for {operationName} #{numRetries} with error: {error_msg}")
161 |  
162 |                 numRetries += 1
163 |                 if numRetries > maxRetries:
164 |                     break
165 |  
166 |             waitTime = waitTime = 2**(numRetries-1) ## Wait longer up to max wait time for failed operations
167 |  
168 |             if waitTime > maxWaitTime:
169 |                 waitTime = maxWaitTime
170 |  
171 |             print(f"Waiting {waitTime} seconds before next attempt on {operationName}...")
172 |             time.sleep(waitTime)


--------------------------------------------------------------------------------
/10-migrations/helperfunctions/dist/helperfunctions-1.0.0-py3-none-any.whl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/10-migrations/helperfunctions/dist/helperfunctions-1.0.0-py3-none-any.whl


--------------------------------------------------------------------------------
/10-migrations/helperfunctions/helperfunctions.egg-info/PKG-INFO:
--------------------------------------------------------------------------------
 1 | Metadata-Version: 2.1
 2 | Name: helperfunctions
 3 | Version: 1.0.0
 4 | Summary: Lakehouse Warehousing and Delta Helper Functions
 5 | Author: Cody Austin Davis @Databricks, Inc.
 6 | Author-email: cody.davis@databricks.com
 7 | Requires-Dist: sqlparse
 8 | Requires-Dist: sql_metadata
 9 | Requires-Dist: sqlglot
10 | Requires-Dist: pyarrow
11 | 


--------------------------------------------------------------------------------
/10-migrations/helperfunctions/helperfunctions.egg-info/SOURCES.txt:
--------------------------------------------------------------------------------
 1 | datavalidator.py
 2 | dbsqlclient.py
 3 | dbsqltransactions.py
 4 | deltahelpers.py
 5 | deltalogger.py
 6 | redshiftchecker.py
 7 | setup.py
 8 | stmvorchestrator.py
 9 | transactions.py
10 | helperfunctions.egg-info/PKG-INFO
11 | helperfunctions.egg-info/SOURCES.txt
12 | helperfunctions.egg-info/dependency_links.txt
13 | helperfunctions.egg-info/requires.txt
14 | helperfunctions.egg-info/top_level.txt


--------------------------------------------------------------------------------
/10-migrations/helperfunctions/helperfunctions.egg-info/dependency_links.txt:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/10-migrations/helperfunctions/helperfunctions.egg-info/requires.txt:
--------------------------------------------------------------------------------
1 | sqlparse
2 | sql_metadata
3 | sqlglot
4 | pyarrow
5 | 


--------------------------------------------------------------------------------
/10-migrations/helperfunctions/helperfunctions.egg-info/top_level.txt:
--------------------------------------------------------------------------------
1 | datavalidator
2 | dbsqlclient
3 | dbsqltransactions
4 | deltahelpers
5 | deltalogger
6 | redshiftchecker
7 | stmvorchestrator
8 | transactions
9 | 


--------------------------------------------------------------------------------
/10-migrations/helperfunctions/requirements.txt:
--------------------------------------------------------------------------------
1 | sqlglot
2 | pyarrow


--------------------------------------------------------------------------------
/10-migrations/helperfunctions/setup.py:
--------------------------------------------------------------------------------
 1 | from setuptools import setup
 2 | 
 3 | setup(
 4 |     name='helperfunctions',
 5 |     version='1.0.0',
 6 |     description='Lakehouse Warehousing and Delta Helper Functions',
 7 |     author='Cody Austin Davis @Databricks, Inc.',
 8 |     author_email='cody.davis@databricks.com',
 9 |     py_modules=['datavalidator', 
10 |               'dbsqltransactions', 
11 |               'stmvorchestrator', 
12 |               'redshiftchecker', 
13 |               'dbsqlclient', 
14 |               'transactions',
15 |               'deltalogger',
16 |               'deltahelpers'],
17 |     install_requires=[
18 |         'sqlparse',
19 |         'sql_metadata',
20 |         'sqlglot',
21 |         'pyarrow'
22 |     ]
23 | )


--------------------------------------------------------------------------------
/10-migrations/helperfunctions/stmvorchestrator.py:
--------------------------------------------------------------------------------
  1 | import re
  2 | import requests
  3 | import time
  4 | 
  5 | 
  6 | ## Function to block Create or REFRESH of ST or MV statements to wait until it is finishing before moving to next task
  7 | 
  8 | ## Similar to the awaitTermination() method in a streaming pipeline
  9 | 
 10 | ## Only supports 1 sql statement at a time on purpose
 11 | 
 12 | def orchestrate_stmv_statement(spark, dbutils, sql_statement, host_name=None, token=None):
 13 | 
 14 |   host_name = None
 15 |   token = None
 16 | 
 17 |   ## Infer hostname from same workspace
 18 |   if host_name is not None:
 19 |     host_name = host_name
 20 | 
 21 |   else:
 22 |     host_name = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().getOrElse(None).replace("https://", "")
 23 | 
 24 |   ## Automatically get user token if none provided
 25 |   if token is not None:
 26 |     token = token
 27 |   else: 
 28 |     token = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().getOrElse(None)
 29 | 
 30 | 
 31 |   ## Get current catalogs/schemas from outside USE commands
 32 |   current_schema = spark.sql("SELECT current_schema()").collect()[0][0]
 33 |   current_catalog = spark.sql("SELECT current_catalog()").collect()[0][0]
 34 | 
 35 |   if current_catalog == 'spark_catalog':
 36 |     current_catalog = 'hive_metastore'
 37 | 
 38 | 
 39 |   ## Check for multiple statements, if more than 1, than raise too many statement exception
 40 |   all_statements = re.split(";", sql_statement)
 41 | 
 42 |   if (len(all_statements) > 1):
 43 |     print("WARNING: There are more than one statements in this sql command, this function will just pick and try to run the first statement and ignore the rest.")
 44 | 
 45 | 
 46 |   sql_statement = all_statements[0]
 47 | 
 48 | 
 49 |   try:
 50 | 
 51 |     ## Get table/mv that is being refreshed
 52 |     table_match = re.split("CREATE OR REFRESH STREAMING TABLE\s|REFRESH STREAMING TABLE\s|CREATE OR REFRESH MATERIALIZED VIEW\s|REFRESH MATERIALIZED VIEW\s", sql_statement.upper())[1].split(" ")[0]
 53 | 
 54 |   except Exception as e:
 55 | 
 56 |     ## If it was not able to find a REFRESH statement, ignore and unblock the operation and move on (i.e. if its not an ST/MV or if its just a CREATE)
 57 | 
 58 |     print("WARNING: No ST / MV Refresh statements found. Moving on.")
 59 |     return
 60 |   
 61 |   ## If ST/MV refresh was found
 62 |   
 63 |   if (len(table_match.split(".")) == 3):
 64 |     ## fully qualified, dont change it
 65 |     pass
 66 |   elif (len(table_match.split(".")) == 2):
 67 |     table_match = current_catalog + "." + table_match
 68 | 
 69 |   elif(len(table_match.split(".")) == 1):
 70 |     table_match = current_catalog + "." + current_schema + "." + table_match
 71 | 
 72 | 
 73 |   ## Step 2 - Execute SQL Statement
 74 |   spark.sql(sql_statement)
 75 | 
 76 | 
 77 |   ## Step 3 - Get pipeline Id for table 
 78 |   active_pipeline_id = (spark.sql(f"DESCRIBE DETAIL {table_match}")
 79 |     .selectExpr("properties").take(1)[0][0]
 80 |     .get("pipelines.pipelineId")
 81 |   )
 82 | 
 83 |   ## Poll for pipeline status
 84 |   
 85 | 
 86 |   current_state = "UNKNOWN"
 87 | 
 88 |   ## Pipeline is active 
 89 |   while current_state not in ("FAILED", "IDLE"):
 90 | 
 91 |     url = "https://" + host_name + "/api/2.0/pipelines/"
 92 |     headers_auth = {"Authorization":f"Bearer {token}"}
 93 | 
 94 |     check_status_resp = requests.get(url + active_pipeline_id , headers=headers_auth).json()
 95 | 
 96 |     current_state = check_status_resp.get("state")
 97 | 
 98 |     if current_state == "IDLE":
 99 |       print(f"STMV Pipeline {active_pipeline_id} completed! \n Moving on")
100 |       return
101 |     
102 |     elif current_state == "FAILED":
103 |       raise(BaseException(f"PIPELINE {active_pipeline_id} FAILED!"))
104 |     
105 | 
106 |     else:
107 |       ## Wait before polling again
108 |       ## TODO: Do exponential backoff
109 |       time.sleep(5)
110 | 
111 |     


--------------------------------------------------------------------------------
/20-operational-excellence/README.md:
--------------------------------------------------------------------------------
1 | #### Operational Excellence
2 | 
3 | This section consists of tools that will help Infrastructure Administrators automate Lakehouse management and operations, eg. data pipelines, workflows, CI/CD processes, IaaS
4 | 
5 | #
6 | 1. [Terraform Examples](https://github.com/databricks/terraform-databricks-examples)


--------------------------------------------------------------------------------
/30-performance/README.md:
--------------------------------------------------------------------------------
1 | #### Performance Optimizations
2 | 
3 | This section consists of tools that will help Developers and Administrators optimize the performance of Lakehouse processes.
4 | 
5 | 
6 | 1. [Delta Optimizer](https://github.com/AbePabbathi/lakehouse-tacklebox/tree/main/30-performance/delta-optimizer)
7 | 2. [TPC-DS Runner](https://github.com/AbePabbathi/lakehouse-tacklebox/tree/main/30-performance/TPC-DS%20Runner)
8 | 3. [Query Replay Tool](https://github.com/AbePabbathi/lakehouse-tacklebox/tree/main/30-performance/dbsql-query-replay-tool)


--------------------------------------------------------------------------------
/30-performance/TPC-DS Runner/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing
 2 | Contributions are welcome! Feel free to file an issue/PR or reach out to michael.berk@databricks.com.
 3 | 
 4 | ### Potential Contributions (in order of importance)
 5 | * Support UC
 6 |   * Support writing raw queries to UC volumes instead of DBFS
 7 |   * Modify existing data write to handle conversion to UC-managed tables
 8 | * Scope warehouse concurrency limitations
 9 | * Improve beaker performance calculations - [issue](https://github.com/goodwillpunning/beaker/issues/24)
10 | * Look to improve data write performance. Some options include:
11 |   * Improve threading of writes.
12 |   * Document baseline TPC-DS benchmarking runtimes - [template](https://github.com/databricks/spark-sql-perf/blob/master/src/main/notebooks/tpcds_datagen.scala)
13 | * Allow for running create data OR run benchmarking. Currently these methods are coupled and it prevent's rerunning different warehouse benchmarks against the same data source
14 | * Add dashboarding and further analysis using Nishant's tool(s)
15 | * Determine if spark-sql-perf supports latest LTS DBR version or if we need to hardcode 12.2
16 | * Make Beaker pip-installable within a Databricks notebook then remove the hard-coded .whl url - [issue](https://github.com/goodwillpunning/beaker/issues/19)
17 | 


--------------------------------------------------------------------------------
/30-performance/TPC-DS Runner/README.md:
--------------------------------------------------------------------------------
 1 | # Databricks TPC-DS Benchmarking Tool
 2 | 
 3 | This tool runs the TPC-DS benchmark on a Databricks SQL warehouse. The TPC-DS benchmark is a standardised method of evaluating the performance of decision support solutions, such as databases, data warehouses, and big data systems.
 4 | 
 5 | **Disclaimer: this tool is simple. It will not duplicate warehouse peak performance out-of-the-box. Instead, it's meant to be a transparent and representative baseline.**
 6 | 
 7 | ## Quick Start
 8 | #### 0 - Clone this repo via [Databricks Repos](https://docs.databricks.com/en/repos/index.html)
 9 | 
10 | #### 1 - Open the main notebook
11 | <img src="./assets/images/main_notebook.png" width="600">
12 | 
13 | #### 2 - Create or Attach to Cluster
14 | <img src="./assets/images/cluster.png" width="600">
15 | 
16 | Note that if you're using a unity catalog (UC) table, UC must be enabled on this cluster.
17 | Note that we don't support serverless clusters at this time.
18 | 
19 | #### 3 - Run your parameters
20 | * Note that you may have to run the first cell in the notebook to see the widgets. 
21 | <img src="./assets/images/filters.png" width="600">
22 | 
23 | ## Parameters
24 | Data
25 | * **Catalog Name**: the name of the catalog to write to for non-UC configurations
26 | * **Schema Prefix**: a string that will be prepended to the dynamically-generated schema name
27 | * **Number of GB of Data**: the number of gigabytes of TPC-DS data to be written. `1` indicates that the sum of all table sizes will be ~1GB. 
28 | 
29 | Warehouse
30 | * **Maximum Number of Clusters**: the maximum number of workers to which a SQL warehouse can scale
31 | * **Warehouse Size**: T-shirt size of the SQL warehouse workers
32 | * **Channel**: the warehouse channel, which correspond to the underlying DBR version
33 | 
34 | Load Testing
35 | * **Concurrency**: the simulated number of users executing the TPC-DS queries. On the backend, this corresponds to the number of Python threads.
36 | * **Query Repeatition Count**: the number of times the TPC-DS queries will be repeatedly run. `2` indicates that each TPC-DS query will be run twice. Note that caching is disabled, so repeated queries will not hit cache.
37 | 
38 | #### 4 - Click "Run All"
39 | <img src="./assets/images/run_all.png" width="600">
40 | 
41 | #### What will happen?
42 | After clicking run all, a Databricks workflow with two tasks will be created. The first task is responsible for writing TPC-DS data and the associated queries into Delta tables. The second task will execute a TPC-DS benchmark leveraging the tables and queries created in the prior task. The results of the bechmarking will be printed out in the job notebook for viewing, but also will be written to a delta table; the location of the delta table will be printed in the job notebook. 
43 | 
44 | <img src="./assets/images/workflow.png" width="600">
45 | 
46 | ## Core Concepts
47 | - **Concurrency**: The simulated number of users executing concurrent queries. It provides an insight into how well the system can handle multiple users executing queries at the same time.
48 | - **Throughput**: The number of queries that the system can handle per unit of time. It is usually measured in queries per minute (QPM) and provides insignt into the speed and efficiency of the system.
49 | 
50 | # Product Details
51 | ## Relevant Features
52 | * The tool is cloud agnostic.
53 | * Authentication is automatically handled by the python SDK.
54 | * Benchmarking will be performed on the latest LTS DBR version.
55 | * Result cache is hard-coded to false, which means that all queries will not hit a warehouse's cache.
56 | * Each benchmark run will trigger a warehouse "warming," which is just a `SELECT *` on all TPC-DS tables. 
57 | * Table format is hard-coded to delta. Data writes are currently hard-coded to DBR 12.2, so if there are updates in Delta with newer DBR versions, they will not be included. This decision was made because spark-sql-perf did not run on > 12.2 DBR as of 2023-08-10.
58 | * A new warehouse will be created based on user parameters. If a warehouse with the same name exists, the benchmarking tool will use that existing warehouse.
59 | * Given Python's Global Processing Lock (GIL), increasing the number of cores will have diminshing returns. To hide complexity from the user while also bounding cost, the concurrency parameter will scale cluster count linearly up to 100 cores, then stop. Concurrency > 100 however is still supported via multithreading - it will just run on a maximum of 100 cores. Based on our default node type, this will be 25 workers.
60 | * We are using [Databricks python-sql-connector](https://docs.databricks.com/en/dev-tools/python-sql-connector.html) to execute queries, but we are not fetching the results. The python-sql-connector has a built-in feature that retries with backoff when rate limit errors occur. Due to this retry mechanism, the actual performance of the system may be slightly faster than what the benchmarking results indicate.
61 | * If the data (with a given set of configs) already exists, it will not be overwritten. The matching logic simply uses the name of the schema, so if you change the `schema_prefix` (and that resulting schema is not found), new data will be written.
62 | 
63 | ### Limitations
64 | * You must run this tool from a single-user cluster to allow default SDK authentication. 
65 | * We currently don't support UC. That will be the next step for this tool.
66 | * We currently only support DBSQL serverless warehouses for simplicity. If there is desire to test non-serverless warehouses, please let us know.
67 | 
68 | ### Data Generation Runtimes
69 | Both the data generation and benchmarking workflow tasks will increase in runtime as the data size increases. Here are some examples, however your benchmarking runtimes may differ signifigantly depending on your configurations.
70 | | Number of GB Written | create_data_and_queries Runtime | TPCDS_benchmarking Runtime |
71 | |---------|---------|---------|
72 | | 1 GB | 17 mins | 7 mins |
73 | | 100 GB | 70 mins | 24 mins |
74 | | 1 TB | 305 mins | 54 mins |
75 | 


--------------------------------------------------------------------------------
/30-performance/TPC-DS Runner/assets/images/cluster.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/30-performance/TPC-DS Runner/assets/images/cluster.png


--------------------------------------------------------------------------------
/30-performance/TPC-DS Runner/assets/images/filters.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/30-performance/TPC-DS Runner/assets/images/filters.png


--------------------------------------------------------------------------------
/30-performance/TPC-DS Runner/assets/images/main_notebook.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/30-performance/TPC-DS Runner/assets/images/main_notebook.png


--------------------------------------------------------------------------------
/30-performance/TPC-DS Runner/assets/images/run_all.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/30-performance/TPC-DS Runner/assets/images/run_all.png


--------------------------------------------------------------------------------
/30-performance/TPC-DS Runner/assets/images/workflow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/30-performance/TPC-DS Runner/assets/images/workflow.png


--------------------------------------------------------------------------------
/30-performance/TPC-DS Runner/constants.py:
--------------------------------------------------------------------------------
  1 | # Databricks notebook source
  2 | pip install --upgrade databricks-sdk -q
  3 | 
  4 | # COMMAND ----------
  5 | 
  6 | dbutils.library.restartPython()
  7 | 
  8 | # COMMAND ----------
  9 | 
 10 | import os
 11 | import math
 12 | from dataclasses import dataclass
 13 | from utils.general import tables_already_exist, get_widget_values, create_widgets
 14 | 
 15 | @dataclass
 16 | class Constants:
 17 |     ############### Variables dependant upon user parameters ##############
 18 |     # Number of GBs of TPCDS data to write
 19 |     number_of_gb_of_data: int
 20 | 
 21 |     # Name of the catalog to write TPCDS data to
 22 |     catalog_name: str
 23 | 
 24 |     # Prefex of the schma to write TPCDS data to
 25 |     schema_prefix: str
 26 | 
 27 |     # Size of the warehouse cluster
 28 |     warehouse_size: str
 29 | 
 30 |     # Maximum number of clusters to scale to in the warehouse
 31 |     maximum_number_of_clusters: int
 32 | 
 33 |     # Warehouse channel name
 34 |     channel: str
 35 | 
 36 |     # Number of concurrent threads
 37 |     concurrency: int
 38 | 
 39 |     # Number of times to duplicate the benchmarking run
 40 |     query_repetition_count: int
 41 | 
 42 |     ############### Variables indepenet of user parameters #############
 43 |     # Name of the job
 44 |     job_name = f"[AUTOMATED] Create and run TPC-DS"
 45 | 
 46 |     # Dynamic variables that are used to create downstream variables
 47 |     _current_user_email = (
 48 |         dbutils.notebook.entry_point.getDbutils()
 49 |         .notebook()
 50 |         .getContext()
 51 |         .userName()
 52 |         .get()
 53 |     )
 54 |     _cwd = os.getcwd().replace("/Workspace", "")
 55 | 
 56 |     # User-specific parameters, which are used to create directories and cluster single-access-mode
 57 |     current_user_email = _current_user_email
 58 |     current_user_name = (
 59 |         _current_user_email.replace(".", "_").replace("-", "_").split("@")[0]
 60 |     )
 61 | 
 62 |     # Base directory where all data and queries will be written
 63 |     root_directory = f"dbfs:/Benchmarking/TPCDS/{current_user_name}"
 64 | 
 65 |     # Additional subdirectories within the above root_directory
 66 |     script_path = os.path.join(root_directory, "scripts")
 67 |     data_path = os.path.join(root_directory, "data")
 68 |     query_path = os.path.join(root_directory, "queries")
 69 | 
 70 |     # Location of the spark-sql-perf jar, which is used to create TPC-DS data and queries
 71 |     jar_path = os.path.join(script_path, "jars/spark-sql-perf_2.12-0.5.1-SNAPSHOT.jar")
 72 | 
 73 |     # Location of the init script, which is responsible for installing the above jar and other prerequisites
 74 |     init_script_path = os.path.join(script_path, "tpcds-install.sh")
 75 | 
 76 |     # Location of the dist whl for beaker
 77 |     beaker_whl_path = os.path.join(script_path, "beaker-0.0.1-py3-none-any.whl")
 78 | 
 79 |     # Location of the notebook that creates data and queries
 80 |     create_data_and_queries_notebook_path = os.path.join(
 81 |         _cwd, "notebooks/create_data_and_queries"
 82 |     )
 83 | 
 84 |     # Location of the notebook that runs TPC-DS queries against written data using the beaker library
 85 |     run_tpcds_benchmarking_notebook_path = os.path.join(
 86 |         _cwd, "notebooks/run_tpcds_benchmarking"
 87 |     )
 88 | 
 89 |     # Name of the current databricks host
 90 |     host = f"https://{spark.conf.get('spark.databricks.workspaceUrl')}/"
 91 | 
 92 |     def _validate_concurrency_will_utilize_cluster(self):
 93 |         required_number_of_clusters = math.ceil(self.concurrency / 10)
 94 |         if self.maximum_number_of_clusters > required_number_of_clusters:
 95 | 
 96 |             print(
 97 |                 "Warning:\n"
 98 |                 "\tFor optimal performance, we recommend using 1 cluster per 10 levels of concurrency. Your currrent\n"
 99 |                 "\tconfiguration will underutilize the warehouse and a cheaper configuration shuold exhibit the same performance.\n"
100 |                 f"\tPlease try using {required_number_of_clusters} clusters instead."
101 |             )
102 | 
103 |     def __post_init__(self):
104 |         # Create a schema prefix if '' to ensure unrelated schemas are not deleted
105 |         if self.schema_prefix == "":
106 |             self.schema_prefix = "tpcds_benchmark"
107 | 
108 |         # Name of the schema that tpcds data and benchmarking metrics will be written to
109 |         self.schema_name: str = (
110 |             f"{self.schema_prefix.rstrip('_')}_{self.number_of_gb_of_data}_gb"
111 |         )
112 | 
113 |         # Add schema to data path
114 |         self.data_path = os.path.join(self.data_path, self.schema_name)
115 | 
116 |         # Determine if TPC-DS tables already exist
117 |         self.tables_already_exist = tables_already_exist(spark, self.catalog_name, self.schema_name)
118 | 
119 |         # Param validations/warnings
120 |         self._validate_concurrency_will_utilize_cluster()
121 | 
122 | 


--------------------------------------------------------------------------------
/30-performance/TPC-DS Runner/main.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # DBTITLE 1,Import Constants
 3 | # MAGIC %run ./constants
 4 | 
 5 | # COMMAND ----------
 6 | 
 7 | # DBTITLE 1,Add Widgets to Notebook
 8 | create_widgets(dbutils)
 9 | 
10 | # COMMAND ----------
11 | 
12 | # DBTITLE 1,Pull Variables from Notebook Widgets
13 | constants = Constants(
14 |   **get_widget_values(dbutils)
15 | )
16 | 
17 | # COMMAND ----------
18 | 
19 | # DBTITLE 1,Create and Run TPC-DS Benchmark
20 | from utils.run import run
21 | 
22 | run(spark, dbutils, constants)
23 | 


--------------------------------------------------------------------------------
/30-performance/TPC-DS Runner/notebooks/create_data_and_queries.scala:
--------------------------------------------------------------------------------
  1 | // Databricks notebook source
  2 | // DBTITLE 1,Get parameters from job
  3 | // name of the user, formatted to be passable as a schema
  4 | val userName = dbutils.widgets.get("current_user_name")
  5 | 
  6 | // The scaleFactor defines the size of the dataset to generate (in GB)
  7 | val scaleFactor = dbutils.widgets.get("scale_factor")
  8 | 
  9 | // Location to store the queries
 10 | val queryDir = dbutils.widgets.get("query_directory")
 11 | 
 12 | // Location to store the data
 13 | val dataDir = dbutils.widgets.get("data_directory")
 14 | 
 15 | // Name of the database to write the tpcds data
 16 | val catalogName = dbutils.widgets.get("catalog_name")
 17 | 
 18 | // Name of the database to write the tpcds data
 19 | val schemaName = dbutils.widgets.get("schema_name")
 20 | 
 21 | // Determine if tables with the same parameters have already been written
 22 | val tablesAlreadyExist = dbutils.widgets.get("tables_already_exist")
 23 | 
 24 | // COMMAND ----------
 25 | 
 26 | // DBTITLE 1,Write Data
 27 | if (tablesAlreadyExist == "false") {
 28 |   // source: https://github.com/deepaksekaranz/TPCDSDataGen/tree/master/TPCDS-Kit
 29 |   import com.databricks.spark.sql.perf.tpcds.TPCDSTables
 30 | 
 31 |   // The scaleFactor defines the size of the dataset to generate (in GB)
 32 |   val scaleFactorInt = scaleFactor.toInt
 33 | 
 34 |   // Set the file type
 35 |   val fileFormat = "delta"
 36 | 
 37 |   // Initialize TPCDS tables with given parameters
 38 |   val tables = new TPCDSTables(
 39 |     sqlContext = sqlContext,
 40 |     dsdgenDir = "/usr/local/bin/tpcds-kit/tools",
 41 |     scaleFactor = scaleFactor,
 42 |     useDoubleForDecimal = false, // If true, replaces DecimalType with DoubleType
 43 |     useStringForDate = false // If true, replaces DateType with StringType
 44 |   )
 45 | 
 46 |   // Generate TPC-DS data
 47 |   tables.genData(
 48 |     location = dataDir,
 49 |     format = "delta",
 50 |     overwrite = true, // overwrite the data that is already there
 51 |     partitionTables = false, // create the partitioned fact tables 
 52 |     clusterByPartitionColumns = false, // shuffle to get partitions coalesced into single files. 
 53 |     filterOutNullPartitionValues = false, // true to filter out the partition with NULL key value
 54 |     tableFilter = "", // "" means generate all tables
 55 |     numPartitions = 20 // how many dsdgen partitions to run - number of input tasks.
 56 |   ) 
 57 | 
 58 |   // Create the specified database if it doesn't exist
 59 |   sql(s"create schema if not exists $schemaName")
 60 | 
 61 |   // Create metastore tables in a specified database for your data. The current database will be switched to the specified database.
 62 |   // Once tables are created, the current database will be switched to the specified database.
 63 |   tables.createExternalTables(dataDir, fileFormat, schemaName, overwrite = true, discoverPartitions = false)
 64 | 
 65 |   // Convert the tables to managed
 66 |   val tableInfo = dbutils.fs.ls(dataDir).map(x => (x.name.stripSuffix("/"), x.path))
 67 | 
 68 |   for ((tableName, tablePath) <- tableInfo) {
 69 |     spark.sql(s"DROP TABLE IF EXISTS ${catalogName}.${schemaName}.${tableName}")
 70 |     spark.sql(s"""
 71 |       CREATE TABLE ${catalogName}.${schemaName}.${tableName}
 72 |       LOCATION '$tablePath'
 73 |     """)
 74 |   }
 75 | }
 76 | 
 77 | // COMMAND ----------
 78 | 
 79 | // DBTITLE 1,Write Queries
 80 | import scala.util.Try
 81 | import com.databricks.spark.sql.perf.tpcds.TPCDS
 82 | import com.databricks.spark.sql.perf.Query
 83 | 
 84 | def writeQueriesToDBFS(dbfsPath: String, queries: Map[String, Query]): Unit = {
 85 |   queries.foreach { case (fileName, query) =>
 86 |     val dbfsFilePath = s"$dbfsPath/$fileName.sql"
 87 |     val putResult = Try(dbutils.fs.put(dbfsFilePath, query.sqlText.getOrElse(""), overwrite = true))
 88 | 
 89 |     putResult match {
 90 |       case scala.util.Success(_) => println(s"Successfully written to $dbfsFilePath")
 91 |       case scala.util.Failure(exception) => println(s"Failed to write to $dbfsFilePath: ${exception.getMessage}")
 92 |     }
 93 |   }
 94 | }
 95 | 
 96 | val tpcds = new TPCDS (sqlContext = sqlContext)
 97 | val sqlQueries = tpcds.tpcds2_4QueriesMap
 98 | 
 99 | writeQueriesToDBFS(queryDir, sqlQueries)
100 | 


--------------------------------------------------------------------------------
/30-performance/TPC-DS Runner/notebooks/run_tpcds_benchmarking.py:
--------------------------------------------------------------------------------
  1 | # Databricks notebook source
  2 | # MAGIC %md ### Run TPC-DS Benchmarks
  3 | 
  4 | # COMMAND ----------
  5 | 
  6 | pip install --upgrade databricks-sdk -q
  7 | 
  8 | # COMMAND ----------
  9 | 
 10 | dbutils.library.restartPython()
 11 | 
 12 | # COMMAND ----------
 13 | 
 14 | # DBTITLE 1,Configuration Variables
 15 | import time
 16 | from databricks.sdk import WorkspaceClient
 17 | 
 18 | # Host and PAT for beaker authentication
 19 | HOST = spark.conf.get('spark.databricks.workspaceUrl')
 20 | PAT = WorkspaceClient().tokens.create(comment='temp use', lifetime_seconds=60*60*12).token_value
 21 | 
 22 | # ID of the warehouse to run benchmarks with
 23 | WAREHOUSE_ID = dbutils.widgets.get("warehouse_id")
 24 | WAREHOUSE_HTTP_PATH = f"/sql/1.0/warehouses/{WAREHOUSE_ID}"
 25 | 
 26 | # Name of the catalog to read/write to
 27 | CATALOG_NAME = dbutils.widgets.get("catalog_name")
 28 | 
 29 | # Name of the schema to read/write to
 30 | SCHEMA_NAME = dbutils.widgets.get("schema_name")
 31 | 
 32 | # Location of query files
 33 | QUERY_PATH = dbutils.widgets.get("query_path").lstrip('/').replace('dbfs:','/dbfs')
 34 | 
 35 | # Number of procs in beaker
 36 | CONCURRENCY = int(dbutils.widgets.get("concurrency"))
 37 | 
 38 | # Number of procs in beaker
 39 | QUERY_REPETITION_COUNT = int(dbutils.widgets.get("query_repetition_count"))
 40 | 
 41 | # Id of the job, which is used to create the schema
 42 | try:
 43 |   job_id = dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags().get("jobId").get()
 44 |   METRICS_TABLE_NAME = f"benchmark_metrics_for_job_{job_id}"
 45 | except AttributeError as e:
 46 |   print("This notebook must be run within a Databricks workflow.")
 47 |   raise e
 48 | 
 49 | # TPC-DS Tables to Warm
 50 | TPCDS_TABLE_NAMES = {
 51 |     "call_center",
 52 |     "catalog_page",
 53 |     "catalog_returns",
 54 |     "catalog_sales",
 55 |     "customer",
 56 |     "customer_address",
 57 |     "customer_demographics",
 58 |     "date_dim",
 59 |     "household_demographics",
 60 |     "income_band",
 61 |     "inventory",
 62 |     "item",
 63 |     "promotion",
 64 |     "reason",
 65 |     "ship_mode",
 66 |     "store",
 67 |     "store_returns",
 68 |     "store_sales",
 69 |     "time_dim",
 70 |     "warehouse",
 71 |     "web_page",
 72 |     "web_returns",
 73 |     "web_sales",
 74 |     "web_site",
 75 | }
 76 | 
 77 | # COMMAND ----------
 78 | 
 79 | # DBTITLE 1,Start Warehouse
 80 | warehouse_start_time = time.time()
 81 | WorkspaceClient().warehouses.start_and_wait(WAREHOUSE_ID)
 82 | print(f"{int(time.time() - warehouse_start_time)}s Warehouse Startup Time")
 83 | 
 84 | # COMMAND ----------
 85 | 
 86 | # DBTITLE 1,Run Benchmark
 87 | from beaker import benchmark
 88 | from functools import reduce
 89 | from pyspark.sql import DataFrame
 90 | import pyspark.sql.functions as F
 91 | 
 92 | # Create beaker benchmark object
 93 | bm = benchmark.Benchmark(results_cache_enabled=False)
 94 | 
 95 | # # Set benchmarking parameters
 96 | bm.setName(name=f"TPC-DS Benchmark {SCHEMA_NAME}")
 97 | bm.setHostname(hostname=HOST)
 98 | bm.setWarehouse(http_path=WAREHOUSE_HTTP_PATH)
 99 | bm.setConcurrency(concurrency=CONCURRENCY)
100 | bm.setWarehouseToken(token=PAT)
101 | bm.setCatalog(catalog=CATALOG_NAME)
102 | bm.setSchema(schema=SCHEMA_NAME)
103 | bm.setQueryFileDir(QUERY_PATH)
104 | bm.setQueryRepeatCount(QUERY_REPETITION_COUNT)
105 | 
106 | # Warm the warehouse. This won't be perfect, but it's the best we can do with current serverless queueing
107 | tables_with_schema = [f"{SCHEMA_NAME}.{t}" for t in TPCDS_TABLE_NAMES]
108 | for _ in range(int(min(CONCURRENCY, 50))):
109 |   bm.preWarmTables(tables_with_schema)
110 | 
111 | # Execute run
112 | start_time = time.time()
113 | result = bm.execute()
114 | duration = time.time() - start_time
115 | 
116 | # Store run metrics
117 | metrics_df = spark.createDataFrame(result)
118 | 
119 | # COMMAND ----------
120 | 
121 | # DBTITLE 1,Write Metrics to a Delta Table
122 | # write output dataframe to delta for analysis/consumption
123 | metrics_full_path = f"{CATALOG_NAME}.{SCHEMA_NAME}.{METRICS_TABLE_NAME}"
124 | print(f"Writing to delta table: {metrics_full_path}")
125 | metrics_df.write.mode('overwrite').saveAsTable(metrics_full_path)
126 | 
127 | # Display the table for reference
128 | metrics_df.display()
129 | 
130 | # COMMAND ----------
131 | 
132 | # DBTITLE 1,Throughput
133 | sql_files = [1 for x in dbutils.fs.ls(QUERY_PATH.replace('/dbfs','dbfs:')) if x.name.endswith('.sql')]
134 | n_sql_queries = len(sql_files) * QUERY_REPETITION_COUNT
135 | print(f"TPC-DS queries per minute: {n_sql_queries / (duration / 60)}")
136 | 
137 | # COMMAND ----------
138 | 
139 | 
140 | 


--------------------------------------------------------------------------------
/30-performance/TPC-DS Runner/utils/run.py:
--------------------------------------------------------------------------------
 1 | from utils.general import setup_files
 2 | from utils.databricks_client import DatabricksClient
 3 | 
 4 | 
 5 | def run(spark, dbutils, constants):
 6 |     # Step 0: drop write schema if exists
 7 |     spark.sql(
 8 |         f"create schema if not exists {constants.catalog_name}.{constants.schema_name}"
 9 |     )
10 | 
11 |     # Step 1: write init script, jar, and beaker whl to DBFS
12 |     setup_files(
13 |         dbutils,
14 |         constants.jar_path,
15 |         constants.init_script_path,
16 |         constants.beaker_whl_path,
17 |     )
18 | 
19 |     # Step 2: create the client
20 |     client = DatabricksClient(constants)
21 | 
22 |     # Step 3: create a warehouse to benchmark against
23 |     warehouse_id = client.create_warehouse().id
24 |     constants.warehouse_id = warehouse_id
25 | 
26 |     # Step 4: create a job to create TPCDS data and queries at a given location and runs benchmarks
27 |     job_id = client.create_job().job_id
28 |     run_id = client.run_job(job_id).run_id
29 | 
30 |     # Step 5: monitor the job run until completion
31 |     url = f"{constants.host.replace('www.','')}#job/{job_id}/run/{run_id}"
32 |     print(f"\nA TPC-DS benchmarking job was created at the following url:\n\t{url}\n")
33 |     print(f"It will write TPC-DS data to {constants.data_path}.")
34 |     print(
35 |         "The job may take several hours depending upon data size, so please check back when it's complete.\n"
36 |     )


--------------------------------------------------------------------------------
/30-performance/dbsql-query-replay-tool/01-Query_Replay_Tool.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # MAGIC %run ./00-Functions
 3 | 
 4 | # COMMAND ----------
 5 | 
 6 | # SETUP
 7 | test_name = "<TEST NAME>"
 8 | result_catalog = "<CATALOG>"
 9 | result_schema = "<SCHEMA>"
10 | token = "<DB TOKEN>"
11 | 
12 | # SETUP SOURCE WAREHOUSE ID AND START AND END TIME
13 | source_warehouse_id = "<WAREHOUSE ID>"
14 | source_start_time = "2023-12-01 00:00:00"
15 | source_end_time = "2023-12-01 00:05:00"
16 | 
17 | replay_test = QueryReplayTest(
18 |     test_name=test_name,
19 |     result_catalog=result_catalog,
20 |     result_schema=result_schema,
21 |     token=token,
22 |     source_warehouse_id=source_warehouse_id,
23 |     source_start_time=source_start_time,
24 |     source_end_time=source_end_time,
25 | )
26 | 
27 | test_id = replay_test.run()
28 | 
29 | # COMMAND ----------
30 | 
31 | # MAGIC %md
32 | # MAGIC ### `query_df` is the list of source queries we are going to use
33 | 
34 | # COMMAND ----------
35 | 
36 | replay_test.query_df.orderBy('start_time').display()
37 | 
38 | # COMMAND ----------
39 | 
40 | # MAGIC %md
41 | # MAGIC ### `show_run` gives the test details
42 | 
43 | # COMMAND ----------
44 | 
45 | replay_test.show_run.display()
46 | 
47 | # COMMAND ----------
48 | 
49 | # MAGIC %md
50 | # MAGIC ### `show_run_details` gives the corresponding statement_id's for all the queries we ran
51 | 
52 | # COMMAND ----------
53 | 
54 | replay_test.show_run_details.display()
55 | 
56 | # COMMAND ----------
57 | 
58 | # MAGIC %md
59 | # MAGIC ### `query_results` gives the result comparing source details against the test output
60 | 
61 | # COMMAND ----------
62 | 
63 | # Recreate test object by providing test_id will allow to us to retrieve the query results later (since system tables might not have all the results immediately)
64 | 
65 | replay_test = QueryReplayTest(
66 |     test_name=test_name,
67 |     result_catalog=result_catalog,
68 |     result_schema=result_schema,
69 |     token=token,
70 |     source_warehouse_id=source_warehouse_id,
71 |     source_start_time=source_start_time,
72 |     source_end_time=source_end_time,
73 |     test_id=test_id
74 | )
75 | 
76 | replay_test.query_results.display()
77 | 


--------------------------------------------------------------------------------
/30-performance/dbsql-query-replay-tool/README.md:
--------------------------------------------------------------------------------
 1 | # Databricks SQL Query Replay Tool
 2 | 
 3 | This tool is aimed to help users evaluate performance of different warehouses by replaying a set of query history from one warehouse to another.
 4 | 
 5 | ## Notebooks
 6 | 
 7 | * `00-Functions` is the notebook containing the python class
 8 | * `01-Query_Replay_Tool` is the notebook that is used to execute the test
 9 | 
10 | ## Requirements
11 | 
12 | User need access to query history system tables `system.query.history` in order to extract the queries and start time for the test.
13 | 
14 | ## Usage
15 | 
16 | Users need to set the following parameters
17 | 
18 | * `test_name`: Test Identifier
19 | * `result_catalog` and `result_schema`: The schema where the test results will be written to
20 | * `token`: A Databricks PAT that will be used to launched those queries
21 | * `source_warehouse_id`: The warehouse ID where the origiated queries were submitted
22 | * `source_start_time`: The start time to filter for queries
23 | * `source_end_time`: The end time to filter for queries
24 | 
25 | And here are a number of optional configuration for the target warehouse where the queries will be replayed to (see [Create Warehouse API doc](https://docs.databricks.com/api/workspace/warehouses/create) for more details).
26 | 
27 | * `target_warehouse_size`
28 | * `target_warehouse_max_num_clusters`
29 | * `target_warehouse_type`
30 | * `target_warehouse_serverless`
31 | * `target_warehouse_custom_tags`
32 | * `target_warehouse_channel`
33 | 
34 | The replay can be executed as follows.
35 | 
36 | ```python
37 | replay_test = QueryReplayTest(
38 |     test_name=test_name,
39 |     result_catalog=result_catalog,
40 |     result_schema=result_schema,
41 |     token=token,
42 |     source_warehouse_id=source_warehouse_id,
43 |     source_start_time=source_start_time,
44 |     source_end_time=source_end_time,
45 | )
46 | 
47 | test_id = replay_test.run()
48 | ```
49 | 
50 | Once the test is completed, it will return the `test_id` which can be used to retrieve the result. 
51 | 
52 | Here are other functionality within the `QueryReplayTest`
53 | 
54 | * `replay_test.query_df` returns the queries that were used for the test
55 | * `replay_test.show_run` returns the metadata of the test
56 | * `show_run_details` returns the corresponding statement_id's for all the queries we ran
57 | * `query_results` returnrs the result comparing source details against the test output
58 | 
59 | All the output data are written to the nominated schema in tables `query_replay_test_run` and `query_replay_test_run_details` if you want to query them directly as well.
60 | 


--------------------------------------------------------------------------------
/30-performance/delta-optimizer/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/30-performance/delta-optimizer/__init__.py


--------------------------------------------------------------------------------
/30-performance/delta-optimizer/customer-facing-delta-optimizer/deltaoptimizer-1.5.5-py3-none-any.whl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/30-performance/delta-optimizer/customer-facing-delta-optimizer/deltaoptimizer-1.5.5-py3-none-any.whl


--------------------------------------------------------------------------------
/30-performance/delta-optimizer/deltaoptimizer/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/30-performance/delta-optimizer/deltaoptimizer/.DS_Store


--------------------------------------------------------------------------------
/30-performance/delta-optimizer/deltaoptimizer/.gitignore:
--------------------------------------------------------------------------------
1 | 
2 | .databricks
3 | 


--------------------------------------------------------------------------------
/30-performance/delta-optimizer/deltaoptimizer/.vscode/settings.json:
--------------------------------------------------------------------------------
1 | {
2 |     "python.envFile": "${workspaceFolder}/.databricks/.databricks.env",
3 |     "databricks.python.envFile": "${workspaceFolder}/.env",
4 |     "jupyter.interactiveWindow.cellMarker.codeRegex": "^# COMMAND ----------|^# Databricks notebook source|^(#\\s*%%|#\\s*\\<codecell\\>|#\\s*In\\[\\d*?\\]|#\\s*In\\[ \\])",
5 |     "jupyter.interactiveWindow.cellMarker.default": "# COMMAND ----------"
6 | }


--------------------------------------------------------------------------------
/30-performance/delta-optimizer/deltaoptimizer/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/30-performance/delta-optimizer/deltaoptimizer/__init__.py


--------------------------------------------------------------------------------
/30-performance/delta-optimizer/deltaoptimizer/deltaoptimizer.egg-info/PKG-INFO:
--------------------------------------------------------------------------------
1 | Metadata-Version: 2.1
2 | Name: deltaoptimizer
3 | Version: 1.5.5
4 | Summary: Delta Optimizer Beta - UC Enabled
5 | Author: Cody Austin Davis @Databricks, Inc.
6 | Author-email: cody.davis@databricks.com
7 | Requires-Dist: sqlparse
8 | Requires-Dist: sql_metadata
9 | 


--------------------------------------------------------------------------------
/30-performance/delta-optimizer/deltaoptimizer/deltaoptimizer.egg-info/SOURCES.txt:
--------------------------------------------------------------------------------
1 | deltaoptimizer.py
2 | setup.py
3 | deltaoptimizer.egg-info/PKG-INFO
4 | deltaoptimizer.egg-info/SOURCES.txt
5 | deltaoptimizer.egg-info/dependency_links.txt
6 | deltaoptimizer.egg-info/requires.txt
7 | deltaoptimizer.egg-info/top_level.txt


--------------------------------------------------------------------------------
/30-performance/delta-optimizer/deltaoptimizer/deltaoptimizer.egg-info/dependency_links.txt:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/30-performance/delta-optimizer/deltaoptimizer/deltaoptimizer.egg-info/requires.txt:
--------------------------------------------------------------------------------
1 | sqlparse
2 | sql_metadata
3 | 


--------------------------------------------------------------------------------
/30-performance/delta-optimizer/deltaoptimizer/deltaoptimizer.egg-info/top_level.txt:
--------------------------------------------------------------------------------
1 | deltaoptimizer
2 | 


--------------------------------------------------------------------------------
/30-performance/delta-optimizer/deltaoptimizer/dist/deltaoptimizer-1.5.5-py3-none-any.whl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/30-performance/delta-optimizer/deltaoptimizer/dist/deltaoptimizer-1.5.5-py3-none-any.whl


--------------------------------------------------------------------------------
/30-performance/delta-optimizer/deltaoptimizer/setup.py:
--------------------------------------------------------------------------------
 1 | from setuptools import setup
 2 | 
 3 | setup(
 4 |     name='deltaoptimizer',
 5 |     version='1.5.5',
 6 |     description='Delta Optimizer Beta - UC Enabled',
 7 |     author='Cody Austin Davis @Databricks, Inc.',
 8 |     author_email='cody.davis@databricks.com',
 9 |     install_requires=[
10 |         'sqlparse',
11 |         'sql_metadata'
12 |     ]
13 | )


--------------------------------------------------------------------------------
/40-observability/README.md:
--------------------------------------------------------------------------------
1 | #### Governance/Observability
2 | 
3 | This section consists of tools that will help CDOs, Billing Administrators and Infrastructure Administrators to get a better understanding of the usage and cost drivers of the Lakehouse
4 | 
5 | #
6 | 1. [Data Profiling](https://github.com/AbePabbathi/lakehouse-tacklebox/tree/master/40-observability/data-profiling)
7 | 2. [DBSQL Monitoring](https://github.com/AbePabbathi/lakehouse-tacklebox/tree/master/40-observability/dbsql-logging)
8 | 3. [Stream Monitoring](https://github.com/AbePabbathi/lakehouse-tacklebox/tree/main/40-observability/stream-monitoring)
9 | 4. PII Detector (Coming soon)


--------------------------------------------------------------------------------
/40-observability/dbsql-logging/00-Config.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # DBTITLE 1,API Config
 3 | # Please ensure the url starts with https and DOES NOT have a slash at the end
 4 | WORKSPACE_HOST = 'https://adb-2541733722036151.11.azuredatabricks.net'
 5 | WAREHOUSE_URL = "{0}/api/2.0/sql/warehouses".format(WORKSPACE_HOST) ## SQL Warehouses APIs 2.0
 6 | QUERIES_URL = "{0}/api/2.0/sql/history/queries".format(WORKSPACE_HOST) ## Query History API 2.0 
 7 | WORKFLOWS_URL = "{0}/api/2.1/jobs/list".format(WORKSPACE_HOST) ## Jobs & Workflows History API 2.1 
 8 | DASHBOARDS_URL = "{0}/api/2.0/preview/sql/queries".format(WORKSPACE_HOST,250) ## Queries and Dashboards API - ❗️in preview, deprecated soon❗️
 9 | 
10 | MAX_RESULTS_PER_PAGE = 1000
11 | MAX_PAGES_PER_RUN = 500
12 | PAGE_SIZE = 250 # 250 is the max
13 | 
14 | # We will fetch all queries that were started between this number of hours ago, and now()
15 | # Queries that are running for longer than this will not be updated.
16 | # Can be set to a much higher number when backfilling data, for example when this Job didn't run for a while.
17 | NUM_HOURS_TO_UPDATE = 168
18 | 
19 | # COMMAND ----------
20 | 
21 | # DBTITLE 1,API Authentication
22 | # If you want to run this notebook yourself, you need to create a Databricks personal access token,
23 | # store it using our secrets API, and pass it in through the Spark config, such as this:
24 | # spark.pat_token {{secrets/query_history_etl/user}}, or Azure Keyvault.
25 | 
26 | #Databricks secrets API
27 | #AUTH_HEADER = {"Authorization" : "Bearer " + spark.conf.get("spark.pat_token")}
28 | #Azure KeyVault
29 | #AUTH_HEADER = {"Authorization" : "Bearer " + dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")}
30 | #Naughty way
31 | AUTH_HEADER = {"Authorization" : "Bearer " + "dapixxxxxxxxxxxxxxxxxxxxxxxxx"}
32 | 
33 | # COMMAND ----------
34 | 
35 | # DBTITLE 1,Database and Table Config
36 | DATABASE_NAME = "dbsql_logging"
37 | # DATABASE_LOCATION = "/s3-location/"
38 | QUERIES_TABLE_NAME = "queries"
39 | WAREHOUSES_TABLE_NAME = "warehouses"
40 | WORKFLOWS_TABLE_NAME = "workflows"
41 | DASHBOARDS_TABLE_NAME = "dashboards_preview"
42 | 
43 | # COMMAND ----------
44 | 
45 | # DBTITLE 1,Delta Table Maintenance
46 | QUERIES_ZORDER = "endpoint_id"
47 | WAREHOUSES_ZORDER = "id"
48 | WORKFLOWS_ZORDER = "job_id"
49 | DASHBOARDS_ZORDER = "id"
50 | 
51 | VACUUM_RETENTION = 168
52 | 


--------------------------------------------------------------------------------
/40-observability/dbsql-logging/01-Functions.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # DBTITLE 1,Check if spark can read the table
 3 | def check_table_exist(db_tbl_name):
 4 |     table_exist = False
 5 |     try:
 6 |         spark.read.table(db_tbl_name) # Check if spark can read the table
 7 |         table_exist = True        
 8 |     except:
 9 |         pass
10 |     return table_exist
11 | 
12 | # COMMAND ----------
13 | 
14 | # DBTITLE 1,Current time in milliseconds
15 | def current_time_in_millis():
16 |     return round(time.time() * 1000)
17 | 
18 | # COMMAND ----------
19 | 
20 | # DBTITLE 1,True False fix
21 | def get_boolean_keys(arrays):
22 |   # A quirk in Python's and Spark's handling of JSON booleans requires us to converting True and False to true and false
23 |   boolean_keys_to_convert = []
24 |   for array in arrays:
25 |     for key in array.keys():
26 |       if type(array[key]) is bool:
27 |         boolean_keys_to_convert.append(key)
28 |   #print(boolean_keys_to_convert)
29 |   return boolean_keys_to_convert
30 | 
31 | # COMMAND ----------
32 | 
33 | # DBTITLE 1,Turn API results into json
34 | def result_to_json(result):
35 |   return json.dumps(result.json())
36 | 
37 | # COMMAND ----------
38 | 
39 | # DBTITLE 1,Get specific page results (Dashboards API only)
40 | def get_page_result(base_url, page, auth):
41 |   return requests.get(f'{base_url}&page={page}&order=executed_at', headers=auth)
42 | 
43 | # COMMAND ----------
44 | 
45 | # DBTITLE 1,Get specific offset results (Workflows API only)
46 | def get_offset_result(base_url, offest, auth):
47 |   return requests.get(f'{base_url}&offset={offest}', headers=auth)
48 | 


--------------------------------------------------------------------------------
/40-observability/dbsql-logging/02-Initialization.py:
--------------------------------------------------------------------------------
1 | # Databricks notebook source
2 | # MAGIC %run ./00-Config
3 | 
4 | # COMMAND ----------
5 | 
6 | spark.sql(f'CREATE DATABASE IF NOT EXISTS {DATABASE_NAME}') 
7 | # optional: add location 
8 | # spark.sql(f'CREATE DATABASE IF NOT EXISTS {DATABASE_NAME} LOCATION {DATABASE_LOCATION}') 
9 | 


--------------------------------------------------------------------------------
/40-observability/dbsql-logging/05-Alert_Syntax.sql:
--------------------------------------------------------------------------------
 1 | -- Databricks notebook source
 2 | -- MAGIC %md
 3 | -- MAGIC ## Syntax for Alerts in DBSQL
 4 | -- MAGIC
 5 | -- MAGIC This notebook contains snippets that may be useful to be alerted for on DBSQL usage. 
 6 | -- MAGIC
 7 | -- MAGIC Schedule this using workflows, or use the schedule dropdown on the top right of the notebook. The job should take <20 mins to run and consume ~1 DBU, there's very little data being processes here, even on busy workspaces.
 8 | -- MAGIC
 9 | -- MAGIC Alerts should be actionable. "Nice to know" information just acts as noise. Examples of actional alerts may be:
10 | -- MAGIC * Terminating long running warehouses
11 | -- MAGIC * Investigating long running queries
12 | -- MAGIC * Sizing up warehouses that have specific query failures
13 | -- MAGIC
14 | -- MAGIC **Remember**, if you want to be notified of a query taking 2 hours to run, this job must be scheduled at least every two hours
15 | -- MAGIC
16 | -- MAGIC ### How to set up alerts with DBSQL
17 | -- MAGIC 1. DBSQL > SQL Editor > create a query in DBSQL by coping the below or creating your own, name it, and save it
18 | -- MAGIC 2. DBSQL > Alerts > Create Alert > select your query you have just saved, set the threshold for values to be alerted for, save, then change the destination if needed
19 | -- MAGIC
20 | -- MAGIC [Official docs](https://docs.databricks.com/sql/user/alerts/index.html)
21 | 
22 | -- COMMAND ----------
23 | 
24 | -- MAGIC %run ./00-Config
25 | 
26 | -- COMMAND ----------
27 | 
28 | -- %run ./03-APIs_to_Delta
29 | -- Uncomment this if you would like to run as part of a job
30 | -- Remove these comments too!
31 | 
32 | -- COMMAND ----------
33 | 
34 | -- MAGIC %python
35 | -- MAGIC spark.sql(f' USE {DATABASE_NAME}')
36 | 
37 | -- COMMAND ----------
38 | 
39 | -- DBTITLE 1,Queries currently running that are over the 95th percentile
40 | SELECT round(duration/1000/60/60,2) as duration_h, *
41 | FROM queries
42 | WHERE duration > (SELECT percentile(duration, 0.95) AS duration_95
43 |                   FROM queries WHERE status = "FINISHED"
44 |                   AND statement_type IN ("SELECT", "MERGE"))
45 | AND status = "RUNNING"
46 | ORDER BY duration DESC
47 | 
48 | -- COMMAND ----------
49 | 
50 | -- DBTITLE 1,Queries taking over 6 hours to run
51 | SELECT round(duration/1000/60/60,2) as duration_h, *
52 | FROM queries
53 | WHERE duration > 21600000 --6 hours in miliseconds
54 | AND status = "RUNNING"
55 | ORDER BY duration DESC
56 | 


--------------------------------------------------------------------------------
/40-observability/dbsql-logging/99-Maintenance.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # MAGIC %run ./00-Config
 3 | 
 4 | # COMMAND ----------
 5 | 
 6 | # DBTITLE 1,Optimize & zOrder
 7 | spark.sql(f'OPTIMIZE {DATABASE_NAME}.{QUERIES_TABLE_NAME} ZORDER BY {QUERIES_ZORDER}')
 8 | spark.sql(f'OPTIMIZE {DATABASE_NAME}.{WAREHOUSES_TABLE_NAME} ZORDER BY {WAREHOUSES_ZORDER}')
 9 | spark.sql(f'OPTIMIZE {DATABASE_NAME}.{DASHBOARDS_TABLE_NAME} ZORDER BY {DASHBOARDS_ZORDER}')
10 | spark.sql(f'OPTIMIZE {DATABASE_NAME}.{WORKFLOWS_TABLE_NAME} ZORDER BY {WORKFLOWS_ZORDER}')
11 | 
12 | # COMMAND ----------
13 | 
14 | # DBTITLE 1,Allow for parallel deletes in Vacuum
15 | spark.conf.set("spark.databricks.delta.vacuum.parallelDelete.enabled", True)
16 | 
17 | # COMMAND ----------
18 | 
19 | # DBTITLE 1,Delete small files no longer in use
20 | spark.sql(f'VACUUM {DATABASE_NAME}.{QUERIES_TABLE_NAME} RETAIN {VACUUM_RETENTION} HOURS')
21 | spark.sql(f'VACUUM {DATABASE_NAME}.{WAREHOUSES_TABLE_NAME} RETAIN {VACUUM_RETENTION} HOURS')
22 | spark.sql(f'VACUUM {DATABASE_NAME}.{DASHBOARDS_TABLE_NAME} RETAIN {VACUUM_RETENTION} HOURS')
23 | spark.sql(f'VACUUM {DATABASE_NAME}.{WORKFLOWS_TABLE_NAME} RETAIN {VACUUM_RETENTION} HOURS')
24 | 


--------------------------------------------------------------------------------
/40-observability/dbsql-logging/README.md:
--------------------------------------------------------------------------------
 1 | ### dbsql-logging
 2 | This tool is a collection notebooks that pulls together data from 4 different APIs to produce useful metrics to monitor DBSQL usage:
 3 | * [SQL Warehouses APIs 2.0](https://docs.databricks.com/sql/api/sql-endpoints.html), referred to as the Warehouse API
 4 | * [Query History API 2.0](https://docs.databricks.com/sql/api/query-history.html), referred to as the Queries API
 5 | * [Jobs API 2.1](https://docs.databricks.com/dev-tools/api/latest/jobs.html), referred to as the Workflows API
 6 | * [Queries and Dashboards API](https://docs.databricks.com/sql/api/queries-dashboards.html), referred to as the Dashboards API  - ❗️in preview, known issues, deprecated soon❗️
 7 | 
 8 | Creator: holly.smith@databricks.com
 9 | 
10 | #### Setup
11 | This tool has been tested with the following 
12 | Cluster config: 
13 | * 11.3 LTS 
14 | * Driver: i3.xlarge
15 | * Workers: 2 x i3.xlarge  - the data here is fairly small
16 | 
17 | Profile:
18 | Must be an **admin** in your workspace for the Dashboards API
19 | 
20 | #### Notebooks
21 | 
22 | ##### 00-Config
23 | This is the configuration of:
24 | * Workspace URL
25 | * Authetication options
26 | * Database and Table storage
27 | * `OPTIMIZE`, `ZORDER` and `VACUUM` settings
28 | 
29 | ##### 01-Functions
30 | * Reuable functions created, all pulled out for code readability
31 | 
32 | ##### 02-Initialisation
33 | * Creates the database if it doesn't exist
34 | * Optional: specify a location for the Database
35 | Dependent on: `00-Config`
36 | 
37 | ##### 03-APIs_to_Delta
38 | 
39 | **Warehouses API:** Appends each call of the api and uses a snapshot time to identify 
40 | 
41 | **Query History API:** Upserts / merges new queries to the original table
42 | 
43 | **Workflows API:** Upserts / merges new workflows to the original table
44 | 
45 | **Dashboards API:**  . I have tried my best to refer to it as a preview in every step of the code to reflect how this is a preview
46 | 
47 | Dependent on: `00-Config`, `01-Functions`, `02-Initialisation`
48 | 
49 | ##### 04-Metrics
50 | * Dashboards & Queries with owner, useful for finding orphaned records
51 | * Queries to Optimise
52 | * Warehouse Metrics
53 | * Per User Metrics
54 | 
55 | 
56 | Dependent on: `00-Config`
57 | 
58 | 
59 | ##### 99-Maintenance
60 | Runs `OPTIMIZE`, `ZORDER` and `VACUUM` against tables
61 | 
62 | Dependent on: `00-Config`
63 | 
64 | 
65 | ##### Troubleshooting
66 | 
67 | ###### Cluster OOM
68 | The data used here was very small, even for a Databricks demo workspace with thousands of users. Parts of 03-APIs_to_Delta involves pulling a json to the driver, in the highly unlikely event of the driver OOM you have two choices:
69 | 1. The quick option: select a larger driver
70 | 2. The robust option: loop through reading in only one page at a time and write to a spark dataframe at a time
71 | 
72 | ###### Dashboards API not sorting in new queries
73 | There are known issues with the API. Where possible, try to use the 
74 | 
75 | ###### Dashboards API has stopped working
76 | This API will go through stages of deprecation, unfortunately with no hard timelines as of yet. Here is the rough process:
77 | 1. When DBSQL Pro comes out, the API will be officially deprecated
78 | 2. It should (*should*) be removed from the documentation at that point
79 | 3. Later it will become totally unavailable 
80 | 
81 | The Query History API captures data shown below
82 | 
83 | ![Query History](https://i.imgur.com/fZaQYzT.png)
84 | 


--------------------------------------------------------------------------------
/40-observability/dbsql-query-history-sync/README.md:
--------------------------------------------------------------------------------
 1 | Sync a Delta table with the query history from a dbsql warehouse.
 2 | 
 3 | As easy as:
 4 | 
 5 | pip install from dist/dbsql_query_history_sync-0.0.1-py3-none-any.whl or dist/dbsql_query_history_sync-0.0.1.tar.gz
 6 | 
 7 | To download the query history without Databricks environment or pyspark (need to change the dbsql host, warehouse_ids and access token):
 8 | ```
 9 | > cd examples
10 | > ./standalone_dbsql_get_query_history_example.py
11 | ```
12 | 
13 | To create a Delta table and continuously sync queries from the dbsql warehouses to it:
14 | 
15 | ```
16 | import dbsql_query_history_sync.delta_sync as delta_sync
17 | 
18 | # create the object
19 | udbq = delta_sync.UpdateDBQueries(dbx_token=DBX_TOKEN, 
20 |                        warehouse_ids=warehouse_ids_list, earliest_query_ts_ms=dt_ts, table_name=sync_table)
21 | udbq.update_db_repeat(interval_secs=10)
22 | ```
23 | 
24 | See examples/dbsql_query_sync_example.py.
25 | 
26 | For questions contact nishant.deshpande@databricks.com.
27 | 
28 | 


--------------------------------------------------------------------------------
/40-observability/dbsql-query-history-sync/__init__.py:
--------------------------------------------------------------------------------
1 | # module
2 | 


--------------------------------------------------------------------------------
/40-observability/dbsql-query-history-sync/dist/dbsql_query_history_sync-0.0.1-py3-none-any.whl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/40-observability/dbsql-query-history-sync/dist/dbsql_query_history_sync-0.0.1-py3-none-any.whl


--------------------------------------------------------------------------------
/40-observability/dbsql-query-history-sync/dist/dbsql_query_history_sync-0.0.1.tar.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/40-observability/dbsql-query-history-sync/dist/dbsql_query_history_sync-0.0.1.tar.gz


--------------------------------------------------------------------------------
/40-observability/dbsql-query-history-sync/examples/dbsql_query_sync_example.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | import datetime, dateutil
 3 | import sys, os
 4 | import json
 5 | #import dbutils
 6 | import time
 7 | 
 8 | 
 9 | # COMMAND ----------
10 | 
11 | sys.path.append(f"{os.getcwd()}/../src")
12 | #sys.path
13 | 
14 | # COMMAND ----------
15 | 
16 | import dbsql_query_history_sync.queries_api as queries_api
17 | import dbsql_query_history_sync.delta_sync as delta_sync
18 | 
19 | # COMMAND ----------
20 | 
21 | import importlib
22 | 
23 | # COMMAND ----------
24 | 
25 | importlib.reload(queries_api)
26 | importlib.reload(delta_sync)
27 | 
28 | # COMMAND ----------
29 | 
30 | # replace as required
31 | workspace_url = 'e2-demo-field-eng.cloud.databricks.com'
32 | warehouse_ids_list = ['475b94ddc7cd5211',]
33 | 
34 | # COMMAND ----------
35 | 
36 | # Replace as required
37 | DBX_TOKEN = dbutils.secrets.get(scope='nishant-deshpande', key='dbsql-api-key')  # subst your scope + key to query the API
38 | 
39 | 
40 | # COMMAND ----------
41 | 
42 | # Adjust the history period as required.
43 | dt = datetime.datetime.now() - datetime.timedelta(minutes=5)
44 | #dt = datetime.datetime.now() - datetime.timedelta(hours=1)
45 | print(dt)
46 | dt_ts = int(dt.timestamp() * 1000)
47 | print(dt_ts)
48 | 
49 | # COMMAND ----------
50 | 
51 | # get queries as a list
52 | x = queries_api.get_query_history(
53 |   dbx_token=DBX_TOKEN, 
54 |   workspace_url=workspace_url, warehouse_ids=warehouse_ids_list, start_ts_ms=dt_ts, end_ts_ms=None, user_ids=None, statuses=None, stop_fetch_limit=1000)
55 | 
56 | # COMMAND ----------
57 | 
58 | t_ts = int(datetime.datetime.now().timestamp())
59 | sync_table = f'default.query_history_test_{t_ts}'  # change to your preferred table name.
60 | print(sync_table)
61 | 
62 | # COMMAND ----------
63 | 
64 | # create the object
65 | udbq = delta_sync.UpdateDBQueries(spark_session=spark, dbx_token=DBX_TOKEN, workspace_url=workspace_url,
66 |                        warehouse_ids=warehouse_ids_list, earliest_query_ts_ms=dt_ts, table_name=sync_table)
67 | 
68 | # COMMAND ----------
69 | 
70 | # This updates the table with the query history one time
71 | udbq.update_db()
72 | 
73 | # COMMAND ----------
74 | 
75 | # Check the table
76 | display(spark.sql(f"""
77 | select count(1), timestamp(min(query_start_time_ms)/1000), timestamp(max(query_start_time_ms)/1000)
78 | from {sync_table}
79 | """))
80 | 
81 | # COMMAND ----------
82 | 
83 | 
84 | 
85 | # COMMAND ----------
86 | 
87 | # This will update the underlying table incrementally every 10 seconds.
88 | udbq.update_db_repeat(interval_secs=10)
89 | 
90 | # COMMAND ----------
91 | 
92 | # MAGIC %sql
93 | # MAGIC select count(1), timestamp(min(query_start_time_ms)/1000), timestamp(max(query_start_time_ms)/1000)
94 | # MAGIC from default.query_history_test_1695014139  -- update the table name to new table created above
95 | 
96 | # COMMAND ----------
97 | 
98 | 
99 | 


--------------------------------------------------------------------------------
/40-observability/dbsql-query-history-sync/examples/standalone_dbsql_get_query_history_example.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | 
 3 | import sys, os
 4 | import datetime, dateutil.parser
 5 | import pickle
 6 | 
 7 | sys.path.append('../src')
 8 | 
 9 | import dbsql_query_history_sync.queries_api as queries_api
10 | 
11 | def ts():
12 |   return datetime.datetime.now().strftime("%Y%m%d%H%M%S")
13 | 
14 | def main():
15 |   # change as required.
16 |   workspace_url = os.getenv("DATABRICKS_HOST", "e2-demo-field-eng.cloud.databricks.com")
17 |   warehouse_ids = ["771c2bee30209f22"]
18 |   start_ts_ms = dateutil.parser.parse('2023-01-01').timestamp() * 1000
19 |   dbx_token = os.getenv('DATABRICKS_ACCESS_TOKEN')
20 | 
21 |   qh = queries_api.get_query_history(
22 |         dbx_token=dbx_token,
23 |         workspace_url=workspace_url,
24 |         warehouse_ids=warehouse_ids,
25 |         start_ts_ms=start_ts_ms)
26 |   print(len(qh))
27 |   fname = f'/tmp/queries_{ts()}.pkl'
28 |   with open(fname, 'wb') as f:
29 |     pickle.dump(qh, f)
30 |   print(f"created pkl file {fname}")
31 | 
32 | if __name__ == "__main__":
33 |   main()
34 | 


--------------------------------------------------------------------------------
/40-observability/dbsql-query-history-sync/pyproject.toml:
--------------------------------------------------------------------------------
 1 | [build-system]
 2 | requires = ["hatchling"]
 3 | build-backend = "hatchling.build"
 4 | 
 5 | [project]
 6 | name = "dbsql-query-history-sync"
 7 | dynamic = ["version"]
 8 | description = "1> Get dbsql query history. 2> Sync to Delta table."
 9 | readme = "README.md"
10 | license = "MIT"
11 | authors = [
12 |     { name = "Nishant Deshpande", email = "nishant.deshpande@databricks.com" },
13 | ]
14 | classifiers = [
15 |     "Programming Language :: Python :: 3",
16 |     "License :: Other/Proprietary License",
17 |     "Operating System :: OS Independent",
18 | ]
19 | requires-python = ">=3.7"
20 | dependencies = [
21 |     "requests",
22 |     "dateutils"
23 | ]
24 | 
25 | [project.urls]
26 | Homepage = "https://github.com/databricks/lakehouse-tacklebox/tree/master/40-observability/dbsql-query-history-sync"
27 | 
28 | [tool.hatch.version]
29 | path = "src/__init__.py"
30 | 
31 | [tool.hatch.build.targets.sdist]
32 | include = [
33 |     "/src",
34 | ]
35 | 


--------------------------------------------------------------------------------
/40-observability/dbsql-query-history-sync/src/__init__.py:
--------------------------------------------------------------------------------
1 | VERSION = '0.0.1'
2 | 
3 | 


--------------------------------------------------------------------------------
/40-observability/dbsql-query-history-sync/src/dbsql_query_history_sync/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AbePabbathi/lakehouse-tacklebox/9bbbf0a7181193091d887ed57e7394a466c369cf/40-observability/dbsql-query-history-sync/src/dbsql_query_history_sync/__init__.py


--------------------------------------------------------------------------------
/40-observability/dbsql-query-history-sync/src/dbsql_query_history_sync/delta_sync.py:
--------------------------------------------------------------------------------
  1 | import datetime, dateutil
  2 | import sys, os
  3 | import json
  4 | import time
  5 | 
  6 | from delta.tables import DeltaTable
  7 | 
  8 | import pyspark.sql.types as T
  9 | import pyspark.sql.functions as F
 10 | 
 11 | from . import queries_api
 12 | 
 13 | class UpdateDBQueries:
 14 |   def __init__(self, spark_session, dbx_token, workspace_url,
 15 |                warehouse_ids, earliest_query_ts_ms, table_name, 
 16 |                user_ids=None, max_queries_batch=1000):
 17 |     self.spark = spark_session
 18 |     self.dbx_token = dbx_token
 19 |     self.workspace_url = workspace_url
 20 |     self.warehouse_ids = warehouse_ids
 21 |     self.earliest_query_ts_ms = earliest_query_ts_ms
 22 |     self.table_name = table_name
 23 |     self.user_ids = user_ids
 24 |     self.max_queries_batch = max_queries_batch
 25 | 
 26 |     # hidden state for optimization of update_db_repeat
 27 |     self._next_update_ts_ms_d = {}
 28 | 
 29 |     self._do_init()
 30 | 
 31 |   def _do_init(self):
 32 |     '''Check if table_name exists, and if it does not, create it.
 33 |     '''
 34 |     # We will use schema evolution when we need to add data.
 35 |     # That way we don't have to assume the returned results keep the same schema.
 36 |     # Add the minimum columns required for things to work.
 37 |     self.spark.sql(f"""
 38 |       create table if not exists {self.table_name}
 39 |       (query_id string, status string, query_start_time_ms bigint)""")
 40 |     # This is somewhat 'invasive' but this is not a open source api used by the unsuspecting masses
 41 |     # so I think this is ok.
 42 |     self.spark.sql(f"alter table {self.table_name} SET TBLPROPERTIES ('delta.enableDeletionVectors' = true)")
 43 | 
 44 |   def _merge_db_queries(self, from_ts_ms):
 45 |     print(f'_merge_db_queries(from_ts_ms={from_ts_ms})')
 46 |     start_ts_ms = from_ts_ms if from_ts_ms else self.earliest_query_ts_ms
 47 |     c = queries_api.sync_query_history(
 48 |       dbx_token=self.dbx_token, workspace_url=self.workspace_url,
 49 |       warehouse_ids=self.warehouse_ids, start_ts_ms=start_ts_ms,
 50 |       query_sink_fn=self._merge_results, sink_batch_size=self.max_queries_batch,
 51 |       user_ids=self.user_ids)
 52 |     return c
 53 | 
 54 |   def _merge_results(self, query_history):
 55 |     qh_df = self.spark.createDataFrame(query_history)
 56 |     ame = self.spark.conf.get('spark.databricks.delta.schema.autoMerge.enabled')
 57 |     if ame != 'true':
 58 |       self.spark.conf.set('spark.databricks.delta.schema.autoMerge.enabled', True)
 59 |     _table = DeltaTable.forName(self.spark, self.table_name)
 60 |     (_table.alias('t1').merge(qh_df.alias('n1'), 't1.query_id = n1.query_id')
 61 |       .whenMatchedUpdateAll()
 62 |       .whenNotMatchedInsertAll()
 63 |       .execute())
 64 |     if ame != 'true':
 65 |       self.spark.conf.set('spark.databricks.delta.schema.autoMerge.enabled', ame)
 66 |     qh_df.createOrReplaceTempView('qh_df')
 67 |     # Optimization.
 68 |     d = self._get_existing_ts(table_name='qh_df')
 69 |     if not self._next_update_ts_ms_d.get('pending'):
 70 |       print(f'_merge_results: updating _next_update_ts_ms_d with {d}')
 71 |       self._next_update_ts_ms_d.update(d)
 72 |     else:
 73 |       print(f'already have a pending {self._next_update_ts_ms_d}')
 74 | 
 75 |   def update_db(self):
 76 |     '''Update the db with queries. Check the table for existing data and query newer queries accordingly.
 77 |     '''
 78 |     d = self._get_existing_ts()
 79 |     existing_ts_ms = d['pending'] if d['pending'] else d['all']
 80 |     print(f'existing_ts_ms: {existing_ts_ms}')
 81 |     c = self._merge_db_queries(existing_ts_ms)
 82 |     print(f"got {c} queries")
 83 | 
 84 |   def _get_existing_ts(self, table_name=None):
 85 |     '''
 86 |     Get the timestamp that should be used to get the next queries.
 87 |     '''
 88 |     if not table_name:
 89 |       table_name = self.table_name
 90 |     r = self.spark.sql(
 91 |       f"""
 92 |         select * from (
 93 |         select 'pending' as status, min(query_start_time_ms) as ts_ms
 94 |         from {table_name}
 95 |         where lower(status) in ('queued', 'running'))
 96 |         union all
 97 |         (select 'all' as status, max(query_start_time_ms) as ts_ms
 98 |         from {table_name})            
 99 |       """).collect()
100 |     assert(r[0][0] == 'pending' and r[1][0] == 'all')
101 |     #ts_ms = r[0][1] if r[0][1] else r[1][1]
102 |     #return ts_ms
103 |     return dict(r)
104 | 
105 |   def update_db_repeat(self, interval_secs):
106 |     '''Update the db with new queries every interval_secs.
107 |     '''
108 |     d = self._get_existing_ts()
109 |     print(f'got initial state: {d}')
110 |     ts_ms = d['pending'] if d['pending'] else d['all']
111 |     while True:
112 |       c = self._merge_db_queries(ts_ms)
113 |       # self._next_update_ts_ms is kept updated inside self._merge_db_queries as an optimization
114 |       print(f'self._next_update_ts_ms_d: {self._next_update_ts_ms_d}')
115 |       ts_ms = self._next_update_ts_ms_d.get('pending') if self._next_update_ts_ms_d.get('pending') else self._next_update_ts_ms_d.get('all')
116 |       self._next_update_ts_ms_d = {}
117 |       print(f'merged {c} queries, updated ts_ms to {ts_ms}')
118 |       print(f'sleeping {interval_secs}...')
119 |       time.sleep(interval_secs)
120 | 


--------------------------------------------------------------------------------
/40-observability/dbsql-query-history-sync/src/dbsql_query_history_sync/queries_api.py:
--------------------------------------------------------------------------------
  1 | import requests
  2 | import datetime, dateutil
  3 | import sys, os
  4 | import json
  5 | import time
  6 | 
  7 | def sync_query_history(dbx_token, workspace_url, warehouse_ids, start_ts_ms, 
  8 |                       query_sink_fn, sink_batch_size,
  9 |                       end_ts_ms=None, user_ids=None, statuses=None, 
 10 |                       stop_fetch_limit=2147483647):
 11 |     '''Pull the query history from the API and call query_sink_fn.
 12 |     query_sink_fn: for every query_batch_size queries, push them into this sink and
 13 |     assume that the sink does the right thing with them.
 14 |     Note that the batch size will not be exactly respected. I.e. as soon as the accumulated
 15 |     queries go over the query_batch_size, query_sink_fn will be called.
 16 |     '''
 17 |     print(f'sync_query_history({locals()})')
 18 |     #workspace_url = "e2-demo-field-eng.cloud.databricks.com"
 19 |     uri = f"https://{workspace_url}/api/2.0/sql/history/queries"
 20 |     print(uri)
 21 |     headers_auth = {"Authorization":f"Bearer {dbx_token}"}
 22 |     request_dict = {}
 23 |     request_dict.update({"filter_by":{"warehouse_ids": warehouse_ids}})
 24 |     time_filter = {"start_time_ms": start_ts_ms}
 25 |     if end_ts_ms:
 26 |       time_filter.update({'end_time_ms': end_ts_ms})
 27 |     request_dict['filter_by'].update({"query_start_time_range": time_filter})
 28 |     if statuses:
 29 |       request_dict['filter_by'].update({"statuses": statuses})
 30 |     if user_ids:
 31 |       request_dict['filter_by'].update({"user_ids": user_ids})
 32 |     max_single_call_results = min(sink_batch_size, 1000, stop_fetch_limit)
 33 |     request_dict.update({'include_metrics': "true", "max_results": f"{max_single_call_results}"})
 34 | 
 35 |     ## Convert dict to json
 36 |     print(f'REQUEST: {request_dict}')
 37 |     v = json.dumps(request_dict)
 38 |         
 39 |     uri = f"https://{workspace_url}/api/2.0/sql/history/queries"
 40 |     headers_auth = {"Authorization":f"Bearer {dbx_token}"}
 41 | 
 42 |     #### Get Query History Results from API
 43 |     endp_resp = requests.get(uri, data=v, headers=headers_auth).json()
 44 |     #print(endp_resp)
 45 |     resp = endp_resp.get("res")
 46 |         
 47 |     if resp is None:
 48 |         print('no results!')
 49 |         return []
 50 | 
 51 |     next_page = endp_resp.get("next_page_token")
 52 |     has_next_page = endp_resp.get("has_next_page")
 53 | 
 54 |     total_fetch_count = len(resp)
 55 | 
 56 |     while has_next_page:
 57 |         #len_resp = len(resp)
 58 |         if len(resp) >= sink_batch_size:  #or len(resp) + total_count >= stop_fetch_limit:
 59 |           query_sink_fn(resp)
 60 |           resp = []
 61 | 
 62 |         if total_fetch_count >= stop_fetch_limit:
 63 |           break
 64 | 
 65 |         print(f"Getting results for next page... {next_page}")
 66 | 
 67 |         raw_page_request = {
 68 |         "include_metrics": "true",
 69 |         "max_results": max_single_call_results,
 70 |         "page_token": next_page
 71 |         }
 72 | 
 73 |         json_page_request = json.dumps(raw_page_request)
 74 | 
 75 |         current_page_resp = requests.get(uri,data=json_page_request, headers=headers_auth).json()
 76 |         current_page_queries = current_page_resp.get("res")
 77 | 
 78 |         resp.extend(current_page_queries)
 79 |         total_fetch_count += len(current_page_queries)
 80 | 
 81 |         ## Get next page
 82 |         next_page = current_page_resp.get("next_page_token")
 83 |         has_next_page = current_page_resp.get("has_next_page")
 84 | 
 85 |     if resp:
 86 |       query_sink_fn(resp)
 87 | 
 88 |     return total_fetch_count
 89 | 
 90 | 
 91 | 
 92 | def get_query_history(dbx_token, workspace_url, warehouse_ids, start_ts_ms,
 93 |                       end_ts_ms=None, user_ids=None, statuses=None, 
 94 |                       stop_fetch_limit=10000):
 95 |   query_sink = []
 96 |   def _fn(qh):
 97 |     print(f"got {len(qh)} queries")
 98 |     query_sink.extend(qh)
 99 |   
100 |   total_fetch_count = sync_query_history(
101 |      dbx_token, workspace_url, warehouse_ids, start_ts_ms,
102 |     _fn, 100,
103 |     end_ts_ms=end_ts_ms, user_ids=user_ids, statuses=statuses, 
104 |     stop_fetch_limit=stop_fetch_limit)
105 |   
106 |   print(f"total_fetch_count: {total_fetch_count}")
107 |   return query_sink
108 | 
109 | 
110 | 
111 |     
112 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | ## Welcome to lakehouse-tacklebox contributing guide <!-- omit in toc -->
 2 | 
 3 | Thank you for your interest in contributing to the tacklebox! 
 4 | 
 5 | In this guide you will get an overview of the contribution workflow from creating a PR, reviewing, and merging the PR.
 6 | 
 7 | #### New contributor guide
 8 | 
 9 | To get an overview of the project, read the [README](README.md). 
10 | 
11 | Here are the steps to follow to contribute:
12 | - [Fill out this request form](https://forms.gle/qsCTdtBLKj9KuyvY8) This will help the admins to know under which sub-folder your tool belongs.
13 | - Create a branch and add your code under the appropriate sub folder. 
14 | - Create a PR.
15 | - Admins will review your code and provide any feedback for changes
16 | - Admin approves and merges the changes


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2023 Abe
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ## LAKEHOUSE TACKLEBOX
 2 | Don't go fishing in the lakehouse without the lakehouse-tacklebox
 3 | 
 4 | 
 5 | ### Project Description
 6 | This repo is a collection of tools which Databricks users can use to deploy,manage and operate a Databricks based Lakehouse.
 7 | 
 8 | 
 9 | ### Using this Project
10 | The tools are organized into different sections based on [well-architected-framework](https://docs.databricks.com/lakehouse-architecture/index.html) pillars
11 | * [Quickstarts/Evaluation Tools](https://github.com/AbePabbathi/lakehouse-tacklebox/tree/master/00-quickstarts)
12 | * [Migrations](https://github.com/AbePabbathi/lakehouse-tacklebox/tree/master/10-migrations)
13 | * [Operational Excellence](https://github.com/AbePabbathi/lakehouse-tacklebox/tree/master/20-operational-excellence)
14 | * [Performance](https://github.com/AbePabbathi/lakehouse-tacklebox/tree/master/30-performance)
15 | * [Governance/Observability](https://github.com/AbePabbathi/lakehouse-tacklebox/tree/master/40-observability)
16 | * Reliability
17 | * Security
18 | 
19 | Each tool has it's own README.md file with instructions on how to run the code.
20 | 
21 | A new customer will generally start with the tools available in Quickstarts/Evaluation Tools section and move down the chain to more advanced tools to help them implement a robust data platform based on the Lakehouse
22 | 
23 | 
24 | 
25 | ### Project Support
26 | Please note that all projects in this repo are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements (SLAs).  They are provided AS-IS and we do not make any guarantees of any kind.  Please do not submit a support ticket relating to any issues arising from the use of these projects.
27 | Any issues discovered through the use of this project should be filed as GitHub Issues on the Repo.  They will be reviewed as time permits, but there are no formal SLAs for support.
28 | 


--------------------------------------------------------------------------------