├── mlops-statcks-multiphase
    ├── __init__.py
    ├── tests
    │   ├── __init__.py
    │   ├── training
    │   │   ├── __init__.py
    │   │   └── test_notebooks.py
    │   └── feature_engineering
    │   │   ├── __init__.py
    │   │   ├── dropoff_features_test.py
    │   │   └── pickup_features_test.py
    ├── .gitignore
    ├── feature_engineering
    │   ├── __init__.py
    │   ├── features
    │   │   ├── __init__.py
    │   │   ├── pickup_features.py
    │   │   └── dropoff_features.py
    │   ├── README.md
    │   └── notebooks
    │   │   └── GenerateAndWriteFeatures.py
    ├── project_params.json
    ├── validation
    │   ├── README.md
    │   ├── validation.py
    │   └── notebooks
    │   │   └── ModelValidation.py
    ├── pytest.ini
    ├── requirements.txt
    ├── monitoring
    │   ├── README.md
    │   ├── notebooks
    │   │   └── MonitoredMetricViolationCheck.py
    │   └── metric_violation_check_query.py
    ├── deployment
    │   ├── batch_inference
    │   │   ├── predict.py
    │   │   ├── README.md
    │   │   └── notebooks
    │   │   │   └── BatchInference.py
    │   └── model_deployment
    │   │   ├── deploy.py
    │   │   └── notebooks
    │   │       └── ModelDeployment.py
    ├── resources
    │   ├── ml-artifacts-resource.yml
    │   ├── batch-inference-workflow-resource.yml
    │   ├── feature-engineering-workflow-resource.yml
    │   ├── monitoring-resource.yml
    │   ├── model-workflow-resource.yml
    │   └── README.md
    ├── databricks.yml
    ├── tmp
    │   ├── phase2.yml
    │   ├── phase1.yml
    │   └── README.md
    ├── training
    │   └── notebooks
    │   │   └── TrainWithFeatureStore.py
    └── README.md
├── jdemo
    ├── src
    │   ├── jdemo
    │   │   ├── __init__.py
    │   │   └── main.py
    │   └── notebook.ipynb
    ├── pytest.ini
    ├── tests
    │   ├── main_test.py
    │   └── main_test.old
    ├── scratch
    │   ├── README.md
    │   └── exploration.ipynb
    ├── resources
    │   ├── variables.yml
    │   └── jdemo.job.yml
    ├── java-code
    │   ├── src
    │   │   └── main
    │   │   │   └── java
    │   │   │       └── net
    │   │   │           └── alexott
    │   │   │               └── demos
    │   │   │                   └── SparkDemo.java
    │   └── pom.xml
    ├── fixtures
    │   └── .gitkeep
    ├── requirements-dev.txt
    ├── setup.py
    ├── README.md
    ├── databricks.yml
    └── azure-pipelines.yml
├── vars_demo
    ├── .gitignore
    ├── resources
    │   ├── variables.yml
    │   └── vars_demo.job.yml
    ├── databricks.yml
    ├── src
    │   └── notebook.ipynb
    └── README.md
├── integration-tests
    ├── .gitignore
    ├── images
    │   └── integration_test.png
    ├── resources
    │   ├── dabs1.job.yml
    │   └── integration_test.yml
    ├── databricks.yml
    ├── README.md
    └── src
    │   ├── setup_test.ipynb
    │   ├── cleanup_test.ipynb
    │   ├── main_nb.ipynb
    │   └── validate_test.ipynb
├── README.md
└── .gitignore


/mlops-statcks-multiphase/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/tests/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/jdemo/src/jdemo/__init__.py:
--------------------------------------------------------------------------------
1 | __version__ = "0.0.1"
2 | 


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/tests/training/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/.gitignore:
--------------------------------------------------------------------------------
1 | 
2 | .databricks
3 | 


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/feature_engineering/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/tests/feature_engineering/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/feature_engineering/features/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/jdemo/pytest.ini:
--------------------------------------------------------------------------------
1 | [pytest]
2 | testpaths = tests
3 | pythonpath = src
4 | 


--------------------------------------------------------------------------------
/jdemo/tests/main_test.py:
--------------------------------------------------------------------------------
1 | 
2 | 
3 | def test_main():
4 |     assert True


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/project_params.json:
--------------------------------------------------------------------------------
1 | {
2 |   "input_cloud": "azure",
3 |   "input_include_feature_store": "yes"
4 | }
5 | 


--------------------------------------------------------------------------------
/vars_demo/.gitignore:
--------------------------------------------------------------------------------
1 | .databricks/
2 | build/
3 | dist/
4 | __pycache__/
5 | *.egg-info
6 | .venv/
7 | scratch/**
8 | !scratch/README.md
9 | 


--------------------------------------------------------------------------------
/integration-tests/.gitignore:
--------------------------------------------------------------------------------
1 | .databricks/
2 | build/
3 | dist/
4 | __pycache__/
5 | *.egg-info
6 | .venv/
7 | scratch/**
8 | !scratch/README.md
9 | 


--------------------------------------------------------------------------------
/integration-tests/images/integration_test.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexott/dabs-playground/main/integration-tests/images/integration_test.png


--------------------------------------------------------------------------------
/jdemo/tests/main_test.old:
--------------------------------------------------------------------------------
1 | from jdemo.main import get_taxis, get_spark
2 | 
3 | 
4 | def test_main():
5 |     taxis = get_taxis(get_spark())
6 |     assert taxis.count() > 5
7 | 


--------------------------------------------------------------------------------
/jdemo/scratch/README.md:
--------------------------------------------------------------------------------
1 | # scratch
2 | 
3 | This folder is reserved for personal, exploratory notebooks.
4 | By default these are not committed to Git, as 'scratch' is listed in .gitignore.
5 | 


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/validation/README.md:
--------------------------------------------------------------------------------
1 | # Model Validation
2 | To enable model validation as part of scheduled databricks workflow, please refer to [mlops/resources/README.md](../resources/README.md)
3 | 


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/pytest.ini:
--------------------------------------------------------------------------------
1 | # Configure pytest to detect local modules in the current directory
2 | # See https://docs.pytest.org/en/7.1.x/reference/reference.html#confval-pythonpath for details
3 | [pytest]
4 | pythonpath = .
5 | 


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/requirements.txt:
--------------------------------------------------------------------------------
 1 | mlflow==2.11.3
 2 | numpy>=1.23.0
 3 | pandas==1.5.3
 4 | scikit-learn>=1.1.1
 5 | matplotlib>=3.5.2
 6 | pillow>=10.0.1
 7 | Jinja2==3.0.3
 8 | pyspark~=3.3.0
 9 | pytz~=2022.2.1
10 | pytest>=7.1.2
11 | 


--------------------------------------------------------------------------------
/jdemo/resources/variables.yml:
--------------------------------------------------------------------------------
1 | variables:
2 |   instance_pool_name:
3 |     description: Name of the instance pool to use
4 |     default: TFTests
5 |   instance_pool_id:
6 |     description: ID of instance pool
7 |     lookup:
8 |       instance_pool: ${var.instance_pool_name} 
9 | 


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/feature_engineering/README.md:
--------------------------------------------------------------------------------
1 | # Feature Engineering
2 | To set up the feature engineering job via scheduled Databricks workflow, please refer to [mlops/resources/README.md](../resources/README.md)
3 | 
4 | For additional details on using the feature store, please refer to [the project-level README](../README.md).
5 | 


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/tests/training/test_notebooks.py:
--------------------------------------------------------------------------------
 1 | import pathlib
 2 | 
 3 | 
 4 | def test_notebook_format():
 5 |     # Verify that all Databricks notebooks have the required header
 6 |     paths = list(pathlib.Path("./notebooks").glob("**/*.py"))
 7 |     for f in paths:
 8 |         notebook_str = open(str(f)).read()
 9 |         assert notebook_str.startswith("# Databricks notebook source")
10 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # dabs-playground
 2 | 
 3 | Different examples around Databricks Asset Bundsls (DABs).
 4 | 
 5 | 
 6 | - [integration-tests](integration-tests) - DAB example on how to redefine resource on per-target base, emulating wrapping of the code into integration test that has additional tasks.
 7 | - [jdemo](jdemo) - DAB example that has both Python wheel and JAR artefacts deployed as job tasks.
 8 | - [mlops-statcks-multiphase](mlops-statcks-multiphase) - DAB based on `mlops-stacks` template, but with customization to deploy quality monitoring in a separate "stage".
 9 | - [vars_demo](vars_demo) - demonstrates how to use complex and lookup variables in DABs.
10 | 


--------------------------------------------------------------------------------
/jdemo/java-code/src/main/java/net/alexott/demos/SparkDemo.java:
--------------------------------------------------------------------------------
 1 | package net.alexott.demos;
 2 | 
 3 | import org.apache.spark.sql.SparkSession;
 4 | import org.apache.spark.sql.Dataset;
 5 | import org.apache.spark.sql.Row;
 6 | 
 7 | public class SparkDemo {
 8 |     public static void main(String[] args) {
 9 |         System.out.println("Creating Spark Session!");
10 |         SparkSession spark = SparkSession.builder()
11 |                 .appName("SparkDemo")
12 |                 .getOrCreate();
13 |         System.out.println("Going to read data!");
14 |         Dataset<Row> df = spark.read().table("samples.nyctaxi.trips");
15 |         df.show(10, false);
16 |     }
17 | }
18 | 


--------------------------------------------------------------------------------
/vars_demo/resources/variables.yml:
--------------------------------------------------------------------------------
 1 | variables:
 2 |   notification_settings:
 3 |     description: "Webhook notification config"
 4 |     type: complex
 5 |     default: {}
 6 |   notification_name:
 7 |     description: "Name of the notification destination"
 8 |     default: ""
 9 |   notification_id:
10 |     description: "ID of the notification destination (placeholder)"
11 |     default: ""
12 | 
13 | targets:
14 |   prod:
15 |     variables:
16 |       notification_name: "Slack native"
17 |       notification_id:
18 |         lookup:
19 |           notification_destination: ${var.notification_name}
20 |       notification_settings:
21 |         on_failure:
22 |           - id: ${var.notification_id}
23 | 


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/monitoring/README.md:
--------------------------------------------------------------------------------
 1 | # Monitoring
 2 | 
 3 | To enable monitoring as part of a scheduled Databricks workflow, please:
 4 | - Create the inference table that you want to monitor and was passed in as an initialization parameter.
 5 | - Update all the TODOs in the [monitoring resource file](../resources/monitoring-resource.yml).
 6 | - Uncomment the monitoring workflow from the main Databricks Asset Bundles file [databricks.yml](../databricks.yml).
 7 | 
 8 | For more details, refer to [mlops/resources/README.md](../resources/README.md). 
 9 | The implementation supports monitoring of batch inference tables directly.
10 | For real time inference tables, unpacking is required before monitoring can be attached.
11 | 


--------------------------------------------------------------------------------
/jdemo/src/jdemo/main.py:
--------------------------------------------------------------------------------
 1 | from pyspark.sql import SparkSession, DataFrame
 2 | 
 3 | def get_taxis(spark: SparkSession) -> DataFrame:
 4 |   return spark.read.table("samples.nyctaxi.trips")
 5 | 
 6 | 
 7 | # Create a new Databricks Connect session. If this fails,
 8 | # check that you have configured Databricks Connect correctly.
 9 | # See https://docs.databricks.com/dev-tools/databricks-connect.html.
10 | def get_spark() -> SparkSession:
11 |   try:
12 |     from databricks.connect import DatabricksSession
13 |     return DatabricksSession.builder.getOrCreate()
14 |   except ImportError:
15 |     return SparkSession.builder.getOrCreate()
16 | 
17 | def main():
18 |   get_taxis(get_spark()).show(10, truncate=False)
19 | 
20 | if __name__ == '__main__':
21 |   main()
22 | 


--------------------------------------------------------------------------------
/jdemo/fixtures/.gitkeep:
--------------------------------------------------------------------------------
 1 | # Fixtures
 2 | 
 3 | This folder is reserved for fixtures, such as CSV files.
 4 | 
 5 | Below is an example of how to load fixtures as a data frame:
 6 | 
 7 | ```
 8 | import pandas as pd
 9 | import os
10 | 
11 | def get_absolute_path(*relative_parts):
12 |     if 'dbutils' in globals():
13 |         base_dir = os.path.dirname(dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()) # type: ignore
14 |         path = os.path.normpath(os.path.join(base_dir, *relative_parts))
15 |         return path if path.startswith("/Workspace") else "/Workspace" + path
16 |     else:
17 |         return os.path.join(*relative_parts)
18 | 
19 | csv_file = get_absolute_path("..", "fixtures", "mycsv.csv")
20 | df = pd.read_csv(csv_file)
21 | display(df)
22 | ```
23 | 


--------------------------------------------------------------------------------
/vars_demo/databricks.yml:
--------------------------------------------------------------------------------
 1 | # This is a Databricks asset bundle definition for vars_demo.
 2 | # See https://docs.databricks.com/dev-tools/bundles/index.html for documentation.
 3 | bundle:
 4 |   name: vars_demo
 5 |   uuid: aae7faf4-420e-48be-a691-ca6981943be4
 6 | 
 7 | include:
 8 |   - resources/*.yml
 9 | 
10 | targets:
11 |   dev:
12 |     # The default target uses 'mode: development' to create a development copy.
13 |     # - Deployed resources get prefixed with '[dev my_user_name]'
14 |     # - Any job schedules and triggers are paused by default.
15 |     # See also https://docs.databricks.com/dev-tools/bundles/deployment-modes.html.
16 |     mode: development
17 |     default: true
18 | 
19 |   prod:
20 |     mode: production
21 |     workspace:
22 |       root_path: /Workspace/Project/${bundle.name}/${bundle.target}
23 | 


--------------------------------------------------------------------------------
/integration-tests/resources/dabs1.job.yml:
--------------------------------------------------------------------------------
 1 | # The main job for dabs1.
 2 | resources:
 3 |   jobs:
 4 |     dabs1_job:
 5 |       name: dabs1_job
 6 | 
 7 |       trigger:
 8 |         # Run this job every day, exactly one day from the last run; see https://docs.databricks.com/api/workspace/jobs/create#trigger
 9 |         periodic:
10 |           interval: 1
11 |           unit: DAYS
12 | 
13 |       #email_notifications:
14 |       #  on_failure:
15 |       #    - your_email@example.com
16 | 
17 |       tasks:
18 |         - task_key: notebook_task
19 |           notebook_task:
20 |             notebook_path: ../src/main_nb.ipynb
21 |             base_parameters:
22 |               catalog: main
23 |               schema: default
24 |               table: nsg_logs
25 |               target_table: nsg_logs_copy
26 |               
27 | 


--------------------------------------------------------------------------------
/vars_demo/resources/vars_demo.job.yml:
--------------------------------------------------------------------------------
 1 | # The main job for vars_demo.
 2 | resources:
 3 |   jobs:
 4 |     vars_demo_job:
 5 |       name: vars_demo_job
 6 | 
 7 |       trigger:
 8 |         # Run this job every day, exactly one day from the last run; see https://docs.databricks.com/api/workspace/jobs/create#trigger
 9 |         periodic:
10 |           interval: 1
11 |           unit: DAYS
12 | 
13 |       webhook_notifications: ${var.notification_settings}
14 | 
15 |       tasks:
16 |         - task_key: notebook_task
17 |           job_cluster_key: job_cluster
18 |           notebook_task:
19 |             notebook_path: ../src/notebook.ipynb
20 | 
21 |       job_clusters:
22 |         - job_cluster_key: job_cluster
23 |           new_cluster:
24 |             spark_version: 15.4.x-scala2.12
25 |             node_type_id: Standard_D3_v2
26 |             data_security_mode: SINGLE_USER
27 |             autoscale:
28 |                 min_workers: 1
29 |                 max_workers: 4
30 | 


--------------------------------------------------------------------------------
/jdemo/requirements-dev.txt:
--------------------------------------------------------------------------------
 1 | ## requirements-dev.txt: dependencies for local development.
 2 | ##
 3 | ## For defining dependencies used by jobs in Databricks Workflows, see
 4 | ## https://docs.databricks.com/dev-tools/bundles/library-dependencies.html
 5 | 
 6 | ## Add code completion support for DLT
 7 | #databricks-dlt
 8 | 
 9 | ## pytest is the default package used for testing
10 | pytest
11 | 
12 | ## Dependencies for building wheel files
13 | setuptools
14 | wheel
15 | 
16 | ## databricks-connect can be used to run parts of this project locally.
17 | ## See https://docs.databricks.com/dev-tools/databricks-connect.html.
18 | ##
19 | ## databricks-connect is automatically installed if you're using Databricks
20 | ## extension for Visual Studio Code
21 | ## (https://docs.databricks.com/dev-tools/vscode-ext/dev-tasks/databricks-connect.html).
22 | ##
23 | ## To manually install databricks-connect, either follow the instructions
24 | ## at https://docs.databricks.com/dev-tools/databricks-connect.html
25 | ## to install the package system-wide. Or uncomment the line below to install a
26 | ## version of db-connect that corresponds to the Databricks Runtime version used
27 | ## for this project.
28 | #
29 | # databricks-connect>=15.4,<15.5
30 | 


--------------------------------------------------------------------------------
/integration-tests/databricks.yml:
--------------------------------------------------------------------------------
 1 | # This is a Databricks asset bundle definition for dabs1.
 2 | # See https://docs.databricks.com/dev-tools/bundles/index.html for documentation.
 3 | bundle:
 4 |   name: dabs1
 5 |   uuid: 734206ae-7fb4-4d2b-aa5f-7650f5954c17
 6 | 
 7 | include:
 8 |   - resources/*.yml
 9 | 
10 | targets:
11 |   dev:
12 |     # The default target uses 'mode: development' to create a development copy.
13 |     # - Deployed resources get prefixed with '[dev my_user_name]'
14 |     # - Any job schedules and triggers are paused by default.
15 |     # See also https://docs.databricks.com/dev-tools/bundles/deployment-modes.html.
16 |     mode: development
17 |     default: true
18 | 
19 |   test:
20 |     mode: development
21 |     presets:
22 |       name_prefix: "[Integration test ${workspace.current_user.short_name}] "
23 |     workspace:
24 |       root_path: /Workspace/Users/${workspace.current_user.userName}/.bundle/${bundle.name}/${bundle.target}
25 | 
26 |   prod:
27 |     mode: production
28 |     workspace:
29 |       # We explicitly deploy to current user folder to make sure we only have a single copy.
30 |       root_path: /Workspace/Users/${workspace.current_user.userName}/.bundle/${bundle.name}/${bundle.target}
31 |     permissions:
32 |       - user_name: ${workspace.current_user.userName}
33 |         level: CAN_MANAGE
34 | 


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/deployment/batch_inference/predict.py:
--------------------------------------------------------------------------------
 1 | import mlflow
 2 | from pyspark.sql.functions import struct, lit, to_timestamp
 3 | 
 4 | 
 5 | def predict_batch(
 6 |     spark_session, model_uri, input_table_name, output_table_name, model_version, ts
 7 | ):
 8 |     """
 9 |     Apply the model at the specified URI for batch inference on the table with name input_table_name,
10 |     writing results to the table with name output_table_name
11 |     """
12 |     
13 |     mlflow.set_registry_uri("databricks-uc")
14 |     
15 |     table = spark_session.table(input_table_name)
16 |          
17 |        
18 |     from databricks.feature_engineering import FeatureEngineeringClient    
19 |     
20 |     fe_client = FeatureEngineeringClient()
21 |     
22 |     prediction_df = fe_client.score_batch(model_uri = model_uri, df = table)
23 |     
24 |     output_df = (
25 |         prediction_df.withColumn("prediction", prediction_df["prediction"])
26 |         .withColumn("model_version", lit(model_version))
27 |         .withColumn("inference_timestamp", to_timestamp(lit(ts)))
28 |     )
29 |     output_df.display()
30 | 
31 |     # Model predictions are written to the Delta table provided as input.
32 |     # Delta is the default format in Databricks Runtime 8.0 and above.
33 |     output_df.write.format("delta").mode("overwrite").saveAsTable(output_table_name)


--------------------------------------------------------------------------------
/jdemo/setup.py:
--------------------------------------------------------------------------------
 1 | """
 2 | setup.py configuration script describing how to build and package this project.
 3 | 
 4 | This file is primarily used by the setuptools library and typically should not
 5 | be executed directly. See README.md for how to deploy, test, and run
 6 | the jdemo project.
 7 | """
 8 | from setuptools import setup, find_packages
 9 | 
10 | import sys
11 | sys.path.append('./src')
12 | 
13 | import datetime
14 | import jdemo
15 | 
16 | setup(
17 |     name="jdemo",
18 |     # We use timestamp as Local version identifier (https://peps.python.org/pep-0440/#local-version-identifiers.)
19 |     # to ensure that changes to wheel package are picked up when used on all-purpose clusters
20 |     version=jdemo.__version__,
21 |     url="https://test.com",
22 |     author="user@domain.com",
23 |     description="wheel file based on jdemo/src",
24 |     packages=find_packages(where='./src'),
25 |     package_dir={'': 'src'},
26 |     entry_points={
27 |         "packages": [
28 |             "main=jdemo.main:main"
29 |         ]
30 |     },
31 |     install_requires=[
32 |         # Dependencies in case the output wheel file is used as a library dependency.
33 |         # For defining dependencies, when this package is used in Databricks, see:
34 |         # https://docs.databricks.com/dev-tools/bundles/library-dependencies.html
35 |         "setuptools"
36 |     ],
37 | )
38 | 


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/resources/ml-artifacts-resource.yml:
--------------------------------------------------------------------------------
 1 | # Allow users to read the experiment 
 2 | common_permissions: &permissions
 3 |   permissions:
 4 |     - level: CAN_READ
 5 |       group_name: users
 6 | 
 7 | # Allow users to execute models in Unity Catalog
 8 | grants: &grants
 9 |   grants:
10 |     - privileges:
11 |         - EXECUTE
12 |       principal: account users
13 | 
14 | # Defines model and experiments
15 | model: &model
16 |       model:
17 |         name: ${var.model_name}
18 |         catalog_name: ${var.current_target}
19 |         schema_name: mlops
20 |         comment: Registered model in Unity Catalog for the "mlops" ML Project for ${var.current_target} deployment target.
21 |         <<: *grants
22 | 
23 | experiment: &experiment
24 |     experiment:
25 |       name: ${var.experiment_name}
26 |       <<: *permissions
27 |       description: MLflow Experiment used to track runs for mlops project.
28 | 
29 | 
30 | targets:
31 |   dev-phase1:
32 |     resources:
33 |       experiments:
34 |         <<: *experiment
35 |       registered_models:
36 |         <<: *model
37 | 
38 |   test-phase1:
39 |     resources:
40 |       experiments:
41 |         <<: *experiment
42 |       registered_models:
43 |         <<: *model
44 | 
45 |   prod-phase1:
46 |     resources:
47 |       experiments:
48 |         <<: *experiment
49 |       registered_models:
50 |         <<: *model
51 | 


--------------------------------------------------------------------------------
/jdemo/scratch/exploration.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "code",
 5 |    "execution_count": 2,
 6 |    "metadata": {},
 7 |    "outputs": [],
 8 |    "source": [
 9 |     "%load_ext autoreload\n",
10 |     "%autoreload 2"
11 |    ]
12 |   },
13 |   {
14 |    "cell_type": "code",
15 |    "execution_count": null,
16 |    "metadata": {
17 |     "application/vnd.databricks.v1+cell": {
18 |      "cellMetadata": {
19 |       "byteLimit": 2048000,
20 |       "rowLimit": 10000
21 |      },
22 |      "inputWidgets": {},
23 |      "nuid": "6bca260b-13d1-448f-8082-30b60a85c9ae",
24 |      "showTitle": false,
25 |      "title": ""
26 |     }
27 |    },
28 |    "outputs": [],
29 |    "source": [
30 |     "import sys\n",
31 |     "sys.path.append('../src')\n",
32 |     "from jdemo import main\n",
33 |     "\n",
34 |     "main.get_taxis(spark).show(10)"
35 |    ]
36 |   }
37 |  ],
38 |  "metadata": {
39 |   "application/vnd.databricks.v1+notebook": {
40 |    "dashboards": [],
41 |    "language": "python",
42 |    "notebookMetadata": {
43 |     "pythonIndentUnit": 2
44 |    },
45 |    "notebookName": "ipynb-notebook",
46 |    "widgets": {}
47 |   },
48 |   "kernelspec": {
49 |    "display_name": "Python 3",
50 |    "language": "python",
51 |    "name": "python3"
52 |   },
53 |   "language_info": {
54 |    "name": "python",
55 |    "version": "3.11.4"
56 |   }
57 |  },
58 |  "nbformat": 4,
59 |  "nbformat_minor": 0
60 | }
61 | 


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/tests/feature_engineering/dropoff_features_test.py:
--------------------------------------------------------------------------------
 1 | import pyspark.sql
 2 | import pytest
 3 | import pandas as pd
 4 | from datetime import datetime
 5 | from pyspark.sql import SparkSession
 6 | 
 7 | from mlops.feature_engineering.features.dropoff_features import (
 8 |     compute_features_fn,
 9 | )
10 | 
11 | 
12 | @pytest.fixture(scope="session")
13 | def spark(request):
14 |     """fixture for creating a spark session
15 |     Args:
16 |         request: pytest.FixtureRequest object
17 |     """
18 |     spark = (
19 |         SparkSession.builder.master("local[1]")
20 |         .appName("pytest-pyspark-local-testing")
21 |         .getOrCreate()
22 |     )
23 |     request.addfinalizer(lambda: spark.stop())
24 | 
25 |     return spark
26 | 
27 | 
28 | @pytest.mark.usefixtures("spark")
29 | def test_dropoff_features_fn(spark):
30 |     input_df = pd.DataFrame(
31 |         {
32 |             "tpep_pickup_datetime": [datetime(2022, 1, 10)],
33 |             "tpep_dropoff_datetime": [datetime(2022, 1, 10)],
34 |             "dropoff_zip": [94400],
35 |             "trip_distance": [2],
36 |             "fare_amount": [100],
37 |         }
38 |     )
39 |     spark_df = spark.createDataFrame(input_df)
40 |     output_df = compute_features_fn(
41 |         spark_df, "tpep_pickup_datetime", datetime(2022, 1, 1), datetime(2022, 1, 15)
42 |     )
43 |     assert isinstance(output_df, pyspark.sql.DataFrame)
44 |     assert output_df.count() == 1
45 | 


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/tests/feature_engineering/pickup_features_test.py:
--------------------------------------------------------------------------------
 1 | import pyspark.sql
 2 | import pytest
 3 | import pandas as pd
 4 | from datetime import datetime
 5 | from pyspark.sql import SparkSession
 6 | 
 7 | from mlops.feature_engineering.features.pickup_features import compute_features_fn
 8 | 
 9 | 
10 | @pytest.fixture(scope="session")
11 | def spark(request):
12 |     """fixture for creating a spark session
13 |     Args:
14 |         request: pytest.FixtureRequest object
15 |     """
16 |     spark = (
17 |         SparkSession.builder.master("local[1]")
18 |         .appName("pytest-pyspark-local-testing")
19 |         .getOrCreate()
20 |     )
21 |     request.addfinalizer(lambda: spark.stop())
22 | 
23 |     return spark
24 | 
25 | 
26 | @pytest.mark.usefixtures("spark")
27 | def test_pickup_features_fn(spark):
28 |     input_df = pd.DataFrame(
29 |         {
30 |             "tpep_pickup_datetime": [datetime(2022, 1, 12)],
31 |             "tpep_dropoff_datetime": [datetime(2022, 1, 12)],
32 |             "pickup_zip": [94400],
33 |             "trip_distance": [2],
34 |             "fare_amount": [100],
35 |         }
36 |     )
37 |     spark_df = spark.createDataFrame(input_df)
38 |     output_df = compute_features_fn(
39 |         spark_df, "tpep_pickup_datetime", datetime(2022, 1, 1), datetime(2022, 1, 15)
40 |     )
41 |     assert isinstance(output_df, pyspark.sql.DataFrame)
42 |     assert output_df.count() == 4  # 4 15-min intervals over 1 hr window.
43 | 


--------------------------------------------------------------------------------
/integration-tests/resources/integration_test.yml:
--------------------------------------------------------------------------------
 1 | targets:
 2 |   test:
 3 |     resources:
 4 |       jobs:
 5 |         dabs1_job:
 6 |           tasks:
 7 |           - task_key: setup
 8 |             notebook_task:
 9 |               notebook_path: ../src/setup_test.ipynb
10 |               base_parameters:
11 |                 catalog: main
12 |                 schema: tmp
13 |                 table: itest
14 | 
15 |           - task_key: notebook_task
16 |             depends_on:
17 |               - task_key: setup
18 |             
19 |             notebook_task:
20 |               notebook_path: ../src/main_nb.ipynb
21 |               base_parameters:
22 |                 catalog: main
23 |                 schema: tmp
24 |                 table: itest
25 |                 target_table: itest_copy
26 | 
27 |           - task_key: validate
28 |             depends_on:
29 |               - task_key: notebook_task
30 | 
31 |             run_if: ALL_DONE
32 |             
33 |             notebook_task:
34 |               notebook_path: ../src/validate_test.ipynb
35 |               base_parameters:
36 |                 catalog: main
37 |                 schema: tmp
38 |                 target_table: itest_copy
39 |               
40 |           - task_key: cleanup
41 |             depends_on:
42 |               - task_key: validate
43 | 
44 |             
45 |             notebook_task:
46 |               notebook_path: ../src/cleanup_test.ipynb
47 |               base_parameters:
48 |                 catalog: main
49 |                 schema: tmp
50 |                 table: itest
51 |                 target_table: itest_copy
52 | 
53 | 
54 | 


--------------------------------------------------------------------------------
/integration-tests/README.md:
--------------------------------------------------------------------------------
 1 | # Integration test example
 2 | 
 3 | This DAB shows how to redefine resource on per-target base, emulating wrapping of the code into integration test that has additional tasks.  This is done by overriding the job resource (defined in [resources/dabs1.job.yml](resources/dabs1.job.yml) only in the specific target (`test`, defined in [resources/integration_test.yml](resources/integration_test.yml)) 
 4 | 
 5 | ## Getting started
 6 | 
 7 | 1. Install the Databricks CLI from https://docs.databricks.com/dev-tools/cli/databricks-cli.html
 8 | 
 9 | 2. Authenticate to your Databricks workspace, if you have not done so already:
10 |     ```
11 |     $ databricks configure
12 |     ```
13 | 
14 | 3. To deploy a development copy of this project, type:
15 |     ```
16 |     $ databricks bundle deploy -t dev
17 |     ```
18 |     (Note that "dev" is the default target, so the `--target` parameter
19 |     is optional here.)
20 | 
21 |     This deploys everything that's defined for this project.  For example, the default
22 |     template would deploy a job called `[dev yourname] dabs1_job` to your workspace.  You
23 |     can find that job by opening your workpace and clicking on **Workflows**.
24 | 
25 | 4. Similarly, to deploy the code into the test environment, type:
26 |    ```
27 |    $ databricks bundle deploy -t test
28 |    ```
29 | 
30 |    This will deploy a job with name `[Integration test yourname] dabs1_job`, but it will
31 |    have different number of tasks (setup the test, validate results, cleanup):
32 |    
33 |    ![Integration test job](images/integration_test.png)
34 | 
35 | 5. To run a job or pipeline, use the "run" command:
36 |    ```
37 |    $ databricks bundle run
38 |    ```
39 | 


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/deployment/batch_inference/README.md:
--------------------------------------------------------------------------------
 1 | # Batch Inference
 2 | To set up batch inference job via scheduled Databricks workflow, please refer to [mlops/resources/README.md](../../resources/README.md)
 3 | 
 4 | ## Prepare the batch inference input table for the example Project
 5 | Please run the following code in a notebook to generate the example batch inference input table.
 6 | 
 7 | ```
 8 | from pyspark.sql.functions import to_timestamp, lit
 9 | from pyspark.sql.types import IntegerType
10 | import math
11 | from datetime import timedelta, timezone
12 | 
13 | def rounded_unix_timestamp(dt, num_minutes=15):
14 |     """
15 |     Ceilings datetime dt to interval num_minutes, then returns the unix timestamp.
16 |     """
17 |     nsecs = dt.minute * 60 + dt.second + dt.microsecond * 1e-6
18 |     delta = math.ceil(nsecs / (60 * num_minutes)) * (60 * num_minutes) - nsecs
19 |     return int((dt + timedelta(seconds=delta)).replace(tzinfo=timezone.utc).timestamp())
20 | 
21 | 
22 | rounded_unix_timestamp_udf = udf(rounded_unix_timestamp, IntegerType())
23 | 
24 | df = spark.table("delta.`dbfs:/databricks-datasets/nyctaxi-with-zipcodes/subsampled`")
25 | df.withColumn(
26 |     "rounded_pickup_datetime",
27 |     to_timestamp(rounded_unix_timestamp_udf(df["tpep_pickup_datetime"], lit(15))),
28 | ).withColumn(
29 |     "rounded_dropoff_datetime",
30 |     to_timestamp(rounded_unix_timestamp_udf(df["tpep_dropoff_datetime"], lit(30))),
31 | ).drop(
32 |     "tpep_pickup_datetime"
33 | ).drop(
34 |     "tpep_dropoff_datetime"
35 | ).drop(
36 |     "fare_amount"
37 | ).write.mode(
38 |     "overwrite"
39 | ).saveAsTable(
40 |     name="hive_metastore.default.taxi_scoring_sample_feature_store_inference_input"
41 | )
42 | ```
43 | 
44 | 


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/databricks.yml:
--------------------------------------------------------------------------------
 1 | # The name of the bundle. run `databricks bundle schema` to see the full bundle settings schema.
 2 | bundle:
 3 |   name: mlops
 4 | 
 5 | variables:
 6 |   current_target:
 7 |     description: "Name of the current target environment (we can't use `bundle.target`)"
 8 |   experiment_name:
 9 |     description: Experiment name for the model training.
10 |     default: /Users/${workspace.current_user.userName}/${var.current_target}-mlops-experiment
11 |   model_name:
12 |     description: Model name for the model training.
13 |     default: mlops-model
14 |   # this is a placeholder for all phases, although it's used only in *-phase2
15 |   model_training_job_id:
16 |     description: "ID of model training job"
17 |     default: 0
18 | 
19 | include:
20 |   # Resources folder contains ML artifact resources for the ML project that defines model and experiment
21 |   # And workflows resources for the ML project including model training -> validation -> deployment,
22 |   # feature engineering,  batch inference, quality monitoring, metric refresh, alerts and triggering retraining
23 |   - ./resources/*.yml
24 | 
25 | workspace:
26 |   host: https://adb-xxxx.17.azuredatabricks.net
27 | 
28 | # Deployment Target specific values for workspace
29 | targets:
30 |   dev-phase1:
31 |     default: true
32 |     variables:
33 |       current_target: dev
34 | 
35 |   dev-phase2:
36 |     variables:
37 |       current_target: dev
38 | 
39 |   test-phase1:
40 |     variables:
41 |       current_target: test
42 | 
43 |   test-phase2:
44 |     variables:
45 |       current_target: test
46 | 
47 |   prod:
48 |     variables:
49 |       current_target: prod
50 | 
51 |   prod-phase2:
52 |     variables:
53 |       current_target: prod
54 | 
55 | 


--------------------------------------------------------------------------------
/vars_demo/src/notebook.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {
 6 |     "application/vnd.databricks.v1+cell": {
 7 |      "cellMetadata": {},
 8 |      "inputWidgets": {},
 9 |      "nuid": "ee353e42-ff58-4955-9608-12865bd0950e",
10 |      "showTitle": false,
11 |      "title": ""
12 |     }
13 |    },
14 |    "source": [
15 |     "# Default notebook\n",
16 |     "\n",
17 |     "This default notebook is executed using Databricks Workflows as defined in resources/vars_demo.job.yml."
18 |    ]
19 |   },
20 |   {
21 |    "cell_type": "code",
22 |    "execution_count": 2,
23 |    "metadata": {},
24 |    "outputs": [],
25 |    "source": [
26 |     "%load_ext autoreload\n",
27 |     "%autoreload 2"
28 |    ]
29 |   },
30 |   {
31 |    "cell_type": "code",
32 |    "execution_count": 0,
33 |    "metadata": {
34 |     "application/vnd.databricks.v1+cell": {
35 |      "cellMetadata": {
36 |       "byteLimit": 2048000,
37 |       "rowLimit": 10000
38 |      },
39 |      "inputWidgets": {},
40 |      "nuid": "6bca260b-13d1-448f-8082-30b60a85c9ae",
41 |      "showTitle": false,
42 |      "title": ""
43 |     }
44 |    },
45 |    "outputs": [],
46 |    "source": [
47 |     "spark.range(10)"
48 |    ]
49 |   }
50 |  ],
51 |  "metadata": {
52 |   "application/vnd.databricks.v1+notebook": {
53 |    "dashboards": [],
54 |    "language": "python",
55 |    "notebookMetadata": {
56 |     "pythonIndentUnit": 2
57 |    },
58 |    "notebookName": "notebook",
59 |    "widgets": {}
60 |   },
61 |   "kernelspec": {
62 |    "display_name": "Python 3",
63 |    "language": "python",
64 |    "name": "python3"
65 |   },
66 |   "language_info": {
67 |    "name": "python",
68 |    "version": "3.11.4"
69 |   }
70 |  },
71 |  "nbformat": 4,
72 |  "nbformat_minor": 0
73 | }
74 | 


--------------------------------------------------------------------------------
/jdemo/src/notebook.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {
 6 |     "application/vnd.databricks.v1+cell": {
 7 |      "cellMetadata": {},
 8 |      "inputWidgets": {},
 9 |      "nuid": "ee353e42-ff58-4955-9608-12865bd0950e",
10 |      "showTitle": false,
11 |      "title": ""
12 |     }
13 |    },
14 |    "source": [
15 |     "# Default notebook\n",
16 |     "\n",
17 |     "This default notebook is executed using Databricks Workflows as defined in resources/jdemo.job.yml."
18 |    ]
19 |   },
20 |   {
21 |    "cell_type": "code",
22 |    "execution_count": 2,
23 |    "metadata": {},
24 |    "outputs": [],
25 |    "source": [
26 |     "%load_ext autoreload\n",
27 |     "%autoreload 2"
28 |    ]
29 |   },
30 |   {
31 |    "cell_type": "code",
32 |    "execution_count": 0,
33 |    "metadata": {
34 |     "application/vnd.databricks.v1+cell": {
35 |      "cellMetadata": {
36 |       "byteLimit": 2048000,
37 |       "rowLimit": 10000
38 |      },
39 |      "inputWidgets": {},
40 |      "nuid": "6bca260b-13d1-448f-8082-30b60a85c9ae",
41 |      "showTitle": false,
42 |      "title": ""
43 |     }
44 |    },
45 |    "outputs": [],
46 |    "source": [
47 |     "from jdemo import main\n",
48 |     "\n",
49 |     "main.get_taxis(spark).show(10)"
50 |    ]
51 |   }
52 |  ],
53 |  "metadata": {
54 |   "application/vnd.databricks.v1+notebook": {
55 |    "dashboards": [],
56 |    "language": "python",
57 |    "notebookMetadata": {
58 |     "pythonIndentUnit": 2
59 |    },
60 |    "notebookName": "notebook",
61 |    "widgets": {}
62 |   },
63 |   "kernelspec": {
64 |    "display_name": "Python 3",
65 |    "language": "python",
66 |    "name": "python3"
67 |   },
68 |   "language_info": {
69 |    "name": "python",
70 |    "version": "3.11.4"
71 |   }
72 |  },
73 |  "nbformat": 4,
74 |  "nbformat_minor": 0
75 | }
76 | 


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/deployment/model_deployment/deploy.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | import pathlib
 3 | 
 4 | sys.path.append(str(pathlib.Path(__file__).parent.parent.parent.resolve()))
 5 | 
 6 | from mlflow.tracking import MlflowClient
 7 | 
 8 | 
 9 | def deploy(model_uri, env):
10 |     """Deploys an already-registered model in Unity catalog by assigning it the appropriate alias for model deployment.
11 | 
12 |     :param model_uri: URI of the model to deploy. Must be in the format "models:/<name>/<version-id>", as described in
13 |                       https://www.mlflow.org/docs/latest/model-registry.html#fetching-an-mlflow-model-from-the-model-registry
14 |     :param env: name of the environment in which we're performing deployment, i.e one of "dev", "staging", "prod".
15 |                 Defaults to "dev"
16 |     :return:
17 |     """
18 |     print(f"Deployment running in env: {env}")
19 |     _, model_name, version = model_uri.split("/")
20 |     client = MlflowClient(registry_uri="databricks-uc")
21 |     mv = client.get_model_version(model_name, version)
22 |     target_alias = "champion"
23 |     if target_alias not in mv.aliases:
24 |         client.set_registered_model_alias(
25 |             name=model_name,
26 |             alias=target_alias, 
27 |             version=version)
28 |         print(f"Assigned alias '{target_alias}' to model version {model_uri}.")
29 |         
30 |         # remove "challenger" alias if assigning "champion" alias
31 |         if target_alias == "champion" and "challenger" in mv.aliases:
32 |             print(f"Removing 'challenger' alias from model version {model_uri}.")
33 |             client.delete_registered_model_alias(
34 |                 name=model_name,
35 |                 alias="challenger")
36 | 
37 | 
38 | 
39 | if __name__ == "__main__":
40 |     deploy(model_uri=sys.argv[1], env=sys.argv[2])
41 | 


--------------------------------------------------------------------------------
/integration-tests/src/setup_test.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {
 6 |     "application/vnd.databricks.v1+cell": {
 7 |      "cellMetadata": {},
 8 |      "inputWidgets": {},
 9 |      "nuid": "ee353e42-ff58-4955-9608-12865bd0950e",
10 |      "showTitle": false,
11 |      "title": ""
12 |     }
13 |    },
14 |    "source": [
15 |     "# Setup test data notebook"
16 |    ]
17 |   },
18 |   {
19 |    "cell_type": "code",
20 |    "execution_count": 2,
21 |    "metadata": {},
22 |    "outputs": [],
23 |    "source": [
24 |     "%load_ext autoreload\n",
25 |     "%autoreload 2"
26 |    ]
27 |   },
28 |   {
29 |    "cell_type": "code",
30 |    "execution_count": 0,
31 |    "metadata": {
32 |     "application/vnd.databricks.v1+cell": {
33 |      "cellMetadata": {
34 |       "byteLimit": 2048000,
35 |       "rowLimit": 10000
36 |      },
37 |      "inputWidgets": {},
38 |      "nuid": "6bca260b-13d1-448f-8082-30b60a85c9ae",
39 |      "showTitle": false,
40 |      "title": ""
41 |     }
42 |    },
43 |    "outputs": [],
44 |    "source": [
45 |     "catalog = dbutils.widgets.get(\"catalog\")\n",
46 |     "schema = dbutils.widgets.get(\"schema\")\n",
47 |     "table = dbutils.widgets.get(\"table\")\n"
48 |    ]
49 |   },
50 |   {
51 |    "cell_type": "code",
52 |    "execution_count": null,
53 |    "metadata": {},
54 |    "outputs": [],
55 |    "source": [
56 |     "df = spark.range(10).write.mode(\"overwrite\").saveAsTable(f\"{catalog}.{schema}.{table}\")"
57 |    ]
58 |   }
59 |  ],
60 |  "metadata": {
61 |   "application/vnd.databricks.v1+notebook": {
62 |    "dashboards": [],
63 |    "language": "python",
64 |    "notebookMetadata": {
65 |     "pythonIndentUnit": 2
66 |    },
67 |    "notebookName": "notebook",
68 |    "widgets": {}
69 |   },
70 |   "kernelspec": {
71 |    "display_name": "Python 3",
72 |    "language": "python",
73 |    "name": "python3"
74 |   },
75 |   "language_info": {
76 |    "name": "python",
77 |    "version": "3.11.4"
78 |   }
79 |  },
80 |  "nbformat": 4,
81 |  "nbformat_minor": 0
82 | }
83 | 


--------------------------------------------------------------------------------
/jdemo/java-code/pom.xml:
--------------------------------------------------------------------------------
 1 | <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 2 |          xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
 3 |   <modelVersion>4.0.0</modelVersion>
 4 | 
 5 |   <groupId>net.alexott.demos</groupId>
 6 |   <artifactId>dabs-demo</artifactId>
 7 |   <version>0.0.1</version>
 8 |   <packaging>jar</packaging>
 9 | 
10 |   <properties>
11 |     <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
12 |     <java.version>1.8</java.version>
13 |     <scala.version>2.12.12</scala.version>
14 |     <spark.version>3.4.3</spark.version>
15 |     <spark.scala.version>2.12</spark.scala.version>
16 |   </properties>
17 | 
18 | 
19 |   <dependencies>
20 |     <dependency>
21 |       <groupId>org.apache.spark</groupId>
22 |       <artifactId>spark-sql_${spark.scala.version}</artifactId>
23 |       <version>${spark.version}</version>
24 |       <scope>provided</scope>
25 |     </dependency>
26 |   </dependencies>
27 | 
28 |   <build>
29 |     <plugins>
30 |       <plugin>
31 |         <artifactId>maven-compiler-plugin</artifactId>
32 |         <version>3.8.1</version>
33 |         <configuration>
34 |           <source>${java.version}</source>
35 |           <target>${java.version}</target>
36 |           <optimize>true</optimize>
37 |         </configuration>
38 |       </plugin>
39 |       <plugin>
40 |         <groupId>org.apache.maven.plugins</groupId>
41 |         <artifactId>maven-assembly-plugin</artifactId>
42 |         <version>3.2.0</version>
43 |         <configuration>
44 |           <descriptorRefs>
45 |             <descriptorRef>jar-with-dependencies</descriptorRef>
46 |           </descriptorRefs>
47 |         </configuration>
48 |         <executions>
49 |           <execution>
50 |             <phase>package</phase>
51 |             <goals>
52 |               <goal>single</goal>
53 |             </goals>
54 |           </execution>
55 |         </executions>
56 |       </plugin>
57 |     </plugins>
58 |   </build>
59 |   
60 | </project>
61 | 


--------------------------------------------------------------------------------
/jdemo/README.md:
--------------------------------------------------------------------------------
 1 | # jdemo
 2 | 
 3 | The 'jdemo' project was generated by using the `default-python` template.  It shows how to
 4 | have multiple artefacts in the same bundle: `jar` built from Java code using Maven, and
 5 | `whl` built from the Python code.
 6 | 
 7 | ## Getting started
 8 | 
 9 | 1. Install the Databricks CLI from https://docs.databricks.com/dev-tools/cli/databricks-cli.html
10 | 
11 | 2. Authenticate to your Databricks workspace, if you have not done so already:
12 |     ```
13 |     $ databricks configure
14 |     ```
15 | 
16 | 3. To deploy a development copy of this project, type:
17 |     ```
18 |     $ databricks bundle deploy --target dev
19 |     ```
20 |     (Note that "dev" is the default target, so the `--target` parameter
21 |     is optional here.)
22 | 
23 |     This deploys everything that's defined for this project.
24 |     For example, the default template would deploy a job called
25 |     `[dev yourname] jdemo_job` to your workspace.
26 |     You can find that job by opening your workpace and clicking on **Workflows**.
27 | 
28 | 4. Similarly, to deploy a production copy, type:
29 |    ```
30 |    $ databricks bundle deploy --target prod
31 |    ```
32 | 
33 |    Note that the default job from the template has a schedule that runs every day
34 |    (defined in resources/jdemo.job.yml). The schedule
35 |    is paused when deploying in development mode (see
36 |    https://docs.databricks.com/dev-tools/bundles/deployment-modes.html).
37 | 
38 | 5. To run a job or pipeline, use the "run" command:
39 |    ```
40 |    $ databricks bundle run
41 |    ```
42 | 
43 | 6. Optionally, install developer tools such as the Databricks extension for Visual Studio Code from
44 |    https://docs.databricks.com/dev-tools/vscode-ext.html. Or read the "getting started" documentation for
45 |    **Databricks Connect** for instructions on running the included Python code from a different IDE.
46 | 
47 | 7. For documentation on the Databricks asset bundles format used
48 |    for this project, and for CI/CD configuration, see
49 |    https://docs.databricks.com/dev-tools/bundles/index.html.
50 | 


--------------------------------------------------------------------------------
/vars_demo/README.md:
--------------------------------------------------------------------------------
 1 | # vars_demo
 2 | 
 3 | The 'vars_demo' project demonstrates how to use [complex](https://docs.databricks.com/aws/en/dev-tools/bundles/variables#define-a-complex-variable) and [lookup](https://docs.databricks.com/aws/en/dev-tools/bundles/variables#retrieve-an-objects-id-value) variables in Databricks Asset Bundles (DABs.
 4 | 
 5 | Variables allow to parametrize a bundle. Variables are referenced using `${var.<variable_name>}` syntax. There are different variable types:
 6 | 
 7 | * "Normal variables" - static value, could be defined in command line, env variable, …
 8 | * "Lookup variables" - fetch information about existing object (cluster or policy ID by name, etc.). This is very handy when you have an object with the same name deployed in different environments, i.e., cluster policies, notification destinations, etc.
 9 | * "Complex variable" - consists of multiple values. I.e., it could be used to define cluster configuration, notifications, etc.
10 | 
11 | Variables could have a different value in each target, and in combination with `default` value it's possible to implement "conditional" overwrite of some values in defined resources.
12 | 
13 | This demo shows how to define `webhook_notifications` in jobs such way that Slack notifications are defined only in the `prod` environment.  This is done by defining a complex variable `notification_settings` that has an empty value by default, but we're overwriting it in the `prod` environment by looking up the notification destination with a specific name (defined by the `notification_name` variable). (All code is in the [resources/variables.yml](resources/variables.yml)).
14 | 
15 | And then we can just use the complex variable in the `webhook_notifications` argument (line 13 in [resources/vars_demo.job.yml](resources/vars_demo.job.yml)):
16 | 
17 | ```yaml
18 | webhook_notifications: ${var.notification_settings}
19 | ```
20 | 
21 | You can check with `databricks bundle validate -t dev --output json` that corresponding argument is empty in the `dev`, but if you run `databricks bundle validate -t prod --output json`, then it will be filled with actual ID of the notification destination.


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/resources/batch-inference-workflow-resource.yml:
--------------------------------------------------------------------------------
 1 | new_cluster: &new_cluster
 2 |   new_cluster:
 3 |     num_workers: 3
 4 |     spark_version: 15.3.x-cpu-ml-scala2.12
 5 |     node_type_id: Standard_D3_v2
 6 |     custom_tags:
 7 |       clusterSource: mlops-stacks_0.4
 8 | 
 9 | common_permissions: &permissions
10 |   permissions:
11 |     - level: CAN_VIEW
12 |       group_name: users
13 | 
14 | batch_inference_job: &batch_inference_job
15 |     batch_inference_job:
16 |       name: ${var.current_target}-mlops-batch-inference-job
17 |       tasks:
18 |         - task_key: batch_inference_job
19 |           <<: *new_cluster
20 |           notebook_task:
21 |             notebook_path: ../deployment/batch_inference/notebooks/BatchInference.py
22 |             base_parameters:
23 |               env: ${var.current_target}
24 |               input_table_name: ${var.current_target}.mlops.feature_store_inference_input  # TODO: create input table for inference
25 |               output_table_name: ${var.current_target}.mlops.predictions
26 |               model_name: ${var.current_target}.mlops.${var.model_name}
27 |               # git source information of current ML resource deployment. It will be persisted as part of the workflow run
28 |               git_source_info: url:${bundle.git.origin_url}; branch:${bundle.git.branch}; commit:${bundle.git.commit}
29 | 
30 |       schedule:
31 |         quartz_cron_expression: "0 0 11 * * ?" # daily at 11am
32 |         timezone_id: UTC
33 |       <<: *permissions
34 |       # If you want to turn on notifications for this job, please uncomment the below code,
35 |       # and provide a list of emails to the on_failure argument.
36 |       #
37 |       #  email_notifications:
38 |       #    on_failure:
39 |       #      - first@company.com
40 |       #      - second@company.com
41 | 
42 | targets:
43 |   dev-phase1:
44 |     resources:
45 |       jobs:
46 |         <<: *batch_inference_job
47 | 
48 |   test-phase1:
49 |     resources:
50 |       jobs:
51 |         <<: *batch_inference_job
52 | 
53 |   prod-phase1:
54 |     resources:
55 |       jobs:
56 |         <<: *batch_inference_job
57 | 


--------------------------------------------------------------------------------
/jdemo/resources/jdemo.job.yml:
--------------------------------------------------------------------------------
 1 | # The main job for jdemo.
 2 | 
 3 | new_cluster: &new_cluster
 4 |   new_cluster:
 5 |     spark_version: 15.4.x-scala2.12
 6 |     instance_pool_id: ${var.instance_pool_id}
 7 |     autoscale:
 8 |       min_workers: 1
 9 |       max_workers: 4
10 |     custom_tags:
11 |       project: jdemo
12 | 
13 | # TODO: 
14 | # - Add parameters to the job, like, getting the table name to read
15 | # - override parameters per stage, or when running integration test
16 | 
17 | resources:
18 |   jobs:
19 |     jdemo_job:
20 |       name: jdemo_job
21 | 
22 |       trigger:
23 |         # Run this job every day, exactly one day from the last run; see https://docs.databricks.com/api/workspace/jobs/create#trigger
24 |         periodic:
25 |           interval: 1
26 |           unit: DAYS
27 | 
28 |       email_notifications:
29 |         on_failure:
30 |           - user@domain.com
31 | 
32 |       tasks:
33 |         - task_key: notebook_task
34 |           job_cluster_key: job_cluster
35 |           notebook_task:
36 |             notebook_path: ../src/notebook.ipynb
37 |         
38 |         - task_key: wheel_task
39 |           depends_on:
40 |             - task_key: notebook_task
41 |           
42 |           job_cluster_key: job_cluster
43 |           python_wheel_task:
44 |             package_name: jdemo
45 |             entry_point: main
46 |           libraries:
47 |             # By default we just include the .whl file generated for the jdemo package.
48 |             # See https://docs.databricks.com/dev-tools/bundles/library-dependencies.html
49 |             # for more information on how to add other libraries.
50 |             - whl: ../dist/*.whl
51 | 
52 |         - task_key: jar_task
53 |           depends_on:
54 |             - task_key: notebook_task
55 |           #<<: *new_cluster
56 |           job_cluster_key: job_cluster
57 |           spark_jar_task:
58 |             main_class_name: net.alexott.demos.SparkDemo
59 |           libraries:
60 |             - jar: ../java-code/target/dabs-demo-0.0.1-jar-with-dependencies.jar
61 | 
62 |       job_clusters:
63 |         - job_cluster_key: job_cluster
64 |           <<: *new_cluster
65 | 


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/validation/validation.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | from mlflow.models import make_metric, MetricThreshold
 3 | 
 4 | # Custom metrics to be included. Return empty list if custom metrics are not needed.
 5 | # Please refer to custom_metrics parameter in mlflow.evaluate documentation https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.evaluate
 6 | # TODO(optional) : custom_metrics
 7 | def custom_metrics():
 8 | 
 9 |     # TODO(optional) : define custom metric function to be included in custom_metrics.
10 |     def squared_diff_plus_one(eval_df, _builtin_metrics):
11 |         """
12 |         This example custom metric function creates a metric based on the ``prediction`` and
13 |         ``target`` columns in ``eval_df`.
14 |         """
15 |         return np.sum(np.abs(eval_df["prediction"] - eval_df["target"] + 1) ** 2)
16 | 
17 |     return [make_metric(eval_fn=squared_diff_plus_one, greater_is_better=False)]
18 | 
19 | 
20 | # Define model validation rules. Return empty dict if validation rules are not needed.
21 | # Please refer to validation_thresholds parameter in mlflow.evaluate documentation https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.evaluate
22 | # TODO(optional) : validation_thresholds
23 | def validation_thresholds():
24 |     return {
25 |         "max_error": MetricThreshold(
26 |             threshold=500, higher_is_better=False  # max_error should be <= 500
27 |         ),
28 |         "mean_squared_error": MetricThreshold(
29 |             threshold=500,  # mean_squared_error should be <= 500
30 |             # min_absolute_change=0.01,  # mean_squared_error should be at least 0.01 greater than baseline model accuracy
31 |             # min_relative_change=0.01,  # mean_squared_error should be at least 1 percent greater than baseline model accuracy
32 |             higher_is_better=False,
33 |         ),
34 |     }
35 | 
36 | 
37 | # Define evaluator config. Return empty dict if validation rules are not needed.
38 | # Please refer to evaluator_config parameter in mlflow.evaluate documentation https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.evaluate
39 | # TODO(optional) : evaluator_config
40 | def evaluator_config():
41 |     return {}
42 | 


--------------------------------------------------------------------------------
/integration-tests/src/cleanup_test.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {
 6 |     "application/vnd.databricks.v1+cell": {
 7 |      "cellMetadata": {},
 8 |      "inputWidgets": {},
 9 |      "nuid": "ee353e42-ff58-4955-9608-12865bd0950e",
10 |      "showTitle": false,
11 |      "title": ""
12 |     }
13 |    },
14 |    "source": [
15 |     "# Cleanup test data\n",
16 |     "\n",
17 |     "Removes test data"
18 |    ]
19 |   },
20 |   {
21 |    "cell_type": "code",
22 |    "execution_count": 2,
23 |    "metadata": {},
24 |    "outputs": [],
25 |    "source": [
26 |     "%load_ext autoreload\n",
27 |     "%autoreload 2"
28 |    ]
29 |   },
30 |   {
31 |    "cell_type": "code",
32 |    "execution_count": 0,
33 |    "metadata": {
34 |     "application/vnd.databricks.v1+cell": {
35 |      "cellMetadata": {
36 |       "byteLimit": 2048000,
37 |       "rowLimit": 10000
38 |      },
39 |      "inputWidgets": {},
40 |      "nuid": "6bca260b-13d1-448f-8082-30b60a85c9ae",
41 |      "showTitle": false,
42 |      "title": ""
43 |     }
44 |    },
45 |    "outputs": [],
46 |    "source": [
47 |     "catalog = dbutils.widgets.get(\"catalog\")\n",
48 |     "schema = dbutils.widgets.get(\"schema\")\n",
49 |     "table = dbutils.widgets.get(\"table\")\n",
50 |     "target_table = dbutils.widgets.get(\"target_table\")"
51 |    ]
52 |   },
53 |   {
54 |    "cell_type": "code",
55 |    "execution_count": null,
56 |    "metadata": {},
57 |    "outputs": [],
58 |    "source": [
59 |     "spark.sql(f\"drop table if exists {catalog}.{schema}.{table}\")\n",
60 |     "spark.sql(f\"drop table if exists {catalog}.{schema}.{target_table}\")"
61 |    ]
62 |   }
63 |  ],
64 |  "metadata": {
65 |   "application/vnd.databricks.v1+notebook": {
66 |    "dashboards": [],
67 |    "language": "python",
68 |    "notebookMetadata": {
69 |     "pythonIndentUnit": 2
70 |    },
71 |    "notebookName": "notebook",
72 |    "widgets": {}
73 |   },
74 |   "kernelspec": {
75 |    "display_name": "Python 3",
76 |    "language": "python",
77 |    "name": "python3"
78 |   },
79 |   "language_info": {
80 |    "name": "python",
81 |    "version": "3.11.4"
82 |   }
83 |  },
84 |  "nbformat": 4,
85 |  "nbformat_minor": 0
86 | }
87 | 


--------------------------------------------------------------------------------
/integration-tests/src/main_nb.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {
 6 |     "application/vnd.databricks.v1+cell": {
 7 |      "cellMetadata": {},
 8 |      "inputWidgets": {},
 9 |      "nuid": "ee353e42-ff58-4955-9608-12865bd0950e",
10 |      "showTitle": false,
11 |      "title": ""
12 |     }
13 |    },
14 |    "source": [
15 |     "# Default notebook\n",
16 |     "\n",
17 |     "This default notebook is executed using Databricks Workflows as defined in resources/dabs1.job.yml."
18 |    ]
19 |   },
20 |   {
21 |    "cell_type": "code",
22 |    "execution_count": 2,
23 |    "metadata": {},
24 |    "outputs": [],
25 |    "source": [
26 |     "%load_ext autoreload\n",
27 |     "%autoreload 2"
28 |    ]
29 |   },
30 |   {
31 |    "cell_type": "code",
32 |    "execution_count": 0,
33 |    "metadata": {
34 |     "application/vnd.databricks.v1+cell": {
35 |      "cellMetadata": {
36 |       "byteLimit": 2048000,
37 |       "rowLimit": 10000
38 |      },
39 |      "inputWidgets": {},
40 |      "nuid": "6bca260b-13d1-448f-8082-30b60a85c9ae",
41 |      "showTitle": false,
42 |      "title": ""
43 |     }
44 |    },
45 |    "outputs": [],
46 |    "source": [
47 |     "catalog = dbutils.widgets.get(\"catalog\")\n",
48 |     "schema = dbutils.widgets.get(\"schema\")\n",
49 |     "table = dbutils.widgets.get(\"table\")\n",
50 |     "target_table = dbutils.widgets.get(\"target_table\")"
51 |    ]
52 |   },
53 |   {
54 |    "cell_type": "code",
55 |    "execution_count": null,
56 |    "metadata": {},
57 |    "outputs": [],
58 |    "source": [
59 |     "df = spark.read.table(f\"{catalog}.{schema}.{table}\")\n",
60 |     "df.write.mode(\"overwrite\").saveAsTable(f\"{catalog}.{schema}.{target_table}\")"
61 |    ]
62 |   }
63 |  ],
64 |  "metadata": {
65 |   "application/vnd.databricks.v1+notebook": {
66 |    "dashboards": [],
67 |    "language": "python",
68 |    "notebookMetadata": {
69 |     "pythonIndentUnit": 2
70 |    },
71 |    "notebookName": "notebook",
72 |    "widgets": {}
73 |   },
74 |   "kernelspec": {
75 |    "display_name": "Python 3",
76 |    "language": "python",
77 |    "name": "python3"
78 |   },
79 |   "language_info": {
80 |    "name": "python",
81 |    "version": "3.11.4"
82 |   }
83 |  },
84 |  "nbformat": 4,
85 |  "nbformat_minor": 0
86 | }
87 | 


--------------------------------------------------------------------------------
/jdemo/databricks.yml:
--------------------------------------------------------------------------------
 1 | # This is a Databricks asset bundle definition for jdemo.
 2 | # See https://docs.databricks.com/dev-tools/bundles/index.html for documentation.
 3 | bundle:
 4 |   name: jdemo
 5 | 
 6 | include:
 7 |   - resources/*.yml
 8 | 
 9 | variables:
10 |   uniq_id:
11 |     description: "Some ID that will guarantee uniqueness of the object, i.e., PR number"
12 |     default: ${workspace.current_user.short_name}
13 | 
14 | artifacts:
15 |   java-code:
16 |     path: ./java-code
17 |     build: mvn package
18 |     type: jar
19 |     files:
20 |       - source: ./java-code/target/dabs-demo-0.0.1-jar-with-dependencies.jar
21 |   wheel:
22 |     path: .
23 |     type: whl
24 | 
25 | workspace:
26 |   host: https://adb-xxxx.17.azuredatabricks.net
27 | 
28 | targets:
29 |   dev:
30 |     # The default target uses 'mode: development' to create a development copy.
31 |     # - Deployed resources get prefixed with '[dev my_user_name]'
32 |     # - Any job schedules and triggers are paused by default.
33 |     # See also https://docs.databricks.com/dev-tools/bundles/deployment-modes.html.
34 |     mode: development
35 |     default: true
36 |     workspace:
37 |       artifact_path: /Volumes/main/default/jars/${workspace.current_user.short_name}-${bundle.target}
38 | 
39 |   staging:
40 |     presets:
41 |       name_prefix: "[Staging ${var.uniq_id}] "
42 |     workspace:
43 |       artifact_path: /Volumes/main/default/jars/${bundle.target}-${var.uniq_id}
44 |       root_path: /Workspace/Projects/${bundle.target}/${bundle.name}/${var.uniq_id}
45 |     resources: 
46 |       jobs: 
47 |         jdemo_job:
48 |           trigger:
49 |             pause_status: PAUSED
50 | 
51 |   prod:
52 |     mode: production
53 |     presets:
54 |       name_prefix: "[Prod] "
55 |     workspace:
56 |       # We explicitly specify /Workspace/Users/user@domain.com to make sure we only have a single copy.
57 |       root_path: /Workspace/Projects/${bundle.target}/${bundle.name}
58 |       artifact_path: /Volumes/main/default/prod
59 |     resources: 
60 |       jobs: 
61 |         jdemo_job:
62 |           trigger:
63 |             pause_status: PAUSED # This is just for demo purposes, to avoid running the job in the demo
64 |     permissions:
65 |       - user_name: user@domain.com
66 |         level: CAN_MANAGE
67 |     run_as:
68 |       user_name: user@domain.com
69 | 


--------------------------------------------------------------------------------
/integration-tests/src/validate_test.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {
 6 |     "application/vnd.databricks.v1+cell": {
 7 |      "cellMetadata": {},
 8 |      "inputWidgets": {},
 9 |      "nuid": "ee353e42-ff58-4955-9608-12865bd0950e",
10 |      "showTitle": false,
11 |      "title": ""
12 |     }
13 |    },
14 |    "source": [
15 |     "# Default notebook\n",
16 |     "\n",
17 |     "This default notebook is executed using Databricks Workflows as defined in resources/dabs1.job.yml."
18 |    ]
19 |   },
20 |   {
21 |    "cell_type": "code",
22 |    "execution_count": 2,
23 |    "metadata": {},
24 |    "outputs": [],
25 |    "source": [
26 |     "%load_ext autoreload\n",
27 |     "%autoreload 2"
28 |    ]
29 |   },
30 |   {
31 |    "cell_type": "code",
32 |    "execution_count": 0,
33 |    "metadata": {
34 |     "application/vnd.databricks.v1+cell": {
35 |      "cellMetadata": {
36 |       "byteLimit": 2048000,
37 |       "rowLimit": 10000
38 |      },
39 |      "inputWidgets": {},
40 |      "nuid": "6bca260b-13d1-448f-8082-30b60a85c9ae",
41 |      "showTitle": false,
42 |      "title": ""
43 |     }
44 |    },
45 |    "outputs": [],
46 |    "source": [
47 |     "catalog = dbutils.widgets.get(\"catalog\")\n",
48 |     "schema = dbutils.widgets.get(\"schema\")\n",
49 |     "target_table = dbutils.widgets.get(\"target_table\")"
50 |    ]
51 |   },
52 |   {
53 |    "cell_type": "code",
54 |    "execution_count": null,
55 |    "metadata": {},
56 |    "outputs": [],
57 |    "source": [
58 |     "df = spark.read.table(f\"{catalog}.{schema}.{target_table}\")\n",
59 |     "assert df is not None, f\"Failed to read table {catalog}.{schema}.{target_table}\"\n",
60 |     "assert df.count() == 10, f\"Incorrect number of rows in {catalog}.{schema}.{target_table}\""
61 |    ]
62 |   }
63 |  ],
64 |  "metadata": {
65 |   "application/vnd.databricks.v1+notebook": {
66 |    "dashboards": [],
67 |    "language": "python",
68 |    "notebookMetadata": {
69 |     "pythonIndentUnit": 2
70 |    },
71 |    "notebookName": "notebook",
72 |    "widgets": {}
73 |   },
74 |   "kernelspec": {
75 |    "display_name": "Python 3",
76 |    "language": "python",
77 |    "name": "python3"
78 |   },
79 |   "language_info": {
80 |    "name": "python",
81 |    "version": "3.11.4"
82 |   }
83 |  },
84 |  "nbformat": 4,
85 |  "nbformat_minor": 0
86 | }
87 | 


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/deployment/model_deployment/notebooks/ModelDeployment.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | ##################################################################################
 3 | # Helper notebook to transition the model stage. This notebook is run
 4 | # after the Train.py notebook as part of a multi-task job, in order to transition model
 5 | # to target stage after training completes.
 6 | #
 7 | # Note that we deploy the model to the stage in MLflow Model Registry equivalent to the
 8 | # environment in which the multi-task job is executed (e.g deploy the trained model to
 9 | # stage=Production if triggered in the prod environment). In a practical setting, we would
10 | # recommend enabling the model validation step between  model training and automatically
11 | # registering the model to the Production stage in prod.
12 | #
13 | # This notebook has the following parameters:
14 | #
15 | #  * env (required)  - String name of the current environment for model deployment, which decides the target stage.
16 | #  * model_uri (required)  - URI of the model to deploy. Must be in the format "models:/<name>/<version-id>", as described in
17 | #                            https://www.mlflow.org/docs/latest/model-registry.html#fetching-an-mlflow-model-from-the-model-registry
18 | #                            This parameter is read as a task value
19 | #                            (https://learn.microsoft.com/azure/databricks/dev-tools/databricks-utils),
20 | #                            rather than as a notebook widget. That is, we assume a preceding task (the Train.py
21 | #                            notebook) has set a task value with key "model_uri".
22 | ##################################################################################
23 | 
24 | # List of input args needed to run the notebook as a job.
25 | # Provide them via DB widgets or notebook arguments.
26 | #
27 | # Name of the current environment
28 | dbutils.widgets.dropdown("env", "None", ["None", "staging", "prod"], "Environment Name")
29 | 
30 | # COMMAND ----------
31 | 
32 | import os
33 | import sys
34 | notebook_path =  '/Workspace/' + os.path.dirname(dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get())
35 | %cd $notebook_path
36 | %cd ..
37 | sys.path.append("../..")
38 | 
39 | # COMMAND ----------
40 | 
41 | from deploy import deploy
42 | 
43 | model_uri = dbutils.jobs.taskValues.get("Train", "model_uri", debugValue="")
44 | env = dbutils.widgets.get("env")
45 | assert env != "None", "env notebook parameter must be specified"
46 | assert model_uri != "", "model_uri notebook parameter must be specified"
47 | deploy(model_uri, env)
48 | 
49 | # COMMAND ----------
50 | print(
51 |     f"Successfully completed model deployment for {model_uri}"
52 | )
53 | 


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/feature_engineering/features/pickup_features.py:
--------------------------------------------------------------------------------
 1 | """
 2 | This sample module contains  features logic that can be used to generate and populate tables in Feature Store.
 3 | You should plug in your own features computation logic in the compute_features_fn method below.
 4 | """
 5 | import pyspark.sql.functions as F
 6 | from pyspark.sql.types import FloatType, IntegerType, StringType, TimestampType
 7 | from pytz import timezone
 8 | 
 9 | 
10 | @F.udf(returnType=StringType())
11 | def _partition_id(dt):
12 |     # datetime -> "YYYY-MM"
13 |     return f"{dt.year:04d}-{dt.month:02d}"
14 | 
15 | 
16 | def _filter_df_by_ts(df, ts_column, start_date, end_date):
17 |     if ts_column and start_date:
18 |         df = df.filter(F.col(ts_column) >= start_date)
19 |     if ts_column and end_date:
20 |         df = df.filter(F.col(ts_column) < end_date)
21 |     return df
22 | 
23 | 
24 | def compute_features_fn(input_df, timestamp_column, start_date, end_date):
25 |     """Contains logic to compute features.
26 | 
27 |     Given an input dataframe and time ranges, this function should compute features, populate an output dataframe and
28 |     return it. This method will be called from a  Feature Store pipeline job and the output dataframe will be written
29 |     to a Feature Store table. You should update this method with your own feature computation logic.
30 | 
31 |     The timestamp_column, start_date, end_date args are optional but strongly recommended for time-series based
32 |     features.
33 | 
34 |     TODO: Update and adapt the sample code for your use case
35 | 
36 |     :param input_df: Input dataframe.
37 |     :param timestamp_column: Column containing a timestamp. This column is used to limit the range of feature
38 |     computation. It is also used as the timestamp key column when populating the feature table, so it needs to be
39 |     returned in the output.
40 |     :param start_date: Start date of the feature computation interval.
41 |     :param end_date:  End date of the feature computation interval.
42 |     :return: Output dataframe containing computed features given the input arguments.
43 |     """
44 |     df = _filter_df_by_ts(input_df, timestamp_column, start_date, end_date)
45 |     pickupzip_features = (
46 |         df.groupBy(
47 |             "pickup_zip", F.window(timestamp_column, "1 hour", "15 minutes")
48 |         )  # 1 hour window, sliding every 15 minutes
49 |         .agg(
50 |             F.mean("fare_amount").alias("mean_fare_window_1h_pickup_zip"),
51 |             F.count("*").alias("count_trips_window_1h_pickup_zip"),
52 |         )
53 |         .select(
54 |             F.col("pickup_zip").alias("zip"),
55 |             F.unix_timestamp(F.col("window.end"))
56 |             .alias(timestamp_column)
57 |             .cast(TimestampType()),
58 |             _partition_id(F.to_timestamp(F.col("window.end"))).alias("yyyy_mm"),
59 |             F.col("mean_fare_window_1h_pickup_zip").cast(FloatType()),
60 |             F.col("count_trips_window_1h_pickup_zip").cast(IntegerType()),
61 |         )
62 |     )
63 |     return pickupzip_features
64 | 


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/feature_engineering/features/dropoff_features.py:
--------------------------------------------------------------------------------
 1 | """
 2 | This sample module contains features logic that can be used to generate and populate tables in Feature Store. 
 3 | You should plug in your own features computation logic in the compute_features_fn method below.
 4 | """
 5 | import pyspark.sql.functions as F
 6 | from pyspark.sql.types import IntegerType, StringType, TimestampType
 7 | from pytz import timezone
 8 | 
 9 | 
10 | @F.udf(returnType=IntegerType())
11 | def _is_weekend(dt):
12 |     tz = "America/New_York"
13 |     return int(dt.astimezone(timezone(tz)).weekday() >= 5)  # 5 = Saturday, 6 = Sunday
14 | 
15 | 
16 | @F.udf(returnType=StringType())
17 | def _partition_id(dt):
18 |     # datetime -> "YYYY-MM"
19 |     return f"{dt.year:04d}-{dt.month:02d}"
20 | 
21 | 
22 | def _filter_df_by_ts(df, ts_column, start_date, end_date):
23 |     if ts_column and start_date:
24 |         df = df.filter(F.col(ts_column) >= start_date)
25 |     if ts_column and end_date:
26 |         df = df.filter(F.col(ts_column) < end_date)
27 |     return df
28 | 
29 | 
30 | def compute_features_fn(input_df, timestamp_column, start_date, end_date):
31 |     """Contains logic to compute features.
32 | 
33 |     Given an input dataframe and time ranges, this function should compute features, populate an output dataframe and
34 |     return it. This method will be called from a  Feature Store pipeline job and the output dataframe will be written
35 |     to a Feature Store table. You should update this method with your own feature computation logic.
36 | 
37 |     The timestamp_column, start_date, end_date args are optional but strongly recommended for time-series based
38 |     features.
39 | 
40 |     TODO: Update and adapt the sample code for your use case
41 | 
42 |     :param input_df: Input dataframe.
43 |     :param timestamp_column: Column containing the timestamp. This column is used to limit the range of feature
44 |     computation. It is also used as the timestamp key column when populating the feature table, so it needs to be
45 |     returned in the output.
46 |     :param start_date: Start date of the feature computation interval.
47 |     :param end_date:  End date of the feature computation interval.
48 |     :return: Output dataframe containing computed features given the input arguments.
49 |     """
50 |     df = _filter_df_by_ts(input_df, timestamp_column, start_date, end_date)
51 |     dropoffzip_features = (
52 |         df.groupBy("dropoff_zip", F.window(timestamp_column, "30 minute"))
53 |         .agg(F.count("*").alias("count_trips_window_30m_dropoff_zip"))
54 |         .select(
55 |             F.col("dropoff_zip").alias("zip"),
56 |             F.unix_timestamp(F.col("window.end"))
57 |             .alias(timestamp_column)
58 |             .cast(TimestampType()),
59 |             _partition_id(F.to_timestamp(F.col("window.end"))).alias("yyyy_mm"),
60 |             F.col("count_trips_window_30m_dropoff_zip").cast(IntegerType()),
61 |             _is_weekend(F.col("window.end")).alias("dropoff_is_weekend"),
62 |         )
63 |     )
64 |     return dropoffzip_features
65 | 


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/monitoring/notebooks/MonitoredMetricViolationCheck.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | ##################################################################################
 3 | # This notebook runs a sql query and set the result as job task value
 4 | #
 5 | # This notebook has the following parameters:
 6 | #
 7 | #  * table_name_under_monitor (required)  - The name of a table that is currently being monitored
 8 | #  * metric_to_monitor (required)  - Metric to be monitored for threshold violation
 9 | #  * metric_violation_threshold (required)  - Threshold value for metric violation
10 | #  * num_evaluation_windows (required)  - Number of windows to check for violation
11 | #  * num_violation_windows (required)  - Number of windows that need to violate the threshold
12 | ##################################################################################
13 | 
14 | # List of input args needed to run the notebook as a job.
15 | # Provide them via DB widgets or notebook arguments.
16 | #
17 | # Name of the table that is currently being monitored
18 | dbutils.widgets.text(
19 |     "table_name_under_monitor", "dev.mlops.predictions", label="Full (three-Level) table name"
20 | )
21 | # Metric to be used for threshold violation check
22 | dbutils.widgets.text(
23 |     "metric_to_monitor", "root_mean_squared_error", label="Metric to be monitored for threshold violation"
24 | )
25 | 
26 | # Threshold value to be checked
27 | dbutils.widgets.text(
28 |     "metric_violation_threshold", "100", label="Threshold value for metric violation"
29 | )
30 | 
31 | # Threshold value to be checked
32 | dbutils.widgets.text(
33 |     "num_evaluation_windows", "5", label="Number of windows to check for violation"
34 | )
35 | 
36 | # Threshold value to be checked
37 | dbutils.widgets.text(
38 |     "num_violation_windows", "2", label="Number of windows that need to violate the threshold"
39 | )
40 | 
41 | # COMMAND ----------
42 | 
43 | import os
44 | import sys
45 | notebook_path =  '/Workspace/' + os.path.dirname(dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get())
46 | %cd $notebook_path
47 | %cd ..
48 | sys.path.append("../..")
49 | 
50 | # COMMAND ----------
51 | 
52 | from metric_violation_check_query import sql_query
53 | 
54 | table_name_under_monitor = dbutils.widgets.get("table_name_under_monitor")
55 | metric_to_monitor = dbutils.widgets.get("metric_to_monitor")
56 | metric_violation_threshold = dbutils.widgets.get("metric_violation_threshold")
57 | num_evaluation_windows = dbutils.widgets.get("num_evaluation_windows")
58 | num_violation_windows = dbutils.widgets.get("num_violation_windows")
59 | 
60 | formatted_sql_query = sql_query.format(
61 |     table_name_under_monitor=table_name_under_monitor,
62 |     metric_to_monitor=metric_to_monitor,
63 |     metric_violation_threshold=metric_violation_threshold,
64 |     num_evaluation_windows=num_evaluation_windows,
65 |     num_violation_windows=num_violation_windows)
66 | is_metric_violated = bool(spark.sql(formatted_sql_query).toPandas()["query_result"][0])
67 | 
68 | dbutils.jobs.taskValues.set("is_metric_violated", is_metric_violated)
69 | 
70 | 
71 | 


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/monitoring/metric_violation_check_query.py:
--------------------------------------------------------------------------------
 1 | # This file is used for the main SQL query that checks the last {num_evaluation_windows} metric violations and whether at least {num_violation_windows} of those runs violate the condition.
 2 | 
 3 | import sys
 4 | import pathlib
 5 | 
 6 | sys.path.append(str(pathlib.Path(__file__).parent.parent.parent.resolve()))
 7 | 
 8 | """The SQL query is divided into three main parts. The first part selects the top {num_evaluation_windows}
 9 | values of the metric to be monitored, ordered by the time window, and saves as recent_metrics.
10 | ```sql
11 | WITH recent_metrics AS (
12 |   SELECT
13 |     {metric_to_monitor},
14 |     window
15 |   FROM
16 |     {table_name_under_monitor}_profile_metrics
17 |   WHERE
18 |     column_name = ":table"
19 |     AND slice_key IS NULL
20 |     AND model_id != "*"
21 |     AND log_type = "INPUT"
22 |   ORDER BY
23 |     window DESC
24 |   LIMIT
25 |     {num_evaluation_windows}
26 | )
27 | ```
28 | The `column_name = ":table"` and `slice_key IS NULL` conditions ensure that the metric
29 | is selected for the entire table within the given granularity. The `log_type = "INPUT"`
30 | condition ensures that the primary table metrics are considered, but not the baseline
31 | table metrics. The `model_id!= "*"` condition ensures that the metric aggregated across
32 | all model IDs is not selected.
33 | 
34 | The second part of the query determines if the metric values have been violated with two cases. 
35 | The first case checks if the metric value is greater than the threshold for at least {num_violation_windows} windows:
36 | ```sql
37 | (SELECT COUNT(*) FROM recent_metrics WHERE {metric_to_monitor} > {metric_violation_threshold}) >= {num_violation_windows}
38 | ```
39 | The second case checks if the most recent metric value is greater than the threshold. This is to make sure we only trigger retraining
40 | if the most recent window was violated, avoiding unnecessary retraining if the violation was in the past and the metric is now within the threshold:
41 | ```sql
42 | (SELECT {metric_to_monitor} FROM recent_metrics ORDER BY window DESC LIMIT 1) > {metric_violation_threshold}
43 | ```
44 | 
45 | The final part of the query sets the `query_result` to 1 if both of the above conditions are met, and 0 otherwise:
46 | ```sql
47 | SELECT
48 |   CASE
49 |     WHEN
50 |       # Check if the metric value is greater than the threshold for at least {num_violation_windows} windows
51 |       AND
52 |       # Check if the most recent metric value is greater than the threshold
53 |     THEN 1
54 |     ELSE 0
55 |   END AS query_result
56 | ```
57 | """
58 | 
59 | sql_query = """WITH recent_metrics AS (
60 |   SELECT
61 |     {metric_to_monitor},
62 |     window
63 |   FROM
64 |     {table_name_under_monitor}_profile_metrics
65 |   WHERE
66 |     column_name = ":table"
67 |     AND slice_key IS NULL
68 |     AND model_id != "*"
69 |     AND log_type = "INPUT"
70 |   ORDER BY
71 |     window DESC
72 |   LIMIT
73 |     {num_evaluation_windows}
74 | )
75 | SELECT
76 |   CASE
77 |     WHEN
78 |       (SELECT COUNT(*) FROM recent_metrics WHERE {metric_to_monitor} > {metric_violation_threshold}) >= {num_violation_windows}
79 |       AND
80 |       (SELECT {metric_to_monitor} FROM recent_metrics ORDER BY window DESC LIMIT 1) > {metric_violation_threshold}
81 |     THEN 1
82 |     ELSE 0
83 |   END AS query_result
84 | """
85 | 


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/resources/feature-engineering-workflow-resource.yml:
--------------------------------------------------------------------------------
 1 | new_cluster: &new_cluster
 2 |   new_cluster:
 3 |     num_workers: 3
 4 |     spark_version: 15.3.x-cpu-ml-scala2.12
 5 |     node_type_id: Standard_D3_v2
 6 |     custom_tags:
 7 |       clusterSource: mlops-stacks_0.4
 8 | 
 9 | common_permissions: &permissions
10 |   permissions:
11 |     - level: CAN_VIEW
12 |       group_name: users
13 | 
14 | write_feature_table_job: &write_feature_table_job
15 |     write_feature_table_job:
16 |       name: ${var.current_target}-mlops-write-feature-table-job
17 |       job_clusters:
18 |         - job_cluster_key: write_feature_table_job_cluster
19 |           <<: *new_cluster
20 |       tasks:
21 |         - task_key: PickupFeatures
22 |           job_cluster_key: write_feature_table_job_cluster
23 |           notebook_task:
24 |             notebook_path: ../feature_engineering/notebooks/GenerateAndWriteFeatures.py
25 |             base_parameters:
26 |               # TODO modify these arguments to reflect your setup.
27 |               input_table_path: /databricks-datasets/nyctaxi-with-zipcodes/subsampled
28 |               # TODO: Empty start/end dates will process the whole range. Update this as needed to process recent data.
29 |               input_start_date: ""
30 |               input_end_date: ""
31 |               timestamp_column: tpep_pickup_datetime
32 |               output_table_name: ${var.current_target}.mlops.trip_pickup_features
33 |               features_transform_module: pickup_features
34 |               primary_keys: zip
35 |               # git source information of current ML resource deployment. It will be persisted as part of the workflow run
36 |               git_source_info: url:${bundle.git.origin_url}; branch:${bundle.git.branch}; commit:${bundle.git.commit}
37 |         - task_key: DropoffFeatures
38 |           job_cluster_key: write_feature_table_job_cluster
39 |           notebook_task:
40 |             notebook_path: ../feature_engineering/notebooks/GenerateAndWriteFeatures.py
41 |             base_parameters:
42 |               # TODO: modify these arguments to reflect your setup.
43 |               input_table_path: /databricks-datasets/nyctaxi-with-zipcodes/subsampled
44 |               # TODO: Empty start/end dates will process the whole range. Update this as needed to process recent data.
45 |               input_start_date: ""
46 |               input_end_date: ""
47 |               timestamp_column: tpep_dropoff_datetime
48 |               output_table_name: ${var.current_target}.mlops.trip_dropoff_features
49 |               features_transform_module: dropoff_features
50 |               primary_keys: zip
51 |               # git source information of current ML resource deployment. It will be persisted as part of the workflow run
52 |               git_source_info: url:${bundle.git.origin_url}; branch:${bundle.git.branch}; commit:${bundle.git.commit}
53 |       schedule:
54 |         quartz_cron_expression: "0 0 7 * * ?" # daily at 7am
55 |         timezone_id: UTC
56 |       <<: *permissions
57 |       # If you want to turn on notifications for this job, please uncomment the below code,
58 |       # and provide a list of emails to the on_failure argument.
59 |       #
60 |       #  email_notifications:
61 |       #    on_failure:
62 |       #      - first@company.com
63 |       #      - second@company.com
64 | 
65 | targets:
66 |   dev-phase1:
67 |     resources:
68 |       jobs:
69 |         <<: *write_feature_table_job
70 | 
71 |   test-phase1:
72 |     resources:
73 |       jobs:
74 |         <<: *write_feature_table_job
75 | 
76 |   prod-phase1:
77 |     resources:
78 |       jobs:
79 |         <<: *write_feature_table_job
80 | 


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/deployment/batch_inference/notebooks/BatchInference.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | ##################################################################################
 3 | # Batch Inference Notebook
 4 | #
 5 | # This notebook is an example of applying a model for batch inference against an input delta table,
 6 | # It is configured and can be executed as the batch_inference_job in the batch_inference_job workflow defined under
 7 | # ``mlops/resources/batch-inference-workflow-resource.yml``
 8 | #
 9 | # Parameters:
10 | #
11 | #  * env (optional)  - String name of the current environment (dev, staging, or prod). Defaults to "dev"
12 | #  * input_table_name (required)  - Delta table name containing your input data.
13 | #  * output_table_name (required) - Delta table name where the predictions will be written to.
14 | #                                   Note that this will create a new version of the Delta table if
15 | #                                   the table already exists
16 | #  * model_name (required) - The name of the model to be used in batch inference.
17 | ##################################################################################
18 | 
19 | 
20 | # List of input args needed to run the notebook as a job.
21 | # Provide them via DB widgets or notebook arguments.
22 | #
23 | # Name of the current environment
24 | dbutils.widgets.dropdown("env", "dev", ["dev", "staging", "prod"], "Environment Name")
25 | # A Hive-registered Delta table containing the input features.
26 | dbutils.widgets.text("input_table_name", "", label="Input Table Name")
27 | # Delta table to store the output predictions.
28 | dbutils.widgets.text("output_table_name", "", label="Output Table Name")
29 | # Unity Catalog registered model name to use for the trained mode.
30 | dbutils.widgets.text(
31 |     "model_name", "dev.mlops.mlops-model", label="Full (Three-Level) Model Name"
32 | )
33 | 
34 | # COMMAND ----------
35 | 
36 | import os
37 | 
38 | notebook_path =  '/Workspace/' + os.path.dirname(dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get())
39 | %cd $notebook_path
40 | 
41 | # COMMAND ----------
42 | 
43 | # MAGIC %pip install -r ../../../requirements.txt
44 | 
45 | # COMMAND ----------
46 | 
47 | dbutils.library.restartPython()
48 | 
49 | # COMMAND ----------
50 | 
51 | import sys
52 | import os
53 | notebook_path =  '/Workspace/' + os.path.dirname(dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get())
54 | %cd $notebook_path
55 | %cd ..
56 | sys.path.append("../..")
57 | 
58 | # COMMAND ----------
59 | 
60 | # DBTITLE 1,Define input and output variables
61 | 
62 | env = dbutils.widgets.get("env")
63 | input_table_name = dbutils.widgets.get("input_table_name")
64 | output_table_name = dbutils.widgets.get("output_table_name")
65 | model_name = dbutils.widgets.get("model_name")
66 | assert input_table_name != "", "input_table_name notebook parameter must be specified"
67 | assert output_table_name != "", "output_table_name notebook parameter must be specified"
68 | assert model_name != "", "model_name notebook parameter must be specified"
69 | alias = "champion"
70 | model_uri = f"models:/{model_name}@{alias}"
71 | 
72 | # COMMAND ----------
73 | 
74 | from mlflow import MlflowClient
75 | 
76 | # Get model version from alias
77 | client = MlflowClient(registry_uri="databricks-uc")
78 | model_version = client.get_model_version_by_alias(model_name, alias).version
79 | 
80 | # COMMAND ----------
81 | 
82 | # Get datetime
83 | from datetime import datetime
84 | 
85 | ts = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
86 | 
87 | # COMMAND ----------
88 | # DBTITLE 1,Load model and run inference
89 | 
90 | from predict import predict_batch
91 | 
92 | predict_batch(spark, model_uri, input_table_name, output_table_name, model_version, ts)
93 | dbutils.notebook.exit(output_table_name)
94 | 


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/tmp/phase2.yml:
--------------------------------------------------------------------------------
  1 | # Please complete all the TODOs in this file.
  2 | # The regression monitor defined here works OOB with this example regression notebook: https://learn.microsoft.com/azure/databricks/_extras/notebooks/source/monitoring/regression-monitor
  3 | # NOTE: Monitoring only works on Unity Catalog tables.
  4 | 
  5 | new_cluster: &new_cluster
  6 |   new_cluster:
  7 |     num_workers: 3
  8 |     spark_version: 15.3.x-cpu-ml-scala2.12
  9 |     node_type_id: Standard_D3_v2
 10 |     custom_tags:
 11 |       clusterSource: mlops-stacks_0.4
 12 | 
 13 | common_permissions: &permissions
 14 |   permissions:
 15 |     - level: CAN_VIEW
 16 |       group_name: users
 17 | 
 18 | quality_monitor: &quality_monitor
 19 |   mlops_quality_monitor:
 20 |     table_name: dev.mlops.predictions
 21 |     # TODO: Update the output schema name as per your requirements
 22 |     output_schema_name: ${var.current_target}.mlops
 23 |     # TODO: Update the below parameters as per your requirements
 24 |     assets_dir: /Users/${workspace.current_user.userName}/databricks_lakehouse_monitoring
 25 |     inference_log:
 26 |       granularities: [1 day]
 27 |       model_id_col: model_id
 28 |       prediction_col: prediction
 29 |       label_col: price
 30 |       problem_type: PROBLEM_TYPE_REGRESSION
 31 |       timestamp_col: timestamp
 32 |     schedule:
 33 |       quartz_cron_expression: 0 0 8 * * ? # Run Every day at 8am
 34 |       timezone_id: UTC
 35 | 
 36 | retraining_job: &retraining_job
 37 |     retraining_job:
 38 |       name: ${var.current_target}-mlops-monitoring-retraining-job
 39 |       tasks:
 40 |         - task_key: monitored_metric_violation_check
 41 |           <<: *new_cluster
 42 |           notebook_task:
 43 |             notebook_path: ../monitoring/notebooks/MonitoredMetricViolationCheck.py
 44 |             base_parameters:
 45 |               env: ${var.current_target}
 46 |               table_name_under_monitor: dev.mlops.predictions
 47 |               # TODO: Update the metric to be monitored and violation threshold
 48 |               metric_to_monitor: root_mean_squared_error
 49 |               metric_violation_threshold: 100
 50 |               num_evaluation_windows: 5
 51 |               num_violation_windows: 2
 52 | 
 53 |         - task_key: is_metric_violated
 54 |           depends_on:
 55 |             - task_key: monitored_metric_violation_check
 56 |           condition_task:
 57 |             op: EQUAL_TO
 58 |             left: "{{tasks.monitored_metric_violation_check.values.is_metric_violated}}"
 59 |             right: "true"
 60 | 
 61 |         - task_key: trigger_retraining
 62 |           depends_on:
 63 |             - task_key: is_metric_violated
 64 |               outcome: "true"
 65 |           run_job_task:
 66 |             job_id: ${var.model_training_job_id}
 67 | 
 68 |       schedule:
 69 |         quartz_cron_expression: "0 0 18 * * ?" # daily at 6pm
 70 |         timezone_id: UTC
 71 |       <<: *permissions
 72 | 
 73 | targets:
 74 |   dev-phase2:
 75 |     variables:
 76 |       model_training_job_id:
 77 |         lookup:
 78 |           job: "${var.current_target}-mlops-model-training-job"
 79 |     resources:
 80 |       quality_monitors:
 81 |         <<: *quality_monitor
 82 |       jobs:
 83 |         <<: *retraining_job
 84 | 
 85 |   test-phase2:
 86 |     variables:
 87 |       model_training_job_id:
 88 |         lookup:
 89 |           job: "${var.current_target}-mlops-model-training-job"
 90 |     resources:
 91 |       quality_monitors:
 92 |         <<: *quality_monitor
 93 |       jobs:
 94 |         <<: *retraining_job
 95 | 
 96 |   prod-phase2:
 97 |     variables:
 98 |       model_training_job_id:
 99 |         lookup:
100 |           job: "${var.current_target}-mlops-model-training-job"
101 |     resources:
102 |       quality_monitors:
103 |         <<: *quality_monitor
104 |       jobs:
105 |         <<: *retraining_job
106 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | share/python-wheels/
 24 | *.egg-info/
 25 | .installed.cfg
 26 | *.egg
 27 | MANIFEST
 28 | 
 29 | # PyInstaller
 30 | #  Usually these files are written by a python script from a template
 31 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 32 | *.manifest
 33 | *.spec
 34 | 
 35 | # Installer logs
 36 | pip-log.txt
 37 | pip-delete-this-directory.txt
 38 | 
 39 | # Unit test / coverage reports
 40 | htmlcov/
 41 | .tox/
 42 | .nox/
 43 | .coverage
 44 | .coverage.*
 45 | .cache
 46 | nosetests.xml
 47 | coverage.xml
 48 | *.cover
 49 | *.py,cover
 50 | .hypothesis/
 51 | .pytest_cache/
 52 | cover/
 53 | 
 54 | # Translations
 55 | *.mo
 56 | *.pot
 57 | 
 58 | # Django stuff:
 59 | *.log
 60 | local_settings.py
 61 | db.sqlite3
 62 | db.sqlite3-journal
 63 | 
 64 | # Flask stuff:
 65 | instance/
 66 | .webassets-cache
 67 | 
 68 | # Scrapy stuff:
 69 | .scrapy
 70 | 
 71 | # Sphinx documentation
 72 | docs/_build/
 73 | 
 74 | # PyBuilder
 75 | .pybuilder/
 76 | target/
 77 | 
 78 | # Jupyter Notebook
 79 | .ipynb_checkpoints
 80 | 
 81 | # IPython
 82 | profile_default/
 83 | ipython_config.py
 84 | 
 85 | # pyenv
 86 | #   For a library or package, you might want to ignore these files since the code is
 87 | #   intended to run in multiple environments; otherwise, check them in:
 88 | # .python-version
 89 | 
 90 | # pipenv
 91 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 92 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 93 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 94 | #   install all needed dependencies.
 95 | #Pipfile.lock
 96 | 
 97 | # poetry
 98 | #   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
 99 | #   This is especially recommended for binary packages to ensure reproducibility, and is more
100 | #   commonly ignored for libraries.
101 | #   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
102 | #poetry.lock
103 | 
104 | # pdm
105 | #   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
106 | #pdm.lock
107 | #   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
108 | #   in version control.
109 | #   https://pdm.fming.dev/latest/usage/project/#working-with-version-control
110 | .pdm.toml
111 | .pdm-python
112 | .pdm-build/
113 | 
114 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
115 | __pypackages__/
116 | 
117 | # Celery stuff
118 | celerybeat-schedule
119 | celerybeat.pid
120 | 
121 | # SageMath parsed files
122 | *.sage.py
123 | 
124 | # Environments
125 | .env
126 | .venv
127 | env/
128 | venv/
129 | ENV/
130 | env.bak/
131 | venv.bak/
132 | 
133 | # Spyder project settings
134 | .spyderproject
135 | .spyproject
136 | 
137 | # Rope project settings
138 | .ropeproject
139 | 
140 | # mkdocs documentation
141 | /site
142 | 
143 | # mypy
144 | .mypy_cache/
145 | .dmypy.json
146 | dmypy.json
147 | 
148 | # Pyre type checker
149 | .pyre/
150 | 
151 | # pytype static type analyzer
152 | .pytype/
153 | 
154 | # Cython debug symbols
155 | cython_debug/
156 | 
157 | # PyCharm
158 | #  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
159 | #  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
160 | #  and can be added to the global gitignore or merged into this file.  For a more nuclear
161 | #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
162 | #.idea/
163 | 


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/resources/monitoring-resource.yml:
--------------------------------------------------------------------------------
  1 | # Please complete all the TODOs in this file.
  2 | # The regression monitor defined here works OOB with this example regression notebook: https://learn.microsoft.com/azure/databricks/_extras/notebooks/source/monitoring/regression-monitor
  3 | # NOTE: Monitoring only works on Unity Catalog tables.
  4 | 
  5 | variables:
  6 |   model_training_job_id:
  7 |     description: "ID of model training job"
  8 | 
  9 | new_cluster: &new_cluster
 10 |   new_cluster:
 11 |     num_workers: 3
 12 |     spark_version: 15.3.x-cpu-ml-scala2.12
 13 |     node_type_id: Standard_D3_v2
 14 |     custom_tags:
 15 |       clusterSource: mlops-stacks_0.4
 16 | 
 17 | common_permissions: &permissions
 18 |   permissions:
 19 |     - level: CAN_VIEW
 20 |       group_name: users
 21 | 
 22 | quality_monitor: &quality_monitor
 23 |   mlops_quality_monitor:
 24 |     table_name: dev.mlops.predictions
 25 |     # TODO: Update the output schema name as per your requirements
 26 |     output_schema_name: ${var.current_target}.mlops
 27 |     # TODO: Update the below parameters as per your requirements
 28 |     assets_dir: /Users/${workspace.current_user.userName}/databricks_lakehouse_monitoring
 29 |     inference_log:
 30 |       granularities: [1 day]
 31 |       model_id_col: model_id
 32 |       prediction_col: prediction
 33 |       label_col: price
 34 |       problem_type: PROBLEM_TYPE_REGRESSION
 35 |       timestamp_col: timestamp
 36 |     schedule:
 37 |       quartz_cron_expression: 0 0 8 * * ? # Run Every day at 8am
 38 |       timezone_id: UTC
 39 | 
 40 | retraining_job: &retraining_job
 41 |     retraining_job:
 42 |       name: ${var.current_target}-mlops-monitoring-retraining-job
 43 |       tasks:
 44 |         - task_key: monitored_metric_violation_check
 45 |           <<: *new_cluster
 46 |           notebook_task:
 47 |             notebook_path: ../monitoring/notebooks/MonitoredMetricViolationCheck.py
 48 |             base_parameters:
 49 |               env: ${var.current_target}
 50 |               table_name_under_monitor: ${var.current_target}.mlops.predictions
 51 |               # TODO: Update the metric to be monitored and violation threshold
 52 |               metric_to_monitor: root_mean_squared_error
 53 |               metric_violation_threshold: 100
 54 |               num_evaluation_windows: 5
 55 |               num_violation_windows: 2
 56 | 
 57 |         - task_key: is_metric_violated
 58 |           depends_on:
 59 |             - task_key: monitored_metric_violation_check
 60 |           condition_task:
 61 |             op: EQUAL_TO
 62 |             left: "{{tasks.monitored_metric_violation_check.values.is_metric_violated}}"
 63 |             right: "true"
 64 | 
 65 |         - task_key: trigger_retraining
 66 |           depends_on:
 67 |             - task_key: is_metric_violated
 68 |               outcome: "true"
 69 |           run_job_task:
 70 |             job_id: ${var.model_training_job_id}
 71 | 
 72 |       schedule:
 73 |         quartz_cron_expression: "0 0 18 * * ?" # daily at 6pm
 74 |         timezone_id: UTC
 75 |       <<: *permissions
 76 | 
 77 | targets:
 78 |   dev-phase2:
 79 |     variables:
 80 |       model_training_job_id:
 81 |         lookup:
 82 |           job: "${var.current_target}-mlops-model-training-job"
 83 |     resources:
 84 |       quality_monitors:
 85 |         <<: *quality_monitor
 86 |       jobs:
 87 |         <<: *retraining_job
 88 | 
 89 |   test-phase2:
 90 |     variables:
 91 |       model_training_job_id:
 92 |         lookup:
 93 |           job: "${var.current_target}-mlops-model-training-job"
 94 |     resources:
 95 |       quality_monitors:
 96 |         <<: *quality_monitor
 97 |       jobs:
 98 |         <<: *retraining_job
 99 | 
100 |   prod-phase2:
101 |     variables:
102 |       model_training_job_id:
103 |         lookup:
104 |           job: "${var.current_target}-mlops-model-training-job"
105 |     resources:
106 |       quality_monitors:
107 |         <<: *quality_monitor
108 |       jobs:
109 |         <<: *retraining_job
110 | 


--------------------------------------------------------------------------------
/jdemo/azure-pipelines.yml:
--------------------------------------------------------------------------------
  1 | # Grab variables from the specific variable group and
  2 | # determine sourceBranchName (avoids SourchBranchName=merge for PR)
  3 | variables:
  4 |   - group: 'DABs Testing'
  5 |   - name: 'branchName'
  6 |     ${{ if startsWith(variables['Build.SourceBranch'], 'refs/heads/') }}:
  7 |       value: $[ replace(variables['Build.SourceBranch'], 'refs/heads/', '') ]
  8 |     ${{ if startsWith(variables['Build.SourceBranch'], 'refs/pull/') }}:
  9 |       value: $[ replace(variables['System.PullRequest.SourceBranch'], 'refs/heads/', '') ]
 10 | 
 11 | trigger:
 12 |   batch: true
 13 |   branches:
 14 |     include:
 15 |     - '*'
 16 |   paths:
 17 |     exclude:
 18 |       - README.md
 19 |       - LICENSE
 20 |       - images
 21 |       - terraform
 22 |       - .github
 23 |       - .vscode
 24 |       - TODOs.org
 25 | 
 26 | stages:
 27 | - stage: onPush
 28 |   condition: |
 29 |     and(
 30 |       ne(variables['Build.SourceBranch'], 'refs/heads/releases'),
 31 |       not(startsWith(variables['Build.SourceBranch'], 'refs/tags/v'))
 32 |     )
 33 |   jobs:
 34 |   - job: onPushJob
 35 |     pool:
 36 |       vmImage: 'ubuntu-latest'
 37 | 
 38 |     steps:
 39 |     - task: UsePythonVersion@0
 40 |       displayName: 'Use Python 3.11'
 41 |       inputs:
 42 |         versionSpec: 3.11
 43 | 
 44 |     - checkout: self
 45 |       displayName: 'Checkout & Build.Reason: $(Build.Reason) & Build.SourceBranchName: $(Build.SourceBranchName)'
 46 | 
 47 |     - script: |
 48 |         eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)"
 49 |         brew tap databricks/tap
 50 |         brew install databricks
 51 |         databricks -v
 52 |       displayName: 'Install Databricks CLI'
 53 |       env:
 54 |         HOMEBREW_NO_ENV_HINTS: 1
 55 |         HOMEBREW_NO_INSTALL_CLEANUP: 1
 56 | 
 57 |     - script: |
 58 |         pip install -U -r requirements-dev.txt
 59 |       displayName: 'Install dependencies'
 60 | 
 61 |     - script: |
 62 |         pytest tests --junit-xml=test-local.xml
 63 |       displayName: 'Execute local tests'
 64 |       
 65 |     - script: |
 66 |         eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)"
 67 |         databricks bundle deploy -t staging --var="uniq_id=$(branchName)"
 68 |       env:
 69 |         DATABRICKS_HOST: $(DATABRICKS_HOST)
 70 |         DATABRICKS_TOKEN: $(DATABRICKS_TOKEN)
 71 |       displayName: 'Deploy to staging'
 72 | 
 73 |     - script: |
 74 |         eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)"
 75 |         # We can pass parameters (--jar-params, --python-params, --notebook-params) here to point to another data location, etc.
 76 |         databricks bundle run jdemo_job -t staging --var="uniq_id=$(branchName)"
 77 |       env:
 78 |         DATABRICKS_HOST: $(DATABRICKS_HOST)
 79 |         DATABRICKS_TOKEN: $(DATABRICKS_TOKEN)
 80 |       displayName: 'Run in staging'
 81 | 
 82 |     - script: |
 83 |         eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)"
 84 |         echo "Optionally destroy the bundle"
 85 |         # databricks bundle destroy --auto-approve -t staging --var="uniq_id=$(branchName)"
 86 |       env:
 87 |         DATABRICKS_HOST: $(DATABRICKS_HOST)
 88 |         DATABRICKS_TOKEN: $(DATABRICKS_TOKEN)
 89 |       displayName: 'Destroy in staging on succcess'
 90 | 
 91 |     - task: PublishTestResults@2
 92 |       condition: succeededOrFailed()
 93 |       inputs:
 94 |         testResultsFormat: 'JUnit'
 95 |         testResultsFiles: '**/test-*.xml' 
 96 |         failTaskOnFailedTests: true
 97 | 
 98 | # Separate pipeline for releases branch
 99 | # Right now it's similar to the onPush stage, but runs only local tests and then deploy to the prod.
100 | - stage: onRelease
101 |   condition: |
102 |     eq(variables['Build.SourceBranch'], 'refs/heads/releases')
103 |   jobs:
104 |   - job: onReleaseJob
105 |     pool:
106 |       vmImage: 'ubuntu-latest'
107 | 
108 |     steps:
109 |     - task: UsePythonVersion@0
110 |       displayName: 'Use Python 3.11'
111 |       inputs:
112 |         versionSpec: 3.11
113 | 
114 |     - checkout: self
115 |       displayName: 'Checkout & Build.Reason: $(Build.Reason) & Build.SourceBranchName: $(Build.SourceBranchName)'
116 | 
117 |     - script: |
118 |         eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)"
119 |         brew tap databricks/tap
120 |         brew install databricks
121 |         databricks -v
122 |       displayName: 'Install Databricks CLI'
123 |       env:
124 |         HOMEBREW_NO_ENV_HINTS: 1
125 |         HOMEBREW_NO_INSTALL_CLEANUP: 1
126 | 
127 |     - script: |
128 |         pip install -U -r requirements-dev.txt
129 |       displayName: 'Install dependencies'
130 | 
131 |     - script: |
132 |         pytest tests --junit-xml=test-local.xml
133 |       displayName: 'Execute local tests'
134 |       
135 |     - script: |
136 |         eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)"
137 |         databricks bundle deploy -t prod
138 |       env:
139 |         DATABRICKS_HOST: $(DATABRICKS_HOST)
140 |         DATABRICKS_TOKEN: $(DATABRICKS_TOKEN)
141 |       displayName: 'Deploy to production'
142 |       
143 |     - task: PublishTestResults@2
144 |       condition: succeededOrFailed()
145 |       inputs:
146 |         testResultsFormat: 'JUnit'
147 |         testResultsFiles: '**/test-*.xml' 
148 |         failTaskOnFailedTests: true
149 | 


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/feature_engineering/notebooks/GenerateAndWriteFeatures.py:
--------------------------------------------------------------------------------
  1 | # Databricks notebook source
  2 | ##################################################################################
  3 | # Generate and Write Features Notebook
  4 | #
  5 | # This notebook can be used to generate and write features to a Databricks Feature Store table.
  6 | # It is configured and can be executed as the tasks in the write_feature_table_job workflow defined under
  7 | # ``mlops/resources/feature-engineering-workflow-resource.yml``
  8 | #
  9 | # Parameters:
 10 | #
 11 | # * input_table_path (required)   - Path to input data.
 12 | # * output_table_name (required)  - Fully qualified schema + Delta table name for the feature table where the features
 13 | # *                                 will be written to. Note that this will create the Feature table if it does not
 14 | # *                                 exist.
 15 | # * primary_keys (required)       - A comma separated string of primary key columns of the output feature table.
 16 | # *
 17 | # * timestamp_column (optional)   - Timestamp column of the input data. Used to limit processing based on
 18 | # *                                 date ranges. This column is used as the timestamp_key column in the feature table.
 19 | # * input_start_date (optional)   - Used to limit feature computations based on timestamp_column values.
 20 | # * input_end_date (optional)     - Used to limit feature computations based on timestamp_column values.
 21 | # *
 22 | # * features_transform_module (required) - Python module containing the feature transform logic.
 23 | ##################################################################################
 24 | 
 25 | 
 26 | # List of input args needed to run this notebook as a job.
 27 | # Provide them via DB widgets or notebook arguments.
 28 | #
 29 | # A Hive-registered Delta table containing the input data.
 30 | dbutils.widgets.text(
 31 |     "input_table_path",
 32 |     "/databricks-datasets/nyctaxi-with-zipcodes/subsampled",
 33 |     label="Input Table Name",
 34 | )
 35 | # Input start date.
 36 | dbutils.widgets.text("input_start_date", "", label="Input Start Date")
 37 | # Input end date.
 38 | dbutils.widgets.text("input_end_date", "", label="Input End Date")
 39 | # Timestamp column. Will be used to filter input start/end dates.
 40 | # This column is also used as a timestamp key of the feature table.
 41 | dbutils.widgets.text(
 42 |     "timestamp_column", "tpep_pickup_datetime", label="Timestamp column"
 43 | )
 44 | 
 45 | # Feature table to store the computed features.
 46 | dbutils.widgets.text(
 47 |     "output_table_name",
 48 |     "dev.mlops.trip_pickup_features",
 49 |     label="Output Feature Table Name",
 50 | )
 51 | 
 52 | # Feature transform module name.
 53 | dbutils.widgets.text(
 54 |     "features_transform_module", "pickup_features", label="Features transform file."
 55 | )
 56 | # Primary Keys columns for the feature table;
 57 | dbutils.widgets.text(
 58 |     "primary_keys",
 59 |     "zip",
 60 |     label="Primary keys columns for the feature table, comma separated.",
 61 | )
 62 | 
 63 | # COMMAND ----------
 64 | 
 65 | import os
 66 | notebook_path =  '/Workspace/' + os.path.dirname(dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get())
 67 | %cd $notebook_path
 68 | %cd ../features
 69 | 
 70 | # COMMAND ----------
 71 | 
 72 | # DBTITLE 1,Define input and output variables
 73 | input_table_path = dbutils.widgets.get("input_table_path")
 74 | output_table_name = dbutils.widgets.get("output_table_name")
 75 | input_start_date = dbutils.widgets.get("input_start_date")
 76 | input_end_date = dbutils.widgets.get("input_end_date")
 77 | ts_column = dbutils.widgets.get("timestamp_column")
 78 | features_module = dbutils.widgets.get("features_transform_module")
 79 | pk_columns = dbutils.widgets.get("primary_keys")
 80 | 
 81 | assert input_table_path != "", "input_table_path notebook parameter must be specified"
 82 | assert output_table_name != "", "output_table_name notebook parameter must be specified"
 83 | 
 84 | # Extract database name. Needs to be updated for Unity Catalog to the Schema name.
 85 | output_database = output_table_name.split(".")[1]
 86 | 
 87 | # COMMAND ----------
 88 | 
 89 | # DBTITLE 1,Create database.
 90 | spark.sql("CREATE DATABASE IF NOT EXISTS " + output_database)
 91 | 
 92 | # COMMAND ----------
 93 | 
 94 | # DBTITLE 1, Read input data.
 95 | raw_data = spark.read.format("delta").load(input_table_path)
 96 | 
 97 | # COMMAND ----------
 98 | 
 99 | # DBTITLE 1,Compute features.
100 | # Compute the features. This is done by dynamically loading the features module.
101 | from importlib import import_module
102 | 
103 | mod = import_module(features_module)
104 | compute_features_fn = getattr(mod, "compute_features_fn")
105 | 
106 | features_df = compute_features_fn(
107 |     input_df=raw_data,
108 |     timestamp_column=ts_column,
109 |     start_date=input_start_date,
110 |     end_date=input_end_date,
111 | )
112 | 
113 | # COMMAND ----------
114 | 
115 | # DBTITLE 1, Write computed features.
116 | from databricks.feature_engineering import FeatureEngineeringClient
117 | 
118 | fe = FeatureEngineeringClient()
119 | 
120 | # Create the feature table if it does not exist first.
121 | # Note that this is a no-op if a table with the same name and schema already exists.
122 | fe.create_table(
123 |     name=output_table_name,    
124 |     primary_keys=[x.strip() for x in pk_columns.split(",")] + [ts_column],  # Include timeseries column in primary_keys
125 |     timestamp_keys=[ts_column],
126 |     df=features_df,
127 | )
128 | 
129 | # Write the computed features dataframe.
130 | fe.write_table(
131 |     name=output_table_name,
132 |     df=features_df,
133 |     mode="merge",
134 | )
135 | 
136 | # COMMAND ----------
137 | 
138 | dbutils.notebook.exit(0)


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/resources/model-workflow-resource.yml:
--------------------------------------------------------------------------------
  1 | new_cluster: &new_cluster
  2 |   new_cluster:
  3 |     num_workers: 3
  4 |     spark_version: 15.3.x-cpu-ml-scala2.12
  5 |     node_type_id: Standard_D3_v2
  6 |     custom_tags:
  7 |       clusterSource: mlops-stacks_0.4
  8 | 
  9 | common_permissions: &permissions
 10 |   permissions:
 11 |     - level: CAN_VIEW
 12 |       group_name: users
 13 | 
 14 | model_training_job: &model_training_job
 15 |     model_training_job:
 16 |       name: ${var.current_target}-mlops-model-training-job
 17 |       job_clusters:
 18 |         - job_cluster_key: model_training_job_cluster
 19 |           <<: *new_cluster
 20 |       tasks:
 21 |         - task_key: Train
 22 |           job_cluster_key: model_training_job_cluster
 23 |           notebook_task:
 24 |             notebook_path: ../training/notebooks/TrainWithFeatureStore.py
 25 |             base_parameters:
 26 |               env: ${bundle.target}
 27 |               # TODO: Update training_data_path
 28 |               training_data_path: /databricks-datasets/nyctaxi-with-zipcodes/subsampled
 29 |               experiment_name: ${var.experiment_name}
 30 |               model_name: ${bundle.target}.mlops.${var.model_name}
 31 |               pickup_features_table: ${bundle.target}.mlops.trip_pickup_features
 32 |               dropoff_features_table: ${bundle.target}.mlops.trip_dropoff_features
 33 |               # git source information of current ML resource deployment. It will be persisted as part of the workflow run
 34 |               git_source_info: url:${bundle.git.origin_url}; branch:${bundle.git.branch}; commit:${bundle.git.commit}
 35 |         - task_key: ModelValidation
 36 |           job_cluster_key: model_training_job_cluster
 37 |           depends_on:
 38 |             - task_key: Train
 39 |           notebook_task:
 40 |             notebook_path: ../validation/notebooks/ModelValidation.py
 41 |             base_parameters:
 42 |               experiment_name: ${var.experiment_name}
 43 |               # The `run_mode` defines whether model validation is enabled or not.
 44 |               # It can be one of the three values:
 45 |               # `disabled` : Do not run the model validation notebook.
 46 |               # `dry_run`  : Run the model validation notebook. Ignore failed model validation rules and proceed to move
 47 |               #               model to Production stage.
 48 |               # `enabled`  : Run the model validation notebook. Move model to Production stage only if all model validation
 49 |               #               rules are passing.
 50 |               # TODO: update run_mode
 51 |               run_mode: dry_run
 52 |               # Whether to load the current registered "Production" stage model as baseline.
 53 |               # Baseline model is a requirement for relative change and absolute change validation thresholds.
 54 |               # TODO: update enable_baseline_comparison
 55 |               enable_baseline_comparison: "false"
 56 |               # Please refer to data parameter in mlflow.evaluate documentation https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.evaluate
 57 |               # TODO: update validation_input
 58 |               validation_input: SELECT * FROM delta.`dbfs:/databricks-datasets/nyctaxi-with-zipcodes/subsampled`
 59 |               # A string describing the model type. The model type can be either "regressor" and "classifier".
 60 |               # Please refer to model_type parameter in mlflow.evaluate documentation https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.evaluate
 61 |               # TODO: update model_type
 62 |               model_type: regressor
 63 |               # The string name of a column from data that contains evaluation labels.
 64 |               # Please refer to targets parameter in mlflow.evaluate documentation https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.evaluate
 65 |               # TODO: targets
 66 |               targets: fare_amount
 67 |               # Specifies the name of the function in mlops/training_validation_deployment/validation/validation.py that returns custom metrics.
 68 |               # TODO(optional): custom_metrics_loader_function
 69 |               custom_metrics_loader_function: custom_metrics
 70 |               # Specifies the name of the function in mlops/training_validation_deployment/validation/validation.py that returns model validation thresholds.
 71 |               # TODO(optional): validation_thresholds_loader_function
 72 |               validation_thresholds_loader_function: validation_thresholds
 73 |               # Specifies the name of the function in mlops/training_validation_deployment/validation/validation.py that returns evaluator_config.
 74 |               # TODO(optional): evaluator_config_loader_function
 75 |               evaluator_config_loader_function: evaluator_config
 76 |               # git source information of current ML resource deployment. It will be persisted as part of the workflow run
 77 |               git_source_info: url:${bundle.git.origin_url}; branch:${bundle.git.branch}; commit:${bundle.git.commit}
 78 |         - task_key: ModelDeployment
 79 |           job_cluster_key: model_training_job_cluster
 80 |           depends_on:
 81 |             - task_key: ModelValidation
 82 |           notebook_task:
 83 |             notebook_path: ../deployment/model_deployment/notebooks/ModelDeployment.py
 84 |             base_parameters:
 85 |               env: ${bundle.target}
 86 |               # git source information of current ML resource deployment. It will be persisted as part of the workflow run
 87 |               git_source_info: url:${bundle.git.origin_url}; branch:${bundle.git.branch}; commit:${bundle.git.commit}
 88 |       schedule:
 89 |         quartz_cron_expression: "0 0 9 * * ?" # daily at 9am
 90 |         timezone_id: UTC
 91 |       <<: *permissions
 92 |       # If you want to turn on notifications for this job, please uncomment the below code,
 93 |       # and provide a list of emails to the on_failure argument.
 94 |       #
 95 |       #  email_notifications:
 96 |       #    on_failure:
 97 |       #      - first@company.com
 98 |       #      - second@company.com
 99 | 
100 | targets:
101 |   dev-phase1:
102 |     resources:
103 |       jobs:
104 |         <<: *model_training_job
105 | 
106 |   test-phase1:
107 |     resources:
108 |       jobs:
109 |         <<: *model_training_job
110 | 
111 |   prod-phase1:
112 |     resources:
113 |       jobs:
114 |         <<: *model_training_job
115 | 


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/training/notebooks/TrainWithFeatureStore.py:
--------------------------------------------------------------------------------
  1 | # Databricks notebook source
  2 | ##################################################################################
  3 | # Model Training Notebook using Databricks Feature Store
  4 | #
  5 | # This notebook shows an example of a Model Training pipeline using Databricks Feature Store tables.
  6 | # It is configured and can be executed as the "Train" task in the model_training_job workflow defined under
  7 | # ``mlops/resources/model-workflow-resource.yml``
  8 | #
  9 | # Parameters:
 10 | # * env (required):                 - Environment the notebook is run in (staging, or prod). Defaults to "staging".
 11 | # * training_data_path (required)   - Path to the training data.
 12 | # * experiment_name (required)      - MLflow experiment name for the training runs. Will be created if it doesn't exist.
 13 | # * model_name (required)           - Three-level name (<catalog>.<schema>.<model_name>) to register the trained model in Unity Catalog. 
 14 | #  
 15 | ##################################################################################
 16 | 
 17 | # COMMAND ----------
 18 | 
 19 | # MAGIC %load_ext autoreload
 20 | # MAGIC %autoreload 2
 21 | 
 22 | # COMMAND ----------
 23 | 
 24 | import os
 25 | notebook_path =  '/Workspace/' + os.path.dirname(dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get())
 26 | %cd $notebook_path
 27 | 
 28 | # COMMAND ----------
 29 | 
 30 | # MAGIC %pip install -r ../../requirements.txt
 31 | 
 32 | # COMMAND ----------
 33 | 
 34 | dbutils.library.restartPython()
 35 | 
 36 | # COMMAND ----------
 37 | 
 38 | # DBTITLE 1, Notebook arguments
 39 | # List of input args needed to run this notebook as a job.
 40 | # Provide them via DB widgets or notebook arguments.
 41 | 
 42 | # Notebook Environment
 43 | dbutils.widgets.dropdown("env", "staging", ["staging", "prod"], "Environment Name")
 44 | env = dbutils.widgets.get("env")
 45 | 
 46 | # Path to the Hive-registered Delta table containing the training data.
 47 | dbutils.widgets.text(
 48 |     "training_data_path",
 49 |     "/databricks-datasets/nyctaxi-with-zipcodes/subsampled",
 50 |     label="Path to the training data",
 51 | )
 52 | 
 53 | # MLflow experiment name.
 54 | dbutils.widgets.text(
 55 |     "experiment_name",
 56 |     f"/dev-mlops-experiment",
 57 |     label="MLflow experiment name",
 58 | )
 59 | 
 60 | 
 61 | # Unity Catalog registered model name to use for the trained mode.
 62 | dbutils.widgets.text(
 63 |     "model_name", "dev.mlops.mlops-model", label="Full (Three-Level) Model Name"
 64 | )
 65 | 
 66 | # Pickup features table name
 67 | dbutils.widgets.text(
 68 |     "pickup_features_table",
 69 |     "dev.mlops.trip_pickup_features",
 70 |     label="Pickup Features Table",
 71 | )
 72 | 
 73 | # Dropoff features table name
 74 | dbutils.widgets.text(
 75 |     "dropoff_features_table",
 76 |     "dev.mlops.trip_dropoff_features",
 77 |     label="Dropoff Features Table",
 78 | )
 79 | 
 80 | # COMMAND ----------
 81 | 
 82 | # DBTITLE 1,Define input and output variables
 83 | input_table_path = dbutils.widgets.get("training_data_path")
 84 | experiment_name = dbutils.widgets.get("experiment_name")
 85 | model_name = dbutils.widgets.get("model_name")
 86 | 
 87 | # COMMAND ----------
 88 | 
 89 | # DBTITLE 1, Set experiment
 90 | import mlflow
 91 | 
 92 | mlflow.set_experiment(experiment_name)
 93 | mlflow.set_registry_uri('databricks-uc')
 94 | 
 95 | # COMMAND ----------
 96 | 
 97 | # DBTITLE 1, Load raw data
 98 | raw_data = spark.read.format("delta").load(input_table_path)
 99 | raw_data.display()
100 | 
101 | # COMMAND ----------
102 | 
103 | # DBTITLE 1, Helper functions
104 | from datetime import timedelta, timezone
105 | import math
106 | import mlflow.pyfunc
107 | import pyspark.sql.functions as F
108 | from pyspark.sql.types import IntegerType
109 | 
110 | 
111 | def rounded_unix_timestamp(dt, num_minutes=15):
112 |     """
113 |     Ceilings datetime dt to interval num_minutes, then returns the unix timestamp.
114 |     """
115 |     nsecs = dt.minute * 60 + dt.second + dt.microsecond * 1e-6
116 |     delta = math.ceil(nsecs / (60 * num_minutes)) * (60 * num_minutes) - nsecs
117 |     return int((dt + timedelta(seconds=delta)).replace(tzinfo=timezone.utc).timestamp())
118 | 
119 | 
120 | rounded_unix_timestamp_udf = F.udf(rounded_unix_timestamp, IntegerType())
121 | 
122 | 
123 | def rounded_taxi_data(taxi_data_df):
124 |     # Round the taxi data timestamp to 15 and 30 minute intervals so we can join with the pickup and dropoff features
125 |     # respectively.
126 |     taxi_data_df = (
127 |         taxi_data_df.withColumn(
128 |             "rounded_pickup_datetime",
129 |             F.to_timestamp(
130 |                 rounded_unix_timestamp_udf(
131 |                     taxi_data_df["tpep_pickup_datetime"], F.lit(15)
132 |                 )
133 |             ),
134 |         )
135 |         .withColumn(
136 |             "rounded_dropoff_datetime",
137 |             F.to_timestamp(
138 |                 rounded_unix_timestamp_udf(
139 |                     taxi_data_df["tpep_dropoff_datetime"], F.lit(30)
140 |                 )
141 |             ),
142 |         )
143 |         .drop("tpep_pickup_datetime")
144 |         .drop("tpep_dropoff_datetime")
145 |     )
146 |     taxi_data_df.createOrReplaceTempView("taxi_data")
147 |     return taxi_data_df
148 | 
149 | 
150 | def get_latest_model_version(model_name):
151 |     latest_version = 1
152 |     mlflow_client = MlflowClient()
153 |     for mv in mlflow_client.search_model_versions(f"name='{model_name}'"):
154 |         version_int = int(mv.version)
155 |         if version_int > latest_version:
156 |             latest_version = version_int
157 |     return latest_version
158 | 
159 | 
160 | # COMMAND ----------
161 | 
162 | # DBTITLE 1, Read taxi data for training
163 | taxi_data = rounded_taxi_data(raw_data)
164 | taxi_data.display()
165 | 
166 | # COMMAND ----------
167 | 
168 | # DBTITLE 1, Create FeatureLookups
169 | from databricks.feature_engineering import FeatureLookup
170 | import mlflow
171 | 
172 | pickup_features_table = dbutils.widgets.get("pickup_features_table")
173 | dropoff_features_table = dbutils.widgets.get("dropoff_features_table")
174 | 
175 | pickup_feature_lookups = [
176 |     FeatureLookup(
177 |         table_name=pickup_features_table,
178 |         feature_names=[
179 |             "mean_fare_window_1h_pickup_zip",
180 |             "count_trips_window_1h_pickup_zip",
181 |         ],
182 |         lookup_key=["pickup_zip"],
183 |         timestamp_lookup_key=["rounded_pickup_datetime"],
184 |     ),
185 | ]
186 | 
187 | dropoff_feature_lookups = [
188 |     FeatureLookup(
189 |         table_name=dropoff_features_table,
190 |         feature_names=["count_trips_window_30m_dropoff_zip", "dropoff_is_weekend"],
191 |         lookup_key=["dropoff_zip"],
192 |         timestamp_lookup_key=["rounded_dropoff_datetime"],
193 |     ),
194 | ]
195 | 
196 | # COMMAND ----------
197 | 
198 | # DBTITLE 1, Create Training Dataset
199 | 
200 | from databricks.feature_engineering import FeatureEngineeringClient
201 | 
202 | # End any existing runs (in the case this notebook is being run for a second time)
203 | mlflow.end_run()
204 | 
205 | # Start an mlflow run, which is needed for the feature store to log the model
206 | mlflow.start_run()
207 | 
208 | # Since the rounded timestamp columns would likely cause the model to overfit the data
209 | # unless additional feature engineering was performed, exclude them to avoid training on them.
210 | exclude_columns = ["rounded_pickup_datetime", "rounded_dropoff_datetime"]
211 | 
212 | fe = FeatureEngineeringClient()
213 | 
214 | # Create the training set that includes the raw input data merged with corresponding features from both feature tables
215 | training_set = fe.create_training_set(
216 |     df=taxi_data, # specify the df 
217 |     feature_lookups=pickup_feature_lookups + dropoff_feature_lookups, 
218 |     # both features need to be available; defined in GenerateAndWriteFeatures &/or feature-engineering-workflow-resource.yml
219 |     label="fare_amount",
220 |     exclude_columns=exclude_columns,
221 | )
222 | 
223 | 
224 | # Load the TrainingSet into a dataframe which can be passed into sklearn for training a model
225 | training_df = training_set.load_df()
226 | 
227 | # COMMAND ----------
228 | 
229 | # Display the training dataframe, and note that it contains both the raw input data and the features from the Feature Store, like `dropoff_is_weekend`
230 | training_df.display()
231 | 
232 | # COMMAND ----------
233 | 
234 | # MAGIC %md
235 | # MAGIC Train a LightGBM model on the data returned by `TrainingSet.to_df`, then log the model with `FeatureStoreClient.log_model`. The model will be packaged with feature metadata.
236 | 
237 | # COMMAND ----------
238 | 
239 | # DBTITLE 1, Train model
240 | import lightgbm as lgb
241 | from sklearn.model_selection import train_test_split
242 | import mlflow.lightgbm
243 | from mlflow.tracking import MlflowClient
244 | 
245 | 
246 | features_and_label = training_df.columns
247 | 
248 | # Collect data into a Pandas array for training
249 | data = training_df.toPandas()[features_and_label]
250 | 
251 | train, test = train_test_split(data, random_state=123)
252 | X_train = train.drop(["fare_amount"], axis=1)
253 | X_test = test.drop(["fare_amount"], axis=1)
254 | y_train = train.fare_amount
255 | y_test = test.fare_amount
256 | 
257 | mlflow.lightgbm.autolog()
258 | train_lgb_dataset = lgb.Dataset(X_train, label=y_train.values)
259 | test_lgb_dataset = lgb.Dataset(X_test, label=y_test.values)
260 | 
261 | param = {"num_leaves": 32, "objective": "regression", "metric": "rmse"}
262 | num_rounds = 100
263 | 
264 | # Train a lightGBM model
265 | model = lgb.train(param, train_lgb_dataset, num_rounds)
266 | 
267 | # COMMAND ----------
268 | 
269 | # DBTITLE 1, Log model and return output.
270 | # Log the trained model with MLflow and package it with feature lookup information.
271 | fe.log_model(
272 |     model=model, #specify model
273 |     artifact_path="model_packaged",
274 |     flavor=mlflow.lightgbm,
275 |     training_set=training_set,
276 |     registered_model_name=model_name,
277 | )
278 | 
279 | 
280 | # The returned model URI is needed by the model deployment notebook.
281 | model_version = get_latest_model_version(model_name)
282 | model_uri = f"models:/{model_name}/{model_version}"
283 | dbutils.jobs.taskValues.set("model_uri", model_uri)
284 | dbutils.jobs.taskValues.set("model_name", model_name)
285 | dbutils.jobs.taskValues.set("model_version", model_version)
286 | dbutils.notebook.exit(model_uri)
287 | 


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/tmp/phase1.yml:
--------------------------------------------------------------------------------
  1 | new_cluster: &new_cluster
  2 |   new_cluster:
  3 |     num_workers: 3
  4 |     spark_version: 15.3.x-cpu-ml-scala2.12
  5 |     node_type_id: Standard_D3_v2
  6 |     custom_tags:
  7 |       clusterSource: mlops-stacks_0.4
  8 | 
  9 | jobs_permissions: &jobs_permissions
 10 |   permissions:
 11 |     - level: CAN_VIEW
 12 |       group_name: users
 13 | 
 14 | jobs: &jobs
 15 |     batch_inference_job:
 16 |       name: ${var.current_target}-mlops-batch-inference-job
 17 |       tasks:
 18 |         - task_key: batch_inference_job
 19 |           <<: *new_cluster
 20 |           notebook_task:
 21 |             notebook_path: ../deployment/batch_inference/notebooks/BatchInference.py
 22 |             base_parameters:
 23 |               env: ${var.current_target}
 24 |               input_table_name: ${var.current_target}.mlops.feature_store_inference_input  # TODO: create input table for inference
 25 |               output_table_name: ${var.current_target}.mlops.predictions
 26 |               model_name: ${var.current_target}.mlops.${var.model_name}
 27 |               # git source information of current ML resource deployment. It will be persisted as part of the workflow run
 28 |               git_source_info: url:${bundle.git.origin_url}; branch:${bundle.git.branch}; commit:${bundle.git.commit}
 29 |       schedule:
 30 |         quartz_cron_expression: "0 0 11 * * ?" # daily at 11am
 31 |         timezone_id: UTC
 32 |       <<: *jobs_permissions
 33 |       # If you want to turn on notifications for this job, please uncomment the below code,
 34 |       # and provide a list of emails to the on_failure argument.
 35 |       #
 36 |       #  email_notifications:
 37 |       #    on_failure:
 38 |       #      - first@company.com
 39 |       #      - second@company.com
 40 | 
 41 |     write_feature_table_job:
 42 |       name: ${var.current_target}-mlops-write-feature-table-job
 43 |       job_clusters:
 44 |         - job_cluster_key: write_feature_table_job_cluster
 45 |           <<: *new_cluster
 46 |       tasks:
 47 |         - task_key: PickupFeatures
 48 |           job_cluster_key: write_feature_table_job_cluster
 49 |           notebook_task:
 50 |             notebook_path: ../feature_engineering/notebooks/GenerateAndWriteFeatures.py
 51 |             base_parameters:
 52 |               # TODO modify these arguments to reflect your setup.
 53 |               input_table_path: /databricks-datasets/nyctaxi-with-zipcodes/subsampled
 54 |               # TODO: Empty start/end dates will process the whole range. Update this as needed to process recent data.
 55 |               input_start_date: ""
 56 |               input_end_date: ""
 57 |               timestamp_column: tpep_pickup_datetime
 58 |               output_table_name: ${var.current_target}.mlops.trip_pickup_features
 59 |               features_transform_module: pickup_features
 60 |               primary_keys: zip
 61 |               # git source information of current ML resource deployment. It will be persisted as part of the workflow run
 62 |               git_source_info: url:${bundle.git.origin_url}; branch:${bundle.git.branch}; commit:${bundle.git.commit}
 63 |         - task_key: DropoffFeatures
 64 |           job_cluster_key: write_feature_table_job_cluster
 65 |           notebook_task:
 66 |             notebook_path: ../feature_engineering/notebooks/GenerateAndWriteFeatures.py
 67 |             base_parameters:
 68 |               # TODO: modify these arguments to reflect your setup.
 69 |               input_table_path: /databricks-datasets/nyctaxi-with-zipcodes/subsampled
 70 |               # TODO: Empty start/end dates will process the whole range. Update this as needed to process recent data.
 71 |               input_start_date: ""
 72 |               input_end_date: ""
 73 |               timestamp_column: tpep_dropoff_datetime
 74 |               output_table_name: ${var.current_target}.mlops.trip_dropoff_features
 75 |               features_transform_module: dropoff_features
 76 |               primary_keys: zip
 77 |               # git source information of current ML resource deployment. It will be persisted as part of the workflow run
 78 |               git_source_info: url:${bundle.git.origin_url}; branch:${bundle.git.branch}; commit:${bundle.git.commit}
 79 |       schedule:
 80 |         quartz_cron_expression: "0 0 7 * * ?" # daily at 7am
 81 |         timezone_id: UTC
 82 |       <<: *jobs_permissions
 83 |       # If you want to turn on notifications for this job, please uncomment the below code,
 84 |       # and provide a list of emails to the on_failure argument.
 85 |       #
 86 |       #  email_notifications:
 87 |       #    on_failure:
 88 |       #      - first@company.com
 89 |       #      - second@company.com
 90 | 
 91 |     model_training_job:
 92 |       name: ${var.current_target}-mlops-model-training-job
 93 |       job_clusters:
 94 |         - job_cluster_key: model_training_job_cluster
 95 |           <<: *new_cluster
 96 |       tasks:
 97 |         - task_key: Train
 98 |           job_cluster_key: model_training_job_cluster
 99 |           notebook_task:
100 |             notebook_path: ../training/notebooks/TrainWithFeatureStore.py
101 |             base_parameters:
102 |               env: ${var.current_target}
103 |               # TODO: Update training_data_path
104 |               training_data_path: /databricks-datasets/nyctaxi-with-zipcodes/subsampled
105 |               experiment_name: ${var.experiment_name}
106 |               model_name: ${var.current_target}.mlops.${var.model_name}
107 |               pickup_features_table: ${var.current_target}.mlops.trip_pickup_features
108 |               dropoff_features_table: ${var.current_target}.mlops.trip_dropoff_features
109 |               # git source information of current ML resource deployment. It will be persisted as part of the workflow run
110 |               git_source_info: url:${bundle.git.origin_url}; branch:${bundle.git.branch}; commit:${bundle.git.commit}
111 |         - task_key: ModelValidation
112 |           job_cluster_key: model_training_job_cluster
113 |           depends_on:
114 |             - task_key: Train
115 |           notebook_task:
116 |             notebook_path: ../validation/notebooks/ModelValidation.py
117 |             base_parameters:
118 |               experiment_name: ${var.experiment_name}
119 |               # The `run_mode` defines whether model validation is enabled or not.
120 |               # It can be one of the three values:
121 |               # `disabled` : Do not run the model validation notebook.
122 |               # `dry_run`  : Run the model validation notebook. Ignore failed model validation rules and proceed to move
123 |               #               model to Production stage.
124 |               # `enabled`  : Run the model validation notebook. Move model to Production stage only if all model validation
125 |               #               rules are passing.
126 |               # TODO: update run_mode
127 |               run_mode: dry_run
128 |               # Whether to load the current registered "Production" stage model as baseline.
129 |               # Baseline model is a requirement for relative change and absolute change validation thresholds.
130 |               # TODO: update enable_baseline_comparison
131 |               enable_baseline_comparison: "false"
132 |               # Please refer to data parameter in mlflow.evaluate documentation https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.evaluate
133 |               # TODO: update validation_input
134 |               validation_input: SELECT * FROM delta.`dbfs:/databricks-datasets/nyctaxi-with-zipcodes/subsampled`
135 |               # A string describing the model type. The model type can be either "regressor" and "classifier".
136 |               # Please refer to model_type parameter in mlflow.evaluate documentation https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.evaluate
137 |               # TODO: update model_type
138 |               model_type: regressor
139 |               # The string name of a column from data that contains evaluation labels.
140 |               # Please refer to targets parameter in mlflow.evaluate documentation https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.evaluate
141 |               # TODO: targets
142 |               targets: fare_amount
143 |               # Specifies the name of the function in mlops/training_validation_deployment/validation/validation.py that returns custom metrics.
144 |               # TODO(optional): custom_metrics_loader_function
145 |               custom_metrics_loader_function: custom_metrics
146 |               # Specifies the name of the function in mlops/training_validation_deployment/validation/validation.py that returns model validation thresholds.
147 |               # TODO(optional): validation_thresholds_loader_function
148 |               validation_thresholds_loader_function: validation_thresholds
149 |               # Specifies the name of the function in mlops/training_validation_deployment/validation/validation.py that returns evaluator_config.
150 |               # TODO(optional): evaluator_config_loader_function
151 |               evaluator_config_loader_function: evaluator_config
152 |               # git source information of current ML resource deployment. It will be persisted as part of the workflow run
153 |               git_source_info: url:${bundle.git.origin_url}; branch:${bundle.git.branch}; commit:${bundle.git.commit}
154 |         - task_key: ModelDeployment
155 |           job_cluster_key: model_training_job_cluster
156 |           depends_on:
157 |             - task_key: ModelValidation
158 |           notebook_task:
159 |             notebook_path: ../deployment/model_deployment/notebooks/ModelDeployment.py
160 |             base_parameters:
161 |               env: ${var.current_target}
162 |               # git source information of current ML resource deployment. It will be persisted as part of the workflow run
163 |               git_source_info: url:${bundle.git.origin_url}; branch:${bundle.git.branch}; commit:${bundle.git.commit}
164 |       schedule:
165 |         quartz_cron_expression: "0 0 9 * * ?" # daily at 9am
166 |         timezone_id: UTC
167 |       <<: *jobs_permissions
168 |       # If you want to turn on notifications for this job, please uncomment the below code,
169 |       # and provide a list of emails to the on_failure argument.
170 |       #
171 |       #  email_notifications:
172 |       #    on_failure:
173 |       #      - first@company.com
174 |       #      - second@company.com
175 | 
176 | 
177 | experiment_permissions: &experiment_permissions
178 |   permissions:
179 |     - level: CAN_READ
180 |       group_name: users
181 | 
182 | # Allow users to execute models in Unity Catalog
183 | model_grants: &model_grants
184 |   grants:
185 |     - privileges:
186 |         - EXECUTE
187 |       principal: account users
188 | 
189 | # Defines model and experiments
190 | model: &model
191 |       model:
192 |         name: ${var.model_name}
193 |         catalog_name: ${var.current_target}
194 |         schema_name: mlops
195 |         comment: Registered model in Unity Catalog for the "mlops" ML Project for ${var.current_target} deployment target.
196 |         <<: *model_grants
197 | 
198 | experiment: &experiment
199 |     experiment:
200 |       name: ${var.experiment_name}
201 |       <<: *experiment_permissions
202 |       description: MLflow Experiment used to track runs for mlops project.
203 | 
204 | 
205 | targets:
206 |   dev-phase1:
207 |     resources:
208 |       jobs:
209 |         <<: *jobs
210 |       registered_models:
211 |         <<: *model
212 |       experiments:
213 |         <<: *experiment
214 | 
215 |   test-phase1:
216 |     resources:
217 |       jobs:
218 |         <<: *jobs
219 |       registered_models:
220 |         <<: *model
221 |       experiments:
222 |         <<: *experiment
223 | 
224 |   prod-phase1:
225 |     resources:
226 |       jobs:
227 |         <<: *jobs
228 |       registered_models:
229 |         <<: *model
230 |       experiments:
231 |         <<: *experiment
232 | 


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/README.md:
--------------------------------------------------------------------------------
  1 | # mlops
  2 | 
  3 | This DAB is based on `mlops-stacks` template, but with customization to deploy quality
  4 | monitoring in a separate "stage" (or "phase") to avoid manual uncommenting of quality
  5 | monitoring and retrain jobs definition.  This is done by splitting model training and
  6 | quality monitor/retraining job into different stages that could be deployed independently.
  7 | To implement this I've used YAML variables to define actual resources, and then declared
  8 | resources in each stage independently instead of specifying them as top-level objects.
  9 | 
 10 | Use `databricks bundle deploy -t <stage>-phaseN` to deploy into a specific "phase".  I.e.,
 11 | use `databricks bundle deploy -t dev-phase1` to deploy model training code, registered
 12 | model, etc., and then use `databricks bundle deploy -t dev-phase2` to deploy quality
 13 | monitor and retraining job.
 14 | 
 15 | The rest of the README is standard text from `mlops-stacks`...
 16 | 
 17 | This project comes with example ML code to train, validate and deploy a regression model
 18 | to predict NYC taxi fares.  If you're a data scientist just getting started with this repo
 19 | for a brand new ML project, we recommend adapting the provided example code to your ML
 20 | problem. Then making and testing ML code changes on Databricks or your local machine.
 21 | 
 22 | The "Getting Started" docs can be found at https://learn.microsoft.com/azure/databricks/dev-tools/bundles/mlops-stacks.
 23 | 
 24 | ## Table of contents
 25 | 
 26 | * [Code structure](#code-structure): structure of this project.
 27 | 
 28 | * [Configure your ML pipeline](#configure-your-ml-pipeline): making and testing ML code changes on Databricks or your local machine.
 29 | 
 30 | * [Iterating on ML code](#iterating-on-ml-code): making and testing ML code changes on Databricks or your local machine.
 31 | * [Next steps](#next-steps)
 32 | 
 33 | This directory contains an ML project based on the default
 34 | [Databricks MLOps Stacks](https://github.com/databricks/mlops-stacks),
 35 | defining a production-grade ML pipeline for automated retraining and batch inference of an ML model on tabular data.
 36 | 
 37 | ## Code structure
 38 | This project contains the following components:
 39 | 
 40 | | Component                  | Description                                                                                                                                                                                                                                                                                                                                             |
 41 | |----------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 42 | | ML Code                    | Example ML project code, with unit tested Python modules and notebooks                                                                                                                                                                                                                                                                                  |
 43 | | ML Resources as Code | ML pipeline resources (training and batch inference jobs with schedules, etc) configured and deployed through [databricks CLI bundles](https://learn.microsoft.com/azure/databricks/dev-tools/cli/bundle-cli)                                                                                              |
 44 | 
 45 | contained in the following files:
 46 | 
 47 | ```
 48 | mlops        <- Root directory. Both monorepo and polyrepo are supported.
 49 | │
 50 | ├── mlops       <- Contains python code, notebooks and ML resources related to one ML project. 
 51 | │   │
 52 | │   ├── requirements.txt        <- Specifies Python dependencies for ML code (for example: model training, batch inference).
 53 | │   │
 54 | │   ├── databricks.yml          <- databricks.yml is the root bundle file for the ML project that can be loaded by databricks CLI bundles. It defines the bundle name, workspace URL and resource config component to be included.
 55 | │   │
 56 | │   ├── training                <- Training folder contains Notebook that trains and registers the model with feature store support.
 57 | │   │
 58 | │   ├── feature_engineering     <- Feature computation code (Python modules) that implements the feature transforms.
 59 | │   │                              The output of these transforms get persisted as Feature Store tables. Most development
 60 | │   │                              work happens here.
 61 | │   │
 62 | │   ├── validation              <- Optional model validation step before deploying a model.
 63 | │   │
 64 | │   ├── monitoring              <- Model monitoring, feature monitoring, etc.
 65 | │   │
 66 | │   ├── deployment              <- Deployment and Batch inference workflows
 67 | │   │   │
 68 | │   │   ├── batch_inference     <- Batch inference code that will run as part of scheduled workflow.
 69 | │   │   │
 70 | │   │   ├── model_deployment    <- As part of CD workflow, deploy the registered model by assigning it the appropriate alias.
 71 | │   │
 72 | │   │
 73 | │   ├── tests                   <- Unit tests for the ML project, including the modules under `features`.
 74 | │   │
 75 | │   ├── resources               <- ML resource (ML jobs, MLflow models) config definitions expressed as code, across dev/staging/prod/test.
 76 | │       │
 77 | │       ├── model-workflow-resource.yml                <- ML resource config definition for model training, validation, deployment workflow
 78 | │       │
 79 | │       ├── batch-inference-workflow-resource.yml      <- ML resource config definition for batch inference workflow
 80 | │       │
 81 | │       ├── feature-engineering-workflow-resource.yml  <- ML resource config definition for feature engineering workflow
 82 | │       │
 83 | │       ├── ml-artifacts-resource.yml                  <- ML resource config definition for model and experiment
 84 | │       │
 85 | │       ├── monitoring-resource.yml           <- ML resource config definition for quality monitoring workflow
 86 | ```
 87 | 
 88 | 
 89 | ## Configure your ML pipeline
 90 | 
 91 | The sample ML code consists of the following:
 92 | 
 93 | * Feature computation modules under `feature_engineering` folder. 
 94 | These sample module contains features logic that can be used to generate and populate tables in Feature Store.
 95 | In each module, there is `compute_features_fn` method that you need to implement. This should compute a features dataframe
 96 | (each column being a separate feature), given the input dataframe, timestamp column and time-ranges. 
 97 | The output dataframe will be persisted in a [time-series Feature Store table](https://learn.microsoft.com/azure/databricks/machine-learning/feature-store/time-series).
 98 | See the example modules' documentation for more information.
 99 | * Python unit tests for feature computation modules in `tests/feature_engineering` folder.
100 | * Feature engineering notebook, `feature_engineering/notebooks/GenerateAndWriteFeatures.py`, that reads input dataframes, dynamically loads feature computation modules, executes their `compute_features_fn` method and writes the outputs to a Feature Store table (creating it if missing).
101 | * Training notebook that [trains](https://learn.microsoft.com/azure/databricks/machine-learning/feature-store/train-models-with-feature-store ) a regression model by creating a training dataset using the Feature Store client.
102 | * Model deployment and batch inference notebooks that deploy and use the trained model. 
103 | * An automated integration test is provided (in `.github/workflows/mlops-run-tests.yml`) that executes a multi task run on Databricks involving the feature engineering and model training notebooks.
104 | 
105 | To adapt this sample code for your use case, implement your own feature module, specifying configs such as input Delta tables/dataset path(s) to use when developing
106 | the feature engineering pipelines.
107 | 1. Implement your feature module, address TODOs in `feature_engineering/features` and create unit test in `tests/feature_engineering`
108 | 2. Update `resources/feature-engineering-workflow-resource.yml`. Fill in notebook parameters for `write_feature_table_job`.
109 | 3. Update training data path in `resources/model-workflow-resource.yml`.
110 | 
111 | We expect most of the development to take place in the `feature_engineering` folder.
112 | 
113 | 
114 | ## Iterating on ML code
115 | 
116 | ### Deploy ML code and resources to dev workspace using Bundles
117 | 
118 | Refer to [Local development and dev workspace](./resources/README.md#local-development-and-dev-workspace)
119 | to use databricks CLI bundles to deploy ML code together with ML resource configs to dev workspace.
120 | 
121 | This will allow you to develop locally and use databricks CLI bundles to deploy to your dev workspace to test out code and config changes.
122 | 
123 | ### Develop on Databricks using Databricks Repos
124 | 
125 | #### Prerequisites
126 | You'll need:
127 | * Access to run commands on a cluster running Databricks Runtime ML version 11.0 or above in your dev Databricks workspace
128 | * To set up [Databricks Repos](https://learn.microsoft.com/azure/databricks/repos/index): see instructions below
129 | 
130 | #### Configuring Databricks Repos
131 | To use Repos, [set up git integration](https://learn.microsoft.com/azure/databricks/repos/repos-setup) in your dev workspace.
132 | 
133 | If the current project has already been pushed to a hosted Git repo, follow the
134 | [UI workflow](https://learn.microsoft.com/azure/databricks/repos/git-operations-with-repos#add-a-repo-and-connect-remotely-later)
135 | to clone it into your dev workspace and iterate. 
136 | 
137 | Otherwise, e.g. if iterating on ML code for a new project, follow the steps below:
138 | * Follow the [UI workflow](https://learn.microsoft.com/azure/databricks/repos/git-operations-with-repos#add-a-repo-and-connect-remotely-later)
139 |   for creating a repo, but uncheck the "Create repo by cloning a Git repository" checkbox.
140 | * Install the `dbx` CLI via `pip install --upgrade dbx`
141 | * Run `databricks configure --profile mlops-dev --token --host <your-dev-workspace-url>`, passing the URL of your dev workspace.
142 |   This should prompt you to enter an API token
143 | * [Create a personal access token](https://learn.microsoft.com/azure/databricks/dev-tools/auth/pat)
144 |   in your dev workspace and paste it into the prompt from the previous step
145 | * From within the root directory of the current project, use the [dbx sync](https://dbx.readthedocs.io/en/latest/guides/python/devloop/mixed/#using-dbx-sync-repo-for-local-to-repo-synchronization) tool to copy code files from your local machine into the Repo by running
146 |   `dbx sync repo --profile mlops-dev --source . --dest-repo your-repo-name`, where `your-repo-name` should be the last segment of the full repo name (`/Repos/username/your-repo-name`)
147 | 
148 | 
149 | ### Develop locally
150 | 
151 | You can iterate on the feature transform modules locally in your favorite IDE before running them on Databricks.  
152 | 
153 | #### Running code on Databricks
154 | You can iterate on ML code by running the provided `feature_engineering/notebooks/GenerateAndWriteFeatures.py` notebook on Databricks using
155 | [Repos](https://learn.microsoft.com/azure/databricks/repos/index). This notebook drives execution of
156 | the feature transforms code defined under ``features``. You can use multiple browser tabs to edit
157 | logic in `features` and run the feature engineering pipeline in the `GenerateAndWriteFeatures.py` notebook.
158 | 
159 | #### Prerequisites
160 | * Python 3.8+
161 | * Install feature engineering code and test dependencies via `pip install -I -r requirements.txt` from project root directory.
162 | * The features transform code uses PySpark and brings up a local Spark instance for testing, so [Java (version 8 and later) is required](https://spark.apache.org/docs/latest/#downloading).
163 | * Access to UC catalog and schema
164 | We expect a catalog to exist with the name of the deployment target by default. 
165 | For example, if the deployment target is dev, we expect a catalog named dev to exist in the workspace. 
166 | If you want to use different catalog names, please update the target names declared in the [databricks.yml](./databricks.yml) file.
167 | If changing the staging, prod, or test deployment targets, you'll also need to update the workflows located in the .github/workflows directory.
168 | 
169 | For the ML training job, you must have permissions to read the input Delta table and create experiment and models. 
170 | i.e. for each environment:
171 | - USE_CATALOG
172 | - USE_SCHEMA
173 | - MODIFY
174 | - CREATE_MODEL
175 | - CREATE_TABLE
176 | 
177 | For the batch inference job, you must have permissions to read input Delta table and modify the output Delta table. 
178 | i.e. for each environment
179 | - USAGE permissions for the catalog and schema of the input and output table.
180 | - SELECT permission for the input table.
181 | - MODIFY permission for the output table if it pre-dates your job.
182 | 
183 | #### Run unit tests
184 | You can run unit tests for your ML code via `pytest tests`.
185 | 
186 | 
187 | 
188 | ## Next Steps
189 | 
190 | When you're satisfied with initial ML experimentation (e.g. validated that a model with reasonable performance can be trained on your dataset) and ready to deploy production training/inference pipelines, ask your ops team to set up CI/CD for the current ML project if they haven't already. CI/CD can be set up as part of the
191 | 
192 | MLOps Stacks initialization even if it was skipped in this case, or this project can be added to a repo setup with CI/CD already, following the directions under "Setting up CI/CD" in the repo root directory README.
193 | 
194 | To add CI/CD to this repo:
195 |  1. Run `databricks bundle init mlops-stacks` via the Databricks CLI
196 |  2. Select the option to only initialize `CICD_Only`
197 |  3. Provide the root directory of this project and answer the subsequent prompts
198 | 
199 | More details can be found on the homepage [MLOps Stacks README](https://github.com/databricks/mlops-stacks/blob/main/README.md).
200 | 


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/validation/notebooks/ModelValidation.py:
--------------------------------------------------------------------------------
  1 | # Databricks notebook source
  2 | ##################################################################################
  3 | # Model Validation Notebook
  4 | ##
  5 | # This notebook uses mlflow model validation API to run mode validation after training and registering a model
  6 | # in model registry, before deploying it to the "champion" alias.
  7 | #
  8 | # It runs as part of CD and by an automated model training job -> validation -> deployment job defined under ``mlops/resources/model-workflow-resource.yml``
  9 | #
 10 | #
 11 | # Parameters:
 12 | #
 13 | # * env                                     - Name of the environment the notebook is run in (staging, or prod). Defaults to "prod".
 14 | # * `run_mode`                              - The `run_mode` defines whether model validation is enabled or not. It can be one of the three values:
 15 | #                                             * `disabled` : Do not run the model validation notebook.
 16 | #                                             * `dry_run`  : Run the model validation notebook. Ignore failed model validation rules and proceed to move
 17 | #                                                            model to the "champion" alias.
 18 | #                                             * `enabled`  : Run the model validation notebook. Move model to the "champion" alias only if all model validation
 19 | #                                                            rules are passing.
 20 | # * enable_baseline_comparison              - Whether to load the current registered "champion" model as baseline.
 21 | #                                             Baseline model is a requirement for relative change and absolute change validation thresholds.
 22 | # * validation_input                        - Validation input. Please refer to data parameter in mlflow.evaluate documentation https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.evaluate
 23 | # * model_type                              - A string describing the model type. The model type can be either "regressor" and "classifier".
 24 | #                                             Please refer to model_type parameter in mlflow.evaluate documentation https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.evaluate
 25 | # * targets                                 - The string name of a column from data that contains evaluation labels.
 26 | #                                             Please refer to targets parameter in mlflow.evaluate documentation https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.evaluate
 27 | # * custom_metrics_loader_function          - Specifies the name of the function in mlops/validation/validation.py that returns custom metrics.
 28 | # * validation_thresholds_loader_function   - Specifies the name of the function in mlops/validation/validation.py that returns model validation thresholds.
 29 | #
 30 | # For details on mlflow evaluate API, see doc https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.evaluate
 31 | # For details and examples about performing model validation, see the Model Validation documentation https://mlflow.org/docs/latest/models.html#model-validation
 32 | #
 33 | ##################################################################################
 34 | 
 35 | # COMMAND ----------
 36 | 
 37 | # MAGIC %load_ext autoreload
 38 | # MAGIC %autoreload 2
 39 | 
 40 | # COMMAND ----------
 41 | 
 42 | import os
 43 | notebook_path =  '/Workspace/' + os.path.dirname(dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get())
 44 | %cd $notebook_path
 45 | 
 46 | # COMMAND ----------
 47 | 
 48 | # MAGIC %pip install -r ../../requirements.txt
 49 | 
 50 | # COMMAND ----------
 51 | 
 52 | dbutils.library.restartPython()
 53 | 
 54 | # COMMAND ----------
 55 | 
 56 | import os
 57 | notebook_path =  '/Workspace/' + os.path.dirname(dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get())
 58 | %cd $notebook_path
 59 | %cd ../
 60 | 
 61 | # COMMAND ----------
 62 | 
 63 | dbutils.widgets.text(
 64 |     "experiment_name",
 65 |     "/dev-mlops-experiment",
 66 |     "Experiment Name",
 67 | )
 68 | dbutils.widgets.dropdown("run_mode", "disabled", ["disabled", "dry_run", "enabled"], "Run Mode")
 69 | dbutils.widgets.dropdown("enable_baseline_comparison", "false", ["true", "false"], "Enable Baseline Comparison")
 70 | dbutils.widgets.text("validation_input", "SELECT * FROM delta.`dbfs:/databricks-datasets/nyctaxi-with-zipcodes/subsampled`", "Validation Input")
 71 | 
 72 | dbutils.widgets.text("model_type", "regressor", "Model Type")
 73 | dbutils.widgets.text("targets", "fare_amount", "Targets")
 74 | dbutils.widgets.text("custom_metrics_loader_function", "custom_metrics", "Custom Metrics Loader Function")
 75 | dbutils.widgets.text("validation_thresholds_loader_function", "validation_thresholds", "Validation Thresholds Loader Function")
 76 | dbutils.widgets.text("evaluator_config_loader_function", "evaluator_config", "Evaluator Config Loader Function")
 77 | dbutils.widgets.text("model_name", "dev.mlops.mlops-model", "Full (Three-Level) Model Name")
 78 | dbutils.widgets.text("model_version", "", "Candidate Model Version")
 79 | 
 80 | # COMMAND ----------
 81 | run_mode = dbutils.widgets.get("run_mode").lower()
 82 | assert run_mode == "disabled" or run_mode == "dry_run" or run_mode == "enabled"
 83 | 
 84 | if run_mode == "disabled":
 85 |     print(
 86 |         "Model validation is in DISABLED mode. Exit model validation without blocking model deployment."
 87 |     )
 88 |     dbutils.notebook.exit(0)
 89 | dry_run = run_mode == "dry_run"
 90 | 
 91 | if dry_run:
 92 |     print(
 93 |         "Model validation is in DRY_RUN mode. Validation threshold validation failures will not block model deployment."
 94 |     )
 95 | else:
 96 |     print(
 97 |         "Model validation is in ENABLED mode. Validation threshold validation failures will block model deployment."
 98 |     )
 99 | 
100 | # COMMAND ----------
101 | 
102 | import importlib
103 | import mlflow
104 | import os
105 | import tempfile
106 | import traceback
107 | 
108 | from mlflow.tracking.client import MlflowClient
109 | 
110 | client = MlflowClient(registry_uri="databricks-uc")
111 | mlflow.set_registry_uri('databricks-uc')
112 | 
113 | # set experiment
114 | experiment_name = dbutils.widgets.get("experiment_name")
115 | mlflow.set_experiment(experiment_name)
116 | 
117 | # set model evaluation parameters that can be inferred from the job
118 | model_uri = dbutils.jobs.taskValues.get("Train", "model_uri", debugValue="")
119 | model_name = dbutils.jobs.taskValues.get("Train", "model_name", debugValue="")
120 | model_version = dbutils.jobs.taskValues.get("Train", "model_version", debugValue="")
121 | 
122 | if model_uri == "":
123 |     model_name = dbutils.widgets.get("model_name")
124 |     model_version = dbutils.widgets.get("model_version")
125 |     model_uri = "models:/" + model_name + "/" + model_version
126 | 
127 | baseline_model_uri = "models:/" + model_name + "@champion"
128 | 
129 | evaluators = "default"
130 | assert model_uri != "", "model_uri notebook parameter must be specified"
131 | assert model_name != "", "model_name notebook parameter must be specified"
132 | assert model_version != "", "model_version notebook parameter must be specified"
133 | 
134 | # COMMAND ----------
135 | 
136 | # take input
137 | enable_baseline_comparison = dbutils.widgets.get("enable_baseline_comparison")
138 | 
139 | 
140 | enable_baseline_comparison = "false" 
141 | print(
142 |     "Currently baseline model comparison is not supported for models registered with feature store. Please refer to "
143 |     "issue https://github.com/databricks/mlops-stacks/issues/70 for more details."
144 | )
145 | 
146 | assert enable_baseline_comparison == "true" or enable_baseline_comparison == "false"
147 | enable_baseline_comparison = enable_baseline_comparison == "true"
148 | 
149 | validation_input = dbutils.widgets.get("validation_input")
150 | assert validation_input
151 | data = spark.sql(validation_input)
152 | 
153 | model_type = dbutils.widgets.get("model_type")
154 | targets = dbutils.widgets.get("targets")
155 | 
156 | assert model_type
157 | assert targets
158 | 
159 | custom_metrics_loader_function_name = dbutils.widgets.get("custom_metrics_loader_function")
160 | validation_thresholds_loader_function_name = dbutils.widgets.get("validation_thresholds_loader_function")
161 | evaluator_config_loader_function_name = dbutils.widgets.get("evaluator_config_loader_function")
162 | assert custom_metrics_loader_function_name
163 | assert validation_thresholds_loader_function_name
164 | assert evaluator_config_loader_function_name
165 | custom_metrics_loader_function = getattr(
166 |     importlib.import_module("validation"), custom_metrics_loader_function_name
167 | )
168 | validation_thresholds_loader_function = getattr(
169 |     importlib.import_module("validation"), validation_thresholds_loader_function_name
170 | )
171 | evaluator_config_loader_function = getattr(
172 |     importlib.import_module("validation"), evaluator_config_loader_function_name
173 | )
174 | custom_metrics = custom_metrics_loader_function()
175 | validation_thresholds = validation_thresholds_loader_function()
176 | evaluator_config = evaluator_config_loader_function()
177 | 
178 | # COMMAND ----------
179 | 
180 | # helper methods
181 | def get_run_link(run_info):
182 |     return "[Run](#mlflow/experiments/{0}/runs/{1})".format(
183 |         run_info.experiment_id, run_info.run_id
184 |     )
185 | 
186 | 
187 | def get_training_run(model_name, model_version):
188 |     version = client.get_model_version(model_name, model_version)
189 |     return mlflow.get_run(run_id=version.run_id)
190 | 
191 | 
192 | def generate_run_name(training_run):
193 |     return None if not training_run else training_run.info.run_name + "-validation"
194 | 
195 | 
196 | def generate_description(training_run):
197 |     return (
198 |         None
199 |         if not training_run
200 |         else "Model Training Details: {0}\n".format(get_run_link(training_run.info))
201 |     )
202 | 
203 | 
204 | def log_to_model_description(run, success):
205 |     run_link = get_run_link(run.info)
206 |     description = client.get_model_version(model_name, model_version).description
207 |     status = "SUCCESS" if success else "FAILURE"
208 |     if description != "":
209 |         description += "\n\n---\n\n"
210 |     description += "Model Validation Status: {0}\nValidation Details: {1}".format(
211 |         status, run_link
212 |     )
213 |     client.update_model_version(
214 |         name=model_name, version=model_version, description=description
215 |     )
216 | 
217 | 
218 | 
219 | from datetime import timedelta, timezone
220 | import math
221 | import pyspark.sql.functions as F
222 | from pyspark.sql.types import IntegerType
223 | 
224 | 
225 | def rounded_unix_timestamp(dt, num_minutes=15):
226 |     """
227 |     Ceilings datetime dt to interval num_minutes, then returns the unix timestamp.
228 |     """
229 |     nsecs = dt.minute * 60 + dt.second + dt.microsecond * 1e-6
230 |     delta = math.ceil(nsecs / (60 * num_minutes)) * (60 * num_minutes) - nsecs
231 |     return int((dt + timedelta(seconds=delta)).replace(tzinfo=timezone.utc).timestamp())
232 | 
233 | 
234 | rounded_unix_timestamp_udf = F.udf(rounded_unix_timestamp, IntegerType())
235 | 
236 | 
237 | def rounded_taxi_data(taxi_data_df):
238 |     # Round the taxi data timestamp to 15 and 30 minute intervals so we can join with the pickup and dropoff features
239 |     # respectively.
240 |     taxi_data_df = (
241 |         taxi_data_df.withColumn(
242 |             "rounded_pickup_datetime",
243 |             F.to_timestamp(
244 |                 rounded_unix_timestamp_udf(
245 |                     taxi_data_df["tpep_pickup_datetime"], F.lit(15)
246 |                 )
247 |             ),
248 |         )
249 |         .withColumn(
250 |             "rounded_dropoff_datetime",
251 |             F.to_timestamp(
252 |                 rounded_unix_timestamp_udf(
253 |                     taxi_data_df["tpep_dropoff_datetime"], F.lit(30)
254 |                 )
255 |             ),
256 |         )
257 |         .drop("tpep_pickup_datetime")
258 |         .drop("tpep_dropoff_datetime")
259 |     )
260 |     taxi_data_df.createOrReplaceTempView("taxi_data")
261 |     return taxi_data_df
262 | 
263 | 
264 | data = rounded_taxi_data(data)
265 | 
266 | 
267 | 
268 | 
269 | # COMMAND ----------
270 | 
271 | 
272 | # Temporary fix as FS model can't predict as a pyfunc model
273 | # MLflow evaluate can take a lambda function instead of a model uri for a model
274 | # but id does not work for the baseline model as it requires a model_uri (baseline comparison is set to false)
275 | 
276 | from databricks.feature_store import FeatureStoreClient
277 | 
278 | def get_fs_model(df):
279 |     fs_client = FeatureStoreClient()
280 |     return (
281 |         fs_client.score_batch(model_uri, spark.createDataFrame(df))
282 |         .select("prediction")
283 |         .toPandas()
284 |     )
285 | 
286 | 
287 | training_run = get_training_run(model_name, model_version)
288 | 
289 | # run evaluate
290 | with mlflow.start_run(
291 |     run_name=generate_run_name(training_run),
292 |     description=generate_description(training_run),
293 | ) as run, tempfile.TemporaryDirectory() as tmp_dir:
294 |     validation_thresholds_file = os.path.join(tmp_dir, "validation_thresholds.txt")
295 |     with open(validation_thresholds_file, "w") as f:
296 |         if validation_thresholds:
297 |             for metric_name in validation_thresholds:
298 |                 f.write(
299 |                     "{0:30}  {1}\n".format(
300 |                         metric_name, str(validation_thresholds[metric_name])
301 |                     )
302 |                 )
303 |     mlflow.log_artifact(validation_thresholds_file)
304 | 
305 |     try:
306 |         eval_result = mlflow.evaluate(
307 |             
308 |             model=get_fs_model,
309 |             
310 |             data=data,
311 |             targets=targets,
312 |             model_type=model_type,
313 |             evaluators=evaluators,
314 |             validation_thresholds=validation_thresholds,
315 |             custom_metrics=custom_metrics,
316 |             baseline_model=None
317 |             if not enable_baseline_comparison
318 |             else baseline_model_uri,
319 |             evaluator_config=evaluator_config,
320 |         )
321 |         metrics_file = os.path.join(tmp_dir, "metrics.txt")
322 |         with open(metrics_file, "w") as f:
323 |             f.write(
324 |                 "{0:30}  {1:30}  {2}\n".format("metric_name", "candidate", "baseline")
325 |             )
326 |             for metric in eval_result.metrics:
327 |                 candidate_metric_value = str(eval_result.metrics[metric])
328 |                 baseline_metric_value = "N/A"
329 |                 if metric in eval_result.baseline_model_metrics:
330 |                     mlflow.log_metric(
331 |                         "baseline_" + metric, eval_result.baseline_model_metrics[metric]
332 |                     )
333 |                     baseline_metric_value = str(
334 |                         eval_result.baseline_model_metrics[metric]
335 |                     )
336 |                 f.write(
337 |                     "{0:30}  {1:30}  {2}\n".format(
338 |                         metric, candidate_metric_value, baseline_metric_value
339 |                     )
340 |                 )
341 |         mlflow.log_artifact(metrics_file)
342 |         log_to_model_description(run, True)
343 |         
344 |         # Assign "challenger" alias to indicate model version has passed validation checks
345 |         print("Validation checks passed. Assigning 'challenger' alias to model version.")
346 |         client.set_registered_model_alias(model_name, "challenger", model_version)
347 |         
348 |     except Exception as err:
349 |         log_to_model_description(run, False)
350 |         error_file = os.path.join(tmp_dir, "error.txt")
351 |         with open(error_file, "w") as f:
352 |             f.write("Validation failed : " + str(err) + "\n")
353 |             f.write(traceback.format_exc())
354 |         mlflow.log_artifact(error_file)
355 |         if not dry_run:
356 |             raise err
357 |         else:
358 |             print(
359 |                 "Model validation failed in DRY_RUN. It will not block model deployment."
360 |             )
361 | 


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/tmp/README.md:
--------------------------------------------------------------------------------
  1 | # Databricks ML Resource Configurations
  2 | [(back to project README)](../README.md)
  3 | 
  4 | ## Table of contents
  5 | * [Intro](#intro)
  6 | * [Local development and dev workspace](#local-development-and-dev-workspace)
  7 | * [Develop and test config changes](#develop-and-test-config-changes)
  8 | * [CI/CD](#set-up-cicd)
  9 | * [Deploy initial ML resources](#deploy-initial-ml-resources)
 10 | * [Deploy config changes](#deploy-config-changes)
 11 | 
 12 | ## Intro
 13 | 
 14 | ### databricks CLI bundles
 15 | MLOps Stacks ML resources are configured and deployed through [databricks CLI bundles](https://learn.microsoft.com/azure/databricks/dev-tools/cli/bundle-cli).
 16 | The bundle setting file must be expressed in YAML format and must contain at minimum the top-level bundle mapping.
 17 | 
 18 | The databricks CLI bundles top level is defined by file `mlops/databricks.yml`.
 19 | During databricks CLI bundles deployment, the root config file will be loaded, validated and deployed to workspace provided by the environment together with all the included resources.
 20 | 
 21 | ML Resource Configurations in this directory:
 22 |  - model workflow (`mlops/resources/model-workflow-resource.yml`)
 23 |  - batch inference workflow (`mlops/resources/batch-inference-workflow-resource.yml`)
 24 |  - monitoring resource and workflow (`mlops/resources/monitoring-resource.yml`)
 25 |  - feature engineering workflow (`mlops/resources/feature-engineering-workflow-resource.yml`)
 26 |  - model definition and experiment definition (`mlops/resources/ml-artifacts-resource.yml`)
 27 | 
 28 | 
 29 | ### Deployment Config & CI/CD integration
 30 | The ML resources can be deployed to databricks workspace based on the databricks CLI bundles deployment config.
 31 | Deployment configs of different deployment targets share the general ML resource configurations with added ability to specify deployment target specific values (workspace URI, model name, jobs notebook parameters, etc).
 32 | This project ships with CI/CD workflows for developing and deploying ML resource configurations based on deployment config.
 33 | 
 34 | For Model Registry in Unity Catalog, we expect a catalog to exist with the name of the deployment target by default. For example, if the deployment target is `dev`, we expect a catalog named `dev` to exist in the workspace. 
 35 | If you want to use different catalog names, please update the `targets` declared in the `mlops/databricks.yml` and `mlops/resources/ml-artifacts-resource.yml` files.
 36 | If changing the `staging`, `prod`, or `test` deployment targets, you'll need to update the  pipelines located in the `azure-pipelines` directory.
 37 | 
 38 | 
 39 | | Deployment Target | Description                                                                                                                                                                                                                           | Databricks Workspace | Model Name                          | Experiment Name                                |
 40 | |-------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------|-------------------------------------|------------------------------------------------|
 41 | | dev         | The `dev` deployment target is used by ML engineers to deploy ML resources to development workspace with `dev` configs. The config is for ML project development purposes.                                                           | dev workspace        | dev-mlops-model     | /dev-mlops-experiment     |
 42 | | staging     | The `staging` deployment target is part of the CD pipeline. Latest main content will be deployed to staging workspace with `staging` config.                                                             | staging workspace    | staging-mlops-model | /staging-mlops-experiment |
 43 | | prod        | The `prod` deployment target is part of the CD pipeline. Latest release content will be deployed to prod workspace with `prod` config.                                                                      | prod workspace       | prod-mlops-model    | /prod-mlops-experiment    |
 44 | | test        | The `test` deployment target is part of the CI pipeline. For changes targeting the main branch, upon making a PR, an integration test will be triggered and ML resources deployed to the staging workspace defined under `test` deployment target. | staging workspace    | test-mlops-model    | /test-mlops-experiment    |
 45 | 
 46 | During ML code development, you can deploy local ML resource configurations together with ML code to the a Databricks workspace to run the training, model validation or batch inference pipelines. The deployment will use `dev` config by default.
 47 | 
 48 | You can open a PR (pull request) to modify ML code or the resource config against main branch.
 49 | The PR will trigger Python unit tests, followed by an integration test executed on the staging workspace, as defined under the `test` environment resource. 
 50 | 
 51 | Upon merging a PR to the main branch, the main branch content will be deployed to the staging workspace with `staging` environment resource configurations.
 52 | 
 53 | Upon merging code into the release branch, the release branch content will be deployed to prod workspace with `prod` environment resource configurations.
 54 | ![ML resource config diagram](../../docs/images/mlops-stack-deploy.png)
 55 | 
 56 | ## Local development and dev workspace
 57 | 
 58 | ### Set up authentication
 59 | 
 60 | To set up the databricks CLI using a Databricks personal access token, take the following steps:
 61 | 
 62 | 1. Follow [databricks CLI](https://learn.microsoft.com/azure/databricks/dev-tools/cli/databricks-cli) to download and set up the databricks CLI locally.
 63 | 2. Complete the `TODO` in `mlops/databricks.yml` to add the dev workspace URI under `targets.dev.workspace.host`.
 64 | 3. [Create a personal access token](https://learn.microsoft.com/azure/databricks/dev-tools/auth/pat)
 65 |   in your dev workspace and copy it.
 66 | 4. Set an env variable `DATABRICKS_TOKEN` with your Databricks personal access token in your terminal. For example, run `export DATABRICKS_TOKEN=dapi12345` if the access token is dapi12345.
 67 | 5. You can now use the databricks CLI to validate and deploy ML resource configurations to the dev workspace.
 68 | 
 69 | Alternatively, you can use the other approaches described in the [databricks CLI](https://learn.microsoft.com/azure/databricks/dev-tools/cli/databricks-cli) documentation to set up authentication. For example, using your Databricks username/password, or seting up a local profile.
 70 | 
 71 | ### Validate and provision ML resource configurations
 72 | 1. After installing the databricks CLI and creating the `DATABRICKS_TOKEN` env variable, change to the `mlops` directory.
 73 | 2. Run `databricks bundle validate` to validate the Databricks resource configurations. 
 74 | 3. Run `databricks bundle deploy` to provision the Databricks resource configurations to the dev workspace. The resource configurations and your ML code will be copied together to the dev workspace. The defined resources such as Databricks Workflows, MLflow Model and MLflow Experiment will be provisioned according to the config files under `mlops/resources`.
 75 | 4. Go to the Databricks dev workspace, check the defined model, experiment and workflows status, and interact with the created workflows.
 76 | 
 77 | ### Destroy ML resource configurations
 78 | After development is done, you can run `databricks bundle destroy` to destroy (remove) the defined Databricks resources in the dev workspace. Any model version with `Production` or `Staging` stage will prevent the model from being deleted. Please update the version stage to `None` or `Archived` before destroying the ML resources.
 79 | ## Set up CI/CD
 80 | Please refer to [mlops-setup](../../docs/mlops-setup.md#configure-cicd) for instructions to set up CI/CD.
 81 | 
 82 | ## Deploy initial ML resources
 83 | After completing the prerequisites, create and push a PR branch adding all files to the Git repo:
 84 | ```
 85 | git checkout -b add-ml-resource-config-and-code
 86 | git add .
 87 | git commit -m "Add ML resource config and ML code"
 88 | git push upstream add-ml-resource-config-and-code
 89 | ```
 90 | Open a pull request to merge the pushed branch into the `main` branch.
 91 | Upon creating this PR, the CI workflows will be triggered.
 92 | These CI workflow will run unit and integration tests of the ML code, 
 93 | in addition to validating the Databricks resources to be deployed to both staging and prod workspaces.
 94 | Once CI passes, merge the PR into the `main` branch. This will deploy an initial set of Databricks resources to the staging workspace.
 95 | resources will be deployed to the prod workspace on pushing code to the `release` branch.
 96 | 
 97 | Follow the next section to configure the input and output data tables for the batch inference job.
 98 | 
 99 | ### Setting up the batch inference job
100 | The batch inference job expects an input Delta table with a schema that your registered model accepts. To use the batch
101 | inference job, set up such a Delta table in both your staging and prod workspaces.
102 | Following this, update the batch_inference_job base parameters in `mlops/resources/batch-inference-workflow-resource.yml` to pass
103 | the name of the input Delta table and the name of the output Delta table to which to write batch predictions.
104 | 
105 | As the batch job will be run with the credentials of the service principal that provisioned it, make sure that the service
106 | principal corresponding to a particular environment has permissions to read the input Delta table and modify the output Delta table in that environment's workspace. If the Delta table is in the [Unity Catalog](https://www.databricks.com/product/unity-catalog), these permissions are
107 | 
108 | * `USAGE` permissions for the catalog and schema of the input and output table.
109 | * `SELECT` permission for the input table.
110 | * `MODIFY` permission for the output table if it pre-dates your job.
111 | 
112 | ### Setting up model validation
113 | The model validation workflow focuses on building a plug-and-play stack component for continuous deployment (CD) of models 
114 | in staging and prod.
115 | Its central purpose is to evaluate a registered model and validate its quality before deploying the model to Production/Staging.
116 | 
117 | Model validation contains three components: 
118 | * [model-workflow-resource.yml](./model-workflow-resource.yml) contains the resource config and input parameters for model validation.
119 | * [validation.py](../validation/validation.py) defines custom metrics and validation thresholds that are referenced by the above resource config files.
120 | * [notebooks/ModelValidation](../validation/notebooks/ModelValidation.py) contains the validation job implementation. In most cases you don't need to modify this file.
121 | 
122 | To set up and enable model validation, update [validation.py](../validation/validation.py) to return desired custom metrics and validation thresholds, then 
123 | resolve the `TODOs` in the ModelValidation task of [model-workflow-resource.yml](./model-workflow-resource.yml).
124 | 
125 | 
126 | ### Setting up monitoring
127 | The monitoring workflow focuses on building a plug-and-play stack component for monitoring the feature drifts and model drifts and retrain based on the
128 | violation threshold defined given the ground truth labels.
129 | 
130 | Its central purpose is to track production model performances, feature distributions and comparing different versions.
131 | 
132 | Monitoring contains four components:
133 | * [metric_violation_check_query.py](../monitoring/metric_violation_check_query.py) defines a query that checks for violation of the monitored metric.
134 | * [notebooks/MonitoredMetricViolationCheck](../monitoring/notebooks/MonitoredMetricViolationCheck.py) acts as an entry point, executing the violation check query against the monitored inference table.
135 | It emits a boolean value based on the query result.
136 | * [monitoring-resource.yml](./monitoring-resource.yml) contains the resource config, inputs parameters for monitoring, and orchestrates model retraining based on monitoring. It first runs the [notebooks/MonitoredMetricViolationCheck](../monitoring/notebooks/MonitoredMetricViolationCheck.py)
137 | entry point then decides whether to execute the model retraining workflow.
138 | 
139 | To set up and enable monitoring:
140 | * If it is not done already, generate inference table, join it with ground truth labels, and update the table name in [monitoring-resource.yml](./monitoring-resource.yml).
141 | * Resolve the `TODOs`  in [monitoring-resource.yml](./monitoring-resource.yml)
142 | * Uncomment the monitoring workflow in [databricks.yml](../databricks.yml)
143 | * OPTIONAL: Update the query in [metric_violation_check_query.py](../monitoring/metric_violation_check_query.py) to customize when the metric is considered to be in violation.
144 | 
145 | NOTE: If ground truth labels are not available, you can still set up monitoring but should disable the retraining workflow.
146 | 
147 | Retraining Constraints:
148 | The retraining job has constraints for optimal functioning:
149 | * Labels must be provided by the user, joined correctly for retraining history, and available on time with the retraining frequency.
150 | * Retraining Frequency is tightly coupled with the granularity of the monitor. Users should take into account and ensure that their retraining frequency is equal to or close to the granularity of the monitor.
151 |     * If the granularity of the monitor is 1 day and retraining frequency is 1 hour, the job will preemptively stop as there is no new data to evaluate retraining criteria
152 |     * If the granularity of the monitor is 1 day and retraining frequency is 1 week, retraining would be stale and not be efficient
153 | 
154 | Permissions:
155 | Permissions for monitoring are inherited from the original table's permissions.
156 | * Users who own the monitored table or its parent catalog/schema can create, update, and view monitors.
157 | * Users with read permissions on the monitored table can view its monitor.
158 | 
159 | Therefore, ensure that service principals are the owners or have the necessary permissions to manage the monitored table.
160 | 
161 | ## Develop and test config changes
162 | 
163 | ### databricks CLI bundles schema overview
164 | To get started, open `mlops/resources/batch-inference-workflow-resource.yml`.  The file contains the ML resource definition of a batch inference job, like:
165 | 
166 | ```$xslt
167 | new_cluster: &new_cluster
168 |   new_cluster:
169 |     num_workers: 3
170 |     spark_version: 15.3.x-cpu-ml-scala2.12
171 |     node_type_id: Standard_D3_v2
172 |     custom_tags:
173 |       clusterSource: mlops-stacks_0.4
174 | 
175 | resources:
176 |   jobs:
177 |     batch_inference_job:
178 |       name: ${bundle.target}-mlops-batch-inference-job
179 |       tasks:
180 |         - task_key: batch_inference_job
181 |           <<: *new_cluster
182 |           notebook_task:
183 |             notebook_path: ../deployment/batch_inference/notebooks/BatchInference.py
184 |             base_parameters:
185 |               env: ${bundle.target}
186 |               input_table_name: batch_inference_input_table_name
187 |               ...
188 | ```
189 | 
190 | The example above defines a Databricks job with name `${bundle.target}-mlops-batch-inference-job`
191 | that runs the notebook under `mlops/deployment/batch_inference/notebooks/BatchInference.py` to regularly apply your ML model for batch inference. 
192 | 
193 | At the start of the resource definition, we declared an anchor `new_cluster` that will be referenced and used later. For more information about anchors in yaml schema, please refer to the [yaml documentation](https://yaml.org/spec/1.2.2/#3222-anchors-and-aliases).
194 | 
195 | We specify a `batch_inference_job` under `resources/jobs` to define a databricks workflow with internal key `batch_inference_job` and job name `{bundle.target}-mlops-batch-inference-job`.
196 | The workflow contains a single task with task key `batch_inference_job`. The task runs notebook `../deployment/batch_inference/notebooks/BatchInference.py` with provided parameters `env` and `input_table_name` passing to the notebook.
197 | After setting up databricks CLI, you can run command `databricks bundle schema`  to learn more about databricks CLI bundles schema.
198 | 
199 | The notebook_path is the relative path starting from the resource yaml file.
200 | 
201 | ### Environment config based variables
202 | The `${bundle.target}` will be replaced by the environment config name during the bundle deployment. For example, during the deployment of a `test` environment config, the job name will be
203 | `test-mlops-batch-inference-job`. During the deployment of the `staging` environment config, the job name will be
204 | `staging-mlops-batch-inference-job`.
205 | 
206 | 
207 | To use different values based on different environment, you can use bundle variables based on the given target, for example,
208 | ```$xslt
209 | variables:
210 |   batch_inference_input_table: 
211 |     description: The table name to be used for input to the batch inference workflow.
212 |     default: input_table
213 | 
214 | targets:
215 |   dev:
216 |     variables:
217 |       batch_inference_input_table: dev_table
218 |   test:
219 |     variables:
220 |       batch_inference_input_table: test_table
221 | 
222 | new_cluster: &new_cluster
223 |   new_cluster:
224 |     num_workers: 3
225 |     spark_version: 15.3.x-cpu-ml-scala2.12
226 |     node_type_id: Standard_D3_v2
227 |     custom_tags:
228 |       clusterSource: mlops-stacks_0.4
229 | 
230 | resources:
231 |   jobs:
232 |     batch_inference_job:
233 |       name: ${bundle.target}-mlops-batch-inference-job
234 |       tasks:
235 |         - task_key: batch_inference_job
236 |           <<: *new_cluster
237 |           notebook_task:
238 |             notebook_path: ../deployment/batch_inference/notebooks/BatchInference.py
239 |             base_parameters:
240 |               env: ${bundle.target}
241 |               input_table_name: ${var.batch_inference_input_table}
242 |               ...
243 | ```
244 | The `batch_inference_job` notebook parameter `input_table_name` is using a bundle variable `batch_inference_input_table` with default value "input_table".
245 | The variable value will be overwritten with "dev_table" for `dev` environment config and "test_table" for `test` environment config:
246 | - during deployment with the `dev` environment config, the `input_table_name` parameter will get the value "dev_table"
247 | - during deployment with the `staging` environment config, the `input_table_name` parameter will get the value "input_table"
248 | - during deployment with the `prod` environment config, the `input_table_name` parameter will get the value "input_table"
249 | - during deployment with the `test` environment config, the `input_table_name` parameter will get the value "test_table"
250 | 
251 | ### Test config changes
252 | To test out a config change, simply edit one of the fields above. For example, increase the cluster size by updating `num_workers` from 3 to 4. 
253 | 
254 | Then follow [Local development and dev workspace](#local-development-and-dev-workspace) to deploy the change to the dev workspace.
255 | Alternatively you can open a PR. Continuous integration will then validate the updated config and deploy tests to the to staging workspace.
256 | ## Deploy config changes
257 | 
258 | ### Dev workspace deployment
259 | Please refer to [Local development and dev workspace](#local-development-and-dev-workspace).
260 | 
261 | ### Test workspace deployment(CI)
262 | After setting up CI/CD, PRs against the main branch will trigger CI workflows to run unit tests, integration test and resource validation.
263 | The integration test will deploy MLflow model, MLflow experiment and Databricks workflow resources defined under the `test` environment resource config to the staging workspace. The integration test then triggers a run of the model workflow to verify the ML code. 
264 | 
265 | ### Staging and Prod workspace deployment(CD)
266 | After merging a PR to the main branch, continuous deployment automation will deploy the `staging` resources to the staging workspace.
267 | 
268 | When you about to cut a release, you can create and merge a PR to merge changes from main to release. Continuous deployment automation will deploy `prod` resources to the prod workspace.
269 | 
270 | [Back to project README](../README.md)
271 | 


--------------------------------------------------------------------------------
/mlops-statcks-multiphase/resources/README.md:
--------------------------------------------------------------------------------
  1 | # Databricks ML Resource Configurations
  2 | [(back to project README)](../README.md)
  3 | 
  4 | ## Table of contents
  5 | * [Intro](#intro)
  6 | * [Local development and dev workspace](#local-development-and-dev-workspace)
  7 | * [Develop and test config changes](#develop-and-test-config-changes)
  8 | * [CI/CD](#set-up-cicd)
  9 | * [Deploy initial ML resources](#deploy-initial-ml-resources)
 10 | * [Deploy config changes](#deploy-config-changes)
 11 | 
 12 | ## Intro
 13 | 
 14 | ### databricks CLI bundles
 15 | MLOps Stacks ML resources are configured and deployed through [databricks CLI bundles](https://learn.microsoft.com/azure/databricks/dev-tools/cli/bundle-cli).
 16 | The bundle setting file must be expressed in YAML format and must contain at minimum the top-level bundle mapping.
 17 | 
 18 | The databricks CLI bundles top level is defined by file `mlops/databricks.yml`.
 19 | During databricks CLI bundles deployment, the root config file will be loaded, validated and deployed to workspace provided by the environment together with all the included resources.
 20 | 
 21 | ML Resource Configurations in this directory:
 22 |  - model workflow (`mlops/resources/model-workflow-resource.yml`)
 23 |  - batch inference workflow (`mlops/resources/batch-inference-workflow-resource.yml`)
 24 |  - monitoring resource and workflow (`mlops/resources/monitoring-resource.yml`)
 25 |  - feature engineering workflow (`mlops/resources/feature-engineering-workflow-resource.yml`)
 26 |  - model definition and experiment definition (`mlops/resources/ml-artifacts-resource.yml`)
 27 | 
 28 | 
 29 | ### Deployment Config & CI/CD integration
 30 | The ML resources can be deployed to databricks workspace based on the databricks CLI bundles deployment config.
 31 | Deployment configs of different deployment targets share the general ML resource configurations with added ability to specify deployment target specific values (workspace URI, model name, jobs notebook parameters, etc).
 32 | This project ships with CI/CD workflows for developing and deploying ML resource configurations based on deployment config.
 33 | 
 34 | For Model Registry in Unity Catalog, we expect a catalog to exist with the name of the deployment target by default. For example, if the deployment target is `dev`, we expect a catalog named `dev` to exist in the workspace. 
 35 | If you want to use different catalog names, please update the `targets` declared in the `mlops/databricks.yml` and `mlops/resources/ml-artifacts-resource.yml` files.
 36 | If changing the `staging`, `prod`, or `test` deployment targets, you'll need to update the  pipelines located in the `azure-pipelines` directory.
 37 | 
 38 | 
 39 | | Deployment Target | Description                                                                                                                                                                                                                           | Databricks Workspace | Model Name                          | Experiment Name                                |
 40 | |-------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------|-------------------------------------|------------------------------------------------|
 41 | | dev         | The `dev` deployment target is used by ML engineers to deploy ML resources to development workspace with `dev` configs. The config is for ML project development purposes.                                                           | dev workspace        | dev-mlops-model     | /dev-mlops-experiment     |
 42 | | staging     | The `staging` deployment target is part of the CD pipeline. Latest main content will be deployed to staging workspace with `staging` config.                                                             | staging workspace    | staging-mlops-model | /staging-mlops-experiment |
 43 | | prod        | The `prod` deployment target is part of the CD pipeline. Latest release content will be deployed to prod workspace with `prod` config.                                                                      | prod workspace       | prod-mlops-model    | /prod-mlops-experiment    |
 44 | | test        | The `test` deployment target is part of the CI pipeline. For changes targeting the main branch, upon making a PR, an integration test will be triggered and ML resources deployed to the staging workspace defined under `test` deployment target. | staging workspace    | test-mlops-model    | /test-mlops-experiment    |
 45 | 
 46 | During ML code development, you can deploy local ML resource configurations together with ML code to the a Databricks workspace to run the training, model validation or batch inference pipelines. The deployment will use `dev` config by default.
 47 | 
 48 | You can open a PR (pull request) to modify ML code or the resource config against main branch.
 49 | The PR will trigger Python unit tests, followed by an integration test executed on the staging workspace, as defined under the `test` environment resource. 
 50 | 
 51 | Upon merging a PR to the main branch, the main branch content will be deployed to the staging workspace with `staging` environment resource configurations.
 52 | 
 53 | Upon merging code into the release branch, the release branch content will be deployed to prod workspace with `prod` environment resource configurations.
 54 | ![ML resource config diagram](../../docs/images/mlops-stack-deploy.png)
 55 | 
 56 | ## Local development and dev workspace
 57 | 
 58 | ### Set up authentication
 59 | 
 60 | To set up the databricks CLI using a Databricks personal access token, take the following steps:
 61 | 
 62 | 1. Follow [databricks CLI](https://learn.microsoft.com/azure/databricks/dev-tools/cli/databricks-cli) to download and set up the databricks CLI locally.
 63 | 2. Complete the `TODO` in `mlops/databricks.yml` to add the dev workspace URI under `targets.dev.workspace.host`.
 64 | 3. [Create a personal access token](https://learn.microsoft.com/azure/databricks/dev-tools/auth/pat)
 65 |   in your dev workspace and copy it.
 66 | 4. Set an env variable `DATABRICKS_TOKEN` with your Databricks personal access token in your terminal. For example, run `export DATABRICKS_TOKEN=dapi12345` if the access token is dapi12345.
 67 | 5. You can now use the databricks CLI to validate and deploy ML resource configurations to the dev workspace.
 68 | 
 69 | Alternatively, you can use the other approaches described in the [databricks CLI](https://learn.microsoft.com/azure/databricks/dev-tools/cli/databricks-cli) documentation to set up authentication. For example, using your Databricks username/password, or seting up a local profile.
 70 | 
 71 | ### Validate and provision ML resource configurations
 72 | 1. After installing the databricks CLI and creating the `DATABRICKS_TOKEN` env variable, change to the `mlops` directory.
 73 | 2. Run `databricks bundle validate` to validate the Databricks resource configurations. 
 74 | 3. Run `databricks bundle deploy` to provision the Databricks resource configurations to the dev workspace. The resource configurations and your ML code will be copied together to the dev workspace. The defined resources such as Databricks Workflows, MLflow Model and MLflow Experiment will be provisioned according to the config files under `mlops/resources`.
 75 | 4. Go to the Databricks dev workspace, check the defined model, experiment and workflows status, and interact with the created workflows.
 76 | 
 77 | ### Destroy ML resource configurations
 78 | After development is done, you can run `databricks bundle destroy` to destroy (remove) the defined Databricks resources in the dev workspace. Any model version with `Production` or `Staging` stage will prevent the model from being deleted. Please update the version stage to `None` or `Archived` before destroying the ML resources.
 79 | ## Set up CI/CD
 80 | Please refer to [mlops-setup](../../docs/mlops-setup.md#configure-cicd) for instructions to set up CI/CD.
 81 | 
 82 | ## Deploy initial ML resources
 83 | After completing the prerequisites, create and push a PR branch adding all files to the Git repo:
 84 | ```
 85 | git checkout -b add-ml-resource-config-and-code
 86 | git add .
 87 | git commit -m "Add ML resource config and ML code"
 88 | git push upstream add-ml-resource-config-and-code
 89 | ```
 90 | Open a pull request to merge the pushed branch into the `main` branch.
 91 | Upon creating this PR, the CI workflows will be triggered.
 92 | These CI workflow will run unit and integration tests of the ML code, 
 93 | in addition to validating the Databricks resources to be deployed to both staging and prod workspaces.
 94 | Once CI passes, merge the PR into the `main` branch. This will deploy an initial set of Databricks resources to the staging workspace.
 95 | resources will be deployed to the prod workspace on pushing code to the `release` branch.
 96 | 
 97 | Follow the next section to configure the input and output data tables for the batch inference job.
 98 | 
 99 | ### Setting up the batch inference job
100 | The batch inference job expects an input Delta table with a schema that your registered model accepts. To use the batch
101 | inference job, set up such a Delta table in both your staging and prod workspaces.
102 | Following this, update the batch_inference_job base parameters in `mlops/resources/batch-inference-workflow-resource.yml` to pass
103 | the name of the input Delta table and the name of the output Delta table to which to write batch predictions.
104 | 
105 | As the batch job will be run with the credentials of the service principal that provisioned it, make sure that the service
106 | principal corresponding to a particular environment has permissions to read the input Delta table and modify the output Delta table in that environment's workspace. If the Delta table is in the [Unity Catalog](https://www.databricks.com/product/unity-catalog), these permissions are
107 | 
108 | * `USAGE` permissions for the catalog and schema of the input and output table.
109 | * `SELECT` permission for the input table.
110 | * `MODIFY` permission for the output table if it pre-dates your job.
111 | 
112 | ### Setting up model validation
113 | The model validation workflow focuses on building a plug-and-play stack component for continuous deployment (CD) of models 
114 | in staging and prod.
115 | Its central purpose is to evaluate a registered model and validate its quality before deploying the model to Production/Staging.
116 | 
117 | Model validation contains three components: 
118 | * [model-workflow-resource.yml](./model-workflow-resource.yml) contains the resource config and input parameters for model validation.
119 | * [validation.py](../validation/validation.py) defines custom metrics and validation thresholds that are referenced by the above resource config files.
120 | * [notebooks/ModelValidation](../validation/notebooks/ModelValidation.py) contains the validation job implementation. In most cases you don't need to modify this file.
121 | 
122 | To set up and enable model validation, update [validation.py](../validation/validation.py) to return desired custom metrics and validation thresholds, then 
123 | resolve the `TODOs` in the ModelValidation task of [model-workflow-resource.yml](./model-workflow-resource.yml).
124 | 
125 | 
126 | ### Setting up monitoring
127 | The monitoring workflow focuses on building a plug-and-play stack component for monitoring the feature drifts and model drifts and retrain based on the
128 | violation threshold defined given the ground truth labels.
129 | 
130 | Its central purpose is to track production model performances, feature distributions and comparing different versions.
131 | 
132 | Monitoring contains four components:
133 | * [metric_violation_check_query.py](../monitoring/metric_violation_check_query.py) defines a query that checks for violation of the monitored metric.
134 | * [notebooks/MonitoredMetricViolationCheck](../monitoring/notebooks/MonitoredMetricViolationCheck.py) acts as an entry point, executing the violation check query against the monitored inference table.
135 | It emits a boolean value based on the query result.
136 | * [monitoring-resource.yml](./monitoring-resource.yml) contains the resource config, inputs parameters for monitoring, and orchestrates model retraining based on monitoring. It first runs the [notebooks/MonitoredMetricViolationCheck](../monitoring/notebooks/MonitoredMetricViolationCheck.py)
137 | entry point then decides whether to execute the model retraining workflow.
138 | 
139 | To set up and enable monitoring:
140 | * If it is not done already, generate inference table, join it with ground truth labels, and update the table name in [monitoring-resource.yml](./monitoring-resource.yml).
141 | * Resolve the `TODOs`  in [monitoring-resource.yml](./monitoring-resource.yml)
142 | * Uncomment the monitoring workflow in [databricks.yml](../databricks.yml)
143 | * OPTIONAL: Update the query in [metric_violation_check_query.py](../monitoring/metric_violation_check_query.py) to customize when the metric is considered to be in violation.
144 | 
145 | NOTE: If ground truth labels are not available, you can still set up monitoring but should disable the retraining workflow.
146 | 
147 | Retraining Constraints:
148 | The retraining job has constraints for optimal functioning:
149 | * Labels must be provided by the user, joined correctly for retraining history, and available on time with the retraining frequency.
150 | * Retraining Frequency is tightly coupled with the granularity of the monitor. Users should take into account and ensure that their retraining frequency is equal to or close to the granularity of the monitor.
151 |     * If the granularity of the monitor is 1 day and retraining frequency is 1 hour, the job will preemptively stop as there is no new data to evaluate retraining criteria
152 |     * If the granularity of the monitor is 1 day and retraining frequency is 1 week, retraining would be stale and not be efficient
153 | 
154 | Permissions:
155 | Permissions for monitoring are inherited from the original table's permissions.
156 | * Users who own the monitored table or its parent catalog/schema can create, update, and view monitors.
157 | * Users with read permissions on the monitored table can view its monitor.
158 | 
159 | Therefore, ensure that service principals are the owners or have the necessary permissions to manage the monitored table.
160 | 
161 | ## Develop and test config changes
162 | 
163 | ### databricks CLI bundles schema overview
164 | To get started, open `mlops/resources/batch-inference-workflow-resource.yml`.  The file contains the ML resource definition of a batch inference job, like:
165 | 
166 | ```$xslt
167 | new_cluster: &new_cluster
168 |   new_cluster:
169 |     num_workers: 3
170 |     spark_version: 15.3.x-cpu-ml-scala2.12
171 |     node_type_id: Standard_D3_v2
172 |     custom_tags:
173 |       clusterSource: mlops-stacks_0.4
174 | 
175 | resources:
176 |   jobs:
177 |     batch_inference_job:
178 |       name: ${bundle.target}-mlops-batch-inference-job
179 |       tasks:
180 |         - task_key: batch_inference_job
181 |           <<: *new_cluster
182 |           notebook_task:
183 |             notebook_path: ../deployment/batch_inference/notebooks/BatchInference.py
184 |             base_parameters:
185 |               env: ${bundle.target}
186 |               input_table_name: batch_inference_input_table_name
187 |               ...
188 | ```
189 | 
190 | The example above defines a Databricks job with name `${bundle.target}-mlops-batch-inference-job`
191 | that runs the notebook under `mlops/deployment/batch_inference/notebooks/BatchInference.py` to regularly apply your ML model for batch inference. 
192 | 
193 | At the start of the resource definition, we declared an anchor `new_cluster` that will be referenced and used later. For more information about anchors in yaml schema, please refer to the [yaml documentation](https://yaml.org/spec/1.2.2/#3222-anchors-and-aliases).
194 | 
195 | We specify a `batch_inference_job` under `resources/jobs` to define a databricks workflow with internal key `batch_inference_job` and job name `{bundle.target}-mlops-batch-inference-job`.
196 | The workflow contains a single task with task key `batch_inference_job`. The task runs notebook `../deployment/batch_inference/notebooks/BatchInference.py` with provided parameters `env` and `input_table_name` passing to the notebook.
197 | After setting up databricks CLI, you can run command `databricks bundle schema`  to learn more about databricks CLI bundles schema.
198 | 
199 | The notebook_path is the relative path starting from the resource yaml file.
200 | 
201 | ### Environment config based variables
202 | The `${bundle.target}` will be replaced by the environment config name during the bundle deployment. For example, during the deployment of a `test` environment config, the job name will be
203 | `test-mlops-batch-inference-job`. During the deployment of the `staging` environment config, the job name will be
204 | `staging-mlops-batch-inference-job`.
205 | 
206 | 
207 | To use different values based on different environment, you can use bundle variables based on the given target, for example,
208 | ```$xslt
209 | variables:
210 |   batch_inference_input_table: 
211 |     description: The table name to be used for input to the batch inference workflow.
212 |     default: input_table
213 | 
214 | targets:
215 |   dev:
216 |     variables:
217 |       batch_inference_input_table: dev_table
218 |   test:
219 |     variables:
220 |       batch_inference_input_table: test_table
221 | 
222 | new_cluster: &new_cluster
223 |   new_cluster:
224 |     num_workers: 3
225 |     spark_version: 15.3.x-cpu-ml-scala2.12
226 |     node_type_id: Standard_D3_v2
227 |     custom_tags:
228 |       clusterSource: mlops-stacks_0.4
229 | 
230 | resources:
231 |   jobs:
232 |     batch_inference_job:
233 |       name: ${bundle.target}-mlops-batch-inference-job
234 |       tasks:
235 |         - task_key: batch_inference_job
236 |           <<: *new_cluster
237 |           notebook_task:
238 |             notebook_path: ../deployment/batch_inference/notebooks/BatchInference.py
239 |             base_parameters:
240 |               env: ${bundle.target}
241 |               input_table_name: ${var.batch_inference_input_table}
242 |               ...
243 | ```
244 | The `batch_inference_job` notebook parameter `input_table_name` is using a bundle variable `batch_inference_input_table` with default value "input_table".
245 | The variable value will be overwritten with "dev_table" for `dev` environment config and "test_table" for `test` environment config:
246 | - during deployment with the `dev` environment config, the `input_table_name` parameter will get the value "dev_table"
247 | - during deployment with the `staging` environment config, the `input_table_name` parameter will get the value "input_table"
248 | - during deployment with the `prod` environment config, the `input_table_name` parameter will get the value "input_table"
249 | - during deployment with the `test` environment config, the `input_table_name` parameter will get the value "test_table"
250 | 
251 | ### Test config changes
252 | To test out a config change, simply edit one of the fields above. For example, increase the cluster size by updating `num_workers` from 3 to 4. 
253 | 
254 | Then follow [Local development and dev workspace](#local-development-and-dev-workspace) to deploy the change to the dev workspace.
255 | Alternatively you can open a PR. Continuous integration will then validate the updated config and deploy tests to the to staging workspace.
256 | ## Deploy config changes
257 | 
258 | ### Dev workspace deployment
259 | Please refer to [Local development and dev workspace](#local-development-and-dev-workspace).
260 | 
261 | ### Test workspace deployment(CI)
262 | After setting up CI/CD, PRs against the main branch will trigger CI workflows to run unit tests, integration test and resource validation.
263 | The integration test will deploy MLflow model, MLflow experiment and Databricks workflow resources defined under the `test` environment resource config to the staging workspace. The integration test then triggers a run of the model workflow to verify the ML code. 
264 | 
265 | ### Staging and Prod workspace deployment(CD)
266 | After merging a PR to the main branch, continuous deployment automation will deploy the `staging` resources to the staging workspace.
267 | 
268 | When you about to cut a release, you can create and merge a PR to merge changes from main to release. Continuous deployment automation will deploy `prod` resources to the prod workspace.
269 | 
270 | [Back to project README](../README.md)
271 | 


--------------------------------------------------------------------------------