├── mlops-statcks-multiphase ├── __init__.py ├── tests │ ├── __init__.py │ ├── training │ │ ├── __init__.py │ │ └── test_notebooks.py │ └── feature_engineering │ │ ├── __init__.py │ │ ├── dropoff_features_test.py │ │ └── pickup_features_test.py ├── .gitignore ├── feature_engineering │ ├── __init__.py │ ├── features │ │ ├── __init__.py │ │ ├── pickup_features.py │ │ └── dropoff_features.py │ ├── README.md │ └── notebooks │ │ └── GenerateAndWriteFeatures.py ├── project_params.json ├── validation │ ├── README.md │ ├── validation.py │ └── notebooks │ │ └── ModelValidation.py ├── pytest.ini ├── requirements.txt ├── monitoring │ ├── README.md │ ├── notebooks │ │ └── MonitoredMetricViolationCheck.py │ └── metric_violation_check_query.py ├── deployment │ ├── batch_inference │ │ ├── predict.py │ │ ├── README.md │ │ └── notebooks │ │ │ └── BatchInference.py │ └── model_deployment │ │ ├── deploy.py │ │ └── notebooks │ │ └── ModelDeployment.py ├── resources │ ├── ml-artifacts-resource.yml │ ├── batch-inference-workflow-resource.yml │ ├── feature-engineering-workflow-resource.yml │ ├── monitoring-resource.yml │ ├── model-workflow-resource.yml │ └── README.md ├── databricks.yml ├── tmp │ ├── phase2.yml │ ├── phase1.yml │ └── README.md ├── training │ └── notebooks │ │ └── TrainWithFeatureStore.py └── README.md ├── jdemo ├── src │ ├── jdemo │ │ ├── __init__.py │ │ └── main.py │ └── notebook.ipynb ├── pytest.ini ├── tests │ ├── main_test.py │ └── main_test.old ├── scratch │ ├── README.md │ └── exploration.ipynb ├── resources │ ├── variables.yml │ └── jdemo.job.yml ├── java-code │ ├── src │ │ └── main │ │ │ └── java │ │ │ └── net │ │ │ └── alexott │ │ │ └── demos │ │ │ └── SparkDemo.java │ └── pom.xml ├── fixtures │ └── .gitkeep ├── requirements-dev.txt ├── setup.py ├── README.md ├── databricks.yml └── azure-pipelines.yml ├── vars_demo ├── .gitignore ├── resources │ ├── variables.yml │ └── vars_demo.job.yml ├── databricks.yml ├── src │ └── notebook.ipynb └── README.md ├── integration-tests ├── .gitignore ├── images │ └── integration_test.png ├── resources │ ├── dabs1.job.yml │ └── integration_test.yml ├── databricks.yml ├── README.md └── src │ ├── setup_test.ipynb │ ├── cleanup_test.ipynb │ ├── main_nb.ipynb │ └── validate_test.ipynb ├── README.md └── .gitignore /mlops-statcks-multiphase/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /mlops-statcks-multiphase/tests/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /jdemo/src/jdemo/__init__.py: -------------------------------------------------------------------------------- 1 | __version__ = "0.0.1" 2 | -------------------------------------------------------------------------------- /mlops-statcks-multiphase/tests/training/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /mlops-statcks-multiphase/.gitignore: -------------------------------------------------------------------------------- 1 | 2 | .databricks 3 | -------------------------------------------------------------------------------- /mlops-statcks-multiphase/feature_engineering/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /mlops-statcks-multiphase/tests/feature_engineering/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /mlops-statcks-multiphase/feature_engineering/features/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /jdemo/pytest.ini: -------------------------------------------------------------------------------- 1 | [pytest] 2 | testpaths = tests 3 | pythonpath = src 4 | -------------------------------------------------------------------------------- /jdemo/tests/main_test.py: -------------------------------------------------------------------------------- 1 | 2 | 3 | def test_main(): 4 | assert True -------------------------------------------------------------------------------- /mlops-statcks-multiphase/project_params.json: -------------------------------------------------------------------------------- 1 | { 2 | "input_cloud": "azure", 3 | "input_include_feature_store": "yes" 4 | } 5 | -------------------------------------------------------------------------------- /vars_demo/.gitignore: -------------------------------------------------------------------------------- 1 | .databricks/ 2 | build/ 3 | dist/ 4 | __pycache__/ 5 | *.egg-info 6 | .venv/ 7 | scratch/** 8 | !scratch/README.md 9 | -------------------------------------------------------------------------------- /integration-tests/.gitignore: -------------------------------------------------------------------------------- 1 | .databricks/ 2 | build/ 3 | dist/ 4 | __pycache__/ 5 | *.egg-info 6 | .venv/ 7 | scratch/** 8 | !scratch/README.md 9 | -------------------------------------------------------------------------------- /integration-tests/images/integration_test.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexott/dabs-playground/main/integration-tests/images/integration_test.png -------------------------------------------------------------------------------- /jdemo/tests/main_test.old: -------------------------------------------------------------------------------- 1 | from jdemo.main import get_taxis, get_spark 2 | 3 | 4 | def test_main(): 5 | taxis = get_taxis(get_spark()) 6 | assert taxis.count() > 5 7 | -------------------------------------------------------------------------------- /jdemo/scratch/README.md: -------------------------------------------------------------------------------- 1 | # scratch 2 | 3 | This folder is reserved for personal, exploratory notebooks. 4 | By default these are not committed to Git, as 'scratch' is listed in .gitignore. 5 | -------------------------------------------------------------------------------- /mlops-statcks-multiphase/validation/README.md: -------------------------------------------------------------------------------- 1 | # Model Validation 2 | To enable model validation as part of scheduled databricks workflow, please refer to [mlops/resources/README.md](../resources/README.md) 3 | -------------------------------------------------------------------------------- /mlops-statcks-multiphase/pytest.ini: -------------------------------------------------------------------------------- 1 | # Configure pytest to detect local modules in the current directory 2 | # See https://docs.pytest.org/en/7.1.x/reference/reference.html#confval-pythonpath for details 3 | [pytest] 4 | pythonpath = . 5 | -------------------------------------------------------------------------------- /mlops-statcks-multiphase/requirements.txt: -------------------------------------------------------------------------------- 1 | mlflow==2.11.3 2 | numpy>=1.23.0 3 | pandas==1.5.3 4 | scikit-learn>=1.1.1 5 | matplotlib>=3.5.2 6 | pillow>=10.0.1 7 | Jinja2==3.0.3 8 | pyspark~=3.3.0 9 | pytz~=2022.2.1 10 | pytest>=7.1.2 11 | -------------------------------------------------------------------------------- /jdemo/resources/variables.yml: -------------------------------------------------------------------------------- 1 | variables: 2 | instance_pool_name: 3 | description: Name of the instance pool to use 4 | default: TFTests 5 | instance_pool_id: 6 | description: ID of instance pool 7 | lookup: 8 | instance_pool: ${var.instance_pool_name} 9 | -------------------------------------------------------------------------------- /mlops-statcks-multiphase/feature_engineering/README.md: -------------------------------------------------------------------------------- 1 | # Feature Engineering 2 | To set up the feature engineering job via scheduled Databricks workflow, please refer to [mlops/resources/README.md](../resources/README.md) 3 | 4 | For additional details on using the feature store, please refer to [the project-level README](../README.md). 5 | -------------------------------------------------------------------------------- /mlops-statcks-multiphase/tests/training/test_notebooks.py: -------------------------------------------------------------------------------- 1 | import pathlib 2 | 3 | 4 | def test_notebook_format(): 5 | # Verify that all Databricks notebooks have the required header 6 | paths = list(pathlib.Path("./notebooks").glob("**/*.py")) 7 | for f in paths: 8 | notebook_str = open(str(f)).read() 9 | assert notebook_str.startswith("# Databricks notebook source") 10 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # dabs-playground 2 | 3 | Different examples around Databricks Asset Bundsls (DABs). 4 | 5 | 6 | - [integration-tests](integration-tests) - DAB example on how to redefine resource on per-target base, emulating wrapping of the code into integration test that has additional tasks. 7 | - [jdemo](jdemo) - DAB example that has both Python wheel and JAR artefacts deployed as job tasks. 8 | - [mlops-statcks-multiphase](mlops-statcks-multiphase) - DAB based on `mlops-stacks` template, but with customization to deploy quality monitoring in a separate "stage". 9 | - [vars_demo](vars_demo) - demonstrates how to use complex and lookup variables in DABs. 10 | -------------------------------------------------------------------------------- /jdemo/java-code/src/main/java/net/alexott/demos/SparkDemo.java: -------------------------------------------------------------------------------- 1 | package net.alexott.demos; 2 | 3 | import org.apache.spark.sql.SparkSession; 4 | import org.apache.spark.sql.Dataset; 5 | import org.apache.spark.sql.Row; 6 | 7 | public class SparkDemo { 8 | public static void main(String[] args) { 9 | System.out.println("Creating Spark Session!"); 10 | SparkSession spark = SparkSession.builder() 11 | .appName("SparkDemo") 12 | .getOrCreate(); 13 | System.out.println("Going to read data!"); 14 | Dataset df = spark.read().table("samples.nyctaxi.trips"); 15 | df.show(10, false); 16 | } 17 | } 18 | -------------------------------------------------------------------------------- /vars_demo/resources/variables.yml: -------------------------------------------------------------------------------- 1 | variables: 2 | notification_settings: 3 | description: "Webhook notification config" 4 | type: complex 5 | default: {} 6 | notification_name: 7 | description: "Name of the notification destination" 8 | default: "" 9 | notification_id: 10 | description: "ID of the notification destination (placeholder)" 11 | default: "" 12 | 13 | targets: 14 | prod: 15 | variables: 16 | notification_name: "Slack native" 17 | notification_id: 18 | lookup: 19 | notification_destination: ${var.notification_name} 20 | notification_settings: 21 | on_failure: 22 | - id: ${var.notification_id} 23 | -------------------------------------------------------------------------------- /mlops-statcks-multiphase/monitoring/README.md: -------------------------------------------------------------------------------- 1 | # Monitoring 2 | 3 | To enable monitoring as part of a scheduled Databricks workflow, please: 4 | - Create the inference table that you want to monitor and was passed in as an initialization parameter. 5 | - Update all the TODOs in the [monitoring resource file](../resources/monitoring-resource.yml). 6 | - Uncomment the monitoring workflow from the main Databricks Asset Bundles file [databricks.yml](../databricks.yml). 7 | 8 | For more details, refer to [mlops/resources/README.md](../resources/README.md). 9 | The implementation supports monitoring of batch inference tables directly. 10 | For real time inference tables, unpacking is required before monitoring can be attached. 11 | -------------------------------------------------------------------------------- /jdemo/src/jdemo/main.py: -------------------------------------------------------------------------------- 1 | from pyspark.sql import SparkSession, DataFrame 2 | 3 | def get_taxis(spark: SparkSession) -> DataFrame: 4 | return spark.read.table("samples.nyctaxi.trips") 5 | 6 | 7 | # Create a new Databricks Connect session. If this fails, 8 | # check that you have configured Databricks Connect correctly. 9 | # See https://docs.databricks.com/dev-tools/databricks-connect.html. 10 | def get_spark() -> SparkSession: 11 | try: 12 | from databricks.connect import DatabricksSession 13 | return DatabricksSession.builder.getOrCreate() 14 | except ImportError: 15 | return SparkSession.builder.getOrCreate() 16 | 17 | def main(): 18 | get_taxis(get_spark()).show(10, truncate=False) 19 | 20 | if __name__ == '__main__': 21 | main() 22 | -------------------------------------------------------------------------------- /jdemo/fixtures/.gitkeep: -------------------------------------------------------------------------------- 1 | # Fixtures 2 | 3 | This folder is reserved for fixtures, such as CSV files. 4 | 5 | Below is an example of how to load fixtures as a data frame: 6 | 7 | ``` 8 | import pandas as pd 9 | import os 10 | 11 | def get_absolute_path(*relative_parts): 12 | if 'dbutils' in globals(): 13 | base_dir = os.path.dirname(dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()) # type: ignore 14 | path = os.path.normpath(os.path.join(base_dir, *relative_parts)) 15 | return path if path.startswith("/Workspace") else "/Workspace" + path 16 | else: 17 | return os.path.join(*relative_parts) 18 | 19 | csv_file = get_absolute_path("..", "fixtures", "mycsv.csv") 20 | df = pd.read_csv(csv_file) 21 | display(df) 22 | ``` 23 | -------------------------------------------------------------------------------- /vars_demo/databricks.yml: -------------------------------------------------------------------------------- 1 | # This is a Databricks asset bundle definition for vars_demo. 2 | # See https://docs.databricks.com/dev-tools/bundles/index.html for documentation. 3 | bundle: 4 | name: vars_demo 5 | uuid: aae7faf4-420e-48be-a691-ca6981943be4 6 | 7 | include: 8 | - resources/*.yml 9 | 10 | targets: 11 | dev: 12 | # The default target uses 'mode: development' to create a development copy. 13 | # - Deployed resources get prefixed with '[dev my_user_name]' 14 | # - Any job schedules and triggers are paused by default. 15 | # See also https://docs.databricks.com/dev-tools/bundles/deployment-modes.html. 16 | mode: development 17 | default: true 18 | 19 | prod: 20 | mode: production 21 | workspace: 22 | root_path: /Workspace/Project/${bundle.name}/${bundle.target} 23 | -------------------------------------------------------------------------------- /integration-tests/resources/dabs1.job.yml: -------------------------------------------------------------------------------- 1 | # The main job for dabs1. 2 | resources: 3 | jobs: 4 | dabs1_job: 5 | name: dabs1_job 6 | 7 | trigger: 8 | # Run this job every day, exactly one day from the last run; see https://docs.databricks.com/api/workspace/jobs/create#trigger 9 | periodic: 10 | interval: 1 11 | unit: DAYS 12 | 13 | #email_notifications: 14 | # on_failure: 15 | # - your_email@example.com 16 | 17 | tasks: 18 | - task_key: notebook_task 19 | notebook_task: 20 | notebook_path: ../src/main_nb.ipynb 21 | base_parameters: 22 | catalog: main 23 | schema: default 24 | table: nsg_logs 25 | target_table: nsg_logs_copy 26 | 27 | -------------------------------------------------------------------------------- /vars_demo/resources/vars_demo.job.yml: -------------------------------------------------------------------------------- 1 | # The main job for vars_demo. 2 | resources: 3 | jobs: 4 | vars_demo_job: 5 | name: vars_demo_job 6 | 7 | trigger: 8 | # Run this job every day, exactly one day from the last run; see https://docs.databricks.com/api/workspace/jobs/create#trigger 9 | periodic: 10 | interval: 1 11 | unit: DAYS 12 | 13 | webhook_notifications: ${var.notification_settings} 14 | 15 | tasks: 16 | - task_key: notebook_task 17 | job_cluster_key: job_cluster 18 | notebook_task: 19 | notebook_path: ../src/notebook.ipynb 20 | 21 | job_clusters: 22 | - job_cluster_key: job_cluster 23 | new_cluster: 24 | spark_version: 15.4.x-scala2.12 25 | node_type_id: Standard_D3_v2 26 | data_security_mode: SINGLE_USER 27 | autoscale: 28 | min_workers: 1 29 | max_workers: 4 30 | -------------------------------------------------------------------------------- /jdemo/requirements-dev.txt: -------------------------------------------------------------------------------- 1 | ## requirements-dev.txt: dependencies for local development. 2 | ## 3 | ## For defining dependencies used by jobs in Databricks Workflows, see 4 | ## https://docs.databricks.com/dev-tools/bundles/library-dependencies.html 5 | 6 | ## Add code completion support for DLT 7 | #databricks-dlt 8 | 9 | ## pytest is the default package used for testing 10 | pytest 11 | 12 | ## Dependencies for building wheel files 13 | setuptools 14 | wheel 15 | 16 | ## databricks-connect can be used to run parts of this project locally. 17 | ## See https://docs.databricks.com/dev-tools/databricks-connect.html. 18 | ## 19 | ## databricks-connect is automatically installed if you're using Databricks 20 | ## extension for Visual Studio Code 21 | ## (https://docs.databricks.com/dev-tools/vscode-ext/dev-tasks/databricks-connect.html). 22 | ## 23 | ## To manually install databricks-connect, either follow the instructions 24 | ## at https://docs.databricks.com/dev-tools/databricks-connect.html 25 | ## to install the package system-wide. Or uncomment the line below to install a 26 | ## version of db-connect that corresponds to the Databricks Runtime version used 27 | ## for this project. 28 | # 29 | # databricks-connect>=15.4,<15.5 30 | -------------------------------------------------------------------------------- /integration-tests/databricks.yml: -------------------------------------------------------------------------------- 1 | # This is a Databricks asset bundle definition for dabs1. 2 | # See https://docs.databricks.com/dev-tools/bundles/index.html for documentation. 3 | bundle: 4 | name: dabs1 5 | uuid: 734206ae-7fb4-4d2b-aa5f-7650f5954c17 6 | 7 | include: 8 | - resources/*.yml 9 | 10 | targets: 11 | dev: 12 | # The default target uses 'mode: development' to create a development copy. 13 | # - Deployed resources get prefixed with '[dev my_user_name]' 14 | # - Any job schedules and triggers are paused by default. 15 | # See also https://docs.databricks.com/dev-tools/bundles/deployment-modes.html. 16 | mode: development 17 | default: true 18 | 19 | test: 20 | mode: development 21 | presets: 22 | name_prefix: "[Integration test ${workspace.current_user.short_name}] " 23 | workspace: 24 | root_path: /Workspace/Users/${workspace.current_user.userName}/.bundle/${bundle.name}/${bundle.target} 25 | 26 | prod: 27 | mode: production 28 | workspace: 29 | # We explicitly deploy to current user folder to make sure we only have a single copy. 30 | root_path: /Workspace/Users/${workspace.current_user.userName}/.bundle/${bundle.name}/${bundle.target} 31 | permissions: 32 | - user_name: ${workspace.current_user.userName} 33 | level: CAN_MANAGE 34 | -------------------------------------------------------------------------------- /mlops-statcks-multiphase/deployment/batch_inference/predict.py: -------------------------------------------------------------------------------- 1 | import mlflow 2 | from pyspark.sql.functions import struct, lit, to_timestamp 3 | 4 | 5 | def predict_batch( 6 | spark_session, model_uri, input_table_name, output_table_name, model_version, ts 7 | ): 8 | """ 9 | Apply the model at the specified URI for batch inference on the table with name input_table_name, 10 | writing results to the table with name output_table_name 11 | """ 12 | 13 | mlflow.set_registry_uri("databricks-uc") 14 | 15 | table = spark_session.table(input_table_name) 16 | 17 | 18 | from databricks.feature_engineering import FeatureEngineeringClient 19 | 20 | fe_client = FeatureEngineeringClient() 21 | 22 | prediction_df = fe_client.score_batch(model_uri = model_uri, df = table) 23 | 24 | output_df = ( 25 | prediction_df.withColumn("prediction", prediction_df["prediction"]) 26 | .withColumn("model_version", lit(model_version)) 27 | .withColumn("inference_timestamp", to_timestamp(lit(ts))) 28 | ) 29 | output_df.display() 30 | 31 | # Model predictions are written to the Delta table provided as input. 32 | # Delta is the default format in Databricks Runtime 8.0 and above. 33 | output_df.write.format("delta").mode("overwrite").saveAsTable(output_table_name) -------------------------------------------------------------------------------- /jdemo/setup.py: -------------------------------------------------------------------------------- 1 | """ 2 | setup.py configuration script describing how to build and package this project. 3 | 4 | This file is primarily used by the setuptools library and typically should not 5 | be executed directly. See README.md for how to deploy, test, and run 6 | the jdemo project. 7 | """ 8 | from setuptools import setup, find_packages 9 | 10 | import sys 11 | sys.path.append('./src') 12 | 13 | import datetime 14 | import jdemo 15 | 16 | setup( 17 | name="jdemo", 18 | # We use timestamp as Local version identifier (https://peps.python.org/pep-0440/#local-version-identifiers.) 19 | # to ensure that changes to wheel package are picked up when used on all-purpose clusters 20 | version=jdemo.__version__, 21 | url="https://test.com", 22 | author="user@domain.com", 23 | description="wheel file based on jdemo/src", 24 | packages=find_packages(where='./src'), 25 | package_dir={'': 'src'}, 26 | entry_points={ 27 | "packages": [ 28 | "main=jdemo.main:main" 29 | ] 30 | }, 31 | install_requires=[ 32 | # Dependencies in case the output wheel file is used as a library dependency. 33 | # For defining dependencies, when this package is used in Databricks, see: 34 | # https://docs.databricks.com/dev-tools/bundles/library-dependencies.html 35 | "setuptools" 36 | ], 37 | ) 38 | -------------------------------------------------------------------------------- /mlops-statcks-multiphase/resources/ml-artifacts-resource.yml: -------------------------------------------------------------------------------- 1 | # Allow users to read the experiment 2 | common_permissions: &permissions 3 | permissions: 4 | - level: CAN_READ 5 | group_name: users 6 | 7 | # Allow users to execute models in Unity Catalog 8 | grants: &grants 9 | grants: 10 | - privileges: 11 | - EXECUTE 12 | principal: account users 13 | 14 | # Defines model and experiments 15 | model: &model 16 | model: 17 | name: ${var.model_name} 18 | catalog_name: ${var.current_target} 19 | schema_name: mlops 20 | comment: Registered model in Unity Catalog for the "mlops" ML Project for ${var.current_target} deployment target. 21 | <<: *grants 22 | 23 | experiment: &experiment 24 | experiment: 25 | name: ${var.experiment_name} 26 | <<: *permissions 27 | description: MLflow Experiment used to track runs for mlops project. 28 | 29 | 30 | targets: 31 | dev-phase1: 32 | resources: 33 | experiments: 34 | <<: *experiment 35 | registered_models: 36 | <<: *model 37 | 38 | test-phase1: 39 | resources: 40 | experiments: 41 | <<: *experiment 42 | registered_models: 43 | <<: *model 44 | 45 | prod-phase1: 46 | resources: 47 | experiments: 48 | <<: *experiment 49 | registered_models: 50 | <<: *model 51 | -------------------------------------------------------------------------------- /jdemo/scratch/exploration.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 2, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "%load_ext autoreload\n", 10 | "%autoreload 2" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": null, 16 | "metadata": { 17 | "application/vnd.databricks.v1+cell": { 18 | "cellMetadata": { 19 | "byteLimit": 2048000, 20 | "rowLimit": 10000 21 | }, 22 | "inputWidgets": {}, 23 | "nuid": "6bca260b-13d1-448f-8082-30b60a85c9ae", 24 | "showTitle": false, 25 | "title": "" 26 | } 27 | }, 28 | "outputs": [], 29 | "source": [ 30 | "import sys\n", 31 | "sys.path.append('../src')\n", 32 | "from jdemo import main\n", 33 | "\n", 34 | "main.get_taxis(spark).show(10)" 35 | ] 36 | } 37 | ], 38 | "metadata": { 39 | "application/vnd.databricks.v1+notebook": { 40 | "dashboards": [], 41 | "language": "python", 42 | "notebookMetadata": { 43 | "pythonIndentUnit": 2 44 | }, 45 | "notebookName": "ipynb-notebook", 46 | "widgets": {} 47 | }, 48 | "kernelspec": { 49 | "display_name": "Python 3", 50 | "language": "python", 51 | "name": "python3" 52 | }, 53 | "language_info": { 54 | "name": "python", 55 | "version": "3.11.4" 56 | } 57 | }, 58 | "nbformat": 4, 59 | "nbformat_minor": 0 60 | } 61 | -------------------------------------------------------------------------------- /mlops-statcks-multiphase/tests/feature_engineering/dropoff_features_test.py: -------------------------------------------------------------------------------- 1 | import pyspark.sql 2 | import pytest 3 | import pandas as pd 4 | from datetime import datetime 5 | from pyspark.sql import SparkSession 6 | 7 | from mlops.feature_engineering.features.dropoff_features import ( 8 | compute_features_fn, 9 | ) 10 | 11 | 12 | @pytest.fixture(scope="session") 13 | def spark(request): 14 | """fixture for creating a spark session 15 | Args: 16 | request: pytest.FixtureRequest object 17 | """ 18 | spark = ( 19 | SparkSession.builder.master("local[1]") 20 | .appName("pytest-pyspark-local-testing") 21 | .getOrCreate() 22 | ) 23 | request.addfinalizer(lambda: spark.stop()) 24 | 25 | return spark 26 | 27 | 28 | @pytest.mark.usefixtures("spark") 29 | def test_dropoff_features_fn(spark): 30 | input_df = pd.DataFrame( 31 | { 32 | "tpep_pickup_datetime": [datetime(2022, 1, 10)], 33 | "tpep_dropoff_datetime": [datetime(2022, 1, 10)], 34 | "dropoff_zip": [94400], 35 | "trip_distance": [2], 36 | "fare_amount": [100], 37 | } 38 | ) 39 | spark_df = spark.createDataFrame(input_df) 40 | output_df = compute_features_fn( 41 | spark_df, "tpep_pickup_datetime", datetime(2022, 1, 1), datetime(2022, 1, 15) 42 | ) 43 | assert isinstance(output_df, pyspark.sql.DataFrame) 44 | assert output_df.count() == 1 45 | -------------------------------------------------------------------------------- /mlops-statcks-multiphase/tests/feature_engineering/pickup_features_test.py: -------------------------------------------------------------------------------- 1 | import pyspark.sql 2 | import pytest 3 | import pandas as pd 4 | from datetime import datetime 5 | from pyspark.sql import SparkSession 6 | 7 | from mlops.feature_engineering.features.pickup_features import compute_features_fn 8 | 9 | 10 | @pytest.fixture(scope="session") 11 | def spark(request): 12 | """fixture for creating a spark session 13 | Args: 14 | request: pytest.FixtureRequest object 15 | """ 16 | spark = ( 17 | SparkSession.builder.master("local[1]") 18 | .appName("pytest-pyspark-local-testing") 19 | .getOrCreate() 20 | ) 21 | request.addfinalizer(lambda: spark.stop()) 22 | 23 | return spark 24 | 25 | 26 | @pytest.mark.usefixtures("spark") 27 | def test_pickup_features_fn(spark): 28 | input_df = pd.DataFrame( 29 | { 30 | "tpep_pickup_datetime": [datetime(2022, 1, 12)], 31 | "tpep_dropoff_datetime": [datetime(2022, 1, 12)], 32 | "pickup_zip": [94400], 33 | "trip_distance": [2], 34 | "fare_amount": [100], 35 | } 36 | ) 37 | spark_df = spark.createDataFrame(input_df) 38 | output_df = compute_features_fn( 39 | spark_df, "tpep_pickup_datetime", datetime(2022, 1, 1), datetime(2022, 1, 15) 40 | ) 41 | assert isinstance(output_df, pyspark.sql.DataFrame) 42 | assert output_df.count() == 4 # 4 15-min intervals over 1 hr window. 43 | -------------------------------------------------------------------------------- /integration-tests/resources/integration_test.yml: -------------------------------------------------------------------------------- 1 | targets: 2 | test: 3 | resources: 4 | jobs: 5 | dabs1_job: 6 | tasks: 7 | - task_key: setup 8 | notebook_task: 9 | notebook_path: ../src/setup_test.ipynb 10 | base_parameters: 11 | catalog: main 12 | schema: tmp 13 | table: itest 14 | 15 | - task_key: notebook_task 16 | depends_on: 17 | - task_key: setup 18 | 19 | notebook_task: 20 | notebook_path: ../src/main_nb.ipynb 21 | base_parameters: 22 | catalog: main 23 | schema: tmp 24 | table: itest 25 | target_table: itest_copy 26 | 27 | - task_key: validate 28 | depends_on: 29 | - task_key: notebook_task 30 | 31 | run_if: ALL_DONE 32 | 33 | notebook_task: 34 | notebook_path: ../src/validate_test.ipynb 35 | base_parameters: 36 | catalog: main 37 | schema: tmp 38 | target_table: itest_copy 39 | 40 | - task_key: cleanup 41 | depends_on: 42 | - task_key: validate 43 | 44 | 45 | notebook_task: 46 | notebook_path: ../src/cleanup_test.ipynb 47 | base_parameters: 48 | catalog: main 49 | schema: tmp 50 | table: itest 51 | target_table: itest_copy 52 | 53 | 54 | -------------------------------------------------------------------------------- /integration-tests/README.md: -------------------------------------------------------------------------------- 1 | # Integration test example 2 | 3 | This DAB shows how to redefine resource on per-target base, emulating wrapping of the code into integration test that has additional tasks. This is done by overriding the job resource (defined in [resources/dabs1.job.yml](resources/dabs1.job.yml) only in the specific target (`test`, defined in [resources/integration_test.yml](resources/integration_test.yml)) 4 | 5 | ## Getting started 6 | 7 | 1. Install the Databricks CLI from https://docs.databricks.com/dev-tools/cli/databricks-cli.html 8 | 9 | 2. Authenticate to your Databricks workspace, if you have not done so already: 10 | ``` 11 | $ databricks configure 12 | ``` 13 | 14 | 3. To deploy a development copy of this project, type: 15 | ``` 16 | $ databricks bundle deploy -t dev 17 | ``` 18 | (Note that "dev" is the default target, so the `--target` parameter 19 | is optional here.) 20 | 21 | This deploys everything that's defined for this project. For example, the default 22 | template would deploy a job called `[dev yourname] dabs1_job` to your workspace. You 23 | can find that job by opening your workpace and clicking on **Workflows**. 24 | 25 | 4. Similarly, to deploy the code into the test environment, type: 26 | ``` 27 | $ databricks bundle deploy -t test 28 | ``` 29 | 30 | This will deploy a job with name `[Integration test yourname] dabs1_job`, but it will 31 | have different number of tasks (setup the test, validate results, cleanup): 32 | 33 | ![Integration test job](images/integration_test.png) 34 | 35 | 5. To run a job or pipeline, use the "run" command: 36 | ``` 37 | $ databricks bundle run 38 | ``` 39 | -------------------------------------------------------------------------------- /mlops-statcks-multiphase/deployment/batch_inference/README.md: -------------------------------------------------------------------------------- 1 | # Batch Inference 2 | To set up batch inference job via scheduled Databricks workflow, please refer to [mlops/resources/README.md](../../resources/README.md) 3 | 4 | ## Prepare the batch inference input table for the example Project 5 | Please run the following code in a notebook to generate the example batch inference input table. 6 | 7 | ``` 8 | from pyspark.sql.functions import to_timestamp, lit 9 | from pyspark.sql.types import IntegerType 10 | import math 11 | from datetime import timedelta, timezone 12 | 13 | def rounded_unix_timestamp(dt, num_minutes=15): 14 | """ 15 | Ceilings datetime dt to interval num_minutes, then returns the unix timestamp. 16 | """ 17 | nsecs = dt.minute * 60 + dt.second + dt.microsecond * 1e-6 18 | delta = math.ceil(nsecs / (60 * num_minutes)) * (60 * num_minutes) - nsecs 19 | return int((dt + timedelta(seconds=delta)).replace(tzinfo=timezone.utc).timestamp()) 20 | 21 | 22 | rounded_unix_timestamp_udf = udf(rounded_unix_timestamp, IntegerType()) 23 | 24 | df = spark.table("delta.`dbfs:/databricks-datasets/nyctaxi-with-zipcodes/subsampled`") 25 | df.withColumn( 26 | "rounded_pickup_datetime", 27 | to_timestamp(rounded_unix_timestamp_udf(df["tpep_pickup_datetime"], lit(15))), 28 | ).withColumn( 29 | "rounded_dropoff_datetime", 30 | to_timestamp(rounded_unix_timestamp_udf(df["tpep_dropoff_datetime"], lit(30))), 31 | ).drop( 32 | "tpep_pickup_datetime" 33 | ).drop( 34 | "tpep_dropoff_datetime" 35 | ).drop( 36 | "fare_amount" 37 | ).write.mode( 38 | "overwrite" 39 | ).saveAsTable( 40 | name="hive_metastore.default.taxi_scoring_sample_feature_store_inference_input" 41 | ) 42 | ``` 43 | 44 | -------------------------------------------------------------------------------- /mlops-statcks-multiphase/databricks.yml: -------------------------------------------------------------------------------- 1 | # The name of the bundle. run `databricks bundle schema` to see the full bundle settings schema. 2 | bundle: 3 | name: mlops 4 | 5 | variables: 6 | current_target: 7 | description: "Name of the current target environment (we can't use `bundle.target`)" 8 | experiment_name: 9 | description: Experiment name for the model training. 10 | default: /Users/${workspace.current_user.userName}/${var.current_target}-mlops-experiment 11 | model_name: 12 | description: Model name for the model training. 13 | default: mlops-model 14 | # this is a placeholder for all phases, although it's used only in *-phase2 15 | model_training_job_id: 16 | description: "ID of model training job" 17 | default: 0 18 | 19 | include: 20 | # Resources folder contains ML artifact resources for the ML project that defines model and experiment 21 | # And workflows resources for the ML project including model training -> validation -> deployment, 22 | # feature engineering, batch inference, quality monitoring, metric refresh, alerts and triggering retraining 23 | - ./resources/*.yml 24 | 25 | workspace: 26 | host: https://adb-xxxx.17.azuredatabricks.net 27 | 28 | # Deployment Target specific values for workspace 29 | targets: 30 | dev-phase1: 31 | default: true 32 | variables: 33 | current_target: dev 34 | 35 | dev-phase2: 36 | variables: 37 | current_target: dev 38 | 39 | test-phase1: 40 | variables: 41 | current_target: test 42 | 43 | test-phase2: 44 | variables: 45 | current_target: test 46 | 47 | prod: 48 | variables: 49 | current_target: prod 50 | 51 | prod-phase2: 52 | variables: 53 | current_target: prod 54 | 55 | -------------------------------------------------------------------------------- /vars_demo/src/notebook.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "application/vnd.databricks.v1+cell": { 7 | "cellMetadata": {}, 8 | "inputWidgets": {}, 9 | "nuid": "ee353e42-ff58-4955-9608-12865bd0950e", 10 | "showTitle": false, 11 | "title": "" 12 | } 13 | }, 14 | "source": [ 15 | "# Default notebook\n", 16 | "\n", 17 | "This default notebook is executed using Databricks Workflows as defined in resources/vars_demo.job.yml." 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": 2, 23 | "metadata": {}, 24 | "outputs": [], 25 | "source": [ 26 | "%load_ext autoreload\n", 27 | "%autoreload 2" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": 0, 33 | "metadata": { 34 | "application/vnd.databricks.v1+cell": { 35 | "cellMetadata": { 36 | "byteLimit": 2048000, 37 | "rowLimit": 10000 38 | }, 39 | "inputWidgets": {}, 40 | "nuid": "6bca260b-13d1-448f-8082-30b60a85c9ae", 41 | "showTitle": false, 42 | "title": "" 43 | } 44 | }, 45 | "outputs": [], 46 | "source": [ 47 | "spark.range(10)" 48 | ] 49 | } 50 | ], 51 | "metadata": { 52 | "application/vnd.databricks.v1+notebook": { 53 | "dashboards": [], 54 | "language": "python", 55 | "notebookMetadata": { 56 | "pythonIndentUnit": 2 57 | }, 58 | "notebookName": "notebook", 59 | "widgets": {} 60 | }, 61 | "kernelspec": { 62 | "display_name": "Python 3", 63 | "language": "python", 64 | "name": "python3" 65 | }, 66 | "language_info": { 67 | "name": "python", 68 | "version": "3.11.4" 69 | } 70 | }, 71 | "nbformat": 4, 72 | "nbformat_minor": 0 73 | } 74 | -------------------------------------------------------------------------------- /jdemo/src/notebook.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "application/vnd.databricks.v1+cell": { 7 | "cellMetadata": {}, 8 | "inputWidgets": {}, 9 | "nuid": "ee353e42-ff58-4955-9608-12865bd0950e", 10 | "showTitle": false, 11 | "title": "" 12 | } 13 | }, 14 | "source": [ 15 | "# Default notebook\n", 16 | "\n", 17 | "This default notebook is executed using Databricks Workflows as defined in resources/jdemo.job.yml." 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": 2, 23 | "metadata": {}, 24 | "outputs": [], 25 | "source": [ 26 | "%load_ext autoreload\n", 27 | "%autoreload 2" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": 0, 33 | "metadata": { 34 | "application/vnd.databricks.v1+cell": { 35 | "cellMetadata": { 36 | "byteLimit": 2048000, 37 | "rowLimit": 10000 38 | }, 39 | "inputWidgets": {}, 40 | "nuid": "6bca260b-13d1-448f-8082-30b60a85c9ae", 41 | "showTitle": false, 42 | "title": "" 43 | } 44 | }, 45 | "outputs": [], 46 | "source": [ 47 | "from jdemo import main\n", 48 | "\n", 49 | "main.get_taxis(spark).show(10)" 50 | ] 51 | } 52 | ], 53 | "metadata": { 54 | "application/vnd.databricks.v1+notebook": { 55 | "dashboards": [], 56 | "language": "python", 57 | "notebookMetadata": { 58 | "pythonIndentUnit": 2 59 | }, 60 | "notebookName": "notebook", 61 | "widgets": {} 62 | }, 63 | "kernelspec": { 64 | "display_name": "Python 3", 65 | "language": "python", 66 | "name": "python3" 67 | }, 68 | "language_info": { 69 | "name": "python", 70 | "version": "3.11.4" 71 | } 72 | }, 73 | "nbformat": 4, 74 | "nbformat_minor": 0 75 | } 76 | -------------------------------------------------------------------------------- /mlops-statcks-multiphase/deployment/model_deployment/deploy.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import pathlib 3 | 4 | sys.path.append(str(pathlib.Path(__file__).parent.parent.parent.resolve())) 5 | 6 | from mlflow.tracking import MlflowClient 7 | 8 | 9 | def deploy(model_uri, env): 10 | """Deploys an already-registered model in Unity catalog by assigning it the appropriate alias for model deployment. 11 | 12 | :param model_uri: URI of the model to deploy. Must be in the format "models://", as described in 13 | https://www.mlflow.org/docs/latest/model-registry.html#fetching-an-mlflow-model-from-the-model-registry 14 | :param env: name of the environment in which we're performing deployment, i.e one of "dev", "staging", "prod". 15 | Defaults to "dev" 16 | :return: 17 | """ 18 | print(f"Deployment running in env: {env}") 19 | _, model_name, version = model_uri.split("/") 20 | client = MlflowClient(registry_uri="databricks-uc") 21 | mv = client.get_model_version(model_name, version) 22 | target_alias = "champion" 23 | if target_alias not in mv.aliases: 24 | client.set_registered_model_alias( 25 | name=model_name, 26 | alias=target_alias, 27 | version=version) 28 | print(f"Assigned alias '{target_alias}' to model version {model_uri}.") 29 | 30 | # remove "challenger" alias if assigning "champion" alias 31 | if target_alias == "champion" and "challenger" in mv.aliases: 32 | print(f"Removing 'challenger' alias from model version {model_uri}.") 33 | client.delete_registered_model_alias( 34 | name=model_name, 35 | alias="challenger") 36 | 37 | 38 | 39 | if __name__ == "__main__": 40 | deploy(model_uri=sys.argv[1], env=sys.argv[2]) 41 | -------------------------------------------------------------------------------- /integration-tests/src/setup_test.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "application/vnd.databricks.v1+cell": { 7 | "cellMetadata": {}, 8 | "inputWidgets": {}, 9 | "nuid": "ee353e42-ff58-4955-9608-12865bd0950e", 10 | "showTitle": false, 11 | "title": "" 12 | } 13 | }, 14 | "source": [ 15 | "# Setup test data notebook" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 2, 21 | "metadata": {}, 22 | "outputs": [], 23 | "source": [ 24 | "%load_ext autoreload\n", 25 | "%autoreload 2" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 0, 31 | "metadata": { 32 | "application/vnd.databricks.v1+cell": { 33 | "cellMetadata": { 34 | "byteLimit": 2048000, 35 | "rowLimit": 10000 36 | }, 37 | "inputWidgets": {}, 38 | "nuid": "6bca260b-13d1-448f-8082-30b60a85c9ae", 39 | "showTitle": false, 40 | "title": "" 41 | } 42 | }, 43 | "outputs": [], 44 | "source": [ 45 | "catalog = dbutils.widgets.get(\"catalog\")\n", 46 | "schema = dbutils.widgets.get(\"schema\")\n", 47 | "table = dbutils.widgets.get(\"table\")\n" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": null, 53 | "metadata": {}, 54 | "outputs": [], 55 | "source": [ 56 | "df = spark.range(10).write.mode(\"overwrite\").saveAsTable(f\"{catalog}.{schema}.{table}\")" 57 | ] 58 | } 59 | ], 60 | "metadata": { 61 | "application/vnd.databricks.v1+notebook": { 62 | "dashboards": [], 63 | "language": "python", 64 | "notebookMetadata": { 65 | "pythonIndentUnit": 2 66 | }, 67 | "notebookName": "notebook", 68 | "widgets": {} 69 | }, 70 | "kernelspec": { 71 | "display_name": "Python 3", 72 | "language": "python", 73 | "name": "python3" 74 | }, 75 | "language_info": { 76 | "name": "python", 77 | "version": "3.11.4" 78 | } 79 | }, 80 | "nbformat": 4, 81 | "nbformat_minor": 0 82 | } 83 | -------------------------------------------------------------------------------- /jdemo/java-code/pom.xml: -------------------------------------------------------------------------------- 1 | 3 | 4.0.0 4 | 5 | net.alexott.demos 6 | dabs-demo 7 | 0.0.1 8 | jar 9 | 10 | 11 | UTF-8 12 | 1.8 13 | 2.12.12 14 | 3.4.3 15 | 2.12 16 | 17 | 18 | 19 | 20 | 21 | org.apache.spark 22 | spark-sql_${spark.scala.version} 23 | ${spark.version} 24 | provided 25 | 26 | 27 | 28 | 29 | 30 | 31 | maven-compiler-plugin 32 | 3.8.1 33 | 34 | ${java.version} 35 | ${java.version} 36 | true 37 | 38 | 39 | 40 | org.apache.maven.plugins 41 | maven-assembly-plugin 42 | 3.2.0 43 | 44 | 45 | jar-with-dependencies 46 | 47 | 48 | 49 | 50 | package 51 | 52 | single 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | -------------------------------------------------------------------------------- /jdemo/README.md: -------------------------------------------------------------------------------- 1 | # jdemo 2 | 3 | The 'jdemo' project was generated by using the `default-python` template. It shows how to 4 | have multiple artefacts in the same bundle: `jar` built from Java code using Maven, and 5 | `whl` built from the Python code. 6 | 7 | ## Getting started 8 | 9 | 1. Install the Databricks CLI from https://docs.databricks.com/dev-tools/cli/databricks-cli.html 10 | 11 | 2. Authenticate to your Databricks workspace, if you have not done so already: 12 | ``` 13 | $ databricks configure 14 | ``` 15 | 16 | 3. To deploy a development copy of this project, type: 17 | ``` 18 | $ databricks bundle deploy --target dev 19 | ``` 20 | (Note that "dev" is the default target, so the `--target` parameter 21 | is optional here.) 22 | 23 | This deploys everything that's defined for this project. 24 | For example, the default template would deploy a job called 25 | `[dev yourname] jdemo_job` to your workspace. 26 | You can find that job by opening your workpace and clicking on **Workflows**. 27 | 28 | 4. Similarly, to deploy a production copy, type: 29 | ``` 30 | $ databricks bundle deploy --target prod 31 | ``` 32 | 33 | Note that the default job from the template has a schedule that runs every day 34 | (defined in resources/jdemo.job.yml). The schedule 35 | is paused when deploying in development mode (see 36 | https://docs.databricks.com/dev-tools/bundles/deployment-modes.html). 37 | 38 | 5. To run a job or pipeline, use the "run" command: 39 | ``` 40 | $ databricks bundle run 41 | ``` 42 | 43 | 6. Optionally, install developer tools such as the Databricks extension for Visual Studio Code from 44 | https://docs.databricks.com/dev-tools/vscode-ext.html. Or read the "getting started" documentation for 45 | **Databricks Connect** for instructions on running the included Python code from a different IDE. 46 | 47 | 7. For documentation on the Databricks asset bundles format used 48 | for this project, and for CI/CD configuration, see 49 | https://docs.databricks.com/dev-tools/bundles/index.html. 50 | -------------------------------------------------------------------------------- /vars_demo/README.md: -------------------------------------------------------------------------------- 1 | # vars_demo 2 | 3 | The 'vars_demo' project demonstrates how to use [complex](https://docs.databricks.com/aws/en/dev-tools/bundles/variables#define-a-complex-variable) and [lookup](https://docs.databricks.com/aws/en/dev-tools/bundles/variables#retrieve-an-objects-id-value) variables in Databricks Asset Bundles (DABs. 4 | 5 | Variables allow to parametrize a bundle. Variables are referenced using `${var.}` syntax. There are different variable types: 6 | 7 | * "Normal variables" - static value, could be defined in command line, env variable, … 8 | * "Lookup variables" - fetch information about existing object (cluster or policy ID by name, etc.). This is very handy when you have an object with the same name deployed in different environments, i.e., cluster policies, notification destinations, etc. 9 | * "Complex variable" - consists of multiple values. I.e., it could be used to define cluster configuration, notifications, etc. 10 | 11 | Variables could have a different value in each target, and in combination with `default` value it's possible to implement "conditional" overwrite of some values in defined resources. 12 | 13 | This demo shows how to define `webhook_notifications` in jobs such way that Slack notifications are defined only in the `prod` environment. This is done by defining a complex variable `notification_settings` that has an empty value by default, but we're overwriting it in the `prod` environment by looking up the notification destination with a specific name (defined by the `notification_name` variable). (All code is in the [resources/variables.yml](resources/variables.yml)). 14 | 15 | And then we can just use the complex variable in the `webhook_notifications` argument (line 13 in [resources/vars_demo.job.yml](resources/vars_demo.job.yml)): 16 | 17 | ```yaml 18 | webhook_notifications: ${var.notification_settings} 19 | ``` 20 | 21 | You can check with `databricks bundle validate -t dev --output json` that corresponding argument is empty in the `dev`, but if you run `databricks bundle validate -t prod --output json`, then it will be filled with actual ID of the notification destination. -------------------------------------------------------------------------------- /mlops-statcks-multiphase/resources/batch-inference-workflow-resource.yml: -------------------------------------------------------------------------------- 1 | new_cluster: &new_cluster 2 | new_cluster: 3 | num_workers: 3 4 | spark_version: 15.3.x-cpu-ml-scala2.12 5 | node_type_id: Standard_D3_v2 6 | custom_tags: 7 | clusterSource: mlops-stacks_0.4 8 | 9 | common_permissions: &permissions 10 | permissions: 11 | - level: CAN_VIEW 12 | group_name: users 13 | 14 | batch_inference_job: &batch_inference_job 15 | batch_inference_job: 16 | name: ${var.current_target}-mlops-batch-inference-job 17 | tasks: 18 | - task_key: batch_inference_job 19 | <<: *new_cluster 20 | notebook_task: 21 | notebook_path: ../deployment/batch_inference/notebooks/BatchInference.py 22 | base_parameters: 23 | env: ${var.current_target} 24 | input_table_name: ${var.current_target}.mlops.feature_store_inference_input # TODO: create input table for inference 25 | output_table_name: ${var.current_target}.mlops.predictions 26 | model_name: ${var.current_target}.mlops.${var.model_name} 27 | # git source information of current ML resource deployment. It will be persisted as part of the workflow run 28 | git_source_info: url:${bundle.git.origin_url}; branch:${bundle.git.branch}; commit:${bundle.git.commit} 29 | 30 | schedule: 31 | quartz_cron_expression: "0 0 11 * * ?" # daily at 11am 32 | timezone_id: UTC 33 | <<: *permissions 34 | # If you want to turn on notifications for this job, please uncomment the below code, 35 | # and provide a list of emails to the on_failure argument. 36 | # 37 | # email_notifications: 38 | # on_failure: 39 | # - first@company.com 40 | # - second@company.com 41 | 42 | targets: 43 | dev-phase1: 44 | resources: 45 | jobs: 46 | <<: *batch_inference_job 47 | 48 | test-phase1: 49 | resources: 50 | jobs: 51 | <<: *batch_inference_job 52 | 53 | prod-phase1: 54 | resources: 55 | jobs: 56 | <<: *batch_inference_job 57 | -------------------------------------------------------------------------------- /jdemo/resources/jdemo.job.yml: -------------------------------------------------------------------------------- 1 | # The main job for jdemo. 2 | 3 | new_cluster: &new_cluster 4 | new_cluster: 5 | spark_version: 15.4.x-scala2.12 6 | instance_pool_id: ${var.instance_pool_id} 7 | autoscale: 8 | min_workers: 1 9 | max_workers: 4 10 | custom_tags: 11 | project: jdemo 12 | 13 | # TODO: 14 | # - Add parameters to the job, like, getting the table name to read 15 | # - override parameters per stage, or when running integration test 16 | 17 | resources: 18 | jobs: 19 | jdemo_job: 20 | name: jdemo_job 21 | 22 | trigger: 23 | # Run this job every day, exactly one day from the last run; see https://docs.databricks.com/api/workspace/jobs/create#trigger 24 | periodic: 25 | interval: 1 26 | unit: DAYS 27 | 28 | email_notifications: 29 | on_failure: 30 | - user@domain.com 31 | 32 | tasks: 33 | - task_key: notebook_task 34 | job_cluster_key: job_cluster 35 | notebook_task: 36 | notebook_path: ../src/notebook.ipynb 37 | 38 | - task_key: wheel_task 39 | depends_on: 40 | - task_key: notebook_task 41 | 42 | job_cluster_key: job_cluster 43 | python_wheel_task: 44 | package_name: jdemo 45 | entry_point: main 46 | libraries: 47 | # By default we just include the .whl file generated for the jdemo package. 48 | # See https://docs.databricks.com/dev-tools/bundles/library-dependencies.html 49 | # for more information on how to add other libraries. 50 | - whl: ../dist/*.whl 51 | 52 | - task_key: jar_task 53 | depends_on: 54 | - task_key: notebook_task 55 | #<<: *new_cluster 56 | job_cluster_key: job_cluster 57 | spark_jar_task: 58 | main_class_name: net.alexott.demos.SparkDemo 59 | libraries: 60 | - jar: ../java-code/target/dabs-demo-0.0.1-jar-with-dependencies.jar 61 | 62 | job_clusters: 63 | - job_cluster_key: job_cluster 64 | <<: *new_cluster 65 | -------------------------------------------------------------------------------- /mlops-statcks-multiphase/validation/validation.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from mlflow.models import make_metric, MetricThreshold 3 | 4 | # Custom metrics to be included. Return empty list if custom metrics are not needed. 5 | # Please refer to custom_metrics parameter in mlflow.evaluate documentation https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.evaluate 6 | # TODO(optional) : custom_metrics 7 | def custom_metrics(): 8 | 9 | # TODO(optional) : define custom metric function to be included in custom_metrics. 10 | def squared_diff_plus_one(eval_df, _builtin_metrics): 11 | """ 12 | This example custom metric function creates a metric based on the ``prediction`` and 13 | ``target`` columns in ``eval_df`. 14 | """ 15 | return np.sum(np.abs(eval_df["prediction"] - eval_df["target"] + 1) ** 2) 16 | 17 | return [make_metric(eval_fn=squared_diff_plus_one, greater_is_better=False)] 18 | 19 | 20 | # Define model validation rules. Return empty dict if validation rules are not needed. 21 | # Please refer to validation_thresholds parameter in mlflow.evaluate documentation https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.evaluate 22 | # TODO(optional) : validation_thresholds 23 | def validation_thresholds(): 24 | return { 25 | "max_error": MetricThreshold( 26 | threshold=500, higher_is_better=False # max_error should be <= 500 27 | ), 28 | "mean_squared_error": MetricThreshold( 29 | threshold=500, # mean_squared_error should be <= 500 30 | # min_absolute_change=0.01, # mean_squared_error should be at least 0.01 greater than baseline model accuracy 31 | # min_relative_change=0.01, # mean_squared_error should be at least 1 percent greater than baseline model accuracy 32 | higher_is_better=False, 33 | ), 34 | } 35 | 36 | 37 | # Define evaluator config. Return empty dict if validation rules are not needed. 38 | # Please refer to evaluator_config parameter in mlflow.evaluate documentation https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.evaluate 39 | # TODO(optional) : evaluator_config 40 | def evaluator_config(): 41 | return {} 42 | -------------------------------------------------------------------------------- /integration-tests/src/cleanup_test.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "application/vnd.databricks.v1+cell": { 7 | "cellMetadata": {}, 8 | "inputWidgets": {}, 9 | "nuid": "ee353e42-ff58-4955-9608-12865bd0950e", 10 | "showTitle": false, 11 | "title": "" 12 | } 13 | }, 14 | "source": [ 15 | "# Cleanup test data\n", 16 | "\n", 17 | "Removes test data" 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": 2, 23 | "metadata": {}, 24 | "outputs": [], 25 | "source": [ 26 | "%load_ext autoreload\n", 27 | "%autoreload 2" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": 0, 33 | "metadata": { 34 | "application/vnd.databricks.v1+cell": { 35 | "cellMetadata": { 36 | "byteLimit": 2048000, 37 | "rowLimit": 10000 38 | }, 39 | "inputWidgets": {}, 40 | "nuid": "6bca260b-13d1-448f-8082-30b60a85c9ae", 41 | "showTitle": false, 42 | "title": "" 43 | } 44 | }, 45 | "outputs": [], 46 | "source": [ 47 | "catalog = dbutils.widgets.get(\"catalog\")\n", 48 | "schema = dbutils.widgets.get(\"schema\")\n", 49 | "table = dbutils.widgets.get(\"table\")\n", 50 | "target_table = dbutils.widgets.get(\"target_table\")" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "spark.sql(f\"drop table if exists {catalog}.{schema}.{table}\")\n", 60 | "spark.sql(f\"drop table if exists {catalog}.{schema}.{target_table}\")" 61 | ] 62 | } 63 | ], 64 | "metadata": { 65 | "application/vnd.databricks.v1+notebook": { 66 | "dashboards": [], 67 | "language": "python", 68 | "notebookMetadata": { 69 | "pythonIndentUnit": 2 70 | }, 71 | "notebookName": "notebook", 72 | "widgets": {} 73 | }, 74 | "kernelspec": { 75 | "display_name": "Python 3", 76 | "language": "python", 77 | "name": "python3" 78 | }, 79 | "language_info": { 80 | "name": "python", 81 | "version": "3.11.4" 82 | } 83 | }, 84 | "nbformat": 4, 85 | "nbformat_minor": 0 86 | } 87 | -------------------------------------------------------------------------------- /integration-tests/src/main_nb.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "application/vnd.databricks.v1+cell": { 7 | "cellMetadata": {}, 8 | "inputWidgets": {}, 9 | "nuid": "ee353e42-ff58-4955-9608-12865bd0950e", 10 | "showTitle": false, 11 | "title": "" 12 | } 13 | }, 14 | "source": [ 15 | "# Default notebook\n", 16 | "\n", 17 | "This default notebook is executed using Databricks Workflows as defined in resources/dabs1.job.yml." 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": 2, 23 | "metadata": {}, 24 | "outputs": [], 25 | "source": [ 26 | "%load_ext autoreload\n", 27 | "%autoreload 2" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": 0, 33 | "metadata": { 34 | "application/vnd.databricks.v1+cell": { 35 | "cellMetadata": { 36 | "byteLimit": 2048000, 37 | "rowLimit": 10000 38 | }, 39 | "inputWidgets": {}, 40 | "nuid": "6bca260b-13d1-448f-8082-30b60a85c9ae", 41 | "showTitle": false, 42 | "title": "" 43 | } 44 | }, 45 | "outputs": [], 46 | "source": [ 47 | "catalog = dbutils.widgets.get(\"catalog\")\n", 48 | "schema = dbutils.widgets.get(\"schema\")\n", 49 | "table = dbutils.widgets.get(\"table\")\n", 50 | "target_table = dbutils.widgets.get(\"target_table\")" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "df = spark.read.table(f\"{catalog}.{schema}.{table}\")\n", 60 | "df.write.mode(\"overwrite\").saveAsTable(f\"{catalog}.{schema}.{target_table}\")" 61 | ] 62 | } 63 | ], 64 | "metadata": { 65 | "application/vnd.databricks.v1+notebook": { 66 | "dashboards": [], 67 | "language": "python", 68 | "notebookMetadata": { 69 | "pythonIndentUnit": 2 70 | }, 71 | "notebookName": "notebook", 72 | "widgets": {} 73 | }, 74 | "kernelspec": { 75 | "display_name": "Python 3", 76 | "language": "python", 77 | "name": "python3" 78 | }, 79 | "language_info": { 80 | "name": "python", 81 | "version": "3.11.4" 82 | } 83 | }, 84 | "nbformat": 4, 85 | "nbformat_minor": 0 86 | } 87 | -------------------------------------------------------------------------------- /jdemo/databricks.yml: -------------------------------------------------------------------------------- 1 | # This is a Databricks asset bundle definition for jdemo. 2 | # See https://docs.databricks.com/dev-tools/bundles/index.html for documentation. 3 | bundle: 4 | name: jdemo 5 | 6 | include: 7 | - resources/*.yml 8 | 9 | variables: 10 | uniq_id: 11 | description: "Some ID that will guarantee uniqueness of the object, i.e., PR number" 12 | default: ${workspace.current_user.short_name} 13 | 14 | artifacts: 15 | java-code: 16 | path: ./java-code 17 | build: mvn package 18 | type: jar 19 | files: 20 | - source: ./java-code/target/dabs-demo-0.0.1-jar-with-dependencies.jar 21 | wheel: 22 | path: . 23 | type: whl 24 | 25 | workspace: 26 | host: https://adb-xxxx.17.azuredatabricks.net 27 | 28 | targets: 29 | dev: 30 | # The default target uses 'mode: development' to create a development copy. 31 | # - Deployed resources get prefixed with '[dev my_user_name]' 32 | # - Any job schedules and triggers are paused by default. 33 | # See also https://docs.databricks.com/dev-tools/bundles/deployment-modes.html. 34 | mode: development 35 | default: true 36 | workspace: 37 | artifact_path: /Volumes/main/default/jars/${workspace.current_user.short_name}-${bundle.target} 38 | 39 | staging: 40 | presets: 41 | name_prefix: "[Staging ${var.uniq_id}] " 42 | workspace: 43 | artifact_path: /Volumes/main/default/jars/${bundle.target}-${var.uniq_id} 44 | root_path: /Workspace/Projects/${bundle.target}/${bundle.name}/${var.uniq_id} 45 | resources: 46 | jobs: 47 | jdemo_job: 48 | trigger: 49 | pause_status: PAUSED 50 | 51 | prod: 52 | mode: production 53 | presets: 54 | name_prefix: "[Prod] " 55 | workspace: 56 | # We explicitly specify /Workspace/Users/user@domain.com to make sure we only have a single copy. 57 | root_path: /Workspace/Projects/${bundle.target}/${bundle.name} 58 | artifact_path: /Volumes/main/default/prod 59 | resources: 60 | jobs: 61 | jdemo_job: 62 | trigger: 63 | pause_status: PAUSED # This is just for demo purposes, to avoid running the job in the demo 64 | permissions: 65 | - user_name: user@domain.com 66 | level: CAN_MANAGE 67 | run_as: 68 | user_name: user@domain.com 69 | -------------------------------------------------------------------------------- /integration-tests/src/validate_test.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "application/vnd.databricks.v1+cell": { 7 | "cellMetadata": {}, 8 | "inputWidgets": {}, 9 | "nuid": "ee353e42-ff58-4955-9608-12865bd0950e", 10 | "showTitle": false, 11 | "title": "" 12 | } 13 | }, 14 | "source": [ 15 | "# Default notebook\n", 16 | "\n", 17 | "This default notebook is executed using Databricks Workflows as defined in resources/dabs1.job.yml." 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": 2, 23 | "metadata": {}, 24 | "outputs": [], 25 | "source": [ 26 | "%load_ext autoreload\n", 27 | "%autoreload 2" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": 0, 33 | "metadata": { 34 | "application/vnd.databricks.v1+cell": { 35 | "cellMetadata": { 36 | "byteLimit": 2048000, 37 | "rowLimit": 10000 38 | }, 39 | "inputWidgets": {}, 40 | "nuid": "6bca260b-13d1-448f-8082-30b60a85c9ae", 41 | "showTitle": false, 42 | "title": "" 43 | } 44 | }, 45 | "outputs": [], 46 | "source": [ 47 | "catalog = dbutils.widgets.get(\"catalog\")\n", 48 | "schema = dbutils.widgets.get(\"schema\")\n", 49 | "target_table = dbutils.widgets.get(\"target_table\")" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": null, 55 | "metadata": {}, 56 | "outputs": [], 57 | "source": [ 58 | "df = spark.read.table(f\"{catalog}.{schema}.{target_table}\")\n", 59 | "assert df is not None, f\"Failed to read table {catalog}.{schema}.{target_table}\"\n", 60 | "assert df.count() == 10, f\"Incorrect number of rows in {catalog}.{schema}.{target_table}\"" 61 | ] 62 | } 63 | ], 64 | "metadata": { 65 | "application/vnd.databricks.v1+notebook": { 66 | "dashboards": [], 67 | "language": "python", 68 | "notebookMetadata": { 69 | "pythonIndentUnit": 2 70 | }, 71 | "notebookName": "notebook", 72 | "widgets": {} 73 | }, 74 | "kernelspec": { 75 | "display_name": "Python 3", 76 | "language": "python", 77 | "name": "python3" 78 | }, 79 | "language_info": { 80 | "name": "python", 81 | "version": "3.11.4" 82 | } 83 | }, 84 | "nbformat": 4, 85 | "nbformat_minor": 0 86 | } 87 | -------------------------------------------------------------------------------- /mlops-statcks-multiphase/deployment/model_deployment/notebooks/ModelDeployment.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | ################################################################################## 3 | # Helper notebook to transition the model stage. This notebook is run 4 | # after the Train.py notebook as part of a multi-task job, in order to transition model 5 | # to target stage after training completes. 6 | # 7 | # Note that we deploy the model to the stage in MLflow Model Registry equivalent to the 8 | # environment in which the multi-task job is executed (e.g deploy the trained model to 9 | # stage=Production if triggered in the prod environment). In a practical setting, we would 10 | # recommend enabling the model validation step between model training and automatically 11 | # registering the model to the Production stage in prod. 12 | # 13 | # This notebook has the following parameters: 14 | # 15 | # * env (required) - String name of the current environment for model deployment, which decides the target stage. 16 | # * model_uri (required) - URI of the model to deploy. Must be in the format "models://", as described in 17 | # https://www.mlflow.org/docs/latest/model-registry.html#fetching-an-mlflow-model-from-the-model-registry 18 | # This parameter is read as a task value 19 | # (https://learn.microsoft.com/azure/databricks/dev-tools/databricks-utils), 20 | # rather than as a notebook widget. That is, we assume a preceding task (the Train.py 21 | # notebook) has set a task value with key "model_uri". 22 | ################################################################################## 23 | 24 | # List of input args needed to run the notebook as a job. 25 | # Provide them via DB widgets or notebook arguments. 26 | # 27 | # Name of the current environment 28 | dbutils.widgets.dropdown("env", "None", ["None", "staging", "prod"], "Environment Name") 29 | 30 | # COMMAND ---------- 31 | 32 | import os 33 | import sys 34 | notebook_path = '/Workspace/' + os.path.dirname(dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()) 35 | %cd $notebook_path 36 | %cd .. 37 | sys.path.append("../..") 38 | 39 | # COMMAND ---------- 40 | 41 | from deploy import deploy 42 | 43 | model_uri = dbutils.jobs.taskValues.get("Train", "model_uri", debugValue="") 44 | env = dbutils.widgets.get("env") 45 | assert env != "None", "env notebook parameter must be specified" 46 | assert model_uri != "", "model_uri notebook parameter must be specified" 47 | deploy(model_uri, env) 48 | 49 | # COMMAND ---------- 50 | print( 51 | f"Successfully completed model deployment for {model_uri}" 52 | ) 53 | -------------------------------------------------------------------------------- /mlops-statcks-multiphase/feature_engineering/features/pickup_features.py: -------------------------------------------------------------------------------- 1 | """ 2 | This sample module contains features logic that can be used to generate and populate tables in Feature Store. 3 | You should plug in your own features computation logic in the compute_features_fn method below. 4 | """ 5 | import pyspark.sql.functions as F 6 | from pyspark.sql.types import FloatType, IntegerType, StringType, TimestampType 7 | from pytz import timezone 8 | 9 | 10 | @F.udf(returnType=StringType()) 11 | def _partition_id(dt): 12 | # datetime -> "YYYY-MM" 13 | return f"{dt.year:04d}-{dt.month:02d}" 14 | 15 | 16 | def _filter_df_by_ts(df, ts_column, start_date, end_date): 17 | if ts_column and start_date: 18 | df = df.filter(F.col(ts_column) >= start_date) 19 | if ts_column and end_date: 20 | df = df.filter(F.col(ts_column) < end_date) 21 | return df 22 | 23 | 24 | def compute_features_fn(input_df, timestamp_column, start_date, end_date): 25 | """Contains logic to compute features. 26 | 27 | Given an input dataframe and time ranges, this function should compute features, populate an output dataframe and 28 | return it. This method will be called from a Feature Store pipeline job and the output dataframe will be written 29 | to a Feature Store table. You should update this method with your own feature computation logic. 30 | 31 | The timestamp_column, start_date, end_date args are optional but strongly recommended for time-series based 32 | features. 33 | 34 | TODO: Update and adapt the sample code for your use case 35 | 36 | :param input_df: Input dataframe. 37 | :param timestamp_column: Column containing a timestamp. This column is used to limit the range of feature 38 | computation. It is also used as the timestamp key column when populating the feature table, so it needs to be 39 | returned in the output. 40 | :param start_date: Start date of the feature computation interval. 41 | :param end_date: End date of the feature computation interval. 42 | :return: Output dataframe containing computed features given the input arguments. 43 | """ 44 | df = _filter_df_by_ts(input_df, timestamp_column, start_date, end_date) 45 | pickupzip_features = ( 46 | df.groupBy( 47 | "pickup_zip", F.window(timestamp_column, "1 hour", "15 minutes") 48 | ) # 1 hour window, sliding every 15 minutes 49 | .agg( 50 | F.mean("fare_amount").alias("mean_fare_window_1h_pickup_zip"), 51 | F.count("*").alias("count_trips_window_1h_pickup_zip"), 52 | ) 53 | .select( 54 | F.col("pickup_zip").alias("zip"), 55 | F.unix_timestamp(F.col("window.end")) 56 | .alias(timestamp_column) 57 | .cast(TimestampType()), 58 | _partition_id(F.to_timestamp(F.col("window.end"))).alias("yyyy_mm"), 59 | F.col("mean_fare_window_1h_pickup_zip").cast(FloatType()), 60 | F.col("count_trips_window_1h_pickup_zip").cast(IntegerType()), 61 | ) 62 | ) 63 | return pickupzip_features 64 | -------------------------------------------------------------------------------- /mlops-statcks-multiphase/feature_engineering/features/dropoff_features.py: -------------------------------------------------------------------------------- 1 | """ 2 | This sample module contains features logic that can be used to generate and populate tables in Feature Store. 3 | You should plug in your own features computation logic in the compute_features_fn method below. 4 | """ 5 | import pyspark.sql.functions as F 6 | from pyspark.sql.types import IntegerType, StringType, TimestampType 7 | from pytz import timezone 8 | 9 | 10 | @F.udf(returnType=IntegerType()) 11 | def _is_weekend(dt): 12 | tz = "America/New_York" 13 | return int(dt.astimezone(timezone(tz)).weekday() >= 5) # 5 = Saturday, 6 = Sunday 14 | 15 | 16 | @F.udf(returnType=StringType()) 17 | def _partition_id(dt): 18 | # datetime -> "YYYY-MM" 19 | return f"{dt.year:04d}-{dt.month:02d}" 20 | 21 | 22 | def _filter_df_by_ts(df, ts_column, start_date, end_date): 23 | if ts_column and start_date: 24 | df = df.filter(F.col(ts_column) >= start_date) 25 | if ts_column and end_date: 26 | df = df.filter(F.col(ts_column) < end_date) 27 | return df 28 | 29 | 30 | def compute_features_fn(input_df, timestamp_column, start_date, end_date): 31 | """Contains logic to compute features. 32 | 33 | Given an input dataframe and time ranges, this function should compute features, populate an output dataframe and 34 | return it. This method will be called from a Feature Store pipeline job and the output dataframe will be written 35 | to a Feature Store table. You should update this method with your own feature computation logic. 36 | 37 | The timestamp_column, start_date, end_date args are optional but strongly recommended for time-series based 38 | features. 39 | 40 | TODO: Update and adapt the sample code for your use case 41 | 42 | :param input_df: Input dataframe. 43 | :param timestamp_column: Column containing the timestamp. This column is used to limit the range of feature 44 | computation. It is also used as the timestamp key column when populating the feature table, so it needs to be 45 | returned in the output. 46 | :param start_date: Start date of the feature computation interval. 47 | :param end_date: End date of the feature computation interval. 48 | :return: Output dataframe containing computed features given the input arguments. 49 | """ 50 | df = _filter_df_by_ts(input_df, timestamp_column, start_date, end_date) 51 | dropoffzip_features = ( 52 | df.groupBy("dropoff_zip", F.window(timestamp_column, "30 minute")) 53 | .agg(F.count("*").alias("count_trips_window_30m_dropoff_zip")) 54 | .select( 55 | F.col("dropoff_zip").alias("zip"), 56 | F.unix_timestamp(F.col("window.end")) 57 | .alias(timestamp_column) 58 | .cast(TimestampType()), 59 | _partition_id(F.to_timestamp(F.col("window.end"))).alias("yyyy_mm"), 60 | F.col("count_trips_window_30m_dropoff_zip").cast(IntegerType()), 61 | _is_weekend(F.col("window.end")).alias("dropoff_is_weekend"), 62 | ) 63 | ) 64 | return dropoffzip_features 65 | -------------------------------------------------------------------------------- /mlops-statcks-multiphase/monitoring/notebooks/MonitoredMetricViolationCheck.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | ################################################################################## 3 | # This notebook runs a sql query and set the result as job task value 4 | # 5 | # This notebook has the following parameters: 6 | # 7 | # * table_name_under_monitor (required) - The name of a table that is currently being monitored 8 | # * metric_to_monitor (required) - Metric to be monitored for threshold violation 9 | # * metric_violation_threshold (required) - Threshold value for metric violation 10 | # * num_evaluation_windows (required) - Number of windows to check for violation 11 | # * num_violation_windows (required) - Number of windows that need to violate the threshold 12 | ################################################################################## 13 | 14 | # List of input args needed to run the notebook as a job. 15 | # Provide them via DB widgets or notebook arguments. 16 | # 17 | # Name of the table that is currently being monitored 18 | dbutils.widgets.text( 19 | "table_name_under_monitor", "dev.mlops.predictions", label="Full (three-Level) table name" 20 | ) 21 | # Metric to be used for threshold violation check 22 | dbutils.widgets.text( 23 | "metric_to_monitor", "root_mean_squared_error", label="Metric to be monitored for threshold violation" 24 | ) 25 | 26 | # Threshold value to be checked 27 | dbutils.widgets.text( 28 | "metric_violation_threshold", "100", label="Threshold value for metric violation" 29 | ) 30 | 31 | # Threshold value to be checked 32 | dbutils.widgets.text( 33 | "num_evaluation_windows", "5", label="Number of windows to check for violation" 34 | ) 35 | 36 | # Threshold value to be checked 37 | dbutils.widgets.text( 38 | "num_violation_windows", "2", label="Number of windows that need to violate the threshold" 39 | ) 40 | 41 | # COMMAND ---------- 42 | 43 | import os 44 | import sys 45 | notebook_path = '/Workspace/' + os.path.dirname(dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()) 46 | %cd $notebook_path 47 | %cd .. 48 | sys.path.append("../..") 49 | 50 | # COMMAND ---------- 51 | 52 | from metric_violation_check_query import sql_query 53 | 54 | table_name_under_monitor = dbutils.widgets.get("table_name_under_monitor") 55 | metric_to_monitor = dbutils.widgets.get("metric_to_monitor") 56 | metric_violation_threshold = dbutils.widgets.get("metric_violation_threshold") 57 | num_evaluation_windows = dbutils.widgets.get("num_evaluation_windows") 58 | num_violation_windows = dbutils.widgets.get("num_violation_windows") 59 | 60 | formatted_sql_query = sql_query.format( 61 | table_name_under_monitor=table_name_under_monitor, 62 | metric_to_monitor=metric_to_monitor, 63 | metric_violation_threshold=metric_violation_threshold, 64 | num_evaluation_windows=num_evaluation_windows, 65 | num_violation_windows=num_violation_windows) 66 | is_metric_violated = bool(spark.sql(formatted_sql_query).toPandas()["query_result"][0]) 67 | 68 | dbutils.jobs.taskValues.set("is_metric_violated", is_metric_violated) 69 | 70 | 71 | -------------------------------------------------------------------------------- /mlops-statcks-multiphase/monitoring/metric_violation_check_query.py: -------------------------------------------------------------------------------- 1 | # This file is used for the main SQL query that checks the last {num_evaluation_windows} metric violations and whether at least {num_violation_windows} of those runs violate the condition. 2 | 3 | import sys 4 | import pathlib 5 | 6 | sys.path.append(str(pathlib.Path(__file__).parent.parent.parent.resolve())) 7 | 8 | """The SQL query is divided into three main parts. The first part selects the top {num_evaluation_windows} 9 | values of the metric to be monitored, ordered by the time window, and saves as recent_metrics. 10 | ```sql 11 | WITH recent_metrics AS ( 12 | SELECT 13 | {metric_to_monitor}, 14 | window 15 | FROM 16 | {table_name_under_monitor}_profile_metrics 17 | WHERE 18 | column_name = ":table" 19 | AND slice_key IS NULL 20 | AND model_id != "*" 21 | AND log_type = "INPUT" 22 | ORDER BY 23 | window DESC 24 | LIMIT 25 | {num_evaluation_windows} 26 | ) 27 | ``` 28 | The `column_name = ":table"` and `slice_key IS NULL` conditions ensure that the metric 29 | is selected for the entire table within the given granularity. The `log_type = "INPUT"` 30 | condition ensures that the primary table metrics are considered, but not the baseline 31 | table metrics. The `model_id!= "*"` condition ensures that the metric aggregated across 32 | all model IDs is not selected. 33 | 34 | The second part of the query determines if the metric values have been violated with two cases. 35 | The first case checks if the metric value is greater than the threshold for at least {num_violation_windows} windows: 36 | ```sql 37 | (SELECT COUNT(*) FROM recent_metrics WHERE {metric_to_monitor} > {metric_violation_threshold}) >= {num_violation_windows} 38 | ``` 39 | The second case checks if the most recent metric value is greater than the threshold. This is to make sure we only trigger retraining 40 | if the most recent window was violated, avoiding unnecessary retraining if the violation was in the past and the metric is now within the threshold: 41 | ```sql 42 | (SELECT {metric_to_monitor} FROM recent_metrics ORDER BY window DESC LIMIT 1) > {metric_violation_threshold} 43 | ``` 44 | 45 | The final part of the query sets the `query_result` to 1 if both of the above conditions are met, and 0 otherwise: 46 | ```sql 47 | SELECT 48 | CASE 49 | WHEN 50 | # Check if the metric value is greater than the threshold for at least {num_violation_windows} windows 51 | AND 52 | # Check if the most recent metric value is greater than the threshold 53 | THEN 1 54 | ELSE 0 55 | END AS query_result 56 | ``` 57 | """ 58 | 59 | sql_query = """WITH recent_metrics AS ( 60 | SELECT 61 | {metric_to_monitor}, 62 | window 63 | FROM 64 | {table_name_under_monitor}_profile_metrics 65 | WHERE 66 | column_name = ":table" 67 | AND slice_key IS NULL 68 | AND model_id != "*" 69 | AND log_type = "INPUT" 70 | ORDER BY 71 | window DESC 72 | LIMIT 73 | {num_evaluation_windows} 74 | ) 75 | SELECT 76 | CASE 77 | WHEN 78 | (SELECT COUNT(*) FROM recent_metrics WHERE {metric_to_monitor} > {metric_violation_threshold}) >= {num_violation_windows} 79 | AND 80 | (SELECT {metric_to_monitor} FROM recent_metrics ORDER BY window DESC LIMIT 1) > {metric_violation_threshold} 81 | THEN 1 82 | ELSE 0 83 | END AS query_result 84 | """ 85 | -------------------------------------------------------------------------------- /mlops-statcks-multiphase/resources/feature-engineering-workflow-resource.yml: -------------------------------------------------------------------------------- 1 | new_cluster: &new_cluster 2 | new_cluster: 3 | num_workers: 3 4 | spark_version: 15.3.x-cpu-ml-scala2.12 5 | node_type_id: Standard_D3_v2 6 | custom_tags: 7 | clusterSource: mlops-stacks_0.4 8 | 9 | common_permissions: &permissions 10 | permissions: 11 | - level: CAN_VIEW 12 | group_name: users 13 | 14 | write_feature_table_job: &write_feature_table_job 15 | write_feature_table_job: 16 | name: ${var.current_target}-mlops-write-feature-table-job 17 | job_clusters: 18 | - job_cluster_key: write_feature_table_job_cluster 19 | <<: *new_cluster 20 | tasks: 21 | - task_key: PickupFeatures 22 | job_cluster_key: write_feature_table_job_cluster 23 | notebook_task: 24 | notebook_path: ../feature_engineering/notebooks/GenerateAndWriteFeatures.py 25 | base_parameters: 26 | # TODO modify these arguments to reflect your setup. 27 | input_table_path: /databricks-datasets/nyctaxi-with-zipcodes/subsampled 28 | # TODO: Empty start/end dates will process the whole range. Update this as needed to process recent data. 29 | input_start_date: "" 30 | input_end_date: "" 31 | timestamp_column: tpep_pickup_datetime 32 | output_table_name: ${var.current_target}.mlops.trip_pickup_features 33 | features_transform_module: pickup_features 34 | primary_keys: zip 35 | # git source information of current ML resource deployment. It will be persisted as part of the workflow run 36 | git_source_info: url:${bundle.git.origin_url}; branch:${bundle.git.branch}; commit:${bundle.git.commit} 37 | - task_key: DropoffFeatures 38 | job_cluster_key: write_feature_table_job_cluster 39 | notebook_task: 40 | notebook_path: ../feature_engineering/notebooks/GenerateAndWriteFeatures.py 41 | base_parameters: 42 | # TODO: modify these arguments to reflect your setup. 43 | input_table_path: /databricks-datasets/nyctaxi-with-zipcodes/subsampled 44 | # TODO: Empty start/end dates will process the whole range. Update this as needed to process recent data. 45 | input_start_date: "" 46 | input_end_date: "" 47 | timestamp_column: tpep_dropoff_datetime 48 | output_table_name: ${var.current_target}.mlops.trip_dropoff_features 49 | features_transform_module: dropoff_features 50 | primary_keys: zip 51 | # git source information of current ML resource deployment. It will be persisted as part of the workflow run 52 | git_source_info: url:${bundle.git.origin_url}; branch:${bundle.git.branch}; commit:${bundle.git.commit} 53 | schedule: 54 | quartz_cron_expression: "0 0 7 * * ?" # daily at 7am 55 | timezone_id: UTC 56 | <<: *permissions 57 | # If you want to turn on notifications for this job, please uncomment the below code, 58 | # and provide a list of emails to the on_failure argument. 59 | # 60 | # email_notifications: 61 | # on_failure: 62 | # - first@company.com 63 | # - second@company.com 64 | 65 | targets: 66 | dev-phase1: 67 | resources: 68 | jobs: 69 | <<: *write_feature_table_job 70 | 71 | test-phase1: 72 | resources: 73 | jobs: 74 | <<: *write_feature_table_job 75 | 76 | prod-phase1: 77 | resources: 78 | jobs: 79 | <<: *write_feature_table_job 80 | -------------------------------------------------------------------------------- /mlops-statcks-multiphase/deployment/batch_inference/notebooks/BatchInference.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | ################################################################################## 3 | # Batch Inference Notebook 4 | # 5 | # This notebook is an example of applying a model for batch inference against an input delta table, 6 | # It is configured and can be executed as the batch_inference_job in the batch_inference_job workflow defined under 7 | # ``mlops/resources/batch-inference-workflow-resource.yml`` 8 | # 9 | # Parameters: 10 | # 11 | # * env (optional) - String name of the current environment (dev, staging, or prod). Defaults to "dev" 12 | # * input_table_name (required) - Delta table name containing your input data. 13 | # * output_table_name (required) - Delta table name where the predictions will be written to. 14 | # Note that this will create a new version of the Delta table if 15 | # the table already exists 16 | # * model_name (required) - The name of the model to be used in batch inference. 17 | ################################################################################## 18 | 19 | 20 | # List of input args needed to run the notebook as a job. 21 | # Provide them via DB widgets or notebook arguments. 22 | # 23 | # Name of the current environment 24 | dbutils.widgets.dropdown("env", "dev", ["dev", "staging", "prod"], "Environment Name") 25 | # A Hive-registered Delta table containing the input features. 26 | dbutils.widgets.text("input_table_name", "", label="Input Table Name") 27 | # Delta table to store the output predictions. 28 | dbutils.widgets.text("output_table_name", "", label="Output Table Name") 29 | # Unity Catalog registered model name to use for the trained mode. 30 | dbutils.widgets.text( 31 | "model_name", "dev.mlops.mlops-model", label="Full (Three-Level) Model Name" 32 | ) 33 | 34 | # COMMAND ---------- 35 | 36 | import os 37 | 38 | notebook_path = '/Workspace/' + os.path.dirname(dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()) 39 | %cd $notebook_path 40 | 41 | # COMMAND ---------- 42 | 43 | # MAGIC %pip install -r ../../../requirements.txt 44 | 45 | # COMMAND ---------- 46 | 47 | dbutils.library.restartPython() 48 | 49 | # COMMAND ---------- 50 | 51 | import sys 52 | import os 53 | notebook_path = '/Workspace/' + os.path.dirname(dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()) 54 | %cd $notebook_path 55 | %cd .. 56 | sys.path.append("../..") 57 | 58 | # COMMAND ---------- 59 | 60 | # DBTITLE 1,Define input and output variables 61 | 62 | env = dbutils.widgets.get("env") 63 | input_table_name = dbutils.widgets.get("input_table_name") 64 | output_table_name = dbutils.widgets.get("output_table_name") 65 | model_name = dbutils.widgets.get("model_name") 66 | assert input_table_name != "", "input_table_name notebook parameter must be specified" 67 | assert output_table_name != "", "output_table_name notebook parameter must be specified" 68 | assert model_name != "", "model_name notebook parameter must be specified" 69 | alias = "champion" 70 | model_uri = f"models:/{model_name}@{alias}" 71 | 72 | # COMMAND ---------- 73 | 74 | from mlflow import MlflowClient 75 | 76 | # Get model version from alias 77 | client = MlflowClient(registry_uri="databricks-uc") 78 | model_version = client.get_model_version_by_alias(model_name, alias).version 79 | 80 | # COMMAND ---------- 81 | 82 | # Get datetime 83 | from datetime import datetime 84 | 85 | ts = datetime.now().strftime("%Y-%m-%d %H:%M:%S") 86 | 87 | # COMMAND ---------- 88 | # DBTITLE 1,Load model and run inference 89 | 90 | from predict import predict_batch 91 | 92 | predict_batch(spark, model_uri, input_table_name, output_table_name, model_version, ts) 93 | dbutils.notebook.exit(output_table_name) 94 | -------------------------------------------------------------------------------- /mlops-statcks-multiphase/tmp/phase2.yml: -------------------------------------------------------------------------------- 1 | # Please complete all the TODOs in this file. 2 | # The regression monitor defined here works OOB with this example regression notebook: https://learn.microsoft.com/azure/databricks/_extras/notebooks/source/monitoring/regression-monitor 3 | # NOTE: Monitoring only works on Unity Catalog tables. 4 | 5 | new_cluster: &new_cluster 6 | new_cluster: 7 | num_workers: 3 8 | spark_version: 15.3.x-cpu-ml-scala2.12 9 | node_type_id: Standard_D3_v2 10 | custom_tags: 11 | clusterSource: mlops-stacks_0.4 12 | 13 | common_permissions: &permissions 14 | permissions: 15 | - level: CAN_VIEW 16 | group_name: users 17 | 18 | quality_monitor: &quality_monitor 19 | mlops_quality_monitor: 20 | table_name: dev.mlops.predictions 21 | # TODO: Update the output schema name as per your requirements 22 | output_schema_name: ${var.current_target}.mlops 23 | # TODO: Update the below parameters as per your requirements 24 | assets_dir: /Users/${workspace.current_user.userName}/databricks_lakehouse_monitoring 25 | inference_log: 26 | granularities: [1 day] 27 | model_id_col: model_id 28 | prediction_col: prediction 29 | label_col: price 30 | problem_type: PROBLEM_TYPE_REGRESSION 31 | timestamp_col: timestamp 32 | schedule: 33 | quartz_cron_expression: 0 0 8 * * ? # Run Every day at 8am 34 | timezone_id: UTC 35 | 36 | retraining_job: &retraining_job 37 | retraining_job: 38 | name: ${var.current_target}-mlops-monitoring-retraining-job 39 | tasks: 40 | - task_key: monitored_metric_violation_check 41 | <<: *new_cluster 42 | notebook_task: 43 | notebook_path: ../monitoring/notebooks/MonitoredMetricViolationCheck.py 44 | base_parameters: 45 | env: ${var.current_target} 46 | table_name_under_monitor: dev.mlops.predictions 47 | # TODO: Update the metric to be monitored and violation threshold 48 | metric_to_monitor: root_mean_squared_error 49 | metric_violation_threshold: 100 50 | num_evaluation_windows: 5 51 | num_violation_windows: 2 52 | 53 | - task_key: is_metric_violated 54 | depends_on: 55 | - task_key: monitored_metric_violation_check 56 | condition_task: 57 | op: EQUAL_TO 58 | left: "{{tasks.monitored_metric_violation_check.values.is_metric_violated}}" 59 | right: "true" 60 | 61 | - task_key: trigger_retraining 62 | depends_on: 63 | - task_key: is_metric_violated 64 | outcome: "true" 65 | run_job_task: 66 | job_id: ${var.model_training_job_id} 67 | 68 | schedule: 69 | quartz_cron_expression: "0 0 18 * * ?" # daily at 6pm 70 | timezone_id: UTC 71 | <<: *permissions 72 | 73 | targets: 74 | dev-phase2: 75 | variables: 76 | model_training_job_id: 77 | lookup: 78 | job: "${var.current_target}-mlops-model-training-job" 79 | resources: 80 | quality_monitors: 81 | <<: *quality_monitor 82 | jobs: 83 | <<: *retraining_job 84 | 85 | test-phase2: 86 | variables: 87 | model_training_job_id: 88 | lookup: 89 | job: "${var.current_target}-mlops-model-training-job" 90 | resources: 91 | quality_monitors: 92 | <<: *quality_monitor 93 | jobs: 94 | <<: *retraining_job 95 | 96 | prod-phase2: 97 | variables: 98 | model_training_job_id: 99 | lookup: 100 | job: "${var.current_target}-mlops-model-training-job" 101 | resources: 102 | quality_monitors: 103 | <<: *quality_monitor 104 | jobs: 105 | <<: *retraining_job 106 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | share/python-wheels/ 24 | *.egg-info/ 25 | .installed.cfg 26 | *.egg 27 | MANIFEST 28 | 29 | # PyInstaller 30 | # Usually these files are written by a python script from a template 31 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 32 | *.manifest 33 | *.spec 34 | 35 | # Installer logs 36 | pip-log.txt 37 | pip-delete-this-directory.txt 38 | 39 | # Unit test / coverage reports 40 | htmlcov/ 41 | .tox/ 42 | .nox/ 43 | .coverage 44 | .coverage.* 45 | .cache 46 | nosetests.xml 47 | coverage.xml 48 | *.cover 49 | *.py,cover 50 | .hypothesis/ 51 | .pytest_cache/ 52 | cover/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | .pybuilder/ 76 | target/ 77 | 78 | # Jupyter Notebook 79 | .ipynb_checkpoints 80 | 81 | # IPython 82 | profile_default/ 83 | ipython_config.py 84 | 85 | # pyenv 86 | # For a library or package, you might want to ignore these files since the code is 87 | # intended to run in multiple environments; otherwise, check them in: 88 | # .python-version 89 | 90 | # pipenv 91 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 92 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 93 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 94 | # install all needed dependencies. 95 | #Pipfile.lock 96 | 97 | # poetry 98 | # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control. 99 | # This is especially recommended for binary packages to ensure reproducibility, and is more 100 | # commonly ignored for libraries. 101 | # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control 102 | #poetry.lock 103 | 104 | # pdm 105 | # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. 106 | #pdm.lock 107 | # pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it 108 | # in version control. 109 | # https://pdm.fming.dev/latest/usage/project/#working-with-version-control 110 | .pdm.toml 111 | .pdm-python 112 | .pdm-build/ 113 | 114 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm 115 | __pypackages__/ 116 | 117 | # Celery stuff 118 | celerybeat-schedule 119 | celerybeat.pid 120 | 121 | # SageMath parsed files 122 | *.sage.py 123 | 124 | # Environments 125 | .env 126 | .venv 127 | env/ 128 | venv/ 129 | ENV/ 130 | env.bak/ 131 | venv.bak/ 132 | 133 | # Spyder project settings 134 | .spyderproject 135 | .spyproject 136 | 137 | # Rope project settings 138 | .ropeproject 139 | 140 | # mkdocs documentation 141 | /site 142 | 143 | # mypy 144 | .mypy_cache/ 145 | .dmypy.json 146 | dmypy.json 147 | 148 | # Pyre type checker 149 | .pyre/ 150 | 151 | # pytype static type analyzer 152 | .pytype/ 153 | 154 | # Cython debug symbols 155 | cython_debug/ 156 | 157 | # PyCharm 158 | # JetBrains specific template is maintained in a separate JetBrains.gitignore that can 159 | # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore 160 | # and can be added to the global gitignore or merged into this file. For a more nuclear 161 | # option (not recommended) you can uncomment the following to ignore the entire idea folder. 162 | #.idea/ 163 | -------------------------------------------------------------------------------- /mlops-statcks-multiphase/resources/monitoring-resource.yml: -------------------------------------------------------------------------------- 1 | # Please complete all the TODOs in this file. 2 | # The regression monitor defined here works OOB with this example regression notebook: https://learn.microsoft.com/azure/databricks/_extras/notebooks/source/monitoring/regression-monitor 3 | # NOTE: Monitoring only works on Unity Catalog tables. 4 | 5 | variables: 6 | model_training_job_id: 7 | description: "ID of model training job" 8 | 9 | new_cluster: &new_cluster 10 | new_cluster: 11 | num_workers: 3 12 | spark_version: 15.3.x-cpu-ml-scala2.12 13 | node_type_id: Standard_D3_v2 14 | custom_tags: 15 | clusterSource: mlops-stacks_0.4 16 | 17 | common_permissions: &permissions 18 | permissions: 19 | - level: CAN_VIEW 20 | group_name: users 21 | 22 | quality_monitor: &quality_monitor 23 | mlops_quality_monitor: 24 | table_name: dev.mlops.predictions 25 | # TODO: Update the output schema name as per your requirements 26 | output_schema_name: ${var.current_target}.mlops 27 | # TODO: Update the below parameters as per your requirements 28 | assets_dir: /Users/${workspace.current_user.userName}/databricks_lakehouse_monitoring 29 | inference_log: 30 | granularities: [1 day] 31 | model_id_col: model_id 32 | prediction_col: prediction 33 | label_col: price 34 | problem_type: PROBLEM_TYPE_REGRESSION 35 | timestamp_col: timestamp 36 | schedule: 37 | quartz_cron_expression: 0 0 8 * * ? # Run Every day at 8am 38 | timezone_id: UTC 39 | 40 | retraining_job: &retraining_job 41 | retraining_job: 42 | name: ${var.current_target}-mlops-monitoring-retraining-job 43 | tasks: 44 | - task_key: monitored_metric_violation_check 45 | <<: *new_cluster 46 | notebook_task: 47 | notebook_path: ../monitoring/notebooks/MonitoredMetricViolationCheck.py 48 | base_parameters: 49 | env: ${var.current_target} 50 | table_name_under_monitor: ${var.current_target}.mlops.predictions 51 | # TODO: Update the metric to be monitored and violation threshold 52 | metric_to_monitor: root_mean_squared_error 53 | metric_violation_threshold: 100 54 | num_evaluation_windows: 5 55 | num_violation_windows: 2 56 | 57 | - task_key: is_metric_violated 58 | depends_on: 59 | - task_key: monitored_metric_violation_check 60 | condition_task: 61 | op: EQUAL_TO 62 | left: "{{tasks.monitored_metric_violation_check.values.is_metric_violated}}" 63 | right: "true" 64 | 65 | - task_key: trigger_retraining 66 | depends_on: 67 | - task_key: is_metric_violated 68 | outcome: "true" 69 | run_job_task: 70 | job_id: ${var.model_training_job_id} 71 | 72 | schedule: 73 | quartz_cron_expression: "0 0 18 * * ?" # daily at 6pm 74 | timezone_id: UTC 75 | <<: *permissions 76 | 77 | targets: 78 | dev-phase2: 79 | variables: 80 | model_training_job_id: 81 | lookup: 82 | job: "${var.current_target}-mlops-model-training-job" 83 | resources: 84 | quality_monitors: 85 | <<: *quality_monitor 86 | jobs: 87 | <<: *retraining_job 88 | 89 | test-phase2: 90 | variables: 91 | model_training_job_id: 92 | lookup: 93 | job: "${var.current_target}-mlops-model-training-job" 94 | resources: 95 | quality_monitors: 96 | <<: *quality_monitor 97 | jobs: 98 | <<: *retraining_job 99 | 100 | prod-phase2: 101 | variables: 102 | model_training_job_id: 103 | lookup: 104 | job: "${var.current_target}-mlops-model-training-job" 105 | resources: 106 | quality_monitors: 107 | <<: *quality_monitor 108 | jobs: 109 | <<: *retraining_job 110 | -------------------------------------------------------------------------------- /jdemo/azure-pipelines.yml: -------------------------------------------------------------------------------- 1 | # Grab variables from the specific variable group and 2 | # determine sourceBranchName (avoids SourchBranchName=merge for PR) 3 | variables: 4 | - group: 'DABs Testing' 5 | - name: 'branchName' 6 | ${{ if startsWith(variables['Build.SourceBranch'], 'refs/heads/') }}: 7 | value: $[ replace(variables['Build.SourceBranch'], 'refs/heads/', '') ] 8 | ${{ if startsWith(variables['Build.SourceBranch'], 'refs/pull/') }}: 9 | value: $[ replace(variables['System.PullRequest.SourceBranch'], 'refs/heads/', '') ] 10 | 11 | trigger: 12 | batch: true 13 | branches: 14 | include: 15 | - '*' 16 | paths: 17 | exclude: 18 | - README.md 19 | - LICENSE 20 | - images 21 | - terraform 22 | - .github 23 | - .vscode 24 | - TODOs.org 25 | 26 | stages: 27 | - stage: onPush 28 | condition: | 29 | and( 30 | ne(variables['Build.SourceBranch'], 'refs/heads/releases'), 31 | not(startsWith(variables['Build.SourceBranch'], 'refs/tags/v')) 32 | ) 33 | jobs: 34 | - job: onPushJob 35 | pool: 36 | vmImage: 'ubuntu-latest' 37 | 38 | steps: 39 | - task: UsePythonVersion@0 40 | displayName: 'Use Python 3.11' 41 | inputs: 42 | versionSpec: 3.11 43 | 44 | - checkout: self 45 | displayName: 'Checkout & Build.Reason: $(Build.Reason) & Build.SourceBranchName: $(Build.SourceBranchName)' 46 | 47 | - script: | 48 | eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)" 49 | brew tap databricks/tap 50 | brew install databricks 51 | databricks -v 52 | displayName: 'Install Databricks CLI' 53 | env: 54 | HOMEBREW_NO_ENV_HINTS: 1 55 | HOMEBREW_NO_INSTALL_CLEANUP: 1 56 | 57 | - script: | 58 | pip install -U -r requirements-dev.txt 59 | displayName: 'Install dependencies' 60 | 61 | - script: | 62 | pytest tests --junit-xml=test-local.xml 63 | displayName: 'Execute local tests' 64 | 65 | - script: | 66 | eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)" 67 | databricks bundle deploy -t staging --var="uniq_id=$(branchName)" 68 | env: 69 | DATABRICKS_HOST: $(DATABRICKS_HOST) 70 | DATABRICKS_TOKEN: $(DATABRICKS_TOKEN) 71 | displayName: 'Deploy to staging' 72 | 73 | - script: | 74 | eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)" 75 | # We can pass parameters (--jar-params, --python-params, --notebook-params) here to point to another data location, etc. 76 | databricks bundle run jdemo_job -t staging --var="uniq_id=$(branchName)" 77 | env: 78 | DATABRICKS_HOST: $(DATABRICKS_HOST) 79 | DATABRICKS_TOKEN: $(DATABRICKS_TOKEN) 80 | displayName: 'Run in staging' 81 | 82 | - script: | 83 | eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)" 84 | echo "Optionally destroy the bundle" 85 | # databricks bundle destroy --auto-approve -t staging --var="uniq_id=$(branchName)" 86 | env: 87 | DATABRICKS_HOST: $(DATABRICKS_HOST) 88 | DATABRICKS_TOKEN: $(DATABRICKS_TOKEN) 89 | displayName: 'Destroy in staging on succcess' 90 | 91 | - task: PublishTestResults@2 92 | condition: succeededOrFailed() 93 | inputs: 94 | testResultsFormat: 'JUnit' 95 | testResultsFiles: '**/test-*.xml' 96 | failTaskOnFailedTests: true 97 | 98 | # Separate pipeline for releases branch 99 | # Right now it's similar to the onPush stage, but runs only local tests and then deploy to the prod. 100 | - stage: onRelease 101 | condition: | 102 | eq(variables['Build.SourceBranch'], 'refs/heads/releases') 103 | jobs: 104 | - job: onReleaseJob 105 | pool: 106 | vmImage: 'ubuntu-latest' 107 | 108 | steps: 109 | - task: UsePythonVersion@0 110 | displayName: 'Use Python 3.11' 111 | inputs: 112 | versionSpec: 3.11 113 | 114 | - checkout: self 115 | displayName: 'Checkout & Build.Reason: $(Build.Reason) & Build.SourceBranchName: $(Build.SourceBranchName)' 116 | 117 | - script: | 118 | eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)" 119 | brew tap databricks/tap 120 | brew install databricks 121 | databricks -v 122 | displayName: 'Install Databricks CLI' 123 | env: 124 | HOMEBREW_NO_ENV_HINTS: 1 125 | HOMEBREW_NO_INSTALL_CLEANUP: 1 126 | 127 | - script: | 128 | pip install -U -r requirements-dev.txt 129 | displayName: 'Install dependencies' 130 | 131 | - script: | 132 | pytest tests --junit-xml=test-local.xml 133 | displayName: 'Execute local tests' 134 | 135 | - script: | 136 | eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)" 137 | databricks bundle deploy -t prod 138 | env: 139 | DATABRICKS_HOST: $(DATABRICKS_HOST) 140 | DATABRICKS_TOKEN: $(DATABRICKS_TOKEN) 141 | displayName: 'Deploy to production' 142 | 143 | - task: PublishTestResults@2 144 | condition: succeededOrFailed() 145 | inputs: 146 | testResultsFormat: 'JUnit' 147 | testResultsFiles: '**/test-*.xml' 148 | failTaskOnFailedTests: true 149 | -------------------------------------------------------------------------------- /mlops-statcks-multiphase/feature_engineering/notebooks/GenerateAndWriteFeatures.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | ################################################################################## 3 | # Generate and Write Features Notebook 4 | # 5 | # This notebook can be used to generate and write features to a Databricks Feature Store table. 6 | # It is configured and can be executed as the tasks in the write_feature_table_job workflow defined under 7 | # ``mlops/resources/feature-engineering-workflow-resource.yml`` 8 | # 9 | # Parameters: 10 | # 11 | # * input_table_path (required) - Path to input data. 12 | # * output_table_name (required) - Fully qualified schema + Delta table name for the feature table where the features 13 | # * will be written to. Note that this will create the Feature table if it does not 14 | # * exist. 15 | # * primary_keys (required) - A comma separated string of primary key columns of the output feature table. 16 | # * 17 | # * timestamp_column (optional) - Timestamp column of the input data. Used to limit processing based on 18 | # * date ranges. This column is used as the timestamp_key column in the feature table. 19 | # * input_start_date (optional) - Used to limit feature computations based on timestamp_column values. 20 | # * input_end_date (optional) - Used to limit feature computations based on timestamp_column values. 21 | # * 22 | # * features_transform_module (required) - Python module containing the feature transform logic. 23 | ################################################################################## 24 | 25 | 26 | # List of input args needed to run this notebook as a job. 27 | # Provide them via DB widgets or notebook arguments. 28 | # 29 | # A Hive-registered Delta table containing the input data. 30 | dbutils.widgets.text( 31 | "input_table_path", 32 | "/databricks-datasets/nyctaxi-with-zipcodes/subsampled", 33 | label="Input Table Name", 34 | ) 35 | # Input start date. 36 | dbutils.widgets.text("input_start_date", "", label="Input Start Date") 37 | # Input end date. 38 | dbutils.widgets.text("input_end_date", "", label="Input End Date") 39 | # Timestamp column. Will be used to filter input start/end dates. 40 | # This column is also used as a timestamp key of the feature table. 41 | dbutils.widgets.text( 42 | "timestamp_column", "tpep_pickup_datetime", label="Timestamp column" 43 | ) 44 | 45 | # Feature table to store the computed features. 46 | dbutils.widgets.text( 47 | "output_table_name", 48 | "dev.mlops.trip_pickup_features", 49 | label="Output Feature Table Name", 50 | ) 51 | 52 | # Feature transform module name. 53 | dbutils.widgets.text( 54 | "features_transform_module", "pickup_features", label="Features transform file." 55 | ) 56 | # Primary Keys columns for the feature table; 57 | dbutils.widgets.text( 58 | "primary_keys", 59 | "zip", 60 | label="Primary keys columns for the feature table, comma separated.", 61 | ) 62 | 63 | # COMMAND ---------- 64 | 65 | import os 66 | notebook_path = '/Workspace/' + os.path.dirname(dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()) 67 | %cd $notebook_path 68 | %cd ../features 69 | 70 | # COMMAND ---------- 71 | 72 | # DBTITLE 1,Define input and output variables 73 | input_table_path = dbutils.widgets.get("input_table_path") 74 | output_table_name = dbutils.widgets.get("output_table_name") 75 | input_start_date = dbutils.widgets.get("input_start_date") 76 | input_end_date = dbutils.widgets.get("input_end_date") 77 | ts_column = dbutils.widgets.get("timestamp_column") 78 | features_module = dbutils.widgets.get("features_transform_module") 79 | pk_columns = dbutils.widgets.get("primary_keys") 80 | 81 | assert input_table_path != "", "input_table_path notebook parameter must be specified" 82 | assert output_table_name != "", "output_table_name notebook parameter must be specified" 83 | 84 | # Extract database name. Needs to be updated for Unity Catalog to the Schema name. 85 | output_database = output_table_name.split(".")[1] 86 | 87 | # COMMAND ---------- 88 | 89 | # DBTITLE 1,Create database. 90 | spark.sql("CREATE DATABASE IF NOT EXISTS " + output_database) 91 | 92 | # COMMAND ---------- 93 | 94 | # DBTITLE 1, Read input data. 95 | raw_data = spark.read.format("delta").load(input_table_path) 96 | 97 | # COMMAND ---------- 98 | 99 | # DBTITLE 1,Compute features. 100 | # Compute the features. This is done by dynamically loading the features module. 101 | from importlib import import_module 102 | 103 | mod = import_module(features_module) 104 | compute_features_fn = getattr(mod, "compute_features_fn") 105 | 106 | features_df = compute_features_fn( 107 | input_df=raw_data, 108 | timestamp_column=ts_column, 109 | start_date=input_start_date, 110 | end_date=input_end_date, 111 | ) 112 | 113 | # COMMAND ---------- 114 | 115 | # DBTITLE 1, Write computed features. 116 | from databricks.feature_engineering import FeatureEngineeringClient 117 | 118 | fe = FeatureEngineeringClient() 119 | 120 | # Create the feature table if it does not exist first. 121 | # Note that this is a no-op if a table with the same name and schema already exists. 122 | fe.create_table( 123 | name=output_table_name, 124 | primary_keys=[x.strip() for x in pk_columns.split(",")] + [ts_column], # Include timeseries column in primary_keys 125 | timestamp_keys=[ts_column], 126 | df=features_df, 127 | ) 128 | 129 | # Write the computed features dataframe. 130 | fe.write_table( 131 | name=output_table_name, 132 | df=features_df, 133 | mode="merge", 134 | ) 135 | 136 | # COMMAND ---------- 137 | 138 | dbutils.notebook.exit(0) -------------------------------------------------------------------------------- /mlops-statcks-multiphase/resources/model-workflow-resource.yml: -------------------------------------------------------------------------------- 1 | new_cluster: &new_cluster 2 | new_cluster: 3 | num_workers: 3 4 | spark_version: 15.3.x-cpu-ml-scala2.12 5 | node_type_id: Standard_D3_v2 6 | custom_tags: 7 | clusterSource: mlops-stacks_0.4 8 | 9 | common_permissions: &permissions 10 | permissions: 11 | - level: CAN_VIEW 12 | group_name: users 13 | 14 | model_training_job: &model_training_job 15 | model_training_job: 16 | name: ${var.current_target}-mlops-model-training-job 17 | job_clusters: 18 | - job_cluster_key: model_training_job_cluster 19 | <<: *new_cluster 20 | tasks: 21 | - task_key: Train 22 | job_cluster_key: model_training_job_cluster 23 | notebook_task: 24 | notebook_path: ../training/notebooks/TrainWithFeatureStore.py 25 | base_parameters: 26 | env: ${bundle.target} 27 | # TODO: Update training_data_path 28 | training_data_path: /databricks-datasets/nyctaxi-with-zipcodes/subsampled 29 | experiment_name: ${var.experiment_name} 30 | model_name: ${bundle.target}.mlops.${var.model_name} 31 | pickup_features_table: ${bundle.target}.mlops.trip_pickup_features 32 | dropoff_features_table: ${bundle.target}.mlops.trip_dropoff_features 33 | # git source information of current ML resource deployment. It will be persisted as part of the workflow run 34 | git_source_info: url:${bundle.git.origin_url}; branch:${bundle.git.branch}; commit:${bundle.git.commit} 35 | - task_key: ModelValidation 36 | job_cluster_key: model_training_job_cluster 37 | depends_on: 38 | - task_key: Train 39 | notebook_task: 40 | notebook_path: ../validation/notebooks/ModelValidation.py 41 | base_parameters: 42 | experiment_name: ${var.experiment_name} 43 | # The `run_mode` defines whether model validation is enabled or not. 44 | # It can be one of the three values: 45 | # `disabled` : Do not run the model validation notebook. 46 | # `dry_run` : Run the model validation notebook. Ignore failed model validation rules and proceed to move 47 | # model to Production stage. 48 | # `enabled` : Run the model validation notebook. Move model to Production stage only if all model validation 49 | # rules are passing. 50 | # TODO: update run_mode 51 | run_mode: dry_run 52 | # Whether to load the current registered "Production" stage model as baseline. 53 | # Baseline model is a requirement for relative change and absolute change validation thresholds. 54 | # TODO: update enable_baseline_comparison 55 | enable_baseline_comparison: "false" 56 | # Please refer to data parameter in mlflow.evaluate documentation https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.evaluate 57 | # TODO: update validation_input 58 | validation_input: SELECT * FROM delta.`dbfs:/databricks-datasets/nyctaxi-with-zipcodes/subsampled` 59 | # A string describing the model type. The model type can be either "regressor" and "classifier". 60 | # Please refer to model_type parameter in mlflow.evaluate documentation https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.evaluate 61 | # TODO: update model_type 62 | model_type: regressor 63 | # The string name of a column from data that contains evaluation labels. 64 | # Please refer to targets parameter in mlflow.evaluate documentation https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.evaluate 65 | # TODO: targets 66 | targets: fare_amount 67 | # Specifies the name of the function in mlops/training_validation_deployment/validation/validation.py that returns custom metrics. 68 | # TODO(optional): custom_metrics_loader_function 69 | custom_metrics_loader_function: custom_metrics 70 | # Specifies the name of the function in mlops/training_validation_deployment/validation/validation.py that returns model validation thresholds. 71 | # TODO(optional): validation_thresholds_loader_function 72 | validation_thresholds_loader_function: validation_thresholds 73 | # Specifies the name of the function in mlops/training_validation_deployment/validation/validation.py that returns evaluator_config. 74 | # TODO(optional): evaluator_config_loader_function 75 | evaluator_config_loader_function: evaluator_config 76 | # git source information of current ML resource deployment. It will be persisted as part of the workflow run 77 | git_source_info: url:${bundle.git.origin_url}; branch:${bundle.git.branch}; commit:${bundle.git.commit} 78 | - task_key: ModelDeployment 79 | job_cluster_key: model_training_job_cluster 80 | depends_on: 81 | - task_key: ModelValidation 82 | notebook_task: 83 | notebook_path: ../deployment/model_deployment/notebooks/ModelDeployment.py 84 | base_parameters: 85 | env: ${bundle.target} 86 | # git source information of current ML resource deployment. It will be persisted as part of the workflow run 87 | git_source_info: url:${bundle.git.origin_url}; branch:${bundle.git.branch}; commit:${bundle.git.commit} 88 | schedule: 89 | quartz_cron_expression: "0 0 9 * * ?" # daily at 9am 90 | timezone_id: UTC 91 | <<: *permissions 92 | # If you want to turn on notifications for this job, please uncomment the below code, 93 | # and provide a list of emails to the on_failure argument. 94 | # 95 | # email_notifications: 96 | # on_failure: 97 | # - first@company.com 98 | # - second@company.com 99 | 100 | targets: 101 | dev-phase1: 102 | resources: 103 | jobs: 104 | <<: *model_training_job 105 | 106 | test-phase1: 107 | resources: 108 | jobs: 109 | <<: *model_training_job 110 | 111 | prod-phase1: 112 | resources: 113 | jobs: 114 | <<: *model_training_job 115 | -------------------------------------------------------------------------------- /mlops-statcks-multiphase/training/notebooks/TrainWithFeatureStore.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | ################################################################################## 3 | # Model Training Notebook using Databricks Feature Store 4 | # 5 | # This notebook shows an example of a Model Training pipeline using Databricks Feature Store tables. 6 | # It is configured and can be executed as the "Train" task in the model_training_job workflow defined under 7 | # ``mlops/resources/model-workflow-resource.yml`` 8 | # 9 | # Parameters: 10 | # * env (required): - Environment the notebook is run in (staging, or prod). Defaults to "staging". 11 | # * training_data_path (required) - Path to the training data. 12 | # * experiment_name (required) - MLflow experiment name for the training runs. Will be created if it doesn't exist. 13 | # * model_name (required) - Three-level name (..) to register the trained model in Unity Catalog. 14 | # 15 | ################################################################################## 16 | 17 | # COMMAND ---------- 18 | 19 | # MAGIC %load_ext autoreload 20 | # MAGIC %autoreload 2 21 | 22 | # COMMAND ---------- 23 | 24 | import os 25 | notebook_path = '/Workspace/' + os.path.dirname(dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()) 26 | %cd $notebook_path 27 | 28 | # COMMAND ---------- 29 | 30 | # MAGIC %pip install -r ../../requirements.txt 31 | 32 | # COMMAND ---------- 33 | 34 | dbutils.library.restartPython() 35 | 36 | # COMMAND ---------- 37 | 38 | # DBTITLE 1, Notebook arguments 39 | # List of input args needed to run this notebook as a job. 40 | # Provide them via DB widgets or notebook arguments. 41 | 42 | # Notebook Environment 43 | dbutils.widgets.dropdown("env", "staging", ["staging", "prod"], "Environment Name") 44 | env = dbutils.widgets.get("env") 45 | 46 | # Path to the Hive-registered Delta table containing the training data. 47 | dbutils.widgets.text( 48 | "training_data_path", 49 | "/databricks-datasets/nyctaxi-with-zipcodes/subsampled", 50 | label="Path to the training data", 51 | ) 52 | 53 | # MLflow experiment name. 54 | dbutils.widgets.text( 55 | "experiment_name", 56 | f"/dev-mlops-experiment", 57 | label="MLflow experiment name", 58 | ) 59 | 60 | 61 | # Unity Catalog registered model name to use for the trained mode. 62 | dbutils.widgets.text( 63 | "model_name", "dev.mlops.mlops-model", label="Full (Three-Level) Model Name" 64 | ) 65 | 66 | # Pickup features table name 67 | dbutils.widgets.text( 68 | "pickup_features_table", 69 | "dev.mlops.trip_pickup_features", 70 | label="Pickup Features Table", 71 | ) 72 | 73 | # Dropoff features table name 74 | dbutils.widgets.text( 75 | "dropoff_features_table", 76 | "dev.mlops.trip_dropoff_features", 77 | label="Dropoff Features Table", 78 | ) 79 | 80 | # COMMAND ---------- 81 | 82 | # DBTITLE 1,Define input and output variables 83 | input_table_path = dbutils.widgets.get("training_data_path") 84 | experiment_name = dbutils.widgets.get("experiment_name") 85 | model_name = dbutils.widgets.get("model_name") 86 | 87 | # COMMAND ---------- 88 | 89 | # DBTITLE 1, Set experiment 90 | import mlflow 91 | 92 | mlflow.set_experiment(experiment_name) 93 | mlflow.set_registry_uri('databricks-uc') 94 | 95 | # COMMAND ---------- 96 | 97 | # DBTITLE 1, Load raw data 98 | raw_data = spark.read.format("delta").load(input_table_path) 99 | raw_data.display() 100 | 101 | # COMMAND ---------- 102 | 103 | # DBTITLE 1, Helper functions 104 | from datetime import timedelta, timezone 105 | import math 106 | import mlflow.pyfunc 107 | import pyspark.sql.functions as F 108 | from pyspark.sql.types import IntegerType 109 | 110 | 111 | def rounded_unix_timestamp(dt, num_minutes=15): 112 | """ 113 | Ceilings datetime dt to interval num_minutes, then returns the unix timestamp. 114 | """ 115 | nsecs = dt.minute * 60 + dt.second + dt.microsecond * 1e-6 116 | delta = math.ceil(nsecs / (60 * num_minutes)) * (60 * num_minutes) - nsecs 117 | return int((dt + timedelta(seconds=delta)).replace(tzinfo=timezone.utc).timestamp()) 118 | 119 | 120 | rounded_unix_timestamp_udf = F.udf(rounded_unix_timestamp, IntegerType()) 121 | 122 | 123 | def rounded_taxi_data(taxi_data_df): 124 | # Round the taxi data timestamp to 15 and 30 minute intervals so we can join with the pickup and dropoff features 125 | # respectively. 126 | taxi_data_df = ( 127 | taxi_data_df.withColumn( 128 | "rounded_pickup_datetime", 129 | F.to_timestamp( 130 | rounded_unix_timestamp_udf( 131 | taxi_data_df["tpep_pickup_datetime"], F.lit(15) 132 | ) 133 | ), 134 | ) 135 | .withColumn( 136 | "rounded_dropoff_datetime", 137 | F.to_timestamp( 138 | rounded_unix_timestamp_udf( 139 | taxi_data_df["tpep_dropoff_datetime"], F.lit(30) 140 | ) 141 | ), 142 | ) 143 | .drop("tpep_pickup_datetime") 144 | .drop("tpep_dropoff_datetime") 145 | ) 146 | taxi_data_df.createOrReplaceTempView("taxi_data") 147 | return taxi_data_df 148 | 149 | 150 | def get_latest_model_version(model_name): 151 | latest_version = 1 152 | mlflow_client = MlflowClient() 153 | for mv in mlflow_client.search_model_versions(f"name='{model_name}'"): 154 | version_int = int(mv.version) 155 | if version_int > latest_version: 156 | latest_version = version_int 157 | return latest_version 158 | 159 | 160 | # COMMAND ---------- 161 | 162 | # DBTITLE 1, Read taxi data for training 163 | taxi_data = rounded_taxi_data(raw_data) 164 | taxi_data.display() 165 | 166 | # COMMAND ---------- 167 | 168 | # DBTITLE 1, Create FeatureLookups 169 | from databricks.feature_engineering import FeatureLookup 170 | import mlflow 171 | 172 | pickup_features_table = dbutils.widgets.get("pickup_features_table") 173 | dropoff_features_table = dbutils.widgets.get("dropoff_features_table") 174 | 175 | pickup_feature_lookups = [ 176 | FeatureLookup( 177 | table_name=pickup_features_table, 178 | feature_names=[ 179 | "mean_fare_window_1h_pickup_zip", 180 | "count_trips_window_1h_pickup_zip", 181 | ], 182 | lookup_key=["pickup_zip"], 183 | timestamp_lookup_key=["rounded_pickup_datetime"], 184 | ), 185 | ] 186 | 187 | dropoff_feature_lookups = [ 188 | FeatureLookup( 189 | table_name=dropoff_features_table, 190 | feature_names=["count_trips_window_30m_dropoff_zip", "dropoff_is_weekend"], 191 | lookup_key=["dropoff_zip"], 192 | timestamp_lookup_key=["rounded_dropoff_datetime"], 193 | ), 194 | ] 195 | 196 | # COMMAND ---------- 197 | 198 | # DBTITLE 1, Create Training Dataset 199 | 200 | from databricks.feature_engineering import FeatureEngineeringClient 201 | 202 | # End any existing runs (in the case this notebook is being run for a second time) 203 | mlflow.end_run() 204 | 205 | # Start an mlflow run, which is needed for the feature store to log the model 206 | mlflow.start_run() 207 | 208 | # Since the rounded timestamp columns would likely cause the model to overfit the data 209 | # unless additional feature engineering was performed, exclude them to avoid training on them. 210 | exclude_columns = ["rounded_pickup_datetime", "rounded_dropoff_datetime"] 211 | 212 | fe = FeatureEngineeringClient() 213 | 214 | # Create the training set that includes the raw input data merged with corresponding features from both feature tables 215 | training_set = fe.create_training_set( 216 | df=taxi_data, # specify the df 217 | feature_lookups=pickup_feature_lookups + dropoff_feature_lookups, 218 | # both features need to be available; defined in GenerateAndWriteFeatures &/or feature-engineering-workflow-resource.yml 219 | label="fare_amount", 220 | exclude_columns=exclude_columns, 221 | ) 222 | 223 | 224 | # Load the TrainingSet into a dataframe which can be passed into sklearn for training a model 225 | training_df = training_set.load_df() 226 | 227 | # COMMAND ---------- 228 | 229 | # Display the training dataframe, and note that it contains both the raw input data and the features from the Feature Store, like `dropoff_is_weekend` 230 | training_df.display() 231 | 232 | # COMMAND ---------- 233 | 234 | # MAGIC %md 235 | # MAGIC Train a LightGBM model on the data returned by `TrainingSet.to_df`, then log the model with `FeatureStoreClient.log_model`. The model will be packaged with feature metadata. 236 | 237 | # COMMAND ---------- 238 | 239 | # DBTITLE 1, Train model 240 | import lightgbm as lgb 241 | from sklearn.model_selection import train_test_split 242 | import mlflow.lightgbm 243 | from mlflow.tracking import MlflowClient 244 | 245 | 246 | features_and_label = training_df.columns 247 | 248 | # Collect data into a Pandas array for training 249 | data = training_df.toPandas()[features_and_label] 250 | 251 | train, test = train_test_split(data, random_state=123) 252 | X_train = train.drop(["fare_amount"], axis=1) 253 | X_test = test.drop(["fare_amount"], axis=1) 254 | y_train = train.fare_amount 255 | y_test = test.fare_amount 256 | 257 | mlflow.lightgbm.autolog() 258 | train_lgb_dataset = lgb.Dataset(X_train, label=y_train.values) 259 | test_lgb_dataset = lgb.Dataset(X_test, label=y_test.values) 260 | 261 | param = {"num_leaves": 32, "objective": "regression", "metric": "rmse"} 262 | num_rounds = 100 263 | 264 | # Train a lightGBM model 265 | model = lgb.train(param, train_lgb_dataset, num_rounds) 266 | 267 | # COMMAND ---------- 268 | 269 | # DBTITLE 1, Log model and return output. 270 | # Log the trained model with MLflow and package it with feature lookup information. 271 | fe.log_model( 272 | model=model, #specify model 273 | artifact_path="model_packaged", 274 | flavor=mlflow.lightgbm, 275 | training_set=training_set, 276 | registered_model_name=model_name, 277 | ) 278 | 279 | 280 | # The returned model URI is needed by the model deployment notebook. 281 | model_version = get_latest_model_version(model_name) 282 | model_uri = f"models:/{model_name}/{model_version}" 283 | dbutils.jobs.taskValues.set("model_uri", model_uri) 284 | dbutils.jobs.taskValues.set("model_name", model_name) 285 | dbutils.jobs.taskValues.set("model_version", model_version) 286 | dbutils.notebook.exit(model_uri) 287 | -------------------------------------------------------------------------------- /mlops-statcks-multiphase/tmp/phase1.yml: -------------------------------------------------------------------------------- 1 | new_cluster: &new_cluster 2 | new_cluster: 3 | num_workers: 3 4 | spark_version: 15.3.x-cpu-ml-scala2.12 5 | node_type_id: Standard_D3_v2 6 | custom_tags: 7 | clusterSource: mlops-stacks_0.4 8 | 9 | jobs_permissions: &jobs_permissions 10 | permissions: 11 | - level: CAN_VIEW 12 | group_name: users 13 | 14 | jobs: &jobs 15 | batch_inference_job: 16 | name: ${var.current_target}-mlops-batch-inference-job 17 | tasks: 18 | - task_key: batch_inference_job 19 | <<: *new_cluster 20 | notebook_task: 21 | notebook_path: ../deployment/batch_inference/notebooks/BatchInference.py 22 | base_parameters: 23 | env: ${var.current_target} 24 | input_table_name: ${var.current_target}.mlops.feature_store_inference_input # TODO: create input table for inference 25 | output_table_name: ${var.current_target}.mlops.predictions 26 | model_name: ${var.current_target}.mlops.${var.model_name} 27 | # git source information of current ML resource deployment. It will be persisted as part of the workflow run 28 | git_source_info: url:${bundle.git.origin_url}; branch:${bundle.git.branch}; commit:${bundle.git.commit} 29 | schedule: 30 | quartz_cron_expression: "0 0 11 * * ?" # daily at 11am 31 | timezone_id: UTC 32 | <<: *jobs_permissions 33 | # If you want to turn on notifications for this job, please uncomment the below code, 34 | # and provide a list of emails to the on_failure argument. 35 | # 36 | # email_notifications: 37 | # on_failure: 38 | # - first@company.com 39 | # - second@company.com 40 | 41 | write_feature_table_job: 42 | name: ${var.current_target}-mlops-write-feature-table-job 43 | job_clusters: 44 | - job_cluster_key: write_feature_table_job_cluster 45 | <<: *new_cluster 46 | tasks: 47 | - task_key: PickupFeatures 48 | job_cluster_key: write_feature_table_job_cluster 49 | notebook_task: 50 | notebook_path: ../feature_engineering/notebooks/GenerateAndWriteFeatures.py 51 | base_parameters: 52 | # TODO modify these arguments to reflect your setup. 53 | input_table_path: /databricks-datasets/nyctaxi-with-zipcodes/subsampled 54 | # TODO: Empty start/end dates will process the whole range. Update this as needed to process recent data. 55 | input_start_date: "" 56 | input_end_date: "" 57 | timestamp_column: tpep_pickup_datetime 58 | output_table_name: ${var.current_target}.mlops.trip_pickup_features 59 | features_transform_module: pickup_features 60 | primary_keys: zip 61 | # git source information of current ML resource deployment. It will be persisted as part of the workflow run 62 | git_source_info: url:${bundle.git.origin_url}; branch:${bundle.git.branch}; commit:${bundle.git.commit} 63 | - task_key: DropoffFeatures 64 | job_cluster_key: write_feature_table_job_cluster 65 | notebook_task: 66 | notebook_path: ../feature_engineering/notebooks/GenerateAndWriteFeatures.py 67 | base_parameters: 68 | # TODO: modify these arguments to reflect your setup. 69 | input_table_path: /databricks-datasets/nyctaxi-with-zipcodes/subsampled 70 | # TODO: Empty start/end dates will process the whole range. Update this as needed to process recent data. 71 | input_start_date: "" 72 | input_end_date: "" 73 | timestamp_column: tpep_dropoff_datetime 74 | output_table_name: ${var.current_target}.mlops.trip_dropoff_features 75 | features_transform_module: dropoff_features 76 | primary_keys: zip 77 | # git source information of current ML resource deployment. It will be persisted as part of the workflow run 78 | git_source_info: url:${bundle.git.origin_url}; branch:${bundle.git.branch}; commit:${bundle.git.commit} 79 | schedule: 80 | quartz_cron_expression: "0 0 7 * * ?" # daily at 7am 81 | timezone_id: UTC 82 | <<: *jobs_permissions 83 | # If you want to turn on notifications for this job, please uncomment the below code, 84 | # and provide a list of emails to the on_failure argument. 85 | # 86 | # email_notifications: 87 | # on_failure: 88 | # - first@company.com 89 | # - second@company.com 90 | 91 | model_training_job: 92 | name: ${var.current_target}-mlops-model-training-job 93 | job_clusters: 94 | - job_cluster_key: model_training_job_cluster 95 | <<: *new_cluster 96 | tasks: 97 | - task_key: Train 98 | job_cluster_key: model_training_job_cluster 99 | notebook_task: 100 | notebook_path: ../training/notebooks/TrainWithFeatureStore.py 101 | base_parameters: 102 | env: ${var.current_target} 103 | # TODO: Update training_data_path 104 | training_data_path: /databricks-datasets/nyctaxi-with-zipcodes/subsampled 105 | experiment_name: ${var.experiment_name} 106 | model_name: ${var.current_target}.mlops.${var.model_name} 107 | pickup_features_table: ${var.current_target}.mlops.trip_pickup_features 108 | dropoff_features_table: ${var.current_target}.mlops.trip_dropoff_features 109 | # git source information of current ML resource deployment. It will be persisted as part of the workflow run 110 | git_source_info: url:${bundle.git.origin_url}; branch:${bundle.git.branch}; commit:${bundle.git.commit} 111 | - task_key: ModelValidation 112 | job_cluster_key: model_training_job_cluster 113 | depends_on: 114 | - task_key: Train 115 | notebook_task: 116 | notebook_path: ../validation/notebooks/ModelValidation.py 117 | base_parameters: 118 | experiment_name: ${var.experiment_name} 119 | # The `run_mode` defines whether model validation is enabled or not. 120 | # It can be one of the three values: 121 | # `disabled` : Do not run the model validation notebook. 122 | # `dry_run` : Run the model validation notebook. Ignore failed model validation rules and proceed to move 123 | # model to Production stage. 124 | # `enabled` : Run the model validation notebook. Move model to Production stage only if all model validation 125 | # rules are passing. 126 | # TODO: update run_mode 127 | run_mode: dry_run 128 | # Whether to load the current registered "Production" stage model as baseline. 129 | # Baseline model is a requirement for relative change and absolute change validation thresholds. 130 | # TODO: update enable_baseline_comparison 131 | enable_baseline_comparison: "false" 132 | # Please refer to data parameter in mlflow.evaluate documentation https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.evaluate 133 | # TODO: update validation_input 134 | validation_input: SELECT * FROM delta.`dbfs:/databricks-datasets/nyctaxi-with-zipcodes/subsampled` 135 | # A string describing the model type. The model type can be either "regressor" and "classifier". 136 | # Please refer to model_type parameter in mlflow.evaluate documentation https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.evaluate 137 | # TODO: update model_type 138 | model_type: regressor 139 | # The string name of a column from data that contains evaluation labels. 140 | # Please refer to targets parameter in mlflow.evaluate documentation https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.evaluate 141 | # TODO: targets 142 | targets: fare_amount 143 | # Specifies the name of the function in mlops/training_validation_deployment/validation/validation.py that returns custom metrics. 144 | # TODO(optional): custom_metrics_loader_function 145 | custom_metrics_loader_function: custom_metrics 146 | # Specifies the name of the function in mlops/training_validation_deployment/validation/validation.py that returns model validation thresholds. 147 | # TODO(optional): validation_thresholds_loader_function 148 | validation_thresholds_loader_function: validation_thresholds 149 | # Specifies the name of the function in mlops/training_validation_deployment/validation/validation.py that returns evaluator_config. 150 | # TODO(optional): evaluator_config_loader_function 151 | evaluator_config_loader_function: evaluator_config 152 | # git source information of current ML resource deployment. It will be persisted as part of the workflow run 153 | git_source_info: url:${bundle.git.origin_url}; branch:${bundle.git.branch}; commit:${bundle.git.commit} 154 | - task_key: ModelDeployment 155 | job_cluster_key: model_training_job_cluster 156 | depends_on: 157 | - task_key: ModelValidation 158 | notebook_task: 159 | notebook_path: ../deployment/model_deployment/notebooks/ModelDeployment.py 160 | base_parameters: 161 | env: ${var.current_target} 162 | # git source information of current ML resource deployment. It will be persisted as part of the workflow run 163 | git_source_info: url:${bundle.git.origin_url}; branch:${bundle.git.branch}; commit:${bundle.git.commit} 164 | schedule: 165 | quartz_cron_expression: "0 0 9 * * ?" # daily at 9am 166 | timezone_id: UTC 167 | <<: *jobs_permissions 168 | # If you want to turn on notifications for this job, please uncomment the below code, 169 | # and provide a list of emails to the on_failure argument. 170 | # 171 | # email_notifications: 172 | # on_failure: 173 | # - first@company.com 174 | # - second@company.com 175 | 176 | 177 | experiment_permissions: &experiment_permissions 178 | permissions: 179 | - level: CAN_READ 180 | group_name: users 181 | 182 | # Allow users to execute models in Unity Catalog 183 | model_grants: &model_grants 184 | grants: 185 | - privileges: 186 | - EXECUTE 187 | principal: account users 188 | 189 | # Defines model and experiments 190 | model: &model 191 | model: 192 | name: ${var.model_name} 193 | catalog_name: ${var.current_target} 194 | schema_name: mlops 195 | comment: Registered model in Unity Catalog for the "mlops" ML Project for ${var.current_target} deployment target. 196 | <<: *model_grants 197 | 198 | experiment: &experiment 199 | experiment: 200 | name: ${var.experiment_name} 201 | <<: *experiment_permissions 202 | description: MLflow Experiment used to track runs for mlops project. 203 | 204 | 205 | targets: 206 | dev-phase1: 207 | resources: 208 | jobs: 209 | <<: *jobs 210 | registered_models: 211 | <<: *model 212 | experiments: 213 | <<: *experiment 214 | 215 | test-phase1: 216 | resources: 217 | jobs: 218 | <<: *jobs 219 | registered_models: 220 | <<: *model 221 | experiments: 222 | <<: *experiment 223 | 224 | prod-phase1: 225 | resources: 226 | jobs: 227 | <<: *jobs 228 | registered_models: 229 | <<: *model 230 | experiments: 231 | <<: *experiment 232 | -------------------------------------------------------------------------------- /mlops-statcks-multiphase/README.md: -------------------------------------------------------------------------------- 1 | # mlops 2 | 3 | This DAB is based on `mlops-stacks` template, but with customization to deploy quality 4 | monitoring in a separate "stage" (or "phase") to avoid manual uncommenting of quality 5 | monitoring and retrain jobs definition. This is done by splitting model training and 6 | quality monitor/retraining job into different stages that could be deployed independently. 7 | To implement this I've used YAML variables to define actual resources, and then declared 8 | resources in each stage independently instead of specifying them as top-level objects. 9 | 10 | Use `databricks bundle deploy -t -phaseN` to deploy into a specific "phase". I.e., 11 | use `databricks bundle deploy -t dev-phase1` to deploy model training code, registered 12 | model, etc., and then use `databricks bundle deploy -t dev-phase2` to deploy quality 13 | monitor and retraining job. 14 | 15 | The rest of the README is standard text from `mlops-stacks`... 16 | 17 | This project comes with example ML code to train, validate and deploy a regression model 18 | to predict NYC taxi fares. If you're a data scientist just getting started with this repo 19 | for a brand new ML project, we recommend adapting the provided example code to your ML 20 | problem. Then making and testing ML code changes on Databricks or your local machine. 21 | 22 | The "Getting Started" docs can be found at https://learn.microsoft.com/azure/databricks/dev-tools/bundles/mlops-stacks. 23 | 24 | ## Table of contents 25 | 26 | * [Code structure](#code-structure): structure of this project. 27 | 28 | * [Configure your ML pipeline](#configure-your-ml-pipeline): making and testing ML code changes on Databricks or your local machine. 29 | 30 | * [Iterating on ML code](#iterating-on-ml-code): making and testing ML code changes on Databricks or your local machine. 31 | * [Next steps](#next-steps) 32 | 33 | This directory contains an ML project based on the default 34 | [Databricks MLOps Stacks](https://github.com/databricks/mlops-stacks), 35 | defining a production-grade ML pipeline for automated retraining and batch inference of an ML model on tabular data. 36 | 37 | ## Code structure 38 | This project contains the following components: 39 | 40 | | Component | Description | 41 | |----------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 42 | | ML Code | Example ML project code, with unit tested Python modules and notebooks | 43 | | ML Resources as Code | ML pipeline resources (training and batch inference jobs with schedules, etc) configured and deployed through [databricks CLI bundles](https://learn.microsoft.com/azure/databricks/dev-tools/cli/bundle-cli) | 44 | 45 | contained in the following files: 46 | 47 | ``` 48 | mlops <- Root directory. Both monorepo and polyrepo are supported. 49 | │ 50 | ├── mlops <- Contains python code, notebooks and ML resources related to one ML project. 51 | │ │ 52 | │ ├── requirements.txt <- Specifies Python dependencies for ML code (for example: model training, batch inference). 53 | │ │ 54 | │ ├── databricks.yml <- databricks.yml is the root bundle file for the ML project that can be loaded by databricks CLI bundles. It defines the bundle name, workspace URL and resource config component to be included. 55 | │ │ 56 | │ ├── training <- Training folder contains Notebook that trains and registers the model with feature store support. 57 | │ │ 58 | │ ├── feature_engineering <- Feature computation code (Python modules) that implements the feature transforms. 59 | │ │ The output of these transforms get persisted as Feature Store tables. Most development 60 | │ │ work happens here. 61 | │ │ 62 | │ ├── validation <- Optional model validation step before deploying a model. 63 | │ │ 64 | │ ├── monitoring <- Model monitoring, feature monitoring, etc. 65 | │ │ 66 | │ ├── deployment <- Deployment and Batch inference workflows 67 | │ │ │ 68 | │ │ ├── batch_inference <- Batch inference code that will run as part of scheduled workflow. 69 | │ │ │ 70 | │ │ ├── model_deployment <- As part of CD workflow, deploy the registered model by assigning it the appropriate alias. 71 | │ │ 72 | │ │ 73 | │ ├── tests <- Unit tests for the ML project, including the modules under `features`. 74 | │ │ 75 | │ ├── resources <- ML resource (ML jobs, MLflow models) config definitions expressed as code, across dev/staging/prod/test. 76 | │ │ 77 | │ ├── model-workflow-resource.yml <- ML resource config definition for model training, validation, deployment workflow 78 | │ │ 79 | │ ├── batch-inference-workflow-resource.yml <- ML resource config definition for batch inference workflow 80 | │ │ 81 | │ ├── feature-engineering-workflow-resource.yml <- ML resource config definition for feature engineering workflow 82 | │ │ 83 | │ ├── ml-artifacts-resource.yml <- ML resource config definition for model and experiment 84 | │ │ 85 | │ ├── monitoring-resource.yml <- ML resource config definition for quality monitoring workflow 86 | ``` 87 | 88 | 89 | ## Configure your ML pipeline 90 | 91 | The sample ML code consists of the following: 92 | 93 | * Feature computation modules under `feature_engineering` folder. 94 | These sample module contains features logic that can be used to generate and populate tables in Feature Store. 95 | In each module, there is `compute_features_fn` method that you need to implement. This should compute a features dataframe 96 | (each column being a separate feature), given the input dataframe, timestamp column and time-ranges. 97 | The output dataframe will be persisted in a [time-series Feature Store table](https://learn.microsoft.com/azure/databricks/machine-learning/feature-store/time-series). 98 | See the example modules' documentation for more information. 99 | * Python unit tests for feature computation modules in `tests/feature_engineering` folder. 100 | * Feature engineering notebook, `feature_engineering/notebooks/GenerateAndWriteFeatures.py`, that reads input dataframes, dynamically loads feature computation modules, executes their `compute_features_fn` method and writes the outputs to a Feature Store table (creating it if missing). 101 | * Training notebook that [trains](https://learn.microsoft.com/azure/databricks/machine-learning/feature-store/train-models-with-feature-store ) a regression model by creating a training dataset using the Feature Store client. 102 | * Model deployment and batch inference notebooks that deploy and use the trained model. 103 | * An automated integration test is provided (in `.github/workflows/mlops-run-tests.yml`) that executes a multi task run on Databricks involving the feature engineering and model training notebooks. 104 | 105 | To adapt this sample code for your use case, implement your own feature module, specifying configs such as input Delta tables/dataset path(s) to use when developing 106 | the feature engineering pipelines. 107 | 1. Implement your feature module, address TODOs in `feature_engineering/features` and create unit test in `tests/feature_engineering` 108 | 2. Update `resources/feature-engineering-workflow-resource.yml`. Fill in notebook parameters for `write_feature_table_job`. 109 | 3. Update training data path in `resources/model-workflow-resource.yml`. 110 | 111 | We expect most of the development to take place in the `feature_engineering` folder. 112 | 113 | 114 | ## Iterating on ML code 115 | 116 | ### Deploy ML code and resources to dev workspace using Bundles 117 | 118 | Refer to [Local development and dev workspace](./resources/README.md#local-development-and-dev-workspace) 119 | to use databricks CLI bundles to deploy ML code together with ML resource configs to dev workspace. 120 | 121 | This will allow you to develop locally and use databricks CLI bundles to deploy to your dev workspace to test out code and config changes. 122 | 123 | ### Develop on Databricks using Databricks Repos 124 | 125 | #### Prerequisites 126 | You'll need: 127 | * Access to run commands on a cluster running Databricks Runtime ML version 11.0 or above in your dev Databricks workspace 128 | * To set up [Databricks Repos](https://learn.microsoft.com/azure/databricks/repos/index): see instructions below 129 | 130 | #### Configuring Databricks Repos 131 | To use Repos, [set up git integration](https://learn.microsoft.com/azure/databricks/repos/repos-setup) in your dev workspace. 132 | 133 | If the current project has already been pushed to a hosted Git repo, follow the 134 | [UI workflow](https://learn.microsoft.com/azure/databricks/repos/git-operations-with-repos#add-a-repo-and-connect-remotely-later) 135 | to clone it into your dev workspace and iterate. 136 | 137 | Otherwise, e.g. if iterating on ML code for a new project, follow the steps below: 138 | * Follow the [UI workflow](https://learn.microsoft.com/azure/databricks/repos/git-operations-with-repos#add-a-repo-and-connect-remotely-later) 139 | for creating a repo, but uncheck the "Create repo by cloning a Git repository" checkbox. 140 | * Install the `dbx` CLI via `pip install --upgrade dbx` 141 | * Run `databricks configure --profile mlops-dev --token --host `, passing the URL of your dev workspace. 142 | This should prompt you to enter an API token 143 | * [Create a personal access token](https://learn.microsoft.com/azure/databricks/dev-tools/auth/pat) 144 | in your dev workspace and paste it into the prompt from the previous step 145 | * From within the root directory of the current project, use the [dbx sync](https://dbx.readthedocs.io/en/latest/guides/python/devloop/mixed/#using-dbx-sync-repo-for-local-to-repo-synchronization) tool to copy code files from your local machine into the Repo by running 146 | `dbx sync repo --profile mlops-dev --source . --dest-repo your-repo-name`, where `your-repo-name` should be the last segment of the full repo name (`/Repos/username/your-repo-name`) 147 | 148 | 149 | ### Develop locally 150 | 151 | You can iterate on the feature transform modules locally in your favorite IDE before running them on Databricks. 152 | 153 | #### Running code on Databricks 154 | You can iterate on ML code by running the provided `feature_engineering/notebooks/GenerateAndWriteFeatures.py` notebook on Databricks using 155 | [Repos](https://learn.microsoft.com/azure/databricks/repos/index). This notebook drives execution of 156 | the feature transforms code defined under ``features``. You can use multiple browser tabs to edit 157 | logic in `features` and run the feature engineering pipeline in the `GenerateAndWriteFeatures.py` notebook. 158 | 159 | #### Prerequisites 160 | * Python 3.8+ 161 | * Install feature engineering code and test dependencies via `pip install -I -r requirements.txt` from project root directory. 162 | * The features transform code uses PySpark and brings up a local Spark instance for testing, so [Java (version 8 and later) is required](https://spark.apache.org/docs/latest/#downloading). 163 | * Access to UC catalog and schema 164 | We expect a catalog to exist with the name of the deployment target by default. 165 | For example, if the deployment target is dev, we expect a catalog named dev to exist in the workspace. 166 | If you want to use different catalog names, please update the target names declared in the [databricks.yml](./databricks.yml) file. 167 | If changing the staging, prod, or test deployment targets, you'll also need to update the workflows located in the .github/workflows directory. 168 | 169 | For the ML training job, you must have permissions to read the input Delta table and create experiment and models. 170 | i.e. for each environment: 171 | - USE_CATALOG 172 | - USE_SCHEMA 173 | - MODIFY 174 | - CREATE_MODEL 175 | - CREATE_TABLE 176 | 177 | For the batch inference job, you must have permissions to read input Delta table and modify the output Delta table. 178 | i.e. for each environment 179 | - USAGE permissions for the catalog and schema of the input and output table. 180 | - SELECT permission for the input table. 181 | - MODIFY permission for the output table if it pre-dates your job. 182 | 183 | #### Run unit tests 184 | You can run unit tests for your ML code via `pytest tests`. 185 | 186 | 187 | 188 | ## Next Steps 189 | 190 | When you're satisfied with initial ML experimentation (e.g. validated that a model with reasonable performance can be trained on your dataset) and ready to deploy production training/inference pipelines, ask your ops team to set up CI/CD for the current ML project if they haven't already. CI/CD can be set up as part of the 191 | 192 | MLOps Stacks initialization even if it was skipped in this case, or this project can be added to a repo setup with CI/CD already, following the directions under "Setting up CI/CD" in the repo root directory README. 193 | 194 | To add CI/CD to this repo: 195 | 1. Run `databricks bundle init mlops-stacks` via the Databricks CLI 196 | 2. Select the option to only initialize `CICD_Only` 197 | 3. Provide the root directory of this project and answer the subsequent prompts 198 | 199 | More details can be found on the homepage [MLOps Stacks README](https://github.com/databricks/mlops-stacks/blob/main/README.md). 200 | -------------------------------------------------------------------------------- /mlops-statcks-multiphase/validation/notebooks/ModelValidation.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | ################################################################################## 3 | # Model Validation Notebook 4 | ## 5 | # This notebook uses mlflow model validation API to run mode validation after training and registering a model 6 | # in model registry, before deploying it to the "champion" alias. 7 | # 8 | # It runs as part of CD and by an automated model training job -> validation -> deployment job defined under ``mlops/resources/model-workflow-resource.yml`` 9 | # 10 | # 11 | # Parameters: 12 | # 13 | # * env - Name of the environment the notebook is run in (staging, or prod). Defaults to "prod". 14 | # * `run_mode` - The `run_mode` defines whether model validation is enabled or not. It can be one of the three values: 15 | # * `disabled` : Do not run the model validation notebook. 16 | # * `dry_run` : Run the model validation notebook. Ignore failed model validation rules and proceed to move 17 | # model to the "champion" alias. 18 | # * `enabled` : Run the model validation notebook. Move model to the "champion" alias only if all model validation 19 | # rules are passing. 20 | # * enable_baseline_comparison - Whether to load the current registered "champion" model as baseline. 21 | # Baseline model is a requirement for relative change and absolute change validation thresholds. 22 | # * validation_input - Validation input. Please refer to data parameter in mlflow.evaluate documentation https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.evaluate 23 | # * model_type - A string describing the model type. The model type can be either "regressor" and "classifier". 24 | # Please refer to model_type parameter in mlflow.evaluate documentation https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.evaluate 25 | # * targets - The string name of a column from data that contains evaluation labels. 26 | # Please refer to targets parameter in mlflow.evaluate documentation https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.evaluate 27 | # * custom_metrics_loader_function - Specifies the name of the function in mlops/validation/validation.py that returns custom metrics. 28 | # * validation_thresholds_loader_function - Specifies the name of the function in mlops/validation/validation.py that returns model validation thresholds. 29 | # 30 | # For details on mlflow evaluate API, see doc https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.evaluate 31 | # For details and examples about performing model validation, see the Model Validation documentation https://mlflow.org/docs/latest/models.html#model-validation 32 | # 33 | ################################################################################## 34 | 35 | # COMMAND ---------- 36 | 37 | # MAGIC %load_ext autoreload 38 | # MAGIC %autoreload 2 39 | 40 | # COMMAND ---------- 41 | 42 | import os 43 | notebook_path = '/Workspace/' + os.path.dirname(dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()) 44 | %cd $notebook_path 45 | 46 | # COMMAND ---------- 47 | 48 | # MAGIC %pip install -r ../../requirements.txt 49 | 50 | # COMMAND ---------- 51 | 52 | dbutils.library.restartPython() 53 | 54 | # COMMAND ---------- 55 | 56 | import os 57 | notebook_path = '/Workspace/' + os.path.dirname(dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()) 58 | %cd $notebook_path 59 | %cd ../ 60 | 61 | # COMMAND ---------- 62 | 63 | dbutils.widgets.text( 64 | "experiment_name", 65 | "/dev-mlops-experiment", 66 | "Experiment Name", 67 | ) 68 | dbutils.widgets.dropdown("run_mode", "disabled", ["disabled", "dry_run", "enabled"], "Run Mode") 69 | dbutils.widgets.dropdown("enable_baseline_comparison", "false", ["true", "false"], "Enable Baseline Comparison") 70 | dbutils.widgets.text("validation_input", "SELECT * FROM delta.`dbfs:/databricks-datasets/nyctaxi-with-zipcodes/subsampled`", "Validation Input") 71 | 72 | dbutils.widgets.text("model_type", "regressor", "Model Type") 73 | dbutils.widgets.text("targets", "fare_amount", "Targets") 74 | dbutils.widgets.text("custom_metrics_loader_function", "custom_metrics", "Custom Metrics Loader Function") 75 | dbutils.widgets.text("validation_thresholds_loader_function", "validation_thresholds", "Validation Thresholds Loader Function") 76 | dbutils.widgets.text("evaluator_config_loader_function", "evaluator_config", "Evaluator Config Loader Function") 77 | dbutils.widgets.text("model_name", "dev.mlops.mlops-model", "Full (Three-Level) Model Name") 78 | dbutils.widgets.text("model_version", "", "Candidate Model Version") 79 | 80 | # COMMAND ---------- 81 | run_mode = dbutils.widgets.get("run_mode").lower() 82 | assert run_mode == "disabled" or run_mode == "dry_run" or run_mode == "enabled" 83 | 84 | if run_mode == "disabled": 85 | print( 86 | "Model validation is in DISABLED mode. Exit model validation without blocking model deployment." 87 | ) 88 | dbutils.notebook.exit(0) 89 | dry_run = run_mode == "dry_run" 90 | 91 | if dry_run: 92 | print( 93 | "Model validation is in DRY_RUN mode. Validation threshold validation failures will not block model deployment." 94 | ) 95 | else: 96 | print( 97 | "Model validation is in ENABLED mode. Validation threshold validation failures will block model deployment." 98 | ) 99 | 100 | # COMMAND ---------- 101 | 102 | import importlib 103 | import mlflow 104 | import os 105 | import tempfile 106 | import traceback 107 | 108 | from mlflow.tracking.client import MlflowClient 109 | 110 | client = MlflowClient(registry_uri="databricks-uc") 111 | mlflow.set_registry_uri('databricks-uc') 112 | 113 | # set experiment 114 | experiment_name = dbutils.widgets.get("experiment_name") 115 | mlflow.set_experiment(experiment_name) 116 | 117 | # set model evaluation parameters that can be inferred from the job 118 | model_uri = dbutils.jobs.taskValues.get("Train", "model_uri", debugValue="") 119 | model_name = dbutils.jobs.taskValues.get("Train", "model_name", debugValue="") 120 | model_version = dbutils.jobs.taskValues.get("Train", "model_version", debugValue="") 121 | 122 | if model_uri == "": 123 | model_name = dbutils.widgets.get("model_name") 124 | model_version = dbutils.widgets.get("model_version") 125 | model_uri = "models:/" + model_name + "/" + model_version 126 | 127 | baseline_model_uri = "models:/" + model_name + "@champion" 128 | 129 | evaluators = "default" 130 | assert model_uri != "", "model_uri notebook parameter must be specified" 131 | assert model_name != "", "model_name notebook parameter must be specified" 132 | assert model_version != "", "model_version notebook parameter must be specified" 133 | 134 | # COMMAND ---------- 135 | 136 | # take input 137 | enable_baseline_comparison = dbutils.widgets.get("enable_baseline_comparison") 138 | 139 | 140 | enable_baseline_comparison = "false" 141 | print( 142 | "Currently baseline model comparison is not supported for models registered with feature store. Please refer to " 143 | "issue https://github.com/databricks/mlops-stacks/issues/70 for more details." 144 | ) 145 | 146 | assert enable_baseline_comparison == "true" or enable_baseline_comparison == "false" 147 | enable_baseline_comparison = enable_baseline_comparison == "true" 148 | 149 | validation_input = dbutils.widgets.get("validation_input") 150 | assert validation_input 151 | data = spark.sql(validation_input) 152 | 153 | model_type = dbutils.widgets.get("model_type") 154 | targets = dbutils.widgets.get("targets") 155 | 156 | assert model_type 157 | assert targets 158 | 159 | custom_metrics_loader_function_name = dbutils.widgets.get("custom_metrics_loader_function") 160 | validation_thresholds_loader_function_name = dbutils.widgets.get("validation_thresholds_loader_function") 161 | evaluator_config_loader_function_name = dbutils.widgets.get("evaluator_config_loader_function") 162 | assert custom_metrics_loader_function_name 163 | assert validation_thresholds_loader_function_name 164 | assert evaluator_config_loader_function_name 165 | custom_metrics_loader_function = getattr( 166 | importlib.import_module("validation"), custom_metrics_loader_function_name 167 | ) 168 | validation_thresholds_loader_function = getattr( 169 | importlib.import_module("validation"), validation_thresholds_loader_function_name 170 | ) 171 | evaluator_config_loader_function = getattr( 172 | importlib.import_module("validation"), evaluator_config_loader_function_name 173 | ) 174 | custom_metrics = custom_metrics_loader_function() 175 | validation_thresholds = validation_thresholds_loader_function() 176 | evaluator_config = evaluator_config_loader_function() 177 | 178 | # COMMAND ---------- 179 | 180 | # helper methods 181 | def get_run_link(run_info): 182 | return "[Run](#mlflow/experiments/{0}/runs/{1})".format( 183 | run_info.experiment_id, run_info.run_id 184 | ) 185 | 186 | 187 | def get_training_run(model_name, model_version): 188 | version = client.get_model_version(model_name, model_version) 189 | return mlflow.get_run(run_id=version.run_id) 190 | 191 | 192 | def generate_run_name(training_run): 193 | return None if not training_run else training_run.info.run_name + "-validation" 194 | 195 | 196 | def generate_description(training_run): 197 | return ( 198 | None 199 | if not training_run 200 | else "Model Training Details: {0}\n".format(get_run_link(training_run.info)) 201 | ) 202 | 203 | 204 | def log_to_model_description(run, success): 205 | run_link = get_run_link(run.info) 206 | description = client.get_model_version(model_name, model_version).description 207 | status = "SUCCESS" if success else "FAILURE" 208 | if description != "": 209 | description += "\n\n---\n\n" 210 | description += "Model Validation Status: {0}\nValidation Details: {1}".format( 211 | status, run_link 212 | ) 213 | client.update_model_version( 214 | name=model_name, version=model_version, description=description 215 | ) 216 | 217 | 218 | 219 | from datetime import timedelta, timezone 220 | import math 221 | import pyspark.sql.functions as F 222 | from pyspark.sql.types import IntegerType 223 | 224 | 225 | def rounded_unix_timestamp(dt, num_minutes=15): 226 | """ 227 | Ceilings datetime dt to interval num_minutes, then returns the unix timestamp. 228 | """ 229 | nsecs = dt.minute * 60 + dt.second + dt.microsecond * 1e-6 230 | delta = math.ceil(nsecs / (60 * num_minutes)) * (60 * num_minutes) - nsecs 231 | return int((dt + timedelta(seconds=delta)).replace(tzinfo=timezone.utc).timestamp()) 232 | 233 | 234 | rounded_unix_timestamp_udf = F.udf(rounded_unix_timestamp, IntegerType()) 235 | 236 | 237 | def rounded_taxi_data(taxi_data_df): 238 | # Round the taxi data timestamp to 15 and 30 minute intervals so we can join with the pickup and dropoff features 239 | # respectively. 240 | taxi_data_df = ( 241 | taxi_data_df.withColumn( 242 | "rounded_pickup_datetime", 243 | F.to_timestamp( 244 | rounded_unix_timestamp_udf( 245 | taxi_data_df["tpep_pickup_datetime"], F.lit(15) 246 | ) 247 | ), 248 | ) 249 | .withColumn( 250 | "rounded_dropoff_datetime", 251 | F.to_timestamp( 252 | rounded_unix_timestamp_udf( 253 | taxi_data_df["tpep_dropoff_datetime"], F.lit(30) 254 | ) 255 | ), 256 | ) 257 | .drop("tpep_pickup_datetime") 258 | .drop("tpep_dropoff_datetime") 259 | ) 260 | taxi_data_df.createOrReplaceTempView("taxi_data") 261 | return taxi_data_df 262 | 263 | 264 | data = rounded_taxi_data(data) 265 | 266 | 267 | 268 | 269 | # COMMAND ---------- 270 | 271 | 272 | # Temporary fix as FS model can't predict as a pyfunc model 273 | # MLflow evaluate can take a lambda function instead of a model uri for a model 274 | # but id does not work for the baseline model as it requires a model_uri (baseline comparison is set to false) 275 | 276 | from databricks.feature_store import FeatureStoreClient 277 | 278 | def get_fs_model(df): 279 | fs_client = FeatureStoreClient() 280 | return ( 281 | fs_client.score_batch(model_uri, spark.createDataFrame(df)) 282 | .select("prediction") 283 | .toPandas() 284 | ) 285 | 286 | 287 | training_run = get_training_run(model_name, model_version) 288 | 289 | # run evaluate 290 | with mlflow.start_run( 291 | run_name=generate_run_name(training_run), 292 | description=generate_description(training_run), 293 | ) as run, tempfile.TemporaryDirectory() as tmp_dir: 294 | validation_thresholds_file = os.path.join(tmp_dir, "validation_thresholds.txt") 295 | with open(validation_thresholds_file, "w") as f: 296 | if validation_thresholds: 297 | for metric_name in validation_thresholds: 298 | f.write( 299 | "{0:30} {1}\n".format( 300 | metric_name, str(validation_thresholds[metric_name]) 301 | ) 302 | ) 303 | mlflow.log_artifact(validation_thresholds_file) 304 | 305 | try: 306 | eval_result = mlflow.evaluate( 307 | 308 | model=get_fs_model, 309 | 310 | data=data, 311 | targets=targets, 312 | model_type=model_type, 313 | evaluators=evaluators, 314 | validation_thresholds=validation_thresholds, 315 | custom_metrics=custom_metrics, 316 | baseline_model=None 317 | if not enable_baseline_comparison 318 | else baseline_model_uri, 319 | evaluator_config=evaluator_config, 320 | ) 321 | metrics_file = os.path.join(tmp_dir, "metrics.txt") 322 | with open(metrics_file, "w") as f: 323 | f.write( 324 | "{0:30} {1:30} {2}\n".format("metric_name", "candidate", "baseline") 325 | ) 326 | for metric in eval_result.metrics: 327 | candidate_metric_value = str(eval_result.metrics[metric]) 328 | baseline_metric_value = "N/A" 329 | if metric in eval_result.baseline_model_metrics: 330 | mlflow.log_metric( 331 | "baseline_" + metric, eval_result.baseline_model_metrics[metric] 332 | ) 333 | baseline_metric_value = str( 334 | eval_result.baseline_model_metrics[metric] 335 | ) 336 | f.write( 337 | "{0:30} {1:30} {2}\n".format( 338 | metric, candidate_metric_value, baseline_metric_value 339 | ) 340 | ) 341 | mlflow.log_artifact(metrics_file) 342 | log_to_model_description(run, True) 343 | 344 | # Assign "challenger" alias to indicate model version has passed validation checks 345 | print("Validation checks passed. Assigning 'challenger' alias to model version.") 346 | client.set_registered_model_alias(model_name, "challenger", model_version) 347 | 348 | except Exception as err: 349 | log_to_model_description(run, False) 350 | error_file = os.path.join(tmp_dir, "error.txt") 351 | with open(error_file, "w") as f: 352 | f.write("Validation failed : " + str(err) + "\n") 353 | f.write(traceback.format_exc()) 354 | mlflow.log_artifact(error_file) 355 | if not dry_run: 356 | raise err 357 | else: 358 | print( 359 | "Model validation failed in DRY_RUN. It will not block model deployment." 360 | ) 361 | -------------------------------------------------------------------------------- /mlops-statcks-multiphase/tmp/README.md: -------------------------------------------------------------------------------- 1 | # Databricks ML Resource Configurations 2 | [(back to project README)](../README.md) 3 | 4 | ## Table of contents 5 | * [Intro](#intro) 6 | * [Local development and dev workspace](#local-development-and-dev-workspace) 7 | * [Develop and test config changes](#develop-and-test-config-changes) 8 | * [CI/CD](#set-up-cicd) 9 | * [Deploy initial ML resources](#deploy-initial-ml-resources) 10 | * [Deploy config changes](#deploy-config-changes) 11 | 12 | ## Intro 13 | 14 | ### databricks CLI bundles 15 | MLOps Stacks ML resources are configured and deployed through [databricks CLI bundles](https://learn.microsoft.com/azure/databricks/dev-tools/cli/bundle-cli). 16 | The bundle setting file must be expressed in YAML format and must contain at minimum the top-level bundle mapping. 17 | 18 | The databricks CLI bundles top level is defined by file `mlops/databricks.yml`. 19 | During databricks CLI bundles deployment, the root config file will be loaded, validated and deployed to workspace provided by the environment together with all the included resources. 20 | 21 | ML Resource Configurations in this directory: 22 | - model workflow (`mlops/resources/model-workflow-resource.yml`) 23 | - batch inference workflow (`mlops/resources/batch-inference-workflow-resource.yml`) 24 | - monitoring resource and workflow (`mlops/resources/monitoring-resource.yml`) 25 | - feature engineering workflow (`mlops/resources/feature-engineering-workflow-resource.yml`) 26 | - model definition and experiment definition (`mlops/resources/ml-artifacts-resource.yml`) 27 | 28 | 29 | ### Deployment Config & CI/CD integration 30 | The ML resources can be deployed to databricks workspace based on the databricks CLI bundles deployment config. 31 | Deployment configs of different deployment targets share the general ML resource configurations with added ability to specify deployment target specific values (workspace URI, model name, jobs notebook parameters, etc). 32 | This project ships with CI/CD workflows for developing and deploying ML resource configurations based on deployment config. 33 | 34 | For Model Registry in Unity Catalog, we expect a catalog to exist with the name of the deployment target by default. For example, if the deployment target is `dev`, we expect a catalog named `dev` to exist in the workspace. 35 | If you want to use different catalog names, please update the `targets` declared in the `mlops/databricks.yml` and `mlops/resources/ml-artifacts-resource.yml` files. 36 | If changing the `staging`, `prod`, or `test` deployment targets, you'll need to update the pipelines located in the `azure-pipelines` directory. 37 | 38 | 39 | | Deployment Target | Description | Databricks Workspace | Model Name | Experiment Name | 40 | |-------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------|-------------------------------------|------------------------------------------------| 41 | | dev | The `dev` deployment target is used by ML engineers to deploy ML resources to development workspace with `dev` configs. The config is for ML project development purposes. | dev workspace | dev-mlops-model | /dev-mlops-experiment | 42 | | staging | The `staging` deployment target is part of the CD pipeline. Latest main content will be deployed to staging workspace with `staging` config. | staging workspace | staging-mlops-model | /staging-mlops-experiment | 43 | | prod | The `prod` deployment target is part of the CD pipeline. Latest release content will be deployed to prod workspace with `prod` config. | prod workspace | prod-mlops-model | /prod-mlops-experiment | 44 | | test | The `test` deployment target is part of the CI pipeline. For changes targeting the main branch, upon making a PR, an integration test will be triggered and ML resources deployed to the staging workspace defined under `test` deployment target. | staging workspace | test-mlops-model | /test-mlops-experiment | 45 | 46 | During ML code development, you can deploy local ML resource configurations together with ML code to the a Databricks workspace to run the training, model validation or batch inference pipelines. The deployment will use `dev` config by default. 47 | 48 | You can open a PR (pull request) to modify ML code or the resource config against main branch. 49 | The PR will trigger Python unit tests, followed by an integration test executed on the staging workspace, as defined under the `test` environment resource. 50 | 51 | Upon merging a PR to the main branch, the main branch content will be deployed to the staging workspace with `staging` environment resource configurations. 52 | 53 | Upon merging code into the release branch, the release branch content will be deployed to prod workspace with `prod` environment resource configurations. 54 | ![ML resource config diagram](../../docs/images/mlops-stack-deploy.png) 55 | 56 | ## Local development and dev workspace 57 | 58 | ### Set up authentication 59 | 60 | To set up the databricks CLI using a Databricks personal access token, take the following steps: 61 | 62 | 1. Follow [databricks CLI](https://learn.microsoft.com/azure/databricks/dev-tools/cli/databricks-cli) to download and set up the databricks CLI locally. 63 | 2. Complete the `TODO` in `mlops/databricks.yml` to add the dev workspace URI under `targets.dev.workspace.host`. 64 | 3. [Create a personal access token](https://learn.microsoft.com/azure/databricks/dev-tools/auth/pat) 65 | in your dev workspace and copy it. 66 | 4. Set an env variable `DATABRICKS_TOKEN` with your Databricks personal access token in your terminal. For example, run `export DATABRICKS_TOKEN=dapi12345` if the access token is dapi12345. 67 | 5. You can now use the databricks CLI to validate and deploy ML resource configurations to the dev workspace. 68 | 69 | Alternatively, you can use the other approaches described in the [databricks CLI](https://learn.microsoft.com/azure/databricks/dev-tools/cli/databricks-cli) documentation to set up authentication. For example, using your Databricks username/password, or seting up a local profile. 70 | 71 | ### Validate and provision ML resource configurations 72 | 1. After installing the databricks CLI and creating the `DATABRICKS_TOKEN` env variable, change to the `mlops` directory. 73 | 2. Run `databricks bundle validate` to validate the Databricks resource configurations. 74 | 3. Run `databricks bundle deploy` to provision the Databricks resource configurations to the dev workspace. The resource configurations and your ML code will be copied together to the dev workspace. The defined resources such as Databricks Workflows, MLflow Model and MLflow Experiment will be provisioned according to the config files under `mlops/resources`. 75 | 4. Go to the Databricks dev workspace, check the defined model, experiment and workflows status, and interact with the created workflows. 76 | 77 | ### Destroy ML resource configurations 78 | After development is done, you can run `databricks bundle destroy` to destroy (remove) the defined Databricks resources in the dev workspace. Any model version with `Production` or `Staging` stage will prevent the model from being deleted. Please update the version stage to `None` or `Archived` before destroying the ML resources. 79 | ## Set up CI/CD 80 | Please refer to [mlops-setup](../../docs/mlops-setup.md#configure-cicd) for instructions to set up CI/CD. 81 | 82 | ## Deploy initial ML resources 83 | After completing the prerequisites, create and push a PR branch adding all files to the Git repo: 84 | ``` 85 | git checkout -b add-ml-resource-config-and-code 86 | git add . 87 | git commit -m "Add ML resource config and ML code" 88 | git push upstream add-ml-resource-config-and-code 89 | ``` 90 | Open a pull request to merge the pushed branch into the `main` branch. 91 | Upon creating this PR, the CI workflows will be triggered. 92 | These CI workflow will run unit and integration tests of the ML code, 93 | in addition to validating the Databricks resources to be deployed to both staging and prod workspaces. 94 | Once CI passes, merge the PR into the `main` branch. This will deploy an initial set of Databricks resources to the staging workspace. 95 | resources will be deployed to the prod workspace on pushing code to the `release` branch. 96 | 97 | Follow the next section to configure the input and output data tables for the batch inference job. 98 | 99 | ### Setting up the batch inference job 100 | The batch inference job expects an input Delta table with a schema that your registered model accepts. To use the batch 101 | inference job, set up such a Delta table in both your staging and prod workspaces. 102 | Following this, update the batch_inference_job base parameters in `mlops/resources/batch-inference-workflow-resource.yml` to pass 103 | the name of the input Delta table and the name of the output Delta table to which to write batch predictions. 104 | 105 | As the batch job will be run with the credentials of the service principal that provisioned it, make sure that the service 106 | principal corresponding to a particular environment has permissions to read the input Delta table and modify the output Delta table in that environment's workspace. If the Delta table is in the [Unity Catalog](https://www.databricks.com/product/unity-catalog), these permissions are 107 | 108 | * `USAGE` permissions for the catalog and schema of the input and output table. 109 | * `SELECT` permission for the input table. 110 | * `MODIFY` permission for the output table if it pre-dates your job. 111 | 112 | ### Setting up model validation 113 | The model validation workflow focuses on building a plug-and-play stack component for continuous deployment (CD) of models 114 | in staging and prod. 115 | Its central purpose is to evaluate a registered model and validate its quality before deploying the model to Production/Staging. 116 | 117 | Model validation contains three components: 118 | * [model-workflow-resource.yml](./model-workflow-resource.yml) contains the resource config and input parameters for model validation. 119 | * [validation.py](../validation/validation.py) defines custom metrics and validation thresholds that are referenced by the above resource config files. 120 | * [notebooks/ModelValidation](../validation/notebooks/ModelValidation.py) contains the validation job implementation. In most cases you don't need to modify this file. 121 | 122 | To set up and enable model validation, update [validation.py](../validation/validation.py) to return desired custom metrics and validation thresholds, then 123 | resolve the `TODOs` in the ModelValidation task of [model-workflow-resource.yml](./model-workflow-resource.yml). 124 | 125 | 126 | ### Setting up monitoring 127 | The monitoring workflow focuses on building a plug-and-play stack component for monitoring the feature drifts and model drifts and retrain based on the 128 | violation threshold defined given the ground truth labels. 129 | 130 | Its central purpose is to track production model performances, feature distributions and comparing different versions. 131 | 132 | Monitoring contains four components: 133 | * [metric_violation_check_query.py](../monitoring/metric_violation_check_query.py) defines a query that checks for violation of the monitored metric. 134 | * [notebooks/MonitoredMetricViolationCheck](../monitoring/notebooks/MonitoredMetricViolationCheck.py) acts as an entry point, executing the violation check query against the monitored inference table. 135 | It emits a boolean value based on the query result. 136 | * [monitoring-resource.yml](./monitoring-resource.yml) contains the resource config, inputs parameters for monitoring, and orchestrates model retraining based on monitoring. It first runs the [notebooks/MonitoredMetricViolationCheck](../monitoring/notebooks/MonitoredMetricViolationCheck.py) 137 | entry point then decides whether to execute the model retraining workflow. 138 | 139 | To set up and enable monitoring: 140 | * If it is not done already, generate inference table, join it with ground truth labels, and update the table name in [monitoring-resource.yml](./monitoring-resource.yml). 141 | * Resolve the `TODOs` in [monitoring-resource.yml](./monitoring-resource.yml) 142 | * Uncomment the monitoring workflow in [databricks.yml](../databricks.yml) 143 | * OPTIONAL: Update the query in [metric_violation_check_query.py](../monitoring/metric_violation_check_query.py) to customize when the metric is considered to be in violation. 144 | 145 | NOTE: If ground truth labels are not available, you can still set up monitoring but should disable the retraining workflow. 146 | 147 | Retraining Constraints: 148 | The retraining job has constraints for optimal functioning: 149 | * Labels must be provided by the user, joined correctly for retraining history, and available on time with the retraining frequency. 150 | * Retraining Frequency is tightly coupled with the granularity of the monitor. Users should take into account and ensure that their retraining frequency is equal to or close to the granularity of the monitor. 151 | * If the granularity of the monitor is 1 day and retraining frequency is 1 hour, the job will preemptively stop as there is no new data to evaluate retraining criteria 152 | * If the granularity of the monitor is 1 day and retraining frequency is 1 week, retraining would be stale and not be efficient 153 | 154 | Permissions: 155 | Permissions for monitoring are inherited from the original table's permissions. 156 | * Users who own the monitored table or its parent catalog/schema can create, update, and view monitors. 157 | * Users with read permissions on the monitored table can view its monitor. 158 | 159 | Therefore, ensure that service principals are the owners or have the necessary permissions to manage the monitored table. 160 | 161 | ## Develop and test config changes 162 | 163 | ### databricks CLI bundles schema overview 164 | To get started, open `mlops/resources/batch-inference-workflow-resource.yml`. The file contains the ML resource definition of a batch inference job, like: 165 | 166 | ```$xslt 167 | new_cluster: &new_cluster 168 | new_cluster: 169 | num_workers: 3 170 | spark_version: 15.3.x-cpu-ml-scala2.12 171 | node_type_id: Standard_D3_v2 172 | custom_tags: 173 | clusterSource: mlops-stacks_0.4 174 | 175 | resources: 176 | jobs: 177 | batch_inference_job: 178 | name: ${bundle.target}-mlops-batch-inference-job 179 | tasks: 180 | - task_key: batch_inference_job 181 | <<: *new_cluster 182 | notebook_task: 183 | notebook_path: ../deployment/batch_inference/notebooks/BatchInference.py 184 | base_parameters: 185 | env: ${bundle.target} 186 | input_table_name: batch_inference_input_table_name 187 | ... 188 | ``` 189 | 190 | The example above defines a Databricks job with name `${bundle.target}-mlops-batch-inference-job` 191 | that runs the notebook under `mlops/deployment/batch_inference/notebooks/BatchInference.py` to regularly apply your ML model for batch inference. 192 | 193 | At the start of the resource definition, we declared an anchor `new_cluster` that will be referenced and used later. For more information about anchors in yaml schema, please refer to the [yaml documentation](https://yaml.org/spec/1.2.2/#3222-anchors-and-aliases). 194 | 195 | We specify a `batch_inference_job` under `resources/jobs` to define a databricks workflow with internal key `batch_inference_job` and job name `{bundle.target}-mlops-batch-inference-job`. 196 | The workflow contains a single task with task key `batch_inference_job`. The task runs notebook `../deployment/batch_inference/notebooks/BatchInference.py` with provided parameters `env` and `input_table_name` passing to the notebook. 197 | After setting up databricks CLI, you can run command `databricks bundle schema` to learn more about databricks CLI bundles schema. 198 | 199 | The notebook_path is the relative path starting from the resource yaml file. 200 | 201 | ### Environment config based variables 202 | The `${bundle.target}` will be replaced by the environment config name during the bundle deployment. For example, during the deployment of a `test` environment config, the job name will be 203 | `test-mlops-batch-inference-job`. During the deployment of the `staging` environment config, the job name will be 204 | `staging-mlops-batch-inference-job`. 205 | 206 | 207 | To use different values based on different environment, you can use bundle variables based on the given target, for example, 208 | ```$xslt 209 | variables: 210 | batch_inference_input_table: 211 | description: The table name to be used for input to the batch inference workflow. 212 | default: input_table 213 | 214 | targets: 215 | dev: 216 | variables: 217 | batch_inference_input_table: dev_table 218 | test: 219 | variables: 220 | batch_inference_input_table: test_table 221 | 222 | new_cluster: &new_cluster 223 | new_cluster: 224 | num_workers: 3 225 | spark_version: 15.3.x-cpu-ml-scala2.12 226 | node_type_id: Standard_D3_v2 227 | custom_tags: 228 | clusterSource: mlops-stacks_0.4 229 | 230 | resources: 231 | jobs: 232 | batch_inference_job: 233 | name: ${bundle.target}-mlops-batch-inference-job 234 | tasks: 235 | - task_key: batch_inference_job 236 | <<: *new_cluster 237 | notebook_task: 238 | notebook_path: ../deployment/batch_inference/notebooks/BatchInference.py 239 | base_parameters: 240 | env: ${bundle.target} 241 | input_table_name: ${var.batch_inference_input_table} 242 | ... 243 | ``` 244 | The `batch_inference_job` notebook parameter `input_table_name` is using a bundle variable `batch_inference_input_table` with default value "input_table". 245 | The variable value will be overwritten with "dev_table" for `dev` environment config and "test_table" for `test` environment config: 246 | - during deployment with the `dev` environment config, the `input_table_name` parameter will get the value "dev_table" 247 | - during deployment with the `staging` environment config, the `input_table_name` parameter will get the value "input_table" 248 | - during deployment with the `prod` environment config, the `input_table_name` parameter will get the value "input_table" 249 | - during deployment with the `test` environment config, the `input_table_name` parameter will get the value "test_table" 250 | 251 | ### Test config changes 252 | To test out a config change, simply edit one of the fields above. For example, increase the cluster size by updating `num_workers` from 3 to 4. 253 | 254 | Then follow [Local development and dev workspace](#local-development-and-dev-workspace) to deploy the change to the dev workspace. 255 | Alternatively you can open a PR. Continuous integration will then validate the updated config and deploy tests to the to staging workspace. 256 | ## Deploy config changes 257 | 258 | ### Dev workspace deployment 259 | Please refer to [Local development and dev workspace](#local-development-and-dev-workspace). 260 | 261 | ### Test workspace deployment(CI) 262 | After setting up CI/CD, PRs against the main branch will trigger CI workflows to run unit tests, integration test and resource validation. 263 | The integration test will deploy MLflow model, MLflow experiment and Databricks workflow resources defined under the `test` environment resource config to the staging workspace. The integration test then triggers a run of the model workflow to verify the ML code. 264 | 265 | ### Staging and Prod workspace deployment(CD) 266 | After merging a PR to the main branch, continuous deployment automation will deploy the `staging` resources to the staging workspace. 267 | 268 | When you about to cut a release, you can create and merge a PR to merge changes from main to release. Continuous deployment automation will deploy `prod` resources to the prod workspace. 269 | 270 | [Back to project README](../README.md) 271 | -------------------------------------------------------------------------------- /mlops-statcks-multiphase/resources/README.md: -------------------------------------------------------------------------------- 1 | # Databricks ML Resource Configurations 2 | [(back to project README)](../README.md) 3 | 4 | ## Table of contents 5 | * [Intro](#intro) 6 | * [Local development and dev workspace](#local-development-and-dev-workspace) 7 | * [Develop and test config changes](#develop-and-test-config-changes) 8 | * [CI/CD](#set-up-cicd) 9 | * [Deploy initial ML resources](#deploy-initial-ml-resources) 10 | * [Deploy config changes](#deploy-config-changes) 11 | 12 | ## Intro 13 | 14 | ### databricks CLI bundles 15 | MLOps Stacks ML resources are configured and deployed through [databricks CLI bundles](https://learn.microsoft.com/azure/databricks/dev-tools/cli/bundle-cli). 16 | The bundle setting file must be expressed in YAML format and must contain at minimum the top-level bundle mapping. 17 | 18 | The databricks CLI bundles top level is defined by file `mlops/databricks.yml`. 19 | During databricks CLI bundles deployment, the root config file will be loaded, validated and deployed to workspace provided by the environment together with all the included resources. 20 | 21 | ML Resource Configurations in this directory: 22 | - model workflow (`mlops/resources/model-workflow-resource.yml`) 23 | - batch inference workflow (`mlops/resources/batch-inference-workflow-resource.yml`) 24 | - monitoring resource and workflow (`mlops/resources/monitoring-resource.yml`) 25 | - feature engineering workflow (`mlops/resources/feature-engineering-workflow-resource.yml`) 26 | - model definition and experiment definition (`mlops/resources/ml-artifacts-resource.yml`) 27 | 28 | 29 | ### Deployment Config & CI/CD integration 30 | The ML resources can be deployed to databricks workspace based on the databricks CLI bundles deployment config. 31 | Deployment configs of different deployment targets share the general ML resource configurations with added ability to specify deployment target specific values (workspace URI, model name, jobs notebook parameters, etc). 32 | This project ships with CI/CD workflows for developing and deploying ML resource configurations based on deployment config. 33 | 34 | For Model Registry in Unity Catalog, we expect a catalog to exist with the name of the deployment target by default. For example, if the deployment target is `dev`, we expect a catalog named `dev` to exist in the workspace. 35 | If you want to use different catalog names, please update the `targets` declared in the `mlops/databricks.yml` and `mlops/resources/ml-artifacts-resource.yml` files. 36 | If changing the `staging`, `prod`, or `test` deployment targets, you'll need to update the pipelines located in the `azure-pipelines` directory. 37 | 38 | 39 | | Deployment Target | Description | Databricks Workspace | Model Name | Experiment Name | 40 | |-------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------|-------------------------------------|------------------------------------------------| 41 | | dev | The `dev` deployment target is used by ML engineers to deploy ML resources to development workspace with `dev` configs. The config is for ML project development purposes. | dev workspace | dev-mlops-model | /dev-mlops-experiment | 42 | | staging | The `staging` deployment target is part of the CD pipeline. Latest main content will be deployed to staging workspace with `staging` config. | staging workspace | staging-mlops-model | /staging-mlops-experiment | 43 | | prod | The `prod` deployment target is part of the CD pipeline. Latest release content will be deployed to prod workspace with `prod` config. | prod workspace | prod-mlops-model | /prod-mlops-experiment | 44 | | test | The `test` deployment target is part of the CI pipeline. For changes targeting the main branch, upon making a PR, an integration test will be triggered and ML resources deployed to the staging workspace defined under `test` deployment target. | staging workspace | test-mlops-model | /test-mlops-experiment | 45 | 46 | During ML code development, you can deploy local ML resource configurations together with ML code to the a Databricks workspace to run the training, model validation or batch inference pipelines. The deployment will use `dev` config by default. 47 | 48 | You can open a PR (pull request) to modify ML code or the resource config against main branch. 49 | The PR will trigger Python unit tests, followed by an integration test executed on the staging workspace, as defined under the `test` environment resource. 50 | 51 | Upon merging a PR to the main branch, the main branch content will be deployed to the staging workspace with `staging` environment resource configurations. 52 | 53 | Upon merging code into the release branch, the release branch content will be deployed to prod workspace with `prod` environment resource configurations. 54 | ![ML resource config diagram](../../docs/images/mlops-stack-deploy.png) 55 | 56 | ## Local development and dev workspace 57 | 58 | ### Set up authentication 59 | 60 | To set up the databricks CLI using a Databricks personal access token, take the following steps: 61 | 62 | 1. Follow [databricks CLI](https://learn.microsoft.com/azure/databricks/dev-tools/cli/databricks-cli) to download and set up the databricks CLI locally. 63 | 2. Complete the `TODO` in `mlops/databricks.yml` to add the dev workspace URI under `targets.dev.workspace.host`. 64 | 3. [Create a personal access token](https://learn.microsoft.com/azure/databricks/dev-tools/auth/pat) 65 | in your dev workspace and copy it. 66 | 4. Set an env variable `DATABRICKS_TOKEN` with your Databricks personal access token in your terminal. For example, run `export DATABRICKS_TOKEN=dapi12345` if the access token is dapi12345. 67 | 5. You can now use the databricks CLI to validate and deploy ML resource configurations to the dev workspace. 68 | 69 | Alternatively, you can use the other approaches described in the [databricks CLI](https://learn.microsoft.com/azure/databricks/dev-tools/cli/databricks-cli) documentation to set up authentication. For example, using your Databricks username/password, or seting up a local profile. 70 | 71 | ### Validate and provision ML resource configurations 72 | 1. After installing the databricks CLI and creating the `DATABRICKS_TOKEN` env variable, change to the `mlops` directory. 73 | 2. Run `databricks bundle validate` to validate the Databricks resource configurations. 74 | 3. Run `databricks bundle deploy` to provision the Databricks resource configurations to the dev workspace. The resource configurations and your ML code will be copied together to the dev workspace. The defined resources such as Databricks Workflows, MLflow Model and MLflow Experiment will be provisioned according to the config files under `mlops/resources`. 75 | 4. Go to the Databricks dev workspace, check the defined model, experiment and workflows status, and interact with the created workflows. 76 | 77 | ### Destroy ML resource configurations 78 | After development is done, you can run `databricks bundle destroy` to destroy (remove) the defined Databricks resources in the dev workspace. Any model version with `Production` or `Staging` stage will prevent the model from being deleted. Please update the version stage to `None` or `Archived` before destroying the ML resources. 79 | ## Set up CI/CD 80 | Please refer to [mlops-setup](../../docs/mlops-setup.md#configure-cicd) for instructions to set up CI/CD. 81 | 82 | ## Deploy initial ML resources 83 | After completing the prerequisites, create and push a PR branch adding all files to the Git repo: 84 | ``` 85 | git checkout -b add-ml-resource-config-and-code 86 | git add . 87 | git commit -m "Add ML resource config and ML code" 88 | git push upstream add-ml-resource-config-and-code 89 | ``` 90 | Open a pull request to merge the pushed branch into the `main` branch. 91 | Upon creating this PR, the CI workflows will be triggered. 92 | These CI workflow will run unit and integration tests of the ML code, 93 | in addition to validating the Databricks resources to be deployed to both staging and prod workspaces. 94 | Once CI passes, merge the PR into the `main` branch. This will deploy an initial set of Databricks resources to the staging workspace. 95 | resources will be deployed to the prod workspace on pushing code to the `release` branch. 96 | 97 | Follow the next section to configure the input and output data tables for the batch inference job. 98 | 99 | ### Setting up the batch inference job 100 | The batch inference job expects an input Delta table with a schema that your registered model accepts. To use the batch 101 | inference job, set up such a Delta table in both your staging and prod workspaces. 102 | Following this, update the batch_inference_job base parameters in `mlops/resources/batch-inference-workflow-resource.yml` to pass 103 | the name of the input Delta table and the name of the output Delta table to which to write batch predictions. 104 | 105 | As the batch job will be run with the credentials of the service principal that provisioned it, make sure that the service 106 | principal corresponding to a particular environment has permissions to read the input Delta table and modify the output Delta table in that environment's workspace. If the Delta table is in the [Unity Catalog](https://www.databricks.com/product/unity-catalog), these permissions are 107 | 108 | * `USAGE` permissions for the catalog and schema of the input and output table. 109 | * `SELECT` permission for the input table. 110 | * `MODIFY` permission for the output table if it pre-dates your job. 111 | 112 | ### Setting up model validation 113 | The model validation workflow focuses on building a plug-and-play stack component for continuous deployment (CD) of models 114 | in staging and prod. 115 | Its central purpose is to evaluate a registered model and validate its quality before deploying the model to Production/Staging. 116 | 117 | Model validation contains three components: 118 | * [model-workflow-resource.yml](./model-workflow-resource.yml) contains the resource config and input parameters for model validation. 119 | * [validation.py](../validation/validation.py) defines custom metrics and validation thresholds that are referenced by the above resource config files. 120 | * [notebooks/ModelValidation](../validation/notebooks/ModelValidation.py) contains the validation job implementation. In most cases you don't need to modify this file. 121 | 122 | To set up and enable model validation, update [validation.py](../validation/validation.py) to return desired custom metrics and validation thresholds, then 123 | resolve the `TODOs` in the ModelValidation task of [model-workflow-resource.yml](./model-workflow-resource.yml). 124 | 125 | 126 | ### Setting up monitoring 127 | The monitoring workflow focuses on building a plug-and-play stack component for monitoring the feature drifts and model drifts and retrain based on the 128 | violation threshold defined given the ground truth labels. 129 | 130 | Its central purpose is to track production model performances, feature distributions and comparing different versions. 131 | 132 | Monitoring contains four components: 133 | * [metric_violation_check_query.py](../monitoring/metric_violation_check_query.py) defines a query that checks for violation of the monitored metric. 134 | * [notebooks/MonitoredMetricViolationCheck](../monitoring/notebooks/MonitoredMetricViolationCheck.py) acts as an entry point, executing the violation check query against the monitored inference table. 135 | It emits a boolean value based on the query result. 136 | * [monitoring-resource.yml](./monitoring-resource.yml) contains the resource config, inputs parameters for monitoring, and orchestrates model retraining based on monitoring. It first runs the [notebooks/MonitoredMetricViolationCheck](../monitoring/notebooks/MonitoredMetricViolationCheck.py) 137 | entry point then decides whether to execute the model retraining workflow. 138 | 139 | To set up and enable monitoring: 140 | * If it is not done already, generate inference table, join it with ground truth labels, and update the table name in [monitoring-resource.yml](./monitoring-resource.yml). 141 | * Resolve the `TODOs` in [monitoring-resource.yml](./monitoring-resource.yml) 142 | * Uncomment the monitoring workflow in [databricks.yml](../databricks.yml) 143 | * OPTIONAL: Update the query in [metric_violation_check_query.py](../monitoring/metric_violation_check_query.py) to customize when the metric is considered to be in violation. 144 | 145 | NOTE: If ground truth labels are not available, you can still set up monitoring but should disable the retraining workflow. 146 | 147 | Retraining Constraints: 148 | The retraining job has constraints for optimal functioning: 149 | * Labels must be provided by the user, joined correctly for retraining history, and available on time with the retraining frequency. 150 | * Retraining Frequency is tightly coupled with the granularity of the monitor. Users should take into account and ensure that their retraining frequency is equal to or close to the granularity of the monitor. 151 | * If the granularity of the monitor is 1 day and retraining frequency is 1 hour, the job will preemptively stop as there is no new data to evaluate retraining criteria 152 | * If the granularity of the monitor is 1 day and retraining frequency is 1 week, retraining would be stale and not be efficient 153 | 154 | Permissions: 155 | Permissions for monitoring are inherited from the original table's permissions. 156 | * Users who own the monitored table or its parent catalog/schema can create, update, and view monitors. 157 | * Users with read permissions on the monitored table can view its monitor. 158 | 159 | Therefore, ensure that service principals are the owners or have the necessary permissions to manage the monitored table. 160 | 161 | ## Develop and test config changes 162 | 163 | ### databricks CLI bundles schema overview 164 | To get started, open `mlops/resources/batch-inference-workflow-resource.yml`. The file contains the ML resource definition of a batch inference job, like: 165 | 166 | ```$xslt 167 | new_cluster: &new_cluster 168 | new_cluster: 169 | num_workers: 3 170 | spark_version: 15.3.x-cpu-ml-scala2.12 171 | node_type_id: Standard_D3_v2 172 | custom_tags: 173 | clusterSource: mlops-stacks_0.4 174 | 175 | resources: 176 | jobs: 177 | batch_inference_job: 178 | name: ${bundle.target}-mlops-batch-inference-job 179 | tasks: 180 | - task_key: batch_inference_job 181 | <<: *new_cluster 182 | notebook_task: 183 | notebook_path: ../deployment/batch_inference/notebooks/BatchInference.py 184 | base_parameters: 185 | env: ${bundle.target} 186 | input_table_name: batch_inference_input_table_name 187 | ... 188 | ``` 189 | 190 | The example above defines a Databricks job with name `${bundle.target}-mlops-batch-inference-job` 191 | that runs the notebook under `mlops/deployment/batch_inference/notebooks/BatchInference.py` to regularly apply your ML model for batch inference. 192 | 193 | At the start of the resource definition, we declared an anchor `new_cluster` that will be referenced and used later. For more information about anchors in yaml schema, please refer to the [yaml documentation](https://yaml.org/spec/1.2.2/#3222-anchors-and-aliases). 194 | 195 | We specify a `batch_inference_job` under `resources/jobs` to define a databricks workflow with internal key `batch_inference_job` and job name `{bundle.target}-mlops-batch-inference-job`. 196 | The workflow contains a single task with task key `batch_inference_job`. The task runs notebook `../deployment/batch_inference/notebooks/BatchInference.py` with provided parameters `env` and `input_table_name` passing to the notebook. 197 | After setting up databricks CLI, you can run command `databricks bundle schema` to learn more about databricks CLI bundles schema. 198 | 199 | The notebook_path is the relative path starting from the resource yaml file. 200 | 201 | ### Environment config based variables 202 | The `${bundle.target}` will be replaced by the environment config name during the bundle deployment. For example, during the deployment of a `test` environment config, the job name will be 203 | `test-mlops-batch-inference-job`. During the deployment of the `staging` environment config, the job name will be 204 | `staging-mlops-batch-inference-job`. 205 | 206 | 207 | To use different values based on different environment, you can use bundle variables based on the given target, for example, 208 | ```$xslt 209 | variables: 210 | batch_inference_input_table: 211 | description: The table name to be used for input to the batch inference workflow. 212 | default: input_table 213 | 214 | targets: 215 | dev: 216 | variables: 217 | batch_inference_input_table: dev_table 218 | test: 219 | variables: 220 | batch_inference_input_table: test_table 221 | 222 | new_cluster: &new_cluster 223 | new_cluster: 224 | num_workers: 3 225 | spark_version: 15.3.x-cpu-ml-scala2.12 226 | node_type_id: Standard_D3_v2 227 | custom_tags: 228 | clusterSource: mlops-stacks_0.4 229 | 230 | resources: 231 | jobs: 232 | batch_inference_job: 233 | name: ${bundle.target}-mlops-batch-inference-job 234 | tasks: 235 | - task_key: batch_inference_job 236 | <<: *new_cluster 237 | notebook_task: 238 | notebook_path: ../deployment/batch_inference/notebooks/BatchInference.py 239 | base_parameters: 240 | env: ${bundle.target} 241 | input_table_name: ${var.batch_inference_input_table} 242 | ... 243 | ``` 244 | The `batch_inference_job` notebook parameter `input_table_name` is using a bundle variable `batch_inference_input_table` with default value "input_table". 245 | The variable value will be overwritten with "dev_table" for `dev` environment config and "test_table" for `test` environment config: 246 | - during deployment with the `dev` environment config, the `input_table_name` parameter will get the value "dev_table" 247 | - during deployment with the `staging` environment config, the `input_table_name` parameter will get the value "input_table" 248 | - during deployment with the `prod` environment config, the `input_table_name` parameter will get the value "input_table" 249 | - during deployment with the `test` environment config, the `input_table_name` parameter will get the value "test_table" 250 | 251 | ### Test config changes 252 | To test out a config change, simply edit one of the fields above. For example, increase the cluster size by updating `num_workers` from 3 to 4. 253 | 254 | Then follow [Local development and dev workspace](#local-development-and-dev-workspace) to deploy the change to the dev workspace. 255 | Alternatively you can open a PR. Continuous integration will then validate the updated config and deploy tests to the to staging workspace. 256 | ## Deploy config changes 257 | 258 | ### Dev workspace deployment 259 | Please refer to [Local development and dev workspace](#local-development-and-dev-workspace). 260 | 261 | ### Test workspace deployment(CI) 262 | After setting up CI/CD, PRs against the main branch will trigger CI workflows to run unit tests, integration test and resource validation. 263 | The integration test will deploy MLflow model, MLflow experiment and Databricks workflow resources defined under the `test` environment resource config to the staging workspace. The integration test then triggers a run of the model workflow to verify the ML code. 264 | 265 | ### Staging and Prod workspace deployment(CD) 266 | After merging a PR to the main branch, continuous deployment automation will deploy the `staging` resources to the staging workspace. 267 | 268 | When you about to cut a release, you can create and merge a PR to merge changes from main to release. Continuous deployment automation will deploy `prod` resources to the prod workspace. 269 | 270 | [Back to project README](../README.md) 271 | --------------------------------------------------------------------------------