├── Part 1 - Distributed Training
├── 00_setup.py
├── 01_data_prep.py
├── 02_model_training_single_node.py
├── 03_model_training_distributed.py
└── 04_monitoring_and_optimization.py
├── Part 2 - Distributed Tuning & Inference
├── 00_setup.py
├── 01_hyperopt_single_machine_model.py
├── 02_hyperopt_distributed_model.py
└── 03_pyfunc_distributed_inference.py
└── README.md
/Part 1 - Distributed Training/00_setup.py:
--------------------------------------------------------------------------------
1 | # Databricks notebook source
2 | # Move file from driver to DBFS
3 | user = dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags().apply('user')
4 |
5 | # Parse name from user
6 | my_name = user.split('@')[0].replace('.', '_')
7 |
8 | # Set config for database name, file paths, and table names
9 | database_name = f'distributed_dl_workshop_{my_name}'
10 |
11 | print(f'database_name: {database_name}')
12 |
13 | # COMMAND ----------
14 |
15 | # In order for us to track from our worker nodes with Horovod to the MLflow Tracking server we need to supply a databricks host and token
16 | DATABRICKS_HOST = 'https://' + dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags().get('browserHostName').get()
17 | DATABRICKS_TOKEN = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()
18 |
--------------------------------------------------------------------------------
/Part 1 - Distributed Training/01_data_prep.py:
--------------------------------------------------------------------------------
1 | # Databricks notebook source
2 | # MAGIC %md
3 | # MAGIC # 01. Data Preparation
4 | # MAGIC
5 | # MAGIC In this notebook we will be working with the [flowers dataset](https://www.tensorflow.org/datasets/catalog/tf_flowers) from TensorFlow. It contains five classes of flower photos stored in JPEG format under five class-specific sub-directories. The images are hosted under [databricks-datasets](https://docs.databricks.com/data/databricks-datasets.html) at `dbfs:/databricks-datasets/flower_photos` for easy access.
6 | # MAGIC
7 | # MAGIC We will:
8 | # MAGIC - Load the JPEG files using Spark's binary file data source reader.
9 | # MAGIC - Create a Delta table of the processed images in binary format.
10 | # MAGIC - Extract labels from the image filepaths.
11 | # MAGIC - Split our dataset into train and validation datasets for model training.
12 | # MAGIC - Extract label indexes for each class and save train and validation datasets as Delta tables
13 |
14 | # COMMAND ----------
15 |
16 | # MAGIC %run ./00_setup
17 |
18 | # COMMAND ----------
19 |
20 | import io
21 | import os
22 | import shutil
23 | import numpy as np
24 | import pandas as pd
25 | import mlflow
26 |
27 | import pyspark.sql.functions as f
28 |
29 | # COMMAND ----------
30 |
31 | # MAGIC %md
32 | # MAGIC ## i. Convert raw JPEG files to Delta table
33 |
34 | # COMMAND ----------
35 |
36 | # MAGIC %md
37 | # MAGIC Let's take a quick look at the files in DBFS
38 |
39 | # COMMAND ----------
40 |
41 | # MAGIC %fs ls /databricks-datasets/flower_photos/daisy
42 |
43 | # COMMAND ----------
44 |
45 | # MAGIC %md
46 | # MAGIC We will use Spark's [binary file](https://docs.microsoft.com/en-us/azure/databricks/data/data-sources/binary-file) data source reader to load the JPEG files from [DBFS](https://docs.databricks.com/data/databricks-file-system.html) into a Spark DataFrame.
47 | # MAGIC
48 | # MAGIC This will read take the binary files and convert each file into a single record that contains the raw content and metadata of the file. The binary file data source produces a DataFrame with the following columns and possibly partition columns:
49 | # MAGIC
50 | # MAGIC * `path` (StringType): The path of the file.
51 | # MAGIC * `modificationTime` (TimestampType): The modification time of the file. In some Hadoop FileSystem implementations, this parameter might be unavailable and the value would be set to a default value.
52 | # MAGIC * `length` (LongType): The length of the file in bytes.
53 | # MAGIC * `content` (BinaryType): The contents of the file.
54 | # MAGIC
55 | # MAGIC Using the `pathGlobFilter` option we will load files with paths matching a given glob pattern while keeping the behavior of partition discovery. The following code reads all JPG files from the input directory with partition discovery.
56 | # MAGIC
57 | # MAGIC We will also use the `recursiveFileLookup` option to ignore partition discovery and recursively search files under the input directory. This option searches through nested directories even if their names do not follow a partition naming scheme like `date=2019-07-01`. The following cell reads all JPEG files recursively from the input directory and ignores partition discovery.
58 |
59 | # COMMAND ----------
60 |
61 | bronze_df = (spark.read.format('binaryFile')
62 | .option('pathGlobFilter', '*.jpg')
63 | .option('recursiveFileLookup', 'true')
64 | .load('/databricks-datasets/flower_photos/')
65 | .sample(fraction=0.5) # Sample to speed up training
66 | )
67 |
68 | bronze_df.display()
69 |
70 | # COMMAND ----------
71 |
72 | bronze_df.count()
73 |
74 | # COMMAND ----------
75 |
76 | # MAGIC %md
77 | # MAGIC ## ii. Creating Bronze Delta Table from Spark Dataframe
78 |
79 | # COMMAND ----------
80 |
81 | bronze_tbl_name = 'bronze'
82 |
83 | # Delete the old database and tables if needed (for demo purposes)
84 | spark.sql(f'DROP DATABASE IF EXISTS {database_name} CASCADE')
85 |
86 | # Create database to house tables
87 | spark.sql(f'CREATE DATABASE {database_name}')
88 |
89 | # COMMAND ----------
90 |
91 | # To improve read performance when you load data back, Databricks recommends turning off compression when you save data loaded from binary files
92 | spark.conf.set('spark.sql.parquet.compression.codec', 'uncompressed')
93 |
94 | # Create a Delta Lake table from loaded
95 | bronze_df.write.format('delta').mode('overwrite').saveAsTable(f'{database_name}.{bronze_tbl_name}')
96 |
97 | # COMMAND ----------
98 |
99 | spark.sql(f'SELECT * FROM {database_name}.{bronze_tbl_name}').display()
100 |
101 | # COMMAND ----------
102 |
103 | # Load bronze table for faster transformations
104 | bronze_df = spark.table(f'{database_name}.{bronze_tbl_name}')
105 |
106 | # COMMAND ----------
107 |
108 | # MAGIC %md
109 | # MAGIC Reading from a Delta table is much faster than trying to load all the images from a directory. By saving the data in the Delta format, we can continue to process the images into a silver table, transforming it into a proper training set for our use case. This entails parsing out labels from the image paths and converting string labels into numerical values.
110 |
111 | # COMMAND ----------
112 |
113 | # MAGIC %md
114 | # MAGIC ## iii. Create Silver Table from Bronze Table
115 |
116 | # COMMAND ----------
117 |
118 | # MAGIC %md
119 | # MAGIC
120 | # MAGIC We will use a [pandas UDF](https://docs.databricks.com/spark/latest/spark-sql/udf-python-pandas.html) to extract the labels for each image from the filepath.
121 |
122 | # COMMAND ----------
123 |
124 | # Define pandas UDF to extract the image label from the file path
125 | @f.pandas_udf('string')
126 | def get_label_udf(path_col: pd.Series) -> pd.Series:
127 | return path_col.map(lambda path: path.split('/')[-2])
128 |
129 | # Apply pandas UDF to path column
130 | silver_df = bronze_df.withColumn('label', get_label_udf('path'))
131 |
132 | # COMMAND ----------
133 |
134 | # Save table with label as silver table
135 | silver_tbl_name = 'silver'
136 | silver_df.write.format('delta').mode('overwrite').saveAsTable(f'{database_name}.{silver_tbl_name}')
137 |
138 | # COMMAND ----------
139 |
140 | # MAGIC %md
141 | # MAGIC
142 | # MAGIC ## iv. Split dataset into train and validation sets
143 |
144 | # COMMAND ----------
145 |
146 | # MAGIC %md
147 | # MAGIC Prior to model training, we want to split our datasets into a `train` and `val` dataset.
148 |
149 | # COMMAND ----------
150 |
151 | # Load dataset from the Silver table
152 | dataset_df = spark.table(f'{database_name}.{silver_tbl_name}')
153 | display(dataset_df)
154 |
155 | # COMMAND ----------
156 |
157 | # MAGIC %md
158 | # MAGIC For testing purposes, we'll use a subset of the data and split it into a `train` and `val` dataset.
159 |
160 | # COMMAND ----------
161 |
162 | train_df, val_df = dataset_df.randomSplit([0.9, 0.1], seed=42)
163 |
164 | # COMMAND ----------
165 |
166 | # MAGIC %md
167 | # MAGIC
168 | # MAGIC ## v. Add label index column to train and validation datasets
169 |
170 | # COMMAND ----------
171 |
172 | # MAGIC %md
173 | # MAGIC
174 | # MAGIC We will need to pass our label as a numeric value during training. As such, we create a label index prior to writing out our train and test datasets.
175 |
176 | # COMMAND ----------
177 |
178 | # Create index lookups for labels
179 | labels = train_df.select(f.col('label')).distinct().collect()
180 | label_to_idx = {label: index for index, (label, ) in enumerate(sorted(labels))}
181 |
182 | print(label_to_idx)
183 |
184 | # COMMAND ----------
185 |
186 | # Define UDF to extract the image index
187 | @f.pandas_udf('int')
188 | def get_label_idx_udf(labels_col: pd.Series) -> pd.Series:
189 | return labels_col.map(lambda label: label_to_idx[label])
190 |
191 | # Create train DataFrame with label index column
192 | train_df = (train_df
193 | .withColumn('label_idx', get_label_idx_udf('label')))
194 |
195 | # Create val DataFrame with label index column
196 | val_df = (val_df
197 | .withColumn('label_idx', get_label_idx_udf('label')))
198 |
199 | # COMMAND ----------
200 |
201 | # MAGIC %md
202 | # MAGIC
203 | # MAGIC ## vi. Save train and val datasets
204 |
205 | # COMMAND ----------
206 |
207 | # MAGIC %md
208 | # MAGIC
209 | # MAGIC Let's create tables from our train and validation datasets.
210 |
211 | # COMMAND ----------
212 |
213 | silver_train_tbl_name = 'silver_train'
214 | silver_val_tbl_name = 'silver_val'
215 |
216 | (train_df.write.format('delta')
217 | .mode('overwrite')
218 | .saveAsTable(f'{database_name}.{silver_train_tbl_name}'))
219 |
220 | (val_df.write.format('delta')
221 | .mode('overwrite')
222 | .saveAsTable(f'{database_name}.{silver_val_tbl_name}'))
223 |
224 | # COMMAND ----------
225 |
226 | # MAGIC %md
227 | # MAGIC Nice! We now have our data ready for model training. In the [next notebook]($./02_model_training_single_node) we will look at training a model in a single node setting, using the Delta tables we just created.
228 |
--------------------------------------------------------------------------------
/Part 1 - Distributed Training/02_model_training_single_node.py:
--------------------------------------------------------------------------------
1 | # Databricks notebook source
2 | # MAGIC %md
3 | # MAGIC # 02. Model Training - Single node
4 |
5 | # COMMAND ----------
6 |
7 | # MAGIC %md
8 | # MAGIC In this notebook we will be performing some basic transfer learning on the flowers dataset, using the [`MobileNetV2`](https://www.tensorflow.org/api_docs/python/tf/keras/applications/mobilenet_v2/MobileNetV2) model as our base model.
9 | # MAGIC
10 | # MAGIC We will:
11 | # MAGIC - Load the train and validation datasets created in [01_data_prep]($./01_data_prep), converting them to `tf.data.Dataset` objects.
12 | # MAGIC - Train our model on a single (driver) node.
13 |
14 | # COMMAND ----------
15 |
16 | # MAGIC %run ./00_setup
17 |
18 | # COMMAND ----------
19 |
20 | import tensorflow as tf
21 | from tensorflow.keras.applications.mobilenet_v2 import preprocess_input
22 | from tensorflow.keras.models import Sequential
23 | from tensorflow.keras.optimizers import Adam
24 |
25 | import mlflow
26 |
27 | # COMMAND ----------
28 |
29 | # MAGIC %md
30 | # MAGIC
31 | # MAGIC ## i. Configs
32 |
33 | # COMMAND ----------
34 |
35 | # MAGIC %md
36 | # MAGIC
37 | # MAGIC Set global variables
38 |
39 | # COMMAND ----------
40 |
41 | IMG_HEIGHT = 224
42 | IMG_WIDTH = 224
43 | IMG_CHANNELS = 3
44 |
45 | BATCH_SIZE = 32
46 | EPOCHS = 3
47 |
48 | # COMMAND ----------
49 |
50 | # MAGIC %md
51 | # MAGIC
52 | # MAGIC ## ii. Load prepared data from Silver layer
53 |
54 | # COMMAND ----------
55 |
56 | # MAGIC %md
57 | # MAGIC
58 | # MAGIC Load the train and validation datasets we created in [01_data_prep]($./01_data_prep) as Spark DataFrames.
59 |
60 | # COMMAND ----------
61 |
62 | cols_to_keep = ['content', 'label_idx']
63 |
64 | train_tbl_name = 'silver_train'
65 | val_tbl_name = 'silver_val'
66 |
67 | train_df = (spark.table(f'{database_name}.{train_tbl_name}')
68 | .select(cols_to_keep))
69 |
70 | val_df = (spark.table(f'{database_name}.{val_tbl_name}')
71 | .select(cols_to_keep))
72 |
73 | print('train_df count:', train_df.count())
74 | print('val_df count:', val_df.count())
75 |
76 | # COMMAND ----------
77 |
78 | num_classes = train_df.select('label_idx').distinct().count()
79 |
80 | print('Number of classes:', num_classes)
81 |
82 | # COMMAND ----------
83 |
84 | # MAGIC %md
85 | # MAGIC
86 | # MAGIC ## iii. Create `tf.data.Dataset`
87 |
88 | # COMMAND ----------
89 |
90 | # MAGIC %md
91 | # MAGIC To train a model with Tensorflow, we'll need to transform our spark dataframes into `tf.data.Dataset` objects. Tensorflow has a function to transform a pandas dataframe into a [tensorflow dataset](https://www.tensorflow.org/datasets).
92 | # MAGIC
93 | # MAGIC For this example, we'll convert our dataframe to a pandas dataframe using `.toPandas()` and convert to a properly formated `tf.Data.dataset` from there.
94 |
95 | # COMMAND ----------
96 |
97 | train_pdf = train_df.toPandas()
98 | val_pdf = val_df.toPandas()
99 |
100 | # Create train tf.data.Dataset
101 | train_ds = (
102 | tf.data.Dataset.from_tensor_slices(
103 | (train_pdf['content'].values, train_pdf['label_idx'].values))
104 | )
105 |
106 | # Create val tf.data.Dataset
107 | val_ds = (
108 | tf.data.Dataset.from_tensor_slices(
109 | (val_pdf['content'].values, val_pdf['label_idx'].values))
110 | )
111 |
112 | # COMMAND ----------
113 |
114 | # MAGIC %md
115 | # MAGIC Now we can continue transforming our image dataset into the correct format using a preprocessing function and mapping it to our `tf.datasets`
116 |
117 | # COMMAND ----------
118 |
119 | def preprocess(content: str, label_idx: int):
120 | """
121 | Preprocess an image file bytes for MobileNetV2 (ImageNet).
122 | """
123 | image = tf.image.decode_jpeg(content, channels=IMG_CHANNELS)
124 | image = tf.image.resize(image, [IMG_HEIGHT, IMG_WIDTH])
125 |
126 | return preprocess_input(image), label_idx
127 |
128 | # COMMAND ----------
129 |
130 | # Apply preprocess function to both train and validation datasets
131 | train_ds = (train_ds
132 | .map(lambda content, label_idx: preprocess(content, label_idx))
133 | .batch(BATCH_SIZE)
134 | )
135 |
136 | val_ds = (val_ds
137 | .map(lambda content, label_idx: preprocess(content, label_idx))
138 | .batch(BATCH_SIZE)
139 | )
140 |
141 | # COMMAND ----------
142 |
143 | # MAGIC %md
144 | # MAGIC
145 | # MAGIC ## iv. Define model
146 |
147 | # COMMAND ----------
148 |
149 | # MAGIC %md
150 | # MAGIC
151 | # MAGIC Single node training is the most common way machine learning practitioners set up their training execution. For some modeling use cases, this is a great way to go as all the modeling can stay on one machine with no additional libraries.
152 | # MAGIC
153 | # MAGIC Training with Databricks is just as easy. Simply use a [Single Node Cluster](https://docs.databricks.com/clusters/single-node.html) which is just a Spark Driver with no worker nodes.
154 | # MAGIC
155 | # MAGIC We will be using the [`MobileNetV2`](https://www.tensorflow.org/api_docs/python/tf/keras/applications/mobilenet_v2/MobileNetV2) architecture from TensorFlow as our base model, freezing the base layers and adding a head of Global Pooling, Dropout and Dense classification layer.
156 |
157 | # COMMAND ----------
158 |
159 | def build_model(img_height: int,
160 | img_width: int,
161 | img_channels: int,
162 | num_classes: int) -> tf.keras.models.Sequential:
163 |
164 | base_model = tf.keras.applications.MobileNetV2(include_top=False,
165 | input_shape=(img_height, img_width, img_channels))
166 |
167 | # Freeze base model layers
168 | for layer in base_model.layers:
169 | layer.trainable = False
170 |
171 | model = Sequential([
172 | base_model,
173 | tf.keras.layers.GlobalAveragePooling2D(),
174 | tf.keras.layers.Dropout(0.5),
175 | tf.keras.layers.Dense(num_classes)
176 | ])
177 |
178 | return model
179 |
180 | # COMMAND ----------
181 |
182 | # MAGIC %md
183 | # MAGIC
184 | # MAGIC ## v. Train Model - Single Node
185 |
186 | # COMMAND ----------
187 |
188 | # MAGIC %md
189 | # MAGIC
190 | # MAGIC Let's train out model, where we will instantiate and compile the model, and fit the model; using [MLflow autologging](https://docs.microsoft.com/en-us/azure/databricks/applications/mlflow/databricks-autologging) to automatically track model params, metrics and artifacts.
191 |
192 | # COMMAND ----------
193 |
194 | # mlflow autologging
195 | mlflow.tensorflow.autolog()
196 |
197 | # Instantiate model
198 | model = build_model(IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS, num_classes)
199 |
200 | # Compile model
201 | model.compile(optimizer=Adam(learning_rate=0.001),
202 | loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
203 | metrics=['accuracy'])
204 |
205 | # Specify training params
206 | steps_per_epoch = len(train_ds) // BATCH_SIZE
207 | validation_steps = len(val_ds) // BATCH_SIZE
208 |
209 | # Train the model
210 | model.fit(train_ds,
211 | steps_per_epoch=steps_per_epoch,
212 | epochs=EPOCHS,
213 | verbose=1,
214 | validation_data=val_ds,
215 | validation_steps=validation_steps)
216 |
217 | # COMMAND ----------
218 |
219 | # MAGIC %md
220 | # MAGIC Wohoo! We trained a model on a single machine very similarly to what you might have done on your laptop. It was easy for us to convert this small dataset into a pandas DataFrame from a Delta table and subsequently convert it into a `tf.data.Dataset` for training.
221 | # MAGIC
222 | # MAGIC However, most production workloads will require training with **orders of magnitude** more data, so much so that it could overwhelm a single machine while training. Other cases could contribute to exhausting a single node during training such as training large model architectures.
223 | # MAGIC
224 | # MAGIC In the next notebook, [03_model_training_distributed]($./03_model_training_distributed) we will see how we scale our model training across mutliple nodes.
225 |
--------------------------------------------------------------------------------
/Part 1 - Distributed Training/03_model_training_distributed.py:
--------------------------------------------------------------------------------
1 | # Databricks notebook source
2 | # MAGIC %md
3 | # MAGIC # 03. Model Training - Distributed
4 |
5 | # COMMAND ----------
6 |
7 | # MAGIC %md
8 | # MAGIC
9 | # MAGIC In this notebook we will see how we can take our single node training code and scale training across our cluster.
10 | # MAGIC
11 | # MAGIC We will:
12 | # MAGIC - Use [Petastorm](https://docs.databricks.com/applications/machine-learning/load-data/ddl-data.html) to load our datasets from Delta and convert to tf.data datsets.
13 | # MAGIC - Conduct single node training using Petastorm.
14 | # MAGIC - Scale training across multiple nodes using [Horovod](https://docs.databricks.com/applications/machine-learning/train-model/distributed-training/horovod-runner.html) to orchestrate training, with Petastorm as a means to feed data to the training process.
15 |
16 | # COMMAND ----------
17 |
18 | # MAGIC %md
19 | # MAGIC
20 | # MAGIC ### Motivation
21 |
22 | # COMMAND ----------
23 |
24 | # MAGIC %md
25 | # MAGIC A very large training problem may benefit from scaling out verticaly (training on a larger instance) or horizontally (using multiple machines). Scaling vertically may be an approach to adopt, but it can be costly to run a large instance for an extended period of time and scaling a single machine is limited (ie. there's only so many cores + RAM one machine could have).
26 | # MAGIC
27 | # MAGIC Scaling horizontally can be an affordable way to leverage the compute required to tackle very large training events. Multiple smaller instances may be more readibily available at cheaper rates in the cloud than a single, very large instance. Theoretically, you could add as many nodes as you wish (dependent on your cloud limits). In this section, we'll go over how to incorporate [Petastorm](https://docs.databricks.com/applications/machine-learning/load-data/ddl-data.html) and [Horovod](https://docs.databricks.com/applications/machine-learning/train-model/distributed-training/horovod-runner.html) into our single node training regime from to distribute training to multiple machines.
28 |
29 | # COMMAND ----------
30 |
31 | # MAGIC %md
32 | # MAGIC Our first single node training example only used a fraction of the data and it required to work with the data sitting in memory. Typical datasets used for training may not fit in memory for a single machine. Petastorm enables directly loading data stored in parquet format, meaning we can go from our silver Delta table to a distributed `tf.data.Dataset` without having to copy our table into a `Pandas` dataframe and wasting additional memory.
33 | # MAGIC
34 | # MAGIC Petastorm enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format and datasets that are already loaded as Spark DataFrames. It supports ML frameworks such as TensorFlow, Pytorch, and PySpark and can be used from pure Python code.
35 |
36 | # COMMAND ----------
37 |
38 | # MAGIC %md
39 | # MAGIC
40 | # MAGIC Here we use `petastorm` to load and cache data that was read directly from our silver train and val Delta tables:
41 |
42 | # COMMAND ----------
43 |
44 | # MAGIC %run ./00_setup
45 |
46 | # COMMAND ----------
47 |
48 | import os
49 |
50 | import tensorflow as tf
51 | from tensorflow.keras.applications.mobilenet_v2 import preprocess_input
52 | from tensorflow.keras.models import Sequential
53 | from tensorflow.keras.optimizers import Adam
54 | from tensorflow.keras.callbacks import ReduceLROnPlateau, ModelCheckpoint
55 |
56 | import mlflow
57 |
58 | from petastorm.spark import SparkDatasetConverter, make_spark_converter
59 |
60 | import horovod.tensorflow.keras as hvd
61 | from sparkdl import HorovodRunner
62 |
63 | # COMMAND ----------
64 |
65 | # MAGIC %md
66 | # MAGIC
67 | # MAGIC ## i. Configs
68 |
69 | # COMMAND ----------
70 |
71 | # MAGIC %md
72 | # MAGIC
73 | # MAGIC Set global variables
74 |
75 | # COMMAND ----------
76 |
77 | IMG_HEIGHT = 224
78 | IMG_WIDTH = 224
79 | IMG_CHANNELS = 3
80 |
81 | BATCH_SIZE = 256
82 | EPOCHS = 3
83 |
84 | # COMMAND ----------
85 |
86 | # MAGIC %md
87 | # MAGIC
88 | # MAGIC ## ii. Load prepared data from Silver layer
89 |
90 | # COMMAND ----------
91 |
92 | # MAGIC %md
93 | # MAGIC
94 | # MAGIC Load the train and validation datasets from the Silver layer we created in [01_data_prep]($./01_data_prep) as Spark DataFrames.
95 |
96 | # COMMAND ----------
97 |
98 | cols_to_keep = ['content', 'label_idx']
99 |
100 | train_tbl_name = 'silver_train'
101 | val_tbl_name = 'silver_val'
102 |
103 | train_df = (spark.table(f'{database_name}.{train_tbl_name}')
104 | .select(cols_to_keep))
105 |
106 | val_df = (spark.table(f'{database_name}.{val_tbl_name}')
107 | .select(cols_to_keep))
108 |
109 | # Make sure the number of partitions is at least the number of workers which is required for distributed training.
110 | train_df = train_df.repartition(2)
111 | val_df = val_df.repartition(2)
112 |
113 | print('train_df count:', train_df.count())
114 | print('val_df count:', val_df.count())
115 |
116 | # COMMAND ----------
117 |
118 | num_classes = train_df.select('label_idx').distinct().count()
119 |
120 | print('Number of classes:', num_classes)
121 |
122 | # COMMAND ----------
123 |
124 | # MAGIC %md
125 | # MAGIC ## iii. Convert the Spark DataFrame to a TensorFlow Dataset
126 | # MAGIC In order to convert Spark DataFrames to a TensorFlow datasets, we need to do it in two steps:
127 | # MAGIC
128 | # MAGIC
129 | # MAGIC
130 | # MAGIC 0. Define where you want to copy the data by setting Spark config
131 | # MAGIC 0. Call `make_spark_converter()` method to make the conversion
132 | # MAGIC
133 | # MAGIC This will copy the data to the specified path.
134 |
135 | # COMMAND ----------
136 |
137 | dbutils.fs.rm(f'/tmp/distributed_dl_workshop_{user}/petastorm', recurse=True)
138 | spark.conf.set(SparkDatasetConverter.PARENT_CACHE_DIR_URL_CONF, f'file:///dbfs/tmp/distributed_dl_workshop_{user}/petastorm')
139 |
140 | train_petastorm_converter = make_spark_converter(train_df)
141 | val_petastorm_converter = make_spark_converter(val_df)
142 |
143 | train_size = len(train_petastorm_converter)
144 | val_size = len(val_petastorm_converter)
145 |
146 | # COMMAND ----------
147 |
148 | # MAGIC %md
149 | # MAGIC ## iv. Define model
150 |
151 | # COMMAND ----------
152 |
153 | # MAGIC %md
154 | # MAGIC
155 | # MAGIC We will define the same model as in [`02_model_training_single_node`]($./02_model_training_single_node), along with the same preprocessing functionality.
156 |
157 | # COMMAND ----------
158 |
159 | def build_model(img_height: int,
160 | img_width: int,
161 | img_channels: int,
162 | num_classes: int) -> tf.keras.models.Sequential:
163 |
164 | base_model = tf.keras.applications.MobileNetV2(include_top=False,
165 | input_shape=(img_height, img_width, img_channels))
166 |
167 | #freeze base model layers
168 | for layer in base_model.layers:
169 | layer.trainable = False
170 |
171 | model = Sequential([
172 | base_model,
173 | tf.keras.layers.GlobalAveragePooling2D(),
174 | tf.keras.layers.Dropout(0.5),
175 | tf.keras.layers.Dense(num_classes)
176 | ])
177 |
178 | return model
179 |
180 | # COMMAND ----------
181 |
182 | def preprocess(content: str, label_idx: int):
183 | """
184 | Preprocess an image file bytes for MobileNetV2 (ImageNet).
185 | """
186 | image = tf.image.decode_jpeg(content, channels=IMG_CHANNELS)
187 | image = tf.image.resize(image, [IMG_HEIGHT, IMG_WIDTH])
188 |
189 | return preprocess_input(image), label_idx
190 |
191 | # COMMAND ----------
192 |
193 | # MAGIC %md
194 | # MAGIC
195 | # MAGIC ## v. Single Node - Feed data to single node using Petastorm
196 | # MAGIC
197 | # MAGIC We will first do single node training where we will use Petastorm to feed the data from Delta to the training process. We need to use Petastorm's make_tf_dataset to read batches of data.
198 | # MAGIC
199 | # MAGIC * Note that we use **`num_epochs=None`** to generate infinite batches of data to avoid handling the last incomplete batch. This is particularly useful in the distributed training scenario, where we need to guarantee that the numbers of data records seen on all workers are identical. Given that the length of each data shard may not be identical, setting **`num_epochs`** to any specific number would fail to meet the guarantee.
200 | # MAGIC * The **`workers_count`** param specifies the number of threads or processes to be spawned in the reader pool, and it is not a Spark worker.
201 |
202 | # COMMAND ----------
203 |
204 | with train_petastorm_converter.make_tf_dataset(batch_size=BATCH_SIZE) as train_ds,\
205 | val_petastorm_converter.make_tf_dataset(batch_size=BATCH_SIZE) as val_ds:
206 |
207 | # mlflow autologging
208 | mlflow.tensorflow.autolog()
209 |
210 | # Transforming datasets to map our preprocess function() and then batch()
211 | train_ds = train_ds.unbatch().map(lambda x: (x.content, x.label_idx))
212 | val_ds = val_ds.unbatch().map(lambda x: (x.content, x.label_idx))
213 |
214 | train_ds = (train_ds
215 | .map(lambda content, label_idx: preprocess(content, label_idx))
216 | .batch(BATCH_SIZE))
217 | val_ds = (val_ds
218 | .map(lambda content, label_idx: preprocess(content, label_idx))
219 | .batch(BATCH_SIZE))
220 |
221 | model = build_model(IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS, num_classes)
222 | model.compile(optimizer=Adam(learning_rate=0.001),
223 | loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
224 | metrics=['accuracy'])
225 |
226 | steps_per_epoch = train_size // BATCH_SIZE
227 | validation_steps = val_size // BATCH_SIZE
228 |
229 | model.fit(train_ds,
230 | steps_per_epoch=steps_per_epoch,
231 | epochs=EPOCHS,
232 | verbose=1,
233 | validation_data=val_ds,
234 | validation_steps=validation_steps)
235 |
236 | # COMMAND ----------
237 |
238 | # MAGIC %md
239 | # MAGIC You can see the code to train our model is exactly the same as our single node training. Adding Petastorm to load our data directly from the underlying parquet files in our delta tables simply required using the `make_spark_converter` class and initializing our datasets using the `make_tf_dataset()` method.
240 | # MAGIC
241 | # MAGIC Next, we need to distribute the training across our cluster.
242 |
243 | # COMMAND ----------
244 |
245 | # MAGIC %md
246 | # MAGIC ## v. Distribute training with Horovod
247 | # MAGIC
248 | # MAGIC [Horovod](https://github.com/horovod/horovod) is a distributed training framework for TensorFlow, Keras, and PyTorch. Databricks supports distributed deep learning training using HorovodRunner and the `horovod.spark` package which we will use next.
249 | # MAGIC
250 | # MAGIC We can use Horovod to train across multiple machines, meaning we can distribute training across CPU-clusters or GPU clusters.
251 |
252 | # COMMAND ----------
253 |
254 | # MAGIC %md
255 | # MAGIC
256 | # MAGIC [HorovodRunner](https://databricks.github.io/spark-deep-learning/#sparkdl.HorovodRunner) is a general API to run distributed DL workloads on Databricks using Uber’s Horovod framework. By integrating Horovod with Spark’s barrier mode, Databricks is able to provide higher stability for long-running deep learning training jobs on Spark.
257 | # MAGIC
258 | # MAGIC ### How it works
259 | # MAGIC * HorovodRunner takes a Python method that contains DL training code with Horovod hooks.
260 | # MAGIC * This method gets pickled on the driver and sent to Spark workers.
261 | # MAGIC * A Horovod MPI job is embedded as a Spark job using barrier execution mode.
262 | # MAGIC * The first executor collects the IP addresses of all task executors using BarrierTaskContext and triggers a Horovod job using mpirun.
263 | # MAGIC * Each Python MPI process loads the pickled program back, deserializes it, and runs it.
264 | # MAGIC
265 | # MAGIC
266 | # MAGIC
267 | # MAGIC 
268 | # MAGIC
269 | # MAGIC For additional resources, see:
270 | # MAGIC * Horovod Runner Docs
271 | # MAGIC * Horovod Runner webinar
272 |
273 | # COMMAND ----------
274 |
275 | # MAGIC %md
276 | # MAGIC Inside our function, we'll incorporate the same logic we used in the single node and petastorm training examples plus additional logic to account for GPUs.
277 | # MAGIC
278 | # MAGIC If you have a CPU cluster, then these lines will be ignored and training will occur on CPU.
279 |
280 | # COMMAND ----------
281 |
282 | def train_and_evaluate_hvd():
283 | hvd.init() # Initialize Horovod.
284 |
285 | # To enable tracking from the Spark workers to Databricks
286 | mlflow.mlflow.set_tracking_uri('databricks')
287 | os.environ['DATABRICKS_HOST'] = DATABRICKS_HOST
288 | os.environ['DATABRICKS_TOKEN'] = DATABRICKS_TOKEN
289 |
290 | # Horovod: pin GPU to be used to process local rank (one GPU per process)
291 | gpus = tf.config.experimental.list_physical_devices('GPU')
292 | for gpu in gpus:
293 | tf.config.experimental.set_memory_growth(gpu, True)
294 | if gpus:
295 | tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')
296 |
297 | # Creating model
298 | model = build_model(IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS, num_classes)
299 |
300 | # Horovod: adjust learning rate based on number of workers.
301 | optimizer = tf.keras.optimizers.Adam(learning_rate=0.001 * hvd.size())
302 | dist_optimizer = hvd.DistributedOptimizer(optimizer)
303 |
304 | callbacks = [
305 | # Horovod: broadcast initial variable states from rank 0 to all other processes.
306 | # This is necessary to ensure consistent initialization of all workers when
307 | # training is started with random weights or restored from a checkpoint.
308 | hvd.callbacks.BroadcastGlobalVariablesCallback(0),
309 |
310 | # Horovod: average metrics among workers at the end of every epoch.
311 | # Note: This callback must be in the list before the ReduceLROnPlateau,
312 | # TensorBoard or other metrics-based callbacks.
313 | hvd.callbacks.MetricAverageCallback(),
314 |
315 | # Horovod: using `lr = 1.0 * hvd.size()` from the very beginning leads to worse final
316 | # accuracy. Scale the learning rate `lr = 1.0` ---> `lr = 1.0 * hvd.size()` during
317 | # the first five epochs. See https://arxiv.org/abs/1706.02677 for details.
318 | hvd.callbacks.LearningRateWarmupCallback(initial_lr=0.001*hvd.size(), warmup_epochs=5, verbose=1),
319 |
320 | # Reduce the learning rate if training plateaus.
321 | ReduceLROnPlateau(patience=10, verbose=1)
322 | ]
323 |
324 | # Set experimental_run_tf_function=False in TF 2.x
325 | model.compile(optimizer=dist_optimizer,
326 | loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
327 | metrics=['accuracy'],
328 | experimental_run_tf_function=False)
329 |
330 |
331 | # same logic as Petastorm training example here with slight adjustments
332 | with train_petastorm_converter.make_tf_dataset(batch_size=BATCH_SIZE,
333 | cur_shard=hvd.rank(),
334 | shard_count=hvd.size()) as train_ds,\
335 | val_petastorm_converter.make_tf_dataset(batch_size=BATCH_SIZE,
336 | cur_shard=hvd.rank(),
337 | shard_count=hvd.size()) as val_ds:
338 |
339 | # Transforming datasets to map our preprocess function() and then batch()
340 | train_ds = train_ds.unbatch().map(lambda x: (x.content, x.label_idx))
341 | val_ds = val_ds.unbatch().map(lambda x: (x.content, x.label_idx))
342 |
343 | train_ds = (train_ds
344 | .map(lambda content, label_idx: preprocess(content, label_idx))
345 | .batch(BATCH_SIZE))
346 | val_ds = (val_ds
347 | .map(lambda content, label_idx: preprocess(content, label_idx))
348 | .batch(BATCH_SIZE))
349 |
350 | steps_per_epoch = train_size // (BATCH_SIZE * hvd.size())
351 | validation_steps = max(1, val_size // (BATCH_SIZE * hvd.size()))
352 |
353 | hist = model.fit(train_ds,
354 | steps_per_epoch = steps_per_epoch,
355 | epochs = EPOCHS,
356 | verbose = 1,
357 | validation_data = val_ds,
358 | validation_steps = validation_steps)
359 |
360 | # MLflow Tracking (Log only from Worker 0)
361 | if hvd.rank() == 0:
362 | # Log events to MLflow
363 | with mlflow.start_run(run_id = active_run_uuid):
364 | # Log MLflow Parameters
365 | mlflow.log_param('epochs', EPOCHS)
366 | mlflow.log_param('batch_size', BATCH_SIZE)
367 |
368 | # Log MLflow Metrics
369 | mlflow.log_metric('val_loss', hist.history['val_loss'][-1])
370 | mlflow.log_metric('val_accuracy', hist.history['val_accuracy'][-1])
371 |
372 | # Log Model
373 | mlflow.keras.log_model(model, 'model')
374 |
375 | return hist.history['val_loss'][-1], hist.history['val_accuracy'][-1]
376 |
377 | # COMMAND ----------
378 |
379 | # MAGIC %md
380 | # MAGIC
381 | # MAGIC ### Training on the driver with Horovod
382 |
383 | # COMMAND ----------
384 |
385 | # MAGIC %md Test it out on just the driver.
386 | # MAGIC
387 | # MAGIC `np=-1` will force Horovod to run on a single core on the Driver node.
388 |
389 | # COMMAND ----------
390 |
391 | with mlflow.start_run(run_name='horovod_driver') as run:
392 |
393 | active_run_uuid = mlflow.active_run().info.run_uuid
394 | hr = HorovodRunner(np=-1, driver_log_verbosity="all")
395 | hr.run(train_and_evaluate_hvd)
396 |
397 | mlflow.end_run()
398 |
399 | # COMMAND ----------
400 |
401 | # MAGIC %md
402 | # MAGIC
403 | # MAGIC ### Distributed Training with Horovod
404 |
405 | # COMMAND ----------
406 |
407 | ## OPTIONAL: You can enable Horovod Timeline as follows, but can incur slow down from frequent writes, and have to export out of Databricks to upload to chrome://tracing
408 | # import os
409 | # os.environ["HOROVOD_TIMELINE"] = f"{working_dir}/_timeline.json"
410 |
411 | with mlflow.start_run(run_name='horovod_distributed') as run:
412 |
413 | active_run_uuid = mlflow.active_run().info.run_uuid
414 | hr = HorovodRunner(np=2, driver_log_verbosity='all')
415 | hr.run(train_and_evaluate_hvd)
416 |
417 | mlflow.end_run()
418 |
419 | # COMMAND ----------
420 |
421 | # MAGIC %md Finally, we delete the cached Petastorm files.
422 |
423 | # COMMAND ----------
424 |
425 | train_petastorm_converter.delete()
426 | val_petastorm_converter.delete()
427 |
428 | # COMMAND ----------
429 |
430 | # MAGIC %md
431 | # MAGIC
432 | # MAGIC ## vi. Load model
433 | # MAGIC
434 | # MAGIC Load model from MLflow
435 |
436 | # COMMAND ----------
437 |
438 | trained_model = mlflow.keras.load_model(f'runs:/{run.info.run_id}/model')
439 | trained_model.summary()
440 |
--------------------------------------------------------------------------------
/Part 1 - Distributed Training/04_monitoring_and_optimization.py:
--------------------------------------------------------------------------------
1 | # Databricks notebook source
2 | # MAGIC %md
3 | # MAGIC
4 | # MAGIC # 04. Monitoring and Optimization
5 |
6 | # COMMAND ----------
7 |
8 | # MAGIC %md
9 | # MAGIC ### Single node vs Distributed Training Cosiderations
10 | # MAGIC
11 | # MAGIC Whether you choose to train in a distributed manner or using a single VM will be highly dependent on the scale of your training data and architecture complexity + size. There may be instances where leveraging Petastorm will help get past memory issues around dataset loading but distributing training across a cluster may add too much additional training time and/or underutilized resources.
12 | # MAGIC
13 | # MAGIC Petastorm is very useful in situations were training data is the bottleneck. Distributing the dataset across the cluster is a great way to get past the memory limitations of a single node.
14 | # MAGIC
15 | # MAGIC Horovod allows us to leverage the compute of multiple machines OR multiple devices (like GPUs) during training. This trait is especially useful when training with large architectures and/or massive datasets. At times leveraging multiple machines with a single GPU could be more cost effective than using a single, very large machine with multiple GPUs.
16 |
17 | # COMMAND ----------
18 |
19 | # MAGIC %md
20 | # MAGIC Given all of these options, how do you best tune the resources you'll need for your training needs?
21 |
22 | # COMMAND ----------
23 |
24 | # MAGIC %md
25 | # MAGIC ### Ganglia Metrics
26 | # MAGIC
27 | # MAGIC Found under the **Metrics** tab, [Ganglia](https://docs.databricks.com/clusters/clusters-manage.html#monitor-performance) live metrics and historical snapshots provide a view into how cluster resources are utilized at any point in time.
28 | # MAGIC
29 | # MAGIC 
30 |
31 | # COMMAND ----------
32 |
33 | # MAGIC %md
34 | # MAGIC ### Optimizations
35 | # MAGIC
36 | # MAGIC Understanding how to interpret Ganglia metrics, you can observe these as you experiment with techniques that can further optimize your training.
37 |
38 | # COMMAND ----------
39 |
40 | # MAGIC %md
41 | # MAGIC #### Early Stopping
42 | # MAGIC
43 | # MAGIC Early stopping can prevent you from potentially overfitting and/or waste compute resources by monitoring a metric of interest. In this example we may measure `val_accuracy`, to end training early when the metric no longer shows improvement. You can dial the sensitivity of this setting by adjusting the `patience` parameter.
44 | # MAGIC
45 | # MAGIC Add this as part of your `callbacks` parameter in the `fit()` method.
46 |
47 | # COMMAND ----------
48 |
49 | # MAGIC %md
50 | # MAGIC #### Larger Batch sizes
51 | # MAGIC
52 | # MAGIC You can improve training speed and potentially [boost model performance](https://arxiv.org/abs/2012.08795) by increasing the batch size during training. Larger batch sizes can help stabilize training but may take more resources (memory + compute) to load and process. If you find you are underwhelming your cluster resources, this could be a good knob to turn to increase utilization and speed up training.
53 |
54 | # COMMAND ----------
55 |
56 | # MAGIC %md
57 | # MAGIC #### Adapt Learning Rate
58 | # MAGIC
59 | # MAGIC A larger learning rate may also help the model converge to a local minima faster by allowing the model weights to change more dramatically between epochs. Although it *may* find a local minima faster, this could lead to a suboptimal solution or even derail training all together where the model may not be able to converge.
60 | # MAGIC
61 | # MAGIC This parameter is a powerful one to experiment with and machine learning practitioners should be encouraged to do so! One can tune this by trial and error or by implementing a scheduler that can vary the parameter depending on duration of training or in response to some observed training progress.
62 | # MAGIC
63 | # MAGIC More on the learning rate parameter can be found in this [reference](https://amzn.to/2NJW3gE).
64 |
--------------------------------------------------------------------------------
/Part 2 - Distributed Tuning & Inference/00_setup.py:
--------------------------------------------------------------------------------
1 | # Databricks notebook source
2 | # Move file from driver to DBFS
3 | user = dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags().apply('user')
4 |
5 | # Parse name from user
6 | my_name = user.split('@')[0].replace('.', '_')
7 |
8 | # Set config for database name, file paths, and table names
9 | database_name = f'distributed_dl_workshop_{my_name}'
10 |
11 | print(f'database_name: {database_name}')
12 |
13 | # COMMAND ----------
14 |
15 | # In order for us to track from our worker nodes with Horovod to the MLflow Tracking server we need to supply a databricks host and token
16 | DATABRICKS_HOST = 'https://' + dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags().get('browserHostName').get()
17 | DATABRICKS_TOKEN = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()
18 |
--------------------------------------------------------------------------------
/Part 2 - Distributed Tuning & Inference/01_hyperopt_single_machine_model.py:
--------------------------------------------------------------------------------
1 | # Databricks notebook source
2 | # MAGIC %md
3 | # MAGIC # 01. Tune Single-node Models with Hyperopt and Apache Spark
4 |
5 | # COMMAND ----------
6 |
7 | # MAGIC %md
8 | # MAGIC
9 | # MAGIC In this notebook:
10 | # MAGIC * We perform hyperparameter optimization, training multiple single node models in parallel using [Hyperopt](https://github.com/hyperopt/hyperopt)
11 | # MAGIC * We examine how to get the best performing model via the [MLflow Tracking](https://www.mlflow.org/docs/latest/tracking.html) API
12 | # MAGIC * We subsequently register the best performing model to [MLflow Model Registry](https://www.mlflow.org/docs/latest/model-registry.html)
13 |
14 | # COMMAND ----------
15 |
16 | # MAGIC %run ./00_setup
17 |
18 | # COMMAND ----------
19 |
20 | import numpy as np
21 | import tensorflow as tf
22 | from tensorflow.keras.applications.mobilenet_v2 import preprocess_input
23 | from tensorflow.keras.models import Sequential
24 | from tensorflow.keras.optimizers import Adam
25 |
26 | from hyperopt import hp
27 | from hyperopt import STATUS_OK
28 | from hyperopt import fmin, tpe, STATUS_OK, SparkTrials
29 |
30 | import mlflow
31 | from mlflow.tracking import MlflowClient
32 |
33 | # COMMAND ----------
34 |
35 | # MAGIC %md
36 | # MAGIC
37 | # MAGIC ## i. Configs
38 |
39 | # COMMAND ----------
40 |
41 | # MAGIC %md
42 | # MAGIC
43 | # MAGIC Set global variables
44 |
45 | # COMMAND ----------
46 |
47 | IMG_HEIGHT = 224
48 | IMG_WIDTH = 224
49 | IMG_CHANNELS = 3
50 | NUM_CLASSES = 5
51 |
52 | BATCH_SIZE = 32
53 | EPOCHS = 3
54 |
55 | # COMMAND ----------
56 |
57 | # MAGIC %md
58 | # MAGIC
59 | # MAGIC ## ii. Load prepared data from Silver layer
60 |
61 | # COMMAND ----------
62 |
63 | cols_to_keep = ['content', 'label_idx']
64 |
65 | train_tbl_name = 'silver_train'
66 | val_tbl_name = 'silver_val'
67 |
68 | train_df = (spark.table(f'{database_name}.{train_tbl_name}')
69 | .select(cols_to_keep))
70 |
71 | val_df = (spark.table(f'{database_name}.{val_tbl_name}')
72 | .select(cols_to_keep))
73 |
74 |
75 | # Convert Spark DataFrame to pandas DataFrame
76 | train_pdf = train_df.toPandas()
77 | val_pdf = val_df.toPandas()
78 |
79 | # COMMAND ----------
80 |
81 | def preprocess(content: str, label_idx: int):
82 | """
83 | Preprocess an image file bytes for MobileNetV2 (ImageNet).
84 | """
85 | image = tf.image.decode_jpeg(content, channels=IMG_CHANNELS)
86 | image = tf.image.resize(image, [IMG_HEIGHT, IMG_WIDTH])
87 |
88 | return preprocess_input(image), label_idx
89 |
90 | # COMMAND ----------
91 |
92 | def build_model(dropout: int = 0.5) -> tf.keras.models.Sequential:
93 |
94 | base_model = tf.keras.applications.MobileNetV2(include_top=False,
95 | input_shape=(IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS))
96 |
97 | # Freeze base model layers
98 | for layer in base_model.layers:
99 | layer.trainable = False
100 |
101 | model = Sequential([
102 | base_model,
103 | tf.keras.layers.GlobalAveragePooling2D(),
104 | tf.keras.layers.Dropout(dropout),
105 | tf.keras.layers.Dense(NUM_CLASSES)
106 | ])
107 |
108 | return model
109 |
110 | # COMMAND ----------
111 |
112 | # MAGIC %md
113 | # MAGIC ## iii. Hyperopt Workflow
114 | # MAGIC
115 | # MAGIC Next, we will create the different pieces needed for parallelizing hyperparameter tuning with [Hyperopt](http://hyperopt.github.io/hyperopt/) and Apache Spark.
116 |
117 | # COMMAND ----------
118 |
119 | # MAGIC %md
120 | # MAGIC ### a. Create objective function
121 | # MAGIC
122 | # MAGIC First, we need to [create an **objective function**](http://hyperopt.github.io/hyperopt/getting-started/minimizing_functions/). This is the function that Hyperopt will call for each set of inputs.
123 | # MAGIC
124 | # MAGIC The basic requirements are:
125 | # MAGIC
126 | # MAGIC 1. An **input** `params` including hyperparameter values to use when training the model
127 | # MAGIC 2. An **output** containing a loss metric on which to optimize
128 | # MAGIC
129 | # MAGIC In this case, we are specifying values of `optimizer`, `learning_rate` and `dropout` and returning validation accuracy as our loss metric.
130 |
131 | # COMMAND ----------
132 |
133 | def objective_function(params):
134 |
135 | mlflow.autolog()
136 |
137 | # Create train tf.data.Dataset
138 | train_ds = (
139 | tf.data.Dataset.from_tensor_slices(
140 | (train_pdf['content'].values, train_pdf['label_idx'].values))
141 | .map(lambda content, label_idx: preprocess(content, label_idx))
142 | .batch(BATCH_SIZE)
143 | )
144 |
145 | # Create val tf.data.Dataset
146 | val_ds = (
147 | tf.data.Dataset.from_tensor_slices(
148 | (val_pdf['content'].values, val_pdf['label_idx'].values))
149 | .map(lambda content, label_idx: preprocess(content, label_idx))
150 | .batch(BATCH_SIZE)
151 | )
152 |
153 | # Select Optimizer
154 | optimizer_call = getattr(tf.keras.optimizers, params['optimizer'])
155 | optimizer = optimizer_call(learning_rate=params['learning_rate'])
156 |
157 | # Instantiate model
158 | model = build_model(dropout=params['dropout'])
159 |
160 | # Compile model
161 | model.compile(optimizer=optimizer,
162 | loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
163 | metrics=['accuracy'])
164 |
165 | # Specify training params
166 | steps_per_epoch = len(train_ds) // BATCH_SIZE
167 | validation_steps = len(val_ds) // BATCH_SIZE
168 |
169 | # Train the model
170 | model.fit(train_ds,
171 | steps_per_epoch = steps_per_epoch,
172 | epochs = EPOCHS,
173 | verbose = 1,
174 | validation_data = val_ds,
175 | validation_steps = validation_steps)
176 |
177 | _, accuracy = model.evaluate(val_ds)
178 |
179 | # Hyperopt will optimize to **minimize** the returned 'loss'
180 | # Given that we want to **maximize** accuracy, we return -accuracy
181 | return {'loss': -accuracy, 'status': STATUS_OK}
182 |
183 | # COMMAND ----------
184 |
185 | # MAGIC %md
186 | # MAGIC ### b. Define the search space
187 | # MAGIC
188 | # MAGIC Next, we need to [define the **search space**](http://hyperopt.github.io/hyperopt/getting-started/search_spaces/).
189 | # MAGIC
190 | # MAGIC Note that we aren't defining the actual values like grid search. Hyperopt's TPE algorithm will intelligently suggest hyperparameter values from within this range.
191 |
192 | # COMMAND ----------
193 |
194 | search_space = {
195 | 'optimizer': hp.choice('optimizer', ['Adadelta', 'Adam']),
196 | 'learning_rate': hp.loguniform('learning_rate', -5, 0),
197 | 'dropout': hp.uniform('dropout', 0.1, 0.9)
198 | }
199 |
200 | # COMMAND ----------
201 |
202 | # MAGIC %md
203 | # MAGIC ### c. Call the `fmin` operation
204 | # MAGIC
205 | # MAGIC The `fmin` function is where we put Hyperopt to work.
206 | # MAGIC
207 | # MAGIC To make this work, we need:
208 | # MAGIC
209 | # MAGIC 1. The `objective_function`
210 | # MAGIC 2. The `search_space`
211 | # MAGIC 3. The `tpe.suggest` optimization algorithm
212 | # MAGIC 4. A `SparkTrials` object to distribute the trials across a cluster using Spark
213 | # MAGIC 5. The maximum number of evaluations or trials denoted by `max_evals`
214 | # MAGIC
215 | # MAGIC In this case, we'll be computing up to 20 trials with 4 trials being run concurrently.
216 | # MAGIC
217 | # MAGIC In Databricks Hyperopt automatically logs its trials to MLflow under a single parent run, with each trial logged to a child run. Within the objective function we are using MLflow autologging which will result in the parameters, metrics and model artifacts logged to each child run.
218 |
219 | # COMMAND ----------
220 |
221 | mlflow.set_experiment('/Users/' + user + '/distributed_dl_workshop')
222 |
223 | with mlflow.start_run(run_name='hyperopt_tuning') as hyperopt_mlflow_run:
224 |
225 | # The number of models we want to evaluate
226 | num_evals = 20
227 |
228 | # Set the number of models to be trained concurrently
229 | spark_trials = SparkTrials(parallelism=4)
230 |
231 | # Run the optimization process
232 | best_hyperparam = fmin(
233 | fn=objective_function,
234 | space=search_space,
235 | algo=tpe.suggest,
236 | trials=spark_trials,
237 | max_evals=num_evals
238 | )
239 |
240 | # Log optimal hyperparameter values
241 | mlflow.log_param('optimizer', best_hyperparam['optimizer'])
242 | mlflow.log_param('learning_rate', best_hyperparam['learning_rate'])
243 | mlflow.log_param('dropout', best_hyperparam['dropout'])
244 |
245 | # COMMAND ----------
246 |
247 | # MAGIC %md
248 | # MAGIC
249 | # MAGIC ## iv. How to get the best run via MLflow Tracking
250 |
251 | # COMMAND ----------
252 |
253 | # Get the run_id of the parent run under which the Hyperopt trials were tracked
254 | parent_run_id = hyperopt_mlflow_run.info.run_id
255 |
256 | # Return all trials (tracked as child runs) as a pandas DataFrame, and order by accuracy descending
257 | hyperopt_trials_pdf = (mlflow.search_runs(filter_string=f'tags.mlflow.parentRunId="{parent_run_id}"',
258 | order_by=['metrics.accuracy DESC']))
259 |
260 | # The best trial will be the first row of this pandas DataFrame
261 | best_run = hyperopt_trials_pdf.iloc[0]
262 | best_run_id = best_run['run_id']
263 |
264 | # COMMAND ----------
265 |
266 | # MAGIC %md
267 | # MAGIC
268 | # MAGIC ## v. Register best run to MLfow Model Registry
269 |
270 | # COMMAND ----------
271 |
272 | # MAGIC %md
273 | # MAGIC
274 | # MAGIC We register our model to the MLflow Model Registry to use in our subsequent inference notebook.
275 |
276 | # COMMAND ----------
277 |
278 | # Define unique name for model in the Model Registry
279 | registry_model_name = my_name + '_flower_classifier'
280 |
281 | # Register the best run. Note that when we initially register it, the model will be in stage=None
282 | model_version = mlflow.register_model(f'runs:/{best_run_id}/model',
283 | name=registry_model_name)
284 |
285 | # COMMAND ----------
286 |
287 | # Transition model to stage='Production'
288 | client = MlflowClient()
289 | client.transition_model_version_stage(
290 | name=registry_model_name,
291 | version=model_version.version,
292 | stage='Production'
293 | )
294 |
295 | # COMMAND ----------
296 |
297 | # Load model from the production stage in MLflow Model Registry
298 | prod_model = mlflow.keras.load_model(f'models:/{registry_model_name}/production')
299 | prod_model.summary()
300 |
--------------------------------------------------------------------------------
/Part 2 - Distributed Tuning & Inference/02_hyperopt_distributed_model.py:
--------------------------------------------------------------------------------
1 | # Databricks notebook source
2 | # MAGIC %md
3 | # MAGIC # 02. Tune distributed models with Hyperopt and Apache Spark
4 |
5 | # COMMAND ----------
6 |
7 | # MAGIC %md
8 | # MAGIC
9 | # MAGIC In this notebook:
10 | # MAGIC * We perform hyperparameter optimization of a distributed DL training process using [Hyperopt](https://github.com/hyperopt/hyperopt)
11 | # MAGIC * We examine how to get the best performing model via the [MLflow Tracking](https://www.mlflow.org/docs/latest/tracking.html) API
12 | # MAGIC * We subsequently register the best performing model to [MLflow Model Registry](https://www.mlflow.org/docs/latest/model-registry.html)
13 |
14 | # COMMAND ----------
15 |
16 | # MAGIC %run ./00_setup
17 |
18 | # COMMAND ----------
19 |
20 | import os
21 | import time
22 | from typing import Tuple
23 | import numpy as np
24 |
25 | import tensorflow as tf
26 | from tensorflow.keras.applications.mobilenet_v2 import preprocess_input
27 | from tensorflow.keras.models import Sequential
28 | from tensorflow.keras.optimizers import Adam
29 | from tensorflow.keras.callbacks import ReduceLROnPlateau, ModelCheckpoint
30 |
31 | import horovod.tensorflow.keras as hvd
32 | from sparkdl import HorovodRunner
33 |
34 | from petastorm.spark import SparkDatasetConverter, make_spark_converter
35 |
36 | from hyperopt import hp
37 | from hyperopt import STATUS_OK
38 | from hyperopt import fmin, tpe, STATUS_OK, SparkTrials
39 |
40 | import mlflow
41 | from mlflow.tracking import MlflowClient
42 |
43 | # COMMAND ----------
44 |
45 | # MAGIC %md
46 | # MAGIC
47 | # MAGIC ## i. Configs
48 |
49 | # COMMAND ----------
50 |
51 | # MAGIC %md
52 | # MAGIC
53 | # MAGIC Set global variables
54 |
55 | # COMMAND ----------
56 |
57 | IMG_HEIGHT = 224
58 | IMG_WIDTH = 224
59 | IMG_CHANNELS = 3
60 | NUM_CLASSES = 5
61 |
62 | BATCH_SIZE = 32
63 | EPOCHS = 3
64 |
65 | # Define directory to save model checkpoints to
66 | checkpoint_dir = f'/dbfs/distributed_dl_workshop_{user}/train_ckpts/{time.time()}'
67 | dbutils.fs.mkdirs(checkpoint_dir.replace('/dbfs', 'dbfs:'))
68 |
69 | # Number of Horovod processes
70 | HVD_NUM_PROCESSES = 2
71 |
72 | # COMMAND ----------
73 |
74 | # MAGIC %md
75 | # MAGIC
76 | # MAGIC ## ii. Load prepared data from Silver layer
77 |
78 | # COMMAND ----------
79 |
80 | cols_to_keep = ['content', 'label_idx']
81 |
82 | train_tbl_name = 'silver_train'
83 | val_tbl_name = 'silver_val'
84 |
85 | train_df = (spark.table(f'{database_name}.{train_tbl_name}')
86 | .select(cols_to_keep))
87 |
88 | val_df = (spark.table(f'{database_name}.{val_tbl_name}')
89 | .select(cols_to_keep))
90 |
91 | # Make sure the number of partitions is at least the number of workers which is required for distributed training.
92 | train_df = train_df.repartition(2)
93 | val_df = val_df.repartition(2)
94 |
95 | print('train_df count:', train_df.count())
96 | print('val_df count:', val_df.count())
97 |
98 | # COMMAND ----------
99 |
100 | # MAGIC %md
101 | # MAGIC ## iii. Convert the Spark DataFrame to a TensorFlow Dataset
102 |
103 | # COMMAND ----------
104 |
105 | dbutils.fs.rm(f'/tmp/distributed_dl_workshop_{user}/petastorm', recurse=True)
106 | spark.conf.set(SparkDatasetConverter.PARENT_CACHE_DIR_URL_CONF, f'file:///dbfs/tmp/distributed_dl_workshop_{user}/petastorm')
107 |
108 | train_petastorm_converter = make_spark_converter(train_df)
109 | val_petastorm_converter = make_spark_converter(val_df)
110 |
111 | train_size = len(train_petastorm_converter)
112 | val_size = len(val_petastorm_converter)
113 |
114 | # COMMAND ----------
115 |
116 | def build_model(dropout: int = 0.5) -> tf.keras.models.Sequential:
117 |
118 | base_model = tf.keras.applications.MobileNetV2(include_top=False,
119 | input_shape=(IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS))
120 |
121 | # Freeze base model layers
122 | for layer in base_model.layers:
123 | layer.trainable = False
124 |
125 | model = Sequential([
126 | base_model,
127 | tf.keras.layers.GlobalAveragePooling2D(),
128 | tf.keras.layers.Dropout(dropout),
129 | tf.keras.layers.Dense(NUM_CLASSES)
130 | ])
131 |
132 | return model
133 |
134 | # COMMAND ----------
135 |
136 | def preprocess(content: str, label_idx: int):
137 | """
138 | Preprocess an image file bytes for MobileNetV2 (ImageNet).
139 | """
140 | image = tf.image.decode_jpeg(content, channels=IMG_CHANNELS)
141 | image = tf.image.resize(image, [IMG_HEIGHT, IMG_WIDTH])
142 |
143 | return preprocess_input(image), label_idx
144 |
145 | # COMMAND ----------
146 |
147 | # MAGIC %md
148 | # MAGIC ## iv. Define training function for HorovodRunner
149 | # MAGIC
150 | # MAGIC The following creates the training function. This function:
151 | # MAGIC - Takes as it's arguments the hyperparameters for tuning
152 | # MAGIC - Configures Horovod
153 | # MAGIC - Loads the training and validation datasets using Petastorm
154 | # MAGIC - Defines the model and distributed optimizer
155 | # MAGIC - Compiles the model
156 | # MAGIC - Fits the model
157 | # MAGIC - Returns the loss and accuracy on the validation set
158 |
159 | # COMMAND ----------
160 |
161 | def train_and_evaluate_hvd(learning_rate: float,
162 | dropout: float,
163 | batch_size: int,
164 | checkpoint_dir: str) -> Tuple[float, float]:
165 | """
166 | Function to pass to Horovod to be executed on each worker.
167 | Args are those hyperparameters we want to tune with Hyperopt.
168 | """
169 | # Initialize Horovod.
170 | hvd.init()
171 |
172 | # To enable tracking from the Spark workers to Databricks
173 | mlflow.mlflow.set_tracking_uri('databricks')
174 | os.environ['DATABRICKS_HOST'] = DATABRICKS_HOST
175 | os.environ['DATABRICKS_TOKEN'] = DATABRICKS_TOKEN
176 |
177 | # Horovod: pin GPU to be used to process local rank (one GPU per process)
178 | gpus = tf.config.experimental.list_physical_devices('GPU')
179 | for gpu in gpus:
180 | tf.config.experimental.set_memory_growth(gpu, True)
181 | if gpus:
182 | tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')
183 |
184 | # Creating model
185 | model = build_model(dropout=dropout)
186 |
187 | # Horovod: adjust learning rate based on number of workers.
188 | optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate * hvd.size())
189 | dist_optimizer = hvd.DistributedOptimizer(optimizer)
190 |
191 | callbacks = [
192 | hvd.callbacks.BroadcastGlobalVariablesCallback(0),
193 | hvd.callbacks.MetricAverageCallback(),
194 | hvd.callbacks.LearningRateWarmupCallback(initial_lr=learning_rate*hvd.size(),
195 | warmup_epochs=5,
196 | verbose=1),
197 | # Reduce the learning rate if training plateaus.
198 | ReduceLROnPlateau(patience=10, verbose=1)
199 | ]
200 |
201 | model.compile(optimizer=dist_optimizer,
202 | loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
203 | metrics=['accuracy'],
204 | experimental_run_tf_function=False)
205 |
206 | # Save checkpoints only on worker 0 to prevent conflicts between workers
207 | param_str = f'learning_rate_{learning_rate}_dropout_{dropout}_batch_size_{batch_size}'
208 | if hvd.rank() == 0:
209 | checkpoint_dir_for_this_trial = os.path.join(checkpoint_dir, param_str)
210 | local_ckpt_path = os.path.join(checkpoint_dir_for_this_trial, 'checkpoint-{epoch}.ckpt')
211 | callbacks.append(ModelCheckpoint(local_ckpt_path, save_weights_only=True))
212 |
213 | with train_petastorm_converter.make_tf_dataset(batch_size=batch_size,
214 | cur_shard=hvd.rank(),
215 | shard_count=hvd.size()) as train_ds,\
216 | val_petastorm_converter.make_tf_dataset(batch_size=batch_size,
217 | cur_shard=hvd.rank(),
218 | shard_count=hvd.size()) as val_ds:
219 |
220 | train_ds = train_ds.unbatch().map(lambda x: (x.content, x.label_idx))
221 | val_ds = val_ds.unbatch().map(lambda x: (x.content, x.label_idx))
222 |
223 | train_ds = (train_ds
224 | .map(lambda content, label_idx: preprocess(content, label_idx))
225 | .batch(batch_size))
226 | val_ds = (val_ds
227 | .map(lambda content, label_idx: preprocess(content, label_idx))
228 | .batch(batch_size))
229 |
230 | steps_per_epoch = train_size // (batch_size * hvd.size())
231 | validation_steps = max(1, val_size // (batch_size * hvd.size()))
232 |
233 | hist = model.fit(train_ds,
234 | steps_per_epoch=steps_per_epoch,
235 | epochs=EPOCHS,
236 | verbose=1,
237 | validation_data=val_ds,
238 | validation_steps=validation_steps)
239 |
240 | # MLflow Tracking (Log only from Worker 0)
241 | if hvd.rank() == 0:
242 | # Log to child run under active parent run
243 | # MLFLOW_PARENT_RUN_ID is defined below, prior to kicking of the HPO
244 | with mlflow.start_run(run_id=MLFLOW_PARENT_RUN_ID) as run:
245 | with mlflow.start_run(experiment_id=run.info.experiment_id,
246 | run_name=param_str,
247 | nested=True):
248 | # Log MLflow Parameters
249 | mlflow.log_param('epochs', EPOCHS)
250 | mlflow.log_param('batch_size', batch_size)
251 | mlflow.log_param('learning_rate', learning_rate)
252 | mlflow.log_param('dropout', dropout)
253 | mlflow.log_param('checkpoint_dir', checkpoint_dir_for_this_trial)
254 |
255 | # Log MLflow Metrics
256 | mlflow.log_metric('val_loss', hist.history['val_loss'][-1])
257 | mlflow.log_metric('val_accuracy', hist.history['val_accuracy'][-1])
258 |
259 | # Log Model
260 | mlflow.keras.log_model(model, 'model')
261 |
262 | return hist.history['val_loss'][-1], hist.history['val_accuracy'][-1]
263 |
264 | # COMMAND ----------
265 |
266 | # MAGIC %md
267 | # MAGIC ## iii. Hyperopt Workflow
268 | # MAGIC
269 | # MAGIC In the following section we create the Hyperopt workflow to tune our distributed DL training process. We will:
270 | # MAGIC * Define an object function to minimize
271 | # MAGIC * Define a search space over hyperparameters
272 | # MAGIC * Specify the search algorithm and use `fmin()` to tune the model
273 | # MAGIC
274 | # MAGIC For more information about the Hyperopt APIs, see the [Hyperopt docs](https://github.com/hyperopt/hyperopt/wiki/FMin).
275 |
276 | # COMMAND ----------
277 |
278 | # MAGIC %md
279 | # MAGIC ### Define objective function
280 | # MAGIC
281 | # MAGIC First, we need to [create an **objective function**](http://hyperopt.github.io/hyperopt/getting-started/minimizing_functions/). This is the function that Hyperopt will call for each set of inputs.
282 | # MAGIC
283 | # MAGIC The basic requirements are:
284 | # MAGIC
285 | # MAGIC 1. An **input** `params` including hyperparameter values to use when training the model
286 | # MAGIC 2. An **output** containing a loss metric on which to optimize
287 | # MAGIC
288 | # MAGIC In this case, we are specifying values of `learning_rate`, `dropout` and `batch_size` and returning validation accuracy as our loss metric.
289 | # MAGIC
290 | # MAGIC Our objective function will use the training function defined for `HorovodRunner` and run distributed training to fit a model and compute its loss. To run this example on a cluster with 2 workers, each with a single GPU, initialize `HorovodRunner` with `np=2`.
291 |
292 | # COMMAND ----------
293 |
294 | def objective_function(params):
295 | """
296 | An example train method that calls into HorovodRunner.
297 | This method is passed to hyperopt.fmin().
298 |
299 | :param params: hyperparameters. Its structure is consistent with how search space is defined. See below.
300 | :return: dict with fields 'loss' (scalar loss) and 'status' (success/failure status of run)
301 | """
302 | hr = HorovodRunner(np=HVD_NUM_PROCESSES)
303 | loss, acc = hr.run(train_and_evaluate_hvd,
304 | learning_rate=params['learning_rate'],
305 | dropout=params['dropout'],
306 | batch_size=params['batch_size'],
307 | checkpoint_dir=checkpoint_dir)
308 |
309 | return {'loss': loss, 'status': STATUS_OK}
310 |
311 | # COMMAND ----------
312 |
313 | # MAGIC %md
314 | # MAGIC ### Define the search space
315 | # MAGIC
316 | # MAGIC Next, we need to [define the **search space**](http://hyperopt.github.io/hyperopt/getting-started/search_spaces/).
317 | # MAGIC
318 | # MAGIC This example tunes 3 hyperparameters: learning rate, dropout and batch size. See the [Hyperopt docs](https://github.com/hyperopt/hyperopt/wiki/FMin#21-parameter-expressions) for details on defining a search space and parameter expressions.
319 |
320 | # COMMAND ----------
321 |
322 | search_space = {
323 | 'learning_rate': hp.loguniform('learning_rate', -5, 0),
324 | 'dropout': hp.uniform('dropout', 0.1, 0.9),
325 | 'batch_size': hp.choice('batch_size', [32, 64, 128])
326 | }
327 |
328 | # COMMAND ----------
329 |
330 | # MAGIC %md
331 | # MAGIC
332 | # MAGIC ### Tune the model using Hyperopt `fmin()`
333 | # MAGIC
334 | # MAGIC - Set `max_evals` to the maximum number of points in hyperparameter space to test, that is, the maximum number of models to fit and evaluate. Because this command evaluates many models, it will take several minutes to execute.
335 | # MAGIC - You must also specify which search algorithm to use. The two main choices are:
336 | # MAGIC - `hyperopt.tpe.suggest`: Tree of Parzen Estimators, a Bayesian approach which iteratively and adaptively selects new hyperparameter settings to explore based on previous results
337 | # MAGIC - `hyperopt.rand.suggest`: Random search, a non-adaptive approach that randomly samples the search space
338 |
339 | # COMMAND ----------
340 |
341 | # MAGIC %md
342 | # MAGIC **Important:** We are training trials (each distributed training process) in a *sequential* manner. As such, when using Hyperopt with HorovodRunner, do not pass a `trials` argument to `fmin()`.
343 | # MAGIC
344 | # MAGIC When you do not include the `trials` argument, Hyperopt uses the default `Trials` class, which runs on the cluster driver. Hyperopt needs to evaluate each trial on the driver node so that each trial can initiate distributed training jobs. The `SparkTrials` class is incompatible with distributed training jobs, as it evaluates each trial on one worker node.
345 |
346 | # COMMAND ----------
347 |
348 | # Start a parent MLflow run
349 | mlflow_exp_path = '/Users/' + user + '/distributed_dl_workshop'
350 | mlflow.set_experiment(mlflow_exp_path)
351 |
352 | with mlflow.start_run(run_name='hyperopt_horovod_tuning') as hyperopt_mlflow_run:
353 |
354 | MLFLOW_PARENT_RUN_ID = hyperopt_mlflow_run.info.run_id
355 |
356 | # The number of models we want to evaluate
357 | max_evals = 4
358 |
359 | # Run the optimization process
360 | best_hyperparam = fmin(
361 | fn=objective_function,
362 | space=search_space,
363 | algo=tpe.suggest,
364 | max_evals=max_evals,
365 | )
366 |
367 | # Log optimal hyperparameter values
368 | mlflow.log_param('learning_rate', best_hyperparam['learning_rate'])
369 | mlflow.log_param('dropout', best_hyperparam['dropout'])
370 | mlflow.log_param('batch_size', best_hyperparam['batch_size'])
371 |
372 | # COMMAND ----------
373 |
374 | # Print out the parameters that produced the best model
375 | print(best_hyperparam)
376 |
377 | # COMMAND ----------
378 |
379 | # Display the contents of the checkpoint directory
380 | dbutils.fs.ls(checkpoint_dir.replace('/dbfs', 'dbfs:'))
381 |
382 | # COMMAND ----------
383 |
384 | # MAGIC %md
385 | # MAGIC
386 | # MAGIC ## iv. Get the best run via MLflow Tracking
387 |
388 | # COMMAND ----------
389 |
390 | # Get the run_id of the parent run under which the Hyperopt trials were tracked
391 | parent_run_id = hyperopt_mlflow_run.info.run_id
392 |
393 | # Return all trials (tracked as child runs) as a pandas DataFrame, and order by accuracy descending
394 | hyperopt_trials_pdf = (mlflow.search_runs(filter_string=f'tags.mlflow.parentRunId="{parent_run_id}"',
395 | order_by=['metrics.accuracy DESC']))
396 |
397 | # The best trial will be the first row of this pandas DataFrame
398 | best_run = hyperopt_trials_pdf.iloc[0]
399 | best_run_id = best_run['run_id']
400 |
401 | print(f'MLflow run_id of best trial: {best_run_id}')
402 |
403 | # COMMAND ----------
404 |
405 | # MAGIC %md
406 | # MAGIC
407 | # MAGIC ## v. Register best run to MLfow Model Registry
408 |
409 | # COMMAND ----------
410 |
411 | # MAGIC %md
412 | # MAGIC
413 | # MAGIC We register our model to the MLflow Model Registry to use in our subsequent inference notebook.
414 |
415 | # COMMAND ----------
416 |
417 | # Define unique name for model in the Model Registry
418 | registry_model_name = my_name + '_flower_classifier'
419 |
420 | # Register the best run. Note that when we initially register it, the model will be in stage=None
421 | model_version = mlflow.register_model(f'runs:/{best_run_id}/model',
422 | name=registry_model_name)
423 |
424 | # COMMAND ----------
425 |
426 | # Transition model to stage='Production'
427 | client = MlflowClient()
428 | client.transition_model_version_stage(
429 | name=registry_model_name,
430 | version=model_version.version,
431 | stage='Production'
432 | )
433 |
--------------------------------------------------------------------------------
/Part 2 - Distributed Tuning & Inference/03_pyfunc_distributed_inference.py:
--------------------------------------------------------------------------------
1 | # Databricks notebook source
2 | # MAGIC %md
3 | # MAGIC # 03. Packaging an MLflow PyFunc, and distributed inference
4 |
5 | # COMMAND ----------
6 |
7 | # MAGIC %md
8 | # MAGIC
9 | # MAGIC In this notebook we examine how to use our trained model to perform inference in a distributed manner. We will do so wrapping our `model.predict()` method as a PySpark UDF.
10 | # MAGIC
11 | # MAGIC In order to accomplish this we will package our image preprocessing and model prediction functionality as an [MLflow pyfunc model](https://www.mlflow.org/docs/latest/python_api/mlflow.pyfunc.html). Packaging our model in this format to enable this capability requires that we do so at model training time.
12 | # MAGIC
13 | # MAGIC Thus, in this notebook we will:
14 | # MAGIC
15 | # MAGIC - Define functions to enable model training, and logging model as a [pyfunc model flavour](https://www.mlflow.org/docs/latest/python_api/mlflow.pyfunc.html).
16 | # MAGIC - Training will be executed on a single (driver) node, using Petastorm to feed data to the training process.
17 | # MAGIC - Model parameters, metrics and artifacts will be tracked to MLflow.
18 | # MAGIC - Inference logic will be packaged into the logged MLflow pyfunc model.
19 | # MAGIC - Execute model training on a single node.
20 | # MAGIC - Load our model for inference.
21 | # MAGIC - First testing against a pandas DataFrame in a single node setting.
22 | # MAGIC - Then applying our model against a Spark DataFrame in a distributed manner.
23 |
24 | # COMMAND ----------
25 |
26 | # MAGIC %run ./00_setup
27 |
28 | # COMMAND ----------
29 |
30 | import io
31 | import os
32 | import json
33 | import ast
34 | from typing import Tuple
35 | from PIL import Image
36 |
37 | import numpy as np
38 | import pandas as pd
39 |
40 | import tensorflow as tf
41 | from tensorflow.keras.applications.mobilenet_v2 import preprocess_input
42 | from tensorflow.keras.models import Sequential
43 | from tensorflow.keras.optimizers import Adam
44 | from tensorflow.keras.callbacks import ReduceLROnPlateau, ModelCheckpoint
45 |
46 | import mlflow
47 |
48 | from petastorm.spark import SparkDatasetConverter, make_spark_converter
49 |
50 | # COMMAND ----------
51 |
52 | # MAGIC %md
53 | # MAGIC
54 | # MAGIC ### Set global variables
55 |
56 | # COMMAND ----------
57 |
58 | IMG_HEIGHT = 224
59 | IMG_WIDTH = 224
60 | IMG_CHANNELS = 3
61 |
62 | CLASSES = ['daisy', 'dandelion', 'roses', 'sunflowers', 'tulips']
63 |
64 | BATCH_SIZE = 128
65 | EPOCHS = 3
66 |
67 | # COMMAND ----------
68 |
69 | # MAGIC %md
70 | # MAGIC
71 | # MAGIC ## Training functions
72 | # MAGIC
73 | # MAGIC Before executing the training process, we define the various classes and functions to enable this.
74 |
75 | # COMMAND ----------
76 |
77 | # MAGIC %md
78 | # MAGIC
79 | # MAGIC ### Define dataclass for data config at training time
80 | # MAGIC
81 | # MAGIC We use the following `dataclass`[docs](https://docs.python.org/3/library/dataclasses.html) to define the data inputs and cache directory for our training process. This user-defined class will be define prior to passing into the main train function.
82 |
83 | # COMMAND ----------
84 |
85 | from dataclasses import dataclass
86 |
87 | @dataclass
88 | class DataCfg:
89 | """
90 | Class representing data inputs to the training process
91 | """
92 | train_tbl_name: str
93 | validation_tbl_name: str
94 | petastorm_cache_dir: str = f'/tmp/distributed_dl_workshop_{user}/petastorm'
95 |
96 | # COMMAND ----------
97 |
98 | # MAGIC %md
99 | # MAGIC
100 | # MAGIC ### Define preprocessing to decode and resize images
101 | # MAGIC
102 | # MAGIC The following `preprocess` function is used to decode and resize the binary image inputs in the training pipeline. Note that these same preprocessing steps must be applied to raw binary image inputs at inference time. We will see below how we can package this up as a [custom MLflow pyfunc model](https://www.mlflow.org/docs/latest/python_api/mlflow.pyfunc.html#creating-custom-pyfunc-models).
103 |
104 | # COMMAND ----------
105 |
106 | def preprocess(content: str, label_idx: int):
107 | """
108 | Preprocess an image file bytes for MobileNetV2 (ImageNet).
109 | """
110 | image = tf.image.decode_jpeg(content, channels=IMG_CHANNELS)
111 | image = tf.image.resize(image, [IMG_HEIGHT, IMG_WIDTH])
112 |
113 | return preprocess_input(image), label_idx
114 |
115 | # COMMAND ----------
116 |
117 | # MAGIC %md
118 | # MAGIC
119 | # MAGIC ### Define model architecture
120 | # MAGIC
121 | # MAGIC We define the same model architecture as that used throughout the rest of the examples.
122 |
123 | # COMMAND ----------
124 |
125 | def build_model() -> tf.keras.models.Sequential:
126 | """
127 | Function to define model architecture
128 | """
129 | base_model = tf.keras.applications.MobileNetV2(include_top=False,
130 | input_shape=(IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS))
131 |
132 | # Freeze base model layers
133 | for layer in base_model.layers:
134 | layer.trainable = False
135 |
136 | model = Sequential([
137 | base_model,
138 | tf.keras.layers.GlobalAveragePooling2D(),
139 | tf.keras.layers.Dropout(0.5),
140 | tf.keras.layers.Dense(len(CLASSES))
141 | ])
142 |
143 | return model
144 |
145 | # COMMAND ----------
146 |
147 | # MAGIC %md
148 | # MAGIC
149 | # MAGIC ### Define custom MLflow PyFunc
150 | # MAGIC
151 | # MAGIC Beforehand we saw how we could train and track our `tf.keras` to MLflow. That logged model artifact was simply the trained model artifact. In order to use that model at inference time we must apply the same preprocessing steps, and additional would have to translate our model output to predicted class labels.
152 | # MAGIC
153 | # MAGIC We can leverage custom Pyfunc models in MLflow to package our custom preprocessing and postprocessing routines. When we load the model at inference time the model artifacts declared in the `load_context` method are loaded, and the custom inference logic within the `predict` method is applied to the inference dataset.
154 |
155 | # COMMAND ----------
156 |
157 | class FlowerPyFunc(mlflow.pyfunc.PythonModel):
158 | """
159 | Custom PyFunc class to enable prediction with tf.keras model against a binary encoded images
160 | """
161 | def load_context(self, context):
162 | """
163 | Loads artifacts to be used by `PythonModel.predict` when evaluating inputs.
164 | When loading an MLflow model with `.load_model`, this method is called as soon as the
165 | `PythonModel` is constructed.
166 |
167 | Args:
168 | context (`mlflow.pyfunc.PythonModelContext`):
169 | A `PythonModelContext` instance containing artifacts that the model can use to
170 | perform inference.
171 | """
172 | # Path to logged image params dict
173 | img_params_dict_path = context.artifacts['img_params_dict_path']
174 | with open(img_params_dict_path) as f:
175 | img_params_dict = json.load(f)
176 |
177 | # Set image params
178 | self.img_height = img_params_dict['img_height']
179 | self.img_width = img_params_dict['img_width']
180 |
181 | # Path to logged model
182 | keras_model_path = context.artifacts['keras_model_path']
183 | # Load Keras model
184 | self.model = mlflow.keras.load_model(keras_model_path)
185 |
186 | def predict(self, context, model_input: pd.Series) -> np.array:
187 | """
188 | Take a pd.Series of binary encoded images (BinaryType 'contents' colummn from Binary file data source),
189 | preprocess images, and apply model
190 |
191 | See https://docs.databricks.com/data/data-sources/binary-file.html#binary-file for info on data source
192 |
193 | Args:
194 | context:
195 | instance containing artifacts that the model can use to perform inference
196 | model_input (pd.Series):
197 | pd.Series of binary encoded images
198 |
199 | Returns:
200 | np.array of predicted class labels
201 | """
202 | model_input_np = np.array(model_input)
203 | # Preprocess binary images
204 | preprocessed_arr = np.array(list(map(self.preprocess, model_input_np)))
205 | # Predictions
206 | pred_arr = self.model.predict(preprocessed_arr, batch_size=BATCH_SIZE)
207 | # Predicted indices
208 | pred_idx = np.argmax(pred_arr, axis=1)
209 | # Predicted class labels
210 | pred_classes = np.take(CLASSES, pred_idx, axis=0)
211 |
212 | return pred_classes
213 |
214 | def preprocess(self, img_bytes: bytes) -> np.array:
215 | """
216 | Take a single binary encoded image, decode image to numpy array
217 | and resize according to image height and size set in config.
218 |
219 | Args:
220 | img_bytes (bytes):
221 | A single binary encoded image.
222 |
223 | Returns:
224 | np.array of shape [img_height, img_width].
225 | """
226 | # When pyfunc is applied as SparkUDF byte format is converted to str
227 | # We require bytes format
228 | if type(img_bytes) is not bytes:
229 | img_bytes = ast.literal_eval(str(img_bytes[0]))
230 |
231 | img = Image.open(io.BytesIO(img_bytes)).convert('RGB')
232 | img = img.resize([self.img_height, self.img_width])
233 |
234 | return np.asarray(img, dtype='float32')
235 |
236 | # COMMAND ----------
237 |
238 | # MAGIC %md
239 | # MAGIC
240 | # MAGIC ### Define training function
241 | # MAGIC
242 | # MAGIC The following function encapsulates the full training pipeline:
243 | # MAGIC
244 | # MAGIC - Creating the MLflow run to log parameters, metrics and artifacts to
245 | # MAGIC - Setting up the Petastorm converter and data access process
246 | # MAGIC - Defining, compiling and fitting the model (single node training)
247 | # MAGIC - Logging the custom Pyfunc model with packaged model inference logic
248 | # MAGIC - Evaluating model performance against the validation dataset
249 | # MAGIC - Returning the MLflow run object, along with model and model history object
250 |
251 | # COMMAND ----------
252 |
253 | def train_model_petastorm_data_ingest(data_cfg: DataCfg,
254 | compile_kwargs: dict,
255 | fit_kwargs: dict) -> Tuple[mlflow.tracking.fluent.Run,
256 | tf.keras.Model,
257 | tf.keras.callbacks.History]:
258 | """
259 | Function to train a tf.keras model as defined by `build_model()`, on a single node.
260 | Data is fed to the training process using Petastorm, with all params, metrics and model artifacts
261 | logged to MLflow. The Petastorm converter objects are deleted upon completion of model evaluation.
262 |
263 | Args:
264 | data_cfg (DataCfg):
265 | Data class representing data inputs to the training process
266 | compile_kwargs (dict):
267 | Compile kwargs passed to model.compile()
268 | fit_kwargs (dict):
269 | Model fit kwargs passed to model.fit(). Note that the batch_size defined in this dict is used
270 | to define the Petastorm data access process rather than explicitly passed into model.fit().
271 |
272 | Returns:
273 | Tuple of MLflow Run object (mlflow.tracking.fluent.Run), trained tf.keras model (tf.keras.Model),
274 | and fitted model history object (tf.keras.callbacks.History)
275 | """
276 | # Enable autologging
277 | mlflow.autolog()
278 |
279 | with mlflow.start_run(run_name='pyfunc_model_petastorm') as mlflow_run:
280 |
281 | #####################
282 | # LOG RUN PARAMS
283 | #####################
284 | # Log img_height and img_width for use in PyFunc (required for preprocessing)
285 | img_params_dict = {'img_height': IMG_HEIGHT,
286 | 'img_width': IMG_WIDTH}
287 | mlflow.log_dict(img_params_dict, 'img_params_dict.json')
288 | mlflow.log_param('BATCH_SIZE', fit_kwargs['batch_size'])
289 |
290 | #####################
291 | # DATA LOADING
292 | #####################
293 | print('DATA LOADING')
294 | # Load and repartition train and validation datasets
295 | # Hard-coding the number of partitions. Should be at least the number of workers which is required for distributed training.
296 | train_df = spark.table(data_cfg.train_tbl_name).select('content', 'label_idx').repartition(2)
297 | val_df = spark.table(data_cfg.validation_tbl_name).select('content', 'label_idx').repartition(2)
298 |
299 | # Set directory to use for the Petastorm cache
300 | dbutils.fs.rm(data_cfg.petastorm_cache_dir, recurse=True)
301 | spark.conf.set(SparkDatasetConverter.PARENT_CACHE_DIR_URL_CONF, f'file:///dbfs{data_cfg.petastorm_cache_dir}')
302 |
303 | # Create Petastorm Spark Converters for train and validation datasets
304 | print('Creating Petastorm Spark Converters')
305 | train_petastorm_converter = make_spark_converter(train_df)
306 | val_petastorm_converter = make_spark_converter(val_df)
307 | train_size = len(train_petastorm_converter)
308 | val_size = len(val_petastorm_converter)
309 |
310 | with train_petastorm_converter.make_tf_dataset(batch_size=fit_kwargs['batch_size']) as train_ds,\
311 | val_petastorm_converter.make_tf_dataset(batch_size=fit_kwargs['batch_size']) as val_ds:
312 |
313 | #####################
314 | # DATA PREPROCESSING
315 | #####################
316 | print('DATA PREPROCESSING')
317 | # Transforming datasets to map our preprocess function() and then batch()
318 | train_ds = train_ds.unbatch().map(lambda x: (x.content, x.label_idx))
319 | val_ds = val_ds.unbatch().map(lambda x: (x.content, x.label_idx))
320 |
321 | train_ds = (train_ds
322 | .map(lambda content, label_idx: preprocess(content, label_idx))
323 | .batch(fit_kwargs['batch_size']))
324 | val_ds = (val_ds
325 | .map(lambda content, label_idx: preprocess(content, label_idx))
326 | .batch(fit_kwargs['batch_size']))
327 |
328 | #####################
329 | # MODEL TRAINING
330 | #####################
331 | print('MODEL TRAINING')
332 | model = build_model()
333 | model.compile(**compile_kwargs)
334 |
335 | steps_per_epoch = train_size // fit_kwargs['batch_size']
336 | validation_steps = val_size // fit_kwargs['batch_size']
337 |
338 | history = model.fit(train_ds,
339 | steps_per_epoch=steps_per_epoch,
340 | epochs=fit_kwargs['epochs'],
341 | verbose=fit_kwargs['verbose'],
342 | validation_data=val_ds,
343 | validation_steps=validation_steps)
344 | print('MODEL FIT COMPLETE')
345 |
346 | #####################
347 | # PyFunc
348 | #####################
349 | # Image params dict
350 | img_params_dict_path = f'runs:/{mlflow_run.info.run_id}/img_params_dict.json'
351 | # Keras model
352 | keras_model_path = f'runs:/{mlflow_run.info.run_id}/model'
353 |
354 | pyfunc_artifacts = {
355 | 'img_params_dict_path': img_params_dict_path,
356 | 'keras_model_path': keras_model_path,
357 | }
358 | # Log MLflow pyfunc Model
359 | print('LOGGING PYFUNC')
360 | mlflow.pyfunc.log_model('pyfunc_model',
361 | python_model=FlowerPyFunc(),
362 | artifacts=pyfunc_artifacts
363 | )
364 |
365 | #####################
366 | # MODEL EVALUATION
367 | #####################
368 | print('MODEL EVALUATION')
369 | val_results = model.evaluate(val_ds, steps=validation_steps)
370 | val_results_dict = dict(zip(model.metrics_names, val_results))
371 | mlflow.log_metrics({f'val_{metric}':value for metric, value in val_results_dict.items()})
372 |
373 | print('DELETING PETASTORM CONVERTER')
374 | train_petastorm_converter.delete()
375 | val_petastorm_converter.delete()
376 |
377 | return mlflow_run, model, history
378 |
379 | # COMMAND ----------
380 |
381 | # MAGIC %md
382 | # MAGIC
383 | # MAGIC ## Model training
384 | # MAGIC
385 | # MAGIC Let's now define our training data config class, along with our compile and fit arguments and trigger model training.
386 |
387 | # COMMAND ----------
388 |
389 | data_cfg = DataCfg(train_tbl_name=f'{database_name}.silver_train',
390 | validation_tbl_name=f'{database_name}.silver_val')
391 |
392 | compile_kwargs = {'optimizer': tf.keras.optimizers.Adam(learning_rate=0.001),
393 | 'loss': tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
394 | 'metrics': ['accuracy']}
395 |
396 | # Define fit args
397 | early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss',
398 | min_delta=1e-2,
399 | patience=3,
400 | verbose=1)
401 | callbacks = [early_stopping]
402 | fit_kwargs = {'batch_size': BATCH_SIZE,
403 | 'epochs': 2, # Run for 2 epochs for demonstration purposes
404 | 'verbose': 1}
405 |
406 |
407 | mlflow_run, _, _ = train_model_petastorm_data_ingest(data_cfg=data_cfg,
408 | compile_kwargs=compile_kwargs,
409 | fit_kwargs=fit_kwargs)
410 |
411 | # COMMAND ----------
412 |
413 | # MAGIC %md
414 | # MAGIC
415 | # MAGIC ## Inference
416 | # MAGIC
417 | # MAGIC We now look at applying our model to some inference data. For demo purposes we are using the `silver` table created earlier.
418 |
419 | # COMMAND ----------
420 |
421 | # Use the silver table as our inference table for demo purposes
422 | inference_df = spark.table(f'{database_name}.silver')
423 |
424 | print('inference_df count:', inference_df.count())
425 |
426 | # COMMAND ----------
427 |
428 | # MAGIC %md
429 | # MAGIC
430 | # MAGIC ### Single node inference
431 | # MAGIC
432 | # MAGIC We can first look at applying our model in a single node setting for inference.
433 | # MAGIC
434 | # MAGIC - Define a model_uri from which to load the Pyfunc model
435 | # MAGIC - Apply loaded model against a pandas DataFrame
436 |
437 | # COMMAND ----------
438 |
439 | # Get run_id from the run object executed above
440 | run_id = mlflow_run.info.run_id
441 | model_uri = f'runs:/{run_id}/pyfunc_model'
442 | print(model_uri)
443 |
444 | # COMMAND ----------
445 |
446 | loaded_model = mlflow.pyfunc.load_model(model_uri)
447 | inference_pdf = inference_df.limit(10).toPandas()
448 | inference_pred_arr = loaded_model.predict(inference_pdf['content'])
449 |
450 | inference_pred_arr
451 |
452 | # COMMAND ----------
453 |
454 | # MAGIC %md
455 | # MAGIC
456 | # MAGIC ### Distributed inference
457 | # MAGIC
458 | # MAGIC We can use the same Pyfunc model in a distributed setting. To do so we:
459 | # MAGIC - Define a model_uri from which to load the Pyfunc model
460 | # MAGIC - Create a PySpark UDF that can be used to invoke our custom Pyfunc model
461 | # MAGIC - Note that our `result_type` will be a string in this example
462 | # MAGIC - Apply the UDF to the `content` column of the PySpark Dataframe. This column contains the raw binary representation of our images
463 |
464 | # COMMAND ----------
465 |
466 | loaded_model_udf = mlflow.pyfunc.spark_udf(spark, model_uri, result_type='string')
467 |
468 | inference_pred_df = (inference_df
469 | .limit(1000)
470 | .withColumn('prediction', loaded_model_udf('content'))
471 | .select('path', 'content', 'label', 'prediction')
472 | )
473 |
474 | # COMMAND ----------
475 |
476 | inference_pred_df.display()
477 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Distributed Deep Learning Workshop
2 |
3 | In this workshop, we will train a deep learning model in a distributed manner using Databricks. We will discuss how we can leverage Delta Lake to prepare structured, semi-structured, or unstructured datasets and Petastorm for distributing datasets efficiently on a cluster. We will also cover how to use Horovod for distributed training on both CPU and GPU based hardware. This example aims to serve as a reusable template that is tailorable to meet your specific modeling needs.
4 |
5 | ## Workshop structure
6 |
7 | The workshop involves a series of Databricks notebooks split into two parts,
8 |
9 | In part 1 we look at how we can optimally leverage the parallelism of Spark for training deep learning models in a distributed manner. The notebooks outline the following:
10 |
11 | - Data Prep
12 | - How to create a Delta table with the Binary file data source reader using JPEG image sources.
13 | - Single node training
14 | - Distributed training
15 |
16 | In part 2 we look at how we can paralellize both hyperparameter tuning and model inference. We illustrate:
17 |
18 | - Model tuning with Hyperopt
19 | - Tuning a single node DL model with Hyperopt
20 | - Tuning a distributed Horovod process with Hyperopt
21 | - Distributed model inference
22 | - How to package up a custom Pyfunc with preprocessing/post-processing steps
23 | - Applying that logged custom Pyfunc in a single node inference setting
24 | - Applying that logged custom Pyfunc in a distributed inference setting
25 |
26 | ## Requirements
27 |
28 | A recommended Databricks ML Runtime >= 7.3LTS is suggested. Please use the [repos](https://docs.databricks.com/repos/index.html) feature to clone into your repo and access the notebook.
29 |
30 |
31 |
--------------------------------------------------------------------------------