├── README.md
├── blog_code_examples.py
├── data.py
├── img
    ├── job_parameters.png
    ├── mlflow_model_comparisons.png
    ├── multiple_job_runs.png
    ├── predictions.png
    └── training_experiments.png
├── inference.py
├── mlflow_model.py
├── requirements.txt
├── trainer.py
└── utils.py


/README.md:
--------------------------------------------------------------------------------
 1 | # Rapid NLP development with Databricks, Delta, and Transformers  
 2 | This Databricks Repo provides example implementations of [huggingface transformer](https://huggingface.co/docs/transformers/index) models for text classification tasks. The project is self contained and can be easily run in your own Workspace. The project downloads several example datasets from huggingface and writes them to [Delta tables](https://docs.databricks.com/delta/index.html). The user can then choose from multiple transformer models to perform text classification. All model metrics and parameters are logged to [MLflow](https://docs.databricks.com/applications/mlflow/index.html). A separate notebook loads trained models promoted to the MLflow [Model Registry](https://docs.databricks.com/applications/mlflow/model-registry.html), performs inference, and writes results back to Delta.
 3 | 
 4 | The Notebooks are designed to be run on a single-node, GPU-backed Cluster type. For AWS customers, consider the g5.4xlarge instance type. For Azure customers, consider the Standard_NC4as_T4_v3 instance type. The project was most recently tested using Databricks ML runtime 11.0. The transformers library will distribute model training across multiple GPUs if you choose a virtual machine type that has more than one.
 5 | 
 6 | 
 7 | ##### Datasets:
 8 |  - **[IMDB](https://huggingface.co/datasets/imdb)**: *binary classification*
 9 |  - **[Banking77](https://huggingface.co/datasets/banking77)**: *mutli-class classification*
10 |  - **[Tweet Emotions](https://huggingface.co/datasets/sem_eval_2018_task_1)**: *multi-label classification*  
11 |  
12 |  Search for additional datasets in the [huggingface data hub](https://huggingface.co/datasets)
13 |    
14 | ##### Models:
15 |  - **[Distilbert](https://huggingface.co/docs/transformers/model_doc/distilbert)**
16 |  - **[Bert](https://huggingface.co/docs/transformers/model_doc/bert)**
17 |  - **[DistilRoberta](https://huggingface.co/distilroberta-base)**
18 |  - **[Roberta](https://huggingface.co/roberta-base)**    
19 |  - **[xtremedistil-l6-h256-uncased](https://huggingface.co/microsoft/xtremedistil-l6-h256-uncased)**  
20 |  - **[xtremedistil-l6-h384-uncased](https://huggingface.co/microsoft/xtremedistil-l6-h384-uncased)**
21 |  - **[xtremedistil-l12-h384-uncased](https://huggingface.co/microsoft/xtremedistil-l12-h384-uncased)**  
22 |  
23 |  Search for models suitable for a wide variety of tasks in the [huggingface model hub](https://huggingface.co/models)  
24 |  
25 | #### Getting started:  
26 | To get started training these models in your own Workspace, simply follow the below steps. 
27 |  1. Clone this github repository into a Databricks Repo  
28 |  
29 |  2. Open the **data** Notebook and attached the Notebook to a Cluster. Select "Run all" at the top of the notebook to download and store the example datasets as Delta tables. Review the cell outputs to see data samples and charts.
30 |  
31 |  3. Open the **trainer** Notebook. Select "Run all" to train an initial model on the banking77 dataset. As part of the run, Widgets will appear at the top of the notebook enabling the user to choose different input datasets, models, and training parameters. To test different models and configurations, consider running the training notebook as a [Job](https://docs.databricks.com/data-engineering/jobs/index.html). When executing via a Job, you can pass parameters to overwrite the default widget values. Additionally, by increasing the Job's maximum concurrent runs, you can fit multiple transformer models concurrently by launching several jobs with different parameters.
32 | 
33 |     <img src="img/job_parameters.png" alt="Setting job parameters" style="height: 225px; width:475px;"/>
34 |     <p align="left">
35 |     <font size=2><i>Adjusting a Job's default parameters values to run different models against the same training dataset.</i></font>
36 |     </p> 
37 | 
38 |     <img src="img/multiple_job_runs.png" alt="Concurrent job runs" style="height: 425px; width:850px;"/>
39 |     <p align="left">
40 |     <font size=2><i>Training multiple transformers models in parallel using Databricks Jobs.</i></font>
41 |     </p>
42 |  
43 |  4. The trainer notebook will create a new MLflow Experiment. You can navigate to this Experiment by clicking the hyperlink that appears under the cell containing the MLflow logging logic, or, by navigating to the Experiments pane and selecting the Experiment named,  **transformer_experiments**. Each row in the Experiment corresponds to a different trained transformer model. Click on an entry, review its parameters and metrics, run multiple models against a dataset and compare their performance.  
44 |  
45 |     <img src="img/mlflow_model_comparisons.png" alt="Comparing MLflow models" style="height: 385px; width:925px;"/>
46 |     <p align="left">
47 |     <font size=2><i>Comparing transformer model runs in MLflow; notice the wide variation in model size.</i></font>
48 |     </p>
49 |     
50 |  5. To leverage a trained model for inference, copy the **Run ID** of a model located in an Experiment run. Run the first several cells of the **inference** notebook to generate the Widget text box and paste the Run ID into the text box. The notebook will generate predictions for both the training and testing sets used to fit the model; it will then write these results to a new Delta table.
51 | 
52 |     
53 |     <img src="img/predictions.png" alt="Model predictions" style="height: 225px; width:775px;"/>
54 |     <p align="left">
55 |     <font size=2><i>Model predictions example for banking77 dataset.</i></font>
56 |     </p>
57 |     
58 |  6. Experiment with different training configurations for a model as outlined in the [transformers documentation](https://huggingface.co/docs/transformers/performance). Training configuration and GPU type can lead to large differences in training times.
59 |     <img src="img/training_experiments.png" alt="Concurrent job runs" style="height: 555px; width:925px;"/>
60 |     <p align="left">
61 |     <font size=2><i>Training a single epoch using dynamic padding.</i></font>
62 |     </p>
63 |     
64 | 


--------------------------------------------------------------------------------
/blog_code_examples.py:
--------------------------------------------------------------------------------
  1 | # Databricks notebook source
  2 | # MAGIC %md #### The below cells contain the code snippets used in the Databricks blog, Rapid NLP Development with Databricks, Delta, and Transformers.  
  3 | # MAGIC Run this notebook on a GPU-backed cluster to recreate the results
  4 | 
  5 | # COMMAND ----------
  6 | 
  7 | # MAGIC %pip install datasets transformers==4.21.*
  8 | 
  9 | # COMMAND ----------
 10 | 
 11 | from transformers import AutoTokenizer, AutoModel
 12 | from transformers import logging
 13 | import torch
 14 | 
 15 | logging.set_verbosity_error()
 16 | 
 17 | # COMMAND ----------
 18 | 
 19 | # MAGIC %md Loading a transformer model and corresponding tokenizer
 20 | 
 21 | # COMMAND ----------
 22 | 
 23 | from transformers import AutoTokenizer, AutoModel
 24 | 
 25 | model_type = 'bert-base-uncased'
 26 | tokenizer = AutoTokenizer.from_pretrained(model_type)
 27 | model = AutoModel.from_pretrained(model_type)
 28 | 
 29 | # COMMAND ----------
 30 | 
 31 | # MAGIC %md View the BERT vocabulary and special tokens
 32 | 
 33 | # COMMAND ----------
 34 | 
 35 | tokenizer.pretrained_vocab_files_map['vocab_file']['bert-base-uncased']
 36 | 
 37 | # COMMAND ----------
 38 | 
 39 | from itertools import islice
 40 | 
 41 | # Display the first five entries in BERT's vocabulary
 42 | for token, token_id in islice(tokenizer.vocab.items(), 5):
 43 |   print(token_id, token)
 44 | 
 45 | # COMMAND ----------
 46 | 
 47 | # Display BERT's special tokens
 48 | for token_name, token_symbol in tokenizer.special_tokens_map.items():
 49 |   print(token_name, token_symbol)
 50 | 
 51 | # COMMAND ----------
 52 | 
 53 | # MAGIC %md Tokenize an input sequence
 54 | 
 55 | # COMMAND ----------
 56 | 
 57 | token_ids = tokenizer.encode("transformers on Databricks are awesome")
 58 | token_ids
 59 | 
 60 | # COMMAND ----------
 61 | 
 62 | # Map token ids to BERT's tokens
 63 | id_to_token = {token_id: token for token, token_id in tokenizer.vocab.items()}
 64 | 
 65 | [id_to_token[id] for id in token_ids]
 66 | 
 67 | # COMMAND ----------
 68 | 
 69 | # MAGIC %md Tokenize the sequences; apply truncation and padding.
 70 | 
 71 | # COMMAND ----------
 72 | 
 73 | records = ["transformers are easy to run on Databricks",
 74 |            "transformers can read from Delta",
 75 |            "transformers are powerful"]
 76 | 
 77 | def tokenize(batch):
 78 |   """
 79 |   Truncate to the max_length; pad any resulting sequences with 
 80 |   length less than max_length
 81 |   """
 82 | 
 83 |   return tokenizer(batch, padding='max_length', truncation=True, max_length=10, return_tensors="pt")
 84 |   
 85 | tokenized = tokenize(records)
 86 | 
 87 | tokenized_lengths = [len(sequence) for sequence in tokenized['input_ids']]
 88 | 
 89 | print("Tokenized and padded sequences returned as pytorch tensors")
 90 | for sequence in tokenized['input_ids']:
 91 |   print(sequence)
 92 |   
 93 | print(f"\nTokenized sequence lengths\n{tokenized_lengths}")
 94 | 
 95 | # COMMAND ----------
 96 | 
 97 | # MAGIC %md Generate word embedddings from BERT's final layer (last hidden layer)
 98 | 
 99 | # COMMAND ----------
100 | 
101 | import torch
102 | 
103 | with torch.no_grad():
104 |   token_embeddings = model(input_ids = tokenized['input_ids'], 
105 |                            attention_mask = tokenized['attention_mask']).last_hidden_state
106 | 
107 | sequence_length = [len(embedding_sequence) for embedding_sequence in token_embeddings]
108 | 
109 | cls_embedding = token_embeddings[0][0]
110 | 
111 | embedding_dim = cls_embedding.shape[0]
112 | 
113 | print(f"\nEmebdding sequence lengths\n{sequence_length}")
114 | 
115 | print(f"\nDimension of a single token embedding\n{int(embedding_dim)}")
116 | 
117 | # COMMAND ----------
118 | 
119 | # MAGIC %md Download a dataset from the huggingface dataset hub and save to Delta
120 | 
121 | # COMMAND ----------
122 | 
123 | from pyspark.sql.types import StructType, StructField, StringType, LongType
124 | from datasets import load_dataset
125 | 
126 | # Load the sample data from the huggingface data hub
127 | dataset = load_dataset("banking77")
128 |     
129 | # Convert the DataSets to Pandas DataFrames
130 | train_pd  = dataset['train'].to_pandas()
131 | test_pd  =  dataset['test'].to_pandas()
132 | 
133 | idx_and_labels = dataset['train'].features['label'].names
134 | id2label = {idx: label for idx, label  in enumerate(idx_and_labels)}
135 | 
136 | # Shuffle the records
137 | train_pd = train_pd.sample(frac=1).reset_index(drop=True)
138 | test_pd = test_pd.sample(frac=1).reset_index(drop=True)
139 | 
140 | train_pd['label_name'] = train_pd.label.apply(lambda x: id2label[x])
141 | test_pd['label_name'] = test_pd.label.apply(lambda x: id2label[x])
142 | 
143 | # Create Spark DataFrames
144 | single_label_schema = StructType([StructField("text", StringType(), False),
145 |                                   StructField("label", LongType(), False),
146 |                                   StructField("label_name", StringType(), False)
147 |                                   
148 |                                   ])
149 | 
150 | train = spark.createDataFrame(train_pd, schema=single_label_schema)
151 | test = spark.createDataFrame(test_pd,   schema=single_label_schema)
152 | 
153 | train.write.mode('overwrite').format('delta').saveAsTable('default.banking77_train_blog')
154 | test.write.mode('overwrite').format('delta').saveAsTable('default.banking77_test_blog')
155 | 
156 | display(spark.table('default.banking77_train_blog').limit(5))
157 | 
158 | # COMMAND ----------
159 | 
160 | # MAGIC %md Create transformer DataSets from Delta tables by sourcing the underlying parquet files
161 | 
162 | # COMMAND ----------
163 | 
164 | from datasets import load_dataset, Dataset, DatasetDict
165 | 
166 | train_delta_file = spark.table('default.banking77_train_blog').inputFiles()
167 | test_delta_file  = spark.table('default.banking77_test_blog').inputFiles()
168 | 
169 | train_delta_file = [file.replace('dbfs:', '/dbfs/') for file in train_delta_file]
170 | test_delta_file  = [file.replace('dbfs:', '/dbfs/') for file in test_delta_file]
171 | 
172 | train_test = DatasetDict({'train':  load_dataset("parquet", 
173 |                                     data_files=train_delta_file,
174 |                                     split='train'),
175 |                           
176 |                           'test': load_dataset("parquet", 
177 |                                   data_files=test_delta_file,
178 |                                   split='train')})
179 | 
180 | # COMMAND ----------
181 | 
182 | # MAGIC %md View the distributed of tokenized sequence lengths
183 | 
184 | # COMMAND ----------
185 | 
186 | from collections import Counter
187 | import numpy as np
188 | 
189 | def tokenize(batch):
190 | 
191 |   return tokenizer(batch['text'], 
192 |                    truncation = True,
193 |                    # Without padding, tokenized sequence lengths will vary
194 |                    # across observations
195 |                    padding = False,
196 |                    # The maximum accpected sequence length of the model
197 |                    max_length = 512)
198 |   
199 | # The default batch size is also 1000 but this can be changed.
200 | train_test_tokenized = train_test.map(tokenize, batched=True, batch_size=1000) 
201 | 
202 | train_test_tokenized.set_format("torch", columns=['input_ids', 'attention_mask', 'label'])
203 | 
204 | tokenized_lengths = [len(sequence) for sequence in train_test_tokenized['train']['input_ids']]
205 | 
206 | groupby_count = [(tokenized_length, count) for tokenized_length, count in Counter(tokenized_lengths).items()]
207 | 
208 | groupby_count = spark.createDataFrame(groupby_count, ['tokenized_length', 'count'])
209 | 
210 | display(groupby_count.orderBy('tokenized_length'))
211 | print("Deciles...")
212 | groupby_count.approxQuantile("tokenized_length", list(np.arange(0.1, 1, 0.1)), 0)
213 | 
214 | # COMMAND ----------
215 | 
216 | # MAGIC %md Tokenize the training and testing DataSets
217 | 
218 | # COMMAND ----------
219 | 
220 | max_length = 90
221 | 
222 | tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
223 | 
224 | def tokenize(batch):
225 | 
226 |   return tokenizer(batch['text'], 
227 |                    truncation = True,
228 |                    padding = 'max_length',
229 |                    max_length = max_length)
230 |   
231 | # The default batch size is also 1000 but this can be changed.
232 | train_test_tokenized = train_test.map(tokenize, batched=True, batch_size=1000) 
233 | 
234 | train_test_tokenized.set_format("torch", columns=['input_ids', 'attention_mask', 'label'])
235 | 
236 | # COMMAND ----------
237 | 
238 | # MAGIC %md Create a function that returns validation metrics
239 | 
240 | # COMMAND ----------
241 | 
242 | from sklearn.metrics import precision_recall_fscore_support
243 | 
244 | def compute_single_label_metrics(pred):
245 |   """Calculate validation statistics for single-label classification
246 |   """
247 |   
248 |   labels = pred.label_ids
249 |   preds = pred.predictions.argmax(-1)
250 |   precision, recall, f1, _ = precision_recall_fscore_support(labels, 
251 |                                                              preds, 
252 |                                                              average='micro')
253 |   return {
254 |       'f1': f1,
255 |       'precision': precision,
256 |       'recall': recall
257 |           }
258 | 
259 | # COMMAND ----------
260 | 
261 | # MAGIC %md Specify a model initialization function, configure a transformers Trainer, and execute the training loop.
262 | 
263 | # COMMAND ----------
264 | 
265 | from transformers import Trainer, TrainingArguments, AutoModelForSequenceClassification
266 | 
267 | # Transformer models should be fine tuned using a GPU-backed instance, 
268 | # such as a single-node cluster with a GPU-backed virtual machine type.
269 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
270 | 
271 | def model_init():
272 |   """Return a freshly instantiated model. This ensure that the model
273 |   is trained from scratch, rather than training a previously 
274 |   instantiated model for additional epochs.
275 |   """
276 | 
277 |   return AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased', 
278 |                                                             num_labels=77).to(device)
279 |   
280 | training_args =   {"output_dir":                  f'/_blog_results',
281 |                    "overwrite_output_dir":        True,
282 |                    "per_device_train_batch_size": 64,
283 |                    "per_device_eval_batch_size":  64,
284 |                    "weight_decay":                0.01,
285 |                    "num_train_epochs":            5,
286 |                    "save_strategy":               "epoch", 
287 |                    "evaluation_strategy":         "epoch",
288 |                    "logging_strategy":            "epoch",
289 |                    "load_best_model_at_end":      True,
290 |                    "save_total_limit":            2,
291 |                    "metric_for_best_model":       "f1",
292 |                    "greater_is_better":           True,
293 |                    "seed":                        123,
294 |                    "report_to":                   'none'}
295 | 
296 | 
297 | trainer = Trainer(model_init =      model_init,
298 |                   args =            TrainingArguments(**training_args),
299 |                   train_dataset =   train_test_tokenized['train'],
300 |                   eval_dataset =    train_test_tokenized['test'],
301 |                   compute_metrics = compute_single_label_metrics)
302 | 
303 | # Execute training
304 | trainer.train()
305 | 
306 | # Get evaluation metrics on test dataset
307 | evaluation_metrics = trainer.evaluate()
308 | 
309 | # COMMAND ----------
310 | 
311 | for metric_name, metric_value in evaluation_metrics.items():
312 |   print(f"{metric_name}: {round(metric_value, 3)}")
313 | 
314 | # COMMAND ----------
315 | 
316 | # MAGIC %md ##### GPU vs. CPU vs. Quantized CPU
317 | 
318 | # COMMAND ----------
319 | 
320 | from transformers import pipeline
321 | from transformers.pipelines.pt_utils import KeyDataset
322 | import itertools
323 | from mlflow_model import get_predictions
324 | 
325 | inference_dataset = KeyDataset(train_test['test'], 'text')
326 | 
327 | inference_batch_size = 256
328 | truncation = True
329 | padding = 'max_length'
330 | 
331 | # COMMAND ----------
332 | 
333 | # MAGIC %md GPU inference
334 | 
335 | # COMMAND ----------
336 | 
337 | gpu_predictions = get_predictions(data=inference_dataset,
338 |                                  model = trainer.model,
339 |                                  tokenizer = tokenizer,
340 |                                  batch_size = inference_batch_size,
341 |                                  device = 0,
342 |                                  truncation = truncation,
343 |                                  padding = padding,
344 |                                  max_length = max_length)
345 | 
346 | gpu_predictions[0]
347 | 
348 | # COMMAND ----------
349 | 
350 | # MAGIC %md CPU
351 | 
352 | # COMMAND ----------
353 | 
354 | cpu_predictions = get_predictions(data=inference_dataset,
355 |                                  model = trainer.model.to('cpu'),
356 |                                  tokenizer = tokenizer,
357 |                                  batch_size = inference_batch_size,
358 |                                  device = -1,
359 |                                  truncation = truncation,
360 |                                  padding = padding,
361 |                                  max_length = max_length)
362 | 
363 | cpu_predictions[0]
364 | 
365 | # COMMAND ----------
366 | 
367 | # MAGIC %md Quantized and CPU
368 | 
369 | # COMMAND ----------
370 | 
371 | import torch.nn as nn
372 | from torch.quantization import quantize_dynamic
373 | 
374 | quantized_model = quantize_dynamic(trainer.model.to("cpu"),
375 |                                    {nn.Linear},
376 |                                    dtype=torch.qint8)
377 | 
378 | # COMMAND ----------
379 | 
380 | quantized_predictions = get_predictions(data=inference_dataset,
381 |                                         model = quantized_model,
382 |                                         tokenizer = tokenizer,
383 |                                         batch_size = inference_batch_size,
384 |                                         device = -1,
385 |                                         truncation = truncation,
386 |                                         padding = padding,
387 |                                         max_length = max_length)
388 | 
389 | quantized_predictions[0]
390 | 
391 | # COMMAND ----------
392 | 
393 | # MAGIC %md Size comparison
394 | 
395 | # COMMAND ----------
396 | 
397 | from pathlib import Path
398 | 
399 | non_quantized_state_dict = trainer.model.state_dict()
400 | quantized_state_dict = quantized_model.state_dict()
401 | 
402 | tmp_path_non_quantized = Path("/non_quantized.pt")
403 | tmp_path_quantized = Path("/quantized.pt")
404 | 
405 | torch.save(non_quantized_state_dict, tmp_path_non_quantized)
406 | torch.save(quantized_state_dict, tmp_path_quantized)
407 | 
408 | def get_size_in_mb(tmp_path):
409 |   return round(Path(tmp_path).stat().st_size / (1024 * 1024), 1)
410 | 
411 | non_quantized_size_mb = get_size_in_mb(tmp_path_non_quantized)
412 | quantized_size_mb = get_size_in_mb(tmp_path_quantized)
413 | 
414 | print(f"Non-quantized model size (mb): {non_quantized_size_mb}\nquantized model size (mb): {quantized_size_mb}")
415 | 
416 | # COMMAND ----------
417 | 
418 | # MAGIC %md Validation metrics comparison
419 | 
420 | # COMMAND ----------
421 | 
422 | from transformers import EvalPrediction
423 | 
424 | non_quantized_eval_dataset = EvalPrediction(predictions = cpu_predictions, 
425 |                                             label_ids = train_test['test']['label'])
426 | 
427 | quantized_eval_dataset = EvalPrediction(predictions = quantized_predictions, 
428 |                                         label_ids = train_test['test']['label'])
429 | 
430 | non_quantized_eval_metrics = compute_single_label_metrics(non_quantized_eval_dataset)
431 | quantized_eval_metrics = compute_single_label_metrics(quantized_eval_dataset)
432 | 
433 | print("non-quantized validation statistics:")
434 | for metric_name, metric_value in non_quantized_eval_metrics.items():
435 |   print(metric_name, round(metric_value, 3))
436 |   
437 | print("\nnon-quantized validation statistics:")
438 | for metric_name, metric_value in quantized_eval_metrics.items():
439 |   print(metric_name, round(metric_value, 3))
440 | 
441 | # COMMAND ----------
442 | 
443 | # MAGIC %md Applying a pretrain and fine tuned model for sentiment analysis
444 | 
445 | # COMMAND ----------
446 | 
447 | sentiment_pipeline = pipeline('sentiment-analysis')
448 | 
449 | # COMMAND ----------
450 | 
451 | records = ["Transformers on Databricks are the best!",
452 |            "Without Delta, our data lake has devolved into a data swamp!"]
453 | 
454 | for prediction in sentiment_pipeline(records):
455 |   print(prediction)
456 | 
457 | # COMMAND ----------
458 | 
459 | spark.sql("DROP TABLE IF EXISTS default.banking77_train_blog")
460 | 


--------------------------------------------------------------------------------
/data.py:
--------------------------------------------------------------------------------
  1 | # Databricks notebook source
  2 | # MAGIC %md ## Dataset creation for transformer model training and inference
  3 | 
  4 | # COMMAND ----------
  5 | 
  6 | # MAGIC %pip install datasets
  7 | 
  8 | # COMMAND ----------
  9 | 
 10 | from collections import namedtuple
 11 | import numpy as np
 12 | import pandas as pd
 13 | from datasets import load_dataset
 14 | from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType, FloatType, LongType
 15 | import pyspark.sql.functions as func
 16 | from pyspark.sql.functions import col
 17 | 
 18 | # COMMAND ----------
 19 | 
 20 | def dataset_to_dataframes(dataset_name:str):
 21 |   
 22 |   """
 23 |   Given a transformers datasets name, download the dataset and 
 24 |   return Spark DataFrame versions. Result include train and test
 25 |   dataframes as well as a dataframe of label index to string
 26 |   representation.
 27 |   """
 28 |   
 29 |   spark_datasets = namedtuple("spark_datasets", "train test labels")
 30 |   
 31 |   # Define Spark schemas
 32 |   single_label_schema = StructType([StructField("text", StringType(), False),
 33 |                                    StructField("label", LongType(), False)
 34 |                                     ])
 35 |   
 36 |   multi_label_schema = StructType([StructField("text", StringType(), False),
 37 |                                    StructField("labels", ArrayType(FloatType()), False)
 38 |                                     ])
 39 |   
 40 |   labels_schema = StructType([StructField("idx", IntegerType(), False),
 41 |                              StructField("label", StringType(), False)])
 42 |   
 43 |   if dataset_name == "sem_eval_2018_task_1":
 44 |     dataset = load_dataset(dataset_name, "subtask5.english")
 45 |     
 46 |     text_col = 'Tweet'
 47 |     non_label_cols = ['ID'] + [text_col]
 48 |     idx_and_labels = [col for col in dataset['train'].features.keys() if col not in non_label_cols]
 49 |     
 50 |     train_pd = pd.concat([dataset['train'].to_pandas(),
 51 |                           dataset['validation'].to_pandas()])
 52 |     train_pd['is_train'] = 1
 53 | 
 54 |     test_pd = dataset['test'].to_pandas()
 55 | 
 56 |     test_pd['is_train'] = 0
 57 | 
 58 |     train_test_pd = pd.concat([train_pd, test_pd])
 59 |     
 60 |     train_test_pd['labels'] = train_test_pd[idx_and_labels].values.tolist()
 61 |     train_test_pd['labels'] = train_test_pd.labels.apply(lambda x: [1. if i is True else 0. for i in x])
 62 |     train_test_pd.rename(columns = {text_col: "text"}, inplace=True)
 63 |     train_test_pd = train_test_pd[['text', 'labels', 'is_train']]
 64 |     
 65 |     train = spark.createDataFrame(train_test_pd[train_test_pd.is_train == 1][['text', 'labels']], schema=multi_label_schema)
 66 |     test = spark.createDataFrame(train_test_pd[train_test_pd.is_train == 0][['text', 'labels']],  schema=multi_label_schema)
 67 |     
 68 |   else:
 69 |     dataset = load_dataset(dataset_name)
 70 |     
 71 |     train_pd  = dataset['train'].to_pandas()
 72 |     test_pd  =  dataset['test'].to_pandas()
 73 | 
 74 |     train_pd = train_pd.sample(frac=1).reset_index(drop=True)
 75 |     test_pd = test_pd.sample(frac=1).reset_index(drop=True)
 76 | 
 77 |     train = spark.createDataFrame(train_pd, schema=single_label_schema)
 78 |     test = spark.createDataFrame(test_pd,   schema=single_label_schema)
 79 | 
 80 |     idx_and_labels = dataset['train'].features['label'].names
 81 |     
 82 |   id2label = [(idx, label) for idx, label  in enumerate(idx_and_labels)]
 83 |   labels = spark.createDataFrame(id2label, schema=labels_schema)
 84 | 
 85 |   return spark_datasets(train, test, labels)
 86 | 
 87 | 
 88 | def get_token_length_counts(delta_table, group_count=True):
 89 |   
 90 |   token_lengths = (spark.table(delta_table).select("text")
 91 |                                            .withColumn("token_length", func.size(func.split(col("text"), " "))))
 92 |   
 93 |   if group_count:
 94 | 
 95 |     return (token_lengths.groupBy("token_length").agg(func.count("*").alias("count"))
 96 |                          .orderBy("token_length"))
 97 |     
 98 |   else:
 99 |     return token_lengths
100 | 
101 | # COMMAND ----------
102 | 
103 | # MAGIC %md #### [Banking77 dataset](https://huggingface.co/datasets/banking77)  
104 | # MAGIC Multi-class classification
105 | 
106 | # COMMAND ----------
107 | 
108 | banking77_train =  "default.banking77_train"
109 | banking77_test =   "default.banking77_test"
110 | banking77_labels = "default.banking77_labels"
111 | 
112 | # COMMAND ----------
113 | 
114 | banking77_dfs = dataset_to_dataframes("banking77")
115 | 
116 | banking77_dfs.train.write.mode('overwrite').format('delta').saveAsTable(banking77_train)
117 | banking77_dfs.test.write.mode('overwrite').format('delta').saveAsTable(banking77_test)
118 | banking77_dfs.labels.write.mode('overwrite').format('delta').saveAsTable(banking77_labels)
119 | 
120 | # COMMAND ----------
121 | 
122 | banking77_train_df = spark.table(banking77_train)
123 | banking77_test_df = spark.table(banking77_test)
124 | banking77_labels_df = spark.table(banking77_labels)
125 | 
126 | # COMMAND ----------
127 | 
128 | # MAGIC %md ##### Raw data
129 | 
130 | # COMMAND ----------
131 | 
132 | display(banking77_train_df)
133 | 
134 | # COMMAND ----------
135 | 
136 | # MAGIC %md ##### Labels
137 | 
138 | # COMMAND ----------
139 | 
140 | display(banking77_labels_df)
141 | 
142 | # COMMAND ----------
143 | 
144 | # MAGIC %md ##### Record counts
145 | 
146 | # COMMAND ----------
147 | 
148 | print(f"train_cnt: {banking77_train_df.count()}, test_cnt: {banking77_test_df.count()}, labels_cnt: {banking77_labels_df.count()}")
149 | 
150 | # COMMAND ----------
151 | 
152 | # MAGIC %md ##### Distribution of token lengths  
153 | # MAGIC Note that this simple chart that counts the individual tokens when a text observation is split on whitespace is not sufficient for making decisions about the maximum sequence length when tokenizing the dataset. This is because the transformer tokenizer will split the data differently and will likely split individual words into multiple tokens. This will result in longer token lengths compared to what the chart indicates below.
154 | 
155 | # COMMAND ----------
156 | 
157 | display(get_token_length_counts(banking77_train))
158 | 
159 | # COMMAND ----------
160 | 
161 | token_lengths = get_token_length_counts(banking77_train, group_count=False)
162 | 
163 | print(f"""
164 |         quantiles: {token_lengths.approxQuantile("token_length", [0.25, 0.5, 0.75], 0)}
165 |         deciles: {token_lengths.approxQuantile("token_length", list(np.arange(0.1, 1, 0.1)), 0)}
166 |        """)
167 | 
168 | # COMMAND ----------
169 | 
170 | # MAGIC %md #### [IMDB dataset](https://huggingface.co/datasets/imdb) 
171 | # MAGIC Binary classification
172 | 
173 | # COMMAND ----------
174 | 
175 | imdb_train =  "default.imdb_train"
176 | imdb_test =   "default.imdb_test"
177 | imdb_labels = "default.imdb_labels"
178 | 
179 | # COMMAND ----------
180 | 
181 | imdb_dfs = dataset_to_dataframes("imdb")
182 | 
183 | imdb_dfs.train.write.mode('overwrite').format('delta').saveAsTable(imdb_train)
184 | imdb_dfs.test.write.mode('overwrite').format('delta').saveAsTable(imdb_test)
185 | imdb_dfs.labels.write.mode('overwrite').format('delta').saveAsTable(imdb_labels)
186 | 
187 | # COMMAND ----------
188 | 
189 | imdb_train_df = spark.table(imdb_train)
190 | imdb_test_df = spark.table(imdb_test)
191 | imdb_labels_df = spark.table(imdb_labels)
192 | 
193 | # COMMAND ----------
194 | 
195 | # MAGIC %md ##### Raw data
196 | 
197 | # COMMAND ----------
198 | 
199 | display(imdb_train_df)
200 | 
201 | # COMMAND ----------
202 | 
203 | # MAGIC %md ##### Labels
204 | 
205 | # COMMAND ----------
206 | 
207 | display(imdb_labels_df)
208 | 
209 | # COMMAND ----------
210 | 
211 | # MAGIC %md ##### Record counts
212 | 
213 | # COMMAND ----------
214 | 
215 | print(f"train_cnt: {imdb_train_df.count()}, test_cnt: {imdb_test_df.count()}, labels_cnt: {imdb_dfs.labels.count()}")
216 | 
217 | # COMMAND ----------
218 | 
219 | # MAGIC %md ##### Distribution of token lengths
220 | 
221 | # COMMAND ----------
222 | 
223 | display(get_token_length_counts(imdb_train))
224 | 
225 | # COMMAND ----------
226 | 
227 | token_lengths = get_token_length_counts(imdb_train, group_count=False)
228 | 
229 | print(f"""
230 |         quantiles: {token_lengths.approxQuantile("token_length", [0.25, 0.5, 0.75], 0)}
231 |         deciles: {token_lengths.approxQuantile("token_length", list(np.arange(0.1, 1, 0.1)), 0)}
232 |        """)
233 | 
234 | # COMMAND ----------
235 | 
236 | # MAGIC %md #### [Tweet Emotions](https://huggingface.co/datasets/sem_eval_2018_task_1)  
237 | # MAGIC Multi-label classification
238 | 
239 | # COMMAND ----------
240 | 
241 | dataset = load_dataset("sem_eval_2018_task_1", "subtask5.english")
242 | 
243 | # COMMAND ----------
244 | 
245 | tweet_emotions_train =  "default.tweet_emotions_train"
246 | tweet_emotions_test =   "default.tweet_emotions_test"
247 | tweet_emotions_labels = "default.tweet_emotions_labels"
248 | 
249 | # COMMAND ----------
250 | 
251 | tweet_emotions_dfs = dataset_to_dataframes("sem_eval_2018_task_1")
252 | 
253 | tweet_emotions_dfs.train.write.mode('overwrite').format('delta').saveAsTable(tweet_emotions_train)
254 | tweet_emotions_dfs.test.write.mode('overwrite').format('delta').saveAsTable(tweet_emotions_test)
255 | tweet_emotions_dfs.labels.write.mode('overwrite').format('delta').saveAsTable(tweet_emotions_labels)
256 | 
257 | # COMMAND ----------
258 | 
259 | tweet_emotions_train_df = spark.table(tweet_emotions_train)
260 | tweet_emotions_test_df = spark.table(tweet_emotions_test)
261 | tweet_emotions_labels_df = spark.table(tweet_emotions_labels)
262 | 
263 | # COMMAND ----------
264 | 
265 | # MAGIC %md ##### Raw data
266 | 
267 | # COMMAND ----------
268 | 
269 | display(tweet_emotions_train_df)
270 | 
271 | # COMMAND ----------
272 | 
273 | # MAGIC %md ##### Labels
274 | 
275 | # COMMAND ----------
276 | 
277 | display(tweet_emotions_train_df)
278 | 
279 | # COMMAND ----------
280 | 
281 | # MAGIC %md ##### Record counts
282 | 
283 | # COMMAND ----------
284 | 
285 | print(f"train_cnt: {tweet_emotions_train_df.count()}, test_cnt: {tweet_emotions_test_df.count()}, labels_cnt: {tweet_emotions_labels_df.count()}")
286 | 
287 | # COMMAND ----------
288 | 
289 | # MAGIC %md ##### Distribution of token lengths
290 | 
291 | # COMMAND ----------
292 | 
293 | display(get_token_length_counts(tweet_emotions_train))
294 | 
295 | # COMMAND ----------
296 | 
297 | token_lengths = get_token_length_counts(tweet_emotions_train, group_count=False)
298 | 
299 | print(f"""
300 |         quantiles: {token_lengths.approxQuantile("token_length", [0.25, 0.5, 0.75], 0)}
301 |         deciles: {token_lengths.approxQuantile("token_length", list(np.arange(0.1, 1, 0.1)), 0)}
302 |        """)
303 | 
304 | 


--------------------------------------------------------------------------------
/img/job_parameters.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marshackVB/rapid_nlp_blog/b404d84da280ce295e4bb88796c4ef670290da98/img/job_parameters.png


--------------------------------------------------------------------------------
/img/mlflow_model_comparisons.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marshackVB/rapid_nlp_blog/b404d84da280ce295e4bb88796c4ef670290da98/img/mlflow_model_comparisons.png


--------------------------------------------------------------------------------
/img/multiple_job_runs.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marshackVB/rapid_nlp_blog/b404d84da280ce295e4bb88796c4ef670290da98/img/multiple_job_runs.png


--------------------------------------------------------------------------------
/img/predictions.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marshackVB/rapid_nlp_blog/b404d84da280ce295e4bb88796c4ef670290da98/img/predictions.png


--------------------------------------------------------------------------------
/img/training_experiments.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marshackVB/rapid_nlp_blog/b404d84da280ce295e4bb88796c4ef670290da98/img/training_experiments.png


--------------------------------------------------------------------------------
/inference.py:
--------------------------------------------------------------------------------
  1 | # Databricks notebook source
  2 | # MAGIC %md ## Model inference workflow  
  3 | # MAGIC Paste an MLflow Experiment run id in the above text box and select "Run All" above. This notebook will register the associated model with the Model Registry and transition the model's stage to 'Production'. Then, the model will be loaded and applied for inference, writing predictions to a Delta table.
  4 | 
  5 | # COMMAND ----------
  6 | 
  7 | # MAGIC %pip install -r requirements.txt
  8 | 
  9 | # COMMAND ----------
 10 | 
 11 | import pickle
 12 | from time import perf_counter
 13 | 
 14 | import mlflow
 15 | from mlflow.tracking import MlflowClient
 16 | import numpy as np
 17 | import pandas as pd
 18 | from pyspark.sql.types import (StructType, 
 19 |                                StructField, 
 20 |                                ArrayType, 
 21 |                                StringType, 
 22 |                                FloatType,
 23 |                                IntegerType)
 24 | from pyspark.sql import DataFrame
 25 | import pyspark.sql.functions as func
 26 | 
 27 | from utils import (get_run_id, 
 28 |                    get_gpu_utilization)
 29 | 
 30 | pd.set_option('display.max_colwidth', None)
 31 | 
 32 | client = MlflowClient()
 33 | 
 34 | # COMMAND ----------
 35 | 
 36 | # MAGIC %md Confirm the Experiment run id is valid
 37 | 
 38 | # COMMAND ----------
 39 | 
 40 | dbutils.widgets.text("experiment_run_id", "")
 41 | run_id = dbutils.widgets.get("experiment_run_id").strip()
 42 | 
 43 | # COMMAND ----------
 44 | 
 45 | try:
 46 |   model_info = client.get_run(run_id).to_dictionary()
 47 | except:
 48 |   raise Exception(f"Run id: {run_id} does not exist")
 49 |   
 50 | model_info
 51 | 
 52 | # COMMAND ----------
 53 | 
 54 | # MAGIC %md ##### View GPU memory availability and current consumption
 55 | # MAGIC If GPU memory utilization is high, you may need to Detach & Re-attach the training notebook to clear the GPU's memory. This could occur if you just finished training a model with the current cluster.
 56 | 
 57 | # COMMAND ----------
 58 | 
 59 | get_gpu_utilization(memory_type='total')
 60 | get_gpu_utilization(memory_type='used')
 61 | get_gpu_utilization(memory_type='free')
 62 | 
 63 | # COMMAND ----------
 64 | 
 65 | # MAGIC %md ##### Create a Model Registry entry if one does not exist
 66 | 
 67 | # COMMAND ----------
 68 | 
 69 | client = MlflowClient()
 70 | model_registry_name =  "transformer_models"
 71 | 
 72 | # Create a Model Registry entry if one does not exist
 73 | try:
 74 |   client.get_registered_model(model_registry_name)
 75 |   print(" Registered model already exists")
 76 | except:
 77 |   client.create_registered_model(model_registry_name)
 78 | 
 79 | # COMMAND ----------
 80 | 
 81 | # MAGIC %md ##### Register the model and transition its stage to 'Production'
 82 | 
 83 | # COMMAND ----------
 84 | 
 85 | # Get model experiment info
 86 | model_info = client.get_run(run_id).to_dictionary()
 87 | artifact_uri = model_info['info']['artifact_uri']
 88 | 
 89 | # Register the model
 90 | registered_model = client.create_model_version(name=model_registry_name,
 91 |                                                source=artifact_uri + "/mlflow",
 92 |                                                run_id=run_id)
 93 | 
 94 | # Promote the model to the "Production" stage
 95 | promote_to_prod = client.transition_model_version_stage(name=model_registry_name,
 96 |                                                         version = int(registered_model.version),
 97 |                                                         stage="Production",
 98 |                                                         archive_existing_versions=True)
 99 | 
100 | # COMMAND ----------
101 | 
102 | # MAGIC %md ##### Create a Pandas DataFrame of records to score by combining the training and test datasets used to fine tune the model
103 | 
104 | # COMMAND ----------
105 | 
106 | def union_train_test(train_df:DataFrame, test_df:DataFrame) -> DataFrame:
107 |   """Combine the training and testing datasets
108 |   """
109 |   
110 |   return (spark.table(train_df).withColumn("is_train", func.lit(1))
111 |                                .unionAll(
112 |                                  spark.table(test_df).withColumn("is_train", func.lit(0))
113 |                                )
114 |          )
115 |   
116 | training_dataset = model_info['data']['params']['dataset']
117 |   
118 | if training_dataset == 'banking77':
119 |   inference_df = union_train_test("default.banking77_train", "default.banking77_test")
120 |   output_table_name = "default.banking77_predictions"
121 |   
122 | elif training_dataset == 'imdb':
123 |   inference_df = union_train_test("default.imdb_train", "default.imdb_test")
124 |   output_table_name = "default.imdb_predictions"
125 |   
126 | elif training_dataset == 'tweet_emotions':
127 |   inference_df = union_train_test("default.tweet_emotions_train", "default.tweet_emotions_test")
128 |   output_table_name = "default.tweet_emotions_predictions"
129 |   
130 | else:
131 |   raise Exception(f"Training and testing datasets are not known")
132 |   
133 |     
134 | inference_pd = inference_df.toPandas()
135 | 
136 | print(f"Total records for inference: {inference_pd.iloc[:, 0].count():,}")
137 | 
138 | # COMMAND ----------
139 | 
140 | display(inference_df)
141 | 
142 | # COMMAND ----------
143 | 
144 | # MAGIC %md ##### Generate predictions and write results to Delta
145 | 
146 | # COMMAND ----------
147 | 
148 | production_run_id = get_run_id(model_name = "transformer_models")
149 | 
150 | # Download id to label mapping
151 | client.download_artifacts(production_run_id, 
152 |                           "mlflow/artifacts/id2label.pickle", 
153 |                           "/")
154 | 
155 | id2label = pickle.load(open("/mlflow/artifacts/id2label.pickle", "rb"))
156 | 
157 | # Load model
158 | loaded_model = mlflow.pyfunc.load_model(f"runs:/{production_run_id}/mlflow")
159 | 
160 | # COMMAND ----------
161 | 
162 | # Combine input texts and predictions
163 | start_time = perf_counter()
164 | predictions = pd.concat([inference_pd, 
165 |                          pd.DataFrame({"probabilities": loaded_model.predict(inference_pd[["text"]]).tolist()})], 
166 |                          axis=1)
167 | inference_time = perf_counter() - start_time
168 | 
169 | # Transform predictions and specify Spark DataFrame schema
170 | schema = StructType()
171 | 
172 | if training_dataset == 'tweet_emotions':
173 |   
174 |   schema.add("text", StringType())
175 |   schema.add("all_label_indxs", ArrayType(FloatType()))
176 |   schema.add("is_train", IntegerType())
177 |   schema.add("pred_proba_label_indxs", ArrayType(FloatType()))
178 |   schema.add("predicted_label_indxs", ArrayType(IntegerType()))
179 |   schema.add("predicted_labels", ArrayType(StringType()))
180 |   schema.add("label_indxs", ArrayType(IntegerType()))
181 |   schema.add("labels", ArrayType(StringType()))
182 |              
183 |   
184 |   predictions.rename(columns={"labels": "all_label_indxs",
185 |                               "probabilities": "pred_proba_label_indxs"}, inplace = True)
186 |   
187 |   predictions['predicted_label_indxs'] = predictions.pred_proba_label_indxs.apply(lambda x: np.where(np.array(x) > 0.5)[0].tolist())
188 |   
189 |   predictions['predicted_labels'] = predictions.predicted_label_indxs.apply(lambda x: [id2label[idx] for idx in x])
190 |   
191 |   predictions['label_indxs'] = predictions.all_label_indxs.apply(lambda x: np.where(np.array(x) == 1.0)[0].tolist())
192 |   
193 |   predictions['labels'] = predictions.label_indxs.apply(lambda x: [id2label[idx] for idx in x])
194 |   
195 |              
196 | else:
197 |   schema.add("text", StringType())
198 |   schema.add("label_indx", IntegerType())
199 |   schema.add("is_train", IntegerType())
200 |   schema.add("probabilities", ArrayType(FloatType()))
201 |   schema.add("predicted_probability", FloatType())
202 |   schema.add("predicted_label_indx", IntegerType())
203 |   schema.add("predicted_label", StringType())
204 |   schema.add("label", StringType())
205 |              
206 |              
207 |   predictions.rename(columns={"label": "label_indx"}, inplace = True)
208 | 
209 |   predictions['predicted_probability'] = predictions.probabilities.apply(lambda x: max(x))
210 |   
211 |   predictions['predicted_label_idx'] = predictions.apply(lambda x: x['probabilities'].index(x['predicted_probability']), axis=1)
212 |   
213 |   predictions['predicted_label'] = predictions.predicted_label_idx.apply(lambda x: id2label[x])
214 |   
215 |   predictions['label'] = predictions.label_indx.apply(lambda x: id2label[x])
216 | 
217 |   
218 | # Convert predictions to a Spark Dataframe and write to Delta
219 | predictions_spark = spark.createDataFrame(predictions, schema=schema)
220 | 
221 | spark.sql(f"DROP TABLE IF EXISTS {output_table_name}")
222 | predictions_spark.write.format("delta").mode("overwrite").saveAsTable(output_table_name)
223 | 
224 | display(spark.table(output_table_name))
225 | 
226 | # COMMAND ----------
227 | 
228 | print(f'Inference seconds: {round(inference_time, 2)}')
229 | 
230 | # COMMAND ----------
231 | 
232 | get_gpu_utilization(memory_type='total')
233 | get_gpu_utilization(memory_type='used')
234 | get_gpu_utilization(memory_type='free')
235 | 


--------------------------------------------------------------------------------
/mlflow_model.py:
--------------------------------------------------------------------------------
  1 | import mlflow
  2 | from mlflow.pyfunc.model import PythonModelContext
  3 | import numpy as np
  4 | import pandas as pd
  5 | import torch
  6 | from transformers import pipeline, AutoModelForSequenceClassification, AutoTokenizer
  7 | from transformers.pipelines.pt_utils import KeyDataset
  8 | from typing import Optional, Union, List
  9 | 
 10 | 
 11 | 
 12 | def get_predictions(data:Union[List, KeyDataset], model:AutoModelForSequenceClassification, tokenizer:AutoTokenizer, batch_size:str, 
 13 |                     device:int=0, padding:Union[bool, str]='longest', truncation:bool=True, max_length:int=512,
 14 |                     function_to_apply:Optional[str]=None) -> np.array([[float]]):
 15 |   """
 16 |   Create a transformers pipeline and perform inference on an input sequence of records. The pipeline 
 17 |   is comprised of a tokenizer and a model as well as additional parameters that govern the tokenizers behavior and 
 18 |   batching of input records. Given a list of text observations, the function will perform inference 
 19 |   in batches and return an array of probabilities, one for each label.
 20 |   
 21 |   This function can be imported into a Notebook and used directly for testing/experimentation purposes.
 22 |     
 23 |   Although this project's examples operate on a list of sequences, this function can also be applied to a
 24 |   KeyDataset, which is created from a transformers Dataset...
 25 |   
 26 |     dataset_to_score = KeyDataset(transformers.Dataset, 'name_of_text_column')
 27 |     
 28 |   This method has the advantage of not requiring the full inference dataset to be persisted in memory. For
 29 |   more information see the link, https://huggingface.co/docs/transformers/v4.21.1/en/main_classes/pipelines#pipeline-batching.
 30 |   
 31 |   For information about the sequence classification pipeline, see the link, 
 32 |   https://huggingface.co/docs/transformers/v4.21.1/en/main_classes/pipelines#transformers.TextClassificationPipeline  
 33 |   
 34 |   Args:
 35 |     data: A list of text sequences with each sequence representing a single observation.
 36 |     model: A fine-tuned transformer model for sequence classification.
 37 |     tokenizer: The transformers tokenizer associated with the model.
 38 |     batch_size: The number of records to score at a time. If you run into GPU out of memory
 39 |                 errors, you may need to decrease the batch size.
 40 |     device: Governs the device used for inference: -1 for CPU and 0 for GPU. At the time of this
 41 |             function's development, transformers pipelines cannot utilize multiple GPUs for inference.
 42 |     padding: Sets the padding strategy; defaults to the longest sequence in a batch.
 43 |     truncation: Indicates if sequences should be truncated if beyond a certain length.
 44 |     max_length: The maximum length of a sequence before it is truncated; defaults to 512,
 45 |                 which is a common maximum length for many transformer models. Truncating longer
 46 |                 sequences to shorter lengths speeds training and allows for larger batch sizes,
 47 |                 potentially with degradation in predictive performance. 
 48 |     function_to_apply: The type of transformation to apply to the logits output by the model, such
 49 |                        as softmax or sigmoid. If this is not specified, the library will infer the
 50 |                        correct transformation based on the label's shape determined when the model
 51 |                        was trained.
 52 |                        
 53 |    Returns:
 54 |      A numpy array of probabilities, one for each label value. The index position of a probability
 55 |      corresponds to its label. So, the element, 0,  in the array corresponds to the label = 0.
 56 |   """
 57 |   
 58 |   inference_pipeline = pipeline(task =               "text-classification", 
 59 |                                 model =              model,
 60 |                                 tokenizer =          tokenizer,
 61 |                                 batch_size =         batch_size,
 62 |                                 device =             device,
 63 |                                 return_all_scores =  True,
 64 |                                 function_to_apply =  function_to_apply,
 65 |                                 framework =          "pt")
 66 |   
 67 |   predictions = inference_pipeline(data,
 68 |                                    padding = padding,
 69 |                                    truncation = truncation,
 70 |                                    max_length = max_length)
 71 | 
 72 |   # Spark return type is ArrayType(FloatType())
 73 |   predictions_to_array = [[round(dct['score'], 4) for dct in prediction] for prediction in predictions]
 74 |   
 75 |   
 76 |   return np.array(predictions_to_array)
 77 |   
 78 |    
 79 |     
 80 | class MLflowModel(mlflow.pyfunc.PythonModel):
 81 |   """
 82 |   Custom MLflow pyfunc model that performs transformer model inference. The model loads a tokenizer
 83 |   and fine-tuned model stored as MLflow model artifacts. These loaded artifacts are used to create
 84 |   a transformer pipeline.
 85 |   
 86 |   For a description of the mode's output, see the docstring associated with the get_predictions
 87 |   function.
 88 |   
 89 |   Args:
 90 |     inference_batch_size: The number of records to pass at a time to the model for inferece.
 91 |     truncation: Indicates if sequences should be truncated if beyond a certain length.
 92 |     padding: Sets the padding strategy; defaults to the longest sequence in a batch.
 93 |     max_length: The maximum length of a sequence before it is truncated; defaults to 512,
 94 |                 which is a common maximum length for many transformer models. Truncating longer
 95 |                 sequences to shorter lengths speeds training and allows for larger batch sizes,
 96 |                 potentially with degradation in predictive performance.
 97 |     function_to_apply: The type of transformation to apply to the logits output by the model, such
 98 |                        as softmax or sigmoid. If this is not specified, the library will infer the
 99 |                        correct transformation based on the label's shape determined when the model
100 |                        was trained.
101 |   """
102 |   
103 |   def __init__(self, inference_batch_size:str, truncation:bool=True, padding:bool=True, max_length:int=512,
104 |                function_to_apply:Optional[str]=None):
105 | 
106 |     self.inference_batch_size = inference_batch_size
107 |     self.truncation = truncation
108 |     self.padding = padding
109 |     self.max_length = max_length
110 |     self.function_to_apply = function_to_apply
111 |     self.tokenizer = None
112 |     self.model = None
113 |     
114 |     
115 |   def load_context(self, context:PythonModelContext):
116 |     """
117 |     This method is called once by MLflow when a model is loaded for inference using the
118 |     mlflow.pyfunc.load_model() function. PythonModelContext is a class with a single 
119 |     attribute, artifacts:dict[str,str], that is referenceable through a class 
120 |     property, context.artifacts. A PythonModelContext instance is passed automatically
121 |     by MLflow.
122 |     """
123 |     
124 |     # Both CPU and single-GPU inference are options using this custome MLFlow model, 
125 |     # though CPU-based inference will be drastically slower and you may need to decrease 
126 |     # the inference batch size when logging this model.
127 |     device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
128 |     if device.type == "cpu":
129 |       raise Exception("No GPU detected. Provision a GPU-backed instance to run model inference")
130 |     
131 |     # Load the tokenizer and model from MLflow
132 |     self.tokenizer = AutoTokenizer.from_pretrained(context.artifacts['tokenizer'])
133 |     self.model = AutoModelForSequenceClassification.from_pretrained(context.artifacts['model'])
134 | 
135 | 
136 |   def predict(self, context:PythonModelContext, model_input:Union[pd.DataFrame, KeyDataset]) -> np.array([[float]]):
137 |     """
138 |     Generate predictions given an input Pandas DataFrame containing a single feature column
139 |     or a tranformers.KeyDataset. See the get_predictions function for more information.
140 |     
141 |     Args:
142 |       context: A PythonModelContext instance passed automatically by MLflow. See the load_context()
143 |                method for more information.
144 |       model_input: Either a Pandas Dataframe or a transformers.KeyDataset. If passing a DataFrame,
145 |                    the expectation is that the DataFrame has only one column and that column contains
146 |                    the raw text to score.
147 |       
148 |     Returns:
149 |      A numpy array of probabilities, one for each label value. The index position of a probability
150 |      corresponds to its label. So, the element, 0,  in the array corresponds to the label = 0.
151 |     """
152 |     
153 |     if isinstance(model_input, KeyDataset):
154 |       # The KeyDataset can be passed directly to the transformers pipeline
155 |       is_pandas = False
156 |       
157 |     elif isinstance(model_input, pd.DataFrame):
158 |       # The Pandas Dataframe column will be converted to a list of string
159 |       # before passed to the transformers pipeline
160 |       is_pandas = True
161 | 
162 |     else:
163 |       raise TypeError("Model input is neither a Pandas DataFrame nor a transformers KeyDataset")
164 |     
165 |     predictions = get_predictions(data=model_input[model_input.columns[0]].tolist() if is_pandas else model_input, 
166 |                                   model=self.model, 
167 |                                   tokenizer=self.tokenizer, 
168 |                                   batch_size=self.inference_batch_size, 
169 |                                   padding=self.padding, 
170 |                                   truncation=self.truncation, 
171 |                                   max_length=self.max_length,
172 |                                   function_to_apply=self.function_to_apply)
173 |     
174 |     return predictions


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | numpy==1.20.*
2 | pandas==1.3.*
3 | scikit-learn==0.24.*
4 | torch==1.11.0+cu113
5 | transformers==4.21.*
6 | datasets==2.4.*
7 | mlflow == 1.27.*
8 | nvidia-ml-py3


--------------------------------------------------------------------------------
/trainer.py:
--------------------------------------------------------------------------------
  1 | # Databricks notebook source
  2 | # MAGIC %md ## Model training workflow  
  3 | # MAGIC This notebook trains transformer models on various datasets. Select a model and training dataset from the above drop-down menus and experiment with different training parameters. See the below cells for guidance on model tuning. This notebook can be run either interactively or as a job. The cluster type should be a single-node cluster using the ML GPU Runtime and a GPU-backed instance type.  
  4 | # MAGIC 
  5 | # MAGIC Note that the IMDB dataset has much longer sequence lengths than the other example datasets. It takes longer to train and is more susceptible to GPU out of memory errors. Consider decreasing the train_batch_size and eval_batch_size to 16 and increasing gradient_accumulation_steps to 4 as a starting point for this dataset. You can also experiment with truncating the sequences to a length below the default, 512. This will speed training and allow for larger batch sizes, potentially at some degradation in predictive performance.
  6 | 
  7 | # COMMAND ----------
  8 | 
  9 | # MAGIC %pip install -q -r requirements.txt
 10 | 
 11 | # COMMAND ----------
 12 | 
 13 | import pickle
 14 | from pathlib import Path
 15 | from sys import version_info
 16 | 
 17 | from datasets import load_dataset, DatasetDict
 18 | import numpy as np
 19 | import mlflow
 20 | from mlflow_model import MLflowModel
 21 | from sklearn.metrics import precision_recall_fscore_support
 22 | from scipy.stats import logistic
 23 | import torch
 24 | from transformers import (AutoConfig,
 25 |                           AutoTokenizer, 
 26 |                           AutoModelForSequenceClassification, 
 27 |                           EarlyStoppingCallback, 
 28 |                           EvalPrediction, 
 29 |                           DataCollatorWithPadding,
 30 |                           pipeline,
 31 |                           TrainingArguments, 
 32 |                           Trainer)                   
 33 | 
 34 | from utils import get_parquet_files, get_or_create_experiment, get_best_metrics, get_gpu_utilization
 35 | 
 36 | mlflow.autolog(disable=True)
 37 | 
 38 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 39 | 
 40 | # COMMAND ----------
 41 | 
 42 | # MAGIC %md ##### View GPU memory availability and current consumption  
 43 | # MAGIC Clear the GPU memory between model runs by re-running the cell that pip installs dependencies from the requirements.txt file. Selecting Detach & Re-attach from the cluster icon will also clear the GPU memory.
 44 | 
 45 | # COMMAND ----------
 46 | 
 47 | get_gpu_utilization(memory_type="total")
 48 | get_gpu_utilization(memory_type="used")
 49 | get_gpu_utilization(memory_type="free")
 50 | 
 51 | # COMMAND ----------
 52 | 
 53 | # MAGIC %md ##### Specify widget values
 54 | 
 55 | # COMMAND ----------
 56 | 
 57 | datasets = ["banking77", "imdb", "tweet_emotions"]
 58 | 
 59 | supported_models = ["distilbert-base-uncased",
 60 |                     "bert-base-uncased", 
 61 |                     "bert-base-cased",
 62 |                     "distilroberta-base",
 63 |                     "roberta-base", 
 64 |                     "microsoft/xtremedistil-l6-h256-uncased",
 65 |                     "microsoft/xtremedistil-l6-h384-uncased",
 66 |                     "microsoft/xtremedistil-l12-h384-uncased"
 67 |                     ]
 68 | 
 69 | dbutils.widgets.dropdown("dataset_name", datasets[0], datasets)
 70 | dbutils.widgets.dropdown("model_type", supported_models[0], supported_models)
 71 | 
 72 | dbutils.widgets.text("train_batch_size", "64")
 73 | dbutils.widgets.text("eval_batch_size", "64")
 74 | dbutils.widgets.text("inference_batch_size", "256")
 75 | 
 76 | dbutils.widgets.text("gradient_accumulation_steps", "1")
 77 | dbutils.widgets.text("max_epochs", "10")
 78 | dbutils.widgets.dropdown("fp16", "True", ["True", "False"])
 79 | dbutils.widgets.dropdown("group_by_length", "False", ["True", "False"])
 80 | 
 81 | dbutils.widgets.text("experiment_location", "transformer_experiments")
 82 | 
 83 | 
 84 | dataset = dbutils.widgets.get("dataset_name")
 85 | model_type = dbutils.widgets.get("model_type")
 86 | 
 87 | train_batch_size = int(dbutils.widgets.get("train_batch_size"))
 88 | eval_batch_size = int(dbutils.widgets.get("eval_batch_size"))
 89 | inference_batch_size = int(dbutils.widgets.get("inference_batch_size"))
 90 | 
 91 | gradient_accumulation_steps = int(dbutils.widgets.get("gradient_accumulation_steps"))
 92 | max_epochs = int(dbutils.widgets.get("max_epochs"))
 93 | 
 94 | fp16 = True if dbutils.widgets.get("fp16") == "True" else False
 95 | group_by_length = True if dbutils.widgets.get("group_by_length") == "True" else False
 96 | experiment_location = dbutils.widgets.get("experiment_location")
 97 | 
 98 | # COMMAND ----------
 99 | 
100 | print(f"""
101 |       Widget parameter values:
102 |       
103 |       dataset: {dataset}
104 |       model: {model_type}
105 |       train_batch_size: {train_batch_size}
106 |       eval_batch_size: {eval_batch_size}
107 |       inference_batch_size: {inference_batch_size}
108 |       gradient_accumulation_steps: {gradient_accumulation_steps}
109 |       fp16: {fp16}
110 |       group_by_length: {group_by_length}
111 |       max_epochs: {max_epochs}
112 |       experiment_location: {experiment_location}""")
113 | 
114 | # COMMAND ----------
115 | 
116 | # MAGIC %md ##### Specify model, tokenizer, and training parameters  
117 | # MAGIC 
118 | # MAGIC See the [documentation](https://huggingface.co/docs/transformers/performance) and specifically the section on [single GPU training](https://huggingface.co/docs/transformers/perf_train_gpu_one) for performance tuning tips. Additionally, see the various [tokenization strategies](https://huggingface.co/docs/transformers/pad_truncation) available.
119 | # MAGIC 
120 | # MAGIC Adjusting the below training arguments can have a large effect on training times and GPU memory consumption.
121 | # MAGIC 
122 | # MAGIC  - [per_device_train_batch_size](https://huggingface.co/docs/transformers/perf_train_gpu_one#vanilla-training): Experiment with  16, 32, 64 and 128.
123 | # MAGIC  - [fp16](https://huggingface.co/docs/transformers/perf_train_gpu_one#fp16-training)  
124 | # MAGIC  - [gradient_accumulation_steps](https://huggingface.co/docs/transformers/perf_train_gpu_one#gradient-accumulation) 
125 | # MAGIC  - [group_by_length](#https://huggingface.co/docs/transformers/v4.21.2/en/main_classes/trainer#transformers.TrainingArguments.group_by_length)
126 | # MAGIC 
127 | # MAGIC In addition, truncating longer sequences to shorter length will speed training time and reduce GPU memory consumption. This can be accomplished by adjusting the tokenizer such that "max_length" is less than 512 and "truncation = 'max_length'".
128 | 
129 | # COMMAND ----------
130 | 
131 | datasets_mapping = {"banking77": {"train": "default.banking77_train",
132 |                                   "test": "default.banking77_test",
133 |                                   "labels": "default.banking77_labels",
134 |                                   "num_labels": 77,
135 |                                   # Batch size for general model inference, outside of the training loop; this is
136 |                                   # the batch size used by the MLflow model
137 |                                   "inference_batch_size": inference_batch_size,
138 |                                   # Batch size for evaluation step of model training
139 |                                   "per_device_train_batch_size": train_batch_size,
140 |                                   "per_device_eval_batch_size": eval_batch_size,
141 |                                   "problem_type": "single_label_classification" 
142 |                                   },
143 |                   
144 |                    "imdb": {"train": "default.imdb_train",
145 |                             "test": "default.imdb_test",
146 |                             "labels": "default.imdb_labels",
147 |                             "num_labels": 2,
148 |                             "inference_batch_size": inference_batch_size,
149 |                             "per_device_train_batch_size": train_batch_size,
150 |                             "per_device_eval_batch_size": inference_batch_size,
151 |                             "problem_type": "single_label_classification"
152 |                            },
153 |                     
154 |                     "tweet_emotions": {"train": "default.tweet_emotions_train",
155 |                                        "test": "default.tweet_emotions_test",
156 |                                        "labels": "default.tweet_emotions_labels",
157 |                                        "num_labels": 11,
158 |                                        "inference_batch_size": inference_batch_size,
159 |                                        "per_device_train_batch_size":train_batch_size,
160 |                                        "per_device_eval_batch_size": eval_batch_size,
161 |                                        "problem_type": "multi_label_classification"
162 |                                }
163 |                    }
164 | 
165 | data_args = datasets_mapping[dataset]
166 | 
167 | model_args = {"feature_col": "text",
168 |               "num_labels": data_args["num_labels"],
169 |               "inference_batch_size": data_args["inference_batch_size"],
170 |               "problem_type": data_args["problem_type"]}
171 | 
172 | tokenizer_args = {"truncation": True,
173 |                   # Padding will be done at the batch level during training
174 |                   "padding": False,
175 |                   # 512 is the max length accepted by the models
176 |                   "max_length": 512}
177 | 
178 | current_user = dbutils.notebook.entry_point.getDbutils().notebook().getContext().userName().get().split("@")[0]
179 | 
180 | training_args = {"output_dir": "/checkpoints",
181 |                  "overwrite_output_dir": True,
182 |                  "per_device_train_batch_size": data_args["per_device_train_batch_size"],
183 |                  "per_device_eval_batch_size": data_args["per_device_eval_batch_size"],
184 |                  "weight_decay": 0.01,
185 |                  "num_train_epochs": max_epochs,
186 |                  "save_strategy": "epoch", 
187 |                  "evaluation_strategy": "epoch",
188 |                  "logging_strategy": "epoch",
189 |                  "load_best_model_at_end": True,
190 |                  "save_total_limit": 2,
191 |                  "metric_for_best_model": "f1",
192 |                  "greater_is_better": True,
193 |                  "seed": 123,
194 |                  "report_to": "none",
195 |                  "gradient_accumulation_steps": gradient_accumulation_steps,
196 |                  "fp16": fp16,
197 |                  "group_by_length": group_by_length}
198 | 
199 | # COMMAND ----------
200 | 
201 | # MAGIC %md ##### Create a [huggingface dataset](https://huggingface.co/course/chapter5/4?fw=pt) directly from the Delta tables' underlying parquet files.  
202 | # MAGIC The huggingface library will copy the training and test datasets to the driver node's disk and leverage [memory mapping](https://huggingface.co/course/chapter5/4?fw=pt) to efficiently read data from disk during training and inference. This prevents larger datasets from overwhelming the memory of your virtual machine.
203 | 
204 | # COMMAND ----------
205 | 
206 | train_table_name = data_args["train"]
207 | test_table_name = data_args["test"]
208 | labels_table_name = data_args["labels"]
209 | 
210 | train_files = get_parquet_files(train_table_name)
211 | test_files = get_parquet_files(test_table_name)
212 | 
213 | train_test = DatasetDict({"train": load_dataset("parquet", 
214 |                                    data_files=train_files,
215 |                                    split="train"),
216 |                           
217 |                           "test": load_dataset("parquet", 
218 |                                   data_files=test_files,
219 |                                   split="train")})
220 | 
221 | labels = spark.table(labels_table_name)
222 | collected_labels = labels.collect()
223 | 
224 | id2label = {row.idx: row.label for row in collected_labels} 
225 | label2id = {row.label: row.idx for row in collected_labels}
226 | 
227 | # COMMAND ----------
228 | 
229 | # MAGIC %md ##### Create an MLflow Experiment or use an existing Experiment
230 | 
231 | # COMMAND ----------
232 | 
233 | experiment_location = f"/Shared/{experiment_location}"
234 | get_or_create_experiment(experiment_location)
235 | 
236 | # COMMAND ----------
237 | 
238 | # MAGIC %md ##### Train models and log to MLflow  
239 | # MAGIC This cell will generate a hyperlink that navigates to the Experiment run in MLFlow.
240 | 
241 | # COMMAND ----------
242 | 
243 | tokenizer = AutoTokenizer.from_pretrained(model_type, use_fast=True)
244 | 
245 | model_config = AutoConfig.from_pretrained(model_type, 
246 |                                           num_labels=model_args["num_labels"],
247 |                                           id2label=id2label, 
248 |                                           label2id=label2id,
249 |                                           problem_type=model_args["problem_type"])
250 | 
251 | 
252 | def tokenize(batch):
253 |   """Tokenize input text in batches"""
254 | 
255 |   return tokenizer(batch[model_args["feature_col"]], 
256 |                    truncation=tokenizer_args["truncation"],
257 |                    padding=tokenizer_args["padding"],
258 |                    max_length=tokenizer_args["max_length"])
259 |   
260 |   
261 | # The DataCollator will handle dynamic padding of batches during training. See the documentation, 
262 | # https://www.youtube.com/watch?v=-RPeakdlHYo. If not leveraging dynamic padding, this can be removed
263 | data_collator = DataCollatorWithPadding(tokenizer, padding=True)
264 | 
265 | # The default batch size is 1,000; this can be changed by setting the "batch_size=" parameter
266 | # https://huggingface.co/docs/datasets/process#batch-processing
267 | train_test_tokenized = train_test.map(tokenize, batched=True) 
268 | train_test_tokenized.set_format("torch", columns=["input_ids", "attention_mask", "labels" if dataset == "tweet_emotions" else "label"])
269 | 
270 | 
271 | def model_init():
272 |   """Return a freshly instantiated model. This ensure that the model
273 |   is trained from scratch, rather than training a previously 
274 |   instantiated model for additional epochs.
275 |   """
276 | 
277 |   return AutoModelForSequenceClassification.from_pretrained(model_type, 
278 |                                                             config=model_config).to(device)
279 | 
280 |   
281 | 
282 | def compute_single_label_metrics(pred: EvalPrediction) -> dict[str: float]:
283 |   """Calculate validation statistics for single label classification
284 |   problems. The function accepts a transformers EvalPrediction object.
285 |   
286 |   https://huggingface.co/docs/transformers/internal/trainer_utils#transformers.EvalPrediction
287 |   """
288 | 
289 |   labels = pred.label_ids
290 |   preds = pred.predictions.argmax(-1)
291 |   precision, recall, f1, _ = precision_recall_fscore_support(labels, 
292 |                                                              preds, 
293 |                                                              average="micro")
294 |   return {
295 |       "f1": f1,
296 |       "precision": precision,
297 |       "recall": recall
298 |           }
299 | 
300 |   
301 | def compute_multi_label_metrics(pred: EvalPrediction) -> dict[str: float]:
302 |   
303 |   """Calculate validation statistics for multilabel classification
304 |   problems. The function accepts a transformers EvalPrediction object.
305 |   """
306 |   
307 |   labels = pred.label_ids  
308 |   preds = logistic.cdf(pred.predictions)
309 |   preds = np.where(preds >= 0.5, 1., 0.)
310 | 
311 |   precision, recall, f1, _ = precision_recall_fscore_support(labels, 
312 |                                                              preds, 
313 |                                                              average="micro")
314 | 
315 |   return {
316 |       "f1": f1,
317 |       "precision": precision,
318 |       "recall": recall
319 |           }
320 |   
321 | # The early stopping threshold is in units; it is not a percentage.
322 | early_stopping_callback = EarlyStoppingCallback(early_stopping_patience=1, early_stopping_threshold=0.005)
323 | 
324 | trainer = Trainer(model_init=model_init,
325 |                   args=TrainingArguments(**training_args),
326 |                   train_dataset=train_test_tokenized["train"],
327 |                   eval_dataset=train_test_tokenized["test"],
328 |                   compute_metrics =compute_multi_label_metrics if model_args["problem_type"] == "multi_label_classification" 
329 |                                                                else compute_single_label_metrics,
330 |                   data_collator=data_collator,
331 |                   callbacks=[early_stopping_callback])
332 | 
333 | 
334 | with mlflow.start_run(run_name=model_type) as run:
335 |   
336 |   run_id = run.info.run_id
337 |   
338 |   result = trainer.train()
339 | 
340 |   # Save trainer and tokenizer to the driver node; then will then be stored as
341 |   # MLflow artifacts
342 |   trainer.save_model("/model")
343 |   tokenizer.save_pretrained("/tokenizer")
344 |   
345 |   eval_result = trainer.evaluate()
346 | 
347 |   best_metrics = get_best_metrics(trainer)
348 | 
349 |   training_eval_metrics = {"model_size_mb": round(Path("/model/pytorch_model.bin").stat().st_size / (1024 * 1024), 1),
350 |                            "train_minutes": round(result.metrics["train_runtime"] / 60, 2),
351 |                            "train_samples_per_second": round(result.metrics["train_samples_per_second"], 1),
352 |                            "train_steps_per_second": round(result.metrics["train_steps_per_second"], 2),
353 |                            "train_rows": train_test["train"].num_rows,
354 |                            "gpu_memory_total_mb": get_gpu_utilization(memory_type="total", print_only=False),
355 |                            "gpu_memory_used_mb": get_gpu_utilization(memory_type="used", print_only=False),
356 | 
357 |                            "eval_seconds": round(eval_result["eval_runtime"], 2),
358 |                            "eval_samples_per_second": round(eval_result["eval_samples_per_second"], 1),
359 |                            "eval_steps_per_second": round(eval_result["eval_steps_per_second"], 1),
360 |                            "eval_rows": train_test["test"].num_rows}
361 | 
362 |   all_metrics = dict(**best_metrics, **training_eval_metrics)
363 |   mlflow.log_metrics(all_metrics)
364 | 
365 |   python_version = f"{version_info.major}.{version_info.minor}.{version_info.micro}"
366 | 
367 |   other_params = {"dataset": dataset,
368 |                   "gpus": trainer.args._n_gpu,
369 |                   "best_checkpoint": trainer.state.best_model_checkpoint.split("/")[-1],
370 |                   "runtime_version": spark.conf.get("spark.databricks.clusterUsageTags.sparkVersion"),
371 |                   "python_version": python_version}
372 | 
373 |   all_params = dict(**model_args, **tokenizer_args, **training_args, **other_params)
374 | 
375 |   mlflow.log_params(all_params)
376 | 
377 |   # Construct environment file based on requirements.txt doc
378 |   with open("requirements.txt", "r") as additional_requirements:
379 |     libraries = additional_requirements.readlines()
380 |     libraries = [library.rstrip() for library in libraries]
381 | 
382 |   model_env = mlflow.pyfunc.get_default_conda_env()
383 |   # Replace mlflow with specific version in requirements.txt
384 |   model_env["dependencies"][-1]["pip"].remove("mlflow")
385 |   model_env["dependencies"][-1]["pip"] += libraries
386 | 
387 |   with open("/id2label.pickle", "wb") as handle:
388 |     pickle.dump(id2label, handle)
389 | 
390 |   artifacts = {"tokenizer": "/tokenizer",
391 |                "model": "/model",
392 |                "id2label": "/id2label.pickle"}
393 | 
394 |   # Create instance of customer MLflow model for inference
395 |   pipeline_model = MLflowModel(inference_batch_size=model_args["inference_batch_size"], 
396 |                                truncation=tokenizer_args["truncation"],
397 |                                # Pad to the longest sequence in the batch during inference
398 |                                padding="longest",
399 |                                max_length=tokenizer_args["max_length"])
400 | 
401 |   mlflow.pyfunc.log_model(artifact_path="mlflow", 
402 |                           python_model=pipeline_model, 
403 |                           conda_env=model_env,
404 |                           artifacts=artifacts)
405 |   
406 | print(f"""
407 |         MLflow Experiment run id: {run_id}
408 |        """)
409 | 
410 | # COMMAND ----------
411 | 
412 | # MAGIC %md ##### View GPU memory availability and current consumption  
413 | 
414 | # COMMAND ----------
415 | 
416 | get_gpu_utilization(memory_type="total")
417 | get_gpu_utilization(memory_type="used")
418 | get_gpu_utilization(memory_type="free")
419 | 


--------------------------------------------------------------------------------
/utils.py:
--------------------------------------------------------------------------------
  1 | import json
  2 | from time import perf_counter
  3 | from argparse import Namespace
  4 | from typing import List, Tuple, Dict
  5 | from pyspark.sql import SparkSession
  6 | from pyspark.dbutils import DBUtils
  7 | from pynvml import *
  8 | import mlflow
  9 | from mlflow.tracking import MlflowClient
 10 | 
 11 | spark = SparkSession.builder.getOrCreate()
 12 | dbutils = DBUtils(spark)
 13 | 
 14 | client = MlflowClient()
 15 | 
 16 | 
 17 | def get_parquet_files(table_name:str) -> List[str]:
 18 |   """
 19 |   Given a database and name of a parquet table, return a list of 
 20 |   parquet files paths and names that can be read by the transfomers
 21 |   library
 22 | 
 23 |   Args:
 24 |     database: The database where the table resides.
 25 |     table_name: The name of the table.
 26 | 
 27 |   Returns:
 28 |     A Namespace object that contans configurations referenced
 29 |     in the program.
 30 |   """
 31 |   
 32 |   files = spark.table(f'{table_name}').inputFiles()
 33 |   
 34 |   if files[0][:4] == 'dbfs':
 35 |     files = [file.replace('dbfs:', '/dbfs/') for file in files]
 36 |   
 37 |   return files
 38 | 
 39 | 
 40 | def get_best_metrics(trainer) -> Dict[str, float]:
 41 |   """
 42 |   Extract metrics from a fitted Trainer instance.
 43 | 
 44 |   Args:
 45 |     trainer: A Trainer instance that has been trained on data.
 46 |    
 47 |   Returns:
 48 |     A dictionary of metrics and their values for the best training epoch.
 49 |   """
 50 | 
 51 |   # Best model metrics
 52 |   best_checkpoint = f'{trainer.state.best_model_checkpoint}/trainer_state.json' 
 53 | 
 54 |   with open(best_checkpoint) as f:
 55 |     metrics = json.load(f)
 56 | 
 57 |   best_epoch = round(metrics['epoch'], 1)
 58 |   
 59 |   # These are instead sourced from calling training.evaluate()
 60 |   metrics_to_drop = ['eval_runtime', 'eval_samples_per_second', 'eval_steps_per_second', 'step']
 61 |   
 62 |   best_loss_metrics, best_eval_metrics = [eval_metrics for eval_metrics in metrics['log_history'] if round(eval_metrics['epoch'], 1) == best_epoch]
 63 |   best_eval_metrics = {metric_name: round(metric_value, 4) for metric_name, metric_value in best_eval_metrics.items() if metric_name not in metrics_to_drop}
 64 | 
 65 |   best_metrics = {**best_loss_metrics, **best_eval_metrics}
 66 | 
 67 |   best_metrics['best_model_epoch'] = round(best_metrics.pop('epoch'), 1)
 68 |   
 69 | 
 70 |   return best_metrics
 71 | 
 72 |       
 73 | def get_or_create_experiment(experiment_location: str) -> None:
 74 |   """
 75 |   Given an experiement path, check to see if an experiment exists in the location.
 76 |   If not, create a new experiment. Set the notebook to log all experiments to the
 77 |   specified experiment location
 78 | 
 79 |   Args:
 80 |     experiment_location: The path to the MLflow Tracking Server (Experiement) instance, 
 81 |                          viewable in the upper left hand corner of the server's UI.
 82 |   """
 83 | 
 84 |   if not mlflow.get_experiment_by_name(experiment_location):
 85 |     print("Experiment does not exist. Creating experiment")
 86 |     
 87 |     mlflow.create_experiment(experiment_location)
 88 |     
 89 |   mlflow.set_experiment(experiment_location)
 90 |   
 91 |   
 92 | def get_model_info(model_name:str, stage:str):
 93 |   
 94 |   run_info = [run for run in client.search_model_versions(f"name='{model_name}'") 
 95 |                   if run.current_stage == stage][0]
 96 |   
 97 |   return run_info
 98 |   
 99 | 
100 | def get_run_id(model_name:str, stage:str='Production') -> str:
101 |   """Given a model's name, return its run id from the Model Registry; this assumes the model
102 |   has been registered
103 |   
104 |   Args:
105 |     model_name: The name of the model; this is the name used to registr the model.
106 |     stage: The stage (version) of the model in the registr you want returned
107 |     
108 |   Returns:
109 |     The run id of the model; this can be used to load the model for inference
110 |   
111 |   """
112 |   
113 |   return get_model_info(model_name, stage).run_id
114 | 
115 | 
116 | def get_artifact_path(model_name:str, stage:str='Production') -> str:
117 |   """Given a model's name, return its artifact directory path from the Model Registry; 
118 |   this assumes the model has been registered
119 |   
120 |   Args:
121 |     model_name: The name of the model; this is the name used to registr the model.
122 |     stage: The stage (version) of the model in the registr you want returned
123 |     
124 |   Returns:
125 |     The artifact directory path
126 |   
127 |   """
128 |   
129 |   run_info = get_model_info(model_name, stage)
130 |   
131 |   artifact_path = get_model_info(model_name, stage).source
132 |   drop_last_dir = artifact_path.split('/')[:-1]
133 |   artifact_path = ('/').join(drop_last_dir)
134 |   
135 |   return artifact_path
136 | 
137 | 
138 | def get_gpu_utilization(memory_type='used', print_only=True):
139 |     """Print GPU memory utililizaiton
140 |     https://huggingface.co/docs/transformers/perf_train_gpu_one#efficient-training-on-a-single-gpu
141 |     """
142 |     
143 |     nvmlInit()
144 |     handle = nvmlDeviceGetHandleByIndex(0)
145 |     info = nvmlDeviceGetMemoryInfo(handle)
146 |     
147 |     if memory_type == 'total':
148 |       return_value = info.total//1024**2
149 |       return_string = f'GPU memory total: {return_value} MB.'
150 |     elif memory_type == 'free':
151 |       return_value = info.free//1024**2
152 |       return_string = f'GPU memory free: {return_value} MB.'
153 |     elif memory_type == 'used':
154 |       return_value = info.used//1024**2
155 |       return_string = f'GPU memory used: {return_value} MB.'
156 |       
157 |     if print_only:
158 |       print(return_string)
159 |     else:
160 |       return return_value
161 |     


--------------------------------------------------------------------------------