├── README.md ├── blog_code_examples.py ├── data.py ├── img ├── job_parameters.png ├── mlflow_model_comparisons.png ├── multiple_job_runs.png ├── predictions.png └── training_experiments.png ├── inference.py ├── mlflow_model.py ├── requirements.txt ├── trainer.py └── utils.py /README.md: -------------------------------------------------------------------------------- 1 | # Rapid NLP development with Databricks, Delta, and Transformers 2 | This Databricks Repo provides example implementations of [huggingface transformer](https://huggingface.co/docs/transformers/index) models for text classification tasks. The project is self contained and can be easily run in your own Workspace. The project downloads several example datasets from huggingface and writes them to [Delta tables](https://docs.databricks.com/delta/index.html). The user can then choose from multiple transformer models to perform text classification. All model metrics and parameters are logged to [MLflow](https://docs.databricks.com/applications/mlflow/index.html). A separate notebook loads trained models promoted to the MLflow [Model Registry](https://docs.databricks.com/applications/mlflow/model-registry.html), performs inference, and writes results back to Delta. 3 | 4 | The Notebooks are designed to be run on a single-node, GPU-backed Cluster type. For AWS customers, consider the g5.4xlarge instance type. For Azure customers, consider the Standard_NC4as_T4_v3 instance type. The project was most recently tested using Databricks ML runtime 11.0. The transformers library will distribute model training across multiple GPUs if you choose a virtual machine type that has more than one. 5 | 6 | 7 | ##### Datasets: 8 | - **[IMDB](https://huggingface.co/datasets/imdb)**: *binary classification* 9 | - **[Banking77](https://huggingface.co/datasets/banking77)**: *mutli-class classification* 10 | - **[Tweet Emotions](https://huggingface.co/datasets/sem_eval_2018_task_1)**: *multi-label classification* 11 | 12 | Search for additional datasets in the [huggingface data hub](https://huggingface.co/datasets) 13 | 14 | ##### Models: 15 | - **[Distilbert](https://huggingface.co/docs/transformers/model_doc/distilbert)** 16 | - **[Bert](https://huggingface.co/docs/transformers/model_doc/bert)** 17 | - **[DistilRoberta](https://huggingface.co/distilroberta-base)** 18 | - **[Roberta](https://huggingface.co/roberta-base)** 19 | - **[xtremedistil-l6-h256-uncased](https://huggingface.co/microsoft/xtremedistil-l6-h256-uncased)** 20 | - **[xtremedistil-l6-h384-uncased](https://huggingface.co/microsoft/xtremedistil-l6-h384-uncased)** 21 | - **[xtremedistil-l12-h384-uncased](https://huggingface.co/microsoft/xtremedistil-l12-h384-uncased)** 22 | 23 | Search for models suitable for a wide variety of tasks in the [huggingface model hub](https://huggingface.co/models) 24 | 25 | #### Getting started: 26 | To get started training these models in your own Workspace, simply follow the below steps. 27 | 1. Clone this github repository into a Databricks Repo 28 | 29 | 2. Open the **data** Notebook and attached the Notebook to a Cluster. Select "Run all" at the top of the notebook to download and store the example datasets as Delta tables. Review the cell outputs to see data samples and charts. 30 | 31 | 3. Open the **trainer** Notebook. Select "Run all" to train an initial model on the banking77 dataset. As part of the run, Widgets will appear at the top of the notebook enabling the user to choose different input datasets, models, and training parameters. To test different models and configurations, consider running the training notebook as a [Job](https://docs.databricks.com/data-engineering/jobs/index.html). When executing via a Job, you can pass parameters to overwrite the default widget values. Additionally, by increasing the Job's maximum concurrent runs, you can fit multiple transformer models concurrently by launching several jobs with different parameters. 32 | 33 | Setting job parameters 34 |

35 | Adjusting a Job's default parameters values to run different models against the same training dataset. 36 |

37 | 38 | Concurrent job runs 39 |

40 | Training multiple transformers models in parallel using Databricks Jobs. 41 |

42 | 43 | 4. The trainer notebook will create a new MLflow Experiment. You can navigate to this Experiment by clicking the hyperlink that appears under the cell containing the MLflow logging logic, or, by navigating to the Experiments pane and selecting the Experiment named, **transformer_experiments**. Each row in the Experiment corresponds to a different trained transformer model. Click on an entry, review its parameters and metrics, run multiple models against a dataset and compare their performance. 44 | 45 | Comparing MLflow models 46 |

47 | Comparing transformer model runs in MLflow; notice the wide variation in model size. 48 |

49 | 50 | 5. To leverage a trained model for inference, copy the **Run ID** of a model located in an Experiment run. Run the first several cells of the **inference** notebook to generate the Widget text box and paste the Run ID into the text box. The notebook will generate predictions for both the training and testing sets used to fit the model; it will then write these results to a new Delta table. 51 | 52 | 53 | Model predictions 54 |

55 | Model predictions example for banking77 dataset. 56 |

57 | 58 | 6. Experiment with different training configurations for a model as outlined in the [transformers documentation](https://huggingface.co/docs/transformers/performance). Training configuration and GPU type can lead to large differences in training times. 59 | Concurrent job runs 60 |

61 | Training a single epoch using dynamic padding. 62 |

63 | 64 | -------------------------------------------------------------------------------- /blog_code_examples.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md #### The below cells contain the code snippets used in the Databricks blog, Rapid NLP Development with Databricks, Delta, and Transformers. 3 | # MAGIC Run this notebook on a GPU-backed cluster to recreate the results 4 | 5 | # COMMAND ---------- 6 | 7 | # MAGIC %pip install datasets transformers==4.21.* 8 | 9 | # COMMAND ---------- 10 | 11 | from transformers import AutoTokenizer, AutoModel 12 | from transformers import logging 13 | import torch 14 | 15 | logging.set_verbosity_error() 16 | 17 | # COMMAND ---------- 18 | 19 | # MAGIC %md Loading a transformer model and corresponding tokenizer 20 | 21 | # COMMAND ---------- 22 | 23 | from transformers import AutoTokenizer, AutoModel 24 | 25 | model_type = 'bert-base-uncased' 26 | tokenizer = AutoTokenizer.from_pretrained(model_type) 27 | model = AutoModel.from_pretrained(model_type) 28 | 29 | # COMMAND ---------- 30 | 31 | # MAGIC %md View the BERT vocabulary and special tokens 32 | 33 | # COMMAND ---------- 34 | 35 | tokenizer.pretrained_vocab_files_map['vocab_file']['bert-base-uncased'] 36 | 37 | # COMMAND ---------- 38 | 39 | from itertools import islice 40 | 41 | # Display the first five entries in BERT's vocabulary 42 | for token, token_id in islice(tokenizer.vocab.items(), 5): 43 | print(token_id, token) 44 | 45 | # COMMAND ---------- 46 | 47 | # Display BERT's special tokens 48 | for token_name, token_symbol in tokenizer.special_tokens_map.items(): 49 | print(token_name, token_symbol) 50 | 51 | # COMMAND ---------- 52 | 53 | # MAGIC %md Tokenize an input sequence 54 | 55 | # COMMAND ---------- 56 | 57 | token_ids = tokenizer.encode("transformers on Databricks are awesome") 58 | token_ids 59 | 60 | # COMMAND ---------- 61 | 62 | # Map token ids to BERT's tokens 63 | id_to_token = {token_id: token for token, token_id in tokenizer.vocab.items()} 64 | 65 | [id_to_token[id] for id in token_ids] 66 | 67 | # COMMAND ---------- 68 | 69 | # MAGIC %md Tokenize the sequences; apply truncation and padding. 70 | 71 | # COMMAND ---------- 72 | 73 | records = ["transformers are easy to run on Databricks", 74 | "transformers can read from Delta", 75 | "transformers are powerful"] 76 | 77 | def tokenize(batch): 78 | """ 79 | Truncate to the max_length; pad any resulting sequences with 80 | length less than max_length 81 | """ 82 | 83 | return tokenizer(batch, padding='max_length', truncation=True, max_length=10, return_tensors="pt") 84 | 85 | tokenized = tokenize(records) 86 | 87 | tokenized_lengths = [len(sequence) for sequence in tokenized['input_ids']] 88 | 89 | print("Tokenized and padded sequences returned as pytorch tensors") 90 | for sequence in tokenized['input_ids']: 91 | print(sequence) 92 | 93 | print(f"\nTokenized sequence lengths\n{tokenized_lengths}") 94 | 95 | # COMMAND ---------- 96 | 97 | # MAGIC %md Generate word embedddings from BERT's final layer (last hidden layer) 98 | 99 | # COMMAND ---------- 100 | 101 | import torch 102 | 103 | with torch.no_grad(): 104 | token_embeddings = model(input_ids = tokenized['input_ids'], 105 | attention_mask = tokenized['attention_mask']).last_hidden_state 106 | 107 | sequence_length = [len(embedding_sequence) for embedding_sequence in token_embeddings] 108 | 109 | cls_embedding = token_embeddings[0][0] 110 | 111 | embedding_dim = cls_embedding.shape[0] 112 | 113 | print(f"\nEmebdding sequence lengths\n{sequence_length}") 114 | 115 | print(f"\nDimension of a single token embedding\n{int(embedding_dim)}") 116 | 117 | # COMMAND ---------- 118 | 119 | # MAGIC %md Download a dataset from the huggingface dataset hub and save to Delta 120 | 121 | # COMMAND ---------- 122 | 123 | from pyspark.sql.types import StructType, StructField, StringType, LongType 124 | from datasets import load_dataset 125 | 126 | # Load the sample data from the huggingface data hub 127 | dataset = load_dataset("banking77") 128 | 129 | # Convert the DataSets to Pandas DataFrames 130 | train_pd = dataset['train'].to_pandas() 131 | test_pd = dataset['test'].to_pandas() 132 | 133 | idx_and_labels = dataset['train'].features['label'].names 134 | id2label = {idx: label for idx, label in enumerate(idx_and_labels)} 135 | 136 | # Shuffle the records 137 | train_pd = train_pd.sample(frac=1).reset_index(drop=True) 138 | test_pd = test_pd.sample(frac=1).reset_index(drop=True) 139 | 140 | train_pd['label_name'] = train_pd.label.apply(lambda x: id2label[x]) 141 | test_pd['label_name'] = test_pd.label.apply(lambda x: id2label[x]) 142 | 143 | # Create Spark DataFrames 144 | single_label_schema = StructType([StructField("text", StringType(), False), 145 | StructField("label", LongType(), False), 146 | StructField("label_name", StringType(), False) 147 | 148 | ]) 149 | 150 | train = spark.createDataFrame(train_pd, schema=single_label_schema) 151 | test = spark.createDataFrame(test_pd, schema=single_label_schema) 152 | 153 | train.write.mode('overwrite').format('delta').saveAsTable('default.banking77_train_blog') 154 | test.write.mode('overwrite').format('delta').saveAsTable('default.banking77_test_blog') 155 | 156 | display(spark.table('default.banking77_train_blog').limit(5)) 157 | 158 | # COMMAND ---------- 159 | 160 | # MAGIC %md Create transformer DataSets from Delta tables by sourcing the underlying parquet files 161 | 162 | # COMMAND ---------- 163 | 164 | from datasets import load_dataset, Dataset, DatasetDict 165 | 166 | train_delta_file = spark.table('default.banking77_train_blog').inputFiles() 167 | test_delta_file = spark.table('default.banking77_test_blog').inputFiles() 168 | 169 | train_delta_file = [file.replace('dbfs:', '/dbfs/') for file in train_delta_file] 170 | test_delta_file = [file.replace('dbfs:', '/dbfs/') for file in test_delta_file] 171 | 172 | train_test = DatasetDict({'train': load_dataset("parquet", 173 | data_files=train_delta_file, 174 | split='train'), 175 | 176 | 'test': load_dataset("parquet", 177 | data_files=test_delta_file, 178 | split='train')}) 179 | 180 | # COMMAND ---------- 181 | 182 | # MAGIC %md View the distributed of tokenized sequence lengths 183 | 184 | # COMMAND ---------- 185 | 186 | from collections import Counter 187 | import numpy as np 188 | 189 | def tokenize(batch): 190 | 191 | return tokenizer(batch['text'], 192 | truncation = True, 193 | # Without padding, tokenized sequence lengths will vary 194 | # across observations 195 | padding = False, 196 | # The maximum accpected sequence length of the model 197 | max_length = 512) 198 | 199 | # The default batch size is also 1000 but this can be changed. 200 | train_test_tokenized = train_test.map(tokenize, batched=True, batch_size=1000) 201 | 202 | train_test_tokenized.set_format("torch", columns=['input_ids', 'attention_mask', 'label']) 203 | 204 | tokenized_lengths = [len(sequence) for sequence in train_test_tokenized['train']['input_ids']] 205 | 206 | groupby_count = [(tokenized_length, count) for tokenized_length, count in Counter(tokenized_lengths).items()] 207 | 208 | groupby_count = spark.createDataFrame(groupby_count, ['tokenized_length', 'count']) 209 | 210 | display(groupby_count.orderBy('tokenized_length')) 211 | print("Deciles...") 212 | groupby_count.approxQuantile("tokenized_length", list(np.arange(0.1, 1, 0.1)), 0) 213 | 214 | # COMMAND ---------- 215 | 216 | # MAGIC %md Tokenize the training and testing DataSets 217 | 218 | # COMMAND ---------- 219 | 220 | max_length = 90 221 | 222 | tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased') 223 | 224 | def tokenize(batch): 225 | 226 | return tokenizer(batch['text'], 227 | truncation = True, 228 | padding = 'max_length', 229 | max_length = max_length) 230 | 231 | # The default batch size is also 1000 but this can be changed. 232 | train_test_tokenized = train_test.map(tokenize, batched=True, batch_size=1000) 233 | 234 | train_test_tokenized.set_format("torch", columns=['input_ids', 'attention_mask', 'label']) 235 | 236 | # COMMAND ---------- 237 | 238 | # MAGIC %md Create a function that returns validation metrics 239 | 240 | # COMMAND ---------- 241 | 242 | from sklearn.metrics import precision_recall_fscore_support 243 | 244 | def compute_single_label_metrics(pred): 245 | """Calculate validation statistics for single-label classification 246 | """ 247 | 248 | labels = pred.label_ids 249 | preds = pred.predictions.argmax(-1) 250 | precision, recall, f1, _ = precision_recall_fscore_support(labels, 251 | preds, 252 | average='micro') 253 | return { 254 | 'f1': f1, 255 | 'precision': precision, 256 | 'recall': recall 257 | } 258 | 259 | # COMMAND ---------- 260 | 261 | # MAGIC %md Specify a model initialization function, configure a transformers Trainer, and execute the training loop. 262 | 263 | # COMMAND ---------- 264 | 265 | from transformers import Trainer, TrainingArguments, AutoModelForSequenceClassification 266 | 267 | # Transformer models should be fine tuned using a GPU-backed instance, 268 | # such as a single-node cluster with a GPU-backed virtual machine type. 269 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 270 | 271 | def model_init(): 272 | """Return a freshly instantiated model. This ensure that the model 273 | is trained from scratch, rather than training a previously 274 | instantiated model for additional epochs. 275 | """ 276 | 277 | return AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased', 278 | num_labels=77).to(device) 279 | 280 | training_args = {"output_dir": f'/_blog_results', 281 | "overwrite_output_dir": True, 282 | "per_device_train_batch_size": 64, 283 | "per_device_eval_batch_size": 64, 284 | "weight_decay": 0.01, 285 | "num_train_epochs": 5, 286 | "save_strategy": "epoch", 287 | "evaluation_strategy": "epoch", 288 | "logging_strategy": "epoch", 289 | "load_best_model_at_end": True, 290 | "save_total_limit": 2, 291 | "metric_for_best_model": "f1", 292 | "greater_is_better": True, 293 | "seed": 123, 294 | "report_to": 'none'} 295 | 296 | 297 | trainer = Trainer(model_init = model_init, 298 | args = TrainingArguments(**training_args), 299 | train_dataset = train_test_tokenized['train'], 300 | eval_dataset = train_test_tokenized['test'], 301 | compute_metrics = compute_single_label_metrics) 302 | 303 | # Execute training 304 | trainer.train() 305 | 306 | # Get evaluation metrics on test dataset 307 | evaluation_metrics = trainer.evaluate() 308 | 309 | # COMMAND ---------- 310 | 311 | for metric_name, metric_value in evaluation_metrics.items(): 312 | print(f"{metric_name}: {round(metric_value, 3)}") 313 | 314 | # COMMAND ---------- 315 | 316 | # MAGIC %md ##### GPU vs. CPU vs. Quantized CPU 317 | 318 | # COMMAND ---------- 319 | 320 | from transformers import pipeline 321 | from transformers.pipelines.pt_utils import KeyDataset 322 | import itertools 323 | from mlflow_model import get_predictions 324 | 325 | inference_dataset = KeyDataset(train_test['test'], 'text') 326 | 327 | inference_batch_size = 256 328 | truncation = True 329 | padding = 'max_length' 330 | 331 | # COMMAND ---------- 332 | 333 | # MAGIC %md GPU inference 334 | 335 | # COMMAND ---------- 336 | 337 | gpu_predictions = get_predictions(data=inference_dataset, 338 | model = trainer.model, 339 | tokenizer = tokenizer, 340 | batch_size = inference_batch_size, 341 | device = 0, 342 | truncation = truncation, 343 | padding = padding, 344 | max_length = max_length) 345 | 346 | gpu_predictions[0] 347 | 348 | # COMMAND ---------- 349 | 350 | # MAGIC %md CPU 351 | 352 | # COMMAND ---------- 353 | 354 | cpu_predictions = get_predictions(data=inference_dataset, 355 | model = trainer.model.to('cpu'), 356 | tokenizer = tokenizer, 357 | batch_size = inference_batch_size, 358 | device = -1, 359 | truncation = truncation, 360 | padding = padding, 361 | max_length = max_length) 362 | 363 | cpu_predictions[0] 364 | 365 | # COMMAND ---------- 366 | 367 | # MAGIC %md Quantized and CPU 368 | 369 | # COMMAND ---------- 370 | 371 | import torch.nn as nn 372 | from torch.quantization import quantize_dynamic 373 | 374 | quantized_model = quantize_dynamic(trainer.model.to("cpu"), 375 | {nn.Linear}, 376 | dtype=torch.qint8) 377 | 378 | # COMMAND ---------- 379 | 380 | quantized_predictions = get_predictions(data=inference_dataset, 381 | model = quantized_model, 382 | tokenizer = tokenizer, 383 | batch_size = inference_batch_size, 384 | device = -1, 385 | truncation = truncation, 386 | padding = padding, 387 | max_length = max_length) 388 | 389 | quantized_predictions[0] 390 | 391 | # COMMAND ---------- 392 | 393 | # MAGIC %md Size comparison 394 | 395 | # COMMAND ---------- 396 | 397 | from pathlib import Path 398 | 399 | non_quantized_state_dict = trainer.model.state_dict() 400 | quantized_state_dict = quantized_model.state_dict() 401 | 402 | tmp_path_non_quantized = Path("/non_quantized.pt") 403 | tmp_path_quantized = Path("/quantized.pt") 404 | 405 | torch.save(non_quantized_state_dict, tmp_path_non_quantized) 406 | torch.save(quantized_state_dict, tmp_path_quantized) 407 | 408 | def get_size_in_mb(tmp_path): 409 | return round(Path(tmp_path).stat().st_size / (1024 * 1024), 1) 410 | 411 | non_quantized_size_mb = get_size_in_mb(tmp_path_non_quantized) 412 | quantized_size_mb = get_size_in_mb(tmp_path_quantized) 413 | 414 | print(f"Non-quantized model size (mb): {non_quantized_size_mb}\nquantized model size (mb): {quantized_size_mb}") 415 | 416 | # COMMAND ---------- 417 | 418 | # MAGIC %md Validation metrics comparison 419 | 420 | # COMMAND ---------- 421 | 422 | from transformers import EvalPrediction 423 | 424 | non_quantized_eval_dataset = EvalPrediction(predictions = cpu_predictions, 425 | label_ids = train_test['test']['label']) 426 | 427 | quantized_eval_dataset = EvalPrediction(predictions = quantized_predictions, 428 | label_ids = train_test['test']['label']) 429 | 430 | non_quantized_eval_metrics = compute_single_label_metrics(non_quantized_eval_dataset) 431 | quantized_eval_metrics = compute_single_label_metrics(quantized_eval_dataset) 432 | 433 | print("non-quantized validation statistics:") 434 | for metric_name, metric_value in non_quantized_eval_metrics.items(): 435 | print(metric_name, round(metric_value, 3)) 436 | 437 | print("\nnon-quantized validation statistics:") 438 | for metric_name, metric_value in quantized_eval_metrics.items(): 439 | print(metric_name, round(metric_value, 3)) 440 | 441 | # COMMAND ---------- 442 | 443 | # MAGIC %md Applying a pretrain and fine tuned model for sentiment analysis 444 | 445 | # COMMAND ---------- 446 | 447 | sentiment_pipeline = pipeline('sentiment-analysis') 448 | 449 | # COMMAND ---------- 450 | 451 | records = ["Transformers on Databricks are the best!", 452 | "Without Delta, our data lake has devolved into a data swamp!"] 453 | 454 | for prediction in sentiment_pipeline(records): 455 | print(prediction) 456 | 457 | # COMMAND ---------- 458 | 459 | spark.sql("DROP TABLE IF EXISTS default.banking77_train_blog") 460 | -------------------------------------------------------------------------------- /data.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md ## Dataset creation for transformer model training and inference 3 | 4 | # COMMAND ---------- 5 | 6 | # MAGIC %pip install datasets 7 | 8 | # COMMAND ---------- 9 | 10 | from collections import namedtuple 11 | import numpy as np 12 | import pandas as pd 13 | from datasets import load_dataset 14 | from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType, FloatType, LongType 15 | import pyspark.sql.functions as func 16 | from pyspark.sql.functions import col 17 | 18 | # COMMAND ---------- 19 | 20 | def dataset_to_dataframes(dataset_name:str): 21 | 22 | """ 23 | Given a transformers datasets name, download the dataset and 24 | return Spark DataFrame versions. Result include train and test 25 | dataframes as well as a dataframe of label index to string 26 | representation. 27 | """ 28 | 29 | spark_datasets = namedtuple("spark_datasets", "train test labels") 30 | 31 | # Define Spark schemas 32 | single_label_schema = StructType([StructField("text", StringType(), False), 33 | StructField("label", LongType(), False) 34 | ]) 35 | 36 | multi_label_schema = StructType([StructField("text", StringType(), False), 37 | StructField("labels", ArrayType(FloatType()), False) 38 | ]) 39 | 40 | labels_schema = StructType([StructField("idx", IntegerType(), False), 41 | StructField("label", StringType(), False)]) 42 | 43 | if dataset_name == "sem_eval_2018_task_1": 44 | dataset = load_dataset(dataset_name, "subtask5.english") 45 | 46 | text_col = 'Tweet' 47 | non_label_cols = ['ID'] + [text_col] 48 | idx_and_labels = [col for col in dataset['train'].features.keys() if col not in non_label_cols] 49 | 50 | train_pd = pd.concat([dataset['train'].to_pandas(), 51 | dataset['validation'].to_pandas()]) 52 | train_pd['is_train'] = 1 53 | 54 | test_pd = dataset['test'].to_pandas() 55 | 56 | test_pd['is_train'] = 0 57 | 58 | train_test_pd = pd.concat([train_pd, test_pd]) 59 | 60 | train_test_pd['labels'] = train_test_pd[idx_and_labels].values.tolist() 61 | train_test_pd['labels'] = train_test_pd.labels.apply(lambda x: [1. if i is True else 0. for i in x]) 62 | train_test_pd.rename(columns = {text_col: "text"}, inplace=True) 63 | train_test_pd = train_test_pd[['text', 'labels', 'is_train']] 64 | 65 | train = spark.createDataFrame(train_test_pd[train_test_pd.is_train == 1][['text', 'labels']], schema=multi_label_schema) 66 | test = spark.createDataFrame(train_test_pd[train_test_pd.is_train == 0][['text', 'labels']], schema=multi_label_schema) 67 | 68 | else: 69 | dataset = load_dataset(dataset_name) 70 | 71 | train_pd = dataset['train'].to_pandas() 72 | test_pd = dataset['test'].to_pandas() 73 | 74 | train_pd = train_pd.sample(frac=1).reset_index(drop=True) 75 | test_pd = test_pd.sample(frac=1).reset_index(drop=True) 76 | 77 | train = spark.createDataFrame(train_pd, schema=single_label_schema) 78 | test = spark.createDataFrame(test_pd, schema=single_label_schema) 79 | 80 | idx_and_labels = dataset['train'].features['label'].names 81 | 82 | id2label = [(idx, label) for idx, label in enumerate(idx_and_labels)] 83 | labels = spark.createDataFrame(id2label, schema=labels_schema) 84 | 85 | return spark_datasets(train, test, labels) 86 | 87 | 88 | def get_token_length_counts(delta_table, group_count=True): 89 | 90 | token_lengths = (spark.table(delta_table).select("text") 91 | .withColumn("token_length", func.size(func.split(col("text"), " ")))) 92 | 93 | if group_count: 94 | 95 | return (token_lengths.groupBy("token_length").agg(func.count("*").alias("count")) 96 | .orderBy("token_length")) 97 | 98 | else: 99 | return token_lengths 100 | 101 | # COMMAND ---------- 102 | 103 | # MAGIC %md #### [Banking77 dataset](https://huggingface.co/datasets/banking77) 104 | # MAGIC Multi-class classification 105 | 106 | # COMMAND ---------- 107 | 108 | banking77_train = "default.banking77_train" 109 | banking77_test = "default.banking77_test" 110 | banking77_labels = "default.banking77_labels" 111 | 112 | # COMMAND ---------- 113 | 114 | banking77_dfs = dataset_to_dataframes("banking77") 115 | 116 | banking77_dfs.train.write.mode('overwrite').format('delta').saveAsTable(banking77_train) 117 | banking77_dfs.test.write.mode('overwrite').format('delta').saveAsTable(banking77_test) 118 | banking77_dfs.labels.write.mode('overwrite').format('delta').saveAsTable(banking77_labels) 119 | 120 | # COMMAND ---------- 121 | 122 | banking77_train_df = spark.table(banking77_train) 123 | banking77_test_df = spark.table(banking77_test) 124 | banking77_labels_df = spark.table(banking77_labels) 125 | 126 | # COMMAND ---------- 127 | 128 | # MAGIC %md ##### Raw data 129 | 130 | # COMMAND ---------- 131 | 132 | display(banking77_train_df) 133 | 134 | # COMMAND ---------- 135 | 136 | # MAGIC %md ##### Labels 137 | 138 | # COMMAND ---------- 139 | 140 | display(banking77_labels_df) 141 | 142 | # COMMAND ---------- 143 | 144 | # MAGIC %md ##### Record counts 145 | 146 | # COMMAND ---------- 147 | 148 | print(f"train_cnt: {banking77_train_df.count()}, test_cnt: {banking77_test_df.count()}, labels_cnt: {banking77_labels_df.count()}") 149 | 150 | # COMMAND ---------- 151 | 152 | # MAGIC %md ##### Distribution of token lengths 153 | # MAGIC Note that this simple chart that counts the individual tokens when a text observation is split on whitespace is not sufficient for making decisions about the maximum sequence length when tokenizing the dataset. This is because the transformer tokenizer will split the data differently and will likely split individual words into multiple tokens. This will result in longer token lengths compared to what the chart indicates below. 154 | 155 | # COMMAND ---------- 156 | 157 | display(get_token_length_counts(banking77_train)) 158 | 159 | # COMMAND ---------- 160 | 161 | token_lengths = get_token_length_counts(banking77_train, group_count=False) 162 | 163 | print(f""" 164 | quantiles: {token_lengths.approxQuantile("token_length", [0.25, 0.5, 0.75], 0)} 165 | deciles: {token_lengths.approxQuantile("token_length", list(np.arange(0.1, 1, 0.1)), 0)} 166 | """) 167 | 168 | # COMMAND ---------- 169 | 170 | # MAGIC %md #### [IMDB dataset](https://huggingface.co/datasets/imdb) 171 | # MAGIC Binary classification 172 | 173 | # COMMAND ---------- 174 | 175 | imdb_train = "default.imdb_train" 176 | imdb_test = "default.imdb_test" 177 | imdb_labels = "default.imdb_labels" 178 | 179 | # COMMAND ---------- 180 | 181 | imdb_dfs = dataset_to_dataframes("imdb") 182 | 183 | imdb_dfs.train.write.mode('overwrite').format('delta').saveAsTable(imdb_train) 184 | imdb_dfs.test.write.mode('overwrite').format('delta').saveAsTable(imdb_test) 185 | imdb_dfs.labels.write.mode('overwrite').format('delta').saveAsTable(imdb_labels) 186 | 187 | # COMMAND ---------- 188 | 189 | imdb_train_df = spark.table(imdb_train) 190 | imdb_test_df = spark.table(imdb_test) 191 | imdb_labels_df = spark.table(imdb_labels) 192 | 193 | # COMMAND ---------- 194 | 195 | # MAGIC %md ##### Raw data 196 | 197 | # COMMAND ---------- 198 | 199 | display(imdb_train_df) 200 | 201 | # COMMAND ---------- 202 | 203 | # MAGIC %md ##### Labels 204 | 205 | # COMMAND ---------- 206 | 207 | display(imdb_labels_df) 208 | 209 | # COMMAND ---------- 210 | 211 | # MAGIC %md ##### Record counts 212 | 213 | # COMMAND ---------- 214 | 215 | print(f"train_cnt: {imdb_train_df.count()}, test_cnt: {imdb_test_df.count()}, labels_cnt: {imdb_dfs.labels.count()}") 216 | 217 | # COMMAND ---------- 218 | 219 | # MAGIC %md ##### Distribution of token lengths 220 | 221 | # COMMAND ---------- 222 | 223 | display(get_token_length_counts(imdb_train)) 224 | 225 | # COMMAND ---------- 226 | 227 | token_lengths = get_token_length_counts(imdb_train, group_count=False) 228 | 229 | print(f""" 230 | quantiles: {token_lengths.approxQuantile("token_length", [0.25, 0.5, 0.75], 0)} 231 | deciles: {token_lengths.approxQuantile("token_length", list(np.arange(0.1, 1, 0.1)), 0)} 232 | """) 233 | 234 | # COMMAND ---------- 235 | 236 | # MAGIC %md #### [Tweet Emotions](https://huggingface.co/datasets/sem_eval_2018_task_1) 237 | # MAGIC Multi-label classification 238 | 239 | # COMMAND ---------- 240 | 241 | dataset = load_dataset("sem_eval_2018_task_1", "subtask5.english") 242 | 243 | # COMMAND ---------- 244 | 245 | tweet_emotions_train = "default.tweet_emotions_train" 246 | tweet_emotions_test = "default.tweet_emotions_test" 247 | tweet_emotions_labels = "default.tweet_emotions_labels" 248 | 249 | # COMMAND ---------- 250 | 251 | tweet_emotions_dfs = dataset_to_dataframes("sem_eval_2018_task_1") 252 | 253 | tweet_emotions_dfs.train.write.mode('overwrite').format('delta').saveAsTable(tweet_emotions_train) 254 | tweet_emotions_dfs.test.write.mode('overwrite').format('delta').saveAsTable(tweet_emotions_test) 255 | tweet_emotions_dfs.labels.write.mode('overwrite').format('delta').saveAsTable(tweet_emotions_labels) 256 | 257 | # COMMAND ---------- 258 | 259 | tweet_emotions_train_df = spark.table(tweet_emotions_train) 260 | tweet_emotions_test_df = spark.table(tweet_emotions_test) 261 | tweet_emotions_labels_df = spark.table(tweet_emotions_labels) 262 | 263 | # COMMAND ---------- 264 | 265 | # MAGIC %md ##### Raw data 266 | 267 | # COMMAND ---------- 268 | 269 | display(tweet_emotions_train_df) 270 | 271 | # COMMAND ---------- 272 | 273 | # MAGIC %md ##### Labels 274 | 275 | # COMMAND ---------- 276 | 277 | display(tweet_emotions_train_df) 278 | 279 | # COMMAND ---------- 280 | 281 | # MAGIC %md ##### Record counts 282 | 283 | # COMMAND ---------- 284 | 285 | print(f"train_cnt: {tweet_emotions_train_df.count()}, test_cnt: {tweet_emotions_test_df.count()}, labels_cnt: {tweet_emotions_labels_df.count()}") 286 | 287 | # COMMAND ---------- 288 | 289 | # MAGIC %md ##### Distribution of token lengths 290 | 291 | # COMMAND ---------- 292 | 293 | display(get_token_length_counts(tweet_emotions_train)) 294 | 295 | # COMMAND ---------- 296 | 297 | token_lengths = get_token_length_counts(tweet_emotions_train, group_count=False) 298 | 299 | print(f""" 300 | quantiles: {token_lengths.approxQuantile("token_length", [0.25, 0.5, 0.75], 0)} 301 | deciles: {token_lengths.approxQuantile("token_length", list(np.arange(0.1, 1, 0.1)), 0)} 302 | """) 303 | 304 | -------------------------------------------------------------------------------- /img/job_parameters.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marshackVB/rapid_nlp_blog/b404d84da280ce295e4bb88796c4ef670290da98/img/job_parameters.png -------------------------------------------------------------------------------- /img/mlflow_model_comparisons.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marshackVB/rapid_nlp_blog/b404d84da280ce295e4bb88796c4ef670290da98/img/mlflow_model_comparisons.png -------------------------------------------------------------------------------- /img/multiple_job_runs.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marshackVB/rapid_nlp_blog/b404d84da280ce295e4bb88796c4ef670290da98/img/multiple_job_runs.png -------------------------------------------------------------------------------- /img/predictions.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marshackVB/rapid_nlp_blog/b404d84da280ce295e4bb88796c4ef670290da98/img/predictions.png -------------------------------------------------------------------------------- /img/training_experiments.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marshackVB/rapid_nlp_blog/b404d84da280ce295e4bb88796c4ef670290da98/img/training_experiments.png -------------------------------------------------------------------------------- /inference.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md ## Model inference workflow 3 | # MAGIC Paste an MLflow Experiment run id in the above text box and select "Run All" above. This notebook will register the associated model with the Model Registry and transition the model's stage to 'Production'. Then, the model will be loaded and applied for inference, writing predictions to a Delta table. 4 | 5 | # COMMAND ---------- 6 | 7 | # MAGIC %pip install -r requirements.txt 8 | 9 | # COMMAND ---------- 10 | 11 | import pickle 12 | from time import perf_counter 13 | 14 | import mlflow 15 | from mlflow.tracking import MlflowClient 16 | import numpy as np 17 | import pandas as pd 18 | from pyspark.sql.types import (StructType, 19 | StructField, 20 | ArrayType, 21 | StringType, 22 | FloatType, 23 | IntegerType) 24 | from pyspark.sql import DataFrame 25 | import pyspark.sql.functions as func 26 | 27 | from utils import (get_run_id, 28 | get_gpu_utilization) 29 | 30 | pd.set_option('display.max_colwidth', None) 31 | 32 | client = MlflowClient() 33 | 34 | # COMMAND ---------- 35 | 36 | # MAGIC %md Confirm the Experiment run id is valid 37 | 38 | # COMMAND ---------- 39 | 40 | dbutils.widgets.text("experiment_run_id", "") 41 | run_id = dbutils.widgets.get("experiment_run_id").strip() 42 | 43 | # COMMAND ---------- 44 | 45 | try: 46 | model_info = client.get_run(run_id).to_dictionary() 47 | except: 48 | raise Exception(f"Run id: {run_id} does not exist") 49 | 50 | model_info 51 | 52 | # COMMAND ---------- 53 | 54 | # MAGIC %md ##### View GPU memory availability and current consumption 55 | # MAGIC If GPU memory utilization is high, you may need to Detach & Re-attach the training notebook to clear the GPU's memory. This could occur if you just finished training a model with the current cluster. 56 | 57 | # COMMAND ---------- 58 | 59 | get_gpu_utilization(memory_type='total') 60 | get_gpu_utilization(memory_type='used') 61 | get_gpu_utilization(memory_type='free') 62 | 63 | # COMMAND ---------- 64 | 65 | # MAGIC %md ##### Create a Model Registry entry if one does not exist 66 | 67 | # COMMAND ---------- 68 | 69 | client = MlflowClient() 70 | model_registry_name = "transformer_models" 71 | 72 | # Create a Model Registry entry if one does not exist 73 | try: 74 | client.get_registered_model(model_registry_name) 75 | print(" Registered model already exists") 76 | except: 77 | client.create_registered_model(model_registry_name) 78 | 79 | # COMMAND ---------- 80 | 81 | # MAGIC %md ##### Register the model and transition its stage to 'Production' 82 | 83 | # COMMAND ---------- 84 | 85 | # Get model experiment info 86 | model_info = client.get_run(run_id).to_dictionary() 87 | artifact_uri = model_info['info']['artifact_uri'] 88 | 89 | # Register the model 90 | registered_model = client.create_model_version(name=model_registry_name, 91 | source=artifact_uri + "/mlflow", 92 | run_id=run_id) 93 | 94 | # Promote the model to the "Production" stage 95 | promote_to_prod = client.transition_model_version_stage(name=model_registry_name, 96 | version = int(registered_model.version), 97 | stage="Production", 98 | archive_existing_versions=True) 99 | 100 | # COMMAND ---------- 101 | 102 | # MAGIC %md ##### Create a Pandas DataFrame of records to score by combining the training and test datasets used to fine tune the model 103 | 104 | # COMMAND ---------- 105 | 106 | def union_train_test(train_df:DataFrame, test_df:DataFrame) -> DataFrame: 107 | """Combine the training and testing datasets 108 | """ 109 | 110 | return (spark.table(train_df).withColumn("is_train", func.lit(1)) 111 | .unionAll( 112 | spark.table(test_df).withColumn("is_train", func.lit(0)) 113 | ) 114 | ) 115 | 116 | training_dataset = model_info['data']['params']['dataset'] 117 | 118 | if training_dataset == 'banking77': 119 | inference_df = union_train_test("default.banking77_train", "default.banking77_test") 120 | output_table_name = "default.banking77_predictions" 121 | 122 | elif training_dataset == 'imdb': 123 | inference_df = union_train_test("default.imdb_train", "default.imdb_test") 124 | output_table_name = "default.imdb_predictions" 125 | 126 | elif training_dataset == 'tweet_emotions': 127 | inference_df = union_train_test("default.tweet_emotions_train", "default.tweet_emotions_test") 128 | output_table_name = "default.tweet_emotions_predictions" 129 | 130 | else: 131 | raise Exception(f"Training and testing datasets are not known") 132 | 133 | 134 | inference_pd = inference_df.toPandas() 135 | 136 | print(f"Total records for inference: {inference_pd.iloc[:, 0].count():,}") 137 | 138 | # COMMAND ---------- 139 | 140 | display(inference_df) 141 | 142 | # COMMAND ---------- 143 | 144 | # MAGIC %md ##### Generate predictions and write results to Delta 145 | 146 | # COMMAND ---------- 147 | 148 | production_run_id = get_run_id(model_name = "transformer_models") 149 | 150 | # Download id to label mapping 151 | client.download_artifacts(production_run_id, 152 | "mlflow/artifacts/id2label.pickle", 153 | "/") 154 | 155 | id2label = pickle.load(open("/mlflow/artifacts/id2label.pickle", "rb")) 156 | 157 | # Load model 158 | loaded_model = mlflow.pyfunc.load_model(f"runs:/{production_run_id}/mlflow") 159 | 160 | # COMMAND ---------- 161 | 162 | # Combine input texts and predictions 163 | start_time = perf_counter() 164 | predictions = pd.concat([inference_pd, 165 | pd.DataFrame({"probabilities": loaded_model.predict(inference_pd[["text"]]).tolist()})], 166 | axis=1) 167 | inference_time = perf_counter() - start_time 168 | 169 | # Transform predictions and specify Spark DataFrame schema 170 | schema = StructType() 171 | 172 | if training_dataset == 'tweet_emotions': 173 | 174 | schema.add("text", StringType()) 175 | schema.add("all_label_indxs", ArrayType(FloatType())) 176 | schema.add("is_train", IntegerType()) 177 | schema.add("pred_proba_label_indxs", ArrayType(FloatType())) 178 | schema.add("predicted_label_indxs", ArrayType(IntegerType())) 179 | schema.add("predicted_labels", ArrayType(StringType())) 180 | schema.add("label_indxs", ArrayType(IntegerType())) 181 | schema.add("labels", ArrayType(StringType())) 182 | 183 | 184 | predictions.rename(columns={"labels": "all_label_indxs", 185 | "probabilities": "pred_proba_label_indxs"}, inplace = True) 186 | 187 | predictions['predicted_label_indxs'] = predictions.pred_proba_label_indxs.apply(lambda x: np.where(np.array(x) > 0.5)[0].tolist()) 188 | 189 | predictions['predicted_labels'] = predictions.predicted_label_indxs.apply(lambda x: [id2label[idx] for idx in x]) 190 | 191 | predictions['label_indxs'] = predictions.all_label_indxs.apply(lambda x: np.where(np.array(x) == 1.0)[0].tolist()) 192 | 193 | predictions['labels'] = predictions.label_indxs.apply(lambda x: [id2label[idx] for idx in x]) 194 | 195 | 196 | else: 197 | schema.add("text", StringType()) 198 | schema.add("label_indx", IntegerType()) 199 | schema.add("is_train", IntegerType()) 200 | schema.add("probabilities", ArrayType(FloatType())) 201 | schema.add("predicted_probability", FloatType()) 202 | schema.add("predicted_label_indx", IntegerType()) 203 | schema.add("predicted_label", StringType()) 204 | schema.add("label", StringType()) 205 | 206 | 207 | predictions.rename(columns={"label": "label_indx"}, inplace = True) 208 | 209 | predictions['predicted_probability'] = predictions.probabilities.apply(lambda x: max(x)) 210 | 211 | predictions['predicted_label_idx'] = predictions.apply(lambda x: x['probabilities'].index(x['predicted_probability']), axis=1) 212 | 213 | predictions['predicted_label'] = predictions.predicted_label_idx.apply(lambda x: id2label[x]) 214 | 215 | predictions['label'] = predictions.label_indx.apply(lambda x: id2label[x]) 216 | 217 | 218 | # Convert predictions to a Spark Dataframe and write to Delta 219 | predictions_spark = spark.createDataFrame(predictions, schema=schema) 220 | 221 | spark.sql(f"DROP TABLE IF EXISTS {output_table_name}") 222 | predictions_spark.write.format("delta").mode("overwrite").saveAsTable(output_table_name) 223 | 224 | display(spark.table(output_table_name)) 225 | 226 | # COMMAND ---------- 227 | 228 | print(f'Inference seconds: {round(inference_time, 2)}') 229 | 230 | # COMMAND ---------- 231 | 232 | get_gpu_utilization(memory_type='total') 233 | get_gpu_utilization(memory_type='used') 234 | get_gpu_utilization(memory_type='free') 235 | -------------------------------------------------------------------------------- /mlflow_model.py: -------------------------------------------------------------------------------- 1 | import mlflow 2 | from mlflow.pyfunc.model import PythonModelContext 3 | import numpy as np 4 | import pandas as pd 5 | import torch 6 | from transformers import pipeline, AutoModelForSequenceClassification, AutoTokenizer 7 | from transformers.pipelines.pt_utils import KeyDataset 8 | from typing import Optional, Union, List 9 | 10 | 11 | 12 | def get_predictions(data:Union[List, KeyDataset], model:AutoModelForSequenceClassification, tokenizer:AutoTokenizer, batch_size:str, 13 | device:int=0, padding:Union[bool, str]='longest', truncation:bool=True, max_length:int=512, 14 | function_to_apply:Optional[str]=None) -> np.array([[float]]): 15 | """ 16 | Create a transformers pipeline and perform inference on an input sequence of records. The pipeline 17 | is comprised of a tokenizer and a model as well as additional parameters that govern the tokenizers behavior and 18 | batching of input records. Given a list of text observations, the function will perform inference 19 | in batches and return an array of probabilities, one for each label. 20 | 21 | This function can be imported into a Notebook and used directly for testing/experimentation purposes. 22 | 23 | Although this project's examples operate on a list of sequences, this function can also be applied to a 24 | KeyDataset, which is created from a transformers Dataset... 25 | 26 | dataset_to_score = KeyDataset(transformers.Dataset, 'name_of_text_column') 27 | 28 | This method has the advantage of not requiring the full inference dataset to be persisted in memory. For 29 | more information see the link, https://huggingface.co/docs/transformers/v4.21.1/en/main_classes/pipelines#pipeline-batching. 30 | 31 | For information about the sequence classification pipeline, see the link, 32 | https://huggingface.co/docs/transformers/v4.21.1/en/main_classes/pipelines#transformers.TextClassificationPipeline 33 | 34 | Args: 35 | data: A list of text sequences with each sequence representing a single observation. 36 | model: A fine-tuned transformer model for sequence classification. 37 | tokenizer: The transformers tokenizer associated with the model. 38 | batch_size: The number of records to score at a time. If you run into GPU out of memory 39 | errors, you may need to decrease the batch size. 40 | device: Governs the device used for inference: -1 for CPU and 0 for GPU. At the time of this 41 | function's development, transformers pipelines cannot utilize multiple GPUs for inference. 42 | padding: Sets the padding strategy; defaults to the longest sequence in a batch. 43 | truncation: Indicates if sequences should be truncated if beyond a certain length. 44 | max_length: The maximum length of a sequence before it is truncated; defaults to 512, 45 | which is a common maximum length for many transformer models. Truncating longer 46 | sequences to shorter lengths speeds training and allows for larger batch sizes, 47 | potentially with degradation in predictive performance. 48 | function_to_apply: The type of transformation to apply to the logits output by the model, such 49 | as softmax or sigmoid. If this is not specified, the library will infer the 50 | correct transformation based on the label's shape determined when the model 51 | was trained. 52 | 53 | Returns: 54 | A numpy array of probabilities, one for each label value. The index position of a probability 55 | corresponds to its label. So, the element, 0, in the array corresponds to the label = 0. 56 | """ 57 | 58 | inference_pipeline = pipeline(task = "text-classification", 59 | model = model, 60 | tokenizer = tokenizer, 61 | batch_size = batch_size, 62 | device = device, 63 | return_all_scores = True, 64 | function_to_apply = function_to_apply, 65 | framework = "pt") 66 | 67 | predictions = inference_pipeline(data, 68 | padding = padding, 69 | truncation = truncation, 70 | max_length = max_length) 71 | 72 | # Spark return type is ArrayType(FloatType()) 73 | predictions_to_array = [[round(dct['score'], 4) for dct in prediction] for prediction in predictions] 74 | 75 | 76 | return np.array(predictions_to_array) 77 | 78 | 79 | 80 | class MLflowModel(mlflow.pyfunc.PythonModel): 81 | """ 82 | Custom MLflow pyfunc model that performs transformer model inference. The model loads a tokenizer 83 | and fine-tuned model stored as MLflow model artifacts. These loaded artifacts are used to create 84 | a transformer pipeline. 85 | 86 | For a description of the mode's output, see the docstring associated with the get_predictions 87 | function. 88 | 89 | Args: 90 | inference_batch_size: The number of records to pass at a time to the model for inferece. 91 | truncation: Indicates if sequences should be truncated if beyond a certain length. 92 | padding: Sets the padding strategy; defaults to the longest sequence in a batch. 93 | max_length: The maximum length of a sequence before it is truncated; defaults to 512, 94 | which is a common maximum length for many transformer models. Truncating longer 95 | sequences to shorter lengths speeds training and allows for larger batch sizes, 96 | potentially with degradation in predictive performance. 97 | function_to_apply: The type of transformation to apply to the logits output by the model, such 98 | as softmax or sigmoid. If this is not specified, the library will infer the 99 | correct transformation based on the label's shape determined when the model 100 | was trained. 101 | """ 102 | 103 | def __init__(self, inference_batch_size:str, truncation:bool=True, padding:bool=True, max_length:int=512, 104 | function_to_apply:Optional[str]=None): 105 | 106 | self.inference_batch_size = inference_batch_size 107 | self.truncation = truncation 108 | self.padding = padding 109 | self.max_length = max_length 110 | self.function_to_apply = function_to_apply 111 | self.tokenizer = None 112 | self.model = None 113 | 114 | 115 | def load_context(self, context:PythonModelContext): 116 | """ 117 | This method is called once by MLflow when a model is loaded for inference using the 118 | mlflow.pyfunc.load_model() function. PythonModelContext is a class with a single 119 | attribute, artifacts:dict[str,str], that is referenceable through a class 120 | property, context.artifacts. A PythonModelContext instance is passed automatically 121 | by MLflow. 122 | """ 123 | 124 | # Both CPU and single-GPU inference are options using this custome MLFlow model, 125 | # though CPU-based inference will be drastically slower and you may need to decrease 126 | # the inference batch size when logging this model. 127 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 128 | if device.type == "cpu": 129 | raise Exception("No GPU detected. Provision a GPU-backed instance to run model inference") 130 | 131 | # Load the tokenizer and model from MLflow 132 | self.tokenizer = AutoTokenizer.from_pretrained(context.artifacts['tokenizer']) 133 | self.model = AutoModelForSequenceClassification.from_pretrained(context.artifacts['model']) 134 | 135 | 136 | def predict(self, context:PythonModelContext, model_input:Union[pd.DataFrame, KeyDataset]) -> np.array([[float]]): 137 | """ 138 | Generate predictions given an input Pandas DataFrame containing a single feature column 139 | or a tranformers.KeyDataset. See the get_predictions function for more information. 140 | 141 | Args: 142 | context: A PythonModelContext instance passed automatically by MLflow. See the load_context() 143 | method for more information. 144 | model_input: Either a Pandas Dataframe or a transformers.KeyDataset. If passing a DataFrame, 145 | the expectation is that the DataFrame has only one column and that column contains 146 | the raw text to score. 147 | 148 | Returns: 149 | A numpy array of probabilities, one for each label value. The index position of a probability 150 | corresponds to its label. So, the element, 0, in the array corresponds to the label = 0. 151 | """ 152 | 153 | if isinstance(model_input, KeyDataset): 154 | # The KeyDataset can be passed directly to the transformers pipeline 155 | is_pandas = False 156 | 157 | elif isinstance(model_input, pd.DataFrame): 158 | # The Pandas Dataframe column will be converted to a list of string 159 | # before passed to the transformers pipeline 160 | is_pandas = True 161 | 162 | else: 163 | raise TypeError("Model input is neither a Pandas DataFrame nor a transformers KeyDataset") 164 | 165 | predictions = get_predictions(data=model_input[model_input.columns[0]].tolist() if is_pandas else model_input, 166 | model=self.model, 167 | tokenizer=self.tokenizer, 168 | batch_size=self.inference_batch_size, 169 | padding=self.padding, 170 | truncation=self.truncation, 171 | max_length=self.max_length, 172 | function_to_apply=self.function_to_apply) 173 | 174 | return predictions -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy==1.20.* 2 | pandas==1.3.* 3 | scikit-learn==0.24.* 4 | torch==1.11.0+cu113 5 | transformers==4.21.* 6 | datasets==2.4.* 7 | mlflow == 1.27.* 8 | nvidia-ml-py3 -------------------------------------------------------------------------------- /trainer.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md ## Model training workflow 3 | # MAGIC This notebook trains transformer models on various datasets. Select a model and training dataset from the above drop-down menus and experiment with different training parameters. See the below cells for guidance on model tuning. This notebook can be run either interactively or as a job. The cluster type should be a single-node cluster using the ML GPU Runtime and a GPU-backed instance type. 4 | # MAGIC 5 | # MAGIC Note that the IMDB dataset has much longer sequence lengths than the other example datasets. It takes longer to train and is more susceptible to GPU out of memory errors. Consider decreasing the train_batch_size and eval_batch_size to 16 and increasing gradient_accumulation_steps to 4 as a starting point for this dataset. You can also experiment with truncating the sequences to a length below the default, 512. This will speed training and allow for larger batch sizes, potentially at some degradation in predictive performance. 6 | 7 | # COMMAND ---------- 8 | 9 | # MAGIC %pip install -q -r requirements.txt 10 | 11 | # COMMAND ---------- 12 | 13 | import pickle 14 | from pathlib import Path 15 | from sys import version_info 16 | 17 | from datasets import load_dataset, DatasetDict 18 | import numpy as np 19 | import mlflow 20 | from mlflow_model import MLflowModel 21 | from sklearn.metrics import precision_recall_fscore_support 22 | from scipy.stats import logistic 23 | import torch 24 | from transformers import (AutoConfig, 25 | AutoTokenizer, 26 | AutoModelForSequenceClassification, 27 | EarlyStoppingCallback, 28 | EvalPrediction, 29 | DataCollatorWithPadding, 30 | pipeline, 31 | TrainingArguments, 32 | Trainer) 33 | 34 | from utils import get_parquet_files, get_or_create_experiment, get_best_metrics, get_gpu_utilization 35 | 36 | mlflow.autolog(disable=True) 37 | 38 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 39 | 40 | # COMMAND ---------- 41 | 42 | # MAGIC %md ##### View GPU memory availability and current consumption 43 | # MAGIC Clear the GPU memory between model runs by re-running the cell that pip installs dependencies from the requirements.txt file. Selecting Detach & Re-attach from the cluster icon will also clear the GPU memory. 44 | 45 | # COMMAND ---------- 46 | 47 | get_gpu_utilization(memory_type="total") 48 | get_gpu_utilization(memory_type="used") 49 | get_gpu_utilization(memory_type="free") 50 | 51 | # COMMAND ---------- 52 | 53 | # MAGIC %md ##### Specify widget values 54 | 55 | # COMMAND ---------- 56 | 57 | datasets = ["banking77", "imdb", "tweet_emotions"] 58 | 59 | supported_models = ["distilbert-base-uncased", 60 | "bert-base-uncased", 61 | "bert-base-cased", 62 | "distilroberta-base", 63 | "roberta-base", 64 | "microsoft/xtremedistil-l6-h256-uncased", 65 | "microsoft/xtremedistil-l6-h384-uncased", 66 | "microsoft/xtremedistil-l12-h384-uncased" 67 | ] 68 | 69 | dbutils.widgets.dropdown("dataset_name", datasets[0], datasets) 70 | dbutils.widgets.dropdown("model_type", supported_models[0], supported_models) 71 | 72 | dbutils.widgets.text("train_batch_size", "64") 73 | dbutils.widgets.text("eval_batch_size", "64") 74 | dbutils.widgets.text("inference_batch_size", "256") 75 | 76 | dbutils.widgets.text("gradient_accumulation_steps", "1") 77 | dbutils.widgets.text("max_epochs", "10") 78 | dbutils.widgets.dropdown("fp16", "True", ["True", "False"]) 79 | dbutils.widgets.dropdown("group_by_length", "False", ["True", "False"]) 80 | 81 | dbutils.widgets.text("experiment_location", "transformer_experiments") 82 | 83 | 84 | dataset = dbutils.widgets.get("dataset_name") 85 | model_type = dbutils.widgets.get("model_type") 86 | 87 | train_batch_size = int(dbutils.widgets.get("train_batch_size")) 88 | eval_batch_size = int(dbutils.widgets.get("eval_batch_size")) 89 | inference_batch_size = int(dbutils.widgets.get("inference_batch_size")) 90 | 91 | gradient_accumulation_steps = int(dbutils.widgets.get("gradient_accumulation_steps")) 92 | max_epochs = int(dbutils.widgets.get("max_epochs")) 93 | 94 | fp16 = True if dbutils.widgets.get("fp16") == "True" else False 95 | group_by_length = True if dbutils.widgets.get("group_by_length") == "True" else False 96 | experiment_location = dbutils.widgets.get("experiment_location") 97 | 98 | # COMMAND ---------- 99 | 100 | print(f""" 101 | Widget parameter values: 102 | 103 | dataset: {dataset} 104 | model: {model_type} 105 | train_batch_size: {train_batch_size} 106 | eval_batch_size: {eval_batch_size} 107 | inference_batch_size: {inference_batch_size} 108 | gradient_accumulation_steps: {gradient_accumulation_steps} 109 | fp16: {fp16} 110 | group_by_length: {group_by_length} 111 | max_epochs: {max_epochs} 112 | experiment_location: {experiment_location}""") 113 | 114 | # COMMAND ---------- 115 | 116 | # MAGIC %md ##### Specify model, tokenizer, and training parameters 117 | # MAGIC 118 | # MAGIC See the [documentation](https://huggingface.co/docs/transformers/performance) and specifically the section on [single GPU training](https://huggingface.co/docs/transformers/perf_train_gpu_one) for performance tuning tips. Additionally, see the various [tokenization strategies](https://huggingface.co/docs/transformers/pad_truncation) available. 119 | # MAGIC 120 | # MAGIC Adjusting the below training arguments can have a large effect on training times and GPU memory consumption. 121 | # MAGIC 122 | # MAGIC - [per_device_train_batch_size](https://huggingface.co/docs/transformers/perf_train_gpu_one#vanilla-training): Experiment with 16, 32, 64 and 128. 123 | # MAGIC - [fp16](https://huggingface.co/docs/transformers/perf_train_gpu_one#fp16-training) 124 | # MAGIC - [gradient_accumulation_steps](https://huggingface.co/docs/transformers/perf_train_gpu_one#gradient-accumulation) 125 | # MAGIC - [group_by_length](#https://huggingface.co/docs/transformers/v4.21.2/en/main_classes/trainer#transformers.TrainingArguments.group_by_length) 126 | # MAGIC 127 | # MAGIC In addition, truncating longer sequences to shorter length will speed training time and reduce GPU memory consumption. This can be accomplished by adjusting the tokenizer such that "max_length" is less than 512 and "truncation = 'max_length'". 128 | 129 | # COMMAND ---------- 130 | 131 | datasets_mapping = {"banking77": {"train": "default.banking77_train", 132 | "test": "default.banking77_test", 133 | "labels": "default.banking77_labels", 134 | "num_labels": 77, 135 | # Batch size for general model inference, outside of the training loop; this is 136 | # the batch size used by the MLflow model 137 | "inference_batch_size": inference_batch_size, 138 | # Batch size for evaluation step of model training 139 | "per_device_train_batch_size": train_batch_size, 140 | "per_device_eval_batch_size": eval_batch_size, 141 | "problem_type": "single_label_classification" 142 | }, 143 | 144 | "imdb": {"train": "default.imdb_train", 145 | "test": "default.imdb_test", 146 | "labels": "default.imdb_labels", 147 | "num_labels": 2, 148 | "inference_batch_size": inference_batch_size, 149 | "per_device_train_batch_size": train_batch_size, 150 | "per_device_eval_batch_size": inference_batch_size, 151 | "problem_type": "single_label_classification" 152 | }, 153 | 154 | "tweet_emotions": {"train": "default.tweet_emotions_train", 155 | "test": "default.tweet_emotions_test", 156 | "labels": "default.tweet_emotions_labels", 157 | "num_labels": 11, 158 | "inference_batch_size": inference_batch_size, 159 | "per_device_train_batch_size":train_batch_size, 160 | "per_device_eval_batch_size": eval_batch_size, 161 | "problem_type": "multi_label_classification" 162 | } 163 | } 164 | 165 | data_args = datasets_mapping[dataset] 166 | 167 | model_args = {"feature_col": "text", 168 | "num_labels": data_args["num_labels"], 169 | "inference_batch_size": data_args["inference_batch_size"], 170 | "problem_type": data_args["problem_type"]} 171 | 172 | tokenizer_args = {"truncation": True, 173 | # Padding will be done at the batch level during training 174 | "padding": False, 175 | # 512 is the max length accepted by the models 176 | "max_length": 512} 177 | 178 | current_user = dbutils.notebook.entry_point.getDbutils().notebook().getContext().userName().get().split("@")[0] 179 | 180 | training_args = {"output_dir": "/checkpoints", 181 | "overwrite_output_dir": True, 182 | "per_device_train_batch_size": data_args["per_device_train_batch_size"], 183 | "per_device_eval_batch_size": data_args["per_device_eval_batch_size"], 184 | "weight_decay": 0.01, 185 | "num_train_epochs": max_epochs, 186 | "save_strategy": "epoch", 187 | "evaluation_strategy": "epoch", 188 | "logging_strategy": "epoch", 189 | "load_best_model_at_end": True, 190 | "save_total_limit": 2, 191 | "metric_for_best_model": "f1", 192 | "greater_is_better": True, 193 | "seed": 123, 194 | "report_to": "none", 195 | "gradient_accumulation_steps": gradient_accumulation_steps, 196 | "fp16": fp16, 197 | "group_by_length": group_by_length} 198 | 199 | # COMMAND ---------- 200 | 201 | # MAGIC %md ##### Create a [huggingface dataset](https://huggingface.co/course/chapter5/4?fw=pt) directly from the Delta tables' underlying parquet files. 202 | # MAGIC The huggingface library will copy the training and test datasets to the driver node's disk and leverage [memory mapping](https://huggingface.co/course/chapter5/4?fw=pt) to efficiently read data from disk during training and inference. This prevents larger datasets from overwhelming the memory of your virtual machine. 203 | 204 | # COMMAND ---------- 205 | 206 | train_table_name = data_args["train"] 207 | test_table_name = data_args["test"] 208 | labels_table_name = data_args["labels"] 209 | 210 | train_files = get_parquet_files(train_table_name) 211 | test_files = get_parquet_files(test_table_name) 212 | 213 | train_test = DatasetDict({"train": load_dataset("parquet", 214 | data_files=train_files, 215 | split="train"), 216 | 217 | "test": load_dataset("parquet", 218 | data_files=test_files, 219 | split="train")}) 220 | 221 | labels = spark.table(labels_table_name) 222 | collected_labels = labels.collect() 223 | 224 | id2label = {row.idx: row.label for row in collected_labels} 225 | label2id = {row.label: row.idx for row in collected_labels} 226 | 227 | # COMMAND ---------- 228 | 229 | # MAGIC %md ##### Create an MLflow Experiment or use an existing Experiment 230 | 231 | # COMMAND ---------- 232 | 233 | experiment_location = f"/Shared/{experiment_location}" 234 | get_or_create_experiment(experiment_location) 235 | 236 | # COMMAND ---------- 237 | 238 | # MAGIC %md ##### Train models and log to MLflow 239 | # MAGIC This cell will generate a hyperlink that navigates to the Experiment run in MLFlow. 240 | 241 | # COMMAND ---------- 242 | 243 | tokenizer = AutoTokenizer.from_pretrained(model_type, use_fast=True) 244 | 245 | model_config = AutoConfig.from_pretrained(model_type, 246 | num_labels=model_args["num_labels"], 247 | id2label=id2label, 248 | label2id=label2id, 249 | problem_type=model_args["problem_type"]) 250 | 251 | 252 | def tokenize(batch): 253 | """Tokenize input text in batches""" 254 | 255 | return tokenizer(batch[model_args["feature_col"]], 256 | truncation=tokenizer_args["truncation"], 257 | padding=tokenizer_args["padding"], 258 | max_length=tokenizer_args["max_length"]) 259 | 260 | 261 | # The DataCollator will handle dynamic padding of batches during training. See the documentation, 262 | # https://www.youtube.com/watch?v=-RPeakdlHYo. If not leveraging dynamic padding, this can be removed 263 | data_collator = DataCollatorWithPadding(tokenizer, padding=True) 264 | 265 | # The default batch size is 1,000; this can be changed by setting the "batch_size=" parameter 266 | # https://huggingface.co/docs/datasets/process#batch-processing 267 | train_test_tokenized = train_test.map(tokenize, batched=True) 268 | train_test_tokenized.set_format("torch", columns=["input_ids", "attention_mask", "labels" if dataset == "tweet_emotions" else "label"]) 269 | 270 | 271 | def model_init(): 272 | """Return a freshly instantiated model. This ensure that the model 273 | is trained from scratch, rather than training a previously 274 | instantiated model for additional epochs. 275 | """ 276 | 277 | return AutoModelForSequenceClassification.from_pretrained(model_type, 278 | config=model_config).to(device) 279 | 280 | 281 | 282 | def compute_single_label_metrics(pred: EvalPrediction) -> dict[str: float]: 283 | """Calculate validation statistics for single label classification 284 | problems. The function accepts a transformers EvalPrediction object. 285 | 286 | https://huggingface.co/docs/transformers/internal/trainer_utils#transformers.EvalPrediction 287 | """ 288 | 289 | labels = pred.label_ids 290 | preds = pred.predictions.argmax(-1) 291 | precision, recall, f1, _ = precision_recall_fscore_support(labels, 292 | preds, 293 | average="micro") 294 | return { 295 | "f1": f1, 296 | "precision": precision, 297 | "recall": recall 298 | } 299 | 300 | 301 | def compute_multi_label_metrics(pred: EvalPrediction) -> dict[str: float]: 302 | 303 | """Calculate validation statistics for multilabel classification 304 | problems. The function accepts a transformers EvalPrediction object. 305 | """ 306 | 307 | labels = pred.label_ids 308 | preds = logistic.cdf(pred.predictions) 309 | preds = np.where(preds >= 0.5, 1., 0.) 310 | 311 | precision, recall, f1, _ = precision_recall_fscore_support(labels, 312 | preds, 313 | average="micro") 314 | 315 | return { 316 | "f1": f1, 317 | "precision": precision, 318 | "recall": recall 319 | } 320 | 321 | # The early stopping threshold is in units; it is not a percentage. 322 | early_stopping_callback = EarlyStoppingCallback(early_stopping_patience=1, early_stopping_threshold=0.005) 323 | 324 | trainer = Trainer(model_init=model_init, 325 | args=TrainingArguments(**training_args), 326 | train_dataset=train_test_tokenized["train"], 327 | eval_dataset=train_test_tokenized["test"], 328 | compute_metrics =compute_multi_label_metrics if model_args["problem_type"] == "multi_label_classification" 329 | else compute_single_label_metrics, 330 | data_collator=data_collator, 331 | callbacks=[early_stopping_callback]) 332 | 333 | 334 | with mlflow.start_run(run_name=model_type) as run: 335 | 336 | run_id = run.info.run_id 337 | 338 | result = trainer.train() 339 | 340 | # Save trainer and tokenizer to the driver node; then will then be stored as 341 | # MLflow artifacts 342 | trainer.save_model("/model") 343 | tokenizer.save_pretrained("/tokenizer") 344 | 345 | eval_result = trainer.evaluate() 346 | 347 | best_metrics = get_best_metrics(trainer) 348 | 349 | training_eval_metrics = {"model_size_mb": round(Path("/model/pytorch_model.bin").stat().st_size / (1024 * 1024), 1), 350 | "train_minutes": round(result.metrics["train_runtime"] / 60, 2), 351 | "train_samples_per_second": round(result.metrics["train_samples_per_second"], 1), 352 | "train_steps_per_second": round(result.metrics["train_steps_per_second"], 2), 353 | "train_rows": train_test["train"].num_rows, 354 | "gpu_memory_total_mb": get_gpu_utilization(memory_type="total", print_only=False), 355 | "gpu_memory_used_mb": get_gpu_utilization(memory_type="used", print_only=False), 356 | 357 | "eval_seconds": round(eval_result["eval_runtime"], 2), 358 | "eval_samples_per_second": round(eval_result["eval_samples_per_second"], 1), 359 | "eval_steps_per_second": round(eval_result["eval_steps_per_second"], 1), 360 | "eval_rows": train_test["test"].num_rows} 361 | 362 | all_metrics = dict(**best_metrics, **training_eval_metrics) 363 | mlflow.log_metrics(all_metrics) 364 | 365 | python_version = f"{version_info.major}.{version_info.minor}.{version_info.micro}" 366 | 367 | other_params = {"dataset": dataset, 368 | "gpus": trainer.args._n_gpu, 369 | "best_checkpoint": trainer.state.best_model_checkpoint.split("/")[-1], 370 | "runtime_version": spark.conf.get("spark.databricks.clusterUsageTags.sparkVersion"), 371 | "python_version": python_version} 372 | 373 | all_params = dict(**model_args, **tokenizer_args, **training_args, **other_params) 374 | 375 | mlflow.log_params(all_params) 376 | 377 | # Construct environment file based on requirements.txt doc 378 | with open("requirements.txt", "r") as additional_requirements: 379 | libraries = additional_requirements.readlines() 380 | libraries = [library.rstrip() for library in libraries] 381 | 382 | model_env = mlflow.pyfunc.get_default_conda_env() 383 | # Replace mlflow with specific version in requirements.txt 384 | model_env["dependencies"][-1]["pip"].remove("mlflow") 385 | model_env["dependencies"][-1]["pip"] += libraries 386 | 387 | with open("/id2label.pickle", "wb") as handle: 388 | pickle.dump(id2label, handle) 389 | 390 | artifacts = {"tokenizer": "/tokenizer", 391 | "model": "/model", 392 | "id2label": "/id2label.pickle"} 393 | 394 | # Create instance of customer MLflow model for inference 395 | pipeline_model = MLflowModel(inference_batch_size=model_args["inference_batch_size"], 396 | truncation=tokenizer_args["truncation"], 397 | # Pad to the longest sequence in the batch during inference 398 | padding="longest", 399 | max_length=tokenizer_args["max_length"]) 400 | 401 | mlflow.pyfunc.log_model(artifact_path="mlflow", 402 | python_model=pipeline_model, 403 | conda_env=model_env, 404 | artifacts=artifacts) 405 | 406 | print(f""" 407 | MLflow Experiment run id: {run_id} 408 | """) 409 | 410 | # COMMAND ---------- 411 | 412 | # MAGIC %md ##### View GPU memory availability and current consumption 413 | 414 | # COMMAND ---------- 415 | 416 | get_gpu_utilization(memory_type="total") 417 | get_gpu_utilization(memory_type="used") 418 | get_gpu_utilization(memory_type="free") 419 | -------------------------------------------------------------------------------- /utils.py: -------------------------------------------------------------------------------- 1 | import json 2 | from time import perf_counter 3 | from argparse import Namespace 4 | from typing import List, Tuple, Dict 5 | from pyspark.sql import SparkSession 6 | from pyspark.dbutils import DBUtils 7 | from pynvml import * 8 | import mlflow 9 | from mlflow.tracking import MlflowClient 10 | 11 | spark = SparkSession.builder.getOrCreate() 12 | dbutils = DBUtils(spark) 13 | 14 | client = MlflowClient() 15 | 16 | 17 | def get_parquet_files(table_name:str) -> List[str]: 18 | """ 19 | Given a database and name of a parquet table, return a list of 20 | parquet files paths and names that can be read by the transfomers 21 | library 22 | 23 | Args: 24 | database: The database where the table resides. 25 | table_name: The name of the table. 26 | 27 | Returns: 28 | A Namespace object that contans configurations referenced 29 | in the program. 30 | """ 31 | 32 | files = spark.table(f'{table_name}').inputFiles() 33 | 34 | if files[0][:4] == 'dbfs': 35 | files = [file.replace('dbfs:', '/dbfs/') for file in files] 36 | 37 | return files 38 | 39 | 40 | def get_best_metrics(trainer) -> Dict[str, float]: 41 | """ 42 | Extract metrics from a fitted Trainer instance. 43 | 44 | Args: 45 | trainer: A Trainer instance that has been trained on data. 46 | 47 | Returns: 48 | A dictionary of metrics and their values for the best training epoch. 49 | """ 50 | 51 | # Best model metrics 52 | best_checkpoint = f'{trainer.state.best_model_checkpoint}/trainer_state.json' 53 | 54 | with open(best_checkpoint) as f: 55 | metrics = json.load(f) 56 | 57 | best_epoch = round(metrics['epoch'], 1) 58 | 59 | # These are instead sourced from calling training.evaluate() 60 | metrics_to_drop = ['eval_runtime', 'eval_samples_per_second', 'eval_steps_per_second', 'step'] 61 | 62 | best_loss_metrics, best_eval_metrics = [eval_metrics for eval_metrics in metrics['log_history'] if round(eval_metrics['epoch'], 1) == best_epoch] 63 | best_eval_metrics = {metric_name: round(metric_value, 4) for metric_name, metric_value in best_eval_metrics.items() if metric_name not in metrics_to_drop} 64 | 65 | best_metrics = {**best_loss_metrics, **best_eval_metrics} 66 | 67 | best_metrics['best_model_epoch'] = round(best_metrics.pop('epoch'), 1) 68 | 69 | 70 | return best_metrics 71 | 72 | 73 | def get_or_create_experiment(experiment_location: str) -> None: 74 | """ 75 | Given an experiement path, check to see if an experiment exists in the location. 76 | If not, create a new experiment. Set the notebook to log all experiments to the 77 | specified experiment location 78 | 79 | Args: 80 | experiment_location: The path to the MLflow Tracking Server (Experiement) instance, 81 | viewable in the upper left hand corner of the server's UI. 82 | """ 83 | 84 | if not mlflow.get_experiment_by_name(experiment_location): 85 | print("Experiment does not exist. Creating experiment") 86 | 87 | mlflow.create_experiment(experiment_location) 88 | 89 | mlflow.set_experiment(experiment_location) 90 | 91 | 92 | def get_model_info(model_name:str, stage:str): 93 | 94 | run_info = [run for run in client.search_model_versions(f"name='{model_name}'") 95 | if run.current_stage == stage][0] 96 | 97 | return run_info 98 | 99 | 100 | def get_run_id(model_name:str, stage:str='Production') -> str: 101 | """Given a model's name, return its run id from the Model Registry; this assumes the model 102 | has been registered 103 | 104 | Args: 105 | model_name: The name of the model; this is the name used to registr the model. 106 | stage: The stage (version) of the model in the registr you want returned 107 | 108 | Returns: 109 | The run id of the model; this can be used to load the model for inference 110 | 111 | """ 112 | 113 | return get_model_info(model_name, stage).run_id 114 | 115 | 116 | def get_artifact_path(model_name:str, stage:str='Production') -> str: 117 | """Given a model's name, return its artifact directory path from the Model Registry; 118 | this assumes the model has been registered 119 | 120 | Args: 121 | model_name: The name of the model; this is the name used to registr the model. 122 | stage: The stage (version) of the model in the registr you want returned 123 | 124 | Returns: 125 | The artifact directory path 126 | 127 | """ 128 | 129 | run_info = get_model_info(model_name, stage) 130 | 131 | artifact_path = get_model_info(model_name, stage).source 132 | drop_last_dir = artifact_path.split('/')[:-1] 133 | artifact_path = ('/').join(drop_last_dir) 134 | 135 | return artifact_path 136 | 137 | 138 | def get_gpu_utilization(memory_type='used', print_only=True): 139 | """Print GPU memory utililizaiton 140 | https://huggingface.co/docs/transformers/perf_train_gpu_one#efficient-training-on-a-single-gpu 141 | """ 142 | 143 | nvmlInit() 144 | handle = nvmlDeviceGetHandleByIndex(0) 145 | info = nvmlDeviceGetMemoryInfo(handle) 146 | 147 | if memory_type == 'total': 148 | return_value = info.total//1024**2 149 | return_string = f'GPU memory total: {return_value} MB.' 150 | elif memory_type == 'free': 151 | return_value = info.free//1024**2 152 | return_string = f'GPU memory free: {return_value} MB.' 153 | elif memory_type == 'used': 154 | return_value = info.used//1024**2 155 | return_string = f'GPU memory used: {return_value} MB.' 156 | 157 | if print_only: 158 | print(return_string) 159 | else: 160 | return return_value 161 | --------------------------------------------------------------------------------