├── .gitignore ├── README.md └── monitoring.ipynb /.gitignore: -------------------------------------------------------------------------------- 1 | # Data 2 | logs/ 3 | stores/ 4 | 5 | # VSCode 6 | .vscode/ 7 | .idea 8 | 9 | # Byte-compiled / optimized / DLL files 10 | __pycache__/ 11 | *.py[cod] 12 | *$py.class 13 | 14 | # C extensions 15 | *.so 16 | 17 | # Distribution / packaging 18 | .Python 19 | build/ 20 | develop-eggs/ 21 | dist/ 22 | downloads/ 23 | eggs/ 24 | .eggs/ 25 | lib/ 26 | lib64/ 27 | parts/ 28 | sdist/ 29 | var/ 30 | wheels/ 31 | pip-wheel-metadata/ 32 | share/python-wheels/ 33 | *.egg-info/ 34 | .installed.cfg 35 | *.egg 36 | MANIFEST 37 | 38 | # PyInstaller 39 | *.manifest 40 | *.spec 41 | 42 | # Installer logs 43 | pip-log.txt 44 | pip-delete-this-directory.txt 45 | 46 | # Unit test / coverage reports 47 | htmlcov/ 48 | .tox/ 49 | .nox/ 50 | .coverage 51 | .coverage.* 52 | .cache 53 | nosetests.xml 54 | coverage.xml 55 | *.cover 56 | *.py,cover 57 | .hypothesis/ 58 | .pytest_cache/ 59 | 60 | # Flask: 61 | instance/ 62 | .webassets-cache 63 | 64 | # Scrapy: 65 | .scrapy 66 | 67 | # Sphinx 68 | docs/_build/ 69 | 70 | # PyBuilder 71 | target/ 72 | 73 | # IPython 74 | .ipynb_checkpoints 75 | profile_default/ 76 | ipython_config.py 77 | 78 | # pyenv 79 | .python-version 80 | 81 | # PEP 582 82 | __pypackages__/ 83 | 84 | # Celery 85 | celerybeat-schedule 86 | celerybeat.pid 87 | 88 | # Environment 89 | .env 90 | .venv 91 | env/ 92 | venv/ 93 | ENV/ 94 | env.bak/ 95 | venv.bak/ 96 | 97 | # mkdocs 98 | site/ 99 | 100 | # Airflow 101 | airflow/airflow.db 102 | 103 | # MacOS 104 | .DS_Store 105 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Monitoring ML 2 | 3 | Learn how to monitor ML systems to identify and address sources of drift before model performance decay. 4 | 5 |
6 |   7 |   8 |   9 | 10 |
11 |
12 | 13 |
14 | 15 | 👉  This repository contains the [interactive notebook](https://colab.research.google.com/github/GokuMohandas/monitoring-ml/blob/main/monitoring.ipynb) that complements the [monitoring lesson](https://madewithml.com/courses/mlops/monitoring/), which is a part of the [MLOps course](https://github.com/GokuMohandas/mlops-course). If you haven't already, be sure to check out the [lesson](https://madewithml.com/courses/mlops/monitoring/) because all the concepts are covered extensively and tied to software engineering best practices for building ML systems. 16 | 17 |
18 |   19 |   20 | Open In Colab 21 |
22 | 23 |
24 | 25 | - [Performance](#performance) 26 | - [Drift](#drift) 27 | - [Data drift](#data-drift) 28 | - [Target drift](#target-drift) 29 | - [Concept drift](#concept-drift) 30 | - [Locating drift](#locating-drift) 31 | - [Measuring drift](#measuring-drift) 32 | - [Expectations](#expectations) 33 | - [Univariate](#univariate) 34 | - [Multivariate](#multivariate) 35 | - [Online](#online) 36 | 37 | ## Performance 38 | 39 | A key aspect of monitoring ML systems involves monitoring the actual performance of our deployed models. These could be quantitative evaluation metrics that we used during model evaluation (accuracy, precision, f1, etc.) but also key business metrics that the model influences (ROI, click rate, etc.). And it's usually never enough to just analyze the cumulative performance metrics across the entire span of time since the model has been deployed. Instead, we should also inspect performance across a period of time that's significant for our application (ex. daily). These sliding metrics might be more indicative of our system's health and we might be able to identify issues faster by not obscuring them with historical data. 40 | 41 | ```python 42 | import matplotlib.pyplot as plt 43 | import numpy as np 44 | import seaborn as sns 45 | sns.set_theme() 46 | ``` 47 | ```python 48 | # Generate data 49 | hourly_f1 = list(np.random.randint(low=94, high=98, size=24*20)) + \ 50 | list(np.random.randint(low=92, high=96, size=24*5)) + \ 51 | list(np.random.randint(low=88, high=96, size=24*5)) + \ 52 | list(np.random.randint(low=86, high=92, size=24*5)) 53 | ``` 54 | ```python 55 | # Cumulative f1 56 | cumulative_f1 = [np.mean(hourly_f1[:n]) for n in range(1, len(hourly_f1)+1)] 57 | print (f"Average cumulative f1 on the last day: {np.mean(cumulative_f1[-24:]):.1f}") 58 | ``` 59 |
 60 | Average cumulative f1 on the last day: 93.7
 61 | 
62 | ```python 63 | # Sliding f1 64 | window_size = 24 65 | sliding_f1 = np.convolve(hourly_f1, np.ones(window_size)/window_size, mode="valid") 66 | print (f"Average sliding f1 on the last day: {np.mean(sliding_f1[-24:]):.1f}") 67 | ``` 68 |
 69 | Average sliding f1 on the last day: 88.6
 70 | 
71 | ```python 72 | plt.ylim([80, 100]) 73 | plt.hlines(y=90, xmin=0, xmax=len(hourly_f1), colors="blue", linestyles="dashed", label="threshold") 74 | plt.plot(cumulative_f1, label="cumulative") 75 | plt.plot(sliding_f1, label="sliding") 76 | plt.legend() 77 | ``` 78 | 79 |
80 | performance drift 81 |
82 | 83 | ## Drift 84 | 85 | We need to first understand the different types of issues that can cause our model's performance to decay (model drift). The best way to do this is to look at all the moving pieces of what we're trying to model and how each one can experience drift. 86 | 87 |
88 | 89 | | Entity | Description | Drift | 90 | | :------------------- | :--------------------------------------- | :------------------------------------------------------------------ | 91 | | $X$ | inputs (features) | data drift $\rightarrow P(X) \neq P_{ref}(X)$ | 92 | | $y$ | outputs (ground-truth) | target drift $\rightarrow P(y) \neq P_{ref}(y)$ | 93 | | $P(y \vert X)$ | actual relationship between $X$ and $y$ | concept drift $\rightarrow P(y \vert X) \neq P_{ref}(y \vert X)$ | 94 | 95 |
96 | 97 | ### Data drift 98 | 99 | Data drift, also known as feature drift or covariate shift, occurs when the distribution of the *production* data is different from the *training* data. The model is not equipped to deal with this drift in the feature space and so, it's predictions may not be reliable. The actual cause of drift can be attributed to natural changes in the real-world but also to systemic issues such as missing data, pipeline errors, schema changes, etc. It's important to inspect the drifted data and trace it back along it's pipeline to identify when and where the drift was introduced. 100 | 101 |
102 | data drift 103 |
104 |
105 | Data drift can occur in either continuous or categorical features. 106 |
107 | 108 | 109 | ### Target drift 110 | 111 | Besides just the input data changing, as with data drift, we can also experience drift in our outcomes. This can be a shift in the distributions but also the removal or addition of new classes with categorical tasks. Though retraining can mitigate the performance decay caused target drift, it can often be avoided with proper inter-pipeline communication about new classes, schema changes, etc. 112 | 113 | ### Concept drift 114 | 115 | Besides the input and output data drifting, we can have the actual relationship between them drift as well. This concept drift renders our model ineffective because the patterns it learned to map between the original inputs and outputs are no longer relevant. Concept drift can be something that occurs in [various patterns](https://link.springer.com/article/10.1007/s11227-018-2674-1): 116 | 117 |
118 | concept drift 119 |
120 | 121 |
122 | 123 | - gradually over a period of time 124 | - abruptly as a result of an external event 125 | - periodically as a result of recurring events 126 | 127 | > All the different types of drift we discussed can can occur simultaneously which can complicated identifying the sources of drift. 128 | 129 | ### Locating drift 130 | 131 | Now that we've identified the different types of drift, we need to learn how to locate and how often to measure it. Here are the constraints we need to consider: 132 | 133 | - **reference window**: the set of points to compare production data distributions with to identify drift. 134 | - **test window**: the set of points to compare with the reference window to determine if drift has occurred. 135 | 136 | Since we're dealing with online drift detection (ie. detecting drift in live production data as opposed to past batch data), we can employ either a [fixed or sliding window approach](https://onlinelibrary.wiley.com/doi/full/10.1002/widm.1381) to identify our set of points for comparison. Typically, the reference window is a fixed, recent subset of the training data while the test window slides over time. 137 | 138 | ### Measuring drift 139 | 140 | Once we have the window of points we wish to compare, we need to know how to compare them. 141 | 142 | ```python 143 | import great_expectations as ge 144 | import json 145 | import pandas as pd 146 | from urllib.request import urlopen 147 | ``` 148 | ```python 149 | # Load labeled projects 150 | projects = pd.read_csv("https://raw.githubusercontent.com/GokuMohandas/Made-With-ML/main/datasets/projects.csv") 151 | tags = pd.read_csv("https://raw.githubusercontent.com/GokuMohandas/Made-With-ML/main/datasets/tags.csv") 152 | df = ge.dataset.PandasDataset(pd.merge(projects, tags, on="id")) 153 | df["text"] = df.title + " " + df.description 154 | df.drop(["title", "description"], axis=1, inplace=True) 155 | df.head(5) 156 | ``` 157 | 158 |
159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 |
idcreated_ontagtext
062020-02-20 06:43:18computer-visionComparison between YOLO and RCNN on real world...
172020-02-20 06:47:21computer-visionShow, Infer & Tell: Contextual Inference for C...
292020-02-24 16:24:45graph-learningAwesome Graph Classification A collection of i...
3152020-02-28 23:55:26reinforcement-learningAwesome Monte Carlo Tree Search A curated list...
4192020-03-03 13:54:31graph-learningDiffusion to Vector Reference implementation o...
207 |
208 | 209 | ### Expectations 210 | 211 | The first line of measurement can be rule-based such as validating [expectations](https://docs.greatexpectations.io/en/latest/reference/glossary_of_expectations.html) around missing values, data types, value ranges, etc. as we did in our [data testing lesson](https://madewithml.com/courses/mlops/testing#expectations). These can be done with or without a reference window and using the [mostly argument](https://docs.greatexpectations.io/en/latest/reference/core_concepts/expectations/standard_arguments.html#mostly) for some level of tolerance. 212 | 213 | ```python 214 | # Simulated production data 215 | prod_df = ge.dataset.PandasDataset([{"text": "hello"}, {"text": 0}, {"text": "world"}]) 216 | ``` 217 | ```python 218 | # Expectation suite 219 | df.expect_column_values_to_not_be_null(column="text") 220 | df.expect_column_values_to_be_of_type(column="text", type_="str") 221 | expectation_suite = df.get_expectation_suite() 222 | ``` 223 | ```python 224 | # Validate reference data 225 | df.validate(expectation_suite=expectation_suite, only_return_failures=True)["statistics"] 226 | ``` 227 | 228 | ```json 229 | {"evaluated_expectations": 2, 230 | "success_percent": 100.0, 231 | "successful_expectations": 2, 232 | "unsuccessful_expectations": 0} 233 | ``` 234 | 235 | ```python 236 | # Validate production data 237 | prod_df.validate(expectation_suite=expectation_suite, only_return_failures=True)["statistics"] 238 | ``` 239 | 240 | ```json 241 | {"evaluated_expectations": 2, 242 | "success_percent": 50.0, 243 | "successful_expectations": 1, 244 | "unsuccessful_expectations": 1} 245 | ``` 246 | 247 | Once we've validated our rule-based expectations, we need to quantitatively measure drift across the different features in our data. 248 | 249 | ### Univariate 250 | 251 | Our task may involve univariate (1D) features that we will want to monitor. While there are many types of hypothesis tests we can use, a popular option is the [Kolmogorov-Smirnov (KS) test](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test). 252 | 253 | #### Kolmogorov-Smirnov (KS) test 254 | 255 | The KS test determines the maximum distance between two distribution's cumulative density functions. Here, we'll measure if there is any drift on the size of our input text feature between two different data subsets. 256 | 257 | ```python 258 | from alibi_detect.cd import KSDrift 259 | ``` 260 | 261 | ```python 262 | # Reference 263 | df["num_tokens"] = df.text.apply(lambda x: len(x.split(" "))) 264 | ref = df["num_tokens"][0:200].to_numpy() 265 | plt.hist(ref, alpha=0.75, label="reference") 266 | plt.legend() 267 | plt.show() 268 | ``` 269 | 270 | ```python 271 | # Initialize drift detector 272 | length_drift_detector = KSDrift(ref, p_val=0.01) 273 | ``` 274 | 275 | ```python 276 | # No drift 277 | no_drift = df["num_tokens"][200:400].to_numpy() 278 | plt.hist(ref, alpha=0.75, label="reference") 279 | plt.hist(no_drift, alpha=0.5, label="test") 280 | plt.legend() 281 | plt.show() 282 | ``` 283 | 284 |
285 | no drift with KS test 286 |
287 |
288 | 289 | ```python 290 | length_drift_detector.predict(no_drift, return_p_val=True, return_distance=True) 291 | ``` 292 | 293 | ```json 294 | {"data": {"distance": array([0.09], dtype=float32), 295 | "is_drift": 0, 296 | "p_val": array([0.3927307], dtype=float32), 297 | "threshold": 0.01}, 298 | "meta": {"data_type": None, 299 | "detector_type": "offline", 300 | "name": "KSDrift", 301 | "version": "0.9.1"}} 302 | ``` 303 | 304 | > ↓ p-value = ↑ confident that the distributions are different. 305 | 306 | ```python 307 | # Drift 308 | drift = np.random.normal(30, 5, len(ref)) 309 | plt.hist(ref, alpha=0.75, label="reference") 310 | plt.hist(drift, alpha=0.5, label="test") 311 | plt.legend() 312 | plt.show() 313 | ``` 314 | 315 |
316 | drift detection with KS 317 |
318 |
319 | 320 | ```python 321 | length_drift_detector.predict(drift, return_p_val=True, return_distance=True) 322 | ``` 323 | 324 | ```json 325 | {"data": {"distance": array([0.63], dtype=float32), 326 | "is_drift": 1, 327 | "p_val": array([6.7101775e-35], dtype=float32), 328 | "threshold": 0.01}, 329 | "meta": {"data_type": None, 330 | "detector_type": "offline", 331 | "name": "KSDrift", 332 | "version": "0.9.1"}} 333 | ``` 334 | 335 | #### Chi-squared test 336 | 337 | Similarly, for categorical data (input features, targets, etc.), we can apply the [Pearson's chi-squared test](https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test) to determine if a frequency of events in production is consistent with a reference distribution. 338 | 339 | > We're creating a categorical variable for the # of tokens in our text feature but we could very very apply it to the tag distribution itself, individual tags (binary), slices of tags, etc. 340 | 341 | ```python 342 | from alibi_detect.cd import ChiSquareDrift 343 | ``` 344 | 345 | ```python 346 | # Reference 347 | df.token_count = df.num_tokens.apply(lambda x: "small" if x <= 10 else ("medium" if x <=25 else "large")) 348 | ref = df.token_count[0:200].to_numpy() 349 | plt.hist(ref, alpha=0.75, label="reference") 350 | plt.legend() 351 | ``` 352 | 353 | ```python 354 | # Initialize drift detector 355 | target_drift_detector = ChiSquareDrift(ref, p_val=0.01) 356 | ``` 357 | 358 | ```python 359 | # No drift 360 | no_drift = df.token_count[200:400].to_numpy() 361 | plt.hist(ref, alpha=0.75, label="reference") 362 | plt.hist(no_drift, alpha=0.5, label="test") 363 | plt.legend() 364 | plt.show() 365 | ``` 366 | 367 |
368 | no drift with chi squared test 369 |
370 |
371 | 372 | ```python 373 | target_drift_detector.predict(no_drift, return_p_val=True, return_distance=True) 374 | ``` 375 | 376 | ```json 377 | {"data": {"distance": array([4.135522], dtype=float32), 378 | "is_drift": 0, 379 | "p_val": array([0.12646863], dtype=float32), 380 | "threshold": 0.01}, 381 | "meta": {"data_type": None, 382 | "detector_type": "offline", 383 | "name": "ChiSquareDrift", 384 | "version": "0.9.1"}} 385 | ``` 386 | 387 | ```python 388 | # Drift 389 | drift = np.array(["small"]*80 + ["medium"]*40 + ["large"]*80) 390 | plt.hist(ref, alpha=0.75, label="reference") 391 | plt.hist(drift, alpha=0.5, label="test") 392 | plt.legend() 393 | plt.show() 394 | ``` 395 | 396 |
397 | drift detection with chi squared tests 398 |
399 |
400 | 401 | ```python 402 | target_drift_detector.predict(drift, return_p_val=True, return_distance=True) 403 | ``` 404 | 405 | ```json 406 | {"data": {"is_drift": 1, 407 | "distance": array([118.03355], dtype=float32), 408 | "p_val": array([2.3406739e-26], dtype=float32), 409 | "threshold": 0.01}, 410 | "meta": {"name": "ChiSquareDrift", 411 | "detector_type": "offline", 412 | "data_type": None}} 413 | ``` 414 | 415 | ### Multivariate 416 | 417 | As we can see, measuring drift is fairly straightforward for univariate data but difficult for multivariate data. We'll summarize the reduce and measure approach outlined in the following paper: [Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift](https://arxiv.org/abs/1810.11953). 418 | 419 |
420 | multivariate drift detection 421 |
422 | We vectorized our text using tf-idf (to keep modeling simple), which has high dimensionality and is not semantically rich in context. However, typically with text, word/char embeddings are used. So to illustrate what drift detection on multivariate data would look like, let's represent our text using pretrained embeddings. 423 | 424 | > Be sure to refer to our [embeddings](https://madewithml.com/courses/foundations/embeddings/) and [transformers](https://madewithml.com/courses/foundations/transformers/) lessons to learn more about these topics. But note that detecting drift on multivariate text embeddings is still quite difficult so it's typically more common to use these methods applied to tabular features or images. 425 | 426 | We'll start by loading the tokenizer from a pretrained model. 427 | 428 | ```python 429 | from transformers import AutoTokenizer 430 | ``` 431 | 432 | ```python 433 | model_name = "allenai/scibert_scivocab_uncased" 434 | tokenizer = AutoTokenizer.from_pretrained(model_name) 435 | vocab_size = len(tokenizer) 436 | print (vocab_size) 437 | ``` 438 | 439 |
440 | 31090
441 | 
442 | 443 | ```python 444 | # Tokenize inputs 445 | encoded_input = tokenizer(df.text.tolist(), return_tensors="pt", padding=True) 446 | ids = encoded_input["input_ids"] 447 | masks = encoded_input["attention_mask"] 448 | ``` 449 | 450 | ```python 451 | # Decode 452 | print (f"{ids[0]}\n{tokenizer.decode(ids[0])}") 453 | ``` 454 | 455 |
456 | tensor([  102,  2029,   467,  1778,   609,   137,  6446,  4857,   191,  1332,
457 |          2399, 13572, 19125,  1983,   147,  1954,   165,  6240,   205,   185,
458 |           300,  3717,  7434,  1262,   121,   537,   201,   137,  1040,   111,
459 |           545,   121,  4714,   205,   103,     0,     0,     0,     0,     0,
460 |             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
461 |             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
462 |             0])
463 | [CLS] comparison between yolo and rcnn on real world videos bringing theory to experiment is cool. we can easily train models in colab and find the results in minutes. [SEP] [PAD] [PAD] ...
464 | 
465 | 466 | ```python 467 | # Sub-word tokens 468 | print (tokenizer.convert_ids_to_tokens(ids=ids[0])) 469 | ``` 470 | 471 |
472 | ['[CLS]', 'comparison', 'between', 'yo', '##lo', 'and', 'rc', '##nn', 'on', 'real', 'world', 'videos', 'bringing', 'theory', 'to', 'experiment', 'is', 'cool', '.', 'we', 'can', 'easily', 'train', 'models', 'in', 'col', '##ab', 'and', 'find', 'the', 'results', 'in', 'minutes', '.', '[SEP]', '[PAD]', '[PAD]', ...]
473 | 
474 | 475 | Next, we'll load the pretrained model's weights and use the `TransformerEmbedding` object to extract the embeddings from the hidden state (averaged across tokens). 476 | 477 | ```python 478 | from alibi_detect.models.pytorch import TransformerEmbedding 479 | ``` 480 | 481 | ```python 482 | # Embedding layer 483 | emb_type = "hidden_state" 484 | layers = [-x for x in range(1, 9)] # last 8 layers 485 | embedding_layer = TransformerEmbedding(model_name, emb_type, layers) 486 | ``` 487 | 488 | ```python 489 | # Embedding dimension 490 | embedding_dim = embedding_layer.model.embeddings.word_embeddings.embedding_dim 491 | embedding_dim 492 | ``` 493 | 494 |
495 | 768
496 | 
497 | 498 | #### Dimensionality reduction 499 | 500 | Now we need to use a dimensionality reduction method to reduce our representations dimensions into something more manageable (ex. 32 dim) so we can run our two-sample tests on to detect drift. Popular options include: 501 | 502 | - [Principle component analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis): orthogonal transformations that preserve the variability of the dataset. 503 | - [Autoencoders (AE)](https://en.wikipedia.org/wiki/Autoencoder): networks that consume the inputs and attempt to reconstruct it from an lower dimensional space while minimizing the error. These can either be trained or untrained (the Failing loudly paper recommends untrained). 504 | - [Black box shift detectors (BBSD)](https://arxiv.org/abs/1802.03916): the actual model trained on the training data can be used as a dimensionality reducer. We can either use the softmax outputs (multivariate) or the actual predictions (univariate). 505 | 506 | ```python 507 | import torch 508 | import torch.nn as nn 509 | ``` 510 | 511 | ```python 512 | # Device 513 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 514 | print(device) 515 | ``` 516 | 517 |
518 | cuda
519 | 
520 | 521 | ```python 522 | # Untrained autoencoder (UAE) reducer 523 | encoder_dim = 32 524 | reducer = nn.Sequential( 525 | embedding_layer, 526 | nn.Linear(embedding_dim, 256), 527 | nn.ReLU(), 528 | nn.Linear(256, encoder_dim) 529 | ).to(device).eval() 530 | ``` 531 | 532 | We can wrap all of the operations above into one preprocessing function that will consume input text and produce the reduced representation. 533 | 534 | ```python 535 | from alibi_detect.cd.pytorch import preprocess_drift 536 | from functools import partial 537 | ``` 538 | 539 | ```python 540 | # Preprocessing with the reducer 541 | max_len = 100 542 | batch_size = 32 543 | preprocess_fn = partial(preprocess_drift, model=reducer, tokenizer=tokenizer, 544 | max_len=max_len, batch_size=batch_size, device=device) 545 | ``` 546 | 547 | #### Maximum Mean Discrepancy (MMD) 548 | 549 | After applying dimensionality reduction techniques on our multivariate data, we can use different statistical tests to calculate drift. A popular option is [Maximum Mean Discrepancy (MMD)](https://jmlr.csail.mit.edu/papers/v13/gretton12a.html), a kernel-based approach that determines the distance between two distributions by computing the distance between the mean embeddings of the features from both distributions. 550 | 551 | ```python 552 | from alibi_detect.cd import MMDDrift 553 | ``` 554 | 555 | ```python 556 | # Initialize drift detector 557 | mmd_drift_detector = MMDDrift(ref, backend="pytorch", p_val=.01, preprocess_fn=preprocess_fn) 558 | ``` 559 | 560 | ```python 561 | # No drift 562 | no_drift = df.text[200:400].to_list() 563 | mmd_drift_detector.predict(no_drift) 564 | ``` 565 | 566 | ```json 567 | {"data": {"distance": 0.0021169185638427734, 568 | "distance_threshold": 0.0032651424, 569 | "is_drift": 0, 570 | "p_val": 0.05999999865889549, 571 | "threshold": 0.01}, 572 | "meta": {"backend": "pytorch", 573 | "data_type": None, 574 | "detector_type": "offline", 575 | "name": "MMDDriftTorch", 576 | "version": "0.9.1"}} 577 | ``` 578 | 579 | ```python 580 | # Drift 581 | drift = ["UNK " + text for text in no_drift] 582 | mmd_drift_detector.predict(drift) 583 | ``` 584 | 585 | ```json 586 | {"data": {"distance": 0.014705955982208252, 587 | "distance_threshold": 0.003908038, 588 | "is_drift": 1, 589 | "p_val": 0.0, 590 | "threshold": 0.01}, 591 | "meta": {"backend": "pytorch", 592 | "data_type": None, 593 | "detector_type": "offline", 594 | "name": "MMDDriftTorch", 595 | "version": "0.9.1"}} 596 | ``` 597 | 598 | ## Online 599 | 600 | So far we've applied our drift detection methods on offline data to try and understand what reference window sizes should be, what p-values are appropriate, etc. However, we'll need to apply these methods in the online production setting so that we can catch drift as easy as possible. 601 | 602 | > Many monitoring libraries and platforms come with [online equivalents](https://docs.seldon.io/projects/alibi-detect/en/latest/cd/methods.html#online) for their detection methods. 603 | 604 | Typically, reference windows are large so that we have a proper benchmark to compare our production data points to. As for the test window, the smaller it is, the more quickly we can catch sudden drift. Whereas, a larger test window will allow us to identify more subtle/gradual drift. So it's best to compose windows of different sizes to regularly monitor. 605 | 606 | ```python 607 | from alibi_detect.cd import MMDDriftOnline 608 | ``` 609 | 610 | ```python 611 | # Online MMD drift detector 612 | ref = df.text[0:800].to_list() 613 | online_mmd_drift_detector = MMDDriftOnline( 614 | ref, ert=400, window_size=200, backend="pytorch", preprocess_fn=preprocess_fn) 615 | ``` 616 | 617 |
618 | Generating permutations of kernel matrix..
619 | 100%|██████████| 1000/1000 [00:00<00:00, 13784.22it/s]
620 | Computing thresholds: 100%|██████████| 200/200 [00:32<00:00,  6.11it/s]
621 | 
622 | 623 | As data starts to flow in, we can use the detector to predict drift at every point. Our detector should detect drift sooner in our drifter dataset than in our normal data. 624 | 625 | ```python 626 | def simulate_production(test_window): 627 | i = 0 628 | online_mmd_drift_detector.reset() 629 | for text in test_window: 630 | result = online_mmd_drift_detector.predict(text) 631 | is_drift = result["data"]["is_drift"] 632 | if is_drift: 633 | break 634 | else: 635 | i += 1 636 | print (f"{i} steps") 637 | ``` 638 | 639 | ```python 640 | # Normal 641 | test_window = df.text[800:] 642 | simulate_production(test_window) 643 | ``` 644 | 645 |
646 | 27 steps
647 | 
648 | 649 | ```python 650 | # Drift 651 | test_window = "UNK" * len(df.text[800:]) 652 | simulate_production(test_window) 653 | ``` 654 | 655 |
656 | 11 steps
657 | 
658 | 659 | There are also several considerations around how often to refresh both the reference and test windows. We could base in on the number of new observations or time without drift, etc. We can also adjust the various thresholds (ERT, window size, etc.) based on what we learn about our system through monitoring. 660 | 661 | ## Learn more 662 | 663 | While these are the foundational concepts for monitoring ML systems, there are a lot of software best practices for monitoring that we cannot show in an isolated repository. Learn more in our [monitoring lesson](https://madewithml.com/courses/mlops/monitoring/). --------------------------------------------------------------------------------