├── Databricks_Talk_shreyashankar.pdf
├── README.md
├── current.pdf
├── mltrace_talk_short.pdf
├── monitoringchallenges.pdf
├── nyc_taxi_2020.ipynb
└── slides.pdf
/Databricks_Talk_shreyashankar.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/shreyashankar/debugging-ml-talk/dbcf7b652467341a729a906a00d0a144d6fe1112/Databricks_Talk_shreyashankar.pdf
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # debugging-ml-talk
2 |
3 | This repo contains (or will contain) the code and slides corresponding to my "Debugging ML in Production" talks. They change a lot over time. **The most up-to-date slides are at current.pdf**.
4 |
5 | I am giving / will give different versions of this talk at:
6 | * [NLP Zurich Meetup](https://www.meetup.com/NLP-Zurich/events/275819552/)
7 | * [UCSD's DSC 102: Systems for Scalable Analytics course taught by Arun Kumar](http://cseweb.ucsd.edu/~arunkk/dsc102_winter21/schedule.html)
8 | * [Stanford's MLSys seminar](https://www.youtube.com/watch?v=aGzu7nI8IRE)
9 | * [Verta MLOps Monitoring Salon](https://info.verta.ai/mlops-salon-model-monitoring?utm_content=160052147&utm_medium=social&utm_source=twitter&hss_channel=tw-1081294493213585408)
10 | * [Databricks Data + AI Summit](https://databricks.com/session_na21/catch-me-if-you-can-keeping-up-with-ml-models-in-production)
11 | * [Toronto MLOps World Conference](https://mlopsworld.com/)
12 | * [UC Berkeley RISECamp](https://risecamp.berkeley.edu/)
13 | * [Facebook Data Observability Summit](https://www.linkedin.com/posts/sravankumar-nandamuri-89337032_data-observability-learning-summit-2021-activity-6866778956964741120-iUnI/)
14 | * [Toronto Machine Learning Society Annual Conference](https://bit.ly/TMLS_2021)
15 | * [Google DevFest 2021](https://www.aicamp.ai/event/eventdetails/W2021120809)
16 |
17 | TODO:
18 | - [x] Document notebook
19 | - [x] Re-upload notebook with cell outputs
20 | - [ ] Post slides on the internet in a better place
21 |
--------------------------------------------------------------------------------
/current.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/shreyashankar/debugging-ml-talk/dbcf7b652467341a729a906a00d0a144d6fe1112/current.pdf
--------------------------------------------------------------------------------
/mltrace_talk_short.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/shreyashankar/debugging-ml-talk/dbcf7b652467341a729a906a00d0a144d6fe1112/mltrace_talk_short.pdf
--------------------------------------------------------------------------------
/monitoringchallenges.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/shreyashankar/debugging-ml-talk/dbcf7b652467341a729a906a00d0a144d6fe1112/monitoringchallenges.pdf
--------------------------------------------------------------------------------
/nyc_taxi_2020.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# NYC Taxicab \"Drift\" Example\n",
8 | "\n",
9 | "Author: shreyashankar\n",
10 | "\n",
11 | "This notebook shows a toy example of a machine learning model that achieves similar performance on the train and evaluation sets but experiences performance \"degradation\" when simulating a \"live\" deployment."
12 | ]
13 | },
14 | {
15 | "cell_type": "code",
16 | "execution_count": 1,
17 | "metadata": {},
18 | "outputs": [],
19 | "source": [
20 | "from cuml.dask.ensemble import RandomForestClassifier\n",
21 | "from cuml.metrics import roc_auc_score\n",
22 | "from dask.array import from_array\n",
23 | "from dask.distributed import Client, wait\n",
24 | "from dask_saturn import SaturnCluster\n",
25 | "from progress import progress\n",
26 | "from scipy import stats\n",
27 | "from sklearn.metrics import f1_score\n",
28 | "\n",
29 | "import dask_cudf\n",
30 | "import dask.dataframe as dd\n",
31 | "import pandas as pd"
32 | ]
33 | },
34 | {
35 | "cell_type": "code",
36 | "execution_count": 2,
37 | "metadata": {},
38 | "outputs": [],
39 | "source": [
40 | "# Parameters\n",
41 | "\n",
42 | "n_workers = 3\n",
43 | "\n",
44 | "numeric_feat = [\n",
45 | " \"pickup_weekday\",\n",
46 | " \"pickup_hour\",\n",
47 | " 'work_hours',\n",
48 | " \"pickup_minute\",\n",
49 | " \"passenger_count\",\n",
50 | " 'trip_distance',\n",
51 | " 'trip_time',\n",
52 | " 'trip_speed'\n",
53 | "]\n",
54 | "categorical_feat = [\n",
55 | " \"PULocationID\",\n",
56 | " \"DOLocationID\",\n",
57 | " \"RatecodeID\",\n",
58 | "]\n",
59 | "features = numeric_feat + categorical_feat\n",
60 | "\n",
61 | "EPS = 1e-7"
62 | ]
63 | },
64 | {
65 | "cell_type": "markdown",
66 | "metadata": {},
67 | "source": [
68 | "## Initialize cluster\n",
69 | "\n",
70 | "Using Saturn's predefined cluster setup."
71 | ]
72 | },
73 | {
74 | "cell_type": "code",
75 | "execution_count": 3,
76 | "metadata": {},
77 | "outputs": [
78 | {
79 | "name": "stdout",
80 | "output_type": "stream",
81 | "text": [
82 | "[2021-02-11 02:14:09] INFO - dask-saturn | Cluster is ready\n",
83 | "[2021-02-11 02:14:09] INFO - dask-saturn | Registering default plugins\n",
84 | "[2021-02-11 02:14:09] INFO - dask-saturn | {'tcp://10.0.25.24:37137': {'status': 'repeat'}, 'tcp://10.0.4.201:38121': {'status': 'repeat'}, 'tcp://10.0.9.1:39615': {'status': 'repeat'}}\n"
85 | ]
86 | },
87 | {
88 | "data": {
89 | "text/html": [
90 | "
\n",
91 | "\n",
92 | "\n",
93 | "Client\n",
94 | "\n",
98 | " | \n",
99 | "\n",
100 | "Cluster\n",
101 | "\n",
102 | " - Workers: 3
\n",
103 | " - Cores: 12
\n",
104 | " - Memory: 46.50 GB
\n",
105 | " \n",
106 | " | \n",
107 | "
\n",
108 | "
"
109 | ],
110 | "text/plain": [
111 | ""
112 | ]
113 | },
114 | "execution_count": 3,
115 | "metadata": {},
116 | "output_type": "execute_result"
117 | }
118 | ],
119 | "source": [
120 | "progress('rf-rapids-dask-cluster-setup')\n",
121 | "cluster = SaturnCluster(\n",
122 | " n_workers=n_workers, scheduler_size=\"medium\", worker_size=\"g4dnxlarge\"\n",
123 | ")\n",
124 | "client = Client(cluster)\n",
125 | "client"
126 | ]
127 | },
128 | {
129 | "cell_type": "markdown",
130 | "metadata": {},
131 | "source": [
132 | "## Create helper functions"
133 | ]
134 | },
135 | {
136 | "cell_type": "code",
137 | "execution_count": 4,
138 | "metadata": {},
139 | "outputs": [],
140 | "source": [
141 | "def preprocess(df: dask_cudf.DataFrame, target_col: str, start_date: str = None, end_date: str = None) -> dask_cudf.DataFrame:\n",
142 | " \"\"\"\n",
143 | " This function computes the target ('high_tip'), adds features, and removes unused features.\n",
144 | " Note that zero EDA or cleaning is performed here, whereas in the \"real world\" you should definitely\n",
145 | " inspect and clean the data. If a start or end date is specified, any entries outside of these bounds\n",
146 | " will be dropped from the dataframe.\n",
147 | " \n",
148 | " Args:\n",
149 | " df: dask dataframe representing data\n",
150 | " target_col: column name of the target (must be in df)\n",
151 | " start_date (optional): minimum date in the resulting dataframe\n",
152 | " end_date (optional): maximum date in the resulting dataframe\n",
153 | " \n",
154 | " Returns:\n",
155 | " dask_cudf: DataFrame representing the preprocessed dataframe\n",
156 | " \"\"\"\n",
157 | " # Basic cleaning\n",
158 | " df = df[df.fare_amount > 0] # avoid divide-by-zero\n",
159 | " if start_date:\n",
160 | " df = df[df.tpep_dropoff_datetime.astype('str') >= start_date]\n",
161 | " if end_date:\n",
162 | " df = df[df.tpep_dropoff_datetime.astype('str') <= end_date]\n",
163 | "\n",
164 | " # add target\n",
165 | " df[\"tip_fraction\"] = df.tip_amount / df.fare_amount\n",
166 | " df[target_col] = df[\"tip_fraction\"] > 0.2\n",
167 | "\n",
168 | " # add features\n",
169 | " df[\"pickup_weekday\"] = df.tpep_pickup_datetime.dt.weekday\n",
170 | " df[\"pickup_hour\"] = df.tpep_pickup_datetime.dt.hour\n",
171 | " df[\"pickup_minute\"] = df.tpep_pickup_datetime.dt.minute\n",
172 | " df[\"work_hours\"] = (df.pickup_weekday >= 0) & (df.pickup_weekday <= 4) & (df.pickup_hour >= 8) & (df.pickup_hour <= 18)\n",
173 | " df['trip_time'] = (df.tpep_dropoff_datetime - df.tpep_pickup_datetime).dt.seconds\n",
174 | " df['trip_speed'] = df.trip_distance / (df.trip_time + EPS)\n",
175 | "\n",
176 | " # drop unused columns\n",
177 | " df = df[['tpep_dropoff_datetime'] + features + [target_col]]\n",
178 | " df[features + [target_col]] = df[features + [target_col]].astype(\"float32\").fillna(-1.0)\n",
179 | "\n",
180 | " # convert target to int32 for efficiency (it's just 0s and 1s)\n",
181 | " df[target_col] = df[target_col].astype(\"int32\")\n",
182 | "\n",
183 | " return df.reset_index(drop=True)\n",
184 | "\n",
185 | "def f1_streaming(df: dask_cudf.DataFrame, target_col: str, pred_col: str) -> dask_cudf.Series:\n",
186 | " \"\"\"\n",
187 | " Computes rolling precision and recall columns\n",
188 | " F1 = 2 * (precision * recall) / (precision + recall)\n",
189 | "\n",
190 | " Precision: of the rows we predicted true, how many were true?\n",
191 | " Recall: of all the trues, how many did we predict to be true?\n",
192 | " \n",
193 | " Args:\n",
194 | " df: dask dataframe\n",
195 | " target_col: column name of the target (must be in df)\n",
196 | " pred_col: column name of the prediction (must be in df)\n",
197 | " \n",
198 | " Returns:\n",
199 | " dask_cudf: Series representing the cumulative F1 score\n",
200 | " \"\"\"\n",
201 | " df = df.sort_values(by=['tpep_dropoff_datetime'], ascending=True)\n",
202 | " numerator = (df['prediction'] & df[target_col]).cumsum()\n",
203 | " precision_denominator = df['prediction'].cumsum()\n",
204 | " recall_denominator = df[target_col].cumsum()\n",
205 | " precision = numerator / precision_denominator\n",
206 | " recall = numerator / recall_denominator\n",
207 | " return 2 * (precision * recall) / (precision + recall)\n",
208 | "\n",
209 | "def get_daily_f1_score(partition):\n",
210 | " \"\"\"\n",
211 | " \"\"\"\n",
212 | " numerator = (partition[target_col] & partition['prediction']).sum()\n",
213 | " recall_denominator = partition[target_col].sum()\n",
214 | " precision_denominator = partition['prediction'].sum()\n",
215 | " precision = numerator / precision_denominator\n",
216 | " recall = numerator / recall_denominator\n",
217 | " f1_score = 2 * (precision * recall) / (precision + recall)\n",
218 | " partition['daily_f1'] = f1_score\n",
219 | " return partition.sort_values(by='tpep_dropoff_datetime', ascending=False).head(1)[['day', 'rolling_f1', 'daily_f1']]"
220 | ]
221 | },
222 | {
223 | "cell_type": "markdown",
224 | "metadata": {},
225 | "source": [
226 | "## Load train data\n",
227 | "\n",
228 | "The training window is all of January 2020 and accessible via a public s3 bucket."
229 | ]
230 | },
231 | {
232 | "cell_type": "code",
233 | "execution_count": 5,
234 | "metadata": {},
235 | "outputs": [
236 | {
237 | "name": "stdout",
238 | "output_type": "stream",
239 | "text": [
240 | "Num rows: 6405008, Size: 0.903424059 GB\n"
241 | ]
242 | },
243 | {
244 | "data": {
245 | "text/html": [
246 | "\n",
247 | "\n",
260 | "
\n",
261 | " \n",
262 | " \n",
263 | " | \n",
264 | " VendorID | \n",
265 | " tpep_pickup_datetime | \n",
266 | " tpep_dropoff_datetime | \n",
267 | " passenger_count | \n",
268 | " trip_distance | \n",
269 | " RatecodeID | \n",
270 | " store_and_fwd_flag | \n",
271 | " PULocationID | \n",
272 | " DOLocationID | \n",
273 | " payment_type | \n",
274 | " fare_amount | \n",
275 | " extra | \n",
276 | " mta_tax | \n",
277 | " tip_amount | \n",
278 | " tolls_amount | \n",
279 | " improvement_surcharge | \n",
280 | " total_amount | \n",
281 | " congestion_surcharge | \n",
282 | "
\n",
283 | " \n",
284 | " \n",
285 | " \n",
286 | " 0 | \n",
287 | " 1.0 | \n",
288 | " 2020-01-01 00:28:15 | \n",
289 | " 2020-01-01 00:33:03 | \n",
290 | " 1.0 | \n",
291 | " 1.2 | \n",
292 | " 1.0 | \n",
293 | " N | \n",
294 | " 238.0 | \n",
295 | " 239.0 | \n",
296 | " 1.0 | \n",
297 | " 6.0 | \n",
298 | " 3.0 | \n",
299 | " 0.5 | \n",
300 | " 1.47 | \n",
301 | " 0.0 | \n",
302 | " 0.3 | \n",
303 | " 11.27 | \n",
304 | " 2.5 | \n",
305 | "
\n",
306 | " \n",
307 | " 1 | \n",
308 | " 1.0 | \n",
309 | " 2020-01-01 00:35:39 | \n",
310 | " 2020-01-01 00:43:04 | \n",
311 | " 1.0 | \n",
312 | " 1.2 | \n",
313 | " 1.0 | \n",
314 | " N | \n",
315 | " 239.0 | \n",
316 | " 238.0 | \n",
317 | " 1.0 | \n",
318 | " 7.0 | \n",
319 | " 3.0 | \n",
320 | " 0.5 | \n",
321 | " 1.50 | \n",
322 | " 0.0 | \n",
323 | " 0.3 | \n",
324 | " 12.30 | \n",
325 | " 2.5 | \n",
326 | "
\n",
327 | " \n",
328 | " 2 | \n",
329 | " 1.0 | \n",
330 | " 2020-01-01 00:47:41 | \n",
331 | " 2020-01-01 00:53:52 | \n",
332 | " 1.0 | \n",
333 | " 0.6 | \n",
334 | " 1.0 | \n",
335 | " N | \n",
336 | " 238.0 | \n",
337 | " 238.0 | \n",
338 | " 1.0 | \n",
339 | " 6.0 | \n",
340 | " 3.0 | \n",
341 | " 0.5 | \n",
342 | " 1.00 | \n",
343 | " 0.0 | \n",
344 | " 0.3 | \n",
345 | " 10.80 | \n",
346 | " 2.5 | \n",
347 | "
\n",
348 | " \n",
349 | " 3 | \n",
350 | " 1.0 | \n",
351 | " 2020-01-01 00:55:23 | \n",
352 | " 2020-01-01 01:00:14 | \n",
353 | " 1.0 | \n",
354 | " 0.8 | \n",
355 | " 1.0 | \n",
356 | " N | \n",
357 | " 238.0 | \n",
358 | " 151.0 | \n",
359 | " 1.0 | \n",
360 | " 5.5 | \n",
361 | " 0.5 | \n",
362 | " 0.5 | \n",
363 | " 1.36 | \n",
364 | " 0.0 | \n",
365 | " 0.3 | \n",
366 | " 8.16 | \n",
367 | " 0.0 | \n",
368 | "
\n",
369 | " \n",
370 | " 4 | \n",
371 | " 2.0 | \n",
372 | " 2020-01-01 00:01:58 | \n",
373 | " 2020-01-01 00:04:16 | \n",
374 | " 1.0 | \n",
375 | " 0.0 | \n",
376 | " 1.0 | \n",
377 | " N | \n",
378 | " 193.0 | \n",
379 | " 193.0 | \n",
380 | " 2.0 | \n",
381 | " 3.5 | \n",
382 | " 0.5 | \n",
383 | " 0.5 | \n",
384 | " 0.00 | \n",
385 | " 0.0 | \n",
386 | " 0.3 | \n",
387 | " 4.80 | \n",
388 | " 0.0 | \n",
389 | "
\n",
390 | " \n",
391 | "
\n",
392 | "
"
393 | ],
394 | "text/plain": [
395 | " VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count \\\n",
396 | "0 1.0 2020-01-01 00:28:15 2020-01-01 00:33:03 1.0 \n",
397 | "1 1.0 2020-01-01 00:35:39 2020-01-01 00:43:04 1.0 \n",
398 | "2 1.0 2020-01-01 00:47:41 2020-01-01 00:53:52 1.0 \n",
399 | "3 1.0 2020-01-01 00:55:23 2020-01-01 01:00:14 1.0 \n",
400 | "4 2.0 2020-01-01 00:01:58 2020-01-01 00:04:16 1.0 \n",
401 | "\n",
402 | " trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID \\\n",
403 | "0 1.2 1.0 N 238.0 239.0 \n",
404 | "1 1.2 1.0 N 239.0 238.0 \n",
405 | "2 0.6 1.0 N 238.0 238.0 \n",
406 | "3 0.8 1.0 N 238.0 151.0 \n",
407 | "4 0.0 1.0 N 193.0 193.0 \n",
408 | "\n",
409 | " payment_type fare_amount extra mta_tax tip_amount tolls_amount \\\n",
410 | "0 1.0 6.0 3.0 0.5 1.47 0.0 \n",
411 | "1 1.0 7.0 3.0 0.5 1.50 0.0 \n",
412 | "2 1.0 6.0 3.0 0.5 1.00 0.0 \n",
413 | "3 1.0 5.5 0.5 0.5 1.36 0.0 \n",
414 | "4 2.0 3.5 0.5 0.5 0.00 0.0 \n",
415 | "\n",
416 | " improvement_surcharge total_amount congestion_surcharge \n",
417 | "0 0.3 11.27 2.5 \n",
418 | "1 0.3 12.30 2.5 \n",
419 | "2 0.3 10.80 2.5 \n",
420 | "3 0.3 8.16 0.0 \n",
421 | "4 0.3 4.80 0.0 "
422 | ]
423 | },
424 | "execution_count": 5,
425 | "metadata": {},
426 | "output_type": "execute_result"
427 | }
428 | ],
429 | "source": [
430 | "taxi = dask_cudf.read_csv(\n",
431 | " \"s3://nyc-tlc/trip data/yellow_tripdata_2020-01.csv\",\n",
432 | " parse_dates=[\"tpep_pickup_datetime\", \"tpep_dropoff_datetime\"],\n",
433 | " storage_options={\"anon\": True},\n",
434 | " assume_missing=True,\n",
435 | ")\n",
436 | "\n",
437 | "print(f\"Num rows: {len(taxi)}, Size: {taxi.memory_usage(deep=True).sum().compute() / 1e9} GB\")\n",
438 | "taxi.head()"
439 | ]
440 | },
441 | {
442 | "cell_type": "code",
443 | "execution_count": 6,
444 | "metadata": {},
445 | "outputs": [
446 | {
447 | "name": "stdout",
448 | "output_type": "stream",
449 | "text": [
450 | "Num rows: 6382762, Size: 0.357434672 GB\n"
451 | ]
452 | }
453 | ],
454 | "source": [
455 | "target_col = \"high_tip\"\n",
456 | "\n",
457 | "taxi_train = preprocess(df=taxi, target_col=target_col)\n",
458 | "print(f\"Num rows: {len(taxi_train)}, Size: {taxi_train.memory_usage(deep=True).sum().compute() / 1e9} GB\")"
459 | ]
460 | },
461 | {
462 | "cell_type": "markdown",
463 | "metadata": {},
464 | "source": [
465 | "## Train model\n",
466 | "\n",
467 | "We will fit a random forest with 100 estimators and `max_depth` of 10 to the training set. Zero hyperparameter tuning is done here. If we were to do any hyperparameter tuning, we should use a hold-out validation set.\n",
468 | "\n",
469 | "We train the model on GPU and evaluate on CPU. We evaluate the model using the [F1 score](https://en.wikipedia.org/wiki/F-score)."
470 | ]
471 | },
472 | {
473 | "cell_type": "code",
474 | "execution_count": 7,
475 | "metadata": {},
476 | "outputs": [
477 | {
478 | "name": "stdout",
479 | "output_type": "stream",
480 | "text": [
481 | "CPU times: user 257 ms, sys: 3.12 ms, total: 261 ms\n",
482 | "Wall time: 21.9 s\n"
483 | ]
484 | }
485 | ],
486 | "source": [
487 | "%%time\n",
488 | "progress('start-rf-rapids-dask-fit')\n",
489 | "\n",
490 | "rfc = RandomForestClassifier(n_estimators=100, max_depth=10, ignore_empty_partitions=True)\n",
491 | "\n",
492 | "rfc.fit(taxi_train[features], taxi_train[target_col])\n",
493 | "progress('finished-rf-rapids-dask-fit')"
494 | ]
495 | },
496 | {
497 | "cell_type": "code",
498 | "execution_count": 8,
499 | "metadata": {},
500 | "outputs": [
501 | {
502 | "name": "stdout",
503 | "output_type": "stream",
504 | "text": [
505 | "F1: 0.6681650475249482\n",
506 | "CPU times: user 3.87 s, sys: 307 ms, total: 4.17 s\n",
507 | "Wall time: 18.3 s\n"
508 | ]
509 | }
510 | ],
511 | "source": [
512 | "%%time\n",
513 | "# Compute F1 \n",
514 | "# This is (relatively) slow since we are copying data to the CPU to compute the metric.\n",
515 | "\n",
516 | "preds = rfc.predict_proba(taxi_train[features])[1]\n",
517 | "print(f'F1: {f1_score(taxi_train[target_col].compute().to_array(), preds.round().compute().to_array())}')"
518 | ]
519 | },
520 | {
521 | "cell_type": "markdown",
522 | "metadata": {},
523 | "source": [
524 | "## Evaluate on test set\n",
525 | "\n",
526 | "The test window is all of February 2020 and also accessible via public s3 bucket. The F1 scores are similar between train and test sets."
527 | ]
528 | },
529 | {
530 | "cell_type": "code",
531 | "execution_count": 9,
532 | "metadata": {},
533 | "outputs": [],
534 | "source": [
535 | "taxi_feb = dask_cudf.read_csv(\n",
536 | " \"s3://nyc-tlc/trip data/yellow_tripdata_2020-02.csv\",\n",
537 | " parse_dates=[\"tpep_pickup_datetime\", \"tpep_dropoff_datetime\"],\n",
538 | " storage_options={\"anon\": True},\n",
539 | " assume_missing=True,\n",
540 | ")\n",
541 | "\n",
542 | "taxi_test = preprocess(taxi_feb, target_col=target_col)"
543 | ]
544 | },
545 | {
546 | "cell_type": "code",
547 | "execution_count": 10,
548 | "metadata": {},
549 | "outputs": [
550 | {
551 | "name": "stdout",
552 | "output_type": "stream",
553 | "text": [
554 | "F1: 0.6658098920024954\n"
555 | ]
556 | }
557 | ],
558 | "source": [
559 | "# Compute F1 on test set\n",
560 | "# This is slow since we are copying data to the CPU to compute the metric.\n",
561 | "\n",
562 | "preds = rfc.predict_proba(taxi_test[features])[1]\n",
563 | "print(f'F1: {f1_score(taxi_test[target_col].compute().to_array(), preds.round().compute().to_array())}')"
564 | ]
565 | },
566 | {
567 | "cell_type": "markdown",
568 | "metadata": {},
569 | "source": [
570 | "## Simulate \"live\" inference on March\n",
571 | "\n",
572 | "As every new batch of points comes in, we make a prediction. We compute the rolling (F1 score since March 1) and daily F1 scores. Note that the daily F1 score drops significantly, but this performance degradation is not so pronounced if we just monitor the rolling F1 score."
573 | ]
574 | },
575 | {
576 | "cell_type": "code",
577 | "execution_count": 11,
578 | "metadata": {},
579 | "outputs": [],
580 | "source": [
581 | "# First, load and sort the march dataframe\n",
582 | "\n",
583 | "taxi_march = dask_cudf.read_csv(\n",
584 | " \"s3://nyc-tlc/trip data/yellow_tripdata_2020-03.csv\",\n",
585 | " parse_dates=[\"tpep_pickup_datetime\", \"tpep_dropoff_datetime\"],\n",
586 | " storage_options={\"anon\": True},\n",
587 | " assume_missing=True,\n",
588 | ")\n",
589 | "\n",
590 | "taxi_inference = preprocess(taxi_march, target_col=target_col, start_date='2020-03-01', end_date='2020-03-31').sort_values(by=['tpep_dropoff_datetime'], ascending=True).reset_index(drop=True)\n",
591 | "taxi_inference['day'] = taxi_inference.tpep_dropoff_datetime.dt.day.to_dask_array()"
592 | ]
593 | },
594 | {
595 | "cell_type": "code",
596 | "execution_count": 12,
597 | "metadata": {},
598 | "outputs": [],
599 | "source": [
600 | "# Save predictions as a new column, compute rolling F1 score\n",
601 | "\n",
602 | "taxi_inference['predicted_prob'] = rfc.predict_proba(taxi_inference[features])[1]\n",
603 | "taxi_inference['prediction'] = taxi_inference['predicted_prob'].round().astype('int32')\n",
604 | "taxi_inference['rolling_f1'] = f1_streaming(taxi_inference, target_col, 'prediction')\n",
605 | "daily_f1 = taxi_inference.groupby('day').apply(get_daily_f1_score, meta={'day': int, 'rolling_f1': float, 'daily_f1': float})"
606 | ]
607 | },
608 | {
609 | "cell_type": "code",
610 | "execution_count": 13,
611 | "metadata": {},
612 | "outputs": [
613 | {
614 | "data": {
615 | "text/html": [
616 | "\n",
617 | "\n",
630 | "
\n",
631 | " \n",
632 | " \n",
633 | " | \n",
634 | " day | \n",
635 | " rolling_f1 | \n",
636 | " daily_f1 | \n",
637 | "
\n",
638 | " \n",
639 | " \n",
640 | " \n",
641 | " 178123 | \n",
642 | " 1 | \n",
643 | " 0.576629 | \n",
644 | " 0.576629 | \n",
645 | "
\n",
646 | " \n",
647 | " 370840 | \n",
648 | " 2 | \n",
649 | " 0.633320 | \n",
650 | " 0.677398 | \n",
651 | "
\n",
652 | " \n",
653 | " 592741 | \n",
654 | " 3 | \n",
655 | " 0.649983 | \n",
656 | " 0.675877 | \n",
657 | "
\n",
658 | " \n",
659 | " 821398 | \n",
660 | " 4 | \n",
661 | " 0.659940 | \n",
662 | " 0.684125 | \n",
663 | "
\n",
664 | " \n",
665 | " 1064741 | \n",
666 | " 5 | \n",
667 | " 0.675841 | \n",
668 | " 0.722298 | \n",
669 | "
\n",
670 | " \n",
671 | " 1307013 | \n",
672 | " 6 | \n",
673 | " 0.682284 | \n",
674 | " 0.708181 | \n",
675 | "
\n",
676 | " \n",
677 | " 58517 | \n",
678 | " 7 | \n",
679 | " 0.668002 | \n",
680 | " 0.555498 | \n",
681 | "
\n",
682 | " \n",
683 | " 225439 | \n",
684 | " 8 | \n",
685 | " 0.659918 | \n",
686 | " 0.572543 | \n",
687 | "
\n",
688 | " \n",
689 | " 400352 | \n",
690 | " 9 | \n",
691 | " 0.660947 | \n",
692 | " 0.670717 | \n",
693 | "
\n",
694 | " \n",
695 | " 583448 | \n",
696 | " 10 | \n",
697 | " 0.661801 | \n",
698 | " 0.670428 | \n",
699 | "
\n",
700 | " \n",
701 | " 765578 | \n",
702 | " 11 | \n",
703 | " 0.663678 | \n",
704 | " 0.684011 | \n",
705 | "
\n",
706 | " \n",
707 | " 936075 | \n",
708 | " 12 | \n",
709 | " 0.667420 | \n",
710 | " 0.711109 | \n",
711 | "
\n",
712 | " \n",
713 | " 1070221 | \n",
714 | " 13 | \n",
715 | " 0.668812 | \n",
716 | " 0.691889 | \n",
717 | "
\n",
718 | " \n",
719 | " 1159620 | \n",
720 | " 14 | \n",
721 | " 0.666032 | \n",
722 | " 0.571661 | \n",
723 | "
\n",
724 | " \n",
725 | " 1219523 | \n",
726 | " 15 | \n",
727 | " 0.664177 | \n",
728 | " 0.564885 | \n",
729 | "
\n",
730 | " \n",
731 | " 1283501 | \n",
732 | " 16 | \n",
733 | " 0.663604 | \n",
734 | " 0.638491 | \n",
735 | "
\n",
736 | " \n",
737 | " 1328995 | \n",
738 | " 17 | \n",
739 | " 0.663178 | \n",
740 | " 0.635958 | \n",
741 | "
\n",
742 | " \n",
743 | " 1365063 | \n",
744 | " 18 | \n",
745 | " 0.662761 | \n",
746 | " 0.628822 | \n",
747 | "
\n",
748 | " \n",
749 | " 1394730 | \n",
750 | " 19 | \n",
751 | " 0.662613 | \n",
752 | " 0.648809 | \n",
753 | "
\n",
754 | " \n",
755 | " 1422146 | \n",
756 | " 20 | \n",
757 | " 0.662300 | \n",
758 | " 0.629325 | \n",
759 | "
\n",
760 | " \n",
761 | " 1438271 | \n",
762 | " 21 | \n",
763 | " 0.661760 | \n",
764 | " 0.534262 | \n",
765 | "
\n",
766 | " \n",
767 | " 1448533 | \n",
768 | " 22 | \n",
769 | " 0.661437 | \n",
770 | " 0.541612 | \n",
771 | "
\n",
772 | " \n",
773 | " 1462011 | \n",
774 | " 23 | \n",
775 | " 0.661225 | \n",
776 | " 0.611136 | \n",
777 | "
\n",
778 | " \n",
779 | " 1473783 | \n",
780 | " 24 | \n",
781 | " 0.660991 | \n",
782 | " 0.594909 | \n",
783 | "
\n",
784 | " \n",
785 | " 1484934 | \n",
786 | " 25 | \n",
787 | " 0.660754 | \n",
788 | " 0.590316 | \n",
789 | "
\n",
790 | " \n",
791 | " 1495523 | \n",
792 | " 26 | \n",
793 | " 0.660534 | \n",
794 | " 0.596606 | \n",
795 | "
\n",
796 | " \n",
797 | " 1507234 | \n",
798 | " 27 | \n",
799 | " 0.660228 | \n",
800 | " 0.576993 | \n",
801 | "
\n",
802 | " \n",
803 | " 1514827 | \n",
804 | " 28 | \n",
805 | " 0.659934 | \n",
806 | " 0.501860 | \n",
807 | "
\n",
808 | " \n",
809 | " 1520358 | \n",
810 | " 29 | \n",
811 | " 0.659764 | \n",
812 | " 0.537860 | \n",
813 | "
\n",
814 | " \n",
815 | " 1529847 | \n",
816 | " 30 | \n",
817 | " 0.659530 | \n",
818 | " 0.576178 | \n",
819 | "
\n",
820 | " \n",
821 | "
\n",
822 | "
"
823 | ],
824 | "text/plain": [
825 | " day rolling_f1 daily_f1\n",
826 | "178123 1 0.576629 0.576629\n",
827 | "370840 2 0.633320 0.677398\n",
828 | "592741 3 0.649983 0.675877\n",
829 | "821398 4 0.659940 0.684125\n",
830 | "1064741 5 0.675841 0.722298\n",
831 | "1307013 6 0.682284 0.708181\n",
832 | "58517 7 0.668002 0.555498\n",
833 | "225439 8 0.659918 0.572543\n",
834 | "400352 9 0.660947 0.670717\n",
835 | "583448 10 0.661801 0.670428\n",
836 | "765578 11 0.663678 0.684011\n",
837 | "936075 12 0.667420 0.711109\n",
838 | "1070221 13 0.668812 0.691889\n",
839 | "1159620 14 0.666032 0.571661\n",
840 | "1219523 15 0.664177 0.564885\n",
841 | "1283501 16 0.663604 0.638491\n",
842 | "1328995 17 0.663178 0.635958\n",
843 | "1365063 18 0.662761 0.628822\n",
844 | "1394730 19 0.662613 0.648809\n",
845 | "1422146 20 0.662300 0.629325\n",
846 | "1438271 21 0.661760 0.534262\n",
847 | "1448533 22 0.661437 0.541612\n",
848 | "1462011 23 0.661225 0.611136\n",
849 | "1473783 24 0.660991 0.594909\n",
850 | "1484934 25 0.660754 0.590316\n",
851 | "1495523 26 0.660534 0.596606\n",
852 | "1507234 27 0.660228 0.576993\n",
853 | "1514827 28 0.659934 0.501860\n",
854 | "1520358 29 0.659764 0.537860\n",
855 | "1529847 30 0.659530 0.576178"
856 | ]
857 | },
858 | "execution_count": 13,
859 | "metadata": {},
860 | "output_type": "execute_result"
861 | }
862 | ],
863 | "source": [
864 | "daily_f1.sort_values(by='day', ascending=True).compute()"
865 | ]
866 | },
867 | {
868 | "cell_type": "markdown",
869 | "metadata": {},
870 | "source": [
871 | "## Evaluate model on later months\n",
872 | "\n",
873 | "We see the performance drop in March 2020, but what happens for future months?"
874 | ]
875 | },
876 | {
877 | "cell_type": "code",
878 | "execution_count": 14,
879 | "metadata": {},
880 | "outputs": [
881 | {
882 | "name": "stdout",
883 | "output_type": "stream",
884 | "text": [
885 | "Loading month 2020-03 for the first time.\n",
886 | "2020-03\n",
887 | "\tF1: 0.6592796100378214\n",
888 | "Loading month 2020-04 for the first time.\n",
889 | "2020-04\n",
890 | "\tF1: 0.5714705472990737\n",
891 | "Loading month 2020-05 for the first time.\n",
892 | "2020-05\n",
893 | "\tF1: 0.5530868473460906\n",
894 | "Loading month 2020-06 for the first time.\n",
895 | "2020-06\n",
896 | "\tF1: 0.5967621469282887\n"
897 | ]
898 | }
899 | ],
900 | "source": [
901 | "# Cycle through many test sets\n",
902 | "\n",
903 | "months = ['2020-03', '2020-04', '2020-05', '2020-06']\n",
904 | "month_dfs = {}\n",
905 | "\n",
906 | "for month in months:\n",
907 | " \n",
908 | " if month not in month_dfs:\n",
909 | " print(f'Loading month {month} for the first time.')\n",
910 | " df = dask_cudf.read_csv(\n",
911 | " f\"s3://nyc-tlc/trip data/yellow_tripdata_{month}.csv\",\n",
912 | " parse_dates=[\"tpep_pickup_datetime\", \"tpep_dropoff_datetime\"],\n",
913 | " storage_options={\"anon\": True},\n",
914 | " assume_missing=True,\n",
915 | " )\n",
916 | "\n",
917 | " df = preprocess(df, target_col=target_col)\n",
918 | " month_dfs[month] = df.copy()\n",
919 | " \n",
920 | " curr_taxi_test = month_dfs[month]\n",
921 | " \n",
922 | " preds = rfc.predict_proba(curr_taxi_test[features])[1]\n",
923 | " print(month)\n",
924 | " print(f'\\tF1: {f1_score(curr_taxi_test[target_col].compute().to_array(), preds.round().compute().to_array())}')"
925 | ]
926 | },
927 | {
928 | "cell_type": "markdown",
929 | "metadata": {},
930 | "source": [
931 | "## Inspect differences between feature values\n",
932 | "\n",
933 | "Maybe the distribution of data shifted. We could try to quantify this using a 2-sided statistical test (Kolmogorov Smirnov in this example)."
934 | ]
935 | },
936 | {
937 | "cell_type": "markdown",
938 | "metadata": {},
939 | "source": [
940 | "### Compare January 2020 vs February 2020\n",
941 | "\n",
942 | "This snippet shows that the p values being small doesn't really tell us much, as we get very small p values when comparing January 2020 vs February 2020 even though we know the F1 score was similar. Curse \"big data.\""
943 | ]
944 | },
945 | {
946 | "cell_type": "code",
947 | "execution_count": 15,
948 | "metadata": {},
949 | "outputs": [],
950 | "source": [
951 | "statistics = []\n",
952 | "p_values = []\n",
953 | "\n",
954 | "for feature in features:\n",
955 | " statistic, p_value = stats.ks_2samp(taxi_train[feature].compute().to_pandas(), taxi_test[feature].compute().to_pandas())\n",
956 | " statistics.append(statistic)\n",
957 | " p_values.append(p_value)"
958 | ]
959 | },
960 | {
961 | "cell_type": "code",
962 | "execution_count": 16,
963 | "metadata": {},
964 | "outputs": [
965 | {
966 | "data": {
967 | "text/html": [
968 | "\n",
969 | "\n",
982 | "
\n",
983 | " \n",
984 | " \n",
985 | " | \n",
986 | " feature | \n",
987 | " statistic | \n",
988 | " p_value | \n",
989 | "
\n",
990 | " \n",
991 | " \n",
992 | " \n",
993 | " 0 | \n",
994 | " pickup_weekday | \n",
995 | " 0.046196 | \n",
996 | " 0.000000e+00 | \n",
997 | "
\n",
998 | " \n",
999 | " 2 | \n",
1000 | " work_hours | \n",
1001 | " 0.028587 | \n",
1002 | " 0.000000e+00 | \n",
1003 | "
\n",
1004 | " \n",
1005 | " 6 | \n",
1006 | " trip_time | \n",
1007 | " 0.017205 | \n",
1008 | " 0.000000e+00 | \n",
1009 | "
\n",
1010 | " \n",
1011 | " 7 | \n",
1012 | " trip_speed | \n",
1013 | " 0.035415 | \n",
1014 | " 0.000000e+00 | \n",
1015 | "
\n",
1016 | " \n",
1017 | " 1 | \n",
1018 | " pickup_hour | \n",
1019 | " 0.009676 | \n",
1020 | " 8.610133e-258 | \n",
1021 | "
\n",
1022 | " \n",
1023 | " 5 | \n",
1024 | " trip_distance | \n",
1025 | " 0.005312 | \n",
1026 | " 5.266602e-78 | \n",
1027 | "
\n",
1028 | " \n",
1029 | " 8 | \n",
1030 | " PULocationID | \n",
1031 | " 0.004083 | \n",
1032 | " 2.994877e-46 | \n",
1033 | "
\n",
1034 | " \n",
1035 | " 9 | \n",
1036 | " DOLocationID | \n",
1037 | " 0.003132 | \n",
1038 | " 2.157559e-27 | \n",
1039 | "
\n",
1040 | " \n",
1041 | " 4 | \n",
1042 | " passenger_count | \n",
1043 | " 0.002947 | \n",
1044 | " 2.634493e-24 | \n",
1045 | "
\n",
1046 | " \n",
1047 | " 10 | \n",
1048 | " RatecodeID | \n",
1049 | " 0.002616 | \n",
1050 | " 3.047481e-19 | \n",
1051 | "
\n",
1052 | " \n",
1053 | " 3 | \n",
1054 | " pickup_minute | \n",
1055 | " 0.000702 | \n",
1056 | " 8.861498e-02 | \n",
1057 | "
\n",
1058 | " \n",
1059 | "
\n",
1060 | "
"
1061 | ],
1062 | "text/plain": [
1063 | " feature statistic p_value\n",
1064 | "0 pickup_weekday 0.046196 0.000000e+00\n",
1065 | "2 work_hours 0.028587 0.000000e+00\n",
1066 | "6 trip_time 0.017205 0.000000e+00\n",
1067 | "7 trip_speed 0.035415 0.000000e+00\n",
1068 | "1 pickup_hour 0.009676 8.610133e-258\n",
1069 | "5 trip_distance 0.005312 5.266602e-78\n",
1070 | "8 PULocationID 0.004083 2.994877e-46\n",
1071 | "9 DOLocationID 0.003132 2.157559e-27\n",
1072 | "4 passenger_count 0.002947 2.634493e-24\n",
1073 | "10 RatecodeID 0.002616 3.047481e-19\n",
1074 | "3 pickup_minute 0.000702 8.861498e-02"
1075 | ]
1076 | },
1077 | "execution_count": 16,
1078 | "metadata": {},
1079 | "output_type": "execute_result"
1080 | }
1081 | ],
1082 | "source": [
1083 | "comparison_df = pd.DataFrame(data={'feature': features, 'statistic': statistics, 'p_value': p_values})\n",
1084 | "comparison_df.sort_values(by='p_value', ascending=True).head(11)"
1085 | ]
1086 | },
1087 | {
1088 | "cell_type": "markdown",
1089 | "metadata": {},
1090 | "source": [
1091 | "### Compare January 2020 vs March 2020\n",
1092 | "\n",
1093 | "These p values are also small, which is good? But if this method in general sends warning alerts all the time, an end user might not trust it."
1094 | ]
1095 | },
1096 | {
1097 | "cell_type": "code",
1098 | "execution_count": 17,
1099 | "metadata": {},
1100 | "outputs": [],
1101 | "source": [
1102 | "statistics = []\n",
1103 | "p_values = []\n",
1104 | "\n",
1105 | "for feature in features:\n",
1106 | " statistic, p_value = stats.ks_2samp(taxi_train[feature].compute().to_pandas(), taxi_inference[feature].compute().to_pandas())\n",
1107 | " statistics.append(statistic)\n",
1108 | " p_values.append(p_value)"
1109 | ]
1110 | },
1111 | {
1112 | "cell_type": "code",
1113 | "execution_count": 18,
1114 | "metadata": {},
1115 | "outputs": [
1116 | {
1117 | "data": {
1118 | "text/html": [
1119 | "\n",
1120 | "\n",
1133 | "
\n",
1134 | " \n",
1135 | " \n",
1136 | " | \n",
1137 | " feature | \n",
1138 | " statistic | \n",
1139 | " p_value | \n",
1140 | "
\n",
1141 | " \n",
1142 | " \n",
1143 | " \n",
1144 | " 0 | \n",
1145 | " pickup_weekday | \n",
1146 | " 0.059051 | \n",
1147 | " 0.000000e+00 | \n",
1148 | "
\n",
1149 | " \n",
1150 | " 1 | \n",
1151 | " pickup_hour | \n",
1152 | " 0.017536 | \n",
1153 | " 0.000000e+00 | \n",
1154 | "
\n",
1155 | " \n",
1156 | " 4 | \n",
1157 | " passenger_count | \n",
1158 | " 0.022485 | \n",
1159 | " 0.000000e+00 | \n",
1160 | "
\n",
1161 | " \n",
1162 | " 5 | \n",
1163 | " trip_distance | \n",
1164 | " 0.017913 | \n",
1165 | " 0.000000e+00 | \n",
1166 | "
\n",
1167 | " \n",
1168 | " 7 | \n",
1169 | " trip_speed | \n",
1170 | " 0.030289 | \n",
1171 | " 0.000000e+00 | \n",
1172 | "
\n",
1173 | " \n",
1174 | " 9 | \n",
1175 | " DOLocationID | \n",
1176 | " 0.013995 | \n",
1177 | " 0.000000e+00 | \n",
1178 | "
\n",
1179 | " \n",
1180 | " 8 | \n",
1181 | " PULocationID | \n",
1182 | " 0.013068 | \n",
1183 | " 3.746619e-302 | \n",
1184 | "
\n",
1185 | " \n",
1186 | " 2 | \n",
1187 | " work_hours | \n",
1188 | " 0.010840 | \n",
1189 | " 5.006014e-208 | \n",
1190 | "
\n",
1191 | " \n",
1192 | " 6 | \n",
1193 | " trip_time | \n",
1194 | " 0.007507 | \n",
1195 | " 5.385560e-100 | \n",
1196 | "
\n",
1197 | " \n",
1198 | " 10 | \n",
1199 | " RatecodeID | \n",
1200 | " 0.005615 | \n",
1201 | " 3.933726e-56 | \n",
1202 | "
\n",
1203 | " \n",
1204 | " 3 | \n",
1205 | " pickup_minute | \n",
1206 | " 0.000642 | \n",
1207 | " 3.722759e-01 | \n",
1208 | "
\n",
1209 | " \n",
1210 | "
\n",
1211 | "
"
1212 | ],
1213 | "text/plain": [
1214 | " feature statistic p_value\n",
1215 | "0 pickup_weekday 0.059051 0.000000e+00\n",
1216 | "1 pickup_hour 0.017536 0.000000e+00\n",
1217 | "4 passenger_count 0.022485 0.000000e+00\n",
1218 | "5 trip_distance 0.017913 0.000000e+00\n",
1219 | "7 trip_speed 0.030289 0.000000e+00\n",
1220 | "9 DOLocationID 0.013995 0.000000e+00\n",
1221 | "8 PULocationID 0.013068 3.746619e-302\n",
1222 | "2 work_hours 0.010840 5.006014e-208\n",
1223 | "6 trip_time 0.007507 5.385560e-100\n",
1224 | "10 RatecodeID 0.005615 3.933726e-56\n",
1225 | "3 pickup_minute 0.000642 3.722759e-01"
1226 | ]
1227 | },
1228 | "execution_count": 18,
1229 | "metadata": {},
1230 | "output_type": "execute_result"
1231 | }
1232 | ],
1233 | "source": [
1234 | "comparison_df = pd.DataFrame(data={'feature': features, 'statistic': statistics, 'p_value': p_values})\n",
1235 | "comparison_df.sort_values(by='p_value', ascending=True).head(11)"
1236 | ]
1237 | },
1238 | {
1239 | "cell_type": "code",
1240 | "execution_count": null,
1241 | "metadata": {},
1242 | "outputs": [],
1243 | "source": []
1244 | }
1245 | ],
1246 | "metadata": {
1247 | "kernelspec": {
1248 | "display_name": "Python 3",
1249 | "language": "python",
1250 | "name": "python3"
1251 | },
1252 | "language_info": {
1253 | "codemirror_mode": {
1254 | "name": "ipython",
1255 | "version": 3
1256 | },
1257 | "file_extension": ".py",
1258 | "mimetype": "text/x-python",
1259 | "name": "python",
1260 | "nbconvert_exporter": "python",
1261 | "pygments_lexer": "ipython3",
1262 | "version": "3.7.7"
1263 | }
1264 | },
1265 | "nbformat": 4,
1266 | "nbformat_minor": 4
1267 | }
1268 |
--------------------------------------------------------------------------------
/slides.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/shreyashankar/debugging-ml-talk/dbcf7b652467341a729a906a00d0a144d6fe1112/slides.pdf
--------------------------------------------------------------------------------