├── Databricks_Talk_shreyashankar.pdf
├── README.md
├── current.pdf
├── mltrace_talk_short.pdf
├── monitoringchallenges.pdf
├── nyc_taxi_2020.ipynb
└── slides.pdf


/Databricks_Talk_shreyashankar.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/shreyashankar/debugging-ml-talk/dbcf7b652467341a729a906a00d0a144d6fe1112/Databricks_Talk_shreyashankar.pdf


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # debugging-ml-talk
 2 | 
 3 | This repo contains (or will contain) the code and slides corresponding to my "Debugging ML in Production" talks. They change a lot over time. **The most up-to-date slides are at current.pdf**.
 4 | 
 5 | I am giving / will give different versions of this talk at:
 6 | * [NLP Zurich Meetup](https://www.meetup.com/NLP-Zurich/events/275819552/)
 7 | * [UCSD's DSC 102: Systems for Scalable Analytics course taught by Arun Kumar](http://cseweb.ucsd.edu/~arunkk/dsc102_winter21/schedule.html)
 8 | * [Stanford's MLSys seminar](https://www.youtube.com/watch?v=aGzu7nI8IRE)
 9 | * [Verta MLOps Monitoring Salon](https://info.verta.ai/mlops-salon-model-monitoring?utm_content=160052147&utm_medium=social&utm_source=twitter&hss_channel=tw-1081294493213585408)
10 | * [Databricks Data + AI Summit](https://databricks.com/session_na21/catch-me-if-you-can-keeping-up-with-ml-models-in-production)
11 | * [Toronto MLOps World Conference](https://mlopsworld.com/)
12 | * [UC Berkeley RISECamp](https://risecamp.berkeley.edu/)
13 | * [Facebook Data Observability Summit](https://www.linkedin.com/posts/sravankumar-nandamuri-89337032_data-observability-learning-summit-2021-activity-6866778956964741120-iUnI/)
14 | * [Toronto Machine Learning Society Annual Conference](https://bit.ly/TMLS_2021)
15 | * [Google DevFest 2021](https://www.aicamp.ai/event/eventdetails/W2021120809)
16 | 
17 | TODO:
18 | - [x] Document notebook
19 | - [x] Re-upload notebook with cell outputs
20 | - [ ] Post slides on the internet in a better place
21 | 


--------------------------------------------------------------------------------
/current.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/shreyashankar/debugging-ml-talk/dbcf7b652467341a729a906a00d0a144d6fe1112/current.pdf


--------------------------------------------------------------------------------
/mltrace_talk_short.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/shreyashankar/debugging-ml-talk/dbcf7b652467341a729a906a00d0a144d6fe1112/mltrace_talk_short.pdf


--------------------------------------------------------------------------------
/monitoringchallenges.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/shreyashankar/debugging-ml-talk/dbcf7b652467341a729a906a00d0a144d6fe1112/monitoringchallenges.pdf


--------------------------------------------------------------------------------
/nyc_taxi_2020.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "# NYC Taxicab \"Drift\" Example\n",
   8 |     "\n",
   9 |     "Author: shreyashankar\n",
  10 |     "\n",
  11 |     "This notebook shows a toy example of a machine learning model that achieves similar performance on the train and evaluation sets but experiences performance \"degradation\" when simulating a \"live\" deployment."
  12 |    ]
  13 |   },
  14 |   {
  15 |    "cell_type": "code",
  16 |    "execution_count": 1,
  17 |    "metadata": {},
  18 |    "outputs": [],
  19 |    "source": [
  20 |     "from cuml.dask.ensemble import RandomForestClassifier\n",
  21 |     "from cuml.metrics import roc_auc_score\n",
  22 |     "from dask.array import from_array\n",
  23 |     "from dask.distributed import Client, wait\n",
  24 |     "from dask_saturn import SaturnCluster\n",
  25 |     "from progress import progress\n",
  26 |     "from scipy import stats\n",
  27 |     "from sklearn.metrics import f1_score\n",
  28 |     "\n",
  29 |     "import dask_cudf\n",
  30 |     "import dask.dataframe as dd\n",
  31 |     "import pandas as pd"
  32 |    ]
  33 |   },
  34 |   {
  35 |    "cell_type": "code",
  36 |    "execution_count": 2,
  37 |    "metadata": {},
  38 |    "outputs": [],
  39 |    "source": [
  40 |     "# Parameters\n",
  41 |     "\n",
  42 |     "n_workers = 3\n",
  43 |     "\n",
  44 |     "numeric_feat = [\n",
  45 |     "    \"pickup_weekday\",\n",
  46 |     "    \"pickup_hour\",\n",
  47 |     "    'work_hours',\n",
  48 |     "    \"pickup_minute\",\n",
  49 |     "    \"passenger_count\",\n",
  50 |     "    'trip_distance',\n",
  51 |     "    'trip_time',\n",
  52 |     "    'trip_speed'\n",
  53 |     "]\n",
  54 |     "categorical_feat = [\n",
  55 |     "    \"PULocationID\",\n",
  56 |     "    \"DOLocationID\",\n",
  57 |     "    \"RatecodeID\",\n",
  58 |     "]\n",
  59 |     "features = numeric_feat + categorical_feat\n",
  60 |     "\n",
  61 |     "EPS = 1e-7"
  62 |    ]
  63 |   },
  64 |   {
  65 |    "cell_type": "markdown",
  66 |    "metadata": {},
  67 |    "source": [
  68 |     "## Initialize cluster\n",
  69 |     "\n",
  70 |     "Using Saturn's predefined cluster setup."
  71 |    ]
  72 |   },
  73 |   {
  74 |    "cell_type": "code",
  75 |    "execution_count": 3,
  76 |    "metadata": {},
  77 |    "outputs": [
  78 |     {
  79 |      "name": "stdout",
  80 |      "output_type": "stream",
  81 |      "text": [
  82 |       "[2021-02-11 02:14:09] INFO - dask-saturn | Cluster is ready\n",
  83 |       "[2021-02-11 02:14:09] INFO - dask-saturn | Registering default plugins\n",
  84 |       "[2021-02-11 02:14:09] INFO - dask-saturn | {'tcp://10.0.25.24:37137': {'status': 'repeat'}, 'tcp://10.0.4.201:38121': {'status': 'repeat'}, 'tcp://10.0.9.1:39615': {'status': 'repeat'}}\n"
  85 |      ]
  86 |     },
  87 |     {
  88 |      "data": {
  89 |       "text/html": [
  90 |        "<table style=\"border: 2px solid white;\">\n",
  91 |        "<tr>\n",
  92 |        "<td style=\"vertical-align: top; border: 0px solid white\">\n",
  93 |        "<h3 style=\"text-align: left;\">Client</h3>\n",
  94 |        "<ul style=\"text-align: left; list-style: none; margin: 0; padding: 0;\">\n",
  95 |        "  <li><b>Scheduler: </b>tcp://d-shrey-rapids-random-forest-85ae247e8459473fbfdc641eb0e7ecb2.main-namespace:8786</li>\n",
  96 |        "  <li><b>Dashboard: </b><a href='https://d-shrey-rapids-random-forest-85ae247e8459473fbfdc641eb0e7ecb2.community.saturnenterprise.io' target='_blank'>https://d-shrey-rapids-random-forest-85ae247e8459473fbfdc641eb0e7ecb2.community.saturnenterprise.io</a></li>\n",
  97 |        "</ul>\n",
  98 |        "</td>\n",
  99 |        "<td style=\"vertical-align: top; border: 0px solid white\">\n",
 100 |        "<h3 style=\"text-align: left;\">Cluster</h3>\n",
 101 |        "<ul style=\"text-align: left; list-style:none; margin: 0; padding: 0;\">\n",
 102 |        "  <li><b>Workers: </b>3</li>\n",
 103 |        "  <li><b>Cores: </b>12</li>\n",
 104 |        "  <li><b>Memory: </b>46.50 GB</li>\n",
 105 |        "</ul>\n",
 106 |        "</td>\n",
 107 |        "</tr>\n",
 108 |        "</table>"
 109 |       ],
 110 |       "text/plain": [
 111 |        "<Client: 'tcp://10.0.13.234:8786' processes=3 threads=12, memory=46.50 GB>"
 112 |       ]
 113 |      },
 114 |      "execution_count": 3,
 115 |      "metadata": {},
 116 |      "output_type": "execute_result"
 117 |     }
 118 |    ],
 119 |    "source": [
 120 |     "progress('rf-rapids-dask-cluster-setup')\n",
 121 |     "cluster = SaturnCluster(\n",
 122 |     "    n_workers=n_workers, scheduler_size=\"medium\", worker_size=\"g4dnxlarge\"\n",
 123 |     ")\n",
 124 |     "client = Client(cluster)\n",
 125 |     "client"
 126 |    ]
 127 |   },
 128 |   {
 129 |    "cell_type": "markdown",
 130 |    "metadata": {},
 131 |    "source": [
 132 |     "## Create helper functions"
 133 |    ]
 134 |   },
 135 |   {
 136 |    "cell_type": "code",
 137 |    "execution_count": 4,
 138 |    "metadata": {},
 139 |    "outputs": [],
 140 |    "source": [
 141 |     "def preprocess(df: dask_cudf.DataFrame, target_col: str, start_date: str = None, end_date: str = None) -> dask_cudf.DataFrame:\n",
 142 |     "    \"\"\"\n",
 143 |     "    This function computes the target ('high_tip'), adds features, and removes unused features.\n",
 144 |     "    Note that zero EDA or cleaning is performed here, whereas in the \"real world\" you should definitely\n",
 145 |     "    inspect and clean the data. If a start or end date is specified, any entries outside of these bounds\n",
 146 |     "    will be dropped from the dataframe.\n",
 147 |     "    \n",
 148 |     "    Args:\n",
 149 |     "        df: dask dataframe representing data\n",
 150 |     "        target_col: column name of the target (must be in df)\n",
 151 |     "        start_date (optional): minimum date in the resulting dataframe\n",
 152 |     "        end_date (optional): maximum date in the resulting dataframe\n",
 153 |     "    \n",
 154 |     "    Returns:\n",
 155 |     "        dask_cudf: DataFrame representing the preprocessed dataframe\n",
 156 |     "    \"\"\"\n",
 157 |     "    # Basic cleaning\n",
 158 |     "    df = df[df.fare_amount > 0]  # avoid divide-by-zero\n",
 159 |     "    if start_date:\n",
 160 |     "        df = df[df.tpep_dropoff_datetime.astype('str') >= start_date]\n",
 161 |     "    if end_date:\n",
 162 |     "        df = df[df.tpep_dropoff_datetime.astype('str') <= end_date]\n",
 163 |     "\n",
 164 |     "    # add target\n",
 165 |     "    df[\"tip_fraction\"] = df.tip_amount / df.fare_amount\n",
 166 |     "    df[target_col] = df[\"tip_fraction\"] > 0.2\n",
 167 |     "\n",
 168 |     "    # add features\n",
 169 |     "    df[\"pickup_weekday\"] = df.tpep_pickup_datetime.dt.weekday\n",
 170 |     "    df[\"pickup_hour\"] = df.tpep_pickup_datetime.dt.hour\n",
 171 |     "    df[\"pickup_minute\"] = df.tpep_pickup_datetime.dt.minute\n",
 172 |     "    df[\"work_hours\"] = (df.pickup_weekday >= 0) & (df.pickup_weekday <= 4) & (df.pickup_hour >= 8) & (df.pickup_hour <= 18)\n",
 173 |     "    df['trip_time'] = (df.tpep_dropoff_datetime - df.tpep_pickup_datetime).dt.seconds\n",
 174 |     "    df['trip_speed'] = df.trip_distance / (df.trip_time + EPS)\n",
 175 |     "\n",
 176 |     "    # drop unused columns\n",
 177 |     "    df = df[['tpep_dropoff_datetime'] + features + [target_col]]\n",
 178 |     "    df[features + [target_col]] = df[features + [target_col]].astype(\"float32\").fillna(-1.0)\n",
 179 |     "\n",
 180 |     "    # convert target to int32 for efficiency (it's just 0s and 1s)\n",
 181 |     "    df[target_col] = df[target_col].astype(\"int32\")\n",
 182 |     "\n",
 183 |     "    return df.reset_index(drop=True)\n",
 184 |     "\n",
 185 |     "def f1_streaming(df: dask_cudf.DataFrame, target_col: str, pred_col: str) -> dask_cudf.Series:\n",
 186 |     "    \"\"\"\n",
 187 |     "    Computes rolling precision and recall columns\n",
 188 |     "    F1 = 2 * (precision * recall) / (precision + recall)\n",
 189 |     "\n",
 190 |     "    Precision: of the rows we predicted true, how many were true?\n",
 191 |     "    Recall: of all the trues, how many did we predict to be true?\n",
 192 |     "    \n",
 193 |     "    Args:\n",
 194 |     "        df: dask dataframe\n",
 195 |     "        target_col: column name of the target (must be in df)\n",
 196 |     "        pred_col: column name of the prediction (must be in df)\n",
 197 |     "    \n",
 198 |     "    Returns:\n",
 199 |     "        dask_cudf: Series representing the cumulative F1 score\n",
 200 |     "    \"\"\"\n",
 201 |     "    df = df.sort_values(by=['tpep_dropoff_datetime'], ascending=True)\n",
 202 |     "    numerator = (df['prediction'] & df[target_col]).cumsum()\n",
 203 |     "    precision_denominator = df['prediction'].cumsum()\n",
 204 |     "    recall_denominator = df[target_col].cumsum()\n",
 205 |     "    precision = numerator / precision_denominator\n",
 206 |     "    recall = numerator / recall_denominator\n",
 207 |     "    return 2 * (precision * recall) / (precision + recall)\n",
 208 |     "\n",
 209 |     "def get_daily_f1_score(partition):\n",
 210 |     "    \"\"\"\n",
 211 |     "    \"\"\"\n",
 212 |     "    numerator = (partition[target_col] & partition['prediction']).sum()\n",
 213 |     "    recall_denominator = partition[target_col].sum()\n",
 214 |     "    precision_denominator = partition['prediction'].sum()\n",
 215 |     "    precision = numerator / precision_denominator\n",
 216 |     "    recall = numerator / recall_denominator\n",
 217 |     "    f1_score = 2 * (precision * recall) / (precision + recall)\n",
 218 |     "    partition['daily_f1'] = f1_score\n",
 219 |     "    return partition.sort_values(by='tpep_dropoff_datetime', ascending=False).head(1)[['day', 'rolling_f1', 'daily_f1']]"
 220 |    ]
 221 |   },
 222 |   {
 223 |    "cell_type": "markdown",
 224 |    "metadata": {},
 225 |    "source": [
 226 |     "## Load train data\n",
 227 |     "\n",
 228 |     "The training window is all of January 2020 and accessible via a public s3 bucket."
 229 |    ]
 230 |   },
 231 |   {
 232 |    "cell_type": "code",
 233 |    "execution_count": 5,
 234 |    "metadata": {},
 235 |    "outputs": [
 236 |     {
 237 |      "name": "stdout",
 238 |      "output_type": "stream",
 239 |      "text": [
 240 |       "Num rows: 6405008, Size: 0.903424059 GB\n"
 241 |      ]
 242 |     },
 243 |     {
 244 |      "data": {
 245 |       "text/html": [
 246 |        "<div>\n",
 247 |        "<style scoped>\n",
 248 |        "    .dataframe tbody tr th:only-of-type {\n",
 249 |        "        vertical-align: middle;\n",
 250 |        "    }\n",
 251 |        "\n",
 252 |        "    .dataframe tbody tr th {\n",
 253 |        "        vertical-align: top;\n",
 254 |        "    }\n",
 255 |        "\n",
 256 |        "    .dataframe thead th {\n",
 257 |        "        text-align: right;\n",
 258 |        "    }\n",
 259 |        "</style>\n",
 260 |        "<table border=\"1\" class=\"dataframe\">\n",
 261 |        "  <thead>\n",
 262 |        "    <tr style=\"text-align: right;\">\n",
 263 |        "      <th></th>\n",
 264 |        "      <th>VendorID</th>\n",
 265 |        "      <th>tpep_pickup_datetime</th>\n",
 266 |        "      <th>tpep_dropoff_datetime</th>\n",
 267 |        "      <th>passenger_count</th>\n",
 268 |        "      <th>trip_distance</th>\n",
 269 |        "      <th>RatecodeID</th>\n",
 270 |        "      <th>store_and_fwd_flag</th>\n",
 271 |        "      <th>PULocationID</th>\n",
 272 |        "      <th>DOLocationID</th>\n",
 273 |        "      <th>payment_type</th>\n",
 274 |        "      <th>fare_amount</th>\n",
 275 |        "      <th>extra</th>\n",
 276 |        "      <th>mta_tax</th>\n",
 277 |        "      <th>tip_amount</th>\n",
 278 |        "      <th>tolls_amount</th>\n",
 279 |        "      <th>improvement_surcharge</th>\n",
 280 |        "      <th>total_amount</th>\n",
 281 |        "      <th>congestion_surcharge</th>\n",
 282 |        "    </tr>\n",
 283 |        "  </thead>\n",
 284 |        "  <tbody>\n",
 285 |        "    <tr>\n",
 286 |        "      <th>0</th>\n",
 287 |        "      <td>1.0</td>\n",
 288 |        "      <td>2020-01-01 00:28:15</td>\n",
 289 |        "      <td>2020-01-01 00:33:03</td>\n",
 290 |        "      <td>1.0</td>\n",
 291 |        "      <td>1.2</td>\n",
 292 |        "      <td>1.0</td>\n",
 293 |        "      <td>N</td>\n",
 294 |        "      <td>238.0</td>\n",
 295 |        "      <td>239.0</td>\n",
 296 |        "      <td>1.0</td>\n",
 297 |        "      <td>6.0</td>\n",
 298 |        "      <td>3.0</td>\n",
 299 |        "      <td>0.5</td>\n",
 300 |        "      <td>1.47</td>\n",
 301 |        "      <td>0.0</td>\n",
 302 |        "      <td>0.3</td>\n",
 303 |        "      <td>11.27</td>\n",
 304 |        "      <td>2.5</td>\n",
 305 |        "    </tr>\n",
 306 |        "    <tr>\n",
 307 |        "      <th>1</th>\n",
 308 |        "      <td>1.0</td>\n",
 309 |        "      <td>2020-01-01 00:35:39</td>\n",
 310 |        "      <td>2020-01-01 00:43:04</td>\n",
 311 |        "      <td>1.0</td>\n",
 312 |        "      <td>1.2</td>\n",
 313 |        "      <td>1.0</td>\n",
 314 |        "      <td>N</td>\n",
 315 |        "      <td>239.0</td>\n",
 316 |        "      <td>238.0</td>\n",
 317 |        "      <td>1.0</td>\n",
 318 |        "      <td>7.0</td>\n",
 319 |        "      <td>3.0</td>\n",
 320 |        "      <td>0.5</td>\n",
 321 |        "      <td>1.50</td>\n",
 322 |        "      <td>0.0</td>\n",
 323 |        "      <td>0.3</td>\n",
 324 |        "      <td>12.30</td>\n",
 325 |        "      <td>2.5</td>\n",
 326 |        "    </tr>\n",
 327 |        "    <tr>\n",
 328 |        "      <th>2</th>\n",
 329 |        "      <td>1.0</td>\n",
 330 |        "      <td>2020-01-01 00:47:41</td>\n",
 331 |        "      <td>2020-01-01 00:53:52</td>\n",
 332 |        "      <td>1.0</td>\n",
 333 |        "      <td>0.6</td>\n",
 334 |        "      <td>1.0</td>\n",
 335 |        "      <td>N</td>\n",
 336 |        "      <td>238.0</td>\n",
 337 |        "      <td>238.0</td>\n",
 338 |        "      <td>1.0</td>\n",
 339 |        "      <td>6.0</td>\n",
 340 |        "      <td>3.0</td>\n",
 341 |        "      <td>0.5</td>\n",
 342 |        "      <td>1.00</td>\n",
 343 |        "      <td>0.0</td>\n",
 344 |        "      <td>0.3</td>\n",
 345 |        "      <td>10.80</td>\n",
 346 |        "      <td>2.5</td>\n",
 347 |        "    </tr>\n",
 348 |        "    <tr>\n",
 349 |        "      <th>3</th>\n",
 350 |        "      <td>1.0</td>\n",
 351 |        "      <td>2020-01-01 00:55:23</td>\n",
 352 |        "      <td>2020-01-01 01:00:14</td>\n",
 353 |        "      <td>1.0</td>\n",
 354 |        "      <td>0.8</td>\n",
 355 |        "      <td>1.0</td>\n",
 356 |        "      <td>N</td>\n",
 357 |        "      <td>238.0</td>\n",
 358 |        "      <td>151.0</td>\n",
 359 |        "      <td>1.0</td>\n",
 360 |        "      <td>5.5</td>\n",
 361 |        "      <td>0.5</td>\n",
 362 |        "      <td>0.5</td>\n",
 363 |        "      <td>1.36</td>\n",
 364 |        "      <td>0.0</td>\n",
 365 |        "      <td>0.3</td>\n",
 366 |        "      <td>8.16</td>\n",
 367 |        "      <td>0.0</td>\n",
 368 |        "    </tr>\n",
 369 |        "    <tr>\n",
 370 |        "      <th>4</th>\n",
 371 |        "      <td>2.0</td>\n",
 372 |        "      <td>2020-01-01 00:01:58</td>\n",
 373 |        "      <td>2020-01-01 00:04:16</td>\n",
 374 |        "      <td>1.0</td>\n",
 375 |        "      <td>0.0</td>\n",
 376 |        "      <td>1.0</td>\n",
 377 |        "      <td>N</td>\n",
 378 |        "      <td>193.0</td>\n",
 379 |        "      <td>193.0</td>\n",
 380 |        "      <td>2.0</td>\n",
 381 |        "      <td>3.5</td>\n",
 382 |        "      <td>0.5</td>\n",
 383 |        "      <td>0.5</td>\n",
 384 |        "      <td>0.00</td>\n",
 385 |        "      <td>0.0</td>\n",
 386 |        "      <td>0.3</td>\n",
 387 |        "      <td>4.80</td>\n",
 388 |        "      <td>0.0</td>\n",
 389 |        "    </tr>\n",
 390 |        "  </tbody>\n",
 391 |        "</table>\n",
 392 |        "</div>"
 393 |       ],
 394 |       "text/plain": [
 395 |        "   VendorID tpep_pickup_datetime tpep_dropoff_datetime  passenger_count  \\\n",
 396 |        "0       1.0  2020-01-01 00:28:15   2020-01-01 00:33:03              1.0   \n",
 397 |        "1       1.0  2020-01-01 00:35:39   2020-01-01 00:43:04              1.0   \n",
 398 |        "2       1.0  2020-01-01 00:47:41   2020-01-01 00:53:52              1.0   \n",
 399 |        "3       1.0  2020-01-01 00:55:23   2020-01-01 01:00:14              1.0   \n",
 400 |        "4       2.0  2020-01-01 00:01:58   2020-01-01 00:04:16              1.0   \n",
 401 |        "\n",
 402 |        "   trip_distance  RatecodeID store_and_fwd_flag  PULocationID  DOLocationID  \\\n",
 403 |        "0            1.2         1.0                  N         238.0         239.0   \n",
 404 |        "1            1.2         1.0                  N         239.0         238.0   \n",
 405 |        "2            0.6         1.0                  N         238.0         238.0   \n",
 406 |        "3            0.8         1.0                  N         238.0         151.0   \n",
 407 |        "4            0.0         1.0                  N         193.0         193.0   \n",
 408 |        "\n",
 409 |        "   payment_type  fare_amount  extra  mta_tax  tip_amount  tolls_amount  \\\n",
 410 |        "0           1.0          6.0    3.0      0.5        1.47           0.0   \n",
 411 |        "1           1.0          7.0    3.0      0.5        1.50           0.0   \n",
 412 |        "2           1.0          6.0    3.0      0.5        1.00           0.0   \n",
 413 |        "3           1.0          5.5    0.5      0.5        1.36           0.0   \n",
 414 |        "4           2.0          3.5    0.5      0.5        0.00           0.0   \n",
 415 |        "\n",
 416 |        "   improvement_surcharge  total_amount  congestion_surcharge  \n",
 417 |        "0                    0.3         11.27                   2.5  \n",
 418 |        "1                    0.3         12.30                   2.5  \n",
 419 |        "2                    0.3         10.80                   2.5  \n",
 420 |        "3                    0.3          8.16                   0.0  \n",
 421 |        "4                    0.3          4.80                   0.0  "
 422 |       ]
 423 |      },
 424 |      "execution_count": 5,
 425 |      "metadata": {},
 426 |      "output_type": "execute_result"
 427 |     }
 428 |    ],
 429 |    "source": [
 430 |     "taxi = dask_cudf.read_csv(\n",
 431 |     "    \"s3://nyc-tlc/trip data/yellow_tripdata_2020-01.csv\",\n",
 432 |     "    parse_dates=[\"tpep_pickup_datetime\", \"tpep_dropoff_datetime\"],\n",
 433 |     "    storage_options={\"anon\": True},\n",
 434 |     "    assume_missing=True,\n",
 435 |     ")\n",
 436 |     "\n",
 437 |     "print(f\"Num rows: {len(taxi)}, Size: {taxi.memory_usage(deep=True).sum().compute() / 1e9} GB\")\n",
 438 |     "taxi.head()"
 439 |    ]
 440 |   },
 441 |   {
 442 |    "cell_type": "code",
 443 |    "execution_count": 6,
 444 |    "metadata": {},
 445 |    "outputs": [
 446 |     {
 447 |      "name": "stdout",
 448 |      "output_type": "stream",
 449 |      "text": [
 450 |       "Num rows: 6382762, Size: 0.357434672 GB\n"
 451 |      ]
 452 |     }
 453 |    ],
 454 |    "source": [
 455 |     "target_col = \"high_tip\"\n",
 456 |     "\n",
 457 |     "taxi_train = preprocess(df=taxi, target_col=target_col)\n",
 458 |     "print(f\"Num rows: {len(taxi_train)}, Size: {taxi_train.memory_usage(deep=True).sum().compute() / 1e9} GB\")"
 459 |    ]
 460 |   },
 461 |   {
 462 |    "cell_type": "markdown",
 463 |    "metadata": {},
 464 |    "source": [
 465 |     "## Train model\n",
 466 |     "\n",
 467 |     "We will fit a random forest with 100 estimators and `max_depth` of 10 to the training set. Zero hyperparameter tuning is done here. If we were to do any hyperparameter tuning, we should use a hold-out validation set.\n",
 468 |     "\n",
 469 |     "We train the model on GPU and evaluate on CPU. We evaluate the model using the [F1 score](https://en.wikipedia.org/wiki/F-score)."
 470 |    ]
 471 |   },
 472 |   {
 473 |    "cell_type": "code",
 474 |    "execution_count": 7,
 475 |    "metadata": {},
 476 |    "outputs": [
 477 |     {
 478 |      "name": "stdout",
 479 |      "output_type": "stream",
 480 |      "text": [
 481 |       "CPU times: user 257 ms, sys: 3.12 ms, total: 261 ms\n",
 482 |       "Wall time: 21.9 s\n"
 483 |      ]
 484 |     }
 485 |    ],
 486 |    "source": [
 487 |     "%%time\n",
 488 |     "progress('start-rf-rapids-dask-fit')\n",
 489 |     "\n",
 490 |     "rfc = RandomForestClassifier(n_estimators=100, max_depth=10, ignore_empty_partitions=True)\n",
 491 |     "\n",
 492 |     "rfc.fit(taxi_train[features], taxi_train[target_col])\n",
 493 |     "progress('finished-rf-rapids-dask-fit')"
 494 |    ]
 495 |   },
 496 |   {
 497 |    "cell_type": "code",
 498 |    "execution_count": 8,
 499 |    "metadata": {},
 500 |    "outputs": [
 501 |     {
 502 |      "name": "stdout",
 503 |      "output_type": "stream",
 504 |      "text": [
 505 |       "F1: 0.6681650475249482\n",
 506 |       "CPU times: user 3.87 s, sys: 307 ms, total: 4.17 s\n",
 507 |       "Wall time: 18.3 s\n"
 508 |      ]
 509 |     }
 510 |    ],
 511 |    "source": [
 512 |     "%%time\n",
 513 |     "# Compute F1 \n",
 514 |     "# This is (relatively) slow since we are copying data to the CPU to compute the metric.\n",
 515 |     "\n",
 516 |     "preds = rfc.predict_proba(taxi_train[features])[1]\n",
 517 |     "print(f'F1: {f1_score(taxi_train[target_col].compute().to_array(), preds.round().compute().to_array())}')"
 518 |    ]
 519 |   },
 520 |   {
 521 |    "cell_type": "markdown",
 522 |    "metadata": {},
 523 |    "source": [
 524 |     "## Evaluate on test set\n",
 525 |     "\n",
 526 |     "The test window is all of February 2020 and also accessible via public s3 bucket. The F1 scores are similar between train and test sets."
 527 |    ]
 528 |   },
 529 |   {
 530 |    "cell_type": "code",
 531 |    "execution_count": 9,
 532 |    "metadata": {},
 533 |    "outputs": [],
 534 |    "source": [
 535 |     "taxi_feb = dask_cudf.read_csv(\n",
 536 |     "    \"s3://nyc-tlc/trip data/yellow_tripdata_2020-02.csv\",\n",
 537 |     "    parse_dates=[\"tpep_pickup_datetime\", \"tpep_dropoff_datetime\"],\n",
 538 |     "    storage_options={\"anon\": True},\n",
 539 |     "    assume_missing=True,\n",
 540 |     ")\n",
 541 |     "\n",
 542 |     "taxi_test = preprocess(taxi_feb, target_col=target_col)"
 543 |    ]
 544 |   },
 545 |   {
 546 |    "cell_type": "code",
 547 |    "execution_count": 10,
 548 |    "metadata": {},
 549 |    "outputs": [
 550 |     {
 551 |      "name": "stdout",
 552 |      "output_type": "stream",
 553 |      "text": [
 554 |       "F1: 0.6658098920024954\n"
 555 |      ]
 556 |     }
 557 |    ],
 558 |    "source": [
 559 |     "# Compute F1 on test set\n",
 560 |     "# This is slow since we are copying data to the CPU to compute the metric.\n",
 561 |     "\n",
 562 |     "preds = rfc.predict_proba(taxi_test[features])[1]\n",
 563 |     "print(f'F1: {f1_score(taxi_test[target_col].compute().to_array(), preds.round().compute().to_array())}')"
 564 |    ]
 565 |   },
 566 |   {
 567 |    "cell_type": "markdown",
 568 |    "metadata": {},
 569 |    "source": [
 570 |     "## Simulate \"live\" inference on March\n",
 571 |     "\n",
 572 |     "As every new batch of points comes in, we make a prediction. We compute the rolling (F1 score since March 1) and daily F1 scores. Note that the daily F1 score drops significantly, but this performance degradation is not so pronounced if we just monitor the rolling F1 score."
 573 |    ]
 574 |   },
 575 |   {
 576 |    "cell_type": "code",
 577 |    "execution_count": 11,
 578 |    "metadata": {},
 579 |    "outputs": [],
 580 |    "source": [
 581 |     "# First, load and sort the march dataframe\n",
 582 |     "\n",
 583 |     "taxi_march = dask_cudf.read_csv(\n",
 584 |     "    \"s3://nyc-tlc/trip data/yellow_tripdata_2020-03.csv\",\n",
 585 |     "    parse_dates=[\"tpep_pickup_datetime\", \"tpep_dropoff_datetime\"],\n",
 586 |     "    storage_options={\"anon\": True},\n",
 587 |     "    assume_missing=True,\n",
 588 |     ")\n",
 589 |     "\n",
 590 |     "taxi_inference = preprocess(taxi_march, target_col=target_col, start_date='2020-03-01', end_date='2020-03-31').sort_values(by=['tpep_dropoff_datetime'], ascending=True).reset_index(drop=True)\n",
 591 |     "taxi_inference['day'] = taxi_inference.tpep_dropoff_datetime.dt.day.to_dask_array()"
 592 |    ]
 593 |   },
 594 |   {
 595 |    "cell_type": "code",
 596 |    "execution_count": 12,
 597 |    "metadata": {},
 598 |    "outputs": [],
 599 |    "source": [
 600 |     "# Save predictions as a new column, compute rolling F1 score\n",
 601 |     "\n",
 602 |     "taxi_inference['predicted_prob'] = rfc.predict_proba(taxi_inference[features])[1]\n",
 603 |     "taxi_inference['prediction'] = taxi_inference['predicted_prob'].round().astype('int32')\n",
 604 |     "taxi_inference['rolling_f1'] = f1_streaming(taxi_inference, target_col, 'prediction')\n",
 605 |     "daily_f1 = taxi_inference.groupby('day').apply(get_daily_f1_score, meta={'day': int, 'rolling_f1': float, 'daily_f1': float})"
 606 |    ]
 607 |   },
 608 |   {
 609 |    "cell_type": "code",
 610 |    "execution_count": 13,
 611 |    "metadata": {},
 612 |    "outputs": [
 613 |     {
 614 |      "data": {
 615 |       "text/html": [
 616 |        "<div>\n",
 617 |        "<style scoped>\n",
 618 |        "    .dataframe tbody tr th:only-of-type {\n",
 619 |        "        vertical-align: middle;\n",
 620 |        "    }\n",
 621 |        "\n",
 622 |        "    .dataframe tbody tr th {\n",
 623 |        "        vertical-align: top;\n",
 624 |        "    }\n",
 625 |        "\n",
 626 |        "    .dataframe thead th {\n",
 627 |        "        text-align: right;\n",
 628 |        "    }\n",
 629 |        "</style>\n",
 630 |        "<table border=\"1\" class=\"dataframe\">\n",
 631 |        "  <thead>\n",
 632 |        "    <tr style=\"text-align: right;\">\n",
 633 |        "      <th></th>\n",
 634 |        "      <th>day</th>\n",
 635 |        "      <th>rolling_f1</th>\n",
 636 |        "      <th>daily_f1</th>\n",
 637 |        "    </tr>\n",
 638 |        "  </thead>\n",
 639 |        "  <tbody>\n",
 640 |        "    <tr>\n",
 641 |        "      <th>178123</th>\n",
 642 |        "      <td>1</td>\n",
 643 |        "      <td>0.576629</td>\n",
 644 |        "      <td>0.576629</td>\n",
 645 |        "    </tr>\n",
 646 |        "    <tr>\n",
 647 |        "      <th>370840</th>\n",
 648 |        "      <td>2</td>\n",
 649 |        "      <td>0.633320</td>\n",
 650 |        "      <td>0.677398</td>\n",
 651 |        "    </tr>\n",
 652 |        "    <tr>\n",
 653 |        "      <th>592741</th>\n",
 654 |        "      <td>3</td>\n",
 655 |        "      <td>0.649983</td>\n",
 656 |        "      <td>0.675877</td>\n",
 657 |        "    </tr>\n",
 658 |        "    <tr>\n",
 659 |        "      <th>821398</th>\n",
 660 |        "      <td>4</td>\n",
 661 |        "      <td>0.659940</td>\n",
 662 |        "      <td>0.684125</td>\n",
 663 |        "    </tr>\n",
 664 |        "    <tr>\n",
 665 |        "      <th>1064741</th>\n",
 666 |        "      <td>5</td>\n",
 667 |        "      <td>0.675841</td>\n",
 668 |        "      <td>0.722298</td>\n",
 669 |        "    </tr>\n",
 670 |        "    <tr>\n",
 671 |        "      <th>1307013</th>\n",
 672 |        "      <td>6</td>\n",
 673 |        "      <td>0.682284</td>\n",
 674 |        "      <td>0.708181</td>\n",
 675 |        "    </tr>\n",
 676 |        "    <tr>\n",
 677 |        "      <th>58517</th>\n",
 678 |        "      <td>7</td>\n",
 679 |        "      <td>0.668002</td>\n",
 680 |        "      <td>0.555498</td>\n",
 681 |        "    </tr>\n",
 682 |        "    <tr>\n",
 683 |        "      <th>225439</th>\n",
 684 |        "      <td>8</td>\n",
 685 |        "      <td>0.659918</td>\n",
 686 |        "      <td>0.572543</td>\n",
 687 |        "    </tr>\n",
 688 |        "    <tr>\n",
 689 |        "      <th>400352</th>\n",
 690 |        "      <td>9</td>\n",
 691 |        "      <td>0.660947</td>\n",
 692 |        "      <td>0.670717</td>\n",
 693 |        "    </tr>\n",
 694 |        "    <tr>\n",
 695 |        "      <th>583448</th>\n",
 696 |        "      <td>10</td>\n",
 697 |        "      <td>0.661801</td>\n",
 698 |        "      <td>0.670428</td>\n",
 699 |        "    </tr>\n",
 700 |        "    <tr>\n",
 701 |        "      <th>765578</th>\n",
 702 |        "      <td>11</td>\n",
 703 |        "      <td>0.663678</td>\n",
 704 |        "      <td>0.684011</td>\n",
 705 |        "    </tr>\n",
 706 |        "    <tr>\n",
 707 |        "      <th>936075</th>\n",
 708 |        "      <td>12</td>\n",
 709 |        "      <td>0.667420</td>\n",
 710 |        "      <td>0.711109</td>\n",
 711 |        "    </tr>\n",
 712 |        "    <tr>\n",
 713 |        "      <th>1070221</th>\n",
 714 |        "      <td>13</td>\n",
 715 |        "      <td>0.668812</td>\n",
 716 |        "      <td>0.691889</td>\n",
 717 |        "    </tr>\n",
 718 |        "    <tr>\n",
 719 |        "      <th>1159620</th>\n",
 720 |        "      <td>14</td>\n",
 721 |        "      <td>0.666032</td>\n",
 722 |        "      <td>0.571661</td>\n",
 723 |        "    </tr>\n",
 724 |        "    <tr>\n",
 725 |        "      <th>1219523</th>\n",
 726 |        "      <td>15</td>\n",
 727 |        "      <td>0.664177</td>\n",
 728 |        "      <td>0.564885</td>\n",
 729 |        "    </tr>\n",
 730 |        "    <tr>\n",
 731 |        "      <th>1283501</th>\n",
 732 |        "      <td>16</td>\n",
 733 |        "      <td>0.663604</td>\n",
 734 |        "      <td>0.638491</td>\n",
 735 |        "    </tr>\n",
 736 |        "    <tr>\n",
 737 |        "      <th>1328995</th>\n",
 738 |        "      <td>17</td>\n",
 739 |        "      <td>0.663178</td>\n",
 740 |        "      <td>0.635958</td>\n",
 741 |        "    </tr>\n",
 742 |        "    <tr>\n",
 743 |        "      <th>1365063</th>\n",
 744 |        "      <td>18</td>\n",
 745 |        "      <td>0.662761</td>\n",
 746 |        "      <td>0.628822</td>\n",
 747 |        "    </tr>\n",
 748 |        "    <tr>\n",
 749 |        "      <th>1394730</th>\n",
 750 |        "      <td>19</td>\n",
 751 |        "      <td>0.662613</td>\n",
 752 |        "      <td>0.648809</td>\n",
 753 |        "    </tr>\n",
 754 |        "    <tr>\n",
 755 |        "      <th>1422146</th>\n",
 756 |        "      <td>20</td>\n",
 757 |        "      <td>0.662300</td>\n",
 758 |        "      <td>0.629325</td>\n",
 759 |        "    </tr>\n",
 760 |        "    <tr>\n",
 761 |        "      <th>1438271</th>\n",
 762 |        "      <td>21</td>\n",
 763 |        "      <td>0.661760</td>\n",
 764 |        "      <td>0.534262</td>\n",
 765 |        "    </tr>\n",
 766 |        "    <tr>\n",
 767 |        "      <th>1448533</th>\n",
 768 |        "      <td>22</td>\n",
 769 |        "      <td>0.661437</td>\n",
 770 |        "      <td>0.541612</td>\n",
 771 |        "    </tr>\n",
 772 |        "    <tr>\n",
 773 |        "      <th>1462011</th>\n",
 774 |        "      <td>23</td>\n",
 775 |        "      <td>0.661225</td>\n",
 776 |        "      <td>0.611136</td>\n",
 777 |        "    </tr>\n",
 778 |        "    <tr>\n",
 779 |        "      <th>1473783</th>\n",
 780 |        "      <td>24</td>\n",
 781 |        "      <td>0.660991</td>\n",
 782 |        "      <td>0.594909</td>\n",
 783 |        "    </tr>\n",
 784 |        "    <tr>\n",
 785 |        "      <th>1484934</th>\n",
 786 |        "      <td>25</td>\n",
 787 |        "      <td>0.660754</td>\n",
 788 |        "      <td>0.590316</td>\n",
 789 |        "    </tr>\n",
 790 |        "    <tr>\n",
 791 |        "      <th>1495523</th>\n",
 792 |        "      <td>26</td>\n",
 793 |        "      <td>0.660534</td>\n",
 794 |        "      <td>0.596606</td>\n",
 795 |        "    </tr>\n",
 796 |        "    <tr>\n",
 797 |        "      <th>1507234</th>\n",
 798 |        "      <td>27</td>\n",
 799 |        "      <td>0.660228</td>\n",
 800 |        "      <td>0.576993</td>\n",
 801 |        "    </tr>\n",
 802 |        "    <tr>\n",
 803 |        "      <th>1514827</th>\n",
 804 |        "      <td>28</td>\n",
 805 |        "      <td>0.659934</td>\n",
 806 |        "      <td>0.501860</td>\n",
 807 |        "    </tr>\n",
 808 |        "    <tr>\n",
 809 |        "      <th>1520358</th>\n",
 810 |        "      <td>29</td>\n",
 811 |        "      <td>0.659764</td>\n",
 812 |        "      <td>0.537860</td>\n",
 813 |        "    </tr>\n",
 814 |        "    <tr>\n",
 815 |        "      <th>1529847</th>\n",
 816 |        "      <td>30</td>\n",
 817 |        "      <td>0.659530</td>\n",
 818 |        "      <td>0.576178</td>\n",
 819 |        "    </tr>\n",
 820 |        "  </tbody>\n",
 821 |        "</table>\n",
 822 |        "</div>"
 823 |       ],
 824 |       "text/plain": [
 825 |        "         day  rolling_f1  daily_f1\n",
 826 |        "178123     1    0.576629  0.576629\n",
 827 |        "370840     2    0.633320  0.677398\n",
 828 |        "592741     3    0.649983  0.675877\n",
 829 |        "821398     4    0.659940  0.684125\n",
 830 |        "1064741    5    0.675841  0.722298\n",
 831 |        "1307013    6    0.682284  0.708181\n",
 832 |        "58517      7    0.668002  0.555498\n",
 833 |        "225439     8    0.659918  0.572543\n",
 834 |        "400352     9    0.660947  0.670717\n",
 835 |        "583448    10    0.661801  0.670428\n",
 836 |        "765578    11    0.663678  0.684011\n",
 837 |        "936075    12    0.667420  0.711109\n",
 838 |        "1070221   13    0.668812  0.691889\n",
 839 |        "1159620   14    0.666032  0.571661\n",
 840 |        "1219523   15    0.664177  0.564885\n",
 841 |        "1283501   16    0.663604  0.638491\n",
 842 |        "1328995   17    0.663178  0.635958\n",
 843 |        "1365063   18    0.662761  0.628822\n",
 844 |        "1394730   19    0.662613  0.648809\n",
 845 |        "1422146   20    0.662300  0.629325\n",
 846 |        "1438271   21    0.661760  0.534262\n",
 847 |        "1448533   22    0.661437  0.541612\n",
 848 |        "1462011   23    0.661225  0.611136\n",
 849 |        "1473783   24    0.660991  0.594909\n",
 850 |        "1484934   25    0.660754  0.590316\n",
 851 |        "1495523   26    0.660534  0.596606\n",
 852 |        "1507234   27    0.660228  0.576993\n",
 853 |        "1514827   28    0.659934  0.501860\n",
 854 |        "1520358   29    0.659764  0.537860\n",
 855 |        "1529847   30    0.659530  0.576178"
 856 |       ]
 857 |      },
 858 |      "execution_count": 13,
 859 |      "metadata": {},
 860 |      "output_type": "execute_result"
 861 |     }
 862 |    ],
 863 |    "source": [
 864 |     "daily_f1.sort_values(by='day', ascending=True).compute()"
 865 |    ]
 866 |   },
 867 |   {
 868 |    "cell_type": "markdown",
 869 |    "metadata": {},
 870 |    "source": [
 871 |     "## Evaluate model on later months\n",
 872 |     "\n",
 873 |     "We see the performance drop in March 2020, but what happens for future months?"
 874 |    ]
 875 |   },
 876 |   {
 877 |    "cell_type": "code",
 878 |    "execution_count": 14,
 879 |    "metadata": {},
 880 |    "outputs": [
 881 |     {
 882 |      "name": "stdout",
 883 |      "output_type": "stream",
 884 |      "text": [
 885 |       "Loading month 2020-03 for the first time.\n",
 886 |       "2020-03\n",
 887 |       "\tF1: 0.6592796100378214\n",
 888 |       "Loading month 2020-04 for the first time.\n",
 889 |       "2020-04\n",
 890 |       "\tF1: 0.5714705472990737\n",
 891 |       "Loading month 2020-05 for the first time.\n",
 892 |       "2020-05\n",
 893 |       "\tF1: 0.5530868473460906\n",
 894 |       "Loading month 2020-06 for the first time.\n",
 895 |       "2020-06\n",
 896 |       "\tF1: 0.5967621469282887\n"
 897 |      ]
 898 |     }
 899 |    ],
 900 |    "source": [
 901 |     "# Cycle through many test sets\n",
 902 |     "\n",
 903 |     "months = ['2020-03', '2020-04', '2020-05', '2020-06']\n",
 904 |     "month_dfs = {}\n",
 905 |     "\n",
 906 |     "for month in months:\n",
 907 |     "    \n",
 908 |     "    if month not in month_dfs:\n",
 909 |     "        print(f'Loading month {month} for the first time.')\n",
 910 |     "        df = dask_cudf.read_csv(\n",
 911 |     "            f\"s3://nyc-tlc/trip data/yellow_tripdata_{month}.csv\",\n",
 912 |     "            parse_dates=[\"tpep_pickup_datetime\", \"tpep_dropoff_datetime\"],\n",
 913 |     "            storage_options={\"anon\": True},\n",
 914 |     "            assume_missing=True,\n",
 915 |     "        )\n",
 916 |     "\n",
 917 |     "        df = preprocess(df, target_col=target_col)\n",
 918 |     "        month_dfs[month] = df.copy()\n",
 919 |     "    \n",
 920 |     "    curr_taxi_test = month_dfs[month]\n",
 921 |     "        \n",
 922 |     "    preds = rfc.predict_proba(curr_taxi_test[features])[1]\n",
 923 |     "    print(month)\n",
 924 |     "    print(f'\\tF1: {f1_score(curr_taxi_test[target_col].compute().to_array(), preds.round().compute().to_array())}')"
 925 |    ]
 926 |   },
 927 |   {
 928 |    "cell_type": "markdown",
 929 |    "metadata": {},
 930 |    "source": [
 931 |     "## Inspect differences between feature values\n",
 932 |     "\n",
 933 |     "Maybe the distribution of data shifted. We could try to quantify this using a 2-sided statistical test (Kolmogorov Smirnov in this example)."
 934 |    ]
 935 |   },
 936 |   {
 937 |    "cell_type": "markdown",
 938 |    "metadata": {},
 939 |    "source": [
 940 |     "### Compare January 2020 vs February 2020\n",
 941 |     "\n",
 942 |     "This snippet shows that the p values being small doesn't really tell us much, as we get very small p values when comparing January 2020 vs February 2020 even though we know the F1 score was similar. Curse \"big data.\""
 943 |    ]
 944 |   },
 945 |   {
 946 |    "cell_type": "code",
 947 |    "execution_count": 15,
 948 |    "metadata": {},
 949 |    "outputs": [],
 950 |    "source": [
 951 |     "statistics = []\n",
 952 |     "p_values = []\n",
 953 |     "\n",
 954 |     "for feature in features:\n",
 955 |     "    statistic, p_value = stats.ks_2samp(taxi_train[feature].compute().to_pandas(), taxi_test[feature].compute().to_pandas())\n",
 956 |     "    statistics.append(statistic)\n",
 957 |     "    p_values.append(p_value)"
 958 |    ]
 959 |   },
 960 |   {
 961 |    "cell_type": "code",
 962 |    "execution_count": 16,
 963 |    "metadata": {},
 964 |    "outputs": [
 965 |     {
 966 |      "data": {
 967 |       "text/html": [
 968 |        "<div>\n",
 969 |        "<style scoped>\n",
 970 |        "    .dataframe tbody tr th:only-of-type {\n",
 971 |        "        vertical-align: middle;\n",
 972 |        "    }\n",
 973 |        "\n",
 974 |        "    .dataframe tbody tr th {\n",
 975 |        "        vertical-align: top;\n",
 976 |        "    }\n",
 977 |        "\n",
 978 |        "    .dataframe thead th {\n",
 979 |        "        text-align: right;\n",
 980 |        "    }\n",
 981 |        "</style>\n",
 982 |        "<table border=\"1\" class=\"dataframe\">\n",
 983 |        "  <thead>\n",
 984 |        "    <tr style=\"text-align: right;\">\n",
 985 |        "      <th></th>\n",
 986 |        "      <th>feature</th>\n",
 987 |        "      <th>statistic</th>\n",
 988 |        "      <th>p_value</th>\n",
 989 |        "    </tr>\n",
 990 |        "  </thead>\n",
 991 |        "  <tbody>\n",
 992 |        "    <tr>\n",
 993 |        "      <th>0</th>\n",
 994 |        "      <td>pickup_weekday</td>\n",
 995 |        "      <td>0.046196</td>\n",
 996 |        "      <td>0.000000e+00</td>\n",
 997 |        "    </tr>\n",
 998 |        "    <tr>\n",
 999 |        "      <th>2</th>\n",
1000 |        "      <td>work_hours</td>\n",
1001 |        "      <td>0.028587</td>\n",
1002 |        "      <td>0.000000e+00</td>\n",
1003 |        "    </tr>\n",
1004 |        "    <tr>\n",
1005 |        "      <th>6</th>\n",
1006 |        "      <td>trip_time</td>\n",
1007 |        "      <td>0.017205</td>\n",
1008 |        "      <td>0.000000e+00</td>\n",
1009 |        "    </tr>\n",
1010 |        "    <tr>\n",
1011 |        "      <th>7</th>\n",
1012 |        "      <td>trip_speed</td>\n",
1013 |        "      <td>0.035415</td>\n",
1014 |        "      <td>0.000000e+00</td>\n",
1015 |        "    </tr>\n",
1016 |        "    <tr>\n",
1017 |        "      <th>1</th>\n",
1018 |        "      <td>pickup_hour</td>\n",
1019 |        "      <td>0.009676</td>\n",
1020 |        "      <td>8.610133e-258</td>\n",
1021 |        "    </tr>\n",
1022 |        "    <tr>\n",
1023 |        "      <th>5</th>\n",
1024 |        "      <td>trip_distance</td>\n",
1025 |        "      <td>0.005312</td>\n",
1026 |        "      <td>5.266602e-78</td>\n",
1027 |        "    </tr>\n",
1028 |        "    <tr>\n",
1029 |        "      <th>8</th>\n",
1030 |        "      <td>PULocationID</td>\n",
1031 |        "      <td>0.004083</td>\n",
1032 |        "      <td>2.994877e-46</td>\n",
1033 |        "    </tr>\n",
1034 |        "    <tr>\n",
1035 |        "      <th>9</th>\n",
1036 |        "      <td>DOLocationID</td>\n",
1037 |        "      <td>0.003132</td>\n",
1038 |        "      <td>2.157559e-27</td>\n",
1039 |        "    </tr>\n",
1040 |        "    <tr>\n",
1041 |        "      <th>4</th>\n",
1042 |        "      <td>passenger_count</td>\n",
1043 |        "      <td>0.002947</td>\n",
1044 |        "      <td>2.634493e-24</td>\n",
1045 |        "    </tr>\n",
1046 |        "    <tr>\n",
1047 |        "      <th>10</th>\n",
1048 |        "      <td>RatecodeID</td>\n",
1049 |        "      <td>0.002616</td>\n",
1050 |        "      <td>3.047481e-19</td>\n",
1051 |        "    </tr>\n",
1052 |        "    <tr>\n",
1053 |        "      <th>3</th>\n",
1054 |        "      <td>pickup_minute</td>\n",
1055 |        "      <td>0.000702</td>\n",
1056 |        "      <td>8.861498e-02</td>\n",
1057 |        "    </tr>\n",
1058 |        "  </tbody>\n",
1059 |        "</table>\n",
1060 |        "</div>"
1061 |       ],
1062 |       "text/plain": [
1063 |        "            feature  statistic        p_value\n",
1064 |        "0    pickup_weekday   0.046196   0.000000e+00\n",
1065 |        "2        work_hours   0.028587   0.000000e+00\n",
1066 |        "6         trip_time   0.017205   0.000000e+00\n",
1067 |        "7        trip_speed   0.035415   0.000000e+00\n",
1068 |        "1       pickup_hour   0.009676  8.610133e-258\n",
1069 |        "5     trip_distance   0.005312   5.266602e-78\n",
1070 |        "8      PULocationID   0.004083   2.994877e-46\n",
1071 |        "9      DOLocationID   0.003132   2.157559e-27\n",
1072 |        "4   passenger_count   0.002947   2.634493e-24\n",
1073 |        "10       RatecodeID   0.002616   3.047481e-19\n",
1074 |        "3     pickup_minute   0.000702   8.861498e-02"
1075 |       ]
1076 |      },
1077 |      "execution_count": 16,
1078 |      "metadata": {},
1079 |      "output_type": "execute_result"
1080 |     }
1081 |    ],
1082 |    "source": [
1083 |     "comparison_df = pd.DataFrame(data={'feature': features, 'statistic': statistics, 'p_value': p_values})\n",
1084 |     "comparison_df.sort_values(by='p_value', ascending=True).head(11)"
1085 |    ]
1086 |   },
1087 |   {
1088 |    "cell_type": "markdown",
1089 |    "metadata": {},
1090 |    "source": [
1091 |     "### Compare January 2020 vs March 2020\n",
1092 |     "\n",
1093 |     "These p values are also small, which is good? But if this method in general sends warning alerts all the time, an end user might not trust it."
1094 |    ]
1095 |   },
1096 |   {
1097 |    "cell_type": "code",
1098 |    "execution_count": 17,
1099 |    "metadata": {},
1100 |    "outputs": [],
1101 |    "source": [
1102 |     "statistics = []\n",
1103 |     "p_values = []\n",
1104 |     "\n",
1105 |     "for feature in features:\n",
1106 |     "    statistic, p_value = stats.ks_2samp(taxi_train[feature].compute().to_pandas(), taxi_inference[feature].compute().to_pandas())\n",
1107 |     "    statistics.append(statistic)\n",
1108 |     "    p_values.append(p_value)"
1109 |    ]
1110 |   },
1111 |   {
1112 |    "cell_type": "code",
1113 |    "execution_count": 18,
1114 |    "metadata": {},
1115 |    "outputs": [
1116 |     {
1117 |      "data": {
1118 |       "text/html": [
1119 |        "<div>\n",
1120 |        "<style scoped>\n",
1121 |        "    .dataframe tbody tr th:only-of-type {\n",
1122 |        "        vertical-align: middle;\n",
1123 |        "    }\n",
1124 |        "\n",
1125 |        "    .dataframe tbody tr th {\n",
1126 |        "        vertical-align: top;\n",
1127 |        "    }\n",
1128 |        "\n",
1129 |        "    .dataframe thead th {\n",
1130 |        "        text-align: right;\n",
1131 |        "    }\n",
1132 |        "</style>\n",
1133 |        "<table border=\"1\" class=\"dataframe\">\n",
1134 |        "  <thead>\n",
1135 |        "    <tr style=\"text-align: right;\">\n",
1136 |        "      <th></th>\n",
1137 |        "      <th>feature</th>\n",
1138 |        "      <th>statistic</th>\n",
1139 |        "      <th>p_value</th>\n",
1140 |        "    </tr>\n",
1141 |        "  </thead>\n",
1142 |        "  <tbody>\n",
1143 |        "    <tr>\n",
1144 |        "      <th>0</th>\n",
1145 |        "      <td>pickup_weekday</td>\n",
1146 |        "      <td>0.059051</td>\n",
1147 |        "      <td>0.000000e+00</td>\n",
1148 |        "    </tr>\n",
1149 |        "    <tr>\n",
1150 |        "      <th>1</th>\n",
1151 |        "      <td>pickup_hour</td>\n",
1152 |        "      <td>0.017536</td>\n",
1153 |        "      <td>0.000000e+00</td>\n",
1154 |        "    </tr>\n",
1155 |        "    <tr>\n",
1156 |        "      <th>4</th>\n",
1157 |        "      <td>passenger_count</td>\n",
1158 |        "      <td>0.022485</td>\n",
1159 |        "      <td>0.000000e+00</td>\n",
1160 |        "    </tr>\n",
1161 |        "    <tr>\n",
1162 |        "      <th>5</th>\n",
1163 |        "      <td>trip_distance</td>\n",
1164 |        "      <td>0.017913</td>\n",
1165 |        "      <td>0.000000e+00</td>\n",
1166 |        "    </tr>\n",
1167 |        "    <tr>\n",
1168 |        "      <th>7</th>\n",
1169 |        "      <td>trip_speed</td>\n",
1170 |        "      <td>0.030289</td>\n",
1171 |        "      <td>0.000000e+00</td>\n",
1172 |        "    </tr>\n",
1173 |        "    <tr>\n",
1174 |        "      <th>9</th>\n",
1175 |        "      <td>DOLocationID</td>\n",
1176 |        "      <td>0.013995</td>\n",
1177 |        "      <td>0.000000e+00</td>\n",
1178 |        "    </tr>\n",
1179 |        "    <tr>\n",
1180 |        "      <th>8</th>\n",
1181 |        "      <td>PULocationID</td>\n",
1182 |        "      <td>0.013068</td>\n",
1183 |        "      <td>3.746619e-302</td>\n",
1184 |        "    </tr>\n",
1185 |        "    <tr>\n",
1186 |        "      <th>2</th>\n",
1187 |        "      <td>work_hours</td>\n",
1188 |        "      <td>0.010840</td>\n",
1189 |        "      <td>5.006014e-208</td>\n",
1190 |        "    </tr>\n",
1191 |        "    <tr>\n",
1192 |        "      <th>6</th>\n",
1193 |        "      <td>trip_time</td>\n",
1194 |        "      <td>0.007507</td>\n",
1195 |        "      <td>5.385560e-100</td>\n",
1196 |        "    </tr>\n",
1197 |        "    <tr>\n",
1198 |        "      <th>10</th>\n",
1199 |        "      <td>RatecodeID</td>\n",
1200 |        "      <td>0.005615</td>\n",
1201 |        "      <td>3.933726e-56</td>\n",
1202 |        "    </tr>\n",
1203 |        "    <tr>\n",
1204 |        "      <th>3</th>\n",
1205 |        "      <td>pickup_minute</td>\n",
1206 |        "      <td>0.000642</td>\n",
1207 |        "      <td>3.722759e-01</td>\n",
1208 |        "    </tr>\n",
1209 |        "  </tbody>\n",
1210 |        "</table>\n",
1211 |        "</div>"
1212 |       ],
1213 |       "text/plain": [
1214 |        "            feature  statistic        p_value\n",
1215 |        "0    pickup_weekday   0.059051   0.000000e+00\n",
1216 |        "1       pickup_hour   0.017536   0.000000e+00\n",
1217 |        "4   passenger_count   0.022485   0.000000e+00\n",
1218 |        "5     trip_distance   0.017913   0.000000e+00\n",
1219 |        "7        trip_speed   0.030289   0.000000e+00\n",
1220 |        "9      DOLocationID   0.013995   0.000000e+00\n",
1221 |        "8      PULocationID   0.013068  3.746619e-302\n",
1222 |        "2        work_hours   0.010840  5.006014e-208\n",
1223 |        "6         trip_time   0.007507  5.385560e-100\n",
1224 |        "10       RatecodeID   0.005615   3.933726e-56\n",
1225 |        "3     pickup_minute   0.000642   3.722759e-01"
1226 |       ]
1227 |      },
1228 |      "execution_count": 18,
1229 |      "metadata": {},
1230 |      "output_type": "execute_result"
1231 |     }
1232 |    ],
1233 |    "source": [
1234 |     "comparison_df = pd.DataFrame(data={'feature': features, 'statistic': statistics, 'p_value': p_values})\n",
1235 |     "comparison_df.sort_values(by='p_value', ascending=True).head(11)"
1236 |    ]
1237 |   },
1238 |   {
1239 |    "cell_type": "code",
1240 |    "execution_count": null,
1241 |    "metadata": {},
1242 |    "outputs": [],
1243 |    "source": []
1244 |   }
1245 |  ],
1246 |  "metadata": {
1247 |   "kernelspec": {
1248 |    "display_name": "Python 3",
1249 |    "language": "python",
1250 |    "name": "python3"
1251 |   },
1252 |   "language_info": {
1253 |    "codemirror_mode": {
1254 |     "name": "ipython",
1255 |     "version": 3
1256 |    },
1257 |    "file_extension": ".py",
1258 |    "mimetype": "text/x-python",
1259 |    "name": "python",
1260 |    "nbconvert_exporter": "python",
1261 |    "pygments_lexer": "ipython3",
1262 |    "version": "3.7.7"
1263 |   }
1264 |  },
1265 |  "nbformat": 4,
1266 |  "nbformat_minor": 4
1267 | }
1268 | 


--------------------------------------------------------------------------------
/slides.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/shreyashankar/debugging-ml-talk/dbcf7b652467341a729a906a00d0a144d6fe1112/slides.pdf


--------------------------------------------------------------------------------