├── requirements.txt
├── res
    ├── pipeline.png
    └── traceview.png
├── .gitignore
├── run_weekly_inference.py
├── download.sh
├── components.py
├── solutions
    ├── components.py
    └── main.py
├── tests.py
├── main.py
└── README.md


/requirements.txt:
--------------------------------------------------------------------------------
1 | fastparquet
2 | mltrace
3 | pandas
4 | pyarrow
5 | sklearn


--------------------------------------------------------------------------------
/res/pipeline.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/loglabs/mltrace-demo/HEAD/res/pipeline.png


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | data*
2 | *.DS_Store*
3 | .ipynb_checkpoints*
4 | *__pycache__*
5 | model.joblib


--------------------------------------------------------------------------------
/res/traceview.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/loglabs/mltrace-demo/HEAD/res/traceview.png


--------------------------------------------------------------------------------
/run_weekly_inference.py:
--------------------------------------------------------------------------------
 1 | from datetime import timedelta, date
 2 | 
 3 | import subprocess
 4 | 
 5 | if __name__ == "__main__":
 6 |     start_date = date(2020, 2, 1)
 7 |     end_date = date(2020, 5, 31)
 8 |     prev_dt = start_date
 9 |     for n in range(7, int((end_date - start_date).days) + 1, 7):
10 |         curr_dt = start_date + timedelta(n)
11 |         curr_dt.strftime("%m/%d/%Y")
12 |         command = f'python main.py --mode=inference --start={prev_dt.strftime("%m/%d/%Y")} --end={curr_dt.strftime("%m/%d/%Y")}'
13 |         print(command)
14 |         subprocess.run(command, shell=True)
15 |         prev_dt = curr_dt
16 | 


--------------------------------------------------------------------------------
/download.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/sh 
 2 | 
 3 | mkdir data
 4 | cd data
 5 | 
 6 | curl -L -o jan.pq 'https://drive.google.com/uc?export=download&id=1dKT7TVjWH3zU552ks6p68vmwf5aAZTia'
 7 | 
 8 | curl -L -o feb.pq 'https://drive.google.com/uc?export=download&id=1qdLJTTGagS9XhH7ZMbm9EHBxK-eQYpCV'
 9 | 
10 | curl -L -o march.pq 'https://drive.google.com/uc?export=download&id=1VvQcHjsMukP6-o7h2m9346XlbGnbYI6U'
11 | 
12 | curl -L -o april.pq 'https://drive.google.com/uc?export=download&id=1noRrPXy4HlpFIs2PBkFQntXNljwiOCtk'
13 | 
14 | curl -L -o may.pq 'https://drive.google.com/uc?export=download&id=1vx3p1NOjckbl4ezvprEFC093QXhvDZfd'
15 | 
16 | cd ..
17 | 


--------------------------------------------------------------------------------
/components.py:
--------------------------------------------------------------------------------
 1 | """
 2 | components.py
 3 | 
 4 | This file defines components of our ML pipeline and their respective
 5 | metadata. Your exercise is to add tests to this file in the component
 6 | specs.
 7 | """
 8 | 
 9 | from mltrace import Component
10 | from tests import *
11 | 
12 | 
13 | class Cleaning(Component):
14 |     def __init__(self, beforeTests=[], afterTests=[]):
15 | 
16 |         super().__init__(
17 |             name="cleaning",
18 |             owner="plumber",
19 |             description="Cleans raw NYC taxicab data",
20 |             tags=["nyc-taxicab"],
21 |             beforeTests=beforeTests,
22 |             afterTests=afterTests,
23 |         )
24 | 
25 | 
26 | class Featuregen(Component):
27 |     def __init__(self, beforeTests=[], afterTests=[]):
28 | 
29 |         super().__init__(
30 |             name="featuregen",
31 |             owner="spark-gymnast",
32 |             description="Generates features for high tip prediction problem",
33 |             tags=["nyc-taxicab"],
34 |             beforeTests=beforeTests,
35 |             afterTests=afterTests,
36 |         )
37 | 
38 | 
39 | class TrainTestSplit(Component):
40 |     def __init__(self, beforeTests=[], afterTests=[]):
41 | 
42 |         super().__init__(
43 |             name="splitting",
44 |             owner="fission",
45 |             description="Splits data into training and test sets",
46 |             tags=["nyc-taxicab"],
47 |             beforeTests=beforeTests,
48 |             afterTests=afterTests,
49 |         )
50 | 
51 | 
52 | class Training(Component):
53 |     def __init__(self, beforeTests=[], afterTests=[]):
54 | 
55 |         super().__init__(
56 |             name="training",
57 |             owner="personal-trainer",
58 |             description="Trains model for high tip prediction problem",
59 |             tags=["nyc-taxicab"],
60 |             beforeTests=beforeTests,
61 |             afterTests=afterTests,
62 |         )
63 | 
64 | 
65 | class Inference(Component):
66 |     def __init__(self, beforeTests=[], afterTests=[]):
67 | 
68 |         super().__init__(
69 |             name="inference",
70 |             owner="sherlock-holmes",
71 |             description="Predicts high tip probability",
72 |             tags=["nyc-taxicab"],
73 |             beforeTests=beforeTests,
74 |             afterTests=afterTests,
75 |         )
76 | 


--------------------------------------------------------------------------------
/solutions/components.py:
--------------------------------------------------------------------------------
 1 | """
 2 | SOLUTION: components.py
 3 | 
 4 | This file defines components of our ML pipeline and their respective
 5 | metadata. Some of the components contain tests to execute before and
 6 | after the components are run.
 7 | """
 8 | 
 9 | from mltrace import Component
10 | from tests import *
11 | 
12 | 
13 | class Cleaning(Component):
14 |     def __init__(self, beforeTests=[], afterTests=[]):
15 | 
16 |         super().__init__(
17 |             name="cleaning",
18 |             owner="plumber",
19 |             description="Cleans raw NYC taxicab data",
20 |             tags=["nyc-taxicab"],
21 |             beforeTests=beforeTests,
22 |             afterTests=afterTests,
23 |         )
24 | 
25 | 
26 | class Featuregen(Component):
27 |     def __init__(self, beforeTests=[], afterTests=[OutliersTest]):
28 | 
29 |         super().__init__(
30 |             name="featuregen",
31 |             owner="spark-gymnast",
32 |             description="Generates features for high tip prediction problem",
33 |             tags=["nyc-taxicab"],
34 |             beforeTests=beforeTests,
35 |             afterTests=afterTests,
36 |         )
37 | 
38 | 
39 | class TrainTestSplit(Component):
40 |     def __init__(self, beforeTests=[], afterTests=[TrainingAssumptionsTest]):
41 | 
42 |         super().__init__(
43 |             name="splitting",
44 |             owner="fission",
45 |             description="Splits data into training and test sets",
46 |             tags=["nyc-taxicab"],
47 |             beforeTests=beforeTests,
48 |             afterTests=afterTests,
49 |         )
50 | 
51 | 
52 | class Training(Component):
53 |     def __init__(self, beforeTests=[], afterTests=[ModelIntegrityTest]):
54 | 
55 |         super().__init__(
56 |             name="training",
57 |             owner="personal-trainer",
58 |             description="Trains model for high tip prediction problem",
59 |             tags=["nyc-taxicab"],
60 |             beforeTests=beforeTests,
61 |             afterTests=afterTests,
62 |         )
63 | 
64 | 
65 | class Inference(Component):
66 |     def __init__(self, beforeTests=[], afterTests=[]):
67 | 
68 |         super().__init__(
69 |             name="inference",
70 |             owner="sherlock-holmes",
71 |             description="Predicts high tip probability",
72 |             tags=["nyc-taxicab"],
73 |             beforeTests=beforeTests,
74 |             afterTests=afterTests,
75 |         )
76 | 


--------------------------------------------------------------------------------
/tests.py:
--------------------------------------------------------------------------------
  1 | """
  2 | tests.py
  3 | 
  4 | This file defines tests to run on inputs and outputs of components.
  5 | """
  6 | 
  7 | from mltrace import Test
  8 | 
  9 | import pandas as pd
 10 | import typing
 11 | 
 12 | 
 13 | class OutliersTest(Test):
 14 |     def __init__(self):
 15 |         super().__init__("Outliers")
 16 | 
 17 |     def testComputeStats(self, df: pd.DataFrame):
 18 |         """
 19 |         Computes stats for all numeric columns in the dataframe.
 20 |         """
 21 |         # Get numerical columns
 22 |         num_df = df.select_dtypes(include=["number"])
 23 | 
 24 |         # Compute stats
 25 |         stats = num_df.describe()
 26 |         print("Dataframe statistics:")
 27 |         print(stats)
 28 | 
 29 |     def testZScore(
 30 |         self,
 31 |         df: pd.DataFrame,
 32 |         stdev_cutoff: float = 5.0,
 33 |         threshold: float = 0.05,
 34 |     ):
 35 |         """
 36 |         Checks to make sure there are no outliers using z score cutoff.
 37 |         """
 38 |         # Get numerical columns
 39 |         num_df = df.select_dtypes(include=["number"])
 40 | 
 41 |         z_scores = (
 42 |             (num_df - num_df.mean(axis=0, skipna=True))
 43 |             / num_df.std(axis=0, skipna=True)
 44 |         ).abs()
 45 | 
 46 |         if (z_scores > stdev_cutoff).to_numpy().sum() > threshold * len(df):
 47 |             print(
 48 |                 f"Number of outliers: {(z_scores > stdev_cutoff).to_numpy().sum()}"
 49 |             )
 50 |             print(f"Outlier threshold: {threshold * len(df)}")
 51 |             raise Exception("There are outlier values!")
 52 | 
 53 | 
 54 | class TrainingAssumptionsTest(Test):
 55 |     def __init__(self):
 56 |         super().__init__("Training Assumptions")
 57 | 
 58 |     # Train-test leakage
 59 |     def testLeakage(
 60 |         self, train_df: pd.DataFrame, test_df: pd.DataFrame, date_column: str
 61 |     ):
 62 |         """
 63 |         Checks to make sure there is no leakage in the training data.
 64 |         """
 65 |         if train_df[date_column].max() > test_df[date_column].min():
 66 |             raise Exception(f"Train and test data are overlapping in dates!")
 67 | 
 68 |     # Assess class imbalance
 69 |     def testClassImbalance(
 70 |         self, train_df: pd.DataFrame, label_column: str, threshold: float = 0.1
 71 |     ):
 72 |         """
 73 |         Checks to make sure there is no class imbalance in the training data.
 74 |         """
 75 |         frequencies = train_df[label_column].value_counts(normalize=True)
 76 |         if frequencies.min() < threshold:
 77 |             raise Exception(f"Class imbalance is too high!")
 78 | 
 79 | 
 80 | class ModelIntegrityTest(Test):
 81 |     def __init__(self):
 82 |         super().__init__("Model Integrity")
 83 | 
 84 |     def testOverfitting(
 85 |         self,
 86 |         train_scores: typing.Dict,
 87 |         test_scores: typing.Dict,
 88 |         threshold: float = 0.5,
 89 |     ):
 90 |         """
 91 |         Test that train and test metrics differences are within a
 92 |         threshold of 5 percent.
 93 |         """
 94 |         for name, val in train_scores.items():
 95 |             if abs(val - test_scores[name]) > threshold:
 96 |                 raise Exception(
 97 |                     f"Model overfitted with {name} diff of "
 98 |                     + f"{abs(val - test_scores[name])}"
 99 |                 )
100 | 
101 |     def testFeatureImportances(
102 |         self,
103 |         feature_importances: pd.DataFrame,
104 |         importance_threshold: float = 0.01,
105 |         num_important_features_threshold: float = 0.5,
106 |     ):
107 |         """Test that feature importances are not heavily skewed."""
108 |         num_unimportant_features = (
109 |             feature_importances["importance"] < importance_threshold
110 |         ).sum()
111 |         if (
112 |             float(num_unimportant_features / len(feature_importances))
113 |             > num_important_features_threshold
114 |         ):
115 |             raise Exception(
116 |                 f"{float(num_unimportant_features / len(feature_importances))} "
117 |                 + "features are unimportant!"
118 |             )
119 | 


--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import os
  3 | import pandas as pd
  4 | import typing
  5 | 
  6 | from components import *
  7 | from datetime import datetime
  8 | from joblib import dump, load
  9 | from sklearn.ensemble import RandomForestClassifier
 10 | from sklearn.metrics import (
 11 |     f1_score,
 12 |     accuracy_score,
 13 |     precision_score,
 14 |     recall_score,
 15 | )
 16 | 
 17 | ### PARSING ARGS ###
 18 | parser = argparse.ArgumentParser(description="Run inference.")
 19 | parser.add_argument("--start", type=str, help="Start date", nargs="?")
 20 | parser.add_argument("--end", type=str, help="End date", nargs="?")
 21 | parser.add_argument(
 22 |     "--mode",
 23 |     type=str,
 24 |     help="training or inference",
 25 |     const="inference",
 26 |     nargs="?",
 27 | )
 28 | args = parser.parse_args()
 29 | 
 30 | 
 31 | def load_data(start_date: str, end_date: str) -> pd.DataFrame:
 32 |     """
 33 |     Format: MM/DD/YYYY
 34 | 
 35 |     This function loads the trip data corresponding to the specified
 36 |     dates. The data must be stored in the "data" folder and can
 37 |     be populated using the download.sh script.
 38 |     """
 39 |     # Iterate through months and years between start and end dates
 40 |     start_date = datetime.strptime(start_date, "%m/%d/%Y")
 41 |     end_date = datetime.strptime(end_date, "%m/%d/%Y")
 42 | 
 43 |     assert end_date >= start_date
 44 |     assert end_date.year == 2020
 45 |     assert start_date.month >= 1
 46 |     assert start_date.month <= 5
 47 | 
 48 |     dfs = []
 49 | 
 50 |     for month in range(start_date.month, start_date.month + 1):
 51 |         df = pd.read_parquet("data/jan.pq")
 52 |         if month == 2:
 53 |             df = pd.read_parquet("data/feb.pq")
 54 |         elif month == 3:
 55 |             df = pd.read_parquet("data/march.pq")
 56 |         elif month == 4:
 57 |             df = pd.read_parquet("data/april.pq")
 58 |         elif month == 5:
 59 |             df = pd.read_parquet("data/may.pq")
 60 |         dfs.append(df)
 61 | 
 62 |     df = pd.concat(dfs)
 63 |     return df
 64 | 
 65 | 
 66 | def clean_data(
 67 |     df: pd.DataFrame, start_date: str = None, end_date: str = None
 68 | ) -> pd.DataFrame:
 69 |     """
 70 |     This function removes rows with negligible fare amounts and out of bounds of the start and end dates.
 71 | 
 72 |     Args:
 73 |         df: pd dataframe representing data
 74 |         start_date (optional): minimum date in the resulting dataframe
 75 |         end_date (optional): maximum date in the resulting dataframe (not inclusive)
 76 | 
 77 |     Returns:
 78 |         pd: DataFrame representing the cleaned dataframe
 79 |     """
 80 |     df = df[df.fare_amount > 5]  # throw out neglibible fare amounts
 81 |     if start_date:
 82 |         df = df[df.tpep_dropoff_datetime.dt.strftime("%m/%d/%Y") >= start_date]
 83 |     if end_date:
 84 |         df = df[df.tpep_dropoff_datetime.dt.strftime("%m/%d/%Y") < end_date]
 85 | 
 86 |     clean_df = df.reset_index(drop=True)
 87 |     return clean_df
 88 | 
 89 | 
 90 | def featurize_data(
 91 |     df: pd.DataFrame, tip_fraction: float = 0.1, imputation_value: float = -1.0
 92 | ) -> pd.DataFrame:
 93 |     """
 94 |     This function constructs features from the dataframe.
 95 |     """
 96 |     # Compute pickup features
 97 |     pickup_weekday = df.tpep_pickup_datetime.dt.weekday
 98 |     pickup_hour = df.tpep_pickup_datetime.dt.hour
 99 |     pickup_minute = df.tpep_pickup_datetime.dt.minute
100 |     work_hours = (
101 |         (pickup_weekday >= 0)
102 |         & (pickup_weekday <= 4)
103 |         & (pickup_hour >= 8)
104 |         & (pickup_hour <= 18)
105 |     )
106 | 
107 |     # Compute time and speed features
108 |     trip_time = (df.tpep_dropoff_datetime - df.tpep_pickup_datetime).dt.seconds
109 |     trip_speed = df.trip_distance / (trip_time + 1e7)
110 | 
111 |     # Compute label
112 |     tip_fraction_col = df.tip_amount / df.fare_amount
113 | 
114 |     # Join all features, identifier, and label
115 |     features_df = pd.DataFrame(
116 |         {
117 |             "tpep_pickup_datetime": df.tpep_pickup_datetime,
118 |             "pickup_weekday": pickup_weekday,
119 |             "pickup_hour": pickup_hour,
120 |             "pickup_minute": pickup_minute,
121 |             "work_hours": work_hours,
122 |             "trip_time": trip_time,
123 |             "trip_speed": trip_speed,
124 |             "trip_distance": df.trip_distance,
125 |             "passenger_count": df.passenger_count,
126 |             "congestion_surcharge": df.congestion_surcharge,
127 |             "loc_code_diffs": (df.DOLocationID - df.PULocationID).abs(),
128 |             "PULocationID": df.PULocationID,
129 |             "DOLocationID": df.DOLocationID,
130 |             "RatecodeID": df.RatecodeID,
131 |             "VendorID": df.VendorID,
132 |             "tip_amount": df.tip_amount,
133 |             "fare_amount": df.fare_amount,
134 |             "tip_fraction": tip_fraction_col,
135 |             "high_tip_indicator": tip_fraction_col > tip_fraction,
136 |         }
137 |     ).fillna(imputation_value)
138 | 
139 |     return features_df
140 | 
141 | 
142 | def train_test_split(
143 |     df: pd.DataFrame,
144 | ) -> typing.Tuple[pd.DataFrame, pd.DataFrame]:
145 |     """
146 |     This function splits the dataframe into train and test.
147 |     """
148 |     # Split into train and test
149 |     date_column = "tpep_pickup_datetime"
150 |     label_column = "high_tip_indicator"
151 |     df = df.sort_values(by=date_column, ascending=True)
152 |     train_df, test_df = (
153 |         df.iloc[: int(len(df) * 0.8)],
154 |         df.iloc[int(len(df) * 0.8) :],
155 |     )
156 | 
157 |     return train_df, test_df
158 | 
159 | 
160 | # Score model
161 | def score(df, model, feature_columns, label_column) -> pd.DataFrame:
162 |     rounded_preds = model.predict_proba(df[feature_columns].values)[
163 |         :, 1
164 |     ].round()
165 |     return {
166 |         "accuracy_score": accuracy_score(
167 |             df[label_column].values, rounded_preds
168 |         ),
169 |         "f1_score": f1_score(df[label_column].values, rounded_preds),
170 |         "precision_score": precision_score(
171 |             df[label_column].values, rounded_preds
172 |         ),
173 |         "recall_score": recall_score(df[label_column].values, rounded_preds),
174 |     }
175 | 
176 | 
177 | def train_model(
178 |     train_df: pd.DataFrame,
179 |     test_df: pd.DataFrame,
180 |     feature_columns: typing.List[str],
181 |     label_column: str,
182 | ) -> None:
183 |     """
184 |     This function runs training on the dataframe with the given
185 |     feature and label columns. The model is saved locally
186 |     to "model.joblib".
187 |     """
188 | 
189 |     params = {"max_depth": 4, "n_estimators": 10, "random_state": 42}
190 | 
191 |     # Create and train model
192 |     model = RandomForestClassifier(**params)
193 |     model.fit(train_df[feature_columns].values, train_df[label_column].values)
194 | 
195 |     # Print scores
196 |     train_scores = score(train_df, model, feature_columns, label_column)
197 |     test_scores = score(test_df, model, feature_columns, label_column)
198 |     print("Train scores:")
199 |     print(train_scores)
200 |     print("Test scores:")
201 |     print(test_scores)
202 | 
203 |     # Print feature importances
204 |     feature_importances = (
205 |         pd.DataFrame(
206 |             {
207 |                 "feature": feature_columns,
208 |                 "importance": model.feature_importances_,
209 |             }
210 |         )
211 |         .sort_values(by="importance", ascending=False)
212 |         .reset_index(drop=True)
213 |     )
214 |     print(feature_importances)
215 | 
216 |     # Save model
217 |     dump(model, "model.joblib")
218 | 
219 | 
220 | def inference(
221 |     features_df: pd.DataFrame,
222 |     feature_columns: typing.List[str],
223 |     label_column: str,
224 |     model=load("model.joblib") if os.path.exists("model.joblib") else None,
225 | ):
226 |     """
227 |     This function runs inference on the dataframe.
228 |     """
229 |     if not model:
230 |         raise ValueError("Please run this pipeline in training mode first!")
231 | 
232 |     # Predict
233 |     predictions = model.predict_proba(features_df[feature_columns].values)[
234 |         :, 1
235 |     ]
236 |     scores = score(features_df, model, feature_columns, label_column)
237 |     predictions_df = features_df
238 |     predictions_df["prediction"] = predictions
239 | 
240 |     return predictions_df, scores
241 | 
242 | 
243 | ##################### PIPELINE CODE #############################
244 | 
245 | if __name__ == "__main__":
246 |     mode = args.mode if args.mode else "inference"
247 |     start_date = args.start if args.start else "01/01/2020"
248 |     end_date = args.end if args.end else "01/31/2020"
249 |     print(f"Running the {mode} pipeline from {start_date} to {end_date}...")
250 | 
251 |     # Clean and featurize data
252 |     df = load_data(start_date, end_date)
253 |     clean_df = clean_data(df, start_date, end_date)
254 |     features_df = featurize_data(clean_df)
255 | 
256 |     feature_columns = [
257 |         "pickup_weekday",
258 |         "pickup_hour",
259 |         "pickup_minute",
260 |         "work_hours",
261 |         "passenger_count",
262 |         "trip_distance",
263 |         "RatecodeID",
264 |         "congestion_surcharge",
265 |         "loc_code_diffs",
266 |     ]
267 |     label_column = "high_tip_indicator"
268 | 
269 |     # If training, train a model and save it
270 |     if mode == "training":
271 |         train_df, test_df = train_test_split(features_df)
272 |         train_model(train_df, test_df, feature_columns, label_column)
273 | 
274 |     # If inference, load the model and make predictions
275 |     elif mode == "inference":
276 |         predictions, scores = inference(
277 |             features_df, feature_columns, label_column
278 |         )
279 |         print(scores)
280 | 
281 |     else:
282 |         print(f"Mode {mode} not supported.")
283 | 


--------------------------------------------------------------------------------
/solutions/main.py:
--------------------------------------------------------------------------------
  1 | # SOLUTION: main.py
  2 | 
  3 | import argparse
  4 | import os
  5 | import pandas as pd
  6 | import typing
  7 | 
  8 | from components import *
  9 | from datetime import datetime
 10 | from joblib import dump, load
 11 | from sklearn.ensemble import RandomForestClassifier
 12 | from sklearn.metrics import (
 13 |     f1_score,
 14 |     accuracy_score,
 15 |     precision_score,
 16 |     recall_score,
 17 | )
 18 | 
 19 | ### PARSING ARGS ###
 20 | parser = argparse.ArgumentParser(description="Run inference.")
 21 | parser.add_argument("--start", type=str, help="Start date", nargs="?")
 22 | parser.add_argument("--end", type=str, help="End date", nargs="?")
 23 | parser.add_argument(
 24 |     "--mode",
 25 |     type=str,
 26 |     help="training or inference",
 27 |     const="inference",
 28 |     nargs="?",
 29 | )
 30 | args = parser.parse_args()
 31 | 
 32 | 
 33 | def load_data(start_date: str, end_date: str) -> pd.DataFrame:
 34 |     """
 35 |     Format: MM/DD/YYYY
 36 | 
 37 |     This function loads the trip data corresponding to the specified
 38 |     dates. The data must be stored in the "data" folder and can
 39 |     be populated using the download.sh script.
 40 |     """
 41 |     # Iterate through months and years between start and end dates
 42 |     start_date = datetime.strptime(start_date, "%m/%d/%Y")
 43 |     end_date = datetime.strptime(end_date, "%m/%d/%Y")
 44 | 
 45 |     assert end_date >= start_date
 46 |     assert end_date.year == 2020
 47 |     assert start_date.month >= 1
 48 |     assert start_date.month <= 5
 49 | 
 50 |     dfs = []
 51 | 
 52 |     for month in range(start_date.month, start_date.month + 1):
 53 |         df = pd.read_parquet("data/jan.pq")
 54 |         if month == 2:
 55 |             df = pd.read_parquet("data/feb.pq")
 56 |         elif month == 3:
 57 |             df = pd.read_parquet("data/march.pq")
 58 |         elif month == 4:
 59 |             df = pd.read_parquet("data/april.pq")
 60 |         elif month == 5:
 61 |             df = pd.read_parquet("data/may.pq")
 62 |         dfs.append(df)
 63 | 
 64 |     df = pd.concat(dfs)
 65 |     return df
 66 | 
 67 | 
 68 | @Cleaning().run(auto_log=True)
 69 | def clean_data(
 70 |     df: pd.DataFrame, start_date: str = None, end_date: str = None
 71 | ) -> pd.DataFrame:
 72 |     """
 73 |     This function removes rows with negligible fare amounts and out of bounds of the start and end dates.
 74 | 
 75 |     Args:
 76 |         df: pd dataframe representing data
 77 |         start_date (optional): minimum date in the resulting dataframe
 78 |         end_date (optional): maximum date in the resulting dataframe (not inclusive)
 79 | 
 80 |     Returns:
 81 |         pd: DataFrame representing the cleaned dataframe
 82 |     """
 83 |     df = df[df.fare_amount > 5]  # throw out neglibible fare amounts
 84 |     if start_date:
 85 |         df = df[df.tpep_dropoff_datetime.dt.strftime("%m/%d/%Y") >= start_date]
 86 |     if end_date:
 87 |         df = df[df.tpep_dropoff_datetime.dt.strftime("%m/%d/%Y") < end_date]
 88 | 
 89 |     clean_df = df.reset_index(drop=True)
 90 |     return clean_df
 91 | 
 92 | 
 93 | @Featuregen().run(auto_log=True)
 94 | def featurize_data(
 95 |     df: pd.DataFrame, tip_fraction: float = 0.1, imputation_value: float = -1.0
 96 | ) -> pd.DataFrame:
 97 |     """
 98 |     This function constructs features from the dataframe.
 99 |     """
100 |     # Compute pickup features
101 |     pickup_weekday = df.tpep_pickup_datetime.dt.weekday
102 |     pickup_hour = df.tpep_pickup_datetime.dt.hour
103 |     pickup_minute = df.tpep_pickup_datetime.dt.minute
104 |     work_hours = (
105 |         (pickup_weekday >= 0)
106 |         & (pickup_weekday <= 4)
107 |         & (pickup_hour >= 8)
108 |         & (pickup_hour <= 18)
109 |     )
110 | 
111 |     # Compute time and speed features
112 |     trip_time = (df.tpep_dropoff_datetime - df.tpep_pickup_datetime).dt.seconds
113 |     trip_speed = df.trip_distance / (trip_time + 1e7)
114 | 
115 |     # Compute label
116 |     tip_fraction_col = df.tip_amount / df.fare_amount
117 | 
118 |     # Join all features, identifier, and label
119 |     features_df = pd.DataFrame(
120 |         {
121 |             "tpep_pickup_datetime": df.tpep_pickup_datetime,
122 |             "pickup_weekday": pickup_weekday,
123 |             "pickup_hour": pickup_hour,
124 |             "pickup_minute": pickup_minute,
125 |             "work_hours": work_hours,
126 |             "trip_time": trip_time,
127 |             "trip_speed": trip_speed,
128 |             "trip_distance": df.trip_distance,
129 |             "passenger_count": df.passenger_count,
130 |             "congestion_surcharge": df.congestion_surcharge,
131 |             "loc_code_diffs": (df.DOLocationID - df.PULocationID).abs(),
132 |             "PULocationID": df.PULocationID,
133 |             "DOLocationID": df.DOLocationID,
134 |             "RatecodeID": df.RatecodeID,
135 |             "VendorID": df.VendorID,
136 |             "tip_amount": df.tip_amount,
137 |             "fare_amount": df.fare_amount,
138 |             "tip_fraction": tip_fraction_col,
139 |             "high_tip_indicator": tip_fraction_col > tip_fraction,
140 |         }
141 |     ).fillna(imputation_value)
142 | 
143 |     return features_df
144 | 
145 | 
146 | @TrainTestSplit().run(auto_log=True)
147 | def train_test_split(
148 |     df: pd.DataFrame,
149 | ) -> typing.Tuple[pd.DataFrame, pd.DataFrame]:
150 |     """
151 |     This function splits the dataframe into train and test.
152 |     """
153 |     # Split into train and test
154 |     date_column = "tpep_pickup_datetime"
155 |     label_column = "high_tip_indicator"
156 |     df = df.sort_values(by=date_column, ascending=True)
157 |     train_df, test_df = (
158 |         df.iloc[: int(len(df) * 0.8)],
159 |         df.iloc[int(len(df) * 0.8) :],
160 |     )
161 | 
162 |     return train_df, test_df
163 | 
164 | 
165 | # Score model
166 | def score(df, model, feature_columns, label_column) -> pd.DataFrame:
167 |     rounded_preds = model.predict_proba(df[feature_columns].values)[
168 |         :, 1
169 |     ].round()
170 |     return {
171 |         "accuracy_score": accuracy_score(
172 |             df[label_column].values, rounded_preds
173 |         ),
174 |         "f1_score": f1_score(df[label_column].values, rounded_preds),
175 |         "precision_score": precision_score(
176 |             df[label_column].values, rounded_preds
177 |         ),
178 |         "recall_score": recall_score(df[label_column].values, rounded_preds),
179 |     }
180 | 
181 | 
182 | @Training().run(auto_log=True)
183 | def train_model(
184 |     train_df: pd.DataFrame,
185 |     test_df: pd.DataFrame,
186 |     feature_columns: typing.List[str],
187 |     label_column: str,
188 | ) -> None:
189 |     """
190 |     This function runs training on the dataframe with the given
191 |     feature and label columns. The model is saved locally
192 |     to "model.joblib".
193 |     """
194 | 
195 |     params = {"max_depth": 4, "n_estimators": 10, "random_state": 42}
196 | 
197 |     # Create and train model
198 |     model = RandomForestClassifier(**params)
199 |     model.fit(train_df[feature_columns].values, train_df[label_column].values)
200 | 
201 |     # Print scores
202 |     train_scores = score(train_df, model, feature_columns, label_column)
203 |     test_scores = score(test_df, model, feature_columns, label_column)
204 |     print("Train scores:")
205 |     print(train_scores)
206 |     print("Test scores:")
207 |     print(test_scores)
208 | 
209 |     # Print feature importances
210 |     feature_importances = (
211 |         pd.DataFrame(
212 |             {
213 |                 "feature": feature_columns,
214 |                 "importance": model.feature_importances_,
215 |             }
216 |         )
217 |         .sort_values(by="importance", ascending=False)
218 |         .reset_index(drop=True)
219 |     )
220 |     print(feature_importances)
221 | 
222 |     # Save model
223 |     dump(model, "model.joblib")
224 | 
225 | 
226 | @Inference().run(auto_log=True)
227 | def inference(
228 |     features_df: pd.DataFrame,
229 |     feature_columns: typing.List[str],
230 |     label_column: str,
231 |     model=load("model.joblib") if os.path.exists("model.joblib") else None,
232 | ):
233 |     """
234 |     This function runs inference on the dataframe.
235 |     """
236 |     if not model:
237 |         raise ValueError("Please run this pipeline in training mode first!")
238 | 
239 |     # Predict
240 |     predictions = model.predict_proba(features_df[feature_columns].values)[
241 |         :, 1
242 |     ]
243 |     scores = score(features_df, model, feature_columns, label_column)
244 |     predictions_df = features_df
245 |     predictions_df["prediction"] = predictions
246 | 
247 |     return predictions_df, scores
248 | 
249 | 
250 | ##################### PIPELINE CODE #############################
251 | 
252 | if __name__ == "__main__":
253 |     mode = args.mode if args.mode else "inference"
254 |     start_date = args.start if args.start else "01/01/2020"
255 |     end_date = args.end if args.end else "01/31/2020"
256 |     print(f"Running the {mode} pipeline from {start_date} to {end_date}...")
257 | 
258 |     # Clean and featurize data
259 |     df = load_data(start_date, end_date)
260 |     clean_df = clean_data(df, start_date, end_date)
261 |     features_df = featurize_data(clean_df)
262 | 
263 |     feature_columns = [
264 |         "pickup_weekday",
265 |         "pickup_hour",
266 |         "pickup_minute",
267 |         "work_hours",
268 |         "passenger_count",
269 |         "trip_distance",
270 |         "RatecodeID",
271 |         "congestion_surcharge",
272 |         "loc_code_diffs",
273 |     ]
274 |     label_column = "high_tip_indicator"
275 | 
276 |     # If training, train a model and save it
277 |     if mode == "training":
278 |         train_df, test_df = train_test_split(features_df)
279 |         train_model(train_df, test_df, feature_columns, label_column)
280 | 
281 |     # If inference, load the model and make predictions
282 |     elif mode == "inference":
283 |         predictions, scores = inference(
284 |             features_df, feature_columns, label_column
285 |         )
286 |         print(scores)
287 | 
288 |     else:
289 |         print(f"Mode {mode} not supported.")
290 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # mltrace tutorial
  2 | 
  3 | Date: October 2021
  4 | 
  5 | This tutorial builds a training and testing pipeline for a toy ML prediction problem: to predict whether a passenger in a NYC taxicab ride will give the driver a nontrivial tip. This is a **binary classification task.** A nontrivial tip is arbitrarily defined as greater than 10% of the total fare (before tip). To evaluate the model or measure the efficacy of the model, we measure the [**F1 score**](https://en.wikipedia.org/wiki/F-score). This task is modeled after the task described in [toy-ml-pipeline](https://github.com/shreyashankar/toy-ml-pipeline).
  6 | 
  7 | The purpose of this tutorial is to demonstrate how mltrace can be used in achieving pipeline *observability*, or end-to-end visibility. In this tutorial, we:
  8 | 
  9 | 1. Train a model on data from January 2020
 10 | 2. Simulate deployment by running inference on a weekly basis from February 1, 2020 to May 31, 2020
 11 | 3. Experience a significant performance decrease in our pipeline (from 83% F1 score to below 70%)
 12 | 4. Instrument our pipeline with mltrace component specifications to trace our predictions and debug the pipeline
 13 | 5. Encode tests into the component specifications to catch failures before they happen
 14 | 
 15 | I am giving this tutorial at [RISECamp 2021](https://risecamp.berkeley.edu/) and the [Toronto Machine Learning Summit](https://www.torontomachinelearning.com/). You can also follow along with this README.
 16 | 
 17 | ## Requirements
 18 | 
 19 | **You can do this entire tutorial locally.** You will need the following:
 20 | 
 21 | * Internet connection to download the data
 22 | * Docker (you can install [here](https://www.docker.com/products/docker-desktop))
 23 | * Python 3.7+
 24 | * Unix-based shell (use WSL if on Windows)
 25 | 
 26 | We recommend you create a conda or virtual environment for this demo.
 27 | 
 28 | ## Step 1: Setup
 29 | 
 30 | Clone two repositories: [mltrace](https://github.com/loglabs/mltrace) and [this mltrace-demo tutorial](https://github.com/loglabs/mltrace-demo). Set up mltrace as described in the mltrace [README](https://github.com/loglabs/mltrace#readme). Verify that you can access the mltrace UI at [localhost:8080](http://localhost:8080). Make sure your containers are running for the entirety of this tutorial.
 31 | 
 32 | Once you have cloned [this mltrace-demo tutorial](https://github.com/loglabs/mltrace-demo), navigate to the root and download the requirements by running `pip install -r requirements.txt`. The data science-specific libraries used are `pandas` and `scikit-learn`.
 33 | 
 34 | ## Step 2: Understand the ML task and pipelines
 35 | 
 36 | For the rest of this tutorial, we will only be working in the `mltrace-demo` directory.
 37 | 
 38 | ### Dataset description
 39 | 
 40 | We use the yellow taxicab trip records from the NYC Taxi & Limousine Comission [public dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page), which is stored in a public aws S3 bucket. The data dictionary can be found [here](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf) and is also shown below:
 41 | 
 42 | | Field Name      | Description |
 43 | | ----------- | ----------- |
 44 | | VendorID      | A code indicating the TPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.       |
 45 | | tpep_pickup_datetime   | The date and time when the meter was engaged.        |
 46 | | tpep_dropoff_datetime   | The date and time when the meter was disengaged.        |
 47 | | Passenger_count   | The number of passengers in the vehicle. This is a driver-entered value.      |
 48 | | Trip_distance   | The elapsed trip distance in miles reported by the taximeter.      |
 49 | | PULocationID   | TLC Taxi Zone in which the taximeter was engaged.      |
 50 | | DOLocationID   | TLC Taxi Zone in which the taximeter was disengaged      |
 51 | | RateCodeID   | The final rate code in effect at the end of the trip. 1= Standard rate, 2=JFK, 3=Newark, 4=Nassau or Westchester, 5=Negotiated fare, 6=Group ride     |
 52 | | Store_and_fwd_flag | This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server. Y= store and forward trip, N= not a store and forward trip |
 53 | | Payment_type | A numeric code signifying how the passenger paid for the trip. 1= Credit card, 2= Cash, 3= No charge, 4= Dispute, 5= Unknown, 6= Voided trip |
 54 | | Fare_amount | The time-and-distance fare calculated by the meter. | 
 55 | | Extra | Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges. |
 56 | | MTA_tax | $0.50 MTA tax that is automatically triggered based on the metered rate in use. | 
 57 | | Improvement_surcharge | $0.30 improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015. | 
 58 | | Tip_amount | Tip amount – This field is automatically populated for credit card tips. Cash tips are not included. | 
 59 | | Tolls_amount | Total amount of all tolls paid in trip. | 
 60 | | Total_amount | The total amount charged to passengers. Does not include cash tips. |
 61 | 
 62 | We have subsampled the data from January to May 2020 to simplify the tutorial. To download the data, in the root directory of this repo, run the download script `download.sh`, and you should see something like the following:
 63 | 
 64 | ```
 65 | > source download.sh
 66 | 
 67 |   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
 68 |                                  Dload  Upload   Total   Spent    Left  Speed
 69 | 100   388    0   388    0     0    129      0 --:--:--  0:00:03 --:--:--   129
 70 | 100 15.5M  100 15.5M    0     0  3331k      0  0:00:04  0:00:04 --:--:-- 11.0M
 71 |   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
 72 |                                  Dload  Upload   Total   Spent    Left  Speed
 73 | 100   388    0   388    0     0    131      0 --:--:--  0:00:02 --:--:--   130
 74 | 100 15.2M  100 15.2M    0     0  3046k      0  0:00:05  0:00:05 --:--:-- 11.2M
 75 |   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
 76 |                                  Dload  Upload   Total   Spent    Left  Speed
 77 | 100   388    0   388    0     0    281      0 --:--:--  0:00:01 --:--:--   281
 78 | 100 7678k  100 7678k    0     0  3103k      0  0:00:02  0:00:02 --:--:-- 8785k
 79 |   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
 80 |                                  Dload  Upload   Total   Spent    Left  Speed
 81 | 100   388    0   388    0     0    928      0 --:--:-- --:--:-- --:--:--   926
 82 | 100  684k  100  684k    0     0   868k      0 --:--:-- --:--:-- --:--:--  868k
 83 |   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
 84 |                                  Dload  Upload   Total   Spent    Left  Speed
 85 | 100   388    0   388    0     0    603      0 --:--:-- --:--:-- --:--:--   602
 86 | 100 1024k  100 1024k    0     0   982k      0  0:00:01  0:00:01 --:--:--  982k
 87 | ```
 88 | 
 89 | ### Pipeline description
 90 | 
 91 | Any applied ML pipeline is essentially a series of functions applied one after the other, such as data transformations, models, and output transformations. For simplicity, the training and inference pipelines are both included in one Python file: `main.py`. The only external tools  These pipelines have the following components:
 92 | 
 93 | ![Pipelines](./res/pipeline.png)
 94 | 
 95 | In the diagram above, both pipelines share some components, such as cleaning and feature generation. In the pipeline code (`main.py`), each component corresponds to a different Python function. 
 96 | 
 97 | ## Step 2: Run pipelines
 98 | 
 99 | Since the inference pipeline depends on a trained model, you must run the training pipeline first to train and save a model. The training pipeline takes in a date range, trains a random forest classifier on the first 80% of data, and evaluates the model on the last 20%. For more details on model parameters and features, read the code in the `train_model` function. To run the training pipeline, execute `python main.py --mode=training`, and you will see something like the following:
100 | 
101 | ```
102 | > python main.py --mode=training
103 | 
104 | Running the training pipeline from 01/01/2020 to 01/31/2020...
105 | Train scores:
106 | {'accuracy_score': 0.7111377217389263, 'f1_score': 0.820004945200449, 'precision_score': 0.7167432688544968, 'recall_score': 0.9580287853406722}
107 | Test scores:
108 | {'accuracy_score': 0.7304694103724853, 'f1_score': 0.8354623429043206, 'precision_score': 0.7372079610648481, 'recall_score': 0.9639346431170206}
109 |                 feature  importance
110 | 0  congestion_surcharge    0.692435
111 | 1            RatecodeID    0.122799
112 | 2       passenger_count    0.084634
113 | 3         trip_distance    0.056488
114 | 4           pickup_hour    0.030117
115 | 5        pickup_weekday    0.006432
116 | 6        loc_code_diffs    0.005208
117 | 7            work_hours    0.001844
118 | 8         pickup_minute    0.000041
119 | ```
120 | 
121 | One can probably come up with a better-performing model, but that is not the goal of this tutorial. *The goal here is to demonstrate that performance can decrease post-deployment.* To simulate a week of deployment, run the script in inference mode and see the result:
122 | 
123 | ```
124 | > python main.py --mode=inference --start=02/01/2020 --end=02/08/2020
125 | 
126 | Running the inference pipeline from 02/01/2020 to 02/08/2020...
127 | {'accuracy_score': 0.7331414566141254, 'f1_score': 0.8376663049524453, 'precision_score': 0.7420173022399211, 'recall_score': 0.9616234153694767}
128 | ```
129 | 
130 | We see similar metrics to what we observed at training time, which is all good (for now). To run inference on every week starting February 1, 2020, we can run the `run_weekly_inference.py` script and see its results:
131 | 
132 | ```
133 | > python run_weekly_inference.py
134 | 
135 | python main.py --mode=inference --start=02/01/2020 --end=02/08/2020
136 | Running the inference pipeline from 02/01/2020 to 02/08/2020...
137 | {'accuracy_score': 0.7331414566141254, 'f1_score': 0.8376663049524453, 'precision_score': 0.7420173022399211, 'recall_score': 0.9616234153694767}
138 | python main.py --mode=inference --start=02/08/2020 --end=02/15/2020
139 | Running the inference pipeline from 02/08/2020 to 02/15/2020...
140 | {'accuracy_score': 0.7278759275705908, 'f1_score': 0.8340398483417413, 'precision_score': 0.7359328219671536, 'recall_score': 0.9623274935955706}
141 | python main.py --mode=inference --start=02/15/2020 --end=02/22/2020
142 | Running the inference pipeline from 02/15/2020 to 02/22/2020...
143 | {'accuracy_score': 0.7045651653189503, 'f1_score': 0.8166364204935767, 'precision_score': 0.7111136903380176, 'recall_score': 0.9589333012280279}
144 | python main.py --mode=inference --start=02/22/2020 --end=02/29/2020
145 | Running the inference pipeline from 02/22/2020 to 02/29/2020...
146 | {'accuracy_score': 0.7290757048767853, 'f1_score': 0.8342193683943596, 'precision_score': 0.7373325008404976, 'recall_score': 0.9604204028860992}
147 | python main.py --mode=inference --start=02/29/2020 --end=03/07/2020
148 | Running the inference pipeline from 02/29/2020 to 03/07/2020...
149 | {'accuracy_score': 0.7036537211975809, 'f1_score': 0.8167176728801508, 'precision_score': 0.7090333315442006, 'recall_score': 0.9629683627350926}
150 | python main.py --mode=inference --start=03/07/2020 --end=03/14/2020
151 | Running the inference pipeline from 03/07/2020 to 03/14/2020...
152 | {'accuracy_score': 0.7281746780953819, 'f1_score': 0.8319688154662216, 'precision_score': 0.7334478820491188, 'recall_score': 0.9610645239571818}
153 | python main.py --mode=inference --start=03/14/2020 --end=03/21/2020
154 | Running the inference pipeline from 03/14/2020 to 03/21/2020...
155 | {'accuracy_score': 0.6889874250874701, 'f1_score': 0.7913742622112748, 'precision_score': 0.6840752048851036, 'recall_score': 0.9385955241979936}
156 | python main.py --mode=inference --start=03/21/2020 --end=03/28/2020
157 | Running the inference pipeline from 03/21/2020 to 03/28/2020...
158 | {'accuracy_score': 0.6451420029895366, 'f1_score': 0.7327178563386625, 'precision_score': 0.6134992458521871, 'recall_score': 0.9094466182224706}
159 | python main.py --mode=inference --start=03/28/2020 --end=04/04/2020
160 | Running the inference pipeline from 03/28/2020 to 04/04/2020...
161 | {'accuracy_score': 0.6284492809949476, 'f1_score': 0.7137724550898203, 'precision_score': 0.5840274375306223, 'recall_score': 0.9176289453425712}
162 | python main.py --mode=inference --start=04/04/2020 --end=04/11/2020
163 | Running the inference pipeline from 04/04/2020 to 04/11/2020...
164 | {'accuracy_score': 0.6171894294887627, 'f1_score': 0.7053231939163498, 'precision_score': 0.5848045397225725, 'recall_score': 0.8884099616858238}
165 | python main.py --mode=inference --start=04/11/2020 --end=04/18/2020
166 | Running the inference pipeline from 04/11/2020 to 04/18/2020...
167 | {'accuracy_score': 0.5968436154949784, 'f1_score': 0.6916605705925385, 'precision_score': 0.5858116480793061, 'recall_score': 0.8441964285714286}
168 | python main.py --mode=inference --start=04/18/2020 --end=04/25/2020
169 | Running the inference pipeline from 04/18/2020 to 04/25/2020...
170 | {'accuracy_score': 0.6017305893358279, 'f1_score': 0.697567039602202, 'precision_score': 0.5843498958643261, 'recall_score': 0.8651982378854626}
171 | python main.py --mode=inference --start=04/25/2020 --end=05/02/2020
172 | Running the inference pipeline from 04/25/2020 to 05/02/2020...
173 | {'accuracy_score': 0.5893766674751395, 'f1_score': 0.6827805883455125, 'precision_score': 0.5769474350854972, 'recall_score': 0.8361633776961909}
174 | python main.py --mode=inference --start=05/02/2020 --end=05/09/2020
175 | Running the inference pipeline from 05/02/2020 to 05/09/2020...
176 | {'accuracy_score': 0.5838457703174339, 'f1_score': 0.6434064369125606, 'precision_score': 0.5146958304853042, 'recall_score': 0.8579567033801747}
177 | python main.py --mode=inference --start=05/09/2020 --end=05/16/2020
178 | Running the inference pipeline from 05/09/2020 to 05/16/2020...
179 | {'accuracy_score': 0.5933857808857809, 'f1_score': 0.6362570050827577, 'precision_score': 0.5070627336933943, 'recall_score': 0.8537950332284016}
180 | python main.py --mode=inference --start=05/16/2020 --end=05/23/2020
181 | Running the inference pipeline from 05/16/2020 to 05/23/2020...
182 | {'accuracy_score': 0.6166423357664234, 'f1_score': 0.6921453692848769, 'precision_score': 0.577351848230002, 'recall_score': 0.8639157155399473}
183 | python main.py --mode=inference --start=05/23/2020 --end=05/30/2020
184 | Running the inference pipeline from 05/23/2020 to 05/30/2020...
185 | {'accuracy_score': 0.6198235909702496, 'f1_score': 0.7046800603878759, 'precision_score': 0.5951353471949784, 'recall_score': 0.8636493025903786}
186 | ```
187 | 
188 | Wow! Towards the end, we see significantly lower F1 scores! How do we even *begin* to go about debugging this performance drop? In the remainder of this tutorial, we will discuss how to use mltrace to observe data flow and debug our pipelines.
189 | 
190 | ## Step 3: Instrument our pipelines with mltrace
191 | 
192 | A natural first step in debugging is to trace our outputs, or determine the end-to-end data flow for the outputs. Fortunately, we can do this with mltrace *without completely redesigning our pipelines and rewriting our code!* We will only need to add code.
193 | 
194 | mltrace provides an interface to define component specifications which can run tests and log data flow throughout our pipelines. For this tutorial, we have already defined component specifications in `components.py`, and we just need to integrate these into our pipelines. An example component specification is the following:
195 | 
196 | ```python
197 | 
198 | from mltrace import Component
199 | 
200 | class Cleaning(Component):
201 |     def __init__(self, beforeTests=[], afterTests=[]):
202 | 
203 |         super().__init__(
204 |             name="cleaning",
205 |             owner="plumber",
206 |             description="Cleans raw NYC taxicab data",
207 |             tags=["nyc-taxicab"],
208 |             beforeTests=beforeTests,
209 |             afterTests=afterTests,
210 |         )
211 | 
212 | ```
213 | 
214 | To integrate this component into our pipeline, we declare a `Cleaning` object in `main.py` and decorate our existing `cleaning` function with the Component object's `run` method:
215 | 
216 | ```python
217 | from components import *
218 | 
219 | @Cleaning().run(auto_log=True) # This is the only line of mltrace code to add
220 | def clean_data(
221 |     df: pd.DataFrame, start_date: str = None, end_date: str = None
222 | ) -> pd.DataFrame:
223 |     """
224 |     This function removes rows with negligible fare amounts and out of bounds of the start and end dates.
225 | 
226 |     Args:
227 |         df: pd dataframe representing data
228 |         start_date (optional): minimum date in the resulting dataframe
229 |         end_date (optional): maximum date in the resulting dataframe (not inclusive)
230 | 
231 |     Returns:
232 |         pd: DataFrame representing the cleaned dataframe
233 |     """
234 |     df = df[df.fare_amount > 5]  # throw out neglibible fare amounts
235 |     if start_date:
236 |         df = df[df.tpep_dropoff_datetime.dt.strftime("%m/%d/%Y") >= start_date]
237 |     if end_date:
238 |         df = df[df.tpep_dropoff_datetime.dt.strftime("%m/%d/%Y") < end_date]
239 | 
240 |     clean_df = df.reset_index(drop=True)
241 |     return clean_df
242 | ```
243 | 
244 | The `auto_log` parameter tells mltrace to automatically find and log input and output data and artifacts (e.g., models), even if we didn't explicitly save them (like in our example). Here, mltrace would log `df` as input and `clean_df` as output.
245 | 
246 | ### Exercise 1: Instrument other functions
247 | 
248 | Like we did for `cleaning`, instrument the following functions with their respective component specifications. You can see all the component specifications in `components.py`. Solution code exists in `solutions/main.py`. *Hint: you will only have to instrument 4 other functions!*
249 | 
250 | ## Step 4: Tracing and debugging
251 | 
252 | Rerun our pipelines as we did above. This time, our pipelines will be instrumented with mltrace, so we can inspect traces for our outputs. To do so, run the following commands:
253 | 
254 | ```
255 | python main.py --mode=training
256 | python run_weekly_inference.py
257 | ```
258 | 
259 | Once inference has finished running, navigate to the UI at [localhost:8080](http://localhost:8080) to check out the mltrace component runs. Type in `history inference` into the command bar and press enter to see the most recent runs of the inference component. Click on one of the first few / most recent rows in the table, then click on the output filename in the card to trace it. The resulting view will look something like:
260 | 
261 | ![Diagram](./res/traceview.png)
262 | 
263 | The trace is a bit complicated, but we can look at some of the intermediate outputs to assess what might have gone wrong with the pipeline.
264 | 
265 | ### Exercise 2: Load and analyze intermediates
266 | 
267 | To begin, let's look at the features fed into this particular run of inference and compare these features to training features. In this exercise, you will identify the two filenames that correspond to the training and inference features, load these into dataframes in a seperate notebook, and determine a few differences between these feature dataframes.
268 | 
269 | 1. Look at the trace view in the mltrace UI. Identify the two filenames that correspond to the training and inference features.
270 | 2. Copy these filenames into a separate doc. You will want to load them to inspect the data.
271 | 3. [OPTIONAL] Open a Jupyter notebook to load these files into dataframes and compare the dataframes. You can load a file by calling `mltrace.load(filename)`. What differences do you find? Did the data "drift" over time?
272 | 
273 | *Note: The UI is still a major work in progress! There's a lot of functionality we can add here to make it significantly easier to debug. Please email shreyashankar@berkeley.edu if you are interested in contributing!*
274 | 
275 | ## Step 5: Encode some tests into components
276 | 
277 | mltrace components have a simple lifecycle: they can run `beforeTests` that execute before a component runs and `afterTests` that execute after a component runs. We will leverage this functionality to encode tests to execute at runtime. For this tutorial, we have some predefined tests, defined in `tests.py`:
278 | 
279 | 
280 | | Test Class Name  | Description |
281 | | ------------- | ------------- |
282 | |  `OutliersTest`  | Prints summary statistics of the data and tests for z-score outliers.  |
283 | | `TrainingAssumptionsTest`  |  Tests for train-test leakage and makes sure the data does not have a large class imbalance. |
284 | | `ModelIntegrityTest`  |  Checks that the model did not overfit and that feature importances aren't heavily skewed towards a small fraction of features. |
285 | 
286 | Each of these test classes contain several test functions to run. The arguments to these test functions *must be* defined in the body of the component run function (i.e., the function that `Component().run` is decorating). Under the hood, mltrace traces the Python code and passes arguments to the tests before and after the component run function is executed.
287 | 
288 | *Note that these tests are not applicable to all ML tasks! For instance, sometimes we will want to be solving problems with class imbalance.*
289 | 
290 | ### Exercise 3: Add tests to mltrace components
291 | 
292 | Each mltrace component accepts a list of `beforeTests` and `afterTests`. We can add tests to either the component specifications in `components.py` or the decorators in `main.py`. The main benefit to having this test abstraction is that now, tests can be reusable between components and even pipelines. For an example, we can add the `OutliersTest` to the `featuregen` component:
293 | 
294 | ```python
295 | from tests import *
296 | 
297 | @Featuregen(afterTests=[OutliersTest]).run(auto_log=True)
298 | def featuregen(...):
299 | ```
300 | 
301 | In this example, the `OutliersTest` will be executed on the features dataframe that gets returned from the function. In this exercise, we will add other tests to be executed. Solution code exists in `solutions/main.py`.
302 | 
303 | 1. Add the `TrainingAssumptionsTest` and `ModelIntegrityTest` to components in the training pipeline. *Hint: training assumptions should be satisfied before training, and model integrity should be satisfied after training!*
304 | 2. Run the pipelines (`python main.py --mode=training; python run_weekly_inference.py`) as we have done before. Some runs of inference should fail the outliers test.
305 | 3. [OPTIONAL] Encode your own tests, based on the analysis you did in exercise 2.
306 | 
307 | ## Step 6: Takeaways
308 | 
309 | In this tutorial, we did the following:
310 | 
311 | 1. Train a model
312 | 2. Simulate deployment by running inference on a weekly basis
313 | 3. Use mltrace to investigate the performance drop and add tests to our pipeline
314 | 
315 | Questions? Feedback? Please email shreyashankar@berkeley.edu!
316 | 
317 | ## Future work
318 | 
319 | mltrace doesn't fix pipelines; our goal is to aid practitioners in debugging and creating sustaining pipelines. We want mltrace to be as flexible as possible, to serve as an "add-on" to existing pipelines to achieve observability. We are most immediately working on the following:
320 | 
321 | * Materializing historical component run inputs and outputs to use while writing running tests (e.g., to compare successive batches of data fed into a component)
322 | * Logging component run parameters and showing visualizations in the UI
323 | * Predefined components with tests that practitioners can use to construct pipelines "off-the-shelf"
324 | 


--------------------------------------------------------------------------------