├── cloud_function ├── create_cloud_function.sh ├── main.py └── requirements.txt ├── notebooks ├── Fraud_Detection_Tutorial.ipynb ├── Validating_Online_Features_While_Detecting_Fraud.ipynb ├── txn_and_features_gen.ipynb └── update_timestamps.ipynb └── readme.md /cloud_function/create_cloud_function.sh: -------------------------------------------------------------------------------- 1 | gcloud functions deploy feast-update-timestamps \ 2 | --entry-point main \ 3 | --runtime python37 \ 4 | --trigger-resource feature-timestamp-schedule \ 5 | --trigger-event google.pubsub.topic.publish \ 6 | --timeout 540s 7 | 8 | gcloud scheduler jobs create pubsub feast-update-timestamp-job \ 9 | --schedule "0 22 * * *" \ 10 | --topic feature-timestamp-schedule \ 11 | --message-body "." -------------------------------------------------------------------------------- /cloud_function/main.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | from datetime import datetime, timedelta 3 | 4 | def update_transactions(): 5 | sql = """ 6 | SELECT * 7 | FROM `feast-oss.fraud_tutorial.transactions` 8 | """ 9 | transactions = pd.read_gbq(sql, dialect='standard') 10 | latest_time = transactions['timestamp'].max() 11 | datediff = datetime.now() - latest_time.replace(tzinfo=None) 12 | transactions['timestamp'] = transactions['timestamp'] + datediff 13 | transactions.to_gbq(destination_table="fraud_tutorial.transactions", project_id="feast-oss", if_exists='replace') 14 | 15 | def update_user_features(): 16 | sql = """ 17 | SELECT * 18 | FROM `feast-oss.fraud_tutorial.user_account_features` 19 | """ 20 | user_features = pd.read_gbq(sql, dialect='standard') 21 | user_features['feature_timestamp'] = datetime.now() - timedelta(days=7) 22 | user_features.to_gbq(destination_table="fraud_tutorial.user_account_features", project_id="feast-oss", if_exists='replace') 23 | 24 | def update_user_fraud_features(): 25 | sql = """ 26 | SELECT * 27 | FROM `feast-oss.fraud_tutorial.user_has_fraudulent_transactions` 28 | """ 29 | user_has_fraud = pd.read_gbq(sql, dialect='standard') 30 | latest_time = user_has_fraud['feature_timestamp'].max() 31 | datediff = datetime.now() - latest_time.replace(tzinfo=None) 32 | user_has_fraud['feature_timestamp'] = user_has_fraud['feature_timestamp'] + datediff 33 | user_has_fraud.to_gbq(destination_table="fraud_tutorial.user_has_fraudulent_transactions", project_id="feast-oss", if_exists='replace') 34 | 35 | def main(data, context): 36 | update_transactions() 37 | update_user_features() 38 | update_user_fraud_features() 39 | 40 | main(1, 1) -------------------------------------------------------------------------------- /cloud_function/requirements.txt: -------------------------------------------------------------------------------- 1 | pandas 2 | pandas-gbq -------------------------------------------------------------------------------- /notebooks/Validating_Online_Features_While_Detecting_Fraud.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "id": "jirdTjhETQW0" 7 | }, 8 | "source": [ 9 | "# Introduction\n", 10 | "\n", 11 | "In this tutorial we will extend previously developed fraud prediction system by adding Data Quality Monitoring for online features.\n", 12 | "\n", 13 | "If you haven't already please check out these two tutorials first:\n", 14 | "1. [Fraud Detection (with BigQuery and Datastore)](https://github.com/feast-dev/feast-gcp-fraud-tutorial/blob/main/notebooks/Fraud_Detection_Tutorial.ipynb)\n", 15 | "2. [Validation of historical features with Great Expectations](https://docs.feast.dev/tutorials/validating-historical-features)\n", 16 | "\n", 17 | "Throughout this tutorial, we’ll briefly revisit set up of feature store for the fraud detection system and then we'll walk through the creation of validation expectations, configuration of the online features logging and will check how to apply validation in production.\n", 18 | "\n", 19 | "*The need to revisit the system desribed in [previous tutorial](https://github.com/feast-dev/feast-gcp-fraud-tutorial/blob/main/notebooks/Fraud_Detection_Tutorial.ipynb) is caused by the fact that Go feature server, which can produce feature logs used in validation, currently supports only Redis online store, whereas previous tutorial was using Datastore.*\n", 20 | "\n", 21 | "Here's a high-level diagram desribing data flow in DQM pipeline:\n", 22 | "\n", 23 | "\n" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": { 29 | "id": "2qipWwfSnrjK" 30 | }, 31 | "source": [ 32 | "" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "It all starts from generating a training dataset (by pulling historical features from an offline store). This training dataset serves as a validation reference. When a data scientist develops a model some set of implicit expectations from the training dataset appears. Those expectations, first of all, should be met by the training dataset itself and only then we can apply them to the online features in production.\n", 40 | "\n", 41 | "Hence, on the second step the data scientist explores the dataset and develops or formalizes those expectations with help of [Great Expectations library](https://docs.greatexpectations.io/docs/). Those expectations can be checked right aways against training dataset and only those that pass on it will be added to a reference profile. Reference profile is a set of expectations that could be serialzed and later checked against tested dataset w/o the need to load the training dataset again.\n", 42 | "\n", 43 | "On the evaluation stage a tested dataset is loaded from a storage and validated against a reference profile." 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "What you'll need for this tutorial:\n", 51 | "1. GCP Account with access to BigQuery\n", 52 | "2. Redis server (accessible locally)\n", 53 | "3. (for Windows / Mac M1 users) installed Go compiler (=> 1.17) to build parts of Feast written in Go " 54 | ] 55 | }, 56 | { 57 | "cell_type": "markdown", 58 | "metadata": {}, 59 | "source": [ 60 | "# Part I (feature store for fraud detection system on Redis)" 61 | ] 62 | }, 63 | { 64 | "cell_type": "markdown", 65 | "metadata": {}, 66 | "source": [ 67 | "So let's first recall the basics of the feature store creation." 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": { 73 | "id": "WX6daujBVgZ5" 74 | }, 75 | "source": [ 76 | "## Installation and set up\n", 77 | "\n", 78 | "### Install Feast\n", 79 | "\n", 80 | "Feast can be installed using pip. This installation includes a Python package as well as a CLI.\n", 81 | "\n", 82 | "Feast contains some packages which conflict with the default versions installed in Colab. **After running this cell, restart the runtime to continue** (Runtime > Restart runtime.)\n" 83 | ] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "execution_count": null, 88 | "metadata": { 89 | "colab": { 90 | "base_uri": "https://localhost:8080/" 91 | }, 92 | "id": "S51NR-oPVsjg", 93 | "outputId": "81fa5d76-1641-4c2f-a7f7-a27988b686f8" 94 | }, 95 | "outputs": [], 96 | "source": [ 97 | "%env COMPILE_GO=True\n", 98 | "%env FEAST_USAGE=False\n", 99 | "\n", 100 | "! pip install 'feast[gcp,redis,ge,go]'\n", 101 | "! feast version" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": { 107 | "id": "6eKs1547MiFA" 108 | }, 109 | "source": [ 110 | "## Set configurations\n", 111 | "\n", 112 | "Set the following configuration, which we'll be using throughout the tutorial:\n", 113 | "\n", 114 | "- PROJECT_ID: Your project.\n", 115 | "- BIGQUERY_DATASET_NAME: The name of a dataset which will be used to create tables containing features and store the logs of the feature server." 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": null, 121 | "metadata": { 122 | "colab": { 123 | "base_uri": "https://localhost:8080/" 124 | }, 125 | "id": "NKPT2GJ_Jb2h", 126 | "outputId": "06a9514b-b0fc-4aff-dd1c-6a9a62c4adae" 127 | }, 128 | "outputs": [], 129 | "source": [ 130 | "PROJECT_ID = \"\"\n", 131 | "BIGQUERY_DATASET_NAME = \"\"" 132 | ] 133 | }, 134 | { 135 | "cell_type": "markdown", 136 | "metadata": { 137 | "id": "hrjZkcFmZlSM" 138 | }, 139 | "source": [ 140 | "## Create a BigQuery dataset\n", 141 | "**Only if your dataset doesn't already exist**: Run the following cell to create your BigQuery dataset." 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": null, 147 | "metadata": { 148 | "colab": { 149 | "base_uri": "https://localhost:8080/" 150 | }, 151 | "id": "_73sXuvjZzoz", 152 | "outputId": "31f70b1e-8eae-4099-efc1-b067eaaadf07" 153 | }, 154 | "outputs": [], 155 | "source": [ 156 | "! bq mk $BIGQUERY_DATASET_NAME" 157 | ] 158 | }, 159 | { 160 | "cell_type": "markdown", 161 | "metadata": { 162 | "id": "3W_OsJMWkipk" 163 | }, 164 | "source": [ 165 | "## Initialize the feature repository" 166 | ] 167 | }, 168 | { 169 | "cell_type": "markdown", 170 | "metadata": { 171 | "id": "Sk0fdKESD3j-" 172 | }, 173 | "source": [ 174 | "In Feast, you define your features using configuration stored in a repository. To start, initialize a feature repository." 175 | ] 176 | }, 177 | { 178 | "cell_type": "code", 179 | "execution_count": null, 180 | "metadata": { 181 | "colab": { 182 | "base_uri": "https://localhost:8080/" 183 | }, 184 | "id": "ASAv4kB3kkz_", 185 | "outputId": "d64888e0-f9ce-4b5a-a00d-a6b5c300b412" 186 | }, 187 | "outputs": [], 188 | "source": [ 189 | "! feast init fraud_tutorial\n", 190 | "%cd fraud_tutorial/\n", 191 | "! ls" 192 | ] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "metadata": { 197 | "id": "XDfgOvshswm5" 198 | }, 199 | "source": [ 200 | "Next, we'll edit the `feature_store.yaml` file to specify offline and online stores. Note that the `project` field in this file refers to the Feast concept of a project, not a GCP project." 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "execution_count": null, 206 | "metadata": { 207 | "colab": { 208 | "base_uri": "https://localhost:8080/" 209 | }, 210 | "id": "mK727g_Ik6zs", 211 | "outputId": "2bb6628d-4dd8-4a3c-89a3-a64da05c43d8" 212 | }, 213 | "outputs": [], 214 | "source": [ 215 | "feature_store = \\\n", 216 | "f\"\"\"project: fraud_tutorial\n", 217 | "registry: data/registry.db\n", 218 | "provider: local\n", 219 | "offline_store:\n", 220 | " type: bigquery\n", 221 | " dataset: {BIGQUERY_DATASET_NAME}\n", 222 | "online_store:\n", 223 | " type: redis\n", 224 | " connection_string: \"localhost:6379\"\n", 225 | "go_feature_retrieval: True\n", 226 | "\"\"\"\n", 227 | "\n", 228 | "with open('feature_store.yaml', \"w\") as feature_store_file:\n", 229 | " feature_store_file.write(feature_store)\n", 230 | "\n", 231 | "# Print our feature_store.yaml\n", 232 | "! cat feature_store.yaml" 233 | ] 234 | }, 235 | { 236 | "cell_type": "markdown", 237 | "metadata": { 238 | "id": "Q0dMrw4rESL7" 239 | }, 240 | "source": [ 241 | "Then, we can apply our feature repository:" 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": null, 247 | "metadata": { 248 | "colab": { 249 | "base_uri": "https://localhost:8080/" 250 | }, 251 | "id": "f6QD4-lVrbdt", 252 | "outputId": "460f5f04-ea1c-4580-a1fe-0f755b71243e" 253 | }, 254 | "outputs": [], 255 | "source": [ 256 | "! feast apply" 257 | ] 258 | }, 259 | { 260 | "cell_type": "markdown", 261 | "metadata": { 262 | "id": "ewiAq45-Efp0" 263 | }, 264 | "source": [ 265 | "## Creating features\n", 266 | "\n", 267 | "Next, let's make a new feature and register it to the store.\n", 268 | "\n", 269 | "This involves two steps.\n", 270 | "\n", 271 | "- **Using Bigquery**, we generate new feature values using SQL. Feast is used not to generate features, which is done in Python/SQL.\n", 272 | "- **Using Feast**, we register our new features in Feast by creating a FeatureView:\n", 273 | "\n" 274 | ] 275 | }, 276 | { 277 | "cell_type": "markdown", 278 | "metadata": { 279 | "id": "0V0qK1knwjg-" 280 | }, 281 | "source": [ 282 | "## Preview the raw data" 283 | ] 284 | }, 285 | { 286 | "cell_type": "code", 287 | "execution_count": null, 288 | "metadata": {}, 289 | "outputs": [], 290 | "source": [ 291 | "from google.cloud import bigquery\n", 292 | "bq_client = bigquery.Client(project=PROJECT_ID)" 293 | ] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "execution_count": null, 298 | "metadata": { 299 | "colab": { 300 | "base_uri": "https://localhost:8080/", 301 | "height": 459 302 | }, 303 | "id": "p90F3kxgv9UO", 304 | "outputId": "5f540f22-3d93-4999-928f-b27292e8b62c" 305 | }, 306 | "outputs": [], 307 | "source": [ 308 | "j = bq_client.query(\"select * from feast-oss.fraud_tutorial.transactions limit 1000\")\n", 309 | "j.to_dataframe()" 310 | ] 311 | }, 312 | { 313 | "cell_type": "markdown", 314 | "metadata": { 315 | "id": "3i0KOCg8wn5n" 316 | }, 317 | "source": [ 318 | "## Create a feature table using SQL" 319 | ] 320 | }, 321 | { 322 | "cell_type": "markdown", 323 | "metadata": { 324 | "id": "1EGCETWpFV95" 325 | }, 326 | "source": [ 327 | "Then, run the following cell to generate features. This cell contains two functions:\n", 328 | "\n", 329 | "- `generate_user_count_features` runs a SQL query that counts the amount of transactions users have made as of a given point in time.\n", 330 | "\n", 331 | "- `backfill_features` runs this query multiple times over an interval to backfill features.\n", 332 | "\n" 333 | ] 334 | }, 335 | { 336 | "cell_type": "code", 337 | "execution_count": null, 338 | "metadata": { 339 | "colab": { 340 | "base_uri": "https://localhost:8080/" 341 | }, 342 | "id": "ZZ_z84Y1xMia", 343 | "outputId": "b52dcd50-3372-4d31-e52f-073854d3b21c" 344 | }, 345 | "outputs": [], 346 | "source": [ 347 | "from datetime import datetime, timedelta\n", 348 | "import time\n", 349 | "\n", 350 | "def generate_user_count_features(aggregation_end_date):\n", 351 | " table_id = f\"{PROJECT_ID}.{BIGQUERY_DATASET_NAME}.user_count_transactions_7d\"\n", 352 | "\n", 353 | " client = bigquery.Client(project=PROJECT_ID)\n", 354 | " job_config = bigquery.QueryJobConfig(destination=table_id, write_disposition='WRITE_APPEND')\n", 355 | "\n", 356 | " aggregation_start_date = datetime.now() - timedelta(days=7)\n", 357 | "\n", 358 | " sql = f\"\"\"\n", 359 | " SELECT\n", 360 | " src_account AS user_id,\n", 361 | " COUNT(*) AS transaction_count_7d,\n", 362 | " timestamp'{aggregation_end_date.isoformat()}' AS feature_timestamp\n", 363 | " FROM\n", 364 | " feast-oss.fraud_tutorial.transactions\n", 365 | " WHERE\n", 366 | " timestamp BETWEEN TIMESTAMP('{aggregation_start_date.isoformat()}')\n", 367 | " AND TIMESTAMP('{aggregation_end_date.isoformat()}')\n", 368 | " GROUP BY\n", 369 | " user_id\n", 370 | " \"\"\"\n", 371 | "\n", 372 | " query_job = client.query(sql, job_config=job_config)\n", 373 | " query_job.result()\n", 374 | " print(f\"Generated features as of {aggregation_end_date.isoformat()}\")\n", 375 | "\n", 376 | "\n", 377 | "def backfill_features(earliest_aggregation_end_date, interval, num_iterations):\n", 378 | " aggregation_end_date = earliest_aggregation_end_date\n", 379 | " for _ in range(num_iterations):\n", 380 | " generate_user_count_features(aggregation_end_date=aggregation_end_date)\n", 381 | " time.sleep(1)\n", 382 | " aggregation_end_date += interval\n", 383 | "\n", 384 | "if __name__ == '__main__':\n", 385 | " backfill_features(\n", 386 | " earliest_aggregation_end_date=datetime.now() - timedelta(days=7),\n", 387 | " interval=timedelta(days=1),\n", 388 | " num_iterations=8\n", 389 | " )\n" 390 | ] 391 | }, 392 | { 393 | "cell_type": "markdown", 394 | "metadata": { 395 | "id": "JYbQTSWiGCWu" 396 | }, 397 | "source": [ 398 | "Then, we can preview our new feature:" 399 | ] 400 | }, 401 | { 402 | "cell_type": "code", 403 | "execution_count": null, 404 | "metadata": { 405 | "colab": { 406 | "base_uri": "https://localhost:8080/" 407 | }, 408 | "id": "c3WpOXbUxs6d", 409 | "outputId": "696fa460-d89f-4286-e8c7-a53272d930e6" 410 | }, 411 | "outputs": [], 412 | "source": [ 413 | "j = bq_client.query(f\"select * from {BIGQUERY_DATASET_NAME}.user_count_transactions_7d limit 1000\")\n", 414 | "j.to_dataframe()" 415 | ] 416 | }, 417 | { 418 | "cell_type": "markdown", 419 | "metadata": { 420 | "id": "RXLgVUuNGPTj" 421 | }, 422 | "source": [ 423 | "## Create a new FeatureView\n", 424 | "\n", 425 | "Create new files, `fraud_features.py`, which contains our new feature definition and `fraud_services.py`, which contains feature service definition." 426 | ] 427 | }, 428 | { 429 | "cell_type": "code", 430 | "execution_count": null, 431 | "metadata": { 432 | "id": "F6B7Wo67yDV7" 433 | }, 434 | "outputs": [], 435 | "source": [ 436 | "fraud_features = \\\n", 437 | "f\"\"\"\n", 438 | "from datetime import timedelta\n", 439 | "from feast import BigQuerySource, FeatureView, Entity, ValueType\n", 440 | "\n", 441 | "# Add an entity for users\n", 442 | "user_entity = Entity(\n", 443 | " name=\"user_id\",\n", 444 | " description=\"A user that has executed a transaction or received a transaction\",\n", 445 | ")\n", 446 | "\n", 447 | "# Add a FeatureView based on our new table\n", 448 | "driver_stats_fv = FeatureView(\n", 449 | " name=\"user_transaction_count_7d\",\n", 450 | " entities=[user_entity],\n", 451 | " ttl=timedelta(weeks=1),\n", 452 | " batch_source=BigQuerySource(\n", 453 | " table=f\"{PROJECT_ID}.{BIGQUERY_DATASET_NAME}.user_count_transactions_7d\",\n", 454 | " timestamp_field=\"feature_timestamp\"))\n", 455 | "\n", 456 | "# Add two FeatureViews based on existing tables in BigQuery\n", 457 | "user_account_fv = FeatureView(\n", 458 | " name=\"user_account_features\",\n", 459 | " entities=[user_entity],\n", 460 | " ttl=timedelta(weeks=52),\n", 461 | " batch_source=BigQuerySource(\n", 462 | " table=f\"feast-oss.fraud_tutorial.user_account_features\",\n", 463 | " timestamp_field=\"feature_timestamp\"))\n", 464 | "\n", 465 | "user_has_fraudulent_transactions_fv = FeatureView(\n", 466 | " name=\"user_has_fraudulent_transactions\",\n", 467 | " entities=[user_entity],\n", 468 | " ttl=timedelta(weeks=52),\n", 469 | " batch_source=BigQuerySource(\n", 470 | " table=f\"feast-oss.fraud_tutorial.user_has_fraudulent_transactions\",\n", 471 | " timestamp_field=\"feature_timestamp\"))\n", 472 | "\"\"\"\n", 473 | "\n", 474 | "fraud_services = f\"\"\"\n", 475 | "from feast import FeatureService\n", 476 | "\n", 477 | "from fraud_features import driver_stats_fv, user_account_fv, user_has_fraudulent_transactions_fv\n", 478 | "\n", 479 | "fs = FeatureService(\n", 480 | " name=\"user_features\",\n", 481 | " features=[\n", 482 | " driver_stats_fv[[\"user_transaction_count_7d\"]],\n", 483 | " user_account_fv[[\"credit_score\", \"account_age_days\", \"user_has_2fa_installed\"]],\n", 484 | " user_has_fraudulent_transactions_fv[[\"user_has_fraudulent_transactions_7d\"]],\n", 485 | " ],\n", 486 | ")\"\"\"\n", 487 | "\n", 488 | "with open('fraud_features.py', \"w\") as fraud_features_file:\n", 489 | " fraud_features_file.write(fraud_features)\n", 490 | " \n", 491 | "with open('fraud_services.py', \"w\") as fraud_services_file:\n", 492 | " fraud_services_file.write(fraud_services)" 493 | ] 494 | }, 495 | { 496 | "cell_type": "code", 497 | "execution_count": null, 498 | "metadata": { 499 | "colab": { 500 | "base_uri": "https://localhost:8080/" 501 | }, 502 | "id": "6rSndNQ30ASP", 503 | "outputId": "84e0d430-d7a0-4b87-aa15-86d0cdccaf9e" 504 | }, 505 | "outputs": [], 506 | "source": [ 507 | "# Remove example features\n", 508 | "!rm example.py\n", 509 | "# Apply our changes\n", 510 | "!feast apply" 511 | ] 512 | }, 513 | { 514 | "cell_type": "markdown", 515 | "metadata": { 516 | "id": "6cJFAJiuGxM3" 517 | }, 518 | "source": [ 519 | "## Fetching training data\n", 520 | "\n", 521 | "Now that our feature is registered in Feast, we can use Feast to generate a training dataset. To do this, we need an entity dataframe, alongside the list of features we want:" 522 | ] 523 | }, 524 | { 525 | "cell_type": "code", 526 | "execution_count": null, 527 | "metadata": { 528 | "colab": { 529 | "base_uri": "https://localhost:8080/", 530 | "height": 330 531 | }, 532 | "id": "BqgiEP2Oz42q", 533 | "outputId": "40552317-644b-4ee6-d735-0ee5e48e79dd" 534 | }, 535 | "outputs": [], 536 | "source": [ 537 | "from datetime import datetime, timedelta\n", 538 | "from feast import FeatureStore\n", 539 | "\n", 540 | "# Initialize a FeatureStore with our current repository's configurations\n", 541 | "store = FeatureStore(repo_path=\".\")\n", 542 | "\n", 543 | "# Get training data\n", 544 | "now = datetime.now()\n", 545 | "two_days_ago = datetime.now() - timedelta(days=2)\n", 546 | "\n", 547 | "feature_service = store.get_feature_service(\"user_features\")\n", 548 | "\n", 549 | "training_data_job = store.get_historical_features(\n", 550 | " entity_df=f\"\"\"\n", 551 | " select \n", 552 | " src_account as user_id,\n", 553 | " timestamp as event_timestamp,\n", 554 | " is_fraud\n", 555 | " from\n", 556 | " feast-oss.fraud_tutorial.transactions\n", 557 | " where\n", 558 | " timestamp between timestamp('{two_days_ago.isoformat()}') \n", 559 | " and timestamp('{now.isoformat()}')\"\"\",\n", 560 | " features=feature_service,\n", 561 | " full_feature_names=True\n", 562 | ")\n", 563 | "\n", 564 | "training_data = training_data_job.to_df()\n", 565 | "training_data.head()\n" 566 | ] 567 | }, 568 | { 569 | "cell_type": "markdown", 570 | "metadata": { 571 | "id": "3izkr_3sG1hX" 572 | }, 573 | "source": [ 574 | "## Training a model\n", 575 | "\n", 576 | "Now, we can use our features to train a model:" 577 | ] 578 | }, 579 | { 580 | "cell_type": "code", 581 | "execution_count": null, 582 | "metadata": {}, 583 | "outputs": [], 584 | "source": [ 585 | "!pip install sklearn" 586 | ] 587 | }, 588 | { 589 | "cell_type": "code", 590 | "execution_count": null, 591 | "metadata": { 592 | "colab": { 593 | "base_uri": "https://localhost:8080/" 594 | }, 595 | "id": "YMgFuFHR4pIu", 596 | "outputId": "7f248543-3016-482c-9d8e-b0819cf3609d" 597 | }, 598 | "outputs": [], 599 | "source": [ 600 | "from sklearn.linear_model import LinearRegression\n", 601 | "\n", 602 | "# Drop stray nulls\n", 603 | "training_data.dropna(inplace=True)\n", 604 | "\n", 605 | "# Select training matrices\n", 606 | "X = training_data[[\n", 607 | " \"user_transaction_count_7d__transaction_count_7d\", \n", 608 | " \"user_account_features__credit_score\",\n", 609 | " \"user_account_features__account_age_days\",\n", 610 | " \"user_account_features__user_has_2fa_installed\",\n", 611 | " \"user_has_fraudulent_transactions__user_has_fraudulent_transactions_7d\"\n", 612 | "]]\n", 613 | "y = training_data[\"is_fraud\"]\n", 614 | "\n", 615 | "# Train a simple SVC model\n", 616 | "model = LinearRegression()\n", 617 | "model.fit(X, y)" 618 | ] 619 | }, 620 | { 621 | "cell_type": "code", 622 | "execution_count": null, 623 | "metadata": { 624 | "colab": { 625 | "base_uri": "https://localhost:8080/" 626 | }, 627 | "id": "PobCpVWu4pdz", 628 | "outputId": "60a39792-d332-4edc-c59c-e4e20a87ea5e" 629 | }, 630 | "outputs": [], 631 | "source": [ 632 | "# Get first two rows of training data\n", 633 | "samples = X.iloc[:2]\n", 634 | "\n", 635 | "# Make a test prediction\n", 636 | "model.predict(samples)" 637 | ] 638 | }, 639 | { 640 | "cell_type": "code", 641 | "execution_count": null, 642 | "metadata": {}, 643 | "outputs": [], 644 | "source": [ 645 | "import joblib\n", 646 | "joblib.dump(model, \"model.bin\")" 647 | ] 648 | }, 649 | { 650 | "cell_type": "markdown", 651 | "metadata": { 652 | "id": "HWQLT0gTHi1h" 653 | }, 654 | "source": [ 655 | "## Materializing features\n", 656 | "\n", 657 | "To enable real time feature inference, Feast loads your features into a key-value store so they're available at low latency. We use Redis as this key-value store." 658 | ] 659 | }, 660 | { 661 | "cell_type": "code", 662 | "execution_count": null, 663 | "metadata": { 664 | "id": "REBqJPcZ99Lj" 665 | }, 666 | "outputs": [], 667 | "source": [ 668 | "!feast materialize-incremental $(date -u +\"%Y-%m-%dT%H:%M:%S\")" 669 | ] 670 | }, 671 | { 672 | "cell_type": "markdown", 673 | "metadata": { 674 | "id": "FNVxJpw1HohD" 675 | }, 676 | "source": [ 677 | "## Low latency inference\n", 678 | "\n", 679 | "To make a prediction in real-time, we need to do the following:\n", 680 | "\n", 681 | "1. Start a feature server (as a subprocess) that will expose gRPC API\n", 682 | "2. Create a gRPC client using precompiled Feast Serving proto interface\n", 683 | "3. Call GetOnlineFeatures on this gRPC client\n", 684 | "4. Pass these features to the model\n", 685 | "5. Return these predictions to the user" 686 | ] 687 | }, 688 | { 689 | "cell_type": "code", 690 | "execution_count": null, 691 | "metadata": {}, 692 | "outputs": [], 693 | "source": [ 694 | "import subprocess\n", 695 | "\n", 696 | "server = subprocess.Popen([\"feast\", \"serve\", \"-t\", \"grpc\"])" 697 | ] 698 | }, 699 | { 700 | "cell_type": "code", 701 | "execution_count": null, 702 | "metadata": {}, 703 | "outputs": [], 704 | "source": [ 705 | "import grpc\n", 706 | "\n", 707 | "from feast.protos.feast.serving.ServingService_pb2 import GetOnlineFeaturesRequest\n", 708 | "from feast.protos.feast.serving.ServingService_pb2_grpc import ServingServiceStub\n", 709 | "\n", 710 | "from feast.protos.feast.types.Value_pb2 import RepeatedValue\n", 711 | "\n", 712 | "from feast.type_map import python_values_to_proto_values\n", 713 | "from feast.online_response import OnlineResponse\n", 714 | "\n", 715 | "chan = grpc.insecure_channel(\"localhost:6566\")\n", 716 | "grpc_client = ServingServiceStub(chan)\n", 717 | "\n", 718 | "def get_online_features_remote(user_ids):\n", 719 | " resp = grpc_client.GetOnlineFeatures(\n", 720 | " GetOnlineFeaturesRequest(\n", 721 | " feature_service=\"user_features\",\n", 722 | " entities={\n", 723 | " \"user_id\": RepeatedValue(\n", 724 | " val=python_values_to_proto_values(user_ids)\n", 725 | " )\n", 726 | " }\n", 727 | " )\n", 728 | " )\n", 729 | " return OnlineResponse(resp).to_dict()" 730 | ] 731 | }, 732 | { 733 | "cell_type": "code", 734 | "execution_count": null, 735 | "metadata": { 736 | "colab": { 737 | "base_uri": "https://localhost:8080/" 738 | }, 739 | "id": "JyME-zoy-4d_", 740 | "outputId": "b2e9ba19-4d0b-4cac-9c69-90a06c28fe25" 741 | }, 742 | "outputs": [], 743 | "source": [ 744 | "import joblib\n", 745 | "model = joblib.load('model.bin')\n", 746 | "\n", 747 | "def predict(user_ids):\n", 748 | " feature_vector = get_online_features_remote(user_ids)\n", 749 | " \n", 750 | " # Delete entity keys\n", 751 | " del feature_vector[\"user_id\"]\n", 752 | "\n", 753 | " # Flatten response from Feast\n", 754 | " instances = [\n", 755 | " [feature_values[i] or 0 for feature_values in feature_vector.values()]\n", 756 | " for i in range(len(user_ids))\n", 757 | " ]\n", 758 | "\n", 759 | " response = model.predict(instances)\n", 760 | " return response\n", 761 | "\n", 762 | "predict([\"v5zlw0\"])" 763 | ] 764 | }, 765 | { 766 | "cell_type": "markdown", 767 | "metadata": {}, 768 | "source": [ 769 | " " 770 | ] 771 | }, 772 | { 773 | "cell_type": "markdown", 774 | "metadata": {}, 775 | "source": [ 776 | " " 777 | ] 778 | }, 779 | { 780 | "cell_type": "markdown", 781 | "metadata": {}, 782 | "source": [ 783 | "# Part II (online features logging and validation)" 784 | ] 785 | }, 786 | { 787 | "cell_type": "markdown", 788 | "metadata": {}, 789 | "source": [ 790 | "In this part we will extend our feature store project with Data Quality Monitoring. Specifically, we are going to validate online features (features served by the feature server) against the reference dataset created from training features by applying expectations, that we are going to develop ourselves. We will do this in 3 steps:\n", 791 | "1. Configuring feature logging in the feature server and setting destination for the specific feature service object.\n", 792 | "2. Defining expectations using [Great Expectations](https://greatexpectations.io/).\n", 793 | "3. Triggering validation using SDK or CLI API." 794 | ] 795 | }, 796 | { 797 | "cell_type": "markdown", 798 | "metadata": {}, 799 | "source": [ 800 | "### Updating configuration to enable logging" 801 | ] 802 | }, 803 | { 804 | "cell_type": "markdown", 805 | "metadata": {}, 806 | "source": [ 807 | "First, let's edit our `feature_store.yaml` and add `feature_logging` parameter inside `feature_server`." 808 | ] 809 | }, 810 | { 811 | "cell_type": "code", 812 | "execution_count": null, 813 | "metadata": {}, 814 | "outputs": [], 815 | "source": [ 816 | "feature_store = \\\n", 817 | "f\"\"\"project: fraud_tutorial\n", 818 | "registry: data/registry.db\n", 819 | "provider: local\n", 820 | "offline_store:\n", 821 | " type: bigquery\n", 822 | " dataset: {BIGQUERY_DATASET_NAME}\n", 823 | "online_store:\n", 824 | " type: redis\n", 825 | " connection_string: \"localhost:6379\"\n", 826 | "feature_server:\n", 827 | " enabled: True\n", 828 | " feature_logging:\n", 829 | " enabled: True\n", 830 | " flush_interval_secs: 60\n", 831 | " write_to_disk_interval_secs: 10\n", 832 | " \n", 833 | "go_feature_retrieval: True\n", 834 | "\"\"\"\n", 835 | "\n", 836 | "with open('feature_store.yaml', \"w\") as feature_store_file:\n", 837 | " feature_store_file.write(feature_store)" 838 | ] 839 | }, 840 | { 841 | "cell_type": "markdown", 842 | "metadata": {}, 843 | "source": [ 844 | "Next, we need to update our feature service definition in `fraud_services.py` with `logging config`. Logging config defines sample rate and logging destination. Sample rate sets the percentage of feature rows that will be logged from all served features and can have a value from 0 to 1 (inclusive from both sides). Destination must be a table or a path in the offline store declared globally in the `feature_store.yaml`." 845 | ] 846 | }, 847 | { 848 | "cell_type": "code", 849 | "execution_count": null, 850 | "metadata": {}, 851 | "outputs": [], 852 | "source": [ 853 | "fraud_services = f\"\"\"\n", 854 | "from feast import FeatureService\n", 855 | "from feast.feature_logging import LoggingConfig\n", 856 | "from feast.infra.offline_stores.bigquery_source import BigQueryLoggingDestination\n", 857 | "\n", 858 | "from fraud_features import driver_stats_fv, user_account_fv, user_has_fraudulent_transactions_fv\n", 859 | "\n", 860 | "fs = FeatureService(\n", 861 | " name=\"user_features\",\n", 862 | " features=[\n", 863 | " driver_stats_fv[[\"user_transaction_count_7d\"]],\n", 864 | " user_account_fv[[\"credit_score\", \"account_age_days\", \"user_has_2fa_installed\"]],\n", 865 | " user_has_fraudulent_transactions_fv[[\"user_has_fraudulent_transactions_7d\"]],\n", 866 | " ],\n", 867 | " logging_config=LoggingConfig(\n", 868 | " sample_rate=1.0,\n", 869 | " destination=BigQueryLoggingDestination(\n", 870 | " table_ref=\"{PROJECT_ID}.{BIGQUERY_DATASET_NAME}.user_features_online_logs\"\n", 871 | " )\n", 872 | " )\n", 873 | ")\"\"\"\n", 874 | " \n", 875 | "with open('fraud_services.py', \"w\") as fraud_services_file:\n", 876 | " fraud_services_file.write(fraud_services)" 877 | ] 878 | }, 879 | { 880 | "cell_type": "markdown", 881 | "metadata": {}, 882 | "source": [ 883 | "Now, let's apply these changes:" 884 | ] 885 | }, 886 | { 887 | "cell_type": "code", 888 | "execution_count": null, 889 | "metadata": {}, 890 | "outputs": [], 891 | "source": [ 892 | "! feast apply" 893 | ] 894 | }, 895 | { 896 | "cell_type": "markdown", 897 | "metadata": {}, 898 | "source": [ 899 | "and restart the feature server:" 900 | ] 901 | }, 902 | { 903 | "cell_type": "code", 904 | "execution_count": null, 905 | "metadata": {}, 906 | "outputs": [], 907 | "source": [ 908 | "server.terminate()\n", 909 | "server = subprocess.Popen([\"feast\", \"serve\", \"-t\", \"grpc\"])" 910 | ] 911 | }, 912 | { 913 | "cell_type": "markdown", 914 | "metadata": {}, 915 | "source": [ 916 | "### Creating reference dataset from training features" 917 | ] 918 | }, 919 | { 920 | "cell_type": "code", 921 | "execution_count": null, 922 | "metadata": {}, 923 | "outputs": [], 924 | "source": [ 925 | "from feast.infra.offline_stores.bigquery_source import SavedDatasetBigQueryStorage\n", 926 | "\n", 927 | "reference_dataset = store.create_saved_dataset(\n", 928 | " from_=training_data_job,\n", 929 | " name=\"reference_dataset\",\n", 930 | " storage=SavedDatasetBigQueryStorage(table=f\"{PROJECT_ID}.{BIGQUERY_DATASET_NAME}.reference_dataset\"))" 931 | ] 932 | }, 933 | { 934 | "cell_type": "markdown", 935 | "metadata": {}, 936 | "source": [ 937 | "### Creating & testing validation profiler" 938 | ] 939 | }, 940 | { 941 | "cell_type": "code", 942 | "execution_count": null, 943 | "metadata": {}, 944 | "outputs": [], 945 | "source": [ 946 | "from feast.dqm.profilers.ge_profiler import ge_profiler\n", 947 | "from great_expectations.dataset import PandasDataset\n", 948 | "from great_expectations.core.expectation_suite import ExpectationSuite" 949 | ] 950 | }, 951 | { 952 | "cell_type": "markdown", 953 | "metadata": {}, 954 | "source": [ 955 | "Profiler is defined as a function that takes a dataset, a Pandas DataFrame wrapped into GE's `PandasDataset` class, and returns `ExpectationSuite`, a set of expectations:" 956 | ] 957 | }, 958 | { 959 | "cell_type": "code", 960 | "execution_count": null, 961 | "metadata": {}, 962 | "outputs": [], 963 | "source": [ 964 | "@ge_profiler\n", 965 | "def user_features_profiler(ds: PandasDataset) -> ExpectationSuite:\n", 966 | " ds.expect_column_values_to_be_between(\"user_account_features__credit_score\", 300, 850)\n", 967 | " ds.expect_column_values_to_be_between(\"user_transaction_count_7d__transaction_count_7d\", min_value=0)\n", 968 | " return ds.get_expectation_suite()" 969 | ] 970 | }, 971 | { 972 | "cell_type": "markdown", 973 | "metadata": {}, 974 | "source": [ 975 | "To learn more about expectation functions that can be used in the profiler definition please refer to [Great Expectations doc](https://docs.greatexpectations.io/docs/)." 976 | ] 977 | }, 978 | { 979 | "cell_type": "markdown", 980 | "metadata": {}, 981 | "source": [ 982 | "Profiler can be tested using the saved dataset object created above:" 983 | ] 984 | }, 985 | { 986 | "cell_type": "code", 987 | "execution_count": null, 988 | "metadata": {}, 989 | "outputs": [], 990 | "source": [ 991 | "reference_dataset.get_profile(profiler=user_features_profiler)" 992 | ] 993 | }, 994 | { 995 | "cell_type": "markdown", 996 | "metadata": {}, 997 | "source": [ 998 | "Profiler function along with the reference dataset must be stored in the Feast registry before calling validation API:" 999 | ] 1000 | }, 1001 | { 1002 | "cell_type": "code", 1003 | "execution_count": null, 1004 | "metadata": {}, 1005 | "outputs": [], 1006 | "source": [ 1007 | "from feast.saved_dataset import ValidationReference\n", 1008 | "\n", 1009 | "ref = ValidationReference(\n", 1010 | " name='user_features_training_ref',\n", 1011 | " dataset_name=\"reference_dataset\",\n", 1012 | " profiler=user_features_profiler,\n", 1013 | ")\n", 1014 | "store.apply(ref)" 1015 | ] 1016 | }, 1017 | { 1018 | "cell_type": "markdown", 1019 | "metadata": {}, 1020 | "source": [ 1021 | "## Validation" 1022 | ] 1023 | }, 1024 | { 1025 | "cell_type": "markdown", 1026 | "metadata": {}, 1027 | "source": [ 1028 | "Let's now run a few predictions to log some data points:" 1029 | ] 1030 | }, 1031 | { 1032 | "cell_type": "code", 1033 | "execution_count": null, 1034 | "metadata": {}, 1035 | "outputs": [], 1036 | "source": [ 1037 | "user_ids = list(training_data.user_id.sample(10))\n", 1038 | "predict(user_ids)" 1039 | ] 1040 | }, 1041 | { 1042 | "cell_type": "markdown", 1043 | "metadata": {}, 1044 | "source": [ 1045 | "After some time passed (depending on the value of `flush_interval_secs` defined in `feature_store.yaml`) we can trigger a validation:" 1046 | ] 1047 | }, 1048 | { 1049 | "cell_type": "code", 1050 | "execution_count": null, 1051 | "metadata": {}, 1052 | "outputs": [], 1053 | "source": [ 1054 | "end_ts = datetime.now()\n", 1055 | "start_ts = end_ts - timedelta(minutes=10)\n", 1056 | "\n", 1057 | "! feast validate --feature-service user_features \\\n", 1058 | " --reference user_features_training_ref {start_ts.isoformat()} {end_ts.isoformat()}" 1059 | ] 1060 | }, 1061 | { 1062 | "cell_type": "markdown", 1063 | "metadata": {}, 1064 | "source": [ 1065 | "### Making validation fail" 1066 | ] 1067 | }, 1068 | { 1069 | "cell_type": "markdown", 1070 | "metadata": {}, 1071 | "source": [ 1072 | "Now, if some invalid data, that doesn't met our expectations, will be ingested into the online store and then retrieved via the feature server we should observe how validation is failing." 1073 | ] 1074 | }, 1075 | { 1076 | "cell_type": "code", 1077 | "execution_count": null, 1078 | "metadata": {}, 1079 | "outputs": [], 1080 | "source": [ 1081 | "import pandas as pd\n", 1082 | "insert_df = pd.DataFrame({\n", 1083 | " \"user_id\": [\"pwvabf\"],\n", 1084 | " \"transaction_count_7d\": [-1],\n", 1085 | " \"feature_timestamp\": [datetime.now()],\n", 1086 | "})\n", 1087 | "store.write_to_online_store(\"user_transaction_count_7d\", insert_df)" 1088 | ] 1089 | }, 1090 | { 1091 | "cell_type": "code", 1092 | "execution_count": null, 1093 | "metadata": {}, 1094 | "outputs": [], 1095 | "source": [ 1096 | "predict([\"pwvabf\"])" 1097 | ] 1098 | }, 1099 | { 1100 | "cell_type": "markdown", 1101 | "metadata": {}, 1102 | "source": [ 1103 | "*Remember that it takes some time to write logs to BigQuery*" 1104 | ] 1105 | }, 1106 | { 1107 | "cell_type": "code", 1108 | "execution_count": null, 1109 | "metadata": {}, 1110 | "outputs": [], 1111 | "source": [ 1112 | "end_ts = datetime.now()\n", 1113 | "start_ts = end_ts - timedelta(minutes=10)\n", 1114 | "\n", 1115 | "! feast validate --feature-service user_features \\\n", 1116 | " --reference user_features_training_ref {start_ts.isoformat()} {end_ts.isoformat()}" 1117 | ] 1118 | }, 1119 | { 1120 | "cell_type": "markdown", 1121 | "metadata": {}, 1122 | "source": [ 1123 | "### Alternative example with validating feature presence" 1124 | ] 1125 | }, 1126 | { 1127 | "cell_type": "markdown", 1128 | "metadata": {}, 1129 | "source": [ 1130 | "In this example we create an expectation that the feature will have a not-null value in 99% of the cases:" 1131 | ] 1132 | }, 1133 | { 1134 | "cell_type": "code", 1135 | "execution_count": null, 1136 | "metadata": {}, 1137 | "outputs": [], 1138 | "source": [ 1139 | "@ge_profiler\n", 1140 | "def user_features_profiler_v2(ds: PandasDataset) -> ExpectationSuite:\n", 1141 | " ds.expect_column_values_to_not_be_null(\"user_account_features__account_age_days\", mostly=0.99)\n", 1142 | " return ds.get_expectation_suite()" 1143 | ] 1144 | }, 1145 | { 1146 | "cell_type": "markdown", 1147 | "metadata": {}, 1148 | "source": [ 1149 | "testing on the reference dataset:" 1150 | ] 1151 | }, 1152 | { 1153 | "cell_type": "code", 1154 | "execution_count": null, 1155 | "metadata": {}, 1156 | "outputs": [], 1157 | "source": [ 1158 | "reference_dataset.get_profile(profiler=user_features_profiler_v2)" 1159 | ] 1160 | }, 1161 | { 1162 | "cell_type": "markdown", 1163 | "metadata": {}, 1164 | "source": [ 1165 | ".. and storing new validation reference in the registry:" 1166 | ] 1167 | }, 1168 | { 1169 | "cell_type": "code", 1170 | "execution_count": null, 1171 | "metadata": {}, 1172 | "outputs": [], 1173 | "source": [ 1174 | "store.apply(\n", 1175 | " ValidationReference(\n", 1176 | " name='user_features_training_ref_v2',\n", 1177 | " dataset_name=\"reference_dataset\",\n", 1178 | " profiler=user_features_profiler_v2,\n", 1179 | " )\n", 1180 | ")" 1181 | ] 1182 | }, 1183 | { 1184 | "cell_type": "markdown", 1185 | "metadata": {}, 1186 | "source": [ 1187 | "Retrieving some entity rows that do not exist in the online store (and thus, returned feature statuses will be NOT FOUND):" 1188 | ] 1189 | }, 1190 | { 1191 | "cell_type": "code", 1192 | "execution_count": null, 1193 | "metadata": {}, 1194 | "outputs": [], 1195 | "source": [ 1196 | "predict([\"invalid\"] * 5)" 1197 | ] 1198 | }, 1199 | { 1200 | "cell_type": "markdown", 1201 | "metadata": {}, 1202 | "source": [ 1203 | "Now validation should fail:" 1204 | ] 1205 | }, 1206 | { 1207 | "cell_type": "code", 1208 | "execution_count": null, 1209 | "metadata": {}, 1210 | "outputs": [], 1211 | "source": [ 1212 | "end_ts = datetime.now()\n", 1213 | "start_ts = end_ts - timedelta(hours=1)\n", 1214 | "\n", 1215 | "! feast validate --feature-service user_features \\\n", 1216 | " --reference user_features_training_ref_v2 {start_ts.isoformat()} {end_ts.isoformat()}" 1217 | ] 1218 | }, 1219 | { 1220 | "cell_type": "markdown", 1221 | "metadata": { 1222 | "id": "m4Pu2m4KUrbp" 1223 | }, 1224 | "source": [ 1225 | "# Cleanup\n", 1226 | "\n", 1227 | "If you want to clean up the resources created during this tutorial, run the following cells:\n" 1228 | ] 1229 | }, 1230 | { 1231 | "cell_type": "code", 1232 | "execution_count": null, 1233 | "metadata": { 1234 | "colab": { 1235 | "base_uri": "https://localhost:8080/" 1236 | }, 1237 | "id": "9RK_Kxj2VFQu", 1238 | "outputId": "c6947143-d41b-4ce8-9f14-36c530234eb4" 1239 | }, 1240 | "outputs": [], 1241 | "source": [ 1242 | "!bq rm -t -f ${BIGQUERY_DATASET_NAME}.user_count_transactions_7d\n", 1243 | "!bq rm -t -f ${BIGQUERY_DATASET_NAME}.user_features_online_logs\n", 1244 | "!bq rm -r -f -d ${BIGQUERY_DATASET_NAME}" 1245 | ] 1246 | }, 1247 | { 1248 | "cell_type": "code", 1249 | "execution_count": null, 1250 | "metadata": { 1251 | "id": "5EGuKSupu5jN" 1252 | }, 1253 | "outputs": [], 1254 | "source": [ 1255 | "server.terminate()" 1256 | ] 1257 | }, 1258 | { 1259 | "cell_type": "code", 1260 | "execution_count": null, 1261 | "metadata": {}, 1262 | "outputs": [], 1263 | "source": [] 1264 | } 1265 | ], 1266 | "metadata": { 1267 | "colab": { 1268 | "collapsed_sections": [], 1269 | "name": "Fraud_Detection_Tutorial.ipynb", 1270 | "provenance": [], 1271 | "toc_visible": true 1272 | }, 1273 | "kernelspec": { 1274 | "display_name": "Python 3 (ipykernel)", 1275 | "language": "python", 1276 | "name": "python3" 1277 | }, 1278 | "language_info": { 1279 | "codemirror_mode": { 1280 | "name": "ipython", 1281 | "version": 3 1282 | }, 1283 | "file_extension": ".py", 1284 | "mimetype": "text/x-python", 1285 | "name": "python", 1286 | "nbconvert_exporter": "python", 1287 | "pygments_lexer": "ipython3", 1288 | "version": "3.9.12" 1289 | } 1290 | }, 1291 | "nbformat": 4, 1292 | "nbformat_minor": 1 1293 | } 1294 | -------------------------------------------------------------------------------- /notebooks/txn_and_features_gen.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import string\n", 10 | "import random" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": 61, 16 | "metadata": {}, 17 | "outputs": [], 18 | "source": [ 19 | "from datetime import datetime, timedelta\n", 20 | "import numpy as np\n", 21 | "\n", 22 | "src_accounts = [user_ids[abs(int(np.random.normal(5000, 2500)))% 10000] for _ in range(100000)]\n", 23 | "amounts = [ round(np.random.uniform (100, 10000), 2) for _ in range(100000)]\n", 24 | "dest_accounts = [''.join(random.choices(string.digits + string.ascii_lowercase, k=6)) for _ in range(100000)]\n", 25 | "is_frauds = [np.random.binomial(1, 0.05) for _ in range(100000)]\n", 26 | "\n", 27 | "timestamps = []\n", 28 | "ts = datetime.now() - timedelta(days=14)\n", 29 | "for i in range(100000):\n", 30 | " timestamps.append(ts)\n", 31 | " ts += timedelta(seconds=np.random.uniform(0, 24))" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": 67, 37 | "metadata": {}, 38 | "outputs": [ 39 | { 40 | "data": { 41 | "text/html": [ 42 | "
\n", 60 | " | src_account | \n", 61 | "amount | \n", 62 | "dest_account | \n", 63 | "is_fraud | \n", 64 | "timestamp | \n", 65 | "
---|---|---|---|---|---|
0 | \n", 70 | "7859ge | \n", 71 | "7078.83 | \n", 72 | "waqmx5 | \n", 73 | "0 | \n", 74 | "2021-05-21 08:10:12.039737 | \n", 75 | "
1 | \n", 78 | "yzziue | \n", 79 | "7851.83 | \n", 80 | "tgx086 | \n", 81 | "0 | \n", 82 | "2021-05-21 08:10:24.464622 | \n", 83 | "
2 | \n", 86 | "bgf8nl | \n", 87 | "6016.44 | \n", 88 | "q0ltxc | \n", 89 | "0 | \n", 90 | "2021-05-21 08:10:36.208894 | \n", 91 | "
3 | \n", 94 | "jiaxoq | \n", 95 | "2573.43 | \n", 96 | "ct01il | \n", 97 | "0 | \n", 98 | "2021-05-21 08:10:54.177049 | \n", 99 | "
4 | \n", 102 | "u49qmt | \n", 103 | "6743.81 | \n", 104 | "397mqf | \n", 105 | "0 | \n", 106 | "2021-05-21 08:11:16.870868 | \n", 107 | "
... | \n", 110 | "... | \n", 111 | "... | \n", 112 | "... | \n", 113 | "... | \n", 114 | "... | \n", 115 | "
99995 | \n", 118 | "xqjsd3 | \n", 119 | "6856.42 | \n", 120 | "2z3w39 | \n", 121 | "0 | \n", 122 | "2021-06-04 05:21:23.834089 | \n", 123 | "
99996 | \n", 126 | "1tyh8p | \n", 127 | "8527.89 | \n", 128 | "h5jgwy | \n", 129 | "0 | \n", 130 | "2021-06-04 05:21:37.946295 | \n", 131 | "
99997 | \n", 134 | "mfj3xt | \n", 135 | "4651.57 | \n", 136 | "xvjv67 | \n", 137 | "0 | \n", 138 | "2021-06-04 05:21:39.853131 | \n", 139 | "
99998 | \n", 142 | "l0e31n | \n", 143 | "9771.14 | \n", 144 | "savnzy | \n", 145 | "0 | \n", 146 | "2021-06-04 05:22:03.112553 | \n", 147 | "
99999 | \n", 150 | "782k16 | \n", 151 | "9749.99 | \n", 152 | "24bhqv | \n", 153 | "0 | \n", 154 | "2021-06-04 05:22:08.142090 | \n", 155 | "
100000 rows × 5 columns
\n", 159 | "\n", 267 | " | src_account | \n", 268 | "amount | \n", 269 | "dest_account | \n", 270 | "is_fraud | \n", 271 | "timestamp | \n", 272 | "
---|---|---|---|---|---|
0 | \n", 277 | "782k16 | \n", 278 | "9749.99 | \n", 279 | "24bhqv | \n", 280 | "0 | \n", 281 | "2021-06-04 05:22:08.142090+00:00 | \n", 282 | "
1 | \n", 285 | "l0e31n | \n", 286 | "9771.14 | \n", 287 | "savnzy | \n", 288 | "0 | \n", 289 | "2021-06-04 05:22:03.112553+00:00 | \n", 290 | "
2 | \n", 293 | "mfj3xt | \n", 294 | "4651.57 | \n", 295 | "xvjv67 | \n", 296 | "0 | \n", 297 | "2021-06-04 05:21:39.853131+00:00 | \n", 298 | "
3 | \n", 301 | "1tyh8p | \n", 302 | "8527.89 | \n", 303 | "h5jgwy | \n", 304 | "0 | \n", 305 | "2021-06-04 05:21:37.946295+00:00 | \n", 306 | "
4 | \n", 309 | "xqjsd3 | \n", 310 | "6856.42 | \n", 311 | "2z3w39 | \n", 312 | "0 | \n", 313 | "2021-06-04 05:21:23.834089+00:00 | \n", 314 | "
... | \n", 317 | "... | \n", 318 | "... | \n", 319 | "... | \n", 320 | "... | \n", 321 | "... | \n", 322 | "
99995 | \n", 325 | "u49qmt | \n", 326 | "6743.81 | \n", 327 | "397mqf | \n", 328 | "0 | \n", 329 | "2021-05-21 08:11:16.870868+00:00 | \n", 330 | "
99996 | \n", 333 | "jiaxoq | \n", 334 | "2573.43 | \n", 335 | "ct01il | \n", 336 | "0 | \n", 337 | "2021-05-21 08:10:54.177049+00:00 | \n", 338 | "
99997 | \n", 341 | "bgf8nl | \n", 342 | "6016.44 | \n", 343 | "q0ltxc | \n", 344 | "0 | \n", 345 | "2021-05-21 08:10:36.208894+00:00 | \n", 346 | "
99998 | \n", 349 | "yzziue | \n", 350 | "7851.83 | \n", 351 | "tgx086 | \n", 352 | "0 | \n", 353 | "2021-05-21 08:10:24.464622+00:00 | \n", 354 | "
99999 | \n", 357 | "7859ge | \n", 358 | "7078.83 | \n", 359 | "waqmx5 | \n", 360 | "0 | \n", 361 | "2021-05-21 08:10:12.039737+00:00 | \n", 362 | "
100000 rows × 5 columns
\n", 366 | "\n", 455 | " | src_account | \n", 456 | "user_has_fraudulent_transactions_7d | \n", 457 | "feature_timestamp | \n", 458 | "
---|---|---|---|
27 | \n", 463 | "zqvbs4 | \n", 464 | "1 | \n", 465 | "2021-06-07 12:45:10.214026 | \n", 466 | "
37 | \n", 469 | "a9l0te | \n", 470 | "1 | \n", 471 | "2021-06-07 12:45:10.214026 | \n", 472 | "
62 | \n", 475 | "z2lnqe | \n", 476 | "1 | \n", 477 | "2021-06-07 12:45:10.214026 | \n", 478 | "
82 | \n", 481 | "xv1ul5 | \n", 482 | "1 | \n", 483 | "2021-06-07 12:45:10.214026 | \n", 484 | "
112 | \n", 487 | "6ua5v6 | \n", 488 | "1 | \n", 489 | "2021-06-07 12:45:10.214026 | \n", 490 | "
... | \n", 493 | "... | \n", 494 | "... | \n", 495 | "... | \n", 496 | "
25188 | \n", 499 | "kr123d | \n", 500 | "1 | \n", 501 | "2021-06-07 12:45:10.214026 | \n", 502 | "
25243 | \n", 505 | "y7dobz | \n", 506 | "1 | \n", 507 | "2021-06-07 12:45:10.214026 | \n", 508 | "
25303 | \n", 511 | "wija9d | \n", 512 | "1 | \n", 513 | "2021-06-07 12:45:10.214026 | \n", 514 | "
25375 | \n", 517 | "u269is | \n", 518 | "1 | \n", 519 | "2021-06-07 12:45:10.214026 | \n", 520 | "
25378 | \n", 523 | "8058vz | \n", 524 | "1 | \n", 525 | "2021-06-07 12:45:10.214026 | \n", 526 | "
1175 rows × 3 columns
\n", 530 | "\n", 597 | " | user_id | \n", 598 | "user_has_fraudulent_transactions_7d | \n", 599 | "feature_timestamp | \n", 600 | "
---|---|---|---|
0 | \n", 605 | "782k16 | \n", 606 | "0.0 | \n", 607 | "2021-06-07 12:58:28.318652 | \n", 608 | "
1 | \n", 611 | "l0e31n | \n", 612 | "0.0 | \n", 613 | "2021-06-07 12:58:28.318652 | \n", 614 | "
2 | \n", 617 | "mfj3xt | \n", 618 | "1.0 | \n", 619 | "2021-06-07 12:58:28.318652 | \n", 620 | "
3 | \n", 623 | "1tyh8p | \n", 624 | "0.0 | \n", 625 | "2021-06-07 12:58:28.318652 | \n", 626 | "
4 | \n", 629 | "xqjsd3 | \n", 630 | "0.0 | \n", 631 | "2021-06-07 12:58:28.318652 | \n", 632 | "
... | \n", 635 | "... | \n", 636 | "... | \n", 637 | "... | \n", 638 | "
99995 | \n", 641 | "u49qmt | \n", 642 | "0.0 | \n", 643 | "2021-06-07 12:58:28.318652 | \n", 644 | "
99996 | \n", 647 | "jiaxoq | \n", 648 | "0.0 | \n", 649 | "2021-06-07 12:58:28.318652 | \n", 650 | "
99997 | \n", 653 | "bgf8nl | \n", 654 | "0.0 | \n", 655 | "2021-06-07 12:58:28.318652 | \n", 656 | "
99998 | \n", 659 | "yzziue | \n", 660 | "0.0 | \n", 661 | "2021-06-07 12:58:28.318652 | \n", 662 | "
99999 | \n", 665 | "7859ge | \n", 666 | "0.0 | \n", 667 | "2021-06-07 12:58:28.318652 | \n", 668 | "
100000 rows × 3 columns
\n", 672 | "\n", 830 | " | user_id | \n", 831 | "credit_score | \n", 832 | "account_age_days | \n", 833 | "user_has_2fa_installed | \n", 834 | "feature_timestamp | \n", 835 | "
---|---|---|---|---|---|
0 | \n", 840 | "782k16 | \n", 841 | "626 | \n", 842 | "799 | \n", 843 | "1 | \n", 844 | "2021-06-07 12:59:14.813413 | \n", 845 | "
1 | \n", 848 | "l0e31n | \n", 849 | "648 | \n", 850 | "889 | \n", 851 | "1 | \n", 852 | "2021-06-07 12:59:14.813418 | \n", 853 | "
2 | \n", 856 | "mfj3xt | \n", 857 | "603 | \n", 858 | "383 | \n", 859 | "1 | \n", 860 | "2021-06-07 12:59:14.813419 | \n", 861 | "
3 | \n", 864 | "1tyh8p | \n", 865 | "808 | \n", 866 | "701 | \n", 867 | "0 | \n", 868 | "2021-06-07 12:59:14.813419 | \n", 869 | "
4 | \n", 872 | "xqjsd3 | \n", 873 | "351 | \n", 874 | "428 | \n", 875 | "0 | \n", 876 | "2021-06-07 12:59:14.813420 | \n", 877 | "
... | \n", 880 | "... | \n", 881 | "... | \n", 882 | "... | \n", 883 | "... | \n", 884 | "... | \n", 885 | "
97279 | \n", 888 | "h1p7lk | \n", 889 | "518 | \n", 890 | "407 | \n", 891 | "1 | \n", 892 | "2021-06-07 12:59:14.818469 | \n", 893 | "
97325 | \n", 896 | "n120dt | \n", 897 | "595 | \n", 898 | "927 | \n", 899 | "1 | \n", 900 | "2021-06-07 12:59:14.818470 | \n", 901 | "
97818 | \n", 904 | "txk4ui | \n", 905 | "583 | \n", 906 | "872 | \n", 907 | "1 | \n", 908 | "2021-06-07 12:59:14.818470 | \n", 909 | "
98870 | \n", 912 | "j72zdi | \n", 913 | "685 | \n", 914 | "114 | \n", 915 | "0 | \n", 916 | "2021-06-07 12:59:14.818471 | \n", 917 | "
99563 | \n", 920 | "wi10zj | \n", 921 | "404 | \n", 922 | "627 | \n", 923 | "1 | \n", 924 | "2021-06-07 12:59:14.818471 | \n", 925 | "
9944 rows × 5 columns
\n", 929 | "\n | src_account | \namount | \ndest_account | \nis_fraud | \ntimestamp | \n
---|---|---|---|---|---|
0 | \n782k16 | \n9749.99 | \n24bhqv | \n0 | \n2021-06-10 11:53:31.513514+00:00 | \n
1 | \nl0e31n | \n9771.14 | \nsavnzy | \n0 | \n2021-06-10 11:53:26.483977+00:00 | \n
2 | \nmfj3xt | \n4651.57 | \nxvjv67 | \n0 | \n2021-06-10 11:53:03.224555+00:00 | \n
3 | \n1tyh8p | \n8527.89 | \nh5jgwy | \n0 | \n2021-06-10 11:53:01.317719+00:00 | \n
4 | \nxqjsd3 | \n6856.42 | \n2z3w39 | \n0 | \n2021-06-10 11:52:47.205513+00:00 | \n
... | \n... | \n... | \n... | \n... | \n... | \n
99995 | \nu49qmt | \n6743.81 | \n397mqf | \n0 | \n2021-05-27 14:42:40.242292+00:00 | \n
99996 | \njiaxoq | \n2573.43 | \nct01il | \n0 | \n2021-05-27 14:42:17.548473+00:00 | \n
99997 | \nbgf8nl | \n6016.44 | \nq0ltxc | \n0 | \n2021-05-27 14:41:59.580318+00:00 | \n
99998 | \nyzziue | \n7851.83 | \ntgx086 | \n0 | \n2021-05-27 14:41:47.836046+00:00 | \n
99999 | \n7859ge | \n7078.83 | \nwaqmx5 | \n0 | \n2021-05-27 14:41:35.411161+00:00 | \n
100000 rows × 5 columns
\n\n | user_id | \ncredit_score | \naccount_age_days | \nuser_has_2fa_installed | \nfeature_timestamp | \n
---|---|---|---|---|---|
0 | \n41sozr | \n512 | \n700 | \n0 | \n2021-06-03 12:11:13.032174 | \n
1 | \nh8nr8u | \n512 | \n157 | \n0 | \n2021-06-03 12:11:13.032174 | \n
2 | \nshid6v | \n512 | \n509 | \n0 | \n2021-06-03 12:11:13.032174 | \n
3 | \nrbcoqw | \n512 | \n742 | \n0 | \n2021-06-03 12:11:13.032174 | \n
4 | \nhew545 | \n512 | \n327 | \n0 | \n2021-06-03 12:11:13.032174 | \n
... | \n... | \n... | \n... | \n... | \n... | \n
9939 | \nnsgtkp | \n767 | \n891 | \n1 | \n2021-06-03 12:11:13.032174 | \n
9940 | \n4dlidj | \n767 | \n855 | \n1 | \n2021-06-03 12:11:13.032174 | \n
9941 | \n1z87hk | \n767 | \n271 | \n1 | \n2021-06-03 12:11:13.032174 | \n
9942 | \nffqerm | \n767 | \n829 | \n1 | \n2021-06-03 12:11:13.032174 | \n
9943 | \nelz674 | \n767 | \n783 | \n1 | \n2021-06-03 12:11:13.032174 | \n
9944 rows × 5 columns
\n\n | src_account | \namount | \ndest_account | \nis_fraud | \ntimestamp | \n
---|---|---|---|---|---|
0 | \n0001mg | \n3012.44 | \nydnwlr | \n0 | \n2021-06-16 12:52:25.074517+00:00 | \n
1 | \n0001mg | \n4431.82 | \noijv7z | \n0 | \n2021-06-13 11:47:21.535700+00:00 | \n
2 | \n0001mg | \n3037.60 | \na6mrvu | \n0 | \n2021-06-11 20:51:21.873945+00:00 | \n
3 | \n0001mg | \n6322.63 | \nbmihen | \n0 | \n2021-06-11 13:46:35.364700+00:00 | \n
4 | \n0001mg | \n9981.82 | \ntk53lu | \n0 | \n2021-06-08 23:31:54.140277+00:00 | \n
... | \n... | \n... | \n... | \n... | \n... | \n
99995 | \nzyvtf8 | \n3609.00 | \nu5s54p | \n1 | \n2021-06-12 21:09:53.775954+00:00 | \n
99996 | \nzz0sgh | \n6060.71 | \nc97pdy | \n1 | \n2021-06-04 11:50:31.591834+00:00 | \n
99997 | \nzz0sgh | \n5543.38 | \ndt60g4 | \n1 | \n2021-06-03 21:48:26.560339+00:00 | \n
99998 | \nzzrx9o | \n5031.12 | \n9vo8j7 | \n1 | \n2021-06-14 12:00:42.439961+00:00 | \n
99999 | \nzzx65l | \n9031.58 | \np6w6un | \n1 | \n2021-06-11 13:13:24.071963+00:00 | \n
100000 rows × 5 columns
\n\n | user_id | \ncredit_score | \naccount_age_days | \nuser_has_2fa_installed | \nfeature_timestamp | \n
---|---|---|---|---|---|
0 | \n41sozr | \n512 | \n700 | \n0 | \n2021-06-09 19:13:46.199693 | \n
1 | \nh8nr8u | \n512 | \n157 | \n0 | \n2021-06-09 19:13:46.199693 | \n
2 | \nshid6v | \n512 | \n509 | \n0 | \n2021-06-09 19:13:46.199693 | \n
3 | \nrbcoqw | \n512 | \n742 | \n0 | \n2021-06-09 19:13:46.199693 | \n
4 | \nhew545 | \n512 | \n327 | \n0 | \n2021-06-09 19:13:46.199693 | \n
... | \n... | \n... | \n... | \n... | \n... | \n
9939 | \nnsgtkp | \n767 | \n891 | \n1 | \n2021-06-09 19:13:46.199693 | \n
9940 | \n4dlidj | \n767 | \n855 | \n1 | \n2021-06-09 19:13:46.199693 | \n
9941 | \n1z87hk | \n767 | \n271 | \n1 | \n2021-06-09 19:13:46.199693 | \n
9942 | \nffqerm | \n767 | \n829 | \n1 | \n2021-06-09 19:13:46.199693 | \n
9943 | \nelz674 | \n767 | \n783 | \n1 | \n2021-06-09 19:13:46.199693 | \n
9944 rows × 5 columns
\n\n | user_id | \nuser_has_fraudulent_transactions_7d | \nfeature_timestamp | \n
---|---|---|---|
0 | \n0001mg | \n0.0 | \n2021-06-11 12:56:59.739937+00:00 | \n
1 | \n00c8mc | \n0.0 | \n2021-06-11 12:56:59.739937+00:00 | \n
2 | \n00gmwi | \n0.0 | \n2021-06-11 12:56:59.739937+00:00 | \n
3 | \n00mbm9 | \n0.0 | \n2021-06-11 12:56:59.739937+00:00 | \n
4 | \n00wjqi | \n0.0 | \n2021-06-11 12:56:59.739937+00:00 | \n
... | \n... | \n... | \n... | \n
69603 | \n54r2jp | \n1.0 | \n2021-06-17 12:56:59.739937+00:00 | \n
69604 | \nphvjnv | \n1.0 | \n2021-06-17 12:56:59.739937+00:00 | \n
69605 | \nvr9qpk | \n1.0 | \n2021-06-17 12:56:59.739937+00:00 | \n
69606 | \nwija9d | \n1.0 | \n2021-06-17 12:56:59.739937+00:00 | \n
69607 | \nyvkh8e | \n1.0 | \n2021-06-17 12:56:59.739937+00:00 | \n
69608 rows × 3 columns
\n