├── .gitignore ├── README.md ├── ai-platform ├── batch_prediction.sh ├── brand_vocab.csv ├── config.yaml ├── deploy.sh ├── evaluate.sh ├── hptuning_config.yaml ├── local.sh ├── online_prediction.sh ├── package │ ├── __init__.py │ ├── setup.py │ └── trainer │ │ ├── __init__.py │ │ ├── task.py │ │ └── utils.py ├── requirements.txt ├── schema.json ├── test-predictions │ ├── batch_test.json │ └── online_test.json └── train.sh ├── exploration └── visualisation-queries │ ├── custom_query_1.sql │ ├── custom_query_2.sql │ ├── custom_query_3.sql │ └── custom_query_4.sql └── processing-pipeline ├── README.md ├── a010_impute_missing_values.sql ├── a020_remove_zero_quantity_rows.sql ├── a030_create_returned_flag.sql ├── a040_absolute_values.sql ├── a050_create_transaction_id.sql ├── a060_create_product_id.sql ├── a070_create_product_price.sql ├── b010_create_baseline.sql ├── b020_join_baseline.sql ├── b030_baseline_metrics.sql ├── bq-processing.sh ├── f010_top_brands.sql ├── f020_label_creation.sql ├── f030_promo_sensitive_feature.sql ├── f040_brand_seasonality.sql ├── f050_create_brand_features.sql ├── f060_create_overall_features.sql ├── f070_compute_aov.sql ├── f080_impute_nulls.sql ├── f090_type_cast.sql ├── s010_downsampling.sql ├── s020_downsampled_features.sql ├── supporting-queries.sql ├── t010_train_test_field.sql ├── t020_train_data.sql ├── t030_test_data.sql ├── t040_test_dev_split.sql ├── t050_dev_data.sql ├── t060_test_data_final.sql └── x010_cross_join.sql /.gitignore: -------------------------------------------------------------------------------- 1 | __pycache__ 2 | .* 3 | !.gitignore 4 | data 5 | model 6 | *.egg-info 7 | .idea 8 | .DS_Store 9 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Brand Propensity Model for Major Retailer 2 | 3 | Author: Laxmi Prajapat (laxmi.prajapat@datatonic.com) 4 | 5 | Python version: 3.5 6 | 7 | TensorFlow version: 1.13.1 8 | 9 | Data: [Acquired Value Shoppers](https://www.kaggle.com/c/acquire-valued-shoppers-challenge) 10 | 11 | ### Folders required in package: 12 | 13 | --- 14 | 15 | - `exploration/` - SQL queries for exploratory analysis 16 | - `processing-pipeline/` - SQL queries for BigQuery data preprocessing pipeline 17 | - `ai-platform/` - python package to train estimators locally and/or using AI Platform with hyperparameter tuning 18 | 19 | 20 | ### Files and execution: 21 | 22 | --- 23 | 24 | `exploration/`: 25 | 26 | - `visualisation-queries/`: 27 | - `custom_query_1.sql` 28 | - `custom_query_2.sql` 29 | - `custom_query_3.sql` 30 | - `custom_query_4.sql` 31 | 32 | `processing-pipeline/`: 33 | 34 | - `README.md` 35 | - `bq-processing.sh` 36 | 37 | 1) Download the "transactions" and "history" data from [here](https://www.kaggle.com/c/acquire-valued-shoppers-challenge) 38 | 2) Create a Cloud Storage bucket and upload the CSV files 39 | 3) Create a dataset in BigQuery 40 | 4) Load "transactions" and "history" datasets from Cloud Storage bucket into this BigQuery dataset 41 | 5) Execute bash script with GCP project and BigQuery dataset as command-line arguments: 42 | 43 | ``` 44 | /usr/bin/time bash bq-processing.sh 45 | ``` 46 | 47 | - `supporting-queries.sql` 48 | - `a010_impute_missing_values.sql` 49 | - `a020_remove_zero_quantity_rows.sql` 50 | - `a030_create_returned_flag.sql` 51 | - `a040_absolute_values.sql` 52 | - `a050_create_transaction_id.sql` 53 | - `a060_create_product_id.sql` 54 | - `a070_create_product_price.sql` 55 | - `b010_create_baseline.sql` 56 | - `b020_join_baseline.sql` 57 | - `b030_baseline_metrics.sql` 58 | - `f010_top_brands.sql` 59 | - `f020_label_creation.sql` 60 | - `f030_promo_sensitive_feature.sql` 61 | - `f040_brand_seasonality.sql` 62 | - `f050_create_brand_features.sql` 63 | - `f060_create_overall_features.sql` 64 | - `f070_compute_aov.sql` 65 | - `f080_impute_nulls.sql` 66 | - `f090_type_cast.sql` 67 | - `s010_downsampling.sql` 68 | - `s020_downsampled_features.sql` 69 | - `t010_train_test_field.sql` 70 | - `t020_train_data.sql` 71 | - `t030_test_data.sql` 72 | - `t040_test_dev_split.sql` 73 | - `t050_dev_data.sql` 74 | - `t060_test_data_final.sql` 75 | - `x010_cross_join.sql` 76 | 77 | 78 | `ai-platform/`: 79 | 80 | - `requirements.txt` - python dependencies 81 | - `brand_vocab.csv` - brand vocabulary list 82 | - `test-predictions/` 83 | - `batch_test.json` - sample JSON for running batch predictions (15 lines) 84 | - `online_test.json` - sample JSON for running online prediction (1 line) 85 | - `package/` 86 | - `__init__.py` 87 | - `setup.py` - package dependencies 88 | - `trainer/` 89 | - `__init__.py` 90 | - `task.py` - model train / predict / evaluate using TensorFlow Estimator API 91 | - `utils.py` - helpful functions for `task.py` 92 | - `config.yaml` - configuration file for AI Platform training job 93 | - `hptuning_config.yaml` - configuration file for AI Platform training job with hyperparameter tuning (DNN and WD) 94 | - `local.sh` - bash script to run train / predict / evaluate locally in virtual environment 95 | 96 | ``` 97 | conda create --name propensity_modelling python=3.5 98 | source activate propensity_modelling 99 | pip install -r requirements.txt 100 | ``` 101 | 102 | Ensure the data and any supporting files are downloaded locally (or on virtual machine) using `gsutil cp` tool. 103 | 104 | Local training: 105 | ``` 106 | bash local.sh train 107 | ``` 108 | 109 | Local predicting: 110 | ``` 111 | bash local.sh predict 112 | ``` 113 | 114 | Local evaluating: 115 | ``` 116 | bash local.sh evaluate 117 | ``` 118 | 119 | - `train.sh` - bash script to run training on AI Platform 120 | 121 | ``` 122 | bash train.sh 123 | ``` 124 | 125 | - `evaluate.sh` - bash script to run evaluation on AI Platform 126 | 127 | ``` 128 | bash evaluate.sh 129 | ``` 130 | 131 | - `deploy.sh` - bash script to deploy selected model on AI Platform 132 | 133 | ``` 134 | bash deploy.sh 135 | ``` 136 | 137 | - `batch_prediction.sh` - bash script to run batch predictions on AI Platform using deployed model 138 | 139 | ``` 140 | bash batch_prediction.sh 141 | ``` 142 | 143 | - `online_prediction.sh` - bash script to run online predictions on AI Platform using deployed model 144 | 145 | ``` 146 | bash online_prediction.sh 147 | ``` 148 | 149 | 150 | To run **Tensorboard**: 151 | 152 | ``` 153 | tensorboard --logdir=gs:///models///model 154 | ``` 155 | 156 | The **signature** (inputs/outputs) of the saved model can be observed with bash command: 157 | 158 | ``` 159 | saved_model_cli show --dir gs:///models///serving/ --tag serve --signature_def predict 160 | ``` 161 | -------------------------------------------------------------------------------- /ai-platform/batch_prediction.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | # use deployed batch serving model on AI Platform to run batch predictions 3 | 4 | MODEL_NAME=$1 5 | VERSION_NAME=$2 6 | BUCKET=$3 7 | 8 | TIMESTAMP=$(date +"%Y%m%d_%H%M%S") 9 | JOBNAME=${MODEL_NAME}_predictions_${TIMESTAMP} 10 | MAX_WORKER_COUNT=1 11 | 12 | gcloud ai-platform jobs submit prediction ${JOBNAME} \ 13 | --data-format=text \ 14 | --input-paths=gs://${BUCKET}/testPredictions/batch_test.json \ 15 | --output-path=gs://${BUCKET}/testPredictions/${MODEL_NAME}/${VERSION_NAME}/${JOBNAME} \ 16 | --region=europe-west1 \ 17 | --model=${MODEL_NAME} \ 18 | --version=${VERSION_NAME} \ 19 | --max-worker-count=${MAX_WORKER_COUNT} -------------------------------------------------------------------------------- /ai-platform/brand_vocab.csv: -------------------------------------------------------------------------------- 1 | brand 2 | 9907 3 | 0 4 | 16347 5 | 88199 6 | 33170 7 | 81691 8 | 15704 9 | 9886 10 | 11473 11 | 7175 12 | 5072 13 | 40679 14 | 2820 15 | 14029 16 | 14456 17 | 20230 18 | 13310 19 | 4568 20 | 23511 21 | 10241 22 | 10786 23 | 13791 24 | 11382 25 | 14760 26 | 4704 27 | 23359 28 | 7953 29 | 5278 30 | 1756 31 | 610 32 | 20514 33 | 3809 34 | 18584 35 | 6564 36 | 19950 37 | 10271 38 | 20864 39 | 18322 40 | 10165 41 | 7173 42 | 7655 43 | 1393 44 | 27474 45 | 19710 46 | 32115 47 | 15889 48 | 7054 49 | 8247 50 | 6732 51 | 10577 52 | 17050 53 | 10522 54 | 12957 55 | 26232 56 | 14582 57 | 13794 58 | 3331 59 | 11176 60 | 15530 61 | 8501 62 | 7848 63 | 11749 64 | 6678 65 | 13967 66 | 18799 67 | 8164 68 | 13243 69 | 6561 70 | 5812 71 | 20588 72 | 20405 73 | 9787 74 | 8612 75 | 5618 76 | 71386 77 | 21065 78 | 23361 79 | 2248 80 | 38922 81 | 1358 82 | 9814 83 | 10837 84 | 5141 85 | 875 86 | 45039 87 | 18893 88 | 26885 89 | 18552 90 | 62065 91 | 5603 92 | 3830 93 | 8318 94 | 8255 95 | 15878 96 | 15335 97 | 26517 98 | 16668 99 | 44209 100 | 16965 101 | 867 102 | 16048 103 | 13023 104 | 8089 105 | 25530 106 | 12465 107 | 20616 108 | 5168 109 | 15113 110 | 8154 111 | 11151 112 | 26189 113 | 27873 114 | 6775 115 | 18371 116 | 13804 117 | 16728 118 | 7531 119 | 14907 120 | 18969 121 | 11786 122 | 9200 123 | 3718 124 | 23664 125 | 12029 126 | 14868 127 | 12021 128 | 6550 129 | 9496 130 | 3342 131 | 8178 132 | 3293 133 | 16410 134 | 4529 135 | 19279 136 | 30272 137 | 9260 138 | 25673 139 | 3453 140 | 12426 141 | 14196 142 | 14112 143 | 14494 144 | 10218 145 | 19154 146 | 3336 147 | 15232 148 | 19197 149 | 5208 150 | 20361 151 | 17286 152 | 2347 153 | 17521 154 | 10882 155 | 8481 156 | 19713 157 | 27417 158 | 2903 159 | 19783 160 | 13291 161 | 5016 162 | 16489 163 | 12475 164 | 18533 165 | 8511 166 | 17186 167 | 5345 168 | 2559 169 | 6132 170 | 1058 171 | 27736 172 | 2321 173 | 2829 174 | 9189 175 | 7237 176 | 2017 177 | 6283 178 | 18599 179 | 18775 180 | 6777 181 | 41981 182 | 4294 183 | 12908 184 | 6509 185 | 17292 186 | 16650 187 | 15598 188 | 12598 189 | 2510 190 | 2073 191 | 15474 192 | 2437 193 | 14357 194 | 1322 195 | 4139 196 | 85048 197 | 25075 198 | 11339 199 | 6714 200 | 3679 201 | 8499 202 | 5098 203 | 17812 204 | 11224 205 | 19704 206 | 16756 207 | 9798 208 | 30626 209 | 683 210 | 4628 211 | 11221 212 | 18738 213 | 19039 214 | 16044 215 | 7142 216 | 16765 217 | 15035 218 | 12279 219 | 43106 220 | 26155 221 | 1814 222 | 14403 223 | 8280 224 | 7245 225 | 13296 226 | 17040 227 | 7196 228 | 28083 229 | 13343 230 | 29555 231 | 9530 232 | 23209 233 | 19975 234 | 8583 235 | 4371 236 | 888 237 | 11186 238 | 23561 239 | 23332 240 | 4912 241 | 2963 242 | 7249 243 | 4030 244 | 3068 245 | 11732 246 | 10383 247 | 6763 248 | 1698 249 | 11244 250 | 18051 251 | 38150 252 | 5489 253 | 13705 254 | 3695 255 | 17593 256 | 11614 257 | 57417 258 | 885 259 | 15667 260 | 7740 261 | 14813 262 | 19509 263 | 58299 264 | 16291 265 | 13228 266 | 26420 267 | 23360 268 | 26234 269 | 8522 270 | 2142 271 | 21913 272 | 6848 273 | 14005 274 | 16284 275 | 4664 276 | 10160 277 | 12157 278 | 12673 279 | 6796 280 | 16391 281 | 20016 282 | 25887 283 | 7168 284 | 14647 285 | 17480 286 | 6236 287 | 16564 288 | 5174 289 | 23033 290 | 26051 291 | 3396 292 | 9739 293 | 7860 294 | 28496 295 | 85114 296 | 21452 297 | 19971 298 | 2014 299 | 6021 300 | 7504 301 | 16397 302 | 14900 303 | 15181 304 | 14716 305 | 17635 306 | 7907 307 | 8463 308 | 13470 309 | 32094 310 | 15085 311 | 8711 312 | 1366 313 | 648 314 | 16853 315 | 4687 316 | 9698 317 | 10091 318 | 26666 319 | 330 320 | 8459 321 | 13481 322 | 8245 323 | 56490 324 | 15251 325 | 401 326 | 3635 327 | 514 328 | 10554 329 | 4599 330 | 2675 331 | 20216 332 | 1783 333 | 13505 334 | 4218 335 | 4913 336 | 3397 337 | 61745 338 | 16558 339 | 6086 340 | 4368 341 | 30972 342 | 833 343 | 14286 344 | 10797 345 | 18139 346 | 11704 347 | 21664 348 | 15441 349 | 14717 350 | 25412 351 | 18390 352 | 3174 353 | 4307 354 | 9547 355 | 3001 356 | 3609 357 | 5740 358 | 30096 359 | 1061 360 | 10216 361 | 7034 362 | 11990 363 | 16807 364 | 2199 365 | 7661 366 | 803 367 | 365 368 | 43069 369 | 18947 370 | 6947 371 | 5379 372 | 491 373 | 2542 374 | 17649 375 | 58434 376 | 16023 377 | 6335 378 | 3843 379 | 17653 380 | 6176 381 | 26880 382 | 47346 383 | 11761 384 | 158 385 | 17447 386 | 1436 387 | 17614 388 | 17864 389 | 7025 390 | 16223 391 | 17034 392 | 2626 393 | 3744 394 | 9659 395 | 30855 396 | 8185 397 | 15041 398 | 14268 399 | 20299 400 | 8019 401 | 55864 402 | 12339 403 | 16487 404 | 17823 405 | 28204 406 | 6015 407 | 12554 408 | 12545 409 | 8664 410 | 7368 411 | 6852 412 | 52045 413 | 9878 414 | 3454 415 | 9883 416 | 5782 417 | 9705 418 | 11516 419 | 4043 420 | 26456 421 | 16879 422 | 16704 423 | 2012 424 | 2071 425 | 18213 426 | 18373 427 | 319 428 | 16021 429 | 45489 430 | 6449 431 | 16095 432 | 13610 433 | 8435 434 | 52314 435 | 10274 436 | 9916 437 | 1008 438 | 11910 439 | 250 440 | 17674 441 | 2308 442 | 3640 443 | 6625 444 | 45635 445 | 52506 446 | 10238 447 | 69260 448 | 7853 449 | 16492 450 | 1910 451 | 18492 452 | 7366 453 | 6718 454 | 8144 455 | 12587 456 | 11515 457 | 17248 458 | 14997 459 | 6790 460 | 5981 461 | 20100 462 | 12600 463 | 13056 464 | 13412 465 | 11748 466 | 18715 467 | 7796 468 | 13891 469 | 14597 470 | 17882 471 | 9190 472 | 76668 473 | 1841 474 | 29344 475 | 16922 476 | 182 477 | 11013 478 | 11698 479 | 4354 480 | 27348 481 | 12502 482 | 21825 483 | 713 484 | 15870 485 | 18909 486 | 27521 487 | 10421 488 | 6137 489 | 95301 490 | 12440 491 | 13012 492 | 15624 493 | 13915 494 | 25202 495 | 20614 496 | 4197 497 | 3988 498 | 8604 499 | 6734 500 | 3073 501 | 16139 502 | 3198 503 | 16794 504 | 35506 505 | 15802 506 | 14200 507 | 9181 508 | 5557 509 | 9596 510 | 16050 511 | 98643 512 | 19737 513 | 9543 514 | 1141 515 | 11347 516 | 20014 517 | 40317 518 | 3457 519 | 25373 520 | 72482 521 | 78850 522 | 5537 523 | 10768 524 | 23362 525 | 2010 526 | 8605 527 | 70939 528 | 16196 529 | 16730 530 | 17117 531 | 2631 532 | 27327 533 | 13386 534 | 64256 535 | 20231 536 | 11482 537 | 14142 538 | 13474 539 | 12168 540 | 22433 541 | 9259 542 | 25919 543 | 3860 544 | 18946 545 | 385 546 | 13689 547 | 19231 548 | 18366 549 | 915 550 | 23363 551 | 17665 552 | 13264 553 | 60 554 | 20449 555 | 7487 556 | 10488 557 | 11116 558 | 3108 559 | 4696 560 | 10074 561 | 804 562 | 20031 563 | 85 564 | 6769 565 | 19934 566 | 12423 567 | 25033 568 | 11375 569 | 11091 570 | 2246 571 | 2892 572 | 16706 573 | 17067 574 | 2689 575 | 11575 576 | 5656 577 | 12973 578 | 9977 579 | 25424 580 | 2272 581 | 3223 582 | 25162 583 | 17876 584 | 16257 585 | 8760 586 | 13414 587 | 25747 588 | 15998 589 | 15983 590 | 6667 591 | 12546 592 | 755 593 | 12851 594 | 13776 595 | 10177 596 | 5613 597 | 17990 598 | 12028 599 | 1707 600 | 15219 601 | 1387 602 | 11930 603 | 12297 604 | 29501 605 | 15487 606 | 17287 607 | 758 608 | 6455 609 | 7755 610 | 11211 611 | 23210 612 | 16680 613 | 3957 614 | 11154 615 | 16738 616 | 27731 617 | 15327 618 | 9286 619 | 26028 620 | 4962 621 | 95440 622 | 6765 623 | 20575 624 | 56360 625 | 6866 626 | 6926 627 | 7966 628 | 5357 629 | 17710 630 | 19411 631 | 9903 632 | 11085 633 | 15064 634 | 17473 635 | 18928 636 | 7981 637 | 3447 638 | 2649 639 | 1121 640 | 15583 641 | 9243 642 | 17037 643 | 12170 644 | 26033 645 | 16355 646 | 5244 647 | 14191 648 | 8919 649 | 25246 650 | 18529 651 | 7627 652 | 5845 653 | 27166 654 | 9891 655 | 6669 656 | 16565 657 | 4239 658 | 2732 659 | 8688 660 | 9709 661 | 9966 662 | 13044 663 | 3789 664 | 78492 665 | 8682 666 | 12547 667 | 4098 668 | 17090 669 | 17898 670 | 19214 671 | 1500 672 | 86101 673 | 4346 674 | 24525 675 | 8047 676 | 1569 677 | 11352 678 | 9275 679 | 3912 680 | 29609 681 | 12661 682 | 53977 683 | 86364 684 | 14418 685 | 5558 686 | 17027 687 | 15284 688 | 18520 689 | 12466 690 | 9039 691 | 11616 692 | 7459 693 | 19212 694 | 15197 695 | 9626 696 | 287 697 | 10333 698 | 7400 699 | 22339 700 | 18877 701 | 26108 702 | 2988 703 | 16665 704 | 7452 705 | 14149 706 | 3931 707 | 1011 708 | 2964 709 | 15194 710 | 12106 711 | 1866 712 | 18158 713 | 4806 714 | 11376 715 | 13656 716 | 8993 717 | 85727 718 | 15414 719 | 23016 720 | 6743 721 | 9303 722 | 5810 723 | 11005 724 | 85927 725 | 15114 726 | 10562 727 | 10653 728 | 12473 729 | 62488 730 | 14162 731 | 4089 732 | 2500 733 | 101199 734 | 17627 735 | 1414 736 | 14245 737 | 52 738 | 29945 739 | 11528 740 | 452 741 | 12885 742 | 2082 743 | 4584 744 | 6716 745 | 13400 746 | 12172 747 | 1292 748 | 12679 749 | 19735 750 | 61474 751 | 12209 752 | 9818 753 | 469 754 | 15828 755 | 5602 756 | 384 757 | 12358 758 | 27139 759 | 4086 760 | 19792 761 | 85180 762 | 26467 763 | 27294 764 | 27108 765 | 20113 766 | 60780 767 | 63690 768 | 14428 769 | 16521 770 | 16696 771 | 33128 772 | 17274 773 | 17938 774 | 23587 775 | 11145 776 | 17797 777 | 27492 778 | 9033 779 | 26252 780 | 26129 781 | 19868 782 | 99506 783 | 16249 784 | 17576 785 | 16611 786 | 15377 787 | 72948 788 | 6305 789 | 4255 790 | 15274 791 | 19527 792 | 12197 793 | 1257 794 | 7161 795 | 17346 796 | 6854 797 | 19921 798 | 1142 799 | 11200 800 | 16559 801 | 13877 802 | 15288 803 | 5494 804 | 19764 805 | 8097 806 | 4369 807 | 423 808 | 17587 809 | 11079 810 | 763 811 | 6911 812 | 12071 813 | 22310 814 | 19243 815 | 17323 816 | 22148 817 | 2595 818 | 506 819 | 18781 820 | 20153 821 | 12137 822 | 23292 823 | 36320 824 | 8421 825 | 47096 826 | 4186 827 | 19976 828 | 11668 829 | 27268 830 | 18730 831 | 1235 832 | 9031 833 | 15449 834 | 15743 835 | 11011 836 | 4720 837 | 3008 838 | 375 839 | 1079 840 | 16798 841 | 11556 842 | 7308 843 | 12186 844 | 4170 845 | 225 846 | 10654 847 | 22943 848 | 6560 849 | 12670 850 | 18961 851 | 12696 852 | 17539 853 | 14199 854 | 17950 855 | 17223 856 | 21174 857 | 17690 858 | 3387 859 | 3167 860 | 19572 861 | 3271 862 | 5534 863 | 12592 864 | 23643 865 | 2225 866 | 32472 867 | 7467 868 | 14869 869 | 17203 870 | 5338 871 | 5593 872 | 16836 873 | 13130 874 | 1537 875 | 12163 876 | 54992 877 | 63479 878 | 8131 879 | 52004 880 | 22328 881 | 20526 882 | 20375 883 | 7217 884 | 3947 885 | 16934 886 | 14391 887 | 16106 888 | 51701 889 | 18845 890 | 14157 891 | 15266 892 | 13564 893 | 837 894 | 30370 895 | 15051 896 | 17344 897 | 15112 898 | 16363 899 | 14982 900 | 43184 901 | 7770 902 | 7468 903 | 14062 904 | 36616 905 | 17616 906 | 6167 907 | 1376 908 | 17930 909 | 8530 910 | 8990 911 | 4253 912 | 75834 913 | 4615 914 | 14048 915 | 10815 916 | 16452 917 | 18550 918 | 57184 919 | 20202 920 | 17320 921 | 11218 922 | 74550 923 | 8159 924 | 10365 925 | 10187 926 | 7354 927 | 12706 928 | 10440 929 | 15787 930 | 18536 931 | 92654 932 | 1230 933 | 10039 934 | 13717 935 | 3012 936 | 6164 937 | 1488 938 | 3738 939 | 26870 940 | 9680 941 | 3829 942 | 25891 943 | 7686 944 | 1677 945 | 2636 946 | 18063 947 | 2593 948 | 27393 949 | 3787 950 | 9241 951 | 25043 952 | 3201 953 | 17549 954 | 13856 955 | 22356 956 | 5337 957 | 15242 958 | 4352 959 | 9316 960 | 66738 961 | 4715 962 | 4181 963 | 2083 964 | 14265 965 | 2359 966 | 18699 967 | 6170 968 | 4009 969 | 56499 970 | 11130 971 | 4979 972 | 52605 973 | 1574 974 | 2771 975 | 10635 976 | 17311 977 | 33381 978 | 2079 979 | 23503 980 | 9542 981 | 24978 982 | 57369 983 | 12733 984 | 19799 985 | 6994 986 | 52135 987 | 56124 988 | 1993 989 | 639 990 | 11289 991 | 107416 992 | 68047 993 | 19793 994 | 10267 995 | 18981 996 | 12131 997 | 193 998 | 10265 999 | 11414 1000 | 19234 1001 | 5453 1002 | -------------------------------------------------------------------------------- /ai-platform/config.yaml: -------------------------------------------------------------------------------- 1 | # configuration file for CUSTOM scale tier AI Platform jobs 2 | trainingInput: 3 | scaleTier: CUSTOM 4 | masterType: complex_model_s # n1-highcpu-8 -------------------------------------------------------------------------------- /ai-platform/deploy.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | # deploy model on AI Platform for serving 3 | # model location in Cloud Storage 4 | 5 | MODEL_TYPE=WD 6 | VERSION_NAME=$1 7 | BUCKET=$2 8 | JOBNAME=$3 9 | MODELID=$4 10 | 11 | MODEL_PATH=gs://${BUCKET}/models/${MODEL_TYPE}/${JOBNAME}/serving/1/${MODELID} 12 | 13 | gcloud ai-platform models create ${MODEL_TYPE} \ 14 | --regions=europe-west1 15 | 16 | gcloud ai-platform versions create ${VERSION_NAME} \ 17 | --model=${MODEL_TYPE} \ 18 | --origin=${MODEL_PATH} \ 19 | --python-version=3.5 \ 20 | --runtime-version=1.13 \ 21 | --framework=tensorflow -------------------------------------------------------------------------------- /ai-platform/evaluate.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | # model evaluation on AI Platform 3 | # --scale-tier=CUSTOM and --config=config.yaml for custom machine types 4 | # use same training parameters for evaluation 5 | # --job-dir is the path to model checkpoints 6 | 7 | JOBNAME=$1 8 | PROJECT=$2 9 | BUCKET=$3 10 | MODEL_TYPE=WD 11 | 12 | JOBID=${JOBNAME}_eval_$(date +"%Y%m%d_%H%M%S") 13 | 14 | gcloud ai-platform jobs submit training ${JOBID} \ 15 | --scale-tier=BASIC \ 16 | --job-dir=gs://${BUCKET}/models/${MODEL_TYPE}/${JOBNAME}/model/1 \ 17 | --region=europe-west1 \ 18 | --package-path=package/trainer \ 19 | --module-name=trainer.task \ 20 | --runtime-version=1.13 \ 21 | --python-version=3.5 \ 22 | -- \ 23 | --mode=evaluate \ 24 | --project=${PROJECT} \ 25 | --bucket=${BUCKET} \ 26 | --test_data=gs://${BUCKET}/data/testData/*.csv \ 27 | --schema_path=gs://${BUCKET}/schema.json \ 28 | --brand_vocab=gs://${BUCKET}/brand_vocab.csv \ 29 | --cloud \ 30 | --model_type=${MODEL_TYPE} \ 31 | --train_epochs=5 \ 32 | --batch_size=256 \ 33 | --learning_rate=0.0004843447060766648 \ 34 | --optimizer=ProximalAdagrad \ 35 | --hidden_units='128,64,32,16' \ 36 | --dropout=0.488892936706543 \ 37 | --feature_selec='[1,2,3,4,5,6]' -------------------------------------------------------------------------------- /ai-platform/hptuning_config.yaml: -------------------------------------------------------------------------------- 1 | # configuration file for AI Platform hyperparameter tuning jobs (DNN & WD models) 2 | trainingInput: 3 | hyperparameters: 4 | goal: MAXIMIZE 5 | hyperparameterMetricTag: f1 6 | maxTrials: 10 7 | maxParallelTrials: 2 8 | enableTrialEarlyStopping: True 9 | params: 10 | - parameterName: batch_size 11 | type: DISCRETE 12 | discreteValues: 13 | - 128 14 | - 256 15 | - 512 16 | - 1024 17 | - parameterName: learning_rate 18 | type: DOUBLE 19 | minValue: 0.0001 20 | maxValue: 0.1 21 | scaleType: UNIT_LOG_SCALE 22 | - parameterName: dropout 23 | type: DOUBLE 24 | minValue: 0.1 25 | maxValue: 0.5 26 | scaleType: UNIT_LINEAR_SCALE 27 | - parameterName: optimizer 28 | type: CATEGORICAL 29 | categoricalValues: 30 | - Adagrad 31 | - ProximalAdagrad 32 | - Adam 33 | - parameterName: hidden_units 34 | type: CATEGORICAL 35 | categoricalValues: 36 | - '64,32' 37 | - '128,64' 38 | - '128,64,32' 39 | - '128,64,32,16' 40 | - parameterName: feature_selec 41 | type: CATEGORICAL 42 | categoricalValues: 43 | - '[1]' 44 | - '[1, 2]' 45 | - '[1, 2, 3]' 46 | - '[1, 2, 3, 4]' 47 | - '[1, 2, 3, 4, 5]' 48 | - '[1, 2, 3, 4, 5, 6]' 49 | - '[1, 2, 3, 4, 5, 6, 7]' -------------------------------------------------------------------------------- /ai-platform/local.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | # run training / evaluation / predictions locally 3 | # evaluate / predict modes: job-dir should be the path to where the model checkpoints are stored 4 | 5 | BUCKET=$1 6 | MODE=$2 # train / evaluate / predict 7 | MODEL_DIR=$3 8 | TRAIN_DATA=$4 9 | DEV_DATA=$5 10 | TEST_DATA=$6 11 | SCHEMA=$7 12 | VOCAB=$8 13 | MODEL_TYPE=WD 14 | 15 | gcloud ai-platform local train \ 16 | --package-path=package/trainer \ 17 | --module-name=trainer.task \ 18 | --job-dir=${MODEL_DIR} \ 19 | -- \ 20 | --bucket=${BUCKET} \ 21 | --mode=${MODE} \ 22 | --train_data=${TRAIN_DATA} \ 23 | --dev_data=${DEV_DATA} \ 24 | --test_data=${TEST_DATA} \ 25 | --model_dir=${MODEL_DIR} \ 26 | --schema_path=${SCHEMA} \ 27 | --brand_vocab=${VOCAB} \ 28 | --early_stopping \ 29 | --model_type=${MODEL_TYPE} \ 30 | --train_epochs=1 \ 31 | --batch_size=256 \ 32 | --learning_rate=0.0004843447060766648 \ 33 | --optimizer=ProximalAdagrad \ 34 | --hidden_units='128,64,32,16' \ 35 | --dropout=0.488892936706543 \ 36 | --feature_selec='[1,2,3,4,5,6]' -------------------------------------------------------------------------------- /ai-platform/online_prediction.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | # online prediction using gcloud tool 3 | # example: bash online_prediction.sh WD v1 4 | 5 | MODEL_TYPE=$1 6 | VERSION_NAME=$2 7 | 8 | gcloud ai-platform predict \ 9 | --model=${MODEL_TYPE} \ 10 | --version=${VERSION_NAME} \ 11 | --json-instances=test-predictions/online_test.json -------------------------------------------------------------------------------- /ai-platform/package/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/teamdatatonic/propensity-modelling-demo/6320c8866fe3ce8968cdca871752235899d37ef4/ai-platform/package/__init__.py -------------------------------------------------------------------------------- /ai-platform/package/setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import find_packages 2 | from setuptools import setup 3 | 4 | # package meta-data 5 | NAME = 'ai-platform' 6 | DESCRIPTION = 'Propensity modelling in TensorFlow for large retailer.' 7 | REQUIRES_PYTHON = '>=3.5.0' 8 | VERSION = 0.1 9 | 10 | # dependencies 11 | REQUIRED = [ 12 | 'argparse>=1.4.0', 'tensorflow>=1.13.1', 13 | 'protobuf>=3.6.1', 'gcsfs>=0.2.0' 14 | ] 15 | 16 | setup( 17 | name=NAME, 18 | version=VERSION, 19 | description=DESCRIPTION, 20 | python_requires=REQUIRES_PYTHON, 21 | packages=find_packages(), 22 | install_requires=REQUIRED, 23 | include_package_data=True) 24 | -------------------------------------------------------------------------------- /ai-platform/package/trainer/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/teamdatatonic/propensity-modelling-demo/6320c8866fe3ce8968cdca871752235899d37ef4/ai-platform/package/trainer/__init__.py -------------------------------------------------------------------------------- /ai-platform/package/trainer/task.py: -------------------------------------------------------------------------------- 1 | """ 2 | Usage: 3 | task.py [options] 4 | 5 | Options: 6 | --mode= train/evaluate/predict [default: train] 7 | --project= name of the GCP project 8 | --bucket= GCS bucket 9 | 10 | --schema_path= path to schema json in GCS [default: None] 11 | --brand_vocab= path to brand vocabulary file [default: 'brand_vocab.csv'] 12 | --train_data= path to train data 13 | --dev_data= path to dev (validation) data 14 | --test_data= path to test data (for evaluation, or prediction) 15 | 16 | --model_type= model to train [default: WD] 17 | --feature_selec= features to exclude (from 1 to 7) [default: None] 18 | --train_epochs= number of epochs to train for [default: 1] 19 | --batch_size= batch size while training [default: 1024] 20 | --learning_rate= learning rate while training [default: 0.01] 21 | --optimizer= GD optimizer [default: Adam] 22 | --dropout= dropout fraction [default: 0.2] 23 | --hidden_units= hidden units of DNN [default: '128,64'] 24 | 25 | --n_trees= number of trees [default: 100] 26 | --max_depth= maximum depth of BT model [default: 6] 27 | 28 | --early_stopping stop training early if no decrease in loss over specified number of steps 29 | --cloud running jobs on AI Platform 30 | --job-dir= working directory for models and checkpoints [default: 'model'] 31 | 32 | """ 33 | import argparse 34 | import ast 35 | import datetime 36 | import json 37 | import os 38 | import subprocess 39 | import sys 40 | import time 41 | 42 | import tensorflow as tf 43 | 44 | # local or AI Platform training 45 | try: 46 | from utils import ( 47 | export_train_results, export_eval_results, get_schema, make_csv_cols) 48 | except: 49 | from trainer.utils import ( 50 | export_train_results, export_eval_results, get_schema, make_csv_cols) 51 | 52 | RANDOM_SEED = 42 53 | 54 | 55 | def get_args(): 56 | """ 57 | Parse command-line arguments 58 | """ 59 | parser = argparse.ArgumentParser() 60 | 61 | parser.add_argument('--mode', type=str, default='train', help='{train, predict, evaluate}') 62 | parser.add_argument('--project', type=str, default='example-project') 63 | parser.add_argument('--bucket', type=str, default='example-bucket') 64 | parser.add_argument('--schema_path', type=str, default='schema.json') 65 | parser.add_argument('--brand_vocab', type=str, default='brand_vocab.csv') 66 | 67 | parser.add_argument('--train_data', type=str, default='trainData/*.csv') 68 | parser.add_argument('--dev_data', type=str, default='devData/*.csv') 69 | parser.add_argument('--test_data', type=str, default='testData/*.csv') 70 | 71 | parser.add_argument('--model_type', type=str, default='WD', help='{DNN, WD, BT}') 72 | parser.add_argument('--feature_selec', type=str, default=None) 73 | parser.add_argument('--train_epochs', type=int, default=1) 74 | parser.add_argument('--batch_size', type=int, default=1024) 75 | parser.add_argument('--learning_rate', type=float, default=0.01) 76 | parser.add_argument('--optimizer', type=str, default='Adam', help='{ProximalAdagrad, Adagrad, Adam}') 77 | parser.add_argument('--dropout', type=float, default=0.2) 78 | parser.add_argument('--hidden_units', type=str, default='128,64') 79 | 80 | parser.add_argument('--n_trees', type=int, default=100) 81 | parser.add_argument('--max_depth', type=int, default=6) 82 | 83 | parser.add_argument('--cloud', dest='cloud', action='store_true') 84 | parser.add_argument('--early_stopping', dest='early_stopping', action='store_true') 85 | parser.add_argument('--job-dir', type=str, default='model') 86 | 87 | return parser.parse_known_args() 88 | 89 | 90 | COLUMNS_TYPE_DICT = { 91 | 'STRING': tf.string, 92 | 'INTEGER': tf.int64, 93 | 'FLOAT': tf.float32, 94 | 'NUMERIC': tf.float32, 95 | 'BOOLEAN': tf.bool, 96 | 'TIMESTAMP': None, 97 | 'RECORD': None 98 | } 99 | 100 | FLAGS, unparsed = get_args() 101 | SCHEMA = get_schema(FLAGS) 102 | LABEL = 'label' 103 | CSV_COLUMNS, CSV_COLUMN_DEFAULTS = make_csv_cols(SCHEMA) 104 | 105 | 106 | def input_fn(path_dir, epochs, batch_size=1024, shuffle=True, skip_header_lines=1): 107 | """ 108 | Generation of features and labels for Estimator 109 | :param path_dir: path to directory containing data 110 | :param epochs: number of times to repeat 111 | :param batch_size: stacks n consecutive elements of dataset into single element 112 | :param skip_header_lines: lines to skip if header present 113 | :return: features, labels 114 | """ 115 | 116 | def parse_csv(records): 117 | """ 118 | :param records: A Tensor of type string - each string is a record/row in the CSV 119 | :return: features, labels 120 | """ 121 | columns = tf.decode_csv( 122 | records=records, record_defaults=CSV_COLUMN_DEFAULTS) 123 | features = dict(zip(CSV_COLUMNS, columns)) 124 | 125 | # forwarding features for customer ID and brand 126 | features['customer_identity'] = tf.identity(features['customer_id']) 127 | features['brand_identity'] = tf.identity(features['brand']) 128 | 129 | try: 130 | labels = features.pop(LABEL) 131 | return features, labels 132 | except KeyError: 133 | return features, [] 134 | 135 | file_list = tf.gfile.Glob(path_dir) 136 | dataset = tf.data.Dataset.from_tensor_slices(file_list) 137 | # shuffle file list 138 | if shuffle: 139 | dataset = dataset.shuffle(50, seed=RANDOM_SEED) 140 | 141 | # read lines of files as row strings, then shuffle and batch 142 | f = lambda filepath: tf.data.TextLineDataset(filepath).skip(skip_header_lines) 143 | dataset = dataset.interleave(f, cycle_length=8, block_length=8) 144 | 145 | if shuffle: 146 | dataset = dataset.shuffle(buffer_size=100000, seed=RANDOM_SEED) 147 | 148 | dataset = dataset.batch(batch_size) \ 149 | .map(parse_csv, num_parallel_calls=8) \ 150 | .repeat(epochs) 151 | 152 | iterator = dataset.make_one_shot_iterator() 153 | features, labels = iterator.get_next() 154 | 155 | return features, labels 156 | 157 | 158 | def serving_input_receiver_fn(): 159 | """ 160 | Build the serving inputs during online prediction 161 | :return: ServingInputReceiver 162 | """ 163 | raw_features = dict() 164 | INPUT = [field for field in SCHEMA if field['name'] not in [LABEL]] 165 | 166 | for field in INPUT: 167 | dtype = COLUMNS_TYPE_DICT[field['type']] 168 | raw_features[field['name']] = tf.placeholder( 169 | shape=[None], dtype=dtype) 170 | 171 | features = raw_features.copy() 172 | features['customer_identity'] = tf.identity(features['customer_id']) 173 | features['brand_identity'] = tf.identity(features['brand']) 174 | 175 | return tf.estimator.export.ServingInputReceiver(features, 176 | raw_features) 177 | 178 | 179 | def metrics(labels, predictions): 180 | """ 181 | Define evaluation metrics 182 | :return: dict of metrics 183 | """ 184 | return { 185 | 'accuracy': tf.metrics.accuracy(labels, predictions['class_ids']), 186 | 'precision': tf.metrics.precision(labels, predictions['class_ids']), 187 | 'recall': tf.metrics.recall(labels, predictions['class_ids']), 188 | 'f1': tf.contrib.metrics.f1_score(labels, predictions['class_ids']), 189 | 'auc': tf.metrics.auc(labels, predictions['logistic']) 190 | } 191 | 192 | 193 | def build_feature_columns(): 194 | """ 195 | Build feature columns as input to the model 196 | :return: feature column tensors 197 | """ 198 | 199 | # most of the columns are numeric columns 200 | exclude = ['customer_id', 'brand', 'promo_sensitive', 'label'] 201 | 202 | if FLAGS.feature_selec: 203 | feature_dict = { 204 | 1: 'returned', 205 | 2: 'chains', 206 | 3: 'max_sale_quantity', 207 | 4: 'overall', 208 | 5: '12m', 209 | 6: '6m', 210 | 7: '3m' 211 | } 212 | terms = [feature_dict[key] for key in ast.literal_eval(FLAGS.feature_selec)] 213 | feature_list = [col for col in CSV_COLUMNS if any(word in col for word in terms)] 214 | exclude += feature_list 215 | 216 | numeric_column_names = [col for col in CSV_COLUMNS if col not in exclude] 217 | numeric_columns = [tf.feature_column.numeric_column(col) for col in numeric_column_names] 218 | 219 | # promo sensitive 220 | promo_sensitive = tf.feature_column.categorical_column_with_identity( 221 | key='promo_sensitive', num_buckets=2) 222 | 223 | # customer id and brand hash buckets 224 | customer_id = tf.feature_column.categorical_column_with_hash_bucket( 225 | key='customer_id', hash_bucket_size=100000) 226 | brand = tf.feature_column.categorical_column_with_vocabulary_file( 227 | key='brand', vocabulary_file=FLAGS.brand_vocab) 228 | 229 | # bucketizing columns 230 | seasonality_names = [col for col in numeric_columns if 'seasonality' in col.key] 231 | brand_seasonality = [tf.feature_column.bucketized_column( 232 | col, boundaries=[3, 6, 9, 12]) for col in seasonality_names] 233 | 234 | aov_column_names = [col for col in numeric_columns if 'aov' in col.key] 235 | aov_columns = [tf.feature_column.bucketized_column( 236 | col, boundaries=[0, 3, 6, 9, 12, 15, 30, 50, 100]) for col in aov_column_names] 237 | 238 | days_1m_names = [col for col in numeric_columns if 'days_shopped_1m' in col.key] 239 | days_3m_names = [col for col in numeric_columns if 'days_shopped_3m' in col.key] 240 | days_6m_names = [col for col in numeric_columns if 'days_shopped_6m' in col.key] 241 | days_12m_names = [col for col in numeric_columns if 'days_shopped_12m' in col.key] 242 | 243 | days_1m = [tf.feature_column.bucketized_column( 244 | col, boundaries=[0, 2, 5, 10, 20, 30]) for col in days_1m_names] 245 | days_3m = [tf.feature_column.bucketized_column( 246 | col, boundaries=[0, 30, 60, 90]) for col in days_3m_names] 247 | days_6m = [tf.feature_column.bucketized_column( 248 | col, boundaries=[0, 60, 120, 180]) for col in days_6m_names] 249 | days_12m = [tf.feature_column.bucketized_column( 250 | col, boundaries=[0, 90, 180, 270, 360]) for col in days_12m_names] 251 | 252 | quantity_column_names = [col for col in numeric_columns if any( 253 | word in col.key for word in ['quantity', 'distinct_brands'])] 254 | quantity_columns = [tf.feature_column.bucketized_column( 255 | col, boundaries=[0, 5, 10, 20, 50, 100, 500, 1000]) for col in quantity_column_names] 256 | 257 | product_column_names = [col for col in numeric_columns if 'products' in col.key] 258 | product_columns = [tf.feature_column.bucketized_column( 259 | col, boundaries=[0, 5, 10, 20, 50, 100]) for col in product_column_names] 260 | 261 | cat_column_names = [col for col in numeric_columns if 'category' in col.key] 262 | cat_columns = [tf.feature_column.bucketized_column( 263 | col, boundaries=[0, 2, 5, 10, 20, 30, 50, 100]) for col in cat_column_names] 264 | 265 | customer_embeddings = tf.feature_column.embedding_column(customer_id, dimension=18) 266 | brand_embeddings = tf.feature_column.embedding_column(brand, dimension=6) 267 | 268 | deep_columns = numeric_columns + [customer_embeddings, brand_embeddings] 269 | tree_columns = ( 270 | brand_seasonality + aov_columns + days_1m + days_3m + days_6m + 271 | days_12m + quantity_columns + product_columns + cat_columns) 272 | wide_columns = [promo_sensitive] + tree_columns 273 | 274 | return wide_columns, deep_columns, tree_columns 275 | 276 | 277 | def initialize_optimizer(): 278 | """ 279 | Define GD optimizer 280 | :return: optimizer 281 | """ 282 | optimizers = { 283 | 'Adagrad': tf.train.AdagradOptimizer(FLAGS.learning_rate), 284 | 'ProximalAdagrad': tf.train.ProximalAdagradOptimizer(FLAGS.learning_rate), 285 | 'Adam': tf.train.AdamOptimizer(FLAGS.learning_rate) 286 | } 287 | 288 | if optimizers.get(FLAGS.optimizer): 289 | return optimizers[FLAGS.optimizer] 290 | 291 | raise Exception('Optimizer {} not recognised'.format(FLAGS.optimizer)) 292 | 293 | 294 | def initialize_estimator(model_checkpoints): 295 | """ 296 | Define estimator 297 | :return: estimator 298 | """ 299 | optimizer = initialize_optimizer() 300 | run_config = tf.estimator.RunConfig( 301 | tf_random_seed=RANDOM_SEED, 302 | save_checkpoints_steps=100, 303 | save_summary_steps=100) 304 | 305 | wide_columns, deep_columns, tree_columns = build_feature_columns() 306 | 307 | if FLAGS.model_type == 'DNN': 308 | return tf.estimator.DNNClassifier( 309 | n_classes=2, 310 | feature_columns=deep_columns, 311 | activation_fn=tf.nn.relu, 312 | optimizer=optimizer, 313 | hidden_units=FLAGS.hidden_units.split(','), 314 | dropout=FLAGS.dropout, 315 | batch_norm=True, 316 | model_dir=model_checkpoints, 317 | config=run_config) 318 | elif FLAGS.model_type == 'WD': 319 | return tf.estimator.DNNLinearCombinedClassifier( 320 | n_classes=2, 321 | linear_feature_columns=wide_columns, 322 | linear_optimizer='Ftrl', 323 | dnn_feature_columns=deep_columns, 324 | dnn_optimizer=optimizer, 325 | dnn_hidden_units=FLAGS.hidden_units.split(','), 326 | dnn_activation_fn=tf.nn.relu, 327 | dnn_dropout=FLAGS.dropout, 328 | batch_norm=True, 329 | model_dir=model_checkpoints, 330 | config=run_config) 331 | elif FLAGS.model_type == 'BT': 332 | n_batches = 500 333 | return tf.estimator.BoostedTreesClassifier( 334 | n_classes=2, 335 | n_batches_per_layer=n_batches, 336 | feature_columns=tree_columns, 337 | learning_rate=FLAGS.learning_rate, 338 | n_trees=FLAGS.n_trees, 339 | max_depth=FLAGS.max_depth, 340 | model_dir=model_checkpoints, 341 | config=run_config) 342 | 343 | raise Exception( 344 | 'Model type {} not recognised - choose from DNN, WD or BT.'.format( 345 | FLAGS.model_type)) 346 | 347 | 348 | def main(unused_argv): 349 | tf.logging.set_verbosity(tf.logging.INFO) 350 | 351 | model_checkpoints = os.path.join(FLAGS.job_dir, 'model') 352 | model_serving = os.path.join(FLAGS.job_dir, 'serving') 353 | 354 | if not FLAGS.cloud: 355 | timestamp = datetime.datetime.utcnow().strftime('%Y-%m-%d_%H-%M-%S') 356 | model_checkpoints = model_checkpoints + '_{}'.format(timestamp) 357 | model_serving = model_serving + '_{}'.format(timestamp) 358 | 359 | trial = json.loads(os.environ.get('TF_CONFIG', '{}')).get('task', {}).get( 360 | 'trial', '') 361 | 362 | if not trial: 363 | trial = 1 364 | 365 | model_checkpoints = os.path.join(model_checkpoints, str(trial)) 366 | model_serving = os.path.join(model_serving, str(trial)) 367 | 368 | estimator = initialize_estimator(model_checkpoints=model_checkpoints) 369 | estimator = tf.contrib.estimator.add_metrics(estimator, metrics) 370 | estimator = tf.contrib.estimator.forward_features( 371 | estimator, keys=['customer_identity', 'brand_identity']) 372 | 373 | # train / evaluate / predict 374 | 375 | if FLAGS.mode == "train": 376 | start_time = time.time() 377 | 378 | # stop if loss does not decrease within given max steps 379 | if FLAGS.early_stopping: 380 | early_stopping = tf.contrib.estimator.stop_if_no_decrease_hook( 381 | estimator, metric_name='loss', max_steps_without_decrease=1000) 382 | hooks = [early_stopping] 383 | else: 384 | hooks = None 385 | 386 | results = tf.estimator.train_and_evaluate( 387 | estimator, 388 | tf.estimator.TrainSpec( 389 | input_fn=lambda: input_fn( 390 | path_dir=FLAGS.train_data, 391 | epochs=FLAGS.train_epochs, 392 | shuffle=True, 393 | batch_size=FLAGS.batch_size), 394 | hooks=hooks 395 | ), 396 | tf.estimator.EvalSpec( 397 | input_fn=lambda: input_fn( 398 | path_dir=FLAGS.dev_data, 399 | epochs=1, 400 | shuffle=False, 401 | batch_size=FLAGS.batch_size 402 | ), 403 | steps=1000, 404 | throttle_secs=60 405 | ) 406 | ) 407 | 408 | duration = time.time() - start_time 409 | 410 | print("Training time: {} seconds / {} minutes".format( 411 | round(duration, 2), round((duration/60.0), 2))) 412 | 413 | # export model for serving 414 | estimator.export_savedmodel(export_dir_base=model_serving, 415 | serving_input_receiver_fn=serving_input_receiver_fn) 416 | 417 | # export model settings (add results from train_and_evaluate) 418 | if results: 419 | results = results[0] 420 | else: 421 | results = {} 422 | results['duration'] = duration 423 | results['checkpoints_dir'] = model_checkpoints 424 | 425 | export_train_results(FLAGS, trial, results) 426 | 427 | # use for local predictions on a test set - for batch scoring use AI Platform predict 428 | elif FLAGS.mode == 'predict': 429 | predictions = estimator.predict( 430 | input_fn=lambda: input_fn( 431 | path_dir=FLAGS.test_data, 432 | shuffle=False, 433 | epochs=1, 434 | batch_size=FLAGS.batch_size 435 | ), 436 | checkpoint_path=tf.train.latest_checkpoint(FLAGS.job_dir)) 437 | 438 | timestamp = datetime.datetime.utcnow().strftime('%Y_%m_%d_%H_%M_%S') 439 | file_name = 'predictions_{}.json'.format(timestamp) 440 | output_path = os.path.join(FLAGS.job_dir, file_name) 441 | 442 | with open(output_path, 'w') as json_output: 443 | for p in predictions: 444 | results = { 445 | 'customer_id': p['customer_identity'].decode('utf-8'), 446 | 'brand': p['brand_identity'].decode('utf-8'), 447 | 'predicted_label': int(p['class_ids'][0]), 448 | 'logistic': float(p['logistic'][0]) 449 | } 450 | 451 | json_output.write(json.dumps(results, ensure_ascii=False) + '\n') 452 | 453 | gcs_command = 'gsutil -m cp -r ' + output_path + ' gs://{}/evaluation/{}'.format( 454 | FLAGS.bucket, file_name) 455 | subprocess.check_output(gcs_command.split()) 456 | 457 | bq_schema = 'logistic:FLOAT,predicted_label:INTEGER,customer_id:STRING,brand:STRING' 458 | bq_command = ('bq --location=EU load --source_format=NEWLINE_DELIMITED_JSON propensity_dataset.{} ' 459 | 'gs://{}/evaluation/{} {}').format( 460 | file_name.replace('.json', ''), FLAGS.bucket, file_name, bq_schema 461 | ) 462 | subprocess.check_output(bq_command.split()) 463 | 464 | # use for evaluation on a test set 465 | elif FLAGS.mode == 'evaluate': 466 | 467 | results = estimator.evaluate( 468 | input_fn=lambda: input_fn( 469 | path_dir=FLAGS.test_data, 470 | epochs=1, 471 | shuffle=False, 472 | batch_size=FLAGS.batch_size 473 | ), 474 | checkpoint_path=tf.train.latest_checkpoint(FLAGS.job_dir) 475 | ) 476 | 477 | export_eval_results(FLAGS, trial, results) 478 | 479 | else: 480 | print('Unrecognised mode {}'.format(FLAGS.mode)) 481 | 482 | 483 | if __name__ == '__main__': 484 | tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) 485 | -------------------------------------------------------------------------------- /ai-platform/package/trainer/utils.py: -------------------------------------------------------------------------------- 1 | from google.cloud import storage 2 | import datetime 3 | import yaml 4 | import json 5 | import os 6 | 7 | 8 | def make_csv_cols(schema): 9 | """ 10 | Create list containing column defaults 11 | :param schema: schema dict containing column names, default types 12 | :return: list of column names, list of column defaults 13 | """ 14 | columns_default_dict = { 15 | 'STRING': ' ', 16 | 'INTEGER': 0, 17 | 'FLOAT': 0.0, 18 | 'NUMERIC': 0.0, 19 | 'BOOLEAN': False, 20 | 'TIMESTAMP': None, 21 | 'RECORD': None 22 | } 23 | 24 | csv_columns = [item['name'] for item in schema] 25 | csv_column_defaults = [columns_default_dict[item['type']] for item in schema] 26 | 27 | return csv_columns, csv_column_defaults 28 | 29 | 30 | def get_file_from_gcs(project, bucket, path): 31 | """ 32 | Retrieves file from Google Cloud Storage 33 | :param project: GCP project name 34 | :param bucket: Cloud Storage bucket name 35 | :param path: path to file 36 | :return: dict for JSON or YAML object 37 | """ 38 | client = storage.Client(project) 39 | bucket_obj = client.get_bucket(bucket) 40 | 41 | blob = bucket_obj.get_blob(path.split(bucket + '/')[1]) 42 | file_string = blob.download_as_string() 43 | 44 | ext = path.split('.')[-1] 45 | if ext == 'json': 46 | return json.loads(file_string.decode('utf-8')) 47 | elif ext == 'yaml': 48 | return yaml.load(file_string.decode('utf-8')) 49 | 50 | 51 | def get_schema(flags): 52 | """ 53 | Return schema based on local or AI Platform mode 54 | :param flags: parsed args 55 | :return: dict containing schema 56 | """ 57 | if flags.cloud: 58 | return get_file_from_gcs( 59 | project=flags.project, 60 | bucket=flags.bucket, 61 | path=flags.schema_path) 62 | else: 63 | with open(flags.schema_path, "r") as read_file: 64 | return json.load(read_file) 65 | 66 | 67 | def upload_to_gcs(flags, filename, data): 68 | """ 69 | Upload files to Cloud Storage 70 | :param flags: parsed args 71 | :param filename: name of file to upload 72 | :param data: dict to dump to file 73 | """ 74 | client = storage.Client(flags.project) 75 | bucket = client.get_bucket(flags.bucket) 76 | path = flags.job_dir.replace('gs://{}/'.format(flags.bucket), '') 77 | path = os.path.join(path, filename) 78 | json_data = json.dumps(data) 79 | blob = bucket.blob(path) 80 | blob.upload_from_string(json_data) 81 | 82 | 83 | def export_train_results(flags, trial_no, results=None): 84 | """ 85 | Generate JSON with model settings and evaluation metrics 86 | :param flags: parsed args 87 | :param trial_no: model trial number 88 | :param results: dict containing training details 89 | """ 90 | timestamp = datetime.datetime.utcnow().strftime('%Y-%m-%d_%H-%M-%S') 91 | filename = 'train_settings_trial{}_{}.json'.format(str(trial_no), timestamp) 92 | 93 | results_dict = { 94 | # job details 95 | 'job_name': flags.job_dir.split('/')[-1], 96 | 'train_data': flags.train_data, 97 | 'test_data': flags.test_data, 98 | 'dev_data': flags.dev_data, 99 | 'schema_path': flags.schema_path, 100 | 'brand_vocab': flags.brand_vocab, 101 | 'mode': flags.mode, 102 | 'trial': trial_no, 103 | 'ai_platform': flags.cloud, 104 | 'model_type': flags.model_type, 105 | 'learning_rate': str(flags.learning_rate), 106 | 'train_epochs': str(flags.train_epochs), 107 | 'batch_size': str(flags.batch_size), 108 | 'feature_selec': flags.feature_selec 109 | } 110 | 111 | if flags.model_type in ['DNN', 'WD']: 112 | results_dict['optimizer'] = flags.optimizer, 113 | results_dict['hidden_units'] = flags.hidden_units, 114 | results_dict['dropout'] = str(flags.dropout) 115 | 116 | elif flags.model_type == 'BT': 117 | results_dict['n_trees'] = flags.n_trees 118 | results_dict['max_depth'] = flags.max_depth 119 | 120 | # add training duration and training metrics 121 | if results: 122 | for k in results.keys(): 123 | results[k] = str(results[k]) 124 | results_dict.update(results) 125 | 126 | if flags.cloud: 127 | upload_to_gcs(flags, filename, results_dict) 128 | else: 129 | with open(os.path.join(flags.job_dir, filename), 'w') as write_file: 130 | write_file.write( 131 | json.dumps(results_dict, sort_keys=True, indent=2)) 132 | 133 | 134 | def export_eval_results(flags, trial_no, results): 135 | """ 136 | Generate JSON with evaluation metrics 137 | :param flags: parsed args 138 | :param trial_no: model trial number 139 | :param results: dict containing eval details 140 | """ 141 | timestamp = datetime.datetime.utcnow().strftime('%Y-%m-%d_%H-%M-%S') 142 | filename = 'eval_results_trial{}_{}.json'.format(trial_no, timestamp) 143 | 144 | # eval metrics 145 | results_dict = { 146 | 'job_dir': str(flags.job_dir), 147 | 'trial': str(trial_no), 148 | 'accuracy': str(results['accuracy']), 149 | 'precision': str(results['precision']), 150 | 'recall': str(results['recall']), 151 | 'f1': str(results['f1']), 152 | 'auc': str(results['auc']) 153 | } 154 | 155 | if flags.cloud: 156 | upload_to_gcs(flags, filename, results_dict) 157 | else: 158 | with open(os.path.join(flags.job_dir, filename), 'w') as write_file: 159 | write_file.write( 160 | json.dumps(results_dict, sort_keys=True, indent=2)) 161 | -------------------------------------------------------------------------------- /ai-platform/requirements.txt: -------------------------------------------------------------------------------- 1 | google-cloud-storage==1.13.2 2 | tensorflow==1.13.1 3 | pyyaml -------------------------------------------------------------------------------- /ai-platform/schema.json: -------------------------------------------------------------------------------- 1 | [ 2 | { "mode": "NULLABLE", "name": "customer_id", "type": "STRING" }, 3 | { "mode": "NULLABLE", "name": "brand", "type": "STRING" }, 4 | { "mode": "NULLABLE", "name": "promo_sensitive", "type": "INTEGER" }, 5 | { "mode": "NULLABLE", "name": "brand_seasonality", "type": "INTEGER" }, 6 | { "mode": "NULLABLE", "name": "total_sale_quantity_1m", "type": "INTEGER" }, 7 | { "mode": "NULLABLE", "name": "max_sale_amount_1m", "type": "INTEGER" }, 8 | { "mode": "NULLABLE", "name": "max_sale_quantity_1m", "type": "INTEGER" }, 9 | { "mode": "NULLABLE", "name": "total_returned_items_1m", "type": "INTEGER" }, 10 | { "mode": "NULLABLE", "name": "distinct_products_1m", "type": "INTEGER" }, 11 | { "mode": "NULLABLE", "name": "distinct_chains_1m", "type": "INTEGER" }, 12 | { "mode": "NULLABLE", "name": "distinct_category_1m", "type": "INTEGER" }, 13 | { "mode": "NULLABLE", "name": "distinct_days_shopped_1m", "type": "INTEGER" }, 14 | { "mode": "NULLABLE", "name": "aov_1m", "type": "INTEGER" }, 15 | { "mode": "NULLABLE", "name": "total_sale_quantity_3m", "type": "INTEGER" }, 16 | { "mode": "NULLABLE", "name": "max_sale_amount_3m", "type": "INTEGER" }, 17 | { "mode": "NULLABLE", "name": "max_sale_quantity_3m", "type": "INTEGER" }, 18 | { "mode": "NULLABLE", "name": "total_returned_items_3m", "type": "INTEGER" }, 19 | { "mode": "NULLABLE", "name": "distinct_products_3m", "type": "INTEGER" }, 20 | { "mode": "NULLABLE", "name": "distinct_chains_3m", "type": "INTEGER" }, 21 | { "mode": "NULLABLE", "name": "distinct_category_3m", "type": "INTEGER" }, 22 | { "mode": "NULLABLE", "name": "distinct_days_shopped_3m", "type": "INTEGER" }, 23 | { "mode": "NULLABLE", "name": "aov_3m", "type": "INTEGER" }, 24 | { "mode": "NULLABLE", "name": "total_sale_quantity_6m", "type": "INTEGER" }, 25 | { "mode": "NULLABLE", "name": "max_sale_amount_6m", "type": "INTEGER" }, 26 | { "mode": "NULLABLE", "name": "max_sale_quantity_6m", "type": "INTEGER" }, 27 | { "mode": "NULLABLE", "name": "total_returned_items_6m", "type": "INTEGER" }, 28 | { "mode": "NULLABLE", "name": "distinct_products_6m", "type": "INTEGER" }, 29 | { "mode": "NULLABLE", "name": "distinct_chains_6m", "type": "INTEGER" }, 30 | { "mode": "NULLABLE", "name": "distinct_category_6m", "type": "INTEGER" }, 31 | { "mode": "NULLABLE", "name": "distinct_days_shopped_6m", "type": "INTEGER" }, 32 | { "mode": "NULLABLE", "name": "aov_6m", "type": "INTEGER" }, 33 | { "mode": "NULLABLE", "name": "total_sale_quantity_12m", "type": "INTEGER" }, 34 | { "mode": "NULLABLE", "name": "max_sale_amount_12m", "type": "INTEGER" }, 35 | { "mode": "NULLABLE", "name": "max_sale_quantity_12m", "type": "INTEGER" }, 36 | { "mode": "NULLABLE", "name": "total_returned_items_12m", "type": "INTEGER" }, 37 | { "mode": "NULLABLE", "name": "distinct_products_12m", "type": "INTEGER" }, 38 | { "mode": "NULLABLE", "name": "distinct_chains_12m", "type": "INTEGER" }, 39 | { "mode": "NULLABLE", "name": "distinct_category_12m", "type": "INTEGER" }, 40 | {"mode": "NULLABLE", "name": "distinct_days_shopped_12m", "type": "INTEGER"}, 41 | { "mode": "NULLABLE", "name": "aov_12m", "type": "INTEGER" }, 42 | { "mode": "NULLABLE", "name": "overall_sale_quantity_1m", "type": "INTEGER" }, 43 | {"mode": "NULLABLE", "name": "overall_returned_items_1m", "type": "INTEGER"}, 44 | {"mode": "NULLABLE", "name": "overall_distinct_products_1m", "type": "INTEGER"}, 45 | {"mode": "NULLABLE", "name": "overall_distinct_brands_1m", "type": "INTEGER"}, 46 | {"mode": "NULLABLE", "name": "overall_distinct_chains_1m", "type": "INTEGER"}, 47 | {"mode": "NULLABLE", "name": "overall_distinct_category_1m", "type": "INTEGER"}, 48 | {"mode": "NULLABLE", "name": "overall_distinct_days_shopped_1m", "type": "INTEGER"}, 49 | { "mode": "NULLABLE", "name": "overall_aov_1m", "type": "INTEGER" }, 50 | { "mode": "NULLABLE", "name": "overall_sale_quantity_3m", "type": "INTEGER" }, 51 | {"mode": "NULLABLE", "name": "overall_returned_items_3m", "type": "INTEGER"}, 52 | {"mode": "NULLABLE", "name": "overall_distinct_products_3m", "type": "INTEGER"}, 53 | {"mode": "NULLABLE", "name": "overall_distinct_brands_3m", "type": "INTEGER"}, 54 | {"mode": "NULLABLE", "name": "overall_distinct_chains_3m", "type": "INTEGER"}, 55 | {"mode": "NULLABLE", "name": "overall_distinct_category_3m", "type": "INTEGER"}, 56 | {"mode": "NULLABLE", "name": "overall_distinct_days_shopped_3m", "type": "INTEGER"}, 57 | { "mode": "NULLABLE", "name": "overall_aov_3m", "type": "INTEGER" }, 58 | { "mode": "NULLABLE", "name": "overall_sale_quantity_6m", "type": "INTEGER" }, 59 | {"mode": "NULLABLE", "name": "overall_returned_items_6m", "type": "INTEGER"}, 60 | {"mode": "NULLABLE", "name": "overall_distinct_products_6m", "type": "INTEGER"}, 61 | {"mode": "NULLABLE", "name": "overall_distinct_brands_6m", "type": "INTEGER"}, 62 | {"mode": "NULLABLE", "name": "overall_distinct_chains_6m", "type": "INTEGER"}, 63 | {"mode": "NULLABLE", "name": "overall_distinct_category_6m", "type": "INTEGER"}, 64 | {"mode": "NULLABLE", "name": "overall_distinct_days_shopped_6m", "type": "INTEGER"}, 65 | { "mode": "NULLABLE", "name": "overall_aov_6m", "type": "INTEGER" }, 66 | {"mode": "NULLABLE", "name": "overall_sale_quantity_12m", "type": "INTEGER"}, 67 | {"mode": "NULLABLE", "name": "overall_returned_items_12m", "type": "INTEGER"}, 68 | {"mode": "NULLABLE", "name": "overall_distinct_products_12m", "type": "INTEGER"}, 69 | {"mode": "NULLABLE", "name": "overall_distinct_brands_12m", "type": "INTEGER"}, 70 | {"mode": "NULLABLE", "name": "overall_distinct_chains_12m", "type": "INTEGER"}, 71 | {"mode": "NULLABLE", "name": "overall_distinct_category_12m", "type": "INTEGER"}, 72 | {"mode": "NULLABLE", "name": "overall_distinct_days_shopped_12m", "type": "INTEGER"}, 73 | { "mode": "NULLABLE", "name": "overall_aov_12m", "type": "INTEGER" }, 74 | { "mode": "NULLABLE", "name": "label", "type": "INTEGER" } 75 | ] 76 | -------------------------------------------------------------------------------- /ai-platform/test-predictions/batch_test.json: -------------------------------------------------------------------------------- 1 | {"customer_id":"4112381428","brand":"22310","promo_sensitive":0,"brand_seasonality":0,"total_sale_quantity_1m":0,"max_sale_amount_1m":0,"max_sale_quantity_1m":0,"total_returned_items_1m":0,"distinct_products_1m":0,"distinct_chains_1m":0,"distinct_category_1m":0,"distinct_days_shopped_1m":0,"aov_1m":0,"total_sale_quantity_3m":0,"max_sale_amount_3m":0,"max_sale_quantity_3m":0,"total_returned_items_3m":0,"distinct_products_3m":0,"distinct_chains_3m":0,"distinct_category_3m":0,"distinct_days_shopped_3m":0,"aov_3m":0,"total_sale_quantity_6m":0,"max_sale_amount_6m":0,"max_sale_quantity_6m":0,"total_returned_items_6m":0,"distinct_products_6m":0,"distinct_chains_6m":0,"distinct_category_6m":0,"distinct_days_shopped_6m":0,"aov_6m":0,"total_sale_quantity_12m":0,"max_sale_amount_12m":0,"max_sale_quantity_12m":0,"total_returned_items_12m":0,"distinct_products_12m":0,"distinct_chains_12m":0,"distinct_category_12m":0,"distinct_days_shopped_12m":0,"aov_12m":0,"overall_sale_quantity_1m":1,"overall_returned_items_1m":0,"overall_distinct_products_1m":1,"overall_distinct_brands_1m":1,"overall_distinct_chains_1m":1,"overall_distinct_category_1m":1,"overall_distinct_days_shopped_1m":1,"overall_aov_1m":1,"overall_sale_quantity_3m":3,"overall_returned_items_3m":0,"overall_distinct_products_3m":2,"overall_distinct_brands_3m":2,"overall_distinct_chains_3m":1,"overall_distinct_category_3m":2,"overall_distinct_days_shopped_3m":2,"overall_aov_3m":2,"overall_sale_quantity_6m":28,"overall_returned_items_6m":0,"overall_distinct_products_6m":15,"overall_distinct_brands_6m":13,"overall_distinct_chains_6m":1,"overall_distinct_category_6m":14,"overall_distinct_days_shopped_6m":7,"overall_aov_6m":11,"overall_sale_quantity_12m":35,"overall_returned_items_12m":0,"overall_distinct_products_12m":17,"overall_distinct_brands_12m":14,"overall_distinct_chains_12m":1,"overall_distinct_category_12m":15,"overall_distinct_days_shopped_12m":10,"overall_aov_12m":8} 2 | {"customer_id":"4626711012","brand":"2246","promo_sensitive":1,"brand_seasonality":0,"total_sale_quantity_1m":0,"max_sale_amount_1m":0,"max_sale_quantity_1m":0,"total_returned_items_1m":0,"distinct_products_1m":0,"distinct_chains_1m":0,"distinct_category_1m":0,"distinct_days_shopped_1m":0,"aov_1m":0,"total_sale_quantity_3m":0,"max_sale_amount_3m":0,"max_sale_quantity_3m":0,"total_returned_items_3m":0,"distinct_products_3m":0,"distinct_chains_3m":0,"distinct_category_3m":0,"distinct_days_shopped_3m":0,"aov_3m":0,"total_sale_quantity_6m":0,"max_sale_amount_6m":0,"max_sale_quantity_6m":0,"total_returned_items_6m":0,"distinct_products_6m":0,"distinct_chains_6m":0,"distinct_category_6m":0,"distinct_days_shopped_6m":0,"aov_6m":0,"total_sale_quantity_12m":0,"max_sale_amount_12m":0,"max_sale_quantity_12m":0,"total_returned_items_12m":0,"distinct_products_12m":0,"distinct_chains_12m":0,"distinct_category_12m":0,"distinct_days_shopped_12m":0,"aov_12m":0,"overall_sale_quantity_1m":268,"overall_returned_items_1m":0,"overall_distinct_products_1m":148,"overall_distinct_brands_1m":118,"overall_distinct_chains_1m":1,"overall_distinct_category_1m":100,"overall_distinct_days_shopped_1m":11,"overall_aov_1m":79,"overall_sale_quantity_3m":770,"overall_returned_items_3m":0,"overall_distinct_products_3m":331,"overall_distinct_brands_3m":233,"overall_distinct_chains_3m":1,"overall_distinct_category_3m":188,"overall_distinct_days_shopped_3m":40,"overall_aov_3m":62,"overall_sale_quantity_6m":1539,"overall_returned_items_6m":0,"overall_distinct_products_6m":546,"overall_distinct_brands_6m":329,"overall_distinct_chains_6m":1,"overall_distinct_category_6m":253,"overall_distinct_days_shopped_6m":91,"overall_aov_6m":57,"overall_sale_quantity_12m":2407,"overall_returned_items_12m":0,"overall_distinct_products_12m":730,"overall_distinct_brands_12m":416,"overall_distinct_chains_12m":1,"overall_distinct_category_12m":289,"overall_distinct_days_shopped_12m":148,"overall_aov_12m":54} 3 | {"customer_id":"2269082691","brand":"17864","promo_sensitive":0,"brand_seasonality":0,"total_sale_quantity_1m":0,"max_sale_amount_1m":0,"max_sale_quantity_1m":0,"total_returned_items_1m":0,"distinct_products_1m":0,"distinct_chains_1m":0,"distinct_category_1m":0,"distinct_days_shopped_1m":0,"aov_1m":0,"total_sale_quantity_3m":0,"max_sale_amount_3m":0,"max_sale_quantity_3m":0,"total_returned_items_3m":0,"distinct_products_3m":0,"distinct_chains_3m":0,"distinct_category_3m":0,"distinct_days_shopped_3m":0,"aov_3m":0,"total_sale_quantity_6m":0,"max_sale_amount_6m":0,"max_sale_quantity_6m":0,"total_returned_items_6m":0,"distinct_products_6m":0,"distinct_chains_6m":0,"distinct_category_6m":0,"distinct_days_shopped_6m":0,"aov_6m":0,"total_sale_quantity_12m":0,"max_sale_amount_12m":0,"max_sale_quantity_12m":0,"total_returned_items_12m":0,"distinct_products_12m":0,"distinct_chains_12m":0,"distinct_category_12m":0,"distinct_days_shopped_12m":0,"aov_12m":0,"overall_sale_quantity_1m":179,"overall_returned_items_1m":0,"overall_distinct_products_1m":121,"overall_distinct_brands_1m":92,"overall_distinct_chains_1m":1,"overall_distinct_category_1m":91,"overall_distinct_days_shopped_1m":4,"overall_aov_1m":132,"overall_sale_quantity_3m":730,"overall_returned_items_3m":0,"overall_distinct_products_3m":334,"overall_distinct_brands_3m":206,"overall_distinct_chains_3m":1,"overall_distinct_category_3m":171,"overall_distinct_days_shopped_3m":16,"overall_aov_3m":130,"overall_sale_quantity_6m":1346,"overall_returned_items_6m":0,"overall_distinct_products_6m":492,"overall_distinct_brands_6m":285,"overall_distinct_chains_6m":1,"overall_distinct_category_6m":228,"overall_distinct_days_shopped_6m":35,"overall_aov_6m":106,"overall_sale_quantity_12m":2523,"overall_returned_items_12m":0,"overall_distinct_products_12m":717,"overall_distinct_brands_12m":394,"overall_distinct_chains_12m":1,"overall_distinct_category_12m":276,"overall_distinct_days_shopped_12m":64,"overall_aov_12m":113} 4 | {"customer_id":"4638295293","brand":"9542","promo_sensitive":1,"brand_seasonality":0,"total_sale_quantity_1m":0,"max_sale_amount_1m":0,"max_sale_quantity_1m":0,"total_returned_items_1m":0,"distinct_products_1m":0,"distinct_chains_1m":0,"distinct_category_1m":0,"distinct_days_shopped_1m":0,"aov_1m":0,"total_sale_quantity_3m":0,"max_sale_amount_3m":0,"max_sale_quantity_3m":0,"total_returned_items_3m":0,"distinct_products_3m":0,"distinct_chains_3m":0,"distinct_category_3m":0,"distinct_days_shopped_3m":0,"aov_3m":0,"total_sale_quantity_6m":0,"max_sale_amount_6m":0,"max_sale_quantity_6m":0,"total_returned_items_6m":0,"distinct_products_6m":0,"distinct_chains_6m":0,"distinct_category_6m":0,"distinct_days_shopped_6m":0,"aov_6m":0,"total_sale_quantity_12m":0,"max_sale_amount_12m":0,"max_sale_quantity_12m":0,"total_returned_items_12m":0,"distinct_products_12m":0,"distinct_chains_12m":0,"distinct_category_12m":0,"distinct_days_shopped_12m":0,"aov_12m":0,"overall_sale_quantity_1m":231,"overall_returned_items_1m":0,"overall_distinct_products_1m":138,"overall_distinct_brands_1m":98,"overall_distinct_chains_1m":1,"overall_distinct_category_1m":103,"overall_distinct_days_shopped_1m":15,"overall_aov_1m":56,"overall_sale_quantity_3m":705,"overall_returned_items_3m":0,"overall_distinct_products_3m":334,"overall_distinct_brands_3m":207,"overall_distinct_chains_3m":1,"overall_distinct_category_3m":191,"overall_distinct_days_shopped_3m":40,"overall_aov_3m":64,"overall_sale_quantity_6m":1319,"overall_returned_items_6m":0,"overall_distinct_products_6m":517,"overall_distinct_brands_6m":281,"overall_distinct_chains_6m":1,"overall_distinct_category_6m":246,"overall_distinct_days_shopped_6m":71,"overall_aov_6m":68,"overall_sale_quantity_12m":2663,"overall_returned_items_12m":0,"overall_distinct_products_12m":753,"overall_distinct_brands_12m":355,"overall_distinct_chains_12m":1,"overall_distinct_category_12m":296,"overall_distinct_days_shopped_12m":138,"overall_aov_12m":70} 5 | {"customer_id":"308864487","brand":"64256","promo_sensitive":0,"brand_seasonality":0,"total_sale_quantity_1m":0,"max_sale_amount_1m":0,"max_sale_quantity_1m":0,"total_returned_items_1m":0,"distinct_products_1m":0,"distinct_chains_1m":0,"distinct_category_1m":0,"distinct_days_shopped_1m":0,"aov_1m":0,"total_sale_quantity_3m":0,"max_sale_amount_3m":0,"max_sale_quantity_3m":0,"total_returned_items_3m":0,"distinct_products_3m":0,"distinct_chains_3m":0,"distinct_category_3m":0,"distinct_days_shopped_3m":0,"aov_3m":0,"total_sale_quantity_6m":0,"max_sale_amount_6m":0,"max_sale_quantity_6m":0,"total_returned_items_6m":0,"distinct_products_6m":0,"distinct_chains_6m":0,"distinct_category_6m":0,"distinct_days_shopped_6m":0,"aov_6m":0,"total_sale_quantity_12m":0,"max_sale_amount_12m":0,"max_sale_quantity_12m":0,"total_returned_items_12m":0,"distinct_products_12m":0,"distinct_chains_12m":0,"distinct_category_12m":0,"distinct_days_shopped_12m":0,"aov_12m":0,"overall_sale_quantity_1m":271,"overall_returned_items_1m":0,"overall_distinct_products_1m":147,"overall_distinct_brands_1m":99,"overall_distinct_chains_1m":1,"overall_distinct_category_1m":103,"overall_distinct_days_shopped_1m":6,"overall_aov_1m":160,"overall_sale_quantity_3m":638,"overall_returned_items_3m":1,"overall_distinct_products_3m":279,"overall_distinct_brands_3m":161,"overall_distinct_chains_3m":1,"overall_distinct_category_3m":155,"overall_distinct_days_shopped_3m":21,"overall_aov_3m":122,"overall_sale_quantity_6m":1083,"overall_returned_items_6m":1,"overall_distinct_products_6m":412,"overall_distinct_brands_6m":219,"overall_distinct_chains_6m":1,"overall_distinct_category_6m":202,"overall_distinct_days_shopped_6m":41,"overall_aov_6m":101,"overall_sale_quantity_12m":1796,"overall_returned_items_12m":2,"overall_distinct_products_12m":584,"overall_distinct_brands_12m":278,"overall_distinct_chains_12m":1,"overall_distinct_category_12m":237,"overall_distinct_days_shopped_12m":74,"overall_aov_12m":91} 6 | {"customer_id":"4087233172","brand":"11375","promo_sensitive":0,"brand_seasonality":0,"total_sale_quantity_1m":0,"max_sale_amount_1m":0,"max_sale_quantity_1m":0,"total_returned_items_1m":0,"distinct_products_1m":0,"distinct_chains_1m":0,"distinct_category_1m":0,"distinct_days_shopped_1m":0,"aov_1m":0,"total_sale_quantity_3m":0,"max_sale_amount_3m":0,"max_sale_quantity_3m":0,"total_returned_items_3m":0,"distinct_products_3m":0,"distinct_chains_3m":0,"distinct_category_3m":0,"distinct_days_shopped_3m":0,"aov_3m":0,"total_sale_quantity_6m":0,"max_sale_amount_6m":0,"max_sale_quantity_6m":0,"total_returned_items_6m":0,"distinct_products_6m":0,"distinct_chains_6m":0,"distinct_category_6m":0,"distinct_days_shopped_6m":0,"aov_6m":0,"total_sale_quantity_12m":0,"max_sale_amount_12m":0,"max_sale_quantity_12m":0,"total_returned_items_12m":0,"distinct_products_12m":0,"distinct_chains_12m":0,"distinct_category_12m":0,"distinct_days_shopped_12m":0,"aov_12m":0,"overall_sale_quantity_1m":225,"overall_returned_items_1m":0,"overall_distinct_products_1m":110,"overall_distinct_brands_1m":81,"overall_distinct_chains_1m":1,"overall_distinct_category_1m":91,"overall_distinct_days_shopped_1m":4,"overall_aov_1m":234,"overall_sale_quantity_3m":770,"overall_returned_items_3m":0,"overall_distinct_products_3m":238,"overall_distinct_brands_3m":160,"overall_distinct_chains_3m":1,"overall_distinct_category_3m":150,"overall_distinct_days_shopped_3m":12,"overall_aov_3m":267,"overall_sale_quantity_6m":1434,"overall_returned_items_6m":0,"overall_distinct_products_6m":371,"overall_distinct_brands_6m":248,"overall_distinct_chains_6m":1,"overall_distinct_category_6m":196,"overall_distinct_days_shopped_6m":23,"overall_aov_6m":268,"overall_sale_quantity_12m":1869,"overall_returned_items_12m":0,"overall_distinct_products_12m":450,"overall_distinct_brands_12m":290,"overall_distinct_chains_12m":1,"overall_distinct_category_12m":219,"overall_distinct_days_shopped_12m":31,"overall_aov_12m":258} 7 | {"customer_id":"449291271","brand":"11224","promo_sensitive":0,"brand_seasonality":0,"total_sale_quantity_1m":0,"max_sale_amount_1m":0,"max_sale_quantity_1m":0,"total_returned_items_1m":0,"distinct_products_1m":0,"distinct_chains_1m":0,"distinct_category_1m":0,"distinct_days_shopped_1m":0,"aov_1m":0,"total_sale_quantity_3m":0,"max_sale_amount_3m":0,"max_sale_quantity_3m":0,"total_returned_items_3m":0,"distinct_products_3m":0,"distinct_chains_3m":0,"distinct_category_3m":0,"distinct_days_shopped_3m":0,"aov_3m":0,"total_sale_quantity_6m":0,"max_sale_amount_6m":0,"max_sale_quantity_6m":0,"total_returned_items_6m":0,"distinct_products_6m":0,"distinct_chains_6m":0,"distinct_category_6m":0,"distinct_days_shopped_6m":0,"aov_6m":0,"total_sale_quantity_12m":0,"max_sale_amount_12m":0,"max_sale_quantity_12m":0,"total_returned_items_12m":0,"distinct_products_12m":0,"distinct_chains_12m":0,"distinct_category_12m":0,"distinct_days_shopped_12m":0,"aov_12m":0,"overall_sale_quantity_1m":366,"overall_returned_items_1m":0,"overall_distinct_products_1m":228,"overall_distinct_brands_1m":168,"overall_distinct_chains_1m":1,"overall_distinct_category_1m":153,"overall_distinct_days_shopped_1m":10,"overall_aov_1m":110,"overall_sale_quantity_3m":774,"overall_returned_items_3m":0,"overall_distinct_products_3m":395,"overall_distinct_brands_3m":254,"overall_distinct_chains_3m":1,"overall_distinct_category_3m":212,"overall_distinct_days_shopped_3m":26,"overall_aov_3m":84,"overall_sale_quantity_6m":1220,"overall_returned_items_6m":0,"overall_distinct_products_6m":540,"overall_distinct_brands_6m":313,"overall_distinct_chains_6m":1,"overall_distinct_category_6m":248,"overall_distinct_days_shopped_6m":40,"overall_aov_6m":83,"overall_sale_quantity_12m":2573,"overall_returned_items_12m":0,"overall_distinct_products_12m":909,"overall_distinct_brands_12m":479,"overall_distinct_chains_12m":1,"overall_distinct_category_12m":342,"overall_distinct_days_shopped_12m":82,"overall_aov_12m":84} 8 | {"customer_id":"3687973006","brand":"14869","promo_sensitive":0,"brand_seasonality":0,"total_sale_quantity_1m":0,"max_sale_amount_1m":0,"max_sale_quantity_1m":0,"total_returned_items_1m":0,"distinct_products_1m":0,"distinct_chains_1m":0,"distinct_category_1m":0,"distinct_days_shopped_1m":0,"aov_1m":0,"total_sale_quantity_3m":0,"max_sale_amount_3m":0,"max_sale_quantity_3m":0,"total_returned_items_3m":0,"distinct_products_3m":0,"distinct_chains_3m":0,"distinct_category_3m":0,"distinct_days_shopped_3m":0,"aov_3m":0,"total_sale_quantity_6m":0,"max_sale_amount_6m":0,"max_sale_quantity_6m":0,"total_returned_items_6m":0,"distinct_products_6m":0,"distinct_chains_6m":0,"distinct_category_6m":0,"distinct_days_shopped_6m":0,"aov_6m":0,"total_sale_quantity_12m":0,"max_sale_amount_12m":0,"max_sale_quantity_12m":0,"total_returned_items_12m":0,"distinct_products_12m":0,"distinct_chains_12m":0,"distinct_category_12m":0,"distinct_days_shopped_12m":0,"aov_12m":0,"overall_sale_quantity_1m":269,"overall_returned_items_1m":0,"overall_distinct_products_1m":142,"overall_distinct_brands_1m":99,"overall_distinct_chains_1m":1,"overall_distinct_category_1m":97,"overall_distinct_days_shopped_1m":9,"overall_aov_1m":62,"overall_sale_quantity_3m":713,"overall_returned_items_3m":0,"overall_distinct_products_3m":306,"overall_distinct_brands_3m":184,"overall_distinct_chains_3m":1,"overall_distinct_category_3m":184,"overall_distinct_days_shopped_3m":18,"overall_aov_3m":76,"overall_sale_quantity_6m":1016,"overall_returned_items_6m":0,"overall_distinct_products_6m":412,"overall_distinct_brands_6m":235,"overall_distinct_chains_6m":1,"overall_distinct_category_6m":233,"overall_distinct_days_shopped_6m":29,"overall_aov_6m":69,"overall_sale_quantity_12m":1123,"overall_returned_items_12m":0,"overall_distinct_products_12m":444,"overall_distinct_brands_12m":248,"overall_distinct_chains_12m":1,"overall_distinct_category_12m":240,"overall_distinct_days_shopped_12m":39,"overall_aov_12m":57} 9 | {"customer_id":"2513546762","brand":"3789","promo_sensitive":0,"brand_seasonality":0,"total_sale_quantity_1m":0,"max_sale_amount_1m":0,"max_sale_quantity_1m":0,"total_returned_items_1m":0,"distinct_products_1m":0,"distinct_chains_1m":0,"distinct_category_1m":0,"distinct_days_shopped_1m":0,"aov_1m":0,"total_sale_quantity_3m":0,"max_sale_amount_3m":0,"max_sale_quantity_3m":0,"total_returned_items_3m":0,"distinct_products_3m":0,"distinct_chains_3m":0,"distinct_category_3m":0,"distinct_days_shopped_3m":0,"aov_3m":0,"total_sale_quantity_6m":0,"max_sale_amount_6m":0,"max_sale_quantity_6m":0,"total_returned_items_6m":0,"distinct_products_6m":0,"distinct_chains_6m":0,"distinct_category_6m":0,"distinct_days_shopped_6m":0,"aov_6m":0,"total_sale_quantity_12m":0,"max_sale_amount_12m":0,"max_sale_quantity_12m":0,"total_returned_items_12m":0,"distinct_products_12m":0,"distinct_chains_12m":0,"distinct_category_12m":0,"distinct_days_shopped_12m":0,"aov_12m":0,"overall_sale_quantity_1m":2,"overall_returned_items_1m":0,"overall_distinct_products_1m":1,"overall_distinct_brands_1m":1,"overall_distinct_chains_1m":1,"overall_distinct_category_1m":1,"overall_distinct_days_shopped_1m":1,"overall_aov_1m":7,"overall_sale_quantity_3m":86,"overall_returned_items_3m":0,"overall_distinct_products_3m":38,"overall_distinct_brands_3m":25,"overall_distinct_chains_3m":1,"overall_distinct_category_3m":22,"overall_distinct_days_shopped_3m":11,"overall_aov_3m":43,"overall_sale_quantity_6m":163,"overall_returned_items_6m":0,"overall_distinct_products_6m":86,"overall_distinct_brands_6m":65,"overall_distinct_chains_6m":1,"overall_distinct_category_6m":50,"overall_distinct_days_shopped_6m":24,"overall_aov_6m":33,"overall_sale_quantity_12m":283,"overall_returned_items_12m":0,"overall_distinct_products_12m":136,"overall_distinct_brands_12m":97,"overall_distinct_chains_12m":1,"overall_distinct_category_12m":80,"overall_distinct_days_shopped_12m":39,"overall_aov_12m":34} 10 | {"customer_id":"496827583","brand":"15441","promo_sensitive":1,"brand_seasonality":0,"total_sale_quantity_1m":0,"max_sale_amount_1m":0,"max_sale_quantity_1m":0,"total_returned_items_1m":0,"distinct_products_1m":0,"distinct_chains_1m":0,"distinct_category_1m":0,"distinct_days_shopped_1m":0,"aov_1m":0,"total_sale_quantity_3m":0,"max_sale_amount_3m":0,"max_sale_quantity_3m":0,"total_returned_items_3m":0,"distinct_products_3m":0,"distinct_chains_3m":0,"distinct_category_3m":0,"distinct_days_shopped_3m":0,"aov_3m":0,"total_sale_quantity_6m":0,"max_sale_amount_6m":0,"max_sale_quantity_6m":0,"total_returned_items_6m":0,"distinct_products_6m":0,"distinct_chains_6m":0,"distinct_category_6m":0,"distinct_days_shopped_6m":0,"aov_6m":0,"total_sale_quantity_12m":0,"max_sale_amount_12m":0,"max_sale_quantity_12m":0,"total_returned_items_12m":0,"distinct_products_12m":0,"distinct_chains_12m":0,"distinct_category_12m":0,"distinct_days_shopped_12m":0,"aov_12m":0,"overall_sale_quantity_1m":270,"overall_returned_items_1m":0,"overall_distinct_products_1m":164,"overall_distinct_brands_1m":110,"overall_distinct_chains_1m":1,"overall_distinct_category_1m":122,"overall_distinct_days_shopped_1m":7,"overall_aov_1m":111,"overall_sale_quantity_3m":776,"overall_returned_items_3m":0,"overall_distinct_products_3m":304,"overall_distinct_brands_3m":184,"overall_distinct_chains_3m":1,"overall_distinct_category_3m":178,"overall_distinct_days_shopped_3m":19,"overall_aov_3m":117,"overall_sale_quantity_6m":1643,"overall_returned_items_6m":0,"overall_distinct_products_6m":468,"overall_distinct_brands_6m":264,"overall_distinct_chains_6m":1,"overall_distinct_category_6m":225,"overall_distinct_days_shopped_6m":41,"overall_aov_6m":115,"overall_sale_quantity_12m":3282,"overall_returned_items_12m":0,"overall_distinct_products_12m":697,"overall_distinct_brands_12m":368,"overall_distinct_chains_12m":1,"overall_distinct_category_12m":284,"overall_distinct_days_shopped_12m":71,"overall_aov_12m":131} 11 | {"customer_id":"126611756","brand":"15889","promo_sensitive":0,"brand_seasonality":0,"total_sale_quantity_1m":0,"max_sale_amount_1m":0,"max_sale_quantity_1m":0,"total_returned_items_1m":0,"distinct_products_1m":0,"distinct_chains_1m":0,"distinct_category_1m":0,"distinct_days_shopped_1m":0,"aov_1m":0,"total_sale_quantity_3m":0,"max_sale_amount_3m":0,"max_sale_quantity_3m":0,"total_returned_items_3m":0,"distinct_products_3m":0,"distinct_chains_3m":0,"distinct_category_3m":0,"distinct_days_shopped_3m":0,"aov_3m":0,"total_sale_quantity_6m":0,"max_sale_amount_6m":0,"max_sale_quantity_6m":0,"total_returned_items_6m":0,"distinct_products_6m":0,"distinct_chains_6m":0,"distinct_category_6m":0,"distinct_days_shopped_6m":0,"aov_6m":0,"total_sale_quantity_12m":0,"max_sale_amount_12m":0,"max_sale_quantity_12m":0,"total_returned_items_12m":0,"distinct_products_12m":0,"distinct_chains_12m":0,"distinct_category_12m":0,"distinct_days_shopped_12m":0,"aov_12m":0,"overall_sale_quantity_1m":180,"overall_returned_items_1m":0,"overall_distinct_products_1m":115,"overall_distinct_brands_1m":80,"overall_distinct_chains_1m":1,"overall_distinct_category_1m":94,"overall_distinct_days_shopped_1m":16,"overall_aov_1m":52,"overall_sale_quantity_3m":546,"overall_returned_items_3m":0,"overall_distinct_products_3m":278,"overall_distinct_brands_3m":180,"overall_distinct_chains_3m":1,"overall_distinct_category_3m":182,"overall_distinct_days_shopped_3m":46,"overall_aov_3m":52,"overall_sale_quantity_6m":995,"overall_returned_items_6m":0,"overall_distinct_products_6m":402,"overall_distinct_brands_6m":242,"overall_distinct_chains_6m":1,"overall_distinct_category_6m":229,"overall_distinct_days_shopped_6m":96,"overall_aov_6m":44,"overall_sale_quantity_12m":2183,"overall_returned_items_12m":0,"overall_distinct_products_12m":671,"overall_distinct_brands_12m":364,"overall_distinct_chains_12m":1,"overall_distinct_category_12m":314,"overall_distinct_days_shopped_12m":211,"overall_aov_12m":43} 12 | {"customer_id":"4351032692","brand":"75834","promo_sensitive":0,"brand_seasonality":0,"total_sale_quantity_1m":0,"max_sale_amount_1m":0,"max_sale_quantity_1m":0,"total_returned_items_1m":0,"distinct_products_1m":0,"distinct_chains_1m":0,"distinct_category_1m":0,"distinct_days_shopped_1m":0,"aov_1m":0,"total_sale_quantity_3m":0,"max_sale_amount_3m":0,"max_sale_quantity_3m":0,"total_returned_items_3m":0,"distinct_products_3m":0,"distinct_chains_3m":0,"distinct_category_3m":0,"distinct_days_shopped_3m":0,"aov_3m":0,"total_sale_quantity_6m":0,"max_sale_amount_6m":0,"max_sale_quantity_6m":0,"total_returned_items_6m":0,"distinct_products_6m":0,"distinct_chains_6m":0,"distinct_category_6m":0,"distinct_days_shopped_6m":0,"aov_6m":0,"total_sale_quantity_12m":0,"max_sale_amount_12m":0,"max_sale_quantity_12m":0,"total_returned_items_12m":0,"distinct_products_12m":0,"distinct_chains_12m":0,"distinct_category_12m":0,"distinct_days_shopped_12m":0,"aov_12m":0,"overall_sale_quantity_1m":209,"overall_returned_items_1m":0,"overall_distinct_products_1m":105,"overall_distinct_brands_1m":79,"overall_distinct_chains_1m":1,"overall_distinct_category_1m":87,"overall_distinct_days_shopped_1m":22,"overall_aov_1m":29,"overall_sale_quantity_3m":728,"overall_returned_items_3m":0,"overall_distinct_products_3m":284,"overall_distinct_brands_3m":204,"overall_distinct_chains_3m":1,"overall_distinct_category_3m":186,"overall_distinct_days_shopped_3m":62,"overall_aov_3m":45,"overall_sale_quantity_6m":973,"overall_returned_items_6m":0,"overall_distinct_products_6m":359,"overall_distinct_brands_6m":253,"overall_distinct_chains_6m":1,"overall_distinct_category_6m":214,"overall_distinct_days_shopped_6m":78,"overall_aov_6m":48,"overall_sale_quantity_12m":973,"overall_returned_items_12m":0,"overall_distinct_products_12m":359,"overall_distinct_brands_12m":253,"overall_distinct_chains_12m":1,"overall_distinct_category_12m":214,"overall_distinct_days_shopped_12m":78,"overall_aov_12m":48} 13 | {"customer_id":"688552826","brand":"17812","promo_sensitive":0,"brand_seasonality":0,"total_sale_quantity_1m":0,"max_sale_amount_1m":0,"max_sale_quantity_1m":0,"total_returned_items_1m":0,"distinct_products_1m":0,"distinct_chains_1m":0,"distinct_category_1m":0,"distinct_days_shopped_1m":0,"aov_1m":0,"total_sale_quantity_3m":0,"max_sale_amount_3m":0,"max_sale_quantity_3m":0,"total_returned_items_3m":0,"distinct_products_3m":0,"distinct_chains_3m":0,"distinct_category_3m":0,"distinct_days_shopped_3m":0,"aov_3m":0,"total_sale_quantity_6m":0,"max_sale_amount_6m":0,"max_sale_quantity_6m":0,"total_returned_items_6m":0,"distinct_products_6m":0,"distinct_chains_6m":0,"distinct_category_6m":0,"distinct_days_shopped_6m":0,"aov_6m":0,"total_sale_quantity_12m":0,"max_sale_amount_12m":0,"max_sale_quantity_12m":0,"total_returned_items_12m":0,"distinct_products_12m":0,"distinct_chains_12m":0,"distinct_category_12m":0,"distinct_days_shopped_12m":0,"aov_12m":0,"overall_sale_quantity_1m":358,"overall_returned_items_1m":0,"overall_distinct_products_1m":195,"overall_distinct_brands_1m":131,"overall_distinct_chains_1m":1,"overall_distinct_category_1m":158,"overall_distinct_days_shopped_1m":8,"overall_aov_1m":96,"overall_sale_quantity_3m":781,"overall_returned_items_3m":0,"overall_distinct_products_3m":325,"overall_distinct_brands_3m":201,"overall_distinct_chains_3m":1,"overall_distinct_category_3m":227,"overall_distinct_days_shopped_3m":30,"overall_aov_3m":55,"overall_sale_quantity_6m":1479,"overall_returned_items_6m":0,"overall_distinct_products_6m":506,"overall_distinct_brands_6m":282,"overall_distinct_chains_6m":1,"overall_distinct_category_6m":310,"overall_distinct_days_shopped_6m":65,"overall_aov_6m":48,"overall_sale_quantity_12m":3090,"overall_returned_items_12m":1,"overall_distinct_products_12m":839,"overall_distinct_brands_12m":421,"overall_distinct_chains_12m":1,"overall_distinct_category_12m":388,"overall_distinct_days_shopped_12m":162,"overall_aov_12m":41} 14 | {"customer_id":"401218837","brand":"2771","promo_sensitive":0,"brand_seasonality":0,"total_sale_quantity_1m":0,"max_sale_amount_1m":0,"max_sale_quantity_1m":0,"total_returned_items_1m":0,"distinct_products_1m":0,"distinct_chains_1m":0,"distinct_category_1m":0,"distinct_days_shopped_1m":0,"aov_1m":0,"total_sale_quantity_3m":0,"max_sale_amount_3m":0,"max_sale_quantity_3m":0,"total_returned_items_3m":0,"distinct_products_3m":0,"distinct_chains_3m":0,"distinct_category_3m":0,"distinct_days_shopped_3m":0,"aov_3m":0,"total_sale_quantity_6m":0,"max_sale_amount_6m":0,"max_sale_quantity_6m":0,"total_returned_items_6m":0,"distinct_products_6m":0,"distinct_chains_6m":0,"distinct_category_6m":0,"distinct_days_shopped_6m":0,"aov_6m":0,"total_sale_quantity_12m":0,"max_sale_amount_12m":0,"max_sale_quantity_12m":0,"total_returned_items_12m":0,"distinct_products_12m":0,"distinct_chains_12m":0,"distinct_category_12m":0,"distinct_days_shopped_12m":0,"aov_12m":0,"overall_sale_quantity_1m":461,"overall_returned_items_1m":0,"overall_distinct_products_1m":140,"overall_distinct_brands_1m":102,"overall_distinct_chains_1m":1,"overall_distinct_category_1m":101,"overall_distinct_days_shopped_1m":14,"overall_aov_1m":48,"overall_sale_quantity_3m":1333,"overall_returned_items_3m":0,"overall_distinct_products_3m":293,"overall_distinct_brands_3m":198,"overall_distinct_chains_3m":1,"overall_distinct_category_3m":174,"overall_distinct_days_shopped_3m":36,"overall_aov_3m":55,"overall_sale_quantity_6m":2545,"overall_returned_items_6m":0,"overall_distinct_products_6m":486,"overall_distinct_brands_6m":282,"overall_distinct_chains_6m":1,"overall_distinct_category_6m":242,"overall_distinct_days_shopped_6m":74,"overall_aov_6m":52,"overall_sale_quantity_12m":4143,"overall_returned_items_12m":0,"overall_distinct_products_12m":709,"overall_distinct_brands_12m":376,"overall_distinct_chains_12m":1,"overall_distinct_category_12m":309,"overall_distinct_days_shopped_12m":142,"overall_aov_12m":49} 15 | {"customer_id":"2975993688","brand":"12021","promo_sensitive":0,"brand_seasonality":0,"total_sale_quantity_1m":0,"max_sale_amount_1m":0,"max_sale_quantity_1m":0,"total_returned_items_1m":0,"distinct_products_1m":0,"distinct_chains_1m":0,"distinct_category_1m":0,"distinct_days_shopped_1m":0,"aov_1m":0,"total_sale_quantity_3m":0,"max_sale_amount_3m":0,"max_sale_quantity_3m":0,"total_returned_items_3m":0,"distinct_products_3m":0,"distinct_chains_3m":0,"distinct_category_3m":0,"distinct_days_shopped_3m":0,"aov_3m":0,"total_sale_quantity_6m":0,"max_sale_amount_6m":0,"max_sale_quantity_6m":0,"total_returned_items_6m":0,"distinct_products_6m":0,"distinct_chains_6m":0,"distinct_category_6m":0,"distinct_days_shopped_6m":0,"aov_6m":0,"total_sale_quantity_12m":0,"max_sale_amount_12m":0,"max_sale_quantity_12m":0,"total_returned_items_12m":0,"distinct_products_12m":0,"distinct_chains_12m":0,"distinct_category_12m":0,"distinct_days_shopped_12m":0,"aov_12m":0,"overall_sale_quantity_1m":369,"overall_returned_items_1m":0,"overall_distinct_products_1m":217,"overall_distinct_brands_1m":151,"overall_distinct_chains_1m":1,"overall_distinct_category_1m":157,"overall_distinct_days_shopped_1m":7,"overall_aov_1m":186,"overall_sale_quantity_3m":1007,"overall_returned_items_3m":0,"overall_distinct_products_3m":447,"overall_distinct_brands_3m":272,"overall_distinct_chains_3m":1,"overall_distinct_category_3m":246,"overall_distinct_days_shopped_3m":23,"overall_aov_3m":174,"overall_sale_quantity_6m":2032,"overall_returned_items_6m":0,"overall_distinct_products_6m":690,"overall_distinct_brands_6m":380,"overall_distinct_chains_6m":1,"overall_distinct_category_6m":316,"overall_distinct_days_shopped_6m":51,"overall_aov_6m":154,"overall_sale_quantity_12m":3823,"overall_returned_items_12m":0,"overall_distinct_products_12m":1071,"overall_distinct_brands_12m":535,"overall_distinct_chains_12m":1,"overall_distinct_category_12m":385,"overall_distinct_days_shopped_12m":112,"overall_aov_12m":130} -------------------------------------------------------------------------------- /ai-platform/test-predictions/online_test.json: -------------------------------------------------------------------------------- 1 | {"customer_id":"69818569","brand":"11516","promo_sensitive":0,"brand_seasonality":1,"total_sale_quantity_1m":0,"max_sale_amount_1m":0,"max_sale_quantity_1m":0,"total_returned_items_1m":0,"distinct_products_1m":0,"distinct_chains_1m":0,"distinct_category_1m":0,"distinct_days_shopped_1m":0,"aov_1m":0,"total_sale_quantity_3m":0,"max_sale_amount_3m":0,"max_sale_quantity_3m":0,"total_returned_items_3m":0,"distinct_products_3m":0,"distinct_chains_3m":0,"distinct_category_3m":0,"distinct_days_shopped_3m":0,"aov_3m":0,"total_sale_quantity_6m":1,"max_sale_amount_6m":15,"max_sale_quantity_6m":1,"total_returned_items_6m":0,"distinct_products_6m":1,"distinct_chains_6m":1,"distinct_category_6m":1,"distinct_days_shopped_6m":1,"aov_6m":15,"total_sale_quantity_12m":1,"max_sale_amount_12m":15,"max_sale_quantity_12m":1,"total_returned_items_12m":0,"distinct_products_12m":1,"distinct_chains_12m":1,"distinct_category_12m":1,"distinct_days_shopped_12m":1,"aov_12m":15,"overall_sale_quantity_1m":118,"overall_returned_items_1m":0,"overall_distinct_products_1m":60,"overall_distinct_brands_1m":44,"overall_distinct_chains_1m":1,"overall_distinct_category_1m":46,"overall_distinct_days_shopped_1m":8,"overall_aov_1m":45,"overall_sale_quantity_3m":335,"overall_returned_items_3m":0,"overall_distinct_products_3m":123,"overall_distinct_brands_3m":75,"overall_distinct_chains_3m":1,"overall_distinct_category_3m":85,"overall_distinct_days_shopped_3m":25,"overall_aov_3m":38,"overall_sale_quantity_6m":572,"overall_returned_items_6m":0,"overall_distinct_products_6m":199,"overall_distinct_brands_6m":113,"overall_distinct_chains_6m":1,"overall_distinct_category_6m":118,"overall_distinct_days_shopped_6m":48,"overall_aov_6m":37,"overall_sale_quantity_12m":1149,"overall_returned_items_12m":0,"overall_distinct_products_12m":381,"overall_distinct_brands_12m":213,"overall_distinct_chains_12m":1,"overall_distinct_category_12m":175,"overall_distinct_days_shopped_12m":95,"overall_aov_12m":42} -------------------------------------------------------------------------------- /ai-platform/train.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | # model training on AI Platform 3 | # --scale-tier=CUSTOM and --config=config.yaml for custom machine types 4 | # --config=hptuning_config.yaml for hyperparameter tuning job 5 | 6 | PROJECT=$1 7 | BUCKET=$2 8 | MODEL_TYPE=WD # DNN, WD or BT 9 | JOBNAME=${MODEL_TYPE}_training_$(date +"%Y%m%d_%H%M%S") 10 | 11 | gcloud ai-platform jobs submit training ${JOBNAME} \ 12 | --scale-tier=BASIC \ 13 | --job-dir=gs://${BUCKET}/models/WD/${JOBNAME} \ 14 | --region=europe-west1 \ 15 | --package-path=package/trainer \ 16 | --module-name=trainer.task \ 17 | --runtime-version=1.13 \ 18 | --python-version=3.5 \ 19 | -- \ 20 | --project=${PROJECT} \ 21 | --bucket=${BUCKET} \ 22 | --train_data=gs://${BUCKET}/data/trainData/*.csv \ 23 | --dev_data=gs://${BUCKET}/data/devData/*.csv \ 24 | --test_data=gs://${BUCKET}/data/testData/*.csv \ 25 | --schema_path=gs://${BUCKET}/schema.json \ 26 | --brand_vocab=gs://${BUCKET}/brand_vocab.csv \ 27 | --cloud \ 28 | --early_stopping \ 29 | --model_type=${MODEL_TYPE} \ 30 | --train_epochs=5 \ 31 | --batch_size=256 \ 32 | --learning_rate=0.0004843447060766648 \ 33 | --optimizer=ProximalAdagrad \ 34 | --hidden_units='128,64,32,16' \ 35 | --dropout=0.488892936706543 \ 36 | --feature_selec='[1,2,3,4,5,6]' -------------------------------------------------------------------------------- /exploration/visualisation-queries/custom_query_1.sql: -------------------------------------------------------------------------------- 1 | -- custom SQL query in Tableau dashboard 2 | -- visualisation: customer - transaction distribution 3 | SELECT 4 | num_transactions_binned, 5 | COUNT(id) AS num_customers 6 | FROM ( 7 | SELECT 8 | *, 9 | CASE 10 | WHEN num_transactions = 1 THEN "1" 11 | WHEN num_transactions = 2 THEN "2" 12 | WHEN num_transactions > 2 AND num_transactions <= 5 THEN "3 - 5" 13 | WHEN num_transactions > 5 AND num_transactions <= 10 THEN "6 - 10" 14 | WHEN num_transactions > 10 AND num_transactions <= 20 THEN "11 - 20" 15 | WHEN num_transactions > 20 AND num_transactions <= 30 THEN "21 - 30" 16 | WHEN num_transactions > 30 AND num_transactions <= 50 THEN "31 - 50" 17 | WHEN num_transactions > 50 AND num_transactions <= 100 THEN "51 - 100" 18 | WHEN num_transactions > 100 AND num_transactions <= 250 THEN "101 - 250" 19 | WHEN num_transactions > 250 AND num_transactions <= 500 THEN "251 - 500" 20 | WHEN num_transactions > 500 AND num_transactions <= 1000 THEN "501 - 1000" 21 | WHEN num_transactions > 1000 AND num_transactions <= 2500 THEN "1001 - 2500" 22 | WHEN num_transactions > 2500 AND num_transactions <= 5000 THEN "2501 - 5000" 23 | WHEN num_transactions > 5000 AND num_transactions <= 10000 THEN "5001 - 10000" 24 | ELSE "> 10000" 25 | END AS num_transactions_binned 26 | FROM ( 27 | SELECT 28 | id, 29 | COUNT(*) AS num_transactions 30 | FROM 31 | `PROJECT.DATASET.transactions` 32 | GROUP BY 33 | id ) ) 34 | GROUP BY 35 | num_transactions_binned -------------------------------------------------------------------------------- /exploration/visualisation-queries/custom_query_2.sql: -------------------------------------------------------------------------------- 1 | -- custom SQL query for Tableau dashboard 2 | -- visualisation: Average Order Value (excluding returned items) 3 | SELECT 4 | AOV_binned, 5 | COUNT(id) AS num_customers 6 | FROM ( 7 | SELECT 8 | *, 9 | CASE 10 | WHEN AOV < 10 THEN "< 10" 11 | WHEN AOV >= 10 AND AOV <= 20 THEN "10 - 20" 12 | WHEN AOV > 20 AND AOV <= 30 THEN "20 - 30" 13 | WHEN AOV > 30 AND AOV <= 50 THEN "30 - 50" 14 | WHEN AOV > 50 AND AOV <= 100 THEN "50 - 100" 15 | WHEN AOV > 100 AND AOV <= 250 THEN "100 - 250" 16 | WHEN AOV > 250 AND AOV <= 500 THEN "250 - 500" 17 | WHEN AOV > 500 AND AOV <= 1000 THEN "500 - 1000" 18 | WHEN AOV > 1000 AND AOV <= 5000 THEN "1000 - 5000" 19 | WHEN AOV > 5000 AND AOV <= 10000 THEN "5000 - 10000" 20 | ELSE ">10000" 21 | END AS AOV_binned 22 | FROM ( 23 | SELECT 24 | id, 25 | SUM(totalamount) AS totalamount, 26 | COUNT(*) AS num_transactions, 27 | AVG(totalamount) AS AOV 28 | FROM ( 29 | SELECT 30 | id, 31 | chain, 32 | date, 33 | transaction_id, 34 | SUM(abs_purchaseamount) AS totalamount 35 | FROM ( 36 | SELECT 37 | *, 38 | ABS(purchaseamount) AS abs_purchaseamount, 39 | CASE WHEN purchaseamount < 0 AND purchasequantity < 0 THEN 1 ELSE 0 END AS returned_flag, 40 | CONCAT(CAST(id AS STRING), "-", CAST(chain AS STRING), "-", CAST(date AS STRING)) AS transaction_id 41 | FROM 42 | `PROJECT.DATASET.transactions` 43 | ) 44 | WHERE returned_flag = 0 45 | GROUP BY 46 | id, 47 | chain, 48 | date, 49 | transaction_id ) 50 | GROUP BY 51 | id ) ) 52 | GROUP BY 53 | AOV_binned -------------------------------------------------------------------------------- /exploration/visualisation-queries/custom_query_3.sql: -------------------------------------------------------------------------------- 1 | -- custom SQL query for Tableau dashboard 2 | -- visualisation: brand transaction distribution 3 | SELECT 4 | num_transactions_binned, 5 | COUNT(brand) AS num_brand 6 | FROM ( 7 | SELECT 8 | *, 9 | CASE 10 | WHEN num_transactions = 1 THEN "1" 11 | WHEN num_transactions > 1 AND num_transactions <= 10 THEN "2 - 10" 12 | WHEN num_transactions > 10 AND num_transactions <= 50 THEN "11 - 50" 13 | WHEN num_transactions > 50 AND num_transactions <= 100 THEN "51 - 100" 14 | WHEN num_transactions > 100 AND num_transactions <= 250 THEN "101 - 250" 15 | WHEN num_transactions > 250 AND num_transactions <= 500 THEN "251 - 500" 16 | WHEN num_transactions > 500 AND num_transactions <= 1000 THEN "501 - 1000" 17 | WHEN num_transactions > 1000 AND num_transactions <= 2500 THEN "1001 - 2500" 18 | WHEN num_transactions > 2500 AND num_transactions <= 5000 THEN "2501 - 5000" 19 | WHEN num_transactions > 5000 AND num_transactions <= 10000 THEN "5001 - 10000" 20 | WHEN num_transactions > 10000 AND num_transactions <= 25000 THEN "10001 - 25000" 21 | ELSE "> 25000" 22 | END AS num_transactions_binned 23 | FROM ( 24 | SELECT 25 | brand, 26 | COUNT(*) AS num_transactions 27 | FROM 28 | `PROJECT.DATASET.transactions` 29 | GROUP BY 30 | brand ) ) 31 | GROUP BY 32 | num_transactions_binned -------------------------------------------------------------------------------- /exploration/visualisation-queries/custom_query_4.sql: -------------------------------------------------------------------------------- 1 | -- custom SQL query for Tableau dashboard 2 | -- visualisation: product mapping 3 | SELECT 4 | dept, 5 | COUNT(DISTINCT(category)) AS distinct_category, 6 | COUNT(DISTINCT(brand)) AS distinct_brands 7 | FROM 8 | `PROJECT.DATASET.transactions` 9 | GROUP BY dept 10 | ORDER BY distinct_brands -------------------------------------------------------------------------------- /processing-pipeline/README.md: -------------------------------------------------------------------------------- 1 | # Brand Propensity Model for Major Retailer 2 | 3 | > The purpose of these scripts are to carry out the data processing and 4 | > feature engineering in **BigQuery**. 5 | 6 | 7 | ### Raw files required: 8 | ------ 9 | These were loaded into BigQuery from Cloud Storage from CSV format. 10 | Big Query dataset: `propensity_dataset` 11 | - history (promotional information) 12 | - transactions (transaction table) 13 | 14 | 15 | ### Files required in package: 16 | ------ 17 | 18 | Ad-hoc SQL queries: 19 | - `supporting-queries.sql` 20 | 21 | SQL queries for data preprocessing: 22 | - `a010_impute_missing_values.sql` 23 | - `a020_remove_zero_quantity_rows.sql` 24 | - `a030_create_returned_flag.sql` 25 | - `a040_absolute_values.sql` 26 | - `a050_create_transaction_id.sql` 27 | - `a060_create_product_id.sql` 28 | - `a070_create_product_price.sql` 29 | 30 | SQL queries for feature engineering: 31 | - `f010_top_brands.sql` 32 | - `x010_cross_join.sql` 33 | - `f020_label_creation.sql` 34 | - `f030_promo_sensitive_feature.sql` 35 | - `f040_brand_seasonality.sql` 36 | - `f050_create_brand_features.sql` 37 | - `f060_create_overall_features.sql` 38 | - `f070_compute_aov.sql` 39 | - `f080_impute_nulls.sql` 40 | - `f090_type_cast.sql` 41 | 42 | SQL queries for splitting into train/dev/test, generating baseline and downsampling: 43 | - `t010_train_test_field.sql` 44 | - `t020_train_data.sql` 45 | - `t030_test_data.sql` 46 | - `t040_test_dev_split.sql` 47 | - `t050_dev_data.sql` 48 | - `t060_test_data_final.sql` 49 | - `b010_create_baseline.sql` 50 | - `b020_join_baseline.sql` 51 | - `b030_baseline_metrics.sql` 52 | - `s010_downsampling.sql` 53 | - `s020_downsampled_features.sql` 54 | 55 | Bash script which references SQL queries and runs them sequentially: 56 | - `bq-processing.sh` 57 | 58 | 59 | ### To excute: 60 | ------ 61 | Please enter the GCP project and dataset name as command-line arguments. 62 | 63 | 1) Create a dataset in BigQuery 64 | 2) Load the "transactions" and "history" tables from Cloud Storage into this dataset 65 | 3) Execute the script with the GCP project and dataset as arguments 66 | 67 | ``` 68 | bash bq-processing.sh 69 | ``` 70 | 71 | ### Running time: 72 | ------ 73 | Prefix the above command with `/usr/bin/time` to compute the running time. 74 | 75 | Total running time: **~25 minutes** 76 | -------------------------------------------------------------------------------- /processing-pipeline/a010_impute_missing_values.sql: -------------------------------------------------------------------------------- 1 | -- impute missing values in productmeasure column (11500472 rows in total) 2 | -- create new table based on bash script e.g. "cleaned" 3 | SELECT 4 | id, 5 | chain, 6 | dept, 7 | category, 8 | company, 9 | brand, 10 | date, 11 | productsize, 12 | CASE 13 | WHEN productmeasure IS NULL THEN "Unknown" 14 | ELSE productmeasure 15 | END AS productmeasure, 16 | purchasequantity, 17 | purchaseamount 18 | FROM 19 | `PROJECT.DATASET.transactions` -------------------------------------------------------------------------------- /processing-pipeline/a020_remove_zero_quantity_rows.sql: -------------------------------------------------------------------------------- 1 | -- removing rows where purchasequantity = 0 (this only represents 0.15% of the data) 2 | -- overwrite "cleaned" table 3 | -- expect 349137053 rows left 4 | SELECT 5 | * 6 | FROM 7 | `PROJECT.DATASET.cleaned` 8 | WHERE 9 | purchasequantity != 0 -------------------------------------------------------------------------------- /processing-pipeline/a030_create_returned_flag.sql: -------------------------------------------------------------------------------- 1 | -- creating a flag for returned items 2 | -- overwrite "cleaned" table 3 | SELECT 4 | *, 5 | CASE 6 | WHEN purchasequantity < 0 AND purchaseamount < 0 THEN 1 7 | ELSE 0 8 | END AS returned_flag 9 | FROM 10 | `PROJECT.DATASET.cleaned` -------------------------------------------------------------------------------- /processing-pipeline/a040_absolute_values.sql: -------------------------------------------------------------------------------- 1 | -- for instances where both purchaseamount and purchasequantity are not < 0 (i.e not returned_flag = 1) then convert to absolute numbers 2 | -- overwrite "cleaned" table 3 | SELECT 4 | id, 5 | chain, 6 | dept, 7 | category, 8 | company, 9 | brand, 10 | date, 11 | productsize, 12 | productmeasure, 13 | CASE 14 | WHEN returned_flag = 0 THEN ABS(purchasequantity) 15 | ELSE purchasequantity 16 | END AS purchasequantity, 17 | CASE 18 | WHEN returned_flag = 0 THEN ABS(purchaseamount) 19 | ELSE purchaseamount 20 | END AS purchaseamount, 21 | returned_flag 22 | FROM 23 | `PROJECT.DATASET.cleaned` 24 | 25 | 26 | /* 27 | -- test outcome 28 | SELECT 29 | count(*) 30 | FROM 31 | `PROJECT.DATASET.transactions` 32 | WHERE purchaseamount < 0 33 | 34 | 35 | SELECT 36 | count(*) 37 | FROM 38 | `PROJECT.DATASET.transactions` 39 | WHERE purchasequantity < 0 40 | */ -------------------------------------------------------------------------------- /processing-pipeline/a050_create_transaction_id.sql: -------------------------------------------------------------------------------- 1 | -- creating a pseudo transaction ID by concatenating id, chain and date 2 | -- overwrite "cleaned" table 3 | SELECT 4 | CONCAT(CAST(id AS STRING), "-", CAST(chain AS STRING), "-", CAST(date AS STRING)) AS transaction_id, 5 | * 6 | FROM 7 | `PROJECT.DATASET.cleaned` -------------------------------------------------------------------------------- /processing-pipeline/a060_create_product_id.sql: -------------------------------------------------------------------------------- 1 | -- create pseudo product ID by concatenating brand, productsize and productmeasure 2 | -- overwrite "cleaned" table 3 | SELECT 4 | *, 5 | CONCAT(CAST(brand AS STRING), "-", CAST(productsize AS STRING), "-", CAST(productmeasure AS STRING)) AS product_id 6 | FROM 7 | `PROJECT.DATASET.cleaned` 8 | -------------------------------------------------------------------------------- /processing-pipeline/a070_create_product_price.sql: -------------------------------------------------------------------------------- 1 | -- extracting individual unit price by dividing the total purchase amount by the quantity 2 | -- overwrite "cleaned" table 3 | SELECT 4 | *, 5 | purchaseamount/purchasequantity AS productprice 6 | FROM 7 | `PROJECT.DATASET.cleaned` -------------------------------------------------------------------------------- /processing-pipeline/b010_create_baseline.sql: -------------------------------------------------------------------------------- 1 | -- create baseline where label is if customer bought in February 2013, they will also buy in March 2013 2 | -- save as "baseline" 3 | 4 | SELECT 5 | * 6 | EXCEPT(id_b, brand_b, target), 7 | CASE 8 | WHEN target IS NULL THEN 0 9 | ELSE target 10 | END AS label 11 | FROM ( 12 | SELECT 13 | customer_id, 14 | brand 15 | FROM 16 | `PROJECT.DATASET.test`) a 17 | LEFT JOIN 18 | -- target for whether customer in test set bought in February 2013 or not 19 | ( 20 | SELECT 21 | CAST(id AS STRING) AS id_b, 22 | CAST(brand AS STRING) AS brand_b, 23 | CASE 24 | WHEN COUNT(*) > 0 THEN 1 25 | ELSE 0 26 | END AS target 27 | FROM 28 | `PROJECT.DATASET.cleaned` 29 | WHERE 30 | date >= '2013-02-01' 31 | AND date < DATE_ADD(DATE(CAST('2013-02-01' AS TIMESTAMP)), INTERVAL 1 MONTH) 32 | AND returned_flag = 0 33 | GROUP BY 34 | id_b, 35 | brand_b ) b 36 | ON 37 | a.customer_id = b.id_b 38 | AND a.brand = b.brand_b -------------------------------------------------------------------------------- /processing-pipeline/b020_join_baseline.sql: -------------------------------------------------------------------------------- 1 | -- overwrite baseline table 2 | SELECT 3 | customer_id, 4 | brand, 5 | label, 6 | predicted_label 7 | FROM ( 8 | SELECT 9 | customer_id, 10 | brand, 11 | label 12 | FROM 13 | `PROJECT.DATASET.test` ) a 14 | INNER JOIN ( 15 | SELECT 16 | CAST(customer_id AS STRING) AS customer_id_b, 17 | CAST(brand AS STRING) AS brand_b, 18 | label AS predicted_label 19 | FROM 20 | `PROJECT.DATASET.baseline` ) b 21 | ON 22 | a.customer_id = b.customer_id_b 23 | AND a.brand = b.brand_b -------------------------------------------------------------------------------- /processing-pipeline/b030_baseline_metrics.sql: -------------------------------------------------------------------------------- 1 | -- save as baseline_metrics 2 | SELECT 3 | *, 4 | 2*((precision*recall)/(precision+recall)) as f1, 5 | (1 + recall)/2 AS auc 6 | FROM ( 7 | SELECT 8 | *, 9 | (true_positive + true_negative) / total_points AS accuracy, 10 | true_positive / (true_positive + false_negative) AS recall, 11 | true_positive / (true_positive + false_positive) AS precision, 12 | false_positive / (false_positive + true_negative) AS fpr 13 | FROM ( 14 | SELECT 15 | SUM(CASE 16 | WHEN predicted_label = 0 AND label = 0 THEN 1 17 | ELSE 0 END) AS true_negative, 18 | SUM(CASE 19 | WHEN predicted_label = 1 AND label = 1 THEN 1 20 | ELSE 0 END) AS true_positive, 21 | SUM(CASE 22 | WHEN predicted_label = 0 AND label = 1 THEN 1 23 | ELSE 0 END) AS false_negative, 24 | SUM(CASE 25 | WHEN predicted_label = 1 AND label = 0 THEN 1 26 | ELSE 0 END) AS false_positive, 27 | SUM(CASE 28 | WHEN label = 0 THEN 1 29 | ELSE 0 END) AS zeros, 30 | SUM(CASE 31 | WHEN label = 1 THEN 1 32 | ELSE 0 END) AS ones, 33 | COUNT(*) AS total_points 34 | FROM 35 | `PROJECT.DATASET.baseline` ) ) -------------------------------------------------------------------------------- /processing-pipeline/bq-processing.sh: -------------------------------------------------------------------------------- 1 | #/bin/bash 2 | # /usr/bin/time prefix to get the time it took to run script 3 | 4 | # script to run the data processing and feature engineering on the dataset 5 | 6 | # enter GCP project and BigQuery dataset name as command line args 7 | # ensure that the "transactions" and "history" table are in this dataset 8 | project=$1 9 | dataset_name=$2 10 | 11 | dataset="${project}:${dataset_name}" 12 | 13 | # table names 14 | processed_table="cleaned" 15 | top_brands="top_brands" 16 | crossed_table="customerxbrand" 17 | feature_table="features" 18 | train="train" 19 | test="test" 20 | dev="dev" 21 | baseline="baseline" 22 | baseline_metrics="baseline_metrics" 23 | downsampled="downsampled" 24 | 25 | # target month for prediction 26 | target_month="2013-03-01" 27 | 28 | : ' 29 | Data processing and creation of reference tables. 30 | ' 31 | # imputing missing values for products 32 | sed "s/PROJECT/$project/g;s/DATASET/$dataset_name/g" a010_impute_missing_values.sql > a010_impute_missing_values_temp.sql 33 | cat a010_impute_missing_values_temp.sql | bq query --destination_table=$dataset.$processed_table --use_legacy_sql=false --allow_large_results 34 | rm a010_impute_missing_values_temp.sql 35 | 36 | # removing rows where product quantity is zero 37 | sed "s/PROJECT/$project/g;s/DATASET/$dataset_name/g" a020_remove_zero_quantity_rows.sql > a020_remove_zero_quantity_rows_temp.sql 38 | cat a020_remove_zero_quantity_rows_temp.sql | bq query --destination_table=$dataset.$processed_table --use_legacy_sql=false --allow_large_results --replace=true 39 | rm a020_remove_zero_quantity_rows_temp.sql 40 | 41 | # creating variable for whether transaction is to return items or not 42 | sed "s/PROJECT/$project/g;s/DATASET/$dataset_name/g" a030_create_returned_flag.sql > a030_create_returned_flag_temp.sql 43 | cat a030_create_returned_flag_temp.sql | bq query --destination_table=$dataset.$processed_table --use_legacy_sql=false --allow_large_results --replace=true 44 | rm a030_create_returned_flag_temp.sql 45 | 46 | # convert product quantity and purchase amount to absolute when not marked as return 47 | sed "s/PROJECT/$project/g;s/DATASET/$dataset_name/g" a040_absolute_values.sql > a040_absolute_values_temp.sql 48 | cat a040_absolute_values_temp.sql | bq query --destination_table=$dataset.$processed_table --use_legacy_sql=false --allow_large_results --replace=true 49 | rm a040_absolute_values_temp.sql 50 | 51 | # creating pseudo transaction ID based on concatenation of columns 52 | sed "s/PROJECT/$project/g;s/DATASET/$dataset_name/g" a050_create_transaction_id.sql > a050_create_transaction_id_temp.sql 53 | cat a050_create_transaction_id_temp.sql | bq query --destination_table=$dataset.$processed_table --use_legacy_sql=false --allow_large_results --replace=true 54 | rm a050_create_transaction_id_temp.sql 55 | 56 | # creating pseudo product ID based on concatenation of columns 57 | sed "s/PROJECT/$project/g;s/DATASET/$dataset_name/g" a060_create_product_id.sql > a060_create_product_id_temp.sql 58 | cat a060_create_product_id_temp.sql | bq query --destination_table=$dataset.$processed_table --use_legacy_sql=false --allow_large_results --replace=true 59 | rm a060_create_product_id_temp.sql 60 | 61 | # creating product price for each unit as purchaseamount is multiple of quantity 62 | sed "s/PROJECT/$project/g;s/DATASET/$dataset_name/g" a070_create_product_price.sql > a070_create_product_price_temp.sql 63 | cat a070_create_product_price_temp.sql | bq query --destination_table=$dataset.$processed_table --use_legacy_sql=false --allow_large_results --replace=true 64 | rm a070_create_product_price_temp.sql 65 | 66 | : ' 67 | Feature engineering for demographic data. 68 | ' 69 | # extract top 1000 brands by volume of transactions 70 | sed "s/TRGT_MONTH/$target_month/g;s/PROJECT/$project/g;s/DATASET/$dataset_name/g" f010_top_brands.sql > f010_top_brands_temp.sql 71 | cat f010_top_brands_temp.sql | bq query --destination_table=$dataset.$top_brands --use_legacy_sql=false --allow_large_results 72 | rm f010_top_brands_temp.sql 73 | 74 | # cross join between top 1000 brands and customers who have transacted at least once with these brands 75 | sed "s/TRGT_MONTH/$target_month/g;s/PROJECT/$project/g;s/DATASET/$dataset_name/g" x010_cross_join.sql > x010_cross_join_temp.sql 76 | cat x010_cross_join_temp.sql | bq query --destination_table=$dataset.$crossed_table --use_legacy_sql=false --allow_large_results 77 | rm x010_cross_join_temp.sql 78 | 79 | # generate binary target variable 80 | sed "s/TRGT_MONTH/$target_month/g;s/PROJECT/$project/g;s/DATASET/$dataset_name/g" f020_label_creation.sql > f020_label_creation_temp.sql 81 | cat f020_label_creation_temp.sql | bq query --destination_table=$dataset.$feature_table --use_legacy_sql=false --allow_large_results 82 | rm f020_label_creation_temp.sql 83 | 84 | # generate promo sensitive feature 85 | sed "s/PROJECT/$project/g;s/DATASET/$dataset_name/g" f030_promo_sensitive_feature.sql > f030_promo_sensitive_feature_temp.sql 86 | cat f030_promo_sensitive_feature_temp.sql | bq query --destination_table=$dataset.$feature_table --use_legacy_sql=false --allow_large_results --replace=true 87 | rm f030_promo_sensitive_feature_temp.sql 88 | 89 | # generate brand seasonality feature 90 | sed "s/TRGT_MONTH/$target_month/g;s/PROJECT/$project/g;s/DATASET/$dataset_name/g" f040_brand_seasonality.sql > f040_brand_seasonality_temp.sql 91 | cat f040_brand_seasonality_temp.sql | bq query --destination_table=$dataset.$feature_table --use_legacy_sql=false --allow_large_results --replace=true 92 | rm f040_brand_seasonality_temp.sql 93 | 94 | # generation of brand-related behavioural features 95 | sed "s/TRGT_MONTH/$target_month/g;s/PROJECT/$project/g;s/DATASET/$dataset_name/g" f050_create_brand_features.sql > f050_create_brand_features_temp.sql 96 | cat f050_create_brand_features_temp.sql | bq query --destination_table=$dataset.$feature_table --use_legacy_sql=false --allow_large_results --replace=true 97 | rm f050_create_brand_features_temp.sql 98 | 99 | # generation of behavioural features across all brands 100 | sed "s/TRGT_MONTH/$target_month/g;s/PROJECT/$project/g;s/DATASET/$dataset_name/g" f060_create_overall_features.sql > f060_create_overall_features_temp.sql 101 | cat f060_create_overall_features_temp.sql | bq query --destination_table=$dataset.$feature_table --use_legacy_sql=false --allow_large_results --replace=true 102 | rm f060_create_overall_features_temp.sql 103 | 104 | # compute AOV features 105 | sed "s/PROJECT/$project/g;s/DATASET/$dataset_name/g" f070_compute_aov.sql > f070_compute_aov_temp.sql 106 | cat f070_compute_aov_temp.sql | bq query --destination_table=$dataset.$feature_table --use_legacy_sql=false --allow_large_results --replace=true 107 | rm f070_compute_aov_temp.sql 108 | 109 | # impute NULLs with zero 110 | sed "s/PROJECT/$project/g;s/DATASET/$dataset_name/g" f080_impute_nulls.sql > f080_impute_nulls_temp.sql 111 | cat f080_impute_nulls_temp.sql | bq query --destination_table=$dataset.$feature_table --use_legacy_sql=false --allow_large_results --replace=true 112 | rm f080_impute_nulls_temp.sql 113 | 114 | # type cast 115 | sed "s/PROJECT/$project/g;s/DATASET/$dataset_name/g" f090_type_cast.sql > f090_type_cast_temp.sql 116 | cat f090_type_cast_temp.sql | bq query --destination_table=$dataset.$feature_table --use_legacy_sql=false --allow_large_results --replace=true 117 | rm f090_type_cast_temp.sql 118 | 119 | : ' 120 | Creation of train / test set 121 | ' 122 | # create train / test field 123 | sed "s/PROJECT/$project/g;s/DATASET/$dataset_name/g" t010_train_test_field.sql > t010_train_test_field_temp.sql 124 | cat t010_train_test_field_temp.sql | bq query --destination_table=$dataset.$feature_table --use_legacy_sql=false --allow_large_results --replace=true 125 | rm t010_train_test_field_temp.sql 126 | 127 | # create training data 128 | sed "s/PROJECT/$project/g;s/DATASET/$dataset_name/g" t020_train_data.sql > t020_train_data_temp.sql 129 | cat t020_train_data_temp.sql | bq query --destination_table=$dataset.$train --use_legacy_sql=false --allow_large_results 130 | rm t020_train_data_temp.sql 131 | 132 | # create testing data 133 | sed "s/PROJECT/$project/g;s/DATASET/$dataset_name/g" t030_test_data.sql > t030_test_data_temp.sql 134 | cat t030_test_data_temp.sql | bq query --destination_table=$dataset.$test --use_legacy_sql=false --allow_large_results 135 | rm t030_test_data_temp.sql 136 | 137 | # create test / dev field 138 | sed "s/PROJECT/$project/g;s/DATASET/$dataset_name/g" t040_test_dev_split.sql > t040_test_dev_split_temp.sql 139 | cat t040_test_dev_split_temp.sql | bq query --destination_table=$dataset.$test --use_legacy_sql=true --allow_large_results --replace=true 140 | rm t040_test_dev_split_temp.sql 141 | 142 | # create dev data 143 | sed "s/PROJECT/$project/g;s/DATASET/$dataset_name/g" t050_dev_data.sql > t050_dev_data_temp.sql 144 | cat t050_dev_data_temp.sql | bq query --destination_table=$dataset.$dev --use_legacy_sql=false --allow_large_results 145 | rm t050_dev_data_temp.sql 146 | 147 | # create final test data 148 | sed "s/PROJECT/$project/g;s/DATASET/$dataset_name/g" t060_test_data_final.sql > t060_test_data_final_temp.sql 149 | cat t060_test_data_final_temp.sql | bq query --destination_table=$dataset.$test --use_legacy_sql=false --allow_large_results --replace=true 150 | rm t060_test_data_final_temp.sql 151 | 152 | # create baseline 153 | sed "s/PROJECT/$project/g;s/DATASET/$dataset_name/g" b010_create_baseline.sql > b010_create_baseline_temp.sql 154 | cat b010_create_baseline_temp.sql | bq query --destination_table=$dataset.$baseline --use_legacy_sql=false --allow_large_results 155 | rm b010_create_baseline_temp.sql 156 | 157 | # join true labels onto baseline 158 | sed "s/PROJECT/$project/g;s/DATASET/$dataset_name/g" b020_join_baseline.sql > b020_join_baseline_temp.sql 159 | cat b020_join_baseline_temp.sql | bq query --destination_table=$dataset.$baseline --use_legacy_sql=false --allow_large_results --replace=true 160 | rm b020_join_baseline_temp.sql 161 | 162 | # compute baseline metrics (accuracy, precision, recall, auc) 163 | sed "s/PROJECT/$project/g;s/DATASET/$dataset_name/g" b030_baseline_metrics.sql > b030_baseline_metrics_temp.sql 164 | cat b030_baseline_metrics_temp.sql | bq query --destination_table=$dataset.$baseline_metrics --use_legacy_sql=false --allow_large_results 165 | rm b030_baseline_metrics_temp.sql 166 | 167 | # downsampling majority class in training set to 1:1 ratio 168 | sed "s/PROJECT/$project/g;s/DATASET/$dataset_name/g" s010_downsampling.sql > s010_downsampling_temp.sql 169 | cat s010_downsampling_temp.sql | bq query --destination_table=$dataset.$downsampled --use_legacy_sql=true --allow_large_results 170 | rm s010_downsampling_temp.sql 171 | 172 | # joining corresponding features to downsampled customer/brand combinations 173 | sed "s/PROJECT/$project/g;s/DATASET/$dataset_name/g" s020_downsampled_features.sql > s020_downsampled_features_temp.sql 174 | cat s020_downsampled_features_temp.sql | bq query --destination_table=$dataset.$downsampled --use_legacy_sql=false --allow_large_results --replace=true 175 | rm s020_downsampled_features_temp.sql -------------------------------------------------------------------------------- /processing-pipeline/f010_top_brands.sql: -------------------------------------------------------------------------------- 1 | -- extract top 1000 brands by number of purchases 2 | -- save as "top_brands" table 3 | SELECT 4 | brand 5 | FROM ( 6 | SELECT 7 | brand, 8 | COUNT(*) AS num_purchases 9 | FROM 10 | `PROJECT.DATASET.cleaned` 11 | WHERE 12 | date < 'TRGT_MONTH' 13 | GROUP BY 14 | brand 15 | ORDER BY 16 | num_purchases DESC 17 | LIMIT 18 | 1000 ) 19 | -------------------------------------------------------------------------------- /processing-pipeline/f020_label_creation.sql: -------------------------------------------------------------------------------- 1 | -- label generation 2 | -- create table "features" 3 | SELECT 4 | * 5 | EXCEPT(id_b, brand_b, target), 6 | CASE 7 | WHEN target IS NULL THEN 0 8 | ELSE target 9 | END AS label 10 | FROM 11 | `PROJECT.DATASET.customerxbrand` a 12 | LEFT JOIN 13 | -- target for whether customer bought in March 2013 or not 14 | ( 15 | SELECT 16 | id AS id_b, 17 | brand AS brand_b, 18 | CASE 19 | WHEN COUNT(*) > 0 THEN 1 20 | ELSE 0 21 | END AS target 22 | FROM 23 | `PROJECT.DATASET.cleaned` 24 | WHERE 25 | date >= 'TRGT_MONTH' 26 | AND date < DATE_ADD(DATE(CAST('TRGT_MONTH' AS TIMESTAMP)), INTERVAL 1 MONTH) 27 | AND returned_flag = 0 28 | GROUP BY 29 | id_b, 30 | brand_b ) b 31 | ON 32 | a.customer_id = b.id_b 33 | AND a.brand = b.brand_b -------------------------------------------------------------------------------- /processing-pipeline/f030_promo_sensitive_feature.sql: -------------------------------------------------------------------------------- 1 | -- promotional sensitive feature 2 | -- overwrite "features" table 3 | 4 | SELECT 5 | * EXCEPT(id, repeater) 6 | FROM ( 7 | SELECT 8 | *, 9 | CASE 10 | WHEN repeater = "t" THEN 1 11 | ELSE 0 12 | END AS promo_sensitive 13 | FROM 14 | `PROJECT.DATASET.features` a 15 | LEFT JOIN ( 16 | SELECT 17 | repeater, 18 | id 19 | FROM 20 | `PROJECT.DATASET.history` ) b 21 | ON 22 | a.customer_id = b.id ) -------------------------------------------------------------------------------- /processing-pipeline/f040_brand_seasonality.sql: -------------------------------------------------------------------------------- 1 | -- brand seasonality feature (based on how many months they have bought in last 12 months) 2 | -- overwrite "features" 3 | SELECT 4 | * 5 | EXCEPT(id_b, brand_b, brand_seasonality), 6 | CASE 7 | WHEN brand_seasonality IS NULL THEN 0 8 | ELSE brand_seasonality 9 | END AS brand_seasonality 10 | FROM 11 | `PROJECT.DATASET.features` a 12 | LEFT JOIN 13 | -- seasonality 14 | ( 15 | SELECT 16 | id AS id_b, 17 | brand AS brand_b, 18 | COUNT(DISTINCT(EXTRACT(MONTH FROM date))) AS brand_seasonality 19 | FROM 20 | `PROJECT.DATASET.cleaned` 21 | WHERE 22 | date < 'TRGT_MONTH' 23 | AND date >= DATE_ADD(DATE(CAST('TRGT_MONTH' AS TIMESTAMP)), INTERVAL -12 MONTH) 24 | GROUP BY 25 | id_b, 26 | brand_b ) b 27 | ON 28 | a.customer_id = b.id_b 29 | AND a.brand = b.brand_b -------------------------------------------------------------------------------- /processing-pipeline/f050_create_brand_features.sql: -------------------------------------------------------------------------------- 1 | -- generation of brand-related behavioural features 2 | -- overwrite "features" table 3 | SELECT 4 | * 5 | EXCEPT(id_b, brand_b, id_c, brand_c, id_d, brand_d, id_e, brand_e) 6 | FROM 7 | `PROJECT.DATASET.features` a 8 | LEFT JOIN 9 | -- 1M 10 | ( 11 | SELECT 12 | id AS id_b, 13 | brand AS brand_b, 14 | COUNT(DISTINCT(transaction_id)) AS total_transactions_1m, 15 | SUM(CASE 16 | WHEN returned_flag = 0 THEN purchaseamount 17 | ELSE 0 END) AS total_sale_amount_1m, 18 | SUM(CASE 19 | WHEN returned_flag = 0 THEN purchasequantity 20 | ELSE 0 END) AS total_sale_quantity_1m, 21 | MAX(CASE 22 | WHEN returned_flag = 0 THEN purchaseamount 23 | ELSE 0 END) AS max_sale_amount_1m, 24 | MAX(CASE 25 | WHEN returned_flag = 0 THEN purchasequantity 26 | ELSE 0 END) AS max_sale_quantity_1m, 27 | SUM(returned_flag) AS total_returned_items_1m, 28 | COUNT(DISTINCT(product_id)) AS distinct_products_1m, 29 | COUNT(DISTINCT(chain)) AS distinct_chains_1m, 30 | COUNT(DISTINCT(category)) AS distinct_category_1m, 31 | COUNT(DISTINCT(date)) AS distinct_days_shopped_1m 32 | FROM 33 | `PROJECT.DATASET.cleaned` 34 | WHERE 35 | date < 'TRGT_MONTH' 36 | AND date >= DATE_ADD(DATE(CAST('TRGT_MONTH' AS TIMESTAMP)), INTERVAL -1 MONTH) 37 | GROUP BY 38 | id_b, 39 | brand_b ) b 40 | ON 41 | a.customer_id = b.id_b 42 | AND a.brand = b.brand_b 43 | LEFT JOIN 44 | -- 3M 45 | ( 46 | SELECT 47 | id AS id_c, 48 | brand AS brand_c, 49 | COUNT(DISTINCT(transaction_id)) AS total_transactions_3m, 50 | SUM(CASE 51 | WHEN returned_flag = 0 THEN purchaseamount 52 | ELSE 0 END) AS total_sale_amount_3m, 53 | SUM(CASE 54 | WHEN returned_flag = 0 THEN purchasequantity 55 | ELSE 0 END) AS total_sale_quantity_3m, 56 | MAX(CASE 57 | WHEN returned_flag = 0 THEN purchaseamount 58 | ELSE 0 END) AS max_sale_amount_3m, 59 | MAX(CASE 60 | WHEN returned_flag = 0 THEN purchasequantity 61 | ELSE 0 END) AS max_sale_quantity_3m, 62 | SUM(returned_flag) AS total_returned_items_3m, 63 | COUNT(DISTINCT(product_id)) AS distinct_products_3m, 64 | COUNT(DISTINCT(chain)) AS distinct_chains_3m, 65 | COUNT(DISTINCT(category)) AS distinct_category_3m, 66 | COUNT(DISTINCT(date)) AS distinct_days_shopped_3m 67 | FROM 68 | `PROJECT.DATASET.cleaned` 69 | WHERE 70 | date < 'TRGT_MONTH' 71 | AND date >= DATE_ADD(DATE(CAST('TRGT_MONTH' AS TIMESTAMP)), INTERVAL -3 MONTH) 72 | GROUP BY 73 | id_c, 74 | brand_c ) c 75 | ON 76 | a.customer_id = c.id_c 77 | AND a.brand = c.brand_c 78 | LEFT JOIN 79 | -- 6M 80 | ( 81 | SELECT 82 | id AS id_d, 83 | brand AS brand_d, 84 | COUNT(DISTINCT(transaction_id)) AS total_transactions_6m, 85 | SUM(CASE 86 | WHEN returned_flag = 0 THEN purchaseamount 87 | ELSE 0 END) AS total_sale_amount_6m, 88 | SUM(CASE 89 | WHEN returned_flag = 0 THEN purchasequantity 90 | ELSE 0 END) AS total_sale_quantity_6m, 91 | MAX(CASE 92 | WHEN returned_flag = 0 THEN purchaseamount 93 | ELSE 0 END) AS max_sale_amount_6m, 94 | MAX(CASE 95 | WHEN returned_flag = 0 THEN purchasequantity 96 | ELSE 0 END) AS max_sale_quantity_6m, 97 | SUM(returned_flag) AS total_returned_items_6m, 98 | COUNT(DISTINCT(product_id)) AS distinct_products_6m, 99 | COUNT(DISTINCT(chain)) AS distinct_chains_6m, 100 | COUNT(DISTINCT(category)) AS distinct_category_6m, 101 | COUNT(DISTINCT(date)) AS distinct_days_shopped_6m 102 | FROM 103 | `PROJECT.DATASET.cleaned` 104 | WHERE 105 | date < 'TRGT_MONTH' 106 | AND date >= DATE_ADD(DATE(CAST('TRGT_MONTH' AS TIMESTAMP)), INTERVAL -6 MONTH) 107 | GROUP BY 108 | id_d, 109 | brand_d ) d 110 | ON 111 | a.customer_id = d.id_d 112 | AND a.brand = d.brand_d 113 | LEFT JOIN 114 | -- 12M 115 | ( 116 | SELECT 117 | id AS id_e, 118 | brand AS brand_e, 119 | COUNT(DISTINCT(transaction_id)) AS total_transactions_12m, 120 | SUM(CASE 121 | WHEN returned_flag = 0 THEN purchaseamount 122 | ELSE 0 END) AS total_sale_amount_12m, 123 | SUM(CASE 124 | WHEN returned_flag = 0 THEN purchasequantity 125 | ELSE 0 END) AS total_sale_quantity_12m, 126 | MAX(CASE 127 | WHEN returned_flag = 0 THEN purchaseamount 128 | ELSE 0 END) AS max_sale_amount_12m, 129 | MAX(CASE 130 | WHEN returned_flag = 0 THEN purchasequantity 131 | ELSE 0 END) AS max_sale_quantity_12m, 132 | SUM(returned_flag) AS total_returned_items_12m, 133 | COUNT(DISTINCT(product_id)) AS distinct_products_12m, 134 | COUNT(DISTINCT(chain)) AS distinct_chains_12m, 135 | COUNT(DISTINCT(category)) AS distinct_category_12m, 136 | COUNT(DISTINCT(date)) AS distinct_days_shopped_12m 137 | FROM 138 | `PROJECT.DATASET.cleaned` 139 | WHERE 140 | date < 'TRGT_MONTH' 141 | AND date >= DATE_ADD(DATE(CAST('TRGT_MONTH' AS TIMESTAMP)), INTERVAL -12 MONTH) 142 | GROUP BY 143 | id_e, 144 | brand_e ) e 145 | ON 146 | a.customer_id = e.id_e 147 | AND a.brand = e.brand_e -------------------------------------------------------------------------------- /processing-pipeline/f060_create_overall_features.sql: -------------------------------------------------------------------------------- 1 | -- generation of behavioural features across all brands 2 | -- overwrite "features" table 3 | 4 | SELECT 5 | * 6 | EXCEPT(id_b, id_c, id_d, id_e) 7 | FROM 8 | `PROJECT.DATASET.features` a 9 | LEFT JOIN 10 | -- 1M 11 | ( 12 | SELECT 13 | id AS id_b, 14 | COUNT(DISTINCT(transaction_id)) AS overall_transactions_1m, 15 | SUM(CASE 16 | WHEN returned_flag = 0 THEN purchaseamount 17 | ELSE 0 END) AS overall_sale_amount_1m, 18 | SUM(CASE 19 | WHEN returned_flag = 0 THEN purchasequantity 20 | ELSE 0 END) AS overall_sale_quantity_1m, 21 | SUM(returned_flag) AS overall_returned_items_1m, 22 | COUNT(DISTINCT(product_id)) AS overall_distinct_products_1m, 23 | COUNT(DISTINCT(brand)) AS overall_distinct_brands_1m, 24 | COUNT(DISTINCT(chain)) AS overall_distinct_chains_1m, 25 | COUNT(DISTINCT(category)) AS overall_distinct_category_1m, 26 | COUNT(DISTINCT(date)) AS overall_distinct_days_shopped_1m 27 | FROM 28 | `PROJECT.DATASET.cleaned` 29 | WHERE 30 | date < 'TRGT_MONTH' 31 | AND date >= DATE_ADD(DATE(CAST('TRGT_MONTH' AS TIMESTAMP)), INTERVAL -1 MONTH) 32 | GROUP BY 33 | id_b ) b 34 | ON 35 | a.customer_id = b.id_b 36 | LEFT JOIN 37 | -- 3M 38 | ( 39 | SELECT 40 | id AS id_c, 41 | COUNT(DISTINCT(transaction_id)) AS overall_transactions_3m, 42 | SUM(CASE 43 | WHEN returned_flag = 0 THEN purchaseamount 44 | ELSE 0 END) AS overall_sale_amount_3m, 45 | SUM(CASE 46 | WHEN returned_flag = 0 THEN purchasequantity 47 | ELSE 0 END) AS overall_sale_quantity_3m, 48 | SUM(returned_flag) AS overall_returned_items_3m, 49 | COUNT(DISTINCT(product_id)) AS overall_distinct_products_3m, 50 | COUNT(DISTINCT(brand)) AS overall_distinct_brands_3m, 51 | COUNT(DISTINCT(chain)) AS overall_distinct_chains_3m, 52 | COUNT(DISTINCT(category)) AS overall_distinct_category_3m, 53 | COUNT(DISTINCT(date)) AS overall_distinct_days_shopped_3m 54 | FROM 55 | `PROJECT.DATASET.cleaned` 56 | WHERE 57 | date < 'TRGT_MONTH' 58 | AND date >= DATE_ADD(DATE(CAST('TRGT_MONTH' AS TIMESTAMP)), INTERVAL -3 MONTH) 59 | GROUP BY 60 | id_c ) c 61 | ON 62 | a.customer_id = c.id_c 63 | LEFT JOIN 64 | -- 6M 65 | ( 66 | SELECT 67 | id AS id_d, 68 | COUNT(DISTINCT(transaction_id)) AS overall_transactions_6m, 69 | SUM(CASE 70 | WHEN returned_flag = 0 THEN purchaseamount 71 | ELSE 0 END) AS overall_sale_amount_6m, 72 | SUM(CASE 73 | WHEN returned_flag = 0 THEN purchasequantity 74 | ELSE 0 END) AS overall_sale_quantity_6m, 75 | SUM(returned_flag) AS overall_returned_items_6m, 76 | COUNT(DISTINCT(product_id)) AS overall_distinct_products_6m, 77 | COUNT(DISTINCT(brand)) AS overall_distinct_brands_6m, 78 | COUNT(DISTINCT(chain)) AS overall_distinct_chains_6m, 79 | COUNT(DISTINCT(category)) AS overall_distinct_category_6m, 80 | COUNT(DISTINCT(date)) AS overall_distinct_days_shopped_6m 81 | FROM 82 | `PROJECT.DATASET.cleaned` 83 | WHERE 84 | date < 'TRGT_MONTH' 85 | AND date >= DATE_ADD(DATE(CAST('TRGT_MONTH' AS TIMESTAMP)), INTERVAL -6 MONTH) 86 | GROUP BY 87 | id_d ) d 88 | ON 89 | a.customer_id = d.id_d 90 | LEFT JOIN 91 | -- 12M 92 | ( 93 | SELECT 94 | id AS id_e, 95 | COUNT(DISTINCT(transaction_id)) AS overall_transactions_12m, 96 | SUM(CASE 97 | WHEN returned_flag = 0 THEN purchaseamount 98 | ELSE 0 END) AS overall_sale_amount_12m, 99 | SUM(CASE 100 | WHEN returned_flag = 0 THEN purchasequantity 101 | ELSE 0 END) AS overall_sale_quantity_12m, 102 | SUM(returned_flag) AS overall_returned_items_12m, 103 | COUNT(DISTINCT(product_id)) AS overall_distinct_products_12m, 104 | COUNT(DISTINCT(brand)) AS overall_distinct_brands_12m, 105 | COUNT(DISTINCT(chain)) AS overall_distinct_chains_12m, 106 | COUNT(DISTINCT(category)) AS overall_distinct_category_12m, 107 | COUNT(DISTINCT(date)) AS overall_distinct_days_shopped_12m 108 | FROM 109 | `PROJECT.DATASET.cleaned` 110 | WHERE 111 | date < 'TRGT_MONTH' 112 | AND date >= DATE_ADD(DATE(CAST('TRGT_MONTH' AS TIMESTAMP)), INTERVAL -12 MONTH) 113 | GROUP BY 114 | id_e ) e 115 | ON 116 | a.customer_id = e.id_e -------------------------------------------------------------------------------- /processing-pipeline/f070_compute_aov.sql: -------------------------------------------------------------------------------- 1 | -- compute AOV by total_sale_amount_[x]/total_transactions_[x] 2 | -- this reduces 16 features down to 8 3 | -- overwrite "features" 4 | 5 | SELECT 6 | *, 7 | total_sale_amount_1m/total_transactions_1m AS aov_1m, 8 | total_sale_amount_3m/total_transactions_3m AS aov_3m, 9 | total_sale_amount_6m/total_transactions_6m AS aov_6m, 10 | total_sale_amount_12m/total_transactions_12m AS aov_12m, 11 | overall_sale_amount_1m/overall_transactions_1m AS overall_aov_1m, 12 | overall_sale_amount_3m/overall_transactions_3m AS overall_aov_3m, 13 | overall_sale_amount_6m/overall_transactions_6m AS overall_aov_6m, 14 | overall_sale_amount_12m/overall_transactions_12m AS overall_aov_12m 15 | FROM 16 | `PROJECT.DATASET.features` -------------------------------------------------------------------------------- /processing-pipeline/f080_impute_nulls.sql: -------------------------------------------------------------------------------- 1 | -- impute NULL values with zero 2 | -- overwrite "features" 3 | 4 | SELECT 5 | customer_id, 6 | brand, 7 | promo_sensitive, 8 | brand_seasonality, 9 | IFNULL( total_sale_quantity_1m, 0) AS total_sale_quantity_1m , 10 | IFNULL( max_sale_amount_1m, 0 ) AS max_sale_amount_1m , 11 | IFNULL( max_sale_quantity_1m, 0 ) AS max_sale_quantity_1m , 12 | IFNULL( total_returned_items_1m, 0 ) AS total_returned_items_1m , 13 | IFNULL( distinct_products_1m, 0 ) AS distinct_products_1m , 14 | IFNULL( distinct_chains_1m, 0 ) AS distinct_chains_1m , 15 | IFNULL( distinct_category_1m, 0 ) AS distinct_category_1m , 16 | IFNULL( distinct_days_shopped_1m, 0 ) AS distinct_days_shopped_1m , 17 | IFNULL( aov_1m, 0 ) AS aov_1m, 18 | IFNULL( total_sale_quantity_3m, 0 ) AS total_sale_quantity_3m , 19 | IFNULL( max_sale_amount_3m, 0 ) AS max_sale_amount_3m , 20 | IFNULL( max_sale_quantity_3m, 0 ) AS max_sale_quantity_3m , 21 | IFNULL( total_returned_items_3m, 0 ) AS total_returned_items_3m , 22 | IFNULL( distinct_products_3m, 0 ) AS distinct_products_3m , 23 | IFNULL( distinct_chains_3m, 0 ) AS distinct_chains_3m , 24 | IFNULL( distinct_category_3m, 0 ) AS distinct_category_3m , 25 | IFNULL( distinct_days_shopped_3m, 0 ) AS distinct_days_shopped_3m , 26 | IFNULL( aov_3m, 0 ) AS aov_3m, 27 | IFNULL( total_sale_quantity_6m, 0 ) AS total_sale_quantity_6m , 28 | IFNULL( max_sale_amount_6m, 0 ) AS max_sale_amount_6m , 29 | IFNULL( max_sale_quantity_6m, 0 ) AS max_sale_quantity_6m , 30 | IFNULL( total_returned_items_6m, 0 ) AS total_returned_items_6m , 31 | IFNULL( distinct_products_6m, 0 ) AS distinct_products_6m , 32 | IFNULL( distinct_chains_6m, 0 ) AS distinct_chains_6m , 33 | IFNULL( distinct_category_6m, 0 ) AS distinct_category_6m , 34 | IFNULL( distinct_days_shopped_6m, 0 ) AS distinct_days_shopped_6m , 35 | IFNULL( aov_6m, 0 ) AS aov_6m, 36 | IFNULL( total_sale_quantity_12m, 0 ) AS total_sale_quantity_12m , 37 | IFNULL( max_sale_amount_12m, 0 ) AS max_sale_amount_12m , 38 | IFNULL( max_sale_quantity_12m, 0 ) AS max_sale_quantity_12m , 39 | IFNULL( total_returned_items_12m, 0 ) AS total_returned_items_12m , 40 | IFNULL( distinct_products_12m, 0 ) AS distinct_products_12m , 41 | IFNULL( distinct_chains_12m, 0 ) AS distinct_chains_12m , 42 | IFNULL( distinct_category_12m, 0 ) AS distinct_category_12m , 43 | IFNULL( distinct_days_shopped_12m, 0 ) AS distinct_days_shopped_12m , 44 | IFNULL( aov_12m, 0 ) AS aov_12m, 45 | IFNULL( overall_sale_quantity_1m, 0 ) AS overall_sale_quantity_1m , 46 | IFNULL( overall_returned_items_1m, 0 ) AS overall_returned_items_1m , 47 | IFNULL( overall_distinct_products_1m, 0 ) AS overall_distinct_products_1m , 48 | IFNULL( overall_distinct_brands_1m, 0 ) AS overall_distinct_brands_1m , 49 | IFNULL( overall_distinct_chains_1m, 0 ) AS overall_distinct_chains_1m , 50 | IFNULL( overall_distinct_category_1m, 0 ) AS overall_distinct_category_1m , 51 | IFNULL( overall_distinct_days_shopped_1m, 0 ) AS overall_distinct_days_shopped_1m , 52 | IFNULL( overall_aov_1m, 0 ) AS overall_aov_1m , 53 | IFNULL( overall_sale_quantity_3m, 0 ) AS overall_sale_quantity_3m , 54 | IFNULL( overall_returned_items_3m, 0 ) AS overall_returned_items_3m , 55 | IFNULL( overall_distinct_products_3m, 0 ) AS overall_distinct_products_3m , 56 | IFNULL( overall_distinct_brands_3m, 0 ) AS overall_distinct_brands_3m , 57 | IFNULL( overall_distinct_chains_3m, 0 ) AS overall_distinct_chains_3m , 58 | IFNULL( overall_distinct_category_3m, 0 ) AS overall_distinct_category_3m , 59 | IFNULL( overall_distinct_days_shopped_3m, 0 ) AS overall_distinct_days_shopped_3m , 60 | IFNULL( overall_aov_3m, 0 ) AS overall_aov_3m , 61 | IFNULL( overall_sale_quantity_6m, 0 ) AS overall_sale_quantity_6m , 62 | IFNULL( overall_returned_items_6m, 0 ) AS overall_returned_items_6m , 63 | IFNULL( overall_distinct_products_6m, 0 ) AS overall_distinct_products_6m , 64 | IFNULL( overall_distinct_brands_6m, 0 ) AS overall_distinct_brands_6m , 65 | IFNULL( overall_distinct_chains_6m, 0 ) AS overall_distinct_chains_6m , 66 | IFNULL( overall_distinct_category_6m, 0 ) AS overall_distinct_category_6m , 67 | IFNULL( overall_distinct_days_shopped_6m, 0 ) AS overall_distinct_days_shopped_6m , 68 | IFNULL( overall_aov_6m, 0 ) AS overall_aov_6m , 69 | IFNULL( overall_sale_quantity_12m, 0 ) AS overall_sale_quantity_12m , 70 | IFNULL( overall_returned_items_12m, 0 ) AS overall_returned_items_12m , 71 | IFNULL( overall_distinct_products_12m, 0 ) AS overall_distinct_products_12m , 72 | IFNULL( overall_distinct_brands_12m, 0 ) AS overall_distinct_brands_12m , 73 | IFNULL( overall_distinct_chains_12m, 0 ) AS overall_distinct_chains_12m , 74 | IFNULL( overall_distinct_category_12m, 0 ) AS overall_distinct_category_12m , 75 | IFNULL( overall_distinct_days_shopped_12m, 0 ) AS overall_distinct_days_shopped_12m , 76 | IFNULL( overall_aov_12m, 0 ) AS overall_aov_12m , 77 | label 78 | FROM 79 | `PROJECT.DATASET.features` -------------------------------------------------------------------------------- /processing-pipeline/f090_type_cast.sql: -------------------------------------------------------------------------------- 1 | -- type cast 2 | -- overwrite "features" 3 | 4 | SELECT 5 | CAST(customer_id AS STRING) AS customer_id, 6 | CAST( brand AS STRING ) AS brand, 7 | CAST( promo_sensitive AS INT64 ) AS promo_sensitive, 8 | CAST( brand_seasonality AS INT64 ) AS brand_seasonality, 9 | CAST( total_sale_quantity_1m AS INT64 ) AS total_sale_quantity_1m , 10 | CAST( max_sale_amount_1m AS INT64 ) AS max_sale_amount_1m , 11 | CAST( max_sale_quantity_1m AS INT64 ) AS max_sale_quantity_1m , 12 | CAST( total_returned_items_1m AS INT64 ) AS total_returned_items_1m , 13 | CAST( distinct_products_1m AS INT64 ) AS distinct_products_1m , 14 | CAST( distinct_chains_1m AS INT64 ) AS distinct_chains_1m , 15 | CAST( distinct_category_1m AS INT64 ) AS distinct_category_1m , 16 | CAST( distinct_days_shopped_1m AS INT64 ) AS distinct_days_shopped_1m , 17 | CAST( aov_1m AS INT64 ) AS aov_1m, 18 | CAST( total_sale_quantity_3m AS INT64 ) AS total_sale_quantity_3m , 19 | CAST( max_sale_amount_3m AS INT64 ) AS max_sale_amount_3m , 20 | CAST( max_sale_quantity_3m AS INT64 ) AS max_sale_quantity_3m , 21 | CAST( total_returned_items_3m AS INT64 ) AS total_returned_items_3m , 22 | CAST( distinct_products_3m AS INT64 ) AS distinct_products_3m , 23 | CAST( distinct_chains_3m AS INT64 ) AS distinct_chains_3m , 24 | CAST( distinct_category_3m AS INT64 ) AS distinct_category_3m , 25 | CAST( distinct_days_shopped_3m AS INT64 ) AS distinct_days_shopped_3m , 26 | CAST( aov_3m AS INT64 ) AS aov_3m, 27 | CAST( total_sale_quantity_6m AS INT64 ) AS total_sale_quantity_6m , 28 | CAST( max_sale_amount_6m AS INT64 ) AS max_sale_amount_6m , 29 | CAST( max_sale_quantity_6m AS INT64 ) AS max_sale_quantity_6m , 30 | CAST( total_returned_items_6m AS INT64 ) AS total_returned_items_6m , 31 | CAST( distinct_products_6m AS INT64 ) AS distinct_products_6m , 32 | CAST( distinct_chains_6m AS INT64 ) AS distinct_chains_6m , 33 | CAST( distinct_category_6m AS INT64 ) AS distinct_category_6m , 34 | CAST( distinct_days_shopped_6m AS INT64 ) AS distinct_days_shopped_6m , 35 | CAST( aov_6m AS INT64 ) AS aov_6m, 36 | CAST( total_sale_quantity_12m AS INT64 ) AS total_sale_quantity_12m , 37 | CAST( max_sale_amount_12m AS INT64 ) AS max_sale_amount_12m , 38 | CAST( max_sale_quantity_12m AS INT64 ) AS max_sale_quantity_12m , 39 | CAST( total_returned_items_12m AS INT64 ) AS total_returned_items_12m , 40 | CAST( distinct_products_12m AS INT64 ) AS distinct_products_12m , 41 | CAST( distinct_chains_12m AS INT64 ) AS distinct_chains_12m , 42 | CAST( distinct_category_12m AS INT64 ) AS distinct_category_12m , 43 | CAST( distinct_days_shopped_12m AS INT64 ) AS distinct_days_shopped_12m , 44 | CAST( aov_12m AS INT64 ) AS aov_12m, 45 | CAST( overall_sale_quantity_1m AS INT64 ) AS overall_sale_quantity_1m , 46 | CAST( overall_returned_items_1m AS INT64 ) AS overall_returned_items_1m , 47 | CAST( overall_distinct_products_1m AS INT64 ) AS overall_distinct_products_1m , 48 | CAST( overall_distinct_brands_1m AS INT64 ) AS overall_distinct_brands_1m , 49 | CAST( overall_distinct_chains_1m AS INT64 ) AS overall_distinct_chains_1m , 50 | CAST( overall_distinct_category_1m AS INT64 ) AS overall_distinct_category_1m , 51 | CAST( overall_distinct_days_shopped_1m AS INT64 ) AS overall_distinct_days_shopped_1m , 52 | CAST( overall_aov_1m AS INT64 ) AS overall_aov_1m , 53 | CAST( overall_sale_quantity_3m AS INT64 ) AS overall_sale_quantity_3m , 54 | CAST( overall_returned_items_3m AS INT64 ) AS overall_returned_items_3m , 55 | CAST( overall_distinct_products_3m AS INT64 ) AS overall_distinct_products_3m , 56 | CAST( overall_distinct_brands_3m AS INT64 ) AS overall_distinct_brands_3m , 57 | CAST( overall_distinct_chains_3m AS INT64 ) AS overall_distinct_chains_3m , 58 | CAST( overall_distinct_category_3m AS INT64 ) AS overall_distinct_category_3m , 59 | CAST( overall_distinct_days_shopped_3m AS INT64 ) AS overall_distinct_days_shopped_3m , 60 | CAST( overall_aov_3m AS INT64 ) AS overall_aov_3m , 61 | CAST( overall_sale_quantity_6m AS INT64 ) AS overall_sale_quantity_6m , 62 | CAST( overall_returned_items_6m AS INT64 ) AS overall_returned_items_6m , 63 | CAST( overall_distinct_products_6m AS INT64 ) AS overall_distinct_products_6m , 64 | CAST( overall_distinct_brands_6m AS INT64 ) AS overall_distinct_brands_6m , 65 | CAST( overall_distinct_chains_6m AS INT64 ) AS overall_distinct_chains_6m , 66 | CAST( overall_distinct_category_6m AS INT64 ) AS overall_distinct_category_6m , 67 | CAST( overall_distinct_days_shopped_6m AS INT64 ) AS overall_distinct_days_shopped_6m , 68 | CAST( overall_aov_6m AS INT64 ) AS overall_aov_6m , 69 | CAST( overall_sale_quantity_12m AS INT64 ) AS overall_sale_quantity_12m , 70 | CAST( overall_returned_items_12m AS INT64 ) AS overall_returned_items_12m , 71 | CAST( overall_distinct_products_12m AS INT64 ) AS overall_distinct_products_12m , 72 | CAST( overall_distinct_brands_12m AS INT64 ) AS overall_distinct_brands_12m , 73 | CAST( overall_distinct_chains_12m AS INT64 ) AS overall_distinct_chains_12m , 74 | CAST( overall_distinct_category_12m AS INT64 ) AS overall_distinct_category_12m , 75 | CAST( overall_distinct_days_shopped_12m AS INT64 ) AS overall_distinct_days_shopped_12m , 76 | CAST( overall_aov_12m AS INT64 ) AS overall_aov_12m , 77 | CAST(label AS INT64) AS label 78 | FROM 79 | `PROJECT.DATASET.features` -------------------------------------------------------------------------------- /processing-pipeline/s010_downsampling.sql: -------------------------------------------------------------------------------- 1 | -- downsampling due to class imbalance 2 | -- 17,813,514 rows now in total 3 | -- legacy SQL = true 4 | -- save as "downsampled" 5 | 6 | -- TODO: best practice is to use a hashing function e.g. FARM_FINGERPRINT 7 | 8 | SELECT 9 | customer_id, 10 | brand, 11 | label 12 | FROM 13 | -- taking all rows (8511369) where label = 1 14 | ( 15 | SELECT 16 | CAST(customer_id AS STRING) AS customer_id, 17 | CAST(brand AS STRING) AS brand, 18 | label 19 | FROM 20 | [PROJECT:DATASET.train] 21 | WHERE 22 | label = 1 ), 23 | -- randomly taking ~1/2 rows where label is zero 24 | ( 25 | SELECT 26 | CAST(customer_id AS STRING) AS customer_id, 27 | CAST(brand AS STRING) AS brand, 28 | label 29 | FROM 30 | [PROJECT:DATASET.train] 31 | WHERE 32 | label = 0 33 | AND RAND(42) < 0.02 ), 34 | -- taking ~1/2 rows where we know we have some interactions in the 12M prior 35 | -- and label = 0 36 | ( 37 | SELECT 38 | CAST(a.id AS STRING) AS customer_id, 39 | CAST(a.brand AS STRING) AS brand, 40 | b.label AS label 41 | FROM ( 42 | SELECT 43 | CAST(id AS STRING) AS id, 44 | CAST(brand AS STRING) AS brand 45 | FROM 46 | [PROJECT:DATASET.cleaned] 47 | WHERE 48 | date < CAST('2013-03-01' AS DATE) 49 | GROUP BY 50 | id, 51 | brand ) a 52 | INNER JOIN ( 53 | SELECT 54 | CAST(customer_id AS STRING) AS id_b, 55 | CAST(brand AS STRING) AS brand_b, 56 | label 57 | FROM 58 | [PROJECT:DATASET.train] ) b 59 | ON 60 | a.id = b.id_b 61 | AND a.brand = b.brand_b 62 | WHERE 63 | label = 0 64 | AND RAND(42) < 0.15 ) 65 | GROUP BY customer_id, brand, label -------------------------------------------------------------------------------- /processing-pipeline/s020_downsampled_features.sql: -------------------------------------------------------------------------------- 1 | -- join on features for downsampled data 2 | -- overwrite "downsampled" 3 | SELECT 4 | b.* 5 | FROM ( 6 | SELECT 7 | customer_id AS customer_id_a, 8 | brand AS brand_a, 9 | label AS label_a 10 | FROM 11 | `PROJECT.DATASET.downsampled` ) a 12 | INNER JOIN ( 13 | SELECT 14 | * 15 | FROM 16 | `PROJECT.DATASET.train` ) b 17 | ON 18 | a.customer_id_a = b.customer_id 19 | AND a.brand_a = b.brand -------------------------------------------------------------------------------- /processing-pipeline/supporting-queries.sql: -------------------------------------------------------------------------------- 1 | -- supporting SQL queries for data quality checks and ad hoc analysis 2 | 3 | ---------- "transactions" table ---------- 4 | 5 | -- checking for NULLs 6 | SELECT 7 | COUNTIF( id IS NULL) AS id, 8 | COUNTIF( chain IS NULL) AS chain, 9 | COUNTIF( dept IS NULL) AS dept, 10 | COUNTIF( category IS NULL) AS category, 11 | COUNTIF( company IS NULL) AS company, 12 | COUNTIF( brand IS NULL) AS brand, 13 | COUNTIF( date IS NULL) AS date, 14 | COUNTIF( productsize IS NULL) AS productsize, 15 | COUNTIF( productmeasure IS NULL) AS productmeasure, 16 | COUNTIF( purchasequantity IS NULL) AS purchasequantity, 17 | COUNTIF( purchaseamount IS NULL) AS purchaseamount 18 | FROM 19 | `PROJECT.DATASET.transactions` 20 | 21 | -- checking for duplicates 22 | SELECT 23 | *, 24 | COUNT(*) 25 | FROM 26 | `PROJECT.DATASET.transactions` 27 | GROUP BY 28 | id, 29 | chain, 30 | dept, 31 | category, 32 | company, 33 | brand, 34 | date, 35 | productsize, 36 | productmeasure, 37 | purchasequantity, 38 | purchaseamount 39 | HAVING 40 | COUNT(*) > 1 41 | 42 | 43 | 44 | ---------- "history" table ---------- 45 | 46 | -- checking for NULLs 47 | SELECT 48 | COUNTIF( id IS NULL) AS id, 49 | COUNTIF( chain IS NULL) AS chain, 50 | COUNTIF( offer IS NULL) AS offer, 51 | COUNTIF( market IS NULL) AS market , 52 | COUNTIF( repeattrips IS NULL) AS repeattrips, 53 | COUNTIF( repeater IS NULL) AS repeater, 54 | COUNTIF( offerdate IS NULL) AS offerdate 55 | FROM 56 | `PROJECT.DATASET.history` 57 | 58 | -- checking for duplicates 59 | SELECT 60 | *, 61 | COUNT(*) 62 | FROM 63 | `PROJECT.DATASET.history` 64 | GROUP BY 65 | id, 66 | chain, 67 | offer, 68 | market, 69 | repeattrips, 70 | repeater, 71 | offerdate 72 | HAVING 73 | COUNT(*) > 1 74 | 75 | 76 | 77 | ---------- target variable ---------- 78 | SELECT 79 | month, 80 | year, 81 | COUNT(*) AS num_transactions, 82 | COUNT(DISTINCT(brand)) AS unique_brands, 83 | COUNT(DISTINCT(id)) AS unique_customers 84 | FROM ( 85 | SELECT 86 | *, 87 | EXTRACT(MONTH 88 | FROM 89 | date) AS month, 90 | EXTRACT(YEAR 91 | FROM 92 | date) AS year 93 | FROM 94 | `PROJECT.DATASET.transactions` ) 95 | GROUP BY 96 | month, 97 | year 98 | ORDER BY 99 | year, 100 | month ASC 101 | 102 | 103 | 104 | ---------- product hierarchy ---------- 105 | -- see how many categories each brand is mapped to 106 | select 107 | brand, 108 | count(distinct(category)) as cnt 109 | FROM 110 | `PROJECT.DATASET.transactions` 111 | GROUP BY brand 112 | ORDER BY cnt DESC 113 | 114 | -- see how many departments each category is mapped to 115 | select 116 | category, 117 | count(distinct(dept)) as cnt 118 | FROM 119 | `PROJECT.DATASET.transactions` 120 | GROUP BY category 121 | ORDER BY cnt DESC 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | -------------------------------------------------------------------------------- /processing-pipeline/t010_train_test_field.sql: -------------------------------------------------------------------------------- 1 | -- train / test label split into 80 : 20 2 | -- overwrite "features" table 3 | 4 | -- TODO: best practice is to use a hashing function e.g. FARM_FINGERPRINT 5 | 6 | SELECT 7 | *, 8 | CASE 9 | WHEN RAND() < 0.8 THEN "train" 10 | ELSE "test" 11 | END AS train_test 12 | FROM 13 | `PROJECT.DATASET.features` -------------------------------------------------------------------------------- /processing-pipeline/t020_train_data.sql: -------------------------------------------------------------------------------- 1 | -- train table 2 | -- create "train" 3 | 4 | SELECT 5 | * 6 | EXCEPT(train_test) 7 | FROM 8 | `PROJECT.DATASET.features` 9 | WHERE 10 | train_test = "train" -------------------------------------------------------------------------------- /processing-pipeline/t030_test_data.sql: -------------------------------------------------------------------------------- 1 | -- test table 2 | -- create "test" table 3 | 4 | SELECT 5 | * 6 | EXCEPT(train_test) 7 | FROM 8 | `PROJECT.DATASET.features` 9 | WHERE 10 | train_test = "test" -------------------------------------------------------------------------------- /processing-pipeline/t040_test_dev_split.sql: -------------------------------------------------------------------------------- 1 | -- overwrite "test" 2 | -- to split the table into test and dev sets 3 | 4 | -- TODO: best practice is to use a hashing function e.g. FARM_FINGERPRINT 5 | 6 | SELECT 7 | *, 8 | CASE 9 | WHEN RAND() < 0.5 THEN "test" 10 | ELSE "dev" 11 | END AS test_dev 12 | FROM 13 | [PROJECT:DATASET.test] -------------------------------------------------------------------------------- /processing-pipeline/t050_dev_data.sql: -------------------------------------------------------------------------------- 1 | -- create "dev" table 2 | SELECT 3 | * 4 | EXCEPT(test_dev) 5 | FROM 6 | `PROJECT.DATASET.test` 7 | WHERE 8 | test_dev = "dev" -------------------------------------------------------------------------------- /processing-pipeline/t060_test_data_final.sql: -------------------------------------------------------------------------------- 1 | -- overwrite "test" to be left with the final test dataset 2 | SELECT 3 | * 4 | EXCEPT(test_dev) 5 | FROM 6 | `PROJECT.DATASET.test` 7 | WHERE 8 | test_dev = "test" -------------------------------------------------------------------------------- /processing-pipeline/x010_cross_join.sql: -------------------------------------------------------------------------------- 1 | -- cross join customers (with at least 1 transaction with top brands) and top 1000 brands that were active prior to March 2013 2 | -- create table "customerxbrand" 3 | 4 | SELECT 5 | * 6 | FROM ( 7 | SELECT 8 | DISTINCT(id) AS customer_id 9 | FROM ( 10 | SELECT 11 | id, 12 | brand 13 | FROM 14 | `PROJECT.DATASET.cleaned` 15 | WHERE 16 | date < 'TRGT_MONTH' 17 | GROUP BY 18 | id, 19 | brand ) a 20 | INNER JOIN ( 21 | SELECT 22 | brand AS brand_b 23 | FROM 24 | `PROJECT.DATASET.top_brands` ) b 25 | ON 26 | a.brand = b.brand_b ) 27 | CROSS JOIN ( 28 | SELECT 29 | * 30 | FROM 31 | `PROJECT.DATASET.top_brands` ) --------------------------------------------------------------------------------