├── .gitignore ├── ANN Model.ipynb ├── Customers Segmentation.ipynb ├── Data Description and Analysis.ipynb ├── Data Preparation.ipynb ├── Exploratory Data Analysis.ipynb ├── Feature Extraction.ipynb ├── LICENSE ├── Market Basket Analysis.ipynb ├── NN Architecture.png ├── Plots ├── Add-to-cart-VS-reorder.png ├── Most-popular-products.png ├── NN Architecture.png ├── NN-Performance.png ├── NN-Report.png ├── Reorder-organic-inorganic-products.png ├── Total-organic-inorganic-products.png ├── XGBoost Feature Importance Plot.eps ├── XGBoost Feature Importance Plot.png ├── XGBoost Performance.png ├── XGBoost-Report.png ├── aisle-high-reorder.png ├── aisle-low-reorder.png ├── cluster.png ├── cumsum_products.png ├── dow.png ├── elbow.png ├── heatmap.png ├── orders.png ├── popular-aisles.png ├── popular-departments.png ├── prior.png ├── readme.md ├── reorder-df.png ├── reorder-total-orders.png └── train.png ├── README.md └── XGBoost Model.ipynb /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | MANIFEST 27 | 28 | # PyInstaller 29 | # Usually these files are written by a python script from a template 30 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 31 | *.manifest 32 | *.spec 33 | 34 | # Installer logs 35 | pip-log.txt 36 | pip-delete-this-directory.txt 37 | 38 | # Unit test / coverage reports 39 | htmlcov/ 40 | .tox/ 41 | .coverage 42 | .coverage.* 43 | .cache 44 | nosetests.xml 45 | coverage.xml 46 | *.cover 47 | .hypothesis/ 48 | .pytest_cache/ 49 | 50 | # Translations 51 | *.mo 52 | *.pot 53 | 54 | # Django stuff: 55 | *.log 56 | local_settings.py 57 | db.sqlite3 58 | 59 | # Flask stuff: 60 | instance/ 61 | .webassets-cache 62 | 63 | # Scrapy stuff: 64 | .scrapy 65 | 66 | # Sphinx documentation 67 | docs/_build/ 68 | 69 | # PyBuilder 70 | target/ 71 | 72 | # Jupyter Notebook 73 | .ipynb_checkpoints 74 | 75 | # pyenv 76 | .python-version 77 | 78 | # celery beat schedule file 79 | celerybeat-schedule 80 | 81 | # SageMath parsed files 82 | *.sage.py 83 | 84 | # Environments 85 | .env 86 | .venv 87 | env/ 88 | venv/ 89 | ENV/ 90 | env.bak/ 91 | venv.bak/ 92 | 93 | # Spyder project settings 94 | .spyderproject 95 | .spyproject 96 | 97 | # Rope project settings 98 | .ropeproject 99 | 100 | # mkdocs documentation 101 | /site 102 | 103 | # mypy 104 | .mypy_cache/ 105 | -------------------------------------------------------------------------------- /Data Preparation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Data Preparation" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "import numpy as np\n", 17 | "import pandas as pd\n", 18 | "import matplotlib.pyplot as plt\n", 19 | "import seaborn as sns\n", 20 | "import gc\n", 21 | "pd.options.mode.chained_assignment = None\n", 22 | "\n", 23 | "root = 'C:/Data/instacart-market-basket-analysis/'" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "#### Reading all data" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 2, 36 | "metadata": {}, 37 | "outputs": [], 38 | "source": [ 39 | "orders = pd.read_csv(root + 'orders.csv', \n", 40 | " dtype={\n", 41 | " 'order_id': np.int32,\n", 42 | " 'user_id': np.int64,\n", 43 | " 'eval_set': 'category',\n", 44 | " 'order_number': np.int16,\n", 45 | " 'order_dow': np.int8,\n", 46 | " 'order_hour_of_day': np.int8,\n", 47 | " 'days_since_prior_order': np.float32})\n", 48 | "\n", 49 | "\n", 50 | "order_products_train = pd.read_csv(root + 'order_products__train.csv', \n", 51 | " dtype={\n", 52 | " 'order_id': np.int32,\n", 53 | " 'product_id': np.uint16,\n", 54 | " 'add_to_cart_order': np.int16,\n", 55 | " 'reordered': np.int8})\n", 56 | "\n", 57 | "order_products_prior = pd.read_csv(root + 'order_products__prior.csv', \n", 58 | " dtype={\n", 59 | " 'order_id': np.int32,\n", 60 | " 'product_id': np.uint16,\n", 61 | " 'add_to_cart_order': np.int16,\n", 62 | " 'reordered': np.int8})\n", 63 | "\n", 64 | "product_features = pd.read_pickle(root + 'product_features.pkl')\n", 65 | "\n", 66 | "user_features = pd.read_pickle(root + 'user_features.pkl')\n", 67 | "\n", 68 | "user_product_features = pd.read_pickle(root + 'user_product_features.pkl')\n", 69 | "\n", 70 | "products = pd.read_csv(root +'products.csv')\n", 71 | "\n", 72 | "aisles = pd.read_csv(root + 'aisles.csv')\n", 73 | "\n", 74 | "departments = pd.read_csv(root + 'departments.csv')" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "#### merging train order data with orders" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": 3, 87 | "metadata": { 88 | "scrolled": true 89 | }, 90 | "outputs": [ 91 | { 92 | "data": { 93 | "text/html": [ 94 | "
\n", 95 | "\n", 108 | "\n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | "
order_iduser_ideval_setorder_numberorder_doworder_hour_of_daydays_since_prior_orderproduct_idadd_to_cart_orderreordered
011878991train114814.019611
111878991train114814.02513321
211878991train114814.03892831
311878991train114814.02640541
411878991train114814.03965751
\n", 192 | "
" 193 | ], 194 | "text/plain": [ 195 | " order_id user_id eval_set order_number order_dow order_hour_of_day \\\n", 196 | "0 1187899 1 train 11 4 8 \n", 197 | "1 1187899 1 train 11 4 8 \n", 198 | "2 1187899 1 train 11 4 8 \n", 199 | "3 1187899 1 train 11 4 8 \n", 200 | "4 1187899 1 train 11 4 8 \n", 201 | "\n", 202 | " days_since_prior_order product_id add_to_cart_order reordered \n", 203 | "0 14.0 196 1 1 \n", 204 | "1 14.0 25133 2 1 \n", 205 | "2 14.0 38928 3 1 \n", 206 | "3 14.0 26405 4 1 \n", 207 | "4 14.0 39657 5 1 " 208 | ] 209 | }, 210 | "execution_count": 3, 211 | "metadata": {}, 212 | "output_type": "execute_result" 213 | } 214 | ], 215 | "source": [ 216 | "train_orders = orders.merge(order_products_train, on = 'order_id', how = 'inner')\n", 217 | "train_orders.head()" 218 | ] 219 | }, 220 | { 221 | "cell_type": "markdown", 222 | "metadata": {}, 223 | "source": [ 224 | "removing unnecessary columns from train_orders" 225 | ] 226 | }, 227 | { 228 | "cell_type": "code", 229 | "execution_count": 4, 230 | "metadata": {}, 231 | "outputs": [], 232 | "source": [ 233 | "train_orders.drop(['eval_set', 'add_to_cart_order', 'order_id'], axis = 1, inplace = True)" 234 | ] 235 | }, 236 | { 237 | "cell_type": "markdown", 238 | "metadata": {}, 239 | "source": [ 240 | "unique user_ids in train data" 241 | ] 242 | }, 243 | { 244 | "cell_type": "code", 245 | "execution_count": 5, 246 | "metadata": {}, 247 | "outputs": [ 248 | { 249 | "data": { 250 | "text/plain": [ 251 | "array([ 1, 2, 5, 7, 8, 9, 10, 13, 14, 17], dtype=int64)" 252 | ] 253 | }, 254 | "execution_count": 5, 255 | "metadata": {}, 256 | "output_type": "execute_result" 257 | } 258 | ], 259 | "source": [ 260 | "train_users = train_orders.user_id.unique()\n", 261 | "train_users[:10]" 262 | ] 263 | }, 264 | { 265 | "cell_type": "markdown", 266 | "metadata": {}, 267 | "source": [ 268 | "keeping only train_users in the data" 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": 6, 274 | "metadata": { 275 | "scrolled": true 276 | }, 277 | "outputs": [ 278 | { 279 | "data": { 280 | "text/plain": [ 281 | "(13307953, 11)" 282 | ] 283 | }, 284 | "execution_count": 6, 285 | "metadata": {}, 286 | "output_type": "execute_result" 287 | } 288 | ], 289 | "source": [ 290 | "user_product_features.shape" 291 | ] 292 | }, 293 | { 294 | "cell_type": "code", 295 | "execution_count": 7, 296 | "metadata": {}, 297 | "outputs": [ 298 | { 299 | "data": { 300 | "text/html": [ 301 | "
\n", 302 | "\n", 315 | "\n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | "
user_idproduct_idtotal_product_orders_by_usertotal_product_reorders_by_useruser_product_reorder_percentageavg_add_to_cart_by_useravg_days_since_last_boughtlast_ordered_inis_reorder_3is_reorder_2is_reorder_1
011961090.9000001.40000017.600000101.01.01.0
1110258980.8888893.33333319.555555101.01.01.0
2110326100.0000005.00000028.00000050.00.00.0
31124271090.9000003.30000017.600000101.01.01.0
4113032320.6666676.33333321.666666101.00.00.0
\n", 405 | "
" 406 | ], 407 | "text/plain": [ 408 | " user_id product_id total_product_orders_by_user \\\n", 409 | "0 1 196 10 \n", 410 | "1 1 10258 9 \n", 411 | "2 1 10326 1 \n", 412 | "3 1 12427 10 \n", 413 | "4 1 13032 3 \n", 414 | "\n", 415 | " total_product_reorders_by_user user_product_reorder_percentage \\\n", 416 | "0 9 0.900000 \n", 417 | "1 8 0.888889 \n", 418 | "2 0 0.000000 \n", 419 | "3 9 0.900000 \n", 420 | "4 2 0.666667 \n", 421 | "\n", 422 | " avg_add_to_cart_by_user avg_days_since_last_bought last_ordered_in \\\n", 423 | "0 1.400000 17.600000 10 \n", 424 | "1 3.333333 19.555555 10 \n", 425 | "2 5.000000 28.000000 5 \n", 426 | "3 3.300000 17.600000 10 \n", 427 | "4 6.333333 21.666666 10 \n", 428 | "\n", 429 | " is_reorder_3 is_reorder_2 is_reorder_1 \n", 430 | "0 1.0 1.0 1.0 \n", 431 | "1 1.0 1.0 1.0 \n", 432 | "2 0.0 0.0 0.0 \n", 433 | "3 1.0 1.0 1.0 \n", 434 | "4 1.0 0.0 0.0 " 435 | ] 436 | }, 437 | "execution_count": 7, 438 | "metadata": {}, 439 | "output_type": "execute_result" 440 | } 441 | ], 442 | "source": [ 443 | "user_product_features.head()" 444 | ] 445 | }, 446 | { 447 | "cell_type": "code", 448 | "execution_count": 8, 449 | "metadata": {}, 450 | "outputs": [ 451 | { 452 | "data": { 453 | "text/html": [ 454 | "
\n", 455 | "\n", 468 | "\n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | "
user_idproduct_idtotal_product_orders_by_usertotal_product_reorders_by_useruser_product_reorder_percentageavg_add_to_cart_by_useravg_days_since_last_boughtlast_ordered_inis_reorder_3is_reorder_2is_reorder_1
011961090.9000001.40000017.600000101.01.01.0
1110258980.8888893.33333319.555555101.01.01.0
2110326100.0000005.00000028.00000050.00.00.0
31124271090.9000003.30000017.600000101.01.01.0
4113032320.6666676.33333321.666666101.00.00.0
\n", 558 | "
" 559 | ], 560 | "text/plain": [ 561 | " user_id product_id total_product_orders_by_user \\\n", 562 | "0 1 196 10 \n", 563 | "1 1 10258 9 \n", 564 | "2 1 10326 1 \n", 565 | "3 1 12427 10 \n", 566 | "4 1 13032 3 \n", 567 | "\n", 568 | " total_product_reorders_by_user user_product_reorder_percentage \\\n", 569 | "0 9 0.900000 \n", 570 | "1 8 0.888889 \n", 571 | "2 0 0.000000 \n", 572 | "3 9 0.900000 \n", 573 | "4 2 0.666667 \n", 574 | "\n", 575 | " avg_add_to_cart_by_user avg_days_since_last_bought last_ordered_in \\\n", 576 | "0 1.400000 17.600000 10 \n", 577 | "1 3.333333 19.555555 10 \n", 578 | "2 5.000000 28.000000 5 \n", 579 | "3 3.300000 17.600000 10 \n", 580 | "4 6.333333 21.666666 10 \n", 581 | "\n", 582 | " is_reorder_3 is_reorder_2 is_reorder_1 \n", 583 | "0 1.0 1.0 1.0 \n", 584 | "1 1.0 1.0 1.0 \n", 585 | "2 0.0 0.0 0.0 \n", 586 | "3 1.0 1.0 1.0 \n", 587 | "4 1.0 0.0 0.0 " 588 | ] 589 | }, 590 | "execution_count": 8, 591 | "metadata": {}, 592 | "output_type": "execute_result" 593 | } 594 | ], 595 | "source": [ 596 | "df = user_product_features[user_product_features.user_id.isin(train_users)]\n", 597 | "df.head()" 598 | ] 599 | }, 600 | { 601 | "cell_type": "code", 602 | "execution_count": 9, 603 | "metadata": { 604 | "scrolled": false 605 | }, 606 | "outputs": [ 607 | { 608 | "data": { 609 | "text/html": [ 610 | "
\n", 611 | "\n", 624 | "\n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | "
user_idproduct_idtotal_product_orders_by_usertotal_product_reorders_by_useruser_product_reorder_percentageavg_add_to_cart_by_useravg_days_since_last_boughtlast_ordered_inis_reorder_3is_reorder_2is_reorder_1order_numberorder_doworder_hour_of_daydays_since_prior_orderreordered
0119610.09.00.9000001.40000017.60000010.01.01.01.011.04.08.014.01.0
11102589.08.00.8888893.33333319.55555510.01.01.01.011.04.08.014.01.0
21103261.00.00.0000005.00000028.0000005.00.00.00.0NaNNaNNaNNaNNaN
311242710.09.00.9000003.30000017.60000010.01.01.01.0NaNNaNNaNNaNNaN
41130323.02.00.6666676.33333321.66666610.01.00.00.011.04.08.014.01.0
\n", 744 | "
" 745 | ], 746 | "text/plain": [ 747 | " user_id product_id total_product_orders_by_user \\\n", 748 | "0 1 196 10.0 \n", 749 | "1 1 10258 9.0 \n", 750 | "2 1 10326 1.0 \n", 751 | "3 1 12427 10.0 \n", 752 | "4 1 13032 3.0 \n", 753 | "\n", 754 | " total_product_reorders_by_user user_product_reorder_percentage \\\n", 755 | "0 9.0 0.900000 \n", 756 | "1 8.0 0.888889 \n", 757 | "2 0.0 0.000000 \n", 758 | "3 9.0 0.900000 \n", 759 | "4 2.0 0.666667 \n", 760 | "\n", 761 | " avg_add_to_cart_by_user avg_days_since_last_bought last_ordered_in \\\n", 762 | "0 1.400000 17.600000 10.0 \n", 763 | "1 3.333333 19.555555 10.0 \n", 764 | "2 5.000000 28.000000 5.0 \n", 765 | "3 3.300000 17.600000 10.0 \n", 766 | "4 6.333333 21.666666 10.0 \n", 767 | "\n", 768 | " is_reorder_3 is_reorder_2 is_reorder_1 order_number order_dow \\\n", 769 | "0 1.0 1.0 1.0 11.0 4.0 \n", 770 | "1 1.0 1.0 1.0 11.0 4.0 \n", 771 | "2 0.0 0.0 0.0 NaN NaN \n", 772 | "3 1.0 1.0 1.0 NaN NaN \n", 773 | "4 1.0 0.0 0.0 11.0 4.0 \n", 774 | "\n", 775 | " order_hour_of_day days_since_prior_order reordered \n", 776 | "0 8.0 14.0 1.0 \n", 777 | "1 8.0 14.0 1.0 \n", 778 | "2 NaN NaN NaN \n", 779 | "3 NaN NaN NaN \n", 780 | "4 8.0 14.0 1.0 " 781 | ] 782 | }, 783 | "execution_count": 9, 784 | "metadata": {}, 785 | "output_type": "execute_result" 786 | } 787 | ], 788 | "source": [ 789 | "df = df.merge(train_orders, on = ['user_id', 'product_id'], how = 'outer')\n", 790 | "df.head()" 791 | ] 792 | }, 793 | { 794 | "cell_type": "markdown", 795 | "metadata": {}, 796 | "source": [ 797 | "for order_number, order_dow, order_hour_of_day, days_since_prior_order, impute null values with mean values grouped by users as these products will also be potential candidate for order." 798 | ] 799 | }, 800 | { 801 | "cell_type": "code", 802 | "execution_count": 10, 803 | "metadata": { 804 | "scrolled": true 805 | }, 806 | "outputs": [], 807 | "source": [ 808 | "df.order_number.fillna(df.groupby('user_id')['order_number'].transform('mean'), inplace = True)\n", 809 | "df.order_dow.fillna(df.groupby('user_id')['order_dow'].transform('mean'), inplace = True)\n", 810 | "df.order_hour_of_day.fillna(df.groupby('user_id')['order_hour_of_day'].transform('mean'), inplace = True)\n", 811 | "df.days_since_prior_order.fillna(df.groupby('user_id')['days_since_prior_order'].\\\n", 812 | " transform('mean'), inplace = True)" 813 | ] 814 | }, 815 | { 816 | "cell_type": "markdown", 817 | "metadata": {}, 818 | "source": [ 819 | "Removing those products which were bought the first time in last order by a user" 820 | ] 821 | }, 822 | { 823 | "cell_type": "code", 824 | "execution_count": 11, 825 | "metadata": {}, 826 | "outputs": [ 827 | { 828 | "data": { 829 | "text/plain": [ 830 | "1.0 828824\n", 831 | "0.0 555793\n", 832 | "Name: reordered, dtype: int64" 833 | ] 834 | }, 835 | "execution_count": 11, 836 | "metadata": {}, 837 | "output_type": "execute_result" 838 | } 839 | ], 840 | "source": [ 841 | "df.reordered.value_counts()" 842 | ] 843 | }, 844 | { 845 | "cell_type": "code", 846 | "execution_count": 12, 847 | "metadata": {}, 848 | "outputs": [ 849 | { 850 | "data": { 851 | "text/plain": [ 852 | "7645837" 853 | ] 854 | }, 855 | "execution_count": 12, 856 | "metadata": {}, 857 | "output_type": "execute_result" 858 | } 859 | ], 860 | "source": [ 861 | "df.reordered.isnull().sum()" 862 | ] 863 | }, 864 | { 865 | "cell_type": "code", 866 | "execution_count": 13, 867 | "metadata": {}, 868 | "outputs": [], 869 | "source": [ 870 | "df = df[df.reordered != 0]" 871 | ] 872 | }, 873 | { 874 | "cell_type": "code", 875 | "execution_count": 14, 876 | "metadata": {}, 877 | "outputs": [ 878 | { 879 | "data": { 880 | "text/plain": [ 881 | "(8474661, 16)" 882 | ] 883 | }, 884 | "execution_count": 14, 885 | "metadata": {}, 886 | "output_type": "execute_result" 887 | } 888 | ], 889 | "source": [ 890 | "df.shape" 891 | ] 892 | }, 893 | { 894 | "cell_type": "markdown", 895 | "metadata": {}, 896 | "source": [ 897 | "Now imputing 0 in reordered as they were not reordered by user in his/her last order." 898 | ] 899 | }, 900 | { 901 | "cell_type": "code", 902 | "execution_count": 15, 903 | "metadata": {}, 904 | "outputs": [ 905 | { 906 | "data": { 907 | "text/plain": [ 908 | "user_id 0\n", 909 | "product_id 0\n", 910 | "total_product_orders_by_user 0\n", 911 | "total_product_reorders_by_user 0\n", 912 | "user_product_reorder_percentage 0\n", 913 | "avg_add_to_cart_by_user 0\n", 914 | "avg_days_since_last_bought 0\n", 915 | "last_ordered_in 0\n", 916 | "is_reorder_3 0\n", 917 | "is_reorder_2 0\n", 918 | "is_reorder_1 0\n", 919 | "order_number 0\n", 920 | "order_dow 0\n", 921 | "order_hour_of_day 0\n", 922 | "days_since_prior_order 0\n", 923 | "reordered 0\n", 924 | "dtype: int64" 925 | ] 926 | }, 927 | "execution_count": 15, 928 | "metadata": {}, 929 | "output_type": "execute_result" 930 | } 931 | ], 932 | "source": [ 933 | "df.reordered.fillna(0, inplace = True)\n", 934 | "\n", 935 | "df.isnull().sum()" 936 | ] 937 | }, 938 | { 939 | "cell_type": "code", 940 | "execution_count": 16, 941 | "metadata": {}, 942 | "outputs": [ 943 | { 944 | "data": { 945 | "text/html": [ 946 | "
\n", 947 | "\n", 960 | "\n", 961 | " \n", 962 | " \n", 963 | " \n", 964 | " \n", 965 | " \n", 966 | " \n", 967 | " \n", 968 | " \n", 969 | " \n", 970 | " \n", 971 | " \n", 972 | " \n", 973 | " \n", 974 | " \n", 975 | " \n", 976 | " \n", 977 | " \n", 978 | " \n", 979 | " \n", 980 | " \n", 981 | " \n", 982 | " \n", 983 | " \n", 984 | " \n", 985 | " \n", 986 | " \n", 987 | " \n", 988 | " \n", 989 | " \n", 990 | " \n", 991 | " \n", 992 | " \n", 993 | " \n", 994 | " \n", 995 | " \n", 996 | " \n", 997 | " \n", 998 | " \n", 999 | " \n", 1000 | " \n", 1001 | " \n", 1002 | " \n", 1003 | " \n", 1004 | " \n", 1005 | " \n", 1006 | " \n", 1007 | " \n", 1008 | " \n", 1009 | " \n", 1010 | " \n", 1011 | " \n", 1012 | " \n", 1013 | " \n", 1014 | " \n", 1015 | " \n", 1016 | " \n", 1017 | " \n", 1018 | " \n", 1019 | " \n", 1020 | " \n", 1021 | " \n", 1022 | " \n", 1023 | " \n", 1024 | " \n", 1025 | " \n", 1026 | " \n", 1027 | " \n", 1028 | " \n", 1029 | " \n", 1030 | " \n", 1031 | " \n", 1032 | " \n", 1033 | " \n", 1034 | " \n", 1035 | " \n", 1036 | " \n", 1037 | " \n", 1038 | " \n", 1039 | " \n", 1040 | " \n", 1041 | " \n", 1042 | " \n", 1043 | " \n", 1044 | " \n", 1045 | " \n", 1046 | " \n", 1047 | " \n", 1048 | " \n", 1049 | " \n", 1050 | " \n", 1051 | " \n", 1052 | " \n", 1053 | " \n", 1054 | " \n", 1055 | " \n", 1056 | " \n", 1057 | " \n", 1058 | " \n", 1059 | " \n", 1060 | " \n", 1061 | " \n", 1062 | " \n", 1063 | " \n", 1064 | " \n", 1065 | " \n", 1066 | " \n", 1067 | " \n", 1068 | " \n", 1069 | " \n", 1070 | " \n", 1071 | " \n", 1072 | " \n", 1073 | " \n", 1074 | " \n", 1075 | " \n", 1076 | " \n", 1077 | " \n", 1078 | " \n", 1079 | "
user_idproduct_idtotal_product_orders_by_usertotal_product_reorders_by_useruser_product_reorder_percentageavg_add_to_cart_by_useravg_days_since_last_boughtlast_ordered_inis_reorder_3is_reorder_2is_reorder_1order_numberorder_doworder_hour_of_daydays_since_prior_orderreordered
0119610.09.00.9000001.40000017.60000010.01.01.01.011.04.08.014.01.0
11102589.08.00.8888893.33333319.55555510.01.01.01.011.04.08.014.01.0
21103261.00.00.0000005.00000028.0000005.00.00.00.011.04.08.014.00.0
311242710.09.00.9000003.30000017.60000010.01.01.01.011.04.08.014.00.0
41130323.02.00.6666676.33333321.66666610.01.00.00.011.04.08.014.01.0
\n", 1080 | "
" 1081 | ], 1082 | "text/plain": [ 1083 | " user_id product_id total_product_orders_by_user \\\n", 1084 | "0 1 196 10.0 \n", 1085 | "1 1 10258 9.0 \n", 1086 | "2 1 10326 1.0 \n", 1087 | "3 1 12427 10.0 \n", 1088 | "4 1 13032 3.0 \n", 1089 | "\n", 1090 | " total_product_reorders_by_user user_product_reorder_percentage \\\n", 1091 | "0 9.0 0.900000 \n", 1092 | "1 8.0 0.888889 \n", 1093 | "2 0.0 0.000000 \n", 1094 | "3 9.0 0.900000 \n", 1095 | "4 2.0 0.666667 \n", 1096 | "\n", 1097 | " avg_add_to_cart_by_user avg_days_since_last_bought last_ordered_in \\\n", 1098 | "0 1.400000 17.600000 10.0 \n", 1099 | "1 3.333333 19.555555 10.0 \n", 1100 | "2 5.000000 28.000000 5.0 \n", 1101 | "3 3.300000 17.600000 10.0 \n", 1102 | "4 6.333333 21.666666 10.0 \n", 1103 | "\n", 1104 | " is_reorder_3 is_reorder_2 is_reorder_1 order_number order_dow \\\n", 1105 | "0 1.0 1.0 1.0 11.0 4.0 \n", 1106 | "1 1.0 1.0 1.0 11.0 4.0 \n", 1107 | "2 0.0 0.0 0.0 11.0 4.0 \n", 1108 | "3 1.0 1.0 1.0 11.0 4.0 \n", 1109 | "4 1.0 0.0 0.0 11.0 4.0 \n", 1110 | "\n", 1111 | " order_hour_of_day days_since_prior_order reordered \n", 1112 | "0 8.0 14.0 1.0 \n", 1113 | "1 8.0 14.0 1.0 \n", 1114 | "2 8.0 14.0 0.0 \n", 1115 | "3 8.0 14.0 0.0 \n", 1116 | "4 8.0 14.0 1.0 " 1117 | ] 1118 | }, 1119 | "execution_count": 16, 1120 | "metadata": {}, 1121 | "output_type": "execute_result" 1122 | } 1123 | ], 1124 | "source": [ 1125 | "df.head()" 1126 | ] 1127 | }, 1128 | { 1129 | "cell_type": "markdown", 1130 | "metadata": {}, 1131 | "source": [ 1132 | "#### Merging product and user features" 1133 | ] 1134 | }, 1135 | { 1136 | "cell_type": "code", 1137 | "execution_count": 17, 1138 | "metadata": { 1139 | "scrolled": true 1140 | }, 1141 | "outputs": [ 1142 | { 1143 | "data": { 1144 | "text/html": [ 1145 | "
\n", 1146 | "\n", 1159 | "\n", 1160 | " \n", 1161 | " \n", 1162 | " \n", 1163 | " \n", 1164 | " \n", 1165 | " \n", 1166 | " \n", 1167 | " \n", 1168 | " \n", 1169 | " \n", 1170 | " \n", 1171 | " \n", 1172 | " \n", 1173 | " \n", 1174 | " \n", 1175 | " \n", 1176 | " \n", 1177 | " \n", 1178 | " \n", 1179 | " \n", 1180 | " \n", 1181 | " \n", 1182 | " \n", 1183 | " \n", 1184 | " \n", 1185 | " \n", 1186 | " \n", 1187 | " \n", 1188 | " \n", 1189 | " \n", 1190 | " \n", 1191 | " \n", 1192 | " \n", 1193 | " \n", 1194 | " \n", 1195 | " \n", 1196 | " \n", 1197 | " \n", 1198 | " \n", 1199 | " \n", 1200 | " \n", 1201 | " \n", 1202 | " \n", 1203 | " \n", 1204 | " \n", 1205 | " \n", 1206 | " \n", 1207 | " \n", 1208 | " \n", 1209 | " \n", 1210 | " \n", 1211 | " \n", 1212 | " \n", 1213 | " \n", 1214 | " \n", 1215 | " \n", 1216 | " \n", 1217 | " \n", 1218 | " \n", 1219 | " \n", 1220 | " \n", 1221 | " \n", 1222 | " \n", 1223 | " \n", 1224 | " \n", 1225 | " \n", 1226 | " \n", 1227 | " \n", 1228 | " \n", 1229 | " \n", 1230 | " \n", 1231 | " \n", 1232 | " \n", 1233 | " \n", 1234 | " \n", 1235 | " \n", 1236 | " \n", 1237 | " \n", 1238 | " \n", 1239 | " \n", 1240 | " \n", 1241 | " \n", 1242 | " \n", 1243 | " \n", 1244 | " \n", 1245 | " \n", 1246 | " \n", 1247 | " \n", 1248 | " \n", 1249 | " \n", 1250 | " \n", 1251 | " \n", 1252 | " \n", 1253 | " \n", 1254 | " \n", 1255 | " \n", 1256 | " \n", 1257 | " \n", 1258 | " \n", 1259 | " \n", 1260 | " \n", 1261 | " \n", 1262 | " \n", 1263 | " \n", 1264 | " \n", 1265 | " \n", 1266 | " \n", 1267 | " \n", 1268 | " \n", 1269 | " \n", 1270 | " \n", 1271 | " \n", 1272 | " \n", 1273 | " \n", 1274 | " \n", 1275 | " \n", 1276 | " \n", 1277 | " \n", 1278 | " \n", 1279 | " \n", 1280 | " \n", 1281 | " \n", 1282 | " \n", 1283 | " \n", 1284 | " \n", 1285 | " \n", 1286 | " \n", 1287 | " \n", 1288 | " \n", 1289 | " \n", 1290 | " \n", 1291 | " \n", 1292 | " \n", 1293 | " \n", 1294 | " \n", 1295 | " \n", 1296 | " \n", 1297 | " \n", 1298 | " \n", 1299 | " \n", 1300 | " \n", 1301 | " \n", 1302 | " \n", 1303 | " \n", 1304 | " \n", 1305 | " \n", 1306 | " \n", 1307 | " \n", 1308 | "
product_idmean_add_to_cart_ordertotal_orderstotal_reordersreorder_percentageunique_usersorder_first_time_total_cntorder_second_time_total_cntis_organicsecond_time_percent...department_total_ordersdepartment_total_reordersdepartment_reorder_percentagedepartment_unique_usersdepartment_0department_1department_2department_3department_4department_5
015.80183618521136.00.61339171671627600.385475...28875501657973.00.574180174219000001
129.8888899012.00.1333337878800.102564...1875577650301.00.346721172755000010
236.415162277203.00.73285274743600.486486...26901291757892.00.653460172795000011
349.507599329147.00.4468091821826400.351648...22364321211890.00.541885163233000100
456.466667159.00.60000066400.666667...1875577650301.00.346721172755000010
\n", 1309 | "

5 rows × 37 columns

\n", 1310 | "
" 1311 | ], 1312 | "text/plain": [ 1313 | " product_id mean_add_to_cart_order total_orders total_reorders \\\n", 1314 | "0 1 5.801836 1852 1136.0 \n", 1315 | "1 2 9.888889 90 12.0 \n", 1316 | "2 3 6.415162 277 203.0 \n", 1317 | "3 4 9.507599 329 147.0 \n", 1318 | "4 5 6.466667 15 9.0 \n", 1319 | "\n", 1320 | " reorder_percentage unique_users order_first_time_total_cnt \\\n", 1321 | "0 0.613391 716 716 \n", 1322 | "1 0.133333 78 78 \n", 1323 | "2 0.732852 74 74 \n", 1324 | "3 0.446809 182 182 \n", 1325 | "4 0.600000 6 6 \n", 1326 | "\n", 1327 | " order_second_time_total_cnt is_organic second_time_percent ... \\\n", 1328 | "0 276 0 0.385475 ... \n", 1329 | "1 8 0 0.102564 ... \n", 1330 | "2 36 0 0.486486 ... \n", 1331 | "3 64 0 0.351648 ... \n", 1332 | "4 4 0 0.666667 ... \n", 1333 | "\n", 1334 | " department_total_orders department_total_reorders \\\n", 1335 | "0 2887550 1657973.0 \n", 1336 | "1 1875577 650301.0 \n", 1337 | "2 2690129 1757892.0 \n", 1338 | "3 2236432 1211890.0 \n", 1339 | "4 1875577 650301.0 \n", 1340 | "\n", 1341 | " department_reorder_percentage department_unique_users department_0 \\\n", 1342 | "0 0.574180 174219 0 \n", 1343 | "1 0.346721 172755 0 \n", 1344 | "2 0.653460 172795 0 \n", 1345 | "3 0.541885 163233 0 \n", 1346 | "4 0.346721 172755 0 \n", 1347 | "\n", 1348 | " department_1 department_2 department_3 department_4 department_5 \n", 1349 | "0 0 0 0 0 1 \n", 1350 | "1 0 0 0 1 0 \n", 1351 | "2 0 0 0 1 1 \n", 1352 | "3 0 0 1 0 0 \n", 1353 | "4 0 0 0 1 0 \n", 1354 | "\n", 1355 | "[5 rows x 37 columns]" 1356 | ] 1357 | }, 1358 | "execution_count": 17, 1359 | "metadata": {}, 1360 | "output_type": "execute_result" 1361 | } 1362 | ], 1363 | "source": [ 1364 | "product_features.head()" 1365 | ] 1366 | }, 1367 | { 1368 | "cell_type": "code", 1369 | "execution_count": 18, 1370 | "metadata": {}, 1371 | "outputs": [ 1372 | { 1373 | "data": { 1374 | "text/html": [ 1375 | "
\n", 1376 | "\n", 1389 | "\n", 1390 | " \n", 1391 | " \n", 1392 | " \n", 1393 | " \n", 1394 | " \n", 1395 | " \n", 1396 | " \n", 1397 | " \n", 1398 | " \n", 1399 | " \n", 1400 | " \n", 1401 | " \n", 1402 | " \n", 1403 | " \n", 1404 | " \n", 1405 | " \n", 1406 | " \n", 1407 | " \n", 1408 | " \n", 1409 | " \n", 1410 | " \n", 1411 | " \n", 1412 | " \n", 1413 | " \n", 1414 | " \n", 1415 | " \n", 1416 | " \n", 1417 | " \n", 1418 | " \n", 1419 | " \n", 1420 | " \n", 1421 | " \n", 1422 | " \n", 1423 | " \n", 1424 | " \n", 1425 | " \n", 1426 | " \n", 1427 | " \n", 1428 | " \n", 1429 | " \n", 1430 | " \n", 1431 | " \n", 1432 | " \n", 1433 | " \n", 1434 | " \n", 1435 | " \n", 1436 | " \n", 1437 | " \n", 1438 | " \n", 1439 | " \n", 1440 | " \n", 1441 | " \n", 1442 | " \n", 1443 | " \n", 1444 | " \n", 1445 | " \n", 1446 | " \n", 1447 | " \n", 1448 | " \n", 1449 | " \n", 1450 | " \n", 1451 | " \n", 1452 | " \n", 1453 | " \n", 1454 | " \n", 1455 | " \n", 1456 | " \n", 1457 | " \n", 1458 | " \n", 1459 | " \n", 1460 | " \n", 1461 | " \n", 1462 | " \n", 1463 | " \n", 1464 | " \n", 1465 | " \n", 1466 | " \n", 1467 | " \n", 1468 | " \n", 1469 | " \n", 1470 | " \n", 1471 | " \n", 1472 | " \n", 1473 | " \n", 1474 | " \n", 1475 | " \n", 1476 | " \n", 1477 | " \n", 1478 | " \n", 1479 | " \n", 1480 | " \n", 1481 | " \n", 1482 | " \n", 1483 | " \n", 1484 | " \n", 1485 | " \n", 1486 | " \n", 1487 | " \n", 1488 | " \n", 1489 | " \n", 1490 | " \n", 1491 | " \n", 1492 | " \n", 1493 | " \n", 1494 | " \n", 1495 | " \n", 1496 | " \n", 1497 | " \n", 1498 | " \n", 1499 | " \n", 1500 | " \n", 1501 | " \n", 1502 | " \n", 1503 | " \n", 1504 | " \n", 1505 | " \n", 1506 | " \n", 1507 | " \n", 1508 | " \n", 1509 | " \n", 1510 | " \n", 1511 | " \n", 1512 | " \n", 1513 | " \n", 1514 | " \n", 1515 | " \n", 1516 | " \n", 1517 | " \n", 1518 | " \n", 1519 | " \n", 1520 | " \n", 1521 | " \n", 1522 | " \n", 1523 | " \n", 1524 | " \n", 1525 | " \n", 1526 | " \n", 1527 | " \n", 1528 | " \n", 1529 | " \n", 1530 | " \n", 1531 | " \n", 1532 | "
user_idavg_dowstd_dowavg_dohstd_dohavg_since_orderstd_since_ordertotal_orders_by_usertotal_products_by_usertotal_unique_product_by_usertotal_reorders_by_userreorder_propotion_by_useraverage_order_sizereorder_in_orderorders_3orders_2orders_1reorder_3reorder_2reorder_1
012.6440681.25619410.5423733.50035518.54237410.55906610591841.00.6949155.9000000.7058336690.6666671.00.666667
122.0051280.97122210.4410261.64985414.9025649.6717121419510293.00.47692313.9285710.447961199160.5789470.00.625000
231.0113641.24563016.3522731.45459910.1818185.86739512883355.00.6250007.3333330.6588176560.8333331.01.000000
344.7222220.82644213.1111111.74520811.9444459.973330518171.00.0555563.6000000.0285717230.1428570.00.000000
451.6216221.27696115.7297302.58895810.1891897.6005774372314.00.3783789.2500000.37777895120.4444440.40.666667
\n", 1533 | "
" 1534 | ], 1535 | "text/plain": [ 1536 | " user_id avg_dow std_dow avg_doh std_doh avg_since_order \\\n", 1537 | "0 1 2.644068 1.256194 10.542373 3.500355 18.542374 \n", 1538 | "1 2 2.005128 0.971222 10.441026 1.649854 14.902564 \n", 1539 | "2 3 1.011364 1.245630 16.352273 1.454599 10.181818 \n", 1540 | "3 4 4.722222 0.826442 13.111111 1.745208 11.944445 \n", 1541 | "4 5 1.621622 1.276961 15.729730 2.588958 10.189189 \n", 1542 | "\n", 1543 | " std_since_order total_orders_by_user total_products_by_user \\\n", 1544 | "0 10.559066 10 59 \n", 1545 | "1 9.671712 14 195 \n", 1546 | "2 5.867395 12 88 \n", 1547 | "3 9.973330 5 18 \n", 1548 | "4 7.600577 4 37 \n", 1549 | "\n", 1550 | " total_unique_product_by_user total_reorders_by_user \\\n", 1551 | "0 18 41.0 \n", 1552 | "1 102 93.0 \n", 1553 | "2 33 55.0 \n", 1554 | "3 17 1.0 \n", 1555 | "4 23 14.0 \n", 1556 | "\n", 1557 | " reorder_propotion_by_user average_order_size reorder_in_order orders_3 \\\n", 1558 | "0 0.694915 5.900000 0.705833 6 \n", 1559 | "1 0.476923 13.928571 0.447961 19 \n", 1560 | "2 0.625000 7.333333 0.658817 6 \n", 1561 | "3 0.055556 3.600000 0.028571 7 \n", 1562 | "4 0.378378 9.250000 0.377778 9 \n", 1563 | "\n", 1564 | " orders_2 orders_1 reorder_3 reorder_2 reorder_1 \n", 1565 | "0 6 9 0.666667 1.0 0.666667 \n", 1566 | "1 9 16 0.578947 0.0 0.625000 \n", 1567 | "2 5 6 0.833333 1.0 1.000000 \n", 1568 | "3 2 3 0.142857 0.0 0.000000 \n", 1569 | "4 5 12 0.444444 0.4 0.666667 " 1570 | ] 1571 | }, 1572 | "execution_count": 18, 1573 | "metadata": {}, 1574 | "output_type": "execute_result" 1575 | } 1576 | ], 1577 | "source": [ 1578 | "user_features.head()" 1579 | ] 1580 | }, 1581 | { 1582 | "cell_type": "code", 1583 | "execution_count": 19, 1584 | "metadata": {}, 1585 | "outputs": [ 1586 | { 1587 | "data": { 1588 | "text/html": [ 1589 | "
\n", 1590 | "\n", 1603 | "\n", 1604 | " \n", 1605 | " \n", 1606 | " \n", 1607 | " \n", 1608 | " \n", 1609 | " \n", 1610 | " \n", 1611 | " \n", 1612 | " \n", 1613 | " \n", 1614 | " \n", 1615 | " \n", 1616 | " \n", 1617 | " \n", 1618 | " \n", 1619 | " \n", 1620 | " \n", 1621 | " \n", 1622 | " \n", 1623 | " \n", 1624 | " \n", 1625 | " \n", 1626 | " \n", 1627 | " \n", 1628 | " \n", 1629 | " \n", 1630 | " \n", 1631 | " \n", 1632 | " \n", 1633 | " \n", 1634 | " \n", 1635 | " \n", 1636 | " \n", 1637 | " \n", 1638 | " \n", 1639 | " \n", 1640 | " \n", 1641 | " \n", 1642 | " \n", 1643 | " \n", 1644 | " \n", 1645 | " \n", 1646 | " \n", 1647 | " \n", 1648 | " \n", 1649 | " \n", 1650 | " \n", 1651 | " \n", 1652 | " \n", 1653 | " \n", 1654 | " \n", 1655 | " \n", 1656 | " \n", 1657 | " \n", 1658 | " \n", 1659 | " \n", 1660 | " \n", 1661 | " \n", 1662 | " \n", 1663 | " \n", 1664 | " \n", 1665 | " \n", 1666 | " \n", 1667 | " \n", 1668 | " \n", 1669 | " \n", 1670 | " \n", 1671 | " \n", 1672 | " \n", 1673 | " \n", 1674 | " \n", 1675 | " \n", 1676 | " \n", 1677 | " \n", 1678 | " \n", 1679 | " \n", 1680 | " \n", 1681 | " \n", 1682 | " \n", 1683 | " \n", 1684 | " \n", 1685 | " \n", 1686 | " \n", 1687 | " \n", 1688 | " \n", 1689 | " \n", 1690 | " \n", 1691 | " \n", 1692 | " \n", 1693 | " \n", 1694 | " \n", 1695 | " \n", 1696 | " \n", 1697 | " \n", 1698 | " \n", 1699 | " \n", 1700 | " \n", 1701 | " \n", 1702 | " \n", 1703 | " \n", 1704 | " \n", 1705 | " \n", 1706 | " \n", 1707 | " \n", 1708 | " \n", 1709 | " \n", 1710 | " \n", 1711 | " \n", 1712 | " \n", 1713 | " \n", 1714 | " \n", 1715 | " \n", 1716 | " \n", 1717 | " \n", 1718 | " \n", 1719 | " \n", 1720 | " \n", 1721 | " \n", 1722 | " \n", 1723 | " \n", 1724 | " \n", 1725 | " \n", 1726 | " \n", 1727 | " \n", 1728 | " \n", 1729 | " \n", 1730 | " \n", 1731 | " \n", 1732 | " \n", 1733 | " \n", 1734 | " \n", 1735 | " \n", 1736 | " \n", 1737 | " \n", 1738 | " \n", 1739 | " \n", 1740 | " \n", 1741 | " \n", 1742 | " \n", 1743 | " \n", 1744 | " \n", 1745 | " \n", 1746 | " \n", 1747 | " \n", 1748 | " \n", 1749 | " \n", 1750 | " \n", 1751 | " \n", 1752 | "
user_idproduct_idtotal_product_orders_by_usertotal_product_reorders_by_useruser_product_reorder_percentageavg_add_to_cart_by_useravg_days_since_last_boughtlast_ordered_inis_reorder_3is_reorder_2...total_reorders_by_userreorder_propotion_by_useraverage_order_sizereorder_in_orderorders_3orders_2orders_1reorder_3reorder_2reorder_1
0119610.09.00.9000001.40000017.60000010.01.01.0...41.00.6949155.90.7058336690.6666671.00.666667
11102589.08.00.8888893.33333319.55555510.01.01.0...41.00.6949155.90.7058336690.6666671.00.666667
21103261.00.00.0000005.00000028.0000005.00.00.0...41.00.6949155.90.7058336690.6666671.00.666667
311242710.09.00.9000003.30000017.60000010.01.01.0...41.00.6949155.90.7058336690.6666671.00.666667
41130323.02.00.6666676.33333321.66666610.01.00.0...41.00.6949155.90.7058336690.6666671.00.666667
\n", 1753 | "

5 rows × 71 columns

\n", 1754 | "
" 1755 | ], 1756 | "text/plain": [ 1757 | " user_id product_id total_product_orders_by_user \\\n", 1758 | "0 1 196 10.0 \n", 1759 | "1 1 10258 9.0 \n", 1760 | "2 1 10326 1.0 \n", 1761 | "3 1 12427 10.0 \n", 1762 | "4 1 13032 3.0 \n", 1763 | "\n", 1764 | " total_product_reorders_by_user user_product_reorder_percentage \\\n", 1765 | "0 9.0 0.900000 \n", 1766 | "1 8.0 0.888889 \n", 1767 | "2 0.0 0.000000 \n", 1768 | "3 9.0 0.900000 \n", 1769 | "4 2.0 0.666667 \n", 1770 | "\n", 1771 | " avg_add_to_cart_by_user avg_days_since_last_bought last_ordered_in \\\n", 1772 | "0 1.400000 17.600000 10.0 \n", 1773 | "1 3.333333 19.555555 10.0 \n", 1774 | "2 5.000000 28.000000 5.0 \n", 1775 | "3 3.300000 17.600000 10.0 \n", 1776 | "4 6.333333 21.666666 10.0 \n", 1777 | "\n", 1778 | " is_reorder_3 is_reorder_2 ... total_reorders_by_user \\\n", 1779 | "0 1.0 1.0 ... 41.0 \n", 1780 | "1 1.0 1.0 ... 41.0 \n", 1781 | "2 0.0 0.0 ... 41.0 \n", 1782 | "3 1.0 1.0 ... 41.0 \n", 1783 | "4 1.0 0.0 ... 41.0 \n", 1784 | "\n", 1785 | " reorder_propotion_by_user average_order_size reorder_in_order orders_3 \\\n", 1786 | "0 0.694915 5.9 0.705833 6 \n", 1787 | "1 0.694915 5.9 0.705833 6 \n", 1788 | "2 0.694915 5.9 0.705833 6 \n", 1789 | "3 0.694915 5.9 0.705833 6 \n", 1790 | "4 0.694915 5.9 0.705833 6 \n", 1791 | "\n", 1792 | " orders_2 orders_1 reorder_3 reorder_2 reorder_1 \n", 1793 | "0 6 9 0.666667 1.0 0.666667 \n", 1794 | "1 6 9 0.666667 1.0 0.666667 \n", 1795 | "2 6 9 0.666667 1.0 0.666667 \n", 1796 | "3 6 9 0.666667 1.0 0.666667 \n", 1797 | "4 6 9 0.666667 1.0 0.666667 \n", 1798 | "\n", 1799 | "[5 rows x 71 columns]" 1800 | ] 1801 | }, 1802 | "execution_count": 19, 1803 | "metadata": {}, 1804 | "output_type": "execute_result" 1805 | } 1806 | ], 1807 | "source": [ 1808 | "df = df.merge(product_features, on = 'product_id', how = 'left')\n", 1809 | "df = df.merge(user_features, on = 'user_id', how = 'left')\n", 1810 | "df.head()" 1811 | ] 1812 | }, 1813 | { 1814 | "cell_type": "markdown", 1815 | "metadata": {}, 1816 | "source": [ 1817 | "The dataframe has null values because the product was never bought earlier by a user" 1818 | ] 1819 | }, 1820 | { 1821 | "cell_type": "code", 1822 | "execution_count": 20, 1823 | "metadata": {}, 1824 | "outputs": [ 1825 | { 1826 | "data": { 1827 | "text/plain": [ 1828 | "(8474661, 71)" 1829 | ] 1830 | }, 1831 | "execution_count": 20, 1832 | "metadata": {}, 1833 | "output_type": "execute_result" 1834 | } 1835 | ], 1836 | "source": [ 1837 | "df.shape" 1838 | ] 1839 | }, 1840 | { 1841 | "cell_type": "code", 1842 | "execution_count": 21, 1843 | "metadata": {}, 1844 | "outputs": [ 1845 | { 1846 | "data": { 1847 | "text/plain": [ 1848 | "reorder_1 0\n", 1849 | "aisle_mean_add_to_cart_order 0\n", 1850 | "reorder_percentage 0\n", 1851 | "unique_users 0\n", 1852 | "order_first_time_total_cnt 0\n", 1853 | "order_second_time_total_cnt 0\n", 1854 | "is_organic 0\n", 1855 | "second_time_percent 0\n", 1856 | "aisle_std_add_to_cart_order 0\n", 1857 | "total_orders 0\n", 1858 | "aisle_total_orders 0\n", 1859 | "aisle_total_reorders 0\n", 1860 | "aisle_reorder_percentage 0\n", 1861 | "aisle_unique_users 0\n", 1862 | "aisle_0 0\n", 1863 | "aisle_1 0\n", 1864 | "total_reorders 0\n", 1865 | "mean_add_to_cart_order 0\n", 1866 | "aisle_3 0\n", 1867 | "last_ordered_in 0\n", 1868 | "product_id 0\n", 1869 | "total_product_orders_by_user 0\n", 1870 | "total_product_reorders_by_user 0\n", 1871 | "user_product_reorder_percentage 0\n", 1872 | "avg_add_to_cart_by_user 0\n", 1873 | "avg_days_since_last_bought 0\n", 1874 | "is_reorder_3 0\n", 1875 | "reordered 0\n", 1876 | "is_reorder_2 0\n", 1877 | "is_reorder_1 0\n", 1878 | " ..\n", 1879 | "total_orders_by_user 0\n", 1880 | "total_products_by_user 0\n", 1881 | "total_unique_product_by_user 0\n", 1882 | "reorder_propotion_by_user 0\n", 1883 | "std_dow 0\n", 1884 | "average_order_size 0\n", 1885 | "reorder_in_order 0\n", 1886 | "orders_3 0\n", 1887 | "orders_2 0\n", 1888 | "orders_1 0\n", 1889 | "reorder_3 0\n", 1890 | "avg_doh 0\n", 1891 | "avg_dow 0\n", 1892 | "aisle_5 0\n", 1893 | "department_total_reorders 0\n", 1894 | "aisle_6 0\n", 1895 | "aisle_7 0\n", 1896 | "aisle_8 0\n", 1897 | "department_mean_add_to_cart_order 0\n", 1898 | "department_std_add_to_cart_order 0\n", 1899 | "department_total_orders 0\n", 1900 | "department_reorder_percentage 0\n", 1901 | "department_5 0\n", 1902 | "department_unique_users 0\n", 1903 | "department_0 0\n", 1904 | "department_1 0\n", 1905 | "department_2 0\n", 1906 | "department_3 0\n", 1907 | "department_4 0\n", 1908 | "user_id 0\n", 1909 | "Length: 71, dtype: int64" 1910 | ] 1911 | }, 1912 | "execution_count": 21, 1913 | "metadata": {}, 1914 | "output_type": "execute_result" 1915 | } 1916 | ], 1917 | "source": [ 1918 | "df.isnull().sum().sort_values(ascending = False)" 1919 | ] 1920 | }, 1921 | { 1922 | "cell_type": "code", 1923 | "execution_count": 22, 1924 | "metadata": {}, 1925 | "outputs": [], 1926 | "source": [ 1927 | "df.to_pickle(root + 'Finaldata.pkl')" 1928 | ] 1929 | }, 1930 | { 1931 | "cell_type": "code", 1932 | "execution_count": 23, 1933 | "metadata": {}, 1934 | "outputs": [ 1935 | { 1936 | "data": { 1937 | "text/html": [ 1938 | "
\n", 1939 | "\n", 1952 | "\n", 1953 | " \n", 1954 | " \n", 1955 | " \n", 1956 | " \n", 1957 | " \n", 1958 | " \n", 1959 | " \n", 1960 | " \n", 1961 | " \n", 1962 | " \n", 1963 | " \n", 1964 | " \n", 1965 | " \n", 1966 | " \n", 1967 | " \n", 1968 | " \n", 1969 | " \n", 1970 | " \n", 1971 | " \n", 1972 | " \n", 1973 | " \n", 1974 | " \n", 1975 | " \n", 1976 | " \n", 1977 | " \n", 1978 | " \n", 1979 | " \n", 1980 | " \n", 1981 | " \n", 1982 | " \n", 1983 | " \n", 1984 | " \n", 1985 | " \n", 1986 | " \n", 1987 | " \n", 1988 | " \n", 1989 | " \n", 1990 | " \n", 1991 | " \n", 1992 | " \n", 1993 | " \n", 1994 | " \n", 1995 | " \n", 1996 | " \n", 1997 | " \n", 1998 | " \n", 1999 | " \n", 2000 | " \n", 2001 | " \n", 2002 | " \n", 2003 | " \n", 2004 | " \n", 2005 | " \n", 2006 | " \n", 2007 | " \n", 2008 | " \n", 2009 | " \n", 2010 | " \n", 2011 | " \n", 2012 | " \n", 2013 | " \n", 2014 | " \n", 2015 | " \n", 2016 | " \n", 2017 | " \n", 2018 | " \n", 2019 | " \n", 2020 | " \n", 2021 | " \n", 2022 | " \n", 2023 | " \n", 2024 | " \n", 2025 | " \n", 2026 | " \n", 2027 | " \n", 2028 | " \n", 2029 | " \n", 2030 | " \n", 2031 | " \n", 2032 | " \n", 2033 | " \n", 2034 | " \n", 2035 | " \n", 2036 | " \n", 2037 | " \n", 2038 | " \n", 2039 | " \n", 2040 | " \n", 2041 | " \n", 2042 | " \n", 2043 | " \n", 2044 | " \n", 2045 | " \n", 2046 | " \n", 2047 | " \n", 2048 | " \n", 2049 | " \n", 2050 | " \n", 2051 | " \n", 2052 | " \n", 2053 | " \n", 2054 | " \n", 2055 | " \n", 2056 | " \n", 2057 | " \n", 2058 | " \n", 2059 | " \n", 2060 | " \n", 2061 | " \n", 2062 | " \n", 2063 | " \n", 2064 | " \n", 2065 | " \n", 2066 | " \n", 2067 | " \n", 2068 | " \n", 2069 | " \n", 2070 | " \n", 2071 | " \n", 2072 | " \n", 2073 | " \n", 2074 | " \n", 2075 | " \n", 2076 | " \n", 2077 | " \n", 2078 | " \n", 2079 | " \n", 2080 | " \n", 2081 | " \n", 2082 | " \n", 2083 | " \n", 2084 | " \n", 2085 | " \n", 2086 | " \n", 2087 | " \n", 2088 | " \n", 2089 | " \n", 2090 | " \n", 2091 | " \n", 2092 | " \n", 2093 | " \n", 2094 | " \n", 2095 | " \n", 2096 | " \n", 2097 | " \n", 2098 | " \n", 2099 | " \n", 2100 | " \n", 2101 | "
user_idproduct_idtotal_product_orders_by_usertotal_product_reorders_by_useruser_product_reorder_percentageavg_add_to_cart_by_useravg_days_since_last_boughtlast_ordered_inis_reorder_3is_reorder_2...total_reorders_by_userreorder_propotion_by_useraverage_order_sizereorder_in_orderorders_3orders_2orders_1reorder_3reorder_2reorder_1
0119610.09.00.9000001.40000017.60000010.01.01.0...41.00.6949155.90.7058336690.6666671.00.666667
11102589.08.00.8888893.33333319.55555510.01.01.0...41.00.6949155.90.7058336690.6666671.00.666667
21103261.00.00.0000005.00000028.0000005.00.00.0...41.00.6949155.90.7058336690.6666671.00.666667
311242710.09.00.9000003.30000017.60000010.01.01.0...41.00.6949155.90.7058336690.6666671.00.666667
41130323.02.00.6666676.33333321.66666610.01.00.0...41.00.6949155.90.7058336690.6666671.00.666667
\n", 2102 | "

5 rows × 71 columns

\n", 2103 | "
" 2104 | ], 2105 | "text/plain": [ 2106 | " user_id product_id total_product_orders_by_user \\\n", 2107 | "0 1 196 10.0 \n", 2108 | "1 1 10258 9.0 \n", 2109 | "2 1 10326 1.0 \n", 2110 | "3 1 12427 10.0 \n", 2111 | "4 1 13032 3.0 \n", 2112 | "\n", 2113 | " total_product_reorders_by_user user_product_reorder_percentage \\\n", 2114 | "0 9.0 0.900000 \n", 2115 | "1 8.0 0.888889 \n", 2116 | "2 0.0 0.000000 \n", 2117 | "3 9.0 0.900000 \n", 2118 | "4 2.0 0.666667 \n", 2119 | "\n", 2120 | " avg_add_to_cart_by_user avg_days_since_last_bought last_ordered_in \\\n", 2121 | "0 1.400000 17.600000 10.0 \n", 2122 | "1 3.333333 19.555555 10.0 \n", 2123 | "2 5.000000 28.000000 5.0 \n", 2124 | "3 3.300000 17.600000 10.0 \n", 2125 | "4 6.333333 21.666666 10.0 \n", 2126 | "\n", 2127 | " is_reorder_3 is_reorder_2 ... total_reorders_by_user \\\n", 2128 | "0 1.0 1.0 ... 41.0 \n", 2129 | "1 1.0 1.0 ... 41.0 \n", 2130 | "2 0.0 0.0 ... 41.0 \n", 2131 | "3 1.0 1.0 ... 41.0 \n", 2132 | "4 1.0 0.0 ... 41.0 \n", 2133 | "\n", 2134 | " reorder_propotion_by_user average_order_size reorder_in_order orders_3 \\\n", 2135 | "0 0.694915 5.9 0.705833 6 \n", 2136 | "1 0.694915 5.9 0.705833 6 \n", 2137 | "2 0.694915 5.9 0.705833 6 \n", 2138 | "3 0.694915 5.9 0.705833 6 \n", 2139 | "4 0.694915 5.9 0.705833 6 \n", 2140 | "\n", 2141 | " orders_2 orders_1 reorder_3 reorder_2 reorder_1 \n", 2142 | "0 6 9 0.666667 1.0 0.666667 \n", 2143 | "1 6 9 0.666667 1.0 0.666667 \n", 2144 | "2 6 9 0.666667 1.0 0.666667 \n", 2145 | "3 6 9 0.666667 1.0 0.666667 \n", 2146 | "4 6 9 0.666667 1.0 0.666667 \n", 2147 | "\n", 2148 | "[5 rows x 71 columns]" 2149 | ] 2150 | }, 2151 | "execution_count": 23, 2152 | "metadata": {}, 2153 | "output_type": "execute_result" 2154 | } 2155 | ], 2156 | "source": [ 2157 | "df2 = pd.read_pickle(root +'Finaldata.pkl')\n", 2158 | "df2.head()" 2159 | ] 2160 | }, 2161 | { 2162 | "cell_type": "markdown", 2163 | "metadata": {}, 2164 | "source": [ 2165 | "Yayyyyy. Ready for some cool modeling now :p" 2166 | ] 2167 | } 2168 | ], 2169 | "metadata": { 2170 | "kernelspec": { 2171 | "display_name": "Python 3", 2172 | "language": "python", 2173 | "name": "python3" 2174 | }, 2175 | "language_info": { 2176 | "codemirror_mode": { 2177 | "name": "ipython", 2178 | "version": 3 2179 | }, 2180 | "file_extension": ".py", 2181 | "mimetype": "text/x-python", 2182 | "name": "python", 2183 | "nbconvert_exporter": "python", 2184 | "pygments_lexer": "ipython3", 2185 | "version": "3.7.3" 2186 | } 2187 | }, 2188 | "nbformat": 4, 2189 | "nbformat_minor": 2 2190 | } 2191 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 Arch Jignesh Desai 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /Market Basket Analysis.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Market Basket Analysis" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Market basket analysis scrutinizes the products customers tend to buy together, and uses the information to decide which products should be cross-sold or promoted together. The term arises from the shopping carts supermarket shoppers fill up during a shopping trip.\n", 15 | "\n", 16 | "Association Rule Mining is used when we want to find an association between different objects in a set, find frequent patterns in a transaction database, relational databases or any other information repository.\n", 17 | "\n", 18 | "The most common approach to find these patterns is Market Basket Analysis, which is a key technique used by large retailers like Amazon, Flipkart, etc to analyze customer buying habits by finding associations between the different items that customers place in their “shopping baskets”. The discovery of these associations can help retailers develop marketing strategies by gaining insight into which items are frequently purchased together by customers. The strategies may include:\n", 19 | "\n", 20 | "- Changing the store layout according to trends\n", 21 | "- Customers behavior analysis\n", 22 | "- Catalog Design\n", 23 | "- Cross marketing on online stores\n", 24 | "- Customized emails with add-on sales, etc." 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "### Matrices\n", 32 | "\n", 33 | "- **Support** : Its the default popularity of an item. In mathematical terms, the support of item A is the ratio of transactions involving A to the total number of transactions.\n", 34 | "\n", 35 | "\n", 36 | "- **Confidence** : Likelihood that customer who bought both A and B. It is the ratio of the number of transactions involving both A and B and the number of transactions involving B.\n", 37 | " - Confidence(A => B) = Support(A, B)/Support(A)\n", 38 | "\n", 39 | "\n", 40 | "- **Lift** : Increase in the sale of A when you sell B.\n", 41 | " \n", 42 | " - Lift(A => B) = Confidence(A, B)/Support(B)\n", 43 | " \n", 44 | " - Lift (A => B) = 1 means that there is no correlation within the itemset.\n", 45 | " - Lift (A => B) > 1 means that there is a positive correlation within the itemset, i.e., products in the itemset, A, and B, are more likely to be bought together.\n", 46 | " - Lift (A => B) < 1 means that there is a negative correlation within the itemset, i.e., products in itemset, A, and B, are unlikely to be bought together." 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "metadata": {}, 52 | "source": [ 53 | "**Apriori Algorithm:** Apriori algorithm assumes that any subset of a frequent itemset must be frequent. Its the algorithm behind Market Basket Analysis. Say, a transaction containing {Grapes, Apple, Mango} also contains {Grapes, Mango}. So, according to the principle of Apriori, if {Grapes, Apple, Mango} is frequent, then {Grapes, Mango} must also be frequent." 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 1, 59 | "metadata": {}, 60 | "outputs": [], 61 | "source": [ 62 | "import numpy as np\n", 63 | "import pandas as pd\n", 64 | "from mlxtend.frequent_patterns import apriori\n", 65 | "from mlxtend.frequent_patterns import association_rules\n", 66 | "\n", 67 | "root = 'C:/Data/instacart-market-basket-analysis/'" 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": {}, 73 | "source": [ 74 | "### Data" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": 2, 80 | "metadata": {}, 81 | "outputs": [], 82 | "source": [ 83 | "orders = pd.read_csv(root + 'orders.csv')\n", 84 | "order_products_prior = pd.read_csv(root + 'order_products__prior.csv')\n", 85 | "order_products_train = pd.read_csv(root + 'order_products__train.csv')\n", 86 | "products = pd.read_csv(root + 'products.csv')" 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "execution_count": 3, 92 | "metadata": {}, 93 | "outputs": [ 94 | { 95 | "data": { 96 | "text/plain": [ 97 | "(33819106, 4)" 98 | ] 99 | }, 100 | "execution_count": 3, 101 | "metadata": {}, 102 | "output_type": "execute_result" 103 | } 104 | ], 105 | "source": [ 106 | "order_products = order_products_prior.append(order_products_train)\n", 107 | "order_products.shape" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": 4, 113 | "metadata": {}, 114 | "outputs": [ 115 | { 116 | "data": { 117 | "text/html": [ 118 | "
\n", 119 | "\n", 132 | "\n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | "
order_idproduct_idadd_to_cart_orderreordered
023312011
122898521
22932730
324591841
423003550
\n", 180 | "
" 181 | ], 182 | "text/plain": [ 183 | " order_id product_id add_to_cart_order reordered\n", 184 | "0 2 33120 1 1\n", 185 | "1 2 28985 2 1\n", 186 | "2 2 9327 3 0\n", 187 | "3 2 45918 4 1\n", 188 | "4 2 30035 5 0" 189 | ] 190 | }, 191 | "execution_count": 4, 192 | "metadata": {}, 193 | "output_type": "execute_result" 194 | } 195 | ], 196 | "source": [ 197 | "order_products.head()" 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": 5, 203 | "metadata": {}, 204 | "outputs": [ 205 | { 206 | "data": { 207 | "text/plain": [ 208 | "49685" 209 | ] 210 | }, 211 | "execution_count": 5, 212 | "metadata": {}, 213 | "output_type": "execute_result" 214 | } 215 | ], 216 | "source": [ 217 | "order_products.product_id.nunique()" 218 | ] 219 | }, 220 | { 221 | "cell_type": "markdown", 222 | "metadata": {}, 223 | "source": [ 224 | "Out of 49685 keeping top 100 most frequent products." 225 | ] 226 | }, 227 | { 228 | "cell_type": "code", 229 | "execution_count": 6, 230 | "metadata": {}, 231 | "outputs": [ 232 | { 233 | "data": { 234 | "text/html": [ 235 | "
\n", 236 | "\n", 249 | "\n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | "
product_idfrequencyproduct_nameaisle_iddepartment_id
024852491291Banana244
113176394930Bag of Organic Bananas244
221137275577Organic Strawberries244
321903251705Organic Baby Spinach1234
447209220877Organic Hass Avocado244
547766184224Organic Avocado244
647626160792Large Lemon244
716797149445Strawberries244
826209146660Limes244
927845142813Organic Whole Milk8416
\n", 343 | "
" 344 | ], 345 | "text/plain": [ 346 | " product_id frequency product_name aisle_id department_id\n", 347 | "0 24852 491291 Banana 24 4\n", 348 | "1 13176 394930 Bag of Organic Bananas 24 4\n", 349 | "2 21137 275577 Organic Strawberries 24 4\n", 350 | "3 21903 251705 Organic Baby Spinach 123 4\n", 351 | "4 47209 220877 Organic Hass Avocado 24 4\n", 352 | "5 47766 184224 Organic Avocado 24 4\n", 353 | "6 47626 160792 Large Lemon 24 4\n", 354 | "7 16797 149445 Strawberries 24 4\n", 355 | "8 26209 146660 Limes 24 4\n", 356 | "9 27845 142813 Organic Whole Milk 84 16" 357 | ] 358 | }, 359 | "execution_count": 6, 360 | "metadata": {}, 361 | "output_type": "execute_result" 362 | } 363 | ], 364 | "source": [ 365 | "product_counts = order_products.groupby('product_id')['order_id'].count().reset_index().rename(columns = {'order_id':'frequency'})\n", 366 | "product_counts = product_counts.sort_values('frequency', ascending=False)[0:100].reset_index(drop = True)\n", 367 | "product_counts = product_counts.merge(products, on = 'product_id', how = 'left')\n", 368 | "product_counts.head(10)" 369 | ] 370 | }, 371 | { 372 | "cell_type": "markdown", 373 | "metadata": {}, 374 | "source": [ 375 | "Keeping 100 most frequent items in order_products dataframe" 376 | ] 377 | }, 378 | { 379 | "cell_type": "code", 380 | "execution_count": 7, 381 | "metadata": {}, 382 | "outputs": [ 383 | { 384 | "data": { 385 | "text/plain": [ 386 | "[13176, 21137, 21903, 47209, 47766, 47626, 16797, 26209, 27845]" 387 | ] 388 | }, 389 | "execution_count": 7, 390 | "metadata": {}, 391 | "output_type": "execute_result" 392 | } 393 | ], 394 | "source": [ 395 | "freq_products = list(product_counts.product_id)\n", 396 | "freq_products[1:10]" 397 | ] 398 | }, 399 | { 400 | "cell_type": "code", 401 | "execution_count": 8, 402 | "metadata": {}, 403 | "outputs": [ 404 | { 405 | "data": { 406 | "text/plain": [ 407 | "100" 408 | ] 409 | }, 410 | "execution_count": 8, 411 | "metadata": {}, 412 | "output_type": "execute_result" 413 | } 414 | ], 415 | "source": [ 416 | "len(freq_products)" 417 | ] 418 | }, 419 | { 420 | "cell_type": "code", 421 | "execution_count": 9, 422 | "metadata": {}, 423 | "outputs": [ 424 | { 425 | "data": { 426 | "text/plain": [ 427 | "(7795471, 4)" 428 | ] 429 | }, 430 | "execution_count": 9, 431 | "metadata": {}, 432 | "output_type": "execute_result" 433 | } 434 | ], 435 | "source": [ 436 | "order_products = order_products[order_products.product_id.isin(freq_products)]\n", 437 | "order_products.shape" 438 | ] 439 | }, 440 | { 441 | "cell_type": "code", 442 | "execution_count": 10, 443 | "metadata": {}, 444 | "outputs": [ 445 | { 446 | "data": { 447 | "text/plain": [ 448 | "2444982" 449 | ] 450 | }, 451 | "execution_count": 10, 452 | "metadata": {}, 453 | "output_type": "execute_result" 454 | } 455 | ], 456 | "source": [ 457 | "order_products.order_id.nunique()" 458 | ] 459 | }, 460 | { 461 | "cell_type": "code", 462 | "execution_count": 13, 463 | "metadata": {}, 464 | "outputs": [ 465 | { 466 | "data": { 467 | "text/html": [ 468 | "
\n", 469 | "\n", 482 | "\n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | "
order_idproduct_idadd_to_cart_orderreorderedproduct_nameaisle_iddepartment_id
022898521Michigan Organic Kale834
121779461Carrots834
232483821Unsweetened Almondmilk9116
332190341Organic Baby Spinach1234
434666761Organic Ginger Root834
\n", 548 | "
" 549 | ], 550 | "text/plain": [ 551 | " order_id product_id add_to_cart_order reordered product_name \\\n", 552 | "0 2 28985 2 1 Michigan Organic Kale \n", 553 | "1 2 17794 6 1 Carrots \n", 554 | "2 3 24838 2 1 Unsweetened Almondmilk \n", 555 | "3 3 21903 4 1 Organic Baby Spinach \n", 556 | "4 3 46667 6 1 Organic Ginger Root \n", 557 | "\n", 558 | " aisle_id department_id \n", 559 | "0 83 4 \n", 560 | "1 83 4 \n", 561 | "2 91 16 \n", 562 | "3 123 4 \n", 563 | "4 83 4 " 564 | ] 565 | }, 566 | "execution_count": 13, 567 | "metadata": {}, 568 | "output_type": "execute_result" 569 | } 570 | ], 571 | "source": [ 572 | "order_products = order_products.merge(products, on = 'product_id', how='left')\n", 573 | "order_products.head()" 574 | ] 575 | }, 576 | { 577 | "cell_type": "markdown", 578 | "metadata": {}, 579 | "source": [ 580 | "Structuring the data for feeding in the algorithm" 581 | ] 582 | }, 583 | { 584 | "cell_type": "code", 585 | "execution_count": 14, 586 | "metadata": {}, 587 | "outputs": [ 588 | { 589 | "data": { 590 | "text/html": [ 591 | "
\n", 592 | "\n", 605 | "\n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | " \n", 746 | " \n", 747 | " \n", 748 | " \n", 749 | " \n", 750 | " \n", 751 | " \n", 752 | " \n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | " \n", 763 | " \n", 764 | " \n", 765 | " \n", 766 | " \n", 767 | " \n", 768 | " \n", 769 | " \n", 770 | " \n", 771 | " \n", 772 | " \n", 773 | " \n", 774 | " \n", 775 | " \n", 776 | " \n", 777 | " \n", 778 | "
product_name100% Raw Coconut Water100% Whole Wheat Bread2% Reduced Fat MilkApple Honeycrisp OrganicAsparagusBag of Organic BananasBananaBartlett PearsBlueberriesBoneless Skinless Chicken Breasts...Sparkling Natural Mineral WaterSparkling Water GrapefruitSpring WaterStrawberriesUncured Genoa SalamiUnsalted ButterUnsweetened AlmondmilkUnsweetened Original Almond Breeze Almond MilkWhole MilkYellow Onions
order_id
10.00.00.00.00.01.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
20.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
30.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.01.00.00.00.0
50.00.01.00.00.01.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
90.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
\n", 779 | "

5 rows × 100 columns

\n", 780 | "
" 781 | ], 782 | "text/plain": [ 783 | "product_name 100% Raw Coconut Water 100% Whole Wheat Bread \\\n", 784 | "order_id \n", 785 | "1 0.0 0.0 \n", 786 | "2 0.0 0.0 \n", 787 | "3 0.0 0.0 \n", 788 | "5 0.0 0.0 \n", 789 | "9 0.0 0.0 \n", 790 | "\n", 791 | "product_name 2% Reduced Fat Milk Apple Honeycrisp Organic Asparagus \\\n", 792 | "order_id \n", 793 | "1 0.0 0.0 0.0 \n", 794 | "2 0.0 0.0 0.0 \n", 795 | "3 0.0 0.0 0.0 \n", 796 | "5 1.0 0.0 0.0 \n", 797 | "9 0.0 0.0 0.0 \n", 798 | "\n", 799 | "product_name Bag of Organic Bananas Banana Bartlett Pears Blueberries \\\n", 800 | "order_id \n", 801 | "1 1.0 0.0 0.0 0.0 \n", 802 | "2 0.0 0.0 0.0 0.0 \n", 803 | "3 0.0 0.0 0.0 0.0 \n", 804 | "5 1.0 0.0 0.0 0.0 \n", 805 | "9 0.0 0.0 0.0 0.0 \n", 806 | "\n", 807 | "product_name Boneless Skinless Chicken Breasts ... \\\n", 808 | "order_id ... \n", 809 | "1 0.0 ... \n", 810 | "2 0.0 ... \n", 811 | "3 0.0 ... \n", 812 | "5 0.0 ... \n", 813 | "9 0.0 ... \n", 814 | "\n", 815 | "product_name Sparkling Natural Mineral Water Sparkling Water Grapefruit \\\n", 816 | "order_id \n", 817 | "1 0.0 0.0 \n", 818 | "2 0.0 0.0 \n", 819 | "3 0.0 0.0 \n", 820 | "5 0.0 0.0 \n", 821 | "9 0.0 0.0 \n", 822 | "\n", 823 | "product_name Spring Water Strawberries Uncured Genoa Salami \\\n", 824 | "order_id \n", 825 | "1 0.0 0.0 0.0 \n", 826 | "2 0.0 0.0 0.0 \n", 827 | "3 0.0 0.0 0.0 \n", 828 | "5 0.0 0.0 0.0 \n", 829 | "9 0.0 0.0 0.0 \n", 830 | "\n", 831 | "product_name Unsalted Butter Unsweetened Almondmilk \\\n", 832 | "order_id \n", 833 | "1 0.0 0.0 \n", 834 | "2 0.0 0.0 \n", 835 | "3 0.0 1.0 \n", 836 | "5 0.0 0.0 \n", 837 | "9 0.0 0.0 \n", 838 | "\n", 839 | "product_name Unsweetened Original Almond Breeze Almond Milk Whole Milk \\\n", 840 | "order_id \n", 841 | "1 0.0 0.0 \n", 842 | "2 0.0 0.0 \n", 843 | "3 0.0 0.0 \n", 844 | "5 0.0 0.0 \n", 845 | "9 0.0 0.0 \n", 846 | "\n", 847 | "product_name Yellow Onions \n", 848 | "order_id \n", 849 | "1 0.0 \n", 850 | "2 0.0 \n", 851 | "3 0.0 \n", 852 | "5 0.0 \n", 853 | "9 0.0 \n", 854 | "\n", 855 | "[5 rows x 100 columns]" 856 | ] 857 | }, 858 | "execution_count": 14, 859 | "metadata": {}, 860 | "output_type": "execute_result" 861 | } 862 | ], 863 | "source": [ 864 | "basket = order_products.groupby(['order_id', 'product_name'])['reordered'].count().unstack().reset_index().fillna(0).set_index('order_id')\n", 865 | "basket.head()" 866 | ] 867 | }, 868 | { 869 | "cell_type": "code", 870 | "execution_count": 15, 871 | "metadata": {}, 872 | "outputs": [], 873 | "source": [ 874 | "del product_counts, products, order_products, order_products_prior, order_products_train" 875 | ] 876 | }, 877 | { 878 | "cell_type": "markdown", 879 | "metadata": {}, 880 | "source": [ 881 | "encoding the units" 882 | ] 883 | }, 884 | { 885 | "cell_type": "code", 886 | "execution_count": 16, 887 | "metadata": {}, 888 | "outputs": [ 889 | { 890 | "data": { 891 | "text/html": [ 892 | "
\n", 893 | "\n", 906 | "\n", 907 | " \n", 908 | " \n", 909 | " \n", 910 | " \n", 911 | " \n", 912 | " \n", 913 | " \n", 914 | " \n", 915 | " \n", 916 | " \n", 917 | " \n", 918 | " \n", 919 | " \n", 920 | " \n", 921 | " \n", 922 | " \n", 923 | " \n", 924 | " \n", 925 | " \n", 926 | " \n", 927 | " \n", 928 | " \n", 929 | " \n", 930 | " \n", 931 | " \n", 932 | " \n", 933 | " \n", 934 | " \n", 935 | " \n", 936 | " \n", 937 | " \n", 938 | " \n", 939 | " \n", 940 | " \n", 941 | " \n", 942 | " \n", 943 | " \n", 944 | " \n", 945 | " \n", 946 | " \n", 947 | " \n", 948 | " \n", 949 | " \n", 950 | " \n", 951 | " \n", 952 | " \n", 953 | " \n", 954 | " \n", 955 | " \n", 956 | " \n", 957 | " \n", 958 | " \n", 959 | " \n", 960 | " \n", 961 | " \n", 962 | " \n", 963 | " \n", 964 | " \n", 965 | " \n", 966 | " \n", 967 | " \n", 968 | " \n", 969 | " \n", 970 | " \n", 971 | " \n", 972 | " \n", 973 | " \n", 974 | " \n", 975 | " \n", 976 | " \n", 977 | " \n", 978 | " \n", 979 | " \n", 980 | " \n", 981 | " \n", 982 | " \n", 983 | " \n", 984 | " \n", 985 | " \n", 986 | " \n", 987 | " \n", 988 | " \n", 989 | " \n", 990 | " \n", 991 | " \n", 992 | " \n", 993 | " \n", 994 | " \n", 995 | " \n", 996 | " \n", 997 | " \n", 998 | " \n", 999 | " \n", 1000 | " \n", 1001 | " \n", 1002 | " \n", 1003 | " \n", 1004 | " \n", 1005 | " \n", 1006 | " \n", 1007 | " \n", 1008 | " \n", 1009 | " \n", 1010 | " \n", 1011 | " \n", 1012 | " \n", 1013 | " \n", 1014 | " \n", 1015 | " \n", 1016 | " \n", 1017 | " \n", 1018 | " \n", 1019 | " \n", 1020 | " \n", 1021 | " \n", 1022 | " \n", 1023 | " \n", 1024 | " \n", 1025 | " \n", 1026 | " \n", 1027 | " \n", 1028 | " \n", 1029 | " \n", 1030 | " \n", 1031 | " \n", 1032 | " \n", 1033 | " \n", 1034 | " \n", 1035 | " \n", 1036 | " \n", 1037 | " \n", 1038 | " \n", 1039 | " \n", 1040 | " \n", 1041 | " \n", 1042 | " \n", 1043 | " \n", 1044 | " \n", 1045 | " \n", 1046 | " \n", 1047 | " \n", 1048 | " \n", 1049 | " \n", 1050 | " \n", 1051 | " \n", 1052 | " \n", 1053 | " \n", 1054 | " \n", 1055 | " \n", 1056 | " \n", 1057 | " \n", 1058 | " \n", 1059 | " \n", 1060 | " \n", 1061 | " \n", 1062 | " \n", 1063 | " \n", 1064 | " \n", 1065 | " \n", 1066 | " \n", 1067 | " \n", 1068 | " \n", 1069 | " \n", 1070 | " \n", 1071 | " \n", 1072 | " \n", 1073 | " \n", 1074 | " \n", 1075 | " \n", 1076 | " \n", 1077 | " \n", 1078 | " \n", 1079 | "
product_name100% Raw Coconut Water100% Whole Wheat Bread2% Reduced Fat MilkApple Honeycrisp OrganicAsparagusBag of Organic BananasBananaBartlett PearsBlueberriesBoneless Skinless Chicken Breasts...Sparkling Natural Mineral WaterSparkling Water GrapefruitSpring WaterStrawberriesUncured Genoa SalamiUnsalted ButterUnsweetened AlmondmilkUnsweetened Original Almond Breeze Almond MilkWhole MilkYellow Onions
order_id
10000010000...0000000000
20000000000...0000000000
30000000000...0000001000
50010010000...0000000000
90000000000...0000000000
\n", 1080 | "

5 rows × 100 columns

\n", 1081 | "
" 1082 | ], 1083 | "text/plain": [ 1084 | "product_name 100% Raw Coconut Water 100% Whole Wheat Bread \\\n", 1085 | "order_id \n", 1086 | "1 0 0 \n", 1087 | "2 0 0 \n", 1088 | "3 0 0 \n", 1089 | "5 0 0 \n", 1090 | "9 0 0 \n", 1091 | "\n", 1092 | "product_name 2% Reduced Fat Milk Apple Honeycrisp Organic Asparagus \\\n", 1093 | "order_id \n", 1094 | "1 0 0 0 \n", 1095 | "2 0 0 0 \n", 1096 | "3 0 0 0 \n", 1097 | "5 1 0 0 \n", 1098 | "9 0 0 0 \n", 1099 | "\n", 1100 | "product_name Bag of Organic Bananas Banana Bartlett Pears Blueberries \\\n", 1101 | "order_id \n", 1102 | "1 1 0 0 0 \n", 1103 | "2 0 0 0 0 \n", 1104 | "3 0 0 0 0 \n", 1105 | "5 1 0 0 0 \n", 1106 | "9 0 0 0 0 \n", 1107 | "\n", 1108 | "product_name Boneless Skinless Chicken Breasts ... \\\n", 1109 | "order_id ... \n", 1110 | "1 0 ... \n", 1111 | "2 0 ... \n", 1112 | "3 0 ... \n", 1113 | "5 0 ... \n", 1114 | "9 0 ... \n", 1115 | "\n", 1116 | "product_name Sparkling Natural Mineral Water Sparkling Water Grapefruit \\\n", 1117 | "order_id \n", 1118 | "1 0 0 \n", 1119 | "2 0 0 \n", 1120 | "3 0 0 \n", 1121 | "5 0 0 \n", 1122 | "9 0 0 \n", 1123 | "\n", 1124 | "product_name Spring Water Strawberries Uncured Genoa Salami \\\n", 1125 | "order_id \n", 1126 | "1 0 0 0 \n", 1127 | "2 0 0 0 \n", 1128 | "3 0 0 0 \n", 1129 | "5 0 0 0 \n", 1130 | "9 0 0 0 \n", 1131 | "\n", 1132 | "product_name Unsalted Butter Unsweetened Almondmilk \\\n", 1133 | "order_id \n", 1134 | "1 0 0 \n", 1135 | "2 0 0 \n", 1136 | "3 0 1 \n", 1137 | "5 0 0 \n", 1138 | "9 0 0 \n", 1139 | "\n", 1140 | "product_name Unsweetened Original Almond Breeze Almond Milk Whole Milk \\\n", 1141 | "order_id \n", 1142 | "1 0 0 \n", 1143 | "2 0 0 \n", 1144 | "3 0 0 \n", 1145 | "5 0 0 \n", 1146 | "9 0 0 \n", 1147 | "\n", 1148 | "product_name Yellow Onions \n", 1149 | "order_id \n", 1150 | "1 0 \n", 1151 | "2 0 \n", 1152 | "3 0 \n", 1153 | "5 0 \n", 1154 | "9 0 \n", 1155 | "\n", 1156 | "[5 rows x 100 columns]" 1157 | ] 1158 | }, 1159 | "execution_count": 16, 1160 | "metadata": {}, 1161 | "output_type": "execute_result" 1162 | } 1163 | ], 1164 | "source": [ 1165 | "def encode_units(x):\n", 1166 | " if x <= 0:\n", 1167 | " return 0\n", 1168 | " if x >= 1:\n", 1169 | " return 1 \n", 1170 | " \n", 1171 | "basket = basket.applymap(encode_units)\n", 1172 | "basket.head()" 1173 | ] 1174 | }, 1175 | { 1176 | "cell_type": "code", 1177 | "execution_count": 17, 1178 | "metadata": {}, 1179 | "outputs": [ 1180 | { 1181 | "data": { 1182 | "text/plain": [ 1183 | "244498200" 1184 | ] 1185 | }, 1186 | "execution_count": 17, 1187 | "metadata": {}, 1188 | "output_type": "execute_result" 1189 | } 1190 | ], 1191 | "source": [ 1192 | "basket.size" 1193 | ] 1194 | }, 1195 | { 1196 | "cell_type": "code", 1197 | "execution_count": 18, 1198 | "metadata": {}, 1199 | "outputs": [ 1200 | { 1201 | "data": { 1202 | "text/plain": [ 1203 | "(2444982, 100)" 1204 | ] 1205 | }, 1206 | "execution_count": 18, 1207 | "metadata": {}, 1208 | "output_type": "execute_result" 1209 | } 1210 | ], 1211 | "source": [ 1212 | "basket.shape" 1213 | ] 1214 | }, 1215 | { 1216 | "cell_type": "markdown", 1217 | "metadata": {}, 1218 | "source": [ 1219 | "Creating frequent sets and rules" 1220 | ] 1221 | }, 1222 | { 1223 | "cell_type": "code", 1224 | "execution_count": 19, 1225 | "metadata": { 1226 | "scrolled": true 1227 | }, 1228 | "outputs": [ 1229 | { 1230 | "data": { 1231 | "text/html": [ 1232 | "
\n", 1233 | "\n", 1246 | "\n", 1247 | " \n", 1248 | " \n", 1249 | " \n", 1250 | " \n", 1251 | " \n", 1252 | " \n", 1253 | " \n", 1254 | " \n", 1255 | " \n", 1256 | " \n", 1257 | " \n", 1258 | " \n", 1259 | " \n", 1260 | " \n", 1261 | " \n", 1262 | " \n", 1263 | " \n", 1264 | " \n", 1265 | " \n", 1266 | " \n", 1267 | " \n", 1268 | " \n", 1269 | " \n", 1270 | " \n", 1271 | " \n", 1272 | " \n", 1273 | " \n", 1274 | " \n", 1275 | " \n", 1276 | " \n", 1277 | " \n", 1278 | " \n", 1279 | " \n", 1280 | " \n", 1281 | "
supportitemsets
00.016062(100% Raw Coconut Water)
10.025814(100% Whole Wheat Bread)
20.015800(2% Reduced Fat Milk)
30.035694(Apple Honeycrisp Organic)
40.029101(Asparagus)
\n", 1282 | "
" 1283 | ], 1284 | "text/plain": [ 1285 | " support itemsets\n", 1286 | "0 0.016062 (100% Raw Coconut Water)\n", 1287 | "1 0.025814 (100% Whole Wheat Bread)\n", 1288 | "2 0.015800 (2% Reduced Fat Milk)\n", 1289 | "3 0.035694 (Apple Honeycrisp Organic)\n", 1290 | "4 0.029101 (Asparagus)" 1291 | ] 1292 | }, 1293 | "execution_count": 19, 1294 | "metadata": {}, 1295 | "output_type": "execute_result" 1296 | } 1297 | ], 1298 | "source": [ 1299 | "frequent_items = apriori(basket, min_support=0.01, use_colnames=True, low_memory=True)\n", 1300 | "frequent_items.head()" 1301 | ] 1302 | }, 1303 | { 1304 | "cell_type": "code", 1305 | "execution_count": 20, 1306 | "metadata": {}, 1307 | "outputs": [ 1308 | { 1309 | "data": { 1310 | "text/html": [ 1311 | "
\n", 1312 | "\n", 1325 | "\n", 1326 | " \n", 1327 | " \n", 1328 | " \n", 1329 | " \n", 1330 | " \n", 1331 | " \n", 1332 | " \n", 1333 | " \n", 1334 | " \n", 1335 | " \n", 1336 | " \n", 1337 | " \n", 1338 | " \n", 1339 | " \n", 1340 | " \n", 1341 | " \n", 1342 | " \n", 1343 | " \n", 1344 | " \n", 1345 | " \n", 1346 | " \n", 1347 | " \n", 1348 | " \n", 1349 | " \n", 1350 | " \n", 1351 | " \n", 1352 | " \n", 1353 | " \n", 1354 | " \n", 1355 | " \n", 1356 | " \n", 1357 | " \n", 1358 | " \n", 1359 | " \n", 1360 | "
supportitemsets
1240.010235(Organic Blueberries, Organic Strawberries)
1250.010966(Organic Raspberries, Organic Hass Avocado)
1260.017314(Organic Strawberries, Organic Hass Avocado)
1270.014533(Organic Strawberries, Organic Raspberries)
1280.010130(Organic Strawberries, Organic Whole Milk)
\n", 1361 | "
" 1362 | ], 1363 | "text/plain": [ 1364 | " support itemsets\n", 1365 | "124 0.010235 (Organic Blueberries, Organic Strawberries)\n", 1366 | "125 0.010966 (Organic Raspberries, Organic Hass Avocado)\n", 1367 | "126 0.017314 (Organic Strawberries, Organic Hass Avocado)\n", 1368 | "127 0.014533 (Organic Strawberries, Organic Raspberries)\n", 1369 | "128 0.010130 (Organic Strawberries, Organic Whole Milk)" 1370 | ] 1371 | }, 1372 | "execution_count": 20, 1373 | "metadata": {}, 1374 | "output_type": "execute_result" 1375 | } 1376 | ], 1377 | "source": [ 1378 | "frequent_items.tail()" 1379 | ] 1380 | }, 1381 | { 1382 | "cell_type": "code", 1383 | "execution_count": 21, 1384 | "metadata": {}, 1385 | "outputs": [ 1386 | { 1387 | "data": { 1388 | "text/plain": [ 1389 | "(129, 2)" 1390 | ] 1391 | }, 1392 | "execution_count": 21, 1393 | "metadata": {}, 1394 | "output_type": "execute_result" 1395 | } 1396 | ], 1397 | "source": [ 1398 | "frequent_items.shape" 1399 | ] 1400 | }, 1401 | { 1402 | "cell_type": "code", 1403 | "execution_count": 22, 1404 | "metadata": {}, 1405 | "outputs": [ 1406 | { 1407 | "data": { 1408 | "text/html": [ 1409 | "
\n", 1410 | "\n", 1423 | "\n", 1424 | " \n", 1425 | " \n", 1426 | " \n", 1427 | " \n", 1428 | " \n", 1429 | " \n", 1430 | " \n", 1431 | " \n", 1432 | " \n", 1433 | " \n", 1434 | " \n", 1435 | " \n", 1436 | " \n", 1437 | " \n", 1438 | " \n", 1439 | " \n", 1440 | " \n", 1441 | " \n", 1442 | " \n", 1443 | " \n", 1444 | " \n", 1445 | " \n", 1446 | " \n", 1447 | " \n", 1448 | " \n", 1449 | " \n", 1450 | " \n", 1451 | " \n", 1452 | " \n", 1453 | " \n", 1454 | " \n", 1455 | " \n", 1456 | " \n", 1457 | " \n", 1458 | " \n", 1459 | " \n", 1460 | " \n", 1461 | " \n", 1462 | " \n", 1463 | " \n", 1464 | " \n", 1465 | " \n", 1466 | " \n", 1467 | " \n", 1468 | " \n", 1469 | " \n", 1470 | " \n", 1471 | " \n", 1472 | " \n", 1473 | " \n", 1474 | " \n", 1475 | " \n", 1476 | " \n", 1477 | " \n", 1478 | " \n", 1479 | " \n", 1480 | " \n", 1481 | " \n", 1482 | " \n", 1483 | " \n", 1484 | " \n", 1485 | " \n", 1486 | " \n", 1487 | " \n", 1488 | " \n", 1489 | " \n", 1490 | " \n", 1491 | " \n", 1492 | " \n", 1493 | " \n", 1494 | " \n", 1495 | " \n", 1496 | " \n", 1497 | " \n", 1498 | " \n", 1499 | " \n", 1500 | " \n", 1501 | " \n", 1502 | " \n", 1503 | " \n", 1504 | " \n", 1505 | " \n", 1506 | " \n", 1507 | " \n", 1508 | " \n", 1509 | " \n", 1510 | " \n", 1511 | " \n", 1512 | " \n", 1513 | " \n", 1514 | " \n", 1515 | " \n", 1516 | " \n", 1517 | " \n", 1518 | " \n", 1519 | " \n", 1520 | " \n", 1521 | " \n", 1522 | " \n", 1523 | " \n", 1524 | " \n", 1525 | " \n", 1526 | " \n", 1527 | " \n", 1528 | " \n", 1529 | " \n", 1530 | " \n", 1531 | " \n", 1532 | " \n", 1533 | " \n", 1534 | " \n", 1535 | " \n", 1536 | " \n", 1537 | " \n", 1538 | " \n", 1539 | " \n", 1540 | " \n", 1541 | " \n", 1542 | " \n", 1543 | " \n", 1544 | " \n", 1545 | " \n", 1546 | " \n", 1547 | " \n", 1548 | " \n", 1549 | " \n", 1550 | " \n", 1551 | " \n", 1552 | " \n", 1553 | " \n", 1554 | " \n", 1555 | " \n", 1556 | " \n", 1557 | " \n", 1558 | " \n", 1559 | " \n", 1560 | " \n", 1561 | " \n", 1562 | " \n", 1563 | " \n", 1564 | " \n", 1565 | " \n", 1566 | " \n", 1567 | " \n", 1568 | " \n", 1569 | " \n", 1570 | " \n", 1571 | " \n", 1572 | " \n", 1573 | " \n", 1574 | " \n", 1575 | " \n", 1576 | " \n", 1577 | " \n", 1578 | " \n", 1579 | " \n", 1580 | " \n", 1581 | " \n", 1582 | " \n", 1583 | " \n", 1584 | " \n", 1585 | " \n", 1586 | " \n", 1587 | " \n", 1588 | " \n", 1589 | " \n", 1590 | " \n", 1591 | " \n", 1592 | " \n", 1593 | " \n", 1594 | " \n", 1595 | " \n", 1596 | " \n", 1597 | " \n", 1598 | " \n", 1599 | " \n", 1600 | " \n", 1601 | " \n", 1602 | " \n", 1603 | " \n", 1604 | " \n", 1605 | " \n", 1606 | " \n", 1607 | " \n", 1608 | " \n", 1609 | " \n", 1610 | " \n", 1611 | " \n", 1612 | " \n", 1613 | " \n", 1614 | " \n", 1615 | " \n", 1616 | " \n", 1617 | " \n", 1618 | " \n", 1619 | " \n", 1620 | " \n", 1621 | " \n", 1622 | " \n", 1623 | " \n", 1624 | " \n", 1625 | " \n", 1626 | " \n", 1627 | " \n", 1628 | " \n", 1629 | " \n", 1630 | " \n", 1631 | " \n", 1632 | " \n", 1633 | " \n", 1634 | " \n", 1635 | " \n", 1636 | " \n", 1637 | " \n", 1638 | " \n", 1639 | " \n", 1640 | " \n", 1641 | " \n", 1642 | " \n", 1643 | " \n", 1644 | " \n", 1645 | " \n", 1646 | " \n", 1647 | " \n", 1648 | " \n", 1649 | " \n", 1650 | " \n", 1651 | " \n", 1652 | " \n", 1653 | " \n", 1654 | " \n", 1655 | " \n", 1656 | " \n", 1657 | " \n", 1658 | " \n", 1659 | " \n", 1660 | " \n", 1661 | " \n", 1662 | " \n", 1663 | " \n", 1664 | " \n", 1665 | " \n", 1666 | " \n", 1667 | " \n", 1668 | " \n", 1669 | " \n", 1670 | " \n", 1671 | " \n", 1672 | " \n", 1673 | " \n", 1674 | " \n", 1675 | " \n", 1676 | " \n", 1677 | " \n", 1678 | " \n", 1679 | " \n", 1680 | " \n", 1681 | " \n", 1682 | " \n", 1683 | " \n", 1684 | " \n", 1685 | " \n", 1686 | " \n", 1687 | " \n", 1688 | " \n", 1689 | " \n", 1690 | " \n", 1691 | " \n", 1692 | " \n", 1693 | " \n", 1694 | " \n", 1695 | " \n", 1696 | " \n", 1697 | " \n", 1698 | " \n", 1699 | " \n", 1700 | " \n", 1701 | " \n", 1702 | " \n", 1703 | " \n", 1704 | " \n", 1705 | " \n", 1706 | " \n", 1707 | " \n", 1708 | " \n", 1709 | " \n", 1710 | " \n", 1711 | " \n", 1712 | " \n", 1713 | " \n", 1714 | " \n", 1715 | " \n", 1716 | " \n", 1717 | " \n", 1718 | " \n", 1719 | " \n", 1720 | " \n", 1721 | " \n", 1722 | " \n", 1723 | " \n", 1724 | " \n", 1725 | " \n", 1726 | " \n", 1727 | " \n", 1728 | " \n", 1729 | " \n", 1730 | " \n", 1731 | " \n", 1732 | " \n", 1733 | " \n", 1734 | " \n", 1735 | " \n", 1736 | " \n", 1737 | " \n", 1738 | " \n", 1739 | " \n", 1740 | " \n", 1741 | " \n", 1742 | " \n", 1743 | " \n", 1744 | " \n", 1745 | " \n", 1746 | " \n", 1747 | " \n", 1748 | " \n", 1749 | " \n", 1750 | " \n", 1751 | " \n", 1752 | " \n", 1753 | " \n", 1754 | " \n", 1755 | " \n", 1756 | " \n", 1757 | " \n", 1758 | " \n", 1759 | " \n", 1760 | " \n", 1761 | " \n", 1762 | " \n", 1763 | " \n", 1764 | " \n", 1765 | " \n", 1766 | " \n", 1767 | " \n", 1768 | " \n", 1769 | " \n", 1770 | " \n", 1771 | " \n", 1772 | " \n", 1773 | " \n", 1774 | " \n", 1775 | " \n", 1776 | " \n", 1777 | " \n", 1778 | " \n", 1779 | " \n", 1780 | " \n", 1781 | " \n", 1782 | " \n", 1783 | " \n", 1784 | " \n", 1785 | " \n", 1786 | " \n", 1787 | " \n", 1788 | " \n", 1789 | " \n", 1790 | " \n", 1791 | " \n", 1792 | " \n", 1793 | " \n", 1794 | " \n", 1795 | " \n", 1796 | " \n", 1797 | " \n", 1798 | " \n", 1799 | " \n", 1800 | " \n", 1801 | " \n", 1802 | " \n", 1803 | " \n", 1804 | " \n", 1805 | " \n", 1806 | " \n", 1807 | " \n", 1808 | " \n", 1809 | " \n", 1810 | " \n", 1811 | " \n", 1812 | " \n", 1813 | " \n", 1814 | " \n", 1815 | " \n", 1816 | " \n", 1817 | " \n", 1818 | " \n", 1819 | " \n", 1820 | " \n", 1821 | " \n", 1822 | " \n", 1823 | " \n", 1824 | " \n", 1825 | " \n", 1826 | " \n", 1827 | " \n", 1828 | " \n", 1829 | " \n", 1830 | " \n", 1831 | " \n", 1832 | " \n", 1833 | " \n", 1834 | " \n", 1835 | " \n", 1836 | " \n", 1837 | " \n", 1838 | " \n", 1839 | " \n", 1840 | " \n", 1841 | " \n", 1842 | " \n", 1843 | " \n", 1844 | " \n", 1845 | " \n", 1846 | " \n", 1847 | " \n", 1848 | " \n", 1849 | " \n", 1850 | " \n", 1851 | " \n", 1852 | " \n", 1853 | " \n", 1854 | " \n", 1855 | " \n", 1856 | " \n", 1857 | " \n", 1858 | " \n", 1859 | " \n", 1860 | " \n", 1861 | " \n", 1862 | " \n", 1863 | " \n", 1864 | " \n", 1865 | " \n", 1866 | " \n", 1867 | " \n", 1868 | " \n", 1869 | " \n", 1870 | " \n", 1871 | " \n", 1872 | " \n", 1873 | " \n", 1874 | " \n", 1875 | " \n", 1876 | " \n", 1877 | " \n", 1878 | " \n", 1879 | " \n", 1880 | " \n", 1881 | " \n", 1882 | " \n", 1883 | " \n", 1884 | " \n", 1885 | " \n", 1886 | " \n", 1887 | " \n", 1888 | " \n", 1889 | " \n", 1890 | " \n", 1891 | " \n", 1892 | " \n", 1893 | " \n", 1894 | " \n", 1895 | " \n", 1896 | " \n", 1897 | " \n", 1898 | " \n", 1899 | " \n", 1900 | " \n", 1901 | " \n", 1902 | " \n", 1903 | " \n", 1904 | " \n", 1905 | " \n", 1906 | " \n", 1907 | " \n", 1908 | " \n", 1909 | " \n", 1910 | " \n", 1911 | " \n", 1912 | " \n", 1913 | " \n", 1914 | " \n", 1915 | " \n", 1916 | " \n", 1917 | " \n", 1918 | " \n", 1919 | " \n", 1920 | " \n", 1921 | " \n", 1922 | " \n", 1923 | " \n", 1924 | " \n", 1925 | " \n", 1926 | " \n", 1927 | " \n", 1928 | " \n", 1929 | " \n", 1930 | " \n", 1931 | " \n", 1932 | " \n", 1933 | " \n", 1934 | " \n", 1935 | " \n", 1936 | " \n", 1937 | " \n", 1938 | " \n", 1939 | " \n", 1940 | " \n", 1941 | " \n", 1942 | " \n", 1943 | " \n", 1944 | " \n", 1945 | " \n", 1946 | " \n", 1947 | " \n", 1948 | " \n", 1949 | " \n", 1950 | " \n", 1951 | " \n", 1952 | " \n", 1953 | " \n", 1954 | " \n", 1955 | " \n", 1956 | " \n", 1957 | " \n", 1958 | " \n", 1959 | " \n", 1960 | " \n", 1961 | " \n", 1962 | " \n", 1963 | " \n", 1964 | " \n", 1965 | " \n", 1966 | " \n", 1967 | " \n", 1968 | " \n", 1969 | " \n", 1970 | " \n", 1971 | " \n", 1972 | " \n", 1973 | " \n", 1974 | " \n", 1975 | " \n", 1976 | " \n", 1977 | " \n", 1978 | " \n", 1979 | " \n", 1980 | " \n", 1981 | " \n", 1982 | " \n", 1983 | " \n", 1984 | " \n", 1985 | " \n", 1986 | " \n", 1987 | " \n", 1988 | " \n", 1989 | " \n", 1990 | " \n", 1991 | " \n", 1992 | " \n", 1993 | " \n", 1994 | " \n", 1995 | " \n", 1996 | " \n", 1997 | " \n", 1998 | " \n", 1999 | " \n", 2000 | " \n", 2001 | " \n", 2002 | " \n", 2003 | " \n", 2004 | " \n", 2005 | " \n", 2006 | " \n", 2007 | " \n", 2008 | " \n", 2009 | " \n", 2010 | " \n", 2011 | " \n", 2012 | " \n", 2013 | " \n", 2014 | " \n", 2015 | " \n", 2016 | " \n", 2017 | " \n", 2018 | " \n", 2019 | " \n", 2020 | " \n", 2021 | " \n", 2022 | " \n", 2023 | " \n", 2024 | " \n", 2025 | " \n", 2026 | " \n", 2027 | " \n", 2028 | " \n", 2029 | " \n", 2030 | " \n", 2031 | " \n", 2032 | " \n", 2033 | " \n", 2034 | " \n", 2035 | " \n", 2036 | " \n", 2037 | " \n", 2038 | " \n", 2039 | " \n", 2040 | " \n", 2041 | " \n", 2042 | " \n", 2043 | " \n", 2044 | " \n", 2045 | " \n", 2046 | " \n", 2047 | " \n", 2048 | " \n", 2049 | " \n", 2050 | " \n", 2051 | " \n", 2052 | " \n", 2053 | " \n", 2054 | " \n", 2055 | " \n", 2056 | " \n", 2057 | " \n", 2058 | " \n", 2059 | " \n", 2060 | " \n", 2061 | " \n", 2062 | " \n", 2063 | " \n", 2064 | " \n", 2065 | " \n", 2066 | " \n", 2067 | " \n", 2068 | " \n", 2069 | " \n", 2070 | " \n", 2071 | " \n", 2072 | " \n", 2073 | " \n", 2074 | " \n", 2075 | " \n", 2076 | " \n", 2077 | " \n", 2078 | " \n", 2079 | " \n", 2080 | " \n", 2081 | " \n", 2082 | " \n", 2083 | " \n", 2084 | " \n", 2085 | " \n", 2086 | " \n", 2087 | " \n", 2088 | " \n", 2089 | " \n", 2090 | " \n", 2091 | " \n", 2092 | " \n", 2093 | " \n", 2094 | " \n", 2095 | " \n", 2096 | " \n", 2097 | " \n", 2098 | " \n", 2099 | " \n", 2100 | " \n", 2101 | " \n", 2102 | " \n", 2103 | " \n", 2104 | " \n", 2105 | " \n", 2106 | " \n", 2107 | " \n", 2108 | " \n", 2109 | " \n", 2110 | " \n", 2111 | " \n", 2112 | "
antecedentsconsequentsantecedent supportconsequent supportsupportconfidenceliftleverageconviction
35(Limes)(Large Lemon)0.0599840.0657640.0118600.1977233.0065440.0079151.164480
34(Large Lemon)(Limes)0.0657640.0599840.0118600.1803453.0065440.0079151.146843
52(Organic Strawberries)(Organic Raspberries)0.1127110.0583250.0145330.1289402.2107310.0079591.081069
53(Organic Raspberries)(Organic Strawberries)0.0583250.1127110.0145330.2491742.2107310.0079591.181751
37(Organic Avocado)(Large Lemon)0.0753480.0657640.0105380.1398622.1267280.0055831.086147
36(Large Lemon)(Organic Avocado)0.0657640.0753480.0105380.1602442.1267280.0055831.101097
47(Organic Strawberries)(Organic Blueberries)0.1127110.0429560.0102350.0908092.1140240.0053941.052633
46(Organic Blueberries)(Organic Strawberries)0.0429560.1127110.0102350.2382742.1140240.0053941.164840
49(Organic Hass Avocado)(Organic Raspberries)0.0903390.0583250.0109660.1213892.0812570.0056971.071777
48(Organic Raspberries)(Organic Hass Avocado)0.0583250.0903390.0109660.1880182.0812570.0056971.120298
24(Banana)(Organic Fuji Apple)0.2009380.0379920.0143780.0715521.8833670.0067441.036147
25(Organic Fuji Apple)(Banana)0.0379920.2009380.0143780.3784411.8833670.0067441.285576
5(Bag of Organic Bananas)(Organic Raspberries)0.1615270.0583250.0172940.1070651.8356620.0078731.054584
4(Organic Raspberries)(Bag of Organic Bananas)0.0583250.1615270.0172940.2965081.8356620.0078731.191874
3(Bag of Organic Bananas)(Organic Hass Avocado)0.1615270.0903390.0264870.1639811.8151750.0118951.088087
2(Organic Hass Avocado)(Bag of Organic Bananas)0.0903390.1615270.0264870.2931991.8151750.0118951.186294
14(Honeycrisp Apple)(Banana)0.0340780.2009380.0121220.3557251.7703170.0052751.240249
15(Banana)(Honeycrisp Apple)0.2009380.0340780.0121220.0603291.7703170.0052751.027936
39(Organic Avocado)(Organic Baby Spinach)0.0753480.1029480.0132070.1752811.7026250.0054501.087707
38(Organic Baby Spinach)(Organic Avocado)0.1029480.0753480.0132070.1282891.7026250.0054501.060733
50(Organic Strawberries)(Organic Hass Avocado)0.1127110.0903390.0173140.1536161.7004400.0071321.074762
51(Organic Hass Avocado)(Organic Strawberries)0.0903390.1127110.0173140.1916591.7004400.0071321.097666
12(Cucumber Kirby)(Banana)0.0407890.2009380.0134320.3292961.6387880.0052361.191377
13(Banana)(Cucumber Kirby)0.2009380.0407890.0134320.0668441.6387880.0052361.027922
43(Organic Hass Avocado)(Organic Baby Spinach)0.0903390.1029480.0147870.1636791.5899290.0054861.072618
42(Organic Baby Spinach)(Organic Hass Avocado)0.1029480.0903390.0147870.1436321.5899290.0054861.062232
55(Organic Whole Milk)(Organic Strawberries)0.0584110.1127110.0101300.1734231.5386450.0035461.073449
54(Organic Strawberries)(Organic Whole Milk)0.1127110.0584110.0101300.0898731.5386450.0035461.034569
20(Organic Avocado)(Banana)0.0753480.2009380.0227450.3018661.5022820.0076051.144568
21(Banana)(Organic Avocado)0.2009380.0753480.0227450.1131941.5022820.0076051.042677
30(Seedless Red Grapes)(Banana)0.0354800.2009380.0105340.2969061.4775960.0034051.136493
31(Banana)(Seedless Red Grapes)0.2009380.0354800.0105340.0524251.4775960.0034051.017883
7(Bag of Organic Bananas)(Organic Strawberries)0.1615270.1127110.0264630.1638321.4535510.0082571.061136
6(Organic Strawberries)(Bag of Organic Bananas)0.1127110.1615270.0264630.2347871.4535510.0082571.095739
32(Strawberries)(Banana)0.0611230.2009380.0176610.2889361.4379310.0053791.123754
33(Banana)(Strawberries)0.2009380.0611230.0176610.0878911.4379310.0053791.029347
45(Organic Strawberries)(Organic Baby Spinach)0.1127110.1029480.0162670.1443261.4019390.0046641.048358
44(Organic Baby Spinach)(Organic Strawberries)0.1029480.1127110.0162670.1580141.4019390.0046641.053805
11(Bag of Organic Bananas)(Organic Yellow Onion)0.1615270.0481460.0104600.0647561.3449890.0026831.017760
10(Organic Yellow Onion)(Bag of Organic Bananas)0.0481460.1615270.0104600.2172521.3449890.0026831.071191
16(Large Lemon)(Banana)0.0657640.2009380.0176030.2676631.3320620.0043881.091111
17(Banana)(Large Lemon)0.2009380.0657640.0176030.0876021.3320620.0043881.023934
0(Organic Baby Spinach)(Bag of Organic Bananas)0.1029480.1615270.0215170.2090071.2939440.0048881.060026
1(Bag of Organic Bananas)(Organic Baby Spinach)0.1615270.1029480.0215170.1332081.2939440.0048881.034911
41(Organic Avocado)(Organic Strawberries)0.0753480.1127110.0102540.1360951.2074680.0017621.027068
40(Organic Strawberries)(Organic Avocado)0.1127110.0753480.0102540.0909801.2074680.0017621.017197
9(Bag of Organic Bananas)(Organic Whole Milk)0.1615270.0584110.0112880.0698831.1964130.0018531.012335
8(Organic Whole Milk)(Bag of Organic Bananas)0.0584110.1615270.0112880.1932531.1964130.0018531.039326
29(Organic Whole Milk)(Banana)0.0584110.2009380.0133680.2288661.1389840.0016311.036216
28(Banana)(Organic Whole Milk)0.2009380.0584110.0133680.0665291.1389840.0016311.008697
19(Banana)(Limes)0.2009380.0599840.0135390.0673801.1232920.0014861.007930
18(Limes)(Banana)0.0599840.2009380.0135390.2257131.1232920.0014861.031996
23(Banana)(Organic Baby Spinach)0.2009380.1029480.0218390.1086831.0557120.0011521.006435
22(Organic Baby Spinach)(Banana)0.1029480.2009380.0218390.2121331.0557120.0011521.014209
27(Banana)(Organic Strawberries)0.2009380.1127110.0238570.1187281.0533820.0012091.006827
26(Organic Strawberries)(Banana)0.1127110.2009380.0238570.2116651.0533820.0012091.013607
\n", 2113 | "
" 2114 | ], 2115 | "text/plain": [ 2116 | " antecedents consequents antecedent support \\\n", 2117 | "35 (Limes) (Large Lemon) 0.059984 \n", 2118 | "34 (Large Lemon) (Limes) 0.065764 \n", 2119 | "52 (Organic Strawberries) (Organic Raspberries) 0.112711 \n", 2120 | "53 (Organic Raspberries) (Organic Strawberries) 0.058325 \n", 2121 | "37 (Organic Avocado) (Large Lemon) 0.075348 \n", 2122 | "36 (Large Lemon) (Organic Avocado) 0.065764 \n", 2123 | "47 (Organic Strawberries) (Organic Blueberries) 0.112711 \n", 2124 | "46 (Organic Blueberries) (Organic Strawberries) 0.042956 \n", 2125 | "49 (Organic Hass Avocado) (Organic Raspberries) 0.090339 \n", 2126 | "48 (Organic Raspberries) (Organic Hass Avocado) 0.058325 \n", 2127 | "24 (Banana) (Organic Fuji Apple) 0.200938 \n", 2128 | "25 (Organic Fuji Apple) (Banana) 0.037992 \n", 2129 | "5 (Bag of Organic Bananas) (Organic Raspberries) 0.161527 \n", 2130 | "4 (Organic Raspberries) (Bag of Organic Bananas) 0.058325 \n", 2131 | "3 (Bag of Organic Bananas) (Organic Hass Avocado) 0.161527 \n", 2132 | "2 (Organic Hass Avocado) (Bag of Organic Bananas) 0.090339 \n", 2133 | "14 (Honeycrisp Apple) (Banana) 0.034078 \n", 2134 | "15 (Banana) (Honeycrisp Apple) 0.200938 \n", 2135 | "39 (Organic Avocado) (Organic Baby Spinach) 0.075348 \n", 2136 | "38 (Organic Baby Spinach) (Organic Avocado) 0.102948 \n", 2137 | "50 (Organic Strawberries) (Organic Hass Avocado) 0.112711 \n", 2138 | "51 (Organic Hass Avocado) (Organic Strawberries) 0.090339 \n", 2139 | "12 (Cucumber Kirby) (Banana) 0.040789 \n", 2140 | "13 (Banana) (Cucumber Kirby) 0.200938 \n", 2141 | "43 (Organic Hass Avocado) (Organic Baby Spinach) 0.090339 \n", 2142 | "42 (Organic Baby Spinach) (Organic Hass Avocado) 0.102948 \n", 2143 | "55 (Organic Whole Milk) (Organic Strawberries) 0.058411 \n", 2144 | "54 (Organic Strawberries) (Organic Whole Milk) 0.112711 \n", 2145 | "20 (Organic Avocado) (Banana) 0.075348 \n", 2146 | "21 (Banana) (Organic Avocado) 0.200938 \n", 2147 | "30 (Seedless Red Grapes) (Banana) 0.035480 \n", 2148 | "31 (Banana) (Seedless Red Grapes) 0.200938 \n", 2149 | "7 (Bag of Organic Bananas) (Organic Strawberries) 0.161527 \n", 2150 | "6 (Organic Strawberries) (Bag of Organic Bananas) 0.112711 \n", 2151 | "32 (Strawberries) (Banana) 0.061123 \n", 2152 | "33 (Banana) (Strawberries) 0.200938 \n", 2153 | "45 (Organic Strawberries) (Organic Baby Spinach) 0.112711 \n", 2154 | "44 (Organic Baby Spinach) (Organic Strawberries) 0.102948 \n", 2155 | "11 (Bag of Organic Bananas) (Organic Yellow Onion) 0.161527 \n", 2156 | "10 (Organic Yellow Onion) (Bag of Organic Bananas) 0.048146 \n", 2157 | "16 (Large Lemon) (Banana) 0.065764 \n", 2158 | "17 (Banana) (Large Lemon) 0.200938 \n", 2159 | "0 (Organic Baby Spinach) (Bag of Organic Bananas) 0.102948 \n", 2160 | "1 (Bag of Organic Bananas) (Organic Baby Spinach) 0.161527 \n", 2161 | "41 (Organic Avocado) (Organic Strawberries) 0.075348 \n", 2162 | "40 (Organic Strawberries) (Organic Avocado) 0.112711 \n", 2163 | "9 (Bag of Organic Bananas) (Organic Whole Milk) 0.161527 \n", 2164 | "8 (Organic Whole Milk) (Bag of Organic Bananas) 0.058411 \n", 2165 | "29 (Organic Whole Milk) (Banana) 0.058411 \n", 2166 | "28 (Banana) (Organic Whole Milk) 0.200938 \n", 2167 | "19 (Banana) (Limes) 0.200938 \n", 2168 | "18 (Limes) (Banana) 0.059984 \n", 2169 | "23 (Banana) (Organic Baby Spinach) 0.200938 \n", 2170 | "22 (Organic Baby Spinach) (Banana) 0.102948 \n", 2171 | "27 (Banana) (Organic Strawberries) 0.200938 \n", 2172 | "26 (Organic Strawberries) (Banana) 0.112711 \n", 2173 | "\n", 2174 | " consequent support support confidence lift leverage conviction \n", 2175 | "35 0.065764 0.011860 0.197723 3.006544 0.007915 1.164480 \n", 2176 | "34 0.059984 0.011860 0.180345 3.006544 0.007915 1.146843 \n", 2177 | "52 0.058325 0.014533 0.128940 2.210731 0.007959 1.081069 \n", 2178 | "53 0.112711 0.014533 0.249174 2.210731 0.007959 1.181751 \n", 2179 | "37 0.065764 0.010538 0.139862 2.126728 0.005583 1.086147 \n", 2180 | "36 0.075348 0.010538 0.160244 2.126728 0.005583 1.101097 \n", 2181 | "47 0.042956 0.010235 0.090809 2.114024 0.005394 1.052633 \n", 2182 | "46 0.112711 0.010235 0.238274 2.114024 0.005394 1.164840 \n", 2183 | "49 0.058325 0.010966 0.121389 2.081257 0.005697 1.071777 \n", 2184 | "48 0.090339 0.010966 0.188018 2.081257 0.005697 1.120298 \n", 2185 | "24 0.037992 0.014378 0.071552 1.883367 0.006744 1.036147 \n", 2186 | "25 0.200938 0.014378 0.378441 1.883367 0.006744 1.285576 \n", 2187 | "5 0.058325 0.017294 0.107065 1.835662 0.007873 1.054584 \n", 2188 | "4 0.161527 0.017294 0.296508 1.835662 0.007873 1.191874 \n", 2189 | "3 0.090339 0.026487 0.163981 1.815175 0.011895 1.088087 \n", 2190 | "2 0.161527 0.026487 0.293199 1.815175 0.011895 1.186294 \n", 2191 | "14 0.200938 0.012122 0.355725 1.770317 0.005275 1.240249 \n", 2192 | "15 0.034078 0.012122 0.060329 1.770317 0.005275 1.027936 \n", 2193 | "39 0.102948 0.013207 0.175281 1.702625 0.005450 1.087707 \n", 2194 | "38 0.075348 0.013207 0.128289 1.702625 0.005450 1.060733 \n", 2195 | "50 0.090339 0.017314 0.153616 1.700440 0.007132 1.074762 \n", 2196 | "51 0.112711 0.017314 0.191659 1.700440 0.007132 1.097666 \n", 2197 | "12 0.200938 0.013432 0.329296 1.638788 0.005236 1.191377 \n", 2198 | "13 0.040789 0.013432 0.066844 1.638788 0.005236 1.027922 \n", 2199 | "43 0.102948 0.014787 0.163679 1.589929 0.005486 1.072618 \n", 2200 | "42 0.090339 0.014787 0.143632 1.589929 0.005486 1.062232 \n", 2201 | "55 0.112711 0.010130 0.173423 1.538645 0.003546 1.073449 \n", 2202 | "54 0.058411 0.010130 0.089873 1.538645 0.003546 1.034569 \n", 2203 | "20 0.200938 0.022745 0.301866 1.502282 0.007605 1.144568 \n", 2204 | "21 0.075348 0.022745 0.113194 1.502282 0.007605 1.042677 \n", 2205 | "30 0.200938 0.010534 0.296906 1.477596 0.003405 1.136493 \n", 2206 | "31 0.035480 0.010534 0.052425 1.477596 0.003405 1.017883 \n", 2207 | "7 0.112711 0.026463 0.163832 1.453551 0.008257 1.061136 \n", 2208 | "6 0.161527 0.026463 0.234787 1.453551 0.008257 1.095739 \n", 2209 | "32 0.200938 0.017661 0.288936 1.437931 0.005379 1.123754 \n", 2210 | "33 0.061123 0.017661 0.087891 1.437931 0.005379 1.029347 \n", 2211 | "45 0.102948 0.016267 0.144326 1.401939 0.004664 1.048358 \n", 2212 | "44 0.112711 0.016267 0.158014 1.401939 0.004664 1.053805 \n", 2213 | "11 0.048146 0.010460 0.064756 1.344989 0.002683 1.017760 \n", 2214 | "10 0.161527 0.010460 0.217252 1.344989 0.002683 1.071191 \n", 2215 | "16 0.200938 0.017603 0.267663 1.332062 0.004388 1.091111 \n", 2216 | "17 0.065764 0.017603 0.087602 1.332062 0.004388 1.023934 \n", 2217 | "0 0.161527 0.021517 0.209007 1.293944 0.004888 1.060026 \n", 2218 | "1 0.102948 0.021517 0.133208 1.293944 0.004888 1.034911 \n", 2219 | "41 0.112711 0.010254 0.136095 1.207468 0.001762 1.027068 \n", 2220 | "40 0.075348 0.010254 0.090980 1.207468 0.001762 1.017197 \n", 2221 | "9 0.058411 0.011288 0.069883 1.196413 0.001853 1.012335 \n", 2222 | "8 0.161527 0.011288 0.193253 1.196413 0.001853 1.039326 \n", 2223 | "29 0.200938 0.013368 0.228866 1.138984 0.001631 1.036216 \n", 2224 | "28 0.058411 0.013368 0.066529 1.138984 0.001631 1.008697 \n", 2225 | "19 0.059984 0.013539 0.067380 1.123292 0.001486 1.007930 \n", 2226 | "18 0.200938 0.013539 0.225713 1.123292 0.001486 1.031996 \n", 2227 | "23 0.102948 0.021839 0.108683 1.055712 0.001152 1.006435 \n", 2228 | "22 0.200938 0.021839 0.212133 1.055712 0.001152 1.014209 \n", 2229 | "27 0.112711 0.023857 0.118728 1.053382 0.001209 1.006827 \n", 2230 | "26 0.200938 0.023857 0.211665 1.053382 0.001209 1.013607 " 2231 | ] 2232 | }, 2233 | "execution_count": 22, 2234 | "metadata": {}, 2235 | "output_type": "execute_result" 2236 | } 2237 | ], 2238 | "source": [ 2239 | "rules = association_rules(frequent_items, metric=\"lift\", min_threshold=1)\n", 2240 | "rules.sort_values('lift', ascending=False)" 2241 | ] 2242 | }, 2243 | { 2244 | "cell_type": "code", 2245 | "execution_count": null, 2246 | "metadata": {}, 2247 | "outputs": [], 2248 | "source": [] 2249 | } 2250 | ], 2251 | "metadata": { 2252 | "kernelspec": { 2253 | "display_name": "Python 3", 2254 | "language": "python", 2255 | "name": "python3" 2256 | }, 2257 | "language_info": { 2258 | "codemirror_mode": { 2259 | "name": "ipython", 2260 | "version": 3 2261 | }, 2262 | "file_extension": ".py", 2263 | "mimetype": "text/x-python", 2264 | "name": "python", 2265 | "nbconvert_exporter": "python", 2266 | "pygments_lexer": "ipython3", 2267 | "version": "3.7.3" 2268 | } 2269 | }, 2270 | "nbformat": 4, 2271 | "nbformat_minor": 2 2272 | } 2273 | -------------------------------------------------------------------------------- /NN Architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/NN Architecture.png -------------------------------------------------------------------------------- /Plots/Add-to-cart-VS-reorder.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/Add-to-cart-VS-reorder.png -------------------------------------------------------------------------------- /Plots/Most-popular-products.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/Most-popular-products.png -------------------------------------------------------------------------------- /Plots/NN Architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/NN Architecture.png -------------------------------------------------------------------------------- /Plots/NN-Performance.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/NN-Performance.png -------------------------------------------------------------------------------- /Plots/NN-Report.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/NN-Report.png -------------------------------------------------------------------------------- /Plots/Reorder-organic-inorganic-products.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/Reorder-organic-inorganic-products.png -------------------------------------------------------------------------------- /Plots/Total-organic-inorganic-products.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/Total-organic-inorganic-products.png -------------------------------------------------------------------------------- /Plots/XGBoost Feature Importance Plot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/XGBoost Feature Importance Plot.png -------------------------------------------------------------------------------- /Plots/XGBoost Performance.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/XGBoost Performance.png -------------------------------------------------------------------------------- /Plots/XGBoost-Report.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/XGBoost-Report.png -------------------------------------------------------------------------------- /Plots/aisle-high-reorder.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/aisle-high-reorder.png -------------------------------------------------------------------------------- /Plots/aisle-low-reorder.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/aisle-low-reorder.png -------------------------------------------------------------------------------- /Plots/cluster.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/cluster.png -------------------------------------------------------------------------------- /Plots/cumsum_products.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/cumsum_products.png -------------------------------------------------------------------------------- /Plots/dow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/dow.png -------------------------------------------------------------------------------- /Plots/elbow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/elbow.png -------------------------------------------------------------------------------- /Plots/heatmap.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/heatmap.png -------------------------------------------------------------------------------- /Plots/orders.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/orders.png -------------------------------------------------------------------------------- /Plots/popular-aisles.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/popular-aisles.png -------------------------------------------------------------------------------- /Plots/popular-departments.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/popular-departments.png -------------------------------------------------------------------------------- /Plots/prior.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/prior.png -------------------------------------------------------------------------------- /Plots/readme.md: -------------------------------------------------------------------------------- 1 | This folder contains plots of analysis and model. 2 | -------------------------------------------------------------------------------- /Plots/reorder-df.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/reorder-df.png -------------------------------------------------------------------------------- /Plots/reorder-total-orders.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/reorder-total-orders.png -------------------------------------------------------------------------------- /Plots/train.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/train.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Instacart Market Basket Analysis 2 | 3 | ## Introduction 4 | 5 | Instacart is an American technology company that operates as a same-day grocery delivery and pick up service in the U.S. and Canada. Customers shop for groceries through the Instacart mobile app or Instacart.com from various retailer partners. The order is shopped and delivered by an Instacart personal shopper. 6 | 7 | ### Objectives: 8 | - Analyze the anonymized [data](https://www.kaggle.com/c/instacart-market-basket-analysis/data) of 3 million grocery orders from more than 200,000 Instacart users open sourced by Instacart 9 | - Find out hidden association between products for better cross-selling and upselling 10 | - Perform customer segmentation for targeted marketing and anticipate customer behavior 11 | - Build a Machine Learning model to predict which previously purchased product will be in user’s next order 12 | 13 | ### Project Organization 14 | ``` 15 | . 16 | ├── Plots/ : Contains all plots 17 | ├── Data Description and Analysis.ipynb : Initial analysis to understand data 18 | ├── Exploratory Data Analysis.ipynb : EDA to analyze customer purchase pattern 19 | ├── Customers Segmentation.ipynb : Customer Segmentation based on product aisles 20 | ├── Market Basket Analysis.ipynb : Market Basket Analysis to find products association 21 | ├── Feature Extraction.ipynb : Feature engineering and extraction for a ML model 22 | ├── Data Preparation.ipynb : Data preparation for modeling 23 | ├── ANN Model.ipynb : Neural Network model for product reorder prediction 24 | ├── XGBoost Model.ipynb : XGBoost model for product reorder prediction 25 | ├── LICENSE : License 26 | └── README.md : Project Report 27 | ``` 28 |
29 | 30 | ## Data Description 31 | 32 | - **aisles:** This file contains different aisles and there are total 134 unique aisles. 33 | 34 | - **departments:** This file contains different departments and there are total 21 unique departments. 35 | 36 | - **orders:** This file contains all the orders made by different users. From below analysis, we can conclude following: 37 | - There are total 3421083 orders made by total 206209 users. 38 | - There are three sets of orders: Prior, Train and Test. The distributions of orders in Train and Test sets are similar whereas the distribution of orders in Prior set is different. 39 | - The total orders per customer ranges from 0 to 100. 40 | - Based on the plot of 'Orders VS Day of Week' we can map 0 and 1 as Saturday and Sunday respectively based on the assumption that most of the people buy groceries on weekends. 41 | - Majority of the orders are made during the day time. 42 | - Customers order once in a week which is supported by peaks at 7, 14, 21 and 30 in 'Orders VS Days since prior order' graph. 43 | - Based on the heatmap between 'Day of Week' and 'Hour of Day,' we can say that Saturday afternoons and Sunday mornings are prime time for orders. 44 | 45 |

46 | 47 |

48 | 49 |

50 | 51 |

52 | 53 |

54 | 55 |

56 | 57 | - **products:** This file contains the list of total 49688 products and their aisle as well as department. The number of products in different aisles and different departments are different. 58 | 59 | - **order_products_prior:** This file gives information about which products were ordered and in which order they were added in the cart. It also tells us that if the product was reordered or not. 60 | 61 | - In this file there is an information of total 3214874 orders through which total 49677 products were ordered. 62 | - From the 'Count VS Items in cart' plot, we can say that most of the people buy 1-15 items in an order and there were a maximum of 145 items in an order. 63 | - The percentage of reorder items in this set is 58.97%. 64 | 65 |

66 | 67 |

68 | 69 | 70 | - **order_products_train:** This file gives information about which products were ordered and in which order they were added in the cart. It also tells us that if the product was reordered or not. 71 | - In this file there is an information of total 131209 orders through which total 39123 products were ordered. 72 | - From the 'Count VS Items in cart' plot, we can say that most of the people buy 1-15 items in an order and there were a maximum of 145 items in an order. 73 | - The percentage of reorder items in this set is 59.86%. 74 | 75 |

76 | 77 |

78 | 79 | ## Exploratory Data Analysis 80 | For the analysis I combined all of the separate data files into one single dataframe and to fit the dataframe in my memory I reduced its size to 50% (4.1 GB to 2.0 GB) by type conversion and without loosing any information. 81 | 82 | - This plot shows most popular aisles based on total products bought. 83 | 84 |

85 | 86 |

87 | 88 | - As we can see in below plot that the reorder percentage of day-to-day food items is high and for other products such as vitamins, first-aids, beauty products, etc. reorder percentage is low. This is true as we buy only groceries regularly and do not buy those items in every order. 89 | 90 |

91 | 92 | 93 |

94 | 95 | - The below plot shows popular departments. The store layout should be in a way that popular departments are very near to each other. 96 | 97 |

98 | 99 |

100 | 101 | - The below plot shows most popular products. As we can see there are many organic products in the most popular products. 102 | 103 |

104 | 105 |

106 | 107 | - We can see that there are less number of organic products but their Mean reorder percentage is high. This tells us that we should have more organic products in the store. 108 | 109 |

110 | 111 | 112 |

113 | 114 | 115 | - We can plot add-to-cart-order and mean reorder percentage. As we can see the lower the add-to-cart-order higher is the reorder percentage. This makes sense as we mostly buy things first that are required on day-to-day basis. 116 | 117 |

118 | 119 |

120 | 121 | - In the below plot of reorder percentage and number of product purchase, we see a ceiling effect. Many people try different product once and they do not reorder again. Also, there are users who buy certain products regularly. 122 | 123 |

124 | 125 |

126 | 127 | - We can see that the total unique users of products having highest reorder ratio are only few (1-15 only). This means that these users like these products and would buy regularly. 128 | 129 |

130 | 131 |

132 | 133 | - In the below plot of cumulative total users per product vs products, we can see that 85% of the users buy only 10000 products out of 49688 products. If we are interested in shelf space optimization, we should have only these 10000 products. Here, I assume that the profit from remaining 39688 products are not significant high. If we had prices of these products, we could have considered the products having high revenue, high reorder percentage and high total product sale. 134 | 135 |

136 | 137 |

138 | 139 | ## Customer Segmentation 140 | 141 | Customer segmentation is the process of dividing customers into groups based on common characteristics so companies can market to each group effectively and appropriately. We can perform segmentation using the data of which products users buy. Since there are thousonds of products and also thousands of customers, I utilized aisles which represent categories of products. 142 | 143 | I then performed Principal component analysis to reduce dimensions as KMeans does not produce good results on higher dimensions. Using 10 principal components I carried out KMeans clustering. I chose optimal number of clusters as 5 using Elbow method shown below. 144 | 145 |

146 | 147 |

148 | 149 | The clustering can be visualized along first two principal components as below. 150 | 151 |

152 | 153 |

154 | 155 | The clustering results into 5 neat clusters and after checking most frequent products in them, we can conclude following: 156 | - Cluster 1 results into 5428 consumers having a very strong preference for water seltzer sparkling water aisle. 157 | - Cluster 2 results into 55784 consumers who mostly order fresh vegetables followed by fruits. 158 | - Cluster 3 results into 7948 consumers who buy packaged produce and fresh fruits mostly. 159 | - Cluster 4 results into 37949 consumers who have a very strong preference for fruits followed by fresh vegetables. 160 | - Cluster 5 results into 99100 consumers who orders products from many aisles. Their mean orders are low compared to other clusters which tells us that either they are not frequent users of Instacart or they are new users and do not have many orders yet. 161 | 162 | ## Markest Basket Analysis 163 | 164 | Market Basket Analysis is a modelling technique based upon the theory that if you buy a certain group of items, you are more or less likely to buy another group of items. Market basket analysis may provide the retailer with information to understand the purchase behavior of a buyer. This information can then be used for purposes of cross-selling and up-selling, in addition to influencing sales promotions, loyalty programs, store design, and discount plans. 165 | 166 | Market basket analysis scrutinizes the products customers tend to buy together, and uses the information to decide which products should be cross-sold or promoted together. The term arises from the shopping carts supermarket shoppers fill up during a shopping trip. 167 | 168 | Association Rule Mining is used when we want to find an association between different objects in a set, find frequent patterns in a transaction database, relational databases or any other information repository. 169 | 170 | The most common approach to find these patterns is Market Basket Analysis, which is a key technique used by large retailers like Amazon, Flipkart, etc to analyze customer buying habits by finding associations between the different items that customers place in their “shopping baskets”. The discovery of these associations can help retailers develop marketing strategies by gaining insight into which items are frequently purchased together by customers. The strategies may include: 171 | 172 | - Changing the store layout according to trends 173 | - Customers behavior analysis 174 | - Catalog Design 175 | - Cross marketing on online stores 176 | - Customized emails with add-on sales, etc. 177 | 178 | ### Matrices 179 | 180 | **Support** : Its the default popularity of an item. In mathematical terms, the support of item A is the ratio of transactions involving A to the total number of transactions. 181 | 182 | **Confidence** : Likelihood that customer who bought both A and B. It is the ratio of the number of transactions involving both A and B and the number of transactions involving B. 183 | - Confidence(A => B) = Support(A, B)/Support(B) 184 | 185 | **Lift** : Increase in the sale of A when you sell B. 186 | - Lift(A => B) = Confidence(A, B)/Support(B) 187 | 188 | - Lift (A => B) = 1 means that there is no correlation within the itemset. 189 | - Lift (A => B) > 1 means that there is a positive correlation within the itemset, i.e., products in the itemset, A, and B, are more likely to be bought together. 190 | - Lift (A => B) < 1 means that there is a negative correlation within the itemset, i.e., products in itemset, A, and B, are unlikely to be bought together. 191 | 192 | **Apriori Algorithm:** Apriori algorithm assumes that any subset of a frequent itemset must be frequent. Its the algorithm behind Market Basket Analysis. Say, a transaction containing {Grapes, Apple, Mango} also contains {Grapes, Mango}. So, according to the principle of Apriori, if {Grapes, Apple, Mango} is frequent, then {Grapes, Mango} must also be frequent. 193 | 194 | I utilized apriori algorithm from Mlxtend python library and found out associations from top 100 most frequent products which resulted in 28 product pairs (total 56 rules) that have lift highr than 1. The top 10 product pairs having highest lift are shown below: 195 | 196 | | Product A | Product B | Lift | 197 | | ------------- | ------------- | ---- | 198 | | Limes | Large Lemons | 3 | 199 | | Organic Strawberries | Organic Raspberries | 2.21 | 200 | | Organic Avocado | Large Lemon | 2.12 | 201 | | Organic Strawberries | Organic Blueberries | 2.11 | 202 | | Organic Hass Avocado | Organic Raspberries | 2.08 | 203 | | Banana | Organic Fuji Apple | 1.88 | 204 | | Bag of Organic Bananas | Organic Raspberries | 1.83 | 205 | | Organic Hass Avocado | Bag of Organic Bananas | 1.81 | 206 | | Honeycrisp Apple | Banana | 1.77 | 207 | | Organic Avocado | Organic Baby Spinach | 1.70 | 208 | 209 | ## ML Model to Predict Product Reorders 210 | 211 | We can utilize this anonymized transactional data of customer orders over time to predict which previously purchased products will be in a user’s next order. This would help recommend the products to a user. 212 | 213 | To build a model, I need to extract features from previous order to understand user's purchase pattern and how popular the particular product is. I extract following features from the user's transactional data. 214 | 215 | **Product Level Features:** To understand the product's popularity among users 216 | ``` 217 | (1) Product's average add-to-cart-order 218 | (2) Total times the product was ordered 219 | (3) Total times the product was reordered 220 | (4) Reorder percentage of a product 221 | (5) Total unique users of a product 222 | (6) Is the product Organic? 223 | (7) Percentage of users that buy the product second time 224 | ``` 225 | 226 | **Aisle and Department Level Features:** To capture if a department and aisle are related to day-to-day products (vegetables, fruits, soda, water, etc.) or once-in-a-while products (medicines, personal-care, etc.) 227 | ``` 228 | (8) Reorder percentage, Total orders and reorders of a product aisle 229 | (9) Mean and std of aisle add-to-cart-order 230 | (10) Aisle unique users 231 | (10) Reorder percentage, Total orders and reorders of a product department 232 | (11) Mean and std of department add-to-cart-order 233 | (12) Department unique users 234 | (13) Binary encoding of aisle feature (Because one-hot encoding results in many features and make datarame sparse) 235 | (14) Binary encoding of department feature (Because one-hot encoding results in many features and make datarame sparse) 236 | ``` 237 | 238 | **User Level features:** To capture user's purchase pattern and behavior 239 | ``` 240 | (15) User's average and std day-of-week of order 241 | (16) User's average and std hour-of-day of order 242 | (17) User's average and std days-since-prior-order 243 | (18) Total orders by a user 244 | (19) Total products user has bought 245 | (20) Total unique products user has bought 246 | (21) user's total reordered products 247 | (22) User's overall reorder percentage 248 | (23) Average order size of a user 249 | (24) User's mean of reordered items of all orders 250 | (25) Percentage of reordered itmes in user's last three orders 251 | (26) Total orders in user's last three orders 252 | ``` 253 | 254 | **User-product Level Features:** To capture user's pattern of ordering-reordering specific products 255 | ``` 256 | (27) User's avg add-to-cart-order for a product 257 | (28) User's avg days_since_prior_order for a product 258 | (29) User's product total orders, reorders and reorders percentage 259 | (30) User's order number when the product was bought last 260 | (31) User's product purchase history of last three orders 261 | ``` 262 | 263 | ### ML Models 264 | 265 | Using the extracted features, I prepared a dataframe which shows all the products user has bought previously, user level features, product level features, asile and department level features, user-product level features and the information of current order such as order's day-of-week, hour-of-day, etc. The Traget would be 'reordered' which shows how many of the previously purchased items, user ordered this time. 266 | 267 | Since the dataframe is huge, I reduced the memory consumption of it by downcasting to fit the data int my memory. I preferred MinMaxScaler over StandardScaler as the latter requires 16 GB of RAM for its operation. I followed standard process for model building and I relied on XGBoost as it handles large data, can be parallelized and gives feature importance. I also built Neural Network to see what would be the best performance from this model disregarding some inherent randomness from both of these models. To balance the data, I have used cost-sensitive learning by assigning class weightage (~{0:1, 1:10}). I have not used random-upsampling/SMOTE as it would increase the data size and I do not have much memory. Also, since random-down-sampling discards information which might be important and would result in bias. 268 | 269 | Since, we can hack the F1 score by changing the threshold, I relied on AUC Score for model evaluation. The performance of both of these models is shown below using Confusion Matrix, ROC curve and classification report. The feature important plot from XGBoost model is also shown to understand important features which help predict product's reorder. The performance of both models is almost similar and XGBoost slightly performs better in terms of ROC-AUC. 270 | 271 | **Neural Network Model Architecture and Performance:** 272 | 273 |

274 | 275 |

276 | 277 |

278 | 279 |

280 | 281 |

282 | 283 |

284 | 285 | 286 | **XGBoost Model's Performance and Feature Importance:** 287 | 288 |

289 | 290 |

291 | 292 |

293 | 294 |

295 | 296 |

297 | 298 |

299 | 300 | 301 | ## Future Work 302 | 303 | - Utilize Collaborative filtering to recommend products to a customer. 304 | --------------------------------------------------------------------------------