├── .gitignore
├── ANN Model.ipynb
├── Customers Segmentation.ipynb
├── Data Description and Analysis.ipynb
├── Data Preparation.ipynb
├── Exploratory Data Analysis.ipynb
├── Feature Extraction.ipynb
├── LICENSE
├── Market Basket Analysis.ipynb
├── NN Architecture.png
├── Plots
├── Add-to-cart-VS-reorder.png
├── Most-popular-products.png
├── NN Architecture.png
├── NN-Performance.png
├── NN-Report.png
├── Reorder-organic-inorganic-products.png
├── Total-organic-inorganic-products.png
├── XGBoost Feature Importance Plot.eps
├── XGBoost Feature Importance Plot.png
├── XGBoost Performance.png
├── XGBoost-Report.png
├── aisle-high-reorder.png
├── aisle-low-reorder.png
├── cluster.png
├── cumsum_products.png
├── dow.png
├── elbow.png
├── heatmap.png
├── orders.png
├── popular-aisles.png
├── popular-departments.png
├── prior.png
├── readme.md
├── reorder-df.png
├── reorder-total-orders.png
└── train.png
├── README.md
└── XGBoost Model.ipynb
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | # C extensions
7 | *.so
8 |
9 | # Distribution / packaging
10 | .Python
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | .eggs/
17 | lib/
18 | lib64/
19 | parts/
20 | sdist/
21 | var/
22 | wheels/
23 | *.egg-info/
24 | .installed.cfg
25 | *.egg
26 | MANIFEST
27 |
28 | # PyInstaller
29 | # Usually these files are written by a python script from a template
30 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
31 | *.manifest
32 | *.spec
33 |
34 | # Installer logs
35 | pip-log.txt
36 | pip-delete-this-directory.txt
37 |
38 | # Unit test / coverage reports
39 | htmlcov/
40 | .tox/
41 | .coverage
42 | .coverage.*
43 | .cache
44 | nosetests.xml
45 | coverage.xml
46 | *.cover
47 | .hypothesis/
48 | .pytest_cache/
49 |
50 | # Translations
51 | *.mo
52 | *.pot
53 |
54 | # Django stuff:
55 | *.log
56 | local_settings.py
57 | db.sqlite3
58 |
59 | # Flask stuff:
60 | instance/
61 | .webassets-cache
62 |
63 | # Scrapy stuff:
64 | .scrapy
65 |
66 | # Sphinx documentation
67 | docs/_build/
68 |
69 | # PyBuilder
70 | target/
71 |
72 | # Jupyter Notebook
73 | .ipynb_checkpoints
74 |
75 | # pyenv
76 | .python-version
77 |
78 | # celery beat schedule file
79 | celerybeat-schedule
80 |
81 | # SageMath parsed files
82 | *.sage.py
83 |
84 | # Environments
85 | .env
86 | .venv
87 | env/
88 | venv/
89 | ENV/
90 | env.bak/
91 | venv.bak/
92 |
93 | # Spyder project settings
94 | .spyderproject
95 | .spyproject
96 |
97 | # Rope project settings
98 | .ropeproject
99 |
100 | # mkdocs documentation
101 | /site
102 |
103 | # mypy
104 | .mypy_cache/
105 |
--------------------------------------------------------------------------------
/Data Preparation.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Data Preparation"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {},
14 | "outputs": [],
15 | "source": [
16 | "import numpy as np\n",
17 | "import pandas as pd\n",
18 | "import matplotlib.pyplot as plt\n",
19 | "import seaborn as sns\n",
20 | "import gc\n",
21 | "pd.options.mode.chained_assignment = None\n",
22 | "\n",
23 | "root = 'C:/Data/instacart-market-basket-analysis/'"
24 | ]
25 | },
26 | {
27 | "cell_type": "markdown",
28 | "metadata": {},
29 | "source": [
30 | "#### Reading all data"
31 | ]
32 | },
33 | {
34 | "cell_type": "code",
35 | "execution_count": 2,
36 | "metadata": {},
37 | "outputs": [],
38 | "source": [
39 | "orders = pd.read_csv(root + 'orders.csv', \n",
40 | " dtype={\n",
41 | " 'order_id': np.int32,\n",
42 | " 'user_id': np.int64,\n",
43 | " 'eval_set': 'category',\n",
44 | " 'order_number': np.int16,\n",
45 | " 'order_dow': np.int8,\n",
46 | " 'order_hour_of_day': np.int8,\n",
47 | " 'days_since_prior_order': np.float32})\n",
48 | "\n",
49 | "\n",
50 | "order_products_train = pd.read_csv(root + 'order_products__train.csv', \n",
51 | " dtype={\n",
52 | " 'order_id': np.int32,\n",
53 | " 'product_id': np.uint16,\n",
54 | " 'add_to_cart_order': np.int16,\n",
55 | " 'reordered': np.int8})\n",
56 | "\n",
57 | "order_products_prior = pd.read_csv(root + 'order_products__prior.csv', \n",
58 | " dtype={\n",
59 | " 'order_id': np.int32,\n",
60 | " 'product_id': np.uint16,\n",
61 | " 'add_to_cart_order': np.int16,\n",
62 | " 'reordered': np.int8})\n",
63 | "\n",
64 | "product_features = pd.read_pickle(root + 'product_features.pkl')\n",
65 | "\n",
66 | "user_features = pd.read_pickle(root + 'user_features.pkl')\n",
67 | "\n",
68 | "user_product_features = pd.read_pickle(root + 'user_product_features.pkl')\n",
69 | "\n",
70 | "products = pd.read_csv(root +'products.csv')\n",
71 | "\n",
72 | "aisles = pd.read_csv(root + 'aisles.csv')\n",
73 | "\n",
74 | "departments = pd.read_csv(root + 'departments.csv')"
75 | ]
76 | },
77 | {
78 | "cell_type": "markdown",
79 | "metadata": {},
80 | "source": [
81 | "#### merging train order data with orders"
82 | ]
83 | },
84 | {
85 | "cell_type": "code",
86 | "execution_count": 3,
87 | "metadata": {
88 | "scrolled": true
89 | },
90 | "outputs": [
91 | {
92 | "data": {
93 | "text/html": [
94 | "
\n",
95 | "\n",
108 | "
\n",
109 | " \n",
110 | " \n",
111 | " | \n",
112 | " order_id | \n",
113 | " user_id | \n",
114 | " eval_set | \n",
115 | " order_number | \n",
116 | " order_dow | \n",
117 | " order_hour_of_day | \n",
118 | " days_since_prior_order | \n",
119 | " product_id | \n",
120 | " add_to_cart_order | \n",
121 | " reordered | \n",
122 | "
\n",
123 | " \n",
124 | " \n",
125 | " \n",
126 | " 0 | \n",
127 | " 1187899 | \n",
128 | " 1 | \n",
129 | " train | \n",
130 | " 11 | \n",
131 | " 4 | \n",
132 | " 8 | \n",
133 | " 14.0 | \n",
134 | " 196 | \n",
135 | " 1 | \n",
136 | " 1 | \n",
137 | "
\n",
138 | " \n",
139 | " 1 | \n",
140 | " 1187899 | \n",
141 | " 1 | \n",
142 | " train | \n",
143 | " 11 | \n",
144 | " 4 | \n",
145 | " 8 | \n",
146 | " 14.0 | \n",
147 | " 25133 | \n",
148 | " 2 | \n",
149 | " 1 | \n",
150 | "
\n",
151 | " \n",
152 | " 2 | \n",
153 | " 1187899 | \n",
154 | " 1 | \n",
155 | " train | \n",
156 | " 11 | \n",
157 | " 4 | \n",
158 | " 8 | \n",
159 | " 14.0 | \n",
160 | " 38928 | \n",
161 | " 3 | \n",
162 | " 1 | \n",
163 | "
\n",
164 | " \n",
165 | " 3 | \n",
166 | " 1187899 | \n",
167 | " 1 | \n",
168 | " train | \n",
169 | " 11 | \n",
170 | " 4 | \n",
171 | " 8 | \n",
172 | " 14.0 | \n",
173 | " 26405 | \n",
174 | " 4 | \n",
175 | " 1 | \n",
176 | "
\n",
177 | " \n",
178 | " 4 | \n",
179 | " 1187899 | \n",
180 | " 1 | \n",
181 | " train | \n",
182 | " 11 | \n",
183 | " 4 | \n",
184 | " 8 | \n",
185 | " 14.0 | \n",
186 | " 39657 | \n",
187 | " 5 | \n",
188 | " 1 | \n",
189 | "
\n",
190 | " \n",
191 | "
\n",
192 | "
"
193 | ],
194 | "text/plain": [
195 | " order_id user_id eval_set order_number order_dow order_hour_of_day \\\n",
196 | "0 1187899 1 train 11 4 8 \n",
197 | "1 1187899 1 train 11 4 8 \n",
198 | "2 1187899 1 train 11 4 8 \n",
199 | "3 1187899 1 train 11 4 8 \n",
200 | "4 1187899 1 train 11 4 8 \n",
201 | "\n",
202 | " days_since_prior_order product_id add_to_cart_order reordered \n",
203 | "0 14.0 196 1 1 \n",
204 | "1 14.0 25133 2 1 \n",
205 | "2 14.0 38928 3 1 \n",
206 | "3 14.0 26405 4 1 \n",
207 | "4 14.0 39657 5 1 "
208 | ]
209 | },
210 | "execution_count": 3,
211 | "metadata": {},
212 | "output_type": "execute_result"
213 | }
214 | ],
215 | "source": [
216 | "train_orders = orders.merge(order_products_train, on = 'order_id', how = 'inner')\n",
217 | "train_orders.head()"
218 | ]
219 | },
220 | {
221 | "cell_type": "markdown",
222 | "metadata": {},
223 | "source": [
224 | "removing unnecessary columns from train_orders"
225 | ]
226 | },
227 | {
228 | "cell_type": "code",
229 | "execution_count": 4,
230 | "metadata": {},
231 | "outputs": [],
232 | "source": [
233 | "train_orders.drop(['eval_set', 'add_to_cart_order', 'order_id'], axis = 1, inplace = True)"
234 | ]
235 | },
236 | {
237 | "cell_type": "markdown",
238 | "metadata": {},
239 | "source": [
240 | "unique user_ids in train data"
241 | ]
242 | },
243 | {
244 | "cell_type": "code",
245 | "execution_count": 5,
246 | "metadata": {},
247 | "outputs": [
248 | {
249 | "data": {
250 | "text/plain": [
251 | "array([ 1, 2, 5, 7, 8, 9, 10, 13, 14, 17], dtype=int64)"
252 | ]
253 | },
254 | "execution_count": 5,
255 | "metadata": {},
256 | "output_type": "execute_result"
257 | }
258 | ],
259 | "source": [
260 | "train_users = train_orders.user_id.unique()\n",
261 | "train_users[:10]"
262 | ]
263 | },
264 | {
265 | "cell_type": "markdown",
266 | "metadata": {},
267 | "source": [
268 | "keeping only train_users in the data"
269 | ]
270 | },
271 | {
272 | "cell_type": "code",
273 | "execution_count": 6,
274 | "metadata": {
275 | "scrolled": true
276 | },
277 | "outputs": [
278 | {
279 | "data": {
280 | "text/plain": [
281 | "(13307953, 11)"
282 | ]
283 | },
284 | "execution_count": 6,
285 | "metadata": {},
286 | "output_type": "execute_result"
287 | }
288 | ],
289 | "source": [
290 | "user_product_features.shape"
291 | ]
292 | },
293 | {
294 | "cell_type": "code",
295 | "execution_count": 7,
296 | "metadata": {},
297 | "outputs": [
298 | {
299 | "data": {
300 | "text/html": [
301 | "\n",
302 | "\n",
315 | "
\n",
316 | " \n",
317 | " \n",
318 | " | \n",
319 | " user_id | \n",
320 | " product_id | \n",
321 | " total_product_orders_by_user | \n",
322 | " total_product_reorders_by_user | \n",
323 | " user_product_reorder_percentage | \n",
324 | " avg_add_to_cart_by_user | \n",
325 | " avg_days_since_last_bought | \n",
326 | " last_ordered_in | \n",
327 | " is_reorder_3 | \n",
328 | " is_reorder_2 | \n",
329 | " is_reorder_1 | \n",
330 | "
\n",
331 | " \n",
332 | " \n",
333 | " \n",
334 | " 0 | \n",
335 | " 1 | \n",
336 | " 196 | \n",
337 | " 10 | \n",
338 | " 9 | \n",
339 | " 0.900000 | \n",
340 | " 1.400000 | \n",
341 | " 17.600000 | \n",
342 | " 10 | \n",
343 | " 1.0 | \n",
344 | " 1.0 | \n",
345 | " 1.0 | \n",
346 | "
\n",
347 | " \n",
348 | " 1 | \n",
349 | " 1 | \n",
350 | " 10258 | \n",
351 | " 9 | \n",
352 | " 8 | \n",
353 | " 0.888889 | \n",
354 | " 3.333333 | \n",
355 | " 19.555555 | \n",
356 | " 10 | \n",
357 | " 1.0 | \n",
358 | " 1.0 | \n",
359 | " 1.0 | \n",
360 | "
\n",
361 | " \n",
362 | " 2 | \n",
363 | " 1 | \n",
364 | " 10326 | \n",
365 | " 1 | \n",
366 | " 0 | \n",
367 | " 0.000000 | \n",
368 | " 5.000000 | \n",
369 | " 28.000000 | \n",
370 | " 5 | \n",
371 | " 0.0 | \n",
372 | " 0.0 | \n",
373 | " 0.0 | \n",
374 | "
\n",
375 | " \n",
376 | " 3 | \n",
377 | " 1 | \n",
378 | " 12427 | \n",
379 | " 10 | \n",
380 | " 9 | \n",
381 | " 0.900000 | \n",
382 | " 3.300000 | \n",
383 | " 17.600000 | \n",
384 | " 10 | \n",
385 | " 1.0 | \n",
386 | " 1.0 | \n",
387 | " 1.0 | \n",
388 | "
\n",
389 | " \n",
390 | " 4 | \n",
391 | " 1 | \n",
392 | " 13032 | \n",
393 | " 3 | \n",
394 | " 2 | \n",
395 | " 0.666667 | \n",
396 | " 6.333333 | \n",
397 | " 21.666666 | \n",
398 | " 10 | \n",
399 | " 1.0 | \n",
400 | " 0.0 | \n",
401 | " 0.0 | \n",
402 | "
\n",
403 | " \n",
404 | "
\n",
405 | "
"
406 | ],
407 | "text/plain": [
408 | " user_id product_id total_product_orders_by_user \\\n",
409 | "0 1 196 10 \n",
410 | "1 1 10258 9 \n",
411 | "2 1 10326 1 \n",
412 | "3 1 12427 10 \n",
413 | "4 1 13032 3 \n",
414 | "\n",
415 | " total_product_reorders_by_user user_product_reorder_percentage \\\n",
416 | "0 9 0.900000 \n",
417 | "1 8 0.888889 \n",
418 | "2 0 0.000000 \n",
419 | "3 9 0.900000 \n",
420 | "4 2 0.666667 \n",
421 | "\n",
422 | " avg_add_to_cart_by_user avg_days_since_last_bought last_ordered_in \\\n",
423 | "0 1.400000 17.600000 10 \n",
424 | "1 3.333333 19.555555 10 \n",
425 | "2 5.000000 28.000000 5 \n",
426 | "3 3.300000 17.600000 10 \n",
427 | "4 6.333333 21.666666 10 \n",
428 | "\n",
429 | " is_reorder_3 is_reorder_2 is_reorder_1 \n",
430 | "0 1.0 1.0 1.0 \n",
431 | "1 1.0 1.0 1.0 \n",
432 | "2 0.0 0.0 0.0 \n",
433 | "3 1.0 1.0 1.0 \n",
434 | "4 1.0 0.0 0.0 "
435 | ]
436 | },
437 | "execution_count": 7,
438 | "metadata": {},
439 | "output_type": "execute_result"
440 | }
441 | ],
442 | "source": [
443 | "user_product_features.head()"
444 | ]
445 | },
446 | {
447 | "cell_type": "code",
448 | "execution_count": 8,
449 | "metadata": {},
450 | "outputs": [
451 | {
452 | "data": {
453 | "text/html": [
454 | "\n",
455 | "\n",
468 | "
\n",
469 | " \n",
470 | " \n",
471 | " | \n",
472 | " user_id | \n",
473 | " product_id | \n",
474 | " total_product_orders_by_user | \n",
475 | " total_product_reorders_by_user | \n",
476 | " user_product_reorder_percentage | \n",
477 | " avg_add_to_cart_by_user | \n",
478 | " avg_days_since_last_bought | \n",
479 | " last_ordered_in | \n",
480 | " is_reorder_3 | \n",
481 | " is_reorder_2 | \n",
482 | " is_reorder_1 | \n",
483 | "
\n",
484 | " \n",
485 | " \n",
486 | " \n",
487 | " 0 | \n",
488 | " 1 | \n",
489 | " 196 | \n",
490 | " 10 | \n",
491 | " 9 | \n",
492 | " 0.900000 | \n",
493 | " 1.400000 | \n",
494 | " 17.600000 | \n",
495 | " 10 | \n",
496 | " 1.0 | \n",
497 | " 1.0 | \n",
498 | " 1.0 | \n",
499 | "
\n",
500 | " \n",
501 | " 1 | \n",
502 | " 1 | \n",
503 | " 10258 | \n",
504 | " 9 | \n",
505 | " 8 | \n",
506 | " 0.888889 | \n",
507 | " 3.333333 | \n",
508 | " 19.555555 | \n",
509 | " 10 | \n",
510 | " 1.0 | \n",
511 | " 1.0 | \n",
512 | " 1.0 | \n",
513 | "
\n",
514 | " \n",
515 | " 2 | \n",
516 | " 1 | \n",
517 | " 10326 | \n",
518 | " 1 | \n",
519 | " 0 | \n",
520 | " 0.000000 | \n",
521 | " 5.000000 | \n",
522 | " 28.000000 | \n",
523 | " 5 | \n",
524 | " 0.0 | \n",
525 | " 0.0 | \n",
526 | " 0.0 | \n",
527 | "
\n",
528 | " \n",
529 | " 3 | \n",
530 | " 1 | \n",
531 | " 12427 | \n",
532 | " 10 | \n",
533 | " 9 | \n",
534 | " 0.900000 | \n",
535 | " 3.300000 | \n",
536 | " 17.600000 | \n",
537 | " 10 | \n",
538 | " 1.0 | \n",
539 | " 1.0 | \n",
540 | " 1.0 | \n",
541 | "
\n",
542 | " \n",
543 | " 4 | \n",
544 | " 1 | \n",
545 | " 13032 | \n",
546 | " 3 | \n",
547 | " 2 | \n",
548 | " 0.666667 | \n",
549 | " 6.333333 | \n",
550 | " 21.666666 | \n",
551 | " 10 | \n",
552 | " 1.0 | \n",
553 | " 0.0 | \n",
554 | " 0.0 | \n",
555 | "
\n",
556 | " \n",
557 | "
\n",
558 | "
"
559 | ],
560 | "text/plain": [
561 | " user_id product_id total_product_orders_by_user \\\n",
562 | "0 1 196 10 \n",
563 | "1 1 10258 9 \n",
564 | "2 1 10326 1 \n",
565 | "3 1 12427 10 \n",
566 | "4 1 13032 3 \n",
567 | "\n",
568 | " total_product_reorders_by_user user_product_reorder_percentage \\\n",
569 | "0 9 0.900000 \n",
570 | "1 8 0.888889 \n",
571 | "2 0 0.000000 \n",
572 | "3 9 0.900000 \n",
573 | "4 2 0.666667 \n",
574 | "\n",
575 | " avg_add_to_cart_by_user avg_days_since_last_bought last_ordered_in \\\n",
576 | "0 1.400000 17.600000 10 \n",
577 | "1 3.333333 19.555555 10 \n",
578 | "2 5.000000 28.000000 5 \n",
579 | "3 3.300000 17.600000 10 \n",
580 | "4 6.333333 21.666666 10 \n",
581 | "\n",
582 | " is_reorder_3 is_reorder_2 is_reorder_1 \n",
583 | "0 1.0 1.0 1.0 \n",
584 | "1 1.0 1.0 1.0 \n",
585 | "2 0.0 0.0 0.0 \n",
586 | "3 1.0 1.0 1.0 \n",
587 | "4 1.0 0.0 0.0 "
588 | ]
589 | },
590 | "execution_count": 8,
591 | "metadata": {},
592 | "output_type": "execute_result"
593 | }
594 | ],
595 | "source": [
596 | "df = user_product_features[user_product_features.user_id.isin(train_users)]\n",
597 | "df.head()"
598 | ]
599 | },
600 | {
601 | "cell_type": "code",
602 | "execution_count": 9,
603 | "metadata": {
604 | "scrolled": false
605 | },
606 | "outputs": [
607 | {
608 | "data": {
609 | "text/html": [
610 | "\n",
611 | "\n",
624 | "
\n",
625 | " \n",
626 | " \n",
627 | " | \n",
628 | " user_id | \n",
629 | " product_id | \n",
630 | " total_product_orders_by_user | \n",
631 | " total_product_reorders_by_user | \n",
632 | " user_product_reorder_percentage | \n",
633 | " avg_add_to_cart_by_user | \n",
634 | " avg_days_since_last_bought | \n",
635 | " last_ordered_in | \n",
636 | " is_reorder_3 | \n",
637 | " is_reorder_2 | \n",
638 | " is_reorder_1 | \n",
639 | " order_number | \n",
640 | " order_dow | \n",
641 | " order_hour_of_day | \n",
642 | " days_since_prior_order | \n",
643 | " reordered | \n",
644 | "
\n",
645 | " \n",
646 | " \n",
647 | " \n",
648 | " 0 | \n",
649 | " 1 | \n",
650 | " 196 | \n",
651 | " 10.0 | \n",
652 | " 9.0 | \n",
653 | " 0.900000 | \n",
654 | " 1.400000 | \n",
655 | " 17.600000 | \n",
656 | " 10.0 | \n",
657 | " 1.0 | \n",
658 | " 1.0 | \n",
659 | " 1.0 | \n",
660 | " 11.0 | \n",
661 | " 4.0 | \n",
662 | " 8.0 | \n",
663 | " 14.0 | \n",
664 | " 1.0 | \n",
665 | "
\n",
666 | " \n",
667 | " 1 | \n",
668 | " 1 | \n",
669 | " 10258 | \n",
670 | " 9.0 | \n",
671 | " 8.0 | \n",
672 | " 0.888889 | \n",
673 | " 3.333333 | \n",
674 | " 19.555555 | \n",
675 | " 10.0 | \n",
676 | " 1.0 | \n",
677 | " 1.0 | \n",
678 | " 1.0 | \n",
679 | " 11.0 | \n",
680 | " 4.0 | \n",
681 | " 8.0 | \n",
682 | " 14.0 | \n",
683 | " 1.0 | \n",
684 | "
\n",
685 | " \n",
686 | " 2 | \n",
687 | " 1 | \n",
688 | " 10326 | \n",
689 | " 1.0 | \n",
690 | " 0.0 | \n",
691 | " 0.000000 | \n",
692 | " 5.000000 | \n",
693 | " 28.000000 | \n",
694 | " 5.0 | \n",
695 | " 0.0 | \n",
696 | " 0.0 | \n",
697 | " 0.0 | \n",
698 | " NaN | \n",
699 | " NaN | \n",
700 | " NaN | \n",
701 | " NaN | \n",
702 | " NaN | \n",
703 | "
\n",
704 | " \n",
705 | " 3 | \n",
706 | " 1 | \n",
707 | " 12427 | \n",
708 | " 10.0 | \n",
709 | " 9.0 | \n",
710 | " 0.900000 | \n",
711 | " 3.300000 | \n",
712 | " 17.600000 | \n",
713 | " 10.0 | \n",
714 | " 1.0 | \n",
715 | " 1.0 | \n",
716 | " 1.0 | \n",
717 | " NaN | \n",
718 | " NaN | \n",
719 | " NaN | \n",
720 | " NaN | \n",
721 | " NaN | \n",
722 | "
\n",
723 | " \n",
724 | " 4 | \n",
725 | " 1 | \n",
726 | " 13032 | \n",
727 | " 3.0 | \n",
728 | " 2.0 | \n",
729 | " 0.666667 | \n",
730 | " 6.333333 | \n",
731 | " 21.666666 | \n",
732 | " 10.0 | \n",
733 | " 1.0 | \n",
734 | " 0.0 | \n",
735 | " 0.0 | \n",
736 | " 11.0 | \n",
737 | " 4.0 | \n",
738 | " 8.0 | \n",
739 | " 14.0 | \n",
740 | " 1.0 | \n",
741 | "
\n",
742 | " \n",
743 | "
\n",
744 | "
"
745 | ],
746 | "text/plain": [
747 | " user_id product_id total_product_orders_by_user \\\n",
748 | "0 1 196 10.0 \n",
749 | "1 1 10258 9.0 \n",
750 | "2 1 10326 1.0 \n",
751 | "3 1 12427 10.0 \n",
752 | "4 1 13032 3.0 \n",
753 | "\n",
754 | " total_product_reorders_by_user user_product_reorder_percentage \\\n",
755 | "0 9.0 0.900000 \n",
756 | "1 8.0 0.888889 \n",
757 | "2 0.0 0.000000 \n",
758 | "3 9.0 0.900000 \n",
759 | "4 2.0 0.666667 \n",
760 | "\n",
761 | " avg_add_to_cart_by_user avg_days_since_last_bought last_ordered_in \\\n",
762 | "0 1.400000 17.600000 10.0 \n",
763 | "1 3.333333 19.555555 10.0 \n",
764 | "2 5.000000 28.000000 5.0 \n",
765 | "3 3.300000 17.600000 10.0 \n",
766 | "4 6.333333 21.666666 10.0 \n",
767 | "\n",
768 | " is_reorder_3 is_reorder_2 is_reorder_1 order_number order_dow \\\n",
769 | "0 1.0 1.0 1.0 11.0 4.0 \n",
770 | "1 1.0 1.0 1.0 11.0 4.0 \n",
771 | "2 0.0 0.0 0.0 NaN NaN \n",
772 | "3 1.0 1.0 1.0 NaN NaN \n",
773 | "4 1.0 0.0 0.0 11.0 4.0 \n",
774 | "\n",
775 | " order_hour_of_day days_since_prior_order reordered \n",
776 | "0 8.0 14.0 1.0 \n",
777 | "1 8.0 14.0 1.0 \n",
778 | "2 NaN NaN NaN \n",
779 | "3 NaN NaN NaN \n",
780 | "4 8.0 14.0 1.0 "
781 | ]
782 | },
783 | "execution_count": 9,
784 | "metadata": {},
785 | "output_type": "execute_result"
786 | }
787 | ],
788 | "source": [
789 | "df = df.merge(train_orders, on = ['user_id', 'product_id'], how = 'outer')\n",
790 | "df.head()"
791 | ]
792 | },
793 | {
794 | "cell_type": "markdown",
795 | "metadata": {},
796 | "source": [
797 | "for order_number, order_dow, order_hour_of_day, days_since_prior_order, impute null values with mean values grouped by users as these products will also be potential candidate for order."
798 | ]
799 | },
800 | {
801 | "cell_type": "code",
802 | "execution_count": 10,
803 | "metadata": {
804 | "scrolled": true
805 | },
806 | "outputs": [],
807 | "source": [
808 | "df.order_number.fillna(df.groupby('user_id')['order_number'].transform('mean'), inplace = True)\n",
809 | "df.order_dow.fillna(df.groupby('user_id')['order_dow'].transform('mean'), inplace = True)\n",
810 | "df.order_hour_of_day.fillna(df.groupby('user_id')['order_hour_of_day'].transform('mean'), inplace = True)\n",
811 | "df.days_since_prior_order.fillna(df.groupby('user_id')['days_since_prior_order'].\\\n",
812 | " transform('mean'), inplace = True)"
813 | ]
814 | },
815 | {
816 | "cell_type": "markdown",
817 | "metadata": {},
818 | "source": [
819 | "Removing those products which were bought the first time in last order by a user"
820 | ]
821 | },
822 | {
823 | "cell_type": "code",
824 | "execution_count": 11,
825 | "metadata": {},
826 | "outputs": [
827 | {
828 | "data": {
829 | "text/plain": [
830 | "1.0 828824\n",
831 | "0.0 555793\n",
832 | "Name: reordered, dtype: int64"
833 | ]
834 | },
835 | "execution_count": 11,
836 | "metadata": {},
837 | "output_type": "execute_result"
838 | }
839 | ],
840 | "source": [
841 | "df.reordered.value_counts()"
842 | ]
843 | },
844 | {
845 | "cell_type": "code",
846 | "execution_count": 12,
847 | "metadata": {},
848 | "outputs": [
849 | {
850 | "data": {
851 | "text/plain": [
852 | "7645837"
853 | ]
854 | },
855 | "execution_count": 12,
856 | "metadata": {},
857 | "output_type": "execute_result"
858 | }
859 | ],
860 | "source": [
861 | "df.reordered.isnull().sum()"
862 | ]
863 | },
864 | {
865 | "cell_type": "code",
866 | "execution_count": 13,
867 | "metadata": {},
868 | "outputs": [],
869 | "source": [
870 | "df = df[df.reordered != 0]"
871 | ]
872 | },
873 | {
874 | "cell_type": "code",
875 | "execution_count": 14,
876 | "metadata": {},
877 | "outputs": [
878 | {
879 | "data": {
880 | "text/plain": [
881 | "(8474661, 16)"
882 | ]
883 | },
884 | "execution_count": 14,
885 | "metadata": {},
886 | "output_type": "execute_result"
887 | }
888 | ],
889 | "source": [
890 | "df.shape"
891 | ]
892 | },
893 | {
894 | "cell_type": "markdown",
895 | "metadata": {},
896 | "source": [
897 | "Now imputing 0 in reordered as they were not reordered by user in his/her last order."
898 | ]
899 | },
900 | {
901 | "cell_type": "code",
902 | "execution_count": 15,
903 | "metadata": {},
904 | "outputs": [
905 | {
906 | "data": {
907 | "text/plain": [
908 | "user_id 0\n",
909 | "product_id 0\n",
910 | "total_product_orders_by_user 0\n",
911 | "total_product_reorders_by_user 0\n",
912 | "user_product_reorder_percentage 0\n",
913 | "avg_add_to_cart_by_user 0\n",
914 | "avg_days_since_last_bought 0\n",
915 | "last_ordered_in 0\n",
916 | "is_reorder_3 0\n",
917 | "is_reorder_2 0\n",
918 | "is_reorder_1 0\n",
919 | "order_number 0\n",
920 | "order_dow 0\n",
921 | "order_hour_of_day 0\n",
922 | "days_since_prior_order 0\n",
923 | "reordered 0\n",
924 | "dtype: int64"
925 | ]
926 | },
927 | "execution_count": 15,
928 | "metadata": {},
929 | "output_type": "execute_result"
930 | }
931 | ],
932 | "source": [
933 | "df.reordered.fillna(0, inplace = True)\n",
934 | "\n",
935 | "df.isnull().sum()"
936 | ]
937 | },
938 | {
939 | "cell_type": "code",
940 | "execution_count": 16,
941 | "metadata": {},
942 | "outputs": [
943 | {
944 | "data": {
945 | "text/html": [
946 | "\n",
947 | "\n",
960 | "
\n",
961 | " \n",
962 | " \n",
963 | " | \n",
964 | " user_id | \n",
965 | " product_id | \n",
966 | " total_product_orders_by_user | \n",
967 | " total_product_reorders_by_user | \n",
968 | " user_product_reorder_percentage | \n",
969 | " avg_add_to_cart_by_user | \n",
970 | " avg_days_since_last_bought | \n",
971 | " last_ordered_in | \n",
972 | " is_reorder_3 | \n",
973 | " is_reorder_2 | \n",
974 | " is_reorder_1 | \n",
975 | " order_number | \n",
976 | " order_dow | \n",
977 | " order_hour_of_day | \n",
978 | " days_since_prior_order | \n",
979 | " reordered | \n",
980 | "
\n",
981 | " \n",
982 | " \n",
983 | " \n",
984 | " 0 | \n",
985 | " 1 | \n",
986 | " 196 | \n",
987 | " 10.0 | \n",
988 | " 9.0 | \n",
989 | " 0.900000 | \n",
990 | " 1.400000 | \n",
991 | " 17.600000 | \n",
992 | " 10.0 | \n",
993 | " 1.0 | \n",
994 | " 1.0 | \n",
995 | " 1.0 | \n",
996 | " 11.0 | \n",
997 | " 4.0 | \n",
998 | " 8.0 | \n",
999 | " 14.0 | \n",
1000 | " 1.0 | \n",
1001 | "
\n",
1002 | " \n",
1003 | " 1 | \n",
1004 | " 1 | \n",
1005 | " 10258 | \n",
1006 | " 9.0 | \n",
1007 | " 8.0 | \n",
1008 | " 0.888889 | \n",
1009 | " 3.333333 | \n",
1010 | " 19.555555 | \n",
1011 | " 10.0 | \n",
1012 | " 1.0 | \n",
1013 | " 1.0 | \n",
1014 | " 1.0 | \n",
1015 | " 11.0 | \n",
1016 | " 4.0 | \n",
1017 | " 8.0 | \n",
1018 | " 14.0 | \n",
1019 | " 1.0 | \n",
1020 | "
\n",
1021 | " \n",
1022 | " 2 | \n",
1023 | " 1 | \n",
1024 | " 10326 | \n",
1025 | " 1.0 | \n",
1026 | " 0.0 | \n",
1027 | " 0.000000 | \n",
1028 | " 5.000000 | \n",
1029 | " 28.000000 | \n",
1030 | " 5.0 | \n",
1031 | " 0.0 | \n",
1032 | " 0.0 | \n",
1033 | " 0.0 | \n",
1034 | " 11.0 | \n",
1035 | " 4.0 | \n",
1036 | " 8.0 | \n",
1037 | " 14.0 | \n",
1038 | " 0.0 | \n",
1039 | "
\n",
1040 | " \n",
1041 | " 3 | \n",
1042 | " 1 | \n",
1043 | " 12427 | \n",
1044 | " 10.0 | \n",
1045 | " 9.0 | \n",
1046 | " 0.900000 | \n",
1047 | " 3.300000 | \n",
1048 | " 17.600000 | \n",
1049 | " 10.0 | \n",
1050 | " 1.0 | \n",
1051 | " 1.0 | \n",
1052 | " 1.0 | \n",
1053 | " 11.0 | \n",
1054 | " 4.0 | \n",
1055 | " 8.0 | \n",
1056 | " 14.0 | \n",
1057 | " 0.0 | \n",
1058 | "
\n",
1059 | " \n",
1060 | " 4 | \n",
1061 | " 1 | \n",
1062 | " 13032 | \n",
1063 | " 3.0 | \n",
1064 | " 2.0 | \n",
1065 | " 0.666667 | \n",
1066 | " 6.333333 | \n",
1067 | " 21.666666 | \n",
1068 | " 10.0 | \n",
1069 | " 1.0 | \n",
1070 | " 0.0 | \n",
1071 | " 0.0 | \n",
1072 | " 11.0 | \n",
1073 | " 4.0 | \n",
1074 | " 8.0 | \n",
1075 | " 14.0 | \n",
1076 | " 1.0 | \n",
1077 | "
\n",
1078 | " \n",
1079 | "
\n",
1080 | "
"
1081 | ],
1082 | "text/plain": [
1083 | " user_id product_id total_product_orders_by_user \\\n",
1084 | "0 1 196 10.0 \n",
1085 | "1 1 10258 9.0 \n",
1086 | "2 1 10326 1.0 \n",
1087 | "3 1 12427 10.0 \n",
1088 | "4 1 13032 3.0 \n",
1089 | "\n",
1090 | " total_product_reorders_by_user user_product_reorder_percentage \\\n",
1091 | "0 9.0 0.900000 \n",
1092 | "1 8.0 0.888889 \n",
1093 | "2 0.0 0.000000 \n",
1094 | "3 9.0 0.900000 \n",
1095 | "4 2.0 0.666667 \n",
1096 | "\n",
1097 | " avg_add_to_cart_by_user avg_days_since_last_bought last_ordered_in \\\n",
1098 | "0 1.400000 17.600000 10.0 \n",
1099 | "1 3.333333 19.555555 10.0 \n",
1100 | "2 5.000000 28.000000 5.0 \n",
1101 | "3 3.300000 17.600000 10.0 \n",
1102 | "4 6.333333 21.666666 10.0 \n",
1103 | "\n",
1104 | " is_reorder_3 is_reorder_2 is_reorder_1 order_number order_dow \\\n",
1105 | "0 1.0 1.0 1.0 11.0 4.0 \n",
1106 | "1 1.0 1.0 1.0 11.0 4.0 \n",
1107 | "2 0.0 0.0 0.0 11.0 4.0 \n",
1108 | "3 1.0 1.0 1.0 11.0 4.0 \n",
1109 | "4 1.0 0.0 0.0 11.0 4.0 \n",
1110 | "\n",
1111 | " order_hour_of_day days_since_prior_order reordered \n",
1112 | "0 8.0 14.0 1.0 \n",
1113 | "1 8.0 14.0 1.0 \n",
1114 | "2 8.0 14.0 0.0 \n",
1115 | "3 8.0 14.0 0.0 \n",
1116 | "4 8.0 14.0 1.0 "
1117 | ]
1118 | },
1119 | "execution_count": 16,
1120 | "metadata": {},
1121 | "output_type": "execute_result"
1122 | }
1123 | ],
1124 | "source": [
1125 | "df.head()"
1126 | ]
1127 | },
1128 | {
1129 | "cell_type": "markdown",
1130 | "metadata": {},
1131 | "source": [
1132 | "#### Merging product and user features"
1133 | ]
1134 | },
1135 | {
1136 | "cell_type": "code",
1137 | "execution_count": 17,
1138 | "metadata": {
1139 | "scrolled": true
1140 | },
1141 | "outputs": [
1142 | {
1143 | "data": {
1144 | "text/html": [
1145 | "\n",
1146 | "\n",
1159 | "
\n",
1160 | " \n",
1161 | " \n",
1162 | " | \n",
1163 | " product_id | \n",
1164 | " mean_add_to_cart_order | \n",
1165 | " total_orders | \n",
1166 | " total_reorders | \n",
1167 | " reorder_percentage | \n",
1168 | " unique_users | \n",
1169 | " order_first_time_total_cnt | \n",
1170 | " order_second_time_total_cnt | \n",
1171 | " is_organic | \n",
1172 | " second_time_percent | \n",
1173 | " ... | \n",
1174 | " department_total_orders | \n",
1175 | " department_total_reorders | \n",
1176 | " department_reorder_percentage | \n",
1177 | " department_unique_users | \n",
1178 | " department_0 | \n",
1179 | " department_1 | \n",
1180 | " department_2 | \n",
1181 | " department_3 | \n",
1182 | " department_4 | \n",
1183 | " department_5 | \n",
1184 | "
\n",
1185 | " \n",
1186 | " \n",
1187 | " \n",
1188 | " 0 | \n",
1189 | " 1 | \n",
1190 | " 5.801836 | \n",
1191 | " 1852 | \n",
1192 | " 1136.0 | \n",
1193 | " 0.613391 | \n",
1194 | " 716 | \n",
1195 | " 716 | \n",
1196 | " 276 | \n",
1197 | " 0 | \n",
1198 | " 0.385475 | \n",
1199 | " ... | \n",
1200 | " 2887550 | \n",
1201 | " 1657973.0 | \n",
1202 | " 0.574180 | \n",
1203 | " 174219 | \n",
1204 | " 0 | \n",
1205 | " 0 | \n",
1206 | " 0 | \n",
1207 | " 0 | \n",
1208 | " 0 | \n",
1209 | " 1 | \n",
1210 | "
\n",
1211 | " \n",
1212 | " 1 | \n",
1213 | " 2 | \n",
1214 | " 9.888889 | \n",
1215 | " 90 | \n",
1216 | " 12.0 | \n",
1217 | " 0.133333 | \n",
1218 | " 78 | \n",
1219 | " 78 | \n",
1220 | " 8 | \n",
1221 | " 0 | \n",
1222 | " 0.102564 | \n",
1223 | " ... | \n",
1224 | " 1875577 | \n",
1225 | " 650301.0 | \n",
1226 | " 0.346721 | \n",
1227 | " 172755 | \n",
1228 | " 0 | \n",
1229 | " 0 | \n",
1230 | " 0 | \n",
1231 | " 0 | \n",
1232 | " 1 | \n",
1233 | " 0 | \n",
1234 | "
\n",
1235 | " \n",
1236 | " 2 | \n",
1237 | " 3 | \n",
1238 | " 6.415162 | \n",
1239 | " 277 | \n",
1240 | " 203.0 | \n",
1241 | " 0.732852 | \n",
1242 | " 74 | \n",
1243 | " 74 | \n",
1244 | " 36 | \n",
1245 | " 0 | \n",
1246 | " 0.486486 | \n",
1247 | " ... | \n",
1248 | " 2690129 | \n",
1249 | " 1757892.0 | \n",
1250 | " 0.653460 | \n",
1251 | " 172795 | \n",
1252 | " 0 | \n",
1253 | " 0 | \n",
1254 | " 0 | \n",
1255 | " 0 | \n",
1256 | " 1 | \n",
1257 | " 1 | \n",
1258 | "
\n",
1259 | " \n",
1260 | " 3 | \n",
1261 | " 4 | \n",
1262 | " 9.507599 | \n",
1263 | " 329 | \n",
1264 | " 147.0 | \n",
1265 | " 0.446809 | \n",
1266 | " 182 | \n",
1267 | " 182 | \n",
1268 | " 64 | \n",
1269 | " 0 | \n",
1270 | " 0.351648 | \n",
1271 | " ... | \n",
1272 | " 2236432 | \n",
1273 | " 1211890.0 | \n",
1274 | " 0.541885 | \n",
1275 | " 163233 | \n",
1276 | " 0 | \n",
1277 | " 0 | \n",
1278 | " 0 | \n",
1279 | " 1 | \n",
1280 | " 0 | \n",
1281 | " 0 | \n",
1282 | "
\n",
1283 | " \n",
1284 | " 4 | \n",
1285 | " 5 | \n",
1286 | " 6.466667 | \n",
1287 | " 15 | \n",
1288 | " 9.0 | \n",
1289 | " 0.600000 | \n",
1290 | " 6 | \n",
1291 | " 6 | \n",
1292 | " 4 | \n",
1293 | " 0 | \n",
1294 | " 0.666667 | \n",
1295 | " ... | \n",
1296 | " 1875577 | \n",
1297 | " 650301.0 | \n",
1298 | " 0.346721 | \n",
1299 | " 172755 | \n",
1300 | " 0 | \n",
1301 | " 0 | \n",
1302 | " 0 | \n",
1303 | " 0 | \n",
1304 | " 1 | \n",
1305 | " 0 | \n",
1306 | "
\n",
1307 | " \n",
1308 | "
\n",
1309 | "
5 rows × 37 columns
\n",
1310 | "
"
1311 | ],
1312 | "text/plain": [
1313 | " product_id mean_add_to_cart_order total_orders total_reorders \\\n",
1314 | "0 1 5.801836 1852 1136.0 \n",
1315 | "1 2 9.888889 90 12.0 \n",
1316 | "2 3 6.415162 277 203.0 \n",
1317 | "3 4 9.507599 329 147.0 \n",
1318 | "4 5 6.466667 15 9.0 \n",
1319 | "\n",
1320 | " reorder_percentage unique_users order_first_time_total_cnt \\\n",
1321 | "0 0.613391 716 716 \n",
1322 | "1 0.133333 78 78 \n",
1323 | "2 0.732852 74 74 \n",
1324 | "3 0.446809 182 182 \n",
1325 | "4 0.600000 6 6 \n",
1326 | "\n",
1327 | " order_second_time_total_cnt is_organic second_time_percent ... \\\n",
1328 | "0 276 0 0.385475 ... \n",
1329 | "1 8 0 0.102564 ... \n",
1330 | "2 36 0 0.486486 ... \n",
1331 | "3 64 0 0.351648 ... \n",
1332 | "4 4 0 0.666667 ... \n",
1333 | "\n",
1334 | " department_total_orders department_total_reorders \\\n",
1335 | "0 2887550 1657973.0 \n",
1336 | "1 1875577 650301.0 \n",
1337 | "2 2690129 1757892.0 \n",
1338 | "3 2236432 1211890.0 \n",
1339 | "4 1875577 650301.0 \n",
1340 | "\n",
1341 | " department_reorder_percentage department_unique_users department_0 \\\n",
1342 | "0 0.574180 174219 0 \n",
1343 | "1 0.346721 172755 0 \n",
1344 | "2 0.653460 172795 0 \n",
1345 | "3 0.541885 163233 0 \n",
1346 | "4 0.346721 172755 0 \n",
1347 | "\n",
1348 | " department_1 department_2 department_3 department_4 department_5 \n",
1349 | "0 0 0 0 0 1 \n",
1350 | "1 0 0 0 1 0 \n",
1351 | "2 0 0 0 1 1 \n",
1352 | "3 0 0 1 0 0 \n",
1353 | "4 0 0 0 1 0 \n",
1354 | "\n",
1355 | "[5 rows x 37 columns]"
1356 | ]
1357 | },
1358 | "execution_count": 17,
1359 | "metadata": {},
1360 | "output_type": "execute_result"
1361 | }
1362 | ],
1363 | "source": [
1364 | "product_features.head()"
1365 | ]
1366 | },
1367 | {
1368 | "cell_type": "code",
1369 | "execution_count": 18,
1370 | "metadata": {},
1371 | "outputs": [
1372 | {
1373 | "data": {
1374 | "text/html": [
1375 | "\n",
1376 | "\n",
1389 | "
\n",
1390 | " \n",
1391 | " \n",
1392 | " | \n",
1393 | " user_id | \n",
1394 | " avg_dow | \n",
1395 | " std_dow | \n",
1396 | " avg_doh | \n",
1397 | " std_doh | \n",
1398 | " avg_since_order | \n",
1399 | " std_since_order | \n",
1400 | " total_orders_by_user | \n",
1401 | " total_products_by_user | \n",
1402 | " total_unique_product_by_user | \n",
1403 | " total_reorders_by_user | \n",
1404 | " reorder_propotion_by_user | \n",
1405 | " average_order_size | \n",
1406 | " reorder_in_order | \n",
1407 | " orders_3 | \n",
1408 | " orders_2 | \n",
1409 | " orders_1 | \n",
1410 | " reorder_3 | \n",
1411 | " reorder_2 | \n",
1412 | " reorder_1 | \n",
1413 | "
\n",
1414 | " \n",
1415 | " \n",
1416 | " \n",
1417 | " 0 | \n",
1418 | " 1 | \n",
1419 | " 2.644068 | \n",
1420 | " 1.256194 | \n",
1421 | " 10.542373 | \n",
1422 | " 3.500355 | \n",
1423 | " 18.542374 | \n",
1424 | " 10.559066 | \n",
1425 | " 10 | \n",
1426 | " 59 | \n",
1427 | " 18 | \n",
1428 | " 41.0 | \n",
1429 | " 0.694915 | \n",
1430 | " 5.900000 | \n",
1431 | " 0.705833 | \n",
1432 | " 6 | \n",
1433 | " 6 | \n",
1434 | " 9 | \n",
1435 | " 0.666667 | \n",
1436 | " 1.0 | \n",
1437 | " 0.666667 | \n",
1438 | "
\n",
1439 | " \n",
1440 | " 1 | \n",
1441 | " 2 | \n",
1442 | " 2.005128 | \n",
1443 | " 0.971222 | \n",
1444 | " 10.441026 | \n",
1445 | " 1.649854 | \n",
1446 | " 14.902564 | \n",
1447 | " 9.671712 | \n",
1448 | " 14 | \n",
1449 | " 195 | \n",
1450 | " 102 | \n",
1451 | " 93.0 | \n",
1452 | " 0.476923 | \n",
1453 | " 13.928571 | \n",
1454 | " 0.447961 | \n",
1455 | " 19 | \n",
1456 | " 9 | \n",
1457 | " 16 | \n",
1458 | " 0.578947 | \n",
1459 | " 0.0 | \n",
1460 | " 0.625000 | \n",
1461 | "
\n",
1462 | " \n",
1463 | " 2 | \n",
1464 | " 3 | \n",
1465 | " 1.011364 | \n",
1466 | " 1.245630 | \n",
1467 | " 16.352273 | \n",
1468 | " 1.454599 | \n",
1469 | " 10.181818 | \n",
1470 | " 5.867395 | \n",
1471 | " 12 | \n",
1472 | " 88 | \n",
1473 | " 33 | \n",
1474 | " 55.0 | \n",
1475 | " 0.625000 | \n",
1476 | " 7.333333 | \n",
1477 | " 0.658817 | \n",
1478 | " 6 | \n",
1479 | " 5 | \n",
1480 | " 6 | \n",
1481 | " 0.833333 | \n",
1482 | " 1.0 | \n",
1483 | " 1.000000 | \n",
1484 | "
\n",
1485 | " \n",
1486 | " 3 | \n",
1487 | " 4 | \n",
1488 | " 4.722222 | \n",
1489 | " 0.826442 | \n",
1490 | " 13.111111 | \n",
1491 | " 1.745208 | \n",
1492 | " 11.944445 | \n",
1493 | " 9.973330 | \n",
1494 | " 5 | \n",
1495 | " 18 | \n",
1496 | " 17 | \n",
1497 | " 1.0 | \n",
1498 | " 0.055556 | \n",
1499 | " 3.600000 | \n",
1500 | " 0.028571 | \n",
1501 | " 7 | \n",
1502 | " 2 | \n",
1503 | " 3 | \n",
1504 | " 0.142857 | \n",
1505 | " 0.0 | \n",
1506 | " 0.000000 | \n",
1507 | "
\n",
1508 | " \n",
1509 | " 4 | \n",
1510 | " 5 | \n",
1511 | " 1.621622 | \n",
1512 | " 1.276961 | \n",
1513 | " 15.729730 | \n",
1514 | " 2.588958 | \n",
1515 | " 10.189189 | \n",
1516 | " 7.600577 | \n",
1517 | " 4 | \n",
1518 | " 37 | \n",
1519 | " 23 | \n",
1520 | " 14.0 | \n",
1521 | " 0.378378 | \n",
1522 | " 9.250000 | \n",
1523 | " 0.377778 | \n",
1524 | " 9 | \n",
1525 | " 5 | \n",
1526 | " 12 | \n",
1527 | " 0.444444 | \n",
1528 | " 0.4 | \n",
1529 | " 0.666667 | \n",
1530 | "
\n",
1531 | " \n",
1532 | "
\n",
1533 | "
"
1534 | ],
1535 | "text/plain": [
1536 | " user_id avg_dow std_dow avg_doh std_doh avg_since_order \\\n",
1537 | "0 1 2.644068 1.256194 10.542373 3.500355 18.542374 \n",
1538 | "1 2 2.005128 0.971222 10.441026 1.649854 14.902564 \n",
1539 | "2 3 1.011364 1.245630 16.352273 1.454599 10.181818 \n",
1540 | "3 4 4.722222 0.826442 13.111111 1.745208 11.944445 \n",
1541 | "4 5 1.621622 1.276961 15.729730 2.588958 10.189189 \n",
1542 | "\n",
1543 | " std_since_order total_orders_by_user total_products_by_user \\\n",
1544 | "0 10.559066 10 59 \n",
1545 | "1 9.671712 14 195 \n",
1546 | "2 5.867395 12 88 \n",
1547 | "3 9.973330 5 18 \n",
1548 | "4 7.600577 4 37 \n",
1549 | "\n",
1550 | " total_unique_product_by_user total_reorders_by_user \\\n",
1551 | "0 18 41.0 \n",
1552 | "1 102 93.0 \n",
1553 | "2 33 55.0 \n",
1554 | "3 17 1.0 \n",
1555 | "4 23 14.0 \n",
1556 | "\n",
1557 | " reorder_propotion_by_user average_order_size reorder_in_order orders_3 \\\n",
1558 | "0 0.694915 5.900000 0.705833 6 \n",
1559 | "1 0.476923 13.928571 0.447961 19 \n",
1560 | "2 0.625000 7.333333 0.658817 6 \n",
1561 | "3 0.055556 3.600000 0.028571 7 \n",
1562 | "4 0.378378 9.250000 0.377778 9 \n",
1563 | "\n",
1564 | " orders_2 orders_1 reorder_3 reorder_2 reorder_1 \n",
1565 | "0 6 9 0.666667 1.0 0.666667 \n",
1566 | "1 9 16 0.578947 0.0 0.625000 \n",
1567 | "2 5 6 0.833333 1.0 1.000000 \n",
1568 | "3 2 3 0.142857 0.0 0.000000 \n",
1569 | "4 5 12 0.444444 0.4 0.666667 "
1570 | ]
1571 | },
1572 | "execution_count": 18,
1573 | "metadata": {},
1574 | "output_type": "execute_result"
1575 | }
1576 | ],
1577 | "source": [
1578 | "user_features.head()"
1579 | ]
1580 | },
1581 | {
1582 | "cell_type": "code",
1583 | "execution_count": 19,
1584 | "metadata": {},
1585 | "outputs": [
1586 | {
1587 | "data": {
1588 | "text/html": [
1589 | "\n",
1590 | "\n",
1603 | "
\n",
1604 | " \n",
1605 | " \n",
1606 | " | \n",
1607 | " user_id | \n",
1608 | " product_id | \n",
1609 | " total_product_orders_by_user | \n",
1610 | " total_product_reorders_by_user | \n",
1611 | " user_product_reorder_percentage | \n",
1612 | " avg_add_to_cart_by_user | \n",
1613 | " avg_days_since_last_bought | \n",
1614 | " last_ordered_in | \n",
1615 | " is_reorder_3 | \n",
1616 | " is_reorder_2 | \n",
1617 | " ... | \n",
1618 | " total_reorders_by_user | \n",
1619 | " reorder_propotion_by_user | \n",
1620 | " average_order_size | \n",
1621 | " reorder_in_order | \n",
1622 | " orders_3 | \n",
1623 | " orders_2 | \n",
1624 | " orders_1 | \n",
1625 | " reorder_3 | \n",
1626 | " reorder_2 | \n",
1627 | " reorder_1 | \n",
1628 | "
\n",
1629 | " \n",
1630 | " \n",
1631 | " \n",
1632 | " 0 | \n",
1633 | " 1 | \n",
1634 | " 196 | \n",
1635 | " 10.0 | \n",
1636 | " 9.0 | \n",
1637 | " 0.900000 | \n",
1638 | " 1.400000 | \n",
1639 | " 17.600000 | \n",
1640 | " 10.0 | \n",
1641 | " 1.0 | \n",
1642 | " 1.0 | \n",
1643 | " ... | \n",
1644 | " 41.0 | \n",
1645 | " 0.694915 | \n",
1646 | " 5.9 | \n",
1647 | " 0.705833 | \n",
1648 | " 6 | \n",
1649 | " 6 | \n",
1650 | " 9 | \n",
1651 | " 0.666667 | \n",
1652 | " 1.0 | \n",
1653 | " 0.666667 | \n",
1654 | "
\n",
1655 | " \n",
1656 | " 1 | \n",
1657 | " 1 | \n",
1658 | " 10258 | \n",
1659 | " 9.0 | \n",
1660 | " 8.0 | \n",
1661 | " 0.888889 | \n",
1662 | " 3.333333 | \n",
1663 | " 19.555555 | \n",
1664 | " 10.0 | \n",
1665 | " 1.0 | \n",
1666 | " 1.0 | \n",
1667 | " ... | \n",
1668 | " 41.0 | \n",
1669 | " 0.694915 | \n",
1670 | " 5.9 | \n",
1671 | " 0.705833 | \n",
1672 | " 6 | \n",
1673 | " 6 | \n",
1674 | " 9 | \n",
1675 | " 0.666667 | \n",
1676 | " 1.0 | \n",
1677 | " 0.666667 | \n",
1678 | "
\n",
1679 | " \n",
1680 | " 2 | \n",
1681 | " 1 | \n",
1682 | " 10326 | \n",
1683 | " 1.0 | \n",
1684 | " 0.0 | \n",
1685 | " 0.000000 | \n",
1686 | " 5.000000 | \n",
1687 | " 28.000000 | \n",
1688 | " 5.0 | \n",
1689 | " 0.0 | \n",
1690 | " 0.0 | \n",
1691 | " ... | \n",
1692 | " 41.0 | \n",
1693 | " 0.694915 | \n",
1694 | " 5.9 | \n",
1695 | " 0.705833 | \n",
1696 | " 6 | \n",
1697 | " 6 | \n",
1698 | " 9 | \n",
1699 | " 0.666667 | \n",
1700 | " 1.0 | \n",
1701 | " 0.666667 | \n",
1702 | "
\n",
1703 | " \n",
1704 | " 3 | \n",
1705 | " 1 | \n",
1706 | " 12427 | \n",
1707 | " 10.0 | \n",
1708 | " 9.0 | \n",
1709 | " 0.900000 | \n",
1710 | " 3.300000 | \n",
1711 | " 17.600000 | \n",
1712 | " 10.0 | \n",
1713 | " 1.0 | \n",
1714 | " 1.0 | \n",
1715 | " ... | \n",
1716 | " 41.0 | \n",
1717 | " 0.694915 | \n",
1718 | " 5.9 | \n",
1719 | " 0.705833 | \n",
1720 | " 6 | \n",
1721 | " 6 | \n",
1722 | " 9 | \n",
1723 | " 0.666667 | \n",
1724 | " 1.0 | \n",
1725 | " 0.666667 | \n",
1726 | "
\n",
1727 | " \n",
1728 | " 4 | \n",
1729 | " 1 | \n",
1730 | " 13032 | \n",
1731 | " 3.0 | \n",
1732 | " 2.0 | \n",
1733 | " 0.666667 | \n",
1734 | " 6.333333 | \n",
1735 | " 21.666666 | \n",
1736 | " 10.0 | \n",
1737 | " 1.0 | \n",
1738 | " 0.0 | \n",
1739 | " ... | \n",
1740 | " 41.0 | \n",
1741 | " 0.694915 | \n",
1742 | " 5.9 | \n",
1743 | " 0.705833 | \n",
1744 | " 6 | \n",
1745 | " 6 | \n",
1746 | " 9 | \n",
1747 | " 0.666667 | \n",
1748 | " 1.0 | \n",
1749 | " 0.666667 | \n",
1750 | "
\n",
1751 | " \n",
1752 | "
\n",
1753 | "
5 rows × 71 columns
\n",
1754 | "
"
1755 | ],
1756 | "text/plain": [
1757 | " user_id product_id total_product_orders_by_user \\\n",
1758 | "0 1 196 10.0 \n",
1759 | "1 1 10258 9.0 \n",
1760 | "2 1 10326 1.0 \n",
1761 | "3 1 12427 10.0 \n",
1762 | "4 1 13032 3.0 \n",
1763 | "\n",
1764 | " total_product_reorders_by_user user_product_reorder_percentage \\\n",
1765 | "0 9.0 0.900000 \n",
1766 | "1 8.0 0.888889 \n",
1767 | "2 0.0 0.000000 \n",
1768 | "3 9.0 0.900000 \n",
1769 | "4 2.0 0.666667 \n",
1770 | "\n",
1771 | " avg_add_to_cart_by_user avg_days_since_last_bought last_ordered_in \\\n",
1772 | "0 1.400000 17.600000 10.0 \n",
1773 | "1 3.333333 19.555555 10.0 \n",
1774 | "2 5.000000 28.000000 5.0 \n",
1775 | "3 3.300000 17.600000 10.0 \n",
1776 | "4 6.333333 21.666666 10.0 \n",
1777 | "\n",
1778 | " is_reorder_3 is_reorder_2 ... total_reorders_by_user \\\n",
1779 | "0 1.0 1.0 ... 41.0 \n",
1780 | "1 1.0 1.0 ... 41.0 \n",
1781 | "2 0.0 0.0 ... 41.0 \n",
1782 | "3 1.0 1.0 ... 41.0 \n",
1783 | "4 1.0 0.0 ... 41.0 \n",
1784 | "\n",
1785 | " reorder_propotion_by_user average_order_size reorder_in_order orders_3 \\\n",
1786 | "0 0.694915 5.9 0.705833 6 \n",
1787 | "1 0.694915 5.9 0.705833 6 \n",
1788 | "2 0.694915 5.9 0.705833 6 \n",
1789 | "3 0.694915 5.9 0.705833 6 \n",
1790 | "4 0.694915 5.9 0.705833 6 \n",
1791 | "\n",
1792 | " orders_2 orders_1 reorder_3 reorder_2 reorder_1 \n",
1793 | "0 6 9 0.666667 1.0 0.666667 \n",
1794 | "1 6 9 0.666667 1.0 0.666667 \n",
1795 | "2 6 9 0.666667 1.0 0.666667 \n",
1796 | "3 6 9 0.666667 1.0 0.666667 \n",
1797 | "4 6 9 0.666667 1.0 0.666667 \n",
1798 | "\n",
1799 | "[5 rows x 71 columns]"
1800 | ]
1801 | },
1802 | "execution_count": 19,
1803 | "metadata": {},
1804 | "output_type": "execute_result"
1805 | }
1806 | ],
1807 | "source": [
1808 | "df = df.merge(product_features, on = 'product_id', how = 'left')\n",
1809 | "df = df.merge(user_features, on = 'user_id', how = 'left')\n",
1810 | "df.head()"
1811 | ]
1812 | },
1813 | {
1814 | "cell_type": "markdown",
1815 | "metadata": {},
1816 | "source": [
1817 | "The dataframe has null values because the product was never bought earlier by a user"
1818 | ]
1819 | },
1820 | {
1821 | "cell_type": "code",
1822 | "execution_count": 20,
1823 | "metadata": {},
1824 | "outputs": [
1825 | {
1826 | "data": {
1827 | "text/plain": [
1828 | "(8474661, 71)"
1829 | ]
1830 | },
1831 | "execution_count": 20,
1832 | "metadata": {},
1833 | "output_type": "execute_result"
1834 | }
1835 | ],
1836 | "source": [
1837 | "df.shape"
1838 | ]
1839 | },
1840 | {
1841 | "cell_type": "code",
1842 | "execution_count": 21,
1843 | "metadata": {},
1844 | "outputs": [
1845 | {
1846 | "data": {
1847 | "text/plain": [
1848 | "reorder_1 0\n",
1849 | "aisle_mean_add_to_cart_order 0\n",
1850 | "reorder_percentage 0\n",
1851 | "unique_users 0\n",
1852 | "order_first_time_total_cnt 0\n",
1853 | "order_second_time_total_cnt 0\n",
1854 | "is_organic 0\n",
1855 | "second_time_percent 0\n",
1856 | "aisle_std_add_to_cart_order 0\n",
1857 | "total_orders 0\n",
1858 | "aisle_total_orders 0\n",
1859 | "aisle_total_reorders 0\n",
1860 | "aisle_reorder_percentage 0\n",
1861 | "aisle_unique_users 0\n",
1862 | "aisle_0 0\n",
1863 | "aisle_1 0\n",
1864 | "total_reorders 0\n",
1865 | "mean_add_to_cart_order 0\n",
1866 | "aisle_3 0\n",
1867 | "last_ordered_in 0\n",
1868 | "product_id 0\n",
1869 | "total_product_orders_by_user 0\n",
1870 | "total_product_reorders_by_user 0\n",
1871 | "user_product_reorder_percentage 0\n",
1872 | "avg_add_to_cart_by_user 0\n",
1873 | "avg_days_since_last_bought 0\n",
1874 | "is_reorder_3 0\n",
1875 | "reordered 0\n",
1876 | "is_reorder_2 0\n",
1877 | "is_reorder_1 0\n",
1878 | " ..\n",
1879 | "total_orders_by_user 0\n",
1880 | "total_products_by_user 0\n",
1881 | "total_unique_product_by_user 0\n",
1882 | "reorder_propotion_by_user 0\n",
1883 | "std_dow 0\n",
1884 | "average_order_size 0\n",
1885 | "reorder_in_order 0\n",
1886 | "orders_3 0\n",
1887 | "orders_2 0\n",
1888 | "orders_1 0\n",
1889 | "reorder_3 0\n",
1890 | "avg_doh 0\n",
1891 | "avg_dow 0\n",
1892 | "aisle_5 0\n",
1893 | "department_total_reorders 0\n",
1894 | "aisle_6 0\n",
1895 | "aisle_7 0\n",
1896 | "aisle_8 0\n",
1897 | "department_mean_add_to_cart_order 0\n",
1898 | "department_std_add_to_cart_order 0\n",
1899 | "department_total_orders 0\n",
1900 | "department_reorder_percentage 0\n",
1901 | "department_5 0\n",
1902 | "department_unique_users 0\n",
1903 | "department_0 0\n",
1904 | "department_1 0\n",
1905 | "department_2 0\n",
1906 | "department_3 0\n",
1907 | "department_4 0\n",
1908 | "user_id 0\n",
1909 | "Length: 71, dtype: int64"
1910 | ]
1911 | },
1912 | "execution_count": 21,
1913 | "metadata": {},
1914 | "output_type": "execute_result"
1915 | }
1916 | ],
1917 | "source": [
1918 | "df.isnull().sum().sort_values(ascending = False)"
1919 | ]
1920 | },
1921 | {
1922 | "cell_type": "code",
1923 | "execution_count": 22,
1924 | "metadata": {},
1925 | "outputs": [],
1926 | "source": [
1927 | "df.to_pickle(root + 'Finaldata.pkl')"
1928 | ]
1929 | },
1930 | {
1931 | "cell_type": "code",
1932 | "execution_count": 23,
1933 | "metadata": {},
1934 | "outputs": [
1935 | {
1936 | "data": {
1937 | "text/html": [
1938 | "\n",
1939 | "\n",
1952 | "
\n",
1953 | " \n",
1954 | " \n",
1955 | " | \n",
1956 | " user_id | \n",
1957 | " product_id | \n",
1958 | " total_product_orders_by_user | \n",
1959 | " total_product_reorders_by_user | \n",
1960 | " user_product_reorder_percentage | \n",
1961 | " avg_add_to_cart_by_user | \n",
1962 | " avg_days_since_last_bought | \n",
1963 | " last_ordered_in | \n",
1964 | " is_reorder_3 | \n",
1965 | " is_reorder_2 | \n",
1966 | " ... | \n",
1967 | " total_reorders_by_user | \n",
1968 | " reorder_propotion_by_user | \n",
1969 | " average_order_size | \n",
1970 | " reorder_in_order | \n",
1971 | " orders_3 | \n",
1972 | " orders_2 | \n",
1973 | " orders_1 | \n",
1974 | " reorder_3 | \n",
1975 | " reorder_2 | \n",
1976 | " reorder_1 | \n",
1977 | "
\n",
1978 | " \n",
1979 | " \n",
1980 | " \n",
1981 | " 0 | \n",
1982 | " 1 | \n",
1983 | " 196 | \n",
1984 | " 10.0 | \n",
1985 | " 9.0 | \n",
1986 | " 0.900000 | \n",
1987 | " 1.400000 | \n",
1988 | " 17.600000 | \n",
1989 | " 10.0 | \n",
1990 | " 1.0 | \n",
1991 | " 1.0 | \n",
1992 | " ... | \n",
1993 | " 41.0 | \n",
1994 | " 0.694915 | \n",
1995 | " 5.9 | \n",
1996 | " 0.705833 | \n",
1997 | " 6 | \n",
1998 | " 6 | \n",
1999 | " 9 | \n",
2000 | " 0.666667 | \n",
2001 | " 1.0 | \n",
2002 | " 0.666667 | \n",
2003 | "
\n",
2004 | " \n",
2005 | " 1 | \n",
2006 | " 1 | \n",
2007 | " 10258 | \n",
2008 | " 9.0 | \n",
2009 | " 8.0 | \n",
2010 | " 0.888889 | \n",
2011 | " 3.333333 | \n",
2012 | " 19.555555 | \n",
2013 | " 10.0 | \n",
2014 | " 1.0 | \n",
2015 | " 1.0 | \n",
2016 | " ... | \n",
2017 | " 41.0 | \n",
2018 | " 0.694915 | \n",
2019 | " 5.9 | \n",
2020 | " 0.705833 | \n",
2021 | " 6 | \n",
2022 | " 6 | \n",
2023 | " 9 | \n",
2024 | " 0.666667 | \n",
2025 | " 1.0 | \n",
2026 | " 0.666667 | \n",
2027 | "
\n",
2028 | " \n",
2029 | " 2 | \n",
2030 | " 1 | \n",
2031 | " 10326 | \n",
2032 | " 1.0 | \n",
2033 | " 0.0 | \n",
2034 | " 0.000000 | \n",
2035 | " 5.000000 | \n",
2036 | " 28.000000 | \n",
2037 | " 5.0 | \n",
2038 | " 0.0 | \n",
2039 | " 0.0 | \n",
2040 | " ... | \n",
2041 | " 41.0 | \n",
2042 | " 0.694915 | \n",
2043 | " 5.9 | \n",
2044 | " 0.705833 | \n",
2045 | " 6 | \n",
2046 | " 6 | \n",
2047 | " 9 | \n",
2048 | " 0.666667 | \n",
2049 | " 1.0 | \n",
2050 | " 0.666667 | \n",
2051 | "
\n",
2052 | " \n",
2053 | " 3 | \n",
2054 | " 1 | \n",
2055 | " 12427 | \n",
2056 | " 10.0 | \n",
2057 | " 9.0 | \n",
2058 | " 0.900000 | \n",
2059 | " 3.300000 | \n",
2060 | " 17.600000 | \n",
2061 | " 10.0 | \n",
2062 | " 1.0 | \n",
2063 | " 1.0 | \n",
2064 | " ... | \n",
2065 | " 41.0 | \n",
2066 | " 0.694915 | \n",
2067 | " 5.9 | \n",
2068 | " 0.705833 | \n",
2069 | " 6 | \n",
2070 | " 6 | \n",
2071 | " 9 | \n",
2072 | " 0.666667 | \n",
2073 | " 1.0 | \n",
2074 | " 0.666667 | \n",
2075 | "
\n",
2076 | " \n",
2077 | " 4 | \n",
2078 | " 1 | \n",
2079 | " 13032 | \n",
2080 | " 3.0 | \n",
2081 | " 2.0 | \n",
2082 | " 0.666667 | \n",
2083 | " 6.333333 | \n",
2084 | " 21.666666 | \n",
2085 | " 10.0 | \n",
2086 | " 1.0 | \n",
2087 | " 0.0 | \n",
2088 | " ... | \n",
2089 | " 41.0 | \n",
2090 | " 0.694915 | \n",
2091 | " 5.9 | \n",
2092 | " 0.705833 | \n",
2093 | " 6 | \n",
2094 | " 6 | \n",
2095 | " 9 | \n",
2096 | " 0.666667 | \n",
2097 | " 1.0 | \n",
2098 | " 0.666667 | \n",
2099 | "
\n",
2100 | " \n",
2101 | "
\n",
2102 | "
5 rows × 71 columns
\n",
2103 | "
"
2104 | ],
2105 | "text/plain": [
2106 | " user_id product_id total_product_orders_by_user \\\n",
2107 | "0 1 196 10.0 \n",
2108 | "1 1 10258 9.0 \n",
2109 | "2 1 10326 1.0 \n",
2110 | "3 1 12427 10.0 \n",
2111 | "4 1 13032 3.0 \n",
2112 | "\n",
2113 | " total_product_reorders_by_user user_product_reorder_percentage \\\n",
2114 | "0 9.0 0.900000 \n",
2115 | "1 8.0 0.888889 \n",
2116 | "2 0.0 0.000000 \n",
2117 | "3 9.0 0.900000 \n",
2118 | "4 2.0 0.666667 \n",
2119 | "\n",
2120 | " avg_add_to_cart_by_user avg_days_since_last_bought last_ordered_in \\\n",
2121 | "0 1.400000 17.600000 10.0 \n",
2122 | "1 3.333333 19.555555 10.0 \n",
2123 | "2 5.000000 28.000000 5.0 \n",
2124 | "3 3.300000 17.600000 10.0 \n",
2125 | "4 6.333333 21.666666 10.0 \n",
2126 | "\n",
2127 | " is_reorder_3 is_reorder_2 ... total_reorders_by_user \\\n",
2128 | "0 1.0 1.0 ... 41.0 \n",
2129 | "1 1.0 1.0 ... 41.0 \n",
2130 | "2 0.0 0.0 ... 41.0 \n",
2131 | "3 1.0 1.0 ... 41.0 \n",
2132 | "4 1.0 0.0 ... 41.0 \n",
2133 | "\n",
2134 | " reorder_propotion_by_user average_order_size reorder_in_order orders_3 \\\n",
2135 | "0 0.694915 5.9 0.705833 6 \n",
2136 | "1 0.694915 5.9 0.705833 6 \n",
2137 | "2 0.694915 5.9 0.705833 6 \n",
2138 | "3 0.694915 5.9 0.705833 6 \n",
2139 | "4 0.694915 5.9 0.705833 6 \n",
2140 | "\n",
2141 | " orders_2 orders_1 reorder_3 reorder_2 reorder_1 \n",
2142 | "0 6 9 0.666667 1.0 0.666667 \n",
2143 | "1 6 9 0.666667 1.0 0.666667 \n",
2144 | "2 6 9 0.666667 1.0 0.666667 \n",
2145 | "3 6 9 0.666667 1.0 0.666667 \n",
2146 | "4 6 9 0.666667 1.0 0.666667 \n",
2147 | "\n",
2148 | "[5 rows x 71 columns]"
2149 | ]
2150 | },
2151 | "execution_count": 23,
2152 | "metadata": {},
2153 | "output_type": "execute_result"
2154 | }
2155 | ],
2156 | "source": [
2157 | "df2 = pd.read_pickle(root +'Finaldata.pkl')\n",
2158 | "df2.head()"
2159 | ]
2160 | },
2161 | {
2162 | "cell_type": "markdown",
2163 | "metadata": {},
2164 | "source": [
2165 | "Yayyyyy. Ready for some cool modeling now :p"
2166 | ]
2167 | }
2168 | ],
2169 | "metadata": {
2170 | "kernelspec": {
2171 | "display_name": "Python 3",
2172 | "language": "python",
2173 | "name": "python3"
2174 | },
2175 | "language_info": {
2176 | "codemirror_mode": {
2177 | "name": "ipython",
2178 | "version": 3
2179 | },
2180 | "file_extension": ".py",
2181 | "mimetype": "text/x-python",
2182 | "name": "python",
2183 | "nbconvert_exporter": "python",
2184 | "pygments_lexer": "ipython3",
2185 | "version": "3.7.3"
2186 | }
2187 | },
2188 | "nbformat": 4,
2189 | "nbformat_minor": 2
2190 | }
2191 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2019 Arch Jignesh Desai
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/Market Basket Analysis.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Market Basket Analysis"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "Market basket analysis scrutinizes the products customers tend to buy together, and uses the information to decide which products should be cross-sold or promoted together. The term arises from the shopping carts supermarket shoppers fill up during a shopping trip.\n",
15 | "\n",
16 | "Association Rule Mining is used when we want to find an association between different objects in a set, find frequent patterns in a transaction database, relational databases or any other information repository.\n",
17 | "\n",
18 | "The most common approach to find these patterns is Market Basket Analysis, which is a key technique used by large retailers like Amazon, Flipkart, etc to analyze customer buying habits by finding associations between the different items that customers place in their “shopping baskets”. The discovery of these associations can help retailers develop marketing strategies by gaining insight into which items are frequently purchased together by customers. The strategies may include:\n",
19 | "\n",
20 | "- Changing the store layout according to trends\n",
21 | "- Customers behavior analysis\n",
22 | "- Catalog Design\n",
23 | "- Cross marketing on online stores\n",
24 | "- Customized emails with add-on sales, etc."
25 | ]
26 | },
27 | {
28 | "cell_type": "markdown",
29 | "metadata": {},
30 | "source": [
31 | "### Matrices\n",
32 | "\n",
33 | "- **Support** : Its the default popularity of an item. In mathematical terms, the support of item A is the ratio of transactions involving A to the total number of transactions.\n",
34 | "\n",
35 | "\n",
36 | "- **Confidence** : Likelihood that customer who bought both A and B. It is the ratio of the number of transactions involving both A and B and the number of transactions involving B.\n",
37 | " - Confidence(A => B) = Support(A, B)/Support(A)\n",
38 | "\n",
39 | "\n",
40 | "- **Lift** : Increase in the sale of A when you sell B.\n",
41 | " \n",
42 | " - Lift(A => B) = Confidence(A, B)/Support(B)\n",
43 | " \n",
44 | " - Lift (A => B) = 1 means that there is no correlation within the itemset.\n",
45 | " - Lift (A => B) > 1 means that there is a positive correlation within the itemset, i.e., products in the itemset, A, and B, are more likely to be bought together.\n",
46 | " - Lift (A => B) < 1 means that there is a negative correlation within the itemset, i.e., products in itemset, A, and B, are unlikely to be bought together."
47 | ]
48 | },
49 | {
50 | "cell_type": "markdown",
51 | "metadata": {},
52 | "source": [
53 | "**Apriori Algorithm:** Apriori algorithm assumes that any subset of a frequent itemset must be frequent. Its the algorithm behind Market Basket Analysis. Say, a transaction containing {Grapes, Apple, Mango} also contains {Grapes, Mango}. So, according to the principle of Apriori, if {Grapes, Apple, Mango} is frequent, then {Grapes, Mango} must also be frequent."
54 | ]
55 | },
56 | {
57 | "cell_type": "code",
58 | "execution_count": 1,
59 | "metadata": {},
60 | "outputs": [],
61 | "source": [
62 | "import numpy as np\n",
63 | "import pandas as pd\n",
64 | "from mlxtend.frequent_patterns import apriori\n",
65 | "from mlxtend.frequent_patterns import association_rules\n",
66 | "\n",
67 | "root = 'C:/Data/instacart-market-basket-analysis/'"
68 | ]
69 | },
70 | {
71 | "cell_type": "markdown",
72 | "metadata": {},
73 | "source": [
74 | "### Data"
75 | ]
76 | },
77 | {
78 | "cell_type": "code",
79 | "execution_count": 2,
80 | "metadata": {},
81 | "outputs": [],
82 | "source": [
83 | "orders = pd.read_csv(root + 'orders.csv')\n",
84 | "order_products_prior = pd.read_csv(root + 'order_products__prior.csv')\n",
85 | "order_products_train = pd.read_csv(root + 'order_products__train.csv')\n",
86 | "products = pd.read_csv(root + 'products.csv')"
87 | ]
88 | },
89 | {
90 | "cell_type": "code",
91 | "execution_count": 3,
92 | "metadata": {},
93 | "outputs": [
94 | {
95 | "data": {
96 | "text/plain": [
97 | "(33819106, 4)"
98 | ]
99 | },
100 | "execution_count": 3,
101 | "metadata": {},
102 | "output_type": "execute_result"
103 | }
104 | ],
105 | "source": [
106 | "order_products = order_products_prior.append(order_products_train)\n",
107 | "order_products.shape"
108 | ]
109 | },
110 | {
111 | "cell_type": "code",
112 | "execution_count": 4,
113 | "metadata": {},
114 | "outputs": [
115 | {
116 | "data": {
117 | "text/html": [
118 | "\n",
119 | "\n",
132 | "
\n",
133 | " \n",
134 | " \n",
135 | " | \n",
136 | " order_id | \n",
137 | " product_id | \n",
138 | " add_to_cart_order | \n",
139 | " reordered | \n",
140 | "
\n",
141 | " \n",
142 | " \n",
143 | " \n",
144 | " 0 | \n",
145 | " 2 | \n",
146 | " 33120 | \n",
147 | " 1 | \n",
148 | " 1 | \n",
149 | "
\n",
150 | " \n",
151 | " 1 | \n",
152 | " 2 | \n",
153 | " 28985 | \n",
154 | " 2 | \n",
155 | " 1 | \n",
156 | "
\n",
157 | " \n",
158 | " 2 | \n",
159 | " 2 | \n",
160 | " 9327 | \n",
161 | " 3 | \n",
162 | " 0 | \n",
163 | "
\n",
164 | " \n",
165 | " 3 | \n",
166 | " 2 | \n",
167 | " 45918 | \n",
168 | " 4 | \n",
169 | " 1 | \n",
170 | "
\n",
171 | " \n",
172 | " 4 | \n",
173 | " 2 | \n",
174 | " 30035 | \n",
175 | " 5 | \n",
176 | " 0 | \n",
177 | "
\n",
178 | " \n",
179 | "
\n",
180 | "
"
181 | ],
182 | "text/plain": [
183 | " order_id product_id add_to_cart_order reordered\n",
184 | "0 2 33120 1 1\n",
185 | "1 2 28985 2 1\n",
186 | "2 2 9327 3 0\n",
187 | "3 2 45918 4 1\n",
188 | "4 2 30035 5 0"
189 | ]
190 | },
191 | "execution_count": 4,
192 | "metadata": {},
193 | "output_type": "execute_result"
194 | }
195 | ],
196 | "source": [
197 | "order_products.head()"
198 | ]
199 | },
200 | {
201 | "cell_type": "code",
202 | "execution_count": 5,
203 | "metadata": {},
204 | "outputs": [
205 | {
206 | "data": {
207 | "text/plain": [
208 | "49685"
209 | ]
210 | },
211 | "execution_count": 5,
212 | "metadata": {},
213 | "output_type": "execute_result"
214 | }
215 | ],
216 | "source": [
217 | "order_products.product_id.nunique()"
218 | ]
219 | },
220 | {
221 | "cell_type": "markdown",
222 | "metadata": {},
223 | "source": [
224 | "Out of 49685 keeping top 100 most frequent products."
225 | ]
226 | },
227 | {
228 | "cell_type": "code",
229 | "execution_count": 6,
230 | "metadata": {},
231 | "outputs": [
232 | {
233 | "data": {
234 | "text/html": [
235 | "\n",
236 | "\n",
249 | "
\n",
250 | " \n",
251 | " \n",
252 | " | \n",
253 | " product_id | \n",
254 | " frequency | \n",
255 | " product_name | \n",
256 | " aisle_id | \n",
257 | " department_id | \n",
258 | "
\n",
259 | " \n",
260 | " \n",
261 | " \n",
262 | " 0 | \n",
263 | " 24852 | \n",
264 | " 491291 | \n",
265 | " Banana | \n",
266 | " 24 | \n",
267 | " 4 | \n",
268 | "
\n",
269 | " \n",
270 | " 1 | \n",
271 | " 13176 | \n",
272 | " 394930 | \n",
273 | " Bag of Organic Bananas | \n",
274 | " 24 | \n",
275 | " 4 | \n",
276 | "
\n",
277 | " \n",
278 | " 2 | \n",
279 | " 21137 | \n",
280 | " 275577 | \n",
281 | " Organic Strawberries | \n",
282 | " 24 | \n",
283 | " 4 | \n",
284 | "
\n",
285 | " \n",
286 | " 3 | \n",
287 | " 21903 | \n",
288 | " 251705 | \n",
289 | " Organic Baby Spinach | \n",
290 | " 123 | \n",
291 | " 4 | \n",
292 | "
\n",
293 | " \n",
294 | " 4 | \n",
295 | " 47209 | \n",
296 | " 220877 | \n",
297 | " Organic Hass Avocado | \n",
298 | " 24 | \n",
299 | " 4 | \n",
300 | "
\n",
301 | " \n",
302 | " 5 | \n",
303 | " 47766 | \n",
304 | " 184224 | \n",
305 | " Organic Avocado | \n",
306 | " 24 | \n",
307 | " 4 | \n",
308 | "
\n",
309 | " \n",
310 | " 6 | \n",
311 | " 47626 | \n",
312 | " 160792 | \n",
313 | " Large Lemon | \n",
314 | " 24 | \n",
315 | " 4 | \n",
316 | "
\n",
317 | " \n",
318 | " 7 | \n",
319 | " 16797 | \n",
320 | " 149445 | \n",
321 | " Strawberries | \n",
322 | " 24 | \n",
323 | " 4 | \n",
324 | "
\n",
325 | " \n",
326 | " 8 | \n",
327 | " 26209 | \n",
328 | " 146660 | \n",
329 | " Limes | \n",
330 | " 24 | \n",
331 | " 4 | \n",
332 | "
\n",
333 | " \n",
334 | " 9 | \n",
335 | " 27845 | \n",
336 | " 142813 | \n",
337 | " Organic Whole Milk | \n",
338 | " 84 | \n",
339 | " 16 | \n",
340 | "
\n",
341 | " \n",
342 | "
\n",
343 | "
"
344 | ],
345 | "text/plain": [
346 | " product_id frequency product_name aisle_id department_id\n",
347 | "0 24852 491291 Banana 24 4\n",
348 | "1 13176 394930 Bag of Organic Bananas 24 4\n",
349 | "2 21137 275577 Organic Strawberries 24 4\n",
350 | "3 21903 251705 Organic Baby Spinach 123 4\n",
351 | "4 47209 220877 Organic Hass Avocado 24 4\n",
352 | "5 47766 184224 Organic Avocado 24 4\n",
353 | "6 47626 160792 Large Lemon 24 4\n",
354 | "7 16797 149445 Strawberries 24 4\n",
355 | "8 26209 146660 Limes 24 4\n",
356 | "9 27845 142813 Organic Whole Milk 84 16"
357 | ]
358 | },
359 | "execution_count": 6,
360 | "metadata": {},
361 | "output_type": "execute_result"
362 | }
363 | ],
364 | "source": [
365 | "product_counts = order_products.groupby('product_id')['order_id'].count().reset_index().rename(columns = {'order_id':'frequency'})\n",
366 | "product_counts = product_counts.sort_values('frequency', ascending=False)[0:100].reset_index(drop = True)\n",
367 | "product_counts = product_counts.merge(products, on = 'product_id', how = 'left')\n",
368 | "product_counts.head(10)"
369 | ]
370 | },
371 | {
372 | "cell_type": "markdown",
373 | "metadata": {},
374 | "source": [
375 | "Keeping 100 most frequent items in order_products dataframe"
376 | ]
377 | },
378 | {
379 | "cell_type": "code",
380 | "execution_count": 7,
381 | "metadata": {},
382 | "outputs": [
383 | {
384 | "data": {
385 | "text/plain": [
386 | "[13176, 21137, 21903, 47209, 47766, 47626, 16797, 26209, 27845]"
387 | ]
388 | },
389 | "execution_count": 7,
390 | "metadata": {},
391 | "output_type": "execute_result"
392 | }
393 | ],
394 | "source": [
395 | "freq_products = list(product_counts.product_id)\n",
396 | "freq_products[1:10]"
397 | ]
398 | },
399 | {
400 | "cell_type": "code",
401 | "execution_count": 8,
402 | "metadata": {},
403 | "outputs": [
404 | {
405 | "data": {
406 | "text/plain": [
407 | "100"
408 | ]
409 | },
410 | "execution_count": 8,
411 | "metadata": {},
412 | "output_type": "execute_result"
413 | }
414 | ],
415 | "source": [
416 | "len(freq_products)"
417 | ]
418 | },
419 | {
420 | "cell_type": "code",
421 | "execution_count": 9,
422 | "metadata": {},
423 | "outputs": [
424 | {
425 | "data": {
426 | "text/plain": [
427 | "(7795471, 4)"
428 | ]
429 | },
430 | "execution_count": 9,
431 | "metadata": {},
432 | "output_type": "execute_result"
433 | }
434 | ],
435 | "source": [
436 | "order_products = order_products[order_products.product_id.isin(freq_products)]\n",
437 | "order_products.shape"
438 | ]
439 | },
440 | {
441 | "cell_type": "code",
442 | "execution_count": 10,
443 | "metadata": {},
444 | "outputs": [
445 | {
446 | "data": {
447 | "text/plain": [
448 | "2444982"
449 | ]
450 | },
451 | "execution_count": 10,
452 | "metadata": {},
453 | "output_type": "execute_result"
454 | }
455 | ],
456 | "source": [
457 | "order_products.order_id.nunique()"
458 | ]
459 | },
460 | {
461 | "cell_type": "code",
462 | "execution_count": 13,
463 | "metadata": {},
464 | "outputs": [
465 | {
466 | "data": {
467 | "text/html": [
468 | "\n",
469 | "\n",
482 | "
\n",
483 | " \n",
484 | " \n",
485 | " | \n",
486 | " order_id | \n",
487 | " product_id | \n",
488 | " add_to_cart_order | \n",
489 | " reordered | \n",
490 | " product_name | \n",
491 | " aisle_id | \n",
492 | " department_id | \n",
493 | "
\n",
494 | " \n",
495 | " \n",
496 | " \n",
497 | " 0 | \n",
498 | " 2 | \n",
499 | " 28985 | \n",
500 | " 2 | \n",
501 | " 1 | \n",
502 | " Michigan Organic Kale | \n",
503 | " 83 | \n",
504 | " 4 | \n",
505 | "
\n",
506 | " \n",
507 | " 1 | \n",
508 | " 2 | \n",
509 | " 17794 | \n",
510 | " 6 | \n",
511 | " 1 | \n",
512 | " Carrots | \n",
513 | " 83 | \n",
514 | " 4 | \n",
515 | "
\n",
516 | " \n",
517 | " 2 | \n",
518 | " 3 | \n",
519 | " 24838 | \n",
520 | " 2 | \n",
521 | " 1 | \n",
522 | " Unsweetened Almondmilk | \n",
523 | " 91 | \n",
524 | " 16 | \n",
525 | "
\n",
526 | " \n",
527 | " 3 | \n",
528 | " 3 | \n",
529 | " 21903 | \n",
530 | " 4 | \n",
531 | " 1 | \n",
532 | " Organic Baby Spinach | \n",
533 | " 123 | \n",
534 | " 4 | \n",
535 | "
\n",
536 | " \n",
537 | " 4 | \n",
538 | " 3 | \n",
539 | " 46667 | \n",
540 | " 6 | \n",
541 | " 1 | \n",
542 | " Organic Ginger Root | \n",
543 | " 83 | \n",
544 | " 4 | \n",
545 | "
\n",
546 | " \n",
547 | "
\n",
548 | "
"
549 | ],
550 | "text/plain": [
551 | " order_id product_id add_to_cart_order reordered product_name \\\n",
552 | "0 2 28985 2 1 Michigan Organic Kale \n",
553 | "1 2 17794 6 1 Carrots \n",
554 | "2 3 24838 2 1 Unsweetened Almondmilk \n",
555 | "3 3 21903 4 1 Organic Baby Spinach \n",
556 | "4 3 46667 6 1 Organic Ginger Root \n",
557 | "\n",
558 | " aisle_id department_id \n",
559 | "0 83 4 \n",
560 | "1 83 4 \n",
561 | "2 91 16 \n",
562 | "3 123 4 \n",
563 | "4 83 4 "
564 | ]
565 | },
566 | "execution_count": 13,
567 | "metadata": {},
568 | "output_type": "execute_result"
569 | }
570 | ],
571 | "source": [
572 | "order_products = order_products.merge(products, on = 'product_id', how='left')\n",
573 | "order_products.head()"
574 | ]
575 | },
576 | {
577 | "cell_type": "markdown",
578 | "metadata": {},
579 | "source": [
580 | "Structuring the data for feeding in the algorithm"
581 | ]
582 | },
583 | {
584 | "cell_type": "code",
585 | "execution_count": 14,
586 | "metadata": {},
587 | "outputs": [
588 | {
589 | "data": {
590 | "text/html": [
591 | "\n",
592 | "\n",
605 | "
\n",
606 | " \n",
607 | " \n",
608 | " product_name | \n",
609 | " 100% Raw Coconut Water | \n",
610 | " 100% Whole Wheat Bread | \n",
611 | " 2% Reduced Fat Milk | \n",
612 | " Apple Honeycrisp Organic | \n",
613 | " Asparagus | \n",
614 | " Bag of Organic Bananas | \n",
615 | " Banana | \n",
616 | " Bartlett Pears | \n",
617 | " Blueberries | \n",
618 | " Boneless Skinless Chicken Breasts | \n",
619 | " ... | \n",
620 | " Sparkling Natural Mineral Water | \n",
621 | " Sparkling Water Grapefruit | \n",
622 | " Spring Water | \n",
623 | " Strawberries | \n",
624 | " Uncured Genoa Salami | \n",
625 | " Unsalted Butter | \n",
626 | " Unsweetened Almondmilk | \n",
627 | " Unsweetened Original Almond Breeze Almond Milk | \n",
628 | " Whole Milk | \n",
629 | " Yellow Onions | \n",
630 | "
\n",
631 | " \n",
632 | " order_id | \n",
633 | " | \n",
634 | " | \n",
635 | " | \n",
636 | " | \n",
637 | " | \n",
638 | " | \n",
639 | " | \n",
640 | " | \n",
641 | " | \n",
642 | " | \n",
643 | " | \n",
644 | " | \n",
645 | " | \n",
646 | " | \n",
647 | " | \n",
648 | " | \n",
649 | " | \n",
650 | " | \n",
651 | " | \n",
652 | " | \n",
653 | " | \n",
654 | "
\n",
655 | " \n",
656 | " \n",
657 | " \n",
658 | " 1 | \n",
659 | " 0.0 | \n",
660 | " 0.0 | \n",
661 | " 0.0 | \n",
662 | " 0.0 | \n",
663 | " 0.0 | \n",
664 | " 1.0 | \n",
665 | " 0.0 | \n",
666 | " 0.0 | \n",
667 | " 0.0 | \n",
668 | " 0.0 | \n",
669 | " ... | \n",
670 | " 0.0 | \n",
671 | " 0.0 | \n",
672 | " 0.0 | \n",
673 | " 0.0 | \n",
674 | " 0.0 | \n",
675 | " 0.0 | \n",
676 | " 0.0 | \n",
677 | " 0.0 | \n",
678 | " 0.0 | \n",
679 | " 0.0 | \n",
680 | "
\n",
681 | " \n",
682 | " 2 | \n",
683 | " 0.0 | \n",
684 | " 0.0 | \n",
685 | " 0.0 | \n",
686 | " 0.0 | \n",
687 | " 0.0 | \n",
688 | " 0.0 | \n",
689 | " 0.0 | \n",
690 | " 0.0 | \n",
691 | " 0.0 | \n",
692 | " 0.0 | \n",
693 | " ... | \n",
694 | " 0.0 | \n",
695 | " 0.0 | \n",
696 | " 0.0 | \n",
697 | " 0.0 | \n",
698 | " 0.0 | \n",
699 | " 0.0 | \n",
700 | " 0.0 | \n",
701 | " 0.0 | \n",
702 | " 0.0 | \n",
703 | " 0.0 | \n",
704 | "
\n",
705 | " \n",
706 | " 3 | \n",
707 | " 0.0 | \n",
708 | " 0.0 | \n",
709 | " 0.0 | \n",
710 | " 0.0 | \n",
711 | " 0.0 | \n",
712 | " 0.0 | \n",
713 | " 0.0 | \n",
714 | " 0.0 | \n",
715 | " 0.0 | \n",
716 | " 0.0 | \n",
717 | " ... | \n",
718 | " 0.0 | \n",
719 | " 0.0 | \n",
720 | " 0.0 | \n",
721 | " 0.0 | \n",
722 | " 0.0 | \n",
723 | " 0.0 | \n",
724 | " 1.0 | \n",
725 | " 0.0 | \n",
726 | " 0.0 | \n",
727 | " 0.0 | \n",
728 | "
\n",
729 | " \n",
730 | " 5 | \n",
731 | " 0.0 | \n",
732 | " 0.0 | \n",
733 | " 1.0 | \n",
734 | " 0.0 | \n",
735 | " 0.0 | \n",
736 | " 1.0 | \n",
737 | " 0.0 | \n",
738 | " 0.0 | \n",
739 | " 0.0 | \n",
740 | " 0.0 | \n",
741 | " ... | \n",
742 | " 0.0 | \n",
743 | " 0.0 | \n",
744 | " 0.0 | \n",
745 | " 0.0 | \n",
746 | " 0.0 | \n",
747 | " 0.0 | \n",
748 | " 0.0 | \n",
749 | " 0.0 | \n",
750 | " 0.0 | \n",
751 | " 0.0 | \n",
752 | "
\n",
753 | " \n",
754 | " 9 | \n",
755 | " 0.0 | \n",
756 | " 0.0 | \n",
757 | " 0.0 | \n",
758 | " 0.0 | \n",
759 | " 0.0 | \n",
760 | " 0.0 | \n",
761 | " 0.0 | \n",
762 | " 0.0 | \n",
763 | " 0.0 | \n",
764 | " 0.0 | \n",
765 | " ... | \n",
766 | " 0.0 | \n",
767 | " 0.0 | \n",
768 | " 0.0 | \n",
769 | " 0.0 | \n",
770 | " 0.0 | \n",
771 | " 0.0 | \n",
772 | " 0.0 | \n",
773 | " 0.0 | \n",
774 | " 0.0 | \n",
775 | " 0.0 | \n",
776 | "
\n",
777 | " \n",
778 | "
\n",
779 | "
5 rows × 100 columns
\n",
780 | "
"
781 | ],
782 | "text/plain": [
783 | "product_name 100% Raw Coconut Water 100% Whole Wheat Bread \\\n",
784 | "order_id \n",
785 | "1 0.0 0.0 \n",
786 | "2 0.0 0.0 \n",
787 | "3 0.0 0.0 \n",
788 | "5 0.0 0.0 \n",
789 | "9 0.0 0.0 \n",
790 | "\n",
791 | "product_name 2% Reduced Fat Milk Apple Honeycrisp Organic Asparagus \\\n",
792 | "order_id \n",
793 | "1 0.0 0.0 0.0 \n",
794 | "2 0.0 0.0 0.0 \n",
795 | "3 0.0 0.0 0.0 \n",
796 | "5 1.0 0.0 0.0 \n",
797 | "9 0.0 0.0 0.0 \n",
798 | "\n",
799 | "product_name Bag of Organic Bananas Banana Bartlett Pears Blueberries \\\n",
800 | "order_id \n",
801 | "1 1.0 0.0 0.0 0.0 \n",
802 | "2 0.0 0.0 0.0 0.0 \n",
803 | "3 0.0 0.0 0.0 0.0 \n",
804 | "5 1.0 0.0 0.0 0.0 \n",
805 | "9 0.0 0.0 0.0 0.0 \n",
806 | "\n",
807 | "product_name Boneless Skinless Chicken Breasts ... \\\n",
808 | "order_id ... \n",
809 | "1 0.0 ... \n",
810 | "2 0.0 ... \n",
811 | "3 0.0 ... \n",
812 | "5 0.0 ... \n",
813 | "9 0.0 ... \n",
814 | "\n",
815 | "product_name Sparkling Natural Mineral Water Sparkling Water Grapefruit \\\n",
816 | "order_id \n",
817 | "1 0.0 0.0 \n",
818 | "2 0.0 0.0 \n",
819 | "3 0.0 0.0 \n",
820 | "5 0.0 0.0 \n",
821 | "9 0.0 0.0 \n",
822 | "\n",
823 | "product_name Spring Water Strawberries Uncured Genoa Salami \\\n",
824 | "order_id \n",
825 | "1 0.0 0.0 0.0 \n",
826 | "2 0.0 0.0 0.0 \n",
827 | "3 0.0 0.0 0.0 \n",
828 | "5 0.0 0.0 0.0 \n",
829 | "9 0.0 0.0 0.0 \n",
830 | "\n",
831 | "product_name Unsalted Butter Unsweetened Almondmilk \\\n",
832 | "order_id \n",
833 | "1 0.0 0.0 \n",
834 | "2 0.0 0.0 \n",
835 | "3 0.0 1.0 \n",
836 | "5 0.0 0.0 \n",
837 | "9 0.0 0.0 \n",
838 | "\n",
839 | "product_name Unsweetened Original Almond Breeze Almond Milk Whole Milk \\\n",
840 | "order_id \n",
841 | "1 0.0 0.0 \n",
842 | "2 0.0 0.0 \n",
843 | "3 0.0 0.0 \n",
844 | "5 0.0 0.0 \n",
845 | "9 0.0 0.0 \n",
846 | "\n",
847 | "product_name Yellow Onions \n",
848 | "order_id \n",
849 | "1 0.0 \n",
850 | "2 0.0 \n",
851 | "3 0.0 \n",
852 | "5 0.0 \n",
853 | "9 0.0 \n",
854 | "\n",
855 | "[5 rows x 100 columns]"
856 | ]
857 | },
858 | "execution_count": 14,
859 | "metadata": {},
860 | "output_type": "execute_result"
861 | }
862 | ],
863 | "source": [
864 | "basket = order_products.groupby(['order_id', 'product_name'])['reordered'].count().unstack().reset_index().fillna(0).set_index('order_id')\n",
865 | "basket.head()"
866 | ]
867 | },
868 | {
869 | "cell_type": "code",
870 | "execution_count": 15,
871 | "metadata": {},
872 | "outputs": [],
873 | "source": [
874 | "del product_counts, products, order_products, order_products_prior, order_products_train"
875 | ]
876 | },
877 | {
878 | "cell_type": "markdown",
879 | "metadata": {},
880 | "source": [
881 | "encoding the units"
882 | ]
883 | },
884 | {
885 | "cell_type": "code",
886 | "execution_count": 16,
887 | "metadata": {},
888 | "outputs": [
889 | {
890 | "data": {
891 | "text/html": [
892 | "\n",
893 | "\n",
906 | "
\n",
907 | " \n",
908 | " \n",
909 | " product_name | \n",
910 | " 100% Raw Coconut Water | \n",
911 | " 100% Whole Wheat Bread | \n",
912 | " 2% Reduced Fat Milk | \n",
913 | " Apple Honeycrisp Organic | \n",
914 | " Asparagus | \n",
915 | " Bag of Organic Bananas | \n",
916 | " Banana | \n",
917 | " Bartlett Pears | \n",
918 | " Blueberries | \n",
919 | " Boneless Skinless Chicken Breasts | \n",
920 | " ... | \n",
921 | " Sparkling Natural Mineral Water | \n",
922 | " Sparkling Water Grapefruit | \n",
923 | " Spring Water | \n",
924 | " Strawberries | \n",
925 | " Uncured Genoa Salami | \n",
926 | " Unsalted Butter | \n",
927 | " Unsweetened Almondmilk | \n",
928 | " Unsweetened Original Almond Breeze Almond Milk | \n",
929 | " Whole Milk | \n",
930 | " Yellow Onions | \n",
931 | "
\n",
932 | " \n",
933 | " order_id | \n",
934 | " | \n",
935 | " | \n",
936 | " | \n",
937 | " | \n",
938 | " | \n",
939 | " | \n",
940 | " | \n",
941 | " | \n",
942 | " | \n",
943 | " | \n",
944 | " | \n",
945 | " | \n",
946 | " | \n",
947 | " | \n",
948 | " | \n",
949 | " | \n",
950 | " | \n",
951 | " | \n",
952 | " | \n",
953 | " | \n",
954 | " | \n",
955 | "
\n",
956 | " \n",
957 | " \n",
958 | " \n",
959 | " 1 | \n",
960 | " 0 | \n",
961 | " 0 | \n",
962 | " 0 | \n",
963 | " 0 | \n",
964 | " 0 | \n",
965 | " 1 | \n",
966 | " 0 | \n",
967 | " 0 | \n",
968 | " 0 | \n",
969 | " 0 | \n",
970 | " ... | \n",
971 | " 0 | \n",
972 | " 0 | \n",
973 | " 0 | \n",
974 | " 0 | \n",
975 | " 0 | \n",
976 | " 0 | \n",
977 | " 0 | \n",
978 | " 0 | \n",
979 | " 0 | \n",
980 | " 0 | \n",
981 | "
\n",
982 | " \n",
983 | " 2 | \n",
984 | " 0 | \n",
985 | " 0 | \n",
986 | " 0 | \n",
987 | " 0 | \n",
988 | " 0 | \n",
989 | " 0 | \n",
990 | " 0 | \n",
991 | " 0 | \n",
992 | " 0 | \n",
993 | " 0 | \n",
994 | " ... | \n",
995 | " 0 | \n",
996 | " 0 | \n",
997 | " 0 | \n",
998 | " 0 | \n",
999 | " 0 | \n",
1000 | " 0 | \n",
1001 | " 0 | \n",
1002 | " 0 | \n",
1003 | " 0 | \n",
1004 | " 0 | \n",
1005 | "
\n",
1006 | " \n",
1007 | " 3 | \n",
1008 | " 0 | \n",
1009 | " 0 | \n",
1010 | " 0 | \n",
1011 | " 0 | \n",
1012 | " 0 | \n",
1013 | " 0 | \n",
1014 | " 0 | \n",
1015 | " 0 | \n",
1016 | " 0 | \n",
1017 | " 0 | \n",
1018 | " ... | \n",
1019 | " 0 | \n",
1020 | " 0 | \n",
1021 | " 0 | \n",
1022 | " 0 | \n",
1023 | " 0 | \n",
1024 | " 0 | \n",
1025 | " 1 | \n",
1026 | " 0 | \n",
1027 | " 0 | \n",
1028 | " 0 | \n",
1029 | "
\n",
1030 | " \n",
1031 | " 5 | \n",
1032 | " 0 | \n",
1033 | " 0 | \n",
1034 | " 1 | \n",
1035 | " 0 | \n",
1036 | " 0 | \n",
1037 | " 1 | \n",
1038 | " 0 | \n",
1039 | " 0 | \n",
1040 | " 0 | \n",
1041 | " 0 | \n",
1042 | " ... | \n",
1043 | " 0 | \n",
1044 | " 0 | \n",
1045 | " 0 | \n",
1046 | " 0 | \n",
1047 | " 0 | \n",
1048 | " 0 | \n",
1049 | " 0 | \n",
1050 | " 0 | \n",
1051 | " 0 | \n",
1052 | " 0 | \n",
1053 | "
\n",
1054 | " \n",
1055 | " 9 | \n",
1056 | " 0 | \n",
1057 | " 0 | \n",
1058 | " 0 | \n",
1059 | " 0 | \n",
1060 | " 0 | \n",
1061 | " 0 | \n",
1062 | " 0 | \n",
1063 | " 0 | \n",
1064 | " 0 | \n",
1065 | " 0 | \n",
1066 | " ... | \n",
1067 | " 0 | \n",
1068 | " 0 | \n",
1069 | " 0 | \n",
1070 | " 0 | \n",
1071 | " 0 | \n",
1072 | " 0 | \n",
1073 | " 0 | \n",
1074 | " 0 | \n",
1075 | " 0 | \n",
1076 | " 0 | \n",
1077 | "
\n",
1078 | " \n",
1079 | "
\n",
1080 | "
5 rows × 100 columns
\n",
1081 | "
"
1082 | ],
1083 | "text/plain": [
1084 | "product_name 100% Raw Coconut Water 100% Whole Wheat Bread \\\n",
1085 | "order_id \n",
1086 | "1 0 0 \n",
1087 | "2 0 0 \n",
1088 | "3 0 0 \n",
1089 | "5 0 0 \n",
1090 | "9 0 0 \n",
1091 | "\n",
1092 | "product_name 2% Reduced Fat Milk Apple Honeycrisp Organic Asparagus \\\n",
1093 | "order_id \n",
1094 | "1 0 0 0 \n",
1095 | "2 0 0 0 \n",
1096 | "3 0 0 0 \n",
1097 | "5 1 0 0 \n",
1098 | "9 0 0 0 \n",
1099 | "\n",
1100 | "product_name Bag of Organic Bananas Banana Bartlett Pears Blueberries \\\n",
1101 | "order_id \n",
1102 | "1 1 0 0 0 \n",
1103 | "2 0 0 0 0 \n",
1104 | "3 0 0 0 0 \n",
1105 | "5 1 0 0 0 \n",
1106 | "9 0 0 0 0 \n",
1107 | "\n",
1108 | "product_name Boneless Skinless Chicken Breasts ... \\\n",
1109 | "order_id ... \n",
1110 | "1 0 ... \n",
1111 | "2 0 ... \n",
1112 | "3 0 ... \n",
1113 | "5 0 ... \n",
1114 | "9 0 ... \n",
1115 | "\n",
1116 | "product_name Sparkling Natural Mineral Water Sparkling Water Grapefruit \\\n",
1117 | "order_id \n",
1118 | "1 0 0 \n",
1119 | "2 0 0 \n",
1120 | "3 0 0 \n",
1121 | "5 0 0 \n",
1122 | "9 0 0 \n",
1123 | "\n",
1124 | "product_name Spring Water Strawberries Uncured Genoa Salami \\\n",
1125 | "order_id \n",
1126 | "1 0 0 0 \n",
1127 | "2 0 0 0 \n",
1128 | "3 0 0 0 \n",
1129 | "5 0 0 0 \n",
1130 | "9 0 0 0 \n",
1131 | "\n",
1132 | "product_name Unsalted Butter Unsweetened Almondmilk \\\n",
1133 | "order_id \n",
1134 | "1 0 0 \n",
1135 | "2 0 0 \n",
1136 | "3 0 1 \n",
1137 | "5 0 0 \n",
1138 | "9 0 0 \n",
1139 | "\n",
1140 | "product_name Unsweetened Original Almond Breeze Almond Milk Whole Milk \\\n",
1141 | "order_id \n",
1142 | "1 0 0 \n",
1143 | "2 0 0 \n",
1144 | "3 0 0 \n",
1145 | "5 0 0 \n",
1146 | "9 0 0 \n",
1147 | "\n",
1148 | "product_name Yellow Onions \n",
1149 | "order_id \n",
1150 | "1 0 \n",
1151 | "2 0 \n",
1152 | "3 0 \n",
1153 | "5 0 \n",
1154 | "9 0 \n",
1155 | "\n",
1156 | "[5 rows x 100 columns]"
1157 | ]
1158 | },
1159 | "execution_count": 16,
1160 | "metadata": {},
1161 | "output_type": "execute_result"
1162 | }
1163 | ],
1164 | "source": [
1165 | "def encode_units(x):\n",
1166 | " if x <= 0:\n",
1167 | " return 0\n",
1168 | " if x >= 1:\n",
1169 | " return 1 \n",
1170 | " \n",
1171 | "basket = basket.applymap(encode_units)\n",
1172 | "basket.head()"
1173 | ]
1174 | },
1175 | {
1176 | "cell_type": "code",
1177 | "execution_count": 17,
1178 | "metadata": {},
1179 | "outputs": [
1180 | {
1181 | "data": {
1182 | "text/plain": [
1183 | "244498200"
1184 | ]
1185 | },
1186 | "execution_count": 17,
1187 | "metadata": {},
1188 | "output_type": "execute_result"
1189 | }
1190 | ],
1191 | "source": [
1192 | "basket.size"
1193 | ]
1194 | },
1195 | {
1196 | "cell_type": "code",
1197 | "execution_count": 18,
1198 | "metadata": {},
1199 | "outputs": [
1200 | {
1201 | "data": {
1202 | "text/plain": [
1203 | "(2444982, 100)"
1204 | ]
1205 | },
1206 | "execution_count": 18,
1207 | "metadata": {},
1208 | "output_type": "execute_result"
1209 | }
1210 | ],
1211 | "source": [
1212 | "basket.shape"
1213 | ]
1214 | },
1215 | {
1216 | "cell_type": "markdown",
1217 | "metadata": {},
1218 | "source": [
1219 | "Creating frequent sets and rules"
1220 | ]
1221 | },
1222 | {
1223 | "cell_type": "code",
1224 | "execution_count": 19,
1225 | "metadata": {
1226 | "scrolled": true
1227 | },
1228 | "outputs": [
1229 | {
1230 | "data": {
1231 | "text/html": [
1232 | "\n",
1233 | "\n",
1246 | "
\n",
1247 | " \n",
1248 | " \n",
1249 | " | \n",
1250 | " support | \n",
1251 | " itemsets | \n",
1252 | "
\n",
1253 | " \n",
1254 | " \n",
1255 | " \n",
1256 | " 0 | \n",
1257 | " 0.016062 | \n",
1258 | " (100% Raw Coconut Water) | \n",
1259 | "
\n",
1260 | " \n",
1261 | " 1 | \n",
1262 | " 0.025814 | \n",
1263 | " (100% Whole Wheat Bread) | \n",
1264 | "
\n",
1265 | " \n",
1266 | " 2 | \n",
1267 | " 0.015800 | \n",
1268 | " (2% Reduced Fat Milk) | \n",
1269 | "
\n",
1270 | " \n",
1271 | " 3 | \n",
1272 | " 0.035694 | \n",
1273 | " (Apple Honeycrisp Organic) | \n",
1274 | "
\n",
1275 | " \n",
1276 | " 4 | \n",
1277 | " 0.029101 | \n",
1278 | " (Asparagus) | \n",
1279 | "
\n",
1280 | " \n",
1281 | "
\n",
1282 | "
"
1283 | ],
1284 | "text/plain": [
1285 | " support itemsets\n",
1286 | "0 0.016062 (100% Raw Coconut Water)\n",
1287 | "1 0.025814 (100% Whole Wheat Bread)\n",
1288 | "2 0.015800 (2% Reduced Fat Milk)\n",
1289 | "3 0.035694 (Apple Honeycrisp Organic)\n",
1290 | "4 0.029101 (Asparagus)"
1291 | ]
1292 | },
1293 | "execution_count": 19,
1294 | "metadata": {},
1295 | "output_type": "execute_result"
1296 | }
1297 | ],
1298 | "source": [
1299 | "frequent_items = apriori(basket, min_support=0.01, use_colnames=True, low_memory=True)\n",
1300 | "frequent_items.head()"
1301 | ]
1302 | },
1303 | {
1304 | "cell_type": "code",
1305 | "execution_count": 20,
1306 | "metadata": {},
1307 | "outputs": [
1308 | {
1309 | "data": {
1310 | "text/html": [
1311 | "\n",
1312 | "\n",
1325 | "
\n",
1326 | " \n",
1327 | " \n",
1328 | " | \n",
1329 | " support | \n",
1330 | " itemsets | \n",
1331 | "
\n",
1332 | " \n",
1333 | " \n",
1334 | " \n",
1335 | " 124 | \n",
1336 | " 0.010235 | \n",
1337 | " (Organic Blueberries, Organic Strawberries) | \n",
1338 | "
\n",
1339 | " \n",
1340 | " 125 | \n",
1341 | " 0.010966 | \n",
1342 | " (Organic Raspberries, Organic Hass Avocado) | \n",
1343 | "
\n",
1344 | " \n",
1345 | " 126 | \n",
1346 | " 0.017314 | \n",
1347 | " (Organic Strawberries, Organic Hass Avocado) | \n",
1348 | "
\n",
1349 | " \n",
1350 | " 127 | \n",
1351 | " 0.014533 | \n",
1352 | " (Organic Strawberries, Organic Raspberries) | \n",
1353 | "
\n",
1354 | " \n",
1355 | " 128 | \n",
1356 | " 0.010130 | \n",
1357 | " (Organic Strawberries, Organic Whole Milk) | \n",
1358 | "
\n",
1359 | " \n",
1360 | "
\n",
1361 | "
"
1362 | ],
1363 | "text/plain": [
1364 | " support itemsets\n",
1365 | "124 0.010235 (Organic Blueberries, Organic Strawberries)\n",
1366 | "125 0.010966 (Organic Raspberries, Organic Hass Avocado)\n",
1367 | "126 0.017314 (Organic Strawberries, Organic Hass Avocado)\n",
1368 | "127 0.014533 (Organic Strawberries, Organic Raspberries)\n",
1369 | "128 0.010130 (Organic Strawberries, Organic Whole Milk)"
1370 | ]
1371 | },
1372 | "execution_count": 20,
1373 | "metadata": {},
1374 | "output_type": "execute_result"
1375 | }
1376 | ],
1377 | "source": [
1378 | "frequent_items.tail()"
1379 | ]
1380 | },
1381 | {
1382 | "cell_type": "code",
1383 | "execution_count": 21,
1384 | "metadata": {},
1385 | "outputs": [
1386 | {
1387 | "data": {
1388 | "text/plain": [
1389 | "(129, 2)"
1390 | ]
1391 | },
1392 | "execution_count": 21,
1393 | "metadata": {},
1394 | "output_type": "execute_result"
1395 | }
1396 | ],
1397 | "source": [
1398 | "frequent_items.shape"
1399 | ]
1400 | },
1401 | {
1402 | "cell_type": "code",
1403 | "execution_count": 22,
1404 | "metadata": {},
1405 | "outputs": [
1406 | {
1407 | "data": {
1408 | "text/html": [
1409 | "\n",
1410 | "\n",
1423 | "
\n",
1424 | " \n",
1425 | " \n",
1426 | " | \n",
1427 | " antecedents | \n",
1428 | " consequents | \n",
1429 | " antecedent support | \n",
1430 | " consequent support | \n",
1431 | " support | \n",
1432 | " confidence | \n",
1433 | " lift | \n",
1434 | " leverage | \n",
1435 | " conviction | \n",
1436 | "
\n",
1437 | " \n",
1438 | " \n",
1439 | " \n",
1440 | " 35 | \n",
1441 | " (Limes) | \n",
1442 | " (Large Lemon) | \n",
1443 | " 0.059984 | \n",
1444 | " 0.065764 | \n",
1445 | " 0.011860 | \n",
1446 | " 0.197723 | \n",
1447 | " 3.006544 | \n",
1448 | " 0.007915 | \n",
1449 | " 1.164480 | \n",
1450 | "
\n",
1451 | " \n",
1452 | " 34 | \n",
1453 | " (Large Lemon) | \n",
1454 | " (Limes) | \n",
1455 | " 0.065764 | \n",
1456 | " 0.059984 | \n",
1457 | " 0.011860 | \n",
1458 | " 0.180345 | \n",
1459 | " 3.006544 | \n",
1460 | " 0.007915 | \n",
1461 | " 1.146843 | \n",
1462 | "
\n",
1463 | " \n",
1464 | " 52 | \n",
1465 | " (Organic Strawberries) | \n",
1466 | " (Organic Raspberries) | \n",
1467 | " 0.112711 | \n",
1468 | " 0.058325 | \n",
1469 | " 0.014533 | \n",
1470 | " 0.128940 | \n",
1471 | " 2.210731 | \n",
1472 | " 0.007959 | \n",
1473 | " 1.081069 | \n",
1474 | "
\n",
1475 | " \n",
1476 | " 53 | \n",
1477 | " (Organic Raspberries) | \n",
1478 | " (Organic Strawberries) | \n",
1479 | " 0.058325 | \n",
1480 | " 0.112711 | \n",
1481 | " 0.014533 | \n",
1482 | " 0.249174 | \n",
1483 | " 2.210731 | \n",
1484 | " 0.007959 | \n",
1485 | " 1.181751 | \n",
1486 | "
\n",
1487 | " \n",
1488 | " 37 | \n",
1489 | " (Organic Avocado) | \n",
1490 | " (Large Lemon) | \n",
1491 | " 0.075348 | \n",
1492 | " 0.065764 | \n",
1493 | " 0.010538 | \n",
1494 | " 0.139862 | \n",
1495 | " 2.126728 | \n",
1496 | " 0.005583 | \n",
1497 | " 1.086147 | \n",
1498 | "
\n",
1499 | " \n",
1500 | " 36 | \n",
1501 | " (Large Lemon) | \n",
1502 | " (Organic Avocado) | \n",
1503 | " 0.065764 | \n",
1504 | " 0.075348 | \n",
1505 | " 0.010538 | \n",
1506 | " 0.160244 | \n",
1507 | " 2.126728 | \n",
1508 | " 0.005583 | \n",
1509 | " 1.101097 | \n",
1510 | "
\n",
1511 | " \n",
1512 | " 47 | \n",
1513 | " (Organic Strawberries) | \n",
1514 | " (Organic Blueberries) | \n",
1515 | " 0.112711 | \n",
1516 | " 0.042956 | \n",
1517 | " 0.010235 | \n",
1518 | " 0.090809 | \n",
1519 | " 2.114024 | \n",
1520 | " 0.005394 | \n",
1521 | " 1.052633 | \n",
1522 | "
\n",
1523 | " \n",
1524 | " 46 | \n",
1525 | " (Organic Blueberries) | \n",
1526 | " (Organic Strawberries) | \n",
1527 | " 0.042956 | \n",
1528 | " 0.112711 | \n",
1529 | " 0.010235 | \n",
1530 | " 0.238274 | \n",
1531 | " 2.114024 | \n",
1532 | " 0.005394 | \n",
1533 | " 1.164840 | \n",
1534 | "
\n",
1535 | " \n",
1536 | " 49 | \n",
1537 | " (Organic Hass Avocado) | \n",
1538 | " (Organic Raspberries) | \n",
1539 | " 0.090339 | \n",
1540 | " 0.058325 | \n",
1541 | " 0.010966 | \n",
1542 | " 0.121389 | \n",
1543 | " 2.081257 | \n",
1544 | " 0.005697 | \n",
1545 | " 1.071777 | \n",
1546 | "
\n",
1547 | " \n",
1548 | " 48 | \n",
1549 | " (Organic Raspberries) | \n",
1550 | " (Organic Hass Avocado) | \n",
1551 | " 0.058325 | \n",
1552 | " 0.090339 | \n",
1553 | " 0.010966 | \n",
1554 | " 0.188018 | \n",
1555 | " 2.081257 | \n",
1556 | " 0.005697 | \n",
1557 | " 1.120298 | \n",
1558 | "
\n",
1559 | " \n",
1560 | " 24 | \n",
1561 | " (Banana) | \n",
1562 | " (Organic Fuji Apple) | \n",
1563 | " 0.200938 | \n",
1564 | " 0.037992 | \n",
1565 | " 0.014378 | \n",
1566 | " 0.071552 | \n",
1567 | " 1.883367 | \n",
1568 | " 0.006744 | \n",
1569 | " 1.036147 | \n",
1570 | "
\n",
1571 | " \n",
1572 | " 25 | \n",
1573 | " (Organic Fuji Apple) | \n",
1574 | " (Banana) | \n",
1575 | " 0.037992 | \n",
1576 | " 0.200938 | \n",
1577 | " 0.014378 | \n",
1578 | " 0.378441 | \n",
1579 | " 1.883367 | \n",
1580 | " 0.006744 | \n",
1581 | " 1.285576 | \n",
1582 | "
\n",
1583 | " \n",
1584 | " 5 | \n",
1585 | " (Bag of Organic Bananas) | \n",
1586 | " (Organic Raspberries) | \n",
1587 | " 0.161527 | \n",
1588 | " 0.058325 | \n",
1589 | " 0.017294 | \n",
1590 | " 0.107065 | \n",
1591 | " 1.835662 | \n",
1592 | " 0.007873 | \n",
1593 | " 1.054584 | \n",
1594 | "
\n",
1595 | " \n",
1596 | " 4 | \n",
1597 | " (Organic Raspberries) | \n",
1598 | " (Bag of Organic Bananas) | \n",
1599 | " 0.058325 | \n",
1600 | " 0.161527 | \n",
1601 | " 0.017294 | \n",
1602 | " 0.296508 | \n",
1603 | " 1.835662 | \n",
1604 | " 0.007873 | \n",
1605 | " 1.191874 | \n",
1606 | "
\n",
1607 | " \n",
1608 | " 3 | \n",
1609 | " (Bag of Organic Bananas) | \n",
1610 | " (Organic Hass Avocado) | \n",
1611 | " 0.161527 | \n",
1612 | " 0.090339 | \n",
1613 | " 0.026487 | \n",
1614 | " 0.163981 | \n",
1615 | " 1.815175 | \n",
1616 | " 0.011895 | \n",
1617 | " 1.088087 | \n",
1618 | "
\n",
1619 | " \n",
1620 | " 2 | \n",
1621 | " (Organic Hass Avocado) | \n",
1622 | " (Bag of Organic Bananas) | \n",
1623 | " 0.090339 | \n",
1624 | " 0.161527 | \n",
1625 | " 0.026487 | \n",
1626 | " 0.293199 | \n",
1627 | " 1.815175 | \n",
1628 | " 0.011895 | \n",
1629 | " 1.186294 | \n",
1630 | "
\n",
1631 | " \n",
1632 | " 14 | \n",
1633 | " (Honeycrisp Apple) | \n",
1634 | " (Banana) | \n",
1635 | " 0.034078 | \n",
1636 | " 0.200938 | \n",
1637 | " 0.012122 | \n",
1638 | " 0.355725 | \n",
1639 | " 1.770317 | \n",
1640 | " 0.005275 | \n",
1641 | " 1.240249 | \n",
1642 | "
\n",
1643 | " \n",
1644 | " 15 | \n",
1645 | " (Banana) | \n",
1646 | " (Honeycrisp Apple) | \n",
1647 | " 0.200938 | \n",
1648 | " 0.034078 | \n",
1649 | " 0.012122 | \n",
1650 | " 0.060329 | \n",
1651 | " 1.770317 | \n",
1652 | " 0.005275 | \n",
1653 | " 1.027936 | \n",
1654 | "
\n",
1655 | " \n",
1656 | " 39 | \n",
1657 | " (Organic Avocado) | \n",
1658 | " (Organic Baby Spinach) | \n",
1659 | " 0.075348 | \n",
1660 | " 0.102948 | \n",
1661 | " 0.013207 | \n",
1662 | " 0.175281 | \n",
1663 | " 1.702625 | \n",
1664 | " 0.005450 | \n",
1665 | " 1.087707 | \n",
1666 | "
\n",
1667 | " \n",
1668 | " 38 | \n",
1669 | " (Organic Baby Spinach) | \n",
1670 | " (Organic Avocado) | \n",
1671 | " 0.102948 | \n",
1672 | " 0.075348 | \n",
1673 | " 0.013207 | \n",
1674 | " 0.128289 | \n",
1675 | " 1.702625 | \n",
1676 | " 0.005450 | \n",
1677 | " 1.060733 | \n",
1678 | "
\n",
1679 | " \n",
1680 | " 50 | \n",
1681 | " (Organic Strawberries) | \n",
1682 | " (Organic Hass Avocado) | \n",
1683 | " 0.112711 | \n",
1684 | " 0.090339 | \n",
1685 | " 0.017314 | \n",
1686 | " 0.153616 | \n",
1687 | " 1.700440 | \n",
1688 | " 0.007132 | \n",
1689 | " 1.074762 | \n",
1690 | "
\n",
1691 | " \n",
1692 | " 51 | \n",
1693 | " (Organic Hass Avocado) | \n",
1694 | " (Organic Strawberries) | \n",
1695 | " 0.090339 | \n",
1696 | " 0.112711 | \n",
1697 | " 0.017314 | \n",
1698 | " 0.191659 | \n",
1699 | " 1.700440 | \n",
1700 | " 0.007132 | \n",
1701 | " 1.097666 | \n",
1702 | "
\n",
1703 | " \n",
1704 | " 12 | \n",
1705 | " (Cucumber Kirby) | \n",
1706 | " (Banana) | \n",
1707 | " 0.040789 | \n",
1708 | " 0.200938 | \n",
1709 | " 0.013432 | \n",
1710 | " 0.329296 | \n",
1711 | " 1.638788 | \n",
1712 | " 0.005236 | \n",
1713 | " 1.191377 | \n",
1714 | "
\n",
1715 | " \n",
1716 | " 13 | \n",
1717 | " (Banana) | \n",
1718 | " (Cucumber Kirby) | \n",
1719 | " 0.200938 | \n",
1720 | " 0.040789 | \n",
1721 | " 0.013432 | \n",
1722 | " 0.066844 | \n",
1723 | " 1.638788 | \n",
1724 | " 0.005236 | \n",
1725 | " 1.027922 | \n",
1726 | "
\n",
1727 | " \n",
1728 | " 43 | \n",
1729 | " (Organic Hass Avocado) | \n",
1730 | " (Organic Baby Spinach) | \n",
1731 | " 0.090339 | \n",
1732 | " 0.102948 | \n",
1733 | " 0.014787 | \n",
1734 | " 0.163679 | \n",
1735 | " 1.589929 | \n",
1736 | " 0.005486 | \n",
1737 | " 1.072618 | \n",
1738 | "
\n",
1739 | " \n",
1740 | " 42 | \n",
1741 | " (Organic Baby Spinach) | \n",
1742 | " (Organic Hass Avocado) | \n",
1743 | " 0.102948 | \n",
1744 | " 0.090339 | \n",
1745 | " 0.014787 | \n",
1746 | " 0.143632 | \n",
1747 | " 1.589929 | \n",
1748 | " 0.005486 | \n",
1749 | " 1.062232 | \n",
1750 | "
\n",
1751 | " \n",
1752 | " 55 | \n",
1753 | " (Organic Whole Milk) | \n",
1754 | " (Organic Strawberries) | \n",
1755 | " 0.058411 | \n",
1756 | " 0.112711 | \n",
1757 | " 0.010130 | \n",
1758 | " 0.173423 | \n",
1759 | " 1.538645 | \n",
1760 | " 0.003546 | \n",
1761 | " 1.073449 | \n",
1762 | "
\n",
1763 | " \n",
1764 | " 54 | \n",
1765 | " (Organic Strawberries) | \n",
1766 | " (Organic Whole Milk) | \n",
1767 | " 0.112711 | \n",
1768 | " 0.058411 | \n",
1769 | " 0.010130 | \n",
1770 | " 0.089873 | \n",
1771 | " 1.538645 | \n",
1772 | " 0.003546 | \n",
1773 | " 1.034569 | \n",
1774 | "
\n",
1775 | " \n",
1776 | " 20 | \n",
1777 | " (Organic Avocado) | \n",
1778 | " (Banana) | \n",
1779 | " 0.075348 | \n",
1780 | " 0.200938 | \n",
1781 | " 0.022745 | \n",
1782 | " 0.301866 | \n",
1783 | " 1.502282 | \n",
1784 | " 0.007605 | \n",
1785 | " 1.144568 | \n",
1786 | "
\n",
1787 | " \n",
1788 | " 21 | \n",
1789 | " (Banana) | \n",
1790 | " (Organic Avocado) | \n",
1791 | " 0.200938 | \n",
1792 | " 0.075348 | \n",
1793 | " 0.022745 | \n",
1794 | " 0.113194 | \n",
1795 | " 1.502282 | \n",
1796 | " 0.007605 | \n",
1797 | " 1.042677 | \n",
1798 | "
\n",
1799 | " \n",
1800 | " 30 | \n",
1801 | " (Seedless Red Grapes) | \n",
1802 | " (Banana) | \n",
1803 | " 0.035480 | \n",
1804 | " 0.200938 | \n",
1805 | " 0.010534 | \n",
1806 | " 0.296906 | \n",
1807 | " 1.477596 | \n",
1808 | " 0.003405 | \n",
1809 | " 1.136493 | \n",
1810 | "
\n",
1811 | " \n",
1812 | " 31 | \n",
1813 | " (Banana) | \n",
1814 | " (Seedless Red Grapes) | \n",
1815 | " 0.200938 | \n",
1816 | " 0.035480 | \n",
1817 | " 0.010534 | \n",
1818 | " 0.052425 | \n",
1819 | " 1.477596 | \n",
1820 | " 0.003405 | \n",
1821 | " 1.017883 | \n",
1822 | "
\n",
1823 | " \n",
1824 | " 7 | \n",
1825 | " (Bag of Organic Bananas) | \n",
1826 | " (Organic Strawberries) | \n",
1827 | " 0.161527 | \n",
1828 | " 0.112711 | \n",
1829 | " 0.026463 | \n",
1830 | " 0.163832 | \n",
1831 | " 1.453551 | \n",
1832 | " 0.008257 | \n",
1833 | " 1.061136 | \n",
1834 | "
\n",
1835 | " \n",
1836 | " 6 | \n",
1837 | " (Organic Strawberries) | \n",
1838 | " (Bag of Organic Bananas) | \n",
1839 | " 0.112711 | \n",
1840 | " 0.161527 | \n",
1841 | " 0.026463 | \n",
1842 | " 0.234787 | \n",
1843 | " 1.453551 | \n",
1844 | " 0.008257 | \n",
1845 | " 1.095739 | \n",
1846 | "
\n",
1847 | " \n",
1848 | " 32 | \n",
1849 | " (Strawberries) | \n",
1850 | " (Banana) | \n",
1851 | " 0.061123 | \n",
1852 | " 0.200938 | \n",
1853 | " 0.017661 | \n",
1854 | " 0.288936 | \n",
1855 | " 1.437931 | \n",
1856 | " 0.005379 | \n",
1857 | " 1.123754 | \n",
1858 | "
\n",
1859 | " \n",
1860 | " 33 | \n",
1861 | " (Banana) | \n",
1862 | " (Strawberries) | \n",
1863 | " 0.200938 | \n",
1864 | " 0.061123 | \n",
1865 | " 0.017661 | \n",
1866 | " 0.087891 | \n",
1867 | " 1.437931 | \n",
1868 | " 0.005379 | \n",
1869 | " 1.029347 | \n",
1870 | "
\n",
1871 | " \n",
1872 | " 45 | \n",
1873 | " (Organic Strawberries) | \n",
1874 | " (Organic Baby Spinach) | \n",
1875 | " 0.112711 | \n",
1876 | " 0.102948 | \n",
1877 | " 0.016267 | \n",
1878 | " 0.144326 | \n",
1879 | " 1.401939 | \n",
1880 | " 0.004664 | \n",
1881 | " 1.048358 | \n",
1882 | "
\n",
1883 | " \n",
1884 | " 44 | \n",
1885 | " (Organic Baby Spinach) | \n",
1886 | " (Organic Strawberries) | \n",
1887 | " 0.102948 | \n",
1888 | " 0.112711 | \n",
1889 | " 0.016267 | \n",
1890 | " 0.158014 | \n",
1891 | " 1.401939 | \n",
1892 | " 0.004664 | \n",
1893 | " 1.053805 | \n",
1894 | "
\n",
1895 | " \n",
1896 | " 11 | \n",
1897 | " (Bag of Organic Bananas) | \n",
1898 | " (Organic Yellow Onion) | \n",
1899 | " 0.161527 | \n",
1900 | " 0.048146 | \n",
1901 | " 0.010460 | \n",
1902 | " 0.064756 | \n",
1903 | " 1.344989 | \n",
1904 | " 0.002683 | \n",
1905 | " 1.017760 | \n",
1906 | "
\n",
1907 | " \n",
1908 | " 10 | \n",
1909 | " (Organic Yellow Onion) | \n",
1910 | " (Bag of Organic Bananas) | \n",
1911 | " 0.048146 | \n",
1912 | " 0.161527 | \n",
1913 | " 0.010460 | \n",
1914 | " 0.217252 | \n",
1915 | " 1.344989 | \n",
1916 | " 0.002683 | \n",
1917 | " 1.071191 | \n",
1918 | "
\n",
1919 | " \n",
1920 | " 16 | \n",
1921 | " (Large Lemon) | \n",
1922 | " (Banana) | \n",
1923 | " 0.065764 | \n",
1924 | " 0.200938 | \n",
1925 | " 0.017603 | \n",
1926 | " 0.267663 | \n",
1927 | " 1.332062 | \n",
1928 | " 0.004388 | \n",
1929 | " 1.091111 | \n",
1930 | "
\n",
1931 | " \n",
1932 | " 17 | \n",
1933 | " (Banana) | \n",
1934 | " (Large Lemon) | \n",
1935 | " 0.200938 | \n",
1936 | " 0.065764 | \n",
1937 | " 0.017603 | \n",
1938 | " 0.087602 | \n",
1939 | " 1.332062 | \n",
1940 | " 0.004388 | \n",
1941 | " 1.023934 | \n",
1942 | "
\n",
1943 | " \n",
1944 | " 0 | \n",
1945 | " (Organic Baby Spinach) | \n",
1946 | " (Bag of Organic Bananas) | \n",
1947 | " 0.102948 | \n",
1948 | " 0.161527 | \n",
1949 | " 0.021517 | \n",
1950 | " 0.209007 | \n",
1951 | " 1.293944 | \n",
1952 | " 0.004888 | \n",
1953 | " 1.060026 | \n",
1954 | "
\n",
1955 | " \n",
1956 | " 1 | \n",
1957 | " (Bag of Organic Bananas) | \n",
1958 | " (Organic Baby Spinach) | \n",
1959 | " 0.161527 | \n",
1960 | " 0.102948 | \n",
1961 | " 0.021517 | \n",
1962 | " 0.133208 | \n",
1963 | " 1.293944 | \n",
1964 | " 0.004888 | \n",
1965 | " 1.034911 | \n",
1966 | "
\n",
1967 | " \n",
1968 | " 41 | \n",
1969 | " (Organic Avocado) | \n",
1970 | " (Organic Strawberries) | \n",
1971 | " 0.075348 | \n",
1972 | " 0.112711 | \n",
1973 | " 0.010254 | \n",
1974 | " 0.136095 | \n",
1975 | " 1.207468 | \n",
1976 | " 0.001762 | \n",
1977 | " 1.027068 | \n",
1978 | "
\n",
1979 | " \n",
1980 | " 40 | \n",
1981 | " (Organic Strawberries) | \n",
1982 | " (Organic Avocado) | \n",
1983 | " 0.112711 | \n",
1984 | " 0.075348 | \n",
1985 | " 0.010254 | \n",
1986 | " 0.090980 | \n",
1987 | " 1.207468 | \n",
1988 | " 0.001762 | \n",
1989 | " 1.017197 | \n",
1990 | "
\n",
1991 | " \n",
1992 | " 9 | \n",
1993 | " (Bag of Organic Bananas) | \n",
1994 | " (Organic Whole Milk) | \n",
1995 | " 0.161527 | \n",
1996 | " 0.058411 | \n",
1997 | " 0.011288 | \n",
1998 | " 0.069883 | \n",
1999 | " 1.196413 | \n",
2000 | " 0.001853 | \n",
2001 | " 1.012335 | \n",
2002 | "
\n",
2003 | " \n",
2004 | " 8 | \n",
2005 | " (Organic Whole Milk) | \n",
2006 | " (Bag of Organic Bananas) | \n",
2007 | " 0.058411 | \n",
2008 | " 0.161527 | \n",
2009 | " 0.011288 | \n",
2010 | " 0.193253 | \n",
2011 | " 1.196413 | \n",
2012 | " 0.001853 | \n",
2013 | " 1.039326 | \n",
2014 | "
\n",
2015 | " \n",
2016 | " 29 | \n",
2017 | " (Organic Whole Milk) | \n",
2018 | " (Banana) | \n",
2019 | " 0.058411 | \n",
2020 | " 0.200938 | \n",
2021 | " 0.013368 | \n",
2022 | " 0.228866 | \n",
2023 | " 1.138984 | \n",
2024 | " 0.001631 | \n",
2025 | " 1.036216 | \n",
2026 | "
\n",
2027 | " \n",
2028 | " 28 | \n",
2029 | " (Banana) | \n",
2030 | " (Organic Whole Milk) | \n",
2031 | " 0.200938 | \n",
2032 | " 0.058411 | \n",
2033 | " 0.013368 | \n",
2034 | " 0.066529 | \n",
2035 | " 1.138984 | \n",
2036 | " 0.001631 | \n",
2037 | " 1.008697 | \n",
2038 | "
\n",
2039 | " \n",
2040 | " 19 | \n",
2041 | " (Banana) | \n",
2042 | " (Limes) | \n",
2043 | " 0.200938 | \n",
2044 | " 0.059984 | \n",
2045 | " 0.013539 | \n",
2046 | " 0.067380 | \n",
2047 | " 1.123292 | \n",
2048 | " 0.001486 | \n",
2049 | " 1.007930 | \n",
2050 | "
\n",
2051 | " \n",
2052 | " 18 | \n",
2053 | " (Limes) | \n",
2054 | " (Banana) | \n",
2055 | " 0.059984 | \n",
2056 | " 0.200938 | \n",
2057 | " 0.013539 | \n",
2058 | " 0.225713 | \n",
2059 | " 1.123292 | \n",
2060 | " 0.001486 | \n",
2061 | " 1.031996 | \n",
2062 | "
\n",
2063 | " \n",
2064 | " 23 | \n",
2065 | " (Banana) | \n",
2066 | " (Organic Baby Spinach) | \n",
2067 | " 0.200938 | \n",
2068 | " 0.102948 | \n",
2069 | " 0.021839 | \n",
2070 | " 0.108683 | \n",
2071 | " 1.055712 | \n",
2072 | " 0.001152 | \n",
2073 | " 1.006435 | \n",
2074 | "
\n",
2075 | " \n",
2076 | " 22 | \n",
2077 | " (Organic Baby Spinach) | \n",
2078 | " (Banana) | \n",
2079 | " 0.102948 | \n",
2080 | " 0.200938 | \n",
2081 | " 0.021839 | \n",
2082 | " 0.212133 | \n",
2083 | " 1.055712 | \n",
2084 | " 0.001152 | \n",
2085 | " 1.014209 | \n",
2086 | "
\n",
2087 | " \n",
2088 | " 27 | \n",
2089 | " (Banana) | \n",
2090 | " (Organic Strawberries) | \n",
2091 | " 0.200938 | \n",
2092 | " 0.112711 | \n",
2093 | " 0.023857 | \n",
2094 | " 0.118728 | \n",
2095 | " 1.053382 | \n",
2096 | " 0.001209 | \n",
2097 | " 1.006827 | \n",
2098 | "
\n",
2099 | " \n",
2100 | " 26 | \n",
2101 | " (Organic Strawberries) | \n",
2102 | " (Banana) | \n",
2103 | " 0.112711 | \n",
2104 | " 0.200938 | \n",
2105 | " 0.023857 | \n",
2106 | " 0.211665 | \n",
2107 | " 1.053382 | \n",
2108 | " 0.001209 | \n",
2109 | " 1.013607 | \n",
2110 | "
\n",
2111 | " \n",
2112 | "
\n",
2113 | "
"
2114 | ],
2115 | "text/plain": [
2116 | " antecedents consequents antecedent support \\\n",
2117 | "35 (Limes) (Large Lemon) 0.059984 \n",
2118 | "34 (Large Lemon) (Limes) 0.065764 \n",
2119 | "52 (Organic Strawberries) (Organic Raspberries) 0.112711 \n",
2120 | "53 (Organic Raspberries) (Organic Strawberries) 0.058325 \n",
2121 | "37 (Organic Avocado) (Large Lemon) 0.075348 \n",
2122 | "36 (Large Lemon) (Organic Avocado) 0.065764 \n",
2123 | "47 (Organic Strawberries) (Organic Blueberries) 0.112711 \n",
2124 | "46 (Organic Blueberries) (Organic Strawberries) 0.042956 \n",
2125 | "49 (Organic Hass Avocado) (Organic Raspberries) 0.090339 \n",
2126 | "48 (Organic Raspberries) (Organic Hass Avocado) 0.058325 \n",
2127 | "24 (Banana) (Organic Fuji Apple) 0.200938 \n",
2128 | "25 (Organic Fuji Apple) (Banana) 0.037992 \n",
2129 | "5 (Bag of Organic Bananas) (Organic Raspberries) 0.161527 \n",
2130 | "4 (Organic Raspberries) (Bag of Organic Bananas) 0.058325 \n",
2131 | "3 (Bag of Organic Bananas) (Organic Hass Avocado) 0.161527 \n",
2132 | "2 (Organic Hass Avocado) (Bag of Organic Bananas) 0.090339 \n",
2133 | "14 (Honeycrisp Apple) (Banana) 0.034078 \n",
2134 | "15 (Banana) (Honeycrisp Apple) 0.200938 \n",
2135 | "39 (Organic Avocado) (Organic Baby Spinach) 0.075348 \n",
2136 | "38 (Organic Baby Spinach) (Organic Avocado) 0.102948 \n",
2137 | "50 (Organic Strawberries) (Organic Hass Avocado) 0.112711 \n",
2138 | "51 (Organic Hass Avocado) (Organic Strawberries) 0.090339 \n",
2139 | "12 (Cucumber Kirby) (Banana) 0.040789 \n",
2140 | "13 (Banana) (Cucumber Kirby) 0.200938 \n",
2141 | "43 (Organic Hass Avocado) (Organic Baby Spinach) 0.090339 \n",
2142 | "42 (Organic Baby Spinach) (Organic Hass Avocado) 0.102948 \n",
2143 | "55 (Organic Whole Milk) (Organic Strawberries) 0.058411 \n",
2144 | "54 (Organic Strawberries) (Organic Whole Milk) 0.112711 \n",
2145 | "20 (Organic Avocado) (Banana) 0.075348 \n",
2146 | "21 (Banana) (Organic Avocado) 0.200938 \n",
2147 | "30 (Seedless Red Grapes) (Banana) 0.035480 \n",
2148 | "31 (Banana) (Seedless Red Grapes) 0.200938 \n",
2149 | "7 (Bag of Organic Bananas) (Organic Strawberries) 0.161527 \n",
2150 | "6 (Organic Strawberries) (Bag of Organic Bananas) 0.112711 \n",
2151 | "32 (Strawberries) (Banana) 0.061123 \n",
2152 | "33 (Banana) (Strawberries) 0.200938 \n",
2153 | "45 (Organic Strawberries) (Organic Baby Spinach) 0.112711 \n",
2154 | "44 (Organic Baby Spinach) (Organic Strawberries) 0.102948 \n",
2155 | "11 (Bag of Organic Bananas) (Organic Yellow Onion) 0.161527 \n",
2156 | "10 (Organic Yellow Onion) (Bag of Organic Bananas) 0.048146 \n",
2157 | "16 (Large Lemon) (Banana) 0.065764 \n",
2158 | "17 (Banana) (Large Lemon) 0.200938 \n",
2159 | "0 (Organic Baby Spinach) (Bag of Organic Bananas) 0.102948 \n",
2160 | "1 (Bag of Organic Bananas) (Organic Baby Spinach) 0.161527 \n",
2161 | "41 (Organic Avocado) (Organic Strawberries) 0.075348 \n",
2162 | "40 (Organic Strawberries) (Organic Avocado) 0.112711 \n",
2163 | "9 (Bag of Organic Bananas) (Organic Whole Milk) 0.161527 \n",
2164 | "8 (Organic Whole Milk) (Bag of Organic Bananas) 0.058411 \n",
2165 | "29 (Organic Whole Milk) (Banana) 0.058411 \n",
2166 | "28 (Banana) (Organic Whole Milk) 0.200938 \n",
2167 | "19 (Banana) (Limes) 0.200938 \n",
2168 | "18 (Limes) (Banana) 0.059984 \n",
2169 | "23 (Banana) (Organic Baby Spinach) 0.200938 \n",
2170 | "22 (Organic Baby Spinach) (Banana) 0.102948 \n",
2171 | "27 (Banana) (Organic Strawberries) 0.200938 \n",
2172 | "26 (Organic Strawberries) (Banana) 0.112711 \n",
2173 | "\n",
2174 | " consequent support support confidence lift leverage conviction \n",
2175 | "35 0.065764 0.011860 0.197723 3.006544 0.007915 1.164480 \n",
2176 | "34 0.059984 0.011860 0.180345 3.006544 0.007915 1.146843 \n",
2177 | "52 0.058325 0.014533 0.128940 2.210731 0.007959 1.081069 \n",
2178 | "53 0.112711 0.014533 0.249174 2.210731 0.007959 1.181751 \n",
2179 | "37 0.065764 0.010538 0.139862 2.126728 0.005583 1.086147 \n",
2180 | "36 0.075348 0.010538 0.160244 2.126728 0.005583 1.101097 \n",
2181 | "47 0.042956 0.010235 0.090809 2.114024 0.005394 1.052633 \n",
2182 | "46 0.112711 0.010235 0.238274 2.114024 0.005394 1.164840 \n",
2183 | "49 0.058325 0.010966 0.121389 2.081257 0.005697 1.071777 \n",
2184 | "48 0.090339 0.010966 0.188018 2.081257 0.005697 1.120298 \n",
2185 | "24 0.037992 0.014378 0.071552 1.883367 0.006744 1.036147 \n",
2186 | "25 0.200938 0.014378 0.378441 1.883367 0.006744 1.285576 \n",
2187 | "5 0.058325 0.017294 0.107065 1.835662 0.007873 1.054584 \n",
2188 | "4 0.161527 0.017294 0.296508 1.835662 0.007873 1.191874 \n",
2189 | "3 0.090339 0.026487 0.163981 1.815175 0.011895 1.088087 \n",
2190 | "2 0.161527 0.026487 0.293199 1.815175 0.011895 1.186294 \n",
2191 | "14 0.200938 0.012122 0.355725 1.770317 0.005275 1.240249 \n",
2192 | "15 0.034078 0.012122 0.060329 1.770317 0.005275 1.027936 \n",
2193 | "39 0.102948 0.013207 0.175281 1.702625 0.005450 1.087707 \n",
2194 | "38 0.075348 0.013207 0.128289 1.702625 0.005450 1.060733 \n",
2195 | "50 0.090339 0.017314 0.153616 1.700440 0.007132 1.074762 \n",
2196 | "51 0.112711 0.017314 0.191659 1.700440 0.007132 1.097666 \n",
2197 | "12 0.200938 0.013432 0.329296 1.638788 0.005236 1.191377 \n",
2198 | "13 0.040789 0.013432 0.066844 1.638788 0.005236 1.027922 \n",
2199 | "43 0.102948 0.014787 0.163679 1.589929 0.005486 1.072618 \n",
2200 | "42 0.090339 0.014787 0.143632 1.589929 0.005486 1.062232 \n",
2201 | "55 0.112711 0.010130 0.173423 1.538645 0.003546 1.073449 \n",
2202 | "54 0.058411 0.010130 0.089873 1.538645 0.003546 1.034569 \n",
2203 | "20 0.200938 0.022745 0.301866 1.502282 0.007605 1.144568 \n",
2204 | "21 0.075348 0.022745 0.113194 1.502282 0.007605 1.042677 \n",
2205 | "30 0.200938 0.010534 0.296906 1.477596 0.003405 1.136493 \n",
2206 | "31 0.035480 0.010534 0.052425 1.477596 0.003405 1.017883 \n",
2207 | "7 0.112711 0.026463 0.163832 1.453551 0.008257 1.061136 \n",
2208 | "6 0.161527 0.026463 0.234787 1.453551 0.008257 1.095739 \n",
2209 | "32 0.200938 0.017661 0.288936 1.437931 0.005379 1.123754 \n",
2210 | "33 0.061123 0.017661 0.087891 1.437931 0.005379 1.029347 \n",
2211 | "45 0.102948 0.016267 0.144326 1.401939 0.004664 1.048358 \n",
2212 | "44 0.112711 0.016267 0.158014 1.401939 0.004664 1.053805 \n",
2213 | "11 0.048146 0.010460 0.064756 1.344989 0.002683 1.017760 \n",
2214 | "10 0.161527 0.010460 0.217252 1.344989 0.002683 1.071191 \n",
2215 | "16 0.200938 0.017603 0.267663 1.332062 0.004388 1.091111 \n",
2216 | "17 0.065764 0.017603 0.087602 1.332062 0.004388 1.023934 \n",
2217 | "0 0.161527 0.021517 0.209007 1.293944 0.004888 1.060026 \n",
2218 | "1 0.102948 0.021517 0.133208 1.293944 0.004888 1.034911 \n",
2219 | "41 0.112711 0.010254 0.136095 1.207468 0.001762 1.027068 \n",
2220 | "40 0.075348 0.010254 0.090980 1.207468 0.001762 1.017197 \n",
2221 | "9 0.058411 0.011288 0.069883 1.196413 0.001853 1.012335 \n",
2222 | "8 0.161527 0.011288 0.193253 1.196413 0.001853 1.039326 \n",
2223 | "29 0.200938 0.013368 0.228866 1.138984 0.001631 1.036216 \n",
2224 | "28 0.058411 0.013368 0.066529 1.138984 0.001631 1.008697 \n",
2225 | "19 0.059984 0.013539 0.067380 1.123292 0.001486 1.007930 \n",
2226 | "18 0.200938 0.013539 0.225713 1.123292 0.001486 1.031996 \n",
2227 | "23 0.102948 0.021839 0.108683 1.055712 0.001152 1.006435 \n",
2228 | "22 0.200938 0.021839 0.212133 1.055712 0.001152 1.014209 \n",
2229 | "27 0.112711 0.023857 0.118728 1.053382 0.001209 1.006827 \n",
2230 | "26 0.200938 0.023857 0.211665 1.053382 0.001209 1.013607 "
2231 | ]
2232 | },
2233 | "execution_count": 22,
2234 | "metadata": {},
2235 | "output_type": "execute_result"
2236 | }
2237 | ],
2238 | "source": [
2239 | "rules = association_rules(frequent_items, metric=\"lift\", min_threshold=1)\n",
2240 | "rules.sort_values('lift', ascending=False)"
2241 | ]
2242 | },
2243 | {
2244 | "cell_type": "code",
2245 | "execution_count": null,
2246 | "metadata": {},
2247 | "outputs": [],
2248 | "source": []
2249 | }
2250 | ],
2251 | "metadata": {
2252 | "kernelspec": {
2253 | "display_name": "Python 3",
2254 | "language": "python",
2255 | "name": "python3"
2256 | },
2257 | "language_info": {
2258 | "codemirror_mode": {
2259 | "name": "ipython",
2260 | "version": 3
2261 | },
2262 | "file_extension": ".py",
2263 | "mimetype": "text/x-python",
2264 | "name": "python",
2265 | "nbconvert_exporter": "python",
2266 | "pygments_lexer": "ipython3",
2267 | "version": "3.7.3"
2268 | }
2269 | },
2270 | "nbformat": 4,
2271 | "nbformat_minor": 2
2272 | }
2273 |
--------------------------------------------------------------------------------
/NN Architecture.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/NN Architecture.png
--------------------------------------------------------------------------------
/Plots/Add-to-cart-VS-reorder.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/Add-to-cart-VS-reorder.png
--------------------------------------------------------------------------------
/Plots/Most-popular-products.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/Most-popular-products.png
--------------------------------------------------------------------------------
/Plots/NN Architecture.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/NN Architecture.png
--------------------------------------------------------------------------------
/Plots/NN-Performance.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/NN-Performance.png
--------------------------------------------------------------------------------
/Plots/NN-Report.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/NN-Report.png
--------------------------------------------------------------------------------
/Plots/Reorder-organic-inorganic-products.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/Reorder-organic-inorganic-products.png
--------------------------------------------------------------------------------
/Plots/Total-organic-inorganic-products.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/Total-organic-inorganic-products.png
--------------------------------------------------------------------------------
/Plots/XGBoost Feature Importance Plot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/XGBoost Feature Importance Plot.png
--------------------------------------------------------------------------------
/Plots/XGBoost Performance.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/XGBoost Performance.png
--------------------------------------------------------------------------------
/Plots/XGBoost-Report.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/XGBoost-Report.png
--------------------------------------------------------------------------------
/Plots/aisle-high-reorder.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/aisle-high-reorder.png
--------------------------------------------------------------------------------
/Plots/aisle-low-reorder.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/aisle-low-reorder.png
--------------------------------------------------------------------------------
/Plots/cluster.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/cluster.png
--------------------------------------------------------------------------------
/Plots/cumsum_products.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/cumsum_products.png
--------------------------------------------------------------------------------
/Plots/dow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/dow.png
--------------------------------------------------------------------------------
/Plots/elbow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/elbow.png
--------------------------------------------------------------------------------
/Plots/heatmap.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/heatmap.png
--------------------------------------------------------------------------------
/Plots/orders.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/orders.png
--------------------------------------------------------------------------------
/Plots/popular-aisles.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/popular-aisles.png
--------------------------------------------------------------------------------
/Plots/popular-departments.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/popular-departments.png
--------------------------------------------------------------------------------
/Plots/prior.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/prior.png
--------------------------------------------------------------------------------
/Plots/readme.md:
--------------------------------------------------------------------------------
1 | This folder contains plots of analysis and model.
2 |
--------------------------------------------------------------------------------
/Plots/reorder-df.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/reorder-df.png
--------------------------------------------------------------------------------
/Plots/reorder-total-orders.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/reorder-total-orders.png
--------------------------------------------------------------------------------
/Plots/train.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/archd3sai/Instacart-Market-Basket-Analysis/eebc34070e6e4803c43f968a5e0975d268fb036e/Plots/train.png
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Instacart Market Basket Analysis
2 |
3 | ## Introduction
4 |
5 | Instacart is an American technology company that operates as a same-day grocery delivery and pick up service in the U.S. and Canada. Customers shop for groceries through the Instacart mobile app or Instacart.com from various retailer partners. The order is shopped and delivered by an Instacart personal shopper.
6 |
7 | ### Objectives:
8 | - Analyze the anonymized [data](https://www.kaggle.com/c/instacart-market-basket-analysis/data) of 3 million grocery orders from more than 200,000 Instacart users open sourced by Instacart
9 | - Find out hidden association between products for better cross-selling and upselling
10 | - Perform customer segmentation for targeted marketing and anticipate customer behavior
11 | - Build a Machine Learning model to predict which previously purchased product will be in user’s next order
12 |
13 | ### Project Organization
14 | ```
15 | .
16 | ├── Plots/ : Contains all plots
17 | ├── Data Description and Analysis.ipynb : Initial analysis to understand data
18 | ├── Exploratory Data Analysis.ipynb : EDA to analyze customer purchase pattern
19 | ├── Customers Segmentation.ipynb : Customer Segmentation based on product aisles
20 | ├── Market Basket Analysis.ipynb : Market Basket Analysis to find products association
21 | ├── Feature Extraction.ipynb : Feature engineering and extraction for a ML model
22 | ├── Data Preparation.ipynb : Data preparation for modeling
23 | ├── ANN Model.ipynb : Neural Network model for product reorder prediction
24 | ├── XGBoost Model.ipynb : XGBoost model for product reorder prediction
25 | ├── LICENSE : License
26 | └── README.md : Project Report
27 | ```
28 |
29 |
30 | ## Data Description
31 |
32 | - **aisles:** This file contains different aisles and there are total 134 unique aisles.
33 |
34 | - **departments:** This file contains different departments and there are total 21 unique departments.
35 |
36 | - **orders:** This file contains all the orders made by different users. From below analysis, we can conclude following:
37 | - There are total 3421083 orders made by total 206209 users.
38 | - There are three sets of orders: Prior, Train and Test. The distributions of orders in Train and Test sets are similar whereas the distribution of orders in Prior set is different.
39 | - The total orders per customer ranges from 0 to 100.
40 | - Based on the plot of 'Orders VS Day of Week' we can map 0 and 1 as Saturday and Sunday respectively based on the assumption that most of the people buy groceries on weekends.
41 | - Majority of the orders are made during the day time.
42 | - Customers order once in a week which is supported by peaks at 7, 14, 21 and 30 in 'Orders VS Days since prior order' graph.
43 | - Based on the heatmap between 'Day of Week' and 'Hour of Day,' we can say that Saturday afternoons and Sunday mornings are prime time for orders.
44 |
45 |
46 |
47 |
48 |
49 |
50 |
51 |
52 |
53 |
54 |
55 |
56 |
57 | - **products:** This file contains the list of total 49688 products and their aisle as well as department. The number of products in different aisles and different departments are different.
58 |
59 | - **order_products_prior:** This file gives information about which products were ordered and in which order they were added in the cart. It also tells us that if the product was reordered or not.
60 |
61 | - In this file there is an information of total 3214874 orders through which total 49677 products were ordered.
62 | - From the 'Count VS Items in cart' plot, we can say that most of the people buy 1-15 items in an order and there were a maximum of 145 items in an order.
63 | - The percentage of reorder items in this set is 58.97%.
64 |
65 |
66 |
67 |
68 |
69 |
70 | - **order_products_train:** This file gives information about which products were ordered and in which order they were added in the cart. It also tells us that if the product was reordered or not.
71 | - In this file there is an information of total 131209 orders through which total 39123 products were ordered.
72 | - From the 'Count VS Items in cart' plot, we can say that most of the people buy 1-15 items in an order and there were a maximum of 145 items in an order.
73 | - The percentage of reorder items in this set is 59.86%.
74 |
75 |
76 |
77 |
78 |
79 | ## Exploratory Data Analysis
80 | For the analysis I combined all of the separate data files into one single dataframe and to fit the dataframe in my memory I reduced its size to 50% (4.1 GB to 2.0 GB) by type conversion and without loosing any information.
81 |
82 | - This plot shows most popular aisles based on total products bought.
83 |
84 |
85 |
86 |
87 |
88 | - As we can see in below plot that the reorder percentage of day-to-day food items is high and for other products such as vitamins, first-aids, beauty products, etc. reorder percentage is low. This is true as we buy only groceries regularly and do not buy those items in every order.
89 |
90 |
91 |
92 |
93 |
94 |
95 | - The below plot shows popular departments. The store layout should be in a way that popular departments are very near to each other.
96 |
97 |
98 |
99 |
100 |
101 | - The below plot shows most popular products. As we can see there are many organic products in the most popular products.
102 |
103 |
104 |
105 |
106 |
107 | - We can see that there are less number of organic products but their Mean reorder percentage is high. This tells us that we should have more organic products in the store.
108 |
109 |
110 |
111 |
112 |
113 |
114 |
115 | - We can plot add-to-cart-order and mean reorder percentage. As we can see the lower the add-to-cart-order higher is the reorder percentage. This makes sense as we mostly buy things first that are required on day-to-day basis.
116 |
117 |
118 |
119 |
120 |
121 | - In the below plot of reorder percentage and number of product purchase, we see a ceiling effect. Many people try different product once and they do not reorder again. Also, there are users who buy certain products regularly.
122 |
123 |
124 |
125 |
126 |
127 | - We can see that the total unique users of products having highest reorder ratio are only few (1-15 only). This means that these users like these products and would buy regularly.
128 |
129 |
130 |
131 |
132 |
133 | - In the below plot of cumulative total users per product vs products, we can see that 85% of the users buy only 10000 products out of 49688 products. If we are interested in shelf space optimization, we should have only these 10000 products. Here, I assume that the profit from remaining 39688 products are not significant high. If we had prices of these products, we could have considered the products having high revenue, high reorder percentage and high total product sale.
134 |
135 |
136 |
137 |
138 |
139 | ## Customer Segmentation
140 |
141 | Customer segmentation is the process of dividing customers into groups based on common characteristics so companies can market to each group effectively and appropriately. We can perform segmentation using the data of which products users buy. Since there are thousonds of products and also thousands of customers, I utilized aisles which represent categories of products.
142 |
143 | I then performed Principal component analysis to reduce dimensions as KMeans does not produce good results on higher dimensions. Using 10 principal components I carried out KMeans clustering. I chose optimal number of clusters as 5 using Elbow method shown below.
144 |
145 |
146 |
147 |
148 |
149 | The clustering can be visualized along first two principal components as below.
150 |
151 |
152 |
153 |
154 |
155 | The clustering results into 5 neat clusters and after checking most frequent products in them, we can conclude following:
156 | - Cluster 1 results into 5428 consumers having a very strong preference for water seltzer sparkling water aisle.
157 | - Cluster 2 results into 55784 consumers who mostly order fresh vegetables followed by fruits.
158 | - Cluster 3 results into 7948 consumers who buy packaged produce and fresh fruits mostly.
159 | - Cluster 4 results into 37949 consumers who have a very strong preference for fruits followed by fresh vegetables.
160 | - Cluster 5 results into 99100 consumers who orders products from many aisles. Their mean orders are low compared to other clusters which tells us that either they are not frequent users of Instacart or they are new users and do not have many orders yet.
161 |
162 | ## Markest Basket Analysis
163 |
164 | Market Basket Analysis is a modelling technique based upon the theory that if you buy a certain group of items, you are more or less likely to buy another group of items. Market basket analysis may provide the retailer with information to understand the purchase behavior of a buyer. This information can then be used for purposes of cross-selling and up-selling, in addition to influencing sales promotions, loyalty programs, store design, and discount plans.
165 |
166 | Market basket analysis scrutinizes the products customers tend to buy together, and uses the information to decide which products should be cross-sold or promoted together. The term arises from the shopping carts supermarket shoppers fill up during a shopping trip.
167 |
168 | Association Rule Mining is used when we want to find an association between different objects in a set, find frequent patterns in a transaction database, relational databases or any other information repository.
169 |
170 | The most common approach to find these patterns is Market Basket Analysis, which is a key technique used by large retailers like Amazon, Flipkart, etc to analyze customer buying habits by finding associations between the different items that customers place in their “shopping baskets”. The discovery of these associations can help retailers develop marketing strategies by gaining insight into which items are frequently purchased together by customers. The strategies may include:
171 |
172 | - Changing the store layout according to trends
173 | - Customers behavior analysis
174 | - Catalog Design
175 | - Cross marketing on online stores
176 | - Customized emails with add-on sales, etc.
177 |
178 | ### Matrices
179 |
180 | **Support** : Its the default popularity of an item. In mathematical terms, the support of item A is the ratio of transactions involving A to the total number of transactions.
181 |
182 | **Confidence** : Likelihood that customer who bought both A and B. It is the ratio of the number of transactions involving both A and B and the number of transactions involving B.
183 | - Confidence(A => B) = Support(A, B)/Support(B)
184 |
185 | **Lift** : Increase in the sale of A when you sell B.
186 | - Lift(A => B) = Confidence(A, B)/Support(B)
187 |
188 | - Lift (A => B) = 1 means that there is no correlation within the itemset.
189 | - Lift (A => B) > 1 means that there is a positive correlation within the itemset, i.e., products in the itemset, A, and B, are more likely to be bought together.
190 | - Lift (A => B) < 1 means that there is a negative correlation within the itemset, i.e., products in itemset, A, and B, are unlikely to be bought together.
191 |
192 | **Apriori Algorithm:** Apriori algorithm assumes that any subset of a frequent itemset must be frequent. Its the algorithm behind Market Basket Analysis. Say, a transaction containing {Grapes, Apple, Mango} also contains {Grapes, Mango}. So, according to the principle of Apriori, if {Grapes, Apple, Mango} is frequent, then {Grapes, Mango} must also be frequent.
193 |
194 | I utilized apriori algorithm from Mlxtend python library and found out associations from top 100 most frequent products which resulted in 28 product pairs (total 56 rules) that have lift highr than 1. The top 10 product pairs having highest lift are shown below:
195 |
196 | | Product A | Product B | Lift |
197 | | ------------- | ------------- | ---- |
198 | | Limes | Large Lemons | 3 |
199 | | Organic Strawberries | Organic Raspberries | 2.21 |
200 | | Organic Avocado | Large Lemon | 2.12 |
201 | | Organic Strawberries | Organic Blueberries | 2.11 |
202 | | Organic Hass Avocado | Organic Raspberries | 2.08 |
203 | | Banana | Organic Fuji Apple | 1.88 |
204 | | Bag of Organic Bananas | Organic Raspberries | 1.83 |
205 | | Organic Hass Avocado | Bag of Organic Bananas | 1.81 |
206 | | Honeycrisp Apple | Banana | 1.77 |
207 | | Organic Avocado | Organic Baby Spinach | 1.70 |
208 |
209 | ## ML Model to Predict Product Reorders
210 |
211 | We can utilize this anonymized transactional data of customer orders over time to predict which previously purchased products will be in a user’s next order. This would help recommend the products to a user.
212 |
213 | To build a model, I need to extract features from previous order to understand user's purchase pattern and how popular the particular product is. I extract following features from the user's transactional data.
214 |
215 | **Product Level Features:** To understand the product's popularity among users
216 | ```
217 | (1) Product's average add-to-cart-order
218 | (2) Total times the product was ordered
219 | (3) Total times the product was reordered
220 | (4) Reorder percentage of a product
221 | (5) Total unique users of a product
222 | (6) Is the product Organic?
223 | (7) Percentage of users that buy the product second time
224 | ```
225 |
226 | **Aisle and Department Level Features:** To capture if a department and aisle are related to day-to-day products (vegetables, fruits, soda, water, etc.) or once-in-a-while products (medicines, personal-care, etc.)
227 | ```
228 | (8) Reorder percentage, Total orders and reorders of a product aisle
229 | (9) Mean and std of aisle add-to-cart-order
230 | (10) Aisle unique users
231 | (10) Reorder percentage, Total orders and reorders of a product department
232 | (11) Mean and std of department add-to-cart-order
233 | (12) Department unique users
234 | (13) Binary encoding of aisle feature (Because one-hot encoding results in many features and make datarame sparse)
235 | (14) Binary encoding of department feature (Because one-hot encoding results in many features and make datarame sparse)
236 | ```
237 |
238 | **User Level features:** To capture user's purchase pattern and behavior
239 | ```
240 | (15) User's average and std day-of-week of order
241 | (16) User's average and std hour-of-day of order
242 | (17) User's average and std days-since-prior-order
243 | (18) Total orders by a user
244 | (19) Total products user has bought
245 | (20) Total unique products user has bought
246 | (21) user's total reordered products
247 | (22) User's overall reorder percentage
248 | (23) Average order size of a user
249 | (24) User's mean of reordered items of all orders
250 | (25) Percentage of reordered itmes in user's last three orders
251 | (26) Total orders in user's last three orders
252 | ```
253 |
254 | **User-product Level Features:** To capture user's pattern of ordering-reordering specific products
255 | ```
256 | (27) User's avg add-to-cart-order for a product
257 | (28) User's avg days_since_prior_order for a product
258 | (29) User's product total orders, reorders and reorders percentage
259 | (30) User's order number when the product was bought last
260 | (31) User's product purchase history of last three orders
261 | ```
262 |
263 | ### ML Models
264 |
265 | Using the extracted features, I prepared a dataframe which shows all the products user has bought previously, user level features, product level features, asile and department level features, user-product level features and the information of current order such as order's day-of-week, hour-of-day, etc. The Traget would be 'reordered' which shows how many of the previously purchased items, user ordered this time.
266 |
267 | Since the dataframe is huge, I reduced the memory consumption of it by downcasting to fit the data int my memory. I preferred MinMaxScaler over StandardScaler as the latter requires 16 GB of RAM for its operation. I followed standard process for model building and I relied on XGBoost as it handles large data, can be parallelized and gives feature importance. I also built Neural Network to see what would be the best performance from this model disregarding some inherent randomness from both of these models. To balance the data, I have used cost-sensitive learning by assigning class weightage (~{0:1, 1:10}). I have not used random-upsampling/SMOTE as it would increase the data size and I do not have much memory. Also, since random-down-sampling discards information which might be important and would result in bias.
268 |
269 | Since, we can hack the F1 score by changing the threshold, I relied on AUC Score for model evaluation. The performance of both of these models is shown below using Confusion Matrix, ROC curve and classification report. The feature important plot from XGBoost model is also shown to understand important features which help predict product's reorder. The performance of both models is almost similar and XGBoost slightly performs better in terms of ROC-AUC.
270 |
271 | **Neural Network Model Architecture and Performance:**
272 |
273 |
274 |
275 |
276 |
277 |
278 |
279 |
280 |
281 |
282 |
283 |
284 |
285 |
286 | **XGBoost Model's Performance and Feature Importance:**
287 |
288 |
289 |
290 |
291 |
292 |
293 |
294 |
295 |
296 |
297 |
298 |
299 |
300 |
301 | ## Future Work
302 |
303 | - Utilize Collaborative filtering to recommend products to a customer.
304 |
--------------------------------------------------------------------------------