├── README.md ├── assets ├── apriori_algorithm.png ├── datacamp.svg └── olist_marketplaces.png ├── data ├── olist_order_items_dataset.csv ├── olist_products_dataset.csv ├── olist_transactions.csv └── product_category_name_translation.csv └── notebooks ├── Market-Basket-Analysis-in-Python.ipynb └── Market-Basket-Analysis-in-Python_Solution.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # **Market Basket Analysis**
by **Isaiah Hull** 2 | 3 | Live training sessions are designed to mimic the flow of how a real data scientist would address a problem or a task. As such, a session needs to have some “narrative” where learners are achieving stated learning objectives in the form of a real-life data science task or project. For example, a data visualization live session could be around analyzing a dataset and creating a report with a specific business objective in mind _(ex: analyzing and visualizing churn)_, a data cleaning live session could be about preparing a dataset for analysis etc ... 4 | 5 | As part of the 'Live training Spec' process, you will need to complete the following tasks: 6 | 7 | Edit this README by filling in the information for steps 1 - 4. 8 | 9 | ## Step 1: Foundations 10 | 11 | This part of the 'Live training Spec' process is designed to help guide you through session design by having you think through several key questions. Please make sure to delete the examples provided here for you. 12 | 13 | ### A. What problem(s) will students learn how to solve? (minimum of 5 problems) 14 | 15 | - Prepare data for use in Market Basket Analysis. 16 | - Identify patterns in consumer decision-making with `mlxtend`. 17 | - Use metrics to evaluate the properties of patterns. 18 | - Construct association rules that provide concrete recommendations for businesses. 19 | - Perform pruning to identify useful rules. 20 | - Visualize patterns and rules using `seaborn` and `matplotlib`. 21 | 22 | ### B. What technologies, packages, or functions will students use? Please be exhaustive. 23 | 24 | - `numpy` 25 | - `pandas` 26 | - `matplotlib` 27 | - `seaborn` 28 | - `mlxtend` 29 | 30 | ### C. What terms or jargon will you define? 31 | 32 | - Transaction: A set of items purchased together. 33 | - Itemset: A collection of unique items. 34 | - Association rule: an "if-then" statement of association between two itemsets. For instance, "if coffee then milk" is an association rule that implies that customers who purchase coffee are also likely to purchase milk. 35 | - Metric: The numerical measure of the intensity of an association between itemsets. 36 | - Pruning: The removal of itemsets or rules that perform poorly according to a metric. 37 | 38 | ### D. What mistakes or misconceptions do you expect? 39 | 40 | - I expect students to be confused about the definitions of metrics and how to interpret them. 41 | - Students are likely to be confused about how the Apriori algorithm works and what it achieves. 42 | 43 | ### E. What datasets will you use? 44 | 45 | - Brazilian E-Commerce Public Dataset by Olist: https://www.kaggle.com/olistbr/brazilian-ecommerce 46 | 47 | ## Step 2: Who is this session for? 48 | 49 | Terms like "beginner" and "expert" mean different things to different people, so we use personas to help instructors clarify a live training's audience. When designing a specific live training, instructors should explain how it will or won't help these people, and what extra skills or prerequisite knowledge they are assuming their students have above and beyond what's included in the persona. 50 | 51 | - [x] Please select the roles and industries that align with your live training. 52 | - [x] Include an explanation describing your reasoning and any other relevant information. 53 | 54 | ### What roles would this live training be suitable for? 55 | 56 | *Check all that apply.* 57 | 58 | - [ ] Data Consumer 59 | - [ ] Leader 60 | - [x] Data Analyst 61 | - [ ] Citizen Data Scientist 62 | - [x] Data Scientist 63 | - [ ] Data Engineer 64 | - [ ] Database Administrator 65 | - [x] Statistician 66 | - [x] Machine Learning Scientist 67 | - [ ] Programmer 68 | - [ ] Other (please describe) 69 | 70 | Reasoning: Market Basket Analysis has limited overlap with popular methods in machine learning and data science (e.g. deep learning, gradient boosting, clustering, etc.). As such, learning the basics of Market Basket Analysis will open up an entirely new toolset for many data analysts, data scientists, statisticians, and machine learning scientists. 71 | 72 | ### What industries would this apply to? 73 | 74 | - Industries: Retail, e-commerce, streaming services. 75 | - Reasoning: Market basket analysis can be used to analyze associations between itemsets in any domain. While it is typically applied in retail settings, it can also be used in other applications, such as building recommender systems for e-commerce sites or streaming services. 76 | 77 | ### What level of expertise should learners have before beginning the live training? 78 | 79 | *List three or more examples of skills that you expect learners to have before beginning the live training* 80 | 81 | - Can define and manipulate an `array` in `numpy`. 82 | - Can define a `DataFrame` in `pandas`, create columns, and apply basic methods, such as `.mean()` and `.sum()`. 83 | - Can use `.apply()` and `lambda` functions to transform columns in a `DataFrame`. 84 | - Can generate basic plots in `matplotlib`. 85 | 86 | ## Step 3: Prerequisites 87 | 88 | List any prerequisite courses you think your live training could use from. This could be the live session’s companion course or a course you think students should take before the session. Prerequisites act as a guiding principle for your session and will set the topic framework, but you do not have to limit yourself in the live session to the syntax used in the prerequisite courses. 89 | 90 | [Data Manipulation with Pandas]( 91 | https://www.datacamp.com/courses/data-manipulation-with-pandas) 92 | 93 | [Market Basket Analysis in Python](https://learn.datacamp.com/courses/market-basket-analysis-in-python) 94 | 95 | ## Step 4: Session Outline 96 | 97 | A live training session usually begins with an introductory presentation, followed by the live training itself, and an ending presentation. Your live session is expected to be around 2h30m-3h long (including Q&A) with a hard-limit at 3h30m. You can check out our live training content guidelines [here](_LINK_). 98 | 99 | 100 | **Introduction Slides** 101 | - Introduction to the webinar and instructor (led by DataCamp TA). 102 | - Summary of webinar topics. 103 | - Define Market Basket Analysis. 104 | - Discuss applications. 105 | - Introduce packages used in webinar. 106 | - Set expectations about Q&A. 107 | 108 | **Data Preparation** 109 | - Discuss Brazilian e-commerce dataset. 110 | - Import data using `pd.read_csv()`. 111 | - Define transaction and itemset. 112 | - Identify transactions in dataset using `pandas` 113 | and `numpy` methods. 114 | - Convert transactions to list of lists. 115 | - Q&A 116 | 117 | **Association Rules, Metrics, and Pruning** 118 | - One-hot encode transactions using 119 | `TransactionEncoder` from `mlxtend`. 120 | - Use `.mean()` to compute support for 121 | individual items. 122 | - Use `.mean()` to compute support for 123 | itemsets. 124 | - Compute confidence using `lambda` function. 125 | - Visualize results in `matplotlib` and `seaborn`. 126 | - Q&A 127 | 128 | **The Apriori Algorithm** 129 | - Introduce `apriori` and `association_rules` 130 | from `mlxtend`. 131 | - Use `min_support`, `max_len`, and `min_threshold` to 132 | perform pruning over itemsets and association rules. 133 | - Identify useful association rules in e-commerce dataset 134 | through the use of pruning. 135 | - Visualize results in `matplotlib` and `seaborn`. 136 | - Q&A 137 | 138 | **Ending Slides** 139 | - Summarize webinar topics. 140 | - Reference DataCamp course on MBA. 141 | - Explain additional topics covered. 142 | - Provide additional supporting material and appendices. 143 | 144 | ## Authoring your session 145 | 146 | To get yourself started with setting up your live session, follow the steps below: 147 | 148 | 1. Download and install the "Open in Colabs" extension from [here](https://chrome.google.com/webstore/detail/open-in-colab/iogfkhleblhcpcekbiedikdehleodpjo?hl=en). This will let you take any jupyter notebook you see in a GitHub repository and open it as a **temporary** Colabs link. 149 | 2. Upload your dataset(s) to the `data` folder. 150 | 3. Upload your images, gifs, or any other assets you want to use in the notebook in the `assets` folder. 151 | 4. Check out the notebooks templates in the `notebooks` folder, and keep the template you want for your session while deleting all remaining ones. 152 | 153 | You can author and save your progress on your notebook using **either** of these methods. 154 | 155 | _**How to author your notebook: By directly saving into GitHub**_ 156 | 157 | 1. Preview your desired notebook, press on "Open in Colabs" extension - and start developing your content in colabs _(which will act as the solution code to the session)_. :warning: **Important** :warning: Your progress will **not** be saved on Google Colabs since it's a temporary link. To save your progress, make sure to press on `File`, `Save a copy in GitHub` and follow remaining prompts. 158 | 2. Once your notebooks is ready to go, give it the name `session_name_solution.ipynb` create an empty version of the Notebook to be filled out by you and learners during the session, end the file name with `session_name.ipynb`. 159 | 3. Create Colabs links for both sessions and save them in notebooks :tada: 160 | 161 | _**How to author your notebook: By uploading notebook into GitHub**_ 162 | 163 | 1. Preview your desired notebook, press on "Open in Colabs" extension - and start developing your content in colabs _(which will act as the solution code to the session)_. Once you're done, press on `file` - `download .ipynb` file - and overwrite the notebook by uploading it into GitHub. 164 | 2. Once your notebooks is ready to go, give it the name `session_name_solution.ipynb` create an empty version of the Notebook to be filled out by you and learners during the session, end the file name with `session_name.ipynb`. 165 | 3. Create Colabs links for both sessions and save them in notebooks :tada: 166 | 167 | 168 | You can check out either of those methods in action using this [recording](https://www.loom.com/share/1eeb148129244edd93fbc34bf5dc7f0d). 169 | -------------------------------------------------------------------------------- /assets/apriori_algorithm.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacamp/Market-Basket-Analysis-in-python-live-training/84ab96d17c41f8c627ff50fe414e8ca98728308f/assets/apriori_algorithm.png -------------------------------------------------------------------------------- /assets/datacamp.svg: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /assets/olist_marketplaces.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacamp/Market-Basket-Analysis-in-python-live-training/84ab96d17c41f8c627ff50fe414e8ca98728308f/assets/olist_marketplaces.png -------------------------------------------------------------------------------- /data/product_category_name_translation.csv: -------------------------------------------------------------------------------- 1 | product_category_name,product_category_name_english 2 | beleza_saude,health_beauty 3 | informatica_acessorios,computers_accessories 4 | automotivo,auto 5 | cama_mesa_banho,bed_bath_table 6 | moveis_decoracao,furniture_decor 7 | esporte_lazer,sports_leisure 8 | perfumaria,perfume 9 | utilidades_domesticas,housewares 10 | telefonia,telephony 11 | relogios_presentes,watches_gifts 12 | alimentos_bebidas,food_drink 13 | bebes,baby 14 | papelaria,stationery 15 | tablets_impressao_imagem,tablets_printing_image 16 | brinquedos,toys 17 | telefonia_fixa,fixed_telephony 18 | ferramentas_jardim,garden_tools 19 | fashion_bolsas_e_acessorios,fashion_bags_accessories 20 | eletroportateis,small_appliances 21 | consoles_games,consoles_games 22 | audio,audio 23 | fashion_calcados,fashion_shoes 24 | cool_stuff,cool_stuff 25 | malas_acessorios,luggage_accessories 26 | climatizacao,air_conditioning 27 | construcao_ferramentas_construcao,construction_tools_construction 28 | moveis_cozinha_area_de_servico_jantar_e_jardim,kitchen_dining_laundry_garden_furniture 29 | construcao_ferramentas_jardim,costruction_tools_garden 30 | fashion_roupa_masculina,fashion_male_clothing 31 | pet_shop,pet_shop 32 | moveis_escritorio,office_furniture 33 | market_place,market_place 34 | eletronicos,electronics 35 | eletrodomesticos,home_appliances 36 | artigos_de_festas,party_supplies 37 | casa_conforto,home_comfort 38 | construcao_ferramentas_ferramentas,costruction_tools_tools 39 | agro_industria_e_comercio,agro_industry_and_commerce 40 | moveis_colchao_e_estofado,furniture_mattress_and_upholstery 41 | livros_tecnicos,books_technical 42 | casa_construcao,home_construction 43 | instrumentos_musicais,musical_instruments 44 | moveis_sala,furniture_living_room 45 | construcao_ferramentas_iluminacao,construction_tools_lights 46 | industria_comercio_e_negocios,industry_commerce_and_business 47 | alimentos,food 48 | artes,art 49 | moveis_quarto,furniture_bedroom 50 | livros_interesse_geral,books_general_interest 51 | construcao_ferramentas_seguranca,construction_tools_safety 52 | fashion_underwear_e_moda_praia,fashion_underwear_beach 53 | fashion_esporte,fashion_sport 54 | sinalizacao_e_seguranca,signaling_and_security 55 | pcs,computers 56 | artigos_de_natal,christmas_supplies 57 | fashion_roupa_feminina,fashion_female_clothing 58 | eletrodomesticos_2,home_appliances_2 59 | livros_importados,books_imported 60 | bebidas,drinks 61 | cine_foto,cine_photo 62 | la_cuisine,cuisine 63 | musica,music 64 | casa_conforto_2,home_comfort_2 65 | portateis_casa_forno_e_cafe,small_appliances_home_oven_and_coffee 66 | cds_dvds_musicais,cds_dvds_music 67 | dvds_blu_ray,dvds_blu_ray 68 | flores,flowers 69 | artes_e_artesanato,arts_and_crafts 70 | fraldas_higiene,diapers_and_hygiene 71 | fashion_roupa_infanto_juvenil,fashion_childrens_clothes 72 | seguros_e_servicos,security_and_services -------------------------------------------------------------------------------- /notebooks/Market-Basket-Analysis-in-Python.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "Market-Basket-Analysis-in-Python.ipynb", 7 | "provenance": [] 8 | }, 9 | "kernelspec": { 10 | "display_name": "Python 3", 11 | "language": "python", 12 | "name": "python3" 13 | }, 14 | "language_info": { 15 | "codemirror_mode": { 16 | "name": "ipython", 17 | "version": 3 18 | }, 19 | "file_extension": ".py", 20 | "mimetype": "text/x-python", 21 | "name": "python", 22 | "nbconvert_exporter": "python", 23 | "pygments_lexer": "ipython3", 24 | "version": "3.7.1" 25 | } 26 | }, 27 | "cells": [ 28 | { 29 | "cell_type": "markdown", 30 | "metadata": { 31 | "colab_type": "text", 32 | "id": "6Ijg5wUCTQYG" 33 | }, 34 | "source": [ 35 | "

\n", 36 | "\"DataCamp\n", 37 | "

\n", 38 | "

\n", 39 | "\n", 40 | "## **Market Basket Analysis in Python**\n", 41 | "\n", 42 | "Welcome to this hands-on training event on Market Basket Analysis in Python. In this session, you will learn how to:\n", 43 | "* Identify patterns in consumer decision-making with the `mlxtend` package.\n", 44 | "* Use metrics to evaluate the properties of patterns.\n", 45 | "* Construct \"rules\" that provide concrete recommendations for businesses.\n", 46 | "* Visualize patterns and rules using `seaborn` and `matplotlib`.\n", 47 | "\n", 48 | "## **The dataset**\n", 49 | "\n", 50 | "**We'll use a dataset from a Brazilian ecommerce site (olist.com) that is divided into three CSV files:**\n", 51 | "\n", 52 | "1. `olist_order_items_dataset.csv`\n", 53 | "2. `olist_products_dataset.csv`\n", 54 | "3. `product_category_name_translation.csv`\n", 55 | "\n", 56 | "**The column definitions are as follows:**\n", 57 | "\n", 58 | "`olist_order_items_dataset.csv`:\n", 59 | "\n", 60 | "- `order_id`: The unique identifier for a transaction.\n", 61 | "- `order_item_id`: The order of an item within a transaction.\n", 62 | "- `product_id`: The unique identifier for a product.\n", 63 | "- `price`: The product's price.\n", 64 | "\n", 65 | "`olist_products_dataset.csv`:\n", 66 | "\n", 67 | "- `product_id`: The unique identifier for a product.\n", 68 | "- `product_category_name`: The name of an item's product category in Portuguese.\n", 69 | "- `product_weight_g`: The product's weight in grams.\n", 70 | "- `product_length_cm`: The product's length in centimeters.\n", 71 | "- `product_width_cm`: The product's width in centimeters.\n", 72 | "- `product_height_cm`: The product's height in centimeters.\n", 73 | "\n", 74 | "`product_category_name_translation.csv`:\n", 75 | "\n", 76 | "- `product_category_name`: The name of an item's product category in Portuguese.\n", 77 | "- `product_category_name_english`: The name of an item's product category in English.\n" 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": { 83 | "colab_type": "text", 84 | "id": "BMYfcKeDY85K" 85 | }, 86 | "source": [ 87 | "## **Data preparation**" 88 | ] 89 | }, 90 | { 91 | "cell_type": "markdown", 92 | "metadata": { 93 | "id": "y3xDirMYnuYB", 94 | "colab_type": "text" 95 | }, 96 | "source": [ 97 | "The first step in any Market Basket Analysis (MBA) project is to determine what constitutes an **item**, an **itemset**, and a **transaction**. This will depend on the dataset we're using and the question we're attempting to answer.\n", 98 | "\n", 99 | "* **Grocery store**\n", 100 | "\t* Item: Grocery\n", 101 | "\t* Itemset: Collection of groceries\n", 102 | "\t* Transaction: Basket of items purchased\n", 103 | "* **Music streaming service**\n", 104 | "\t* Item: Song\n", 105 | "\t* Itemset: Collection of unique songs\n", 106 | "\t* Transaction: User song library\n", 107 | "* **Ebook store**\n", 108 | "\t* Item: Ebook\n", 109 | "\t* Itemset: One or more ebooks\n", 110 | "\t* Transaction: User ebook library\n" 111 | ] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "metadata": { 116 | "id": "4_gO3NX1JvFy", 117 | "colab_type": "text" 118 | }, 119 | "source": [ 120 | "**In this live training session, we'll use a dataset of transactions from olist.com, a Brazilian ecommerce site.**\n", 121 | "* 100,000+ orders over 2016-2018.\n", 122 | "* Olist connects sellers to marketplaces.\n", 123 | "* Seller can register products with Olist.\n", 124 | "* Customer makes purchase at marketplace from Olist store.\n", 125 | "* Seller fulfills orders." 126 | ] 127 | }, 128 | { 129 | "cell_type": "markdown", 130 | "metadata": { 131 | "id": "_mo4P8zBp_9d", 132 | "colab_type": "text" 133 | }, 134 | "source": [ 135 | "\n", 136 | "\n", 137 | "---\n", 138 | "\n", 139 | "\n", 140 | "![alt](https://github.com/datacamp/Market-Basket-Analysis-in-python-live-training/blob/master/assets/olist_marketplaces.png?raw=true)\n", 141 | "\n", 142 | "\n", 143 | "\n", 144 | "\n", 145 | "\n", 146 | "---\n", 147 | "\n" 148 | ] 149 | }, 150 | { 151 | "cell_type": "markdown", 152 | "metadata": { 153 | "id": "D1HMEk73_ie6", 154 | "colab_type": "text" 155 | }, 156 | "source": [ 157 | "**What is an item**?\n", 158 | " * A product purchased from Olist.\n", 159 | "\n", 160 | "**What is an itemset?**\n", 161 | " * A collection of one or more product(s).\n", 162 | "\n", 163 | "**What is a transaction?**\n", 164 | " * An itemset that corresponds to a customer's order." 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "metadata": { 170 | "id": "xhliXEYb02sm", 171 | "colab_type": "code", 172 | "colab": {} 173 | }, 174 | "source": [ 175 | "# Import modules.\n", 176 | "import numpy as np\n", 177 | "import pandas as pd\n", 178 | "import matplotlib.pyplot as plt\n", 179 | "import seaborn as sns\n", 180 | "\n", 181 | "# Set default asthetic parameters.\n", 182 | "sns.set()\n", 183 | "\n", 184 | "# Define path to data.\n", 185 | "data_path = 'https://github.com/datacamp/Market-Basket-Analysis-in-python-live-training/raw/master/data/'" 186 | ], 187 | "execution_count": null, 188 | "outputs": [] 189 | }, 190 | { 191 | "cell_type": "code", 192 | "metadata": { 193 | "colab_type": "code", 194 | "id": "EMQfyC7GUNhT", 195 | "colab": {} 196 | }, 197 | "source": [ 198 | "# Load orders dataset.\n", 199 | "\n", 200 | "\n", 201 | "# Load products items dataset.\n", 202 | "\n", 203 | "\n", 204 | "# Load translations dataset.\n" 205 | ], 206 | "execution_count": null, 207 | "outputs": [] 208 | }, 209 | { 210 | "cell_type": "code", 211 | "metadata": { 212 | "id": "KOKdds8Qe6wq", 213 | "colab_type": "code", 214 | "colab": { 215 | "base_uri": "https://localhost:8080/", 216 | "height": 204 217 | }, 218 | "outputId": "da00d052-1eef-423e-e3d0-3d30d9089bb7" 219 | }, 220 | "source": [ 221 | "# Print orders header.\n" 222 | ], 223 | "execution_count": null, 224 | "outputs": [ 225 | { 226 | "output_type": "execute_result", 227 | "data": { 228 | "text/html": [ 229 | "
\n", 230 | "\n", 243 | "\n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | "
order_idorder_item_idproduct_idprice
0b8bfa12431142333a0c84802f9529d871765a8070ece0f1383d0f5faf913dfb9b81.0
1b8bfa12431142333a0c84802f9529d872a41e356c76fab66334f36de622ecbd3a99.3
2b8bfa12431142333a0c84802f9529d873765a8070ece0f1383d0f5faf913dfb9b81.0
300010242fe8c5a6d1ba2dd792cb1621414244733e06e7ecb4970a6e2683c13e6158.9
400018f77f2f0320c557190d7a144bdd31e5f2d52b802189ee658865ca93d83a8f239.9
\n", 291 | "
" 292 | ], 293 | "text/plain": [ 294 | " order_id ... price\n", 295 | "0 b8bfa12431142333a0c84802f9529d87 ... 81.0\n", 296 | "1 b8bfa12431142333a0c84802f9529d87 ... 99.3\n", 297 | "2 b8bfa12431142333a0c84802f9529d87 ... 81.0\n", 298 | "3 00010242fe8c5a6d1ba2dd792cb16214 ... 58.9\n", 299 | "4 00018f77f2f0320c557190d7a144bdd3 ... 239.9\n", 300 | "\n", 301 | "[5 rows x 4 columns]" 302 | ] 303 | }, 304 | "metadata": { 305 | "tags": [] 306 | }, 307 | "execution_count": 3 308 | } 309 | ] 310 | }, 311 | { 312 | "cell_type": "code", 313 | "metadata": { 314 | "id": "8jxinQxfAB6e", 315 | "colab_type": "code", 316 | "colab": { 317 | "base_uri": "https://localhost:8080/", 318 | "height": 204 319 | }, 320 | "outputId": "eab66112-b228-4a67-a33b-fcdc3459ddab" 321 | }, 322 | "source": [ 323 | "# Print orders info.\n" 324 | ], 325 | "execution_count": null, 326 | "outputs": [ 327 | { 328 | "output_type": "stream", 329 | "text": [ 330 | "\n", 331 | "RangeIndex: 112650 entries, 0 to 112649\n", 332 | "Data columns (total 4 columns):\n", 333 | " # Column Non-Null Count Dtype \n", 334 | "--- ------ -------------- ----- \n", 335 | " 0 order_id 112650 non-null object \n", 336 | " 1 order_item_id 112650 non-null int64 \n", 337 | " 2 product_id 112650 non-null object \n", 338 | " 3 price 112650 non-null float64\n", 339 | "dtypes: float64(1), int64(1), object(2)\n", 340 | "memory usage: 3.4+ MB\n" 341 | ], 342 | "name": "stdout" 343 | } 344 | ] 345 | }, 346 | { 347 | "cell_type": "code", 348 | "metadata": { 349 | "id": "D5QCoN3CEaGi", 350 | "colab_type": "code", 351 | "colab": { 352 | "base_uri": "https://localhost:8080/", 353 | "height": 204 354 | }, 355 | "outputId": "01918f16-8ef2-4699-8ed1-4cd78cd9b2d6" 356 | }, 357 | "source": [ 358 | "# Print products header.\n" 359 | ], 360 | "execution_count": null, 361 | "outputs": [ 362 | { 363 | "output_type": "execute_result", 364 | "data": { 365 | "text/html": [ 366 | "
\n", 367 | "\n", 380 | "\n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | "
product_idproduct_category_nameproduct_weight_gproduct_length_cmproduct_height_cmproduct_width_cm
01e9e8ef04dbcff4541ed26657ea517e5perfumaria225.016.010.014.0
13aa071139cb16b67ca9e5dea641aaa2fartes1000.030.018.020.0
296bd76ec8810374ed1b65e291975717fesporte_lazer154.018.09.015.0
3cef67bcfe19066a932b7673e239eb23dbebes371.026.04.026.0
49dc1a7de274444849c219cff195d0b71utilidades_domesticas625.020.017.013.0
\n", 440 | "
" 441 | ], 442 | "text/plain": [ 443 | " product_id ... product_width_cm\n", 444 | "0 1e9e8ef04dbcff4541ed26657ea517e5 ... 14.0\n", 445 | "1 3aa071139cb16b67ca9e5dea641aaa2f ... 20.0\n", 446 | "2 96bd76ec8810374ed1b65e291975717f ... 15.0\n", 447 | "3 cef67bcfe19066a932b7673e239eb23d ... 26.0\n", 448 | "4 9dc1a7de274444849c219cff195d0b71 ... 13.0\n", 449 | "\n", 450 | "[5 rows x 6 columns]" 451 | ] 452 | }, 453 | "metadata": { 454 | "tags": [] 455 | }, 456 | "execution_count": 99 457 | } 458 | ] 459 | }, 460 | { 461 | "cell_type": "code", 462 | "metadata": { 463 | "id": "Gd93j3O_AWsU", 464 | "colab_type": "code", 465 | "colab": { 466 | "base_uri": "https://localhost:8080/", 467 | "height": 238 468 | }, 469 | "outputId": "ca205535-15b4-4c24-8829-87caa5e69d3a" 470 | }, 471 | "source": [ 472 | "# Print products info.\n" 473 | ], 474 | "execution_count": null, 475 | "outputs": [ 476 | { 477 | "output_type": "stream", 478 | "text": [ 479 | "\n", 480 | "RangeIndex: 32951 entries, 0 to 32950\n", 481 | "Data columns (total 6 columns):\n", 482 | " # Column Non-Null Count Dtype \n", 483 | "--- ------ -------------- ----- \n", 484 | " 0 product_id 32951 non-null object \n", 485 | " 1 product_category_name 32341 non-null object \n", 486 | " 2 product_weight_g 32949 non-null float64\n", 487 | " 3 product_length_cm 32949 non-null float64\n", 488 | " 4 product_height_cm 32949 non-null float64\n", 489 | " 5 product_width_cm 32949 non-null float64\n", 490 | "dtypes: float64(4), object(2)\n", 491 | "memory usage: 1.5+ MB\n" 492 | ], 493 | "name": "stdout" 494 | } 495 | ] 496 | }, 497 | { 498 | "cell_type": "code", 499 | "metadata": { 500 | "id": "AzBZGfXJEZ0P", 501 | "colab_type": "code", 502 | "colab": { 503 | "base_uri": "https://localhost:8080/", 504 | "height": 204 505 | }, 506 | "outputId": "2def817c-b13d-464a-c0b9-a6b80c090363" 507 | }, 508 | "source": [ 509 | "# Print translations header.\n" 510 | ], 511 | "execution_count": null, 512 | "outputs": [ 513 | { 514 | "output_type": "execute_result", 515 | "data": { 516 | "text/html": [ 517 | "
\n", 518 | "\n", 531 | "\n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | "
product_category_nameproduct_category_name_english
0beleza_saudehealth_beauty
1informatica_acessorioscomputers_accessories
2automotivoauto
3cama_mesa_banhobed_bath_table
4moveis_decoracaofurniture_decor
\n", 567 | "
" 568 | ], 569 | "text/plain": [ 570 | " product_category_name product_category_name_english\n", 571 | "0 beleza_saude health_beauty\n", 572 | "1 informatica_acessorios computers_accessories\n", 573 | "2 automotivo auto\n", 574 | "3 cama_mesa_banho bed_bath_table\n", 575 | "4 moveis_decoracao furniture_decor" 576 | ] 577 | }, 578 | "metadata": { 579 | "tags": [] 580 | }, 581 | "execution_count": 100 582 | } 583 | ] 584 | }, 585 | { 586 | "cell_type": "code", 587 | "metadata": { 588 | "id": "l4ci_uMgy81Z", 589 | "colab_type": "code", 590 | "colab": {} 591 | }, 592 | "source": [ 593 | "# Print translations info.\n" 594 | ], 595 | "execution_count": null, 596 | "outputs": [] 597 | }, 598 | { 599 | "cell_type": "markdown", 600 | "metadata": { 601 | "id": "sC5SO_1LgNO5", 602 | "colab_type": "text" 603 | }, 604 | "source": [ 605 | "---\n", 606 | "

Q&A 1

\n", 607 | "\n", 608 | "---" 609 | ] 610 | }, 611 | { 612 | "cell_type": "markdown", 613 | "metadata": { 614 | "id": "3bGr9T2DGo20", 615 | "colab_type": "text" 616 | }, 617 | "source": [ 618 | "### **Translating item category names**" 619 | ] 620 | }, 621 | { 622 | "cell_type": "markdown", 623 | "metadata": { 624 | "id": "e9Y2uCaKwl3j", 625 | "colab_type": "text" 626 | }, 627 | "source": [ 628 | "**The product names are given in Portuguese.**\n", 629 | " * We'll translate the names to English using a `pandas` `DataFrame` named `translations`.\n", 630 | " * `.merge()` performs a join operation on columns or indices.\n", 631 | " * `on` is the column on which to perform the join.\n", 632 | " * `how` specifies which keys to use to perform the join. " 633 | ] 634 | }, 635 | { 636 | "cell_type": "code", 637 | "metadata": { 638 | "id": "6AXIu0a_fLuG", 639 | "colab_type": "code", 640 | "colab": {} 641 | }, 642 | "source": [ 643 | "# Translate product names to English.\n", 644 | "\n", 645 | "\n", 646 | "# Print English names.\n" 647 | ], 648 | "execution_count": null, 649 | "outputs": [] 650 | }, 651 | { 652 | "cell_type": "markdown", 653 | "metadata": { 654 | "id": "FVqfYEjwHXZM", 655 | "colab_type": "text" 656 | }, 657 | "source": [ 658 | "### **Convert product IDs to product category names.**" 659 | ] 660 | }, 661 | { 662 | "cell_type": "markdown", 663 | "metadata": { 664 | "id": "welfsnP1xJzC", 665 | "colab_type": "text" 666 | }, 667 | "source": [ 668 | "**We can work with product IDs directly, but do not have product names.**\n", 669 | " * Map product IDs to product category names, which are available in `products`.\n", 670 | " * Use another `.merge()` with `orders` and subset of `products` columns.\n", 671 | " \n", 672 | "**Using category names will also simplify the analysis, since there are fewer categories than products.**" 673 | ] 674 | }, 675 | { 676 | "cell_type": "code", 677 | "metadata": { 678 | "id": "H1wmY51JtTu7", 679 | "colab_type": "code", 680 | "colab": {} 681 | }, 682 | "source": [ 683 | "# Define product category name in orders DataFrame.\n" 684 | ], 685 | "execution_count": null, 686 | "outputs": [] 687 | }, 688 | { 689 | "cell_type": "code", 690 | "metadata": { 691 | "id": "kogImVcnu4q7", 692 | "colab_type": "code", 693 | "colab": {} 694 | }, 695 | "source": [ 696 | "# Print orders header.\n" 697 | ], 698 | "execution_count": null, 699 | "outputs": [] 700 | }, 701 | { 702 | "cell_type": "code", 703 | "metadata": { 704 | "id": "J906oRHjgsZE", 705 | "colab_type": "code", 706 | "colab": {} 707 | }, 708 | "source": [ 709 | "# Drop products without a defined category.\n" 710 | ], 711 | "execution_count": null, 712 | "outputs": [] 713 | }, 714 | { 715 | "cell_type": "code", 716 | "metadata": { 717 | "id": "S0DEdAgkfke2", 718 | "colab_type": "code", 719 | "colab": {} 720 | }, 721 | "source": [ 722 | "# Print number of unique items.\n" 723 | ], 724 | "execution_count": null, 725 | "outputs": [] 726 | }, 727 | { 728 | "cell_type": "code", 729 | "metadata": { 730 | "id": "sp_ZZcj7IEMD", 731 | "colab_type": "code", 732 | "colab": {} 733 | }, 734 | "source": [ 735 | "# Print number of unique categories.\n" 736 | ], 737 | "execution_count": null, 738 | "outputs": [] 739 | }, 740 | { 741 | "cell_type": "markdown", 742 | "metadata": { 743 | "id": "Yxp4Dk15IP9g", 744 | "colab_type": "text" 745 | }, 746 | "source": [ 747 | "**Insight**: Performing \"aggregation\" up to the product category level reduces the number of potential itemsets from $2^{32328}$ to $2^{71}$." 748 | ] 749 | }, 750 | { 751 | "cell_type": "markdown", 752 | "metadata": { 753 | "id": "_z5WqVXFIn23", 754 | "colab_type": "text" 755 | }, 756 | "source": [ 757 | "### **Construct transactions from order and product data**" 758 | ] 759 | }, 760 | { 761 | "cell_type": "markdown", 762 | "metadata": { 763 | "id": "fDofpDQ8zw7n", 764 | "colab_type": "text" 765 | }, 766 | "source": [ 767 | "* **We will perform Market Basket Analysis on transactions.**\n", 768 | " * A transaction consists of the unique items purchased by a customer.\n", 769 | "* **Need to extract transactions from orders `DataFrame`.**\n", 770 | " * Group all items in an order." 771 | ] 772 | }, 773 | { 774 | "cell_type": "code", 775 | "metadata": { 776 | "id": "LObvSR1bfs8N", 777 | "colab_type": "code", 778 | "colab": {} 779 | }, 780 | "source": [ 781 | "# Identify transactions associated with example order.\n", 782 | "example1 = orders[orders['order_id'] == 'fe64170e936bc5f6a6a41def260984b9']['product_category_name_english']\n", 783 | "\n", 784 | "# Print example.\n" 785 | ], 786 | "execution_count": null, 787 | "outputs": [] 788 | }, 789 | { 790 | "cell_type": "code", 791 | "metadata": { 792 | "id": "B2EEHJpPWOVJ", 793 | "colab_type": "code", 794 | "colab": {} 795 | }, 796 | "source": [ 797 | "# Identify transactions associated with example order.\n", 798 | "example2 = orders[orders['order_id'] == 'fffb9224b6fc7c43ebb0904318b10b5f']['product_category_name_english']\n", 799 | "\n", 800 | "# Print example.\n" 801 | ], 802 | "execution_count": null, 803 | "outputs": [] 804 | }, 805 | { 806 | "cell_type": "markdown", 807 | "metadata": { 808 | "id": "n83EDUs0Wa_2", 809 | "colab_type": "text" 810 | }, 811 | "source": [ 812 | "**Insight**: Aggregation reduces the number of items and, therefore, itemsets." 813 | ] 814 | }, 815 | { 816 | "cell_type": "markdown", 817 | "metadata": { 818 | "id": "0yVc5cmhCHt6", 819 | "colab_type": "text" 820 | }, 821 | "source": [ 822 | "**Map `orders` to `transactions`.**\n", 823 | "* `.groupby()` splits a `DataFrame` into groups according to some criterion.\n", 824 | "* `.unique()` returns list of unique values." 825 | ] 826 | }, 827 | { 828 | "cell_type": "code", 829 | "metadata": { 830 | "id": "nXES2DJ3Ry8l", 831 | "colab_type": "code", 832 | "colab": {} 833 | }, 834 | "source": [ 835 | "# Recover transaction itemsets from orders DataFrame.\n", 836 | "\n", 837 | "\n", 838 | "# Print transactions header.\n" 839 | ], 840 | "execution_count": null, 841 | "outputs": [] 842 | }, 843 | { 844 | "cell_type": "code", 845 | "metadata": { 846 | "id": "l06VMDNQfzqZ", 847 | "colab_type": "code", 848 | "colab": { 849 | "base_uri": "https://localhost:8080/", 850 | "height": 564 851 | }, 852 | "outputId": "bcf16881-eabe-4f22-b70a-18c097c4c0a8" 853 | }, 854 | "source": [ 855 | "# Plot 50 largest categories of transactions.\n" 856 | ], 857 | "execution_count": null, 858 | "outputs": [ 859 | { 860 | "output_type": "execute_result", 861 | "data": { 862 | "text/plain": [ 863 | "" 864 | ] 865 | }, 866 | "metadata": { 867 | "tags": [] 868 | }, 869 | "execution_count": 59 870 | }, 871 | { 872 | "output_type": "display_data", 873 | "data": { 874 | "image/png": "\n", 875 | "text/plain": [ 876 | "
" 877 | ] 878 | }, 879 | "metadata": { 880 | "tags": [] 881 | } 882 | } 883 | ] 884 | }, 885 | { 886 | "cell_type": "markdown", 887 | "metadata": { 888 | "id": "aDSZJBjJT3Iw", 889 | "colab_type": "text" 890 | }, 891 | "source": [ 892 | "**Insight 1:** The most common itemsets consist of a single item.\n", 893 | "\n", 894 | "**Insight 2:** There's a long tail of categories that consist of infrequently purchased items." 895 | ] 896 | }, 897 | { 898 | "cell_type": "markdown", 899 | "metadata": { 900 | "id": "ApMsvYgwHqIl", 901 | "colab_type": "text" 902 | }, 903 | "source": [ 904 | "**Use `.tolist()` to transform a `DataFrame` or `Series` object into a list.**" 905 | ] 906 | }, 907 | { 908 | "cell_type": "code", 909 | "metadata": { 910 | "id": "nrBBAg9kf5R1", 911 | "colab_type": "code", 912 | "colab": {} 913 | }, 914 | "source": [ 915 | "# Convert the pandas series to list of lists.\n" 916 | ], 917 | "execution_count": null, 918 | "outputs": [] 919 | }, 920 | { 921 | "cell_type": "markdown", 922 | "metadata": { 923 | "id": "J_JdOGzOVUed", 924 | "colab_type": "text" 925 | }, 926 | "source": [ 927 | "### **Summarize final transaction data**" 928 | ] 929 | }, 930 | { 931 | "cell_type": "code", 932 | "metadata": { 933 | "id": "NXo1XlNh2nvV", 934 | "colab_type": "code", 935 | "colab": {} 936 | }, 937 | "source": [ 938 | "# Print length of transactions.\n" 939 | ], 940 | "execution_count": null, 941 | "outputs": [] 942 | }, 943 | { 944 | "cell_type": "code", 945 | "metadata": { 946 | "id": "KHWb2YGe2n__", 947 | "colab_type": "code", 948 | "colab": {} 949 | }, 950 | "source": [ 951 | "# Count number of unique item categories for each transaction.\n" 952 | ], 953 | "execution_count": null, 954 | "outputs": [] 955 | }, 956 | { 957 | "cell_type": "code", 958 | "metadata": { 959 | "id": "5dcX5qwN2njx", 960 | "colab_type": "code", 961 | "colab": {} 962 | }, 963 | "source": [ 964 | "# Print median number of items in a transaction.\n" 965 | ], 966 | "execution_count": null, 967 | "outputs": [] 968 | }, 969 | { 970 | "cell_type": "code", 971 | "metadata": { 972 | "id": "bZm2vOczf7mB", 973 | "colab_type": "code", 974 | "colab": {} 975 | }, 976 | "source": [ 977 | "# Print maximum number of items in a transaction.\n" 978 | ], 979 | "execution_count": null, 980 | "outputs": [] 981 | }, 982 | { 983 | "cell_type": "markdown", 984 | "metadata": { 985 | "id": "xLOYfyVps4Uu", 986 | "colab_type": "text" 987 | }, 988 | "source": [ 989 | "---\n", 990 | "

Q&A 2

\n", 991 | "\n", 992 | "---" 993 | ] 994 | }, 995 | { 996 | "cell_type": "markdown", 997 | "metadata": { 998 | "id": "tXqmKHdXiCt6", 999 | "colab_type": "text" 1000 | }, 1001 | "source": [ 1002 | "## **Association Rules and Metrics**" 1003 | ] 1004 | }, 1005 | { 1006 | "cell_type": "markdown", 1007 | "metadata": { 1008 | "id": "RQnsrXg7aKgS", 1009 | "colab_type": "text" 1010 | }, 1011 | "source": [ 1012 | "**Association rule:** an \"if-then\" relationship between two itemsets.\n", 1013 | " * **rule:** if *{coffee)* then *{milk}*.\n", 1014 | " * **antecedent:** coffee\n", 1015 | " * **consequent:** milk\n", 1016 | "\n", 1017 | "**Metric:** a measure of the strength of association between two itemsets.\n", 1018 | " * **rule:** if *{coffee)* then *{milk}*\n", 1019 | " * **support:** 0.10\n", 1020 | " * **leverage:** 0.03\n", 1021 | "\n" 1022 | ] 1023 | }, 1024 | { 1025 | "cell_type": "markdown", 1026 | "metadata": { 1027 | "id": "QbFpAuLocyqt", 1028 | "colab_type": "text" 1029 | }, 1030 | "source": [ 1031 | "### **One-hot encode the transaction data**" 1032 | ] 1033 | }, 1034 | { 1035 | "cell_type": "markdown", 1036 | "metadata": { 1037 | "id": "jT0FHyUfIDNC", 1038 | "colab_type": "text" 1039 | }, 1040 | "source": [ 1041 | "* **One-hot encoding data.**\n", 1042 | " * `TransactionEncoder()` instantiates an encoder object.\n", 1043 | " * `.fit()` creates mapping between list and one-hot encoding.\n", 1044 | " * `.transform()` transforms list into one-hot encoded array." 1045 | ] 1046 | }, 1047 | { 1048 | "cell_type": "markdown", 1049 | "metadata": { 1050 | "id": "3Y65e_9jzBUw", 1051 | "colab_type": "text" 1052 | }, 1053 | "source": [ 1054 | "* **Applying one-hot encoding will transform the list of lists (of transactions) into a `DataFrame`.**\n", 1055 | " * The columns correspond to item categories and the rows correspond to transactions. A true indicates that a transaction contains an item from the corresponding category.\n", 1056 | "* **One-hot encoding simplifies the computation of metrics.**\n", 1057 | " * We will also use a one-hot encoded `DataFrame` as an input to different `mlxtend` functions." 1058 | ] 1059 | }, 1060 | { 1061 | "cell_type": "code", 1062 | "metadata": { 1063 | "id": "PADu6cwylDWC", 1064 | "colab_type": "code", 1065 | "colab": {} 1066 | }, 1067 | "source": [ 1068 | "from mlxtend.preprocessing import TransactionEncoder\n", 1069 | "\n", 1070 | "# Instantiate an encoder.\n", 1071 | "\n", 1072 | "\n", 1073 | "# Fit encoder to list of lists.\n", 1074 | "\n", 1075 | "\n", 1076 | "# Transform lists into one-hot encoded array.\n", 1077 | "\n", 1078 | "\n", 1079 | "# Convert array to pandas DataFrame.\n" 1080 | ], 1081 | "execution_count": null, 1082 | "outputs": [] 1083 | }, 1084 | { 1085 | "cell_type": "code", 1086 | "metadata": { 1087 | "id": "uXvgq0wclEZ_", 1088 | "colab_type": "code", 1089 | "colab": {} 1090 | }, 1091 | "source": [ 1092 | "# Print header.\n" 1093 | ], 1094 | "execution_count": null, 1095 | "outputs": [] 1096 | }, 1097 | { 1098 | "cell_type": "markdown", 1099 | "metadata": { 1100 | "id": "QcQf1RW7ffzc", 1101 | "colab_type": "text" 1102 | }, 1103 | "source": [ 1104 | "\n", 1105 | "### **Compute the support metric**\n", 1106 | "\n", 1107 | "* Support measures the frequency with which an itemset appears in a database of transactions." 1108 | ] 1109 | }, 1110 | { 1111 | "cell_type": "markdown", 1112 | "metadata": { 1113 | "id": "mwjsbXSKByym", 1114 | "colab_type": "text" 1115 | }, 1116 | "source": [ 1117 | "\n", 1118 | "$$support(X) = \\frac{\\text{number of transactions containing X}}{\\text{total number of transactions}}$$" 1119 | ] 1120 | }, 1121 | { 1122 | "cell_type": "markdown", 1123 | "metadata": { 1124 | "id": "rdoVfdc_H8KU", 1125 | "colab_type": "text" 1126 | }, 1127 | "source": [ 1128 | "* `.mean(axis=0)` computes support values for one-hot encoded `DataFrame`. \n", 1129 | "* A high support value indicates that items in an itemset are purchased together frequently and, thus, are associated with each other." 1130 | ] 1131 | }, 1132 | { 1133 | "cell_type": "code", 1134 | "metadata": { 1135 | "id": "n3E8jFSelMRj", 1136 | "colab_type": "code", 1137 | "colab": {} 1138 | }, 1139 | "source": [ 1140 | "# Print support metric over all rows for each column.\n" 1141 | ], 1142 | "execution_count": null, 1143 | "outputs": [] 1144 | }, 1145 | { 1146 | "cell_type": "markdown", 1147 | "metadata": { 1148 | "id": "k4xq0z7IdwmS", 1149 | "colab_type": "text" 1150 | }, 1151 | "source": [ 1152 | "**Observation:** In retail and ecommerce settings, any particular item is likely to account for a small share of transactions. Here, we've aggregated up to the product category level and very popular categories are still only present in 5% of transactions. Consequently, itemsets with 2 or more item categories will account for a vanishingly small share of total transactions (e.g. 0.01%)." 1153 | ] 1154 | }, 1155 | { 1156 | "cell_type": "markdown", 1157 | "metadata": { 1158 | "id": "7NNdtDOXiC8z", 1159 | "colab_type": "text" 1160 | }, 1161 | "source": [ 1162 | "### **Compute the item count distribution over transactions**" 1163 | ] 1164 | }, 1165 | { 1166 | "cell_type": "markdown", 1167 | "metadata": { 1168 | "id": "Aey6WvpMM_26", 1169 | "colab_type": "text" 1170 | }, 1171 | "source": [ 1172 | "* `onehot.sum(axis=1)` sums across the columns in a `DataFrame`. " 1173 | ] 1174 | }, 1175 | { 1176 | "cell_type": "code", 1177 | "metadata": { 1178 | "id": "q87IRIx0lR9U", 1179 | "colab_type": "code", 1180 | "colab": {} 1181 | }, 1182 | "source": [ 1183 | "# Print distribution of item counts.\n" 1184 | ], 1185 | "execution_count": null, 1186 | "outputs": [] 1187 | }, 1188 | { 1189 | "cell_type": "markdown", 1190 | "metadata": { 1191 | "id": "fWmrboxOhnr4", 1192 | "colab_type": "text" 1193 | }, 1194 | "source": [ 1195 | "**Insight:** Only 726 transactions contain more than one item category. We may want to consider whether aggregation discards too many multi-item itemsets." 1196 | ] 1197 | }, 1198 | { 1199 | "cell_type": "markdown", 1200 | "metadata": { 1201 | "id": "fj7yQ2DqiQkh", 1202 | "colab_type": "text" 1203 | }, 1204 | "source": [ 1205 | "### **Create a column for an itemset with multiple items**" 1206 | ] 1207 | }, 1208 | { 1209 | "cell_type": "markdown", 1210 | "metadata": { 1211 | "id": "3ue1byfs4ejs", 1212 | "colab_type": "text" 1213 | }, 1214 | "source": [ 1215 | "* **We can create multi-item columns using the logical AND operation.**\n", 1216 | " * `True & True = True`\n", 1217 | " * `True & False = False`\n", 1218 | " * `False & True = False`\n", 1219 | " * `False & False = False`" 1220 | ] 1221 | }, 1222 | { 1223 | "cell_type": "code", 1224 | "metadata": { 1225 | "id": "immnq5stlWaf", 1226 | "colab_type": "code", 1227 | "colab": {} 1228 | }, 1229 | "source": [ 1230 | "# Add sports_leisure and health_beauty to DataFrame.\n", 1231 | "\n", 1232 | "\n", 1233 | "# Print support value.\n" 1234 | ], 1235 | "execution_count": null, 1236 | "outputs": [] 1237 | }, 1238 | { 1239 | "cell_type": "markdown", 1240 | "metadata": { 1241 | "id": "VHlEhjUuikdj", 1242 | "colab_type": "text" 1243 | }, 1244 | "source": [ 1245 | "**Insight:** Only 0.014% of transactions contain a product from both the sports and leisure, and health and beauty categories. These are typically the type of numbers we will work with when we set pruning thresholds in the following section." 1246 | ] 1247 | }, 1248 | { 1249 | "cell_type": "markdown", 1250 | "metadata": { 1251 | "id": "BvoKwShnjC4z", 1252 | "colab_type": "text" 1253 | }, 1254 | "source": [ 1255 | "### **Aggregate the dataset further by combining product sub-categories**" 1256 | ] 1257 | }, 1258 | { 1259 | "cell_type": "markdown", 1260 | "metadata": { 1261 | "id": "vHIEvm0zjLk7", 1262 | "colab_type": "text" 1263 | }, 1264 | "source": [ 1265 | "* **We can use the inclusive OR operation to combine multiple categories.**\n", 1266 | " * `True | True = True`\n", 1267 | " * `True | False = True`\n", 1268 | " * `False | True = True`\n", 1269 | " * `False | False = False`" 1270 | ] 1271 | }, 1272 | { 1273 | "cell_type": "code", 1274 | "metadata": { 1275 | "id": "qLGroyNZlX1U", 1276 | "colab_type": "code", 1277 | "colab": {} 1278 | }, 1279 | "source": [ 1280 | "# Merge books_imported and books_technical.\n", 1281 | "\n", 1282 | "\n", 1283 | "# Print support values for books, books_imported, and books_technical.\n" 1284 | ], 1285 | "execution_count": null, 1286 | "outputs": [] 1287 | }, 1288 | { 1289 | "cell_type": "markdown", 1290 | "metadata": { 1291 | "id": "9E2CHkMfqHx8", 1292 | "colab_type": "text" 1293 | }, 1294 | "source": [ 1295 | "### **Compute the confidence metric**" 1296 | ] 1297 | }, 1298 | { 1299 | "cell_type": "markdown", 1300 | "metadata": { 1301 | "id": "vcPL0Iy3rY2m", 1302 | "colab_type": "text" 1303 | }, 1304 | "source": [ 1305 | "* **The support metric doesn't provide information about direction.**\n", 1306 | " * $support(antecedent, consequent) = support(consequent, antecedent)$\n", 1307 | "\n", 1308 | "* **The confidence metric has a direction.**\n", 1309 | " * Conditional probability of the consequent, given the antecedent." 1310 | ] 1311 | }, 1312 | { 1313 | "cell_type": "markdown", 1314 | "metadata": { 1315 | "id": "C2JLWdbnr8Nl", 1316 | "colab_type": "text" 1317 | }, 1318 | "source": [ 1319 | "$$confidence(antecedent \\rightarrow consequent)= \\frac{support(antecedent, consequent)}{support(antecedent)}$$" 1320 | ] 1321 | }, 1322 | { 1323 | "cell_type": "markdown", 1324 | "metadata": { 1325 | "id": "RC3Zc1uWHsm3", 1326 | "colab_type": "text" 1327 | }, 1328 | "source": [ 1329 | "* A high value of confidence indicates that the antecedent and consequent are associated and that the direction of the association runs from the antecedent to the consequent." 1330 | ] 1331 | }, 1332 | { 1333 | "cell_type": "code", 1334 | "metadata": { 1335 | "id": "US-Z5hs7qGFl", 1336 | "colab_type": "code", 1337 | "colab": {} 1338 | }, 1339 | "source": [ 1340 | "# Compute joint support for sports_leisure and health_beauty.\n", 1341 | "\n", 1342 | "\n", 1343 | "# Print confidence metric for sports_leisure -> health_beauty.\n" 1344 | ], 1345 | "execution_count": null, 1346 | "outputs": [] 1347 | }, 1348 | { 1349 | "cell_type": "code", 1350 | "metadata": { 1351 | "id": "oRacycCMtKeh", 1352 | "colab_type": "code", 1353 | "colab": {} 1354 | }, 1355 | "source": [ 1356 | "# Print confidence for health_beauty -> sports_leisure.\n" 1357 | ], 1358 | "execution_count": null, 1359 | "outputs": [] 1360 | }, 1361 | { 1362 | "cell_type": "markdown", 1363 | "metadata": { 1364 | "id": "QC_SuQBMtRsa", 1365 | "colab_type": "text" 1366 | }, 1367 | "source": [ 1368 | "**Insight:** $confidence(sports\\_leisure \\rightarrow health\\_beauty)$ was higher than $confidence(health\\_beauty \\rightarrow sports\\_leisure)$. Since the two have the same joint support, the confidence measures will differ only by the antecedent support. The higher confidence metric means that the antecedent has *lower* support." 1369 | ] 1370 | }, 1371 | { 1372 | "cell_type": "markdown", 1373 | "metadata": { 1374 | "id": "iKm1vKDFldpt", 1375 | "colab_type": "text" 1376 | }, 1377 | "source": [ 1378 | "---\n", 1379 | "

Q&A 3

\n", 1380 | "\n", 1381 | "---" 1382 | ] 1383 | }, 1384 | { 1385 | "cell_type": "markdown", 1386 | "metadata": { 1387 | "id": "kXwJcMyViCcW", 1388 | "colab_type": "text" 1389 | }, 1390 | "source": [ 1391 | "## **The Apriori Algorithm and Pruning**" 1392 | ] 1393 | }, 1394 | { 1395 | "cell_type": "markdown", 1396 | "metadata": { 1397 | "id": "h7JGKJX3wsYK", 1398 | "colab_type": "text" 1399 | }, 1400 | "source": [ 1401 | "**The Apriori algorithm** identifies frequent (high support) itemsets using something called the Apriori principle, which states that a superset that contains an infrequent item is also infrequent." 1402 | ] 1403 | }, 1404 | { 1405 | "cell_type": "markdown", 1406 | "metadata": { 1407 | "id": "Xru6-VBAwZz9", 1408 | "colab_type": "text" 1409 | }, 1410 | "source": [ 1411 | "![alt](https://github.com/datacamp/Market-Basket-Analysis-in-python-live-training/blob/master/assets/apriori_algorithm.png?raw=True)" 1412 | ] 1413 | }, 1414 | { 1415 | "cell_type": "markdown", 1416 | "metadata": { 1417 | "id": "-tgWgdJozS3g", 1418 | "colab_type": "text" 1419 | }, 1420 | "source": [ 1421 | "**Pruning** is the process of removing itemsets or association rules, typically based on the application of a metric threshold. " 1422 | ] 1423 | }, 1424 | { 1425 | "cell_type": "markdown", 1426 | "metadata": { 1427 | "id": "O-nGUl2Cx951", 1428 | "colab_type": "text" 1429 | }, 1430 | "source": [ 1431 | "**The `mlxtend` module will enable us to apply the Apriori algorithm, perform pruning, and compute association rules.**" 1432 | ] 1433 | }, 1434 | { 1435 | "cell_type": "markdown", 1436 | "metadata": { 1437 | "id": "RPqHxNBczJFD", 1438 | "colab_type": "text" 1439 | }, 1440 | "source": [ 1441 | "### **Applying the Apriori algorithm**" 1442 | ] 1443 | }, 1444 | { 1445 | "cell_type": "markdown", 1446 | "metadata": { 1447 | "id": "_9_EnUE5NSYC", 1448 | "colab_type": "text" 1449 | }, 1450 | "source": [ 1451 | "* Use `apriori()` to identify frequent itemsets.\n", 1452 | "* `min_support` set the item frequency threshold used for pruning." 1453 | ] 1454 | }, 1455 | { 1456 | "cell_type": "code", 1457 | "metadata": { 1458 | "id": "oTdaZ39VljgV", 1459 | "colab_type": "code", 1460 | "colab": {} 1461 | }, 1462 | "source": [ 1463 | "from mlxtend.frequent_patterns import apriori\n", 1464 | "\n", 1465 | "# Apply apriori algorithm to data with min support threshold of 0.01.\n", 1466 | "\n", 1467 | "\n", 1468 | "# Print frequent itemsets.\n" 1469 | ], 1470 | "execution_count": null, 1471 | "outputs": [] 1472 | }, 1473 | { 1474 | "cell_type": "markdown", 1475 | "metadata": { 1476 | "id": "iQ3gYEK2yPCi", 1477 | "colab_type": "text" 1478 | }, 1479 | "source": [ 1480 | "**Observation 1:** `apriori` returns a `DataFrame` with a `support` column and an `itemsets` column.\n", 1481 | "\n", 1482 | "**Observation 2:** By default `apriori` returns itemset numbers, rather than labels. We can change this by using the `use_colnames` parameter.\n", 1483 | "\n", 1484 | "**Insight:** All itemsets with a support of greater than 0.01 contain a single item." 1485 | ] 1486 | }, 1487 | { 1488 | "cell_type": "markdown", 1489 | "metadata": { 1490 | "id": "aOawxLPlN0O3", 1491 | "colab_type": "text" 1492 | }, 1493 | "source": [ 1494 | "* Use `use_colnames` to use item names, rather than integer IDs." 1495 | ] 1496 | }, 1497 | { 1498 | "cell_type": "code", 1499 | "metadata": { 1500 | "id": "L_MrF6Ckllde", 1501 | "colab_type": "code", 1502 | "colab": {} 1503 | }, 1504 | "source": [ 1505 | "# Apply apriori algorithm to data with min support threshold of 0.001.\n", 1506 | "\n", 1507 | "\n", 1508 | "# Print frequent itemsets.\n" 1509 | ], 1510 | "execution_count": null, 1511 | "outputs": [] 1512 | }, 1513 | { 1514 | "cell_type": "markdown", 1515 | "metadata": { 1516 | "id": "jHkW8KmCyp0h", 1517 | "colab_type": "text" 1518 | }, 1519 | "source": [ 1520 | "**Insight:** Lowering the support threshold increased the number of itemsets returned and even yielded itemsets with more than one item." 1521 | ] 1522 | }, 1523 | { 1524 | "cell_type": "code", 1525 | "metadata": { 1526 | "id": "lT7h9l_Glnf6", 1527 | "colab_type": "code", 1528 | "colab": {} 1529 | }, 1530 | "source": [ 1531 | "# Apply apriori algorithm to data with min support threshold of 0.00005.\n", 1532 | "\n", 1533 | "\n", 1534 | "# Print frequent itemsets.\n" 1535 | ], 1536 | "execution_count": null, 1537 | "outputs": [] 1538 | }, 1539 | { 1540 | "cell_type": "markdown", 1541 | "metadata": { 1542 | "id": "pHmgv5bqzYmN", 1543 | "colab_type": "text" 1544 | }, 1545 | "source": [ 1546 | "**Observation:** Notice how low we must set the support threshold (0.005%) to return a high number of itemsets with more than one item." 1547 | ] 1548 | }, 1549 | { 1550 | "cell_type": "code", 1551 | "metadata": { 1552 | "id": "j273yq0Alo0H", 1553 | "colab_type": "code", 1554 | "colab": {} 1555 | }, 1556 | "source": [ 1557 | "# Apply apriori algorithm to data with a two-item limit.\n" 1558 | ], 1559 | "execution_count": null, 1560 | "outputs": [] 1561 | }, 1562 | { 1563 | "cell_type": "markdown", 1564 | "metadata": { 1565 | "id": "CTo4IKmy0BXr", 1566 | "colab_type": "text" 1567 | }, 1568 | "source": [ 1569 | "**Insight:** What do we gain from the apriori algorithm? We start off with $2^{71}$ potential itemsets and immediately reduce it to 113 without enumerating all $2^{71}$ itemsets." 1570 | ] 1571 | }, 1572 | { 1573 | "cell_type": "markdown", 1574 | "metadata": { 1575 | "id": "kBAjlmz-zuWk", 1576 | "colab_type": "text" 1577 | }, 1578 | "source": [ 1579 | "### **Computing association rules from Apriori output**" 1580 | ] 1581 | }, 1582 | { 1583 | "cell_type": "markdown", 1584 | "metadata": { 1585 | "id": "E74Qv6fTOARv", 1586 | "colab_type": "text" 1587 | }, 1588 | "source": [ 1589 | "* Use `association_rules()` to compute and prune association rules from output of `apriori()`." 1590 | ] 1591 | }, 1592 | { 1593 | "cell_type": "code", 1594 | "metadata": { 1595 | "id": "AF6jhDkmlpM8", 1596 | "colab_type": "code", 1597 | "colab": {} 1598 | }, 1599 | "source": [ 1600 | "from mlxtend.frequent_patterns import association_rules\n", 1601 | "\n", 1602 | "# Recover association rules using support and a minimum threshold of 0.0001.\n", 1603 | "\n", 1604 | "\n", 1605 | "# Print rules header.\n" 1606 | ], 1607 | "execution_count": null, 1608 | "outputs": [] 1609 | }, 1610 | { 1611 | "cell_type": "markdown", 1612 | "metadata": { 1613 | "id": "sz3aVycbz6pt", 1614 | "colab_type": "text" 1615 | }, 1616 | "source": [ 1617 | "**Notice that `association_rules` automatically computes seven metrics.**" 1618 | ] 1619 | }, 1620 | { 1621 | "cell_type": "markdown", 1622 | "metadata": { 1623 | "id": "3_rM_sYn0nPa", 1624 | "colab_type": "text" 1625 | }, 1626 | "source": [ 1627 | "### **Pruning association rules**" 1628 | ] 1629 | }, 1630 | { 1631 | "cell_type": "code", 1632 | "metadata": { 1633 | "id": "jejN-n9Blql6", 1634 | "colab_type": "code", 1635 | "colab": {} 1636 | }, 1637 | "source": [ 1638 | "# Recover association rules using confidence threshold of 0.01.\n", 1639 | "\n", 1640 | "\n", 1641 | "# Print rules.\n" 1642 | ], 1643 | "execution_count": null, 1644 | "outputs": [] 1645 | }, 1646 | { 1647 | "cell_type": "code", 1648 | "metadata": { 1649 | "id": "_JhzujmIlv7C", 1650 | "colab_type": "code", 1651 | "colab": {} 1652 | }, 1653 | "source": [ 1654 | "# Select rules with a consequent support above 0.095.\n", 1655 | "\n", 1656 | "\n", 1657 | "# Print rules.\n" 1658 | ], 1659 | "execution_count": null, 1660 | "outputs": [] 1661 | }, 1662 | { 1663 | "cell_type": "markdown", 1664 | "metadata": { 1665 | "id": "nsSaO4EU2mwX", 1666 | "colab_type": "text" 1667 | }, 1668 | "source": [ 1669 | "### **The leverage metric**\n", 1670 | "\n", 1671 | "* **Leverage provides a sanity check.**\n", 1672 | " * $support(antecedent, consequent)$ = joint support in data.\n", 1673 | " * $support(antecedent) * support(consequent)$ = expected joint support for unrelated antecedent and consequent." 1674 | ] 1675 | }, 1676 | { 1677 | "cell_type": "markdown", 1678 | "metadata": { 1679 | "id": "mfYjXEBTIqj7", 1680 | "colab_type": "text" 1681 | }, 1682 | "source": [ 1683 | "* **Leverage formula**\n", 1684 | " * $$leverage(antecendent, consequent) = \n", 1685 | "support(antecedent, consequent) - support(antecedent) * support(consequent)$$" 1686 | ] 1687 | }, 1688 | { 1689 | "cell_type": "markdown", 1690 | "metadata": { 1691 | "id": "b9AxLt1rIqQU", 1692 | "colab_type": "text" 1693 | }, 1694 | "source": [ 1695 | "* **For most problems, we will discard itemsets with negative leverage.**\n", 1696 | " * Negative leverage means that the items appear together less frequently than we would expect if they were randomly and independently distributed across transactions." 1697 | ] 1698 | }, 1699 | { 1700 | "cell_type": "code", 1701 | "metadata": { 1702 | "id": "6Cjpf3B8lwVG", 1703 | "colab_type": "code", 1704 | "colab": {} 1705 | }, 1706 | "source": [ 1707 | "# Select rules with leverage higher than 0.0.\n", 1708 | "\n", 1709 | "\n", 1710 | "# Print rules.\n" 1711 | ], 1712 | "execution_count": null, 1713 | "outputs": [] 1714 | }, 1715 | { 1716 | "cell_type": "markdown", 1717 | "metadata": { 1718 | "id": "JFSSJq5u5qmQ", 1719 | "colab_type": "text" 1720 | }, 1721 | "source": [ 1722 | "**Insight:** The Apriori algorithm reduced the number of itemsets from $2^{71}$ to 113. Pruning allowed us to identify to a single association rule that could be useful for cross-promotional purposes: $\\{home\\_comfort\\} \\rightarrow \\{bed\\_bath\\_table\\}$." 1723 | ] 1724 | }, 1725 | { 1726 | "cell_type": "markdown", 1727 | "metadata": { 1728 | "id": "mbqWXtzR0sif", 1729 | "colab_type": "text" 1730 | }, 1731 | "source": [ 1732 | "### **Visualizing patterns in metrics**" 1733 | ] 1734 | }, 1735 | { 1736 | "cell_type": "markdown", 1737 | "metadata": { 1738 | "id": "jdIvXojWOphd", 1739 | "colab_type": "text" 1740 | }, 1741 | "source": [ 1742 | "* `sns.scatterplot()` creates a scatterplot from two columns in a `DataFrame`." 1743 | ] 1744 | }, 1745 | { 1746 | "cell_type": "code", 1747 | "metadata": { 1748 | "id": "JiA_CqVLlyss", 1749 | "colab_type": "code", 1750 | "colab": { 1751 | "base_uri": "https://localhost:8080/", 1752 | "height": 356 1753 | }, 1754 | "outputId": "51ef2101-bf5c-4f6d-81fb-a6c0745f12f3" 1755 | }, 1756 | "source": [ 1757 | "# Recover association rules with a minimum support greater than 0.000001.\n", 1758 | "\n", 1759 | "\n", 1760 | "# Plot leverage against confidence.\n", 1761 | "\n" 1762 | ], 1763 | "execution_count": null, 1764 | "outputs": [ 1765 | { 1766 | "output_type": "execute_result", 1767 | "data": { 1768 | "text/plain": [ 1769 | "" 1770 | ] 1771 | }, 1772 | "metadata": { 1773 | "tags": [] 1774 | }, 1775 | "execution_count": 130 1776 | }, 1777 | { 1778 | "output_type": "display_data", 1779 | "data": { 1780 | "image/png": "\n", 1781 | "text/plain": [ 1782 | "
" 1783 | ] 1784 | }, 1785 | "metadata": { 1786 | "tags": [] 1787 | } 1788 | } 1789 | ] 1790 | }, 1791 | { 1792 | "cell_type": "markdown", 1793 | "metadata": { 1794 | "id": "v95xAd8803y3", 1795 | "colab_type": "text" 1796 | }, 1797 | "source": [ 1798 | "**Insight 1**: Leverage and confidence contain some of the same information about the strength of an association." 1799 | ] 1800 | }, 1801 | { 1802 | "cell_type": "markdown", 1803 | "metadata": { 1804 | "id": "S-BoO_JMmkdV", 1805 | "colab_type": "text" 1806 | }, 1807 | "source": [ 1808 | "---\n", 1809 | "

Q&A 4

\n", 1810 | "\n", 1811 | "---" 1812 | ] 1813 | }, 1814 | { 1815 | "cell_type": "code", 1816 | "metadata": { 1817 | "id": "j7v4_LSImSy2", 1818 | "colab_type": "code", 1819 | "colab": {} 1820 | }, 1821 | "source": [ 1822 | "# Experiment on your own here." 1823 | ], 1824 | "execution_count": null, 1825 | "outputs": [] 1826 | } 1827 | ] 1828 | } --------------------------------------------------------------------------------