├── CaseStudy#1 - Danny's Diner ├── CaseStudy1_schema.sql ├── CaseStudy1_solution.ipynb └── README.md ├── CaseStudy#2 - Pizza Runner ├── CaseStudy2_schema.sql ├── CaseStudy2_solutions.ipynb ├── CaseStudy2_solutions.md └── README.md ├── CaseStudy#3 - Foodie-Fi ├── CaseStudy3_schema.sql ├── CaseStudy3_solutions.ipynb ├── CaseStudy3_solutions.md ├── README.md └── images │ ├── question10.png │ ├── question2.png │ ├── question3.png │ ├── question6.png │ └── question7.png ├── CaseStudy#4 - Data Bank ├── CaseStudy4_schema.sql ├── CaseStudy4_solutions.ipynb ├── CaseStudy4_solutions.md └── README.md ├── CaseStudy#5 - Data Mart ├── CaseStudy5_schema.sql ├── CaseStudy5_solutions.ipynb ├── CaseStudy5_solutions.md ├── README.md └── image │ └── plot.png ├── CaseStudy#6 - Clique Bait ├── CaseStudy6_schema.sql ├── CaseStudy6_solutions.ipynb ├── CaseStudy6_solutions.md ├── README.md └── images │ └── CliqueBait_ERD.png ├── CaseStudy#7 - Balanced Tree ├── CaseStudy7_schema.sql ├── CaseStudy7_solutions.ipynb ├── CaseStudy7_solutions.md └── README.md ├── CaseStudy#8 - Fresh Segments ├── CaseStudy8_schema.sql ├── CaseStudy8_solutions.ipynb ├── CaseStudy8_solutions.md └── README.md ├── LICENSE └── README.md /CaseStudy#1 - Danny's Diner/CaseStudy1_schema.sql: -------------------------------------------------------------------------------- 1 | ########################################################## 2 | # Case Study #1: Danny's Diner 3 | # This SQL script is a DDL file for creating tables. 4 | # Source: https://8weeksqlchallenge.com/case-study-1/ 5 | ########################################################## 6 | 7 | 8 | -- Create Schema 9 | CREATE SCHEMA IF NOT EXISTS dannys_diner; 10 | 11 | 12 | -- Create and Populate "sales" table 13 | CREATE TABLE IF NOT EXISTS dannys_diner.sales ( 14 | customer_id VARCHAR(1), 15 | order_date DATE, 16 | product_id INTEGER 17 | ); 18 | 19 | INSERT INTO dannys_diner.sales 20 | (customer_id, order_date, product_id) 21 | VALUES 22 | ('A', '2021-01-01', '1'), 23 | ('A', '2021-01-01', '2'), 24 | ('A', '2021-01-07', '2'), 25 | ('A', '2021-01-10', '3'), 26 | ('A', '2021-01-11', '3'), 27 | ('A', '2021-01-11', '3'), 28 | ('B', '2021-01-01', '2'), 29 | ('B', '2021-01-02', '2'), 30 | ('B', '2021-01-04', '1'), 31 | ('B', '2021-01-11', '1'), 32 | ('B', '2021-01-16', '3'), 33 | ('B', '2021-02-01', '3'), 34 | ('C', '2021-01-01', '3'), 35 | ('C', '2021-01-01', '3'), 36 | ('C', '2021-01-07', '3'); 37 | 38 | 39 | -- Create and Populate "menu" table 40 | CREATE TABLE IF NOT EXISTS dannys_diner.menu ( 41 | product_id INTEGER, 42 | product_name VARCHAR(5), 43 | price INTEGER 44 | ); 45 | 46 | INSERT INTO dannys_diner.menu 47 | (product_id, product_name, price) 48 | VALUES 49 | ('1', 'sushi', '10'), 50 | ('2', 'curry', '15'), 51 | ('3', 'ramen', '12'); 52 | 53 | 54 | -- Create and Populate "members" table 55 | CREATE TABLE IF NOT EXISTS dannys_diner.members ( 56 | customer_id VARCHAR(1), 57 | join_date DATE 58 | ); 59 | 60 | INSERT INTO dannys_diner.members 61 | (customer_id, join_date) 62 | VALUES 63 | ('A', '2021-01-07'), 64 | ('B', '2021-01-09'); -------------------------------------------------------------------------------- /CaseStudy#1 - Danny's Diner/README.md: -------------------------------------------------------------------------------- 1 | # Case Study #1: Danny's Diner 🍥 2 | [![View Main Folder](https://img.shields.io/badge/View-Main_Folder-F5788D.svg?logo=GitHub)](https://github.com/chanronnie/8WeekSQLChallenge) 3 | [![View Case Study 1](https://img.shields.io/badge/View-Case_Study_1-68032C)](https://8weeksqlchallenge.com/case-study-1/)
4 | ![1](https://github.com/chanronnie/8WeekSQLChallenge/assets/121308347/3ebb1080-b8d3-4381-850d-4a003cc9476d) 5 | 6 | The case study presented here is part of the **8 Week SQL Challenge**.\ 7 | It is kindly brought to us by [**Data With Danny**](https://8weeksqlchallenge.com).\ 8 | I use `MySQL queries` in `Jupyter Notebook` to quickly view results. 9 | 10 | 11 | 12 | ## Table of Contents 13 | * [Entity Relationship Diagram](#entity-relationship-diagram) 14 | * [Datasets](#datasets) 15 | * [Case Study Questions](#case-study-questions) 16 | * [Solutions](#solutions) 17 | 18 | 19 | ## Entity Relationship Diagram 20 | ![Danny's Diner](https://github.com/chanronnie/8WeekSQLChallenge/assets/121308347/d71bffd1-6513-456c-9686-d95dbf1eeaaf) 21 | 22 | ## Datasets 23 | The Case Study #1 contains 3 tables: 24 | - **sales**: This table captures all the order information (`order_date` and `product_id`) of each customer (`customer_id`). 25 | - **menu**: This table lists the IDs, names and prices of each menu item. 26 | - **members**: This table captures the dates (`join_date`) when each customer joined the member program. 27 | 28 | ## Case Study Questions 29 | 1. What is the total amount each customer spent at the restaurant? 30 | 2. How many days has each customer visited the restaurant? 31 | 3. What was the first item from the menu purchased by each customer? 32 | 4. What is the most purchased item on the menu and how many times was it purchased by all customers? 33 | 5. Which item was the most popular for each customer? 34 | 6. Which item was purchased first by the customer after they became a member? 35 | 7. Which item was purchased just before the customer became a member? 36 | 8. What is the total items and amount spent for each member before they became a member? 37 | 9. If each $1 spent equates to 10 points and sushi has a 2x points multiplier - how many points would each customer have? 38 | 10. In the first week after a customer joins the program (including their join date) they earn 2x points on all items, not just sushi - how many points do customer A and B have at the end of January? 39 | 40 | 41 | ## Solutions 42 | - View `dannys_diner` database: [**here**](CaseStudy1_schema.sql) 43 | - View Solution: [**here**](CaseStudy1_solution.ipynb) 44 | 45 | -------------------------------------------------------------------------------- /CaseStudy#2 - Pizza Runner/CaseStudy2_schema.sql: -------------------------------------------------------------------------------- 1 | ########################################################## 2 | # Case Study #2: Pizza Runner 3 | # This SQL script is a DDL file for creating tables. 4 | # Source: https://8weeksqlchallenge.com/case-study-2/ 5 | ########################################################## 6 | 7 | 8 | -- Create Schema 9 | CREATE SCHEMA IF NOT EXISTS pizza_runner; 10 | USE pizza_runner; 11 | 12 | 13 | -- Create and Populate "runners" table 14 | DROP TABLE IF EXISTS runners; 15 | CREATE TABLE runners ( 16 | runner_id INTEGER, 17 | registration_date DATE 18 | ); 19 | INSERT INTO runners 20 | (runner_id, registration_date) 21 | VALUES 22 | (1, '2021-01-01'), 23 | (2, '2021-01-03'), 24 | (3, '2021-01-08'), 25 | (4, '2021-01-15'); 26 | 27 | 28 | -- Create and Populate "customers_orders" table 29 | DROP TABLE IF EXISTS customer_orders; 30 | CREATE TABLE customer_orders ( 31 | order_id INTEGER, 32 | customer_id INTEGER, 33 | pizza_id INTEGER, 34 | exclusions VARCHAR(4), 35 | extras VARCHAR(4), 36 | order_time TIMESTAMP 37 | ); 38 | 39 | INSERT INTO customer_orders 40 | (order_id, customer_id, pizza_id, exclusions, extras, order_time) 41 | VALUES 42 | ('1', '101', '1', '', '', '2020-01-01 18:05:02'), 43 | ('2', '101', '1', '', '', '2020-01-01 19:00:52'), 44 | ('3', '102', '1', '', '', '2020-01-02 23:51:23'), 45 | ('3', '102', '2', '', NULL, '2020-01-02 23:51:23'), 46 | ('4', '103', '1', '4', '', '2020-01-04 13:23:46'), 47 | ('4', '103', '1', '4', '', '2020-01-04 13:23:46'), 48 | ('4', '103', '2', '4', '', '2020-01-04 13:23:46'), 49 | ('5', '104', '1', 'null', '1', '2020-01-08 21:00:29'), 50 | ('6', '101', '2', 'null', 'null', '2020-01-08 21:03:13'), 51 | ('7', '105', '2', 'null', '1', '2020-01-08 21:20:29'), 52 | ('8', '102', '1', 'null', 'null', '2020-01-09 23:54:33'), 53 | ('9', '103', '1', '4', '1, 5', '2020-01-10 11:22:59'), 54 | ('10', '104', '1', 'null', 'null', '2020-01-11 18:34:49'), 55 | ('10', '104', '1', '2, 6', '1, 4', '2020-01-11 18:34:49'); 56 | 57 | 58 | -- Create and Populate "runner_orders" table 59 | DROP TABLE IF EXISTS runner_orders; 60 | CREATE TABLE runner_orders ( 61 | order_id INTEGER, 62 | runner_id INTEGER, 63 | pickup_time VARCHAR(19), 64 | distance VARCHAR(7), 65 | duration VARCHAR(10), 66 | cancellation VARCHAR(23) 67 | ); 68 | 69 | INSERT INTO runner_orders 70 | (order_id, runner_id, pickup_time, distance, duration, cancellation) 71 | VALUES 72 | ('1', '1', '2020-01-01 18:15:34', '20km', '32 minutes', ''), 73 | ('2', '1', '2020-01-01 19:10:54', '20km', '27 minutes', ''), 74 | ('3', '1', '2020-01-03 00:12:37', '13.4km', '20 mins', NULL), 75 | ('4', '2', '2020-01-04 13:53:03', '23.4', '40', NULL), 76 | ('5', '3', '2020-01-08 21:10:57', '10', '15', NULL), 77 | ('6', '3', 'null', 'null', 'null', 'Restaurant Cancellation'), 78 | ('7', '2', '2020-01-08 21:30:45', '25km', '25mins', 'null'), 79 | ('8', '2', '2020-01-10 00:15:02', '23.4 km', '15 minute', 'null'), 80 | ('9', '2', 'null', 'null', 'null', 'Customer Cancellation'), 81 | ('10', '1', '2020-01-11 18:50:20', '10km', '10minutes', 'null'); 82 | 83 | 84 | -- Create and Populate "pizza_names" table 85 | DROP TABLE IF EXISTS pizza_names; 86 | CREATE TABLE pizza_names ( 87 | pizza_id INTEGER, 88 | pizza_name TEXT 89 | ); 90 | INSERT INTO pizza_names 91 | (pizza_id, pizza_name) 92 | VALUES 93 | (1, 'Meatlovers'), 94 | (2, 'Vegetarian'); 95 | 96 | 97 | -- Create and Populate "pizza_recipes" table 98 | DROP TABLE IF EXISTS pizza_recipes; 99 | CREATE TABLE pizza_recipes ( 100 | pizza_id INTEGER, 101 | toppings TEXT 102 | ); 103 | INSERT INTO pizza_recipes 104 | (pizza_id, toppings) 105 | VALUES 106 | (1, '1, 2, 3, 4, 5, 6, 8, 10'), 107 | (2, '4, 6, 7, 9, 11, 12'); 108 | 109 | 110 | -- Create and Populate "pizza_toppings" table 111 | DROP TABLE IF EXISTS pizza_toppings; 112 | CREATE TABLE pizza_toppings ( 113 | topping_id INTEGER, 114 | topping_name TEXT 115 | ); 116 | INSERT INTO pizza_toppings 117 | (topping_id, topping_name) 118 | VALUES 119 | (1, 'Bacon'), 120 | (2, 'BBQ Sauce'), 121 | (3, 'Beef'), 122 | (4, 'Cheese'), 123 | (5, 'Chicken'), 124 | (6, 'Mushrooms'), 125 | (7, 'Onions'), 126 | (8, 'Pepperoni'), 127 | (9, 'Peppers'), 128 | (10, 'Salami'), 129 | (11, 'Tomatoes'), 130 | (12, 'Tomato Sauce'); -------------------------------------------------------------------------------- /CaseStudy#2 - Pizza Runner/README.md: -------------------------------------------------------------------------------- 1 | # Case Study #2: Pizza Runner 🍕 2 | [![View Main Folder](https://img.shields.io/badge/View-Main_Folder-F5788D.svg?logo=GitHub)](https://github.com/chanronnie/8WeekSQLChallenge) 3 | [![View Case Study 2](https://img.shields.io/badge/View-Case_Study_2-9336E5)](https://8weeksqlchallenge.com/case-study-2/)
4 | ![2](https://github.com/chanronnie/8WeekSQLChallenge/assets/121308347/b7a82869-9156-4cf5-add0-67f701bf3e45) 5 | 6 | 7 | The case study presented here is part of the **8 Week SQL Challenge**.\ 8 | It is kindly brought to us by [**Data With Danny**](https://8weeksqlchallenge.com).\ 9 | I use `MySQL queries` in `Jupyter Notebook` to quickly view results. 10 | 11 | 12 | ## Table of Contents 13 | * [Entity Relationship Diagram](#entity-relationship-diagram) 14 | * [Datasets](#datasets) 15 | * [Case Study Questions](#case-study-questions) 16 | * [Solutions](#solutions) 17 | * [MySQL Topics Covered](#mysql-topics-covered) 18 | 19 | ## Entity Relationship Diagram 20 | ![Pizza Runner](https://github.com/chanronnie/8WeekSQLChallenge/assets/121308347/4222d127-ee94-49be-95ce-6e81d9b3774a) 21 | 22 | 23 | ## Datasets 24 | The Case Study #2 contains 6 tables: 25 | - **customer_orders**: This table captures all the pizza order and delivery information of each customer (`customer_id`). 26 | - **runner_orders**: This table lists the delivery and runner information for each order. 27 | - **runners**: This table lists the runner IDs and their registration dates. 28 | - **pizza_names**: This table maps each pizza_id to its corresponding pizza name 29 | - **pizza_recipes**: This table maps the toppings used for each pizza at Pizza Runner. 30 | - **pizza_toppings**: This table lists the toppings used at Pizza Runner. 31 | 32 | ## Case Study Questions 33 | Case Study #2 is categorized into five question groups\ 34 | To view the specific section, please open the link in a *`new tab`* or *`window`*.\ 35 | [A. Pizza Metrics](CaseStudy2_solutions.md#A)\ 36 | [B. Runner and Customer Experience](CaseStudy2_solutions.md#B)\ 37 | [C. Ingredient Optimisation](CaseStudy2_solutions.md#C)\ 38 | [D. Pricing and Ratings](CaseStudy2_solutions.md#D)\ 39 | [E. Bonus Questions](CaseStudy2_solutions.md#E) 40 | 41 | 42 | ## Solutions 43 | - View `pizza_runner` database: [here](https://github.com/chanronnie/8WeekSQLChallenge/blob/main/CaseStudy%232%20-%20Pizza%20Runner/CaseStudy2_schema.sql) 44 | - View Solution: 45 | - [Markdown File](CaseStudy2_solutions.md): offers a more fluid and responsive viewing experience 46 | - [Jupyter Notebook](CaseStudy2_solutions.ipynb): contains the original code 47 | 48 | ## MySQL Topics Covered 49 | - Data Cleaning 50 | - Common Table Expressions (CTE) 51 | - Temporary Tables 52 | - Window Functions 53 | - Subqueries 54 | - JOIN, UNION ALL 55 | - String and Time Data Manipulation 56 | -------------------------------------------------------------------------------- /CaseStudy#3 - Foodie-Fi/README.md: -------------------------------------------------------------------------------- 1 | # Case Study #3: Foodie-Fi 🥑 2 | [![View Main Folder](https://img.shields.io/badge/View-Main_Folder-F5788D.svg?logo=GitHub)](https://github.com/chanronnie/8WeekSQLChallenge) 3 | [![View Case Study 3](https://img.shields.io/badge/View-Case_Study_3-049C20)](https://8weeksqlchallenge.com/case-study-3/)
4 | ![3](https://github.com/chanronnie/8WeekSQLChallenge/assets/121308347/660c3a4a-8c2e-4f40-842b-464a3f4bf3ff) 5 | 6 | 7 | 8 | The case study presented here is part of the **8 Week SQL Challenge**.\ 9 | It is kindly brought to us by [**Data With Danny**](https://8weeksqlchallenge.com).\ 10 | I use `MySQL queries` in `Jupyter Notebook` to quickly view results. 11 | 12 | 13 | ## Table of Contents 14 | * [Entity Relationship Diagram](#entity-relationship-diagram) 15 | * [Datasets](#datasets) 16 | * [Case Study Questions](#case-study-questions) 17 | * [Solutions](#solutions) 18 | * [MySQL Topics Covered](#mysql-topics-covered) 19 | 20 | ## Entity Relationship Diagram 21 | ![FoodieFi](https://github.com/chanronnie/8WeekSQLChallenge/assets/121308347/7fb1d1c0-25c6-44b0-af05-b834f90e70ee) 22 | 23 | 24 | 25 | ## Datasets 26 | The Case Study #3 contains 2 tables: 27 | - **subscriptions**: This table captures all the plan subscription infoprmation of each customer at Foodie-Fi. 28 | - **plans**: This table lists the plans available and prices at Foodie-Fi. 29 | 30 | ## Case Study Questions 31 | Case Study #3 is categorized into 3 question groups\ 32 | To view the specific section, please open the link in a *`new tab`* or *`window`*.\ 33 | [A. Customer Journey](CaseStudy3_solutions.md#A)\ 34 | [B. Data Analysis Questions](CaseStudy3_solutions.md#B)\ 35 | [C. Challenge Payment Question](CaseStudy3_solutions.md#C) 36 | 37 | ## Solutions 38 | - View `foodie_fi` database: [**here**](CaseStudy3_schema.sql) 39 | - View Solution: 40 | - [**Markdown File**](CaseStudy3_solutions.md): offers a more fluid and responsive viewing experience 41 | - [**Jupyter Notebook**](CaseStudy3_solutions.ipynb): contains the original code 42 | 43 | ## MySQL Topics Covered 44 | - Common Table Expressions (CTE) 45 | - Temporary Tables 46 | - Window Functions 47 | - Subqueries 48 | - JOIN, UNION ALL 49 | -------------------------------------------------------------------------------- /CaseStudy#3 - Foodie-Fi/images/question10.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chanronnie/8WeekSQLChallenge/4b36f745d833a82a61a9ce70ab177462eb7dbfd0/CaseStudy#3 - Foodie-Fi/images/question10.png -------------------------------------------------------------------------------- /CaseStudy#3 - Foodie-Fi/images/question2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chanronnie/8WeekSQLChallenge/4b36f745d833a82a61a9ce70ab177462eb7dbfd0/CaseStudy#3 - Foodie-Fi/images/question2.png -------------------------------------------------------------------------------- /CaseStudy#3 - Foodie-Fi/images/question3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chanronnie/8WeekSQLChallenge/4b36f745d833a82a61a9ce70ab177462eb7dbfd0/CaseStudy#3 - Foodie-Fi/images/question3.png -------------------------------------------------------------------------------- /CaseStudy#3 - Foodie-Fi/images/question6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chanronnie/8WeekSQLChallenge/4b36f745d833a82a61a9ce70ab177462eb7dbfd0/CaseStudy#3 - Foodie-Fi/images/question6.png -------------------------------------------------------------------------------- /CaseStudy#3 - Foodie-Fi/images/question7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chanronnie/8WeekSQLChallenge/4b36f745d833a82a61a9ce70ab177462eb7dbfd0/CaseStudy#3 - Foodie-Fi/images/question7.png -------------------------------------------------------------------------------- /CaseStudy#4 - Data Bank/CaseStudy4_solutions.md: -------------------------------------------------------------------------------- 1 | # Case Study #4: Data Bank 2 | The case study questions presented here are created by [**Data With Danny**](https://linktr.ee/datawithdanny). They are part of the [**8 Week SQL Challenge**](https://8weeksqlchallenge.com/). 3 | 4 | My SQL queries are written in the `PostgreSQL 15` dialect, integrated into `Jupyter Notebook`, which allows us to instantly view the query results and document the queries. 5 | 6 | For more details about the **Case Study #4**, click [**here**](https://8weeksqlchallenge.com/case-study-4/). 7 | 8 | ## Table of Contents 9 | 10 | ### [1. Importing Libraries](#Import) 11 | 12 | ### [2. Tables of the Database](#Tables) 13 | 14 | ### [3. Case Study Questions](#CaseStudyQuestions) 15 | 16 | - [A. Customer Nodes Exploration](#A) 17 | - [B. Customer Transactions](#B) 18 | - [C. Data Allocation Challenge](#C) 19 | 20 | 21 | 22 | ## 1. Importing required Libraries 23 | 24 | 25 | ```python 26 | import psycopg2 as pg2 27 | import pandas as pd 28 | import os 29 | import warnings 30 | 31 | warnings.filterwarnings('ignore') 32 | ``` 33 | 34 | 35 | 36 | ## 2. Tables of the Database 37 | 38 | ### Connecting PostgreSQL database through Jupyter Notebook 39 | 40 | 41 | ```python 42 | # Get the PostgreSQL password 43 | mypassword = os.getenv("POSTGRESQL_PASSWORD") 44 | 45 | # Connect SQL database 46 | conn = pg2.connect(user = 'postgres', password = mypassword, database = 'data_bank') 47 | cursor = conn.cursor() 48 | ``` 49 | 50 | Now, let's list the table names of the `data_bank` database. 51 | 52 | 53 | ```python 54 | query_ShowTables = """ 55 | SELECT 56 | table_schema, 57 | table_name 58 | FROM information_schema.tables 59 | WHERE table_schema = 'data_bank'; 60 | """ 61 | cursor.execute(query_ShowTables) 62 | 63 | print('--- Tables within "data_bank" database --- ') 64 | for table in cursor: 65 | print(table[1]) 66 | ``` 67 | 68 | --- Tables within "data_bank" database --- 69 | regions 70 | customer_nodes 71 | customer_transactions 72 | 73 | 74 | The followings are the 3 tables within the `data_bank` database. Please click [**here**](https://8weeksqlchallenge.com/case-study-4/) to get more insights about the tables. 75 | 76 | 77 | ```python 78 | cursor.execute(query_ShowTables) 79 | for table in cursor: 80 | print("Table: ", table[1]) 81 | query = "SELECT * FROM " + table[0] + '.' + table[1] 82 | df = pd.read_sql(query, conn) 83 | display(df) 84 | ``` 85 | 86 | Table: regions 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 |
region_idregion_name
01Australia
12America
23Africa
34Asia
45Europe
124 | 125 | 126 | 127 | Table: customer_nodes 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 |
customer_idregion_idnode_idstart_dateend_date
01342020-01-022020-01-03
12352020-01-032020-01-17
23542020-01-272020-02-18
34542020-01-072020-01-19
45332020-01-152020-01-23
..................
3495496342020-02-259999-12-31
3496497542020-05-279999-12-31
3497498122020-04-059999-12-31
3498499512020-02-039999-12-31
3499500222020-04-159999-12-31
231 |

3500 rows × 5 columns

232 | 233 | 234 | 235 | Table: customer_transactions 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | 250 | 251 | 252 | 253 | 254 | 255 | 256 | 257 | 258 | 259 | 260 | 261 | 262 | 263 | 264 | 265 | 266 | 267 | 268 | 269 | 270 | 271 | 272 | 273 | 274 | 275 | 276 | 277 | 278 | 279 | 280 | 281 | 282 | 283 | 284 | 285 | 286 | 287 | 288 | 289 | 290 | 291 | 292 | 293 | 294 | 295 | 296 | 297 | 298 | 299 | 300 | 301 | 302 | 303 | 304 | 305 | 306 | 307 | 308 | 309 | 310 | 311 | 312 | 313 | 314 | 315 | 316 | 317 | 318 | 319 | 320 | 321 | 322 | 323 | 324 | 325 | 326 |
customer_idtxn_datetxn_typetxn_amount
04292020-01-21deposit82
11552020-01-10deposit712
23982020-01-01deposit196
32552020-01-14deposit563
41852020-01-29deposit626
...............
58631892020-02-03withdrawal870
58641892020-03-22purchase718
58651892020-02-06purchase393
58661892020-01-22deposit302
58671892020-01-27withdrawal861
327 |

5868 rows × 4 columns

328 | 329 | 330 | 331 | 332 | 333 | ## Case Study Questions 334 | 335 | 336 | ## A. Customer Nodes Exploration 337 | 338 | #### 1. How many unique nodes are there on the Data Bank system? 339 | 340 | 341 | ```python 342 | pd.read_sql(""" 343 | SELECT COUNT(DISTINCT node_id) AS nodes_count 344 | FROM data_bank.customer_nodes 345 | """, conn) 346 | ``` 347 | 348 | 349 | 350 | 351 | 352 | 353 | 354 | 355 | 356 | 357 | 358 | 359 | 360 | 361 |
nodes_count
05
362 | 363 | 364 | 365 | 366 | **Result (Attempt #1)**\ 367 | In the Data Bank system, there are 5 unique nodes to which customers will be randomly reallocated. 368 | 369 | However, if we interpret the question differently, we may understand that nodes are unique for each region, and customers are randomly distributed across the nodes *according to their region*. The following query shows the possible uniques nodes in each region. 370 | 371 | **Here is the attempt #2** 372 | 373 | 374 | ```python 375 | pd.read_sql(""" 376 | SELECT 377 | cn.region_id, 378 | r.region_name, 379 | STRING_AGG(DISTINCT cn.node_id::VARCHAR(1), ', ') AS nodes 380 | FROM data_bank.customer_nodes cn 381 | JOIN data_bank.regions r ON cn.region_id = r.region_id 382 | GROUP BY cn.region_id, r.region_name 383 | """, conn) 384 | ``` 385 | 386 | 387 | 388 | 389 | 390 | 391 | 392 | 393 | 394 | 395 | 396 | 397 | 398 | 399 | 400 | 401 | 402 | 403 | 404 | 405 | 406 | 407 | 408 | 409 | 410 | 411 | 412 | 413 | 414 | 415 | 416 | 417 | 418 | 419 | 420 | 421 | 422 | 423 | 424 | 425 | 426 | 427 |
region_idregion_namenodes
01Australia1, 2, 3, 4, 5
12America1, 2, 3, 4, 5
23Africa1, 2, 3, 4, 5
34Asia1, 2, 3, 4, 5
45Europe1, 2, 3, 4, 5
428 | 429 | 430 | 431 | 432 | 433 | ```python 434 | pd.read_sql(""" 435 | SELECT SUM(nb_nodes)::INTEGER AS total_nodes 436 | FROM 437 | ( 438 | -- Find the number of unique nodes per region 439 | 440 | SELECT region_id, COUNT(DISTINCT node_id) AS nb_nodes 441 | FROM data_bank.customer_nodes 442 | GROUP BY region_id 443 | ) n 444 | """, conn) 445 | ``` 446 | 447 | 448 | 449 | 450 | 451 | 452 | 453 | 454 | 455 | 456 | 457 | 458 | 459 | 460 |
total_nodes
025
461 | 462 | 463 | 464 | 465 | **Result (Attempt #2)**\ 466 | Hence, there are 25 unique nodes in the Data Bank system accross the world. 467 | 468 | ___ 469 | #### 2. What is the number of nodes per region? 470 | 471 | 472 | ```python 473 | pd.read_sql(""" 474 | SELECT 475 | r.region_name, 476 | COUNT(DISTINCT node_id) AS nb_nodes 477 | FROM data_bank.customer_nodes cn 478 | JOIN data_bank.regions r ON cn.region_id = r.region_id 479 | GROUP BY r.region_name; 480 | """, conn) 481 | ``` 482 | 483 | 484 | 485 | 486 | 487 | 488 | 489 | 490 | 491 | 492 | 493 | 494 | 495 | 496 | 497 | 498 | 499 | 500 | 501 | 502 | 503 | 504 | 505 | 506 | 507 | 508 | 509 | 510 | 511 | 512 | 513 | 514 | 515 | 516 | 517 | 518 |
region_namenb_nodes
0Africa5
1America5
2Asia5
3Australia5
4Europe5
519 | 520 | 521 | 522 | 523 | **Result**\ 524 | There are 5 nodes per region. 525 | 526 | ___ 527 | #### 3. How many customers are allocated to each region? 528 | 529 | 530 | ```python 531 | pd.read_sql(""" 532 | SELECT 533 | r.region_name AS region, 534 | COUNT(DISTINCT customer_id) AS nb_customers 535 | FROM data_bank.customer_nodes cn 536 | JOIN data_bank.regions r ON cn.region_id = r.region_id 537 | GROUP BY r.region_name 538 | ORDER BY nb_customers DESC; 539 | """, conn) 540 | ``` 541 | 542 | 543 | 544 | 545 | 546 | 547 | 548 | 549 | 550 | 551 | 552 | 553 | 554 | 555 | 556 | 557 | 558 | 559 | 560 | 561 | 562 | 563 | 564 | 565 | 566 | 567 | 568 | 569 | 570 | 571 | 572 | 573 | 574 | 575 | 576 | 577 |
regionnb_customers
0Australia110
1America105
2Africa102
3Asia95
4Europe88
578 | 579 | 580 | 581 | 582 | **Result** 583 | - The majority of Data Bank customers are from Australia, totaling 110 customers. 584 | - Following Australia, America and Africa are the second and third regions with the highest number of Data Bank customers. 585 | - Asia has a total of 95 Data Bank customers. 586 | - Data Bank does not seem to attract many customers from Europe, with only 88 clients. 587 | 588 | ___ 589 | #### 4. How many days on average are customers reallocated to a different node? 590 | 591 | 592 | ```python 593 | pd.read_sql(""" 594 | SELECT ROUND(AVG(end_date::date - start_date::date), 1) AS avg_reallocation_days 595 | FROM data_bank.customer_nodes 596 | WHERE end_date != '9999-12-31' 597 | """, conn) 598 | ``` 599 | 600 | 601 | 602 | 603 | 604 | 605 | 606 | 607 | 608 | 609 | 610 | 611 | 612 | 613 |
avg_reallocation_days
014.6
614 | 615 | 616 | 617 | 618 | **Result (Attempt #1)**\ 619 | Customers are reallocated to a different node on average after 14.6 days . 620 | 621 | However, if we examine the `customer_id 7`, we can observe two instances where the customer has been reallocated to the same node (nodes 2 and 4). For customer ID 7, it took him/her 622 | - 21 days to be reallocated from node 4 to node 2 623 | - 34 days to be reallocated from node 2 to node 4 624 | 625 | 626 | ```python 627 | pd.read_sql(""" 628 | SELECT * 629 | FROM data_bank.customer_nodes 630 | WHERE customer_id = 7 631 | """, conn) 632 | ``` 633 | 634 | 635 | 636 | 637 | 638 | 639 | 640 | 641 | 642 | 643 | 644 | 645 | 646 | 647 | 648 | 649 | 650 | 651 | 652 | 653 | 654 | 655 | 656 | 657 | 658 | 659 | 660 | 661 | 662 | 663 | 664 | 665 | 666 | 667 | 668 | 669 | 670 | 671 | 672 | 673 | 674 | 675 | 676 | 677 | 678 | 679 | 680 | 681 | 682 | 683 | 684 | 685 | 686 | 687 | 688 | 689 | 690 | 691 | 692 | 693 | 694 | 695 | 696 | 697 | 698 | 699 | 700 | 701 | 702 | 703 |
customer_idregion_idnode_idstart_dateend_date
07252020-01-202020-02-04
17242020-02-052020-02-20
27242020-02-212020-02-26
37222020-02-272020-03-05
47222020-03-062020-04-01
57242020-04-022020-04-07
67252020-04-089999-12-31
704 | 705 | 706 | 707 | 708 | **Table 1**: Customer 7 reallocated to node no 4 twice 709 | 710 | node_id | start_date | end_date | nb_reallocation_days 711 | --- | --- | --- | --- 712 | 4 | 2020-02-05 | 2020-02-20 | -- 713 | 4 | 2020-02-21 | 2020-02-26 | 21 714 | 715 | **Table 2**: Customer 7 reallocated to node no 2 twice 716 | 717 | node_id | start_date | end_date | nb_reallocation_days 718 | --- | --- | --- | --- 719 | 2 | 2020-02-27 | 2020-03-05 | -- 720 | 2 | 2020-03-06 | 2020-04-01 | 34 721 | 722 | Hence, here is a query for the `reallocation_cte` that will be used later to answer the question. The query takes into consideration counting the total number of days for which customers will be reallocated to a different node, and includes all instances of both same and different nodes. 723 | 724 | 725 | ```python 726 | # Let's see for customer ID 7 727 | 728 | pd.read_sql(""" 729 | SELECT *, 730 | CASE 731 | WHEN LEAD(node_id) OVER w = node_id THEN NULL 732 | WHEN LAG(node_id) OVER w = node_id THEN end_date::date - LAG(start_date) OVER w::date 733 | ELSE end_date::date - start_date::date 734 | END AS nb_days 735 | FROM data_bank.customer_nodes 736 | WHERE end_date != '9999-12-31' AND customer_id = 7 737 | WINDOW w AS (PARTITION BY customer_id ORDER BY start_date) 738 | """, conn) 739 | ``` 740 | 741 | 742 | 743 | 744 | 745 | 746 | 747 | 748 | 749 | 750 | 751 | 752 | 753 | 754 | 755 | 756 | 757 | 758 | 759 | 760 | 761 | 762 | 763 | 764 | 765 | 766 | 767 | 768 | 769 | 770 | 771 | 772 | 773 | 774 | 775 | 776 | 777 | 778 | 779 | 780 | 781 | 782 | 783 | 784 | 785 | 786 | 787 | 788 | 789 | 790 | 791 | 792 | 793 | 794 | 795 | 796 | 797 | 798 | 799 | 800 | 801 | 802 | 803 | 804 | 805 | 806 | 807 | 808 | 809 |
customer_idregion_idnode_idstart_dateend_datenb_days
07252020-01-202020-02-0415.0
17242020-02-052020-02-20NaN
27242020-02-212020-02-2621.0
37222020-02-272020-03-05NaN
47222020-03-062020-04-0134.0
57242020-04-022020-04-075.0
810 | 811 | 812 | 813 | 814 | Now, let's utilize the aforementioned query as the `reallocation_cte`, to provide a comprehensive answer. 815 | 816 | 817 | ```python 818 | pd.read_sql(""" 819 | WITH reallocation_cte AS 820 | ( 821 | -- Find the number of reallocation days 822 | SELECT *, 823 | CASE 824 | WHEN LEAD(node_id) OVER w = node_id THEN NULL 825 | WHEN LAG(node_id) OVER w = node_id THEN end_date::date - LAG(start_date) OVER w::date 826 | ELSE end_date::date - start_date::date 827 | END AS nb_reallocation_days 828 | FROM data_bank.customer_nodes 829 | WHERE end_date != '9999-12-31' 830 | WINDOW w AS (PARTITION BY customer_id ORDER BY start_date) 831 | ) 832 | SELECT ROUND(AVG(nb_reallocation_days),1) AS avg_reallocation_days 833 | FROM reallocation_cte; 834 | """, conn) 835 | ``` 836 | 837 | 838 | 839 | 840 | 841 | 842 | 843 | 844 | 845 | 846 | 847 | 848 | 849 | 850 |
avg_reallocation_days
017.3
851 | 852 | 853 | 854 | 855 | **Result (Attempt #2)**\ 856 | Customers are reallocated to a different node on average after 17.3 days . 857 | 858 | ___ 859 | #### 5. What is the median, 80th and 95th percentile for this same reallocation days metric for each region? 860 | 861 | Since I have previously addressed the reallocation days problem (see question #4 above) using two different approaches, I will also provide my answer to this question in two attempts. Please note that the questions 4 and 5 are rather vague. 862 | 863 | **Result (Attempt #1)** 864 | 865 | 866 | ```python 867 | pd.read_sql(""" 868 | SELECT 869 | region_name, 870 | PERCENTILE_DISC(0.5) WITHIN GROUP(ORDER BY reallocation_days) as median, 871 | PERCENTILE_DISC(0.8) WITHIN GROUP(ORDER BY reallocation_days) as percentile_80th, 872 | PERCENTILE_DISC(0.95) WITHIN GROUP(ORDER BY reallocation_days) as percentile_95th 873 | FROM 874 | ( 875 | SELECT 876 | n.node_id, 877 | n.start_date, 878 | n.end_date, 879 | r.region_name, 880 | n.end_date::date - n.start_date::date AS reallocation_days 881 | FROM data_bank.customer_nodes n 882 | JOIN data_bank.regions r ON n.region_id = r.region_id 883 | WHERE n.end_date != '9999-12-31' 884 | ) re 885 | GROUP BY region_name; 886 | """, conn) 887 | ``` 888 | 889 | 890 | 891 | 892 | 893 | 894 | 895 | 896 | 897 | 898 | 899 | 900 | 901 | 902 | 903 | 904 | 905 | 906 | 907 | 908 | 909 | 910 | 911 | 912 | 913 | 914 | 915 | 916 | 917 | 918 | 919 | 920 | 921 | 922 | 923 | 924 | 925 | 926 | 927 | 928 | 929 | 930 | 931 | 932 | 933 | 934 | 935 | 936 |
region_namemedianpercentile_80thpercentile_95th
0Africa152428
1America152328
2Asia152328
3Australia152328
4Europe152428
937 | 938 | 939 | 940 | 941 | **Result (Attempt #2)** 942 | 943 | 944 | ```python 945 | pd.read_sql(""" 946 | WITH reallocation_cte AS 947 | ( 948 | SELECT *, 949 | CASE 950 | WHEN LEAD(node_id) OVER w = node_id THEN NULL 951 | WHEN LAG(node_id) OVER w = node_id THEN end_date::date - LAG(start_date) OVER w::date 952 | ELSE end_date::date - start_date::date 953 | END AS nb_reallocation_days 954 | FROM data_bank.customer_nodes 955 | WHERE end_date != '9999-12-31' 956 | WINDOW w AS (PARTITION BY customer_id ORDER BY start_date) 957 | ) 958 | SELECT 959 | r.region_name, 960 | PERCENTILE_DISC(0.5) WITHIN GROUP(ORDER BY cte.nb_reallocation_days) as median, 961 | PERCENTILE_DISC(0.8) WITHIN GROUP(ORDER BY cte.nb_reallocation_days) as percentile_80th, 962 | PERCENTILE_DISC(0.95) WITHIN GROUP(ORDER BY cte.nb_reallocation_days) as percentile_95th 963 | FROM reallocation_cte cte 964 | JOIN data_bank.regions r ON cte.region_id = r.region_id 965 | GROUP BY r.region_name; 966 | """, conn) 967 | ``` 968 | 969 | 970 | 971 | 972 | 973 | 974 | 975 | 976 | 977 | 978 | 979 | 980 | 981 | 982 | 983 | 984 | 985 | 986 | 987 | 988 | 989 | 990 | 991 | 992 | 993 | 994 | 995 | 996 | 997 | 998 | 999 | 1000 | 1001 | 1002 | 1003 | 1004 | 1005 | 1006 | 1007 | 1008 | 1009 | 1010 | 1011 | 1012 | 1013 | 1014 | 1015 | 1016 |
region_namemedianpercentile_80thpercentile_95th
0Africa172736
1America172636
2Asia172534
3Australia172636
4Europe182737
1017 | 1018 | 1019 | 1020 | 1021 | ___ 1022 | 1023 | ## B. Customer Transactions 1024 | #### 1. What is the unique count and total amount for each transaction type? 1025 | 1026 | 1027 | ```python 1028 | pd.read_sql(""" 1029 | SELECT 1030 | txn_type AS transaction_type, 1031 | to_char(COUNT(txn_type), 'FM 999,999') AS count, 1032 | to_char(SUM(txn_amount), 'FM$ 999,999,999.99') AS total_amount 1033 | FROM data_bank.customer_transactions 1034 | GROUP BY txn_type 1035 | """, conn) 1036 | ``` 1037 | 1038 | 1039 | 1040 | 1041 | 1042 | 1043 | 1044 | 1045 | 1046 | 1047 | 1048 | 1049 | 1050 | 1051 | 1052 | 1053 | 1054 | 1055 | 1056 | 1057 | 1058 | 1059 | 1060 | 1061 | 1062 | 1063 | 1064 | 1065 | 1066 | 1067 |
transaction_typecounttotal_amount
0purchase1,617$ 806,537.
1withdrawal1,580$ 793,003.
2deposit2,671$ 1,359,168.
1068 | 1069 | 1070 | 1071 | 1072 | **Result** 1073 | - The most fequently used transaction type at Data Bank is **deposit**, with a total numbers of 2,671 deposits, amounting to $1,359,168. 1074 | - There is a total of 1,671 **purchases** made at Data Bank, totalling $ 806,537. 1075 | - There is a total of 1,580 **withdrawals** made at Data Bank, totalling $ 793,003 1076 | 1077 | ___ 1078 | #### 2. What is the average total historical deposit counts and amounts for all customers? 1079 | 1080 | 1081 | ```python 1082 | pd.read_sql(""" 1083 | SELECT 1084 | AVG(deposit_count)::INTEGER As nb_deposit, 1085 | to_char(AVG(avg_deposit_amount), 'FM$ 999,999.99') AS avg_deposit_amount 1086 | FROM 1087 | ( 1088 | -- Find the avegrage count and amount of deposits made by each customer 1089 | SELECT 1090 | customer_id, 1091 | COUNT(txn_type) as deposit_count, 1092 | AVG(txn_amount) AS avg_deposit_amount 1093 | FROM data_bank.customer_transactions 1094 | WHERE txn_type = 'deposit' 1095 | GROUP BY customer_id 1096 | ORDER BY customer_id 1097 | ) d 1098 | """, conn) 1099 | ``` 1100 | 1101 | 1102 | 1103 | 1104 | 1105 | 1106 | 1107 | 1108 | 1109 | 1110 | 1111 | 1112 | 1113 | 1114 | 1115 | 1116 |
nb_depositavg_deposit_amount
05$ 508.61
1117 | 1118 | 1119 | 1120 | 1121 | **Result**\ 1122 | The Data Bank customers made an average of 5 deposits, with an average amount of $ 508.61. 1123 | 1124 | ___ 1125 | #### 3. For each month - how many Data Bank customers make more than 1 deposit and either 1 purchase or 1 withdrawal in a single month? 1126 | 1127 | 1128 | ```python 1129 | pd.read_sql(""" 1130 | WITH counting_transactions_cte AS 1131 | ( 1132 | SELECT 1133 | customer_id, 1134 | EXTRACT(MONTH FROM txn_date)::INTEGER AS month, 1135 | TO_CHAR(txn_date, 'MONTH') AS month_name, 1136 | SUM(CASE WHEN txn_type = 'deposit' THEN 1 ELSE 0 END) AS deposit, 1137 | SUM(CASE WHEN txn_type = 'purchase' THEN 1 ELSE 0 END) AS purchase, 1138 | SUM(CASE WHEN txn_type = 'withdrawal' THEN 1 ELSE 0 END) AS withdrawal 1139 | FROM data_bank.customer_transactions 1140 | GROUP BY customer_id, month, month_name 1141 | ORDER BY customer_id 1142 | ) 1143 | SELECT 1144 | month, 1145 | month_name, 1146 | COUNT(DISTINCT customer_id) AS nb_customers 1147 | FROM counting_transactions_cte 1148 | WHERE deposit > 1 AND (purchase > 0 OR withdrawal > 0) 1149 | GROUP BY month, month_name 1150 | ORDER BY month; 1151 | """, conn) 1152 | ``` 1153 | 1154 | 1155 | 1156 | 1157 | 1158 | 1159 | 1160 | 1161 | 1162 | 1163 | 1164 | 1165 | 1166 | 1167 | 1168 | 1169 | 1170 | 1171 | 1172 | 1173 | 1174 | 1175 | 1176 | 1177 | 1178 | 1179 | 1180 | 1181 | 1182 | 1183 | 1184 | 1185 | 1186 | 1187 | 1188 | 1189 |
monthmonth_namenb_customers
01JANUARY168
12FEBRUARY181
23MARCH192
34APRIL70
1190 | 1191 | 1192 | 1193 | 1194 | **Result** 1195 | - **March** is the month when Data Bank customers make the most transactions (more than 1 deposit and either 1 purchase or 1 withdrawal in a single month), with a total of 192 customers. 1196 | - Following March, **February** is the second month with the highest number of customers engaging in the given transactions. 1197 | - In **January**, a total of 168 customers made more than 1 deposit and either 1 purchase or 1 withdrawal. 1198 | - The month of **April** has the fewest number of customers conducting these types of transactions, with only 70 clients. 1199 | 1200 | ___ 1201 | #### 4. What is the closing balance for each customer at the end of the month? 1202 | 1203 | Here are the results showing the first 5 customers. 1204 | 1205 | 1206 | ```python 1207 | pd.read_sql(""" 1208 | SELECT 1209 | customer_id, 1210 | EXTRACT(MONTH FROM txn_date)::INTEGER AS month, 1211 | TO_CHAR(txn_date, 'MONTH') AS month_name, 1212 | SUM(CASE WHEN txn_type = 'deposit' THEN txn_amount ELSE - txn_amount END) AS closing_balance 1213 | FROM data_bank.customer_transactions 1214 | WHERE customer_id <= 5 1215 | GROUP BY customer_id, month, month_name 1216 | ORDER BY customer_id, month; 1217 | """, conn) 1218 | ``` 1219 | 1220 | 1221 | 1222 | 1223 | 1224 | 1225 | 1226 | 1227 | 1228 | 1229 | 1230 | 1231 | 1232 | 1233 | 1234 | 1235 | 1236 | 1237 | 1238 | 1239 | 1240 | 1241 | 1242 | 1243 | 1244 | 1245 | 1246 | 1247 | 1248 | 1249 | 1250 | 1251 | 1252 | 1253 | 1254 | 1255 | 1256 | 1257 | 1258 | 1259 | 1260 | 1261 | 1262 | 1263 | 1264 | 1265 | 1266 | 1267 | 1268 | 1269 | 1270 | 1271 | 1272 | 1273 | 1274 | 1275 | 1276 | 1277 | 1278 | 1279 | 1280 | 1281 | 1282 | 1283 | 1284 | 1285 | 1286 | 1287 | 1288 | 1289 | 1290 | 1291 | 1292 | 1293 | 1294 | 1295 | 1296 | 1297 | 1298 | 1299 | 1300 | 1301 | 1302 | 1303 | 1304 | 1305 | 1306 | 1307 | 1308 | 1309 | 1310 | 1311 | 1312 | 1313 | 1314 | 1315 | 1316 | 1317 | 1318 | 1319 | 1320 | 1321 | 1322 | 1323 |
customer_idmonthmonth_nameclosing_balance
011JANUARY312
113MARCH-952
221JANUARY549
323MARCH61
431JANUARY144
532FEBRUARY-965
633MARCH-401
734APRIL493
841JANUARY848
943MARCH-193
1051JANUARY954
1153MARCH-2877
1254APRIL-490
1324 | 1325 | 1326 | 1327 | 1328 | ___ 1329 | #### 5. What is the percentage of customers who increase their closing balance by more than 5%? 1330 | 1331 | 1332 | ```python 1333 | pd.read_sql(""" 1334 | WITH monthly_balance_cte AS 1335 | ( 1336 | SELECT 1337 | customer_id, 1338 | EXTRACT(MONTH FROM txn_date)::INTEGER AS month, 1339 | SUM(CASE WHEN txn_type = 'deposit' THEN txn_amount ELSE - txn_amount END) AS closing_balance 1340 | FROM data_bank.customer_transactions 1341 | GROUP BY customer_id, month 1342 | ORDER BY customer_id, month 1343 | ), 1344 | balance_greaterthan5_cte AS 1345 | ( 1346 | SELECT COUNT(DISTINCT customer_id) AS nb_customers 1347 | FROM 1348 | ( 1349 | SELECT 1350 | customer_id, 1351 | (LEAD(closing_balance) OVER (PARTITION BY customer_id ORDER BY month) - closing_balance)/ closing_balance::numeric*100 AS percent_change 1352 | FROM monthly_balance_cte 1353 | ) pc 1354 | WHERE percent_change > 5 1355 | ) 1356 | SELECT 1357 | MAX(nb_customers) AS nb_customers, 1358 | COUNT(DISTINCT ct.customer_id) AS total_customers, 1359 | CONCAT(ROUND(MAX(nb_customers)/COUNT(DISTINCT ct.customer_id)::numeric * 100,1), ' %') AS percentage_customers 1360 | FROM balance_greaterthan5_cte b, data_bank.customer_transactions ct 1361 | """, conn) 1362 | ``` 1363 | 1364 | 1365 | 1366 | 1367 | 1368 | 1369 | 1370 | 1371 | 1372 | 1373 | 1374 | 1375 | 1376 | 1377 | 1378 | 1379 | 1380 | 1381 |
nb_customerstotal_customerspercentage_customers
026950053.8 %
1382 | 1383 | 1384 | 1385 | 1386 | **Result**\ 1387 | There are 53.8 % of the Data Bank customers who increase their closing balance by more than 5%. 1388 | 1389 | ___ 1390 | 1391 | ## C. Data Allocation Challenge 1392 | 1393 | To test out a few different hypotheses - the Data Bank team wants to run an experiment where different groups of customers would be allocated data using 3 different options: 1394 | 1395 | - `Option 1`: data is allocated based off the amount of money at the end of the previous month 1396 | - `Option 2`: data is allocated on the average amount of money kept in the account in the previous 30 days 1397 | - `Option 3`: data is updated real-time 1398 | 1399 | For this multi-part challenge question - you have been requested to generate the following data elements to help the Data Bank team estimate how much data will need to be provisioned for each option: 1400 | - running customer balance column that includes the impact each transaction 1401 | - customer balance at the end of each month 1402 | - minimum, average and maximum values of the running balance for each customer 1403 | 1404 | Using all of the data available - how much data would have been required for each option on a monthly basis? 1405 | 1406 | ### Running Balance 1407 | 1408 | 1409 | ```python 1410 | pd.read_sql(""" 1411 | SELECT *, 1412 | SUM( 1413 | CASE WHEN txn_type = 'deposit' THEN txn_amount ELSE -txn_amount END 1414 | ) OVER (PARTITION BY customer_id ORDER BY txn_date) AS running_balance 1415 | FROM data_bank.customer_transactions 1416 | WHERE customer_id <= 5; 1417 | """, conn) 1418 | ``` 1419 | 1420 | 1421 | 1422 | 1423 | 1424 | 1425 | 1426 | 1427 | 1428 | 1429 | 1430 | 1431 | 1432 | 1433 | 1434 | 1435 | 1436 | 1437 | 1438 | 1439 | 1440 | 1441 | 1442 | 1443 | 1444 | 1445 | 1446 | 1447 | 1448 | 1449 | 1450 | 1451 | 1452 | 1453 | 1454 | 1455 | 1456 | 1457 | 1458 | 1459 | 1460 | 1461 | 1462 | 1463 | 1464 | 1465 | 1466 | 1467 | 1468 | 1469 | 1470 | 1471 | 1472 | 1473 | 1474 | 1475 | 1476 | 1477 | 1478 | 1479 | 1480 | 1481 | 1482 | 1483 | 1484 | 1485 | 1486 | 1487 | 1488 | 1489 | 1490 | 1491 | 1492 | 1493 | 1494 | 1495 | 1496 | 1497 | 1498 | 1499 | 1500 | 1501 | 1502 | 1503 | 1504 | 1505 | 1506 | 1507 | 1508 | 1509 | 1510 | 1511 | 1512 | 1513 | 1514 | 1515 | 1516 | 1517 | 1518 | 1519 | 1520 | 1521 | 1522 | 1523 | 1524 | 1525 | 1526 | 1527 | 1528 | 1529 | 1530 | 1531 | 1532 | 1533 | 1534 | 1535 | 1536 | 1537 | 1538 | 1539 | 1540 | 1541 | 1542 | 1543 | 1544 | 1545 | 1546 | 1547 | 1548 | 1549 | 1550 | 1551 | 1552 | 1553 | 1554 | 1555 | 1556 | 1557 | 1558 | 1559 | 1560 | 1561 | 1562 | 1563 | 1564 | 1565 | 1566 | 1567 | 1568 | 1569 | 1570 | 1571 | 1572 | 1573 | 1574 | 1575 | 1576 | 1577 | 1578 | 1579 | 1580 | 1581 | 1582 | 1583 | 1584 | 1585 | 1586 | 1587 | 1588 | 1589 | 1590 | 1591 | 1592 | 1593 | 1594 | 1595 | 1596 | 1597 | 1598 | 1599 | 1600 | 1601 | 1602 | 1603 | 1604 | 1605 | 1606 | 1607 | 1608 | 1609 | 1610 | 1611 | 1612 | 1613 | 1614 | 1615 | 1616 | 1617 | 1618 | 1619 | 1620 | 1621 | 1622 | 1623 | 1624 | 1625 | 1626 | 1627 | 1628 | 1629 | 1630 | 1631 | 1632 | 1633 |
customer_idtxn_datetxn_typetxn_amountrunning_balance
012020-01-02deposit312312
112020-03-05purchase612-300
212020-03-17deposit32424
312020-03-19purchase664-640
422020-01-03deposit549549
522020-03-24deposit61610
632020-01-27deposit144144
732020-02-22purchase965-821
832020-03-05withdrawal213-1034
932020-03-19withdrawal188-1222
1032020-04-12deposit493-729
1142020-01-07deposit458458
1242020-01-21deposit390848
1342020-03-25purchase193655
1452020-01-15deposit974974
1552020-01-25deposit8061780
1652020-01-31withdrawal826954
1752020-03-02purchase88668
1852020-03-19deposit718786
1952020-03-26withdrawal7860
2052020-03-27deposit412-288
2152020-03-27withdrawal700-288
2252020-03-29purchase852-1140
2352020-03-31purchase783-1923
2452020-04-02withdrawal490-2413
1634 | 1635 | 1636 | 1637 | 1638 | ### Monthly Balance 1639 | 1640 | 1641 | ```python 1642 | pd.read_sql(""" 1643 | SELECT 1644 | customer_id, 1645 | EXTRACT(MONTH FROM txn_date)::INTEGER AS month, 1646 | TO_CHAR(txn_date, 'Month') AS month_name, 1647 | SUM(CASE WHEN txn_type = 'deposit' THEN txn_amount ELSE -txn_amount END) AS closing_balance 1648 | FROM data_bank.customer_transactions 1649 | WHERE customer_id <= 5 1650 | GROUP BY customer_id, month, month_name 1651 | ORDER BY customer_id, month; 1652 | """, conn) 1653 | ``` 1654 | 1655 | 1656 | 1657 | 1658 | 1659 | 1660 | 1661 | 1662 | 1663 | 1664 | 1665 | 1666 | 1667 | 1668 | 1669 | 1670 | 1671 | 1672 | 1673 | 1674 | 1675 | 1676 | 1677 | 1678 | 1679 | 1680 | 1681 | 1682 | 1683 | 1684 | 1685 | 1686 | 1687 | 1688 | 1689 | 1690 | 1691 | 1692 | 1693 | 1694 | 1695 | 1696 | 1697 | 1698 | 1699 | 1700 | 1701 | 1702 | 1703 | 1704 | 1705 | 1706 | 1707 | 1708 | 1709 | 1710 | 1711 | 1712 | 1713 | 1714 | 1715 | 1716 | 1717 | 1718 | 1719 | 1720 | 1721 | 1722 | 1723 | 1724 | 1725 | 1726 | 1727 | 1728 | 1729 | 1730 | 1731 | 1732 | 1733 | 1734 | 1735 | 1736 | 1737 | 1738 | 1739 | 1740 | 1741 | 1742 | 1743 | 1744 | 1745 | 1746 | 1747 | 1748 | 1749 | 1750 | 1751 | 1752 | 1753 | 1754 | 1755 | 1756 | 1757 | 1758 |
customer_idmonthmonth_nameclosing_balance
011January312
113March-952
221January549
323March61
431January144
532February-965
633March-401
734April493
841January848
943March-193
1051January954
1153March-2877
1254April-490
1759 | 1760 | 1761 | 1762 | 1763 | ### Min, Average, Max Transaction 1764 | 1765 | 1766 | ```python 1767 | pd.read_sql(""" 1768 | SELECT 1769 | customer_id, 1770 | MIN(running_balance) AS min_transaction, 1771 | MAX(running_balance) AS max_transaction, 1772 | ROUND(AVG(running_balance),2) AS avg_transaction 1773 | FROM 1774 | ( 1775 | SELECT *, 1776 | SUM(CASE WHEN txn_type = 'deposit' THEN txn_amount ELSE -txn_amount END) OVER (PARTITION BY customer_id ORDER BY txn_date) AS running_balance 1777 | FROM data_bank.customer_transactions 1778 | ) running_balance 1779 | WHERE customer_id <= 10 1780 | GROUP BY customer_id 1781 | ORDER BY customer_id 1782 | """, conn) 1783 | ``` 1784 | 1785 | 1786 | 1787 | 1788 | 1789 | 1790 | 1791 | 1792 | 1793 | 1794 | 1795 | 1796 | 1797 | 1798 | 1799 | 1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | 1809 | 1810 | 1811 | 1812 | 1813 | 1814 | 1815 | 1816 | 1817 | 1818 | 1819 | 1820 | 1821 | 1822 | 1823 | 1824 | 1825 | 1826 | 1827 | 1828 | 1829 | 1830 | 1831 | 1832 | 1833 | 1834 | 1835 | 1836 | 1837 | 1838 | 1839 | 1840 | 1841 | 1842 | 1843 | 1844 | 1845 | 1846 | 1847 | 1848 | 1849 | 1850 | 1851 | 1852 | 1853 | 1854 | 1855 | 1856 | 1857 | 1858 | 1859 | 1860 | 1861 | 1862 | 1863 | 1864 | 1865 | 1866 | 1867 |
customer_idmin_transactionmax_transactionavg_transaction
01-640312-151.00
12549610579.50
23-1222144-732.40
34458848653.67
45-24131780-135.45
56-5522197624.00
6788735392268.69
78-10291363173.70
89-9120301021.70
910-5090556-2229.83
1868 | 1869 | 1870 | 1871 | 1872 | 1873 | ```python 1874 | conn.close() 1875 | ``` 1876 | -------------------------------------------------------------------------------- /CaseStudy#4 - Data Bank/README.md: -------------------------------------------------------------------------------- 1 | # Case Study #4: Data Bank 💱 2 | [![View Main Folder](https://img.shields.io/badge/View-Main_Folder-F5788D.svg?logo=GitHub)](https://github.com/chanronnie/8WeekSQLChallenge) 3 | [![View Case Study 4](https://img.shields.io/badge/View-Case_Study_4-brown)](https://8weeksqlchallenge.com/case-study-4/)
4 | ![4](https://github.com/chanronnie/8WeekSQLChallenge/assets/121308347/8d5eb6d0-c3d2-4e6c-998d-6c00fff3d516) 5 | 6 | 7 | The case study presented here is part of the **8 Week SQL Challenge**.\ 8 | It is kindly brought to us by [**Data With Danny**](https://8weeksqlchallenge.com). 9 | 10 | This time, I am using `PostgreSQL queries` (instead of MySQL) in `Jupyter Notebook` to quickly view results, which provides me with an opportunity 11 | - to learn PostgreSQL 12 | - to utilize handy mathematical and string functions. 13 | 14 | 15 | ## Table of Contents 16 | * [Entity Relationship Diagram](#entity-relationship-diagram) 17 | * [Datasets](#datasets) 18 | * [Case Study Questions](#case-study-questions) 19 | * [Solutions](#solutions) 20 | * [PostgreSQL Topics Covered](#postgresql-topics-covered) 21 | 22 | ## Entity Relationship Diagram 23 | ![DataBank](https://github.com/chanronnie/8WeekSQLChallenge/assets/121308347/5e232441-2712-4812-90a3-3a9ee5cd1ef0) 24 | 25 | 26 | 27 | 28 | ## Datasets 29 | The Case Study #4 contains 3 tables: 30 | - **regions**: This table maps the region_id to its respective region_name values. 31 | - **customer_nodes**: This table lists the regions and node reallocation informations of all Data Bank customers. 32 | - **customer_transactions**: This table lists all the transaction informations of all Data Bank customers. 33 | 34 | ## Case Study Questions 35 | Case Study #4 is categorized into 3 question groups\ 36 | To view the specific section, please open the link in a *`new tab`* or *`window`*.\ 37 | [A. Customer Nodes Exploration](CaseStudy4_solutions.md#A)\ 38 | [B. Customer Transactions](CaseStudy4_solutions.md#B)\ 39 | [C. Data Allocation Challenge](CaseStudy4_solutions.md#C) 40 | 41 | ## Solutions 42 | - View `data_bank` database: [**here**](CaseStudy4_schema.sql) 43 | - View Solution: 44 | - [**Markdown File**](CaseStudy4_solutions.md): offers a more fluid and responsive viewing experience 45 | - [**Jupyter Notebook**](CaseStudy4_solutions.ipynb): contains the original code 46 | 47 | ## PostgreSQL Topics Covered 48 | - Common Table Expressions (CTE) 49 | - Window Functions 50 | - Subqueries 51 | - JOIN, UNION ALL 52 | - MEDIAN, PERCENTILE 53 | -------------------------------------------------------------------------------- /CaseStudy#5 - Data Mart/CaseStudy5_solutions.md: -------------------------------------------------------------------------------- 1 | # Case Study #5: Data Mart 2 | The case study questions presented here are created by [**Data With Danny**](https://linktr.ee/datawithdanny). They are part of the [**8 Week SQL Challenge**](https://8weeksqlchallenge.com/). 3 | 4 | My SQL queries are written in the `PostgreSQL 15` dialect, integrated into `Jupyter Notebook`, which allows us to instantly view the query results and document the queries. 5 | 6 | For more details about the **Case Study #5**, click [**here**](https://8weeksqlchallenge.com/case-study-5/). 7 | 8 | ## Table of Contents 9 | 10 | ### [1. Importing Libraries](#Import) 11 | 12 | ### [2. Tables of the Database](#Tables) 13 | 14 | ### [3. Case Study Questions](#CaseStudyQuestions) 15 | 16 | - [A. Data Cleansing](#A) 17 | - [B. Data Exploration](#B) 18 | - [C. Before & After Analysis](#C) 19 | - [D. Bonus Question](#D) 20 | 21 | 22 | ## 1. Importing Required Libraries 23 | 24 | 25 | ```python 26 | import psycopg2 as pg2 27 | import pandas as pd 28 | from datetime import datetime 29 | import seaborn as sns 30 | import matplotlib.pyplot as plt 31 | import matplotlib.ticker as ticker 32 | import os 33 | import warnings 34 | 35 | warnings.filterwarnings('ignore') 36 | ``` 37 | 38 | ### Connecting PostgreSQL database from Jupyter Notebook 39 | 40 | 41 | ```python 42 | # Get PostgreSQL password 43 | mypassword = os.getenv('POSTGRESQL_PASSWORD') 44 | 45 | # Connecting to database 46 | conn = pg2.connect(user = 'postgres', password = mypassword, database = 'data_mart') 47 | cursor = conn.cursor() 48 | ``` 49 | 50 | 51 | 52 | ## 2. Tables of the Database 53 | 54 | 55 | ```python 56 | cursor.execute(""" 57 | SELECT table_schema, table_name 58 | FROM information_schema.tables 59 | WHERE table_schema = 'data_mart' 60 | """ 61 | ) 62 | 63 | print('--- Tables within "data_mart" database --- ') 64 | for table in cursor: 65 | print(table[1]) 66 | ``` 67 | 68 | --- Tables within "data_mart" database --- 69 | weekly_sales 70 | 71 | 72 | Here is the `weekly_sales` table, containing 17,117 data since the March 26, 2018 73 | 74 | 75 | ```python 76 | pd.read_sql(""" 77 | SELECT * 78 | FROM data_mart.weekly_sales; 79 | """, conn) 80 | ``` 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 |
week_dateregionplatformsegmentcustomer_typetransactionssales
031/8/20ASIARetailC3New1206313656163
131/8/20ASIARetailF1New31574996575
231/8/20USARetailnullGuest52915116509610
331/8/20EUROPERetailC1New4517141942
431/8/20AFRICARetailC2New580461758388
........................
1711226/3/18AFRICARetailC3New983423706066
1711326/3/18USAShopifyC4New162784
1711426/3/18USARetailF2New256651064172
1711526/3/18EUROPERetailC4New88333523
1711626/3/18AFRICARetailC3Existing21851612083475
208 |

17117 rows × 7 columns

209 | 210 | 211 | 212 | 213 | 214 | 215 | ## 3. Case Study Questions 216 | 217 | 218 | ## A. Data Cleansing 219 | In a single query, perform the following operations and generate a *new table* in the data_mart schema named `clean_weekly_sales`: 220 | 221 | - Convert the `week_date` to a DATE format 222 | 223 | - Add a `week_number` as the second column for each `week_date` value, for example any value from the 1st of January to 7th of January will be 1, 8th to 14th will be 2 etc 224 | 225 | - Add a `month_number` with the calendar month for each `week_date` value as the 3rd column 226 | 227 | - Add a `calendar_year` column as the 4th column containing either 2018, 2019 or 2020 values 228 | 229 | - Add a new column called `age_band` after the original `segment` column using the following mapping on the number inside the segment value 230 | 231 |
232 | 233 | segment | age_band 234 | --- | --- 235 | 1 | Young Adults 236 | 2 | Middle Aged 237 | 3 or 4 | Retirees 238 | 239 |
240 | 241 | - Add a new `demographic` column using the following mapping for the first letter in the segment values: 242 | 243 |
244 | 245 | segment | demographic 246 | --- | --- 247 | C | Couples 248 | F | Families 249 | 250 |
251 | 252 | - Ensure all null string values with an "unknown" string value in the original `segment` column as well as the new `age_band` and `demographic` columns 253 | 254 | - Generate a new `avg_transaction` column as the sales value divided by transactions rounded to 2 decimal places for each record 255 | 256 | ___ 257 | Creating an empty table is the initial step required to store the processed data from the original dataset `weekly_sales` in the desired column order. By using a single query with the `INSERT INTO` statement, it is possible to execute various data cleaning steps within the `SELECT` block. 258 | 259 | 260 | ```python 261 | # Creating table 262 | cursor.execute("DROP TABLE IF EXISTS data_mart.clean_weekly_sales;") 263 | cursor.execute(""" 264 | CREATE TABLE data_mart.clean_weekly_sales 265 | ( 266 | "week_date" DATE, 267 | "week_number" INTEGER, 268 | "month_number" INTEGER, 269 | "calendar_year" INTEGER, 270 | "region" VARCHAR(13), 271 | "platform" VARCHAR(7), 272 | "segment" VARCHAR(10), 273 | "age_band" VARCHAR(50), 274 | "demographic" VARCHAR(10), 275 | "customer_type" VARCHAR(8), 276 | "transactions" INTEGER, 277 | "sales" INTEGER, 278 | "avg_transaction" DECIMAL 279 | ); 280 | """) 281 | 282 | 283 | # Inserting required and processed data into the newly created table 284 | cursor.execute(""" 285 | INSERT INTO data_mart.clean_weekly_sales 286 | SELECT 287 | TO_DATE(week_date, 'dd/mm/yy') AS week_date, 288 | DATE_PART('week', TO_DATE(week_date, 'dd/mm/yy'))::INTEGER as week_number, 289 | DATE_PART('month', TO_DATE(week_date, 'dd/mm/yy'))::INTEGER as month_number, 290 | DATE_PART('year', TO_DATE(week_date, 'dd/mm/yy'))::INTEGER as calendar_year, 291 | region, 292 | platform, 293 | CASE WHEN segment = 'null' THEN 'unknown' ELSE segment END as segment, 294 | CASE 295 | WHEN segment LIKE '%1' THEN 'Young Adults' 296 | WHEN segment LIKE '%2' THEN 'Middle Aged' 297 | WHEN REGEXP_LIKE(segment, '3|4') THEN 'Retirees' 298 | ELSE 'unknown' 299 | END AS age_band, 300 | 301 | CASE 302 | WHEN segment LIKE 'C%' THEN 'Couples' 303 | WHEN segment LIKE 'F%' THEN 'Families' 304 | ELSE 'unknown' 305 | END AS demographic, 306 | customer_type, 307 | transactions, 308 | sales, 309 | ROUND(sales/transactions::numeric, 2) AS avg_transaction 310 | FROM data_mart.weekly_sales; 311 | """) 312 | 313 | # Saving updates 314 | conn.commit() 315 | ``` 316 | 317 | **Result** 318 | 319 | 320 | ```python 321 | pd.read_sql(""" 322 | SELECT * 323 | FROM data_mart.clean_weekly_sales; 324 | """, conn) 325 | ``` 326 | 327 | 328 | 329 | 330 | 331 | 332 | 333 | 334 | 335 | 336 | 337 | 338 | 339 | 340 | 341 | 342 | 343 | 344 | 345 | 346 | 347 | 348 | 349 | 350 | 351 | 352 | 353 | 354 | 355 | 356 | 357 | 358 | 359 | 360 | 361 | 362 | 363 | 364 | 365 | 366 | 367 | 368 | 369 | 370 | 371 | 372 | 373 | 374 | 375 | 376 | 377 | 378 | 379 | 380 | 381 | 382 | 383 | 384 | 385 | 386 | 387 | 388 | 389 | 390 | 391 | 392 | 393 | 394 | 395 | 396 | 397 | 398 | 399 | 400 | 401 | 402 | 403 | 404 | 405 | 406 | 407 | 408 | 409 | 410 | 411 | 412 | 413 | 414 | 415 | 416 | 417 | 418 | 419 | 420 | 421 | 422 | 423 | 424 | 425 | 426 | 427 | 428 | 429 | 430 | 431 | 432 | 433 | 434 | 435 | 436 | 437 | 438 | 439 | 440 | 441 | 442 | 443 | 444 | 445 | 446 | 447 | 448 | 449 | 450 | 451 | 452 | 453 | 454 | 455 | 456 | 457 | 458 | 459 | 460 | 461 | 462 | 463 | 464 | 465 | 466 | 467 | 468 | 469 | 470 | 471 | 472 | 473 | 474 | 475 | 476 | 477 | 478 | 479 | 480 | 481 | 482 | 483 | 484 | 485 | 486 | 487 | 488 | 489 | 490 | 491 | 492 | 493 | 494 | 495 | 496 | 497 | 498 | 499 | 500 | 501 | 502 | 503 | 504 | 505 | 506 | 507 | 508 | 509 | 510 | 511 | 512 | 513 | 514 | 515 | 516 | 517 | 518 | 519 | 520 | 521 | 522 | 523 | 524 |
week_dateweek_numbermonth_numbercalendar_yearregionplatformsegmentage_banddemographiccustomer_typetransactionssalesavg_transaction
02020-08-313682020ASIARetailC3RetireesCouplesNew120631365616330.31
12020-08-313682020ASIARetailF1Young AdultsFamiliesNew3157499657531.56
22020-08-313682020USARetailunknownunknownunknownGuest5291511650961031.20
32020-08-313682020EUROPERetailC1Young AdultsCouplesNew451714194231.42
42020-08-313682020AFRICARetailC2Middle AgedCouplesNew58046175838830.29
..........................................
171122018-03-261332018AFRICARetailC3RetireesCouplesNew98342370606637.69
171132018-03-261332018USAShopifyC4RetireesCouplesNew162784174.00
171142018-03-261332018USARetailF2Middle AgedFamiliesNew25665106417241.46
171152018-03-261332018EUROPERetailC4RetireesCouplesNew8833352337.96
171162018-03-261332018AFRICARetailC3RetireesCouplesExisting2185161208347555.30
525 |

17117 rows × 13 columns

526 | 527 | 528 | 529 | 530 | 531 | 532 | ## B. Data Exploration 533 | 534 | #### 1. What day of the week is used for each `week_date` value? 535 | 536 | 537 | ```python 538 | pd.read_sql(""" 539 | SELECT DISTINCT TO_CHAR(week_date, 'Day') AS day_of_week 540 | FROM data_mart.clean_weekly_sales; 541 | """, conn) 542 | ``` 543 | 544 | 545 | 546 | 547 | 548 | 549 | 550 | 551 | 552 | 553 | 554 | 555 | 556 | 557 |
day_of_week
0Monday
558 | 559 | 560 | 561 | 562 | **Result**\ 563 | Monday is the designated day of the week for each `week_date` value. 564 | 565 | ___ 566 | #### 2. What range of week numbers are missing from the dataset? 567 | 568 | Taking into consideration that there are 52 weeks in a year, we can utilize the `generate_series` function to generate a series of numbers from 1 to 52. By performing a **LEFT JOIN** between this number series and the `clean_week_sales` table, we can identify all the week numbers that are not present in the dataset. 569 | 570 | 571 | ```python 572 | pd.read_sql(""" 573 | SELECT gs.week_number AS missing_week_nb 574 | FROM generate_series(1,52,1) gs(week_number) 575 | LEFT JOIN data_mart.clean_weekly_sales d ON gs.week_number = d.week_number 576 | WHERE d.week_number IS NULL 577 | """, conn) 578 | ``` 579 | 580 | 581 | 582 | 583 | 584 | 585 | 586 | 587 | 588 | 589 | 590 | 591 | 592 | 593 | 594 | 595 | 596 | 597 | 598 | 599 | 600 | 601 | 602 | 603 | 604 | 605 | 606 | 607 | 608 | 609 | 610 | 611 | 612 | 613 | 614 | 615 | 616 | 617 | 618 | 619 | 620 | 621 | 622 | 623 | 624 | 625 | 626 | 627 | 628 | 629 | 630 | 631 | 632 | 633 | 634 | 635 | 636 | 637 | 638 | 639 | 640 | 641 | 642 | 643 | 644 | 645 | 646 | 647 | 648 | 649 | 650 | 651 | 652 | 653 | 654 | 655 | 656 | 657 | 658 | 659 | 660 | 661 | 662 | 663 | 664 | 665 | 666 | 667 | 668 | 669 | 670 | 671 | 672 | 673 | 674 | 675 | 676 | 677 | 678 | 679 | 680 | 681 | 682 | 683 | 684 | 685 | 686 | 687 | 688 | 689 | 690 | 691 | 692 | 693 | 694 | 695 | 696 | 697 | 698 | 699 | 700 | 701 |
missing_week_nb
01
12
23
34
45
56
67
78
89
910
1011
1112
1237
1338
1439
1540
1641
1742
1843
1944
2045
2146
2247
2348
2449
2550
2651
2752
702 | 703 | 704 | 705 | 706 | **Result** 707 | - The first range of missing week numbers in the dataset is from week 1 to week 12. 708 | - The second missing range of week numbers is from week 37 to week 52. 709 | 710 | ___ 711 | #### 3. How many total transactions were there for each year in the dataset? 712 | 713 | 714 | ```python 715 | pd.read_sql(""" 716 | SELECT 717 | calendar_year AS year, 718 | COUNT(transactions) AS nb_transactions 719 | FROM data_mart.clean_weekly_sales 720 | GROUP BY year; 721 | """, conn) 722 | ``` 723 | 724 | 725 | 726 | 727 | 728 | 729 | 730 | 731 | 732 | 733 | 734 | 735 | 736 | 737 | 738 | 739 | 740 | 741 | 742 | 743 | 744 | 745 | 746 | 747 | 748 | 749 |
yearnb_transactions
020185698
120205711
220195708
750 | 751 | 752 | 753 | 754 | ___ 755 | #### 4. What is the total sales for each region for each month? 756 | 757 | 758 | ```python 759 | df4 = pd.read_sql(""" 760 | SELECT 761 | region, 762 | calendar_year AS year, 763 | TO_CHAR(week_date, 'Month') AS month_name, 764 | SUM(sales) AS total_sales 765 | FROM data_mart.clean_weekly_sales 766 | GROUP BY region, year, month_number, month_name 767 | ORDER BY region, year, month_number 768 | """, conn) 769 | 770 | df4 771 | ``` 772 | 773 | 774 | 775 | 776 | 777 | 778 | 779 | 780 | 781 | 782 | 783 | 784 | 785 | 786 | 787 | 788 | 789 | 790 | 791 | 792 | 793 | 794 | 795 | 796 | 797 | 798 | 799 | 800 | 801 | 802 | 803 | 804 | 805 | 806 | 807 | 808 | 809 | 810 | 811 | 812 | 813 | 814 | 815 | 816 | 817 | 818 | 819 | 820 | 821 | 822 | 823 | 824 | 825 | 826 | 827 | 828 | 829 | 830 | 831 | 832 | 833 | 834 | 835 | 836 | 837 | 838 | 839 | 840 | 841 | 842 | 843 | 844 | 845 | 846 | 847 | 848 | 849 | 850 | 851 | 852 | 853 | 854 | 855 | 856 | 857 | 858 | 859 | 860 | 861 | 862 | 863 |
regionyearmonth_nametotal_sales
0AFRICA2018March130542213
1AFRICA2018April650194751
2AFRICA2018May522814997
3AFRICA2018June519127094
4AFRICA2018July674135866
...............
135USA2020April221952003
136USA2020May225545881
137USA2020June277763625
138USA2020July223735311
139USA2020August277361606
864 |

140 rows × 4 columns

865 | 866 | 867 | 868 | 869 | **Result**\ 870 | The output table above, containing 140 rows, makes it challenging to gain insights into the sales performance at Data Mart. Therefore, let's examine a time-series visualization that represents the total sales made by Data Mart by region, using the `hue` parameter. 871 | 872 | 873 | ```python 874 | # Insert a new column for datetime using calendar_year and month_name values 875 | df4['date'] = df4.apply(lambda row: datetime(row['year'], datetime.strptime(row['month_name'].strip(), '%B').month, 1), axis=1) 876 | 877 | 878 | # plot the total_sales by region 879 | sns.set_style("darkgrid") 880 | sns.lineplot(df4['date'], df4['total_sales'], hue = df4['region']) 881 | plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left') 882 | 883 | # customize the format, labels and title 884 | plt.xticks(rotation=45, ha='right') 885 | plt.ylabel("total sales") 886 | formatter = ticker.StrMethodFormatter('${x:,.0f}') 887 | plt.gca().yaxis.set_major_formatter(formatter) 888 | plt.title('Total Sales at Data Mart between 2018 and 2020'); 889 | ``` 890 | 891 | 892 | 893 | ![png](image/plot.png) 894 | 895 | 896 | 897 | **Insights** 898 | - Data Mart has achieved the highest sales in Oceania since 2018. 899 | - Following Oceania, Africa and Asia are the second and third most lucrative regions for Data Mart sales. 900 | - South America and Europe recorded the lowest sales figures from 2018 to 2020. 901 | 902 | ___ 903 | #### 5. What is the total count of transactions for each platform 904 | 905 | 906 | ```python 907 | pd.read_sql(""" 908 | SELECT 909 | platform, 910 | COUNT(transactions) AS nb_transactions 911 | FROM data_mart.clean_weekly_sales 912 | GROUP BY platform; 913 | """, conn) 914 | ``` 915 | 916 | 917 | 918 | 919 | 920 | 921 | 922 | 923 | 924 | 925 | 926 | 927 | 928 | 929 | 930 | 931 | 932 | 933 | 934 | 935 | 936 |
platformnb_transactions
0Shopify8549
1Retail8568
937 | 938 | 939 | 940 | 941 | ___ 942 | #### 6. What is the percentage of sales for Retail vs Shopify for each month? 943 | 944 | 945 | ```python 946 | pd.read_sql(""" 947 | WITH platform_sales_cte AS 948 | ( 949 | SELECT 950 | calendar_year, 951 | month_number, 952 | TO_CHAR(week_date, 'Month') AS month, 953 | SUM(sales) AS monthly_sales, 954 | SUM(CASE WHEN platform = 'Retail' THEN sales ELSE 0 END) AS retail_sales, 955 | SUM(CASE WHEN platform = 'Shopify' THEN sales ELSE 0 END) AS shopify_sales 956 | FROM data_mart.clean_weekly_sales 957 | GROUP BY calendar_year, month_number, month 958 | ORDER BY calendar_year, month 959 | ) 960 | SELECT 961 | calendar_year AS year, 962 | month, 963 | ROUND(retail_sales/monthly_sales::numeric*100,1) AS retail_percent_sales, 964 | ROUND(shopify_sales/monthly_sales::numeric*100,1) AS shopify_percent_sales 965 | FROM platform_sales_cte; 966 | """, conn) 967 | ``` 968 | 969 | 970 | 971 | 972 | 973 | 974 | 975 | 976 | 977 | 978 | 979 | 980 | 981 | 982 | 983 | 984 | 985 | 986 | 987 | 988 | 989 | 990 | 991 | 992 | 993 | 994 | 995 | 996 | 997 | 998 | 999 | 1000 | 1001 | 1002 | 1003 | 1004 | 1005 | 1006 | 1007 | 1008 | 1009 | 1010 | 1011 | 1012 | 1013 | 1014 | 1015 | 1016 | 1017 | 1018 | 1019 | 1020 | 1021 | 1022 | 1023 | 1024 | 1025 | 1026 | 1027 | 1028 | 1029 | 1030 | 1031 | 1032 | 1033 | 1034 | 1035 | 1036 | 1037 | 1038 | 1039 | 1040 | 1041 | 1042 | 1043 | 1044 | 1045 | 1046 | 1047 | 1048 | 1049 | 1050 | 1051 | 1052 | 1053 | 1054 | 1055 | 1056 | 1057 | 1058 | 1059 | 1060 | 1061 | 1062 | 1063 | 1064 | 1065 | 1066 | 1067 | 1068 | 1069 | 1070 | 1071 | 1072 | 1073 | 1074 | 1075 | 1076 | 1077 | 1078 | 1079 | 1080 | 1081 | 1082 | 1083 | 1084 | 1085 | 1086 | 1087 | 1088 | 1089 | 1090 | 1091 | 1092 | 1093 | 1094 | 1095 | 1096 | 1097 | 1098 | 1099 | 1100 | 1101 | 1102 | 1103 | 1104 | 1105 | 1106 | 1107 | 1108 | 1109 | 1110 | 1111 | 1112 | 1113 | 1114 | 1115 | 1116 | 1117 | 1118 | 1119 | 1120 | 1121 |
yearmonthretail_percent_salesshopify_percent_sales
02018April97.92.1
12018August97.72.3
22018July97.82.2
32018June97.82.2
42018March97.92.1
52018May97.72.3
62018September97.72.3
72019April97.82.2
82019August97.22.8
92019July97.42.6
102019June97.42.6
112019March97.72.3
122019May97.52.5
132019September97.12.9
142020April97.03.0
152020August96.53.5
162020July96.73.3
172020June96.83.2
182020March97.32.7
192020May96.73.3
1122 | 1123 | 1124 | 1125 | 1126 | **Result**\ 1127 | Retail has achieved significantly higher sales, accounting for approximately 97% of the total sales, while Shopify only accounted for 2.5% of the sales. 1128 | 1129 | ___ 1130 | #### 7. What is the percentage of sales by demographic for each year in the dataset? 1131 | 1132 | 1133 | ```python 1134 | pd.read_sql(""" 1135 | WITH demographic_sales_cte AS 1136 | ( 1137 | SELECT 1138 | calendar_year AS year, 1139 | SUM(sales) AS total_sales, 1140 | SUM(CASE WHEN demographic = 'Families' THEN sales END) AS families_sales, 1141 | SUM(CASE WHEN demographic = 'Couples' THEN sales END) AS couples_sales, 1142 | SUM(CASE WHEN demographic = 'unknown' THEN sales END) AS unknown_sales 1143 | FROM data_mart.clean_weekly_sales 1144 | GROUP BY year 1145 | ) 1146 | SELECT 1147 | year, 1148 | ROUND(families_sales/total_sales::numeric*100,1) AS families_sales_percent, 1149 | ROUND(couples_sales/total_sales::numeric*100,1) AS couples_sales_percent, 1150 | ROUND(unknown_sales/total_sales::numeric*100,1) AS unknown_sales_percent 1151 | FROM demographic_sales_cte; 1152 | """, conn) 1153 | ``` 1154 | 1155 | 1156 | 1157 | 1158 | 1159 | 1160 | 1161 | 1162 | 1163 | 1164 | 1165 | 1166 | 1167 | 1168 | 1169 | 1170 | 1171 | 1172 | 1173 | 1174 | 1175 | 1176 | 1177 | 1178 | 1179 | 1180 | 1181 | 1182 | 1183 | 1184 | 1185 | 1186 | 1187 | 1188 |
yearfamilies_sales_percentcouples_sales_percentunknown_sales_percent
0201832.026.441.6
1202032.728.738.6
2201932.527.340.3
1189 | 1190 | 1191 | 1192 | 1193 | **Result**\ 1194 | Families contributed slightly more to the sales of Data Mart, accounting for approximately 5% more sales compared to Couples. 1195 | 1196 | ___ 1197 | #### 8. Which `age_band` and `demographic` values contribute the most to Retail sales? 1198 | 1199 | 1200 | ```python 1201 | query8 = """ 1202 | SELECT 1203 | age_band, 1204 | demographic, 1205 | ROUND(SUM(sales)/(SELECT SUM(sales) FROM data_mart.clean_weekly_sales WHERE platform = 'Retail')::NUMERIC * 100,1) AS contribution_percent 1206 | FROM data_mart.clean_weekly_sales 1207 | WHERE platform = 'Retail' 1208 | GROUP BY age_band, demographic 1209 | ORDER BY contribution_percent DESC 1210 | """ 1211 | 1212 | pd.read_sql(query8, conn) 1213 | ``` 1214 | 1215 | 1216 | 1217 | 1218 | 1219 | 1220 | 1221 | 1222 | 1223 | 1224 | 1225 | 1226 | 1227 | 1228 | 1229 | 1230 | 1231 | 1232 | 1233 | 1234 | 1235 | 1236 | 1237 | 1238 | 1239 | 1240 | 1241 | 1242 | 1243 | 1244 | 1245 | 1246 | 1247 | 1248 | 1249 | 1250 | 1251 | 1252 | 1253 | 1254 | 1255 | 1256 | 1257 | 1258 | 1259 | 1260 | 1261 | 1262 | 1263 | 1264 | 1265 | 1266 | 1267 | 1268 |
age_banddemographiccontribution_percent
0unknownunknown40.5
1RetireesFamilies16.7
2RetireesCouples16.1
3Middle AgedFamilies11.0
4Young AdultsCouples6.6
5Middle AgedCouples4.7
6Young AdultsFamilies4.5
1269 | 1270 | 1271 | 1272 | 1273 | Alternatively, let's extract the values with the highest contribution percent. 1274 | 1275 | 1276 | ```python 1277 | pd.read_sql(query8 + "LIMIT 2", conn) 1278 | ``` 1279 | 1280 | 1281 | 1282 | 1283 | 1284 | 1285 | 1286 | 1287 | 1288 | 1289 | 1290 | 1291 | 1292 | 1293 | 1294 | 1295 | 1296 | 1297 | 1298 | 1299 | 1300 | 1301 | 1302 | 1303 |
age_banddemographiccontribution_percent
0unknownunknown40.5
1RetireesFamilies16.7
1304 | 1305 | 1306 | 1307 | 1308 | **Result**\ 1309 | Retired Families are the age_band and demographic values that contributed the most to the Retail sales. 1310 | 1311 | ___ 1312 | #### 9. Can we use the `avg_transaction` column to find the average transaction size for each year for Retail vs Shopify? If not - how would you calculate it instead? 1313 | 1314 | **Method 1:** When using `GROUP BY` to calculate the average transaction size for each year, the obtained results differ slightly from using the `avg_transaction` column directly. 1315 | 1316 | 1317 | ```python 1318 | pd.read_sql(""" 1319 | SELECT 1320 | calendar_year AS year, 1321 | platform, 1322 | ROUND(AVG(avg_transaction),2) AS avg_transaction1, 1323 | ROUND(SUM(sales)::NUMERIC/SUM(transactions)::NUMERIC,2) AS avg_transaction2 1324 | FROM data_mart.clean_weekly_sales 1325 | GROUP BY year, platform 1326 | ORDER BY platform, year 1327 | """, conn) 1328 | ``` 1329 | 1330 | 1331 | 1332 | 1333 | 1334 | 1335 | 1336 | 1337 | 1338 | 1339 | 1340 | 1341 | 1342 | 1343 | 1344 | 1345 | 1346 | 1347 | 1348 | 1349 | 1350 | 1351 | 1352 | 1353 | 1354 | 1355 | 1356 | 1357 | 1358 | 1359 | 1360 | 1361 | 1362 | 1363 | 1364 | 1365 | 1366 | 1367 | 1368 | 1369 | 1370 | 1371 | 1372 | 1373 | 1374 | 1375 | 1376 | 1377 | 1378 | 1379 | 1380 | 1381 | 1382 | 1383 | 1384 |
yearplatformavg_transaction1avg_transaction2
02018Retail42.9136.56
12019Retail41.9736.83
22020Retail40.6436.56
32018Shopify188.28192.48
42019Shopify177.56183.36
52020Shopify174.87179.03
1385 | 1386 | 1387 | 1388 | 1389 | **Method #2:** With this following method, we obtain the same results as using the `avg_transactions` values. 1390 | 1391 | 1392 | ```python 1393 | pd.read_sql(""" 1394 | SELECT 1395 | calendar_year AS year, 1396 | ROUND(AVG(CASE WHEN platform = 'Retail' THEN avg_transaction END),2) AS retail_avg_txn1, 1397 | ROUND(AVG(CASE WHEN platform = 'Retail' THEN sales::numeric/transactions::numeric END),2) AS retail_avg_txn2, 1398 | ROUND(AVG(CASE WHEN platform = 'Shopify' THEN avg_transaction END),2) AS shopify_avg_txn1, 1399 | ROUND(AVG(CASE WHEN platform = 'Shopify' THEN sales::numeric/transactions::numeric END),2) AS shopify_avg_txn2 1400 | FROM data_mart.clean_weekly_sales 1401 | GROUP BY year 1402 | """, conn) 1403 | ``` 1404 | 1405 | 1406 | 1407 | 1408 | 1409 | 1410 | 1411 | 1412 | 1413 | 1414 | 1415 | 1416 | 1417 | 1418 | 1419 | 1420 | 1421 | 1422 | 1423 | 1424 | 1425 | 1426 | 1427 | 1428 | 1429 | 1430 | 1431 | 1432 | 1433 | 1434 | 1435 | 1436 | 1437 | 1438 | 1439 | 1440 | 1441 | 1442 |
yearretail_avg_txn1retail_avg_txn2shopify_avg_txn1shopify_avg_txn2
0201842.9142.91188.28188.28
1202040.6440.64174.87174.87
2201941.9741.97177.56177.56
1443 | 1444 | 1445 | 1446 | 1447 | 1448 | 1449 | ## C. Before & After Analysis 1450 | This technique is usually used when we inspect an important event and want to inspect the impact before and after a certain point in time. 1451 | 1452 | Taking the `week_date` value of **2020-06-15** as the baseline week where the Data Mart sustainable packaging changes came into effect. 1453 | 1454 | We would include all `week_date` values for **2020-06-15** as the start of the period after the change and the previous `week_date` values would be before 1455 | 1456 | Using this analysis approach - answer the following questions: 1457 | 1458 | > 1. What is the total sales for the 4 weeks before and after 2020-06-15? What is the growth or reduction rate in actual values and percentage of sales? 1459 | > 2. What about the entire 12 weeks before and after? 1460 | > 3. How do the sale metrics for these 2 periods before and after compare with the previous years in 2018 and 2019? 1461 | 1462 | #### 1. What is the total sales for the 4 weeks before and after 2020-06-15? What is the growth or reduction rate in actual values and percentage of sales? 1463 | 1464 | The week number of the date **2020-06-15** is 25. Therefore, 1465 | - the four weeks before this date are `week 21, 22, 23, and 24`, and 1466 | - the four weeks after are `week 25, 26, 27, and 28`. 1467 | 1468 | 1469 | ```python 1470 | pd.read_sql(""" 1471 | SELECT DISTINCT week_number 1472 | FROM data_mart.clean_weekly_sales 1473 | WHERE week_date = '2020-06-15' 1474 | """, conn) 1475 | ``` 1476 | 1477 | 1478 | 1479 | 1480 | 1481 | 1482 | 1483 | 1484 | 1485 | 1486 | 1487 | 1488 | 1489 | 1490 |
week_number
025
1491 | 1492 | 1493 | 1494 | 1495 | **Result**\ 1496 | The implementation of sustainable packaging changes appears to have a **negative impact** on sales at Data Mart, resulting in a reduction rate of 1.15% in sales. 1497 | 1498 | 1499 | ```python 1500 | pd.read_sql(""" 1501 | SELECT 1502 | TO_CHAR(total_sales_before, 'FM$ 999,999,999,999') AS total_sales_before, 1503 | TO_CHAR(total_sales_after, 'FM$ 999,999,999,999') AS total_sales_after, 1504 | TO_CHAR(total_sales_after - total_sales_before, 'FM$ 999,999,999,999') AS difference, 1505 | ROUND((total_sales_after - total_sales_before)/total_sales_before::NUMERIC * 100,2) AS sales_change_percent 1506 | FROM 1507 | ( 1508 | SELECT 1509 | SUM(CASE WHEN week_number BETWEEN 21 AND 24 THEN sales ELSE 0 END) AS total_sales_before, 1510 | SUM(CASE WHEN week_number BETWEEN 25 AND 28 THEN sales ELSE 0 END) AS total_sales_after 1511 | FROM data_mart.clean_weekly_sales 1512 | WHERE calendar_year = 2020 1513 | ) sales; 1514 | """, conn) 1515 | ``` 1516 | 1517 | 1518 | 1519 | 1520 | 1521 | 1522 | 1523 | 1524 | 1525 | 1526 | 1527 | 1528 | 1529 | 1530 | 1531 | 1532 | 1533 | 1534 | 1535 | 1536 |
total_sales_beforetotal_sales_afterdifferencesales_change_percent
0$ 2,345,878,357$ 2,318,994,169$ -26,884,188-1.15
1537 | 1538 | 1539 | 1540 | 1541 | ___ 1542 | #### 2. What about the entire 12 weeks before and after? 1543 | 1544 | 1545 | ```python 1546 | pd.read_sql(""" 1547 | SELECT 1548 | TO_CHAR(total_sales_before, 'FM$ 999,999,999,999') AS total_sales_before, 1549 | TO_CHAR(total_sales_after, 'FM$ 999,999,999,999') AS total_sales_after, 1550 | TO_CHAR(total_sales_after - total_sales_before, 'FM$ 999,999,999,999') AS difference, 1551 | ROUND((total_sales_after - total_sales_before)/total_sales_before::NUMERIC * 100, 2) AS sales_change_percent 1552 | FROM 1553 | ( 1554 | SELECT 1555 | SUM(CASE WHEN week_number BETWEEN 13 AND 24 THEN sales ELSE 0 END) AS total_sales_before, 1556 | SUM(CASE WHEN week_number BETWEEN 25 AND 37 THEN sales ELSE 0 END) AS total_sales_after 1557 | FROM data_mart.clean_weekly_sales 1558 | WHERE calendar_year = 2020 1559 | ) sales 1560 | """, conn) 1561 | ``` 1562 | 1563 | 1564 | 1565 | 1566 | 1567 | 1568 | 1569 | 1570 | 1571 | 1572 | 1573 | 1574 | 1575 | 1576 | 1577 | 1578 | 1579 | 1580 | 1581 | 1582 |
total_sales_beforetotal_sales_afterdifferencesales_change_percent
0$ 7,126,273,147$ 6,973,947,753$ -152,325,394-2.14
1583 | 1584 | 1585 | 1586 | 1587 | **Result**\ 1588 | The implementation of sustainable packaging changes appears to have a **negative impact** on sales at Data Mart, resulting in a reduction rate of 2.14% in sales. 1589 | 1590 | ___ 1591 | #### 3. How do the sale metrics for these 2 periods before and after compare with the previous years in 2018 and 2019? 1592 | 1593 | 1594 | ```python 1595 | pd.read_sql(""" 1596 | SELECT 1597 | year, 1598 | TO_CHAR(total_sales_before, 'FM$ 999,999,999,999') AS total_sales_before, 1599 | TO_CHAR(total_sales_after, 'FM$ 999,999,999,999') AS total_sales_after, 1600 | TO_CHAR(total_sales_after - total_sales_before, 'FM$ 999,999,999,999') AS difference, 1601 | ROUND((total_sales_after - total_sales_before)/total_sales_before::NUMERIC * 100, 2) AS sales_change_percent 1602 | FROM 1603 | ( 1604 | SELECT 1605 | calendar_year AS year, 1606 | SUM(CASE WHEN week_number BETWEEN 13 AND 24 THEN sales ELSE 0 END) AS total_sales_before, 1607 | SUM(CASE WHEN week_number BETWEEN 25 AND 37 THEN sales ELSE 0 END) AS total_sales_after 1608 | FROM data_mart.clean_weekly_sales 1609 | GROUP BY year 1610 | ) sales 1611 | ORDER BY year 1612 | """, conn) 1613 | ``` 1614 | 1615 | 1616 | 1617 | 1618 | 1619 | 1620 | 1621 | 1622 | 1623 | 1624 | 1625 | 1626 | 1627 | 1628 | 1629 | 1630 | 1631 | 1632 | 1633 | 1634 | 1635 | 1636 | 1637 | 1638 | 1639 | 1640 | 1641 | 1642 | 1643 | 1644 | 1645 | 1646 | 1647 | 1648 | 1649 | 1650 | 1651 | 1652 |
yeartotal_sales_beforetotal_sales_afterdifferencesales_change_percent
02018$ 6,396,562,317$ 6,500,818,510$ 104,256,1931.63
12019$ 6,883,386,397$ 6,862,646,103$ -20,740,294-0.30
22020$ 7,126,273,147$ 6,973,947,753$ -152,325,394-2.14
1653 | 1654 | 1655 | 1656 | 1657 | 1658 | 1659 | ## D. Bonus Question 1660 | Which areas of the business have the highest negative impact in sales metrics performance in 2020 for the 12 week before and after period? 1661 | 1662 | - region 1663 | - platform 1664 | - age_band 1665 | - demographic 1666 | - customer_type 1667 | 1668 | Do you have any further recommendations for Danny’s team at Data Mart or any interesting insights based off this analysis? 1669 | 1670 | ### Overview 1671 | The following query output displays the various areas of the business that experienced the highest negative impact on sales metrics performance in 2020. By setting `WHERE sales_before > sales_after`, only the negative percent changes record will be filtered and displayed (the top 10 records). 1672 | 1673 | 1674 | ```python 1675 | pd.read_sql(""" 1676 | WITH overview_sales_cte AS 1677 | ( 1678 | SELECT 1679 | region, 1680 | platform, 1681 | age_band, 1682 | demographic, 1683 | customer_type, 1684 | SUM(CASE WHEN week_number BETWEEN 13 AND 24 THEN sales ELSE 0 END) AS sales_before, 1685 | SUM(CASE WHEN week_number BETWEEN 25 AND 37 THEN sales ELSE 0 END) AS sales_after 1686 | FROM data_mart.clean_weekly_sales 1687 | WHERE calendar_year = 2020 1688 | GROUP BY region, platform, age_band, demographic, customer_type 1689 | ) 1690 | SELECT 1691 | region, 1692 | platform, 1693 | age_band, 1694 | demographic, 1695 | customer_type, 1696 | ROUND((sales_after - sales_before)::NUMERIC/sales_before::NUMERIC * 100, 2) AS sales_change_percent 1697 | FROM overview_sales_cte 1698 | WHERE sales_before > sales_after 1699 | ORDER BY sales_change_percent ASC 1700 | LIMIT 10; 1701 | """, conn) 1702 | ``` 1703 | 1704 | 1705 | 1706 | 1707 | 1708 | 1709 | 1710 | 1711 | 1712 | 1713 | 1714 | 1715 | 1716 | 1717 | 1718 | 1719 | 1720 | 1721 | 1722 | 1723 | 1724 | 1725 | 1726 | 1727 | 1728 | 1729 | 1730 | 1731 | 1732 | 1733 | 1734 | 1735 | 1736 | 1737 | 1738 | 1739 | 1740 | 1741 | 1742 | 1743 | 1744 | 1745 | 1746 | 1747 | 1748 | 1749 | 1750 | 1751 | 1752 | 1753 | 1754 | 1755 | 1756 | 1757 | 1758 | 1759 | 1760 | 1761 | 1762 | 1763 | 1764 | 1765 | 1766 | 1767 | 1768 | 1769 | 1770 | 1771 | 1772 | 1773 | 1774 | 1775 | 1776 | 1777 | 1778 | 1779 | 1780 | 1781 | 1782 | 1783 | 1784 | 1785 | 1786 | 1787 | 1788 | 1789 | 1790 | 1791 | 1792 | 1793 | 1794 | 1795 | 1796 | 1797 | 1798 | 1799 | 1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 |
regionplatformage_banddemographiccustomer_typesales_change_percent
0SOUTH AMERICAShopifyunknownunknownExisting-42.23
1EUROPEShopifyRetireesFamiliesNew-33.71
2EUROPEShopifyYoung AdultsFamiliesNew-27.97
3SOUTH AMERICARetailunknownunknownExisting-23.20
4SOUTH AMERICARetailRetireesFamiliesNew-21.28
5SOUTH AMERICAShopifyMiddle AgedFamiliesNew-19.73
6SOUTH AMERICAShopifyRetireesFamiliesNew-19.04
7SOUTH AMERICAShopifyRetireesCouplesNew-18.39
8SOUTH AMERICARetailRetireesCouplesExisting-16.80
9SOUTH AMERICARetailRetireesCouplesNew-15.86
1809 | 1810 | 1811 | 1812 | 1813 | ___ 1814 | In the followings, I will analyze each area individually to determine which category has been the most impacted. 1815 | 1816 | ### Sales By Region 1817 | The **Asia** region has been the most negatively impacted by the implementation of the new system at Data Mart. 1818 | 1819 | 1820 | ```python 1821 | pd.read_sql(""" 1822 | WITH region_sales_cte AS 1823 | ( 1824 | SELECT 1825 | region, 1826 | SUM(CASE WHEN week_number BETWEEN 13 AND 24 THEN sales END) AS sales_before, 1827 | SUM(CASE WHEN week_number BETWEEN 25 AND 37 THEN sales END) AS sales_after 1828 | FROM data_mart.clean_weekly_sales 1829 | WHERE calendar_year = 2020 1830 | GROUP BY region 1831 | ) 1832 | SELECT 1833 | region, 1834 | ROUND((sales_after - sales_before)/sales_before::NUMERIC * 100,2) AS sales_change_percent 1835 | FROM region_sales_cte 1836 | ORDER BY sales_change_percent ASC; 1837 | """, conn) 1838 | ``` 1839 | 1840 | 1841 | 1842 | 1843 | 1844 | 1845 | 1846 | 1847 | 1848 | 1849 | 1850 | 1851 | 1852 | 1853 | 1854 | 1855 | 1856 | 1857 | 1858 | 1859 | 1860 | 1861 | 1862 | 1863 | 1864 | 1865 | 1866 | 1867 | 1868 | 1869 | 1870 | 1871 | 1872 | 1873 | 1874 | 1875 | 1876 | 1877 | 1878 | 1879 | 1880 | 1881 | 1882 | 1883 | 1884 | 1885 |
regionsales_change_percent
0ASIA-3.26
1OCEANIA-3.03
2SOUTH AMERICA-2.15
3CANADA-1.92
4USA-1.60
5AFRICA-0.54
6EUROPE4.73
1886 | 1887 | 1888 | 1889 | 1890 | ### Sales By Platform 1891 | 1892 | The **Retail** platform has been the most negatively impacted by the implementation of the new system at Data Mart. 1893 | 1894 | 1895 | ```python 1896 | pd.read_sql(""" 1897 | WITH platform_sales_cte AS 1898 | ( 1899 | SELECT 1900 | platform, 1901 | SUM(CASE WHEN week_number BETWEEN 13 AND 24 THEN sales END) AS sales_before, 1902 | SUM(CASE WHEN week_number BETWEEN 25 AND 37 THEN sales END) AS sales_after 1903 | FROM data_mart.clean_weekly_sales 1904 | WHERE calendar_year = 2020 1905 | GROUP BY platform 1906 | ) 1907 | SELECT 1908 | platform, 1909 | ROUND((sales_after - sales_before)/sales_before::NUMERIC * 100,2) AS sales_change_percent 1910 | FROM platform_sales_cte 1911 | ORDER BY sales_change_percent ASC; 1912 | """, conn) 1913 | ``` 1914 | 1915 | 1916 | 1917 | 1918 | 1919 | 1920 | 1921 | 1922 | 1923 | 1924 | 1925 | 1926 | 1927 | 1928 | 1929 | 1930 | 1931 | 1932 | 1933 | 1934 | 1935 |
platformsales_change_percent
0Retail-2.43
1Shopify7.18
1936 | 1937 | 1938 | 1939 | 1940 | ### Sales By Age_Band 1941 | The **Middle Aged** customers have been the most negatively impacted by the implementation of the new system at Data Mart. 1942 | 1943 | 1944 | ```python 1945 | pd.read_sql(""" 1946 | WITH age_band_sales_cte AS 1947 | ( 1948 | SELECT 1949 | age_band, 1950 | SUM(CASE WHEN week_number BETWEEN 13 AND 24 THEN sales END) AS sales_before, 1951 | SUM(CASE WHEN week_number BETWEEN 25 AND 37 THEN sales END) AS sales_after 1952 | FROM data_mart.clean_weekly_sales 1953 | WHERE calendar_year = 2020 1954 | GROUP BY age_band 1955 | ) 1956 | SELECT 1957 | age_band, 1958 | ROUND((sales_after - sales_before)/sales_before::NUMERIC * 100,2) AS sales_change_percent 1959 | FROM age_band_sales_cte 1960 | ORDER BY sales_change_percent ASC; 1961 | """, conn) 1962 | ``` 1963 | 1964 | 1965 | 1966 | 1967 | 1968 | 1969 | 1970 | 1971 | 1972 | 1973 | 1974 | 1975 | 1976 | 1977 | 1978 | 1979 | 1980 | 1981 | 1982 | 1983 | 1984 | 1985 | 1986 | 1987 | 1988 | 1989 | 1990 | 1991 | 1992 | 1993 | 1994 |
age_bandsales_change_percent
0unknown-3.34
1Middle Aged-1.97
2Retirees-1.23
3Young Adults-0.92
1995 | 1996 | 1997 | 1998 | 1999 | ### Sales By Demographic 2000 | The **Families** demographic category has been the most negatively impacted by the implementation of the new system at Data Mart. 2001 | 2002 | 2003 | ```python 2004 | pd.read_sql(""" 2005 | WITH demographic_sales_cte AS 2006 | ( 2007 | SELECT 2008 | demographic, 2009 | SUM(CASE WHEN week_number BETWEEN 13 AND 24 THEN sales END) AS sales_before, 2010 | SUM(CASE WHEN week_number BETWEEN 25 AND 37 THEN sales END) AS sales_after 2011 | FROM data_mart.clean_weekly_sales 2012 | WHERE calendar_year = 2020 2013 | GROUP BY demographic 2014 | ) 2015 | SELECT 2016 | demographic, 2017 | ROUND((sales_after - sales_before)/sales_before::NUMERIC * 100,2) AS sales_change_percent 2018 | FROM demographic_sales_cte 2019 | ORDER BY sales_change_percent ASC; 2020 | """, conn) 2021 | ``` 2022 | 2023 | 2024 | 2025 | 2026 | 2027 | 2028 | 2029 | 2030 | 2031 | 2032 | 2033 | 2034 | 2035 | 2036 | 2037 | 2038 | 2039 | 2040 | 2041 | 2042 | 2043 | 2044 | 2045 | 2046 | 2047 | 2048 |
demographicsales_change_percent
0unknown-3.34
1Families-1.82
2Couples-0.87
2049 | 2050 | 2051 | 2052 | 2053 | ### Sales By Customer_Type 2054 | 2055 | The **Guest** customer type has been the most negatively impacted by the implementation of the new system at Data Mart. 2056 | 2057 | 2058 | ```python 2059 | pd.read_sql(""" 2060 | WITH customer_type_sales_cte AS 2061 | ( 2062 | SELECT 2063 | customer_type, 2064 | SUM(CASE WHEN week_number BETWEEN 13 AND 24 THEN sales END) AS sales_before, 2065 | SUM(CASE WHEN week_number BETWEEN 25 AND 37 THEN sales END) AS sales_after 2066 | FROM data_mart.clean_weekly_sales 2067 | WHERE calendar_year = 2020 2068 | GROUP BY customer_type 2069 | ) 2070 | SELECT 2071 | customer_type, 2072 | ROUND((sales_after - sales_before)/sales_before::NUMERIC * 100,2) AS sales_change_percent 2073 | FROM customer_type_sales_cte 2074 | ORDER BY sales_change_percent ASC; 2075 | """, conn) 2076 | ``` 2077 | 2078 | 2079 | 2080 | 2081 | 2082 | 2083 | 2084 | 2085 | 2086 | 2087 | 2088 | 2089 | 2090 | 2091 | 2092 | 2093 | 2094 | 2095 | 2096 | 2097 | 2098 | 2099 | 2100 | 2101 | 2102 | 2103 |
customer_typesales_change_percent
0Guest-3.00
1Existing-2.27
2New1.01
2104 | 2105 | 2106 | 2107 | 2108 | 2109 | ```python 2110 | conn.close() 2111 | ``` 2112 | -------------------------------------------------------------------------------- /CaseStudy#5 - Data Mart/README.md: -------------------------------------------------------------------------------- 1 | # Case Study #5: Data Mart 🧺 2 | [![View Main Folder](https://img.shields.io/badge/View-Main_Folder-F5788D.svg?logo=GitHub)](https://github.com/chanronnie/8WeekSQLChallenge) 3 | [![View Case Study 5](https://img.shields.io/badge/View-Case_Study_5-red)](https://8weeksqlchallenge.com/case-study-5/)
4 | ![5](https://github.com/chanronnie/8WeekSQLChallenge/assets/121308347/a07627c4-27c5-47f5-a6f8-dc3379831287) 5 | 6 | 7 | The case study presented here is part of the **8 Week SQL Challenge**.\ 8 | It is kindly brought to us by [**Data With Danny**](https://8weeksqlchallenge.com). 9 | 10 | This time, I am using `PostgreSQL queries` (instead of MySQL) in `Jupyter Notebook` to quickly view results, which provides me with an opportunity 11 | - to learn PostgreSQL 12 | - to utilize handy mathematical and string functions. 13 | 14 | 15 | 16 | ## Table of Contents 17 | * [Entity Relationship Diagram](#entity-relationship-diagram) 18 | * [Datasets](#datasets) 19 | * [Case Study Questions](#case-study-questions) 20 | * [Solutions](#solutions) 21 | * [PostgreSQL Topics Covered](#postgresql-topics-covered) 22 | 23 | 24 | ## Entity Relationship Diagram 25 | ![DataMart](https://github.com/chanronnie/8WeekSQLChallenge/assets/121308347/b0bb72b8-e579-41c6-99f0-ce73718f0d73) 26 | 27 | 28 | ## Datasets 29 | The Case Study #5 contains 1 dataset:
30 | **`weekly_sales`**: This table shows all the sales records made at Data Mart. 31 | 32 |
33 | View dataset 34 | 35 | Here are the 10 random rows of the `weekly_sales` dataset 36 | 37 | week_date | region | platform | segment | customer_type | transactions | sales 38 | --- | --- | --- | --- | --- | --- | --- 39 | 9/9/20 | OCEANIA | Shopify | C3 | New | 610 | 110033.89 40 | 29/7/20 | AFRICA | Retail | C1 | New | 110692 | 3053771.19 41 | 22/7/20 | EUROPE | Shopify | C4 | Existing | 24 | 8101.54 42 | 13/5/20 | AFRICA | Shopify | null | Guest | 5287 | 1003301.37 43 | 24/7/19 | ASIA | Retail | C1 | New | 127342 | 3151780.41 44 | 10/7/19 | CANADA | Shopify | F3 | New | 51 | 8844.93 45 | 26/6/19 | OCEANIA | Retail | C3 | New | 152921 | 5551385.36 46 | 29/5/19 | SOUTH AMERICA | Shopify | null | New | 53 | 10056.2 47 | 22/8/18 | AFRICA | Retail | null | Existing | 31721 | 1718863.58 48 | 25/7/18 | SOUTH AMERICA | Retail | null | New | 2136 | 81757.91 49 | 50 |
51 | 52 | ## Case Study Questions 53 | Case Study #5 is categorized into 4 question groups\ 54 | To view the specific section, please open the link in a *`new tab`* or *`window`*.\ 55 | [A. Data Cleansing](CaseStudy5_solutions.md#A)\ 56 | [B. Data Exploration](CaseStudy5_solutions.md#B)\ 57 | [C. Before & After Analysis](CaseStudy5_solutions.md#C)\ 58 | [D. Bonus Question](CaseStudy5_solutions.md#D) 59 | 60 | ## Solutions 61 | - View `data_mart` database: [**here**](CaseStudy5_schema.sql) 62 | - View Solution: 63 | - [**Markdown File**](CaseStudy5_solutions.md): offers a more fluid and responsive viewing experience 64 | - [**Jupyter Notebook**](CaseStudy5_solutions.ipynb): contains the original code 65 | 66 | 67 | ## PostgreSQL Topics Covered 68 | - Data Cleaning 69 | - Pivot Table using CASE WHEN 70 | - DATE Type Manipulation 71 | - Common Table Expressions (CTE) 72 | - Window Functions 73 | - Subqueries 74 | - JOIN, UNION ALL 75 | -------------------------------------------------------------------------------- /CaseStudy#5 - Data Mart/image/plot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chanronnie/8WeekSQLChallenge/4b36f745d833a82a61a9ce70ab177462eb7dbfd0/CaseStudy#5 - Data Mart/image/plot.png -------------------------------------------------------------------------------- /CaseStudy#6 - Clique Bait/CaseStudy6_solutions.md: -------------------------------------------------------------------------------- 1 | # Case Study #6: Clique Bait 2 | The case study questions presented here are created by [**Data With Danny**](https://linktr.ee/datawithdanny). They are part of the [**8 Week SQL Challenge**](https://8weeksqlchallenge.com/). 3 | 4 | My SQL queries are written in the `PostgreSQL 15` dialect, integrated into `Jupyter Notebook`, which allows us to instantly view the query results and document the queries. 5 | 6 | For more details about the **Case Study #6**, click [**here**](https://8weeksqlchallenge.com/case-study-6/). 7 | 8 | ## Table of Contents 9 | 10 | ### [1. Importing Libraries](#Import) 11 | 12 | ### [2. Tables of the Database](#Tables) 13 | 14 | ### [3. Case Study Questions](#CaseStudyQuestions) 15 | 16 | - [A. Enterprise Relationship Diagram](#A) 17 | - [B. Digital Analysis](#B) 18 | - [C. Product Funnel Analysis](#C) 19 | - [D. Campaigns Analysis](#D) 20 | 21 | 22 | 23 | ## 1. Importing Libraries 24 | 25 | 26 | ```python 27 | import psycopg2 as pg2 28 | import pandas as pd 29 | import os 30 | import warnings 31 | 32 | warnings.filterwarnings("ignore") 33 | ``` 34 | 35 | ### Connecting PostgreSQL database from Jupyter Notebook 36 | 37 | 38 | ```python 39 | # Get my PostgreSQL password 40 | mypassword = os.getenv("POSTGRESQL_PASSWORD") 41 | 42 | # Connecting to database 43 | conn = pg2.connect(user = 'postgres', password = mypassword, database = 'clique_bait') 44 | cursor = conn.cursor() 45 | ``` 46 | 47 | ___ 48 | 49 | ## 2. Tables of the Database 50 | 51 | First, let's verify if the connected database contains the 5 dataset names. 52 | 53 | 54 | ```python 55 | cursor.execute(""" 56 | SELECT table_schema, table_name 57 | FROM information_schema.tables 58 | WHERE table_schema = 'clique_bait' 59 | """) 60 | 61 | table_names = [] 62 | print('--- Tables within "data_mart" database --- ') 63 | for table in cursor: 64 | print(table[1]) 65 | table_names.append(table[1]) 66 | ``` 67 | 68 | --- Tables within "data_mart" database --- 69 | event_identifier 70 | campaign_identifier 71 | page_hierarchy 72 | users 73 | events 74 | 75 | 76 | #### Here are the 5 datasets of the "clique_bait" database. For more details about each dataset, please click [here](https://8weeksqlchallenge.com/case-study-6/). 77 | 78 | 79 | ```python 80 | for table in table_names: 81 | print("\nTable: ", table) 82 | display(pd.read_sql("SELECT * FROM clique_bait." + table, conn)) 83 | ``` 84 | 85 | 86 | Table: event_identifier 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 |
event_typeevent_name
01Page View
12Add to Cart
23Purchase
34Ad Impression
45Ad Click
124 | 125 | 126 | 127 | 128 | Table: campaign_identifier 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 |
campaign_idproductscampaign_namestart_dateend_date
011-3BOGOF - Fishing For Compliments2020-01-012020-01-14
124-525% Off - Living The Lux Life2020-01-152020-01-28
236-8Half Off - Treat Your Shellf(ish)2020-02-012020-03-31
168 | 169 | 170 | 171 | 172 | Table: page_hierarchy 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | 250 | 251 | 252 | 253 | 254 | 255 | 256 | 257 | 258 | 259 | 260 | 261 | 262 | 263 | 264 | 265 | 266 | 267 | 268 | 269 | 270 | 271 | 272 | 273 | 274 | 275 | 276 | 277 |
page_idpage_nameproduct_categoryproduct_id
01Home PageNoneNaN
12All ProductsNoneNaN
23SalmonFish1.0
34KingfishFish2.0
45TunaFish3.0
56Russian CaviarLuxury4.0
67Black TruffleLuxury5.0
78AbaloneShellfish6.0
89LobsterShellfish7.0
910CrabShellfish8.0
1011OysterShellfish9.0
1112CheckoutNoneNaN
1213ConfirmationNoneNaN
278 | 279 | 280 | 281 | 282 | Table: users 283 | 284 | 285 | 286 | 287 | 288 | 289 | 290 | 291 | 292 | 293 | 294 | 295 | 296 | 297 | 298 | 299 | 300 | 301 | 302 | 303 | 304 | 305 | 306 | 307 | 308 | 309 | 310 | 311 | 312 | 313 | 314 | 315 | 316 | 317 | 318 | 319 | 320 | 321 | 322 | 323 | 324 | 325 | 326 | 327 | 328 | 329 | 330 | 331 | 332 | 333 | 334 | 335 | 336 | 337 | 338 | 339 | 340 | 341 | 342 | 343 | 344 | 345 | 346 | 347 | 348 | 349 | 350 | 351 | 352 | 353 | 354 | 355 | 356 | 357 | 358 | 359 | 360 | 361 |
user_idcookie_idstart_date
01c4ca422020-02-04
12c81e722020-01-18
23eccbc82020-02-21
34a87ff62020-02-22
45e4da3b2020-02-01
............
17772546dd2f2020-03-29
17789459511b2020-03-22
177949d345a82020-02-23
1780211a26e032020-02-20
17816487a4ba2020-03-18
362 |

1782 rows × 3 columns

363 | 364 | 365 | 366 | 367 | Table: events 368 | 369 | 370 | 371 | 372 | 373 | 374 | 375 | 376 | 377 | 378 | 379 | 380 | 381 | 382 | 383 | 384 | 385 | 386 | 387 | 388 | 389 | 390 | 391 | 392 | 393 | 394 | 395 | 396 | 397 | 398 | 399 | 400 | 401 | 402 | 403 | 404 | 405 | 406 | 407 | 408 | 409 | 410 | 411 | 412 | 413 | 414 | 415 | 416 | 417 | 418 | 419 | 420 | 421 | 422 | 423 | 424 | 425 | 426 | 427 | 428 | 429 | 430 | 431 | 432 | 433 | 434 | 435 | 436 | 437 | 438 | 439 | 440 | 441 | 442 | 443 | 444 | 445 | 446 | 447 | 448 | 449 | 450 | 451 | 452 | 453 | 454 | 455 | 456 | 457 | 458 | 459 | 460 | 461 | 462 | 463 | 464 | 465 | 466 | 467 | 468 | 469 | 470 | 471 | 472 | 473 | 474 | 475 | 476 | 477 | 478 | 479 | 480 | 481 | 482 |
visit_idcookie_idpage_idevent_typesequence_numberevent_time
0ccf365c4ca421112020-02-04 19:16:09.182546
1ccf365c4ca422122020-02-04 19:16:17.358191
2ccf365c4ca426132020-02-04 19:16:58.454669
3ccf365c4ca429142020-02-04 19:16:58.609142
4ccf365c4ca429252020-02-04 19:17:51.729420
.....................
32729355a6a87a4ba101152020-03-18 22:44:16.541396
32730355a6a87a4ba111162020-03-18 22:44:18.900830
32731355a6a87a4ba112172020-03-18 22:45:12.670472
32732355a6a87a4ba121182020-03-18 22:45:54.081818
32733355a6a87a4ba133192020-03-18 22:45:54.984666
483 |

32734 rows × 6 columns

484 | 485 | 486 | 487 | 488 | 489 | ## 3. Case Study Questions 490 | 491 | 492 | ## A. Enterprise Relationship Diagram 493 | Using the following DDL schema details to create an ERD for all the Clique Bait datasets.
494 | [Click here](https://dbdiagram.io/) to access the DB Diagram tool to create the ERD. 495 |
496 |
497 |
498 | **ANSWER**
499 | Here is the code that I used for creating the **ERD** for all the Clique Bait datasets on [DB Diagram tool](https://dbdiagram.io/). 500 | 501 | ``` 502 | TABLE event_identifier { 503 | "event_type" INTEGER 504 | "event_name" VARCHAR(13) 505 | } 506 | 507 | TABLE campaign_identifier { 508 | "campaign_id" INTEGER 509 | "products" VARCHAR(3) 510 | "campaign_name" VARCHAR(33) 511 | "start_date" TIMESTAMP 512 | "end_date" TIMESTAMP 513 | } 514 | 515 | TABLE page_hierarchy { 516 | "page_id" INTEGER 517 | "page_name" VARCHAR(14) 518 | "product_category" VARCHAR(9) 519 | "product_id" INTEGER 520 | } 521 | 522 | TABLE users { 523 | "user_id" INTEGER 524 | "cookie_id" VARCHAR(6) 525 | "start_date" TIMESTAMP 526 | } 527 | 528 | TABLE events { 529 | "visit_id" VARCHAR(6) 530 | "cookie_id" VARCHAR(6) 531 | "page_id" INTEGER 532 | "event_type" INTEGER 533 | "sequence_number" INTEGER 534 | "event_time" TIMESTAMP 535 | } 536 | 537 | // Establish connections or references between datasets 538 | Ref: "events"."event_type" > "event_identifier"."event_type" 539 | Ref: "events"."page_id" > "page_hierarchy"."page_id" 540 | Ref: "events"."cookie_id" > "users"."cookie_id" 541 | ``` 542 | 543 | **Result** 544 | ![CliqueBait_ERD](images/CliqueBait_ERD.png) 545 | 546 | 547 | 548 | ## B. Digital Analysis 549 | 550 | #### 1. How many users are there? 551 | 552 | 553 | ```python 554 | pd.read_sql(""" 555 | SELECT COUNT(DISTINCT user_id) AS nb_users 556 | FROM clique_bait.users; 557 | """, conn) 558 | ``` 559 | 560 | 561 | 562 | 563 | 564 | 565 | 566 | 567 | 568 | 569 | 570 | 571 | 572 | 573 |
nb_users
0500
574 | 575 | 576 | 577 | 578 | **Result**\ 579 | There are 500 users. 580 | 581 | ___ 582 | #### 2. How many cookies does each user have on average? 583 | 584 | 585 | ```python 586 | pd.read_sql(""" 587 | SELECT ROUND(AVG(nb_cookie_ids))::INTEGER AS avg_cookies_per_user 588 | FROM 589 | ( 590 | SELECT DISTINCT user_id, COUNT(cookie_id) AS nb_cookie_ids 591 | FROM clique_bait.users 592 | GROUP BY user_id 593 | ) nb_cookies_per_user; 594 | """, conn) 595 | ``` 596 | 597 | 598 | 599 | 600 | 601 | 602 | 603 | 604 | 605 | 606 | 607 | 608 | 609 | 610 |
avg_cookies_per_user
04
611 | 612 | 613 | 614 | 615 | **Result**\ 616 | Each user has an average of 4 cookies. 617 | 618 | ___ 619 | #### 3. What is the unique number of visits by all users per month? 620 | 621 | 622 | ```python 623 | pd.read_sql(""" 624 | SELECT 625 | DATE_PART('month', u.start_date)::INTEGER AS month, 626 | TO_CHAR(u.start_date,'Month') AS month_name, 627 | COUNT(DISTINCT e.visit_id) AS nb_visits 628 | FROM clique_bait.users u 629 | JOIN clique_bait.events e ON u.cookie_id = e.cookie_id 630 | GROUP BY month, month_name 631 | ORDER BY month 632 | """, conn) 633 | ``` 634 | 635 | 636 | 637 | 638 | 639 | 640 | 641 | 642 | 643 | 644 | 645 | 646 | 647 | 648 | 649 | 650 | 651 | 652 | 653 | 654 | 655 | 656 | 657 | 658 | 659 | 660 | 661 | 662 | 663 | 664 | 665 | 666 | 667 | 668 | 669 | 670 | 671 | 672 | 673 | 674 | 675 | 676 |
monthmonth_namenb_visits
01January876
12February1488
23March916
34April248
45May36
677 | 678 | 679 | 680 | 681 | **Result** 682 | - The month of **February** has the highest number of visits on the Clique Bait website, with a total of 1,488 visits. 683 | - Following February, **March** and **January** are the second and third most visited months, accounting for 916 visits and 876 visits, respectively. 684 | - The month of **May** has the lowest number of visits, with only 36. 685 | 686 | ___ 687 | #### 4. What is the number of events for each event type? 688 | 689 | 690 | ```python 691 | pd.read_sql(""" 692 | SELECT 693 | ei.event_name, 694 | COUNT(*) AS nb_events 695 | FROM clique_bait.events e 696 | JOIN clique_bait. event_identifier ei ON e.event_type = ei.event_type 697 | GROUP BY ei.event_name 698 | ORDER BY nb_events DESC 699 | """, conn) 700 | ``` 701 | 702 | 703 | 704 | 705 | 706 | 707 | 708 | 709 | 710 | 711 | 712 | 713 | 714 | 715 | 716 | 717 | 718 | 719 | 720 | 721 | 722 | 723 | 724 | 725 | 726 | 727 | 728 | 729 | 730 | 731 | 732 | 733 | 734 | 735 | 736 | 737 |
event_namenb_events
0Page View20928
1Add to Cart8451
2Purchase1777
3Ad Impression876
4Ad Click702
738 | 739 | 740 | 741 | 742 | **Result** 743 | - **Page View** has the highest number of events, with a total of 20,928. 744 | - Following Page View, **Add to Cart** and **Purchase** are the second and third most interacted pages, accounting for 8,451 events and 1,777 events, respectively. 745 | - Taking into consideration that not all users receive a campaign impression and that not all users clicked on the impression advertisement, the **Ad Impression** page has been interacted with 876 times, and clicking the impression (see **Ad Click**) has occurred only 702 times. 746 | 747 | ___ 748 | #### 5. What is the percentage of visits which have a purchase event? 749 | 750 | 751 | ```python 752 | pd.read_sql(""" 753 | SELECT 754 | COUNT(*) AS nb_purchase_event, 755 | (SELECT COUNT(DISTINCT visit_id) FROM clique_bait.events) AS total_nb_visits, 756 | CONCAT(ROUND(COUNT(*)/(SELECT COUNT(DISTINCT visit_id) FROM clique_bait.events)::NUMERIC * 100, 1), ' %') AS purchase_percent 757 | FROM clique_bait.events e 758 | JOIN clique_bait.event_identifier ei ON e.event_type = ei.event_type 759 | WHERE ei.event_name = 'Purchase' 760 | """, conn) 761 | ``` 762 | 763 | 764 | 765 | 766 | 767 | 768 | 769 | 770 | 771 | 772 | 773 | 774 | 775 | 776 | 777 | 778 | 779 | 780 |
nb_purchase_eventtotal_nb_visitspurchase_percent
01777356449.9 %
781 | 782 | 783 | 784 | 785 | **Result**\ 786 | 49.9% of the visits resulted in a purchase event. 787 | 788 | ___ 789 | #### 6. What is the percentage of visits which view the checkout page but do not have a purchase event? 790 | 791 | 792 | ```python 793 | pd.read_sql(""" 794 | SELECT 795 | (nb_checkouts - nb_purchases) AS nb_checkouts_without_purchase, 796 | (SELECT COUNT(DISTINCT visit_id) FROM clique_bait.events) AS total_visits, 797 | ROUND((nb_checkouts - nb_purchases)/(SELECT COUNT(DISTINCT visit_id) FROM clique_bait.events)::NUMERIC * 100,2) AS percent 798 | FROM 799 | ( 800 | SELECT 801 | SUM(CASE WHEN ph.page_name = 'Checkout' AND ei.event_name = 'Page View' THEN 1 ELSE 0 END) AS nb_checkouts, 802 | SUM(CASE WHEN ei.event_name = 'Purchase' THEN 1 ELSE 0 END) AS nb_purchases 803 | FROM clique_bait.events e 804 | JOIN clique_bait.event_identifier ei ON e.event_type = ei.event_type 805 | JOIN clique_bait.page_hierarchy ph ON e.page_id = ph.page_id 806 | ) c 807 | """, conn) 808 | ``` 809 | 810 | 811 | 812 | 813 | 814 | 815 | 816 | 817 | 818 | 819 | 820 | 821 | 822 | 823 | 824 | 825 | 826 | 827 |
nb_checkouts_without_purchasetotal_visitspercent
032635649.15
828 | 829 | 830 | 831 | 832 | **Result**\ 833 | Only 9.15% of the visits resulted in viewing the checkout page without a purchase event. 834 | 835 | ___ 836 | #### 7. What are the top 3 pages by number of views? 837 | 838 | 839 | ```python 840 | pd.read_sql(""" 841 | SELECT 842 | ph.page_name, 843 | COUNT(*) AS nb_views 844 | FROM clique_bait.events e 845 | JOIN clique_bait.page_hierarchy ph ON e.page_id = ph.page_id 846 | JOIN clique_bait.event_identifier ei ON e.event_type = ei.event_type 847 | WHERE ei.event_name = 'Page View' 848 | GROUP BY ph.page_name 849 | ORDER BY nb_views DESC 850 | LIMIT 3 851 | """, conn) 852 | ``` 853 | 854 | 855 | 856 | 857 | 858 | 859 | 860 | 861 | 862 | 863 | 864 | 865 | 866 | 867 | 868 | 869 | 870 | 871 | 872 | 873 | 874 | 875 | 876 | 877 | 878 | 879 |
page_namenb_views
0All Products3174
1Checkout2103
2Home Page1782
880 | 881 | 882 | 883 | 884 | ___ 885 | #### 8. What is the number of views and cart adds for each product category? 886 | 887 | 888 | ```python 889 | pd.read_sql(""" 890 | SELECT 891 | ph.product_category, 892 | SUM(CASE WHEN event_name = 'Page View' THEN 1 ELSE 0 END) AS nb_views, 893 | SUM(CASE WHEN event_name = 'Add to Cart' THEN 1 ELSE 0 END) AS nb_card_adds 894 | FROM clique_bait.events e 895 | JOIN clique_bait.page_hierarchy ph ON e.page_id = ph.page_id 896 | JOIN clique_bait.event_identifier ei ON e.event_type = ei.event_type 897 | WHERE ph.product_category IS NOT NULL 898 | GROUP BY ph.product_category 899 | ORDER BY nb_views DESC 900 | """, conn) 901 | ``` 902 | 903 | 904 | 905 | 906 | 907 | 908 | 909 | 910 | 911 | 912 | 913 | 914 | 915 | 916 | 917 | 918 | 919 | 920 | 921 | 922 | 923 | 924 | 925 | 926 | 927 | 928 | 929 | 930 | 931 | 932 |
product_categorynb_viewsnb_card_adds
0Shellfish62043792
1Fish46332789
2Luxury30321870
933 | 934 | 935 | 936 | 937 | ___ 938 | #### 9. What are the top 3 products by purchases? 939 | 940 | 941 | ```python 942 | pd.read_sql(""" 943 | WITH visitID_with_purchases_cte AS 944 | ( 945 | -- Retrieve visit IDS that have made purchases 946 | 947 | SELECT e.visit_id 948 | FROM clique_bait.events e 949 | JOIN clique_bait.event_identifier ei ON e.event_type = ei.event_type 950 | WHERE ei.event_name = 'Purchase' 951 | ) 952 | SELECT 953 | ph.page_name as product, 954 | COUNT(*) AS nb_purchases 955 | FROM visitID_with_purchases_cte cte 956 | JOIN clique_bait.events e ON cte.visit_id = e.visit_id 957 | JOIN clique_bait.event_identifier ei ON e.event_type = ei.event_type 958 | JOIN clique_bait.page_hierarchy ph ON e.page_id = ph.page_id 959 | WHERE ph.product_category IS NOT NULL 960 | AND ei.event_name = 'Add to Cart' 961 | GROUP BY ph.page_name 962 | ORDER BY nb_purchases DESC 963 | LIMIT 3 964 | """, conn) 965 | ``` 966 | 967 | 968 | 969 | 970 | 971 | 972 | 973 | 974 | 975 | 976 | 977 | 978 | 979 | 980 | 981 | 982 | 983 | 984 | 985 | 986 | 987 | 988 | 989 | 990 | 991 | 992 |
productnb_purchases
0Lobster754
1Oyster726
2Crab719
993 | 994 | 995 | 996 | 997 | Alternatively, we can utilize the Window Function `LAST_VALUE() OVER (PARTITION BY... ORDER BY...)` to extract the most recent event recorded during each visit on the Clique Bait website. The **Purchase** event typically represents the final action taken by users in each visit. Therefore, by applying the `LAST_VALUE()` in the `add_last_event_cte`, we can subsequently filter and identify the products that have indeed been purchased, as indicated in the following statement:
`WHERE event_name = 'Add to Cart' AND last_event = 'Purchase'`. 998 | 999 | 1000 | ```python 1001 | pd.read_sql(""" 1002 | WITH add_last_event_cte AS 1003 | ( 1004 | SELECT 1005 | e.visit_id, 1006 | e.sequence_number, 1007 | ph.page_name, 1008 | ph.product_category, 1009 | ei.event_name, 1010 | LAST_VALUE(ei.event_name) OVER (PARTITION BY e.visit_id ORDER BY e.sequence_number ROWS BETWEEN UNBOUNDED preceding AND UNBOUNDED following) AS last_event 1011 | FROM clique_bait.events e 1012 | JOIN clique_bait.event_identifier ei ON e.event_type = ei.event_type 1013 | JOIN clique_bait.page_hierarchy ph ON e.page_id = ph.page_id 1014 | ) 1015 | SELECT 1016 | page_name, 1017 | COUNT(event_name) AS nb_purchases 1018 | FROM add_last_event_cte 1019 | WHERE product_category IS NOT NULL AND event_name = 'Add to Cart' AND last_event = 'Purchase' 1020 | GROUP BY page_name 1021 | ORDER BY nb_purchases DESC 1022 | """, conn) 1023 | ``` 1024 | 1025 | 1026 | 1027 | 1028 | 1029 | 1030 | 1031 | 1032 | 1033 | 1034 | 1035 | 1036 | 1037 | 1038 | 1039 | 1040 | 1041 | 1042 | 1043 | 1044 | 1045 | 1046 | 1047 | 1048 | 1049 | 1050 | 1051 | 1052 | 1053 | 1054 | 1055 | 1056 | 1057 | 1058 | 1059 | 1060 | 1061 | 1062 | 1063 | 1064 | 1065 | 1066 | 1067 | 1068 | 1069 | 1070 | 1071 | 1072 | 1073 | 1074 | 1075 | 1076 | 1077 | 1078 | 1079 | 1080 |
page_namenb_purchases
0Lobster754
1Oyster726
2Crab719
3Salmon711
4Black Truffle707
5Kingfish707
6Abalone699
7Russian Caviar697
8Tuna697
1081 | 1082 | 1083 | 1084 | 1085 | 1086 | ## C. Product Funnel Analysis 1087 | Using a single SQL query - create a new output table which has the following details: 1088 | 1089 | - How many times was each product viewed? 1090 | - How many times was each product added to cart? 1091 | - How many times was each product added to a cart but not purchased (abandoned)? 1092 | - How many times was each product purchased? 1093 | 1094 | Additionally, create another table which further aggregates the data for the above points but this time **for each product category** instead of individual products. 1095 | 1096 | ### Table 1: Aggregate the data by products 1097 | 1098 | 1099 | ```python 1100 | # Create table 1101 | cursor.execute("DROP TABLE IF EXISTS clique_bait.products;") 1102 | cursor.execute(""" 1103 | CREATE TABLE clique_bait.products 1104 | ( 1105 | "product" VARCHAR(255), 1106 | "nb_views" INTEGER, 1107 | "nb_cart_adds" INTEGER, 1108 | "nb_abandoned" INTEGER, 1109 | "nb_purchases" INTEGER 1110 | ); 1111 | """) 1112 | 1113 | 1114 | # Populate the table 1115 | cursor.execute(""" 1116 | INSERT INTO clique_bait.products 1117 | WITH add_last_event_cte AS 1118 | ( 1119 | SELECT 1120 | e.visit_id, 1121 | e.sequence_number, 1122 | ph.page_name, 1123 | ph.product_category, 1124 | ei.event_name, 1125 | LAST_VALUE(ei.event_name) OVER (PARTITION BY e.visit_id ORDER BY e.sequence_number ROWS BETWEEN UNBOUNDED preceding AND UNBOUNDED following) AS last_event 1126 | FROM clique_bait.events e 1127 | JOIN clique_bait.event_identifier ei ON e.event_type = ei.event_type 1128 | JOIN clique_bait.page_hierarchy ph ON e.page_id = ph.page_id 1129 | ) 1130 | SELECT 1131 | page_name AS product, 1132 | SUM(CASE WHEN event_name = 'Page View' THEN 1 ELSE 0 END) AS nb_views, 1133 | SUM(CASE WHEN event_name = 'Add to Cart' THEN 1 ELSE 0 END) AS nb_cart_adds, 1134 | SUM(CASE WHEN event_name = 'Add to Cart' AND last_event != 'Purchase' THEN 1 ELSE 0 END) AS nb_abandoned, 1135 | SUM(CASE WHEN event_name = 'Add to Cart' AND last_event = 'Purchase' THEN 1 ELSE 0 END) AS nb_purchases 1136 | FROM add_last_event_cte 1137 | WHERE product_category IS NOT NULL 1138 | GROUP BY page_name 1139 | """) 1140 | 1141 | 1142 | # Saving 1143 | conn.commit() 1144 | ``` 1145 | 1146 | **Result** 1147 | 1148 | 1149 | ```python 1150 | pd.read_sql("""SELECT * FROM clique_bait.products""", conn) 1151 | ``` 1152 | 1153 | 1154 | 1155 | 1156 | 1157 | 1158 | 1159 | 1160 | 1161 | 1162 | 1163 | 1164 | 1165 | 1166 | 1167 | 1168 | 1169 | 1170 | 1171 | 1172 | 1173 | 1174 | 1175 | 1176 | 1177 | 1178 | 1179 | 1180 | 1181 | 1182 | 1183 | 1184 | 1185 | 1186 | 1187 | 1188 | 1189 | 1190 | 1191 | 1192 | 1193 | 1194 | 1195 | 1196 | 1197 | 1198 | 1199 | 1200 | 1201 | 1202 | 1203 | 1204 | 1205 | 1206 | 1207 | 1208 | 1209 | 1210 | 1211 | 1212 | 1213 | 1214 | 1215 | 1216 | 1217 | 1218 | 1219 | 1220 | 1221 | 1222 | 1223 | 1224 | 1225 | 1226 | 1227 | 1228 | 1229 | 1230 | 1231 | 1232 | 1233 | 1234 | 1235 | 1236 | 1237 | 1238 |
productnb_viewsnb_cart_addsnb_abandonednb_purchases
0Abalone1525932233699
1Oyster1568943217726
2Salmon1559938227711
3Crab1564949230719
4Tuna1515931234697
5Lobster1547968214754
6Kingfish1559920213707
7Russian Caviar1563946249697
8Black Truffle1469924217707
1239 | 1240 | 1241 | 1242 | 1243 | ### Table 2: Aggregate the data by product categories 1244 | 1245 | 1246 | ```python 1247 | # Create the table 1248 | cursor.execute("DROP TABLE IF EXISTS clique_bait.product_category;") 1249 | cursor.execute(""" 1250 | CREATE TABLE clique_bait.product_category 1251 | ( 1252 | "product" VARCHAR(255), 1253 | "nb_views" INTEGER, 1254 | "nb_cart_adds" INTEGER, 1255 | "nb_abandoned" INTEGER, 1256 | "nb_purchases" INTEGER 1257 | ); 1258 | """) 1259 | 1260 | 1261 | # Populate the table 1262 | cursor.execute(""" 1263 | INSERT INTO clique_bait.product_category 1264 | WITH add_last_event_cte AS 1265 | ( 1266 | SELECT 1267 | e.visit_id, 1268 | e.sequence_number, 1269 | ph.product_category, 1270 | ei.event_name, 1271 | LAST_VALUE(ei.event_name) OVER (PARTITION BY e.visit_id ORDER BY e.sequence_number ROWS BETWEEN UNBOUNDED preceding AND UNBOUNDED following) AS last_event 1272 | FROM clique_bait.events e 1273 | JOIN clique_bait.event_identifier ei ON e.event_type = ei.event_type 1274 | JOIN clique_bait.page_hierarchy ph ON e.page_id = ph.page_id 1275 | ) 1276 | SELECT 1277 | product_category, 1278 | SUM(CASE WHEN event_name = 'Page View' THEN 1 ELSE 0 END) AS nb_views, 1279 | SUM(CASE WHEN event_name = 'Add to Cart' THEN 1 ELSE 0 END) AS nb_cart_adds, 1280 | SUM(CASE WHEN event_name = 'Add to Cart' AND last_event != 'Purchase' THEN 1 ELSE 0 END) AS nb_abandoned, 1281 | SUM(CASE WHEN event_name = 'Add to Cart' AND last_event = 'Purchase' THEN 1 ELSE 0 END) AS nb_purchases 1282 | FROM add_last_event_cte 1283 | WHERE product_category IS NOT NULL 1284 | GROUP BY product_category 1285 | 1286 | """) 1287 | 1288 | # Saving 1289 | conn.commit() 1290 | ``` 1291 | 1292 | **Result** 1293 | 1294 | 1295 | ```python 1296 | pd.read_sql("SELECT * FROM clique_bait.product_category", conn) 1297 | ``` 1298 | 1299 | 1300 | 1301 | 1302 | 1303 | 1304 | 1305 | 1306 | 1307 | 1308 | 1309 | 1310 | 1311 | 1312 | 1313 | 1314 | 1315 | 1316 | 1317 | 1318 | 1319 | 1320 | 1321 | 1322 | 1323 | 1324 | 1325 | 1326 | 1327 | 1328 | 1329 | 1330 | 1331 | 1332 | 1333 | 1334 | 1335 | 1336 |
productnb_viewsnb_cart_addsnb_abandonednb_purchases
0Luxury303218704661404
1Shellfish620437928942898
2Fish463327896742115
1337 | 1338 | 1339 | 1340 | 1341 | ___ 1342 | ### Use your 2 new output tables - answer the following questions: 1343 | 1344 | #### 1. Which product had the most views, cart adds and purchases? 1345 | 1346 | 1347 | ```python 1348 | pd.read_sql(""" 1349 | SELECT 1350 | product, 1351 | nb_views, 1352 | nb_cart_adds, 1353 | nb_purchases 1354 | FROM clique_bait.products 1355 | ORDER BY nb_views DESC, nb_cart_adds DESC, nb_purchases DESC 1356 | LIMIT 1; 1357 | """, conn) 1358 | ``` 1359 | 1360 | 1361 | 1362 | 1363 | 1364 | 1365 | 1366 | 1367 | 1368 | 1369 | 1370 | 1371 | 1372 | 1373 | 1374 | 1375 | 1376 | 1377 | 1378 | 1379 |
productnb_viewsnb_cart_addsnb_purchases
0Oyster1568943726
1380 | 1381 | 1382 | 1383 | 1384 | The output above shows that **Oyster** is the product that had the most views, cart adds and purchases.
1385 | Alternatively, we could break down the data for each individual product, which reveals that: 1386 | 1387 | - **Oyster** is the most viewed product. 1388 | - **Lobster** is the most added product to cart and the most purchased product 1389 | 1390 | 1391 | ```python 1392 | pd.read_sql(""" 1393 | WITH max_cte AS 1394 | ( 1395 | SELECT 1396 | *, 1397 | CASE WHEN MAX(nb_views) OVER (ORDER BY nb_views DESC) = nb_views THEN product ELSE '' END AS most_viewed, 1398 | CASE WHEN MAX(nb_cart_adds) OVER (ORDER BY nb_cart_adds DESC) = nb_cart_adds THEN product ELSE '' END AS most_cart_added, 1399 | CASE WHEN MAX(nb_purchases) OVER (ORDER BY nb_purchases DESC) = nb_purchases THEN product ELSE '' END AS most_purchased 1400 | FROM clique_bait.products 1401 | ) 1402 | SELECT most_viewed, most_cart_added, most_purchased 1403 | FROM max_cte 1404 | WHERE most_viewed != '' OR most_cart_added !='' OR most_purchased !='' 1405 | """, conn) 1406 | ``` 1407 | 1408 | 1409 | 1410 | 1411 | 1412 | 1413 | 1414 | 1415 | 1416 | 1417 | 1418 | 1419 | 1420 | 1421 | 1422 | 1423 | 1424 | 1425 | 1426 | 1427 | 1428 | 1429 | 1430 | 1431 |
most_viewedmost_cart_addedmost_purchased
0Oyster
1LobsterLobster
1432 | 1433 | 1434 | 1435 | 1436 | ___ 1437 | #### 2. Which product was most likely to be abandoned? 1438 | 1439 | 1440 | ```python 1441 | pd.read_sql(""" 1442 | SELECT product, nb_abandoned 1443 | FROM clique_bait.products 1444 | ORDER BY nb_abandoned DESC 1445 | LIMIT 1; 1446 | """, conn) 1447 | ``` 1448 | 1449 | 1450 | 1451 | 1452 | 1453 | 1454 | 1455 | 1456 | 1457 | 1458 | 1459 | 1460 | 1461 | 1462 | 1463 | 1464 |
productnb_abandoned
0Russian Caviar249
1465 | 1466 | 1467 | 1468 | 1469 | ___ 1470 | #### 3. Which product had the highest view to purchase percentage? 1471 | 1472 | 1473 | ```python 1474 | pd.read_sql(""" 1475 | SELECT 1476 | product, 1477 | nb_views, 1478 | nb_purchases, 1479 | CONCAT(ROUND(nb_purchases/nb_views::NUMERIC * 100,2), ' %') AS percent 1480 | FROM clique_bait.products 1481 | ORDER BY percent DESC 1482 | LIMIT 1; 1483 | """, conn) 1484 | ``` 1485 | 1486 | 1487 | 1488 | 1489 | 1490 | 1491 | 1492 | 1493 | 1494 | 1495 | 1496 | 1497 | 1498 | 1499 | 1500 | 1501 | 1502 | 1503 | 1504 | 1505 |
productnb_viewsnb_purchasespercent
0Lobster154775448.74 %
1506 | 1507 | 1508 | 1509 | 1510 | ___ 1511 | #### 4. What is the average conversion rate from view to cart add? 1512 | 1513 | 1514 | ```python 1515 | pd.read_sql(""" 1516 | SELECT 1517 | CONCAT(ROUND(AVG(nb_cart_adds/nb_views::NUMERIC * 100),2), ' %') AS conversion_rate 1518 | FROM clique_bait.products 1519 | """, conn) 1520 | ``` 1521 | 1522 | 1523 | 1524 | 1525 | 1526 | 1527 | 1528 | 1529 | 1530 | 1531 | 1532 | 1533 | 1534 | 1535 |
conversion_rate
060.95 %
1536 | 1537 | 1538 | 1539 | 1540 | ___ 1541 | #### 5. What is the average conversion rate from cart add to purchase? 1542 | 1543 | 1544 | ```python 1545 | pd.read_sql(""" 1546 | SELECT 1547 | CONCAT(ROUND(AVG(nb_purchases/nb_cart_adds::NUMERIC * 100),2), ' %') AS conversion_rate 1548 | FROM clique_bait.products 1549 | """, conn) 1550 | ``` 1551 | 1552 | 1553 | 1554 | 1555 | 1556 | 1557 | 1558 | 1559 | 1560 | 1561 | 1562 | 1563 | 1564 | 1565 |
conversion_rate
075.93 %
1566 | 1567 | 1568 | 1569 | 1570 | 1571 | ## D. Campaigns Analysis 1572 | Generate a table that has 1 single row for every unique `visit_id` record and has the following columns: 1573 | 1574 | - `user_id` 1575 | - `visit_id` 1576 | - `visit_start_time`: the earliest event_time for each visit 1577 | - `page_views`: count of page views for each visit 1578 | - `cart_adds`: count of product cart add events for each visit 1579 | - `purchase`: 1/0 flag if a purchase event exists for each visit 1580 | - `campaign_name`: map the visit to a campaign if the visit_start_time falls between the start_date and end_date 1581 | - `impression`: count of ad impressions for each visit 1582 | - `click`: count of ad clicks for each visit 1583 | - `cart_products`**(Optional column)**: a comma separated text value with products added to the cart sorted by the order they were added to the cart (hint: use the `sequence_number`) 1584 | 1585 | Use the subsequent dataset to generate at least 5 insights for the Clique Bait team - bonus: prepare a single A4 infographic that the team can use for their management reporting sessions, be sure to emphasise the most important points from your findings. 1586 | 1587 | Some ideas you might want to investigate further include: 1588 | 1589 | - Identifying users who have received impressions during each campaign period and comparing each metric with other users who did not have an impression event 1590 | - Does clicking on an impression lead to higher purchase rates? 1591 | - What is the uplift in purchase rate when comparing users who click on a campaign impression versus users who do not receive an impression? What if we compare them with users who just an impression but do not click? 1592 | - What metrics can you use to quantify the success or failure of each campaign compared to eachother? 1593 | 1594 | 1595 | ```python 1596 | # Create table 1597 | cursor.execute("DROP TABLE IF EXISTS clique_bait.campaign_analysis;") 1598 | cursor.execute(""" 1599 | CREATE TABLE clique_bait.campaign_analysis 1600 | ( 1601 | "user_id" INTEGER, 1602 | "visit_id" VARCHAR(6), 1603 | "visit_start_time" TIMESTAMP, 1604 | "page_views" INTEGER, 1605 | "cart_adds" INTEGER, 1606 | "purchases" INTEGER, 1607 | "impressions" INTEGER, 1608 | "clicks" INTEGER, 1609 | "campaign_name" VARCHAR(33), 1610 | "cart_products" VARCHAR(255) 1611 | ); 1612 | """) 1613 | 1614 | 1615 | # Populate the table 1616 | cursor.execute(""" 1617 | INSERT INTO clique_bait.campaign_analysis 1618 | WITH cart_products_cte AS 1619 | ( 1620 | -- Generate a sequence of products added to the cart 1621 | 1622 | SELECT 1623 | u.user_id, 1624 | e.visit_id, 1625 | STRING_AGG(ph.page_name, ', ' ORDER BY sequence_number) AS cart_products 1626 | FROM clique_bait.users u 1627 | JOIN clique_bait.events e ON u.cookie_id = e.cookie_id 1628 | JOIN clique_bait.page_hierarchy ph ON e.page_id = ph.page_id 1629 | JOIN clique_bait.event_identifier ei ON e.event_type = ei.event_type 1630 | WHERE ei.event_name = 'Add to Cart' 1631 | GROUP BY u.user_id, e.visit_id 1632 | ) 1633 | 1634 | SELECT 1635 | u.user_id, 1636 | e.visit_id, 1637 | MIN(e.event_time) AS visit_start_time, 1638 | SUM(CASE WHEN ei.event_name = 'Page View' THEN 1 ELSE 0 END) AS page_views, 1639 | SUM(CASE WHEN ei.event_name = 'Add to Cart' THEN 1 ELSE 0 END) AS cart_adds, 1640 | SUM(CASE WHEN ei.event_name = 'Purchase' THEN 1 ELSE 0 END) AS purchases, 1641 | SUM(CASE WHEN ei.event_name = 'Ad Impression' THEN 1 ELSE 0 END) AS impressions, 1642 | SUM(CASE WHEN ei.event_name = 'Ad Click' THEN 1 ELSE 0 END) AS clicks, 1643 | CASE WHEN MIN(e.event_time) BETWEEN ci.start_date AND ci.end_date THEN ci.campaign_name ELSE '' END AS campaign_name, 1644 | CASE WHEN cp.cart_products IS NULL THEN '' ELSE cp.cart_products END AS cart_products 1645 | 1646 | FROM clique_bait.users u 1647 | JOIN clique_bait.events e ON u.cookie_id = e.cookie_id 1648 | JOIN clique_bait.page_hierarchy ph ON e.page_id = ph.page_id 1649 | JOIN clique_bait.event_identifier ei ON e.event_type = ei.event_type 1650 | JOIN clique_bait.campaign_identifier ci ON e.event_time BETWEEN ci.start_date AND ci.end_date 1651 | LEFT JOIN cart_products_cte cp ON cp.user_id = u.user_id AND cp.visit_id = e.visit_id 1652 | GROUP BY u.user_id, e.visit_id, ci.start_date, ci.end_date, ci.campaign_name, cp.cart_products 1653 | ORDER BY u.user_id, e.visit_id; 1654 | """) 1655 | 1656 | 1657 | # Saving the updates 1658 | conn.commit() 1659 | ``` 1660 | 1661 | **Result**
1662 | Here is the table listing the first 5 users only. 1663 | 1664 | 1665 | ```python 1666 | pd.read_sql(""" 1667 | SELECT * 1668 | FROM clique_bait.campaign_analysis 1669 | WHERE user_id < 6 1670 | """, conn) 1671 | ``` 1672 | 1673 | 1674 | 1675 | 1676 | 1677 | 1678 | 1679 | 1680 | 1681 | 1682 | 1683 | 1684 | 1685 | 1686 | 1687 | 1688 | 1689 | 1690 | 1691 | 1692 | 1693 | 1694 | 1695 | 1696 | 1697 | 1698 | 1699 | 1700 | 1701 | 1702 | 1703 | 1704 | 1705 | 1706 | 1707 | 1708 | 1709 | 1710 | 1711 | 1712 | 1713 | 1714 | 1715 | 1716 | 1717 | 1718 | 1719 | 1720 | 1721 | 1722 | 1723 | 1724 | 1725 | 1726 | 1727 | 1728 | 1729 | 1730 | 1731 | 1732 | 1733 | 1734 | 1735 | 1736 | 1737 | 1738 | 1739 | 1740 | 1741 | 1742 | 1743 | 1744 | 1745 | 1746 | 1747 | 1748 | 1749 | 1750 | 1751 | 1752 | 1753 | 1754 | 1755 | 1756 | 1757 | 1758 | 1759 | 1760 | 1761 | 1762 | 1763 | 1764 | 1765 | 1766 | 1767 | 1768 | 1769 | 1770 | 1771 | 1772 | 1773 | 1774 | 1775 | 1776 | 1777 | 1778 | 1779 | 1780 | 1781 | 1782 | 1783 | 1784 | 1785 | 1786 | 1787 | 1788 | 1789 | 1790 | 1791 | 1792 | 1793 | 1794 | 1795 | 1796 | 1797 | 1798 | 1799 | 1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | 1809 | 1810 | 1811 | 1812 | 1813 | 1814 | 1815 | 1816 | 1817 | 1818 | 1819 | 1820 | 1821 | 1822 | 1823 | 1824 | 1825 | 1826 | 1827 | 1828 | 1829 | 1830 | 1831 | 1832 | 1833 | 1834 | 1835 | 1836 | 1837 | 1838 | 1839 | 1840 | 1841 | 1842 | 1843 | 1844 | 1845 | 1846 | 1847 | 1848 | 1849 | 1850 | 1851 | 1852 | 1853 | 1854 | 1855 | 1856 | 1857 | 1858 | 1859 | 1860 | 1861 | 1862 | 1863 | 1864 | 1865 | 1866 | 1867 | 1868 | 1869 | 1870 | 1871 | 1872 | 1873 | 1874 | 1875 | 1876 | 1877 | 1878 | 1879 | 1880 | 1881 | 1882 | 1883 | 1884 | 1885 | 1886 | 1887 | 1888 | 1889 | 1890 | 1891 | 1892 | 1893 | 1894 | 1895 | 1896 | 1897 | 1898 | 1899 | 1900 | 1901 | 1902 | 1903 | 1904 | 1905 | 1906 | 1907 | 1908 | 1909 | 1910 | 1911 | 1912 | 1913 | 1914 | 1915 | 1916 | 1917 | 1918 | 1919 | 1920 | 1921 | 1922 | 1923 | 1924 | 1925 | 1926 | 1927 | 1928 | 1929 | 1930 | 1931 | 1932 | 1933 | 1934 | 1935 | 1936 | 1937 | 1938 | 1939 | 1940 | 1941 | 1942 | 1943 | 1944 | 1945 | 1946 | 1947 | 1948 | 1949 | 1950 | 1951 | 1952 | 1953 | 1954 | 1955 | 1956 | 1957 | 1958 | 1959 | 1960 | 1961 | 1962 | 1963 | 1964 | 1965 | 1966 | 1967 | 1968 | 1969 | 1970 | 1971 | 1972 | 1973 | 1974 | 1975 | 1976 | 1977 | 1978 | 1979 | 1980 | 1981 | 1982 | 1983 | 1984 | 1985 | 1986 | 1987 | 1988 | 1989 | 1990 | 1991 | 1992 | 1993 | 1994 | 1995 | 1996 | 1997 | 1998 | 1999 | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | 2023 | 2024 | 2025 | 2026 | 2027 | 2028 | 2029 | 2030 | 2031 | 2032 | 2033 | 2034 | 2035 | 2036 | 2037 | 2038 | 2039 | 2040 | 2041 | 2042 | 2043 | 2044 | 2045 | 2046 | 2047 | 2048 | 2049 | 2050 | 2051 | 2052 | 2053 | 2054 | 2055 | 2056 | 2057 | 2058 | 2059 | 2060 | 2061 | 2062 | 2063 | 2064 | 2065 | 2066 | 2067 | 2068 | 2069 | 2070 | 2071 | 2072 | 2073 | 2074 | 2075 | 2076 | 2077 | 2078 | 2079 | 2080 | 2081 |
user_idvisit_idvisit_start_timepage_viewscart_addspurchasesimpressionsclickscampaign_namecart_products
0102a5d52020-02-26 16:57:26.26087140000Half Off - Treat Your Shellf(ish)
110826dc2020-02-26 05:58:37.91861810000Half Off - Treat Your Shellf(ish)
210fc4372020-02-04 17:49:49.602976106111Half Off - Treat Your Shellf(ish)Tuna, Russian Caviar, Black Truffle, Abalone, ...
3130b94d2020-03-15 13:12:54.02393697111Half Off - Treat Your Shellf(ish)Salmon, Kingfish, Tuna, Russian Caviar, Abalon...
4141355d2020-03-25 00:11:17.86065561000Half Off - Treat Your Shellf(ish)Lobster
51ccf3652020-02-04 19:16:09.18254673100Half Off - Treat Your Shellf(ish)Lobster, Crab, Oyster
61eaffde2020-03-25 20:06:32.342989108111Half Off - Treat Your Shellf(ish)Salmon, Tuna, Russian Caviar, Black Truffle, A...
71f7c7982020-03-15 02:23:26.31254393100Half Off - Treat Your Shellf(ish)Russian Caviar, Crab, Oyster
820635fb2020-02-16 06:42:42.73573094100Half Off - Treat Your Shellf(ish)Salmon, Kingfish, Abalone, Crab
921f11982020-02-01 21:51:55.07877510000Half Off - Treat Your Shellf(ish)
1023b58712020-01-18 10:16:32.1584759611125% Off - Living The Lux LifeSalmon, Kingfish, Russian Caviar, Black Truffl...
11249d73d2020-02-16 06:21:27.138532119111Half Off - Treat Your Shellf(ish)Salmon, Kingfish, Tuna, Russian Caviar, Black ...
122910d9a2020-02-01 10:40:46.87596881000Half Off - Treat Your Shellf(ish)Abalone
132c5c0ee2020-01-18 10:35:22.7653821000025% Off - Living The Lux Life
142d58cbd2020-01-18 23:40:54.7619068400025% Off - Living The Lux LifeKingfish, Tuna, Abalone, Crab
152e26a842020-01-18 16:06:40.9072806210025% Off - Living The Lux LifeSalmon, Oyster
16325502e2020-02-21 11:26:15.35338910000Half Off - Treat Your Shellf(ish)
1739a2f242020-02-21 03:19:10.03245562100Half Off - Treat Your Shellf(ish)Kingfish, Black Truffle
183bf200a2020-03-11 04:10:26.70838572100Half Off - Treat Your Shellf(ish)Salmon, Crab
193eb13cd2020-03-11 21:36:37.22276310000Half Off - Treat Your Shellf(ish)
20407a9502020-03-19 17:56:24.61044560000Half Off - Treat Your Shellf(ish)
2144c0ce32020-02-22 19:42:45.49827110000Half Off - Treat Your Shellf(ish)
2247caba52020-02-22 17:49:37.64617452000Half Off - Treat Your Shellf(ish)Tuna, Lobster
234b90e252020-03-19 11:01:58.18294794111Half Off - Treat Your Shellf(ish)Tuna, Black Truffle, Lobster, Crab
24505c52a2020-02-11 12:30:33.47905298011Half Off - Treat Your Shellf(ish)Salmon, Kingfish, Tuna, Russian Caviar, Black ...
2554bffe12020-02-26 16:03:10.37788181000Half Off - Treat Your Shellf(ish)Russian Caviar
265580bf62020-02-11 04:05:42.30799186000Half Off - Treat Your Shellf(ish)Salmon, Tuna, Russian Caviar, Black Truffle, A...
275b45feb2020-02-01 07:47:45.02524710000Half Off - Treat Your Shellf(ish)
285f61ed72020-02-01 06:30:39.76616882100Half Off - Treat Your Shellf(ish)Lobster, Crab
295fa70cb2020-02-26 11:12:20.36163810000Half Off - Treat Your Shellf(ish)
2082 | 2083 | 2084 | 2085 | 2086 | ### further investigation ... 2087 | ### Does clicking on an impression lead to higher purchase rates? 2088 | 2089 | 2090 | Among users who received an impression, the purchase rate is significantly higher for those who clicked on the advertisement compared to those who did not click, accounting for a purchase rate of 71.89% vs 13.12%. 2091 | 2092 | 2093 | ```python 2094 | pd.read_sql(""" 2095 | SELECT 2096 | ROUND(SUM(CASE WHEN clicks = 1 THEN purchases ELSE 0 END)/COUNT(visit_id)::NUMERIC * 100, 2) AS clicked_purchase_rate, 2097 | ROUND(SUM(CASE WHEN clicks = 0 THEN purchases ELSE 0 END)/COUNT(visit_id)::NUMERIC * 100, 2) AS no_clicked_purchase_rate 2098 | FROM clique_bait.campaign_analysis 2099 | WHERE impressions = 1 2100 | """, conn) 2101 | ``` 2102 | 2103 | 2104 | 2105 | 2106 | 2107 | 2108 | 2109 | 2110 | 2111 | 2112 | 2113 | 2114 | 2115 | 2116 | 2117 | 2118 |
clicked_purchase_rateno_clicked_purchase_rate
071.8913.12
2119 | 2120 | 2121 | 2122 | 2123 | ___ 2124 | ### What is the uplift in purchase rate when comparing users who click on a campaign impression versus users who do not receive an impression? What if we compare them with users who just an impression but do not click? 2125 | 2126 | 2127 | Among all users who visited the Clique Bait website: 2128 | - The purchase rate is higher for users who do not receive an impression, with a rate of 28.6%. 2129 | - Clicking on a campaign impression leads to a lower purchase rate of 17.6% compared to the previous statement. 2130 | - However, even if users receive campaign impressions, not clicking on them results in the lowest purchase rate of only 3.2%. 2131 | 2132 | 2133 | ```python 2134 | pd.read_sql(""" 2135 | SELECT 2136 | ROUND(SUM(CASE WHEN impressions = 1 AND clicks = 1 THEN purchases ELSE 0 END)/COUNT(visit_id)::NUMERIC * 100, 1) AS clicked_purchase_rate, 2137 | ROUND(SUM(CASE WHEN impressions = 1 AND clicks = 0 THEN purchases ELSE 0 END)/COUNT(visit_id)::NUMERIC * 100, 1) AS no_clicked_purchase_rate, 2138 | ROUND(SUM(CASE WHEN impressions = 0 AND clicks = 0 THEN purchases ELSE 0 END)/COUNT(visit_id)::NUMERIC * 100, 1) AS no_impression_purchase_rate 2139 | FROM clique_bait.campaign_analysis 2140 | """, conn) 2141 | ``` 2142 | 2143 | 2144 | 2145 | 2146 | 2147 | 2148 | 2149 | 2150 | 2151 | 2152 | 2153 | 2154 | 2155 | 2156 | 2157 | 2158 | 2159 | 2160 |
clicked_purchase_rateno_clicked_purchase_rateno_impression_purchase_rate
017.63.228.6
2161 | 2162 | 2163 | 2164 | 2165 | 2166 | ```python 2167 | conn.close() 2168 | ``` 2169 | -------------------------------------------------------------------------------- /CaseStudy#6 - Clique Bait/README.md: -------------------------------------------------------------------------------- 1 | # Case Study #6: Clique Bait 🪝
2 | [![View Main Folder](https://img.shields.io/badge/View-Main_Folder-F5788D.svg?logo=GitHub)](https://github.com/chanronnie/8WeekSQLChallenge) 3 | [![View Case Study 6](https://img.shields.io/badge/View-Case_Study_6-06816E)](https://8weeksqlchallenge.com/case-study-6/)
4 | ![6](https://github.com/chanronnie/8WeekSQLChallenge/assets/121308347/364b11b1-7a4b-47d9-b7a6-6fd87075d140) 5 | 6 | 7 | The case study presented here is part of the **8 Week SQL Challenge**.\ 8 | It is kindly brought to us by [**Data With Danny**](https://8weeksqlchallenge.com). 9 | 10 | This time, I am using `PostgreSQL queries` (instead of MySQL) in `Jupyter Notebook` to quickly view results, which provides me with an opportunity 11 | - to learn PostgreSQL 12 | - to utilize handy mathematical and string functions. 13 | 14 | 15 | 16 | ## Table of Contents 17 | * [Entity Relationship Diagram](#entity-relationship-diagram) 18 | * [Datasets](#datasets) 19 | * [Case Study Questions](#case-study-questions) 20 | * [Solutions](#solutions) 21 | * [PostgreSQL Topics Covered](#postgresql-topics-covered) 22 | 23 | ## Entity Relationship Diagram 24 | The question A will require the creation of an Entity-Relationship Diagram (ERD). 25 | 26 | 27 | ## Datasets 28 | The Case Study #6 contains 5 dataset: 29 | 30 | 31 | ### `users` 32 |
33 | 34 | View table 35 | 36 | 37 | `users` : This table shows all the user_id along with their unique cookie_id (PK). 38 | 39 | user_id | cookie_id | start_date 40 | --- | --- | --- 41 | 397 | 759ff | 2020-03-30 00:00:00 42 | 215 | 863329 | 2020-01-26 00:00:00 43 | 191 | eefca9 | 2020-03-15 00:00:00 44 | 89 | 764796 | 2020-01-07 00:00:00 45 | 127 | 17ccc5 | 2020-01-22 00:00:00 46 | 81 | b0b666 | 2020-03-01 00:00:00 47 | 260 | a4f236 | 2020-01-08 00:00:00 48 | 203 | d1182f | 2020-04-18 00:00:00 49 | 23 | 12dbc8 | 2020-01-18 00:00:00 50 | 375 | f61d69 | 2020-01-03 00:00:00 51 | 52 |
53 | 54 | 55 | ### `events` 56 |
57 | 58 | View table 59 | 60 | 61 | `events` : The table lists all the website interactions by customers (cookie_id, webpage visited, etc.). 62 | 63 | visit_id | cookie_id | page_id | event_type | sequence_number | event_time 64 | --- | --- | --- | --- | --- | --- 65 | 719fd3 | 3d83d3 | 5 | 1 | 4 | 2020-03-02 00:29:09.975502 66 | fb1eb1 | c5ff25 | 5 | 2 | 8 | 2020-01-22 07:59:16.761931 67 | 23fe81 | 1e8c2d | 10 | 1 | 9 | 2020-03-21 13:14:11.745667 68 | ad91aa | 648115 | 6 | 1 | 3 | 2020-04-27 16:28:09.824606 69 | 5576d7 | ac418c | 6 | 1 | 4 | 2020-01-18 04:55:10.149236 70 | 48308b | c686c1 | 8 | 1 | 5 | 2020-01-29 06:10:38.702163 71 | 46b17d | 78f9b3 | 7 | 1 | 12 | 2020-02-16 09:45:31.926407 72 | 9fd196 | ccf057 | 4 | 1 | 5 | 2020-02-14 08:29:12.922164 73 | edf853 | f85454 | 1 | 1 | 1 | 2020-02-22 12:59:07.652207 74 | 3c6716 | 02e74f | 3 | 2 | 5 | 2020-01-31 17:56:20.777383 75 | 76 |
77 | 78 | 79 | ### `event_identifier` 80 |
81 | 82 | View table 83 | 84 | 85 | `event_identifier` : The table maps the event_type to its corresponding event_name. 86 | 87 | event_type | event_name 88 | --- | --- 89 | 1 | Page View 90 | 2 | Add to Cart 91 | 3 | Purchase 92 | 4 | Ad Impression 93 | 5 | Ad Click 94 | 95 |
96 | 97 | 98 | ### `campaign_identifier` 99 |
100 | 101 | View table 102 | 103 | 104 | `campaign_identifier`: The table lists the information of the three campaigns that ran on the Clique Bait website. 105 | 106 | campaign_id | products | campaign_name | start_date | end_date 107 | --- | --- | --- | --- | --- 108 | 1 | 1-3 | BOGOF - Fishing For Compliments | 2020-01-01 00:00:00 | 2020-01-14 00:00:00 109 | 2 | 4-5 | 25% Off - Living The Lux Life | 2020-01-15 00:00:00 | 2020-01-28 00:00:00 110 | 3 | 6-8 | Half Off - Treat Your Shellf(ish) | 2020-02-01 00:00:00 | 2020-03-31 00:00:00 111 | 112 |
113 | 114 | 115 | ### `page_hierarchy` 116 |
117 | 118 | View table 119 | 120 | 121 | `page_hierarchy`: The table maps the page_id to its page_name. 122 | 123 | page_id | page_name | product_category | product_id 124 | --- | --- | --- | --- 125 | 1 | Home Page | null | null 126 | 2 | All Products | null | null 127 | 3 | Salmon | Fish | 1 128 | 4 | Kingfish | Fish | 2 129 | 5 | Tuna | Fish | 3 130 | 6 | Russian Caviar | Luxury | 4 131 | 7 | Black Truffle | Luxury | 5 132 | 8 | Abalone | Shellfish | 6 133 | 9 | Lobster | Shellfish | 7 134 | 10 | Crab | Shellfish | 8 135 | 11 | Oyster | Shellfish | 9 136 | 12 | Checkout | null | null 137 | 13 | Confirmation | null | null 138 | 139 |
140 | 141 | 142 | ## Case Study Questions 143 | Case Study #6 is categorized into 4 question groups\ 144 | To view the specific section, please open the link in a *`new tab`* or *`window`*.\ 145 | [A. Enterprise Relationship Diagram](CaseStudy6_solutions.md#A)\ 146 | [B. Digital Analysis](CaseStudy6_solutions.md#B)\ 147 | [C. Product Funnel Analysis](CaseStudy6_solutions.md#C)\ 148 | [D. Campaigns Analysis](CaseStudy6_solutions.md#D) 149 | 150 | ## Solutions 151 | - View `clique_bait` database: [**CaseStudy6_schema.sql**](https://raw.githubusercontent.com/chanronnie/8WeekSQLChallenge/main/CaseStudy%236%20-%20Clique%20Bait/CaseStudy6_schema.sql) 152 | - View Solution: 153 | - [**Markdown File**](CaseStudy6_solutions.md): offers a more fluid and responsive viewing experience 154 | - [**Jupyter Notebook**](CaseStudy6_solutions.ipynb): contains the original code 155 | 156 | ## PostgreSQL Topics Covered 157 | - Establish relationships between datasets to create an Entity-Relationship Diagram (ERD) 158 | - Data Cleaning 159 | - Create Tables 160 | - Pivot Table using CASE WHEN 161 | - DATE and STRING Type Manipulation 162 | - Common Table Expressions (CTE) 163 | - Window Functions 164 | - Subqueries 165 | - JOIN, UNION ALL 166 | -------------------------------------------------------------------------------- /CaseStudy#6 - Clique Bait/images/CliqueBait_ERD.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chanronnie/8WeekSQLChallenge/4b36f745d833a82a61a9ce70ab177462eb7dbfd0/CaseStudy#6 - Clique Bait/images/CliqueBait_ERD.png -------------------------------------------------------------------------------- /CaseStudy#7 - Balanced Tree/README.md: -------------------------------------------------------------------------------- 1 | # Case Study #7: Balanced Tree Clothing Co.🌲
2 | [![View Main Folder](https://img.shields.io/badge/View-Main_Folder-F5788D.svg?logo=GitHub)](https://github.com/chanronnie/8WeekSQLChallenge) 3 | [![View Case Study 7](https://img.shields.io/badge/View-Case_Study_7-06816E)](https://8weeksqlchallenge.com/case-study-7/)
4 | ![7](https://github.com/chanronnie/8WeekSQLChallenge/assets/121308347/d355752c-2e17-4706-a6cf-f92310e84fa1) 5 | 6 | 7 | 8 | The case study presented here is part of the **8 Week SQL Challenge**.\ 9 | It is kindly brought to us by [**Data With Danny**](https://8weeksqlchallenge.com). 10 | 11 | This time, I am using `PostgreSQL queries` (instead of MySQL) in `Jupyter Notebook` to quickly view results, which provides me with an opportunity 12 | - to learn PostgreSQL 13 | - to utilize handy mathematical and string functions. 14 | 15 | 16 | 17 | ## Table of Contents 18 | * [Entity Relationship Diagram](#entity-relationship-diagram) 19 | * [Datasets](#datasets) 20 | * [Case Study Questions](#case-study-questions) 21 | * [Solutions](#solutions) 22 | * [PostgreSQL Topics Covered](#postgresql-topics-covered) 23 | 24 | ## Entity Relationship Diagram 25 | Again, I used the [DB Diagram tool](https://dbdiagram.io/home) to create the following ERD. 26 | ![balanced_tree](https://github.com/chanronnie/8WeekSQLChallenge/assets/121308347/9867786d-6078-42f8-abf5-3565768d2832) 27 | 28 | 29 | 30 |
31 | View code to create ERD 32 | 33 | Here is the code that I used for creating the ERD for all the Balanced Tree datasets on DB Diagram tool 34 | 35 | ```markdown 36 | TABLE product_hierarchy { 37 | "id" INTEGER 38 | "parent_id" INTEGER 39 | "level_text" VARCHAR(19) 40 | "level_name" VARCHAR(8) 41 | } 42 | 43 | TABLE product_prices { 44 | "id" INTEGER 45 | "product_id" VARCHAR(6) 46 | "price" INTEGER 47 | } 48 | 49 | TABLE product_details { 50 | "product_id" VARCHAR(6) 51 | "price" INTEGER 52 | "product_name" VARCHAR(32) 53 | "category_id" INTEGER 54 | "segment_id" INTEGER 55 | "style_id" INTEGER 56 | "category_name" VARCHAR(6) 57 | "segment_name" VARCHAR(6) 58 | "style_name" VARCHAR(19) 59 | } 60 | 61 | TABLE sales { 62 | "prod_id" VARCHAR(6) 63 | "qty" INTEGER 64 | "price" INTEGER 65 | "discount" INTEGER 66 | "member" BOOLEAN 67 | "txn_id" VARCHAR(6) 68 | "start_txn_time" TIMESTAMP 69 | } 70 | 71 | # Establish relationships between datasets 72 | REF: sales.prod_id > product_details.product_id 73 | REF: product_hierarchy.id - product_prices.id 74 | ``` 75 | 76 |
77 | 78 | 79 | ## Datasets 80 | The Case Study #7 contains 4 dataset: 81 | 82 |
83 | View Table: product_details 84 | 85 | The `product_details` dataset lists the product information. 86 | 87 | product_id | price | product_name | category_id | segment_id | style_id | category_name | segment_name | style_name 88 | --- | --- | --- | --- | --- | --- | --- | --- | --- 89 | c4a632 | 13 | Navy Oversized Jeans - Womens | 1 | 3 | 7 | Womens | Jeans | Navy Oversized 90 | e83aa3 | 32 | Black Straight Jeans - Womens | 1 | 3 | 8 | Womens | Jeans | Black Straight 91 | e31d39 | 10 | Cream Relaxed Jeans - Womens | 1 | 3 | 9 | Womens | Jeans | Cream Relaxed 92 | d5e9a6 | 23 | Khaki Suit Jacket - Womens | 1 | 4 | 10 | Womens | Jacket | Khaki Suit 93 | 72f5d4 | 19 | Indigo Rain Jacket - Womens | 1 | 4 | 11 | Womens | Jacket | Indigo Rain 94 | 9ec847 | 54 | Grey Fashion Jacket - Womens | 1 | 4 | 12 | Womens | Jacket | Grey Fashion 95 | 5d267b | 40 | White Tee Shirt - Mens | 2 | 5 | 13 | Mens | Shirt | White Tee 96 | c8d436 | 10 | Teal Button Up Shirt - Mens | 2 | 5 | 14 | Mens | Shirt | Teal Button Up 97 | 2a2353 | 57 | Blue Polo Shirt - Mens | 2 | 5 | 15 | Mens | Shirt | Blue Polo 98 | f084eb | 36 | Navy Solid Socks - Mens | 2 | 6 | 16 | Mens | Socks | Navy Solid 99 | b9a74d | 17 | White Striped Socks - Mens | 2 | 6 | 17 | Mens | Socks | White Striped 100 | 2feb6b | 29 | Pink Fluro Polkadot Socks - Mens | 2 | 6 | 18 | Mens | Socks | Pink Fluro Polkadot 101 | 102 | 103 |
104 | 105 |
106 | View Table: sales 107 | 108 | The `sales` dataset lists product level information (transactions made for Balanced Tree Co.) 109 | 110 | prod_id | qty | price | discount | member | txn_id | start_txn_time 111 | --- | --- | --- | --- | --- | --- | --- 112 | c4a632 | 4 | 13 | 17 | t | 54f307 | 2021-02-13 01:59:43.296 113 | 5d267b | 4 | 40 | 17 | t | 54f307 | 2021-02-13 01:59:43.296 114 | b9a74d | 4 | 17 | 17 | t | 54f307 | 2021-02-13 01:59:43.296 115 | 2feb6b | 2 | 29 | 17 | t | 54f307 | 2021-02-13 01:59:43.296 116 | c4a632 | 5 | 13 | 21 | t | 26cc98 | 2021-01-19 01:39:00.3456 117 | e31d39 | 2 | 10 | 21 | t | 26cc98 | 2021-01-19 01:39:00.3456 118 | 72f5d4 | 3 | 19 | 21 | t | 26cc98 | 2021-01-19 01:39:00.3456 119 | 2a2353 | 3 | 57 | 21 | t | 26cc98 | 2021-01-19 01:39:00.3456 120 | f084eb | 3 | 36 | 21 | t | 26cc98 | 2021-01-19 01:39:00.3456 121 | c4a632 | 1 | 13 | 21 | f | ef648d | 2021-01-27 02:18:17.1648 122 | 123 |
124 | 125 | 126 |
127 | View Table: product_hierarchy 128 | 129 | The `product_hierarchy` dataset will be used to answer the **Bonus Question** to recreate the `product_details` dataset. 130 | 131 | id | parent_id | level_text | level_name 132 | --- | --- | --- | --- 133 | 1 | | Womens | Category 134 | 2 | | Mens | Category 135 | 3 | 1 | Jeans | Segment 136 | 4 | 1 | Jacket | Segment 137 | 5 | 2 | Shirt | Segment 138 | 6 | 2 | Socks | Segment 139 | 7 | 3 | Navy Oversized | Style 140 | 8 | 3 | Black Straight | Style 141 | 9 | 3 | Cream Relaxed | Style 142 | 10 | 4 | Khaki Suit | Style 143 | 11 | 4 | Indigo Rain | Style 144 | 12 | 4 | Grey Fashion | Style 145 | 13 | 5 | White Tee | Style 146 | 14 | 5 | Teal Button Up | Style 147 | 15 | 5 | Blue Polo | Style 148 | 16 | 6 | Navy Solid | Style 149 | 17 | 6 | White Striped | Style 150 | 18 | 6 | Pink Fluro Polkadot | Style 151 | 152 |
153 | 154 |
155 | View Table: product_prices 156 | 157 | The `product_prices` dataset will be used to answer the **Bonus Question** to recreate the `product_details` dataset. 158 | 159 | id | product_id | price 160 | --- | --- | --- 161 | 7 | c4a632 | 13 162 | 8 | e83aa3 | 32 163 | 9 | e31d39 | 10 164 | 10 | d5e9a6 | 23 165 | 11 | 72f5d4 | 19 166 | 12 | 9ec847 | 54 167 | 13 | 5d267b | 40 168 | 14 | c8d436 | 10 169 | 15 | 2a2353 | 57 170 | 16 | f084eb | 36 171 | 17 | b9a74d | 17 172 | 18 | 2feb6b | 29 173 | 174 |
175 | 176 | 177 | 178 | 179 | ## Case Study Questions 180 | Case Study #7 is categorized into 4 question groups\ 181 | To view the specific section, please open the link in a *`new tab`* or *`window`*.\ 182 | [A. High Level Sales Analysis](CaseStudy7_solutions.md#A)\ 183 | [B. Transaction Analysis](CaseStudy7_solutions.md#B)\ 184 | [C. Product Analysis](CaseStudy7_solutions.md#C)\ 185 | [D. Bonus Challenge](CaseStudy7_solutions.md#D) 186 | 187 | 188 | ## Solutions 189 | - View `balanced_tree` database: [**here**](CaseStudy7_schema.sql) 190 | - View Solution: 191 | - [**Markdown File**](CaseStudy7_solutions.md): offers a more fluid and responsive viewing experience 192 | - [**Jupyter Notebook**](CaseStudy7_solutions.ipynb): contains the original code 193 | 194 | 195 | ## PostgreSQL Topics Covered 196 | - Establish relationships between datasets to create an Entity-Relationship Diagram (ERD) 197 | - Common Table Expressions (CTE) 198 | - Window Functions 199 | - Subqueries 200 | - JOIN, SELF JOIN 201 | -------------------------------------------------------------------------------- /CaseStudy#8 - Fresh Segments/README.md: -------------------------------------------------------------------------------- 1 | # Case Study #8: Fresh Segments
2 | [![View Main Folder](https://img.shields.io/badge/View-Main_Folder-F5788D.svg?logo=GitHub)](https://github.com/chanronnie/8WeekSQLChallenge) 3 | [![View Case Study 8](https://img.shields.io/badge/View-Case_Study_8-07A092)](https://8weeksqlchallenge.com/case-study-8/)
4 | ![8](https://github.com/chanronnie/8WeekSQLChallenge/assets/121308347/4f340132-a725-4702-a865-808b623c467c) 5 | 6 | 7 | The case study presented here is part of the **8 Week SQL Challenge**.\ 8 | It is kindly brought to us by [**Data With Danny**](https://8weeksqlchallenge.com).\ 9 | This time, I am using `PostgreSQL queries` (instead of MySQL) in `Jupyter Notebook`. 10 | 11 | 12 | ## Table of Contents 13 | * [Entity Relationship Diagram](#entity-relationship-diagram) 14 | * [Datasets](#datasets) 15 | * [Case Study Questions](#case-study-questions) 16 | * [Solutions](#solutions) 17 | * [PostgreSQL Topics Covered](#postgresql-topics-covered) 18 | 19 | 20 | ## Entity Relationship Diagram 21 | Again, I used the [DB Diagram tool](https://dbdiagram.io/home) to create the following ERD. 22 | ![Fresh Segments](https://github.com/chanronnie/8WeekSQLChallenge/assets/121308347/1a9d92ec-c04b-44ed-9600-e98220789525) 23 | 24 | 25 |
26 | View code to create ERD 27 | 28 | Here is the code that I used for creating the ERD for the Fresh Segments datasets on DB Diagram tool 29 | 30 | ```markdown 31 | TABLE interest_metrics { 32 | "_month" VARCHAR(4) 33 | "_year" VARCHAR(4) 34 | "month_year" VARCHAR(7) 35 | "interest_id" VARCHAR(5) 36 | "composition" FLOAT 37 | "index_value" FLOAT 38 | "ranking" INTEGER 39 | "percentile_ranking" FLOAT 40 | } 41 | 42 | TABLE interest_map { 43 | "id" INTEGER 44 | "interest_name" TEXT 45 | "interest_summary" TEXT 46 | "created_at" TIMESTAMP 47 | "last_modified" TIMESTAMP 48 | } 49 | 50 | 51 | # Establish relationships between datasets 52 | REF: interest_metrics.interest_id > interest_map.id 53 | ``` 54 | 55 |
56 | 57 | 58 | ## Datasets 59 | The Case Study #8 contains 2 dataset: 60 | 61 | ### `interest_metrics` 62 | 63 |
64 | View Table: interest_metrics 65 | 66 | The `interest_metrics` table presents aggregated interest metrics for a major client of Fresh Segments, reflecting the performance of interest_id based on clicks and interactions with targeted advertising content among their customer base. 67 | 68 | _month | _year | month_year | interest_id | composition | index_value | ranking | percentile_ranking 69 | --- | --- | --- | --- | --- | --- | --- | --- 70 | 7 | 2018 | 07-2018 | 32486 | 11.89 | 6.19 | 1 | 99.86 71 | 7 | 2018 | 07-2018 | 6106 | 9.93 | 5.31 | 2 | 99.73 72 | 7 | 2018 | 07-2018 | 18923 | 10.85 | 5.29 | 3 | 99.59 73 | 7 | 2018 | 07-2018 | 6344 | 10.32 | 5.1 | 4 | 99.45 74 | 7 | 2018 | 07-2018 | 100 | 10.77 | 5.04 | 5 | 99.31 75 | 7 | 2018 | 07-2018 | 69 | 10.82 | 5.03 | 6 | 99.18 76 | 7 | 2018 | 07-2018 | 79 | 11.21 | 4.97 | 7 | 99.04 77 | 7 | 2018 | 07-2018 | 6111 | 10.71 | 4.83 | 8 | 98.9 78 | 7 | 2018 | 07-2018 | 6214 | 9.71 | 4.83 | 8 | 98.9 79 | 7 | 2018 | 07-2018 | 19422 | 10.11 | 4.81 | 10 | 98.63 80 | 81 |
82 | 83 | ### `interest_map` 84 | 85 |
86 | View Table: interest_map 87 | 88 | The **`interest_map`** links the interest_id to the corresponding interest_name. 89 | 90 | id | interest_name | interest_summary | created_at | last_modified 91 | --- | --- | --- | --- | --- 92 | 1 | Fitness Enthusiasts | Consumers using fitness tracking apps and websites. | 2016-05-26 14:57:59 | 2018-05-23 11:30:12 93 | 2 | Gamers | Consumers researching game reviews and cheat codes. | 2016-05-26 14:57:59 | 2018-05-23 11:30:12 94 | 3 | Car Enthusiasts | Readers of automotive news and car reviews. | 2016-05-26 14:57:59 | 2018-05-23 11:30:12 95 | 4 | Luxury Retail Researchers | Consumers researching luxury product reviews and gift ideas. | 2016-05-26 14:57:59 | 2018-05-23 11:30:12 96 | 5 | Brides & Wedding Planners | People researching wedding ideas and vendors. | 2016-05-26 14:57:59 | 2018-05-23 11:30:12 97 | 6 | Vacation Planners | Consumers reading reviews of vacation destinations and accommodations. | 2016-05-26 14:57:59 | 2018-05-23 11:30:13 98 | 7 | Motorcycle Enthusiasts | Readers of motorcycle news and reviews. | 2016-05-26 14:57:59 | 2018-05-23 11:30:13 99 | 8 | Business News Readers | Readers of online business news content. | 2016-05-26 14:57:59 | 2018-05-23 11:30:12 100 | 12 | Thrift Store Shoppers | Consumers shopping online for clothing at thrift stores and researching locations. | 2016-05-26 14:57:59 | 2018-03-16 13:14:00 101 | 13 | Advertising Professionals | People who read advertising industry news. | 2016-05-26 14:57:59 | 2018-05-23 11:30:12 102 | 103 | 104 |
105 | 106 | 107 | ## Case Study Questions 108 | Case Study #8 is categorized into 4 question groups\ 109 | To view the specific section, please open the link in a *`new tab`* or *`window`*.\ 110 | [A. Data Exploration and Cleansing](CaseStudy8_solutions.md#A)\ 111 | [B. Interest Analysis](CaseStudy8_solutions.md#B)\ 112 | [C. Segment Analysis](CaseStudy8_solutions.md#C)\ 113 | [D. Index Analysis](CaseStudy8_solutions.md#D) 114 | 115 | 116 | ## Solutions 117 | - View `fresh_segments` database: [**here**](CaseStudy8_schema.sql) 118 | - View Solution: 119 | - [**Markdown File**](CaseStudy8_solutions.md): offers a more fluid and responsive viewing experience 120 | - [**Jupyter Notebook**](CaseStudy8_solutions.ipynb): contains the original code 121 | 122 | 123 | ## PostgreSQL Topics Covered 124 | - Data Cleaning 125 | - Common Table Expressions (CTE) 126 | - Window Functions (rolling average) 127 | - Subqueries 128 | - JOIN 129 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 Ronnie Chan 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 8 Week SQL Challenge 2 | ![case](https://github.com/chanronnie/8WeekSQLChallenge/assets/121308347/80a28034-5afd-435c-83dc-0d9a6ffa100f) 3 | 4 | 5 | In my journey to switch careers into the field of Data, I am excited to undertake the [**#8WeekSQLChallenge**](https://8weeksqlchallenge.com/), kindly created by [Data with Danny](https://linktr.ee/datawithdanny) in order to enhance my SQL skills. 6 | This repository will store all the solutions for the case studies included in this challenge. 7 | 8 | ## Case Studies 9 | Case Study | Topic | SQL | Status| 10 | | --- | --- | --- | --- | 11 | |📄 [**#1 Danny's Diner**](https://github.com/chanronnie/8WeekSQLChallenge/tree/main/CaseStudy%231%20-%20Danny's%20Diner) |Customer Analytics | MySQL | **Completed** ✔️| 12 | |📄 [**#2 Pizza Runner**](https://github.com/chanronnie/8WeekSQLChallenge/tree/main/CaseStudy%232%20-%20Pizza%20Runner) | Delivery Operations Analytics | MySQL| **Completed** ✔️| 13 | |📄 [**#3 Foodie-Fi**](https://github.com/chanronnie/8WeekSQLChallenge/tree/main/CaseStudy%233%20-%20Foodie-Fi)| Subscription Analytics | MySQL| **Completed** ✔️| 14 | |📄 [**#4 Data Bank**](https://github.com/chanronnie/8WeekSQLChallenge/tree/main/CaseStudy%234%20-%20Data%20Bank) | Financial Data Storage and Usage Analytics | PostgreSQL| **Completed** ✔️ | 15 | |📄 [**#5 Data Mart**](https://github.com/chanronnie/8WeekSQLChallenge/tree/main/CaseStudy%235%20-%20Data%20Mart) | Sales Performance Analytics | PostgreSQL | **Completed** ✔️ | 16 | |📄 [**#6 Clique Bait**](https://github.com/chanronnie/8WeekSQLChallenge/tree/main/CaseStudy%236%20-%20Clique%20Bait) | Digital Analytics| PostgreSQL | **Completed** ✔️ | 17 | |📄 [**#7 Balanced Tree**](https://github.com/chanronnie/8WeekSQLChallenge/tree/main/CaseStudy%237%20-%20Balanced%20Tree) | Sales Performance Analytics | PostgreSQL | **Completed** ✔️ | 18 | |📄 [**#8 Fresh Segments**](https://github.com/chanronnie/8WeekSQLChallenge/tree/main/CaseStudy%238%20-%20Fresh%20Segments) | Digital Marketing Analytics | PostgreSQL | **Completed** ✔️| 19 | 20 | ## Technologies 21 | ![MySQL](https://img.shields.io/badge/mysql-%2300f.svg?style=for-the-badge&logo=mysql&logoColor=white) 22 | ![Postgres](https://img.shields.io/badge/postgres-%23316192.svg?style=for-the-badge&logo=postgresql&logoColor=white) 23 | ![Jupyter Notebook](https://img.shields.io/badge/jupyter-%23FA0F00.svg?style=for-the-badge&logo=jupyter&logoColor=white) 24 | 25 | ## Installation 26 | 27 | For writing **MySQL** queries in Jupyter Notebook, we will need to install the `pymysql` library 28 | ``` 29 | pip install pymysql 30 | ``` 31 | 32 | For writing **PostgreSQL** queries in Jupyter Notebook, we will need to install the `psycopg2` library 33 | ``` 34 | pip install psycopg2 35 | ``` 36 | --------------------------------------------------------------------------------