├── CaseStudy#1 - Danny's Diner
├── CaseStudy1_schema.sql
├── CaseStudy1_solution.ipynb
└── README.md
├── CaseStudy#2 - Pizza Runner
├── CaseStudy2_schema.sql
├── CaseStudy2_solutions.ipynb
├── CaseStudy2_solutions.md
└── README.md
├── CaseStudy#3 - Foodie-Fi
├── CaseStudy3_schema.sql
├── CaseStudy3_solutions.ipynb
├── CaseStudy3_solutions.md
├── README.md
└── images
│ ├── question10.png
│ ├── question2.png
│ ├── question3.png
│ ├── question6.png
│ └── question7.png
├── CaseStudy#4 - Data Bank
├── CaseStudy4_schema.sql
├── CaseStudy4_solutions.ipynb
├── CaseStudy4_solutions.md
└── README.md
├── CaseStudy#5 - Data Mart
├── CaseStudy5_schema.sql
├── CaseStudy5_solutions.ipynb
├── CaseStudy5_solutions.md
├── README.md
└── image
│ └── plot.png
├── CaseStudy#6 - Clique Bait
├── CaseStudy6_schema.sql
├── CaseStudy6_solutions.ipynb
├── CaseStudy6_solutions.md
├── README.md
└── images
│ └── CliqueBait_ERD.png
├── CaseStudy#7 - Balanced Tree
├── CaseStudy7_schema.sql
├── CaseStudy7_solutions.ipynb
├── CaseStudy7_solutions.md
└── README.md
├── CaseStudy#8 - Fresh Segments
├── CaseStudy8_schema.sql
├── CaseStudy8_solutions.ipynb
├── CaseStudy8_solutions.md
└── README.md
├── LICENSE
└── README.md
/CaseStudy#1 - Danny's Diner/CaseStudy1_schema.sql:
--------------------------------------------------------------------------------
1 | ##########################################################
2 | # Case Study #1: Danny's Diner
3 | # This SQL script is a DDL file for creating tables.
4 | # Source: https://8weeksqlchallenge.com/case-study-1/
5 | ##########################################################
6 |
7 |
8 | -- Create Schema
9 | CREATE SCHEMA IF NOT EXISTS dannys_diner;
10 |
11 |
12 | -- Create and Populate "sales" table
13 | CREATE TABLE IF NOT EXISTS dannys_diner.sales (
14 | customer_id VARCHAR(1),
15 | order_date DATE,
16 | product_id INTEGER
17 | );
18 |
19 | INSERT INTO dannys_diner.sales
20 | (customer_id, order_date, product_id)
21 | VALUES
22 | ('A', '2021-01-01', '1'),
23 | ('A', '2021-01-01', '2'),
24 | ('A', '2021-01-07', '2'),
25 | ('A', '2021-01-10', '3'),
26 | ('A', '2021-01-11', '3'),
27 | ('A', '2021-01-11', '3'),
28 | ('B', '2021-01-01', '2'),
29 | ('B', '2021-01-02', '2'),
30 | ('B', '2021-01-04', '1'),
31 | ('B', '2021-01-11', '1'),
32 | ('B', '2021-01-16', '3'),
33 | ('B', '2021-02-01', '3'),
34 | ('C', '2021-01-01', '3'),
35 | ('C', '2021-01-01', '3'),
36 | ('C', '2021-01-07', '3');
37 |
38 |
39 | -- Create and Populate "menu" table
40 | CREATE TABLE IF NOT EXISTS dannys_diner.menu (
41 | product_id INTEGER,
42 | product_name VARCHAR(5),
43 | price INTEGER
44 | );
45 |
46 | INSERT INTO dannys_diner.menu
47 | (product_id, product_name, price)
48 | VALUES
49 | ('1', 'sushi', '10'),
50 | ('2', 'curry', '15'),
51 | ('3', 'ramen', '12');
52 |
53 |
54 | -- Create and Populate "members" table
55 | CREATE TABLE IF NOT EXISTS dannys_diner.members (
56 | customer_id VARCHAR(1),
57 | join_date DATE
58 | );
59 |
60 | INSERT INTO dannys_diner.members
61 | (customer_id, join_date)
62 | VALUES
63 | ('A', '2021-01-07'),
64 | ('B', '2021-01-09');
--------------------------------------------------------------------------------
/CaseStudy#1 - Danny's Diner/README.md:
--------------------------------------------------------------------------------
1 | # Case Study #1: Danny's Diner 🍥
2 | [](https://github.com/chanronnie/8WeekSQLChallenge)
3 | [](https://8weeksqlchallenge.com/case-study-1/)
4 | 
5 |
6 | The case study presented here is part of the **8 Week SQL Challenge**.\
7 | It is kindly brought to us by [**Data With Danny**](https://8weeksqlchallenge.com).\
8 | I use `MySQL queries` in `Jupyter Notebook` to quickly view results.
9 |
10 |
11 |
12 | ## Table of Contents
13 | * [Entity Relationship Diagram](#entity-relationship-diagram)
14 | * [Datasets](#datasets)
15 | * [Case Study Questions](#case-study-questions)
16 | * [Solutions](#solutions)
17 |
18 |
19 | ## Entity Relationship Diagram
20 | 
21 |
22 | ## Datasets
23 | The Case Study #1 contains 3 tables:
24 | - **sales**: This table captures all the order information (`order_date` and `product_id`) of each customer (`customer_id`).
25 | - **menu**: This table lists the IDs, names and prices of each menu item.
26 | - **members**: This table captures the dates (`join_date`) when each customer joined the member program.
27 |
28 | ## Case Study Questions
29 | 1. What is the total amount each customer spent at the restaurant?
30 | 2. How many days has each customer visited the restaurant?
31 | 3. What was the first item from the menu purchased by each customer?
32 | 4. What is the most purchased item on the menu and how many times was it purchased by all customers?
33 | 5. Which item was the most popular for each customer?
34 | 6. Which item was purchased first by the customer after they became a member?
35 | 7. Which item was purchased just before the customer became a member?
36 | 8. What is the total items and amount spent for each member before they became a member?
37 | 9. If each $1 spent equates to 10 points and sushi has a 2x points multiplier - how many points would each customer have?
38 | 10. In the first week after a customer joins the program (including their join date) they earn 2x points on all items, not just sushi - how many points do customer A and B have at the end of January?
39 |
40 |
41 | ## Solutions
42 | - View `dannys_diner` database: [**here**](CaseStudy1_schema.sql)
43 | - View Solution: [**here**](CaseStudy1_solution.ipynb)
44 |
45 |
--------------------------------------------------------------------------------
/CaseStudy#2 - Pizza Runner/CaseStudy2_schema.sql:
--------------------------------------------------------------------------------
1 | ##########################################################
2 | # Case Study #2: Pizza Runner
3 | # This SQL script is a DDL file for creating tables.
4 | # Source: https://8weeksqlchallenge.com/case-study-2/
5 | ##########################################################
6 |
7 |
8 | -- Create Schema
9 | CREATE SCHEMA IF NOT EXISTS pizza_runner;
10 | USE pizza_runner;
11 |
12 |
13 | -- Create and Populate "runners" table
14 | DROP TABLE IF EXISTS runners;
15 | CREATE TABLE runners (
16 | runner_id INTEGER,
17 | registration_date DATE
18 | );
19 | INSERT INTO runners
20 | (runner_id, registration_date)
21 | VALUES
22 | (1, '2021-01-01'),
23 | (2, '2021-01-03'),
24 | (3, '2021-01-08'),
25 | (4, '2021-01-15');
26 |
27 |
28 | -- Create and Populate "customers_orders" table
29 | DROP TABLE IF EXISTS customer_orders;
30 | CREATE TABLE customer_orders (
31 | order_id INTEGER,
32 | customer_id INTEGER,
33 | pizza_id INTEGER,
34 | exclusions VARCHAR(4),
35 | extras VARCHAR(4),
36 | order_time TIMESTAMP
37 | );
38 |
39 | INSERT INTO customer_orders
40 | (order_id, customer_id, pizza_id, exclusions, extras, order_time)
41 | VALUES
42 | ('1', '101', '1', '', '', '2020-01-01 18:05:02'),
43 | ('2', '101', '1', '', '', '2020-01-01 19:00:52'),
44 | ('3', '102', '1', '', '', '2020-01-02 23:51:23'),
45 | ('3', '102', '2', '', NULL, '2020-01-02 23:51:23'),
46 | ('4', '103', '1', '4', '', '2020-01-04 13:23:46'),
47 | ('4', '103', '1', '4', '', '2020-01-04 13:23:46'),
48 | ('4', '103', '2', '4', '', '2020-01-04 13:23:46'),
49 | ('5', '104', '1', 'null', '1', '2020-01-08 21:00:29'),
50 | ('6', '101', '2', 'null', 'null', '2020-01-08 21:03:13'),
51 | ('7', '105', '2', 'null', '1', '2020-01-08 21:20:29'),
52 | ('8', '102', '1', 'null', 'null', '2020-01-09 23:54:33'),
53 | ('9', '103', '1', '4', '1, 5', '2020-01-10 11:22:59'),
54 | ('10', '104', '1', 'null', 'null', '2020-01-11 18:34:49'),
55 | ('10', '104', '1', '2, 6', '1, 4', '2020-01-11 18:34:49');
56 |
57 |
58 | -- Create and Populate "runner_orders" table
59 | DROP TABLE IF EXISTS runner_orders;
60 | CREATE TABLE runner_orders (
61 | order_id INTEGER,
62 | runner_id INTEGER,
63 | pickup_time VARCHAR(19),
64 | distance VARCHAR(7),
65 | duration VARCHAR(10),
66 | cancellation VARCHAR(23)
67 | );
68 |
69 | INSERT INTO runner_orders
70 | (order_id, runner_id, pickup_time, distance, duration, cancellation)
71 | VALUES
72 | ('1', '1', '2020-01-01 18:15:34', '20km', '32 minutes', ''),
73 | ('2', '1', '2020-01-01 19:10:54', '20km', '27 minutes', ''),
74 | ('3', '1', '2020-01-03 00:12:37', '13.4km', '20 mins', NULL),
75 | ('4', '2', '2020-01-04 13:53:03', '23.4', '40', NULL),
76 | ('5', '3', '2020-01-08 21:10:57', '10', '15', NULL),
77 | ('6', '3', 'null', 'null', 'null', 'Restaurant Cancellation'),
78 | ('7', '2', '2020-01-08 21:30:45', '25km', '25mins', 'null'),
79 | ('8', '2', '2020-01-10 00:15:02', '23.4 km', '15 minute', 'null'),
80 | ('9', '2', 'null', 'null', 'null', 'Customer Cancellation'),
81 | ('10', '1', '2020-01-11 18:50:20', '10km', '10minutes', 'null');
82 |
83 |
84 | -- Create and Populate "pizza_names" table
85 | DROP TABLE IF EXISTS pizza_names;
86 | CREATE TABLE pizza_names (
87 | pizza_id INTEGER,
88 | pizza_name TEXT
89 | );
90 | INSERT INTO pizza_names
91 | (pizza_id, pizza_name)
92 | VALUES
93 | (1, 'Meatlovers'),
94 | (2, 'Vegetarian');
95 |
96 |
97 | -- Create and Populate "pizza_recipes" table
98 | DROP TABLE IF EXISTS pizza_recipes;
99 | CREATE TABLE pizza_recipes (
100 | pizza_id INTEGER,
101 | toppings TEXT
102 | );
103 | INSERT INTO pizza_recipes
104 | (pizza_id, toppings)
105 | VALUES
106 | (1, '1, 2, 3, 4, 5, 6, 8, 10'),
107 | (2, '4, 6, 7, 9, 11, 12');
108 |
109 |
110 | -- Create and Populate "pizza_toppings" table
111 | DROP TABLE IF EXISTS pizza_toppings;
112 | CREATE TABLE pizza_toppings (
113 | topping_id INTEGER,
114 | topping_name TEXT
115 | );
116 | INSERT INTO pizza_toppings
117 | (topping_id, topping_name)
118 | VALUES
119 | (1, 'Bacon'),
120 | (2, 'BBQ Sauce'),
121 | (3, 'Beef'),
122 | (4, 'Cheese'),
123 | (5, 'Chicken'),
124 | (6, 'Mushrooms'),
125 | (7, 'Onions'),
126 | (8, 'Pepperoni'),
127 | (9, 'Peppers'),
128 | (10, 'Salami'),
129 | (11, 'Tomatoes'),
130 | (12, 'Tomato Sauce');
--------------------------------------------------------------------------------
/CaseStudy#2 - Pizza Runner/README.md:
--------------------------------------------------------------------------------
1 | # Case Study #2: Pizza Runner 🍕
2 | [](https://github.com/chanronnie/8WeekSQLChallenge)
3 | [](https://8weeksqlchallenge.com/case-study-2/)
4 | 
5 |
6 |
7 | The case study presented here is part of the **8 Week SQL Challenge**.\
8 | It is kindly brought to us by [**Data With Danny**](https://8weeksqlchallenge.com).\
9 | I use `MySQL queries` in `Jupyter Notebook` to quickly view results.
10 |
11 |
12 | ## Table of Contents
13 | * [Entity Relationship Diagram](#entity-relationship-diagram)
14 | * [Datasets](#datasets)
15 | * [Case Study Questions](#case-study-questions)
16 | * [Solutions](#solutions)
17 | * [MySQL Topics Covered](#mysql-topics-covered)
18 |
19 | ## Entity Relationship Diagram
20 | 
21 |
22 |
23 | ## Datasets
24 | The Case Study #2 contains 6 tables:
25 | - **customer_orders**: This table captures all the pizza order and delivery information of each customer (`customer_id`).
26 | - **runner_orders**: This table lists the delivery and runner information for each order.
27 | - **runners**: This table lists the runner IDs and their registration dates.
28 | - **pizza_names**: This table maps each pizza_id to its corresponding pizza name
29 | - **pizza_recipes**: This table maps the toppings used for each pizza at Pizza Runner.
30 | - **pizza_toppings**: This table lists the toppings used at Pizza Runner.
31 |
32 | ## Case Study Questions
33 | Case Study #2 is categorized into five question groups\
34 | To view the specific section, please open the link in a *`new tab`* or *`window`*.\
35 | [A. Pizza Metrics](CaseStudy2_solutions.md#A)\
36 | [B. Runner and Customer Experience](CaseStudy2_solutions.md#B)\
37 | [C. Ingredient Optimisation](CaseStudy2_solutions.md#C)\
38 | [D. Pricing and Ratings](CaseStudy2_solutions.md#D)\
39 | [E. Bonus Questions](CaseStudy2_solutions.md#E)
40 |
41 |
42 | ## Solutions
43 | - View `pizza_runner` database: [here](https://github.com/chanronnie/8WeekSQLChallenge/blob/main/CaseStudy%232%20-%20Pizza%20Runner/CaseStudy2_schema.sql)
44 | - View Solution:
45 | - [Markdown File](CaseStudy2_solutions.md): offers a more fluid and responsive viewing experience
46 | - [Jupyter Notebook](CaseStudy2_solutions.ipynb): contains the original code
47 |
48 | ## MySQL Topics Covered
49 | - Data Cleaning
50 | - Common Table Expressions (CTE)
51 | - Temporary Tables
52 | - Window Functions
53 | - Subqueries
54 | - JOIN, UNION ALL
55 | - String and Time Data Manipulation
56 |
--------------------------------------------------------------------------------
/CaseStudy#3 - Foodie-Fi/README.md:
--------------------------------------------------------------------------------
1 | # Case Study #3: Foodie-Fi 🥑
2 | [](https://github.com/chanronnie/8WeekSQLChallenge)
3 | [](https://8weeksqlchallenge.com/case-study-3/)
4 | 
5 |
6 |
7 |
8 | The case study presented here is part of the **8 Week SQL Challenge**.\
9 | It is kindly brought to us by [**Data With Danny**](https://8weeksqlchallenge.com).\
10 | I use `MySQL queries` in `Jupyter Notebook` to quickly view results.
11 |
12 |
13 | ## Table of Contents
14 | * [Entity Relationship Diagram](#entity-relationship-diagram)
15 | * [Datasets](#datasets)
16 | * [Case Study Questions](#case-study-questions)
17 | * [Solutions](#solutions)
18 | * [MySQL Topics Covered](#mysql-topics-covered)
19 |
20 | ## Entity Relationship Diagram
21 | 
22 |
23 |
24 |
25 | ## Datasets
26 | The Case Study #3 contains 2 tables:
27 | - **subscriptions**: This table captures all the plan subscription infoprmation of each customer at Foodie-Fi.
28 | - **plans**: This table lists the plans available and prices at Foodie-Fi.
29 |
30 | ## Case Study Questions
31 | Case Study #3 is categorized into 3 question groups\
32 | To view the specific section, please open the link in a *`new tab`* or *`window`*.\
33 | [A. Customer Journey](CaseStudy3_solutions.md#A)\
34 | [B. Data Analysis Questions](CaseStudy3_solutions.md#B)\
35 | [C. Challenge Payment Question](CaseStudy3_solutions.md#C)
36 |
37 | ## Solutions
38 | - View `foodie_fi` database: [**here**](CaseStudy3_schema.sql)
39 | - View Solution:
40 | - [**Markdown File**](CaseStudy3_solutions.md): offers a more fluid and responsive viewing experience
41 | - [**Jupyter Notebook**](CaseStudy3_solutions.ipynb): contains the original code
42 |
43 | ## MySQL Topics Covered
44 | - Common Table Expressions (CTE)
45 | - Temporary Tables
46 | - Window Functions
47 | - Subqueries
48 | - JOIN, UNION ALL
49 |
--------------------------------------------------------------------------------
/CaseStudy#3 - Foodie-Fi/images/question10.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chanronnie/8WeekSQLChallenge/4b36f745d833a82a61a9ce70ab177462eb7dbfd0/CaseStudy#3 - Foodie-Fi/images/question10.png
--------------------------------------------------------------------------------
/CaseStudy#3 - Foodie-Fi/images/question2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chanronnie/8WeekSQLChallenge/4b36f745d833a82a61a9ce70ab177462eb7dbfd0/CaseStudy#3 - Foodie-Fi/images/question2.png
--------------------------------------------------------------------------------
/CaseStudy#3 - Foodie-Fi/images/question3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chanronnie/8WeekSQLChallenge/4b36f745d833a82a61a9ce70ab177462eb7dbfd0/CaseStudy#3 - Foodie-Fi/images/question3.png
--------------------------------------------------------------------------------
/CaseStudy#3 - Foodie-Fi/images/question6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chanronnie/8WeekSQLChallenge/4b36f745d833a82a61a9ce70ab177462eb7dbfd0/CaseStudy#3 - Foodie-Fi/images/question6.png
--------------------------------------------------------------------------------
/CaseStudy#3 - Foodie-Fi/images/question7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chanronnie/8WeekSQLChallenge/4b36f745d833a82a61a9ce70ab177462eb7dbfd0/CaseStudy#3 - Foodie-Fi/images/question7.png
--------------------------------------------------------------------------------
/CaseStudy#4 - Data Bank/CaseStudy4_solutions.md:
--------------------------------------------------------------------------------
1 | # Case Study #4: Data Bank
2 | The case study questions presented here are created by [**Data With Danny**](https://linktr.ee/datawithdanny). They are part of the [**8 Week SQL Challenge**](https://8weeksqlchallenge.com/).
3 |
4 | My SQL queries are written in the `PostgreSQL 15` dialect, integrated into `Jupyter Notebook`, which allows us to instantly view the query results and document the queries.
5 |
6 | For more details about the **Case Study #4**, click [**here**](https://8weeksqlchallenge.com/case-study-4/).
7 |
8 | ## Table of Contents
9 |
10 | ### [1. Importing Libraries](#Import)
11 |
12 | ### [2. Tables of the Database](#Tables)
13 |
14 | ### [3. Case Study Questions](#CaseStudyQuestions)
15 |
16 | - [A. Customer Nodes Exploration](#A)
17 | - [B. Customer Transactions](#B)
18 | - [C. Data Allocation Challenge](#C)
19 |
20 |
21 |
22 | ## 1. Importing required Libraries
23 |
24 |
25 | ```python
26 | import psycopg2 as pg2
27 | import pandas as pd
28 | import os
29 | import warnings
30 |
31 | warnings.filterwarnings('ignore')
32 | ```
33 |
34 |
35 |
36 | ## 2. Tables of the Database
37 |
38 | ### Connecting PostgreSQL database through Jupyter Notebook
39 |
40 |
41 | ```python
42 | # Get the PostgreSQL password
43 | mypassword = os.getenv("POSTGRESQL_PASSWORD")
44 |
45 | # Connect SQL database
46 | conn = pg2.connect(user = 'postgres', password = mypassword, database = 'data_bank')
47 | cursor = conn.cursor()
48 | ```
49 |
50 | Now, let's list the table names of the `data_bank` database.
51 |
52 |
53 | ```python
54 | query_ShowTables = """
55 | SELECT
56 | table_schema,
57 | table_name
58 | FROM information_schema.tables
59 | WHERE table_schema = 'data_bank';
60 | """
61 | cursor.execute(query_ShowTables)
62 |
63 | print('--- Tables within "data_bank" database --- ')
64 | for table in cursor:
65 | print(table[1])
66 | ```
67 |
68 | --- Tables within "data_bank" database ---
69 | regions
70 | customer_nodes
71 | customer_transactions
72 |
73 |
74 | The followings are the 3 tables within the `data_bank` database. Please click [**here**](https://8weeksqlchallenge.com/case-study-4/) to get more insights about the tables.
75 |
76 |
77 | ```python
78 | cursor.execute(query_ShowTables)
79 | for table in cursor:
80 | print("Table: ", table[1])
81 | query = "SELECT * FROM " + table[0] + '.' + table[1]
82 | df = pd.read_sql(query, conn)
83 | display(df)
84 | ```
85 |
86 | Table: regions
87 |
88 |
89 |
90 |
91 | |
92 | region_id |
93 | region_name |
94 |
95 |
96 |
97 |
98 | 0 |
99 | 1 |
100 | Australia |
101 |
102 |
103 | 1 |
104 | 2 |
105 | America |
106 |
107 |
108 | 2 |
109 | 3 |
110 | Africa |
111 |
112 |
113 | 3 |
114 | 4 |
115 | Asia |
116 |
117 |
118 | 4 |
119 | 5 |
120 | Europe |
121 |
122 |
123 |
124 |
125 |
126 |
127 | Table: customer_nodes
128 |
129 |
130 |
131 |
132 | |
133 | customer_id |
134 | region_id |
135 | node_id |
136 | start_date |
137 | end_date |
138 |
139 |
140 |
141 |
142 | 0 |
143 | 1 |
144 | 3 |
145 | 4 |
146 | 2020-01-02 |
147 | 2020-01-03 |
148 |
149 |
150 | 1 |
151 | 2 |
152 | 3 |
153 | 5 |
154 | 2020-01-03 |
155 | 2020-01-17 |
156 |
157 |
158 | 2 |
159 | 3 |
160 | 5 |
161 | 4 |
162 | 2020-01-27 |
163 | 2020-02-18 |
164 |
165 |
166 | 3 |
167 | 4 |
168 | 5 |
169 | 4 |
170 | 2020-01-07 |
171 | 2020-01-19 |
172 |
173 |
174 | 4 |
175 | 5 |
176 | 3 |
177 | 3 |
178 | 2020-01-15 |
179 | 2020-01-23 |
180 |
181 |
182 | ... |
183 | ... |
184 | ... |
185 | ... |
186 | ... |
187 | ... |
188 |
189 |
190 | 3495 |
191 | 496 |
192 | 3 |
193 | 4 |
194 | 2020-02-25 |
195 | 9999-12-31 |
196 |
197 |
198 | 3496 |
199 | 497 |
200 | 5 |
201 | 4 |
202 | 2020-05-27 |
203 | 9999-12-31 |
204 |
205 |
206 | 3497 |
207 | 498 |
208 | 1 |
209 | 2 |
210 | 2020-04-05 |
211 | 9999-12-31 |
212 |
213 |
214 | 3498 |
215 | 499 |
216 | 5 |
217 | 1 |
218 | 2020-02-03 |
219 | 9999-12-31 |
220 |
221 |
222 | 3499 |
223 | 500 |
224 | 2 |
225 | 2 |
226 | 2020-04-15 |
227 | 9999-12-31 |
228 |
229 |
230 |
231 | 3500 rows × 5 columns
232 |
233 |
234 |
235 | Table: customer_transactions
236 |
237 |
238 |
239 |
240 | |
241 | customer_id |
242 | txn_date |
243 | txn_type |
244 | txn_amount |
245 |
246 |
247 |
248 |
249 | 0 |
250 | 429 |
251 | 2020-01-21 |
252 | deposit |
253 | 82 |
254 |
255 |
256 | 1 |
257 | 155 |
258 | 2020-01-10 |
259 | deposit |
260 | 712 |
261 |
262 |
263 | 2 |
264 | 398 |
265 | 2020-01-01 |
266 | deposit |
267 | 196 |
268 |
269 |
270 | 3 |
271 | 255 |
272 | 2020-01-14 |
273 | deposit |
274 | 563 |
275 |
276 |
277 | 4 |
278 | 185 |
279 | 2020-01-29 |
280 | deposit |
281 | 626 |
282 |
283 |
284 | ... |
285 | ... |
286 | ... |
287 | ... |
288 | ... |
289 |
290 |
291 | 5863 |
292 | 189 |
293 | 2020-02-03 |
294 | withdrawal |
295 | 870 |
296 |
297 |
298 | 5864 |
299 | 189 |
300 | 2020-03-22 |
301 | purchase |
302 | 718 |
303 |
304 |
305 | 5865 |
306 | 189 |
307 | 2020-02-06 |
308 | purchase |
309 | 393 |
310 |
311 |
312 | 5866 |
313 | 189 |
314 | 2020-01-22 |
315 | deposit |
316 | 302 |
317 |
318 |
319 | 5867 |
320 | 189 |
321 | 2020-01-27 |
322 | withdrawal |
323 | 861 |
324 |
325 |
326 |
327 | 5868 rows × 4 columns
328 |
329 |
330 |
331 |
332 |
333 | ## Case Study Questions
334 |
335 |
336 | ## A. Customer Nodes Exploration
337 |
338 | #### 1. How many unique nodes are there on the Data Bank system?
339 |
340 |
341 | ```python
342 | pd.read_sql("""
343 | SELECT COUNT(DISTINCT node_id) AS nodes_count
344 | FROM data_bank.customer_nodes
345 | """, conn)
346 | ```
347 |
348 |
349 |
350 |
351 | |
352 | nodes_count |
353 |
354 |
355 |
356 |
357 | 0 |
358 | 5 |
359 |
360 |
361 |
362 |
363 |
364 |
365 |
366 | **Result (Attempt #1)**\
367 | In the Data Bank system, there are 5 unique nodes to which customers will be randomly reallocated.
368 |
369 | However, if we interpret the question differently, we may understand that nodes are unique for each region, and customers are randomly distributed across the nodes *according to their region*. The following query shows the possible uniques nodes in each region.
370 |
371 | **Here is the attempt #2**
372 |
373 |
374 | ```python
375 | pd.read_sql("""
376 | SELECT
377 | cn.region_id,
378 | r.region_name,
379 | STRING_AGG(DISTINCT cn.node_id::VARCHAR(1), ', ') AS nodes
380 | FROM data_bank.customer_nodes cn
381 | JOIN data_bank.regions r ON cn.region_id = r.region_id
382 | GROUP BY cn.region_id, r.region_name
383 | """, conn)
384 | ```
385 |
386 |
387 |
388 |
389 | |
390 | region_id |
391 | region_name |
392 | nodes |
393 |
394 |
395 |
396 |
397 | 0 |
398 | 1 |
399 | Australia |
400 | 1, 2, 3, 4, 5 |
401 |
402 |
403 | 1 |
404 | 2 |
405 | America |
406 | 1, 2, 3, 4, 5 |
407 |
408 |
409 | 2 |
410 | 3 |
411 | Africa |
412 | 1, 2, 3, 4, 5 |
413 |
414 |
415 | 3 |
416 | 4 |
417 | Asia |
418 | 1, 2, 3, 4, 5 |
419 |
420 |
421 | 4 |
422 | 5 |
423 | Europe |
424 | 1, 2, 3, 4, 5 |
425 |
426 |
427 |
428 |
429 |
430 |
431 |
432 |
433 | ```python
434 | pd.read_sql("""
435 | SELECT SUM(nb_nodes)::INTEGER AS total_nodes
436 | FROM
437 | (
438 | -- Find the number of unique nodes per region
439 |
440 | SELECT region_id, COUNT(DISTINCT node_id) AS nb_nodes
441 | FROM data_bank.customer_nodes
442 | GROUP BY region_id
443 | ) n
444 | """, conn)
445 | ```
446 |
447 |
448 |
449 |
450 | |
451 | total_nodes |
452 |
453 |
454 |
455 |
456 | 0 |
457 | 25 |
458 |
459 |
460 |
461 |
462 |
463 |
464 |
465 | **Result (Attempt #2)**\
466 | Hence, there are 25 unique nodes in the Data Bank system accross the world.
467 |
468 | ___
469 | #### 2. What is the number of nodes per region?
470 |
471 |
472 | ```python
473 | pd.read_sql("""
474 | SELECT
475 | r.region_name,
476 | COUNT(DISTINCT node_id) AS nb_nodes
477 | FROM data_bank.customer_nodes cn
478 | JOIN data_bank.regions r ON cn.region_id = r.region_id
479 | GROUP BY r.region_name;
480 | """, conn)
481 | ```
482 |
483 |
484 |
485 |
486 | |
487 | region_name |
488 | nb_nodes |
489 |
490 |
491 |
492 |
493 | 0 |
494 | Africa |
495 | 5 |
496 |
497 |
498 | 1 |
499 | America |
500 | 5 |
501 |
502 |
503 | 2 |
504 | Asia |
505 | 5 |
506 |
507 |
508 | 3 |
509 | Australia |
510 | 5 |
511 |
512 |
513 | 4 |
514 | Europe |
515 | 5 |
516 |
517 |
518 |
519 |
520 |
521 |
522 |
523 | **Result**\
524 | There are 5 nodes per region.
525 |
526 | ___
527 | #### 3. How many customers are allocated to each region?
528 |
529 |
530 | ```python
531 | pd.read_sql("""
532 | SELECT
533 | r.region_name AS region,
534 | COUNT(DISTINCT customer_id) AS nb_customers
535 | FROM data_bank.customer_nodes cn
536 | JOIN data_bank.regions r ON cn.region_id = r.region_id
537 | GROUP BY r.region_name
538 | ORDER BY nb_customers DESC;
539 | """, conn)
540 | ```
541 |
542 |
543 |
544 |
545 | |
546 | region |
547 | nb_customers |
548 |
549 |
550 |
551 |
552 | 0 |
553 | Australia |
554 | 110 |
555 |
556 |
557 | 1 |
558 | America |
559 | 105 |
560 |
561 |
562 | 2 |
563 | Africa |
564 | 102 |
565 |
566 |
567 | 3 |
568 | Asia |
569 | 95 |
570 |
571 |
572 | 4 |
573 | Europe |
574 | 88 |
575 |
576 |
577 |
578 |
579 |
580 |
581 |
582 | **Result**
583 | - The majority of Data Bank customers are from Australia, totaling 110 customers.
584 | - Following Australia, America and Africa are the second and third regions with the highest number of Data Bank customers.
585 | - Asia has a total of 95 Data Bank customers.
586 | - Data Bank does not seem to attract many customers from Europe, with only 88 clients.
587 |
588 | ___
589 | #### 4. How many days on average are customers reallocated to a different node?
590 |
591 |
592 | ```python
593 | pd.read_sql("""
594 | SELECT ROUND(AVG(end_date::date - start_date::date), 1) AS avg_reallocation_days
595 | FROM data_bank.customer_nodes
596 | WHERE end_date != '9999-12-31'
597 | """, conn)
598 | ```
599 |
600 |
601 |
602 |
603 | |
604 | avg_reallocation_days |
605 |
606 |
607 |
608 |
609 | 0 |
610 | 14.6 |
611 |
612 |
613 |
614 |
615 |
616 |
617 |
618 | **Result (Attempt #1)**\
619 | Customers are reallocated to a different node on average after 14.6 days .
620 |
621 | However, if we examine the `customer_id 7`, we can observe two instances where the customer has been reallocated to the same node (nodes 2 and 4). For customer ID 7, it took him/her
622 | - 21 days to be reallocated from node 4 to node 2
623 | - 34 days to be reallocated from node 2 to node 4
624 |
625 |
626 | ```python
627 | pd.read_sql("""
628 | SELECT *
629 | FROM data_bank.customer_nodes
630 | WHERE customer_id = 7
631 | """, conn)
632 | ```
633 |
634 |
635 |
636 |
637 | |
638 | customer_id |
639 | region_id |
640 | node_id |
641 | start_date |
642 | end_date |
643 |
644 |
645 |
646 |
647 | 0 |
648 | 7 |
649 | 2 |
650 | 5 |
651 | 2020-01-20 |
652 | 2020-02-04 |
653 |
654 |
655 | 1 |
656 | 7 |
657 | 2 |
658 | 4 |
659 | 2020-02-05 |
660 | 2020-02-20 |
661 |
662 |
663 | 2 |
664 | 7 |
665 | 2 |
666 | 4 |
667 | 2020-02-21 |
668 | 2020-02-26 |
669 |
670 |
671 | 3 |
672 | 7 |
673 | 2 |
674 | 2 |
675 | 2020-02-27 |
676 | 2020-03-05 |
677 |
678 |
679 | 4 |
680 | 7 |
681 | 2 |
682 | 2 |
683 | 2020-03-06 |
684 | 2020-04-01 |
685 |
686 |
687 | 5 |
688 | 7 |
689 | 2 |
690 | 4 |
691 | 2020-04-02 |
692 | 2020-04-07 |
693 |
694 |
695 | 6 |
696 | 7 |
697 | 2 |
698 | 5 |
699 | 2020-04-08 |
700 | 9999-12-31 |
701 |
702 |
703 |
704 |
705 |
706 |
707 |
708 | **Table 1**: Customer 7 reallocated to node no 4 twice
709 |
710 | node_id | start_date | end_date | nb_reallocation_days
711 | --- | --- | --- | ---
712 | 4 | 2020-02-05 | 2020-02-20 | --
713 | 4 | 2020-02-21 | 2020-02-26 | 21
714 |
715 | **Table 2**: Customer 7 reallocated to node no 2 twice
716 |
717 | node_id | start_date | end_date | nb_reallocation_days
718 | --- | --- | --- | ---
719 | 2 | 2020-02-27 | 2020-03-05 | --
720 | 2 | 2020-03-06 | 2020-04-01 | 34
721 |
722 | Hence, here is a query for the `reallocation_cte` that will be used later to answer the question. The query takes into consideration counting the total number of days for which customers will be reallocated to a different node, and includes all instances of both same and different nodes.
723 |
724 |
725 | ```python
726 | # Let's see for customer ID 7
727 |
728 | pd.read_sql("""
729 | SELECT *,
730 | CASE
731 | WHEN LEAD(node_id) OVER w = node_id THEN NULL
732 | WHEN LAG(node_id) OVER w = node_id THEN end_date::date - LAG(start_date) OVER w::date
733 | ELSE end_date::date - start_date::date
734 | END AS nb_days
735 | FROM data_bank.customer_nodes
736 | WHERE end_date != '9999-12-31' AND customer_id = 7
737 | WINDOW w AS (PARTITION BY customer_id ORDER BY start_date)
738 | """, conn)
739 | ```
740 |
741 |
742 |
743 |
744 | |
745 | customer_id |
746 | region_id |
747 | node_id |
748 | start_date |
749 | end_date |
750 | nb_days |
751 |
752 |
753 |
754 |
755 | 0 |
756 | 7 |
757 | 2 |
758 | 5 |
759 | 2020-01-20 |
760 | 2020-02-04 |
761 | 15.0 |
762 |
763 |
764 | 1 |
765 | 7 |
766 | 2 |
767 | 4 |
768 | 2020-02-05 |
769 | 2020-02-20 |
770 | NaN |
771 |
772 |
773 | 2 |
774 | 7 |
775 | 2 |
776 | 4 |
777 | 2020-02-21 |
778 | 2020-02-26 |
779 | 21.0 |
780 |
781 |
782 | 3 |
783 | 7 |
784 | 2 |
785 | 2 |
786 | 2020-02-27 |
787 | 2020-03-05 |
788 | NaN |
789 |
790 |
791 | 4 |
792 | 7 |
793 | 2 |
794 | 2 |
795 | 2020-03-06 |
796 | 2020-04-01 |
797 | 34.0 |
798 |
799 |
800 | 5 |
801 | 7 |
802 | 2 |
803 | 4 |
804 | 2020-04-02 |
805 | 2020-04-07 |
806 | 5.0 |
807 |
808 |
809 |
810 |
811 |
812 |
813 |
814 | Now, let's utilize the aforementioned query as the `reallocation_cte`, to provide a comprehensive answer.
815 |
816 |
817 | ```python
818 | pd.read_sql("""
819 | WITH reallocation_cte AS
820 | (
821 | -- Find the number of reallocation days
822 | SELECT *,
823 | CASE
824 | WHEN LEAD(node_id) OVER w = node_id THEN NULL
825 | WHEN LAG(node_id) OVER w = node_id THEN end_date::date - LAG(start_date) OVER w::date
826 | ELSE end_date::date - start_date::date
827 | END AS nb_reallocation_days
828 | FROM data_bank.customer_nodes
829 | WHERE end_date != '9999-12-31'
830 | WINDOW w AS (PARTITION BY customer_id ORDER BY start_date)
831 | )
832 | SELECT ROUND(AVG(nb_reallocation_days),1) AS avg_reallocation_days
833 | FROM reallocation_cte;
834 | """, conn)
835 | ```
836 |
837 |
838 |
839 |
840 | |
841 | avg_reallocation_days |
842 |
843 |
844 |
845 |
846 | 0 |
847 | 17.3 |
848 |
849 |
850 |
851 |
852 |
853 |
854 |
855 | **Result (Attempt #2)**\
856 | Customers are reallocated to a different node on average after 17.3 days .
857 |
858 | ___
859 | #### 5. What is the median, 80th and 95th percentile for this same reallocation days metric for each region?
860 |
861 | Since I have previously addressed the reallocation days problem (see question #4 above) using two different approaches, I will also provide my answer to this question in two attempts. Please note that the questions 4 and 5 are rather vague.
862 |
863 | **Result (Attempt #1)**
864 |
865 |
866 | ```python
867 | pd.read_sql("""
868 | SELECT
869 | region_name,
870 | PERCENTILE_DISC(0.5) WITHIN GROUP(ORDER BY reallocation_days) as median,
871 | PERCENTILE_DISC(0.8) WITHIN GROUP(ORDER BY reallocation_days) as percentile_80th,
872 | PERCENTILE_DISC(0.95) WITHIN GROUP(ORDER BY reallocation_days) as percentile_95th
873 | FROM
874 | (
875 | SELECT
876 | n.node_id,
877 | n.start_date,
878 | n.end_date,
879 | r.region_name,
880 | n.end_date::date - n.start_date::date AS reallocation_days
881 | FROM data_bank.customer_nodes n
882 | JOIN data_bank.regions r ON n.region_id = r.region_id
883 | WHERE n.end_date != '9999-12-31'
884 | ) re
885 | GROUP BY region_name;
886 | """, conn)
887 | ```
888 |
889 |
890 |
891 |
892 | |
893 | region_name |
894 | median |
895 | percentile_80th |
896 | percentile_95th |
897 |
898 |
899 |
900 |
901 | 0 |
902 | Africa |
903 | 15 |
904 | 24 |
905 | 28 |
906 |
907 |
908 | 1 |
909 | America |
910 | 15 |
911 | 23 |
912 | 28 |
913 |
914 |
915 | 2 |
916 | Asia |
917 | 15 |
918 | 23 |
919 | 28 |
920 |
921 |
922 | 3 |
923 | Australia |
924 | 15 |
925 | 23 |
926 | 28 |
927 |
928 |
929 | 4 |
930 | Europe |
931 | 15 |
932 | 24 |
933 | 28 |
934 |
935 |
936 |
937 |
938 |
939 |
940 |
941 | **Result (Attempt #2)**
942 |
943 |
944 | ```python
945 | pd.read_sql("""
946 | WITH reallocation_cte AS
947 | (
948 | SELECT *,
949 | CASE
950 | WHEN LEAD(node_id) OVER w = node_id THEN NULL
951 | WHEN LAG(node_id) OVER w = node_id THEN end_date::date - LAG(start_date) OVER w::date
952 | ELSE end_date::date - start_date::date
953 | END AS nb_reallocation_days
954 | FROM data_bank.customer_nodes
955 | WHERE end_date != '9999-12-31'
956 | WINDOW w AS (PARTITION BY customer_id ORDER BY start_date)
957 | )
958 | SELECT
959 | r.region_name,
960 | PERCENTILE_DISC(0.5) WITHIN GROUP(ORDER BY cte.nb_reallocation_days) as median,
961 | PERCENTILE_DISC(0.8) WITHIN GROUP(ORDER BY cte.nb_reallocation_days) as percentile_80th,
962 | PERCENTILE_DISC(0.95) WITHIN GROUP(ORDER BY cte.nb_reallocation_days) as percentile_95th
963 | FROM reallocation_cte cte
964 | JOIN data_bank.regions r ON cte.region_id = r.region_id
965 | GROUP BY r.region_name;
966 | """, conn)
967 | ```
968 |
969 |
970 |
971 |
972 | |
973 | region_name |
974 | median |
975 | percentile_80th |
976 | percentile_95th |
977 |
978 |
979 |
980 |
981 | 0 |
982 | Africa |
983 | 17 |
984 | 27 |
985 | 36 |
986 |
987 |
988 | 1 |
989 | America |
990 | 17 |
991 | 26 |
992 | 36 |
993 |
994 |
995 | 2 |
996 | Asia |
997 | 17 |
998 | 25 |
999 | 34 |
1000 |
1001 |
1002 | 3 |
1003 | Australia |
1004 | 17 |
1005 | 26 |
1006 | 36 |
1007 |
1008 |
1009 | 4 |
1010 | Europe |
1011 | 18 |
1012 | 27 |
1013 | 37 |
1014 |
1015 |
1016 |
1017 |
1018 |
1019 |
1020 |
1021 | ___
1022 |
1023 | ## B. Customer Transactions
1024 | #### 1. What is the unique count and total amount for each transaction type?
1025 |
1026 |
1027 | ```python
1028 | pd.read_sql("""
1029 | SELECT
1030 | txn_type AS transaction_type,
1031 | to_char(COUNT(txn_type), 'FM 999,999') AS count,
1032 | to_char(SUM(txn_amount), 'FM$ 999,999,999.99') AS total_amount
1033 | FROM data_bank.customer_transactions
1034 | GROUP BY txn_type
1035 | """, conn)
1036 | ```
1037 |
1038 |
1039 |
1040 |
1041 | |
1042 | transaction_type |
1043 | count |
1044 | total_amount |
1045 |
1046 |
1047 |
1048 |
1049 | 0 |
1050 | purchase |
1051 | 1,617 |
1052 | $ 806,537. |
1053 |
1054 |
1055 | 1 |
1056 | withdrawal |
1057 | 1,580 |
1058 | $ 793,003. |
1059 |
1060 |
1061 | 2 |
1062 | deposit |
1063 | 2,671 |
1064 | $ 1,359,168. |
1065 |
1066 |
1067 |
1068 |
1069 |
1070 |
1071 |
1072 | **Result**
1073 | - The most fequently used transaction type at Data Bank is **deposit**, with a total numbers of 2,671 deposits, amounting to $1,359,168.
1074 | - There is a total of 1,671 **purchases** made at Data Bank, totalling $ 806,537.
1075 | - There is a total of 1,580 **withdrawals** made at Data Bank, totalling $ 793,003
1076 |
1077 | ___
1078 | #### 2. What is the average total historical deposit counts and amounts for all customers?
1079 |
1080 |
1081 | ```python
1082 | pd.read_sql("""
1083 | SELECT
1084 | AVG(deposit_count)::INTEGER As nb_deposit,
1085 | to_char(AVG(avg_deposit_amount), 'FM$ 999,999.99') AS avg_deposit_amount
1086 | FROM
1087 | (
1088 | -- Find the avegrage count and amount of deposits made by each customer
1089 | SELECT
1090 | customer_id,
1091 | COUNT(txn_type) as deposit_count,
1092 | AVG(txn_amount) AS avg_deposit_amount
1093 | FROM data_bank.customer_transactions
1094 | WHERE txn_type = 'deposit'
1095 | GROUP BY customer_id
1096 | ORDER BY customer_id
1097 | ) d
1098 | """, conn)
1099 | ```
1100 |
1101 |
1102 |
1103 |
1104 | |
1105 | nb_deposit |
1106 | avg_deposit_amount |
1107 |
1108 |
1109 |
1110 |
1111 | 0 |
1112 | 5 |
1113 | $ 508.61 |
1114 |
1115 |
1116 |
1117 |
1118 |
1119 |
1120 |
1121 | **Result**\
1122 | The Data Bank customers made an average of 5 deposits, with an average amount of $ 508.61.
1123 |
1124 | ___
1125 | #### 3. For each month - how many Data Bank customers make more than 1 deposit and either 1 purchase or 1 withdrawal in a single month?
1126 |
1127 |
1128 | ```python
1129 | pd.read_sql("""
1130 | WITH counting_transactions_cte AS
1131 | (
1132 | SELECT
1133 | customer_id,
1134 | EXTRACT(MONTH FROM txn_date)::INTEGER AS month,
1135 | TO_CHAR(txn_date, 'MONTH') AS month_name,
1136 | SUM(CASE WHEN txn_type = 'deposit' THEN 1 ELSE 0 END) AS deposit,
1137 | SUM(CASE WHEN txn_type = 'purchase' THEN 1 ELSE 0 END) AS purchase,
1138 | SUM(CASE WHEN txn_type = 'withdrawal' THEN 1 ELSE 0 END) AS withdrawal
1139 | FROM data_bank.customer_transactions
1140 | GROUP BY customer_id, month, month_name
1141 | ORDER BY customer_id
1142 | )
1143 | SELECT
1144 | month,
1145 | month_name,
1146 | COUNT(DISTINCT customer_id) AS nb_customers
1147 | FROM counting_transactions_cte
1148 | WHERE deposit > 1 AND (purchase > 0 OR withdrawal > 0)
1149 | GROUP BY month, month_name
1150 | ORDER BY month;
1151 | """, conn)
1152 | ```
1153 |
1154 |
1155 |
1156 |
1157 | |
1158 | month |
1159 | month_name |
1160 | nb_customers |
1161 |
1162 |
1163 |
1164 |
1165 | 0 |
1166 | 1 |
1167 | JANUARY |
1168 | 168 |
1169 |
1170 |
1171 | 1 |
1172 | 2 |
1173 | FEBRUARY |
1174 | 181 |
1175 |
1176 |
1177 | 2 |
1178 | 3 |
1179 | MARCH |
1180 | 192 |
1181 |
1182 |
1183 | 3 |
1184 | 4 |
1185 | APRIL |
1186 | 70 |
1187 |
1188 |
1189 |
1190 |
1191 |
1192 |
1193 |
1194 | **Result**
1195 | - **March** is the month when Data Bank customers make the most transactions (more than 1 deposit and either 1 purchase or 1 withdrawal in a single month), with a total of 192 customers.
1196 | - Following March, **February** is the second month with the highest number of customers engaging in the given transactions.
1197 | - In **January**, a total of 168 customers made more than 1 deposit and either 1 purchase or 1 withdrawal.
1198 | - The month of **April** has the fewest number of customers conducting these types of transactions, with only 70 clients.
1199 |
1200 | ___
1201 | #### 4. What is the closing balance for each customer at the end of the month?
1202 |
1203 | Here are the results showing the first 5 customers.
1204 |
1205 |
1206 | ```python
1207 | pd.read_sql("""
1208 | SELECT
1209 | customer_id,
1210 | EXTRACT(MONTH FROM txn_date)::INTEGER AS month,
1211 | TO_CHAR(txn_date, 'MONTH') AS month_name,
1212 | SUM(CASE WHEN txn_type = 'deposit' THEN txn_amount ELSE - txn_amount END) AS closing_balance
1213 | FROM data_bank.customer_transactions
1214 | WHERE customer_id <= 5
1215 | GROUP BY customer_id, month, month_name
1216 | ORDER BY customer_id, month;
1217 | """, conn)
1218 | ```
1219 |
1220 |
1221 |
1222 |
1223 | |
1224 | customer_id |
1225 | month |
1226 | month_name |
1227 | closing_balance |
1228 |
1229 |
1230 |
1231 |
1232 | 0 |
1233 | 1 |
1234 | 1 |
1235 | JANUARY |
1236 | 312 |
1237 |
1238 |
1239 | 1 |
1240 | 1 |
1241 | 3 |
1242 | MARCH |
1243 | -952 |
1244 |
1245 |
1246 | 2 |
1247 | 2 |
1248 | 1 |
1249 | JANUARY |
1250 | 549 |
1251 |
1252 |
1253 | 3 |
1254 | 2 |
1255 | 3 |
1256 | MARCH |
1257 | 61 |
1258 |
1259 |
1260 | 4 |
1261 | 3 |
1262 | 1 |
1263 | JANUARY |
1264 | 144 |
1265 |
1266 |
1267 | 5 |
1268 | 3 |
1269 | 2 |
1270 | FEBRUARY |
1271 | -965 |
1272 |
1273 |
1274 | 6 |
1275 | 3 |
1276 | 3 |
1277 | MARCH |
1278 | -401 |
1279 |
1280 |
1281 | 7 |
1282 | 3 |
1283 | 4 |
1284 | APRIL |
1285 | 493 |
1286 |
1287 |
1288 | 8 |
1289 | 4 |
1290 | 1 |
1291 | JANUARY |
1292 | 848 |
1293 |
1294 |
1295 | 9 |
1296 | 4 |
1297 | 3 |
1298 | MARCH |
1299 | -193 |
1300 |
1301 |
1302 | 10 |
1303 | 5 |
1304 | 1 |
1305 | JANUARY |
1306 | 954 |
1307 |
1308 |
1309 | 11 |
1310 | 5 |
1311 | 3 |
1312 | MARCH |
1313 | -2877 |
1314 |
1315 |
1316 | 12 |
1317 | 5 |
1318 | 4 |
1319 | APRIL |
1320 | -490 |
1321 |
1322 |
1323 |
1324 |
1325 |
1326 |
1327 |
1328 | ___
1329 | #### 5. What is the percentage of customers who increase their closing balance by more than 5%?
1330 |
1331 |
1332 | ```python
1333 | pd.read_sql("""
1334 | WITH monthly_balance_cte AS
1335 | (
1336 | SELECT
1337 | customer_id,
1338 | EXTRACT(MONTH FROM txn_date)::INTEGER AS month,
1339 | SUM(CASE WHEN txn_type = 'deposit' THEN txn_amount ELSE - txn_amount END) AS closing_balance
1340 | FROM data_bank.customer_transactions
1341 | GROUP BY customer_id, month
1342 | ORDER BY customer_id, month
1343 | ),
1344 | balance_greaterthan5_cte AS
1345 | (
1346 | SELECT COUNT(DISTINCT customer_id) AS nb_customers
1347 | FROM
1348 | (
1349 | SELECT
1350 | customer_id,
1351 | (LEAD(closing_balance) OVER (PARTITION BY customer_id ORDER BY month) - closing_balance)/ closing_balance::numeric*100 AS percent_change
1352 | FROM monthly_balance_cte
1353 | ) pc
1354 | WHERE percent_change > 5
1355 | )
1356 | SELECT
1357 | MAX(nb_customers) AS nb_customers,
1358 | COUNT(DISTINCT ct.customer_id) AS total_customers,
1359 | CONCAT(ROUND(MAX(nb_customers)/COUNT(DISTINCT ct.customer_id)::numeric * 100,1), ' %') AS percentage_customers
1360 | FROM balance_greaterthan5_cte b, data_bank.customer_transactions ct
1361 | """, conn)
1362 | ```
1363 |
1364 |
1365 |
1366 |
1367 | |
1368 | nb_customers |
1369 | total_customers |
1370 | percentage_customers |
1371 |
1372 |
1373 |
1374 |
1375 | 0 |
1376 | 269 |
1377 | 500 |
1378 | 53.8 % |
1379 |
1380 |
1381 |
1382 |
1383 |
1384 |
1385 |
1386 | **Result**\
1387 | There are 53.8 % of the Data Bank customers who increase their closing balance by more than 5%.
1388 |
1389 | ___
1390 |
1391 | ## C. Data Allocation Challenge
1392 |
1393 | To test out a few different hypotheses - the Data Bank team wants to run an experiment where different groups of customers would be allocated data using 3 different options:
1394 |
1395 | - `Option 1`: data is allocated based off the amount of money at the end of the previous month
1396 | - `Option 2`: data is allocated on the average amount of money kept in the account in the previous 30 days
1397 | - `Option 3`: data is updated real-time
1398 |
1399 | For this multi-part challenge question - you have been requested to generate the following data elements to help the Data Bank team estimate how much data will need to be provisioned for each option:
1400 | - running customer balance column that includes the impact each transaction
1401 | - customer balance at the end of each month
1402 | - minimum, average and maximum values of the running balance for each customer
1403 |
1404 | Using all of the data available - how much data would have been required for each option on a monthly basis?
1405 |
1406 | ### Running Balance
1407 |
1408 |
1409 | ```python
1410 | pd.read_sql("""
1411 | SELECT *,
1412 | SUM(
1413 | CASE WHEN txn_type = 'deposit' THEN txn_amount ELSE -txn_amount END
1414 | ) OVER (PARTITION BY customer_id ORDER BY txn_date) AS running_balance
1415 | FROM data_bank.customer_transactions
1416 | WHERE customer_id <= 5;
1417 | """, conn)
1418 | ```
1419 |
1420 |
1421 |
1422 |
1423 | |
1424 | customer_id |
1425 | txn_date |
1426 | txn_type |
1427 | txn_amount |
1428 | running_balance |
1429 |
1430 |
1431 |
1432 |
1433 | 0 |
1434 | 1 |
1435 | 2020-01-02 |
1436 | deposit |
1437 | 312 |
1438 | 312 |
1439 |
1440 |
1441 | 1 |
1442 | 1 |
1443 | 2020-03-05 |
1444 | purchase |
1445 | 612 |
1446 | -300 |
1447 |
1448 |
1449 | 2 |
1450 | 1 |
1451 | 2020-03-17 |
1452 | deposit |
1453 | 324 |
1454 | 24 |
1455 |
1456 |
1457 | 3 |
1458 | 1 |
1459 | 2020-03-19 |
1460 | purchase |
1461 | 664 |
1462 | -640 |
1463 |
1464 |
1465 | 4 |
1466 | 2 |
1467 | 2020-01-03 |
1468 | deposit |
1469 | 549 |
1470 | 549 |
1471 |
1472 |
1473 | 5 |
1474 | 2 |
1475 | 2020-03-24 |
1476 | deposit |
1477 | 61 |
1478 | 610 |
1479 |
1480 |
1481 | 6 |
1482 | 3 |
1483 | 2020-01-27 |
1484 | deposit |
1485 | 144 |
1486 | 144 |
1487 |
1488 |
1489 | 7 |
1490 | 3 |
1491 | 2020-02-22 |
1492 | purchase |
1493 | 965 |
1494 | -821 |
1495 |
1496 |
1497 | 8 |
1498 | 3 |
1499 | 2020-03-05 |
1500 | withdrawal |
1501 | 213 |
1502 | -1034 |
1503 |
1504 |
1505 | 9 |
1506 | 3 |
1507 | 2020-03-19 |
1508 | withdrawal |
1509 | 188 |
1510 | -1222 |
1511 |
1512 |
1513 | 10 |
1514 | 3 |
1515 | 2020-04-12 |
1516 | deposit |
1517 | 493 |
1518 | -729 |
1519 |
1520 |
1521 | 11 |
1522 | 4 |
1523 | 2020-01-07 |
1524 | deposit |
1525 | 458 |
1526 | 458 |
1527 |
1528 |
1529 | 12 |
1530 | 4 |
1531 | 2020-01-21 |
1532 | deposit |
1533 | 390 |
1534 | 848 |
1535 |
1536 |
1537 | 13 |
1538 | 4 |
1539 | 2020-03-25 |
1540 | purchase |
1541 | 193 |
1542 | 655 |
1543 |
1544 |
1545 | 14 |
1546 | 5 |
1547 | 2020-01-15 |
1548 | deposit |
1549 | 974 |
1550 | 974 |
1551 |
1552 |
1553 | 15 |
1554 | 5 |
1555 | 2020-01-25 |
1556 | deposit |
1557 | 806 |
1558 | 1780 |
1559 |
1560 |
1561 | 16 |
1562 | 5 |
1563 | 2020-01-31 |
1564 | withdrawal |
1565 | 826 |
1566 | 954 |
1567 |
1568 |
1569 | 17 |
1570 | 5 |
1571 | 2020-03-02 |
1572 | purchase |
1573 | 886 |
1574 | 68 |
1575 |
1576 |
1577 | 18 |
1578 | 5 |
1579 | 2020-03-19 |
1580 | deposit |
1581 | 718 |
1582 | 786 |
1583 |
1584 |
1585 | 19 |
1586 | 5 |
1587 | 2020-03-26 |
1588 | withdrawal |
1589 | 786 |
1590 | 0 |
1591 |
1592 |
1593 | 20 |
1594 | 5 |
1595 | 2020-03-27 |
1596 | deposit |
1597 | 412 |
1598 | -288 |
1599 |
1600 |
1601 | 21 |
1602 | 5 |
1603 | 2020-03-27 |
1604 | withdrawal |
1605 | 700 |
1606 | -288 |
1607 |
1608 |
1609 | 22 |
1610 | 5 |
1611 | 2020-03-29 |
1612 | purchase |
1613 | 852 |
1614 | -1140 |
1615 |
1616 |
1617 | 23 |
1618 | 5 |
1619 | 2020-03-31 |
1620 | purchase |
1621 | 783 |
1622 | -1923 |
1623 |
1624 |
1625 | 24 |
1626 | 5 |
1627 | 2020-04-02 |
1628 | withdrawal |
1629 | 490 |
1630 | -2413 |
1631 |
1632 |
1633 |
1634 |
1635 |
1636 |
1637 |
1638 | ### Monthly Balance
1639 |
1640 |
1641 | ```python
1642 | pd.read_sql("""
1643 | SELECT
1644 | customer_id,
1645 | EXTRACT(MONTH FROM txn_date)::INTEGER AS month,
1646 | TO_CHAR(txn_date, 'Month') AS month_name,
1647 | SUM(CASE WHEN txn_type = 'deposit' THEN txn_amount ELSE -txn_amount END) AS closing_balance
1648 | FROM data_bank.customer_transactions
1649 | WHERE customer_id <= 5
1650 | GROUP BY customer_id, month, month_name
1651 | ORDER BY customer_id, month;
1652 | """, conn)
1653 | ```
1654 |
1655 |
1656 |
1657 |
1658 | |
1659 | customer_id |
1660 | month |
1661 | month_name |
1662 | closing_balance |
1663 |
1664 |
1665 |
1666 |
1667 | 0 |
1668 | 1 |
1669 | 1 |
1670 | January |
1671 | 312 |
1672 |
1673 |
1674 | 1 |
1675 | 1 |
1676 | 3 |
1677 | March |
1678 | -952 |
1679 |
1680 |
1681 | 2 |
1682 | 2 |
1683 | 1 |
1684 | January |
1685 | 549 |
1686 |
1687 |
1688 | 3 |
1689 | 2 |
1690 | 3 |
1691 | March |
1692 | 61 |
1693 |
1694 |
1695 | 4 |
1696 | 3 |
1697 | 1 |
1698 | January |
1699 | 144 |
1700 |
1701 |
1702 | 5 |
1703 | 3 |
1704 | 2 |
1705 | February |
1706 | -965 |
1707 |
1708 |
1709 | 6 |
1710 | 3 |
1711 | 3 |
1712 | March |
1713 | -401 |
1714 |
1715 |
1716 | 7 |
1717 | 3 |
1718 | 4 |
1719 | April |
1720 | 493 |
1721 |
1722 |
1723 | 8 |
1724 | 4 |
1725 | 1 |
1726 | January |
1727 | 848 |
1728 |
1729 |
1730 | 9 |
1731 | 4 |
1732 | 3 |
1733 | March |
1734 | -193 |
1735 |
1736 |
1737 | 10 |
1738 | 5 |
1739 | 1 |
1740 | January |
1741 | 954 |
1742 |
1743 |
1744 | 11 |
1745 | 5 |
1746 | 3 |
1747 | March |
1748 | -2877 |
1749 |
1750 |
1751 | 12 |
1752 | 5 |
1753 | 4 |
1754 | April |
1755 | -490 |
1756 |
1757 |
1758 |
1759 |
1760 |
1761 |
1762 |
1763 | ### Min, Average, Max Transaction
1764 |
1765 |
1766 | ```python
1767 | pd.read_sql("""
1768 | SELECT
1769 | customer_id,
1770 | MIN(running_balance) AS min_transaction,
1771 | MAX(running_balance) AS max_transaction,
1772 | ROUND(AVG(running_balance),2) AS avg_transaction
1773 | FROM
1774 | (
1775 | SELECT *,
1776 | SUM(CASE WHEN txn_type = 'deposit' THEN txn_amount ELSE -txn_amount END) OVER (PARTITION BY customer_id ORDER BY txn_date) AS running_balance
1777 | FROM data_bank.customer_transactions
1778 | ) running_balance
1779 | WHERE customer_id <= 10
1780 | GROUP BY customer_id
1781 | ORDER BY customer_id
1782 | """, conn)
1783 | ```
1784 |
1785 |
1786 |
1787 |
1788 | |
1789 | customer_id |
1790 | min_transaction |
1791 | max_transaction |
1792 | avg_transaction |
1793 |
1794 |
1795 |
1796 |
1797 | 0 |
1798 | 1 |
1799 | -640 |
1800 | 312 |
1801 | -151.00 |
1802 |
1803 |
1804 | 1 |
1805 | 2 |
1806 | 549 |
1807 | 610 |
1808 | 579.50 |
1809 |
1810 |
1811 | 2 |
1812 | 3 |
1813 | -1222 |
1814 | 144 |
1815 | -732.40 |
1816 |
1817 |
1818 | 3 |
1819 | 4 |
1820 | 458 |
1821 | 848 |
1822 | 653.67 |
1823 |
1824 |
1825 | 4 |
1826 | 5 |
1827 | -2413 |
1828 | 1780 |
1829 | -135.45 |
1830 |
1831 |
1832 | 5 |
1833 | 6 |
1834 | -552 |
1835 | 2197 |
1836 | 624.00 |
1837 |
1838 |
1839 | 6 |
1840 | 7 |
1841 | 887 |
1842 | 3539 |
1843 | 2268.69 |
1844 |
1845 |
1846 | 7 |
1847 | 8 |
1848 | -1029 |
1849 | 1363 |
1850 | 173.70 |
1851 |
1852 |
1853 | 8 |
1854 | 9 |
1855 | -91 |
1856 | 2030 |
1857 | 1021.70 |
1858 |
1859 |
1860 | 9 |
1861 | 10 |
1862 | -5090 |
1863 | 556 |
1864 | -2229.83 |
1865 |
1866 |
1867 |
1868 |
1869 |
1870 |
1871 |
1872 |
1873 | ```python
1874 | conn.close()
1875 | ```
1876 |
--------------------------------------------------------------------------------
/CaseStudy#4 - Data Bank/README.md:
--------------------------------------------------------------------------------
1 | # Case Study #4: Data Bank 💱
2 | [](https://github.com/chanronnie/8WeekSQLChallenge)
3 | [](https://8weeksqlchallenge.com/case-study-4/)
4 | 
5 |
6 |
7 | The case study presented here is part of the **8 Week SQL Challenge**.\
8 | It is kindly brought to us by [**Data With Danny**](https://8weeksqlchallenge.com).
9 |
10 | This time, I am using `PostgreSQL queries` (instead of MySQL) in `Jupyter Notebook` to quickly view results, which provides me with an opportunity
11 | - to learn PostgreSQL
12 | - to utilize handy mathematical and string functions.
13 |
14 |
15 | ## Table of Contents
16 | * [Entity Relationship Diagram](#entity-relationship-diagram)
17 | * [Datasets](#datasets)
18 | * [Case Study Questions](#case-study-questions)
19 | * [Solutions](#solutions)
20 | * [PostgreSQL Topics Covered](#postgresql-topics-covered)
21 |
22 | ## Entity Relationship Diagram
23 | 
24 |
25 |
26 |
27 |
28 | ## Datasets
29 | The Case Study #4 contains 3 tables:
30 | - **regions**: This table maps the region_id to its respective region_name values.
31 | - **customer_nodes**: This table lists the regions and node reallocation informations of all Data Bank customers.
32 | - **customer_transactions**: This table lists all the transaction informations of all Data Bank customers.
33 |
34 | ## Case Study Questions
35 | Case Study #4 is categorized into 3 question groups\
36 | To view the specific section, please open the link in a *`new tab`* or *`window`*.\
37 | [A. Customer Nodes Exploration](CaseStudy4_solutions.md#A)\
38 | [B. Customer Transactions](CaseStudy4_solutions.md#B)\
39 | [C. Data Allocation Challenge](CaseStudy4_solutions.md#C)
40 |
41 | ## Solutions
42 | - View `data_bank` database: [**here**](CaseStudy4_schema.sql)
43 | - View Solution:
44 | - [**Markdown File**](CaseStudy4_solutions.md): offers a more fluid and responsive viewing experience
45 | - [**Jupyter Notebook**](CaseStudy4_solutions.ipynb): contains the original code
46 |
47 | ## PostgreSQL Topics Covered
48 | - Common Table Expressions (CTE)
49 | - Window Functions
50 | - Subqueries
51 | - JOIN, UNION ALL
52 | - MEDIAN, PERCENTILE
53 |
--------------------------------------------------------------------------------
/CaseStudy#5 - Data Mart/CaseStudy5_solutions.md:
--------------------------------------------------------------------------------
1 | # Case Study #5: Data Mart
2 | The case study questions presented here are created by [**Data With Danny**](https://linktr.ee/datawithdanny). They are part of the [**8 Week SQL Challenge**](https://8weeksqlchallenge.com/).
3 |
4 | My SQL queries are written in the `PostgreSQL 15` dialect, integrated into `Jupyter Notebook`, which allows us to instantly view the query results and document the queries.
5 |
6 | For more details about the **Case Study #5**, click [**here**](https://8weeksqlchallenge.com/case-study-5/).
7 |
8 | ## Table of Contents
9 |
10 | ### [1. Importing Libraries](#Import)
11 |
12 | ### [2. Tables of the Database](#Tables)
13 |
14 | ### [3. Case Study Questions](#CaseStudyQuestions)
15 |
16 | - [A. Data Cleansing](#A)
17 | - [B. Data Exploration](#B)
18 | - [C. Before & After Analysis](#C)
19 | - [D. Bonus Question](#D)
20 |
21 |
22 | ## 1. Importing Required Libraries
23 |
24 |
25 | ```python
26 | import psycopg2 as pg2
27 | import pandas as pd
28 | from datetime import datetime
29 | import seaborn as sns
30 | import matplotlib.pyplot as plt
31 | import matplotlib.ticker as ticker
32 | import os
33 | import warnings
34 |
35 | warnings.filterwarnings('ignore')
36 | ```
37 |
38 | ### Connecting PostgreSQL database from Jupyter Notebook
39 |
40 |
41 | ```python
42 | # Get PostgreSQL password
43 | mypassword = os.getenv('POSTGRESQL_PASSWORD')
44 |
45 | # Connecting to database
46 | conn = pg2.connect(user = 'postgres', password = mypassword, database = 'data_mart')
47 | cursor = conn.cursor()
48 | ```
49 |
50 |
51 |
52 | ## 2. Tables of the Database
53 |
54 |
55 | ```python
56 | cursor.execute("""
57 | SELECT table_schema, table_name
58 | FROM information_schema.tables
59 | WHERE table_schema = 'data_mart'
60 | """
61 | )
62 |
63 | print('--- Tables within "data_mart" database --- ')
64 | for table in cursor:
65 | print(table[1])
66 | ```
67 |
68 | --- Tables within "data_mart" database ---
69 | weekly_sales
70 |
71 |
72 | Here is the `weekly_sales` table, containing 17,117 data since the March 26, 2018
73 |
74 |
75 | ```python
76 | pd.read_sql("""
77 | SELECT *
78 | FROM data_mart.weekly_sales;
79 | """, conn)
80 | ```
81 |
82 |
83 |
84 |
85 | |
86 | week_date |
87 | region |
88 | platform |
89 | segment |
90 | customer_type |
91 | transactions |
92 | sales |
93 |
94 |
95 |
96 |
97 | 0 |
98 | 31/8/20 |
99 | ASIA |
100 | Retail |
101 | C3 |
102 | New |
103 | 120631 |
104 | 3656163 |
105 |
106 |
107 | 1 |
108 | 31/8/20 |
109 | ASIA |
110 | Retail |
111 | F1 |
112 | New |
113 | 31574 |
114 | 996575 |
115 |
116 |
117 | 2 |
118 | 31/8/20 |
119 | USA |
120 | Retail |
121 | null |
122 | Guest |
123 | 529151 |
124 | 16509610 |
125 |
126 |
127 | 3 |
128 | 31/8/20 |
129 | EUROPE |
130 | Retail |
131 | C1 |
132 | New |
133 | 4517 |
134 | 141942 |
135 |
136 |
137 | 4 |
138 | 31/8/20 |
139 | AFRICA |
140 | Retail |
141 | C2 |
142 | New |
143 | 58046 |
144 | 1758388 |
145 |
146 |
147 | ... |
148 | ... |
149 | ... |
150 | ... |
151 | ... |
152 | ... |
153 | ... |
154 | ... |
155 |
156 |
157 | 17112 |
158 | 26/3/18 |
159 | AFRICA |
160 | Retail |
161 | C3 |
162 | New |
163 | 98342 |
164 | 3706066 |
165 |
166 |
167 | 17113 |
168 | 26/3/18 |
169 | USA |
170 | Shopify |
171 | C4 |
172 | New |
173 | 16 |
174 | 2784 |
175 |
176 |
177 | 17114 |
178 | 26/3/18 |
179 | USA |
180 | Retail |
181 | F2 |
182 | New |
183 | 25665 |
184 | 1064172 |
185 |
186 |
187 | 17115 |
188 | 26/3/18 |
189 | EUROPE |
190 | Retail |
191 | C4 |
192 | New |
193 | 883 |
194 | 33523 |
195 |
196 |
197 | 17116 |
198 | 26/3/18 |
199 | AFRICA |
200 | Retail |
201 | C3 |
202 | Existing |
203 | 218516 |
204 | 12083475 |
205 |
206 |
207 |
208 | 17117 rows × 7 columns
209 |
210 |
211 |
212 |
213 |
214 |
215 | ## 3. Case Study Questions
216 |
217 |
218 | ## A. Data Cleansing
219 | In a single query, perform the following operations and generate a *new table* in the data_mart schema named `clean_weekly_sales`:
220 |
221 | - Convert the `week_date` to a DATE format
222 |
223 | - Add a `week_number` as the second column for each `week_date` value, for example any value from the 1st of January to 7th of January will be 1, 8th to 14th will be 2 etc
224 |
225 | - Add a `month_number` with the calendar month for each `week_date` value as the 3rd column
226 |
227 | - Add a `calendar_year` column as the 4th column containing either 2018, 2019 or 2020 values
228 |
229 | - Add a new column called `age_band` after the original `segment` column using the following mapping on the number inside the segment value
230 |
231 |
232 |
233 | segment | age_band
234 | --- | ---
235 | 1 | Young Adults
236 | 2 | Middle Aged
237 | 3 or 4 | Retirees
238 |
239 |
240 |
241 | - Add a new `demographic` column using the following mapping for the first letter in the segment values:
242 |
243 |
244 |
245 | segment | demographic
246 | --- | ---
247 | C | Couples
248 | F | Families
249 |
250 |
251 |
252 | - Ensure all null string values with an "unknown" string value in the original `segment` column as well as the new `age_band` and `demographic` columns
253 |
254 | - Generate a new `avg_transaction` column as the sales value divided by transactions rounded to 2 decimal places for each record
255 |
256 | ___
257 | Creating an empty table is the initial step required to store the processed data from the original dataset `weekly_sales` in the desired column order. By using a single query with the `INSERT INTO` statement, it is possible to execute various data cleaning steps within the `SELECT` block.
258 |
259 |
260 | ```python
261 | # Creating table
262 | cursor.execute("DROP TABLE IF EXISTS data_mart.clean_weekly_sales;")
263 | cursor.execute("""
264 | CREATE TABLE data_mart.clean_weekly_sales
265 | (
266 | "week_date" DATE,
267 | "week_number" INTEGER,
268 | "month_number" INTEGER,
269 | "calendar_year" INTEGER,
270 | "region" VARCHAR(13),
271 | "platform" VARCHAR(7),
272 | "segment" VARCHAR(10),
273 | "age_band" VARCHAR(50),
274 | "demographic" VARCHAR(10),
275 | "customer_type" VARCHAR(8),
276 | "transactions" INTEGER,
277 | "sales" INTEGER,
278 | "avg_transaction" DECIMAL
279 | );
280 | """)
281 |
282 |
283 | # Inserting required and processed data into the newly created table
284 | cursor.execute("""
285 | INSERT INTO data_mart.clean_weekly_sales
286 | SELECT
287 | TO_DATE(week_date, 'dd/mm/yy') AS week_date,
288 | DATE_PART('week', TO_DATE(week_date, 'dd/mm/yy'))::INTEGER as week_number,
289 | DATE_PART('month', TO_DATE(week_date, 'dd/mm/yy'))::INTEGER as month_number,
290 | DATE_PART('year', TO_DATE(week_date, 'dd/mm/yy'))::INTEGER as calendar_year,
291 | region,
292 | platform,
293 | CASE WHEN segment = 'null' THEN 'unknown' ELSE segment END as segment,
294 | CASE
295 | WHEN segment LIKE '%1' THEN 'Young Adults'
296 | WHEN segment LIKE '%2' THEN 'Middle Aged'
297 | WHEN REGEXP_LIKE(segment, '3|4') THEN 'Retirees'
298 | ELSE 'unknown'
299 | END AS age_band,
300 |
301 | CASE
302 | WHEN segment LIKE 'C%' THEN 'Couples'
303 | WHEN segment LIKE 'F%' THEN 'Families'
304 | ELSE 'unknown'
305 | END AS demographic,
306 | customer_type,
307 | transactions,
308 | sales,
309 | ROUND(sales/transactions::numeric, 2) AS avg_transaction
310 | FROM data_mart.weekly_sales;
311 | """)
312 |
313 | # Saving updates
314 | conn.commit()
315 | ```
316 |
317 | **Result**
318 |
319 |
320 | ```python
321 | pd.read_sql("""
322 | SELECT *
323 | FROM data_mart.clean_weekly_sales;
324 | """, conn)
325 | ```
326 |
327 |
328 |
329 |
330 | |
331 | week_date |
332 | week_number |
333 | month_number |
334 | calendar_year |
335 | region |
336 | platform |
337 | segment |
338 | age_band |
339 | demographic |
340 | customer_type |
341 | transactions |
342 | sales |
343 | avg_transaction |
344 |
345 |
346 |
347 |
348 | 0 |
349 | 2020-08-31 |
350 | 36 |
351 | 8 |
352 | 2020 |
353 | ASIA |
354 | Retail |
355 | C3 |
356 | Retirees |
357 | Couples |
358 | New |
359 | 120631 |
360 | 3656163 |
361 | 30.31 |
362 |
363 |
364 | 1 |
365 | 2020-08-31 |
366 | 36 |
367 | 8 |
368 | 2020 |
369 | ASIA |
370 | Retail |
371 | F1 |
372 | Young Adults |
373 | Families |
374 | New |
375 | 31574 |
376 | 996575 |
377 | 31.56 |
378 |
379 |
380 | 2 |
381 | 2020-08-31 |
382 | 36 |
383 | 8 |
384 | 2020 |
385 | USA |
386 | Retail |
387 | unknown |
388 | unknown |
389 | unknown |
390 | Guest |
391 | 529151 |
392 | 16509610 |
393 | 31.20 |
394 |
395 |
396 | 3 |
397 | 2020-08-31 |
398 | 36 |
399 | 8 |
400 | 2020 |
401 | EUROPE |
402 | Retail |
403 | C1 |
404 | Young Adults |
405 | Couples |
406 | New |
407 | 4517 |
408 | 141942 |
409 | 31.42 |
410 |
411 |
412 | 4 |
413 | 2020-08-31 |
414 | 36 |
415 | 8 |
416 | 2020 |
417 | AFRICA |
418 | Retail |
419 | C2 |
420 | Middle Aged |
421 | Couples |
422 | New |
423 | 58046 |
424 | 1758388 |
425 | 30.29 |
426 |
427 |
428 | ... |
429 | ... |
430 | ... |
431 | ... |
432 | ... |
433 | ... |
434 | ... |
435 | ... |
436 | ... |
437 | ... |
438 | ... |
439 | ... |
440 | ... |
441 | ... |
442 |
443 |
444 | 17112 |
445 | 2018-03-26 |
446 | 13 |
447 | 3 |
448 | 2018 |
449 | AFRICA |
450 | Retail |
451 | C3 |
452 | Retirees |
453 | Couples |
454 | New |
455 | 98342 |
456 | 3706066 |
457 | 37.69 |
458 |
459 |
460 | 17113 |
461 | 2018-03-26 |
462 | 13 |
463 | 3 |
464 | 2018 |
465 | USA |
466 | Shopify |
467 | C4 |
468 | Retirees |
469 | Couples |
470 | New |
471 | 16 |
472 | 2784 |
473 | 174.00 |
474 |
475 |
476 | 17114 |
477 | 2018-03-26 |
478 | 13 |
479 | 3 |
480 | 2018 |
481 | USA |
482 | Retail |
483 | F2 |
484 | Middle Aged |
485 | Families |
486 | New |
487 | 25665 |
488 | 1064172 |
489 | 41.46 |
490 |
491 |
492 | 17115 |
493 | 2018-03-26 |
494 | 13 |
495 | 3 |
496 | 2018 |
497 | EUROPE |
498 | Retail |
499 | C4 |
500 | Retirees |
501 | Couples |
502 | New |
503 | 883 |
504 | 33523 |
505 | 37.96 |
506 |
507 |
508 | 17116 |
509 | 2018-03-26 |
510 | 13 |
511 | 3 |
512 | 2018 |
513 | AFRICA |
514 | Retail |
515 | C3 |
516 | Retirees |
517 | Couples |
518 | Existing |
519 | 218516 |
520 | 12083475 |
521 | 55.30 |
522 |
523 |
524 |
525 | 17117 rows × 13 columns
526 |
527 |
528 |
529 |
530 |
531 |
532 | ## B. Data Exploration
533 |
534 | #### 1. What day of the week is used for each `week_date` value?
535 |
536 |
537 | ```python
538 | pd.read_sql("""
539 | SELECT DISTINCT TO_CHAR(week_date, 'Day') AS day_of_week
540 | FROM data_mart.clean_weekly_sales;
541 | """, conn)
542 | ```
543 |
544 |
545 |
546 |
547 | |
548 | day_of_week |
549 |
550 |
551 |
552 |
553 | 0 |
554 | Monday |
555 |
556 |
557 |
558 |
559 |
560 |
561 |
562 | **Result**\
563 | Monday is the designated day of the week for each `week_date` value.
564 |
565 | ___
566 | #### 2. What range of week numbers are missing from the dataset?
567 |
568 | Taking into consideration that there are 52 weeks in a year, we can utilize the `generate_series` function to generate a series of numbers from 1 to 52. By performing a **LEFT JOIN** between this number series and the `clean_week_sales` table, we can identify all the week numbers that are not present in the dataset.
569 |
570 |
571 | ```python
572 | pd.read_sql("""
573 | SELECT gs.week_number AS missing_week_nb
574 | FROM generate_series(1,52,1) gs(week_number)
575 | LEFT JOIN data_mart.clean_weekly_sales d ON gs.week_number = d.week_number
576 | WHERE d.week_number IS NULL
577 | """, conn)
578 | ```
579 |
580 |
581 |
582 |
583 | |
584 | missing_week_nb |
585 |
586 |
587 |
588 |
589 | 0 |
590 | 1 |
591 |
592 |
593 | 1 |
594 | 2 |
595 |
596 |
597 | 2 |
598 | 3 |
599 |
600 |
601 | 3 |
602 | 4 |
603 |
604 |
605 | 4 |
606 | 5 |
607 |
608 |
609 | 5 |
610 | 6 |
611 |
612 |
613 | 6 |
614 | 7 |
615 |
616 |
617 | 7 |
618 | 8 |
619 |
620 |
621 | 8 |
622 | 9 |
623 |
624 |
625 | 9 |
626 | 10 |
627 |
628 |
629 | 10 |
630 | 11 |
631 |
632 |
633 | 11 |
634 | 12 |
635 |
636 |
637 | 12 |
638 | 37 |
639 |
640 |
641 | 13 |
642 | 38 |
643 |
644 |
645 | 14 |
646 | 39 |
647 |
648 |
649 | 15 |
650 | 40 |
651 |
652 |
653 | 16 |
654 | 41 |
655 |
656 |
657 | 17 |
658 | 42 |
659 |
660 |
661 | 18 |
662 | 43 |
663 |
664 |
665 | 19 |
666 | 44 |
667 |
668 |
669 | 20 |
670 | 45 |
671 |
672 |
673 | 21 |
674 | 46 |
675 |
676 |
677 | 22 |
678 | 47 |
679 |
680 |
681 | 23 |
682 | 48 |
683 |
684 |
685 | 24 |
686 | 49 |
687 |
688 |
689 | 25 |
690 | 50 |
691 |
692 |
693 | 26 |
694 | 51 |
695 |
696 |
697 | 27 |
698 | 52 |
699 |
700 |
701 |
702 |
703 |
704 |
705 |
706 | **Result**
707 | - The first range of missing week numbers in the dataset is from week 1 to week 12.
708 | - The second missing range of week numbers is from week 37 to week 52.
709 |
710 | ___
711 | #### 3. How many total transactions were there for each year in the dataset?
712 |
713 |
714 | ```python
715 | pd.read_sql("""
716 | SELECT
717 | calendar_year AS year,
718 | COUNT(transactions) AS nb_transactions
719 | FROM data_mart.clean_weekly_sales
720 | GROUP BY year;
721 | """, conn)
722 | ```
723 |
724 |
725 |
726 |
727 | |
728 | year |
729 | nb_transactions |
730 |
731 |
732 |
733 |
734 | 0 |
735 | 2018 |
736 | 5698 |
737 |
738 |
739 | 1 |
740 | 2020 |
741 | 5711 |
742 |
743 |
744 | 2 |
745 | 2019 |
746 | 5708 |
747 |
748 |
749 |
750 |
751 |
752 |
753 |
754 | ___
755 | #### 4. What is the total sales for each region for each month?
756 |
757 |
758 | ```python
759 | df4 = pd.read_sql("""
760 | SELECT
761 | region,
762 | calendar_year AS year,
763 | TO_CHAR(week_date, 'Month') AS month_name,
764 | SUM(sales) AS total_sales
765 | FROM data_mart.clean_weekly_sales
766 | GROUP BY region, year, month_number, month_name
767 | ORDER BY region, year, month_number
768 | """, conn)
769 |
770 | df4
771 | ```
772 |
773 |
774 |
775 |
776 |
777 | |
778 | region |
779 | year |
780 | month_name |
781 | total_sales |
782 |
783 |
784 |
785 |
786 | 0 |
787 | AFRICA |
788 | 2018 |
789 | March |
790 | 130542213 |
791 |
792 |
793 | 1 |
794 | AFRICA |
795 | 2018 |
796 | April |
797 | 650194751 |
798 |
799 |
800 | 2 |
801 | AFRICA |
802 | 2018 |
803 | May |
804 | 522814997 |
805 |
806 |
807 | 3 |
808 | AFRICA |
809 | 2018 |
810 | June |
811 | 519127094 |
812 |
813 |
814 | 4 |
815 | AFRICA |
816 | 2018 |
817 | July |
818 | 674135866 |
819 |
820 |
821 | ... |
822 | ... |
823 | ... |
824 | ... |
825 | ... |
826 |
827 |
828 | 135 |
829 | USA |
830 | 2020 |
831 | April |
832 | 221952003 |
833 |
834 |
835 | 136 |
836 | USA |
837 | 2020 |
838 | May |
839 | 225545881 |
840 |
841 |
842 | 137 |
843 | USA |
844 | 2020 |
845 | June |
846 | 277763625 |
847 |
848 |
849 | 138 |
850 | USA |
851 | 2020 |
852 | July |
853 | 223735311 |
854 |
855 |
856 | 139 |
857 | USA |
858 | 2020 |
859 | August |
860 | 277361606 |
861 |
862 |
863 |
864 | 140 rows × 4 columns
865 |
866 |
867 |
868 |
869 | **Result**\
870 | The output table above, containing 140 rows, makes it challenging to gain insights into the sales performance at Data Mart. Therefore, let's examine a time-series visualization that represents the total sales made by Data Mart by region, using the `hue` parameter.
871 |
872 |
873 | ```python
874 | # Insert a new column for datetime using calendar_year and month_name values
875 | df4['date'] = df4.apply(lambda row: datetime(row['year'], datetime.strptime(row['month_name'].strip(), '%B').month, 1), axis=1)
876 |
877 |
878 | # plot the total_sales by region
879 | sns.set_style("darkgrid")
880 | sns.lineplot(df4['date'], df4['total_sales'], hue = df4['region'])
881 | plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
882 |
883 | # customize the format, labels and title
884 | plt.xticks(rotation=45, ha='right')
885 | plt.ylabel("total sales")
886 | formatter = ticker.StrMethodFormatter('${x:,.0f}')
887 | plt.gca().yaxis.set_major_formatter(formatter)
888 | plt.title('Total Sales at Data Mart between 2018 and 2020');
889 | ```
890 |
891 |
892 |
893 | 
894 |
895 |
896 |
897 | **Insights**
898 | - Data Mart has achieved the highest sales in Oceania since 2018.
899 | - Following Oceania, Africa and Asia are the second and third most lucrative regions for Data Mart sales.
900 | - South America and Europe recorded the lowest sales figures from 2018 to 2020.
901 |
902 | ___
903 | #### 5. What is the total count of transactions for each platform
904 |
905 |
906 | ```python
907 | pd.read_sql("""
908 | SELECT
909 | platform,
910 | COUNT(transactions) AS nb_transactions
911 | FROM data_mart.clean_weekly_sales
912 | GROUP BY platform;
913 | """, conn)
914 | ```
915 |
916 |
917 |
918 |
919 | |
920 | platform |
921 | nb_transactions |
922 |
923 |
924 |
925 |
926 | 0 |
927 | Shopify |
928 | 8549 |
929 |
930 |
931 | 1 |
932 | Retail |
933 | 8568 |
934 |
935 |
936 |
937 |
938 |
939 |
940 |
941 | ___
942 | #### 6. What is the percentage of sales for Retail vs Shopify for each month?
943 |
944 |
945 | ```python
946 | pd.read_sql("""
947 | WITH platform_sales_cte AS
948 | (
949 | SELECT
950 | calendar_year,
951 | month_number,
952 | TO_CHAR(week_date, 'Month') AS month,
953 | SUM(sales) AS monthly_sales,
954 | SUM(CASE WHEN platform = 'Retail' THEN sales ELSE 0 END) AS retail_sales,
955 | SUM(CASE WHEN platform = 'Shopify' THEN sales ELSE 0 END) AS shopify_sales
956 | FROM data_mart.clean_weekly_sales
957 | GROUP BY calendar_year, month_number, month
958 | ORDER BY calendar_year, month
959 | )
960 | SELECT
961 | calendar_year AS year,
962 | month,
963 | ROUND(retail_sales/monthly_sales::numeric*100,1) AS retail_percent_sales,
964 | ROUND(shopify_sales/monthly_sales::numeric*100,1) AS shopify_percent_sales
965 | FROM platform_sales_cte;
966 | """, conn)
967 | ```
968 |
969 |
970 |
971 |
972 | |
973 | year |
974 | month |
975 | retail_percent_sales |
976 | shopify_percent_sales |
977 |
978 |
979 |
980 |
981 | 0 |
982 | 2018 |
983 | April |
984 | 97.9 |
985 | 2.1 |
986 |
987 |
988 | 1 |
989 | 2018 |
990 | August |
991 | 97.7 |
992 | 2.3 |
993 |
994 |
995 | 2 |
996 | 2018 |
997 | July |
998 | 97.8 |
999 | 2.2 |
1000 |
1001 |
1002 | 3 |
1003 | 2018 |
1004 | June |
1005 | 97.8 |
1006 | 2.2 |
1007 |
1008 |
1009 | 4 |
1010 | 2018 |
1011 | March |
1012 | 97.9 |
1013 | 2.1 |
1014 |
1015 |
1016 | 5 |
1017 | 2018 |
1018 | May |
1019 | 97.7 |
1020 | 2.3 |
1021 |
1022 |
1023 | 6 |
1024 | 2018 |
1025 | September |
1026 | 97.7 |
1027 | 2.3 |
1028 |
1029 |
1030 | 7 |
1031 | 2019 |
1032 | April |
1033 | 97.8 |
1034 | 2.2 |
1035 |
1036 |
1037 | 8 |
1038 | 2019 |
1039 | August |
1040 | 97.2 |
1041 | 2.8 |
1042 |
1043 |
1044 | 9 |
1045 | 2019 |
1046 | July |
1047 | 97.4 |
1048 | 2.6 |
1049 |
1050 |
1051 | 10 |
1052 | 2019 |
1053 | June |
1054 | 97.4 |
1055 | 2.6 |
1056 |
1057 |
1058 | 11 |
1059 | 2019 |
1060 | March |
1061 | 97.7 |
1062 | 2.3 |
1063 |
1064 |
1065 | 12 |
1066 | 2019 |
1067 | May |
1068 | 97.5 |
1069 | 2.5 |
1070 |
1071 |
1072 | 13 |
1073 | 2019 |
1074 | September |
1075 | 97.1 |
1076 | 2.9 |
1077 |
1078 |
1079 | 14 |
1080 | 2020 |
1081 | April |
1082 | 97.0 |
1083 | 3.0 |
1084 |
1085 |
1086 | 15 |
1087 | 2020 |
1088 | August |
1089 | 96.5 |
1090 | 3.5 |
1091 |
1092 |
1093 | 16 |
1094 | 2020 |
1095 | July |
1096 | 96.7 |
1097 | 3.3 |
1098 |
1099 |
1100 | 17 |
1101 | 2020 |
1102 | June |
1103 | 96.8 |
1104 | 3.2 |
1105 |
1106 |
1107 | 18 |
1108 | 2020 |
1109 | March |
1110 | 97.3 |
1111 | 2.7 |
1112 |
1113 |
1114 | 19 |
1115 | 2020 |
1116 | May |
1117 | 96.7 |
1118 | 3.3 |
1119 |
1120 |
1121 |
1122 |
1123 |
1124 |
1125 |
1126 | **Result**\
1127 | Retail has achieved significantly higher sales, accounting for approximately 97% of the total sales, while Shopify only accounted for 2.5% of the sales.
1128 |
1129 | ___
1130 | #### 7. What is the percentage of sales by demographic for each year in the dataset?
1131 |
1132 |
1133 | ```python
1134 | pd.read_sql("""
1135 | WITH demographic_sales_cte AS
1136 | (
1137 | SELECT
1138 | calendar_year AS year,
1139 | SUM(sales) AS total_sales,
1140 | SUM(CASE WHEN demographic = 'Families' THEN sales END) AS families_sales,
1141 | SUM(CASE WHEN demographic = 'Couples' THEN sales END) AS couples_sales,
1142 | SUM(CASE WHEN demographic = 'unknown' THEN sales END) AS unknown_sales
1143 | FROM data_mart.clean_weekly_sales
1144 | GROUP BY year
1145 | )
1146 | SELECT
1147 | year,
1148 | ROUND(families_sales/total_sales::numeric*100,1) AS families_sales_percent,
1149 | ROUND(couples_sales/total_sales::numeric*100,1) AS couples_sales_percent,
1150 | ROUND(unknown_sales/total_sales::numeric*100,1) AS unknown_sales_percent
1151 | FROM demographic_sales_cte;
1152 | """, conn)
1153 | ```
1154 |
1155 |
1156 |
1157 |
1158 | |
1159 | year |
1160 | families_sales_percent |
1161 | couples_sales_percent |
1162 | unknown_sales_percent |
1163 |
1164 |
1165 |
1166 |
1167 | 0 |
1168 | 2018 |
1169 | 32.0 |
1170 | 26.4 |
1171 | 41.6 |
1172 |
1173 |
1174 | 1 |
1175 | 2020 |
1176 | 32.7 |
1177 | 28.7 |
1178 | 38.6 |
1179 |
1180 |
1181 | 2 |
1182 | 2019 |
1183 | 32.5 |
1184 | 27.3 |
1185 | 40.3 |
1186 |
1187 |
1188 |
1189 |
1190 |
1191 |
1192 |
1193 | **Result**\
1194 | Families contributed slightly more to the sales of Data Mart, accounting for approximately 5% more sales compared to Couples.
1195 |
1196 | ___
1197 | #### 8. Which `age_band` and `demographic` values contribute the most to Retail sales?
1198 |
1199 |
1200 | ```python
1201 | query8 = """
1202 | SELECT
1203 | age_band,
1204 | demographic,
1205 | ROUND(SUM(sales)/(SELECT SUM(sales) FROM data_mart.clean_weekly_sales WHERE platform = 'Retail')::NUMERIC * 100,1) AS contribution_percent
1206 | FROM data_mart.clean_weekly_sales
1207 | WHERE platform = 'Retail'
1208 | GROUP BY age_band, demographic
1209 | ORDER BY contribution_percent DESC
1210 | """
1211 |
1212 | pd.read_sql(query8, conn)
1213 | ```
1214 |
1215 |
1216 |
1217 |
1218 | |
1219 | age_band |
1220 | demographic |
1221 | contribution_percent |
1222 |
1223 |
1224 |
1225 |
1226 | 0 |
1227 | unknown |
1228 | unknown |
1229 | 40.5 |
1230 |
1231 |
1232 | 1 |
1233 | Retirees |
1234 | Families |
1235 | 16.7 |
1236 |
1237 |
1238 | 2 |
1239 | Retirees |
1240 | Couples |
1241 | 16.1 |
1242 |
1243 |
1244 | 3 |
1245 | Middle Aged |
1246 | Families |
1247 | 11.0 |
1248 |
1249 |
1250 | 4 |
1251 | Young Adults |
1252 | Couples |
1253 | 6.6 |
1254 |
1255 |
1256 | 5 |
1257 | Middle Aged |
1258 | Couples |
1259 | 4.7 |
1260 |
1261 |
1262 | 6 |
1263 | Young Adults |
1264 | Families |
1265 | 4.5 |
1266 |
1267 |
1268 |
1269 |
1270 |
1271 |
1272 |
1273 | Alternatively, let's extract the values with the highest contribution percent.
1274 |
1275 |
1276 | ```python
1277 | pd.read_sql(query8 + "LIMIT 2", conn)
1278 | ```
1279 |
1280 |
1281 |
1282 |
1283 | |
1284 | age_band |
1285 | demographic |
1286 | contribution_percent |
1287 |
1288 |
1289 |
1290 |
1291 | 0 |
1292 | unknown |
1293 | unknown |
1294 | 40.5 |
1295 |
1296 |
1297 | 1 |
1298 | Retirees |
1299 | Families |
1300 | 16.7 |
1301 |
1302 |
1303 |
1304 |
1305 |
1306 |
1307 |
1308 | **Result**\
1309 | Retired Families are the age_band and demographic values that contributed the most to the Retail sales.
1310 |
1311 | ___
1312 | #### 9. Can we use the `avg_transaction` column to find the average transaction size for each year for Retail vs Shopify? If not - how would you calculate it instead?
1313 |
1314 | **Method 1:** When using `GROUP BY` to calculate the average transaction size for each year, the obtained results differ slightly from using the `avg_transaction` column directly.
1315 |
1316 |
1317 | ```python
1318 | pd.read_sql("""
1319 | SELECT
1320 | calendar_year AS year,
1321 | platform,
1322 | ROUND(AVG(avg_transaction),2) AS avg_transaction1,
1323 | ROUND(SUM(sales)::NUMERIC/SUM(transactions)::NUMERIC,2) AS avg_transaction2
1324 | FROM data_mart.clean_weekly_sales
1325 | GROUP BY year, platform
1326 | ORDER BY platform, year
1327 | """, conn)
1328 | ```
1329 |
1330 |
1331 |
1332 |
1333 | |
1334 | year |
1335 | platform |
1336 | avg_transaction1 |
1337 | avg_transaction2 |
1338 |
1339 |
1340 |
1341 |
1342 | 0 |
1343 | 2018 |
1344 | Retail |
1345 | 42.91 |
1346 | 36.56 |
1347 |
1348 |
1349 | 1 |
1350 | 2019 |
1351 | Retail |
1352 | 41.97 |
1353 | 36.83 |
1354 |
1355 |
1356 | 2 |
1357 | 2020 |
1358 | Retail |
1359 | 40.64 |
1360 | 36.56 |
1361 |
1362 |
1363 | 3 |
1364 | 2018 |
1365 | Shopify |
1366 | 188.28 |
1367 | 192.48 |
1368 |
1369 |
1370 | 4 |
1371 | 2019 |
1372 | Shopify |
1373 | 177.56 |
1374 | 183.36 |
1375 |
1376 |
1377 | 5 |
1378 | 2020 |
1379 | Shopify |
1380 | 174.87 |
1381 | 179.03 |
1382 |
1383 |
1384 |
1385 |
1386 |
1387 |
1388 |
1389 | **Method #2:** With this following method, we obtain the same results as using the `avg_transactions` values.
1390 |
1391 |
1392 | ```python
1393 | pd.read_sql("""
1394 | SELECT
1395 | calendar_year AS year,
1396 | ROUND(AVG(CASE WHEN platform = 'Retail' THEN avg_transaction END),2) AS retail_avg_txn1,
1397 | ROUND(AVG(CASE WHEN platform = 'Retail' THEN sales::numeric/transactions::numeric END),2) AS retail_avg_txn2,
1398 | ROUND(AVG(CASE WHEN platform = 'Shopify' THEN avg_transaction END),2) AS shopify_avg_txn1,
1399 | ROUND(AVG(CASE WHEN platform = 'Shopify' THEN sales::numeric/transactions::numeric END),2) AS shopify_avg_txn2
1400 | FROM data_mart.clean_weekly_sales
1401 | GROUP BY year
1402 | """, conn)
1403 | ```
1404 |
1405 |
1406 |
1407 |
1408 | |
1409 | year |
1410 | retail_avg_txn1 |
1411 | retail_avg_txn2 |
1412 | shopify_avg_txn1 |
1413 | shopify_avg_txn2 |
1414 |
1415 |
1416 |
1417 |
1418 | 0 |
1419 | 2018 |
1420 | 42.91 |
1421 | 42.91 |
1422 | 188.28 |
1423 | 188.28 |
1424 |
1425 |
1426 | 1 |
1427 | 2020 |
1428 | 40.64 |
1429 | 40.64 |
1430 | 174.87 |
1431 | 174.87 |
1432 |
1433 |
1434 | 2 |
1435 | 2019 |
1436 | 41.97 |
1437 | 41.97 |
1438 | 177.56 |
1439 | 177.56 |
1440 |
1441 |
1442 |
1443 |
1444 |
1445 |
1446 |
1447 |
1448 |
1449 | ## C. Before & After Analysis
1450 | This technique is usually used when we inspect an important event and want to inspect the impact before and after a certain point in time.
1451 |
1452 | Taking the `week_date` value of **2020-06-15** as the baseline week where the Data Mart sustainable packaging changes came into effect.
1453 |
1454 | We would include all `week_date` values for **2020-06-15** as the start of the period after the change and the previous `week_date` values would be before
1455 |
1456 | Using this analysis approach - answer the following questions:
1457 |
1458 | > 1. What is the total sales for the 4 weeks before and after 2020-06-15? What is the growth or reduction rate in actual values and percentage of sales?
1459 | > 2. What about the entire 12 weeks before and after?
1460 | > 3. How do the sale metrics for these 2 periods before and after compare with the previous years in 2018 and 2019?
1461 |
1462 | #### 1. What is the total sales for the 4 weeks before and after 2020-06-15? What is the growth or reduction rate in actual values and percentage of sales?
1463 |
1464 | The week number of the date **2020-06-15** is 25. Therefore,
1465 | - the four weeks before this date are `week 21, 22, 23, and 24`, and
1466 | - the four weeks after are `week 25, 26, 27, and 28`.
1467 |
1468 |
1469 | ```python
1470 | pd.read_sql("""
1471 | SELECT DISTINCT week_number
1472 | FROM data_mart.clean_weekly_sales
1473 | WHERE week_date = '2020-06-15'
1474 | """, conn)
1475 | ```
1476 |
1477 |
1478 |
1479 |
1480 | |
1481 | week_number |
1482 |
1483 |
1484 |
1485 |
1486 | 0 |
1487 | 25 |
1488 |
1489 |
1490 |
1491 |
1492 |
1493 |
1494 |
1495 | **Result**\
1496 | The implementation of sustainable packaging changes appears to have a **negative impact** on sales at Data Mart, resulting in a reduction rate of 1.15% in sales.
1497 |
1498 |
1499 | ```python
1500 | pd.read_sql("""
1501 | SELECT
1502 | TO_CHAR(total_sales_before, 'FM$ 999,999,999,999') AS total_sales_before,
1503 | TO_CHAR(total_sales_after, 'FM$ 999,999,999,999') AS total_sales_after,
1504 | TO_CHAR(total_sales_after - total_sales_before, 'FM$ 999,999,999,999') AS difference,
1505 | ROUND((total_sales_after - total_sales_before)/total_sales_before::NUMERIC * 100,2) AS sales_change_percent
1506 | FROM
1507 | (
1508 | SELECT
1509 | SUM(CASE WHEN week_number BETWEEN 21 AND 24 THEN sales ELSE 0 END) AS total_sales_before,
1510 | SUM(CASE WHEN week_number BETWEEN 25 AND 28 THEN sales ELSE 0 END) AS total_sales_after
1511 | FROM data_mart.clean_weekly_sales
1512 | WHERE calendar_year = 2020
1513 | ) sales;
1514 | """, conn)
1515 | ```
1516 |
1517 |
1518 |
1519 |
1520 | |
1521 | total_sales_before |
1522 | total_sales_after |
1523 | difference |
1524 | sales_change_percent |
1525 |
1526 |
1527 |
1528 |
1529 | 0 |
1530 | $ 2,345,878,357 |
1531 | $ 2,318,994,169 |
1532 | $ -26,884,188 |
1533 | -1.15 |
1534 |
1535 |
1536 |
1537 |
1538 |
1539 |
1540 |
1541 | ___
1542 | #### 2. What about the entire 12 weeks before and after?
1543 |
1544 |
1545 | ```python
1546 | pd.read_sql("""
1547 | SELECT
1548 | TO_CHAR(total_sales_before, 'FM$ 999,999,999,999') AS total_sales_before,
1549 | TO_CHAR(total_sales_after, 'FM$ 999,999,999,999') AS total_sales_after,
1550 | TO_CHAR(total_sales_after - total_sales_before, 'FM$ 999,999,999,999') AS difference,
1551 | ROUND((total_sales_after - total_sales_before)/total_sales_before::NUMERIC * 100, 2) AS sales_change_percent
1552 | FROM
1553 | (
1554 | SELECT
1555 | SUM(CASE WHEN week_number BETWEEN 13 AND 24 THEN sales ELSE 0 END) AS total_sales_before,
1556 | SUM(CASE WHEN week_number BETWEEN 25 AND 37 THEN sales ELSE 0 END) AS total_sales_after
1557 | FROM data_mart.clean_weekly_sales
1558 | WHERE calendar_year = 2020
1559 | ) sales
1560 | """, conn)
1561 | ```
1562 |
1563 |
1564 |
1565 |
1566 | |
1567 | total_sales_before |
1568 | total_sales_after |
1569 | difference |
1570 | sales_change_percent |
1571 |
1572 |
1573 |
1574 |
1575 | 0 |
1576 | $ 7,126,273,147 |
1577 | $ 6,973,947,753 |
1578 | $ -152,325,394 |
1579 | -2.14 |
1580 |
1581 |
1582 |
1583 |
1584 |
1585 |
1586 |
1587 | **Result**\
1588 | The implementation of sustainable packaging changes appears to have a **negative impact** on sales at Data Mart, resulting in a reduction rate of 2.14% in sales.
1589 |
1590 | ___
1591 | #### 3. How do the sale metrics for these 2 periods before and after compare with the previous years in 2018 and 2019?
1592 |
1593 |
1594 | ```python
1595 | pd.read_sql("""
1596 | SELECT
1597 | year,
1598 | TO_CHAR(total_sales_before, 'FM$ 999,999,999,999') AS total_sales_before,
1599 | TO_CHAR(total_sales_after, 'FM$ 999,999,999,999') AS total_sales_after,
1600 | TO_CHAR(total_sales_after - total_sales_before, 'FM$ 999,999,999,999') AS difference,
1601 | ROUND((total_sales_after - total_sales_before)/total_sales_before::NUMERIC * 100, 2) AS sales_change_percent
1602 | FROM
1603 | (
1604 | SELECT
1605 | calendar_year AS year,
1606 | SUM(CASE WHEN week_number BETWEEN 13 AND 24 THEN sales ELSE 0 END) AS total_sales_before,
1607 | SUM(CASE WHEN week_number BETWEEN 25 AND 37 THEN sales ELSE 0 END) AS total_sales_after
1608 | FROM data_mart.clean_weekly_sales
1609 | GROUP BY year
1610 | ) sales
1611 | ORDER BY year
1612 | """, conn)
1613 | ```
1614 |
1615 |
1616 |
1617 |
1618 | |
1619 | year |
1620 | total_sales_before |
1621 | total_sales_after |
1622 | difference |
1623 | sales_change_percent |
1624 |
1625 |
1626 |
1627 |
1628 | 0 |
1629 | 2018 |
1630 | $ 6,396,562,317 |
1631 | $ 6,500,818,510 |
1632 | $ 104,256,193 |
1633 | 1.63 |
1634 |
1635 |
1636 | 1 |
1637 | 2019 |
1638 | $ 6,883,386,397 |
1639 | $ 6,862,646,103 |
1640 | $ -20,740,294 |
1641 | -0.30 |
1642 |
1643 |
1644 | 2 |
1645 | 2020 |
1646 | $ 7,126,273,147 |
1647 | $ 6,973,947,753 |
1648 | $ -152,325,394 |
1649 | -2.14 |
1650 |
1651 |
1652 |
1653 |
1654 |
1655 |
1656 |
1657 |
1658 |
1659 | ## D. Bonus Question
1660 | Which areas of the business have the highest negative impact in sales metrics performance in 2020 for the 12 week before and after period?
1661 |
1662 | - region
1663 | - platform
1664 | - age_band
1665 | - demographic
1666 | - customer_type
1667 |
1668 | Do you have any further recommendations for Danny’s team at Data Mart or any interesting insights based off this analysis?
1669 |
1670 | ### Overview
1671 | The following query output displays the various areas of the business that experienced the highest negative impact on sales metrics performance in 2020. By setting `WHERE sales_before > sales_after`, only the negative percent changes record will be filtered and displayed (the top 10 records).
1672 |
1673 |
1674 | ```python
1675 | pd.read_sql("""
1676 | WITH overview_sales_cte AS
1677 | (
1678 | SELECT
1679 | region,
1680 | platform,
1681 | age_band,
1682 | demographic,
1683 | customer_type,
1684 | SUM(CASE WHEN week_number BETWEEN 13 AND 24 THEN sales ELSE 0 END) AS sales_before,
1685 | SUM(CASE WHEN week_number BETWEEN 25 AND 37 THEN sales ELSE 0 END) AS sales_after
1686 | FROM data_mart.clean_weekly_sales
1687 | WHERE calendar_year = 2020
1688 | GROUP BY region, platform, age_band, demographic, customer_type
1689 | )
1690 | SELECT
1691 | region,
1692 | platform,
1693 | age_band,
1694 | demographic,
1695 | customer_type,
1696 | ROUND((sales_after - sales_before)::NUMERIC/sales_before::NUMERIC * 100, 2) AS sales_change_percent
1697 | FROM overview_sales_cte
1698 | WHERE sales_before > sales_after
1699 | ORDER BY sales_change_percent ASC
1700 | LIMIT 10;
1701 | """, conn)
1702 | ```
1703 |
1704 |
1705 |
1706 |
1707 | |
1708 | region |
1709 | platform |
1710 | age_band |
1711 | demographic |
1712 | customer_type |
1713 | sales_change_percent |
1714 |
1715 |
1716 |
1717 |
1718 | 0 |
1719 | SOUTH AMERICA |
1720 | Shopify |
1721 | unknown |
1722 | unknown |
1723 | Existing |
1724 | -42.23 |
1725 |
1726 |
1727 | 1 |
1728 | EUROPE |
1729 | Shopify |
1730 | Retirees |
1731 | Families |
1732 | New |
1733 | -33.71 |
1734 |
1735 |
1736 | 2 |
1737 | EUROPE |
1738 | Shopify |
1739 | Young Adults |
1740 | Families |
1741 | New |
1742 | -27.97 |
1743 |
1744 |
1745 | 3 |
1746 | SOUTH AMERICA |
1747 | Retail |
1748 | unknown |
1749 | unknown |
1750 | Existing |
1751 | -23.20 |
1752 |
1753 |
1754 | 4 |
1755 | SOUTH AMERICA |
1756 | Retail |
1757 | Retirees |
1758 | Families |
1759 | New |
1760 | -21.28 |
1761 |
1762 |
1763 | 5 |
1764 | SOUTH AMERICA |
1765 | Shopify |
1766 | Middle Aged |
1767 | Families |
1768 | New |
1769 | -19.73 |
1770 |
1771 |
1772 | 6 |
1773 | SOUTH AMERICA |
1774 | Shopify |
1775 | Retirees |
1776 | Families |
1777 | New |
1778 | -19.04 |
1779 |
1780 |
1781 | 7 |
1782 | SOUTH AMERICA |
1783 | Shopify |
1784 | Retirees |
1785 | Couples |
1786 | New |
1787 | -18.39 |
1788 |
1789 |
1790 | 8 |
1791 | SOUTH AMERICA |
1792 | Retail |
1793 | Retirees |
1794 | Couples |
1795 | Existing |
1796 | -16.80 |
1797 |
1798 |
1799 | 9 |
1800 | SOUTH AMERICA |
1801 | Retail |
1802 | Retirees |
1803 | Couples |
1804 | New |
1805 | -15.86 |
1806 |
1807 |
1808 |
1809 |
1810 |
1811 |
1812 |
1813 | ___
1814 | In the followings, I will analyze each area individually to determine which category has been the most impacted.
1815 |
1816 | ### Sales By Region
1817 | The **Asia** region has been the most negatively impacted by the implementation of the new system at Data Mart.
1818 |
1819 |
1820 | ```python
1821 | pd.read_sql("""
1822 | WITH region_sales_cte AS
1823 | (
1824 | SELECT
1825 | region,
1826 | SUM(CASE WHEN week_number BETWEEN 13 AND 24 THEN sales END) AS sales_before,
1827 | SUM(CASE WHEN week_number BETWEEN 25 AND 37 THEN sales END) AS sales_after
1828 | FROM data_mart.clean_weekly_sales
1829 | WHERE calendar_year = 2020
1830 | GROUP BY region
1831 | )
1832 | SELECT
1833 | region,
1834 | ROUND((sales_after - sales_before)/sales_before::NUMERIC * 100,2) AS sales_change_percent
1835 | FROM region_sales_cte
1836 | ORDER BY sales_change_percent ASC;
1837 | """, conn)
1838 | ```
1839 |
1840 |
1841 |
1842 |
1843 | |
1844 | region |
1845 | sales_change_percent |
1846 |
1847 |
1848 |
1849 |
1850 | 0 |
1851 | ASIA |
1852 | -3.26 |
1853 |
1854 |
1855 | 1 |
1856 | OCEANIA |
1857 | -3.03 |
1858 |
1859 |
1860 | 2 |
1861 | SOUTH AMERICA |
1862 | -2.15 |
1863 |
1864 |
1865 | 3 |
1866 | CANADA |
1867 | -1.92 |
1868 |
1869 |
1870 | 4 |
1871 | USA |
1872 | -1.60 |
1873 |
1874 |
1875 | 5 |
1876 | AFRICA |
1877 | -0.54 |
1878 |
1879 |
1880 | 6 |
1881 | EUROPE |
1882 | 4.73 |
1883 |
1884 |
1885 |
1886 |
1887 |
1888 |
1889 |
1890 | ### Sales By Platform
1891 |
1892 | The **Retail** platform has been the most negatively impacted by the implementation of the new system at Data Mart.
1893 |
1894 |
1895 | ```python
1896 | pd.read_sql("""
1897 | WITH platform_sales_cte AS
1898 | (
1899 | SELECT
1900 | platform,
1901 | SUM(CASE WHEN week_number BETWEEN 13 AND 24 THEN sales END) AS sales_before,
1902 | SUM(CASE WHEN week_number BETWEEN 25 AND 37 THEN sales END) AS sales_after
1903 | FROM data_mart.clean_weekly_sales
1904 | WHERE calendar_year = 2020
1905 | GROUP BY platform
1906 | )
1907 | SELECT
1908 | platform,
1909 | ROUND((sales_after - sales_before)/sales_before::NUMERIC * 100,2) AS sales_change_percent
1910 | FROM platform_sales_cte
1911 | ORDER BY sales_change_percent ASC;
1912 | """, conn)
1913 | ```
1914 |
1915 |
1916 |
1917 |
1918 | |
1919 | platform |
1920 | sales_change_percent |
1921 |
1922 |
1923 |
1924 |
1925 | 0 |
1926 | Retail |
1927 | -2.43 |
1928 |
1929 |
1930 | 1 |
1931 | Shopify |
1932 | 7.18 |
1933 |
1934 |
1935 |
1936 |
1937 |
1938 |
1939 |
1940 | ### Sales By Age_Band
1941 | The **Middle Aged** customers have been the most negatively impacted by the implementation of the new system at Data Mart.
1942 |
1943 |
1944 | ```python
1945 | pd.read_sql("""
1946 | WITH age_band_sales_cte AS
1947 | (
1948 | SELECT
1949 | age_band,
1950 | SUM(CASE WHEN week_number BETWEEN 13 AND 24 THEN sales END) AS sales_before,
1951 | SUM(CASE WHEN week_number BETWEEN 25 AND 37 THEN sales END) AS sales_after
1952 | FROM data_mart.clean_weekly_sales
1953 | WHERE calendar_year = 2020
1954 | GROUP BY age_band
1955 | )
1956 | SELECT
1957 | age_band,
1958 | ROUND((sales_after - sales_before)/sales_before::NUMERIC * 100,2) AS sales_change_percent
1959 | FROM age_band_sales_cte
1960 | ORDER BY sales_change_percent ASC;
1961 | """, conn)
1962 | ```
1963 |
1964 |
1965 |
1966 |
1967 | |
1968 | age_band |
1969 | sales_change_percent |
1970 |
1971 |
1972 |
1973 |
1974 | 0 |
1975 | unknown |
1976 | -3.34 |
1977 |
1978 |
1979 | 1 |
1980 | Middle Aged |
1981 | -1.97 |
1982 |
1983 |
1984 | 2 |
1985 | Retirees |
1986 | -1.23 |
1987 |
1988 |
1989 | 3 |
1990 | Young Adults |
1991 | -0.92 |
1992 |
1993 |
1994 |
1995 |
1996 |
1997 |
1998 |
1999 | ### Sales By Demographic
2000 | The **Families** demographic category has been the most negatively impacted by the implementation of the new system at Data Mart.
2001 |
2002 |
2003 | ```python
2004 | pd.read_sql("""
2005 | WITH demographic_sales_cte AS
2006 | (
2007 | SELECT
2008 | demographic,
2009 | SUM(CASE WHEN week_number BETWEEN 13 AND 24 THEN sales END) AS sales_before,
2010 | SUM(CASE WHEN week_number BETWEEN 25 AND 37 THEN sales END) AS sales_after
2011 | FROM data_mart.clean_weekly_sales
2012 | WHERE calendar_year = 2020
2013 | GROUP BY demographic
2014 | )
2015 | SELECT
2016 | demographic,
2017 | ROUND((sales_after - sales_before)/sales_before::NUMERIC * 100,2) AS sales_change_percent
2018 | FROM demographic_sales_cte
2019 | ORDER BY sales_change_percent ASC;
2020 | """, conn)
2021 | ```
2022 |
2023 |
2024 |
2025 |
2026 | |
2027 | demographic |
2028 | sales_change_percent |
2029 |
2030 |
2031 |
2032 |
2033 | 0 |
2034 | unknown |
2035 | -3.34 |
2036 |
2037 |
2038 | 1 |
2039 | Families |
2040 | -1.82 |
2041 |
2042 |
2043 | 2 |
2044 | Couples |
2045 | -0.87 |
2046 |
2047 |
2048 |
2049 |
2050 |
2051 |
2052 |
2053 | ### Sales By Customer_Type
2054 |
2055 | The **Guest** customer type has been the most negatively impacted by the implementation of the new system at Data Mart.
2056 |
2057 |
2058 | ```python
2059 | pd.read_sql("""
2060 | WITH customer_type_sales_cte AS
2061 | (
2062 | SELECT
2063 | customer_type,
2064 | SUM(CASE WHEN week_number BETWEEN 13 AND 24 THEN sales END) AS sales_before,
2065 | SUM(CASE WHEN week_number BETWEEN 25 AND 37 THEN sales END) AS sales_after
2066 | FROM data_mart.clean_weekly_sales
2067 | WHERE calendar_year = 2020
2068 | GROUP BY customer_type
2069 | )
2070 | SELECT
2071 | customer_type,
2072 | ROUND((sales_after - sales_before)/sales_before::NUMERIC * 100,2) AS sales_change_percent
2073 | FROM customer_type_sales_cte
2074 | ORDER BY sales_change_percent ASC;
2075 | """, conn)
2076 | ```
2077 |
2078 |
2079 |
2080 |
2081 | |
2082 | customer_type |
2083 | sales_change_percent |
2084 |
2085 |
2086 |
2087 |
2088 | 0 |
2089 | Guest |
2090 | -3.00 |
2091 |
2092 |
2093 | 1 |
2094 | Existing |
2095 | -2.27 |
2096 |
2097 |
2098 | 2 |
2099 | New |
2100 | 1.01 |
2101 |
2102 |
2103 |
2104 |
2105 |
2106 |
2107 |
2108 |
2109 | ```python
2110 | conn.close()
2111 | ```
2112 |
--------------------------------------------------------------------------------
/CaseStudy#5 - Data Mart/README.md:
--------------------------------------------------------------------------------
1 | # Case Study #5: Data Mart 🧺
2 | [](https://github.com/chanronnie/8WeekSQLChallenge)
3 | [](https://8weeksqlchallenge.com/case-study-5/)
4 | 
5 |
6 |
7 | The case study presented here is part of the **8 Week SQL Challenge**.\
8 | It is kindly brought to us by [**Data With Danny**](https://8weeksqlchallenge.com).
9 |
10 | This time, I am using `PostgreSQL queries` (instead of MySQL) in `Jupyter Notebook` to quickly view results, which provides me with an opportunity
11 | - to learn PostgreSQL
12 | - to utilize handy mathematical and string functions.
13 |
14 |
15 |
16 | ## Table of Contents
17 | * [Entity Relationship Diagram](#entity-relationship-diagram)
18 | * [Datasets](#datasets)
19 | * [Case Study Questions](#case-study-questions)
20 | * [Solutions](#solutions)
21 | * [PostgreSQL Topics Covered](#postgresql-topics-covered)
22 |
23 |
24 | ## Entity Relationship Diagram
25 | 
26 |
27 |
28 | ## Datasets
29 | The Case Study #5 contains 1 dataset:
30 | **`weekly_sales`**: This table shows all the sales records made at Data Mart.
31 |
32 |
33 | View dataset
34 |
35 | Here are the 10 random rows of the `weekly_sales` dataset
36 |
37 | week_date | region | platform | segment | customer_type | transactions | sales
38 | --- | --- | --- | --- | --- | --- | ---
39 | 9/9/20 | OCEANIA | Shopify | C3 | New | 610 | 110033.89
40 | 29/7/20 | AFRICA | Retail | C1 | New | 110692 | 3053771.19
41 | 22/7/20 | EUROPE | Shopify | C4 | Existing | 24 | 8101.54
42 | 13/5/20 | AFRICA | Shopify | null | Guest | 5287 | 1003301.37
43 | 24/7/19 | ASIA | Retail | C1 | New | 127342 | 3151780.41
44 | 10/7/19 | CANADA | Shopify | F3 | New | 51 | 8844.93
45 | 26/6/19 | OCEANIA | Retail | C3 | New | 152921 | 5551385.36
46 | 29/5/19 | SOUTH AMERICA | Shopify | null | New | 53 | 10056.2
47 | 22/8/18 | AFRICA | Retail | null | Existing | 31721 | 1718863.58
48 | 25/7/18 | SOUTH AMERICA | Retail | null | New | 2136 | 81757.91
49 |
50 |
51 |
52 | ## Case Study Questions
53 | Case Study #5 is categorized into 4 question groups\
54 | To view the specific section, please open the link in a *`new tab`* or *`window`*.\
55 | [A. Data Cleansing](CaseStudy5_solutions.md#A)\
56 | [B. Data Exploration](CaseStudy5_solutions.md#B)\
57 | [C. Before & After Analysis](CaseStudy5_solutions.md#C)\
58 | [D. Bonus Question](CaseStudy5_solutions.md#D)
59 |
60 | ## Solutions
61 | - View `data_mart` database: [**here**](CaseStudy5_schema.sql)
62 | - View Solution:
63 | - [**Markdown File**](CaseStudy5_solutions.md): offers a more fluid and responsive viewing experience
64 | - [**Jupyter Notebook**](CaseStudy5_solutions.ipynb): contains the original code
65 |
66 |
67 | ## PostgreSQL Topics Covered
68 | - Data Cleaning
69 | - Pivot Table using CASE WHEN
70 | - DATE Type Manipulation
71 | - Common Table Expressions (CTE)
72 | - Window Functions
73 | - Subqueries
74 | - JOIN, UNION ALL
75 |
--------------------------------------------------------------------------------
/CaseStudy#5 - Data Mart/image/plot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chanronnie/8WeekSQLChallenge/4b36f745d833a82a61a9ce70ab177462eb7dbfd0/CaseStudy#5 - Data Mart/image/plot.png
--------------------------------------------------------------------------------
/CaseStudy#6 - Clique Bait/CaseStudy6_solutions.md:
--------------------------------------------------------------------------------
1 | # Case Study #6: Clique Bait
2 | The case study questions presented here are created by [**Data With Danny**](https://linktr.ee/datawithdanny). They are part of the [**8 Week SQL Challenge**](https://8weeksqlchallenge.com/).
3 |
4 | My SQL queries are written in the `PostgreSQL 15` dialect, integrated into `Jupyter Notebook`, which allows us to instantly view the query results and document the queries.
5 |
6 | For more details about the **Case Study #6**, click [**here**](https://8weeksqlchallenge.com/case-study-6/).
7 |
8 | ## Table of Contents
9 |
10 | ### [1. Importing Libraries](#Import)
11 |
12 | ### [2. Tables of the Database](#Tables)
13 |
14 | ### [3. Case Study Questions](#CaseStudyQuestions)
15 |
16 | - [A. Enterprise Relationship Diagram](#A)
17 | - [B. Digital Analysis](#B)
18 | - [C. Product Funnel Analysis](#C)
19 | - [D. Campaigns Analysis](#D)
20 |
21 |
22 |
23 | ## 1. Importing Libraries
24 |
25 |
26 | ```python
27 | import psycopg2 as pg2
28 | import pandas as pd
29 | import os
30 | import warnings
31 |
32 | warnings.filterwarnings("ignore")
33 | ```
34 |
35 | ### Connecting PostgreSQL database from Jupyter Notebook
36 |
37 |
38 | ```python
39 | # Get my PostgreSQL password
40 | mypassword = os.getenv("POSTGRESQL_PASSWORD")
41 |
42 | # Connecting to database
43 | conn = pg2.connect(user = 'postgres', password = mypassword, database = 'clique_bait')
44 | cursor = conn.cursor()
45 | ```
46 |
47 | ___
48 |
49 | ## 2. Tables of the Database
50 |
51 | First, let's verify if the connected database contains the 5 dataset names.
52 |
53 |
54 | ```python
55 | cursor.execute("""
56 | SELECT table_schema, table_name
57 | FROM information_schema.tables
58 | WHERE table_schema = 'clique_bait'
59 | """)
60 |
61 | table_names = []
62 | print('--- Tables within "data_mart" database --- ')
63 | for table in cursor:
64 | print(table[1])
65 | table_names.append(table[1])
66 | ```
67 |
68 | --- Tables within "data_mart" database ---
69 | event_identifier
70 | campaign_identifier
71 | page_hierarchy
72 | users
73 | events
74 |
75 |
76 | #### Here are the 5 datasets of the "clique_bait" database. For more details about each dataset, please click [here](https://8weeksqlchallenge.com/case-study-6/).
77 |
78 |
79 | ```python
80 | for table in table_names:
81 | print("\nTable: ", table)
82 | display(pd.read_sql("SELECT * FROM clique_bait." + table, conn))
83 | ```
84 |
85 |
86 | Table: event_identifier
87 |
88 |
89 |
90 |
91 | |
92 | event_type |
93 | event_name |
94 |
95 |
96 |
97 |
98 | 0 |
99 | 1 |
100 | Page View |
101 |
102 |
103 | 1 |
104 | 2 |
105 | Add to Cart |
106 |
107 |
108 | 2 |
109 | 3 |
110 | Purchase |
111 |
112 |
113 | 3 |
114 | 4 |
115 | Ad Impression |
116 |
117 |
118 | 4 |
119 | 5 |
120 | Ad Click |
121 |
122 |
123 |
124 |
125 |
126 |
127 |
128 | Table: campaign_identifier
129 |
130 |
131 |
132 |
133 | |
134 | campaign_id |
135 | products |
136 | campaign_name |
137 | start_date |
138 | end_date |
139 |
140 |
141 |
142 |
143 | 0 |
144 | 1 |
145 | 1-3 |
146 | BOGOF - Fishing For Compliments |
147 | 2020-01-01 |
148 | 2020-01-14 |
149 |
150 |
151 | 1 |
152 | 2 |
153 | 4-5 |
154 | 25% Off - Living The Lux Life |
155 | 2020-01-15 |
156 | 2020-01-28 |
157 |
158 |
159 | 2 |
160 | 3 |
161 | 6-8 |
162 | Half Off - Treat Your Shellf(ish) |
163 | 2020-02-01 |
164 | 2020-03-31 |
165 |
166 |
167 |
168 |
169 |
170 |
171 |
172 | Table: page_hierarchy
173 |
174 |
175 |
176 |
177 | |
178 | page_id |
179 | page_name |
180 | product_category |
181 | product_id |
182 |
183 |
184 |
185 |
186 | 0 |
187 | 1 |
188 | Home Page |
189 | None |
190 | NaN |
191 |
192 |
193 | 1 |
194 | 2 |
195 | All Products |
196 | None |
197 | NaN |
198 |
199 |
200 | 2 |
201 | 3 |
202 | Salmon |
203 | Fish |
204 | 1.0 |
205 |
206 |
207 | 3 |
208 | 4 |
209 | Kingfish |
210 | Fish |
211 | 2.0 |
212 |
213 |
214 | 4 |
215 | 5 |
216 | Tuna |
217 | Fish |
218 | 3.0 |
219 |
220 |
221 | 5 |
222 | 6 |
223 | Russian Caviar |
224 | Luxury |
225 | 4.0 |
226 |
227 |
228 | 6 |
229 | 7 |
230 | Black Truffle |
231 | Luxury |
232 | 5.0 |
233 |
234 |
235 | 7 |
236 | 8 |
237 | Abalone |
238 | Shellfish |
239 | 6.0 |
240 |
241 |
242 | 8 |
243 | 9 |
244 | Lobster |
245 | Shellfish |
246 | 7.0 |
247 |
248 |
249 | 9 |
250 | 10 |
251 | Crab |
252 | Shellfish |
253 | 8.0 |
254 |
255 |
256 | 10 |
257 | 11 |
258 | Oyster |
259 | Shellfish |
260 | 9.0 |
261 |
262 |
263 | 11 |
264 | 12 |
265 | Checkout |
266 | None |
267 | NaN |
268 |
269 |
270 | 12 |
271 | 13 |
272 | Confirmation |
273 | None |
274 | NaN |
275 |
276 |
277 |
278 |
279 |
280 |
281 |
282 | Table: users
283 |
284 |
285 |
286 |
287 | |
288 | user_id |
289 | cookie_id |
290 | start_date |
291 |
292 |
293 |
294 |
295 | 0 |
296 | 1 |
297 | c4ca42 |
298 | 2020-02-04 |
299 |
300 |
301 | 1 |
302 | 2 |
303 | c81e72 |
304 | 2020-01-18 |
305 |
306 |
307 | 2 |
308 | 3 |
309 | eccbc8 |
310 | 2020-02-21 |
311 |
312 |
313 | 3 |
314 | 4 |
315 | a87ff6 |
316 | 2020-02-22 |
317 |
318 |
319 | 4 |
320 | 5 |
321 | e4da3b |
322 | 2020-02-01 |
323 |
324 |
325 | ... |
326 | ... |
327 | ... |
328 | ... |
329 |
330 |
331 | 1777 |
332 | 25 |
333 | 46dd2f |
334 | 2020-03-29 |
335 |
336 |
337 | 1778 |
338 | 94 |
339 | 59511b |
340 | 2020-03-22 |
341 |
342 |
343 | 1779 |
344 | 49 |
345 | d345a8 |
346 | 2020-02-23 |
347 |
348 |
349 | 1780 |
350 | 211 |
351 | a26e03 |
352 | 2020-02-20 |
353 |
354 |
355 | 1781 |
356 | 64 |
357 | 87a4ba |
358 | 2020-03-18 |
359 |
360 |
361 |
362 | 1782 rows × 3 columns
363 |
364 |
365 |
366 |
367 | Table: events
368 |
369 |
370 |
371 |
372 | |
373 | visit_id |
374 | cookie_id |
375 | page_id |
376 | event_type |
377 | sequence_number |
378 | event_time |
379 |
380 |
381 |
382 |
383 | 0 |
384 | ccf365 |
385 | c4ca42 |
386 | 1 |
387 | 1 |
388 | 1 |
389 | 2020-02-04 19:16:09.182546 |
390 |
391 |
392 | 1 |
393 | ccf365 |
394 | c4ca42 |
395 | 2 |
396 | 1 |
397 | 2 |
398 | 2020-02-04 19:16:17.358191 |
399 |
400 |
401 | 2 |
402 | ccf365 |
403 | c4ca42 |
404 | 6 |
405 | 1 |
406 | 3 |
407 | 2020-02-04 19:16:58.454669 |
408 |
409 |
410 | 3 |
411 | ccf365 |
412 | c4ca42 |
413 | 9 |
414 | 1 |
415 | 4 |
416 | 2020-02-04 19:16:58.609142 |
417 |
418 |
419 | 4 |
420 | ccf365 |
421 | c4ca42 |
422 | 9 |
423 | 2 |
424 | 5 |
425 | 2020-02-04 19:17:51.729420 |
426 |
427 |
428 | ... |
429 | ... |
430 | ... |
431 | ... |
432 | ... |
433 | ... |
434 | ... |
435 |
436 |
437 | 32729 |
438 | 355a6a |
439 | 87a4ba |
440 | 10 |
441 | 1 |
442 | 15 |
443 | 2020-03-18 22:44:16.541396 |
444 |
445 |
446 | 32730 |
447 | 355a6a |
448 | 87a4ba |
449 | 11 |
450 | 1 |
451 | 16 |
452 | 2020-03-18 22:44:18.900830 |
453 |
454 |
455 | 32731 |
456 | 355a6a |
457 | 87a4ba |
458 | 11 |
459 | 2 |
460 | 17 |
461 | 2020-03-18 22:45:12.670472 |
462 |
463 |
464 | 32732 |
465 | 355a6a |
466 | 87a4ba |
467 | 12 |
468 | 1 |
469 | 18 |
470 | 2020-03-18 22:45:54.081818 |
471 |
472 |
473 | 32733 |
474 | 355a6a |
475 | 87a4ba |
476 | 13 |
477 | 3 |
478 | 19 |
479 | 2020-03-18 22:45:54.984666 |
480 |
481 |
482 |
483 | 32734 rows × 6 columns
484 |
485 |
486 |
487 |
488 |
489 | ## 3. Case Study Questions
490 |
491 |
492 | ## A. Enterprise Relationship Diagram
493 | Using the following DDL schema details to create an ERD for all the Clique Bait datasets.
494 | [Click here](https://dbdiagram.io/) to access the DB Diagram tool to create the ERD.
495 |
496 |
497 |
498 | **ANSWER**
499 | Here is the code that I used for creating the **ERD** for all the Clique Bait datasets on [DB Diagram tool](https://dbdiagram.io/).
500 |
501 | ```
502 | TABLE event_identifier {
503 | "event_type" INTEGER
504 | "event_name" VARCHAR(13)
505 | }
506 |
507 | TABLE campaign_identifier {
508 | "campaign_id" INTEGER
509 | "products" VARCHAR(3)
510 | "campaign_name" VARCHAR(33)
511 | "start_date" TIMESTAMP
512 | "end_date" TIMESTAMP
513 | }
514 |
515 | TABLE page_hierarchy {
516 | "page_id" INTEGER
517 | "page_name" VARCHAR(14)
518 | "product_category" VARCHAR(9)
519 | "product_id" INTEGER
520 | }
521 |
522 | TABLE users {
523 | "user_id" INTEGER
524 | "cookie_id" VARCHAR(6)
525 | "start_date" TIMESTAMP
526 | }
527 |
528 | TABLE events {
529 | "visit_id" VARCHAR(6)
530 | "cookie_id" VARCHAR(6)
531 | "page_id" INTEGER
532 | "event_type" INTEGER
533 | "sequence_number" INTEGER
534 | "event_time" TIMESTAMP
535 | }
536 |
537 | // Establish connections or references between datasets
538 | Ref: "events"."event_type" > "event_identifier"."event_type"
539 | Ref: "events"."page_id" > "page_hierarchy"."page_id"
540 | Ref: "events"."cookie_id" > "users"."cookie_id"
541 | ```
542 |
543 | **Result**
544 | 
545 |
546 |
547 |
548 | ## B. Digital Analysis
549 |
550 | #### 1. How many users are there?
551 |
552 |
553 | ```python
554 | pd.read_sql("""
555 | SELECT COUNT(DISTINCT user_id) AS nb_users
556 | FROM clique_bait.users;
557 | """, conn)
558 | ```
559 |
560 |
561 |
562 |
563 | |
564 | nb_users |
565 |
566 |
567 |
568 |
569 | 0 |
570 | 500 |
571 |
572 |
573 |
574 |
575 |
576 |
577 |
578 | **Result**\
579 | There are 500 users.
580 |
581 | ___
582 | #### 2. How many cookies does each user have on average?
583 |
584 |
585 | ```python
586 | pd.read_sql("""
587 | SELECT ROUND(AVG(nb_cookie_ids))::INTEGER AS avg_cookies_per_user
588 | FROM
589 | (
590 | SELECT DISTINCT user_id, COUNT(cookie_id) AS nb_cookie_ids
591 | FROM clique_bait.users
592 | GROUP BY user_id
593 | ) nb_cookies_per_user;
594 | """, conn)
595 | ```
596 |
597 |
598 |
599 |
600 | |
601 | avg_cookies_per_user |
602 |
603 |
604 |
605 |
606 | 0 |
607 | 4 |
608 |
609 |
610 |
611 |
612 |
613 |
614 |
615 | **Result**\
616 | Each user has an average of 4 cookies.
617 |
618 | ___
619 | #### 3. What is the unique number of visits by all users per month?
620 |
621 |
622 | ```python
623 | pd.read_sql("""
624 | SELECT
625 | DATE_PART('month', u.start_date)::INTEGER AS month,
626 | TO_CHAR(u.start_date,'Month') AS month_name,
627 | COUNT(DISTINCT e.visit_id) AS nb_visits
628 | FROM clique_bait.users u
629 | JOIN clique_bait.events e ON u.cookie_id = e.cookie_id
630 | GROUP BY month, month_name
631 | ORDER BY month
632 | """, conn)
633 | ```
634 |
635 |
636 |
637 |
638 | |
639 | month |
640 | month_name |
641 | nb_visits |
642 |
643 |
644 |
645 |
646 | 0 |
647 | 1 |
648 | January |
649 | 876 |
650 |
651 |
652 | 1 |
653 | 2 |
654 | February |
655 | 1488 |
656 |
657 |
658 | 2 |
659 | 3 |
660 | March |
661 | 916 |
662 |
663 |
664 | 3 |
665 | 4 |
666 | April |
667 | 248 |
668 |
669 |
670 | 4 |
671 | 5 |
672 | May |
673 | 36 |
674 |
675 |
676 |
677 |
678 |
679 |
680 |
681 | **Result**
682 | - The month of **February** has the highest number of visits on the Clique Bait website, with a total of 1,488 visits.
683 | - Following February, **March** and **January** are the second and third most visited months, accounting for 916 visits and 876 visits, respectively.
684 | - The month of **May** has the lowest number of visits, with only 36.
685 |
686 | ___
687 | #### 4. What is the number of events for each event type?
688 |
689 |
690 | ```python
691 | pd.read_sql("""
692 | SELECT
693 | ei.event_name,
694 | COUNT(*) AS nb_events
695 | FROM clique_bait.events e
696 | JOIN clique_bait. event_identifier ei ON e.event_type = ei.event_type
697 | GROUP BY ei.event_name
698 | ORDER BY nb_events DESC
699 | """, conn)
700 | ```
701 |
702 |
703 |
704 |
705 | |
706 | event_name |
707 | nb_events |
708 |
709 |
710 |
711 |
712 | 0 |
713 | Page View |
714 | 20928 |
715 |
716 |
717 | 1 |
718 | Add to Cart |
719 | 8451 |
720 |
721 |
722 | 2 |
723 | Purchase |
724 | 1777 |
725 |
726 |
727 | 3 |
728 | Ad Impression |
729 | 876 |
730 |
731 |
732 | 4 |
733 | Ad Click |
734 | 702 |
735 |
736 |
737 |
738 |
739 |
740 |
741 |
742 | **Result**
743 | - **Page View** has the highest number of events, with a total of 20,928.
744 | - Following Page View, **Add to Cart** and **Purchase** are the second and third most interacted pages, accounting for 8,451 events and 1,777 events, respectively.
745 | - Taking into consideration that not all users receive a campaign impression and that not all users clicked on the impression advertisement, the **Ad Impression** page has been interacted with 876 times, and clicking the impression (see **Ad Click**) has occurred only 702 times.
746 |
747 | ___
748 | #### 5. What is the percentage of visits which have a purchase event?
749 |
750 |
751 | ```python
752 | pd.read_sql("""
753 | SELECT
754 | COUNT(*) AS nb_purchase_event,
755 | (SELECT COUNT(DISTINCT visit_id) FROM clique_bait.events) AS total_nb_visits,
756 | CONCAT(ROUND(COUNT(*)/(SELECT COUNT(DISTINCT visit_id) FROM clique_bait.events)::NUMERIC * 100, 1), ' %') AS purchase_percent
757 | FROM clique_bait.events e
758 | JOIN clique_bait.event_identifier ei ON e.event_type = ei.event_type
759 | WHERE ei.event_name = 'Purchase'
760 | """, conn)
761 | ```
762 |
763 |
764 |
765 |
766 | |
767 | nb_purchase_event |
768 | total_nb_visits |
769 | purchase_percent |
770 |
771 |
772 |
773 |
774 | 0 |
775 | 1777 |
776 | 3564 |
777 | 49.9 % |
778 |
779 |
780 |
781 |
782 |
783 |
784 |
785 | **Result**\
786 | 49.9% of the visits resulted in a purchase event.
787 |
788 | ___
789 | #### 6. What is the percentage of visits which view the checkout page but do not have a purchase event?
790 |
791 |
792 | ```python
793 | pd.read_sql("""
794 | SELECT
795 | (nb_checkouts - nb_purchases) AS nb_checkouts_without_purchase,
796 | (SELECT COUNT(DISTINCT visit_id) FROM clique_bait.events) AS total_visits,
797 | ROUND((nb_checkouts - nb_purchases)/(SELECT COUNT(DISTINCT visit_id) FROM clique_bait.events)::NUMERIC * 100,2) AS percent
798 | FROM
799 | (
800 | SELECT
801 | SUM(CASE WHEN ph.page_name = 'Checkout' AND ei.event_name = 'Page View' THEN 1 ELSE 0 END) AS nb_checkouts,
802 | SUM(CASE WHEN ei.event_name = 'Purchase' THEN 1 ELSE 0 END) AS nb_purchases
803 | FROM clique_bait.events e
804 | JOIN clique_bait.event_identifier ei ON e.event_type = ei.event_type
805 | JOIN clique_bait.page_hierarchy ph ON e.page_id = ph.page_id
806 | ) c
807 | """, conn)
808 | ```
809 |
810 |
811 |
812 |
813 | |
814 | nb_checkouts_without_purchase |
815 | total_visits |
816 | percent |
817 |
818 |
819 |
820 |
821 | 0 |
822 | 326 |
823 | 3564 |
824 | 9.15 |
825 |
826 |
827 |
828 |
829 |
830 |
831 |
832 | **Result**\
833 | Only 9.15% of the visits resulted in viewing the checkout page without a purchase event.
834 |
835 | ___
836 | #### 7. What are the top 3 pages by number of views?
837 |
838 |
839 | ```python
840 | pd.read_sql("""
841 | SELECT
842 | ph.page_name,
843 | COUNT(*) AS nb_views
844 | FROM clique_bait.events e
845 | JOIN clique_bait.page_hierarchy ph ON e.page_id = ph.page_id
846 | JOIN clique_bait.event_identifier ei ON e.event_type = ei.event_type
847 | WHERE ei.event_name = 'Page View'
848 | GROUP BY ph.page_name
849 | ORDER BY nb_views DESC
850 | LIMIT 3
851 | """, conn)
852 | ```
853 |
854 |
855 |
856 |
857 | |
858 | page_name |
859 | nb_views |
860 |
861 |
862 |
863 |
864 | 0 |
865 | All Products |
866 | 3174 |
867 |
868 |
869 | 1 |
870 | Checkout |
871 | 2103 |
872 |
873 |
874 | 2 |
875 | Home Page |
876 | 1782 |
877 |
878 |
879 |
880 |
881 |
882 |
883 |
884 | ___
885 | #### 8. What is the number of views and cart adds for each product category?
886 |
887 |
888 | ```python
889 | pd.read_sql("""
890 | SELECT
891 | ph.product_category,
892 | SUM(CASE WHEN event_name = 'Page View' THEN 1 ELSE 0 END) AS nb_views,
893 | SUM(CASE WHEN event_name = 'Add to Cart' THEN 1 ELSE 0 END) AS nb_card_adds
894 | FROM clique_bait.events e
895 | JOIN clique_bait.page_hierarchy ph ON e.page_id = ph.page_id
896 | JOIN clique_bait.event_identifier ei ON e.event_type = ei.event_type
897 | WHERE ph.product_category IS NOT NULL
898 | GROUP BY ph.product_category
899 | ORDER BY nb_views DESC
900 | """, conn)
901 | ```
902 |
903 |
904 |
905 |
906 | |
907 | product_category |
908 | nb_views |
909 | nb_card_adds |
910 |
911 |
912 |
913 |
914 | 0 |
915 | Shellfish |
916 | 6204 |
917 | 3792 |
918 |
919 |
920 | 1 |
921 | Fish |
922 | 4633 |
923 | 2789 |
924 |
925 |
926 | 2 |
927 | Luxury |
928 | 3032 |
929 | 1870 |
930 |
931 |
932 |
933 |
934 |
935 |
936 |
937 | ___
938 | #### 9. What are the top 3 products by purchases?
939 |
940 |
941 | ```python
942 | pd.read_sql("""
943 | WITH visitID_with_purchases_cte AS
944 | (
945 | -- Retrieve visit IDS that have made purchases
946 |
947 | SELECT e.visit_id
948 | FROM clique_bait.events e
949 | JOIN clique_bait.event_identifier ei ON e.event_type = ei.event_type
950 | WHERE ei.event_name = 'Purchase'
951 | )
952 | SELECT
953 | ph.page_name as product,
954 | COUNT(*) AS nb_purchases
955 | FROM visitID_with_purchases_cte cte
956 | JOIN clique_bait.events e ON cte.visit_id = e.visit_id
957 | JOIN clique_bait.event_identifier ei ON e.event_type = ei.event_type
958 | JOIN clique_bait.page_hierarchy ph ON e.page_id = ph.page_id
959 | WHERE ph.product_category IS NOT NULL
960 | AND ei.event_name = 'Add to Cart'
961 | GROUP BY ph.page_name
962 | ORDER BY nb_purchases DESC
963 | LIMIT 3
964 | """, conn)
965 | ```
966 |
967 |
968 |
969 |
970 | |
971 | product |
972 | nb_purchases |
973 |
974 |
975 |
976 |
977 | 0 |
978 | Lobster |
979 | 754 |
980 |
981 |
982 | 1 |
983 | Oyster |
984 | 726 |
985 |
986 |
987 | 2 |
988 | Crab |
989 | 719 |
990 |
991 |
992 |
993 |
994 |
995 |
996 |
997 | Alternatively, we can utilize the Window Function `LAST_VALUE() OVER (PARTITION BY... ORDER BY...)` to extract the most recent event recorded during each visit on the Clique Bait website. The **Purchase** event typically represents the final action taken by users in each visit. Therefore, by applying the `LAST_VALUE()` in the `add_last_event_cte`, we can subsequently filter and identify the products that have indeed been purchased, as indicated in the following statement: `WHERE event_name = 'Add to Cart' AND last_event = 'Purchase'`.
998 |
999 |
1000 | ```python
1001 | pd.read_sql("""
1002 | WITH add_last_event_cte AS
1003 | (
1004 | SELECT
1005 | e.visit_id,
1006 | e.sequence_number,
1007 | ph.page_name,
1008 | ph.product_category,
1009 | ei.event_name,
1010 | LAST_VALUE(ei.event_name) OVER (PARTITION BY e.visit_id ORDER BY e.sequence_number ROWS BETWEEN UNBOUNDED preceding AND UNBOUNDED following) AS last_event
1011 | FROM clique_bait.events e
1012 | JOIN clique_bait.event_identifier ei ON e.event_type = ei.event_type
1013 | JOIN clique_bait.page_hierarchy ph ON e.page_id = ph.page_id
1014 | )
1015 | SELECT
1016 | page_name,
1017 | COUNT(event_name) AS nb_purchases
1018 | FROM add_last_event_cte
1019 | WHERE product_category IS NOT NULL AND event_name = 'Add to Cart' AND last_event = 'Purchase'
1020 | GROUP BY page_name
1021 | ORDER BY nb_purchases DESC
1022 | """, conn)
1023 | ```
1024 |
1025 |
1026 |
1027 |
1028 | |
1029 | page_name |
1030 | nb_purchases |
1031 |
1032 |
1033 |
1034 |
1035 | 0 |
1036 | Lobster |
1037 | 754 |
1038 |
1039 |
1040 | 1 |
1041 | Oyster |
1042 | 726 |
1043 |
1044 |
1045 | 2 |
1046 | Crab |
1047 | 719 |
1048 |
1049 |
1050 | 3 |
1051 | Salmon |
1052 | 711 |
1053 |
1054 |
1055 | 4 |
1056 | Black Truffle |
1057 | 707 |
1058 |
1059 |
1060 | 5 |
1061 | Kingfish |
1062 | 707 |
1063 |
1064 |
1065 | 6 |
1066 | Abalone |
1067 | 699 |
1068 |
1069 |
1070 | 7 |
1071 | Russian Caviar |
1072 | 697 |
1073 |
1074 |
1075 | 8 |
1076 | Tuna |
1077 | 697 |
1078 |
1079 |
1080 |
1081 |
1082 |
1083 |
1084 |
1085 |
1086 | ## C. Product Funnel Analysis
1087 | Using a single SQL query - create a new output table which has the following details:
1088 |
1089 | - How many times was each product viewed?
1090 | - How many times was each product added to cart?
1091 | - How many times was each product added to a cart but not purchased (abandoned)?
1092 | - How many times was each product purchased?
1093 |
1094 | Additionally, create another table which further aggregates the data for the above points but this time **for each product category** instead of individual products.
1095 |
1096 | ### Table 1: Aggregate the data by products
1097 |
1098 |
1099 | ```python
1100 | # Create table
1101 | cursor.execute("DROP TABLE IF EXISTS clique_bait.products;")
1102 | cursor.execute("""
1103 | CREATE TABLE clique_bait.products
1104 | (
1105 | "product" VARCHAR(255),
1106 | "nb_views" INTEGER,
1107 | "nb_cart_adds" INTEGER,
1108 | "nb_abandoned" INTEGER,
1109 | "nb_purchases" INTEGER
1110 | );
1111 | """)
1112 |
1113 |
1114 | # Populate the table
1115 | cursor.execute("""
1116 | INSERT INTO clique_bait.products
1117 | WITH add_last_event_cte AS
1118 | (
1119 | SELECT
1120 | e.visit_id,
1121 | e.sequence_number,
1122 | ph.page_name,
1123 | ph.product_category,
1124 | ei.event_name,
1125 | LAST_VALUE(ei.event_name) OVER (PARTITION BY e.visit_id ORDER BY e.sequence_number ROWS BETWEEN UNBOUNDED preceding AND UNBOUNDED following) AS last_event
1126 | FROM clique_bait.events e
1127 | JOIN clique_bait.event_identifier ei ON e.event_type = ei.event_type
1128 | JOIN clique_bait.page_hierarchy ph ON e.page_id = ph.page_id
1129 | )
1130 | SELECT
1131 | page_name AS product,
1132 | SUM(CASE WHEN event_name = 'Page View' THEN 1 ELSE 0 END) AS nb_views,
1133 | SUM(CASE WHEN event_name = 'Add to Cart' THEN 1 ELSE 0 END) AS nb_cart_adds,
1134 | SUM(CASE WHEN event_name = 'Add to Cart' AND last_event != 'Purchase' THEN 1 ELSE 0 END) AS nb_abandoned,
1135 | SUM(CASE WHEN event_name = 'Add to Cart' AND last_event = 'Purchase' THEN 1 ELSE 0 END) AS nb_purchases
1136 | FROM add_last_event_cte
1137 | WHERE product_category IS NOT NULL
1138 | GROUP BY page_name
1139 | """)
1140 |
1141 |
1142 | # Saving
1143 | conn.commit()
1144 | ```
1145 |
1146 | **Result**
1147 |
1148 |
1149 | ```python
1150 | pd.read_sql("""SELECT * FROM clique_bait.products""", conn)
1151 | ```
1152 |
1153 |
1154 |
1155 |
1156 | |
1157 | product |
1158 | nb_views |
1159 | nb_cart_adds |
1160 | nb_abandoned |
1161 | nb_purchases |
1162 |
1163 |
1164 |
1165 |
1166 | 0 |
1167 | Abalone |
1168 | 1525 |
1169 | 932 |
1170 | 233 |
1171 | 699 |
1172 |
1173 |
1174 | 1 |
1175 | Oyster |
1176 | 1568 |
1177 | 943 |
1178 | 217 |
1179 | 726 |
1180 |
1181 |
1182 | 2 |
1183 | Salmon |
1184 | 1559 |
1185 | 938 |
1186 | 227 |
1187 | 711 |
1188 |
1189 |
1190 | 3 |
1191 | Crab |
1192 | 1564 |
1193 | 949 |
1194 | 230 |
1195 | 719 |
1196 |
1197 |
1198 | 4 |
1199 | Tuna |
1200 | 1515 |
1201 | 931 |
1202 | 234 |
1203 | 697 |
1204 |
1205 |
1206 | 5 |
1207 | Lobster |
1208 | 1547 |
1209 | 968 |
1210 | 214 |
1211 | 754 |
1212 |
1213 |
1214 | 6 |
1215 | Kingfish |
1216 | 1559 |
1217 | 920 |
1218 | 213 |
1219 | 707 |
1220 |
1221 |
1222 | 7 |
1223 | Russian Caviar |
1224 | 1563 |
1225 | 946 |
1226 | 249 |
1227 | 697 |
1228 |
1229 |
1230 | 8 |
1231 | Black Truffle |
1232 | 1469 |
1233 | 924 |
1234 | 217 |
1235 | 707 |
1236 |
1237 |
1238 |
1239 |
1240 |
1241 |
1242 |
1243 | ### Table 2: Aggregate the data by product categories
1244 |
1245 |
1246 | ```python
1247 | # Create the table
1248 | cursor.execute("DROP TABLE IF EXISTS clique_bait.product_category;")
1249 | cursor.execute("""
1250 | CREATE TABLE clique_bait.product_category
1251 | (
1252 | "product" VARCHAR(255),
1253 | "nb_views" INTEGER,
1254 | "nb_cart_adds" INTEGER,
1255 | "nb_abandoned" INTEGER,
1256 | "nb_purchases" INTEGER
1257 | );
1258 | """)
1259 |
1260 |
1261 | # Populate the table
1262 | cursor.execute("""
1263 | INSERT INTO clique_bait.product_category
1264 | WITH add_last_event_cte AS
1265 | (
1266 | SELECT
1267 | e.visit_id,
1268 | e.sequence_number,
1269 | ph.product_category,
1270 | ei.event_name,
1271 | LAST_VALUE(ei.event_name) OVER (PARTITION BY e.visit_id ORDER BY e.sequence_number ROWS BETWEEN UNBOUNDED preceding AND UNBOUNDED following) AS last_event
1272 | FROM clique_bait.events e
1273 | JOIN clique_bait.event_identifier ei ON e.event_type = ei.event_type
1274 | JOIN clique_bait.page_hierarchy ph ON e.page_id = ph.page_id
1275 | )
1276 | SELECT
1277 | product_category,
1278 | SUM(CASE WHEN event_name = 'Page View' THEN 1 ELSE 0 END) AS nb_views,
1279 | SUM(CASE WHEN event_name = 'Add to Cart' THEN 1 ELSE 0 END) AS nb_cart_adds,
1280 | SUM(CASE WHEN event_name = 'Add to Cart' AND last_event != 'Purchase' THEN 1 ELSE 0 END) AS nb_abandoned,
1281 | SUM(CASE WHEN event_name = 'Add to Cart' AND last_event = 'Purchase' THEN 1 ELSE 0 END) AS nb_purchases
1282 | FROM add_last_event_cte
1283 | WHERE product_category IS NOT NULL
1284 | GROUP BY product_category
1285 |
1286 | """)
1287 |
1288 | # Saving
1289 | conn.commit()
1290 | ```
1291 |
1292 | **Result**
1293 |
1294 |
1295 | ```python
1296 | pd.read_sql("SELECT * FROM clique_bait.product_category", conn)
1297 | ```
1298 |
1299 |
1300 |
1301 |
1302 | |
1303 | product |
1304 | nb_views |
1305 | nb_cart_adds |
1306 | nb_abandoned |
1307 | nb_purchases |
1308 |
1309 |
1310 |
1311 |
1312 | 0 |
1313 | Luxury |
1314 | 3032 |
1315 | 1870 |
1316 | 466 |
1317 | 1404 |
1318 |
1319 |
1320 | 1 |
1321 | Shellfish |
1322 | 6204 |
1323 | 3792 |
1324 | 894 |
1325 | 2898 |
1326 |
1327 |
1328 | 2 |
1329 | Fish |
1330 | 4633 |
1331 | 2789 |
1332 | 674 |
1333 | 2115 |
1334 |
1335 |
1336 |
1337 |
1338 |
1339 |
1340 |
1341 | ___
1342 | ### Use your 2 new output tables - answer the following questions:
1343 |
1344 | #### 1. Which product had the most views, cart adds and purchases?
1345 |
1346 |
1347 | ```python
1348 | pd.read_sql("""
1349 | SELECT
1350 | product,
1351 | nb_views,
1352 | nb_cart_adds,
1353 | nb_purchases
1354 | FROM clique_bait.products
1355 | ORDER BY nb_views DESC, nb_cart_adds DESC, nb_purchases DESC
1356 | LIMIT 1;
1357 | """, conn)
1358 | ```
1359 |
1360 |
1361 |
1362 |
1363 | |
1364 | product |
1365 | nb_views |
1366 | nb_cart_adds |
1367 | nb_purchases |
1368 |
1369 |
1370 |
1371 |
1372 | 0 |
1373 | Oyster |
1374 | 1568 |
1375 | 943 |
1376 | 726 |
1377 |
1378 |
1379 |
1380 |
1381 |
1382 |
1383 |
1384 | The output above shows that **Oyster** is the product that had the most views, cart adds and purchases.
1385 | Alternatively, we could break down the data for each individual product, which reveals that:
1386 |
1387 | - **Oyster** is the most viewed product.
1388 | - **Lobster** is the most added product to cart and the most purchased product
1389 |
1390 |
1391 | ```python
1392 | pd.read_sql("""
1393 | WITH max_cte AS
1394 | (
1395 | SELECT
1396 | *,
1397 | CASE WHEN MAX(nb_views) OVER (ORDER BY nb_views DESC) = nb_views THEN product ELSE '' END AS most_viewed,
1398 | CASE WHEN MAX(nb_cart_adds) OVER (ORDER BY nb_cart_adds DESC) = nb_cart_adds THEN product ELSE '' END AS most_cart_added,
1399 | CASE WHEN MAX(nb_purchases) OVER (ORDER BY nb_purchases DESC) = nb_purchases THEN product ELSE '' END AS most_purchased
1400 | FROM clique_bait.products
1401 | )
1402 | SELECT most_viewed, most_cart_added, most_purchased
1403 | FROM max_cte
1404 | WHERE most_viewed != '' OR most_cart_added !='' OR most_purchased !=''
1405 | """, conn)
1406 | ```
1407 |
1408 |
1409 |
1410 |
1411 | |
1412 | most_viewed |
1413 | most_cart_added |
1414 | most_purchased |
1415 |
1416 |
1417 |
1418 |
1419 | 0 |
1420 | Oyster |
1421 | |
1422 | |
1423 |
1424 |
1425 | 1 |
1426 | |
1427 | Lobster |
1428 | Lobster |
1429 |
1430 |
1431 |
1432 |
1433 |
1434 |
1435 |
1436 | ___
1437 | #### 2. Which product was most likely to be abandoned?
1438 |
1439 |
1440 | ```python
1441 | pd.read_sql("""
1442 | SELECT product, nb_abandoned
1443 | FROM clique_bait.products
1444 | ORDER BY nb_abandoned DESC
1445 | LIMIT 1;
1446 | """, conn)
1447 | ```
1448 |
1449 |
1450 |
1451 |
1452 | |
1453 | product |
1454 | nb_abandoned |
1455 |
1456 |
1457 |
1458 |
1459 | 0 |
1460 | Russian Caviar |
1461 | 249 |
1462 |
1463 |
1464 |
1465 |
1466 |
1467 |
1468 |
1469 | ___
1470 | #### 3. Which product had the highest view to purchase percentage?
1471 |
1472 |
1473 | ```python
1474 | pd.read_sql("""
1475 | SELECT
1476 | product,
1477 | nb_views,
1478 | nb_purchases,
1479 | CONCAT(ROUND(nb_purchases/nb_views::NUMERIC * 100,2), ' %') AS percent
1480 | FROM clique_bait.products
1481 | ORDER BY percent DESC
1482 | LIMIT 1;
1483 | """, conn)
1484 | ```
1485 |
1486 |
1487 |
1488 |
1489 | |
1490 | product |
1491 | nb_views |
1492 | nb_purchases |
1493 | percent |
1494 |
1495 |
1496 |
1497 |
1498 | 0 |
1499 | Lobster |
1500 | 1547 |
1501 | 754 |
1502 | 48.74 % |
1503 |
1504 |
1505 |
1506 |
1507 |
1508 |
1509 |
1510 | ___
1511 | #### 4. What is the average conversion rate from view to cart add?
1512 |
1513 |
1514 | ```python
1515 | pd.read_sql("""
1516 | SELECT
1517 | CONCAT(ROUND(AVG(nb_cart_adds/nb_views::NUMERIC * 100),2), ' %') AS conversion_rate
1518 | FROM clique_bait.products
1519 | """, conn)
1520 | ```
1521 |
1522 |
1523 |
1524 |
1525 | |
1526 | conversion_rate |
1527 |
1528 |
1529 |
1530 |
1531 | 0 |
1532 | 60.95 % |
1533 |
1534 |
1535 |
1536 |
1537 |
1538 |
1539 |
1540 | ___
1541 | #### 5. What is the average conversion rate from cart add to purchase?
1542 |
1543 |
1544 | ```python
1545 | pd.read_sql("""
1546 | SELECT
1547 | CONCAT(ROUND(AVG(nb_purchases/nb_cart_adds::NUMERIC * 100),2), ' %') AS conversion_rate
1548 | FROM clique_bait.products
1549 | """, conn)
1550 | ```
1551 |
1552 |
1553 |
1554 |
1555 | |
1556 | conversion_rate |
1557 |
1558 |
1559 |
1560 |
1561 | 0 |
1562 | 75.93 % |
1563 |
1564 |
1565 |
1566 |
1567 |
1568 |
1569 |
1570 |
1571 | ## D. Campaigns Analysis
1572 | Generate a table that has 1 single row for every unique `visit_id` record and has the following columns:
1573 |
1574 | - `user_id`
1575 | - `visit_id`
1576 | - `visit_start_time`: the earliest event_time for each visit
1577 | - `page_views`: count of page views for each visit
1578 | - `cart_adds`: count of product cart add events for each visit
1579 | - `purchase`: 1/0 flag if a purchase event exists for each visit
1580 | - `campaign_name`: map the visit to a campaign if the visit_start_time falls between the start_date and end_date
1581 | - `impression`: count of ad impressions for each visit
1582 | - `click`: count of ad clicks for each visit
1583 | - `cart_products`**(Optional column)**: a comma separated text value with products added to the cart sorted by the order they were added to the cart (hint: use the `sequence_number`)
1584 |
1585 | Use the subsequent dataset to generate at least 5 insights for the Clique Bait team - bonus: prepare a single A4 infographic that the team can use for their management reporting sessions, be sure to emphasise the most important points from your findings.
1586 |
1587 | Some ideas you might want to investigate further include:
1588 |
1589 | - Identifying users who have received impressions during each campaign period and comparing each metric with other users who did not have an impression event
1590 | - Does clicking on an impression lead to higher purchase rates?
1591 | - What is the uplift in purchase rate when comparing users who click on a campaign impression versus users who do not receive an impression? What if we compare them with users who just an impression but do not click?
1592 | - What metrics can you use to quantify the success or failure of each campaign compared to eachother?
1593 |
1594 |
1595 | ```python
1596 | # Create table
1597 | cursor.execute("DROP TABLE IF EXISTS clique_bait.campaign_analysis;")
1598 | cursor.execute("""
1599 | CREATE TABLE clique_bait.campaign_analysis
1600 | (
1601 | "user_id" INTEGER,
1602 | "visit_id" VARCHAR(6),
1603 | "visit_start_time" TIMESTAMP,
1604 | "page_views" INTEGER,
1605 | "cart_adds" INTEGER,
1606 | "purchases" INTEGER,
1607 | "impressions" INTEGER,
1608 | "clicks" INTEGER,
1609 | "campaign_name" VARCHAR(33),
1610 | "cart_products" VARCHAR(255)
1611 | );
1612 | """)
1613 |
1614 |
1615 | # Populate the table
1616 | cursor.execute("""
1617 | INSERT INTO clique_bait.campaign_analysis
1618 | WITH cart_products_cte AS
1619 | (
1620 | -- Generate a sequence of products added to the cart
1621 |
1622 | SELECT
1623 | u.user_id,
1624 | e.visit_id,
1625 | STRING_AGG(ph.page_name, ', ' ORDER BY sequence_number) AS cart_products
1626 | FROM clique_bait.users u
1627 | JOIN clique_bait.events e ON u.cookie_id = e.cookie_id
1628 | JOIN clique_bait.page_hierarchy ph ON e.page_id = ph.page_id
1629 | JOIN clique_bait.event_identifier ei ON e.event_type = ei.event_type
1630 | WHERE ei.event_name = 'Add to Cart'
1631 | GROUP BY u.user_id, e.visit_id
1632 | )
1633 |
1634 | SELECT
1635 | u.user_id,
1636 | e.visit_id,
1637 | MIN(e.event_time) AS visit_start_time,
1638 | SUM(CASE WHEN ei.event_name = 'Page View' THEN 1 ELSE 0 END) AS page_views,
1639 | SUM(CASE WHEN ei.event_name = 'Add to Cart' THEN 1 ELSE 0 END) AS cart_adds,
1640 | SUM(CASE WHEN ei.event_name = 'Purchase' THEN 1 ELSE 0 END) AS purchases,
1641 | SUM(CASE WHEN ei.event_name = 'Ad Impression' THEN 1 ELSE 0 END) AS impressions,
1642 | SUM(CASE WHEN ei.event_name = 'Ad Click' THEN 1 ELSE 0 END) AS clicks,
1643 | CASE WHEN MIN(e.event_time) BETWEEN ci.start_date AND ci.end_date THEN ci.campaign_name ELSE '' END AS campaign_name,
1644 | CASE WHEN cp.cart_products IS NULL THEN '' ELSE cp.cart_products END AS cart_products
1645 |
1646 | FROM clique_bait.users u
1647 | JOIN clique_bait.events e ON u.cookie_id = e.cookie_id
1648 | JOIN clique_bait.page_hierarchy ph ON e.page_id = ph.page_id
1649 | JOIN clique_bait.event_identifier ei ON e.event_type = ei.event_type
1650 | JOIN clique_bait.campaign_identifier ci ON e.event_time BETWEEN ci.start_date AND ci.end_date
1651 | LEFT JOIN cart_products_cte cp ON cp.user_id = u.user_id AND cp.visit_id = e.visit_id
1652 | GROUP BY u.user_id, e.visit_id, ci.start_date, ci.end_date, ci.campaign_name, cp.cart_products
1653 | ORDER BY u.user_id, e.visit_id;
1654 | """)
1655 |
1656 |
1657 | # Saving the updates
1658 | conn.commit()
1659 | ```
1660 |
1661 | **Result**
1662 | Here is the table listing the first 5 users only.
1663 |
1664 |
1665 | ```python
1666 | pd.read_sql("""
1667 | SELECT *
1668 | FROM clique_bait.campaign_analysis
1669 | WHERE user_id < 6
1670 | """, conn)
1671 | ```
1672 |
1673 |
1674 |
1675 |
1676 | |
1677 | user_id |
1678 | visit_id |
1679 | visit_start_time |
1680 | page_views |
1681 | cart_adds |
1682 | purchases |
1683 | impressions |
1684 | clicks |
1685 | campaign_name |
1686 | cart_products |
1687 |
1688 |
1689 |
1690 |
1691 | 0 |
1692 | 1 |
1693 | 02a5d5 |
1694 | 2020-02-26 16:57:26.260871 |
1695 | 4 |
1696 | 0 |
1697 | 0 |
1698 | 0 |
1699 | 0 |
1700 | Half Off - Treat Your Shellf(ish) |
1701 | |
1702 |
1703 |
1704 | 1 |
1705 | 1 |
1706 | 0826dc |
1707 | 2020-02-26 05:58:37.918618 |
1708 | 1 |
1709 | 0 |
1710 | 0 |
1711 | 0 |
1712 | 0 |
1713 | Half Off - Treat Your Shellf(ish) |
1714 | |
1715 |
1716 |
1717 | 2 |
1718 | 1 |
1719 | 0fc437 |
1720 | 2020-02-04 17:49:49.602976 |
1721 | 10 |
1722 | 6 |
1723 | 1 |
1724 | 1 |
1725 | 1 |
1726 | Half Off - Treat Your Shellf(ish) |
1727 | Tuna, Russian Caviar, Black Truffle, Abalone, ... |
1728 |
1729 |
1730 | 3 |
1731 | 1 |
1732 | 30b94d |
1733 | 2020-03-15 13:12:54.023936 |
1734 | 9 |
1735 | 7 |
1736 | 1 |
1737 | 1 |
1738 | 1 |
1739 | Half Off - Treat Your Shellf(ish) |
1740 | Salmon, Kingfish, Tuna, Russian Caviar, Abalon... |
1741 |
1742 |
1743 | 4 |
1744 | 1 |
1745 | 41355d |
1746 | 2020-03-25 00:11:17.860655 |
1747 | 6 |
1748 | 1 |
1749 | 0 |
1750 | 0 |
1751 | 0 |
1752 | Half Off - Treat Your Shellf(ish) |
1753 | Lobster |
1754 |
1755 |
1756 | 5 |
1757 | 1 |
1758 | ccf365 |
1759 | 2020-02-04 19:16:09.182546 |
1760 | 7 |
1761 | 3 |
1762 | 1 |
1763 | 0 |
1764 | 0 |
1765 | Half Off - Treat Your Shellf(ish) |
1766 | Lobster, Crab, Oyster |
1767 |
1768 |
1769 | 6 |
1770 | 1 |
1771 | eaffde |
1772 | 2020-03-25 20:06:32.342989 |
1773 | 10 |
1774 | 8 |
1775 | 1 |
1776 | 1 |
1777 | 1 |
1778 | Half Off - Treat Your Shellf(ish) |
1779 | Salmon, Tuna, Russian Caviar, Black Truffle, A... |
1780 |
1781 |
1782 | 7 |
1783 | 1 |
1784 | f7c798 |
1785 | 2020-03-15 02:23:26.312543 |
1786 | 9 |
1787 | 3 |
1788 | 1 |
1789 | 0 |
1790 | 0 |
1791 | Half Off - Treat Your Shellf(ish) |
1792 | Russian Caviar, Crab, Oyster |
1793 |
1794 |
1795 | 8 |
1796 | 2 |
1797 | 0635fb |
1798 | 2020-02-16 06:42:42.735730 |
1799 | 9 |
1800 | 4 |
1801 | 1 |
1802 | 0 |
1803 | 0 |
1804 | Half Off - Treat Your Shellf(ish) |
1805 | Salmon, Kingfish, Abalone, Crab |
1806 |
1807 |
1808 | 9 |
1809 | 2 |
1810 | 1f1198 |
1811 | 2020-02-01 21:51:55.078775 |
1812 | 1 |
1813 | 0 |
1814 | 0 |
1815 | 0 |
1816 | 0 |
1817 | Half Off - Treat Your Shellf(ish) |
1818 | |
1819 |
1820 |
1821 | 10 |
1822 | 2 |
1823 | 3b5871 |
1824 | 2020-01-18 10:16:32.158475 |
1825 | 9 |
1826 | 6 |
1827 | 1 |
1828 | 1 |
1829 | 1 |
1830 | 25% Off - Living The Lux Life |
1831 | Salmon, Kingfish, Russian Caviar, Black Truffl... |
1832 |
1833 |
1834 | 11 |
1835 | 2 |
1836 | 49d73d |
1837 | 2020-02-16 06:21:27.138532 |
1838 | 11 |
1839 | 9 |
1840 | 1 |
1841 | 1 |
1842 | 1 |
1843 | Half Off - Treat Your Shellf(ish) |
1844 | Salmon, Kingfish, Tuna, Russian Caviar, Black ... |
1845 |
1846 |
1847 | 12 |
1848 | 2 |
1849 | 910d9a |
1850 | 2020-02-01 10:40:46.875968 |
1851 | 8 |
1852 | 1 |
1853 | 0 |
1854 | 0 |
1855 | 0 |
1856 | Half Off - Treat Your Shellf(ish) |
1857 | Abalone |
1858 |
1859 |
1860 | 13 |
1861 | 2 |
1862 | c5c0ee |
1863 | 2020-01-18 10:35:22.765382 |
1864 | 1 |
1865 | 0 |
1866 | 0 |
1867 | 0 |
1868 | 0 |
1869 | 25% Off - Living The Lux Life |
1870 | |
1871 |
1872 |
1873 | 14 |
1874 | 2 |
1875 | d58cbd |
1876 | 2020-01-18 23:40:54.761906 |
1877 | 8 |
1878 | 4 |
1879 | 0 |
1880 | 0 |
1881 | 0 |
1882 | 25% Off - Living The Lux Life |
1883 | Kingfish, Tuna, Abalone, Crab |
1884 |
1885 |
1886 | 15 |
1887 | 2 |
1888 | e26a84 |
1889 | 2020-01-18 16:06:40.907280 |
1890 | 6 |
1891 | 2 |
1892 | 1 |
1893 | 0 |
1894 | 0 |
1895 | 25% Off - Living The Lux Life |
1896 | Salmon, Oyster |
1897 |
1898 |
1899 | 16 |
1900 | 3 |
1901 | 25502e |
1902 | 2020-02-21 11:26:15.353389 |
1903 | 1 |
1904 | 0 |
1905 | 0 |
1906 | 0 |
1907 | 0 |
1908 | Half Off - Treat Your Shellf(ish) |
1909 | |
1910 |
1911 |
1912 | 17 |
1913 | 3 |
1914 | 9a2f24 |
1915 | 2020-02-21 03:19:10.032455 |
1916 | 6 |
1917 | 2 |
1918 | 1 |
1919 | 0 |
1920 | 0 |
1921 | Half Off - Treat Your Shellf(ish) |
1922 | Kingfish, Black Truffle |
1923 |
1924 |
1925 | 18 |
1926 | 3 |
1927 | bf200a |
1928 | 2020-03-11 04:10:26.708385 |
1929 | 7 |
1930 | 2 |
1931 | 1 |
1932 | 0 |
1933 | 0 |
1934 | Half Off - Treat Your Shellf(ish) |
1935 | Salmon, Crab |
1936 |
1937 |
1938 | 19 |
1939 | 3 |
1940 | eb13cd |
1941 | 2020-03-11 21:36:37.222763 |
1942 | 1 |
1943 | 0 |
1944 | 0 |
1945 | 0 |
1946 | 0 |
1947 | Half Off - Treat Your Shellf(ish) |
1948 | |
1949 |
1950 |
1951 | 20 |
1952 | 4 |
1953 | 07a950 |
1954 | 2020-03-19 17:56:24.610445 |
1955 | 6 |
1956 | 0 |
1957 | 0 |
1958 | 0 |
1959 | 0 |
1960 | Half Off - Treat Your Shellf(ish) |
1961 | |
1962 |
1963 |
1964 | 21 |
1965 | 4 |
1966 | 4c0ce3 |
1967 | 2020-02-22 19:42:45.498271 |
1968 | 1 |
1969 | 0 |
1970 | 0 |
1971 | 0 |
1972 | 0 |
1973 | Half Off - Treat Your Shellf(ish) |
1974 | |
1975 |
1976 |
1977 | 22 |
1978 | 4 |
1979 | 7caba5 |
1980 | 2020-02-22 17:49:37.646174 |
1981 | 5 |
1982 | 2 |
1983 | 0 |
1984 | 0 |
1985 | 0 |
1986 | Half Off - Treat Your Shellf(ish) |
1987 | Tuna, Lobster |
1988 |
1989 |
1990 | 23 |
1991 | 4 |
1992 | b90e25 |
1993 | 2020-03-19 11:01:58.182947 |
1994 | 9 |
1995 | 4 |
1996 | 1 |
1997 | 1 |
1998 | 1 |
1999 | Half Off - Treat Your Shellf(ish) |
2000 | Tuna, Black Truffle, Lobster, Crab |
2001 |
2002 |
2003 | 24 |
2004 | 5 |
2005 | 05c52a |
2006 | 2020-02-11 12:30:33.479052 |
2007 | 9 |
2008 | 8 |
2009 | 0 |
2010 | 1 |
2011 | 1 |
2012 | Half Off - Treat Your Shellf(ish) |
2013 | Salmon, Kingfish, Tuna, Russian Caviar, Black ... |
2014 |
2015 |
2016 | 25 |
2017 | 5 |
2018 | 4bffe1 |
2019 | 2020-02-26 16:03:10.377881 |
2020 | 8 |
2021 | 1 |
2022 | 0 |
2023 | 0 |
2024 | 0 |
2025 | Half Off - Treat Your Shellf(ish) |
2026 | Russian Caviar |
2027 |
2028 |
2029 | 26 |
2030 | 5 |
2031 | 580bf6 |
2032 | 2020-02-11 04:05:42.307991 |
2033 | 8 |
2034 | 6 |
2035 | 0 |
2036 | 0 |
2037 | 0 |
2038 | Half Off - Treat Your Shellf(ish) |
2039 | Salmon, Tuna, Russian Caviar, Black Truffle, A... |
2040 |
2041 |
2042 | 27 |
2043 | 5 |
2044 | b45feb |
2045 | 2020-02-01 07:47:45.025247 |
2046 | 1 |
2047 | 0 |
2048 | 0 |
2049 | 0 |
2050 | 0 |
2051 | Half Off - Treat Your Shellf(ish) |
2052 | |
2053 |
2054 |
2055 | 28 |
2056 | 5 |
2057 | f61ed7 |
2058 | 2020-02-01 06:30:39.766168 |
2059 | 8 |
2060 | 2 |
2061 | 1 |
2062 | 0 |
2063 | 0 |
2064 | Half Off - Treat Your Shellf(ish) |
2065 | Lobster, Crab |
2066 |
2067 |
2068 | 29 |
2069 | 5 |
2070 | fa70cb |
2071 | 2020-02-26 11:12:20.361638 |
2072 | 1 |
2073 | 0 |
2074 | 0 |
2075 | 0 |
2076 | 0 |
2077 | Half Off - Treat Your Shellf(ish) |
2078 | |
2079 |
2080 |
2081 |
2082 |
2083 |
2084 |
2085 |
2086 | ### further investigation ...
2087 | ### Does clicking on an impression lead to higher purchase rates?
2088 |
2089 |
2090 | Among users who received an impression, the purchase rate is significantly higher for those who clicked on the advertisement compared to those who did not click, accounting for a purchase rate of 71.89% vs 13.12%.
2091 |
2092 |
2093 | ```python
2094 | pd.read_sql("""
2095 | SELECT
2096 | ROUND(SUM(CASE WHEN clicks = 1 THEN purchases ELSE 0 END)/COUNT(visit_id)::NUMERIC * 100, 2) AS clicked_purchase_rate,
2097 | ROUND(SUM(CASE WHEN clicks = 0 THEN purchases ELSE 0 END)/COUNT(visit_id)::NUMERIC * 100, 2) AS no_clicked_purchase_rate
2098 | FROM clique_bait.campaign_analysis
2099 | WHERE impressions = 1
2100 | """, conn)
2101 | ```
2102 |
2103 |
2104 |
2105 |
2106 | |
2107 | clicked_purchase_rate |
2108 | no_clicked_purchase_rate |
2109 |
2110 |
2111 |
2112 |
2113 | 0 |
2114 | 71.89 |
2115 | 13.12 |
2116 |
2117 |
2118 |
2119 |
2120 |
2121 |
2122 |
2123 | ___
2124 | ### What is the uplift in purchase rate when comparing users who click on a campaign impression versus users who do not receive an impression? What if we compare them with users who just an impression but do not click?
2125 |
2126 |
2127 | Among all users who visited the Clique Bait website:
2128 | - The purchase rate is higher for users who do not receive an impression, with a rate of 28.6%.
2129 | - Clicking on a campaign impression leads to a lower purchase rate of 17.6% compared to the previous statement.
2130 | - However, even if users receive campaign impressions, not clicking on them results in the lowest purchase rate of only 3.2%.
2131 |
2132 |
2133 | ```python
2134 | pd.read_sql("""
2135 | SELECT
2136 | ROUND(SUM(CASE WHEN impressions = 1 AND clicks = 1 THEN purchases ELSE 0 END)/COUNT(visit_id)::NUMERIC * 100, 1) AS clicked_purchase_rate,
2137 | ROUND(SUM(CASE WHEN impressions = 1 AND clicks = 0 THEN purchases ELSE 0 END)/COUNT(visit_id)::NUMERIC * 100, 1) AS no_clicked_purchase_rate,
2138 | ROUND(SUM(CASE WHEN impressions = 0 AND clicks = 0 THEN purchases ELSE 0 END)/COUNT(visit_id)::NUMERIC * 100, 1) AS no_impression_purchase_rate
2139 | FROM clique_bait.campaign_analysis
2140 | """, conn)
2141 | ```
2142 |
2143 |
2144 |
2145 |
2146 | |
2147 | clicked_purchase_rate |
2148 | no_clicked_purchase_rate |
2149 | no_impression_purchase_rate |
2150 |
2151 |
2152 |
2153 |
2154 | 0 |
2155 | 17.6 |
2156 | 3.2 |
2157 | 28.6 |
2158 |
2159 |
2160 |
2161 |
2162 |
2163 |
2164 |
2165 |
2166 | ```python
2167 | conn.close()
2168 | ```
2169 |
--------------------------------------------------------------------------------
/CaseStudy#6 - Clique Bait/README.md:
--------------------------------------------------------------------------------
1 | # Case Study #6: Clique Bait 🪝
2 | [](https://github.com/chanronnie/8WeekSQLChallenge)
3 | [](https://8weeksqlchallenge.com/case-study-6/)
4 | 
5 |
6 |
7 | The case study presented here is part of the **8 Week SQL Challenge**.\
8 | It is kindly brought to us by [**Data With Danny**](https://8weeksqlchallenge.com).
9 |
10 | This time, I am using `PostgreSQL queries` (instead of MySQL) in `Jupyter Notebook` to quickly view results, which provides me with an opportunity
11 | - to learn PostgreSQL
12 | - to utilize handy mathematical and string functions.
13 |
14 |
15 |
16 | ## Table of Contents
17 | * [Entity Relationship Diagram](#entity-relationship-diagram)
18 | * [Datasets](#datasets)
19 | * [Case Study Questions](#case-study-questions)
20 | * [Solutions](#solutions)
21 | * [PostgreSQL Topics Covered](#postgresql-topics-covered)
22 |
23 | ## Entity Relationship Diagram
24 | The question A will require the creation of an Entity-Relationship Diagram (ERD).
25 |
26 |
27 | ## Datasets
28 | The Case Study #6 contains 5 dataset:
29 |
30 |
31 | ### `users`
32 |
33 |
34 | View table
35 |
36 |
37 | `users` : This table shows all the user_id along with their unique cookie_id (PK).
38 |
39 | user_id | cookie_id | start_date
40 | --- | --- | ---
41 | 397 | 759ff | 2020-03-30 00:00:00
42 | 215 | 863329 | 2020-01-26 00:00:00
43 | 191 | eefca9 | 2020-03-15 00:00:00
44 | 89 | 764796 | 2020-01-07 00:00:00
45 | 127 | 17ccc5 | 2020-01-22 00:00:00
46 | 81 | b0b666 | 2020-03-01 00:00:00
47 | 260 | a4f236 | 2020-01-08 00:00:00
48 | 203 | d1182f | 2020-04-18 00:00:00
49 | 23 | 12dbc8 | 2020-01-18 00:00:00
50 | 375 | f61d69 | 2020-01-03 00:00:00
51 |
52 |
53 |
54 |
55 | ### `events`
56 |
57 |
58 | View table
59 |
60 |
61 | `events` : The table lists all the website interactions by customers (cookie_id, webpage visited, etc.).
62 |
63 | visit_id | cookie_id | page_id | event_type | sequence_number | event_time
64 | --- | --- | --- | --- | --- | ---
65 | 719fd3 | 3d83d3 | 5 | 1 | 4 | 2020-03-02 00:29:09.975502
66 | fb1eb1 | c5ff25 | 5 | 2 | 8 | 2020-01-22 07:59:16.761931
67 | 23fe81 | 1e8c2d | 10 | 1 | 9 | 2020-03-21 13:14:11.745667
68 | ad91aa | 648115 | 6 | 1 | 3 | 2020-04-27 16:28:09.824606
69 | 5576d7 | ac418c | 6 | 1 | 4 | 2020-01-18 04:55:10.149236
70 | 48308b | c686c1 | 8 | 1 | 5 | 2020-01-29 06:10:38.702163
71 | 46b17d | 78f9b3 | 7 | 1 | 12 | 2020-02-16 09:45:31.926407
72 | 9fd196 | ccf057 | 4 | 1 | 5 | 2020-02-14 08:29:12.922164
73 | edf853 | f85454 | 1 | 1 | 1 | 2020-02-22 12:59:07.652207
74 | 3c6716 | 02e74f | 3 | 2 | 5 | 2020-01-31 17:56:20.777383
75 |
76 |
77 |
78 |
79 | ### `event_identifier`
80 |
81 |
82 | View table
83 |
84 |
85 | `event_identifier` : The table maps the event_type to its corresponding event_name.
86 |
87 | event_type | event_name
88 | --- | ---
89 | 1 | Page View
90 | 2 | Add to Cart
91 | 3 | Purchase
92 | 4 | Ad Impression
93 | 5 | Ad Click
94 |
95 |
96 |
97 |
98 | ### `campaign_identifier`
99 |
100 |
101 | View table
102 |
103 |
104 | `campaign_identifier`: The table lists the information of the three campaigns that ran on the Clique Bait website.
105 |
106 | campaign_id | products | campaign_name | start_date | end_date
107 | --- | --- | --- | --- | ---
108 | 1 | 1-3 | BOGOF - Fishing For Compliments | 2020-01-01 00:00:00 | 2020-01-14 00:00:00
109 | 2 | 4-5 | 25% Off - Living The Lux Life | 2020-01-15 00:00:00 | 2020-01-28 00:00:00
110 | 3 | 6-8 | Half Off - Treat Your Shellf(ish) | 2020-02-01 00:00:00 | 2020-03-31 00:00:00
111 |
112 |
113 |
114 |
115 | ### `page_hierarchy`
116 |
117 |
118 | View table
119 |
120 |
121 | `page_hierarchy`: The table maps the page_id to its page_name.
122 |
123 | page_id | page_name | product_category | product_id
124 | --- | --- | --- | ---
125 | 1 | Home Page | null | null
126 | 2 | All Products | null | null
127 | 3 | Salmon | Fish | 1
128 | 4 | Kingfish | Fish | 2
129 | 5 | Tuna | Fish | 3
130 | 6 | Russian Caviar | Luxury | 4
131 | 7 | Black Truffle | Luxury | 5
132 | 8 | Abalone | Shellfish | 6
133 | 9 | Lobster | Shellfish | 7
134 | 10 | Crab | Shellfish | 8
135 | 11 | Oyster | Shellfish | 9
136 | 12 | Checkout | null | null
137 | 13 | Confirmation | null | null
138 |
139 |
140 |
141 |
142 | ## Case Study Questions
143 | Case Study #6 is categorized into 4 question groups\
144 | To view the specific section, please open the link in a *`new tab`* or *`window`*.\
145 | [A. Enterprise Relationship Diagram](CaseStudy6_solutions.md#A)\
146 | [B. Digital Analysis](CaseStudy6_solutions.md#B)\
147 | [C. Product Funnel Analysis](CaseStudy6_solutions.md#C)\
148 | [D. Campaigns Analysis](CaseStudy6_solutions.md#D)
149 |
150 | ## Solutions
151 | - View `clique_bait` database: [**CaseStudy6_schema.sql**](https://raw.githubusercontent.com/chanronnie/8WeekSQLChallenge/main/CaseStudy%236%20-%20Clique%20Bait/CaseStudy6_schema.sql)
152 | - View Solution:
153 | - [**Markdown File**](CaseStudy6_solutions.md): offers a more fluid and responsive viewing experience
154 | - [**Jupyter Notebook**](CaseStudy6_solutions.ipynb): contains the original code
155 |
156 | ## PostgreSQL Topics Covered
157 | - Establish relationships between datasets to create an Entity-Relationship Diagram (ERD)
158 | - Data Cleaning
159 | - Create Tables
160 | - Pivot Table using CASE WHEN
161 | - DATE and STRING Type Manipulation
162 | - Common Table Expressions (CTE)
163 | - Window Functions
164 | - Subqueries
165 | - JOIN, UNION ALL
166 |
--------------------------------------------------------------------------------
/CaseStudy#6 - Clique Bait/images/CliqueBait_ERD.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chanronnie/8WeekSQLChallenge/4b36f745d833a82a61a9ce70ab177462eb7dbfd0/CaseStudy#6 - Clique Bait/images/CliqueBait_ERD.png
--------------------------------------------------------------------------------
/CaseStudy#7 - Balanced Tree/README.md:
--------------------------------------------------------------------------------
1 | # Case Study #7: Balanced Tree Clothing Co.🌲
2 | [](https://github.com/chanronnie/8WeekSQLChallenge)
3 | [](https://8weeksqlchallenge.com/case-study-7/)
4 | 
5 |
6 |
7 |
8 | The case study presented here is part of the **8 Week SQL Challenge**.\
9 | It is kindly brought to us by [**Data With Danny**](https://8weeksqlchallenge.com).
10 |
11 | This time, I am using `PostgreSQL queries` (instead of MySQL) in `Jupyter Notebook` to quickly view results, which provides me with an opportunity
12 | - to learn PostgreSQL
13 | - to utilize handy mathematical and string functions.
14 |
15 |
16 |
17 | ## Table of Contents
18 | * [Entity Relationship Diagram](#entity-relationship-diagram)
19 | * [Datasets](#datasets)
20 | * [Case Study Questions](#case-study-questions)
21 | * [Solutions](#solutions)
22 | * [PostgreSQL Topics Covered](#postgresql-topics-covered)
23 |
24 | ## Entity Relationship Diagram
25 | Again, I used the [DB Diagram tool](https://dbdiagram.io/home) to create the following ERD.
26 | 
27 |
28 |
29 |
30 |
31 | View code to create ERD
32 |
33 | Here is the code that I used for creating the ERD for all the Balanced Tree datasets on DB Diagram tool
34 |
35 | ```markdown
36 | TABLE product_hierarchy {
37 | "id" INTEGER
38 | "parent_id" INTEGER
39 | "level_text" VARCHAR(19)
40 | "level_name" VARCHAR(8)
41 | }
42 |
43 | TABLE product_prices {
44 | "id" INTEGER
45 | "product_id" VARCHAR(6)
46 | "price" INTEGER
47 | }
48 |
49 | TABLE product_details {
50 | "product_id" VARCHAR(6)
51 | "price" INTEGER
52 | "product_name" VARCHAR(32)
53 | "category_id" INTEGER
54 | "segment_id" INTEGER
55 | "style_id" INTEGER
56 | "category_name" VARCHAR(6)
57 | "segment_name" VARCHAR(6)
58 | "style_name" VARCHAR(19)
59 | }
60 |
61 | TABLE sales {
62 | "prod_id" VARCHAR(6)
63 | "qty" INTEGER
64 | "price" INTEGER
65 | "discount" INTEGER
66 | "member" BOOLEAN
67 | "txn_id" VARCHAR(6)
68 | "start_txn_time" TIMESTAMP
69 | }
70 |
71 | # Establish relationships between datasets
72 | REF: sales.prod_id > product_details.product_id
73 | REF: product_hierarchy.id - product_prices.id
74 | ```
75 |
76 |
77 |
78 |
79 | ## Datasets
80 | The Case Study #7 contains 4 dataset:
81 |
82 |
83 | View Table: product_details
84 |
85 | The `product_details` dataset lists the product information.
86 |
87 | product_id | price | product_name | category_id | segment_id | style_id | category_name | segment_name | style_name
88 | --- | --- | --- | --- | --- | --- | --- | --- | ---
89 | c4a632 | 13 | Navy Oversized Jeans - Womens | 1 | 3 | 7 | Womens | Jeans | Navy Oversized
90 | e83aa3 | 32 | Black Straight Jeans - Womens | 1 | 3 | 8 | Womens | Jeans | Black Straight
91 | e31d39 | 10 | Cream Relaxed Jeans - Womens | 1 | 3 | 9 | Womens | Jeans | Cream Relaxed
92 | d5e9a6 | 23 | Khaki Suit Jacket - Womens | 1 | 4 | 10 | Womens | Jacket | Khaki Suit
93 | 72f5d4 | 19 | Indigo Rain Jacket - Womens | 1 | 4 | 11 | Womens | Jacket | Indigo Rain
94 | 9ec847 | 54 | Grey Fashion Jacket - Womens | 1 | 4 | 12 | Womens | Jacket | Grey Fashion
95 | 5d267b | 40 | White Tee Shirt - Mens | 2 | 5 | 13 | Mens | Shirt | White Tee
96 | c8d436 | 10 | Teal Button Up Shirt - Mens | 2 | 5 | 14 | Mens | Shirt | Teal Button Up
97 | 2a2353 | 57 | Blue Polo Shirt - Mens | 2 | 5 | 15 | Mens | Shirt | Blue Polo
98 | f084eb | 36 | Navy Solid Socks - Mens | 2 | 6 | 16 | Mens | Socks | Navy Solid
99 | b9a74d | 17 | White Striped Socks - Mens | 2 | 6 | 17 | Mens | Socks | White Striped
100 | 2feb6b | 29 | Pink Fluro Polkadot Socks - Mens | 2 | 6 | 18 | Mens | Socks | Pink Fluro Polkadot
101 |
102 |
103 |
104 |
105 |
106 | View Table: sales
107 |
108 | The `sales` dataset lists product level information (transactions made for Balanced Tree Co.)
109 |
110 | prod_id | qty | price | discount | member | txn_id | start_txn_time
111 | --- | --- | --- | --- | --- | --- | ---
112 | c4a632 | 4 | 13 | 17 | t | 54f307 | 2021-02-13 01:59:43.296
113 | 5d267b | 4 | 40 | 17 | t | 54f307 | 2021-02-13 01:59:43.296
114 | b9a74d | 4 | 17 | 17 | t | 54f307 | 2021-02-13 01:59:43.296
115 | 2feb6b | 2 | 29 | 17 | t | 54f307 | 2021-02-13 01:59:43.296
116 | c4a632 | 5 | 13 | 21 | t | 26cc98 | 2021-01-19 01:39:00.3456
117 | e31d39 | 2 | 10 | 21 | t | 26cc98 | 2021-01-19 01:39:00.3456
118 | 72f5d4 | 3 | 19 | 21 | t | 26cc98 | 2021-01-19 01:39:00.3456
119 | 2a2353 | 3 | 57 | 21 | t | 26cc98 | 2021-01-19 01:39:00.3456
120 | f084eb | 3 | 36 | 21 | t | 26cc98 | 2021-01-19 01:39:00.3456
121 | c4a632 | 1 | 13 | 21 | f | ef648d | 2021-01-27 02:18:17.1648
122 |
123 |
124 |
125 |
126 |
127 | View Table: product_hierarchy
128 |
129 | The `product_hierarchy` dataset will be used to answer the **Bonus Question** to recreate the `product_details` dataset.
130 |
131 | id | parent_id | level_text | level_name
132 | --- | --- | --- | ---
133 | 1 | | Womens | Category
134 | 2 | | Mens | Category
135 | 3 | 1 | Jeans | Segment
136 | 4 | 1 | Jacket | Segment
137 | 5 | 2 | Shirt | Segment
138 | 6 | 2 | Socks | Segment
139 | 7 | 3 | Navy Oversized | Style
140 | 8 | 3 | Black Straight | Style
141 | 9 | 3 | Cream Relaxed | Style
142 | 10 | 4 | Khaki Suit | Style
143 | 11 | 4 | Indigo Rain | Style
144 | 12 | 4 | Grey Fashion | Style
145 | 13 | 5 | White Tee | Style
146 | 14 | 5 | Teal Button Up | Style
147 | 15 | 5 | Blue Polo | Style
148 | 16 | 6 | Navy Solid | Style
149 | 17 | 6 | White Striped | Style
150 | 18 | 6 | Pink Fluro Polkadot | Style
151 |
152 |
153 |
154 |
155 | View Table: product_prices
156 |
157 | The `product_prices` dataset will be used to answer the **Bonus Question** to recreate the `product_details` dataset.
158 |
159 | id | product_id | price
160 | --- | --- | ---
161 | 7 | c4a632 | 13
162 | 8 | e83aa3 | 32
163 | 9 | e31d39 | 10
164 | 10 | d5e9a6 | 23
165 | 11 | 72f5d4 | 19
166 | 12 | 9ec847 | 54
167 | 13 | 5d267b | 40
168 | 14 | c8d436 | 10
169 | 15 | 2a2353 | 57
170 | 16 | f084eb | 36
171 | 17 | b9a74d | 17
172 | 18 | 2feb6b | 29
173 |
174 |
175 |
176 |
177 |
178 |
179 | ## Case Study Questions
180 | Case Study #7 is categorized into 4 question groups\
181 | To view the specific section, please open the link in a *`new tab`* or *`window`*.\
182 | [A. High Level Sales Analysis](CaseStudy7_solutions.md#A)\
183 | [B. Transaction Analysis](CaseStudy7_solutions.md#B)\
184 | [C. Product Analysis](CaseStudy7_solutions.md#C)\
185 | [D. Bonus Challenge](CaseStudy7_solutions.md#D)
186 |
187 |
188 | ## Solutions
189 | - View `balanced_tree` database: [**here**](CaseStudy7_schema.sql)
190 | - View Solution:
191 | - [**Markdown File**](CaseStudy7_solutions.md): offers a more fluid and responsive viewing experience
192 | - [**Jupyter Notebook**](CaseStudy7_solutions.ipynb): contains the original code
193 |
194 |
195 | ## PostgreSQL Topics Covered
196 | - Establish relationships between datasets to create an Entity-Relationship Diagram (ERD)
197 | - Common Table Expressions (CTE)
198 | - Window Functions
199 | - Subqueries
200 | - JOIN, SELF JOIN
201 |
--------------------------------------------------------------------------------
/CaseStudy#8 - Fresh Segments/README.md:
--------------------------------------------------------------------------------
1 | # Case Study #8: Fresh Segments
2 | [](https://github.com/chanronnie/8WeekSQLChallenge)
3 | [](https://8weeksqlchallenge.com/case-study-8/)
4 | 
5 |
6 |
7 | The case study presented here is part of the **8 Week SQL Challenge**.\
8 | It is kindly brought to us by [**Data With Danny**](https://8weeksqlchallenge.com).\
9 | This time, I am using `PostgreSQL queries` (instead of MySQL) in `Jupyter Notebook`.
10 |
11 |
12 | ## Table of Contents
13 | * [Entity Relationship Diagram](#entity-relationship-diagram)
14 | * [Datasets](#datasets)
15 | * [Case Study Questions](#case-study-questions)
16 | * [Solutions](#solutions)
17 | * [PostgreSQL Topics Covered](#postgresql-topics-covered)
18 |
19 |
20 | ## Entity Relationship Diagram
21 | Again, I used the [DB Diagram tool](https://dbdiagram.io/home) to create the following ERD.
22 | 
23 |
24 |
25 |
26 | View code to create ERD
27 |
28 | Here is the code that I used for creating the ERD for the Fresh Segments datasets on DB Diagram tool
29 |
30 | ```markdown
31 | TABLE interest_metrics {
32 | "_month" VARCHAR(4)
33 | "_year" VARCHAR(4)
34 | "month_year" VARCHAR(7)
35 | "interest_id" VARCHAR(5)
36 | "composition" FLOAT
37 | "index_value" FLOAT
38 | "ranking" INTEGER
39 | "percentile_ranking" FLOAT
40 | }
41 |
42 | TABLE interest_map {
43 | "id" INTEGER
44 | "interest_name" TEXT
45 | "interest_summary" TEXT
46 | "created_at" TIMESTAMP
47 | "last_modified" TIMESTAMP
48 | }
49 |
50 |
51 | # Establish relationships between datasets
52 | REF: interest_metrics.interest_id > interest_map.id
53 | ```
54 |
55 |
56 |
57 |
58 | ## Datasets
59 | The Case Study #8 contains 2 dataset:
60 |
61 | ### `interest_metrics`
62 |
63 |
64 | View Table: interest_metrics
65 |
66 | The `interest_metrics` table presents aggregated interest metrics for a major client of Fresh Segments, reflecting the performance of interest_id based on clicks and interactions with targeted advertising content among their customer base.
67 |
68 | _month | _year | month_year | interest_id | composition | index_value | ranking | percentile_ranking
69 | --- | --- | --- | --- | --- | --- | --- | ---
70 | 7 | 2018 | 07-2018 | 32486 | 11.89 | 6.19 | 1 | 99.86
71 | 7 | 2018 | 07-2018 | 6106 | 9.93 | 5.31 | 2 | 99.73
72 | 7 | 2018 | 07-2018 | 18923 | 10.85 | 5.29 | 3 | 99.59
73 | 7 | 2018 | 07-2018 | 6344 | 10.32 | 5.1 | 4 | 99.45
74 | 7 | 2018 | 07-2018 | 100 | 10.77 | 5.04 | 5 | 99.31
75 | 7 | 2018 | 07-2018 | 69 | 10.82 | 5.03 | 6 | 99.18
76 | 7 | 2018 | 07-2018 | 79 | 11.21 | 4.97 | 7 | 99.04
77 | 7 | 2018 | 07-2018 | 6111 | 10.71 | 4.83 | 8 | 98.9
78 | 7 | 2018 | 07-2018 | 6214 | 9.71 | 4.83 | 8 | 98.9
79 | 7 | 2018 | 07-2018 | 19422 | 10.11 | 4.81 | 10 | 98.63
80 |
81 |
82 |
83 | ### `interest_map`
84 |
85 |
86 | View Table: interest_map
87 |
88 | The **`interest_map`** links the interest_id to the corresponding interest_name.
89 |
90 | id | interest_name | interest_summary | created_at | last_modified
91 | --- | --- | --- | --- | ---
92 | 1 | Fitness Enthusiasts | Consumers using fitness tracking apps and websites. | 2016-05-26 14:57:59 | 2018-05-23 11:30:12
93 | 2 | Gamers | Consumers researching game reviews and cheat codes. | 2016-05-26 14:57:59 | 2018-05-23 11:30:12
94 | 3 | Car Enthusiasts | Readers of automotive news and car reviews. | 2016-05-26 14:57:59 | 2018-05-23 11:30:12
95 | 4 | Luxury Retail Researchers | Consumers researching luxury product reviews and gift ideas. | 2016-05-26 14:57:59 | 2018-05-23 11:30:12
96 | 5 | Brides & Wedding Planners | People researching wedding ideas and vendors. | 2016-05-26 14:57:59 | 2018-05-23 11:30:12
97 | 6 | Vacation Planners | Consumers reading reviews of vacation destinations and accommodations. | 2016-05-26 14:57:59 | 2018-05-23 11:30:13
98 | 7 | Motorcycle Enthusiasts | Readers of motorcycle news and reviews. | 2016-05-26 14:57:59 | 2018-05-23 11:30:13
99 | 8 | Business News Readers | Readers of online business news content. | 2016-05-26 14:57:59 | 2018-05-23 11:30:12
100 | 12 | Thrift Store Shoppers | Consumers shopping online for clothing at thrift stores and researching locations. | 2016-05-26 14:57:59 | 2018-03-16 13:14:00
101 | 13 | Advertising Professionals | People who read advertising industry news. | 2016-05-26 14:57:59 | 2018-05-23 11:30:12
102 |
103 |
104 |
105 |
106 |
107 | ## Case Study Questions
108 | Case Study #8 is categorized into 4 question groups\
109 | To view the specific section, please open the link in a *`new tab`* or *`window`*.\
110 | [A. Data Exploration and Cleansing](CaseStudy8_solutions.md#A)\
111 | [B. Interest Analysis](CaseStudy8_solutions.md#B)\
112 | [C. Segment Analysis](CaseStudy8_solutions.md#C)\
113 | [D. Index Analysis](CaseStudy8_solutions.md#D)
114 |
115 |
116 | ## Solutions
117 | - View `fresh_segments` database: [**here**](CaseStudy8_schema.sql)
118 | - View Solution:
119 | - [**Markdown File**](CaseStudy8_solutions.md): offers a more fluid and responsive viewing experience
120 | - [**Jupyter Notebook**](CaseStudy8_solutions.ipynb): contains the original code
121 |
122 |
123 | ## PostgreSQL Topics Covered
124 | - Data Cleaning
125 | - Common Table Expressions (CTE)
126 | - Window Functions (rolling average)
127 | - Subqueries
128 | - JOIN
129 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2023 Ronnie Chan
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # 8 Week SQL Challenge
2 | 
3 |
4 |
5 | In my journey to switch careers into the field of Data, I am excited to undertake the [**#8WeekSQLChallenge**](https://8weeksqlchallenge.com/), kindly created by [Data with Danny](https://linktr.ee/datawithdanny) in order to enhance my SQL skills.
6 | This repository will store all the solutions for the case studies included in this challenge.
7 |
8 | ## Case Studies
9 | Case Study | Topic | SQL | Status|
10 | | --- | --- | --- | --- |
11 | |📄 [**#1 Danny's Diner**](https://github.com/chanronnie/8WeekSQLChallenge/tree/main/CaseStudy%231%20-%20Danny's%20Diner) |Customer Analytics | MySQL | **Completed** ✔️|
12 | |📄 [**#2 Pizza Runner**](https://github.com/chanronnie/8WeekSQLChallenge/tree/main/CaseStudy%232%20-%20Pizza%20Runner) | Delivery Operations Analytics | MySQL| **Completed** ✔️|
13 | |📄 [**#3 Foodie-Fi**](https://github.com/chanronnie/8WeekSQLChallenge/tree/main/CaseStudy%233%20-%20Foodie-Fi)| Subscription Analytics | MySQL| **Completed** ✔️|
14 | |📄 [**#4 Data Bank**](https://github.com/chanronnie/8WeekSQLChallenge/tree/main/CaseStudy%234%20-%20Data%20Bank) | Financial Data Storage and Usage Analytics | PostgreSQL| **Completed** ✔️ |
15 | |📄 [**#5 Data Mart**](https://github.com/chanronnie/8WeekSQLChallenge/tree/main/CaseStudy%235%20-%20Data%20Mart) | Sales Performance Analytics | PostgreSQL | **Completed** ✔️ |
16 | |📄 [**#6 Clique Bait**](https://github.com/chanronnie/8WeekSQLChallenge/tree/main/CaseStudy%236%20-%20Clique%20Bait) | Digital Analytics| PostgreSQL | **Completed** ✔️ |
17 | |📄 [**#7 Balanced Tree**](https://github.com/chanronnie/8WeekSQLChallenge/tree/main/CaseStudy%237%20-%20Balanced%20Tree) | Sales Performance Analytics | PostgreSQL | **Completed** ✔️ |
18 | |📄 [**#8 Fresh Segments**](https://github.com/chanronnie/8WeekSQLChallenge/tree/main/CaseStudy%238%20-%20Fresh%20Segments) | Digital Marketing Analytics | PostgreSQL | **Completed** ✔️|
19 |
20 | ## Technologies
21 | 
22 | 
23 | 
24 |
25 | ## Installation
26 |
27 | For writing **MySQL** queries in Jupyter Notebook, we will need to install the `pymysql` library
28 | ```
29 | pip install pymysql
30 | ```
31 |
32 | For writing **PostgreSQL** queries in Jupyter Notebook, we will need to install the `psycopg2` library
33 | ```
34 | pip install psycopg2
35 | ```
36 |
--------------------------------------------------------------------------------