├── Case Study #1-Danny's Dinner
└── README.md
├── Case Study #2 - Pizza Runner
└── README.md
├── Case Study #3 - Foodie-Fi
└── README.md
├── Case Study #4 - Data Bank
└── README.md
├── Case Study #5 - Data Mart
└── README.md
├── Case Study #6 - Clique Bait
└── README.md
├── Case Study #7 - Balanced Tree Clothing Co
└── README.md
├── Case Study #8 - Fresh Segments
└── README.md
└── README.md
/Case Study #1-Danny's Dinner/README.md:
--------------------------------------------------------------------------------
1 |
2 |
Case Study #1 - Danny's Diner👨🏻🍳
3 |
4 | Contents
5 |
13 |
14 |
15 | In early 2021, Danny follows his passion for Japanese food and opens "Danny's Diner," a charming restaurant offering sushi, curry, and ramen. However, lacking data analysis expertise, the restaurant struggles to leverage the basic data collected during its initial months to make informed business decisions. Danny's Diner seeks assistance in using this data effectively to keep the restaurant thriving.
16 |
17 |
18 | Danny aims to utilize customer data to gain valuable insights into their visiting patterns, spending habits, and favorite menu items. By establishing a deeper connection with his customers, he can provide a more personalized experience for his loyal patrons.
19 |
20 | He plans to use these insights to make informed decisions about expanding the existing customer loyalty program. Additionally, Danny seeks assistance in generating basic datasets for his team to inspect the data conveniently, without requiring SQL expertise.
21 |
22 | Due to privacy concerns, he has shared a sample of his overall customer data, hoping it will be sufficient for you to create fully functional SQL queries to address his questions.
23 |
24 | The case study revolves around three key datasets:
25 |
26 | - Sales
27 | - Menu
28 | - Members
29 |
30 |
31 |
32 |
33 |
34 |
35 |
36 |
37 |
38 | What is the total amount each customer spent at the restaurant?
39 |
40 | ```sql
41 | SELECT S.customer_id, SUM(M.price) AS total_amnt
42 | FROM sales S
43 | JOIN menu M ON S.product_id = M.product_id
44 | GROUP BY S.customer_id
45 | ORDER BY customer_id
46 | ```
47 |
48 | Answer:
49 |
50 |
51 | - The SQL query retrieves the
customer_id
and calculates the total amount spent (total_amnt
) by each customer at the restaurant.
52 | - It combines data from the
sales
and menu
tables based on matching product_id
.
53 | - The results are grouped by
customer_id
.
54 | - The query then calculates the total sum of
price
for each group of sales records with the same customer_id
.
55 | - Finally, the results are sorted in ascending order based on the
customer_id
.
56 |
57 |
58 |
59 |
60 | How many days has each customer visited the restaurant?
61 |
62 | ```sql
63 | SELECT customer_id, COUNT(DISTINCT order_date) AS No_Days
64 | FROM sales
65 | GROUP BY customer_id
66 | ```
67 | Answer:
68 |
69 |
70 | - The SQL query selects the
customer_id
and counts the number of distinct order dates (No_Days
) for each customer.
71 | - It retrieves data from the
sales
table.
72 | - The results are grouped by
customer_id
.
73 | - The
COUNT(DISTINCT order_date)
function calculates the number of unique order dates for each customer.
74 | - Finally, the query presents the total number of unique order dates as
No_Days
for each customer.
75 |
76 |
77 |
78 |
79 | What was the first item from the menu purchased by each customer?
80 |
81 | ```sql
82 | WITH CTE AS
83 | (SELECT S.customer_id,DENSE_RANK() OVER(PARTITION BY S.customer_id ORDER BY S.order_date)AS rn,M.product_name
84 | FROM sales S
85 | JOIN menu M ON S.product_id=M.product_id)
86 |
87 | SELECT customer_id,product_name
88 | FROM CTE
89 | WHERE rn=1
90 | ```
91 |
92 | Answer:
93 |
94 |
95 | - The SQL query uses a Common Table Expression (CTE) named
CTE
to generate a temporary result set.
96 | - Within the CTE, it selects the
customer_id
, assigns a dense rank to each row based on the order date for each customer, and retrieves the corresponding product_name
from the menu
table.
97 | - The
sales
table is joined with the menu
table on matching product_id
.
98 | - The DENSE_RANK() function assigns a rank to each row within the partition of each
customer_id
based on the order_date
in ascending order.
99 | - Each
customer_id
has its own partition and separate ranks based on the order dates of their purchases.
100 | - Next, the main query selects the
customer_id
and corresponding product_name
from the CTE.
101 | - It filters the results and only includes rows where the rank
rn
is equal to 1, which means the earliest purchase for each customer_id
.
102 | - As a result, the query returns the first purchased product for each customer.
103 |
104 |
105 |
106 | What is the most purchased item on the menu and how many times was it purchased by all customers?
107 |
108 | ```sql
109 | SELECT M.product_name,COUNT(S.product_id)AS most_ordered
110 | FROM Sales S
111 | JOIN menu M ON S.product_id=M.product_id
112 | GROUP BY M.product_name
113 | ORDER BY most_ordered DESC
114 | LIMIT 1
115 | ```
116 |
117 | Answer:
118 |
119 |
120 | - The SQL query selects the
product_name
from the menu
table and counts the number of times each product was ordered (most_ordered
).
121 | - It retrieves data from the
Sales
table and joins it with the menu
table based on matching product_id
.
122 | - The results are grouped by
product_name
.
123 | - The
COUNT(S.product_id)
function calculates the number of occurrences of each product_id
in the Sales
table.
124 | - The query then presents the
product_name
and its corresponding count as most_ordered
for each product.
125 | - Next, the results are sorted in descending order based on the
most_ordered
column, so the most ordered product appears first.
126 | - The
LIMIT 1
clause is used to restrict the result to only one row, effectively showing the most ordered product.
127 |
128 |
129 |
130 |
131 | Which item was the most popular for each customer?
132 |
133 | ```sql
134 | WITH CTE AS
135 | (SELECT S.customer_id,M.product_name,COUNT(M.product_id)AS order_count,DENSE_RANK() OVER(PARTITION BY S.customer_id ORDER BY COUNT(S.product_id)DESC) AS rnk
136 | FROM sales S
137 | JOIN menu M ON S.product_id=M.product_id
138 | GROUP BY S.customer_id,M.product_name)
139 |
140 | SELECT customer_id,product_name,order_count
141 | FROM CTE
142 | WHERE rnk=1
143 | ```
144 |
145 | Answer:
146 |
147 |
148 | - The SQL query uses a Common Table Expression (CTE) named
CTE
to generate a temporary result set.
149 | - Within the CTE, it selects the
customer_id
, product_name
, and counts the number of times each product was ordered (order_count
) for each customer.
150 | - It retrieves data from the
sales
table and joins it with the menu
table based on matching product_id
.
151 | - The results are grouped by
customer_id
and product_name
to get the count of orders for each product of each customer.
152 | - The
COUNT(M.product_id)
function calculates the number of occurrences of each product_id
in the menu
table.
153 | - The DENSE_RANK() function assigns a rank to each row within the partition of each
customer_id
based on the order count of products in descending order.
154 | - Each
customer_id
has its own partition and separate ranks based on the number of product orders.
155 | - Next, the main query selects the
customer_id
, product_name
, and order_count
from the CTE.
156 | - It filters the results and only includes rows where the rank
rnk
is equal to 1, which means the most ordered product for each customer_id
.
157 | - As a result, the query returns the customer's ID, the most ordered product, and the number of times it was ordered by that customer.
158 |
159 |
160 |
161 | Which item was purchased first by the customer after they became a member?
162 |
163 | ```sql
164 | SELECT DISTINCT ON (s.customer_id)
165 | s.customer_id,
166 | m.product_name
167 | FROM sales s
168 | JOIN members mbr ON s.customer_id = mbr.customer_id
169 | JOIN menu m ON s.product_id = m.product_id
170 | WHERE s.order_date > mbr.join_date
171 | ORDER BY s.customer_id;
172 | ```
173 |
174 | Answer:
175 |
176 |
177 | - The SQL query retrieves distinct rows for each unique
customer_id
with their corresponding product_name
from the sales
and menu
tables.
178 | - It filters the data based on the condition that the
order_date
in the sales
table is greater than the join_date
of the customer in the members
table.
179 | - The
sales
table is aliased as s
, the members
table is aliased as mbr
, and the menu
table is aliased as m
.
180 | - The query performs inner joins between
sales
and members
tables on matching customer_id
and between sales
and menu
tables on matching product_id
.
181 | - Only rows that meet the join condition and the
order_date > join_date
condition are considered in the result set.
182 | - The query selects the
customer_id
and corresponding product_name
for each customer who has placed an order after their join_date
.
183 | - The results are sorted in ascending order based on the
customer_id
.
184 | - The
DISTINCT ON (s.customer_id)
clause ensures that only the first occurrence of each customer_id
is included in the result set.
185 | - As a result, the query returns a unique list of
customer_id
along with the first product_name
they ordered after joining as a member.
186 |
187 |
188 |
189 | Which item was purchased just before the customer became a member?
190 |
191 | ```sql
192 | SELECT DISTINCT ON (s.customer_id)
193 | s.customer_id,
194 | m.product_name
195 | FROM sales s
196 | JOIN members mbr ON s.customer_id = mbr.customer_id
197 | JOIN menu m ON s.product_id = m.product_id
198 | WHERE s.order_date < mbr.join_date
199 | ORDER BY s.customer_id;
200 | ```
201 |
202 | Answer:
203 |
204 |
205 | - The SQL query retrieves distinct rows for each unique
customer_id
with their corresponding product_name
from the sales
and menu
tables.
206 | - It filters the data based on the condition that the
order_date
in the sales
table is less than the join_date
of the customer in the members
table.
207 | - The
sales
table is aliased as s
, the members
table is aliased as mbr
, and the menu
table is aliased as m
.
208 | - The query performs inner joins between
sales
and members
tables on matching customer_id
and between sales
and menu
tables on matching product_id
.
209 | - Only rows that meet the join condition and the
order_date < join_date
condition are considered in the result set.
210 | - The query selects the
customer_id
and corresponding product_name
for each customer who has placed an order before their join_date
.
211 | - The results are sorted in ascending order based on the
customer_id
.
212 | - The
DISTINCT ON (s.customer_id)
clause ensures that only the first occurrence of each customer_id
is included in the result set.
213 | - As a result, the query returns a unique list of
customer_id
along with the first product_name
they ordered before joining as a member.
214 |
215 |
216 |
217 | What is the total items and amount spent for each member before they became a member?
218 |
219 | ```sql
220 | SELECT S.customer_id,
221 | COUNT(S.product_id) AS total_item,
222 | SUM(M.price) AS total_amont
223 | FROM sales S
224 | JOIN menu M ON S.product_id=M.product_id
225 | JOIN members ME ON S.customer_id=ME.customer_id
226 | WHERE S.order_dateAnswer:
232 |
233 |
234 | - The SQL query retrieves the
customer_id
along with the total count of items ordered (total_item
) and the total amount spent (total_amont
) by each customer.
235 | - It retrieves data from the
sales
table and joins it with the menu
table based on matching product_id
.
236 | - It also joins the
sales
table with the members
table based on matching customer_id
.
237 | - The results are filtered based on the condition that the
order_date
in the sales
table is less than the join_date
of the customer in the members
table.
238 | - The
COUNT(S.product_id)
function calculates the number of occurrences of each product_id
in the sales
table, giving the total number of items ordered by each customer.
239 | - The
SUM(M.price)
function calculates the sum of the price
from the menu
table, providing the total amount spent by each customer.
240 | - Results are grouped by
customer_id
to get the totals for each customer.
241 | - The query then presents the
customer_id
, total_item
, and total_amont
for each customer who placed orders before joining as a member.
242 | - Finally, the results are sorted in ascending order based on the
customer_id
.
243 |
244 |
245 |
246 | If each $1 spent equates to 10 points and sushi has a 2x points multiplier - how many points would each customer have?
247 |
248 | ```sql
249 | SELECT s.customer_id,
250 | SUM(CASE
251 | WHEN m.product_name = 'sushi' THEN price * 2
252 | ELSE price
253 | END) * 10 AS total_points
254 | FROM sales s
255 | JOIN menu m ON s.product_id = m.product_id
256 | GROUP BY s.customer_id
257 | ORDER BY s.customer_id;
258 | ```
259 |
260 | Answer:
261 |
262 |
263 | - The SQL query retrieves the
customer_id
and calculates the total points (total_points
) earned by each customer based on their purchases from the sales
and menu
tables.
264 | - It retrieves data from the
sales
table and joins it with the menu
table based on matching product_id
.
265 | - The query uses a
CASE
statement to differentiate between 'sushi' and other products.
266 | - If the product name is 'sushi', the price is multiplied by 2 to give double points.
267 | - Otherwise, the regular price is considered.
268 | - The
SUM
function calculates the total points for each customer by adding up the points earned from their purchases.
269 | - The total points are then multiplied by 10 to give a scaled value.
270 | - Results are grouped by
customer_id
to get the total points for each customer.
271 | - The query then presents the
customer_id
and the scaled total_points
for each customer based on their purchases.
272 | - Finally, the results are sorted in ascending order based on the
customer_id
.
273 |
274 |
275 |
276 | In the first week after a customer joins the program (including their join date) they earn 2x points on all items, not just sushi - how many points do customer A and B have at the end of January?
277 |
278 | ```sql
279 | WITH dates_cte AS (
280 | SELECT
281 | customer_id,
282 | join_date,
283 | join_date + INTERVAL '6 days' AS valid_date,
284 | DATE_TRUNC('month', '2021-01-31'::DATE) + INTERVAL '1 month' - INTERVAL '1 day' AS last_date
285 | FROM members
286 | )
287 |
288 | SELECT
289 | s.customer_id,
290 | SUM(CASE
291 | WHEN m.product_name = 'sushi' OR (s.order_date BETWEEN dates.join_date AND dates.valid_date) THEN 2 * 10 * m.price
292 | ELSE 10 * m.price END) AS points
293 | FROM sales s
294 | INNER JOIN dates_cte AS dates
295 | ON s.customer_id = dates.customer_id
296 | AND dates.join_date <= s.order_date
297 | AND s.order_date <= dates.last_date
298 | INNER JOIN menu m
299 | ON s.product_id = m.product_id
300 | GROUP BY s.customer_id
301 | ORDER BY s.customer_id;
302 | ```
303 |
304 | Answer:
305 |
306 |
307 |
308 | - The SQL query starts by creating a Common Table Expression (CTE) named
dates_cte
.
309 | - Within the CTE, it selects
customer_id
, join_date
, join_date + INTERVAL '6 days'
as valid_date
, and the last day of the month for the date '2021-01-31' as last_date
.
310 | - The CTE is used to generate date ranges for each customer, from their
join_date
to 6 days later, and the last day of the month for January 2021.
311 | - Next, the main query selects the
customer_id
and calculates the total points (points
) earned by each customer based on their purchases from the sales
and menu
tables.
312 | - It retrieves data from the
sales
table and joins it with the dates_cte
CTE using a combination of JOIN
and ON
clauses.
313 | - The query uses a
CASE
statement to differentiate between 'sushi' purchases and other products.
314 | - If the product name is 'sushi' or the order date falls within the range of
join_date
to valid_date
, the points are calculated as 2 times 10 times the price of the product.
315 | - Otherwise, for other products, the points are calculated as 10 times the price of the product.
316 | - The
SUM
function calculates the total points for each customer by adding up the points earned from their purchases.
317 | - Results are grouped by
customer_id
to get the total points for each customer.
318 | - The query then presents the
customer_id
and the calculated points
for each customer based on their purchases.
319 | - Finally, the results are sorted in ascending order based on the
customer_id
.
320 |
321 |
322 |
323 |
324 | Join All The Things
325 |
326 | ```sql
327 | WITH customer_member_status AS (
328 | SELECT
329 | s.customer_id,
330 | s.order_date,
331 | m.product_name,
332 | m.price,
333 | CASE
334 | WHEN mbr.join_date <= s.order_date THEN 'Y'
335 | ELSE 'N'
336 | END AS member
337 | FROM sales s
338 | INNER JOIN menu m ON s.product_id = m.product_id
339 | LEFT JOIN members mbr ON s.customer_id = mbr.customer_id
340 | )
341 | SELECT
342 | customer_id,
343 | order_date,
344 | product_name,
345 | price,
346 | member
347 | FROM customer_member_status
348 | ORDER BY
349 | customer_id,
350 | member DESC,
351 | order_date;
352 | ```
353 |
354 | Answer:
355 |
356 |
357 | - The SQL query starts by creating a Common Table Expression (CTE) named
customer_member_status
.
358 | - Within the CTE, it selects
customer_id
, order_date
, product_name
, price
, and uses a CASE
statement to determine whether the customer is a member ('Y') or not ('N') based on their join date in the members
table.
359 | - The
sales
table is aliased as s
, the menu
table as m
, and the members
table as mbr
.
360 | - The query performs inner joins between
sales
and menu
tables on matching product_id
and left join between sales
and members
tables on matching customer_id
.
361 | - For each row, the
CASE
statement checks if the join_date
from the members
table is less than or equal to the order_date
from the sales
table.
362 | - If true, it assigns 'Y' (member) to the
member
column, otherwise 'N' (non-member).
363 | - Next, the main query selects the
customer_id
, order_date
, product_name
, price
, and member
from the customer_member_status
CTE.
364 | - Results are ordered first by
customer_id
in ascending order, then by member
in descending order (so members appear first), and finally by order_date
in ascending order.
365 | - The query presents the final results with the selected columns for each customer, showing whether they are a member or not for each order and sorted accordingly.
366 |
367 |
368 |
369 | Rank All The Things
370 | Danny needs additional information about the ranking of customer products. However, he specifically requires null ranking values for non-member purchases, as he is not interested in ranking customers who are not yet part of the loyalty program.
371 |
372 | ```sql
373 | WITH customers_data AS (
374 | SELECT
375 | sales.customer_id,
376 | sales.order_date,
377 | menu.product_name,
378 | menu.price,
379 | CASE
380 | WHEN members.join_date > sales.order_date THEN 'N'
381 | WHEN members.join_date <= sales.order_date THEN 'Y'
382 | ELSE 'N'
383 | END AS member_status
384 | FROM sales
385 | LEFT JOIN members ON sales.customer_id = members.customer_id
386 | INNER JOIN menu ON sales.product_id = menu.product_id
387 | )
388 |
389 | SELECT
390 | customer_id,
391 | order_date,
392 | product_name,
393 | price,
394 | member_status AS member,
395 | CASE
396 | WHEN member_status = 'N' THEN NULL
397 | ELSE RANK() OVER (PARTITION BY customer_id, member_status ORDER BY order_date)
398 | END AS ranking
399 | FROM customers_data;
400 | ```
401 |
402 | Answer:
403 |
404 |
405 |
406 | - The SQL query starts by creating a Common Table Expression (CTE) named
customers_data
.
407 | - Within the CTE, it selects
customer_id
, order_date
, product_name
, price
, and uses a CASE
statement to determine the member_status
based on whether the join_date
in the members
table is greater, equal, or less than the order_date
in the sales
table.
408 | - The
sales
table is aliased as sales
, the members
table as members
, and the menu
table as menu
.
409 | - The query performs left join between
sales
and members
tables on matching customer_id
and inner join between sales
and menu
tables on matching product_id
.
410 | - For each row, the
CASE
statement checks the join_date
from the members
table and compares it to the order_date
from the sales
table, assigning 'Y' if the join_date
is less than or equal to the order_date
(customer is a member) and 'N' otherwise (non-member).
411 | - Next, the main query selects the
customer_id
, order_date
, product_name
, price
, and member_status
from the customers_data
CTE.
412 | - It also uses a
CASE
statement to calculate the ranking
for each customer's orders if they are a member ('Y'). The RANK()
function is used with PARTITION BY
to partition the ranking within each customer and their membership status and ordered by order_date
.
413 | - If the customer is not a member ('N'), the
ranking
is set to NULL
.
414 | - Results are presented with the selected columns and
ranking
for each customer's orders, showing whether they are a member or not and the order of their orders.
415 |
416 |
417 |
418 |
419 |
420 |
421 | - Customer Spending: The total amount spent by each customer at Danny's Diner varies widely. Some customers have spent significantly more than others, indicating potential high-value customers or loyal patrons.
422 | - Customer Visits: The number of days each customer visited the restaurant also varies, showing different patterns of customer engagement. Some customers visit frequently, while others may visit less often.
423 | - First Purchases: Understanding the first items purchased by each customer can help Danny identify popular entry items and potentially attract new customers.
424 | - Most Popular Item: The most purchased item on the menu is valuable information for Danny. He can use this insight to optimize inventory management and capitalize on the popularity of the item.
425 | - Personalized Recommendations: Knowing the most popular item for each customer allows Danny to make personalized menu suggestions, enhancing the dining experience for his customers.
426 | - Customer Loyalty: The data about purchases before and after joining the loyalty program helps Danny evaluate the effectiveness of the loyalty program and its impact on customer behavior.
427 | - Bonus Points for New Members: By offering 2x points to new members during their first week, Danny incentivizes more spending, encouraging members to engage more actively with the program.
428 | - Member Points: The points earned by each member can be used to assess their loyalty and potentially offer targeted rewards and promotions.
429 | - Data Visualization: Creating visualizations based on the data and insights can further aid Danny in understanding trends and making data-driven decisions.
430 | - Customer Segmentation: By analyzing customer spending habits, Danny can segment his customer base and tailor marketing strategies accordingly.
431 | - Expanding Membership: Danny can use the insights from this data to refine his loyalty program and attract new members, leveraging the success of the program.
432 | - Inventory Management: Knowing the most popular and least popular items can help Danny optimize his inventory, reduce wastage, and maximize profits.
433 | - Menu Optimization: Danny can use the data to evaluate the performance of different menu items and consider introducing new dishes based on customer preferences.
434 | - Customer Engagement: Analyzing customer behavior can help Danny understand what keeps customers coming back and help him create more engaging experiences.
435 | - Long-Term Growth: By leveraging data analysis, Danny can make informed decisions that contribute to the long-term growth and success of Danny's Diner.
436 |
437 |
438 |
439 |
440 |
441 |
442 |
443 |
444 |
445 |
--------------------------------------------------------------------------------
/Case Study #3 - Foodie-Fi/README.md:
--------------------------------------------------------------------------------
1 | Case Study #3 - Foodie-Fi🫕🥑
2 |
3 | Contents
4 |
5 |
18 |
19 |
20 | In response to the growing popularity of subscription-based businesses, Danny recognized a significant market gap. He saw an opportunity to introduce a distinctive streaming service, one that exclusively featured culinary content—a culinary counterpart to platforms like Netflix. With this vision in mind, Danny collaborated with a group of astute friends to establish Foodie-Fi, a startup launched in 2020. The company quickly began offering monthly and annual subscription plans, granting subscribers unrestricted access to an array of exclusive culinary videos sourced from across the globe.
21 |
22 | At the heart of Foodie-Fi's foundation lies Danny's commitment to data-driven decision-making. He was resolute in leveraging data to inform all future investments and innovative features. This case study delves into the realm of subscription-style digital data, where insights derived from data analysis are pivotal in addressing critical business inquiries and steering the company toward sustained growth and success.
23 |
24 | Entity Relationship Diagram
25 |
26 |
27 | Case Study Questions & Solutions
28 |
29 | A. Customer Journey🫕🥑
30 |
31 |
32 | Based off the 8 sample customers provided in the sample from the subscriptions table, write a brief description about each customer’s onboarding journey.
33 |
34 | ```sql
35 | SELECT
36 | s.customer_id,
37 | p.plan_name,
38 | s.start_date
39 | FROM
40 | subscriptions s
41 | JOIN
42 | plans p ON s.plan_id = p.plan_id
43 | ORDER BY
44 | s.customer_id, s.start_date;
45 | ```
46 |
47 | Answer:
48 |
49 |
50 |
51 |
52 | -
53 | Customer 1:
54 |
55 | - Started with a trial plan on 2020-08-01.
56 | - Upgraded to the basic monthly plan on 2020-08-08.
57 |
58 |
59 | -
60 | Customer 2:
61 |
62 | - Started with a trial plan on 2020-09-20.
63 | - Upgraded to the pro annual plan on 2020-09-27.
64 |
65 |
66 | -
67 | Customer 11:
68 |
69 | - Started with a trial plan on 2020-11-19.
70 | - Downgraded to the churn plan on 2020-11-26.
71 |
72 |
73 | -
74 | Customer 13:
75 |
76 | - Started with a trial plan on 2020-12-15.
77 | - Upgraded to the basic monthly plan on 2020-12-22.
78 | - Upgraded to the pro monthly plan on 2021-03-29.
79 |
80 |
81 | -
82 | Customer 15:
83 |
84 | - Started with a trial plan on 2020-03-17.
85 | - Upgraded to the pro monthly plan on 2020-03-24.
86 | - Downgraded to the churn plan on 2020-04-29.
87 |
88 |
89 | -
90 | Customer 16:
91 |
92 | - Started with a trial plan on 2020-05-31.
93 | - Upgraded to the basic monthly plan on 2020-06-07.
94 | - Upgraded to the pro annual plan on 2020-10-21.
95 |
96 |
97 | -
98 | Customer 18:
99 |
100 | - Started with a trial plan on 2020-07-06.
101 | - Upgraded to the pro monthly plan on 2020-07-13.
102 |
103 |
104 | -
105 | Customer 19:
106 |
107 | - Started with a trial plan on 2020-06-22.
108 | - Upgraded to the pro monthly plan on 2020-06-29.
109 | - Upgraded to the pro annual plan on 2020-08-29.
110 |
111 |
112 |
113 |
114 |
115 | B. Data Analysis Questions📊
116 |
117 | How many customers has Foodie-Fi ever had?
118 |
119 | ```sql
120 | SELECT COUNT(DISTINCT customer_id) AS customer_count
121 | FROM subscriptions
122 | ```
123 | Answer:
124 |
125 |
126 | - The SQL query retrieves the count of distinct
customer_id
values from the subscriptions
table.
127 | - The
COUNT(DISTINCT customer_id)
function calculates the number of unique customer_id
values in the subscriptions
table.
128 | - As a result, the query presents the total number of unique customers who have subscriptions in the
subscriptions
table.
129 |
130 |
131 |
132 | What is the monthly distribution of trial plan start_date values for our dataset - use the start of the month as the group by value
133 |
134 | ```sql
135 | SELECT
136 | DATE_TRUNC('month', start_date) AS month_start,
137 | COUNT(*) AS trial_starts
138 | FROM
139 | subscriptions
140 | WHERE
141 | plan_id = 0
142 | GROUP BY
143 | month_start
144 | ORDER BY
145 | month_start;
146 | ```
147 | Answer:
148 |
149 |
150 | - The SQL query retrieves data related to trial subscriptions from the
subscriptions
table.
151 | - The
DATE_TRUNC
function is used to truncate the start_date
column to the start of each month, resulting in the month_start
column.
152 | - The
COUNT(*)
function calculates the number of records (trial starts) in the subscriptions
table for each truncated month.
153 | - The
WHERE plan_id = 0
condition filters the data to include only trial subscriptions, where the plan_id
is 0.
154 | - Results are grouped by the
month_start
column, which represents the start of each month after truncation.
155 | - The
ORDER BY month_start
clause arranges the results in ascending order based on the month_start
column.
156 | - As a result, the query presents the truncated start of each month and the corresponding count of trial subscription starts for that month from the
subscriptions
table.
157 |
158 |
159 |
160 | What plan start_date values occur after the year 2020 for our dataset? Show the breakdown by count of events for each plan_name
161 |
162 | ```sql
163 | SELECT
164 | p.plan_name,
165 | p.plan_id,
166 | COUNT(*) AS event_count
167 | FROM
168 | subscriptions s
169 | JOIN
170 | plans p ON s.plan_id = p.plan_id
171 | WHERE
172 | s.start_date > '2020-12-31' -- Filter for start dates after 2020
173 | GROUP BY
174 | p.plan_name, p.plan_id
175 | ORDER BY
176 | p.plan_name, p.plan_id;
177 | ```
178 | Answer:
179 |
180 |
181 | - The SQL query retrieves data related to subscription events from the
subscriptions
table and their corresponding plan details from the plans
table.
182 | - The
SELECT
statement selects the plan_name
, plan_id
, and the count of records (subscription events) as event_count
from the subscriptions
table.
183 | - The
JOIN
clause connects the subscriptions
table with the plans
table using the common plan_id
column.
184 | - The
WHERE s.start_date > '2020-12-31'
condition filters the data to include only records with start dates after December 31, 2020.
185 | - Results are grouped by the
plan_name
and plan_id
columns from the plans
table.
186 | - The
ORDER BY p.plan_name, p.plan_id
clause arranges the results in ascending order based on the plan_name
and plan_id
columns.
187 | - As a result, the query presents the plan name, plan ID, and the corresponding count of subscription events for each plan that started after December 31, 2020, using data from the
subscriptions
and plans
tables.
188 |
189 |
190 |
191 | What is the customer count and percentage of customers who have churned rounded to 1 decimal place?
192 |
193 | ```sql
194 | SELECT
195 | COUNT(DISTINCT CASE WHEN s.plan_id = 4 AND s.start_date <= CURRENT_DATE THEN s.customer_id END) AS churned_customers,
196 | ROUND(COUNT(DISTINCT CASE WHEN s.plan_id = 4 AND s.start_date <= CURRENT_DATE THEN s.customer_id END) * 100.0 / COUNT(DISTINCT s.customer_id), 1) AS churned_percentage
197 | FROM
198 | subscriptions s;
199 | ```
200 | Answer:
201 |
202 |
203 | - The SQL query calculates the count and percentage of churned customers based on certain conditions from the
subscriptions
table.
204 | - The first
COUNT(DISTINCT CASE WHEN ... END)
expression calculates the count of distinct customer IDs who have a plan ID of 4 and started their subscription on or before the current date. This count represents the churned customers.
205 | - The second
ROUND(...)
expression calculates the churned percentage by dividing the count of churned customers by the count of distinct customer IDs and then multiplying by 100.0. The ROUND
function rounds the result to one decimal place.
206 | - The
FROM subscriptions s
clause specifies that the data is being retrieved from the subscriptions
table with the alias s
.
207 | - As a result, the query presents the count of churned customers (those with plan ID 4 and a start date on or before the current date) and the corresponding percentage of churned customers in relation to all distinct customers from the
subscriptions
table.
208 |
209 |
210 |
211 | How many customers have churned straight after their initial free trial - what percentage is this rounded to the nearest whole number?
212 |
213 | ```sql
214 | WITH ranked_cte AS (
215 | SELECT
216 | sub.customer_id,
217 | plans.plan_name,
218 | LEAD(plans.plan_name) OVER (
219 | PARTITION BY sub.customer_id
220 | ORDER BY sub.start_date) AS next_plan
221 | FROM subscriptions AS sub
222 | JOIN plans
223 | ON sub.plan_id = plans.plan_id
224 | )
225 |
226 | SELECT
227 | COUNT(customer_id) AS customers,
228 | ROUND(100.0 *
229 | COUNT(customer_id)
230 | / (SELECT COUNT(DISTINCT customer_id)
231 | FROM subscriptions)
232 | ) AS churn_percentage
233 | FROM ranked_cte
234 | WHERE plan_name = 'trial'
235 | AND next_plan = 'churn';
236 |
237 | ```
238 | Answer:
239 |
240 |
241 | - The SQL query calculates the count and percentage of customers who have churned from a trial plan to a churn plan, based on data from the
subscriptions
and plans
tables.
242 | - The
ranked_cte
Common Table Expression (CTE) is defined to retrieve customer IDs, current plan names, and the next plan names using the LEAD
window function.
243 | - Within the
ranked_cte
CTE, the LEAD(plans.plan_name) OVER (PARTITION BY sub.customer_id ORDER BY sub.start_date)
function retrieves the name of the next plan for each customer, ordered by their subscription start date.
244 | - The main query selects the count of customer IDs who transitioned from a 'trial' plan to a 'churn' plan and calculates the churn percentage.
245 | - The
COUNT(customer_id)
function counts the number of customers who transitioned from a 'trial' plan to a 'churn' plan.
246 | - The
ROUND(...)
function calculates the churn percentage by dividing the count of customers who transitioned by the total count of distinct customer IDs from the subscriptions
table, multiplied by 100.0. The result is rounded to the nearest whole number.
247 | - The subquery
(SELECT COUNT(DISTINCT customer_id) FROM subscriptions)
calculates the total count of distinct customer IDs in the subscriptions
table.
248 | - The
WHERE plan_name = 'trial' AND next_plan = 'churn'
condition filters the data to include only cases where customers transitioned from a 'trial' plan to a 'churn' plan.
249 | - As a result, the query presents the count and percentage of customers who have transitioned from a 'trial' plan to a 'churn' plan using data from the
subscriptions
and plans
tables.
250 |
251 |
252 |
253 | What is the number and percentage of customer plans after their initial free trial?
254 |
255 | ```sql
256 | WITH next_plans AS (
257 | SELECT
258 | customer_id,
259 | plan_id,
260 | LEAD(plan_id) OVER(
261 | PARTITION BY customer_id
262 | ORDER BY plan_id) as next_plan_id
263 | FROM subscriptions
264 | )
265 |
266 | SELECT
267 | next_plan_id AS plan_id,
268 | COUNT(customer_id) AS after_free_customers,
269 | ROUND(100 *
270 | COUNT(customer_id)::NUMERIC
271 | / (SELECT COUNT(DISTINCT customer_id)
272 | FROM subscriptions)
273 | ,1) AS after_percentage
274 | FROM next_plans
275 | WHERE next_plan_id IS NOT NULL
276 | AND plan_id = 0
277 | GROUP BY next_plan_id
278 | ORDER BY next_plan_id;
279 | ```
280 | Answer:
281 |
282 |
283 | - The SQL query calculates the count and percentage of customers who transitioned from a free trial plan to a subsequent plan, based on data from the
subscriptions
table.
284 | - The
next_plans
Common Table Expression (CTE) is defined to retrieve customer IDs, current plan IDs, and the next plan IDs using the LEAD
window function.
285 | - Within the
next_plans
CTE, the LEAD(plan_id) OVER (PARTITION BY customer_id ORDER BY plan_id)
function retrieves the ID of the next plan for each customer, ordered by plan ID.
286 | - The main query selects the next plan ID, counts the number of customers who transitioned to that plan, and calculates the percentage of customers who transitioned.
287 | - The
COUNT(customer_id)
function counts the number of customers who transitioned to each next plan.
288 | - The
ROUND(...)
function calculates the percentage of customers who transitioned to each next plan by dividing the count of transitioning customers by the total count of distinct customer IDs from the subscriptions
table, multiplied by 100.0. The result is rounded to one decimal place.
289 | - The subquery
(SELECT COUNT(DISTINCT customer_id) FROM subscriptions)
calculates the total count of distinct customer IDs in the subscriptions
table.
290 | - The
WHERE next_plan_id IS NOT NULL AND plan_id = 0
condition filters the data to include only cases where customers transitioned from a free trial plan to a subsequent plan.
291 | - Results are grouped by the
next_plan_id
and ordered by next_plan_id
.
292 | - As a result, the query presents the next plan ID, count of customers who transitioned to each next plan, and the corresponding percentage of customers who transitioned from a free trial plan to that next plan.
293 |
294 |
295 |
296 | What is the customer count and percentage breakdown of all 5 plan_name values at 2020-12-31?
297 |
298 | ```sql
299 | SELECT
300 | p.plan_name,
301 | COUNT(DISTINCT s.customer_id) AS customer_count,
302 | ROUND(100.0 * COUNT(DISTINCT s.customer_id) / (SELECT COUNT(DISTINCT customer_id) FROM subscriptions WHERE start_date <= '2020-12-31'), 1) AS percentage
303 | FROM
304 | subscriptions s
305 | JOIN
306 | plans p ON s.plan_id = p.plan_id
307 | WHERE
308 | s.start_date <= '2020-12-31'
309 | GROUP BY
310 | p.plan_name,p.plan_id
311 | ORDER BY
312 | p.plan_id;
313 | ````
314 | Answer:
315 |
316 |
317 | - The SQL query calculates the count and percentage of customers for each plan who subscribed on or before December 31, 2020, based on data from the
subscriptions
and plans
tables.
318 | - The
SELECT
statement selects the plan_name
, count of distinct customer IDs, and the corresponding percentage of customers for each plan.
319 | - The
COUNT(DISTINCT s.customer_id)
function calculates the count of distinct customer IDs who subscribed on or before December 31, 2020, for each plan.
320 | - The second
ROUND(...)
expression calculates the percentage of customers for each plan by dividing the count of distinct customer IDs by the total count of distinct customer IDs who subscribed on or before December 31, 2020, multiplied by 100.0. The result is rounded to one decimal place.
321 | - The subquery
(SELECT COUNT(DISTINCT customer_id) FROM subscriptions WHERE start_date <= '2020-12-31')
calculates the total count of distinct customer IDs who subscribed on or before December 31, 2020.
322 | - The
FROM subscriptions s
clause specifies that the data is being retrieved from the subscriptions
table with the alias s
.
323 | - The
JOIN plans p ON s.plan_id = p.plan_id
clause connects the subscriptions
table with the plans
table using the common plan_id
column.
324 | - The
WHERE s.start_date <= '2020-12-31'
condition filters the data to include only records with subscription start dates on or before December 31, 2020.
325 | - Results are grouped by the
plan_name
and plan_id
columns from the plans
table.
326 | - The
ORDER BY p.plan_id
clause arranges the results in ascending order based on the plan_id
.
327 | - As a result, the query presents the plan name, count of distinct customers, and corresponding percentage of customers for each plan who subscribed on or before December 31, 2020.
328 |
329 |
330 |
331 | How many customers have upgraded to an annual plan in 2020?
332 |
333 | ```sql
334 | SELECT
335 | COUNT(DISTINCT customer_id) AS upgraded_customers_count
336 | FROM
337 | subscriptions
338 | WHERE
339 | start_date >= '2020-01-01'
340 | AND start_date <= '2020-12-31'
341 | AND plan_id = 3; -- Plan ID 3 represents the annual pro plan
342 | ```
343 | Answer:
344 |
345 |
346 | - The SQL query calculates the count of distinct customers who upgraded to the annual pro plan (Plan ID 3) during the year 2020.
347 | - The
COUNT(DISTINCT customer_id)
function calculates the count of distinct customer IDs who meet the specified conditions.
348 | - The
FROM subscriptions
clause specifies that the data is being retrieved from the subscriptions
table.
349 | - The
WHERE start_date >= '2020-01-01' AND start_date <= '2020-12-31' AND plan_id = 3
condition filters the data to include only records with start dates within the year 2020 and a plan ID equal to 3 (representing the annual pro plan).
350 | - As a result, the query presents the count of distinct customers who upgraded to the annual pro plan (Plan ID 3) during the year 2020.
351 |
352 |
353 |
354 | How many customers downgraded from a pro monthly to a basic monthly plan in 2020?
355 |
356 | ```sql
357 | SELECT COUNT(*) AS num_downgrades
358 | FROM subscriptions prev
359 | JOIN subscriptions current ON prev.customer_id = current.customer_id
360 | AND prev.plan_id = 2 -- Plan ID 2 represents pro monthly plan
361 | AND current.plan_id = 1 -- Plan ID 1 represents basic monthly plan
362 | AND prev.start_date <= current.start_date
363 | WHERE EXTRACT(YEAR FROM prev.start_date) = 2020;
364 | ```
365 | Answer:
366 |
367 |
368 | - The SQL query calculates the count of downgrades from the pro monthly plan (Plan ID 2) to the basic monthly plan (Plan ID 1) during the year 2020, based on data from the
subscriptions
table.
369 | - The
COUNT(*)
function calculates the count of records that meet the specified conditions.
370 | - The query performs a self-join on the
subscriptions
table, where the prev
table represents the previous subscription and the current
table represents the current subscription.
371 | - The
ON prev.customer_id = current.customer_id AND prev.plan_id = 2 AND current.plan_id = 1 AND prev.start_date <= current.start_date
condition ensures that the join is made between a pro monthly plan (Plan ID 2) and a basic monthly plan (Plan ID 1) for the same customer with a valid start date sequence.
372 | - The
WHERE EXTRACT(YEAR FROM prev.start_date) = 2020
condition filters the data to include only records with a start date in the year 2020.
373 | - As a result, the query presents the count of downgrades from the pro monthly plan (Plan ID 2) to the basic monthly plan (Plan ID 1) during the year 2020 based on the specified conditions from the
subscriptions
table.
374 |
375 |
376 |
377 | C. Challenge Payment Question💰
378 |
379 | The Foodie-Fi team wants you to create a new payments table for the year 2020 that includes amounts paid by each customer in the subscriptions table with the following requirements:
380 |
381 | - monthly payments always occur on the same day of month as the original start_date of any monthly paid plan
382 | - upgrades from basic to monthly or pro plans are reduced by the current paid amount in that month and start immediately
383 | - upgrades from pro monthly to pro annual are paid at the end of the current billing period and also starts at the end of the month period
384 | - once a customer churns they will no longer make payments
385 |
386 | ```sql
387 | WITH all_plans AS (
388 | SELECT
389 | s.customer_id,
390 | s.plan_id,
391 | p.plan_name,
392 | s.start_date AS plan_start_date,
393 | p.price,
394 | CASE
395 | WHEN p.plan_name LIKE 'pro monthly%' THEN
396 | DATE_TRUNC('month', s.start_date)::DATE
397 | ELSE
398 | CASE
399 | WHEN EXTRACT(MONTH FROM s.start_date) = 12 THEN s.start_date
400 | ELSE DATE_TRUNC('month', s.start_date + INTERVAL '1 month')::DATE
401 | END
402 | END AS payment_date
403 | FROM subscriptions s
404 | JOIN plans p ON s.plan_id = p.plan_id
405 | WHERE s.start_date >= '2020-01-01' AND s.start_date <= '2020-12-31'
406 | )
407 | , payments AS (
408 | SELECT
409 | customer_id,
410 | plan_id,
411 | plan_name,
412 | payment_date,
413 | price AS amount,
414 | ROW_NUMBER() OVER (PARTITION BY customer_id, plan_id, plan_name, payment_date ORDER BY plan_start_date) AS payment_order
415 | FROM all_plans
416 | )
417 | -- Finalize the query
418 | SELECT
419 | customer_id,
420 | plan_id,
421 | plan_name,
422 | payment_date,
423 | amount,
424 | payment_order
425 | FROM payments
426 | WHERE payment_order = 1 OR (amount > 0 AND payment_order > 1)
427 | ORDER BY customer_id, plan_id, payment_date, payment_order
428 | LIMIT 20;
429 | ```
430 | Answer:
431 |
432 |
433 | - The SQL query calculates customer payments for various subscription plans during the year 2020 based on data from the
subscriptions
and plans
tables.
434 | - The
all_plans
Common Table Expression (CTE) is defined to retrieve customer details, plan details, start dates, prices, and payment dates based on subscription data.
435 | - Within the
all_plans
CTE, a conditional expression calculates the payment date based on the plan type (monthly or annual) and ensures that payments for annual plans are made in December.
436 | - The
payments
CTE is defined to calculate individual payment records based on customer, plan, and payment date.
437 | - Within the
payments
CTE, the ROW_NUMBER()
function assigns a payment order to each record within the same customer, plan, plan name, and payment date partition.
438 | - The main query selects the finalized customer payment details, including customer ID, plan ID, plan name, payment date, amount, and payment order.
439 | - The
WHERE payment_order = 1 OR (amount > 0 AND payment_order > 1)
condition filters the data to include the first payment for each partition and subsequent payments with a positive amount.
440 | - The query
ORDER BY customer_id, plan_id, payment_date, payment_order
arranges the results in ascending order based on customer ID, plan ID, payment date, and payment order.
441 | - The
LIMIT 20
clause limits the results to the first 20 records.
442 | - As a result, the query presents the customer payments for various subscription plans during the year 2020, including the customer ID, plan ID, plan name, payment date, amount, and payment order.
443 |
444 |
445 |
446 |
447 |
448 |
--------------------------------------------------------------------------------
/Case Study #4 - Data Bank/README.md:
--------------------------------------------------------------------------------
1 | Case Study #4 - Data Bank🏦
2 |
3 | Contents
4 |
5 |
18 |
19 | Welcome to an enthralling journey into the world of finance, innovation, and data convergence. In this captivating case study project, we immerse ourselves in the realm of Neo-Banks, the trailblazing digital financial entities revolutionizing the industry landscape by eliminating physical branches. Our focus zooms in on an ingenious visionary named Danny, whose aspirations transcend the ordinary, leading to the inception of a remarkable initiative – "Data Bank."
20 |
21 | Within the fabric of this narrative lies a compelling fusion of Neo-Banking, cryptocurrency, and cutting-edge data management. The birth of Data Bank signifies a paradigm shift where the contours of digital banking intertwine seamlessly with the frontiers of secure distributed data storage.
22 |
23 | The essence of Data Bank mirrors that of traditional digital banks, albeit with an unprecedented twist. Not confined merely to financial transactions, this avant-garde establishment boasts the accolade of harboring the world's most impregnable distributed data storage platform. As we delve deeper, a unique interplay emerges – customers' cloud data storage allotments are intricately tied to their financial standing. Yet, the intricacies of this symbiotic relationship beckon us to explore further.
24 |
25 | Our protagonist, Data Bank, has set its sights on twin goals: expanding its customer base and, concurrently, gaining insights to navigate the data storage landscape. It is in this intricate tapestry of aspirations that we invite you, the analytical trailblazers, to join hands with the Data Bank team.
26 |
27 | This case study project is an expedition into the realms of metric calculation, growth strategy formulation, and incisive data-driven insights. As collaborators in this journey, we wield the tools of analysis to forecast and shape future developments. The canvas before us is one of intricate calculation, strategic expansion, and data-driven foresight. Embark on this expedition as we decipher the chronicles of Data Bank's evolution, where innovation intersects with pragmatism, and where data illuminates the path forward.
28 |
29 | Entity Relationship Diagram
30 |
31 | Case Study Questions & Solutions
32 |
33 | A. Customer Nodes Exploration👥
34 |
35 | How many unique nodes are there on the Data Bank system?
36 |
37 | ```sql
38 | SELECT COUNT(DISTINCT node_id) AS unique_nodes_count
39 | FROM customer_nodes;
40 | ```
41 | Answer:
42 |
43 |
44 | - The SQL query calculates the count of distinct node IDs in the
customer_nodes
table, representing unique nodes.
45 | - The
COUNT(DISTINCT node_id)
function calculates the count of distinct values in the node_id
column of the customer_nodes
table.
46 | - As a result, the query presents the count of unique node IDs in the
customer_nodes
table.
47 |
48 |
49 |
50 | What is the number of nodes per region?
51 |
52 | ```sql
53 | SELECT r.region_name, COUNT(DISTINCT cn.node_id) AS nodes_per_region
54 | FROM regions r
55 | LEFT JOIN customer_nodes cn ON r.region_id = cn.region_id
56 | GROUP BY r.region_name
57 | ORDER BY nodes_per_region DESC;
58 | ```
59 | Answer:
60 |
61 |
62 | - The SQL query retrieves the count of distinct nodes per region from the
regions
and customer_nodes
tables.
63 | - The
SELECT
statement selects the region_name
and the count of distinct node_id
values for each region.
64 | - The
LEFT JOIN
clause joins the regions
table with the customer_nodes
table using the region_id
as the common column.
65 | - The
GROUP BY r.region_name
clause groups the results by region_name
to calculate the count of distinct nodes for each region.
66 | - The
ORDER BY nodes_per_region DESC
clause orders the results in descending order based on the count of nodes per region.
67 | - As a result, the query presents the region names and the corresponding count of distinct nodes per region, ordered by the count of nodes in descending order.
68 |
69 |
70 |
71 | How many customers are allocated to each region?
72 |
73 | ```sql
74 | SELECT r.region_name, COUNT(cn.customer_id) AS customers_per_region
75 | FROM regions r
76 | LEFT JOIN customer_nodes cn ON r.region_id = cn.region_id
77 | GROUP BY r.region_name
78 | ORDER BY customers_per_region DESC;
79 | ```
80 | Answer:
81 |
82 |
83 | - The SQL query retrieves the count of customers per region from the
regions
and customer_nodes
tables.
84 | - The
SELECT
statement selects the region_name
and the count of distinct customer_id
values for each region.
85 | - The
LEFT JOIN
clause joins the regions
table with the customer_nodes
table using the region_id
as the common column.
86 | - The
GROUP BY r.region_name
clause groups the results by region_name
to calculate the count of customers for each region.
87 | - The
ORDER BY customers_per_region DESC
clause orders the results in descending order based on the count of customers per region.
88 | - As a result, the query presents the region names and the corresponding count of customers per region, ordered by customer count in descending order.
89 |
90 |
91 |
92 | How many days on average are customers reallocated to a different node?
93 |
94 | ```sql
95 | WITH CTE AS
96 | (SELECT customer_id,
97 | node_id,
98 | end_date - start_date AS node_dys
99 | FROM customer_nodes
100 | WHERE end_date != '9999-12-31'
101 | GROUP BY customer_id, node_id, start_date, end_date),
102 |
103 | CTE2 AS
104 | (
105 | SELECT customer_id,node_id,
106 | SUM(node_dys) AS total_node
107 | FROM CTE
108 | GROUP BY customer_id, node_id)
109 |
110 | SELECT ROUND(AVG(total_node)) AS avg_n
111 | FROM CTE2;
112 | ```
113 | Answer:
114 |
115 |
116 | - The SQL query calculates the average total number of days that each customer-node combination was active, based on data from the
customer_nodes
table.
117 | - The
CTE
(Common Table Expression) named CTE
is defined to calculate the number of days each customer-node combination was active. It calculates the difference between the end_date
and start_date
columns and renames it as node_dys
. The WHERE end_date != '9999-12-31'
condition filters out records with the default end date.
118 | - The
CTE2
is defined to calculate the total number of days each customer-node combination was active. It sums the node_dys
values for each customer-node combination using the SUM(node_dys)
function.
119 | - The main query calculates the rounded average of the
total_node
values using the AVG(total_node)
function and renames it as avg_n
.
120 | - As a result, the query presents the rounded average total number of days that each customer-node combination was active, based on the data from the
CTE2
and customer_nodes
tables.
121 |
122 |
123 |
124 | What is the median, 80th and 95th percentile for this same reallocation days metric for each region?
125 |
126 | ```sql
127 | WITH date_diff_cte AS (
128 | SELECT
129 | cn.region_id,
130 | end_date - start_date AS node_dys
131 | FROM
132 | customer_nodes cn
133 | WHERE
134 | end_date != '9999-12-31'
135 | )
136 | SELECT
137 | r.region_name,
138 | PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY node_dys) AS median,
139 | PERCENTILE_CONT(0.8) WITHIN GROUP (ORDER BY node_dys) AS percentile_80,
140 | PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY node_dys) AS percentile_95
141 | FROM
142 | date_diff_cte d
143 | JOIN
144 | regions r ON d.region_id = r.region_id
145 | GROUP BY
146 | r.region_name;
147 | ```
148 | Answer:
149 |
150 |
151 | - The SQL query calculates statistical percentiles of node active durations for each region based on data from the
customer_nodes
and regions
tables.
152 | - The
WITH
clause defines a Common Table Expression (CTE) named date_diff_cte
that calculates the difference in days between end_date
and start_date
for each active node. The condition end_date != '9999-12-31'
filters out records with the default end date.
153 | - The main query selects the
region_name
column from the regions
table and calculates the statistical percentiles within groups of node active durations for each region.
154 | - The
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY node_dys)
calculates the median, PERCENTILE_CONT(0.8) WITHIN GROUP (ORDER BY node_dys)
calculates the 80th percentile, and PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY node_dys)
calculates the 95th percentile of node active durations for each region.
155 | - The
JOIN
clause joins the date_diff_cte
CTE with the regions
table using the region_id
as the common column.
156 | - The
GROUP BY r.region_name
groups the results by region_name
to present the statistical percentiles of node active durations for each region.
157 | - As a result, the query presents the region names along with the median, 80th percentile, and 95th percentile of node active durations, calculated using data from the
date_diff_cte
CTE and the regions
table.
158 |
159 |
160 |
161 | B. Customer Transactions💸
162 |
163 | What is the unique count and total amount for each transaction type?
164 |
165 | ```sql
166 | SELECT
167 | txn_type AS transaction_type,
168 | COUNT(customer_id) AS tn_cnt,
169 | SUM(txn_amount) AS ttl_amnt
170 | FROM customer_transactions
171 | GROUP BY txn_type;
172 | ```
173 |
174 | Answer:
175 |
176 |
177 | - The SQL query calculates transaction statistics based on data from the
customer_transactions
table.
178 | - The
SELECT
statement selects three columns: txn_type
as transaction_type
, COUNT(customer_id)
as tn_cnt
(transaction count), and SUM(txn_amount)
as ttl_amnt
(total amount).
179 | - The
GROUP BY txn_type
clause groups the results by txn_type
to calculate statistics for each transaction type.
180 | - The
COUNT(customer_id)
calculates the count of transactions for each transaction type.
181 | - The
SUM(txn_amount)
calculates the total amount of transactions for each transaction type.
182 | - As a result, the query presents the transaction types, the count of transactions, and the total transaction amount for each transaction type.
183 |
184 |
185 |
186 | What is the average total historical deposit counts and amounts for all customers?
187 |
188 | ```sql
189 | WITH CTE AS (
190 | SELECT
191 | customer_id,
192 | COUNT(customer_id) AS cnt,
193 | AVG(txn_amount) AS avg_amnt
194 | FROM customer_transactions
195 | WHERE txn_type = 'deposit'
196 | GROUP BY customer_id
197 | )
198 |
199 | SELECT
200 | ROUND(AVG(cnt)) AS count_avgcnt,
201 | ROUND(AVG(avg_amnt)) AS amt_avgcnt
202 | FROM CTE;
203 | ```
204 | Answer:
205 |
206 |
207 | - The SQL query calculates average statistics based on deposit transactions for each customer using data from the
customer_transactions
table.
208 | - The
WITH
clause defines a Common Table Expression (CTE) named CTE
that calculates the count of deposit transactions and the average transaction amount for each customer. The condition txn_type = 'deposit'
filters the transactions to consider only deposits.
209 | - The
GROUP BY customer_id
clause groups the results by customer_id
to calculate statistics for each customer.
210 | - The main query calculates the rounded average count of deposit transactions and the rounded average of average transaction amounts using the
AVG(cnt)
and AVG(avg_amnt)
functions from the CTE
.
211 | - As a result, the query presents the rounded average count of deposit transactions and the rounded average of average transaction amounts for all customers.
212 |
213 |
214 |
215 | For each month - how many Data Bank customers make more than 1 deposit and either 1 purchase or 1 withdrawal in a single month?
216 |
217 | ```sql
218 | WITH monthly_transactions AS (
219 | SELECT
220 | customer_id,
221 | DATE_TRUNC('month', txn_date) AS transaction_month,
222 | SUM(CASE WHEN txn_type = 'deposit' THEN 0 ELSE 1 END) AS deposit_count,
223 | SUM(CASE WHEN txn_type = 'withdrawal' THEN 0 ELSE 1 END) AS purchase_count,
224 | SUM(CASE WHEN txn_type = 'withdrawal' THEN 1 ELSE 0 END) AS withdrawal_count
225 | FROM customer_transactions
226 | GROUP BY customer_id, DATE_TRUNC('month', txn_date)
227 | )
228 |
229 | SELECT
230 | transaction_month AS mth,
231 | COUNT(DISTINCT customer_id) AS customer_count
232 | FROM monthly_transactions
233 | WHERE deposit_count > 1
234 | AND (purchase_count >= 1 OR withdrawal_count >= 1)
235 | GROUP BY transaction_month
236 | ORDER BY transaction_month;
237 | ```
238 | Answer:
239 |
240 |
241 | - The SQL query analyzes monthly transaction patterns for customers based on data from the
customer_transactions
table.
242 | - The
WITH
clause defines a Common Table Expression (CTE) named monthly_transactions
that aggregates transactions for each customer per month. It calculates the count of deposit transactions, purchase transactions, and withdrawal transactions for each customer in each month using conditional aggregations.
243 | - The
GROUP BY customer_id, DATE_TRUNC('month', txn_date)
clause groups the results by customer_id
and the truncated transaction_month
to calculate statistics for each customer in each month.
244 | - The main query selects the
transaction_month
column as mth
and calculates the count of distinct customers with the specified conditions using the COUNT(DISTINCT customer_id)
function.
245 | - The
WHERE
clause filters the results to consider only months with more than one deposit transaction and at least one purchase or withdrawal transaction.
246 | - The
GROUP BY transaction_month
clause groups the results by transaction_month
to analyze monthly patterns.
247 | - The
ORDER BY transaction_month
sorts the results by the transaction_month
in ascending order.
248 | - As a result, the query presents the months along with the count of distinct customers who meet the specified transaction conditions for each month.
249 |
250 |
251 |
252 | What is the closing balance for each customer at the end of the month?
253 |
254 | ```sql
255 | WITH customer_monthly_balances AS (
256 | SELECT
257 | customer_id,
258 | DATE_TRUNC('month', txn_date) AS transaction_month,
259 | SUM(CASE WHEN txn_type = 'deposit' THEN txn_amount ELSE 0 END) -
260 | SUM(CASE WHEN txn_type IN ('purchase', 'withdrawal') THEN txn_amount ELSE 0 END) AS closing_balance
261 | FROM customer_transactions
262 | GROUP BY customer_id, transaction_month
263 | )
264 |
265 | SELECT
266 | customer_id,
267 | transaction_month,
268 | closing_balance
269 | FROM customer_monthly_balances
270 | ORDER BY customer_id, transaction_month;
271 | ```
272 | Answer:
273 |
274 |
275 | - The SQL query calculates monthly closing balances for customers based on data from the
customer_transactions
table.
276 | - The
WITH
clause defines a Common Table Expression (CTE) named customer_monthly_balances
that aggregates transactions for each customer per month. It calculates the closing balance by subtracting the sum of purchase and withdrawal amounts from the sum of deposit amounts for each customer in each month.
277 | - The
GROUP BY customer_id, DATE_TRUNC('month', txn_date)
clause groups the results by customer_id
and the truncated transaction_month
to calculate the closing balance for each customer in each month.
278 | - The main query selects the
customer_id
, transaction_month
, and closing_balance
columns from the customer_monthly_balances
CTE.
279 | - The
ORDER BY customer_id, transaction_month
sorts the results by customer_id
and transaction_month
in ascending order.
280 | - As a result, the query presents the customer IDs, transaction months, and closing balances for each customer based on monthly transaction data.
281 |
282 |
283 |
284 | What is the percentage of customers who increase their closing balance by more than 5%?
285 |
286 | ```sql
287 | WITH customer_monthly_balances AS (
288 | SELECT
289 | customer_id,
290 | DATE_TRUNC('month', txn_date) AS transaction_month,
291 | SUM(CASE WHEN txn_type = 'deposit' THEN txn_amount ELSE 0 END) -
292 | SUM(CASE WHEN txn_type IN ('purchase', 'withdrawal') THEN txn_amount ELSE 0 END) AS closing_balance
293 | FROM customer_transactions
294 | GROUP BY customer_id, transaction_month
295 | )
296 | SELECT
297 | ROUND(
298 | (COUNT(*) FILTER (WHERE closing_balance_increase > 5) * 100.0) / COUNT(*),
299 | 2
300 | ) AS percentage_increase
301 | FROM (
302 | SELECT
303 | customer_id,
304 | transaction_month,
305 | closing_balance,
306 | LAG(closing_balance) OVER (PARTITION BY customer_id ORDER BY transaction_month) AS prev_balance,
307 | (closing_balance - LAG(closing_balance) OVER (PARTITION BY customer_id ORDER BY transaction_month)) * 100.0 / NULLIF(LAG(closing_balance) OVER (PARTITION BY customer_id ORDER BY transaction_month), 0) AS closing_balance_increase
308 | FROM customer_monthly_balances
309 | ) AS balance_changes;
310 | ```
311 | Answer:
312 |
313 |
314 | - The SQL query calculates the percentage increase in closing balances for customers based on monthly transaction data from the
customer_transactions
table.
315 | - The
WITH
clause defines a Common Table Expression (CTE) named customer_monthly_balances
that aggregates transactions for each customer per month and calculates the closing balance difference.
316 | - The main query calculates the percentage increase by using a series of calculations and filters on the results from the
customer_monthly_balances
CTE.
317 | - Within the main query, the
LAG
window function is used to retrieve the previous month's closing balance for each customer.
318 | - The calculated
closing_balance_increase
represents the percentage increase in closing balance for each month, considering the previous month's balance.
319 | - The
FILTER
condition filters the results to consider only rows where closing_balance_increase
is greater than 5%.
320 | - The main query then calculates the percentage of customers with a closing balance increase greater than 5% by dividing the count of such customers by the total count of customers and multiplying by 100.
321 | - The
ROUND
function is used to round the percentage to two decimal places.
322 | - As a result, the query presents the percentage increase in closing balances for customers with an increase greater than 5% based on monthly transaction data.
323 |
324 |
325 |
326 |
327 |
328 |
329 |
--------------------------------------------------------------------------------
/Case Study #5 - Data Mart/README.md:
--------------------------------------------------------------------------------
1 | Case Study #5 - Data Mart🛒
2 |
3 | Contents
4 |
5 |
17 |
18 |
19 | In the focal point of our case study is Danny's latest enterprise, Data Mart, which emerges following his successful oversight of global operations for an online grocery store specializing in fresh produce. Danny now seeks our collaboration to dissect his sales performance analysis.
20 |
21 | In the juncture of June 2020, Data Mart underwent significant supply chain modifications. Embracing sustainability, the company transitioned all its products to employ eco-friendly packaging throughout the entire journey from the farm to the end consumer.
22 |
23 | Danny is enlisting our expertise to quantify the ramifications of this alteration on Data Mart's sales performance across its distinct business sectors.
24 |
25 | Central to this inquiry are the subsequent pivotal questions:
26 |
27 | - What is the measurable impact resulting from the changes implemented in June 2020?
28 | - Among platforms, regions, segments, and customer categories, which experienced the most pronounced effects due to these adjustments?
29 | - How can we strategize for potential future implementations of analogous sustainability enhancements to the business, with the aim of mitigating any adverse consequences on sales?
30 |
31 | Entity Relationship Diagram
32 |
33 | Case Study Questions & Solutions
34 |
35 | 1. Data Cleansing Steps📊
36 |
37 |
38 |
39 |
In a single query, the following operations are performed to generate a new table named clean_weekly_sales
in the data_mart
schema:
40 |
41 |
42 | - Convert the
week_date
to a DATE
format
43 | - Add a
week_number
as the second column for each week_date
value, where the week number corresponds to the range of dates within that week
44 | - Add a
month_number
with the calendar month for each week_date
value as the third column
45 | - Add a
calendar_year
column as the fourth column containing values of 2018, 2019, or 2020 based on the year of the week_date
46 | - Add a new column called
age_band
after the original segment
column using predefined mappings for age groups
47 | - Add a new demographic column using the following mapping for the first letter in the segment values
48 | - Ensure all null string values with an "unknown" string value in the original segment column as well as the new age_band and demographic columns
49 | - Generate a new avg_transaction column as the sales value divided by transactions rounded to 2 decimal places for each record
50 |
51 |
52 | ```sql
53 | DROP TABLE IF EXISTS clean_weekly_sales;
54 | CREATE TABLE clean_weekly_sales AS
55 | SELECT
56 | TO_DATE(week_date, 'DD/MM/YY') AS week_date,
57 | DATE_PART('week', TO_DATE(week_date, 'DD/MM/YY')) AS week_number,
58 | EXTRACT(MONTH FROM TO_DATE(week_date, 'DD/MM/YY')) AS month_number,
59 | DATE_TRUNC('month', TO_DATE(week_date, 'DD/MM/YY')) AS transaction_month,
60 | CASE
61 | WHEN EXTRACT(YEAR FROM TO_DATE(week_date, 'DD/MM/YY')) = 2018 THEN 2018
62 | WHEN EXTRACT(YEAR FROM TO_DATE(week_date, 'DD/MM/YY')) = 2019 THEN 2019
63 | WHEN EXTRACT(YEAR FROM TO_DATE(week_date, 'DD/MM/YY')) = 2020 THEN 2020
64 | END AS calendar_year,
65 | CASE
66 | WHEN RIGHT(segment,1) = '1' THEN 'Young Adults'
67 | WHEN RIGHT(segment,1) = '2' THEN 'Middle Aged'
68 | WHEN RIGHT(segment,1) in ('3','4') THEN 'Retirees'
69 | ELSE 'unknown' END AS age_band,platform,
70 | CASE
71 | WHEN LEFT(segment, 1) = 'C' THEN 'Couples'
72 | WHEN LEFT(segment, 1) = 'F' THEN 'Families'
73 | ELSE 'unknown'
74 | END AS demographic,
75 | COALESCE(NULLIF(segment, ''), 'unknown') AS segment,
76 | region,
77 | transactions,
78 | sales,
79 | ROUND(sales::NUMERIC / transactions, 2) AS avg_transaction
80 | FROM weekly_sales;
81 |
82 | SELECT *
83 | FROM clean_weekly_sales;
84 | ```
85 | Answer:
86 |
87 |
88 | - The SQL script begins by creating a new table named
clean_weekly_sales
by selecting and transforming data from an existing table named weekly_sales
.
89 | - Within the
SELECT
statement, various operations are performed on the week_date
column to extract different date-related information like week_number
, month_number
, transaction_month
, and calendar_year
based on the year of the week_date
.
90 | - Using
CASE
statements, the age_band
column is determined based on the last digit of the segment
column.
91 | - Similarly, the
demographic
column is determined based on the first letter of the segment
column.
92 | - The
COALESCE
function ensures that the segment
column is replaced with 'unknown' if it's an empty string.
93 | - Finally, the
transactions
, sales
, and avg_transaction
columns are calculated using appropriate formulas.
94 | - After the transformation, the script selects all columns from the newly created
clean_weekly_sales
table.
95 | - As a result, the
clean_weekly_sales
table contains transformed and cleaned data from the weekly_sales
table with various columns representing date information, age band, demographic, platform, sales metrics, and more.
96 |
97 |
98 |
99 | 2. Data Exploration🤯📊
100 |
101 | What day of the week is used for each week_date value?
102 |
103 | ```sql
104 | SELECT DISTINCT(TO_CHAR(week_date, 'day')) AS day
105 | FROM clean_weekly_sales;
106 | ```
107 | Answer:
108 |
109 |
110 | - The SQL query selects distinct weekdays from the
clean_weekly_sales
table.
111 | - The
TO_CHAR
function is used to format the week_date
column as a weekday string.
112 | - The format 'day' specifies that only the name of the day of the week should be extracted.
113 | - The result is a list of unique weekday names present in the
week_date
column of the clean_weekly_sales
table.
114 |
115 |
116 |
117 | What range of week numbers are missing from the dataset?
118 |
119 | ```sql
120 | WITH week_number_cte AS (
121 | SELECT GENERATE_SERIES(1, 52) AS week_num
122 | )
123 | SELECT DISTINCT w.week_num
124 | FROM week_number_cte AS w
125 | LEFT JOIN clean_weekly_sales AS s
126 | ON w.week_num = s.week_number
127 | WHERE s.week_number IS NULL
128 | ORDER BY w.week_num;
129 | ```
130 |
131 | Answer:
132 |
133 |
134 | - The SQL query uses a common table expression (CTE) named
week_number_cte
to generate a series of week numbers from 1 to 52.
135 | - The
GENERATE_SERIES
function is used to create a sequence of numbers within the specified range.
136 | - The main
SELECT
statement then selects distinct week_num
values from the week_number_cte
CTE.
137 | - A
LEFT JOIN
is performed between the week_number_cte
CTE and the clean_weekly_sales
table on the condition that the week_num
matches the week_number
column in the sales data.
138 | - The
WHERE
clause filters rows where the week_number
in the sales data is NULL
, indicating missing data for that week.
139 | - The result is a list of week numbers that are missing data in the
clean_weekly_sales
table.
140 | - The list is ordered in ascending order of week numbers.
141 |
142 |
143 |
144 | How many total transactions were there for each year in the dataset?
145 |
146 | ```sql
147 | SELECT calendar_year, SUM(transactions) AS total_transactions
148 | FROM clean_weekly_sales
149 | GROUP BY calendar_year
150 | ORDER BY calendar_year;
151 | ```
152 | Answer:
153 |
154 |
155 | - The SQL query selects the
calendar_year
and calculates the total number of transactions for each calendar year from the clean_weekly_sales
table.
156 | - The
SUM
function is used to add up the transactions
values for each year.
157 | - The results are grouped by the
calendar_year
column.
158 | - The
ORDER BY
clause sorts the results in ascending order based on the calendar_year
.
159 | - The outcome is a list that displays the total number of transactions for each calendar year present in the
clean_weekly_sales
table.
160 |
161 |
162 |
163 | What is the total sales for each region for each month?
164 |
165 | ```sql
166 | SELECT region, SUM(sales) AS total_sales
167 | FROM clean_weekly_sales
168 | WHERE month_number=7
169 | GROUP BY region
170 | ORDER BY region;
171 | ```
172 | Answer:
173 |
174 |
175 | - The SQL query selects the
region
and calculates the total sales for each region from the clean_weekly_sales
table.
176 | - The
SUM
function is used to add up the sales
values for each region.
177 | - The
WHERE
clause filters the rows to include only data where the month_number
is 7, indicating July.
178 | - The results are grouped by the
region
column.
179 | - The
ORDER BY
clause sorts the results in ascending order based on the region
.
180 | - The outcome is a list that displays the total sales for each region in the month of July from the
clean_weekly_sales
table.
181 |
182 |
183 |
184 | What is the total count of transactions for each platform
185 |
186 | ```sql
187 | SELECT platform, SUM(transactions) AS total_transactions
188 | FROM clean_weekly_sales
189 | GROUP BY platform
190 | ORDER BY platform;
191 | ```
192 | Answer:
193 |
194 |
195 | - The SQL query selects the
platform
column and calculates the total number of transactions for each platform from the clean_weekly_sales
table.
196 | - The
SUM
function is used to add up the transactions
values for each platform.
197 | - The results are grouped by the
platform
column.
198 | - The
ORDER BY
clause sorts the results in ascending order based on the platform
.
199 | - The outcome is a list that displays the total number of transactions for each platform in the
clean_weekly_sales
table.
200 |
201 |
202 |
203 | What is the percentage of sales for Retail vs Shopify for each month?
204 |
205 | ```sql
206 | WITH monthly_platform_sales AS (
207 | SELECT
208 | calendar_year,
209 | month_number,
210 | platform,
211 | SUM(sales) AS monthly_sales
212 | FROM clean_weekly_sales
213 | GROUP BY calendar_year, month_number, platform
214 | )
215 |
216 | SELECT
217 | calendar_year,
218 | month_number,
219 | ROUND(100 * SUM(CASE WHEN platform = 'Retail' THEN monthly_sales ELSE 0 END) / SUM(monthly_sales), 2) AS retail_percentage,
220 | ROUND(100 * SUM(CASE WHEN platform = 'Shopify' THEN monthly_sales ELSE 0 END) / SUM(monthly_sales), 2) AS shopify_percentage
221 | FROM monthly_platform_sales
222 | GROUP BY calendar_year, month_number
223 | ORDER BY calendar_year, month_number;
224 | ```
225 | Answer:
226 |
227 |
228 | - The SQL query first creates a Common Table Expression (CTE) named
monthly_platform_sales
.
229 | - In the CTE, the query calculates the total monthly sales for each combination of
calendar_year
, month_number
, and platform
from the clean_weekly_sales
table using the SUM
function.
230 | - The results are grouped by
calendar_year
, month_number
, and platform
.
231 | - The main part of the query selects data from the
monthly_platform_sales
CTE.
232 | - It calculates the percentage of retail sales by dividing the sum of monthly sales from the 'Retail' platform by the total monthly sales, multiplied by 100. The
ROUND
function is used to round the result to two decimal places.
233 | - Similarly, it calculates the percentage of Shopify sales by dividing the sum of monthly sales from the 'Shopify' platform by the total monthly sales, multiplied by 100.
234 | - The results are grouped by
calendar_year
and month_number
.
235 | - The
ORDER BY
clause sorts the results in ascending order based on calendar_year
and month_number
.
236 | - The outcome is a table that displays the percentage of retail and Shopify sales for each month and calendar year based on the data from the
clean_weekly_sales
table.
237 |
238 |
239 |
240 | What is the percentage of sales by demographic for each year in the dataset?
241 |
242 | ```sql
243 | WITH yearly_demographic_sales AS (
244 | SELECT
245 | calendar_year,
246 | demographic,
247 | SUM(sales) AS yearly_sales
248 | FROM clean_weekly_sales
249 | GROUP BY calendar_year, demographic
250 | )
251 |
252 | SELECT
253 | calendar_year,
254 | demographic,
255 | ROUND(100 * yearly_sales / SUM(yearly_sales) OVER (PARTITION BY calendar_year), 2) AS sales_percentage
256 | FROM yearly_demographic_sales
257 | ORDER BY calendar_year, demographic;
258 | ```
259 | Answer:
260 |
261 |
262 | - The SQL query creates a Common Table Expression (CTE) named
yearly_demographic_sales
.
263 | - In the CTE, the query calculates the total yearly sales for each combination of
calendar_year
and demographic
from the clean_weekly_sales
table using the SUM
function.
264 | - The results are grouped by
calendar_year
and demographic
.
265 | - The main part of the query selects data from the
yearly_demographic_sales
CTE.
266 | - It calculates the sales percentage for each demographic in a given year by dividing the yearly sales for that demographic by the total yearly sales for the same year, multiplied by 100. The
ROUND
function is used to round the result to two decimal places.
267 | - The
SUM(yearly_sales) OVER (PARTITION BY calendar_year)
calculates the total yearly sales for each year and is used as the denominator for calculating the percentage.
268 | - The results are displayed with columns for
calendar_year
, demographic
, and sales_percentage
.
269 | - The results are ordered in ascending order based on
calendar_year
and demographic
.
270 |
271 |
272 |
273 | Which age_band and demographic values contribute the most to Retail sales?
274 |
275 | ```sql
276 | SELECT
277 | age_band,
278 | demographic,
279 | SUM(sales) AS retail_sales
280 | FROM clean_weekly_sales
281 | WHERE platform = 'Retail'
282 | GROUP BY age_band, demographic
283 | ORDER BY retail_sales DESC;
284 | ```
285 | Answer:
286 |
287 |
288 | - The SQL query selects data from the
clean_weekly_sales
table.
289 | - It retrieves the columns
age_band
and demographic
along with the calculated sum of sales
as retail_sales
.
290 | - The query filters the data to include only rows where the
platform
is 'Retail' using the WHERE
clause.
291 | - The results are grouped by
age_band
and demographic
using the GROUP BY
clause.
292 | - The calculated sum of
sales
for each group is displayed as retail_sales
.
293 | - The results are ordered in descending order based on
retail_sales
using the ORDER BY
clause with the DESC
keyword.
294 |
295 |
296 |
297 | Can we use the avg_transaction column to find the average transaction size for each year for Retail vs Shopify?
298 |
299 | ```sql
300 | SELECT
301 | calendar_year,
302 | platform,
303 | AVG(avg_transaction) AS average_transaction_size
304 | FROM clean_weekly_sales
305 | GROUP BY calendar_year, platform
306 | ORDER BY calendar_year, platform;
307 | ```
308 | Answer:
309 |
310 |
311 | - The SQL query selects data from the
clean_weekly_sales
table.
312 | - It retrieves the columns
calendar_year
and platform
along with the calculated average of avg_transaction
as average_transaction_size
.
313 | - The query calculates the average transaction size using the
AVG
function on the avg_transaction
column.
314 | - The results are grouped by
calendar_year
and platform
using the GROUP BY
clause.
315 | - The calculated average transaction size for each group is displayed as
average_transaction_size
.
316 | - The results are ordered in ascending order based on
calendar_year
and platform
using the ORDER BY
clause.
317 |
318 |
319 |
320 | 3. Before & After Analysis📈📉
321 |
322 | This technique is usually used when we inspect an important event and want to inspect the impact before and after a certain point in time.
323 |
324 | Taking the week_date value of 2020-06-15 as the baseline week where the Data Mart sustainable packaging changes came into effect.
325 |
326 | We would include all week_date values for 2020-06-15 as the start of the period after the change and the previous week_date values would be before
327 |
328 | Using this analysis approach - answer the following questions:
329 |
330 | - What is the total sales for the 4 weeks before and after 2020-06-15? What is the growth or reduction rate in actual values and percentage of sales?
331 |
332 | ```sql
333 | WITH CTE AS (
334 | SELECT
335 | week_date,
336 | week_number,
337 | SUM(sales) AS totl_sales
338 | FROM clean_weekly_sales
339 | WHERE (week_number BETWEEN 21 AND 28)
340 | AND (EXTRACT(YEAR FROM week_date) = 2020)
341 | GROUP BY week_date, week_number
342 | ),
343 | CTE2 AS (
344 | SELECT
345 | SUM(CASE
346 | WHEN week_number BETWEEN 21 AND 24 THEN totl_sales END) AS before_sales,
347 | SUM(CASE
348 | WHEN week_number BETWEEN 25 AND 28 THEN totl_sales END) AS after_sales
349 | FROM CTE
350 | )
351 |
352 | SELECT
353 | after_sales - before_sales AS sales_va,
354 | ROUND(100 *
355 | (after_sales - before_sales)
356 | / before_sales, 2) AS variance_percent
357 | FROM CTE2;
358 | ```
359 |
360 | Answer:
361 |
362 |
363 | - The SQL query uses two Common Table Expressions (CTEs) to perform calculations on the
clean_weekly_sales
table.
364 | - The first CTE named
CTE
calculates the total sales for each week within the specified range (week numbers 21 to 28) of the year 2020.
365 | - The
GROUP BY
clause groups the results by week_date
and week_number
.
366 | - The second CTE named
CTE2
further processes the data from the CTE
.
367 | - It calculates the sum of total sales for the weeks before (week numbers 21 to 24) and after (week numbers 25 to 28) the specified range.
368 | - The main query calculates the variance in sales between the "after" and "before" periods by subtracting the
before_sales
from the after_sales
.
369 | - The query also calculates the percentage variance by dividing the variance by the
before_sales
and then multiplying by 100. It rounds the percentage to 2 decimal places using the ROUND
function.
370 | - The calculated sales variance and percentage variance are displayed as
sales_va
and variance_percent
, respectively.
371 |
372 |
373 |
374 | What about the entire 12 weeks before and after?
375 |
376 | ```sql
377 | WITH packaging_sales AS (
378 | SELECT
379 | week_date,
380 | week_number,
381 | SUM(sales) AS total_sales
382 | FROM clean_weekly_sales
383 | WHERE week_date >= '2020-04-06' AND week_date <= '2020-07-05' -- Include 12 weeks before and after
384 | GROUP BY week_date, week_number
385 | ),
386 | before_after_changes AS (
387 | SELECT
388 | SUM(CASE
389 | WHEN week_date >= '2020-04-06' AND week_date <= '2020-06-07' THEN total_sales END) AS before_packaging_sales,
390 | SUM(CASE
391 | WHEN week_date >= '2020-06-15' AND week_date <= '2020-08-02' THEN total_sales END) AS after_packaging_sales
392 | FROM packaging_sales
393 | )
394 |
395 | SELECT
396 | after_packaging_sales - before_packaging_sales AS sales_variance,
397 | ROUND(100 *
398 | (after_packaging_sales - before_packaging_sales)
399 | / before_packaging_sales, 2) AS variance_percentage
400 | FROM before_after_changes;
401 | ```
402 | Answer:
403 |
404 |
405 | - The SQL query consists of two Common Table Expressions (CTEs) to perform calculations on the
clean_weekly_sales
table.
406 | - The first CTE named
packaging_sales
calculates the total sales for each week within a specific date range, including 12 weeks before and after a central period.
407 | - The
GROUP BY
clause groups the results by week_date
and week_number
.
408 | - The second CTE named
before_after_changes
further processes the data from the packaging_sales
.
409 | - It calculates the sum of total sales for the weeks before the central period (weeks from '2020-04-06' to '2020-06-07') and after the central period (weeks from '2020-06-15' to '2020-08-02').
410 | - The main query calculates the variance in sales between the "after" and "before" periods by subtracting the
before_packaging_sales
from the after_packaging_sales
.
411 | - The query also calculates the percentage variance by dividing the variance by the
before_packaging_sales
and then multiplying by 100. It rounds the percentage to 2 decimal places using the ROUND
function.
412 | - The calculated sales variance and percentage variance are displayed as
sales_variance
and variance_percentage
, respectively.
413 |
414 |
415 |
416 |
417 |
418 |
419 |
420 |
421 |
422 |
--------------------------------------------------------------------------------
/Case Study #6 - Clique Bait/README.md:
--------------------------------------------------------------------------------
1 | Case Study #6 - Clique Bait🦞🍢🦈
2 |
3 | Contents
4 |
5 |
16 |
17 |
18 | Meet Danny, the innovative mind behind Clique Bait, an unconventional online seafood store. With a background in digital data analytics, Danny's mission is to revolutionize the seafood industry. In this study, we analyze Clique Bait's user journey and propose creative solutions to calculate funnel fallout rates. By understanding where users drop off, we help optimize the online shopping experience, aligning with Danny's vision of merging data expertise with seafood excellence. Join us in uncovering insights that drive conversion and customer satisfaction.
19 |
20 | *Note: The data and insights presented in this case study are for illustrative purposes only and do not reflect actual data associated with Clique Bait.*
21 |
22 | Entity Relationship Diagram
23 |
24 | Case Study Questions & Solutions
25 |
26 | A. Digital Analysis
27 |
28 | How many users are there?
29 |
30 | ```sql
31 | SELECT COUNT (DISTINCT user_id) AS user_cnt
32 | FROM users;
33 | ```
34 | Answer:
35 |
36 |
37 | - The SQL query retrieves the count of distinct
user_id
values from the users
table.
38 | - The
COUNT
function is used to calculate the number of distinct user IDs.
39 | - The result is presented as
user_cnt
in the output.
40 |
41 |
42 |
43 | How many cookies does each user have on average?
44 |
45 | ```sql
46 | SELECT ROUND(AVG(cookie_count)) AS average_cookies_per_user
47 | FROM (
48 | SELECT user_id, COUNT(DISTINCT cookie_id) AS cookie_count
49 | FROM users
50 | GROUP BY user_id
51 | ) AS user_cookies;
52 | ```
53 | Answer:
54 |
55 |
56 | - The SQL query calculates the average number of cookies per user.
57 | - It starts with an inner subquery that groups the data in the
users
table by user_id
.
58 | - Within the subquery, the
COUNT(DISTINCT cookie_id)
function calculates the number of distinct cookie_id
values for each user_id
.
59 | - The result of the subquery includes the
user_id
and the corresponding count of distinct cookies.
60 | - The outer query then calculates the average of the
cookie_count
column from the subquery using the ROUND(AVG(cookie_count))
function.
61 | - The average is presented as
average_cookies_per_user
in the output.
62 |
63 |
64 |
65 | What is the unique number of visits by all users per month?
66 |
67 | ```sql
68 | SELECT EXTRACT(YEAR FROM event_time) AS year,
69 | EXTRACT(MONTH FROM event_time) AS month,
70 | COUNT(DISTINCT visit_id) AS unique_visits
71 | FROM events
72 | GROUP BY year, month
73 | ORDER BY year, month;
74 | ```
75 | Answer:
76 |
77 |
78 | - The SQL query calculates the number of unique visits for each month and year from the
events
table.
79 | - It uses the
EXTRACT
function to extract the YEAR
and MONTH
components from the event_time
column.
80 | - The
COUNT(DISTINCT visit_id)
function counts the distinct visit_id
values for each combination of year and month.
81 | - The results are grouped by the
year
and month
columns.
82 | - The
ORDER BY
clause sorts the results in ascending order based on the year
and month
values.
83 | - The output includes columns for
year
, month
, and unique_visits
(the count of distinct visit IDs).
84 |
85 |
86 |
87 | What is the number of events for each event type?
88 |
89 | ```sql
90 | SELECT event_type, COUNT(*) AS event_count
91 | FROM events
92 | GROUP BY event_type
93 | ORDER BY event_type;
94 | ```
95 | Answer:
96 |
97 |
98 | - The SQL query calculates the count of events for each event type from the
events
table.
99 | - It retrieves the
event_type
column values.
100 | - The
COUNT(*)
function counts the occurrences of each event type.
101 | - The results are grouped by the
event_type
column.
102 | - The
ORDER BY
clause sorts the results in ascending order based on the event_type
.
103 | - The output includes columns for
event_type
and event_count
(the count of events for each type).
104 |
105 |
106 |
107 | What is the percentage of visits which have a purchase event?
108 |
109 | ```sql
110 | SELECT (COUNT(DISTINCT CASE WHEN event_type = 3 THEN visit_id END) * 100.0 / COUNT(DISTINCT visit_id)) AS purchase_percentage
111 | FROM events;
112 | ```
113 | Answer:
114 |
115 |
116 | - The SQL query calculates the percentage of visits where the event type is equal to 3, indicating a purchase event.
117 | - The
COUNT(DISTINCT CASE WHEN event_type = 3 THEN visit_id END)
counts the number of unique visit_id
s where the event_type
is 3 (purchase event).
118 | - The
COUNT(DISTINCT visit_id)
counts the total number of unique visit_id
s in the events
table.
119 | - The calculation
(purchase_count * 100.0 / total_visits)
computes the percentage of purchase events among all visits.
120 | - The output is a single column named
purchase_percentage
which represents the calculated percentage.
121 |
122 |
123 |
124 | What is the percentage of visits which view the checkout page but do not have a purchase event?
125 |
126 | ```sql
127 | WITH CTE AS (
128 | SELECT
129 | visit_id,
130 | MAX(CASE WHEN event_type = 1 AND page_id = 12 THEN 1 ELSE 0 END) AS checkout_viewed,
131 | MAX(CASE WHEN event_type = 3 THEN 1 ELSE 0 END) AS purchase_made
132 | FROM events
133 | GROUP BY visit_id
134 | )
135 |
136 | SELECT
137 | ROUND(100 * (1 - (SUM(purchase_made)::numeric / SUM(checkout_viewed))), 2) AS percentagewithout_purchase
138 | FROM CTE;
139 | ```
140 | Answer:
141 |
142 |
143 | - The SQL query uses a Common Table Expression (CTE) named
CTE
to calculate specific event indicators for each visit_id
.
144 | - The CTE calculates two indicators for each visit:
checkout_viewed
and purchase_made
.
145 | - The
checkout_viewed
indicator is calculated using the MAX
function along with a CASE
statement to identify visits where event_type
is 1 (page view) and page_id
is 12 (indicating a checkout page view).
146 | - The
purchase_made
indicator is calculated similarly for visits where event_type
is 3 (purchase event).
147 | - The
GROUP BY visit_id
groups the results by visit_id
.
148 | - The main query calculates the percentage of visits without a purchase by subtracting the sum of
purchase_made
from the sum of checkout_viewed
.
149 | - The expression
(1 - (SUM(purchase_made) / SUM(checkout_viewed)))
calculates the ratio of visits without a purchase.
150 | - The query then multiplies the ratio by 100, rounds the result to two decimal places using the
ROUND
function, and presents it as percentagewithout_purchase
.
151 |
152 |
153 |
154 | What are the top 3 pages by number of views?
155 |
156 | ```sql
157 | SELECT page_id, COUNT(*) AS page_views
158 | FROM events
159 | WHERE event_type = 1
160 | GROUP BY page_id
161 | ORDER BY page_views DESC
162 | LIMIT 3;
163 | ```
164 | Answer:
165 |
166 |
167 | - The SQL query retrieves page views by counting the occurrences of each
page_id
with event_type
equal to 1 (indicating a page view).
168 | - The
WHERE event_type = 1
clause filters the results to include only events of type 1 (page view).
169 | - The
GROUP BY page_id
groups the results by unique page_id
.
170 | - The
ORDER BY page_views DESC
clause sorts the results in descending order based on the count of page views.
171 | - The
LIMIT 3
limits the output to the top 3 page_id
entries with the highest page views.
172 |
173 |
174 |
175 | What is the number of views and cart adds for each product category?
176 |
177 | ```sql
178 | SELECT ph.product_category,
179 | SUM(CASE WHEN e.event_type = 1 THEN 1 ELSE 0 END) AS num_views,
180 | SUM(CASE WHEN e.event_type = 2 THEN 1 ELSE 0 END) AS num_cart_adds
181 | FROM page_hierarchy ph
182 | LEFT JOIN events e ON ph.page_id = e.page_id
183 | GROUP BY ph.product_category
184 | ORDER BY ph.product_category;
185 | ```
186 | Answer:
187 |
188 |
189 | - The SQL query retrieves statistics for product categories from the
page_hierarchy
and events
tables.
190 | - The
LEFT JOIN
operation combines data from the page_hierarchy
and events
tables using the common page_id
column.
191 | - The
SUM(CASE WHEN e.event_type = 1 THEN 1 ELSE 0 END)
expression calculates the total number of page views (event_type = 1) for each product category.
192 | - The
SUM(CASE WHEN e.event_type = 2 THEN 1 ELSE 0 END)
expression calculates the total number of cart additions (event_type = 2) for each product category.
193 | - The
GROUP BY ph.product_category
clause groups the results by unique product categories from the page_hierarchy
table.
194 | - The
ORDER BY ph.product_category
clause sorts the results in ascending order based on the product category.
195 |
196 |
197 |
198 | What are the top 3 products by purchases?
199 |
200 | ```sql
201 | SELECT ph.page_name AS product_name,
202 | SUM(CASE WHEN e.event_type = 3 THEN 1 ELSE 0 END) AS num_purchases
203 | FROM page_hierarchy ph
204 | LEFT JOIN events e ON ph.page_id = e.page_id
205 | WHERE ph.page_name IS NOT NULL
206 | GROUP BY ph.page_name
207 | ORDER BY num_purchases DESC
208 | LIMIT 3;
209 | ```
210 |
211 | - The SQL query retrieves statistics for product purchases from the
page_hierarchy
and events
tables.
212 | - The
LEFT JOIN
operation combines data from the page_hierarchy
and events
tables using the common page_id
column.
213 | - The
WHERE ph.page_name IS NOT NULL
clause filters out rows where the page_name
in the page_hierarchy
table is not null.
214 | - The
SUM(CASE WHEN e.event_type = 3 THEN 1 ELSE 0 END)
expression calculates the total number of purchases (event_type = 3) for each product name.
215 | - The
GROUP BY ph.page_name
clause groups the results by unique product names from the page_hierarchy
table.
216 | - The
ORDER BY num_purchases DESC
clause sorts the results in descending order based on the number of purchases.
217 | - The
LIMIT 3
clause limits the output to the top 3 products with the highest number of purchases.
218 |
219 |
220 |
221 | B. Product Funnel Analysis
222 |
223 |
Using a single SQL query - create a new output table which has the following details:
224 |
225 | - How many times was each product viewed?
226 | - How many times was each product added to cart?
227 | - How many times was each product added to a cart but not purchased (abandoned)?
228 | - How many times was each product purchased?
229 |
230 | ```sql
231 | WITH ProdView AS (
232 | SELECT
233 | e.visit_id,
234 | ph.product_id,
235 | ph.page_name AS product_name,
236 | ph.product_category,
237 | COUNT(CASE WHEN e.event_type = 1 THEN 1 ELSE NULL END) AS views
238 | FROM events AS e
239 | JOIN page_hierarchy AS ph ON e.page_id = ph.page_id
240 | WHERE product_id IS NOT NULL
241 | GROUP BY e.visit_id, ph.product_id, ph.page_name, ph.product_category
242 | ),
243 | ProdCart AS (
244 | SELECT
245 | e.visit_id,
246 | ph.product_id,
247 | COUNT(CASE WHEN e.event_type = 2 THEN 1 ELSE NULL END) AS cart_adds
248 | FROM events AS e
249 | JOIN page_hierarchy AS ph ON e.page_id = ph.page_id
250 | WHERE product_id IS NOT NULL
251 | GROUP BY e.visit_id, ph.product_id
252 | ),
253 | PurchaseEvents AS (
254 | SELECT DISTINCT visit_id
255 | FROM events
256 | WHERE event_type = 3
257 | ),
258 | ProductStats AS (
259 | SELECT
260 | pvc.visit_id,
261 | pvc.product_id,
262 | pvc.product_name,
263 | pvc.product_category,
264 | pvc.views,
265 | pcc.cart_adds,
266 | CASE WHEN pe.visit_id IS NOT NULL THEN 1 ELSE 0 END AS purchases
267 | FROM ProdView AS pvc
268 | JOIN ProdCart AS pcc ON pvc.visit_id = pcc.visit_id AND pvc.product_id = pcc.product_id
269 | LEFT JOIN PurchaseEvents AS pe ON pvc.visit_id = pe.visit_id
270 | )
271 | SELECT
272 | product_name,
273 | product_category,
274 | SUM(views) AS total_views,
275 | SUM(cart_adds) AS total_cart_adds,
276 | SUM(CASE WHEN cart_adds > 0 AND purchases = 0 THEN 1 ELSE 0 END) AS abandoned,
277 | SUM(CASE WHEN cart_adds = 1 AND purchases = 1 THEN 1 ELSE 0 END) AS purchases
278 | FROM ProductStats
279 | GROUP BY product_name, product_category
280 | ORDER BY product_name;
281 | ```
282 |
283 | Answer:
284 |
285 |
286 |
287 | - The SQL query involves multiple Common Table Expressions (CTEs) to analyze product interactions and purchases.
288 | - The first CTE named
ProdView
calculates the number of views for each product by joining the events
and page_hierarchy
tables on the page_id
column.
289 | - The
WHERE product_id IS NOT NULL
clause filters out rows with null product IDs.
290 | - The
GROUP BY
clause groups the results by visit_id
, product_id
, product_name
, and product_category
.
291 | - The second CTE named
ProdCart
calculates the number of cart additions for each product in a similar manner to the first CTE.
292 | - The third CTE named
PurchaseEvents
identifies distinct visit IDs with purchase events (event_type = 3).
293 | - The fourth CTE named
ProductStats
combines data from the previous CTEs to calculate aggregated statistics for each product.
294 | - The main query selects
product_name
, product_category
, and aggregates the views
, cart_adds
, and purchases
based on different conditions.
295 | - The
GROUP BY
clause groups the results by product_name
and product_category
.
296 | - The
ORDER BY product_name
clause sorts the results alphabetically by product name.
297 |
298 |
299 |
300 | Additionally, create another table which further aggregates the data for the above points but this time for each product category instead of individual products
301 |
302 | ```sql
303 | WITH ProductPageEvents AS (
304 | SELECT
305 | e.visit_id,
306 | ph.product_id,
307 | ph.page_name AS product_name,
308 | ph.product_category,
309 | COUNT(CASE WHEN e.event_type = 1 THEN 1 ELSE NULL END) AS page_view,
310 | COUNT(CASE WHEN e.event_type = 2 THEN 1 ELSE NULL END) AS cart_add
311 | FROM events AS e
312 | JOIN page_hierarchy AS ph ON e.page_id = ph.page_id
313 | WHERE product_id IS NOT NULL
314 | GROUP BY e.visit_id, ph.product_id, ph.page_name, ph.product_category
315 | ),
316 | PurchaseEvents AS (
317 | SELECT DISTINCT visit_id
318 | FROM events
319 | WHERE event_type = 3
320 | ),
321 | CombinedTable AS (
322 | SELECT
323 | ppe.visit_id,
324 | ppe.product_id,
325 | ppe.product_name,
326 | ppe.product_category,
327 | ppe.page_view,
328 | ppe.cart_add,
329 | CASE WHEN pe.visit_id IS NOT NULL THEN 1 ELSE 0 END AS purchase
330 | FROM ProductPageEvents AS ppe
331 | LEFT JOIN PurchaseEvents AS pe ON ppe.visit_id = pe.visit_id
332 | ),
333 | ProductInfo AS (
334 | SELECT
335 | product_name,
336 | product_category,
337 | SUM(page_view) AS views,
338 | SUM(cart_add) AS cart_adds,
339 | SUM(CASE WHEN cart_add = 1 AND purchase = 0 THEN 1 ELSE 0 END) AS abandoned,
340 | SUM(CASE WHEN cart_add = 1 AND purchase = 1 THEN 1 ELSE 0 END) AS purchases
341 | FROM CombinedTable
342 | GROUP BY product_id, product_name, product_category
343 | ),
344 | CategoryStats AS (
345 | SELECT
346 | product_category,
347 | SUM(views) AS total_category_views,
348 | SUM(cart_adds) AS total_category_cart_adds,
349 | SUM(abandoned) AS total_category_abandoned,
350 | SUM(purchases) AS total_category_purchases
351 | FROM ProductInfo
352 | GROUP BY product_category
353 | )
354 |
355 |
356 | SELECT
357 | product_category,
358 | total_category_views,
359 | total_category_cart_adds,
360 | total_category_abandoned,
361 | total_category_purchases
362 | FROM CategoryStats;
363 | ```
364 | Answer:
365 |
366 |
367 | - The SQL query analyzes product page events, purchases, and statistics based on different categories.
368 | - The first CTE named
ProductPageEvents
calculates page views and cart additions for each product by joining the events
and page_hierarchy
tables on the page_id
column.
369 | - The
WHERE product_id IS NOT NULL
clause filters out rows with null product IDs.
370 | - The
GROUP BY
clause groups the results by visit_id
, product_id
, product_name
, and product_category
.
371 | - The second CTE named
PurchaseEvents
identifies distinct visit IDs with purchase events (event_type = 3).
372 | - The third CTE named
CombinedTable
combines data from the previous CTEs and determines if a visit resulted in a purchase.
373 | - The fourth CTE named
ProductInfo
aggregates statistics for each product, including views, cart additions, abandoned carts, and successful purchases.
374 | - The
GROUP BY
clause groups the results by product_id
, product_name
, and product_category
.
375 | - The fifth CTE named
CategoryStats
aggregates category-level statistics based on the product_category
.
376 | - The
GROUP BY
clause groups the results by product_category
.
377 | - The main query selects
product_category
and the aggregated statistics for total category views, cart additions, abandoned carts, and purchases.
378 |
379 |
380 |
381 | Use your 2 new output tables - answer the following questions:
382 | Which product had the most views, cart adds and purchases?
383 |
384 | ```sql
385 | WITH ProductPageEvents AS (
386 | SELECT
387 | e.visit_id,
388 | ph.product_id,
389 | ph.page_name AS product_name,
390 | ph.product_category,
391 | COUNT(CASE WHEN e.event_type = 1 THEN 1 ELSE NULL END) AS page_view,
392 | COUNT(CASE WHEN e.event_type = 2 THEN 1 ELSE NULL END) AS cart_add
393 | FROM events AS e
394 | JOIN page_hierarchy AS ph ON e.page_id = ph.page_id
395 | WHERE product_id IS NOT NULL
396 | GROUP BY e.visit_id, ph.product_id, ph.page_name, ph.product_category
397 | ),
398 | PurchaseEvents AS (
399 | SELECT DISTINCT visit_id
400 | FROM events
401 | WHERE event_type = 3
402 | ),
403 | CombinedTable AS (
404 | SELECT
405 | ppe.visit_id,
406 | ppe.product_id,
407 | ppe.product_name,
408 | ppe.product_category,
409 | ppe.page_view,
410 | ppe.cart_add,
411 | CASE WHEN pe.visit_id IS NOT NULL THEN 1 ELSE 0 END AS purchase
412 | FROM ProductPageEvents AS ppe
413 | LEFT JOIN PurchaseEvents AS pe ON ppe.visit_id = pe.visit_id
414 | ),
415 | ProductInfo AS (
416 | SELECT
417 | product_name,
418 | product_category,
419 | SUM(page_view) AS views,
420 | SUM(cart_add) AS cart_adds,
421 | SUM(CASE WHEN cart_add = 1 AND purchase = 0 THEN 1 ELSE 0 END) AS abandoned,
422 | SUM(CASE WHEN cart_add = 1 AND purchase = 1 THEN 1 ELSE 0 END) AS purchases
423 | FROM CombinedTable
424 | GROUP BY product_id, product_name, product_category
425 | )
426 |
427 | SELECT
428 | product_name,
429 | MAX(views) AS most_views,
430 | MAX(cart_adds) AS most_cart_adds,
431 | MAX(purchases) AS most_purchases
432 | FROM ProductInfo
433 | GROUP BY product_name
434 | ORDER BY most_views DESC, most_cart_adds DESC, most_purchases DESC
435 | LIMIT 1;
436 | ```
437 | Answer:
438 |
439 |
440 | - The SQL query focuses on analyzing product page events, purchases, and statistics to find the most viewed, most cart-added, and most purchased product.
441 | - The first CTE named
ProductPageEvents
calculates page views and cart additions for each product by joining the events
and page_hierarchy
tables on the page_id
column.
442 | - The
WHERE product_id IS NOT NULL
clause filters out rows with null product IDs.
443 | - The
GROUP BY
clause groups the results by visit_id
, product_id
, product_name
, and product_category
.
444 | - The second CTE named
PurchaseEvents
identifies distinct visit IDs with purchase events (event_type = 3).
445 | - The third CTE named
CombinedTable
combines data from the previous CTEs and determines if a visit resulted in a purchase.
446 | - The fourth CTE named
ProductInfo
aggregates statistics for each product, including views, cart additions, abandoned carts, and successful purchases.
447 | - The
GROUP BY
clause groups the results by product_id
, product_name
, and product_category
.
448 | - The main query selects
product_name
and aggregates the maximum views, cart additions, and purchases for each product.
449 | - The
GROUP BY
clause groups the results by product_name
.
450 | - The
ORDER BY
clause orders the results based on most views, most cart additions, and most purchases in descending order.
451 | - The
LIMIT 1
clause restricts the output to only the top product with the most views, cart additions, and purchases.
452 |
453 |
454 |
455 | Which product was most likely to be abandoned?
456 |
457 | ```sql
458 | WITH ProductPageEvents AS (
459 | SELECT
460 | e.visit_id,
461 | ph.product_id,
462 | ph.page_name AS product_name,
463 | ph.product_category,
464 | COUNT(CASE WHEN e.event_type = 1 THEN 1 ELSE NULL END) AS page_view,
465 | COUNT(CASE WHEN e.event_type = 2 THEN 1 ELSE NULL END) AS cart_add
466 | FROM events AS e
467 | JOIN page_hierarchy AS ph ON e.page_id = ph.page_id
468 | WHERE product_id IS NOT NULL
469 | GROUP BY e.visit_id, ph.product_id, ph.page_name, ph.product_category
470 | ),
471 | PurchaseEvents AS (
472 | SELECT DISTINCT visit_id
473 | FROM events
474 | WHERE event_type = 3
475 | ),
476 | CombinedTable AS (
477 | SELECT
478 | ppe.visit_id,
479 | ppe.product_id,
480 | ppe.product_name,
481 | ppe.product_category,
482 | ppe.page_view,
483 | ppe.cart_add,
484 | CASE WHEN pe.visit_id IS NOT NULL THEN 1 ELSE 0 END AS purchase
485 | FROM ProductPageEvents AS ppe
486 | LEFT JOIN PurchaseEvents AS pe ON ppe.visit_id = pe.visit_id
487 | ),
488 | ProductInfo AS (
489 | SELECT
490 | product_name,
491 | product_category,
492 | SUM(page_view) AS views,
493 | SUM(cart_add) AS cart_adds,
494 | SUM(CASE WHEN cart_add = 1 AND purchase = 0 THEN 1 ELSE 0 END) AS abandoned,
495 | SUM(CASE WHEN cart_add = 1 AND purchase = 1 THEN 1 ELSE 0 END) AS purchases
496 | FROM CombinedTable
497 | GROUP BY product_id, product_name, product_category
498 | )
499 |
500 | SELECT
501 | product_name,
502 | MAX(abandoned_percentage) AS most_likely_abandoned
503 | FROM (
504 | SELECT
505 | product_name,
506 | (CAST(abandoned AS decimal) / CAST(cart_adds AS decimal)) * 100 AS abandoned_percentage
507 | FROM ProductInfo
508 | ) AS AbandonedPercentage
509 | GROUP BY product_name
510 | ORDER BY most_likely_abandoned DESC
511 | LIMIT 1;
512 | ```
513 | Answer:
514 |
515 |
516 | - The SQL query aims to identify the product with the highest likelihood of abandoned cart based on statistics.
517 | - The first CTE named
ProductPageEvents
calculates page views and cart additions for each product by joining the events
and page_hierarchy
tables on the page_id
column.
518 | - The
WHERE product_id IS NOT NULL
clause filters out rows with null product IDs.
519 | - The
GROUP BY
clause groups the results by visit_id
, product_id
, product_name
, and product_category
.
520 | - The second CTE named
PurchaseEvents
identifies distinct visit IDs with purchase events (event_type = 3).
521 | - The third CTE named
CombinedTable
combines data from the previous CTEs and determines if a visit resulted in a purchase.
522 | - The fourth CTE named
ProductInfo
aggregates statistics for each product, including page views, cart additions, abandoned carts, and successful purchases.
523 | - The
GROUP BY
clause groups the results by product_id
, product_name
, and product_category
.
524 | - The main query selects
product_name
and calculates the percentage of abandoned carts for each product using the formula (abandoned / cart_adds) * 100
.
525 | - The calculated abandoned percentage and product name are included in the derived table
AbandonedPercentage
.
526 | - The outer query selects the product name and maximum abandoned percentage for each product.
527 | - The
GROUP BY
clause groups the results by product_name
.
528 | - The
ORDER BY
clause orders the results based on the most likely abandoned percentage in descending order.
529 | - The
LIMIT 1
clause restricts the output to only the top product with the highest likelihood of abandoned cart.
530 |
531 |
532 |
533 | Which product had the highest view to purchase percentage?
534 |
535 | ```sql
536 | WITH ProductPageEvents AS (
537 | SELECT
538 | e.visit_id,
539 | ph.product_id,
540 | ph.page_name AS product_name,
541 | ph.product_category,
542 | COUNT(CASE WHEN e.event_type = 1 THEN 1 ELSE NULL END) AS page_view,
543 | COUNT(CASE WHEN e.event_type = 2 THEN 1 ELSE NULL END) AS cart_add
544 | FROM events AS e
545 | JOIN page_hierarchy AS ph ON e.page_id = ph.page_id
546 | WHERE product_id IS NOT NULL
547 | GROUP BY e.visit_id, ph.product_id, ph.page_name, ph.product_category
548 | ),
549 | PurchaseEvents AS (
550 | SELECT DISTINCT visit_id
551 | FROM events
552 | WHERE event_type = 3
553 | ),
554 | CombinedTable AS (
555 | SELECT
556 | ppe.visit_id,
557 | ppe.product_id,
558 | ppe.product_name,
559 | ppe.product_category,
560 | ppe.page_view,
561 | ppe.cart_add,
562 | CASE WHEN pe.visit_id IS NOT NULL THEN 1 ELSE 0 END AS purchase
563 | FROM ProductPageEvents AS ppe
564 | LEFT JOIN PurchaseEvents AS pe ON ppe.visit_id = pe.visit_id
565 | ),
566 | ProductInfo AS (
567 | SELECT
568 | product_name,
569 | product_category,
570 | SUM(page_view) AS views,
571 | SUM(cart_add) AS cart_adds,
572 | SUM(CASE WHEN cart_add = 1 AND purchase = 0 THEN 1 ELSE 0 END) AS abandoned,
573 | SUM(CASE WHEN cart_add = 1 AND purchase = 1 THEN 1 ELSE 0 END) AS purchases
574 | FROM CombinedTable
575 | GROUP BY product_id, product_name, product_category
576 | ),
577 | ViewToPurchase AS (
578 | SELECT
579 | product_name,
580 | (CAST(views AS decimal) / CAST(purchases AS decimal)) * 100 AS view_to_purchase_percentage
581 | FROM ProductInfo
582 | )
583 |
584 | SELECT
585 | product_name
586 | FROM ViewToPurchase
587 | ORDER BY view_to_purchase_percentage DESC
588 | LIMIT 1;
589 | ```
590 | Answer:
591 |
592 |
593 | - The SQL query aims to identify the product with the highest view-to-purchase conversion rate based on statistics.
594 | - The first CTE named
ProductPageEvents
calculates page views and cart additions for each product by joining the events
and page_hierarchy
tables on the page_id
column.
595 | - The
WHERE product_id IS NOT NULL
clause filters out rows with null product IDs.
596 | - The
GROUP BY
clause groups the results by visit_id
, product_id
, product_name
, and product_category
.
597 | - The second CTE named
PurchaseEvents
identifies distinct visit IDs with purchase events (event_type = 3).
598 | - The third CTE named
CombinedTable
combines data from the previous CTEs and determines if a visit resulted in a purchase.
599 | - The fourth CTE named
ProductInfo
aggregates statistics for each product, including page views, cart additions, abandoned carts, and successful purchases.
600 | - The
GROUP BY
clause groups the results by product_id
, product_name
, and product_category
.
601 | - The fifth CTE named
ViewToPurchase
calculates the view-to-purchase conversion percentage for each product using the formula (views / purchases) * 100
.
602 | - The calculated view-to-purchase percentage and product name are included in the derived table
ViewToPurchase
.
603 | - The main query selects the product name with the highest view-to-purchase percentage based on the
view_to_purchase_percentage
column.
604 | - The
ORDER BY
clause orders the results based on the view-to-purchase percentage in descending order.
605 | - The
LIMIT 1
clause restricts the output to only the top product with the highest view-to-purchase conversion rate.
606 |
607 |
608 |
609 | What is the average conversion rate from view to cart add?
610 |
611 | ```sql
612 | WITH ProductPageEvents AS (
613 | SELECT
614 | e.visit_id,
615 | ph.product_id,
616 | ph.page_name AS product_name,
617 | ph.product_category,
618 | COUNT(CASE WHEN e.event_type = 1 THEN 1 ELSE NULL END) AS page_view,
619 | COUNT(CASE WHEN e.event_type = 2 THEN 1 ELSE NULL END) AS cart_add
620 | FROM events AS e
621 | JOIN page_hierarchy AS ph ON e.page_id = ph.page_id
622 | WHERE product_id IS NOT NULL
623 | GROUP BY e.visit_id, ph.product_id, ph.page_name, ph.product_category
624 | ),
625 | ProductStats AS (
626 | SELECT
627 | product_name,
628 | SUM(page_view) AS views,
629 | SUM(cart_add) AS cart_adds
630 | FROM ProductPageEvents
631 | GROUP BY product_name
632 | )
633 |
634 | SELECT
635 | AVG((CAST(cart_adds AS decimal) / CAST(views AS decimal)) * 100) AS average_conversion_rate
636 | FROM ProductStats;
637 | ```
638 | Answer:
639 |
640 |
641 | - The SQL query calculates the average conversion rate for products based on their page views and cart additions.
642 | - The first CTE named
ProductPageEvents
calculates page views and cart additions for each product by joining the events
and page_hierarchy
tables on the page_id
column.
643 | - The
WHERE product_id IS NOT NULL
clause filters out rows with null product IDs.
644 | - The
GROUP BY
clause groups the results by visit_id
, product_id
, product_name
, and product_category
.
645 | - The second CTE named
ProductStats
aggregates page views and cart additions for each product from the ProductPageEvents
CTE.
646 | - The
GROUP BY
clause groups the results by product_name
.
647 | - The main query calculates the average conversion rate using the formula
(cart_adds / views) * 100
for each product and then calculates the average of these conversion rates using the AVG
function.
648 | - The result is displayed as
average_conversion_rate
.
649 |
650 |
651 |
652 | C. Campaigns Analysis
653 |
654 | Generate a table that has 1 single row for every unique visit_id record and has the following columns:
655 |
656 | - user_id
657 | - visit_id
658 | - visit_start_time: the earliest event_time for each visit
659 | - page_views: count of page views for each visit
660 | - cart_adds: count of product cart add events for each visit
661 | - purchase: 1/0 flag if a purchase event exists for each visit
662 | - campaign_name: map the visit to a campaign if the visit_start_time falls between the start_date and end_date
663 | - impression: count of ad impressions for each visit
664 | - click: count of ad clicks for each visit
665 | - (Optional column) cart_products: a comma separated text value with products added to the cart sorted by the order they were added to the cart (hint: use the sequence_number)
666 |
667 | ```sql
668 | SELECT
669 | u.user_id,
670 | e.visit_id,
671 | MIN(e.event_time) AS visit_start_time,
672 | SUM(CASE WHEN e.event_type = 1 THEN 1 ELSE 0 END) AS page_views,
673 | SUM(CASE WHEN e.event_type = 2 THEN 1 ELSE 0 END) AS cart_adds,
674 | MAX(CASE WHEN e.event_type = 3 THEN 1 ELSE 0 END) AS purchase,
675 | ci.campaign_name,
676 | SUM(CASE WHEN e.event_type = 4 THEN 1 ELSE 0 END) AS impression,
677 | SUM(CASE WHEN e.event_type = 5 THEN 1 ELSE 0 END) AS click,
678 | STRING_AGG(
679 | CASE WHEN e.event_type = 2 THEN ph.page_name END,
680 | ', '
681 | ORDER BY e.sequence_number
682 | ) AS cart_products
683 | FROM users u
684 | INNER JOIN events e ON u.cookie_id = e.cookie_id
685 | LEFT JOIN campaign_identifier ci ON e.event_time BETWEEN ci.start_date AND ci.end_date
686 | LEFT JOIN page_hierarchy ph ON e.page_id = ph.page_id
687 | GROUP BY u.user_id, e.visit_id, ci.campaign_name;
688 |
689 | ```
690 | Answer:
691 |
692 |
693 | - The SQL query retrieves user-related data along with various event metrics from the
events
table.
694 | - The
SELECT
statement selects columns such as user_id
, visit_id
, and more.
695 | - The
MIN(e.event_time)
function is used to find the minimum event time for each visit, serving as the visit start time.
696 | - The
SUM(CASE WHEN e.event_type = 1 THEN 1 ELSE 0 END)
aggregates the count of page views using a conditional statement.
697 | - The
SUM(CASE WHEN e.event_type = 2 THEN 1 ELSE 0 END)
aggregates the count of cart additions using a similar conditional statement.
698 | - The
MAX(CASE WHEN e.event_type = 3 THEN 1 ELSE 0 END)
identifies if a purchase event occurred during the visit.
699 | - The
LEFT JOIN
clauses join the campaign_identifier
and page_hierarchy
tables based on event time and page ID respectively.
700 | - The
GROUP BY
clause groups the results by user_id
, visit_id
, and campaign_name
.
701 | - The
STRING_AGG
function concatenates page names for cart products using a comma separator.
702 | - The result of the query provides insights into user behavior during visits, including metrics like page views, cart additions, purchases, campaign associations, and more.
703 |
704 |
705 |
706 |
707 |
708 |
709 |
710 |
711 |
712 |
713 |
--------------------------------------------------------------------------------
/Case Study #7 - Balanced Tree Clothing Co/README.md:
--------------------------------------------------------------------------------
1 | Case Study #7 - Balanced Tree Clothing Co.👚🧶
2 |
3 | Contents
4 |
5 |
18 |
19 | In this case study, we delve into the operations of Balanced Tree Clothing Company, a leading fashion enterprise renowned for curating a meticulously tailored selection of apparel and lifestyle gear designed for contemporary explorers. At the helm is Danny, the visionary CEO, who has enlisted our expertise in scrutinizing their sales accomplishments and creating a fundamental financial synopsis destined to be disseminated across the organization. This analysis offers valuable insights into their business strategies and performance metrics.
20 |
21 | Entity Relationship Diagram
22 |
23 | Case Study Questions & Solutions
24 |
25 | A. High Level Sales Analysis
26 |
27 | What was the total quantity sold for all products?
28 |
29 | ```sql
30 | SELECT COUNT(DISTINCT txn_id) AS unique_transactions
31 | FROM balanced_tree.sales;
32 | ```
33 | Answer:
34 |
35 |
36 | - The SQL query calculates the count of unique transactions from the
balanced_tree.sales
table.
37 | - The
SELECT
statement includes the COUNT(DISTINCT txn_id)
function to count distinct transaction IDs.
38 | - The result of the query will be the total number of unique transactions present in the
sales
table.
39 |
40 |
41 |
42 | What is the total generated revenue for all products before discounts?
43 |
44 | ```sql
45 | SELECT AVG(avg_unique_products) AS average_unique_products_per_transaction
46 | FROM (
47 | SELECT txn_id, COUNT(DISTINCT prod_id) AS avg_unique_products
48 | FROM balanced_tree.sales
49 | GROUP BY txn_id
50 | ) AS unique_products_per_transaction;
51 | ```
52 | Answer:
53 |
54 |
55 | - The SQL query calculates the average number of unique products per transaction from the
balanced_tree.sales
table.
56 | - It achieves this by first creating a subquery that groups the sales data by
txn_id
and calculates the count of distinct prod_id
values for each transaction.
57 | - The outer query then calculates the average of the unique product counts calculated by the subquery using the
AVG
function.
58 | - The result of the query will provide the average number of unique products per transaction across all transactions in the
sales
table.
59 |
60 |
61 |
62 | What was the total discount amount for all products?
63 |
64 | ```sql
65 | SELECT SUM((price * qty) - (price * qty * (discount / 100))) AS total_discount_amount
66 | FROM balanced_tree.sales;
67 |
68 | ```
69 | Answer:
70 |
71 |
72 | - The SQL query calculates the total discount amount for all sales transactions in the
balanced_tree.sales
table.
73 | - It achieves this by summing up the discounted amounts for each transaction. The discounted amount for a transaction is calculated as
(price * qty) - (price * qty * (discount / 100))
.
74 | - Within the formula,
price * qty
calculates the total price for the transaction before discount, and price * qty * (discount / 100)
calculates the discounted amount based on the discount percentage for that transaction.
75 | - The
SUM
function then aggregates the calculated discounted amounts across all transactions to yield the total discount amount.
76 | - The result of the query provides the total amount saved due to discounts across all sales transactions in the dataset.
77 |
78 |
79 |
80 |
81 | B. Transaction Analysis
82 |
83 | How many unique transactions were there?
84 |
85 | ```sql
86 | SELECT COUNT(DISTINCT txn_id) AS unique_transactions
87 | FROM balanced_tree.sales;
88 | ```
89 | Answer:
90 |
91 |
92 | - The SQL query calculates the number of unique transactions in the
balanced_tree.sales
table.
93 | - It uses the
COUNT(DISTINCT txn_id)
aggregation function to count the distinct values of the txn_id
column, which represents unique transactions.
94 | - The result of the query provides the count of unique transactions present in the sales dataset.
95 |
96 |
97 |
98 | What is the average unique products purchased in each transaction?
99 |
100 | ```sql
101 | SELECT AVG(avg_unique_products) AS average_unique_products_per_transaction
102 | FROM (
103 | SELECT txn_id, COUNT(DISTINCT prod_id) AS avg_unique_products
104 | FROM balanced_tree.sales
105 | GROUP BY txn_id
106 | ) AS unique_products_per_transaction;
107 | ```
108 | Answer:
109 |
110 |
111 | - The SQL query calculates the average number of unique products per transaction in the
balanced_tree.sales
table.
112 | - It uses a subquery to calculate the count of distinct
prod_id
values for each txn_id
, which represents the unique products in each transaction.
113 | - The outer query then calculates the average of the counts obtained from the subquery, providing the average number of unique products per transaction.
114 | - The result of the query gives insight into the average diversity of products bought per transaction.
115 |
116 |
117 |
118 | What are the 25th, 50th and 75th percentile values for the revenue per transaction?
119 |
120 | ```sql
121 | SELECT
122 | PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY revenue) AS percentile_25th,
123 | PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY revenue) AS percentile_50th,
124 | PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY revenue) AS percentile_75th
125 | FROM (
126 | SELECT
127 | txn_id,
128 | SUM(price * qty) AS revenue
129 | FROM balanced_tree.sales
130 | GROUP BY txn_id
131 | ) AS revenue_cte;
132 | ```
133 | Answer:
134 |
135 |
136 | - The SQL query calculates the 25th, 50th (median), and 75th percentiles of revenue per transaction in the
balanced_tree.sales
table.
137 | - It uses a common table expression (CTE) named
revenue_cte
to calculate the total revenue for each txn_id
by summing the product of price
and qty
.
138 | - The main query then applies the
PERCENTILE_CONT
function to calculate the desired percentiles (0.25
, 0.5
, and 0.75
) of the calculated revenue values.
139 | - The results provide insights into the distribution of revenue across transactions, with the median being the 50th percentile.
140 |
141 |
142 |
143 | What is the average discount value per transaction?
144 |
145 | ```sql
146 | SELECT
147 | ROUND(SUM(qty * price * discount/100) / COUNT(DISTINCT txn_id)) AS avg_discount
148 | FROM balanced_tree.sales;
149 | ```
150 | Answer:
151 |
152 |
153 | - The SQL query calculates the average discount per transaction in the
balanced_tree.sales
table.
154 | - It calculates the total discount amount for each transaction by multiplying the
qty
, price
, and discount
columns and then dividing by 100.
155 | - The main query uses the
SUM
function to calculate the sum of all the calculated discount amounts and then divides it by the total number of distinct txn_id
values using the COUNT
function.
156 | - The result is the average discount per transaction, rounded using the
ROUND
function.
157 |
158 |
159 |
160 | What is the percentage split of all transactions for members vs non-members?
161 |
162 | ```sql
163 | SELECT
164 | CASE
165 | WHEN member = 't' THEN 'Member'
166 | WHEN member = 'f' THEN 'Non-Member'
167 | END AS member_status,
168 | COUNT(txn_id) AS transaction_count,
169 | ROUND(100 * COUNT(txn_id) / (SELECT COUNT(txn_id) FROM balanced_tree.sales), 2) AS percentage
170 | FROM balanced_tree.sales
171 | GROUP BY member_status;
172 | ```
173 | Answer:
174 |
175 |
176 | - The SQL query categorizes transactions into two groups based on the
member
column: 'Member' and 'Non-Member'.
177 | - The
CASE
statement is used to transform the 't' and 'f' values in the member
column into more meaningful labels.
178 | - The
COUNT
function calculates the number of transactions for each member status category.
179 | - The main query calculates the percentage of transactions for each member status category out of the total number of transactions in the
balanced_tree.sales
table.
180 | - The subquery
(SELECT COUNT(txn_id) FROM balanced_tree.sales)
calculates the total number of transactions in the table.
181 | - The result is the member status ('Member' or 'Non-Member'), the transaction count for each category, and the percentage of transactions represented by each category, rounded to two decimal places using the
ROUND
function.
182 |
183 |
184 |
185 | What is the average revenue for member transactions and non-member transactions?
186 |
187 | ```sql
188 | SELECT
189 | CASE
190 | WHEN member = 't' THEN 'Member'
191 | WHEN member = 'f' THEN 'Non-Member'
192 | END AS member_status,
193 | ROUND(AVG(price * qty), 2) AS avg_revenue
194 | FROM balanced_tree.sales
195 | GROUP BY member_status;
196 | ```
197 | Answer:
198 |
199 |
200 | - The SQL query categorizes transactions into two groups based on the
member
column: 'Member' and 'Non-Member'.
201 | - The
CASE
statement is used to transform the 't' and 'f' values in the member
column into more meaningful labels.
202 | - The
ROUND
function calculates the average revenue for each member status category by multiplying the price
and qty
columns and then rounding the result to two decimal places.
203 | - The
GROUP BY
clause groups the results by the transformed member status.
204 | - The result displays the member status ('Member' or 'Non-Member') and the average revenue for each category.
205 |
206 |
207 |
208 | C. Product Analysis
209 |
210 | What are the top 3 products by total revenue before discount?
211 |
212 | ```sql
213 | SELECT
214 | pd.product_id,
215 | pd.product_name,
216 | SUM(s.price * s.qty) AS total_revenue_before_discount
217 | FROM balanced_tree.product_details pd
218 | JOIN balanced_tree.sales s ON pd.product_id = s.prod_id
219 | GROUP BY pd.product_id, pd.product_name
220 | ORDER BY total_revenue_before_discount DESC
221 | LIMIT 3;
222 | ```
223 | Answer:
224 |
225 |
226 | - The SQL query retrieves product-specific information along with the total revenue generated from sales before any discounts.
227 | - The
balanced_tree.product_details
table is joined with the balanced_tree.sales
table using the product_id
and prod_id
columns.
228 | - The
SUM
function calculates the total revenue for each product by multiplying the price
and qty
columns from the sales table.
229 | - The
GROUP BY
clause groups the results by product_id
and product_name
.
230 | - The
ORDER BY
clause arranges the results in descending order of total revenue before discount.
231 | - The
LIMIT
clause limits the output to the top 3 products with the highest total revenue before discount.
232 | - The result displays the product ID, product name, and total revenue generated from sales for the top 3 products.
233 |
234 |
235 |
236 | What is the total quantity, revenue and discount for each segment?
237 |
238 | ```sql
239 | SELECT
240 | pd.segment_name,
241 | SUM(s.qty) AS total_quantity,
242 | SUM(s.price * s.qty) AS total_revenue,
243 | SUM(s.qty * s.price * s.discount / 100) AS total_discount
244 | FROM balanced_tree.product_details pd
245 | JOIN balanced_tree.sales s ON pd.product_id = s.prod_id
246 | GROUP BY pd.segment_name
247 | ORDER BY total_revenue DESC;
248 | ```
249 | Answer:
250 |
251 |
252 | - The SQL query retrieves information about product segments, along with aggregated data on total quantity sold, total revenue generated, and total discounts applied.
253 | - The
balanced_tree.product_details
table is joined with the balanced_tree.sales
table using the product_id
and prod_id
columns.
254 | - The
SUM
function calculates the total quantity sold (qty
), total revenue generated by multiplying price
and qty
, and total discount amount by considering qty
, price
, and discount
columns from the sales table.
255 | - The
GROUP BY
clause groups the results by segment_name
from the product_details
table.
256 | - The
ORDER BY
clause arranges the results in descending order of total revenue.
257 | - The result displays the product segment name, total quantity sold, total revenue generated, and total discount amount for each segment.
258 |
259 |
260 |
261 | What is the total quantity, revenue and discount for each segment?
262 |
263 | ```sql
264 | WITH top_selling_products AS (
265 | SELECT
266 | pd.segment_id,
267 | pd.segment_name,
268 | pd.product_id,
269 | pd.product_name,
270 | SUM(s.qty) AS total_quantity,
271 | ROW_NUMBER() OVER (PARTITION BY pd.segment_id ORDER BY SUM(s.qty) DESC) AS rank
272 | FROM balanced_tree.product_details pd
273 | JOIN balanced_tree.sales s ON pd.product_id = s.prod_id
274 | GROUP BY pd.segment_id, pd.segment_name, pd.product_id, pd.product_name
275 | )
276 | SELECT
277 | segment_id,
278 | segment_name,
279 | product_id,
280 | product_name,
281 | total_quantity
282 | FROM top_selling_products
283 | WHERE rank = 1;
284 | ```
285 | Answer:
286 |
287 |
288 | - The SQL query calculates and presents information about the top-selling products within each product segment.
289 | - The query starts by creating a Common Table Expression (CTE) named
top_selling_products
.
290 | - The
balanced_tree.product_details
table is joined with the balanced_tree.sales
table using the product_id
and prod_id
columns.
291 | - The
SUM
function calculates the total quantity sold (qty
) for each product within a segment.
292 | - The
ROW_NUMBER()
window function is used with the OVER
clause to assign a rank to each product within its corresponding segment based on total quantity sold. The PARTITION BY
clause ensures that the ranking is done separately for each segment.
293 | - The main query selects columns
segment_id
, segment_name
, product_id
, product_name
, and total_quantity
from the top_selling_products
CTE.
294 | - The
WHERE
clause filters the results to include only rows with a rank of 1, representing the top-selling product within each segment.
295 | - The result displays the segment ID, segment name, product ID, product name, and total quantity sold for the top-selling product within each segment.
296 |
297 |
298 |
299 | What is the total quantity, revenue and discount for each category?
300 |
301 | ```sql
302 | SELECT
303 | pd.category_id,
304 | pd.category_name,
305 | SUM(s.qty) AS total_quantity,
306 | SUM(s.qty * s.price) AS total_revenue,
307 | SUM(s.qty * s.price * s.discount / 100) AS total_discount
308 | FROM balanced_tree.product_details pd
309 | JOIN balanced_tree.sales s ON pd.product_id = s.prod_id
310 | GROUP BY pd.category_id, pd.category_name;
311 | ```
312 | Answer:
313 |
314 |
315 | - The SQL query calculates and presents aggregated information about product categories based on sales data.
316 | - The query starts by selecting the
category_id
and category_name
columns from the balanced_tree.product_details
table.
317 | - The
SUM
function is used to calculate the total quantity sold (qty
) and the total revenue earned (qty * price
) for each product category.
318 | - The formula
qty * price * discount / 100
is used to calculate the total discount amount for each product category, and the result is summed using the SUM
function.
319 | - The
JOIN
clause connects the balanced_tree.product_details
table with the balanced_tree.sales
table using the product_id
and prod_id
columns.
320 | - The
GROUP BY
clause groups the results by category_id
and category_name
.
321 | - The query calculates the total quantity, total revenue, and total discount for each product category.
322 | - The
ORDER BY
clause sorts the results by total revenue in descending order.
323 | - The query presents the results, including the
category_id
, category_name
, total quantity sold, total revenue earned, and total discount amount for each product category.
324 |
325 |
326 |
327 | What is the top selling product for each category?
328 |
329 | ```sql
330 | WITH top_selling_cte AS (
331 | SELECT
332 | pd.category_id,
333 | pd.category_name,
334 | pd.product_id,
335 | pd.product_name,
336 | SUM(s.qty) AS total_quantity,
337 | RANK() OVER (
338 | PARTITION BY pd.category_id
339 | ORDER BY SUM(s.qty) DESC) AS ranking
340 | FROM balanced_tree.product_details pd
341 | JOIN balanced_tree.sales s ON pd.product_id = s.prod_id
342 | GROUP BY
343 | pd.category_id, pd.category_name, pd.product_id, pd.product_name
344 | )
345 |
346 | SELECT
347 | category_id,
348 | category_name,
349 | product_id,
350 | product_name,
351 | total_quantity
352 | FROM top_selling_cte
353 | WHERE ranking = 1;
354 | ```
355 | Answer:
356 |
357 |
358 | - The SQL query aims to identify the top-selling products within each product category based on sales data.
359 | - The query starts by creating a Common Table Expression (CTE) named
top_selling_cte
.
360 | - Within the CTE, the query selects relevant columns such as
category_id
, category_name
, product_id
, and product_name
from the balanced_tree.product_details
table.
361 | - The
SUM
function calculates the total quantity sold (qty
) for each product and is grouped by category_id
, category_name
, product_id
, and product_name
.
362 | - The
RANK()
window function assigns a ranking to products within each product category, ordering them by total quantity sold in descending order.
363 | - The
PARTITION BY
clause divides the data into partitions based on category_id
, ensuring that ranking is done within each category.
364 | - The main query selects columns from the CTE, including
category_id
, category_name
, product_id
, product_name
, and total_quantity
.
365 | - The
WHERE
clause filters the results to include only rows where the product has the highest ranking (ranking = 1) within its category.
366 | - The query presents the top-selling products within each product category, displaying the
category_id
, category_name
, product_id
, product_name
, and total quantity sold for each product.
367 |
368 |
369 |
370 | What is the percentage split of revenue by product for each segment?
371 |
372 | ```sql
373 | WITH segment_revenue AS (
374 | SELECT
375 | p.segment_id,
376 | p.segment_name,
377 | s.prod_id,
378 | p.product_name,
379 | SUM(s.price * s.qty) AS total_revenue
380 | FROM balanced_tree.sales s
381 | JOIN balanced_tree.product_details p ON s.prod_id = p.product_id
382 | GROUP BY
383 | p.segment_id, p.segment_name, s.prod_id, p.product_name
384 | )
385 |
386 | SELECT
387 | sr.segment_id,
388 | sr.segment_name,
389 | sr.prod_id,
390 | sr.product_name,
391 | sr.total_revenue,
392 | ROUND(100 * sr.total_revenue / SUM(sr.total_revenue) OVER (PARTITION BY sr.segment_id), 2) AS revenue_percentage
393 | FROM segment_revenue sr;
394 | ```
395 | Answer:
396 |
397 |
398 | - The SQL query calculates the revenue distribution percentage of each product within its corresponding segment based on sales data.
399 | - The query begins by defining a Common Table Expression (CTE) named
segment_revenue
.
400 | - Within the CTE, the query selects columns such as
segment_id
, segment_name
, prod_id
, product_name
, and the total revenue (total_revenue
) for each product.
401 | - The
SUM
function calculates the total revenue by multiplying the price and quantity for each product, and the results are grouped by segment_id
, segment_name
, prod_id
, and product_name
.
402 | - The main query selects columns from the
segment_revenue
CTE, including segment_id
, segment_name
, prod_id
, product_name
, and total_revenue
.
403 | - The
ROUND
function is used to calculate the revenue percentage of each product within its segment.
404 | - The percentage is calculated as the total revenue of the product divided by the total revenue of all products within the same segment, then multiplied by 100.
405 | - The
SUM(total_revenue)
is calculated over a window partitioned by segment_id
, ensuring that the percentage is calculated for each segment separately.
406 | - The calculated revenue percentage is displayed as
revenue_percentage
.
407 | - The query presents the segment-wise revenue distribution for each product, showing the
segment_id
, segment_name
, prod_id
, product_name
, total revenue, and the revenue percentage.
408 |
409 |
410 |
411 | What is the percentage split of revenue by segment for each category?
412 |
413 | ```sql
414 | WITH category_segment_revenue AS (
415 | SELECT
416 | p.category_id,
417 | p.category_name,
418 | p.segment_id,
419 | p.segment_name,
420 | SUM(s.price * s.qty) AS total_revenue
421 | FROM balanced_tree.sales s
422 | JOIN balanced_tree.product_details p ON s.prod_id = p.product_id
423 | GROUP BY
424 | p.category_id, p.category_name, p.segment_id, p.segment_name
425 | )
426 |
427 | SELECT
428 | csr.category_id,
429 | csr.category_name,
430 | csr.segment_id,
431 | csr.segment_name,
432 | csr.total_revenue,
433 | ROUND(100 * csr.total_revenue / SUM(csr.total_revenue) OVER (PARTITION BY csr.category_id), 2) AS revenue_percentage
434 | FROM category_segment_revenue csr;
435 | ```
436 | Answer:
437 |
438 |
439 | - The SQL query calculates the revenue distribution percentage of each segment within its corresponding category based on sales data.
440 | - The query begins by defining a Common Table Expression (CTE) named
category_segment_revenue
.
441 | - Within the CTE, the query selects columns such as
category_id
, category_name
, segment_id
, segment_name
, and the total revenue (total_revenue
) for each segment within its category.
442 | - The
SUM
function calculates the total revenue by multiplying the price and quantity for each sale, and the results are grouped by category_id
, category_name
, segment_id
, and segment_name
.
443 | - The main query selects columns from the
category_segment_revenue
CTE, including category_id
, category_name
, segment_id
, segment_name
, and total_revenue
.
444 | - The
ROUND
function is used to calculate the revenue percentage of each segment within its category.
445 | - The percentage is calculated as the total revenue of the segment divided by the total revenue of all segments within the same category, then multiplied by 100.
446 | - The
SUM(total_revenue)
is calculated over a window partitioned by category_id
, ensuring that the percentage is calculated for each category separately.
447 | - The calculated revenue percentage is displayed as
revenue_percentage
.
448 | - The query presents the category-wise revenue distribution for each segment, showing the
category_id
, category_name
, segment_id
, segment_name
, total revenue, and the revenue percentage.
449 |
450 |
451 |
452 | What is the percentage split of total revenue by category?
453 |
454 | ```sql
455 | WITH category_revenue AS (
456 | SELECT
457 | p.category_id,
458 | p.category_name,
459 | SUM(s.price * s.qty) AS total_revenue
460 | FROM balanced_tree.sales s
461 | JOIN balanced_tree.product_details p ON s.prod_id = p.product_id
462 | GROUP BY
463 | p.category_id, p.category_name
464 | )
465 |
466 | SELECT
467 | cr.category_id,
468 | cr.category_name,
469 | cr.total_revenue,
470 | ROUND(100 * cr.total_revenue / SUM(cr.total_revenue) OVER (), 2) AS revenue_percentage
471 | FROM category_revenue cr;
472 | ```
473 | Answer:
474 |
475 |
476 | - The SQL query calculates the revenue distribution percentage for each category based on sales data and product details.
477 | - The query begins by defining a Common Table Expression (CTE) named
category_revenue
.
478 | - Within the CTE, the query selects columns such as
category_id
, category_name
, and the total revenue (total_revenue
) for each category.
479 | - The
SUM
function calculates the total revenue by multiplying the price and quantity for each sale, and the results are grouped by category_id
and category_name
.
480 | - The main query selects columns from the
category_revenue
CTE, including category_id
, category_name
, and total_revenue
.
481 | - The
ROUND
function is used to calculate the revenue percentage of each category.
482 | - The percentage is calculated as the total revenue of the category divided by the overall total revenue, then multiplied by 100.
483 | - The
SUM(total_revenue)
is calculated over the entire result set using the OVER ()
clause, ensuring that the percentage is calculated based on the total revenue across all categories.
484 | - The calculated revenue percentage is displayed as
revenue_percentage
.
485 | - The query presents the revenue distribution for each category, showing the
category_id
, category_name
, total revenue, and the revenue percentage.
486 |
487 |
488 |
489 |
490 |
--------------------------------------------------------------------------------
/Case Study #8 - Fresh Segments/README.md:
--------------------------------------------------------------------------------
1 | Case Study #8 - Fresh Segments👨🏼💻
2 |
3 | Contents
4 |
5 |
16 |
17 |
18 | Discover the transformative journey of Fresh Segments, a pioneering digital marketing agency founded by Danny. By decoding online ad click behavior trends tailored to individual businesses, Fresh Segments empowers companies to understand their customer base like never before.
19 |
20 | In this case study, we delve into Fresh Segments' innovative methodology. Clients entrust their customer lists to Fresh Segments, which then compiles and analyzes interest metrics, creating a comprehensive dataset for detailed examination.
21 |
22 | The agency's forte lies in unraveling interest composition and rankings for each client, spotlighting the proportions of customers interacting with specific online assets. These insights, presented monthly, offer a holistic view of evolving customer interests.
23 |
24 | Danny seeks our expertise in deciphering aggregated metrics for a sample client. Our mission is to distill key insights from this data, shedding light on customer lists and the array of captivating interests. Through this analysis, we illuminate overarching patterns, offering strategic guidance to propel Fresh Segments forward.
25 |
26 | Case Study Questions & Solutions
27 |
28 | A. Data Exploration and Cleansing
29 |
30 | Update the fresh_segments.interest_metrics table by modifying the month_year column to be a date data type with the start of the month
31 |
32 | ```sql
33 | alter table interest_metrics
34 | drop column month_year
35 | ```
36 | ```sql
37 | alter table interest_metrics
38 | add month_year date
39 | ```
40 | ```sql
41 | UPDATE interest_metrics
42 | SET month_year = TO_DATE(_year || '-' || LPAD(_month::text, 2, '0') || '-01', 'YYYY-MM-DD');
43 | ```
44 | ```sql
45 | SELECT*
46 | FROM interest_metrics
47 | ```
48 | Answer:
49 |
50 |
51 | - Drop Column Command: Removes the
month_year
column from the interest_metrics
table.
52 |
53 | - Alter Table - Add Column Command: Adds a new
month_year
column of type DATE
to the interest_metrics
table.
54 |
55 | - Update Command: Updates the
month_year
column for existing records in the interest_metrics
table. It calculates the month_year
value based on _year
and _month
variables using the TO_DATE
function.
56 |
57 | - Select All Records Command: Retrieves and displays all records from the
interest_metrics
table, presumably to view the changes made.
58 |
59 |
60 |
61 | What is count of records in the fresh_segments.interest_metrics for each month_year value sorted in chronological order (earliest to latest) with the null values appearing first?
62 |
63 | ```sql
64 | SELECT
65 | month_year,
66 | COUNT(*) AS record_count
67 | FROM interest_metrics
68 | GROUP BY month_year
69 | ORDER BY month_year NULLS FIRST;
70 |
71 | ```
72 |
73 | Answer:
74 |
75 |
76 | - Select Command: Retrieves data from the
interest_metrics
table.
77 |
78 | - Columns Selected:
79 |
80 | month_year
: Represents the month and year of the records.
81 | COUNT(*)
: Calculates the number of records for each month_year
.
82 |
83 |
84 |
85 | - Group By Clause: Groups the records based on the
month_year
column.
86 |
87 | - Order By Clause: Arranges the results in ascending order of
month_year
. Records with NULL month_year
values are listed first (NULLS FIRST).
88 |
89 |
90 |
91 | What do you think we should do with these null values in the fresh_segments.interest_metrics
92 |
93 | ```sql
94 | SELECT month_year, COUNT(*) AS record_count
95 | FROM interest_metrics
96 | WHERE month_year IS NOT NULL
97 | GROUP BY month_year
98 | ORDER BY month_year;
99 | ```
100 | Answer:
101 |
102 |
103 | - Select Command: Retrieves data from the
interest_metrics
table.
104 |
105 | - Columns Selected:
106 |
107 | month_year
: Represents the month and year of the records.
108 | COUNT(*)
: Calculates the number of records for each month_year
.
109 |
110 |
111 |
112 | - Where Clause: Filters out records where
month_year
is not NULL.
113 |
114 | - Group By Clause: Groups the non-null records based on the
month_year
column.
115 |
116 | - Order By Clause: Arranges the results in ascending order of
month_year
.
117 |
118 |
119 |
120 | How many interest_id values exist in the fresh_segments.interest_metrics table but not in the fresh_segments.interest_map table? What about the other way around?
121 |
122 | ```sql
123 | SELECT COUNT(DISTINCT im.interest_id) AS count_interest_metrics_not_in_map
124 | FROM interest_metrics im
125 | LEFT JOIN interest_map map ON im.interest_id::integer = map.id
126 | WHERE map.id IS NULL;
127 |
128 | ```
129 | Answer:
130 |
131 |
132 | - Select Command: Retrieves data from the
interest_metrics
table.
133 |
134 | - Columns Selected:
135 |
136 | COUNT(DISTINCT im.interest_id)
: Counts the number of distinct interest_id
values that are not present in the interest_map
table.
137 |
138 |
139 |
140 | - Join: Performs a left join between
interest_metrics
and interest_map
using the interest_id
and id
columns.
141 |
142 | - Where Clause: Filters out rows where the
id
column in the interest_map
table is NULL, indicating that the interest_id
is not found in the map.
143 |
144 |
145 |
146 | ```sql
147 | SELECT COUNT(DISTINCT map.id) AS count_interest_map_not_in_metrics
148 | FROM interest_map map
149 | LEFT JOIN interest_metrics im ON map.id = im.interest_id::integer
150 | WHERE im.interest_id IS NULL;
151 | ```
152 |
153 | Answer:
154 |
155 |
156 | - Select Command: Retrieves data from the
interest_map
table.
157 |
158 | - Columns Selected:
159 |
160 | COUNT(DISTINCT map.id)
: Counts the number of distinct id
values that are not present in the interest_metrics
table.
161 |
162 |
163 |
164 | - Join: Performs a left join between
interest_map
and interest_metrics
using the id
and interest_id
columns.
165 |
166 | - Where Clause: Filters out rows where the
interest_id
column in the interest_metrics
table is NULL, indicating that the id
is not found in the metrics.
167 |
168 |
169 |
170 | Summarise the id values in the fresh_segments.interest_map by its total record count in this table
171 |
172 | ```sql
173 | SELECT id,interest_name, COUNT(*) AS record_count
174 | FROM interest_map im
175 | join interest_metrics ime on im.id=ime.interest_id::integer
176 | GROUP BY id,interest_name
177 | ORDER BY record_count DESC;
178 | ```
179 | Answer:
180 |
181 |
182 | - Select Command: Retrieves data from the
interest_map
and interest_metrics
tables.
183 |
184 | - Columns Selected:
185 |
186 | id
: The unique identifier of the interest in the interest_map
table.
187 | interest_name
: The name of the interest in the interest_map
table.
188 | COUNT(*)
: Calculates the count of records.
189 | AS record_count
: Renames the count column as record_count
.
190 |
191 |
192 |
193 | - Join: Performs an inner join between
interest_map
and interest_metrics
using the id
and interest_id
columns.
194 |
195 | - Group By: Groups the result set by
id
and interest_name
.
196 |
197 | - Order By: Orders the results in descending order based on the
record_count
column.
198 |
199 |
200 |
201 |
202 | What sort of table join should we perform for our analysis and why? Check your logic by checking the rows where interest_id = 21246 in your joined output and include all columns from fresh_segments.interest_metrics and all columns from fresh_segments.interest_map except from the id column.
203 |
204 | ```sql
205 | SELECT
206 | im.*,
207 | map.interest_name AS mapped_interest_name,
208 | map.interest_summary AS mapped_interest_summary,
209 | map.created_at AS mapped_created_at,
210 | map.last_modified AS mapped_last_modified
211 | FROM interest_metrics AS im
212 | LEFT JOIN interest_map AS map ON im.interest_id::VARCHAR = map.id::VARCHAR
213 | WHERE im.interest_id = '21246';
214 | ```
215 |
216 | Answer:
217 |
218 |
219 | - Select Command: Retrieves data from the
interest_metrics
and interest_map
tables.
220 |
221 | - Columns Selected:
222 |
223 | im.*
: Selects all columns from the interest_metrics
table.
224 | map.interest_name AS mapped_interest_name
: Retrieves the interest_name
column from the interest_map
table and aliases it as mapped_interest_name
.
225 | map.interest_summary AS mapped_interest_summary
: Retrieves the interest_summary
column from the interest_map
table and aliases it as mapped_interest_summary
.
226 | map.created_at AS mapped_created_at
: Retrieves the created_at
column from the interest_map
table and aliases it as mapped_created_at
.
227 | map.last_modified AS mapped_last_modified
: Retrieves the last_modified
column from the interest_map
table and aliases it as mapped_last_modified
.
228 |
229 |
230 |
231 | - Join: Performs a left join between
interest_metrics
and interest_map
using the interest_id
and id
columns, casting both to VARCHAR
for comparison.
232 |
233 | - Where: Filters the result to only include rows where
interest_id
is equal to '21246'.
234 |
235 |
236 |
237 | Are there any records in your joined table where the month_year value is before the created_at value from the fresh_segments.interest_map table? Do you think these values are valid and why?
238 |
239 |
240 | ```sql
241 | SELECT
242 | im.*,
243 | im.month_year AS interest_metrics_month_year,
244 | map.created_at AS interest_map_created_at
245 | FROM
246 | interest_metrics AS im
247 | LEFT JOIN
248 | interest_map AS map
249 | ON
250 | im.interest_id = map.id::varchar
251 | WHERE
252 | im.month_year < map.created_at
253 | ORDER BY
254 | im.interest_id, im.month_year;
255 | ```
256 | Answer:
257 |
258 |
259 | - Select Command: Retrieves data from the
interest_metrics
and interest_map
tables.
260 |
261 | - Columns Selected:
262 |
263 | im.*
: Selects all columns from the interest_metrics
table.
264 | im.month_year AS interest_metrics_month_year
: Aliases the month_year
column from interest_metrics
table as interest_metrics_month_year
.
265 | map.created_at AS interest_map_created_at
: Retrieves the created_at
column from the interest_map
table and aliases it as interest_map_created_at
.
266 |
267 |
268 |
269 | - Join: Performs a left join between
interest_metrics
and interest_map
using the interest_id
column from interest_metrics
and casting id
to VARCHAR
for comparison.
270 |
271 | - Where: Filters the result to only include rows where
month_year
from interest_metrics
is less than created_at
from interest_map
.
272 |
273 | - Order By: Orders the results first by
interest_id
and then by month_year
in ascending order.
274 |
275 |
276 |
277 | B. Interest Analysis
278 |
279 | Which interests have been present in all month_year dates in our dataset?
280 |
281 | ```sql
282 | SELECT DISTINCT interest_id
283 | FROM interest_metrics
284 | GROUP BY interest_id
285 | HAVING COUNT(DISTINCT month_year) = (SELECT COUNT(DISTINCT month_year) FROM interest_metrics);
286 | ```
287 | Answer:
288 |
289 |
290 | - Select Command: Retrieves distinct
interest_id
values from the interest_metrics
table.
291 |
292 | - Distinct: Ensures that only unique
interest_id
values are selected.
293 |
294 | - Group By: Groups the data in the
interest_metrics
table by interest_id
.
295 |
296 | - Having Clause: Filters the grouped results to include only those groups where the count of distinct
month_year
values is equal to the total count of distinct month_year
values in the entire interest_metrics
table.
297 |
298 | - Subquery: The subquery
(SELECT COUNT(DISTINCT month_year) FROM interest_metrics)
calculates the total count of distinct month_year
values in the interest_metrics
table.
299 |
300 |
301 |
302 | Using this same total_months measure - calculate the cumulative percentage of all records starting at 14 months - which total_months value passes the 90% cumulative percentage value?
303 |
304 | ```sql
305 | WITH MonthlyInterestCounts AS (
306 | SELECT interest_id, COUNT(DISTINCT month_year) AS month_count
307 | FROM interest_metrics
308 | GROUP BY interest_id
309 | ),
310 | InterestCumulativePercentage AS (
311 | SELECT
312 | month_count,
313 | COUNT(interest_id) AS interest_count,
314 | SUM(COUNT(interest_id)) OVER (ORDER BY month_count DESC) AS total_interest_count
315 | FROM MonthlyInterestCounts
316 | GROUP BY month_count
317 | )
318 | SELECT
319 | month_count,
320 | total_interest_count,
321 | ROUND(total_interest_count * 100.0 / (SELECT SUM(interest_count) FROM InterestCumulativePercentage), 2) AS cumulative_percent
322 | FROM InterestCumulativePercentage
323 | WHERE ROUND(total_interest_count * 100.0 / (SELECT SUM(interest_count) FROM InterestCumulativePercentage), 2) > 90;
324 | ```
325 | Answer:
326 |
327 |
328 | - Common Table Expressions (CTEs):
329 |
330 | MonthlyInterestCounts:
Calculates the count of distinct month_year
values for each interest_id
from the interest_metrics
table.
331 | InterestCumulativePercentage:
Computes the cumulative percentage of interests based on the count of interest_id
s and their month_count
. It uses a window function to calculate the cumulative sum of interest_count
.
332 |
333 |
334 | - Main Query:
335 |
336 | month_count:
Represents the number of distinct month_year
values for each interest_id
.
337 | total_interest_count:
Shows the total count of interest_id
s considered up to the current month_count
in descending order.
338 | cumulative_percent:
Calculates the cumulative percentage of interest_id
s based on the total_interest_count
and the total sum of interest_count
from the InterestCumulativePercentage
CTE.
339 | WHERE:
Filters the results to include only rows where the calculated cumulative percentage is greater than 90.
340 |
341 |
342 |
343 |
344 | If we were to remove all interest_id values which are lower than the total_months value we found in the previous question - how many total data points would we be removing?
345 |
346 | ```sql
347 | WITH InterestMonthsCounts AS (
348 | SELECT interest_id, COUNT(DISTINCT month_year) AS month_count
349 | FROM interest_metrics
350 | GROUP BY interest_id
351 | ),
352 | TotalMonths AS (
353 | SELECT MAX(month_count) AS total_months
354 | FROM InterestMonthsCounts
355 | )
356 | SELECT SUM(imc.month_count) AS total_data_points_to_keep,
357 | (SELECT COUNT(*) FROM interest_metrics) - SUM(imc.month_count) AS total_data_points_removed
358 | FROM InterestMonthsCounts imc
359 | JOIN TotalMonths tm ON imc.month_count >= tm.total_months;
360 | ```
361 | Answer:
362 |
363 |
364 | - Common Table Expressions (CTEs):
365 |
366 | InterestMonthsCounts:
Calculates the count of distinct month_year
values for each interest_id
from the interest_metrics
table using the COUNT
and GROUP BY
aggregation.
367 | TotalMonths:
Determines the maximum month_count
from the InterestMonthsCounts
CTE as the total_months
.
368 |
369 |
370 | - Main Query:
371 |
372 | total_data_points_to_keep:
Calculates the total count of data points to keep, which is the sum of month_count
values from the InterestMonthsCounts
CTE.
373 | total_data_points_removed:
Computes the total count of data points to be removed, which is the difference between the total count of data points in the interest_metrics
table and the sum of month_count
values from the InterestMonthsCounts
CTE.
374 | JOIN:
Joins the InterestMonthsCounts
CTE with the TotalMonths
CTE based on the condition that month_count
in InterestMonthsCounts
is greater than or equal to total_months
in TotalMonths
.
375 |
376 |
377 |
378 |
379 | C. Segment Analysis
380 |
381 | Using our filtered dataset by removing the interests with less than 6 months worth of data, which are the top 10 and bottom 10 interests which have the largest composition values in any month_year? Only use the maximum composition value for each interest but you must keep the corresponding month_year
382 |
383 | ```sql
384 | WITH cte AS (
385 | SELECT interest_id, COUNT(DISTINCT month_year) AS month_count
386 | FROM interest_metrics
387 | GROUP BY interest_id
388 | HAVING COUNT(DISTINCT month_year) >= 6
389 | )
390 |
391 | SELECT *
392 | INTO filtered_table
393 | FROM interest_metrics
394 | WHERE interest_id IN (SELECT interest_id FROM cte);
395 |
396 | -- Top 10 interests with largest composition values
397 | SELECT
398 | f.month_year,
399 | f.interest_id,
400 | im.interest_name,
401 | MAX(f.composition) AS max_composition
402 | FROM filtered_table f
403 | JOIN interest_map im ON f.interest_id = im.id::varchar
404 | GROUP BY f.month_year, f.interest_id, im.interest_name
405 | ORDER BY max_composition DESC
406 | LIMIT 10;
407 |
408 | -- Bottom 10 interests with smallest composition values
409 | SELECT
410 | f.month_year,
411 | f.interest_id,
412 | im.interest_name,
413 | MAX(f.composition) AS max_composition
414 | FROM filtered_table f
415 | JOIN interest_map im ON f.interest_id = im.id::varchar
416 | GROUP BY f.month_year, f.interest_id, im.interest_name
417 | ORDER BY max_composition ASC
418 | LIMIT 10;
419 | ```
420 | Answer:
421 |
422 |
423 | - Common Table Expression (CTE):
424 |
425 | - Creates a CTE named
cte
that counts the number of distinct month_year
values for each interest_id
in interest_metrics
.
426 | - Groups the results by
interest_id
.
427 | - Filters the results to only include rows with a
month_count
of 6 or more.
428 |
429 |
430 |
431 |
432 | - Query 1:
433 |
434 | - Selects all columns from
interest_metrics
into a table named filtered_table
.
435 | - Filters the rows to include only those with
interest_id
values present in the interest_id
column of the cte
.
436 |
437 |
438 |
439 |
440 | - Query 2:
441 |
442 | - Selects
month_year
, interest_id
, calculates the maximum composition
, and retrieves interest_name
from the filtered_table
.
443 | - Joins
filtered_table
with interest_map
using interest_id
and id
.
444 | - Groups the results by
month_year
, interest_id
, and interest_name
.
445 | - Calculates the maximum
composition
for each group.
446 | - Orders the result by
max_composition
in descending order.
447 | - Limits the output to the top 10 records using
LIMIT 10
.
448 |
449 |
450 |
451 |
452 | - Query 3:
453 |
454 | - Selects
month_year
, interest_id
, calculates the maximum composition
, and retrieves interest_name
from the filtered_table
.
455 | - Joins
filtered_table
with interest_map
using interest_id
and id
.
456 | - Groups the results by
month_year
, interest_id
, and interest_name
.
457 | - Calculates the maximum
composition
for each group.
458 | - Orders the result by
max_composition
in ascending order.
459 | - Limits the output to the bottom 10 records using
LIMIT 10
.
460 |
461 |
462 |
463 |
464 | Which 5 interests had the lowest average ranking value?
465 |
466 | ```sql
467 | SELECT
468 | f.interest_id,
469 | im.interest_name,
470 | AVG(f.ranking) AS avg_ranking
471 | FROM filtered_table f
472 | JOIN interest_map im ON f.interest_id = im.id::varchar
473 | GROUP BY f.interest_id, im.interest_name
474 | ORDER BY avg_ranking ASC
475 | LIMIT 5;
476 | ```
477 | Answer:
478 |
479 |
480 |
481 | - Query:
482 |
483 | - Selects
interest_id
, calculates the average of ranking
, and retrieves interest_name
from the filtered_table
.
484 | - Joins
filtered_table
with interest_map
using interest_id
and id
.
485 | - Groups the results by
interest_id
and interest_name
.
486 | - Calculates the average of
ranking
for each group.
487 | - Orders the result by
avg_ranking
in ascending order.
488 | - Limits the output to the top 5 records using
LIMIT 5
.
489 |
490 |
491 |
492 |
493 | Which 5 interests had the lowest average ranking value?
494 |
495 | ```sql
496 | SELECT
497 | f.interest_id,
498 | im.interest_name,
499 | STDDEV(f.percentile_ranking) AS std_dev_percentile_ranking
500 | FROM filtered_table f
501 | JOIN interest_map im ON f.interest_id = im.id::varchar
502 | GROUP BY f.interest_id, im.interest_name
503 | ORDER BY std_dev_percentile_ranking DESC
504 | LIMIT 5;
505 | ```
506 | Answer:
507 |
508 |
509 | - Query:
510 |
511 | - Selects
interest_id
, calculates the standard deviation of percentile_ranking
, and retrieves interest_name
from the filtered_table
.
512 | - Joins
filtered_table
with interest_map
using interest_id
and id
.
513 | - Groups the results by
interest_id
and interest_name
.
514 | - Calculates the standard deviation of
percentile_ranking
for each group.
515 | - Orders the result by
std_dev_percentile_ranking
in descending order.
516 | - Limits the output to the top 5 records using
LIMIT 5
.
517 |
518 |
519 |
520 |
521 | D. Index Analysis
522 |
523 | What is the top 10 interests by the average composition for each month?
524 |
525 | ```sql
526 | SELECT
527 | _month,
528 | _year,
529 | month_year,
530 | interest_id,
531 | ROUND(SUM(composition)::numeric / SUM(index_value)::numeric, 2) AS average_composition
532 | FROM interest_metrics
533 | GROUP BY _month, _year, month_year, interest_id
534 | ORDER BY _year, _month, average_composition DESC
535 | LIMIT 10;
536 | ```
537 | Answer:
538 |
539 |
540 | - Query:
541 |
542 | - Selects
_month
, _year
, month_year
, interest_id
, and calculates the average composition using the formula: SUM(composition) / SUM(index_value)
, rounded to two decimal places.
543 | - Groups the results by
_month
, _year
, month_year
, and interest_id
.
544 | - Orders the result by
_year
in ascending order, then _month
in ascending order, and finally average_composition
in descending order.
545 | - Limits the output to the top 10 records using
LIMIT 10
.
546 |
547 |
548 |
549 |
550 | For all of these top 10 interests - which interest appears the most often?
551 |
552 | ```sql
553 | WITH ranks_tab AS (
554 | SELECT
555 | i.interest_id,
556 | interest_name,
557 | month_year,
558 | AVG(composition / index_value) AS avg_composition,
559 | RANK() OVER (PARTITION BY month_year ORDER BY AVG(composition / index_value) DESC) AS ranks
560 | FROM interest_metrics i
561 | JOIN interest_map m ON i.interest_id = m.id::varchar
562 | GROUP BY i.interest_id, interest_name, month_year
563 | )
564 | SELECT
565 | interest_id,
566 | interest_name,
567 | COUNT(*) OVER (PARTITION BY interest_name) AS counts
568 | FROM ranks_tab
569 | WHERE ranks <= 10
570 | ORDER BY 3 DESC;
571 | ```
572 | Answer:
573 |
574 |
575 | - Query:
576 |
577 | - Uses a common table expression (CTE) named
ranks_tab
to calculate the average composition for each interest_id
and month_year
.
578 | - Assigns a rank to each record within the same
month_year
partition based on the descending order of avg_composition
.
579 | - Joins the
interest_metrics
table with the interest_map
table using interest_id
to retrieve the interest_name
.
580 | - Groups the results by
interest_id
, interest_name
, and month_year
.
581 |
582 | - Final Query:
583 |
584 | - Selects
interest_id
, interest_name
, and calculates the count of records with the same interest_name
using the window function COUNT(*)
over a partition.
585 | - Filters the results to only include records with
ranks
up to 10.
586 | - Orders the result by the third column (counts) in descending order.
587 |
588 |
589 |
590 |
591 | What is the average of the average composition for the top 10 interests for each month?
592 |
593 | ```sql
594 | WITH ranks_tab AS (
595 | SELECT
596 | i.interest_id,
597 | interest_name,
598 | month_year,
599 | AVG(composition / index_value) AS avg_composition,
600 | RANK() OVER (PARTITION BY month_year ORDER BY AVG(composition / index_value) DESC) AS ranks
601 | FROM interest_metrics i
602 | JOIN interest_map m ON i.interest_id = m.id::varchar
603 | GROUP BY i.interest_id, interest_name, month_year
604 | )
605 | SELECT
606 | month_year,
607 | AVG(avg_composition) AS average_avg_composition
608 | FROM ranks_tab
609 | WHERE ranks <= 10
610 | GROUP BY month_year
611 | ORDER BY month_year;
612 | ```
613 |
614 | Answer:
615 |
616 |
617 | - Query:
618 |
619 | - Uses a common table expression (CTE) named
ranks_tab
to calculate the average composition for each interest_id
and month_year
.
620 | - Assigns a rank to each record within the same
month_year
partition based on the descending order of avg_composition
.
621 | - Joins the
interest_metrics
table with the interest_map
table using interest_id
to retrieve the interest_name
.
622 | - Groups the results by
interest_id
, interest_name
, and month_year
.
623 |
624 | - Final Query:
625 |
626 | - Selects
month_year
and calculates the average of avg_composition
using the AVG()
function.
627 | - Filters the results to only include records with
ranks
up to 10.
628 | - Groups the results by
month_year
.
629 | - Orders the result by
month_year
in ascending order.
630 |
631 |
632 |
633 |
634 |
635 |
636 |
637 |
638 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # 8 Week SQL challenge🔍
2 | This repository serves as a comprehensive solution for the #8WeekSQLChallenge, encompassing eight distinct case studies.The repository is a testament to my prowess in SQL, as it demonstrates my adeptness in solving diverse SQL challenges and showcases my expertise in query composition and problem-solving. These case studies cover a wide array of topics, including ranking, common table expressions (CTEs), datetime functions, text manipulation, aggregation, NULL handling, and more.
3 |
4 | This repository reflects the culmination of my journey through various SQL challenges. Each case study presents a unique problem-solving scenario that required me to apply a range of SQL techniques such as CTEs, case statements, date functions, and complex aggregations. As a result, I have honed my skills in SQL query crafting, data manipulation, and analysis.
5 |
6 | Through this repository, I proudly present my ability to tackle intricate SQL challenges and exemplify my aptitude for effectively addressing real-world data scenarios. It serves as a testament to my growth, proficiency, and innovation in the realm of SQL.
7 | The creator behind the development of the #8WeekSQLChallenge is Danny Maa.
8 |
9 | Feel free to reach out for any questions or suggestions about this project. I'm open to discussions and eager to assist.
10 |
11 |
Linkedln | Mariya Joseph
12 |
13 | Don't forget to follow and star ⭐ the repository if you find it valuable.
14 | Tools Used🛠️ : PostgreSQL
15 |
16 | CONTENTS📝
17 |
58 |
--------------------------------------------------------------------------------