├── Case Study #1-Danny's Dinner └── README.md ├── Case Study #2 - Pizza Runner └── README.md ├── Case Study #3 - Foodie-Fi └── README.md ├── Case Study #4 - Data Bank └── README.md ├── Case Study #5 - Data Mart └── README.md ├── Case Study #6 - Clique Bait └── README.md ├── Case Study #7 - Balanced Tree Clothing Co └── README.md ├── Case Study #8 - Fresh Segments └── README.md └── README.md /Case Study #1-Danny's Dinner/README.md: -------------------------------------------------------------------------------- 1 | 2 |

Case Study #1 - Danny's Diner👨🏻‍🍳

3 |

4 |

5 |

Introduction
Problem Statement
Entity Relationship Diagram
Case Study Questions & Solutions
Bonus Questions & Solutions
Key Insights

13 | 14 |

Introduction

15 |

In early 2021, Danny follows his passion for Japanese food and opens "Danny's Diner," a charming restaurant offering sushi, curry, and ramen. However, lacking data analysis expertise, the restaurant struggles to leverage the basic data collected during its initial months to make informed business decisions. Danny's Diner seeks assistance in using this data effectively to keep the restaurant thriving.

16 | 17 |

Problem Statement

18 |

Danny aims to utilize customer data to gain valuable insights into their visiting patterns, spending habits, and favorite menu items. By establishing a deeper connection with his customers, he can provide a more personalized experience for his loyal patrons. 19 | 20 | He plans to use these insights to make informed decisions about expanding the existing customer loyalty program. Additionally, Danny seeks assistance in generating basic datasets for his team to inspect the data conveniently, without requiring SQL expertise. 21 | 22 | Due to privacy concerns, he has shared a sample of his overall customer data, hoping it will be sufficient for you to create fully functional SQL queries to address his questions. 23 | 24 | The case study revolves around three key datasets: 25 | 26 | - Sales 27 | - Menu 28 | - Members

29 | 30 |

Entity Relationship Diagram

31 |

32 | 33 | 34 |

Case Study Questions & Solutions

35 | 36 |

What is the total amount each customer spent at the restaurant?

Answer:

The SQL query retrieves the customer_id and calculates the total amount spent (total_amnt) by each customer at the restaurant.
It combines data from the sales and menu tables based on matching product_id.
The results are grouped by customer_id.
The query then calculates the total sum of price for each group of sales records with the same customer_id.
Finally, the results are sorted in ascending order based on the customer_id.

How many days has each customer visited the restaurant?

Answer:

The SQL query selects the customer_id and counts the number of distinct order dates (No_Days) for each customer.
It retrieves data from the sales table.
The results are grouped by customer_id.
The COUNT(DISTINCT order_date) function calculates the number of unique order dates for each customer.
Finally, the query presents the total number of unique order dates as No_Days for each customer.

What was the first item from the menu purchased by each customer?

Answer:

The SQL query uses a Common Table Expression (CTE) named CTE to generate a temporary result set.
Within the CTE, it selects the customer_id, assigns a dense rank to each row based on the order date for each customer, and retrieves the corresponding product_name from the menu table.
The sales table is joined with the menu table on matching product_id.
The DENSE_RANK() function assigns a rank to each row within the partition of each customer_id based on the order_date in ascending order.
Each customer_id has its own partition and separate ranks based on the order dates of their purchases.
Next, the main query selects the customer_id and corresponding product_name from the CTE.
It filters the results and only includes rows where the rank rn is equal to 1, which means the earliest purchase for each customer_id.
As a result, the query returns the first purchased product for each customer.

What is the most purchased item on the menu and how many times was it purchased by all customers?

Answer:

The SQL query selects the product_name from the menu table and counts the number of times each product was ordered (most_ordered).
It retrieves data from the Sales table and joins it with the menu table based on matching product_id.
The results are grouped by product_name.
The COUNT(S.product_id) function calculates the number of occurrences of each product_id in the Sales table.
The query then presents the product_name and its corresponding count as most_ordered for each product.
Next, the results are sorted in descending order based on the most_ordered column, so the most ordered product appears first.
The LIMIT 1 clause is used to restrict the result to only one row, effectively showing the most ordered product.

Which item was the most popular for each customer?

Answer:

The SQL query uses a Common Table Expression (CTE) named CTE to generate a temporary result set.
Within the CTE, it selects the customer_id, product_name, and counts the number of times each product was ordered (order_count) for each customer.
It retrieves data from the sales table and joins it with the menu table based on matching product_id.
The results are grouped by customer_id and product_name to get the count of orders for each product of each customer.
The COUNT(M.product_id) function calculates the number of occurrences of each product_id in the menu table.
The DENSE_RANK() function assigns a rank to each row within the partition of each customer_id based on the order count of products in descending order.
Each customer_id has its own partition and separate ranks based on the number of product orders.
Next, the main query selects the customer_id, product_name, and order_count from the CTE.
It filters the results and only includes rows where the rank rnk is equal to 1, which means the most ordered product for each customer_id.
As a result, the query returns the customer's ID, the most ordered product, and the number of times it was ordered by that customer.

Which item was purchased first by the customer after they became a member?

Answer:

The SQL query retrieves distinct rows for each unique customer_id with their corresponding product_name from the sales and menu tables.
It filters the data based on the condition that the order_date in the sales table is greater than the join_date of the customer in the members table.
The sales table is aliased as s, the members table is aliased as mbr, and the menu table is aliased as m.
The query performs inner joins between sales and members tables on matching customer_id and between sales and menu tables on matching product_id.
Only rows that meet the join condition and the order_date > join_date condition are considered in the result set.
The query selects the customer_id and corresponding product_name for each customer who has placed an order after their join_date.
The results are sorted in ascending order based on the customer_id.
The DISTINCT ON (s.customer_id) clause ensures that only the first occurrence of each customer_id is included in the result set.
As a result, the query returns a unique list of customer_id along with the first product_name they ordered after joining as a member.

Which item was purchased just before the customer became a member?

Answer:

The SQL query retrieves distinct rows for each unique customer_id with their corresponding product_name from the sales and menu tables.
It filters the data based on the condition that the order_date in the sales table is less than the join_date of the customer in the members table.
The sales table is aliased as s, the members table is aliased as mbr, and the menu table is aliased as m.
The query performs inner joins between sales and members tables on matching customer_id and between sales and menu tables on matching product_id.
Only rows that meet the join condition and the order_date < join_date condition are considered in the result set.
The query selects the customer_id and corresponding product_name for each customer who has placed an order before their join_date.
The results are sorted in ascending order based on the customer_id.
The DISTINCT ON (s.customer_id) clause ensures that only the first occurrence of each customer_id is included in the result set.
As a result, the query returns a unique list of customer_id along with the first product_name they ordered before joining as a member.

What is the total items and amount spent for each member before they became a member?

Answer: 232 | Coding

233 |

The SQL query retrieves the customer_id along with the total count of items ordered (total_item) and the total amount spent (total_amont) by each customer.
It retrieves data from the sales table and joins it with the menu table based on matching product_id.
It also joins the sales table with the members table based on matching customer_id.
The results are filtered based on the condition that the order_date in the sales table is less than the join_date of the customer in the members table.
The COUNT(S.product_id) function calculates the number of occurrences of each product_id in the sales table, giving the total number of items ordered by each customer.
The SUM(M.price) function calculates the sum of the price from the menu table, providing the total amount spent by each customer.
Results are grouped by customer_id to get the totals for each customer.
The query then presents the customer_id, total_item, and total_amont for each customer who placed orders before joining as a member.
Finally, the results are sorted in ascending order based on the customer_id.

244 | 245 | 246 |

If each $1 spent equates to 10 points and sushi has a 2x points multiplier - how many points would each customer have?

Answer:

261 |

262 |

The SQL query retrieves the customer_id and calculates the total points (total_points) earned by each customer based on their purchases from the sales and menu tables.
It retrieves data from the sales table and joins it with the menu table based on matching product_id.
The query uses a CASE statement to differentiate between 'sushi' and other products.
If the product name is 'sushi', the price is multiplied by 2 to give double points.
Otherwise, the regular price is considered.
The SUM function calculates the total points for each customer by adding up the points earned from their purchases.
The total points are then multiplied by 10 to give a scaled value.
Results are grouped by customer_id to get the total points for each customer.
The query then presents the customer_id and the scaled total_points for each customer based on their purchases.
Finally, the results are sorted in ascending order based on the customer_id.

274 | 275 | 276 |

In the first week after a customer joins the program (including their join date) they earn 2x points on all items, not just sushi - how many points do customer A and B have at the end of January?

Answer:

305 |

306 |

307 |

The SQL query starts by creating a Common Table Expression (CTE) named dates_cte.
Within the CTE, it selects customer_id, join_date, join_date + INTERVAL '6 days' as valid_date, and the last day of the month for the date '2021-01-31' as last_date.
The CTE is used to generate date ranges for each customer, from their join_date to 6 days later, and the last day of the month for January 2021.
Next, the main query selects the customer_id and calculates the total points (points) earned by each customer based on their purchases from the sales and menu tables.
It retrieves data from the sales table and joins it with the dates_cte CTE using a combination of JOIN and ON clauses.
The query uses a CASE statement to differentiate between 'sushi' purchases and other products.
If the product name is 'sushi' or the order date falls within the range of join_date to valid_date, the points are calculated as 2 times 10 times the price of the product.
Otherwise, for other products, the points are calculated as 10 times the price of the product.
The SUM function calculates the total points for each customer by adding up the points earned from their purchases.
Results are grouped by customer_id to get the total points for each customer.
The query then presents the customer_id and the calculated points for each customer based on their purchases.
Finally, the results are sorted in ascending order based on the customer_id.

321 | 322 | 323 |

Bonus Questions & Solutions

324 |

Join All The Things

Answer:

355 |

356 |

The SQL query starts by creating a Common Table Expression (CTE) named customer_member_status.
Within the CTE, it selects customer_id, order_date, product_name, price, and uses a CASE statement to determine whether the customer is a member ('Y') or not ('N') based on their join date in the members table.
The sales table is aliased as s, the menu table as m, and the members table as mbr.
The query performs inner joins between sales and menu tables on matching product_id and left join between sales and members tables on matching customer_id.
For each row, the CASE statement checks if the join_date from the members table is less than or equal to the order_date from the sales table.
If true, it assigns 'Y' (member) to the member column, otherwise 'N' (non-member).
Next, the main query selects the customer_id, order_date, product_name, price, and member from the customer_member_status CTE.
Results are ordered first by customer_id in ascending order, then by member in descending order (so members appear first), and finally by order_date in ascending order.
The query presents the final results with the selected columns for each customer, showing whether they are a member or not for each order and sorted accordingly.

367 | 368 | 369 |

Rank All The Things

370 |

Danny needs additional information about the ranking of customer products. However, he specifically requires null ranking values for non-member purchases, as he is not interested in ranking customers who are not yet part of the loyalty program.

Answer:

403 |

404 | 405 |

The SQL query starts by creating a Common Table Expression (CTE) named customers_data.
Within the CTE, it selects customer_id, order_date, product_name, price, and uses a CASE statement to determine the member_status based on whether the join_date in the members table is greater, equal, or less than the order_date in the sales table.
The sales table is aliased as sales, the members table as members, and the menu table as menu.
The query performs left join between sales and members tables on matching customer_id and inner join between sales and menu tables on matching product_id.
For each row, the CASE statement checks the join_date from the members table and compares it to the order_date from the sales table, assigning 'Y' if the join_date is less than or equal to the order_date (customer is a member) and 'N' otherwise (non-member).
Next, the main query selects the customer_id, order_date, product_name, price, and member_status from the customers_data CTE.
It also uses a CASE statement to calculate the ranking for each customer's orders if they are a member ('Y'). The RANK() function is used with PARTITION BY to partition the ranking within each customer and their membership status and ordered by order_date.
If the customer is not a member ('N'), the ranking is set to NULL.
Results are presented with the selected columns and ranking for each customer's orders, showing whether they are a member or not and the order of their orders.

416 | 417 | 418 |

Key Insights

419 | 420 |

Customer Spending: The total amount spent by each customer at Danny's Diner varies widely. Some customers have spent significantly more than others, indicating potential high-value customers or loyal patrons.
Customer Visits: The number of days each customer visited the restaurant also varies, showing different patterns of customer engagement. Some customers visit frequently, while others may visit less often.
First Purchases: Understanding the first items purchased by each customer can help Danny identify popular entry items and potentially attract new customers.
Most Popular Item: The most purchased item on the menu is valuable information for Danny. He can use this insight to optimize inventory management and capitalize on the popularity of the item.
Personalized Recommendations: Knowing the most popular item for each customer allows Danny to make personalized menu suggestions, enhancing the dining experience for his customers.
Customer Loyalty: The data about purchases before and after joining the loyalty program helps Danny evaluate the effectiveness of the loyalty program and its impact on customer behavior.
Bonus Points for New Members: By offering 2x points to new members during their first week, Danny incentivizes more spending, encouraging members to engage more actively with the program.
Member Points: The points earned by each member can be used to assess their loyalty and potentially offer targeted rewards and promotions.
Data Visualization: Creating visualizations based on the data and insights can further aid Danny in understanding trends and making data-driven decisions.
Customer Segmentation: By analyzing customer spending habits, Danny can segment his customer base and tailor marketing strategies accordingly.
Expanding Membership: Danny can use the insights from this data to refine his loyalty program and attract new members, leveraging the success of the program.
Inventory Management: Knowing the most popular and least popular items can help Danny optimize his inventory, reduce wastage, and maximize profits.
Menu Optimization: Danny can use the data to evaluate the performance of different menu items and consider introducing new dishes based on customer preferences.
Customer Engagement: Analyzing customer behavior can help Danny understand what keeps customers coming back and help him create more engaging experiences.
Long-Term Growth: By leveraging data analysis, Danny can make informed decisions that contribute to the long-term growth and success of Danny's Diner.

437 | 438 | 439 | 440 | 441 | 442 | 443 | 444 | 445 | -------------------------------------------------------------------------------- /Case Study #3 - Foodie-Fi/README.md: -------------------------------------------------------------------------------- 1 |

Case Study #3 - Foodie-Fi🫕🥑

2 |

3 |

4 | 5 |

Introduction
Entity Relationship Diagram
Case Study Questions & Solutions

A. Customer Journey
B. Data Analysis Questions
C. Challenge Payment Question

18 | 19 |

Introduction

20 |

In response to the growing popularity of subscription-based businesses, Danny recognized a significant market gap. He saw an opportunity to introduce a distinctive streaming service, one that exclusively featured culinary content—a culinary counterpart to platforms like Netflix. With this vision in mind, Danny collaborated with a group of astute friends to establish Foodie-Fi, a startup launched in 2020. The company quickly began offering monthly and annual subscription plans, granting subscribers unrestricted access to an array of exclusive culinary videos sourced from across the globe.
21 | 22 | At the heart of Foodie-Fi's foundation lies Danny's commitment to data-driven decision-making. He was resolute in leveraging data to inform all future investments and innovative features. This case study delves into the realm of subscription-style digital data, where insights derived from data analysis are pivotal in addressing critical business inquiries and steering the company toward sustained growth and success.

23 | 24 |

Entity Relationship Diagram

25 |

26 | 27 |

Case Study Questions & Solutions

28 | 29 |

A. Customer Journey🫕🥑

30 | 31 |

Based off the 8 sample customers provided in the sample from the subscriptions table, write a brief description about each customer’s onboarding journey.

Answer:

50 | 51 |

53 | Customer 1: 54 |
- Started with a trial plan on 2020-08-01.
- Upgraded to the basic monthly plan on 2020-08-08.
58 |
60 | Customer 2: 61 |
- Started with a trial plan on 2020-09-20.
- Upgraded to the pro annual plan on 2020-09-27.
65 |
67 | Customer 11: 68 |
- Started with a trial plan on 2020-11-19.
- Downgraded to the churn plan on 2020-11-26.
72 |
74 | Customer 13: 75 |
- Started with a trial plan on 2020-12-15.
- Upgraded to the basic monthly plan on 2020-12-22.
- Upgraded to the pro monthly plan on 2021-03-29.
80 |
82 | Customer 15: 83 |
- Started with a trial plan on 2020-03-17.
- Upgraded to the pro monthly plan on 2020-03-24.
- Downgraded to the churn plan on 2020-04-29.
88 |
90 | Customer 16: 91 |
- Started with a trial plan on 2020-05-31.
- Upgraded to the basic monthly plan on 2020-06-07.
- Upgraded to the pro annual plan on 2020-10-21.
96 |
98 | Customer 18: 99 |
- Started with a trial plan on 2020-07-06.
- Upgraded to the pro monthly plan on 2020-07-13.
103 |
105 | Customer 19: 106 |
- Started with a trial plan on 2020-06-22.
- Upgraded to the pro monthly plan on 2020-06-29.
- Upgraded to the pro annual plan on 2020-08-29.
111 |

113 | 114 | 115 |

B. Data Analysis Questions📊

116 |

How many customers has Foodie-Fi ever had?

Answer:

The SQL query retrieves the count of distinct customer_id values from the subscriptions table.
The COUNT(DISTINCT customer_id) function calculates the number of unique customer_id values in the subscriptions table.
As a result, the query presents the total number of unique customers who have subscriptions in the subscriptions table.

What is the monthly distribution of trial plan start_date values for our dataset - use the start of the month as the group by value

Answer:

The SQL query retrieves data related to trial subscriptions from the subscriptions table.
The DATE_TRUNC function is used to truncate the start_date column to the start of each month, resulting in the month_start column.
The COUNT(*) function calculates the number of records (trial starts) in the subscriptions table for each truncated month.
The WHERE plan_id = 0 condition filters the data to include only trial subscriptions, where the plan_id is 0.
Results are grouped by the month_start column, which represents the start of each month after truncation.
The ORDER BY month_start clause arranges the results in ascending order based on the month_start column.
As a result, the query presents the truncated start of each month and the corresponding count of trial subscription starts for that month from the subscriptions table.

What plan start_date values occur after the year 2020 for our dataset? Show the breakdown by count of events for each plan_name

Answer:

The SQL query retrieves data related to subscription events from the subscriptions table and their corresponding plan details from the plans table.
The SELECT statement selects the plan_name, plan_id, and the count of records (subscription events) as event_count from the subscriptions table.
The JOIN clause connects the subscriptions table with the plans table using the common plan_id column.
The WHERE s.start_date > '2020-12-31' condition filters the data to include only records with start dates after December 31, 2020.
Results are grouped by the plan_name and plan_id columns from the plans table.
The ORDER BY p.plan_name, p.plan_id clause arranges the results in ascending order based on the plan_name and plan_id columns.
As a result, the query presents the plan name, plan ID, and the corresponding count of subscription events for each plan that started after December 31, 2020, using data from the subscriptions and plans tables.

What is the customer count and percentage of customers who have churned rounded to 1 decimal place?

Answer:

The SQL query calculates the count and percentage of churned customers based on certain conditions from the subscriptions table.
The first COUNT(DISTINCT CASE WHEN ... END) expression calculates the count of distinct customer IDs who have a plan ID of 4 and started their subscription on or before the current date. This count represents the churned customers.
The second ROUND(...) expression calculates the churned percentage by dividing the count of churned customers by the count of distinct customer IDs and then multiplying by 100.0. The ROUND function rounds the result to one decimal place.
The FROM subscriptions s clause specifies that the data is being retrieved from the subscriptions table with the alias s.
As a result, the query presents the count of churned customers (those with plan ID 4 and a start date on or before the current date) and the corresponding percentage of churned customers in relation to all distinct customers from the subscriptions table.

How many customers have churned straight after their initial free trial - what percentage is this rounded to the nearest whole number?

Answer:

The SQL query calculates the count and percentage of customers who have churned from a trial plan to a churn plan, based on data from the subscriptions and plans tables.
The ranked_cte Common Table Expression (CTE) is defined to retrieve customer IDs, current plan names, and the next plan names using the LEAD window function.
Within the ranked_cte CTE, the LEAD(plans.plan_name) OVER (PARTITION BY sub.customer_id ORDER BY sub.start_date) function retrieves the name of the next plan for each customer, ordered by their subscription start date.
The main query selects the count of customer IDs who transitioned from a 'trial' plan to a 'churn' plan and calculates the churn percentage.
The COUNT(customer_id) function counts the number of customers who transitioned from a 'trial' plan to a 'churn' plan.
The ROUND(...) function calculates the churn percentage by dividing the count of customers who transitioned by the total count of distinct customer IDs from the subscriptions table, multiplied by 100.0. The result is rounded to the nearest whole number.
The subquery (SELECT COUNT(DISTINCT customer_id) FROM subscriptions) calculates the total count of distinct customer IDs in the subscriptions table.
The WHERE plan_name = 'trial' AND next_plan = 'churn' condition filters the data to include only cases where customers transitioned from a 'trial' plan to a 'churn' plan.
As a result, the query presents the count and percentage of customers who have transitioned from a 'trial' plan to a 'churn' plan using data from the subscriptions and plans tables.

What is the number and percentage of customer plans after their initial free trial?

Answer:

The SQL query calculates the count and percentage of customers who transitioned from a free trial plan to a subsequent plan, based on data from the subscriptions table.
The next_plans Common Table Expression (CTE) is defined to retrieve customer IDs, current plan IDs, and the next plan IDs using the LEAD window function.
Within the next_plans CTE, the LEAD(plan_id) OVER (PARTITION BY customer_id ORDER BY plan_id) function retrieves the ID of the next plan for each customer, ordered by plan ID.
The main query selects the next plan ID, counts the number of customers who transitioned to that plan, and calculates the percentage of customers who transitioned.
The COUNT(customer_id) function counts the number of customers who transitioned to each next plan.
The ROUND(...) function calculates the percentage of customers who transitioned to each next plan by dividing the count of transitioning customers by the total count of distinct customer IDs from the subscriptions table, multiplied by 100.0. The result is rounded to one decimal place.
The subquery (SELECT COUNT(DISTINCT customer_id) FROM subscriptions) calculates the total count of distinct customer IDs in the subscriptions table.
The WHERE next_plan_id IS NOT NULL AND plan_id = 0 condition filters the data to include only cases where customers transitioned from a free trial plan to a subsequent plan.
Results are grouped by the next_plan_id and ordered by next_plan_id.
As a result, the query presents the next plan ID, count of customers who transitioned to each next plan, and the corresponding percentage of customers who transitioned from a free trial plan to that next plan.

What is the customer count and percentage breakdown of all 5 plan_name values at 2020-12-31?

Answer:

The SQL query calculates the count and percentage of customers for each plan who subscribed on or before December 31, 2020, based on data from the subscriptions and plans tables.
The SELECT statement selects the plan_name, count of distinct customer IDs, and the corresponding percentage of customers for each plan.
The COUNT(DISTINCT s.customer_id) function calculates the count of distinct customer IDs who subscribed on or before December 31, 2020, for each plan.
The second ROUND(...) expression calculates the percentage of customers for each plan by dividing the count of distinct customer IDs by the total count of distinct customer IDs who subscribed on or before December 31, 2020, multiplied by 100.0. The result is rounded to one decimal place.
The subquery (SELECT COUNT(DISTINCT customer_id) FROM subscriptions WHERE start_date <= '2020-12-31') calculates the total count of distinct customer IDs who subscribed on or before December 31, 2020.
The FROM subscriptions s clause specifies that the data is being retrieved from the subscriptions table with the alias s.
The JOIN plans p ON s.plan_id = p.plan_id clause connects the subscriptions table with the plans table using the common plan_id column.
The WHERE s.start_date <= '2020-12-31' condition filters the data to include only records with subscription start dates on or before December 31, 2020.
Results are grouped by the plan_name and plan_id columns from the plans table.
The ORDER BY p.plan_id clause arranges the results in ascending order based on the plan_id.
As a result, the query presents the plan name, count of distinct customers, and corresponding percentage of customers for each plan who subscribed on or before December 31, 2020.

How many customers have upgraded to an annual plan in 2020?

Answer:

The SQL query calculates the count of distinct customers who upgraded to the annual pro plan (Plan ID 3) during the year 2020.
The COUNT(DISTINCT customer_id) function calculates the count of distinct customer IDs who meet the specified conditions.
The FROM subscriptions clause specifies that the data is being retrieved from the subscriptions table.
The WHERE start_date >= '2020-01-01' AND start_date <= '2020-12-31' AND plan_id = 3 condition filters the data to include only records with start dates within the year 2020 and a plan ID equal to 3 (representing the annual pro plan).
As a result, the query presents the count of distinct customers who upgraded to the annual pro plan (Plan ID 3) during the year 2020.

How many customers downgraded from a pro monthly to a basic monthly plan in 2020?

Answer:

The SQL query calculates the count of downgrades from the pro monthly plan (Plan ID 2) to the basic monthly plan (Plan ID 1) during the year 2020, based on data from the subscriptions table.
The COUNT(*) function calculates the count of records that meet the specified conditions.
The query performs a self-join on the subscriptions table, where the prev table represents the previous subscription and the current table represents the current subscription.
The ON prev.customer_id = current.customer_id AND prev.plan_id = 2 AND current.plan_id = 1 AND prev.start_date <= current.start_date condition ensures that the join is made between a pro monthly plan (Plan ID 2) and a basic monthly plan (Plan ID 1) for the same customer with a valid start date sequence.
The WHERE EXTRACT(YEAR FROM prev.start_date) = 2020 condition filters the data to include only records with a start date in the year 2020.
As a result, the query presents the count of downgrades from the pro monthly plan (Plan ID 2) to the basic monthly plan (Plan ID 1) during the year 2020 based on the specified conditions from the subscriptions table.

376 | 377 |

C. Challenge Payment Question💰

378 |

The Foodie-Fi team wants you to create a new payments table for the year 2020 that includes amounts paid by each customer in the subscriptions table with the following requirements: 380 |
- monthly payments always occur on the same day of month as the original start_date of any monthly paid plan
- upgrades from basic to monthly or pro plans are reduced by the current paid amount in that month and start immediately
- upgrades from pro monthly to pro annual are paid at the end of the current billing period and also starts at the end of the month period
- once a customer churns they will no longer make payments

Answer:

The SQL query calculates customer payments for various subscription plans during the year 2020 based on data from the subscriptions and plans tables.
The all_plans Common Table Expression (CTE) is defined to retrieve customer details, plan details, start dates, prices, and payment dates based on subscription data.
Within the all_plans CTE, a conditional expression calculates the payment date based on the plan type (monthly or annual) and ensures that payments for annual plans are made in December.
The payments CTE is defined to calculate individual payment records based on customer, plan, and payment date.
Within the payments CTE, the ROW_NUMBER() function assigns a payment order to each record within the same customer, plan, plan name, and payment date partition.
The main query selects the finalized customer payment details, including customer ID, plan ID, plan name, payment date, amount, and payment order.
The WHERE payment_order = 1 OR (amount > 0 AND payment_order > 1) condition filters the data to include the first payment for each partition and subsequent payments with a positive amount.
The query ORDER BY customer_id, plan_id, payment_date, payment_order arranges the results in ascending order based on customer ID, plan ID, payment date, and payment order.
The LIMIT 20 clause limits the results to the first 20 records.
As a result, the query presents the customer payments for various subscription plans during the year 2020, including the customer ID, plan ID, plan name, payment date, amount, and payment order.

445 | 446 | 447 | 448 | -------------------------------------------------------------------------------- /Case Study #4 - Data Bank/README.md: -------------------------------------------------------------------------------- 1 |

Case Study #4 - Data Bank🏦

2 |

3 |

4 | 5 |

Introduction
Entity Relationship Diagram
Case Study Questions & Solutions

A. Customer Nodes Exploration
B. Customer Transactions

18 |

Introduction

19 |

Welcome to an enthralling journey into the world of finance, innovation, and data convergence. In this captivating case study project, we immerse ourselves in the realm of Neo-Banks, the trailblazing digital financial entities revolutionizing the industry landscape by eliminating physical branches. Our focus zooms in on an ingenious visionary named Danny, whose aspirations transcend the ordinary, leading to the inception of a remarkable initiative – "Data Bank."
20 | 21 | Within the fabric of this narrative lies a compelling fusion of Neo-Banking, cryptocurrency, and cutting-edge data management. The birth of Data Bank signifies a paradigm shift where the contours of digital banking intertwine seamlessly with the frontiers of secure distributed data storage.
22 | 23 | The essence of Data Bank mirrors that of traditional digital banks, albeit with an unprecedented twist. Not confined merely to financial transactions, this avant-garde establishment boasts the accolade of harboring the world's most impregnable distributed data storage platform. As we delve deeper, a unique interplay emerges – customers' cloud data storage allotments are intricately tied to their financial standing. Yet, the intricacies of this symbiotic relationship beckon us to explore further.
24 | 25 | Our protagonist, Data Bank, has set its sights on twin goals: expanding its customer base and, concurrently, gaining insights to navigate the data storage landscape. It is in this intricate tapestry of aspirations that we invite you, the analytical trailblazers, to join hands with the Data Bank team.
26 | 27 | This case study project is an expedition into the realms of metric calculation, growth strategy formulation, and incisive data-driven insights. As collaborators in this journey, we wield the tools of analysis to forecast and shape future developments. The canvas before us is one of intricate calculation, strategic expansion, and data-driven foresight. Embark on this expedition as we decipher the chronicles of Data Bank's evolution, where innovation intersects with pragmatism, and where data illuminates the path forward.

28 | 29 |

Entity Relationship Diagram

30 |

31 |

Case Study Questions & Solutions

32 | 33 |

A. Customer Nodes Exploration👥

34 |

How many unique nodes are there on the Data Bank system?

Answer:

The SQL query calculates the count of distinct node IDs in the customer_nodes table, representing unique nodes.
The COUNT(DISTINCT node_id) function calculates the count of distinct values in the node_id column of the customer_nodes table.
As a result, the query presents the count of unique node IDs in the customer_nodes table.

What is the number of nodes per region?

Answer:

The SQL query retrieves the count of distinct nodes per region from the regions and customer_nodes tables.
The SELECT statement selects the region_name and the count of distinct node_id values for each region.
The LEFT JOIN clause joins the regions table with the customer_nodes table using the region_id as the common column.
The GROUP BY r.region_name clause groups the results by region_name to calculate the count of distinct nodes for each region.
The ORDER BY nodes_per_region DESC clause orders the results in descending order based on the count of nodes per region.
As a result, the query presents the region names and the corresponding count of distinct nodes per region, ordered by the count of nodes in descending order.

How many customers are allocated to each region?

Answer:

The SQL query retrieves the count of customers per region from the regions and customer_nodes tables.
The SELECT statement selects the region_name and the count of distinct customer_id values for each region.
The LEFT JOIN clause joins the regions table with the customer_nodes table using the region_id as the common column.
The GROUP BY r.region_name clause groups the results by region_name to calculate the count of customers for each region.
The ORDER BY customers_per_region DESC clause orders the results in descending order based on the count of customers per region.
As a result, the query presents the region names and the corresponding count of customers per region, ordered by customer count in descending order.

How many days on average are customers reallocated to a different node?

Answer:

The SQL query calculates the average total number of days that each customer-node combination was active, based on data from the customer_nodes table.
The CTE (Common Table Expression) named CTE is defined to calculate the number of days each customer-node combination was active. It calculates the difference between the end_date and start_date columns and renames it as node_dys. The WHERE end_date != '9999-12-31' condition filters out records with the default end date.
The CTE2 is defined to calculate the total number of days each customer-node combination was active. It sums the node_dys values for each customer-node combination using the SUM(node_dys) function.
The main query calculates the rounded average of the total_node values using the AVG(total_node) function and renames it as avg_n.
As a result, the query presents the rounded average total number of days that each customer-node combination was active, based on the data from the CTE2 and customer_nodes tables.

What is the median, 80th and 95th percentile for this same reallocation days metric for each region?

Answer:

The SQL query calculates statistical percentiles of node active durations for each region based on data from the customer_nodes and regions tables.
The WITH clause defines a Common Table Expression (CTE) named date_diff_cte that calculates the difference in days between end_date and start_date for each active node. The condition end_date != '9999-12-31' filters out records with the default end date.
The main query selects the region_name column from the regions table and calculates the statistical percentiles within groups of node active durations for each region.
The PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY node_dys) calculates the median, PERCENTILE_CONT(0.8) WITHIN GROUP (ORDER BY node_dys) calculates the 80th percentile, and PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY node_dys) calculates the 95th percentile of node active durations for each region.
The JOIN clause joins the date_diff_cte CTE with the regions table using the region_id as the common column.
The GROUP BY r.region_name groups the results by region_name to present the statistical percentiles of node active durations for each region.
As a result, the query presents the region names along with the median, 80th percentile, and 95th percentile of node active durations, calculated using data from the date_diff_cte CTE and the regions table.

160 | 161 |

B. Customer Transactions💸

162 |

What is the unique count and total amount for each transaction type?

Answer:

The SQL query calculates transaction statistics based on data from the customer_transactions table.
The SELECT statement selects three columns: txn_type as transaction_type, COUNT(customer_id) as tn_cnt (transaction count), and SUM(txn_amount) as ttl_amnt (total amount).
The GROUP BY txn_type clause groups the results by txn_type to calculate statistics for each transaction type.
The COUNT(customer_id) calculates the count of transactions for each transaction type.
The SUM(txn_amount) calculates the total amount of transactions for each transaction type.
As a result, the query presents the transaction types, the count of transactions, and the total transaction amount for each transaction type.

What is the average total historical deposit counts and amounts for all customers?

Answer:

The SQL query calculates average statistics based on deposit transactions for each customer using data from the customer_transactions table.
The WITH clause defines a Common Table Expression (CTE) named CTE that calculates the count of deposit transactions and the average transaction amount for each customer. The condition txn_type = 'deposit' filters the transactions to consider only deposits.
The GROUP BY customer_id clause groups the results by customer_id to calculate statistics for each customer.
The main query calculates the rounded average count of deposit transactions and the rounded average of average transaction amounts using the AVG(cnt) and AVG(avg_amnt) functions from the CTE.
As a result, the query presents the rounded average count of deposit transactions and the rounded average of average transaction amounts for all customers.

For each month - how many Data Bank customers make more than 1 deposit and either 1 purchase or 1 withdrawal in a single month?

Answer:

The SQL query analyzes monthly transaction patterns for customers based on data from the customer_transactions table.
The WITH clause defines a Common Table Expression (CTE) named monthly_transactions that aggregates transactions for each customer per month. It calculates the count of deposit transactions, purchase transactions, and withdrawal transactions for each customer in each month using conditional aggregations.
The GROUP BY customer_id, DATE_TRUNC('month', txn_date) clause groups the results by customer_id and the truncated transaction_month to calculate statistics for each customer in each month.
The main query selects the transaction_month column as mth and calculates the count of distinct customers with the specified conditions using the COUNT(DISTINCT customer_id) function.
The WHERE clause filters the results to consider only months with more than one deposit transaction and at least one purchase or withdrawal transaction.
The GROUP BY transaction_month clause groups the results by transaction_month to analyze monthly patterns.
The ORDER BY transaction_month sorts the results by the transaction_month in ascending order.
As a result, the query presents the months along with the count of distinct customers who meet the specified transaction conditions for each month.

What is the closing balance for each customer at the end of the month?

Answer:

The SQL query calculates monthly closing balances for customers based on data from the customer_transactions table.
The WITH clause defines a Common Table Expression (CTE) named customer_monthly_balances that aggregates transactions for each customer per month. It calculates the closing balance by subtracting the sum of purchase and withdrawal amounts from the sum of deposit amounts for each customer in each month.
The GROUP BY customer_id, DATE_TRUNC('month', txn_date) clause groups the results by customer_id and the truncated transaction_month to calculate the closing balance for each customer in each month.
The main query selects the customer_id, transaction_month, and closing_balance columns from the customer_monthly_balances CTE.
The ORDER BY customer_id, transaction_month sorts the results by customer_id and transaction_month in ascending order.
As a result, the query presents the customer IDs, transaction months, and closing balances for each customer based on monthly transaction data.

What is the percentage of customers who increase their closing balance by more than 5%?

Answer:

The SQL query calculates the percentage increase in closing balances for customers based on monthly transaction data from the customer_transactions table.
The WITH clause defines a Common Table Expression (CTE) named customer_monthly_balances that aggregates transactions for each customer per month and calculates the closing balance difference.
The main query calculates the percentage increase by using a series of calculations and filters on the results from the customer_monthly_balances CTE.
Within the main query, the LAG window function is used to retrieve the previous month's closing balance for each customer.
The calculated closing_balance_increase represents the percentage increase in closing balance for each month, considering the previous month's balance.
The FILTER condition filters the results to consider only rows where closing_balance_increase is greater than 5%.
The main query then calculates the percentage of customers with a closing balance increase greater than 5% by dividing the count of such customers by the total count of customers and multiplying by 100.
The ROUND function is used to round the percentage to two decimal places.
As a result, the query presents the percentage increase in closing balances for customers with an increase greater than 5% based on monthly transaction data.

325 | 326 | 327 | 328 | 329 | -------------------------------------------------------------------------------- /Case Study #5 - Data Mart/README.md: -------------------------------------------------------------------------------- 1 |

Case Study #5 - Data Mart🛒

2 |

3 |

4 | 5 |

Introduction
Entity Relationship Diagram
Case Study Questions & Solutions

1. Data Cleansing Steps
2. Data Exploration
3. Before & After Analysis

17 | 18 |

Introduction

19 |

In the focal point of our case study is Danny's latest enterprise, Data Mart, which emerges following his successful oversight of global operations for an online grocery store specializing in fresh produce. Danny now seeks our collaboration to dissect his sales performance analysis.
20 | 21 | In the juncture of June 2020, Data Mart underwent significant supply chain modifications. Embracing sustainability, the company transitioned all its products to employ eco-friendly packaging throughout the entire journey from the farm to the end consumer.
22 | 23 | Danny is enlisting our expertise to quantify the ramifications of this alteration on Data Mart's sales performance across its distinct business sectors.
24 | 25 | Central to this inquiry are the subsequent pivotal questions:
26 |

What is the measurable impact resulting from the changes implemented in June 2020?

Among platforms, regions, segments, and customer categories, which experienced the most pronounced effects due to these adjustments?

How can we strategize for potential future implementations of analogous sustainability enhancements to the business, with the aim of mitigating any adverse consequences on sales?

Entity Relationship Diagram

Case Study Questions & Solutions

1. Data Cleansing Steps📊

38 | 39 |
In a single query, the following operations are performed to generate a new table named clean_weekly_sales in the data_mart schema:
40 | 41 |
- Convert the week_date to a DATE format
- Add a week_number as the second column for each week_date value, where the week number corresponds to the range of dates within that week
- Add a month_number with the calendar month for each week_date value as the third column
- Add a calendar_year column as the fourth column containing values of 2018, 2019, or 2020 based on the year of the week_date
- Add a new column called age_band after the original segment column using predefined mappings for age groups
- Add a new demographic column using the following mapping for the first letter in the segment values
- Ensure all null string values with an "unknown" string value in the original segment column as well as the new age_band and demographic columns
- Generate a new avg_transaction column as the sales value divided by transactions rounded to 2 decimal places for each record

Answer:

The SQL script begins by creating a new table named clean_weekly_sales by selecting and transforming data from an existing table named weekly_sales.
Within the SELECT statement, various operations are performed on the week_date column to extract different date-related information like week_number, month_number, transaction_month, and calendar_year based on the year of the week_date.
Using CASE statements, the age_band column is determined based on the last digit of the segment column.
Similarly, the demographic column is determined based on the first letter of the segment column.
The COALESCE function ensures that the segment column is replaced with 'unknown' if it's an empty string.
Finally, the transactions, sales, and avg_transaction columns are calculated using appropriate formulas.
After the transformation, the script selects all columns from the newly created clean_weekly_sales table.
As a result, the clean_weekly_sales table contains transformed and cleaned data from the weekly_sales table with various columns representing date information, age band, demographic, platform, sales metrics, and more.

2. Data Exploration🤯📊

What day of the week is used for each week_date value?

Answer:

The SQL query selects distinct weekdays from the clean_weekly_sales table.
The TO_CHAR function is used to format the week_date column as a weekday string.
The format 'day' specifies that only the name of the day of the week should be extracted.
The result is a list of unique weekday names present in the week_date column of the clean_weekly_sales table.

What range of week numbers are missing from the dataset?

Answer:

The SQL query uses a common table expression (CTE) named week_number_cte to generate a series of week numbers from 1 to 52.
The GENERATE_SERIES function is used to create a sequence of numbers within the specified range.
The main SELECT statement then selects distinct week_num values from the week_number_cte CTE.
A LEFT JOIN is performed between the week_number_cte CTE and the clean_weekly_sales table on the condition that the week_num matches the week_number column in the sales data.
The WHERE clause filters rows where the week_number in the sales data is NULL, indicating missing data for that week.
The result is a list of week numbers that are missing data in the clean_weekly_sales table.
The list is ordered in ascending order of week numbers.

How many total transactions were there for each year in the dataset?

Answer:

The SQL query selects the calendar_year and calculates the total number of transactions for each calendar year from the clean_weekly_sales table.
The SUM function is used to add up the transactions values for each year.
The results are grouped by the calendar_year column.
The ORDER BY clause sorts the results in ascending order based on the calendar_year.
The outcome is a list that displays the total number of transactions for each calendar year present in the clean_weekly_sales table.

What is the total sales for each region for each month?

Answer:

The SQL query selects the region and calculates the total sales for each region from the clean_weekly_sales table.
The SUM function is used to add up the sales values for each region.
The WHERE clause filters the rows to include only data where the month_number is 7, indicating July.
The results are grouped by the region column.
The ORDER BY clause sorts the results in ascending order based on the region.
The outcome is a list that displays the total sales for each region in the month of July from the clean_weekly_sales table.

What is the total count of transactions for each platform

Answer:

The SQL query selects the platform column and calculates the total number of transactions for each platform from the clean_weekly_sales table.
The SUM function is used to add up the transactions values for each platform.
The results are grouped by the platform column.
The ORDER BY clause sorts the results in ascending order based on the platform.
The outcome is a list that displays the total number of transactions for each platform in the clean_weekly_sales table.

What is the percentage of sales for Retail vs Shopify for each month?

Answer:

The SQL query first creates a Common Table Expression (CTE) named monthly_platform_sales.
In the CTE, the query calculates the total monthly sales for each combination of calendar_year, month_number, and platform from the clean_weekly_sales table using the SUM function.
The results are grouped by calendar_year, month_number, and platform.
The main part of the query selects data from the monthly_platform_sales CTE.
It calculates the percentage of retail sales by dividing the sum of monthly sales from the 'Retail' platform by the total monthly sales, multiplied by 100. The ROUND function is used to round the result to two decimal places.
Similarly, it calculates the percentage of Shopify sales by dividing the sum of monthly sales from the 'Shopify' platform by the total monthly sales, multiplied by 100.
The results are grouped by calendar_year and month_number.
The ORDER BY clause sorts the results in ascending order based on calendar_year and month_number.
The outcome is a table that displays the percentage of retail and Shopify sales for each month and calendar year based on the data from the clean_weekly_sales table.

What is the percentage of sales by demographic for each year in the dataset?

Answer:

The SQL query creates a Common Table Expression (CTE) named yearly_demographic_sales.
In the CTE, the query calculates the total yearly sales for each combination of calendar_year and demographic from the clean_weekly_sales table using the SUM function.
The results are grouped by calendar_year and demographic.
The main part of the query selects data from the yearly_demographic_sales CTE.
It calculates the sales percentage for each demographic in a given year by dividing the yearly sales for that demographic by the total yearly sales for the same year, multiplied by 100. The ROUND function is used to round the result to two decimal places.
The SUM(yearly_sales) OVER (PARTITION BY calendar_year) calculates the total yearly sales for each year and is used as the denominator for calculating the percentage.
The results are displayed with columns for calendar_year, demographic, and sales_percentage.
The results are ordered in ascending order based on calendar_year and demographic.

Which age_band and demographic values contribute the most to Retail sales?

Answer:

The SQL query selects data from the clean_weekly_sales table.
It retrieves the columns age_band and demographic along with the calculated sum of sales as retail_sales.
The query filters the data to include only rows where the platform is 'Retail' using the WHERE clause.
The results are grouped by age_band and demographic using the GROUP BY clause.
The calculated sum of sales for each group is displayed as retail_sales.
The results are ordered in descending order based on retail_sales using the ORDER BY clause with the DESC keyword.

Can we use the avg_transaction column to find the average transaction size for each year for Retail vs Shopify?

Answer:

The SQL query selects data from the clean_weekly_sales table.
It retrieves the columns calendar_year and platform along with the calculated average of avg_transaction as average_transaction_size.
The query calculates the average transaction size using the AVG function on the avg_transaction column.
The results are grouped by calendar_year and platform using the GROUP BY clause.
The calculated average transaction size for each group is displayed as average_transaction_size.
The results are ordered in ascending order based on calendar_year and platform using the ORDER BY clause.

3. Before & After Analysis📈📉

This technique is usually used when we inspect an important event and want to inspect the impact before and after a certain point in time. 323 | 324 | Taking the week_date value of 2020-06-15 as the baseline week where the Data Mart sustainable packaging changes came into effect. 325 | 326 | We would include all week_date values for 2020-06-15 as the start of the period after the change and the previous week_date values would be before 327 | 328 | Using this analysis approach - answer the following questions: 329 |
1. What is the total sales for the 4 weeks before and after 2020-06-15? What is the growth or reduction rate in actual values and percentage of sales?

Answer:

The SQL query uses two Common Table Expressions (CTEs) to perform calculations on the clean_weekly_sales table.
The first CTE named CTE calculates the total sales for each week within the specified range (week numbers 21 to 28) of the year 2020.
The GROUP BY clause groups the results by week_date and week_number.
The second CTE named CTE2 further processes the data from the CTE.
It calculates the sum of total sales for the weeks before (week numbers 21 to 24) and after (week numbers 25 to 28) the specified range.
The main query calculates the variance in sales between the "after" and "before" periods by subtracting the before_sales from the after_sales.
The query also calculates the percentage variance by dividing the variance by the before_sales and then multiplying by 100. It rounds the percentage to 2 decimal places using the ROUND function.
The calculated sales variance and percentage variance are displayed as sales_va and variance_percent, respectively.

What about the entire 12 weeks before and after?

Answer:

The SQL query consists of two Common Table Expressions (CTEs) to perform calculations on the clean_weekly_sales table.
The first CTE named packaging_sales calculates the total sales for each week within a specific date range, including 12 weeks before and after a central period.
The GROUP BY clause groups the results by week_date and week_number.
The second CTE named before_after_changes further processes the data from the packaging_sales.
It calculates the sum of total sales for the weeks before the central period (weeks from '2020-04-06' to '2020-06-07') and after the central period (weeks from '2020-06-15' to '2020-08-02').
The main query calculates the variance in sales between the "after" and "before" periods by subtracting the before_packaging_sales from the after_packaging_sales.
The query also calculates the percentage variance by dividing the variance by the before_packaging_sales and then multiplying by 100. It rounds the percentage to 2 decimal places using the ROUND function.
The calculated sales variance and percentage variance are displayed as sales_variance and variance_percentage, respectively.

Case Study #6 - Clique Bait🦞🍢🦈

Introduction
Entity Relationship Diagram
Case Study Questions & Solutions

A. Digital Analysis
B. Product Funnel Analysis
C. Campaigns Analysis

Introduction

Meet Danny, the innovative mind behind Clique Bait, an unconventional online seafood store. With a background in digital data analytics, Danny's mission is to revolutionize the seafood industry. In this study, we analyze Clique Bait's user journey and propose creative solutions to calculate funnel fallout rates. By understanding where users drop off, we help optimize the online shopping experience, aligning with Danny's vision of merging data expertise with seafood excellence. Join us in uncovering insights that drive conversion and customer satisfaction. 19 | 20 | *Note: The data and insights presented in this case study are for illustrative purposes only and do not reflect actual data associated with Clique Bait.*

Entity Relationship Diagram

Case Study Questions & Solutions

A. Digital Analysis

How many users are there?

Answer:

The SQL query retrieves the count of distinct user_id values from the users table.
The COUNT function is used to calculate the number of distinct user IDs.
The result is presented as user_cnt in the output.

How many cookies does each user have on average?

Answer:

The SQL query calculates the average number of cookies per user.
It starts with an inner subquery that groups the data in the users table by user_id.
Within the subquery, the COUNT(DISTINCT cookie_id) function calculates the number of distinct cookie_id values for each user_id.
The result of the subquery includes the user_id and the corresponding count of distinct cookies.
The outer query then calculates the average of the cookie_count column from the subquery using the ROUND(AVG(cookie_count)) function.
The average is presented as average_cookies_per_user in the output.

What is the unique number of visits by all users per month?

Answer:

The SQL query calculates the number of unique visits for each month and year from the events table.
It uses the EXTRACT function to extract the YEAR and MONTH components from the event_time column.
The COUNT(DISTINCT visit_id) function counts the distinct visit_id values for each combination of year and month.
The results are grouped by the year and month columns.
The ORDER BY clause sorts the results in ascending order based on the year and month values.
The output includes columns for year, month, and unique_visits (the count of distinct visit IDs).

What is the number of events for each event type?

Answer:

The SQL query calculates the count of events for each event type from the events table.
It retrieves the event_type column values.
The COUNT(*) function counts the occurrences of each event type.
The results are grouped by the event_type column.
The ORDER BY clause sorts the results in ascending order based on the event_type.
The output includes columns for event_type and event_count (the count of events for each type).

What is the percentage of visits which have a purchase event?

Answer:

The SQL query calculates the percentage of visits where the event type is equal to 3, indicating a purchase event.
The COUNT(DISTINCT CASE WHEN event_type = 3 THEN visit_id END) counts the number of unique visit_ids where the event_type is 3 (purchase event).
The COUNT(DISTINCT visit_id) counts the total number of unique visit_ids in the events table.
The calculation (purchase_count * 100.0 / total_visits) computes the percentage of purchase events among all visits.
The output is a single column named purchase_percentage which represents the calculated percentage.

What is the percentage of visits which view the checkout page but do not have a purchase event?

Answer:

The SQL query uses a Common Table Expression (CTE) named CTE to calculate specific event indicators for each visit_id.
The CTE calculates two indicators for each visit: checkout_viewed and purchase_made.
The checkout_viewed indicator is calculated using the MAX function along with a CASE statement to identify visits where event_type is 1 (page view) and page_id is 12 (indicating a checkout page view).
The purchase_made indicator is calculated similarly for visits where event_type is 3 (purchase event).
The GROUP BY visit_id groups the results by visit_id.
The main query calculates the percentage of visits without a purchase by subtracting the sum of purchase_made from the sum of checkout_viewed.
The expression (1 - (SUM(purchase_made) / SUM(checkout_viewed))) calculates the ratio of visits without a purchase.
The query then multiplies the ratio by 100, rounds the result to two decimal places using the ROUND function, and presents it as percentagewithout_purchase.

What are the top 3 pages by number of views?

Answer:

The SQL query retrieves page views by counting the occurrences of each page_id with event_type equal to 1 (indicating a page view).
The WHERE event_type = 1 clause filters the results to include only events of type 1 (page view).
The GROUP BY page_id groups the results by unique page_id.
The ORDER BY page_views DESC clause sorts the results in descending order based on the count of page views.
The LIMIT 3 limits the output to the top 3 page_id entries with the highest page views.

What is the number of views and cart adds for each product category?

Answer:

The SQL query retrieves statistics for product categories from the page_hierarchy and events tables.
The LEFT JOIN operation combines data from the page_hierarchy and events tables using the common page_id column.
The SUM(CASE WHEN e.event_type = 1 THEN 1 ELSE 0 END) expression calculates the total number of page views (event_type = 1) for each product category.
The SUM(CASE WHEN e.event_type = 2 THEN 1 ELSE 0 END) expression calculates the total number of cart additions (event_type = 2) for each product category.
The GROUP BY ph.product_category clause groups the results by unique product categories from the page_hierarchy table.
The ORDER BY ph.product_category clause sorts the results in ascending order based on the product category.

What are the top 3 products by purchases?

The SQL query retrieves statistics for product purchases from the page_hierarchy and events tables.
The LEFT JOIN operation combines data from the page_hierarchy and events tables using the common page_id column.
The WHERE ph.page_name IS NOT NULL clause filters out rows where the page_name in the page_hierarchy table is not null.
The SUM(CASE WHEN e.event_type = 3 THEN 1 ELSE 0 END) expression calculates the total number of purchases (event_type = 3) for each product name.
The GROUP BY ph.page_name clause groups the results by unique product names from the page_hierarchy table.
The ORDER BY num_purchases DESC clause sorts the results in descending order based on the number of purchases.
The LIMIT 3 clause limits the output to the top 3 products with the highest number of purchases.

B. Product Funnel Analysis

Using a single SQL query - create a new output table which has the following details:
224 |

How many times was each product viewed?
How many times was each product added to cart?
How many times was each product added to a cart but not purchased (abandoned)?
How many times was each product purchased?

Answer:

284 |

285 | 286 |

The SQL query involves multiple Common Table Expressions (CTEs) to analyze product interactions and purchases.
The first CTE named ProdView calculates the number of views for each product by joining the events and page_hierarchy tables on the page_id column.
The WHERE product_id IS NOT NULL clause filters out rows with null product IDs.
The GROUP BY clause groups the results by visit_id, product_id, product_name, and product_category.
The second CTE named ProdCart calculates the number of cart additions for each product in a similar manner to the first CTE.
The third CTE named PurchaseEvents identifies distinct visit IDs with purchase events (event_type = 3).
The fourth CTE named ProductStats combines data from the previous CTEs to calculate aggregated statistics for each product.
The main query selects product_name, product_category, and aggregates the views, cart_adds, and purchases based on different conditions.
The GROUP BY clause groups the results by product_name and product_category.
The ORDER BY product_name clause sorts the results alphabetically by product name.

Answer:

365 |

366 |

The SQL query analyzes product page events, purchases, and statistics based on different categories.
The first CTE named ProductPageEvents calculates page views and cart additions for each product by joining the events and page_hierarchy tables on the page_id column.
The WHERE product_id IS NOT NULL clause filters out rows with null product IDs.
The GROUP BY clause groups the results by visit_id, product_id, product_name, and product_category.
The second CTE named PurchaseEvents identifies distinct visit IDs with purchase events (event_type = 3).
The third CTE named CombinedTable combines data from the previous CTEs and determines if a visit resulted in a purchase.
The fourth CTE named ProductInfo aggregates statistics for each product, including views, cart additions, abandoned carts, and successful purchases.
The GROUP BY clause groups the results by product_id, product_name, and product_category.
The fifth CTE named CategoryStats aggregates category-level statistics based on the product_category.
The GROUP BY clause groups the results by product_category.
The main query selects product_category and the aggregated statistics for total category views, cart additions, abandoned carts, and purchases.

379 | 380 | 381 | Use your 2 new output tables - answer the following questions:

382 |

Which product had the most views, cart adds and purchases?

Answer:

438 |

439 |

The SQL query focuses on analyzing product page events, purchases, and statistics to find the most viewed, most cart-added, and most purchased product.
The first CTE named ProductPageEvents calculates page views and cart additions for each product by joining the events and page_hierarchy tables on the page_id column.
The WHERE product_id IS NOT NULL clause filters out rows with null product IDs.
The GROUP BY clause groups the results by visit_id, product_id, product_name, and product_category.
The second CTE named PurchaseEvents identifies distinct visit IDs with purchase events (event_type = 3).
The third CTE named CombinedTable combines data from the previous CTEs and determines if a visit resulted in a purchase.
The fourth CTE named ProductInfo aggregates statistics for each product, including views, cart additions, abandoned carts, and successful purchases.
The GROUP BY clause groups the results by product_id, product_name, and product_category.
The main query selects product_name and aggregates the maximum views, cart additions, and purchases for each product.
The GROUP BY clause groups the results by product_name.
The ORDER BY clause orders the results based on most views, most cart additions, and most purchases in descending order.
The LIMIT 1 clause restricts the output to only the top product with the most views, cart additions, and purchases.

453 | 454 | 455 |

Which product was most likely to be abandoned?

Answer:

514 |

515 |

The SQL query aims to identify the product with the highest likelihood of abandoned cart based on statistics.
The first CTE named ProductPageEvents calculates page views and cart additions for each product by joining the events and page_hierarchy tables on the page_id column.
The WHERE product_id IS NOT NULL clause filters out rows with null product IDs.
The GROUP BY clause groups the results by visit_id, product_id, product_name, and product_category.
The second CTE named PurchaseEvents identifies distinct visit IDs with purchase events (event_type = 3).
The third CTE named CombinedTable combines data from the previous CTEs and determines if a visit resulted in a purchase.
The fourth CTE named ProductInfo aggregates statistics for each product, including page views, cart additions, abandoned carts, and successful purchases.
The GROUP BY clause groups the results by product_id, product_name, and product_category.
The main query selects product_name and calculates the percentage of abandoned carts for each product using the formula (abandoned / cart_adds) * 100.
The calculated abandoned percentage and product name are included in the derived table AbandonedPercentage.
The outer query selects the product name and maximum abandoned percentage for each product.
The GROUP BY clause groups the results by product_name.
The ORDER BY clause orders the results based on the most likely abandoned percentage in descending order.
The LIMIT 1 clause restricts the output to only the top product with the highest likelihood of abandoned cart.

531 | 532 | 533 |

Which product had the highest view to purchase percentage?

Answer:

591 |

592 |

The SQL query aims to identify the product with the highest view-to-purchase conversion rate based on statistics.
The first CTE named ProductPageEvents calculates page views and cart additions for each product by joining the events and page_hierarchy tables on the page_id column.
The WHERE product_id IS NOT NULL clause filters out rows with null product IDs.
The GROUP BY clause groups the results by visit_id, product_id, product_name, and product_category.
The second CTE named PurchaseEvents identifies distinct visit IDs with purchase events (event_type = 3).
The third CTE named CombinedTable combines data from the previous CTEs and determines if a visit resulted in a purchase.
The fourth CTE named ProductInfo aggregates statistics for each product, including page views, cart additions, abandoned carts, and successful purchases.
The GROUP BY clause groups the results by product_id, product_name, and product_category.
The fifth CTE named ViewToPurchase calculates the view-to-purchase conversion percentage for each product using the formula (views / purchases) * 100.
The calculated view-to-purchase percentage and product name are included in the derived table ViewToPurchase.
The main query selects the product name with the highest view-to-purchase percentage based on the view_to_purchase_percentage column.
The ORDER BY clause orders the results based on the view-to-purchase percentage in descending order.
The LIMIT 1 clause restricts the output to only the top product with the highest view-to-purchase conversion rate.

607 | 608 | 609 |

What is the average conversion rate from view to cart add?

Answer:

639 |

640 |

The SQL query calculates the average conversion rate for products based on their page views and cart additions.
The first CTE named ProductPageEvents calculates page views and cart additions for each product by joining the events and page_hierarchy tables on the page_id column.
The WHERE product_id IS NOT NULL clause filters out rows with null product IDs.
The GROUP BY clause groups the results by visit_id, product_id, product_name, and product_category.
The second CTE named ProductStats aggregates page views and cart additions for each product from the ProductPageEvents CTE.
The GROUP BY clause groups the results by product_name.
The main query calculates the average conversion rate using the formula (cart_adds / views) * 100 for each product and then calculates the average of these conversion rates using the AVG function.
The result is displayed as average_conversion_rate.

650 | 651 | 652 |

C. Campaigns Analysis

653 |

Generate a table that has 1 single row for every unique visit_id record and has the following columns: 655 |
- user_id
- visit_id
- visit_start_time: the earliest event_time for each visit
- page_views: count of page views for each visit
- cart_adds: count of product cart add events for each visit
- purchase: 1/0 flag if a purchase event exists for each visit
- campaign_name: map the visit to a campaign if the visit_start_time falls between the start_date and end_date
- impression: count of ad impressions for each visit
- click: count of ad clicks for each visit
- (Optional column) cart_products: a comma separated text value with products added to the cart sorted by the order they were added to the cart (hint: use the sequence_number)

Answer:

The SQL query retrieves user-related data along with various event metrics from the events table.
The SELECT statement selects columns such as user_id, visit_id, and more.
The MIN(e.event_time) function is used to find the minimum event time for each visit, serving as the visit start time.
The SUM(CASE WHEN e.event_type = 1 THEN 1 ELSE 0 END) aggregates the count of page views using a conditional statement.
The SUM(CASE WHEN e.event_type = 2 THEN 1 ELSE 0 END) aggregates the count of cart additions using a similar conditional statement.
The MAX(CASE WHEN e.event_type = 3 THEN 1 ELSE 0 END) identifies if a purchase event occurred during the visit.
The LEFT JOIN clauses join the campaign_identifier and page_hierarchy tables based on event time and page ID respectively.
The GROUP BY clause groups the results by user_id, visit_id, and campaign_name.
The STRING_AGG function concatenates page names for cart products using a comma separator.
The result of the query provides insights into user behavior during visits, including metrics like page views, cart additions, purchases, campaign associations, and more.

705 | 706 | 707 | 708 | 709 | 710 | 711 | 712 | 713 | -------------------------------------------------------------------------------- /Case Study #7 - Balanced Tree Clothing Co/README.md: -------------------------------------------------------------------------------- 1 |

Case Study #7 - Balanced Tree Clothing Co.👚🧶

2 |

3 |

4 | 5 |

Introduction
Entity Relationship Diagram
Case Study Questions & Solutions

A. High Level Sales Analysis
B. Transaction Analysis
C. Product Analysis

18 |

Introduction

19 |

In this case study, we delve into the operations of Balanced Tree Clothing Company, a leading fashion enterprise renowned for curating a meticulously tailored selection of apparel and lifestyle gear designed for contemporary explorers. At the helm is Danny, the visionary CEO, who has enlisted our expertise in scrutinizing their sales accomplishments and creating a fundamental financial synopsis destined to be disseminated across the organization. This analysis offers valuable insights into their business strategies and performance metrics.

20 | 21 |

Entity Relationship Diagram

22 |

23 |

Case Study Questions & Solutions

24 | 25 |

A. High Level Sales Analysis

26 |

What was the total quantity sold for all products?

Answer:

The SQL query calculates the count of unique transactions from the balanced_tree.sales table.
The SELECT statement includes the COUNT(DISTINCT txn_id) function to count distinct transaction IDs.
The result of the query will be the total number of unique transactions present in the sales table.

What is the total generated revenue for all products before discounts?

Answer:

The SQL query calculates the average number of unique products per transaction from the balanced_tree.sales table.
It achieves this by first creating a subquery that groups the sales data by txn_id and calculates the count of distinct prod_id values for each transaction.
The outer query then calculates the average of the unique product counts calculated by the subquery using the AVG function.
The result of the query will provide the average number of unique products per transaction across all transactions in the sales table.

What was the total discount amount for all products?

Answer:

The SQL query calculates the total discount amount for all sales transactions in the balanced_tree.sales table.
It achieves this by summing up the discounted amounts for each transaction. The discounted amount for a transaction is calculated as (price * qty) - (price * qty * (discount / 100)).
Within the formula, price * qty calculates the total price for the transaction before discount, and price * qty * (discount / 100) calculates the discounted amount based on the discount percentage for that transaction.
The SUM function then aggregates the calculated discounted amounts across all transactions to yield the total discount amount.
The result of the query provides the total amount saved due to discounts across all sales transactions in the dataset.

80 | 81 |

B. Transaction Analysis

82 |

How many unique transactions were there?

Answer:

The SQL query calculates the number of unique transactions in the balanced_tree.sales table.
It uses the COUNT(DISTINCT txn_id) aggregation function to count the distinct values of the txn_id column, which represents unique transactions.
The result of the query provides the count of unique transactions present in the sales dataset.

What is the average unique products purchased in each transaction?

Answer:

The SQL query calculates the average number of unique products per transaction in the balanced_tree.sales table.
It uses a subquery to calculate the count of distinct prod_id values for each txn_id, which represents the unique products in each transaction.
The outer query then calculates the average of the counts obtained from the subquery, providing the average number of unique products per transaction.
The result of the query gives insight into the average diversity of products bought per transaction.

What are the 25th, 50th and 75th percentile values for the revenue per transaction?

Answer:

The SQL query calculates the 25th, 50th (median), and 75th percentiles of revenue per transaction in the balanced_tree.sales table.
It uses a common table expression (CTE) named revenue_cte to calculate the total revenue for each txn_id by summing the product of price and qty.
The main query then applies the PERCENTILE_CONT function to calculate the desired percentiles (0.25, 0.5, and 0.75) of the calculated revenue values.
The results provide insights into the distribution of revenue across transactions, with the median being the 50th percentile.

What is the average discount value per transaction?

Answer:

The SQL query calculates the average discount per transaction in the balanced_tree.sales table.
It calculates the total discount amount for each transaction by multiplying the qty, price, and discount columns and then dividing by 100.
The main query uses the SUM function to calculate the sum of all the calculated discount amounts and then divides it by the total number of distinct txn_id values using the COUNT function.
The result is the average discount per transaction, rounded using the ROUND function.

What is the percentage split of all transactions for members vs non-members?

Answer:

The SQL query categorizes transactions into two groups based on the member column: 'Member' and 'Non-Member'.
The CASE statement is used to transform the 't' and 'f' values in the member column into more meaningful labels.
The COUNT function calculates the number of transactions for each member status category.
The main query calculates the percentage of transactions for each member status category out of the total number of transactions in the balanced_tree.sales table.
The subquery (SELECT COUNT(txn_id) FROM balanced_tree.sales) calculates the total number of transactions in the table.
The result is the member status ('Member' or 'Non-Member'), the transaction count for each category, and the percentage of transactions represented by each category, rounded to two decimal places using the ROUND function.

What is the average revenue for member transactions and non-member transactions?

Answer:

The SQL query categorizes transactions into two groups based on the member column: 'Member' and 'Non-Member'.
The CASE statement is used to transform the 't' and 'f' values in the member column into more meaningful labels.
The ROUND function calculates the average revenue for each member status category by multiplying the price and qty columns and then rounding the result to two decimal places.
The GROUP BY clause groups the results by the transformed member status.
The result displays the member status ('Member' or 'Non-Member') and the average revenue for each category.

207 | 208 |

C. Product Analysis

209 |

What are the top 3 products by total revenue before discount?

Answer:

The SQL query retrieves product-specific information along with the total revenue generated from sales before any discounts.
The balanced_tree.product_details table is joined with the balanced_tree.sales table using the product_id and prod_id columns.
The SUM function calculates the total revenue for each product by multiplying the price and qty columns from the sales table.
The GROUP BY clause groups the results by product_id and product_name.
The ORDER BY clause arranges the results in descending order of total revenue before discount.
The LIMIT clause limits the output to the top 3 products with the highest total revenue before discount.
The result displays the product ID, product name, and total revenue generated from sales for the top 3 products.

What is the total quantity, revenue and discount for each segment?

Answer:

The SQL query retrieves information about product segments, along with aggregated data on total quantity sold, total revenue generated, and total discounts applied.
The balanced_tree.product_details table is joined with the balanced_tree.sales table using the product_id and prod_id columns.
The SUM function calculates the total quantity sold (qty), total revenue generated by multiplying price and qty, and total discount amount by considering qty, price, and discount columns from the sales table.
The GROUP BY clause groups the results by segment_name from the product_details table.
The ORDER BY clause arranges the results in descending order of total revenue.
The result displays the product segment name, total quantity sold, total revenue generated, and total discount amount for each segment.

What is the total quantity, revenue and discount for each segment?

Answer:

The SQL query calculates and presents information about the top-selling products within each product segment.
The query starts by creating a Common Table Expression (CTE) named top_selling_products.
The balanced_tree.product_details table is joined with the balanced_tree.sales table using the product_id and prod_id columns.
The SUM function calculates the total quantity sold (qty) for each product within a segment.
The ROW_NUMBER() window function is used with the OVER clause to assign a rank to each product within its corresponding segment based on total quantity sold. The PARTITION BY clause ensures that the ranking is done separately for each segment.
The main query selects columns segment_id, segment_name, product_id, product_name, and total_quantity from the top_selling_products CTE.
The WHERE clause filters the results to include only rows with a rank of 1, representing the top-selling product within each segment.
The result displays the segment ID, segment name, product ID, product name, and total quantity sold for the top-selling product within each segment.

What is the total quantity, revenue and discount for each category?

Answer:

The SQL query calculates and presents aggregated information about product categories based on sales data.
The query starts by selecting the category_id and category_name columns from the balanced_tree.product_details table.
The SUM function is used to calculate the total quantity sold (qty) and the total revenue earned (qty * price) for each product category.
The formula qty * price * discount / 100 is used to calculate the total discount amount for each product category, and the result is summed using the SUM function.
The JOIN clause connects the balanced_tree.product_details table with the balanced_tree.sales table using the product_id and prod_id columns.
The GROUP BY clause groups the results by category_id and category_name.
The query calculates the total quantity, total revenue, and total discount for each product category.
The ORDER BY clause sorts the results by total revenue in descending order.
The query presents the results, including the category_id, category_name, total quantity sold, total revenue earned, and total discount amount for each product category.

What is the top selling product for each category?

Answer:

The SQL query aims to identify the top-selling products within each product category based on sales data.
The query starts by creating a Common Table Expression (CTE) named top_selling_cte.
Within the CTE, the query selects relevant columns such as category_id, category_name, product_id, and product_name from the balanced_tree.product_details table.
The SUM function calculates the total quantity sold (qty) for each product and is grouped by category_id, category_name, product_id, and product_name.
The RANK() window function assigns a ranking to products within each product category, ordering them by total quantity sold in descending order.
The PARTITION BY clause divides the data into partitions based on category_id, ensuring that ranking is done within each category.
The main query selects columns from the CTE, including category_id, category_name, product_id, product_name, and total_quantity.
The WHERE clause filters the results to include only rows where the product has the highest ranking (ranking = 1) within its category.
The query presents the top-selling products within each product category, displaying the category_id, category_name, product_id, product_name, and total quantity sold for each product.

What is the percentage split of revenue by product for each segment?

Answer:

The SQL query calculates the revenue distribution percentage of each product within its corresponding segment based on sales data.
The query begins by defining a Common Table Expression (CTE) named segment_revenue.
Within the CTE, the query selects columns such as segment_id, segment_name, prod_id, product_name, and the total revenue (total_revenue) for each product.
The SUM function calculates the total revenue by multiplying the price and quantity for each product, and the results are grouped by segment_id, segment_name, prod_id, and product_name.
The main query selects columns from the segment_revenue CTE, including segment_id, segment_name, prod_id, product_name, and total_revenue.
The ROUND function is used to calculate the revenue percentage of each product within its segment.
The percentage is calculated as the total revenue of the product divided by the total revenue of all products within the same segment, then multiplied by 100.
The SUM(total_revenue) is calculated over a window partitioned by segment_id, ensuring that the percentage is calculated for each segment separately.
The calculated revenue percentage is displayed as revenue_percentage.
The query presents the segment-wise revenue distribution for each product, showing the segment_id, segment_name, prod_id, product_name, total revenue, and the revenue percentage.

What is the percentage split of revenue by segment for each category?

Answer:

The SQL query calculates the revenue distribution percentage of each segment within its corresponding category based on sales data.
The query begins by defining a Common Table Expression (CTE) named category_segment_revenue.
Within the CTE, the query selects columns such as category_id, category_name, segment_id, segment_name, and the total revenue (total_revenue) for each segment within its category.
The SUM function calculates the total revenue by multiplying the price and quantity for each sale, and the results are grouped by category_id, category_name, segment_id, and segment_name.
The main query selects columns from the category_segment_revenue CTE, including category_id, category_name, segment_id, segment_name, and total_revenue.
The ROUND function is used to calculate the revenue percentage of each segment within its category.
The percentage is calculated as the total revenue of the segment divided by the total revenue of all segments within the same category, then multiplied by 100.
The SUM(total_revenue) is calculated over a window partitioned by category_id, ensuring that the percentage is calculated for each category separately.
The calculated revenue percentage is displayed as revenue_percentage.
The query presents the category-wise revenue distribution for each segment, showing the category_id, category_name, segment_id, segment_name, total revenue, and the revenue percentage.

What is the percentage split of total revenue by category?

Answer:

The SQL query calculates the revenue distribution percentage for each category based on sales data and product details.
The query begins by defining a Common Table Expression (CTE) named category_revenue.
Within the CTE, the query selects columns such as category_id, category_name, and the total revenue (total_revenue) for each category.
The SUM function calculates the total revenue by multiplying the price and quantity for each sale, and the results are grouped by category_id and category_name.
The main query selects columns from the category_revenue CTE, including category_id, category_name, and total_revenue.
The ROUND function is used to calculate the revenue percentage of each category.
The percentage is calculated as the total revenue of the category divided by the overall total revenue, then multiplied by 100.
The SUM(total_revenue) is calculated over the entire result set using the OVER () clause, ensuring that the percentage is calculated based on the total revenue across all categories.
The calculated revenue percentage is displayed as revenue_percentage.
The query presents the revenue distribution for each category, showing the category_id, category_name, total revenue, and the revenue percentage.

488 | 489 | 490 | -------------------------------------------------------------------------------- /Case Study #8 - Fresh Segments/README.md: -------------------------------------------------------------------------------- 1 |

Case Study #8 - Fresh Segments👨🏼‍💻

2 |

3 |

4 | 5 |

Introduction
Case Study Questions & Solutions

A. Data Exploration and Cleansing
B. Interest Analysis
C. Segment Analysis
D. Index Analysis

16 | 17 |

Introduction

18 |

Discover the transformative journey of Fresh Segments, a pioneering digital marketing agency founded by Danny. By decoding online ad click behavior trends tailored to individual businesses, Fresh Segments empowers companies to understand their customer base like never before.
19 | 20 | In this case study, we delve into Fresh Segments' innovative methodology. Clients entrust their customer lists to Fresh Segments, which then compiles and analyzes interest metrics, creating a comprehensive dataset for detailed examination.
21 | 22 | The agency's forte lies in unraveling interest composition and rankings for each client, spotlighting the proportions of customers interacting with specific online assets. These insights, presented monthly, offer a holistic view of evolving customer interests.
23 | 24 | Danny seeks our expertise in deciphering aggregated metrics for a sample client. Our mission is to distill key insights from this data, shedding light on customer lists and the array of captivating interests. Through this analysis, we illuminate overarching patterns, offering strategic guidance to propel Fresh Segments forward.

25 | 26 |

Case Study Questions & Solutions

27 | 28 |

A. Data Exploration and Cleansing

29 |

Update the fresh_segments.interest_metrics table by modifying the month_year column to be a date data type with the start of the month

Answer:

Drop Column Command: Removes the month_year column from the interest_metrics table.
Alter Table - Add Column Command: Adds a new month_year column of type DATE to the interest_metrics table.
Update Command: Updates the month_year column for existing records in the interest_metrics table. It calculates the month_year value based on _year and _month variables using the TO_DATE function.
Select All Records Command: Retrieves and displays all records from the interest_metrics table, presumably to view the changes made.

What is count of records in the fresh_segments.interest_metrics for each month_year value sorted in chronological order (earliest to latest) with the null values appearing first?

Answer:

Select Command: Retrieves data from the interest_metrics table.
Columns Selected: 79 |
- month_year: Represents the month and year of the records.
- COUNT(*): Calculates the number of records for each month_year.
83 |
Group By Clause: Groups the records based on the month_year column.
Order By Clause: Arranges the results in ascending order of month_year. Records with NULL month_year values are listed first (NULLS FIRST).

What do you think we should do with these null values in the fresh_segments.interest_metrics

Answer:

Select Command: Retrieves data from the interest_metrics table.
Columns Selected: 106 |
- month_year: Represents the month and year of the records.
- COUNT(*): Calculates the number of records for each month_year.
110 |
Where Clause: Filters out records where month_year is not NULL.
Group By Clause: Groups the non-null records based on the month_year column.
Order By Clause: Arranges the results in ascending order of month_year.

How many interest_id values exist in the fresh_segments.interest_metrics table but not in the fresh_segments.interest_map table? What about the other way around?

Answer:

Select Command: Retrieves data from the interest_metrics table.
Columns Selected: 135 |
- COUNT(DISTINCT im.interest_id): Counts the number of distinct interest_id values that are not present in the interest_map table.
138 |
Join: Performs a left join between interest_metrics and interest_map using the interest_id and id columns.
Where Clause: Filters out rows where the id column in the interest_map table is NULL, indicating that the interest_id is not found in the map.

Answer:

Select Command: Retrieves data from the interest_map table.
Columns Selected: 159 |
- COUNT(DISTINCT map.id): Counts the number of distinct id values that are not present in the interest_metrics table.
162 |
Join: Performs a left join between interest_map and interest_metrics using the id and interest_id columns.
Where Clause: Filters out rows where the interest_id column in the interest_metrics table is NULL, indicating that the id is not found in the metrics.

Summarise the id values in the fresh_segments.interest_map by its total record count in this table

Answer:

Select Command: Retrieves data from the interest_map and interest_metrics tables.
Columns Selected: 185 |
- id: The unique identifier of the interest in the interest_map table.
- interest_name: The name of the interest in the interest_map table.
- COUNT(*): Calculates the count of records.
- AS record_count: Renames the count column as record_count.
191 |
Join: Performs an inner join between interest_map and interest_metrics using the id and interest_id columns.
Group By: Groups the result set by id and interest_name.
Order By: Orders the results in descending order based on the record_count column.

What sort of table join should we perform for our analysis and why? Check your logic by checking the rows where interest_id = 21246 in your joined output and include all columns from fresh_segments.interest_metrics and all columns from fresh_segments.interest_map except from the id column.

Answer:

Select Command: Retrieves data from the interest_metrics and interest_map tables.
Columns Selected: 222 |
- im.*: Selects all columns from the interest_metrics table.
- map.interest_name AS mapped_interest_name: Retrieves the interest_name column from the interest_map table and aliases it as mapped_interest_name.
- map.interest_summary AS mapped_interest_summary: Retrieves the interest_summary column from the interest_map table and aliases it as mapped_interest_summary.
- map.created_at AS mapped_created_at: Retrieves the created_at column from the interest_map table and aliases it as mapped_created_at.
- map.last_modified AS mapped_last_modified: Retrieves the last_modified column from the interest_map table and aliases it as mapped_last_modified.
229 |
Join: Performs a left join between interest_metrics and interest_map using the interest_id and id columns, casting both to VARCHAR for comparison.
Where: Filters the result to only include rows where interest_id is equal to '21246'.

Are there any records in your joined table where the month_year value is before the created_at value from the fresh_segments.interest_map table? Do you think these values are valid and why? 238 |

Answer:

Select Command: Retrieves data from the interest_metrics and interest_map tables.
Columns Selected: 262 |
- im.*: Selects all columns from the interest_metrics table.
- im.month_year AS interest_metrics_month_year: Aliases the month_year column from interest_metrics table as interest_metrics_month_year.
- map.created_at AS interest_map_created_at: Retrieves the created_at column from the interest_map table and aliases it as interest_map_created_at.
267 |
Join: Performs a left join between interest_metrics and interest_map using the interest_id column from interest_metrics and casting id to VARCHAR for comparison.
Where: Filters the result to only include rows where month_year from interest_metrics is less than created_at from interest_map.
Order By: Orders the results first by interest_id and then by month_year in ascending order.

276 | 277 |

B. Interest Analysis

278 |

Which interests have been present in all month_year dates in our dataset?

Answer:

Select Command: Retrieves distinct interest_id values from the interest_metrics table.
Distinct: Ensures that only unique interest_id values are selected.
Group By: Groups the data in the interest_metrics table by interest_id.
Having Clause: Filters the grouped results to include only those groups where the count of distinct month_year values is equal to the total count of distinct month_year values in the entire interest_metrics table.
Subquery: The subquery (SELECT COUNT(DISTINCT month_year) FROM interest_metrics) calculates the total count of distinct month_year values in the interest_metrics table.

Using this same total_months measure - calculate the cumulative percentage of all records starting at 14 months - which total_months value passes the 90% cumulative percentage value?

Answer:

Common Table Expressions (CTEs):

MonthlyInterestCounts: Calculates the count of distinct month_year values for each interest_id from the interest_metrics table.
InterestCumulativePercentage: Computes the cumulative percentage of interests based on the count of interest_ids and their month_count. It uses a window function to calculate the cumulative sum of interest_count.

Main Query:

month_count: Represents the number of distinct month_year values for each interest_id.
total_interest_count: Shows the total count of interest_ids considered up to the current month_count in descending order.
cumulative_percent: Calculates the cumulative percentage of interest_ids based on the total_interest_count and the total sum of interest_count from the InterestCumulativePercentage CTE.
WHERE: Filters the results to include only rows where the calculated cumulative percentage is greater than 90.

If we were to remove all interest_id values which are lower than the total_months value we found in the previous question - how many total data points would we be removing?

Answer:

Common Table Expressions (CTEs):

InterestMonthsCounts: Calculates the count of distinct month_year values for each interest_id from the interest_metrics table using the COUNT and GROUP BY aggregation.
TotalMonths: Determines the maximum month_count from the InterestMonthsCounts CTE as the total_months.

Main Query:

total_data_points_to_keep: Calculates the total count of data points to keep, which is the sum of month_count values from the InterestMonthsCounts CTE.
total_data_points_removed: Computes the total count of data points to be removed, which is the difference between the total count of data points in the interest_metrics table and the sum of month_count values from the InterestMonthsCounts CTE.
JOIN: Joins the InterestMonthsCounts CTE with the TotalMonths CTE based on the condition that month_count in InterestMonthsCounts is greater than or equal to total_months in TotalMonths.

378 | 379 |

C. Segment Analysis

380 |

Using our filtered dataset by removing the interests with less than 6 months worth of data, which are the top 10 and bottom 10 interests which have the largest composition values in any month_year? Only use the maximum composition value for each interest but you must keep the corresponding month_year

Answer:

Common Table Expression (CTE):

Creates a CTE named cte that counts the number of distinct month_year values for each interest_id in interest_metrics.
Groups the results by interest_id.
Filters the results to only include rows with a month_count of 6 or more.

Query 1:

Selects all columns from interest_metrics into a table named filtered_table.
Filters the rows to include only those with interest_id values present in the interest_id column of the cte.

Query 2:

Selects month_year, interest_id, calculates the maximum composition, and retrieves interest_name from the filtered_table.
Joins filtered_table with interest_map using interest_id and id.
Groups the results by month_year, interest_id, and interest_name.
Calculates the maximum composition for each group.
Orders the result by max_composition in descending order.
Limits the output to the top 10 records using LIMIT 10.

Query 3:

Selects month_year, interest_id, calculates the maximum composition, and retrieves interest_name from the filtered_table.
Joins filtered_table with interest_map using interest_id and id.
Groups the results by month_year, interest_id, and interest_name.
Calculates the maximum composition for each group.
Orders the result by max_composition in ascending order.
Limits the output to the bottom 10 records using LIMIT 10.

Which 5 interests had the lowest average ranking value?

Answer:

Query:

Selects interest_id, calculates the average of ranking, and retrieves interest_name from the filtered_table.
Joins filtered_table with interest_map using interest_id and id.
Groups the results by interest_id and interest_name.
Calculates the average of ranking for each group.
Orders the result by avg_ranking in ascending order.
Limits the output to the top 5 records using LIMIT 5.

Which 5 interests had the lowest average ranking value?

Answer:

Query:

Selects interest_id, calculates the standard deviation of percentile_ranking, and retrieves interest_name from the filtered_table.
Joins filtered_table with interest_map using interest_id and id.
Groups the results by interest_id and interest_name.
Calculates the standard deviation of percentile_ranking for each group.
Orders the result by std_dev_percentile_ranking in descending order.
Limits the output to the top 5 records using LIMIT 5.

520 | 521 |

D. Index Analysis

522 |

What is the top 10 interests by the average composition for each month?

Answer:

Query:

Selects _month, _year, month_year, interest_id, and calculates the average composition using the formula: SUM(composition) / SUM(index_value), rounded to two decimal places.
Groups the results by _month, _year, month_year, and interest_id.
Orders the result by _year in ascending order, then _month in ascending order, and finally average_composition in descending order.
Limits the output to the top 10 records using LIMIT 10.

For all of these top 10 interests - which interest appears the most often?

Answer:

Query:

Uses a common table expression (CTE) named ranks_tab to calculate the average composition for each interest_id and month_year.
Assigns a rank to each record within the same month_year partition based on the descending order of avg_composition.
Joins the interest_metrics table with the interest_map table using interest_id to retrieve the interest_name.
Groups the results by interest_id, interest_name, and month_year.

Final Query:

Selects interest_id, interest_name, and calculates the count of records with the same interest_name using the window function COUNT(*) over a partition.
Filters the results to only include records with ranks up to 10.
Orders the result by the third column (counts) in descending order.

What is the average of the average composition for the top 10 interests for each month?

Answer:

Query:

Uses a common table expression (CTE) named ranks_tab to calculate the average composition for each interest_id and month_year.
Assigns a rank to each record within the same month_year partition based on the descending order of avg_composition.
Joins the interest_metrics table with the interest_map table using interest_id to retrieve the interest_name.
Groups the results by interest_id, interest_name, and month_year.

Final Query:

Selects month_year and calculates the average of avg_composition using the AVG() function.
Filters the results to only include records with ranks up to 10.
Groups the results by month_year.
Orders the result by month_year in ascending order.

633 | 634 | 635 | 636 | 637 | 638 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 8 Week SQL challenge🔍 2 |

This repository serves as a comprehensive solution for the #8WeekSQLChallenge, encompassing eight distinct case studies.The repository is a testament to my prowess in SQL, as it demonstrates my adeptness in solving diverse SQL challenges and showcases my expertise in query composition and problem-solving. These case studies cover a wide array of topics, including ranking, common table expressions (CTEs), datetime functions, text manipulation, aggregation, NULL handling, and more.
3 | 4 | This repository reflects the culmination of my journey through various SQL challenges. Each case study presents a unique problem-solving scenario that required me to apply a range of SQL techniques such as CTEs, case statements, date functions, and complex aggregations. As a result, I have honed my skills in SQL query crafting, data manipulation, and analysis.
5 | 6 | Through this repository, I proudly present my ability to tackle intricate SQL challenges and exemplify my aptitude for effectively addressing real-world data scenarios. It serves as a testament to my growth, proficiency, and innovation in the realm of SQL.
7 | The creator behind the development of the #8WeekSQLChallenge is Danny Maa.

8 | 9 |

14 |

Tools Used🛠️ : PostgreSQL

15 | 16 |

CONTENTS📝

17 |

19 | 20 | Case Study #1 - Danny's Diner 21 |
24 | 25 | Case Study #2 - Pizza Runner 26 |
29 | 30 | Case Study #3 - Foodie-Fi 31 |
34 | 35 | Case Study #4 - Data Bank 36 |
39 | 40 | Case Study #5 - Data Mart 41 |
44 | 45 | Case Study #6 - Clique Bait 46 |
49 | 50 | Case Study #7 - Balanced Tree Clothing Co. 51 |
54 | 55 | Case Study #8 - Fresh Segments 56 |

58 | --------------------------------------------------------------------------------

Case Study #1 - Danny's Diner👨🏻‍🍳

Contents

What is the total amount each customer spent at the restaurant?

Answer:

How many days has each customer visited the restaurant?

Answer:

What was the first item from the menu purchased by each customer?

Answer:

What is the most purchased item on the menu and how many times was it purchased by all customers?

Answer:

Which item was the most popular for each customer?

Answer:

Which item was purchased first by the customer after they became a member?

Answer:

Which item was purchased just before the customer became a member?

Answer:

What is the total items and amount spent for each member before they became a member?

If each $1 spent equates to 10 points and sushi has a 2x points multiplier - how many points would each customer have?

Answer:

In the first week after a customer joins the program (including their join date) they earn 2x points on all items, not just sushi - how many points do customer A and B have at the end of January?

Answer:

Join All The Things

Answer:

Rank All The Things

Answer:

Case Study #3 - Foodie-Fi🫕🥑

Contents

Entity Relationship Diagram

Case Study Questions & Solutions

A. Customer Journey🫕🥑

Based off the 8 sample customers provided in the sample from the subscriptions table, write a brief description about each customer’s onboarding journey.

Answer:

B. Data Analysis Questions📊

How many customers has Foodie-Fi ever had?

Answer:

What is the monthly distribution of trial plan start_date values for our dataset - use the start of the month as the group by value

Answer:

What plan start_date values occur after the year 2020 for our dataset? Show the breakdown by count of events for each plan_name

Answer:

What is the customer count and percentage of customers who have churned rounded to 1 decimal place?

Answer:

How many customers have churned straight after their initial free trial - what percentage is this rounded to the nearest whole number?

Answer:

What is the number and percentage of customer plans after their initial free trial?

Answer:

What is the customer count and percentage breakdown of all 5 plan_name values at 2020-12-31?

Answer:

How many customers have upgraded to an annual plan in 2020?

Answer:

How many customers downgraded from a pro monthly to a basic monthly plan in 2020?

Answer:

C. Challenge Payment Question💰

Answer:

Case Study #4 - Data Bank🏦

Contents

Entity Relationship Diagram

Case Study Questions & Solutions

A. Customer Nodes Exploration👥

How many unique nodes are there on the Data Bank system?

Answer:

What is the number of nodes per region?

Answer:

How many customers are allocated to each region?

Answer:

How many days on average are customers reallocated to a different node?

Answer:

What is the median, 80th and 95th percentile for this same reallocation days metric for each region?

Answer:

B. Customer Transactions💸

What is the unique count and total amount for each transaction type?

Answer:

What is the average total historical deposit counts and amounts for all customers?

Answer:

For each month - how many Data Bank customers make more than 1 deposit and either 1 purchase or 1 withdrawal in a single month?

Answer:

What is the closing balance for each customer at the end of the month?

Answer:

What is the percentage of customers who increase their closing balance by more than 5%?

Answer:

Case Study #5 - Data Mart🛒