├── Case Study #1 - Danny's Diner └── README.md ├── Case Study #2 - Pizza Runner ├── A. Pizza Metrics.md ├── B. Runner and Customer Experience.md ├── C. Ingredient Optimisation.md ├── Data Cleaning and Transformation.md └── README.md ├── Case Study #3 - Foodie-Fi └── README.md ├── Case Study #4 - Data Bank └── README.md ├── Case Study #5 - Data Mart └── README.md ├── Case Study #6 - Clique Bait └── README.md ├── Case Study #7 - Balanced Tree Clothing Co. └── README.md ├── Case Study #8: Fresh Segments ├── A. Data Exploration and Cleansing.md ├── B. Interest Analysis.md └── README.md └── README.md /Case Study #1 - Danny's Diner/README.md: -------------------------------------------------------------------------------- 1 | # 🍜 Case Study #1: Danny's Diner 2 |

3 | 4 | ## 📚 Table of Contents 5 | - [Business Task](#business-task) 6 | - [Entity Relationship Diagram](#entity-relationship-diagram) 7 | - [Question and Solution](#question-and-solution) 8 | 9 | Please note that all the information regarding the case study has been sourced from the following link: [here](https://8weeksqlchallenge.com/case-study-1/). 10 | 11 | *** 12 | 13 | ## Business Task 14 | Danny wants to use the data to answer a few simple questions about his customers, especially about their visiting patterns, how much money they’ve spent and also which menu items are their favourite. 15 | 16 | *** 17 | 18 | ## Entity Relationship Diagram 19 | 20 | ![image](https://user-images.githubusercontent.com/81607668/127271130-dca9aedd-4ca9-4ed8-b6ec-1e1920dca4a8.png) 21 | 22 | *** 23 | 24 | ## Question and Solution 25 | 26 | Please join me in executing the queries using PostgreSQL on [DB Fiddle](https://www.db-fiddle.com/f/2rM8RAnq7h5LLDTzZiRWcd/138). It would be great to work together on the questions! 27 | 28 | Additionally, I have also published this case study on [Medium](https://katiehuangx.medium.com/8-week-sql-challenge-case-study-week-1-dannys-diner-2ba026c897ab?source=friends_link&sk=ed355696f5a70ff8b3d5a1b905e5dabe). 29 | 30 | If you have any questions, reach out to me on [LinkedIn](https://www.linkedin.com/in/katiehuangx/). 31 | 32 | **1. What is the total amount each customer spent at the restaurant?** 33 | 34 | ````sql 35 | SELECT 36 | sales.customer_id, 37 | SUM(menu.price) AS total_sales 38 | FROM dannys_diner.sales 39 | INNER JOIN dannys_diner.menu 40 | ON sales.product_id = menu.product_id 41 | GROUP BY sales.customer_id 42 | ORDER BY sales.customer_id ASC; 43 | ```` 44 | 45 | #### Steps: 46 | - Use **JOIN** to merge `dannys_diner.sales` and `dannys_diner.menu` tables as `sales.customer_id` and `menu.price` are from both tables. 47 | - Use **SUM** to calculate the total sales contributed by each customer. 48 | - Group the aggregated results by `sales.customer_id`. 49 | 50 | #### Answer: 51 | | customer_id | total_sales | 52 | | ----------- | ----------- | 53 | | A | 76 | 54 | | B | 74 | 55 | | C | 36 | 56 | 57 | - Customer A spent $76. 58 | - Customer B spent $74. 59 | - Customer C spent $36. 60 | 61 | *** 62 | 63 | **2. How many days has each customer visited the restaurant?** 64 | 65 | ````sql 66 | SELECT 67 | customer_id, 68 | COUNT(DISTINCT order_date) AS visit_count 69 | FROM dannys_diner.sales 70 | GROUP BY customer_id; 71 | ```` 72 | 73 | #### Steps: 74 | - To determine the unique number of visits for each customer, utilize **COUNT(DISTINCT `order_date`)**. 75 | - It's important to apply the **DISTINCT** keyword while calculating the visit count to avoid duplicate counting of days. For instance, if Customer A visited the restaurant twice on '2021–01–07', counting without **DISTINCT** would result in 2 days instead of the accurate count of 1 day. 76 | 77 | #### Answer: 78 | | customer_id | visit_count | 79 | | ----------- | ----------- | 80 | | A | 4 | 81 | | B | 6 | 82 | | C | 2 | 83 | 84 | - Customer A visited 4 times. 85 | - Customer B visited 6 times. 86 | - Customer C visited 2 times. 87 | 88 | *** 89 | 90 | **3. What was the first item from the menu purchased by each customer?** 91 | 92 | ````sql 93 | WITH ordered_sales AS ( 94 | SELECT 95 | sales.customer_id, 96 | sales.order_date, 97 | menu.product_name, 98 | DENSE_RANK() OVER ( 99 | PARTITION BY sales.customer_id 100 | ORDER BY sales.order_date) AS rank 101 | FROM dannys_diner.sales 102 | INNER JOIN dannys_diner.menu 103 | ON sales.product_id = menu.product_id 104 | ) 105 | 106 | SELECT 107 | customer_id, 108 | product_name 109 | FROM ordered_sales 110 | WHERE rank = 1 111 | GROUP BY customer_id, product_name; 112 | ```` 113 | 114 | #### Steps: 115 | - Create a Common Table Expression (CTE) named `ordered_sales_cte`. Within the CTE, create a new column `rank` and calculate the row number using **DENSE_RANK()** window function. The **PARTITION BY** clause divides the data by `customer_id`, and the **ORDER BY** clause orders the rows within each partition by `order_date`. 116 | - In the outer query, select the appropriate columns and apply a filter in the **WHERE** clause to retrieve only the rows where the rank column equals 1, which represents the first row within each `customer_id` partition. 117 | - Use the GROUP BY clause to group the result by `customer_id` and `product_name`. 118 | 119 | #### Answer: 120 | | customer_id | product_name | 121 | | ----------- | ----------- | 122 | | A | curry | 123 | | A | sushi | 124 | | B | curry | 125 | | C | ramen | 126 | 127 | - Customer A placed an order for both curry and sushi simultaneously, making them the first items in the order. 128 | - Customer B's first order is curry. 129 | - Customer C's first order is ramen. 130 | 131 | I have received feedback suggesting the use of `ROW_NUMBER()` instead of `DENSE_RANK()` for determining the "first order" in this question. 132 | 133 | However, since the `order_date` does not have a timestamp, it is impossible to determine the exact sequence of items ordered by the customer. 134 | 135 | Therefore, it would be inaccurate to conclude that curry is the customer's first order purely based on the alphabetical order of the product names. For this reason, I maintain my solution of using `DENSE_RANK()` and consider both curry and sushi as Customer A's first order. 136 | 137 | *** 138 | 139 | **4. What is the most purchased item on the menu and how many times was it purchased by all customers?** 140 | 141 | ````sql 142 | SELECT 143 | menu.product_name, 144 | COUNT(sales.product_id) AS most_purchased_item 145 | FROM dannys_diner.sales 146 | INNER JOIN dannys_diner.menu 147 | ON sales.product_id = menu.product_id 148 | GROUP BY menu.product_name 149 | ORDER BY most_purchased_item DESC 150 | LIMIT 1; 151 | ```` 152 | 153 | #### Steps: 154 | - Perform a **COUNT** aggregation on the `product_id` column and **ORDER BY** the result in descending order using `most_purchased` field. 155 | - Apply the **LIMIT** 1 clause to filter and retrieve the highest number of purchased items. 156 | 157 | #### Answer: 158 | | most_purchased | product_name | 159 | | ----------- | ----------- | 160 | | 8 | ramen | 161 | 162 | 163 | - Most purchased item on the menu is ramen which is 8 times. Yummy! 164 | 165 | *** 166 | 167 | **5. Which item was the most popular for each customer?** 168 | 169 | ````sql 170 | WITH most_popular AS ( 171 | SELECT 172 | sales.customer_id, 173 | menu.product_name, 174 | COUNT(menu.product_id) AS order_count, 175 | DENSE_RANK() OVER ( 176 | PARTITION BY sales.customer_id 177 | ORDER BY COUNT(sales.customer_id) DESC) AS rank 178 | FROM dannys_diner.menu 179 | INNER JOIN dannys_diner.sales 180 | ON menu.product_id = sales.product_id 181 | GROUP BY sales.customer_id, menu.product_name 182 | ) 183 | 184 | SELECT 185 | customer_id, 186 | product_name, 187 | order_count 188 | FROM most_popular 189 | WHERE rank = 1; 190 | ```` 191 | 192 | *Each user may have more than 1 favourite item.* 193 | 194 | #### Steps: 195 | - Create a CTE named `fav_item_cte` and within the CTE, join the `menu` table and `sales` table using the `product_id` column. 196 | - Group results by `sales.customer_id` and `menu.product_name` and calculate the count of `menu.product_id` occurrences for each group. 197 | - Utilize the **DENSE_RANK()** window function to calculate the ranking of each `sales.customer_id` partition based on the count of orders **COUNT(`sales.customer_id`)** in descending order. 198 | - In the outer query, select the appropriate columns and apply a filter in the **WHERE** clause to retrieve only the rows where the rank column equals 1, representing the rows with the highest order count for each customer. 199 | 200 | #### Answer: 201 | | customer_id | product_name | order_count | 202 | | ----------- | ---------- |------------ | 203 | | A | ramen | 3 | 204 | | B | sushi | 2 | 205 | | B | curry | 2 | 206 | | B | ramen | 2 | 207 | | C | ramen | 3 | 208 | 209 | - Customer A and C's favourite item is ramen. 210 | - Customer B enjoys all items on the menu. He/she is a true foodie, sounds like me. 211 | 212 | *** 213 | 214 | **6. Which item was purchased first by the customer after they became a member?** 215 | 216 | ```sql 217 | WITH joined_as_member AS ( 218 | SELECT 219 | members.customer_id, 220 | sales.product_id, 221 | ROW_NUMBER() OVER ( 222 | PARTITION BY members.customer_id 223 | ORDER BY sales.order_date) AS row_num 224 | FROM dannys_diner.members 225 | INNER JOIN dannys_diner.sales 226 | ON members.customer_id = sales.customer_id 227 | AND sales.order_date > members.join_date 228 | ) 229 | 230 | SELECT 231 | customer_id, 232 | product_name 233 | FROM joined_as_member 234 | INNER JOIN dannys_diner.menu 235 | ON joined_as_member.product_id = menu.product_id 236 | WHERE row_num = 1 237 | ORDER BY customer_id ASC; 238 | ``` 239 | 240 | #### Steps: 241 | - Create a CTE named `joined_as_member` and within the CTE, select the appropriate columns and calculate the row number using the **ROW_NUMBER()** window function. The **PARTITION BY** clause divides the data by `members.customer_id` and the **ORDER BY** clause orders the rows within each `members.customer_id` partition by `sales.order_date`. 242 | - Join tables `dannys_diner.members` and `dannys_diner.sales` on `customer_id` column. Additionally, apply a condition to only include sales that occurred *after* the member's `join_date` (`sales.order_date > members.join_date`). 243 | - In the outer query, join the `joined_as_member` CTE with the `dannys_diner.menu` on the `product_id` column. 244 | - In the **WHERE** clause, filter to retrieve only the rows where the row_num column equals 1, representing the first row within each `customer_id` partition. 245 | - Order result by `customer_id` in ascending order. 246 | 247 | #### Answer: 248 | | customer_id | product_name | 249 | | ----------- | ---------- | 250 | | A | ramen | 251 | | B | sushi | 252 | 253 | - Customer A's first order as a member is ramen. 254 | - Customer B's first order as a member is sushi. 255 | 256 | *** 257 | 258 | **7. Which item was purchased just before the customer became a member?** 259 | 260 | ````sql 261 | WITH purchased_prior_member AS ( 262 | SELECT 263 | members.customer_id, 264 | sales.product_id, 265 | ROW_NUMBER() OVER ( 266 | PARTITION BY members.customer_id 267 | ORDER BY sales.order_date DESC) AS rank 268 | FROM dannys_diner.members 269 | INNER JOIN dannys_diner.sales 270 | ON members.customer_id = sales.customer_id 271 | AND sales.order_date < members.join_date 272 | ) 273 | 274 | SELECT 275 | p_member.customer_id, 276 | menu.product_name 277 | FROM purchased_prior_member AS p_member 278 | INNER JOIN dannys_diner.menu 279 | ON p_member.product_id = menu.product_id 280 | WHERE rank = 1 281 | ORDER BY p_member.customer_id ASC; 282 | ```` 283 | 284 | #### Steps: 285 | - Create a CTE called `purchased_prior_member`. 286 | - In the CTE, select the appropriate columns and calculate the rank using the **ROW_NUMBER()** window function. The rank is determined based on the order dates of the sales in descending order within each customer's group. 287 | - Join `dannys_diner.members` table with `dannys_diner.sales` table based on the `customer_id` column, only including sales that occurred *before* the customer joined as a member (`sales.order_date < members.join_date`). 288 | - Join `purchased_prior_member` CTE with `dannys_diner.menu` table based on `product_id` column. 289 | - Filter the result set to include only the rows where the rank is 1, representing the earliest purchase made by each customer before they became a member. 290 | - Sort the result by `customer_id` in ascending order. 291 | 292 | #### Answer: 293 | | customer_id | product_name | 294 | | ----------- | ---------- | 295 | | A | sushi | 296 | | B | sushi | 297 | 298 | - Both customers' last order before becoming members are sushi. 299 | 300 | *** 301 | 302 | **8. What is the total items and amount spent for each member before they became a member?** 303 | 304 | ```sql 305 | SELECT 306 | sales.customer_id, 307 | COUNT(sales.product_id) AS total_items, 308 | SUM(menu.price) AS total_sales 309 | FROM dannys_diner.sales 310 | INNER JOIN dannys_diner.members 311 | ON sales.customer_id = members.customer_id 312 | AND sales.order_date < members.join_date 313 | INNER JOIN dannys_diner.menu 314 | ON sales.product_id = menu.product_id 315 | GROUP BY sales.customer_id 316 | ORDER BY sales.customer_id; 317 | ``` 318 | 319 | #### Steps: 320 | - Select the columns `sales.customer_id` and calculate the count of `sales.product_id` as total_items for each customer and the sum of `menu.price` as total_sales. 321 | - From `dannys_diner.sales` table, join `dannys_diner.members` table on `customer_id` column, ensuring that `sales.order_date` is earlier than `members.join_date` (`sales.order_date < members.join_date`). 322 | - Then, join `dannys_diner.menu` table to `dannys_diner.sales` table on `product_id` column. 323 | - Group the results by `sales.customer_id`. 324 | - Order the result by `sales.customer_id` in ascending order. 325 | 326 | #### Answer: 327 | | customer_id | total_items | total_sales | 328 | | ----------- | ---------- |---------- | 329 | | A | 2 | 25 | 330 | | B | 3 | 40 | 331 | 332 | Before becoming members, 333 | - Customer A spent $25 on 2 items. 334 | - Customer B spent $40 on 3 items. 335 | 336 | *** 337 | 338 | **9. If each $1 spent equates to 10 points and sushi has a 2x points multiplier — how many points would each customer have?** 339 | 340 | ```sql 341 | WITH points_cte AS ( 342 | SELECT 343 | menu.product_id, 344 | CASE 345 | WHEN product_id = 1 THEN price * 20 346 | ELSE price * 10 END AS points 347 | FROM dannys_diner.menu 348 | ) 349 | 350 | SELECT 351 | sales.customer_id, 352 | SUM(points_cte.points) AS total_points 353 | FROM dannys_diner.sales 354 | INNER JOIN points_cte 355 | ON sales.product_id = points_cte.product_id 356 | GROUP BY sales.customer_id 357 | ORDER BY sales.customer_id; 358 | ``` 359 | 360 | #### Steps: 361 | Let's break down the question to understand the point calculation for each customer's purchases. 362 | - Each $1 spent = 10 points. However, `product_id` 1 sushi gets 2x points, so each $1 spent = 20 points. 363 | - Here's how the calculation is performed using a conditional CASE statement: 364 | - If product_id = 1, multiply every $1 by 20 points. 365 | - Otherwise, multiply $1 by 10 points. 366 | - Then, calculate the total points for each customer. 367 | 368 | #### Answer: 369 | | customer_id | total_points | 370 | | ----------- | ---------- | 371 | | A | 860 | 372 | | B | 940 | 373 | | C | 360 | 374 | 375 | - Total points for Customer A is $860. 376 | - Total points for Customer B is $940. 377 | - Total points for Customer C is $360. 378 | 379 | *** 380 | 381 | **10. In the first week after a customer joins the program (including their join date) they earn 2x points on all items, not just sushi — how many points do customer A and B have at the end of January?** 382 | 383 | ```sql 384 | WITH dates_cte AS ( 385 | SELECT 386 | customer_id, 387 | join_date, 388 | join_date + 6 AS valid_date, 389 | DATE_TRUNC( 390 | 'month', '2021-01-31'::DATE) 391 | + interval '1 month' 392 | - interval '1 day' AS last_date 393 | FROM dannys_diner.members 394 | ) 395 | 396 | SELECT 397 | sales.customer_id, 398 | SUM(CASE 399 | WHEN menu.product_name = 'sushi' THEN 2 * 10 * menu.price 400 | WHEN sales.order_date BETWEEN dates.join_date AND dates.valid_date THEN 2 * 10 * menu.price 401 | ELSE 10 * menu.price END) AS points 402 | FROM dannys_diner.sales 403 | INNER JOIN dates_cte AS dates 404 | ON sales.customer_id = dates.customer_id 405 | AND dates.join_date <= sales.order_date 406 | AND sales.order_date <= dates.last_date 407 | INNER JOIN dannys_diner.menu 408 | ON sales.product_id = menu.product_id 409 | GROUP BY sales.customer_id; 410 | ``` 411 | 412 | #### Assumptions: 413 | - On Day -X to Day 1 (the day a customer becomes a member), each $1 spent earns 10 points. However, for sushi, each $1 spent earns 20 points. 414 | - From Day 1 to Day 7 (the first week of membership), each $1 spent for any items earns 20 points. 415 | - From Day 8 to the last day of January 2021, each $1 spent earns 10 points. However, sushi continues to earn double the points at 20 points per $1 spent. 416 | 417 | #### Steps: 418 | - Create a CTE called `dates_cte`. 419 | - In `dates_cte`, calculate the `valid_date` by adding 6 days to the `join_date` and determine the `last_date` of the month by subtracting 1 day from the last day of January 2021. 420 | - From `dannys_diner.sales` table, join `dates_cte` on `customer_id` column, ensuring that the `order_date` of the sale is after the `join_date` (`dates.join_date <= sales.order_date`) and not later than the `last_date` (`sales.order_date <= dates.last_date`). 421 | - Then, join `dannys_diner.menu` table based on the `product_id` column. 422 | - In the outer query, calculate the points by using a `CASE` statement to determine the points based on our assumptions above. 423 | - If the `product_name` is 'sushi', multiply the price by 2 and then by 10. For orders placed between `join_date` and `valid_date`, also multiply the price by 2 and then by 10. 424 | - For all other products, multiply the price by 10. 425 | - Calculate the sum of points for each customer. 426 | 427 | #### Answer: 428 | | customer_id | total_points | 429 | | ----------- | ---------- | 430 | | A | 1020 | 431 | | B | 320 | 432 | 433 | - Total points for Customer A is 1,020. 434 | - Total points for Customer B is 320. 435 | 436 | *** 437 | 438 | ## BONUS QUESTIONS 439 | 440 | **Join All The Things** 441 | 442 | **Recreate the table with: customer_id, order_date, product_name, price, member (Y/N)** 443 | 444 | ```sql 445 | SELECT 446 | sales.customer_id, 447 | sales.order_date, 448 | menu.product_name, 449 | menu.price, 450 | CASE 451 | WHEN members.join_date > sales.order_date THEN 'N' 452 | WHEN members.join_date <= sales.order_date THEN 'Y' 453 | ELSE 'N' END AS member_status 454 | FROM dannys_diner.sales 455 | LEFT JOIN dannys_diner.members 456 | ON sales.customer_id = members.customer_id 457 | INNER JOIN dannys_diner.menu 458 | ON sales.product_id = menu.product_id 459 | ORDER BY members.customer_id, sales.order_date 460 | ``` 461 | 462 | #### Answer: 463 | | customer_id | order_date | product_name | price | member | 464 | | ----------- | ---------- | -------------| ----- | ------ | 465 | | A | 2021-01-01 | sushi | 10 | N | 466 | | A | 2021-01-01 | curry | 15 | N | 467 | | A | 2021-01-07 | curry | 15 | Y | 468 | | A | 2021-01-10 | ramen | 12 | Y | 469 | | A | 2021-01-11 | ramen | 12 | Y | 470 | | A | 2021-01-11 | ramen | 12 | Y | 471 | | B | 2021-01-01 | curry | 15 | N | 472 | | B | 2021-01-02 | curry | 15 | N | 473 | | B | 2021-01-04 | sushi | 10 | N | 474 | | B | 2021-01-11 | sushi | 10 | Y | 475 | | B | 2021-01-16 | ramen | 12 | Y | 476 | | B | 2021-02-01 | ramen | 12 | Y | 477 | | C | 2021-01-01 | ramen | 12 | N | 478 | | C | 2021-01-01 | ramen | 12 | N | 479 | | C | 2021-01-07 | ramen | 12 | N | 480 | 481 | *** 482 | 483 | **Rank All The Things** 484 | 485 | **Danny also requires further information about the ```ranking``` of customer products, but he purposely does not need the ranking for non-member purchases so he expects null ```ranking``` values for the records when customers are not yet part of the loyalty program.** 486 | 487 | ```sql 488 | WITH customers_data AS ( 489 | SELECT 490 | sales.customer_id, 491 | sales.order_date, 492 | menu.product_name, 493 | menu.price, 494 | CASE 495 | WHEN members.join_date > sales.order_date THEN 'N' 496 | WHEN members.join_date <= sales.order_date THEN 'Y' 497 | ELSE 'N' END AS member_status 498 | FROM dannys_diner.sales 499 | LEFT JOIN dannys_diner.members 500 | ON sales.customer_id = members.customer_id 501 | INNER JOIN dannys_diner.menu 502 | ON sales.product_id = menu.product_id 503 | ) 504 | 505 | SELECT 506 | *, 507 | CASE 508 | WHEN member_status = 'N' then NULL 509 | ELSE RANK () OVER ( 510 | PARTITION BY customer_id, member_status 511 | ORDER BY order_date 512 | ) END AS ranking 513 | FROM customers_data; 514 | ``` 515 | 516 | #### Answer: 517 | | customer_id | order_date | product_name | price | member | ranking | 518 | | ----------- | ---------- | -------------| ----- | ------ |-------- | 519 | | A | 2021-01-01 | sushi | 10 | N | NULL 520 | | A | 2021-01-01 | curry | 15 | N | NULL 521 | | A | 2021-01-07 | curry | 15 | Y | 1 522 | | A | 2021-01-10 | ramen | 12 | Y | 2 523 | | A | 2021-01-11 | ramen | 12 | Y | 3 524 | | A | 2021-01-11 | ramen | 12 | Y | 3 525 | | B | 2021-01-01 | curry | 15 | N | NULL 526 | | B | 2021-01-02 | curry | 15 | N | NULL 527 | | B | 2021-01-04 | sushi | 10 | N | NULL 528 | | B | 2021-01-11 | sushi | 10 | Y | 1 529 | | B | 2021-01-16 | ramen | 12 | Y | 2 530 | | B | 2021-02-01 | ramen | 12 | Y | 3 531 | | C | 2021-01-01 | ramen | 12 | N | NULL 532 | | C | 2021-01-01 | ramen | 12 | N | NULL 533 | | C | 2021-01-07 | ramen | 12 | N | NULL 534 | 535 | *** 536 | -------------------------------------------------------------------------------- /Case Study #2 - Pizza Runner/A. Pizza Metrics.md: -------------------------------------------------------------------------------- 1 | # 🍕 Case Study #2 - Pizza Runner 2 | 3 | ## 🍝 Solution - A. Pizza Metrics 4 | 5 | ### 1. How many pizzas were ordered? 6 | 7 | ````sql 8 | SELECT COUNT(*) AS pizza_order_count 9 | FROM #customer_orders; 10 | ```` 11 | 12 | **Answer:** 13 | 14 | ![1*Ma9L4y6O_zhln6Wy7CdWMQ](https://user-images.githubusercontent.com/81607668/129473598-d6d55ab2-59c7-4040-97db-d1b0c1c5b294.png) 15 | 16 | - Total of 14 pizzas were ordered. 17 | 18 | ### 2. How many unique customer orders were made? 19 | 20 | ````sql 21 | SELECT 22 | COUNT(DISTINCT order_id) AS unique_order_count 23 | FROM #customer_orders; 24 | ```` 25 | 26 | **Answer:** 27 | 28 | ![image](https://user-images.githubusercontent.com/81607668/129737993-710198bd-433d-469f-b5de-14e4022a3a45.png) 29 | 30 | - There are 10 unique customer orders. 31 | 32 | ### 3. How many successful orders were delivered by each runner? 33 | 34 | ````sql 35 | SELECT 36 | runner_id, 37 | COUNT(order_id) AS successful_orders 38 | FROM #runner_orders 39 | WHERE distance != 0 40 | GROUP BY runner_id; 41 | ```` 42 | 43 | **Answer:** 44 | 45 | ![image](https://user-images.githubusercontent.com/81607668/129738112-6eada46a-8c32-495a-8e26-793b2fec89ef.png) 46 | 47 | - Runner 1 has 4 successful delivered orders. 48 | - Runner 2 has 3 successful delivered orders. 49 | - Runner 3 has 1 successful delivered order. 50 | 51 | ### 4. How many of each type of pizza was delivered? 52 | 53 | ````sql 54 | SELECT 55 | p.pizza_name, 56 | COUNT(c.pizza_id) AS delivered_pizza_count 57 | FROM #customer_orders AS c 58 | JOIN #runner_orders AS r 59 | ON c.order_id = r.order_id 60 | JOIN pizza_names AS p 61 | ON c.pizza_id = p.pizza_id 62 | WHERE r.distance != 0 63 | GROUP BY p.pizza_name; 64 | ```` 65 | 66 | **Answer:** 67 | 68 | ![image](https://user-images.githubusercontent.com/81607668/129738140-c9c002ff-5aed-48ab-bdfa-cadbd98973a9.png) 69 | 70 | - There are 9 delivered Meatlovers pizzas and 3 Vegetarian pizzas. 71 | 72 | ### 5. How many Vegetarian and Meatlovers were ordered by each customer?** 73 | 74 | ````sql 75 | SELECT 76 | c.customer_id, 77 | p.pizza_name, 78 | COUNT(p.pizza_name) AS order_count 79 | FROM #customer_orders AS c 80 | JOIN pizza_names AS p 81 | ON c.pizza_id= p.pizza_id 82 | GROUP BY c.customer_id, p.pizza_name 83 | ORDER BY c.customer_id; 84 | ```` 85 | 86 | **Answer:** 87 | 88 | ![image](https://user-images.githubusercontent.com/81607668/129738167-269df165-1c9a-446a-b757-c7fc9a9021ed.png) 89 | 90 | - Customer 101 ordered 2 Meatlovers pizzas and 1 Vegetarian pizza. 91 | - Customer 102 ordered 2 Meatlovers pizzas and 2 Vegetarian pizzas. 92 | - Customer 103 ordered 3 Meatlovers pizzas and 1 Vegetarian pizza. 93 | - Customer 104 ordered 1 Meatlovers pizza. 94 | - Customer 105 ordered 1 Vegetarian pizza. 95 | 96 | ### 6. What was the maximum number of pizzas delivered in a single order? 97 | 98 | ````sql 99 | WITH pizza_count_cte AS 100 | ( 101 | SELECT 102 | c.order_id, 103 | COUNT(c.pizza_id) AS pizza_per_order 104 | FROM #customer_orders AS c 105 | JOIN #runner_orders AS r 106 | ON c.order_id = r.order_id 107 | WHERE r.distance != 0 108 | GROUP BY c.order_id 109 | ) 110 | 111 | SELECT 112 | MAX(pizza_per_order) AS pizza_count 113 | FROM pizza_count_cte; 114 | ```` 115 | 116 | **Answer:** 117 | 118 | ![image](https://user-images.githubusercontent.com/81607668/129738201-f676edd4-2530-4663-9ed8-6e6ec4d9cc68.png) 119 | 120 | - Maximum number of pizza delivered in a single order is 3 pizzas. 121 | 122 | ### 7. For each customer, how many delivered pizzas had at least 1 change and how many had no changes? 123 | 124 | ````sql 125 | SELECT 126 | c.customer_id, 127 | SUM( 128 | CASE WHEN c.exclusions <> ' ' OR c.extras <> ' ' THEN 1 129 | ELSE 0 130 | END) AS at_least_1_change, 131 | SUM( 132 | CASE WHEN c.exclusions = ' ' AND c.extras = ' ' THEN 1 133 | ELSE 0 134 | END) AS no_change 135 | FROM #customer_orders AS c 136 | JOIN #runner_orders AS r 137 | ON c.order_id = r.order_id 138 | WHERE r.distance != 0 139 | GROUP BY c.customer_id 140 | ORDER BY c.customer_id; 141 | ```` 142 | 143 | **Answer:** 144 | 145 | ![image](https://user-images.githubusercontent.com/81607668/129738236-2c4383cb-9d42-458c-b9be-9963c336ee58.png) 146 | 147 | - Customer 101 and 102 likes his/her pizzas per the original recipe. 148 | - Customer 103, 104 and 105 have their own preference for pizza topping and requested at least 1 change (extra or exclusion topping) on their pizza. 149 | 150 | ### 8. How many pizzas were delivered that had both exclusions and extras? 151 | 152 | ````sql 153 | SELECT 154 | SUM( 155 | CASE WHEN exclusions IS NOT NULL AND extras IS NOT NULL THEN 1 156 | ELSE 0 157 | END) AS pizza_count_w_exclusions_extras 158 | FROM #customer_orders AS c 159 | JOIN #runner_orders AS r 160 | ON c.order_id = r.order_id 161 | WHERE r.distance >= 1 162 | AND exclusions <> ' ' 163 | AND extras <> ' '; 164 | ```` 165 | 166 | **Answer:** 167 | 168 | ![image](https://user-images.githubusercontent.com/81607668/129738278-dd3e7056-309d-42fc-a5e3-00f7b5d4609e.png) 169 | 170 | - Only 1 pizza delivered that had both extra and exclusion topping. That’s one fussy customer! 171 | 172 | ### 9. What was the total volume of pizzas ordered for each hour of the day? 173 | 174 | ````sql 175 | SELECT 176 | DATEPART(HOUR, [order_time]) AS hour_of_day, 177 | COUNT(order_id) AS pizza_count 178 | FROM #customer_orders 179 | GROUP BY DATEPART(HOUR, [order_time]); 180 | ```` 181 | 182 | **Answer:** 183 | 184 | ![image](https://user-images.githubusercontent.com/81607668/129738302-573430e9-1785-4c71-adc1-464ffa94de8a.png) 185 | 186 | - Highest volume of pizza ordered is at 13 (1:00 pm), 18 (6:00 pm) and 21 (9:00 pm). 187 | - Lowest volume of pizza ordered is at 11 (11:00 am), 19 (7:00 pm) and 23 (11:00 pm). 188 | 189 | ### 10. What was the volume of orders for each day of the week? 190 | 191 | ````sql 192 | SELECT 193 | FORMAT(DATEADD(DAY, 2, order_time),'dddd') AS day_of_week, -- add 2 to adjust 1st day of the week as Monday 194 | COUNT(order_id) AS total_pizzas_ordered 195 | FROM #customer_orders 196 | GROUP BY FORMAT(DATEADD(DAY, 2, order_time),'dddd'); 197 | ```` 198 | 199 | **Answer:** 200 | 201 | ![image](https://user-images.githubusercontent.com/81607668/129738331-233744f6-3b57-4f4f-9a51-f7a699a9eb2e.png) 202 | 203 | - There are 5 pizzas ordered on Friday and Monday. 204 | - There are 3 pizzas ordered on Saturday. 205 | - There is 1 pizza ordered on Sunday. 206 | 207 | ***Click [here](https://github.com/katiehuangx/8-Week-SQL-Challenge/blob/main/Case%20Study%20%232%20-%20Pizza%20Runner/B.%20Runner%20and%20Customer%20Experience.md) for solution for B. Runner and Customer Experience!*** 208 | 209 | -------------------------------------------------------------------------------- /Case Study #2 - Pizza Runner/B. Runner and Customer Experience.md: -------------------------------------------------------------------------------- 1 | # 🍕 Case Study #2 Pizza Runner 2 | 3 | ## Solution - B. Runner and Customer Experience 4 | 5 | ### 1. How many runners signed up for each 1 week period? (i.e. week starts 2021-01-01) 6 | 7 | ````sql 8 | SELECT 9 | DATEPART(WEEK, registration_date) AS registration_week, 10 | COUNT(runner_id) AS runner_signup 11 | FROM runners 12 | GROUP BY DATEPART(WEEK, registration_date); 13 | ```` 14 | 15 | **Answer:** 16 | 17 | ![image](https://user-images.githubusercontent.com/81607668/129739658-a233932a-9f79-4280-a618-8bab6d3bd1f2.png) 18 | 19 | - On Week 1 of Jan 2021, 2 new runners signed up. 20 | - On Week 2 and 3 of Jan 2021, 1 new runner signed up. 21 | 22 | ### 2. What was the average time in minutes it took for each runner to arrive at the Pizza Runner HQ to pickup the order? 23 | 24 | ````sql 25 | WITH time_taken_cte AS 26 | ( 27 | SELECT 28 | c.order_id, 29 | c.order_time, 30 | r.pickup_time, 31 | DATEDIFF(MINUTE, c.order_time, r.pickup_time) AS pickup_minutes 32 | FROM #customer_orders AS c 33 | JOIN #runner_orders AS r 34 | ON c.order_id = r.order_id 35 | WHERE r.distance != 0 36 | GROUP BY c.order_id, c.order_time, r.pickup_time 37 | ) 38 | 39 | SELECT 40 | AVG(pickup_minutes) AS avg_pickup_minutes 41 | FROM time_taken_cte 42 | WHERE pickup_minutes > 1; 43 | ```` 44 | 45 | **Answer:** 46 | 47 | ![image](https://user-images.githubusercontent.com/81607668/129739701-e94b75e9-7193-4cf3-8e77-3c76be8b638d.png) 48 | 49 | - The average time taken in minutes by runners to arrive at Pizza Runner HQ to pick up the order is 15 minutes. 50 | 51 | ### 3. Is there any relationship between the number of pizzas and how long the order takes to prepare? 52 | 53 | ````sql 54 | WITH prep_time_cte AS 55 | ( 56 | SELECT 57 | c.order_id, 58 | COUNT(c.order_id) AS pizza_order, 59 | c.order_time, 60 | r.pickup_time, 61 | DATEDIFF(MINUTE, c.order_time, r.pickup_time) AS prep_time_minutes 62 | FROM #customer_orders AS c 63 | JOIN #runner_orders AS r 64 | ON c.order_id = r.order_id 65 | WHERE r.distance != 0 66 | GROUP BY c.order_id, c.order_time, r.pickup_time 67 | ) 68 | 69 | SELECT 70 | pizza_order, 71 | AVG(prep_time_minutes) AS avg_prep_time_minutes 72 | FROM prep_time_cte 73 | WHERE prep_time_minutes > 1 74 | GROUP BY pizza_order; 75 | ```` 76 | 77 | **Answer:** 78 | 79 | ![image](https://user-images.githubusercontent.com/81607668/129739816-05e3ba03-d3fe-4206-8557-869930a897d1.png) 80 | 81 | - On average, a single pizza order takes 12 minutes to prepare. 82 | - An order with 3 pizzas takes 30 minutes at an average of 10 minutes per pizza. 83 | - It takes 16 minutes to prepare an order with 2 pizzas which is 8 minutes per pizza — making 2 pizzas in a single order the ultimate efficiency rate. 84 | 85 | ### 4. Is there any relationship between the number of pizzas and how long the order takes to prepare? 86 | 87 | ````sql 88 | SELECT 89 | c.customer_id, 90 | AVG(r.distance) AS avg_distance 91 | FROM #customer_orders AS c 92 | JOIN #runner_orders AS r 93 | ON c.order_id = r.order_id 94 | WHERE r.duration != 0 95 | GROUP BY c.customer_id; 96 | ```` 97 | 98 | **Answer:** 99 | 100 | ![image](https://user-images.githubusercontent.com/81607668/129739847-5e338f4f-b42c-4531-9685-e2e822063183.png) 101 | 102 | _(Assuming that distance is calculated from Pizza Runner HQ to customer’s place)_ 103 | 104 | - Customer 104 stays the nearest to Pizza Runner HQ at average distance of 10km, whereas Customer 105 stays the furthest at 25km. 105 | 106 | ### 5. What was the difference between the longest and shortest delivery times for all orders? 107 | 108 | _Edit 08/10/21: Thanks to my reader, Ankush Taneja on Medium who caught my mistake. I've amended to the correct solution. Also, I was doing this case study using SQL Server few months ago, but I'm using PostgreSQL on SQLpad now so there could be a slight difference to the syntax._ 109 | 110 | Firstly, I'm going to filter results with non-null duration first just to have a feel. You can skip this step and go straight to the answer. 111 | 112 | ````sql 113 | SELECT 114 | order_id, duration 115 | FROM #runner_orders 116 | WHERE duration not like ' '; 117 | ```` 118 | 119 |

120 | 121 | ```sql 122 | SELECT MAX(duration::NUMERIC) - MIN(duration::NUMERIC) AS delivery_time_difference 123 | FROM runner_orders2 124 | where duration not like ' '; 125 | ``` 126 | 127 | **Answer:** 128 | 129 |

130 | 131 | - The difference between longest (40 minutes) and shortest (10 minutes) delivery time for all orders is 30 minutes. 132 | 133 | ### 6. What was the average speed for each runner for each delivery and do you notice any trend for these values? 134 | 135 | ````sql 136 | SELECT 137 | r.runner_id, 138 | c.customer_id, 139 | c.order_id, 140 | COUNT(c.order_id) AS pizza_count, 141 | r.distance, (r.duration / 60) AS duration_hr , 142 | ROUND((r.distance/r.duration * 60), 2) AS avg_speed 143 | FROM #runner_orders AS r 144 | JOIN #customer_orders AS c 145 | ON r.order_id = c.order_id 146 | WHERE distance != 0 147 | GROUP BY r.runner_id, c.customer_id, c.order_id, r.distance, r.duration 148 | ORDER BY c.order_id; 149 | ```` 150 | 151 | **Answer:** 152 | 153 | ![image](https://user-images.githubusercontent.com/81607668/129739931-54127037-0879-43bf-b53f-e4a1a6ebffeb.png) 154 | 155 | _(Average speed = Distance in km / Duration in hour)_ 156 | - Runner 1’s average speed runs from 37.5km/h to 60km/h. 157 | - Runner 2’s average speed runs from 35.1km/h to 93.6km/h. Danny should investigate Runner 2 as the average speed has a 300% fluctuation rate! 158 | - Runner 3’s average speed is 40km/h 159 | 160 | ### 7. What is the successful delivery percentage for each runner? 161 | 162 | ````sql 163 | SELECT 164 | runner_id, 165 | ROUND(100 * SUM( 166 | CASE WHEN distance = 0 THEN 0 167 | ELSE 1 END) / COUNT(*), 0) AS success_perc 168 | FROM #runner_orders 169 | GROUP BY runner_id; 170 | ```` 171 | 172 | **Answer:** 173 | 174 | ![image](https://user-images.githubusercontent.com/81607668/129740007-021d78fb-ec32-46c0-98f2-9e8f1891baed.png) 175 | 176 | - Runner 1 has 100% successful delivery. 177 | - Runner 2 has 75% successful delivery. 178 | - Runner 3 has 50% successful delivery 179 | 180 | _(It’s not right to attribute successful delivery to runners as order cancellations are out of the runner’s control.)_ 181 | 182 | *** 183 | -------------------------------------------------------------------------------- /Case Study #2 - Pizza Runner/C. Ingredient Optimisation.md: -------------------------------------------------------------------------------- 1 | # 🍕 Case Study #2 Pizza Runner 2 | 3 | ## Solution - C. Ingredient Optimisation 4 | 5 | ### 1. What are the standard ingredients for each pizza? 6 | 7 | ### 2. What was the most commonly added extra? 8 | 9 | ```sql 10 | WITH toppings_cte AS ( 11 | SELECT 12 | pizza_id, 13 | REGEXP_SPLIT_TO_TABLE(toppings, '[,\s]+')::INTEGER AS topping_id 14 | FROM pizza_runner.pizza_recipes) 15 | 16 | SELECT 17 | t.topping_id, pt.topping_name, 18 | COUNT(t.topping_id) AS topping_count 19 | FROM toppings_cte t 20 | INNER JOIN pizza_runner.pizza_toppings pt 21 | ON t.topping_id = pt.topping_id 22 | GROUP BY t.topping_id, pt.topping_name 23 | ORDER BY topping_count DESC; 24 | ``` 25 | 26 | **Solution** 27 | 28 |

29 | 30 | ### 3. What was the most common exclusion? 31 | 32 | ### 4. Generate an order item for each record in the customers_orders table in the format of one of the following: 33 | - Meat Lovers 34 | - Meat Lovers - Exclude Beef 35 | - Meat Lovers - Extra Bacon 36 | - Meat Lovers - Exclude Cheese, Bacon - Extra Mushroom, Peppers 37 | 38 | ### 5. Generate an alphabetically ordered comma separated ingredient list for each pizza order from the customer_orders table and add a 2x in front of any relevant ingredients 39 | 40 | ### 6. For example: "Meat Lovers: 2xBacon, Beef, ... , Salami" 41 | 42 | ### 7. What is the total quantity of each ingredient used in all delivered pizzas sorted by most frequent first? 43 | -------------------------------------------------------------------------------- /Case Study #2 - Pizza Runner/Data Cleaning and Transformation.md: -------------------------------------------------------------------------------- 1 | # 🍕 Case Study #2 - Pizza Runner 2 | 3 | ## 🧼 Data Cleaning & Transformation 4 | 5 | ### 🔨 Table: customer_orders 6 | 7 | Looking at the `customer_orders` table below, we can see that there are 8 | - In the `exclusions` column, there are missing/ blank spaces ' ' and null values. 9 | - In the `extras` column, there are missing/ blank spaces ' ' and null values. 10 | 11 |

38 | 39 | *** 40 | 41 | ### 🔨 Table: runner_orders 42 | 43 | Looking at the `runner_orders` table below, we can see that there are 44 | - In the `exclusions` column, there are missing/ blank spaces ' ' and null values. 45 | - In the `extras` column, there are missing/ blank spaces ' ' and null values 46 | 47 |

95 | 96 | *** 97 | 98 | Click here for [solution](https://github.com/katiehuangx/8-Week-SQL-Challenge/blob/main/Case%20Study%20%232%20-%20Pizza%20Runner/A.%20Pizza%20Metrics.md) to **A. Pizza Metrics**! 99 | -------------------------------------------------------------------------------- /Case Study #2 - Pizza Runner/README.md: -------------------------------------------------------------------------------- 1 | # 🍕 Case Study #2 Pizza Runner 2 | 3 |

4 | 5 | ## 📚 Table of Contents 6 | - [Business Task](#business-task) 7 | - [Entity Relationship Diagram](#entity-relationship-diagram) 8 | - Solution 9 | - [Data Cleaning and Transformation](#-data-cleaning--transformation) 10 | - [A. Pizza Metrics](#a-pizza-metrics) 11 | - [B. Runner and Customer Experience](#b-runner-and-customer-experience) 12 | - [C. Ingredient Optimisation](#c-ingredient-optimisation) 13 | - [D. Pricing and Ratings](#d-pricing-and-ratings) 14 | 15 | Please note that all the information regarding the case study has been sourced from the following link: [here](https://8weeksqlchallenge.com/case-study-2/). 16 | 17 | I also published this on [Medium](https://medium.com/analytics-vidhya/8-week-sql-challenge-case-study-2-pizza-runner-ba32f0a6f9fb?source=friends_link&sk=5463dad7c9b0b1ba83d570f09e1fce90)! 18 | *** 19 | 20 | ## Business Task 21 | Danny is expanding his new Pizza Empire and at the same time, he wants to Uberize it, so Pizza Runner was launched! 22 | 23 | Danny started by recruiting “runners” to deliver fresh pizza from Pizza Runner Headquarters (otherwise known as Danny’s house) and also maxed out his credit card to pay freelance developers to build a mobile app to accept orders from customers. 24 | 25 | ## Entity Relationship Diagram 26 | 27 | ![Pizza Runner](https://github.com/katiehuangx/8-Week-SQL-Challenge/assets/81607668/78099a4e-4d0e-421f-a560-b72e4321f530) 28 | 29 | ## 🧼 Data Cleaning & Transformation 30 | 31 | ### 🔨 Table: customer_orders 32 | 33 | Looking at the `customer_orders` table below, we can see that there are 34 | - In the `exclusions` column, there are missing/ blank spaces ' ' and null values. 35 | - In the `extras` column, there are missing/ blank spaces ' ' and null values. 36 | 37 |

64 | 65 | *** 66 | 67 | ### 🔨 Table: runner_orders 68 | 69 | Looking at the `runner_orders` table below, we can see that there are 70 | - In the `exclusions` column, there are missing/ blank spaces ' ' and null values. 71 | - In the `extras` column, there are missing/ blank spaces ' ' and null values 72 | 73 |

448 | 449 | ```sql 450 | SELECT MAX(duration::NUMERIC) - MIN(duration::NUMERIC) AS delivery_time_difference 451 | FROM runner_orders2 452 | where duration not like ' '; 453 | ``` 454 | 455 | **Answer:** 456 | 457 |

538 | 539 | ### 3. What was the most common exclusion? 540 | 541 | ### 4. Generate an order item for each record in the customers_orders table in the format of one of the following: 542 | - Meat Lovers 543 | - Meat Lovers - Exclude Beef 544 | - Meat Lovers - Extra Bacon 545 | - Meat Lovers - Exclude Cheese, Bacon - Extra Mushroom, Peppers 546 | 547 | ### 5. Generate an alphabetically ordered comma separated ingredient list for each pizza order from the customer_orders table and add a 2x in front of any relevant ingredients 548 | 549 | ### 6. For example: "Meat Lovers: 2xBacon, Beef, ... , Salami" 550 | 551 | ### 7. What is the total quantity of each ingredient used in all delivered pizzas sorted by most frequent first? 552 | 553 | *** 554 | 555 | ## D. Pricing and Ratings 556 | 557 | 1. If a Meat Lovers pizza costs $12 and Vegetarian costs $10 and there were no charges for changes - how much money has Pizza Runner made so far if there are no delivery fees? 558 | 2. What if there was an additional $1 charge for any pizza extras? 559 | - Add cheese is $1 extra 560 | 3. The Pizza Runner team now wants to add an additional ratings system that allows customers to rate their runner, how would you design an additional table for this new dataset generate a schema for this new table and insert your own data for ratings for each successful customer order between 1 to 5. 561 | 4. Using your newly generated table - can you join all of the information together to form a table which has the following information for successful deliveries? 562 | - customer_id 563 | - order_id 564 | - runner_id 565 | - rating 566 | - order_time 567 | - pickup_time 568 | - Time between order and pickup 569 | - Delivery duration 570 | - Average speed 571 | - Total number of pizzas 572 | 5. If a Meat Lovers pizza was $12 and Vegetarian $10 fixed prices with no cost for extras and each runner is paid $0.30 per kilometre traveled - how much money does Pizza Runner have left over after these deliveries? 573 | 574 | *** 575 | -------------------------------------------------------------------------------- /Case Study #3 - Foodie-Fi/README.md: -------------------------------------------------------------------------------- 1 | # 🥑 Case Study #3: Foodie-Fi 2 | 3 |

26 | 27 | There are 5 customer plans. 28 | 29 | - Trial — Customer sign up to an initial 7 day free trial and will automatically continue with the pro monthly subscription plan unless they cancel, downgrade to basic or upgrade to an annual pro plan at any point during the trial. 30 | - Basic plan — Customers have limited access and can only stream their videos and is only available monthly at $9.90. 31 | - Pro plan — Customers have no watch time limits and are able to download videos for offline viewing. Pro plans start at $19.90 a month or $199 for an annual subscription. 32 | 33 | When customers cancel their Foodie-Fi service — they will have a Churn plan record with a null price, but their plan will continue until the end of the billing period. 34 | 35 | **Table 2: `subscriptions`** 36 | 37 |

38 | 39 | Customer subscriptions show the **exact date** where their specific `plan_id` starts. 40 | 41 | If customers downgrade from a pro plan or cancel their subscription — the higher plan will remain in place until the period is over — the `start_date` in the subscriptions table will reflect the date that the actual plan changes. 42 | 43 | When customers upgrade their account from a basic plan to a pro or annual pro plan — the higher plan will take effect straightaway. 44 | 45 | When customers churn, they will keep their access until the end of their current billing period, but the start_date will be technically the day they decided to cancel their service. 46 | 47 | *** 48 | 49 | ## Question and Solution 50 | 51 | Please join me in executing the queries using PostgreSQL on [DB Fiddle](https://www.db-fiddle.com/f/rHJhRrXy5hbVBNJ6F6b9gJ/16). It would be great to work together on the questions! 52 | 53 | Additionally, I have also published this case study on [Medium](https://medium.com/analytics-vidhya/8-week-sql-challenge-case-study-3-foodie-fi-3d8497376ea9?sk=579afc01c30aa6149d85050f8a46ddef). 54 | 55 | If you have any questions, reach out to me on [LinkedIn](https://www.linkedin.com/in/katiehuangx/). 56 | 57 | ## 🎞️ A. Customer Journey 58 | 59 | Based off the 8 sample customers provided in the sample subscriptions table below, write a brief description about each customer’s onboarding journey. 60 | 61 | **Table: Sample of `subscriptions` table** 62 | 63 | Screenshot 2021-08-17 at 11 36 10 PM

80 | 81 | Based on the results above, I have selected three customers to focus on and will now share their onboarding journey. 82 | 83 | _(Refer to the table below)_ 84 | 85 | Customer 1: This customer initiated their journey by starting the free trial on 1 Aug 2020. After the trial period ended, on 8 Aug 2020, they subscribed to the basic monthly plan. 86 | 87 |

88 | 89 | Customer 13: The onboarding journey for this customer began with a free trial on 15 Dec 2020. Following the trial period, on 22 Dec 2020, they subscribed to the basic monthly plan. After three months, on 29 Mar 2021, they upgraded to the pro monthly plan. 90 | 91 |

92 | 93 | Customer 15: Initially, this customer commenced their onboarding journey with a free trial on 17 Mar 2020. Once the trial ended, on 24 Mar 2020, they upgraded to the pro monthly plan. However, the following month, on 29 Apr 2020, the customer decided to terminate their subscription and subsequently churned until the paid subscription ends. 94 | 95 |

96 | 97 | *** 98 | 99 | ## B. Data Analysis Questions 100 | 101 | ### 1. How many customers has Foodie-Fi ever had? 102 | 103 | To determine the count of unique customers for Foodie-Fi, I utilize the `COUNT()` function wrapped around `DISTINCT`. 104 | 105 | ```sql 106 | SELECT COUNT(DISTINCT customer_id) AS num_of_customers 107 | FROM foodie_fi.subscriptions; 108 | ``` 109 | 110 | **Answer:** 111 | 112 |

137 | 138 | Among all the months, March has the highest number of trial plans, while February has the lowest number of trial plans. 139 | 140 | ### 3. What plan start_date values occur after the year 2020 for our dataset? Show the breakdown by count of events for each plan_name. 141 | 142 | To put it simply, we have to determine the count of plans with start dates on or after 1 January 2021 grouped by plan names. 143 | 1. Filter plans based on their start dates by including only the plans occurring on or after January 1, 2021. 144 | 2. Calculate the number of customers as the number of events. 145 | 3. Group results based on the plan names. For better readability, order results in ascending order of the plan ID. 146 | 147 | ````sql 148 | SELECT 149 | plans.plan_id, 150 | plans.plan_name, 151 | COUNT(sub.customer_id) AS num_of_events 152 | FROM foodie_fi.subscriptions AS sub 153 | JOIN foodie_fi.plans 154 | ON sub.plan_id = plans.plan_id 155 | WHERE sub.start_date >= '2021-01-01' 156 | GROUP BY plans.plan_id, plans.plan_name 157 | ORDER BY plans.plan_id; 158 | ```` 159 | 160 | **Answer:** 161 | 162 | | plan_id | plan_name | num_of_events | 163 | | ------- | ------------- | ------------- | 164 | | 1 | basic monthly | 8 | 165 | | 2 | pro monthly | 60 | 166 | | 3 | pro annual | 63 | 167 | | 4 | churn | 71 | 168 | 169 | ### 4. What is the customer count and percentage of customers who have churned rounded to 1 decimal place? 170 | 171 | Let's analyze the question: 172 | - First, we need to determine 173 | - The number of customers who have churned, meaning those who have discontinued their subscription. 174 | - The total number of customers, including both active and churned ones. 175 | 176 | - To calculate the churn rate, we divide the number of churned customers by the total number of customers. The result should be rounded to one decimal place. 177 | 178 | ```sql 179 | SELECT 180 | COUNT(DISTINCT sub.customer_id) AS churned_customers, 181 | ROUND(100.0 * COUNT(sub.customer_id) 182 | / (SELECT COUNT(DISTINCT customer_id) 183 | FROM foodie_fi.subscriptions) 184 | ,1) AS churn_percentage 185 | FROM foodie_fi.subscriptions AS sub 186 | JOIN foodie_fi.plans 187 | ON sub.plan_id = plans.plan_id 188 | WHERE plans.plan_id = 4; -- Filter results to customers with churn plan only 189 | ``` 190 | 191 | **Answer:** 192 | 193 |

270 | 271 | - A total of 92 customers churned immediately after the initial free trial period, representing approximately 9% of the entire customer base. 272 | 273 | ### 6. What is the number and percentage of customer plans after their initial free trial? 274 | 275 | ```sql 276 | WITH next_plans AS ( 277 | SELECT 278 | customer_id, 279 | plan_id, 280 | LEAD(plan_id) OVER( 281 | PARTITION BY customer_id 282 | ORDER BY plan_id) as next_plan_id 283 | FROM foodie_fi.subscriptions 284 | ) 285 | 286 | SELECT 287 | next_plan_id AS plan_id, 288 | COUNT(customer_id) AS converted_customers, 289 | ROUND(100 * 290 | COUNT(customer_id)::NUMERIC 291 | / (SELECT COUNT(DISTINCT customer_id) 292 | FROM foodie_fi.subscriptions) 293 | ,1) AS conversion_percentage 294 | FROM next_plans 295 | WHERE next_plan_id IS NOT NULL 296 | AND plan_id = 0 297 | GROUP BY next_plan_id 298 | ORDER BY next_plan_id; 299 | ``` 300 | 301 | **Answer:** 302 | 303 | | plan_id | converted_customers | conversion_percentage | 304 | | ------- | ------------------- | --------------------- | 305 | | 1 | 546 | 54.6 | 306 | | 2 | 325 | 32.5 | 307 | | 3 | 37 | 3.7 | 308 | | 4 | 92 | 9.2 | 309 | 310 | - More than 80% of Foodie-Fi's customers are on paid plans with a majority opting for Plans 1 and 2. 311 | - There is potential for improvement in customer acquisition for Plan 3 as only a small percentage of customers are choosing this higher-priced plan. 312 | 313 | ### 7. What is the customer count and percentage breakdown of all 5 plan_name values at 2020-12-31? 314 | 315 | In the cte called `next_dates`, we begin by filtering the results to include only the plans with start dates on or before '2020-12-31'. To identify the next start date for each plan, we utilize the `LEAD()` window function. 316 | 317 | In the outer query, we filter the results where the `next_date` is NULL. This step helps us identify the most recent plan that each customer subscribed to as of '2020-12-31'. 318 | 319 | Lastly, we perform calculations to determine the total count of customers and the percentage of customers associated with each trial plan. 320 | 321 | ```sql 322 | WITH next_dates AS ( 323 | SELECT 324 | customer_id, 325 | plan_id, 326 | start_date, 327 | LEAD(start_date) OVER ( 328 | PARTITION BY customer_id 329 | ORDER BY start_date 330 | ) AS next_date 331 | FROM foodie_fi.subscriptions 332 | WHERE start_date <= '2020-12-31' 333 | ) 334 | 335 | SELECT 336 | plan_id, 337 | COUNT(DISTINCT customer_id) AS customers, 338 | ROUND(100.0 * 339 | COUNT(DISTINCT customer_id) 340 | / (SELECT COUNT(DISTINCT customer_id) 341 | FROM foodie_fi.subscriptions) 342 | ,1) AS percentage 343 | FROM next_dates 344 | WHERE next_date IS NULL 345 | GROUP BY plan_id; 346 | ``` 347 | 348 | **Answer:** 349 | 350 |

351 | 352 | ### 8. How many customers have upgraded to an annual plan in 2020? 353 | 354 | ```sql 355 | SELECT COUNT(DISTINCT customer_id) AS num_of_customers 356 | FROM foodie_fi.subscriptions 357 | WHERE plan_id = 3 358 | AND start_date <= '2020-12-31'; 359 | ``` 360 | 361 | **Answer:** 362 | 363 |

401 | 402 | - On average, customers take approximately 105 days from the day they join Foodie-Fi to upgrade to an annual plan. 403 | 404 | ### 10. Can you further breakdown this average value into 30 day periods (i.e. 0-30 days, 31-60 days etc) 405 | 406 | To understand how the `WIDTH_BUCKET()` function works in creating buckets of 30-day periods, you can refer to this [StackOverflow](https://stackoverflow.com/questions/50518548/creating-a-bin-column-in-postgres-to-check-an-integer-and-return-a-string) answer. 407 | 408 | ```sql 409 | WITH trial_plan AS ( 410 | -- trial_plan CTE: Filter results to include only the customers subscribed to the trial plan. 411 | SELECT 412 | customer_id, 413 | start_date AS trial_date 414 | FROM foodie_fi.subscriptions 415 | WHERE plan_id = 0 416 | ), annual_plan AS ( 417 | -- annual_plan CTE: Filter results to only include the customers subscribed to the pro annual plan. 418 | SELECT 419 | customer_id, 420 | start_date AS annual_date 421 | FROM foodie_fi.subscriptions 422 | WHERE plan_id = 3 423 | ), bins AS ( 424 | -- bins CTE: Put customers in 30-day buckets based on the average number of days taken to upgrade to a pro annual plan. 425 | SELECT 426 | WIDTH_BUCKET(annual.annual_date - trial.trial_date, 0, 365, 12) AS avg_days_to_upgrade 427 | FROM trial_plan AS trial 428 | JOIN annual_plan AS annual 429 | ON trial.customer_id = annual.customer_id 430 | ) 431 | 432 | SELECT 433 | ((avg_days_to_upgrade - 1) * 30 || ' - ' || avg_days_to_upgrade * 30 || ' days') AS bucket, 434 | COUNT(*) AS num_of_customers 435 | FROM bins 436 | GROUP BY avg_days_to_upgrade 437 | ORDER BY avg_days_to_upgrade; 438 | ``` 439 | 440 | **Answer:** 441 | 442 | | bucket | num_of_customers | 443 | | -------------- | ---------------- | 444 | | 0 - 30 days | 49 | 445 | | 30 - 60 days | 24 | 446 | | 60 - 90 days | 35 | 447 | | 90 - 120 days | 35 | 448 | | 120 - 150 days | 43 | 449 | | 150 - 180 days | 37 | 450 | | 180 - 210 days | 24 | 451 | | 210 - 240 days | 4 | 452 | | 240 - 270 days | 4 | 453 | | 270 - 300 days | 1 | 454 | | 300 - 330 days | 1 | 455 | | 330 - 360 days | 1 | 456 | 457 | ### 11. How many customers downgraded from a pro monthly to a basic monthly plan in 2020? 458 | 459 | ```sql 460 | WITH ranked_cte AS ( 461 | SELECT 462 | sub.customer_id, 463 | plans.plan_id, 464 | plans.plan_name, 465 | LEAD(plans.plan_id) OVER ( 466 | PARTITION BY sub.customer_id 467 | ORDER BY sub.start_date) AS next_plan_id 468 | FROM foodie_fi.subscriptions AS sub 469 | JOIN foodie_fi.plans 470 | ON sub.plan_id = plans.plan_id 471 | WHERE DATE_PART('year', start_date) = 2020 472 | ) 473 | 474 | SELECT 475 | COUNT(customer_id) AS churned_customers 476 | FROM ranked_cte 477 | WHERE plan_id = 2 478 | AND next_plan_id = 1; 479 | ``` 480 | 481 | **Answer:** 482 | 483 | In 2020, there were no instances where customers downgraded from a pro monthly plan to a basic monthly plan. 484 | 485 | *** 486 | -------------------------------------------------------------------------------- /Case Study #4 - Data Bank/README.md: -------------------------------------------------------------------------------- 1 | ## Case Study #4: Data Bank 2 | 3 |

26 | 27 | **Table 1: `regions`** 28 | 29 | This regions table contains the `region_id` and their respective `region_name` values. 30 | 31 |

32 | 33 | **Table 2: `customer_nodes`** 34 | 35 | Customers are randomly distributed across the nodes according to their region. This random distribution changes frequently to reduce the risk of hackers getting into Data Bank’s system and stealing customer’s money and data! 36 | 37 |

38 | 39 | **Table 3: Customer Transactions** 40 | 41 | This table stores all customer deposits, withdrawals and purchases made using their Data Bank debit card. 42 | 43 |

44 | 45 | *** 46 | 47 | ## Question and Solution 48 | 49 | Please join me in executing the queries using PostgreSQL on [DB Fiddle](https://www.db-fiddle.com/f/2GtQz4wZtuNNu7zXH5HtV4/3). It would be great to work together on the questions! 50 | 51 | If you have any questions, reach out to me on [LinkedIn](https://www.linkedin.com/in/katiehuangx/). 52 | 53 | ## 🏦 A. Customer Nodes Exploration 54 | 55 | **1. How many unique nodes are there on the Data Bank system?** 56 | 57 | ````sql 58 | SELECT COUNT(DISTINCT node_id) AS unique_nodes 59 | FROM data_bank.customer_nodes; 60 | ```` 61 | 62 | **Answer:** 63 | 64 | |unique_nodes| 65 | |:----| 66 | |5| 67 | 68 | - There are 5 unique nodes on the Data Bank system. 69 | 70 | *** 71 | 72 | **2. What is the number of nodes per region?** 73 | 74 | ````sql 75 | SELECT 76 | regions.region_name, 77 | COUNT(DISTINCT customers.node_id) AS node_count 78 | FROM data_bank.regions 79 | JOIN data_bank.customer_nodes AS customers 80 | ON regions.region_id = customers.region_id 81 | GROUP BY regions.region_name; 82 | ```` 83 | 84 | **Answer:** 85 | 86 | |region_name|node_count| 87 | |:----|:----| 88 | |Africa|5| 89 | |America|5| 90 | |Asia|5| 91 | |Australia|5| 92 | |Europe|5| 93 | 94 | *** 95 | 96 | **3. How many customers are allocated to each region?** 97 | 98 | ````sql 99 | SELECT 100 | region_id, 101 | COUNT(customer_id) AS customer_count 102 | FROM data_bank.customer_nodes 103 | GROUP BY region_id 104 | ORDER BY region_id; 105 | ```` 106 | 107 | **Answer:** 108 | 109 | |region_id|customer_count| 110 | |:----|:----| 111 | |1|770| 112 | |2|735| 113 | |3|714| 114 | |4|665| 115 | |5|616| 116 | 117 | *** 118 | 119 | **4. How many days on average are customers reallocated to a different node?** 120 | 121 | ````sql 122 | WITH node_days AS ( 123 | SELECT 124 | customer_id, 125 | node_id, 126 | end_date - start_date AS days_in_node 127 | FROM data_bank.customer_nodes 128 | WHERE end_date != '9999-12-31' 129 | GROUP BY customer_id, node_id, start_date, end_date 130 | ) 131 | , total_node_days AS ( 132 | SELECT 133 | customer_id, 134 | node_id, 135 | SUM(days_in_node) AS total_days_in_node 136 | FROM node_days 137 | GROUP BY customer_id, node_id 138 | ) 139 | 140 | SELECT ROUND(AVG(total_days_in_node)) AS avg_node_reallocation_days 141 | FROM total_node_days; 142 | ```` 143 | 144 | **Answer:** 145 | 146 | |avg_node_reallocation_days| 147 | |:----| 148 | |24| 149 | 150 | - On average, customers are reallocated to a different node every 24 days. 151 | 152 | *** 153 | 154 | **5. What is the median, 80th and 95th percentile for this same reallocation days metric for each region?** 155 | 156 | 157 | 158 | *** 159 | 160 | ## 🏦 B. Customer Transactions 161 | 162 | **1. What is the unique count and total amount for each transaction type?** 163 | 164 | ````sql 165 | SELECT 166 | txn_type, 167 | COUNT(customer_id) AS transaction_count, 168 | SUM(txn_amount) AS total_amount 169 | FROM data_bank.customer_transactions 170 | GROUP BY txn_type; 171 | ```` 172 | 173 | **Answer:** 174 | 175 | |txn_type|transaction_count|total_amount| 176 | |:----|:----|:----| 177 | |purchase|1617|806537| 178 | |deposit|2671|1359168| 179 | |withdrawal|1580|793003| 180 | 181 | *** 182 | 183 | **2. What is the average total historical deposit counts and amounts for all customers?** 184 | 185 | ````sql 186 | WITH deposits AS ( 187 | SELECT 188 | customer_id, 189 | COUNT(customer_id) AS txn_count, 190 | AVG(txn_amount) AS avg_amount 191 | FROM data_bank.customer_transactions 192 | WHERE txn_type = 'deposit' 193 | GROUP BY customer_id 194 | ) 195 | 196 | SELECT 197 | ROUND(AVG(txn_count)) AS avg_deposit_count, 198 | ROUND(AVG(avg_amount)) AS avg_deposit_amt 199 | FROM deposits; 200 | ```` 201 | **Answer:** 202 | 203 | |avg_deposit_count|avg_deposit_amt| 204 | |:----|:----| 205 | |5|509| 206 | 207 | - The average historical deposit count is 5 and the average historical deposit amount is $ 509. 208 | 209 | *** 210 | 211 | **3. For each month - how many Data Bank customers make more than 1 deposit and either 1 purchase or 1 withdrawal in a single month?** 212 | 213 | First, create a CTE called `monthly_transactions` to determine the count of deposit, purchase and withdrawal for each customer categorised by month using `CASE` statement and `SUM()`. 214 | 215 | In the main query, select the `mth` column and count the number of unique customers where: 216 | - `deposit_count` is greater than 1, indicating more than one deposit (`deposit_count > 1`). 217 | - Either `purchase_count` is greater than or equal to 1 (`purchase_count >= 1`) OR `withdrawal_count` is greater than or equal to 1 (`withdrawal_count >= 1`). 218 | 219 | ````sql 220 | WITH monthly_transactions AS ( 221 | SELECT 222 | customer_id, 223 | DATE_PART('month', txn_date) AS mth, 224 | SUM(CASE WHEN txn_type = 'deposit' THEN 0 ELSE 1 END) AS deposit_count, 225 | SUM(CASE WHEN txn_type = 'purchase' THEN 0 ELSE 1 END) AS purchase_count, 226 | SUM(CASE WHEN txn_type = 'withdrawal' THEN 1 ELSE 0 END) AS withdrawal_count 227 | FROM data_bank.customer_transactions 228 | GROUP BY customer_id, DATE_PART('month', txn_date) 229 | ) 230 | 231 | SELECT 232 | mth, 233 | COUNT(DISTINCT customer_id) AS customer_count 234 | FROM monthly_transactions 235 | WHERE deposit_count > 1 236 | AND (purchase_count >= 1 OR withdrawal_count >= 1) 237 | GROUP BY mth 238 | ORDER BY mth; 239 | ```` 240 | 241 | **Answer:** 242 | 243 | |month|customer_count| 244 | |:----|:----| 245 | |1|170| 246 | |2|277| 247 | |3|292| 248 | |4|103| 249 | 250 | *** 251 | 252 | **4. What is the closing balance for each customer at the end of the month? Also show the change in balance each month in the same table output.** 253 | 254 | Update Jun 2, 2023: Even after 2 years, I continue to find this question incredibly challenging. I have cleaned up the code and provided additional explanations. 255 | 256 | The key aspect to understanding the solution is to build up the tabele and run the CTEs cumulatively (run CTE 1 first, then run CTE 1 & 2, and so on). This approach allows for a better understanding of why specific columns were created or how the information in the tables progressed. 257 | 258 | ```sql 259 | -- CTE 1 - To identify transaction amount as an inflow (+) or outflow (-) 260 | WITH monthly_balances_cte AS ( 261 | SELECT 262 | customer_id, 263 | (DATE_TRUNC('month', txn_date) + INTERVAL '1 MONTH - 1 DAY') AS closing_month, 264 | SUM(CASE 265 | WHEN txn_type = 'withdrawal' OR txn_type = 'purchase' THEN -txn_amount 266 | ELSE txn_amount END) AS transaction_balance 267 | FROM data_bank.customer_transactions 268 | GROUP BY 269 | customer_id, txn_date 270 | ) 271 | 272 | -- CTE 2 - Use GENERATE_SERIES() to generate as a series of last day of the month for each customer. 273 | , monthend_series_cte AS ( 274 | SELECT 275 | DISTINCT customer_id, 276 | ('2020-01-31'::DATE + GENERATE_SERIES(0,3) * INTERVAL '1 MONTH') AS ending_month 277 | FROM data_bank.customer_transactions 278 | ) 279 | 280 | -- CTE 3 - Calculate total monthly change and ending balance for each month using window function SUM() 281 | , monthly_changes_cte AS ( 282 | SELECT 283 | monthend_series_cte.customer_id, 284 | monthend_series_cte.ending_month, 285 | SUM(monthly_balances_cte.transaction_balance) OVER ( 286 | PARTITION BY monthend_series_cte.customer_id, monthend_series_cte.ending_month 287 | ORDER BY monthend_series_cte.ending_month 288 | ) AS total_monthly_change, 289 | SUM(monthly_balances_cte.transaction_balance) OVER ( 290 | PARTITION BY monthend_series_cte.customer_id 291 | ORDER BY monthend_series_cte.ending_month 292 | ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW 293 | ) AS ending_balance 294 | FROM monthend_series_cte 295 | LEFT JOIN monthly_balances_cte 296 | ON monthend_series_cte.ending_month = monthly_balances_cte.closing_month 297 | AND monthend_series_cte.customer_id = monthly_balances_cte.customer_id 298 | ) 299 | 300 | -- Final query: Display the output of customer monthly statement with the ending balances. 301 | SELECT 302 | customer_id, 303 | ending_month, 304 | COALESCE(total_monthly_change, 0) AS total_monthly_change, 305 | MIN(ending_balance) AS ending_balance 306 | FROM monthly_changes_cte 307 | GROUP BY 308 | customer_id, ending_month, total_monthly_change 309 | ORDER BY 310 | customer_id, ending_month; 311 | ``` 312 | 313 | **Answer:** 314 | 315 | Showing results for customers ID 1, 2 and 3 only: 316 | |customer_id|ending_month|total_monthly_change|ending_balance| 317 | |:----|:----|:----|:----| 318 | |1|2020-01-31T00:00:00.000Z|312|312| 319 | |1|2020-02-29T00:00:00.000Z|0|312| 320 | |1|2020-03-31T00:00:00.000Z|-952|-964| 321 | |1|2020-04-30T00:00:00.000Z|0|-640| 322 | |2|2020-01-31T00:00:00.000Z|549|549| 323 | |2|2020-02-29T00:00:00.000Z|0|549| 324 | |2|2020-03-31T00:00:00.000Z|61|610| 325 | |2|2020-04-30T00:00:00.000Z|0|610| 326 | |3|2020-01-31T00:00:00.000Z|144|144| 327 | |3|2020-02-29T00:00:00.000Z|-965|-821| 328 | |3|2020-03-31T00:00:00.000Z|-401|-1222| 329 | |3|2020-04-30T00:00:00.000Z|493|-729| 330 | 331 | *** 332 | 333 | **5. Comparing the closing balance of a customer’s first month and the closing balance from their second nth, what percentage of customers:** 334 | 335 | For this question, I have created 2 temporary tables to solve the questions below: 336 | - Create temp table #1 `customer_monthly_balances` by copying and pasting the code from the solution to Question 4. 337 | - Use temp table #1 `ranked_monthly_balances` to create temp table #2 by applying the `ROW_NUMBER()` function. 338 | 339 | ```sql 340 | -- Temp table #1: Create a temp table using Question 4 solution 341 | CREATE TEMP TABLE customer_monthly_balances AS ( 342 | WITH monthly_balances_cte AS ( 343 | SELECT 344 | customer_id, 345 | (DATE_TRUNC('month', txn_date) + INTERVAL '1 MONTH - 1 DAY') AS closing_month, 346 | SUM(CASE 347 | WHEN txn_type = 'withdrawal' OR txn_type = 'purchase' THEN -txn_amount 348 | ELSE txn_amount END) AS transaction_balance 349 | FROM data_bank.customer_transactions 350 | GROUP BY 351 | customer_id, txn_date 352 | ), monthend_series_cte AS ( 353 | SELECT 354 | DISTINCT customer_id, 355 | ('2020-01-31'::DATE + GENERATE_SERIES(0,3) * INTERVAL '1 MONTH') AS ending_month 356 | FROM data_bank.customer_transactions 357 | ), monthly_changes_cte AS ( 358 | SELECT 359 | monthend_series_cte.customer_id, 360 | monthend_series_cte.ending_month, 361 | SUM(monthly_balances_cte.transaction_balance) OVER ( 362 | PARTITION BY monthend_series_cte.customer_id, monthend_series_cte.ending_month 363 | ORDER BY monthend_series_cte.ending_month 364 | ) AS total_monthly_change, 365 | SUM(monthly_balances_cte.transaction_balance) OVER ( 366 | PARTITION BY monthend_series_cte.customer_id 367 | ORDER BY monthend_series_cte.ending_month 368 | ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW 369 | ) AS ending_balance 370 | FROM monthend_series_cte 371 | LEFT JOIN monthly_balances_cte 372 | ON monthend_series_cte.ending_month = monthly_balances_cte.closing_month 373 | AND monthend_series_cte.customer_id = monthly_balances_cte.customer_id 374 | ) 375 | 376 | SELECT 377 | customer_id, 378 | ending_month, 379 | COALESCE(total_monthly_change, 0) AS total_monthly_change, 380 | MIN(ending_balance) AS ending_balance 381 | FROM monthly_changes_cte 382 | GROUP BY 383 | customer_id, ending_month, total_monthly_change 384 | ORDER BY 385 | customer_id, ending_month; 386 | ); 387 | 388 | -- Temp table #2: Create a temp table using temp table #1 `customer_monthly_balances` 389 | CREATE TEMP TABLE ranked_monthly_balances AS ( 390 | SELECT 391 | customer_id, 392 | ending_month, 393 | ending_balance, 394 | ROW_NUMBER() OVER ( 395 | PARTITION BY customer_id 396 | ORDER BY ending_month) AS sequence 397 | FROM customer_monthly_balances 398 | ); 399 | ``` 400 | 401 | **- What percentage of customers have a negative first month balance? What percentage of customers have a positive first month balance?** 402 | 403 | To address both questions, I'm using one solution since the questions are asking opposite spectrums of each other. 404 | 405 | ````sql 406 | -- Method 1 407 | SELECT 408 | ROUND(100.0 * 409 | SUM(CASE 410 | WHEN ending_balance::TEXT LIKE '-%' THEN 1 ELSE 0 END) 411 | /(SELECT COUNT(DISTINCT customer_id) 412 | FROM customer_monthly_balances),1) AS negative_first_month_percentage, 413 | ROUND(100.0 * 414 | SUM(CASE 415 | WHEN ending_balance::TEXT NOT LIKE '-%' THEN 1 ELSE 0 END) 416 | /(SELECT COUNT(DISTINCT customer_id) 417 | FROM customer_monthly_balances),1) AS positive_first_month_percentage 418 | FROM ranked_monthly_balances 419 | WHERE ranked_row = 1; 420 | ```` 421 | 422 | A cheeky solution would be to simply calculate one of the percentages requested and then deducting it from 100%. 423 | ```sql 424 | -- Method 2 425 | SELECT 426 | ROUND(100.0 * 427 | COUNT(customer_id) 428 | /(SELECT COUNT(DISTINCT customer_id) 429 | FROM customer_monthly_balances),1) AS negative_first_month_percentage, 430 | 100 - ROUND(100.0 * COUNT(customer_id) 431 | /(SELECT COUNT(DISTINCT customer_id) 432 | FROM customer_monthly_balances),1) AS positive_first_month_percentage 433 | FROM ranked_monthly_balances 434 | WHERE ranked_row = 1 435 | AND ending_balance::TEXT LIKE '-%'; 436 | ``` 437 | 438 | **Answer:** 439 | 440 | |negative_first_month_percentage|positive_first_month_percentage| 441 | |:----|:----| 442 | |44.8|55.2| 443 | 444 | **- What percentage of customers increase their opening month’s positive closing balance by more than 5% in the following month?** 445 | 446 | I'm using `LEAD()` window function to query the balances for the following month and then, filtering the results to select only the records with balances for the 1st and 2nd month. 447 | 448 | Important assumptions: 449 | - Negative balances in the `following_balance` field have been excluded from the results. This is because a higher negative balance in the following month does not represent a true increase in balances. 450 | - Including negative balances could lead to a misrepresentation of the answer as the percentage of variance would still appear as a positive percentage. 451 | 452 | ````sql 453 | WITH following_month_cte AS ( 454 | SELECT 455 | customer_id, 456 | ending_month, 457 | ending_balance, 458 | LEAD(ending_balance) OVER ( 459 | PARTITION BY customer_id 460 | ORDER BY ending_month) AS following_balance 461 | FROM ranked_monthly_balances 462 | ) 463 | , variance_cte AS ( 464 | SELECT 465 | customer_id, 466 | ending_month, 467 | ROUND(100.0 * 468 | (following_balance - ending_balance) / ending_balance,1) AS variance 469 | FROM following_month_cte 470 | WHERE ending_month = '2020-01-31' 471 | AND following_balance::TEXT NOT LIKE '-%' 472 | GROUP BY 473 | customer_id, ending_month, ending_balance, following_balance 474 | HAVING ROUND(100.0 * (following_balance - ending_balance) / ending_balance,1) > 5.0 475 | ) 476 | 477 | SELECT 478 | ROUND(100.0 * 479 | COUNT(customer_id) 480 | / (SELECT COUNT(DISTINCT customer_id) 481 | FROM ranked_monthly_balances),1) AS increase_5_percentage 482 | FROM variance_cte; 483 | ```` 484 | 485 | **Answer:** 486 | 487 | |increase_5_percentage| 488 | |:----| 489 | |20.0| 490 | 491 | - Among the customers, 20% experience a growth of more than 5% in their positive closing balance from the opening month to the following month. 492 | 493 | **- What percentage of customers reduce their opening month’s positive closing balance by more than 5% in the following month?** 494 | 495 | ````sql 496 | WITH following_month_cte AS ( 497 | SELECT 498 | customer_id, 499 | ending_month, 500 | ending_balance, 501 | LEAD(ending_balance) OVER ( 502 | PARTITION BY customer_id 503 | ORDER BY ending_month) AS following_balance 504 | FROM ranked_monthly_balances 505 | ) 506 | , variance_cte AS ( 507 | SELECT 508 | customer_id, 509 | ending_month, 510 | ROUND((100.0 * 511 | following_balance - ending_balance) / ending_balance,1) AS variance 512 | FROM following_month_cte 513 | WHERE ending_month = '2020-01-31' 514 | AND following_balance::TEXT NOT LIKE '-%' 515 | GROUP BY 516 | customer_id, ending_month, ending_balance, following_balance 517 | HAVING ROUND((100.0 * (following_balance - ending_balance)) / ending_balance,2) < 5.0 518 | ) 519 | 520 | SELECT 521 | ROUND(100.0 * 522 | COUNT(customer_id) 523 | / (SELECT COUNT(DISTINCT customer_id) 524 | FROM ranked_monthly_balances),1) AS reduce_5_percentage 525 | FROM variance_cte; 526 | ```` 527 | 528 | **Answer:** 529 | 530 | |reduce_5_percentage| 531 | |:----| 532 | |25.6| 533 | 534 | - Among the customers, 25.6% experience a drop of more than 5% in their positive closing balance from the opening month to the following month. 535 | 536 | **- What percentage of customers move from a positive balance in the first month to a negative balance in the second month?** 537 | 538 | ````sql 539 | WITH following_month_cte AS ( 540 | SELECT 541 | customer_id, 542 | ending_month, 543 | ending_balance, 544 | LEAD(ending_balance) OVER ( 545 | PARTITION BY customer_id 546 | ORDER BY ending_month) AS following_balance 547 | FROM ranked_monthly_balances 548 | ) 549 | , variance_cte AS ( 550 | SELECT * 551 | FROM following_month_cte 552 | WHERE ending_month = '2020-01-31' 553 | AND ending_balance::TEXT NOT LIKE '-%' 554 | AND following_balance::TEXT LIKE '-%' 555 | ) 556 | 557 | SELECT 558 | ROUND(100.0 * 559 | COUNT(customer_id) 560 | / (SELECT COUNT(DISTINCT customer_id) 561 | FROM ranked_monthly_balances),1) AS positive_to_negative_percentage 562 | FROM variance_cte; 563 | ```` 564 | 565 | **Answer:** 566 | 567 | |positive_to_negative_percentage| 568 | |:----| 569 | |20.2| 570 | 571 | - Among the customers, 20.2% transitioned from having a positive balance (`ending_balance`) in the first month to having a negative balance (`following_balance`) in the following month. 572 | 573 | *** 574 | 575 | Do give me a 🌟 if you like what you're reading. Thank you! 🙆🏻‍♀️ 576 | -------------------------------------------------------------------------------- /Case Study #5 - Data Mart/README.md: -------------------------------------------------------------------------------- 1 | # Case Study #5: Data Mart 2 | 3 |

31 | 32 | Here are some further details about the dataset: 33 | 34 | 1. Data Mart has international operations using a multi-region strategy. 35 | 2. Data Mart has both, a retail and online `platform` in the form of a Shopify store front to serve their customers. 36 | 3. Customer `segment` and `customer_type` data relates to personal age and demographics information that is shared with Data Mart. 37 | 4. `transactions` is the count of unique purchases made through Data Mart and `sales` is the actual dollar amount of purchases. 38 | 39 | Each record in the dataset is related to a specific aggregated slice of the underlying sales data rolled up into a `week_date` value which represents the start of the sales week. 40 | 41 | 10 random rows are shown in the table output below from `data_mart.weekly_sales`. 42 | 43 |

44 | 45 | *** 46 | 47 | ## Question and Solution 48 | 49 | Please join me in executing the queries using PostgreSQL on [DB Fiddle](https://www.db-fiddle.com/f/jmnwogTsUE8hGqkZv9H7E8/8). It would be great to work together on the questions! 50 | 51 | If you have any questions, reach out to me on [LinkedIn](https://www.linkedin.com/in/katiehuangx/). 52 | 53 | ## 🧼 A. Data Cleansing Steps 54 | 55 | In a single query, perform the following operations and generate a new table in the `data_mart` schema named `clean_weekly_sales`: 56 | - Convert the `week_date` to a `DATE` format 57 | - Add a `week_number` as the second column for each `week_date` value, for example any value from the 1st of January to 7th of January will be 1, 8th to 14th will be 2 etc 58 | - Add a `month_number` with the calendar month for each `week_date` value as the 3rd column 59 | - Add a `calendar_year` column as the 4th column containing either 2018, 2019 or 2020 values 60 | - Add a new column called `age_band` after the original segment column using the following mapping on the number inside the segment value 61 | 62 |

63 | 64 | - Add a new `demographic` column using the following mapping for the first letter in the `segment` values: 65 | 66 | | segment | demographic | 67 | | ------- | ----------- | 68 | | C | Couples | 69 | | F | Families | 70 | 71 | - Ensure all `null` string values with an "unknown" string value in the original `segment` column as well as the new `age_band` and `demographic` columns 72 | - Generate a new `avg_transaction` column as the sales value divided by transactions rounded to 2 decimal places for each record 73 | 74 | **Answer:** 75 | 76 | Let's construct the structure of `clean_weekly_sales` table and lay out the actions to be taken. 77 | 78 | _`*` represent new columns_ 79 | 80 | | Column Name | Action to take | 81 | | ------- | --------------- | 82 | | week_date | Convert to `DATE` using `TO_DATE` 83 | | week_number* | Extract number of week using `DATE_PART` 84 | | month_number* | Extract number of month using `DATE_PART` 85 | | calendar_year* | Extract year using `DATE_PART` 86 | | region | No changes 87 | | platform | No changes 88 | | segment | No changes 89 | | age_band* | Use `CASE` statement and apply conditional logic on `segment` with 1 = `Young Adults`, 2 = `Middle Aged`, 3/4 = `Retirees` and null = `Unknown` 90 | | demographic* | Use `CASE WHEN` and apply conditional logic on based on `segment`, C = `Couples` and F = `Families` and null = `Unknown` 91 | | transactions | No changes 92 | | avg_transaction* | Divide `sales` with `transactions` and round up to 2 decimal places 93 | | sales | No changes 94 | 95 | ````sql 96 | DROP TABLE IF EXISTS clean_weekly_sales; 97 | CREATE TEMP TABLE clean_weekly_sales AS ( 98 | SELECT 99 | TO_DATE(week_date, 'DD/MM/YY') AS week_date, 100 | DATE_PART('week', TO_DATE(week_date, 'DD/MM/YY')) AS week_number, 101 | DATE_PART('month', TO_DATE(week_date, 'DD/MM/YY')) AS month_number, 102 | DATE_PART('year', TO_DATE(week_date, 'DD/MM/YY')) AS calendar_year, 103 | region, 104 | platform, 105 | segment, 106 | CASE 107 | WHEN RIGHT(segment,1) = '1' THEN 'Young Adults' 108 | WHEN RIGHT(segment,1) = '2' THEN 'Middle Aged' 109 | WHEN RIGHT(segment,1) in ('3','4') THEN 'Retirees' 110 | ELSE 'unknown' END AS age_band, 111 | CASE 112 | WHEN LEFT(segment,1) = 'C' THEN 'Couples' 113 | WHEN LEFT(segment,1) = 'F' THEN 'Families' 114 | ELSE 'unknown' END AS demographic, 115 | transactions, 116 | ROUND((sales::NUMERIC/transactions),2) AS avg_transaction, 117 | sales 118 | FROM data_mart.weekly_sales 119 | ); 120 | ```` 121 | 122 |

123 | 124 | *** 125 | 126 | ## 🛍 B. Data Exploration 127 | 128 | **1. What day of the week is used for each week_date value?** 129 | 130 | ````sql 131 | SELECT DISTINCT(TO_CHAR(week_date, 'day')) AS week_day 132 | FROM clean_weekly_sales; 133 | ```` 134 | 135 | **Answer:** 136 | 137 | |week_day| 138 | |:----| 139 | |monday| 140 | 141 | - Monday is used for the `week_date` value. 142 | 143 | **2. What range of week numbers are missing from the dataset?** 144 | - First, generate a range of week numbers for the entire year from 1st week to the 52nd week using the `GENERATE_SERIES()` function. 145 | - Then, perform a `LEFT JOIN` with the `clean_weekly_sales`. Ensure that the join sequence is the CTE followed by the `clean_weekly_sales` as reversing the sequence would result in null results (unless you opt for a `RIGHT JOIN` instead!). 146 | 147 | ````sql 148 | WITH week_number_cte AS ( 149 | SELECT GENERATE_SERIES(1,52) AS week_number 150 | ) 151 | 152 | SELECT DISTINCT week_no.week_number 153 | FROM week_number_cte AS week_no 154 | LEFT JOIN clean_weekly_sales AS sales 155 | ON week_no.week_number = sales.week_number 156 | WHERE sales.week_number IS NULL; -- Filter to identify the missing week numbers where the values are `NULL`. 157 | ```` 158 | 159 | **Answer:** 160 | 161 | _I'm posting only the results of 5 rows here. Ensure that you have retrieved 28 rows!_ 162 | 163 | |week_number| 164 | |:----| 165 | |1| 166 | |2| 167 | |3| 168 | |37| 169 | |41| 170 | 171 | - The dataset is missing a total of 28 `week_number` records. 172 | 173 | **3. How many total transactions were there for each year in the dataset?** 174 | 175 | ````sql 176 | SELECT 177 | calendar_year, 178 | SUM(transactions) AS total_transactions 179 | FROM clean_weekly_sales 180 | GROUP BY calendar_year 181 | ORDER BY calendar_year; 182 | ```` 183 | 184 | **Answer:** 185 | 186 | |calendar_year|total_transactions| 187 | |:----|:----| 188 | |2018|346406460| 189 | |2019|365639285| 190 | |2020|375813651| 191 | 192 | **4. What is the total sales for each region for each month?** 193 | 194 | ````sql 195 | SELECT 196 | month_number, 197 | region, 198 | SUM(sales) AS total_sales 199 | FROM clean_weekly_sales 200 | GROUP BY month_number, region 201 | ORDER BY month_number, region; 202 | ```` 203 | 204 | **Answer:** 205 | 206 | I'm only showing the results for the month of March. 207 | 208 | |month_number|region|total_sales| 209 | |:----|:----|:----| 210 | |3|AFRICA|567767480| 211 | |3|ASIA|529770793| 212 | |3|CANADA|144634329| 213 | |3|EUROPE|35337093| 214 | |3|OCEANIA|783282888| 215 | |3|SOUTH AMERICA|71023109| 216 | |3|USA|225353043| 217 | 218 | **5. What is the total count of transactions for each platform?** 219 | 220 | ````sql 221 | SELECT 222 | platform, 223 | SUM(transactions) AS total_transactions 224 | FROM clean_weekly_sales 225 | GROUP BY platform; 226 | ```` 227 | 228 | **Answer:** 229 | 230 | |platform|total_transactions| 231 | |:----|:----| 232 | |Retail|1081934227| 233 | |Shopify|5925169| 234 | 235 | **6. What is the percentage of sales for Retail vs Shopify for each month?** 236 | 237 | ````sql 238 | WITH monthly_transactions AS ( 239 | SELECT 240 | calendar_year, 241 | month_number, 242 | platform, 243 | SUM(sales) AS monthly_sales 244 | FROM clean_weekly_sales 245 | GROUP BY calendar_year, month_number, platform 246 | ) 247 | 248 | SELECT 249 | calendar_year, 250 | month_number, 251 | ROUND(100 * MAX 252 | (CASE 253 | WHEN platform = 'Retail' THEN monthly_sales ELSE NULL END) 254 | / SUM(monthly_sales),2) AS retail_percentage, 255 | ROUND(100 * MAX 256 | (CASE 257 | WHEN platform = 'Shopify' THEN monthly_sales ELSE NULL END) 258 | / SUM(monthly_sales),2) AS shopify_percentage 259 | FROM monthly_transactions 260 | GROUP BY calendar_year, month_number 261 | ORDER BY calendar_year, month_number; 262 | ```` 263 | 264 | **Answer:** 265 | 266 | _Although I am only displaying the rows for the year 2018, please note that the overall results consist of 20 rows._ 267 | 268 | |calendar_year|month_number|retail_percentage|shopify_percentage| 269 | |:----|:----|:----|:----| 270 | |2018|3|97.92|2.08| 271 | |2018|4|97.93|2.07| 272 | |2018|5|97.73|2.27| 273 | |2018|6|97.76|2.24| 274 | |2018|7|97.75|2.25| 275 | |2018|8|97.71|2.29| 276 | |2018|9|97.68|2.32| 277 | 278 | **7. What is the percentage of sales by demographic for each year in the dataset?** 279 | 280 | ````sql 281 | WITH demographic_sales AS ( 282 | SELECT 283 | calendar_year, 284 | demographic, 285 | SUM(sales) AS yearly_sales 286 | FROM clean_weekly_sales 287 | GROUP BY calendar_year, demographic 288 | ) 289 | 290 | SELECT 291 | calendar_year, 292 | ROUND(100 * MAX 293 | (CASE 294 | WHEN demographic = 'Couples' THEN yearly_sales ELSE NULL END) 295 | / SUM(yearly_sales),2) AS couples_percentage, 296 | ROUND(100 * MAX 297 | (CASE 298 | WHEN demographic = 'Families' THEN yearly_sales ELSE NULL END) 299 | / SUM(yearly_sales),2) AS families_percentage, 300 | ROUND(100 * MAX 301 | (CASE 302 | WHEN demographic = 'unknown' THEN yearly_sales ELSE NULL END) 303 | / SUM(yearly_sales),2) AS unknown_percentage 304 | FROM demographic_sales 305 | GROUP BY calendar_year; 306 | ```` 307 | 308 | **Answer:** 309 | 310 | |calendar_year|couples_percentage|families_percentage|unknown_percentage| 311 | |:----|:----|:----|:----| 312 | |2019|27.28|32.47|40.25| 313 | |2018|26.38|31.99|41.63| 314 | |2020|28.72|32.73|38.55| 315 | 316 | **8. Which age_band and demographic values contribute the most to Retail sales?** 317 | 318 | ````sql 319 | SELECT 320 | age_band, 321 | demographic, 322 | SUM(sales) AS retail_sales, 323 | ROUND(100 * 324 | SUM(sales)::NUMERIC 325 | / SUM(SUM(sales)) OVER (), 326 | 1) AS contribution_percentage 327 | FROM clean_weekly_sales 328 | WHERE platform = 'Retail' 329 | GROUP BY age_band, demographic 330 | ORDER BY retail_sales DESC; 331 | ```` 332 | 333 | **Answer:** 334 | 335 | |age_band|demographic|retail_sales|contribution_percentage| 336 | |:----|:----|:----|:----| 337 | |unknown|unknown|16067285533|40.5| 338 | |Retirees|Families|6634686916|16.7| 339 | |Retirees|Couples|6370580014|16.1| 340 | |Middle Aged|Families|4354091554|11.0| 341 | |Young Adults|Couples|2602922797|6.6| 342 | |Middle Aged|Couples|1854160330|4.7| 343 | |Young Adults|Families|1770889293|4.5| 344 | 345 | The majority of the highest retail sales accounting for 42% are contributed by unknown `age_band` and `demographic`. This is followed by retired families at 16.73% and retired couples at 16.07%. 346 | 347 | **9. Can we use the `avg_transaction` column to find the average transaction size for each year for Retail vs Shopify? If not - how would you calculate it instead?** 348 | 349 | ````sql 350 | SELECT 351 | calendar_year, 352 | platform, 353 | ROUND(AVG(avg_transaction),0) AS avg_transaction_row, 354 | SUM(sales) / sum(transactions) AS avg_transaction_group 355 | FROM clean_weekly_sales 356 | GROUP BY calendar_year, platform 357 | ORDER BY calendar_year, platform; 358 | ```` 359 | 360 | **Answer:** 361 | 362 | |calendar_year|platform|avg_transaction_row|avg_transaction_group| 363 | |:----|:----|:----|:----| 364 | |2018|Retail|43|36| 365 | |2018|Shopify|188|192| 366 | |2019|Retail|42|36| 367 | |2019|Shopify|178|183| 368 | |2020|Retail|41|36| 369 | |2020|Shopify|175|179| 370 | 371 | The difference between `avg_transaction_row` and `avg_transaction_group` is as follows: 372 | - `avg_transaction_row` calculates the average transaction size by dividing the sales of each row by the number of transactions in that row. 373 | - On the other hand, `avg_transaction_group` calculates the average transaction size by dividing the total sales for the entire dataset by the total number of transactions. 374 | 375 | For finding the average transaction size for each year by platform accurately, it is recommended to use `avg_transaction_group`. 376 | 377 | *** 378 | 379 | ## 🧼 C. Before & After Analysis 380 | 381 | This technique is usually used when we inspect an important event and want to inspect the impact before and after a certain point in time. 382 | 383 | Taking the `week_date` value of `2020-06-15` as the baseline week where the Data Mart sustainable packaging changes came into effect. We would include all `week_date` values for `2020-06-15` as the start of the period after the change and the previous week_date values would be before. 384 | 385 | Using this analysis approach - answer the following questions: 386 | 387 | **1. What is the total sales for the 4 weeks before and after `2020-06-15`? What is the growth or reduction rate in actual values and percentage of sales?** 388 | 389 | Before we proceed, we determine the the week_number corresponding to '2020-06-15' to use it as a filter in our analysis. 390 | 391 | ````sql 392 | SELECT DISTINCT week_number 393 | FROM clean_weekly_sales 394 | WHERE week_date = '2020-06-15' 395 | AND calendar_year = '2020'; 396 | ```` 397 | 398 | |week_number| 399 | |:----| 400 | |25| 401 | 402 | The `week_number` is 25. I created 2 CTEs: 403 | - `packaging_sales` CTE: Filter the dataset for 4 weeks before and after `2020-06-15` and calculate the sum of sales within the period. 404 | - `before_after_changes` CTE: Utilize a `CASE` statement to capture the sales for 4 weeks before and after `2020-06-15` and then calculate the total sales for the specified period. 405 | 406 | ````sql 407 | WITH packaging_sales AS ( 408 | SELECT 409 | week_date, 410 | week_number, 411 | SUM(sales) AS total_sales 412 | FROM clean_weekly_sales 413 | WHERE (week_number BETWEEN 21 AND 28) 414 | AND (calendar_year = 2020) 415 | GROUP BY week_date, week_number 416 | ) 417 | , before_after_changes AS ( 418 | SELECT 419 | SUM(CASE 420 | WHEN week_number BETWEEN 21 AND 24 THEN total_sales END) AS before_packaging_sales, 421 | SUM(CASE 422 | WHEN week_number BETWEEN 25 AND 28 THEN total_sales END) AS after_packaging_sales 423 | FROM packaging_sales 424 | ) 425 | 426 | SELECT 427 | after_packaging_sales - before_packaging_sales AS sales_variance, 428 | ROUND(100 * 429 | (after_packaging_sales - before_packaging_sales) 430 | / before_packaging_sales,2) AS variance_percentage 431 | FROM before_after_changes; 432 | ```` 433 | 434 | **Answer:** 435 | 436 | |sales_variance|variance_percentage| 437 | |:----|:----| 438 | |-26884188|-1.15| 439 | 440 | Since the implementation of the new sustainable packaging, there has been a decrease in sales amounting by $26,884,188 reflecting a negative change at 1.15%. Introducing a new packaging does not always guarantee positive results as customers may not readily recognise your product on the shelves due to the change in packaging. 441 | 442 | *** 443 | 444 | **2. What about the entire 12 weeks before and after?** 445 | 446 | We can apply a similar approach and solution to address this question. 447 | 448 | ````sql 449 | WITH packaging_sales AS ( 450 | SELECT 451 | week_date, 452 | week_number, 453 | SUM(sales) AS total_sales 454 | FROM clean_weekly_sales 455 | WHERE (week_number BETWEEN 13 AND 37) 456 | AND (calendar_year = 2020) 457 | GROUP BY week_date, week_number 458 | ) 459 | , before_after_changes AS ( 460 | SELECT 461 | SUM(CASE 462 | WHEN week_number BETWEEN 13 AND 24 THEN total_sales END) AS before_packaging_sales, 463 | SUM(CASE 464 | WHEN week_number BETWEEN 25 AND 37 THEN total_sales END) AS after_packaging_sales 465 | FROM packaging_sales 466 | ) 467 | 468 | SELECT 469 | after_packaging_sales - before_packaging_sales AS sales_variance, 470 | ROUND(100 * 471 | (after_packaging_sales - before_packaging_sales) / before_packaging_sales,2) AS variance_percentage 472 | FROM before_after_changes; 473 | ```` 474 | 475 | **Answer:** 476 | 477 | |sales_variance|variance_percentage| 478 | |:----|:----| 479 | |-152325394|-2.14| 480 | 481 | Looks like the sales have experienced a further decline, now at a negative 2.14%! If I'm Danny's boss, I wouldn't be too happy with the results. 482 | 483 | *** 484 | 485 | **3. How do the sale metrics for these 2 periods before and after compare with the previous years in 2018 and 2019?** 486 | 487 | I'm breaking down this question to 2 parts. 488 | 489 | **Part 1: How do the sale metrics for 4 weeks before and after compare with the previous years in 2018 and 2019?** 490 | - Basically, the question is asking us to find the sales variance between 4 weeks before and after `'2020-06-15'` for years 2018, 2019 and 2020. Perhaps we can find a pattern here. 491 | - We can apply the same solution as above and add `calendar_year` into the syntax. 492 | 493 | ````sql 494 | WITH changes AS ( 495 | SELECT 496 | calendar_year, 497 | week_number, 498 | SUM(sales) AS total_sales 499 | FROM clean_weekly_sales 500 | WHERE week_number BETWEEN 21 AND 28 501 | GROUP BY calendar_year, week_number 502 | ) 503 | , before_after_changes AS ( 504 | SELECT 505 | calendar_year, 506 | SUM(CASE 507 | WHEN week_number BETWEEN 13 AND 24 THEN total_sales END) AS before_packaging_sales, 508 | SUM(CASE 509 | WHEN week_number BETWEEN 25 AND 28 THEN total_sales END) AS after_packaging_sales 510 | FROM changes 511 | GROUP BY calendar_year 512 | ) 513 | 514 | SELECT 515 | calendar_year, 516 | after_packaging_sales - before_packaging_sales AS sales_variance, 517 | ROUND(100 * 518 | (after_packaging_sales - before_packaging_sales) 519 | / before_packaging_sales,2) AS variance_percentage 520 | FROM before_after_changes; 521 | ```` 522 | 523 | **Answer:** 524 | 525 | |calendar_year|sales_variance|variance_percentage| 526 | |:----|:----|:----| 527 | |2018|4102105|0.19| 528 | |2019|2336594|0.10| 529 | |2020|-26884188|-1.15| 530 | 531 | In 2018, there was a sales variance of $4,102,105, indicating a positive change of 0.19% compared to the period before the packaging change. 532 | 533 | Similarly, in 2019, there was a sales variance of $2,336,594, corresponding to a positive change of 0.10% when comparing the period before and after the packaging change. 534 | 535 | However, in 2020, there was a substantial decrease in sales following the packaging change. The sales variance amounted to $26,884,188, indicating a significant negative change of -1.15%. This reduction represents a considerable drop compared to the previous years. 536 | 537 | **Part 2: How do the sale metrics for 12 weeks before and after compare with the previous years in 2018 and 2019?** 538 | - Use the same solution above and change to week 13 to 24 for before and week 25 to 37 for after. 539 | 540 | ````sql 541 | WITH changes AS ( 542 | SELECT 543 | calendar_year, 544 | week_number, 545 | SUM(sales) AS total_sales 546 | FROM clean_weekly_sales 547 | WHERE week_number BETWEEN 13 AND 37 548 | GROUP BY calendar_year, week_number 549 | ) 550 | , before_after_changes AS ( 551 | SELECT 552 | calendar_year, 553 | SUM(CASE 554 | WHEN week_number BETWEEN 13 AND 24 THEN total_sales END) AS before_packaging_sales, 555 | SUM(CASE 556 | WHEN week_number BETWEEN 25 AND 37 THEN total_sales END) AS after_packaging_sales 557 | FROM changes 558 | GROUP BY calendar_year 559 | ) 560 | 561 | SELECT 562 | calendar_year, 563 | after_packaging_sales - before_packaging_sales AS sales_variance, 564 | ROUND(100 * 565 | (after_packaging_sales - before_packaging_sales) 566 | / before_packaging_sales,2) AS variance_percentage 567 | FROM before_after_changes; 568 | ```` 569 | 570 | **Answer:** 571 | 572 | |calendar_year|sales_variance|variance_percentage| 573 | |:----|:----|:----| 574 | |2018|104256193|1.63| 575 | |2019|-20740294|-0.30| 576 | |2020|-152325394|-2.14| 577 | 578 | There was a fair bit of percentage differences in all 3 years. However, now when you compare the worst year to their best year in 2018, the sales percentage difference is even more stark at a difference of 3.77% (1.63% + 2.14%). 579 | 580 | When comparing the sales performance across all three years, there were noticeable variations in the percentage differences. However, the most significant contrast emerges when comparing the worst-performing year in 2020 to the best-performing year in 2018. In this comparison, the sales percentage difference becomes even more apparent with a significant gap of 3.77% (1.63% + 2.14%). 581 | 582 | *** 583 | 584 | ## D. Bonus Question 585 | 586 | Which areas of the business have the highest negative impact in sales metrics performance in 2020 for the 12 week before and after period? 587 | - `region` 588 | - `platform` 589 | - `age_band` 590 | - `demographic` 591 | - `customer_type` 592 | 593 | Do you have any further recommendations for Danny’s team at Data Mart or any interesting insights based off this analysis? 594 | 595 | *** 596 | -------------------------------------------------------------------------------- /Case Study #6 - Clique Bait/README.md: -------------------------------------------------------------------------------- 1 | # Case Study #6: Clique Bait 2 | 3 |

22 | 23 | **Table: `users`** 24 | 25 |

26 | 27 | **Table: `events`** 28 | 29 |

30 | 31 | **Table: `event_identifier`** 32 | 33 |

34 | 35 | **Table: `page_hierarchy`** 36 | 37 |

38 | 39 | **Table: `campaign_identifier`** 40 | 41 |

42 | 43 | *** 44 | 45 | ## Question and Solution 46 | 47 | Please join me in executing the queries using PostgreSQL on [DB Fiddle](https://www.db-fiddle.com/f/jmnwogTsUE8hGqkZv9H7E8/17). It would be great to work together on the questions! 48 | 49 | If you have any questions, reach out to me on [LinkedIn](https://www.linkedin.com/in/katiehuangx/). 50 | 51 | ## 👩🏻‍💻 A. Digital Analysis 52 | 53 | **1. How many users are there?** 54 | 55 | ````sql 56 | SELECT 57 | COUNT(DISTINCT user_id) AS user_count 58 | FROM clique_bait.users; 59 | ```` 60 | 61 |

62 | 63 | **2. How many cookies does each user have on average?** 64 | 65 | This was one of those tricky questions that seems easy, but the solution is not as clear as it seems. 66 | 67 | - Question is asking the number of cookies each user have on average. That's calling us to either use a `DISTINCT` or `GROUP BY` in order to ensure the count of cookies belonging to each user is unique. 68 | - Next, round up the average cookie count with 0 decimal point as it will not make sense for the cookie to be in fractional form. 69 | 70 | ````sql 71 | WITH cookie AS ( 72 | SELECT 73 | user_id, 74 | COUNT(cookie_id) AS cookie_id_count 75 | FROM clique_bait.users 76 | GROUP BY user_id) 77 | 78 | SELECT 79 | ROUND(AVG(cookie_id_count),0) AS avg_cookie_id 80 | FROM cookie; 81 | ```` 82 | 83 |

84 | 85 | **3. What is the unique number of visits by all users per month?** 86 | - First, extract numerical month from `event_time` so that we can group the data by month. 87 | - Unique is a keyword to use `DISTINCT`. 88 | 89 | ````sql 90 | SELECT 91 | EXTRACT(MONTH FROM event_time) as month, 92 | COUNT(DISTINCT visit_id) AS unique_visit_count 93 | FROM clique_bait.events 94 | GROUP BY EXTRACT(MONTH FROM event_time); 95 | ```` 96 | 97 |

98 | 99 | **4. What is the number of events for each event type?** 100 | 101 | ````sql 102 | SELECT 103 | event_type, 104 | COUNT(*) AS event_count 105 | FROM clique_bait.events 106 | GROUP BY event_type 107 | ORDER BY event_type; 108 | ```` 109 | 110 |

111 | 112 | **5. What is the percentage of visits which have a purchase event?** 113 | - Join the events and events_identifier table and filter by `Purchase` event only. 114 | - As the data is now filtered to having `Purchase` events only, counting the distinct visit IDs would give you the number of purchase events. 115 | - Then, divide the number of purchase events with a subquery of total number of distinct visits from the `events` table. 116 | 117 | ````sql 118 | SELECT 119 | 100 * COUNT(DISTINCT e.visit_id)/ 120 | (SELECT COUNT(DISTINCT visit_id) FROM clique_bait.events) AS percentage_purchase 121 | FROM clique_bait.events AS e 122 | JOIN clique_bait.event_identifier AS ei 123 | ON e.event_type = ei.event_type 124 | WHERE ei.event_name = 'Purchase'; 125 | ```` 126 | 127 |

128 | 129 | **6. What is the percentage of visits which view the checkout page but do not have a purchase event?** 130 | The strategy to answer this question is to breakdown the question into 2 parts. 131 | 132 | Part 1: Create a `CTE` and using `CASE statements`, find the `MAX()` of: 133 | - `event_type` = 1 (Page View) and `page_id` = 12 (Checkout), and assign "1" to these events. These events are when user viewed the checkout page. 134 | - `event_type` = 3 (Purchase) and assign "1" to these events. These events signifies users who made a purchase. 135 | 136 | We're using MAX() because we do not want to group the results by `event_type` and `page_id`. Since the max score is "1", it would mean "Give me the max score for each event". 137 | 138 | Part 2: Using the table we have created, find the percentage of visits which view checkout page. 139 | 140 | ````sql 141 | WITH checkout_purchase AS ( 142 | SELECT 143 | visit_id, 144 | MAX(CASE WHEN event_type = 1 AND page_id = 12 THEN 1 ELSE 0 END) AS checkout, 145 | MAX(CASE WHEN event_type = 3 THEN 1 ELSE 0 END) AS purchase 146 | FROM clique_bait.events 147 | GROUP BY visit_id) 148 | 149 | SELECT 150 | ROUND(100 * (1-(SUM(purchase)::numeric/SUM(checkout))),2) AS percentage_checkout_view_with_no_purchase 151 | FROM checkout_purchase 152 | ```` 153 | 154 |

187 | 188 | **9. What are the top 3 products by purchases?** 189 | 190 | 191 | *** 192 | 193 | ## 👩🏻‍💻 B. Product Funnel Analysis 194 | 195 | Using a single SQL query - create a new output table which has the following details: 196 | 197 | 1. How many times was each product viewed? 198 | 2. How many times was each product added to cart? 199 | 3. How many times was each product added to a cart but not purchased (abandoned)? 200 | 4. How many times was each product purchased? 201 | 202 | ## Planning Our Strategy 203 | 204 | Let us visualize the output table. 205 | 206 | | Column | Description | 207 | | ------- | ----------- | 208 | | product | Name of the product | 209 | | views | Number of views for each product | 210 | | cart_adds | Number of cart adds for each product | 211 | | abandoned | Number of times product was added to a cart, but not purchased | 212 | | purchased | Number of times product was purchased | 213 | 214 | These information would come from these 2 tables. 215 | - `events` table - visit_id, page_id, event_type 216 | - `page_hierarchy` table - page_id, product_category 217 | 218 | **Solution** 219 | 220 | - Note 1 - In `product_page_events` CTE, find page views and cart adds for individual visit ids by wrapping `SUM` around `CASE statements` so that we do not have to group the results by `event_type` as well. 221 | - Note 2 - In `purchase_events` CTE, get only visit ids that have made purchases. 222 | - Note 3 - In `combined_table` CTE, merge `product_page_events` and `purchase_events` using `LEFT JOIN`. Take note of the table sequence. In order to filter for visit ids with purchases, we use a `CASE statement` and where visit id is not null, it means the visit id is a purchase. 223 | 224 | ```sql 225 | WITH product_page_events AS ( -- Note 1 226 | SELECT 227 | e.visit_id, 228 | ph.product_id, 229 | ph.page_name AS product_name, 230 | ph.product_category, 231 | SUM(CASE WHEN e.event_type = 1 THEN 1 ELSE 0 END) AS page_view, -- 1 for Page View 232 | SUM(CASE WHEN e.event_type = 2 THEN 1 ELSE 0 END) AS cart_add -- 2 for Add Cart 233 | FROM clique_bait.events AS e 234 | JOIN clique_bait.page_hierarchy AS ph 235 | ON e.page_id = ph.page_id 236 | WHERE product_id IS NOT NULL 237 | GROUP BY e.visit_id, ph.product_id, ph.page_name, ph.product_category 238 | ), 239 | purchase_events AS ( -- Note 2 240 | SELECT 241 | DISTINCT visit_id 242 | FROM clique_bait.events 243 | WHERE event_type = 3 -- 3 for Purchase 244 | ), 245 | combined_table AS ( -- Note 3 246 | SELECT 247 | ppe.visit_id, 248 | ppe.product_id, 249 | ppe.product_name, 250 | ppe.product_category, 251 | ppe.page_view, 252 | ppe.cart_add, 253 | CASE WHEN pe.visit_id IS NOT NULL THEN 1 ELSE 0 END AS purchase 254 | FROM product_page_events AS ppe 255 | LEFT JOIN purchase_events AS pe 256 | ON ppe.visit_id = pe.visit_id 257 | ), 258 | product_info AS ( 259 | SELECT 260 | product_name, 261 | product_category, 262 | SUM(page_view) AS views, 263 | SUM(cart_add) AS cart_adds, 264 | SUM(CASE WHEN cart_add = 1 AND purchase = 0 THEN 1 ELSE 0 END) AS abandoned, 265 | SUM(CASE WHEN cart_add = 1 AND purchase = 1 THEN 1 ELSE 0 END) AS purchases 266 | FROM combined_table 267 | GROUP BY product_id, product_name, product_category) 268 | 269 | SELECT * 270 | FROM product_info 271 | ORDER BY product_id; 272 | ``` 273 | 274 | The logic behind `abandoned` column in which `cart_add = 1` where a customer adds an item into the cart, but `purchase = 0` customer did not purchase and abandoned the cart. 275 | 276 | 277 | 278 | *** 279 | 280 | Additionally, create another table which further aggregates the data for the above points but this time for each product category instead of individual products. 281 | 282 | **Solution** 283 | 284 | ```sql 285 | WITH product_page_events AS ( -- Note 1 286 | SELECT 287 | e.visit_id, 288 | ph.product_id, 289 | ph.page_name AS product_name, 290 | ph.product_category, 291 | SUM(CASE WHEN e.event_type = 1 THEN 1 ELSE 0 END) AS page_view, -- 1 for Page View 292 | SUM(CASE WHEN e.event_type = 2 THEN 1 ELSE 0 END) AS cart_add -- 2 for Add Cart 293 | FROM clique_bait.events AS e 294 | JOIN clique_bait.page_hierarchy AS ph 295 | ON e.page_id = ph.page_id 296 | WHERE product_id IS NOT NULL 297 | GROUP BY e.visit_id, ph.product_id, ph.page_name, ph.product_category 298 | ), 299 | purchase_events AS ( -- Note 2 300 | SELECT 301 | DISTINCT visit_id 302 | FROM clique_bait.events 303 | WHERE event_type = 3 -- 3 for Purchase 304 | ), 305 | combined_table AS ( -- Note 3 306 | SELECT 307 | ppe.visit_id, 308 | ppe.product_id, 309 | ppe.product_name, 310 | ppe.product_category, 311 | ppe.page_view, 312 | ppe.cart_add, 313 | CASE WHEN pe.visit_id IS NOT NULL THEN 1 ELSE 0 END AS purchase 314 | FROM product_page_events AS ppe 315 | LEFT JOIN purchase_events AS pe 316 | ON ppe.visit_id = pe.visit_id 317 | ), 318 | product_category AS ( 319 | SELECT 320 | product_category, 321 | SUM(page_view) AS views, 322 | SUM(cart_add) AS cart_adds, 323 | SUM(CASE WHEN cart_add = 1 AND purchase = 0 THEN 1 ELSE 0 END) AS abandoned, 324 | SUM(CASE WHEN cart_add = 1 AND purchase = 1 THEN 1 ELSE 0 END) AS purchases 325 | FROM combined_table 326 | GROUP BY product_category) 327 | 328 | SELECT * 329 | FROM product_category 330 | ``` 331 | 332 | 333 | 334 | *** 335 | 336 | Use your 2 new output tables - answer the following questions: 337 | 338 | **1. Which product had the most views, cart adds and purchases?** 339 | 340 | **2. Which product was most likely to be abandoned?** 341 | 342 | 343 | 344 | - Oyster has the most views. 345 | - Lobster has the most cart adds and purchases. 346 | - Russian Caviar is most likely to be abandoned. 347 | 348 | **3. Which product had the highest view to purchase percentage?** 349 | 350 | ```sql 351 | SELECT 352 | product_name, 353 | product_category, 354 | ROUND(100 * purchases/views,2) AS purchase_per_view_percentage 355 | FROM product_info 356 | ORDER BY purchase_per_view_percentage DESC 357 | ``` 358 | 359 | 360 | 361 | - Lobster has the highest view to purchase percentage at 48.74%. 362 | 363 | **4. What is the average conversion rate from view to cart add?** 364 | 365 | **5. What is the average conversion rate from cart add to purchase?** 366 | 367 | ```sql 368 | SELECT 369 | ROUND(100*AVG(cart_adds/views),2) AS avg_view_to_cart_add_conversion, 370 | ROUND(100*AVG(purchases/cart_adds),2) AS avg_cart_add_to_purchases_conversion_rate 371 | FROM product_info 372 | ``` 373 | 374 | 375 | 376 | - Average views to cart adds rate is 60.95% and average cart adds to purchases rate is 75.93%. 377 | - Although the cart add rate is lower, but the conversion of potential customer to the sales funnel is at least 15% higher. 378 | 379 | *** 380 | 381 | ## 👩🏻‍💻 C. Campaigns Analysis 382 | 383 | Generate a table that has 1 single row for every unique visit_id record and has the following columns: 384 | - `user_id` 385 | - `visit_id` 386 | - `visit_start_time`: the earliest event_time for each visit 387 | - `page_views`: count of page views for each visit 388 | - `cart_adds`: count of product cart add events for each visit 389 | - `purchase`: 1/0 flag if a purchase event exists for each visit 390 | - `campaign_name`: map the visit to a campaign if the visit_start_time falls between the start_date and end_date 391 | - `impression`: count of ad impressions for each visit 392 | - `click`: count of ad clicks for each visit 393 | - (Optional column) `cart_products`: a comma separated text value with products added to the cart sorted by the order they were added to the cart (hint: use the sequence_number) 394 | 395 | **Solution** 396 | 397 | Steps: 398 | - We will merge multiple tables: 399 | - Using INNER JOIN for `users` and `events` table 400 | - Joining `campaign_identifier` table using LEFT JOIN as we want all lines that have `event_time` between `start_date` and `end_date`. 401 | - Joining `page_hierachy` table using LEFT JOIN as we want all the rows in the `page_hierachy` table 402 | - To generate earliest `visit_start_time` for each unique `visit_id`, use `MIN()` to find the 1st `visit_time`. 403 | - Wrap `SUM()` with CASE statement in order to find the total number of counts for `page_views`, `cart_adds`, `purchase`, ad `impression` and ad `click`. 404 | - To get a list of products added into cart sorted by sequence, 405 | - Firstly, use a CASE statement to only get cart add events. 406 | - Then, use `STRING_AGG()` to separate products by comma `,` and sort the sequence using `sequence_number`. 407 | 408 | ```sql 409 | SELECT 410 | u.user_id, e.visit_id, 411 | MIN(e.event_time) AS visit_start_time, 412 | SUM(CASE WHEN e.event_type = 1 THEN 1 ELSE 0 END) AS page_views, 413 | SUM(CASE WHEN e.event_type = 2 THEN 1 ELSE 0 END) AS cart_adds, 414 | SUM(CASE WHEN e.event_type = 3 THEN 1 ELSE 0 END) AS purchase, 415 | c.campaign_name, 416 | SUM(CASE WHEN e.event_type = 4 THEN 1 ELSE 0 END) AS impression, 417 | SUM(CASE WHEN e.event_type = 5 THEN 1 ELSE 0 END) AS click, 418 | STRING_AGG(CASE WHEN p.product_id IS NOT NULL AND e.event_type = 2 THEN p.page_name ELSE NULL END, 419 | ', ' ORDER BY e.sequence_number) AS cart_products 420 | FROM clique_bait.users AS u 421 | INNER JOIN clique_bait.events AS e 422 | ON u.cookie_id = e.cookie_id 423 | LEFT JOIN clique_bait.campaign_identifier AS c 424 | ON e.event_time BETWEEN c.start_date AND c.end_date 425 | LEFT JOIN clique_bait.page_hierarchy AS p 426 | ON e.page_id = p.page_id 427 | GROUP BY u.user_id, e.visit_id, c.campaign_name; 428 | ``` 429 | 430 | | user_id | visit_id | visit_start_time | page_views | cart_adds | purchase | campaign_name | impression | click | cart_products | 431 | |---------|----------|--------------------------|------------|-----------|----------|-----------------------------------|------------|-------|----------------------------------------------------------------| 432 | | 1 | 02a5d5 | 2020-02-26T16:57:26.261Z | 4 | 0 | 0 | Half Off - Treat Your Shellf(ish) | 0 | 0 | | 433 | | 1 | 0826dc | 2020-02-26T05:58:37.919Z | 1 | 0 | 0 | Half Off - Treat Your Shellf(ish) | 0 | 0 | | 434 | | 1 | 0fc437 | 2020-02-04T17:49:49.603Z | 10 | 6 | 1 | Half Off - Treat Your Shellf(ish) | 1 | 1 | Tuna, Russian Caviar, Black Truffle, Abalone, Crab, Oyster | 435 | | 1 | 30b94d | 2020-03-15T13:12:54.024Z | 9 | 7 | 1 | Half Off - Treat Your Shellf(ish) | 1 | 1 | Salmon, Kingfish, Tuna, Russian Caviar, Abalone, Lobster, Crab | 436 | | 1 | 41355d | 2020-03-25T00:11:17.861Z | 6 | 1 | 0 | Half Off - Treat Your Shellf(ish) | 0 | 0 | Lobster | 437 | 438 | *** 439 | 440 | Use the subsequent dataset to generate at least 5 insights for the Clique Bait team - bonus: prepare a single A4 infographic that the team can use for their management reporting sessions, be sure to emphasise the most important points from your findings. 441 | 442 | Some ideas you might want to investigate further include: 443 | 444 | - Identifying users who have received impressions during each campaign period and comparing each metric with other users who did not have an impression event 445 | - Does clicking on an impression lead to higher purchase rates? 446 | - What is the uplift in purchase rate when comparing users who click on a campaign impression versus users who do not receive an impression? What if we compare them with users who just an impression but do not click? 447 | - What metrics can you use to quantify the success or failure of each campaign compared to each other? 448 | 449 | -------------------------------------------------------------------------------- /Case Study #7 - Balanced Tree Clothing Co./README.md: -------------------------------------------------------------------------------- 1 | ## 🌲 Case Study #7: Balanced Tree 2 | 3 |

23 | 24 | **Table 1: `product_details`** 25 | 26 | |product_id|price|product_name|category_id|segment_id|style_id|category_name|segment_name|style_name| 27 | |:----|:----|:----|:----|:----|:----|:----|:----|:----| 28 | |c4a632|13|Navy Oversized Jeans - Womens|1|3|7|Womens|Jeans|Navy Oversized| 29 | |e83aa3|32|Black Straight Jeans - Womens|1|3|8|Womens|Jeans|Black Straight| 30 | |e31d39|10|Cream Relaxed Jeans - Womens|1|3|9|Womens|Jeans|Cream Relaxed| 31 | |d5e9a6|23|Khaki Suit Jacket - Womens|1|4|10|Womens|Jacket|Khaki Suit| 32 | |72f5d4|19|Indigo Rain Jacket - Womens|1|4|11|Womens|Jacket|Indigo Rain| 33 | |9ec847|54|Grey Fashion Jacket - Womens|1|4|12|Womens|Jacket|Grey Fashion| 34 | |5d267b|40|White Tee Shirt - Mens|2|5|13|Mens|Shirt|White Tee| 35 | |c8d436|10|Teal Button Up Shirt - Mens|2|5|14|Mens|Shirt|Teal Button Up| 36 | |2a2353|57|Blue Polo Shirt - Mens|2|5|15|Mens|Shirt|Blue Polo| 37 | |f084eb|36|Navy Solid Socks - Mens|2|6|16|Mens|Socks|Navy Solid| 38 | 39 | 40 | **Table 2: `sales`** 41 | 42 | |prod_id|qty|price|discount|member|txn_id|start_txn_time| 43 | |:----|:----|:----|:----|:----|:----|:----| 44 | |c4a632|4|13|17|true|54f307|2021-02-13T01:59:43.296Z| 45 | |5d267b|4|40|17|true|54f307|2021-02-13T01:59:43.296Z| 46 | |b9a74d|4|17|17|true|54f307|2021-02-13T01:59:43.296Z| 47 | |2feb6b|2|29|17|true|54f307|2021-02-13T01:59:43.296Z| 48 | |c4a632|5|13|21|true|26cc98|2021-01-19T01:39:00.345Z| 49 | |e31d39|2|10|21|true|26cc98|2021-01-19T01:39:00.345Z| 50 | |72f5d4|3|19|21|true|26cc98|2021-01-19T01:39:00.345Z| 51 | |2a2353|3|57|21|true|26cc98|2021-01-19T01:39:00.345Z| 52 | |f084eb|3|36|21|true|26cc98|2021-01-19T01:39:00.345Z| 53 | |c4a632|1|13|21|false|ef648d|2021-01-27T02:18:17.164Z| 54 | 55 | **Table 3: `product_hierarchy`** 56 | 57 | |id|parent_id|level_text|level_name| 58 | |:----|:----|:----|:----| 59 | |1|null|Womens|Category| 60 | |2|null|Mens|Category| 61 | |3|1|Jeans|Segment| 62 | |4|1|Jacket|Segment| 63 | |5|2|Shirt|Segment| 64 | |6|2|Socks|Segment| 65 | |7|3|Navy Oversized|Style| 66 | |8|3|Black Straight|Style| 67 | |9|3|Cream Relaxed|Style| 68 | |10|4|Khaki Suit|Style| 69 | 70 | **Table 4: `product_prices`** 71 | 72 | |id|product_id|price| 73 | |:----|:----|:----| 74 | |7|c4a632|13| 75 | |8|e83aa3|32| 76 | |9|e31d39|10| 77 | |10|d5e9a6|23| 78 | |11|72f5d4|19| 79 | |12|9ec847|54| 80 | |13|5d267b|40| 81 | |14|c8d436|10| 82 | |15|2a2353|57| 83 | |16|f084eb|36| 84 | 85 | *** 86 | 87 | ## Question and Solution 88 | 89 | Please join me in executing the queries using PostgreSQL on [DB Fiddle](https://www.db-fiddle.com/f/dkhULDEjGib3K58MvDjYJr/8). It would be great to work together on the questions! 90 | 91 | If you have any questions, reach out to me on [LinkedIn](https://www.linkedin.com/in/katiehuangx/). 92 | 93 | ## 📈 A. High Level Sales Analysis 94 | 95 | **1. What was the total quantity sold for all products?** 96 | 97 | ```sql 98 | SELECT 99 | product.product_name, 100 | SUM(sales.qty) AS total_quantity 101 | FROM balanced_tree.sales 102 | INNER JOIN balanced_tree.product_details AS product 103 | ON sales.prod_id = product.product_id 104 | GROUP BY product.product_name; 105 | ``` 106 | 107 | **Answer:** 108 | 109 | |product_name|total_quantity| 110 | |:----|:----| 111 | |White Tee Shirt - Mens|3800| 112 | |Navy Solid Socks - Mens|3792| 113 | |Grey Fashion Jacket - Womens|3876| 114 | |Navy Oversized Jeans - Womens|3856| 115 | |Pink Fluro Polkadot Socks - Mens|3770| 116 | |Khaki Suit Jacket - Womens|3752| 117 | |Black Straight Jeans - Womens|3786| 118 | |White Striped Socks - Mens|3655| 119 | |Blue Polo Shirt - Mens|3819| 120 | |Indigo Rain Jacket - Womens|3757| 121 | |Cream Relaxed Jeans - Womens|3707| 122 | |Teal Button Up Shirt - Mens|3646| 123 | 124 | *** 125 | 126 | **2. What is the total generated revenue for all products before discounts?** 127 | 128 | ```sql 129 | SELECT 130 | product.product_name, 131 | SUM(sales.qty) * SUM(sales.price) AS total_revenue 132 | FROM balanced_tree.sales 133 | INNER JOIN balanced_tree.product_details AS product 134 | ON sales.prod_id = product.product_id 135 | GROUP BY product.product_name; 136 | ``` 137 | 138 | **Answer:** 139 | 140 | |product_name|total_revenue| 141 | |:----|:----| 142 | |White Tee Shirt - Mens|192736000| 143 | |Navy Solid Socks - Mens|174871872| 144 | |Grey Fashion Jacket - Womens|266862600| 145 | |Navy Oversized Jeans - Womens|63863072| 146 | |Pink Fluro Polkadot Socks - Mens|137537140| 147 | |Khaki Suit Jacket - Womens|107611112| 148 | |Black Straight Jeans - Womens|150955392| 149 | |White Striped Socks - Mens|77233805| 150 | |Blue Polo Shirt - Mens|276022044| 151 | |Indigo Rain Jacket - Womens|89228750| 152 | |Cream Relaxed Jeans - Womens|46078010| 153 | |Teal Button Up Shirt - Mens|45283320| 154 | 155 | *** 156 | 157 | **3. What was the total discount amount for all products?** 158 | 159 | ```sql 160 | SELECT 161 | product.product_name, 162 | SUM(sales.qty * sales.price * sales.discount/100) AS total_discount 163 | FROM balanced_tree.sales 164 | INNER JOIN balanced_tree.product_details AS product 165 | ON sales.prod_id = product.product_id 166 | GROUP BY product.product_name; 167 | ``` 168 | 169 | **Answer:** 170 | 171 | |product_name|total_discount| 172 | |:----|:----| 173 | |White Tee Shirt - Mens|17968| 174 | |Navy Solid Socks - Mens|16059| 175 | |Grey Fashion Jacket - Womens|24781| 176 | |Navy Oversized Jeans - Womens|5538| 177 | |Pink Fluro Polkadot Socks - Mens|12344| 178 | |Khaki Suit Jacket - Womens|9660| 179 | |Black Straight Jeans - Womens|14156| 180 | |White Striped Socks - Mens|6877| 181 | |Blue Polo Shirt - Mens|26189| 182 | |Indigo Rain Jacket - Womens|8010| 183 | |Cream Relaxed Jeans - Womens|3979| 184 | |Teal Button Up Shirt - Mens|3925| 185 | 186 | *** 187 | 188 | ## 🧾 B. Transaction Analysis 189 | 190 | **1. How many unique transactions were there?** 191 | 192 | ```sql 193 | SELECT COUNT(DISTINCT txn_id) AS transaction_count 194 | FROM balanced_tree.sales; 195 | ``` 196 | 197 | **Answer:** 198 | 199 | |transaction_count| 200 | |:----| 201 | |2500| 202 | 203 | *** 204 | 205 | **2. What is the average unique products purchased in each transaction?** 206 | 207 | ```sql 208 | SELECT ROUND(AVG(total_quantity)) AS avg_unique_products 209 | FROM ( 210 | SELECT 211 | txn_id, 212 | SUM(qty) AS total_quantity 213 | FROM balanced_tree.sales 214 | GROUP BY txn_id 215 | ) AS total_quantities; 216 | ``` 217 | 218 | **Answer:** 219 | 220 | |avg_unique_products| 221 | |:----| 222 | |18| 223 | 224 | *** 225 | 226 | **3. What are the 25th, 50th and 75th percentile values for the revenue per transaction?** 227 | 228 | ```sql 229 | WITH revenue_cte AS ( 230 | SELECT 231 | txn_id, 232 | SUM(price * qty) AS revenue 233 | FROM balanced_tree.sales 234 | GROUP BY txn_id 235 | ) 236 | 237 | SELECT 238 | PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY revenue) AS median_25th, 239 | PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY revenue) AS median_50th, 240 | PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY revenue) AS median_75th 241 | FROM revenue_cte; 242 | ``` 243 | 244 | **Answer:** 245 | 246 | |median_25th|median_50th|median_75th| 247 | |:----|:----|:----| 248 | |375.75|509.5|647| 249 | 250 | *** 251 | 252 | **4. What is the average discount value per transaction?** 253 | 254 | ```sql 255 | SELECT ROUND(AVG(discount_amt)) AS avg_discount 256 | FROM ( 257 | SELECT 258 | txn_id, 259 | SUM(qty * price * discount/100) AS discount_amt 260 | FROM balanced_tree.sales 261 | GROUP BY txn_id 262 | ) AS discounted_value 263 | ``` 264 | 265 | **Answer:** 266 | 267 | |avg_discount| 268 | |:----| 269 | |60| 270 | 271 | **5. What is the percentage split of all transactions for members vs non-members?** 272 | 273 | ```sql 274 | WITH transactions_cte AS ( 275 | SELECT 276 | member, 277 | COUNT(DISTINCT txn_id) AS transactions 278 | FROM balanced_tree.sales 279 | GROUP BY member 280 | ) 281 | 282 | SELECT 283 | member, 284 | transactions, 285 | ROUND(100 * transactions 286 | /(SELECT SUM(transactions) 287 | FROM transactions_cte)) AS percentage 288 | FROM transactions_cte 289 | GROUP BY member, transactions; 290 | ``` 291 | 292 | **Answer:** 293 | 294 | Members have a transaction count at 60% compared to than non-members who account for only 40% of the transactions. 295 | 296 | |member|transactions|percentage| 297 | |:----|:----|:----| 298 | |false|995|40| 299 | |true|1505|60| 300 | 301 | *** 302 | 303 | **6. What is the average revenue for member transactions and non-member transactions?** 304 | 305 | ```sql 306 | WITH revenue_cte AS ( 307 | SELECT 308 | member, 309 | txn_id, 310 | SUM(price * qty) AS revenue 311 | FROM balanced_tree.sales 312 | GROUP BY member, txn_id 313 | ) 314 | 315 | SELECT 316 | member, 317 | ROUND(AVG(revenue),2) AS avg_revenue 318 | FROM revenue_cte 319 | GROUP BY member; 320 | ``` 321 | 322 | **Answer:** 323 | 324 | The average revenue per transaction for members is only $1.23 higher compared to non-members. 325 | 326 | |member|avg_revenue| 327 | |:----|:----| 328 | |false|515.04| 329 | |true|516.27| 330 | 331 | *** 332 | 333 | ## 👚 C. Product Analysis 334 | 335 | **1. What are the top 3 products by total revenue before discount?** 336 | 337 | ```sql 338 | SELECT 339 | product.product_id, 340 | product.product_name, 341 | SUM(sales.qty) * SUM(sales.price) AS total_revenue 342 | FROM balanced_tree.sales 343 | INNER JOIN balanced_tree.product_details AS product 344 | ON sales.prod_id = product.product_id 345 | GROUP BY product.product_id, product.product_name 346 | ORDER BY total_revenue DESC 347 | LIMIT 3; 348 | ``` 349 | 350 | **Answer:** 351 | 352 | |product_id|product_name|total_revenue| 353 | |:----|:----|:----| 354 | |2a2353|Blue Polo Shirt - Mens|276022044| 355 | |9ec847|Grey Fashion Jacket - Womens|266862600| 356 | |5d267b|White Tee Shirt - Mens|192736000| 357 | 358 | *** 359 | 360 | **2. What is the total quantity, revenue and discount for each segment?** 361 | 362 | ```sql 363 | SELECT 364 | product.segment_id, 365 | product.segment_name, 366 | SUM(sales.qty) AS total_quantity, 367 | SUM(sales.qty * sales.price) AS total_revenue, 368 | SUM((sales.qty * sales.price) * sales.discount/100) AS total_discount 369 | FROM balanced_tree.sales 370 | INNER JOIN balanced_tree.product_details AS product 371 | ON sales.prod_id = product.product_id 372 | GROUP BY product.segment_id, product.segment_name; 373 | ``` 374 | 375 | **Answer:** 376 | 377 | |segment_id|segment_name|total_quantity|total_revenue|total_discount| 378 | |:----|:----|:----|:----|:----| 379 | |4|Jacket|11385|366983|42451| 380 | |6|Socks|11217|307977|35280| 381 | |5|Shirt|11265|406143|48082| 382 | |3|Jeans|11349|208350|23673| 383 | 384 | *** 385 | 386 | **3. What is the top selling product for each segment?** 387 | 388 | ```sql 389 | WITH top_selling_cte AS ( 390 | SELECT 391 | product.segment_id, 392 | product.segment_name, 393 | product.product_id, 394 | product.product_name, 395 | SUM(sales.qty) AS total_quantity, 396 | RANK() OVER ( 397 | PARTITION BY segment_id 398 | ORDER BY SUM(sales.qty) DESC) AS ranking 399 | FROM balanced_tree.sales 400 | INNER JOIN balanced_tree.product_details AS product 401 | ON sales.prod_id = product.product_id 402 | GROUP BY 403 | product.segment_id, product.segment_name, product.product_id, product.product_name 404 | ) 405 | 406 | SELECT 407 | segment_id, 408 | segment_name, 409 | product_id, 410 | product_name, 411 | total_quantity 412 | FROM top_selling_cte 413 | WHERE ranking = 1; 414 | ``` 415 | 416 | **Answer:** 417 | 418 | |segment_id|segment_name|product_id|product_name|total_quantity| 419 | |:----|:----|:----|:----|:----| 420 | |3|Jeans|c4a632|Navy Oversized Jeans - Womens|3856| 421 | |4|Jacket|9ec847|Grey Fashion Jacket - Womens|3876| 422 | |5|Shirt|2a2353|Blue Polo Shirt - Mens|3819| 423 | |6|Socks|f084eb|Navy Solid Socks - Mens|3792| 424 | 425 | *** 426 | 427 | **4. What is the total quantity, revenue and discount for each category?** 428 | 429 | ```sql 430 | SELECT 431 | product.category_id, 432 | product.category_name, 433 | SUM(sales.qty) AS total_quantity, 434 | SUM(sales.qty * sales.price) AS total_revenue, 435 | SUM((sales.qty * sales.price) * sales.discount/100) AS total_discount 436 | FROM balanced_tree.sales 437 | INNER JOIN balanced_tree.product_details AS product 438 | ON sales.prod_id = product.product_id 439 | GROUP BY product.category_id, product.category_name 440 | ORDER BY product.category_id; 441 | ``` 442 | 443 | **Answer:** 444 | 445 | |category_id|category_name|total_quantity|total_revenue|total_discount| 446 | |:----|:----|:----|:----|:----| 447 | |1|Womens|22734|575333|66124| 448 | |2|Mens|22482|714120|83362| 449 | 450 | *** 451 | 452 | **5. What is the top selling product for each category?** 453 | 454 | ```sql 455 | WITH top_selling_cte AS ( 456 | SELECT 457 | product.category_id, 458 | product.category_name, 459 | product.product_id, 460 | product.product_name, 461 | SUM(sales.qty) AS total_quantity, 462 | RANK() OVER ( 463 | PARTITION BY product.category_id 464 | ORDER BY SUM(sales.qty) DESC) AS ranking 465 | FROM balanced_tree.sales 466 | INNER JOIN balanced_tree.product_details AS product 467 | ON sales.prod_id = product.product_id 468 | GROUP BY 469 | product.category_id, product.category_name, product.product_id, product.product_name 470 | ) 471 | 472 | SELECT 473 | category_id, 474 | category_name, 475 | product_id, 476 | product_name, 477 | total_quantity 478 | FROM top_selling_cte 479 | WHERE ranking = 1; 480 | ``` 481 | 482 | **Answer:** 483 | 484 | |category_id|category_name|product_id|product_name|total_quantity| 485 | |:----|:----|:----|:----|:----| 486 | |1|Womens|9ec847|Grey Fashion Jacket - Womens|3876| 487 | |2|Mens|2a2353|Blue Polo Shirt - Mens|3819| 488 | 489 | *** 490 | 491 | **6. What is the percentage split of revenue by product for each segment?** 492 | 493 | **Answer:** 494 | 495 | *** 496 | 497 | **7. What is the percentage split of revenue by segment for each category?** 498 | 499 | **Answer:** 500 | 501 | *** 502 | 503 | **8. What is the percentage split of total revenue by category?** 504 | 505 | **Answer:** 506 | 507 | *** 508 | 509 | **9. What is the total transaction “penetration” for each product? (hint: penetration = number of transactions where at least 1 quantity of a product was purchased divided by total number of transactions)** 510 | 511 | **Answer:** 512 | 513 | *** 514 | 515 | **10. What is the most common combination of at least 1 quantity of any 3 products in a 1 single transaction?** 516 | 517 | **Answer:** 518 | 519 | *** 520 | 521 | ## 📝 Reporting Challenge 522 | 523 | Write a single SQL script that combines all of the previous questions into a scheduled report that the Balanced Tree team can run at the beginning of each month to calculate the previous month’s values. 524 | 525 | Imagine that the Chief Financial Officer (which is also Danny) has asked for all of these questions at the end of every month. 526 | 527 | He first wants you to generate the data for January only - but then he also wants you to demonstrate that you can easily run the samne analysis for February without many changes (if at all). 528 | 529 | Feel free to split up your final outputs into as many tables as you need - but be sure to explicitly reference which table outputs relate to which question for full marks :) 530 | 531 | *** 532 | 533 | ## 💡 Bonus Challenge 534 | 535 | Use a single SQL query to transform the `product_hierarchy` and `product_prices` datasets to the `product_details` table. 536 | 537 | Hint: you may want to consider using a recursive CTE to solve this problem! 538 | 539 | *** 540 | -------------------------------------------------------------------------------- /Case Study #8: Fresh Segments/A. Data Exploration and Cleansing.md: -------------------------------------------------------------------------------- 1 | # 🍅 Case Study #8 - Fresh Segments 2 | 3 | ## 🧼 Solution - A. Data Exploration and Cleansing 4 | 5 | ### 1. Update the `fresh_segments.interest_metrics` table by modifying the `month_year` column to be a `date` data type with the start of the month 6 | 7 | ```sql 8 | ALTER TABLE fresh_segments.interest_metrics 9 | ALTER month_year TYPE DATE USING month_year::DATE; 10 | ``` 11 | 12 | 13 | 14 | ### 2. What is count of records in the `fresh_segments.interest_metrics` for each `month_year` value sorted in chronological order (earliest to latest) with the `null` values appearing first? 15 | 16 | ```sql 17 | SELECT 18 | month_year, COUNT(*) 19 | FROM fresh_segments.interest_metrics 20 | GROUP BY month_year 21 | ORDER BY month_year NULLS FIRST; 22 | ``` 23 | 24 | 25 | 26 | ### 3. What do you think we should do with these `null` values in the `fresh_segments.interest_metrics`? 27 | 28 | The `null` values appear in `_month`, `_year`, `month_year`, and `interest_id`. The corresponding values in `composition`, `index_value`, `ranking`, and `percentile_ranking` fields are not meaningful without the specific information on `interest_id` and dates. 29 | 30 | Before dropping the values, it would be useful to find out the percentage of `null` values. 31 | 32 | ```sql 33 | SELECT 34 | ROUND(100 * (SUM(CASE WHEN interest_id IS NULL THEN 1 END) * 1.0 / 35 | COUNT(*)),2) AS null_perc 36 | FROM fresh_segments.interest_metrics 37 | ``` 38 | 39 | 40 | 41 | The percentage of null values is 8.36% which is less than 10%, hence I would suggest to drop all the `null` values. 42 | 43 | ```sql 44 | DELETE FROM fresh_segments.interest_metrics 45 | WHERE interest_id IS NULL; 46 | 47 | -- Run again to confirm that there are no null values. 48 | SELECT 49 | ROUND(100 * (SUM(CASE WHEN interest_id IS NULL THEN 1 END) * 1.0 / 50 | COUNT(*)),2) AS null_perc 51 | FROM fresh_segments.interest_metrics 52 | ``` 53 | 54 | 55 | 56 | Confirmed that there are no `null` values in `fresh_segments.interest_metrics`. 57 | 58 | ### 4. How many `interest_id` values exist in the `fresh_segments.interest_metrics` table but not in the `fresh_segments.interest_map` table? What about the other way around? 59 | 60 | ```sql 61 | SELECT 62 | COUNT(DISTINCT map.id) AS map_id_count, 63 | COUNT(DISTINCT metrics.interest_id) AS metrics_id_count, 64 | SUM(CASE WHEN map.id is NULL THEN 1 END) AS not_in_metric, 65 | SUM(CASE WHEN metrics.interest_id is NULL THEN 1 END) AS not_in_map 66 | FROM fresh_segments.interest_map map 67 | FULL OUTER JOIN fresh_segments.interest_metrics metrics 68 | ON metrics.interest_id = map.id; 69 | ``` 70 | 71 | 72 | 73 | - There are 1,209 unique `id`s in `interest_map`. 74 | - There are 1,202 unique `interest_id`s in `interest_metrics`. 75 | - There are no `interest_id` that did not appear in `interest_map`. All 1,202 ids were present in the `interest_metrics` table. 76 | - There are 7 `id`s that did not appear in `interest_metrics`. 77 | 78 | ### 5. Summarise the id values in the `fresh_segments.interest_map` by its total record count in this table 79 | 80 | I found the solution for this question to be strange - hence I came up with another summary of the id values too. 81 | 82 | Original solution: 83 | 84 | ```sql 85 | SELECT COUNT(*) 86 | FROM fresh_segments.interest_map 87 | ``` 88 | 89 | 90 | 91 | My solution: 92 | 93 | ```sql 94 | SELECT 95 | id, 96 | interest_name, 97 | COUNT(*) 98 | FROM fresh_segments.interest_map map 99 | JOIN fresh_segments.interest_metrics metrics 100 | ON map.id = metrics.interest_id 101 | GROUP BY id, interest_name 102 | ORDER BY count DESC, id; 103 | ``` 104 | 105 | 106 | 107 | ### 6. What sort of table join should we perform for our analysis and why? Check your logic by checking the rows where 'interest_id = 21246' in your joined output and include all columns from `fresh_segments.interest_metrics` and all columns from `fresh_segments.interest_map` except from the id column. 108 | 109 | We should be using `INNER JOIN` to perform our analysis. 110 | 111 | ```sql 112 | SELECT * 113 | FROM fresh_segments.interest_map map 114 | INNER JOIN fresh_segments.interest_metrics metrics 115 | ON map.id = metrics.interest_id 116 | WHERE metrics.interest_id = 21246 117 | AND metrics._month IS NOT NULL; -- There were instances when interest_id is available, however the date values were not - hence filter them out. 118 | ``` 119 | 120 | 121 | 122 | The results should come up to 10 rows only. 123 | 124 | ### 7. Are there any records in your joined table where the `month_year` value is before the `created_at` value from the `fresh_segments.interest_map` table? Do you think these values are valid and why? 125 | 126 | ```sql 127 | SELECT 128 | COUNT(*) 129 | FROM fresh_segments.interest_map map 130 | INNER JOIN fresh_segments.interest_metrics metrics 131 | ON map.id = metrics.interest_id 132 | WHERE metrics.month_year < map.created_at::DATE; 133 | ``` 134 | 135 | 136 | 137 | There are 188 records where the `month_year` date is before the `created_at` date. 138 | 139 | However, it looks like these records are created in the same month as `month_year`. Do you remember that the `month_year` column's date is set to default on 1st day of the month? 140 | 141 | 142 | 143 | Running another test to see whether date in `month_year` and `created_at` are in the same month. 144 | 145 | ```sql 146 | SELECT 147 | COUNT(*) 148 | FROM fresh_segments.interest_map map 149 | INNER JOIN fresh_segments.interest_metrics metrics 150 | ON map.id = metrics.interest_id 151 | WHERE metrics.month_year < DATE_TRUNC('mon', map.created_at::DATE); 152 | ``` 153 | 154 | 155 | 156 | Seems like all the records' dates are in the same month, hence we will consider the records as valid. 157 | 158 | *** 159 | -------------------------------------------------------------------------------- /Case Study #8: Fresh Segments/B. Interest Analysis.md: -------------------------------------------------------------------------------- 1 | # 🍅 Case Study #8 - Fresh Segments 2 | 3 | ## 📚 Solution - B. Interest Analysis 4 | 5 | ### 1. Which interests have been present in all `month_year` dates in our dataset? 6 | 7 | Find out how many unique `month_year` in dataset. 8 | 9 | ```sql 10 | SELECT 11 | COUNT(DISTINCT month_year) AS unique_month_year_count, 12 | COUNT(DISTINCT interest_id) AS unique_interest_id_count 13 | FROM fresh_segments.interest_metrics; 14 | ``` 15 | 16 |

73 | 74 | Interests with total months of 6 and above received a 90% and above percentage. Interests below this mark should be investigated to improve their clicks and customer interactions. 75 | 76 | ### 3. If we were to remove all `interest_id` values which are lower than the `total_months` value we found in the previous question - how many total data points would we be removing? 77 | 78 | ### 4. Does this decision make sense to remove these data points from a business perspective? Use an example where there are all 14 months present to a removed interest example for your arguments - think about what it means to have less months present from a segment perspective. 79 | 80 | ### 5. If we include all of our interests regardless of their counts - how many unique interests are there for each month? 81 | -------------------------------------------------------------------------------- /Case Study #8: Fresh Segments/README.md: -------------------------------------------------------------------------------- 1 | # Case Study #8: Fresh Segments 2 | 3 |

4 | 5 | ## 📚 Table of Contents 6 | - [Business Task](#business-task) 7 | - [Entity Relationship Diagram](#entity-relationship-diagram) 8 | - [Question and Solution](#question-and-solution) 9 | 10 | Please note that all the information regarding the case study has been sourced from the following link: [here](https://8weeksqlchallenge.com/case-study-8/). 11 | 12 | *** 13 | 14 | ## Business Task 15 | 16 | Fresh Segments is a digital marketing agency that helps other businesses analyse trends in online ad click behaviour for their unique customer base. 17 | 18 | Clients share their customer lists with the Fresh Segments team who then aggregate interest metrics and generate a single dataset worth of metrics for further analysis. 19 | 20 | In particular - the composition and rankings for different interests are provided for each client showing the proportion of their customer list who interacted with online assets related to each interest for each month. 21 | 22 | Danny has asked for your assistance to analyse aggregated metrics for an example client and provide some high level insights about the customer list and their interests. 23 | 24 | *** 25 | 26 | ## Entity Relationship Diagram 27 | 28 | **Table: `interest_map`** 29 | 30 | | id | interest_name | interest_summary | created_at | last_modified | 31 | |----|---------------------------|------------------------------------------------------------------------------------|-------------------------|-------------------------| 32 | | 1 | Fitness Enthusiasts | Consumers using fitness tracking apps and websites. | 2016-05-26T14:57:59.000 | 2018-05-23T11:30:12.000 | 33 | | 2 | Gamers | Consumers researching game reviews and cheat codes. | 2016-05-26T14:57:59.000 | 2018-05-23T11:30:12.000 | 34 | | 3 | Car Enthusiasts | Readers of automotive news and car reviews. | 2016-05-26T14:57:59.000 | 2018-05-23T11:30:12.000 | 35 | | 4 | Luxury Retail Researchers | Consumers researching luxury product reviews and gift ideas. | 2016-05-26T14:57:59.000 | 2018-05-23T11:30:12.000 | 36 | | 5 | Brides & Wedding Planners | People researching wedding ideas and vendors. | 2016-05-26T14:57:59.000 | 2018-05-23T11:30:12.000 | 37 | | 6 | Vacation Planners | Consumers reading reviews of vacation destinations and accommodations. | 2016-05-26T14:57:59.000 | 2018-05-23T11:30:13.000 | 38 | | 7 | Motorcycle Enthusiasts | Readers of motorcycle news and reviews. | 2016-05-26T14:57:59.000 | 2018-05-23T11:30:13.000 | 39 | | 8 | Business News Readers | Readers of online business news content. | 2016-05-26T14:57:59.000 | 2018-05-23T11:30:12.000 | 40 | | 12 | Thrift Store Shoppers | Consumers shopping online for clothing at thrift stores and researching locations. | 2016-05-26T14:57:59.000 | 2018-03-16T13:14:00.000 | 41 | | 13 | Advertising Professionals | People who read advertising industry news. | 2016-05-26T14:57:59.000 | 2018-05-23T11:30:12.000 | 42 | 43 | **Table: `interest_metrics`** 44 | 45 | | month | year | month_year | interest_id | composition | index_value | ranking | percentile_ranking | 46 | |-------|------|------------|-------------|-------------|-------------|---------|--------------------| 47 | | 7 | 2018 | Jul-18 | 32486 | 11.89 | 6.19 | 1 | 99.86 | 48 | | 7 | 2018 | Jul-18 | 6106 | 9.93 | 5.31 | 2 | 99.73 | 49 | | 7 | 2018 | Jul-18 | 18923 | 10.85 | 5.29 | 3 | 99.59 | 50 | | 7 | 2018 | Jul-18 | 6344 | 10.32 | 5.1 | 4 | 99.45 | 51 | | 7 | 2018 | Jul-18 | 100 | 10.77 | 5.04 | 5 | 99.31 | 52 | | 7 | 2018 | Jul-18 | 69 | 10.82 | 5.03 | 6 | 99.18 | 53 | | 7 | 2018 | Jul-18 | 79 | 11.21 | 4.97 | 7 | 99.04 | 54 | | 7 | 2018 | Jul-18 | 6111 | 10.71 | 4.83 | 8 | 98.9 | 55 | | 7 | 2018 | Jul-18 | 6214 | 9.71 | 4.83 | 8 | 98.9 | 56 | | 7 | 2018 | Jul-18 | 19422 | 10.11 | 4.81 | 10 | 98.63 | 57 | 58 | *** 59 | 60 | ## Question and Solution 61 | 62 | Please join me in executing the queries using PostgreSQL on [DB Fiddle](https://www.db-fiddle.com/f/jmnwogTsUE8hGqkZv9H7E8/17). It would be great to work together on the questions! 63 | 64 | If you have any questions, reach out to me on [LinkedIn](https://www.linkedin.com/in/katiehuangx/). 65 | 66 | ## 🧼 A. Data Exploration and Cleansing 67 | 68 | **1. Update the `fresh_segments.interest_metrics` table by modifying the `month_year` column to be a `date` data type with the start of the month** 69 | 70 | ```sql 71 | ALTER TABLE fresh_segments.interest_metrics 72 | ALTER month_year TYPE DATE USING month_year::DATE; 73 | ``` 74 | 75 | 76 | 77 | *** 78 | 79 | **2. What is count of records in the `fresh_segments.interest_metrics` for each `month_year` value sorted in chronological order (earliest to latest) with the `null` values appearing first?** 80 | 81 | ```sql 82 | SELECT 83 | month_year, COUNT(*) 84 | FROM fresh_segments.interest_metrics 85 | GROUP BY month_year 86 | ORDER BY month_year NULLS FIRST; 87 | ``` 88 | 89 | 90 | 91 | **3. What do you think we should do with these `null` values in the `fresh_segments.interest_metrics`?** 92 | 93 | The `null` values appear in `_month`, `_year`, `month_year`, and `interest_id`. The corresponding values in `composition`, `index_value`, `ranking`, and `percentile_ranking` fields are not meaningful without the specific information on `interest_id` and dates. 94 | 95 | Before dropping the values, it would be useful to find out the percentage of `null` values. 96 | 97 | ```sql 98 | SELECT 99 | ROUND(100 * (SUM(CASE WHEN interest_id IS NULL THEN 1 END) * 1.0 / 100 | COUNT(*)),2) AS null_perc 101 | FROM fresh_segments.interest_metrics 102 | ``` 103 | 104 | 105 | 106 | The percentage of null values is 8.36% which is less than 10%, hence I would suggest to drop all the `null` values. 107 | 108 | ```sql 109 | DELETE FROM fresh_segments.interest_metrics 110 | WHERE interest_id IS NULL; 111 | 112 | -- Run again to confirm that there are no null values. 113 | SELECT 114 | ROUND(100 * (SUM(CASE WHEN interest_id IS NULL THEN 1 END) * 1.0 / 115 | COUNT(*)),2) AS null_perc 116 | FROM fresh_segments.interest_metrics 117 | ``` 118 | 119 | 120 | 121 | Confirmed that there are no `null` values in `fresh_segments.interest_metrics`. 122 | 123 | *** 124 | 125 | **4. How many `interest_id` values exist in the `fresh_segments.interest_metrics` table but not in the `fresh_segments.interest_map` table? What about the other way around?** 126 | 127 | ```sql 128 | SELECT 129 | COUNT(DISTINCT map.id) AS map_id_count, 130 | COUNT(DISTINCT metrics.interest_id) AS metrics_id_count, 131 | SUM(CASE WHEN map.id is NULL THEN 1 END) AS not_in_metric, 132 | SUM(CASE WHEN metrics.interest_id is NULL THEN 1 END) AS not_in_map 133 | FROM fresh_segments.interest_map map 134 | FULL OUTER JOIN fresh_segments.interest_metrics metrics 135 | ON metrics.interest_id = map.id; 136 | ``` 137 | 138 | 139 | 140 | - There are 1,209 unique `id`s in `interest_map`. 141 | - There are 1,202 unique `interest_id`s in `interest_metrics`. 142 | - There are no `interest_id` that did not appear in `interest_map`. All 1,202 ids were present in the `interest_metrics` table. 143 | - There are 7 `id`s that did not appear in `interest_metrics`. 144 | 145 | *** 146 | 147 | **5. Summarise the id values in the `fresh_segments.interest_map` by its total record count in this table.** 148 | 149 | I found the solution for this question to be strange - hence I came up with another summary of the id values too. 150 | 151 | Original solution: 152 | 153 | ```sql 154 | SELECT COUNT(*) 155 | FROM fresh_segments.interest_map 156 | ``` 157 | 158 | 159 | 160 | My solution: 161 | 162 | ```sql 163 | SELECT 164 | id, 165 | interest_name, 166 | COUNT(*) 167 | FROM fresh_segments.interest_map map 168 | JOIN fresh_segments.interest_metrics metrics 169 | ON map.id = metrics.interest_id 170 | GROUP BY id, interest_name 171 | ORDER BY count DESC, id; 172 | ``` 173 | 174 | 175 | 176 | *** 177 | 178 | **6. What sort of table join should we perform for our analysis and why? Check your logic by checking the rows where 'interest_id = 21246' in your joined output and include all columns from `fresh_segments.interest_metrics` and all columns from `fresh_segments.interest_map` except from the id column.** 179 | 180 | We should be using `INNER JOIN` to perform our analysis. 181 | 182 | ```sql 183 | SELECT * 184 | FROM fresh_segments.interest_map map 185 | INNER JOIN fresh_segments.interest_metrics metrics 186 | ON map.id = metrics.interest_id 187 | WHERE metrics.interest_id = 21246 188 | AND metrics._month IS NOT NULL; -- There were instances when interest_id is available, however the date values were not - hence filter them out. 189 | ``` 190 | 191 | 192 | 193 | The results should come up to 10 rows only. 194 | 195 | *** 196 | 197 | **7. Are there any records in your joined table where the `month_year` value is before the `created_at` value from the `fresh_segments.interest_map` table? Do you think these values are valid and why?** 198 | 199 | ```sql 200 | SELECT 201 | COUNT(*) 202 | FROM fresh_segments.interest_map map 203 | INNER JOIN fresh_segments.interest_metrics metrics 204 | ON map.id = metrics.interest_id 205 | WHERE metrics.month_year < map.created_at::DATE; 206 | ``` 207 | 208 | 209 | 210 | There are 188 records where the `month_year` date is before the `created_at` date. 211 | 212 | However, it looks like these records are created in the same month as `month_year`. Do you remember that the `month_year` column's date is set to default on 1st day of the month? 213 | 214 | 215 | 216 | Running another test to see whether date in `month_year` and `created_at` are in the same month. 217 | 218 | ```sql 219 | SELECT 220 | COUNT(*) 221 | FROM fresh_segments.interest_map map 222 | INNER JOIN fresh_segments.interest_metrics metrics 223 | ON map.id = metrics.interest_id 224 | WHERE metrics.month_year < DATE_TRUNC('mon', map.created_at::DATE); 225 | ``` 226 | 227 | 228 | 229 | Seems like all the records' dates are in the same month, hence we will consider the records as valid. 230 | 231 | *** 232 | 233 | ## 📚 B. Interest Analysis 234 | 235 | **1. Which interests have been present in all `month_year` dates in our dataset?** 236 | 237 | Find out how many unique `month_year` in dataset. 238 | 239 | ```sql 240 | SELECT 241 | COUNT(DISTINCT month_year) AS unique_month_year_count, 242 | COUNT(DISTINCT interest_id) AS unique_interest_id_count 243 | FROM fresh_segments.interest_metrics; 244 | ``` 245 | 246 |

305 | 306 | Interests with total months of 6 and above received a 90% and above percentage. Interests below this mark should be investigated to improve their clicks and customer interactions. 307 | *** 308 | 309 | **3. If we were to remove all `interest_id` values which are lower than the `total_months` value we found in the previous question - how many total data points would we be removing?** 310 | 311 | *** 312 | 313 | **4. Does this decision make sense to remove these data points from a business perspective? Use an example where there are all 14 months present to a removed interest example for your arguments - think about what it means to have less months present from a segment perspective. ** 314 | 315 | *** 316 | 317 | **5. If we include all of our interests regardless of their counts - how many unique interests are there for each month?** 318 | 319 | *** 320 | 321 | ## 🧩 C. Segment Analysis 322 | 323 | 1. Using the complete dataset - which are the top 10 and bottom 10 interests which have the largest composition values in any month_year? Only use the maximum composition value for each interest but you must keep the corresponding month_year 324 | 2. Which 5 interests had the lowest average ranking value? 325 | 3. Which 5 interests had the largest standard deviation in their percentile_ranking value? 326 | 4. For the 5 interests found in the previous question - what was minimum and maximum percentile_ranking values for each interest and its corresponding year_month value? Can you describe what is happening for these 5 interests? 327 | 5. How would you describe our customers in this segment based off their composition and ranking values? What sort of products or services should we show to these customers and what should we avoid? 328 | 329 | *** 330 | 331 | ## 👆🏻 D. Index Analysis 332 | 333 | The `index_value` is a measure which can be used to reverse calculate the average composition for Fresh Segments’ clients. Average composition can be calculated by dividing the composition column by the index_value column rounded to 2 decimal places. 334 | 335 | 1. What is the top 10 interests by the average composition for each month? 336 | 2. For all of these top 10 interests - which interest appears the most often? 337 | 3. What is the average of the average composition for the top 10 interests for each month? 338 | 4. What is the 3 month rolling average of the max average composition value from September 2018 to August 2019 and include the previous top ranking interests in the same output shown below. 339 | 5. Provide a possible reason why the max average composition might change from month to month? Could it signal something is not quite right with the overall business model for Fresh Segments? 340 | 341 | *** 342 | 343 | Do give me a 🌟 if you like what you're reading. Thank you! 🙆🏻‍♀️ 344 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 8-Week SQL Challenges 2 | 3 | This repository serves as the solution for the 8 case studies from the **[#8WeekSQLChallenge](https://8weeksqlchallenge.com)**. 4 | 5 | It showcases my ability to tackle various SQL challenges and demonstrates my proficiency in SQL query writing and problem-solving skills. 6 | 7 | A special thanks to [Data with Danny](https://www.linkedin.com/company/datawithdanny/) for creating these insightful and engaging SQL case studies! 👋🏻 8 | 9 | ## 📚 Table of Contents 10 | 11 | Please find the solution links for the case studies below. Simply click on the links to access each solution. 12 | - [Case Study #1: Danny's Diner](https://github.com/katiehuangx/8-Week-SQL-Challenge/blob/main/Case%20Study%20%231%20-%20Danny's%20Diner/README.md) 13 | - [Case Study #2: Pizza Runner](https://github.com/katiehuangx/8-Week-SQL-Challenge/blob/main/Case%20Study%20%232%20-%20Pizza%20Runner/README.md) 14 | - [Case Study #3: Foodie-Fi](https://github.com/katiehuangx/8-Week-SQL-Challenge/blob/main/Case%20Study%20%233%20-%20Foodie-Fi/README.md) 15 | - [Case Study #4: Data Bank](https://github.com/katiehuangx/8-Week-SQL-Challenge/blob/main/Case%20Study%20%234%20-%20Data%20Bank/README.md) 16 | - [Case Study #5: Data Mart](https://github.com/katiehuangx/8-Week-SQL-Challenge/blob/main/Case%20Study%20%235%20-%20Data%20Mart/README.md) 17 | - [Case Study #6: Clique Bait](https://github.com/katiehuangx/8-Week-SQL-Challenge/blob/main/Case%20Study%20%236%20-%20Clique%20Bait/README.md) 18 | - [Case Study #7: Balanced Tree](https://github.com/katiehuangx/8-Week-SQL-Challenge/tree/main/Case%20Study%20%237%20-%20Balanced%20Tree%20Clothing%20Co.) 19 | - [Case Study #8: Fresh Segments](https://github.com/katiehuangx/8-Week-SQL-Challenge/blob/main/Case%20Study%20%238%3A%20Fresh%20Segments/README.md) 20 | 21 | ## Important Note Regarding My Solutions 22 | 23 | Hello visitors, 24 | 25 | I wanted to address a concerning issue. It has come to my attention that some of the SQL case study solutions on this repository have been copied without permission. I want to emphasize that this behaviour goes against the principles of academic integrity and intellectual honesty. 26 | 27 | Creating these solutions has been a labour of love, involving substantial time and effort. Therefore, I kindly ask that you avoid copying them word for word. Instead, I encourage you to utilize them as a reference and derive inspiration from the underlying concepts, applying them in your own unique way. 28 | 29 | If you have any questions or need assistance, please feel free to reach out to me. I'm here to support your learning journey and provide guidance. 30 | 31 | Thank you for your understanding and for being a part of this community. 32 | 33 | Best wishes, 34 | 35 | Katie 36 | --------------------------------------------------------------------------------