├── README.md ├── Relevel_case_studies ├── Capstone_Project.sql ├── Case_Study_1.sql ├── Case_Study_2.sql ├── Case_Study_3.sql ├── Case_Study_4.sql ├── Case_Study_5.sql ├── Case_Study_6.sql ├── Case_Study_7.sql └── Detailed_Case_Study_Ginny.sql └── Retail_case_studies └── Case_Study_1.sql /README.md: -------------------------------------------------------------------------------- 1 | # SQL Case Studies 2 | 3 | Welcome to data town :) 4 | 5 |

1. Relevel case studies

6 | 7 | This section of SQL Case Study uses data from mode.com website. It's free to use, you just have to sign in. They have a huge resource of dataset to practice. 8 | 9 | Steps I followed: 10 | 11 | Login to mode.com 12 | Export the required dataset(by default .csv format) 13 | Open your MySQL Workbench. 14 | Create a new database 'casestudies' using command 'create database casestudies' 15 | Run the SQL command 'use casestudies' to work in the specific database 16 | Import the .csv dataset downloaded from mode.com 17 | The .csv dataset will be imported in current database i.e., 'casestudies' 18 | Run 'Select * from table_name' to view your imported dataset. 19 | 20 | Once all above steps are performed successfully, jump right into deriving insights from your data. Yay, happy exploring :) 21 | 22 |

2. Retail case studies

23 | 24 | In this section, I will work on retail datasets and using SQL we will uncover the hidden gems of retail industry data. The dataset used in this section is taken from kaggle. 25 | 26 | Dataset used in Case_Study_1.sql -> https://www.kaggle.com/datasets/mohammadtalib786/retail-sales-dataset?select=retail_sales_dataset.csv 27 | 28 | -------------------------------------------------------------------------------- /Relevel_case_studies/Capstone_Project.sql: -------------------------------------------------------------------------------- 1 | # Introduction to Case 2 | 3 | /* Ron and his buddies founded Foodie-Fi and began selling monthly and annual 4 | subscriptions, providing clients with unrestricted on-demand access to exclusive cuisine 5 | videos from around the world. 6 | This case study focuses on the use of subscription-style digital data to answer critical 7 | business questions about the customer journey, payments, and business performance. */ 8 | 9 | /* There are 2 tables in the given schema: 10 | 1) Plans 11 | 2) Subscriptions 12 | 13 | # Table 'Plans' description 14 | 15 | /* There are 5 customer plans: 16 | Trial— Customers sign up for a 7-day free trial and will be automatically enrolled in the pro monthly subscription plan 17 | unless they unsubscribe, downgrade to basic, or upgrade to an annual pro plan during the trial. 18 | Basic plan — Customers have limited access and can only stream their videos with the basic package, which is only 19 | available monthly for $9.90. 20 | Pro plan — Customers on the Pro plan have no watch time limits and can download videos for offline viewing. Pro 21 | plans begin at $19.90 per month or $199 for a yearly subscription. 22 | When clients cancel their Foodie-Fi service, a Churn plan record with a null pricing is created, but their plan continues 23 | until the end of the billing cycle */ 24 | 25 | # Table 'Subscriptions' description 26 | 27 | /* Customer subscriptions display the precise date on which their specific plan id begins. 28 | If a customer downgrades from a pro plan or cancels their subscription — the higher program will remain in place until 29 | the period expires — the start date in the subscriptions table will reflect the date the actual plan changes. 30 | When clients upgrade their account from a basic plan to a pro or annual pro plan, the higher plan becomes active 31 | immediately. 32 | When customers cancel their subscription, they will retain access until the end of their current billing cycle, but the 33 | start date will be the day they opted to quit their service. */ 34 | 35 | create database capstone_1; 36 | use capstone_1; 37 | 38 | # The column name got changed somehow while exporting the data. So, had to alter column name to set it to right name 39 | alter table Subscriptions rename column ï»¿customer_id to customer_id; 40 | alter table plans rename column ï»¿plan_id to plan_id; 41 | 42 | # QUERY 1 43 | 44 | # How many customers has Foodie-Fi ever had? 45 | select count(distinct customer_id) as num_customers from Subscriptions; 46 | 47 | # QUERY 2 48 | 49 | # What is the monthly distribution of trial plan start_date values for our dataset? — Use the start of the month as the group by value. 50 | with cte as (select s.* , p.plan_name, p.price, extract(month from s.start_date) as month_num 51 | from subscriptions as s 52 | join plans as p 53 | on s.plan_id=p.plan_id 54 | where plan_name='trial') 55 | select month_num, count(plan_name) as monthly_distribtion 56 | from cte 57 | group by 1 58 | order by 1; 59 | 60 | # QUERY 3 61 | 62 | # What plan start_date values occur after the year 2020 for our dataset? Show the breakdown by count of events for each plan_name 63 | with cte as 64 | (select * from 65 | (select s.*, p.plan_name, extract(year from s.start_date) as year from subscriptions as s join plans as p on s.plan_id=p.plan_id) sub 66 | where year>2020) 67 | select plan_id, plan_name, count(customer_id) as req_ans 68 | from cte 69 | group by 1 70 | order by 1; 71 | 72 | # QUERY 4 73 | 74 | # What is the customer count and percentage of customers who have churned rounded to 1 decimal place? 75 | with cte as (select count(s.plan_id) as churn_count 76 | from subscriptions as s 77 | left join plans as p 78 | on s.plan_id=p.plan_id 79 | where p.plan_name='churn') 80 | select churn_count, 81 | round((100*churn_count/(select count(distinct customer_id) as total_count from subscriptions)),1) as percent_churn 82 | from cte; 83 | 84 | # QUERY 5 85 | 86 | # How many customers have churned straight after their initial free trial? — what percentage is this rounded to the nearest whole number? 87 | # METHOD 1 88 | with cte as 89 | (select s.customer_id, s.plan_id, p.plan_name, 90 | row_number() over(partition by s.customer_id order by s.plan_id) as req_rank 91 | from subscriptions as s 92 | left join plans as p 93 | on s.plan_id=p.plan_id) 94 | select count(customer_id) as num_customers, 95 | round(100*count(customer_id)/(select count(distinct customer_id) from subscriptions),0) as req_perc 96 | from cte where plan_id=4 and req_rank=2; 97 | 98 | # MTEHOD 2 99 | with cte as 100 | (select s.customer_id, s.plan_id, p.plan_name, 101 | lead(s.plan_id) over(partition by s.customer_id order by s.plan_id) as next_plan 102 | from subscriptions as s 103 | left join plans as p 104 | on s.plan_id=p.plan_id) 105 | select count(*) as req_count, 106 | round(100*count(*)/(select count(distinct customer_id) from subscriptions),0) as req_perc 107 | from cte where plan_id=0 and next_plan=4; 108 | 109 | 110 | # QUERY 6 111 | 112 | # What is the number and percentage of customer plans after their initial free trial? 113 | with cte as 114 | (select customer_id, plan_id, 115 | lead(plan_id,1) over(partition by customer_id order by plan_id) as next_plan 116 | from subscriptions) 117 | select next_plan, 118 | count(*) as req_count, 119 | round(100*count(*)/(select count(distinct customer_id) from subscriptions),2) as req_perc 120 | from cte 121 | where plan_id=0 and next_plan is not null 122 | group by 1 123 | order by 1; 124 | 125 | # QUERY 7 126 | 127 | # What is the customer count and percentage breakdown of all 5 plan_name values at 2020–12–31? 128 | 129 | # NEED MORE CLARITY ON THIS PROBLEM STATEMENT 130 | 131 | # QUERY 8 132 | 133 | # How many customers have upgraded to an annual plan in 2020? 134 | select count(customer_id) as num_customers 135 | from subscriptions 136 | where plan_id=3 137 | and year(start_date)='2020'; 138 | 139 | # QUERY 9 140 | 141 | # How many days on average does it take a customer to start an annual plan from the day they join Foodie-Fi? 142 | with trial_plan as 143 | (select customer_id, date(start_date) as trial_date 144 | from subscriptions 145 | where plan_id=0), 146 | annual_plan as 147 | (select customer_id, date(start_date) as annual_date 148 | from subscriptions 149 | where plan_id=3) 150 | select round(avg(datediff(annual_date,trial_date)),0) as avg_days_to_upgrade 151 | from trial_plan as tp 152 | join annual_plan as ap 153 | on tp.customer_id=ap.customer_id; 154 | 155 | # QUERY 10 156 | 157 | # Can you further breakdown this average value into 30-day periods? (i.e. 0–30 days, 31–60 days etc) 158 | # NEED TO WORK ON THIS PROBLEM STATEMENT AS LESS CLARITY ON PROBLEM STATEMENT 159 | 160 | # QUERY 11 161 | 162 | # How many customers downgraded from a pro-monthly to a basic monthly plan in 2020? 163 | with cte as (select s.customer_id, s.plan_id, p.plan_name, 164 | lead(s.plan_id,1) over(partition by s.customer_id order by s.plan_id) as next_plan 165 | from subscriptions as s 166 | left join plans as p 167 | on s.plan_id=p.plan_id 168 | where year(start_date)=2020) 169 | select count(*) as num_customers 170 | from cte 171 | where plan_id=2 and next_plan=1; 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | -------------------------------------------------------------------------------- /Relevel_case_studies/Case_Study_1.sql: -------------------------------------------------------------------------------- 1 | create database CaseStudies; 2 | use CaseStudies; 3 | 4 | create table oscar_nominees ( 5 | year year, 6 | category varchar(50), 7 | nominee varchar(30), 8 | movie varchar(30), 9 | winner bool, 10 | id int not null, 11 | primary key(id) 12 | ); 13 | 14 | 15 | # Write a query to display all the records in the table tutorial.oscar_nominees 16 | SELECT * FROM oscar_nominees; 17 | 18 | # Write a query to find the distinct values in the ‘year’ column 19 | select distinct year from oscar_nominees; 20 | 21 | # Write a query to filter the records from year 1999 to year 2006 22 | select * from oscar_nominees where year between 1999 and 2006; 23 | 24 | # Write a query to filter the records for either year 1991 or 1998 25 | select * from oscar_nominees where year=1991 or year=1998; 26 | 27 | # Write a query to return the winner movie name for the year of 1997 28 | select movie from oscar_nominees where year=1997 and winner='true'; 29 | 30 | # Write a query to return the winner in the ‘actor in a leading role’ and ‘actress in a leading role’ category for the year of 1994,1980, and 2008 31 | select * from oscar_nominees where (category='actor in a leading role' or category='actress in a leading role') and year IN(1994,1980,2008) and winner='true'; 32 | 33 | # Write a query to return the name of the movie starting from letter ‘a’ 34 | select movie from oscar_nominees where movie like 'a%'; 35 | 36 | # Write a query to return the name of movies containing the word ‘the’ 37 | select movie from oscar_nominees where movie like '%the%'; 38 | 39 | # Write a query to return all the records where the nominee name starts with “c” and ends with “r” 40 | select * from oscar_nominees where nominee like 'c%r'; 41 | 42 | # Write a query to return all the records where the movie was released in 2005 and movie name does not start with ‘a’ and ‘c’ and nominee was a winner 43 | select * from oscar_nominees where year=2005 and (movie not like 'a%' and movie not like 'c%') and winner='true'; 44 | 45 | # Write a query to return the name of nominees who got more nominations than ‘Akim Tamiroff’. Solve this using CTE 46 | with cte as (select nominee, count(*) as num_nominations from oscar_nominees group by 1) 47 | select nominee from cte where num_nominations>(select num_nominations from cte where nominee='Akim Tamiroff'); 48 | 49 | # Write a query to find the nominee name with the second highest number of oscar wins. Solve using subquery. 50 | with oscar_wins as 51 | (select nominee, count(*) as num_oscar_wins from oscar_nominees where winner='True' group by 1 order by 2 desc) 52 | select nominee, num_oscar_wins 53 | from oscar_wins 54 | where num_oscar_wins = (select max(num_oscar_wins) from oscar_wins where num_oscar_wins<(select max(num_oscar_wins) from oscar_wins)); 55 | 56 | # Write a query to create three columns per nominee 57 | /* 1. Number of wins 58 | 2. Number of loss 59 | 3. Total nomination */ 60 | select nominee, 61 | sum(case when winner='true' then 1 else 0 end) as num_wins, 62 | sum(case when winner='false' then 1 else 0 end) as num_loss, 63 | count(*) as total_nominations 64 | from oscar_nominees 65 | group by 1 66 | order by 4 desc; 67 | 68 | # Write a query to create two columns 69 | /* ● Win_rate: Number of wins/total wins 70 | ● Loss_rate: Number of loss/total wins */ 71 | select movie, 72 | 100*sum(case when winner='true' then 1 else 0 end)/count(*) as win_rate, 73 | 100*sum(case when winner='false' then 1 else 0 end)/count(*) as loss_rate 74 | from oscar_nominees 75 | group by 1; 76 | 77 | # Write a query to return all the records of the nominees who have lost but won at least once 78 | select * 79 | from oscar_nominees 80 | where nominee in(select distinct nominee from oscar_nominees where winner= 'true') and winner = 'false'; 81 | 82 | # Write a query to find the nominees who are nominated for both ‘actor in a leading role’ and ‘actor in supporting role’ 83 | select distinct nominee from oscar_nominees 84 | where nominee in(select nominee from oscar_nominees where category='actor in a leading role') 85 | and category='actor in a supporting role'; 86 | 87 | # Write a query to find the movie which won more than average number of wins per winning movie 88 | with cte as (select movie, count(*) as num_wins from oscar_nominees where winner='true' group by 1) 89 | select movie from cte where num_wins>(select avg(num_wins) from cte); 90 | 91 | # Write a query to return the year which have more winners than year 1970 92 | with cte as (select year, count(*) as num_wins from oscar_nominees where winner='true' group by 1) 93 | select year from cte where num_wins > (select num_wins from cte where year=1970); 94 | 95 | # Write a query to return all the movies which have won oscars both in the actor and actress category 96 | select distinct movie 97 | from oscar_nominees 98 | where (winner='true' and category like '%actor%') 99 | and movie in(select movie from oscar_nominees where category like '%actress%' and winner='true'); 100 | 101 | # Write a query to return the movie name which did not win a single oscar 102 | with cte as 103 | (select movie, 104 | sum(case when winner='true' then 1 else 0 end) as num_wins, 105 | sum(case when winner='false' then 1 else 0 end) as num_loss 106 | from oscar_nominees 107 | group by 1) 108 | select distinct movie from cte where num_wins=0; 109 | 110 | # METHOD 2 for above query 111 | select distinct movie 112 | from oscar_nominees 113 | where winner='false' and movie not in(select distinct movie from oscar_nominees where winner='true'); 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | -------------------------------------------------------------------------------- /Relevel_case_studies/Case_Study_2.sql: -------------------------------------------------------------------------------- 1 | use casestudies; 2 | create table kag_conversion_data( 3 | ad_id int, 4 | xyz_campaign_id int, 5 | fb_campaign_id int, 6 | age int, 7 | gender varchar(2), 8 | interest int, 9 | impressions int, 10 | clicks int, 11 | spent float, 12 | total_conversion int, 13 | approved_conversion int 14 | ); 15 | select * from kag_conversion_data; 16 | 17 | # Write a query to count the total number of records in the kag_conversion_data dataset 18 | select count(ad_id) as total_records from kag_conversion_data; 19 | 20 | # Write a query to count the distinct number of fb_campaign_id 21 | select count(distinct fb_campaign_id) as total_fb_campaign_id from kag_conversion_data; 22 | 23 | # Write a query to find the maximum spent, average interest, minimum impressions for ad_id 24 | select max(spent) as max_spent, avg(interest) as average_interest, min(impressions) as min_impressions from kag_conversion_data; 25 | 26 | # Write a query to create an additional column spent per impressions(spent/impressions) 27 | select *, (spent/impressions) as spent_per_impressions from kag_conversion_data; 28 | 29 | # Write a query to count the ad_campaign for each age group 30 | select age, count(ad_id) as num_campaign from kag_conversion_data group by age; 31 | 32 | # Write a query to calculate the average spent on ads for each gender category 33 | select gender, round(avg(spent),2) as avg_spent from kag_conversion_data group by gender; 34 | 35 | # Write a query to find the total approved conversion per xyz campaign id. Arrange the total conversion in descending order 36 | select xyz_campaign_id, sum(approved_conversion) as total_approved_conversion from kag_conversion_data group by 1 order by 2 desc; 37 | 38 | # Write a query to show the fb_campaign_id and total interest per fb_campaign_id. Only show the campaign which has more than 300 interests 39 | select fb_campaign_id, sum(interest) as total_interest from kag_conversion_data group by 1 having sum(interest)>300; 40 | 41 | # Write a query to find the age and gender segment with maximum impression to interest ratio. Return three columns - age, gender, impression_to_interest 42 | select age, gender, sum(impressions)/sum(interest) as max_imp_per_interest from kag_conversion_data group by 1,2 order by 3 desc limit 1; 43 | 44 | # Write a query to find the top 2 xyz_campaign_id and gender segment with the maximum total_unapproved_conversion (total_conversion - approved_conversion) 45 | select xyz_campaign_id, gender, (sum(total_conversion)-sum(approved_conversion)) as total_unapproved_conversion from kag_conversion_data group by 1, 2 order by 3 desc limit 2; 46 | -------------------------------------------------------------------------------- /Relevel_case_studies/Case_Study_3.sql: -------------------------------------------------------------------------------- 1 | use casestudies; 2 | create table crunchbase_companies ( 3 | permalink varchar(50), 4 | name varchar(30), 5 | homepage_url varchar(40), 6 | category_code varchar(20), 7 | funding_total_usd int, 8 | status varchar(10), 9 | country_code varchar(10), 10 | state_code varchar(10), 11 | region varchar(20), 12 | city varchar(20), 13 | funding_rounds int, 14 | founded_at date, 15 | founded_month date, 16 | founded_quarter varchar(15), 17 | founded_year int, 18 | first_funding_at date, 19 | last_funding_at date, 20 | last_milestone_at date, 21 | id int, 22 | primary key(id) 23 | ); 24 | 25 | select * from crunchbase_companies; 26 | 27 | # Find the top 5 countries(country code) with the highest number of operating companies. Ensure the country code is not null 28 | select country_code, count(status) as num_operating_companies from crunchbase_companies where status='operating' and country_code!= "" group by 1 order by 2 desc limit 5; 29 | 30 | # How many companies have no country code available in the dataset 31 | select count(id) as total from crunchbase_companies where country_code=""; 32 | 33 | # Find the number of companies starting with letter ‘g’ founded in France(FRA) and still operational(status = operating) 34 | select count(id) as total from crunchbase_companies where name like 'g%' and country_code='FRA' and status='operating'; 35 | 36 | # How many advertising, founded after 2003, are acquired? 37 | select count(name) as total from crunchbase_companies where category_code='advertising' and founded_year>2003 and status='acquired'; 38 | 39 | # Calculate the average funding_total_usd per company for the companies founded in the software, education, and analytics category 40 | select category_code, avg(funding_total_usd) as avg_funding from crunchbase_companies where category_code IN('software','education','analytics') group by 1; 41 | 42 | # Find the city having more than 50 closed companies. Return the city and number of companies closed 43 | select city, count(name) as num_companies from crunchbase_companies where status='closed' and city!="" group by 1 having count(name)>50; 44 | 45 | # Find the number of bio-tech companies who are founded after 2000 and either have more than 1Mn funding or have ipo and secured more than 1 round of funding 46 | select count(name) as num_companies from crunchbase_companies where category_code='biotech' and founded_year>2000 and (funding_total_usd>1000000 or (funding_rounds>1 and status='ipo')); 47 | 48 | # Find number of all acquired companies founded between 1980 and 2005 and founded in the city ending with the word ‘city’. Return the city name and number of acquired companies 49 | select city, count(name) as num_acquired_company from crunchbase_companies where status='acquired' and founded_year between 1980 and 2005 and city like '%city' group by 1; 50 | 51 | # Find the number of ‘hardware’ companies founded outside ‘USA’ and did not take any funding. Return the country code and number of hardware companies in descending order. 52 | select country_code, count(name) as num_hardware_companies from crunchbase_companies where category_code='hardware' and funding_total_usd="" and country_code!='USA' group by 1 order by 2 desc; 53 | 54 | # Find the 5 most popular company category(category with highest companies) across the city Singapore, Shanghai, and Bangalore. Return category code and number of companies 55 | select category_code, count(name) as num_companies from crunchbase_companies where city IN('Singapore','Shanghai','Bangalore') group by 1 order by 2 desc limit 5; 56 | 57 | 58 | -------------------------------------------------------------------------------- /Relevel_case_studies/Case_Study_4.sql: -------------------------------------------------------------------------------- 1 | use casestudies; 2 | create table college_football_players 3 | ( 4 | id int, 5 | primary key(id), 6 | full_school_name varchar(30), 7 | school_name varchar(30) 8 | ); 9 | select * FROM college_football_players; 10 | 11 | create table college_football_teams( 12 | id int primary key, 13 | school_name varchar(30), 14 | conference varchar(35) 15 | ); 16 | 17 | select * from college_football_teams; 18 | 19 | # Write a query to return player_name, school_name, position, conference from the above dataset 20 | select p.player_name, p.school_name, p.position, t.conference from college_football_players as p join college_football_teams as t on p.school_name=t.school_name; 21 | 22 | # Write a query to find the total number of players playing in each conference.Order the output in the descending order of number of players 23 | select t.conference, count(p.id) as num_players from college_football_players as p join college_football_teams as t on p.school_name=t.school_name group by 1 order by 2 desc; 24 | 25 | # Write a query to find the average height of players per division 26 | select t.division,round(avg(p.height),2) as avg_height from college_football_players as p join college_football_teams as t on p.school_name=t.school_name group by 1; 27 | 28 | # Write a query to return to the conference where average weight is more than 210. Order the output in the descending order of average weight 29 | select t.conference, round(avg(p.weight),2) as avg_weight from college_football_players as p join college_football_teams as t on p.school_name=t.school_name group by 1 having avg(p.weight)>210 order by 2 desc; 30 | 31 | # Write a query to return to the top 3 conference with the highest BMI (weight/height) ratio 32 | select t.conference, round(703*(sum(p.weight)/sum(power(p.height,2))),2) as BMI from college_football_players as p join college_football_teams as t on p.school_name=t.school_name group by 1 order by 2 desc limit 3; 33 | 34 | # BMI formula: 35 | /* USC Units: BMI = 703 × mass (lbs)/height2 (in) = BMI in kg/m2 36 | SI, Metric Units: BMI = mass (kg)/height2 (m) = BMI in kg/m2 */ 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | -------------------------------------------------------------------------------- /Relevel_case_studies/Case_Study_5.sql: -------------------------------------------------------------------------------- 1 | # Write a query to join the tables: excel_sql_inventory_data and excel_sql_transaction_data 2 | select inventory.*, transaction.transaction_id, transaction.time from excel_sql_inventory_data as inventory join excel_sql_transaction_data as transaction on inventory.product_id=transaction.product_id; 3 | 4 | # Find the product which does not sell a single unit 5 | select i.*, t.transaction_id 6 | from excel_sql_inventory_data as i 7 | left join excel_sql_transaction_data as t 8 | on i.product_id=t.product_id 9 | where t.transaction_id is null; 10 | 11 | # Write a query to find how many units are sold per product. Sort the data in terms of unit sold(descending order) 12 | select i.product_id, i.product_name, count(t.product_id) units_sold 13 | from excel_sql_inventory_data as i 14 | left join excel_sql_transaction_data as t 15 | on i.product_id=t.product_id 16 | group by 1 17 | order by 3 desc; 18 | 19 | # Write a query to return product_type and units_sold where product_type is sold more than 50 times 20 | select i.product_type, count(t.transaction_id) as units_sold 21 | from excel_sql_inventory_data as i 22 | left join excel_sql_transaction_data as t 23 | on i.product_id=t.product_id 24 | group by 1 25 | having count(t.transaction_id)>50; 26 | 27 | # Write a query to return the total revenue generated 28 | select round(sum(i.price_unit),2) as total_revenue 29 | from excel_sql_inventory_data as i 30 | left join excel_sql_transaction_data as t 31 | on i.product_id=t.product_id 32 | where t.transaction_id!="null"; 33 | 34 | # Write a query to return the most selling product under product_type = ‘dry goods 35 | select i.product_name, count(t.transaction_id) as units_sold 36 | from excel_sql_inventory_data as i 37 | left join excel_sql_transaction_data as t 38 | on i.product_id=t.product_id 39 | where i.product_type='dry_goods' 40 | group by 1 41 | order by 2 desc 42 | limit 1; 43 | 44 | # Write a query to find the difference between inventory and total sales per product_type 45 | select i.product_type, abs(sum(i.current_inventory)-count(t.transaction_id)) as req_diff 46 | from excel_sql_inventory_data as i 47 | left join excel_sql_transaction_data as t 48 | on i.product_id=t.product_id 49 | group by 1 50 | order by 2 desc; 51 | 52 | # Find the product-wise sales for product_type =’dairy’ 53 | select i.product_name, round(sum(i.price_unit)*count(t.transaction_id),2) as sales 54 | from excel_sql_inventory_data as i 55 | left join excel_sql_transaction_data as t 56 | on i.product_id=t.product_id 57 | where product_type='dairy' 58 | group by 1 59 | order by 2 desc; 60 | -------------------------------------------------------------------------------- /Relevel_case_studies/Case_Study_6.sql: -------------------------------------------------------------------------------- 1 | use casestudies; 2 | select * from city_populations; 3 | 4 | # Write a query to return all the records where the city population is more than average population of dataset 5 | select * 6 | from city_populations 7 | where population_estimate_2012 > (select avg(population_estimate_2012) from city_populations); 8 | 9 | # Write a query to return all the records where the city population is more than the most populated city of Texas(TX) state 10 | select * 11 | from city_populations 12 | where population_estimate_2012 > (select max(population_estimate_2012) from city_populations where state='TX'); 13 | 14 | # Find the number of cities where population is more than the average population of Illinois(IL) state 15 | select count(id) as num_cities 16 | from city_populations 17 | where population_estimate_2012 > (select avg(population_estimate_2012) from city_populations where state='IL'); 18 | 19 | # Write a query to add the additional column - percentage_population(city population/total population of dataset) 20 | select * , 21 | round((population_estimate_2012*100)/(select sum(population_estimate_2012) from city_populations),2) as percentage_population 22 | from city_populations; 23 | 24 | # Write a query to add the additional column - percentage_population_state(city population/total population of the state) 25 | select a.*, round(100.0 * population_estimate_2012/state_population,2) as percentage_population 26 | from city_populations a 27 | left join(select state, sum(population_estimate_2012) as state_population from city_populations group by state) b 28 | on a.state = b.state 29 | order by a.state; 30 | 31 | # Write a query to add the additional column - population density. The column logic is: 32 | /* ● Population more than average - High 33 | ● Population less than or equal to average - Low */ 34 | select *, 35 | case when population_estimate_2012 > (select avg(population_estimate_2012) from city_populations) then "High" 36 | when population_estimate_2012 <= (select avg(population_estimate_2012) from city_populations) then "Low" 37 | end as population_density 38 | from city_populations; 39 | 40 | /* Write a query to add an additional column - num_cities. Each row in the dataset should tell the 41 | number of cities in the dataset */ 42 | select *, 43 | count(city) over(partition by state) as num_cities 44 | from city_populations; 45 | 46 | /* Write a query to add an additional column - total_population. Each row in the dataset should tell the 47 | total population of the state */ 48 | select *, 49 | sum(population_estimate_2012) over(partition by state) as total_population 50 | from city_populations; 51 | 52 | # Write a query to return the rows where population is more than the average population of the state 53 | 54 | # METHOD 1 55 | select * from 56 | (select *, 57 | avg(population_estimate_2012) over(partition by state) as avg_population_state 58 | from city_populations) sub 59 | where population_estimate_2012 > sub.avg_population_state; 60 | 61 | # METHOD 2 62 | with cte as ( 63 | select *, 64 | avg(population_estimate_2012) over(partition by state) as avg_population_state 65 | from city_populations) 66 | select * from cte where population_estimate_2012 > avg_population_state; 67 | 68 | # Write a query to calculate the cumulative sum of population. Arrange the data in ascending order of the population 69 | select *, 70 | sum(population_estimate_2012) over(order by population_estimate_2012) as cum_population 71 | from city_populations; 72 | 73 | /* Write a query to add a column rolling_avg. Each row in the dataset includes the average population 74 | for the two rows above and two rows below(including current row */ 75 | select *, 76 | avg(population_estimate_2012) over(order by population_estimate_2012 rows between 2 preceding and 2 following) as rolling_avg 77 | from city_populations; 78 | 79 | /* Write a query to rank the cities in California(CA) state in terms of population. City with the highest 80 | population is given rank 1. Use both rank and dense_rank function */ 81 | select *, 82 | rank() over(order by population_estimate_2012 desc) as rank_population, 83 | dense_rank() over(order by population_estimate_2012 desc) as dense_rank_population 84 | from city_populations 85 | where state='CA'; 86 | 87 | # Write a query to find the top 2 most populated cities per state 88 | 89 | # Method 1 90 | select * from 91 | (select *, 92 | row_number() over(partition by state order by population_estimate_2012 desc) as population_rank 93 | from city_populations) sub 94 | where sub.population_rank<=2; 95 | 96 | # Method 2 97 | with cte as ( 98 | select *, 99 | row_number() over(partition by state order by population_estimate_2012 desc) as population_rank 100 | from city_populations) 101 | select * from cte where population_rank <= 2; 102 | 103 | /* Write a query to add a column - perc_pop. Each row in this column should represent the percentage 104 | of population a city contributes in that state */ 105 | select *, 106 | population_estimate_2012*100/sum(population_estimate_2012) over(partition by state) as perc_pop 107 | from city_populations; 108 | 109 | # Write a query to find the cities which lie in the top 10 decile in terms of population 110 | select *, 111 | ntile(10) over(order by population_estimate_2012 desc) as top_populated_cities 112 | from city_populations; 113 | 114 | /* Write a query to arrange the cities in the descending order of population and add a column 115 | calculating difference in population from a row below (in the dataset) */ 116 | select *, 117 | population_estimate_2012 - lead(population_estimate_2012) over(order by population_estimate_2012 desc) as req_diff 118 | from city_populations; 119 | 120 | # Write a query to return the state, first city and last city (in terms of id number) in the state 121 | select distinct state, 122 | first_value(id) over(partition by state) as first_city, 123 | last_value(id) over(partition by state) as last_city 124 | from city_populations; 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | -------------------------------------------------------------------------------- /Relevel_case_studies/Case_Study_7.sql: -------------------------------------------------------------------------------- 1 | select * from sat_scores; 2 | 3 | # Write a query to add column - avg_sat_writing. Each row in this column should include average marks in the writing section of the student per school 4 | select *, 5 | avg(sat_writing) over(partition by school) as avg_sat_writing 6 | from sat_scores; 7 | 8 | # In the above question, add an additional column - count_per_school. Each row of this column should include number of students per school 9 | select *, 10 | avg(sat_writing) over(partition by school) as avg_sat_writing, 11 | count(student_id) over(partition by school) as count_per_school 12 | from sat_scores; 13 | 14 | # In the above question, add two additional columns - max_per_teacher and min_per_teacher. Each row of this column should include maximum and minimum marks in maths per teacher respectively 15 | select *, 16 | avg(sat_writing) over(partition by school) as avg_sat_writing, 17 | count(student_id) over(partition by school) as count_per_school, 18 | max(sat_math) over(partition by teacher) as max_per_teacher, 19 | min(sat_math) over(partition by teacher) as min_per_teacher 20 | from sat_scores; 21 | 22 | /* For the dataset, write a query to add the two columns cum_hrs_studied and total_hrs_studied. Each 23 | row in cum_hrs_studied should display the cumulative sum of hours studied per school. Each row in 24 | the total_hrs_studied will display total hours studied per school. Order the data in the ascending 25 | order of student id */ 26 | select *, 27 | sum(hrs_studied) over(partition by school order by student_id) as cum_hrs_studied, 28 | sum(hrs_studied) over(partition by school order by student_id rows between unbounded preceding and unbounded following) as total_hrs_studied 29 | from sat_scores; 30 | 31 | /* For the dataset, write a query to add column sub_hrs_studied and total_hrs_studied. Each row in 32 | sub_hrs_studied should display the sum of hrs_studied for a row above, a row below, and current 33 | row per school. Order the data in the ascending order of student id */ 34 | select *, 35 | sum(hrs_studied) over(partition by school order by student_id rows between unbounded preceding and unbounded following) as total_hrs_studied, 36 | sum(hrs_studied) over(partition by school order by student_id rows between 1 preceding and 1 following) as sub_hrs_studied 37 | from sat_scores; 38 | 39 | /* Write a query to rank the students per school on the basis of scores in verbal. Use both rank and 40 | dense_rank function. Students with the highest marks should get rank 1. 41 | **Note: see if there is difference in ranking provided by both the functions */ 42 | select *, 43 | rank() over(partition by school order by sat_verbal desc) as rank_verbal, 44 | dense_rank() over(partition by school order by sat_verbal desc) as d_rank_verbal 45 | from sat_scores; 46 | 47 | /* Write a query to rank the students per school on the basis of scores in writing. Use both rank and 48 | dense_rank function. Student with the highest marks should get rank 1. 49 | **Note: see if there is difference in ranking provided by both the functions for teacher = ‘Spellman’ */ 50 | select *, 51 | rank() over(partition by school order by sat_writing desc) as rank_writing, 52 | dense_rank() over(partition by school order by sat_writing desc) as d_rank_writing 53 | from sat_scores 54 | where teacher='Spellman'; 55 | 56 | # Write a query to find the top 5 students per teacher who spent maximum hours studying 57 | select * from 58 | (select *, 59 | rank() over(partition by teacher order by hrs_studied desc) as rank_hr_studied 60 | from sat_scores) sub 61 | where sub.rank_hr_studied<=5; 62 | 63 | # Write a query to find the worst 5 students per school who got minimum marks in sat_math 64 | select * from 65 | (select *, 66 | row_number() over(partition by school order by sat_math) as sat_math_ranking 67 | from sat_scores) sub 68 | where sub.sat_math_ranking<=5; 69 | 70 | # Write a query to divide the dataset into quartile on the basis of marks in sat_verbal 71 | select *, 72 | ntile(4) over(order by sat_verbal) as quartile 73 | from sat_scores; 74 | 75 | /* For ‘Petersville HS’ school, write a query to arrange the students in the ascending order of hours 76 | studied. Also, add a column to find the difference in hours studied from the student above(in the 77 | row). Exclude the cases where hrs_studied is null */ 78 | select * from 79 | (select student_id, hrs_studied, 80 | hrs_studied-lag(hrs_studied) over(order by hrs_studied) as diff_hr_studied 81 | from sat_scores 82 | where school='Petersville HS') sub 83 | where sub.hrs_studied!=""; 84 | 85 | /* For ‘Washington HS’ school, write a query to arrange the students in the descending order of 86 | sat_math. Also, add a column to find the difference in sat_math from the student below(in the row) */ 87 | select *, 88 | sat_math-lead(sat_math) over(order by sat_math desc) as diff_sat_math 89 | from sat_scores 90 | where school='Washington HS'; 91 | 92 | /* Write a query to return 4 columns - student_id, school, sat_writing, difference in sat_writing and 93 | average marks scored in sat_writing in the school */ 94 | select student_id, school, sat_writing, 95 | sat_writing-avg(sat_writing) over(partition by school) as req_diff 96 | from sat_scores; 97 | 98 | /* Write a query to return 4 columns - student_id, teacher, sat_verbal, difference in sat_verbal and 99 | minimum marks scored in sat_verbal per teacher */ 100 | select student_id, teacher, sat_verbal, 101 | sat_verbal-min(sat_verbal) over(partition by teacher) as req_diff 102 | from sat_scores; 103 | 104 | /* Write a query to return the student_id and school who are in bottom 20 in each of sat_verbal, 105 | sat_writing, and sat_math for their school */ 106 | 107 | # METHOD 1 108 | select student_id, school from 109 | (select *, 110 | row_number() over(partition by school order by sat_verbal) as verbal_rank, 111 | row_number() over(partition by school order by sat_writing) as writing_rank, 112 | row_number() over(partition by school order by sat_math) as math_rank 113 | from sat_scores) sub 114 | where sub.verbal_rank<=20 and sub.writing_rank<=20 and sub.math_rank<=20; 115 | 116 | # METHOD - 2 117 | with data as ( 118 | select *, 119 | row_number() over(partition by school order by sat_verbal) as verbal_rank, 120 | row_number() over(partition by school order by sat_writing) as writing_rank, 121 | row_number() over(partition by school order by sat_math) as math_rank 122 | from sat_scores) 123 | select student_id, school 124 | from data 125 | where verbal_rank<=20 and writing_rank<=20 and math_rank<=20; 126 | 127 | # Write a query to find the student_id for the highest mark and lowest mark per teacher for sat_writing 128 | select distinct teacher, 129 | first_value(student_id) over(partition by teacher order by sat_writing desc rows between unbounded preceding and unbounded following) as highest_marks_student, 130 | last_value(student_id) over(partition by teacher order by sat_writing desc rows between unbounded preceding and unbounded following) as lowest_marks_student 131 | from sat_scores; 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 | 238 | 239 | -------------------------------------------------------------------------------- /Relevel_case_studies/Detailed_Case_Study_Ginny.sql: -------------------------------------------------------------------------------- 1 | # INTRODUCTION 2 | 3 | /* Ginny is a big fan of Japanese food, so she decided to start a restaurant at the beginning of 2021 that sells his three favorite 4 | foods: sushi, curry, and ramen. 5 | Ginny's Diner needs your help to stay afloat – the restaurant has collected some fundamental data from their few months of 6 | operation but has no idea how to use it to help them operate the business. */ 7 | 8 | # PROBLEM STATEMENT 9 | 10 | /* Ginny wants to use the data to answer a few simple questions about his customers, especially about their: 11 | ● visiting patterns, 12 | ● how much money they’ve spent, and 13 | ● which menu items are their favorite. 14 | This deeper connection with her customers will help her deliver a better and more personalized experience for her loyal 15 | customers. 16 | She plans on using these insights to help him decide whether she should expand the existing customer loyalty program. 17 | Additionally, she needs help to generate some essential datasets so her team can quickly inspect the data without needing to 18 | use SQL. 19 | 20 | The data set contains the following three tables, which you may refer to in the relationship diagram below to understand the 21 | connection: 22 | ● sales 23 | ● members 24 | ● menu 25 | */ 26 | 27 | # CREATING OUR DATABASE FOR THE CASE STUDY AND ANALYSIS 28 | 29 | CREATE database Detailed_Case_Study; 30 | 31 | CREATE TABLE sales ( 32 | customer_id VARCHAR(1), 33 | order_date DATE, 34 | product_id INTEGER 35 | ); 36 | 37 | INSERT INTO sales 38 | (customer_id, order_date, product_id) 39 | VALUES 40 | ('A', '2021-01-01', '1'), 41 | ('A', '2021-01-01', '2'), 42 | ('A', '2021-01-07', '2'), 43 | ('A', '2021-01-10', '3'), 44 | ('A', '2021-01-11', '3'), 45 | ('A', '2021-01-11', '3'), 46 | ('B', '2021-01-01', '2'), 47 | ('B', '2021-01-02', '2'), 48 | ('B', '2021-01-04', '1'), 49 | ('B', '2021-01-11', '1'), 50 | ('B', '2021-01-16', '3'), 51 | ('B', '2021-02-01', '3'), 52 | ('C', '2021-01-01', '3'), 53 | ('C', '2021-01-01', '3'), 54 | ('C', '2021-01-07', '3'); 55 | 56 | 57 | CREATE TABLE menu ( 58 | product_id INTEGER, 59 | product_name VARCHAR(5), 60 | price INTEGER 61 | ); 62 | 63 | INSERT INTO menu 64 | (product_id, product_name, price) 65 | VALUES 66 | ('1', 'sushi', '10'), 67 | ('2', 'curry', '15'), 68 | ('3', 'ramen', '12'); 69 | 70 | 71 | CREATE TABLE members ( 72 | customer_id VARCHAR(1), 73 | join_date DATE 74 | ); 75 | 76 | INSERT INTO members 77 | (customer_id, join_date) 78 | VALUES 79 | ('A', '2021-01-07'), 80 | ('B', '2021-01-09'); 81 | 82 | # QUERY 1 83 | 84 | # What is the total amount each customer spent at the restaurant? 85 | with data as (select s.customer_id, s.product_id, m.price 86 | from sales s 87 | left join menu m 88 | on s.product_id=m.product_id) 89 | select customer_id, sum(price) as total_spent from data group by 1; 90 | 91 | # QUERY 2 92 | 93 | # How many days has each customer visited the restaurant? 94 | select customer_id, count(distinct order_date) as num_visit_days 95 | from sales 96 | group by 1; 97 | 98 | # QUERY 3 99 | 100 | # What was the first item from the menu purchased by each customer? 101 | with data as ( 102 | select s.*, m.product_name, 103 | rank() over(partition by s.customer_id order by s.order_date) as ord_Rank 104 | from sales s 105 | left join menu m 106 | on s.product_id=m.product_id) 107 | select customer_id, product_name from data where ord_Rank=1 group by 1,2; 108 | 109 | # QUERY 4 110 | 111 | # What is the most purchased item on the menu and how many times was it purchased by all customers? 112 | select m.product_name, count(s.product_id) as num_purchase 113 | from sales as s 114 | left join menu as m 115 | on s.product_id=m.product_id 116 | group by 1 117 | order by 2 desc 118 | limit 1; 119 | 120 | # QUERY 5 121 | 122 | # Which item was the most popular one for each customer? 123 | with data as 124 | (select customer_id, product_name, order_amount, rank() over(partition by customer_id order by order_amount desc) as ord_rank from 125 | (select s.customer_id, 126 | count(s.product_id) as order_amount, 127 | m.product_name 128 | from sales as s 129 | left join menu as m 130 | on s.product_id=m.product_id 131 | group by 1, s.product_id 132 | order by 3 desc) as a) 133 | select customer_id, product_name, order_amount from data where ord_rank=1; 134 | 135 | # QUERY 6 136 | 137 | /* Which item was purchased first by the customer after they became a member? 138 | Note: Assumption to consider is Customer's join date does not mean that customer ordered on that date. */ 139 | with cte as (select * from (select s.*, m.join_date, me.product_name from sales as s 140 | left join members as m 141 | on s.customer_id=m.customer_id 142 | left join menu as me 143 | on s.product_id=me.product_id) as a 144 | where join_date is not null) 145 | select customer_id, product_name, min(order_date) as first_purchase from cte where order_date>join_date group by 1; 146 | 147 | # QUERY 7 148 | 149 | # Which item was purchased right before the customer became a member? 150 | select customer_id, product_name, order_date from ( 151 | with cte as 152 | (select * from (select s.*, m.join_date, me.product_name 153 | from sales as s 154 | left join members as m 155 | on s.customer_id=m.customer_id 156 | join menu as me 157 | on s.product_id=me.product_id) as a 158 | where join_date is not null and order_dateorder_date 175 | group by 1; 176 | 177 | # QUERY 9 178 | 179 | # If each customers’ $1 spent equates to 10 points and sushi has a 2x points multiplier — how many points would each customer have? 180 | select customer_id, sum(points_earned) as total_points from ( 181 | select s.customer_id, 182 | case 183 | when m.product_name='sushi' then 20*m.price else 10*m.price 184 | end as points_earned 185 | from sales as s 186 | left join menu as m 187 | on s.product_id=m.product_id) sub 188 | group by 1; 189 | 190 | # QUERY 10 191 | 192 | /* In the first week after a customer joins the program, (including their join date) they earn 2x points on all items; not just sushi — 193 | how many points do customer A and B have at the end of Jan21? 194 | Note :here you can use a concept of interval function which returns the index of the argument that is more than the first 195 | argument meaning It returns 0 if 1st number is less than the 2nd number and 1 if 1st number is less than the 3rd number and so 196 | on or -1 if 1st number is null */ 197 | with cte as (select s.*, m.product_name, m.price, me.join_date, 198 | case 199 | when s.order_date between me.join_date and date_add(me.join_date, interval 6 day) then 20*m.price 200 | when m.product_name='sushi' then 20*m.price 201 | else 10*m.price 202 | end as point_earned 203 | from sales as s 204 | left join menu as m 205 | on s.product_id=m.product_id 206 | inner join members as me 207 | on s.customer_id=me.customer_id 208 | ) 209 | select customer_id, sum(point_earned) as total_points 210 | from cte 211 | where order_date<='2021-01-31' 212 | group by 1; 213 | 214 | 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | -------------------------------------------------------------------------------- /Retail_case_studies/Case_Study_1.sql: -------------------------------------------------------------------------------- 1 | select * from retail_sales_dataset; 2 | select count(distinct customer_id) from retail_sales_dataset; 3 | --------------------------------------------------------------------------------