├── README.md
├── Relevel_case_studies
├── Capstone_Project.sql
├── Case_Study_1.sql
├── Case_Study_2.sql
├── Case_Study_3.sql
├── Case_Study_4.sql
├── Case_Study_5.sql
├── Case_Study_6.sql
├── Case_Study_7.sql
└── Detailed_Case_Study_Ginny.sql
└── Retail_case_studies
└── Case_Study_1.sql
/README.md:
--------------------------------------------------------------------------------
1 | # SQL Case Studies
2 |
3 | Welcome to data town :)
4 |
5 |
1. Relevel case studies
6 |
7 | This section of SQL Case Study uses data from mode.com website. It's free to use, you just have to sign in. They have a huge resource of dataset to practice.
8 |
9 | Steps I followed:
10 |
11 | Login to mode.com
12 | Export the required dataset(by default .csv format)
13 | Open your MySQL Workbench.
14 | Create a new database 'casestudies' using command 'create database casestudies'
15 | Run the SQL command 'use casestudies' to work in the specific database
16 | Import the .csv dataset downloaded from mode.com
17 | The .csv dataset will be imported in current database i.e., 'casestudies'
18 | Run 'Select * from table_name' to view your imported dataset.
19 |
20 | Once all above steps are performed successfully, jump right into deriving insights from your data. Yay, happy exploring :)
21 |
22 | 2. Retail case studies
23 |
24 | In this section, I will work on retail datasets and using SQL we will uncover the hidden gems of retail industry data. The dataset used in this section is taken from kaggle.
25 |
26 | Dataset used in Case_Study_1.sql -> https://www.kaggle.com/datasets/mohammadtalib786/retail-sales-dataset?select=retail_sales_dataset.csv
27 |
28 |
--------------------------------------------------------------------------------
/Relevel_case_studies/Capstone_Project.sql:
--------------------------------------------------------------------------------
1 | # Introduction to Case
2 |
3 | /* Ron and his buddies founded Foodie-Fi and began selling monthly and annual
4 | subscriptions, providing clients with unrestricted on-demand access to exclusive cuisine
5 | videos from around the world.
6 | This case study focuses on the use of subscription-style digital data to answer critical
7 | business questions about the customer journey, payments, and business performance. */
8 |
9 | /* There are 2 tables in the given schema:
10 | 1) Plans
11 | 2) Subscriptions
12 |
13 | # Table 'Plans' description
14 |
15 | /* There are 5 customer plans:
16 | Trial— Customers sign up for a 7-day free trial and will be automatically enrolled in the pro monthly subscription plan
17 | unless they unsubscribe, downgrade to basic, or upgrade to an annual pro plan during the trial.
18 | Basic plan — Customers have limited access and can only stream their videos with the basic package, which is only
19 | available monthly for $9.90.
20 | Pro plan — Customers on the Pro plan have no watch time limits and can download videos for offline viewing. Pro
21 | plans begin at $19.90 per month or $199 for a yearly subscription.
22 | When clients cancel their Foodie-Fi service, a Churn plan record with a null pricing is created, but their plan continues
23 | until the end of the billing cycle */
24 |
25 | # Table 'Subscriptions' description
26 |
27 | /* Customer subscriptions display the precise date on which their specific plan id begins.
28 | If a customer downgrades from a pro plan or cancels their subscription — the higher program will remain in place until
29 | the period expires — the start date in the subscriptions table will reflect the date the actual plan changes.
30 | When clients upgrade their account from a basic plan to a pro or annual pro plan, the higher plan becomes active
31 | immediately.
32 | When customers cancel their subscription, they will retain access until the end of their current billing cycle, but the
33 | start date will be the day they opted to quit their service. */
34 |
35 | create database capstone_1;
36 | use capstone_1;
37 |
38 | # The column name got changed somehow while exporting the data. So, had to alter column name to set it to right name
39 | alter table Subscriptions rename column customer_id to customer_id;
40 | alter table plans rename column plan_id to plan_id;
41 |
42 | # QUERY 1
43 |
44 | # How many customers has Foodie-Fi ever had?
45 | select count(distinct customer_id) as num_customers from Subscriptions;
46 |
47 | # QUERY 2
48 |
49 | # What is the monthly distribution of trial plan start_date values for our dataset? — Use the start of the month as the group by value.
50 | with cte as (select s.* , p.plan_name, p.price, extract(month from s.start_date) as month_num
51 | from subscriptions as s
52 | join plans as p
53 | on s.plan_id=p.plan_id
54 | where plan_name='trial')
55 | select month_num, count(plan_name) as monthly_distribtion
56 | from cte
57 | group by 1
58 | order by 1;
59 |
60 | # QUERY 3
61 |
62 | # What plan start_date values occur after the year 2020 for our dataset? Show the breakdown by count of events for each plan_name
63 | with cte as
64 | (select * from
65 | (select s.*, p.plan_name, extract(year from s.start_date) as year from subscriptions as s join plans as p on s.plan_id=p.plan_id) sub
66 | where year>2020)
67 | select plan_id, plan_name, count(customer_id) as req_ans
68 | from cte
69 | group by 1
70 | order by 1;
71 |
72 | # QUERY 4
73 |
74 | # What is the customer count and percentage of customers who have churned rounded to 1 decimal place?
75 | with cte as (select count(s.plan_id) as churn_count
76 | from subscriptions as s
77 | left join plans as p
78 | on s.plan_id=p.plan_id
79 | where p.plan_name='churn')
80 | select churn_count,
81 | round((100*churn_count/(select count(distinct customer_id) as total_count from subscriptions)),1) as percent_churn
82 | from cte;
83 |
84 | # QUERY 5
85 |
86 | # How many customers have churned straight after their initial free trial? — what percentage is this rounded to the nearest whole number?
87 | # METHOD 1
88 | with cte as
89 | (select s.customer_id, s.plan_id, p.plan_name,
90 | row_number() over(partition by s.customer_id order by s.plan_id) as req_rank
91 | from subscriptions as s
92 | left join plans as p
93 | on s.plan_id=p.plan_id)
94 | select count(customer_id) as num_customers,
95 | round(100*count(customer_id)/(select count(distinct customer_id) from subscriptions),0) as req_perc
96 | from cte where plan_id=4 and req_rank=2;
97 |
98 | # MTEHOD 2
99 | with cte as
100 | (select s.customer_id, s.plan_id, p.plan_name,
101 | lead(s.plan_id) over(partition by s.customer_id order by s.plan_id) as next_plan
102 | from subscriptions as s
103 | left join plans as p
104 | on s.plan_id=p.plan_id)
105 | select count(*) as req_count,
106 | round(100*count(*)/(select count(distinct customer_id) from subscriptions),0) as req_perc
107 | from cte where plan_id=0 and next_plan=4;
108 |
109 |
110 | # QUERY 6
111 |
112 | # What is the number and percentage of customer plans after their initial free trial?
113 | with cte as
114 | (select customer_id, plan_id,
115 | lead(plan_id,1) over(partition by customer_id order by plan_id) as next_plan
116 | from subscriptions)
117 | select next_plan,
118 | count(*) as req_count,
119 | round(100*count(*)/(select count(distinct customer_id) from subscriptions),2) as req_perc
120 | from cte
121 | where plan_id=0 and next_plan is not null
122 | group by 1
123 | order by 1;
124 |
125 | # QUERY 7
126 |
127 | # What is the customer count and percentage breakdown of all 5 plan_name values at 2020–12–31?
128 |
129 | # NEED MORE CLARITY ON THIS PROBLEM STATEMENT
130 |
131 | # QUERY 8
132 |
133 | # How many customers have upgraded to an annual plan in 2020?
134 | select count(customer_id) as num_customers
135 | from subscriptions
136 | where plan_id=3
137 | and year(start_date)='2020';
138 |
139 | # QUERY 9
140 |
141 | # How many days on average does it take a customer to start an annual plan from the day they join Foodie-Fi?
142 | with trial_plan as
143 | (select customer_id, date(start_date) as trial_date
144 | from subscriptions
145 | where plan_id=0),
146 | annual_plan as
147 | (select customer_id, date(start_date) as annual_date
148 | from subscriptions
149 | where plan_id=3)
150 | select round(avg(datediff(annual_date,trial_date)),0) as avg_days_to_upgrade
151 | from trial_plan as tp
152 | join annual_plan as ap
153 | on tp.customer_id=ap.customer_id;
154 |
155 | # QUERY 10
156 |
157 | # Can you further breakdown this average value into 30-day periods? (i.e. 0–30 days, 31–60 days etc)
158 | # NEED TO WORK ON THIS PROBLEM STATEMENT AS LESS CLARITY ON PROBLEM STATEMENT
159 |
160 | # QUERY 11
161 |
162 | # How many customers downgraded from a pro-monthly to a basic monthly plan in 2020?
163 | with cte as (select s.customer_id, s.plan_id, p.plan_name,
164 | lead(s.plan_id,1) over(partition by s.customer_id order by s.plan_id) as next_plan
165 | from subscriptions as s
166 | left join plans as p
167 | on s.plan_id=p.plan_id
168 | where year(start_date)=2020)
169 | select count(*) as num_customers
170 | from cte
171 | where plan_id=2 and next_plan=1;
172 |
173 |
174 |
175 |
176 |
177 |
178 |
179 |
180 |
181 |
182 |
183 |
184 |
185 |
186 |
187 |
188 |
189 |
190 |
191 |
192 |
193 |
194 |
195 |
196 |
197 |
198 |
199 |
200 |
201 |
202 |
203 |
204 |
205 |
206 |
207 |
208 |
209 |
210 |
211 |
212 |
--------------------------------------------------------------------------------
/Relevel_case_studies/Case_Study_1.sql:
--------------------------------------------------------------------------------
1 | create database CaseStudies;
2 | use CaseStudies;
3 |
4 | create table oscar_nominees (
5 | year year,
6 | category varchar(50),
7 | nominee varchar(30),
8 | movie varchar(30),
9 | winner bool,
10 | id int not null,
11 | primary key(id)
12 | );
13 |
14 |
15 | # Write a query to display all the records in the table tutorial.oscar_nominees
16 | SELECT * FROM oscar_nominees;
17 |
18 | # Write a query to find the distinct values in the ‘year’ column
19 | select distinct year from oscar_nominees;
20 |
21 | # Write a query to filter the records from year 1999 to year 2006
22 | select * from oscar_nominees where year between 1999 and 2006;
23 |
24 | # Write a query to filter the records for either year 1991 or 1998
25 | select * from oscar_nominees where year=1991 or year=1998;
26 |
27 | # Write a query to return the winner movie name for the year of 1997
28 | select movie from oscar_nominees where year=1997 and winner='true';
29 |
30 | # Write a query to return the winner in the ‘actor in a leading role’ and ‘actress in a leading role’ category for the year of 1994,1980, and 2008
31 | select * from oscar_nominees where (category='actor in a leading role' or category='actress in a leading role') and year IN(1994,1980,2008) and winner='true';
32 |
33 | # Write a query to return the name of the movie starting from letter ‘a’
34 | select movie from oscar_nominees where movie like 'a%';
35 |
36 | # Write a query to return the name of movies containing the word ‘the’
37 | select movie from oscar_nominees where movie like '%the%';
38 |
39 | # Write a query to return all the records where the nominee name starts with “c” and ends with “r”
40 | select * from oscar_nominees where nominee like 'c%r';
41 |
42 | # Write a query to return all the records where the movie was released in 2005 and movie name does not start with ‘a’ and ‘c’ and nominee was a winner
43 | select * from oscar_nominees where year=2005 and (movie not like 'a%' and movie not like 'c%') and winner='true';
44 |
45 | # Write a query to return the name of nominees who got more nominations than ‘Akim Tamiroff’. Solve this using CTE
46 | with cte as (select nominee, count(*) as num_nominations from oscar_nominees group by 1)
47 | select nominee from cte where num_nominations>(select num_nominations from cte where nominee='Akim Tamiroff');
48 |
49 | # Write a query to find the nominee name with the second highest number of oscar wins. Solve using subquery.
50 | with oscar_wins as
51 | (select nominee, count(*) as num_oscar_wins from oscar_nominees where winner='True' group by 1 order by 2 desc)
52 | select nominee, num_oscar_wins
53 | from oscar_wins
54 | where num_oscar_wins = (select max(num_oscar_wins) from oscar_wins where num_oscar_wins<(select max(num_oscar_wins) from oscar_wins));
55 |
56 | # Write a query to create three columns per nominee
57 | /* 1. Number of wins
58 | 2. Number of loss
59 | 3. Total nomination */
60 | select nominee,
61 | sum(case when winner='true' then 1 else 0 end) as num_wins,
62 | sum(case when winner='false' then 1 else 0 end) as num_loss,
63 | count(*) as total_nominations
64 | from oscar_nominees
65 | group by 1
66 | order by 4 desc;
67 |
68 | # Write a query to create two columns
69 | /* ● Win_rate: Number of wins/total wins
70 | ● Loss_rate: Number of loss/total wins */
71 | select movie,
72 | 100*sum(case when winner='true' then 1 else 0 end)/count(*) as win_rate,
73 | 100*sum(case when winner='false' then 1 else 0 end)/count(*) as loss_rate
74 | from oscar_nominees
75 | group by 1;
76 |
77 | # Write a query to return all the records of the nominees who have lost but won at least once
78 | select *
79 | from oscar_nominees
80 | where nominee in(select distinct nominee from oscar_nominees where winner= 'true') and winner = 'false';
81 |
82 | # Write a query to find the nominees who are nominated for both ‘actor in a leading role’ and ‘actor in supporting role’
83 | select distinct nominee from oscar_nominees
84 | where nominee in(select nominee from oscar_nominees where category='actor in a leading role')
85 | and category='actor in a supporting role';
86 |
87 | # Write a query to find the movie which won more than average number of wins per winning movie
88 | with cte as (select movie, count(*) as num_wins from oscar_nominees where winner='true' group by 1)
89 | select movie from cte where num_wins>(select avg(num_wins) from cte);
90 |
91 | # Write a query to return the year which have more winners than year 1970
92 | with cte as (select year, count(*) as num_wins from oscar_nominees where winner='true' group by 1)
93 | select year from cte where num_wins > (select num_wins from cte where year=1970);
94 |
95 | # Write a query to return all the movies which have won oscars both in the actor and actress category
96 | select distinct movie
97 | from oscar_nominees
98 | where (winner='true' and category like '%actor%')
99 | and movie in(select movie from oscar_nominees where category like '%actress%' and winner='true');
100 |
101 | # Write a query to return the movie name which did not win a single oscar
102 | with cte as
103 | (select movie,
104 | sum(case when winner='true' then 1 else 0 end) as num_wins,
105 | sum(case when winner='false' then 1 else 0 end) as num_loss
106 | from oscar_nominees
107 | group by 1)
108 | select distinct movie from cte where num_wins=0;
109 |
110 | # METHOD 2 for above query
111 | select distinct movie
112 | from oscar_nominees
113 | where winner='false' and movie not in(select distinct movie from oscar_nominees where winner='true');
114 |
115 |
116 |
117 |
118 |
119 |
120 |
121 |
122 |
123 |
124 |
125 |
126 |
127 |
128 |
129 |
130 |
131 |
132 |
133 |
134 |
135 |
--------------------------------------------------------------------------------
/Relevel_case_studies/Case_Study_2.sql:
--------------------------------------------------------------------------------
1 | use casestudies;
2 | create table kag_conversion_data(
3 | ad_id int,
4 | xyz_campaign_id int,
5 | fb_campaign_id int,
6 | age int,
7 | gender varchar(2),
8 | interest int,
9 | impressions int,
10 | clicks int,
11 | spent float,
12 | total_conversion int,
13 | approved_conversion int
14 | );
15 | select * from kag_conversion_data;
16 |
17 | # Write a query to count the total number of records in the kag_conversion_data dataset
18 | select count(ad_id) as total_records from kag_conversion_data;
19 |
20 | # Write a query to count the distinct number of fb_campaign_id
21 | select count(distinct fb_campaign_id) as total_fb_campaign_id from kag_conversion_data;
22 |
23 | # Write a query to find the maximum spent, average interest, minimum impressions for ad_id
24 | select max(spent) as max_spent, avg(interest) as average_interest, min(impressions) as min_impressions from kag_conversion_data;
25 |
26 | # Write a query to create an additional column spent per impressions(spent/impressions)
27 | select *, (spent/impressions) as spent_per_impressions from kag_conversion_data;
28 |
29 | # Write a query to count the ad_campaign for each age group
30 | select age, count(ad_id) as num_campaign from kag_conversion_data group by age;
31 |
32 | # Write a query to calculate the average spent on ads for each gender category
33 | select gender, round(avg(spent),2) as avg_spent from kag_conversion_data group by gender;
34 |
35 | # Write a query to find the total approved conversion per xyz campaign id. Arrange the total conversion in descending order
36 | select xyz_campaign_id, sum(approved_conversion) as total_approved_conversion from kag_conversion_data group by 1 order by 2 desc;
37 |
38 | # Write a query to show the fb_campaign_id and total interest per fb_campaign_id. Only show the campaign which has more than 300 interests
39 | select fb_campaign_id, sum(interest) as total_interest from kag_conversion_data group by 1 having sum(interest)>300;
40 |
41 | # Write a query to find the age and gender segment with maximum impression to interest ratio. Return three columns - age, gender, impression_to_interest
42 | select age, gender, sum(impressions)/sum(interest) as max_imp_per_interest from kag_conversion_data group by 1,2 order by 3 desc limit 1;
43 |
44 | # Write a query to find the top 2 xyz_campaign_id and gender segment with the maximum total_unapproved_conversion (total_conversion - approved_conversion)
45 | select xyz_campaign_id, gender, (sum(total_conversion)-sum(approved_conversion)) as total_unapproved_conversion from kag_conversion_data group by 1, 2 order by 3 desc limit 2;
46 |
--------------------------------------------------------------------------------
/Relevel_case_studies/Case_Study_3.sql:
--------------------------------------------------------------------------------
1 | use casestudies;
2 | create table crunchbase_companies (
3 | permalink varchar(50),
4 | name varchar(30),
5 | homepage_url varchar(40),
6 | category_code varchar(20),
7 | funding_total_usd int,
8 | status varchar(10),
9 | country_code varchar(10),
10 | state_code varchar(10),
11 | region varchar(20),
12 | city varchar(20),
13 | funding_rounds int,
14 | founded_at date,
15 | founded_month date,
16 | founded_quarter varchar(15),
17 | founded_year int,
18 | first_funding_at date,
19 | last_funding_at date,
20 | last_milestone_at date,
21 | id int,
22 | primary key(id)
23 | );
24 |
25 | select * from crunchbase_companies;
26 |
27 | # Find the top 5 countries(country code) with the highest number of operating companies. Ensure the country code is not null
28 | select country_code, count(status) as num_operating_companies from crunchbase_companies where status='operating' and country_code!= "" group by 1 order by 2 desc limit 5;
29 |
30 | # How many companies have no country code available in the dataset
31 | select count(id) as total from crunchbase_companies where country_code="";
32 |
33 | # Find the number of companies starting with letter ‘g’ founded in France(FRA) and still operational(status = operating)
34 | select count(id) as total from crunchbase_companies where name like 'g%' and country_code='FRA' and status='operating';
35 |
36 | # How many advertising, founded after 2003, are acquired?
37 | select count(name) as total from crunchbase_companies where category_code='advertising' and founded_year>2003 and status='acquired';
38 |
39 | # Calculate the average funding_total_usd per company for the companies founded in the software, education, and analytics category
40 | select category_code, avg(funding_total_usd) as avg_funding from crunchbase_companies where category_code IN('software','education','analytics') group by 1;
41 |
42 | # Find the city having more than 50 closed companies. Return the city and number of companies closed
43 | select city, count(name) as num_companies from crunchbase_companies where status='closed' and city!="" group by 1 having count(name)>50;
44 |
45 | # Find the number of bio-tech companies who are founded after 2000 and either have more than 1Mn funding or have ipo and secured more than 1 round of funding
46 | select count(name) as num_companies from crunchbase_companies where category_code='biotech' and founded_year>2000 and (funding_total_usd>1000000 or (funding_rounds>1 and status='ipo'));
47 |
48 | # Find number of all acquired companies founded between 1980 and 2005 and founded in the city ending with the word ‘city’. Return the city name and number of acquired companies
49 | select city, count(name) as num_acquired_company from crunchbase_companies where status='acquired' and founded_year between 1980 and 2005 and city like '%city' group by 1;
50 |
51 | # Find the number of ‘hardware’ companies founded outside ‘USA’ and did not take any funding. Return the country code and number of hardware companies in descending order.
52 | select country_code, count(name) as num_hardware_companies from crunchbase_companies where category_code='hardware' and funding_total_usd="" and country_code!='USA' group by 1 order by 2 desc;
53 |
54 | # Find the 5 most popular company category(category with highest companies) across the city Singapore, Shanghai, and Bangalore. Return category code and number of companies
55 | select category_code, count(name) as num_companies from crunchbase_companies where city IN('Singapore','Shanghai','Bangalore') group by 1 order by 2 desc limit 5;
56 |
57 |
58 |
--------------------------------------------------------------------------------
/Relevel_case_studies/Case_Study_4.sql:
--------------------------------------------------------------------------------
1 | use casestudies;
2 | create table college_football_players
3 | (
4 | id int,
5 | primary key(id),
6 | full_school_name varchar(30),
7 | school_name varchar(30)
8 | );
9 | select * FROM college_football_players;
10 |
11 | create table college_football_teams(
12 | id int primary key,
13 | school_name varchar(30),
14 | conference varchar(35)
15 | );
16 |
17 | select * from college_football_teams;
18 |
19 | # Write a query to return player_name, school_name, position, conference from the above dataset
20 | select p.player_name, p.school_name, p.position, t.conference from college_football_players as p join college_football_teams as t on p.school_name=t.school_name;
21 |
22 | # Write a query to find the total number of players playing in each conference.Order the output in the descending order of number of players
23 | select t.conference, count(p.id) as num_players from college_football_players as p join college_football_teams as t on p.school_name=t.school_name group by 1 order by 2 desc;
24 |
25 | # Write a query to find the average height of players per division
26 | select t.division,round(avg(p.height),2) as avg_height from college_football_players as p join college_football_teams as t on p.school_name=t.school_name group by 1;
27 |
28 | # Write a query to return to the conference where average weight is more than 210. Order the output in the descending order of average weight
29 | select t.conference, round(avg(p.weight),2) as avg_weight from college_football_players as p join college_football_teams as t on p.school_name=t.school_name group by 1 having avg(p.weight)>210 order by 2 desc;
30 |
31 | # Write a query to return to the top 3 conference with the highest BMI (weight/height) ratio
32 | select t.conference, round(703*(sum(p.weight)/sum(power(p.height,2))),2) as BMI from college_football_players as p join college_football_teams as t on p.school_name=t.school_name group by 1 order by 2 desc limit 3;
33 |
34 | # BMI formula:
35 | /* USC Units: BMI = 703 × mass (lbs)/height2 (in) = BMI in kg/m2
36 | SI, Metric Units: BMI = mass (kg)/height2 (m) = BMI in kg/m2 */
37 |
38 |
39 |
40 |
41 |
42 |
43 |
44 |
45 |
46 |
47 |
48 |
49 |
--------------------------------------------------------------------------------
/Relevel_case_studies/Case_Study_5.sql:
--------------------------------------------------------------------------------
1 | # Write a query to join the tables: excel_sql_inventory_data and excel_sql_transaction_data
2 | select inventory.*, transaction.transaction_id, transaction.time from excel_sql_inventory_data as inventory join excel_sql_transaction_data as transaction on inventory.product_id=transaction.product_id;
3 |
4 | # Find the product which does not sell a single unit
5 | select i.*, t.transaction_id
6 | from excel_sql_inventory_data as i
7 | left join excel_sql_transaction_data as t
8 | on i.product_id=t.product_id
9 | where t.transaction_id is null;
10 |
11 | # Write a query to find how many units are sold per product. Sort the data in terms of unit sold(descending order)
12 | select i.product_id, i.product_name, count(t.product_id) units_sold
13 | from excel_sql_inventory_data as i
14 | left join excel_sql_transaction_data as t
15 | on i.product_id=t.product_id
16 | group by 1
17 | order by 3 desc;
18 |
19 | # Write a query to return product_type and units_sold where product_type is sold more than 50 times
20 | select i.product_type, count(t.transaction_id) as units_sold
21 | from excel_sql_inventory_data as i
22 | left join excel_sql_transaction_data as t
23 | on i.product_id=t.product_id
24 | group by 1
25 | having count(t.transaction_id)>50;
26 |
27 | # Write a query to return the total revenue generated
28 | select round(sum(i.price_unit),2) as total_revenue
29 | from excel_sql_inventory_data as i
30 | left join excel_sql_transaction_data as t
31 | on i.product_id=t.product_id
32 | where t.transaction_id!="null";
33 |
34 | # Write a query to return the most selling product under product_type = ‘dry goods
35 | select i.product_name, count(t.transaction_id) as units_sold
36 | from excel_sql_inventory_data as i
37 | left join excel_sql_transaction_data as t
38 | on i.product_id=t.product_id
39 | where i.product_type='dry_goods'
40 | group by 1
41 | order by 2 desc
42 | limit 1;
43 |
44 | # Write a query to find the difference between inventory and total sales per product_type
45 | select i.product_type, abs(sum(i.current_inventory)-count(t.transaction_id)) as req_diff
46 | from excel_sql_inventory_data as i
47 | left join excel_sql_transaction_data as t
48 | on i.product_id=t.product_id
49 | group by 1
50 | order by 2 desc;
51 |
52 | # Find the product-wise sales for product_type =’dairy’
53 | select i.product_name, round(sum(i.price_unit)*count(t.transaction_id),2) as sales
54 | from excel_sql_inventory_data as i
55 | left join excel_sql_transaction_data as t
56 | on i.product_id=t.product_id
57 | where product_type='dairy'
58 | group by 1
59 | order by 2 desc;
60 |
--------------------------------------------------------------------------------
/Relevel_case_studies/Case_Study_6.sql:
--------------------------------------------------------------------------------
1 | use casestudies;
2 | select * from city_populations;
3 |
4 | # Write a query to return all the records where the city population is more than average population of dataset
5 | select *
6 | from city_populations
7 | where population_estimate_2012 > (select avg(population_estimate_2012) from city_populations);
8 |
9 | # Write a query to return all the records where the city population is more than the most populated city of Texas(TX) state
10 | select *
11 | from city_populations
12 | where population_estimate_2012 > (select max(population_estimate_2012) from city_populations where state='TX');
13 |
14 | # Find the number of cities where population is more than the average population of Illinois(IL) state
15 | select count(id) as num_cities
16 | from city_populations
17 | where population_estimate_2012 > (select avg(population_estimate_2012) from city_populations where state='IL');
18 |
19 | # Write a query to add the additional column - percentage_population(city population/total population of dataset)
20 | select * ,
21 | round((population_estimate_2012*100)/(select sum(population_estimate_2012) from city_populations),2) as percentage_population
22 | from city_populations;
23 |
24 | # Write a query to add the additional column - percentage_population_state(city population/total population of the state)
25 | select a.*, round(100.0 * population_estimate_2012/state_population,2) as percentage_population
26 | from city_populations a
27 | left join(select state, sum(population_estimate_2012) as state_population from city_populations group by state) b
28 | on a.state = b.state
29 | order by a.state;
30 |
31 | # Write a query to add the additional column - population density. The column logic is:
32 | /* ● Population more than average - High
33 | ● Population less than or equal to average - Low */
34 | select *,
35 | case when population_estimate_2012 > (select avg(population_estimate_2012) from city_populations) then "High"
36 | when population_estimate_2012 <= (select avg(population_estimate_2012) from city_populations) then "Low"
37 | end as population_density
38 | from city_populations;
39 |
40 | /* Write a query to add an additional column - num_cities. Each row in the dataset should tell the
41 | number of cities in the dataset */
42 | select *,
43 | count(city) over(partition by state) as num_cities
44 | from city_populations;
45 |
46 | /* Write a query to add an additional column - total_population. Each row in the dataset should tell the
47 | total population of the state */
48 | select *,
49 | sum(population_estimate_2012) over(partition by state) as total_population
50 | from city_populations;
51 |
52 | # Write a query to return the rows where population is more than the average population of the state
53 |
54 | # METHOD 1
55 | select * from
56 | (select *,
57 | avg(population_estimate_2012) over(partition by state) as avg_population_state
58 | from city_populations) sub
59 | where population_estimate_2012 > sub.avg_population_state;
60 |
61 | # METHOD 2
62 | with cte as (
63 | select *,
64 | avg(population_estimate_2012) over(partition by state) as avg_population_state
65 | from city_populations)
66 | select * from cte where population_estimate_2012 > avg_population_state;
67 |
68 | # Write a query to calculate the cumulative sum of population. Arrange the data in ascending order of the population
69 | select *,
70 | sum(population_estimate_2012) over(order by population_estimate_2012) as cum_population
71 | from city_populations;
72 |
73 | /* Write a query to add a column rolling_avg. Each row in the dataset includes the average population
74 | for the two rows above and two rows below(including current row */
75 | select *,
76 | avg(population_estimate_2012) over(order by population_estimate_2012 rows between 2 preceding and 2 following) as rolling_avg
77 | from city_populations;
78 |
79 | /* Write a query to rank the cities in California(CA) state in terms of population. City with the highest
80 | population is given rank 1. Use both rank and dense_rank function */
81 | select *,
82 | rank() over(order by population_estimate_2012 desc) as rank_population,
83 | dense_rank() over(order by population_estimate_2012 desc) as dense_rank_population
84 | from city_populations
85 | where state='CA';
86 |
87 | # Write a query to find the top 2 most populated cities per state
88 |
89 | # Method 1
90 | select * from
91 | (select *,
92 | row_number() over(partition by state order by population_estimate_2012 desc) as population_rank
93 | from city_populations) sub
94 | where sub.population_rank<=2;
95 |
96 | # Method 2
97 | with cte as (
98 | select *,
99 | row_number() over(partition by state order by population_estimate_2012 desc) as population_rank
100 | from city_populations)
101 | select * from cte where population_rank <= 2;
102 |
103 | /* Write a query to add a column - perc_pop. Each row in this column should represent the percentage
104 | of population a city contributes in that state */
105 | select *,
106 | population_estimate_2012*100/sum(population_estimate_2012) over(partition by state) as perc_pop
107 | from city_populations;
108 |
109 | # Write a query to find the cities which lie in the top 10 decile in terms of population
110 | select *,
111 | ntile(10) over(order by population_estimate_2012 desc) as top_populated_cities
112 | from city_populations;
113 |
114 | /* Write a query to arrange the cities in the descending order of population and add a column
115 | calculating difference in population from a row below (in the dataset) */
116 | select *,
117 | population_estimate_2012 - lead(population_estimate_2012) over(order by population_estimate_2012 desc) as req_diff
118 | from city_populations;
119 |
120 | # Write a query to return the state, first city and last city (in terms of id number) in the state
121 | select distinct state,
122 | first_value(id) over(partition by state) as first_city,
123 | last_value(id) over(partition by state) as last_city
124 | from city_populations;
125 |
126 |
127 |
128 |
129 |
130 |
131 |
132 |
133 |
134 |
135 |
136 |
137 |
138 |
139 |
140 |
141 |
142 |
143 |
144 |
145 |
146 |
147 |
148 |
149 |
150 |
151 |
152 |
153 |
154 |
155 |
156 |
157 |
158 |
159 |
160 |
161 |
162 |
163 |
164 |
165 |
166 |
167 |
168 |
169 |
170 |
171 |
172 |
173 |
174 |
175 |
176 |
177 |
178 |
179 |
180 |
181 |
182 |
183 |
184 |
185 |
186 |
187 |
188 |
189 |
190 |
191 |
192 |
193 |
194 |
195 |
196 |
197 |
198 |
199 |
200 |
201 |
202 |
203 |
204 |
205 |
206 |
207 |
208 |
209 |
210 |
211 |
212 |
--------------------------------------------------------------------------------
/Relevel_case_studies/Case_Study_7.sql:
--------------------------------------------------------------------------------
1 | select * from sat_scores;
2 |
3 | # Write a query to add column - avg_sat_writing. Each row in this column should include average marks in the writing section of the student per school
4 | select *,
5 | avg(sat_writing) over(partition by school) as avg_sat_writing
6 | from sat_scores;
7 |
8 | # In the above question, add an additional column - count_per_school. Each row of this column should include number of students per school
9 | select *,
10 | avg(sat_writing) over(partition by school) as avg_sat_writing,
11 | count(student_id) over(partition by school) as count_per_school
12 | from sat_scores;
13 |
14 | # In the above question, add two additional columns - max_per_teacher and min_per_teacher. Each row of this column should include maximum and minimum marks in maths per teacher respectively
15 | select *,
16 | avg(sat_writing) over(partition by school) as avg_sat_writing,
17 | count(student_id) over(partition by school) as count_per_school,
18 | max(sat_math) over(partition by teacher) as max_per_teacher,
19 | min(sat_math) over(partition by teacher) as min_per_teacher
20 | from sat_scores;
21 |
22 | /* For the dataset, write a query to add the two columns cum_hrs_studied and total_hrs_studied. Each
23 | row in cum_hrs_studied should display the cumulative sum of hours studied per school. Each row in
24 | the total_hrs_studied will display total hours studied per school. Order the data in the ascending
25 | order of student id */
26 | select *,
27 | sum(hrs_studied) over(partition by school order by student_id) as cum_hrs_studied,
28 | sum(hrs_studied) over(partition by school order by student_id rows between unbounded preceding and unbounded following) as total_hrs_studied
29 | from sat_scores;
30 |
31 | /* For the dataset, write a query to add column sub_hrs_studied and total_hrs_studied. Each row in
32 | sub_hrs_studied should display the sum of hrs_studied for a row above, a row below, and current
33 | row per school. Order the data in the ascending order of student id */
34 | select *,
35 | sum(hrs_studied) over(partition by school order by student_id rows between unbounded preceding and unbounded following) as total_hrs_studied,
36 | sum(hrs_studied) over(partition by school order by student_id rows between 1 preceding and 1 following) as sub_hrs_studied
37 | from sat_scores;
38 |
39 | /* Write a query to rank the students per school on the basis of scores in verbal. Use both rank and
40 | dense_rank function. Students with the highest marks should get rank 1.
41 | **Note: see if there is difference in ranking provided by both the functions */
42 | select *,
43 | rank() over(partition by school order by sat_verbal desc) as rank_verbal,
44 | dense_rank() over(partition by school order by sat_verbal desc) as d_rank_verbal
45 | from sat_scores;
46 |
47 | /* Write a query to rank the students per school on the basis of scores in writing. Use both rank and
48 | dense_rank function. Student with the highest marks should get rank 1.
49 | **Note: see if there is difference in ranking provided by both the functions for teacher = ‘Spellman’ */
50 | select *,
51 | rank() over(partition by school order by sat_writing desc) as rank_writing,
52 | dense_rank() over(partition by school order by sat_writing desc) as d_rank_writing
53 | from sat_scores
54 | where teacher='Spellman';
55 |
56 | # Write a query to find the top 5 students per teacher who spent maximum hours studying
57 | select * from
58 | (select *,
59 | rank() over(partition by teacher order by hrs_studied desc) as rank_hr_studied
60 | from sat_scores) sub
61 | where sub.rank_hr_studied<=5;
62 |
63 | # Write a query to find the worst 5 students per school who got minimum marks in sat_math
64 | select * from
65 | (select *,
66 | row_number() over(partition by school order by sat_math) as sat_math_ranking
67 | from sat_scores) sub
68 | where sub.sat_math_ranking<=5;
69 |
70 | # Write a query to divide the dataset into quartile on the basis of marks in sat_verbal
71 | select *,
72 | ntile(4) over(order by sat_verbal) as quartile
73 | from sat_scores;
74 |
75 | /* For ‘Petersville HS’ school, write a query to arrange the students in the ascending order of hours
76 | studied. Also, add a column to find the difference in hours studied from the student above(in the
77 | row). Exclude the cases where hrs_studied is null */
78 | select * from
79 | (select student_id, hrs_studied,
80 | hrs_studied-lag(hrs_studied) over(order by hrs_studied) as diff_hr_studied
81 | from sat_scores
82 | where school='Petersville HS') sub
83 | where sub.hrs_studied!="";
84 |
85 | /* For ‘Washington HS’ school, write a query to arrange the students in the descending order of
86 | sat_math. Also, add a column to find the difference in sat_math from the student below(in the row) */
87 | select *,
88 | sat_math-lead(sat_math) over(order by sat_math desc) as diff_sat_math
89 | from sat_scores
90 | where school='Washington HS';
91 |
92 | /* Write a query to return 4 columns - student_id, school, sat_writing, difference in sat_writing and
93 | average marks scored in sat_writing in the school */
94 | select student_id, school, sat_writing,
95 | sat_writing-avg(sat_writing) over(partition by school) as req_diff
96 | from sat_scores;
97 |
98 | /* Write a query to return 4 columns - student_id, teacher, sat_verbal, difference in sat_verbal and
99 | minimum marks scored in sat_verbal per teacher */
100 | select student_id, teacher, sat_verbal,
101 | sat_verbal-min(sat_verbal) over(partition by teacher) as req_diff
102 | from sat_scores;
103 |
104 | /* Write a query to return the student_id and school who are in bottom 20 in each of sat_verbal,
105 | sat_writing, and sat_math for their school */
106 |
107 | # METHOD 1
108 | select student_id, school from
109 | (select *,
110 | row_number() over(partition by school order by sat_verbal) as verbal_rank,
111 | row_number() over(partition by school order by sat_writing) as writing_rank,
112 | row_number() over(partition by school order by sat_math) as math_rank
113 | from sat_scores) sub
114 | where sub.verbal_rank<=20 and sub.writing_rank<=20 and sub.math_rank<=20;
115 |
116 | # METHOD - 2
117 | with data as (
118 | select *,
119 | row_number() over(partition by school order by sat_verbal) as verbal_rank,
120 | row_number() over(partition by school order by sat_writing) as writing_rank,
121 | row_number() over(partition by school order by sat_math) as math_rank
122 | from sat_scores)
123 | select student_id, school
124 | from data
125 | where verbal_rank<=20 and writing_rank<=20 and math_rank<=20;
126 |
127 | # Write a query to find the student_id for the highest mark and lowest mark per teacher for sat_writing
128 | select distinct teacher,
129 | first_value(student_id) over(partition by teacher order by sat_writing desc rows between unbounded preceding and unbounded following) as highest_marks_student,
130 | last_value(student_id) over(partition by teacher order by sat_writing desc rows between unbounded preceding and unbounded following) as lowest_marks_student
131 | from sat_scores;
132 |
133 |
134 |
135 |
136 |
137 |
138 |
139 |
140 |
141 |
142 |
143 |
144 |
145 |
146 |
147 |
148 |
149 |
150 |
151 |
152 |
153 |
154 |
155 |
156 |
157 |
158 |
159 |
160 |
161 |
162 |
163 |
164 |
165 |
166 |
167 |
168 |
169 |
170 |
171 |
172 |
173 |
174 |
175 |
176 |
177 |
178 |
179 |
180 |
181 |
182 |
183 |
184 |
185 |
186 |
187 |
188 |
189 |
190 |
191 |
192 |
193 |
194 |
195 |
196 |
197 |
198 |
199 |
200 |
201 |
202 |
203 |
204 |
205 |
206 |
207 |
208 |
209 |
210 |
211 |
212 |
213 |
214 |
215 |
216 |
217 |
218 |
219 |
220 |
221 |
222 |
223 |
224 |
225 |
226 |
227 |
228 |
229 |
230 |
231 |
232 |
233 |
234 |
235 |
236 |
237 |
238 |
239 |
--------------------------------------------------------------------------------
/Relevel_case_studies/Detailed_Case_Study_Ginny.sql:
--------------------------------------------------------------------------------
1 | # INTRODUCTION
2 |
3 | /* Ginny is a big fan of Japanese food, so she decided to start a restaurant at the beginning of 2021 that sells his three favorite
4 | foods: sushi, curry, and ramen.
5 | Ginny's Diner needs your help to stay afloat – the restaurant has collected some fundamental data from their few months of
6 | operation but has no idea how to use it to help them operate the business. */
7 |
8 | # PROBLEM STATEMENT
9 |
10 | /* Ginny wants to use the data to answer a few simple questions about his customers, especially about their:
11 | ● visiting patterns,
12 | ● how much money they’ve spent, and
13 | ● which menu items are their favorite.
14 | This deeper connection with her customers will help her deliver a better and more personalized experience for her loyal
15 | customers.
16 | She plans on using these insights to help him decide whether she should expand the existing customer loyalty program.
17 | Additionally, she needs help to generate some essential datasets so her team can quickly inspect the data without needing to
18 | use SQL.
19 |
20 | The data set contains the following three tables, which you may refer to in the relationship diagram below to understand the
21 | connection:
22 | ● sales
23 | ● members
24 | ● menu
25 | */
26 |
27 | # CREATING OUR DATABASE FOR THE CASE STUDY AND ANALYSIS
28 |
29 | CREATE database Detailed_Case_Study;
30 |
31 | CREATE TABLE sales (
32 | customer_id VARCHAR(1),
33 | order_date DATE,
34 | product_id INTEGER
35 | );
36 |
37 | INSERT INTO sales
38 | (customer_id, order_date, product_id)
39 | VALUES
40 | ('A', '2021-01-01', '1'),
41 | ('A', '2021-01-01', '2'),
42 | ('A', '2021-01-07', '2'),
43 | ('A', '2021-01-10', '3'),
44 | ('A', '2021-01-11', '3'),
45 | ('A', '2021-01-11', '3'),
46 | ('B', '2021-01-01', '2'),
47 | ('B', '2021-01-02', '2'),
48 | ('B', '2021-01-04', '1'),
49 | ('B', '2021-01-11', '1'),
50 | ('B', '2021-01-16', '3'),
51 | ('B', '2021-02-01', '3'),
52 | ('C', '2021-01-01', '3'),
53 | ('C', '2021-01-01', '3'),
54 | ('C', '2021-01-07', '3');
55 |
56 |
57 | CREATE TABLE menu (
58 | product_id INTEGER,
59 | product_name VARCHAR(5),
60 | price INTEGER
61 | );
62 |
63 | INSERT INTO menu
64 | (product_id, product_name, price)
65 | VALUES
66 | ('1', 'sushi', '10'),
67 | ('2', 'curry', '15'),
68 | ('3', 'ramen', '12');
69 |
70 |
71 | CREATE TABLE members (
72 | customer_id VARCHAR(1),
73 | join_date DATE
74 | );
75 |
76 | INSERT INTO members
77 | (customer_id, join_date)
78 | VALUES
79 | ('A', '2021-01-07'),
80 | ('B', '2021-01-09');
81 |
82 | # QUERY 1
83 |
84 | # What is the total amount each customer spent at the restaurant?
85 | with data as (select s.customer_id, s.product_id, m.price
86 | from sales s
87 | left join menu m
88 | on s.product_id=m.product_id)
89 | select customer_id, sum(price) as total_spent from data group by 1;
90 |
91 | # QUERY 2
92 |
93 | # How many days has each customer visited the restaurant?
94 | select customer_id, count(distinct order_date) as num_visit_days
95 | from sales
96 | group by 1;
97 |
98 | # QUERY 3
99 |
100 | # What was the first item from the menu purchased by each customer?
101 | with data as (
102 | select s.*, m.product_name,
103 | rank() over(partition by s.customer_id order by s.order_date) as ord_Rank
104 | from sales s
105 | left join menu m
106 | on s.product_id=m.product_id)
107 | select customer_id, product_name from data where ord_Rank=1 group by 1,2;
108 |
109 | # QUERY 4
110 |
111 | # What is the most purchased item on the menu and how many times was it purchased by all customers?
112 | select m.product_name, count(s.product_id) as num_purchase
113 | from sales as s
114 | left join menu as m
115 | on s.product_id=m.product_id
116 | group by 1
117 | order by 2 desc
118 | limit 1;
119 |
120 | # QUERY 5
121 |
122 | # Which item was the most popular one for each customer?
123 | with data as
124 | (select customer_id, product_name, order_amount, rank() over(partition by customer_id order by order_amount desc) as ord_rank from
125 | (select s.customer_id,
126 | count(s.product_id) as order_amount,
127 | m.product_name
128 | from sales as s
129 | left join menu as m
130 | on s.product_id=m.product_id
131 | group by 1, s.product_id
132 | order by 3 desc) as a)
133 | select customer_id, product_name, order_amount from data where ord_rank=1;
134 |
135 | # QUERY 6
136 |
137 | /* Which item was purchased first by the customer after they became a member?
138 | Note: Assumption to consider is Customer's join date does not mean that customer ordered on that date. */
139 | with cte as (select * from (select s.*, m.join_date, me.product_name from sales as s
140 | left join members as m
141 | on s.customer_id=m.customer_id
142 | left join menu as me
143 | on s.product_id=me.product_id) as a
144 | where join_date is not null)
145 | select customer_id, product_name, min(order_date) as first_purchase from cte where order_date>join_date group by 1;
146 |
147 | # QUERY 7
148 |
149 | # Which item was purchased right before the customer became a member?
150 | select customer_id, product_name, order_date from (
151 | with cte as
152 | (select * from (select s.*, m.join_date, me.product_name
153 | from sales as s
154 | left join members as m
155 | on s.customer_id=m.customer_id
156 | join menu as me
157 | on s.product_id=me.product_id) as a
158 | where join_date is not null and order_dateorder_date
175 | group by 1;
176 |
177 | # QUERY 9
178 |
179 | # If each customers’ $1 spent equates to 10 points and sushi has a 2x points multiplier — how many points would each customer have?
180 | select customer_id, sum(points_earned) as total_points from (
181 | select s.customer_id,
182 | case
183 | when m.product_name='sushi' then 20*m.price else 10*m.price
184 | end as points_earned
185 | from sales as s
186 | left join menu as m
187 | on s.product_id=m.product_id) sub
188 | group by 1;
189 |
190 | # QUERY 10
191 |
192 | /* In the first week after a customer joins the program, (including their join date) they earn 2x points on all items; not just sushi —
193 | how many points do customer A and B have at the end of Jan21?
194 | Note :here you can use a concept of interval function which returns the index of the argument that is more than the first
195 | argument meaning It returns 0 if 1st number is less than the 2nd number and 1 if 1st number is less than the 3rd number and so
196 | on or -1 if 1st number is null */
197 | with cte as (select s.*, m.product_name, m.price, me.join_date,
198 | case
199 | when s.order_date between me.join_date and date_add(me.join_date, interval 6 day) then 20*m.price
200 | when m.product_name='sushi' then 20*m.price
201 | else 10*m.price
202 | end as point_earned
203 | from sales as s
204 | left join menu as m
205 | on s.product_id=m.product_id
206 | inner join members as me
207 | on s.customer_id=me.customer_id
208 | )
209 | select customer_id, sum(point_earned) as total_points
210 | from cte
211 | where order_date<='2021-01-31'
212 | group by 1;
213 |
214 |
215 |
216 |
217 |
218 |
219 |
220 |
221 |
222 |
223 |
224 |
225 |
226 |
227 |
228 |
229 |
230 |
--------------------------------------------------------------------------------
/Retail_case_studies/Case_Study_1.sql:
--------------------------------------------------------------------------------
1 | select * from retail_sales_dataset;
2 | select count(distinct customer_id) from retail_sales_dataset;
3 |
--------------------------------------------------------------------------------