├── .gitignore ├── LICENSE ├── Practical-SQL-exercises.Rproj ├── README.Rmd ├── README.md ├── assets └── cover.jpg ├── exercises ├── Chapter-01.sql ├── Chapter-02.sql ├── Chapter-03.sql ├── Chapter-04.sql ├── Chapter-05.sql ├── Chapter-06.sql ├── Chapter-07.sql ├── Chapter-08.sql ├── Chapter-09.sql ├── Chapter-10.sql ├── Chapter-11.sql ├── Chapter-12.sql ├── Chapter-13.sql ├── Chapter-14.sql ├── Chapter-15.sql ├── Chapter-16.sql ├── Chapter-17.sql ├── Chapter-18.sql └── extra.sql └── feedback ├── feedback.Rmd └── feedback.pdf /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | .Ruserdata 5 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 Indrajeet Patil 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /Practical-SQL-exercises.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: No 4 | SaveWorkspace: No 5 | AlwaysSaveHistory: No 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: knitr 13 | LaTeX: pdfLaTeX 14 | 15 | AutoAppendNewline: Yes 16 | StripTrailingWhitespace: Yes 17 | 18 | QuitChildProcessesOnExit: Yes 19 | DisableExecuteRprofile: Yes 20 | -------------------------------------------------------------------------------- /README.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Practical-SQL exercises" 3 | output: github_document 4 | --- 5 | 6 | ```{r setup, include=FALSE} 7 | knitr::opts_chunk$set(echo = TRUE) 8 | ``` 9 | 10 | Solutions to exercises from the _Practical SQL_ (1st Edition) [book](https://nostarch.com/practicalSQL). 11 | 12 | - [My solutions](https://github.com/IndrajeetPatil/Practical-SQL-exercises/tree/master/exercises) 13 | 14 | - [Author's solutions](https://github.com/anthonydb/practical-sql/blob/master/Try_It_Yourself/Try_It_Yourself.sql) 15 | 16 | ```{r, out.width="60%", echo=FALSE} 17 | knitr::include_graphics("assets/cover.jpg") 18 | ``` 19 | 20 | ## See also 21 | 22 | ### Code and data 23 | 24 | - [1st Edition Code and Data](https://github.com/anthonydb/practical-sql) 25 | 26 | - [2nd Edition Code and Data](https://github.com/anthonydb/practical-sql-2) 27 | 28 | ### Resources 29 | 30 | - [A curated list of awesome PostgreSQL software, libraries, tools and resources](https://github.com/dhamaniasad/awesome-postgres) 31 | 32 | ### Useful tips for debugging a few common problems 33 | 34 | - [If you find pgAdmin is too slow on Windows](https://dba.stackexchange.com/questions/201646/slow-connect-time-to-postgresql-on-windows-10) 35 | 36 | - [If you have problems importing data on Windows](https://stackoverflow.com/a/65459173/7973626) 37 | 38 | - [Configuring binary path on Windows](https://dba.stackexchange.com/questions/149169/binary-path-in-the-pgadmin-preferences) 39 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Practical-SQL exercises 2 | ================ 3 | 4 | Solutions to exercises from the *Practical SQL* (1st Edition) 5 | [book](https://nostarch.com/practicalSQL). 6 | 7 | - [My 8 | solutions](https://github.com/IndrajeetPatil/Practical-SQL-exercises/tree/master/exercises) 9 | 10 | - [Author’s 11 | solutions](https://github.com/anthonydb/practical-sql/blob/master/Try_It_Yourself/Try_It_Yourself.sql) 12 | 13 | 14 | 15 | ## See also 16 | 17 | ### Code and data 18 | 19 | - [1st Edition Code and 20 | Data](https://github.com/anthonydb/practical-sql) 21 | 22 | - [2nd Edition Code and 23 | Data](https://github.com/anthonydb/practical-sql-2) 24 | 25 | ### Resources 26 | 27 | - [A curated list of awesome PostgreSQL software, libraries, tools and 28 | resources](https://github.com/dhamaniasad/awesome-postgres) 29 | 30 | ### Useful tips for debugging a few common problems 31 | 32 | - [If you find pgAdmin is too slow on 33 | Windows](https://dba.stackexchange.com/questions/201646/slow-connect-time-to-postgresql-on-windows-10) 34 | 35 | - [If you have problems importing data on 36 | Windows](https://stackoverflow.com/a/65459173/7973626) 37 | 38 | - [Configuring binary path on 39 | Windows](https://dba.stackexchange.com/questions/149169/binary-path-in-the-pgadmin-preferences) 40 | -------------------------------------------------------------------------------- /assets/cover.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IndrajeetPatil/Practical-SQL-exercises/fb02667a2070212d7ff234ba79fc8941e866f2c0/assets/cover.jpg -------------------------------------------------------------------------------- /exercises/Chapter-01.sql: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------- 2 | -- Practical SQL: A Beginner's Guide to Storytelling with Data 3 | -- by Anthony DeBarros 4 | -- 1st Edition 5 | -- Chapter 1 "Try It Yourself" Exercises 6 | -------------------------------------------------------------- 7 | 8 | 9 | --------- 10 | -- Q1 -- 11 | --------- 12 | 13 | 14 | -- create a database for a zoo 15 | 16 | CREATE DATABASE zoo; 17 | 18 | -- create a table describing animals in zoo (humans not included ;-) ) 19 | -- total_count is smallint coz a zoo won't have thousands of same animals 20 | 21 | CREATE TABLE animals ( 22 | species_counter smallserial, 23 | species_name varchar(100), 24 | total_count smallint, 25 | avg_weight numeric, 26 | date_acquisition date 27 | ); 28 | 29 | 30 | --------- 31 | -- Q2 -- 32 | --------- 33 | 34 | 35 | -- insert some made up data into it 36 | 37 | INSERT INTO 38 | animals ( 39 | species_name, 40 | total_count, 41 | avg_weight, 42 | date_acquisition 43 | ) 44 | VALUES 45 | ('Zebra', 2, 200.10, '2020-2-3'), 46 | ('Lion', 1, 190.50, '2018-7-8'), 47 | ('Tiger', 2, 150.40, '2017-12-1'); 48 | 49 | -------------------------------------------------------------------------------- /exercises/Chapter-02.sql: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------- 2 | -- Practical SQL: A Beginner's Guide to Storytelling with Data 3 | -- by Anthony DeBarros 4 | -- 1st Edition 5 | -- Chapter 2 "Try It Yourself" Exercises 6 | -------------------------------------------------------------- 7 | 8 | 9 | --------- 10 | -- Q1 -- 11 | --------- 12 | 13 | 14 | -- list in the format specified by the superintendant 15 | 16 | SELECT 17 | school, 18 | last_name 19 | FROM 20 | teachers 21 | ORDER BY 22 | school, 23 | last_name DESC; 24 | 25 | 26 | --------- 27 | -- Q2 -- 28 | --------- 29 | 30 | 31 | -- find teachers with specified conditions 32 | 33 | SELECT 34 | * 35 | FROM 36 | teachers 37 | WHERE 38 | (first_name LIKE 'S%') 39 | AND (salary > 40000); 40 | 41 | 42 | --------- 43 | -- Q3 -- 44 | --------- 45 | 46 | 47 | -- find and order teachers by hiring date and salary 48 | 49 | SELECT 50 | * 51 | FROM 52 | teachers 53 | WHERE 54 | hire_date > '2010-1-1' 55 | ORDER BY 56 | salary DESC; 57 | -------------------------------------------------------------------------------- /exercises/Chapter-03.sql: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------- 2 | -- Practical SQL: A Beginner's Guide to Storytelling with Data 3 | -- by Anthony DeBarros 4 | -- 1st Edition 5 | -- Chapter 3 "Try It Yourself" Exercises 6 | -------------------------------------------------------------- 7 | 8 | 9 | --------------- 10 | -- Q1 and Q2 -- 11 | --------------- 12 | 13 | 14 | -- create a database for a grocery delivery company 15 | 16 | CREATE DATABASE grocery; 17 | 18 | -- create a table for drivers 19 | -- choose the correct datatype for miles driven with required precision 20 | /* 21 | -- since mileage can't be more than 999 (3 digits on the left), and 22 | -- needed precision is one tenth of a mile (1 digit on the right) 23 | the correct datatype would be decimal(4, 1) 24 | */ 25 | -- why have separate columns for first and last names? 26 | /* 27 | - we can then sort either by first or last names 28 | - query them further (e.g. check if two drivers have same name) 29 | - etc. 30 | */ 31 | 32 | CREATE TABLE drivers( 33 | driver_id serial, 34 | first_name varchar(50), 35 | last_name varchar(50), 36 | workday date, 37 | clock_in time with time zone, 38 | clock_out time with time zone, 39 | miles_driven decimal(4, 1) 40 | ); 41 | 42 | -- add entries to the table 43 | 44 | INSERT INTO 45 | drivers ( 46 | first_name, 47 | last_name, 48 | workday, 49 | clock_in, 50 | clock_out, 51 | miles_driven 52 | ) 53 | VALUES 54 | ( 55 | 'Adam', 56 | 'Driver', 57 | '2021-04-05', 58 | '08:15 GMT+1', 59 | '13:00 GMT+1', 60 | 123.4545 61 | ), 62 | ( 63 | 'Adam', 64 | 'Driver', 65 | '2021-04-05', 66 | '11:30 CET', 67 | '18:00 CET', 68 | 234 69 | ), 70 | ( 71 | 'Krystyna', 72 | 'Pszczyna', 73 | '2021-04-06', 74 | '13:15 EST', 75 | '23:00 EST', 76 | 678.4344 77 | ); 78 | 79 | 80 | --------- 81 | -- Q3 -- 82 | --------- 83 | 84 | 85 | -- what do you see when a malformed date entry is cast to timestamp? 86 | 87 | CREATE TABLE drivers2( 88 | first_name varchar(50), 89 | workhours timestamp 90 | ); 91 | 92 | INSERT INTO 93 | drivers2 (first_name, workhours) 94 | VALUES 95 | ('X', '4//2017'); 96 | 97 | /* 98 | ERROR: invalid input syntax for type timestamp: "4//2017" 99 | LINE 9: ('X', '4//2017'); 100 | ^ 101 | SQL state: 22007 102 | Character: 146 103 | */ 104 | 105 | -- delete this unnecessary table 106 | 107 | DROP TABLE drivers2; 108 | -------------------------------------------------------------------------------- /exercises/Chapter-04.sql: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------- 2 | -- Practical SQL: A Beginner's Guide to Storytelling with Data 3 | -- by Anthony DeBarros 4 | -- 1st Edition 5 | -- Chapter 4 "Try It Yourself" Exercises 6 | -------------------------------------------------------------- 7 | 8 | 9 | --------- 10 | -- Q1 -- 11 | --------- 12 | 13 | 14 | -- importing specified text file 15 | -- Ref. 16 | 17 | COPY movies 18 | FROM -- your file path will be different 19 | 'C:\Users\IndrajeetPatil\Documents\GitHub\Practical-SQL-exercises\assets\ch4q1.txt' 20 | WITH (FORMAT CSV, HEADER, DELIMITER ':', quote '#'); 21 | 22 | 23 | --------- 24 | -- Q2 -- 25 | --------- 26 | 27 | 28 | -- counties with most housing units 29 | 30 | COPY ( 31 | SELECT 32 | geo_name, 33 | state_us_abbreviation, 34 | housing_unit_count_100_percent 35 | FROM 36 | us_counties_2010 37 | ORDER BY 38 | housing_unit_count_100_percent DESC 39 | LIMIT 40 | 20 41 | ) 42 | TO -- your file path will be different 43 | 'C:\Users\IndrajeetPatil\Documents\GitHub\Practical-SQL-exercises\assets\ch4q2.txt' 44 | WITH (FORMAT csv, HEADER); 45 | 46 | 47 | --------- 48 | -- Q3 -- 49 | --------- 50 | 51 | 52 | -- fixed-point numbers 53 | /* 54 | numeric(3,8) will not work for the provided values and you will get the error 55 | 'NUMERIC scale 8 must be between 0 and precision 3 56 | This is because with this scale all values will be to the right of the decimal 57 | point and so the value needs to be between 0 and 1 58 | */ 59 | 60 | SELECT 61 | CAST(17519.668 AS numeric(3, 8)); 62 | 63 | -- it instead needs to be the following 64 | -- scale of 3 means 3 digits to the right of the decimal point 65 | 66 | SELECT 67 | CAST(17519.668 AS numeric(8, 3)); 68 | 69 | -- if the value were to be between 0 and 1, the following will work 70 | 71 | SELECT 72 | CAST(0.17519668 AS numeric(8, 8)); 73 | -------------------------------------------------------------------------------- /exercises/Chapter-05.sql: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------- 2 | -- Practical SQL: A Beginner's Guide to Storytelling with Data 3 | -- by Anthony DeBarros 4 | -- 1st Edition 5 | -- Chapter 5 "Try It Yourself" Exercises 6 | -------------------------------------------------------------- 7 | 8 | 9 | --------- 10 | -- Q1 -- 11 | --------- 12 | 13 | 14 | -- calculating the area of a circle whose radius is 5 inches 15 | /* no paranthese needed because exponent/square function has 16 | a higher precedence than multiplication 17 | */ 18 | 19 | SELECT 20 | 3.14 * SQUARE(5); 21 | 22 | 23 | --------- 24 | -- Q2 -- 25 | --------- 26 | 27 | 28 | /* 29 | You first need to have created a database and then also a table 30 | code here: https://github.com/anthonydb/practical-sql/blob/ebc2233c41f2ba18512daf7c4a1a2f4c7f72a520/Chapter_04/Chapter_04.sql#L20-L123 31 | 32 | -- Importing data from csv (path to file will be different for everyone) 33 | -- If you get 'Permission denied' error, see: https://stackoverflow.com/a/65459173/7973626 34 | -- If that still doesn't work, use the pgAdmin GUI to import data 35 | -- configure binary path, see: https://dba.stackexchange.com/questions/149169/binary-path-in-the-pgadmin-preferences 36 | */ 37 | -- Q2. NY county with the highest percentage of 38 | -- CREATE DATABASE us_census; 39 | 40 | 41 | COPY us_counties_2010 42 | FROM -- your file path will be different 43 | 'C:\Users\IndrajeetPatil\Documents\GitHub\Practical-SQL-exercises\assets' 44 | WITH (FORMAT csv, HEADER); 45 | 46 | -- Frankin County with just about 7% people identifying as American Indian and Alaska Native alone 47 | -- You can see map here: https://www.census.gov/quickfacts/fact/map/franklincountynewyork/RHI325219 48 | -- As to why? I have no clue. Google search wasn't that helpful. 49 | 50 | SELECT 51 | geo_name, 52 | ( 53 | CAST(p0010005 AS numeric(8, 1)) / population_count_100_percent 54 | ) * 100 AS "prop" 55 | FROM 56 | us_counties_2010 57 | WHERE 58 | state_us_abbreviation = 'NY' 59 | ORDER BY 60 | prop DESC; 61 | 62 | 63 | --------- 64 | -- Q3 -- 65 | --------- 66 | 67 | 68 | -- median county population in CA (179140.5) was higher than NY (91301) 69 | 70 | SELECT 71 | state_us_abbreviation, 72 | percentile_cont(0.5) WITHIN GROUP ( 73 | ORDER BY 74 | population_count_100_percent 75 | ) 76 | FROM 77 | us_counties_2010 78 | WHERE 79 | state_us_abbreviation IN ('NY', 'CA') 80 | GROUP BY 81 | state_us_abbreviation; 82 | -------------------------------------------------------------------------------- /exercises/Chapter-06.sql: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------- 2 | -- Practical SQL: A Beginner's Guide to Storytelling with Data 3 | -- by Anthony DeBarros 4 | -- 1st Edition 5 | -- Chapter 6 "Try It Yourself" Exercises 6 | -------------------------------------------------------------- 7 | 8 | 9 | --------- 10 | -- Q1 -- 11 | --------- 12 | 13 | 14 | -- The curious case of missing counties 15 | -- identify which counties don’t exist in either of the tables 16 | /* 17 | Present in 2000 but not in 2010 (these counties were incorporated in city-and-borough) 18 | - Clifton Forge city, Virginia 19 | - Prince of Wales-Outer Ketchikan Census Area, Alaska 20 | - Skagway-Hoonah-Angoon Census Area, Alaska 21 | - Wrangell-Petersburg Census Area, Alaska 22 | 23 | Present in 2010 but not in 2000 (newly created counties) 24 | - Broomfield County 25 | - Hoonah-Angoon Census Area 26 | - Petersburg Census Area 27 | - Prince of Wales-Hyder Census Area 28 | - Skagway Municipality 29 | - Wrangell City and Borough 30 | */ 31 | 32 | SELECT 33 | c2000.geo_name AS county_2000_name, 34 | c2000.county_fips AS county_2000, 35 | c2010.geo_name AS county_2010_name, 36 | c2010.county_fips AS county_2010 37 | FROM 38 | us_counties_2010 AS c2010 FULL 39 | JOIN us_counties_2000 AS c2000 ON c2010.state_fips = c2000.state_fips 40 | AND c2010.county_fips = c2000.county_fips 41 | WHERE 42 | (c2010.county_fips IS NULL) 43 | OR (c2000.county_fips IS NULL) 44 | ORDER BY 45 | county_2010_name; 46 | 47 | 48 | --------- 49 | -- Q2 -- 50 | --------- 51 | 52 | 53 | -- median of the percent change in county population: 3.2% 54 | 55 | SELECT 56 | percentile_cont(.5) WITHIN GROUP ( 57 | ORDER BY 58 | ( 59 | (c2010.p0010001::numeric - c2000.p0010001) / c2000.p0010001 * 100 60 | ) 61 | ) AS median_pct_change 62 | FROM 63 | us_counties_2010 AS c2010 64 | INNER JOIN us_counties_2000 AS c2000 ON c2010.state_fips = c2000.state_fips 65 | AND c2010.county_fips = c2000.county_fips 66 | AND c2010.p0010001 <> c2000.p0010001; 67 | 68 | 69 | --------- 70 | -- Q3 -- 71 | --------- 72 | 73 | 74 | -- the greatest percentage loss of population 75 | /* 76 | The greatest percentage loss of population between 2000 and 2010 was seen 77 | in St. Bernard Parish, Louisiana (-46.6%). This was because, on August 29, 2005, 78 | St. Bernard was devastated by Hurricane Katrina. 79 | */ 80 | 81 | SELECT 82 | c2010.geo_name, 83 | c2010.state_us_abbreviation AS st, 84 | c2010.p0010001 AS pop_2010, 85 | c2000.p0010001 AS pop_2000, 86 | round( 87 | (c2010.p0010001::numeric - c2000.p0010001) / c2000.p0010001 * 100, 88 | 1 89 | ) AS pct_change 90 | FROM 91 | us_counties_2010 AS c2010 92 | INNER JOIN us_counties_2000 AS c2000 ON c2010.state_fips = c2000.state_fips 93 | AND c2010.county_fips = c2000.county_fips 94 | AND c2010.p0010001 <> c2000.p0010001 95 | ORDER BY 96 | pct_change ASC; 97 | -------------------------------------------------------------------------------- /exercises/Chapter-07.sql: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------- 2 | -- Practical SQL: A Beginner's Guide to Storytelling with Data 3 | -- by Anthony DeBarros 4 | -- 1st Edition 5 | -- Chapter 7 "Try It Yourself" Exercises 6 | -------------------------------------------------------------- 7 | 8 | 9 | --------- 10 | -- Q1 -- 11 | --------- 12 | 13 | 14 | -- Choosing primary and foreign keys and imposing constraints with justification 15 | 16 | CREATE DATABASE vinyl; 17 | 18 | CREATE TABLE albums ( 19 | album_id bigserial, 20 | --each album should have a unique catalog code 21 | album_catalog_code varchar(100) NOT NULL, 22 | -- why not unique?: multiple artists can have albums with the same title 23 | album_title text, 24 | -- why not unique?: same artist can release multiple albums in a year 25 | album_artist text, 26 | -- why not unique?: multiple albums released on the same date 27 | album_release_date date, 28 | album_genre varchar(40), 29 | --each album should have a unique description 30 | album_description text, 31 | CONSTRAINT album_description_unique UNIQUE (album_description), 32 | CONSTRAINT album_code_unique UNIQUE (album_catalog_code), 33 | CONSTRAINT albums_key PRIMARY KEY (album_id) 34 | ); 35 | 36 | CREATE TABLE songs ( 37 | song_id bigserial, 38 | song_title text, 39 | song_artist text, 40 | -- no midnight calls and consistent with albums data type 41 | album_id bigserial REFERENCES albums (album_id), 42 | CONSTRAINT songs_key PRIMARY KEY (album_id) 43 | ); 44 | 45 | 46 | --------- 47 | -- Q2 -- 48 | --------- 49 | 50 | 51 | -- Alternative natural primary key 52 | /* 53 | Instead of `album_id` column as a *surrogate* primary key for the `albums` table, 54 | `album_artist` column could be a *natural* primary key. 55 | In this case, the foreign key for `songs` table will be `song_artist`. 56 | */ 57 | 58 | CONSTRAINT albums_key PRIMARY KEY (album_artist); 59 | 60 | 61 | --------- 62 | -- Q3 -- 63 | --------- 64 | 65 | 66 | -- Possible indexing columns 67 | /* 68 | good candidates for indexes: `album_id`, `album_catalog_code`, `album_description` 69 | */ 70 | 71 | CREATE INDEX album_id_idx ON albums (album_id); 72 | CREATE INDEX album_catalog_code_idx ON albums (album_catalog_code); 73 | CREATE INDEX album_description_idx ON albums (album_description); 74 | -------------------------------------------------------------------------------- /exercises/Chapter-08.sql: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------- 2 | -- Practical SQL: A Beginner's Guide to Storytelling with Data 3 | -- by Anthony DeBarros 4 | -- 1st Edition 5 | -- Chapter 8 "Try It Yourself" Exercises 6 | -------------------------------------------------------------- 7 | 8 | 9 | --------- 10 | -- Q1 -- 11 | --------- 12 | 13 | 14 | -- Use of technology in library 15 | -- Create tables using Listing 8-1 and Listing 8-2 16 | /* 17 | Results show that- 18 | 19 | 1. There has been a big increases in gpterms (the number of internet-connected 20 | computers used by the public), with the biggest being 827% for Guam. 21 | Only two states had decrease: Hawaii (-2.4%) and Puerto Rico (-6.18%) 22 | 23 | 2. The pitusr (use of public internet computers per year), on the other hand, 24 | has seen both increase and descrease across states. 25 | */ 26 | 27 | SELECT 28 | pls14.stabr, 29 | SUM(pls14.gpterms) AS gpterms_14, 30 | SUM(pls09.gpterms) AS gpterms_09, 31 | SUM(pls14.pitusr) AS pitusr_14, 32 | SUM(pls09.pitusr) AS pitusr_09, 33 | round( 34 | ( 35 | CAST( 36 | SUM(pls14.gpterms) - SUM(pls09.gpterms) AS numeric(10, 1) 37 | ) / SUM(pls09.gpterms) 38 | ) * 100, 39 | 2 40 | ) AS pct_change_gpterms, 41 | round( 42 | ( 43 | CAST( 44 | SUM(pls14.pitusr) - SUM(pls09.pitusr) AS numeric(10, 1) 45 | ) / SUM(pls09.pitusr) 46 | ) * 100, 47 | 2 48 | ) AS pct_change_pitusr 49 | FROM 50 | pls_fy2014_pupld14a AS pls14 51 | INNER JOIN pls_fy2009_pupld09a AS pls09 ON pls14.fscskey = pls09.fscskey 52 | WHERE 53 | -- negative integers means missing entries 54 | (pls14.gpterms > 0) 55 | AND (pls09.gpterms > 0) 56 | AND (pls14.pitusr > 0) 57 | AND (pls09.pitusr > 0) 58 | GROUP BY 59 | pls14.stabr 60 | ORDER BY 61 | pct_change_gpterms DESC; 62 | 63 | 64 | --------- 65 | -- Q2 -- 66 | --------- 67 | 68 | 69 | -- create a table with Bureau of Economic Analysis Codes and corresponding regions 70 | -- Ref: 71 | 72 | CREATE TABLE region_details ( 73 | obereg varchar(2) CONSTRAINT region_key PRIMARY KEY, 74 | region_name text 75 | ); 76 | 77 | INSERT INTO 78 | region_details(obereg, region_name) 79 | VALUES 80 | ('01', 'New England Region'), 81 | ('02', 'Mideast Region'), 82 | ('03', 'Great Lakes Region'), 83 | ('04', 'Plain Region'), 84 | ('05', 'Southeast Region'), 85 | ('06', 'Southwest Region'), 86 | ('07', 'Rocky Mountain Region'), 87 | ('08', 'Far West Region'), 88 | ('09', 'Territories Region'); 89 | 90 | -- percent change in visit by Bureau of Economic Analysis Code regions 91 | /* 92 | All regions have seen decrease, with the highest one (-12.66%) in Great Lakes region, 93 | while the lowest in (-1.78%) in Rocky Mountain Region. 94 | */ 95 | 96 | SELECT 97 | pls14.region_name, 98 | pls14.obereg, 99 | round( 100 | ( 101 | (sum(pls14.visits) :: numeric - sum(pls09.visits)) / sum(pls09.visits) 102 | ) * 100, 103 | 2 104 | ) AS pct_change_visits 105 | FROM 106 | ( 107 | SELECT 108 | pls14.obereg, 109 | region_name, 110 | visits, 111 | fscskey 112 | FROM 113 | pls_fy2014_pupld14a AS pls14 114 | LEFT JOIN region_details AS regions ON pls14.obereg = regions.obereg 115 | ) AS pls14 116 | INNER JOIN pls_fy2009_pupld09a AS pls09 ON pls14.fscskey = pls09.fscskey 117 | WHERE 118 | pls14.visits > 0 119 | AND pls09.visits > 0 120 | GROUP BY 121 | pls14.region_name, 122 | pls14.obereg 123 | ORDER BY 124 | pct_change_visits; 125 | 126 | 127 | --------- 128 | -- Q3 -- 129 | --------- 130 | 131 | 132 | -- Missing agencies 133 | -- There are 236 agencies which are not present in either of the tables 134 | 135 | SELECT 136 | * 137 | FROM 138 | ( 139 | SELECT 140 | fscskey, 141 | libname 142 | FROM 143 | pls_fy2014_pupld14a 144 | ) AS pls14 FULL 145 | JOIN ( 146 | SELECT 147 | fscskey, 148 | libname 149 | FROM 150 | pls_fy2009_pupld09a 151 | ) AS pls09 ON pls14.fscskey = pls09.fscskey 152 | WHERE 153 | (pls14.fscskey IS NULL) 154 | OR (pls09.fscskey IS NULL); 155 | -------------------------------------------------------------------------------- /exercises/Chapter-09.sql: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------- 2 | -- Practical SQL: A Beginner's Guide to Storytelling with Data 3 | -- by Anthony DeBarros 4 | -- 1st Edition 5 | -- Chapter 9 "Try It Yourself" Exercises 6 | -------------------------------------------------------------- 7 | 8 | 9 | ---------- 10 | -- Q1-3 -- 11 | ---------- 12 | 13 | 14 | -- creating count of different activities at the plant 15 | -- Step 1: create two new columns 16 | 17 | ALTER TABLE meat_poultry_egg_inspect 18 | ADD COLUMN meat_processing boolean, 19 | ADD COLUMN poultry_processing boolean; 20 | 21 | -- Step 2: set them to true based on activities column 22 | 23 | UPDATE 24 | meat_poultry_egg_inspect 25 | SET 26 | meat_processing = TRUE 27 | WHERE 28 | activities ILIKE '%Meat Processing%'; 29 | 30 | UPDATE 31 | meat_poultry_egg_inspect 32 | SET 33 | poultry_processing = TRUE 34 | WHERE 35 | activities ILIKE '%Poultry Processing%'; 36 | 37 | -- Step 3: counts 38 | -- Meat Processing: 4832 39 | 40 | SELECT 41 | count(*) AS meat_processing_total 42 | FROM 43 | meat_poultry_egg_inspect 44 | WHERE 45 | meat_processing IS NOT NULL; 46 | 47 | -- Poultry Processing: 3787 48 | 49 | SELECT 50 | count(*) AS poultry_processing 51 | FROM 52 | meat_poultry_egg_inspect 53 | WHERE 54 | poultry_processing IS NOT NULL; 55 | 56 | -- Both activities: 3395 57 | 58 | SELECT 59 | count(*) AS both_activities 60 | FROM 61 | meat_poultry_egg_inspect 62 | WHERE 63 | (poultry_processing IS NOT NULL) 64 | AND (meat_processing IS NOT NULL); 65 | -------------------------------------------------------------------------------- /exercises/Chapter-10.sql: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------- 2 | -- Practical SQL: A Beginner's Guide to Storytelling with Data 3 | -- by Anthony DeBarros 4 | -- 1st Edition 5 | -- Chapter 10 "Try It Yourself" Exercises 6 | -------------------------------------------------------------- 7 | 8 | 9 | --------- 10 | -- Q1 -- 11 | --------- 12 | 13 | 14 | -- Relationship between masters degree and median household income 15 | /* 16 | The r of the variables pct_bachelors_higher and median_hh_income was about .68, 17 | while the r of pct_masters_higher and median_hh_income is lower (.57). 18 | One possibility is that getting a master's degree or higherhas a more 19 | incremental effect on salary than getting a bachelor's degree. 20 | */ 21 | 22 | SELECT 23 | round( 24 | cast( 25 | corr(acs.pct_masters_higher, acs.median_hh_income) AS numeric 26 | ), 27 | 2 28 | ) 29 | FROM 30 | acs_2011_2015_stats AS acs; 31 | 32 | 33 | --------- 34 | -- Q2 -- 35 | --------- 36 | 37 | 38 | -- Meaningful comparisons 39 | -- Milwaukee is at the top with 12.29 motor vehicle thefts per 1000 people 40 | 41 | SELECT 42 | city, 43 | motor_vehicle_theft, 44 | population, 45 | round( 46 | (motor_vehicle_theft::numeric / population) * 1000, 47 | 2 48 | ) AS pct_motor_vehicle_theft 49 | FROM 50 | fbi_crime_data_2015 51 | WHERE 52 | population > 500000 53 | ORDER BY 54 | pct_motor_vehicle_theft DESC; 55 | 56 | -- Detroit is at the top with 17.60 violent crimes per 1000 people 57 | 58 | SELECT 59 | city, 60 | violent_crime, 61 | population, 62 | round( 63 | (violent_crime::numeric / population) * 1000, 64 | 2 65 | ) AS pct_violent_crime 66 | FROM 67 | fbi_crime_data_2015 68 | WHERE 69 | population > 500000 70 | ORDER BY 71 | pct_violent_crime DESC; 72 | 73 | 74 | --------- 75 | -- Q3 -- 76 | --------- 77 | 78 | 79 | -- Ranking library agencies based on rates of visits 80 | /* 81 | The highest rate (of 12962.83 per 1000 people) is the Cuyahoga County Public Library, 82 | which has 27 branches and serve 47 communities in Cuyahoga County, Ohio. 83 | For more, see: 84 | */ 85 | 86 | SELECT 87 | libname, 88 | city, 89 | stabr, 90 | visits, 91 | popu_lsa, 92 | round((visits::numeric / popu_lsa) * 1000, 2) AS pct_visits, 93 | rank() over ( 94 | ORDER BY 95 | round((visits::numeric / popu_lsa) * 1000, 2) DESC 96 | ) 97 | FROM 98 | pls_fy2014_pupld14a 99 | WHERE 100 | (popu_lsa > 250000) 101 | AND (visits > 0) 102 | ORDER BY 103 | pct_visits DESC; 104 | 105 | -------------------------------------------------------------------------------- /exercises/Chapter-11.sql: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------- 2 | -- Practical SQL: A Beginner's Guide to Storytelling with Data 3 | -- by Anthony DeBarros 4 | -- 1st Edition 5 | -- Chapter 11 "Try It Yourself" Exercises 6 | -------------------------------------------------------------- 7 | 8 | 9 | --------- 10 | -- Q1 -- 11 | --------- 12 | 13 | 14 | -- sorted durations of journeys 15 | /* 16 | The longest journey durations are astoundingly close to 24 hours, even though 17 | the trip distance is a mere few kilometers. Is this due to traffic jams? 18 | 19 | On the lower end, the shortest durations are instantenous (0 seconds), and yet 20 | the customers seem to have paid a handsome sum. Why were they charged?! 21 | */ 22 | 23 | SELECT 24 | trip_id, 25 | tpep_pickup_datetime, 26 | tpep_dropoff_datetime, 27 | (tpep_dropoff_datetime - tpep_pickup_datetime) AS duration, 28 | trip_distance, 29 | total_amount 30 | FROM 31 | nyc_yellow_taxi_trips_2016_06_01 32 | ORDER BY 33 | duration DESC; 34 | 35 | 36 | --------- 37 | -- Q2 -- 38 | --------- 39 | 40 | 41 | -- exploring time zones 42 | 43 | CREATE TABLE time_test_table(test_dttm timestamptz); 44 | 45 | -- without daylight saving (which usually lasts from March to November) 46 | 47 | INSERT INTO 48 | time_test_table 49 | VALUES 50 | ('2100-01-01 00:00:00 UTC-04:00'); 51 | 52 | -- London: 2099-12-31 20:00:00 53 | 54 | SELECT 55 | test_dttm AT TIME ZONE 'GMT' 56 | FROM 57 | time_test_table; 58 | 59 | -- Johannesburg: 2099-12-31 18:00:00 60 | 61 | SELECT 62 | test_dttm AT TIME ZONE 'GMT+2' 63 | FROM 64 | time_test_table; 65 | 66 | -- Moscow: 2099-12-31 17:00:00 67 | 68 | SELECT 69 | test_dttm AT TIME ZONE 'GMT+2' 70 | FROM 71 | time_test_table; 72 | 73 | -- Melbourne: 2099-12-31 09:00:00 74 | 75 | SELECT 76 | test_dttm AT TIME ZONE 'GMT+11' 77 | FROM 78 | time_test_table; 79 | 80 | -- Los Angeles (for fun): 2100-01-01 04:00:00 81 | 82 | SELECT 83 | test_dttm AT TIME ZONE 'GMT-8' 84 | FROM 85 | time_test_table; 86 | 87 | 88 | --------- 89 | -- Q3 -- 90 | --------- 91 | 92 | 93 | -- correlation and r2 values for associations 94 | /* 95 | Relationship between trip duration and total amount is strong (r = .80, r2 = .64), 96 | with 64% of the variance in total amount paid accounted for by the trip duration. 97 | 98 | Relationship between trip distance and total amount is stronger (r = .86, r2 = .73), 99 | with 73% of the variance in total amount paid accounted for by the trip duration. 100 | 101 | Rest of the variance is explained by other factors (e.g. the tip amount is not 102 | related to either duration or distance). 103 | */ 104 | 105 | SELECT 106 | -- relationship between trip duration and total amount 107 | round( 108 | corr( 109 | ( 110 | date_part( 111 | 'epoch', 112 | tpep_dropoff_datetime - tpep_pickup_datetime 113 | ) 114 | ), 115 | total_amount 116 | )::numeric, 117 | 2 118 | ) AS r_duration_amount, 119 | round( 120 | regr_r2( 121 | ( 122 | date_part( 123 | 'epoch', 124 | tpep_dropoff_datetime - tpep_pickup_datetime 125 | ) 126 | ), 127 | total_amount 128 | )::numeric, 129 | 2 130 | ) AS r2_duration_amount, 131 | -- relationship between trip distance and total amount 132 | round(corr(trip_distance, total_amount)::numeric, 2) AS r_distance_amount, 133 | round( 134 | regr_r2(trip_distance, total_amount)::numeric, 135 | 2 136 | ) AS r2_distance_amount 137 | FROM 138 | nyc_yellow_taxi_trips_2016_06_01 139 | WHERE 140 | (tpep_dropoff_datetime - tpep_pickup_datetime) < '03:00:00'::interval; 141 | -------------------------------------------------------------------------------- /exercises/Chapter-12.sql: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------- 2 | -- Practical SQL: A Beginner's Guide to Storytelling with Data 3 | -- by Anthony DeBarros 4 | -- 1st Edition 5 | -- Chapter 12 "Try It Yourself" Exercises 6 | -------------------------------------------------------------- 7 | 8 | 9 | --------- 10 | -- Q1 -- 11 | --------- 12 | 13 | 14 | -- Groupings of Waikiki's daily maximum temperatures 15 | /* 16 | In this news schema, Waikiki’s daily maximum temperature fall most often in the 17 | '86-87' group (118 observations), while the least observed bins are 18 | '90 or more' (5 observations) and '79 or less' (8 observations). 19 | */ 20 | 21 | WITH temps_collapsed (max_temperature_group) AS ( 22 | SELECT 23 | CASE 24 | WHEN max_temp >= 90 THEN '90 or more' 25 | WHEN max_temp BETWEEN 88 26 | AND 89 THEN '88-89' 27 | WHEN max_temp BETWEEN 86 28 | AND 87 THEN '86-87' 29 | WHEN max_temp BETWEEN 84 30 | AND 85 THEN '84-85' 31 | WHEN max_temp BETWEEN 82 32 | AND 83 THEN '82-83' 33 | WHEN max_temp BETWEEN 80 34 | AND 81 THEN '80-81' 35 | WHEN max_temp <= 79 THEN '79 or less' 36 | END 37 | FROM 38 | temperature_readings 39 | WHERE 40 | station_name = 'WAIKIKI 717.2 HI US' 41 | ) 42 | SELECT 43 | max_temperature_group, 44 | count(max_temperature_group) AS n_obs 45 | FROM 46 | temps_collapsed 47 | GROUP BY 48 | max_temperature_group 49 | ORDER BY 50 | n_obs DESC; 51 | 52 | 53 | --------- 54 | -- Q2 -- 55 | --------- 56 | 57 | 58 | -- instead of office-by-flavor, make flavor-by-office crosstabs table 59 | -- the counts remain the same 60 | 61 | SELECT * 62 | FROM crosstab('SELECT 63 | flavor, 64 | office, 65 | count(*) 66 | FROM ice_cream_survey 67 | GROUP BY flavor, office 68 | ORDER BY flavor', 69 | 70 | 'SELECT office 71 | FROM ice_cream_survey 72 | GROUP BY office 73 | ORDER BY office') 74 | 75 | AS (flavor varchar(20), 76 | Downtown bigint, 77 | Midtown bigint, 78 | Uptown bigint); 79 | -------------------------------------------------------------------------------- /exercises/Chapter-13.sql: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------- 2 | -- Practical SQL: A Beginner's Guide to Storytelling with Data 3 | -- by Anthony DeBarros 4 | -- 1st Edition 5 | -- Chapter 13 "Try It Yourself" Exercises 6 | -------------------------------------------------------------- 7 | 8 | 9 | --------- 10 | -- Q1 -- 11 | --------- 12 | 13 | 14 | -- function to remove the comma: regexp_replace. For example: 15 | 16 | SELECT 17 | regexp_replace('Alvarez, Jr.', ',', ''); 18 | 19 | -- can regex help? 20 | -- just to extract first name: outputs 'Alvarez' 21 | 22 | SELECT 23 | (regexp_match('Alvarez, Jr.', '\w+')) [1]; 24 | 25 | -- just to extract suffix: outputs 'Jr.' 26 | 27 | SELECT 28 | (regexp_match('Alvarez, Jr.', '\s.+')) [1]; 29 | 30 | -- create a table with author names with suffixes and add values 31 | 32 | CREATE TABLE authors( 33 | author_id bigserial, 34 | author_name varchar(50) 35 | ); 36 | 37 | INSERT INTO 38 | authors (author_name) 39 | VALUES 40 | ('Anthony'), 41 | ('Alvarez, Jr.'), 42 | ('Williams, Sr.'); 43 | 44 | -- add a new column for suffixes and extract suffixes to it 45 | 46 | ALTER TABLE authors ADD COLUMN suffix varchar(10); 47 | 48 | UPDATE 49 | authors 50 | SET 51 | suffix = (regexp_match(author_name, '\s.+')) [1]; 52 | 53 | -- now the suffixes can be removed from the original name 54 | 55 | UPDATE 56 | authors 57 | SET 58 | author_name = regexp_replace(author_name, ',\s.+', ''); 59 | 60 | 61 | --------- 62 | -- Q2 -- 63 | --------- 64 | 65 | 66 | -- for simplicity, let's create a new table only with one speech 67 | 68 | CREATE TABLE speech AS ( 69 | SELECT 70 | * 71 | FROM 72 | president_speeches 73 | where 74 | sotu_id = 4 75 | ); 76 | 77 | -- convert all text to lower case 78 | -- split the text into words based on space character (\s) 79 | -- then remove '.' and ',' and ':' 80 | -- then select only distinct words 81 | -- then counts only the words whose length is equal to or more than 5 82 | -- answer for this particular speech is 850 83 | 84 | WITH words AS ( 85 | SELECT 86 | DISTINCT regexp_replace( 87 | regexp_split_to_table(lower(speech_text), '\s+'), 88 | '[.,:]', 89 | '' 90 | ) AS clean_words 91 | FROM 92 | speech 93 | ORDER BY 94 | clean_words DESC 95 | ) 96 | SELECT 97 | count(clean_words) 98 | FROM 99 | words 100 | WHERE 101 | length(clean_words) >= 5; 102 | 103 | 104 | --------- 105 | -- Q3 -- 106 | --------- 107 | 108 | 109 | -- Using ts_rank_cd() function 110 | /* 111 | Yes, ts_rank_cd() instead of ts_rank() does change result. 112 | With the former, the rankings are: 113 | Dwight D. Eisenhower, Richard M. Nixon, Harry S. Truman, Gerald R. Ford, Dwight D. Eisenhower 114 | Compare this rankind with that in table on page 238 115 | */ 116 | 117 | SELECT 118 | president, 119 | speech_date, 120 | ts_rank_cd( 121 | search_speech_text, 122 | to_tsquery('war & security & threat & enemy') 123 | ) AS score 124 | FROM 125 | president_speeches 126 | WHERE 127 | search_speech_text @@ to_tsquery('war & security & threat & enemy') 128 | ORDER BY 129 | score DESC 130 | LIMIT 131 | 5; 132 | -------------------------------------------------------------------------------- /exercises/Chapter-14.sql: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------- 2 | -- Practical SQL: A Beginner's Guide to Storytelling with Data 3 | -- by Anthony DeBarros 4 | -- 1st Edition 5 | -- Chapter 14 "Try It Yourself" Exercises 6 | -------------------------------------------------------------- 7 | 8 | 9 | --------- 10 | -- Q1 -- 11 | --------- 12 | 13 | 14 | -- area of each state in square miles 15 | 16 | SELECT 17 | statefp10, 18 | sum( 19 | round( 20 | (ST_Area(geom::geography) / 2589988.110336)::numeric, 21 | 2 22 | ) 23 | ) AS square_miles 24 | FROM 25 | us_counties_2010_shp 26 | GROUP BY 27 | statefp10 28 | ORDER BY 29 | square_miles DESC; 30 | 31 | -- How many states are bigger than Yukon-Koyukuk? 32 | -- 3 states (statefp10: 02, 06, 48) 33 | 34 | SELECT 35 | statefp10, 36 | sum( 37 | round( 38 | (ST_Area(geom::geography) / 2589988.110336)::numeric, 39 | 2 40 | ) 41 | ) AS square_miles 42 | FROM 43 | us_counties_2010_shp 44 | GROUP BY 45 | statefp10 46 | HAVING 47 | sum( 48 | (ST_Area(geom::geography) / 2589988.110336)::numeric 49 | ) > ( 50 | SELECT 51 | (ST_Area(geom::geography) / 2589988.110336)::numeric 52 | FROM 53 | us_counties_2010_shp 54 | WHERE 55 | name10 = 'Yukon-Koyukuk' 56 | ) 57 | ORDER BY 58 | square_miles DESC; 59 | 60 | 61 | --------- 62 | -- Q2 -- 63 | --------- 64 | 65 | 66 | -- extracting lat/long info 67 | -- compute distance between these two markets 68 | 69 | WITH 70 | market_start (geog_point) AS 71 | ( 72 | SELECT geog_point 73 | FROM farmers_markets 74 | WHERE market_name = 'The Oakleaf Greenmarket' 75 | ), 76 | market_end (geog_point) AS 77 | ( 78 | SELECT geog_point 79 | FROM farmers_markets 80 | WHERE market_name = 'Columbia Farmers Market' 81 | ) 82 | SELECT ST_Distance(market_start.geog_point, market_end.geog_point) / 1609.344 -- convert to meters to miles 83 | FROM market_start, market_end; 84 | 85 | 86 | --------- 87 | -- Q3 -- 88 | --------- 89 | 90 | 91 | -- cast column geom from geometry type to geography 92 | -- create two CTEs, one for US county 93 | 94 | WITH us_counties AS ( 95 | SELECT 96 | namelsad10, 97 | ST_SetSRID(geom::geography, 4326) AS "geom" 98 | FROM 99 | us_counties_2010_shp 100 | ), 101 | markets AS ( 102 | SELECT 103 | * 104 | FROM 105 | farmers_markets 106 | WHERE 107 | county IS NULL 108 | ) 109 | SELECT 110 | markets.county, 111 | us_counties.namelsad10, 112 | us_counties.geom 113 | FROM 114 | markets 115 | JOIN us_counties ON ST_Intersects(markets.geog_point, us_counties.geom) 116 | ORDER BY 117 | markets.fmid; 118 | -------------------------------------------------------------------------------- /exercises/Chapter-15.sql: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------- 2 | -- Practical SQL: A Beginner's Guide to Storytelling with Data 3 | -- by Anthony DeBarros 4 | -- 1st Edition 5 | -- Chapter 15 "Try It Yourself" Exercises 6 | -------------------------------------------------------------- 7 | 8 | 9 | --------- 10 | -- Q1 -- 11 | --------- 12 | 13 | 14 | -- creating view for taxi trip counts per hour 15 | 16 | CREATE OR REPLACE VIEW trip_hour_counts AS 17 | SELECT 18 | date_part('hour', tpep_pickup_datetime) AS trip_hour, 19 | count(*) 20 | FROM 21 | nyc_yellow_taxi_trips_2016_06_01 22 | GROUP BY 23 | trip_hour; 24 | 25 | -- Use it in Listing 11-8 to simlify it 26 | 27 | SELECT 28 | * 29 | FROM 30 | trip_hour_counts 31 | ORDER BY 32 | trip_hour; 33 | 34 | 35 | --------- 36 | -- Q2 -- 37 | --------- 38 | 39 | 40 | -- custom function to calculate rates per thousand 41 | 42 | CREATE OR REPLACE FUNCTION rates_per_thousand( 43 | observed_number numeric, 44 | base_number numeric, 45 | decimal_places integer DEFAULT 1 46 | ) RETURNS numeric AS ' 47 | SELECT round( 48 | (observed_number::numeric / base_number) * 1000, decimal_places 49 | ); 50 | ' 51 | LANGUAGE SQL 52 | IMMUTABLE 53 | RETURNS NULL ON NULL INPUT; 54 | 55 | -- use it in Listing 10-9 to make sure you get the same results 56 | 57 | SELECT 58 | city, 59 | st, 60 | population, 61 | property_crime, 62 | rates_per_thousand(property_crime, population, 1) 63 | FROM 64 | fbi_crime_data_2015 65 | WHERE 66 | population >= 500000 67 | ORDER BY 68 | (property_crime::numeric / population) DESC; 69 | 70 | 71 | --------- 72 | -- Q3 -- 73 | --------- 74 | 75 | 76 | -- implement a trigger that automatically adds an inspection date to the 77 | -- meat_poultry_egg_inspect table each time you insert a new facility 78 | 79 | -- add the new column if you haven't already done so 80 | 81 | ALTER TABLE meat_poultry_egg_inspect ADD COLUMN inspection_date date; 82 | 83 | -- creating the add_inspection_date() function 84 | 85 | CREATE OR REPLACE FUNCTION add_inspection_date() 86 | RETURNS trigger AS 87 | $$ 88 | BEGIN 89 | NEW.inspection_date = now() + '6 months'::interval; 90 | RETURN NEW; 91 | END; 92 | $$ 93 | LANGUAGE plpgsql; 94 | 95 | -- specifying trigger 96 | 97 | CREATE OR REPLACE TRIGGER facility_insert 98 | BEFORE INSERT 99 | ON meat_poultry_egg_inspect 100 | FOR EACH ROW 101 | EXECUTE PROCEDURE add_inspection_date(); 102 | 103 | -- Update the table 104 | 105 | INSERT INTO 106 | meat_poultry_egg_inspect (est_number) 107 | VALUES 108 | ('XXXXXXX'); 109 | 110 | -- check if that worked! 111 | -- let's see the final row 112 | 113 | SELECT 114 | * 115 | FROM 116 | meat_poultry_egg_inspect 117 | WHERE 118 | est_number = ( 119 | SELECT 120 | max(est_number) 121 | FROM 122 | meat_poultry_egg_inspect 123 | ); 124 | 125 | -- if it works, let's remove this unnecessary trial entry 126 | 127 | DELETE FROM 128 | meat_poultry_egg_inspect 129 | WHERE 130 | est_number = 'XXXXXXX'; 131 | -------------------------------------------------------------------------------- /exercises/Chapter-16.sql: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------- 2 | -- Practical SQL: A Beginner's Guide to Storytelling with Data 3 | -- by Anthony DeBarros 4 | -- 1st Edition 5 | -- Chapter 16 "Try It Yourself" Exercises 6 | -------------------------------------------------------------- 7 | 8 | 9 | /* 10 | The exercise is redoing any of the prior analysis in Command prompt using psql 11 | So nothing extra needs to be done here 12 | */ 13 | -------------------------------------------------------------------------------- /exercises/Chapter-17.sql: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------- 2 | -- Practical SQL: A Beginner's Guide to Storytelling with Data 3 | -- by Anthony DeBarros 4 | -- 1st Edition 5 | -- Chapter 17 "Try It Yourself" Exercises 6 | -------------------------------------------------------------- 7 | 8 | -- backing up entire gis_analysis database 9 | 10 | -- note that this needs to be used with the normal, and not psql, command prompt 11 | 12 | pg_dump -d gis_analysis -U postgres -Fc > gis_analysis_backup.sql 13 | 14 | -- backing only a single table from this database 15 | 16 | pg_dump -t 'spatial_ref_sys' -d gis_analysis -U postgres -Fc > spatial_ref_sys_backup.sql 17 | 18 | -- delete the database using pgAdmin 19 | 20 | -- restore using the backup 21 | 22 | pg_restore -C -d gis_anaylsis -U postgres gis_analysis_backup.sql 23 | -------------------------------------------------------------------------------- /exercises/Chapter-18.sql: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------- 2 | -- Practical SQL: A Beginner's Guide to Storytelling with Data 3 | -- by Anthony DeBarros 4 | -- 1st Edition 5 | -- Chapter 18 "Try It Yourself" Exercises 6 | -------------------------------------------------------------- 7 | 8 | 9 | -- No exercises! 🎉 10 | -------------------------------------------------------------------------------- /exercises/extra.sql: -------------------------------------------------------------------------------- 1 | -- The following exercises are not from the book, but just for my own practice 2 | -- This script will keep updating while I keep practicing 3 | -- the tables come from datasets mentioned across various chapters in the book 4 | 5 | ---------- 6 | --- P1 --- 7 | ---------- 8 | 9 | -- find first_name with at least one 'e' *and* 'a' at second position 10 | 11 | SELECT 12 | first_name 13 | FROM 14 | teachers 15 | WHERE 16 | (first_name LIKE '%e%') 17 | AND (first_name LIKE '_a%'); 18 | 19 | ---------- 20 | --- P2 --- 21 | ---------- 22 | 23 | -- find teachers whose last name is not 'Bush' and someone who 24 | -- was hired after 2005 and then sort by their school name (ascending) and 25 | -- salary (descending) 26 | 27 | SELECT 28 | * 29 | FROM 30 | teachers 31 | WHERE 32 | (first_name <> 'Bush') 33 | AND (hire_date > '1-1-2005') 34 | ORDER BY 35 | school ASC, 36 | salary DESC; 37 | 38 | ---------- 39 | --- P3 --- 40 | ---------- 41 | 42 | -- Practising more with the US census 2010 data 43 | -- check out where housing crisis is expected to be happening 44 | 45 | SELECT 46 | geo_name, 47 | state_us_abbreviation, 48 | ( 49 | CAST(housing_unit_count_100_percent AS numeric) / CAST(population_count_100_percent AS numeric) 50 | ) * 100 AS housing_prop 51 | FROM 52 | us_counties_2010 53 | WHERE 54 | population_count_100_percent > 10000 55 | ORDER BY 56 | housing_prop ASC, 57 | state_us_abbreviation; 58 | 59 | ---------- 60 | --- P4 --- 61 | ---------- 62 | 63 | -- TODO: how to avoid the repeated computation of pct_violent_crime here? 64 | -- Select a few states and rank cities according to violent crime rates 65 | 66 | SELECT 67 | st, 68 | city, 69 | violent_crime, 70 | population, 71 | round( 72 | (violent_crime::numeric / population) * 1000, 73 | 2 74 | ) AS pct_violent_crime, 75 | rank() over (PARTITION BY st ORDER BY round((violent_crime::numeric / population) * 1000, 2) DESC) 76 | FROM 77 | fbi_crime_data_2015 78 | WHERE 79 | (population > 500000) 80 | AND (st IN('Texas', 'California')); 81 | 82 | ---------- 83 | --- P5 --- 84 | ---------- 85 | 86 | -- summary of library visit rates (min, median, max) 87 | 88 | SELECT 89 | MIN(pct_data.pct_visits), 90 | percentile_cont(0.5) WITHIN GROUP ( 91 | ORDER BY 92 | pct_data.pct_visits 93 | ), 94 | MAX(pct_data.pct_visits) 95 | FROM 96 | ( 97 | SELECT 98 | libname, 99 | city, 100 | stabr, 101 | visits, 102 | popu_lsa, 103 | round((visits::numeric / popu_lsa) * 1000, 2) AS pct_visits, 104 | rank() over ( 105 | ORDER BY 106 | round((visits::numeric / popu_lsa) * 1000, 2) DESC 107 | ) 108 | FROM 109 | pls_fy2014_pupld14a 110 | WHERE 111 | (popu_lsa > 250000) 112 | AND (visits > 0) 113 | ORDER BY 114 | pct_visits DESC 115 | ) AS pct_data; 116 | 117 | ---------- 118 | --- P6 --- 119 | ---------- 120 | 121 | -- FBI data 122 | -- create a view with correlations between all crime variables 123 | -- note that the maximum correlation is between violence and burglary, while 124 | -- minimum between violence and larceny 125 | 126 | CREATE OR REPLACE VIEW corr_table AS ( 127 | SELECT 128 | round( 129 | corr(rpt_violent_crime, rpt_property_crime)::numeric, 130 | 2 131 | ) AS r_violence_property, 132 | round( 133 | corr(rpt_violent_crime, rpt_burglary)::numeric, 134 | 2 135 | ) AS r_violence_burglary, 136 | round( 137 | corr(rpt_violent_crime, rpt_larceny_theft)::numeric, 138 | 2 139 | ) AS r_violence_larceny, 140 | round( 141 | corr(rpt_violent_crime, rpt_motor_vehicle_theft)::numeric, 142 | 2 143 | ) AS r_violence_motor, 144 | round( 145 | corr(rpt_burglary, rpt_larceny_theft)::numeric, 146 | 2 147 | ) AS r_burglary_larceny, 148 | round( 149 | corr(rpt_burglary, rpt_motor_vehicle_theft)::numeric, 150 | 2 151 | ) AS r_burglary_motor, 152 | round( 153 | corr(rpt_larceny_theft, rpt_motor_vehicle_theft)::numeric, 154 | 2 155 | ) AS r_larceny_motor 156 | FROM 157 | -- derived table 158 | ( 159 | SELECT 160 | rates_per_thousand(violent_crime, population, 3) AS rpt_violent_crime, 161 | rates_per_thousand(property_crime, population, 3) AS rpt_property_crime, 162 | rates_per_thousand(burglary, population, 3) AS rpt_burglary, 163 | rates_per_thousand(larceny_theft, population, 3) AS rpt_larceny_theft, 164 | rates_per_thousand(motor_vehicle_theft, population, 3) AS rpt_motor_vehicle_theft 165 | FROM 166 | fbi_crime_data_2015 167 | ) AS rpt_crimes 168 | ); 169 | 170 | ---------- 171 | --- P7 --- 172 | ---------- 173 | 174 | -- NYC taxi data 175 | -- min, max, and average distance travelled and passanger count by hour of travel 176 | 177 | SELECT 178 | date_part('hour', tpep_pickup_datetime) as pickup_hour, 179 | min(trip_distance) AS min_distance_rate, 180 | max(trip_distance) AS max_distance_rate, 181 | sum(trip_distance)::numeric / count(trip_id) AS avg_distance_rate, 182 | min(passenger_count) AS min_passenger_count, 183 | max(passenger_count) AS max_passenger_count, 184 | sum(passenger_count)::numeric / count(trip_id) AS avg_passenger_count 185 | FROM 186 | nyc_yellow_taxi_trips_2016_06_01 187 | GROUP BY 188 | pickup_hour 189 | ORDER BY 190 | pickup_hour; 191 | 192 | 193 | ---------- 194 | --- P8 --- 195 | ---------- 196 | 197 | -- calculate the descriptives for passenger count per trip for each vendor and hour 198 | -- but only for trips with pick up times between 8-6 pm 199 | -- we can see that vendor_id 2 consistently gets more passengers per trip pretty much 200 | -- across all hours of interest 201 | 202 | SELECT 203 | date_part('hour', tpep_pickup_datetime) AS pickup_hour, 204 | vendor_id, 205 | round( 206 | sum(passenger_count)::numeric / count(trip_id), 207 | 2 208 | ) AS avg_passenger_per_trip, 209 | min(passenger_count) AS min_passenger_count, 210 | max(passenger_count) AS max_passenger_count 211 | FROM 212 | nyc_yellow_taxi_trips_2016_06_01 213 | GROUP BY 214 | vendor_id, 215 | pickup_hour 216 | HAVING 217 | date_part('hour', tpep_pickup_datetime) BETWEEN '8' AND '18' 218 | ORDER BY 219 | pickup_hour, 220 | vendor_id; 221 | 222 | 223 | ---------- 224 | --- P9 --- 225 | ---------- 226 | 227 | -- See how media proportion of each type has changed across years across libraries 228 | 229 | -- update tables to add a new column for total of all media 230 | 231 | ALTER TABLE 232 | pls_fy2014_pupld14a 233 | ADD 234 | total_media integer; 235 | 236 | UPDATE 237 | pls_fy2014_pupld14a 238 | SET 239 | total_media = (bkvol + ebook + audio_ph + video_ph); 240 | 241 | ALTER TABLE 242 | pls_fy2009_pupld09a 243 | ADD 244 | total_media integer; 245 | 246 | UPDATE 247 | pls_fy2009_pupld09a 248 | SET 249 | total_media = (bkvol + ebook + audio + video); 250 | 251 | -- create a view out of proportions 252 | -- create CTEs that show the proportion of each type of media, 253 | -- namely, books, ebooks, audio books, videos 254 | -- two CTEs: one for 2009 and one for 2014 255 | 256 | CREATE OR REPLACE VIEW media_pct_change AS 257 | WITH media14 AS ( 258 | SELECT 259 | stabr, 260 | fscskey, 261 | bkvol, 262 | (bkvol::numeric / total_media) AS book_prop, 263 | ebook, 264 | (ebook::numeric / total_media) AS ebook_prop, 265 | audio_ph, 266 | (audio_ph::numeric / total_media) AS audio_prop, 267 | video_ph, 268 | (video_ph::numeric / total_media) AS video_prop, 269 | total_media 270 | FROM 271 | pls_fy2014_pupld14a 272 | WHERE 273 | total_media > 0 274 | ), 275 | media09 AS ( 276 | SELECT 277 | fscskey, 278 | bkvol, 279 | (bkvol::numeric / total_media) AS book_prop, 280 | ebook, 281 | (ebook::numeric / total_media) AS ebook_prop, 282 | audio, 283 | (audio::numeric / total_media) AS audio_prop, 284 | video, 285 | (video::numeric / total_media) AS video_prop, 286 | total_media 287 | FROM 288 | pls_fy2009_pupld09a 289 | WHERE 290 | total_media > 0 291 | ) 292 | SELECT 293 | stabr, 294 | media14.fscskey, 295 | round( 296 | ( 297 | (media14.total_media - media09.total_media)::numeric / media09.total_media 298 | ) * 100, 299 | 2 300 | ) AS pct_total_media, 301 | round( 302 | ( 303 | (media14.book_prop - media09.book_prop)::numeric / media09.book_prop 304 | ) * 100, 305 | 2 306 | ) AS pct_book_prop, 307 | round( 308 | ( 309 | (media14.ebook_prop - media09.ebook_prop)::numeric / media09.ebook_prop 310 | ) * 100, 311 | 2 312 | ) AS pct_ebook_prop, 313 | round( 314 | ( 315 | (media14.audio_prop - media09.audio_prop)::numeric / media09.audio_prop 316 | ) * 100, 317 | 2 318 | ) AS pct_audio_prop, 319 | round( 320 | ( 321 | (media14.video_prop - media09.video_prop)::numeric / media09.video_prop 322 | ) * 100, 323 | 2 324 | ) AS pct_video_prop 325 | FROM 326 | media14 327 | INNER JOIN media09 ON media14.fscskey = media09.fscskey 328 | WHERE 329 | media14.book_prop > 0 330 | AND media09.book_prop > 0 331 | AND media14.ebook_prop > 0 332 | AND media09.ebook_prop > 0 333 | AND media14.audio_prop > 0 334 | AND media09.audio_prop > 0 335 | AND media14.video_prop > 0 336 | AND media09.video_prop > 0; 337 | 338 | -- a view table with average changes in media 339 | 340 | CREATE 341 | OR REPLACE VIEW avg_media_pct_change AS 342 | SELECT 343 | stabr, 344 | round(avg(pct_total_media), 2) AS avg_pct_total_media, 345 | round(avg(pct_book_prop), 2) AS avg_pct_book_prop, 346 | round(avg(pct_ebook_prop), 2) AS avg_pct_ebook_prop, 347 | round(avg(pct_audio_prop), 2) AS avg_pct_audio_prop, 348 | round(avg(pct_video_prop), 2) AS avg_pct_video_prop 349 | FROM 350 | media_pct_change 351 | GROUP BY 352 | stabr 353 | ORDER BY 354 | avg_pct_total_media DESC; 355 | 356 | -- descriptive statistics for these changes 357 | -- median of average changes across states 358 | 359 | SELECT 360 | percentile_cont(0.5) WITHIN GROUP (ORDER BY avg_pct_book_prop DESC) AS median_avg_pct_book_prop, 361 | percentile_cont(0.5) WITHIN GROUP (ORDER BY avg_pct_ebook_prop DESC) AS median_avg_pct_ebook_prop, 362 | percentile_cont(0.5) WITHIN GROUP (ORDER BY avg_pct_audio_prop DESC) AS median_avg_pct_audio_prop, 363 | percentile_cont(0.5) WITHIN GROUP (ORDER BY avg_pct_video_prop DESC) AS median_avg_pct_video_prop 364 | FROM 365 | avg_media_pct_change; 366 | 367 | /* 368 | From results, we can see that there has been a massive uptick in ebooks in libraries. 369 | The median of avergae change in ebooks across state libraries from 2009 to 2014 is 370 | is a whopping 22,000%! 371 | There has been a decrease in physical books with a median change of -11%. Similar 372 | trend is observed for audio books: -9%. 373 | For video collection, there is a modest increase, with a median at 16%. 374 | */ 375 | 376 | -- the state where maximum percentage increase was seen in ebooks: Delware (613,276%) 377 | -- this massive change is mostly due to Milford Public Library, which had 1 ebook 378 | -- in 2009 and 19,770 ebooks in 2014, which is 1,976,900% change! 379 | 380 | SELECT 381 | stabr, 382 | avg_pct_ebook_prop 383 | FROM 384 | avg_media_pct_change 385 | WHERE 386 | avg_pct_ebook_prop = ( 387 | SELECT 388 | max(avg_pct_ebook_prop) 389 | FROM 390 | avg_media_pct_change 391 | ); 392 | 393 | -- the state where minimum percentage increase was seen in ebooks: Minnesota (51%) 394 | -- (using derived table to figure this out) 395 | 396 | SELECT 397 | stabr, 398 | avg_pct_ebook_prop 399 | FROM 400 | avg_media_pct_change 401 | WHERE 402 | avg_pct_ebook_prop = ( 403 | SELECT 404 | min(avg_pct_ebook_prop) 405 | FROM 406 | avg_media_pct_change 407 | ); 408 | -------------------------------------------------------------------------------- /feedback/feedback.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Practical SQL: A Beginner's Guide to Storytelling with Data" 3 | subtitle: "Feedback on the 1st Edition" 4 | output: pdf_document 5 | --- 6 | 7 | ```{r setup, include=FALSE} 8 | knitr::opts_chunk$set(echo = TRUE) 9 | ``` 10 | 11 | ## General comments 12 | 13 | * When the joins are mentioned for the first time, the book uses aliases for tables being joined by explicitly using the `AS` keyword (e.g. `left_table AS lt`). But then it stops using this keyword later on (e.g. `left_table lt`). Although, it might not be hard for someone to see that this keyword is optional, I think it's still a good practice for beginners to repeatedly see aliasing using `AS` keyword, so that they embody the habit of making their aliasing intent clear using this keyword. Plus, it may avoid confusion for some other users who may be wondering why there is no `AS`, and if this is an error. 14 | 15 | * While discussing writing functions for triggers, I think it would be helpful if the book were to clarify that we don't need to specify the table explicitly, because the function will always pick up the table specified in when `CREATE TRIGGER` clause is specified. Users coming from other programming languages will spend some time scratching their heads as to how does the function *know*, for example in Listing 15-22, which table does `NEW` keyword refer to?! 16 | 17 | ## Possible clarifications 18 | 19 | * I wish the book emphasized how DBMS, data model, and query language (SQL) are all different things. Only then will it be clear that, for example, *MySQL*, *PostgresSQL*, etc. are RDBMS, and **not** different dialects of SQL developed for respective databases (which is what I walked away thinking having read the book). 20 | 21 | * The way derived tables, CTEs, and view are discussed in the book, sometimes it may feel like they can't all be combined in the same workflow. So, maybe an example that contains more than one of these concepts can be produced to drive this point home: 22 | 23 | ```sql 24 | CREATE OR REPLACE VIEW ... AS 25 | WITH ... AS ( 26 | ... 27 | ) 28 | ``` 29 | 30 | ## Wishlist 31 | 32 | * The book does not cover some common terms used while talking about relational data model: 33 | *attribute*, *tuple*, *domain*, *degree*, *cardinality* 34 | 35 | * I wish the book had also covered the following operators: 36 | 37 | - `UNION` (and `UNION ALL`) 38 | - `ANY` and `ALL` 39 | 40 | * While discussing joins, if not explicit treatment, self joins may at least deserve a hat tip. 41 | -------------------------------------------------------------------------------- /feedback/feedback.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IndrajeetPatil/Practical-SQL-exercises/fb02667a2070212d7ff234ba79fc8941e866f2c0/feedback/feedback.pdf --------------------------------------------------------------------------------